Dear STAN community,
I am a new STAN fan. I am testing the following model:
data {
int<lower=0> N; // number of data points
vector[N] x; // independent variable
vector[N] y; // dependent variable
}
parameters {
real slope;
real intercept;
real<lower=0> sigma; // standard deviation
}
model {
vector[N] y_est;
y_est = slope * x + intercept;
y ~ normal(y_est, sigma); // posterior
}
This is a simple linear regression model and works fine. I use auto-generated data to test the model in R,
samplesize <- 100
x <- runif(samplesize,min=0,max=samplesize)
slope <- 0.4
intercept <- 2
stdev <- 0.5
y <- slope*x + intercept + rnorm(samplesize,sd=stdev)
fitted_stan_model <- stan(file="linearregressionmodel.stan", data=list(N=samplesize, x=x, y=y))
and after running the model, I use the following R function to calculate the log posterior from the generated samples:
f <- function(x,y,slope,intercept,sigma) { ## loglikelihood estimator
y_est <- slope*x + intercept
residuals <- y - y_est
loglikelihood <- sum(-.5*(residuals/sigma)^2)-length(residuals)*(log(sigma)+log(2*pi)/2)
return(loglikelihood)
}
Since I have not defined any prior, the posterior probability should be equal to the likelihood function. I understand that STAN does not scale the log posterior probability (lp__), however I still expect a linear relationship between lp__ and f(…) for a given parameter set. However, when I try to verify this, it does not work:
d <- as.data.frame(fitted_stan_model)
lp2 <- vector()
for (i in 1:nrow(d)){
lp2[i] <- with(d[i,],f(x,y,slope,intercept,sigma))
}
m1 <- lm(d$lp__~lp2)
plot(residuals(m1)~lp2)
This small difference confuses me as it seems to cause small differences in the estimated values when I compare with other methods that I used in the past. If anybody has a helpful explanation for this, it would be greatly appreciated.
Cheers