Hi everyone! Sorry for sending a long message but I really need your help! My simple linear regression model in Stan cannot be as good as lm() model in r and I don’t understand why!
I use rstan to fit a simple linear regression model. The dataset is 8000 rows * 10 cols and the numbers are all between 0 and 15. I try to use columns 1~9 to predict column 10. I also use lm() in r to do the prediction too. It turns out to be that the root of mean squared error (RMSE) from Stan is 0.49 and the RMSE in r is 0.369.
Below is my model in stan
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}
There is no divergence when training the model. I try different priors for alpha, beta(s), and sigma but the RMSE doesn’t improve. The iteration time in Stan is 2000. (I also try iteration time = 500 or 5000, the RMSEs are still the same as 0.49.)
I attach some figures for reference:
The red dots are the coefficients from lm(), which indicate the best “locations” for the coefficients. And the violin distribution for each coefficient is the coefficient distribution from Stan. It is clear that the distribution of beta[7] from rstan is very off from the red dot from lm.
Below are some diagnostic plots using the bayesplot package. There is no divergence in the model so everything looks normal.
The plot from mcmc_parcoord
The plot from mcmc_trace for beta[7]:
I don’t understand why the RMSE from lm() in r can go to 0.369 but Stan can only do 0.49. How to make Stan model as good as lm() in this case? Thank you very much for your time and help!
Hi, @Jianfeng and welcome to the Stan forums.
If you fit with optimization in Stan, you should get the same answer as lm gets in R, so I would check that first to make sure the data’s formulated the same way in R and in Stan. That is, check that the estimated coefficients and error term all match (error should be close, but not exactly identical because R divides by N - degrees-of-freedom, whereas Stan just fits the MLE and divides by N).
Even if the optimizers get the same answer, that doesn’t guarantee the MCMC result will be the same. MCMC point estimates are posterior means rather than posterior modes, and if there’s skew, they can be very different.
Assuming you are fitting with MCMC, how are you evaluating R-hat? Are you just taking the Bayesian point estimates (posterior means) and plugging them in? If so, you might want to try using posterior predictive inference instead.
In general, if the data’s generated from the model, the posterior intervals should be well calibrated (in expectation over multiple simulations).
1 Like
Hi Bob! Thank you very much for your reply and abundant helpful information!
Below is how I extracted the paraments and calculated the predictions and RMSE:
# Extract trained parameters of the model
params = as.data.frame(extract(fit_rstan))
coefmean <- as.numeric(apply(params, 2, mean))
# Calculate the predictions
predictions <- Trainingset_data %*% as.matrix(coefmean[2:10], ncol=1) + coefmean[1]
# Calculate metrics: root of MSE, mean of absolute diff, O/E ratio.
RMSE <- sqrt(sum((predictions-Trainingset_pred)^2)/length(predictions))
I am not sure if I did the right thing…
And here is the code for R-hat and the output plot:
fit_rstan %>%
rhat() %>%
mcmc_rhat() +
yaxis_text()
It looks good I think?
Me, either, but that’s because I don’t understand all the R
you’re using. This isn’t the Bayesian way to compute this, but if everything’s linear, it will work out to the be the same. What you want to do is average the predictions over the posterior draws. But that works out to be the same in linear regression,
\displaystyle \hat{y}_n = \frac{1}{M} \sum_{m=1}^M \alpha^{(m)} + \beta^{(m)} \cdot x_n = \textrm{mean}(\alpha) + \textrm{mean}(\beta) \cdot x_n.
Then
\displaystyle \textrm{RMSE} = \sqrt{\left(\textrm{mean}(\alpha) + \textrm{mean}(\beta) \cdot x - y\right)^2}
where the square operation is element wise. In general the mean for the Bayesian posterior goes on the outside, but because everything’s linear, it can go inside here.
If this isn’t coming out close in the two approaches, I’d suggest again comparing the posterior means in Stan and the fitted coefficients from lm
and see if Stan’s optimizer gives you the same result as lm
. If it doesn’t, then there’s something wrong in how you’re specifying the Stan model, because the answers for the coefficients should be identical.
You are right! I just made a stupid mistake when reformulating the data. Thank you again for your time and help!