Poor results with a simple linear regression (with large slope range)

Hello, I am getting extremely poor results with an extremely simple linear regression. Can anyone tell me where the problem is? Or if I need to tweak settings, can anyone explain why Stan would perform so poorly on such a simple linear regression with default settings?

R code:

N <- 1000
x <- 1:N
y <- 10 + 3*x

lm_fit <- lm(y ~ x)

stan_dat <- within(list(), {
    N = N
    y = y
    x = x
})

stan_fit <- stan(file = "simple_regression.stan", data = stan_dat)

print(lm_fit)
print(stan_fit)

Stan code (simple_regression.stan)

data {
    int<lower=0> N;
    real y[N];
    real x[N];
}
parameters {
    real b0;
    real b1;
    real<lower=0> sigma;
}

model {  
    real mu[N];
    for (i in 1:N)
        mu[i] = b0+b1*x[i];
        
    b0 ~ normal(0, 10000);
    b1 ~ normal(0, 10000);
    
    sigma ~ uniform(0, 100);
    y ~ normal(mu, sigma);
}

end of output:

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
         10            3  

Inference for Stan model: simple_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean  se_mean        sd       2.5%        25%        50%       75%   97.5% n_eff      Rhat
b0          2.89     2.97      4.20      -0.28      -0.14       0.92      3.95   10.00     2 190583.39
b1          0.37     1.13      1.60      -1.15      -0.78      -0.19      0.96    3.00     2  53056.45
sigma      75.00    30.61     43.31       0.00      75.00     100.00    100.00  100.00     2 198812.79
lp__  -161189.34 80501.54 113874.84 -293878.71 -245587.99 -179357.88 -95092.74 7856.47     2   2902.19

I notice three things.

(1) The range of x variable is very large. Standardizing it will help. So I would do: x <- (1:N)/N

(2) There is no error in the y ~ x relationship. The lm() fit is singular, I bet. So: y <- rnorm( N, 10 + 3*x , 1 )

(3) The priors are far too flat, resulting in initial draws that are very far from the target. Stan could probably handle it anyway, but better to do something like normal(0,10).

My hunch is that making those changes will fix it up.

1 Like

Thanks Richard. (1) is it. I was just figuring out that it worked well for N = 30 or so, but worked poorly at N = 1000, and was about to post an update.

Added later: now that I know the problem, there is a good related discussion here:

https://groups.google.com/forum/#!topic/stan-users/w-h5A8UbjMs