How to fix unconstrained growth of deviation parameter?

Hello Stan community,

I’ve been trying to troubleshoot a complex Stan model for about 2.5 days, but some sigma fits just go crazy. Here’s a minimal example - can someone tell me what’s wrong with it? The model problem is that we weigh two boxes, twice each, and want estimates for average box weight (mu), standard weight deviation (sigma), and measurement error (m_sd). For the sake of this example, let say that mu=1000g, box1=999g, box2=1001g, sigma = approx 1, and m_sd = appox 1.

Here’s the self-contained Stan model:

transformed data {
  int<lower=0> N; // total number of measurements
  int<lower=0> boxCount; // number of boxes to measure
  int<lower=1> B[4]; // index of box that was measured
  real<lower=0> M[4]; // weight measurements
  
  boxCount = 2;
  N = 4; // 2 measurements per box
  B = {1, 1, 2, 2}; // measured box 1 twice, then box 2 twice
  M = {999+1, 999-1, 1001-1, 1001+1}; // measured weights
}
parameters {
  real<lower=0> W[boxCount]; // true weight of a given box
  real<lower=0> mu; // average weight of boxes
  real<lower=0> sigma; // standard deviation of box weights
  real<lower=0> m_sd; // standard deviation of measurements
}
model {
  // True box weights
  for (b in 1:boxCount) {
    W[b] ~ normal(mu, sigma);
  }
  // Measured box weights
  for (m in 1:N) {
    M[m] ~ normal(W[B[m]], m_sd);
  }
}

The result of 100 iterations is as follows (sigma goes towards infinity with more iterations):

OS: x86_64, darwin13.4.0; rstan: 2.15.1; Rcpp: 0.12.8; inline: 0.3.14

Inference for Stan model: temp2.
1 chains, each with iter=100; warmup=50; thin=1;
post-warmup draws per chain=50, total post-warmup draws=50.

     mean se_mean      sd   2.5%    25%     50%     75%    97.5% n_eff Rhat

W[1] 999.87 18.61 118.90 755.96 957.54 984.58 1044.29 1230.22 41 0.98
W[2] 969.13 23.85 130.73 626.53 931.98 998.59 1016.34 1181.13 30 0.98
mu 1074.46 296.18 1435.16 51.14 421.16 621.86 1065.05 3803.69 23 0.99
sigma 2129.80 819.44 3836.89 501.56 875.56 1005.34 1503.88 15966.03 22 1.03
m_sd 137.75 52.68 134.06 37.96 42.18 83.51 155.66 480.19 6 0.99
lp__ -2.17 1.01 2.38 -6.94 -3.70 -2.18 -0.61 1.84 6 1.12

Why do sigma and m_sd go crazy like that? They should be around 1 or so, not in the hundreds or thousands…

The analysis using linear mixed effects works just fine:

dfBox = data.frame(
  B = factor(c(1, 1, 2, 2)),
  M = c(999+1, 999-1, 1001-1, 1001+1)
)
library(lme4)
lmer1 = lmer(M ~ (1|B), data=dfBox)
> summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: M ~ (1 | B)
   Data: dfBox

REML criterion at convergence: 12.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.0607 -0.5303  0.0000  0.5303  1.0607 

Random effects:
 Groups   Name        Variance Std.Dev.
 B        (Intercept) 1        1.000   
 Residual             2        1.414   
Number of obs: 4, groups:  B, 2

Fixed effects:
            Estimate Std. Error t value
(Intercept)     1000          1    1000
> fixef(lmer1)
(Intercept) 
       1000 
> ranef(lmer1)
$B
  (Intercept)
1        -0.5
2         0.5

> confint(lmer1)
Computing profile confidence intervals ...
                  2.5 %      97.5 %
.sig01        0.0000000    4.118386
.sigma        0.6739152    3.468133
(Intercept) 997.5865087 1002.413505

Thanks in advance!
Ellis

For that little amount of data, it is essential to express your prior beliefs about all the parameters, particularly sigma and m_sd. For any measurement error problem, you should have a pretty good idea about whether the measurement error is big or small.

In addition, it is tough for Stan to sample efficiently from a posterior distribution where some parameters is on the order of 1000 and others on are the order of 1. In this case, I would make mu a transformed parameter like

transformed parameters {
  real mu = mu_loc + mu_scale * mu_raw;
}

where mu_raw is declared in the parameters block instead of mu, mu_loc and mu_scale are passed into the data block, and mu_raw has a standard normal prior. This is more or less how we do hierarchical models in the rstanarm package, and you can confirm you get reasonable answers from it with reasonable prior information:

dfBox <- data.frame(
  B = factor(c(1, 1, 2, 2)),
  M = c(999+1, 999-1, 1001-1, 1001+1)
)
library(rstanarm)
post <- stan_lmer(M ~ (1|B), data = dfBox, 
                  prior_intercept = normal(location = 1000, scale = 10), 
                  prior_aux = exponential(1),
                  prior_covariance = decov(shape = 2, scale = 0.2))
3 Likes

Thanks Ben, I really appreciate the information and suggestions! The result with stan_lmer looks promising. My attempts this morning to write Stan code equivalent to stan_lmer didn’t produce good results yet, though. I’ll see what I can figure out with that.

Parameter fits can easily “go crazy” if you don’t have priors.

You probably want to use a non-centered parameterization for W (draw W_raw from normal(0, 1) thn define W = mu + W_raw * sigma. It changes the distribution being sampled (now it’s W_raw), though the distribution for W will be the same.

You can also vectorize, as in:

W ~ normal(mu, sigma);
M ~ normal(W[B}, m_sd);

Thanks, Bob. After doing some more reading and watching some lectures about Stan models, I was able to get the models to fit much better. One of the keys was using non-centered parameterization, as you and Ben described.