Hierarchical model with some very small parameter values

I have simulated some data that resembles some real data that I am using.

The data is simulated from a simple hierarchical model with a constant intercept and slope and a random intercept term. More specifically, I have simulated data from

Y_{i}(t_{ij}) = (\eta_1 + w_i) +\eta_2t_{ij} + \varepsilon_{ij},

where w_i \sim N(0,\sigma^2_1) and \varepsilon_{ij} \sim N(0,\sigma^2). In context, I have i=1,\dots,n units and j=1,\dots,n_i recordings of a covariate for each unit.

The data is simulated using (\eta_1, \eta_2, \sigma_1, \sigma) = (-1.22, 0.004,1, 0.0001). I have attached a plot of the data below.

y.pdf (170.7 KB)

The Stan code that I am using (shown below) is taking a very long time (about 10 hours) to estimate the model parameters and is returning warnings

5: The largest R-hat is 2.14, indicating chains have not mixed.
Running the chains for more iterations may help.
6: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help.
7: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help.

The effective sample sizes for all parameters is low. I ran the code for 6000 iterations (1500 warmup) and the effective sample sizes were (91,20,317,22) for (\eta_1, \eta_2, \sigma_1, \sigma), respectively. I also had to increase the maximum treedepth to 30 to avoid getting an additional warning.

Is it the very small values of \sigma_1 and \sigma that are causing the issues?

data{
  int n; //Sample size
  int<lower=1> N; // Total number of covariate recordings (all recordings for all units)
  real y[N]; //All of the y values
  int w_ind[N]; //Indices for w
  int tt[N]; //All times for all units
}
parameters{
  real <lower = 0> sig;
  real <lower = 0> sig1;
  real eta1;
  real eta2;
  vector[n] w;
}
model{
  real Mu[N];
  for(i in 1:N){
    Mu[i] = eta1 + eta2*tt[i] + sig1*w[w_ind[i]];
  }
  target += normal_lpdf(y|Mu, sig);
  //priors
  eta1 ~ normal(0, 1);
  eta2 ~ normal(0, 1);
  w ~ normal(0, 1);
  sig ~ normal(0, 1);
  sig1 ~ normal(0, 1);
}

I have tried multiplying the response, y, by different factors to increase the estimate value of \eta_2 but this does not help. This is, for example, like changing the response from metres to centimetres.

I have 40 units and around 20 000 observations. However, even taking smaller subsets of the data takes a very long time to run and returns the same warnings. The lme function in R can estimate the parameters accurately in about a second.

It’s hard to provide a definitive answer, but there are some clear ways to check. The first thing I would check is to see what happens if you declare your parameters on a scale that’s a bit closer to the true values. So for example, you could do:

parameters {
 real <lower = 0> sigma_scaled;
}
transformed parameters{
 real sig = .0001 * sigma_scaled; // the Jacobian is constant and can be dropped 
}

If the tiny value for sig is your problem, then this might fix it.

Stan is never going to compete with lme for speed tests on this model, and the speed of Stan won’t be relevant anyway unless you can solve the convergence issue. However, you can vectorize your computation of Mu (just eliminate your for-loop and drop the subscripting on i), which should yield some speedup.

1 Like

I have tried transforming sig but it looks like the code is taking about the same time to run.

I don’t think I can vectorise the computation of Mu, though.

Mu is a [1 x N] vector, eta1 and eta2 are [1 x 1], tt is a [1 x N] vector, and w is a [1 x n] vector.

The first n_1 observations belong to unit 1, the next n_2 belong to unit 2, etc.

w_{ind} = (1,1,...,1,2,2,...,2,.....).

This way I can use w_1 for the first n_1 tt values and w_2 for tt values from (n_1 +1) to n_2.

I.e. y = (y_1, y_2, \dots, y_n), tt = (1,\dots,n_1, \dots, 1, \dots, n_n), w_{ind} = (1,\dots,1,\dots,n,\dots,n).

Does

real Mu[N] = eta1 + eta2*tt + sig1*w[w_ind];

not work?

1 Like

Aside from vectorizing, the next thing I’d try is switching your random effect (which appears to be strongly informed) from the non-centered parameterization to a centered one. That is, do w ~ normal(0, sig1) and remove the multiplication by sig1 from the computation of Mu

1 Like

Yes, this does work. Thank you.

I am running the code again. It is the warmup that is (still) taking a long time. I will try transforming eta2, also.

I used the above Stan code on a dataset with 456 units and 186 448 observations in total. The code took 22 hours to run. The output is shown below (note I added a subscript 240 to each variable name):

Inference for Stan model: covariate.
1 chains, each with iter=6000; warmup=1500; thin=1; 
post-warmup draws per chain=4500, total post-warmup draws=4500.

                         mean    se_mean          sd          2.5%           25%           50%           75%         97.5% n_eff      Rhat
sig_240_scaled   9.148857e+01 0.00371589  0.15143235  9.118760e+01  9.138527e+01  9.148774e+01  9.159188e+01  9.178513e+01  1661 0.9997921
sig1_240         1.153557e+00 0.00127602  0.03800022  1.082805e+00  1.127420e+00  1.152964e+00  1.177242e+00  1.231585e+00   887 1.0015214
eta1_240        -6.026718e-01 0.01351272  0.03937897 -6.761101e-01 -6.331541e-01 -5.895338e-01 -5.731584e-01 -5.360788e-01     8 0.9998684
eta2_240_scaled  8.333048e-01 0.00000044  0.00002952  8.332466e-01  8.332854e-01  8.333044e-01  8.333247e-01  8.333648e-01  4467 1.0001221
sig_240          9.148860e-03 0.00000037  0.00001514  9.118760e-03  9.138530e-03  9.148770e-03  9.159190e-03  9.178510e-03  1661 0.9997921
eta2_240         3.333220e-03 0.00000000  0.00000012  3.332990e-03  3.333140e-03  3.333220e-03  3.333300e-03  3.333460e-03  4467 1.0001221
lp__             6.103612e+05 0.40540291 15.41476895  6.103309e+05  6.103507e+05  6.103615e+05  6.103720e+05  6.103904e+05  1446 1.0001449

The effective sample size for eta1_240 (i.e. the fixed intercept term) is very low. The diagnostic plots (see attached) also show that the mixing of eta1_240 is poor. The same conclusion is drawn for all of the random intercepts.

diagnostic1.pdf (83.6 KB)
diagnostic2.pdf (82.7 KB)

Despite this, I did simulate some data from the posterior samples and the data is almost identical to the input data.

It is a little worrying that such a simple dataset has such poor mixing. I am going to try fitting the model without the fixed intercept to see if this helps (as this seems to be the term with the worst effective sample size).

Removing the fixed intercept solved the issue and reduced the runtime to 4 hours from 22 hours.

This parameter must have interfered with the estimation of sig1_240 since the intercepts were all centred around zero.