Need some advice on speeding up non-centered model with random effects


#1

Hi,

I have a model with a few levels to the data.
There are groups, students in groups, and tutors.
We are trying to model student scores.

There are approximately:

  • 2500 groups
  • 25000 students
  • 1500 tutors

It is clear, from analyzing the data that there are very specific group means and group variances. Additionally students and mentors have a wide variety of ability.

I’ve built a basic model in Stan for this, with three intercepts (group, student, tutor) and a group-specific variance.

However it can take many hours for just 100 iterations, and the diagnostic reports that every step hit the maximum tree depth. I’ve tried increasing the tree depth to 50, but same results (I know 50 may be extreme, and was testing to see if that was the issue.)

Attached is the Stan file.

Once the intercepts and variances are working OK, I plan to add some other covariates (non-intercepts)

I’d love any suggestions that the group might have on how to improve this model.

data{

  // Counts of things
  int N;
  int N_student;
  int N_group;
  int N_tutor;

  // Dependent variable
  real score[N];

  // Intercepts
  int<lower=1, upper=N_student> student[N];
  int<lower=1, upper=N_tutor>   tutor[N];
  int<lower=1, upper=N_group>   group[N];
}

parameters {

  real mu_eta; // grand location
  real mu_scale;

  vector<lower=0>[N_group] g_sigma;  // group specific scale

  
  vector[N_group] g_eta; // group specific intercept
  real g_scale; // scaling of group specific intercept

  vector[N_student] s_eta; // student specific intercept
  real s_scale; // scaling of student specific 

  vector[N_tutor] t_eta; // tutor specific intercept
  real t_scale; // scaling of tutor specific 

}

transformed parameters {
  vector[N_group]   g_re;
  vector[N_student] s_re;
  vector[N_tutor]   t_re;
  real mu;

  mu = mu_scale * mu_eta;

  g_re = g_scale * g_eta;  // intercpet for group
  s_re = s_scale * s_eta;  // intercept for student
  t_re = t_scale * t_eta;  // intercept for tutor

}

model{

  vector[N] theta;
  vector[N] sigma;

  // Priors
  g_sigma  ~ cauchy(0,2.5);

  mu_eta   ~ normal(0,1);
  mu_scale ~ normal(0,1);

  g_eta    ~ normal(0,1);
  g_scale  ~ normal(0,1);

  s_eta    ~ normal(0,1);
  s_scale  ~ normal(0,1);

  t_eta    ~ normal(0,1);
  t_scale  ~ normal(0,1);

  // model likelihood
  theta = mu + g_re[group] + s_re[student] + t_re[tutor];
  sigma = g_sigma[group];

  score ~ normal(theta, sigma);
 
  
}

#2

You are likely seeing non-identifiability and maybe some funny stuff from sign flipping. For example, mu = mu_scale * mu_eta, and both are on the real line with N(0,1) priors. I don’t think you have any information to identify those.


#3

Thank you, that makes sense.

But, I was trying to follow the concept of “Matt’s trick” (non-centered) for mu. Are you suggesting that isn’t good for the mu parameter? If so, how do you suggest I fit that?

Thanks!!!


#4

Yeah, you only want to use it on parameters that would otherwise be specified via a hierarchical model.


#5

Good point.

I envision the model as having a “main mean” and then the 3 intercepts.

Two options for the main mean (mu)

  1. Give it a non-informative prior like normal(0,5)
  2. Use the empirical mean from the data set - just set mu=5

Any opinion?

Thanks!


#6

Just use a weakly informative prior and estimate the overall mean, it’ll be
fine.


#7

I’ll give it a shot. Thanks!


#8

One thing that would help with identify the model is all the “scale” parameters should be constrained to be positive, e.g.:

real<lower=0> s_scale;

#9

That makes sense.

  1. Surprisingly, in the Stan manual, it doesn’t suggest keeping the scale parameter positive.

  2. What about an exponential (1) as a scale parameter?


#10

Tried the non-negative constraint as suggested, and I’m still getting 100% of transitions hitting maximum tree depth.

(Note: On the command line, I’ve even tried increasing the max_depth to 50, but that doesn’t seem to carry through to the stan code, which still indicates max_depth of 14.)

calling cmdstan with:

./test sample adapt delta=0.95 algorithm=hmc engine=nuts max_depth=50 num_samples=100 num_warmup=100 id=1 data file=data_dump.RD output refresh=1 file=samples_1.csv

diagnostic tells me:

100 of 100 (1e+02%) transitions hit the maximum treedepth limit of 14, or 2^14 leapfrog steps.


#11

You should probably post your model again so we know what we’re looking at.


#12

Due to a current limitation in our reader code, the diagnostic function does not read the current max treedepth but rather assumes the default of 14. So if you increase the max treedepth and end up with very long trajectories then this warning will be a false positive. Alternatively you can just look at the statistics for treedepth__ in stansummary.


#13

@betanalpha That’s very helpful to know. It was driving me nuts (no pun intended) trying to figure out why treedepth wasn’t working. Thanks for explaining.


#14

As suggested, here is the latest. Note the partially-informative prior on mu. (In the training data set it has a value around 4.4, so this prior should easily cover that. In the samples, it recovers the 4.4 nicely.)

Once this seems to run OK, I’ll use the generated quantities block to generate some data to see how the posterior fits the training data.

data{
  int N;
  int N_student;
  int N_group;
  int N_tutor;

  // Dependent variable
  real score[N];


  // Intercepts
  int<lower=1, upper=N_student>   student[N];
  int<lower=1, upper=N_group>    group[N];
  int<lower=1, upper=N_tutor>  tutor[N];

}

parameters {

  real mu; // base mean
  vector<lower=0>[N_group] g_sigma;  // group specific scale

  vector[N_group] g_eta; // group specific intercept
  real<lower=0> g_scale; // scaling of group specific intercept

  vector[N_student] s_eta; // student specific intercept
  real<lower=0> s_scale; // scaling of student specific 

  vector[N_tutor] t_eta; // tutor specific intercept
  real<lower=0> t_scale; // scaling of tutor specific 

}

transformed parameters {
  vector[N_group]   g_re;
  vector[N_student]  s_re;
  vector[N_tutor] t_re;

  g_re = g_scale * g_eta;  // intercept for group
  s_re = s_scale * s_eta;  // intercept for student
  t_re = t_scale * t_eta;  // intercpet for tutor

}

model{

  vector[N] theta;
  vector[N] sigma;

  // Priors
  mu  ~ normal(0,10);
  g_sigma  ~ normal(0,5); 

  g_eta    ~ normal(0,1);
  g_scale  ~ exponential(1);

  s_eta    ~ normal(0,1);
  s_scale  ~ exponential(1);

  t_eta    ~ normal(0,1);
  t_scale  ~ exponential(1);

  // model likelihood
  theta = mu + g_re[group] + s_re[student] + t_re[tutor];
  sigma = g_sigma[group];

  score ~ lognormal(theta, sigma); 
}

#15

OK,

Added a single covariate for slope (with corresponding coefficient) and the entire thing just crawls to a halt. After almost 18 hours, I have 45 samples.

The hours of study affect the score. Students belong to a few different skill “levels” and empirically, it makes sense to have a different weight on the hours_study for each level.

Clearly I’m doing something wrong here. Would really love any advice you might have!

Thank You!!!

data{
  int N;
  int N_student;
  int N_group;
  int N_tutor;
  int N_level;

  // Dependent variable
  real score[N];


  // Intercepts
  int<lower=1, upper=N_student>   student[N];
  int<lower=1, upper=N_group>    group[N];
  int<lower=1, upper=N_tutor>  tutor[N];
  int<lower=1, upper=N_level>  level[N];

  // Slop
  vector[N] hours_study;

}

parameters {

  real mu; // base mean
  vector<lower=0>[N_group] g_sigma;  // group specific scale

  vector[N_group] g_eta; // group specific intercept
  real<lower=0> g_scale; // scaling of group specific intercept

  vector[N_student] s_eta; // student specific intercept
  real<lower=0> s_scale; // scaling of student specific 

  vector[N_tutor] t_eta; // tutor specific intercept
  real<lower=0> t_scale; // scaling of tutor specific 

  vector[N_level] beta_hours;  // Hours of study, per student skill level (about 40 levels)


}

transformed parameters {
  vector[N_group]   g_re;
  vector[N_student]  s_re;
  vector[N_tutor] t_re;

  g_re = g_scale * g_eta;  // intercept for group
  s_re = s_scale * s_eta;  // intercept for student
  t_re = t_scale * t_eta;  // intercpet for tutor

}

model{

  vector[N] theta;
  vector[N] sigma;

  // Priors
  mu  ~ normal(0,10);
  g_sigma  ~ normal(0,5); 

  g_eta    ~ normal(0,1);
  g_scale  ~ exponential(1);

  s_eta    ~ normal(0,1);
  s_scale  ~ exponential(1);

  t_eta    ~ normal(0,1);
  t_scale  ~ exponential(1);

  beta_hours ~ normal(0,1);

  // model likelihood
  theta = mu + g_re[group] + s_re[student] + t_re[tutor] + beta_hours[level] .* hours_study;
  sigma = g_sigma[group];

  score ~ lognormal(theta, sigma); 
}

#16

Tried an experiment this morning. If I change the distribution from lognormal to normal, Stan runs VERY quickly and the samples look good (no divergence, etc…)


#17

Thanks for following up. Didn’t see this before, but it’s another example of what Andrew calls the “folk theorem”. The upshot is that misspecified models tend to be slow.