My model with individual level random effects and group effects is really slow

I have a logisitic (binary) model with two levels of clustering - group and individual. The group level random effects are actually time-specific group effects, which are correlated within group. When I run a model with these group effects alone, it takes a while - maybe 20 minutes on my 2017 Macbook Pro. (That is with 24 groups and 12 time periods, so 12 effects per group.)

When I include an individual level effect (forget about introducing any correlation at the individual level), the estimation time is on the order of hours - maybe it will take over night, perhaps longer. Some individuals have data for a single time period, whereas others might have up to 20 periods. There are 5000+ individuals, I imagine that is a problem. I also saw from the diagnostics that the sampling wasn’t particularly stable for the variance of the individual effects.

Here’s my code - and if anyone can see something obvious, let me know. Or if it is what it is, that would be good to know as well. (And if you need more information, let me know.)

data {
  int<lower=1> I;              // number of unique individuals
  int<lower=1> N;              // number of records
  int<lower=1> K;              // number of predictors
  int<lower=1> J;              // number of clusters
  int<lower=0> T;              // number of periods
  int<lower=1,upper=I> ii[N];  // id for individual
  int<lower=1,upper=J> jj[N];  // group for individual
  int<lower=1,upper=T> tt[N];  // period of indidvidual
  matrix[N, K] x;              // matrix of predictors
  int<lower=0,upper=1> y[N];   // vector of outcomes
}

parameters {
  vector[K] beta;              // model fixed effects
  real<lower=0> sigma_S;       // site variance (sd)
  real<lower=-1,upper=1> rho;  // correlation
  real<lower=0> sigma_I;      // individual level varianc (sd)
  
  vector[T] ran_S[J];           // site level random effects (by period)
   
  vector[I] ran_I;              // individual level random effects              
}
 
transformed parameters{ 
  
  cov_matrix[T] Sigma;
  vector[N] yloghat;
  vector[T] mu_S = rep_vector(0, T); // mod
    
  // Random effects with exchangeable correlation  
  
  real sigma_S2 = sigma_S^2;
  real rho_sigma_S2 = rho * (sigma_S^2);
  
  for (j in 1:(T-1))
    for (k in (j+1):T) {
      Sigma[j,k] = rho_sigma_S2;  
      Sigma[k,j] = Sigma[j, k];
    }
     
  for (i in 1:T)
      Sigma[i,i] = sigma_S2;
  
  //
  
  for (i in 1:N)  
      yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
}

model {
  
  sigma_I ~ exponential(0.25);
  sigma_S ~ exponential(0.25); // mod
  rho ~ uniform(-1, 1);
  
  ran_S ~ multi_normal(mu_S, Sigma); // mod: vectorized
  ran_I ~ normal(0, sigma_I);
  
  y ~ bernoulli_logit(yloghat);

}

generated quantities {
  
  real sigma_site2;
  real sigma_ind2;

  sigma_site2 = sigma_S^2;
  sigma_ind2 = sigma_I^2;

}

Try non-centering ran_S and ran_I.

Something like:

data {
  int<lower=1> I;              // number of unique individuals
  int<lower=1> N;              // number of records
  int<lower=1> K;              // number of predictors
  int<lower=1> J;              // number of clusters
  int<lower=0> T;              // number of periods
  int<lower=1,upper=I> ii[N];  // id for individual
  int<lower=1,upper=J> jj[N];  // group for individual
  int<lower=1,upper=T> tt[N];  // period of indidvidual
  matrix[N, K] x;              // matrix of predictors
  int<lower=0,upper=1> y[N];   // vector of outcomes
}

parameters {
  vector[K] beta;              // model fixed effects
  real<lower=0> sigma_S;       // site variance (sd)
  real<lower=-1,upper=1> rho;  // correlation
  real<lower=0> sigma_I;      // individual level varianc (sd)
  
  vector[T] z_ran_S[J];
   
  vector[I] z_ran_I;    
}
 
transformed parameters{ 
  cov_matrix[T] Sigma;
  matrix[T, T] L;
  vector[N] yloghat;
    
  // Random effects with exchangeable correlation  
  
  real sigma_S2 = sigma_S^2;
  real rho_sigma_S2 = rho * (sigma_S^2);
  
  for (j in 1:(T-1))
    for (k in (j+1):T) {
      Sigma[j,k] = rho_sigma_S2;  
      Sigma[k,j] = Sigma[j, k];
    }
     
  for (i in 1:T)
      Sigma[i,i] = sigma_S2;

  // This seems like the type of matrix that might have a closed form
  //   cholesky but we'll compute it manually here (these are N^3 operations)
  L = cholesky_decompose(Sigma);
  //
  
  for (i in 1:N)  
      yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
}

model {
  vector[T] ran_S[J];           // site level random effects (by period)
  vector[T] ran_I;              // individual level random effects          
  sigma_I ~ exponential(0.25);
  sigma_S ~ exponential(0.25); // mod
  rho ~ uniform(-1, 1);

  for(j in 1:J) {
    z_ran_S[j] ~ normal(0, 1);
    ran_S[j] = L * z_ran_S[j];
  }

  z_ran_I ~ normal(0, 1);
  ran_I = sigma_I * z_ran_I;
  
  y ~ bernoulli_logit(yloghat);
}

Check https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html and “Hierarchical Models and the Non-Centered Parameterization” here https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html . May or may not help but it’s the standard thing to try with these things.

Thanks so much for the links and the suggestion about non-central parameterization. The code changes certainly changed everything - now the chains complete in 30 minutes or so (as opposed to 12 hours), and more importantly, they seem result in non-deviant paths, for the most part. I set “adapt_delta” equal to 0.90 for all of this.

I am using simulation, so I am able to create different data sets (under the same data generating process). For some of these data sets, everything seems to works out well. For others, one of the chains is completely deviant - starting off far from the true value and essentially remaining constant. I am wondering if there is anything I can do to fix this.

Here is a traceplot:

Here are the initial messages I get. Oddly, it is chain 4 that generates the initial start-up issue, but seems to settle in fine - it is the 3rd chain that goes awry (based on looking at the diagnostics).

SAMPLING FOR MODEL 'binary sw - cs ind effect - non-central' NOW (CHAIN 1).

SAMPLING FOR MODEL 'binary sw - cs ind effect - non-central' NOW (CHAIN 2).

SAMPLING FOR MODEL 'binary sw - cs ind effect - non-central' NOW (CHAIN 3).

SAMPLING FOR MODEL 'binary sw - cs ind effect - non-central' NOW (CHAIN 4).
Chain 1: 
Chain 1: Gradient evaluation took 0.011599 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 115.99 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 2: 
Chain 2: Gradient evaluation took 0.010633 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 106.33 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 3: 
Chain 3: Gradient evaluation took 0.011352 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 113.52 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 1: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 2: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 3: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 4: Rejecting initial value:
Chain 4:   Error evaluating the log probability at the initial value.
Chain 4: Exception: cholesky_decompose: Matrix m is not positive definite  (in 'model1aed2fedd42f_binary_sw___cs_ind_effect___non_central' at line 48)

Chain 4: 
Chain 4: Gradient evaluation took 0.011463 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 114.63 seconds.
Chain 4: Adjust your expectations accordingly!

Of course, I also get warning messages about excessive divergent paths. And for complete information, here’s my the call to stan from R.

stanseed <- 2716283 # 

fit.i.cs <- sampling(sm, data=testdat, seed = stanseed, 
                   iter = 4000, warmup = 1000,
                   control=list(adapt_delta=0.90),
                   chains = 4)

And here’s the improved stan code:

data {
  int<lower=1> I;              // number of unique individuals
  int<lower=1> N;              // number of records
  int<lower=1> K;              // number of predictors
  int<lower=1> J;              // number of sites
  int<lower=0> T;              // number of periods
  int<lower=1,upper=I> ii[N];  // id for individual
  int<lower=1,upper=J> jj[N];  // group for individual
  int<lower=1,upper=T> tt[N];  // period of indidvidual
  matrix[N, K] x;              // matrix of predictors
  int<lower=0,upper=1> y[N];   // vector of binary outcomes
}

parameters {
  vector[K] beta;              // model fixed effects
  real<lower=0> sigma_S;       // site variance (sd)
  real<lower=-1,upper=1> rho;  // correlation
  real<lower=0> sigma_I;       // individual level varianc (sd)
  
  // non-central paramerization
  
  vector[T] z_ran_S[J];        // site level random effects (by period)
  vector[I] z_ran_I;           // individual level random effects              
}
 
transformed parameters{ 
  
  cov_matrix[T] Sigma;
  matrix[T, T] L;                      // for non-central parameterization
  vector[I] ran_I;                     // individual level random effects 
  vector[T] ran_S[J];                  // site level random effects (by period)
  vector[N] yloghat;

  // Random effects with exchangeable correlation  
  
  real sigma_S2 = sigma_S^2;
  real rho_sigma_S2 = rho * (sigma_S^2);
  
  for (j in 1:(T-1))
    for (k in (j+1):T) {
      Sigma[j,k] = rho_sigma_S2;  
      Sigma[k,j] = Sigma[j, k];
    }
     
  for (i in 1:T)
      Sigma[i,i] = sigma_S2;
  
  L = cholesky_decompose(Sigma);

  for(j in 1:J)
    ran_S[j] = L * z_ran_S[j];
    
  ran_I = sigma_I * z_ran_I;

  for (i in 1:N)  
      yloghat[i] = x[i]*beta + ran_S[jj[i], tt[i]] + ran_I[ii[i]];
      
}

model {
  
  sigma_I ~ exponential(0.25);
  sigma_S ~ exponential(0.25);
  
  rho ~ uniform(-1, 1);
  
  for(j in 1:J) {
    z_ran_S[j] ~ std_normal();
  }

  z_ran_I ~ std_normal();
  
  y ~ bernoulli_logit(yloghat);

}

generated quantities {
  
  real sigma_site2;
  real sigma_ind2;

  sigma_site2 = sigma_S^2;
  sigma_ind2 = sigma_I^2;

}

It’s hard to tell from the traceplots in a situation like this what to do.

If something isn’t moving, I’d assume it’s cause something numerically overflowed somewhere and output is NaN/inf or something.

The only thing in your model that seems like something that might do that is the cholesky_decompose. It’s easy to get numeric problems with those. Maybe add a small number (1e-8 or something) to the diagonal of Sigma and see if this goes away?

Thanks for the suggestion. I gave it a try, and the results are the same. I suspect you are right about the cause of the problem, as I get this error message:

Chain 4: Rejecting initial value:
Chain 4:   Error evaluating the log probability at the initial value.
Chain 4: Exception: cholesky_decompose: Matrix m is not positive definite  (in 'model1aed58285071_binary_sw___cs_ind_effect___non_central' at line 48)

Granted, in some of runs (where I use a different start seed), I get that warning, but things get back on track and everything works fine. For now, I will just have to keep varying the start seed to make sure I can find a process that converges/mixes properly. Thanks.

I wouldn’t used fixed seeds. Keep them random.

I might be being paranoid when I say this :D, but I think I’ve been misled by fixed seeds before. I don’t remember what it was though.

OOOoooooh yeah, I remember this matrix too.

I suspect you can compute the Cholesky decomposition in closed form too.

Like check this: https://en.wikipedia.org/wiki/Cholesky_decomposition#Rank-one_update

So if your off-diagonals are a constant and your diagonals are a constant you can decompose your matrix:

\begin{bmatrix} a & b \\ b & a \end{bmatrix}
\begin{bmatrix} a - b & 0 \\ 0 & a - b \end{bmatrix} + \begin{bmatrix} b & b \\ b & b \end{bmatrix}
\begin{bmatrix} a - b & 0 \\ 0 & a - b \end{bmatrix} + \begin{bmatrix} \sqrt{b} \\ \sqrt{b} \end{bmatrix}\begin{bmatrix} \sqrt{b} & \sqrt{b} \end{bmatrix}

And then you know the Cholesky of the thing on the left (just take the square root of the diagonal) and then there is a rank 1 update you can do: https://en.wikipedia.org/wiki/Cholesky_decomposition#Rank-one_update

There’s probably a way to write that in math that’ll give you a closed form.

Interesting. That could conceivably work for the case with compound symmetry, but not with an AR-1 structure, where the off-diagonals are not constant.

I have another question - I’ve looked at the initial value of the matrix in question, and it is actually positive definite. Is cholesky_decompose doing something weird? Is that why you suggested I try to do this manually?

Not that I’m aware of.

The matrices you see in the output probly aren’t the ones that are producing those errors.

Those errors probably come from early Stan initialization where it’s just guessing initial values and trying to figure out a timestep. During this period it’s entirely possible that something like sigma_S in your model would end up as an actual 0.0 or something weird.

I think you can add ‘print(Sigma)’ statements right before your cholesky_decompose to see this.

I think I found the culprit. It is rho that has been causing the problems. If it starts out extremely close to 0 and drifts negative, then it gets stuck. I don’t know if it is too uncool to set the prior to be between 0 and 1 - is that too informative? (In this application, I would never expect negative correlations. 0 yes, but not negative.) When I do set the lower bound of rho to 0, it appears that things work well, at least with my simulated data.

I dunno your application, but just tell people you did it and you did it for these reasons.