Speed up multilevel logit: large dataset and group-level variables

Hi all,

I’m writing with a question about fitting a multilevel logit model to a dataset with ~300,000 observations. The data is cross-national survey data where individual respondents are nested within countries and years. I’m including group-level covariates to predict country and year varying intercepts, like the county-level uranium variable in the radon example or state-level variables in Gelman et al’s Red State Blue State project.

Even after using reduce_sum (huge thanks to the Stan developers for this), this model takes about 7 days on a computer with 8 cores and 32GB of RAM. Looking at the code below, is there any way I could speed this up? Would writing the model differently by putting the group-level variables straight into the individual level regression in partial_sum speed things up? The non-centered parameterization of both sets of intercepts is necessary, as far as I can tell.


Can write the model as

y \sim Bernoulli Logit(\alpha + \alpha_{state} + \alpha_{year} + \textbf{X}\beta)

where in two group level regressions that include a constant in the data matrices G and Z:

\alpha_{year} \sim N(\textbf{G}\lambda, \sigma_{year})
\alpha_{state} \sim N(\textbf{Z}\gamma, \sigma_{state})

Stan program here. Country intercepts are indexed with S, while year intercepts are indexed with T. The group-level regressions are in the transformed parameters block.


functions {
  real partial_sum(int[] y_slice,
                   int start, int end,
                   matrix X, vector beta,
                   int[] state,
                   vector alpha_state,
                   int[] year,
                   vector alpha_year,
                   real alpha) {
    return bernoulli_logit_lpmf(y_slice | alpha +
      alpha_state[state[start:end]] + alpha_year[year[start:end]] +
      X[start:end, ] * beta);
  }
}

// data
data {
  int<lower = 0> N; // number of individual obs
  int y[N]; // outcome
  
  int<lower = 0> S; // number of state-year obs
  int<lower = 1, upper = S> state[N]; // state indicator

  int<lower = 0> T; // number of year obs
  int<lower = 1, upper = T> year[N];
  
  int<lower = 1> I; // number of individual variables
  matrix[N, I] X; // individual-level reg matrix
  int<lower = 1> J; // number of state-year groups
  matrix[S, J] Z; // state-year reg matrix
  int<lower = 1> L; // number of system-year groups
  matrix[T, L] G; // year/system reg matrix
}

// parameters.
parameters {
  real alpha; // overall intercept
//  real<lower = 0> sigma; // outcome variance
 
vector[S] alpha_state_std; // state intercepts- noncenter
real<lower = 0> sigma_state; // state var. hyperparam

vector[T] alpha_year_std; // state intercepts- noncenter
real<lower = 0> sigma_year; // state var. hyperparam

vector[I] beta; // individual coefs
vector[J] gamma; // state-year coefs
vector[L] lambda; // alliance-year coefs

}

transformed parameters {
   vector[T] mu_year; 
   vector[T] alpha_year; // year intercepts
   vector[S] alpha_state; // state intercepts
   vector[S] mu_state;
   
   // regression models of state and year varying intercepts
   mu_year = G * lambda; // year/system level 
   mu_state = Z * gamma; // state-year level

  // non-centered parameterization of state intercepts
for(s in 1:S)
  alpha_state[s] = mu_state[s] + sigma_state * alpha_state_std[s];

  // non-centered parameterization of year intercepts
for(t in 1:T)
  alpha_year[t] = mu_year[t] + sigma_year * alpha_year_std[t];


}

// model 
model {
  // define grain size (automatic selection)
  int grainsize = 1; 
  
  // define priors
  alpha ~ std_normal(); 
  
  // state parameters
  sigma_state ~ normal(0, 1); // half-normal
  alpha_state_std ~ std_normal(); // state intercepts non-centered
  // year paramters
  alpha_year_std ~ std_normal(); // year intercepts non-centered
  sigma_year ~ normal(0, 1); // half-normal
  
  // regression coef priors 
  // robust priors for indiv params and 
  //weak info normal for state-year level
  beta ~ student_t(7, 0, 1); 
  gamma ~ normal(0, 1); 
  lambda ~ normal(0, 1);
  
 // split outcome w/ reduce sum  
target += reduce_sum(partial_sum, y, 
                      grainsize,
                      X, beta,
                      state, alpha_state, 
                      year, alpha_year,
                      alpha);
}

bernoulli_logit_glm should speed up the X * beta part of the model, though it won’t speed up the random effects (12.3 Bernoulli-Logit Generalized Linear Model (Logistic Regression) | Stan Functions Reference).

You can vectorize these:

for(s in 1:S)
  alpha_state[s] = mu_state[s] + sigma_state * alpha_state_std[s];

  // non-centered parameterization of year intercepts
for(t in 1:T)
  alpha_year[t] = mu_year[t] + sigma_year * alpha_year_std[t];

like this:

alpha_state = mu_state + sigma_state * alpha_state_std;
alpha_year = mu_year + sigma_year * alpha_year_std;

It might be a tiny bump but I wouldn’t expect much from that.

Just updating this thread for anyone who’s interested. What @bbbales2 mentioned at the margins, as expected.

The big gain came from a suggestion by @mike-lawrence, who pointed me to the exploiting sufficient statistics section of the manual. Assuming that observations are i.i.d. at the country-year level, I summed the number of ones and calculated the number of trials for each country year. This also meant I had to average the individual level covariates at the same level. Revised model, which takes about 40 minutes, is below.


// data
data {
  int<lower = 0> N; // number of obs = S with sum
  int<lower = 0> n_res[N]; // number of respondents in group
  int y[N]; // outcome: sum by group
  
  int<lower = 0> S; // number of countries
  int<lower = 1, upper = S> state[N]; // state indicator

  int<lower = 0> T; // number of year obs
  int<lower = 1, upper = T> year[N];
  
  int<lower = 1> I; // number of individual variables
  matrix[N, I] X; // individual-level reg matrix
  int<lower = 1> J; // number of state-year variables
  matrix[N, J] Z; // state-year reg matrix; summed 
  int<lower = 1> L; // number of system-year variables
  matrix[T, L] G; // year/system reg matrix
}

// parameters.
parameters {
  real alpha; // overall intercept
//  real<lower = 0> sigma; // outcome variance
  
 
  vector[S] alpha_state_std; // state intercepts- noncenter
  real<lower = 0> sigma_state; // state var. hyperparam
  real mu_state;
  
  vector[T] alpha_year_std; // state intercepts- noncenter
  real<lower = 0> sigma_year; // state var. hyperparam

  vector[I] beta; // individual coefs
  vector[J] gamma; // state-year coefs
  vector[L] lambda; // alliance-year coefs

}

transformed parameters {
   vector[T] mu_year; 
   vector[T] alpha_year; // year intercepts
   vector[S] alpha_state; // state intercepts

   
   // regression models of mean year varying intercepts
   mu_year = G * lambda; // year/system level 

  // non-centered parameterization of state intercepts
  alpha_state = mu_state + sigma_state * alpha_state_std;

  // non-centered parameterization of year intercepts
  alpha_year = mu_year + sigma_year * alpha_year_std;
  

}

// model 
model {
  // define grain size (automatic selection)
  int grainsize = 1; 
  
  // define priors
  alpha ~ std_normal(); 
  
  // state parameters
  sigma_state ~ normal(0, 1); // half-normal
  alpha_state_std ~ std_normal(); // state intercepts non-centered
  // year paramters
  alpha_year_std ~ std_normal(); // year intercepts non-centered
  sigma_year ~ normal(0, 1); // half-normal
  
  // regression coef priors 
  // robust priors 
  beta ~ student_t(5, 0, 1); 
  gamma ~ student_t(5, 0, 1); 
  lambda ~ student_t(5, 0, 1);

  y ~ binomial_logit(n_res, alpha +
      alpha_state[state] + alpha_year[year] +
       X * beta + Z * gamma);

}

Another option that I considered but didn’t need is the reduced redundant computation trick. Just putting this here in case someone else finds it helpful.

2 Likes

7 days to 40 minutes, wow!

1 Like

Thanks for report back your solution! This is good to have