Specifying random effects as dummies in Stan

In hierarchical models with many random effects (say >30), it can be cumbersome and time-consuming to specify random effects in the usual way:

for(j in 1:Nj){
  theta[j] = theta_z[j] * sigma_theta;
}
theta_z ~ std_normal();
sigma_theta ~ std_normal();

As far as I understand it, another way of specifying random effects is by unpacking the categorical covariate into a matrix of dummy variables, and ensuring the dummies share a scale parameter. The cool thing about this (to me) is that you can write more succint code and specify any number of random effects in just 2 lines - I think you could do the specification in STAN as follows:

for(i in 1:q){
  eta_star[i] = eta[i]*reff_scales[reff_vars_id[i]];
}
mu = alpha + Z*eta_star

eta ~ std_normal()
reff_scales ~ std_normal()

Z is a matrix n x q, where its columns are dummies indicating each category of the categorical variables we want to measure the effects of; reff_vars_id is an index telling you which columns come from the same variable - so for example if the first variable is “partisanship”, the first three columns if Z will be “democrat”, “independent”, “republican”, and the first three elements of reff_vars_id are (1,1,1…).

In my mind, this specification has a huge advantage in coding time for models with large number of covariates where you want the shrinkage to be selective, in the way random effects allow. It also allows for a quick switching of the specification to a ridge or lasso model, which is attractive for model comparison purposes.

In a previous thread from a while back, @bbbales2 suggested one can do this in STAN, but one should be careful about using sparse matrix vector products to make it efficient.

I have a couple of questions I was hoping to get some answers on:

  1. This is my first time using sparse matrix vector products - from this it seems reasonably straightforward, though I’ve never seen it used in STAN code before. It would be great to have a reference code to see how these are used - my current guess is something like this, though I’m not sure how to declare the vector sizes - I guess they should be passed in as data ? should I have some code to calculate each size according to the manual, or is STAN somehow smart enough to figure it out by itself ? :

data.list$Zw_size = sum(!unlist(Z)==0)
data.list$Zv_size = sum(apply(Z,2,sum)!=0)
data.list$Zu_size = sum(apply(Z,1,sum)!=0)

transformed data {
vector[Zw_size] Zw = csr_extract_w(Z);
int Zv[Zv_size] = csr_extract_v(Z);
int Zu[Zu_size] = csr_extract_u(Z);
}
transformed parameters{
    vector[q] eta_star;
     vector[n] mu;
     vector[n] reff_pred;

 for(i in 1:q){
   eta_star[i] = eta[i]*reff_scales[reff_vars_id[i]];
    }
       reff_pred = csr_matrix_times_vector(n, q, Zw, Zv, Zu, eta_star);
              mu = alpha +  reff_pred;  
}
  1. In the ridge-regression specification, it is typically necessary to standardize the covariates before fitting to ensure the coefficients are on the same scale, though we typically never do this with random effects. Is there any reason to believe applying standardization to the columns of Z would bring any benefit ?

  2. Beyond this, is there an already existing tried and tested way of quickly specifying large random effects models in STAN (say with more than 100 random effects), and being able to quickly vary which variables to include, without having to re-write the variable names and assign new priors every time ?

Any and all help is greatly appreciated,

Can you post a full model of your original 1:Nj approach and the new 1:q approach? As someone that does a lot of hierarchical models with multivariate random effects, the latter seems to be the standard way the manual shows how to do it (with the exception that you’re not attempting inference on the correlation structure).

Oh, maybe you have data where there are multiple different “random effects” (ex. human participant as one random effect, word tokens used as stimuli as another random effect) that by definition wouldn’t have any correlation structure, yet you want to make it more efficient (typing-code-wise) to express the scales thereof in a single variable?

(p.s. it’s “Stan”, not “STAN”; it’s a shortened-first-name, not an acronym)

1 Like

Hi Mike - thanks for the reply and sorry about Stan v. STAN - my bad indeed.

you have it right on the second sentence I think - I’m trying to avoid having to specify the different random effects one-by-one - and yes I guess it ultimately boils down to expressing the different scales as a single variable and the different effects in a single matrix.

I’m working on a model of the 2020 Republican vote, with (for simplicity) let’s say 2 covariates - age and partisanship. In order to increase the out-of-sample predictive power of my model, I’d like to use random effects instead of fixed effects to benefit from the shrinkage effect on the predictions. I prefer this to ridge because the shrinkage is not `global’, but guided by some theoretical groupings of the variables (i.e. I think age categories should share a variance component amongst themselves, but not with partisanship categories, for example). The full model in the “traditional” random-effect specification as I understand it would be:

data {
  // Outcome data
  int<lower = 1> n; // total number of observations
  int<lower = 0> Y[n]; // vector of labels

  int age_id[n]; // index of age categories
  int<lower = 0> N_age[n]; // number of age categories

  int party_id[n]; // index of  party categories
  int<lower = 0> N_party[n]; // number of party categories
  
 }

parameters {
  real alpha; // global fixed intercept 
  vector[N_age] eta_age; // party age effect
  vector[N_party] eta_party; // party random effect

  real<lower=0> age_scale; // scale for age effects
  real<lower=0> party_scale; // scale for party effects

}  
  
transformed parameters{
  vector[N_age] eta_age_star; // scaled age random effects
  vector[N_party] eta_party_star; // scaled party random effects
  vector[n] mu; // logit-scale linear predictor
  
  eta_age_star = eta_age * age_scale
  eta_party_star = eta_party * party_scale

  // linear function of the logit-scale propensity to vote republican
  mu = alpha + eta_age_star[age_id] + eta_party_star[party_id];  
}

model {
   // // // Fixed Effects
   alpha ~ std_normal();
   
  // // // Random effects
eta_party ~ std_normal();
eta_age ~ std_normal()
age_scale ~ std_normal()
party_scale ~ std_normal()

  // // // Likelihood
  Y ~ bernoulli_logit(mu);
}

Does that make the problem a little clearer ?

Yup, that makes sense. I’ll be interested to see if your idea of lumping all the levels of the various random effects in a sparse matrix maintains performance; I have a slight prior that it’ll be worse than doing things by hand but I don’t have much expertise in this domain.

Brief aside on this specific application: if you have the raw ages, best to use those, possibly with a GP for capturing possibly-non-linear effects of age (or GAM for an approximation if your data is too dense for reasonable GP compute times, but given a 1yr reporting res and about 100 unique values, a GP should be tractable). If the data aren’t raw age but instead responses on an ordinal multiple-choice survey, you could still achieve better fidelity by keeping the ordinal relation, you’d just need to do an ordinal model for the effect of age.

1 Like

Finally got the chance to test this out in comparison.

Code below - used some data from a non-represenative survey of MTurks from the 2020 election, tested a number of random effects for a sample size ~ 5000. I used random effects for gender, age, income and states.

I left the gender effects as random, instead of just having a “male fixed effect” (which would be more efficient I think) because in the matrix formulation you don’t want to have to look at which effects you can make “fixed” and which you want to keep as “random”, and even if there is no shrinkage for k<3 you still get to identify the model via the soft sum-to-zero constraint on gender, so that’s good enough for me.

Your intuition bares out - the sparse matrix approach is more than twice as slow. It’s a shame because the coding-time gains can be quite large imho and the results seem to be pretty much identical, suggesting the approach is in principle viable.

I wonder if there is any way to speed up the sparse-matrix computation ? Again I think this is pretty vital for situations with large numbers of random effects… you’re not going to code up 100 effects separately - it’s just so inefficient.

I attach the plots for convergencetradvsparse_convergence.pdf (8.5 KB) and coefficient comparisons tradvsparse_reff.pdf (16.4 KB) (mean, 2.5th quantile and 97.5th quantile) and the computation times are :

print(reffmatrix.time )
Time difference of 5.866842 mins
print(trad.time)
Time difference of 2.488841 mins

# # # # # # # # # # # # # # # Check Speed and accuracy of this random effect model # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
temp.list = list(  n = dim(train.voters.star)[1],
                   Y = train.voters.star$vote2020,  
                   gender_id = match(train.voters.star$gender,levels(SF_complete_temp$gender)),
                   N_gender = nlevels(train.voters.star$gender),
                   
                   age_id = match(train.voters.star$age_bins,levels(train.voters.star$age_bins)),
                   N_age = nlevels(train.voters.star$age_bins),
                   
                   income_id = match(train.voters.star$commercial_estimated_hh_income,levels(train.voters.star$commercial_estimated_hh_income)),
                   N_income = nlevels(train.voters.star$commercial_estimated_hh_income),
                   
                   state_id = match(train.voters.star$state,levels(SF_complete_temp$state)),
                   N_state = nlevels(train.voters.star$state)
                   )

reff_vars.temp = c("gender","age_bins","commercial_estimated_hh_income","state")
Z.temp = data.table()
for(j in 1:length(reff_vars.temp)){
  temp = train.voters.star[,reff_vars.temp[j],with=FALSE]
  colnames(temp)=paste(colnames(temp),"___",sep="")
  Z.temp = cbind(Z.temp, model.matrix(~.-1,data = temp))
}
reff_vars_id.temp = match(sub("\\_\\_\\_.*", "",colnames(Z.temp)), reff_vars.temp )
temp.list$Z = Z.temp
temp.list$q = dim(Z.temp)[2]
temp.list$reff_vars_id = reff_vars_id.temp
temp.list$n_reff= max(reff_vars_id.temp)

temp.list$Zw_size = sum(!unlist(Z)==0)
temp.list$Zv_size = sum(apply(Z,2,sum)!=0)
temp.list$Zu_size = sum(apply(Z,1,sum)!=0)
# # # # Multilevel Regression Code + BYM Spatial Smoothing
model.reffmatrix= "
data {
  // Outcome data
  int<lower = 1> n; // total number of observations
  int<lower = 0> Y[n]; // vector of labels

  // Random effects data
  int<lower = 1> q; // number of random-effect covariates
  matrix[n,q] Z; // random design matrix
  int<lower = 1> reff_vars_id[q]; // variable-group ids (for local shrinkage)
  int<lower = 1> n_reff; // no. of separate variance components
  
  // Data for sparse matrix operations
  int<lower = 1> Zw_size; // vector of the non-zero real values
  int<lower = 1> Zv_size; // array of integer column indices
  int<lower = 1> Zu_size; // array of integer indices indicating where in Zw a given row’s values start
}

transformed data {

  // Tranzlate Z to sparse matrix
  vector[Zw_size] Zw = csr_extract_w(Z);
  int Zv[Zv_size] = csr_extract_v(Z);
  int Zu[Zu_size] = csr_extract_u(Z);
}

parameters {
  real alpha; // global fixed intercept 
  vector[q] eta; // random effects
  real<lower = 0> reff_scales[n_reff]; // scale of the random effects
}  
  
transformed parameters{
  vector[q] eta_star; // scaled random effects
  vector[n] reff_pred; // random effects predictor
  vector[n] mu; // logit-scale total linear predictor

  // calculate random-effects predictor
  for(i in 1:q){
  eta_star[i] = eta[i]*reff_scales[reff_vars_id[i]];
  }
  reff_pred = csr_matrix_times_vector(n, q, Zw, Zv, Zu, eta_star);

  // linear function of the logit-scale propensity to be a recruit
  mu = alpha + reff_pred;  
}

model {
   // // // Fixed Effects
   alpha ~ std_normal();
   eta ~ std_normal();
   reff_scales ~ std_normal();

  // // // Likelihood
  Y ~ bernoulli_logit(mu);
}
"
pars.reffmatrix = c('alpha','eta_star','reff_scales')
model.trad= "
data {
  // Outcome data
  int<lower = 1> n; // total number of observations
  int<lower = 0> Y[n]; // vector of labels

  // Random effects data
    int gender_id[n]; // index of gender categories
  int<lower = 0> N_gender; // number of gender categories
  
  int age_id[n]; // index of age categories
  int<lower = 0> N_age; // number of age categories

  int income_id[n]; // index of income categories
  int<lower = 0> N_income; // number of income categories

  int state_id[n]; // index of state categories
  int<lower = 0> N_state; // number of state categories
}

parameters {
  real alpha; // global fixed intercept 
  vector[N_gender] eta_gender; 
  vector[N_age] eta_age; 
  vector[N_income] eta_income; 
  vector[N_state] eta_state; 

  real<lower = 0> gender_scale; 
  real<lower = 0> age_scale; 
  real<lower = 0> income_scale; 
  real<lower = 0> state_scale; 
}  
  
transformed parameters{
  vector[N_gender] eta_gender_star; // 
  vector[N_age] eta_age_star; // 
  vector[N_income] eta_income_star; // 
  vector[N_state] eta_state_star; // 

  vector[n] mu; // logit-scale linear predictor
  
  eta_gender_star = eta_gender * gender_scale;
  eta_age_star = eta_age * age_scale;
  eta_income_star = eta_income * income_scale;
  eta_state_star = eta_state * state_scale;

  // linear function of the logit-scale propensity to vote republican
  mu = alpha + eta_gender_star[gender_id] + eta_age_star[age_id] +  eta_income_star[income_id] +  eta_state_star[state_id];  
}

model {
   // // // Fixed Effects
   alpha ~ std_normal();
   eta_gender ~ std_normal();
   eta_age ~ std_normal();
   eta_income ~ std_normal();
   eta_state ~ std_normal();


   gender_scale ~ std_normal();
   age_scale ~ std_normal();
   income_scale ~ std_normal();
   state_scale ~ std_normal();


  // // // Likelihood
  Y ~ bernoulli_logit(mu);
}
"
pars.trad = c('alpha',
              'eta_gender_star',
              'eta_age_star','eta_income_star',
              'eta_state_star')

# # # Fit Total Model # # # # # # # # # # # # # # # # # # # # # # # # 
start.time = Sys.time()
stan.model.fit.reffmatrix <- stan(model_code = model.reffmatrix, 
                             data = temp.list, 
                             iter = 500,
                             warmup = 250,
                             thin = 4,
                             pars = pars.reffmatrix,
                             cores =4,
                             chains = 4,
                             control = list(max_treedepth =10),
                             verbose = TRUE)
end.time = Sys.time()
reffmatrix.time = end.time - start.time
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
start.time = Sys.time()
stan.model.fit.trad <- stan(model_code = model.trad, 
                                  data = temp.list, 
                                  iter = 500,
                                  warmup = 250,
                                  thin = 4,
                                  pars = pars.trad,
                                  cores =4,
                                  chains = 4,
                                  control = list(max_treedepth =10),
                                  verbose = TRUE)
end.time = Sys.time()
trad.time = end.time - start.time
print(trad.time)
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
# generate estimated parameters object
model.params.reffmatrix = extract(stan.model.fit.reffmatrix,pars = pars.reffmatrix)
summary_fit.reffmatrix = summary(stan.model.fit.reffmatrix)

model.params.trad = extract(stan.model.fit.trad,pars = pars.trad)
summary_fit.trad = summary(stan.model.fit.trad)

# save summary
par(mfrow = c(1,2))
# # # Convergence Diagnostics
plot(summary_fit.trad$summary[,"Rhat"],pch = NA,
     ylim = c(min(min(summary_fit.trad$summary[,"Rhat"],na.rm=TRUE),0.85),max(max(summary_fit.trad$summary[,"Rhat"],na.rm=TRUE),1.5)),
     bty = "n",ylab = "Rhat",xlab = 'Index of Parameters',main = 'Convergence Diagnostics - Traditional Approach')
abline(h = 1.1,col= 'red',lty = 2)
params = unique(sub("\\[.*", "",rownames(summary_fit.trad$summary)))
for(j in 1:length(params)){
  points(x = grep(params[j],rownames(summary_fit.trad$summary)),
         y = summary_fit.trad$summary[grep(params[j],rownames(summary_fit.trad$summary)),"Rhat"],
         pch = unlist(lapply(X = 15:16,function(x){rep(x,8)}))[j],col = adjustcolor( rep(seq(1,8,1),2)[j],0.5))
}
legend('topleft',legend = params ,col = adjustcolor(rep(seq(1,8,1),2)[1:length(params)],0.85),
       pch = unlist(lapply(X = 15:16,function(x){rep(x,8)}))[1:length(params)])
# # # 
plot(summary_fit.reffmatrix$summary[,"Rhat"],pch = NA,
     ylim = c(min(min(summary_fit.reffmatrix$summary[,"Rhat"],na.rm=TRUE),0.85),max(max(summary_fit.reffmatrix$summary[,"Rhat"],na.rm=TRUE),1.5)),
     bty = "n",ylab = "Rhat",xlab = 'Index of Parameters',main = 'Convergence Diagnostics - Sparse Matrix Approach')
abline(h = 1.1,col= 'red',lty = 2)
params = unique(sub("\\[.*", "",rownames(summary_fit.reffmatrix$summary)))
for(j in 1:length(params)){
  points(x = grep(params[j],rownames(summary_fit.reffmatrix$summary)),
         y = summary_fit.reffmatrix$summary[grep(params[j],rownames(summary_fit.reffmatrix$summary)),"Rhat"],
         pch = unlist(lapply(X = 15:16,function(x){rep(x,8)}))[j],col = adjustcolor( rep(seq(1,8,1),2)[j],0.5))
}
legend('topleft',legend = params ,col = adjustcolor(rep(seq(1,8,1),2)[1:length(params)],0.85),
       pch = unlist(lapply(X = 15:16,function(x){rep(x,8)}))[1:length(params)])

# # # check estimates 
par(mfrow = c(1,3))
plot(y = summary_fit.trad $summary[grep("eta",rownames(summary_fit.trad $summary)),"mean"],
     x = summary_fit.reffmatrix$summary[grep("eta",rownames(summary_fit.reffmatrix$summary)),"mean"],
     main = 'mean',ylab = 'traditional',xlab = 'sparse')
abline(0,1)
plot(y = summary_fit.trad $summary[grep("eta",rownames(summary_fit.trad $summary)),"2.5%"],
     x = summary_fit.reffmatrix$summary[grep("eta",rownames(summary_fit.reffmatrix$summary)),"2.5%"],
     main = '2.5%',ylab = 'traditional',xlab = 'sparse')
abline(0,1)
plot(y = summary_fit.trad $summary[grep("eta",rownames(summary_fit.trad $summary)),"97.5%"],
     x = summary_fit.reffmatrix$summary[grep("eta",rownames(summary_fit.reffmatrix$summary)),"97.5%"],
     main = '97.5%',ylab = 'traditional',xlab = 'sparse')
abline(0,1)
1 Like

It’s something we’re talking a lot about. It’s a work in progress though. The implementation of the matrix vector product in Stan right now is known to be slower than we’d like.

I think since you are constrained by the problem definition (100 random effects) to specify it with sparse matrices, go ahead and do that.

I think the other fast way to work on a model like this would be to work with rstanarm or brms. Do either of those interfaces work for the problem you’re working on?

You can also use brms to generate stan code and data using make_stancode and make_standata that you can take and work with manually.

Pinging @stevebronder so he sees the thread

2 Likes

An alternative might also be to use regular matrix multiplication, but with gpu-acceleration (OpenCL with cmdstanr)

2 Likes

Thanks @bbbales2 - I typically prefer to write Stan code manually to exploit the flexibility it offers - so for instance in these MrP models I usually have some element of spatial modeling, some elements of tradtional random effects, and some elements of GP for dynamic effects. I was under the impression this level of complexity is not easily specified in brms.

That said, I imagine that the solution you are suggesting would allow me to quickly get the code for the random effects part of the model, and then further add whatever extra complexity by hand in the Stan code, which sounds attractive.

I was under the impression that in brms you’d still have to specify the random effects by hand with the (1|var) syntax - that would still require you to write this code for 100 variables - is there another way that I’m not familiar with ? (I’m not overly familiar with brms)

Many thanks again.

Thanks Andrew - that sounds promising, I’ll look into it. I’m new to OpenCL - I was under the impression that Stan already parallelises the code over the available cores if you ask it to in the stan() function. What is the difference between that and enabling parallelisation via OpenCL ?

Apologies for the basic question.

Stan parallelises in the sense that it will estimate multiple chains simultaneously (i.e., one chain per core), but it doesn’t parallelise the computations within those chains. So even if you specify cores=8, if you only have 4 chains then only 4 cores will be used.

You can parallelise the computations within chains using the map_rect or reduce_sum frameworks, allowing for multiple CPU cores to be used by a given chain.

A much ‘larger’ version of parallelism is introduced through GPU-acceleration, which offloads some functions to the GPU (a full list of those functions is here). You can look at GPU-acceleration as parallelism on a greater scale, allowing for several hundred calculations to be executed simultaneously (depending on the GPU).

2 Likes

Brilliant - thanks for the quick explainer, which makes total sense. As a former JAGS user I always wondered about within-chain parallelisation - this sounds extremely promising. Thanks for the pointers, I’ll report back If I can make it work.

Yeah, so it’s faster than writing the Stan code but you’d still need to type out all the terms for a big regression (or generate them somehow).

What you could do is automate the building of the formula. Assuming the vars you want as random effects are the Nth to Kth columns in your data frame:

formula_text = paste0(paste0('(1|',names(a)[N:K]),')',collapse='+')

Then add on the dv ~ at the beginning plus any other fixed effects structure, and finally supply eval(parse(text=formula_text)) as the formula argument to brms.

3 Likes

good explanation