Stand-alone predictions in the context of a hierarchical random effect model

Model: I am working with a model that can be described, at the bird-eye level, as a random slope logistic model. I have 0/1 outcomes, some predictors, random intercepts and random slopes for my individuals. I am fitting the models I am satisfied with, goodness of fit and convergence and all.

Problem: suppose I am given a new individual with their predictors and their trajectory of the 0/1 outcomes. Can I run the sampled chains on these to obtain the random effects? Section 27.7 provides pointers in the “flat” case of i.i.d. data on how to pick the parameters that have been simulated, and feed them to the generated quantities without the model to sample from. But in my case, I have to deal with the random effects that I am declaring as parameters – and the random effects for the new data, obviously, have not been sampled. I cannot simply add that to the model and the sampler as the stuff runs for about 50 core-hours, so I cannot afford to re-run it for another 2000 new individuals that I need to produce the model predictions for.

Reprex as an Rmd fragment, so that the Stan model is in the middle. The goal is to get the vector u_i[2] for the new data.


title: “Random effects reprex”
output: html_document

knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE, 
                      dev = "svg") ## export images as SVG

library(ggplot2)
library(magrittr)
library(here)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
library(rstan)
library(bayesplot)
library(knitr)

Simulate data

N <- 10
set.seed(1)
my_df <- data.frame( id=1:N, interc=rnorm(n=N,m=0.5,s=0.3), slope=rnorm(n=N, m=-0.5, s=0.3), n=sample(5:8, N, replace=T) ) 
my_df <- my_df[ rep(1:N, my_df$n), ]
my_df %>% group_by(id) %>% mutate(t=row_number()) %>% ungroup() %>%
  mutate(y=as.integer(interc+slope*t/3+rnorm(nrow(my_df),m=0,s=0.3)>0)) -> my_df

Stan code

data {
  int<lower=1> N;                                     // total number of observations
  int<lower=0,upper=1> outcome[N];                    // outcome
  int<lower=1> N_people;                              // number of unique individuals
  int<lower=1,upper=N_people> seq_person_id[N];       // SEQUENTIAL person id -- important for indexing
  vector[N]    predictor;                             // predictor that goes with random slope
}
parameters {
  // variance components
  vector<lower = 0>[2]  tau_u;   
  // random effects for people
  vector[2] u_i[N_people];
  // correlation matrix
  corr_matrix[2] rho_matrix;

}
transformed parameters {
  cov_matrix[2] Sigma;
  vector[N] theta;
  
  // covariance matrix
  Sigma = quad_form_diag(rho_matrix,tau_u);
  
  // logit of outcome
  for (t in 1:N)
    theta[t] = u_i[seq_person_id[t],1] 
               + u_i[seq_person_id[t],2]*predictor[t];
}
model {
  // outcome
  outcome ~ bernoulli(inv_logit(theta));

  // variances
  tau_u ~ cauchy(0,1);
  
  // correlation
  rho_matrix ~ lkj_corr(1);
  
  // random effects
  u_i ~ multi_normal( [0, 0]', Sigma );
  
}

Bayesian computation

Prepare the data for Stan

stan_data <- list(
  N = nrow(my_df),                                      # int<lower=1> N;
  outcome = unlist(my_df[,"y"]),                        # int<lower=0,upper=1> outcome[N];
  N_people = (my_df %>% distinct(id) %>% nrow()),       # int<lower=1> N_people;
  seq_person_id = unlist(my_df[,"id"]),                 # int<lower=1,upper=N_people> seq_person_id[N]; 
  predictor = unlist(my_df[,"t"])                       # vector[N]    predictor;
)

Sample from the model

stan_mcmc_rslope_logit <- rstan::sampling(
    random_slope_logit, data=stan_data,
        chains = 2, seed = 1234,
        warmup = 1000, iter = 2000,
        control=list(adapt_delta=0.9, stepsize_jitter = 0.1),
        refresh = 100,
        init = list( list(tau_u=c(0.2,0.2)), list(tau_u=c(0.3,0.3)) )
    )

New observations

new_df <- data.frame( y_new = c(0,1,1,0,1), t_new=1:5)

What now??

Estimating the group-level parameters for new observations is a bit more of a complicated endeavour, and was discussed more over in this thread: Posterior Predictive for hierarchical model without further hyper-parameter inference - #8 by maxbiostat

Note that you can pretty significantly optimise your model by using the non-centered parameterisation for u_i and using the compound bernoulli_logit_glm distribution for your likelihood, like so:

data {
  int<lower=1> N;                                     // total number of observations
  int<lower=0,upper=1> outcome[N];                    // outcome
  int<lower=1> N_people;                              // number of unique individuals
  int<lower=1,upper=N_people> seq_person_id[N];       // SEQUENTIAL person id -- important for indexing
  matrix[N, 1]    predictor;                             // predictor that goes with random slope
}
parameters {
  // variance components
  vector<lower = 0>[2]  tau_u;
  // 'Raw' N(0,1) variates for the non-centered param
  matrix[2, N_people] u_i_raw;
  // correlation matrix
  cholesky_factor_corr[2] rho_matrix_cholesky;

}
transformed parameters {
  // random effects for people
  // Implies u_i ~ multi_normal( [0, 0]', Sigma )
  // Specified with column per group (person) since it's
  // faster to extract a column than a row
  matrix[2, N_people] u_i = diag_pre_multiply(tau_u, rho_matrix_cholesky) * u_i_raw;
}
model {
  // correlation
  rho_matrix_cholesky ~ lkj_corr_cholesky(1);
  // variances
  tau_u ~ cauchy(0,1);

  // 'Raw' N(0,1) variates for the non-centered param
  to_vector(u_i_raw) ~ std_normal();

  // outcome
  outcome ~ bernoulli_logit_glm(predictor,
                                u_i[1, seq_person_id]',
                                u_i[2, seq_person_id]');
}

1 Like

Thanks for the syntax tips. In my production model (which is 150 lines of code as is right now, there is way more going on and I only wanted to reprex the methodological issue I am facing), I do the noncentral parameterization of the random effects (although I am not convinced it helps much – my variance components are very well isolated from zero with hundreds of degrees of freedom, and I doubt I am facing the funnel situation.) Are you saying bernoulli_logit_glm() would work much faster?

What @maxbiostat has proposed is pretty close to what I am attempting to do now – draw the independent random effects, independently estimate the regression of the sequence of events on time, combine the two using the rules for two normals. I know it is not kosher Bayes but it is good enough for the business application :).

although I am not convinced it helps much – my variance components are very well isolated from zero with hundreds of degrees of freedom, and I doubt I am facing the funnel situation

When you have correlated random effects, the primary benefit comes the fact that the multivariate (cholesky) likelihood is much more computationally expensive than the univariate standard normal and a single matrix multiplication

Are you saying bernoulli_logit_glm() would work much faster?

Both faster and more efficient. The _glm likelihoods calculate analytic gradients for the respective inputs separately, which leads to more efficient sampling. They also avoid calculating gradients for inputs which are data and not parameters, allowing for less computation for each evaluation. The _glmlikelihoods can also be GPU-accelerated, which provides a significant improvement with large datasets

1 Like

My read of the bernoulli_logit_glm() documentation is that it is (very well) suited for the i.i.d. data as it exposes the overall constant alpha and slopes beta. I don’t really see a way to adapt it to mixed models short of blowing up the design matrix x.

The model, for individuals indexed as i and observations within the individuals indexed as t, is

\mathop{\mathbf{P}} [y_{it}=1 | \alpha, \beta, \mathbf{u}_i ] = \alpha + \beta' x_{it} + u_{1i} + u_{2i} z_{it}

So to utilize the bernoulli_logit_glm() version, I would need to append_row(beta, u[,1], u[,2]) for the parameters, and append_col(x, a bunch of indicators of the i-th cluster, a bunch of vectors of z_it corresponding to the i-th cluster) for the regressors. I am not entirely sure it is worth it as it would blow up the memory – I currently have a 40K x 12 regressor matrix, and smaller 1.5K x 2 random effect objects, and I would end up with something like 40K x 3K regressor matrix with the bernoulli_logit_glm() factorization.

The syntax you provided would have multiplied u_{1i} by u_{2i}, which would be… weird.

Maybe @Matthijs could suggest a better form? Would it matter if the parameters appended the above way are sampled differently? Or is the primary purpose of breaking down the parameters and the overall constant and the regressors the improvement of derivatives and the chain rule that you could use? Identification of the overall constant in those mixed models is awful; that parameter has the highest autocorrelations and effective sample sizes = 1/10 of the slopes.

The syntax you provided would have multiplied u1i by u2i , which would be… weird.

That’s not quite the case. The likelihood statement is:

  outcome ~ bernoulli_logit_glm(predictor,
                                u_i[1, seq_person_id]',
                                u_i[2, seq_person_id]');

Which implies:

y_{it} = u_{1i} + u_{2i}z_{it}

I currently have a 40K x 12 regressor matrix, and smaller 1.5K x 2 random effect objects, and I would end up with something like 40K x 3K regressor matrix with the bernoulli_logit_glm() factorization

If you have 12 fixed-effect predictors and 2 random-effect predictors, then your final x matrix (passed as input data) would be 40K x 14

To add a fixed intercept and fixed regressions to your model, you would add syntax similar to:

parameters {
  ...
  real alpha;
  vector[12] beta;
}

model {
  ...
  outcome ~ bernoulli_logit_glm(predictor,
                                alpha + u_i[1, seq_person_id]',
                                // Assuming that the first 12 columns of 'predictor'
                                // comprise the fixed covariates
                                append_row(beta, u_i[2, seq_person_id]'));
}

Which would imply

y_{it} = \alpha + \beta'x_{it} + u_{1i} + u_{2i}z_{it}

Because these operations all use large matrices/vectors, it’s quite likely you will also see significant speed improvements by using the new Stanc O1 flag and OpenCL acceleration (if you have access to a discrete GPU).

For an alternative perspective on optimising mixed-models in Stan, @mike-lawrence has done some brilliant work and might be able to help as well

I see, thanks. That’s a tricky use of the function. It would be great if it were documented (1.9 Hierarchical logistic regression | Stan User’s Guide talks about bernoulli_logit() .)