Categorical model with large dataset - large/divergent ELBO & possible combined reduce_sum()/GPU support

I have posted a few times here about this model and received great support to get the model to this point. Unfortunately, I’m still having trouble with it. R/data file included at bottom. I have estimated such models using Stan before (in fact I have a publication on it). The model is slightly different here and the dataset much larger.

I started from the following brms model:

  brm(data = mod[1:1000,],
      family = categorical(link = logit, refcat = NA),
      bf(trptrans ~ 1,
         nlf(mu1 ~ btc * (b1tt * travelTimeDrive + travelCostDrive)),
         nlf(mu2 ~ btc * (a2 + b2tt * travelTimeWalk)),
         nlf(mu3 ~ btc * (a3 + b3tt * travelTimeBike)),
         nlf(mu4 ~ btc * (a4 + b4tt * travelTimeTransit)),
         mvbind(a2+ a3 + a4 + btc + b1tt + b2tt + b3tt + b4tt) ~ 1 + (1 + race_2 + race_3 + race_4 + race_5 + hhfaminc|personid) + (1|trippurp),
         nl = TRUE),
      prior = c(prior(normal(0, 2.5), class = b, nlpar = a2),
                prior(normal(0, 2.5), class = b, nlpar = a3),
                prior(normal(0, 2.5), class = b, nlpar = a4),
                prior(normal(0, 2.5), class = b, nlpar = b1tt),
                prior(normal(0, 2.5), class = b, nlpar = btc),
                prior(normal(0, 2.5), class = b, nlpar = b2tt),
                prior(normal(0, 2.5), class = b, nlpar = b3tt),
                prior(normal(0, 2.5), class = b, nlpar = b4tt)),
      algorithm = "meanfield")

My dataset has about 178,000 rows, which I filter in the above code mod[1:1000,] for testing. I adjusted my Stan code from that generated by brms to be more vectorized and adjust the specification for btc, b2tt, b3tt, b4tt to be exponentiated (so they are lognormal distributed). It was recommended I run the model on a GPU because I have a large dataset. This doesn’t give a huge improvement because categorical_logit_glm_lupmf requires me to loop over the N rows and the GPU isn’t highly utilized.

I simplified the model as a next step to remove correlations and contextual variables (race_2, race_3, race_4, race_5, and hhfaminc). I ran the model on VI for the full dataset. On the CPU, gradient evaluation took 1.08 seconds. On the GPU, it took 0.92 seconds. I get very large ELBO values (e.g., -2783111885420583704669702329161101097016063682459208926108713371041792.000) and a divergence message. When I run the above brms code with VI, it runs fine (gradient evaluation takes 0.0088 seconds for 1000 records or 1.52 seconds with the full dataset) and stochastic ascent runs with reasonable ELBO values.

I thought part of the problem could be the priors combined with the exp() exploding the initial values. I reduced the prior mean from 2.5 to 0.5, but it doesn’t have an appreciable effect on the ELBO.

If I can get this model giving reasonable results (in a reasonable runtime), then I can add back the contextual variables and some of the correlation. I have two questions/focuses:

  1. Any suggestions on how to address the divergence/large ELBO?
  2. The GPU is underutilized, so not being optimally used (using a Tesla k20 but have access to a V100). I read it might be possible to combine reduce_sum and GPU (Any intuitions/results on when GPU>reduce_sum? - #5 by saudiwin). I thought this might increase GPU utilization because it would pass multiple rows at once. Is this a possibility? I have a version of the code written using reduce_sum.
data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  array[N] row_vector[nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  array[N] row_vector[nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  array[N] row_vector[nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  array[N] row_vector[nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  // matrix[N, M_1] Z_1;
  vector[N] Z_1;
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  int prior_only;  // should the likelihood be ignored?
transformed data {
  int<lower=1> M_3 = (M_1+M_2)*(ncat*2);  // total number of sd
parameters {
  real b_btc;  // population-level effects
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  vector<lower=0>[M_3] sd;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector[N_2] z_2[M_2];  // standardized group-level effects
  vector[N_1] z_3[M_1];  // standardized group-level effects
  vector[N_2] z_4[M_2];  // standardized group-level effects
  vector[N_1] z_5[M_1];  // standardized group-level effects
  vector[N_2] z_6[M_2];  // standardized group-level effects
  vector[N_1] z_7[M_1];  // standardized group-level effects
  vector[N_2] z_8[M_2];  // standardized group-level effects
  vector[N_1] z_9[M_1];  // standardized group-level effects
  vector[N_2] z_10[M_2];  // standardized group-level effects
  vector[N_1] z_11[M_1];  // standardized group-level effects
  vector[N_2] z_12[M_2];  // standardized group-level effects
  vector[N_1] z_13[M_1];  // standardized group-level effects
  vector[N_2] z_14[M_2];  // standardized group-level effects
  vector[N_1] z_15[M_1];  // standardized group-level effects
  vector[N_2] z_16[M_2];  // standardized group-level effects
transformed parameters {
  vector[N_1] r_1;  // actual group-level effects
  vector[N_2] r_2;  // actual group-level effects
  vector[N_1] r_3;  // actual group-level effects
  vector[N_2] r_4;  // actual group-level effects
  vector[N_1] r_5;  // actual group-level effects
  vector[N_2] r_6;  // actual group-level effects
  vector[N_1] r_7;  // actual group-level effects
  vector[N_2] r_8;  // actual group-level effects
  vector[N_1] r_9;  // actual group-level effects
  vector[N_2] r_10;  // actual group-level effects
  vector[N_1] r_11;  // actual group-level effects
  vector[N_2] r_12;  // actual group-level effects
  vector[N_1] r_13;  // actual group-level effects
  vector[N_2] r_14;  // actual group-level effects
  vector[N_1] r_15;  // actual group-level effects
  vector[N_2] r_16;  // actual group-level effects
  // compute actual group-level effects
  r_1 = (sd[1] * (z_1[1]));
  r_2 = (sd[2] * (z_2[1]));
  // compute actual group-level effects
  r_3 = (sd[3] * (z_3[1]));
  r_4 = (sd[4] * (z_4[1]));
  // compute actual group-level effects
  r_5 = (sd[5] * (z_5[1]));
  r_6 = (sd[6] * (z_6[1]));
  // compute actual group-level effects
  r_7 = (sd[7] * (z_7[1]));
  r_8 = (sd[8] * (z_8[1]));
  // compute actual group-level effects
  r_9 = (sd[9] * (z_9[1]));
  r_10 = (sd[10] * (z_10[1]));
  // compute actual group-level effects
  r_11 = (sd[11] * (z_11[1]));
  r_12 = (sd[12] * (z_12[1]));
  // compute actual group-level effects
  r_13 = (sd[13] * (z_13[1]));
  r_14 = (sd[14] * (z_14[1]));
  // compute actual group-level effects
  r_15 = (sd[15] * (z_15[1]));
  r_16 = (sd[16] * (z_16[1]));
model {
  // likelihood including constants
  if (!prior_only) {
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    matrix[nvar,ncat] beta;
    vector[ncat] alpha;
    array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array([b_btc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);     // matrix[N,ntotvar] b = rep_matrix(b_btc,N,ntotvar);
    for (n in 1:N) {
      // add to linear predictor term
      a[n] += [0,r_3[J_1[n]] + r_4[J_2[n]], r_5[J_1[n]] + r_6[J_2[n]], r_7[J_1[n]] + r_8[J_2[n]]];
      // add to linear predictor term
      // Terms are btc, b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[n]] + r_2[J_2[n]], r_9[J_1[n]] + r_10[J_2[n]], r_11[J_1[n]] + r_12[J_2[n]],r_13[J_1[n]] + r_14[J_2[n]],r_15[J_1[n]] + r_16[J_2[n]]];
      b[n] = exp(b[n]);
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[n],x2[n],x3[n],x4[n]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = [b[n,1] * b[n,2:5], rep_row_vector(b[n,1],ncat)];
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      target += categorical_logit_glm_lupmf(Y[n] | x, alpha, beta);
  // priors
  target += normal_lpdf(b_btc | 0, 0.5);
  target += normal_lpdf(b_a2 | 0, 2.5);
  target += normal_lpdf(b_a3 | 0, 2.5);
  target += normal_lpdf(b_a4 | 0, 2.5);
  target += normal_lpdf(b_b1tt | 0, 0.5);
  target += normal_lpdf(b_b2tt | 0, 0.5);
  target += normal_lpdf(b_b3tt | 0, 0.5);
  target += normal_lpdf(b_b4tt | 0, 0.5);
  target += student_t_lpdf(sd | 3, 0, 2.5);
  target += std_normal_lpdf(z_1[1]);
  target += std_normal_lpdf(z_2[1]);
  target += std_normal_lpdf(z_3[1]);
  target += std_normal_lpdf(z_4[1]);
  target += std_normal_lpdf(z_5[1]);
  target += std_normal_lpdf(z_6[1]);
  target += std_normal_lpdf(z_7[1]);
  target += std_normal_lpdf(z_8[1]);
  target += std_normal_lpdf(z_9[1]);
  target += std_normal_lpdf(z_10[1]);
  target += std_normal_lpdf(z_11[1]);
  target += std_normal_lpdf(z_12[1]);
  target += std_normal_lpdf(z_13[1]);
  target += std_normal_lpdf(z_14[1]);
  target += std_normal_lpdf(z_15[1]);
  target += std_normal_lpdf(z_16[1]);

time_cost_calculated.CSV (7.3 MB)
mtc_equity.R (2.0 KB)

This doesn’t give a huge improvement because categorical_logit_glm_lupmf requires me to loop over the N rows and the GPU isn’t highly utilized.

The categorical_logit_glm_lupmf supports an array of integers as the outcome, was there another reason that you needed the loop?

I saw it supports an array output, but I was confused by the input specification. x and beta should be arrays of alternatives x variables, right? E.g., with 3 alternatives and 2 variables (_alt,var) [x_11beta_11+x_12beta_12, x_21beta_21+x_22beta_22, x_31beta_31+x_32beta_32].

The loop was also for the group references J_1[n] and J_2[n].

I assumed it would use the array position to connect with the y[n] value. E.g., the 3rd alternative (3rd element in the input vector) being chosen would mean y = 3?

That is not a good model for variational bayes, which has been shown to have serious problems with difficult posterior geometry. I would recommend switching to full bayes and using cmdstanr with brms to do within-chain parallelization with reduce_sum.

1 Like

Thanks. I’m planning to update the likelihood to avoid looping. I was only using VI to try getting a result out quick. The plan was to use full bayes with cmdstanr. I’ll go that route for testing, too.

1 Like

Ah I see my mistake, the vectorised implementation is for situations where the alpha and beta parameters are the same for all observations, but you have different values for each observation.

Quick question, you’ve declared the z_ as being of arrays of size M_1 (6) or M_2 (1):

  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector[N_2] z_2[M_2];  // standardized group-level effects

But then you’re only using the first vector from each array:

  r_1 = (sd[1] * (z_1[1]));
  r_2 = (sd[2] * (z_2[1]));

This means that for each of the 8 z_ parameters of size M_1, there are 5 vectors of size N_1 going unused - so your model is currently estimating 105130 parameters which aren’t being used.

You can implement your non-centered param directly instead using the multiplier syntax:

parameters {
  row_vector<lower=0>[8] sd1;
  row_vector<lower=0>[8] sd2;
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1;
  matrix<multiplier = rep_matrix(sd2, N_2)>[N_2, 8] r_2;
model {
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]);
    r_2[, k] ~ normal(0, sd2[k]);
1 Like

Thanks @andrjohns.

I made a few other modifications based on your recommendations and a few other things I thought could be improved. I had some Metropolis-Hastings rejection warnings. I changed the step size to 0.1 and also changed my starting seed. I still get warnings, but they stop after a few iterations.

For M_1 and M_2, I missed updating them after I simplified the model to remove contextual variables. The simplified model just has the hierarchical intercept terms. I ran the model with a 10% sample and now get a result in 46 hours. I assume the runtime scaling for a 100% sample will be <10x but still take over a week. Adjusting the priors, etc. based on these results could help runtime, correct?

It gives me:

Warning: 1168 of 4000 (29.0%) transitions ended with a divergence.
This may indicate insufficient exploration of the posterior distribution.
Possible remedies include: 
  * Increasing adapt_delta closer to 1 (default is 0.8) 
  * Reparameterizing the model (e.g. using a non-centered parameterization)
  * Using informative or weakly informative prior distributions 

2832 of 4000 (71.0%) transitions hit the maximum treedepth limit of 10 or 2^10-1 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
Increasing the max_treedepth limit can avoid this at the expense of more computation.
If increasing max_treedepth does not remove warnings, try to reparameterize the model.

This isn’t entirely unexpected since I’m working with a base model and hadn’t adjusted any priors or adapt_delta. I tried setting up the model to build the terms within a loop and apply categorical_logit_glm_lupmf with Y as an array, x as a+x\beta, \alpha as a vector of 0s, and \beta as a matrix of 1s. It seems to run, but I’m not sure it increased the GPU utilization beyond the 2-5% it was using with everything in a loop. I’m playing with it to see if I can also use threading to increase the amount of data sent to the GPU at once.

Interesting result. I use 9 cores per chain and run it on the GPU, down to 8 hours runtime! Of course, many people don’t have 36 cores and a V100 GPU available.


Can you please share your cmdstan code? I gave up on a categorical logit project due to long modelling time and I was unsuccesful to utilize GPU.

mtc_equity_7.stan (4.6 KB)

Attached is the code. It runs, but I should note there are many divergences for my particular model. I am running it on a Tesla V100 GPU with 9 CPU cores per chain. This uses ~40% of the GPU.


Awesome, thank you very much. I will be running the code on GTX5000, let me see how much difference it will make.

With that much data I think it would also be beneficial for you to directly use cmdstan with the num_chains argument. That will let one program run multiple chains and share your data between the chains. At the cmdstan level you can also just pass the total number of threads you want to use (I usually recommend about 70% of your available threads) and cmdstan will manage across chain and within chain parallelism for you.

Thanks! I’ll give that a try. I started another topic because I was having trouble with many divergences. I think this may help with that problem, too. My sense from looking at the iteration plots was that the threads weren’t talking with each other.