I have posted a few times here about this model and received great support to get the model to this point. Unfortunately, I’m still having trouble with it. R/data file included at bottom. I have estimated such models using Stan before (in fact I have a publication on it). The model is slightly different here and the dataset much larger.
I started from the following brms model:
brm(data = mod[1:1000,],
family = categorical(link = logit, refcat = NA),
bf(trptrans ~ 1,
nlf(mu1 ~ btc * (b1tt * travelTimeDrive + travelCostDrive)),
nlf(mu2 ~ btc * (a2 + b2tt * travelTimeWalk)),
nlf(mu3 ~ btc * (a3 + b3tt * travelTimeBike)),
nlf(mu4 ~ btc * (a4 + b4tt * travelTimeTransit)),
mvbind(a2+ a3 + a4 + btc + b1tt + b2tt + b3tt + b4tt) ~ 1 + (1 + race_2 + race_3 + race_4 + race_5 + hhfaminc|personid) + (1|trippurp),
nl = TRUE),
prior = c(prior(normal(0, 2.5), class = b, nlpar = a2),
prior(normal(0, 2.5), class = b, nlpar = a3),
prior(normal(0, 2.5), class = b, nlpar = a4),
prior(normal(0, 2.5), class = b, nlpar = b1tt),
prior(normal(0, 2.5), class = b, nlpar = btc),
prior(normal(0, 2.5), class = b, nlpar = b2tt),
prior(normal(0, 2.5), class = b, nlpar = b3tt),
prior(normal(0, 2.5), class = b, nlpar = b4tt)),
algorithm = "meanfield")
My dataset has about 178,000 rows, which I filter in the above code mod[1:1000,]
for testing. I adjusted my Stan code from that generated by brms to be more vectorized and adjust the specification for btc, b2tt, b3tt, b4tt
to be exponentiated (so they are lognormal distributed). It was recommended I run the model on a GPU because I have a large dataset. This doesn’t give a huge improvement because categorical_logit_glm_lupmf
requires me to loop over the N rows and the GPU isn’t highly utilized.
I simplified the model as a next step to remove correlations and contextual variables (race_2, race_3, race_4, race_5, and hhfaminc). I ran the model on VI for the full dataset. On the CPU, gradient evaluation took 1.08 seconds. On the GPU, it took 0.92 seconds. I get very large ELBO values (e.g., -2783111885420583704669702329161101097016063682459208926108713371041792.000) and a divergence message. When I run the above brms code with VI, it runs fine (gradient evaluation takes 0.0088 seconds for 1000 records or 1.52 seconds with the full dataset) and stochastic ascent runs with reasonable ELBO values.
I thought part of the problem could be the priors combined with the exp()
exploding the initial values. I reduced the prior mean from 2.5 to 0.5, but it doesn’t have an appreciable effect on the ELBO.
If I can get this model giving reasonable results (in a reasonable runtime), then I can add back the contextual variables and some of the correlation. I have two questions/focuses:
- Any suggestions on how to address the divergence/large ELBO?
- The GPU is underutilized, so not being optimally used (using a Tesla k20 but have access to a V100). I read it might be possible to combine
reduce_sum
and GPU (Any intuitions/results on when GPU>reduce_sum? - #5 by saudiwin). I thought this might increase GPU utilization because it would pass multiple rows at once. Is this a possibility? I have a version of the code written usingreduce_sum
.
data {
int<lower=1> N; // total number of observations
int<lower=2> ncat; // number of categories
int<lower=2> nvar; // number of variables for each alternative
int<lower=2> ntotvar; // number of total variables
array[N] int Y; // response variable
// covariate matrix for alt1
array[N] row_vector[nvar] x1; // matrix[N,nvar] x1;
// covariate matrix for alt2
array[N] row_vector[nvar] x2; // matrix[N,nvar] x2;
// covariate matrix for alt3
array[N] row_vector[nvar] x3; // matrix[N,nvar] x3;
// covariate matrix for alt4
array[N] row_vector[nvar] x4; // matrix[N,nvar] x4;
// data for group-level effects of ID 1
int<lower=1> N_1; // number of grouping levels
int<lower=1> M_1; // number of coefficients per level
int<lower=1> J_1[N]; // grouping indicator per observation
// group-level predictor values
// matrix[N, M_1] Z_1;
vector[N] Z_1;
int<lower=1> NC_1; // number of group-level correlations
// data for group-level effects of ID 2
int<lower=1> N_2; // number of grouping levels
int<lower=1> M_2; // number of coefficients per level
int<lower=1> J_2[N]; // grouping indicator per observation
int prior_only; // should the likelihood be ignored?
}
transformed data {
int<lower=1> M_3 = (M_1+M_2)*(ncat*2); // total number of sd
}
parameters {
real b_btc; // population-level effects
real b_a2; // population-level effects
real b_a3; // population-level effects
real b_a4; // population-level effects
real b_b1tt; // population-level effects
real b_b2tt; // population-level effects
real b_b3tt; // population-level effects
real b_b4tt; // population-level effects
vector<lower=0>[M_3] sd; // group-level standard deviations
vector[N_1] z_1[M_1]; // standardized group-level effects
vector[N_2] z_2[M_2]; // standardized group-level effects
vector[N_1] z_3[M_1]; // standardized group-level effects
vector[N_2] z_4[M_2]; // standardized group-level effects
vector[N_1] z_5[M_1]; // standardized group-level effects
vector[N_2] z_6[M_2]; // standardized group-level effects
vector[N_1] z_7[M_1]; // standardized group-level effects
vector[N_2] z_8[M_2]; // standardized group-level effects
vector[N_1] z_9[M_1]; // standardized group-level effects
vector[N_2] z_10[M_2]; // standardized group-level effects
vector[N_1] z_11[M_1]; // standardized group-level effects
vector[N_2] z_12[M_2]; // standardized group-level effects
vector[N_1] z_13[M_1]; // standardized group-level effects
vector[N_2] z_14[M_2]; // standardized group-level effects
vector[N_1] z_15[M_1]; // standardized group-level effects
vector[N_2] z_16[M_2]; // standardized group-level effects
}
transformed parameters {
vector[N_1] r_1; // actual group-level effects
vector[N_2] r_2; // actual group-level effects
vector[N_1] r_3; // actual group-level effects
vector[N_2] r_4; // actual group-level effects
vector[N_1] r_5; // actual group-level effects
vector[N_2] r_6; // actual group-level effects
vector[N_1] r_7; // actual group-level effects
vector[N_2] r_8; // actual group-level effects
vector[N_1] r_9; // actual group-level effects
vector[N_2] r_10; // actual group-level effects
vector[N_1] r_11; // actual group-level effects
vector[N_2] r_12; // actual group-level effects
vector[N_1] r_13; // actual group-level effects
vector[N_2] r_14; // actual group-level effects
vector[N_1] r_15; // actual group-level effects
vector[N_2] r_16; // actual group-level effects
// compute actual group-level effects
r_1 = (sd[1] * (z_1[1]));
r_2 = (sd[2] * (z_2[1]));
// compute actual group-level effects
r_3 = (sd[3] * (z_3[1]));
r_4 = (sd[4] * (z_4[1]));
// compute actual group-level effects
r_5 = (sd[5] * (z_5[1]));
r_6 = (sd[6] * (z_6[1]));
// compute actual group-level effects
r_7 = (sd[7] * (z_7[1]));
r_8 = (sd[8] * (z_8[1]));
// compute actual group-level effects
r_9 = (sd[9] * (z_9[1]));
r_10 = (sd[10] * (z_10[1]));
// compute actual group-level effects
r_11 = (sd[11] * (z_11[1]));
r_12 = (sd[12] * (z_12[1]));
// compute actual group-level effects
r_13 = (sd[13] * (z_13[1]));
r_14 = (sd[14] * (z_14[1]));
// compute actual group-level effects
r_15 = (sd[15] * (z_15[1]));
r_16 = (sd[16] * (z_16[1]));
}
model {
// likelihood including constants
if (!prior_only) {
// Define matrices/vector for x, alpha, beta
matrix[ncat,nvar] x;
matrix[nvar,ncat] beta;
vector[ncat] alpha;
array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
// initialize linear predictor term
// Terms are btc, b1tt, b2tt, b3tt, b4tt
array[N] row_vector[ntotvar] b = rep_array([b_btc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N); // matrix[N,ntotvar] b = rep_matrix(b_btc,N,ntotvar);
for (n in 1:N) {
// add to linear predictor term
a[n] += [0,r_3[J_1[n]] + r_4[J_2[n]], r_5[J_1[n]] + r_6[J_2[n]], r_7[J_1[n]] + r_8[J_2[n]]];
// add to linear predictor term
// Terms are btc, b1tt, b2tt, b3tt, b4tt
b[n] += [r_1[J_1[n]] + r_2[J_2[n]], r_9[J_1[n]] + r_10[J_2[n]], r_11[J_1[n]] + r_12[J_2[n]],r_13[J_1[n]] + r_14[J_2[n]],r_15[J_1[n]] + r_16[J_2[n]]];
b[n] = exp(b[n]);
// Each x and beta is a matrix with dimensions alts x variables
// Our x will be the time/cost coming in as inputs
x = to_matrix([x1[n],x2[n],x3[n],x4[n]]);
// Our betas will be the hierarchical slope parameters (2 x 4)
beta = [b[n,1] * b[n,2:5], rep_row_vector(b[n,1],ncat)];
// Our alphas will be the hierarchical intercept parameters
alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
target += categorical_logit_glm_lupmf(Y[n] | x, alpha, beta);
}
}
// priors
target += normal_lpdf(b_btc | 0, 0.5);
target += normal_lpdf(b_a2 | 0, 2.5);
target += normal_lpdf(b_a3 | 0, 2.5);
target += normal_lpdf(b_a4 | 0, 2.5);
target += normal_lpdf(b_b1tt | 0, 0.5);
target += normal_lpdf(b_b2tt | 0, 0.5);
target += normal_lpdf(b_b3tt | 0, 0.5);
target += normal_lpdf(b_b4tt | 0, 0.5);
target += student_t_lpdf(sd | 3, 0, 2.5);
target += std_normal_lpdf(z_1[1]);
target += std_normal_lpdf(z_2[1]);
target += std_normal_lpdf(z_3[1]);
target += std_normal_lpdf(z_4[1]);
target += std_normal_lpdf(z_5[1]);
target += std_normal_lpdf(z_6[1]);
target += std_normal_lpdf(z_7[1]);
target += std_normal_lpdf(z_8[1]);
target += std_normal_lpdf(z_9[1]);
target += std_normal_lpdf(z_10[1]);
target += std_normal_lpdf(z_11[1]);
target += std_normal_lpdf(z_12[1]);
target += std_normal_lpdf(z_13[1]);
target += std_normal_lpdf(z_14[1]);
target += std_normal_lpdf(z_15[1]);
target += std_normal_lpdf(z_16[1]);
}
}
time_cost_calculated.CSV (7.3 MB)
mtc_equity.R (2.0 KB)