Categorical Model is taking really long for estimation

Hi All (@Bob_Carpenter ),

I am very new to Stan. I am planning to estimate the following model. There are three types of datasets (Version_1, Version_2, Version_3). At the first level, the following linear regression models are estimated for latent variable “att”:-

  • Version_1 uses X_lk as predictors
  • Version_2 uses X_lk as predictors
  • Version_3 uses X_oe as predictors

I have re-parameterized “att” and estimated “y_att”

Using the estimated values of “y_att”, a multinomial logit model is estimated for the choice “y”. The choice variable has 3 levels. I have tried to vectorise all the levels, except the last one for

"for (n in 1:N)
    y[n] ~ categorical(softmax(alpha + beta_att * y_att[n]' + beta_sd * X_sd[n]')); // likelihood"

All the steps prior to this is estimated very quickly. With the inclusion of the above choice model, the estimation takes days to complete.

My dataset contains 14261 observations, # of predictors (D_lk-100, D_oe-34, D_sd-55). # of MCMC iterations-3000, # of warm-ups-600. I am running 4 chains

Is there a way to reduce the estimation time? Furthermore, are there any suggestions on improving my code further?

Any suggestions and help will be greatly appreciated.

Thank you in advance.


data {

     int<lower=0> N;                         // Number of observations
     int<lower=0> D_lk;                      // Number of predictors for Version_1 of the dataset
     int<lower=0> D_oe;                      // Number of predictors for Version_3 of the dataset
     int<lower=0> D_sd;                      // Number of predictors for sd
     int<lower=0> C;                         // Number of classes
     matrix[N,D_lk] X_lk;                    // Predictor matrix for Version_1 of the dataset
     matrix[N,D_oe] X_oe;                    // Predictor matrix for Version_3 of the dataset
     matrix[N,D_sd] X_sd;                    // Predictor matrix for sd
     int<lower=1,upper=C> y[N];              // Categorical dependent variable
     int<lower=0> K;                         // Number of multivariate dimensions
     matrix[N, K] bern1;                     // Version_1 dummy
     matrix[N, K] bern2;                     // Version_2 dummy
     matrix[N, K] bern3;                     // Version_3 dummy
}
parameters {
     row_vector[K] alpha_lkv1;    // Intercept for Version_1
     matrix[D_lk,K] gamma_lkv1;   // Coefficients of predictors for Version_1
     row_vector[K] alpha_lkv2;    // Intercept for Version_2
     matrix[D_lk,K] gamma_lkv2;   // Coefficients of predictors for Version_2
     row_vector[K] alpha_oe;      // Intercept for Version_3
     matrix[D_oe,K] gamma_oe;     // Coefficients of predictors for Version_3
     vector[C] alpha;             // Intercept for the choice model
     matrix[C,K] beta_att;        // Coefficient for Ind_var_1
     matrix[C,D_sd] beta_sd;      // Coefficients of predictors for sd
     cov_matrix[K] Sigma;         // Covariance Matrix
     matrix[N, K] alpha_scale;    // Re-parameterisation scale factor
}
transformed parameters {
     matrix[N, K] att;                     // Regression equation for Ind_var_1
     matrix[K,K] L;                        // Cholesky factor Matrix
     matrix[N, K] y_att;                   // Predicted value for latent attitudes
     matrix[N,K] v1;                       // Value matrix for Version_1
     matrix[N,K] v2;                       // Value matrix for Version_2
     matrix[N,K] v3;                       // Value matrix for Version_3
     matrix[N,K] m1;                       // Matrix of intercepts for Version_1
     matrix[N,K] m2;                       // Matrix of intercepts for Version_2
     matrix[N,K] m3;                       // Matrix of intercepts for Version_3
     
     m1 = rep_matrix(alpha_lkv1, N);
     m2 = rep_matrix(alpha_lkv2, N);
     m3 = rep_matrix(alpha_oe, N);
     
     v1 = m1 + (X_lk * gamma_lkv1);
     v2 = m2 + (X_lk * gamma_lkv2);
     v3 = m3 + (X_oe * gamma_oe);
     
     att = (bern1 .* v1) + (bern2 .* v2) + (bern3 .* v3);
 
     L = cholesky_decompose(Sigma);       // Decomposing the Covariance matrix into Cholesky components
     
     y_att = att + (alpha_scale * L);
}
model {
     alpha_lkv1 ~ normal(0,1);                 // Prior for the constants for Version_1
     alpha_lkv2 ~ normal(0,1);                 // Prior for the constants for Version_2
     alpha_oe ~ normal(0,1);                   // Prior for the constants for Version_3
     
     to_vector(gamma_lkv1)~ normal(0,1);       // Priors for the coefficients for Version_1
     to_vector(gamma_lkv2)~ normal(0,1);       // Priors for the coefficients for Version_2
     to_vector(gamma_oe)~ normal(0,1);         // Priors for the coefficients for Version_3
     
     Sigma ~ inv_wishart((K+1), diag_matrix(rep_vector(0.1, K)));   // Priors for the Covariance Matrix
 
     to_vector(alpha_scale) ~ normal(0,1);     // Priors for the re-parameterisation 
     
     alpha ~ normal(0,1);                      // Prior for the constants for the choice model
     to_vector(beta_att) ~ normal(0,1);        // Prior for the Ind_var_1 for the choice model
     to_vector(beta_sd) ~ normal(0,1);         // Priors for the coefficients of sd
     
     for (n in 1:N)
         y[n] ~ categorical(softmax(alpha + beta_att * y_att[n]' + beta_sd * X_sd[n]')); // likelihood
}

MODERATOR EDIT: @martinmodrak reformatted code

1 Like

My first guess is that your model is overparametrized: the categorical distribution with C categories is completely determined by C - 1 parameters (because the probabilitites need to sum to 1). However, if I read the model correctly, you predict C parameters. Usually, when working on the logit scale (i.e. before applying softmax) one would fix one of the vector elements to 0. Such overparametrization both prevents you from interpreting the coefficents in a useful way and also creates weird interdependencies between the parameters that are hard for the sampler to work with.

I discussed a similar issue recently at Two questions: ①Rejecting initial value but still sampling. ②regarding divergent transitions but feel free to ask for clarifications here, if it is hard to understand.

Few additional minor suggestions:

  • You can use the cholesky_factor_cov type so that the sample will work directly with the decomposition (this avoids having to decompose the matrix in the transformed parameters block and is usually more numerically stable). In many use cases it is recomended to separate the correlation matrix and the variance vector - then you can use the cholesky_factor_corr type and the lkj_corr_cholesky prior.
  • categorical_logit(X) is a more efficient shorthand for categorical(softmax(X))

Best of luck with the model!

1 Like