Overfitting in hierarchical multinomial logit model



I’ve been running a hierarchical multinomial logit model using the Stan code shown below. I’m finding that the model tends to be overfitted when comparing in-sample and holdout accuracies (for each respondent I hold out responses to some questions and fit the model on the remaining questions, the accuracy is based on predicting respondent choices). The in-sample accuracies are much larger than the holdout accuracies (e.g. 100% vs 60%).

Is there anything I can change to the model to reduce the amount of overfitting, i.e. the difference between in-sample and holdout accuracies while still maintaining or improving the holdout performance? I have tried tweaking the gamma and LKJ settings to reduce the amount of respondent parameter variation but this tends to decrease performance. Any helpful feedback would be welcome. Thanks!

data {
    int<lower=2> C; // Number of alternatives (choices) in each question
    int<lower=1> R; // Number of respondents
    int<lower=1> S[R]; // Number of questions per respondent
    int<lower=1> RS; // sum(S)
    int<lower=1> A; // Number of attributes
    int<lower=1> V; // Number of parameters
    int<lower=1> V_attribute[A]; // Number of parameters in each attribute
    int<lower=1,upper=C> Y[RS]; // choices
    matrix[C, V] X[RS]; // matrix of attributes for each obs
    vector[V] prior_mean; // Prior mean for theta
    vector[V] prior_sd; // Prior sd for theta
    real<lower=1> lkj_shape; // shape parameter for LKJ prior
    real<lower=0> gamma_alpha; // alpha parameter for gamma prior for sigma
    real<lower=0> gamma_beta; // beta parameter for gamma prior for sigma

parameters {
    vector<lower=0>[V] sigma;
    row_vector[V] theta;
    cholesky_factor_corr[V] L_omega;
    matrix[V, R] standard_normal;

transformed parameters {
    matrix[V, V] L_sigma;
    matrix[R, V] beta;

    L_sigma = diag_pre_multiply(sigma, L_omega);
    beta = rep_matrix(theta, R) + (L_sigma * standard_normal)';

model {
    int rs = 1;
    vector[V * R] vector_normal;

    sigma ~ gamma(gamma_alpha, gamma_beta);

    theta ~ normal(prior_mean, prior_sd);
    L_omega ~ lkj_corr_cholesky(lkj_shape);

    vector_normal = to_vector(standard_normal);
    vector_normal ~ normal(0, 1);

    for (r in 1:R)
        for (s in 1:S[r])
            Y[rs] ~ categorical_logit(X[rs] * to_vector(beta[r]));
            rs += 1;


I’ve done some previous work with multilevel, multinomial models in Stan.

It would be helpful to learn more about the research and data structure. It sounds like you have data on respondents who make multiple choices throughout the study. Can you tell us more?


You could add a random intercept for each observation.
Second use a noise nugget to your covariance matrix.
Your covariance matrix \Sigma would become \Sigma + \sigma I,
with I identity matrix. No diagonal update in the cholesky form
is known, thus sampling the full matrix and doing a cholesky decomposition
of the sum is needed.


Yes, the data is from a conjoint analysis where a number of survey respondents are asked to respond to a number of questions in which they choose between a number of alternatives, each with different attributes. The matrix beta contains the parameters (utilities) for each respondent * attribute, X contains the attribute levels for each question, and Y contains the choices made by the respondents.


You might benefit from this thread (though it doesn’t directly relate to your question about tuning parameters):


Thanks, I’ll have a read!


Thanks Andre, I’ll look into your suggestions!