Understanding priors and generated quantities in hierarchical multinomial

I’m trying to construct a multinomial model with two effects, district and age. I’ve followed @rtrangucci’s model at the bottom of this notebook but had a few questions.

My data looks like the following. I’ve also annotated it in the data block. Basically, I have a survey with responds who favor one of four vodkas (and a fifth zero column for identifiability). They are located in one of four districts and are in one of four age groups. I want to understand the probability an age group prefers a vodka.

Model code is below:

data { 
  int<lower=2> K; // number of Vodkas
  int<lower=0> N; // total survey responses
  int<lower=0> y[N, K]; // data matrix of vodka responses
  int<lower=1> J_age; // J ages (1-4)
  int<lower=1> J_district; // J districts (1-4)
  int<lower=1> G; // G levels (2)
  int<lower=1, upper=J_age> idx_age[N]; // index column
  int<lower=1, upper=J_district> idx_district[N]; // index column
}
transformed data {
  row_vector[J_age] zeros_age = rep_row_vector(0, J_age);
  row_vector[J_district] zeros_district = rep_row_vector(0, J_district);
}
parameters {
  vector[K - 1] alpha_raw;
  matrix[K - 1, J_age] eta_age;
  matrix[K - 1, J_district] eta_district;
  vector<lower=0>[K - 1] sigma_age;
  vector<lower=0>[K - 1] sigma_district;
  vector<lower=0>[G] sigma_inter_eqn;
}
transformed parameters {
  matrix[K, J_age] alpha_age;
  matrix[K, J_district] alpha_district;
  vector[K] alpha;
  
  for (k in 1:(K - 1)) {
    alpha[k] = alpha_raw[k];
    alpha_age[k] = sigma_inter_eqn[1] * sigma_age[k] * eta_age[k];
    alpha_district[k] = sigma_inter_eqn[2] * sigma_district[k] * eta_district[k];
  }
  alpha[K] = 0;
  alpha_age[K] = zeros_age;
  alpha_district[K] = zeros_district;
}
model {
  to_vector(eta_age) ~ normal(0, 1);
  to_vector(eta_district) ~ normal(0, 1);
  sigma_inter_eqn ~ normal(0, 1);
  sigma_age ~ normal(0, 1);
  sigma_district ~ normal(0, 1);
  alpha_raw ~ normal(0, 1);
  
  for (n in 1:N) y[n] ~ multinomial(softmax(col(alpha_age,idx_age[n]) + col(alpha_district,idx_district[n])));
}
generated quantities {
  vector[K] theta[J_age];
  for (j in 1:J_age) theta[j] = softmax(col(alpha_age,j));
}

I have two questions.

  1. Understanding the effects pooling.

Why, when varying the intercept, do we multiply the etas by the district and age-level hyperparameters? I suppose this is a somewhat basic question, but why don’t we add our etas in the way we would a hierarchical linear model, when defining our treatment effect?

    alpha_age[k] = sigma_inter_eqn[1] * sigma_age[k] * eta_age[k];
    alpha_district[k] = sigma_inter_eqn[2] * sigma_district[k] * eta_district[k];
  1. Understanding how to interpret and construct generated quantities for the effects.

I would like to return the mean estimate for the probabilities an age group prefers a particular vodka brand. I’ve struggled with how to build a generated quantities block to understand which age group prefers which vodka. The below block seems to work in the sense that a larger value corresponds with a greater probability of choosing a vodka, but I’m not sure how to interpret the values.

Because we’ve modeled on the softmax, I’m not sure how to back transform (if I can at all) these values. If I wrap this line in another softmax, like so: for (j in 1:J_age) theta[j] = softmax(col(alpha_age,j)); to put the posteriors on a 0 to 1 scale, but, again, not sure I gain any interpretability that way.

    for (n in 1:N) y[n] ~ multinomial(softmax(col(alpha_age,idx_age[n]) + col(alpha_district,idx_district[n])));
}
generated quantities {
  vector[K] theta[J_age];
  for (j in 1:J_age) theta[j] = col(alpha_age,j);
}

This is probably because it is in “non-centered” parametrization (mentioned for example in the case study: Diagnosing Biased Inference with Divergences)

I am not sure this is a sensible approach, due to the multinomial nature of the data, there are negative correlations between the preferences. Depending on whether you are more interested in the true/latent/unobserved probabilities (without sampling noise) or in future observations (with sampling noise), I would just either store the vector of multinomial probabilities (here, softmax(col(alpha_age,idx_age[n]) + col(alpha_district,idx_district[n]))) or add multinomial_rng on top of it.

Means, quantiles etc. of those values should be directly interpretable.

Hope that helps, albeit a bit late.

1 Like