Hierarchical Ordinal Probit model: priors for fixed thresholds and non-centered t-distribution

Hi all. I am trying to estimate a series of ordinal probit models with increasing complexity. Further down, I am attaching the script for the most complex one. The model is based on Kruschke and Liddell’s 2018 paper, where the first and last thresholds are fixed to identify the model while estimating the intercept and the standard deviation. The modeling questions:

  • Does it make sense to put the Dirichlet prior on the raw thresholds (in code: tau_prop)? Or is it recommended to put a normal prior on the transformed thresholds (tau)? Extra: Is there some useful interpretation of the tau_prop parameter, or do we just mention it as a tool for fixing the lower and upper thresholds?
  • In the model without eta, the model does fine (no divergencies, trace plots and other diagnostics looks satisfying). Including eta, I get a few divergences with adapt_delta = 0.95. Would it be more appropriate to use a non-centered parametrisation for eta here? If so, do we follow the same procedure as for normal non-centered distributions? I am using the t distribution to prevent too much shrinkage, given that there are only six agencies.

Some background: The data is original survey data at the individual level, with unit of analysis being individual-agency. We have asked our respondents to report their general confidence in six Swedish public agencies and also asked for their evaluation of each agency in terms of their competence, motivation, and opportunity to perform their task. All items have seven response categories.

The theoretical quantity that I am interested in is the gamma part, basically trying to predict general confidence using our theoretical components of trust. I am including a model at the individual level to strengthen exchangeability among the observations: covariates include some demographical and political variables. Eta is motivated theoretically to adjust for differences in scale-usage between the agencies (inspired by Jackman’s chapter on ordinal models) but I realise that the constants will soak up any agency-unique variance as well.

The way I have parametrised the thresholds are in line with the solution given by @nhuurre in this post.

I would be extremely thankful for any comments, both relating to the questions or other reflections. All the best!


data {
  int<lower = 1> N_n; // number of rows
  int<lower = 1, upper = 7> N_k; // number of categories on all items, 7
  int<lower = 3> N_d; // number of theoretical trust dimensions, 3
  int<lower = 1> N_r; // number of unique respondents, 1810
  int<lower = 1, upper = 6> N_a; // number of agencies asked for, 6
  int<lower = 1, upper = N_r> resp[N_n]; // respondent at row n
  int<lower = 1, upper = N_a> agency[N_n]; // agency at row n
  real<lower = 0.0, upper = 1.5> tau_lower; // == 1 + 0.5 = 1.5
  real<lower = 0.0, upper = 6.5> tau_upper; // == N_k - 0.5 = 6.5
  int<lower = 2> nu_eta; // degree of freedom for eta, 4
  matrix[N_n, N_d] l1_xmat; // Individual-agency level predictor matrix, N_n by N_d
  int<lower = 1> R; // number of predictors in l2_xmat
  matrix[N_r, R] l2_xmat; // Individual-level predictor matrix, N_r by R
  int<lower = 1, upper = 7> general[N_n]; // outcome vector with general trust statements
parameters {
  // Thresholds
  simplex[N_k-2] tau_prop; // Define tau_prop, used to transform into thresholds
  // Individual-level section
  vector[N_r] raw_alpha; // raw_alpha for non-centered parametrization of varying intercepts
  real<lower = 0> sigma_alpha; // standard deviation for distribution of varying intercepts
  real mu_alpha; // grand mean for varying intercept distribution
  real<lower = 0.0> sigma_mu_alpha; // standard deviation for grand mean for varying intercept distribution
  vector[R] beta; // vector of regression coefficients for individual level model (level 2)
  vector<lower = 0.0>[R] sigma_beta; // vector of standard deviations for beta
  // Agency-individual section
  vector<lower = 0.0>[N_d] sigma_gamma_fixed; // vector of standard deviations for population level regression parameters
  vector[N_d] gamma_fixed; // length 3 vector of population level regrssion parameters
  matrix[N_d, N_a] z_gamma; // matrix of 3 by 6, unit-normal
  cholesky_factor_corr[N_d] L_dim; // Cholesky factor for group-level effects
  vector<lower = 0.0>[N_d] sigma_dim; // standard deviaton for each dimension of trust
  // Eta = agency-specific constants
  vector[N_a] eta_raw; // vector of agency-specific constants
  real<lower = 0.0> sigma_eta; // standard deviation of eta distribution
  // Outcome section
  real<lower = 0.0> sigma_y; // standard deviation of outcomes
transformed parameters {
 ordered[N_k-1] tau; // transformed ordered thresholds
 vector[N_r] alpha; // transformed varying intercepts (1 | resp)
 vector[N_r] alpha_hat; // predictor for alpha = mu_alpha + model
 simplex[N_k] theta[N_n]; // category-response probabilities, N_k repeated N_n times
 matrix[N_d, N_a] gamma; // 3 by 6 matrix for random effects ( {competence, motivation, opportunity} | agency)
 vector[N_a] eta; // transformed agency-specific constants
 // Thresholds
 tau[1] = tau_lower; // first value is set to fixed lower from data, 1.5
 for(c in 2:(N_k-1)) tau[c] = tau[c-1] + (tau_upper - tau_lower) * tau_prop[c-1] ; // add (upper-lower) between tau[1] and tau[6] by multiplying "on" the simplex bit-by-bit
 // Model for level 2
 for(i in 1:N_r) alpha_hat[i] = mu_alpha + l2_xmat[i,1] * beta[1] + l2_xmat[i,2] * beta[2] + l2_xmat[i,3] * beta[3] + l2_xmat[i,4] * beta[4]; // model for alpha hat
 // Varying intercepts: centering alpha
 alpha = alpha_hat + raw_alpha * sigma_alpha;
 // Transform random effects
 gamma = diag_pre_multiply(sigma_dim, L_dim) * z_gamma; // Multiple the diag(Sigma_dim) by the Cholesky factor then multiply with z_u to get random effects
 // Eta
 eta = eta_raw - mean(eta_raw); // sum eta to 0 recursively
 // Model for level 1
 for(n in 1:N_n) { // N_n loop starts here...
   real mu[N_n]; // local variable mu for the conditional mean of the linear predictor
   mu[n] = alpha[resp[n]] + eta[agency[n]] + 
   (gamma_fixed[1] + gamma[1, agency[n]]) * l1_xmat[n,1] +
   (gamma_fixed[2] + gamma[2, agency[n]]) * l1_xmat[n,2] +
   (gamma_fixed[3] + gamma[3, agency[n]]) * l1_xmat[n,3];
   //Calculate response probabilitities
   theta[n,1] = Phi_approx( ( tau[1] - mu[n] ) / sigma_y );
   theta[n,N_k] = 1 - Phi_approx( ( tau[N_k - 1] - mu[n] ) / sigma_y );
   for (k in 2:(N_k-1)) {theta[n,k] = Phi_approx( ( tau[k] - mu[n] ) / sigma_y ) - Phi_approx( ( tau[k-1] - mu[n] ) / sigma_y );}
  } // ... N_n loop ends here
model {
  // Level 2 parametrs
  sigma_mu_alpha ~ exponential(1.0); // exponential for sigma_mu_alpha
  sigma_alpha ~ normal(0.0, 5.0); // half-normal for standard deviation
  raw_alpha ~ normal(0, 1); // standard normal for non-centered
  mu_alpha ~ normal( (N_k+1.0)/2.0 , sigma_mu_alpha); // location of mean alpha, prior location is mid-point of the response scale
  sigma_beta ~ exponential(1.0); // prior for sigma_beta scale
  beta ~ normal(0, sigma_beta); // prior for beta, individual level coefficients
  // Level 1 parametrs
  sigma_gamma_fixed ~ exponential(1.0);
  sigma_y ~ exponential(1.0);
  gamma_fixed ~ normal(0.0, sigma_gamma_fixed); // for each gamma, location is 0 with adaptive prior for sigmas
  L_dim ~ lkj_corr_cholesky(1.0); // LKJ(1) on the Cholesky factor to reflect uniform expectations on the correlation matrix
  sigma_dim ~ normal(0.0, 5.0); // Half-normal for the standard deviations for each dimension
  to_vector(z_gamma) ~ normal(0.0, 1.0); // Standard-normal prior on the random deviations
  sigma_eta ~ normal(0.0, 2.0); // half-normal for standard deviation
  eta_raw ~ student_t(nu_eta, 0.0, sigma_eta); // student t distribution for the agency specific constants, df from data, located at 0.0
  // thresholds
  tau_prop ~ dirichlet( rep_vector(1.0, (N_k-2))); // Uniform Dirichlet prior on the threshold proportions
  // Increament log probabibility of general as categorical
  for(n in 1:N_n) general[n] ~ categorical( theta[n, 1:N_k] ); // General is the outcome vector
generated quantities {
  vector[N_n] log_lik;
  matrix[N_d, N_d] Omega;
  vector[N_n] y_rep;
  // Get log likelihood of data point n
  for(n in 1:N_n) log_lik[n] = categorical_lpmf( general[n] | theta[n, 1:N_k] );
  // y_rep, given parameters and observed data
  for(n in 1:N_n) y_rep[n] = categorical_rng(theta[n, 1:N_k]);
  // Calculate correlation matrix of ( {comp, moti, oppo} | agency )
  Omega = multiply_lower_tri_self_transpose(L_dim); // retrieve the correlation matrix

Yes, putting a dirchlet prior on tau_prop looks mostly sensible, but is probably not directly interpretable. If you really want an interpretable prior, then I think you could put a dirichlet prior on the implied distribution of responses when mu is at some reference value (probably zero). That would require some transformations and Jacobian adjustments similar to the prior choice discussed at Mike Betancourt’s Ordinal Regression case study (but note that in the case study, none of the thresholds are fixed - which introduces problems if some of the extreme categories are rare in the data, but that shouldn’t be a concern for your model).

This usually requires a bit more detailed investigation. The discussion in Mike’s case study on hierarchical models describes when one would expect centered vs. non-centered parametrization to work better.

Additionally, I have recently encountered a couple situations where there was a strong negative correlation between random intercepts and the overall intercept (thresholds in your case), which then posed sampling problems. (i.e. the model allows all the random intercepts to go up while the main intercept goes down). You can investigate pairs plots for the relevant parameters to see if this is the case. I don’t think I understand this type of problems well at this moment, but enforcing a sum-to-zero constraint on the random intercepts tended to help quite a bit (but somehwat changes the interpretation of the coefficients!).

Probably more than you need to know about implementing sum-to-zero constraint: Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain

TLDR: either use sum(x) ~ normal(0, <small_number> * <num_lements>) or the code in this answer in the aformentioned thread + the one below (which is in my experience slightly superior). For more details read the whole topic.

Hope that helps at least a bit.

1 Like

Dear Martin, thank you for your reply. And sorry for being slow in getting back.

Regarding the first point: This makes sense. I think I will stick with prior on tau_prop for the time being. After doing some prior predictive checks and simulated the implied thresholds, I have proceeded with a slightly more informative argument to the Dirichlet prior.

Regarding the second point: This was a great case study that helped me understand the consequences of different parametrization strategies, thank you for directing me to it. A bit complicated for me to follow at some points, but I will have to study it harder.

At the next round I will take a deeper look into the issue with random intercepts. Just to be clear: the overall intercept is mu_alpha in my model. I fix the extreme thresholds in order to be able to estimate both the residual variance sigma_y and along with the alpha terms, gamma and beta. If I proceed with a sum-to-zero constraint, I think I will have to apply it to both the thresholds and all alpha terms, so they are still comparable. Recently I discovered that there might be some issues with one or more beta's (coefficients on the varying intercepts). It seems as they are causing some of the divergencies. But I will have to look more into that.

Again, thanks for the helpful comments!

1 Like