Non-Centered, Cholesky Factored Covariance Matrix Converges to Inconsistent Parameter Values; Issue with Model Specification?

Hey all!

Extending work mentioned here, I’ve been developing a hierarchical SEM model where:

  1. Latent variable z[i] is fit to predict a common (Likert scale) survey item y_z for individual i via ordinal regression
  2. Latent variable t[m,i] is defined for individual i and topic m, where latent variable z[i]
    is a predictor of t[m,i] with varying slope delta[m] (this slope parameter delta varies by topic)
  3. Cutpoints c_n[n] for responses y_n to survey item n (Likert scale) are fit via ordinal logistic regression of t_m, where m is the topic corresponding to item n
  4. Latent variables, z[i], t[m_1,i], …t[M,i] are modeled to have a covarying error term for individual i

I’m experiencing a fitting issue where, when I rerun my model under different randomization defaults,

  1. All my parameters “converge” in the sense of returning R_hat = 1.0, but
  2. I recovery consistent values for all my parameters except those related to the covarying individual error term (parameters tau, L_Omega, and tau_L_Omega and downstream generated quantities Omega and Sigma)

I’ve tried changing the prior for my scale parameter tau from the recommended cauchy(0,2.5) something less diffuse (cauchy(0,1), student_t(8,0,1), etc.) and when I do so I find that my covariance terms vary less from randomization default to randomization default, but remain quite noisy. I’m not sure whether:

  1. I may have mis-specified my coding of the non-centered, Cholesky-factored covarying error term
  2. My error term might be non-identifiable in this model as specified (I believe the term is identifiable since the associated parameters consistently return R_hat values of 1.0, but I recognize that I may be mis-interpreting this QC output)
  3. I’m suffering from some third source of error that I haven’t considered

Appreciate any guidance that folks might have to impart here. Thanks in advance!

data {

  // Metadata
  int<lower=0> I; // Number of Interviews
  int<lower=0> J; // Number of Item-Response Pairs for Issue Topic Items
  int<lower=0> M; // Number of Topics
  int<lower=0> N; // Number of Survey Items
  int<lower=0> Q; // Number of Common Item Response Classes
  int<lower=0> R; // Number of Issue Topic Item Response Classes
  
  // Discrete Predictor Level Declarations
  int<lower=0> N_a;
  int<lower=0> N_b;
  
  // PII-Survey Data

  /// Individual Demography
  array[I] int<lower=1,upper=N_a> a;
  array[I] int<lower=1,upper=N_b> b;

  /// Common Item Response Labels
  array[I] int<lower=1,upper=Q> y_z; 
  
  // Issue Topic-Item Data
  
  /// Metadata
  array[N] int<lower=1,upper=M> m_n;   // Item-Level Topic Index

  /// Responses
  array[J] int<lower=1,upper=I> i_j;   // Item-Response Respondent ID
  array[J] int<lower=1,upper=N> n_j;   // Item-Response Item ID
  array[J] int<lower=1,upper=R> y_n_j; // Item-Response Response Labels
  
}
parameters {

  // Target Cutpoints
  ordered[Q - 1] c_z;          // Cutpoints for common item
  array[N] ordered[R - 1] c_t; // Cutpoints for topic items
  
  // Latent Hierarchical Model for Common Item

  /// Hyperprior Uncertainties for Varying Intercepts
  real<lower=0> sigma_alpha_a;
  real<lower=0> sigma_alpha_b;
  
  /// Modeled Effects
  vector[N_a] alpha_a_raw;
  vector[N_b] alpha_b_raw;  
  
  // Latent Hierarchical Model for Topic Items

  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  vector<lower=0>[M] sigma_beta_a;
  vector<lower=0>[M] sigma_delta;
  
  /// Modeled Effects
  array[M] vector[N_a] beta_a_raw;
  vector[M] delta_raw; // varying slope for effect of latent variable representing *common item* in predicting *topic items*

  // Correlated Error Parameters

  /// Hyperpriors
  vector<lower=0>[M_K + 1] tau;          // prior scale
  cholesky_factor_corr[M_K + 1] L_Omega; // Cholesky-factored prior covariance

  /// Centered Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega_raw;
  
}
transformed parameters{

  // Latent Hierarchical Model for Common Item
  
  /// Vector codings for multilevel intercepts
  vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;
  vector[N_b] alpha_b = alpha_b_raw * sigma_alpha_b;

  // Latent Hierarchical Model for Topic Items
  
  /// Varying Intercepts, Slopes
  array[M] vector[N_a] beta_a = beta_a_raw .* sigma_beta_a;
  vector[M] delta = delta_raw .* sigma_delta;
  
  // Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega = diag_pre_multiply(tau, L_Omega) * tau_L_Omega_raw;
  
}
model {
  
  // Latent Hierarchical Model Parameters for Common Item
  
  /// Prior Uncertainties
  sigma_alpha_a ~ std_normal();
  sigma_alpha_b ~ std_normal();

  /// Varying Intercepts
  alpha_a_raw ~ std_normal();
  alpha_b_raw ~ std_normal();
  
  // Latent Hierarchical Model Parameters for Topic Items
  
  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  sigma_beta_a ~ std_normal();
  sigma_delta ~ std_normal();
  
  /// Varying Intercepts, Slopes
  for (m in 1:M){
      beta_a_raw[m] ~ std_normal();
   }
  delta_raw ~ std_normal();

  // Latent Topic Correlation
  tau ~ cauchy(0,2);
  L_Omega ~ lkj_corr_cholesky(2);
  tau_L_Omega_raw ~ std_normal();
  
  // Declare Local Scope for Latent Variable Vectorization
  {

    // Declare Latent Core Item Variable z, Topic Items variables t_m
    vector[I] z = alpha_a[a] + alpha_b[b] + tau_L_Omega[1];
    array[M] vector[I] t_m_i;
    
    // Vectorize Latent Variables for Core Item, Topics
    for (m in 1:M){
        t_m_i[m,] = beta_a[m,a] + delta[m] * z + tau_L_Omega[m + 1];  
    }
    
    // Reindex Latent Topic Variables from individuals to item-reponse pairs
    vector[J] t_j = t_m_i[m_n[n_j],i_j];

    // Fit responses to Core Item, Topic Items
    y_z ~ ordered_logistic(z, c_z);
    y_n_j ~ ordered_logistic(t_j, c_n[n_j]);
    
  }

}
generated quantities{

  matrix[M_K + 1, M_K + 1] Omega = multiply_lower_tri_self_transpose(L_Omega);
  matrix[M_K + 1, M_K + 1] Sigma = quad_form_diag(Omega, tau);
  
}

Note: code has been edited for clarity from initial submission

1 Like

I think should be

for (k in 1:K){
    tau_L_Omega[k] = diag_pre_multiply(tau[k] .* tau_L_Omega_raw[k], L_Omega[k]);
  }

Though I’d probably make tau a real instead of a vector.

@spinkney thanks for the response!

I’d been taking my cue on writing out the non-centered, Cholesky factored covariance term from the snippet in the Stan User Guide section on Multivariate Hierarchical Priors (7th code block, directly below Optimization Under Cholesky Factorization). Am I correct in interpreting from your post that the user guide no longer represents coding best practices? (tau_L_Omega_raw in my stan code is equivalent to z in the User Guide example).

  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;

No, I misunderstood your model. That’s right then.

Do you mean that if you run multiple chains you get \widehat{R} \gg 0 that indicates non-mixing? If not, what specifically is not consistent? Did you simulate data and check coverage?

Estimating correlation and covariance structure is challenging, especially if you don’t have a lot of data. I’m afraid your model is too complicated for me to grasp its structure through all the indexing. Our usual advice is to simplify to a model that works then build back up the complexity. That way, you can see where things go wrong. For example, start without a complicated correlated prior and get that working, then introduce the correlated prior.

Also, you an use declare define and vectorization to simplify a lot of this code, e.g.,

vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;

vector[I] z = alpha_a[a] + alpha+b[b] + tau_L_Omega[1];
...

Thanks for reaching out @Bob_Carpenter!

One quick point of clarification. Appreciate your bump on starting with a simple model and building up; my initial post is the result of exactly this exercise. The model I’ve shared above is the simplest specification with which I’m able to illustrate the unexpected behavior I’m observing, and I’m able to recover results that map with my expectations for all specifications simpler than the one above. I realize use of expectation in that last sentence is deeply subjective, and hope to clarify what I mean below.

(As a quick aside: I’ve done my best to implement some of the code-cleanup strategies you’ve recommended above for easier interpretability, but I recognize that my code may remain quite abstruse. Apologies to any trying to dig in – don’t expect anyone to read my code who doesn’t want to, hopefully I’m able to fully summarize my ideas verbally below).

I think my core question, in a nutshell, is: how sensitive should I expect posterior-sampling of parameters in a covariance matrix to be to initialization? This question is at the heart of what I was glancing at by earlier describing some of my parameter estimates as “inconsistent” (likely an abuse of language, apologies there). Perhaps a better word to use here would be “stable” (though I’m not sure this isn’t just as bad semantically).

The observation prompting this question is: for different random seed values, using the same model and inputting the same data, sampling is yielding posterior parameter estimates where the following properties hold:

  1. The magnitude of the diagonal entries in the covariance matrix rank-order inconsistently across sampling runs i.e. it might be the case that \Sigma_{i,i} > \Sigma_{j,j} > \Sigma_{k,k} for one sampling run, but \Sigma_{i,i} < \Sigma_{j,j} < \Sigma_{k,k} for a sampling run of the same model with the same data but different initialization
  2. The signs of the off-diagonal entries in the covariance matrix are inconsistent i.e. it may be the case that \Sigma_{i,j} > 0 in a first run but \Sigma_{i,j} < 0 in a next run of the same model with the same data but different initialization

As a point of comparison, all the other parameters in my model (alpha_*, beta_*, delta, and c_*), return posterior estimates that are fairly “stable”, in the sense that these posterior estimates are similar in magnitude, sign, variance, and rank-order across different initializations (apologies again if there’s different language from stable I should be using to denote these properties). If I remove the covariance structure from my data (omit tau, L_Omega, and tau_L_Omega from my model, and Omega and Sigma from my generated quantities), posterior samples for my alpha_*, beta_*, c_*, and delta parameters remain “stable” across different initializations.

Across initializations, the \hat{R} values for the parameters in my model associated with \Sigma (i.e. tau, L_Omega, and tau_L_Omega) are quite close to 1.0. The \hat{R} values for the other parameters in my model (alpha_*, beta_*, delta, and c_*) are also quite close to 1.0 – whether I include the tau, L_Omega, and tau_L_Omega parameters, or omit them.

The size of my dataset is a few thousand records (I’m testing things out with about 5k records right now, though I have several thousand more in-hand).

Thanks to any/all who have borne with me this far. To restate my core question at a high level: to what extent, if at all, should I be concerned at how sensitive my posterior estimates for parameters related to covariance structure are to initialization? Do the fluctuations I’m observing fall within the range of regularly expected variability, given my data size and prior specification? Do they portend a some deeper model misspecification that I’ll need to weed out?

Appreciate advice/guidance that folks are able to impart.

Will try once more here since the chatter here has gone cold for the last week and a half or so:

I’m surprised at the result that the posterior sampling of parameters associated with the modeled covariance matrix (alone among my model parameters) is so dependent on initialization, and so would love to get a sense for whether this is reasonable/expected behavior, and if not what types of issues it might portend. Appreciate any guidance folks might have to impart!