Non-Centered, Cholesky Factored Covariance Matrix Converges to Inconsistent Parameter Values; Issue with Model Specification?

Hey all!

Extending work mentioned here, I’ve been developing a hierarchical SEM model where:

  1. Latent variable z[i] is fit to predict a common (Likert scale) survey item y_z for individual i via ordinal regression
  2. Latent variable t[m,i] is defined for individual i and topic m, where latent variable z[i]
    is a predictor of t[m,i] with varying slope delta[m] (this slope parameter delta varies by topic)
  3. Cutpoints c_n[n] for responses y_n to survey item n (Likert scale) are fit via ordinal logistic regression of t_m, where m is the topic corresponding to item n
  4. Latent variables, z[i], t[m_1,i], …t[M,i] are modeled to have a covarying error term for individual i

I’m experiencing a fitting issue where, when I rerun my model under different randomization defaults,

  1. All my parameters “converge” in the sense of returning R_hat = 1.0, but
  2. I recovery consistent values for all my parameters except those related to the covarying individual error term (parameters tau, L_Omega, and tau_L_Omega and downstream generated quantities Omega and Sigma)

I’ve tried changing the prior for my scale parameter tau from the recommended cauchy(0,2.5) something less diffuse (cauchy(0,1), student_t(8,0,1), etc.) and when I do so I find that my covariance terms vary less from randomization default to randomization default, but remain quite noisy. I’m not sure whether:

  1. I may have mis-specified my coding of the non-centered, Cholesky-factored covarying error term
  2. My error term might be non-identifiable in this model as specified (I believe the term is identifiable since the associated parameters consistently return R_hat values of 1.0, but I recognize that I may be mis-interpreting this QC output)
  3. I’m suffering from some third source of error that I haven’t considered

Appreciate any guidance that folks might have to impart here. Thanks in advance!

data {

  // Metadata
  int<lower=0> I; // Number of Interviews
  int<lower=0> J; // Number of Item-Response Pairs for Issue Topic Items
  int<lower=0> M; // Number of Topics
  int<lower=0> N; // Number of Survey Items
  int<lower=0> Q; // Number of Common Item Response Classes
  int<lower=0> R; // Number of Issue Topic Item Response Classes
  
  // Discrete Predictor Level Declarations
  int<lower=0> N_a;
  int<lower=0> N_b;
  
  // PII-Survey Data

  /// Individual Demography
  array[I] int<lower=1,upper=N_a> a;
  array[I] int<lower=1,upper=N_b> b;

  /// Common Item Response Labels
  array[I] int<lower=1,upper=Q> y_z; 
  
  // Issue Topic-Item Data
  
  /// Metadata
  array[N] int<lower=1,upper=M> m_n;   // Item-Level Topic Index

  /// Responses
  array[J] int<lower=1,upper=I> i_j;   // Item-Response Respondent ID
  array[J] int<lower=1,upper=N> n_j;   // Item-Response Item ID
  array[J] int<lower=1,upper=R> y_n_j; // Item-Response Response Labels
  
}
parameters {

  // Target Cutpoints
  ordered[Q - 1] c_z;          // Cutpoints for common item
  array[N] ordered[R - 1] c_t; // Cutpoints for topic items
  
  // Latent Hierarchical Model for Common Item

  /// Hyperprior Uncertainties for Varying Intercepts
  real<lower=0> sigma_alpha_a;
  real<lower=0> sigma_alpha_b;
  
  /// Modeled Effects
  vector[N_a] alpha_a_raw;
  vector[N_b] alpha_b_raw;  
  
  // Latent Hierarchical Model for Topic Items

  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  vector<lower=0>[M] sigma_beta_a;
  vector<lower=0>[M] sigma_delta;
  
  /// Modeled Effects
  array[M] vector[N_a] beta_a_raw;
  vector[M] delta_raw; // varying slope for effect of latent variable representing *common item* in predicting *topic items*

  // Correlated Error Parameters

  /// Hyperpriors
  vector<lower=0>[M_K + 1] tau;          // prior scale
  cholesky_factor_corr[M_K + 1] L_Omega; // Cholesky-factored prior covariance

  /// Centered Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega_raw;
  
}
transformed parameters{

  // Latent Hierarchical Model for Common Item
  
  /// Vector codings for multilevel intercepts
  vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;
  vector[N_b] alpha_b = alpha_b_raw * sigma_alpha_b;

  // Latent Hierarchical Model for Topic Items
  
  /// Varying Intercepts, Slopes
  array[M] vector[N_a] beta_a = beta_a_raw .* sigma_beta_a;
  vector[M] delta = delta_raw .* sigma_delta;
  
  // Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega = diag_pre_multiply(tau, L_Omega) * tau_L_Omega_raw;
  
}
model {
  
  // Latent Hierarchical Model Parameters for Common Item
  
  /// Prior Uncertainties
  sigma_alpha_a ~ std_normal();
  sigma_alpha_b ~ std_normal();

  /// Varying Intercepts
  alpha_a_raw ~ std_normal();
  alpha_b_raw ~ std_normal();
  
  // Latent Hierarchical Model Parameters for Topic Items
  
  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  sigma_beta_a ~ std_normal();
  sigma_delta ~ std_normal();
  
  /// Varying Intercepts, Slopes
  for (m in 1:M){
      beta_a_raw[m] ~ std_normal();
   }
  delta_raw ~ std_normal();

  // Latent Topic Correlation
  tau ~ cauchy(0,2);
  L_Omega ~ lkj_corr_cholesky(2);
  tau_L_Omega_raw ~ std_normal();
  
  // Declare Local Scope for Latent Variable Vectorization
  {

    // Declare Latent Core Item Variable z, Topic Items variables t_m
    vector[I] z = alpha_a[a] + alpha_b[b] + tau_L_Omega[1];
    array[M] vector[I] t_m_i;
    
    // Vectorize Latent Variables for Core Item, Topics
    for (m in 1:M){
        t_m_i[m,] = beta_a[m,a] + delta[m] * z + tau_L_Omega[m + 1];  
    }
    
    // Reindex Latent Topic Variables from individuals to item-reponse pairs
    vector[J] t_j = t_m_i[m_n[n_j],i_j];

    // Fit responses to Core Item, Topic Items
    y_z ~ ordered_logistic(z, c_z);
    y_n_j ~ ordered_logistic(t_j, c_n[n_j]);
    
  }

}
generated quantities{

  matrix[M_K + 1, M_K + 1] Omega = multiply_lower_tri_self_transpose(L_Omega);
  matrix[M_K + 1, M_K + 1] Sigma = quad_form_diag(Omega, tau);
  
}

Note: code has been edited for clarity from initial submission

1 Like

I think should be

for (k in 1:K){
    tau_L_Omega[k] = diag_pre_multiply(tau[k] .* tau_L_Omega_raw[k], L_Omega[k]);
  }

Though I’d probably make tau a real instead of a vector.

@spinkney thanks for the response!

I’d been taking my cue on writing out the non-centered, Cholesky factored covariance term from the snippet in the Stan User Guide section on Multivariate Hierarchical Priors (7th code block, directly below Optimization Under Cholesky Factorization). Am I correct in interpreting from your post that the user guide no longer represents coding best practices? (tau_L_Omega_raw in my stan code is equivalent to z in the User Guide example).

  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;

No, I misunderstood your model. That’s right then.

Do you mean that if you run multiple chains you get \widehat{R} \gg 0 that indicates non-mixing? If not, what specifically is not consistent? Did you simulate data and check coverage?

Estimating correlation and covariance structure is challenging, especially if you don’t have a lot of data. I’m afraid your model is too complicated for me to grasp its structure through all the indexing. Our usual advice is to simplify to a model that works then build back up the complexity. That way, you can see where things go wrong. For example, start without a complicated correlated prior and get that working, then introduce the correlated prior.

Also, you an use declare define and vectorization to simplify a lot of this code, e.g.,

vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;

vector[I] z = alpha_a[a] + alpha+b[b] + tau_L_Omega[1];
...

Thanks for reaching out @Bob_Carpenter!

One quick point of clarification. Appreciate your bump on starting with a simple model and building up; my initial post is the result of exactly this exercise. The model I’ve shared above is the simplest specification with which I’m able to illustrate the unexpected behavior I’m observing, and I’m able to recover results that map with my expectations for all specifications simpler than the one above. I realize use of expectation in that last sentence is deeply subjective, and hope to clarify what I mean below.

(As a quick aside: I’ve done my best to implement some of the code-cleanup strategies you’ve recommended above for easier interpretability, but I recognize that my code may remain quite abstruse. Apologies to any trying to dig in – don’t expect anyone to read my code who doesn’t want to, hopefully I’m able to fully summarize my ideas verbally below).

I think my core question, in a nutshell, is: how sensitive should I expect posterior-sampling of parameters in a covariance matrix to be to initialization? This question is at the heart of what I was glancing at by earlier describing some of my parameter estimates as “inconsistent” (likely an abuse of language, apologies there). Perhaps a better word to use here would be “stable” (though I’m not sure this isn’t just as bad semantically).

The observation prompting this question is: for different random seed values, using the same model and inputting the same data, sampling is yielding posterior parameter estimates where the following properties hold:

  1. The magnitude of the diagonal entries in the covariance matrix rank-order inconsistently across sampling runs i.e. it might be the case that \Sigma_{i,i} > \Sigma_{j,j} > \Sigma_{k,k} for one sampling run, but \Sigma_{i,i} < \Sigma_{j,j} < \Sigma_{k,k} for a sampling run of the same model with the same data but different initialization
  2. The signs of the off-diagonal entries in the covariance matrix are inconsistent i.e. it may be the case that \Sigma_{i,j} > 0 in a first run but \Sigma_{i,j} < 0 in a next run of the same model with the same data but different initialization

As a point of comparison, all the other parameters in my model (alpha_*, beta_*, delta, and c_*), return posterior estimates that are fairly “stable”, in the sense that these posterior estimates are similar in magnitude, sign, variance, and rank-order across different initializations (apologies again if there’s different language from stable I should be using to denote these properties). If I remove the covariance structure from my data (omit tau, L_Omega, and tau_L_Omega from my model, and Omega and Sigma from my generated quantities), posterior samples for my alpha_*, beta_*, c_*, and delta parameters remain “stable” across different initializations.

Across initializations, the \hat{R} values for the parameters in my model associated with \Sigma (i.e. tau, L_Omega, and tau_L_Omega) are quite close to 1.0. The \hat{R} values for the other parameters in my model (alpha_*, beta_*, delta, and c_*) are also quite close to 1.0 – whether I include the tau, L_Omega, and tau_L_Omega parameters, or omit them.

The size of my dataset is a few thousand records (I’m testing things out with about 5k records right now, though I have several thousand more in-hand).

Thanks to any/all who have borne with me this far. To restate my core question at a high level: to what extent, if at all, should I be concerned at how sensitive my posterior estimates for parameters related to covariance structure are to initialization? Do the fluctuations I’m observing fall within the range of regularly expected variability, given my data size and prior specification? Do they portend a some deeper model misspecification that I’ll need to weed out?

Appreciate advice/guidance that folks are able to impart.

Will try once more here since the chatter here has gone cold for the last week and a half or so:

I’m surprised at the result that the posterior sampling of parameters associated with the modeled covariance matrix (alone among my model parameters) is so dependent on initialization, and so would love to get a sense for whether this is reasonable/expected behavior, and if not what types of issues it might portend. Appreciate any guidance folks might have to impart!

I wonder what your effective sample sizes look like. You said that Rhat is close to 1, which probably implies large effective sample sizes, but maybe they are still on the low side for certain parameters.

Not always. After 100 iterations sampling a standard normal, we get R_hat < 1 and N_Eff = 40.

          Mean      MCSE    StdDev       5%       50%       95%    N_Eff  N_Eff/s   R_hat
y     0.098738  0.152707  0.967841 -1.52322  0.029663  1.61712  40.1692  40169.2  0.996454

On the plus side, if R_hat is near 1, we expect N_eff to grow linearly with sample size.

@edm here’s an example Arviz summary for the parameters associated with the model covariance matrix (I’ve fit in the 2 * 2 * 3 case, the smallest possible non-trivial example for my model specification); PPS samples were generated from four chains of 1k samples each (after 1k warmup samples per chain), for 4k PPS draws total.

  • ess_bulk is definitely a bit low for the covariance matrix scale parameter terms (tau), at about 2k per parameter
  • Some of the non-NULL valued L_Omega (Cholesky-factored correlation matrix) terms have fairly ess_bulk low values (~2k), though others have values > 5k
  • the tau_L_Omega terms all appear to have ess_bulks values smaller than 1k
  • the generated quantity Sigma (covariance matrix) terms terms are mostly around the full 4k, except for the \Sigma_{i,j,j} terms, which are at about 2k each

Are these ess_bulk values in a range that would be considered potentially problematic? Are there any steps available to me to increase them, other than a) running PPS sampling for longer and b) increasing the size of my training data?

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
tau[0, 0] 0.574 0.593 0.000 1.674 0.010 0.007 2671.0 2459.0 1.00
tau[0, 1] 0.564 0.543 0.001 1.534 0.011 0.008 1881.0 2485.0 1.00
tau[0, 2] 0.565 0.568 0.000 1.591 0.012 0.008 2154.0 2694.0 1.00
tau[1, 0] 0.600 0.584 0.000 1.615 0.010 0.009 2586.0 2053.0 1.00
tau[1, 1] 0.561 0.562 0.000 1.584 0.011 0.008 1974.0 2398.0 1.00
tau[1, 2] 0.558 0.543 0.000 1.586 0.011 0.008 1997.0 2068.0 1.00
L_Omega[0, 0, 0] 1.000 0.000 1.000 1.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[0, 0, 1] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[0, 0, 2] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[0, 1, 0] -0.004 0.411 -0.753 0.730 0.006 0.007 5280.0 2825.0 1.00
L_Omega[0, 1, 1] 0.904 0.122 0.666 1.000 0.003 0.002 2089.0 2297.0 1.00
L_Omega[0, 1, 2] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[0, 2, 0] -0.003 0.414 -0.804 0.689 0.006 0.007 5253.0 2928.0 1.00
L_Omega[0, 2, 1] -0.002 0.415 -0.764 0.724 0.006 0.007 5426.0 2602.0 1.00
L_Omega[0, 2, 2] 0.793 0.165 0.486 1.000 0.004 0.003 1776.0 2233.0 1.00
L_Omega[1, 0, 0] 1.000 0.000 1.000 1.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[1, 0, 1] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[1, 0, 2] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[1, 1, 0] -0.007 0.411 -0.781 0.696 0.006 0.007 5422.0 2954.0 1.00
L_Omega[1, 1, 1] 0.904 0.119 0.669 1.000 0.003 0.002 1997.0 2417.0 1.00
L_Omega[1, 1, 2] 0.000 0.000 0.000 0.000 0.000 0.000 4000.0 4000.0 NaN
L_Omega[1, 2, 0] -0.006 0.412 -0.783 0.711 0.005 0.007 5784.0 2764.0 1.01
L_Omega[1, 2, 1] -0.003 0.414 -0.683 0.787 0.006 0.007 4735.0 2583.0 1.00
L_Omega[1, 2, 2] 0.794 0.166 0.486 1.000 0.004 0.003 1844.0 2298.0 1.00
tau_L_Omega[0, 0] 0.025 0.338 -0.646 0.704 0.009 0.007 1412.0 1936.0 1.00
tau_L_Omega[0, 1] -0.002 0.307 -0.658 0.587 0.013 0.009 758.0 597.0 1.00
tau_L_Omega[0, 2] -0.031 0.299 -0.645 0.553 0.012 0.010 814.0 542.0 1.00
tau_L_Omega[1, 0] -0.044 0.339 -0.719 0.636 0.010 0.007 1367.0 1788.0 1.00
tau_L_Omega[1, 1] 0.010 0.311 -0.646 0.622 0.013 0.009 764.0 669.0 1.00
tau_L_Omega[1, 2] 0.006 0.298 -0.554 0.630 0.011 0.010 803.0 477.0 1.00
Omega[0, 0, 0] 1.000 0.000 1.000 1.000 0.000 0.000 4000.0 4000.0 NaN
Omega[0, 0, 1] -0.004 0.411 -0.753 0.730 0.006 0.007 5280.0 2825.0 1.00
Omega[0, 0, 2] -0.003 0.414 -0.804 0.689 0.006 0.007 5253.0 2928.0 1.00
Omega[0, 1, 0] -0.004 0.411 -0.753 0.730 0.006 0.007 5280.0 2825.0 1.00
Omega[0, 1, 1] 1.000 0.000 1.000 1.000 0.000 0.000 3958.0 3798.0 1.00
Omega[0, 1, 2] -0.001 0.413 -0.765 0.703 0.006 0.007 4233.0 2872.0 1.00
Omega[0, 2, 0] -0.003 0.414 -0.804 0.689 0.006 0.007 5253.0 2928.0 1.00
Omega[0, 2, 1] -0.001 0.413 -0.765 0.703 0.006 0.007 4233.0 2872.0 1.00
Omega[0, 2, 2] 1.000 0.000 1.000 1.000 0.000 0.000 3492.0 3585.0 1.00
Omega[1, 0, 0] 1.000 0.000 1.000 1.000 0.000 0.000 4000.0 4000.0 NaN
Omega[1, 0, 1] -0.007 0.411 -0.781 0.696 0.006 0.007 5422.0 2954.0 1.00
Omega[1, 0, 2] -0.006 0.412 -0.783 0.711 0.005 0.007 5784.0 2764.0 1.01
Omega[1, 1, 0] -0.007 0.411 -0.781 0.696 0.006 0.007 5422.0 2954.0 1.00
Omega[1, 1, 1] 1.000 0.000 1.000 1.000 0.000 0.000 4029.0 3930.0 1.00
Omega[1, 1, 2] 0.000 0.419 -0.703 0.798 0.007 0.007 3795.0 2735.0 1.00
Omega[1, 2, 0] -0.006 0.412 -0.783 0.711 0.005 0.007 5784.0 2764.0 1.01
Omega[1, 2, 1] 0.000 0.419 -0.703 0.798 0.007 0.007 3795.0 2735.0 1.00
Omega[1, 2, 2] 1.000 0.000 1.000 1.000 0.000 0.000 3577.0 3198.0 1.00
Sigma[0, 0, 0] 0.682 1.553 0.000 2.803 0.026 0.019 2671.0 2459.0 1.00
Sigma[0, 0, 1] 0.006 0.329 -0.384 0.430 0.006 0.004 4267.0 3093.0 1.00
Sigma[0, 0, 2] -0.003 0.276 -0.371 0.498 0.005 0.004 4300.0 3248.0 1.00
Sigma[0, 1, 0] 0.006 0.329 -0.384 0.430 0.006 0.004 4267.0 3093.0 1.00
Sigma[0, 1, 1] 0.613 1.344 0.000 2.353 0.025 0.018 1881.0 2485.0 1.00
Sigma[0, 1, 2] -0.002 0.246 -0.398 0.434 0.004 0.003 3633.0 3078.0 1.00
Sigma[0, 2, 0] -0.003 0.276 -0.371 0.498 0.005 0.004 4300.0 3248.0 1.00
Sigma[0, 2, 1] -0.002 0.246 -0.398 0.434 0.004 0.003 3633.0 3078.0 1.00
Sigma[0, 2, 2] 0.641 1.498 0.000 2.532 0.031 0.023 2154.0 2694.0 1.00
Sigma[1, 0, 0] 0.701 1.600 0.000 2.608 0.033 0.030 2586.0 2053.0 1.00
Sigma[1, 0, 1] -0.000 0.278 -0.462 0.439 0.005 0.003 4197.0 3347.0 1.00
Sigma[1, 0, 2] -0.003 0.270 -0.459 0.403 0.005 0.003 4065.0 3079.0 1.00
Sigma[1, 1, 0] -0.000 0.278 -0.462 0.439 0.005 0.003 4197.0 3347.0 1.00
Sigma[1, 1, 1] 0.630 1.516 0.000 2.510 0.029 0.021 1974.0 2398.0 1.00
Sigma[1, 1, 2] -0.005 0.267 -0.462 0.424 0.004 0.003 4014.0 3450.0 1.00
Sigma[1, 2, 0] -0.003 0.270 -0.459 0.403 0.005 0.003 4065.0 3079.0 1.00
Sigma[1, 2, 1] -0.005 0.267 -0.462 0.424 0.004 0.003 4014.0 3450.0 1.00
Sigma[1, 2, 2] 0.606 1.250 0.000 2.515 0.024 0.017 1997.0 2068.0 1.00

I wouldn’t say that’s low. An ESS of 2000 in 4000 iterations is actually quite high for general models. Usually you can get away with an ESS of 100 or so, as that reduces uncertainty in estimates to sd/10, which is more than enough for most purposes.

ESS less than 20 per chain or so is problematic because the estimators we use for ESS are noisy if the chains are very short.

Run more iterations. It might help to tighten the prior on tau compared to the current Cauchy. But otherwise, I don’t see a reparameterization that would help. With less data, it can help to put priors on the outpoints keeping them from collapsing when there are no observations for a level.

Thank you @Bob_Carpenter/@edm/@spinkney for your time and attention!