Non-Centered, Cholesky Factored Covariance Matrix Converges to Inconsistent Parameter Values; Issue with Model Specification?

bgb-cu · August 2, 2022, 12:43pm

Hey all!

Extending work mentioned here, I’ve been developing a hierarchical SEM model where:

Latent variable z[i] is fit to predict a common (Likert scale) survey item y_z for individual i via ordinal regression
Latent variable t[m,i] is defined for individual i and topic m, where latent variable z[i]
is a predictor of t[m,i] with varying slope delta[m] (this slope parameter delta varies by topic)
Cutpoints c_n[n] for responses y_n to survey item n (Likert scale) are fit via ordinal logistic regression of t_m, where m is the topic corresponding to item n
Latent variables, z[i], t[m_1,i], …t[M,i] are modeled to have a covarying error term for individual i

I’m experiencing a fitting issue where, when I rerun my model under different randomization defaults,

All my parameters “converge” in the sense of returning R_hat = 1.0, but
I recovery consistent values for all my parameters except those related to the covarying individual error term (parameters tau, L_Omega, and tau_L_Omega and downstream generated quantities Omega and Sigma)

I’ve tried changing the prior for my scale parameter tau from the recommended cauchy(0,2.5) something less diffuse (cauchy(0,1), student_t(8,0,1), etc.) and when I do so I find that my covariance terms vary less from randomization default to randomization default, but remain quite noisy. I’m not sure whether:

I may have mis-specified my coding of the non-centered, Cholesky-factored covarying error term
My error term might be non-identifiable in this model as specified (I believe the term is identifiable since the associated parameters consistently return R_hat values of 1.0, but I recognize that I may be mis-interpreting this QC output)
I’m suffering from some third source of error that I haven’t considered

Appreciate any guidance that folks might have to impart here. Thanks in advance!

data {

  // Metadata
  int<lower=0> I; // Number of Interviews
  int<lower=0> J; // Number of Item-Response Pairs for Issue Topic Items
  int<lower=0> M; // Number of Topics
  int<lower=0> N; // Number of Survey Items
  int<lower=0> Q; // Number of Common Item Response Classes
  int<lower=0> R; // Number of Issue Topic Item Response Classes
  
  // Discrete Predictor Level Declarations
  int<lower=0> N_a;
  int<lower=0> N_b;
  
  // PII-Survey Data

  /// Individual Demography
  array[I] int<lower=1,upper=N_a> a;
  array[I] int<lower=1,upper=N_b> b;

  /// Common Item Response Labels
  array[I] int<lower=1,upper=Q> y_z; 
  
  // Issue Topic-Item Data
  
  /// Metadata
  array[N] int<lower=1,upper=M> m_n;   // Item-Level Topic Index

  /// Responses
  array[J] int<lower=1,upper=I> i_j;   // Item-Response Respondent ID
  array[J] int<lower=1,upper=N> n_j;   // Item-Response Item ID
  array[J] int<lower=1,upper=R> y_n_j; // Item-Response Response Labels
  
}
parameters {

  // Target Cutpoints
  ordered[Q - 1] c_z;          // Cutpoints for common item
  array[N] ordered[R - 1] c_t; // Cutpoints for topic items
  
  // Latent Hierarchical Model for Common Item

  /// Hyperprior Uncertainties for Varying Intercepts
  real<lower=0> sigma_alpha_a;
  real<lower=0> sigma_alpha_b;
  
  /// Modeled Effects
  vector[N_a] alpha_a_raw;
  vector[N_b] alpha_b_raw;  
  
  // Latent Hierarchical Model for Topic Items

  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  vector<lower=0>[M] sigma_beta_a;
  vector<lower=0>[M] sigma_delta;
  
  /// Modeled Effects
  array[M] vector[N_a] beta_a_raw;
  vector[M] delta_raw; // varying slope for effect of latent variable representing *common item* in predicting *topic items*

  // Correlated Error Parameters

  /// Hyperpriors
  vector<lower=0>[M_K + 1] tau;          // prior scale
  cholesky_factor_corr[M_K + 1] L_Omega; // Cholesky-factored prior covariance

  /// Centered Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega_raw;
  
}
transformed parameters{

  // Latent Hierarchical Model for Common Item
  
  /// Vector codings for multilevel intercepts
  vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;
  vector[N_b] alpha_b = alpha_b_raw * sigma_alpha_b;

  // Latent Hierarchical Model for Topic Items
  
  /// Varying Intercepts, Slopes
  array[M] vector[N_a] beta_a = beta_a_raw .* sigma_beta_a;
  vector[M] delta = delta_raw .* sigma_delta;
  
  // Cholesky Factored Covariance Term
  vector[M_K + 1] tau_L_Omega = diag_pre_multiply(tau, L_Omega) * tau_L_Omega_raw;
  
}
model {
  
  // Latent Hierarchical Model Parameters for Common Item
  
  /// Prior Uncertainties
  sigma_alpha_a ~ std_normal();
  sigma_alpha_b ~ std_normal();

  /// Varying Intercepts
  alpha_a_raw ~ std_normal();
  alpha_b_raw ~ std_normal();
  
  // Latent Hierarchical Model Parameters for Topic Items
  
  /// Hyperprior Uncertainties for Varying Intercepts, Slopes
  sigma_beta_a ~ std_normal();
  sigma_delta ~ std_normal();
  
  /// Varying Intercepts, Slopes
  for (m in 1:M){
      beta_a_raw[m] ~ std_normal();
   }
  delta_raw ~ std_normal();

  // Latent Topic Correlation
  tau ~ cauchy(0,2);
  L_Omega ~ lkj_corr_cholesky(2);
  tau_L_Omega_raw ~ std_normal();
  
  // Declare Local Scope for Latent Variable Vectorization
  {

    // Declare Latent Core Item Variable z, Topic Items variables t_m
    vector[I] z = alpha_a[a] + alpha_b[b] + tau_L_Omega[1];
    array[M] vector[I] t_m_i;
    
    // Vectorize Latent Variables for Core Item, Topics
    for (m in 1:M){
        t_m_i[m,] = beta_a[m,a] + delta[m] * z + tau_L_Omega[m + 1];  
    }
    
    // Reindex Latent Topic Variables from individuals to item-reponse pairs
    vector[J] t_j = t_m_i[m_n[n_j],i_j];

    // Fit responses to Core Item, Topic Items
    y_z ~ ordered_logistic(z, c_z);
    y_n_j ~ ordered_logistic(t_j, c_n[n_j]);
    
  }

}
generated quantities{

  matrix[M_K + 1, M_K + 1] Omega = multiply_lower_tri_self_transpose(L_Omega);
  matrix[M_K + 1, M_K + 1] Sigma = quad_form_diag(Omega, tau);
  
}

Note: code has been edited for clarity from initial submission

spinkney · August 2, 2022, 1:57pm

I think should be

for (k in 1:K){
    tau_L_Omega[k] = diag_pre_multiply(tau[k] .* tau_L_Omega_raw[k], L_Omega[k]);
  }

Though I’d probably make tau a real instead of a vector.

bgb-cu · August 2, 2022, 2:18pm

@spinkney thanks for the response!

I’d been taking my cue on writing out the non-centered, Cholesky factored covariance term from the snippet in the Stan User Guide section on Multivariate Hierarchical Priors (7th code block, directly below Optimization Under Cholesky Factorization). Am I correct in interpreting from your post that the user guide no longer represents coding best practices? (tau_L_Omega_raw in my stan code is equivalent to z in the User Guide example).

  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;

spinkney · August 2, 2022, 2:32pm

No, I misunderstood your model. That’s right then.

Bob_Carpenter · August 2, 2022, 4:13pm

Do you mean that if you run multiple chains you get \widehat{R} \gg 0 that indicates non-mixing? If not, what specifically is not consistent? Did you simulate data and check coverage?

Estimating correlation and covariance structure is challenging, especially if you don’t have a lot of data. I’m afraid your model is too complicated for me to grasp its structure through all the indexing. Our usual advice is to simplify to a model that works then build back up the complexity. That way, you can see where things go wrong. For example, start without a complicated correlated prior and get that working, then introduce the correlated prior.

Also, you an use declare define and vectorization to simplify a lot of this code, e.g.,

vector[N_a] alpha_a = alpha_a_raw * sigma_alpha_a;

vector[I] z = alpha_a[a] + alpha+b[b] + tau_L_Omega[1];
...

bgb-cu · August 2, 2022, 7:53pm

Thanks for reaching out @Bob_Carpenter!

One quick point of clarification. Appreciate your bump on starting with a simple model and building up; my initial post is the result of exactly this exercise. The model I’ve shared above is the simplest specification with which I’m able to illustrate the unexpected behavior I’m observing, and I’m able to recover results that map with my expectations for all specifications simpler than the one above. I realize use of expectation in that last sentence is deeply subjective, and hope to clarify what I mean below.

(As a quick aside: I’ve done my best to implement some of the code-cleanup strategies you’ve recommended above for easier interpretability, but I recognize that my code may remain quite abstruse. Apologies to any trying to dig in – don’t expect anyone to read my code who doesn’t want to, hopefully I’m able to fully summarize my ideas verbally below).

I think my core question, in a nutshell, is: how sensitive should I expect posterior-sampling of parameters in a covariance matrix to be to initialization? This question is at the heart of what I was glancing at by earlier describing some of my parameter estimates as “inconsistent” (likely an abuse of language, apologies there). Perhaps a better word to use here would be “stable” (though I’m not sure this isn’t just as bad semantically).

The observation prompting this question is: for different random seed values, using the same model and inputting the same data, sampling is yielding posterior parameter estimates where the following properties hold:

The magnitude of the diagonal entries in the covariance matrix rank-order inconsistently across sampling runs i.e. it might be the case that \Sigma_{i,i} > \Sigma_{j,j} > \Sigma_{k,k} for one sampling run, but \Sigma_{i,i} < \Sigma_{j,j} < \Sigma_{k,k} for a sampling run of the same model with the same data but different initialization
The signs of the off-diagonal entries in the covariance matrix are inconsistent i.e. it may be the case that \Sigma_{i,j} > 0 in a first run but \Sigma_{i,j} < 0 in a next run of the same model with the same data but different initialization

As a point of comparison, all the other parameters in my model (alpha_*, beta_*, delta, and c_*), return posterior estimates that are fairly “stable”, in the sense that these posterior estimates are similar in magnitude, sign, variance, and rank-order across different initializations (apologies again if there’s different language from stable I should be using to denote these properties). If I remove the covariance structure from my data (omit tau, L_Omega, and tau_L_Omega from my model, and Omega and Sigma from my generated quantities), posterior samples for my alpha_*, beta_*, c_*, and delta parameters remain “stable” across different initializations.

Across initializations, the \hat{R} values for the parameters in my model associated with \Sigma (i.e. tau, L_Omega, and tau_L_Omega) are quite close to 1.0. The \hat{R} values for the other parameters in my model (alpha_*, beta_*, delta, and c_*) are also quite close to 1.0 – whether I include the tau, L_Omega, and tau_L_Omega parameters, or omit them.

The size of my dataset is a few thousand records (I’m testing things out with about 5k records right now, though I have several thousand more in-hand).

Thanks to any/all who have borne with me this far. To restate my core question at a high level: to what extent, if at all, should I be concerned at how sensitive my posterior estimates for parameters related to covariance structure are to initialization? Do the fluctuations I’m observing fall within the range of regularly expected variability, given my data size and prior specification? Do they portend a some deeper model misspecification that I’ll need to weed out?

Appreciate advice/guidance that folks are able to impart.

bgb-cu · August 13, 2022, 6:32pm

Will try once more here since the chatter here has gone cold for the last week and a half or so:

bgb-cu:

The observation prompting this question is: for different random seed values, using the same model and inputting the same data, sampling is yielding posterior parameter estimates where the following properties hold:

The magnitude of the diagonal entries in the covariance matrix rank-order inconsistently across sampling runs i.e. it might be the case that Σi,i > Σj,j > Σk,k for one sampling run, but Σi,i < Σj,j < Σk,k for a sampling run of the same model with the same data but different initialization

The signs of the off-diagonal entries in the covariance matrix are inconsistent i.e. it may be the case that Σi,j > 0 in a first run but Σi,j < 0 in a next run of the same model with the same data but different initialization

As a point of comparison, all the other parameters in my model (alpha_*, beta_*, delta, and c_*), return posterior estimates that are fairly “stable”, in the sense that these posterior estimates are similar in magnitude, sign, variance, and rank-order across different initializations (apologies again if there’s different language from stable I should be using to denote these properties). If I remove the covariance structure from my data (omit tau, L_Omega, and tau_L_Omega from my model, and Omega and Sigma from my generated quantities), posterior samples for my alpha_*, beta_*, c_*, and delta parameters remain “stable” across different initializations.

I’m surprised at the result that the posterior sampling of parameters associated with the modeled covariance matrix (alone among my model parameters) is so dependent on initialization, and so would love to get a sense for whether this is reasonable/expected behavior, and if not what types of issues it might portend. Appreciate any guidance folks might have to impart!

edm · August 15, 2022, 6:42pm

I wonder what your effective sample sizes look like. You said that Rhat is close to 1, which probably implies large effective sample sizes, but maybe they are still on the low side for certain parameters.

Bob_Carpenter · August 15, 2022, 8:33pm

Not always. After 100 iterations sampling a standard normal, we get R_hat < 1 and N_Eff = 40.

          Mean      MCSE    StdDev       5%       50%       95%    N_Eff  N_Eff/s   R_hat
y     0.098738  0.152707  0.967841 -1.52322  0.029663  1.61712  40.1692  40169.2  0.996454

On the plus side, if R_hat is near 1, we expect N_eff to grow linearly with sample size.

bgb-cu · August 16, 2022, 1:26am

@edm here’s an example Arviz summary for the parameters associated with the model covariance matrix (I’ve fit in the 2 * 2 * 3 case, the smallest possible non-trivial example for my model specification); PPS samples were generated from four chains of 1k samples each (after 1k warmup samples per chain), for 4k PPS draws total.

ess_bulk is definitely a bit low for the covariance matrix scale parameter terms (tau), at about 2k per parameter
Some of the non-NULL valued L_Omega (Cholesky-factored correlation matrix) terms have fairly ess_bulk low values (~2k), though others have values > 5k
the tau_L_Omega terms all appear to have ess_bulks values smaller than 1k
the generated quantity Sigma (covariance matrix) terms terms are mostly around the full 4k, except for the \Sigma_{i,j,j} terms, which are at about 2k each

Are these ess_bulk values in a range that would be considered potentially problematic? Are there any steps available to me to increase them, other than a) running PPS sampling for longer and b) increasing the size of my training data?

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
tau[0, 0]	0.574	0.593	0.000	1.674	0.010	0.007	2671.0	2459.0	1.00
tau[0, 1]	0.564	0.543	0.001	1.534	0.011	0.008	1881.0	2485.0	1.00
tau[0, 2]	0.565	0.568	0.000	1.591	0.012	0.008	2154.0	2694.0	1.00
tau[1, 0]	0.600	0.584	0.000	1.615	0.010	0.009	2586.0	2053.0	1.00
tau[1, 1]	0.561	0.562	0.000	1.584	0.011	0.008	1974.0	2398.0	1.00
tau[1, 2]	0.558	0.543	0.000	1.586	0.011	0.008	1997.0	2068.0	1.00
L_Omega[0, 0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[0, 0, 1]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[0, 0, 2]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[0, 1, 0]	-0.004	0.411	-0.753	0.730	0.006	0.007	5280.0	2825.0	1.00
L_Omega[0, 1, 1]	0.904	0.122	0.666	1.000	0.003	0.002	2089.0	2297.0	1.00
L_Omega[0, 1, 2]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[0, 2, 0]	-0.003	0.414	-0.804	0.689	0.006	0.007	5253.0	2928.0	1.00
L_Omega[0, 2, 1]	-0.002	0.415	-0.764	0.724	0.006	0.007	5426.0	2602.0	1.00
L_Omega[0, 2, 2]	0.793	0.165	0.486	1.000	0.004	0.003	1776.0	2233.0	1.00
L_Omega[1, 0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[1, 0, 1]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[1, 0, 2]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[1, 1, 0]	-0.007	0.411	-0.781	0.696	0.006	0.007	5422.0	2954.0	1.00
L_Omega[1, 1, 1]	0.904	0.119	0.669	1.000	0.003	0.002	1997.0	2417.0	1.00
L_Omega[1, 1, 2]	0.000	0.000	0.000	0.000	0.000	0.000	4000.0	4000.0	NaN
L_Omega[1, 2, 0]	-0.006	0.412	-0.783	0.711	0.005	0.007	5784.0	2764.0	1.01
L_Omega[1, 2, 1]	-0.003	0.414	-0.683	0.787	0.006	0.007	4735.0	2583.0	1.00
L_Omega[1, 2, 2]	0.794	0.166	0.486	1.000	0.004	0.003	1844.0	2298.0	1.00
tau_L_Omega[0, 0]	0.025	0.338	-0.646	0.704	0.009	0.007	1412.0	1936.0	1.00
tau_L_Omega[0, 1]	-0.002	0.307	-0.658	0.587	0.013	0.009	758.0	597.0	1.00
tau_L_Omega[0, 2]	-0.031	0.299	-0.645	0.553	0.012	0.010	814.0	542.0	1.00
tau_L_Omega[1, 0]	-0.044	0.339	-0.719	0.636	0.010	0.007	1367.0	1788.0	1.00
tau_L_Omega[1, 1]	0.010	0.311	-0.646	0.622	0.013	0.009	764.0	669.0	1.00
tau_L_Omega[1, 2]	0.006	0.298	-0.554	0.630	0.011	0.010	803.0	477.0	1.00
Omega[0, 0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
Omega[0, 0, 1]	-0.004	0.411	-0.753	0.730	0.006	0.007	5280.0	2825.0	1.00
Omega[0, 0, 2]	-0.003	0.414	-0.804	0.689	0.006	0.007	5253.0	2928.0	1.00
Omega[0, 1, 0]	-0.004	0.411	-0.753	0.730	0.006	0.007	5280.0	2825.0	1.00
Omega[0, 1, 1]	1.000	0.000	1.000	1.000	0.000	0.000	3958.0	3798.0	1.00
Omega[0, 1, 2]	-0.001	0.413	-0.765	0.703	0.006	0.007	4233.0	2872.0	1.00
Omega[0, 2, 0]	-0.003	0.414	-0.804	0.689	0.006	0.007	5253.0	2928.0	1.00
Omega[0, 2, 1]	-0.001	0.413	-0.765	0.703	0.006	0.007	4233.0	2872.0	1.00
Omega[0, 2, 2]	1.000	0.000	1.000	1.000	0.000	0.000	3492.0	3585.0	1.00
Omega[1, 0, 0]	1.000	0.000	1.000	1.000	0.000	0.000	4000.0	4000.0	NaN
Omega[1, 0, 1]	-0.007	0.411	-0.781	0.696	0.006	0.007	5422.0	2954.0	1.00
Omega[1, 0, 2]	-0.006	0.412	-0.783	0.711	0.005	0.007	5784.0	2764.0	1.01
Omega[1, 1, 0]	-0.007	0.411	-0.781	0.696	0.006	0.007	5422.0	2954.0	1.00
Omega[1, 1, 1]	1.000	0.000	1.000	1.000	0.000	0.000	4029.0	3930.0	1.00
Omega[1, 1, 2]	0.000	0.419	-0.703	0.798	0.007	0.007	3795.0	2735.0	1.00
Omega[1, 2, 0]	-0.006	0.412	-0.783	0.711	0.005	0.007	5784.0	2764.0	1.01
Omega[1, 2, 1]	0.000	0.419	-0.703	0.798	0.007	0.007	3795.0	2735.0	1.00
Omega[1, 2, 2]	1.000	0.000	1.000	1.000	0.000	0.000	3577.0	3198.0	1.00
Sigma[0, 0, 0]	0.682	1.553	0.000	2.803	0.026	0.019	2671.0	2459.0	1.00
Sigma[0, 0, 1]	0.006	0.329	-0.384	0.430	0.006	0.004	4267.0	3093.0	1.00
Sigma[0, 0, 2]	-0.003	0.276	-0.371	0.498	0.005	0.004	4300.0	3248.0	1.00
Sigma[0, 1, 0]	0.006	0.329	-0.384	0.430	0.006	0.004	4267.0	3093.0	1.00
Sigma[0, 1, 1]	0.613	1.344	0.000	2.353	0.025	0.018	1881.0	2485.0	1.00
Sigma[0, 1, 2]	-0.002	0.246	-0.398	0.434	0.004	0.003	3633.0	3078.0	1.00
Sigma[0, 2, 0]	-0.003	0.276	-0.371	0.498	0.005	0.004	4300.0	3248.0	1.00
Sigma[0, 2, 1]	-0.002	0.246	-0.398	0.434	0.004	0.003	3633.0	3078.0	1.00
Sigma[0, 2, 2]	0.641	1.498	0.000	2.532	0.031	0.023	2154.0	2694.0	1.00
Sigma[1, 0, 0]	0.701	1.600	0.000	2.608	0.033	0.030	2586.0	2053.0	1.00
Sigma[1, 0, 1]	-0.000	0.278	-0.462	0.439	0.005	0.003	4197.0	3347.0	1.00
Sigma[1, 0, 2]	-0.003	0.270	-0.459	0.403	0.005	0.003	4065.0	3079.0	1.00
Sigma[1, 1, 0]	-0.000	0.278	-0.462	0.439	0.005	0.003	4197.0	3347.0	1.00
Sigma[1, 1, 1]	0.630	1.516	0.000	2.510	0.029	0.021	1974.0	2398.0	1.00
Sigma[1, 1, 2]	-0.005	0.267	-0.462	0.424	0.004	0.003	4014.0	3450.0	1.00
Sigma[1, 2, 0]	-0.003	0.270	-0.459	0.403	0.005	0.003	4065.0	3079.0	1.00
Sigma[1, 2, 1]	-0.005	0.267	-0.462	0.424	0.004	0.003	4014.0	3450.0	1.00
Sigma[1, 2, 2]	0.606	1.250	0.000	2.515	0.024	0.017	1997.0	2068.0	1.00

Bob_Carpenter · August 16, 2022, 6:12pm

I wouldn’t say that’s low. An ESS of 2000 in 4000 iterations is actually quite high for general models. Usually you can get away with an ESS of 100 or so, as that reduces uncertainty in estimates to sd/10, which is more than enough for most purposes.

ESS less than 20 per chain or so is problematic because the estimators we use for ESS are noisy if the chains are very short.

Run more iterations. It might help to tighten the prior on tau compared to the current Cauchy. But otherwise, I don’t see a reparameterization that would help. With less data, it can help to put priors on the outpoints keeping them from collapsing when there are no observations for a level.

bgb-cu · August 16, 2022, 8:58pm

Thank you @Bob_Carpenter/@edm/@spinkney for your time and attention!

Topic		Replies	Views
Non-Centered Cholesky Factor Correlation Matrix Draws Imply Non-Unit Self-Correlation Modeling pystan , techniques , specification , covariance	2	571	July 11, 2022
Issues with non-centered specification (Matt trick) Modeling fitting-issues	0	461	January 7, 2019
Centered variance-covariance matrix for hierarchical model Modeling	25	1519	November 12, 2019
Non-Centered Parameterization - Divergence issues w PyStan hierarchical model Modeling specification , hierarchical-model , divergences , reparametrization	3	788	April 4, 2022
Issues when expressing parameters as factors Modeling fitting-issues	5	464	October 1, 2021

Non-Centered, Cholesky Factored Covariance Matrix Converges to Inconsistent Parameter Values; Issue with Model Specification?

Related topics