Why do I find chains running in parallel and not converging

stanc

#1

Hello,

I am relatively new to STAN, but simply loving the model fitting experience. I call stan from R using rstan, and I am interested in fitting an lmer (multilevel model) equivalent hierarchical bayesian model. Since I have multiple outcomes, I plan to fit a hierarchical model with a single latent variable as an outcome, that has four manifest variables corresponding to the four outcomes. I have followed [Merkle and Wang 2016](http://Bayesian latent variable models for the analysis of experimental psychology data) to represent my model as follows:

For each outcome y_1, y_2, y_3, y_4:
y_j ~ normal(mu, sigmaY_j)
where
mu is modeled as
mu = lambda_j * theta
theta is a latent trait which is modeled using hierarchical bayesian model as follows:
theta = intercept + betas’ * covariates + latent_error

I have included a random intercept representation based on Sorensen et al. 2016 's tutorial.

My Stan code is as follows:

// Hierarchical model with:
// Multilevel model structure with level-1 and level-2 Covariates
// Four outcomes that are used to model a single latent variable
// First lambda (factor loading) is fixed to 1, and other loadings are restricted to positive values
// based on domain knowledge and to ensure identification and avoid rotation invariance

data {
int N; //number of data points
int J; //Number of outcomes
matrix[N,J] y; //Outcomes
int ID; //Total number of groups in level-2 structure
int<lower=1,upper=ID> plevel[N]; // Group number
int CovN; // Number of covariates created using model.matrix
matrix[N,CovN] CovX; // Covariates matrix
}

parameters {
vector<lower=0>[J] sigma_error; // SD of outcomes’ distribution
vector[ID] fixed_eff; // Fixed effects vector for latent equation
real<lower=0> sigma_fixed_eff;

vector[J] lambda; // Final factor loading for each outcome
real latent_intercept;
vector[CovN] beta_CovX;

}

model
{
real MuY;
real latent;

// Prior distribution
sigma_fixed_eff ~ cauchy(0,1); // Hyper-prior for fixed effects for each group
fixed_eff ~ normal(0, sigma_fixed_eff);

beta_CovX ~ normal(0,1); // Prior for beta of covariates

sigma_error[1] ~ cauchy(0,1); // Each outcome has its own residual variance
sigma_error[2] ~ cauchy(0,1);
sigma_error[3] ~ cauchy(0,1);
sigma_error[4] ~ cauchy(0,1);

lambda[1] ~ normal(1,0.001); // The factor loading of first outcome is fixed to resolve rotation invariance
lambda[2] ~ normal(1,1); // All outcomes are scaled and centered, and we know that they have a positive relationship with latent variable based on domain knowledge
lambda[3] ~ normal(1,1);
lambda[4] ~ normal(1,1);

for (i in 1:N)
{
latent = latent_intercept + fixed_eff[plevel[i]] + dot_product(beta_CovX, CovX[i,]) ;
for (j in 1:J)
{
MuY = lambda[j] * latent ; // Latent to outcome(s) relationship
y[i,j] ~ normal(MuY, sigma_error[j]); // Each outcome is univariate normal
}
}
}

As shown in code, I skipped the error term for the latent trait parameter to avoid getting the syntax error: Exception: normal_lpdf: Random variable is nan, but must not be nan!

For different number of iterations (500, 1000,…, 5000), I keep getting results where my MCMC chains do not converge and are parallel for many parameters, as seen in figure:

I wonder where I am going wrong in my code above. If I convert my multivariate outcome to a univariate outcome and represent the same equation using the below code, I do get a model that good convergence, but I am not sure if this heuristic approach is correct. The trick below is to copy the input matrix four times, and have the outcomes serially in a single vector (so number of observations is 4x, but the number of outcomes is 1). One reference where this trick is used is for bivariate outcomes in Morrison et al. 2016

model
{
real MuY;
real latent;

// Prior distribution
sigma_fixed_eff ~ cauchy(0,1); // Hyper-prior for fixed effects for each group
fixed_eff ~ normal(0, sigma_fixed_eff);

beta_CovX ~ normal(0,1); // Prior for beta of covariates

sigma_error[1] ~ cauchy(0,1); // Each outcome has its own residual variance
sigma_error[2] ~ cauchy(0,1);
sigma_error[3] ~ cauchy(0,1);
sigma_error[4] ~ cauchy(0,1);

lambda[1] ~ normal(1,0.001); // The factor loading of first outcome is fixed to resolve rotation invariance
lambda[2] ~ normal(1,1); // All outcomes are scaled and centered, and we know that they have a positive relationship with latent variable based on domain knowledge
lambda[3] ~ normal(1,1);
lambda[4] ~ normal(1,1);

for (i in 1:(N/4))
{
latent = latent_intercept + fixed_eff[plevel[i]] + dot_product(beta_CovX, CovX[i,]) ;
y[i] ~ normal(lambda[1]*latent, sigma_error[1]);
}

for (i in ((N/4) + 1) :(N/2))
{
latent = latent_intercept + fixed_eff[plevel[i]] + dot_product(beta_CovX, CovX[i,]) ;
y[i] ~ normal(lambda[2]*latent, sigma_error[2]);
}

for (i in ((N/2) + 1) :(3*N/4))
{
latent = latent_intercept + fixed_eff[plevel[i]] + dot_product(beta_CovX, CovX[i,]) ;
y[i] ~ normal(lambda[3]*latent, sigma_error[3]);
}

for (i in ((3*N/4) + 1) :N)
{
latent = latent_intercept + fixed_eff[plevel[i]] + dot_product(beta_CovX, CovX[i,]) ;
y[i] ~ normal(lambda[4]*latent, sigma_error[4]);
}

}

Looking forward to your advise as to whether the latter approach has a major flaw in logic.

Thanks,
Karthik S


#2

One issue that jumps out is the way in which you are fixing lambda[1]. I’m not sure if this is causing your problem, but it could be related.

Rather than specifying a tight prior, you should define lambda_raw[j-1] in the parameters block and then define lambda[j] = append_row([1.0]', lambda_raw] in the transformed parameter block. Having such a tight prior on a parameter can cause issues in adaptation and fitting.


#3

Thanks. Will try that out. Amazing way to do this. I was trying out all kinds of stuff before this (putting lambda in init and all). I got this idea of fitting a tight prior from this discussion: https://groups.google.com/forum/#!topic/stan-users/UMl1jWLGkx0

Best,
Karthik S


#4

That should probably be just append_row(1, lambda_raw). We added a scalar option recently.

The only reason to ever write 1.0 in Stan is to avoid integer division rounding; otherwise, you can use 1 anywhere you can use 1.0.