Simulated based calibration - unexpected correlations

Hi,
I’m trying to do some proper Bayesian work (following Betancourt workflow) so I simulated data and then I fit them on the same model. The model formula would have been written as
y ~ -1+analyst + (analyst|p) + (analyst|u)

The problem is that the model estimates are not aligned with the used priors I set to generate the data.

So I get these results out of 50 sets,

  • Blue points: Mean of each posterior
  • Black intervals: HDI of the blue points
  • Orange is the priors I used to generate the data.

The issue is that the Omegas had to be close to zero as I generated the data using LKG(10) but the summary show a clear (weak) correlation. By using LKG(2) the correlation is like -0.7.

Any ideas why the random effects are correlated when the data generative process indicate no correlation?

Thanks for your time

Here is the Stan model

/*
Fitting model:
Log-normal Multilevel model - 
y ~ -1+analyst + (analyst|period) + (analyst|unit)
*/

// Observed variables 
data {
  int<lower=0> N;
  int<lower=0> n_analyst;
  int analyst[N];
  real analyst_sd; //Analyst error
  
  // Varying intercepts
  // period
  int<lower=1> P;                   //number of periods
  int<lower=1,upper=P> period[N];  //period id
  
  // item
  int<lower=1> U;                   //number units
  int<lower=1,upper=U> unit[N];     //unit id  
  
  // Parameters for the given priors
  real sigma_mu; //Random error
  real sigma_sd;  // Random error variation 
  
  real sigma_p_mu_o; //Random error
  real sigma_p_sd_o;  // Random error variation 
  real sigma_p_mu_i; //Random error
  real sigma_p_sd_i;  // Random error variation 
  
  real sigma_u_mu_o; //Random error
  real sigma_u_sd_o;  // Random error variation 
  real sigma_u_mu_i; //Randome error
  real sigma_u_sd_i;  // Random error variation 
  
  // Period effect by analyst 1
  real period_mu_o;
  real period_sd_o;
  // Period effect  by analyst 2
  real period_mu_i;
  real period_sd_i;

  // Unit effect
  real unit_mu;
  real unit_sd;
  
  // Correlation 
  real eta_p;
  real eta_u;
  
  real analyst_mu1; //Set the mean of 1st analyst
  real analyst_mu2;//Set the mean of 2nd analyst
  
  vector[N] y; // Response
}

parameters {
  // analyst
  vector[n_analyst] analyst_b;
 
  real<lower=0> sigma;   
  
  // sd for random effects - There're of size 2 because they are 2 analysts 
  vector<lower=0>[2] sigma_p;       // period errors
  vector<lower=0>[2] sigma_u;       // unit errors 

  // declare L_u to be the Choleski factor of a 2x2 correlation matrix
  // 2x2 because they are two type of uncertainties (2 analysts)
    // for periods
    cholesky_factor_corr[2] Lu_p;
    // for units
    cholesky_factor_corr[2] Lu_u;  
  
  // random effect matrix
    // Size of 2 to have the intercept and slope 
      // for periods
      // Size of P to get the intercept and slope for each period seperately
      matrix[2,P] zu_p;                  
    
      // for items
      // The size of z_ui is follwing the same logic to zu_p
      matrix[2,U] zu_u;     
}

transformed parameters {
  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  
  // period
  matrix[2,P] up;
  up = diag_pre_multiply(sigma_p, Lu_p) * zu_p;
  
  // item
  matrix[2,U] uu;
  uu = diag_pre_multiply(sigma_u, Lu_u) * zu_u;
}

model {
  real mu; // conditional mean of the dependent variable
      
  //priors
    // LKJ prior for the correlation matrix
    Lu_p ~ lkj_corr_cholesky(eta_p); 
    Lu_u ~ lkj_corr_cholesky(eta_u); 
  to_vector(zu_p) ~ normal(0,1);
  to_vector(zu_u) ~ normal(0,1);
  
  // prior for the dispersion and the standard deviation
  sigma ~ lognormal(sigma_mu, sigma_sd);  
  sigma_p[1] ~ lognormal(sigma_p_mu_o, sigma_p_sd_o);  
  sigma_p[2] ~ lognormal(sigma_p_mu_i, sigma_p_sd_i); 
  sigma_u[1] ~ lognormal(sigma_u_mu_o, sigma_u_sd_o);     
  sigma_u[2] ~ lognormal(sigma_u_mu_i, sigma_u_sd_i); 
  
  // prior for fixed-effect
  analyst_b[1] ~ normal(analyst_mu1, analyst_sd);    
  analyst_b[2] ~ normal(analyst_mu2, analyst_sd); 


  // Simulate data from observational model
  for(n in 1:N){         
  // The model    
    mu =  analyst_b[analyst[n]] + //analyst effect
          up[1,period[n]] +       // period intercept
          up[2,period[n]]*analyst[n] + // deviation from period intercept
          uu[1,unit[n]] +          // unit intercept 
          uu[2,unit[n]]*analyst[n];// deviation from unit intercept

  //likelihood  
  y[n] ~ lognormal(mu, sigma);
  }
}

generated quantities {
  // get the correlation matrix back
  matrix[2, 2] Omega_p= Lu_p * Lu_p';
  matrix[2, 2] Omega_u= Lu_u * Lu_u';
}

The data generating process doesn’t have them correlated, but you should expect correlation in the posterior when random effects inform the same outcome. This is easier to see in simpler models, like a trivial linear regression y[n] ~ normal(alpha + beta * x[n], sigma) in the case where all the x[n] are positive. Even if you give alpha and beta independent priors, you will see negative correlation between alpha and beta in the posterior because increasing alpha can be compensated somewhat by decreasing beta and vice-versa. Whenever you have two parameters that can explain the same movement, you get correlation in the posterior. Same thing happens if you do a simple varying effects regression on say, income and age, y[n] ~ normal(alpha_income[income[n]] + alpha_age[age[n]] + ..., sigma). Here you will find alpha_income and alpha_age correlated in that you can add to alpha_income and subtract from alpha_age and get the same answer (this won’t happen if you identify with sum-to-zero or if you pin one of the values to zero).

In your example, you are getting three different slopes for analyst, so I would expect the fixed effect for analyst and the two varying effects based on period and unit to all be pairwise negatively correlated.

3 Likes

Hi, thanks for replying.
Do you have any references discussing this matter?
Should I look for a solution of this issue? Isn’t this a sign of biased model?

I would imagine most books on regression in a Bayesian setting would discuss this. Rather than a reference, I suggest you just simulate data and try it yourself. You will also find that if you are doing multiple regression with multiple covariates and the covariates are correlated, then you get correlation in the posterior for their estimated coefficients.

No. As the simulation-based calibration papers lay out, Bayesian models are well calibrated by construction when they are well specified, by which I mean the data comes from the model’s data generating process. You can see the issue with posterior correlation even with well-specified models from which data is generated according to the model.