Asymmetric Gaussian Hierarchical Model

Hello,

I am trying to fit a hierarchical model for longitudinal data. The mean function is specified as a asymmetric Gaussian function (6 parameters, all positive), which is put within a normal distribution to allow it to be stochastic. I have group data for multiple curves and when I fit the model for a single group (curve) it seems to fit fine as long as I specify relatively informative priors. However, there is strong correlation between some of the parameters, and I would like to model the data hierarchically using the LKJ prior. The MCMC takes quite some time (1.5 hours, N=5000) to run and for most of the parameters the effective sample size is < 10 after estimation. I have a few questions that I was hoping someone could help me with.

  1. What is the best way to specify that the parameters should be positive? I have seen non linear examples (e.g., Gelman, Lee, Guo, 2015) where all the parameters are log transformed. I have also seen hard constraints put in the parameters block and then uniform or lognormal priors specified. I have also seen defining the parameters on the real line and then putting normal priors that put little mass on values less than zero. It seems like the speed of the sampling varies considerably depending on the choice here. All of my parameters are greater than zero and some of them are bounded between 0 and 1. Right now I am putting informative normal hyper priors on those (see mu_theta priors in code) but this is possibly causing some problems?

  2. Also, I am getting very small sample sizes for the posterior parameters of the hierarchical covariance matrix. I think I have the LKJ prior specified correctly to ensure speed, and I have a diffuse prior on the correlation structure, but maybe this is part of the problem as well.

Any suggestions would be much appreciated. I can also post a subset of the data if that would be helpful as well.

Thank you,
Colin

data {
  int N; //total number of obs
  int G; //number of groups
  int<lower = 1, upper = G> group[N]; //indicator for Each group
  int d[N]; // day of year
  real y[N]; // observed value of T
  int<lower=1> K; // number of parameters 
}

parameters {
  cholesky_factor_corr[K] Lcorr;  
  //corr_matrix[K] Omega; // Correlation Matrix 
  vector<lower=0>[K] sigma; // Standard Deviations
  vector [K] theta[G]; // Parameter for Each Group 
  real<lower=0> sigma_e[G]; // Separate Error Term for Each Group
  //real <lower=0> sigma_error;
  vector[K] mu_theta;    //Means for Hierarchical Parameters 
}

//transformed parameters {
//  cov_matrix[K] Sigma; 
//  Sigma = quad_form_diag(Omega, sigma); 
//}

model {  
  vector[N] f; //Mean at Each Day
  sigma ~ cauchy(0, 2.5);  //Prior on Standard Deviations
  //Omega ~ lkj_corr(1);   //LKJ Prior on Correlation Matrix
  Lcorr ~ lkj_corr_cholesky(1);
  sigma_e ~ cauchy(0,1);
  
 
   //informative hyper priors 
  mu_theta[1] ~ normal(.1, .5);  
  mu_theta[2] ~ normal(.2, .5);
  mu_theta[3] ~ normal(.5, .5);
  mu_theta[4] ~ normal(225, 30);
  mu_theta[5] ~ normal(20, 25);
  mu_theta[6] ~ normal(50, 25);
  
  theta ~ multi_normal_cholesky(mu_theta, diag_pre_multiply(sigma, Lcorr));
  //theta ~ multi_normal(mu_theta, Sigma); //Hyper Prior on K Parameters
  
  
  //Likelihood
  for(i in 1:N){
    if (d[group[i]] < theta[group[i], 4]){
      f[i] = theta[group[i], 1] + (theta[group[i], 3] - theta[group[i], 1])*exp(-((d[group[i]] - theta[group[i], 4])^2)/(2*(theta[group[i], 5])^2));
    }
    else{
      f[i]  = theta[group[i], 2] + (theta[group[i], 3] - theta[group[i], 2])*exp(-((d[group[i]] - theta[group[i], 4])^2)/(2*(theta[group[i], 6])^2));
    }
    
    //y[i] ~ normal(f[i], sigma_e[group[i]]);
    y[i] ~ normal(f[i], sigma_e[group[i]]);
  }
}

generated quantities {
  matrix[K,K] Omega;
  matrix[K,K] Sigma;
  Omega = multiply_lower_tri_self_transpose(Lcorr);
  Sigma = quad_form_diag(Omega, sigma); 
}

Make theta a transformed parameter or local parameter such that

theta[j] = mu_theta + L * theta_raw[j];

where L = diag_pre_multiply(sigma, Lcorr) and theta_raw is declared in the parameters block and has independent standard normal priors.

Hey Ben,

Thank you for the quick response. I think I understand your suggestion for the non-centered parameterization on the hierarchical distribution. However, with my changes, I am getting the sampler is failing to create. Per your note, I added the following to the transformed parameters block to define the means of the 6 parameters that are coming from a multivariate normal distribution . I then put ~ N(0,1) priors on theta_raw, and kept the previous priors I had for mu_theta. Any thoughts on what I am missing?

Thanks again,
Colin

data {
  int N; //total number of obs
  int G; //number of groups
  int<lower = 1, upper = G> group[N]; //indicator for Each group
  int d[N]; // the Observation Day for Each T
  real y[N]; // observed value of T
  int<lower=1> K; // number of parameters 
}

parameters {
  cholesky_factor_corr[K] Lcorr;  
  vector<lower=0>[K] sigma; // Standard Deviations
  vector [K] theta[G]; // Parameter for Each Group 
  real<lower=0> sigma_e[G]; //Seperate Error Term for Each Group
  vector[K] theta_raw; //Means for Hierarchcial Parameters
  vector[K] mu_theta;  
}

transformed parameters {
  vector[K] theta_transform;
  theta_transform = mu_theta + (diag_pre_multiply(sigma, Lcorr) * theta_raw);
}

model {  
  vector[N] f; //Mean at Each Day
  sigma ~ cauchy(0, 2.5); //Prior on Standard Deviations
  Lcorr ~ lkj_corr_cholesky(1);
  sigma_e ~ cauchy(0,1);
  
  theta_raw ~ normal(0, 1);
  
  mu_theta[1] ~ normal(.1, .5);
  mu_theta[2] ~ normal(.2, .5);
  mu_theta[3] ~ normal(.5, .5);
  mu_theta[4] ~ normal(225, 30);
  mu_theta[5] ~ normal(20, 25);
  mu_theta[6] ~ normal(50, 25);
  
  theta ~ multi_normal_cholesky(theta_transform, diag_pre_multiply(sigma, Lcorr));
   
  //Likelihood
  for(i in 1:N){
    if (d[group[i]] < theta[group[i], 4]){
      f[i] = theta[group[i], 1] + (theta[group[i], 3] - theta[group[i], 1])*exp(-((d[group[i]] - theta[group[i], 4])^2)/(2*(theta[group[i], 5])^2));
    }
    else{
      f[i]  = theta[group[i], 2] + (theta[group[i], 3] - theta[group[i], 2])*exp(-((d[group[i]] - theta[group[i], 4])^2)/(2*(theta[group[i], 6])^2));
    }
    
    y[i] ~ normal(f[i], sigma_e[group[i]]);
  }
}

Get rid of this line since you have reparameterized into theta_raw. But I think you need one theta vector for each group and hence one theta_raw vector for each group.

Thanks, Ben! I see what you mean. That trick did speed up the MCMC from 74 to 24 minutes. I am still getting poor mixing, unfortunately. All my parameters are positive, so I am going to add vector<lower=0> [K] theta_raw[G]; as a constraint. I know the Rstan Git page has some recommended priors. For non-linear models the parameters seem to need fairly strong priors to avoid unstable solutions. Do you or anyone else have any general suggestions for helping with convergence? A few of the parameters have very high correlation (rho >.8), which is why I am using a multivariate normal on the hierarchical distribution. I am using the LKJ prior and setting eta to 1 to allow for a non diagonal covariance matrix.

Thanks,
Colin

parameters {
  cholesky_factor_corr[K] Lcorr;  
  vector<lower=0>[K] sigma; // Standard Deviations
  real<lower=0> sigma_e; // model error term
  vector<lower=0> [K] theta_raw[G];  
  vector[K] mu_theta;
  
}
transformed parameters {
  vector<lower=0> [K] theta[G];  
   for(j in 1:G){
    theta[j] = mu_theta + (diag_pre_multiply(sigma, Lcorr) * theta_raw[j]);
  }
}

That is because you are rejecting leapfrog steps with negative values of theta. It is probably better to make theta_raw unrestricted and do

transformed parameters {
  vector[K] theta[G];  
  {
   matrix[K,K] L = diag_pre_multiply(sigma, Lcorr);
   for(j in 1:G) theta[j] = exp(mu_theta + L * theta_raw[j]);
  }
}

Thanks, Ben. I will give this a shot. I am assuming I will have to then insert log(theta[group[i], 1]) for the different values of theta into the likelihood to undo the exp transform?

No. I thought the point was to have theta values that were strictly positive?

Yes, you are correct. My mistake. Thanks again for the help.

~Colin

Hi Ben,

I am still having convergence issues with the multivariate hierarchical distribution so I simplified the model, assuming independence across the hierarchical parameters. I tried your suggestion, as well as a few others, but am still getting lots of divergent transitions. The data are pretty informative and the model fits fine when I fit groups individually with very diffuse priors (i.e, constrain the parameters to be positive, and put a uniform prior on the real line).

Yet, once I scale it up to a hierarchical level the HMC seems to have trouble exploring the parameter space. I tried your suggestion of using a link function to ensure all the parameters are positive. However, this results in thousands of divergent transitions. My other thought was just to assume all the parameters are normal, but just put priors on the group means that put little mass on values less than zero.

Below is my code with the link function to ensure all the parameters are positive. Is there anything obvious that I am missing here? My only thought was maybe to also play with the priors on the group standard deviations, but I doubt that is going to have much of an impact.

Thanks,
Colin

data {
  int N; //total number of obs
  int G; //number of groups
  int<lower = 1, upper = G> group[N]; //indicator for each group
  int d[N]; // the Observation Day for Each T
  real y[N]; // observed value of T
  int<lower=1> K; // number of parameters 
}

parameters {
  
  //population mean
  real<lower=0> mu1;
  real<lower=0> mu2;
  real<lower=0> mu3;
  real<lower=0> mu4;
  real<lower=0> mu5;
  real<lower=0> mu6;
  
  //population sd
  real<lower=0> tau1;
  real<lower=0> tau2;
  real<lower=0> tau3;
  real<lower=0> tau4;
  real<lower=0> tau5;
  real<lower=0> tau6;
  
  //Group level errors
  vector[G] eta1;
  vector[G] eta2;
  vector[G] eta3;
  vector[G] eta4;
  vector[G] eta5;
  vector[G] eta6;
  
  vector<lower=0> [G] sigma; //model error sd
}

transformed parameters{
  vector[G] theta1;
  vector[G] theta2;
  vector[G] theta3;
  vector[G] theta4;
  vector[G] theta5;
  vector[G] theta6;
  
  theta1 = exp(mu1 + tau1*eta1);
  theta2 = exp(mu2 + tau2*eta2);
  theta3 =  exp(mu3 + tau3*eta3);
  theta4 = exp(mu4 + tau4*eta4);
  theta5 = exp(mu5 + tau5*eta5);
  theta6 = exp(mu6 + tau6*eta6);
  
}

model {  
  vector[N] f;
  
  //Likelihood
  for(i in 1:N){
    if (d[group[i]] < theta4[group[i]]){
      f[i] = theta1[group[i]] + (theta3[group[i]] - theta1[group[i]])*exp(-((d[group[i]] - theta4[group[i]])^2)/(2*(theta5[group[i]])^2));
    }
    else{
      f[i]  = theta2[group[i]] + (theta3[group[i]] - theta2[group[i]])*exp(-((d[group[i]] - theta4[group[i]])^2)/(2*(theta6[group[i]])^2));
    }
    
    y[i] ~ normal(f[i], sigma[group[i]]);
  }
  
 // Assume uniform prior on mu's over positive real line
    
  //priors on population sd
  tau1 ~ cauchy(0,1);
  tau2 ~ cauchy(0,1);
  tau3 ~ cauchy(0,1);
  tau4 ~ cauchy(0,1);
  tau5 ~ cauchy(0,1);
  tau6 ~ cauchy(0,1);
  
  eta1 ~ normal(0,1);
  eta2 ~ normal(0,1);
  eta2 ~ normal(0,1);
  eta3 ~ normal(0,1);
  eta4 ~ normal(0,1);
  eta5 ~ normal(0,1);
  eta6 ~ normal(0,1);
  
  //error sd prior
  sigma ~ cauchy(0,1);
  
}

Half-Cauchy is an odd thing to believe about standard deviations and an invitation to have divergent transitions.

1 Like

Gotcha. I was trying to be uninformative and BDA3 has some examples where a half-Cauchy is used on the standard deviation of the hyper parameters. Maybe I will try something with heavier tails.

Thanks,
Colin

No, you want less heavy tails so that the prior expectations exist.

1 Like

Hi Ben,

Oops. Yes, that’s what I meant. I will try some of the suggested priors on the Stan Git page to help better regularize the posterior. Thanks!

Maybe we should take it out of our examples! It’s really time to go through and revise them all.

Hi Bob,

Yes, I started out with half Cauchy since for hierarchical scale parameters that is often recommended as a weakly informative distribution. However, I only have 28 groups in my hierarchical model, and I see the Stan team now suggests t or normal priors on the sd: especially in cases where there are few groups.

It’s more that the Cauchy lets you have extreme values; if those aren’t reasonable, it’s much better to avoid the long tails. Earlier, we hadn’t quite realized how sensitive MCMC was to those long tails.

Hi Bob,

Thanks for the suggestion. When fitting the model to individual groups, I am now getting nice mixing. I am still having convergence issues, however, when I am putting this model into a hierarchical framework. To keep things simple, at first I am assuming the parameters come from independent normal distributions. I am using the non-centered parameterization and because all of the parameters are positive, per bgoodri’s suggestion above I specified the following in the transformed parameters block:

transformed parameters{
  vector [G] theta1;
  vector [G] theta2;
  vector [G] theta3;
  vector [G] theta4;
  vector [G] theta5;
  vector [G] theta6;
  
  theta1 = exp(log_mu1 + tau1*eta1);
  theta2 = exp(log_mu2 + tau2*eta2);
  theta3 = exp(log_mu3 + tau3*eta3);
  theta4 = exp(log_mu4 + tau4*eta4);
  theta5 = exp(log_mu5 + tau5*eta5);
  theta6 = exp(log_mu6 + tau6*eta6);
  }

The eta’s are the N(0,1) variables, and the tau’s are the standard deviations of the hierarchical parameters. I am wondering, however, how best to specify priors on the log_mu’s. Before, fitting the model for individual groups, I specified the following priors on the parameters. I can’t standardize the mu4 as it corresponds to the day of the year and thus integer days from 0 to 365. Mu5 and mu6 are scale parameters and I have a diffuse half t distribution on those.

  mu1 ~ normal(0, 1);
  mu2 ~ normal(0, 1);
  mu3 ~ normal(0, 1);
  mu4 ~ normal(225, 15);  
  mu5 ~ student_t(4, 0 , 10); 
  mu6 ~ student_t(4, 0, 10); 

For the hierarchical model, with the exp transformation, I am putting these priors on the log scale and have been trying the following, but with out much luck.

  log_mu1 ~ normal(-1, 1);
  log_mu2 ~ normal(-1, 1);
  log_mu3 ~ normal(-1, 1);
  log_mu4 ~ normal(4, 2);  
  log_mu5 ~ normal(3, 5);
  log_mu6 ~ normal(3, 5); 

Because the model is non-linear diffuse priors lead to even worse MCMC diagnostics. I just wanted to see if these transformations make sense or if there are better priors to put on the hierarchical means? Given the model fits well for individual groups, I am surprised the hierarchical extension is having trouble. Any suggestions would be much appreciated.

Thanks,
Colin

Are you fitting with a non-centered hierarchical parameterization? And the hierarchical parts are on the log_mu and tau and eta?

Introducing additional intercepts can also be problematic for identifiability, even if they’re givne priors for soft identifiability.

Hi Bob,

Yes, below is the full Stan code for the model. Yes, I am fitting it with a non-centered hierarchical parameterization. I am assuming two intercepts (mu1, and mu2) because the left and right hand side of the Gaussian curve have different levels. The parameters (mu5 and mu6) are the two scales that determine the left (and right) shape of the Gaussian curve. The other two parameters are the height (mu3) of the curve and day of the year (mu4) where it reaches it’s peak. Because all of the parameters are constrained to be positive I used the exp link around the hierarchical distribution before putting the parameters back into the likelihood. I wonder, however, if this scale is causing big jumps in the sampler. I saw another post where you urged to keep parameters on the log scale as long as possible for numeric stability. Does this explanation help?

data {
  int N; //total number of obs
  int G; //number of pixels
  int<lower = 1, upper = G> pixel[N]; //indicator for Each pixel
  int d[N]; // the Observation Day for Each T
  real y[N]; // observed value of T
  int<lower=1> K; // number of parameters 
}

parameters {
  
  //population mean
  log_mu1 ~ normal(-1, 1);
  log_mu2 ~ normal(-1, 1);
  log_mu3 ~ normal(-1, 1);
  log_mu4 ~ normal(4, 2);  
  log_mu5 ~ normal(3, 5);
  log_mu6 ~ normal(3, 5);
  
  //population sd for log parameters 
  real<lower=0> log_tau1;
  real<lower=0> log_tau2;
  real<lower=0> log_tau3;
  real<lower=0> log_tau4;
  real<lower=0> log_tau5;
  real<lower=0> log_tau6;
  
  //Group level errors
  vector[G] eta1;
  vector[G] eta2;
  vector[G] eta3;
  vector[G] eta4;
  vector[G] eta5;
  vector[G] eta6;
  
  vector<lower=0> [G] sigma; //model error sd
}

transformed parameters{
  vector [G] theta1;
  vector [G] theta2;
  vector [G] theta3;
  vector [G] theta4;
  vector [G] theta5;
  vector [G] theta6;
  
  theta1 = exp(log_mu1 + log_tau1*eta1);
  theta2 = exp(log_mu2 + log_tau2*eta2);
  theta3 = exp(log_mu3 + log_tau3*eta3);
  theta4 = exp(log_mu4 + log_tau4*eta4);
  theta5 = exp(log_mu5 + log_tau5*eta5);
  theta6 = exp(log_mu6 + log_tau6*eta6);
  
}


model {  
  vector[N] f;
  
  //Likelihood
  for(i in 1:N){
    if (d[pixel[i]] < theta4[pixel[i]]){
      f[i] = theta1[pixel[i]] + (theta3[pixel[i]] - theta1[pixel[i]])*exp(-((d[pixel[i]] - theta4[pixel[i]])^2)/(2*(theta5[pixel[i]])^2));
    }
    else{
      f[i]  = theta2[pixel[i]] + (theta3[pixel[i]] - theta2[pixel[i]])*exp(-((d[pixel[i]] - theta4[pixel[i]])^2)/(2*(theta6[pixel[i]])^2));
    }
    
    y[i] ~ normal(f[i], sigma[pixel[i]]);
  }
     
  //priors on population sd
  log_tau1 ~ normal(0,1);
  log_tau2 ~ normal(0,1);
  log_tau3 ~ normal(0,1);
  log_tau4 ~ normal(0,5);
  log_tau5 ~ normal(0,5);
  log_tau6 ~ normal(0,5);
  
  eta1 ~ normal(0,1);
  eta2 ~ normal(0,1);
  eta2 ~ normal(0,1);
  eta3 ~ normal(0,1);
  eta4 ~ normal(0,1);
  eta5 ~ normal(0,1);
  eta6 ~ normal(0,1);
  
  //error sd prior
  sigma ~ cauchy(0,1);
  
}