Hierarchical model with covariance relationship between parameters: applying different truncations

Hi everybody,

I have just recently taken my first steps in hierarchical Bayesian modeling with Stan. The forum has been a great source to answer basic questions so far, but now I am facing an issue that I can’t wrap my head around:

I want to code a 3-parameter hierarchical model with nSubjects as random effects and a group level for fixed effects. I also want to account for possible correlations between parameters. Therefore, I adapted a code that I found based on a 2-parameter linear model:

data {
  int<lower=1> nSubjects;
  int<lower=1> nTrials;
  matrix<lower=0>[nSubjects, nTrials] x;
  matrix<lower=0>[nSubjects, nTrials] y;
}

parameters {
  // group-level parameters
  real<lower=0> sigma;
  real alpha;
  real beta;
  real gamma;
  
  vector<lower=0>[3] tau_u;
  matrix[3, nSubjects] z_u;
  cholesky_factor_corr[3] L_u;
}

transformed parameters {
  matrix[nSubjects, 3] u;
  u = (diag_pre_multiply(tau_u, L_u) * z_u)';
}

model {
  // priors
  target += cauchy_lpdf(sigma | 0,10) - cauchy_lccdf(0 | 0,10);
  target += normal_lpdf(alpha_mu | 0,1000);
  target += normal_lpdf(beta_mu | 0,10);
  target += normal_lpdf(gamma_mu | 0,10);
  
  target += cauchy_lpdf(tau_u[1] | 0,50) - cauchy_lccdf(0 | 0,50);
  target += cauchy_lpdf(tau_u[2] | 0,50) - cauchy_lccdf(0 | 0,50);
  target += cauchy_lpdf(tau_u[3] | 0,50) - cauchy_lccdf(0 | 0,50);
 
  target += lkj_corr_cholesky_lpdf(L_u | 2);
  target += std_normal_lpdf(to_vector(z_u));
  
  // likelihood
  for (s in 1:nSubjects) {
    
    target += normal_lpdf(y[s] | (alpha + u[s, 1]) ./ (x[s] - (beta + u[s, 2])) + (gamma + u[s, 3]), sigma);
    
  }
}

The issue I’m having now is, that according to published theory behind the model, parameters should have different bounds, stated as follows:

alpha <lower = 0>
beta <upper = 0>
gamma <lower = 0>

This would account to both, group level and subject level parameters. Setting bounds to the group level parameters shouldn’t be too much of a problem. But how can you apply different truncations to subject level parameters that are sampled from a covariance relationship matrix?

Thank you in advance for any help!

In your model at present, u is a matrix of subjects’ deviations from the mean. What you want to do is instead make it a matrix of the subjects’ actual values by adding the mean (as you do in your expression of the likelihood; then remember to remove the addition of the means from the likelihood so it’s not being done twice). Then add a lower bound to the declaration of u.

2 Likes

Hi Mike,
thank you for your advice! I don‘t quite see yet, how this accounts for the beta parameter having an upper bound (only alpha and gamma have lower bounds). Hence, there is not a uniform truncation to the entire parameter matrix u, but column-specific (parameter-specific) bounds. Could you please explain this aspect shortly?

Sure. Sorry, I usually provide code associated with recommendations but was afk all morning. Plus I see now you have more complicated contraints than I originally understood from my over-quick reading. Here’s what I’d suggest:

...
transformed parameters{
  matrix[nSubjects, 3] u = (diag_pre_multiply(tau_u, L_u) * z_u)';
  vector<lower=0> alpha_u = alpha + u[,1] ;
  vector<upper=0> beta_u = beta + u[,2] ;
  vector<lower=0> gamma_u = gamma + u[,1] ;
}
model{
 ...
    // normal_lpdf is vectorized, so no need to loop over subjects
    target += normal_lpdf( y | alpha_u ./ (x - beta_u) + gamma_u, sigma);
}

Now that i look at your model, there’s a bunch of other stuff I’d recommend changing:

Presumably everywhere you have:

target += dist_lpdf(par | ...) - dist_lccdf(0| ...) ;

You’re intending the - dist_lccdf(0| ...) part to correct for the positive constraint on par, but that’s done automatically by Stan already, so you don’t need to add the - dist_lccdf(0| ...) bit at all.

You have prior scales of 10, 1000, and 50, all of which may cause issues and suggest you either aren’t scaling your data or haven’t run prior predictive checks to nail things down properly. Generally Stan’s sampler works best if the data (outcomes & predictors) are scaled such that the parameters are expected to fall somewhere from -10 to +10, or even better from -1 to +1. With that scale of outcomes/predictors/parameters, the prior scales you’re using are going to imply that a ridiculously broad range of data are a priori credible.

Your use of the cauchy priors on the parameters reflecting standard deviations of a normal (sigma & tau) is formerly popular/recommended, but tends not to be recommended any longer as the heavy tails of the cauchy can both cause issues for sampling and again imply unreasonable data when checked with prior predictive checks. Most people now use student_t(3) or even normal(). I personally think it’s also unreasonable to use priors for variance-related parameters that peak at zero, so I tend to use something like weibull(2,1), which has a peak around .8, falls to zero at zero, and has about 2% mass above 2. Combined with data that is scaled to have an SD of 1, I think this makes for much more reasonable prior predictive performance than priors peaked at zero.

1 Like

Oh, and slightly more compressed code would be achieved by merely flipping the sign of beta, enabling expression of the lower bounds on the by-subject values. Here’s the whole thing re-written with my recommendations (plus fixing my error in thinking you could nix the loop over subjects):

data {
  int<lower=1> nSubjects;
  int<lower=1> nTrials;
  matrix<lower=0>[nSubjects, nTrials] x;
  matrix<lower=0>[nSubjects, nTrials] y;
}
transformed data{
  real y_mean = mean(to_vector(y)) ;
  real y_sd = sd(to_vector(y)) ;
  matrix[nSubjects,nTrial] y_scaled = (y-y_mean)/y_sd ;

  real x_mean = mean(to_vector(x)) ;
  real x_sd = sd(to_vector(x)) ;
  matrix[nSubjects,nTrial] x_scaled = (x-x_mean)/x_sd ;
}
parameters {
  // group-level parameters
  real<lower=0> sigma;
  row_vector<lower=0>[3] mu_u ;  
  vector<lower=0>[3] tau_u ;
  matrix[3, nSubjects] z_u ;
  cholesky_factor_corr[3] L_u ;
}
transformed parameters {
  matrix<lower=0>[nSubjects, 3] u = rep_matrix(mu_u,nSubjects) + transpose(diag_pre_multiply(tau_u, L_u) * z_u) ;
}
model {
  // priors
  target += weibull_lpdf(sigma | 2,1) ;
  target += std_normal_lpdf(mu_u) ;  
  target += weibull_lpdf(tau_u | 2,1) ;
  target += lkj_corr_cholesky_lpdf(L_u | 2) ;
  target += std_normal_lpdf(to_vector(z_u)) ;  
  // likelihood
  for( s in 1:nSubjects){
    target += normal_lpdf( y_scaled[s] | u[s,1] ./ (x_scaled[s] + u[s,2]) + u[s,3], sigma); //note change to x+beta (previously x-beta)
  }
}
generated quantities{
  // unscaled versions of parameters
  real gamma = mu_u[3] * y_sd + y_mean; //gamma serves as an intercept, so scale and add back the mean
  real alpha = mu_u[1] * y_sd * x_sd ; //alpha is a covariate-dependent scalar, so need to double-scale
  real beta = mu_u[2] * y_sd * x_sd + x_mean ; // ditto, but since u[,2] is added to x_scaled, we need to add the x_mean
}
1 Like

Amazing! I guess I was so focused on applying the literature-based model expression that I didn’t see the forest for the trees. Flipping the sign of beta absolutely does the trick in my case.
Also, thank you for the input on sd priors. I just looked into the weibull distribution and your reasoning absolutely makes sense.
Scaling the data for bayesian modeling is uncharted territory to me, but I see where you are coming from. Will have to get more into that subject.

Thank you for the comprehensive feedback!

It still may be faster to compute mu in a loop and then one vectorized call to normal bc of the derivatives.

Hm, I don’t follow; mind posting code for what you’re suggesting as an alternative? I’m very curious to hear there’s an optimization I haven’t heard of yet!

Not really an optimization and I’m not sure if this helps for the normal case. For eg, in the multivariate normal precomputing mu in a loop and then calling multi normal cholesky on the vector is faster because the adjoint is computed with matrix algebra and just needs to be done once per iteration.

In this case, I’m curious if doing a loop and assigning mu to a vector in the model block then calling normal with a vectorized call is faster or the same. It could well be the same but there are cases - depending on the implementation in stan math - where it’s better to precompute in a loop and then call the vectorized version of the distribution.

Oh, you mean the final likelihood part? That’d require the data be structured as a vector and have a new data variable that contains the subject indices, like this:

data {
  int<lower=1> nSubjects;
  int<lower=1> nTrials;
  vector<lower=0>[nSubjects*nTrials] x;
  vector<lower=0>[nSubjects*nTrials] y;
  int<lower=1,upper=nSubjects> trialSubject[nSubjects*nTrials] ;
}
transformed data{
  real y_mean = mean(y) ;
  real y_sd = sd(y) ;
  vector[nSubjects*nTrial] y_scaled = (y-y_mean)/y_sd ;

  real x_mean = mean(x) ;
  real x_sd = sd(x) ;
  vector[nSubjects*nTrial] x_scaled = (x-x_mean)/x_sd ;
}
parameters {
  // group-level parameters
  real<lower=0> sigma;
  row_vector<lower=0>[3] mu_u ;  
  vector<lower=0>[3] tau_u ;
  matrix[3, nSubjects] z_u ;
  cholesky_factor_corr[3] L_u ;
}
transformed parameters {
  matrix<lower=0>[nSubjects, 3] u = rep_matrix(mu_u,nSubjects) + transpose(diag_pre_multiply(tau_u, L_u) * z_u) ;
}
model {
  // priors
  target += weibull_lpdf(sigma | 2,1) ;
  target += std_normal_lpdf(mu_u) ;  
  target += weibull_lpdf(tau_u | 2,1) ;
  target += lkj_corr_cholesky_lpdf(L_u | 2) ;
  target += std_normal_lpdf(to_vector(z_u)) ;  
  // likelihood
  target += normal_lpdf( y_scaled | u[,1][trialSubject] ./ (x_scaled + u[,2][trialSubject]) + u[,3][trialSubject], sigma); 

}
generated quantities{
  // unscaled versions of parameters
  real gamma = mu_u[3] * y_sd + y_mean; //gamma serves as an intercept, so scale and add back the mean
  real alpha = mu_u[1] * y_sd * x_sd ; //alpha is a covariate-dependent scalar, so need to double-scale
  real beta = mu_u[2] * y_sd * x_sd + x_mean ; // ditto, but since u[,2] is added to x_scaled, we need to add the x_mean
}

Is that what you mean?

1 Like

@BeniSportSci Note that I just realized that given the bounds in your model, you probably shouldn’t subtract the mean from y when scaling the data. I’ll need to think about whether the same goes for x.