Sampling issues with vectorized multivariate normal

brijo · July 17, 2023, 5:44pm

I have a set of treatment estimates from randomized trials that I’m trying to better model. All of the interventions, each belonging to one of several broad intervention categories, attempted to improve student attendance, with vary degrees of success. Each of the treatment estimates comes from a Poisson regression run in R, with the resulting (untransformed) estimates and covariance passed in as data to Stan. What I’m also trying to do, and what I suspect I might have done poorly, is to expand my Stan model so that it also takes in the mean absence rate from the control group in each trial (along with another covariance matrix for this). As various people have pointed out, if the control mean in an RCT is correlated with the treatment effect, and you build this into your model, maybe you can get better inference on the estimated true treatment effects and estimated true heterogeneity in effects.

One complicating factor is that many of the trials had more than one treatment arm and a single control arm. In which case there is of course only one control mean absence rate for that trial’s set of treatment estimates. So I try to have one lambda – what I call the true mean absence rate parameter – for each trial be correlated with a set of thetas – what I call the true treatment effect parameters for the treatment estimates. I also (believe) I set up the model to estimate just a single correlation between each lambda and set of thetas across all trials, along with a single mean and variance for all the lambdas. Whereas I allow the mean and variation in the thetas to depend on the intervention type.

This model compiles and samples – but the sampling is poor. I run 4 chains with 8000 iterations. There were 22 divergent transitions, which doesn’t seem too terrible. But I also get warnings about ~16000 transitions exceeding max treedepth, 1 chain with low BFMI, that the largest R-hat was NA, and that ESS was too low. This seems like just about every warning that Stan throws.

I’m wondering if I wrote this model poorly, especially the arrays YY, MU, and D, as referenced in the multi_normal() function where I iterate through the contents of YY, MU, and D.

If there isn’t anything glaringly wrong with the model, does anyone have suggestions for how to make the sampling proceed more smoothly?

Thank you!

data {
	int<lower=1> J;							            // distinct intervention types
	int<lower=1> T;                         // distinct trials, some multi-armed; each trial falls into some type j
	int<lower=1> N;							            // number of total arms across all trials and intervention types
	vector[N] y;								            // estimated treatment effects
	cov_matrix[N] sigma_y;                  // covariance matrix of estimated treatment effects
	vector[T] a;                            // estimated control mean absence rates
	cov_matrix[T] sigma_a;                  // covariance matrix of mean absence rates
	int<lower=1,upper=J> idx_study[N];		  // intervention type index for that arm 
	int<lower=1,upper=T> idx_trial[N];      // trial index for that arm
	real<lower=0> alpha_sd;                 // SD for prior on alpha
	real<lower=0> sigma_delta_sd;           // SD for prior on intervention effects
	real<lower=0> sigma_epsilon_sd_sd;      // SD for prior for heterogeneity in arm effects variances
}

parameters {
	real alpha;								            // average affect across all treatment arms, trials, intervention types
	real<lower=0> sigma_epsilon_sd;       // SD of half normal distribution of sigma_epsilons
	real<lower=0> sigma_delta;				    // SD of intervention type effect deviances
	vector[J] delta_nc;                   // non-centered deltas
	vector<lower=0>[J] sigma_epsilon_nc;  // non-centered sigma epsilons
	vector<lower=0>[T] lambda;            // true parameters for estimated control mean abs rates
	real<lower=0> mu;                     // mean parameter for the lambdas
	real<lower=0> sigma_lambda;           // SD of lambdas
	corr_matrix[2] Rho;                   // matrix for correlation between thetas, lambdas
	vector[N] theta;				              // true parameter generating observed treatment effects
}

transformed parameters {
	vector[J] delta;						        // intervention type effects
	vector<lower=0>[J] sigma_epsilon;   // SD of arm effects, estimated for each intervention type

	sigma_epsilon[idx_study] = sigma_epsilon_nc[idx_study]*sigma_epsilon_sd;
	delta[idx_study] = delta_nc[idx_study]*sigma_delta;

}

model {
  array[N] vector[2] YY; 
  array[N] vector[2] MU;
  array[N] vector[2] D; 
  for (i in 1:N) YY[i] = [theta[i],lambda[idx_trial[i]]]';
  for (i in 1:N) MU[i] = [alpha + delta[idx_study[i]],mu]';
  for (i in 1:N) D[i] = [sigma_epsilon[idx_study[i]],sigma_lambda]'; 
  
	alpha ~ normal(0,alpha_sd);

	sigma_epsilon_nc ~ normal(0,1);
	sigma_epsilon_sd ~ normal(0,sigma_epsilon_sd_sd); // sigma_epsilon_sd_sd is admittedly a bad name

	delta_nc ~ normal(0,1);
	sigma_delta ~ normal(0,sigma_delta_sd);
	
	sigma_lambda ~ normal(0,0.05);
	mu ~ normal(0.07, 0.03) T[0,];
  Rho ~ lkj_corr(2);

	for (i in 1:N) YY[i] ~ multi_normal(MU[i], quad_form_diag(Rho, D[i]));
	
  a ~ multi_normal(lambda, sigma_a); // likelihood function for the observed control mean abs. rates
	y ~ multi_normal(theta, sigma_y); // likelihood function for the observed treatment estimates
	
}

Bob_Carpenter · July 25, 2023, 9:42pm

There are a couple of issues here. First, fix all the indentation. If the problem is tabs, just don’t use tabs in source code and then they render the same way everywhere.

Whatever you do, this is going to slow because of this line:

To evaluate the multi_normal, we need to solve the matrix D[I]. You’ll get a huge speedup parameterizing in terms of Cholesky factors. There’s a description of how to do that in the regression chapter of the User’s Guide. This goes for all of your multi-normals. For the data-driven ones, you can do a Cholesky decomposition in the transformed data block to define L_Sigma_a (I’d recommend capitalizing covariances to distinguish them notationally from ordinary scale parameters).

It looks like you’ve properly included non-centered parameterizations of all of your vector priors, which is good. For some of the other variables, it can still help to scale. For instance, when you have (with spaces after comma for readability):

sigma_lambda ~ normal(0, 0.05);

As is, this can be problematic because a positive constrained parameter will be initialized in the range (1/exp(2), exp(2)), or about (1/8, 8), all of which are at least a factor of two too large given the prior. This can be a challenge to mitigate with constrained parameters as you have to do everything on the unconstrained scale. It can help to define

parameters {
  sigma_lambda_std ~ normal(0, 1);
  ...
transformed parameters {
  sigma_lambda = 0.05 * sigma_lambda_std;
  ...

This mainly helps with our default initialization. the other way to mitigate the problem is better user-defined initializations.

For this

	mu ~ normal(0.07, 0.03) T[0,];

you don’t need the T[0, ]. You need it to normalize, but the normalization term will just be a constant and hence drop out during MCMC. You would need the T[0, ] if you had parameters in place of the constants 0.07 and 0.03.

For the top-level definitions, it’s more efficient to do this:

YY[, 1] = theta;
YY[, 2] = lambda[idx_trial];

where now theta and lambda are declared as arrays. Similarly for MU and D (I wouldn’t capitalize these—the usual convention is that all caps is reserved for constants in computer programs.). The MU case and D case look strange in that there’s no way a normal distribution would generate the same value for the second parameter for every entry—this means the model is reducing rank from how it’s expressed in the sampling notation (which can be OK, but it’s something to watch out for).

brijo · August 1, 2023, 10:31pm

Bob: Thank you for your very generous feedback. As you might have guessed, I’m still relatively new to Stan. I learned a great deal from your comments.

I apologize for my slightly delayed reply. It took me a minute to get back to this model and another minute to work my way through your suggestions. I have a few more still to implement, mostly related (I think) to faster sampling. I focused on this first pass on changes that seem necessary for sampling to work well at all, no matter how slow. In particular, I tried to set up non-centered parameterization of my covariance structure, using the Cholesky decomposition approach you suggested. I also created a sigma_lambda_std parameter and dropped the truncation statement on the prior for mu.

But now, I’m sure because I messed something up, my model won’t sample at all. It isn’t a compilation error; it does compile. It just says every initial value is rejected. Here it is now (with all tabs removed for better rendering hopefully):

data {
 int<lower=1> J; // distinct intervention types
 int<lower=1> T; // distinct trials, some multi-armed; each trial falls into some type j
 int<lower=1> N; // number of total arms across all trials and intervention types
 vector[N] y; // estimated treatment effects
 cov_matrix[N] Sigma_y; // covariance matrix of estimated treatment effects
 vector[T] a; // estimated control mean absence rates
 cov_matrix[T] Sigma_a; // covariance matrix of mean absence rates
 int<lower=1,upper=J> idx_study[N]; // intervention type index for that arm 
 int<lower=1,upper=T> idx_trial[N]; // trial index for that arm
 real<lower=0> alpha_sd; // SD for prior on alpha
 real<lower=0> sigma_delta_sd; // SD for prior on intervention effects
 real<lower=0> sigma_epsilon_sd_sd; // SD for prior for heterogeneity in arm effects variances
}

parameters {
 real alpha; // average affect across all treatment arms, trials, intervention types
 real<lower=0> sigma_epsilon_sd; // SD of half normal distribution of sigma_epsilons
 real<lower=0> sigma_delta; // SD of intervention type effect deviances
 vector[J] delta_nc; // non-centered deltas
 vector<lower=0>[J] sigma_epsilon_nc; // non-centered sigma epsilons
 real<lower=0> mu; // mean parameter for the lambdas
 real sigma_lambda_std; // will ultimately become sigma_lambda but better to initialize as std. normal
 cholesky_factor_corr[2] L_Rho; // Cholesky factor for correlation matrix Rho
 array[N] vector[2] z; // draws from standard normal for use with Cholesky
}

transformed parameters {
 vector[J] delta; // intervention type effects
 vector<lower=0>[J] sigma_epsilon; // SD of arm effects, estimated for each intervention type
 real<lower=0> sigma_lambda; // SD of lambdas 
 vector<lower=0>[T] lambda; // true parameters for estimated control mean abs rates 
 vector[N] theta; // true parameter generating observed treatment effects 
 
 array[N] vector[2] YY; 
 array[N] vector[2] MU;
 array[N] vector[2] D;
 for (i in 1:N) YY[i] = [theta[i], lambda[idx_trial[i]]]';
 for (i in 1:N) MU[i] = [alpha + delta[idx_study[i]], mu]';
 for (i in 1:N) D[i] = [sigma_epsilon[idx_study[i]], sigma_lambda]';

 for (i in 1:N) YY[i] = MU[i] + diag_pre_multiply(D[i],L_Rho)*z[i];

 sigma_epsilon[idx_study] = sigma_epsilon_nc[idx_study]*sigma_epsilon_sd;
 delta[idx_study] = delta_nc[idx_study]*sigma_delta;

 sigma_lambda = 0.05*sigma_lambda_std; // transforming here to make initialization easier for constrained parameter

}

model {
   
 alpha ~ normal(0, alpha_sd);

 sigma_epsilon_nc ~ normal(0, 1);
 sigma_epsilon_sd ~ normal(0, sigma_epsilon_sd_sd); // sigma_epsilon_sd_sd is admittedly a bad name

 delta_nc ~ normal(0, 1);
 sigma_delta ~ normal(0, sigma_delta_sd);
	
 sigma_lambda_std ~ normal(0, 1);
 mu ~ normal(0.07, 0.03); // not truncating at 0 because this normal uses constants, not parameters, and in MCMC constant of normalization drops out 
 L_Rho ~ lkj_corr_cholesky(2);
 for (i in 1:N) z[i] ~ normal(0, 1);

 a ~ multi_normal(lambda, Sigma_a); // likelihood function for the observed control mean abs. rates
 y ~ multi_normal(theta, Sigma_y); // likelihood function for the observed treatment estimates
	
}

What might I be doing wrong now such that I don’t get any sampling at all?

I’m not sure it explains the rejection of every initial value (or maybe it does?), but I’m still puzzling over how to use built-in functions that require a vector as an argument when I have an array of vectors I’m trying to use. For instance:

for (i in 1:N) YY[i] = MU[i] + diag_pre_multiply(D[i],L_Rho)*z[i]

I realize my top-level definitions for arrays like YY might look bizarre (in addition to having all-cap names, which I still intend to remedy). But I’m setting these up as I am because I understand I ultimately need to feed diag_pre_multipy() a single vector as the first argument. Just as I ultimately needed to feed quad_form_diag() a single vector as the second argument, in the earlier draft of this model. And it seems easiest to set up arrays of vectors as I have and iterate over them.

brijo · August 18, 2023, 2:03am

After more reflection and experimentation, this model does what I had hoped. It samples well - though I know there are further refinements I could make so that it samples faster.

Some of the improvement was I think due to just writing a better model. I think control group mean absence rates might be correlated with the treatment effects, but I sometimes have several treatment estimates estimated off of the same control group, from a multi-arm trial, so now I just make a trial effect part of the decomposition of the estimated true treatment effect for each estimate. And let the correlation happen between those trial effects and the control mean absence rates. Rather than trying to re-cycle parameters in a weird way with burdensome syntax, all the while maybe creating model rank issues.

Bob’s comments were really helpful. In the end, I think the biggest one was just non-centering my parameterization of the covariance. I tried messing with the way the positive constrained parameters are set up but that didn’t seem to make a big difference with the issues I’d been having.

Alright, thanks for listening!

functions {

  // function to get number of matching elements in a vector
  int num_matches(int[] x, int y) {
    int n = 0;
    for (i in 1:num_elements(x))
      if (x[i] == y)
        n += 1;
    return n;
 }
  
  // funciton to get which indices in vector match
  int[] which_equal(int[] x, int y) {
    int match_positions[num_matches(x, y)];
    int pos = 1;
    for (i in 1:num_elements(x)) {
      if (x[i] == y) {
        match_positions[pos] = i;
        pos += 1;
      }
    }
    return match_positions;
  }

  // function to get the intervention type of a given trial:
  //   takes t as an integer argument for a trial index
  //   trial is data idx_trial vector
  //   type is data idx_study vector
  int trial_type(int t, int[] trial, int[] type) {
    int ids[num_matches(trial, t)] = which_equal(trial, t);
    int answer = type[ids[1]]; // just need one element from ids, so just taking first
    return answer;
  }

}

data {
  int<lower=1> N; // number of total arms across all trials and intervention types
  int<lower=1> J; // number of intervention types
  int<lower=1> T; // number of trials
  vector[N] y; // estimated treatment effects
  cov_matrix[N] Sigma_y; // covariance matrix of estimated treatment effects
  vector[T] a; // estimated control mean absence rates
  cov_matrix[T] Sigma_a; // covariance matrix of mean absence rates
  int<lower=1,upper=J> idx_study[N]; // intervention type index
  int<lower=1,upper=T> idx_trial[N]; // trial index 
  real<lower=0> alpha_sd; // SD for prior on alpha
  real<lower=0> sigma_delta_sd; // SD prior on distribution of the deltas
  real<lower=0> sigma_epsilon_sd_sd; // SD prior on distribution of sigma epsilons
  real<lower=0> sigma_gamma_sigma_eta_sd; // SD for prior on sigma gammas and sigma etas
}

parameters {
  real alpha; // mean true effect
  real<lower=0> sigma_delta; // spread of the deltas
  vector[J] delta_nc; // non-centered deltas
  vector<lower=0>[J] mu; // mean true control abs. rate, for each intervention type
  array[J] vector<lower=0>[2] sigma_vec_std; // array of std. sigma vectors, one for each intervention type
  array[J] cholesky_factor_corr[2] L_Rho; // array of Cholesky correlation factors, one for each intervention type
  matrix[2,T] z;
  real<lower=0> sigma_epsilon_sd; // spread on distribution of sigma epsilons
  vector<lower=0>[J] sigma_epsilon_nc; // non-centered sigma epsilons
  vector[N] epsilon_nc; // non-centered arm effects
}

transformed parameters {
  vector[J] delta; // intervention type effects
  array[J] vector[2] sigma_vec; // sigma_gamma is first element in each vector, sigma_eta is second
  matrix[2,T] gamma_eta; // top row for gamma (trial effects), bottom row eta (control abs. rate deviations from mu) 
  vector<lower=0>[J] sigma_epsilon; 
  vector[N] epsilon; // individual arm effects
  vector[N] theta; // true effects for the observed treatment effects 
 
 for (j in 1:J) {
   delta[j] = delta_nc[j]*sigma_delta; 
   sigma_vec[j] = sigma_gamma_sigma_eta_sd*sigma_vec_std[j];
   sigma_epsilon[j] = sigma_epsilon_nc[j]*sigma_epsilon_sd; 
 }
 for (t in 1:T) {
  gamma_eta[,t] = diag_pre_multiply(sigma_vec[trial_type(t, idx_trial, idx_study)], L_Rho[trial_type(t, idx_trial, idx_study)])*z[,t]; 
 }
 for (i in 1:N) {
  epsilon[i] = epsilon_nc[i]*sigma_epsilon[idx_study[i]];
 }
 
 theta = alpha + delta[idx_study] + (gamma_eta[1])'[idx_trial] + epsilon;
 
}

model {
  int trial_type_reduced[T]; // intervention type for each trial, in order of trial index 1:T  
  
  alpha ~ normal(0, alpha_sd);
  sigma_delta ~ normal(0, sigma_delta_sd);
  delta_nc ~ normal(0, 1);
  mu ~ normal(0.07, 0.03); // no need to explicitly include T[0,]
  for (j in 1:J) sigma_vec_std[j] ~ normal(0, 1);
  for (j in 1:J) L_Rho[j] ~ lkj_corr_cholesky(2);  
  to_vector(z) ~ normal(0, 1);
  sigma_epsilon_sd ~ normal(0, sigma_epsilon_sd_sd);
  sigma_epsilon_nc ~ normal(0, 1);
  epsilon_nc ~ normal(0, 1);

  y ~ multi_normal(theta, Sigma_y);
   
  for (t in 1:T) {
    trial_type_reduced[t] = trial_type(t, idx_trial, idx_study);
  }
  a ~ multi_normal(mu[trial_type_reduced] + (gamma_eta[2])', Sigma_a); // recall a is vector of length T, not N, so can't just use idx_study here
  
}

spinkney · August 18, 2023, 4:24pm

Cool! You can speed this up a little bit by taking the cholesky decomposition of Sigma_y and Sigma_a either before you input as data or in the transformed data block. The multi_normal is taking the decomposition anyway in the backend (specifically an LDL’ decomp). Then use multi_normal_cholesky.

Depending on how big J is you could construct 2 x 2 cholesky factor corr matrices using just a J sized vector. We know that any lower triangular cholesky corr matrix will have a 1 in the 1,1 slot. Then the 2, 1 slot is a correlation value, \rho, between -1 and 1. The 2,2 slot will be \sqrt{1 - \rho^2}. Now we just need to declare vector<lower=-1, upper=1>[J] rho; and in tp you do

parameters {
 //   array[J] cholesky_factor_corr[2] L_Rho; // array of Cholesky correlation factors, one for each intervention type
  vector<lower=-1, upper=1>[J] rho;
}
transformed parameters {
...

 for (t in 1:T) {
  matrix[2, 2] L_Rho;
  L_Rho[1, 1] = 1;
  L_Rho[1, 2] = 0;
  L_Rho[2, 1] = rho[trial_type(t, idx_trial, idx_study)];
  L_Rho[2, 2] = sqrt(1 - L_Rho[2, 1]^2);
  gamma_eta[,t] = diag_pre_multiply(sigma_vec[trial_type(t, idx_trial, idx_study)], L_Rho) * z[,t]; 
 }

}

brijo · August 28, 2023, 12:10am

Awesome, thank you, I will give this a try!

Topic		Replies	Views
Inefficient sampling and divergent transitions in network meta-analysis (ported from WinBUGS) Modeling fitting-issues , meta-analysis	12	1208	July 24, 2020
Correct way of sampling correlated multidimensional vectors Modeling	3	643	February 9, 2019
Efficient sampling SRM with large number of correlated random effects Modeling	1	727	May 4, 2018
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6040	July 24, 2020
MCMC sampling does not work when execute Modeling fitting-issues	78	11198	November 1, 2017

Sampling issues with vectorized multivariate normal

Related topics