Inefficient sampling and divergent transitions in network meta-analysis (ported from WinBUGS)

benskov · June 22, 2020, 1:36pm

dietary_fat_dataset_long.txt (564 Bytes)

Hi everybody,

This post is a bit lengthy, but I wanted to provide as much context as possible.

I’m trying to port a network meta-analysis from BUGS/JAGS to Stan, with the main motivations being the ability to have pre-compiled models (the model will need to many times on different data sets) and faster sampling than JAGS offers. This has turned out to be a little tricky, particularly I keep getting divergent transitions when using the NUTS algorithm, and getting what I believe is quite inefficient sampling with the HMC sampler.

I’ve seen some other attempts at this around on Discourse, but they all (in my view) stay too loyal to the original WinBUGS models with ragged arrays and a bunch of constant zero-valued parameters, which makes the Stan code a lot more verbose and involved to read.

The univariate “trick” mentioned in the code is that, instead of using a multivariate normal distribution to model within-trial correlation in N-armed trials (N \geq 3), a univariate conditional distribution of the effect of the k’th treatment in the i’th trial (with the trial-specific control arm being k = 1) can be used (the md vector represents the location parameter):

Although I’m not convinced using this conditional univariate likelihood is advantageous, in this first attempt I wanted to implement the same model as the “industry-standard” to have a point of reference. The model, comparing poisson-distributed event rates, looks like this:

// Inspired by model in: https://discourse.mc-stan.org/t/replication-of-winbugs-nma-code-in-stan/3587/2

data {
	int<lower=0> ns;                      // number of studies
	int<lower=0> nt;                      // number of treatments
	int<lower=1> no;                      // number of observations
	
	int<lower=0> s[no];                   // study indicator
	int<lower=0> t[no];                   // treatment indicator
	int<lower=0> b[no];                   // baseline indicator
	int<lower=0> r[no];                   // number of evens
	int<lower=1> a[no];					  // arm in study
	vector<lower=0>[no] E;                // exposure time (vector for vectorised sampling)
	
	int<lower=1> b_idx[ns];				  // indices of control arms
	int<lower=1> t_idx[no-ns];			  // indices of active treatment arms
}

transformed data {
	vector[no-ns] a_t = to_vector(a[t_idx]); // arms of active treatments only (used in sampling block)
}

parameters {
	vector[nt-1] d_param;                 // mean effect of treatments (except control) against control 
	vector[ns] mu;                        // trial-specific baseline mean effect
	real<lower=0> sigma;        		  // heterogeneity parameter
	vector[no-ns] delta;				  // comparison-specific treatment effects
}

transformed parameters{
	vector[nt] d;						  // mean effect of treatments (incl. control) against control
	vector[no-ns] md;        			  // comparison-specific mean difference
	
	// Stan doesn't allow mixing random and static values
	d[1] = 0;							  // effect of control vs. control is zero by definition
	for(i in 2:nt) {
		d[i] = d_param[i-1];			  // pull in the rest from the parameter of real interest
	}
	
	// Univariate trick (eq. 14 [p. 36] in http://nicedsu.org.uk/wp-content/uploads/2017/05/TSD2-General-meta-analysis-corrected-2Sep2016v2.pdf)
	{
		real weight = 0;
		int current_study = 0; // keep track of study to know when to reset multi-arm weight
		
		// FIX: Should probably be vectorised in some nifty way
		for (i in 1:(no-ns)) {
			int current_t_idx = t_idx[i];
			
			// Reset weight when encountering new study
			if (current_study != s[current_t_idx]) { 
				weight = 0; 
				current_study = s[current_t_idx];
			}
			
			md[i] = d[t[current_t_idx]] - d[b[current_t_idx]];
			// SD weighting directly in sampling statement
			
			if (a[current_t_idx] > 2) {
				weight += delta[i-1] - d[t[current_t_idx-1]] + d[b[current_t_idx-1]]; // mimick sum over k-1
				md[i] += weight / (a[current_t_idx] - 1);
			} 
		}
	}
}

model {
	// Priors
	mu ~ normal(0, 10);
	sigma ~ normal(0, 10);
	
	// Likelihoods
	r[b_idx] ~ poisson_log(mu[s[b_idx]] + log(E[b_idx])); 
	r[t_idx] ~ poisson_log(mu[s[t_idx]] + delta + log(E[t_idx]));
	delta ~ normal(md, sigma * sqrt(0.5 * a_t ./ (a_t - 1.0)));
}

Running this the following in Rstudio:

for (p in c("rstan", "dplyr")) 
	library(p, character.only = TRUE)

stan_data <- read.delim("dietary_fat_dataset_long.txt", sep = "\t") %>% # attached to post
	mutate(s = as.numeric(factor(study))) %>%
	group_by(s) %>%
	mutate(a = seq(n())) %>%
	c(ns = n_distinct(.$s), 
	  nt = n_distinct(.$t), 
	  no = nrow(.), 
	  b_idx = list(which(.$t == .$b)),
	  t_idx = list(which(.$t != .$b)))

stan_data looks like this:

$t
 [1] 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

$b
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

$E
 [1] 1917.0 1925.0   43.6   41.3   38.0  393.5  373.9 4715.0 4823.0  715.0  751.0  885.0  895.0   87.8   91.0 1011.0  939.0 1544.0 1588.0  125.0  123.0

$r
 [1] 113 111   1   5   3  24  20 248 269  31  28  65  48   3   1  28  39 177 174   2   1

$s
 [1]  1  1  2  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10

$a
 [1] 1 2 1 2 3 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

$ns
[1] 10

$nt
[1] 2

$no
[1] 21

$b_idx
 [1]  1  3  6  8 10 12 14 16 18 20

$t_idx
 [1]  2  4  5  7  9 11 13 15 17 19 21

When running (note I set algorithm = "HMC", sampling takes about 10 seconds on my macbook pro i7)

options(mc.cores = parallel::detectCores() - 1)
rstan_options(auto_write = TRUE)
fit <- stan(file = "poisson_mtc_re_model_with_univariate_trick_lean1.stan", 
			data = stan_data, seed = 42, algorithm = "HMC", iter = 10000, chains = 7)

Stan throws these three warnings at me:

1: The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

If I try algorithm = "NUTS", I get 1177 divergent transitions.

print(fit, probs = c(0.5, 0.025, 0.975)) yields (omitting the “header” and “footer”):

              mean se_mean   sd    2.5%     50%   97.5% n_eff Rhat
d_param[1]   -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
mu[1]        -2.83    0.00 0.08   -3.00   -2.83   -2.68  9707    1
mu[2]        -2.70    0.00 0.37   -3.48   -2.67   -2.04 11642    1
mu[3]        -2.86    0.00 0.17   -3.19   -2.85   -2.54  9944    1
mu[4]        -2.93    0.00 0.06   -3.05   -2.93   -2.82  9194    1
mu[5]        -3.20    0.00 0.15   -3.50   -3.20   -2.91  8177    1
mu[6]        -2.71    0.00 0.11   -2.93   -2.71   -2.48  9649    1
mu[7]        -3.89    0.01 0.53   -5.04   -3.86   -2.98  9331    1
mu[8]        -3.43    0.00 0.15   -3.74   -3.42   -3.16  6905    1
mu[9]        -2.18    0.00 0.07   -2.31   -2.18   -2.04  7781    1
mu[10]       -4.57    0.01 0.64   -6.01   -4.51   -3.49  3832    1
sigma         0.15    0.00 0.12    0.02    0.11    0.44  2466    1
delta[1]     -0.02    0.00 0.10   -0.22   -0.02    0.17 11073    1
delta[2]      0.04    0.00 0.21   -0.29    0.01    0.57 12659    1
delta[3]      0.02    0.00 0.20   -0.34    0.00    0.48 20373    1
delta[4]     -0.04    0.00 0.15   -0.37   -0.03    0.25 14772    1
delta[5]      0.03    0.00 0.08   -0.12    0.02    0.19  8293    1
delta[6]     -0.05    0.00 0.14   -0.36   -0.03    0.22 12595    1
delta[7]     -0.11    0.00 0.14   -0.44   -0.09    0.11  5187    1
delta[8]     -0.05    0.00 0.20   -0.53   -0.03    0.31 21202    1
delta[9]      0.09    0.00 0.16   -0.16    0.06    0.49  5468    1
delta[10]    -0.03    0.00 0.08   -0.20   -0.03    0.13 10623    1
delta[11]    -0.03    0.00 0.20   -0.47   -0.02    0.36 23127    1
d[1]          0.00     NaN 0.00    0.00    0.00    0.00   NaN  NaN
d[2]         -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[1]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[2]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[3]         0.01    0.00 0.13   -0.21    0.00    0.32 16029    1
md[4]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[5]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[6]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[7]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[8]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[9]        -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[10]       -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
md[11]       -0.02    0.00 0.09   -0.20   -0.02    0.16 26377    1
lp__       5403.67    0.21 7.65 5389.42 5403.20 5418.95  1346    1

stan_dens and stan_trace don’t suggest problems with mixing, just as Rhat above is 1 for all parameters.

In essence, with this model I get the same results as the technical report (http://nicedsu.org.uk/wp-content/uploads/2017/05/TSD2-General-meta-analysis-corrected-2Sep2016v2.pdf, p. 66), suggesting that the model is equivalent (although, strangely, my sigma (standard deviation) parameter comes out with what seems to be their precision; but might be a typo in their table, or some weird mistake in my model of course).

My concerns/questions are the following:

Is it expected/are there good reasons why NUTS cannot “find its way” around the posterior, when normal HMC seems to be doing fine. Are there any (more or less) obvious problems with the code that renders it difficult for NUTS? Or is it likely simply a matter of estimating too many parameters with too few observations? Can enforcing fixed values for parameters cause this kind of behaviour?
I wonder if the warnings regarding ESS is an artefact caused by forcing d[1] to zero.
lp__ seems to have very low n_eff. Is that a problem? All the other parameters seem fine, although n_eff of a few thousand isn’t impressive when I have 35,000 posterior draws.
Is this model simply ill-defined?

If you’ve made it this far, thanks for reading on! I hope someone has some insights and could perhaps give me a pointer as to what to do from here.

Cheers,
Ben

maxbiostat · June 23, 2020, 12:35am

Can you please show the output from the call with default settings?

benskov · June 23, 2020, 6:31am

Sure thing! Sampling with default settings (except for setting seed) yields 397 divergent transitions (+ of course suggestion to inspect pairs()). The parameter summaries, which I suppose you’re interested in, are (again without IQR):

Inference for Stan model: poisson_mtc_re_model_with_univariate_trick.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean    sd     50%    2.5%   97.5% n_eff Rhat
d_param[1]   -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
mu[1]        -2.84    0.00  0.08   -2.85   -3.00   -2.68  1364 1.00
mu[2]        -2.72    0.01  0.33   -2.76   -3.45   -2.09  1545 1.00
mu[3]        -2.84    0.01  0.16   -2.81   -3.18   -2.56   145 1.02
mu[4]        -2.91    0.01  0.06   -2.91   -3.04   -2.81    19 1.11
mu[5]        -3.18    0.01  0.14   -3.15   -3.47   -2.92   142 1.03
mu[6]        -2.71    0.00  0.11   -2.74   -2.92   -2.49   633 1.01
mu[7]        -3.90    0.01  0.50   -3.91   -5.07   -3.00  1399 1.00
mu[8]        -3.48    0.07  0.19   -3.46   -3.75   -3.15     8 1.21
mu[9]        -2.16    0.02  0.07   -2.15   -2.32   -2.05    17 1.09
mu[10]       -4.47    0.05  0.58   -4.33   -5.79   -3.51   122 1.03
sigma         0.12    0.03  0.12    0.09    0.01    0.42    15 1.12
delta[1]     -0.02    0.00  0.09   -0.03   -0.22    0.17   513 1.01
delta[2]      0.03    0.02  0.19   -0.02   -0.24    0.50   144 1.04
delta[3]      0.01    0.01  0.18   -0.03   -0.31    0.43   355 1.02
delta[4]     -0.04    0.00  0.14   -0.05   -0.36    0.24  1651 1.00
delta[5]      0.01    0.01  0.08    0.00   -0.12    0.18    43 1.06
delta[6]     -0.05    0.00  0.13   -0.05   -0.35    0.20  1004 1.00
delta[7]     -0.10    0.01  0.13   -0.06   -0.44    0.10   236 1.03
delta[8]     -0.05    0.01  0.19   -0.05   -0.52    0.30  1395 1.00
delta[9]      0.06    0.03  0.16    0.02   -0.17    0.48    27 1.07
delta[10]    -0.04    0.00  0.08   -0.04   -0.20    0.13   585 1.01
delta[11]    -0.03    0.00  0.18   -0.03   -0.43    0.34  1687 1.00
d[1]          0.00     NaN  0.00    0.00    0.00    0.00   NaN  NaN
d[2]         -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[1]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[2]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[3]         0.00    0.01  0.12   -0.02   -0.20    0.29   141 1.04
md[4]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[5]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[6]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[7]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[8]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[9]        -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[10]       -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
md[11]       -0.02    0.00  0.09   -0.04   -0.20    0.16   385 1.01
lp__       5407.29    4.71 10.56 5405.68 5389.91 5426.89     5 1.37

Samples were drawn using NUTS(diag_e) at Tue Jun 23 08:22:42 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Chains definitely haven’t mixed (stan_dens and ´stan_trace` are wicked-looking, which is expected with these Rhat values).

I reckon that the priors are very wide, which is mainly because I wanted to stay as loyal as possible to the original WinBUGS configuration. Such wide priors might just cause more trouble with NUTS than the Gibbs sampler of WinBUGS/JAGS.

jeffreypullin · June 23, 2020, 8:05am

One thing that I think is odd about your model (from a quick read) is that is sets d[1] = 0 but then also proceeds to estimate it. That’s the parameter that has NaN Rhat. Could you perhaps try not estimating d[1] is the value is truly known?

benskov · June 23, 2020, 9:39am

Indeed, the NaN for d[1] isn’t surprising, and as I wrote, perhaps the ESS warnings are simply an artefact caused by the nature of this parameter. The d vector is the treatment effects of each treatment against a control treatment (e.g., placebo or gold standard). Thus, d[1] is by definition 0 because that is the effect of the control treatment against itself. So, that treatment effect parameters of actual interest are d_param, while the full d vector is needed to define indirect treatment effects. I tried to not set d[1] to zero and instead sample from a quite narrow 0-mean Normal prior but that only made things worse in terms of sampling while departing from the logic of the model.

I wonder if this kind of model causes trouble for NUTS, or if there’s something else going on. I believe to have some kind of intuition as to how the NUTS works, but far from enough to understand if having a fixed-value parameter is problematic or not. In essence, it would be nice to know if we need to re-think entirely how to write Stan models equivalent to WinBUGS models because the sampler perhaps is a more sensitive to “odd” specifications of parameters.

andre.pfeuffer · June 24, 2020, 4:38pm

The SE of d[1] is 0, since d[1] is set 0. Obvious. d[1] is your reference parameter, if I understood it right. It’s subtracted from all other.

Before we go into WINBUGS tricks and adapting it, we should step back and look at equation (13)

Looking into the code: Why have delta and d different number of elements?
I’d like to make a suggestion and bypass the trick, and go back to equation (13):

  real sq_sigma = square(sigma);
  matrix[ns, ns] Sigma;
  matrix[ns, ns] L;
  for(i in 2:ns)
    for(j in 1:i-1) {
      Sigma[i, j] = 0.5 * sq_sigma;
      Sigma[j, i] = 0.5 * sq_sigma;
    }
  for(i in 1:ns)
      Sigma[i, i] = sq_sigma;
  
 L = cholesky_decompose(Sigma);

in model block we write:

delta ~ multi_normal_cholesky(append_row(0, d_param), L);

Once we then get good results, we may think about to optimize, if sampling speed is of any concern.

benskov · June 26, 2020, 2:39pm

Thanks a lot for your inputs, Andre! I very much appreciate it, and thanks for giving an example of using the Cholesky decomposition. Had thought about using that, but seemed a bit daunting to jump right into that.

I just wanted to point out a few things: d[0] is not in general subtracted from all other parameters although in this simple example it is; had there been another trial comparing treatment 2 and 3 (not in the sample data set), we’d have md = d[3] - d[2] for that trial. Indeed, this model is of random-effects, which is the reason that delta and d aren’t of equal sizes: the delta's are noisy realisations of the underlying (true) treatment effects d, so not all trials will contribute to estimates all d's. In the sample data set, d is of length 2 (1 vs 1 and 2 vs 1), but delta is of length n_obs - n_studies = 11 (because there are 10 studies on which one has two arms with treatment 2).

So, for 2-arm trials the univariate normal distributions suffices, but the multivariate variant is needed for multi-arm trials to account for intra-trial correlation between estimates. That’s why the location parameter in the multivariate normal distribution is a_i - 1 long, where a_i is the number of arms in trial i. Sampling needs to account for this.

I took your suggestion and tried to tweak it to work well with the data:

data {
	int<lower=0> ns;		// number of studies
	int<lower=0> nt;       	// number of treatments
	int<lower=1> no;        // number of observations
	
	int<lower=0> s[no];     // study indicator
	int<lower=0> t[no];     // treatment indicator
	int<lower=0> b[no];     // baseline treatment indicator
	int<lower=0> r[no];     // number of events
	vector<lower=0>[no] E;  // exposure time; vector for vectorised sampling
	
	int<lower=1> na[ns];	// number of arms in study
	int<lower=1> b_idx[ns];	// baseline/control index
	int<lower=1> t_idx[no-ns]; // treatment indices
}

parameters {
	vector[nt-1] d_param;   // underlying treatment effects of treatments vs. baseline (except baseline)
	vector[no-ns] delta;    // trial-specific treatment effects of active treatments      
	vector[ns] mu;          // baseline mean effect for each trial
	real<lower=0> sigma;	// heterogeneity parameter
}

transformed parameters{
	vector[nt] d = append_row(0, d_param); // underlying treatment effects (incl. baseline)
	matrix[nt, nt] Sigma; 	// covariance matrix
	
	// Set off-diagonals of covariance matrix
	for(i in 2:nt) { 
		for(j in 1:i-1) {
			Sigma[i, j] = 0.5 * square(sigma);
			Sigma[j, i] = 0.5 * square(sigma);
		}
	} 
	
	// Set diagonals of covariance matrix
	for(i in 1:nt) {
		Sigma[i, i] = square(sigma);
	}
}

model {
	// Priors
	mu ~ normal(0, 10);
	sigma ~ normal(0, 10);
	delta ~ normal(0, 10);
	
	// Likelihoods
	r[b_idx] ~ poisson_log(mu[s[b_idx]] + log(E[b_idx]));
	{
		int pos_idx = 2; 	// position in t and b vectors; skip baseline of trial 1
		int pos_delta = 1; 	// position in delta vector
		for (i in 1:ns) { 	// loop through studies
			int offset = na[i] - 2;
			vector[na[i]-1] md = d[t[pos_idx:(pos_idx+offset)]] - d[b[pos_idx:(pos_idx+offset)]];
			
			if (na[i] == 2) {
				delta[pos_delta] ~ normal(md, sigma);
			} else {
				// delta[pos_delta:(pos_delta+offset)] ~ multi_normal(md, Sigma[1:(na[i]-1), 1:(na[i]-1)]);
				delta[pos_delta:(pos_delta+offset)] ~ multi_normal_cholesky(md, cholesky_decompose(Sigma[1:(na[i]-1), 1:(na[i]-1)]));
			}
			
			pos_idx += na[i];
			pos_delta += na[i] - 1;
		}
	}
	r[t_idx] ~ poisson_log(mu[s[t_idx]] + delta + log(E[t_idx]));
}

I suppose it’s not valid to take a subset of a Cholesky decomposition (if I’m wrong, I’d be happy to know), so I’m not defining it as a transformed parameter but instead computing it on the fly when needed. I’ve also tried using multi_normal directly which yields 77 divergent transitions with the call below. Using the Cholesky variant yields 229 divergent transitions. The two versions are both included, one is commented-out.

For this model, I only need to supply a vector holding the number of arms for each study:

stan_data <- mutate(df, s = as.numeric(factor(s))) %>%
	group_by(s) %>%
	mutate(a = seq(n())) %>%
	c(ns = n_distinct(.$s), 
	  nt = n_distinct(.$t), 
	  no = nrow(.), 
	  b_idx = list(which(.$t == .$b)),
	  t_idx = list(which(.$t != .$b)),
	  na = list(aggregate(.$a, list(.$s), length)$x))
fit <- stan(file = "poisson_mtc_model_with_cholesky.stan", data = stan_data, seed = 42)

I’ve played around with different (still quite) priors or even no prior specification (just to see), but to little avail. The summary of the fit with multi_normal looks like this:

Inference for Stan model: poisson_mtc_model_with_cholesky.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean   sd     50%    2.5%   97.5% n_eff Rhat
d_param[1]   -0.02    0.00 0.09   -0.02   -0.19    0.16  1721 1.00
delta[1]     -0.02    0.00 0.10   -0.02   -0.22    0.17  2416 1.00
delta[2]      0.04    0.00 0.19    0.02   -0.27    0.49  1465 1.00
delta[3]      0.01    0.00 0.18    0.00   -0.33    0.42  2047 1.00
delta[4]     -0.04    0.00 0.15   -0.03   -0.38    0.23  2390 1.00
delta[5]      0.03    0.00 0.08    0.03   -0.11    0.18  1716 1.00
delta[6]     -0.05    0.00 0.14   -0.03   -0.33    0.21  2096 1.00
delta[7]     -0.11    0.00 0.14   -0.09   -0.44    0.11   904 1.00
delta[8]     -0.04    0.00 0.18   -0.03   -0.47    0.27  2481 1.00
delta[9]      0.08    0.00 0.16    0.05   -0.16    0.47  1077 1.00
delta[10]    -0.03    0.00 0.08   -0.03   -0.20    0.13  2078 1.00
delta[11]    -0.03    0.00 0.19   -0.02   -0.45    0.33  2788 1.00
mu[1]        -2.83    0.00 0.08   -2.83   -3.00   -2.67  1972 1.00
mu[2]        -2.68    0.01 0.35   -2.65   -3.44   -2.05  2878 1.00
mu[3]        -2.86    0.00 0.17   -2.85   -3.18   -2.53  2321 1.00
mu[4]        -2.93    0.00 0.06   -2.93   -3.05   -2.82  1873 1.00
mu[5]        -3.20    0.00 0.15   -3.20   -3.49   -2.92  2041 1.00
mu[6]        -2.71    0.00 0.11   -2.72   -2.93   -2.49  1228 1.00
mu[7]        -3.89    0.01 0.53   -3.83   -5.04   -2.96  2505 1.00
mu[8]        -3.42    0.00 0.15   -3.41   -3.74   -3.16  1716 1.00
mu[9]        -2.18    0.00 0.07   -2.17   -2.31   -2.04  2787 1.00
mu[10]       -4.57    0.02 0.63   -4.51   -5.96   -3.49  1657 1.00
sigma         0.14    0.01 0.10    0.11    0.02    0.39   367 1.01
d[1]          0.00     NaN 0.00    0.00    0.00    0.00   NaN  NaN
d[2]         -0.02    0.00 0.09   -0.02   -0.19    0.16  1721 1.00
Sigma[1,1]    0.03    0.00 0.05    0.01    0.00    0.15   733 1.00
Sigma[1,2]    0.01    0.00 0.03    0.01    0.00    0.08   733 1.00
Sigma[2,1]    0.01    0.00 0.03    0.01    0.00    0.08   733 1.00
Sigma[2,2]    0.03    0.00 0.05    0.01    0.00    0.15   733 1.00
lp__       5404.17    0.56 7.44 5403.92 5390.23 5419.29   178 1.02

Samples were drawn using NUTS(diag_e) at Fri Jun 26 16:28:19 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

pairs(fit, pars = c("sigma", "d_param", "lp__")gives:

Clearly, something’s up with sigma, which I suppose spills over on lp__. I wonder if the hierarchical structure defined in these models simply doesn’t play well with NUTS, or at least needs to be “reinvented”.

Thanks for all your time!

andre.pfeuffer · June 26, 2020, 3:31pm

This can be solved by a covariance matrix with block diagonals. Block diagonals are independent. Consider a diagonal cov-matrix: This is the special case of independent normal distributions.

I think we should a non-centered Parameterization of the multivariate distribution.
Take a look about Stan User’s Guide
In detail the section about Non-Centered parameterization. The subsection Multivariate reparameterizations describes how to do it:

delta = md + cholesky_decompose(Sigma) * delta_param;

delta should be in transformed_parmameters or model block and delta_param is a parameter vector. Add the expression delta_param ~ std_normal(); to the model block.

Set the number of adapt_delta = 0.99 and max_treedepth = 14 in the sampling statement.

You used delta ~ normal(0, 10);
and
delta[pos_delta:(pos_delta+offset)] ~ multi_normal_cholesky(md, cholesky_decompose(Sigma[1:(na[i]-1), 1:(na[i]-1)]));
This assigns delta twice for some cases. This should be corrected.

I haven’t looked too much into the application aspects of your code, eg. what you are going to archive.

benskov · June 30, 2020, 11:42am

Thanks a bunch for your inputs! I think it actually works now, with the following model:

data {
	int<lower=0> ns;		// number of studies
	int<lower=0> nt;       	// number of treatments
	int<lower=1> no;        // number of observations
	
	int<lower=0> s[no];     // study indicator
	int<lower=0> t[no];     // treatment indicator
	int<lower=0> b[no];     // baseline treatment indicator
	int<lower=0> r[no];     // number of events
	vector<lower=0>[no] E;  // exposure time; vector for vectorised sampling
	
	int<lower=1> na[ns];	// number of arms in study
	int<lower=1> b_idx[ns];	// baseline/control index
	int<lower=1> t_idx[no-ns]; // treatment indices
}

transformed data {
	// Create "scaffold" for block-diagonal covariance matrix
	matrix[no-ns, no-ns] S = rep_matrix(0.0, no-ns, no-ns);
	
	// Make block-diagonal
	{
		int j = 1;
		for (i in 1:ns) {
			S[j:(j+na[i]-2), j:(j+na[i]-2)] = rep_matrix(0.5, na[i]-1, na[i]-1);
			j += na[i] - 1;
		}
	}
	
	// Set diagonals
	for (i in 1:(no-ns)) {
		S[i, i] = 1.0;
	}
}

parameters {
	vector[nt-1] d_param;   // underlying treatment effects of treatments vs. baseline (except baseline)
	vector[no-ns] delta_param;    // helper for non-centered parameterisation
	vector[ns] mu;          // baseline mean effect for each trial
	real<lower=0> sigma;	// heterogeneity parameter
}

transformed parameters{
	vector[nt] d = append_row(0, d_param); // underlying treatment effects (incl. baseline)
}

model {
	// Priors
	mu ~ normal(0, 10);
	sigma ~ normal(0, 10);
	d_param ~ normal(0, 10);
	
	// Likelihoods
	r[b_idx] ~ poisson_log(mu[s[b_idx]] + log(E[b_idx]));
	delta_param ~ std_normal(); 
	{
		// Non-centered parameterisation of trial-specific treatment effects of active treatments
		vector[no-ns] delta = d[t[t_idx]] - d[b[t_idx]] + cholesky_decompose(S * sigma) * delta_param;
		r[t_idx] ~ poisson_log(mu[s[t_idx]] + delta + log(E[t_idx]));
	}
}

Running with the same data, adapt_delta = 0.99, max_treedepth = 15 and seed = 42 yields no divergent transitions and the following summary:

4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean   sd     50%    2.5%   97.5% n_eff Rhat
d_param[1]        -0.02    0.00 0.13   -0.02   -0.29    0.26  1568    1
delta_param[1]    -0.02    0.01 0.68   -0.02   -1.38    1.29  2555    1
delta_param[2]     0.43    0.02 0.98    0.44   -1.51    2.37  3473    1
delta_param[3]     0.05    0.01 0.99    0.04   -1.83    2.05  4510    1
delta_param[4]    -0.14    0.01 0.84   -0.14   -1.75    1.56  3811    1
delta_param[5]     0.28    0.01 0.63    0.28   -0.94    1.54  2136    1
delta_param[6]    -0.20    0.01 0.79   -0.20   -1.80    1.34  3375    1
delta_param[7]    -0.64    0.01 0.75   -0.63   -2.19    0.86  2814    1
delta_param[8]    -0.23    0.01 0.98   -0.22   -2.18    1.70  4563    1
delta_param[9]     0.69    0.01 0.82    0.70   -1.01    2.24  3025    1
delta_param[10]   -0.10    0.01 0.67   -0.09   -1.45    1.24  2370    1
delta_param[11]   -0.11    0.01 0.99   -0.12   -2.13    1.83  5577    1
mu[1]             -2.83    0.00 0.09   -2.83   -3.00   -2.67  3935    1
mu[2]             -2.76    0.01 0.41   -2.73   -3.68   -2.04  2230    1
mu[3]             -2.85    0.00 0.18   -2.84   -3.20   -2.51  4486    1
mu[4]             -2.94    0.00 0.06   -2.94   -3.06   -2.82  4663    1
mu[5]             -3.19    0.00 0.16   -3.19   -3.50   -2.89  4474    1
mu[6]             -2.68    0.00 0.12   -2.68   -2.92   -2.44  3473    1
mu[7]             -3.90    0.01 0.55   -3.86   -5.07   -2.93  3981    1
mu[8]             -3.46    0.00 0.16   -3.45   -3.81   -3.17  2506    1
mu[9]             -2.17    0.00 0.07   -2.17   -2.32   -2.03  4336    1
mu[10]            -4.55    0.01 0.64   -4.50   -5.92   -3.45  4517    1
sigma              0.09    0.01 0.20    0.04    0.00    0.48  1043    1
d[1]               0.00     NaN 0.00    0.00    0.00    0.00   NaN  NaN
d[2]              -0.02    0.00 0.13   -0.02   -0.29    0.26  1568    1
lp__            5378.64    0.12 4.13 5378.90 5369.97 5386.13  1101    1

–so the same point estimates of d_param[1] and sigma but with wider credible intervals. Increasing the number iterations only affects the CrI widths slightly. So does changing (or even omitting) the priors. This made me wonder: does the specification of the block-diagonal covariance matrix and location (i.e. md = d[t[t_idx]] - d[b[t_idx]]) actually produce the random-effects model I intend? The location parameter would hold the same md’s several times, but I suppose that’s not a problem as long as the covariance matrix is correctly specified, making it a performance “trick” (vectorised sampling) rather than looping over studies.

Funnily, if I define the delta vector in transformed parameters, the fit produces 1 divergent transition. If anything, I would have expected this to the opposite. But there may be a good explanation based on better understanding of NUTS than I possess.

timdisher · June 30, 2020, 1:15pm

I don’t have any insights to add that your probably haven’t already seen in my own efforts to port these models, but did want to point you to David Phillippo’s dissertation which while focused on an implementation of NMA that leverages IPD where it is available, has code that is written in Stan that might give you ideas.

benskov · July 1, 2020, 11:44am

Thanks for the pointer, @timdisher. Looks very interesting, and look forward to reading the dissertation.

andre.pfeuffer · July 1, 2020, 1:39pm

Thanks a good question. You could sample the WINBUGS ‘trick’ with say 10000 sample in GNU R put them column-wise in a matrix and calculate cov(X), where X is your matrix. The mean part holds the same values.

Second, you may check, whether the WINBUGS output and Stan output report the same results.

benskov · July 24, 2020, 8:12am

Sorry for the belated reply. I’ll dig into this later; had to re-focus my attention and hammer down some other things, but will definitely get back to this late summer.

Topic		Replies	Views
Replication of Winbugs NMA code in Stan Modeling meta-analysis	2	1369	May 20, 2019
Network meta-analysis random effect model for binary data Modeling meta-analysis	7	1690	June 23, 2020
How efficient is Stan compared to JAGS? A small test Publicity blog	4	4992	January 7, 2019
Divergent transitions in hierarchical model with random effects Modeling ecology , hierarchical-model , divergences	14	1568	June 14, 2021
Translating winBUGS code to Stan Modeling rstan	6	1374	February 8, 2022

Inefficient sampling and divergent transitions in network meta-analysis (ported from WinBUGS)

Related topics