Loo issues on simulated meta-analysis data

Hi all,

Trying to do a quick proof of concept of loo as a replacement for DIC in network meta-analysis models I fit at work. I have yet to convince everyone to do a full switch to stan (mostly because I am the only one with any familiarity but it’s mostly via brms/rstanarm), so I hope it’s OK that this is a JAGS model for now. The question relates to the use of the loo package.

model
  {
  for (i in 1:NS) {
  
  
  
   # If NAs exist in covariates, impute from the mean at each step


  # Prior in baselines, will be normal for most but uniform for identity-binomial
  mu[i] ~ dnorm(0,.0001)           # vague priors for all trial baselines
  
  #Code for the linear predictor and likelihood

  
  for (k in 1:na[i]) {
  r[i, k] ~ dbin(p[i, k], n[i, k]) # Observed number of events in arm k comes from binomial
  
  #Code for the linear predictor
  logit(p[i,k]) <- mu[i] + delta[i,k]   # model for linear predictor
  
  rhat[i, k] <- p[i, k] * n[i, k] # Expected number of events
  
 # Calculte deviance for arm
  dev[i, k] <- 2 * (r[i, k] * (log(r[i, k]) - log(rhat[i, 
  k])) + (n[i, k] - r[i, k]) * (log(n[i, k] - r[i, 
  k]) - log(n[i, k] - rhat[i, k])))


  log_lik[i,k] <- logdensity.bin(r[i,k], p[i, k], n[i, k])
  }
  



# Calculate residual deviance for trial
  resdev[i] <- sum(dev[i, 1:na[i]])
  
  
  # Code for random effects

  
  
  }
  
  totresdev <- sum(resdev[])
  
  d[1] <- 0.00000E+00
  
  for (k in 2:NT) {
  
  d[k] ~ dnorm(0.00000E+00, 1.00000E-04)
  
  }
  
  # If a random effects model is used, code for different priors
  
  
  # If a meta-regression model is used, code for Betas
  
  
   
  # Transformed effects
   

 # Treatment 1 baseline, based on average of NP trials including it. 							
  for (i in 1:NS){ 							
    mu1[i] <- mu[i] * equals(t[i,1],1)
  }							
  
  for (k in 1:NT){ 							
    logit(T[k]) <- (sum(mu1[])/NP) +d[k]
  }


# Treatment effect estimates					
  for (c in 1:(NT - 1)) {  							
    for (k in (c + 1):NT) {
      # odds ratio
      OR[c, k] <- exp(d[k] - d[c] )
      # log odds ratio
      lOR[c, k] <- d[k] - d[c]				
      # relative risk
      RR[c, k] <- T[k]/T[c]				
      # risk difference
      RD[c, k] <- T[k] - T[c]
      
      better[c, k] <- 1 - step(d[k] - d[c]) # Outcome is bad			
    }  							
  }
  
  # Rank code depending on whether outcomes are good or bad
  rk <- rank(d[]) # Outcome is bad

  # All of the common rank based outputs
  
  for (k in 1:NT) {
  
    best[k] <- equals(rk[k], 1) # Best treatment has rank 1
    for (h in 1:NT) {
    prob[k, h] <- equals(rk[k], h) # Probability that given drug is in any rank
    }
  }
  for (k in 1:NT) {
    for (h in 1:NT) {
      cumeffectiveness[k, h] <- sum(prob[k, 1:h]) # Cumulative rank (e.g. area under ranking curve)
    }
  }

  for (i in 1:NT) {
    SUCRA[i] <- sum(cumeffectiveness[i, 1:(NT - 1)])/(NT - 1) # Calculate SUCRA
    }

}

I have simulated a meta-analysis dataset and the below code returns the simulated OR nicely. I calculate log density for each arm of a trial. The deviance calculation is standard code from the National Institutes for Health and Care Excellence so I have left it there for now.

Data was simulated with a fixed effect model, so there shouldn’t be any issues, but my results from loo are less than comforting

Computed from 120000 by 100 log-likelihood matrix

         Estimate   SE
elpd_loo   -359.0  9.2
p_loo        52.6  6.0
looic       718.0 18.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     24    24.0%   15548     
 (0.5, 0.7]   (ok)       61    61.0%   660       
   (0.7, 1]   (bad)      15    15.0%   36        
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Is there any obvious reason for this? PSIS paper suggests something wrong with the model but I know that isn’t true in this case. Could this be a sign of something pathological in JAGS sampling that might otherwise be caught in stan? Is this purely an issue with the loo approximation or does it suggest I should be concerned about my DIC results as well?

Lastly (and hopefully for my own private mission), would more principled priors help here at all? Typically we can’t touch these because they are annointed by the gods, but I’m keen to change that if possible.

Edit: Lastly (really this time), if all I really want is to look at stacking weights, is this as problematic?

Based on this you have only 100 observations. It would be easier to see data and parameters from blocked Stan code, but it seems you have 100+ parameters, that is, the model is very flexible.

See loo package glossary and especially the section “Interpreting p_loo when Pareto k is large” . It is possible that in this case it’s not about the model misspecification, but just having very flexible model with weak prior and thus observations are influential.

For the convergence diagnostics see the new R-hat paper. You can use the new monitor.R also with JAGS as shown in Appendix E

If Pareto k diagnostic gives large values then WAIC and DIC fail even more.

More informative priors can help, but sometimes even the best models are too flexible for loo (and WAIC and DIC).

Stacking weights will be also biased if there are Pareto k values larger than 0.7.

You could use k-fold-CV instead.

We’ll soon have also improved loo version, which probably would solve this problem…

I have 50 trials, each with two arms but the total number of observations would >>100, could this be causing a problem?

The model is definitely NOT very flexible. We estimate a baseline nuisance parameter for every trial (so n = 50) but then treatment difference is just fixed effect. It should be equivalent-ish to just running:

glm(cbind(event, non-event) ~ -1 + study + trt, data = dat))

Sorry, I don’t understand. Could you give more details on data.

For me nuisance parameter for every trial is lot of parameters, and lot of flexibility.

What are study and trt? Where are the nuisance parameters? Can you try running with this?

In your model code you have delta[i,k] which is NSna, but it has no prior so it’s just NSna parameters. I can’t figure out whether delta is data or parameter, but I assume it’s parameter. Sorry, I don’t know how JAGS works.

loo_validation.txt (913 Bytes)

I’ve uploaded the data file. Each row is a study, n0 is n in control, r0 is event in control, n1/r1 is same but in treatment.

It looks like in my haste I copied the wrong model file, corrected is below. Delta is just the trial specific observed treatment effect (which is fixed in the fe model), the d[t[i,k]]s are the actual parameters. The mu[i] are the nuisance parameters.

model
  {
  for (i in 1:NS) {
  
  delta[i, 1] <- 0.00000E+00 # Trial specific outcome in arm 1 is zero
  
   # If NAs exist in covariates, impute from the mean at each step


  # Prior in baselines, will be normal for most but uniform for identity-binomial
  mu[i] ~ dnorm(0,0.0001)           # vague priors for all trial baselines
  
  #Code for the linear predictor and likelihood

  
  for (k in 1:na[i]) {
  r[i, k] ~ dbin(p[i, k], n[i, k]) # Observed number of events in arm k comes from binomial
  
  #Code for the linear predictor
  logit(p[i,k]) <- mu[i] + delta[i,k]   # model for linear predictor
  
  rhat[i, k] <- p[i, k] * n[i, k] # Expected number of events
  
 # Calculate deviance for arm
  dev[i, k] <- 2 * (r[i, k] * (log(r[i, k]) - log(rhat[i, 
  k])) + (n[i, k] - r[i, k]) * (log(n[i, k] - r[i, 
  k]) - log(n[i, k] - rhat[i, k])))


  }
  



# Calculate residual deviance for trial
  resdev[i] <- sum(dev[i, 1:na[i]])

  
  
  # Code for random effects

  for (k in 2:na[i]) {
  delta[i, k] <-  d[t[i, k]] - d[t[i, 1]]  # Trial specific treatment effect comes from a normal distribution
    }
  
  }
  
  totresdev <- sum(resdev[])
  

  d[1] <- 0.00000E+00


  for (k in 2:NT) {
  
  d[k] ~ dnorm(0.00000E+00, 0.0001)
  
  }
  
  # If a random effects model is used, code for different priors
  
  
  # If a meta-regression model is used, code for Betas
  
  
   
  # Transformed effects
   

 # Treatment 1 baseline, based on average of NP trials including it. 							
  for (i in 1:NS){ 							
    mu1[i] <- mu[i] * equals(t[i,1],1)
  }							
  
  for (k in 1:NT){ 							
    logit(T[k]) <- (sum(mu1[])/NP) +d[k]
  }


# Treatment effect estimates					
  for (c in 1:(NT - 1)) {  							
    for (k in (c + 1):NT) {
      # odds ratio
      OR[c, k] <- exp(d[k] - d[c] )
      # log odds ratio
      lOR[c, k] <- d[k] - d[c]				
      # relative risk
      RR[c, k] <- T[k]/T[c]				
      # risk difference
      RD[c, k] <- T[k] - T[c]
      
      better[c, k] <- 1 - step(d[k] - d[c]) # Outcome is bad			
    }  							
  }
  
  # Rank code depending on whether outcomes are good or bad
  rk <- rank(d[]) # Outcome is bad

  # All of the common rank based outputs
  
  for (k in 1:NT) {
  
    best[k] <- equals(rk[k], 1) # Best treatment has rank 1
    for (h in 1:NT) {
    prob[k, h] <- equals(rk[k], h) # Probability that given drug is in any rank
    }
  }
  for (k in 1:NT) {
    for (h in 1:NT) {
      cumeffectiveness[k, h] <- sum(prob[k, 1:h]) # Cumulative rank (e.g. area under ranking curve)
    }
  }

  for (i in 1:NT) {
    SUCRA[i] <- sum(cumeffectiveness[i, 1:(NT - 1)])/(NT - 1) # Calculate SUCRA
    }

}

Is the new rhat only available in a dev version of rstan? I am on 2.18.2 and the appendix E code doesn’t give all the fun bonus columns when I run it (Bulk_ESS, Tail_ESS).

Got this working. All rhat < 1.01, all ESS > 7000

Great

50 rows…

… and thus mu has length 50? So you still have more than 50 parameters and only 100 observations?

Bit since your loo report said 100, you are leaving either control or treatment out, and then that trial has either 100% controls or 100% treatments, which is much different from 50%/50% case with full data. In this kind of balanced case you might want to leave out the whole trial to keep the balance.

If you think that you could do your model as

glm(cbind(event, non-event) ~ -1 + study + trt, data = dat))

then you could test this with Stan using either rstanaarm and stan_glm or brms and brm

So it doesn’t seem to be an issue with JAGS because I get the right parameter back with this model in rstanarm and still have loo issues.

Computed from 4000 by 100 log-likelihood matrix

         Estimate   SE
elpd_loo   -356.3  8.6
p_loo        49.8  5.4
looic       712.5 17.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     31    31.0%   565       
 (0.5, 0.7]   (ok)       49    49.0%   147       
   (0.7, 1]   (bad)      17    17.0%   89        
   (1, Inf)   (very bad)  3     3.0%   14        
See help('pareto-k-diagnostic') for details.

To get the model to work as I wrote it out I needed entry in long format (1 arm per row) hence why we’re back at 100 observations.

For excluding the whole trial at once would I just sum log predictive density calculated for each arm?

I didn’t expect JAGS to fail on this simple model, and thus it was expected that you get the same result with rstanarm. However, now you can run loo with option k_threshold=0.7 and automatically refit all folds with khat>0.7. It should be quite fast for this model and data.

Yes

The only issue with this is that this is a toy dataset/model and I think adapting the full version of what I’m looking for to rstanarm/brms is beyond my ability. See this earlier question for a sense of where I am trying to get to.

Will give this a shot and report back!

Just as bad :(

I suggest to continue in that thread by writing down the equations for the statistical model. @paul.buerkner thinks you can do that model with brms, and he can help you more if you provide more details.

Just wanted to close this: Your intuition that study-specific intercepts was the culprit was correct. Excluding that feature of the model lead to loo results that were well-behaved.

2 Likes