Information criteria - conflicting advice for hierarchical models

Have been having a great time reading through stacking (https://projecteuclid.org/euclid.ba/1516093227), survey of bayesian predictive methods for model selection (https://projecteuclid.org/euclid.ssu/1356628931), and projpred (https://arxiv.org/abs/1503.08650) papers and have a quick (potentially silly) question:

In the meta-analysis for health technology assessment literature, one of the more authorative documents (http://nicedsu.org.uk/wp-content/uploads/2017/05/TSD2-General-meta-analysis-corrected-2Sep2016v2.pdf), states that while DIC could be considered for choosing between fixed and random effect models, that it should not be used to choose between a model with and without an interaction term:

" In deciding whether a covariate should be included, the posterior mean of the regression coefficient should be compared to the posterior standard deviation. The DIC is not a reliable criterion for deciding whether to include a covariate in RE models. This is because RE models can fit the data equally well, whatever the between-trial variation ."

This would also seem to suggest that loo or even a loo-stacking type of approach would also be inappropriate, but this doesn’t make a lot of sense to me. Is there a reason why loo/AIC/DIC etc… would be appropriate for model selection/averaging in the analysis of individual participant data but not for meta-analysis?

You should be careful on translating recommendation from DIC to LOO or WAIC. DIC looks at the model comparison based on a single overall estimate of predictive log likelihood. While LOO and WAIC account for subject predictive log-likelohhod across the multiple iterations

This paper seems to be outdated on Bayesian methods of model comparison. As it only looks at DIC. I see no reason for LOO to fail for model selection on meta-analysis. In every case, assuming your model is correct, LOO would be a better estimate of the expected predictive lo-likelihood, and with the respective model comparison there is a measure of Standard error for the difference. Which helps to represent the magnitude of the difference.

In a different threat, was mentioned that the basic recommendation is that the ELPD difference should be at least 2 SE to represent a meaningful model difference, but strongly suggesting to use at least 4 SE as the measure of SE is “optimistic”

I might be missing something

There is this presentation from Aki Vehtari on the topic
https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=e998b5dd-bf8e-42da-9f7c-ab1700ca2702

Would also recommend these paper on model comparison
Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing , 27 (3), 711–735. https://doi.org/10.1007/s11222-016-9649-y

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing , 27 (5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

2 Likes

I could imagine the same issue with loo - so far as I can see the difficulty is because there is enough information about the subject in the left in data to estimate the random coefficient, so the predictor is not helpful. If you instead took a leave one subject out approach, the predictor may be able to show it’s value. This is armchair speculation so take it for what it’s worth…

  • If each group has lot of individuals, then the group specific random effect is well defined (“whatever the between-trial variation”) and if we leave one individual out the posterior for the random effect doesn’t change much and we don’t learn much about how good our model is for the “between trial variation” in the hierarchical model
  • If the focus is in the “between trial variation”, we can do leave-one-group-out cross-validation, and assess how well we can predict the group effect for the left out group.

See more in my videos https://www.youtube.com/watch?v=Re-2yVd0Mqk or Lecture videos 9.1 and 9.2 https://github.com/avehtari/BDA_course_Aalto
And a paper by Merkle, Furr & Rabe-Hesketh https://arxiv.org/abs/1802.04452

@avehtari you seem to be suggesting that it’s the number of groups, and not the amount of information within the groups, that determines whether the predictor looks important. This is contrary to my armchair speculation above – am I interpreting you correctly?

Thank you all the responses. I will start looking through the linked references now but in the interim will provide a bit more information on all the lovely challenges I’ve been encountering trying to translate model selection advice to these models:

  1. The goal of this particular brand of meta-analysis is to compare multiple treatments via a network of direct and indirect evidence. The full model details are above but in short what we typically have are observations of arms within trials, and we fit a random effect for the treatment effect by study (shared across treatments for identifiability reasons) so predictions are focused on new studies. The rapid approximations estimates of LOO don’t tend to work for these models since the way we preserve within-trial comparisons is by fitting a study specific fixed effect so we can end up estimating almost as many parameters as studies.

  2. The diagram below shows where I am stuck for how to implement actual cross-validation via fitting multiple models. Since treatment estimates are all made relative to a reference treatment (A), if any of these nodes are informed by a single study then we either can’t get an estimate for that treatment or, in the case of dropping the sole study comparing C vs A, we can end up not being able to estimate multiple treatments (since D relies on C to connect to A).
    image

A couple other quick notes:

Just to clarify: for a binomial case with an interaction term these models would be:

r[i,k] ~ binomial(p, n[i,k])
logit( p ) <- mu[i] + study-arm_specific_treatment_effect + beta_age*centered_mean_study_age[i]

study-arm_specific_treatment_effect<- normal(treatment_effect, sd)

where i and k index trials and arms and mu[i] is a study-specific fixed effect. So based on the paper you link the groups to be left out are studies, correct? I just wanted to be clear that we don’t typically have patient level data so leaving out individuals is not an option.

Thanks, this is useful. When you earlier said just “hierarchical models”, I just guessed the most common simple hierarchical model. The network in you model makes A, B, C, D non-exchangeable and cross-validation more difficult. Could you provide a bit more details? Different levels you mention are

  • in general no individuals
  • arms
  • trials
  • treatments (A, B, C, D)
  • studies

Can you tell a bit more how these are related (I can guess, but I’m not expert on these kind of studies, so I could make mistakes) and how many each of those you have, and which are assumed to be exchangeable given some other information?

1 Like

“the predictor” at what level? A large number of groups make it easier to make inference about the predictor on group level, and a large number if individuals in a group make it easier to make inference about the predictor on that group. And of course it’s not just the amount of data, but also how wide the distributions at different levels are.

Yes this is correct. Typically the goal of these analyses is comparative efficacy of multiple treatments that have all been trialed against a common comparator (typically placebo), but they also extend to the case where you have mixed evidence (direct comparisons between two comparators as well as indirect evidence via placebo). We would commonly have between 10-20 treatments with each treatment vs placebo comparison has 1-5 2-3 arm studies with maybe 200-300 patients per arm (although this varies based therapeutic area). Treatment effects are not typically modeled as exchangeable (but can be in models where treatments are eg exchangeable within a class).

Treatments effects are considered exchangeable across studies. Since these are RCTs were interested in “maintaining randomization” i.e. we’re typically estimating and pooling ORs across trials. The reason data is entered in arm format (instead of just pooling log ORs assuming normal distribution) is so that we can use exact likelihoods. The workaround for this is to fit fixed effects for arm 1 in every trial.

We are putting together some work based on a lot of the good writing you and others here have done in the last few years around model selection (it is very out of date in our field), but all these factors together have made implementation of loo very challenging.

I’m happy to help, and I’m interested in this more complex structure of the data, but so far it seems it’s challenging for me to put all pieces together.

Slow down. Where did the trials go? Is one trial as described above? How do you fit 20 treatments in 1-5 2-3 arm studies? Maybe if you write the full model (equations, Stan code or pseudo-code) or point a paper with such a model? I tried reading your previous messages and one of them had three lines of code, but no explanation what is r, is there typo in study-arm_specific_treatment_effect (it’s strange to have - on the left hand side of <-), and is there more structure to show?

Do classes form grouping or hierarchy? Is that used?

Now here trials is mentioned, but I don’t understand how it is related to studies.

Is arm 1 somehow special or is it arbitrary which arm is labeled 1?

1 Like

Thank you, hopefully I can help clear up problems below

I have attached a reproducible example here:
example_nma.R (2.1 KB)
ex_nma_jags.txt (1.8 KB)
In this particular network there are:
24 studies
50 arms (two 3-arm studies)
4 treatments

All formula and additional code can be found here. Briefly:



Where the delta is drawn from a normal distribution

We sometimes will specify each of the ds in equation six above as being exchangeable within a class. so if d[2] was a member of antibody class we could say:
d[2] ~ dnorm(d_ab, sd_class)
This is rare though.

Sorry, studies == trials.

It is arbitrary

In the code I’ve attached you will see that loo fails badly and that the number of parameters is nearly equal to the number of data points. Am I understanding correctly that DIC is likely also failing in this scenario since it is only a good estimate of predictive error when the number of effective parameters is much smaller than the number of data points?

Expanding on above. I’ve done some troubleshooting to make sure everything is working as expected. Using the built-in WAIC monitor in JAGS I get the same estimate as the one output by loo. One unexpected finding however is that when I change the model to a fixed effect model where I am estimating a fixed baseline for every trial (24) and three treatment effects (d[1] is set to zero) I get:

pD: DIC = 26.8
p_waic = 131.2 (loo::waic and jags monitor)
p_loo = 145.2 (> 50% bad or very bad pareto k diagnostics)

Thinking that this could be partially explained by the nuisance trial baseline, I switched data entry to be in contrast format, where each trial estimates a log odds ratio for each arm versus arm 1 and then the log odds ratios are specified as coming from normal distribution with known standard error. Now fitting a fixed effect model (cont_nma_model.txt) have the following:

pD DIC = 2.98
p_WAIC = 27.7
p_loo = 27.1 (1 bad, 1 very bad pareto-k)

In both cases pD from DIC gives me what I would expect in terms of the effective number of parameters. Any idea what might be happening here?

This is getting more clear. The code is quite clear, but it is still difficult to match what you write with what is in the code.

Going back to what is your goal I re-read the first message

What do you mean by an “interaction”? The text you quote says “covarite”.
What are the models you want to compare?

Yes, and more so for DIC and for DIC we can say inappropriate. WAIC and PSIS-LOO are also likely to suffer from computational problems in case of random effect for each observation (ie when Pareto-k’s are high), but even with exact LOO, it is possible that random effects can explain all variation with or without additional covariate. So for WAIC and PSIS-LOO we can say that it is appropriate, but can be either computationally difficult or not well identified. In such cases it is useful to also look at the posterior scale of the random effect. If adding a covariate reduces the scale of random effects, then that covariate is useful. See section 2.2 in the appendix of https://arxiv.org/abs/1912.03549 for analysing how much each component explains the variation.

DIC is failing if loo fails.

From the model code I don’t see how A, B, C, and D are connected as in your earlier graph.

Based on the code this seems to be simple hierarchical model. If the goal is to analyse treatment effects, it is likely that you could analyse the posterior of the effects directly and you don’t need *IC or LOO.

What do you get in the other case? What is fixed part and what is unknown (btw. I hate terms fixed and random effects as they are used so that I don’t which parameters are unknown and which are not)? Do you still have n random effects delta[i,k]? Do you have model code for this?

Again I’m confused what is fixed as you still have delta[i,k].

You have in both models d[k] ~ dnorm(0, 0.11), but that 0.11 has very different meaning with respect to the scale of the likelihood.

Looking at the models, I would expect the values close to the number of terms in lik

Thanks for spending so much time on this @avehtari

Sorry, the covariates here are all interaction terms i.e. we say the relative effect of treatment k vs the reference treatment increases/decreases based on the level of the covariate.

Is there any reading you can suggest on this?

Ultimately the motivation for getting loo working in this scenario is for implementing stacking weights . The inspiration came from the linear subset regression section of the stacking paper. We will often have maybe 3-4 study level covariates that we think might be important but only have enough data to fit one at a time. We’re looking at a few options:

  1. Regularized horseshoe
  2. Stochastic search variable selection
  3. Stacking
  4. Selection of a single model using IC
  5. Selection of a single model using change in scale of random effects

The thinking what that we could do a similar set up with simulation in M-open and M-closed. We would expect horsehoe/SVSS would work in M-closed but that stacking would be better in M-Open. From what you’re saying, it sounds like it might either be impossible (in cases of PSIS-LOO) or computationally difficult (in case of exact loo) to get a reliable estimate we can use for either 3. or 4. and that even if we did it might not actually be helpful. Is there a solution to this problem that will work in M-open?

Sorry this code is made to be easily modular for some automated analyses. The important difference in the random and fixed models is:

 // Fixed effect model
delta[i,k] <- d[t[i, k]] - d[t[i, 1]] # Fixed, all trials with same treatments in arm 1 and arm k will have the same delta.
  
// Random Effect model - All trials with the same treatments in arm 1 and arm k have exchangeable delta[i,k]
    delta[i, k] ~ dnorm(md[i, k], taud[i, k])
      md[i, k] <- d[t[i, k]] - d[t[i, 1]] 

In the fixed effect model the unknowns are:
An intercept for each study (24 in this case)
Three treatment effects (d[2], d[3], d[4])
Total parameters: 27

In the random effect model the unknowns are:
An intercept for each study = 24
A treatment effect for arm 2:n for each study = 26
the scale of the random effect which I would understand as controlling the amount of partial pooling so that the number of treatment effects should be somewhere between 3 and 26
Total parameters = 50

Random effects model:
pD from DIC: 44.8

Loo output:

Computed from 40000 by 50 log-likelihood matrix

       Estimate   SE
elpd_loo   -179.1  5.4
p_loo        45.9  3.0
looic       358.1 10.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      0     0.0%   <NA>      
 (0.5, 0.7]   (ok)        5    10.0%   1785      
   (0.7, 1]   (bad)      37    74.0%   48        
   (1, Inf)   (very bad)  8    16.0%   12        
See help('pareto-k-diagnostic') for details.

Fixed effect model:

pD from DIC: 26.7

Loo output:

Computed from 40000 by 50 log-likelihood matrix

         Estimate    SE
elpd_loo   -343.7  70.2
p_loo       137.6  39.4
looic       687.3 140.4
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     12    24.0%   4518      
 (0.5, 0.7]   (ok)        9    18.0%   328       
   (0.7, 1]   (bad)      19    38.0%   63        
   (1, Inf)   (very bad) 10    20.0%   3         
See help('pareto-k-diagnostic') for details.

So pD from DIC makes sense to me in both cases but loo not so much in in the fixed effect model.

I left this for the end since I think it’s a bit of a separate problem maybe. All treatments in the data in this network are connected through a common comparator. The code I attached estimates each of the deltas as treatment effects relative to the reference treatment (which is arbitrarily set as placebo). If we had a group of studies that could be represented by the earlier graph and we wanted to calculate exact loo, then in the model where we drop the study that compared C to A we would disconnect both treatment C and D from the network so we wouldn’t be able to estimate D vs A or C vs A.

2 Likes

And the model you posted earlier did not have any covariates?

I don’t remember seeing anything which would explain all aspects. These are related
http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf
https://avehtari.github.io/modelselection/rats_kcv.html


I’ll add individual level random effect case study to my model selection examples soon. I mentioned the section 2.2 in the appendix of https://arxiv.org/abs/1912.03549 for analysing how much each component explains the variation. Before that in the earlier paper https://www.nature.com/articles/s41467-019-09785-8 we used predictive performance estimates, but looking at the posterior of the explained variations is better (when applicable).

This is useful to know. I think you don’t need stacking, but we can check this.

Can you explain this?

For 3-4 covariates RHS might be overkill or there is not much difference.

For 3-4 covariates it seems you could test all combinations.

I would recommend first to try to make a model with all covariates. Stacking is useful as a quick test of benefits of model averaging or if making a continuous expansion to super model is for some reason computationally challenging.

If theere is no bad model misspecification and if there are no strong posterior dependencies then looking at the posterior is better than looking at IC (https://avehtari.github.io/modelselection/betablockers.html, https://avehtari.github.io/modelselection/collinear.html)

Integrate over the lowest level random effects. If they have normal prior, then Laplace approximation is likely to work well. Alternatively drop explicit random effect parameters and switch to overdispersed distribution for y, e.g. switch binomial to beta-binomial or normal to t. This would make PSIS-LOO computation to work well. It would be useful to see posterior predictive checks for the current models without random effects).

I also figured out this just 5mins ago.

These and loo outputs are super useful for me. See also Interpreting p_loo when Pareto k is large

DIC and PSIS-LOO are both failing as each observation has it’s own parameter and thus removing one observations changes the posterior for the corresponding random effect a lot. This can be solved by integrating random effects away or switching to observation models which correspond to integrating analytically over random effects with certain distribution.

p_loo is much larger than the total number of parameters which in this case is very likely due to very bad model misspecification. It would be very useful to make posterior predictive check plot (see, e.g. https://mc-stan.org/bayesplot/articles/graphical-ppcs.html)

Is this similar to estimating skill of players in some sport, where comparisons can be made only based on matches where both players had met? There would not be a problem if treatments are connected via alternate route, but if there is a single trial when removed would separate other treatments in distinc two groups then there is a problem.

Roaches example has now also “random effects” example, illustrating that PSIS-LOO and WAIC fail due to the computational reasons, but cross-validation can still be used with K-fold-CV. The example also illustrates that if instead of “random effects” we use overdispersed distribution which corresponds to another random effect model, but with random effects integrated out analytically, then PSIS-LOO works just fine.

1 Like

Thanks @avehtari. Sorry for the delay, I have been taking this opportunity to translate code from JAGS to Stan and then implemented the beta-binomial version at the same time.

Ya this is pretty much the issue. I answer the rest of your comments below.

First, Stan code:

Fixed Effect model

data {
  int<lower=1> NT; // Number of treatments
  int<lower=1> NS; // Number of studies
  int ND; // Number of data points
  int study[ND]; // Study indicator
  int  n[ND]; // Number of patients in study and treatment arm 
  int  r[ND]; // Number of responders in study and treatment arm
  int  t[ND]; // Treatment used in study and treatment arm
  int base[ND]; // Treatment in arm 1
  int arm[ND]; // Arm index
  
}



parameters {
  vector[NT]  d_orig; // Treatment effect for each treatment 
  vector[NS] mu; // One mu for each study
  
}

transformed parameters {
  vector[NT] d = d_orig; // Treatment effect for each treatment with values of 0 for treatment one
  real  p[ND]; // logit probability
  real md[ND]; // Mean difference
  
  
  d[1] = 0.0; // Treatment effect is zero for reference treatment
  
  for(i in 1:ND){
   md[i] = d[t[i]] - d[base[i]];
   p[i] = mu[study[i]] + md[i];
   }
   

  
 }
 


model {
  
  r ~ binomial_logit(n, p);
  
  // Priors
  mu ~ normal(0, 3);
  d_orig ~ normal(0, 3);
    
  }

generated quantities{
  real rhat[ND];
  real dev[ND];
  real totresdev;
  real lik[ND];
  int r_rep[ND];

  
  for (i in 1:ND){
    rhat[i] =  inv_logit(p[i]) * n[i]; // Expected number of events
    dev[i] = 2 * (r[i] * (log(r[i]) - log(rhat[i])) + (n[i] - r[i]) * (log(n[i] - r[i]) - log(n[i] - rhat[i]))); // Deviance residual

lik[i] = binomial_logit_lpmf(r[i] | n[i], p[i]);
  
  }
  
   for(i in 1:ND){
    r_rep[i] = binomial_rng(n[i], inv_logit(p[i]));
    
    
  }
  
  totresdev = sum(dev);

     }

Random Effects Model

The transformed data section creates a covariance matrix that ensures that deltas from multi-arm trials have covariance var/2. This seems to be working as I would expect but it might not be the most efficient approach. Can you confirm I got the non-centered parameterization right?

data {
  int<lower=1> NT; // Number of treatments
  int<lower=1> NS; // Number of studies
  int ND; // Number of data points
  int study[ND]; // Study indicator
  int  n[ND]; // Number of patients in study and treatment arm 
  int  r[ND]; // Number of responders in study and treatment arm
  int  t[ND]; // Treatment used in study and treatment arm
  int base[ND]; // Treatment in arm 1
  int arm[ND]; // Arm index

}

transformed data{
   matrix[ND, ND] re_cov; // Random effect covariance
   vector[ND] mu_cov; // Vector of zeroes for non-centered parameterization
  
  for(i in 1:ND){
    for (k in 1:ND){
    
      if (i == k)
      re_cov[i, k] = 1; // Diagonal = 1 (standard normal)
      else if (study[i] == study[k])
      re_cov[i, k] = 0.5; // In multi-arm trials, covariance = sd^2/2
      else
      re_cov[i,k] = 0; // No co-variance between trials
    }
  mu_cov[i] = 0; // Mu is vector of zero (standard normal)
  }  

}


parameters {
  vector[NT]  d_orig; // Treatment effect for each treatment 
  vector[NS] mu; // One mu for each study
  vector[ND] delta_raw; // Non-centered paramaterization
  real<lower=0> std_dev; // Between trial standard deviation 
}

transformed parameters {
  vector[NT] d = d_orig; // Treatment effect for each treatment with values of 0 for treatment one
  vector[ND] delta; // Study specfic treatment effect allowing delta to be zero when base is 1
  real  p[ND]; // logit probability
  real md[ND]; // Mean difference
  
  
  d[1] = 0.0; // Treatment effect is zero for reference treatment
  
  for(i in 1:ND){
   md[i] = d[t[i]] - d[base[i]];
  
  if(arm[i] == 1){
    delta[i] = 0; // Delta is zero in arm 1
    } else{

  delta[i] = md[i] + std_dev * delta_raw[i]; // Non-centered parameterization (equivalent to normal(md[i], std_dev) with delta raw sampled with multi-arm trial cov matrix)
  
   }
  p[i] = mu[study[i]] + delta[i];
  
 }
 
}

model {
  
  r ~ binomial_logit(n, p);
  delta_raw ~ multi_normal(mu_cov, re_cov); // Non-centered paramaterization with covariance matrix for multi-arm trials
  
  // Priors
  mu ~ normal(0, 3);
  d_orig ~ normal(0, 3);
  std_dev ~ uniform(0, 2);
    
  }

generated quantities{
  real rhat[ND];
  real dev[ND];
  real totresdev;
  real lik[ND];
  int  r_rep[ND];

  
  for (i in 1:ND){
    rhat[i] =  inv_logit(p[i]) * n[i]; // Expected number of events
    dev[i] = 2 * (r[i] * (log(r[i]) - log(rhat[i])) + (n[i] - r[i]) * (log(n[i] - r[i]) - log(n[i] - rhat[i]))); // Deviance residual

lik[i] = binomial_logit_lpmf(r[i] | n[i], p[i]);
  
  }
  
  totresdev = sum(dev);
  
  for(i in 1:ND){
    r_rep[i] = binomial_rng(n[i], inv_logit(p[i]));
    
    
  }

     }

Beta-binomial model

I’m not as confident in this implementation, but based it on a pairwise implementation described
here

data {
  int<lower=1> NT; // Number of treatments
  int<lower=1> NS; // Number of studies
  int ND; // Number of data points
  int study[ND]; // Study indicator
  int  n[ND]; // Number of patients in study and treatment arm 
  int  r[ND]; // Number of responders in study and treatment arm
  int  t[ND]; // Treatment used in study and treatment arm
  int base[ND]; // Treatment in arm 1
  int arm[ND]; // Arm index
  
}



parameters {
  vector[NT]  d_orig; // Treatment effect for each treatment 
  vector[NS] mu; // One mu for each study
  real<lower=0,upper=1> rho; // Overdispersion parameter (1 for whole network)
  
}

transformed parameters {
  vector[NT] d = d_orig; // Treatment effect for each treatment with values of 0 for treatment one
  real  p[ND]; // logit probability
  real md[ND]; // Mean difference

  real alpha[ND]; // Alpha for beta portion of beta-binomial
  real beta[ND]; // Beta for beta portion of beta-binomial
  
  
  d[1] = 0.0; // Treatment effect is zero for reference treatment
  
  for(i in 1:ND){
   md[i] = d[t[i]] - d[base[i]];
   
   p[i] = inv_logit(mu[study[i]] + md[i]);
   
  alpha[i] = p[i] * (1 - rho) / rho; 
  beta[i] = (1 - p[i]) * (1 - rho) / rho;
  
   }
   

  
 }
 


model {
  
  r ~ beta_binomial(n, alpha, beta); // Beta-binomial likelihood
  
  // Priors
  mu ~ normal(0, 3);
  d_orig ~ normal(0, 3);
  rho ~ uniform(0, 1);
  
  }

generated quantities{
  real rhat[ND];
  real dev[ND];
  real totresdev;
  real lik[ND];
  int  r_rep[ND]; 

  
  for (i in 1:ND){
    rhat[i] =  p[i] * n[i]; // Expected number of events
    dev[i] = 2 * (r[i] * (log(r[i]) - log(rhat[i])) + (n[i] - r[i]) * (log(n[i] - r[i]) - log(n[i] - rhat[i]))); // Deviance residual

lik[i] = beta_binomial_lpmf(r[i] | n[i], alpha[i], beta[i]);
  
  }
  
  totresdev = sum(dev);
  
   for(i in 1:ND){
    r_rep[i] = beta_binomial_rng(n[i], alpha[i], beta[i]);
    
    
  }

     }

LOO Results

Fixed

As expected, PSIS-LOO performs very poorly for the fixed effect model. Based on your previous posts I would interpret this as meaning the model is mis-specified, but I couldn’t actually use these values for comparison.

Computed from 2000 by 50 log-likelihood matrix

         Estimate    SE
elpd_loo   -333.1  69.3
p_loo       128.4  39.4
looic       666.3 138.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     10    20.0%   168       
 (0.5, 0.7]   (ok)       18    36.0%   80        
   (0.7, 1]   (bad)      10    20.0%   41        
   (1, Inf)   (very bad) 12    24.0%   2         
See help('pareto-k-diagnostic') for details.

Random

Again this PSIS-LOO doesn’t perform well but now it is because assigning a parameter to every data-point makes it computationally difficult. The p_loo is at least within the right boundaries (i.e. no pooling = 50, complete pooling = 30).

Computed from 2000 by 50 log-likelihood matrix

         Estimate   SE
elpd_loo   -171.0  5.2
p_loo        37.7  2.3
looic       342.0 10.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      2     4.0%   440       
 (0.5, 0.7]   (ok)       11    22.0%   156       
   (0.7, 1]   (bad)      29    58.0%   34        
   (1, Inf)   (very bad)  8    16.0%   6         
See help('pareto-k-diagnostic') for details.

Beta-binomial

Still not great but better.

Computed from 2000 by 50 log-likelihood matrix

         Estimate   SE
elpd_loo   -211.0  9.4
p_loo        28.2  4.0
looic       421.9 18.9
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     14    28.0%   608       
 (0.5, 0.7]   (ok)       24    48.0%   145       
   (0.7, 1]   (bad)       9    18.0%   19        
   (1, Inf)   (very bad)  3     6.0%   21        
See help('pareto-k-diagnostic') for details.

PP Checks

Density of events

The random effect model seems to be doing the best here, but based on your link where p_loo < p but large compared to N then this could be evidence that the model is too flexible?

Intervals of events/log odds ratio

The top panel shows events in arms, the bottom panel shows log odds ratio vs arm 1 (so all arm 1s are zero). The top panel here helps me understand what you mean when you say that the random effects model is too flexible. The FE model does poorly but that poor performance seems to be mostly due to two hard to predict arms. In the lower panel I used the md[i] parameter for fixed and beta-binomial model but the delta parameter for the random effects model (since it predicts one LOR for each arm). The beta-binomial estimates differ from the fixed effect in both point (some more than other) and uncertainty, which is what I would expect based on beta-binomial being equivalent to a random effects model.

Estimates (95% Credible Interval)

Typically we base inference on the ds. This gives me some faith in my beta-binomial implementation since the point estimates and intervals are so similar to the random effects model.

So based on above it still doesn’t look great for using an approach that would work well in M-Open since even the beta-binomial doesn’t behave with PSIS-LOO. Would it be worth starting to look at whether it can be feasible to compute exact LOO for the bad/very bad estimates? Or is there something else I can try to get PSIS-LOO working? The two concerns I have for exact LOO are:

  1. Trials being dropped leading to a disconnected network
  2. If the bad or very bad points are arm 1 of trials, I’m not sure how the model would predict those points since there is no sharing of information across trials (i.e. I would need an estimate of mu[i] but won’t have one). One approach I could think of would be to switch to a contrast model, where the data are log odds ratio vs arm 1, but then I would have to rely on approximately normal log ORs instead of exact likelihoods.

With this d_orig[1] is unnecessary parameter which doesn’t have effect in the likelihood and has just the normal(0,3) prior. You could have something like

vector[NT-1] d = d_orig; 
...
vector[NT] d = append_row(0, d_orig);

to avoid sampling that extra parameter.

@paul.buerkner or @bgoodri can probably help with this better than I. Now also delta_raw has the first parameter without any effect on likelihood (at least I think so). The prior delta_raw ~ multi_normal(mu_cov, re_cov) is not conditioned on the fact that delta[1]=0;

You might be able to make this model also with brms or rstanarm.

“You have requested a resource that is restricted to current Dalhousie students, faculty and staff.”

Yes, I would assume it’s mis-specified and specifically under-dispersed. You could check what are the loo predictions (see E_loo) compared to actual observations to understand which are the most difficult to predict.

Yes. If one model fitting is fast (say less than few minutes), you could run kfold(fit, K=50) to compare to exact leave-one-out by sampling each leave-one-out-posterior with MCMC. This would help to know whether random effect could be better than beta-binomial

As p_loo = 28.2 ND=50, many observations are still influential. You might get less bad k’s by using more posterior draws (Pareto k are estimated from posterior draws and the estimate has some variability which can make values larger than 0.5 just by chance and increasing the number of draws makes the estimate more accurate). It’s still possible that a few high k’s remain. In the forthcoming loo package release we have new iterative computation (https://arxiv.org/abs/1906.08850) which probably could get rid of high k’s here (random effect model is too hard for even this new method).

The random effect model is too flexible to be tested with posterior predictive checks. You could do corresponding cross-validation predictive checks with kfold if the computation time is not an issue (PSIS-LOO based predictive checks have a problem in computation)

Is there explanation for the higher peak in data density closer to 0? Also it seems that the tail really is quite long (kernel density estimate is also failing to model the distribution well).

This is excellent figure. You could make the same with cross-validation predictive y_rep if you have time.

Yes. These two are visible also in PPC figures as bumps in kernel density estimate. Is there additional information why they are hard to predict?

There’s not much difference in posterior means (or medians?) , but Beta-binomial is predicting with more uncertainty in tails.

That’s a good point, that posteriors for these parameters from random effects and beta-binomial model are likely to be similar (although there can be differences as 1) they are asymptotically different, 2) there is only finite number of random effects)

I think PSIS-LOO has been very useful here when we have expected the models not to be true :)

  • You could run longer chains to get more accurate Pareto k estimates
  • You could do exact LOO only for the folds with high k (this would be very easy when using brms or rstanarm)
  • You could try the new MM-LOO (which is nit yet merged in master branch in loo package) https://github.com/stan-dev/loo/pull/130

This happens also in PSIS-LOO when we get small Pareto k.

I assume this related to having d[1]=0? Could you share information across trials? How true you assume your “exact likelihood” is?

How true you assume your “exact likelihood” is? Beta in beta-binomial is used for computational convenience (closed form for beta-binomial) and same for normal distribution in random effects models. They are not exact representations of your knowledge of possible variation between events.

That’s a nice idea, thank you

The first delta for every study is set to zero so if there are 50 datapoints each from two arm trials then there would be 50 deltas of which 25 are zero. It works in the sense that my deltas from multi-arm trials have the correct covariance in the posterior but I’m sure there is a smarter way to get there.

Sorry: this should work

These models fit in about 3 seconds. Can you recommend any accessible tutorials on coding kfold cv in stan?

Unfortunately these are just example data so I’m not overly familiar with the back-story.

Yes I think that’s fair

Will try each of these and get back to you

Yes, the purpose is to “maintain randomization” and focus on the relative effects since these are RCTs. The most common approach to sharing information across trials would be to have the baselines be exchangeable across trials. This is discouraged since then differences in prognostic variables across trials could alter the relative effect estimates.

That’s fair. I guess the main issue that lor ~ normal requires continuity corrections when there are zero events and doesn’t perform as well as well as normal-binomial when events are rare.

Here’s one example https://datascienceplus.com/k-fold-cross-validation-in-stan/

Consider also if you can write this in brms or rstanarm (brms is more flexible)

1 Like