Simple intercept only hierarchical model with two groups: deadly slow, poor convergence

lizzieinvancouver · September 8, 2021, 8:55pm

Hello!

I am trying to fit what I thought what a super simple intercept-only model with two grouping factors (let’s call them study and species) that I would call `crossed’… and it takes forever to run in rstan (where we started) or rstanarm (shown here for ease and the hopes someone will see what we’re missing), and has poor convergence (sad Rhat, neffective well below 5%).

Nrep <- 10 # rep per trait
Nstudy <- 50 # number of studies w/ traits 
Nspp <- 50 # number of species with traits

# First making a data frame for the test trait data
Ntrt <- Nspp * Nstudy * Nrep # total number of traits observations

# make a dataframe for height
trt.dat <- data.frame(matrix(NA, Ntrt, 1))
names(trt.dat) <- c("rep")
trt.dat$rep <- c(1:Nrep)
trt.dat$study <- rep(c(1:Nstudy), each = Nspp)
trt.dat$species <- rep(1:Nspp, Nstudy)

# now generating the species trait data, here it is for height
mu.grand <- 10 
sigma.species <- 10 

mu.trtsp <- rnorm(Nspp, 0, sigma.species)
trt.dat$mu.trtsp <- rep(mu.trtsp, Nstudy) 

# now generating the effects of study
sigma.study <- 10
mu.study <- rnorm(Nstudy, 0, sigma.study) #intercept for each study
trt.dat$mu.study <- rep(mu.study, each = Nspp) # generate data for ea study

# general variance
trt.var <- 1 #sigmaTrait_y in the stan code
trt.dat$trt.er <- rnorm(Ntrt, 0, trt.var)

# generate yhat - heights
trt.dat$yTraiti <- mu.grand + trt.dat$mu.trtsp + trt.dat$mu.study + trt.dat$trt.er

#
library(rstanarm)
tryme <- stan_lmer(yTraiti~ (1|study) + (1|species), data=trt.dat, cores=4)

Any thoughts would be most welcome! BTW, we think this is may be the problem underlying Issue with chain mixing and low n_eff for joint model but we’re not yet sure.

Thanks!

Faith · September 8, 2021, 10:29pm

Have you checked that your study and species variables are in the correct format? Stan models take in integers as the group names, but I think brms and rstanarm might need them to be factors.

lizzieinvancouver · September 8, 2021, 10:33pm

Oh, good point. I will try that, but we also had the same problem with the Stan version.

So, here’s the Stan version. (I wonder if I just haven’t these types of models before and they’re more persnickety than I think, but I would like to understand the persnicketyness.)

In R:

## Trait only stan model ###########################################################
trait_data <- list(yTraiti = trt.dat$yTraiti, 
                   N = Ntrt, 
                   n_spec = Nspp, 
                   species = trt.dat$species, 
                   study = trt.dat$study, 
                   n_study = Nstudy,
                   prior_mu_grand = 10,
                   prior_sigma_grand = 20,
                   prior_mu_sp = 0,
                   prior_sigma_sp_mu = 10,
                   prior_sigma_sp_sigma = 10,
                   prior_mu_study = 0,
                   prior_sigma_study_mu = 10,
                   prior_sigma_study_sigma = 10,
                   prior_sigma_traity_mu = 1,
                   prior_sigma_traity_sigma = 1) 

mdl.traitonly <- stan('stan/joint_traitonly.stan',
                      data = trait_data, iter = 4000, warmup = 3000)

And

data {
    //MODEL 1 ------------------------------------------------
    int < lower = 1 > N; // Sample size for trait data 
 
    int < lower = 1 > n_study; // number of random effect levels (study) 
    int < lower = 1, upper = n_study > study[N]; // id of random effect (study)

    vector[N] yTraiti; // Outcome trait data 

    //both models --------------------------------------------------------
    int < lower = 1 > n_spec; // number of random effect levels (species) 
    int < lower = 1, upper = n_spec > species[N]; // id of random effect (species)

    // //MODEL 2 ------------------------------------------------
    // int < lower = 1 > Nph; // Sample size for forcing 
    // 
    // vector[Nph] yPhenoi; // Outcome phenology
    // vector[Nph] forcei; // predictor forcing 
    // vector[Nph] photoi; // predictor photoperiod 
    // vector[Nph] chilli; // predictor chilling
    // 
    // int < lower = 1, upper = n_spec > species2[Nph]; // id of random effect (species)
    
    //Priors
    real prior_mu_grand; 
    real prior_sigma_grand;
    real prior_mu_sp;
    real prior_sigma_sp_mu;
    real prior_sigma_sp_sigma;
    real prior_mu_study;
    real prior_sigma_study_mu;
    real prior_sigma_study_sigma;
    real prior_sigma_traity_mu;
    real prior_sigma_traity_sigma;
}

parameters{

    //MODEL 1 ------------------------------------------------
    //level 1
    real <lower =0> sigmaTrait_y; // overall variation accross observations
    real mu_grand; // Grand mean for trait value 
    //level 2
    real <lower = 0> sigma_sp; // variation of intercept amoung species

    real muSp[n_spec]; //The trait effect of each species without study 

    real <lower = 0> sigma_study; // variation of intercept amoung studies
    real muStudy[n_study]; // mean of the alpha value for studies 

//     //MODEL 2 -----------------------------------------------------
//     //level 2
//     
//     real alphaForceSp[n_spec]; //the distribution of species forcing values
//     real muForceSp; // the mean of the effect of forcing
//     real <lower = 0> sigmaForceSp; //variation around the mean of the effect of forcing 
//     
//     real alphaChillSp[n_spec]; //the distribution of species chilling values
//     real muChillSp; // the mean of the effect of chilling
//     real <lower = 0> sigmaChillSp; //variation around the mean of the effect of chilling
//     
//     real alphaPhotoSp[n_spec]; //the distribution of species photoperiod values
//     real muPhotoSp; // the mean of the effect of photoperiod
//     real <lower = 0> sigmaPhotoSp; //variation around the mean of the effect of photo
// 
//     real alphaPhenoSp[n_spec]; //the species level intercept 
//     real muPhenoSp; // 
//     real <lower = 0> sigmaPhenoSp; 
// 
//     real betaTraitxForce; 
//     real betaTraitxChill; 
//     real betaTraitxPhoto; 
//     // general varience/error
//     real <lower =0> sigmapheno_y; // overall variation accross observations
}

transformed parameters{
    //MODEL 1 ----------------------------------------
    //Individual mean for species and study
    real ymu[N]; // the trait value, height, ssd, etc.

    // //MODEL 2------------------------------------------------
    // real betaForceSp[n_spec];     //species level beta forcing 
    // real betaPhotoSp[n_spec];     //species level beta photoperiod
    // real betaChillSp[n_spec];     //species level beta chilling 

    //MODEL 1
    //Individual mean calculation 
    for (i in 1:N){
        ymu[i] = mu_grand + muSp[species[i]] + muStudy[study[i]];  //muSp is used in 2nd level of model
    }

    // //MODEL 2----------------------------------------
    // //get beta-cue-Sp values for each species
    // for (isp in 1:n_spec){
    // betaForceSp[isp] = alphaForceSp[isp] + betaTraitxForce * ( muSp[isp]);
    // }
    // 
    // for (isp in 1:n_spec){
    // betaPhotoSp[isp] = alphaPhotoSp[isp] + betaTraitxPhoto* ( muSp[isp]);
    // }
    // 
    // for (isp in 1:n_spec){
    // betaChillSp[isp] = alphaChillSp[isp] + betaTraitxChill* (muSp[isp]);
    // }
}
model{ 
    //MODEL 1 ---------------------------------------------
    //assign priors
    sigmaTrait_y ~ normal(prior_sigma_traity_mu, prior_sigma_traity_sigma); // trt.var 0.5
    sigma_sp ~ normal(prior_sigma_sp_mu, prior_sigma_sp_sigma); //sigma_species 10
    mu_grand ~ normal(prior_mu_grand, prior_sigma_grand); // 
    muSp ~ normal(prior_mu_sp, sigma_sp); //

    sigma_study ~ normal(prior_sigma_study_mu, prior_sigma_study_sigma); //sigma.study 5
    muStudy ~ normal(prior_mu_study, sigma_study);//
    
    // run the actual model - likihood
    for (i in 1:N){
        yTraiti[i] ~ normal(ymu[i], sigmaTrait_y);
    }

    // //MODEL 2 -----------------------------------------------
    // //priors - level 1
    // sigmapheno_y ~ normal(0.5, 0.5); // 
    // 
    // //priors level 2
    // 
    // sigmaForceSp ~ normal(2, 0.5); //
    // muForceSp ~ normal(-1, 0.5);//
    // alphaForceSp ~ normal(muForceSp, sigmaForceSp);  //
    // 
    // sigmaChillSp ~ normal(2, 0.5); //sigma.chill.sp
    // muChillSp ~ normal(-2, 0.5);// 
    // alphaChillSp ~ normal(muChillSp, sigmaChillSp);  //
    // 
    // sigmaPhotoSp ~ normal(2, 0.5); //sigma.photo.sp
    // muPhotoSp ~ normal(-3, 0.5);// 
    // alphaPhotoSp ~ normal(muPhotoSp, sigmaPhotoSp);  //
    // 
    // sigmaPhenoSp ~ normal(2, 0.5); // sigma.pheno.sp =2  
    // muPhenoSp ~ normal(150, 10);  // mu.pheno.sp = 150
    // alphaPhenoSp ~ normal(muPhenoSp, sigmaPhenoSp);//
    // 
    // betaTraitxForce ~ normal(-2, 0.5); // 
    // betaTraitxPhoto ~ normal(-2, 0.5); // 
    // betaTraitxChill ~ normal(-2, 0.5); // 
    // 
    // //likelihood 
    //     for (i in 1:Nph){
    // yPhenoi[i] ~ normal( alphaPhenoSp[species2[i]] + betaForceSp[species2[i]] * forcei[i] + betaPhotoSp[species2[i]] * photoi[i] + betaChillSp[species2[i]] * chilli[i], sigmapheno_y);
    //     }

}

andrjohns · September 9, 2021, 2:00am

Hierarchical models in Stan (and Bayes generally) can have difficulties with sampling and convergence when specified directly, instead it’s recommended to use a non-centered parameterisation which constructs the parameter as a function of a standard normal. There’s more background for this in the User’s Guide.

For parameters which don’t have a lower bound, you can implement this very simply using the offset and multiplier syntax without changing anything else in the model:

real<offset=prior_mu_grand,multiplier=prior_sigma_grand> mu_grand;

However for parameters with a lower bound (e.g., sigma_sp), you need to use the ‘manual’ approach which specifies a standard normal parameter with an appropriate lower bound:

data {
    real prior_sigma_sp_mu;
    real prior_sigma_sp_sigma;
}
parameters {                 
    real<lower=-prior_sigma_sp_mu/prior_sigma_sp_sigma> sigma_sp_raw;
}
transformed parameters {
  // Implies sigma_sp<lower=0> ~ normal(prior_sigma_sp_mu, prior_sigma_sp_sigma)
  real sigma_sp = prior_sigma_sp_mu + sigma_sp_raw * prior_sigma_sp_sigma;
}
model {
    sigma_sp_raw ~ std_normal();
}

See this post for more background on how these constraints are derived.

Additionally, with stan models in general you will want to use:

vector[N] parameter;

Instead of:

real parameter[N];

As it is more efficient to work with vectors than arrays of real. This also allows you to vectorise your likelihood, so you can instead write:

yTraiti ~ normal(mu_grand + muSp[species] + muStudy[study], sigmaTrait_y);

Without the need for the loop or the ymu variable

lizzieinvancouver · September 9, 2021, 2:11am

Thanks! I am used to doing an NCP when I see structure in the log posterior that suggests I need one (when we don’t it makes my models run slower) … I haven’t seen that yet, but I can dig deeper. Because we’re getting similar failures in rstanarm where I believe NCP is the default, any other ideas you have of what to try are most welcome.

aammd · September 9, 2021, 4:08pm

I like these models too! I replicated your result (bad, slow sampling) on my laptop and started playing around. Here are some ideas; I hope something here is useful to you!

Could it have something to do with the scaled priors used by rstanarm? Apparently if you don’t specify priors, rstanarm will choose them for you and give them scales based on the range of the data.
Could the difficulty be related to specific parameter values used in this simulation? Here we have a couple things that are set to “10” – particularly the sd of both the species and study offsets are the same (10). I wonder if that might make it particularly difficulty to identify, since the same amount of variance can come from two sources.
If the problem is the difficulty of identifying the variance parameters, I wonder if some other strategy might help? I’ve been interested in digging into Ensuring identifiability in hierarchical mixed effects Bayesian models, where Ogle and Barber talk about adding various constraints to help with this.
Most of all, this question re-ignited my curiosity about this paper by Fuglstad et al on Intuitive joint priors for variance parameters. Essentially (as I understand it) their innovation is to recognize that the variance in the data is split up into three sources: the observation variance, the variance from studies, and that from species. They define a kind of tree structure to capture ideas about where the variance comes from.
Finally, have you considered trying INLA? I know this is the Stan forums but I hope that’s not too unwelcome a suggestion :) INLA is supposed to be faster; and when I tried it quickly it seemed to do better (reprex of that below)

Anyway, i realize that 5 questions don’t make an answer, but hopefully something in there is useful!

varying intercepts via INLA

# sample sizes
Nrep <- 11 # rep per trait
Nstudy <- 13 # number of studies w/ traits
Nspp <- 17 # number of species with traits

# parameters
mu.grand <- 10
sigma.species <- 4
sigma.study <- 13
trt.var <- 1 #sigmaTrait_y in the stan code


mu.study <- rnorm(Nstudy, 0, sigma.study)
mu.trtsp <- rnorm(Nspp, 0, sigma.species)

trt.dat$trt.er <- rnorm(Ntrt, 0, trt.var)
#> Error in rnorm(Ntrt, 0, trt.var): object 'Ntrt' not found

data_grid <- expand.grid(rep = 1:Nrep,
            study = 1:Nstudy,
            species = 1:Nspp)

data_offsets <- transform(data_grid,
          mu_study = mu.study[study],
          mu_species = mu.trtsp[species],
          obs_error = rnorm(n = nrow(data_grid), mean = 0, sd = trt.var))

data_trait <- transform(data_offsets, yTraiti = mu.grand + mu_study + mu_species + obs_error)

library(ggplot2)

ggplot(data_trait, aes(x = species, y = yTraiti)) +
  geom_point(alpha = 0.2) +
  facet_wrap(~study) +
  geom_hline(yintercept = mu.grand, lty = 2) +
  geom_hline(aes(yintercept = mu.grand + mu_study)) +
  geom_point(aes(y = mu.grand + mu_study + mu_species), col = "red") +
  theme_bw()

head(data_trait)
#>   rep study species mu_study mu_species  obs_error  yTraiti
#> 1   1     1       1 14.00328  -2.751635  0.7790162 22.03066
#> 2   2     1       1 14.00328  -2.751635  0.8086471 22.06029
#> 3   3     1       1 14.00328  -2.751635  1.0896166 22.34126
#> 4   4     1       1 14.00328  -2.751635 -1.1565259 20.09512
#> 5   5     1       1 14.00328  -2.751635 -0.2241306 21.02751
#> 6   6     1       1 14.00328  -2.751635 -1.4029173 19.84873

library(INLA)
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loading required package: parallel
#> Loading required package: sp
#> This is INLA_21.02.23 built 2021-09-09 15:17:08 UTC.
#>  - See www.r-inla.org/contact-us for how to get help.
#>  - To enable PARDISO sparse library; see inla.pardiso()
#>  - Save 350.5Mb of storage running 'inla.prune()'
tryinla <- inla(yTraiti ~ 1 + f(study, model = "iid") + f(species, model = "iid"),
     family = "gaussian",
     data = data_trait,
     control.predictor = list(link = 1)
)

summary(tryinla)
#> 
#> Call:
#>    c("inla(formula = yTraiti ~ 1 + f(study, model = \"iid\") + f(species, 
#>    ", " model = \"iid\"), family = \"gaussian\", data = data_trait, 
#>    control.predictor = list(link = 1))" ) 
#> Time used:
#>     Pre = 6.39, Running = 11.3, Post = 1, Total = 18.7 
#> Fixed effects:
#>             mean    sd 0.025quant 0.5quant 0.975quant mode kld
#> (Intercept)   10 3.715       2.61       10     17.381   10   0
#> 
#> Random effects:
#>   Name     Model
#>     study IID model
#>    species IID model
#> 
#> Model hyperparameters:
#>                                          mean    sd 0.025quant 0.5quant
#> Precision for the Gaussian observations 0.996 0.029      0.940    0.995
#> Precision for study                     0.006 0.002      0.003    0.006
#> Precision for species                   0.093 0.031      0.044    0.090
#>                                         0.975quant  mode
#> Precision for the Gaussian observations      1.053 0.995
#> Precision for study                          0.012 0.005
#> Precision for species                        0.164 0.083
#> 
#> Expected number of effective parameters(stdev): 29.00(0.004)
#> Number of equivalent replicates : 83.82 
#> 
#> Marginal log-Likelihood:  -3620.32 
#> Posterior marginals for the linear predictor and
#>  the fitted values are computed

^{Created on 2021-09-09 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  ctype    en_CA.UTF-8                 
#>  tz       America/Montreal            
#>  date     2021-09-09                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version     date       lib source                            
#>  assertthat     0.2.1       2019-03-21 [1] CRAN (R 4.1.0)                    
#>  backports      1.2.1       2020-12-09 [1] standard (@1.2.1)                 
#>  cli            2.5.0       2021-04-26 [1] CRAN (R 4.1.0)                    
#>  codetools      0.2-18      2020-11-04 [1] CRAN (R 4.1.0)                    
#>  colorspace     2.0-1       2021-05-04 [1] standard (@2.0-1)                 
#>  crayon         1.4.1       2021-02-08 [1] CRAN (R 4.1.0)                    
#>  curl           4.3.1       2021-04-30 [1] CRAN (R 4.1.0)                    
#>  DBI            1.1.1       2021-01-15 [1] standard (@1.1.1)                 
#>  digest         0.6.27      2020-10-24 [1] CRAN (R 4.1.0)                    
#>  dplyr          1.0.6       2021-05-05 [1] standard (@1.0.6)                 
#>  ellipsis       0.3.2       2021-04-29 [1] CRAN (R 4.1.0)                    
#>  evaluate       0.14        2019-05-28 [1] standard (@0.14)                  
#>  fansi          0.5.0       2021-05-25 [1] CRAN (R 4.1.0)                    
#>  farver         2.1.0       2021-02-28 [1] standard (@2.1.0)                 
#>  fastmap        1.1.0       2021-01-25 [1] standard (@1.1.0)                 
#>  foreach      * 1.5.1       2020-10-15 [1] standard (@1.5.1)                 
#>  fs             1.5.0       2020-07-31 [1] standard (@1.5.0)                 
#>  generics       0.1.0       2020-10-31 [1] standard (@0.1.0)                 
#>  ggplot2      * 3.3.3       2020-12-30 [1] standard (@3.3.3)                 
#>  glue           1.4.2       2020-08-27 [1] CRAN (R 4.1.0)                    
#>  gtable         0.3.0       2019-03-25 [1] standard (@0.3.0)                 
#>  highr          0.9         2021-04-16 [1] standard (@0.9)                   
#>  htmltools      0.5.1.9005  2021-06-09 [1] Github (rstudio/htmltools@7fbab16)
#>  httr           1.4.2       2020-07-20 [1] standard (@1.4.2)                 
#>  INLA         * 21.02.23    2021-09-09 [1] local                             
#>  iterators      1.0.13      2020-10-15 [1] standard (@1.0.13)                
#>  knitr          1.33        2021-04-24 [1] standard (@1.33)                  
#>  labeling       0.4.2       2020-10-20 [1] standard (@0.4.2)                 
#>  lattice        0.20-44     2021-05-02 [1] CRAN (R 4.1.0)                    
#>  lifecycle      1.0.0       2021-02-15 [1] CRAN (R 4.1.0)                    
#>  magrittr       2.0.1       2020-11-17 [1] CRAN (R 4.1.0)                    
#>  Matrix       * 1.3-4       2021-06-01 [1] standard (@1.3-4)                 
#>  MatrixModels   0.5-0       2021-03-02 [1] CRAN (R 4.1.0)                    
#>  mime           0.10        2021-02-13 [1] standard (@0.10)                  
#>  munsell        0.5.0       2018-06-12 [1] standard (@0.5.0)                 
#>  pillar         1.6.1       2021-05-16 [1] CRAN (R 4.1.0)                    
#>  pkgconfig      2.0.3       2019-09-22 [1] CRAN (R 4.1.0)                    
#>  purrr          0.3.4       2020-04-17 [1] standard (@0.3.4)                 
#>  R6             2.5.0       2020-10-28 [1] CRAN (R 4.1.0)                    
#>  reprex         2.0.0       2021-04-02 [1] standard (@2.0.0)                 
#>  rlang          0.4.11.9000 2021-06-09 [1] Github (r-lib/rlang@b501cf7)      
#>  rmarkdown      2.8         2021-05-07 [1] standard (@2.8)                   
#>  scales         1.1.1       2020-05-11 [1] standard (@1.1.1)                 
#>  sessioninfo    1.1.1       2018-11-05 [1] standard (@1.1.1)                 
#>  sp           * 1.4-5       2021-01-10 [1] standard (@1.4-5)                 
#>  stringi        1.6.2       2021-05-17 [1] CRAN (R 4.1.0)                    
#>  stringr        1.4.0       2019-02-10 [1] CRAN (R 4.1.0)                    
#>  styler         1.5.1       2021-07-13 [1] CRAN (R 4.1.0)                    
#>  tibble         3.1.2       2021-05-16 [1] CRAN (R 4.1.0)                    
#>  tidyselect     1.1.1       2021-04-30 [1] standard (@1.1.1)                 
#>  utf8           1.2.1       2021-03-12 [1] CRAN (R 4.1.0)                    
#>  vctrs          0.3.8       2021-04-29 [1] CRAN (R 4.1.0)                    
#>  withr          2.4.2       2021-04-18 [1] CRAN (R 4.1.0)                    
#>  xfun           0.23        2021-05-15 [1] standard (@0.23)                  
#>  xml2           1.3.2       2020-04-23 [1] CRAN (R 4.1.0)                    
#>  yaml           2.2.1       2020-02-01 [1] standard (@2.2.1)                 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

betanalpha · September 20, 2021, 9:38pm

Hierarchical models are often problematic, but assuming that any problems are due to hierarchal structure just leads people down unproductive rabbit holes. In practice one shouldn’t attempt any of the usual strategies for moderating the degeneracies inherent to hierarchical models (in particular normal hierarchal models) without first verifying that those degeneracies are the problem.

In this case running the fit doesn’t show any divergences nor E-FMI warnings, which indicates that there likely isn’t any hierarchical structure at fault here. This is consistent with the fact that with 50 observations in each factor level, and a relatively small measurement variability, the realized likelihood function should be narrow enough to inform all of the factor level parameters strongly enough that the centered parameterization assumed in the model is optimal. It’s straightforward to verify by removing the hierarchal priors entirely and running again – the same problems arise!

So what else could be going on? The effective sample size warnings should be ignored until all other diagnostics are clear, which means that we prioritize the Rhat warnings. Because these show up for all of the parameters it appears that the four Markov chains run by default in Stan aren’t overlapping much at all!

That could be due to metastability, for example metastability caused by multiple modes in the posterior distribution, or it could be due to slow exploration caused by poor adaptation. Because there are also maximum treedepth warnings the latter is more suspect.

Looking at the number of leapfrog steps across all four chains

steps <- do.call(rbind, get_sampler_params(mdl.traitonly, inc_warmup=FALSE))[,'n_leapfrog__']
counts <- as.data.frame(table(steps))
colnames(counts) <- c("Leapfrog Steps", "Counts")
print(counts, row.names=FALSE)

we can see that the sampler is utilizing very long numerical trajectories. This suggests either a very degenerate posterior density function or poor adaptation.

To investigate the adaption further we have to look at the adaptation Hamiltonian Monte Carlo configuration for each chain. In RStan we can look at the adapted step sizes with

sapply(1:4, function(c) get_sampler_params(mdl.traitonly, inc_warmup=FALSE)[[c]][,'stepsize__'][1])

which shows some variation between the chains.

We also want to look at the adapted inverse metric elements,

info <- c()

for (c in 1:4) {
  chain_info <- get_adaptation_info(mdl.traitonly)[[c]]
  info <- c(info, as.list(strsplit(sub("\n$", "", sub("^#.*\n# ", "", chain_info)), ", ")))
}

inv_metric_elem <- data.frame(info)
colnames(inv_metric_elem) <- sapply(1:4, function(c) paste("Chain", c))
inv_metric_elem

This reveals even stronger variations in the adaptation arrived at from each Markov chain.

This provides more evidence that Stan is having trouble adapting Hamiltonian Monte Carlo well enough, and each trajectory is exploring only the very local neighborhood around each initial point. This is consistent with the trace plots, for example

unpermuted_samples <- extract(mdl.traitonly, permute=FALSE)

par(mfrow=c(2, 2))

for (c in 1:4) {
plot(1:1000, unpermuted_samples[,c,2], type="l", lwd=1, col=c_dark,
     main=paste("Chain", c),
     xlab="Iteration",  xlim=c(1, 1000),
     ylab="mu_grand", ylim=c(0, 20))
}

Okay, so why is the adaptation ending up in such a bad sampler configuration? In this case I think that, perhaps ironically, it’s due to the glut of data! With so much data the posterior is extremely narrow, and with a population scale of 10 for the individual species and study parameters the narrow posterior is likely to concentrate outside of the [-2, 2] initialization window that Stan uses by default. This means that Stan is trying to initialize very far away from the posterior typical set which makes the adaptation problem really, really hard.

Simulating new data with a measurement variability of 10 instead of 1 provides supporting evidence for this. The larger variability results in wider likelihood functions, and hence wider posterior density functions, that more strongly overlap with the default initialization window.

Here we can initialize around the point estimates, for example

init_mu_grand <- mean(trt.dat$yTraiti)
init_muSp <- mean(trt.dat$yTraiti[trt.dat$species == n])) - init_mu_grand
init_muStudy <- trt.dat$yTraiti[trt.dat$study == n])) - init_mu_grand

which should drastically reduce the adaptation difficulty, at least if my hypothesis is correct. For more complicated observational models this prescient initialization might not be possible.

lizzieinvancouver · September 22, 2021, 6:28pm

Thanks! This answer makes sense and is very helpful, especially the walk through of warnings (hierarchy of warnings) and diagnostic plots.

We’re still having a debate though on this issue of multiple modes, specifically whether these may be present because we are trying to separate out the variance of two separate effects (here: study and species). Signs here point to poor adaptation, but is there a way to also check for evidence of metastability or rule it out?

Thanks.

jsocolar · September 22, 2021, 7:28pm

One strategy that is sometimes useful when there are two standard deviation terms that are difficult to identify (not saying that’s necessarily what’s going on here though) is to parameterize the model in terms of a combined standard deviation (i.e. the square root of the sum of the variances) and then a parameter p on the unit interval that allocates a proportion of the combined variance to one effect or the other.

parameters {
  real<lower=0> sigma_combined;
  real<lower=0, upper = 1> p;
}
transformed parameters {
  real<lower=0> sigma_species = sqrt(p*sigma_combined^2);
  real<lower=0> sigma_study = sqrt((1-p)*sigma_combined^2);
}
model {
  // put a prior on sigma_combined
}

I first encountered this trick in the BYM2 parameterization of the ICAR model, where the spatial and nonspatial variances can be difficult to identify and so are reparameterized in this way, but I’ve occasionally found it to help in standard linear mixed models, particularly if membership in one grouping (e.g. species) is strongly related to membership in another grouping (e.g. study), which is exactly what would make it difficult to attribute variance to one effect or the other. However, if this is what’s going on, you’d expect it to show up as a negative correlation joint posterior for the two standard deviation parameters under the “normal” parameterization. This reparameterization is useful for breaking that correlation, but otherwise is unlikely to help.

stijn · September 23, 2021, 2:06am

Excellent advice. In addition, this type of parameterization is also useful for setting priors. It’s often easier to think about the total variation (sigma_combined) and the proportion that can be attributed to different components (p) than thinking about the actual variation of the different components.

lizzieinvancouver · September 24, 2021, 7:05pm

Thanks! I had not seen that.

In this case, it’s all test data where we generated perfectly balanced data – all species are in all studies and vice versa. For such a case I am trying to confirm that there should not be issues in separating out the variances. I personally think it should be fine, but my lab is having a bit of a debate.

jsocolar · September 24, 2021, 7:08pm

I’m firmly on your side of the debate! But if there is a problem, then it should show up as strong posterior correlations between the variances, so maybe there’s a direct resolution to the debate available.

betanalpha · October 1, 2021, 6:45pm

Perhaps the most important consequence of true metastability is that the parameter space will decomposes into non-overlapping “basins” of attraction (typically the neighborhoods surrounding each mode). Markov chains initialized in a basin will be confined to that basin, exploring the truncated posterior within, for many, many iterations. Often so many that you never see a Markov chain transition from one basin to another.

In this case each Markov chain should look like it’s exploring reasonably well, with fuzzy trace plots and reasonably small empirical autocorrelations. Markov chains initialized within the same basin should converge to the same exploration, and Rhat, or equivalent diagnostics, run just on those chains shouldn’t indicate any problems. Markov chains initialized in different basins, however, should converge to different explorations that trigger Rhat problems.

Because we usually don’t know where the basics of attraction actually are we can’t control these circumstances. All we can do in practice is

Run multiple Markov chains initialized as diffusely as possible. The breadth of the prior usually, but not often, is a useful target.
Verify that each Markov chain has adapted to a reasonable sampler configuration and that the trace plots show reasonable exploration from iteration to iteration.
Then verify that the Markov chains are exploring different areas, for example visually with trace plots or with Rhat for any functional output of interest.

In this case (2) was suspicious so we couldn’t jump to (3).

Separating out the variance of study and species with normal population models and a normal observational model will always yield a unimodal posterior distribution, I believe. The common issue here is not a discrete degeneracy (multiple modes which case metastable Markov chains) but rather a continuous degeneracy which is diagnosed and moderated in different ways. See for example the discussion in Identity Crisis.

Topic		Replies	Views
DHGLM with two (nested) random intercept Modeling rstan , fitting-issues , specification	3	421	May 8, 2020
Regression with random intercept for groups RStan	2	1459	August 29, 2018
Model struggles to systematically estimate simple three estimate model Modeling	2	291	August 18, 2023
Question about hierarchical effects rstanarm	2	773	February 15, 2018
Simple intercept only model with poor chain mixing Modeling rstan	6	539	September 17, 2021

Simple intercept only hierarchical model with two groups: deadly slow, poor convergence

varying intercepts via INLA

Related topics