Multimodality in hierarchical population parameters

Hi all,

I’ve been working on developing a three-level occupancy model, that considers \psi_{ij} (the probability site j is occupied by species i) \theta_{ijk} (the probability that sample k in site j is occupied by species i), and \lambda_{ijk} (the (normalised) rate at which DNA sequences of species i are encountered in sample k at site j) - see Conditional posterior predictions for three-level occupancy model for some more details. To fit the model for multiple species, I have put independent hierarchical priors on each level of the model to estimate the population of species responses at each level.

I have six total datasets of varying sizes, and 5 out of 6 of these datasets fit well, with Rhats <1.01 and (with perhaps a sometimes slightly forgiving eye) the MCMC traceplots and other graphical diagnostics look pretty good. However, for one model, I get multimodality specifically in the site-level population parameters for the hierarchical model. Here, sigmas_occ[1] is a random intercept, and the others all represent slopes to 10 other continuous environmental predictors. N_species=2063, N_sites=30, and N_replicates=300. There are no max treedepth warnings or divergences. There is no multimodality present at the higher levels.


Which seems to be due to some chains exploring completely different regions of parameter space:

I had wondered if the problem was due to parameterisation, so I tried reparameterising the model using an implementation of the rstanarm decov() prior, where there is a single variance parameter that is split into standard deviations using a simplex, but this ended up looking pretty much the same, but with divergences to boot.

I’m at a bit of a loss at what to try next - I suppose one option might be to tighten up the priors, and another would be to try another parameterisation. One thing that’s probably worth mentioning is that I use the same 11 environmental variables at all three levels of the hierarchy (though at the sample and abundance levels, there are extra predictors too) - the thinking being, for example, if a species responds negatively to increasing rainfall, then this is likely to be reflected both in the probability of finding it at a site, the probability of it being present in a sample, and the number of sequences detected - and perhaps there’s a better way of representing this dependency among the levels that might help fix this problem?

Any insight much appreciated!

Model code is below:

functions {
  real sample_detection_lpmf(int Y, real lambda, real theta, real phi){
    return bernoulli_logit_lpmf(1 | theta) +
      neg_binomial_2_log_lpmf(Y | lambda, phi);
  }
  
  real sample_nondetection_lpmf(int Y, real lambda, real theta, real phi) {
    return log_sum_exp(
      sample_detection_lpmf(Y | lambda, theta, phi), bernoulli_logit_lpmf(0 | theta));
  }
  
  real sample_occupancy_sum_lpmf(array[] int Y, vector lambda, vector theta, real phi) {
    int K = size(Y);
    real temp_target = 0;
    
    for(k in 1:K) {
      if(Y[k] > 0) {
        temp_target += sample_detection_lpmf(Y[k] | lambda[k], theta[k], phi);
      } else {
        temp_target += sample_nondetection_lpmf(0 | lambda[k], theta[k], phi);
      }
    }
    
    return(temp_target);
  }
  
  real partial_sum_lpmf(array[,] int Y, 
                        int start, int end,
                        array[,] int Q,
                        int N_sites,
                        int N_replicates,
                        matrix X_reads,
                        matrix X_rep,
                        matrix X_occ,
                        vector b_reads,
                        vector b_rep,
                        vector b_occ,
                        matrix b_reads_species,
                        matrix b_rep_species,
                        matrix b_occ_species,
                        vector phi,
                        vector SF,
                        array[] int rep_group_sizes) {
    
    int N = size(Y);
    real temp_target = 0;
    array[N, N_sites] int Q_local = Q[start:end,];
    vector[N] phi_local = phi[start:end];
    
    matrix[N_replicates, N] lambda = rep_matrix(SF, N) +
                                     rep_matrix(X_reads * b_reads, N) +  
                                     (X_reads * b_reads_species[,start:end]);
    matrix[N_replicates, N] theta = rep_matrix(X_rep * b_rep, N) + X_rep * b_rep_species[,start:end];
    matrix[N_sites, N] psi = rep_matrix(X_occ * b_occ, N) + X_occ * b_occ_species[,start:end];
    
    for(i in 1:N) {
      int loc_start=1; //replicates are ordered by site - starting at replicate 1
      for(j in 1:N_sites) {
        int loc_end = loc_start + rep_group_sizes[j] - 1; // add number of reps for site j and subtract 1
  
        if(Q_local[i, j] == 1) {
          temp_target += bernoulli_logit_lpmf(1 | psi[j, i]) + 
              sample_occupancy_sum_lpmf(Y[i,loc_start:loc_end] | lambda[loc_start:loc_end, i], theta[loc_start:loc_end, i], phi_local[i]);
        }
        if(Q_local[i, j] == 0) {
          temp_target += log_sum_exp(bernoulli_logit_lpmf(1 | psi[j, i]) + 
              sample_occupancy_sum_lpmf(Y[i, loc_start:loc_end] | lambda[loc_start:loc_end, i], theta[loc_start:loc_end, i], phi_local[i]),
              bernoulli_logit_lpmf(0 | psi[j, i]));
        }
        loc_start += rep_group_sizes[j]; // update start point for next site
      }
    }
    
    return temp_target;
  }
}
data {
  int N_species;                            // Number of species
  int N_sites;                              // Number of sites
  int N_replicates;                         // Number of replicates
  array[N_species, N_replicates] int Y;     // Raw data [OTU][Replicate]
  array[N_species, N_sites] int Q;          // Site occupation factor
  
  // Different number of replicates per site, so the input replicates are passed in site
  // order - then we can access them per site by looping over the group sizes
  array[N_sites] int rep_group_sizes;

  // Site occupancy predictors
  int N_pred_occ;                    // Number of predictors
  matrix[N_sites, N_pred_occ] X_occ; // Predictor matrix for occupancy
  
  // Replicate occupancy predictors
  int N_pred_rep;
  matrix[N_replicates, N_pred_rep] X_rep;

  // Read abundance predictors
  int N_pred_reads;
  matrix[N_replicates, N_pred_reads] X_reads;
  
  // Parallelisation options
  int grainsize;
}
parameters {
  // Site occupancy parameters
  vector[N_pred_occ] b_occ;
  cholesky_factor_corr[N_pred_occ] L_Rho_occ;
  vector<lower=0>[N_pred_occ] sigmas_occ;
  matrix[N_pred_occ, N_species] z_occ;

  // Sample occupancy parameters
  vector[N_pred_rep] b_rep;
  cholesky_factor_corr[N_pred_rep] L_Rho_rep;
  vector<lower=0>[N_pred_rep] sigmas_rep;
  matrix[N_pred_rep, N_species] z_rep;
  
  // Size factors/row effects
  vector<lower=0>[N_replicates] SF;

  // Read abundance parameters
  vector[N_pred_reads] b_reads;
  cholesky_factor_corr[N_pred_reads] L_Rho_reads;
  vector<lower=0>[N_pred_reads] sigmas_reads;
  matrix[N_pred_reads, N_species] z_reads;
  
  // Read dispersion
  vector[N_species] z_phi;
  real<lower=0> sigma_phi;
  real a_phi;
}
transformed parameters {
  // Site occupancy parameters
  matrix[N_pred_occ, N_species] b_occ_species;
  b_occ_species = (diag_pre_multiply(sigmas_occ, L_Rho_occ) * z_occ);
  
  // Sample occupancy parameters
  matrix[N_pred_rep, N_species] b_rep_species;
  b_rep_species = (diag_pre_multiply(sigmas_rep, L_Rho_rep) * z_rep);

  // Read abundance parametters
  matrix[N_pred_reads, N_species] b_reads_species;
  b_reads_species = (diag_pre_multiply(sigmas_reads, L_Rho_reads) * z_reads);
  
  // Read dispersion
  vector<lower=0>[N_species] phi;
  phi = exp((a_phi + z_phi) * sigma_phi);
}
model {
  // Priors
  // Site occupancy
  target += std_normal_lpdf(b_occ);
  target += lkj_corr_cholesky_lpdf(L_Rho_occ | 3);
  target += exponential_lpdf(sigmas_occ | 1);
  target += std_normal_lpdf(to_vector(z_occ));

  // Sample occupancy
  target += std_normal_lpdf(b_rep);
  target += lkj_corr_cholesky_lpdf(L_Rho_rep | 3);
  target += exponential_lpdf(sigmas_rep | 1);
  target += std_normal_lpdf(to_vector(z_rep));
  
  // Size factors
  target += lognormal_lpdf(SF | 0, 1);

  // Read abundance 
  target += normal_lpdf(b_reads | 0, 3);
  target += lkj_corr_cholesky_lpdf(L_Rho_reads | 3);
  target += exponential_lpdf(sigmas_reads | 1);
  target += std_normal_lpdf(to_vector(z_reads));

  // Read dispersion
  target += std_normal_lpdf(a_phi);
  target += std_normal_lpdf(z_phi);
  target += exponential_lpdf(sigma_phi | 1);

  //array[,,] int Y, int grainsize, array[,] int Q, int N_sites, int N_replicates, matrix X_reads,
  //matrix X_rep,   matrix X_occ, vector b_reads, vector b_rep, vector b_occ, matrix b_reads_species,
  //matrix b_rep_species, matrix b_occ_species, vector phi, vector SF, array[] int rep_group_sizes
  
  target += reduce_sum(partial_sum_lpmf, Y, grainsize, Q, N_sites, N_replicates, X_reads, X_rep, X_occ, b_reads,
                       b_rep, b_occ, b_reads_species, b_rep_species, b_occ_species, phi, SF, rep_group_sizes);

}

Looks like you are already using a non-centered parameterization of the random effects, which is good. One possible reason for your results would be high multicolinearity between your predictors. If some of your predictors are highly correlated, then identifying the regression coefficients becomes difficult. For example, if you have two perfectly correlated predictors x1 and x2, almost any combination of b1 and b2 that sums to the same value would have the same fit. From your plots it looks like something like that might be the case - one of the chains finds higher values for some paramteres but lower values for others. Investigating the correlation between your predictors and simpliftiting the model would be one approach.

1 Like

Hi, many thanks for your thoughts!

One clarification that’s worth making is that my six datasets are six sets of measurements for the same set of 30 sites and 300 replicates, so across all six datasets the covariate values are the same - particularly for the 30 sites. The differences between the datasets are in the number and type of species (and location of measurement within a replicate) - the species have poor overlap among datasets. So if multicollinearity is the problem, it’s perhaps somewhat surprising that the model is identifiable in 5/6 cases - unless there’s something particular about the biology of this particular dataset that I’m not appreciating.

The variables are pretty much all meteorological variables, so I think in some cases a reasonable degree of correlation is expected, but they have all have been mean-centred and scaled (which in my understanding can help with collinearity problems), and examining the pairs plot, while some variables are quite correlated with one another, I don’t see anything that seems exceedinly high to my eye:

I also tried a QR decomposition at one point, which definitely decreased the time to fit - but while I understand how to convert the betas back to the non-QR scale, it wasn’t clear to me that it would be possible to transform the sigma parameters back to the natural scale, which is annoying as I am interested in making inferences about them specifically.

baribault2022_troubleshooting-bayesian-cognitive-models_DOC-3.pdf (329.7 KB)

These two diagnostic plots might be something you want to check out.

Baribault, B., & Collins, A. G. E. (2023). Troubleshooting Bayesian cognitive models. Psychological Methods, Advance online publication. https://dx.doi.org/10.1037/met0000554

1 Like

Hi, thanks for the suggestion. Plotting the energy diagnostic plot, it seems like there’s definitely a problem with chain #1.

Looking at the pairs plot for the problematic SD parameters, with the HMC energy in final position (two chains plotted in the upper diagonal, two chains in the lower diagonal):

There’s a lot of pathological behaviour here, including what look like random walks in the problematic chain, suggesting its having trouble exploring the posterior correctly, and otherwise inducing what appear to be strong correlations which suggest identifiability issues.

Would I be best off presupposing that the model is dramatically mis-specified and going back to the drawing board a bit, or are there other things I could try first (e.g. a longer warmup) that might help? In an ideal world I would specify the exact same model for all six datasets, and I’d rather not pick and choose variables from one model and not the others.

Just wondering if anyone has any other insight? If not, I think for now I will start looking at the variables with strongly correlated SD parameters and cross-comparing with those which are collinear in the raw data, and try dropping one or two and see if that helps. I don’t love it, but as I’m not putting too much causal interpretation onto these species responses (I’m highly unconvinced it’s possible to disentangle them with a DAG), hopefuly it’s not the end of the world if one model has a couple less variables than the other (or I can standardise to a core set for all models if it fixes the issue).

If you think the upper right posteriors look sensible, it seems you have additional prior information that says the smaller values are more sensible, and the current prior is too wide causing problems when the sampler gets stuck in bad region of big values. Do you think you can adjust your priors using that additional information?

Possibly? The upper-right look sensible in terms of comparison to the other 5 models (they’re in a similar range of magnitudes between about 0.5 - 1.5, depending on the variable), so tightening the prior on these SDs to e.g. an exponential(2) would put these higher modes well outside the range of prior plausibility, which might be helpful.

That said, outside of comparisons to the other models, I don’t necessarily have a strong expectation about their values - I would have said a-priori that for some variables larger values would be possible - so I’m not sure if tightening the prior too much is a sensible idea?

Instead of tightening prior, you could try Pathfinder for initializing MCMC as it can discard minor modes

1 Like

Looking again at your Birthday case study with Pathfinder - it looks like I can just fit the model with Pathfinder, and then pass the Pathfinder fit object directly as model$sample(init = pathfinder_fit)? Or do I need to post-process the Pathfinder fit first to produce a list of inits? I’m on cmdstanr 0.7.1 with cmdstan 2.34.1.

Sorry, I have updated the case study to use a PR Adds feature to make inits from fit and draws objects by SteveBronder · Pull Request #937 · stan-dev/cmdstanr · GitHub. You can install the PR version of CmdStanR, or
with your version, you can use the following code

#' Function to form a list of list of initial values from a draws object.
#' Something like this will be eventually available in `cmdstanr` package.
as_inits <- function(draws, variable=NULL, ndraws=4) {
  ndraws <- min(ndraws(draws),ndraws)
  if (is.null(draws)) {variable = variables(draws)}
  inits <- lapply(1:ndraws,
                  function(drawid) {
                    sapply(variable,
                           function(var) {
                             as.numeric(subset_draws(draws, variable=var, draw=drawid))
                           })
                  })
  if (ndraws==1) { inits[[1]] } else { inits }
}

#' Function to form a list of list of initial values from a Pathfinder object.
#' Something like this will be eventually available in `cmdstanr` package.
create_inits <- function(pthfs, variables=NULL, ndraws=4) {
  if (is.list(pthfs)) {
    pthf <- pthfs[[1]]
    draws <- do.call(bind_draws, c(lapply(pthfs, as_draws), along='draw'))
  } else {
    pthf <- pthfs
    draws <- pthf$draws()
  }
  draws <- draws |>
    mutate_variables(lw=lp__-lp_approx__,
                     w=exp(lw-max(lw)),
                     ws=pareto_smooth(w, tail='right', r_eff=1)$x)
  if (is.null(variables)) {
    variables <- names(pthf$variable_skeleton(transformed_parameters = FALSE,
                                              generated_quantities = FALSE))
  }
  draws |>
    weight_draws(weights=extract_variable(draws,"ws"), log=FALSE) |>
    resample_draws(ndraws=ndraws, method = "simple_no_replace") |>
    as_inits(variable=variables, ndraws=ndraws)
}

init_list <- create_inits(pathfinder_object)

For this code to work, you need to call cmdstan_model() with compile_model_methods=TRUE, force_recompile=TRUE

Thanks! I tried running Pathfinder on the smallest of my datasets (~300 species, 300 replicates, 30 sites) and it returned a singular fit with an SD of 0 for all parameters - so I suspect my model is probably far too complicated for anything other than MCMC.

I’ll try out tightening the priors in the meantime, and see if that helps any!

Try running pathfinder with psis_resample=FALSE. There are examples of this and explanation in the case study (I hope that feature is in your version, too)

1 Like

I gave this a shot - the results are highly multimodal, and most of the modes are very far from the values (much larger) given by HMC for the same model.

I went back to this and gave it some further thought - an exponential(1) prior on standard deviations for a slope parameter on the logit scale means that we can expect 95% of slopes to be in the range of 0.05 - 20x proportional change in odds for a unit change (1 SD as I’ve scaled my variables) in a predictor. That’s quite extreme! Tightening that to an exponential(2) narrows that range a lot to about 0.22 - 4.5x, which is still quite wide but much more within the range of plausibility. It also prompted me to take a look at some of my other priors, which in retrospect are also a bit too wide.

When I first set these priors I chose reasonably common “defaults”, as I was more focused on writing the model likelihood and getting the model to sample efficiently on a large dataset - I didn’t spend enough time considering how “weakly informative” they actually are. I’m going to refit the failing model with this new priorset, and see how much that helps.

1 Like

Ru

And then you could use the sample without replacement as shown in the Birthdays case study to check if there is just one major mode.

One part of the workflow recommendations says if the sample efficiency is bad, do think about priors more [2011.01808] Bayesian Workflow

Great! Please share also here if that helps

1 Like