Help with hierarchically modeling parameters of a nonlinear model

Dear Stan community,

I’m struggling to fit a model in which linear models are hierarchically attached to parameters of a nonlinear model. The issue is that the model sometimes runs smoothly and produces sensible estimates, but sometimes one or two chains get stuck and hit max treedepth and/or produce some divergent transitions. I’ve tried to employ non-centered parameterizations for the linear models and used prior predictive simulations to inform prior choice, but sampling is still unstable. I’m not sure what to do (or what can be done) now, so I wanted to ask for advice/help here.

Here is a slightly simplified version of the model:

data{
  int<lower=1> n; // number of observations
  vector<lower=0, upper=1>[n] w; // predictor for nonlinear parameters (scaled to 0--1)
  array[n] int<lower=0> x; // predictor for outcome
  array[n] int<lower=0> y; // outcome
}

transformed data {
  vector<lower=0>[n] w2;
  w2 = square(w);
}

parameters{
  // linear model for L
  real intercept_L;
  real beta_L1;
  real beta_L2;
  real<lower=0> sigma_L;
  // linear model for A
  real intercept_A;
  real beta_A;
  real<lower=0> sigma_A;
  // nonlinear model for y
  real<lower=0> phi;
  // z-scores
  vector[n] z_L;
  vector[n] z_A;
}

model {
  vector[n] mu;
  vector[n] L;
  vector[n] log_L;
  vector[n] mu_L;
  vector[n] A;
  vector[n] log_A;
  vector[n] mu_A;
  intercept_L ~ normal(4, 2);
  beta_L1 ~ normal(0, 2);
  beta_L2 ~ normal(0, 2);
  sigma_L ~ exponential(1);
  intercept_A ~ normal(0, 1);
  beta_A ~ normal(0, 1);
  sigma_A ~ exponential(1);
  1/phi ~ exponential(1);
  z_L ~ normal(0, 1);
  z_A ~ normal(0, 1);
  for (i in 1:n) {
    // linear model for L
    mu_L[i] = intercept_L + beta_L1 * w[i] + beta_L2 * w2[i];
    log_L[i] = mu_L[i] + z_L[i] * sigma_L;
    L[i] = exp(log_L[i]);
    // linear model for A
    mu_A[i] = intercept_A + beta_A * w[i];
    log_A[i] = mu_A[i] + z_A[i] * sigma_A;
    A[i] = exp(log_A[i]);
    // nonlinear model for y (L > 0, A > 0)
    mu[i] = L[i] / (1 + A[i] * x[i]);
  }
  y ~ neg_binomial_2(mu, phi);
}

I suspect that the sampling difficulties partially stem from the exp()'s which are used to constrain L and A to be positive as required by the nonlinear model. Wrapping the linear models in exp() has also made it difficult to set sensible priors. I’ve observed that the sampling issues can become worse with even minor changes to priors that I don’t feel are that drastically different in terms of their prior predictions (e.g., expanding to intercept_A ~ normal(0, 2), beta_A ~ normal(0, 2)). Perhaps the model is too complex for the data?

I would appreciate any advice/suggestions on how I might be able to improve this model and possibly resolve my issues. Thank you very much!

System info: macOS 13.4, R v4.4.1, cmdstanr v0.8.1, CmdStan v2.35.0

Update: Exploring trace plots and pairs plots for sampling runs that failed to converge, I see that there is bimodality in the posterior distribution. Here are plots for a run that had both maximum treedepth (yellow) and divergence (red) issues:

The plots are for the full model, which has some additional indexing of parameters to differentiate between two different populations. lambda is L and alpha is A in terms of the model code provided above. It seems pretty clear that the treedepth and divergence issues are closely related, but I’m not sure how to deal with the apparent bimodality in the posterior. I also see that some posterior pairs exhibit somewhat narrow ridges (e.g., intercept_lambda[1] and intercept_alpha[1,1]), which I understand might indicate some identifiability issues.

This exploration has helped me further understand the issue at hand, but I’m still unsure how to proceed from here. Perhaps I need to simplify one or more components of the model, but I feel that anything simpler than this might be hard to justify based on domain background (e.g., it is rather implausible that L and/or A do not exhibit some sort of response to w), not to mention that I’ve already dropped two group-level varying intercept terms to simplify the model.

Once again, any advice on how to approach these issues would be very much appreciated!

Why do you set 1/phi ~ exponential()? If you go with phi ~ exponential() you still get a strictly positive parameter.

If you must define the prior on 1/phi, you need to make the Jacobian adjustment target += -2*log(phi).

Thank you for pointing that out, the prior on 1/phi was admittedly naive. I can confirm that the issues still persist with phi ~ exponential(1), although possibly at a lower rate (I ran the model 20 times with 4 chains each and got one chain with ~300 divergences). The divergences occur in the same parameter space as shown above, and appear to correspond to phi that is about and order of magnitude larger than the other chains.

Are the runs that seemingly converge trustworthy, or are they just missing some parts of the full posterior? Similarly, would adjusting priors away from the other ‘peak’ just cause the posterior to be incompletely sampled?

The seemingly converged runs are trustworthy, but that sounds like your prior is too informative. Did you you try to set phi ~ exponential(0.1) or even lower? You should try the same with the other parameters.

Generally, it’s best to have the model do the work and confirm that it can pull its weight. If your parameters are barely identified, and the results only look okay because the priors are strong enough, then you may just as well skip the model, and just use the priors (which obviously one does not intend to). This is probably not the case for you, but using maximally uninformative priors is a good idea.

Can I ask you to elaborate on your intuition behind the priors possibly being too informative? I would have thought that drifting into different areas of parameter space would be a symptom of looser priors.

Model behavior is similar with phi ~ exponential(0.1), but slightly loosening priors on the parameters for L or A (e.g., intercept_A ~ normal(0, 2), beta_A ~ normal(0, 2)) greatly worsens sampling (frequent maximum treedepth and E-BFMI warnings, occasional divergences). This feels quite sensitive, even considering the exp() around the linear models. Here are plots from a run using the priors described above:

Considering how finicky this is turning out to be, perhaps this model is not a good idea after all. I do have a simpler counterpart to this model where the data are categorized into groups along w (this is a quantitative variable but was applied in fixed quantities in an experimental setting) - this categorical model seems to work much better.

First, if different of combinations of L_i and A_i can yield the same \mu_i, you have an identification problem.

Second, z_L and z_A are observation level random effects (OLREs). And you probably can’t have OLREs in models that already have a dispersion parameter unless the dispersion parameter is known. This is probably why the traceplot for \phi is so strange. You could do this in a Poisson model.

Third, while there is nothing wrong the strong priors, the “strong” priors here are likely hiding the extent of non identification in this model.

I would:

  1. Switch to weakly-informative priors, unless the “strong” priors reflect prior knowledge.
  2. Eliminate z_L and z_A.
  3. Fix one intercept, maybe intercept_A, to a constant e.g. 0.

Maybe this is enough to resolve the problems. If not, I would then remove w_i predictor from the model, though I doubt this is as much of a problem.

Generally, it is fine to keep eliminating aspects of your model when you have problems like this until you get to a model without problems, even if this model without problems is useless for your purposes. Then you slowly add things back to see which additions create problems.

If 1 – 3 resolve the problem, I would add z_L and z_A back in – which I believe would reintroduce problems due to using OLRE and NB with unknown \phi. If sigma_L and sigma_A are substantively interesting parameters, then I would switch to a Poisson model.

Minor, but you may be able to write the likelihood as: y ~ neg_binomial_2_log(log_L - log1p_exp(log_A + log_x), phi), where vector[n] log_x = to_vector(log(x)) in the transformed data. Have no idea if this is faster, but may be worth trying.

Hope this is somewhat useful.

This is a subtle question. It’s possible for the R-hat and ESS criteria to get misled by a sampler mixing well through a subspace of the posterior. You see this with things like hierarchical models with very little data—you won’t get into the tails, but R-hat and ESS will be good.

You can validate using simulation based calibration checks. Informally, you can generate parameters from the prior, data from the parameters, and then see if the posterior that arises from the data covers the parameters with the appropriate coverage (e.g., 50% intervals contain the true value about 50% of the time).

P.S. It looks like this code was translated from BUGS. We don’t need to define giant arrays and it’s more efficient if you replace lines like this:

vector[n] mu_L;
...
for (i in 1:n) {
  mu_L[i] = intercept_L + beta_L1 * w[i] + beta_L2 * w2[I];

with

for (I in 1:N) {
  real mu_L = intercept_L + beta_L1 * w[i] + beta_L2 * w2[I];

Negative binomials are often challenging to fit because you can explain variation through either a shifting mean or more overdispersion.

You’ll find negative_binomial2_log to be more arithmetically stable as it doesn’t require you to put mu on the positive scale. Instead, you can define

log_mu = log_L[i]  -log1p(A[i] * x[i]);
...
y ~ neg_binomial2_log(log_mu, phi);

How so? You a put priors on the values before the exponentiation. I’d generally recommend putting priors on all the variables you are using. As is, I don’t see any—just a likelihood for y.

Priors on phi and 1/phi go in opposite directions. An exponential(1) prior on phi means you expect phi to be relatively small, whereas an exponential prior on 1 / phi allows for large values of phi. You should figure out where you want the prior mass here and set an appropriate prior.

You also want to take care that your likelihood is identified as much as possible. it can help a lot to standardize your covariates w and w2.

The bimodal intercept is also potentially gong to be a problem for sampling. Are both modes reasonable for this problem?

It can help a lot with understanding and debugging to start with a very simple model and build it up. For example, add the over dispersion from negative binomial later and just start with a Poisson. Start with fewer effects in the regressions, and so on. Then as you add features, you can see where things start to get non-identified and break, which is what @James.uanhoro was getting at in his reasonable comments. You can see the non-identifiability in the heavy negative correlation of (beta_lambda_1, beta_lambda_2, intercept) and in the bimodality. One problem may be allowing too many things to vary with i.

Thank you @james.uanhoro and @Bob_Carpenter - I really appreciate the detailed advice. I went back to a simple nonlinear model with only intercepts for L (herein lambda) and A (herein alpha), weaker priors, and a Poisson likelihood, then tried incrementally building back up from there. Even the simplest model had some degree of nonidentifiability for intercept_lambda and intercept_alpha, perhaps due to the division involving these quantities in the nonlinear formula:

I was able to add back the linear predictor terms for lambda (w and w^2, where w is now centered and scaled to unit variance) without trouble. However, when I instead tried to add back the linear predictor terms for alpha, I encountered severe nonidentifiability for some parameters, which I suppose explains all the maximum treedepth warnings I was getting before.

Here is the code for this problematic model:

data {
  int<lower=1> n; // number of observations
  array[n] int<lower=1, upper=2> sp; // species ID
  vector[n] w; // total water added (linear)
  array[n] int<lower=0> N1; // density of species 1
  array[n] int<lower=0> N2; // density of species 2
  array[n] int<lower=0> F; // fecundity
}

parameters {
  vector[2] intercept_lambda;
  matrix[2, 2] intercept_alpha;
  matrix[2, 2] b_alpha;
}

model {
  vector[n] mu;
  intercept_lambda ~ normal(5, 5);
  to_vector(intercept_alpha) ~ normal(0, 10);
  to_vector(b_alpha) ~ normal(0, 10);
  for (i in 1:n) {
    // linear model for lambda
    real Y;
    Y = exp(intercept_lambda[sp[i]]);
    // linear model for alpha
    vector[2] a;
    vector[2] mu_a;
    for (j in 1:2) {
      mu_a[j] = intercept_alpha[sp[i], j] + b_alpha[sp[i], j] * w[i];
      a[j] = exp(mu_a[j]);
    }
    // nonlinear model for F
    mu[i] = Y / (1 + a[1] * N1[i] + a[2] * N2[i]);
  }
  F ~ poisson(mu);
}

Here are results from two different runs of this model:

Both heavy posterior correlation and bimodality in the posterior arise here. I’m surprised to see that the two ‘modes’ for b_alpha[1,2] are so far apart with opposite signs, but I haven’t managed to grasp exactly why this might be happening. Considering that the posterior pair of intercept_alpha[1,1] and b_alpha[1,1] looks much more reasonable than its [1,2] counterpart, could it be that the model is poorly informed by the data involving species 2? I was able to get the above model to work by setting intercept_alpha to 0, but the moment I add back w as a predictor of lambda things become bimodal/unidentified again.

Would a reasonable explanation for all this be that the parameters of the nonlinear model are not sufficiently identifiable to allow each of them to be modeled separately, at least using the same predictor? If so, are these the alternatives?

  1. Reparameterize the model so that some (non)linear combination of lambda and alpha are predicted by w. I’m not sure if this is possible with both N1 and N2 in the nonlinear formula (not to mention that lambda and alpha are really the quantities of interest here).
  2. Use a different (non)linear formula for the outcome that allows for better identifiability of lambda and alpha.
  3. Give up on using of w as a continuous predictor.
  4. Use tighter priors (if they can can be justified based on prior knowledge, of course). This doesn’t seem like it will get me that far based on the issues explained in earlier posts.

Apologies for the lengthy reply. Please let me know if you have any additional insights!

I just wanted to make sure to respond to some of your additional questions/comments here @Bob_Carpenter.

I think I put priors on all the variables in the initial model, but perhaps I’m misunderstanding something? As for the matter of setting priors inside exp(), I sometimes struggle with how a symmetric prior like normal(5, 5) on the log scale is not symmetric when exponentiated and can imply unrealistically massive values at ~2 SD away from the prior mean. But perhaps I’m overthinking things here.

In the updated model above, I tried standardizing (centered and scaled) w, but I’m wondering how much this helps in this particular case since w^2 is still just a function of w.

In some versions of the model I definitely found one of the posterior modes to be undoubtedly unreasonable, but in some versions (like the one above) I can’t say for sure. Most of my prior knowledge is about lambda, not so much about alpha.

Thank you all for your help with this model. After some further exploration based on the above feedback, I found that a combination of (1) dropping observation-level random effects, (2) using an orthogonal polynomial for the predictor w, and (3) bringing in some more prior information on the intercept allowed the model to work much more reasonably. I will mark this topic as solved!