How to objectively identify poorly-fitting sites in a hierarchical bayesian forest recovery model

Hi all,

I’m modeling aboveground carbon stock (AGC) recovery across various forest sites in Africa using a hierarchical Bayesian framework in Stan. At each site, I have around 20 plot-level AGC measurements taken at different times since disturbance (i.e. clearing for agriculture). I model the AGC recovery using a Chapman-Richards growth function (a sigmoidal curve) with lognormal likelihoods, and estimate site-specific parameters (α = initial value, λ = recovery rate, ω = asymptotic value, δ = inflection point). These parameters are evaluated on lognormal hyperlaws.

The model runs and converges well, but for certain sites, posterior predictive checks show a poor fit — likely due to high measurement noise or uncertainty in the “time since disturbance” variable, which is notoriously hard to pin down. I suspect that these poorly fitting sites may negatively influence hyperparameter estimates and thus affect the quality of estimates even for better-fitting sites.

I’d like to identify these “problematic” sites in an objective and principled way to consider excluding them from the model. Here’s what I’ve tried so far:

  • For each site, I computed the difference between posterior predictive draws and observed values at each time point.
  • For each MCMC iteration, I then computed the mean and standard deviation of these differences, giving me a distribution for each metric.
  • I noticed that some sites show very wide distributions for both the mean difference and the standard deviation, suggesting poor and uncertain predictive performance.

However, I’m unsure how to rank or compare sites:

  • Should I prioritize high uncertainty in the mean difference or in the spread of those ‘residuals’?
  • How do I fairly compare sites with different “times since disturbance” (since larger AGC values naturally produce larger absolute errors)?

I’m considering using a more integrated goodness-of-fit metric like Bayesian R² (as in Gelman et al. (2019)), but I’m unsure how to compute it correctly. My current attempt is:

Rsq_Bayes_site[s] = variance(modeled_value_site_s) / [variance(modeled_value_site_s) + variance(residuals_site_s)];

However, I know this doesn’t incorporate the posterior predictive distribution.

So here are my questions:

  1. Is Bayesian R² (properly computed) a reasonable approach to flag problematic sites?
  2. Are there better metrics or methods that leverage posterior predictive samples more fully to assess model fit at the site level?
  3. Is there a principled way to exclude sites without introducing selection bias into the model?

Any suggestions on robust, interpretable ways to approach this would be greatly appreciated!

Thanks so much,
Viktor

I think this approach for identifying problematic sites is appropriate (you could possible systematize it by computing posterior p-values for some appropriate test statistic and screening below some threshold). But, around here you probably won’t find much support for excluding them from the model. Instead, since you seem to have some idea of the reason behind why the fit may be poor for certain sites, can you expand the model? Specifically, both site-varying observation noise or input uncertainties are both fairly easy to implement in Stan.

Thank you very much! I’ll look into modeling the observation noise, but I’m not yet sure if the underlying trends are consistent enough.

My advice is to fit a measurement error model directly and account for the “outliers” in the model. That is, you want to be able to tell a story about how all of the data was generated. For example, stock market data is plagued by decimal slip errors, where, for example 1.01 is reported as 10.1. We can add a measurement model that allows a probability of each measurement being in error (probably couldn’t fit this particular example with Stan due to the combinatorics of discrete errors like this).

Can you say more about what’s failing in the posterior predictive checks? I assume by that you mean you have posterior p-values that look “significant”?

The other way to look at this is that they broaden the prior variance (which is presumably a hyper parameter somewhere) to the appropriate level to cover all the data.

Here’s the best clue you provided about how to fix your model:

If there’s just high measurement noise, you don’t want to just remove noisy measurements. They are measurements and unless your measurement device is changing, you have to account for the noise in measurement in anything you do predictively.

Second, if the “time since disturbance” measurement is noisy, you can add an explicit noisy measurement model to account for it. That can lead to a much tighter fit and less uncertainty in the “clean data” version.

This should be standardized by dividing by scale. Just knowing the difference doesn’t tell you anything about match unless you know how uncertain things are. For instance, if I have a normal(0, 1), then an estimate of 0.2 is not that far from the mean, but if I have a distribution of normal(0, 0.01), it’s miles away. For calibration, you want to check that the posterior intervals cover the estimated value at their nominal rates (e.g., 50% posterior intervals should contain roughly 50% of the estimated parameter values).

No. You’re literally pulling out things that don’t fit and thus biasing the uncertainty estimates to be lower than they would be without selection.

Thank you very much for your elaborate answer.

I ended up grouping my sites using an additional hierarchical level based on the Koppen-Geiger climate classification. The main reason is that including or excluding sites really affected covariate effect size estimates. The effect size of the same covariate also strongly differed depending on some of the other covariates, while covariance was checked before. Most importantly, when examining a preliminary PCA analysis, I could clearly see some sites clustering, which also resulted in some discrete trends in e.g. the covariate histograms.

Although conceptually different, the model now runs fine and is not doing anything unexpected, so I will keep it at this for now.

The hierarchy is now as follows: regional parameters < climate class hyperdistributions < general hierarchical hyperdistribution.

Nonetheless, thanks for the help!

1 Like