I am sorry if this question is too basic or posted under the wrong tags, but here goes:
I have modeled forest regrowth trajectories in Stan using an exponential recovery function described by an intrinsic recovery rate lambda. The model is hierarchical, estimating different lambdas for 20 forest sites. Now, I would like to relate these lambdas (which, in my case, are a measure for the relative recovery speed of the specific forest sites) to environmental covariates like rainfall, % tree cover, etc.
I could simply include the covariates in the Stan model and model their effect sizes on lamda with additional parameters. However, I first need to do a predictor selection and I am also not sure of the actual relation between the response (i.e. site-specific lambda values) and the covariates (may be linear or non-linear). Additionally, I want to take full use of the previous Bayesian modeling output, of course, accounting for all simulated values for lambda for all sites.
Could someone please point me in the direction of how to tackle this? I posted here in Gaussian processes since I expect the answer would involve simulating many regressions. But I’m still left with the question of what to do with potential non-linear (e.g. saturating) trends that may go unnoticed.
Can you elaborate a bit on the hierarchical structure? Are the sites nested within geographic regions or are there individual data points nested within sites? If you only have 20 data points I suspect your ability to do meaningful predictor selection (or even learning a nonlinear relationship between \lambda and the covariates) will be limited.
Thank you for the swift reply and the reference to the overview.
The individual data points consist of inventories of forests of different ages. These data points were collected in several sites, so indeed geographical areas, across Africa. Each site thus contains various forest inventory data points.
Some sites are close to each other, others are far away, but for now I’ve modeled separate lambda’s for each site, regardless of their distance from one another. I’m still looking to add an extra hierarchical level (‘above’ the site-level), based on climate or vegetation type.
The 20 sites (and thus 20 lambda’s I estimate) are hence based on many underlying individual data points.
Ok, I think based on my understanding of your situation the keyword(s) you are looking for is “group level predictors”. The Gelman and Hill text has discussion on this.
In Bayesian models, you usually do all of this at once. If by “predictor selection” you mean you want to set some of the effects exactly to zero, you can’t easily do that “within” Bayes without a spike-and-slab prior, but that’s not something Stan can fit as the marginalization becomes combinatorial. So instead you have to do something outside of standard Bayesian inference. I’m just going to assume that you’ve somehow by hook or by crook figured out which covariates to use.
At that point, yes, you can fit the whole thing with a Gaussian process. The kernel can be set up to decide how much smoothing to do. A GP does not simulate many regressions—it combines them all into one big joint regression. The best place to start for Stan is probably the chapter in the User’s Guide. A lot of the literature on the GPs takes a prior over functions approach that can be very math heavy. Just like the linear modeling literature, you can add a link function on top to keep things positive, for example.
A simpler approach to non-linearity is to bin your covariates and treat them as varying effects. For instance, instead of adding a coefficient \beta^\text{age} \times x^\text{age}_n for a continuous age covariate x^\text{age}, you can break age down into ten bands and give them all their own effect, e.g., \beta^\text{age}_{\text{age-band}(n)}. This allows non-monotonicity, but it is not smooth among the groups. And it leaves you with selecting the boundaries. It will probably be a lot more efficient than a GP to fit.