Hi all,
I’m quite new to Bayesian modelling, and have spent the past couple of months working through Statistical Rethinking. I’ve also read Bayesian Models by Hobbs and Hooten, which has really inspired me to go a bit beyond GLMMs and think a bit more about the processes that underlie the data generation process. I thought it might be fun to apply some of what I’ve learned so far to one of my own datasets, hopefully cementing some of what I’ve picked up, but I’m running into all sorts of fun new problems.
The dataset in question contains data on the mycorrhizal fungal species richness of pine seedlings. Seedlings from each of 12 genotypes (“families”) were grown in the field at each of 18 locations (“grids”), with seedlings from each family replicated 3 - 5 times at each location. Seedlings were collected after three months and the species richness of fungi was counted. Each fungal species occupies a single root tip, so we also counted the total number of root tips on each seedling (this varies quite a lot from 0 to ~500). The question we would like to answer is if pine genotypes differ in their fungal species richness.
When I worked on this data during my PhD, I modelled this using a GLMM, which gave very lacklustre results. Thinking about the data process a bit more, however, species richness is likely to saturate as the number of root tips increases, as a seedling samples fungal species from the available pool at any given location. I thought it might be fun to try and model this using a Michaelis-Menten equation, modelling V_{max} (the saturation point of the curve) as a function of genotype and location. However, I’m running into some issues fitting the model and I’m not quite sure where I’ve gone wrong.
The model I’m trying to implement is:
I modelled V_{max} as the product of the Family and Grid parameters, with the idea being that the Grid parameter captures the maximum species richness at a location, but that different families may capture different proportions of this richness (if there are any differences). My understanding is that as long as both $V_{max} and k are positive, then \lambda is positive and thus I don’t need a log-link. To ensure that, I specified Lognormal distributions for Family, Grid and k, and set lower bounds for these values at 0.0001.
I have implemented this model as follows, using ulam in the rethinking package:
SeedMod <- ulam(
alist(
richness ~ dpois(lambda),
lambda <- (vmax*tips) / (k + tips),
vmax <- fam[family]*gr[grid],
k ~ dlnorm(1, 1),
fam[family] ~ dlnorm(fam_a, fam_sig),
gr[grid] ~ dlnorm(grid_a, grid_sig),
fam_a ~ dnorm(1, 1),
fam_sig ~ dexp(1),
grid_a ~ dnorm(1, 1),
grid_sig ~ dexp(1)
), data = with(fungi_data, list(richness = Richness,
family = as.integer(as.factor(Family)),
grid = as.integer(as.factor(Grid)),
tips = Tot_tips/max(Tot_tips))),
cores = 4, chains = 4, log_lik = TRUE, cmdstan = TRUE,
constraints = list(fam="lower=0.0001", gr="lower=0.0001", k = "lower=0.0001"),
)
Which produces the following Stan code:
data{
int richness[673];
vector[673] tips;
int grid[673];
int family[673];
}
parameters{
real<lower=0.0001> k;
vector<lower=0.0001>[12] fam;
vector<lower=0.0001>[18] gr;
real<lower=0> fam_a;
real<lower=0> fam_sig;
real<lower=0> grid_a;
real<lower=0> grid_sig;
}
model{
vector[673] lambda;
vector[673] vmax;
grid_sig ~ exponential( 1 );
grid_a ~ lognormal( 1 , 1 );
fam_sig ~ exponential( 1 );
fam_a ~ lognormal( 1 , 1 );
gr ~ lognormal( grid_a , grid_sig );
fam ~ lognormal( fam_a , fam_sig );
k ~ lognormal( 1 , 1 );
for ( i in 1:673 ) {
vmax[i] = fam[family[i]] * gr[grid[i]];
}
for ( i in 1:673 ) {
lambda[i] = (vmax[i] * tips[i])/(k + tips[i]);
}
richness ~ poisson( lambda );
}
generated quantities{
vector[673] log_lik;
vector[673] lambda;
vector[673] vmax;
for ( i in 1:673 ) {
vmax[i] = fam[family[i]] * gr[grid[i]];
}
for ( i in 1:673 ) {
lambda[i] = (vmax[i] * tips[i])/(k + tips[i]);
}
for ( i in 1:673 ) log_lik[i] = poisson_lpmf( richness[i] | lambda[i] );
}
However, the model fails - although it compiles fine, it fails to find starting values for all chains, and just dies:
Chain 4 Rejecting initial value:
Chain 4 Gradient evaluated at the initial value is not finite.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Initialization between (-2, 2) failed after 100 attempts.
Chain 4 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Chain 4 Initialization failed.
Warning: Chain 4 finished unexpectedly!
Where have I gone wrong here? The model samples OK if I include a log-link (with about 1% of transitions divergent), but dies completely without it, which suggests to me that it’s trying out impossible values for \lambda. However, from my understanding, with the bounds I specified, this shouldn’t happen? Any insight anyone might have would be really invaluable!