Hi all,
I’m developing a spatially explicit ecological model on plant interactions/dispersal using inverse modeling. The model is pretty simple in structure, but given my limited experience with Stan I wanted to make sure it’s doing what it’s supposed to. In other words, is there anything I’m missing? Additionally, I’d appreciate any suggestions about how to optimize the code, and make the calculations more efficient as the matrices could get pretty large.
data {
int <lower = 0> N; //number of observations
vector[N] growth; //response
vector[N] sizet;
vector <lower = 0> [476] sizemat[N]; // neighbor matrix
vector <lower = 0> [476] distmat[N]; // neighbor matrix
}
parameters {
real I;
real a;
real b;
real c;
real<lower = 0> sigma;
}
transformed parameters{
vector [476] dist[N];
for(n in 1:N){
for(i in 1:476){
dist[n,i] = distmat[n,i]^b;
}}
}
model{
I ~ normal(1,10);
a ~ normal(5,3);
b ~ normal(1,2);
c ~ normal(1,3);
sigma ~ normal(5,10);
for(n in 1:N){
growth[n] ~ normal(I + sizet[n]*c + a*sum(sizemat[n] ./ exp(dist[n])), sigma);
}
}
generated quantities{
vector[N] log_lik;
for (n in 1:N){
log_lik[n] = normal_lpdf(growth[n] | I + sizet[n]*target_sizet + a*sum(sizemat[n] ./ exp(dist[n])), sigma);
}
}
//the formula for mle looks like this:
growth ~ dnorm(I + size*c + a*rowSums((sizemat)/exp(distmat^b),na.rm=T), exp(sigma))
The model is running fine as of now, but one of the issues I’ve had is the discrepancy with maximum likelihood estimates. I started off with MLE’s but pretty soon run into problems with parametrization. I then fit my model with Stan using un/weakly-informed priors, and applied coefficient estimates as starting values for MLE. This, I thought, should have helped the optim algorithm to find a stable solution, but it didn’t. Is this common?
Rather converges to different places, but it varies. Here are a few mle2 runs on the same model/data (saved rds):
Could not estimate se for some parameters: m1.R (447.3 KB) m2.R (447.3 KB)
For this one, the estimates from the profiling were used as starting values. Converged, but rerunning shows that the solution is not stable (still generates different parameter values): m3.R (447.4 KB)
Whooops, you already did this with your m4_stanfit.R fit. It’s just the difference in calling optimizing or sampling on the model.
The optimizer converging to multiple separate solutions is just telling you there is something complicated going on in the model.
Running the sampler is a better way of understanding what is happening in your problem. So that is what you have with the m4_stanfit file.
The first thing that jumps out to me looking at that are the divergences. The model has like 400+. You really want that to be zero, or maybe just a handful. They are a strong indicator that something funky is happening, and the places that they happen can clue you in in the problem. A divergence happens when the integrator in the Stan sampler breaks.
Google for the Betancourt case study on divergences. I think by default pairs(fit) will show you divergences in red. The place to start is there. Whether a model has divergences or not is a function of the data, so maybe look at the Stan fit for the data that worked for you and see if you find anything.
It could be multimodality (many local optima) or a failure to find a single global optimum due to config issues like step size or convergence tolerances.
Right. This is saying that the coding of the log density isn’t stable when used for a first-order Hamiltonian simulation (i.e., just gradients). That indicates there is some extreme curvature in the posterior, which can also cause problems for the optimizer.
Are there missing constraints, or can b truly be negative?
The reference on bias diagnostics is helpful, I looked into the location of divergences, but, as far as I can tell, it doesn’t seem like they have a clear pattern of occurrence (plot below). The pairs() do not show pretty much any breaks due to the numeric errors as well.
I also tried running the model with a smaller step size (adapt_delta=0.9), and another one with increased warmup period (iter=6000, warmup=5000) not specifying delta, but neither of those affected the number of divergences (300+).
A constraint to positive values for b seems reasonable. Not sure if that’s the way to do it, but I tried to simply introduce a bound in the parameter block <lower=0> for b, and it brought the number of divergences to 1100+.
If the issue is in multimodality, which I imagine is a function of the data, to what extent do the divergences bias the estimates? By visual assessment, the chains don’t seem to stick in any particular place, the four chains are well mixed, the number of independent samples is relatively high, and rhat < 1.05. If these indicate the convergence (?), would the estimates be reliable?
Another possibility, is the Stan code doing the same thing as my mle function (i.e. is vectorization done correctly, and matrix/vector operations work as expected)?
It’s totally bad news bears. Divergences can make areas of parameter space inaccessible. The mixing diagnostics might be fine, but that only indicates you’ve mixed well outside of that region of parameter space where the divergences are. You’ve got reason to be worried with more than a couple divergences. The divergences are indicators the sampler is trying to explore somewhere it cannot get though.
Like is it something like: “we expect the growth (growth[n]) in biomass to be proportional to the size (sizet[n]) of field n, but growth of a field will also be affected by the size of the neighboring fields, with an effect that diminishes with distance”. Correct me!
And then why is that distance raised to a power and then exponentiated? That seems like an awful lot of transformin’.
An interpretation of the excerpted line is that growth of a target plant depends both on the target plant’s size (bigger plants grow faster) as well as on the neighbors around the target plant (we expect plant growth to be less when there are big neighbors that are nearby, due to competition).
so then sizemat and dist are the matrices of size of and distance to neighbors from target plants.
thanks for your thoughts on this (the original poster is my MSc student btw)
Thanks for everyone’s input, the model does converge a lot better.
I’m currently working to speed up the calculations using segment() function because as the neighborhood matrices get bigger, the computation slows down significantly.
One of the ways to do this is to remove zeros from the matrices, as they should not contribute anything to the non-linear term. The two-dimensional vector syntax (example above), works fine with relatively small dimensions of the matrices:
...
vector[476] sizemat[N];
vector[476] dist[N];
vector[N] cf;
for(n in 1:N) cf[n]=sum(exp(log(sizemat[n])-dist[n]));
...
for (n in 1:N)
growth[n] ~ normal(... + a*cf[n], sigma);
In this example, zeros in the sizemat correspond to ones in dist to avoid division by zero, so the sum() should be equal to the “by row” sum of quotients where sizemat elements are non-zero.
Another syntax that theoretically should be doing the same thing (?), with all zeros removed, is using the segment() function:
...
vector[y]size_obs; //sizemat converted to a vector and zeros removed
vector[y]dist_obs; //distmat converted to a vector and zeros removed
vector[N] pos; //position index corresponding to each row in the matrix
vector[N] n_nb; //group size = number of non-zero values in each row of the matrix
...
vector[N] cf;
for (n in 1:N)
cf[n]=sum(segment(size_obs, pos[n], n_nb[n])./exp(segment(dist_obs, pos[n], n_nb[n])*b));
...
growth ~ normal(... +a*cf, sigma);
The calculation times with segment() syntax are a lot faster, and in my understanding should be the same thing - slicing chuncks of a long vector, take element-wise ratios, and adding them together. However, I’ve noticed some other differences as well with the segment() syntax, that I can’t really explain:
the convergence is a lot easier
WAIC are significantly lower compared to the “matrix” notation, which means that the syntax changes the log likelihood
both versions successfully retrieve true parameter values from a simulated data set
So my question is whether the two versions are truly doing the same operation, and if so, why are there differences in model fit (e.g. WAIC)?
Ahhh, finally managed the two to converge to the same estimates. Thanks for looking into this, @bbbales2, it took me a while to figure out where the bug was, but you were totally right - tracking things down was the key.
Main findings: (i) when zeros removed and a typical R matrix converted to a ragged matrix the convergence was a lot faster; (ii) the two ways to code the program yielded the same estimates with very similar CI’s, and the WAIC values were within a point difference; (iii) lastly, the ragged matrix, when zeros are removed, also allows to use the exp(log(x1)-x2), unlike the 2D matrix format which contains many zeros.