 # Divergent transitions and other warnings for binomial model describing spider feeding rates

Hi all,

I am trying to fit a hierarchical binomial model to some spider foraging data that I have and am running into some issues with divergent transitions and low bulk effective and tail effective sample sizes.

For some background, I am trying to fit spider feeding rates (or functions describing how spider feeding rates change other variables) to feeding surveys of spiders eating midges on walls. The way this method works is by combining the fraction of feeding spiders during a survey and the time over which feeding events are detectable to estimate the feeding rate. Skipping some details, the proportion of spiders feeding in a survey p should be equal to the product of the feeding rate of the spiders during that survey f and the detection time of spiders feeding on their prey d, or
p = fd.

Now, we want to be able to use a whole series of surveys to be able to estimate the parameters of a function describing how f changes with other variables such as midge densities. The classic function for this is
f = \frac{aR}{1 + ahR}
where a is the so-called space clearance rate of the predator and h is the handling time which for our convenience here we will assume is equal to the detection time, d, and R is the resource density. For each survey, we have estimated the detection times using data from another experiment and we also have estimates of the resource densities in each survey.

Finally, to the actual model. Originally, we simply fit the following the model
y_{i} \sim Binomial(n = n_{t,i} , p_{i} = \frac{aR_{i}}{1 + ad_{i}R_{i}}d_{i})
where y_{i} is the number of spiders feeding in survey i and n_{t,i} is the total number of spiders in survey i. Fitting this model everything worked fine.

However, an astute reviewer noticed that we had surveyed certain walls multiple times and suggested that we account for this non-independence by using a hierarchical model. I figured this wouldn’t be a big deal and tried to fit the following model
y_{ji} \sim Binomial(n = n_{t,ji}, p_{i} = \frac{a_{j}R_{ji}}{1 + a_{j}d_{ji}R_{ji}}d_{ji})
a_{j} \sim Normal(\mu_{a}, \sigma_{a})
where y_{ji} is now the number of feeding spiders on wall j in survey i, a_{j} is wall-specific space clearance rate for wall j, and so on. \mu_{a} and \sigma_{a} are hyperpriors describing desitrbution of wall-specific space clearance rates, a.
Now, however, I get a bunch of warnings about divergent transitions and low bulk effective and tail effective sample sizes. I’m looking for any tips on how to get rid of these issues. This is likely the simplest of the models I will fit to the data since I will also be fitting models in which, for example, the a parameter is temperature-dependent. My worry is that if I’m running into issues with the hierarchical models already, fitting those more complex models may not be possible.

Below, I give my R code to fit the model to the data, the Stan code to define the model, and a link to the .csv file containing the data on GitHub,

R Code:

### load required packages

library(dplyr); library(rstan);

### set some options for Stan

options(mc.cores = parallel::detectCores())

rstan_options(auto_write = TRUE)

spider <- read.csv('FeedingData_Manipulated_f.csv', stringsAsFactors = TRUE)

### drop surveys with no midges

spider <- spider %>% filter(NumberMidges > 0)

### set up data for Stan

notemp_noint_data <- list(N = nrow(spider),
y = spider$NumberFeed, n_t = spider$NumberPred,
K = length(unique(spider$BuildingWall)), d = spider$DetectionEst,
R = spider$NumberMidges, j = as.numeric(spider$BuildingWall))

notemp_noint_fit <- stan(file = 'Female_notemp_BDnone_hierarchical.stan', data = notemp_noint_data)



The contents of the file ‘Female_notemp_BDnone_hierarchical.stan’

// Stan model to fit type II FR
//

// Input data:
data {
int<lower=0> N; // Number of surveys
int<lower=0> y[N]; // Number of spiders feeding
int<lower=0> n_t[N]; // Number of spiders observed
int<lower=1> K; // Number of different building/wall combinations
vector[N] d; // Detection times
vector[N] R; // Prey densities
int<lower=1> j[N]; // building/wall index for each observation
}

// The parameters accepted by the model.

parameters {
real<lower = 0> aj[K];
real<lower = 0> a_mean;
real<lower = 0> a_sd;
}

transformed parameters {
vector[N] f;
vector<lower = 0, upper = 1>[N] p;

for (i in 1:N) {
f[i] = (aj[j[i]] * R[i])/(1 + (aj[j[i]] * R[i] * d[i]));
p[i] = f[i] * d[i];
}
}

// The model to be estimated. y is binomial with the probability
// equal to the feeding rate times the detection time
model {
// Likelihood
for (i in 1:N) {
y[i] ~ binomial(n_t[i], p[i]);
}
// Need priors and hyper priors
for (k in 1:K) {
aj[k] ~ normal(a_mean, a_sd);
}

a_mean ~ normal(30, 15);

a_sd ~ normal(0, 10);

}

generated quantities {
vector[N] log_lik;
for (n in 1:N){
log_lik[n] = binomial_lpmf(y[n] | n_t[n], (aj[j[n]]*R[n])/(1 + aj[j[n]]*d[n]*R[n]) * d[n]);
}
}


Link to the .csv file containing the data on GitHub: ZebraSpiderFR/FeedingData_Manipulated_f.csv at main · KyleCoblentz/ZebraSpiderFR · GitHub

I really appreciate any tips or help that anyone can offer. Thank you in advance and please let me know if there are any questions you have.

A low hanging fruit is just to increase the acceptance rate by increasing adapt_delta

 stan(file = 'Female_notemp_BDnone_hierarchical.stan', data = notemp_noint_data
, control = list(
max_treedepth = 12
)
)


I’d suggest to replace the truncated normal distribution normal(\mu_j, \sigma_j) for a_j into something more appropriate like an exponential or gamma distribution.
You may work out the parameters a, b of a gamma(a, b) distribution depending on the Expected Value and Standard-Deviation. Or you can try something like:

real<lower=0> scale;  // parameters

aj[k] ~ gamma(scale * a_mean, scale);
scale ~ cauchy(0, 10);


Further I would also study some papers if the beta_binomial distribution has been used in your case before. To make it short, the beta distribution is to the binomial distribution, what the gamma distribution is to the poisson distribution (=negbinomial).

1 Like

Wow, that’s an interesting system!

Based on this block:

  for (k in 1:K) {
aj[k] ~ normal(a_mean, a_sd);
}

a_mean ~ normal(30, 15);

a_sd ~ normal(0, 10);


I think this corresponds to centered parameterization which notoriously causes divergences. Case study here is a very accessible description of why (Diagnosing Biased Inference with Divergences), and shows that you could recode the hierarchical component here using a non-centered parameterization.

To follow @andre.pfeuffer comments regarding the beta-binomial, I believe that a clever implementation of hierarchical terms could be an alternate approach to add observation-level variability, see here Beta-Binomial: Why not a default family in the package? - #2 by paul.buerkner.

Hi @andre.pfeuffer and @AWoodward,

Unfortunately, increasing the adapt_delta and changing the distribution of the hierarchical parameters themselves did not solve the problem.

I also worked for a little while trying to connect the ecological background to the beta-binomial distribution, but I failed at this also. I might think about it a bit more though and see if I can’t come up with something clever.

The good news is that the noncentered parameterization seems to work! Below is my current code for the model. It occasionally gives one or two divergent transitions, but, generally, I can run it a few times and get a set of chains that have no divergent transitions. The parameter estimates all seem to make sense too. I’ll now move on to trying to fit some more complicated models, for example, where the a parameter is also possibly temperature-dependent. Thanks again for y’all’s help, I really appreciate it.

Here is my Stan code for the non-centered parameterization. The R code to run the model in Stan is the same as my original post but with the adapt-delta increased:

// Stan model to fit type II FR
//

// Input data:
data {
int<lower=0> N; // Number of surveys
int<lower=0> y[N]; // Number of spiders feeding
int<lower=0> n_t[N]; // Number of spiders observed
int<lower=1> K; // Number of different building/wall combinations
vector[N] d; // Detection times
vector[N] R; // Prey densities
int<lower=1> j[N]; // building/wall index for each observation
}

// The parameters accepted by the model.

parameters {
real<lower = 0> a_mean;
real<lower = 0> a_sd;
real a_raw[K]; // need for non-centered parameterization
}

transformed parameters {
vector[N] f; // individual feeding rates
vector<lower = 0, upper = 1>[N] p; // individual estimated proportions of spiders feeding
real<lower = 0> aj[K]; // building/wall specific space clearance rate

for (k in 1:K) {
aj[k] = a_mean + a_sd * a_raw[k]; // non-centered parameterization
}

for (i in 1:N) {
f[i] = (aj[j[i]] * R[i])/(1 + (aj[j[i]] * R[i] * d[i])); // define feeding rate
p[i] = f[i] * d[i]; // define proportion
}

}

// The model to be estimated. y is binomial with the probability
// equal to the feeding rate times the detection time
model {
// Likelihood
for (i in 1:N) {
y[i] ~ binomial(n_t[i], p[i]);
}
// Need priors and hyper priors

for (k in 1:K) {
a_raw[k] ~ normal(0, 1);
}

a_mean ~ normal(30, 10);

a_sd ~ exponential(0.5);

}

generated quantities {
vector[N] log_lik;
for (n in 1:N){
log_lik[n] = binomial_lpmf(y[n] | n_t[n], (aj[j[n]]*R[n])/(1 + aj[j[n]]*d[n]*R[n]) * d[n]);
}
}


// Stan model to fit type II FR
//

// Input data:
data {
int<lower=0> N; // Number of surveys
int<lower=0> y[N]; // Number of spiders feeding
int<lower=0> n_t[N]; // Number of spiders observed
int<lower=1> K; // Number of different building/wall combinations
vector[N] d; // Detection times
vector[N] R; // Prey densities
int<lower=1> j[N]; // building/wall index for each observation
}

// The parameters accepted by the model.

parameters {
real<lower = 0> aj[K];
real<lower=0> a_mean;
//real<lower = 0> a_sd;
real<lower = 0, upper=1> phi;
//real<lower = 0> scale;
}

transformed parameters {
vector[N] f;
vector<lower = 0, upper = 1>[N] p;

for (i in 1:N) {
f[i] = (aj[j[i]] * R[i])/(1 + (aj[j[i]] * R[i] * d[i]));
p[i] = f[i] * d[i];
}
}

// The model to be estimated. y is binomial with the probability
// equal to the feeding rate times the detection time
model {
real shape_mu = (1-phi)/phi;
// Likelihood
for (i in 1:N) {
//y[i] ~ beta_binomial(n_t[i], p[i] * inv_square(scale), (1- p[i]) * inv_square(scale));
y[i] ~ binomial(n_t[i], p[i]);
}

// Need priors and hyper priors
for (k in 1:K) {
//  aj[k] ~ normal(a_mean, a_sd);
aj[k] ~ gamma(shape_mu, shape_mu * exp(- a_mean));
}

a_mean ~ lognormal(1.38, 2);
phi ~ beta(1, 2);
//scale ~ cauchy(0, 1);
//a_sd ~ normal(0, 10);

}

generated quantities {
vector[N] log_lik;
for (n in 1:N){
log_lik[n] = binomial_lpmf(y[n] | n_t[n], (aj[j[n]]*R[n])/(1 + aj[j[n]]*d[n]*R[n]) * d[n]);
}
}


Run with:

file <- paste0(fname,".stan")
smod <- stan_model(file=model_file)

notemp_noint_fit <- sampling(smod
, data = notemp_noint_data
#, pars = pars
#,  include = FALSE
, iter=5000
, chains=4
#  , init =  init_ll
, seed = 123
, control = list(
, max_treedepth = 14
)
)
extr<-extract(notemp_noint_fit)

dump_file_base <- "~/spider/sider.diagnosis"
library(loo)
options(max.print=1000000, width = 999)
sink(paste0(dump_file_base, ".diagnosis"))
print(summary(notemp_noint_fit))
sink()


This runs with no divergent transitions. It requires an adapt_delta rate of 0.99. I tried the beta_binomial pdf (see commented out).

Thank you @andre.pfeuffer for this code.

I just have a few questions to better help me understand what is going on with the hyperpriors. In this model, you’ve placed a gamma distribution on the a_{j}'s (the building/wall specific space clearing rates). As a hierarchical model, we need hyperpriors on the \alpha and \beta parameters of the model. In the code, the \alpha parameter of the gamma distribution is given by shape_mu. Rather than supplying a prior distribution for shape_mu, you have instead defined shape_mu as \frac{(1 - \phi)}{\phi} and placed a beta prior on \phi. May I ask why you took that approach rather than placing a prior on shape_mu?

From my understanding, the definition of the beta parameter of the gamma distribution is a trick to define the parameter in terms of the mean of the gamma distribution. That is, the mean of the gamma distribution is \frac{\alpha}{\beta}. So, \beta = \frac{\alpha}{mean}. Because we have a_mean having a log normal distribution, this becomes \beta = \frac{\alpha}{e^{\text{a_mean}}} or \beta = \alpha e^{-\text{a_mean}}. Is this correct?

This beta prior on phi makes the sampler to favor high(er) numbers for shape_mu. You may also try inv(phi) or inv_square(phi) and put a prior on phi. This resulted in a few divergent transitions.

The math is correct. Not because. The lognormal distribution is to sample a positive value for a_mean. Suppose we have used a truncated normal, then the mass gets concentrated towards zero. I wanted it to get concentrated around exp(1.38) thus for aj to exp(exp(1.38)).
Reconsidering, the sigma value of lognormal is high. Maybe something smaller lognormal(1.38,0.5); is more appropriate.

Thanks a lot. This all makes sense.