Suggestions for improving model of bird monitoring data

Hi Folks
I’m trying to build a Stan version of the JAGS models currently used to analyse the North American Breeding Bird Survey (BBS) monitoring data. The model is effectively a hierarchical, over-dispersed Poisson regression, with a number of intercept parameters to account for observer-effects, route-level variation in abundance, and random annual fluctuations around the slope-line.

The model works. It generates reasonable estimates, no divergent transitions, effective sample sizes are generally ~80% of the total samples. If the max_treedepth argument is increased beyond 10.

The problem

The problem is that the sampler exceeds the maximum tree depth, even if the limit is set to ~15-17 (haven’t tried higher yet because life is short). The high max_treedepth limit means that the sampler takes a really long time, even for a species which has very few data compared to many in the dataset.

For example, at tree_depth 15, the n_leapfrog steps is > 30,000. So even a limited initial run of 500 (400 warmup) requires ~10 hours. If max_treedepth at 17, it’s closer to 24 hours for the same small run (500 transitions). If I were to try to run the model on some of the most data-rich species in the dataset, it would require months of run-time.

Help!

Is there an alternative parameterization that might make this posterior easier to sample?

I’m really new to Stan (this is my 3rd bespoke model), and I’m no mathematician. So it’s entirely possible that I’m doing something dumb.

The model

model {
// This is a Stan implementation of the bbsBayes slope model

data {
  int<lower=1> nstrata;
  int<lower=1> ncounts;
  int<lower=1> nyears;

  int<lower=0> count[ncounts];              // count observations
  int<lower=1> strat[ncounts];               // strata indicators
  int<lower=1> year[ncounts]; // year index
  int<lower=1> fixedyear; // centering value for years
  
  int<lower=0> firstyr[ncounts]; // first year index
  
  int<lower=1> obser[ncounts];              // observer indicators
  int<lower=1> nobservers;
  
}

parameters {
  vector[ncounts] noise_raw;             //non centered observation-level random effect to model over-dispersion
  real lambda[ncounts];             // Poisson means
  
  vector[nstrata] beta_p; //non centered random effect slopes
  real BETA; //Slope hyperparameter

  vector[nstrata] strata_p; //non centered random effect intercepts
  real STRATA; //intercept hypterparameter

  real eta; //first-year observer intercept (observer startup effect)
  
  matrix[nstrata,nyears] yeareffect_raw; //random year-effects modeling departures from slope-based trajectory

  vector[nobservers] obs_raw;    // sd of year effects

  real<lower=0> sdnoise;    // sd of over-dispersion
  real<lower=0> sdobs;    // sd of observer effects
  real<lower=0> sdbeta;    // sd of slopes 
  real<lower=0> sdstrata;    // sd of intercepts
  real<lower=0> sdyear[nstrata];    // sd of year effects

  
}

transformed parameters { 
  vector[ncounts] E;           // log_scale additive likelihood
  vector[ncounts] noise;           // extra-Poisson log-normal variance
  vector[nstrata] beta;
  vector[nstrata] strata;
  vector[nobservers] obs; //observer effects
  matrix[nstrata,nyears] yeareffect;

     obs = sdobs*obs_raw;
     noise = sdnoise*noise_raw;
      

// covariate effect on intercepts and slopes
  beta = (sdbeta*beta_p) + BETA;
  strata = (sdstrata*strata_p) + STRATA;
  
   for(s in 1:nstrata){
     yeareffect[s,] = sdyear[s]*yeareffect_raw[s,];
   }
  

  for(i in 1:ncounts){
    E[i] =  beta[strat[i]] * (year[i]-fixedyear) + strata[strat[i]] + yeareffect[strat[i],year[i]] + obs[obser[i]] + eta*firstyr[i] + noise[i];
  }
  
  }
  
model {

  sdnoise ~ normal(0,1); //prior on scale of extra Poisson log-normal variance
  noise_raw ~ normal(0,1); //non centered prior normal tailed extra Poisson log-normal variance
  
  sdobs ~ normal(0,1); //prior on sd of observer effects
  sdyear ~ normal(0,1); // prior on sd of yeareffects - stratum specific
  obs_raw ~ normal(0,1); //non centered prior on observer effects
  
  
 for(s in 1:nstrata){

  yeareffect_raw[s,] ~ normal(0,1);
  sum(yeareffect_raw[s,]) ~ normal(0,0.001*nyears);// soft sum to zero constraint
  
 }
  count ~ poisson_log(E); //vectorized count likelihood with log-transformation
  
  BETA ~ normal(0,0.1);// prior on fixed effect mean slope - hyperparameter
  STRATA ~ normal(0,1);// prior on fixed effect mean intercept - hyperparameter
  eta ~ normal(0,1);// prior on first-year observer effect
  
  
  sdstrata ~ normal(0,1); //prior on sd of intercept variation
  sdbeta ~ normal(0,0.1); //prior on sd of slope variation

  beta_p ~ normal(0,1); //non centered prior on stratum-level slopes
  strata_p ~ normal(0,1); //non centered prior on stratum-level intercepts

  //sum to zero constraints
  sum(strata_p) ~ normal(0,0.001*nstrata);
  sum(beta_p) ~ normal(0,0.001*nstrata);
  
}

} 

The Data

I haven’t included the data here (it’s large), but I have the example data and all code necessary to run the model at this GitHub repo. This repo also includes some background on the data, as well as some of the JAGS modeling that’s been done with these data.

In a nutshell, the data object is a list with the following components:

  • count - the counts of individual birds on each route and year (the example data here has ~4000 counts)

  • strat - the integer indicator for each of the 18 geographic strata included in this species’ range (some species in the BBS database cover ~ 150 strata)

  • year - the integer indicating which year of the time-series

  • firstyr - 1 if this route-year combination represents the observers first year on this particular route

  • obser - integer indicating which observer conducted the survey

  • ncounts - number of observations

  • nstrata - number of geographic strata

  • nyears - number of years included

  • nobservers - number of unique observer-route combinations in the data

  • fixedyear - midpoint of the time-series, used to center the year-values

1 Like

Have you tried fitting your model to fake data and seeing if you can recover the true parameter values? This paper (https://arxiv.org/pdf/1804.06788.pdf) and Ch. 5 of this paper (https://arxiv.org/abs/2011.01808) give some guidance on solving computational problems in Bayesian workflows.

Yes. Thanks for your suggestion and the links to those preprints.
I have done previous work with this model in it’s JAGS representations, and it works quite well to recover known parameter values using simulated data. The model has been used to estimate population status for ~500 North American bird species by federal agencies in Canada and the US for the last 10 years. So in its JAGS applications, it’s reasonably well understood. This Stan application here does seem to work (i.e., for a few example species, it generates very similar estimates to the JAGS application).
I have a feeling that either I’m missing something in the way I’ve parameterized it so that Stan’s sampler is struggling, or that Stan is actually telling me that our work in JAGS has been ignorant of some underlying complexity in the posterior.

1 Like

If I remember correctly many models in JAGS will run without error but when moved over to Stan (better diagnostics tools) they will get flagged with issues. I think there are some threads around here about that and some discussion on Andrew Gelman’s blog.

Also as @js592 said you will need to generate some fake data and run the model in Stan to see if you can recover the parameters. Once that is done it’s a bit easier to troubleshoot.

Yes, I have a feeling that might be part of what’s happening here. I’m hoping that the experienced members of this community might have some suggestions for an alternative parameterization that might help.

Yes, you’re right. Thanks @js592 and @Ara_Winter. I should re-run some of the simulated data in the Stan model that I’ve run through the JAGS model in the past. I will do that.

However, I strongly suspect it will recover the parameters well. The model does seem to work (it generates estimates that are essentially the same as the JAGS versions, and I’ve run simulated data through the JAGS models), it’s just extremely slow.

1 Like

Are the normal(0,1) priors reasonable for the model?

1 Like

Some recent talk about setting priors

1 Like

It may be faster if you move those transformed parameters into the model block, as mentioned here recently:

I’m wondering if the sum to zero constraints are brought over from JAGS, I haven’t seen them used in this way before in Stan. Unless I just need to be enlightened on what you’re doing there, you can try commenting those lines out and see what happens.

2 Likes

Good question.
So most of the priors are normal(0,1), largely because they’re the raw components of a noncentered parameterization. The half-normal priors on the sd components should be pretty reasonable, given the log-link, and so suggesting that the most likely range of effects on the abundance of birds are on the scale of a single order or magnitude (unlikely to be larger). So, if I understand the “Hierarchical Models and the Non-Centered Parameterization” section of the link above, then yes. The priors should be reasonable.

1 Like

Thanks @cmcd. I’m going to try moving some of the transformed parameters in the model block. The discussion on the link you shared suggested that writing to disc may vary between the blocks and there are many data involved here so perhaps those disc-write limits are important. I also noticed that post suggested using the std_normal() distribution instead of the normal(0,1). I’ll try that too.

The soft sum to zero constraints is pretty new for me too. I discovered it when working with an iCAR model in Stan. There’s some discussion of different approaches to these constraints here:


and the modelling work where I first discovered them is here:

I have tried versions of this same model without the sum to zero constraints (as you suggested, they’re easy to comment-out). There’s little difference in run-time and the convergence is generally a bit worse (particularly for the main slope and intercept hyperparameters).

Update: I’ve tried these two suggestions

and I’ve used the std_normal() distribution instead of the normal(0,1).
It seems to have improved things a bit:
The warmup seems to be going faster (mostly)…

get_elapsed_time(slope_stanfit)/3600 ## in hours

          warmup   sample
chain:1 5.227583 2.122542
chain:2 7.342417 2.112506
chain:3 5.522083 2.103778

Here’s the same result from the original parameterization:

          warmup   sample
chain:1 7.798528 2.105014
chain:2 8.093528 2.224678
chain:3 7.706306 2.151717

It’s not the kind of speed I’m hoping for yet, so I’m going to keep trying things. I’m wondering if it’s worth digging deep in to the error distribution on the counts. The sdnoise parameter is one of the least well sampled parameters. Perhaps that’s causing some miss-specification.

So, here’s the model now:

// This is a Stan implementation of the bbsBayes slope model

data {
  int<lower=1> nstrata;
  int<lower=1> ncounts;
  int<lower=1> nyears;

  int<lower=0> count[ncounts];              // count observations
  int<lower=1> strat[ncounts];               // strata indicators
  int<lower=1> year[ncounts]; // year index
  int<lower=1> fixedyear; // centering value for years
  
  int<lower=0> firstyr[ncounts]; // first year index
  
  int<lower=1> obser[ncounts];              // observer indicators
  int<lower=1> nobservers;
  
}

parameters {
  vector[ncounts] noise_raw;             //non centered observation-level random effect to model over-dispersion
  real lambda[ncounts];             // Poisson means
  
  vector[nstrata] beta_p; //non centered random effect slopes
  real BETA; //Slope hyperparameter

  vector[nstrata] strata_p; //non centered random effect intercepts
  real STRATA; //intercept hypterparameter

  real eta; //first-year observer intercept (observer startup effect)
  
  matrix[nstrata,nyears] yeareffect_raw; //random year-effects modeling departures from slope-based trajectory

  vector[nobservers] obs_raw;    // sd of year effects

  real<lower=0> sdnoise;    // sd of over-dispersion
 //real<lower=1> nu;  //optional heavy-tail df for t-distribution
  real<lower=0> sdobs;    // sd of observer effects
  real<lower=0> sdbeta;    // sd of slopes 
  real<lower=0> sdstrata;    // sd of intercepts
  real<lower=0> sdyear[nstrata];    // sd of year effects

  
}


model {


  vector[ncounts] E;           // log_scale additive likelihood
  vector[ncounts] noise;           // extra-Poisson log-normal variance
  vector[nstrata] beta;
  vector[nstrata] strata;
  vector[nobservers] obs; //observer effects
  matrix[nstrata,nyears] yeareffect;




  sdnoise ~ std_normal(); //prior on scale of extra Poisson log-normal variance
  noise_raw ~ std_normal(); //non centered prior normal tailed extra Poisson log-normal variance
  
  sdobs ~ std_normal(); //prior on sd of observer-route effects
  sdyear ~ std_normal(); // prior on sd of yeareffects - stratum specific
  obs_raw ~ std_normal(); //non centered prior on observer effects
  
  
  //nu ~ gamma(2,0.1); // alternate prior on df for t-distribution of heavy tailed 
 for(s in 1:nstrata){

  yeareffect_raw[s,] ~ std_normal();
  sum(yeareffect_raw[s,]) ~ normal(0,0.001*nyears);// soft sum to zero constraint
  
 }
 
  BETA ~ normal(0,0.1);// prior on fixed effect mean slope - hyperparameter
  STRATA ~ std_normal();// prior on fixed effect mean intercept - hyperparameter
  eta ~ std_normal();// prior on first-year observer effect
  
  
  sdstrata ~ std_normal(); //prior on sd of intercept variation
  sdbeta ~ normal(0,0.1); //prior on sd of slope variation

  beta_p ~ std_normal(); //non centered prior on stratum-level slopes
  strata_p ~ std_normal(); //non centered prior on stratum-level slopes

  //sum to zero constraints
  sum(strata_p) ~ normal(0,0.001*nstrata);
  sum(beta_p) ~ normal(0,0.001*nstrata);
  
  
       obs = sdobs*obs_raw;
     noise = sdnoise*noise_raw;
      

// covariate effect on intercepts and slopes
  beta = (sdbeta*beta_p) + BETA;
  strata = (sdstrata*strata_p) + STRATA;
  
   for(s in 1:nstrata){
     yeareffect[s,] = sdyear[s]*yeareffect_raw[s,];
   }
  

  for(i in 1:ncounts){
    E[i] =  beta[strat[i]] * (year[i]-fixedyear) + strata[strat[i]] + yeareffect[strat[i],year[i]] + obs[obser[i]] + eta*firstyr[i] + noise[i];
  }
  
    count ~ poisson_log(E); //vectorized count likelihood with log-transformation
 
}
1 Like