Divergent transition every iteration--multi-scale occupancy model

Hello,

Tried to use the ecology tag (suggested here…https://stanecology.github.io/), but can’t find it. Anywhos:

I’m attempting to fit something like a multi-scale occupancy model (sensu Nichols et al. 2008). As the topic states, the problem I’ve faced is that that every iteration ends up with a divergent transition, and I’m trying to figure out whether this is fixable or not. I suspect the problem relates to a coding error/mis-specification of the likelihood, although it could be that the data in hand don’t support the model. At least, I have not found that increasing adapt.delta to 0.95 and upping the length of the warm-up to 3500 makes any difference.

Briefly, the complete data likelihood is like:

z_i~Bernoulli(Psi), where z_i denotes whether a site is occupied by some species
a_i_t~Bernoulli(theta), where a_i_t denotes whether the species is ‘available’ to be detected (e.g., it is in site i during time period t given that the site is ever occupied)
y_i_t_k~Bernoulli(z_ia_i_tp), where y_i_t_k is observed presence (with probability p conditional on occupancy and availability) or absence at one of multiple points k contained within site i at time t. The replicated observation points k are required for parameters p and theta to be identifiable.

The likelihood I’m employing in the stan program looks like so:

for (o in 1:nobs) {
 //data is in long format, with a corresponding vector denoting site, time, and k indexing
      if (obstate[o]==1) { 
          //known occupancy, known availability--i.e., species observed at site i at time t at at least 1 k
          1~bernoulli(psi[site_idx[o]]);
          1~bernoulli(theta[site_idx[o], Day[o]]);
           y[o]~bernoulli(p[k_idx[o]]);
      }
      else if (obstate[o]==2){
           //known occupancy, because observed at site i at least one time.
           //unknown availability. Could have been unavailable, or available and not detected.
           1~bernoulli(psi[site_idx[o]]);
           target +=log_sum_exp(bernoulli_lpmf(1|theta[site_idx[o], Day[o]])+
           bernoulli_lpmf(0 |p[k_idx[o]]),
           bernoulli_lpmf(0|theta[site_idx[o], Day[o]]));
      }
      else {
        //unknown occupancy and availability--never seen at site i.
        //prob not occupied, prob occupied but not available, prob occupied & available but not detected 
        vector[3] lp;
        lp[1]=bernoulli_lpmf(0|psi[site_idx[o]]);
        lp[2]=bernoulli_lpmf(1|psi[site_idx[o]])+bernoulli_lpmf(0|theta[site_idx[o], Day[o]]);
        lp[3]=bernoulli_lpmf(1|psi[site_idx[o]])+bernoulli_lpmf(1|theta[site_idx[o], Day[o]])+
               bernoulli_lpmf(0|p[k_idx[o]]);
        target+=log_sum_exp(lp);  
     }

} //end o loop

I’ve looked at this a lot, and have not seen anything obviously incorrect (could be more efficient), but maybe there’s an error in there.

The other thing that I thought might be an issue relates to some of the random effects structures employed. I think I’ve coded an uncentered model correctly, but perhaps I’m off? e.g.:

     transformed parameters {
      //other junk....
       vector[365] a0_theta;
       a0_theta[1]=a0_theta_raw[1];  
           for (t in 2:365){
               a0_theta[t]=a0_theta[t-1] + a0_theta_raw[t]*var_a0_theta;     
            }
       }//end transformed

     model{
      
var_a0_theta ~ normal (0, 2.5); //<lower=0>
      a0_theta_raw[1] ~ normal (0, 2.5); 
         for (t in 2:365){
              a0_theta_raw[t] ~ std_normal(); 
        }
      
      }

****
When including Stan code in your post it really helps if you make it as readable as possible by using Stan code chunks (```stan) with clear spacing and indentation. For example, use
```stan
model {
  vector[N] mu = alpha + beta * x;
  y ~ normal(mu, sigma); 
} 

instead of

model{
vector[N] mu = alpha+beta*x;
y~normal(mu,sigma);
}


To include mathematical notation in your post put LaTeX syntax between two $ symbols, e.g.,
p(\theta | y) \propto p(\theta) p(y | \theta).

Sorry–hit the space bar and this went off. I can’t figure out how to delete?

Anyway, the other issue relates to the random effect structure: I think I have the uncentered structure specified correctly for a random walk sort of prior, and it seems to mix a little better/more quickly than a centered structure, but maybe the issue is here?

parameters{
   vector[365] a0_theta_raw;
   real<lower=0> var_a0_theta;
 //other junk
}


transformed parameters {
      //other junk....
       vector[365] a0_theta;
       a0_theta[1]=a0_theta_raw[1];  
           for (t in 2:365){
               a0_theta[t]=a0_theta[t-1] + a0_theta_raw[t]*var_a0_theta;     
            }
       }//end transformed

model{
     //other junk 
      var_a0_theta ~ normal (0, 2.5); //<lower=0>
      a0_theta_raw[1] ~ normal (0, 2.5); 
         for (t in 2:365){
              a0_theta_raw[t] ~ std_normal(); 
        }
      }

Exploration of the pairs plots has not turned up any obvious confounding geometry (yet), although I’m continuing to look closer at it.

At any rate, any feedback regarding things to look for would be greatly appreciated. Obviously, every iteration featuring a divergent transition is, ah, problematic–I’ve just never encountered something like this. Can provide full code if that would be more helpful?

Thanks

Hi! Can you share a subset of the data? And when you plot your data was does it look like? I use the geom_tile option in ggplot in R to plot out this kind of data. Have you run one of the simpler occupancy models in Stan to see if those work?

Hi, thanks for the response. I can pull some data together, although I think it’s in a different form than one might expect due to ragged effort–rather than, say, a site x time matrix, a list of observations with nested indexing to reference site, subsite (k), and time. I guess I’m not quite sure what format you’re looking for?

Simpler models fit fine in the sense that the posterior median nearly mirrors MLE’s (although often, there is an issue with divergent transitions for these, as well). Actually, point estimates here seem reasonable as well, although a truncated version of the pairs plot indicates some multi-modality (below)? I dunno, I feel like this is an issue either with weak identifiability due to the data (most sites do not have replicate subsites k) or the way the model is parameterized?

Note that the pairs above are from a very short chain–although some of the multimodality seems to persist in longer chains.

1 Like

Hi @jcwi!

If I’m reading Nichols et al. 2008 correctly, the likelihood should be a product (sum in log space) over sites, rather than over observations as in the implementation you provided.

I put together an example R simulation and Stan script that show how to compute the likelihood. There’s a bit of extra complexity required to deal with the number possible combinations of \alpha that can result in non-detection across all t and k for any particular i (i.e., it takes some extra work to compute the n_possible terms in the sum for the equation of Pr(000 000) in the original paper).

Here’s a gist with the R and Stan files: https://gist.github.com/mbjoseph/6ad0fa57a4de987399c8324b7a6847db

I haven’t optimized the Stan code (e.g., re-using computations of log(psi)) to keep things simple, but here’s the likelihood: https://gist.github.com/mbjoseph/6ad0fa57a4de987399c8324b7a6847db#file-multiscale-occupancy-stan-L44-L78

4 Likes

Thanks for the feedback and link!

Ah,I think I see (?). The issue is that the psi part of the likelihood (e.g., 1~psi), although constant for each subsite/occassion within a site, is being summed over each observation rather than over the number of sites. (similarly, 1~theta is being summed over each k in i and t, rather than each i,t).

I’ll have to figure out a good way to structure the input data given its extremely ragged nature and the variability of interest (for theta, across all i and t).

1 Like

Thanks for putting that together @mbjoseph! I’ve used this model a good bit for eDNA-related work, which was a while back and I was primarily using JAGS at the time. I’ve been wanting to implement it in Stan but haven’t gotten around to it. So, again, just wanted to say thanks!

Perhaps for posterity, here is a simple bugs likelihood, which I followed from Kery and Royle’s AHM (2016) book and an MEE article by Schmidt et al. 2013, in case that might help anyone with that background (probably many of us ecology folk) compare to what you’ve done for the discrete parameters in the Stan model:

for ( i in 1:n.sites ) {
 z[ i ] ~ dbern( psi[ i ] )
  logit( psi[ i ] ) <- a0_psi 

 for ( j in 1:n.dates ) {
  a[ i, j ] ~ dbern( mu_a[ i, j ] )
   mu_a[ i , j ] <- z[ i ] * theta[ i , j ]
     logit( theta[ i, j ] ) <- a0_theta

  for ( k in 1:n.reps ) {
   y[ i, j, k ] ~ dbern( mu_y[ i, j, k ] )
    mu_y[ i, j, k ] <- a[ i, j ] * p[ i, j, k ]
     logit( p[ i, j, k] ) <- a0_p  
     }
   }
 }
1 Like

Just to follow up here,

I think that maybe

for (jj in 1:n_possible) {
          for (j in 1:ntime) {
            if (alpha_potential[jj, j] == 0) {
              // not available
              tmp_poss[jj, j] = log1m(theta);
            } else {
              // available but not detected
              tmp_poss[jj, j] = log(theta) + bernoulli_lpmf(y[i, j, ] | p);
            }
          }
          sum_poss[jj] = log(psi) + sum(tmp_poss[jj, ]);
        }
        sum_poss[n_possible + 1] = log1m(psi);
        log_lik[i] = log_sum_exp(sum_poss);

can be simplified? At least, I’m getting very similar results with:

data {
  int<lower = 1> nsite;
  int<lower = 1> ntime;
  int<lower = 1> nrep;
  
  int<lower = 0, upper = 1> y[nsite, ntime, nrep];
  
  //int<lower = 1> n_possible;
  //matrix<lower = 0, upper = 1>[n_possible, ntime] alpha_potential;
}

transformed data {
  int<lower = 0, upper = 1> known_present[nsite];
  int<lower = 0, upper = 1> known_available[nsite, ntime];
  
  for (i in 1:nsite) {
    known_present[i] = 0;
    for (j in 1:ntime) {
      known_available[i, j] = 0;
      for (k in 1:nrep) {
        if (y[i, j, k] == 1) {
          known_present[i] = 1;
          known_available[i, j] = 1;
        }
      }
    }
  }
}

parameters {
  real<lower = 0, upper = 1> psi;
  real<lower = 0, upper = 1> theta;
  real<lower = 0, upper = 1> p;
}

transformed parameters {
  vector[nsite] log_lik;
  
  {
    vector[ntime] tmp_lp;
    matrix[2, ntime] tmp_poss;
    vector[3] sum_poss;
    
    for (i in 1:nsite) {
      if (known_present[i]) {
        for (j in 1:ntime) {
          if (known_available[i, j]) {
            // present and available
            tmp_lp[j] = log(theta) + bernoulli_lpmf(y[i, j, ] | p);
          } else {
            // present, possibly unavailable
            tmp_lp[j] = log_sum_exp(
              log(theta) + bernoulli_lpmf(y[i, j, ] | p), 
              log1m(theta)
            );
          }
        }
        log_lik[i] = log(psi) + sum(tmp_lp);
      } else {
        for (jj in 1:2) {
          for (j in 1:ntime) {
            //if (alpha_potential[jj, j] == 0) {
              // not available
              tmp_poss[1, j] = log1m(theta);
              tmp_poss[2, j] = log(theta) + bernoulli_lpmf(y[i, j, ] | p);
            }
          sum_poss[jj] = log(psi) + sum(tmp_poss[jj, ]);
          }
          
        
        sum_poss[3] = log1m(psi);
        log_lik[i] = log_sum_exp(sum_poss);
        }
      }
    }
  }


model {
  target += sum(log_lik);
}

Maybe I don’t quite follow the need to consider all the combinations of alpha? Either way, this has been very helpful.

This is probably my fault for not explaining this. To recap, for a particular site i (I’m going to drop this subscript, but describe a capture history for one site), we have dates j=1, ..., J, and on each date, we make detection/non-detection observations on sampling occasions k=1, ..., K.

The model for true occurrence z is:

z \sim \text{Bernoulli}(\psi),

availability for detection (conditional on occurrence) on dates j=1, ..., J is:

\alpha_{j} \sim \text{Bernoulli}(\theta_{j}),

and the detection/non-detection observation model for occasions k=1, ..., K is:

y_{j, k} \sim \text{Bernoulli}(z \alpha_{j} p_{j, k}).

In the case of an all zero encounter history there are 2^J + 1 possible combinations of z and \alpha_{j} to account for. In the original paper, they provide an example where J =2 and K=3, representing the capture history as 000 \; 000 (two days with three observed zeros each), and write the probability of the encounter history as \text{Pr}(000 \; 000)=

(1 - \psi) \\ + \psi (1 - \theta_{1}) (1 - \theta_{2}) \\+ \psi \theta_{1} \prod_{k=1}^K (1 - p_{1, k}) (1 - \theta_{2}) \\+ \psi (1 - \theta_{1}) \theta_{2} \prod_{k=1}^K (1 - p_{2, k}) \\+ \psi \theta_{1} \theta_{2} \prod_{k=1}^K (1 - p_{1, k}) (1 - p_{2, k}).

Note that there are 2^J + 1 terms being summed here to represent the different ways that an all-zero encounter history could arise, listed in the same order that they appear in the equation above:

  • Site was not occupied
  • Site was occupied, but the critter was never available for detection
  • Site was occupied, critter available but not detected on j=1, and not available on j=2
  • Site was occupied, critter not available on j=1, available not detected j=2
  • Site was occupied, critters available on all j, but not detected

In general, for some value of J, there are 2^J combinations of \alpha_{j} that could lead to an all-zero encounter history if the site is occupied, and one more additional possibility that the site is not occupied and the values of \alpha are irrelevant (hence 2^J + 1 possibilities).

For example, if J=2, you can think about a table of potential values of \alpha_j that could lead to an all-zero encounter history, where you put the values \alpha_1 in the first column, and \alpha_2 in the second column.

In the R script that I posted, I compute a table of these 2^J possibilities with:

expand.grid(rep(list(c(0, 1)), 2))
#>   Var1 Var2
#> 1    0    0
#> 2    1    0
#> 3    0    1
#> 4    1    1

Created on 2020-03-20 by the reprex package (v0.3.0)

If J=3, then there are 2^J=8 possibilities:

expand.grid(rep(list(c(0, 1)), 3))
#>   Var1 Var2 Var3
#> 1    0    0    0
#> 2    1    0    0
#> 3    0    1    0
#> 4    1    1    0
#> 5    0    0    1
#> 6    1    0    1
#> 7    0    1    1
#> 8    1    1    1

Created on 2020-03-20 by the reprex package (v0.3.0)

In my implementation, the number of rows in this table is n_possible, and that’s why the model loops over n_possible terms to compute the likelihood for all zero encounter histories.

Hopefully that helps, but feel free to let me know if I can clarify anything.

1 Like

Thanks again–this is a great explanation for the code provided and the likelihood described by Nichols et al!

I guess I should have clarified that my earlier comment probably relates more to what I think is kind of a gentle but inefficient description of the likelihood provided by Nichols et al. The n_possible term they describe actually becomes intractable pretty quickly (e.g., with J = 20, there’s a summation over more than 1 million combinations, with J=30, more than a billion combinations, j=40 gets into the trillions and beyond). Unless I’m doing the math/factorization incorrectly (quite possible, :) ), I think the formulation below works out to essentially the same thing for the sites where the organism in question was never observed.

//earlier i loop over sites...
for (jj in 1:2) {
          for (j in 1:ntime) {
              tmp_poss[1, j] = log1m(theta);
              tmp_poss[2, j] = log(theta) + bernoulli_lpmf(y[i, j, ] | p);
            }
          sum_poss[jj] = log(psi) + sum(tmp_poss[jj, ]);
          }
                 
   sum_poss[3] = log1m(psi);
 log_lik[i] = log_sum_exp(sum_poss);
 }//end i loop

Awesome - thanks @jcwi for clarifying! I see that the implementation above accounts for three of the 2^J + 1 possibilities, but I don’t see how it is equivalent to the likelihood described in the Nichols model.

For example, we can compare the probability of the capture history 000 \; 000 with J=2 and K=3 from the above implementation to the capture history probability in the original paper (also provided in my last post).

For an all-zero capture history, the likelihood in the Stan implementation you’ve provided evaluates to:

\psi \prod_{j=1}^J (1 - \theta) + \\ \psi \prod_{j=1}^J \big( \theta \prod_{k=1}^K (1 - p) \big) + \\ 1 - \psi,

summing over three possibilities, regardless of J:

  1. present, never available (sum_poss[1])
  2. present, always available but never detected (sum_poss[2])
  3. absent (sum_poss[3])

Unless I’m missing something, it seems like this misses cases with mixed availability (e.g., available on one day, but not another), and is not in general equivalent to the Nichols et al. model, except in the case where J=1.

True - scaling of n_possible is a challenge for large J. I’m not sure if there are any computational tricks out there that people have used when marginalizing over the discrete parameters in these multiscale occupancy models.

Shoot. Yep, you’re totally correct. Thanks for the thorough explanation! I’ll have to poke around regarding likelihood efficiency.

1 Like

In cases with large J, I wonder whether it would be worthwhile to re-frame a multi-scale occupancy model as a multistate hidden Markov model. That way, you could use the forward algorithm to compute the capture history probabilities, instead of enumerating over all possible combinations of the latent variables.

The Markov assumption would be that availability in time j depends on availability in the previous time j-1.

1 Like

That’s a good idea–and actually, probably closer to the implementation I was planning to work towards, anyway. Small number of states and transitions, so probably a gentle way to introduce oneself to marginal likelihood HMM’s.

1 Like

Cool, let me know if you want any help with it! It seems like this thread is starting to “diverge” a bit from the original question - do you think it makes sense to mark it as solved?