Identifiability across levels in occupancy model

Background

I am building occupancy models with Stan, following the occupancy example. For those not familiar with occupancy models, they are a Bernoulli/Bernoulli or Bernoulli/Binomial model that models the probability a site is “occupied” as well as a the probability of detection. The simplest occupancy model estimates the probability a site is occupied (\psi) and the probability of detection within a site (p). My end goal is to build up models that include coefficients at both levels. As a first step, I am using different data input options. Also, I have changed the tutorial example to estimate parameters on the logit scale (mu) to increase stability and allow for the eventual addition of coefficients.

When I try to change the data from “wide” format to “long” format, the model can no longer distinguish between the two levels.

Example model and results

The example uses “wide” format data with columns of visits and rows of sites:

> head(y)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    0    0     0
[2,]    0    0    0    0    0    0    0    0    0     0
[3,]    0    0    0    0    0    0    0    0    0     0
[4,]    1    1    1    1    1    1    0    1    0     0
[5,]    0    0    0    0    0    0    0    0    0     0
[6,]    0    0    0    0    0    0    0    0    0     0

This has the following likelihood code

  // likelihood
  for (r in 1:R) {
    if (sum(y[r]) > 0)
      target +=
	/* log_psi1 + bernoulli_lpmf(y[r] | p); */
	log_muPsi + bernoulli_logit_lpmf(y[r] | muP);
    else
      target +=
	/* log_sum_exp(log_psi1 + bernoulli_lpmf(y[r] | p), log1m_psi1); */
	log_sum_exp(log_muPsi +
		    bernoulli_logit_lpmf(y[r] | muP), log1m_muPsi);
  }

The outputs from this model look good:

> Inference for Stan model: occupancyMu.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
muPsi   -0.58    0.00 0.13   -0.85   -0.67   -0.58   -0.49   -0.33  3159    1
muP      0.50    0.00 0.07    0.36    0.45    0.50    0.55    0.63  3403    1
p        0.62    0.00 0.02    0.59    0.61    0.62    0.63    0.65  3405    1
psi      0.36    0.00 0.03    0.30    0.34    0.36    0.38    0.42  3172    1
lp__  -761.09    0.02 1.01 -763.73 -761.47 -760.78 -760.38 -760.11  2043    1

Binomial model

Next, I changed the model to take data as a binomial sum rather than in “wide” format. This allows for a different number of visits per site. This model produces similar results as the above model.

Example data:

> head(ySum)
[1] 0 7 0 0 0 5
> head(k)
[1] 10 10 10 10 10 10

Likelihood

  // likelihood
  for (r in 1:R) {
    if (y[r] > 0)
      target +=
	log_muPsi + binomial_logit_lpmf(y[r] | k[r],  muP);
    else
      target +=
	log_sum_exp(log_muPsi +
		    binomial_logit_lpmf(y[r] | k[r], muP), log1m_muPsi);
  }

And similar outputs to the first model.

Inference for Stan model: occupancyMuBinomial.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
muPsi   -0.58    0.00 0.13   -0.84   -0.66   -0.57   -0.49   -0.31  3324    1
muP      0.50    0.00 0.07    0.36    0.45    0.50    0.54    0.63  3147    1
p        0.62    0.00 0.02    0.59    0.61    0.62    0.63    0.65  3150    1
psi      0.36    0.00 0.03    0.30    0.34    0.36    0.38    0.42  3324    1
lp__  -322.09    0.03 1.02 -324.91 -322.47 -321.77 -321.36 -321.10  1634    1

Samples were drawn using NUTS(diag_e) at Thu Aug 23 08:12:23 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Model with Problem: Bernoulli model with long form data

Last, I tried to put the data into “long” form where each site visit is a row.
The data now has a site-sample detection (0/1) column y, a site-level detection zObs, as well as a site and visit identification column. There is also a “known” z column that was generated during data simulation.

> yDTlong
      site z visit y zObs
   1:    1 0    V1 0    0
   2:    2 1    V1 0    1
   3:    3 0    V1 0    0
   4:    4 0    V1 0    0
   5:    5 0    V1 0    0

The likelihood code looks like:

  // likelihood
  for (r in 1:R) {
    if (z[r] > 0)
      target +=
	log_muPsi + bernoulli_logit_lpmf(y[r] |  muP );
    else
      target +=
	log_sum_exp(log_muPsi +
		    bernoulli_logit_lpmf(y[r] |  muP),
		    log1m_muPsi);
  }

The model output cannot tell the difference between p and \psi:

> 
Inference for Stan model: occupancyMuLong.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

               mean      se_mean           sd          2.5%           25%
muPsi  9.481725e+18 5.447956e+18 1.335993e+19  1.715213e+17  1.925087e+18
muP   -1.240000e+00 0.000000e+00 5.000000e-02 -1.340000e+00 -1.270000e+00
p      2.200000e-01 0.000000e+00 1.000000e-02  2.100000e-01  2.200000e-01
psi    1.000000e+00 0.000000e+00 0.000000e+00  1.000000e+00  1.000000e+00
lp__  -1.330300e+03 2.000000e-02 6.900000e-01 -1.332210e+03 -1.330470e+03
                50%           75%         97.5% n_eff Rhat
muPsi  5.454296e+18  1.126274e+19  5.649307e+19     6 1.47
muP   -1.240000e+00 -1.210000e+00 -1.150000e+00  3637 1.00
p      2.200000e-01  2.300000e-01  2.400000e-01  3643 1.00
psi    1.000000e+00  1.000000e+00  1.000000e+00  4000  NaN
lp__  -1.330020e+03 -1.329860e+03 -1.329810e+03  2040 1.00

Samples were drawn using NUTS(diag_e) at Thu Aug 23 08:33:35 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Questions/Problems

I have discussed this problem with a statistician collaborator.
He thinks we have an identifiability problem because the model cannot tell the two levels apart using this formulation.
His observations seem reasonable to me.
Here are my questions:

  • Did I write my likelihood statement correctly?
  • Any suggestions for this identifiability problem?

Note on Occupancy models examples

My short term goals is to write a tutorial on occupancy models in Stan and my longer term goal is to wrap the tutorials up into a package. If anyone else is interested in helping, please let me know.

Hi @Richard_Erickson - looks like your likelihood is a bit off for the Bernoulli observation model. To see why, let’s consider the likelihood for a site where at least one detection was made (we know the true occupancy state is z = 1). If we have k surveys, then the likelihood for the observations at this site are:

[y \mid \psi, p] = \psi \times \text{Binomial}(y \mid k, p),

which on the log scale turns into

\log(\psi) + \log(\text{Binomial}(y \mid k, p)).

And that part shows up in your Stan code for the binomial model here:

Now to unpack the binomial likelihood into bernoulli observations, we can rewrite the above likelihood (again assuming we know z=1) as:

[y \mid \psi, p] = \psi \times \prod_{i = 1}^k \text{Bernoulli}(y_i \mid p_i),

where y_i is either zero or one, and p_i is the probability of detection given that z = 1 on survey i. Taking a log:

\log(\psi) + \sum_{i = 1}^k \log(\text{Bernoulli}(y_i \mid p_i)).

But, in the Bernoulli model that you provided there is no summation over repeat surveys at a site. If I’m following your code correctly, the summation is happening over surveys, which means that the log_muPsi term is going to be added to the likelihood too many times:

One fix for this is to use indexing with vectorization to feed a vector of y and p values to bernoulli_logit_lpmf, looping over sites instead of looping over surveys. This way, bernoulli_logit_lpmf will automatically sum the survey-level probabilities for you, but it does take some extra work to generate appropriate indices to slice the detection/nondetection data. For a complete example with simulated data check out this gist: Bernoulli observation occupancy model in Stan · GitHub

The likelihood is

for (i in 1:n_site) {
    if (n_survey[i] > 0) {
      if (any_seen[i]) {
        // site is occupied
        target += log_psi[i] 
                  + bernoulli_logit_lpmf(y[start_idx[i]:end_idx[i]] | 
                                         logit_p[start_idx[i]:end_idx[i]]);
      } else {
        // site may or may not be occupied
        target += log_sum_exp(
          log_psi[i] + bernoulli_logit_lpmf(y[start_idx[i]:end_idx[i]] |
                                            logit_p[start_idx[i]:end_idx[i]]), 
          log1m_psi[i]
        );
      }
    }
}

There may be a more efficient way to get these slices. It’s possibly not ideal to construct the same index vector twice to extract slices of y and logit_p.

Hi @mbjoseph thank you for your assistance. I’ve been looking over your code and I think it does exactly what I need.

Also, I looked through your Github Profile and you’re doing interesting work. I especially liked this Ecology article. I had wondered how to include occupancy modeling with SEMs.

Glad that the code is helpful @Richard_Erickson! Also, happy to hear that paper was of interest - the project was a lot of fun.