Background
I am building occupancy models with Stan, following the occupancy example. For those not familiar with occupancy models, they are a Bernoulli/Bernoulli or Bernoulli/Binomial model that models the probability a site is “occupied” as well as a the probability of detection. The simplest occupancy model estimates the probability a site is occupied (\psi) and the probability of detection within a site (p). My end goal is to build up models that include coefficients at both levels. As a first step, I am using different data input options. Also, I have changed the tutorial example to estimate parameters on the logit scale (mu
) to increase stability and allow for the eventual addition of coefficients.
When I try to change the data from “wide” format to “long” format, the model can no longer distinguish between the two levels.
Example model and results
The example uses “wide” format data with columns of visits and rows of sites:
> head(y)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0
[4,] 1 1 1 1 1 1 0 1 0 0
[5,] 0 0 0 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0
This has the following likelihood code
// likelihood
for (r in 1:R) {
if (sum(y[r]) > 0)
target +=
/* log_psi1 + bernoulli_lpmf(y[r] | p); */
log_muPsi + bernoulli_logit_lpmf(y[r] | muP);
else
target +=
/* log_sum_exp(log_psi1 + bernoulli_lpmf(y[r] | p), log1m_psi1); */
log_sum_exp(log_muPsi +
bernoulli_logit_lpmf(y[r] | muP), log1m_muPsi);
}
The outputs from this model look good:
> Inference for Stan model: occupancyMu.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
muPsi -0.58 0.00 0.13 -0.85 -0.67 -0.58 -0.49 -0.33 3159 1
muP 0.50 0.00 0.07 0.36 0.45 0.50 0.55 0.63 3403 1
p 0.62 0.00 0.02 0.59 0.61 0.62 0.63 0.65 3405 1
psi 0.36 0.00 0.03 0.30 0.34 0.36 0.38 0.42 3172 1
lp__ -761.09 0.02 1.01 -763.73 -761.47 -760.78 -760.38 -760.11 2043 1
Binomial model
Next, I changed the model to take data as a binomial sum rather than in “wide” format. This allows for a different number of visits per site. This model produces similar results as the above model.
Example data:
> head(ySum)
[1] 0 7 0 0 0 5
> head(k)
[1] 10 10 10 10 10 10
Likelihood
// likelihood
for (r in 1:R) {
if (y[r] > 0)
target +=
log_muPsi + binomial_logit_lpmf(y[r] | k[r], muP);
else
target +=
log_sum_exp(log_muPsi +
binomial_logit_lpmf(y[r] | k[r], muP), log1m_muPsi);
}
And similar outputs to the first model.
Inference for Stan model: occupancyMuBinomial.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
muPsi -0.58 0.00 0.13 -0.84 -0.66 -0.57 -0.49 -0.31 3324 1
muP 0.50 0.00 0.07 0.36 0.45 0.50 0.54 0.63 3147 1
p 0.62 0.00 0.02 0.59 0.61 0.62 0.63 0.65 3150 1
psi 0.36 0.00 0.03 0.30 0.34 0.36 0.38 0.42 3324 1
lp__ -322.09 0.03 1.02 -324.91 -322.47 -321.77 -321.36 -321.10 1634 1
Samples were drawn using NUTS(diag_e) at Thu Aug 23 08:12:23 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Model with Problem: Bernoulli model with long form data
Last, I tried to put the data into “long” form where each site visit is a row.
The data now has a site-sample detection (0/1) column y
, a site-level detection zObs
, as well as a site
and visit
identification column. There is also a “known” z
column that was generated during data simulation.
> yDTlong
site z visit y zObs
1: 1 0 V1 0 0
2: 2 1 V1 0 1
3: 3 0 V1 0 0
4: 4 0 V1 0 0
5: 5 0 V1 0 0
The likelihood code looks like:
// likelihood
for (r in 1:R) {
if (z[r] > 0)
target +=
log_muPsi + bernoulli_logit_lpmf(y[r] | muP );
else
target +=
log_sum_exp(log_muPsi +
bernoulli_logit_lpmf(y[r] | muP),
log1m_muPsi);
}
The model output cannot tell the difference between p and \psi:
>
Inference for Stan model: occupancyMuLong.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25%
muPsi 9.481725e+18 5.447956e+18 1.335993e+19 1.715213e+17 1.925087e+18
muP -1.240000e+00 0.000000e+00 5.000000e-02 -1.340000e+00 -1.270000e+00
p 2.200000e-01 0.000000e+00 1.000000e-02 2.100000e-01 2.200000e-01
psi 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00
lp__ -1.330300e+03 2.000000e-02 6.900000e-01 -1.332210e+03 -1.330470e+03
50% 75% 97.5% n_eff Rhat
muPsi 5.454296e+18 1.126274e+19 5.649307e+19 6 1.47
muP -1.240000e+00 -1.210000e+00 -1.150000e+00 3637 1.00
p 2.200000e-01 2.300000e-01 2.400000e-01 3643 1.00
psi 1.000000e+00 1.000000e+00 1.000000e+00 4000 NaN
lp__ -1.330020e+03 -1.329860e+03 -1.329810e+03 2040 1.00
Samples were drawn using NUTS(diag_e) at Thu Aug 23 08:33:35 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Questions/Problems
I have discussed this problem with a statistician collaborator.
He thinks we have an identifiability problem because the model cannot tell the two levels apart using this formulation.
His observations seem reasonable to me.
Here are my questions:
- Did I write my likelihood statement correctly?
- Any suggestions for this identifiability problem?
Note on Occupancy models examples
My short term goals is to write a tutorial on occupancy models in Stan and my longer term goal is to wrap the tutorials up into a package. If anyone else is interested in helping, please let me know.