Posterior predictive distribution for occupancy models

Brief background

Occupancy models simultaneously estimate the probabilities that a species occupies each of a collection of site based on detection/nondetection data from biological surveys. Conceptually, the model is formulated as a pair of logistic regressions. Across sites, the site occupancy Z (a latent binary state), is regressed on covariates whose linear combination gives the logit of the predicted occupancy probability \psi. Conditional on Z==1, the observed detection/non-detection data are also regressed on covariates whose linear combination gives the logit predicted detection probability \theta. \psi and \theta identifiable because the sites are visited repeatedly during a period over which Z is assumed not to change. The outcomes of sampling on the visits are assumed to be conditionally independent. We implement the model by marginalizing over the latent state Z. See here for more details about the model and the marginalization.

My question

I am doing some posterior predictive checking on an occupancy model, and it seems to me that the marginalized and non-marginalized specifications actually lead to substantially different posterior predictive distributions (see below)! I’m personally surprised to realize this, if only because I had kinda conceptualized marginalization as nothing but a tool to make models easier to sample, not as a decision about my posterior predictive distribution. I guess my question is whether anybody has some interesting or useful comments about when it might be preferable to use one distribution or the other. I imagine that the first distribution (the one that arises from the marginalized model) should be more sensitive for diagnosing model misspecification, since the posterior predictive distribution is less tightly tied to the observed data.

Posterior predictive distribution in the marginalized model

Consider the posterior predictive distribution for obtaining at least one detection at site i. This is given by Bernoulli(\psi_i*(1-\prod_{j=1}^{n}(1-\theta_{ij}))), where \psi is the modeled occupancy probability, \theta is the modeled conditional detection probability, and j indexes the visit.

Posterior predictive distribution in the non-marginalized model

In the non-marginalized model, the posterior predictive distribution is different for points with at least one detection in the data versus points with no detections. With at least one detection at point i, the posterior distribution for Z_i is equal to one with probability 1, and the posterior predictive distribution of interest is just Bernoulli(1-\prod_{j=1}^{n}(1-\theta_{ij})) (note the lack of dependence on \psi).
For points with no detections, Z_i is equal to one with probability \frac{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}{ (1-\psi_i) + \psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}. So the posterior predictive distribution for obtaining at least one detection at a point is then Bernoulli(\frac{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij}) + (1-\psi_i)}*(1-\prod_{j=1}^{n}(1-\theta_{ij}))) Hopefully I got that right, anyway.


Hi, apologies for the delay in responding to your question, which is well-written and informative.

@vianeylb @Bob_Carpenter @betanalpha are our resident experts, but I can have a look in the coming week as well.

I should start off by saying I’m not super familiar with occupancy models (although I know they can be formulated as hidden Markov models). However, whether marginalized or not, the manner in which to simulate data from an occupancy model using draws from the joint posterior distribution shouldn’t change, no? I see you have code in your case study writeup on how to simulate data from a specific model.

What might affect results in practice is the ability of an MCMC algorithm to explore the discrete space efficiently vs marginalizing. (just a guess!!) I wouldn’t ever expect a posterior predictive distribution to be different for an HMM, for example, whether fitting the model using the marginalized likelihood or not. I’ll have a think and try to come back with some more useful advice!


I think you’re right; in both cases the posterior predictive distribution should be the one that I wrote down for the non-marginalized model. The confusing thing is that in the marginalized model, the likelihood is different depending on whether or not there’s at least one detection at the site. But there’s still something going on here that I’m not entirely comfortable/confident about.

Just to expose the issue more starkly, suppose that I have a generative model:

 Z ~ bernoulli(p_occupancy);  // Z is a latent discrete parameter that needs to be marginalized out
 y ~ bernoulli(Z*p_detection);  // y is data, and there are no repeat visits to keep track of here.

If I have this right, then in the non-marginalized case (treating Z as a parameter) the posterior predictive distribution if y = 1 is just bernoulli(p_{detection}), because the posterior distribution for Z is one with probability one. If y = 0 the posterior predictive distribution is bernoulli(p_{occupancy} * (1 - p_{detection}) + (1 - p_{occupancy})).

Perhaps this is the correct posterior predictive distribution for the marginalized model as well. It is the distribution that we recover if we marginalize with

if(y == 1){
  y ~ bernoulli(p_detection)
if(y == 0){
 y ~ bernoulli(p_occupancy * (1 - p_detection) + (1 - p_occupancy))

But it sure is tempting to say that the marginalized model is just

y ~ bernoulli(p_occupancy * p_detection)

And this yields a different posterior predictive distribution. So I guess I feel like either we need to say that this is not the fully “correct” marginalization, or that marginalizing changes the posterior predictive distribution?

Thinking back to the full occupancy model (with multiple visits), it feels very weird to have the posterior predictive distribution depend so directly on the observed data. It seems like it should weaken the power of posterior predictive checking. And it doesn’t at all correspond to the way I would go about simulating synthetic data from scratch. That simulation is easy to do–we just resample the latent discrete occupancy state from the distribution given by the fitted model. Any suggestions for what to call this distribution, which at the end of the day is not the posterior predictive distribution?

Thanks so much for taking a look!

Looking over @mbjoseph’s blog, is there a chance you’re putting p_occupancy inside the Bernoulli distribution when it should be a mixture? Maxwell B. Joseph: A step-by-step guide to marginalizing over discrete parameters for ecologists using Stan


One quick note - the posterior predictive distribution is different depending on whether the occupancy state is known.

If any detections are made, then the site is occupied (because we assume no false positive detections). In that case, we can simulate from the posterior predictive distribution as:

\tilde y \sim \text{Binomial}(J, p),

where \tilde y is a predicted observation, J is the number of replicate surveys, and p the detection probability.

If no detections are made, then we don’t know the occupancy state z. Further, the probability that a site is occupied given that no detections were made is not \psi, which is the unconditional probability of occupancy \text{Pr}(z=1). We actually want to work with \text{Pr}(z=1 \mid y=0). By Bayes’ theorem this is given by:

\dfrac{\psi (1-p)^J}{\psi (1-p)^J + 1-\psi}.

For a Stan example see the occupancy model BPA book Stan translations: example-models/site-occ.stan at master · stan-dev/example-models · GitHub

So, in the case where there are no detections, you would first draw a value for the unknown occupancy state in the generated quantities block:

z \sim \text{Bernoulli}\Big( \dfrac{\psi (1-p)^J}{\psi (1-p)^J + 1-\psi} \Big),

then, simulate new detection data conditional on this value:

\tilde y \sim \text{Binomial}(J, zp).