Posterior predictive distribution for occupancy models

Brief background

Occupancy models simultaneously estimate the probabilities that a species occupies each of a collection of site based on detection/nondetection data from biological surveys. Conceptually, the model is formulated as a pair of logistic regressions. Across sites, the site occupancy Z (a latent binary state), is regressed on covariates whose linear combination gives the logit of the predicted occupancy probability \psi. Conditional on Z==1, the observed detection/non-detection data are also regressed on covariates whose linear combination gives the logit predicted detection probability \theta. \psi and \theta identifiable because the sites are visited repeatedly during a period over which Z is assumed not to change. The outcomes of sampling on the visits are assumed to be conditionally independent. We implement the model by marginalizing over the latent state Z. See here for more details about the model and the marginalization.

My question

I am doing some posterior predictive checking on an occupancy model, and it seems to me that the marginalized and non-marginalized specifications actually lead to substantially different posterior predictive distributions (see below)! I’m personally surprised to realize this, if only because I had kinda conceptualized marginalization as nothing but a tool to make models easier to sample, not as a decision about my posterior predictive distribution. I guess my question is whether anybody has some interesting or useful comments about when it might be preferable to use one distribution or the other. I imagine that the first distribution (the one that arises from the marginalized model) should be more sensitive for diagnosing model misspecification, since the posterior predictive distribution is less tightly tied to the observed data.

Posterior predictive distribution in the marginalized model

Consider the posterior predictive distribution for obtaining at least one detection at site i. This is given by Bernoulli(\psi_i*(1-\prod_{j=1}^{n}(1-\theta_{ij}))), where \psi is the modeled occupancy probability, \theta is the modeled conditional detection probability, and j indexes the visit.

Posterior predictive distribution in the non-marginalized model

In the non-marginalized model, the posterior predictive distribution is different for points with at least one detection in the data versus points with no detections. With at least one detection at point i, the posterior distribution for Z_i is equal to one with probability 1, and the posterior predictive distribution of interest is just Bernoulli(1-\prod_{j=1}^{n}(1-\theta_{ij})) (note the lack of dependence on \psi).
For points with no detections, Z_i is equal to one with probability \frac{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}{ (1-\psi_i) + \psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}. So the posterior predictive distribution for obtaining at least one detection at a point is then Bernoulli(\frac{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij})}{\psi_i*\prod_{j=1}^{n}(1-\theta_{ij}) + (1-\psi_i)}*(1-\prod_{j=1}^{n}(1-\theta_{ij}))) Hopefully I got that right, anyway.

3 Likes

Hi, apologies for the delay in responding to your question, which is well-written and informative.

@vianeylb @Bob_Carpenter @betanalpha are our resident experts, but I can have a look in the coming week as well.

I should start off by saying I’m not super familiar with occupancy models (although I know they can be formulated as hidden Markov models). However, whether marginalized or not, the manner in which to simulate data from an occupancy model using draws from the joint posterior distribution shouldn’t change, no? I see you have code in your case study writeup on how to simulate data from a specific model.

What might affect results in practice is the ability of an MCMC algorithm to explore the discrete space efficiently vs marginalizing. (just a guess!!) I wouldn’t ever expect a posterior predictive distribution to be different for an HMM, for example, whether fitting the model using the marginalized likelihood or not. I’ll have a think and try to come back with some more useful advice!

3 Likes

I think you’re right; in both cases the posterior predictive distribution should be the one that I wrote down for the non-marginalized model. The confusing thing is that in the marginalized model, the likelihood is different depending on whether or not there’s at least one detection at the site. But there’s still something going on here that I’m not entirely comfortable/confident about.

Just to expose the issue more starkly, suppose that I have a generative model:

 Z ~ bernoulli(p_occupancy);  // Z is a latent discrete parameter that needs to be marginalized out
 y ~ bernoulli(Z*p_detection);  // y is data, and there are no repeat visits to keep track of here.

If I have this right, then in the non-marginalized case (treating Z as a parameter) the posterior predictive distribution if y = 1 is just bernoulli(p_{detection}), because the posterior distribution for Z is one with probability one. If y = 0 the posterior predictive distribution is bernoulli(p_{occupancy} * (1 - p_{detection}) + (1 - p_{occupancy})).

Perhaps this is the correct posterior predictive distribution for the marginalized model as well. It is the distribution that we recover if we marginalize with

if(y == 1){
  y ~ bernoulli(p_detection)
}
if(y == 0){
 y ~ bernoulli(p_occupancy * (1 - p_detection) + (1 - p_occupancy))
}

But it sure is tempting to say that the marginalized model is just

y ~ bernoulli(p_occupancy * p_detection)

And this yields a different posterior predictive distribution. So I guess I feel like either we need to say that this is not the fully “correct” marginalization, or that marginalizing changes the posterior predictive distribution?

Thinking back to the full occupancy model (with multiple visits), it feels very weird to have the posterior predictive distribution depend so directly on the observed data. It seems like it should weaken the power of posterior predictive checking. And it doesn’t at all correspond to the way I would go about simulating synthetic data from scratch. That simulation is easy to do–we just resample the latent discrete occupancy state from the distribution given by the fitted model. Any suggestions for what to call this distribution, which at the end of the day is not the posterior predictive distribution?

Thanks so much for taking a look!

Looking over @mbjoseph’s blog, is there a chance you’re putting p_occupancy inside the Bernoulli distribution when it should be a mixture? Maxwell B. Joseph: A step-by-step guide to marginalizing over discrete parameters for ecologists using Stan

4 Likes

One quick note - the posterior predictive distribution is different depending on whether the occupancy state is known.

If any detections are made, then the site is occupied (because we assume no false positive detections). In that case, we can simulate from the posterior predictive distribution as:

\tilde y \sim \text{Binomial}(J, p),

where \tilde y is a predicted observation, J is the number of replicate surveys, and p the detection probability.

If no detections are made, then we don’t know the occupancy state z. Further, the probability that a site is occupied given that no detections were made is not \psi, which is the unconditional probability of occupancy \text{Pr}(z=1). We actually want to work with \text{Pr}(z=1 \mid y=0). By Bayes’ theorem this is given by:

\dfrac{\psi (1-p)^J}{\psi (1-p)^J + 1-\psi}.

For a Stan example see the occupancy model BPA book Stan translations: example-models/site-occ.stan at master · stan-dev/example-models · GitHub

So, in the case where there are no detections, you would first draw a value for the unknown occupancy state in the generated quantities block:

z \sim \text{Bernoulli}\Big( \dfrac{\psi (1-p)^J}{\psi (1-p)^J + 1-\psi} \Big),

then, simulate new detection data conditional on this value:

\tilde y \sim \text{Binomial}(J, zp).

4 Likes

Hi @mbjoseph,
I’m digging this up because I think we actually ended up in the wrong place here.

I think we both fell prey to the same semantic confusion about what should properly be called the “posterior predictive distribution” in models like these. In particular, I was misled by all of the literature that calls Z a “parameter” in the non-marginalized model. If Z is a parameter, then of course the posterior predictive distribution should condition on Z. But Z is not really a parameter as such in these models–it has more the flavor of a residual error term in an ordinary regression (note that if we condition our predictive distribution for an ordinary regression on the estimated residuals we precisely recapitulate the original data every time).

One way to see that Z isn’t really a parameter is to consider the behavior of cross-validation if the likelihood were to condition on Z. In particular, cross-validation would always favor models the smallest possible small occupancy probabilities, since the likelihood of observed detections conditional on Z contains absolutely no penalty for estimating arbitrarily small occupancy probabilities, but the likelihood of observed nondetections always increases with decreasing occupancy probability.

One way to verify how the concept of the posterior predictive distribution gets used by experts in the context of these models is to take a peek at how brms handles posterior predictions for a zero-inflated binomial regression, which is an occupancy model without visit-varying detection covariates:

library(brms)
N <- 100 # number of sites
nvisit <- 15 # number of repeat visits (a large number is useful for illustration)
p <- .8 # detection probability
zi <- .5 # 1 minus occupancy probability

nzi <- rbinom(1, N, zi) # number of unoccupied sites
nBinomial <- N - nzi # number of occupied sites

mydata <- data.frame(y = c(rbinom(nBinomial, nvisit, p), rep(0, nzi)),
                     trials = nvisit)

fit <- brm(y | trials(trials) ~ 1, family = zero_inflated_binomial(), data = mydata,
                 backend = 'cmdstanr')
posterior_predict(fit)

If brms were conditioning the posterior prediction on the latent zero-inflation state, we should see exclusively nonzero predictions for the first half of the posterior predictions followed by exclusively zero predictions for the second half. But brms does not condition the posterior predictive distribution on Z, and we see zero and non-zero predictions scattered at random.

I’ve recently learned from @Bob_Carpenter of some useful language to attach to this issue. Apparently some people like to refer to Z as unobserved data, which I quite like. Then some people distinguish between the (unobserved) full data likelihood defined as P(y,Z|\psi, \theta) and the regular old likelihood defined by P(y|\psi, \theta), where \psi and \theta are occupancy and detection probabilities respectively. But we never refer to P(y|Z,\psi,\theta) as a likelihood, and simulating y from this distribution is not properly said to yield the “Posterior Predictive Distribution”.

At the end of the day, this is a bit of a relief to me, since it always felt icky to provide the posterior predictive distribution with absolutely certain knowledge about characteristics of the observed response (i.e. whether or not there is at least one detection).

1 Like

A historical note: this language was derived from a common frequentist bait-and-switch. In frequentist inference probability theory is assumed to applicable only to repeated events with limiting frequencies (hence the name) which means we can’t apply probability theory to the model configuration space; parameters have some fixed but known value. Variables like Z can never be observed directly so they’re not technically data, and we can’t apply probability theory (including integrating them out). The bait-and-switch is then to assume that they could be observed in theory but they’re intentionally left out, in which case they are interpreted as “missing data”. In this case the observational model and likelihood functions take the form P(y, Z | \psi, \theta), and the Z can be integrated out to define the “marginal” or “collapsed” likelihood P(y | \psi, \theta).

It’s all much less convolved in the more general Bayesian setting where both data and model configuration variables can be marginalized.

The confusion here largely arises because there are multiple data generating processes consistent with the joint model P(y, Z, \psi, \theta). If multiple observations are made from the same Z then we have the process

\begin{align*} \psi, \theta &\sim P(\psi, \theta) \\ Z &\sim P(Z | \psi, \theta) \\ y_{n} &\sim P(y | Z, \psi, \theta) \end{align*}

On the other hand if Z varies with each observation then we have

\begin{align*} \psi, \theta &\sim P(\psi, \theta) \\ Z_{n} &\sim P(Z | \psi, \theta) \\ y_{n} &\sim P(y | Z_{n}, \psi, \theta) \end{align*}

This distinction doesn’t effect posterior inferences, but it will change the interpretation of the data generating process and hence they generation of retrodictive simulations that exactly match the assumed data generating process. If Z is fixed then retrodictive y_{n} are simulated conditional on Z, and otherwise we have to first simulate retrodictive Z_{n} before simulating the retrodictive y_{n}.

This only gets more confusing when we consider the fact that we can use the inferences from this model to not just make retrodictions but also predictions in other circumstances. In that case we can make predictions both for hypothetical data generating processes where Z is fixed and processes where Z varies! But only one of these hypothetical processes will match the assumed data generating process that generated the observed data.

2 Likes

That’s hilarious! Bummer, though, that it gives a more distasteful flavor to the language. I like calling Z something other than “parameter” since for many important applications (e.g. LOO) the “likelihood” does not condition on Z. And this is true in general for all zero-inflated count models.

My feeble mind likes to latch on to the idea of likelihood as P(\mathrm{data}|\mathrm{parameters}), so I’ll avoid calling Z a parameter whenever possible.

I’ll pushback a bit here. Every mixture model can be interpreted as a single, fixed but unknown group index common to all observations or group indices that very from observation to observation, which means every one of these models has the same ambiguity. In many fields one of these options often becomes convention, but that doesn’t mean that it becomes a general truth!

Again when constructing posterior inferences the difference between the two perspectives doesn’t matter, but when you want to make retrodictions or predictions the two perspectives diverge because retrodictions and predictions depend on more details of the assumed generate process. I think that it’s much more clear to explicitly specify which of those processes you’re assuming instead of relying on an assumed notational convention that may or may not be understood by others.