How to generate posterior predictive distribution for hierarchical occupancy model in Stan?

vladkovalenko · April 8, 2022, 5:00pm

Hi I am new to Bayesian modeling and Stan. Hoping to get some guidance on generating a posterior distribution to compare to my observed y values of presences and absences.

I am using the below model which was developed by Richard Erickson (StanOccupancyModelTutorials/TwoLevelOccupancyModels.Rmd at master · raerickson/StanOccupancyModelTutorials · GitHub).

I mostly understand the theory, however the complexity of the likelihood is giving me a hard time in entering the generated quantities.

data {
  int<lower=0> n_sampling_units;
  int<lower=0> y[n_sampling_units];
  int<lower=0> k[n_sampling_units];
}

parameters {
  real muPsi;
  real muP;
}

model {

  real log_muPsi;
  real log1m_muPsi;

  log_muPsi   = log_inv_logit(muPsi);
  log1m_muPsi = log1m_inv_logit(muPsi);

// priors
  muP ~ normal(2, .5);
  muPsi ~ normal(0, 2);
  
 // likelihood
  for (r in 1:n_sampling_units) {
    if (y[r] > 0)
      target +=
	log_muPsi + binomial_logit_lpmf(y[r] | k[r],  muP);
    else
      target +=
	log_sum_exp(log_muPsi +
		    binomial_logit_lpmf(y[r] | k[r], muP), log1m_muPsi);
  }
}
generated quantities{
  real<lower = 0, upper = 1> p;
  real<lower = 0, upper = 1> psi;

  p   = inv_logit(muP);
  psi = inv_logit(muPsi);
}

My generated quantities block so far is:

  vector[n_sampling_units] y_rep;
  real<lower = 0, upper = 1> p;
  real<lower = 0, upper = 1> psi;

  p   = inv_logit(muP);
  psi = inv_logit(muPsi);
  
  for(i in 1:n_sampling_units){
        y_rep[i] = bernoulli_rng(psi*p);

Thank you!

-Vlad

mbjoseph · April 8, 2022, 6:42pm

Welcome @vladkovalenko!

Check out this thread for an in-depth discussion Posterior predictive distribution for occupancy models - #10 by betanalpha.

vladkovalenko · April 8, 2022, 7:43pm

Thanks @mbjoseph!

This is very helpful, but I am hoping to also get some help with the actual code as I am new to Stan (and programming in general).

jsocolar · April 8, 2022, 8:16pm

Hi Vlad,

This is complicated stuff! Here are some resources that might help:
Marginalized occupancy models: a how-to for JAGS and Stan. (see especially the very last paragraph in the The marginalized JAGS/BUGS model section).
Occupancy model likelihoods: advanced topics. (see especially the whole thing!)

One major confusion here is that you need to decide two things:

What are you trying to posterior-predict? The latent occupancy state? Or the observed detection/nondetection data?
Do you want your predictions to condition only on the intercepts and covariates (if you have any)–that is, condition only on the linear predictors–or do you want your predictions to condition twice on the observations? I say “condition twice” because the linear predictors are estimated conditional on the data, but then once we know the linear predictors we can still condition again on the data. This is easiest to see in the event that there is at least one detection at a site. Then regardless of the linear predictor for occupancy {logit}(\psi), we will know with certainty that the true occupancy state Z is one. So do we want to condition our posterior predictions on Z = 1 or on the value of {logit}(\psi)? It turns out that for posterior predictive checking, and likelihood-based model comparison including LOO, it is very important to condition on \psi and NOT on Z = 1. To see why intuitively, notice that at every site with a detection, we know with certainty that Z = 1. If we condition on Z = 1 at all of these sites, then we will universally get the best predictive performance by estimating the lowest possible intercept for logit(\psi). Because at every site without a detection, low values of logit(\psi) maximize the likelihood. And at every site with a detection, logit(\psi) doesn’t matter if we condition on Z = 1. On the other hand, if we want to get the best possible posterior predictions for the latent occupancy state Z, we absolutely should condition in this way. This is all dealt with a bit more neatly at the two resources linked above.

Edit: If we can pin down what exactly you want to predict, I’m happy to help you with the Stan code!

vladkovalenko · April 9, 2022, 7:35pm

Thanks so much for the detailed reply and the links to your tutorials, Jacob!

Admittedly, after reading your reply and your tutorials, I feel even more in over my head.

To answer your first question, I believe I am trying to posterior-predict the observed detection/nondetection data. That way I can get a sense of how well the model fits my data. For the second question, I think I agree that we should condition the posterior on the value of logit(ψ).

Thanks for clarifying that! Also, I have two models at the moment: one with no detection or occupancy covariates to simply get at Z at my sites (the model which I shared in my original post), and a model with covariates for both parameters that I hope will elucidate the effects of the covariates and then also predict the Z at my sites. Do you think this is a good structure? Or should I try to combine both these models into one?

Thanks so much for your help!

Best,
Vlad

jsocolar · April 10, 2022, 6:12pm

If you think that the covariates are relevant and useful, and the model that includes them doesn’t show any problems fitting, then I would abandon the model without covariates.

It sounds like you want to do two different flavors of posterior prediction:

Predict Z at your sites.
Generate posterior predictions to assess the quality of model fit.

The approach for (2) is actually easier than the approach for (1). For (2), we need to use the generated quantities to sample Z conditional on \psi. So in the generated quantities, you’ll want to compute p and psi just like you’ve done above. Then you’ll want to do something like

  for(i in 1:n_sampling_units){
        z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
        y_rep[i] = z_rep[i] * binomial_rng(k[i], p[i]);
}

For (1), we instead need

for  (i in 1:n_sampling_units) {
    if(y[i] == 1) {
       Z_rep[i] = 1;
     } else {
       real ulo = psi[i] * (1 - p[i])^k[i]; // unnormalized likelihood associated with occupancy
       real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
       Z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
     }
}

vladkovalenko · April 11, 2022, 5:43pm

Thanks, Jacob! Some of the code you provided is definitely new to me.

I get an error in the first if statement (y[i] == 1), that says:

Binary infix operator == with functional interpretation logical_eq requires arguments of primitive type (int or real), found left type=int[ ], right arg type=int.

I think the issue is due to when I first define the new generated quantities. I have:

generated quantities{
      real<lower = 0, upper = 1> p;
      real<lower = 0, upper = 1> psi;
      vector[n_sampling_units] y_rep;
      vector[n_sampling_units] z_rep;
      
      // detection and occupancy
      p   = inv_logit(muP);
      psi = inv_logit(muPsi);
      
      // Predict Z at sites
      for  (i in 1:n_sampling_units) {
      if (y[i] == 1) {
       z_rep[i] = 1;
     } else {
       real ulo = psi[i] * (1 - p[i])^k[i]; // unnormalized likelihood associated with occupancy
       real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
       z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
     }
}
      // generating posterior predictive distribution
        for(i in 1:n_sampling_units){
        z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
        y_rep[i] = z_rep[i] * binomial_rng(k[i], p[i]);
}
}

A bit of clarification: n_sampling_units is my number of sites (n=30). n_surveys is my number of repeat surveys at each site. So I am wondering if the declared dimensions for y_rep and z_rep are correct?

Additionally, I wonder if p and psi are correctly defined? I see that in your code, you are calling them as an i-th term. Does that require that they be vectors?

Thanks so much for your help!

-Vlad

jsocolar · April 11, 2022, 6:21pm

So there are two things where we might be talking past each other.

First, I tucked this into a comment in my Stan snippets, but maybe should have been more upfront about it. I am indexing psi and p by i because you said that you want to include covariates later on. I figured I’d go ahead and write the more general code that can deal with covariates.

Second, the original model that you posted doesn’t have n_surveys but does have k which I think is the same. Your model expects the data to be a single vector of length n_sampling_units giving the counts of detections per site. But based on the error you’re reporting, I wonder if you’ve switched to a model where you pass y as an array of detection/nondetection data where every element is a one or a zero corresponding to a detection or nondetection on a single visit. The genereated quantities code would need modification to play well with a model formulated in this way.

vladkovalenko · April 11, 2022, 6:57pm

Thanks , Jacob!

I was sensing the same thing so I went back to your (amazing) tutorial about marginalizing in Stan and am now using that model code to try and fit my data. In your model, it appears you have two site-level (psi) covariates, and one survey-level (detection) covariate.

One quick question about the code you provided there:

transformed parameters{
  real psi[n_unit];
  real theta[n_unit, n_rep];
  for(i in 1:n_unit){
    psi[i] = inv_logit(a0 + a1 * cov1[i] + a2 * cov2[i]);
    for(j in 1:n_rep){
        theta[i,j] = inv_logit(b0 + b1 * cov1[i] + b2 * cov3[i,j]);
    }

I understand what you are doing for psi, but for theta it appears you are also using your first occupancy covariate as one of the linear terms. Wondering why it is there instead of theta[i,j] = inv_logit(b0 + b1 * cov3[i,j]);?

Thank you!

jsocolar · April 11, 2022, 7:29pm

In this case I’m imagining that there’s a site-level covariate that influences both occupancy and detection :)

vladkovalenko · April 11, 2022, 9:34pm

Thank you, Jacob.

Now that we are more or less on the same page and referring to the Stan model in your tutorial (Marginalized occupancy models: a how-to for JAGS and Stan), I wonder if you would please help me adapt your code for generating y_rep and z_rep?

At the moment, I think I’m having disagreement with the indexing of det_data. I am unsure how to tell Stan to check the ith row and the jth column:

generated quantities{
      matrix[n_unit,n_rep] y_rep;
      vector[n_unit] z_rep;
      
     // Predict Z at sites
      for  (i in 1:n_unit) {
      if (det_data[i] == 1) {
       z_rep[i] = 1;
     } else {
       real ulo = psi[i] * (1 - p[i])^n_rep[i]; // unnormalized likelihood associated with occupancy
       real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
       z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
     }
}
      // generating posterior predictive distribution
        for(i in 1:n_unit){
        z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
        y_rep[i] = z_rep[i] * binomial_rng(n_rep[i], p[i]);
  }

}

Thanks and sorry if I’m boring you with my basic problems :)

vladkovalenko · April 11, 2022, 9:38pm

Before you reply, I realize that p should be theta. Should det_data be Q?

jsocolar · April 12, 2022, 2:44pm

Hi Vlad,
Just want to give you a heads up that I won’t likely have time to respond properly on this thread until sometime next week. But you are on the right track. To predict posterior Z you need to implement the branching logic based on whether or not there is at least one detection at a site, which in the tutorial I have called Q.

vladkovalenko · April 12, 2022, 3:27pm

Sounds good, Jacob. Thanks for all your help!

vladkovalenko · April 21, 2022, 2:30pm

Hi Jacob,

Just wanted to circle back. I have successfully created a z_rep and it seems to predict my actual site occupancy quite well. I have yet to figure out y_rep and am still struggling with indexing issues. Please let me know if you have any thoughts. Otherwise, I feel like z_rep might be sufficient?

Thanks,
Vlad

Topic		Replies	Views
Bayesian inference in Stan Modeling	40	5804	July 18, 2017
Problems with posteriors of parameters in Bayesian hierarchical model (cmdstanpy) Modeling cmdstanpy , hierarchical-model , python	1	594	August 22, 2023
Posterior Predictive Checks After Sampling Modeling	3	808	October 23, 2022
Posterior predictive check for binary outcome in Stan Modeling	2	814	November 10, 2020
Examples of simulating the expectation of the posterior predictive in Stan Modeling rstan , techniques , specification	0	414	October 16, 2022

How to generate posterior predictive distribution for hierarchical occupancy model in Stan?

Related topics