Accounting for unobserved successes in binary repreated trial model


Dear group,

I am fairly new to Stan/Bayesian statistics and maybe my question is stupid. Aynways, here is my problem:

I use a partially-pooled model for binary repeated trials to estimate annual forest disturbance rates (i.e., proportion of forests that were disturbed by insects, wind, etc.). My data set has 32 rows (=32 years), one column indicating the ‘successes’ (=number of forested plots disturbed), and one column indicating the number of ‘trials’ (=total number of forested plots). The data is attached. For building the model I follow the suggestions by Bob Carpenter: The model ist also attached (see model_base.stan).

Up to here, everything is working nicely! However, I know that while interpreting the plots we sometimes omit a disturbances (because it was to ephemeral) and we sometimes comit a disturbances (labeling a plot as disturbed even though it is not). Hence, the number of ‘successes’ (=number of forested plots disturbed) is just an estimation of the unknown ‘real’ number. I would like to integrate this into my model to get a better representation of uncertainties.

My first try was to model the response (i.e., number of ‘successes’) itself as binomial distributed. Hence, instead of:

model {
mu ~ normal(-1, 1); // hyperprior
sigma ~ normal(0, 1); // hyperprior
alpha_std ~ normal(0, 1); // prior

y ~ binomial_logit(K, mu + sigma * alpha_std); // likelihood

I tried something like this:

model {
int y_obs[N];

mu ~ normal(-1, 1); // hyperprior
sigma ~ normal(0, 1); // hyperprior
alpha_std ~ normal(0, 1); // prior

for (i in 1:N)
y_obs[i] ~ binomial(K[i], y[i] / K[i]); // With y / K being the estimated success-rate

y_obs ~ binomial_logit(K, mu + sigma * alpha_std); // likelihood

This model, however, does not sample. I am stuck here as I a) do not understand why it does not work and b) how I should proceed. Any suggestions?

Here the code and data:

model_base.stan (885 Bytes)
forestplots.csv (441 Bytes)
disturbance_rate_model.R (800 Bytes)


It looks like you’re trying to use y_obs as a discrete parameter, which won’t work since the tilde is not really a sampling statement ( and Stan does not estimate discrete parameters directly). To fit this model in Stan you would need to sum over possible y_obs values (and there’s a section of the manual about it, … it’s the one on fitting mixture distriubtions).

I’m not sure this is really the model you want to fit. To estimate something about the effect of omissions/comissions you would need some auxilliary data source about them (or at least some signal about them). In this particular model I don’t see that (at a glance).

Really great to see so much interest in Stan from Ecoogists since ecology tends to have really wacky observation models.


Okay, think I see the problem. The idea behind was to derive the number of successes from a binomial distribution, where the success rate is estimated from the raw data. This would introduce some additional stochasticity, as for each sample the number of successes would be slightly different.

I could make some rough estimates about the overall detection error (number of plots classified (in-)correctly), if this what you mean by auxiliary data? This would actually be nice to integrate, as we could do some kind of sensitivity analysis showing the effect of ommission/comission by varying the detection error.



As @Krzysztof_Sakrejda says, what you’re trying to create is what’s known as a measurement error model. There’s a chapter in the manual on how to think about them and code them. Typically what you’d do here is something like:

y[i] ~ foo(y_true[i], K[i], ...);
y_true ~ binomial_logit(...);

Your model has y_true (which you call y_obs) defined the other way around. Instead, you want the observed value to be generated based on the true value. Then you plug the true value into the regression. So what you’d want to do is this if you think there’s pure binomial measurement error:

y[i] ~ binomial(K[i], y_true[i] / K[i]);

The problem you’ll run into is that y is a discrete parameter. If the counts K[i] are smallish (hundreds, not thousands) and bounded (or approximately so), you can marginalize out. The chapter on latent discrete parameters in the manual shows how to do this.

Are the K measured without error?

If the count’s not bounded or large, continuous approximations can work, but you coldn’t use it with a binomial logit.

The Stan bug in the way you formulated your model is that you’re using y_obs before it’s defined. The ~ doesn’t sample and do assignment, it just evaluates a log density (or mass) and adds it to the target density. So you’d have to make y_obs a parameter, but Stan doesn’t support discrete parameters. This is the one case where we’d like to be able to support discrete params. So you have to do several difficult things to get Stan to fit this kind of model.