Do you need to specify offsets (for sampling effort) twice in hurdle models?

I was reading this thread on modelling with a hurdle_gamma versus a hurdle_lognormal distribution. Someone mentioned that adding an offset to the mentioned model might be useful. This made me want to follow-up to ask whether when adding an offset, must it be included for both the continuous response part and the hurdle part?

In the example below, would brms use the offset information for the continuous portion (gamma) to inform the hurdle portion? I ask because I am used to having to specify trial numbers when running a binomial model: count | trials(n)~ …

gamma_hu_model <- brm(
bf(
# gamma model on the non-zero
Y ~ x1 + x2 + offset(log(n)) + (1 | RE) ,
# binomial model for zeros
hu ~ x1 + x2 + (1 | RE)
),
family = hurdle_gamma(link = "log"),
data = df
)

When modelling counts the offset allows us to make estimates per unit of sampling effort while keeping the observations as integers. The upshot is we can still use a Poisson or negative binomial distribution.

I do not see the benefit of using an offset with a continuous response. Why not just divide Y by n?

1 Like

@amynang’s comment about whether to offset versus control for effort via a covariate or via division by n is an astute one.

I think it will rarely be sensible to use an offset on the hurdle part of a model, because this part of the model works on the logit scale. It’s hard for me to envision how you would know a priori how much the log odds of an event shifts. Offsets work well on the log scale when we know a priori that, for example, doubling the sampling area will tend to double the count. We achieve this behavior via offsets on the log scale.

It’s also worth taking some time to notice that the question of whether to control for effort in the hurdle part of a model depends entirely on the system being modeled. One could imagine a system where the hurdle zeros represent structural zeros that are never going to change no matter how much effort one invests, or scenarios where your chance of observing the non-zero part of the response go up with increasing effort. For an example of the former, think of making counts of Capuchin Monkeys in various sampling plots, of different sizes, but some plots are in unsuitable habitat and cannot possibly yield a count different from zero no matter how big you make the plot. For an example of the latter, think of making counts of Capuchin monkeys along transects where zeros arise when you don’t happen to encounter a troop during a transect walk, but longer transects or more time spent on the transect increases the probability of encountering a troop. One could even imagine a sampling protocol that calls for walking a transect until you spot the first troop, then counting that troop and stopping. In this case, the total length of the transect you are prepared to walk would need to be controlled for in the hurdle part, and the count conditional on finding a troop doesn’t need to control for effort at all–a case where controlling for effort is appropriately applied only in the hurdle part and not in the non-hurdle part.

Sorry if this example is esoteric or off the wall; I’m making an assumption based on your user name.

3 Likes

@amynang and @jsocolar thanks very much for your replies. I appreciate the time taken to help me on this topic. The example about transects and structural zeros is an interesting one that I will keep in mind for the future.

I will give an example from observational data. If you think of n in the example as number of hours an individual (imagine a capuchin) is observed, and Y as the amount of time the individual was seen doing extractive foraging of a particular fruit. With limited sampling, there will be individuals that were never seen doing this behavior, but if we had observed them long enough many more individuals would have been documented performing the behavior. In such a case it’s more likely that the false zeros are present for individuals with less observation time. Individuals also vary in their likelihood of taking part in extractive foraging, making individuals that engage in it less also less likely to be documented performing the behavior over a period of time. Let’s assume though that a hurdle model is appropriate because some individuals just never become extractive foragers, and different demographic variables drive the likelihood of becoming an extractive forager.

As for the continuous part, I could divide the Y by n, but then I would generate a rate while throwing away information about certainty in the estimate due to sampling effort (1 second observed over 30 minutes, versus 200 seconds observed over 20 hours. In this case, assume individuals also vary in the amount of time they dedicate to extractive foraging, conditional on doing the extractive foraging.

So, overall there is variation in the population in both the likelihood of an event (extractive foraging) and in the time spent performing the behavior (amount of time spent extracting, given that you are extractive foraging).

To understand variation in likelihood of doing extractive foraging, if I were to run the model as a binomial model (where count is the sum of events were Y > 0), I would set it up as :

mod_bin ← brm(count | trials(n) ~ x1 + x2 + (1 | Subject), family=binomial(), data=df).

but an equivalent syntax does not work in the hurdle portion of a hurdle model as far as I am aware.

If I were interested in variation in extractive foraging time (conditional on the event taking place), I could run a model as such:

mod_gamma ← brm(Y ~ x1 + x3 + offset(log(n)) + (1 | Subject), family=gamma, data=df).

but then if trying to use a hurdle_gamma model, how does one properly account for sampling effort in both the hurdle and the continuous portion (let’s assume the case is that its reasonable to do so)? For the gamma portion, its seems offset(log(n)) would work. But it is not clear to me what to do with the hurdle portion of the model.

gamma_hu_model <- brm(
bf(
# gamma, non-zero (how much do you extractively forage)
Y ~ x1 + x2 + offset(log(n)) + (1 | RE) ,
# binomial, zeros(do you become an extractive forager)
hu ~ x1 + x3 + (1 | RE)
),
family = hurdle_gamma(link = "log"),
data = df
)

In reading my example, it sounds more like a zero-inflated gamma (structural zeros and sampling zeros), but I’ve read that there is no such distinction in brms.

Leaving aside the inflation/hurdle question, the offset does not have the effect you desire. It is strictly equivalent to dividing with observation time, in the sense that 5/10 is not weighted any differently than 500/1000.

I am not sure what would be a sensible way to incorporate observation time in the model if what you want is a continuous equivalent to what trials do in binomial regression. Perhaps using n as weights?

In the case of hurdle gamma you have a mixture of a Bernoulli model (0/1, observed/not observed) with a gamma model (duration, if obesrved). I think this is helpful in understanding why offsets, weights etc would not make sense in the Bernoulli component. Your intuition is that 0/10 should be different from 0/1000, but the hurdle part does not work like that. It is either zero or not zero.

Thank you @amynang It does clarify to me why you would not have something like offset() or trials() for the hurdle part.

I will need to do a bit more homework to understand what offset(log(n)) would do for the gamma portion. I understand it can get the response on a per-one-unit-of-time scale (whatever the mean sampling effort is in the dataset), but I also thought it was always the better option than transforming the response variable (Y/n). Dividing the response by exposure (sampling effort) in the formula could lead to a misspecified model, as it changes the distribution of the response variable. In the post I read, it was mostly about Poisson, but later mentions using offset(log(exposure)) for Gamma and Tweedie as well.

But I understand that’s not the same reason for wanting to use offset() as my original intention (trying to have more certain estimates for individuals with higher sampling effort).

Thanks for all the help on this topic so far!! What a great community ;)

Suppose you have a very skewed distribution of observations because whatever you are measuring is predominantly measured at small (in space/time) observational units. Once you divide by unit size the distribution may no longer be skewed, allowing you to use Gaussian instead of lognormal or Gamma. The distribution has changed but for a perfectly legitimate reason.

So perhaps offsets may give us some more flexibility in how we handle continuous responses that are meant to be per unit of space/time/effort. I should not have been so dismissive of the idea :)

I may be way off the mark here (perhaps @jsocolar can put me right if that is so) but I think that using observation time as a weight would result in something closer to what you have in mind (and to how 1/5 differs from 100/500 in a binomial regression).

behaviour_time/observation_time | weights(observation_time) ~ ...

Potential caveat: trials are independent events. If what I say makes sense, it should be in the case when the total time of extractive behaviour is accumulated over several independent extractive events which are more likely to be observed over more extended observation time.

1 Like

I haven’t read the entire thread, but if you model the hurdle part with a different link function, e.g. hu = 1 - exp(-h), where h is the rate of occurrence, then you model log(h) = alpha + offset. I think this is called the cloglog link function. So if you consider the hurdle process to be driven by an underlying rate, then you can just add the offset there.

3 Likes