Which family distribution to use for my duration time data

Hi everyone,

I have data from great apes, where we measured how long they stayed in a specific zone. Each of the apes did 3 trials, corresponding to 3 conditions. Because it is time data, and because we stopped measuring after 6 minutes, the time parameter space is between 0 and 360 seconds.

I fitted a brms model to model this and assess the difference between conditions:

Specify the priors


# Specify the priors for the model
prior <- c(
  prior(normal(0, 10), class = "Intercept"),
  prior(normal(0, 10), class = "b"),
  prior(inv_gamma(0.1, 0.1), class = "sigma")
)

Specify the model formula for conditions


# Specify the model formula
formula <- approach_time_Final|trunc(lb=-0.01) ~ condition + (1 | subject)

Fit the model



# Fit a log link Gaussian model
model <- brm(formula, data = Apedata, family = ???,
             prior = prior, iter = 50000 , chains = 3, warmup = 20000, cores = 6,control = list(adapt_delta = 0.9) ,save_pars = save_pars(all = TRUE))


However, for some reason I can’t find a proper family distribution that fits the data. There is a relatively high number of low values in there (and even a fair amount of zeros).

I have tried a variety of distributions (including survival distributions in which I set the zeros to 0.001), but something always goes wrong in a different part of subsequent data analysis:

  1. sometimes the model itself does not converge, shows poor Gelman or Geweke statistics
  2. Some distributions just fall outside of the parameter space (e.g., contain negative values or values much higher than 360). And truncating them results in one of the other problems.
  3. Sometimes the posterior predictive check does not even compute (I get NA warnings)
  4. Sometimes it fails to then generate expected predictive values (which I need for calculating contrasts)
  5. And then, finally, I one time had a problem with a distribution that seemed fine in all of these ways, until I tried to compute Bayesfactors between this model and a model with additional covariate predictors, and then I got an error that some estimated likelihoods weer inf/inf or -inf/-inf, suggesting that this was also not an appropriate distribution.

So if anyone has any ideas on what distribution to use, it would be greatly appreciated!

Best,

Wouter

Howdy!
It might help if you provided a plot (for example a histogram) of your outcome variable.

If you have many values at 360, then you probably need to account for censoring.

So this is all integer data from 0 to 360? Or is it continuous? In other words, is the precision down to the second, or are there fractions of a second?

So you have tried distributions like lognormal, Gamma, exponential, etc?

If it is count data, integers from 0 to 360 in your case, perhaps you should try something like negative binomial.

Hi, thanks for your quick reply:

Here is a histogram:

The time is coded in (whole) seconds.

I have tried Gamma, lognormal exponential (with different link functions) as well as weibull.

A negative binomial gives me a posterior predictive check like this

Which also seems out of bounds. I suppose I can truncate it?

Then a model for count data like negative binomial or Poisson might be more appropriate.

I usually think about truncation as when the data generation process itself restricts the data to some bound. For example, some detector that can only detect a signal if it is large enough. Data where you simply stop the measurement is more a case of censoring, although in this particular case, it doesn’t seem that you have any data near the upper bound. The coefficients in your model when you use truncated distributions can be hard to interpret.
You might want to try a Poisson as well, just to compare.
Plotting PPC as histograms and also looking at the proportion of zeroes and max value could be helpful as well.

prop_zero <- function(outcome) mean(outcome == 0)
(prop_zero_test <- pp_check(model_fit, type = "stat", stat = "prop_zero"))

pp_max <- function(outcome) max(outcome)
(pp_max_test <- pp_check(model_fit, type = "stat", stat = "pp_max"))

I think the correct answer here probably depends on the question you want to ask. Are you interested in modelling the absolute amount of time each ape spent in each area, measured over the given time, or are you interested in the proportion of time each ape spent in each area within the experimental bounds?

If you’re interested in proportions, you might consider dividing the data by 360 and fitting a beta model, which models on the (0,1) scale. The only consideration would be that you can’t model zeroes exactly - so given the graph you’ve posted, which looks like it contains zeroes, you would need to consider a zero inflated model to account for this, and you might have to think closely about what the zero inflated part of the model means (probability that an ape enters an area at all?). Because you’re modelling a proportion out of 360 here, this model would never predict a number higher than this.

Alternatively, since you have count data, you might model using a Poisson or Negative binomial model, with an offset of log(total_time), the total time each ape was in each experiment (which should be 360 in all cases?). This will give you a rate (the proportion of time each ape spent in an area, per second), which should be a number between zero and one. This has the advantage that it can natively handle zeroes, but the disadvantage is that for high rates (near 1) you may find that there are possible times that are beyond the maximum allowable in the experiment. YMMV here though and this might be a good enough approximation if the rates are not particularly high, especially since you can avoid some of the additional complexity of the zero inflation component, as well modelling the counts in their natural scale.

Thanks everyone.

I have tried 5 models based on your suggestions:

Poisson:

Poisson with a log link

Hurdle Poisson to accommodate for the large amount of observation at and close to zero:

Negbinomial:

Zero inflated Negbinomial to account for some of the zeros in our data being apes that did not approach (N = 6):

It seems like the hurdle Poisson fits the data best?
Ultimately, I want to use expected predicted values to see if the time the apes stay in the zone differs across conditions (using emmeans contrasts).

For example like this:

Thanks for your help! I’m relatively new to choosing very specific distributions for my data, so I appreciate others sharing their experience!

Do all the 0s in your data mean that the ape didn’t approach to start with? It’s not clear whether they didn’t approach under all three conditions or whether they didn’t approach under one condition but perhaps approached under others. If all the 0s are because 6 apes didn’t approach under all three conditions, I would also consider the following analysis of “time spent if ape approached at all”: analyze the positive observations separately, apply a transformation (\log?) and then model f(time) as normal.

Note that you want to be careful about choosing a zero-inflated vs a hurdle model, because they make different assumptions. The hurdle model assumes that all of the zeroes arise from a separate process, while the zero-inflated model assumes that some zeroes come from the Poisson or negative binomial process and some come from another process. In brms, you can model the process that generates zeroes, so you might want a model for this process that includes condition and perhaps a varying intercept for subject.
Again, I would suggest other PPC than just the density plots, since this is count data. Checking the proportion of zeroes, maximum value, and using histograms would be a good idea.

Hi everyone,

thanks again for your feedback.
Just a quick clarification: the zeros are specific trials, not subjects. So all participants approached in at least 1 condition.

I looked at the proportion of zeroes and max values yields the following:

Poisson


Poisson Log Link:

Hurdle Poisson:


Negative Binomial:


Zero inflated Neg Binomial


So as far as I can tell, the normal Poisson models actually don’t work as well, because they underestimate both the proportion of zeros and the max values.

The hurdle Poisson model seems pretty on point with the proportion of zeroes, but still underestimates the max value

On the other hand, both negbinomial models seem to at least include the data’s proportion of zeros and max values, but the estimation of the max value seems to allow for much higher (i.e., impossible) max values over 1000, where the max of the actual data cannot be higher than 360.

As for the assumptions of where the zeros come from, it is a bit tricky. There are actually 8 apes that spend no time in the zone, however 2 of them did approach before the official time started, but followed the experimenter somewhere else when they did the final procedural step before the clock started, whereas 6 of them never approached at all.

Any ideas?

Thanks!

For what is worth I’m still not clear why you are modeling your data as counts. Time is continuous but the measurement process means the observations are rounded to the nearest second. (That is okay; it would be weird to demand sub-second precision.) When you think of it many variables that we treat as continuous are rounded to a certain precision: age, weight, grades (on 0 to 100 scale), survival time (in days). So I would probably be thinking whether I can model the data as a mixture of “not approached” (so Y = 0) and “approached & spent time so Y > 0”. That could open the possibility to model time spent as continuous.

1 Like

So I would probably be thinking whether I can model the data as a mixture of “not approached” (so Y = 0) and “approached & spent time so Y > 0”. That could open the possibility to model time spent as continuous.

So would that be in a single model? Because we are also analyzing the go vs no go in a seperate bernoulli model.

You can do it either way - as a mixture model or modeling them separately. When you use a continuous distribution, you now assume two separate processes, a bernoulli for the zeroes and some model for the non-zero observations. As Michael Betancourt writes in his Mixture Modeling Basics, "In other words because the the inflated and non-inflated points are essentially modeled by separate data generating processes the entire model decomposes into a binomial model for the total inflated observations and a baseline model for the non-zero observations.

Because these two models are independent we can fit them independently or jointly, whichever is more convenient. In particular if the inflated counts are just a nuisance then we can ignore them entirely and just fit the non-inflated observations directly without any consideration of the inflation!"

For your data, it doesn’t sound like you would want to treat the zero inflation as a nuisance. It seems like you would want to model it.
Other than the high max counts, the zero-inflated negative binomial doesn’t seem so bad. You can model the zero-inflation part, for example with a varying intercept for ape. If you know that it is not possible to have such high counts, then you can put a better prior on the shape parameter.

Ok, then I will continue with the negbinomial model.

I have tried different shape parameters, but I keep getting max values that are between 1000 and 2000 (whereas the sample limit is 360). Also truncating does not work. Any suggestions how to deal with this?