Zero-truncated count data

Hi everyone,

I am trying to model count data, but based on the posterior distribution (pp_checks) I cannot parametrize the model correctly I think.

About the data:

  • Participants perform a task (twice) that requires the to change between locations, and we want to measure and model the number of changes as a function of an earlier priming factor with 2 levels.

  • Note that the minimum possible count is 4 which suggests a truncated model. Or I can subtract 4 from the count, but I am first trying to model the distribution as is.

I have fitted the following models

# the data look like this (40% excerpt):
counts <- c(5, 4, 4, 5, 6, 6, 5, 5, 5, 6, 6, 4, 4, 4, 4, 6, 7, 5, 4, 5, 5, 5, 4, 6, 10, 5, 5, 4, 5, 4, 6, 5) 
prime <- c(rep('A', 16), rep('B', 16))
task <- c(rep('1', 8), rep('2', 8), rep('1', 8), rep('2', 8))
subject <- paste0("S",  1:32)

data <- tibble(counts, prime, task, subject)

# models
## standard 
m1 <- brm(counts ~ 1 + prime + task + (1|subject) , 
                                 family= negbinomial(),
                                 prior = set_prior("normal(0, 5", class = "b"), 
                                 data,
                                 save_all_pars = T) 
## zero truncated
m2 <- brm(counts | trunc(lb = 4) ~ 1 + prime + task + (1|subject) , 
                                 family= negbinomial(),
                                 prior = set_prior("normal(0, 5", class = "b"), 
                                 data,
                                 save_all_pars = T) 

I have tried this with poisson, negative binomial distributions (as well as other options, like gamma etc). I have also tried adding a couple other parameters (e.g. gender, but they dont’ make a difference).

From the posterior checks, lognormal provides the best fit but is inappropriate (assumes continuous outcome), and then poisson / negbinomial are close.

non-truncated neg binomial

truncated neg binomial (results similar with poisson)

However, as you can see in the figure, in the pp_checks, I notice that the models do no capture well the 4s (overs predict) and 5s (underpredict).

I am not sure how to improve the fit before deciding on a model with loocv.

Any advice most welcome!

  • brms beginner ;-)
  • Operating System: Mac OSX
  • brms Version: brms_2.14.4
2 Likes

Hey there!

IMO it would make sense to substract 4 from the dependent variable since values 0-3 are impossible to reach under any circumstance. I bet you’ll also get much better model fits this way.

If this seems too ad-hoc, you can motivate this with a change of variables where the transformation is just substracting 4 – so the log absolute Jacobian is just 0 (so you don’t have to do anything).

Maybe others have other opinions/suggestions on this, but I didn’t want your question stay unanswered any longer :)

Cheers,
Max

2 Likes

Alright Max was brave enough to make a guess so I’ll try one too :D.

The two plots seem really surprising to me too. Can you generate data from a truncated negative binomial in R and then fit it with brms to make sure the truncated negative binomial is working like you expect? Something like:

brm(counts | trunc(lb = 4) ~ 1, ...)

Maybe the tail behavior is making it hard for the negative binomial to fit what happens at y = 4. If that’s the case, then I guess you’ll need a more complicated distribution.

1 Like

To relevant question here is why is the minimum count 4?

If the hypothesized phenomena appears only after 4 counts then it’s natural to start modeling \text{counts} - 4, as it’s only the excess counts that manifest to that phenomena.

One the other hand if the hypothesized phenomena manifests in all counts but you require 4 for the experiment then it’s much more natural to model the truncation explicitly as it’s part of your data generating process.

Critically the negative binomial mass functions are only so flexible, and the behavior near the threshold will in general be very different between these two approaches. Based on limited details about the experiment I would speculate that the truncation is more natural, and that’s supported by the deficiencies of the excess fit.

1 Like

Hi everyone,
Thanks @bbales2 and @betanalpha for your replies! I had to take a break from this project for personal reasons and going back to it now.

The minimum count is due to the experimental design. To solve the task, people have to do a minimum of four moves, or more, so you are correct we are interested in the excess moves.

I tried with the truncation and it did not improve sufficiently. Eventually I used the acat family, as recommended elsewhere (trying to find and link to the post), which produced very good results.

Just wanted to say thanks.
Cheers

1 Like

OK to follow up on this, in case this helps anyone in the future.

After trying the acat() family, I also tried and settled with hurdle_poisson(). I think it makes sense from a conceptual point of view (model the excess zeros as a separate process) and the pp_checks show that it captures the distribution well.

1 Like

Hi @pmavros,

I am fitting a similar model but with hospital encounters. Number of encounters at hospital is recorded as a minimum of at least one encounter.

Did you finally subtract 4 to your outcome?

Best,
Esteban

Hi @maurosc3ner

TL;DR yes i subtracted 4.

Long story:
I subtracted ‘4’ which was the behaviourally relevant minimum number (you just can’t do the task without 4 counts). In my case, because most people just did the minimum (minimised effort) this produced a lot zeros, so it made sense to use a hurdle poisson distribution, which is a two-part model. Writing it up was straightforward, although i had discuss the hurdle part and the poisson parts separately. The adjacent category (acat) model gives very good results too, at least in terms of pp_checks.

However, in the end these counts did not tell me enough about what i was observing so i reconceptualised the problem – I suppose that’s what people mean by saying to think about the generative process. There were two types of behaviours people could do, and eventually we fitted 2 separate bayesian logistic regression models (i.e. family binomial). Because the overall response was both lower and upper bounded, the formula looked like this:

brm(counts|trials(upper_bound) ~ independent variables ..., family = binomial())

This helped unpack the behaviour better. Hope this helps, let me know if this answers your question.