First time modeling censored count data—looking for peer review

  • Operating System: Mac
  • brms Version: 2.10.3

I’ve inherited some count data of self-reported behaviors. The response format was limited in that the lowest count one could endorse was 1 and the highest count one could endorse was 6. The way I see it, this implies each 1 is a mixture of 1s and 0s and each 6 is a mixture of 6s and above. I believe these could be considered censored count data and I’d like to model them with the Poisson likelihood.

Here’s a simplified version of the data. The y values are the true counts. The y_cens values are the counts after censoring as described, above.

# load packages
library(tidyverse)
library(brms)

# simulate data
set.seed(1)
s <-
  tibble(y      = rpois(n = 1e3, lambda = 3)) %>% 
  mutate(y_cens = ifelse(y < 1, 1,
                         ifelse(y > 6, 6, y)))

The cens() argument in brms::brm() requires an additional column in the data indicating the censoring status of each observation. Here’s my attempt.

s <-
  mutate(censored = ifelse(y == 1, "left",
                           ifelse(y > 6, "right", "none")))

Here’s my model setup.

fit <-
  brm(data = s,
      family = poisson,
      y_cens | cens(censored) ~ 1)

The model output seems to looks good.

print(fit)
 Family: poisson 
  Links: mu = log 
Formula: y_cens | cens(censored) ~ 1 
   Data: s (Number of observations: 1000) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.11      0.02     1.07     1.14 1.00     1649     2219

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

Does it appear I’ve correctly interpreted how to work with censored count data?

1 Like

This looks a bit weird I would expect it to be ifelse(y <= 1, "left", ifelse(y >= 6, "right", "none")))
since you would not know which ones and sixes are “real” and which are censored and since values y == 0 were also censored. This might actually be the reason why your estimate is slightly biased upwards from log(3) = 1.098... despite the large data size.

Other than that it looks correct to me. Also testing your intuitions on a small model with simulated data is a good practice!

Hope that helps!

Thanks for the suggestion, @martinmodrak. I tried your coding method out. Here’s the code and subsequent model.

# add alternative censoring variable
s <-
  s %>% 
  mutate(c2 = ifelse(y <= 1, "left", ifelse(y >= 6, "right", "none")))

# fit the model
fit2 <-
  brm(data = s,
      family = poisson,
      y_cens | cens(c2) ~ 1)

# check the summary
print(fit2)
 Family: poisson 
  Links: mu = log 
Formula: y_cens | cens(c2) ~ 1 
   Data: s (Number of observations: 1000) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.13      0.02     1.10     1.17 1.00     1634     2132

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

The updated intercept appears more biased relative to the original y data and the idealized lambda = 3.

log(mean(s$y)) %>% round(digits = 2)
log(3) %>% round(digits = 2)
[1] 1.11
[1] 1.1

Here’s the LOO comparison.

fit1 <- add_criterion(fit1, "loo")
fit2 <- add_criterion(fit2, "loo")

loo_compare(fit1, fit2) %>%  
  print(simplify = F)
     elpd_diff se_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic   se_looic
fit1     0.0       0.0 -1810.2     15.3         1.0     0.0   3620.4    30.6 
fit2   -11.2       2.9 -1821.4     16.0         1.2     0.1   3642.8    31.9 

Sorry for taking long to get back. I was mistaken in my initial evaluation, because brms treats lower bounds as exclusive and upper bounds as inclusive (when censoring), sorry for the confusion. So what I see as correct would be:

mutate(censored = ifelse(y <= 1, "left",
                           ifelse(y > 6, "right", "none")))

(in your original version values with y==0 would appear as uncensored y = 1)

Hope that makes sense.