Custom likelihood for multinomial model on pooled data

Hi Folks,

For some context, I’m working with disease prevalence data in ticks. To reduce costs, ticks are sometimes ‘pooled’ and screened for a particular pathogen together. Here, the binomial likelihood function doesn’t work as it’s impossible to know if tick A, B or both are infected. I overcome this using a ‘misclassification model’ as outlined by Richard McElreath’s in his statistical rethinking lecture 17 (2023). https://www.youtube.com/watch?v=mt9WKbQJrI4&t=3224s.

Part 1 of the code bellow shows this implementation in the rethinking package. It seems to work. However, ticks carry multiple diseases. I wondered if anyone could help me to re-write a multinomial ‘equivalent’ custom likelihood function in brms for the simulated data in part 2.

Any help and advice is greatly appreciated,
Thankyou

part 1: data simulation and working ‘rethinking’ model

library(tidyverse)

simple_dat <- tibble(
  pool_id = c(rep(1:20, time = 2), 21:30)) %>% # 20 pools of two samples, 10 single samples
  mutate(intercept = 0.2) %>% # 20% positive rate
  expand_grid(site = 1:5) %>% # increase sample size by x5
  mutate(positive = rbinom(250, 1, intercept))) # generate outcomes (positive, negative)

(simple_dat_pld <- simple_dat %>% 
  group_by(pool_id, site) %>% 
  summarise(num_positive = sum(positive), 
            pool_positive = ifelse(num_positive >= 1, 1, 0)) %>% # reduce the level of information, to positive/negative for each pool
  mutate(n = ifelse(pool_id %in% 1:20, 2, 1),
         # label the different outcome:pool size combinations
         c = 1, # 1 labels samples which weren't pooled and were negative 
         c = replace(c, n == 1 & pool_positive == 1, 2), # 2 samples which weren't pooled + positive 
         c = replace(c, n == 2 & pool_positive == 0, 3), # 3 samples pooled, negatives 
         c = replace(c, n == 2 & pool_positive == 1, 4)) %>% 
  uncount(n))

# put the outcome variable, and information about pooling into a list 
simple_dat_l_pld <- list(bp = as.numeric(simple_dat_pld$pool_positive),
                         c = as.numeric(simple_dat_pld$c))

simple_pool_mod <- ulam(
  alist(
    
    # missclassification
    bp|c == 1 ~ custom(log(1 - p)), 
    bp|c == 2 ~ custom(log(p)), # this is the standard binomial likelihood, for non-pooled samples
    bp|c == 3 ~ custom(log((1-p)^2)), # the lower too are likelihoods for pooled samples 
    bp|c == 4 ~ custom(log(1 - ((1-p)^2))),
    
    logit(p) <- a,
    
    # intercept 
    a ~ dnorm(-1, 1)
    
  ), data = simple_dat_l_pld, chains = 4, cores = 4
)

post <- as.data.frame(extract.samples(simple_pool_mod))

# how well does the model predict prevalence ? 
ggplot(post) +
  geom_density(aes(x = inv_logit(a))) +
  geom_vline(xintercept = sum(simple_dat$positive)/length(simple_dat$pool_id))

Part 2: multinomial data simulation

(simple_dat <- tibble(
  pool_id = c(rep(1:20, time = 2), 21:30)) %>% 
  mutate(intercept_a = 0.1,
         intercept_b = 0.2) %>% 
  expand_grid(site = 1:5) %>% 
  mutate(positive_a = rbinom(250, 1, intercept_a),
         positive_b = rbinom(250, 1, intercept_b))) # generate outcomes (positive, negative)

(simple_dat_pld <- simple_dat %>% 
  group_by(pool_id, site) %>% 
  summarise(num_positive_a = sum(positive_a), 
            pool_positive_a = ifelse(num_positive_a >= 1, 1, 0),
            num_positive_b = sum(positive_b), 
            pool_positive_b = ifelse(num_positive_b >= 1, 1, 0)) %>% 
  mutate(n = ifelse(pool_id %in% 1:20, 2, 1),
         ca = 1,  
         ca = replace(ca, n == 1 & pool_positive_a == 1, 2), 
         ca = replace(ca, n == 2 & pool_positive_a == 0, 3), 
         ca = replace(ca, n == 2 & pool_positive_a == 1, 4),
         cb = 1,  
         cb = replace(cb, n == 1 & pool_positive_b == 1, 2), 
         cb = replace(cb, n == 2 & pool_positive_b == 0, 3), 
         cb = replace(cb, n == 2 & pool_positive_b == 1, 4)) %>% 
  uncount(n))