Hi Folks,
For some context, I’m working with disease prevalence data in ticks. To reduce costs, ticks are sometimes ‘pooled’ and screened for a particular pathogen together. Here, the binomial likelihood function doesn’t work as it’s impossible to know if tick A, B or both are infected. I overcome this using a ‘misclassification model’ as outlined by Richard McElreath’s in his statistical rethinking lecture 17 (2023). https://www.youtube.com/watch?v=mt9WKbQJrI4&t=3224s.
Part 1 of the code bellow shows this implementation in the rethinking package. It seems to work. However, ticks carry multiple diseases. I wondered if anyone could help me to re-write a multinomial ‘equivalent’ custom likelihood function in brms for the simulated data in part 2.
Any help and advice is greatly appreciated,
Thankyou
part 1: data simulation and working ‘rethinking’ model
library(tidyverse)
simple_dat <- tibble(
pool_id = c(rep(1:20, time = 2), 21:30)) %>% # 20 pools of two samples, 10 single samples
mutate(intercept = 0.2) %>% # 20% positive rate
expand_grid(site = 1:5) %>% # increase sample size by x5
mutate(positive = rbinom(250, 1, intercept))) # generate outcomes (positive, negative)
(simple_dat_pld <- simple_dat %>%
group_by(pool_id, site) %>%
summarise(num_positive = sum(positive),
pool_positive = ifelse(num_positive >= 1, 1, 0)) %>% # reduce the level of information, to positive/negative for each pool
mutate(n = ifelse(pool_id %in% 1:20, 2, 1),
# label the different outcome:pool size combinations
c = 1, # 1 labels samples which weren't pooled and were negative
c = replace(c, n == 1 & pool_positive == 1, 2), # 2 samples which weren't pooled + positive
c = replace(c, n == 2 & pool_positive == 0, 3), # 3 samples pooled, negatives
c = replace(c, n == 2 & pool_positive == 1, 4)) %>%
uncount(n))
# put the outcome variable, and information about pooling into a list
simple_dat_l_pld <- list(bp = as.numeric(simple_dat_pld$pool_positive),
c = as.numeric(simple_dat_pld$c))
simple_pool_mod <- ulam(
alist(
# missclassification
bp|c == 1 ~ custom(log(1 - p)),
bp|c == 2 ~ custom(log(p)), # this is the standard binomial likelihood, for non-pooled samples
bp|c == 3 ~ custom(log((1-p)^2)), # the lower too are likelihoods for pooled samples
bp|c == 4 ~ custom(log(1 - ((1-p)^2))),
logit(p) <- a,
# intercept
a ~ dnorm(-1, 1)
), data = simple_dat_l_pld, chains = 4, cores = 4
)
post <- as.data.frame(extract.samples(simple_pool_mod))
# how well does the model predict prevalence ?
ggplot(post) +
geom_density(aes(x = inv_logit(a))) +
geom_vline(xintercept = sum(simple_dat$positive)/length(simple_dat$pool_id))
Part 2: multinomial data simulation
(simple_dat <- tibble(
pool_id = c(rep(1:20, time = 2), 21:30)) %>%
mutate(intercept_a = 0.1,
intercept_b = 0.2) %>%
expand_grid(site = 1:5) %>%
mutate(positive_a = rbinom(250, 1, intercept_a),
positive_b = rbinom(250, 1, intercept_b))) # generate outcomes (positive, negative)
(simple_dat_pld <- simple_dat %>%
group_by(pool_id, site) %>%
summarise(num_positive_a = sum(positive_a),
pool_positive_a = ifelse(num_positive_a >= 1, 1, 0),
num_positive_b = sum(positive_b),
pool_positive_b = ifelse(num_positive_b >= 1, 1, 0)) %>%
mutate(n = ifelse(pool_id %in% 1:20, 2, 1),
ca = 1,
ca = replace(ca, n == 1 & pool_positive_a == 1, 2),
ca = replace(ca, n == 2 & pool_positive_a == 0, 3),
ca = replace(ca, n == 2 & pool_positive_a == 1, 4),
cb = 1,
cb = replace(cb, n == 1 & pool_positive_b == 1, 2),
cb = replace(cb, n == 2 & pool_positive_b == 0, 3),
cb = replace(cb, n == 2 & pool_positive_b == 1, 4)) %>%
uncount(n))