Hi,
I am evaulating a screening programme, where there is a two-stage process:
-
test: gives a continous score from 0 to 1. Iftestscore is >=0.88, then the participant progresses to a definitive test to ascertain disease status -
disease: based on another test, the person is classifed as having the disease or not.
We also have covariates and tests that are associated with disease status (e.g. test2).
We are interested in the overall accuracy of test for predicting Disease, but our sample is biased because we only know disease status for those with a “positive” test.
Here are some simulated data to demonstrate:
library(copula)
library(tidyverse)
#Simulate some correlated data
set.seed(123)
myCop <- normalCopula(param = c(0.46, 0.27, 0.18),
dim = 3, dispstr = "un")
out <- rCopula(1e5, myCop)
out[, 1] <- qbeta(out[, 1], 8,2)
out[, 2] <- qbinom(out[, 2], size = 1, prob = 0.05)
out[, 3] <- qbinom(out[, 3], size = 1, prob = 0.226)
#prepare the data for modelling
d <- as.data.frame(out) %>%
rename(test=V1, disease=V2, test2=V3) %>%
mutate_at(.vars = vars(disease, test2),
.funs = funs(factor))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(d)
#> test disease test2
#> 1 0.7686354 0 1
#> 2 0.8574327 0 1
#> 3 0.8278272 0 0
#> 4 0.8080611 0 0
#> 5 0.8596687 0 0
#> 6 0.9560706 0 0
#distributions of the data
d %>%
ggplot() +
geom_density(aes(test, fill=disease), alpha=0.3) +
facet_grid(test2~.)

#Now change disease status to missing for all with test score >=0.88
#i.e. disease status only ascertained for those with a high test1 score
d2 <- d %>%
mutate(disease = case_when(test>=0.88 ~ disease,
TRUE ~ NA_integer_))
head(d2)
#> test disease test2
#> 1 0.7686354 <NA> 1
#> 2 0.8574327 <NA> 1
#> 3 0.8278272 <NA> 0
#> 4 0.8080611 <NA> 0
#> 5 0.8596687 <NA> 0
#> 6 0.9560706 0 0
Created on 2021-03-31 by the reprex package (v0.3.0)
Now we want to estimate the overall acuracy of test, but recognising that disease status is missing in a biased manner, and that we potentially have additional information from test2. This where I am struggling. It seems like this could be either be a “missing data” problem, or an problem that could be addressed through latent class analysis.
I understand Stan does not support discrete parameters for estimation of count missing values (see m2 example below). But I am hopeful that someone will have a suggestion for an alternative approach!
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.14.4). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#>
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#>
#> ar
#Fit a model to the data
#this ignores the missing disease status
m1 <- brm(disease ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Warning: Rows containing NAs were excluded from the model.
#> Compiling Stan program...
#> Trying to compile a simple C file
#> Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
#> clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
#> In file included from <built-in>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
#> namespace Eigen {
#> ^
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
#> namespace Eigen {
#> ^
#> ;
#> In file included from <built-in>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
#> #include <complex>
#> ^~~~~~~~~
#> 3 errors generated.
#> make: *** [foo.o] Error 1
#> Start sampling
#>
#> SAMPLING FOR MODEL 'fc5c5b8a5161ca9226ca5d50c5c16cee' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 0.000717 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 7.17 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 1:
#> Chain 1: Elapsed Time: 5.08442 seconds (Warm-up)
#> Chain 1: 3.76533 seconds (Sampling)
#> Chain 1: 8.84975 seconds (Total)
#> Chain 1:
conditional_effects(m1)


#Now try to fit a model imputing missing disease status
m2 <- brm(disease | mi() ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Error: Argument 'mi' is not supported for family 'bernoulli(logit)'.
Created on 2021-03-31 by the reprex package (v0.3.0)