Hi,
I am evaulating a screening programme, where there is a twostage process:

test
: gives a continous score from 0 to 1. Iftest
score is >=0.88, then the participant progresses to a definitive test to ascertain disease status 
disease
: based on another test, the person is classifed as having the disease or not.
We also have covariates and tests that are associated with disease status (e.g. test2
).
We are interested in the overall accuracy of test
for predicting Disease
, but our sample is biased because we only know disease
status for those with a â€śpositiveâ€ť test
.
Here are some simulated data to demonstrate:
library(copula)
library(tidyverse)
#Simulate some correlated data
set.seed(123)
myCop < normalCopula(param = c(0.46, 0.27, 0.18),
dim = 3, dispstr = "un")
out < rCopula(1e5, myCop)
out[, 1] < qbeta(out[, 1], 8,2)
out[, 2] < qbinom(out[, 2], size = 1, prob = 0.05)
out[, 3] < qbinom(out[, 3], size = 1, prob = 0.226)
#prepare the data for modelling
d < as.data.frame(out) %>%
rename(test=V1, disease=V2, test2=V3) %>%
mutate_at(.vars = vars(disease, test2),
.funs = funs(factor))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(d)
#> test disease test2
#> 1 0.7686354 0 1
#> 2 0.8574327 0 1
#> 3 0.8278272 0 0
#> 4 0.8080611 0 0
#> 5 0.8596687 0 0
#> 6 0.9560706 0 0
#distributions of the data
d %>%
ggplot() +
geom_density(aes(test, fill=disease), alpha=0.3) +
facet_grid(test2~.)
#Now change disease status to missing for all with test score >=0.88
#i.e. disease status only ascertained for those with a high test1 score
d2 < d %>%
mutate(disease = case_when(test>=0.88 ~ disease,
TRUE ~ NA_integer_))
head(d2)
#> test disease test2
#> 1 0.7686354 <NA> 1
#> 2 0.8574327 <NA> 1
#> 3 0.8278272 <NA> 0
#> 4 0.8080611 <NA> 0
#> 5 0.8596687 <NA> 0
#> 6 0.9560706 0 0
^{Created on 20210331 by the reprex package (v0.3.0)}
Now we want to estimate the overall acuracy of test
, but recognising that disease status is missing in a biased manner, and that we potentially have additional information from test2
. This where I am struggling. It seems like this could be either be a â€śmissing dataâ€ť problem, or an problem that could be addressed through latent class analysis.
I understand Stan
does not support discrete parameters for estimation of count missing values (see m2
example below). But I am hopeful that someone will have a suggestion for an alternative approach!
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.14.4). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#>
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#>
#> ar
#Fit a model to the data
#this ignores the missing disease status
m1 < brm(disease ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Warning: Rows containing NAs were excluded from the model.
#> Compiling Stan program...
#> Trying to compile a simple C file
#> Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
#> clang mmacosxversionmin=10.13 I"/Library/Frameworks/R.framework/Resources/include" DNDEBUG I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" DEIGEN_NO_DEBUG DBOOST_DISABLE_ASSERTS DBOOST_PENDING_INTEGER_LOG2_HPP DSTAN_THREADS DBOOST_NO_AUTO_PTR include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' D_REENTRANT DRCPP_PARALLEL_USE_TBB=1 I/usr/local/include fPIC Wall g O2 c foo.c o foo.o
#> In file included from <builtin>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
#> namespace Eigen {
#> ^
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
#> namespace Eigen {
#> ^
#> ;
#> In file included from <builtin>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
#> #include <complex>
#> ^~~~~~~~~
#> 3 errors generated.
#> make: *** [foo.o] Error 1
#> Start sampling
#>
#> SAMPLING FOR MODEL 'fc5c5b8a5161ca9226ca5d50c5c16cee' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 0.000717 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 7.17 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
#> Chain 1:
#> Chain 1: Elapsed Time: 5.08442 seconds (Warmup)
#> Chain 1: 3.76533 seconds (Sampling)
#> Chain 1: 8.84975 seconds (Total)
#> Chain 1:
conditional_effects(m1)
#Now try to fit a model imputing missing disease status
m2 < brm(disease  mi() ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Error: Argument 'mi' is not supported for family 'bernoulli(logit)'.
^{Created on 20210331 by the reprex package (v0.3.0)}