# Missing data/latent variable problem

Hi,

I am evaulating a screening programme, where there is a two-stage process:

• `test`: gives a continous score from 0 to 1. If `test` score is >=0.88, then the participant progresses to a definitive test to ascertain disease status
• `disease`: based on another test, the person is classifed as having the disease or not.

We also have covariates and tests that are associated with disease status (e.g. `test2`).

We are interested in the overall accuracy of `test` for predicting `Disease`, but our sample is biased because we only know `disease` status for those with a â€śpositiveâ€ť `test`.

Here are some simulated data to demonstrate:

``````library(copula)
library(tidyverse)

#Simulate some correlated data
set.seed(123)
myCop <- normalCopula(param = c(0.46, 0.27, 0.18),
dim = 3, dispstr = "un")
out <- rCopula(1e5, myCop)
out[, 1] <- qbeta(out[, 1], 8,2)
out[, 2] <- qbinom(out[, 2], size = 1, prob = 0.05)
out[, 3] <- qbinom(out[, 3], size = 1, prob = 0.226)

#prepare the data for modelling
d <- as.data.frame(out) %>%
rename(test=V1, disease=V2, test2=V3) %>%
mutate_at(.vars = vars(disease, test2),
.funs = funs(factor))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Please use a list of either functions or lambdas:
#>
#>   # Simple named list:
#>   list(mean = mean, median = median)
#>
#>   # Auto named with `tibble::lst()`:
#>   tibble::lst(mean, median)
#>
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

head(d)
#>        test disease test2
#> 1 0.7686354       0     1
#> 2 0.8574327       0     1
#> 3 0.8278272       0     0
#> 4 0.8080611       0     0
#> 5 0.8596687       0     0
#> 6 0.9560706       0     0

#distributions of the data
d %>%
ggplot() +
geom_density(aes(test, fill=disease), alpha=0.3) +
facet_grid(test2~.)
``````

``````#Now change disease status to missing for all with test score >=0.88
#i.e. disease status only ascertained for those with a high test1 score
d2 <- d %>%
mutate(disease = case_when(test>=0.88 ~ disease,
TRUE ~ NA_integer_))

head(d2)
#>        test disease test2
#> 1 0.7686354    <NA>     1
#> 2 0.8574327    <NA>     1
#> 3 0.8278272    <NA>     0
#> 4 0.8080611    <NA>     0
#> 5 0.8596687    <NA>     0
#> 6 0.9560706       0     0
``````

Created on 2021-03-31 by the reprex package (v0.3.0)

Now we want to estimate the overall acuracy of `test`, but recognising that disease status is missing in a biased manner, and that we potentially have additional information from `test2`. This where I am struggling. It seems like this could be either be a â€śmissing dataâ€ť problem, or an problem that could be addressed through latent class analysis.

I understand `Stan` does not support discrete parameters for estimation of count missing values (see `m2` example below). But I am hopeful that someone will have a suggestion for an alternative approach!

``````library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.14.4). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#>
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#>
#>     ar

#Fit a model to the data
#this ignores the missing disease status
m1 <- brm(disease ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Warning: Rows containing NAs were excluded from the model.
#> Compiling Stan program...
#> Trying to compile a simple C file
#> Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
#> clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
#> In file included from <built-in>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
#> namespace Eigen {
#> ^
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
#> namespace Eigen {
#>                ^
#>                ;
#> In file included from <built-in>:1:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
#> In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
#> /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
#> #include <complex>
#>          ^~~~~~~~~
#> 3 errors generated.
#> make: *** [foo.o] Error 1
#> Start sampling
#>
#> SAMPLING FOR MODEL 'fc5c5b8a5161ca9226ca5d50c5c16cee' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 0.000717 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 7.17 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1:
#> Chain 1:  Elapsed Time: 5.08442 seconds (Warm-up)
#> Chain 1:                3.76533 seconds (Sampling)
#> Chain 1:                8.84975 seconds (Total)
#> Chain 1:

conditional_effects(m1)
``````

``````#Now try to fit a model imputing missing disease status
m2 <- brm(disease | mi() ~ test + test2,
data=d2,
chains=1,
family=bernoulli())
#> Error: Argument 'mi' is not supported for family 'bernoulli(logit)'.
``````

Created on 2021-03-31 by the reprex package (v0.3.0)

1 Like

Hi, sorry for not getting to you earlier, your question is relevant and well written.

Unfortunately this seems to me like a very tough problem that lies mostly in the design of the data collection. If you had at least some data on how `disease` looks like for people with lower test scores, the whole thing would be reasonably easy. Even incoroporating such data from different studies/populations/â€¦ might be easier than trying to answer the question from the data you have. Or if you have some idea on what the expected total prevalence in your population would be (which we could then use to infer the number of cases we missed).

Without additional data, I think you will need to bring a lot of untestable assumptions to the table to be able to do anything. You basically need to somehow extrapolate the relationship between `test` and `disease` (and potentially `test2`) from cases where `test >= 0.88` to the rest of the cases. And extrapolations are tough. E.g. even if there was a clean linear relationship between log odds of `disease` and `test` for high test values, how can we justify extrapolating this line all the way towards say `test = 0.25`?

If you can consult the knowledge of the particular disease and test to constrain the possible form of the relationships that might help.

It seems to me that your case is very similar to that discussed at: MICE missing values on the response If that is right than the best you can do is to fit the model to the cases where you have the response and hope they generalize, but as I said above, I donâ€™t think such hope would be easy to justify.

Best of luck with your model!

Hi, thanks for the reply.

I had feared that this was the case.

We are trying some alternative approaches, so I will post back here if any success.

All the best.

1 Like

I realized that I worded my reply a bit too strongly - all what I wrote is my opinion and understanding, but it is quite possible I am missing something. I donâ€™t have expertise in this kinds of problems and was just trying to use some statistical common sense - which is definitely fallible, so please double check that what I wrote makes sense to you and ask if you feel like there are potential holes/mistakes in the argument.

1 Like

Could this help?

2 Likes

Not at all - very helpful to think through these issues, and very grateful for the insights.

1 Like

Thank you - that is exactly the way our thinking is evolving. ONce we have ironed out some of the code, I will post back with how we get on. Thanks!

1 Like