I have data with a binary outcome that I would like to model allowing for misclassification in the outcome. I have no validation data and no idea what the actual misclassification rate might be.
A very brief search turned up a paper by Liu and Zhang (http://doi.org/10.1007/s41237-017-0031-y) and the accompanying R package logistic4p.
So I’ve been trying to implement that model in Stan, to play with it and see if it helps me.
I am neither mathematician nor computer scientist. I do not really understand the Stan language (yet), and do most of my modelling through brms.
So I am looking for advice on implementing the model, and extending it if possible.
I have managed to create a model that (sort of) works, and gives parameter estimates similar to the logistic4p model in the logistic4p::nlsy example data. (See below for model and R script to compare with logistic4p.)
But there are problems.
I run 8 chains; only 7 complete. The other gives the error:
[1] “Error in sampler$call_sampler(args_list[[i]]) : "
[2] " c++ exception (unknown reason)”
error occurred during calling the sampler; sampling not done
I have no idea what this error means, or how to fix it.
I am also getting messages about initial values being rejected. I think I don’t know how to implement constraints on parameters properly.
Model so far has no multilevel structure, but my real data does.
Ultimately, I would like the misclassification rates to depend on cluster-level variables (and my wife wants a horse).
If anyone can help with this model, or direct me to a better model for this purpose, I will be grateful.
Thanks
Tim
===== Stan model =====
functions {
}
data {
int<lower=1> N; // total number of observations
int Y[N]; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
}
transformed data {
}
parameters {
vector[K] b; // population-level effects
real<lower=0.0, upper=1.0> FP; // false positive
// real<lower=0.0, upper=1.0> FN; // false negative
real<lower=0.0, upper=1.0> r2;
}
transformed parameters {
// vector[N] nu = Xc * b + temp_Intercept;
// vector[N] F = inv_logit(nu);
real<lower=0.0, upper=1.0> FN = r2 - FP; // if I use (1 - FP - FN) in following line, then b parameters are very poorly estimated (Neff = 4, Rhat > 20, after 40000 iterations)
vector[N] mu = logit(FP + (1 - r2)inv_logit(X * b));
// vector[N] mu=Xb;
}
model {
FP ~ normal(0, .5);
r2 ~ normal(0, .5);
b ~ normal(0, 1);
Y ~ bernoulli_logit(mu); // model
}
generated quantities {
}
===== Script to compare with logistic4p =====
library(logistic4p)
library(rstan)
options(mc.cores=parallel::detectCores())
data(nlsy)
y=nlsy[, 1]
x=nlsy[,-1]
modl=logistic4p(x,y, model=‘fp.fn’)
dta = list()
dta$Y=nlsy[,1]
dta$X= model.matrix(marijuana ~ gender + smoke + residence + peer, data=nlsy)
dta$N = nrow(dta$X)
dta$K = ncol(dta$X)
set.seed(20180405)
mdl = stan(“mdl.stan”, iter=100, chains=8, data=dta)
modl
print(mdl, pars=“mu”, include=FALSE, digits_summary=4)