Model for binary outcome with misclassification

I have data with a binary outcome that I would like to model allowing for misclassification in the outcome. I have no validation data and no idea what the actual misclassification rate might be.

A very brief search turned up a paper by Liu and Zhang ( and the accompanying R package logistic4p.

So I’ve been trying to implement that model in Stan, to play with it and see if it helps me.

I am neither mathematician nor computer scientist. I do not really understand the Stan language (yet), and do most of my modelling through brms.

So I am looking for advice on implementing the model, and extending it if possible.

I have managed to create a model that (sort of) works, and gives parameter estimates similar to the logistic4p model in the logistic4p::nlsy example data. (See below for model and R script to compare with logistic4p.)

But there are problems.
I run 8 chains; only 7 complete. The other gives the error:
[1] “Error in sampler$call_sampler(args_list[[i]]) : "
[2] " c++ exception (unknown reason)”
error occurred during calling the sampler; sampling not done

I have no idea what this error means, or how to fix it.

I am also getting messages about initial values being rejected. I think I don’t know how to implement constraints on parameters properly.

Model so far has no multilevel structure, but my real data does.
Ultimately, I would like the misclassification rates to depend on cluster-level variables (and my wife wants a horse).

If anyone can help with this model, or direct me to a better model for this purpose, I will be grateful.



===== Stan model =====
functions {
data {
int<lower=1> N; // total number of observations
int Y[N]; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
transformed data {
parameters {

vector[K] b; // population-level effects
real<lower=0.0, upper=1.0> FP; // false positive
// real<lower=0.0, upper=1.0> FN; // false negative
real<lower=0.0, upper=1.0> r2;
transformed parameters {
// vector[N] nu = Xc * b + temp_Intercept;
// vector[N] F = inv_logit(nu);
real<lower=0.0, upper=1.0> FN = r2 - FP; // if I use (1 - FP - FN) in following line, then b parameters are very poorly estimated (Neff = 4, Rhat > 20, after 40000 iterations)
vector[N] mu = logit(FP + (1 - r2)inv_logit(X * b));
// vector[N] mu=X
model {
FP ~ normal(0, .5);
r2 ~ normal(0, .5);
b ~ normal(0, 1);
Y ~ bernoulli_logit(mu); // model

generated quantities {

===== Script to compare with logistic4p =====

y=nlsy[, 1]
modl=logistic4p(x,y, model=‘fp.fn’)

dta = list()
dta$X= model.matrix(marijuana ~ gender + smoke + residence + peer, data=nlsy)
dta$N = nrow(dta$X)
dta$K = ncol(dta$X)
mdl = stan(“mdl.stan”, iter=100, chains=8, data=dta)


print(mdl, pars=“mu”, include=FALSE, digits_summary=4)

Upgrade to the latest version of RStan