I have data with a binary outcome that I would like to model allowing for misclassification in the outcome. I have no validation data and no idea what the actual misclassification rate might be.

A very brief search turned up a paper by Liu and Zhang (http://doi.org/10.1007/s41237-017-0031-y) and the accompanying R package logistic4p.

So I’ve been trying to implement that model in Stan, to play with it and see if it helps me.

I am neither mathematician nor computer scientist. I do not really understand the Stan language (yet), and do most of my modelling through brms.

So I am looking for advice on implementing the model, and extending it if possible.

I have managed to create a model that (sort of) works, and gives parameter estimates similar to the logistic4p model in the logistic4p::nlsy example data. (See below for model and R script to compare with logistic4p.)

But there are problems.

I run 8 chains; only 7 complete. The other gives the error:

[1] “Error in sampler$call_sampler(args_list[[i]]) : "

[2] " c++ exception (unknown reason)”

error occurred during calling the sampler; sampling not done

I have no idea what this error means, or how to fix it.

I am also getting messages about initial values being rejected. I think I don’t know how to implement constraints on parameters properly.

Model so far has no multilevel structure, but my real data does.

Ultimately, I would like the misclassification rates to depend on cluster-level variables (and my wife wants a horse).

If anyone can help with this model, or direct me to a better model for this purpose, I will be grateful.

Thanks

Tim

===== Stan model =====

functions {

}

data {

int<lower=1> N; // total number of observations

int Y[N]; // response variable

int<lower=1> K; // number of population-level effects

matrix[N, K] X; // population-level design matrix

}

transformed data {

}

parameters {

vector[K] b; // population-level effects

real<lower=0.0, upper=1.0> FP; // false positive

// real<lower=0.0, upper=1.0> FN; // false negative

real<lower=0.0, upper=1.0> r2;

}

transformed parameters {

// vector[N] nu = Xc * b + temp_Intercept;

// vector[N] F = inv_logit(nu);

real<lower=0.0, upper=1.0> FN = r2 - FP; // if I use (1 - FP - FN) in following line, then b parameters are very poorly estimated (Neff = 4, Rhat > 20, after 40000 iterations)

vector[N] mu = logit(FP + (1 - r2)*inv_logit(X * b));
// vector[N] mu=X*b;

}

model {

FP ~ normal(0, .5);

r2 ~ normal(0, .5);

b ~ normal(0, 1);

Y ~ bernoulli_logit(mu); // model

}

generated quantities {

}

===== Script to compare with logistic4p =====

library(logistic4p)

library(rstan)

options(mc.cores=parallel::detectCores())

data(nlsy)

y=nlsy[, 1]

x=nlsy[,-1]

modl=logistic4p(x,y, model=‘fp.fn’)

dta = list()

dta$Y=nlsy[,1]

dta$X= model.matrix(marijuana ~ gender + smoke + residence + peer, data=nlsy)

dta$N = nrow(dta$X)

dta$K = ncol(dta$X)

set.seed(20180405)

mdl = stan(“mdl.stan”, iter=100, chains=8, data=dta)

modl

print(mdl, pars=“mu”, include=FALSE, digits_summary=4)