Double definition of response variable allowed?


(apologies for a rather long-ish post, I wanted to provide all information necessary to understand my problem)

I am working on a rather complex model that attempts to address multiple goals, and am getting pushback from journal reviewers on a specific aspect of the model.
I am trying to find out whether what I’m doing is technically appropriate, but my technical background as a disease ecologist is not strong enough to solve this problem so I am hoping to get some feedback from an expert.

The essence of the issue is whether it is allowed/appropriate to define an outcome variable twice.
I’m providing a simplified version of this aspect of the model below, including R code to simulate data and test the model.

I have two different sets of binary (neg/pos) samples, y1 and y2,
both collected during a number of sessions on times t.
Prevalence (the proportion of positives) is different for each session.
For set y1, I have covariate data X, for y2 there are no covariates.
Crucially, the assay used to determine the status of y1 samples has a small probability (psi) of testing false negative. The assay used for y2 can be considered 100% reliable.

Model goals:

  1. Estimate prevalence (the proportion of positives) at each session.
  2. Estimate the false negative rate.
  3. Estimate the covariate coefficients.

When estimating the covariate effects, it is important to account for the effect of false negatives, as not doing so would result in biased effect estimates.
And I want prevalence to be informed by both y1 and y2.
This is a key reason to model all 3 parameters at the same time.

It is not clear to me whether this formulation, where y1 is used twice as an outcome variable, is allowed. One reviewer says it is not allowed to define a random variable twice. My understanding however is that these two models can both be included, as they don’t use the same data twice to inform a parameter, and thus they are basically separate and just contribute to the overall likelihood.
The model works great, and recovers the parameters well, but is this formulation generally acceptable, or should I find a different solution (if so, any suggestions?)?
And if this approach is fine, how can I argue this so that I can convince those who don’t agree?

I’d be extremely grateful for any feedback, I always want to learn and I am working hard to get my skillset up to speed.

Thank you in advance,


Stan and R simulation code below:

data {
       int<lower=0> N;
       int y1[N];
       int y2[N];
       int<lower=1> Nt;
       int t[N];
       int K;
       matrix[N,K] X;

parameters {
       real<lower=0,upper=1> prev[Nt];
       real<lower=0,upper=1> psi;
       real beta0[Nt];
       vector[K] beta;

model {
       beta0 ~ normal(0,10);
       beta ~ normal(0,10);
       psi ~ uniform(0,1);
       prev ~ uniform(0,1);
       for(i in 1:N) {
              y1[i] ~ bernoulli((1 - psi) * inv_logit(beta0[t[i]] + X[i,] * beta));
              y1[i] ~ bernoulli((1 - psi) * prev[t[i]]);
              y2[i] ~ bernoulli(prev[t[i]]);

R simulation code:

# simulate data    
Nt = 5  # number of sampling times
prev = runif(Nt,0,1)  # prevalence at each sampling time
N = 500    # number of samples at each sampling time
t = rep(1:Nt, each = N)     # index of sampling times

# simulate y1, which includes false negatives and covariate data
X = matrix(data = rnorm(N*Nt, 0,1), ncol=1)
beta0 = rnorm(Nt, 0,2)
beta = 2.4
mu = rep(beta0, each=N) + X[,1] * beta = exp(mu)/(1+exp(mu))

y1 = rbinom(n = N*Nt, 1,    # simulate sample status
psi = 0.17   # P(false negative)
false.neg.idx = sample(which(y1==1), round(sum(y1==1) * psi), replace=F)
y1[false.neg.idx] = 0

# calculate resulting prevalence per sampling time
prev = numeric(Nt)
for(i in 1:Nt) prev[i] = sum(y2[which(t==i)]==1)/N

# data for stan 
data.stan = list(
       N = N*Nt,
       y1 = y1,
       y2 = y2,
       Nt = Nt,
       t = t,
       K = ncol(X),
       X = X