It’s not surprising why the mixture and adjusted models fit poorly, since they’re unrelated to your original data generation. As to the puzzle of why the unadjusted model fits well… My original hunch (which I lost while in the weeds exploring your original question) was that
u ~ bernoulli(theta)
leads to an unbiased estimate of theta
y ~ bernoulli((1 - psi) * theta)
leads to an unbiased estimate of psi
, conditional on theta
being unbiased
y ~ bernoulli((1 - psi) * mu)
leads to an unbiased estimate of mu
, given psi
is unbiased
So that might lead to unbiased estimates overall, even if the unadjusted double-likelihood is conceptually and statistically odd. From that perspective, it’s not surprising that you’d get unbiased results.
But I bet you’d also get results that do not have adequate coverage. Have you checked to see whether 95% of your (95%) credible intervals include the true values? The unadjusted model is as if you actually drew two y
values per y
. So an analogy would be if you ran a logistic regression and just doubled every observation. You’d still get unbiased results, but your intervals would be too narrow.
Edit: I’ve been running some simulations. From what I’m seeing, phi (and, to a lesser-extent, theta) don’t have sufficient coverage when the sample size is “small” (say, 60 or fewer). But I wasn’t seeing any immediate issues when using 1,000 observations. Other than sample size, I used the same parameters values as you specified, though used the integrate
function to find the value of theta
given the distribution of X.
inv_logit <- function(x){
1 / (1 + exp(-x))
}
mean_logit <- function(alpha, beta, X_mu = 0, X_sigma = 1){
# probability of y given X (inv_logit) weighted by likelihood of X (dnorm)
f <- function(x){
dnorm(x, X_mu, X_sigma) * inv_logit(alpha + beta * x)
}
# Integrate to find average probability of y given distribution of X
integrate(f, -Inf, Inf)
}
generate_data <- function(n = 1000, psi = 0.23, alpha = -0.5, beta = 0.01, X_mu = 0, X_sd = 1){
# Observations recorded without error (but without covariate data)
theta = mean_logit(alpha, beta, X_mu, X_sd)$value
u = rbinom(n, 1, theta)
# Observations recorded with covariate data (but also with error)
X = rnorm(n, X_mu, X_sd)
mu = inv_logit(alpha + beta * X)
y = rbinom(n, size = 1, prob = (1 - psi) * mu)
list(
params = list('n' = n,
'theta' = theta,
'psi' = psi,
'alpha' = alpha,
'beta' = beta,
'X_mu' = X_mu,
'X_sd' = X_sd),
data = list(
N = n,
y = y,
u = u,
X = X)
)
}