Dear all,
I need to analyze an annoyingly difficult dataset and am at a point where I am somewhat out of my depth and could do with some reassurance. Here’s a simplified version of my problem.
Suppose I have a dataset of N
observations, a continuous outcome variable y
, and a binary predictor variable x
(indicating, for example, from which of two interventions an observation resulted from).
data {
int<lower = 1> N;
vector[N] y;
int<lower = 0> N_mis;
int<lower = 0> N_obs;
int<lower = 1, upper = N> ii_mis[N_mis];
int<lower = 1, upper = N> ii_obs[N_obs];
vector<lower = 0, upper = 1>[N_obs] x_obs;
vector[N_mis] x_alpha;
vector[N_mis] x_beta;
}
Some observations of the binary predictor variable are missing but, for each missing observation, I know (with some uncertainty) the probability that x_i = 1. This information enters the model as a beta distribution such that:
This is the relevant Stan code for imputing the missing data:
parameters {
vector<lower = 0, upper = 1>[N_mis] x_imp;
}
transformed parameters {
vector<lower = 0, upper = 1>[N] x;
x[ii_mis] = x_imp;
x[ii_obs] = x_obs;
}
model {
x_imp ~ beta(x_alpha, x_beta);
}
This means that the predictor variable x
is now a mixture of some observed values (that either are 0 or 1) and some uncertain probabilities (that have a value between 0 and 1). I want to estimate the mean of y for when x = 0 and when x = 1 (and thus the difference between the two means).
This is the Stan code without the lines relating to the missing observations.
parameters {
vector[2] mu;
vector<lower = 0>[2] sigma;
}
model {
mu ~ student_t(3, 0, 1);
sigma ~ student_t(3, 0, 3);
for (n in 1:N)
target += log_mix(
x[n],
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2])
);
}
generated quantities {
real delta = mu[1] - mu[2];
}
My rationale is that the log density of an observation should depend on each of the two likelihoods to the extent of how likely the observation had either value of x.
Is this the right way to go about it? Is there a better parameterization that I’m missing? I am asking because when I’m running this model with a lot of observations (60,000+), with some varying effects added, and a different likelihood function I’m running into trouble. (But I wanted to make sure the basic specification is the correct one before investing more time into the more complex model.)
Thank you in advance for your help.
Nils