Dear all,

I need to analyze an annoyingly difficult dataset and am at a point where I am somewhat out of my depth and could do with some reassurance. Here’s a simplified version of my problem.

Suppose I have a dataset of `N`

observations, a continuous outcome variable `y`

, and a binary predictor variable `x`

(indicating, for example, from which of two interventions an observation resulted from).

```
data {
int<lower = 1> N;
vector[N] y;
int<lower = 0> N_mis;
int<lower = 0> N_obs;
int<lower = 1, upper = N> ii_mis[N_mis];
int<lower = 1, upper = N> ii_obs[N_obs];
vector<lower = 0, upper = 1>[N_obs] x_obs;
vector[N_mis] x_alpha;
vector[N_mis] x_beta;
}
```

Some observations of the binary predictor variable are missing but, for each missing observation, I know (with some uncertainty) the probability that x_i = 1. This information enters the model as a beta distribution such that:

This is the relevant Stan code for imputing the missing data:

```
parameters {
vector<lower = 0, upper = 1>[N_mis] x_imp;
}
transformed parameters {
vector<lower = 0, upper = 1>[N] x;
x[ii_mis] = x_imp;
x[ii_obs] = x_obs;
}
model {
x_imp ~ beta(x_alpha, x_beta);
}
```

This means that the predictor variable `x`

is now a mixture of some observed values (that either are 0 or 1) and some uncertain probabilities (that have a value between 0 and 1). I want to estimate the mean of y for when x = 0 and when x = 1 (and thus the difference between the two means).

This is the Stan code without the lines relating to the missing observations.

```
parameters {
vector[2] mu;
vector<lower = 0>[2] sigma;
}
model {
mu ~ student_t(3, 0, 1);
sigma ~ student_t(3, 0, 3);
for (n in 1:N)
target += log_mix(
x[n],
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2])
);
}
generated quantities {
real delta = mu[1] - mu[2];
}
```

My rationale is that the log density of an observation should depend on each of the two likelihoods to the extent of how likely the observation had either value of x.

**Is this the right way to go about it?** Is there a better parameterization that I’m missing? I am asking because when I’m running this model with a lot of observations (60,000+), with some varying effects added, and a different likelihood function I’m running into trouble. (But I wanted to make sure the basic specification is the correct one before investing more time into the more complex model.)

Thank you in advance for your help.

Nils