Using log_mix when imputing missing observations of a binary predictor variable

Dear all,

I need to analyze an annoyingly difficult dataset and am at a point where I am somewhat out of my depth and could do with some reassurance. Here’s a simplified version of my problem.

Suppose I have a dataset of N observations, a continuous outcome variable y, and a binary predictor variable x (indicating, for example, from which of two interventions an observation resulted from).

data {
  int<lower = 1> N;
  vector[N] y;
  int<lower = 0> N_mis;
  int<lower = 0> N_obs;
  int<lower = 1, upper = N> ii_mis[N_mis];
  int<lower = 1, upper = N> ii_obs[N_obs];
  vector<lower = 0, upper = 1>[N_obs] x_obs;
  vector[N_mis] x_alpha;
  vector[N_mis] x_beta;
}

Some observations of the binary predictor variable are missing but, for each missing observation, I know (with some uncertainty) the probability that x_i = 1. This information enters the model as a beta distribution such that:

x_i \sim \text{Beta}(\alpha_i, \beta_i)~\text{if } x_i \text{ is missing.}

This is the relevant Stan code for imputing the missing data:

parameters {
  vector<lower = 0, upper = 1>[N_mis] x_imp;
}
transformed parameters {
  vector<lower = 0, upper = 1>[N] x;
  x[ii_mis] = x_imp;
  x[ii_obs] = x_obs;
}
model {
  x_imp ~ beta(x_alpha, x_beta);
}

This means that the predictor variable x is now a mixture of some observed values (that either are 0 or 1) and some uncertain probabilities (that have a value between 0 and 1). I want to estimate the mean of y for when x = 0 and when x = 1 (and thus the difference between the two means).

This is the Stan code without the lines relating to the missing observations.

parameters {
  vector[2] mu;
  vector<lower = 0>[2] sigma;
}
model {
  mu ~ student_t(3, 0, 1);
  sigma ~ student_t(3, 0, 3);
  for (n in 1:N)
    target += log_mix(
      x[n],
      normal_lpdf(y[n] | mu[1], sigma[1]),
      normal_lpdf(y[n] | mu[2], sigma[2])
    );
}
generated quantities {
  real delta = mu[1] - mu[2];
}

My rationale is that the log density of an observation should depend on each of the two likelihoods to the extent of how likely the observation had either value of x.

Is this the right way to go about it? Is there a better parameterization that I’m missing? I am asking because when I’m running this model with a lot of observations (60,000+), with some varying effects added, and a different likelihood function I’m running into trouble. (But I wanted to make sure the basic specification is the correct one before investing more time into the more complex model.)

Thank you in advance for your help.
Nils

1 Like

Hi Nils!

My brief response would be: Yes, I think your approach makes sense!

A couple of thoughts. Maybe some are helpful.


So, right now y|x=0 \sim N(\mu_1, \sigma_1) and y|x=1 \sim N(\mu_2, \sigma_2). It might be worth it to think about the easier, although less general, case, where y|x \sim N(\alpha + \beta x, \sigma), essentially a regression. This implies y|x=1 \sim N(\alpha + \beta x, \sigma) and y|x=0 \sim N(\alpha, \sigma), so you have \alpha as a common mean, and \beta as the estimated difference (assumes same variance in the two groups, but this can be relaxed by modelling \sigma explicitly). Then you can do

...
model{
  ...
  target += normal_lpdf(y_obs | alpha + beta*x_obs, sigma);
  for(n in 1:N_mis)
    target += log_mix(
      p_x[n],
      normal_lpdf(y_mis[n] | alpha, sigma), // x = 0 and beta drops out
      normal_lpdf(y_mis[n] | alpha + beta, sigma) // x = 1 and left out
    );
  ...
}
...

…so you have to use log_mix only on the missing observations, which will hopefully speed up you model.


Having something like this

  x_p ~ beta(x_alpha, x_beta);

is cool, if you have data on x_alpha, x_beta (which you provide in your model). I always feel like using the beta_proportion version always to be more intuitive:

  x_p ~ beta_proportion(x_p_prior_expec, x_p_prior_uncer);

where x_p_prior_expec \in (0,1) is the expected value of the prior probability, and x_p_prior_uncer \in \mathbb{R}_+ is your uncertainty about the prior expectation. Maybe this is also more convenient for you?


I somehow feel like there is some opportunity to hierarchically model the observed x and data on x_alpha and x_beta “together”, to use as much information as possible in your data. But it’s probably a bit to late for me to think this through properly. Do you have this prior data/info/guess x_alpha and x_beta also for the cases where you do observe x?


Hm… that’s it for now. I guess. I really hope this helps somehow!

Cheers!
Max

2 Likes

Hi Max,

Thank you—this is excellent feedback and just what I was looking for! I am planning to write a longer reply once I have settled on and tested the final model.

Best wishes,
Nils

1 Like