Using log_mix when imputing missing observations of a binary predictor variable

nkreimer · February 10, 2020, 1:34pm

Dear all,

I need to analyze an annoyingly difficult dataset and am at a point where I am somewhat out of my depth and could do with some reassurance. Here’s a simplified version of my problem.

Suppose I have a dataset of N observations, a continuous outcome variable y, and a binary predictor variable x (indicating, for example, from which of two interventions an observation resulted from).

data {
  int<lower = 1> N;
  vector[N] y;
  int<lower = 0> N_mis;
  int<lower = 0> N_obs;
  int<lower = 1, upper = N> ii_mis[N_mis];
  int<lower = 1, upper = N> ii_obs[N_obs];
  vector<lower = 0, upper = 1>[N_obs] x_obs;
  vector[N_mis] x_alpha;
  vector[N_mis] x_beta;
}

Some observations of the binary predictor variable are missing but, for each missing observation, I know (with some uncertainty) the probability that x_i = 1. This information enters the model as a beta distribution such that:

x_i \sim \text{Beta}(\alpha_i, \beta_i)~\text{if } x_i \text{ is missing.}

This is the relevant Stan code for imputing the missing data:

parameters {
  vector<lower = 0, upper = 1>[N_mis] x_imp;
}
transformed parameters {
  vector<lower = 0, upper = 1>[N] x;
  x[ii_mis] = x_imp;
  x[ii_obs] = x_obs;
}
model {
  x_imp ~ beta(x_alpha, x_beta);
}

This means that the predictor variable x is now a mixture of some observed values (that either are 0 or 1) and some uncertain probabilities (that have a value between 0 and 1). I want to estimate the mean of y for when x = 0 and when x = 1 (and thus the difference between the two means).

This is the Stan code without the lines relating to the missing observations.

parameters {
  vector[2] mu;
  vector<lower = 0>[2] sigma;
}
model {
  mu ~ student_t(3, 0, 1);
  sigma ~ student_t(3, 0, 3);
  for (n in 1:N)
    target += log_mix(
      x[n],
      normal_lpdf(y[n] | mu[1], sigma[1]),
      normal_lpdf(y[n] | mu[2], sigma[2])
    );
}
generated quantities {
  real delta = mu[1] - mu[2];
}

My rationale is that the log density of an observation should depend on each of the two likelihoods to the extent of how likely the observation had either value of x.

Is this the right way to go about it? Is there a better parameterization that I’m missing? I am asking because when I’m running this model with a lot of observations (60,000+), with some varying effects added, and a different likelihood function I’m running into trouble. (But I wanted to make sure the basic specification is the correct one before investing more time into the more complex model.)

Thank you in advance for your help.
Nils

Max_Mantei · February 13, 2020, 10:46pm

Hi Nils!

My brief response would be: Yes, I think your approach makes sense!

A couple of thoughts. Maybe some are helpful.

So, right now y|x=0 \sim N(\mu_1, \sigma_1) and y|x=1 \sim N(\mu_2, \sigma_2). It might be worth it to think about the easier, although less general, case, where y|x \sim N(\alpha + \beta x, \sigma), essentially a regression. This implies y|x=1 \sim N(\alpha + \beta x, \sigma) and y|x=0 \sim N(\alpha, \sigma), so you have \alpha as a common mean, and \beta as the estimated difference (assumes same variance in the two groups, but this can be relaxed by modelling \sigma explicitly). Then you can do

...
model{
  ...
  target += normal_lpdf(y_obs | alpha + beta*x_obs, sigma);
  for(n in 1:N_mis)
    target += log_mix(
      p_x[n],
      normal_lpdf(y_mis[n] | alpha, sigma), // x = 0 and beta drops out
      normal_lpdf(y_mis[n] | alpha + beta, sigma) // x = 1 and left out
    );
  ...
}
...

…so you have to use log_mix only on the missing observations, which will hopefully speed up you model.

Having something like this

  x_p ~ beta(x_alpha, x_beta);

is cool, if you have data on x_alpha, x_beta (which you provide in your model). I always feel like using the beta_proportion version always to be more intuitive:

  x_p ~ beta_proportion(x_p_prior_expec, x_p_prior_uncer);

where x_p_prior_expec \in (0,1) is the expected value of the prior probability, and x_p_prior_uncer \in \mathbb{R}_+ is your uncertainty about the prior expectation. Maybe this is also more convenient for you?

I somehow feel like there is some opportunity to hierarchically model the observed x and data on x_alpha and x_beta “together”, to use as much information as possible in your data. But it’s probably a bit to late for me to think this through properly. Do you have this prior data/info/guess x_alpha and x_beta also for the cases where you do observe x?

Hm… that’s it for now. I guess. I really hope this helps somehow!

Cheers!
Max

nkreimer · February 14, 2020, 4:35pm

Hi Max,

Thank you—this is excellent feedback and just what I was looking for! I am planning to write a longer reply once I have settled on and tested the final model.

Best wishes,
Nils

Topic		Replies	Views
Imputation of a 3 category covariate to model a binary outcome Modeling	2	1190	December 3, 2018
Marginalize missing binary outcome variable for GLM Modeling	11	1080	January 31, 2020
Merging Missing and Observed Data in Regression Model Modeling	15	899	January 22, 2020
Assigning bernoulli prior to missing entries in covariate Modeling	2	1056	August 15, 2019
Missing data in categorical data models Modeling rstan	7	1308	August 12, 2023

Using log_mix when imputing missing observations of a binary predictor variable

Related topics