Imputation of a 3 category covariate to model a binary outcome

Diego · December 3, 2018, 7:23am

Hi everyone:

It seems this question is starting to become a classic. It is related to log_mix, log_exp_sum and the imputation of a 3 categorical covariate to model a binary outcome. I have checked several similar questions and references:

https://andrewgelman.com/2017/08/21/mixture-models-stan-can-use-log_mix/

github.com/stan-dev/math

log_mix - Multivariate Containers

stan-dev:develop ← andrjohns:feature/log_mix_arr

opened 07:43PM - 06 Feb 18 UTC

andrjohns

+1745 -64

#### Submission Checklist - [x] Run unit tests: `./runTests.py test/unit` - …[x] Run cpplint: `make cpplint` - [x] Declare copyright holder and open-source license: see below #### Summary: This pull extends log_mix to take a vector of mixing proportions and an array of density vectors (i.e. ```log_mix(vector, vectors[])``` with analytic gradients. Any combination of vector/row_vector/std::vector is valid. This required extending operands_and_partials to work with the ```std::vector<std::vector<T>>``` signature, let me know if that would be better introduced in a separate pull. The tests should be fairly thorough and cover every possible combination of inputs, but let me know if I've missed anything. #### Intended Effect: Improve the speed and efficiency of mixture modelling #### How to Verify: Tests for ```prim/fwd/rev/mix``` are included #### Side Effects: N/A #### Documentation: As doxygen #### Copyright and Licensing Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Andrew Johnson By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

gist.github.com

https://gist.github.com/rmcelreath/9406643583a8c99304e459e644762f82

discrete_missingness.R

# "impute" missing binary predictor
# really just marginalizes over missingness
# imputed values produced in generated quantities

N <- 1000 # number of cases
N_miss <- 100 # number missing values
x_baserate <- 0.25 # prob x==1 in total sample
a <- 0 # intercept in y ~ N( a+b*x , 1 )
b <- 1 # slope in y ~ N( a+b*x , 1 )

This file has been truncated. show original

But it is still not enough clear for me. I have done the exploration of what works hierarchically. I am comparing step bu step output from JAGS and STAN. I get to the point to impute the categorical covariate using only complete data and it works like a charm (same coefficients and posterior predictive of the categories). So I know the problem comes when moving to use the data that includes the unobserved categories.

In the following code
y is the binary outcome
cov_cat_1 and cov_cat_3 are dummy variables for the categories 1 and 3 (-1 when NA)
day and lat are predictors of the log odds to be in category 1-3

So as you can see theta is a vector of length 3 containing the log odds to be in category 1-3
and once this category is imputed the idea is to use this imputation to estimate a0, beta_bp, and beta_m.

My questions are: what is wrong? What do you suggest? I would really appreciate suggestions to be as clear as possible, please.

model{
  a0 ~ normal(0,30); 
  beta_bp ~ normal(0,30); 
  beta_m ~ normal(0,30); 
  a_imp ~ normal(0,30); // explained above
  b1_imp ~ normal(0,30); // explained above
  b2_imp ~ normal(0,30); // explained above
  

  for (i in 1:n_obs) {
    
    if (cov_cat_miss[i] == 0) {
      y[i] ~ bernoulli_logit(a0+
                               beta_bp*cov_cat_1[i]+
                               beta_m*cov_cat_3[i]);}
    
    else {
      vector[n_cat] theta;
      vector[n_cat] log_prob_theta;
      matrix[n_cat, n_cat] lp;  
      
      real p2 = a_imp[2] + b1_imp[2]*day[i] + b2_imp[2]*lat[i]; 
      real p3 = a_imp[3] + b1_imp[3]*day[i] + b2_imp[3]*lat[i]; 
      
      theta[1] = 0;
      theta[2] = p2;
      theta[3] = p3; 
      log_prob_theta=log_softmax(theta);
      
      lp[1,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp); //cat 1
      lp[2,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0); //cat 2 (baseline)
      lp[3,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0  + beta_m); //cat 3
      lp[1,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp);
      lp[2,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0);
      lp[3,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0  + beta_m);
      lp[1,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp);
      lp[2,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0);
      lp[3,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0  + beta_m);
      
      target += log_sum_exp(lp); 
      
    }
  }
}

Erik_Ringen · December 3, 2018, 9:51am

Diego,

At first glance, it looks like you haven’t specified how day and late are predictors of the categorical covariate in cases where the categorical covariate are actually observed. For cases where cov_cat is observed you need a line like:

if (cov_cat_miss[i] == 0) {
  y[i] ~ bernoulli_logit(a0+
                           beta_bp*cov_cat_1[i]+
                           beta_m*cov_cat_3[i]);
cov_cat[i] ~ categorical_logit( p );
}

As is, it looks like there’s no data informing linear models p2 and p3. I accidentally omitted this step from my example in Log_mix for missing categorical data. See my edit to that post for an updated example script. I didn’t catch the error before because I assigned uniform probabilities to each category, so it didn’t matter that they weren’t being informed by data so long as they had reasonable priors.

Diego · December 3, 2018, 3:55pm

AWESOME!!

This is the complete solution for the introducing example:

model{
a0 ~ normal(0,30);
beta_bp ~ normal(0,30);
beta_m ~ normal(0,30);
a_imp ~ normal(0,30); // explained above
b1_imp ~ normal(0,30); // explained above
b2_imp ~ normal(0,30); // explained above

for (i in 1:n_obs) {

vector[n_cat] p;
vector[n_cat] theta;

p[2] = a_imp[2] + b1_imp[2]*day[i] + b2_imp[2]*lat[i]; 
p[3] = a_imp[3] + b1_imp[3]*day[i] + b2_imp[3]*lat[i]; 

if (cov_cat_miss[i] == 0) {
  y[i] ~ bernoulli_logit(a0+
                           beta_bp*cov_cat_1[i]+
                           beta_m*cov_cat_3[i]);}

theta[1] = 0;
theta[2] = p[2];
theta[3] = p[3]; 
repro_cat_obs[i] ~ categorical( softmax(theta));

else {
  vector[n_cat] log_prob_theta;
  matrix[n_cat, n_cat] lp;  
  
  theta[1] = 0;
  theta[2] = p[2];
  theta[3] = p[3]; 
  log_prob_theta=log_softmax(theta);
  
  lp[1,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp); //cat 1
  lp[2,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0); //cat 2 (baseline)
  lp[3,1] = log_prob_theta[1] + bernoulli_logit_lpmf( y[i] | a0  + beta_m); //cat 3
  lp[1,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp);
  lp[2,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0);
  lp[3,2] = log_prob_theta[2] + bernoulli_logit_lpmf( y[i] | a0  + beta_m);
  lp[1,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0  + beta_bp);
  lp[2,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0);
  lp[3,3] = log_prob_theta[3] + bernoulli_logit_lpmf( y[i] | a0  + beta_m);
  
  target += log_sum_exp(lp); 
  
}

}
}

Topic		Replies	Views
Using log_mix when imputing missing observations of a binary predictor variable Modeling specification	2	423	February 14, 2020
Log_mix for missing categorical data Modeling	5	1791	July 8, 2019
Bayesian mixture model with multiple covariates Modeling	1	582	June 28, 2020
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1449	September 6, 2023
Impute partially missing discrete outcome Modeling specification	1	391	May 22, 2023

Imputation of a 3 category covariate to model a binary outcome

Related topics