Modeling non-response in ordinal survey data

Hi everyone!

This is my first Stan Discourse post, so I apologize in advance for anything I may have missed. I’m working with a bunch of ordinal response variables (for this example I’m using Angela Merkel’s favorability as reported by Pew, data available here) from several surveys.

Some of the questions have a fair proportion of ‘don’t know / don’t respond’ answers, and I want to avoid removing them (or using MICE / full Bayesian imputation) because the level of non-response is informative for us, but also because if I do MRP later on I’d want to be able to model the % of non-respondents by cell (so I want to be able to predict it).

I know there was some discussion of Tobit / Heckman selection models here, and @Solomon discussed conditional logistic models a while ago, but this feels slightly different so I’m curious about your thoughts re: how best to do this! (Also, whether it’s possible in #interfaces:brms )

I put together a simple (and hopefully not catastrophically incorrect) Stan model where the prob of not responding is a Bernoulli, and the ordinal part is handled by the usual ordinal logistic, is there a better (and maybe more straightforward) way to do this?

data {
  int<lower=2> K; // number of categories
  int<lower=0> N; // number of observations
  int<lower=1> D; // number of predictors
  int<lower=0,upper=K> y[N]; // response variable
  matrix[N, D] x; // design matrix
}
parameters {
  vector[D] beta; // predictors for ordered logistic part (e.g. likert scale)
  ordered[K-1] c; // number of thresholds for ordered logistic
  real alpha_p; // intercept for proportion of non-response
  vector[D] beta_p; // predictors for proportion of non-response
}
model {
  // model
  for (n in 1:N) {
    if (y[n] == 0) // assume non-response coded as 0
      target += bernoulli_logit_lpmf(1|alpha_p + x[n] * beta_p); // predict non-response with probability p
    else // otherwise use ordered logistic with prob 1-p
      target += bernoulli_logit_lpmf(0|alpha_p + x[n] * beta_p) + ordered_logistic_lpmf(y[n]| x[n] * beta, c);
  }
}
generated quantities {
  vector[N] yrep; // generate samples
  vector[N] temp;
  for (n in 1:N) {
    temp[n] = bernoulli_rng(inv_logit(alpha_p + x[n] * beta_p));
    if (temp[n] == 1)
      yrep[n] = 0;
    else
      yrep[n] = ordered_logistic_rng(x[n] * beta, c);
  }
}

Here’s the code to run the model. It seems to work fine.

library(readr)
library(dplyr)
library(rstan)

data <- read_csv("https://raw.githubusercontent.com/octmedina/zi-ordinal/main/merkel_data.csv")

D <- 4 # number of covariates
N <- length(data$confid_merkel) # number of observations
y <- data$confid_merkel # response variable
x <- matrix(c(data$party, # covariates
              data$edu,
              data$race,
              data$income), 
            nrow = N, ncol = D) 
K <- length(unique(data$confid_merkel)) - 1 # number of ordinal levels (0 doesn't count)


stan_data <- list(
  N = N,
  K = K,
  y = y, 
  x = x, 
  D = D)


fit_merkel <- stan(file = "https://raw.githubusercontent.com/octmedina/zi-ordinal/main/zi-ordinal-model.stan", data = stan_data,
                   chains = 2,
                   cores = 2,
                   iter = 2000)


print(fit_merkel, pars= c("beta", "c", "alpha_p", "beta_p"))

Thank you so much!

2 Likes

Update: I have since been able to make this (kind of) work in #interfaces:brms using the awesome custom family capabilities.

Here’s the custom family call. Based on this discussion I tried using the “categorical” special option , but couldn’t get it to work because I kept getting messages about the number of categories, so did it the manual way. If anyone has any thoughts on this, please share!

library(brms)
library(readr)
library(dplyr)
library(rstan)
library(bayesplot)

# create custom family for zero inflated ordinal
# define two parameters (mu and eta), one for the zero inflation (mu)
# and one for the regular ordinal part (eta)
# also have to define the (threshold) intercept variable (c_int) 
zi_ordinal <- custom_family(
  "zi_ordinal", dpars = c("mu", "eta"),
  links = c("identity"), lb = c(NA, NA),
  type = c("int"), vars = c("c_int")
)


# define stan functions
# we define the lpmf and the random number generator
stan_funs <- stanvar(block = "functions", scode = "
  real zi_ordinal_lpmf(int k, real mu, real eta, vector c_int) {
      if (k == 0) // assume non-response coded as 0
        return bernoulli_logit_lpmf(1| mu); // predict non-response with probability p
      else // otherwise use ordered logistic with prob 1-p
        return bernoulli_logit_lpmf(0| mu) + ordered_logistic_lpmf(k | eta, c_int);
  }
  int zi_ordinal_rng(real mu, real eta, vector c_int) {
    if (bernoulli_rng(inv_logit(mu)) == 1)
      return 0;
    else
      return ordered_logistic_rng(eta, c_int);
  }
")

# this feels sort of hacky, but to define the ordered c_int variable i had to
# add this to the parameters block, as well as the number of categories (n_thresh)
ordered_var <- stanvar(scode = "ordered[n_thresh] c_int;", block = "parameters")
ncat_var <- stanvar(x = 3, name = "n_thresh", scode = "int n_thresh;")
stanvars <- ordered_var + ncat_var + stan_funs


This seems to work fine. The parameters recovered are the same as the Stan model (shown above).


# read in data
data <- read_csv("https://raw.githubusercontent.com/octmedina/zi-ordinal/main/merkel_data.csv") 

# fit the model. it works!
fit_zi_ord <- brm(
  bf(confid_merkel ~ 1 + party + income + edu + race,
     eta ~ 0 + party + income + edu + race),
  data = data, 
  family = zi_ordinal, stanvars = stanvars,
  chains = 2,
  cores = 2
)

But here’s something that I’ve been struggling with. The posterior predictive checks look similar, except the intervals are way wider! I don’t really know why, since it seems that the parameters are similar, and the predicted values are also similar for every observation. Could it be the way I’m generating the samples? I used the categorical distribution to generate them.

# Create posterior predict function for post-processing
posterior_predict_zi_ordinal <- function(i, prep, ...) {
  c_all <- rstan::extract(fit_zi_ord$fit, pars = "c_int") # extract intercepts
  c_int <- as.matrix(c_all$c_int[,1:3]) # convert to matrix
  
  mu <- brms::get_dpar(prep, "mu", i = i) # get mu (zero/non-response inflation)
  eta <- brms::get_dpar(prep, "eta", i = i) # get eta (regular ordinal logistic)
  
  p_zi <- plogis(mu) # inverse-logit 
  thresholds <- ncol(c_int) # get number of thresholds
  
  # create the probabilities for all the categories
  p_total <- cbind(p_zi, (1-p_zi)*(1-plogis(eta-c_int[,1])))
  for (val in 2:thresholds) {
    p_total <- cbind(p_total, (1-p_zi)*(plogis(eta-c_int[,val-1])-plogis(eta-c_int[,val])))
  }
  p_total <- cbind(p_total, (1-p_zi)*(plogis(eta-c_int[,thresholds])))
  
  # generate samples (subtract 1 bc first category is 0)
  y_rep <- extraDistr::rcat(n=dim(p_total)[1], prob = p_total)-1
  y_rep <- as.matrix(y_rep)
  y_rep
}

# Generate samples (seems to work)
brms_yrep <- posterior_predict(fit_zi_ord)


Here are the two bayesplot outputs, for comparison. The Rstan model is available in the message above, and the full reprex is here.


1 Like