# Modeling non-response in ordinal survey data

Hi everyone!

This is my first Stan Discourse post, so I apologize in advance for anything I may have missed. I’m working with a bunch of ordinal response variables (for this example I’m using Angela Merkel’s favorability as reported by Pew, data available here) from several surveys.

Some of the questions have a fair proportion of ‘don’t know / don’t respond’ answers, and I want to avoid removing them (or using MICE / full Bayesian imputation) because the level of non-response is informative for us, but also because if I do MRP later on I’d want to be able to model the % of non-respondents by cell (so I want to be able to predict it).

I know there was some discussion of Tobit / Heckman selection models here, and @Solomon discussed conditional logistic models a while ago, but this feels slightly different so I’m curious about your thoughts re: how best to do this! (Also, whether it’s possible in #interfaces:brms )

I put together a simple (and hopefully not catastrophically incorrect) Stan model where the prob of not responding is a Bernoulli, and the ordinal part is handled by the usual ordinal logistic, is there a better (and maybe more straightforward) way to do this?

``````data {
int<lower=2> K; // number of categories
int<lower=0> N; // number of observations
int<lower=1> D; // number of predictors
int<lower=0,upper=K> y[N]; // response variable
matrix[N, D] x; // design matrix
}
parameters {
vector[D] beta; // predictors for ordered logistic part (e.g. likert scale)
ordered[K-1] c; // number of thresholds for ordered logistic
real alpha_p; // intercept for proportion of non-response
vector[D] beta_p; // predictors for proportion of non-response
}
model {
// model
for (n in 1:N) {
if (y[n] == 0) // assume non-response coded as 0
target += bernoulli_logit_lpmf(1|alpha_p + x[n] * beta_p); // predict non-response with probability p
else // otherwise use ordered logistic with prob 1-p
target += bernoulli_logit_lpmf(0|alpha_p + x[n] * beta_p) + ordered_logistic_lpmf(y[n]| x[n] * beta, c);
}
}
generated quantities {
vector[N] yrep; // generate samples
vector[N] temp;
for (n in 1:N) {
temp[n] = bernoulli_rng(inv_logit(alpha_p + x[n] * beta_p));
if (temp[n] == 1)
yrep[n] = 0;
else
yrep[n] = ordered_logistic_rng(x[n] * beta, c);
}
}
``````

Here’s the code to run the model. It seems to work fine.

``````library(readr)
library(dplyr)
library(rstan)

D <- 4 # number of covariates
N <- length(data\$confid_merkel) # number of observations
y <- data\$confid_merkel # response variable
x <- matrix(c(data\$party, # covariates
data\$edu,
data\$race,
data\$income),
nrow = N, ncol = D)
K <- length(unique(data\$confid_merkel)) - 1 # number of ordinal levels (0 doesn't count)

stan_data <- list(
N = N,
K = K,
y = y,
x = x,
D = D)

fit_merkel <- stan(file = "https://raw.githubusercontent.com/octmedina/zi-ordinal/main/zi-ordinal-model.stan", data = stan_data,
chains = 2,
cores = 2,
iter = 2000)

print(fit_merkel, pars= c("beta", "c", "alpha_p", "beta_p"))

``````

Thank you so much!

1 Like

Update: I have since been able to make this (kind of) work in #interfaces:brms using the awesome custom family capabilities.

Here’s the custom family call. Based on this discussion I tried using the “categorical” special option , but couldn’t get it to work because I kept getting messages about the number of categories, so did it the manual way. If anyone has any thoughts on this, please share!

``````library(brms)
library(dplyr)
library(rstan)
library(bayesplot)

# create custom family for zero inflated ordinal
# define two parameters (mu and eta), one for the zero inflation (mu)
# and one for the regular ordinal part (eta)
# also have to define the (threshold) intercept variable (c_int)
zi_ordinal <- custom_family(
"zi_ordinal", dpars = c("mu", "eta"),
links = c("identity"), lb = c(NA, NA),
type = c("int"), vars = c("c_int")
)

# define stan functions
# we define the lpmf and the random number generator
stan_funs <- stanvar(block = "functions", scode = "
real zi_ordinal_lpmf(int k, real mu, real eta, vector c_int) {
if (k == 0) // assume non-response coded as 0
return bernoulli_logit_lpmf(1| mu); // predict non-response with probability p
else // otherwise use ordered logistic with prob 1-p
return bernoulli_logit_lpmf(0| mu) + ordered_logistic_lpmf(k | eta, c_int);
}
int zi_ordinal_rng(real mu, real eta, vector c_int) {
if (bernoulli_rng(inv_logit(mu)) == 1)
return 0;
else
return ordered_logistic_rng(eta, c_int);
}
")

# this feels sort of hacky, but to define the ordered c_int variable i had to
# add this to the parameters block, as well as the number of categories (n_thresh)
ordered_var <- stanvar(scode = "ordered[n_thresh] c_int;", block = "parameters")
ncat_var <- stanvar(x = 3, name = "n_thresh", scode = "int n_thresh;")
stanvars <- ordered_var + ncat_var + stan_funs

``````

This seems to work fine. The parameters recovered are the same as the Stan model (shown above).

``````

# fit the model. it works!
fit_zi_ord <- brm(
bf(confid_merkel ~ 1 + party + income + edu + race,
eta ~ 0 + party + income + edu + race),
data = data,
family = zi_ordinal, stanvars = stanvars,
chains = 2,
cores = 2
)

``````

But here’s something that I’ve been struggling with. The posterior predictive checks look similar, except the intervals are way wider! I don’t really know why, since it seems that the parameters are similar, and the predicted values are also similar for every observation. Could it be the way I’m generating the samples? I used the categorical distribution to generate them.

``````# Create posterior predict function for post-processing
posterior_predict_zi_ordinal <- function(i, prep, ...) {
c_all <- rstan::extract(fit_zi_ord\$fit, pars = "c_int") # extract intercepts
c_int <- as.matrix(c_all\$c_int[,1:3]) # convert to matrix

mu <- brms::get_dpar(prep, "mu", i = i) # get mu (zero/non-response inflation)
eta <- brms::get_dpar(prep, "eta", i = i) # get eta (regular ordinal logistic)

p_zi <- plogis(mu) # inverse-logit
thresholds <- ncol(c_int) # get number of thresholds

# create the probabilities for all the categories
p_total <- cbind(p_zi, (1-p_zi)*(1-plogis(eta-c_int[,1])))
for (val in 2:thresholds) {
p_total <- cbind(p_total, (1-p_zi)*(plogis(eta-c_int[,val-1])-plogis(eta-c_int[,val])))
}
p_total <- cbind(p_total, (1-p_zi)*(plogis(eta-c_int[,thresholds])))

# generate samples (subtract 1 bc first category is 0)
y_rep <- extraDistr::rcat(n=dim(p_total)[1], prob = p_total)-1
y_rep <- as.matrix(y_rep)
y_rep
}

# Generate samples (seems to work)
brms_yrep <- posterior_predict(fit_zi_ord)

``````

Here are the two bayesplot outputs, for comparison. The Rstan model is available in the message above, and the full reprex is here.