Categorical Predictor with Measurement Error Regression

bwilden · January 6, 2021, 5:29pm

Hello,
I am still very new to the world of Stan and Bayesian modeling, so please excuse in advance my ignorance!

I am attempting to build a measurement error model in Stan where the main predictor variable is categorical, but where there is uncertainty whether the observed category matches the true category. Attached to each observation is a vector of probabilities for all potential categories. The observed categories are generated from the highest of these probabilities per observation. My goal is to incorporate this uncertainty when predicting some binary outcome Y, which is a function of the observation’s true category. The simulated data below shows the structure of what I am working with:

library(MCMCpack)

N <- 1000
B_a <- 0.8
B_b <- 0.5
B_c <- 0.2
cat_true <- rep(NA, N)
probs <- data.frame(prob_a = rep(NA, N),
                    prob_b = rep(NA, N),
                    prob_c = rep(NA, N))
Y <- rep(NA, N)

for (i in 1:N) {
  cat_true[i] <- sample(c("a", "b", "c"), 1)
  if (cat_true[i] == "a") {
    probs[i, ] <- rdirichlet(1, c(3, 1, 1))
    Y[i] <- rbinom(1, 1, B_a)
  } else if (cat_true[i] == "b") {
    probs[i, ] <- rdirichlet(1, c(1, 3, 1))
    Y[i] <- rbinom(1, 1, B_b)
  } else if (cat_true[i] == "c") {
    probs[i, ] <- rdirichlet(1, c(1, 1, 3))
    Y[i] <- rbinom(1, 1, B_c)
  }
}
cat_obs <- colnames(probs)[max.col(probs)]
cat_obs <- gsub("prob_", "", cat_obs)

My main question is how to write a logit model in Stan with this data in the absence of cat_true. As I understand it, it is tricky working with discrete values in Stan? I have been attempting to adapt Richard McElreath’s example in 15.3 with weighted averages, but without luck so far. Any guidance would be greatly appreciated!

nhuurre · January 6, 2021, 8:49pm

The User Guide has a chapter on discrete latent parameters with an example of Noisy Categorical Measurement Model.

The following code that corresponds to your simulation

data {
  int N;
  simplex[3] probs[N];
  int Y[N];
}
parameters {
  real<lower=0,upper=1> B_a;
  real<lower=0,upper=1> B_b;
  real<lower=0,upper=1> B_c;
}
model {
  for (i in 1:N) {
    vector[3] lprob;
    // cat_true[i] == "a"
    lprob[1] = dirichlet_lpdf(probs[i]| [3,1,1]')
               + bernoulli_lpmf(Y[i]| B_a);
    // cat_true[i] == "b"
    lprob[2] = dirichlet_lpdf(probs[i]| [1,3,1]')
               + bernoulli_lpmf(Y[i]| B_b);
    // cat_true[i] == "c"
    lprob[3] = dirichlet_lpdf(probs[i]| [1,1,3]')
               + bernoulli_lpmf(Y[i]| B_c);
    target += log_sum_exp(lprob);
  }
}
generated quantities {
  // posterior probabilities of cat_true
  simplex[3] posterior_prob[N];
  for (i in 1:N) {
    vector[3] lprob;
    lprob[1] = dirichlet_lpdf(probs[i]| [3,1,1]')
               + bernoulli_lpmf(Y[i]| B_a);
    lprob[2] = dirichlet_lpdf(probs[i]| [1,3,1]')
               + bernoulli_lpmf(Y[i]| B_b);
    lprob[3] = dirichlet_lpdf(probs[i]| [1,1,3]')
               + bernoulli_lpmf(Y[i]| B_c);
    posterior_prob[i] = softmax(lprob);
  }
}

bwilden · January 15, 2021, 3:11pm

Thank you for this! I greatly appreciate it

Topic		Replies	Views
Latent Categorical Predictor Variable Model Modeling specification	10	503	February 24, 2023
Bayesian Measurement Error Model in Stan Modeling rstan , techniques	5	155	April 24, 2025
Another measurement error modeling question Modeling	13	585	July 5, 2023
Bayesian categorical annotation Modeling	4	816	January 11, 2019
Measurement error in a categorical outcome variable Modeling	3	508	March 29, 2018

Categorical Predictor with Measurement Error Regression

Related topics