Categorical Predictor with Measurement Error Regression

Hello,
I am still very new to the world of Stan and Bayesian modeling, so please excuse in advance my ignorance!

I am attempting to build a measurement error model in Stan where the main predictor variable is categorical, but where there is uncertainty whether the observed category matches the true category. Attached to each observation is a vector of probabilities for all potential categories. The observed categories are generated from the highest of these probabilities per observation. My goal is to incorporate this uncertainty when predicting some binary outcome Y, which is a function of the observation’s true category. The simulated data below shows the structure of what I am working with:

library(MCMCpack)

N <- 1000
B_a <- 0.8
B_b <- 0.5
B_c <- 0.2
cat_true <- rep(NA, N)
probs <- data.frame(prob_a = rep(NA, N),
                    prob_b = rep(NA, N),
                    prob_c = rep(NA, N))
Y <- rep(NA, N)

for (i in 1:N) {
  cat_true[i] <- sample(c("a", "b", "c"), 1)
  if (cat_true[i] == "a") {
    probs[i, ] <- rdirichlet(1, c(3, 1, 1))
    Y[i] <- rbinom(1, 1, B_a)
  } else if (cat_true[i] == "b") {
    probs[i, ] <- rdirichlet(1, c(1, 3, 1))
    Y[i] <- rbinom(1, 1, B_b)
  } else if (cat_true[i] == "c") {
    probs[i, ] <- rdirichlet(1, c(1, 1, 3))
    Y[i] <- rbinom(1, 1, B_c)
  }
}
cat_obs <- colnames(probs)[max.col(probs)]
cat_obs <- gsub("prob_", "", cat_obs)

My main question is how to write a logit model in Stan with this data in the absence of cat_true. As I understand it, it is tricky working with discrete values in Stan? I have been attempting to adapt Richard McElreath’s example in 15.3 with weighted averages, but without luck so far. Any guidance would be greatly appreciated!

The User Guide has a chapter on discrete latent parameters with an example of Noisy Categorical Measurement Model.

The following code that corresponds to your simulation

data {
  int N;
  simplex[3] probs[N];
  int Y[N];
}
parameters {
  real<lower=0,upper=1> B_a;
  real<lower=0,upper=1> B_b;
  real<lower=0,upper=1> B_c;
}
model {
  for (i in 1:N) {
    vector[3] lprob;
    // cat_true[i] == "a"
    lprob[1] = dirichlet_lpdf(probs[i]| [3,1,1]')
               + bernoulli_lpmf(Y[i]| B_a);
    // cat_true[i] == "b"
    lprob[2] = dirichlet_lpdf(probs[i]| [1,3,1]')
               + bernoulli_lpmf(Y[i]| B_b);
    // cat_true[i] == "c"
    lprob[3] = dirichlet_lpdf(probs[i]| [1,1,3]')
               + bernoulli_lpmf(Y[i]| B_c);
    target += log_sum_exp(lprob);
  }
}
generated quantities {
  // posterior probabilities of cat_true
  simplex[3] posterior_prob[N];
  for (i in 1:N) {
    vector[3] lprob;
    lprob[1] = dirichlet_lpdf(probs[i]| [3,1,1]')
               + bernoulli_lpmf(Y[i]| B_a);
    lprob[2] = dirichlet_lpdf(probs[i]| [1,3,1]')
               + bernoulli_lpmf(Y[i]| B_b);
    lprob[3] = dirichlet_lpdf(probs[i]| [1,1,3]')
               + bernoulli_lpmf(Y[i]| B_c);
    posterior_prob[i] = softmax(lprob);
  }
}

Thank you for this! I greatly appreciate it