# Latent Categorical Predictor Variable Model

Hello,

I am working on a measurement error model with a latent categorical predictor variable. I want to do something that is very similar to a continuous measurement error model, but instead of a known standard error for each observation I am using a known simplex of probabilities which relates to the latent variable for each observation. For illustration, my data look like:

``````data.frame(y = c(1, 1, 0, 1, 0),
pr_Xa = c(0.3, 0.2, 0.4, 0.7, 0.3),
pr_Xb = c(0.1, 0.1, 0.5, 0.1, 0.4),
pr_Xc = c(0.6, 0.7, 0.1, 0.2, 0.3),
X_obs = c("c", "c", "b", "a", "b"))
``````

A simple model without taking into account measurement error would look like `y ~ X_obs`. Where `X_obs` is the â€śobservedâ€ť category for the latent variable which we get by only considering the highest value in the simplex.

But how can I include the measurement error simplex for each observation in a logistic regression model in Stan? I have tried adapting the code from this example, but if Iâ€™m understanding it right, that example assigns equal measurement error probabilities to each observation?

Here is an example of the type of model Iâ€™m trying to writeâ€”except this uses a continuous predictor variable with a known standard error attached to each observation.

``````data {
int<lower=0> N;
array[N] real x_meas;
array[N] real x_se;
vector[N] y;
}

parameters {
real alpha;
real beta;
real<lower=0> sigma;
array[N] real x;    // unknown true value
}

model {
alpha ~ normal(0, 2);
beta ~ normal(0, 2);
sigma ~ cauchy(0, 5);
for (i in 1:N) {
x[i] ~ normal(0, 2);
x_meas[i] ~ normal(x[i], x_se[i]);
y[i] ~ normal(alpha + beta * x[i], sigma);
}
}
``````

My idealized model with a categorical predictor variable whose measurement error is given in a simplex for each observation is:

``````data {
int<lower=0> N;
int<lower=1> K;
array[N] int x_meas;     // observed categorical predictor
simplex[K] p[N];      // simplex for each observation
vector[N] y;
}

parameters {
real alpha;
real beta;
array[N] int x;    // unknown true value
}

model {
alpha ~ normal(0, 2);
beta ~ normal(0, 2);
for (i in 1:N) {
...
}
}
``````

I am a little lost regarding what to put in the `model` block to mimic the measurement error process in the continuous predictor example above. I think it would be something with a categorical distribution and dirichlet prior? And I am also struggling with how to deal with the fact that the unknown true value of the predictor is an integer and therefore canâ€™t be part of the `parameters` block.

The trick here will be, for each row in the data, to marginalize over all of the possible `x`s according to their probabilities. If the model looks like `y ~ x`, then in Stan without measurement error we could write

``````target += normal_lpdf(y | mu[x], sigma);
``````

where `x` is encoded as integers. To marginalize over the possible values of `x`, we would instead write

``````data{
array[N, 3] real<lower=0, upper=1> pr_X; // pass the probabilities as an array
}
model{
array[3] real likelihood_terms;
for(i in 1:N){
for(j in 1:3){
likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y | mu[j], sigma);
}
target += log_sum_exp(likelihood_terms);
}
}
``````

Edited to add: the critical difference from the Dawid & Skene model in the Userâ€™s Guide is that they treat the misclassification probabilities as a parameter to be estimated. In your case, we can simplify and enable more flexible models (e.g. where the probabilities of misclassification vary by row) because we know we donâ€™t need to estimate those misclassification probabilities; we know them a priori. Note that the language gets a bit confusing here, and itâ€™s hard to be unambiguous and precise. The posterior distribution for the true category of a given observation will not be the same as the simplex of classification probabilities that we pass as data. Which one of these things we mean when we refer to â€śthe misclassification probabilitiesâ€ť is ambiguous except from context.

1 Like

Thank you! This is very helpful! I feel like I am still missing something though. When I run the full model below, each categoryâ€™s `mu` parameter is equal to the average value of `y` despite the fact that they should all be different levels based on my simulated data.

``````data {
int<lower=1> N;
int<lower=1> K;  // number of categories
vector[N] y;
array[N, K] real<lower=0,upper=1> pr_X;
}

parameters {
vector[K] mu;
real<lower=0> sigma;
}

model {
mu ~ normal(0, 2);
sigma ~ cauchy(0, 5);

array[K] real likelihood_terms;
for(i in 1:N){
for(j in 1:K){
likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y | mu[j], sigma);
}
target += log_sum_exp(likelihood_terms);
}
}
``````

ah, looks like I neglected to index `y` in the likelihood; should be `y[N]`

Wouldnâ€™t that condition everything on the Nth, or last, observation of `y`? I tried `y[i]` but that resulted in a ton of divergences/max treedepths

Sorry, Iâ€™m going too fast `y[i]` is indeed what I meant. This model is simple enough that Iâ€™m surprised youâ€™re hitting divergences/treedepths, and that could indicate a bug in this implementation, but Iâ€™m not currently spotting it. Happy to try to troubleshoot if you can provide real or simulated data.

As a side-note, be aware that these sorts of models can produce some unexpected behavior even with known class probabilities. See Section 7.1 (pages 27-31) of Asparouhov and Muthen 2014 (alt: journal version). Iâ€™m not aware of any Stan-based solution equivalent to the three-step solution in MPlus. But just be aware that youâ€™ll want to do a closer inspection of the `y` mixture distributions.

@simonbrauer is this just saying that if you badly misspecify the observation model by applying a unimodal model to a bimodal response, you can confuse the class allocation?

@jsocolar thatâ€™s an interesting way of putting it, and it sounds correct. The idea is that the outcome `y` is equivalent to another variable in the mixture definitions. In such a case the ML/MAP estimates are not equivalent to the true model. In the admittedly extreme case that Asparouhov and Muthen generate, the results are dramatic. The ML estimate produced two mixture components that were distinct on the bimodal, continuous outcome and not at all distinct on the discrete variables. In reality, the reverse was true.

In most cases, the data wonâ€™t be as extreme as the contrived example, but you never know. You can imagine an omitted variable that is unrelated to the mixture components but creates a bimodal distribution in the outcome. You would run into this problem in such a scenario.

1 Like

@jsocolar hereâ€™s a full reproducible example of what I have been playing with. I included some code extracting the max value category from each simplex, `x_meas` because I initially thought that would be relevant. Thank you again for your help!

``````library(rstan)
library(MCMCpack)

N = 200
x = sample(c("a", "b", "c"), N, replace = TRUE) # true categories

# true category outcome averages
mu_a = 0.7
mu_b = 0.5
mu_c = 0.3

y = NULL
probs = data.frame(pr_a = NA, pr_b = NA, pr_c = NA)
for (i in seq_len(N)) {
if (x[i] == "a") {
probs[i, ] = rdirichlet(1, c(3, 1, 1))
y[i] = rbinom(1, 1, mu_a)
} else if (x[i] == "b") {
probs[i, ] = rdirichlet(1, c(1, 3, 1))
y[i] = rbinom(1, 1, mu_b)
} else if (x[i] == "c") {
probs[i, ] = rdirichlet(1, c(1, 1, 3))
y[i] = rbinom(1, 1, mu_c)
}
}
# x_meas <- colnames(probs)[max.col(probs)]
# x_meas <- gsub("pr_", "", x_meas)

stan_list <- list(
N = N,
K = length(unique(x)),
#  x = as.numeric(as.factor(x)),
#  x_meas = as.numeric(as.factor(x_meas)),
y = y,
pr_X = probs
)

model <- "
data {
int<lower=1> N;
int<lower=1> K;  // number of categories
vector[N] y;
array[N, K] real<lower=0,upper=1> pr_X;
}

parameters {
vector[K] mu;
real<lower=0> sigma;
}

model {
mu ~ normal(0, 2);
sigma ~ student_t(3, 0, 2);

array[K] real likelihood_terms;
for(i in 1:N){
for(j in 1:K){
likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y[i] | mu[j], sigma);
}
target += log_sum_exp(likelihood_terms);
}
}
"

fit <- stan(model_code = model, data = stan_list, iter = 2000, chains = 4)
``````