Latent Categorical Predictor Variable Model

Hello,

I am working on a measurement error model with a latent categorical predictor variable. I want to do something that is very similar to a continuous measurement error model, but instead of a known standard error for each observation I am using a known simplex of probabilities which relates to the latent variable for each observation. For illustration, my data look like:

data.frame(y = c(1, 1, 0, 1, 0),
           pr_Xa = c(0.3, 0.2, 0.4, 0.7, 0.3),
           pr_Xb = c(0.1, 0.1, 0.5, 0.1, 0.4),
           pr_Xc = c(0.6, 0.7, 0.1, 0.2, 0.3),
           X_obs = c("c", "c", "b", "a", "b"))

A simple model without taking into account measurement error would look like y ~ X_obs. Where X_obs is the “observed” category for the latent variable which we get by only considering the highest value in the simplex.

But how can I include the measurement error simplex for each observation in a logistic regression model in Stan? I have tried adapting the code from this example, but if I’m understanding it right, that example assigns equal measurement error probabilities to each observation?

Here is an example of the type of model I’m trying to write—except this uses a continuous predictor variable with a known standard error attached to each observation.

data {
  int<lower=0> N;    
  array[N] real x_meas;     
  array[N] real x_se;
  vector[N] y;        
}

parameters {
  real alpha;           
  real beta;            
  real<lower=0> sigma;  
  array[N] real x;    // unknown true value
}

model {
  alpha ~ normal(0, 2);
  beta ~ normal(0, 2);
  sigma ~ cauchy(0, 5);
  for (i in 1:N) {
    x[i] ~ normal(0, 2);
    x_meas[i] ~ normal(x[i], x_se[i]);
    y[i] ~ normal(alpha + beta * x[i], sigma);
  }
}

My idealized model with a categorical predictor variable whose measurement error is given in a simplex for each observation is:

data {
  int<lower=0> N;  
  int<lower=1> K;
  array[N] int x_meas;     // observed categorical predictor
  simplex[K] p[N];      // simplex for each observation
  vector[N] y;        
}

parameters {
  real alpha;           
  real beta;            
  array[N] int x;    // unknown true value
}

model {
  alpha ~ normal(0, 2);
  beta ~ normal(0, 2);
  for (i in 1:N) {
    ...
  }
}

I am a little lost regarding what to put in the model block to mimic the measurement error process in the continuous predictor example above. I think it would be something with a categorical distribution and dirichlet prior? And I am also struggling with how to deal with the fact that the unknown true value of the predictor is an integer and therefore can’t be part of the parameters block.

The trick here will be, for each row in the data, to marginalize over all of the possible xs according to their probabilities. If the model looks like y ~ x, then in Stan without measurement error we could write

target += normal_lpdf(y | mu[x], sigma);

where x is encoded as integers. To marginalize over the possible values of x, we would instead write

data{
  array[N, 3] real<lower=0, upper=1> pr_X; // pass the probabilities as an array
}
model{
  array[3] real likelihood_terms;
  for(i in 1:N){
    for(j in 1:3){
      likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y | mu[j], sigma);
    }
    target += log_sum_exp(likelihood_terms);
  }
}

Edited to add: the critical difference from the Dawid & Skene model in the User’s Guide is that they treat the misclassification probabilities as a parameter to be estimated. In your case, we can simplify and enable more flexible models (e.g. where the probabilities of misclassification vary by row) because we know we don’t need to estimate those misclassification probabilities; we know them a priori. Note that the language gets a bit confusing here, and it’s hard to be unambiguous and precise. The posterior distribution for the true category of a given observation will not be the same as the simplex of classification probabilities that we pass as data. Which one of these things we mean when we refer to “the misclassification probabilities” is ambiguous except from context.

1 Like

Thank you! This is very helpful! I feel like I am still missing something though. When I run the full model below, each category’s mu parameter is equal to the average value of y despite the fact that they should all be different levels based on my simulated data.

data {
  int<lower=1> N; 
  int<lower=1> K;  // number of categories
  vector[N] y;
  array[N, K] real<lower=0,upper=1> pr_X;
}

parameters {
  vector[K] mu;
  real<lower=0> sigma;
}

model {
  mu ~ normal(0, 2);
  sigma ~ cauchy(0, 5);

  array[K] real likelihood_terms;
  for(i in 1:N){
    for(j in 1:K){
      likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y | mu[j], sigma);
    }
    target += log_sum_exp(likelihood_terms);
  }
}

ah, looks like I neglected to index y in the likelihood; should be y[N]

Wouldn’t that condition everything on the Nth, or last, observation of y? I tried y[i] but that resulted in a ton of divergences/max treedepths

Sorry, I’m going too fast y[i] is indeed what I meant. This model is simple enough that I’m surprised you’re hitting divergences/treedepths, and that could indicate a bug in this implementation, but I’m not currently spotting it. Happy to try to troubleshoot if you can provide real or simulated data.

As a side-note, be aware that these sorts of models can produce some unexpected behavior even with known class probabilities. See Section 7.1 (pages 27-31) of Asparouhov and Muthen 2014 (alt: journal version). I’m not aware of any Stan-based solution equivalent to the three-step solution in MPlus. But just be aware that you’ll want to do a closer inspection of the y mixture distributions.

@simonbrauer is this just saying that if you badly misspecify the observation model by applying a unimodal model to a bimodal response, you can confuse the class allocation?

@jsocolar that’s an interesting way of putting it, and it sounds correct. The idea is that the outcome y is equivalent to another variable in the mixture definitions. In such a case the ML/MAP estimates are not equivalent to the true model. In the admittedly extreme case that Asparouhov and Muthen generate, the results are dramatic. The ML estimate produced two mixture components that were distinct on the bimodal, continuous outcome and not at all distinct on the discrete variables. In reality, the reverse was true.

In most cases, the data won’t be as extreme as the contrived example, but you never know. You can imagine an omitted variable that is unrelated to the mixture components but creates a bimodal distribution in the outcome. You would run into this problem in such a scenario.

1 Like

@jsocolar here’s a full reproducible example of what I have been playing with. I included some code extracting the max value category from each simplex, x_meas because I initially thought that would be relevant. Thank you again for your help!

library(rstan)
library(MCMCpack)

N = 200
x = sample(c("a", "b", "c"), N, replace = TRUE) # true categories

# true category outcome averages
mu_a = 0.7
mu_b = 0.5
mu_c = 0.3

y = NULL
probs = data.frame(pr_a = NA, pr_b = NA, pr_c = NA)
for (i in seq_len(N)) {
  if (x[i] == "a") {
    probs[i, ] = rdirichlet(1, c(3, 1, 1))
    y[i] = rbinom(1, 1, mu_a)
  } else if (x[i] == "b") {
    probs[i, ] = rdirichlet(1, c(1, 3, 1))
    y[i] = rbinom(1, 1, mu_b)
  } else if (x[i] == "c") {
    probs[i, ] = rdirichlet(1, c(1, 1, 3))
    y[i] = rbinom(1, 1, mu_c)
  }
}
# x_meas <- colnames(probs)[max.col(probs)]
# x_meas <- gsub("pr_", "", x_meas)

stan_list <- list(
  N = N,
  K = length(unique(x)),
#  x = as.numeric(as.factor(x)),
#  x_meas = as.numeric(as.factor(x_meas)),
  y = y,
  pr_X = probs
)

model <- "
data {
  int<lower=1> N; 
  int<lower=1> K;  // number of categories
  vector[N] y;
  array[N, K] real<lower=0,upper=1> pr_X;
}

parameters {
  vector[K] mu;
  real<lower=0> sigma;
}

model {
  mu ~ normal(0, 2);
  sigma ~ student_t(3, 0, 2);

  array[K] real likelihood_terms;
  for(i in 1:N){
    for(j in 1:K){
      likelihood_terms[j] = log(pr_X[i, j]) + normal_lpdf(y[i] | mu[j], sigma);
    }
    target += log_sum_exp(likelihood_terms);
  }
}
"

fit <- stan(model_code = model, data = stan_list, iter = 2000, chains = 4)