Random group assignment to model uncertainty

Hello everyone,

I’m getting started with Stan and try to code a model where I have some group assignment uncertainty in my data. I tried to come up with a simple reprex to illustrate what’s my goal, but ultimately I would have this as a part of a modified brms Stan script:

library(dplyr)
library(rstan)
library(brms)
library(tidybayes)

# Fake data structure ----
set.seed(42)
N <- 100L
K <- 3L
dat <- 
  tibble(
    x = rnorm(N),
    y = rnorm(N,2.5 * x - 1,1.5),
    group = sample(1:k,N,replace = TRUE)
  )
dat

I have data onto which I want to fit a GLMM with Stan, and among other variables I have a categorical grouping factor. For some of the data points, the value of group is uncertain, akin to a measurement error model you can have on a continuous covariate. I have however data on how probable each data point is to be a given value (e.g. group_i \sim categorical(\theta_i) with \theta_i = (p_1,\ p2_,\ ...,\ p_k) a simplex (say here that \theta_i is constant, like vector theta = [0.2, 0.3, 0.5]' for simplicity)).

My first intuition was to randomly reassign group_i at every sampling iteration following \theta_i to propagate this uncertainty to the estimate of the effect of group (whether as a fixed or random effect) via categorical_rng():

data {
  int N;
  int K;
  array[N] vector[k] theta;
  // ...
}
parameters {
  int newGroup[N];
  // ...
}
transformed parameters {
  for (i in 1:N) {
    newGroup[i] = categorical_rng(theta[i])
  }
  // ...
}

but successively found out several hurdles in successfully modeling this, namely:

  1. No int allowed in (transformed) parameters block(s)
  2. No _rng functions allowed in (transformed) parameters or model block(s)
  3. the transformed data block being run only once, where I would need this random assignment performed at each sampling step

I tried several alternative/“hacky” ways to do so, like:

  • creating the newGroup[N] integer vector in the generated quantities but it is then out of the model block’s scope,
  • having a random real rand follow a Uniform(0, 1) to infer a random group assignment (using the cumulated sum of \theta_i) in the model block, but rand ~ uniform(0.0, 1.0); obviously does not work like I hoped it would.

Questions

  • Is it possible to model such a group uncertainty in Stan?
  • Is it a mathematically correct way to propagate uncertainty over a grouping factor?
  • Is there a better way to achieve what I want?
  • An alternative intuition I have would be to randomize my dataset following the \theta_i and fit my model on each dataset, but this feels inefficient and less accurate compared to something that could run in Stan directly.

Any help/comment/criticism welcome!

1 Like

Welcome @Kenneth.K!
There may be better ways to approach this but one reasonable way to proceed would be to create 100 versions of your dataset, sampling from the assignment probabilities and then fitting your model with brms_multiple.

Hey @amynang!

Thank you for the pointer!

brms_multiple() does indeed exactly what I was considering as the alternative I mentioned, but even better than I hoped since it outputs a combined brmsfit object that behaves (almost) like a regular one (barring extra caution needed for convergence diagnostic as per the function’s manual page).

For reference, an updated reprex showing the result would be:

library(dplyr)
library(brms)

# Fake data ----
set.seed(42)
N <- 200L
K <- 3L

## Probability of group assignment ----
#' cf. https://en.wikipedia.org/wiki/Dirichlet_distribution#From_gamma_distribution
theta <- 
  lapply(
    1:N,
    \(i) rgamma(K,runif(3,0,100)) |> 
      {\(.) . / sum(.)}() |> 
      setNames(paste0("p",1:K))
  ) |> 
  bind_rows()

## Dataset with 80% true_group reassigned randomly ----
dat <- 
  tibble(
    x = rnorm(N),
    true_group = sapply(1:N,\(i) sample(1:3,1,replace = TRUE,prob = theta[i,])),
    y = rnorm(N,2.5 * x + case_when(true_group == 1 ~ 0.8,true_group == 2 ~ -0.3,true_group == 3 ~ 2.5,),1.5)
  ) |> 
  mutate(
    group = sapply(true_group,\(g) if (runif(1) <= 0.8) {g} else {sample(setdiff(1:3, g),1)}),
    .after = true_group
  ) |> 
  mutate(across(ends_with("group"),as.factor))
    
# Generating a list of 100 datasets with group reassigned ----
datasets <- 
  lapply(
    1:100,
    \(.) dat |> 
      mutate(group = sapply(1:N,\(i) sample(1:3,1,replace = TRUE,prob = theta[i,])) |> as.factor())
  )

# Fitting on single dataset and multiple datasets ----
empty_fit <- brm_multiple(formula = y ~ -1 + x + group,data = datasets,combine = TRUE,chains = 0)
fit_single   <- brms:::update.brmsfit_multiple(empty_fit,formula. = y ~ -1 + x + group,newdata = list(dat),chains = 4)
fit_multiple <- brms:::update.brmsfit_multiple(empty_fit,formula. = y ~ -1 + x + group,newdata = datasets,chains = 4)

fit_single
fit_multiple

This will for sure suffice if I don’t find any other ways to proceed so I can mark this as a solution.

To go further on a potential better solution, I also read chapter 15 “Missing Data and Other Opportunities” of McElreath’s Statistical Rethinking as I saw later that this chapter tackles error measurement and missing data imputation with a mention on categorical errors and discrete data. Relevant bits I’ll try to adapt are:


and:

15.3.2. Discrete error. The example above concerned missing data. But when the data are measured instead with error, the procedure is very similar. Suppose for example that in the example above each house is assigned a probability of a cat being present. Call this probability k_i. When we are sure there is a cat there, k_i = 1. When we are sure there is no cat, k_i = 0. When we think it is a coin flip, k_i = 0.5. These k_i values replace the parameter k in the previous model, becoming the weights for averaging over our uncertainty.

Adapting to my case, I imagine this becomes:

\begin{align} y_i &\sim Normal(\beta x_i + \alpha_{group_i},\sigma) \\ group_i &\sim categorical(\theta_i) \end{align}

But how to code this is Stan is still unclear to me. I suspect \theta would be provided as data, the group_i \sim categorical(\theta_i) should be part of the model block, but I still bump into the issue of group_i being an integer. It’s unfortunate the book doesn’t provide a Stan code example on this, nor does Kurz’s nice companion web book

Any help with this or other feedback welcome!

You can define local int variables in the model block.

That’s right. You can have randomness in transformed data and generated quantities. If you want randomness elsewhere, it has to be with a parameter.

Not quite the way you present it without doing the multiple imputation suggested by @amynang. We’re working on a new block that would sort of solve this problem the same way as “cut” in BUGS.

Thanks for your input @Bob_Carpenter!

I’ve read a little on cut models in BUGS after reading your post, that sounds interesting indeed.
I’ll check this feature out when it is out in Stan!