Hello everyone,
I’m getting started with Stan and try to code a model where I have some group assignment uncertainty in my data. I tried to come up with a simple reprex to illustrate what’s my goal, but ultimately I would have this as a part of a modified brms
Stan script:
library(dplyr)
library(rstan)
library(brms)
library(tidybayes)
# Fake data structure ----
set.seed(42)
N <- 100L
K <- 3L
dat <-
tibble(
x = rnorm(N),
y = rnorm(N,2.5 * x - 1,1.5),
group = sample(1:k,N,replace = TRUE)
)
dat
I have data onto which I want to fit a GLMM with Stan, and among other variables I have a categorical grouping factor. For some of the data points, the value of group
is uncertain, akin to a measurement error model you can have on a continuous covariate. I have however data on how probable each data point is to be a given value (e.g. group_i \sim categorical(\theta_i) with \theta_i = (p_1,\ p2_,\ ...,\ p_k) a simplex (say here that \theta_i is constant, like vector theta = [0.2, 0.3, 0.5]'
for simplicity)).
My first intuition was to randomly reassign group_i at every sampling iteration following \theta_i to propagate this uncertainty to the estimate of the effect of group (whether as a fixed or random effect) via categorical_rng()
:
data {
int N;
int K;
array[N] vector[k] theta;
// ...
}
parameters {
int newGroup[N];
// ...
}
transformed parameters {
for (i in 1:N) {
newGroup[i] = categorical_rng(theta[i])
}
// ...
}
but successively found out several hurdles in successfully modeling this, namely:
- No
int
allowed in (transformed) parameters block(s) - No
_rng
functions allowed in (transformed) parameters or model block(s) - the
transformed data
block being run only once, where I would need this random assignment performed at each sampling step
I tried several alternative/“hacky” ways to do so, like:
- creating the
newGroup[N]
integer vector in the generated quantities but it is then out of themodel
block’s scope, - having a random
real rand
follow a Uniform(0, 1) to infer a random group assignment (using the cumulated sum of \theta_i) in themodel
block, butrand ~ uniform(0.0, 1.0);
obviously does not work like I hoped it would.
Questions
- Is it possible to model such a group uncertainty in Stan?
- Is it a mathematically correct way to propagate uncertainty over a grouping factor?
- Is there a better way to achieve what I want?
- An alternative intuition I have would be to randomize my dataset following the \theta_i and fit my model on each dataset, but this feels inefficient and less accurate compared to something that could run in Stan directly.
Any help/comment/criticism welcome!