 Specifying a model with uncertain predictor variables?

tl;dr I’m trying to fit a regression with categorical predictor variables, but the categories have uncertainty.

I’m analyzing a semi-large data set (> 5000 rows) in a categorical linear regression model in R with cmdstan via {cmdstanr}. Here’s a small reprex that illustrates the data structure, with a continuous measured variable and two categorical predictor variables and their assigned certainties:

## Simulating data
d <- data.frame(
Y = rnorm(100), # continuous measured variable
X1 = as.character(sample(1:3, 100, TRUE)), # categorical predictor variable
wei1 = runif(100), # uncertainty [0,1] of the predictor variable category
X2 = as.character(sample(1:3, 100, TRUE)),
wei2 = runif(100)
)
Y X1       wei1 X2       wei2
1 0.55175975  A 0.65072489  D 0.53731439
2 0.15496037  C 0.01422981  D 0.76525172
3 0.85153264  B 0.01561891  E 0.18192876
4 0.02384035  A 0.25004881  D 0.71313754
5 0.37444941  C 0.13034577  F 0.29462676
6 0.83104657  C 0.55697677  F 0.03499256

The two variables X1 and X2 are independent and have three possible categories. The complication is that the category assignment is uncertain; the certainty is given in the wei1 and wei2 columns, respectively, with values [0,1]. For example, there’s a 0.65 certainty that the first category value in variable X1 (i.e., A) is assigned correctly. It’s outside the scope of this post to describe why this is the case, so let’s just accept it. Note that these uncertainties are not uncertainties of Y.

I want to setup a model that incorporates all this information and can predict Y given a additive combination of X1 and X2, i.e., Y ~ X1 + X2 (without intercept).

My initial idea was to “summarize” the uncertainty weights and treat the Stan model as weighted linear regression:

d\$wei <- (d\$wei1 + d\$wei2) / 2 # "summarizing" the certainties

The Stan code of this model (mod1) is as follows:

data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
vector<lower=0,upper=1>[N] wei; // model weights
int<lower=1> X1; // number of population-level effects
matrix<lower=0,upper=1>[N,X1] mx1; // population-level design matrix
int<lower=1> X2; // number of population-level effects
matrix<lower=0,upper=1>[N,X2] mx2; // population-level design matrix
}
parameters {
vector[X1] b1; // population-level effects
vector[X2] b2; // population-level effects
real<lower=0> sigma; // dispersion parameter
}
model {
// likelihood
vector[N] mu = mx1 * b1 + mx2 * b2; // linear predictor term
for (n in 1:N) {
target += wei[n] * normal_lpdf(Y[n] | mu[n], sigma);
}
// prior
b1 ~ std_normal();
b2 ~ std_normal();
sigma ~ std_normal();
}

The following code in R runs the model:

standata1 <- list(
N = nrow(d),
Y = d\$Y,
wei = d\$wei,
X1 = length(unique(d\$X1)),
mx1 = model.matrix(~ 0 + X1, d),
X2 = length(unique(d\$X2)),
mx2 = model.matrix(~ 0 + X2, d)
)

mod1 <- cmdstanr::cmdstan_model("mod1.stan")

fit1 <- mod1\$sample(data = standata1, parallel_chains = 4)

However, I feel like summarizing the weights might be wrong in this case. In reality the weights are associated with the categorical variables X1 and X2, and not with continuous measured variable Y - but maybe I’m wrong here!

My attempt in trying to specify the model to more accurately reflect the data basis was as follows (mod2):

data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
vector<lower=0,upper=1>[N] wei1; // model weights
vector<lower=0,upper=1>[N] wei2; // model weights
int<lower=1> X1; // number of population-level effects
matrix<lower=0,upper=1>[N,X1] mx1; // population-level design matrix
int<lower=1> X2; // number of population-level effects
matrix<lower=0,upper=1>[N,X2] mx2; // population-level design matrix
}
parameters {
vector[X1] b1; // population-level effects
vector[X2] b2; // population-level effects
real<lower=0> sigma; // dispersion parameter
}
model {
// likelihood
vector[N] b1_hat = mx1 * b1;
vector[N] b2_hat = mx2 * b2;
vector[N] mu = rep_vector(0.0, N); // linear predictor term
for (n in 1:N) {
mu[n] += wei1[n] * b1_hat[n] + wei2[n] * b2_hat[n];
target += normal_lpdf(Y[n] | mu[n], sigma);
}
// prior
b1 ~ std_normal();
b2 ~ std_normal();
sigma ~ std_normal();
}

Surely not the most elegant or best way to specify it (especially in the model block), but it works. Now the “weights” (i.e., uncertainties) are associated with the categorical coefficients, which in my understanding reflects the data basis. The model can be fit with the following R code:

standata2 <- list(
N = nrow(d),
Y = d\$Y,
X1 = length(unique(d\$X1)),
mx1 = model.matrix(~ 0 + X1, d),
wei1 = d\$wei1,
X2 = length(unique(d\$X2)),
mx2 = model.matrix(~ 0 + X2, d),
wei2 = d\$wei2
)

mod2 <- cmdstanr::cmdstan_model("mod2.stan")

fit2 <- mod2\$sample(data = standata2, parallel_chains = 4)

Now I am struggling to understand if mod2 is really doing what I think it should be doing - having a penalty (?) on the X1 and X2 categories instead of Y - or if I’m just wrongly specifying my model and get bogus results.

Any help would be appreciated!

Given that the predictor variables have measurement uncertainty, would a measurement error model make sense here? I don’t have much experience with measurement error models and my limited experience is with continuous predictor variables and normally distributed measurement noise (the Stan manual discusses this). In your case, the measurement error model would perhaps have a multinomial distribution, or maybe a dirichlet distribution if the uncertainty varies for each observation. If a measurement error model makes sense for your data, hopefully one of the experts on the Stan forum can provide more specific advice.

Do you have information about the probability that the true value is in each of the three categories, or just the probability that it’s in the most likely category? For example, if we have X1_i = A and wei1_i = .75, do you know anything about the probabilities of B and C beyond the constraint that their sum must be .25?

1 Like