**tl;dr** I’m trying to fit a regression with categorical predictor variables, but the categories have uncertainty.

I’m analyzing a semi-large data set (> 5000 rows) in a categorical linear regression model in R with cmdstan via {cmdstanr}. Here’s a small reprex that illustrates the data structure, with a continuous measured variable and two categorical predictor variables and their assigned certainties:

```
## Simulating data
d <- data.frame(
Y = rnorm(100), # continuous measured variable
X1 = as.character(sample(1:3, 100, TRUE)), # categorical predictor variable
wei1 = runif(100), # uncertainty [0,1] of the predictor variable category
X2 = as.character(sample(1:3, 100, TRUE)),
wei2 = runif(100)
)
```

```
> head(d)
Y X1 wei1 X2 wei2
1 0.55175975 A 0.65072489 D 0.53731439
2 0.15496037 C 0.01422981 D 0.76525172
3 0.85153264 B 0.01561891 E 0.18192876
4 0.02384035 A 0.25004881 D 0.71313754
5 0.37444941 C 0.13034577 F 0.29462676
6 0.83104657 C 0.55697677 F 0.03499256
```

The two variables `X1`

and `X2`

are independent and have three possible categories. The complication is that the category assignment is uncertain; the certainty is given in the `wei1`

and `wei2`

columns, respectively, with values `[0,1]`

. For example, there’s a `0.65`

certainty that the first category value in variable `X1`

(*i.e.*, `A`

) is assigned correctly. It’s outside the scope of this post to describe *why* this is the case, so let’s just accept it. Note that these uncertainties are *not* uncertainties of `Y`

.

I want to setup a model that incorporates all this information and can predict `Y`

given a additive combination of `X1`

and `X2`

, *i.e.*, `Y ~ X1 + X2`

(without intercept).

My initial idea was to “summarize” the uncertainty weights and treat the Stan model as weighted linear regression:

```
d$wei <- (d$wei1 + d$wei2) / 2 # "summarizing" the certainties
```

The Stan code of this model (`mod1`

) is as follows:

```
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
vector<lower=0,upper=1>[N] wei; // model weights
int<lower=1> X1; // number of population-level effects
matrix<lower=0,upper=1>[N,X1] mx1; // population-level design matrix
int<lower=1> X2; // number of population-level effects
matrix<lower=0,upper=1>[N,X2] mx2; // population-level design matrix
}
parameters {
vector[X1] b1; // population-level effects
vector[X2] b2; // population-level effects
real<lower=0> sigma; // dispersion parameter
}
model {
// likelihood
vector[N] mu = mx1 * b1 + mx2 * b2; // linear predictor term
for (n in 1:N) {
target += wei[n] * normal_lpdf(Y[n] | mu[n], sigma);
}
// prior
b1 ~ std_normal();
b2 ~ std_normal();
sigma ~ std_normal();
}
```

The following code in R runs the model:

```
standata1 <- list(
N = nrow(d),
Y = d$Y,
wei = d$wei,
X1 = length(unique(d$X1)),
mx1 = model.matrix(~ 0 + X1, d),
X2 = length(unique(d$X2)),
mx2 = model.matrix(~ 0 + X2, d)
)
mod1 <- cmdstanr::cmdstan_model("mod1.stan")
fit1 <- mod1$sample(data = standata1, parallel_chains = 4)
```

However, I feel like summarizing the weights might be wrong in this case. In reality the weights are associated with the categorical variables `X1`

and `X2`

, and *not* with continuous measured variable `Y`

- but maybe I’m wrong here!

My attempt in trying to specify the model to more accurately reflect the data basis was as follows (`mod2`

):

```
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
vector<lower=0,upper=1>[N] wei1; // model weights
vector<lower=0,upper=1>[N] wei2; // model weights
int<lower=1> X1; // number of population-level effects
matrix<lower=0,upper=1>[N,X1] mx1; // population-level design matrix
int<lower=1> X2; // number of population-level effects
matrix<lower=0,upper=1>[N,X2] mx2; // population-level design matrix
}
parameters {
vector[X1] b1; // population-level effects
vector[X2] b2; // population-level effects
real<lower=0> sigma; // dispersion parameter
}
model {
// likelihood
vector[N] b1_hat = mx1 * b1;
vector[N] b2_hat = mx2 * b2;
vector[N] mu = rep_vector(0.0, N); // linear predictor term
for (n in 1:N) {
mu[n] += wei1[n] * b1_hat[n] + wei2[n] * b2_hat[n];
target += normal_lpdf(Y[n] | mu[n], sigma);
}
// prior
b1 ~ std_normal();
b2 ~ std_normal();
sigma ~ std_normal();
}
```

Surely not the most elegant or best way to specify it (especially in the `model`

block), but it works. Now the “weights” (*i.e.*, uncertainties) are associated with the categorical coefficients, which in my understanding reflects the data basis. The model can be fit with the following R code:

```
standata2 <- list(
N = nrow(d),
Y = d$Y,
X1 = length(unique(d$X1)),
mx1 = model.matrix(~ 0 + X1, d),
wei1 = d$wei1,
X2 = length(unique(d$X2)),
mx2 = model.matrix(~ 0 + X2, d),
wei2 = d$wei2
)
mod2 <- cmdstanr::cmdstan_model("mod2.stan")
fit2 <- mod2$sample(data = standata2, parallel_chains = 4)
```

Now I am struggling to understand if `mod2`

is really doing what I think it should be doing - having a penalty (?) on the `X1`

and `X2`

categories instead of `Y`

- or if I’m just wrongly specifying my model and get bogus results.

Any help would be appreciated!