# Estimate predictor from multiple measurements and then use to predict outcome

Hi there,

I obtained a predictor 3 times per individual. Then I obtained an outcome once for each individual (N = 120).

I want to estimate the “true” value for the predictor for each individual using partial pooling. Then I want to use this estimate to predict the outcome. So the model looks like this:

Let the predictor be denoted by X and the outcome by y.

X_{obs} \sim normal(\mu_{Xtrue}, \sigma_{X})
\mu_{Xtrue} = \alpha_{subject}
\alpha_{subject} \sim normal(\bar{a}, \sigma_{subject})
\bar{a} \sim normal(0, 1)
\sigma_{X} \sim exponential(1)
\sigma_{subject} \sim exponential(1)

y \sim normal(\mu, \sigma)
\mu \sim \alpha_y + X_{true} \times \beta
\alpha_y \sim normal(0, 1)
\beta \sim normal(0, 0.5)
\sigma \sim exponential(1)

Obviously the vectors for the first model have 360 values, because for each individual there are three values. However, my X_{true} estimate should be the pooled subject mean distributions (1 per subject, thus 120). The outcome vector has 120 values. My question is, how can I model this in stan. If I write the code the way I thought it should be I get an error saying that there cannot be different vector lengths. It only works if I put for each subject the same outcome value 3 times. But then, the model assumes that the outcome has been measured three times, which was not the case.

How do I have to specify this model. Here was the attempt where I put the outcome values 3x per subject but that is not the way I need it. I must code it somehow with vector y as 120 and that for the final model 120 “true” estimated predictor values are used

data{
vector[360] y;
vector[360] predictor_obs;
int id[360];
}
parameters{
vector[120] a_predictor;
real a_predictor_bar;
real<lower=0> sigma_predictor_a;
real<lower=0> sigma_predictor;
real a_y;
real b;
real<lower=0> sigma;
}
model{
vector[360] mu;
vector[360] predictor_true;
sigma ~ exponential( 1 );
b ~ normal( 0 , 0.5 );
a_y ~ normal( 0 , 1 );
sigma_predictor ~ exponential( 1 );
sigma_predictor_a ~ exponential( 1 );
a_predictor_bar ~ normal( 0 , 1 );
a_predictor ~ normal( a_predictor_bar , sigma_predictor_a );
for ( i in 1:360 ) {
predictor_true[i] = a_predictor[id[i]];
}
predictor_obs ~ normal( predictor_true , sigma_predictor );
for ( i in 1:360 ) {
mu[i] = a_y + b * predictor_true[i];
}
y ~ normal( mu , sigma );
}


1 Like

So if I get it correctly, this is some sort of measurement error model?

The way I would do this would be to have (not checked for completeness, just a sketch of the main changes):

data{
int<lower=1> N_subj;
int<lower=1> N_X_obs;
vector[N_subj] y;
matrix[N_X_obs, N_subj] predictor_obs;
}

...

model {
vector[N_subj] mu;
...
for(i in 1:N_X_obs) {
predictor_obs[i, ] ~ normal(a_predictor, sigma_predictor);
}

mu = a_y + b * a_predictor;
y ~ normal( mu , sigma );
}


I also moved the constants to data which is a good practice but not strictly necessary.

Is that somewhat clear or is some of the syntax hard to understand? (I also hope there are no big errors that would prevent you from compiling this)

1 Like

@martinmodrak:
I could finally test the model with your input and test it (simulations). In the simulations it works very well. Thanks so much for providing me hints. I will test some more scenarios and then apply it to real data. Maybe I will get back with questions but so far it is clear!

1 Like

Hi again,

The model does the job perfectly. But I need to implement the next step:
There is missing data in the predictor and in the outcome. I read the chapter about that in SR2 and the stan manual about missingness but I do not feel confident to code this in stan. At the same time, I do not know how to code above model in rethinking or brms. I am thinking to fit this model in brms and then look at the stancode to build understanding. In any case, I need a hint how to implement this.

@paul.buerkner: Can you maybe tell me if the below model can be written in brms? I think I then would know how to implement the missing data imputation for brms (or alternatively use mice). I saw the measurement error vignette but it seems to assume that I have the SE instead of estimating this SE from the data.

Regarding the missing data:
As this predictor has been obtained three times per individual, the best imputation strategy seems to me to impute it using a multilevel model that only estimates the intercept and random intercepts:

pred \sim normal(\mu_p, \sigma_p)
\mu_p = \alpha_p
\alpha_p \sim normal(\alpha_{pbar}, \sigma_{pbar})
\alpha_{pbar} \sim normal(0,1)
\sigma_{pbar} \sim exponential(1)

So, pred here is known except that for some individuals (<15%) there was only 1 or 2 measurement(s) instead of 3. I used now mean imputation (mean of the individual). But that makes the model maybe too certain. And that also does not solve the missingness in the outcome (what I did not say above to make it less complicated: there are several outcomes that are only weakly correlated < 0.2 but that get predicted from the predictor).

Hope someone can help!

data{
int<lower=1> N_subj;
int<lower=1> N_X_obs;
vector[N_subj] y;
matrix[N_X_obs, N_subj] pred_obs;
}
parameters{
vector[N_subj] a_pred;
real<lower=0> sigma_pred;
real a_y;
real bS;
real<lower=0> sigma;
}
model{
vector[N_subj] mu;
sigma ~ exponential( 1 );
bS ~ normal( 0 , 0.5 );
a_y ~ normal( 0 , 1 );
sigma_pred ~ exponential( 1 );
a_pred ~ normal(0, 1);

for (i in 1:N_X_obs) {
pred_obs[i, ] ~ normal(a_pred, sigma_pred);
}

mu = a_y + bS * a_pred;
y ~ normal(mu, sigma);
}