How to estimate a parameter for use in another model in brms?

Hi all, I’m working on an experiment in which a predictor variable was measured several times (with error) and an outcome variable was measured once.

What I would to do is estimate the “true” value of the predictor based on the noisy replicated measurements, and use that estimate to predict an outcome. So, a kind of two stage approach: 1) estimate the mean of predictor, 2) model relationship between estimated predictor mean and outcome. Ideally both regressions would be fit in the same Markov chain. I have attempted this in brms to no avail (I’m still a bit stan intimidated…).

Here is some representative simulated data:

library(brms)
library(tidyverse)

# simulate noisy x data

d <- tibble(
  true_x = rnorm(200),
  y = 0.5*true_x + rnorm(200),
  # observed x values
  x1 = true_x + rnorm(200, sd = 0.5),
  x2 = true_x + rnorm(200, sd = 0.5),
  x3 = true_x + rnorm(200, sd = 0.5),
  x4 = true_x + rnorm(200, sd = 0.5),
  x5 = true_x + rnorm(200, sd = 0.5))

d$ID <- as.factor(seq(1:nrow(d)))

I can estimate the means for each participant:

d_long <- pivot_longer(d, -c(ID, y, true_x),
                       names_to = "rep", values_to = "measured_x")

# model

m1 <- brm(
  measured_x ~ 0 + ID,
  data = d_long,
  prior = prior(normal(0,1), class = "b"),
  warmup = 1500, iter = 3000,
  cores = 4, backend = "cmdstanr")

# save draws

post <- as_draws_df(m1) %>% select(starts_with("b_ID"))

But that isn’t particularly helpful, since I can’t (to my knowledge) use those posteriors in the next model stage.

An ideal solution would be to fit these two models simultaneously. But I’m not sure if this is possible given that each model requires a different dataset (d_long for model 1, d for model 2). Potentially there’s a solution with brms::subset() but I haven’t managed to figure it out.

Any help much appreciated!

1 Like

I think it might make things clearer if I present the model mathematically. Ignoring priors for now, I have:

true\_x_i \sim N(\mu_{1i}, \sigma_1)
\mu_{1i} = \beta_1*ID1_i + \beta_2*ID2_i ... + \beta_n*IDN_i

Y_i \sim N(\mu_{2i}, \sigma_2)
\mu_{2i} = \delta_0 + \delta_1 * \mu_{1i}

The problem I’m having is that the estimation of \mu_{1i} requires a dataset with a row for each k replicated measurements, while the estimation of the parameters in the second part of the model require a single row per unique ID. And I can’t figure out how I can manage to estimate both within the same model formula.

p.s. I am not wedded to a specifically brms solution. If anyone had ideas for doing this directly in Stan, I would be eager to hear.

Hello @LachlanC, there was some discussion on a related topic before here Brms & pk models? which is a part of PK model workflow (the dataset is per-observation but post-processing is per-subject).

I think your need is a bit different, which is to express two simultaneous models where a predictor variable in the second model is the response variable from the first model. In your syntax the first part of the model predicts the latent true value for each observation \mu_{1i}, and then that latent value is the predictor in the second part.

I wonder if there is a way that the complete length dataset could be populated with the predicted value at each replicate, in the non-linear syntax (so that no row reduction is required). Personally I think I’d do this directly in Stan though. You can define the latent variable in the transformed parameters block, and then pass it directly to the second component of your model. Maybe using the brms code from make_stancode() for your two submodels would be convenient.

1 Like

You can fit this model using Stan, since that’s a bit easier to pass data with different dimensions. You can think of this as a hierarchical model:

x_{n,j} \sim N(\mu_n, \sigma_x) \\ y_n \sim N(\mu_n, \sigma_y)

This is specified in Stan as:

data {
  int N;
  int Nmeasurements;
  vector[N] y;
  matrix[N, Nmeasurements] x;
}

parameters {
  real<lower=0> sigma_x;
  vector[N] true_x;

  real<lower=0> sigma_y;
}

model {
  for (n in 1:N) {
    x[n] ~ normal(true_x[n], sigma_x);
  }

  y ~ normal(true_x, sigma_y);
}

Which does a reasonable job of recovering the ‘true’ values with the default uniform priors:

library(tidyverse)
library(cmdstanr)

# simulate noisy x data

d <- tibble(
  true_x = rnorm(200),
  y = 0.5*true_x + rnorm(200),
  # observed x values
  x1 = true_x + rnorm(200, sd = 0.5),
  x2 = true_x + rnorm(200, sd = 0.5),
  x3 = true_x + rnorm(200, sd = 0.5),
  x4 = true_x + rnorm(200, sd = 0.5),
  x5 = true_x + rnorm(200, sd = 0.5))


stan_data <- list(
  N = nrow(d),
  Nmeasurements = ncol(d[,3:7]),
  y = d$y,
  x = d[,3:7]
)

# Estimate Stan model
mod <- cmdstan_model("model.stan")
samp <- mod$sample(
  data = stan_data
)

# Compute mean estimated 'true' value for each individual
samp_x_mean <- colMeans(samp$draws(variables = "true_x",
                                   format = "draws_matrix"))

# Compare estimated 'true' value to generating value
comp_true_x <- cbind(d$true_x, samp_x_mean)

> head(comp_true_x)
                     samp_x_mean
true_x[1] -0.6522736  -1.0719358
true_x[2]  0.7498932   0.7134411
true_x[3]  2.0755305   2.3759420
true_x[4] -2.6725507  -2.9482978
true_x[5]  0.2261058   0.3088987
true_x[6]  0.4756021   0.5248264

1 Like

Yes, I had been trying to do this exact thing without success. But, it seems from the Stan code above, I may not even need the data in long format!

Wow, I can’t believe how simple this proved to be! Could you please elaborate a little the meaning of these lines?

for (n in 1:N) {
    x[n] ~ normal(true_x[n], sigma_x);

I see that, for the nth individual, their x measurements are distributed as normal with mean true\_x_n. I’m so used to the long-format data structure though that I can’t quite understand how Stan is using this matrix structure to inform true\_x_n?

I have adjusted the Stan code slightly to include an intercept and beta parameter in the model for y. This posterior means of true\_x that you calculated almost exactly match those of simply averaging the replicates (which is what I had anticipated):

d$raw_mean <- rowMeans(d[,3:7])

head(cbind(d$raw_mean, samp_x_mean))

                     samp_x_mean
true_x[1]  0.7175155   0.7146122
true_x[2]  1.3910691   1.3881449
true_x[3] -1.2007596  -1.1945678
true_x[4] -0.3405883  -0.3363386
true_x[5]  0.2427991   0.2361900
true_x[6] -0.7224959  -0.7126916

The estimated beta parameter is quite close to that produced from a model including the simulated true_x variable:

broom::tidy(lm(y ~ true_x, data = d))
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    0.112    0.0697      1.61 0.110       
2 true_x         0.409    0.0699      5.84 0.0000000206

summarise_draws(samp$draws(c("alpha", "beta")))
# A tibble: 2 × 10
  variable  mean median     sd    mad       q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>    <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 alpha    0.113  0.113 0.0706 0.0705 -0.00325 0.229  1.00   27255.    7885.
2 beta     0.379  0.379 0.0672 0.0673  0.268   0.489  1.00   22409.    8206.

What I did not expect is that the results from the model fitted in Stan are almost exactly identical to one using simply the calculated means as data (ignoring error):

broom::tidy(lm(y ~ raw_mean, data = d))
# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)    0.113    0.0700      1.61 0.108       
2 raw_mean       0.379    0.0674      5.62 0.0000000628

My expectation was that the Stan model would have resulted in a slightly different beta parameter and a larger posterior sd than the above. But this may be a consequence of the way I simulated the data - the error around the estimated true_x may not play much of a role. I will try a few different settings (greater measurement error, fewer measurements etc) and see how things change.

Thank you both for your help!!!