Predictions from a multilevel model, conditional on a single new observation in a subject not in the original data

This question is a more concrete follow-up on this question, were @martinmodrak was very helpful in recognizing the core of my question.

I have fitted a varying effects model to some data with multiple measurements for each individual. I expect that the outcome (y) is proportional to time. Here is similar, simulated data.

library(brms)
library(tidybayes)
library(ggplot2)
library(dplyr)

d <- expand.grid(time = 4:10, id = factor(1:10))


d$y <- lme4::simulate.formula(~0+time+(0+time|id), newdata = d,
                       newparams = list(beta = c(time = 1), 
                                        theta = c(id.time = 1), 
                                        sigma = 1),
                       family = gaussian,
                       seed = 1)[[1]]

head(d)
#>   time id         y
#> 1    4  1 3.0059659
#> 2    5  1 2.2575742
#> 3    6  1 1.6200366
#> 4    7  1 0.4001234
#> 5    8  1 4.1133004
#> 6    9  1 3.3169821
                       
m1 <- brm(y~0+time+(0+time|id), data = d)
add_epred_draws(d, m1) |>
  summarise(mean_e = mean(.epred)) |> 
  ggplot(aes(time, y, color = id)) + 
  geom_point() +
  geom_line(aes(y = mean_e))

I now want to make predictions for a large amount of hypothetical observations. E.g. If a new subject has y=10 at time=6, what can I expect from measurements at time = 10.
I want to make these predictions for a grid of relevant combinations of y and time (in my real data I have two fixed effects, so the grid quickly becomes very large).

Here is the solution I have for now. It seems to work, but requires refitting the entire model for each hypothetical observation:

# Generate a grid of hypothetical observations to make predictions for (here only 3)
fake_data_for_predictions <- 
  expand.grid(time = 6, y = c(5,10,15)) %>% 
  mutate(id = row_number() + 100,
         id = factor(id))

# make a list of new datasets, each with the original data and one fake observation
d_list_w_one_fake <- lapply(split(fake_data_for_predictions, 
                             fake_data_for_predictions$id),
                 rbind, d)

head(d_list_w_one_fake[[1]])
#>   time         y  id
#> 1    6 5.0000000 101
#> 2    4 3.0059659   1
#> 3    5 2.2575742   1
#> 4    6 1.6200366   1
#> 5    7 0.4001234   1
#> 6    8 4.1133004   1

# Fit a model to each nearly identical dataset
m_mult <- brm_multiple(y~0+time+(0+time|id), data = d_list_w_one_fake,
                       combine = FALSE)

# Generate predictions at time = 10 for the fake subject
gen_prediction_at_10 <- function(mod) {
  fake_obs <- mod$data[1,] # the first row is the fake
  
  fake_obs %>% 
    mutate(
      time_obs = time,
      y_obs = y,
      time = 10) %>% 
    add_predicted_draws(mod, value = "y_predicted")
}

predictions_at_10 <- purrr::map_df(m_mult, gen_prediction_at_10)

ggplot(predictions_at_10, aes(y_obs, y_predicted)) +
  stat_eye() +
  labs(title = "Predictions of y at time = 10, conditional on an observation of y at time = 6")

Created on 2022-03-23 by the reprex package (v2.0.0)

Is there a faster way to do this? This vignette (Approximate leave-future-out cross-validation for Bayesian time series models • loo) seems to describe an approach to validate this type of predictions, but does not seem to help with generating loads of predictions.

In the end, I want to make a visualization conveying the level of y at different “times” that corresponds to an e.g. 80% probability of y > 20 at time = 10.

1 Like