Model: I am working with a model that can be described, at the bird-eye level, as a random slope logistic model. I have 0/1 outcomes, some predictors, random intercepts and random slopes for my individuals. I am fitting the models I am satisfied with, goodness of fit and convergence and all.
Problem: suppose I am given a new individual with their predictors and their trajectory of the 0/1 outcomes. Can I run the sampled chains on these to obtain the random effects? Section 27.7 provides pointers in the “flat” case of i.i.d. data on how to pick the parameters that have been simulated, and feed them to the generated quantities
without the model to sample from. But in my case, I have to deal with the random effects that I am declaring as parameters – and the random effects for the new data, obviously, have not been sampled. I cannot simply add that to the model and the sampler as the stuff runs for about 50 core-hours, so I cannot afford to re-run it for another 2000 new individuals that I need to produce the model predictions for.
Reprex as an Rmd fragment, so that the Stan model is in the middle. The goal is to get the vector u_i[2]
for the new data.
title: “Random effects reprex”
output: html_document
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE,
dev = "svg") ## export images as SVG
library(ggplot2)
library(magrittr)
library(here)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
library(rstan)
library(bayesplot)
library(knitr)
Simulate data
N <- 10
set.seed(1)
my_df <- data.frame( id=1:N, interc=rnorm(n=N,m=0.5,s=0.3), slope=rnorm(n=N, m=-0.5, s=0.3), n=sample(5:8, N, replace=T) )
my_df <- my_df[ rep(1:N, my_df$n), ]
my_df %>% group_by(id) %>% mutate(t=row_number()) %>% ungroup() %>%
mutate(y=as.integer(interc+slope*t/3+rnorm(nrow(my_df),m=0,s=0.3)>0)) -> my_df
Stan code
data {
int<lower=1> N; // total number of observations
int<lower=0,upper=1> outcome[N]; // outcome
int<lower=1> N_people; // number of unique individuals
int<lower=1,upper=N_people> seq_person_id[N]; // SEQUENTIAL person id -- important for indexing
vector[N] predictor; // predictor that goes with random slope
}
parameters {
// variance components
vector<lower = 0>[2] tau_u;
// random effects for people
vector[2] u_i[N_people];
// correlation matrix
corr_matrix[2] rho_matrix;
}
transformed parameters {
cov_matrix[2] Sigma;
vector[N] theta;
// covariance matrix
Sigma = quad_form_diag(rho_matrix,tau_u);
// logit of outcome
for (t in 1:N)
theta[t] = u_i[seq_person_id[t],1]
+ u_i[seq_person_id[t],2]*predictor[t];
}
model {
// outcome
outcome ~ bernoulli(inv_logit(theta));
// variances
tau_u ~ cauchy(0,1);
// correlation
rho_matrix ~ lkj_corr(1);
// random effects
u_i ~ multi_normal( [0, 0]', Sigma );
}
Bayesian computation
Prepare the data for Stan
stan_data <- list(
N = nrow(my_df), # int<lower=1> N;
outcome = unlist(my_df[,"y"]), # int<lower=0,upper=1> outcome[N];
N_people = (my_df %>% distinct(id) %>% nrow()), # int<lower=1> N_people;
seq_person_id = unlist(my_df[,"id"]), # int<lower=1,upper=N_people> seq_person_id[N];
predictor = unlist(my_df[,"t"]) # vector[N] predictor;
)
Sample from the model
stan_mcmc_rslope_logit <- rstan::sampling(
random_slope_logit, data=stan_data,
chains = 2, seed = 1234,
warmup = 1000, iter = 2000,
control=list(adapt_delta=0.9, stepsize_jitter = 0.1),
refresh = 100,
init = list( list(tau_u=c(0.2,0.2)), list(tau_u=c(0.3,0.3)) )
)
New observations
new_df <- data.frame( y_new = c(0,1,1,0,1), t_new=1:5)
What now??