I am hoping that somebody (with a more formal Bayesian statistics background) could help me label and understand the output of the following Stan model better.
Consider the following nested longitudinal data generating process:
-
Process 1 generates extremal (time-series) data (for example, the maximum temperature) according to a Gumbel distribution
x_i \sim \text{Gumbel}(\mu_{x,i}, \sigma_x)\,,where
\mu_{x,i} = \beta_{0,x} + \beta_{1,x} t_iand t_i is the time at point i.
-
Process 2 is understood to generate (time-series) data for a derived (measured) quantity y
y_i = \text{Normal}(\mu_{y, i}, \sigma_y)with
\mu_{y, i} = \beta_{0,y} + \beta_{1,y} x_i\,.
Generating sample data according to this data generation process, and then fitting the model in Stan gives the following (sensible) results:
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(tidyverse)
library(extraDistr) # `rgumbel()`
library(broom.mixed) # `tidy` model outputs
# Sample data
t <- 0:10
set.seed(2020)
beta0_x <- 1; beta1_x <- 5; sigma_x <- 2
beta0_y <- 10; beta1_y <- 0.5; sigma_y <- 1.5
x <- beta0_x + beta1_x * t + rgumbel(length(t), 0, sigma_x)
y <- beta0_y + beta1_y * x + rnorm(length(t), 0, sigma_y)
# Stan model
mod <- "
data {
int<lower=1> N; // Number of observations
vector[N] t; // Time
vector[N] x; // Extremal quantity measured as a function of time
vector[N] y; // Derived quantity y measured as a function of time
}
parameters {
real beta0_x;
real beta1_x;
real<lower=0> sigma_x;
real beta0_y;
real beta1_y;
real<lower=0> sigma_y;
}
model {
// Priors
beta0_x ~ normal(mean(x), 10);
beta1_x ~ normal(0, 10);
sigma_x ~ cauchy(0, 2.5);
beta0_y ~ normal(mean(y), 10);
beta1_y ~ normal(0, 10);
sigma_y ~ cauchy(0, 2.5);
// Likelihood
x ~ gumbel(beta0_x + beta1_x * t, sigma_x);
y ~ normal(beta0_y + beta1_y * x, sigma_y);
}
"
# Fit model and show par estimates
fit <- stan(model_code = mod, data = list(N = length(t), t = t, x = x, y = y))
vars <- c("beta0_x", "beta1_x", "sigma_x", "beta0_y", "beta1_y", "sigma_y")
df_val <- mget(vars) %>% enframe() %>% unnest(value)
fit %>% tidy(vars) %>% left_join(df_val, by = c("term" = "name"))
## A tibble: 6 x 4
# term estimate std.error value
# <chr> <dbl> <dbl> <dbl>
#1 beta0_x 1.64 1.56 1
#2 beta1_x 4.82 0.285 5
#3 sigma_x 2.30 0.649 2
#4 beta0_y 9.70 1.15 10
#5 beta1_y 0.512 0.0374 0.5
#6 sigma_y 1.76 0.506 1.5
Note how parameter estimates and actual values show good agreement.
I am confused about a few things involving the model and its Stan implementation:
- First off, I’m not even sure how to characterise/label the model; I am simultaneously estimating the posterior density of x and y. Could this be interpreted along the lines of estimating a joint model, as in e.g. Estimating Joint Models for Longitudinal and Time-to-Event Data with rstanarm. Or should this be interpreted as a latent variable model, where t plays the role of the latent variable?
- I could have fitted the x model first, then drawn samples from the posterior predictive distribution (ppd) for the different t_i values, and then used those ppd samples in a second model to estimate the y model parameters. I am surprised that the joint RStan model fitting both models simulatenously gives sensible results. How is it that Stan manages to estimate those two distributions simultaneously? Is this even a sensible thing to do?