# Nested longitudinal data and estimating joint model

I am hoping that somebody (with a more formal Bayesian statistics background) could help me label and understand the output of the following Stan model better.

Consider the following nested longitudinal data generating process:

1. Process 1 generates extremal (time-series) data (for example, the maximum temperature) according to a Gumbel distribution

x_i \sim \text{Gumbel}(\mu_{x,i}, \sigma_x)\,,

where

\mu_{x,i} = \beta_{0,x} + \beta_{1,x} t_i

and t_i is the time at point i.

2. Process 2 is understood to generate (time-series) data for a derived (measured) quantity y

y_i = \text{Normal}(\mu_{y, i}, \sigma_y)

with

\mu_{y, i} = \beta_{0,y} + \beta_{1,y} x_i\,.

Generating sample data according to this data generation process, and then fitting the model in Stan gives the following (sensible) results:

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(tidyverse)
library(extraDistr)     # rgumbel()
library(broom.mixed)    # tidy model outputs

# Sample data
t <- 0:10
set.seed(2020)
beta0_x <- 1; beta1_x <- 5; sigma_x <- 2
beta0_y <- 10; beta1_y <- 0.5; sigma_y <- 1.5
x <- beta0_x + beta1_x * t + rgumbel(length(t), 0, sigma_x)
y <- beta0_y + beta1_y * x + rnorm(length(t), 0, sigma_y)

# Stan model
mod <- "
data {
int<lower=1> N;         // Number of observations
vector[N] t;            // Time
vector[N] x;            // Extremal quantity measured as a function of time
vector[N] y;            // Derived quantity y measured as a function of time
}

parameters {
real beta0_x;
real beta1_x;
real<lower=0> sigma_x;
real beta0_y;
real beta1_y;
real<lower=0> sigma_y;
}

model {
// Priors
beta0_x ~ normal(mean(x), 10);
beta1_x ~ normal(0, 10);
sigma_x ~ cauchy(0, 2.5);
beta0_y ~ normal(mean(y), 10);
beta1_y ~ normal(0, 10);
sigma_y ~ cauchy(0, 2.5);
// Likelihood
x ~ gumbel(beta0_x + beta1_x * t, sigma_x);
y ~ normal(beta0_y + beta1_y * x, sigma_y);
}
"

# Fit model and show par estimates
fit <- stan(model_code = mod, data = list(N = length(t), t = t, x = x, y = y))
vars <- c("beta0_x", "beta1_x", "sigma_x", "beta0_y", "beta1_y", "sigma_y")
df_val <- mget(vars) %>% enframe() %>% unnest(value)
fit %>% tidy(vars) %>% left_join(df_val, by = c("term" = "name"))
## A tibble: 6 x 4
#  term    estimate std.error value
#  <chr>      <dbl>     <dbl> <dbl>
#1 beta0_x    1.64     1.56     1
#2 beta1_x    4.82     0.285    5
#3 sigma_x    2.30     0.649    2
#4 beta0_y    9.70     1.15    10
#5 beta1_y    0.512    0.0374   0.5
#6 sigma_y    1.76     0.506    1.5


Note how parameter estimates and actual values show good agreement.

I am confused about a few things involving the model and its Stan implementation:

1. First off, I’m not even sure how to characterise/label the model; I am simultaneously estimating the posterior density of x and y. Could this be interpreted along the lines of estimating a joint model, as in e.g. Estimating Joint Models for Longitudinal and Time-to-Event Data with rstanarm. Or should this be interpreted as a latent variable model, where t plays the role of the latent variable?
2. I could have fitted the x model first, then drawn samples from the posterior predictive distribution (ppd) for the different t_i values, and then used those ppd samples in a second model to estimate the y model parameters. I am surprised that the joint RStan model fitting both models simulatenously gives sensible results. How is it that Stan manages to estimate those two distributions simultaneously? Is this even a sensible thing to do?
1 Like

I am still trying to wrap my head around the model described above; to rephrase my question somewhat:

Consider the model where L depends on time t, and Q depends on L. Both L and Q are quantities that are measured at time t, and that (I assume to) have a specific underlying distribution.

\begin{aligned} L_i &\sim \text{Gumbel}(\beta_{L,0} + \beta_{L,1} t_i, \sigma_L)\,,\\ Q_i &\sim \text{Normal}(\beta_{Q,0} + \beta_{Q,1} L_i, \sigma_Q)\,. \end{aligned}

My questions are:

1. It seems to be possible to fit both L_i and Q_i in one go in rstan. Why is that the case? We seem to be getting two models (posterior densities) for the price of one Stan model.

2. Are the fit results meaningful/sensible? Based on the parameter estimates, they seem to be.

3. And how would such a nested/coupled/joint model be called?

Apologies for the ping/bump but I am hoping that a Stan dev might help me with understanding above model and/or point me in the right direction. Based on my limited time here on the forum, @jonah and @martinmodrak, you’ve always been very helpful & patient with explanations and advice.

Thanks,
Maurits

PS. If this question is not a good fit, I’d like to remove the question so I can re-post on e.g. Cross Validated.

1 Like

Sorry for not getting to you sooner. Your question is relevant an well written and I am glad you bumped it up.

Actually the two-step process of using the posterior of one model as a prior in another is (usually) the harder way in Stan (see e.g. Using posteriors as new priors). However, the model as you have written it doesn’t use the ppd of x the predict y. It uses the observed x, which is fixed. I.e. you would get exactly the same results if you split the model into two, one for x and one for y.

The approach would be closer to what stan_jm does if you used \mu_{x,i} as the predictor instead of x_i. I. e.:

vector[N] mu_x = beta0_x + beta1_x * t;
x ~ gumbel(mu_x, sigma_x);
y ~ normal(beta0_y + beta1_y .* mu_x, sigma_y);


This would be both more flexible (in that it involves uncertainty about beta0_x and beta1_x) and more rigid (in that it forces a strict linear influence of t on y). One could also add additional flexibility by treating mu_x as parameters and have something like mu_x ~ normal(beta0_x + beta1_x * t, sigma_mu_x) (although I would fear this model might have computational issues).

As for giving a name to the model - we need to be aware that all the nomenclature out there is not a prescription, it is just an attempt to force some order on the actual messy reality of models people fit. The model as you have originally written it would not IMHO deserve any special name: it is just two separate linear models. The model as I suggested above would probably best be called a “joint” model for x and y, but that really isn’t saying much as “joint” simply means we model multiple things together. So to remove any ambiguity I think you should report the model formulas as you did here to let everybody know what you mean by “joint model”.

Does that make sense?

2 Likes

First off, I really appreciate your detailed response @martinmodrak. I hope you don’t mind if I follow up with some questions.

However, the model as you have written it doesn’t use the ppd of x the predict y. It uses the observed x, which is fixed. I.e. you would get exactly the same results if you split the model into two, one for x and one for y.

Yes, and that’s exactly what has me stumped. We have essentially two separate models, and Stan manages to explore the posterior densities of both x and y in a single sampling run. Is this efficient and “good” Stan modelling? Or would it be better to sample both posterior densities separately in two models? To be honest, I was surprised that sampling was successful and that I was able to obtain reasonable parameter estimates in the toy example.

The approach would be closer to what stan_jm does if you used \mu_{x,i} as the predictor instead of x_{i}. I. e.:

vector[N] mu_x = beta0_x + beta1_x * t;
x ~ gumbel(mu_x, sigma_x);
y ~ normal(beta0_y + beta1_y .* mu_x, sigma_y);


This would be both more flexible (in that it involves uncertainty about beta0_x and beta1_x) and more rigid (in that it forces a strict linear influence of t on y).

I think this is a very important point you’re making. Provided I understood you correctly, the change you suggests in essence means “propagating” uncertainties in beta0_x and beta1_x through to the estimation of beta0_y and beta1_y.

I’m not very familiar with rstanarm::stan_jm; to re-phrase would you suggest, I guess this should be straightforward to implement in above Stan example as well, basically involving the replacement of x with mu_x?

Since for x \sim \text{Gumbel}(\mu_x,\sigma_x)

E[x] = \mu_x + \sigma_x \gamma

we might have to scale mu_x_prime = mu_x + sigma_x * 0.577 in order to make the parameter estimates for beta1_y directly comparable to the actual value during data simulation.

In my “real data” model I am less interested in parameter estimates and more interested in producing predictions for x and y. Following fitting the model to observations x_i and y_i, I then

1. first predict x_\text{pred} for some new times t, and then
2. use the predicted x_\text{pred} to predict y_\text{pred}.

In Stan code, this is something along the lines of

generated quantities {
...
for (i in 1:10)
t_new[i] = max(t) + i
x_pred[i] = gumbel_rng(beta0_x + beta1_x * t_new[i]);
y_pred[i] = normal_rng(beta0_y + beta1_y * x_pred, sigma_y);
}


Would you say that here uncertainties in beta0_x and beta1_x are properly included in predicted y_pred?

I am not sure how to explain this, but I don’t think it sould be surprising - Independent variables (or submodels) is the simplest form of dependency, so if that wouldn’t work there would be little hope to make anything with more complex dependency else work. Stuff like fitting 10000 i.i.d Gaussian parameters is actually a simple benchmark used from time to time. For very small models+data this sampling of two models together might even be faster than sampling separately (you pay the price of spawning Stan only once), but anything of decent size will work faster separately (Stan is pretty good at exploiting structure of the model but not so good to not pay any additional price from the model being larger)

I am not sure I follow your thoughts here. The code snippet I provided is (hopefully, I didn’t test) all that would take to turn your model into the model I proposed.

That makes sense - you would then have mu_x_prime be the mean of the Gumbel distribution while mu_x would be the median. I can imagine situations where the median is preferable, but agree that mean is most likely the more sensible first choice.

I repeat that I don’t think this approach is inherently better than the one in your original model, just that it might make more sense for some scenarios.

Yes, assuming the original model is a good model of the data this looks reasonable to me.

Best of luck!

Brilliant, thanks again for your insights @martinmodrak. Much appreciated.