Composing Stan models (posterior as next prior)

#1

Bayesian inference in theory is composable: a posterior distribution of one model can be a prior for another one.

Can this be achieved in Stan? Specifically, if I have an output sample matrix from one Stan model, can I turn it into a multivariate prior in another Stan model?

2 Likes
#2

@Daniel_Simpson was lamenting about the same thing on Twitter the other day. For the univariate case, a very good solution that has not been utilized is to use quantile parameterized distributions for priors that I linked to at


in which case the posterior quantiles become the hyperparameters for the prior the next time your have new data to condition on. In the multivariate but unconstrained case, you could link the together with a multivariate normal copula density, which is sort of alluded to starting at page 38 of http://www.metalogdistributions.com/images/TheMetalogDistributionsInvitedLecture.pdf . For things like simplexes that are multivariate but constrained, I don’t know that there is a great solution.

1 Like
#3

Are you secretly on twitter @bgoodri?

1 Like
#4

Yes, with probability \pi.

2 Likes
#5

Ben’s solution is neat and general. Another, perhaps more niche way of including a posterior from a previous analysis as prior are the so-called power priors. If the paper piques your interest, I can share some code for implementing some pretty simple power priors.

1 Like
#6

@maxbiostat I’d love to see some code examples. I am working with a lot of historical data these days.

#7

The meta-analytic-predictive approach is another approach for this. See: https://cran.r-project.org/web/packages/RBesT/vignettes/introduction.html for an intro. The package implements normal, binary and poisson endpoints… it is tailored for derivation of informative priors for clinical trials to reduce the control group sample size, but can be used for many other applications, of course.

#8

OK, here’s some code for a Normal inverse-Gamma regression analysis using power priors.
R code:

N0 <- 1000
true.betas <- c(-1, 1, .5, -.5) 
P <- 4
X0 <- matrix(NA, nrow = N0, ncol = P)
for(i in 1:P) X0[, i] <- rnorm(N0)
sy <- 2
y0 <- rnorm(N0, mean = X0%*%true.betas, sd = sy)

summary(lm(y0 ~ -1 + X0))

library(rstan)
rstan_options(auto_write = TRUE)

compiled.model.prior <- stan_model("simple_linear_regression_NIG_prior.stan")

as <- .5
bs <- 2
vb <- 1.5
lm.data <- list(
  N_0 = N0,
  X_0 = X0,
  y_0 = y0,
  mu_beta = rep(0, P),
  lambda_0 = solve(vb * diag(P)),
  alpha0 = as,
  beta0 = bs,
  a_0 = .65
)

prior.lm <- sampling(compiled.model.prior, data = lm.data)
prior.lm
pairs(prior.lm, pars = "beta")

############
### Posterior
### will draw from the a similar process
N <- 100
X <- matrix(NA, nrow = N, ncol = P)
new.betas <- true.betas + rnorm(P, 0, sd = .2)
for(i in 1:P) X[, i] <- rnorm(N)
y <-  rnorm(N, mean = X%*%new.betas, sd = sy)

compiled.model.posterior <- stan_model("simple_linear_regression_NIG_posterior.stan")

lm.data.forposterior <- list(
  N_0 = lm.data$N0,
  X_0 = lm.data$X0,
  y_0 = lm.data$y0,
  mu_beta = lm.data$mu_beta,
  lambda_0 = lm.data$lambda_0,
  alpha0 = lm.data$alpha0,
  beta0 = lm.data$beta0,
  X = X,
  N = N,
  y = y,
  a_0 = lm.data$a_0
)
posterior.lm <- sampling(compiled.model.posterior, data = lm.data.forposterior)
posterior.lm
pairs(posterior.lm, pars = c("beta"))

whence simple_linear_regression_NIG_prior.stan reads as

data{
  int<lower=0> N0;
  int<lower=0> P;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
}

and simple_linear_regression_NIG_posterior.stan is

data{
  int<lower=0> P;
  int<lower=0> N0;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  int<lower=0> N;
  real y[N];
  matrix[N, P] X;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
  /* Likelihood */
  target += normal_lpdf(y | X*beta, sqrt(sigma_sq) );
}
2 Likes
#9

To put the issue a bit more succinctly – posterior distributions are composable with respect to independent observations, but posterior computation typically are not (the one exception is Sequential Monte Carlo which still has a long way to go to be a general tool). As with many properties, this property of the underlying theory doesn’t carry over to the implementations that we employ in practice.

There are lots of ways of trying to construct intermediate densities to approximate this composability but let me warn that they typically hemorrhage information, especially for high-dimensional parameter spaces. You can always avoid this by simply refitting to the entire data set as you add more observations, if that’s computationally feasible.

1 Like
#10

Thanks @maxbiostat ! I’ll roll that paper and the code into our working group to study.

#11

@Ara_Winter - Side note, here’s a potentially relevant paper: https://osf.io/jrjd2/ - “Cumulative science via Bayesian posterior passing, an introduction.”

1 Like
#12

You set a_0, however, you could also estimate it. In your experience, what is typically done?

#13

You may also want to follow Omega.jl where composability is a main feature (i.e. condition on predicates). It looks to be quite experimental and the author recommends using Stan if your model is expressible in Stan.

#14

Typically, a_0 is fixed. But you’re absolutely right it can be estimated. You just need to be careful about the normalising constant: c(a_0) = \int_{\Theta} L(D \mid \theta)^{a_0}\pi(\theta)d\theta.

See this note for discussion and examples. The paper I linked above also has a section with discussion on fixed versus random a_0.

1 Like
#15

Note that the power prior approach also suffers from this information leakage. For a few models, we know how to set a_0 such that some “information-optimality” is satistified (see this paper).