Composing Stan models (posterior as next prior)

Bayesian inference in theory is composable: a posterior distribution of one model can be a prior for another one.

Can this be achieved in Stan? Specifically, if I have an output sample matrix from one Stan model, can I turn it into a multivariate prior in another Stan model?

3 Likes

@anon75146577 was lamenting about the same thing on Twitter the other day. For the univariate case, a very good solution that has not been utilized is to use quantile parameterized distributions for priors that I linked to at

in which case the posterior quantiles become the hyperparameters for the prior the next time your have new data to condition on. In the multivariate but unconstrained case, you could link the together with a multivariate normal copula density, which is sort of alluded to starting at page 38 of http://www.metalogdistributions.com/images/TheMetalogDistributionsInvitedLecture.pdf . For things like simplexes that are multivariate but constrained, I don’t know that there is a great solution.

2 Likes

Are you secretly on twitter @bgoodri?

1 Like

Yes, with probability \pi.

2 Likes

Ben’s solution is neat and general. Another, perhaps more niche way of including a posterior from a previous analysis as prior are the so-called power priors. If the paper piques your interest, I can share some code for implementing some pretty simple power priors.

2 Likes

@maxbiostat I’d love to see some code examples. I am working with a lot of historical data these days.

The meta-analytic-predictive approach is another approach for this. See: https://cran.r-project.org/web/packages/RBesT/vignettes/introduction.html for an intro. The package implements normal, binary and poisson endpoints… it is tailored for derivation of informative priors for clinical trials to reduce the control group sample size, but can be used for many other applications, of course.

OK, here’s some code for a Normal inverse-Gamma regression analysis using power priors.
R code:

N0 <- 1000
true.betas <- c(-1, 1, .5, -.5) 
P <- 4
X0 <- matrix(NA, nrow = N0, ncol = P)
for(i in 1:P) X0[, i] <- rnorm(N0)
sy <- 2
y0 <- rnorm(N0, mean = X0%*%true.betas, sd = sy)

summary(lm(y0 ~ -1 + X0))

library(rstan)
rstan_options(auto_write = TRUE)

compiled.model.prior <- stan_model("simple_linear_regression_NIG_prior.stan")

as <- .5
bs <- 2
vb <- 1.5
lm.data <- list(
  N_0 = N0,
  X_0 = X0,
  y_0 = y0,
  mu_beta = rep(0, P),
  lambda_0 = solve(vb * diag(P)),
  alpha0 = as,
  beta0 = bs,
  a_0 = .65
)

prior.lm <- sampling(compiled.model.prior, data = lm.data)
prior.lm
pairs(prior.lm, pars = "beta")

############
### Posterior
### will draw from the a similar process
N <- 100
X <- matrix(NA, nrow = N, ncol = P)
new.betas <- true.betas + rnorm(P, 0, sd = .2)
for(i in 1:P) X[, i] <- rnorm(N)
y <-  rnorm(N, mean = X%*%new.betas, sd = sy)

compiled.model.posterior <- stan_model("simple_linear_regression_NIG_posterior.stan")

lm.data.forposterior <- list(
  N_0 = lm.data$N0,
  X_0 = lm.data$X0,
  y_0 = lm.data$y0,
  mu_beta = lm.data$mu_beta,
  lambda_0 = lm.data$lambda_0,
  alpha0 = lm.data$alpha0,
  beta0 = lm.data$beta0,
  X = X,
  N = N,
  y = y,
  a_0 = lm.data$a_0
)
posterior.lm <- sampling(compiled.model.posterior, data = lm.data.forposterior)
posterior.lm
pairs(posterior.lm, pars = c("beta"))

whence simple_linear_regression_NIG_prior.stan reads as

data{
  int<lower=0> N0;
  int<lower=0> P;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
}

and simple_linear_regression_NIG_posterior.stan is

data{
  int<lower=0> P;
  int<lower=0> N0;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  int<lower=0> N;
  real y[N];
  matrix[N, P] X;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
  /* Likelihood */
  target += normal_lpdf(y | X*beta, sqrt(sigma_sq) );
}
4 Likes

To put the issue a bit more succinctly – posterior distributions are composable with respect to independent observations, but posterior computation typically are not (the one exception is Sequential Monte Carlo which still has a long way to go to be a general tool). As with many properties, this property of the underlying theory doesn’t carry over to the implementations that we employ in practice.

There are lots of ways of trying to construct intermediate densities to approximate this composability but let me warn that they typically hemorrhage information, especially for high-dimensional parameter spaces. You can always avoid this by simply refitting to the entire data set as you add more observations, if that’s computationally feasible.

3 Likes

Thanks @maxbiostat ! I’ll roll that paper and the code into our working group to study.

@Ara_Winter - Side note, here’s a potentially relevant paper: https://osf.io/jrjd2/ - ā€œCumulative science via Bayesian posterior passing, an introduction.ā€

1 Like

You set a_0, however, you could also estimate it. In your experience, what is typically done?

You may also want to follow Omega.jl where composability is a main feature (i.e. condition on predicates). It looks to be quite experimental and the author recommends using Stan if your model is expressible in Stan.

Typically, a_0 is fixed. But you’re absolutely right it can be estimated. You just need to be careful about the normalising constant: c(a_0) = \int_{\Theta} L(D \mid \theta)^{a_0}\pi(\theta)d\theta.

See this note for discussion and examples. The paper I linked above also has a section with discussion on fixed versus random a_0.

1 Like

Note that the power prior approach also suffers from this information leakage. For a few models, we know how to set a_0 such that some ā€œinformation-optimalityā€ is satistified (see this paper).

Hi,
I am having a similar issue.
I am estimating the matrix A in a model where y = Ax + e with e an error term, and x an y given vectors (len(x)<len(y)). I have several (x;y) couples and would like to iterate the estimation on each couple using as prior of the i+1-th iteration the posterior of the i-eth iteration. If anyone has an idea I would very much appriciate.

My Stan model writes as follows for the first iteration:

  data{
  vector[13] y; //NACE Rev 2 vector
  vector[9] x; //NACE Rev 1 vector
  //A priori dirichlet parameters
  vector[5] c01; 
  vector[5] c02;
  vector[2] c03;
  vector[2] c04;
  vector[4] c05;
  vector[5] c07;
  vector[7] c08;
  vector[6] c09;
  //Likelihood parameter (the 1 left out of the estimation)
  real a06;

}

parameters{
  simplex[5] a1;
  simplex[5] a2;
  simplex[2] a3;
  simplex[2] a4;
  simplex[4] a5;
  simplex[5] a7;
  simplex[7] a8;
  simplex[6] a9;

}

model{
  a1~dirichlet(c01);
  a2~dirichlet(c02);
  a3~dirichlet(c03);
  a4~dirichlet(c04);
  a5~dirichlet(c05);
  a7~dirichlet(c07);
  a8~dirichlet(c08);
  a9~dirichlet(c09);
  
y[1] ~normal(a1[1]*x[1],0.1);
y[2] ~normal(a1[2]*x[1]+a2[1]*x[2]+a5[1]*x[5]+a8[1]*x[8]+a9[1]*x[9],0.1);
y[3]~normal(a3[1]*x[3],0.1);
y[4] ~normal(a1[3]*x[3]+a2[2]*x[2]+a3[2]*x[3]+a9[2]*x[9],0.1);
y[5] ~normal(a2[3]*x[2]+a8[2]*x[2],0.1);
y[6] ~normal(a5[2]*x[5],0.1);
y[7] ~normal(a1[4]*x[1]+a5[3]*x[3]+a7[1]*x[7],0.1);
y[8] ~normal(a06*x[6],0.1);
y[9] ~normal(a2[4]*x[2]+a7[2]*x[7]+a8[3]*x[8]+a9[3]*x[9],0.1);
y[10] ~normal(a8[4]*x[8],0.1);
y[11] ~normal(a7[3]*x[7]+a8[5]*x[8]+a9[4]*x[9],0.1);
y[12] ~normal(a4[2]*x[4]+a7[4]*x[7]+a8[6]*x[8]+a9[5]*x[9],0.1);
y[13] ~normal(a1[5]*x[1]+a2[5]*x[2]+a5[4]*x[5]+a7[5]*x[7]+a8[7]*x[8]+a9[6]*x[9],0.1);
}

Hi!
Just to pick this discussion up, I am trying to reproduce the power prior with ā€˜random’ a0 for a simple regression example. It works with the unnormalized version. I am unsure, however, how to include the normalizing constant c(a0) as mentioned by max. Currently, I am thinking along the lines of specifying the power prior as in max’s code, using the posterior distribution with the bridgesampling-package to calculate the marginal likelihood, and using this in the regression setup.

Perhaps somebody has another idea or already implemented the normalized power prior in Stan?

Any help/suggestions appreciated!

Thanks,
Chris

1 Like

Depending on the model you might be able to write c(a_0) in closed form. Otherwise, bridge sampling is a good way to go. I have.a manuscript in the works that deals with this. PM me if you are interested.

Hi max,
yes, I’d be interested! It seems I cannot send you a PM (the composer throws an error, ā€˜you cannot send a PM to that user’), so I reply here.

My first example would be a simple regression model, but later on it will become more complicated (mostly SEMs). I’ve attached a small example where I used your code with the bridgesampling package. Perhaps you could have a look if I am on the right track…

Thank you very much!
Best,
Chris
test_normalizedPP.R (2.2 KB)

1 Like

Cool. I may move this conversation to email, since this is an ongoing project with collaborators and I’m not sure I can publicise the details yet. But one thing to notice is that

data{
...
  real<upper=0> C;
}
model{
    target += a0 * normal_lpdf(y0 | X0*beta, sigma ) / C;
...
}

won’t work because C depends on a_0. What you need is a function that approximates c(a_0) which is the topic I’ve been working on.