Composing Stan models (posterior as next prior)

roman · March 15, 2019, 10:18pm

Bayesian inference in theory is composable: a posterior distribution of one model can be a prior for another one.

Can this be achieved in Stan? Specifically, if I have an output sample matrix from one Stan model, can I turn it into a multivariate prior in another Stan model?

bgoodri · March 16, 2019, 1:28am

@anon75146577 was lamenting about the same thing on Twitter the other day. For the univariate case, a very good solution that has not been utilized is to use quantile parameterized distributions for priors that I linked to at

in which case the posterior quantiles become the hyperparameters for the prior the next time your have new data to condition on. In the multivariate but unconstrained case, you could link the together with a multivariate normal copula density, which is sort of alluded to starting at page 38 of http://www.metalogdistributions.com/images/TheMetalogDistributionsInvitedLecture.pdf . For things like simplexes that are multivariate but constrained, I don’t know that there is a great solution.

anon75146577 · March 16, 2019, 2:13am

Are you secretly on twitter @bgoodri?

bgoodri · March 16, 2019, 2:32am

Yes, with probability \pi.

maxbiostat · March 16, 2019, 12:33pm

Ben’s solution is neat and general. Another, perhaps more niche way of including a posterior from a previous analysis as prior are the so-called power priors. If the paper piques your interest, I can share some code for implementing some pretty simple power priors.

Ara_Winter · March 16, 2019, 1:33pm

@maxbiostat I’d love to see some code examples. I am working with a lot of historical data these days.

wds15 · March 16, 2019, 1:36pm

The meta-analytic-predictive approach is another approach for this. See: https://cran.r-project.org/web/packages/RBesT/vignettes/introduction.html for an intro. The package implements normal, binary and poisson endpoints… it is tailored for derivation of informative priors for clinical trials to reduce the control group sample size, but can be used for many other applications, of course.

maxbiostat · March 16, 2019, 6:13pm

OK, here’s some code for a Normal inverse-Gamma regression analysis using power priors.
R code:

N0 <- 1000
true.betas <- c(-1, 1, .5, -.5) 
P <- 4
X0 <- matrix(NA, nrow = N0, ncol = P)
for(i in 1:P) X0[, i] <- rnorm(N0)
sy <- 2
y0 <- rnorm(N0, mean = X0%*%true.betas, sd = sy)

summary(lm(y0 ~ -1 + X0))

library(rstan)
rstan_options(auto_write = TRUE)

compiled.model.prior <- stan_model("simple_linear_regression_NIG_prior.stan")

as <- .5
bs <- 2
vb <- 1.5
lm.data <- list(
  N_0 = N0,
  X_0 = X0,
  y_0 = y0,
  mu_beta = rep(0, P),
  lambda_0 = solve(vb * diag(P)),
  alpha0 = as,
  beta0 = bs,
  a_0 = .65
)

prior.lm <- sampling(compiled.model.prior, data = lm.data)
prior.lm
pairs(prior.lm, pars = "beta")

############
### Posterior
### will draw from the a similar process
N <- 100
X <- matrix(NA, nrow = N, ncol = P)
new.betas <- true.betas + rnorm(P, 0, sd = .2)
for(i in 1:P) X[, i] <- rnorm(N)
y <-  rnorm(N, mean = X%*%new.betas, sd = sy)

compiled.model.posterior <- stan_model("simple_linear_regression_NIG_posterior.stan")

lm.data.forposterior <- list(
  N_0 = lm.data$N0,
  X_0 = lm.data$X0,
  y_0 = lm.data$y0,
  mu_beta = lm.data$mu_beta,
  lambda_0 = lm.data$lambda_0,
  alpha0 = lm.data$alpha0,
  beta0 = lm.data$beta0,
  X = X,
  N = N,
  y = y,
  a_0 = lm.data$a_0
)
posterior.lm <- sampling(compiled.model.posterior, data = lm.data.forposterior)
posterior.lm
pairs(posterior.lm, pars = c("beta"))

whence simple_linear_regression_NIG_prior.stan reads as

data{
  int<lower=0> N0;
  int<lower=0> P;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
}

and simple_linear_regression_NIG_posterior.stan is

data{
  int<lower=0> P;
  int<lower=0> N0;
  real y0[N0];
  matrix[N0, P] X0;
  vector[P] mu_beta;
  matrix[P, P] lambda_0;
  real<lower=0> alpha0;
  real<lower=0> beta0;
  int<lower=0> N;
  real y[N];
  matrix[N, P] X;
  real<lower=0> a_0;
}
parameters{
  vector[P] beta;
  real<lower=0> sigma_sq;
}
model{
  /*power prior*/
  target += a_0 * normal_lpdf(y0 | X0*beta, sqrt(sigma_sq) );
  target += multi_normal_lpdf(beta| mu_beta, sigma_sq * lambda_0);
  target += inv_gamma_lpdf(sigma_sq | alpha0, beta0);
  /* Likelihood */
  target += normal_lpdf(y | X*beta, sqrt(sigma_sq) );
}

betanalpha · March 16, 2019, 10:40pm

To put the issue a bit more succinctly – posterior distributions are composable with respect to independent observations, but posterior computation typically are not (the one exception is Sequential Monte Carlo which still has a long way to go to be a general tool). As with many properties, this property of the underlying theory doesn’t carry over to the implementations that we employ in practice.

There are lots of ways of trying to construct intermediate densities to approximate this composability but let me warn that they typically hemorrhage information, especially for high-dimensional parameter spaces. You can always avoid this by simply refitting to the entire data set as you add more observations, if that’s computationally feasible.

Ara_Winter · March 18, 2019, 2:43pm

Thanks @maxbiostat ! I’ll roll that paper and the code into our working group to study.

davidmanheim · March 26, 2019, 8:35pm

@Ara_Winter - Side note, here’s a potentially relevant paper: https://osf.io/jrjd2/ - “Cumulative science via Bayesian posterior passing, an introduction.”

spinkney · March 27, 2019, 10:09am

You set a_0, however, you could also estimate it. In your experience, what is typically done?

spinkney · March 27, 2019, 10:20am

You may also want to follow Omega.jl where composability is a main feature (i.e. condition on predicates). It looks to be quite experimental and the author recommends using Stan if your model is expressible in Stan.

github.com

zenna/Omega.jl/blob/master/docs/src/omegavsotherppls.md

# Omega vs other Probabilistic Programming Languages

There are many probabilistic programming languages and libraries.
The objective of Omega is to increase the expressiveness of probabilistic programming languages.  There are models and inference queries that are either impossible or very difficult to express in other PPLs.  In particular:

- In Omega you condition on predicates
- Omega supports likelihood free inference
- Omega supports causal inference
- Omega is fast (and type stable)

# Omega vs ...
- __Turing__: Omega does not yet suport likelihood based inference, although it is planned, so use Turing if that is the case.  Turing currently has more infernece procedures implemented. Omega is more flexible, likely faster, and supports causal and distributional inference.
- __Stan__: Use Stan if you model is expressible in stan, i.e.,  differentiable and of finite dimension.  Otherwise, consider Omega.
- __Pyro, TensorFlow Probability__: Omega does not yet support variational inference (contributions welcome!), so use those frameworks if that is required.

maxbiostat · March 27, 2019, 11:52am

Typically, a_0 is fixed. But you’re absolutely right it can be estimated. You just need to be careful about the normalising constant: c(a_0) = \int_{\Theta} L(D \mid \theta)^{a_0}\pi(\theta)d\theta.

See this note for discussion and examples. The paper I linked above also has a section with discussion on fixed versus random a_0.

maxbiostat · March 27, 2019, 5:50pm

Note that the power prior approach also suffers from this information leakage. For a few models, we know how to set a_0 such that some “information-optimality” is satistified (see this paper).

Emilie_B · June 17, 2019, 5:12pm

Hi,
I am having a similar issue.
I am estimating the matrix A in a model where y = Ax + e with e an error term, and x an y given vectors (len(x)<len(y)). I have several (x;y) couples and would like to iterate the estimation on each couple using as prior of the i+1-th iteration the posterior of the i-eth iteration. If anyone has an idea I would very much appriciate.

My Stan model writes as follows for the first iteration:

  data{
  vector[13] y; //NACE Rev 2 vector
  vector[9] x; //NACE Rev 1 vector
  //A priori dirichlet parameters
  vector[5] c01; 
  vector[5] c02;
  vector[2] c03;
  vector[2] c04;
  vector[4] c05;
  vector[5] c07;
  vector[7] c08;
  vector[6] c09;
  //Likelihood parameter (the 1 left out of the estimation)
  real a06;

}

parameters{
  simplex[5] a1;
  simplex[5] a2;
  simplex[2] a3;
  simplex[2] a4;
  simplex[4] a5;
  simplex[5] a7;
  simplex[7] a8;
  simplex[6] a9;

}

model{
  a1~dirichlet(c01);
  a2~dirichlet(c02);
  a3~dirichlet(c03);
  a4~dirichlet(c04);
  a5~dirichlet(c05);
  a7~dirichlet(c07);
  a8~dirichlet(c08);
  a9~dirichlet(c09);
  
y[1] ~normal(a1[1]*x[1],0.1);
y[2] ~normal(a1[2]*x[1]+a2[1]*x[2]+a5[1]*x[5]+a8[1]*x[8]+a9[1]*x[9],0.1);
y[3]~normal(a3[1]*x[3],0.1);
y[4] ~normal(a1[3]*x[3]+a2[2]*x[2]+a3[2]*x[3]+a9[2]*x[9],0.1);
y[5] ~normal(a2[3]*x[2]+a8[2]*x[2],0.1);
y[6] ~normal(a5[2]*x[5],0.1);
y[7] ~normal(a1[4]*x[1]+a5[3]*x[3]+a7[1]*x[7],0.1);
y[8] ~normal(a06*x[6],0.1);
y[9] ~normal(a2[4]*x[2]+a7[2]*x[7]+a8[3]*x[8]+a9[3]*x[9],0.1);
y[10] ~normal(a8[4]*x[8],0.1);
y[11] ~normal(a7[3]*x[7]+a8[5]*x[8]+a9[4]*x[9],0.1);
y[12] ~normal(a4[2]*x[4]+a7[4]*x[7]+a8[6]*x[8]+a9[5]*x[9],0.1);
y[13] ~normal(a1[5]*x[1]+a2[5]*x[2]+a5[4]*x[5]+a7[5]*x[7]+a8[7]*x[8]+a9[6]*x[9],0.1);
}

ckoenig · October 30, 2019, 8:03am

Hi!
Just to pick this discussion up, I am trying to reproduce the power prior with ‘random’ a0 for a simple regression example. It works with the unnormalized version. I am unsure, however, how to include the normalizing constant c(a0) as mentioned by max. Currently, I am thinking along the lines of specifying the power prior as in max’s code, using the posterior distribution with the bridgesampling-package to calculate the marginal likelihood, and using this in the regression setup.

Perhaps somebody has another idea or already implemented the normalized power prior in Stan?

Any help/suggestions appreciated!

Thanks,
Chris

maxbiostat · October 30, 2019, 9:47am

Depending on the model you might be able to write c(a_0) in closed form. Otherwise, bridge sampling is a good way to go. I have.a manuscript in the works that deals with this. PM me if you are interested.

ckoenig · October 30, 2019, 2:31pm

Hi max,
yes, I’d be interested! It seems I cannot send you a PM (the composer throws an error, ‘you cannot send a PM to that user’), so I reply here.

My first example would be a simple regression model, but later on it will become more complicated (mostly SEMs). I’ve attached a small example where I used your code with the bridgesampling package. Perhaps you could have a look if I am on the right track…

Thank you very much!
Best,
Chris
test_normalizedPP.R (2.2 KB)

maxbiostat · October 30, 2019, 2:36pm

Cool. I may move this conversation to email, since this is an ongoing project with collaborators and I’m not sure I can publicise the details yet. But one thing to notice is that

data{
...
  real<upper=0> C;
}
model{
    target += a0 * normal_lpdf(y0 | X0*beta, sigma ) / C;
...
}

won’t work because C depends on a_0. What you need is a function that approximates c(a_0) which is the topic I’ve been working on.

Topic		Replies	Views
Using posterior as prior (recursively) Modeling	2	716	July 29, 2021
Multiple priors for the same parameter Modeling	10	3535	October 29, 2021
Posterior from STAN model as prior for another model Modeling	6	201	November 10, 2024
Estimating model with moment constraints Modeling	1	384	May 10, 2019
"Pairwise" alternative to multivariate normal isn't behaving for the hierarchical case; help? Modeling techniques , specification	13	885	June 10, 2021

Composing Stan models (posterior as next prior)

Related topics