Posterior prediction from moments of parameter coefficients

AdamWespiser · January 30, 2018, 11:54pm

Hello,
I have a non-centered, hierarchical logistic regression model in STAN, and drawing from the posterior I can make predictions on new data, say

y_new = inv_logit(x1 * b1 + x2 * b2 + alpha)

.
However, the sampling procedure is involved: is it possible to take moments, like an average or mean, of these b1 and b2, instead of having to run over all the data? When I tested this on my data, this did not work. I was under the impression that if the coefficients are normally distributed, you could work with the expectations.

Is there way to extract a useful moment from the distribution for b1 and b2 such that those moments product useful predictions, instead of the simulation from the posterior?

bgoodri · January 31, 2018, 12:20am

Not under a nonlinear transformation.

AdamWespiser · January 31, 2018, 6:54am

Thanks! I ran a simulation using a similar dataset, and I don’t think the non-linear aspect of the inverse logit transform is causing the problem. The values are slightly off, 0.0009383003, but not what I’m seeing in my full model.
Source: https://gist.github.com/adamwespiser/1cad89738fbf1ea1553ed7fe77869070

Could the issue be hierarchical construction of parameters, if the group parameters all come from a normal distribution?

Max_Mantei · January 31, 2018, 2:13pm

Just a quick note… Your model will run faster, if you use y ~ bernoulli_logit(X * b + a); in the model block and get rid of the transformed parameters block. If you still want to calculate pr_y you can do that in the generated quantities block. I don’t know if this is feasible, but there you can also compute the predictions that you are interested in.

And a quick question: Right now you have a logistic regression model with 15 independent variables and you have put an hierarchical prior (in your example this is not non-centered) on the coefficients of the 15 independent variables, right? I don’t see what you mean with group parameters.

AdamWespiser · January 31, 2018, 5:19pm

Thanks! Its really not feasible to run predictions within STAN. In fact, the most desired output is an inference expression using single variable summaries, so, pr(y == 1) == inv_logit(x1 * mean(b1) + x2 * mean(b2) .... + mean(alpha ). Where mean(...) is outputed from STAN as from the posterior density.

My production model has group based predictors, drawn from a multi-normal distribution. In the style of

matrix[S,K] Beta;
Beta = (diag_pre_multiply(sigma, L_Omega) * re_tilde)';
sigma ~ cauchy(0,2.5);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(re_tilde) ~ normal(0, 10);

Where S is the number of groups, and K is the number of my predictors. I would like to make the posterior prediction by taking the expectation over the posterior distribution of Beta. Is there a way to do this, without having to draw Beta from the posterior?

Addition – In the model above, on my github gist page, taking the expectation of each parameter value from the posterior works pretty well, ~0.1% off, which is good enough to use those parameters for inference. However, the reconstruction error, between using the mean parameters, and drawing from the posterior directly to calculate pr(y == 1), is so much higher the distribution is unrecognizable.

bgoodri · January 31, 2018, 5:38pm

The mean of a nonlinear function is not equal to a nonlinear function of the mean.

AdamWespiser · January 31, 2018, 6:38pm

yes I understand this, but under my simulation its close enough to use for an approximate inference, which is appropriate for my business problem. I’m just trying to understand the situations when it degrades beyond reasonable performance, and what role hierarchical parameters play in this!

bgoodri · January 31, 2018, 7:39pm

The more parameters the worse it is going to be to reduce them to some constant before predicting. But I don’t understand why the time constraints are the way they are. You can do posterior prediction in a tiny fraction of the time that it takes to draw from the posterior distribution in Stan.

AdamWespiser · January 31, 2018, 7:49pm

Thanks! It’s definitely an artificial problem, however, I’m just one part of a much larger team, that happens to expect, and planned to implement single coefficients!

bgoodri · January 31, 2018, 8:06pm

I would ask your company to put it in writing that you will not be held responsible for the bad predictions.

Bob_Carpenter · February 8, 2018, 2:32am

It’s popular to take Laplace approximations of the posterior, which are essentially multivariate normal. But you’d still need to integrate over that nasty inverse logit to get an approximate distribution over y_new. I’m pretty sure that’s not going to work analytically.

On the other hand, you probably don’t need many more than a few dozen draws to get reasonable downstream inferences.

There are lots of ways to get bad predictions, including doing beautiful full Bayesian inference on a misspecified model. I’d be more worried about linearity assumptions in the log odds than I would be about a 0.1% non-linearity error. Of course, if it compounds, it’ll matter how much.

AdamWespiser · February 25, 2018, 9:18pm

Thanks Bob,
The posterior simulation give ~0.1% with just 100 posterior draws, so thats definitely a viable option. I’ll look into Laplace approximations of the posterior!

Kevin_Van_Horn · February 25, 2018, 11:37pm

Your simulation may be misleading, if it tends to keep things in the linear part of the logistic curve. I would suggest doing a histogram of the linear predictor (what goes into inv_logit) for both your simulated and your actual data, to see how often this value lies between +/-1.5.

BTW, I have been in the situation where I could not use multiple draws in production. I got around this by finding an optimal point estimate, one that most closely matched the posterior predictive probabilities. You can read here, especially Section 3, about this:
http://ksvanhorn.com/bayes/Papers/mpd.pdf.

The basic idea is this: First, you decide on an appropriate joint distribution for your predictor variables, that approximates what you’ll see in production. Second, you generate a large amount of simulated data: you repeatedly draw from the joint distribution from your predictor variables, then generate multiple simulated outcomes using posterior draws for your model parameters. You should end up with much more simulated data than you initially had when estimating your model. Now run maximum likelihood estimation on the simulated data.

Bob_Carpenter · February 27, 2018, 7:22am

This sounds kind of like variational inference, only with the goal of matching posterior predictive probabilities rather than minimizing KL divergences.

Topic		Replies	Views
Prediction using point estimates in rstan Modeling	13	1673	October 22, 2017
How to compute the predictive likelihood using stan General	35	2361	May 15, 2019
Use Posterior_predict in rstanarm to generate probabilites for each observation in a logistic regression model Modeling	6	1359	February 22, 2019
Inverse predictions from posterior General rstan , techniques	4	205	June 4, 2024
Prediction for a new observation Modeling	8	1794	March 16, 2018

Posterior prediction from moments of parameter coefficients

Related topics