Posterior prediction from moments of parameter coefficients

Hello,
I have a non-centered, hierarchical logistic regression model in STAN, and drawing from the posterior I can make predictions on new data, say

y_new = inv_logit(x1 * b1 + x2 * b2 + alpha)

.
However, the sampling procedure is involved: is it possible to take moments, like an average or mean, of these b1 and b2, instead of having to run over all the data? When I tested this on my data, this did not work. I was under the impression that if the coefficients are normally distributed, you could work with the expectations.

Is there way to extract a useful moment from the distribution for b1 and b2 such that those moments product useful predictions, instead of the simulation from the posterior?

Not under a nonlinear transformation.

Thanks! I ran a simulation using a similar dataset, and I don’t think the non-linear aspect of the inverse logit transform is causing the problem. The values are slightly off, 0.0009383003, but not what I’m seeing in my full model.
Source: https://gist.github.com/adamwespiser/1cad89738fbf1ea1553ed7fe77869070

Could the issue be hierarchical construction of parameters, if the group parameters all come from a normal distribution?

Just a quick note… Your model will run faster, if you use y ~ bernoulli_logit(X * b + a); in the model block and get rid of the transformed parameters block. If you still want to calculate pr_y you can do that in the generated quantities block. I don’t know if this is feasible, but there you can also compute the predictions that you are interested in.

And a quick question: Right now you have a logistic regression model with 15 independent variables and you have put an hierarchical prior (in your example this is not non-centered) on the coefficients of the 15 independent variables, right? I don’t see what you mean with group parameters.

Thanks! Its really not feasible to run predictions within STAN. In fact, the most desired output is an inference expression using single variable summaries, so, pr(y == 1) == inv_logit(x1 * mean(b1) + x2 * mean(b2) .... + mean(alpha ). Where mean(...) is outputed from STAN as from the posterior density.

My production model has group based predictors, drawn from a multi-normal distribution. In the style of

matrix[S,K] Beta;
Beta = (diag_pre_multiply(sigma, L_Omega) * re_tilde)';
sigma ~ cauchy(0,2.5);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(re_tilde) ~ normal(0, 10);

Where S is the number of groups, and K is the number of my predictors. I would like to make the posterior prediction by taking the expectation over the posterior distribution of Beta. Is there a way to do this, without having to draw Beta from the posterior?

Addition – In the model above, on my github gist page, taking the expectation of each parameter value from the posterior works pretty well, ~0.1% off, which is good enough to use those parameters for inference. However, the reconstruction error, between using the mean parameters, and drawing from the posterior directly to calculate pr(y == 1), is so much higher the distribution is unrecognizable.

The mean of a nonlinear function is not equal to a nonlinear function of the mean.

yes I understand this, but under my simulation its close enough to use for an approximate inference, which is appropriate for my business problem. I’m just trying to understand the situations when it degrades beyond reasonable performance, and what role hierarchical parameters play in this!

The more parameters the worse it is going to be to reduce them to some constant before predicting. But I don’t understand why the time constraints are the way they are. You can do posterior prediction in a tiny fraction of the time that it takes to draw from the posterior distribution in Stan.

Thanks! It’s definitely an artificial problem, however, I’m just one part of a much larger team, that happens to expect, and planned to implement single coefficients!

I would ask your company to put it in writing that you will not be held responsible for the bad predictions.

3 Likes

It’s popular to take Laplace approximations of the posterior, which are essentially multivariate normal. But you’d still need to integrate over that nasty inverse logit to get an approximate distribution over y_new. I’m pretty sure that’s not going to work analytically.

On the other hand, you probably don’t need many more than a few dozen draws to get reasonable downstream inferences.

There are lots of ways to get bad predictions, including doing beautiful full Bayesian inference on a misspecified model. I’d be more worried about linearity assumptions in the log odds than I would be about a 0.1% non-linearity error. Of course, if it compounds, it’ll matter how much.

Thanks Bob,
The posterior simulation give ~0.1% with just 100 posterior draws, so thats definitely a viable option. I’ll look into Laplace approximations of the posterior!

Your simulation may be misleading, if it tends to keep things in the linear part of the logistic curve. I would suggest doing a histogram of the linear predictor (what goes into inv_logit) for both your simulated and your actual data, to see how often this value lies between +/-1.5.

BTW, I have been in the situation where I could not use multiple draws in production. I got around this by finding an optimal point estimate, one that most closely matched the posterior predictive probabilities. You can read here, especially Section 3, about this:
http://ksvanhorn.com/bayes/Papers/mpd.pdf.

The basic idea is this: First, you decide on an appropriate joint distribution for your predictor variables, that approximates what you’ll see in production. Second, you generate a large amount of simulated data: you repeatedly draw from the joint distribution from your predictor variables, then generate multiple simulated outcomes using posterior draws for your model parameters. You should end up with much more simulated data than you initially had when estimating your model. Now run maximum likelihood estimation on the simulated data.

1 Like

This sounds kind of like variational inference, only with the goal of matching posterior predictive probabilities rather than minimizing KL divergences.