Prediction using point estimates in rstan

Hi,

I’m newish to stan and I’m trying to do prediction on a much larger data set by extracting the coefficients out of the stan model and translating to sql. It’s a bernoulli_logit hierarchical model with two levels where the second level has 19 regions. I’m also including a generated_quantities block for prediction to compare to my manually calculated predictions. The coefficients I extract by taking the mean match the printed coefficients, but when I do the manual calculation I’m getting different results: 1/(1+exp(-(B*X)). Is there an obvious step I’m missing? Specifically, the manually calculated predictions seem to be more extreme (closer to 0 or 1).

Thanks for any help!

Michael

Your description doesn’t quite make sense, what are you comparing these transformed coefficients to?

Sorry, what I meant was the coefficients I found by taking the posterior means matched the coefficients reported with print(model,pars=c(‘beta’)).

Either you are:

  • summarizing before transforming which is the wrong order (transform, then summarize)
  • you have a bug.

You don’t provide enough info to tell which one is going on.

I think I might be summarizing before transforming (also I’m using variational bayes for speed), does this look wrong? The beta_coef matches the printed coefficients above for all J regions.

beta <- extract(model, ‘beta’)
beta_mat<-as.matrix(beta$beta, ncol=K*J, nrow=1000, byrow=T)/1000
beta_mat<-matrix(beta_mat,ncol=(K)*J)
beta_coef<- matrix(colSums(beta_mat),ncol=J)

Then:
prediction<-1/(1+exp(-as.matrix(df[,full_vars_w_intercept1,with=F])%*%beta_coef[,]))

It helps if you supply your model. Otherwise we have to guess, as I’m about to do.

This doesn’t make sense. The model just defines a log density. There are then several inference algorithms that let you extract point estimates, including optimization and MCMC sampling.

If you’re using MCMC sampling to compute a posterior, then plugging in the posterior mean won’t give you the same as the mean of the posterior prediction because there are non-linear functions involved. That is, the posterior mean of the parameters is just that—it’s the average parameter value weighted by posterior density. The posterior mean of the prediction is the prediction averaged over all values of parameters (weighted the same way).

In general, you don’t want to use point estimates to do prediction if you can do full Bayes because they tend not to be well calibrated.

Thanks Bob, that’s the conclusion I was coming to, that prediction using the mean of the parameters is not the same as the mean of the posterior prediction, which makes sense. I’m using the full sampling now to generate the posterior predictions for all individuals and then taking the mean over that for a point estimate for each observation, which I think is the track I want to be on.

Stan automatically computes that posterior mean over the draws in all of the interfaces. So you just write the prediction down and then the posterior mean gets computed automatically.

One other thing to keep in mind is that you want to add any “noise” in the model to the prediction using an RNG if you want to capture the appropriate posterior uncertainty in the posterior predictive distribution. It’ll change posterior intervals even when it doesn’t have much (or any in some cases) effect on the posterior mean.

In cases where it isn’t feasible to run all the observations through stan itself, what is the best practice for generating the full posterior for all observations. For example, I’m using Rstan and can only load a few million rows into memory, but have a universe of several hundred million observations that I’d like to generate the posterior for. Right now I’m coding up all the models into sql but that seems tedious and pretty computationally intensive.

I think you want to be using the bigmemory R package or one of the packages that is related to it
https://cran.r-project.org/web/packages/bigmemory/index.html

Thanks, I’ll check that out.

Usually the CLT kicks in well before several hundred million observations unless you have long tail data and sparse structures or lots of subgroup structure. It doesn’t sound like you have that.

I meant for the prediction step, I have several hundred million observations that I’d like the model to generate posterior predictions for that don’t have a dependent variable.

Not sure what you mean by dependent variable here, but you’re not going to have an easy time doing 100M+ predictions with the full posterior in Stan just due to output size. What you probably are going to need to do is compute running means (and or variances with Welford’s algorithm) and not even store all the draws for the predictive inferences.