Best way to do prediction on new data? R rstan stanfit


#1

Hi,

I am new to stan, so I’d really appreciate any help. I use R 3.4.1 with rstan (2.14.1).

I have a fitted stan object in R, fitted via rstan::stan. How do I make prediction on new data?

Options so far:

  • Include new data in call to ‘stan’ function, like stan reference 2.16.0 page 160. However this requires new data to be available at training time, which is not the case for me. I do not wish to retrain the model when new-data-for-prediction comes since model training takes a while
  • Extract posterior parameter samples into R and do prediction in R with new-data-for-prediction. However this requires me to duplicate the model structure in R
  • Muck around with the “algorithm = 'Fixed_parameter” function in rstan::stan. Doesn’t do what I want, it probably is not meant for this use case?

The user experience I am after is:

  • Fit model in R, eg: fit <- stan(…)
  • Make posterior predictions using new data and the posterior parameter samples in fit, eg: new_data_pred_samples <- predict(fit, newdata)
  • then I can get prediction point estimates and bounds from these new_data_pred_samples

Any ideas?
Thanks


#2

This is your best bet if you are using rstan::stan. If you were using the rstanarm or brms packages, there is a posterior_predict method with a newdata argument that does exactly this.


#3

Please note also that, if you coded your model in stan, you can access it in rstan without recoding: here


#4

Thanks everyone. I’ve decided to duplicate model structure in R.

Think the expose_stan_functions functionality seemed promising, but couldn’t find any useful examples of it being used for my use case. Most of the R code I ended up writing were wrapper code that pulls samples from stanfit object and stan-data lists anyway.

It is a shame that rstan doesn’t have posterior_predict method like brms and rstanarm.


#5

rstan::stan does not not what the model is, so it cannot do the duplication to have a posterior_predict method.


#6

I had a similar problem a while ago and I too lamented the lack of something as easy as a predict() type solution. If you’re not yet familiar with it, generating posterior predicted values using McElreath’s rethinking R package is just as easy. The rethinking package is a front-end for stan and I can’t say enough positive things about the package or McElreath’s book. If you pursue this route, see the link() function in chapter 4.

Until there’s a similar solution available directly in rstan, I think you’ve got to do this manually. But rather than using expose_stan_functions(), I think it’s easier and more transparent to do this entirely outside of stan and in R. Here’s an example using the rats example that comes with rstan. In that example, assume you’ve fit the very same model I just linked to in stan and that you have a resulting stanfit object called “rats1”. Assume also that your interest is in predicting average rat weight for new data–days 21, 22, and 23.

# generate posterior samples of the stan model coefficients
param.sample <- as.data.frame(rats1)  

# assume data2 is your new df. This df has just 3 time periods one day apart in which the rats were weighed
data2 <- data.frame(x=c(21, 22, 23))  
data2$xbar <- mean(data2$x)

# posterior predicted values for rat weight using the new data frame (data2):
result.sample<- t(apply(param.sample, 1, function(x) x["mu_alpha"] + x["mu_beta"]*(data2$xbar - data2$x) ))

# summary
result.summary <- apply(result.sample, 2, mean) 
result.summary.std <- apply(result.sample, 2, sd) 

I’m guessing your interest would be in result.summary and in the standard errors around it.


#7

Hi Ben. I think you meant to say that rstan doesn’t know what the model is and, hence, it can’t compute predicted values. But can’t one divine what the model is from the following?

cat(get_stancode(my.stan.model))

In any event, it appears that others have been able to create simple posterior_predict() type solutions from a stanfit object. In my reply to the OP above, I make reference to McElreath’s rethinking package. That package has a link() function that appears to pull whatever’s needed from the stanfit object and generate predicted values. And since rethinking is a front-end for stan, this ought to be doable natively in rstan.

pred.sample <- rethinking::link(my.model, new.df) # where mymodel is an object of class 'rethinking' and rethinking is just a stan wrapper.

I’m just not smart enough to extract just exactly only what’s needed from the get_stancode text string and cast it into an R formula object or equation.


#8

A small percentage of humans can, but a computer cannot (yet), in general. The rethinking, rstanarm, brms, etc. R packages all use R syntax to specify the model and call rstan::sampling. By caching the information from the R syntax, these packages can do posterior prediction fairly easily. But they are not inferring it from the Stan program.


#9

Thanks. That makes complete sense.


#10

I assume this isn’t feasible for larger datasets, but could you not just do posterior predictions within generated quantities? Include the prediction-dataset data {}, then plug all that into a predictive formula in generated quantities {}?


#11

Yes, that’s what we recommend. But it’s not automatic the way it is in the other packages mentioned.


#12

I do not really know how to perform posterior predictions and I wonder whether you would have a practical example or tutorial that could explain it further. How would the code.stan look like?


#13

There are examples in the manual and in my case study on repeated binary trials (there’s a page of links to case studies for Stan).