Thanks for sharing this, it very nicely complements and extends my notebook and I can recommend for others also using these kinds of workflow and diagnostic ideas for perfecting models. All about basis function GPs holds perfectly for spline models, too, and many things he discuss, can be used in other contexts, too.
Couple comments that are mainly focusing on the difference in thinking when doing fast prototyping and Niko also touches some of these (based on our previous discussions)
I knew that 10 basis functions is too little to match exact GP even without any diagnostic, and the purpose was to show that it’s possible to do quick experiments with something that is not perfect. When Niko adds many basis functions already to model 1, it reveals that the prior for the lengthscale was actually giving too much mass for the small lengthscales. I was not interested at that point in checking this as I knew the model is badly misspecified and lacking many components. I wanted a separate component to model the faster variation.
We had forgotten to make the mean of the basis functions to be zero within the box. Stan’s dynamic HMC is so good that the results were good anyway, and after making the means of the basis function to be zero, although I observed speed-up it was less than an order of magnitude. I prefer to have zero mean functions and keep the explicit intercept term, as this will make the behavior more consistent also when adding several basis function GPs. In this specific case the data had been transformed to have zero mean and the mid-point of basis functions were at the mid-point of data, so that a priori it was likely that the intercept is close to zero. Instead of removing it due to correlation, I’d prefer to keep it and adjust the model to not have that correlation so that we can confirm that posterior is concentrated close to zero. Now I feel that our mistake on not having zero means is too prominent through all the models, although that part could have been fixed already for the first model.
There is also a question of much to spend tuning the model to improve computational efficiency. Niko enjoyed improving the computational efficiency e.g. looking at the correlations, but the conclusions of the models did not change. If the models are only used once to make the analysis and Stan is able to provide sufficient accuracy with the initial set of models, there is no need for tuning the models. If the models are going to be used repeatedly or the diagnostic indicate that computation is not reliable, then of course additional changes in the model are needed and the methods Niko discussed are very useful and recommended,
I’m surprised that day of week would require sum to zero constraint, as the Monday effect was already fixed to 0.
I think that for the Birthday example specifically what Niko did was premature optimization, as I think my fast prototyping already showed that actually useful model for predictions would require more calendar information and constraint that babies not born on some day have to born on a day near that day, so that there is no need to tune the models further before that work would be done (and even then some more useful data would be needed). However, I expect there are other more interesting data analysis cases that would greatly benefit from the workflow he presents.
We’ll make the basis functions to have zero mean in our reference implementation.
Thank you for the nice feedback! I do agree that for the Birthday problem this might be a little much :)
My main aim was to showcase my workflow and develop it further on an interesting problem. I did learn a few new things already, which is great! And I expect many things to carry over to other problems, which is even better!
Edit: some additional comments:
I’m not 100% sure how to go forward with the DOW sum-to-zero constraint. There seems to be something extra-weird going on with the original model, but enforcing the sum-to-zero constraint did not improve the model with my other proposed improvements. I’m guessing how to handle this would depend on how to go forward with the GP for the weekday effect, which probably would also benefit from making the basis functions have zero mean.
Hello. I am interested in using GP to for a binary classification problem. It is a “big data” problem such that there are n= ~200K observations and p= ~200 variables. I don’t know if this is similarly complex as the Birthday problem. I am thinking about following the workflow you have here. Do you think it would make sense? Thanks.
thank you for your interest! I’d guess that if you really want/need to do full MCMC, then you won’t get around optimizing your model in the way that I did and adaptively tuning the approximation and the parametrization.
However there are two important caveats: First, currently you can’t do what I did because none of the publicly available Stan interfaces supports it. Second, and depending on the problem more importantly, you might not need to do full MCMC.
I have very little experience with GPs, but your problem sounds as if this discussion could be very helpful to you: Gaussian process regression
I’m sure @avehtari could tell you more, but I’d guess the following quote applies for your problem as well:
For comparison, the Birthday problem has ~7300 observations and a single dimension, both in input and output.