(Edited) Rules of thumb relating multilevel modeling speed to number of samples and number of levels/complexity?



[Edited: I think that my original question might have turned people off because it seemed very specific. I illustrate my motivation with a specific concern, but all I really want to hear from those with experience about rules of thumb in terms of modeling speed. I have observed that a simple model on lots of data is slow. I know from reading other posts that this may have something to do with the posterior’s density concentrating too much because there are too many data points and the prior is therefore too weak to capture any uncertainty. I’m wondering if perhaps adding levels and complexity to the model is one way of making a model on so much data actually going faster.]

Original post:

_Inspired by Gelman et al’s BDA3 and DAURMM, I would like to apply the multilevel modeling framework to a work application. I have set-up RStan and simple models and am getting very very slow performance.

Let me explain a bit of context. I have monthly data (120 months), with about 90,000 observations of security returns per month. Each of these observations has two features: color and shape. There are about 20 colors and about 10 shapes. Each of these security returns has an exposure to the market return for that month that we wish to compute. If we did no-pooling, we would do simple regression on the securities in each (month, color, shape) bucket. However, some buckets have very few observations. Furthermore, we have prior information that, fixing a month and shape, there is a known ordering of the true population slope parameters of each color.

The above details are perhaps not relevant to my rather general question. To test Stan I have first set up a simple linear regression of security returns to market returns where I pool all the colors and shapes. This takes a long time!

I would love to know simply if multilevel modeling with Stan is worth my time. If I make the model more complicated, with levels corresponding to year, color and shape will things slow down even more? Or can I get gains in time via some smart approach?

I would really really like to make this work. I’ve fallen in love with the powerful simplicity of multilevel modeling!


I think the problem is the opposite—it’s hard to see a specific question in there.

If you want help speeding up a model, you’ll have to show us what you currently have. The key efficiency issues are outlined in the manual chapter on efficiency: you need to use non-centered parameterizations and vectorize for a start.

This usually results in posteriors that are better behaved and faster.

I think you’re wrestling with the issue of whether the model fits the data well. More complex models (in terms of time to evaluate the log density) can be faster to fit if they match the data better.

And don’t you want to be applying time series to financial data like this? Keep in mind that should also use non-centered parameterizations.