I’m attempting to model varying effects of different loan characteristics on defaults in rstanarm. When my model only has varying intercepts everything runs fine, when I add any predictor to get a varying slope none of my chains start due to underflows “Log probability evaluates to log(0)”. I attempted to initialize the intercept term at its eventual mean (~ -6) but that didn’t solve the problem (and I imagine that wasn’t the problem to begin with since it was working fine in the varying intercepts case). What I find puzzling is that the initialization only breaks down when I add varying slopes. Typically I would think my initializations are off the support of their pdfs, but all of my random slopes should take values on the real line. One fear is with >2200 parameters I am getting underflow issues that can’t be solved.
If someone could enlighten me as to how I can initialize a wider range of slopes for my varying slopes (perhaps widening the uniform from which every parameter is drawn), I would appreciate not having to write my own stan code in this case. I know you can write a function or specify a list, but those options seem difficult/not fruitful when my matrix of varying parameters has over 1000 rows.
My other concern is that since there are over 1000 groups I will not be able to successfully initialize any chains just by randomly initializing.
One other thing is that some of my groups have few observations and my model may barely be identified, while I assume this makes my model less stable I wasn’t sure if it would make my model that much more difficult to initialize.
And lastly my model levels are state:3-digit ZIP:individual.
Maybe try and increase complexity of your model step by step (e.g. adding a few covariates in small batches). Also, sometimes it’s hard for Stan to initialize when the scales of the predictors are very different. You can also try if the
QR = TRUE option helps – it is recommended when the predictors are (highly) correlated. I’m not sure how
rstanarm handles it in the multilevel case. Also, did you run a nested model as in
... + (... |state/zip/individual)? Maybe that could help sampling?
That’s all I can think of… Maybe @MegBallard has time to think of better ideas for this problem?
not a ton of time but until I do:
A couple of quick thoughts / questions / suggestions in the interim (though I ended up with several after re-reading). Apologize for lackluster formatting / clarity in an effort to save time but get ideas out.
- Are you using uniform priors? Just kind of a bad idea generally, but especially in a model with a lot of group level effects.
- I think the Rstanarm docs (and the brms ones) suggest if you are going to specify inits rather than using random, that using 0 is sometimes helpful others aren’t likely to be
- in the rstanarm vingettes (though maybe I came across it somewhere else I can figure it out if you need) they discuss that sometimes calling
autoscale=FALSE can help the sampler better explore the space, but not sure this applies if you can’t initialize chains.
Try a few of the following ideas to run down breaking problems
run model using a subset of data, including only the groups you think you have an adequate number of observations.
Use submodels that don’t use all three group effects. Try one, then two etc. especially individual. In other settings, using observation level “random effects” can account for over dispersion, but this doesn’t work well in the bayesian setting (and isn’t necessary I think).
Similarly, I don’t know how many population level terms you are including, but starting with fewer is a good approach. And standardizing the scale of your numeric predictors helps the sampler. You can read about that in a ton of places, I can link some later.
If your groups are nested, code them so they are explicitly nested. If you have 3 states and within each state you have 10 ZIPcode groups, you should have 30 unique level names for those groups. Same true for Individual. This will allow you be very specific when varying slopes and/or intercepts. eg
(ln_char | state + 3-digitzip + individual)
(ln_char | state + 3-digitzip ) + (1 | individual)
As always–because I’m no expert–it’s possible I’ve completely misunderstood your problem.
However, these recommendations are pretty general when working on multi-level / hierarchical / mixed models*. As suggested by @Max_Manteiit a general recommendation is to start with a simpler model so fewer parameters are being predicted/estimated (not sure what term is correct). And then build up once you have something that works at the simplest level. Sometimes I just start with an intercept! Clearly you did that with adding slopes, but adding them incrementally can help. And if the problem is mathematical you may end up with a less “full” model, but Bayes, MCMC, and NUTS can’t solve everything.
*NB For interested/knowledgable parties:
Can we just refer to them as partial-pooling models at this point? Is that accurate and general enough for all those flavors?
Can’t use an initialism because there are too many involving m’s and h’s already!
Have I conjured a problem from the non-problem ether?
I definitely have failed in my attempt to write a quick response!
Thanks for the detailed reply, Meg.
I am using weakly informative priors, with Stan I can’t see any point in uniform priors anyway.
On specifying inits, my initial problem (haha) is that I have no idea which inits are causing me problems and in my larger models I have >1000 parameters. I could try setting them all at zero, but I am unsure what that would do to my variances etc. as they are bounded and probably ill-behaved at zero.
I will try autoscale=FALSE.
Another good idea to scale down the model, which I have done and it works fine, leading me back to my “too many parameters” general problem.
I may settle with just using a two-level model as it reduces the complexity of the model. I don’t follow your point about over dispersion exactly, my motivation for the random effects is partial-pooling as you suggest below. Several years ago I would’ve just used a fixed effect (dummy coded everything) model.
I have also standardized the scale of my predictors. I was curious if the goal of standardizing predictors is to standardize the scale of the parameters, because my effect sizes differ in orders of magnitude. E.g. interest rates have an effect orders of magnitude larger than changes on a per dollar basis, but the dollar predictors are in the 100ks, so the effect sizes are comparable. Is that a case where every predictor being unit scale is good even though it makes the parameters orders of magnitude different? My intuition was no, parameters should be on the same scale.
I’ll try to be more explicit in my coding of the structure, but rstanarm seems pretty straightforward with the group IDs so I didn’t think about that too much to be honest.
Thank you for the very detailed reply, I’ll definitely try your suggestions!