We are modeling crash rates (number of times an app crashes accounting for hours it used) across different operating systems and major,minor versions. For example a major, minor combination might be 70.0.1 where major is 70 and minor is 1. We call the combined form the cversion (or complete version). So in the previous example the cversion is 70.0.1.
Here the group effect is cversion and thus there would be information pooling across allcversions. But apps change a lot and minors are much closer to other minors for the same major. The design, i feel, is akin to a nested model and the next model seems appropriate
About the dataset: it has 105 different combinations of OS, major, and minor. 3 different OS, 3 different majors, 14 different minors and 35 different major,minor combinations. On average 4.4 observations per OS,major, minor.
The first model takes roughly 882 second to run and the second 3253 seconds to run and has a lot of divergences (though R-hats are 1) unless i increase max_delta. My concern is the running time: is long running time a symptom of an incorrect model?
One other question: would (1+os | major:minor) be equivalent to (1+os | cversion)?
Replying to one of question here, having checked the two models it appears that (1+os | major:minor) is equivalent to (1+os | cversion) (if iām not mistaken major:minor is pretty much creating levels equivalent to cversion). Hence the extra running time is because of the addition of other group factor (1+os| major).
How much complexity does this add to the design matrix that causes the running time to nearly quadruple?
There is a problem with the major level which has just 3 levels. Without more informative priors, this will likely be problematic as you try to estimate SDs and correlations on the basis of the equivalent of 3 observations here.