Nested Models and Slowness : how to debug?

Hello,

We are modeling crash rates (number of times an app crashes accounting for hours it used) across different operating systems and major,minor versions. For example a major, minor combination might be 70.0.1 where major is 70 and minor is 1. We call the combined form the cversion (or complete version). So in the previous example the cversion is 70.0.1.

The model i fitted looks like

Model 1

bf( cmain ~   offset(log( hours_used + 1/60)) 
                  + os
                  + (1+os|cversion),
      shape ~ os) + negbinomial()

Here the group effect is cversion and thus there would be information pooling across all cversions. But apps change a lot and minors are much closer to other minors for the same major. The design, i feel, is akin to a nested model and the next model seems appropriate

Model 2

bf( cmain ~   offset(log( hours_used + 1/60)) 
                  + os
                  + (1+os|major/minor),
      shape ~ os) + negbinomial()

About the dataset: it has 105 different combinations of OS, major, and minor. 3 different OS, 3 different majors, 14 different minors and 35 different major,minor combinations. On average 4.4 observations per OS,major, minor.

The first model takes roughly 882 second to run and the second 3253 seconds to run and has a lot of divergences (though R-hats are 1) unless i increase max_delta. My concern is the running time: is long running time a symptom of an incorrect model?

One other question: would (1+os | major:minor) be equivalent to (1+os | cversion)?

Any insights appreciated
Saptarshi

Output of Models

Model 1

 Family: negbinomial
  Links: mu = log; shape = log
Formula: cmain + 1 ~ offset(log(usage_cm_crasher_cversion + 1/60)) + os + (1 + os | c_version)
         shape ~ os
   Data: D (Number of observations: 454)
Samples: 4 chains, each with iter = 3000; warmup = 1500; thin = 1;
         total post-warmup samples = 6000

Group-Level Effects:
~c_version (Number of levels: 35)
                            Estimate Est.Error l-95% CI u-95% CI Eff.Sample
sd(Intercept)                   0.43      0.09     0.25     0.62       1565
sd(osLinux)                     0.68      0.18     0.33     1.04       1420
sd(osWindows_NT)                0.44      0.10     0.25     0.64       1685
cor(Intercept,osLinux)         -0.36      0.26    -0.77     0.23       1640
cor(Intercept,osWindows_NT)    -0.97      0.03    -1.00    -0.89       2165
cor(osLinux,osWindows_NT)       0.36      0.27    -0.26     0.77       1872
                            Rhat
sd(Intercept)               1.00
sd(osLinux)                 1.00
sd(osWindows_NT)            1.00
cor(Intercept,osLinux)      1.00
cor(Intercept,osWindows_NT) 1.00
cor(osLinux,osWindows_NT)   1.00

Population-Level Effects:
                   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept              0.34      0.11     0.13     0.54       2456 1.00
shape_Intercept        0.77      0.16     0.45     1.10       2957 1.00
osLinux                0.24      0.18    -0.11     0.59       3083 1.00
osWindows_NT          -0.36      0.11    -0.57    -0.15       2642 1.00
shape_osLinux         -0.49      0.23    -0.95    -0.04       3872 1.00
shape_osWindows_NT     1.33      0.20     0.93     1.72       3660 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

and for Model 2

Formula: cmain + 1 ~ offset(log(usage_cm_crasher_cversion + 1/60)) + os + (1 + os | major) + (1 + os | major:minor)
         shape ~ os
   Data: D (Number of observations: 454)
Samples: 4 chains, each with iter = 3000; warmup = 1500; thin = 1;
         total post-warmup samples = 6000

Group-Level Effects:
~major (Number of levels: 3)
                            Estimate Est.Error l-95% CI u-95% CI Eff.Sample
sd(Intercept)                   1.32      1.83     0.13     6.30       1437
sd(osLinux)                     1.78      2.34     0.10     8.38       1432
sd(osWindows_NT)                1.42      2.25     0.12     7.01        355
cor(Intercept,osLinux)         -0.19      0.51    -0.95     0.82       4684
cor(Intercept,osWindows_NT)    -0.26      0.50    -0.96     0.77       4271
cor(osLinux,osWindows_NT)       0.15      0.51    -0.83     0.94       4340
                            Rhat
sd(Intercept)               1.01
sd(osLinux)                 1.00
sd(osWindows_NT)            1.01
cor(Intercept,osLinux)      1.00
cor(Intercept,osWindows_NT) 1.00
cor(osLinux,osWindows_NT)   1.00

~major:minor (Number of levels: 35)
                            Estimate Est.Error l-95% CI u-95% CI Eff.Sample
sd(Intercept)                   0.16      0.13     0.01     0.47        458
sd(osLinux)                     0.60      0.17     0.27     0.96       1652
sd(osWindows_NT)                0.15      0.14     0.00     0.49        457
cor(Intercept,osLinux)          0.03      0.44    -0.74     0.86        977
cor(Intercept,osWindows_NT)    -0.51      0.51    -0.99     0.70        935
cor(osLinux,osWindows_NT)       0.02      0.45    -0.85     0.81       2442
                            Rhat
sd(Intercept)               1.01
sd(osLinux)                 1.00
sd(osWindows_NT)            1.01
cor(Intercept,osLinux)      1.00
cor(Intercept,osWindows_NT) 1.00
cor(osLinux,osWindows_NT)   1.00

Population-Level Effects:
                   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept              0.41      1.19    -1.60     2.75       1078 1.00
shape_Intercept        0.68      0.16     0.38     1.00       1915 1.00
osLinux                0.27      1.75    -2.96     3.71       1008 1.00
osWindows_NT          -0.29      1.55    -2.68     2.28        244 1.02
shape_osLinux         -0.42      0.23    -0.89     0.02       3607 1.00
shape_osWindows_NT     1.42      0.20     1.03     1.80       3039 1.00


Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
There were 5 divergent transitions after warmup. Increasing adapt_delta above 0.999 may help.

Replying to one of question here, having checked the two models it appears that (1+os | major:minor) is equivalent to (1+os | cversion) (if iā€™m not mistaken major:minor is pretty much creating levels equivalent to cversion). Hence the extra running time is because of the addition of other group factor (1+os| major).

How much complexity does this add to the design matrix that causes the running time to nearly quadruple?

@paul.buerkner do you have any insight you could provide about this?

There is a problem with the major level which has just 3 levels. Without more informative priors, this will likely be problematic as you try to estimate SDs and correlations on the basis of the equivalent of 3 observations here.

2 Likes