Categorical Modelling over huge dataset stops

Greetings all,

I have a huge dataset consisting of 63000 entries with 10 colums. I am trying apply categorical modeling since response has 3 levels. However, even after 5 hours of sampling, the process does not continue and stops at Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) with no error etc. It is OK to take so much time or is there a problem? I’m running i5 cpu with 32 gb ram.

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

tibble [62,982 x 10] (S3: tbl_df/tbl/data.frame)
 $ docid_field    : chr [1:62982] "SPM02010" "DBAN3028" "SENS2032" "BGSU1003" ...
 $ Gender         : chr [1:62982] "Female" "Female" "Female" "Female" ...
 $ Native_language: chr [1:62982] "Spanish" "Dutch" "Serbian" "Bulgarian" ...
 $ Modal_Verbs    : chr [1:62982] "will" "may" "will" "would" ...
 $ Main_Verbs     : chr [1:62982] "make" "be" "affect" "ask" ...
 $ Pattern_ID     : chr [1:62982] "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_with_adv" ...
 $ Type           : chr [1:62982] "Argumentative" "Argumentative" "Argumentative" "Argumentative" ...
 $ Conditions     : chr [1:62982] "No timing" "No timing" "Timed" "No timing" ...
 $ Reference_tools: chr [1:62982] "Yes" "Yes" "No" "Yes" ...
 $ Examination    : chr [1:62982] "No" "No" "No" "No" ...
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] udpipe_0.8.8    **brms_2.16.3**     rstanarm_2.21.1 Rcpp_1.0.7      forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [8] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.5    ggplot2_3.3.5   tidyverse_1.3.1 readxl_1.3.1   

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_2.0-2     ellipsis_0.3.2       ggridges_0.5.3       rsconnect_0.8.25    
  [6] estimability_1.3     markdown_1.1         base64enc_0.1-3      fs_1.5.2             rstudioapi_0.13     
 [11] farver_2.1.0         rstan_2.21.2         DT_0.20              mvtnorm_1.1-3        fansi_0.5.0         
 [16] lubridate_1.8.0      diffobj_0.3.5        xml2_1.3.3           bridgesampling_1.1-2 codetools_0.2-18    
 [21] splines_4.1.1        shinythemes_1.2.0    projpred_2.0.2       bayesplot_1.8.1      jsonlite_1.7.2      
 [26] nloptr_1.2.2.2       broom_0.7.12         Rmpfr_0.8-7          dbplyr_2.1.1         shiny_1.7.1         
 [31] compiler_4.1.1       httr_1.4.2           emmeans_1.7.2        backports_1.4.1      assertthat_0.2.1    
 [36] Matrix_1.3-4         fastmap_1.1.0        cli_3.1.0            later_1.3.0          htmltools_0.5.2     
 [41] prettyunits_1.1.1    tools_4.1.1          gmp_0.6-2.1          igraph_1.2.7         coda_0.19-4         
 [46] gtable_0.3.0         glue_1.4.2           posterior_1.2.0      reshape2_1.4.4       V8_3.5.0            
 [51] cellranger_1.1.0     vctrs_0.3.8          nlme_3.1-152         crosstalk_1.2.0      tensorA_0.36.2      
 [56] ps_1.6.0             rvest_1.0.2          lme4_1.1-27.1        mime_0.12            miniUI_0.1.1.1      
 [61] lifecycle_1.0.1      gtools_3.9.2         MASS_7.3-54          zoo_1.8-9            scales_1.1.1        
 [66] colourpicker_1.1.1   Brobdingnag_1.2-7    hms_1.1.1            promises_1.2.0.1     parallel_4.1.1      
 [71] inline_0.3.19        shinystan_2.5.0      gamm4_0.2-6          curl_4.3.2           gridExtra_2.3       
 [76] loo_2.4.1            StanHeaders_2.21.0-7 stringi_1.7.5        dygraphs_1.1.1.6     checkmate_2.0.0     
 [81] boot_1.3-28          pkgbuild_1.3.1       rlang_0.4.12         pkgconfig_2.0.3      matrixStats_0.61.0  
 [86] distributional_0.3.0 lattice_0.20-44      rstantools_2.1.1     htmlwidgets_1.5.4    processx_3.5.2      
 [91] tidyselect_1.1.1     plyr_1.8.6           magrittr_2.0.1       R6_2.5.1             generics_0.1.2      
 [96] DBI_1.1.2            mgcv_1.8-36          pillar_1.7.0         haven_2.4.3          withr_2.4.3         
[101] xts_0.12.1           abind_1.4-5          survival_3.2-11      modelr_0.1.8         crayon_1.4.2        
[106] utf8_1.2.2           tzdb_0.2.0           grid_4.1.1           data.table_1.14.2    callr_3.7.0         
[111] threejs_0.3.3        reprex_2.0.1         digest_0.6.28        xtable_1.8-4         httpuv_1.6.3        
[116] RcppParallel_5.1.4   stats4_4.1.1         munsell_0.5.0        shinyjs_2.1.0       
1 Like

Can you post your stan model?

Here is the code:

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

Can you try subsetting your dataset to a much smaller size, and checking that the estimation values and times are as expected:

model1 <- brm(Pattern_ID ~ Modal_Verbs,
              data = df1[1:100, ],
              family = categorical(),
              cores = 4)

seems like working after subsetting with a few warnings;

Warning messages:
1: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
‘-E’ not found
2: There were 499 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.1, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

 
Chain 3:  Elapsed Time: 32.371 seconds (Warm-up)
Chain 3:                19.498 seconds (Sampling)
Chain 3:                51.869 seconds (Total)
Chain 3: 
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 33.976 seconds (Warm-up)
Chain 4:                20.072 seconds (Sampling)
Chain 4:                54.048 seconds (Total)
Chain 4: 
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 45.53 seconds (Warm-up)
Chain 2:                32.03 seconds (Sampling)
Chain 2:                77.56 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 37.384 seconds (Warm-up)
Chain 1:                69.197 seconds (Sampling)
Chain 1:                106.581 seconds (Total)
Chain 1:

Also, no problem with df[1:1000] I assume it is due to dataset size, it takes forever.

Chain 3: 
Chain 3:  Elapsed Time: 304.575 seconds (Warm-up)
Chain 3:                259.139 seconds (Sampling)
Chain 3:                563.714 seconds (Total)
Chain 3: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 378.879 seconds (Warm-up)
Chain 2:                256.811 seconds (Sampling)
Chain 2:                635.69 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 375.08 seconds (Warm-up)
Chain 1:                370.751 seconds (Sampling)
Chain 1:                745.831 seconds (Total)
Chain 1: 
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)

I switched from my office pc to home pc, installed cmdstanr GPU (nvidia 1660 super) support. It seems like working yet now it stuck at chain 1 iteration 1 / 2000 [0 %] (warmup).

Should I give up on brms and go on with cmdstanr?

Update. I switched to cmdrstan backend with GPU support. Though the process still takes so much time at least it has been completed.

Thank you for the support andrjohns

Regards.