Categorical Modelling over huge dataset stops

Greetings all,

I have a huge dataset consisting of 63000 entries with 10 colums. I am trying apply categorical modeling since response has 3 levels. However, even after 5 hours of sampling, the process does not continue and stops at Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) with no error etc. It is OK to take so much time or is there a problem? I’m running i5 cpu with 32 gb ram.

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

tibble [62,982 x 10] (S3: tbl_df/tbl/data.frame)
 $ docid_field    : chr [1:62982] "SPM02010" "DBAN3028" "SENS2032" "BGSU1003" ...
 $ Gender         : chr [1:62982] "Female" "Female" "Female" "Female" ...
 $ Native_language: chr [1:62982] "Spanish" "Dutch" "Serbian" "Bulgarian" ...
 $ Modal_Verbs    : chr [1:62982] "will" "may" "will" "would" ...
 $ Main_Verbs     : chr [1:62982] "make" "be" "affect" "ask" ...
 $ Pattern_ID     : chr [1:62982] "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_with_adv" ...
 $ Type           : chr [1:62982] "Argumentative" "Argumentative" "Argumentative" "Argumentative" ...
 $ Conditions     : chr [1:62982] "No timing" "No timing" "Timed" "No timing" ...
 $ Reference_tools: chr [1:62982] "Yes" "Yes" "No" "Yes" ...
 $ Examination    : chr [1:62982] "No" "No" "No" "No" ...
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] udpipe_0.8.8    **brms_2.16.3**     rstanarm_2.21.1 Rcpp_1.0.7      forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [8] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.5    ggplot2_3.3.5   tidyverse_1.3.1 readxl_1.3.1   

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_2.0-2     ellipsis_0.3.2       ggridges_0.5.3       rsconnect_0.8.25    
  [6] estimability_1.3     markdown_1.1         base64enc_0.1-3      fs_1.5.2             rstudioapi_0.13     
 [11] farver_2.1.0         rstan_2.21.2         DT_0.20              mvtnorm_1.1-3        fansi_0.5.0         
 [16] lubridate_1.8.0      diffobj_0.3.5        xml2_1.3.3           bridgesampling_1.1-2 codetools_0.2-18    
 [21] splines_4.1.1        shinythemes_1.2.0    projpred_2.0.2       bayesplot_1.8.1      jsonlite_1.7.2      
 [26] nloptr_1.2.2.2       broom_0.7.12         Rmpfr_0.8-7          dbplyr_2.1.1         shiny_1.7.1         
 [31] compiler_4.1.1       httr_1.4.2           emmeans_1.7.2        backports_1.4.1      assertthat_0.2.1    
 [36] Matrix_1.3-4         fastmap_1.1.0        cli_3.1.0            later_1.3.0          htmltools_0.5.2     
 [41] prettyunits_1.1.1    tools_4.1.1          gmp_0.6-2.1          igraph_1.2.7         coda_0.19-4         
 [46] gtable_0.3.0         glue_1.4.2           posterior_1.2.0      reshape2_1.4.4       V8_3.5.0            
 [51] cellranger_1.1.0     vctrs_0.3.8          nlme_3.1-152         crosstalk_1.2.0      tensorA_0.36.2      
 [56] ps_1.6.0             rvest_1.0.2          lme4_1.1-27.1        mime_0.12            miniUI_0.1.1.1      
 [61] lifecycle_1.0.1      gtools_3.9.2         MASS_7.3-54          zoo_1.8-9            scales_1.1.1        
 [66] colourpicker_1.1.1   Brobdingnag_1.2-7    hms_1.1.1            promises_1.2.0.1     parallel_4.1.1      
 [71] inline_0.3.19        shinystan_2.5.0      gamm4_0.2-6          curl_4.3.2           gridExtra_2.3       
 [76] loo_2.4.1            StanHeaders_2.21.0-7 stringi_1.7.5        dygraphs_1.1.1.6     checkmate_2.0.0     
 [81] boot_1.3-28          pkgbuild_1.3.1       rlang_0.4.12         pkgconfig_2.0.3      matrixStats_0.61.0  
 [86] distributional_0.3.0 lattice_0.20-44      rstantools_2.1.1     htmlwidgets_1.5.4    processx_3.5.2      
 [91] tidyselect_1.1.1     plyr_1.8.6           magrittr_2.0.1       R6_2.5.1             generics_0.1.2      
 [96] DBI_1.1.2            mgcv_1.8-36          pillar_1.7.0         haven_2.4.3          withr_2.4.3         
[101] xts_0.12.1           abind_1.4-5          survival_3.2-11      modelr_0.1.8         crayon_1.4.2        
[106] utf8_1.2.2           tzdb_0.2.0           grid_4.1.1           data.table_1.14.2    callr_3.7.0         
[111] threejs_0.3.3        reprex_2.0.1         digest_0.6.28        xtable_1.8-4         httpuv_1.6.3        
[116] RcppParallel_5.1.4   stats4_4.1.1         munsell_0.5.0        shinyjs_2.1.0       
1 Like

Can you post your stan model?

Here is the code:

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

Can you try subsetting your dataset to a much smaller size, and checking that the estimation values and times are as expected:

model1 <- brm(Pattern_ID ~ Modal_Verbs,
              data = df1[1:100, ],
              family = categorical(),
              cores = 4)

seems like working after subsetting with a few warnings;

Warning messages:
1: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
‘-E’ not found
2: There were 499 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
Runtime warnings and convergence problems
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.1, indicating chains have not mixed.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems

 
Chain 3:  Elapsed Time: 32.371 seconds (Warm-up)
Chain 3:                19.498 seconds (Sampling)
Chain 3:                51.869 seconds (Total)
Chain 3: 
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 33.976 seconds (Warm-up)
Chain 4:                20.072 seconds (Sampling)
Chain 4:                54.048 seconds (Total)
Chain 4: 
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 45.53 seconds (Warm-up)
Chain 2:                32.03 seconds (Sampling)
Chain 2:                77.56 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 37.384 seconds (Warm-up)
Chain 1:                69.197 seconds (Sampling)
Chain 1:                106.581 seconds (Total)
Chain 1:

Also, no problem with df[1:1000] I assume it is due to dataset size, it takes forever.

Chain 3: 
Chain 3:  Elapsed Time: 304.575 seconds (Warm-up)
Chain 3:                259.139 seconds (Sampling)
Chain 3:                563.714 seconds (Total)
Chain 3: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 378.879 seconds (Warm-up)
Chain 2:                256.811 seconds (Sampling)
Chain 2:                635.69 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 375.08 seconds (Warm-up)
Chain 1:                370.751 seconds (Sampling)
Chain 1:                745.831 seconds (Total)
Chain 1: 
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)

I switched from my office pc to home pc, installed cmdstanr GPU (nvidia 1660 super) support. It seems like working yet now it stuck at chain 1 iteration 1 / 2000 [0 %] (warmup).

Should I give up on brms and go on with cmdstanr?

Update. I switched to cmdrstan backend with GPU support. Though the process still takes so much time at least it has been completed.

Thank you for the support andrjohns

Regards.