Categorical Modelling over huge dataset stops

Fatih_Bozdag · March 5, 2022, 9:48am

Greetings all,

I have a huge dataset consisting of 63000 entries with 10 colums. I am trying apply categorical modeling since response has 3 levels. However, even after 5 hours of sampling, the process does not continue and stops at Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) with no error etc. It is OK to take so much time or is there a problem? I’m running i5 cpu with 32 gb ram.

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

tibble [62,982 x 10] (S3: tbl_df/tbl/data.frame)
 $ docid_field    : chr [1:62982] "SPM02010" "DBAN3028" "SENS2032" "BGSU1003" ...
 $ Gender         : chr [1:62982] "Female" "Female" "Female" "Female" ...
 $ Native_language: chr [1:62982] "Spanish" "Dutch" "Serbian" "Bulgarian" ...
 $ Modal_Verbs    : chr [1:62982] "will" "may" "will" "would" ...
 $ Main_Verbs     : chr [1:62982] "make" "be" "affect" "ask" ...
 $ Pattern_ID     : chr [1:62982] "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_no_adv" "pattern_simple_with_adv" ...
 $ Type           : chr [1:62982] "Argumentative" "Argumentative" "Argumentative" "Argumentative" ...
 $ Conditions     : chr [1:62982] "No timing" "No timing" "Timed" "No timing" ...
 $ Reference_tools: chr [1:62982] "Yes" "Yes" "No" "Yes" ...
 $ Examination    : chr [1:62982] "No" "No" "No" "No" ...

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] udpipe_0.8.8    **brms_2.16.3**     rstanarm_2.21.1 Rcpp_1.0.7      forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [8] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.5    ggplot2_3.3.5   tidyverse_1.3.1 readxl_1.3.1   

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_2.0-2     ellipsis_0.3.2       ggridges_0.5.3       rsconnect_0.8.25    
  [6] estimability_1.3     markdown_1.1         base64enc_0.1-3      fs_1.5.2             rstudioapi_0.13     
 [11] farver_2.1.0         rstan_2.21.2         DT_0.20              mvtnorm_1.1-3        fansi_0.5.0         
 [16] lubridate_1.8.0      diffobj_0.3.5        xml2_1.3.3           bridgesampling_1.1-2 codetools_0.2-18    
 [21] splines_4.1.1        shinythemes_1.2.0    projpred_2.0.2       bayesplot_1.8.1      jsonlite_1.7.2      
 [26] nloptr_1.2.2.2       broom_0.7.12         Rmpfr_0.8-7          dbplyr_2.1.1         shiny_1.7.1         
 [31] compiler_4.1.1       httr_1.4.2           emmeans_1.7.2        backports_1.4.1      assertthat_0.2.1    
 [36] Matrix_1.3-4         fastmap_1.1.0        cli_3.1.0            later_1.3.0          htmltools_0.5.2     
 [41] prettyunits_1.1.1    tools_4.1.1          gmp_0.6-2.1          igraph_1.2.7         coda_0.19-4         
 [46] gtable_0.3.0         glue_1.4.2           posterior_1.2.0      reshape2_1.4.4       V8_3.5.0            
 [51] cellranger_1.1.0     vctrs_0.3.8          nlme_3.1-152         crosstalk_1.2.0      tensorA_0.36.2      
 [56] ps_1.6.0             rvest_1.0.2          lme4_1.1-27.1        mime_0.12            miniUI_0.1.1.1      
 [61] lifecycle_1.0.1      gtools_3.9.2         MASS_7.3-54          zoo_1.8-9            scales_1.1.1        
 [66] colourpicker_1.1.1   Brobdingnag_1.2-7    hms_1.1.1            promises_1.2.0.1     parallel_4.1.1      
 [71] inline_0.3.19        shinystan_2.5.0      gamm4_0.2-6          curl_4.3.2           gridExtra_2.3       
 [76] loo_2.4.1            StanHeaders_2.21.0-7 stringi_1.7.5        dygraphs_1.1.1.6     checkmate_2.0.0     
 [81] boot_1.3-28          pkgbuild_1.3.1       rlang_0.4.12         pkgconfig_2.0.3      matrixStats_0.61.0  
 [86] distributional_0.3.0 lattice_0.20-44      rstantools_2.1.1     htmlwidgets_1.5.4    processx_3.5.2      
 [91] tidyselect_1.1.1     plyr_1.8.6           magrittr_2.0.1       R6_2.5.1             generics_0.1.2      
 [96] DBI_1.1.2            mgcv_1.8-36          pillar_1.7.0         haven_2.4.3          withr_2.4.3         
[101] xts_0.12.1           abind_1.4-5          survival_3.2-11      modelr_0.1.8         crayon_1.4.2        
[106] utf8_1.2.2           tzdb_0.2.0           grid_4.1.1           data.table_1.14.2    callr_3.7.0         
[111] threejs_0.3.3        reprex_2.0.1         digest_0.6.28        xtable_1.8-4         httpuv_1.6.3        
[116] RcppParallel_5.1.4   stats4_4.1.1         munsell_0.5.0        shinyjs_2.1.0

andrjohns · March 5, 2022, 1:29pm

Can you post your stan model?

Fatih_Bozdag · March 5, 2022, 3:52pm

Here is the code:

model1 <- brm(Pattern_ID ~ Modal_Verbs, data = df1, family = categorical(), cores = 4)

andrjohns · March 6, 2022, 9:52am

Can you try subsetting your dataset to a much smaller size, and checking that the estimation values and times are as expected:

model1 <- brm(Pattern_ID ~ Modal_Verbs,
              data = df1[1:100, ],
              family = categorical(),
              cores = 4)

Fatih_Bozdag · March 6, 2022, 11:03am

seems like working after subsetting with a few warnings;

Warning messages:
1: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
‘-E’ not found
2: There were 499 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
Runtime warnings and convergence problems
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.1, indicating chains have not mixed.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
Runtime warnings and convergence problems

 
Chain 3:  Elapsed Time: 32.371 seconds (Warm-up)
Chain 3:                19.498 seconds (Sampling)
Chain 3:                51.869 seconds (Total)
Chain 3: 
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 33.976 seconds (Warm-up)
Chain 4:                20.072 seconds (Sampling)
Chain 4:                54.048 seconds (Total)
Chain 4: 
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 45.53 seconds (Warm-up)
Chain 2:                32.03 seconds (Sampling)
Chain 2:                77.56 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 37.384 seconds (Warm-up)
Chain 1:                69.197 seconds (Sampling)
Chain 1:                106.581 seconds (Total)
Chain 1:

Fatih_Bozdag · March 6, 2022, 11:31am

Also, no problem with df[1:1000] I assume it is due to dataset size, it takes forever.

Chain 3: 
Chain 3:  Elapsed Time: 304.575 seconds (Warm-up)
Chain 3:                259.139 seconds (Sampling)
Chain 3:                563.714 seconds (Total)
Chain 3: 
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 378.879 seconds (Warm-up)
Chain 2:                256.811 seconds (Sampling)
Chain 2:                635.69 seconds (Total)
Chain 2: 
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 375.08 seconds (Warm-up)
Chain 1:                370.751 seconds (Sampling)
Chain 1:                745.831 seconds (Total)
Chain 1: 
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)

Fatih_Bozdag · March 6, 2022, 5:38pm

I switched from my office pc to home pc, installed cmdstanr GPU (nvidia 1660 super) support. It seems like working yet now it stuck at chain 1 iteration 1 / 2000 [0 %] (warmup).

Should I give up on brms and go on with cmdstanr?

Fatih_Bozdag · March 9, 2022, 7:06pm

Update. I switched to cmdrstan backend with GPU support. Though the process still takes so much time at least it has been completed.

Thank you for the support andrjohns

Regards.

Topic		Replies	Views
BRMS Runs too slow; error in definition on my behalf? Modeling rstan , specification , performance , brms	14	3014	November 17, 2021
BRM function suddenly stopped running and throws a range of different errors Modeling	26	2166	October 7, 2020
Categorical model with large dataset - large/divergent ELBO & possible combined reduce_sum()/GPU support Modeling cmdstan , fitting-issues , specification , performance	13	800	May 13, 2022
Brms limited memory issue while running on 15M data points Modeling brms	9	1068	July 28, 2021
Strategies for fitting a large model via AWS EC2 services General fitting-issues	17	1843	July 27, 2022

Categorical Modelling over huge dataset stops

Related topics