Question Regarding Possible Speed Gains ( GPU or CPU ) for a Four Level Hierarchical Logistic Regression Model

ajnafa · July 9, 2021, 12:29am

I’m in the process of trying to figure out a more efficient way to fit a four level logistic regression model for repeated cross-sectional survey data in which the i^{th} respondent is nested within birth cohort l, survey-year t, and country j. The Stan code for the baseline model with only the group level intercepts is as follows:

// Hierarchical Logistic Regression Model of Support for Democracy
// Model 1: Null Model with Varying Intercepts country, survey-year, and cohort
// Data Dimensions: N = 208,107; L = 2460; T = 169; J = 58
data {
  int <lower = 0> N; // Total Number of Observations
  int <lower = 0, upper = 1> Y[N]; // The Response Vector
  int <lower = 1> K; // Number of Fixed Effects in the Model
  matrix[N, K] X; // Design Matrix for the Fixed Effects
  // Data for Country-Level Effects (Level 4)
  int <lower = 1> J; // Total Number of Countries J
  int<lower=1> M_J; // Number of Country Level Coefficients
  int <lower = 1> jj[N]; // Country-Observation Indicator
  vector[N] Z_J; // Country-Level Predictors
  // Data for Country-Year Effects  (Level 3)
  int<lower=1> T; // Total Number of Country-Years T
  int<lower=1> M_T; // Number of Country-Year Level Coefficients
  int<lower=1> tt[N]; // Grouping Indicator Per Observation
  vector[N] Z_T; // Time Varying Country-Level Predictors
  // Data for Country-Birth Cohort Effects  (Level 2)
  int<lower=1> L; // Total Number Cohorts L
  int<lower=1> M_L; // Number of Country-Birth Cohort Level Coefficients
  int<lower=1> ll[N]; // Country-Birth Cohort Indicator
  vector[N] Z_L; // Country-Birth Cohort Intercepts
}

parameters {
  vector[K] beta; // Fixed Effects Parameters
  vector <lower = 0>[M_J] sigma_j; // Country-Level Standard Deviations
  vector[J] alpha_j[M_J]; // Standardized Country-Level Intercepts
  vector <lower = 0>[M_T] sigma_t; // Country-Year Standard Deviations
  vector[T] alpha_t[M_T]; // Standardized Country-Level Intercepts
  vector <lower = 0>[M_L] sigma_l; // Country-Year Standard Deviations
  vector[L] alpha_l[M_L]; // Standardized Birth Cohort-Level Intercepts
}

transformed parameters {
  vector[J] country; // Actual Country-Level Effects
  country = (sigma_j[1] * (alpha_j[1]));
  vector[T] survey_year; // Actual Country-Year Effects
  survey_year = (sigma_t[1] * (alpha_t[1]));
  vector[L] cohort; // Actual Birth Cohort Effects
  cohort = (sigma_l[1] * (alpha_l[1]));
}

model {
  vector[N] mu = X * beta;
  // Priors for the Model Parameters
  profile("Priors") {
    target += normal_lpdf(beta | 0, 1);
    target += student_t_lpdf(sigma_j | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_j[1]);
    target += student_t_lpdf(sigma_t | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_t[1]);
    target += student_t_lpdf(sigma_l | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_l[1]);
  }
  // Linear Predictor
  profile("Linear Predictor") {
    for (n in 1:N) {
      mu[n] += country[jj[n]] * Z_J[n] + survey_year[tt[n]] * Z_T[n] + cohort[ll[n]] * Z_L[n];
      }
    }
  // Likelihood
  profile("Likelihood") {
    target += bernoulli_logit_lpmf(Y | mu);
    }
}

However, I have run into what based on the profiling output below seems to be a severe bottle-neck in the linear predictor and likelihood evaluation sections.

              name thread_id total_time forward_time reverse_time  chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1       Likelihood         1       4465         2649       1815.6       691190              0         345595                 1
2 Linear Predictor         1       7286         4830       2456.2 175600275450              0         345595                 1
3           Priors         1       1220         1006        213.8      4492735              0         345595                 1

I’m hoping there’s some way to implement the model more efficiently but I haven’t had much luck so far and as things stand now, if the null specification is this slow I imagine adding predictors will just make things worse since the full specification is a within-between random effects model with around 20 predictors and a cross-level interaction.

I’m using cmdstanr with cmdstan 2.27.0 on a Windows 10 desktop with a Ryzen 9 5900X, 64GB of Memory, and an Nvidia RTX 3090 and I’m hoping I can take advantage of the expanded OpenCL support that was added in the latest version of cmdstan since the 3090 should be capable of delivering some pretty substantial computational power (it requires less than half the wall time the CPU does for the OpenCL cmdstanr vignette model).

Any help anyone can provide with this would be very much appreciated.

Model File: Churchill_HLogit_Null.stan (2.9 KB)
Null Model Data: Data for the Stan Model

wds15 · July 9, 2021, 6:56am

You should use the glm version of Bernoulli logit lpmf.

… and I’d very much recommend you to write your model down using the brms R package. With that you can just turn on CPU parallelisation or OpenCL by using some arguments to the brm function… no need to code that yourself. I’d advise to use the GitHub version though and you need the cmdstanr backend.

rok_cesnovar · July 9, 2021, 12:57pm

Hey,

have you been able to get any speedup with CmdStan 2.27 with a GPU?

With 2.27 only the bernoulli_logit_lpmf will be run on the GPU from your model. The linear predictor calculation will still run on the CPU, as we currently do not support indexing vectors with OpenCL.

That will change soon though after this is merged: https://github.com/stan-dev/stan/pull/3048 Plus a bit more, but the majority of work is done so should be ready soon.

Even then though, the linear predictor will need to be calculated vectorized. If I hadnt make a mistake, that should just be:

mu = country[jj] .* Z_J + survey_year[tt] .* Z_T + cohort[ll] .* Z_L;

If its possible to convert this to use bernoulli_logit_glm_lpmf (12.3 Bernoulli-logit generalized linear model (Logistic Regression) | Stan Functions Reference) as @wds15 suggests that would help you speed this up without waiting for additional features. Otherwise, you will need to wait for indexing to be supported with OpenCL. That will severly decreased the number of transfers and thus enable speedups with GPUs.

ajnafa · July 10, 2021, 12:28am

Thank you for the suggestions. I’ve been able to get some fairly substantial speed gains using GPU based computation with cmdstan 2.27 in logistic regression models using bernoulli_logit_glm_lpmf (less than half the time required for purely CPU-based computation on average), though the gains in hierarchical specifications have been far more varied.

If I understand correctly, the use of bernoulli_logit_lpmf rather than bernoulli_logit_glm_lpmf may explain why the null model with only random intercepts at each level is less computationally efficient than the full specification which utilizes bernoulli_logit_glm_lpmf?

Replacing the original loop for the linear predictor with the vectorized version and using bernoulli_logit_glm_lpmf as shown below seems to be sampling faster and definitely has my GPU doing a lot more of the work (original code was at 9% CUDA usage and revised code is at 22%).

// Linear Predictor
vector[N]  mu = country[jj] .* Z_J + survey_year[tt] .* Z_T + cohort[ll] .* Z_L;
// Likelihood
target += bernoulli_logit_glm_lpmf(Y | X, mu, b);

The vast majority of the computational time seems to be spent on the random effects for country-birth-cohort so I may just have to drop those for the time being since they don’t change the substantive results in any meaningful way and including them adds at least 24 hours of wall time. I’ll keep an eye out for the OpenCL indexing update as well since that would definitely help with performance. Thank you.

rok_cesnovar · July 18, 2021, 2:46pm

@ajnafa Would you be willing to run a test with the latest developments?

Rewrote the model a bit, to make it vectorized and remove the need for the intermediate mu varaible.

// Hierarchical Logistic Regression Model of Support for Democracy
// Model 1: Null Model with Varying Intercepts country, survey-year, and cohort
// Data Dimensions: N = 208,107; L = 2460; T = 169; J = 58
data {
  int <lower = 0> N; // Total Number of Observations
  int <lower = 0, upper = 1> Y[N]; // The Response Vector
  int <lower = 1> K; // Number of Fixed Effects in the Model
  matrix[N, K] X; // Design Matrix for the Fixed Effects
  // Data for Country-Level Effects (Level 4)
  int <lower = 1> J; // Total Number of Countries J
  int<lower=1> M_J; // Number of Country Level Coefficients
  int <lower = 1> jj[N]; // Country-Observation Indicator
  vector[N] Z_J; // Country-Level Predictors
  // Data for Country-Year Effects  (Level 3)
  int<lower=1> T; // Total Number of Country-Years T
  int<lower=1> M_T; // Number of Country-Year Level Coefficients
  int<lower=1> tt[N]; // Grouping Indicator Per Observation
  vector[N] Z_T; // Time Varying Country-Level Predictors
  // Data for Country-Birth Cohort Effects  (Level 2)
  int<lower=1> L; // Total Number Cohorts L
  int<lower=1> M_L; // Number of Country-Birth Cohort Level Coefficients
  int<lower=1> ll[N]; // Country-Birth Cohort Indicator
  vector[N] Z_L; // Country-Birth Cohort Intercepts
}

parameters {
  vector[K] beta; // Fixed Effects Parameters
  vector <lower = 0>[M_J] sigma_j; // Country-Level Standard Deviations
  vector[J] alpha_j[M_J]; // Standardized Country-Level Intercepts
  vector <lower = 0>[M_T] sigma_t; // Country-Year Standard Deviations
  vector[T] alpha_t[M_T]; // Standardized Country-Level Intercepts
  vector <lower = 0>[M_L] sigma_l; // Country-Year Standard Deviations
  vector[L] alpha_l[M_L]; // Standardized Birth Cohort-Level Intercepts
}

transformed parameters {
  vector[J] country = (sigma_j[1] * (alpha_j[1])); // Actual Country-Level Effects
  vector[T] survey_year = (sigma_t[1] * (alpha_t[1])); // Actual Country-Year Effects
  vector[L] cohort = (sigma_l[1] * (alpha_l[1])); // Actual Birth Cohort Effects
}

model {
  // Priors for the Model Parameters
  profile("Priors") {
    target += normal_lpdf(beta | 0, 1);
    target += student_t_lpdf(sigma_j | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_j[1]);
    target += student_t_lpdf(sigma_t | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_t[1]);
    target += student_t_lpdf(sigma_l | 3, 0, 2) - 1 * student_t_lccdf(0 | 3, 0, 2);
    target += std_normal_lpdf(alpha_l[1]);
  }
  // Likelihood
  profile("Likelihood") {
    target += bernoulli_logit_lpmf(Y | X * beta + country[jj] .* Z_J + survey_year[tt] .* Z_T + cohort[ll] .* Z_L);
  }
}

You would need to then install a development build with:

install_cmdstan(release_url = "https://github.com/rok-cesnovar/cmdstan/releases/download/opencl-v0.1/cmdstan-opencl-test-v0.1.tar.gz", cores = 4)

Then run the following:

set_cmdstan_path(file.path(cmdstan_default_install_path(), "cmdstan-opencl-test-v0.1"))

data <- readRDS("HLogit_Null_Data.rds")

stan_file <- "Churchill_HLogit_Null.stan"
mod <- cmdstan_model(stan_file = stan_file, cpp_options = list(STAN_OPENCL=TRUE))

download.file(
  "https://gist.githubusercontent.com/rok-cesnovar/0a306c58a4915777fbcd8efaab712991/raw/54007d546312eb72166ab25a0e6faa1d5aef6efd/Churchill_HLogit_Null.hpp",
  "Churchill_HLogit_Null.hpp"
)

processx::run(
  cmdstanr:::make_cmd(),
  c("STAN_OPENCL=TRUE", cmdstanr:::cmdstan_ext(cmdstanr:::strip_ext(cmdstanr:::absolute_path(stan_file)))),
  wd = cmdstan_path()
)

fit <- mod$sample(data = data, chains = 1, seed = 123) #not sure how many iterations to run

There is a file download in there that replaces the generated C++ code with one that uses all the latest improvements that will be merged soon.
I made the changes manually to the generated C++ code. Locally, this model runs ~4 times faster with my Radeon VII vs a i7-4790 CPU with N = 200k.

ajnafa · July 19, 2021, 7:01pm

I’m in the middle of doing maintenance on my RTX 3090 system (custom loop) but I gave this a quick run on my office desktop that has a Ryzen 9 5900X and an Nvidia RTX 2070 Super which should have similar performance to a Radeon VII.

# Profiling results for 1 chain w/40 warmup and 10 sampling iterations using only the CPU (5900X) 
fit_cpu$profiles()
[[1]]
        name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 Likelihood         1 764.354000   589.733000   174.620000      243240    44292453345          30405                 1
2     Priors         1   0.581463     0.356678     0.224786      304050              0          30405                 1

# Profiling results for 1 chain w/40 warmup and 10 sampling using OpenCL (RTX 2070 Super)
fit$profiles()
[[1]]
        name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 Likelihood         1   789.1930      558.732     230.4610      284850    46106105850          31650                 1
2     Priors         1    72.9164       62.656      10.2604      443100              0          31650                 1

While I followed the directions you provided, I suspect I did something wrong somewhere because this is the first time I’ve ever seen my CPU out-perform my GPU entirely. On the other hand, there do appear to be some performance gains over OpenCL as implemented in cmdstan 2.27.0

# Profiling results for 1 chain w/40 warmup and 10 sampling using OpenCL (RTX 2070 Super) under cmdstan 2.27.0
fit_2$profiles()
[[1]]
        name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 Likelihood         1   802.0440      574.727     227.3160      287496    46534390056          31944                 1
2     Priors         1    72.9852       62.787      10.1982      447216              0          31944                 1

I’ll try running it with multiple chains in parallel and a more realistic number of iterations overnight and see how things scale.

Thanks.

rok_cesnovar · July 19, 2021, 7:15pm

Thanks, make sure to set the same seed.

ajnafa · July 22, 2021, 4:48pm

After some discussion with one of my colleagues I made some slight changes to the data so now the models have N = 199247, K = 1, J = 51, T = 160, and L = 2329. I have uploaded the revised data here.

Using the modified code and instructions you provided, I re-ran the model both with and without OpenCL using 6 parallel chains with 1000 warm-up iterations and 1000 sampling iterations per chain. Due to some warnings I was getting with a few of the iterations using the default values, I set the max_treedepth and adapt_delta arguments to 11 and 0.90 respectively.

# Set the cmdstan path to the development build
set_cmdstan_path("F:/.cmdstanr/cmdstan-opencl-test-v0.1")

# Path to the revised Stan model file
stan_file <- "E:/Users/Documents/Churchill_HLogit_Null.stan"

# Compile the model with OpenCL
mod_opencl <- cmdstan_model(
  stan_file, 
  cpp_options = list(
    stan_opencl = TRUE,
    opencl_platform_id = 0,
    opencl_device_id = 0
  ), 
  force_recompile = T
)

# Not shown for brevity, but I run the download.file and processx::run code you 
# provided after compiling the model

# Fit the model with OpenCL using an Nvidia RTX 2070
opencl_fit <- mod_opencl$sample(
  data = stan_data,
  chains = 6,
  opencl_ids = c(0, 0),
  seed = 123,
  refresh = 1,
  iter_warmup = 1000,
  iter_sampling = 1000,
  parallel_chains = 6,
  adapt_delta = 0.90,
  max_treedepth = 11
)

# Print the total wall time per chain using OpenCL
opencl_fit$time()
# $total
# [1] 33866

# $chains
#  chain_id warmup sampling total
#1      1   20858   12154   33012
#2      2   20877   12169   33047
#3      3   21217   12016   33232
#4      4   23087   10775   33862
#5      5   21580   11845   33425
#6      6   21486   11915   33401

# Print the profiling information
opencl_fit$profiles() %>% 
  bind_rows(.id = "chain") %>% 
  select(chain:autodiff_calls)

#   chain       name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls
#1      1 Likelihood         1      30469        19507      10961.8     3117609   483135520329         346401
#2      1     Priors         1       1911         1564        346.4     4849614              0         346401
#3      2 Likelihood         1      30520        19509      11011.2     3117996   483195493676         346444
#4      2     Priors         1       1900         1554        345.6     4850216              0         346444
#5      3 Likelihood         1      30689        19643      11045.7     3156372   489142618132         350708
#6      3     Priors         1       1919         1572        348.0     4909912              0         350708
#7      4 Likelihood         1      31248        20023      11225.0     3327687   515691284647         369743
#8      4     Priors         1       1967         1610        357.1     5176402              0         369743
#9      5 Likelihood         1      30863        19748      11115.6     3181905   493099464305         353545
#10     5     Priors         1       1930         1580        350.5     4949630              0         353545
#11     6 Likelihood         1      30831        19763      11067.7     3170745   491370000345         352305
#12     6     Priors         1       1935         1584        351.8     4932270              0         352305

# Compile the model without OpenCL
mod_cpu <- cmdstan_model(stan_file, force_recompile = T)

# Fit the model with only the Ryzen 9 5900X CPU
cpu_fit <- mod_cpu$sample(
  data = stan_data, 
  chains = 6,
  seed = 123,   
  refresh = 1,
  iter_warmup = 1000,
  iter_sampling = 1000,
  parallel_chains = 6,
  adapt_delta = 0.90,
  max_treedepth = 11
)

# Print total wall time per chain using OpenCL
cpu_fit$time()
#$total
#[1] 34298

#$chains
#  chain_id warmup sampling total
#1        1  21469    12549 34018
#2        2  22080    12215 34294
#3        3  21388    12542 33930
#4        4  21565    12501 34067
#5        5  21966    12308 34273
#6        6  21990    12284 34274

# Print the profiling information
cpu_fit$profiles() %>% 
   bind_rows(.id = "chain") %>% 
   select(chain:autodiff_calls)

#   chain       name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls
#1      1 Likelihood         1   33872.20     22456.10     11416.10     2773424   483521860262         346678
#2      1     Priors         1      22.27        10.56        11.71     3466780              0         346678
#3      2 Likelihood         1   34147.60     22642.60     11505.00     2826072   492700571811         353259
#4      2     Priors         1      22.47        10.69        11.78     3532590              0         353259
#5      3 Likelihood         1   33783.90     22376.00     11407.90     2770744   483054626047         346343
#6      3     Priors         1      22.19        10.49        11.70     3463430              0         346343
#7      4 Likelihood         1   33920.90     22469.70     11451.10     2784912   485524691106         348114
#8      4     Priors         1      22.16        10.38        11.79     3481140              0         348114
#9      5 Likelihood         1   34125.30     22630.50     11494.80     2812600   490351848175         351575
#10     5     Priors         1      22.56        10.80        11.77     3515750              0         351575
#11     6 Likelihood         1   34125.70     22618.60     11507.10     2819976   491637788313         352497
#12     6     Priors         1      22.63        10.73        11.90     3524970              0         352497

# Calculate the total difference between OpenCL and CPU
cpu_fit$time()$total / opencl_fit$time()$total
# [1] 1.013

# Time differences for profiling (cpu time/gpu time); code not shown
chain   name     total_time_diff forward_time_diff reverse_time_diff
1     Likelihood       1.11             1.15               1.04  
1     Priors          0.0117           0.00675            0.0338
2     Likelihood       1.12             1.16               1.04  
2     Priors          0.0118           0.00688            0.0341
3     Likelihood       1.10             1.14               1.03  
3     Priors          0.0116           0.00667            0.0336
4     Likelihood       1.09             1.12               1.02  
4     Priors          0.0113           0.00644            0.0330
5     Likelihood       1.11             1.15               1.03  
5     Priors          0.0117           0.00683            0.0336
6     Likelihood       1.11             1.14               1.04  
6     Priors          0.0117           0.00677            0.0338

I’m still waiting on my replacement coolant to arrive but as soon as I have my 3090 system back up and running I’ll test it on that to see if there’s any difference in performance and I’m open to any suggestions you may have based on the results above. Thank you.

Topic		Replies	Views
(Another) Slow Hierarchical Model in Stan Modeling fitting-issues , specification , performance	3	1755	September 29, 2017
Speed up multilevel logit: large dataset and group-level variables Modeling techniques , specification , performance , cmdstanr	4	875	February 4, 2021
Theoretical lowest runtime – hierarchical linear model Modeling fitting-issues , specification , performance	14	1390	September 22, 2019
Improving Speed for hierarchical logit/probit models Modeling rstan , performance , hierarchical-model	2	718	October 26, 2020
Question about multilevel logistic model (mixed intercept logistic model) in Stan code Modeling rstan	4	489	June 7, 2021

Question Regarding Possible Speed Gains ( GPU or CPU ) for a Four Level Hierarchical Logistic Regression Model

Related topics