OpenCL in BRMS with cmdstanr backend - making use of stan-math OpenCL functions

Firstly,

apologies for formatting issues. I can’t find guidance on how to make use of the tags available.

I have been pursuing OpenCL for performance enhancement on some models I am writing in brms and cmdstanr. Here is some example code I am working on:



stan_funs_LNP_basic  =  "
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
    real mu;
    real ln_r;
    real lntmp;
    real ln_one_m_r;
    real alpha;
    real sigma;
    real theta;
    // softplus transform parameters with a little bit added near zero
    //alpha      = (tempalpha > 15) ? tempalpha : 0.1*log(1+exp(10*tempalpha))+1e-10*exp(-1*(exp(tempalpha)-1))  ;
    //sigma      = (tempsigma > 15) ? tempsigma : 0.1*log(1+exp(10*tempsigma))+1e-10*exp(-1*(exp(tempsigma)-1))  ;
    //theta      = (temptheta > 15) ? temptheta : 10.0*log(1+exp(0.1*temptheta))+1e-10*exp(-1*(exp(temptheta)-1))  ;
    
    // exponentiate to test gradient problems
    alpha      = exp(tempalpha )   ;
    sigma      = exp(tempsigma )   ;
    theta      = exp(temptheta )*1e4   ;
    
    lntmp = 0.5*log(2*pi()) +log(alpha)  +log( sigma)  + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
    
    ln_one_m_r = -log_sum_exp(0,lntmp);
    ln_r = lntmp+ln_one_m_r;

    mu = log(theta) - alpha * sigma * sigma;
    if (y < theta){
      return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
    } else {
      return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
    }
}

real LNP_ecdf_bands_lpdf(real y,real tempalpha,real templnpsigma,real temptheta,real tempsigma, int n, real bandmean) {
    return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| tempalpha, templnpsigma, temptheta)), exp(tempsigma));
}
"
stanvars_LNP_basic <- stanvar(scode = stan_funs_LNP_basic, block = "functions")

LNP_ecdf_bands<- custom_family(
  "LNP_ecdf_bands", dpars = c("mu","sigmaLNP","thetaLNP","sigma"),
  links = c("identity","identity","identity","identity"),
  #lb=c(NA,0,0),
  #ub=c(NA,5,1e6),
  vars = c("vint1[n]","vreal1[n]"),
  type = "real"
)
  
LNP_sample = function(N,alpha,sigma,theta, deductible){
  # generally assume that theta>deductible
  require(VGAM)
  SAMPLE_LNandP_01 <-c()
  i = 1
  N_cont = TRUE
  while( N_cont)  
  {
    tmp = sqrt(2 * pi)  * alpha[i]  * sigma[i]  * pnorm(alpha[i] * sigma[i]) * exp(0.5 * (alpha[i] * sigma[i])^2)
    r = tmp / (tmp + 1)
    
      PA_VALUE = rpareto(1, max(theta[i],deductible[i]), alpha[i])
      
      lncdf_theta = plnorm(theta[i],log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])
      lncdf_ded  = plnorm(deductible[i],log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])
      LN_VALUE = qlnorm(runif(1)*(lncdf_theta-lncdf_ded)+lncdf_ded, log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])

      mod_rate = r*max(lncdf_theta-lncdf_ded,0)/lncdf_theta
      R = rbinom(1,1,mod_rate/(mod_rate+1-r))
      SAMPLE_LNandP_01$ACTUAL_LOSS[i] <- R * LN_VALUE + (1 - R) * PA_VALUE

      i = i + 1

    if(i > N)
    {N_cont = FALSE}  
  }
  return(SAMPLE_LNandP_01$ACTUAL_LOSS) 
}

test_data = expand( data.frame(ay =c(2016,2021),am=c(1,12),t=c(1,60), sample = c(1,100)),ay = full_seq(ay,1),am=full_seq(am,1),t=full_seq(t,1),sample=full_seq(sample,1))%>%
            mutate(y = LNP_sample_2(n(),rep(1,n()),rep(2,n()),rep(1e3,n()),rep(0,n())))%>%
            mutate(ayxt = paste0(ay," ",t), inception = ay+(am-1)/12-min(ay))%>%
            group_by(ayxt)%>%
            mutate(ecdf = ecdf(y)(y))%>%
            mutate(ecdf_band = 1+ecdf%/%0.05)%>%
            group_by(ecdf_band, ayxt)%>%
            mutate(bandmean = mean(y),bandcount = n(),bandecdf = mean(ecdf))%>%
            mutate(bandmean = case_when(ecdf_band==20 ~ y, TRUE ~ bandmean))%>%
            mutate(bandecdf = case_when(ecdf_band==20 ~ ecdf, TRUE ~ bandecdf))%>%
            mutate(bandcount = case_when(ecdf_band==20 ~ 1, TRUE ~ bandcount*1.0))%>%
            group_by(bandmean,ayxt)%>%
            summarise(bandmean = max(bandmean), bandcount = max(bandcount), ay=max(ay), t=max(t), ayxt = max(ayxt),bandecdf = max(bandecdf), inception = max(inception))


ecdf_model_cov = brm(bf(bandecdf|vint(bandcount)+vreal(bandmean) ~ (1|p|ayxt)+s(t)+s(inception)
                        ,sigmaLNP+thetaLNP ~ (1|p|ayxt)+s(t)+s(inception)
                        ,sigma ~1
                        ,nl=FALSE
                        )
                     , family = LNP_ecdf_bands
                     , stanvars = stanvars_LNP_basic
                     ,data = cdf_band_data
                     ,backend = "cmdstanr"
                     ,threads = threading(10)
                     , iter = 200
                     ,warmup = 100
                     ,chains = 2
                     ,refresh =1
                     ,max_treedepth=50
                    )

Broadly speaking we are simplifying a distribution fitting exercise to start with matching the empirical CDF (ECDF) to the theoretical CDF for a Lognormal Pareto mixture distribution. This is designed to assist diagnosis of fits of LNP on a likelihood basis rather than ECDF fitting. There are time series where each cohort of samples can evolve through time, and cohorts of samples emanate from different inception dates. So something like a series of time series’. Actuarially focused on long tail claims analysis.

I am trying to look for ways to improve the performance of the code constructed by brms using OpenCL. The code below is constructed using make_stancode() on the brms model above:


// generated with brms 2.14.6
functions {
 /* compute correlated group-level effects
  * Args: 
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
  *   matrix of scaled group-level effects
  */ 
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
  /* integer sequence of values
   * Args: 
   *   start: starting integer
   *   end: ending integer
   * Returns: 
   *   an integer sequence from start to end
   */ 
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 
  
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
    real mu;
    real ln_r;
    real lntmp;
    real ln_one_m_r;
    real alpha;
    real sigma;
    real theta;
    // softplus transform parameters with a little bit added near zero
    //alpha      = (tempalpha > 15) ? tempalpha : 0.1*log(1+exp(10*tempalpha))+1e-10*exp(-1*(exp(tempalpha)-1))  ;
    //sigma      = (tempsigma > 15) ? tempsigma : 0.1*log(1+exp(10*tempsigma))+1e-10*exp(-1*(exp(tempsigma)-1))  ;
    //theta      = (temptheta > 15) ? temptheta : 10.0*log(1+exp(0.1*temptheta))+1e-10*exp(-1*(exp(temptheta)-1))  ;
    
    // exponentiate to test gradient problems
    alpha      = exp(tempalpha )   ;
    sigma      = exp(tempsigma )   ;
    theta      = exp(temptheta )*1e4   ;
    
    lntmp = 0.5*log(2*pi()) +log(alpha)  +log( sigma)  + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
    
    ln_one_m_r = -log_sum_exp(0,lntmp);
    ln_r = lntmp+ln_one_m_r;

    mu = log(theta) - alpha * sigma * sigma;
    if (y < theta){
      return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
    } else {
      return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
    }
}

real LNP_ecdf_bands_lpdf(real y,real alpha,real lnpsigma,real theta,real sigma, int n, real bandmean) {
    
    return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| alpha, lnpsigma, theta)), exp(sigma));
}


  // compute partial sums of the log-likelihood
  real partial_log_lik_lpmf(int[] seq, int start, int end, vector Y, real[] vreal1, int[] vint1, real Intercept, real Intercept_sigmaLNP, real Intercept_thetaLNP, real Intercept_sigma, int[] J_1, vector Z_1_1, vector Z_1_sigmaLNP_2, vector Z_1_thetaLNP_3, vector r_1_1, vector r_1_sigmaLNP_2, vector r_1_thetaLNP_3) {
    real ptarget = 0;
    int N = end - start + 1;
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigmaLNP = Intercept_sigmaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] thetaLNP = Intercept_thetaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      sigmaLNP[n] += r_1_sigmaLNP_2[J_1[nn]] * Z_1_sigmaLNP_2[nn];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      thetaLNP[n] += r_1_thetaLNP_3[J_1[nn]] * Z_1_thetaLNP_3[nn];
    }
    for (n in 1:N) {
      int nn = n + start - 1;
      ptarget += LNP_ecdf_bands_lpdf(Y[nn] | mu[n], sigmaLNP[n], thetaLNP[n], sigma[n], vint1[n], vreal1[n]);
    }
    return ptarget;
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for custom real vectors
  real vreal1[N];
  // data for custom integer vectors
  int vint1[N];
  int grainsize;  // grainsize for threading
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_sigmaLNP_2;
  vector[N] Z_1_thetaLNP_3;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real Intercept_sigmaLNP;  // temporary intercept for centered predictors
  real Intercept_thetaLNP;  // temporary intercept for centered predictors
  real Intercept_sigma;  // temporary intercept for centered predictors
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_sigmaLNP_2;
  vector[N_1] r_1_thetaLNP_3;
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[, 1];
  r_1_sigmaLNP_2 = r_1[, 2];
  r_1_thetaLNP_3 = r_1[, 3];
}
model {
  // likelihood including constants
  if (!prior_only) {
    target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, vreal1, vint1, Intercept, Intercept_sigmaLNP, Intercept_thetaLNP, Intercept_sigma, J_1, Z_1_1, Z_1_sigmaLNP_2, Z_1_thetaLNP_3, r_1_1, r_1_sigmaLNP_2, r_1_thetaLNP_3);
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, 1, 2.5);
  target += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 3 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(to_vector(z_1));
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
  // actual population-level intercept
  real b_sigmaLNP_Intercept = Intercept_sigmaLNP;
  // actual population-level intercept
  real b_thetaLNP_Intercept = Intercept_thetaLNP;
  // actual population-level intercept
  real b_sigma_Intercept = Intercept_sigma;
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

After reading through the recent stan-math library and several posts. I am getting the idea that I should do away with the reduce_sum and modify partial_log_lik_lpmf to convert the parameter vectors to matrix_cl while converting the static matricies to matrix_cl in the transformed data section. To do this I gather that I would need to call several functions from stan-math inside the stancode using #include statements.

I would really like to see a piece of example code where we can call to_matrix_cl and some kernel from within the stan code using a #include statement (it seems). There don’t seem to be any simple examples to get me going as everyone involved seems to be writing their code directly in c++.

This would allow me to modify the BRMS code to run on my GPU a little more efficiently I hope. At this stage my code is taking days to run on my 10 core xeon system while my Nvidia Quatro RTX 6000 card sits idle.

Perhaps someone can provide a pointer or two.

Kind regards
Cynon Sonkkila

1 Like

Hi Cynon,

the release version of cmdstan (2.25.0) has automatic OpenCL support for GLM functions, cholesky_decompose, multiply and mdivide_left_tri. Quickly looking at your model I am not sure those help you a lot in your model.

The plan for the next release is to have automatic support for:

  • bernoulli_lpmf
  • bernoulli_logit_lpmf
  • bernoulli_logit_glm_lpmf
  • beta_lpdf
  • beta_proportion_lpdf
  • binomial_lpmf
  • categorical_logit_glm_lpmf
  • cauchy_lpdf
  • chi_square_lpdf
  • double_exponential_lpdf
  • exp_mod_normal_lpdf
  • exponential_lpdf
  • frechet_lpdf
  • gamma_lpdf
  • gumbel_lpdf
  • inv_chi_square_lpdf
  • inv_gamma_lpdf
  • logistic_lpdf
  • lognormal_lpdf
  • neg_binomial_lpmf
  • neg_binomial_2_lpmf
  • neg_binomial_2_log_lpmf
  • neg_binomial_2_log_glm_lpmf
  • normal_lpdf
  • normal_id_glm_lpdf
  • ordered_logistic_glm_lpmf
  • pareto_lpdf
  • pareto_type_2_lpdf
  • poisson_lpmf
  • poisson_log_lpmf
  • poisson_log_glm_lpmf
  • rayleigh_lpdf
  • scaled_inv_chi_square_lpdf
  • skew_normal_lpdf
  • std_normal_lpdf
  • student_t_lpdf
  • uniform_lpdf
  • weibull_lpdf

and the following functions:

  • acos, acosh
  • add, operator+
  • asin, asinh
  • atan, atanh
  • block, col, row
  • cholesky_decompose
  • cols, dims, num_elements, rows, size
  • cos, cosh
  • crossprod, tcrossprod
  • diag_matrix, diagonal
  • head, tail
  • multiply, operator*
  • sin, sinh
  • subtract, operator-
  • sum
  • tan, tanh
  • transpose

We are also working hard on adding additional functions and plan on adding additional 20-30 functions before the release.

For now, the listed functions are supported in the Stan Math backend but not by the compiler so in order to use them you would need to generate the C++ code, modify it manually and then recompile. Its not something that is really simple but it should not be too difficult either.

Before we dig in the OpenCL stuff I would go and profile the model to see where its bottlenecks are. In order to do that, you have to again use something that is in the works for 2.26: profiling.

A version supporting profiling is available here: https://github.com/rok-cesnovar/cmdstan/releases/download/profiling/cmdstan-profiling.tgz

An example of its use is here:

data {
  int<lower=0> N;
  int<lower=0> K;
  int<lower=0,upper=1> y[N];
  matrix[N, K] X;
}
parameters {
  real alpha;
  vector[K] beta;
}
model {
  {
    profile("normal_lpdf alpha");
    target += normal_lpdf(alpha | 0, 1);
  }
  
  {
    profile("normal_lpdf beta");
    target += normal_lpdf(beta | 0, 1);
  }
  {
    profile("bernoulli GLM");
    target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
  }
}

Once we figure out what is the largest bottleneck we can then check whether those functions have OpenCL support.

2 Likes

@Cynon would you be able to generate the Stan code without threading? That would probably make it a bit easier to read and find a bottleneck.

Great to be able to respond to you Rok. Thanks for the quick reply. I have been reading most of your posts and can provide a cut down version of the code for you:

stancode = make_stancode(bf(bandecdf|vint(bandcount)+vreal(bandmean) ~ 1
                  ,sigmaLNP+thetaLNP ~ 1
                  ,sigma ~1
                  ,nl=FALSE
)
, family = LNP_ecdf_bands
, stanvars = stanvars_LNP_basic
,data = test_data
#,backend = "cmdstanr"
#,threads = threading(10)
)

this returns new stancode:

// generated with brms 2.14.6
functions {
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
    real mu;
    real ln_r;
    real lntmp;
    real ln_one_m_r;
    real alpha;
    real sigma;
    real theta;
  // exponentiate as a link function
    alpha      = exp(tempalpha )   ;
    sigma      = exp(tempsigma )   ;
    theta      = exp(temptheta )*1e4   ;
    
    // below is one possible way to generate the lcdf for the lognormal pareto distribution
    lntmp = 0.5*log(2*pi()) +log(alpha)  +log( sigma)  + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
    
    ln_one_m_r = -log_sum_exp(0,lntmp);
    ln_r = lntmp+ln_one_m_r;

    mu = log(theta) - alpha * sigma * sigma;
    if (y < theta){
      return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
    } else {
      return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
    }
}

real LNP_ecdf_bands_lpdf(real y,real tempalpha,real templnpsigma,real temptheta,real tempsigma, int n, real bandmean) {
  
    return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| tempalpha, templnpsigma, temptheta)), exp(tempsigma));
}


}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for custom real vectors
  real vreal1[N];
  // data for custom integer vectors
  int vint1[N];
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real Intercept_sigmaLNP;  // temporary intercept for centered predictors
  real Intercept_thetaLNP;  // temporary intercept for centered predictors
  real Intercept_sigma;  // temporary intercept for centered predictors
}
transformed parameters {
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigmaLNP = Intercept_sigmaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] thetaLNP = Intercept_thetaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
    for (n in 1:N) {
      target += LNP_ecdf_bands_lpdf(Y[n] | mu[n], sigmaLNP[n], thetaLNP[n], sigma[n], vint1[n], vreal1[n]);
    }
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, 1, 2.5);
  target += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
  // actual population-level intercept
  real b_sigmaLNP_Intercept = Intercept_sigmaLNP;
  // actual population-level intercept
  real b_thetaLNP_Intercept = Intercept_thetaLNP;
  // actual population-level intercept
  real b_sigma_Intercept = Intercept_sigma;
}

here we see the call to LNP_ecdf_bands_lpdf() in a loop over N after turning off the support for cmdstanr which removes the use of reduce_sum. Vectorising this code does not improve the performance on a cpu unless we use reduce_sum, but converting everything to matrix_cl and having a matrix_cl version of the lpdf function implemented on the gpu would likely improve performance.

Is it not possible for me to declare (but not define) the to_matrix_cl function in the functions section (and any other function available in stan-math), then call it (and others) inside LNP_ecdf_bands_lpdf()? This seems to be described in “Interfacing with External C++ Code • rstan” [https://mc-stan.org/rstan/articles/external.html]

Interested in your thoughts here as modifying the generated c++ code directly is the next step. A new skill I would need to develop.

Regards
Cy

As I see it, I need to produce a new version of the normal_lpdf function that returns a matrix_cl ( the vector of the lpdf for each row rather than the usual scalar result) and takes matrix_cl arguments.

The performance improvement would only be had if the formula for the normal lpdf, when implemented in array form, could be parallelized across the many compute cores on my gpu.

I was thinking something like:

vector LNP_lcdf_function(vector y,vector tempalpha,vector tempsigma,vector temptheta) {
    int N = rows(y);

    vector[N] alpha;//= to_matrix_cl(exp(tempalpha ));
    vector[N] sigma= exp(tempsigma );
    vector[N] theta=exp(temptheta )*1e4;
    vector[N] alphasigma= alpha .* sigma;
    
    vector[N] lntmp = 0.5*log(2*pi()) +log(alpha)  +log( sigma)  + log(Phi_approx(alphasigma)) + 0.5 * alphasigma .* alphasigma;
    vector[N] fm = fmax(0,lntmp);
    vector[N] ln_one_m_r = -fm-log(exp(0-fm)+exp(lntmp-fm));
    vector[N] ln_r = lntmp+ln_one_m_r;
    vector[N] mu = log(theta) - alpha .* square(sigma);
    vector[N] y_gt_theta = 0.5*(1 + (y-theta) ./ fabs(y-theta));
    
    vector[N] a= ln_r-log(Phi_approx(alphasigma)) +log(Phi_approx((log(y)-mu) ./ sigma));
    vector[N] b=ln_one_m_r + log(1-(theta ./ fmax(theta,y))^alpha);
    vector[N] c=fmax(a,b);
    
    return  (1-y_gt_theta) .* a +y_gt_theta .* ( c+log(exp(b-c)+exp(a-c)) );

    // 1-(theta/x)^alpha is the pareto cdf
    
}

this would replace the LNP_basic_lcdf function. But perhaps this is too involved for getting my point accross. Consider, instead, the normal_lpdf() function where the formula is:
{\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}}

I would replace normal_lpdf with a weigted version. I note that the weights and the y vectors would be static and passed in as data. So the code might be something like:

real normal_lpdf_function(vector vy, vector vmu, vector vsigma, vector vweights){
int n = rows(y);
matrix_cl[n]  weights = to_matrix_cl(vweights);
matrix_cl[n]  y = to_matrix_cl(vy);
matrix_cl[n]  mu = to_matrix_cl(vmu);
matrix_cl [n] sigma = to_matrix_cl(vsigma);

real lp = transpose(weights)*( -log(sigma)-log(2*pi())-0.5*diag_matrix(diag_matrix(y-mu)*(1/sigma))*(diag_matrix(y-mu)*(1/sigma)) );
return lp;
}

At least I am hoping something like this can be achieved. I am assuming that when lp is calculated, the OpenCL forms of the functions can easily parallelize the function across the rows of the vectors. I have used diag_matrix() as a stand in to avoid the elementwise function .* which, I imagine, would be slower, but maybe I am mistaken?

Let me know what you think.

Regards
Cy

Thanks Rok,

In regards to profiling. I am already aware of the bottleneck being the loglikelihood calculation step. Calculating this step on the GPU and only returning the final loglikelihood is my plan.

It would be incredibly helpful if you could provide a minimal example of stancode that you have compiled to c++, and then made a modification to in order to use the GPU. This will give everyone a material insight into what is needed when you are modifying the c++ directly. An example I can see online is: External C++ (experimental) — PyStan 2.19.1.1 documentation although I am working in r.

Sorry, I thought I responded, but seems I got distracted before I clicked send… Oh well.

Ok, so the first step we need to do is vectorize the code as much as we can. Loops with scalars are not performant with OpenCL backend. But lets first just vectorize it and then deal with OpenCL.

Lets use this code to start and try to vectorize it as much as possible:

// generated with brms 2.14.6
functions {
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
    real mu;
    real ln_r;
    real lntmp;
    real ln_one_m_r;
    real alpha;
    real sigma;
    real theta;
  // exponentiate as a link function
    alpha      = exp(tempalpha )   ;
    sigma      = exp(tempsigma )   ;
    theta      = exp(temptheta )*1e4   ;
    
    // below is one possible way to generate the lcdf for the lognormal pareto distribution
    lntmp = 0.5*log(2*pi()) +log(alpha)  +log( sigma)  + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
    
    ln_one_m_r = -log_sum_exp(0,lntmp);
    ln_r = lntmp+ln_one_m_r;

    mu = log(theta) - alpha * sigma * sigma;
    if (y < theta){
      return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
    } else {
      return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
    }
}

real LNP_ecdf_bands_lpdf(real y,real tempalpha,real templnpsigma,real temptheta,real tempsigma, int n, real bandmean) {
  
    return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| tempalpha, templnpsigma, temptheta)), exp(tempsigma));
}


}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for custom real vectors
  real vreal1[N];
  // data for custom integer vectors
  int vint1[N];
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real Intercept_sigmaLNP;  // temporary intercept for centered predictors
  real Intercept_thetaLNP;  // temporary intercept for centered predictors
  real Intercept_sigma;  // temporary intercept for centered predictors
}
transformed parameters {
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigmaLNP = Intercept_sigmaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] thetaLNP = Intercept_thetaLNP + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
    for (n in 1:N) {
      target += LNP_ecdf_bands_lpdf(Y[n] | mu[n], sigmaLNP[n], thetaLNP[n], sigma[n], vint1[n], vreal1[n]);
    }
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, 1, 2.5);
  target += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
  // actual population-level intercept
  real b_sigmaLNP_Intercept = Intercept_sigmaLNP;
  // actual population-level intercept
  real b_thetaLNP_Intercept = Intercept_thetaLNP;
  // actual population-level intercept
  real b_sigma_Intercept = Intercept_sigma;
}

So this loop here:

for (n in 1:N) {
      target += LNP_ecdf_bands_lpdf(Y[n] | mu[n], sigmaLNP[n], thetaLNP[n], sigma[n], vint1[n], vreal1[n]);
}

should be replaced with a vectorized version if possible.

Here is one way of achieving it.

I’m not sure how to efficiently replicate the .* elementwise multiplication. I thought maybe that it might be faster to use diag_pre_multiply(v, v) or something like that. Let me know and I will rewrite.

Also, you suggested:

target += dot_product(vint1, normal_lpdf( y| exp(LNP_basic_lcdf( vreal1| mu, sigmaLNP, thetaLNP)), exp(sigma)))

which implies that the normal_lpdf() function returns a vector. But I don’t see that this is the case anywhere. In my code I have had to expand out the lpdf formula for the normal in vector form manually so that it returns a vector. If there is a more efficient way then I would be greatful to hear about it, and maybe you could confirm that normal_lpdf() can be induced to return a vector for your dot product? That would be great to hear as it is not documented as far as I can see. Maybe you are intending to implement it this way, which I totally agree with if it is possible.

Regards

Cy

openclcode.stan (10.3 KB)

1 Like