Firstly,
apologies for formatting issues. I can’t find guidance on how to make use of the tags available.
I have been pursuing OpenCL for performance enhancement on some models I am writing in brms and cmdstanr. Here is some example code I am working on:
stan_funs_LNP_basic = "
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
real mu;
real ln_r;
real lntmp;
real ln_one_m_r;
real alpha;
real sigma;
real theta;
// softplus transform parameters with a little bit added near zero
//alpha = (tempalpha > 15) ? tempalpha : 0.1*log(1+exp(10*tempalpha))+1e-10*exp(-1*(exp(tempalpha)-1)) ;
//sigma = (tempsigma > 15) ? tempsigma : 0.1*log(1+exp(10*tempsigma))+1e-10*exp(-1*(exp(tempsigma)-1)) ;
//theta = (temptheta > 15) ? temptheta : 10.0*log(1+exp(0.1*temptheta))+1e-10*exp(-1*(exp(temptheta)-1)) ;
// exponentiate to test gradient problems
alpha = exp(tempalpha ) ;
sigma = exp(tempsigma ) ;
theta = exp(temptheta )*1e4 ;
lntmp = 0.5*log(2*pi()) +log(alpha) +log( sigma) + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
ln_one_m_r = -log_sum_exp(0,lntmp);
ln_r = lntmp+ln_one_m_r;
mu = log(theta) - alpha * sigma * sigma;
if (y < theta){
return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
} else {
return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
}
}
real LNP_ecdf_bands_lpdf(real y,real tempalpha,real templnpsigma,real temptheta,real tempsigma, int n, real bandmean) {
return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| tempalpha, templnpsigma, temptheta)), exp(tempsigma));
}
"
stanvars_LNP_basic <- stanvar(scode = stan_funs_LNP_basic, block = "functions")
LNP_ecdf_bands<- custom_family(
"LNP_ecdf_bands", dpars = c("mu","sigmaLNP","thetaLNP","sigma"),
links = c("identity","identity","identity","identity"),
#lb=c(NA,0,0),
#ub=c(NA,5,1e6),
vars = c("vint1[n]","vreal1[n]"),
type = "real"
)
LNP_sample = function(N,alpha,sigma,theta, deductible){
# generally assume that theta>deductible
require(VGAM)
SAMPLE_LNandP_01 <-c()
i = 1
N_cont = TRUE
while( N_cont)
{
tmp = sqrt(2 * pi) * alpha[i] * sigma[i] * pnorm(alpha[i] * sigma[i]) * exp(0.5 * (alpha[i] * sigma[i])^2)
r = tmp / (tmp + 1)
PA_VALUE = rpareto(1, max(theta[i],deductible[i]), alpha[i])
lncdf_theta = plnorm(theta[i],log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])
lncdf_ded = plnorm(deductible[i],log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])
LN_VALUE = qlnorm(runif(1)*(lncdf_theta-lncdf_ded)+lncdf_ded, log(theta[i]) - alpha[i] * sigma[i]^2, sigma[i])
mod_rate = r*max(lncdf_theta-lncdf_ded,0)/lncdf_theta
R = rbinom(1,1,mod_rate/(mod_rate+1-r))
SAMPLE_LNandP_01$ACTUAL_LOSS[i] <- R * LN_VALUE + (1 - R) * PA_VALUE
i = i + 1
if(i > N)
{N_cont = FALSE}
}
return(SAMPLE_LNandP_01$ACTUAL_LOSS)
}
test_data = expand( data.frame(ay =c(2016,2021),am=c(1,12),t=c(1,60), sample = c(1,100)),ay = full_seq(ay,1),am=full_seq(am,1),t=full_seq(t,1),sample=full_seq(sample,1))%>%
mutate(y = LNP_sample_2(n(),rep(1,n()),rep(2,n()),rep(1e3,n()),rep(0,n())))%>%
mutate(ayxt = paste0(ay," ",t), inception = ay+(am-1)/12-min(ay))%>%
group_by(ayxt)%>%
mutate(ecdf = ecdf(y)(y))%>%
mutate(ecdf_band = 1+ecdf%/%0.05)%>%
group_by(ecdf_band, ayxt)%>%
mutate(bandmean = mean(y),bandcount = n(),bandecdf = mean(ecdf))%>%
mutate(bandmean = case_when(ecdf_band==20 ~ y, TRUE ~ bandmean))%>%
mutate(bandecdf = case_when(ecdf_band==20 ~ ecdf, TRUE ~ bandecdf))%>%
mutate(bandcount = case_when(ecdf_band==20 ~ 1, TRUE ~ bandcount*1.0))%>%
group_by(bandmean,ayxt)%>%
summarise(bandmean = max(bandmean), bandcount = max(bandcount), ay=max(ay), t=max(t), ayxt = max(ayxt),bandecdf = max(bandecdf), inception = max(inception))
ecdf_model_cov = brm(bf(bandecdf|vint(bandcount)+vreal(bandmean) ~ (1|p|ayxt)+s(t)+s(inception)
,sigmaLNP+thetaLNP ~ (1|p|ayxt)+s(t)+s(inception)
,sigma ~1
,nl=FALSE
)
, family = LNP_ecdf_bands
, stanvars = stanvars_LNP_basic
,data = cdf_band_data
,backend = "cmdstanr"
,threads = threading(10)
, iter = 200
,warmup = 100
,chains = 2
,refresh =1
,max_treedepth=50
)
Broadly speaking we are simplifying a distribution fitting exercise to start with matching the empirical CDF (ECDF) to the theoretical CDF for a Lognormal Pareto mixture distribution. This is designed to assist diagnosis of fits of LNP on a likelihood basis rather than ECDF fitting. There are time series where each cohort of samples can evolve through time, and cohorts of samples emanate from different inception dates. So something like a series of time series’. Actuarially focused on long tail claims analysis.
I am trying to look for ways to improve the performance of the code constructed by brms using OpenCL. The code below is constructed using make_stancode() on the brms model above:
// generated with brms 2.14.6
functions {
/* compute correlated group-level effects
* Args:
* z: matrix of unscaled group-level effects
* SD: vector of standard deviation parameters
* L: cholesky factor correlation matrix
* Returns:
* matrix of scaled group-level effects
*/
matrix scale_r_cor(matrix z, vector SD, matrix L) {
// r is stored in another dimension order than z
return transpose(diag_pre_multiply(SD, L) * z);
}
/* integer sequence of values
* Args:
* start: starting integer
* end: ending integer
* Returns:
* an integer sequence from start to end
*/
int[] sequence(int start, int end) {
int seq[end - start + 1];
for (n in 1:num_elements(seq)) {
seq[n] = n + start - 1;
}
return seq;
}
real LNP_basic_lcdf(real y,real tempalpha,real tempsigma,real temptheta) {
real mu;
real ln_r;
real lntmp;
real ln_one_m_r;
real alpha;
real sigma;
real theta;
// softplus transform parameters with a little bit added near zero
//alpha = (tempalpha > 15) ? tempalpha : 0.1*log(1+exp(10*tempalpha))+1e-10*exp(-1*(exp(tempalpha)-1)) ;
//sigma = (tempsigma > 15) ? tempsigma : 0.1*log(1+exp(10*tempsigma))+1e-10*exp(-1*(exp(tempsigma)-1)) ;
//theta = (temptheta > 15) ? temptheta : 10.0*log(1+exp(0.1*temptheta))+1e-10*exp(-1*(exp(temptheta)-1)) ;
// exponentiate to test gradient problems
alpha = exp(tempalpha ) ;
sigma = exp(tempsigma ) ;
theta = exp(temptheta )*1e4 ;
lntmp = 0.5*log(2*pi()) +log(alpha) +log( sigma) + log(Phi_approx(alpha * sigma)) + 0.5 * pow(alpha * sigma, 2);
ln_one_m_r = -log_sum_exp(0,lntmp);
ln_r = lntmp+ln_one_m_r;
mu = log(theta) - alpha * sigma * sigma;
if (y < theta){
return ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(y|mu, sigma);
} else {
return log_sum_exp(ln_one_m_r + pareto_lcdf(y| theta, alpha),ln_r-log(Phi_approx(alpha * sigma)) +lognormal_lcdf(theta|mu, sigma));
}
}
real LNP_ecdf_bands_lpdf(real y,real alpha,real lnpsigma,real theta,real sigma, int n, real bandmean) {
return n*normal_lpdf( y| exp(LNP_basic_lcdf( bandmean| alpha, lnpsigma, theta)), exp(sigma));
}
// compute partial sums of the log-likelihood
real partial_log_lik_lpmf(int[] seq, int start, int end, vector Y, real[] vreal1, int[] vint1, real Intercept, real Intercept_sigmaLNP, real Intercept_thetaLNP, real Intercept_sigma, int[] J_1, vector Z_1_1, vector Z_1_sigmaLNP_2, vector Z_1_thetaLNP_3, vector r_1_1, vector r_1_sigmaLNP_2, vector r_1_thetaLNP_3) {
real ptarget = 0;
int N = end - start + 1;
// initialize linear predictor term
vector[N] mu = Intercept + rep_vector(0.0, N);
// initialize linear predictor term
vector[N] sigmaLNP = Intercept_sigmaLNP + rep_vector(0.0, N);
// initialize linear predictor term
vector[N] thetaLNP = Intercept_thetaLNP + rep_vector(0.0, N);
// initialize linear predictor term
vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
for (n in 1:N) {
// add more terms to the linear predictor
int nn = n + start - 1;
mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn];
}
for (n in 1:N) {
// add more terms to the linear predictor
int nn = n + start - 1;
sigmaLNP[n] += r_1_sigmaLNP_2[J_1[nn]] * Z_1_sigmaLNP_2[nn];
}
for (n in 1:N) {
// add more terms to the linear predictor
int nn = n + start - 1;
thetaLNP[n] += r_1_thetaLNP_3[J_1[nn]] * Z_1_thetaLNP_3[nn];
}
for (n in 1:N) {
int nn = n + start - 1;
ptarget += LNP_ecdf_bands_lpdf(Y[nn] | mu[n], sigmaLNP[n], thetaLNP[n], sigma[n], vint1[n], vreal1[n]);
}
return ptarget;
}
}
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
// data for custom real vectors
real vreal1[N];
// data for custom integer vectors
int vint1[N];
int grainsize; // grainsize for threading
// data for group-level effects of ID 1
int<lower=1> N_1; // number of grouping levels
int<lower=1> M_1; // number of coefficients per level
int<lower=1> J_1[N]; // grouping indicator per observation
// group-level predictor values
vector[N] Z_1_1;
vector[N] Z_1_sigmaLNP_2;
vector[N] Z_1_thetaLNP_3;
int<lower=1> NC_1; // number of group-level correlations
int prior_only; // should the likelihood be ignored?
}
transformed data {
int seq[N] = sequence(1, N);
}
parameters {
real Intercept; // temporary intercept for centered predictors
real Intercept_sigmaLNP; // temporary intercept for centered predictors
real Intercept_thetaLNP; // temporary intercept for centered predictors
real Intercept_sigma; // temporary intercept for centered predictors
vector<lower=0>[M_1] sd_1; // group-level standard deviations
matrix[M_1, N_1] z_1; // standardized group-level effects
cholesky_factor_corr[M_1] L_1; // cholesky factor of correlation matrix
}
transformed parameters {
matrix[N_1, M_1] r_1; // actual group-level effects
// using vectors speeds up indexing in loops
vector[N_1] r_1_1;
vector[N_1] r_1_sigmaLNP_2;
vector[N_1] r_1_thetaLNP_3;
// compute actual group-level effects
r_1 = scale_r_cor(z_1, sd_1, L_1);
r_1_1 = r_1[, 1];
r_1_sigmaLNP_2 = r_1[, 2];
r_1_thetaLNP_3 = r_1[, 3];
}
model {
// likelihood including constants
if (!prior_only) {
target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, vreal1, vint1, Intercept, Intercept_sigmaLNP, Intercept_thetaLNP, Intercept_sigma, J_1, Z_1_1, Z_1_sigmaLNP_2, Z_1_thetaLNP_3, r_1_1, r_1_sigmaLNP_2, r_1_thetaLNP_3);
}
// priors including constants
target += student_t_lpdf(Intercept | 3, 1, 2.5);
target += student_t_lpdf(Intercept_sigma | 3, 0, 2.5);
target += student_t_lpdf(sd_1 | 3, 0, 2.5)
- 3 * student_t_lccdf(0 | 3, 0, 2.5);
target += std_normal_lpdf(to_vector(z_1));
target += lkj_corr_cholesky_lpdf(L_1 | 1);
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept;
// actual population-level intercept
real b_sigmaLNP_Intercept = Intercept_sigmaLNP;
// actual population-level intercept
real b_thetaLNP_Intercept = Intercept_thetaLNP;
// actual population-level intercept
real b_sigma_Intercept = Intercept_sigma;
// compute group-level correlations
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// extract upper diagonal of correlation matrix
for (k in 1:M_1) {
for (j in 1:(k - 1)) {
cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
}
}
}
After reading through the recent stan-math library and several posts. I am getting the idea that I should do away with the reduce_sum and modify partial_log_lik_lpmf to convert the parameter vectors to matrix_cl while converting the static matricies to matrix_cl in the transformed data section. To do this I gather that I would need to call several functions from stan-math inside the stancode using #include statements.
I would really like to see a piece of example code where we can call to_matrix_cl and some kernel from within the stan code using a #include statement (it seems). There don’t seem to be any simple examples to get me going as everyone involved seems to be writing their code directly in c++.
This would allow me to modify the BRMS code to run on my GPU a little more efficiently I hope. At this stage my code is taking days to run on my 10 core xeon system while my Nvidia Quatro RTX 6000 card sits idle.
Perhaps someone can provide a pointer or two.
Kind regards
Cynon Sonkkila