CommaInitializer.h error running model on GPU

I’m moving this question from a previous question linked here because I think it has transitioned from the original problem. I have a hierarchical logit/categorical model with a large dataset. I’ve set it up to run on a GPU through my institutions HPC.

I get a c++ error
stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/CommaInitializer.h:118: XprType& Eigen::CommaInitializer<MatrixType>::finished() [with XprType = Eigen::Matrix<double, 1, -1>]: Assertion ((m_row+m_currentBlockRows) == m_xpr.rows() || m_xpr.cols() == 0) && m_col == m_xpr.cols() && "Too few coefficients passed to comma initializer (operator<<)"' failed.

It seems to relate to my dataset size because it runs fine if I include 100 records. As I increase the dataset size (total dataset ~187,000 rows), it starts giving me the above error. The HPC isn’t giving me any error that I can see. I’m asking for 4 cores, a Tesla V100, with OPA and 16 GB RAM, and 16GB RAM from the CPU. It’s my first time working with a GPU, so I may be missing something in the setup.

Checking the GPU status while my code runs with 100 rows, it only uses 7% of the GPU. I’m not sure why it wouldn’t run with more data then.

R script:

model.1 = cmdstan_model("mtc_equity_2.stan", cpp_options = list(stan_opencl = TRUE), stanc_options = list("O1"))

samp_ct = 100L
mod = mod[1:samp_ct,]

data_list = list(N = samp_ct,
            ncat = 4L,
            nvar = 2L,
            ntotvar = 5L,
            Y = as.integer(mod$trptrans),
            x1 = mod%>%select(travelTimeDrive,travelCostDrive),
            x2 = cbind(mod$travelTimeWalk,rep(0,samp_ct)),
            x3 = cbind(mod$travelTimeBike,rep(0,samp_ct)),
            x4 = cbind(mod$travelTimeTransit,rep(-1,samp_ct)),
            N_1 = 62L,
            M_1 = 6L,
            J_1 = as.integer(mod$personid),
            Z_1 = cbind(rep(1,samp_ct),mod%>%select(race_2,race_3,race_4,race_5,hhfaminc)), 
            NC_1 = 15L,
            N_2 = 5L,
            M_2 = 1L,
            J_2 = as.integer(mod$trippurp),
            prior_only = 0L)

model_fit = model.1$sample(data = data_list,
			seed = 24567,
			iter_warmup  = 500,
			iter_sampling =500,
  			chains = 4,
  			parallel_chains = 4)

Stan code:

functions {
 /* compute correlated group-level effects
  * Args: 
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
  *   matrix of scaled group-level effects
  */ 
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
}
data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1;
  // covariate matrix for alt2
  matrix[N,nvar] x2;
  // covariate matrix for alt3
  matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  matrix[N, M_1] Z_1;
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int<lower=1> M_3 = (M_1+M_2)*(ncat*2);  // total number of sd
}
parameters {
  real b_btc;  // population-level effects
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  vector<lower=0>[M_3] sd;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
  vector[N_2] z_2[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_3;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_3;  // cholesky factor of correlation matrix
  vector[N_2] z_4[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_5;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_5;  // cholesky factor of correlation matrix
  vector[N_2] z_6[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_7;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_7;  // cholesky factor of correlation matrix
  vector[N_2] z_8[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_9;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_9;  // cholesky factor of correlation matrix
  vector[N_2] z_10[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_11;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_11;  // cholesky factor of correlation matrix
  vector[N_2] z_12[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_13;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_13;  // cholesky factor of correlation matrix
  vector[N_2] z_14[M_2];  // standardized group-level effects
  matrix[M_1, N_1] z_15;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_15;  // cholesky factor of correlation matrix
  vector[N_2] z_16[M_2];  // standardized group-level effects
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  vector[N_2] r_2;  // actual group-level effects
  matrix[N_1, M_1] r_3;  // actual group-level effects
  vector[N_2] r_4;  // actual group-level effects
  matrix[N_1, M_1] r_5;  // actual group-level effects
  vector[N_2] r_6;  // actual group-level effects
  matrix[N_1, M_1] r_7;  // actual group-level effects
  vector[N_2] r_8;  // actual group-level effects
  matrix[N_1, M_1] r_9;  // actual group-level effects
  vector[N_2] r_10;  // actual group-level effects
  matrix[N_1, M_1] r_11;  // actual group-level effects
  vector[N_2] r_12;  // actual group-level effects
  matrix[N_1, M_1] r_13;  // actual group-level effects
  vector[N_2] r_14;  // actual group-level effects
  matrix[N_1, M_1] r_15;  // actual group-level effects
  vector[N_2] r_16;  // actual group-level effects
  
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd[1:M_1], L_1);
  r_2 = (sd[M_1+1] * (z_2[1]));
  // compute actual group-level effects
  r_3 = scale_r_cor(z_3, sd[M_1+2:2*M_1+1], L_3);
  r_4 = (sd[2*M_1+2] * (z_4[1]));
  // compute actual group-level effects
  r_5 = scale_r_cor(z_5, sd[2*M_1+3:3*M_1+2], L_5);
  r_6 = (sd[3*M_1+3] * (z_6[1]));
  // compute actual group-level effects
  r_7 = scale_r_cor(z_7, sd[3*M_1+4:4*M_1+3], L_7);
  r_8 = (sd[4*M_1+4] * (z_8[1]));
  // compute actual group-level effects
  r_9 = scale_r_cor(z_9, sd[4*M_1+5:5*M_1+4], L_9);
  r_10 = (sd[5*M_1+5] * (z_10[1]));
  // compute actual group-level effects
  r_11 = scale_r_cor(z_11, sd[5*M_1+6:6*M_1+5], L_11);
  r_12 = (sd[5*M_1+6] * (z_12[1]));
  // compute actual group-level effects
  r_13 = scale_r_cor(z_13, sd[6*M_1+7:7*M_1+6], L_13);
  r_14 = (sd[6*M_1+7] * (z_14[1]));
  // compute actual group-level effects
  r_15 = scale_r_cor(z_15, sd[7*M_1+8:8*M_1+7], L_15);
  r_16 = (sd[8*M_1+8] * (z_16[1]));
}
model {
  // likelihood including constants
  if (!prior_only) {
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    matrix[ncat,nvar] beta;
    vector[ncat] alpha;
    matrix[N,ncat] a = rep_matrix(0,N,ncat);
    a[,2] = rep_vector(b_a2,N);
    a[,3] = rep_vector(b_a3,N);
    a[,4] = rep_vector(b_a4,N);
      
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt
    matrix[N,ntotvar] b = rep_matrix(b_btc,N,ntotvar);
    b[,2] = rep_vector(b_b1tt,N);
    b[,3] = rep_vector(b_b2tt,N);
    b[,4] = rep_vector(b_b3tt,N);
    b[,5] = rep_vector(b_b4tt,N);
    
    for (n in 1:N) {
      // add to linear predictor term
      a[n] += [0,r_3[J_1[n]]*Z_1[n]' + r_4[J_2[n]], r_5[J_1[n]]*Z_1[n]' + r_6[J_2[n]], r_7[J_1[n]]*Z_1[n]' + r_8[J_2[n]]];
      // add to linear predictor term
      // Terms are btc, b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[n]]*Z_1[n]' + r_2[J_2[n]], r_9[J_1[n]]*Z_1[n]' + r_10[J_2[n]], r_11[J_1[n]]*Z_1[n]' + r_12[J_2[n]],r_13[J_1[n]]*Z_1[n]' + r_14[J_2[n]],r_15[J_1[n]]*Z_1[n]' + r_16[J_2[n]]];
      b[n] = exp(b[n]);
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x[1,] = x1[n];
      x[2,] = x2[n];
      x[3,] = x3[n];
      x[4,] = x4[n];
      // Our betas will be the hierarchical slope parameters
      beta[1,] = [b[n,1] * b[n,2], b[n,1]];
      beta[2,] = [b[n,1] * b[n,3], b[n,1]];
      beta[3,] = [b[n,1] * b[n,4], b[n,1]];
      beta[4,] = [b[n,1] * b[n,5], b[n,1]];
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      target += categorical_logit_glm_lupmf(Y[n] | x, alpha, beta');
    }
  }
  // priors including constants
  target += normal_lpdf(b_btc | 0, 2.5);
  target += normal_lpdf(b_a2 | 0, 2.5);
  target += normal_lpdf(b_a3 | 0, 2.5);
  target += normal_lpdf(b_a4 | 0, 2.5);
  target += normal_lpdf(b_b1tt | 0, 2.5);
  target += normal_lpdf(b_b2tt | 0, 2.5);
  target += normal_lpdf(b_b3tt | 0, 2.5);
  target += normal_lpdf(b_b4tt | 0, 2.5);
  target += student_t_lpdf(sd | 3, 0, 2.5);
  target += std_normal_lpdf(to_vector(z_1));
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
  target += std_normal_lpdf(z_2[1]);
  target += std_normal_lpdf(to_vector(z_3));
  target += lkj_corr_cholesky_lpdf(L_3 | 1);
  target += std_normal_lpdf(z_4[1]);
  target += std_normal_lpdf(to_vector(z_5));
  target += lkj_corr_cholesky_lpdf(L_5 | 1);
  target += std_normal_lpdf(z_6[1]);
  target += std_normal_lpdf(to_vector(z_7));
  target += lkj_corr_cholesky_lpdf(L_7 | 1);
  target += std_normal_lpdf(z_8[1]);
  target += std_normal_lpdf(to_vector(z_9));
  target += lkj_corr_cholesky_lpdf(L_9 | 1);
  target += std_normal_lpdf(z_10[1]);
  target += std_normal_lpdf(to_vector(z_11));
  target += lkj_corr_cholesky_lpdf(L_11 | 1);
  target += std_normal_lpdf(z_12[1]);
  target += std_normal_lpdf(to_vector(z_13));
  target += lkj_corr_cholesky_lpdf(L_13 | 1);
  target += std_normal_lpdf(z_14[1]);
  target += std_normal_lpdf(to_vector(z_15));
  target += lkj_corr_cholesky_lpdf(L_15 | 1);
  target += std_normal_lpdf(z_16[1]);
}

Submit script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --mem=16G
#SBATCH --gres=gpu
#SBATCH --constraint=gpu_v100
#SBATCH --time=1:00:00
#SBATCH --job-name=R_mtc_base
#SBATCH --error=R_mtc_gpu.%J.err
#SBATCH --output=R_mtc_gpu.%J.out

module load crane
module load R/4.1

module load cuda/11.4

Rscript mtc_equity_2.R

Hi,

which CmdStan version are you running?

I’m using CmdStan 2.29.2

I played with the GPU test model a bit more and it runs with a larger dataset size.

It looks to me like an error somewhere in a matrix operation based on my limited understanding of the C++ error message? I’m not sure why it runs with different sized test model data and my code runs with 100 records, yet won’t run with more (I’ve tried dividing the dataset by 10,100,250,500,750, and 1000). It runs using 101 records using only 1% of the GPU.

One consideration might be that I get Metropolis proposal rejection warnings when it does run. I’ve been ignoring these because it finds a good parameter space within 100 iterations. It doesn’t surprise me it’s getting some invalid results at the initial conditions because of the exp() terms, which can be sensitive.

Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: categorical_logit_glm_lpmf: Intercept[1, 0] = -nan, but it must be finite! (in '/tmp/RtmpVa6vc8/model-3fd0635ebba0a5.stan', line 163, column 6 to column 68)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

I read a bit of the Eigen docs. It sounds to me like comma initializer generates matrices. I may be causing problems through the way I’m using rep_vector() and [v1, v2, v3,…] to generate matrices. I realize it’s probably not the best option here. Is there a better Stan coding that might work nicer with Eigen?

I tried my strategy of changing the way I define some of my matrices/arrays but no luck. See below for examples of how I updated the Stan code (original given as comments). Any suggestions? I’m at a loss how to solve this problem. I don’t know why it doesn’t work when I add more than about 100 records.

  // covariate matrix for alt1
  array[N] row_vector[nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  array[N] row_vector[nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  array[N] row_vector[nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  array[N] row_vector[nvar] x4; // matrix[N,nvar] x4;

array[N] row_vector[ntotvar] b = rep_array([b_btc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);     // matrix[N,ntotvar] b = rep_matrix(b_btc,N,ntotvar);

x = to_matrix([x1[n],x2[n],x3[n],x4[n]]);
      // x[1,] = x1[n];
      // x[2,] = x2[n];
      // x[3,] = x3[n];
      // x[4,] = x4[n];

Another update on testing (no change yet):
I’ve run my model using categorical_logit_lpmf(Y[n] | mu[n]) rather than categorical_logit_glm_lupmf(Y[n] | x, alpha, beta') and also tried running it on the CPU rather than GPU. I thought one of these might change the way it was handling the data, which seems to be the problem (as previously found, it works with a small dataset and the GPU test model runs fine with a much larger dataset). Neither option changed the output.

I took a look at a recent post by @Bob_Carpenter (here) for a similar model where he recommended making sure you don’t include a transpose operation in the likelihood expression (for efficiency not to solve my particular problem). No luck with that change either. @andrjohns or @rok_cesnovar, do you have any suggestions for what I might try? I’m stumped at this point. I’ve attached the current R/Stan files and a snippet of my dataset.

mtc_equity_3.stan (8.7 KB)
mtc_equity.R (1.8 KB)
time_cost_calculated_2000.CSV (2.9 MB)

Sorry for the delay in getting to this! I can reproduce this error on Ubuntu with 2.29.2 and opencl, I’m looking into it now

Ah this was a tricky one! For some reason an indexing-out-of-bounds error was just silently failing.

In your model you have the matrix r_3:

matrix[N_1, M_1] r_3

Where the target row is indexed using J_1:

      a[n] += ...r_3[J_1[n]]...

However, the indexes in J_1 exceed the number of rows in r_3 (i.e., N_1):

> data_list$N_1
[1] 62
> max(data_list$J_1)
[1] 385

So the Stan model was attempting to retrieve rows that did not exist. Normally, this would result in an out-of-bounds error like:

Exception: matrix[uni] indexing: accessing element out of range. index 63 out of range; expecting index to be between 1 and 62

But it’s not getting displayed in this case (FYI @rok_cesnovar @stevebronder)

I suspect there are similar issues with the the other r_* parameters as well

2 Likes

Thanks Andrew for the debug!!

Yeah, I think the assign and rvalue functions for OpenCL might not have these range checks. These were added just recently, all of these errors also printed the less helpful Eigen errors up to a version or 2 ago.

Ah I should have clarified, the issue is present with base Stan, no OpenCl needed. Here’s a minimal example:

library(cmdstanr)

modcode <- "
data {
  real y_mean;
}

transformed data {
  row_vector[2] in_vec = [1, 2];
  row_vector[2] out_vec = [1, in_vec[5]];
}

parameters {
  real y;
}

model {
  y ~ normal(y_mean, 1);
}
"

mod <- cmdstan_model(write_stan_file(modcode))

mod$sample(
  data = list(y_mean = 0)
)

Thanks for the help! This also explains why it works with a subset of the data from the first 100 rows.

The model now run but very slowly. I tried using VI with meanfield. After 48 hours on the GPU, it finishes the eta adaptation. The gradient evaluation message is:

Gradient evaluation took 541.178 seconds 
1000 transitions using 10 leapfrog steps per transition would take 5.41178e+06 seconds.

Running full inference prints 1 iteration for 1 chain. I ran some tests with sample code and found it runs much faster when the GPU is fully utilized. categorical_logit_glm_lupmf needs to run in a loop, so even though it’s a large dataset, the GPU is only being fed small chunks at a time. I’m running it on a 5% sample to see if I can get it to fully run and adjust parameters and priors from there.