CmdStan OpenCL GPU problems and wiki page

I was using cmdstanr with num_cores=2. If I change that to num_cores=1 it works. If I would have more memory, is there a problem if I run several chains at the same time with GPU support?

Although it still on the border of running out of memory. If I open emacs, running `num_cores=1’ fails, too.

I don’t know how num_cores works in cmdstanr. If it uses single process with multiple threads I think kernels should be only compiled once, so there should not be significant increase in RAM demand by kernel compilation. If multiple processes are launched each will have to compile kernels and you will have problems.

It calls CmdStan. I don’t think CmdStan has multithreading chains yet.

So that is probably multiple processes. Since they are launched at the same time each will try to compile kernels and you will probably run out of RAM.

If I don’t run out of RAM, what is the performance hit running multiple chains and having one GPU card?

Depending on the model it should either be negligible or even a bit faster due to async OpenCL calls (in theory we can have GPU computations, CPU computations and memory transfers between CPU and GPU running at the same time without affecting each other). However in single thread one operation must often wait on previous one to complete.

Our prior on multi-process-GPU is what Tadej says, should actually be more efficient in most cases as you are more likely to occupy the GPU at 100%. This has however not been tested extensively as I find running multiple processes in the shell an unpleasent experience. This is one of the reasons I started working on cmdstanr and am also actively pushing for the next patch release to get through (hopefully I am not too annoying :) ).

Thanks for the help @rok_cesnovar and @tadej !

It would be good to add to that wiki page, e.g. OpenCL kernels are compiled just-in-time at the start of any OpenCL-enabled Stan/Stan Math program and thus may require more memory than when running without CPU support. If several CmdStan processes are started at the same time each process needs that memory for a moment. If there is not enough memory to compile OpenCL kernel, you should see “CL_OUT_OF_HOST_MEMORY” error.

Done. Thanks!

It seems that make clean-all is not cleaning all. After removing OpenCL things from make/local, running make clean-all;make build I get hundred lines of like

stansummary.cpp:(.text._ZN4stan4math14opencl_kernels14kernel_functorIJRKN2cl6BufferES6_S6_RKiS8_S8_RKNS0_14matrix_cl_viewESB_EEC2EPKcRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISL_EERKSt3mapISL_iSt4lessISL_ESaISt4pairIKSL_iEEE[_ZN4stan4math14opencl_kernels14kernel_functorIJRKN2cl6BufferES6_S6_RKiS8_S8_RKNS0_14matrix_cl_viewESB_EEC5EPKcRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISL_EERKSt3mapISL_iSt4lessISL_ESaISt4pairIKSL_iEEE]+0x156): undefined reference to `clReleaseKernel'

Does removing src/cmdstan/main.o help?

It’s removed as part of make clean-all, but something else seems to be not removed and the error message is not useful

Unlikely, but are you sure you cleaned up all the make/local files (one of the reasons I never use any other make/local than the cmdstan one)? You mentioned previously that you added lines to the math’s make/local. I tried to replicate this on my Ubuntu machine and couldn’t unfortunately.

Yes. There are no local files left. I had run the CL tests in math, and I run make clean-all also there. I guess I just clone a fresh start, but thought to mention that there might be cleaning issues.

Yes, thanks. That is definitely not desireable. Will try to get to the same state.

Wiki page lists
" We currently have support for the following methods

  • bernoulli_logit_glm"

I tested two different models, one with n=54,p=1536 covariates and one with n=1e4,p=100, but I don’t see any speed difference with or without GPU, and GPU load doesn’t change from 0%. How can I check whether bernoulli_logit_glm is actually using GPU? If using num_cores=4, I do get CL_OUT_OF_HOST_MEMORY with one version, but not with the other, so I assume something is happening, but not enough. (I should probably stop trying to get this work)

Cmdstan outputs a line stating that it uses an opencl device and which one. Can you post the hpp file?

Ah, cmdstanr doesn’t show that. Tested from shell and see

STAN_OPENCL is enabled. OpenCL supported functions will use:
Platform: NVIDIA CUDA
Device: GRID P40-2Q

Also with nvidia-smi I can see

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1104      G   /usr/lib/xorg/Xorg                           222MiB |
|    0      1145      G   /usr/bin/gnome-shell                         184MiB |
|    0     10195    C+G   ...ent/VMwareBlastServer/VMwareBlastServer   232MiB |
|    0     11560      G   /usr/bin/gnome-shell                         177MiB |
|    0     12254      G   /usr/lib/xorg/Xorg                           663MiB |
|    0     28238      C   ./bern_glm                                   157MiB |
+-----------------------------------------------------------------------------+

hpp


// Code generated by stanc 11a0fa51
#include <stan/model/model_header.hpp>
namespace bern_glm_model_namespace {

template <typename T, typename S>
std::vector<T> resize_to_match__(std::vector<T>& dst, const std::vector<S>& src) {
  dst.resize(src.size());
  return dst;
}

template <typename T>
Eigen::Matrix<T, -1, -1>
resize_to_match__(Eigen::Matrix<T, -1, -1>& dst, const Eigen::Matrix<T, -1, -1>& src) {
  dst.resize(src.rows(), src.cols());
  return dst;
}

template <typename T>
Eigen::Matrix<T, 1, -1>
resize_to_match__(Eigen::Matrix<T, 1, -1>& dst, const Eigen::Matrix<T, 1, -1>& src) {
  dst.resize(src.size());
  return dst;
}

template <typename T>
Eigen::Matrix<T, -1, 1>
resize_to_match__(Eigen::Matrix<T, -1, 1>& dst, const Eigen::Matrix<T, -1, 1>& src) {
  dst.resize(src.size());
  return dst;
}
std::vector<double> to_doubles__(std::initializer_list<double> x) {
  return x;
}

std::vector<stan::math::var> to_vars__(std::initializer_list<stan::math::var> x) {
  return x;
}

inline void validate_positive_index(const char* var_name, const char* expr,
                                    int val) {
  if (val < 1) {
    std::stringstream msg;
    msg << "Found dimension size less than one in simplex declaration"
        << "; variable=" << var_name << "; dimension size expression=" << expr
        << "; expression value=" << val;
    std::string msg_str(msg.str());
    throw std::invalid_argument(msg_str.c_str());
  }
}

inline void validate_unit_vector_index(const char* var_name, const char* expr,
                                       int val) {
  if (val <= 1) {
    std::stringstream msg;
    if (val == 1) {
      msg << "Found dimension size one in unit vector declaration."
          << " One-dimensional unit vector is discrete"
          << " but the target distribution must be continuous."
          << " variable=" << var_name << "; dimension size expression=" << expr;
    } else {
      msg << "Found dimension size less than one in unit vector declaration"
          << "; variable=" << var_name << "; dimension size expression=" << expr
          << "; expression value=" << val;
    }
    std::string msg_str(msg.str());
    throw std::invalid_argument(msg_str.c_str());
  }
}


using std::istream;
using std::string;
using std::stringstream;
using std::vector;
using stan::io::dump;
using stan::math::lgamma;
using stan::model::model_base_crtp;
using stan::model::rvalue;
using stan::model::cons_list;
using stan::model::index_uni;
using stan::model::index_max;
using stan::model::index_min;
using stan::model::index_min_max;
using stan::model::index_multi;
using stan::model::index_omni;
using stan::model::nil_index_list;
using namespace stan::math; 

static int current_statement__ = 0;
static const std::vector<string> locations_array__ = {" (found before start of program)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 8, column 2 to column 14)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 9, column 2 to column 17)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 13, column 2 to column 49)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 15, column 2 to column 43)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 2, column 2 to column 17)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 3, column 2 to column 11)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 4, column 2 to column 17)",
                                                      " (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 5, column 2 to column 17)"};


class bern_glm_model : public model_base_crtp<bern_glm_model> {

 private:
  int pos__;
  int N;
  std::vector<int> Y;
  int K;
  Eigen::Matrix<double, -1, -1> X;
 
 public:
  ~bern_glm_model() { }
  
  std::string model_name() const { return "bern_glm_model"; }
  
  bern_glm_model(stan::io::var_context& context__,
                 unsigned int random_seed__ = 0,
                 std::ostream* pstream__ = nullptr) : model_base_crtp(0) {
    typedef double local_scalar_t__;
    boost::ecuyer1988 base_rng__ = 
        stan::services::util::create_rng(random_seed__, 0);
    (void) base_rng__;  // suppress unused var warning
    static const char* function__ = "bern_glm_model_namespace::bern_glm_model";
    (void) function__;  // suppress unused var warning
    
    try {
      
      pos__ = 1;
      context__.validate_dims("data initialization","N","int",
          context__.to_vec());
      
      current_statement__ = 5;
      N = context__.vals_i("N")[(1 - 1)];
      current_statement__ = 6;
      validate_non_negative_index("Y", "N", N);
      context__.validate_dims("data initialization","Y","int",
          context__.to_vec(N));
      Y = std::vector<int>(N, 0);
      
      current_statement__ = 6;
      assign(Y, nil_index_list(), context__.vals_i("Y"),
        "assigning variable Y");
      context__.validate_dims("data initialization","K","int",
          context__.to_vec());
      
      current_statement__ = 7;
      K = context__.vals_i("K")[(1 - 1)];
      current_statement__ = 8;
      validate_non_negative_index("X", "N", N);
      current_statement__ = 8;
      validate_non_negative_index("X", "K", K);
      context__.validate_dims("data initialization","X","double",
          context__.to_vec(N, K));
      X = Eigen::Matrix<double, -1, -1>(N, K);
      
      {
        std::vector<local_scalar_t__> X_flat__;
        current_statement__ = 8;
        assign(X_flat__, nil_index_list(), context__.vals_r("X"),
          "assigning variable X_flat__");
        current_statement__ = 8;
        pos__ = 1;
        current_statement__ = 8;
        for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
          current_statement__ = 8;
          for (size_t sym2__ = 1; sym2__ <= N; ++sym2__) {
            current_statement__ = 8;
            assign(X,
              cons_list(index_uni(sym2__),
                cons_list(index_uni(sym1__), nil_index_list())),
              X_flat__[(pos__ - 1)], "assigning variable X");
            current_statement__ = 8;
            pos__ = (pos__ + 1);}}
      }
      current_statement__ = 5;
      current_statement__ = 5;
      check_greater_or_equal(function__, "N", N, 1);
      current_statement__ = 7;
      current_statement__ = 7;
      check_greater_or_equal(function__, "K", K, 1);
    } catch (const std::exception& e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
    }
    num_params_r__ = 0U;
    
    try {
      current_statement__ = 1;
      validate_non_negative_index("b", "K", K);
      num_params_r__ += K;
      num_params_r__ += 1;
    } catch (const std::exception& e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
    }
  }
  template <bool propto__, bool jacobian__, typename T__>
  T__ log_prob(std::vector<T__>& params_r__, std::vector<int>& params_i__,
               std::ostream* pstream__ = 0) const {
    typedef T__ local_scalar_t__;
    T__ lp__(0.0);
    stan::math::accumulator<T__> lp_accum__;
    static const char* function__ = "bern_glm_model_namespace::log_prob";
(void) function__;  // suppress unused var warning

    stan::io::reader<local_scalar_t__> in__(params_r__, params_i__);
    
    try {
      current_statement__ = 1;
      validate_non_negative_index("b", "K", K);
      Eigen::Matrix<local_scalar_t__, -1, 1> b;
      b = Eigen::Matrix<local_scalar_t__, -1, 1>(K);
      
      current_statement__ = 1;
      b = in__.vector(K);
      local_scalar_t__ Intercept;
      
      current_statement__ = 2;
      Intercept = in__.scalar();
      {
        current_statement__ = 3;
        lp_accum__.add(student_t_lpdf<false>(Intercept, 3, 0, 10));
        current_statement__ = 4;
        lp_accum__.add(bernoulli_logit_glm_lpmf<propto__>(Y, X, Intercept, b));
      }
    } catch (const std::exception& e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
    }
    lp_accum__.add(lp__);
    return lp_accum__.sum();
    } // log_prob() 
    
  template <typename RNG>
  void write_array(RNG& base_rng__, std::vector<double>& params_r__,
                   std::vector<int>& params_i__, std::vector<double>& vars__,
                   bool emit_transformed_parameters__ = true,
                   bool emit_generated_quantities__ = true,
                   std::ostream* pstream__ = 0) const {
    typedef double local_scalar_t__;
    vars__.resize(0);
    stan::io::reader<local_scalar_t__> in__(params_r__, params_i__);
    static const char* function__ = "bern_glm_model_namespace::write_array";
(void) function__;  // suppress unused var warning

    (void) function__;  // suppress unused var warning

    double lp__ = 0.0;
    (void) lp__;  // dummy to suppress unused var warning
    stan::math::accumulator<double> lp_accum__;
    
    try {
      current_statement__ = 1;
      validate_non_negative_index("b", "K", K);
      Eigen::Matrix<double, -1, 1> b;
      b = Eigen::Matrix<double, -1, 1>(K);
      
      current_statement__ = 1;
      b = in__.vector(K);
      double Intercept;
      
      current_statement__ = 2;
      Intercept = in__.scalar();
      for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
        vars__.push_back(b[(sym1__ - 1)]);}
      vars__.push_back(Intercept);
      if (logical_negation((primitive_value(emit_transformed_parameters__) ||
            primitive_value(emit_generated_quantities__)))) {
        return ;
      } 
      if (logical_negation(emit_generated_quantities__)) {
        return ;
      } 
    } catch (const std::exception& e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
    }
    } // write_array() 
    
  void transform_inits(const stan::io::var_context& context__,
                       std::vector<int>& params_i__,
                       std::vector<double>& vars__, std::ostream* pstream__) const {
    typedef double local_scalar_t__;
    vars__.resize(0);
    vars__.reserve(num_params_r__);
    
    try {
      int pos__;
      
      pos__ = 1;
      current_statement__ = 1;
      validate_non_negative_index("b", "K", K);
      Eigen::Matrix<double, -1, 1> b;
      b = Eigen::Matrix<double, -1, 1>(K);
      
      {
        std::vector<local_scalar_t__> b_flat__;
        current_statement__ = 1;
        assign(b_flat__, nil_index_list(), context__.vals_r("b"),
          "assigning variable b_flat__");
        current_statement__ = 1;
        pos__ = 1;
        current_statement__ = 1;
        for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
          current_statement__ = 1;
          assign(b, cons_list(index_uni(sym1__), nil_index_list()),
            b_flat__[(pos__ - 1)], "assigning variable b");
          current_statement__ = 1;
          pos__ = (pos__ + 1);}
      }
      double Intercept;
      
      current_statement__ = 2;
      Intercept = context__.vals_r("Intercept")[(1 - 1)];
      for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
        vars__.push_back(b[(sym1__ - 1)]);}
      vars__.push_back(Intercept);
    } catch (const std::exception& e) {
      stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
    }
    } // transform_inits() 
    
  void get_param_names(std::vector<std::string>& names__) const {
    
    names__.resize(0);
    names__.push_back("b");
    names__.push_back("Intercept");
    } // get_param_names() 
    
  void get_dims(std::vector<std::vector<size_t>>& dimss__) const {
    dimss__.resize(0);
    std::vector<size_t> dims__;
    dims__.push_back(K);
    dimss__.push_back(dims__);
    dims__.resize(0);
    dimss__.push_back(dims__);
    dims__.resize(0);
    
    } // get_dims() 
    
  void constrained_param_names(std::vector<std::string>& param_names__,
                               bool emit_transformed_parameters__ = true,
                               bool emit_generated_quantities__ = true) const {
    
    for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
      {
        param_names__.push_back(std::string() + "b" + '.' + std::to_string(sym1__));
      }}
    param_names__.push_back(std::string() + "Intercept");
    if (emit_transformed_parameters__) {
      
    }
    
    if (emit_generated_quantities__) {
      
    }
    
    } // constrained_param_names() 
    
  void unconstrained_param_names(std::vector<std::string>& param_names__,
                                 bool emit_transformed_parameters__ = true,
                                 bool emit_generated_quantities__ = true) const {
    
    for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
      {
        param_names__.push_back(std::string() + "b" + '.' + std::to_string(sym1__));
      }}
    param_names__.push_back(std::string() + "Intercept");
    if (emit_transformed_parameters__) {
      
    }
    
    if (emit_generated_quantities__) {
      
    }
    
    } // unconstrained_param_names() 
    
  std::string get_constrained_sizedtypes() const {
    stringstream s__;
    s__ << "[{\"name\":\"b\",\"type\":{\"name\":\"vector\",\"length\":" << K << "},\"block\":\"parameters\"},{\"name\":\"Intercept\",\"type\":{\"name\":\"real\"},\"block\":\"parameters\"}]";
    return s__.str();
    } // get_constrained_sizedtypes() 
    
  std::string get_unconstrained_sizedtypes() const {
    stringstream s__;
    s__ << "[{\"name\":\"b\",\"type\":{\"name\":\"vector\",\"length\":" << K << "},\"block\":\"parameters\"},{\"name\":\"Intercept\",\"type\":{\"name\":\"real\"},\"block\":\"parameters\"}]";
    return s__.str();
    } // get_unconstrained_sizedtypes() 
    
  
    // Begin method overload boilerplate
    template <typename RNG>
    void write_array(RNG& base_rng__,
                     Eigen::Matrix<double,Eigen::Dynamic,1>& params_r,
                     Eigen::Matrix<double,Eigen::Dynamic,1>& vars,
                     bool emit_transformed_parameters__ = true,
                     bool emit_generated_quantities__ = true,
                     std::ostream* pstream = 0) const {
      std::vector<double> params_r_vec(params_r.size());
      for (int i = 0; i < params_r.size(); ++i)
        params_r_vec[i] = params_r(i);
      std::vector<double> vars_vec;
      std::vector<int> params_i_vec;
      write_array(base_rng__, params_r_vec, params_i_vec, vars_vec,
          emit_transformed_parameters__, emit_generated_quantities__, pstream);
      vars.resize(vars_vec.size());
      for (int i = 0; i < vars.size(); ++i)
        vars(i) = vars_vec[i];
    }

    template <bool propto__, bool jacobian__, typename T_>
    T_ log_prob(Eigen::Matrix<T_,Eigen::Dynamic,1>& params_r,
               std::ostream* pstream = 0) const {
      std::vector<T_> vec_params_r;
      vec_params_r.reserve(params_r.size());
      for (int i = 0; i < params_r.size(); ++i)
        vec_params_r.push_back(params_r(i));
      std::vector<int> vec_params_i;
      return log_prob<propto__,jacobian__,T_>(vec_params_r, vec_params_i, pstream);
    }

    void transform_inits(const stan::io::var_context& context,
                         Eigen::Matrix<double, Eigen::Dynamic, 1>& params_r,
                         std::ostream* pstream__) const {
      std::vector<double> params_r_vec;
      std::vector<int> params_i_vec;
      transform_inits(context, params_i_vec, params_r_vec, pstream__);
      params_r.resize(params_r_vec.size());
      for (int i = 0; i < params_r.size(); ++i)
        params_r(i) = params_r_vec[i];
    }

};
}
typedef bern_glm_model_namespace::bern_glm_model stan_model;

#ifndef USING_R

// Boilerplate
stan::model::model_base& new_model(
        stan::io::var_context& data_context,
        unsigned int seed,
        std::ostream* msg_stream) {
  stan_model* m = new stan_model(data_context, seed, msg_stream);
  return *m;
}

#endif

Yes, this was added in cmdstan 2.22 and cmdstanr is a bit behind currently with a bunch of PRs open. Its on my short-term TODO.

As far as your issue:

stanc3 seems to not pickup
y ~ bernoulli_logit_glm(y_vi_d| X_d, alpha, beta);
to use OpenCL but rather as
target += bernoulli_logit_glm_lpmf(y_vi_d| X_d, alpha, beta);

The latter is the syntax that is tested in stanc3 https://github.com/stan-dev/stanc3/blob/master/test/integration/good/code-gen/optimize_glm.stan

Not sure why it doesnt pick that up though. Ah… I apologize for the inconvenience and wasted hours. I do appreciate your efforts for testing this features.