I was using cmdstanr
with num_cores=2
. If I change that to num_cores=1
it works. If I would have more memory, is there a problem if I run several chains at the same time with GPU support?
Although it still on the border of running out of memory. If I open emacs, running `num_cores=1â fails, too.
I donât know how num_cores
works in cmdstanr. If it uses single process with multiple threads I think kernels should be only compiled once, so there should not be significant increase in RAM demand by kernel compilation. If multiple processes are launched each will have to compile kernels and you will have problems.
It calls CmdStan. I donât think CmdStan has multithreading chains yet.
So that is probably multiple processes. Since they are launched at the same time each will try to compile kernels and you will probably run out of RAM.
If I donât run out of RAM, what is the performance hit running multiple chains and having one GPU card?
Depending on the model it should either be negligible or even a bit faster due to async OpenCL calls (in theory we can have GPU computations, CPU computations and memory transfers between CPU and GPU running at the same time without affecting each other). However in single thread one operation must often wait on previous one to complete.
Our prior on multi-process-GPU is what Tadej says, should actually be more efficient in most cases as you are more likely to occupy the GPU at 100%. This has however not been tested extensively as I find running multiple processes in the shell an unpleasent experience. This is one of the reasons I started working on cmdstanr and am also actively pushing for the next patch release to get through (hopefully I am not too annoying :) ).
Thanks for the help @rok_cesnovar and @tadej !
It would be good to add to that wiki page, e.g. OpenCL kernels are compiled just-in-time at the start of any OpenCL-enabled Stan/Stan Math program and thus may require more memory than when running without CPU support. If several CmdStan processes are started at the same time each process needs that memory for a moment. If there is not enough memory to compile OpenCL kernel, you should see âCL_OUT_OF_HOST_MEMORYâ error.
Done. Thanks!
It seems that make clean-all
is not cleaning all. After removing OpenCL things from make/local, running make clean-all;make build
I get hundred lines of like
stansummary.cpp:(.text._ZN4stan4math14opencl_kernels14kernel_functorIJRKN2cl6BufferES6_S6_RKiS8_S8_RKNS0_14matrix_cl_viewESB_EEC2EPKcRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISL_EERKSt3mapISL_iSt4lessISL_ESaISt4pairIKSL_iEEE[_ZN4stan4math14opencl_kernels14kernel_functorIJRKN2cl6BufferES6_S6_RKiS8_S8_RKNS0_14matrix_cl_viewESB_EEC5EPKcRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISL_EERKSt3mapISL_iSt4lessISL_ESaISt4pairIKSL_iEEE]+0x156): undefined reference to `clReleaseKernel'
Does removing src/cmdstan/main.o help?
Itâs removed as part of make clean-all
, but something else seems to be not removed and the error message is not useful
Unlikely, but are you sure you cleaned up all the make/local files (one of the reasons I never use any other make/local than the cmdstan one)? You mentioned previously that you added lines to the mathâs make/local. I tried to replicate this on my Ubuntu machine and couldnât unfortunately.
Yes. There are no local
files left. I had run the CL tests in math, and I run make clean-all also there. I guess I just clone a fresh start, but thought to mention that there might be cleaning issues.
Yes, thanks. That is definitely not desireable. Will try to get to the same state.
Wiki page lists
" We currently have support for the following methods
- bernoulli_logit_glm"
I tested two different models, one with n=54,p=1536 covariates and one with n=1e4,p=100, but I donât see any speed difference with or without GPU, and GPU load doesnât change from 0%. How can I check whether bernoulli_logit_glm is actually using GPU? If using num_cores=4
, I do get CL_OUT_OF_HOST_MEMORY with one version, but not with the other, so I assume something is happening, but not enough. (I should probably stop trying to get this work)
Cmdstan outputs a line stating that it uses an opencl device and which one. Can you post the hpp file?
Ah, cmdstanr doesnât show that. Tested from shell and see
STAN_OPENCL is enabled. OpenCL supported functions will use:
Platform: NVIDIA CUDA
Device: GRID P40-2Q
Also with nvidia-smi I can see
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1104 G /usr/lib/xorg/Xorg 222MiB |
| 0 1145 G /usr/bin/gnome-shell 184MiB |
| 0 10195 C+G ...ent/VMwareBlastServer/VMwareBlastServer 232MiB |
| 0 11560 G /usr/bin/gnome-shell 177MiB |
| 0 12254 G /usr/lib/xorg/Xorg 663MiB |
| 0 28238 C ./bern_glm 157MiB |
+-----------------------------------------------------------------------------+
hpp
// Code generated by stanc 11a0fa51
#include <stan/model/model_header.hpp>
namespace bern_glm_model_namespace {
template <typename T, typename S>
std::vector<T> resize_to_match__(std::vector<T>& dst, const std::vector<S>& src) {
dst.resize(src.size());
return dst;
}
template <typename T>
Eigen::Matrix<T, -1, -1>
resize_to_match__(Eigen::Matrix<T, -1, -1>& dst, const Eigen::Matrix<T, -1, -1>& src) {
dst.resize(src.rows(), src.cols());
return dst;
}
template <typename T>
Eigen::Matrix<T, 1, -1>
resize_to_match__(Eigen::Matrix<T, 1, -1>& dst, const Eigen::Matrix<T, 1, -1>& src) {
dst.resize(src.size());
return dst;
}
template <typename T>
Eigen::Matrix<T, -1, 1>
resize_to_match__(Eigen::Matrix<T, -1, 1>& dst, const Eigen::Matrix<T, -1, 1>& src) {
dst.resize(src.size());
return dst;
}
std::vector<double> to_doubles__(std::initializer_list<double> x) {
return x;
}
std::vector<stan::math::var> to_vars__(std::initializer_list<stan::math::var> x) {
return x;
}
inline void validate_positive_index(const char* var_name, const char* expr,
int val) {
if (val < 1) {
std::stringstream msg;
msg << "Found dimension size less than one in simplex declaration"
<< "; variable=" << var_name << "; dimension size expression=" << expr
<< "; expression value=" << val;
std::string msg_str(msg.str());
throw std::invalid_argument(msg_str.c_str());
}
}
inline void validate_unit_vector_index(const char* var_name, const char* expr,
int val) {
if (val <= 1) {
std::stringstream msg;
if (val == 1) {
msg << "Found dimension size one in unit vector declaration."
<< " One-dimensional unit vector is discrete"
<< " but the target distribution must be continuous."
<< " variable=" << var_name << "; dimension size expression=" << expr;
} else {
msg << "Found dimension size less than one in unit vector declaration"
<< "; variable=" << var_name << "; dimension size expression=" << expr
<< "; expression value=" << val;
}
std::string msg_str(msg.str());
throw std::invalid_argument(msg_str.c_str());
}
}
using std::istream;
using std::string;
using std::stringstream;
using std::vector;
using stan::io::dump;
using stan::math::lgamma;
using stan::model::model_base_crtp;
using stan::model::rvalue;
using stan::model::cons_list;
using stan::model::index_uni;
using stan::model::index_max;
using stan::model::index_min;
using stan::model::index_min_max;
using stan::model::index_multi;
using stan::model::index_omni;
using stan::model::nil_index_list;
using namespace stan::math;
static int current_statement__ = 0;
static const std::vector<string> locations_array__ = {" (found before start of program)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 8, column 2 to column 14)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 9, column 2 to column 17)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 13, column 2 to column 49)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 15, column 2 to column 43)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 2, column 2 to column 17)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 3, column 2 to column 11)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 4, column 2 to column 17)",
" (in '/m/home/home7/77/ave/unix/proj/ovarian/bern_glm.stan', line 5, column 2 to column 17)"};
class bern_glm_model : public model_base_crtp<bern_glm_model> {
private:
int pos__;
int N;
std::vector<int> Y;
int K;
Eigen::Matrix<double, -1, -1> X;
public:
~bern_glm_model() { }
std::string model_name() const { return "bern_glm_model"; }
bern_glm_model(stan::io::var_context& context__,
unsigned int random_seed__ = 0,
std::ostream* pstream__ = nullptr) : model_base_crtp(0) {
typedef double local_scalar_t__;
boost::ecuyer1988 base_rng__ =
stan::services::util::create_rng(random_seed__, 0);
(void) base_rng__; // suppress unused var warning
static const char* function__ = "bern_glm_model_namespace::bern_glm_model";
(void) function__; // suppress unused var warning
try {
pos__ = 1;
context__.validate_dims("data initialization","N","int",
context__.to_vec());
current_statement__ = 5;
N = context__.vals_i("N")[(1 - 1)];
current_statement__ = 6;
validate_non_negative_index("Y", "N", N);
context__.validate_dims("data initialization","Y","int",
context__.to_vec(N));
Y = std::vector<int>(N, 0);
current_statement__ = 6;
assign(Y, nil_index_list(), context__.vals_i("Y"),
"assigning variable Y");
context__.validate_dims("data initialization","K","int",
context__.to_vec());
current_statement__ = 7;
K = context__.vals_i("K")[(1 - 1)];
current_statement__ = 8;
validate_non_negative_index("X", "N", N);
current_statement__ = 8;
validate_non_negative_index("X", "K", K);
context__.validate_dims("data initialization","X","double",
context__.to_vec(N, K));
X = Eigen::Matrix<double, -1, -1>(N, K);
{
std::vector<local_scalar_t__> X_flat__;
current_statement__ = 8;
assign(X_flat__, nil_index_list(), context__.vals_r("X"),
"assigning variable X_flat__");
current_statement__ = 8;
pos__ = 1;
current_statement__ = 8;
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
current_statement__ = 8;
for (size_t sym2__ = 1; sym2__ <= N; ++sym2__) {
current_statement__ = 8;
assign(X,
cons_list(index_uni(sym2__),
cons_list(index_uni(sym1__), nil_index_list())),
X_flat__[(pos__ - 1)], "assigning variable X");
current_statement__ = 8;
pos__ = (pos__ + 1);}}
}
current_statement__ = 5;
current_statement__ = 5;
check_greater_or_equal(function__, "N", N, 1);
current_statement__ = 7;
current_statement__ = 7;
check_greater_or_equal(function__, "K", K, 1);
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
num_params_r__ = 0U;
try {
current_statement__ = 1;
validate_non_negative_index("b", "K", K);
num_params_r__ += K;
num_params_r__ += 1;
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
}
template <bool propto__, bool jacobian__, typename T__>
T__ log_prob(std::vector<T__>& params_r__, std::vector<int>& params_i__,
std::ostream* pstream__ = 0) const {
typedef T__ local_scalar_t__;
T__ lp__(0.0);
stan::math::accumulator<T__> lp_accum__;
static const char* function__ = "bern_glm_model_namespace::log_prob";
(void) function__; // suppress unused var warning
stan::io::reader<local_scalar_t__> in__(params_r__, params_i__);
try {
current_statement__ = 1;
validate_non_negative_index("b", "K", K);
Eigen::Matrix<local_scalar_t__, -1, 1> b;
b = Eigen::Matrix<local_scalar_t__, -1, 1>(K);
current_statement__ = 1;
b = in__.vector(K);
local_scalar_t__ Intercept;
current_statement__ = 2;
Intercept = in__.scalar();
{
current_statement__ = 3;
lp_accum__.add(student_t_lpdf<false>(Intercept, 3, 0, 10));
current_statement__ = 4;
lp_accum__.add(bernoulli_logit_glm_lpmf<propto__>(Y, X, Intercept, b));
}
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
lp_accum__.add(lp__);
return lp_accum__.sum();
} // log_prob()
template <typename RNG>
void write_array(RNG& base_rng__, std::vector<double>& params_r__,
std::vector<int>& params_i__, std::vector<double>& vars__,
bool emit_transformed_parameters__ = true,
bool emit_generated_quantities__ = true,
std::ostream* pstream__ = 0) const {
typedef double local_scalar_t__;
vars__.resize(0);
stan::io::reader<local_scalar_t__> in__(params_r__, params_i__);
static const char* function__ = "bern_glm_model_namespace::write_array";
(void) function__; // suppress unused var warning
(void) function__; // suppress unused var warning
double lp__ = 0.0;
(void) lp__; // dummy to suppress unused var warning
stan::math::accumulator<double> lp_accum__;
try {
current_statement__ = 1;
validate_non_negative_index("b", "K", K);
Eigen::Matrix<double, -1, 1> b;
b = Eigen::Matrix<double, -1, 1>(K);
current_statement__ = 1;
b = in__.vector(K);
double Intercept;
current_statement__ = 2;
Intercept = in__.scalar();
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
vars__.push_back(b[(sym1__ - 1)]);}
vars__.push_back(Intercept);
if (logical_negation((primitive_value(emit_transformed_parameters__) ||
primitive_value(emit_generated_quantities__)))) {
return ;
}
if (logical_negation(emit_generated_quantities__)) {
return ;
}
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
} // write_array()
void transform_inits(const stan::io::var_context& context__,
std::vector<int>& params_i__,
std::vector<double>& vars__, std::ostream* pstream__) const {
typedef double local_scalar_t__;
vars__.resize(0);
vars__.reserve(num_params_r__);
try {
int pos__;
pos__ = 1;
current_statement__ = 1;
validate_non_negative_index("b", "K", K);
Eigen::Matrix<double, -1, 1> b;
b = Eigen::Matrix<double, -1, 1>(K);
{
std::vector<local_scalar_t__> b_flat__;
current_statement__ = 1;
assign(b_flat__, nil_index_list(), context__.vals_r("b"),
"assigning variable b_flat__");
current_statement__ = 1;
pos__ = 1;
current_statement__ = 1;
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
current_statement__ = 1;
assign(b, cons_list(index_uni(sym1__), nil_index_list()),
b_flat__[(pos__ - 1)], "assigning variable b");
current_statement__ = 1;
pos__ = (pos__ + 1);}
}
double Intercept;
current_statement__ = 2;
Intercept = context__.vals_r("Intercept")[(1 - 1)];
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
vars__.push_back(b[(sym1__ - 1)]);}
vars__.push_back(Intercept);
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
} // transform_inits()
void get_param_names(std::vector<std::string>& names__) const {
names__.resize(0);
names__.push_back("b");
names__.push_back("Intercept");
} // get_param_names()
void get_dims(std::vector<std::vector<size_t>>& dimss__) const {
dimss__.resize(0);
std::vector<size_t> dims__;
dims__.push_back(K);
dimss__.push_back(dims__);
dims__.resize(0);
dimss__.push_back(dims__);
dims__.resize(0);
} // get_dims()
void constrained_param_names(std::vector<std::string>& param_names__,
bool emit_transformed_parameters__ = true,
bool emit_generated_quantities__ = true) const {
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
{
param_names__.push_back(std::string() + "b" + '.' + std::to_string(sym1__));
}}
param_names__.push_back(std::string() + "Intercept");
if (emit_transformed_parameters__) {
}
if (emit_generated_quantities__) {
}
} // constrained_param_names()
void unconstrained_param_names(std::vector<std::string>& param_names__,
bool emit_transformed_parameters__ = true,
bool emit_generated_quantities__ = true) const {
for (size_t sym1__ = 1; sym1__ <= K; ++sym1__) {
{
param_names__.push_back(std::string() + "b" + '.' + std::to_string(sym1__));
}}
param_names__.push_back(std::string() + "Intercept");
if (emit_transformed_parameters__) {
}
if (emit_generated_quantities__) {
}
} // unconstrained_param_names()
std::string get_constrained_sizedtypes() const {
stringstream s__;
s__ << "[{\"name\":\"b\",\"type\":{\"name\":\"vector\",\"length\":" << K << "},\"block\":\"parameters\"},{\"name\":\"Intercept\",\"type\":{\"name\":\"real\"},\"block\":\"parameters\"}]";
return s__.str();
} // get_constrained_sizedtypes()
std::string get_unconstrained_sizedtypes() const {
stringstream s__;
s__ << "[{\"name\":\"b\",\"type\":{\"name\":\"vector\",\"length\":" << K << "},\"block\":\"parameters\"},{\"name\":\"Intercept\",\"type\":{\"name\":\"real\"},\"block\":\"parameters\"}]";
return s__.str();
} // get_unconstrained_sizedtypes()
// Begin method overload boilerplate
template <typename RNG>
void write_array(RNG& base_rng__,
Eigen::Matrix<double,Eigen::Dynamic,1>& params_r,
Eigen::Matrix<double,Eigen::Dynamic,1>& vars,
bool emit_transformed_parameters__ = true,
bool emit_generated_quantities__ = true,
std::ostream* pstream = 0) const {
std::vector<double> params_r_vec(params_r.size());
for (int i = 0; i < params_r.size(); ++i)
params_r_vec[i] = params_r(i);
std::vector<double> vars_vec;
std::vector<int> params_i_vec;
write_array(base_rng__, params_r_vec, params_i_vec, vars_vec,
emit_transformed_parameters__, emit_generated_quantities__, pstream);
vars.resize(vars_vec.size());
for (int i = 0; i < vars.size(); ++i)
vars(i) = vars_vec[i];
}
template <bool propto__, bool jacobian__, typename T_>
T_ log_prob(Eigen::Matrix<T_,Eigen::Dynamic,1>& params_r,
std::ostream* pstream = 0) const {
std::vector<T_> vec_params_r;
vec_params_r.reserve(params_r.size());
for (int i = 0; i < params_r.size(); ++i)
vec_params_r.push_back(params_r(i));
std::vector<int> vec_params_i;
return log_prob<propto__,jacobian__,T_>(vec_params_r, vec_params_i, pstream);
}
void transform_inits(const stan::io::var_context& context,
Eigen::Matrix<double, Eigen::Dynamic, 1>& params_r,
std::ostream* pstream__) const {
std::vector<double> params_r_vec;
std::vector<int> params_i_vec;
transform_inits(context, params_i_vec, params_r_vec, pstream__);
params_r.resize(params_r_vec.size());
for (int i = 0; i < params_r.size(); ++i)
params_r(i) = params_r_vec[i];
}
};
}
typedef bern_glm_model_namespace::bern_glm_model stan_model;
#ifndef USING_R
// Boilerplate
stan::model::model_base& new_model(
stan::io::var_context& data_context,
unsigned int seed,
std::ostream* msg_stream) {
stan_model* m = new stan_model(data_context, seed, msg_stream);
return *m;
}
#endif
Yes, this was added in cmdstan 2.22 and cmdstanr is a bit behind currently with a bunch of PRs open. Its on my short-term TODO.
As far as your issue:
stanc3 seems to not pickup
y ~ bernoulli_logit_glm(y_vi_d| X_d, alpha, beta);
to use OpenCL but rather as
target += bernoulli_logit_glm_lpmf(y_vi_d| X_d, alpha, beta);
The latter is the syntax that is tested in stanc3 https://github.com/stan-dev/stanc3/blob/master/test/integration/good/code-gen/optimize_glm.stan
Not sure why it doesnt pick that up though. Ah⌠I apologize for the inconvenience and wasted hours. I do appreciate your efforts for testing this features.