I have kind of a general question about the way that multi-threading works. I have a very large reduce_sum
function that has a ton of data passed in, with the intention being that I want to shard the data across threads:
target += reduce_sum(
choice_lpmf,
indices, // Thing to cut across
grainsize, // grainsize == 1
p_R_ftr_not_na,
p_R_ftr_wo_clm,
pricing_error_mean,
pricing_error_sd,
tm_int_ind_n,
p_R_ftr_wo_clm_w_tm, // transformed data, N
pricing_error_mean_tm, // model block, N
pricing_error_sd_tm, // model block
run_demand, // data
is_nb, // transformed data, N
L_draws, // data
lambda, // transformed data, N
demand_choice_index, // data
tm_ind, // data
dollar_norm, // data
is_nb_int, // transformed data
tm_elig_n, // transformed data
J, // data
J_oo, // data
prices_not_na_ind, // data, N x J
prices_oo_not_na_ind, // data, N x J_oo
limits, // transformed data, J --
limits_oo, // transformed data, J_oo --
prices, // transformed data, N x J
prices_oo, // transformed data, N x J_oo
tm_optin_disc_n, // transformed data, N
prices_pre_firm_ind, // data, N x J
prices_pre_cov_ind, // data, N x J
pre_oo_firm_ind, // data, N x J_oo
pre_oo_cov_ind, // data, N x J_oo
clm_surcharge, // data, N
pareto_loc, // transformed data, N
pareto_shape, // data
risk_aversion, // model block, scalar
firm_switch_cost, // model block, N
cov_switch_cost, // model block, scalar
tm_frictions, // model block, N
sigma_logit, // model block, scalar
plan_fes, // model block, N x J
plan_oo_fes, // model block, N x J_oo
quad_nodes,
quad_weights
);
The mapping function choice_lpmf
cuts all these variables into chunks and then does all the likelihood calculations. I have yet to find a better way to do this – ideally, reduce_sum
would shard all the variables for me before handing them to a thread, but I have not yet found a way to do this.
Many of these variables are quite large. On a single threaded run, this doesn’t use much memory, it’s just slow.
However, I’m running into the issue where, when I start adding threads, it quickly becomes an inoperable problem. The memory usage is insane, and I cannot actually run this model on the full data set with more than 1-2 threads.
My best guess here is that reduce_sum
deep copies all the variables to each thread. Is there a way to prevent this? Maybe to flag some of these variables as something to be passed by reference? Or is there some other way of writing out functions like this that use lots of heterogeneous data?