Multithreading and memory usage

cpfiffer · March 10, 2023, 5:43pm

I have kind of a general question about the way that multi-threading works. I have a very large reduce_sum function that has a ton of data passed in, with the intention being that I want to shard the data across threads:

    target += reduce_sum(
      choice_lpmf,
      indices, // Thing to cut across
      grainsize, // grainsize == 1
      p_R_ftr_not_na,
      p_R_ftr_wo_clm,
      pricing_error_mean,
      pricing_error_sd,
      tm_int_ind_n,
      p_R_ftr_wo_clm_w_tm, // transformed data, N
      pricing_error_mean_tm, // model block, N
      pricing_error_sd_tm, // model block
      run_demand, // data
      is_nb, // transformed data, N
      L_draws, // data
      lambda, // transformed data, N
      demand_choice_index, // data
      tm_ind, // data
      dollar_norm, // data
      is_nb_int, // transformed data
      tm_elig_n, // transformed data
      J, // data
      J_oo, // data
      prices_not_na_ind, // data, N x J
      prices_oo_not_na_ind, // data, N x J_oo
      limits, // transformed data, J --
      limits_oo, // transformed data, J_oo --
      prices, // transformed data, N x J
      prices_oo, // transformed data, N x J_oo
      tm_optin_disc_n, // transformed data, N
      prices_pre_firm_ind, // data, N x J
      prices_pre_cov_ind, // data, N x J
      pre_oo_firm_ind, // data, N x J_oo
      pre_oo_cov_ind, // data, N x J_oo
      clm_surcharge, // data, N
      pareto_loc, // transformed data, N
      pareto_shape, // data
      risk_aversion, // model block, scalar
      firm_switch_cost, // model block, N
      cov_switch_cost, // model block, scalar
      tm_frictions, // model block, N
      sigma_logit, // model block, scalar
      plan_fes, // model block, N x J
      plan_oo_fes, // model block, N x J_oo
      quad_nodes,
      quad_weights
    );

The mapping function choice_lpmf cuts all these variables into chunks and then does all the likelihood calculations. I have yet to find a better way to do this – ideally, reduce_sum would shard all the variables for me before handing them to a thread, but I have not yet found a way to do this.

Many of these variables are quite large. On a single threaded run, this doesn’t use much memory, it’s just slow.

However, I’m running into the issue where, when I start adding threads, it quickly becomes an inoperable problem. The memory usage is insane, and I cannot actually run this model on the full data set with more than 1-2 threads.

My best guess here is that reduce_sum deep copies all the variables to each thread. Is there a way to prevent this? Maybe to flag some of these variables as something to be passed by reference? Or is there some other way of writing out functions like this that use lots of heterogeneous data?

cpfiffer · March 10, 2023, 5:50pm

I’m also noticing my memory spike after the first likelihood evaluation, presumably this is during the gradient pass? Does anyone have a good sense of what causes memory usage during the gradient to be so much higher?

rok_cesnovar · March 10, 2023, 9:17pm

reduce_sum does deep copy parameters, but not data.

If you have large vectors/matrices of parameters that all need to be shared between all threads that will be a big bottleneck and is less likely to be efficient with reduce_sum, unless the computation is really expensive.

Data variables are passed by reference and should not cause additional memory usage.

WardBrian · March 10, 2023, 9:21pm

Have you tried grainsizes other than 1?

cpfiffer · March 11, 2023, 2:33am

Is this something I need to annotate, i.e. with data vector ...?

Yes, it generally increases the memory usage. It also slows it down a bit, I’ve found generally grainsize=1 works best for my application (at least in terms of compute speed).

jsocolar · March 11, 2023, 2:39am

Anything declared in the data or transformed data blocks will get treated as data for this purpose.

Topic		Replies	Views
Reduce_sum() using only one thread Interfaces	2	426	June 10, 2020
Reduce_sum: choosing how to split data across shards Modeling performance	3	619	May 7, 2020
Reduce_sum with array #dims > 1 Modeling performance	2	753	May 8, 2020
Help with reduce_sum Modeling	32	1444	August 4, 2020
Reduce_sum performance Modeling performance , paralellization	15	1342	September 28, 2020

Multithreading and memory usage

Related topics