Multithreading and memory usage

I have kind of a general question about the way that multi-threading works. I have a very large reduce_sum function that has a ton of data passed in, with the intention being that I want to shard the data across threads:

    target += reduce_sum(
      choice_lpmf,
      indices, // Thing to cut across
      grainsize, // grainsize == 1
      p_R_ftr_not_na,
      p_R_ftr_wo_clm,
      pricing_error_mean,
      pricing_error_sd,
      tm_int_ind_n,
      p_R_ftr_wo_clm_w_tm, // transformed data, N
      pricing_error_mean_tm, // model block, N
      pricing_error_sd_tm, // model block
      run_demand, // data
      is_nb, // transformed data, N
      L_draws, // data
      lambda, // transformed data, N
      demand_choice_index, // data
      tm_ind, // data
      dollar_norm, // data
      is_nb_int, // transformed data
      tm_elig_n, // transformed data
      J, // data
      J_oo, // data
      prices_not_na_ind, // data, N x J
      prices_oo_not_na_ind, // data, N x J_oo
      limits, // transformed data, J --
      limits_oo, // transformed data, J_oo --
      prices, // transformed data, N x J
      prices_oo, // transformed data, N x J_oo
      tm_optin_disc_n, // transformed data, N
      prices_pre_firm_ind, // data, N x J
      prices_pre_cov_ind, // data, N x J
      pre_oo_firm_ind, // data, N x J_oo
      pre_oo_cov_ind, // data, N x J_oo
      clm_surcharge, // data, N
      pareto_loc, // transformed data, N
      pareto_shape, // data
      risk_aversion, // model block, scalar
      firm_switch_cost, // model block, N
      cov_switch_cost, // model block, scalar
      tm_frictions, // model block, N
      sigma_logit, // model block, scalar
      plan_fes, // model block, N x J
      plan_oo_fes, // model block, N x J_oo
      quad_nodes,
      quad_weights
    );

The mapping function choice_lpmf cuts all these variables into chunks and then does all the likelihood calculations. I have yet to find a better way to do this – ideally, reduce_sum would shard all the variables for me before handing them to a thread, but I have not yet found a way to do this.

Many of these variables are quite large. On a single threaded run, this doesn’t use much memory, it’s just slow.

However, I’m running into the issue where, when I start adding threads, it quickly becomes an inoperable problem. The memory usage is insane, and I cannot actually run this model on the full data set with more than 1-2 threads.

My best guess here is that reduce_sum deep copies all the variables to each thread. Is there a way to prevent this? Maybe to flag some of these variables as something to be passed by reference? Or is there some other way of writing out functions like this that use lots of heterogeneous data?

I’m also noticing my memory spike after the first likelihood evaluation, presumably this is during the gradient pass? Does anyone have a good sense of what causes memory usage during the gradient to be so much higher?

reduce_sum does deep copy parameters, but not data.

If you have large vectors/matrices of parameters that all need to be shared between all threads that will be a big bottleneck and is less likely to be efficient with reduce_sum, unless the computation is really expensive.

Data variables are passed by reference and should not cause additional memory usage.

1 Like

Have you tried grainsizes other than 1?

2 Likes

Is this something I need to annotate, i.e. with data vector ...?

Yes, it generally increases the memory usage. It also slows it down a bit, I’ve found generally grainsize=1 works best for my application (at least in terms of compute speed).

Anything declared in the data or transformed data blocks will get treated as data for this purpose.