How to vectorize the stan code to reduce the run time

I have this below STAN code for marketing mix modelling application with thousands records and hundreds of media/control variables. Using** pystan version 2.19.1.1** I am new to stan. This current setup is taking 10-12 hours to run daily granular data with hundreds of media+control variables. STAN code is below

functions {
 // the Hill function
 real Hill(real t, real ec, real slope) {
  return 1 / (1 + (t / ec)^(-slope));
 }
 // the adstock transformation with a vector of weights
 real Adstock(row_vector t, row_vector weights) {
  return dot_product(t, weights) / sum(weights);
 }
}

data {
 // the total number of observations
 int<lower=1> N;
 
  // the total number of training observations
 int<lower=1> T;
 // the total number of holdout observations
 int<lower=0> H;
 
 int<lower=0> n_interactions;
 int interaction_left[n_interactions];
 int interaction_right[n_interactions];

 int<lower=0> tau_dist_type;

 int<lower=0> noise_var_dist_type;


 real tau_dist_mean;
 real<lower=0> tau_dist_sd;

 real noise_var_dist_mean;
 real<lower=0> noise_var_dist_sd;


 // training data indexes 
 int training_index[T];
 // holdout data indexes 
 int holdout_index[H];

 real Y_train[T];
 real Y_holdout[H];
 
 // the maximum duration of lag effect, in weeks
 int<lower=1> max_lag;
 // the number of media channels
 int<lower=1> num_media;
 row_vector[num_media] media_prior_dist_type;
 row_vector[num_media] media_prior_mean;
 row_vector[num_media] media_prior_sd;

 row_vector[num_media] retain_rate_dist_type;
 row_vector[num_media] retain_rate_dist_mean;
 row_vector[num_media] retain_rate_dist_sd;

 row_vector[num_media] delay_dist_type;
 row_vector[num_media] delay_dist_mean;
 row_vector[num_media] delay_dist_sd;

 row_vector[num_media] slope_dist_type;
 row_vector[num_media] slope_dist_mean;
 row_vector[num_media] slope_dist_sd;

 row_vector[num_media] ec_dist_type;
 row_vector[num_media] ec_dist_mean;
 row_vector[num_media] ec_dist_sd;



 
 // a vector of 0 to max_lag - 1
 //row_vector[max_lag] lag_vec;
 // 3D array of media variables
 row_vector[max_lag] X_media[N, num_media];
 // the number of other control variables
 int<lower=1> num_ctrl;
 row_vector[num_ctrl] ctrl_prior_dist_type;
 row_vector[num_ctrl] ctrl_prior_mean;
 row_vector[num_ctrl] ctrl_prior_sd;
 
 // a matrix of control variables
 row_vector[num_ctrl] X_ctrl[N];
 
 
 row_vector<lower=0>[num_media] slope;
}

parameters {
 // residual variance
 real<lower=0> noise_var;
 // the intercept
 real tau;
 // the coefficients for media variables
 vector<lower=0>[num_media] beta_medias;
 // coefficients for other control variables
 vector[num_ctrl] gamma_ctrl;
 // the retention rate and delay parameter for the adstock transformation of
 // each media
 vector<lower=0,upper=1>[num_media] retain_rate;
 //vector<lower=0,upper=max_lag-1>[num_media] delay;
 // ec50 and slope for Hill function of each media
 vector<lower=0,upper=1>[num_media] ec;
 vector<lower=0>[n_interactions] beta_interactions;
 // vector<lower=0>[num_media] slope;
}

transformed parameters {
 // a vector of the mean response
 real mu[T];
 // the cumulative media effect after adstock
 real cum_effect;
 // the cumulative media effect after adstock, and then Hill transformation
 row_vector[num_media] cum_effects_hill[T];
 row_vector[max_lag] lag_weights;
 row_vector[n_interactions] cum_effects_hill_interaction[T];
 
 
 for (nn in 1:T) {
  for (media in 1 : num_media) {
   for (lag in 1 : max_lag) {
    lag_weights[lag] <- pow(retain_rate[media], (lag - 1) ); 
   }
   cum_effect <- Adstock(X_media[training_index[nn], media], lag_weights);
   cum_effects_hill[nn, media] <- Hill(cum_effect, ec[media], slope[media]);
  }
  

  
  if(n_interactions > 0)
   for (inter in 1:n_interactions){
    cum_effects_hill_interaction[nn,inter] = cum_effects_hill[nn,interaction_left[inter]]*cum_effects_hill[nn,interaction_right[inter]]; 
   }
   
  if(n_interactions > 0) 
    mu[nn] <- tau +
              dot_product(cum_effects_hill[nn], beta_medias) +
              dot_product(X_ctrl[training_index[nn]], gamma_ctrl) + 
              dot_product(cum_effects_hill_interaction[nn],beta_interactions);
  else  
   mu[nn] <- tau +
            dot_product(cum_effects_hill[nn], beta_medias) +
            dot_product(X_ctrl[training_index[nn]], gamma_ctrl);
 }
}
model {

    tau ~ normal(tau_dist_mean,tau_dist_sd);

  for (media_index in 1 : num_media) {
     beta_medias[media_index] ~ normal(media_prior_mean[media_index],media_prior_sd[media_index]);
 
   

    retain_rate[media_index] ~ normal(retain_rate_dist_mean[media_index],retain_rate_dist_sd[media_index]);
   
 
    slope[media_index] ~ normal(slope_dist_mean[media_index],slope_dist_sd[media_index]);
    ec[media_index] ~ beta(ec_dist_mean[media_index],ec_dist_sd[media_index]);
  
}
  for (ctrl_index in 1 : num_ctrl) {
 
    gamma_ctrl[ctrl_index] ~ normal(ctrl_prior_mean[ctrl_index],ctrl_prior_sd[ctrl_index]);
  
    
  }
 
  noise_var ~ inv_gamma(noise_var_dist_mean,noise_var_dist_sd);
  Y_train ~ normal(mu, sqrt(noise_var));
}
import pystan
stan_file = 'stan_code.stan'
stanmodel = pystan.stan(stan_file,data=stan_data,chains = chains,control=dict({'max_treedepth':max_treedepth,'adapt_delta':adapt_delta,'stepsize' : stepsize}), iter = iterations, verbose = False,n_jobs = n_jobs,seed = 9966)

Here if you observe in the above transformed parameter block of stan code, it is looping through each record in the data file with max lag 40 days and hence number of iterations are very huge.

My objective is to replace this for loops with matrix/array multiplications and change the *hill * and *adstock * functions to accept matrix/array. I tried to vectorized this by below code and with new version of pystan-3.10.0 but this code getting several error, can you help me to vectorize this stan code so I can reduce the run time. Or if there is another way to reduce run time.

functions {
  // The Hill function (vectorized for each media channel)
  vector Hill(vector t, vector ec, vector slope) {
    return 1 ./ (1 + pow((t ./ ec),(-slope)));
  }

  // The Adstock transformation (vectorized)
  vector Adstock(vector t, vector weights) {
    return (t.*weights)/ rowwise_sum(weights);
  }
}

data {
 // the total number of observations
 int<lower=1> N;
 
  // the total number of training observations
 int<lower=1> T;
 // the total number of holdout observations
 int<lower=0> H;
 
 int<lower=0> n_interactions;
 array[n_interactions] int interaction_left;
 array[n_interactions] int interaction_right;

 int<lower=0> tau_dist_type;

 int<lower=0> noise_var_dist_type;


 real tau_dist_mean;
 real<lower=0> tau_dist_sd;

 real noise_var_dist_mean;
 real<lower=0> noise_var_dist_sd;


 // training data indexes 
 array[T] int training_index;
 // holdout data indexes 
 array[H] int holdout_index;

 array[T] real Y_train;
 array[H] real Y_holdout;
 
 // the maximum duration of lag effect, in weeks
 int<lower=1> max_lag;
 // the number of media channels
 int<lower=1> num_media;
 row_vector[num_media] media_prior_dist_type;
 row_vector[num_media] media_prior_mean;
 row_vector[num_media] media_prior_sd;

 row_vector[num_media] retain_rate_dist_type;
 row_vector[num_media] retain_rate_dist_mean;
 row_vector[num_media] retain_rate_dist_sd;

 row_vector[num_media] delay_dist_type;
 row_vector[num_media] delay_dist_mean;
 row_vector[num_media] delay_dist_sd;

 row_vector[num_media] slope_dist_type;
 row_vector[num_media] slope_dist_mean;
 row_vector[num_media] slope_dist_sd;

 row_vector[num_media] ec_dist_type;
 row_vector[num_media] ec_dist_mean;
 row_vector[num_media] ec_dist_sd;



 
 // a vector of 0 to max_lag - 1
 //row_vector[max_lag] lag_vec;
 // 3D array of media variables
 array[N, num_media] row_vector[max_lag] X_media;
 // the number of other control variables
 int<lower=1> num_ctrl;
 row_vector[num_ctrl] ctrl_prior_dist_type;
 row_vector[num_ctrl] ctrl_prior_mean;
 row_vector[num_ctrl] ctrl_prior_sd;
 
 // a matrix of control variables
 array[N] row_vector[num_ctrl] X_ctrl;
 
 
 row_vector<lower=0>[num_media] slope;
}

parameters {
 // residual variance
 real<lower=0> noise_var;
 // the intercept
 real tau;
 // the coefficients for media variables
 vector<lower=0>[num_media] beta_medias;
 // coefficients for other control variables
 vector[num_ctrl] gamma_ctrl;
 // the retention rate and delay parameter for the adstock transformation of
 // each media
 vector<lower=0,upper=1>[num_media] retain_rate;
 //vector<lower=0,upper=max_lag-1>[num_media] delay;
 // ec50 and slope for Hill function of each media
 vector<lower=0,upper=1>[num_media] ec;
 vector<lower=0>[n_interactions] beta_interactions;
 // vector<lower=0>[num_media] slope;
}

transformed parameters {
 matrix[T, num_media] cum_effects;               // Cumulative effects after Adstock
 matrix[T, num_media] cum_effects_hill;          // After Hill transformation
 matrix[T, n_interactions] cum_effects_hill_interaction; // Interaction effects
 vector[T] mu; 
 array[num_media] row_vector[max_lag] lag_weights; // Mean response

 for (media in 1 : num_media) {
   for (lag in 1 : max_lag) {
    lag_weights[media,lag] = pow(retain_rate[media], (lag - 1) ); 
   }
 }

  // Apply Adstock transformation
  
  cum_effects = Adstock(X_media[training_index, ], lag_weights);

  // Apply Hill transformation using vectorized `Hill` function
  cum_effects_hill = Hill(cum_effects, ec, slope);

  // Compute interaction effects if applicable
  if (n_interactions > 0) {
    for (inter in 1:n_interactions) {
      cum_effects_hill_interaction[, inter] = 
        cum_effects_hill[, interaction_left[inter]] .* cum_effects_hill[, interaction_right[inter]];
    }
  }

  // Compute mu using dot products for medias, controls, and interactions
  if (n_interactions > 0) {
    mu = tau + 
         cum_effects_hill * beta_medias + 
         X_ctrl[training_index, ] * gamma_ctrl + 
         cum_effects_hill_interaction * beta_interactions;
  } else {
    mu = tau + 
         cum_effects_hill * beta_medias + 
         X_ctrl[training_index, ] * gamma_ctrl;
  }
}

model {
    tau ~ normal(tau_dist_mean,tau_dist_sd);
  for (media_index in 1 : num_media) {
    beta_medias[media_index] ~ normal(media_prior_mean[media_index],media_prior_sd[media_index]);

 retain_rate[media_index] ~    normal(retain_rate_dist_mean[media_index],retain_rate_dist_sd[media_index]);
    slope[media_index] ~ normal(slope_dist_mean[media_index],slope_dist_sd[media_index]);
    ec[media_index] ~ beta(ec_dist_mean[media_index],ec_dist_sd[media_index]);
  }
  for (ctrl_index in 1 : num_ctrl) {
    gamma_ctrl[ctrl_index] ~ normal(ctrl_prior_mean[ctrl_index],ctrl_prior_sd[ctrl_index]); 
  }
  
  noise_var ~ inv_gamma(noise_var_dist_mean,noise_var_dist_sd);
  Y_train ~ normal(mu, sqrt(noise_var));
}
import stan
import nest_asyncio
nest_asyncio.apply()

# Path to your Stan model file
stan_file = 'stan_codev2.stan'

# Read the Stan model code
with open(stan_file, 'r') as file:
    stan_code = file.read()

# Compile the Stan model
stan_model = stan.build(stan_code, data=stan_data, random_seed=9966)

# Sample from the posterior
fit = stan_model.sample(
    num_chains=chains,
    num_samples=iterations,
    num_warmup=int(iterations / 2),
    adapt_delta=adapt_delta,
    max_treedepth=max_treedepth,
    step_size=stepsize
)

# Access results
print(fit)

Let me know if I missed any information. Thanks in advance.

Hello, that code from Google is really difficult to work with. I am also trying to make it faster or at least modify it meaningfully, but so far no success.

If you are doing this for an actual problem, I suggest you to just use their python package: GitHub - google/lightweight_mmm: LightweightMMM 🦇 is a lightweight Bayesian Marketing Mix Modeling (MMM) library that allows users to easily train MMMs and obtain channel attribution information.

It is essentially the same model. Good luck and report back if you manage to speed it up.

Thanks @sonicking, I looked up the lightweightMMM on a very high level.
It seems it doesn’t support the different distributions for each media and adstock is bit different than google mentioned in their paper.

we have these 3 models available in lightweightmmm with certain limitations.
I am exploring more to define these custom.
“carryover”:
immutabledict.immutabledict({
_AD_EFFECT_RETENTION_RATE:
dist.Beta(concentration1=1., concentration0=1.),
_PEAK_EFFECT_DELAY:
dist.HalfNormal(scale=2.),
_EXPONENT:
dist.Beta(concentration1=9., concentration0=1.)
}),
“adstock”:
immutabledict.immutabledict({
_EXPONENT: dist.Beta(concentration1=9., concentration0=1.),
_LAG_WEIGHT: dist.Beta(concentration1=2., concentration0=1.)
}),
“hill_adstock”:
immutabledict.immutabledict({
_LAG_WEIGHT:
dist.Beta(concentration1=2., concentration0=1.),
_HALF_MAX_EFFECTIVE_CONCENTRATION:
dist.Gamma(concentration=1., rate=1.),
_SLOPE:
dist.Gamma(concentration=1., rate=1.)
})
})

Not sure what disconnect you are noticing. Both the paper and the package are from Google and it is the same model. But best of luck.

Hi, @Samip_Tandon and welcome to the Stan forums. Sorry it’s taken so long to get to this.

I’d recommend using cmdstanpy for scalability. Also, if it’s mixing well, then you should probably reduce the number of iterations to where you get an ESS of around 100. But that’s not answering your question. (Also, it’s “Stan” because it’s not an acronym.)

Wow, 40 data variables. This is a big program, which is probably why you haven’t gotten any responses.

First, you’re going to need to update to our new array syntax, which looks like this:

array[N] int y;

The only thing I can see to do to optimize that code further is a bit of vectorization. It’s not that loops are slow, but rather that it lets us compress and partially evaluation automatic differentiation. For example,

for (media_index in 1 : num_media) {
  beta_medias[media_index]
    ~ normal(media_prior_mean[media_index],media_prior_sd[media_index]);

is more efficiently written as:

beta_medias ~ normal(media_prior_mean, media_prior_sd);

It’s not going to be a huge gain because the scales vary with each observation. This can be done with all four distribution statements in that loop.

Same thing can be done for the second loop in the model block.

You don’t say what the constants are, but you don’t want to use inverse gamma priors with epsilon parameters like inv_gamme(0.001, 0.001)—it has bad computational and statistical ramifications as Andrew Gelman has written about specifically. Usually we just put priors directly on the scale because it’s easier to interpret in the same units as the location, but it’s also OK to do what you’re doing.

In the transformed parameters block, you can also vectorize operations. For example, you can use linspaced_vector to replace

 for (lag in 1 : max_lag) {
	lag_weights[lag] <- pow(retain_rate[media], (lag - 1) ); 
      }

with

lag_weights = pow(retain_rate[media], linspaced_vector(max_lag, 0, max_lag - 1));

That may not even really speed things up because there’s not much to save in the autodiff graph.

Similarly, you could compute cum_effect for all indices at once, but then you’d have to vectorize Hill() to match. Speaking of Hill(), I’d convert

  real Hill(real t, real ec, real slope) {
    return 1 / (1 + (t / ec)^(-slope));
  }

to

real Hill(real t, real ec, real slope) {
  return inv(1 + (ec / t)^slope);
  }

The real speedups come from removing redundant/duplicated computations and I don’t see any left in this code.

You can keep vectorizing, e.g., cum_effects_hill_interaction can vectorize over inter.

I find this is too much code duplication,

for (nn in 1:T) {
    ....
    if (n_interactions > 0) 
      mu[nn] <- tau +
	dot_product(cum_effects_hill[nn], beta_medias) +
	dot_product(X_ctrl[training_index[nn]], gamma_ctrl) + 
	dot_product(cum_effects_hill_interaction[nn],beta_interactions);
    else  
      mu[nn] <- tau +
	dot_product(cum_effects_hill[nn], beta_medias) +
	dot_product(X_ctrl[training_index[nn]], gamma_ctrl);

and I’d prefer to see

for (nn in 1:T) {
  mu[nn] = tau
      + dot_product(cum_effects_hill[nn], beta_medias)
      + dot_product(X_ctrl[training_index[nn]], gamma_ctrl);
  if (n_interactions > 0) {
    mu[nn] += dot_product(cum_effects_hill_interaction[nn],beta_interactions);
  }
}

Now we can turn those dot-products into matrix products, which can lead to real savings

mu = tau + cum_effects_hill * beta_medias + X_ctrl[training_index] * gamma_ctrl;
for (t in 1:T) {
  if (n_interactions > 0) {
    mu[t] += dot_product(cum_effects_hill_interaction[t], beta_interactions);
  }

Basically, anywhere you see an iterated dot product, think about whether the operation can be pulled out of the loop and turned into a matrix multiply. One matrix multiply is a whole lot faster than a bunch of dot products because of the blocking, memory locality, and autodiff speedups.

1 Like

Thank you so much @Bob_Carpenter.
I also realized that out of those 8-10 hours of model running around 50% time is being consumed into getting model summary as below

model_summary = stanmodel.summary()
model_summary = pd.DataFrame(model_summary['summary'],columns=model_summary['summary_colnames'],index=model_summary['summary_rownames'])

Is there any way to speed up this model summary extraction from model and on what it depends, I am using 4 chains.

Thanks in advance.

You might want to try CmdStanPy rather than PyStan. It offloads things to disk then reads back in and uses our C++ code to do the analysis. Even so, it’s slow.

One thing you can do to cut down summary time is remove variables. If there are transformed parameters, move them into the model block as local variables. If there are generated quantities, compute them using standalone generated quantities later.