Hierarchical Media Mix Model

ahsangardezi · September 25, 2019, 8:56pm

I am trying to fit a Hierarchical Media Mix Model, according to the specifications provided by Google White Paper. The Hierarchy is at the geographic level, with DMAs. I have tried to define the model as below with both media and controls transformed between 0-1.

Stan…
functions {
// the Hill function
real Hill(real t, real ec, real slope) {
return 1 / (1 + (t / ec)^(-slope));
}
// the adstock transformation with a vector of weights
real Adstock(row_vector t, row_vector weights) {
return dot_product(t, weights) / sum(weights);
}
}

data {
// the total number of observations
int<lower=1> N;
//the number of DMAs
int<lower=1> J;
//vector of group indeces
int<lower=1,upper=J> dma[N];
// the vector of sales
real<lower=0> Y[N];
// the maximum duration of lag effect, in weeks
int<lower=1> max_lag;
// the number of media channels
int<lower=1> num_media;
// a vector of 0 to max_lag - 1
row_vector[max_lag] lag_vec;
// 3D array of media variables
row_vector[max_lag] X_media[N, num_media];
// the number of other control variables
int<lower=1> num_ctrl;
// a matrix of control variables
row_vector[num_ctrl] X_ctrl[N];

}

parameters {
// residual variance
real<lower=0> noise_var;
// the population level intercept
real tau;
// the population coefficients for media variables
vector<lower=0>[num_media] beta_medias;
// popultaion level coffecients for other control variables
vector[num_ctrl] gamma_ctrl;
// group level coffecients for intercept
vector[J] theta;
// the group coefficients for media variables
vector[num_media] beta[J];
// the group coffecients for control
vector[num_ctrl] gamma[J];
// the standard deviation of intercept
real<lower=0>sigma_tau;
// the standard deviation of betas
real<lower=0>sigma_beta;
// the retention rate and delay parameter for the adstock transformation of
// each media
vector<lower=0,upper=1>[num_media] retain_rate;
vector<lower=0,upper=max_lag-1>[num_media] delay;
// ec50 and slope for Hill function of each media
vector<lower=0,upper=1>[num_media] ec;
vector<lower=0>[num_media] slope;
}

transformed parameters {
// a vector of the mean response
real mu[N];
// the cumulative media effect after adstock
real cum_effect;
// the cumulative media effect after adstock, and then Hill transformation
row_vector[num_media] cum_effects_hill[N];
row_vector[max_lag] lag_weights;

for (nn in 1:N) {
for (media in 1 : num_media) {
for (lag in 1 : max_lag) {
lag_weights[lag] = pow(retain_rate[media], (lag - 1 - delay[media]) ^ 2);
}
cum_effect = Adstock(X_media[nn, media], lag_weights);
cum_effects_hill[nn, media] = Hill(cum_effect, ec[media], slope[media]);
}
mu[nn] = theta[dma[nn]] +
dot_product(cum_effects_hill[nn], beta[dma[nn]]) +
dot_product(X_ctrl[nn], gamma[dma[nn]]);
}
}

model {
retain_rate ~ uniform(0,1);
delay ~ uniform(0, 3);
slope ~ gamma(1.5,0.5);
ec ~ uniform(0,1);
tau ~ normal(0,1);
sigma_tau ~ cauchy(0,2.5);
sigma_beta ~ gamma(2,0.1);
theta ~ normal(0,sigma_tau);
beta_medias ~ normal(0,1);
for (j in 1:J){
beta[j] ~ normal(beta_medias,sigma_beta);
}
gamma_ctrl ~ normal(0,1);
for (j in 1:J) {
gamma[j] ~ normal(gamma_ctrl,sigma_beta);
}
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
Y ~ normal(mu, sqrt(noise_var));
}

generated quantities {
vector[N] Y_new;

for (n in 1:N) {Y_new[n] = normal_rng(mu[n], sqrt(noise_var));
}

}

I am fairly new to specifying mutillevel stan model, so can’t seem to debug the program when I get the following error; Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Stan can’t start sampling from this initial value.
Chain 1:
Chain 1: Initialization between (-2, 2) failed after 100 attempts.

nhuurre · September 25, 2019, 9:20pm

Initialization failure usually means parameter constraints are wrong. You have upper=max_lag-1 on delay but then in the model it’s delay ~ uniform(0, 3). If max_lag is more than 4 that’s going to be a problem.

ahsangardezi · September 27, 2019, 7:42pm

Thank you, yes, this indeed was an error.

ahsangardezi · September 30, 2019, 6:55pm

I tried non-centered reparameterization to speed up the sampling process, however it did not help much. Currently I am testing the following on only 2 dmas i.e X_Media 312x21 and X_Ctrl 312 X 8.

functions {
// the Hill function
real Hill(real t, real ec, real slope) {
return 1 / (1 + (t / ec)^(-slope));
}
// the adstock transformation with a vector of weights
real Adstock(row_vector t, row_vector weights) {
return dot_product(t, weights) / sum(weights);
}
}

data {
// the total number of observations
int<lower=1> N;
//the number of DMAs
int<lower=1> J;
//vector of group indeces
int<lower=1,upper=J> dma[N];
// the vector of sales
real<lower=0> Y[N];
// the maximum duration of lag effect, in weeks
int<lower=1> max_lag;
// the number of media channels
int<lower=1> num_media;
// a vector of 0 to max_lag - 1
row_vector[max_lag] lag_vec;
// 3D array of media variables
row_vector[max_lag] X_media[N, num_media];
// the number of other control variables
int<lower=1> num_ctrl;
// a matrix of control variables
row_vector[num_ctrl] X_ctrl[N];

}

parameters {
// residual variance
real<lower=0> noise_var;
// the population level intercept
real tau;
// the population coefficients for media variables
vector<lower=0>[num_media] beta_medias;
// popultaion level coffecients for other control variables
vector[num_ctrl] gamma_ctrl;
// group level coffecients for intercept
// non_center tau
vector[J] tau_raw;
// the group coefficients for media variables
vector[num_media] beta_raw[J];
// the group coffecients for control
vector[num_ctrl] gamma_raw[J];
// the standard deviation of intercept
real<lower=0>sigma_tau;
// the standard deviation of betas
real<lower=0>sigma_beta;
// sd of controll
real<lower=0>sigma_gamma;
// the retention rate and delay parameter for the adstock transformation of
// each media
vector<lower=0,upper=1>[num_media] retain_rate;
vector<lower=0,upper=max_lag-1>[num_media] delay;
// ec50 and slope for Hill function of each media
vector<lower=0,upper=1>[num_media] ec;
vector<lower=0>[num_media] slope;
}

transformed parameters {
// a vector of the mean response
real mu[N];
// the cumulative media effect after adstock
real cum_effect;
// the cumulative media effect after adstock, and then Hill transformation
row_vector[num_media] cum_effects_hill[N];
row_vector[max_lag] lag_weights;
vector[num_ctrl] gamma[J];
vector[num_media] beta[J];
vector[J] theta;

for(j in 1:J){
beta[j] = beta_medias + sigma_beta * beta_raw[j];;
}

 for(j in 1:J){
gamma[j] = gamma_ctrl + sigma_gamma * gamma_raw[j];;

}

 for(j in 1:J){
theta[j] = tau + sigma_tau * tau_raw[j];;

}

for (nn in 1:N) {
for (media in 1 : num_media) {
for (lag in 1 : max_lag) {
lag_weights[lag] = pow(retain_rate[media], (lag - 1 - delay[media]) ^ 2);
}
cum_effect = Adstock(X_media[nn, media], lag_weights);
cum_effects_hill[nn, media] = Hill(cum_effect, ec[media], slope[media]);
}
mu[nn] = theta[dma[nn]] +
dot_product(cum_effects_hill[nn], beta[dma[nn]]) +
dot_product(X_ctrl[nn], gamma[dma[nn]]);
}
}

model {
retain_rate ~ uniform(0,1);
delay ~ uniform(0, 13);
slope ~ gamma(1.5,0.5);
ec ~ uniform(0,1);
tau ~ normal(0,1);
for (j in 1:J){
tau_raw[j] ~ normal(0,1);
}
beta_medias ~ normal(0,1);
for (j in 1:J){
beta_raw[j] ~ normal(0,1);
}
gamma_ctrl ~ normal(0,1);
for (j in 1:J) {
gamma_raw[j] ~ normal(0,1);
}
noise_var ~ normal(0, 0.05 * 0.01);
Y ~ normal(mu, sqrt(noise_var));
}

My complete data set has 210 dmas i.e. X_Media;32000 X 21 & X_Ctrl;32000x 8, are there any suggestions to improve the sampling speed. Would increasing the computational power, help in a substantial way?

Thanks

emiruz · September 30, 2019, 7:56pm

Stan supports threading and MPI using the map_rect function. It’s a rudimentary way of sharding the log probability calculation but it’s reported to deliver “embarrassingly parallel” magnitude speed ups. There is a simple tutorial about how to stand it up here.

ahsangardezi · October 1, 2019, 8:11pm

Thank you, Ill definitely look into it. just for some reference, can you suggest any system specifications suitable for multi threading?

emiruz · October 1, 2019, 9:35pm

The official guide for MPI parallelism is here:

https://github.com/stan-dev/math/wiki/MPI-Parallelism

Once you’ve made your mpi version . Try running your model with a progressively larger number of iterations on 1 core to understand how it scales . For N<40 cores divide your estimate by N to get the run time.

ahsangardezi · October 3, 2019, 1:48pm

Thank you emiruz, I have been reading up on all this and working on converting my stan program to the specifications needed for map_rect. However, I am having trouble translating my the 3D- Array X_Media to the requirements. Is there a way that I can use the transformed parameter block in the reduce function defined at the top.

emiruz · October 3, 2019, 2:05pm

It’s a bit painful especially when your model is complicated because you have to pack all your parameters into arrays the interface provides. If you have a matrix of the form:

aaaa
bbbb
cccc

You have to write it one row at a time into the form:

aaaabbbbcccc

and then inside of the function that you pass to map_rect, recover it by reading into your matrix modulo the number of columns.

It’s ugly and brittle but it works and it’s what we have at the moment.

ahsangardezi · October 9, 2019, 7:28pm

Thank you for all the help so far.

I was looking into GPU support for stan, and was able to configure stan on a nvidia GPU. I have setup the model to run on cmdstan, but despite showing the model running in GPU activity list. I see no major speed ups, am I missing something regarding GPU setup.

emiruz · October 9, 2019, 7:30pm

It’s only a handful of functions that invoke the GPU. You need to check Stan Math to see whether you’re using them or not in your model explicitly. map_rect doesn’t make use of the GPU in itself.

Interesting paper btw, it’s on my list to read properly.

Navjot · December 7, 2022, 1:53am

Hi, I am trying to apply Hierarchical MMM on my data, I am not clear on how to create data for the model.

For N weeks, M media channels, C control variables, L lag duration and G geographies:
for baseline MMM, the data looks like following:

X_media is a matrix of N*M*L,
X_ctrl is N*C
y is N.

I have geography level information for all the y, X_media and X_ctrl. So I want to fit Hierarchical MMM, what are the expected data sizes for X_media, X_ctrl and y? Below are the sizes as per my understanding:

X_media is a matrix of N*G*M*L,
X_ctrl is N*G*C
y is N*G.
To get global y, I would be summing up y across G.

Could someone please verify this approach? Any help would be really appreciated! Thanks!

(my apologies since it is not exactly pyStan issue… the reason I decided to post it here is that this is the only post so far I have come across where people have actually discussed on the implementation).

sonicking · December 8, 2022, 3:22pm

Google created a python library for this model: GitHub - google/lightweight_mmm: LightweightMMM 🦇 is a lightweight Bayesian Marketing Mix Modeling (MMM) library that allows users to easily train MMMs and obtain channel attribution information.

While I still recommend anyone to learn the model by peeling back the onion, it is good to try out your data with the library.

Topic		Replies	Views
How to vectorize the stan code to reduce the run time Modeling pystan , optimization , marketing , mixed-model	7	210	December 10, 2024
Mixed/hierarchical logit code optimization Modeling cmdstanr , mixed-model , hierarchical-model , brms	6	1104	April 4, 2022
Question: Stan Modelspecification for Hierarchical Linear Model (Gelman's Radon example) RStan rstan	13	1167	November 19, 2019
Scaling up a hierarchical model Modeling bioinformatics	28	3271	June 6, 2019
Hierarchical Linear Mixture Model Modeling	1	657	February 4, 2019

Hierarchical Media Mix Model

Related topics