So glad I have found this topic! After spending a weekend trying to fit a double nonlinear function to my MMM model I feel really helpless.
It’s my first post here and it will be long - I don’t mean to bother you guys too much, but I’m kind of desperate and feel like “deep” description of the problem is essential here. Just in case I will bold the questions.
I’m in @alexhallam shoeas at the moment trying to create a Marketing Mix Hierarchical Bayesian Model. As an example I would like to use cheese dataset from “bayesm” package. I will do my best to explain the transformations on the simple example and move to more advanced details.
Imagine we want to build a model for a single retailer from the dataset. Our purpose is to transform the data so they capture both carry-over and diminishing returns shape effect (logistic function in this case). Our effects are predefined at the moment; decay rate is set to 0.25 and Sigma from the logistic function to 2.
PRICE is for price ;) and DISP is an advertising variable.
library(dplyr)
adstockTransform <- function(x){
stats::filter( 1/(1+exp(-2*x)), 0.25, method = "recursive")
}
mmm.data <- cheese %>% filter(RETAILER == 'CHICAGO - DOMINICK') %>%
select(-RETAILER) %>%
mutate(log.volume = log(VOLUME), log.price = log(PRICE), adstock= adstockTransform(DISP)) %>%
select(-VOLUME, -DISP, -PRICE)
We can run a regression:
fit <- lm(log.volume ~ adstock + log.price, data=mmm.data)
summary(fit)
And the result is:
Call:
lm(formula = log.volume ~ adstock + log.price, data = mmm.data)
Residuals:
Min 1Q Median 3Q Max
-0.53274 -0.13505 -0.02216 0.08934 0.58237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6611 0.3029 38.50 <2e-16 ***
adstock 0.6719 0.2615 2.57 0.0128 *
log.price -2.9680 0.1442 -20.58 <2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2468 on 58 degrees of freedom
Multiple R-squared: 0.9016, Adjusted R-squared: 0.8982
F-statistic: 265.6 on 2 and 58 DF, p-value: < 2.2e-16
Nice! Now let’s try to fit decay effect and sigma with non-linear methods.
adstockTransform <- function(x, as, beta){
stats::filter( 1/(1+exp(-beta*x)), as, method = "recursive")
}
library(minpack.lm)
fits2 <- nlsLM(log.volume ~ const + B1*adstockTransform(DISP, as, beta) + B2*log.price,
start = c(const = 10, B1 = 0.5, as = 0.2, beta = 2, B2 = -3),
lower = c(const = 5, B1 = 0.2, as = 0.1, beta = 1.5, B2 = -5),
upper = c(const = 12, B1 = 2, as = 0.4, beta = 6, B2 = -2),
data=mmm.data)
summary(fits2)
Formula: log.volume ~ const + B1 * adstockTransform(DISP, as, beta) +
B2 * log.price
Parameters:
Estimate Std. Error t value Pr(>|t|)
const 12.0000 0.2842 42.230 <2e-16 ***
B1 0.3637 0.3119 1.166 0.249
as 0.1000 0.5807 0.172 0.864
beta 6.0000 8.8240 0.680 0.499
B2 -3.0502 0.1718 -17.749 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2504 on 56 degrees of freedom
Number of iterations to convergence: 23
Achieved convergence tolerance: 1.49e-08
It’s not perfect, but works fine for an aggregated model and allows us to estimate the non-linear parameters: as and beta. However, having cross-sectional data does complicate things. We can use two methods now:
-
Fixed Effect Modeling (LSDV Approach) In this method we regress the dependent variable on predictors at the individual level but use as predictors a set of n-1 dummy variables for the n groups to identify the group membership of each individual in the data set. This method allows differences in intercepts of the individual groups and therefore considers variations between groups. However, the slope of each predictor is assumed to be same for all groups. In reality we expect the predictor variables (such as advertising) to have differing effects on the separate groups and so the slope of the predictor variables should change between groups. The regression coefficient for each predictor obtained from the Fixed effects method therefore is the weighted average of the regression coefficients in each of the individual groups. It is widely used; the curvature parameters may be found after applying nlsLM (for example) on the demeaned data so the transformed variables may be then put into the final FE model.
-
Hierarchical Bayesian model. It may be applied if coefficients are assumed to be different across markets. This method accounts for both within group variation and between group variation. The responsiveness of advertising can be expected to vary between different markets as well as different brands, and the advantage of mixed models is that they inherently account for this feature. These models can also handle data structures that have multiple levels (e.g.: brands within stores within markets). This method also allows parameter estimation of advertising effects in markets with very few observations as well as helps to quantify or differentiate the individual effect of each group from the average effect across all groups. Also, these models are very powerful since all the available information for all groups is used to obtain the coefficients for each subgroup and even observations with missing data are used instead of being deleted.
Let’s skip the FE model and jump straight to the HB one. For now we also skip the AdStock and diminishing returns function.
cheese_brm1 <- brm(bf(log(VOLUME) ~ DISP + log(PRICE) + (DISP + log(PRICE)|RETAILER)), data = cheese, iter = 4000)
summary(cheese.brm1)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log(VOLUME) ~ DISP + log(PRICE) + (DISP + log(PRICE) | RETAILER)
Data: cheese (Number of observations: 5555)
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup samples = 8000
Group-Level Effects:
~RETAILER (Number of levels: 88)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 1.14 0.10 0.96 1.35 2232 1.00
sd(DISP) 1.24 0.15 0.97 1.56 2310 1.00
sd(logPRICE) 0.85 0.08 0.70 1.02 1518 1.00
cor(Intercept,DISP) 0.11 0.12 -0.13 0.33 1747 1.00
cor(Intercept,logPRICE) -0.74 0.05 -0.83 -0.62 2052 1.00
cor(DISP,logPRICE) -0.07 0.13 -0.31 0.18 1509 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 10.33 0.13 10.08 10.60 1687 1.00
DISP 1.17 0.15 0.88 1.49 1517 1.00
logPRICE -2.20 0.11 -2.41 -1.99 1601 1.00
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma 0.25 0.00 0.24 0.25 8000 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Great! Seems like I managed to build my first Hierarchical Bayesian Model with brms! However, I didn’t touch the priors and some of the results seem to be a bit suspicious; after running coefs(cheese.brm1) I can see that DISP estimated values for some retailers are negative. Well, it should not happen as we believe that advertising always has a positive impact on sales. How can I “strict” the DISP coefficient to be positive using brms…?
Moving on. I would like to add non-linear parameters to my model, just like in the first simple example. I’m new to Bayesian analysis, so don’t feel offended by the priors I set here. :)
For now I will use only the diminishing returns transformation, as Adstock function from the first example does not work here…
I assumed that intercept, DISP “global” coefficient and PRICE may vary among the retailers, but their diminishing returns curvature is the same.
prior0 <- prior(normal(10,5), nlpar = "const")
prior1 <- prior(normal(1, 0.5), nlpar = "beta1")
prior2 <- prior(normal(2, 3), nlpar = "beta2")
prior3 <- prior(normal(-2, 0.1), nlpar = "beta3")
bfx5 <- bf(log(VOLUME) ~ const + beta1/(1+exp(-beta2*DISP)) + beta3*log(PRICE),
const + beta1 + beta3 ~ 1 + (1|RETAILER), beta2 ~ 1,
nl = TRUE)
ups3 <- brm(bfx5, data = cheese, prior = c(prior0,prior1, prior2, prior3), iter = 2000)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log(VOLUME) ~ const + beta1/(1 + exp(-beta2 * DISP)) + beta3 * log(PRICE)
const ~ 1 + (1 | RETAILER)
beta1 ~ 1 + (1 | RETAILER)
beta3 ~ 1 + (1 | RETAILER)
beta2 ~ 1
Data: cheese (Number of observations: 5555)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Group-Level Effects:
~RETAILER (Number of levels: 88)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(const_Intercept) 1.10 0.11 0.91 1.33 905 1.00
sd(beta1_Intercept) 1.43 0.18 1.12 1.80 945 1.00
sd(beta3_Intercept) 0.78 0.07 0.65 0.94 1005 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
const_Intercept 9.43 0.15 9.14 9.71 1323 1.00
beta1_Intercept 1.70 0.19 1.34 2.08 934 1.01
beta3_Intercept -2.11 0.07 -2.24 -1.98 1199 1.00
beta2_Intercept 2.77 0.26 2.29 3.31 2076 1.00
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma 0.25 0.00 0.24 0.25 4000 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
summary(cheese_brm6)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log(VOLUME) ~ DISP + log(PRICE) + (DISP + log(PRICE) | RETAILER)
Data: cheese (Number of observations: 5555)
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup samples = 8000
Group-Level Effects:
~RETAILER (Number of levels: 88)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 1.14 0.10 0.96 1.35 2232 1.00
sd(DISP) 1.24 0.15 0.97 1.56 2310 1.00
sd(logPRICE) 0.85 0.08 0.70 1.02 1518 1.00
cor(Intercept,DISP) 0.11 0.12 -0.13 0.33 1747 1.00
cor(Intercept,logPRICE) -0.74 0.05 -0.83 -0.62 2052 1.00
cor(DISP,logPRICE) -0.07 0.13 -0.31 0.18 1509 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 10.33 0.13 10.08 10.60 1687 1.00
DISP 1.17 0.15 0.88 1.49 1517 1.00
logPRICE -2.20 0.11 -2.41 -1.99 1601 1.00
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma 0.25 0.00 0.24 0.25 8000 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The results of these models are similiar, however, some retailers have (again) negative values of beta1 which should not happen.
Now I am like half-way there. My main issues/questions are:
1. Are the models I used in brms correctly specified so we can call them Hierarchical Bayesian ones? This question may sound naive, but after spending like 40 hours during the weekend reading and testing I feel a bit twisted at the moment :)
2. How can I make the DISP coefficients positive? Using Gamma ditribution while setting priors? Also, how can I “force” coefficients into some predefined bounds? I’ve just ran into lb and ub in manual, but could anyone provide me with a simple example in the cheese case…?
3. @alexhallam question once again. :) Did you manage to implement adstock into the brms function? @paul.buerkner or maybe is there another way of doing it without writing the STAN code…? What if the adstock transformation would be described like:
for (i in 1:length(advertising)){
if (i == 1) {
adstocked_advertising[i] = advertising[i] * adstock_rate }
else {
adstocked_advertising[i] = adstock_rate * advertising[i] + (1 - adstock_rate) * adstocked_advertising[i-1]
} } ?
I have found an example of STAN code used for both Adstock and Hill transformation, but at the moment it’s like black magic.
functions {
// the Hill function
real Hill(real t, real ec, real slope) {
return 1 / (1 + (t / ec)^(-slope));
}
// the adstock transformation with a vector of weights
real Adstock(row_vector t, row_vector weights) {
return dot_product(t, weights) / sum(weights);
}
}
data {
// the total number of observations
int<lower=1> N;
// the vector of sales
real<lower=0> Y[N];
// the maximum duration of lag effect, in weeks
int<lower=1> max_lag;
// the number of media channels
int<lower=1> num_media;
// a vector of 0 to max_lag - 1
row_vector[max_lag] lag_vec;
// 3D array of media variables
row_vector[max_lag] X_media[N, num_media];
// the number of other control variables
int<lower=1> num_ctrl;
// a matrix of control variables
row_vector[num_ctrl] X_ctrl[N];
}
parameters {
// residual variance
real<lower=0> noise_var;
// the intercept
real tau;
// the coefficients for media variables
vector<lower=0>[num_media] beta_medias;
// coefficients for other control variables
vector[num_ctrl] gamma_ctrl;
// the retention rate and delay parameter for the adstock transformation of
// each media
vector<lower=0,upper=1>[num_media] retain_rate;
vector<lower=0,upper=max_lag-1>[num_media] delay;
// ec50 and slope for Hill function of each media
vector<lower=0,upper=1>[num_media] ec;
vector<lower=0>[num_media] slope;
}
transformed parameters {
// a vector of the mean response
real mu[N];
// the cumulative media effect after adstock
real cum_effect;
// the cumulative media effect after adstock, and then Hill transformation
row_vector[num_media] cum_effects_hill[N];
row_vector[max_lag] lag_weights;
for (nn in 1:N) {
for (media in 1 : num_media) {
for (lag in 1 : max_lag) {
lag_weights[lag] <- pow(retain_rate[media], (lag - 1 - delay[media]) ^ 2);
}
cum_effect <- Adstock(X_media[nn, media], lag_weights);
cum_effects_hill[nn, media] <- Hill(cum_effect, ec[media], slope[media]);
}
mu[nn] <- tau +
dot_product(cum_effects_hill[nn], beta_medias) +
dot_product(X_ctrl[nn], gamma_ctrl);
}
}
model {
retain_rate ~ beta(3,3);
delay ~ uniform(0, max_lag - 1);
slope ~ gamma(3, 1);
ec ~ beta(2,2);
tau ~ normal(0, 5);
for (media_index in 1 : num_media) {
beta_medias[media_index] ~ normal(0, 1);
}
for (ctrl_index in 1 : num_ctrl) {
gamma_ctrl[ctrl_index] ~ normal(0,1);
}
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
Y ~ normal(mu, sqrt(noise_var));
}
Once again - sorry for a really looong post. I hope you won’t find it overhelming or simply dumb. I would really love to start using HB models instead of FE ones, but seem to struggle with the basics.
Best wishes!