Error when using adstock function in non-linear models

I am currently trying to reproduce results I found on this post in brms. The goal is to see if I can recover adstock rates, coefficients, and intercept using a non-linear model from simulated data.

The only non-linear models I have tried are the ones in vignette(brms_nonlinear) so I am betting I am making an embarrassingly simple mistake.

The error I am getting is

Error in stanc(model_code = paste(program, collapse = “\n”), model_name = model_cppname, :
failed to parse Stan model ‘file290d3c2c341a’ due to the above error.

I am guessing this is due to the adstock function being part of my model.

My code is as follows:

# load libraries
library(purrr)
library(tidyverse)

# positive_round function 
positive_round <- function(...) round(pmax(..., 0), 0)

# adstock function
adstock<-function(x,rate=0){
  return(as.numeric(stats::filter(x=x,filter=rate,method="recursive")))
}

# base and ad constants 
base     <-  100 
n_weeks  <-  104
avg      <-  20
sdev     <-  10 

# adstock rates
ad1_rate <- .7
ad2_rate <- .4
ad3_rate <- .5

set.seed(42)
# 3 ads with mean
fake_df <- rep(avg, 3) %>%
  # normal distribution with 2 years of observations
  map(~ rnorm(n = n_weeks, mean = .x, sd = sdev)) %>% 
  # make sure there are no 0s and make values integers
  map(~ positive_round(0,.x)) %>% 
  # convert lists to data frame
  map_dfc(~as_tibble(.x)) %>% 
  # sensible names
  setNames(c("ad1", "ad2", "ad3")) %>% 
  # sales = sum of adstock and some noise
  mutate(sales = base + 
           adstock(ad1, ad1_rate) + 
           adstock(ad2, ad2_rate) + 
           adstock(ad3, ad3_rate) +
           rnorm(n(), sd = 5)) %>% 
  # round sales to whole number
  mutate(sales =  round(sales, 0))

# Model building
library(brms)
prior1 <- prior(normal(.5, .5), nlpar = "rate1") + 
          prior(normal(.5, .5), nlpar = "rate2") +
          prior(normal(.5, .5), nlpar = "rate3") +
          prior(normal(1, .5), nlpar = "b1") +
          prior(normal(1, .5), nlpar = "b2") +
          prior(normal(1, .5), nlpar = "b3") +
          prior(normal(100, 50), nlpar = "basesales") 
  
  
fit1 <- brm(bf(sales ~ basesales + 
                 b1 * adstock(ad1, rate1) + 
                 b2 * adstock(ad2, rate2) + 
                 b3 * adstock(ad3, rate3), 
               basesales + b1 + b2 + b3 + rate1 + rate2 + rate3 ~ 1, 
               nl = TRUE),
            data = fake_df, 
            prior = prior1)

summary(fit1)

Can you include what the error message for the “above error” was? That will help us figure out what’s going on.

Thanks!

Sorry

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

No matches for: 

  adstock(real, real)

Function adstock not found.
  error in 'model290d7a10b5d2_file290d3c2c341a' at line 52, column 71
  -------------------------------------------------
    50:   for (n in 1:N) { 
    51:     // compute non-linear predictor 
    52:     mu[n] = mu_basesales[n] + mu_b1[n] * adstock(C_1[n] , mu_rate1[n]) + mu_b2[n] * adstock(C_2[n] , mu_rate2[n]) + mu_b3[n] * adstock(C_3[n] , mu_rate3[n]);
                                                                              ^
    53:   } 
  -------------------------------------------------

Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  : 
  failed to parse Stan model 'file290d3c2c341a' due to the above error.

Looks like the issue is the adstock function. What is a possible work-around for this?

It looks like the Stan code brms writes includes the adstock() function as specified in your formula, that is, bf() is interpreting the formula as the formula to use inside of Stan rather than a formula where the terms are evaluated in R (which I think it does when you don’t use bf() or if nl=FALSE). So yeah, that’s not going to work. adstock() would need to be a user-defined Stan function for this to work and not an R function. So either adstock() needs to be defined in the Stan language or, if possible, used only in R to precompute what you need without having to call adstock() inside of Stan.

Thanks for the help. Unfortunately the rate in the adstock function is something that I need to optimize. So I don’t think that precomputing in R will work. The only option would be to define the function in Stan.

Indeed, bf() interpretes the right-hand side of a formula literally if argument nl is set to TRUE. That’s the main idea of non-linear formulas in brms, basically. So yes, you need to define adstock() in Stan for this model to work.

1 Like

I have not worked with Stan directly. Since I have to define adstock() in Stan does that mean that I need to write the entire program in Stan?

just the function as a string which you then pass to argument stan_funs of brm().

Given that adstock is a wrapper around another R function (filter in this case) I’d recommend starting by rewriting it using more primitive R functions than filter. In my experience, doing that will often make it easier to translate R into Stan.

After some time in the Stan User’s Manual and a few iterations I was able to produce the following

my_stan_adstock <- "
  vector my_stan_adstock(vector x, real rate, int N){
  vector[N] y;
  vector[N] my_vector;
      for(i in 1:N){ 
       y[0] = x[i];
       y[i] = x[i] + rate * y[i - 1];
       my_vector[i] = y[i];
     }
  return my_vector;
  }
"

This chuck of code seems like it should work, but when I use stan_funs in brms I get the following error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

No matches for: 

  my_stan_adstock(real, real, int)

Available argument signatures for my_stan_adstock:

  my_stan_adstock(vector, real, int)

  error in 'model290d28a7de7f_file290d3c8a7f3e' at line 63, column 87
  -------------------------------------------------
    61:   for (n in 1:N) { 
    62:     // compute non-linear predictor 
    63:     mu[n] = mu_basesales[n] + mu_b1[n] * my_stan_adstock(C_1[n] , mu_ad1rate[n] , 104) + mu_b2[n] * my_stan_adstock(C_2[n] , mu_ad2rate[n] , 104) + mu_b3[n] * my_stan_adstock(C_3[n] , mu_ad3rate[n] , 104);
                                                                                              ^
    64:   } 
  -------------------------------------------------

If I understand this correctly, after translation the function is expected to be my_stan_adstock(real, real, int), but I am providing my_stan_adstock(vector, real, int). I would change that vector to a real, but I need a vector to calculate the output.

If anyone is interested in reproducing this error. My code is in a gist here.

My question at this point is how do I take in the whole vector as opposed to the real?

It looks like the brms parser translates your formula into calls of the function within a loop. So this is going to depend on how flexible this feature in brms is. There would need to be a way to inform brms that you need to index some of the arguments differently. @paul.buerkner will know if it’s currently possible.

Either way, it looks like brms is getting you close to the right Stan code so you could also use make_stancode()it to generate the code and make_standata() to get the data. Then tweak the code so that it works with your function and use it with rstan to fit the model.

We have a vignette coming imminently that includes a section about using brms for that purpose.

brms expects functions to work in a pointwise manner (that is for each observation n). That is, your function should take the input necessary for the nth observation and return a real value for this observations. Is it possible to write your function this way?

@alexhallam Looking back at your function are you sure you don’t need i-1 in some of the indexes? If so then the point wise function would need the arguments for the current i and also arguments for the previous output to be fed back in. I’m not sure if it’s possible with brms or not. As a fallback you can always modify the brms code.

Thanks @jonah and @paul.buerkner for your help. I do need the y[i - 1] in there. I updated the code to reflect that. If brms expects the code in a pointwise manner then I may have to do some rethinking to solve this problem. I think that you provided great suggestions:

  1. Using make_stancode() and make_standata()
  2. Possible function rewrite.
  3. Modify some brms code for this specific purpose.

I will work through these suggestions. I am also excited to see the vignette that comes out on this topic.

Thanks again for the help. I am going to keep chipping away at this problem.

You could think of adding another covariate to adstock which contains y[i - 1] and 0 for the first observation.

So glad I have found this topic! After spending a weekend trying to fit a double nonlinear function to my MMM model I feel really helpless.

It’s my first post here and it will be long - I don’t mean to bother you guys too much, but I’m kind of desperate and feel like “deep” description of the problem is essential here. Just in case I will bold the questions.

I’m in @alexhallam shoeas at the moment trying to create a Marketing Mix Hierarchical Bayesian Model. As an example I would like to use cheese dataset from “bayesm” package. I will do my best to explain the transformations on the simple example and move to more advanced details.

Imagine we want to build a model for a single retailer from the dataset. Our purpose is to transform the data so they capture both carry-over and diminishing returns shape effect (logistic function in this case). Our effects are predefined at the moment; decay rate is set to 0.25 and Sigma from the logistic function to 2.

PRICE is for price ;) and DISP is an advertising variable.

library(dplyr)

adstockTransform <- function(x){
  stats::filter( 1/(1+exp(-2*x)), 0.25, method = "recursive")
}

mmm.data <- cheese %>% filter(RETAILER == 'CHICAGO - DOMINICK') %>%
  select(-RETAILER) %>% 
  mutate(log.volume = log(VOLUME), log.price = log(PRICE), adstock= adstockTransform(DISP)) %>%
  select(-VOLUME, -DISP, -PRICE)

We can run a regression:

fit <- lm(log.volume ~ adstock + log.price, data=mmm.data)
summary(fit)

And the result is:

Call:
lm(formula = log.volume ~ adstock + log.price, data = mmm.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53274 -0.13505 -0.02216  0.08934  0.58237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.6611     0.3029   38.50   <2e-16 ***
adstock       0.6719     0.2615    2.57   0.0128 *  
log.price    -2.9680     0.1442  -20.58   <2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2468 on 58 degrees of freedom
Multiple R-squared:  0.9016,    Adjusted R-squared:  0.8982 
F-statistic: 265.6 on 2 and 58 DF,  p-value: < 2.2e-16

Nice! Now let’s try to fit decay effect and sigma with non-linear methods.

adstockTransform <- function(x, as, beta){
  stats::filter( 1/(1+exp(-beta*x)), as, method = "recursive")
}

library(minpack.lm)

fits2 <- nlsLM(log.volume ~  const + B1*adstockTransform(DISP, as, beta) + B2*log.price,
             start = c(const = 10, B1 = 0.5, as = 0.2, beta = 2, B2 = -3),
             lower = c(const = 5, B1 = 0.2, as = 0.1, beta = 1.5, B2 = -5),
             upper = c(const = 12, B1 = 2, as = 0.4, beta = 6, B2 = -2),
             data=mmm.data)

summary(fits2)

Formula: log.volume ~ const + B1 * adstockTransform(DISP, as, beta) + 
    B2 * log.price

Parameters:
      Estimate Std. Error t value Pr(>|t|)    
const  12.0000     0.2842  42.230   <2e-16 ***
B1      0.3637     0.3119   1.166    0.249    
as      0.1000     0.5807   0.172    0.864    
beta    6.0000     8.8240   0.680    0.499    
B2     -3.0502     0.1718 -17.749   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2504 on 56 degrees of freedom

Number of iterations to convergence: 23 
Achieved convergence tolerance: 1.49e-08

It’s not perfect, but works fine for an aggregated model and allows us to estimate the non-linear parameters: as and beta. However, having cross-sectional data does complicate things. We can use two methods now:

  1. Fixed Effect Modeling (LSDV Approach) In this method we regress the dependent variable on predictors at the individual level but use as predictors a set of n-1 dummy variables for the n groups to identify the group membership of each individual in the data set. This method allows differences in intercepts of the individual groups and therefore considers variations between groups. However, the slope of each predictor is assumed to be same for all groups. In reality we expect the predictor variables (such as advertising) to have differing effects on the separate groups and so the slope of the predictor variables should change between groups. The regression coefficient for each predictor obtained from the Fixed effects method therefore is the weighted average of the regression coefficients in each of the individual groups. It is widely used; the curvature parameters may be found after applying nlsLM (for example) on the demeaned data so the transformed variables may be then put into the final FE model.

  2. Hierarchical Bayesian model. It may be applied if coefficients are assumed to be different across markets. This method accounts for both within group variation and between group variation. The responsiveness of advertising can be expected to vary between different markets as well as different brands, and the advantage of mixed models is that they inherently account for this feature. These models can also handle data structures that have multiple levels (e.g.: brands within stores within markets). This method also allows parameter estimation of advertising effects in markets with very few observations as well as helps to quantify or differentiate the individual effect of each group from the average effect across all groups. Also, these models are very powerful since all the available information for all groups is used to obtain the coefficients for each subgroup and even observations with missing data are used instead of being deleted.

Let’s skip the FE model and jump straight to the HB one. For now we also skip the AdStock and diminishing returns function.


cheese_brm1 <- brm(bf(log(VOLUME) ~ DISP + log(PRICE)  + (DISP + log(PRICE)|RETAILER)), data = cheese, iter = 4000)

summary(cheese.brm1)

Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log(VOLUME) ~ DISP + log(PRICE) + (DISP + log(PRICE) | RETAILER) 
   Data: cheese (Number of observations: 5555) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup samples = 8000

Group-Level Effects: 
~RETAILER (Number of levels: 88) 
                        Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)               1.14      0.10     0.96     1.35       2232 1.00
sd(DISP)                    1.24      0.15     0.97     1.56       2310 1.00
sd(logPRICE)                0.85      0.08     0.70     1.02       1518 1.00
cor(Intercept,DISP)         0.11      0.12    -0.13     0.33       1747 1.00
cor(Intercept,logPRICE)    -0.74      0.05    -0.83    -0.62       2052 1.00
cor(DISP,logPRICE)         -0.07      0.13    -0.31     0.18       1509 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept    10.33      0.13    10.08    10.60       1687 1.00
DISP          1.17      0.15     0.88     1.49       1517 1.00
logPRICE     -2.20      0.11    -2.41    -1.99       1601 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.25      0.00     0.24     0.25       8000 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1). 

Great! Seems like I managed to build my first Hierarchical Bayesian Model with brms! However, I didn’t touch the priors and some of the results seem to be a bit suspicious; after running coefs(cheese.brm1) I can see that DISP estimated values for some retailers are negative. Well, it should not happen as we believe that advertising always has a positive impact on sales. How can I “strict” the DISP coefficient to be positive using brms…?

Moving on. I would like to add non-linear parameters to my model, just like in the first simple example. I’m new to Bayesian analysis, so don’t feel offended by the priors I set here. :)

For now I will use only the diminishing returns transformation, as Adstock function from the first example does not work here…

I assumed that intercept, DISP “global” coefficient and PRICE may vary among the retailers, but their diminishing returns curvature is the same.


prior0 <- prior(normal(10,5), nlpar = "const")
prior1 <- prior(normal(1, 0.5), nlpar = "beta1")
prior2 <- prior(normal(2, 3), nlpar = "beta2")
prior3 <- prior(normal(-2, 0.1), nlpar = "beta3")

bfx5 <- bf(log(VOLUME) ~ const + beta1/(1+exp(-beta2*DISP)) + beta3*log(PRICE),
           const + beta1 + beta3 ~ 1 + (1|RETAILER), beta2 ~ 1,  
           nl = TRUE)

ups3 <- brm(bfx5, data = cheese, prior = c(prior0,prior1, prior2, prior3), iter = 2000)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log(VOLUME) ~ const + beta1/(1 + exp(-beta2 * DISP)) + beta3 * log(PRICE) 
         const ~ 1 + (1 | RETAILER)
         beta1 ~ 1 + (1 | RETAILER)
         beta3 ~ 1 + (1 | RETAILER)
         beta2 ~ 1
   Data: cheese (Number of observations: 5555) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~RETAILER (Number of levels: 88) 
                    Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(const_Intercept)     1.10      0.11     0.91     1.33        905 1.00
sd(beta1_Intercept)     1.43      0.18     1.12     1.80        945 1.00
sd(beta3_Intercept)     0.78      0.07     0.65     0.94       1005 1.00

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
const_Intercept     9.43      0.15     9.14     9.71       1323 1.00
beta1_Intercept     1.70      0.19     1.34     2.08        934 1.01
beta3_Intercept    -2.11      0.07    -2.24    -1.98       1199 1.00
beta2_Intercept     2.77      0.26     2.29     3.31       2076 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.25      0.00     0.24     0.25       4000 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

summary(cheese_brm6)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log(VOLUME) ~ DISP + log(PRICE) + (DISP + log(PRICE) | RETAILER) 
   Data: cheese (Number of observations: 5555) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup samples = 8000

Group-Level Effects: 
~RETAILER (Number of levels: 88) 
                        Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)               1.14      0.10     0.96     1.35       2232 1.00
sd(DISP)                    1.24      0.15     0.97     1.56       2310 1.00
sd(logPRICE)                0.85      0.08     0.70     1.02       1518 1.00
cor(Intercept,DISP)         0.11      0.12    -0.13     0.33       1747 1.00
cor(Intercept,logPRICE)    -0.74      0.05    -0.83    -0.62       2052 1.00
cor(DISP,logPRICE)         -0.07      0.13    -0.31     0.18       1509 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept    10.33      0.13    10.08    10.60       1687 1.00
DISP          1.17      0.15     0.88     1.49       1517 1.00
logPRICE     -2.20      0.11    -2.41    -1.99       1601 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.25      0.00     0.24     0.25       8000 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

The results of these models are similiar, however, some retailers have (again) negative values of beta1 which should not happen.

Now I am like half-way there. My main issues/questions are:

1. Are the models I used in brms correctly specified so we can call them Hierarchical Bayesian ones? This question may sound naive, but after spending like 40 hours during the weekend reading and testing I feel a bit twisted at the moment :)

2. How can I make the DISP coefficients positive? Using Gamma ditribution while setting priors? Also, how can I “force” coefficients into some predefined bounds? I’ve just ran into lb and ub in manual, but could anyone provide me with a simple example in the cheese case…?

3. @alexhallam question once again. :) Did you manage to implement adstock into the brms function? @paul.buerkner or maybe is there another way of doing it without writing the STAN code…? What if the adstock transformation would be described like:


for (i in 1:length(advertising)){

  if (i == 1) {
    adstocked_advertising[i] = advertising[i] * adstock_rate }

  else {

 adstocked_advertising[i] = adstock_rate * advertising[i] + (1 - adstock_rate) * adstocked_advertising[i-1]
} }    ?

I have found an example of STAN code used for both Adstock and Hill transformation, but at the moment it’s like black magic.

functions {
// the Hill function
real Hill(real t, real ec, real slope) {

return 1 / (1 + (t / ec)^(-slope));
}
// the adstock transformation with a vector of weights
real Adstock(row_vector t, row_vector weights) {
return dot_product(t, weights) / sum(weights);
}
}
data {
// the total number of observations
int<lower=1> N;
// the vector of sales
real<lower=0> Y[N];
// the maximum duration of lag effect, in weeks
int<lower=1> max_lag;
// the number of media channels
int<lower=1> num_media;
// a vector of 0 to max_lag - 1
row_vector[max_lag] lag_vec;
// 3D array of media variables
row_vector[max_lag] X_media[N, num_media];
// the number of other control variables
int<lower=1> num_ctrl;
// a matrix of control variables
row_vector[num_ctrl] X_ctrl[N];
}
parameters {
// residual variance
real<lower=0> noise_var;
// the intercept
real tau;
// the coefficients for media variables
vector<lower=0>[num_media] beta_medias;
// coefficients for other control variables
vector[num_ctrl] gamma_ctrl;
// the retention rate and delay parameter for the adstock transformation of
// each media
vector<lower=0,upper=1>[num_media] retain_rate;
vector<lower=0,upper=max_lag-1>[num_media] delay;
// ec50 and slope for Hill function of each media
vector<lower=0,upper=1>[num_media] ec;
vector<lower=0>[num_media] slope;
}
transformed parameters {

// a vector of the mean response
real mu[N];
// the cumulative media effect after adstock
real cum_effect;
// the cumulative media effect after adstock, and then Hill transformation
row_vector[num_media] cum_effects_hill[N];
row_vector[max_lag] lag_weights;
for (nn in 1:N) {
for (media in 1 : num_media) {
for (lag in 1 : max_lag) {
lag_weights[lag] <- pow(retain_rate[media], (lag - 1 - delay[media]) ^ 2);
}
cum_effect <- Adstock(X_media[nn, media], lag_weights);
cum_effects_hill[nn, media] <- Hill(cum_effect, ec[media], slope[media]);
}
mu[nn] <- tau +
dot_product(cum_effects_hill[nn], beta_medias) +
dot_product(X_ctrl[nn], gamma_ctrl);
}
}
model {
retain_rate ~ beta(3,3);
delay ~ uniform(0, max_lag - 1);
slope ~ gamma(3, 1);
ec ~ beta(2,2);
tau ~ normal(0, 5);
for (media_index in 1 : num_media) {
beta_medias[media_index] ~ normal(0, 1);
}
for (ctrl_index in 1 : num_ctrl) {
gamma_ctrl[ctrl_index] ~ normal(0,1);
}
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
Y ~ normal(mu, sqrt(noise_var));
}

Once again - sorry for a really looong post. I hope you won’t find it overhelming or simply dumb. I would really love to start using HB models instead of FE ones, but seem to struggle with the basics.

Best wishes!

To force coefficients in certain boundaries, use transformations such as exp() to force something to be positive or inv_logit to force something to be within [0, 1]. The lb and ub arguments don’t help you anymore once you have multilevel structure in your parameters.

1 Like

Thank you, Paul. Does it refer to all coefficients in the model or the ones describing the multilevel variables? For example, if I kept the DISP variable as fixed - would it be still impossible to use lb / ub?

If you just fit an intercept for a parameter (disp ~ 1), lb and ub will suffice.