# Bayesian Regression Model Building

Hi Stan experts, I am new to Bayesian and brms. I am trying to run multivariate regression using Bayesian.

I am working on a marketing mix problem using Bayesian. I am working with a data of 200k records which is at customer x month level for year 2019. Which means, for each customer I have 12 records. I have sales as the dependent variable and various marketing channels as the independent variable. I am trying to estimate the betas of these marketing channels using Bayesian models.

There are few issues I am facing:

1. Prior selection: I have Beta estimates of different channels from 2018 data, can I use these betas as the prior for my 2019 Bayesian model? If yes, then if I run the model with 2018 betas as priors for my entire 200k records 2019 data, then due to large number of records, effect of priors is going to wash away anyway. How can I utilize the previous 2018 information? Also, I have to include all the 200k records in my model as we cannot drop any customer from our analysis.

2. I understand each customer would have different purchasing behavior, so I have alrerady bucketed them in the order of there sales or rather say deciles- 0,1,2,3,4 with decile 4 for the customers with maximum sales. Since I was still not clear on point 1 above, I used 2018 priors with just few observation to check the approach.

I ran below model for 100 customer from decile 4:

bmod1 <- brm(Sales ~ Comp + FTO + CPL + SPL
+   CC + ENG + IMP + PSO+ (1|decile)+ (0+FTO|decile)
+ (0+Comp|decile)+(0+SPL|decile)+(0+CC|decile)
+ (0+ENG|decile) + (0+IMP|decile) +(0+PSO|decile) ,
data = mdb_14, family = gaussian(),prior=prior2,
warmup = 1000, iter = 5000, chains = 4,
control = list(adapt_delta = 0.9999), seed=150, thin=2,
cores = parallel::detectCores()
)



But for the fit of bmod1 I am getting a very poor fit below:

I think I have added all the relevant details but please feel free to ask any clarification. I would really appreciate if you could first look into issue no.1 and give your qualitative suggestions and second look into the issue no.2 and let me know if there is any other distribution or any other tweaks I should make with bmod1 to improve my fit.

Hi @Sham414
Regarding your first question - you could certainly use estimates for parameters based on your previous data to make more informative priors. The extent to which these priors affect your ultimate posterior would depend not only on the amount of data you have, but also how precisely that new data indicates any particular parameter value, and how tightly you specify your prior. If you use the prior to give a very precise initial estimate and rule out certain posterior values a priori, then this could still strongly affect the posterior estimate. In any case, it is not an issue for your prior to be ‘washed out’. The final posterior simply is a combination of the prior, the model, and the data you feed it, and the prior information is incorporated into the posterior even if you can’t tell from the naked eye. If you have good reason to ensure that the new posterior corresponds more strongly to your prior, then you can specify a more constrained prior accordingly, but of course be careful to also let the new data speak.

For the second question regarding the fit in your posterior predictive check, I’m afraid I don’t really understand fully what you are trying to do and can’t quite understand your intention with the formula, so I can’t comment on the overall appropriateness of your model/coding or how to tweak it to better fit the data.

I could also be wrong here, but that adapt_delta choice looks rather high.

Hi @JimBob,

Thank you so much for taking out time to answer my query. I really appreciate the support of the community.

• Could you please explain a little more about what do you mean by “specifying a more constrained priors”

• Also, you mentioned I should let the new data speak as well. Is there any way I could find an appropriate model with both likelihood and priors with equal say in the results (I dont want either priors or the current likelihood to dominate in my results, however since I have 200k as N size I am afraid the current data would dominate)

I selected100 observations with similar tier level distibution of customers as that in my complete datasets and ran a bayesian model on it. I have Comp, FTO, CPL,SPL,CC, ENG, IMP and PSO as my independent variables and sales as my dependent variable. I am randomizing all these independent variables at decile level, as there is underlying difference in behavior in customers from different deciles.

Other details: All my variables are conitnuous in nature. Furthermore, since my data is at Customer x Month level, there might be significant difference between sales of one regular customer than that of non regular customer. For e.g. Customer 1001 in month 5 might have sales 2.4, whereas other customer 1003 in month 5 might have sales as low as 0.

Please let me know if you need any further details. I would love to have a discussion on this.

Thank you once again for your support.

Hi, regarding your model, it is still a little hard for me to wrap my head around but hopefully another person on the forum will be able to give some more pointers as to what you can do with that. I might suggest starting out with a larger number of participants from your sample, but a less complex model to start with, if you want to get a better sense of the ideas behind the priors.

By more/less constrained priors, I mean that you can specify your own priors for the different parameters in the model, and they can be more or less specific. By parameters, I am referring to each of the different model components that you get estimates for when you look at your brms output. You can see which parameters you can put a prior on by using the function in brms get_prior() - which requires that you give your formula, family (e.g., gaussian), and data. You can then use the function set_prior() to give different model components a prior.

With the prior, what you are basically doing is specifying a plausible range of possible values for that parameter. When you run the regression, the program is using the priors, your data, and your model, to update a best estimate for these parameters. What exact prior is ‘best’ is going to depend on the situation and your knowledge of what the parameter values should be. By default, brms usually uses very ‘vague’ or only weakly constrained priors, allowing the search space for the parameter values to spread over a very wide number of possibilities. If you have reasons based on past information to make this search space more constrained, you can specify that by defining a particular distribution for the prior. For example, if you are estimating the heights of a random population of adults, it doesn’t make sense to look between 0-10000 meters with equal likelihood. You could specify a prior on the mean height as something like a normal distribution, with a mean of 170 and a standard deviation of say 15 centimeters. You can think of the parameters in your model along these lines, and work out if you have information that might cause you to specify a more constrained search space for the model. This will be easier to think about though if you first start with a simpler model, most likely.

The posterior is always an updating of the priors, using your data and the model specification. So long as you don’t give very bizarre or overly constrained priors, your model should represent a nice posterior that reflects the process you are observing. If you are worried about the influence of the priors or the data, you can try running the model with different specifications for the priors, and see how and if the resulting posteriors differ.

Hi,

unfortunately you’ve binned your outcome. It looks like the outcome now is 0,…,4 and that means it’s an ordered categorical outcome. Simply fitting with a gaussian() might not be the right thing to do? Perhaps not binning the outcome could be a way forward, or use a likelihood that takes into account that the outcome is of ordered categorical nature.

Thanks @JimBob. I got your points regarding testing various priors- both non informative and highly constrained, however I have specific information about how priors should look like exactly from my 2018 analysis. I will post my results with my specific priors on a larger sample than the one I have used above.

Additonally, just one thing if you could possibly help me with- since I have to run my bayesian model on 200k records, it looks computationally impossible for my machine to run. Is there anything I could do with this, such as parallel processing or taking out samples or anything?

Thank you once again @JimBob

Hi @torkar, thanks for your reply. I binned my dependent variable - Sales because there are visible differences between high buyer and low buyer and I thought may be implementing a segment or decile would tell the model to treat these customers in a seperate way. I was afraid if I do not bin my Sales then I might end up underestimating the parameter value due to the presence of low buyers or overestimating the parameter value due to the presence of high buyer, hence I seperated them so that the there are different betas/coefficients for each decile. I would love to know your thoughts on this.

I am also little confused about why the model would think of the outcome as ordered categorical variable, since I have sales as dependent variable which is continuous in nature. Am I missing some thing here?

Thank you once again for your support :)

Hi

well, is your outcome binned or not? If you have chopped it up into several categories (ordered) then it’s binned and you do not have a continuous number perhaps?

If it’s ordered categorical data then you should model it as such. I am bit unsure if you should actually do that, since I’ve been taught to always avoid removing information (which you in a way do when you bin the values), but Bürkner’s tutorial is a good way to start: https://journals.sagepub.com/doi/pdf/10.1177/2515245918823199

If your model is the same as the one you used for your 2018 datay, then you can simply update your previous results with the new data. There is a function for this in brms, which you can google as I don’t remember it off the top of my head. You feed it the new data and it takes the old results sort of like a prior, and updates with the new data, I believe. An alternative option would be to take a look at the posteriors and find distributions that describe each parameter’s posterior and then use these to explicitly define new priors in the new model.

Updating your model using the brms function could also be a way to account for the large sample size if it is computationally not possible. I think that you could take samples from your data and feed them in sequentially, each time updating the model from the previous one, until you have used up the whole sample.

Regarding the comment of @torkar about using a different likelihood, I think they mean that instead of using gaussian() as your likelihood family in the model specification, you could use something such as cumulative(“probit”), if the outcome variable is an ordered categorical variable.

2 Likes

Thanks for the quick reply. I think you misunderstood my step of deciling. My “Sales” variable, which is also my dependent variable is intact, I havent chopped the “Sales” column in any way, I have just introduced another column “decile” which is based on values of “Sales” column and used “decile” Only for randomizing.

In my model, my dependent variable is still “Sales”, which is still continuous.

Yeah I’m sure I didn’t get some things right, but the plot looks very strange, esp. the line plotting your y. It looks like it’s categorical, that’s why I assumed that :)

Just wanted to check how should I search for that function you are talking about which allows to feed in the new data and take old results as prior? It would be very helpful if you could give me some direction as this function looks encouraging

I think the “Sales” variable is weirdly distributed because you have one or a few products with similar prices. This is because the modes of the distribution seem to be equidistant, which means that you are seeing the the products selling 1, 2, 3, …, units.

Instead of ranking customers by deciles, you could use interactions based on a customer’s previous number of sales. This is probably more robust than computing deciles of sales based on the whole sample, and it is taking advantage of the fact that higher values have meaning (so using a hierarchical structure in that way does not make much sense; you can add splines or something like that if you want extra flexibility, although at that point I am not sure how efficient things would be with 200K observations).

Also, it seems that the issue is ill-posed, although this seems to be typical for these marketing problems (they are hard…). You have barely any sales that are at zero, why? Do you only have data on regular customers? If so, you can only know if the marketing channels are improving sales for your existing customers – everything else is confounded to hell. But if that is the case, you need to really be careful about how you structure the model. You probably need to model seasonal effects (which will be correlated with spending on marketing channels). Be careful about not leaking information from the future, as that will artificially diminish the effects of marketing channels. Another worry is related with monthly data: the effect of being advertised to has to diminish over time, meaning that ads should have a larger effect on sales after one week rather than after a year. Hence, you should set this up as a time series problem so that you can capture the variation over time, otherwise your results won’t be that informative. That said, because your data is not very granular, it might be hard to capture that more immediate effect without leaking future information.

There are probably other issues that I am not seeing right now, so I would strongly encourage you to do simulations on this.

But here are my recommendations, summarized:

• If you can, use a Poisson likelihood to predict the number of units sold, not the “Sales” (which seems to be units \times price). If you do not have data for that, you can probably get away with assuming that each ‘hump’ corresponds to a unit sold, and then estimating how the error varies around that (but most of the variability seems to be due to units sold, so start with the simplifying assumption first). This is the more correct thing to do because, after all, the units sold are dependent on the price.

• Model this as a time series problem, maybe. So in the simplest adequate model might be something like:

UnitsSold_{t, i} ~ Poisson(lambda_{t, i})
lambda_{t, i} = beta_0 + beta_1 * Price_{t, j}  + beta_2 * UnitsSold_{t-1, i}
+ [insert marketing channels at t-1])


for each client i and product j. Having the “beta_2 * UnitsSold_{t-1, i}” part will alleviate some of the issues with heterogeneity between high and low spenders, so you do not need to add the interactions just yet. I am not sure the model will be good for what you want to find causally, so figure that out with simulations.

• Start small. Select 100 clients at random from your data (~1200 observations?) and try to fit the model with 100 iterations. It runs? Ramp up the iterations to 1000, do model checking and all that. Then use the data of 1000 clients, see how well the model scales up, or if there are any obvious opportunities for improving the model, and go from there.

Hi @davidmpinho,

• My data has no Zero Sales because my dependent variable is log transformed Sales

• This data is for almost all the customers of our current brand, so I think the data should represent the overall sales as well

Additioally, as you suggested I tried modeling with a small sample of data with privious years model results as prior to this small current sample. I drew a sample of 200 customers and trid fitting the model. I also reworked on creating Deciles. It seems the decile which were done previously had some other factors included as well such as customer segment or type or behavior, so I recreated the deciles solely on Sales(Not Log of Sales, although it would be mostly same). I am seeing a great fit for this small data. Below are my codes:

bmod1 <- brm(Sales ~ Comp + FTO + CPL + SPL
+ CC + ENG + IMP + PSO+ (1|decile2)+ (0+FTO|decile2)
+ (0+Comp|decile2)+(0+SPL|decile2)+(0+CC|decile2)
+ (0+ENG|decile2) + (0+IMP|decile2) +(0+PSO|decile2) ,
data = mdb_14, family = gaussian(),prior=prior2,
warmup = 1000, iter = 5000, chains = 4,
control = list(adapt_delta = 0.9999), seed=150, thin=2,
cores = parallel::detectCores()
)
Below is the fit for 200 observations and 5000 iterations 4 chains:

Below is the one when I scale the model to 800 observations and number of iterations to 20,000 and 4 chains:

However the problem is when I try to scale this model to 5000 observations(20k iterations, 4 chains) I am only seeing some lines:-

Which is a real challenge for me, because eventually I would have to scale this up to 200k records. I would really appreciate your inputs on this.

Thanks

The data is different from what I thought, so ignore some of my previous suggestions.

My data has no Zero Sales because my dependent variable is log transformed Sales

But did you remove months where a customer had zero sales and then used the log? Or were there no actual zeros? Because you cannot remove the zeros, it will give you worse predictions and horrible inferences.

For everything else, I do not think you would need anything close to 80K iterations if the model was adequate. From what I can tell, there is a lot of heterogeneity in your data set, so it needs to be dealt with. My guess is that products of different prices are accounting for a large part of that heterogeneity: more expensive products probably sell much fewer units than cheap products; and the price of a product does not change much over time, so modeling this with a continuous process might be causing some issues if customers do not buy different products. So I will, again, insist that you try something similar to what I suggested, but if you do not have the data available yet, I also recommend something closer to what you did:

Use the following model with 100 iterations and then increase it to 1000:

pois_model <- brm(UnitsSold ~ 1 + LogPricePerUnit + Comp + FTO + CPL + SPL
+ CC + ENG + IMP + PSO,
data = mdb_14, family = 'poisson',
warmup = 50, iter = 100, chains = 4,
control = list(adapt_delta = 0.95), seed=150, thin=2,
cores = parallel::detectCores()
)


Do not log UnitsSold, and “LogPricePerUnit” can be the average price in a given month. Do not use priors just yet, those may be causing issues. Also do not use
a hierarchical model just yet because I suspect that it will not be as necessary after using the log of the price per unit. Ideally, this simpler model would be fast enough that it would allow you to add a hierarchical effect for each customer instead of something else.

If you really (!!) cannot know how many units were sold or their average prices, I would try this:

test_1 <- brm(LogSales ~ 1 + Comp + FTO + CPL + SPL
+ CC + ENG + IMP + PSO,
data = mdb_14, family = 'gaussian',
warmup = 50, iter = 100, chains = 4,
control = list(adapt_delta = 0.95), seed=150, thin=2,
cores = parallel::detectCores()
)


Again, without custom priors or hierarchical effects. This gets trickier if you have zeros because you might need to use a zero-inflated process. I do not know how to do that in BRMS, but try it with the data set you were using to see if the previous issues are solved.

If there seem to be issues with users spending different amounts, try:

test_2 <- brm(LogSales ~ (1|UserID) + Comp + FTO + CPL + SPL
+ CC + ENG + IMP + PSO,
data = mdb_14, family = 'gaussian',
warmup = 50, iter = 100, chains = 4,
control = list(adapt_delta = 0.95), seed=150, thin=2,
cores = parallel::detectCores()
)


this should deal with the issue of having high/low spenders much better than quantiles, but (for now) we are assuming that Comp, FTO, etc. have the same effect on everyone. As I said previously, I have a hunch that we do not need hierarchical effects for those.

Show us the results with 800 observations. And there are few parameters, so post the trace plots and the pairs plots if you see anything that looks weird (and the distribution of “random effects” for UserIDs would also be interesting to know what is going on).

However the problem is when I try to scale this model to 5000 observations(20k iterations, 4 chains) I am only seeing some lines:-

This could be because you model gives some credibility to extremely large values - perhaps some extreme event is predicted to happen on occasion. The appearance of the lines would then just be to do with the scaling on the graph. If you change the graph from its default or make your own so that you can zoom down in on the left hand side, you might find that it is quite similar to your previous graphs, but with a very long tail.