Modeling of an A/B Test with Average Revenue as the Measure with Per Day Data

I am after a model for A/B Test for average revenue.
The data is given in the form:

Day A A A A B B B B
Views Clicks Sales Revenue [$] Views Clicks Sales Revenue [$]
1 100 70 10 52.3 93 71 9 51.1
2 103 65 11 59.2 110 67 12 57.6
3 99 55 7 46.4 89 61 9 42.4
4
5
6

Views is the number of users entered the site, clicks are the number of users who made a click operation, sales are the number of users who made sale operation and revenue is the total revenue.
I don’t have the data in per user granularity.
At the end I want to model the average revenue per view / click.

What would be a plausible model for such data?
Is there a good way to leverage all data (Views, Clicks, Sales)?

I could see many models for the per user case, I wonder what would be a good model for per day accumulative data.

1 Like

So I’d think that views, clicks, sales, & revenue will have strong correlation across days, so you’ll want structure reflecting that. You might be tempted to use a multivariate normal, but I’d be wary of assuming normality for count variables like Views & Clicks & Sales. Instead, what you can do is have a latent rate associated with each day for views and a latent rate associated with each count outcome for each day, and connect each to their associated count observations with a poisson likelihood. Since total revenue for the day is determined entirely by sales count \times average sale $-value, and it’s the latter value that you say you’re most interested in, you can go ahead and compute it in the data as av_sale_val = total_revenue/num_sales and then use av_sale_value rather than total_revenue in the model. Finally, inference on the influence of the A/B manipulation could be achieved by modelling a difference in the means of the correlated quantities. You might also consider having a separate correlation structure for A & B.

2 Likes

@mike-lawrence , Is there w way to visualize the model you suggest?

Thank you for sharing the knowledge.

I remember there being a tool someone was working on to translate Stan models into DAG visuals, is that along the lines of what you mean by visualize? Or do you mean visualize what the model’s posterior implies for any difference between A & B?

Oh, since you have my brain reflecting on this scenario again, quick question: was it a concurrent experimental design such that the is there data for both A and B for every day and day 1 for the A data is literally the same day as day 1 for the B data? If that were the case, then I’d suggest a Gaussian Process (possibly with periodic components depending on how many days it was run to capture weekly and annual patterns) to model an average for each outcome type around which then the A & B conditions vary. To capture the correlations I mentioned before, you could either take an SEM-style route whereby you have one hyper-latent GP and that “loads” on the latent GPs for each measure type, but it might also be possible/more-efficient to model things as a GP in three dimensions. @avehtari might chime in on whether a 3D GP would capture correlations across days in the way I’d imagine would be necessary.

2 Likes

I think the model for A/B Testing assumes the underlying distribution of the variable (Average income per click / sale) is constant along the trial. Namely the parameters of the distribution are constant.

Isn’t using Gaussian Process is the opposite of that?

1 Like

@mike-lawrence , The samples are in parallel. Namely data is shown to different users on the same day.

I meant I want to see the hierarchy graphically.
What would be the distribution of the sale per click, its parameters, their distribution, etc…

Cool, then my suggested GP(s) to capture structure in the day-to-day intercept applies.

I’m unclear if you mean my original model, or “typical models”, but regardless, if you are fundamentally interested in accurate & optimally-informed inference on the A/B difference, I think it would be folly to a priori assume all variability across days reflects unstructured random effects. For clarity, think of a simpler AB scenario with just one Gaussian outcome, wherein a “all variability across days is random” model would be:

data{
	int n_days;
	matrix[n_days,2] ab ;
}
parameters{
	real intercept ;
	real difference ;
	real<lower=0> noise ;
}
model{
	// priors
	intercept ~ ... ;
	difference ~ ... ;
	noise ~ ... ;
	// likelihood
	ab[,1] ~ normal( intercept + difference/2 , noise ) ;
	ab[,2] ~ normal( intercept - difference/2 , noise ) ;
}

And a model permitting GP structure to the intercept as I’m suggesting would be:

...
parameters{
	vector[n_days] f_intercept ;
	real difference ;
	real<lower=0> noise ;
	... // GP-related parameters here
}
model{
	// priors
	f_intercept ~ GP(...) ;
	difference ~ ... ;
	noise ~ ... ;
	... // GP-related parameter priors here
	// likelihood
	ab[,1] ~ normal( f_intercept + difference/2 , noise ) ;
	ab[,2] ~ normal( f_intercept - difference/2 , noise ) ;
}

Take note that the GP is on the intercept, i.e. the quantity reflecting the mean outcome regardless of the AB condition. I think it’s certainly the case that, especially in the realm of human behaviour, temporal structure abounds and burden of proof would be on accounts positing absence of said structure. And certainly if such structure exists, the second model permits attribution of the variability associated with that structure to the GP, decreasing the left-over variability/uncertainty that must then get dispersed among difference and noise. The first model has no such mechanism and thus difference and noise both will be left with greater uncertainty.

It’s even possible (& possibly advisable, depending on the context) to add structure to capture structured by-day variation in the AB difference too.

The key to all of this is to employ a principled workflow whereby:

  1. domain expertise informs on the qualitative structures included in the model (i.e. decisions on whether to include a GP across days on the intercept or not; for such binary decisions I’d lean toward a low bar for inclusion)
  2. domain expertise informs on key parameters of any included structures, including where possible parameterization that permits simple specification of structural simplicity-preferring priors (ex. parameterization of either model above permits a peaked-at-zero prior for the AB difference, where a zero difference reflects a simpler “less structure” model; in the case of the GP, a prior on noise to be not-peaked at zero and a prior on the GP aplitude that is peaked at zero similarly reflects a simplicitly-preferring parameterization-&-prior combo).
  3. Both prior predicitive checks tand posterior predictive checks to validate implmentations of 1 & 2 are actually consistent with both domain expertise and the data at hand.
1 Like

If I get you right you’d model the data as something which its mean shared by the variations up to the difference. While the daily mean is Gaussian Process.
Wouldn’t such model require large number of samples to have a good estimation of the parameters?

I have see the paper Bayesian A/B Testing at VWO. In chapter 10 it has:

image

The problem is it assumes data is given per user while in my case data id aggregated (Per day).
Is there a way to adapt it?

Certainly more complex models have larger parameter spaces across which uncertainty is distributed, so to the degree that some of the parameter space is unncessary, such larger models will require more data to achieve the same uncertainty reduction in the necesary parameter dimensions as a simpler model that contains only the necessary/true structures. But we never know what that true structure is, and it’s also the case that models that omit necesssary structure similarly achieve less uncertainty reduction per datum than models with precisely the true structure. So statements on the advisability of more or less complicated structure on the basis of how much data they require are certainly non-sensical, and a better way of thinking about it is the principled workflow I described, which can indeed lead to eventual reduction of model structure but doesn’t a priori avoid complex structure when there’s strong domain expertise to motivate exploration of said structures.

And the GP strategy I suggest does not require observations from multiple users per day; that would help achieve greater uncertainty reduction, but isn’t strictly necessary. There’s lots of examples of GPs of one-observation-per-day you can find around, indeed most one-dimensional GP examples show this case.

1 Like

Another issue with your very elegant approach is scaling. In case of A/B/C (3 variants) it will be trickier.

Do you have any suggestions about a simpler model for the aggregated data?
What about the model I linked, any way making it suitable to aggregated data?

Maybe model something like a sum of Exponential distributions as a Gamma distribution?

Any idea how to do it?