Hi,
I am trying to model something relatively easy but somehow can’t get it together. I believe my problem is mostly lack of knowledge in bayesian statistics than stan but I’d be really glad to get some help.
Let me first explain what I am trying to do and what kind of data I have before going into the model so maybe it can help to get the context.
I have an already existing article recommender model that recommend one related article for each article you browse. The goal is to asses if the recommender model is doing a good job by looking at the improvement of CTR for the related articles.
The test was run in a way such that we have a control group where we collect all the clicks and impressions for a “related article” on pages that it would not have been recommended by the model and the experiment is we collect the clicks and impressions for the related article on only page it was recommended.
To give a more concrete example.
Let’s say I have data as follow:
mydf <- data.frame(
assignment = c(rep(1,4),rep(2,4)),
impressions = c(12003,40049,3077,8021,600937,31059,3008,3000),
clicks = c(142,84,31,60,833,246,206,168),
variant = rep(1:4,2)
)
So it looks like that
assignment | impressions | clicks | variant |
---|---|---|---|
1 | 12003 | 142 | 1 |
1 | 40049 | 84 | 2 |
1 | 3077 | 31 | 3 |
1 | 8021 | 60 | 4 |
2 | 600937 | 833 | 1 |
2 | 31059 | 246 | 2 |
2 | 3008 | 206 | 3 |
2 | 3000 | 168 | 4 |
Assignment is control(2) and experiment(1).
Variant is the related article id.
For example on the first line it says that when we did put the related article 1 on pages that were recommended by the model we had a total amount of impressions of 12003 and a total amount of clicks of 142. On the other hand when you put the related article 1 on pages that were not recommended by the model you had a total amount of impressions of 600937 and 833 clicks.
I would like to know if assignment 1 has a higher CTR than assignment 2.
My assumption is that some “related article” have higher overall CTR because of (higher quality content) and therefore I believe I should have a prior on each variant that is common to the control group and experiment group.
Seems like a fairly easy test. However, I can’t figure out how to model it such that you have some partial pooling between the variant. It seems to me that because of the way the data is put together I have very few point with low uncertainty which make the model very sensible to the prior.
I tried different approach that I’d like to show here.
- Following DBDA and adding another level from this article
The code look like that:
mymodel_hie_shrink <-"
data {
int<lower=0> A; //number of assignment
int<lower=0> V; //number of variant
int<lower=0> N; //number of observations
int<lower=0,upper=A> OV[N]; //observed variant
int<lower=0,upper=V> OA[N]; //observed assignment
int<lower=0> I[N]; //number of impressions
int<lower=0> C[N] ; //number of clicks
}
parameters {
vector<lower=0.001, upper=0.03>[V] phi_var;
vector<lower=1>[V] kappa_var;
vector<lower=0.001, upper=0.03>[A] phi_ass;
vector<lower=1>[A] kappa_ass;
vector<lower=0, upper=1>[N] theta; // chance of success
}
transformed parameters {
vector[V] alpha_var;
vector[V] beta_var;
vector[A] alpha_ass;
vector[A] beta_ass;
alpha_ass = phi_ass .* kappa_ass;
beta_ass = (1 - phi_ass) .* kappa_ass;
alpha_var = phi_var .* kappa_var;
beta_var = (1 - phi_var) .* kappa_var;
}
model {
kappa_ass ~ gamma(0.01,0.05); // hyperprior
kappa_var ~ gamma(0.01,0.05); // hyperprior
for (i in 1:N){
target += beta_lpdf(phi_ad[OV[i]] | alpha_ass[OA[i]],beta_ass[OA[i]]);
target += beta_lpdf(theta[i] | alpha_var[OV[i]],beta_var[OV[i]]);
target += binomial_lpmf( C[i] | I[i],theta[i]) ;
}
}
generated quantities {
real diff;
diff = phi_ass[1] - phi_ass[2];
}
"
I tried different prior and it seems to be very sensible to the prior, in addition with theses prior the result does not really make sense. Please note that I also tryied to change the hyperprior to use what Bob’s case study used (Gelman’s prior).
- Using a built in approach
the code look like that
data {
int<lower=0> A; //number of assigmnent
int<lower=0> V; //number of variant
int<lower=0> N; //number of observations
int<lower=0,upper=V> OV[N]; //observed variant
int<lower=0,upper=A> OA[N]; //observed assignment
int<lower=0> I[N]; //number of impressions
int<lower=0> C[N] ; //number of clicks
}
parameters {
vector<lower=0, upper=1>[V] phi; // population chance of success
vector<lower=1>[V] kappa; // population concentration
vector<lower=0, upper=1>[N] theta; // chance of success
vector<lower=-.03,upper=0.03>[V] delta; // improvement of experiment vs control
real<lower=-0.3,upper=0.3> mu; //prior on mean diff
real<lower=0,upper=0.05> sigma; //prior on sd diff
}
transformed parameters {
vector[N] alpha;
vector[N] beta;
vector[N] theta_prime;
for (i in 1:N){
int idx = OV[i];
alpha[i] = phi[idx] * kappa[idx];
beta[i] = (1 - phi[idx]) * kappa[idx];
theta_prime = theta;
if( vidx == 1) {
theta_prime = theta_prime + delta[idx];
}
}
}
model {
kappa ~ pareto(1, 1.5); // hyperprior
mu ~ normal(0,0.01);
sigma ~ exponential(200);
delta ~ normal(mu,sigma);
theta ~ beta(alpha,beta);
C ~ binomial(I,theta_prime);
}
generated quantities {
real diff;
diff = mu;
}
Again this approach had a very hard time (if you change the data set a bit it will not converge) and is very sensible to the prior.
Notes:
As a comparison I am using a simple model that just does partial pooling on the assignment:
data {
int<lower=0> A; //number of assigmnent
int<lower=0> V; //number of variant
int<lower=0> N; //number of observations
int<lower=0,upper=V> OV[N]; //observed variant
int<lower=0,upper=A> OA[N]; //observed assignment
int<lower=0> I[N]; //number of impressions
int<lower=0> C[N] ; //number of clicks
}
parameters {
vector<lower=0, upper=1>[A] theta; // chance of success
}
model {
theta ~ beta(1,1);
C ~ binomial(I,theta[OA]);
}
generated quantities {
real diff;
diff = theta[1] - theta[2];
}
And my result are completely different.
is there anything I am missing? What kind of model could I use with pretty vague priors?