[Solved] Beta binomial with multilevel partial pooling


#1

Hi,
I am trying to model something relatively easy but somehow can’t get it together. I believe my problem is mostly lack of knowledge in bayesian statistics than stan but I’d be really glad to get some help.
Let me first explain what I am trying to do and what kind of data I have before going into the model so maybe it can help to get the context.
I have an already existing article recommender model that recommend one related article for each article you browse. The goal is to asses if the recommender model is doing a good job by looking at the improvement of CTR for the related articles.
The test was run in a way such that we have a control group where we collect all the clicks and impressions for a “related article” on pages that it would not have been recommended by the model and the experiment is we collect the clicks and impressions for the related article on only page it was recommended.
To give a more concrete example.
Let’s say I have data as follow:

mydf <- data.frame(
                   assignment = c(rep(1,4),rep(2,4)),
                   impressions = c(12003,40049,3077,8021,600937,31059,3008,3000),
                   clicks = c(142,84,31,60,833,246,206,168),
                   variant = rep(1:4,2)
                   )

So it looks like that

assignment impressions clicks variant
1 12003 142 1
1 40049 84 2
1 3077 31 3
1 8021 60 4
2 600937 833 1
2 31059 246 2
2 3008 206 3
2 3000 168 4

Assignment is control(2) and experiment(1).
Variant is the related article id.
For example on the first line it says that when we did put the related article 1 on pages that were recommended by the model we had a total amount of impressions of 12003 and a total amount of clicks of 142. On the other hand when you put the related article 1 on pages that were not recommended by the model you had a total amount of impressions of 600937 and 833 clicks.
I would like to know if assignment 1 has a higher CTR than assignment 2.
My assumption is that some “related article” have higher overall CTR because of (higher quality content) and therefore I believe I should have a prior on each variant that is common to the control group and experiment group.
Seems like a fairly easy test. However, I can’t figure out how to model it such that you have some partial pooling between the variant. It seems to me that because of the way the data is put together I have very few point with low uncertainty which make the model very sensible to the prior.
I tried different approach that I’d like to show here.

  1. Following DBDA and adding another level from this article

The code look like that:

mymodel_hie_shrink <-"
data {
    int<lower=0> A; //number of assignment
    int<lower=0> V; //number of variant
    int<lower=0> N; //number of observations
    int<lower=0,upper=A> OV[N]; //observed variant
    int<lower=0,upper=V> OA[N]; //observed assignment
    int<lower=0> I[N]; //number of impressions
    int<lower=0> C[N] ; //number of clicks
}


parameters {
  vector<lower=0.001, upper=0.03>[V] phi_var;
  vector<lower=1>[V] kappa_var;
  vector<lower=0.001, upper=0.03>[A] phi_ass;
  vector<lower=1>[A] kappa_ass;
  vector<lower=0, upper=1>[N] theta;  // chance of success
}

transformed parameters {
    vector[V] alpha_var;
    vector[V] beta_var;
    vector[A] alpha_ass;
    vector[A] beta_ass;
    alpha_ass = phi_ass .* kappa_ass;
    beta_ass = (1 - phi_ass) .* kappa_ass;
    alpha_var = phi_var .* kappa_var;
    beta_var = (1 - phi_var) .* kappa_var;
}

model {
  kappa_ass ~ gamma(0.01,0.05);                        // hyperprior
  kappa_var ~ gamma(0.01,0.05);                        // hyperprior
  for (i in 1:N){
      target += beta_lpdf(phi_ad[OV[i]] | alpha_ass[OA[i]],beta_ass[OA[i]]);
      target += beta_lpdf(theta[i] | alpha_var[OV[i]],beta_var[OV[i]]);
      target += binomial_lpmf( C[i] | I[i],theta[i]) ;
  }
}
generated quantities {
  real diff;
  diff = phi_ass[1] - phi_ass[2];
}
"

I tried different prior and it seems to be very sensible to the prior, in addition with theses prior the result does not really make sense. Please note that I also tryied to change the hyperprior to use what Bob’s case study used (Gelman’s prior).

  1. Using a built in approach

the code look like that

data {
    int<lower=0> A; //number of assigmnent
    int<lower=0> V; //number of variant
    int<lower=0> N; //number of observations
    int<lower=0,upper=V> OV[N]; //observed variant
    int<lower=0,upper=A> OA[N]; //observed assignment
    int<lower=0> I[N]; //number of impressions
    int<lower=0> C[N] ; //number of clicks
}


parameters {
  vector<lower=0, upper=1>[V] phi;         // population chance of success
  vector<lower=1>[V] kappa;                // population concentration
  vector<lower=0, upper=1>[N] theta;  // chance of success
  vector<lower=-.03,upper=0.03>[V] delta; // improvement of experiment vs control
  real<lower=-0.3,upper=0.3> mu; //prior on mean diff
  real<lower=0,upper=0.05> sigma; //prior on sd diff
}

transformed parameters {
    vector[N] alpha;
    vector[N] beta;
    vector[N] theta_prime;
    for (i in 1:N){
        int idx = OV[i];
        alpha[i] = phi[idx] * kappa[idx];
        beta[i] = (1 - phi[idx]) * kappa[idx];
        theta_prime = theta;
        if( vidx == 1) {
            theta_prime = theta_prime + delta[idx];
        }

    }
}

model {
  kappa ~ pareto(1, 1.5);                        // hyperprior
  mu ~ normal(0,0.01);
  sigma ~ exponential(200);
  delta ~ normal(mu,sigma);
  theta ~ beta(alpha,beta);
  C ~ binomial(I,theta_prime);
}
generated quantities {
  real diff;
  diff = mu;
}

Again this approach had a very hard time (if you change the data set a bit it will not converge) and is very sensible to the prior.

Notes:
As a comparison I am using a simple model that just does partial pooling on the assignment:

data {
    int<lower=0> A; //number of assigmnent
    int<lower=0> V; //number of variant
    int<lower=0> N; //number of observations
    int<lower=0,upper=V> OV[N]; //observed variant
    int<lower=0,upper=A> OA[N]; //observed assignment
    int<lower=0> I[N]; //number of impressions
    int<lower=0> C[N] ; //number of clicks
}


parameters {
  vector<lower=0, upper=1>[A] theta;  // chance of success
}


model {
    theta ~ beta(1,1);
    C ~ binomial(I,theta[OA]);
}
generated quantities {
  real diff;
  diff = theta[1] - theta[2];
}

And my result are completely different.

is there anything I am missing? What kind of model could I use with pretty vague priors?


#2

Hi,
your code is quite long and I am a bit too lazy to try to understand it completely (sorry), but there are a few tricks that seem applicable to your situation and frequently help people fit their models:

  1. Avoid hard constraints unless they are actual physical/mathematical limits. It is OK to have vector<lower=0, upper=1>[N] theta; as probability has to be in [0,1], but you should not have things like real<lower=-0.3,upper=0.3> mu;. Instead your prior should provide a soft constraint. e.g. having mu unconstrained with mu ~ normal(0,0.15) gives little mass to anything outside [-0.3, 0.3] but is easier for the sampler to work with

  2. The gamma(0.01,0.05); is really very wide prior maybe try something more constrained? E.g. normal or cauchy prior? Andrew Gelman generally advises against wide Gamma priors on theoretical grounds (search his blog for more details).

  3. Try simulating data from the model (sample hyperparameters from their priors, then params, then actual data) - this way you know the true parameters and can check that your model recovers them correctly. It will also force you to specify your priors well - if most of the data you generate this way looks unrealistic, your priors are wrong.
    You can search this forum and Andrew’s papers for “prior predictive check” for some further discussion.

Hope that helps.


#3

I hope so, too, as it’s all important advice.

We spend a lot of time trying to undo advice people picked up from looking at lots of BUGS models. The ironic thing is that the models didn’t fit in BUGS, but Gibbs is so slow to mix given its conditional formulation that you often couldn’t detect the problem.

But it’s not going to recreate the answers from a different model.

You don’t need theta ~ beta(1, 1) in Stan—that’s just uniform and its implicit.

I also couldn’t follow the larger model. You probably don’t want exponential(200), either!


#4

It turns out that I had a typical 8 schools problem. Using the standard error helped me solve this.
Thanks for all your advises!