Hierarchical mixture models in Stan

Hi everybody!
I was trying to replicate the hierarchical mixture model proposed in “Bayesian hierarchical model for the prediction of football results” by G.Baio and M.Blangiardo (2010), but I have some trouble in writing the model in Stan (this is the first time that I use Stan and I am not familiar with c++ language).

Breafly, there are 3 different generating mechanism, one for the top teams, one for the mid-table teams and one for the bottom-table teams. For each team (t) 2 latent variables
grp^{att}(t) and grp^{def}(t) , which take on the value 1,2,3 identifying the bottom-, mid- or top-table performances in terms of attack and defence. These are given suitable categorical distributions, each depending on a vector of prior probabilities
\pi^{att} = (\pi^{att}_{1t}, \pi^{att}_{2t}, \pi^{att}_{3t}) and
\pi^{def} = (\pi^{def}_{1t}, \pi^{def}_{2t}, \pi^{def}_{3t}) which follow a Dirichlet distribution with parameters (1,1,1). They represent the prior change that each team is in one of the 3 categories.

Let y1 be the number of goals scored by the home teams, the likelihood is
y1 | \theta1 \sim Poisson(\theta1) with
log(\theta1) = home + att_{hometeam} + def_{visitorteam}

Let y2 be the number of goals scored by the away teams, the likelihood is
y2 | \theta2 \sim Poisson(\theta2) with
log(\theta2) = att_{visitorteam} + def_{hometeam}

Where

  • home \sim Normal(0, 0.0001) is the advantage for teams playing at home
  • att_t \sim nct(\mu^{att}_{grp(t)}, \tau^{att}_{grp(t)}, \nu) (nct = non centered student t). In a mixture model, the parameter att is computed as follows:
    att_t = \sum_{k=1}^3 \pi^{att}_{kt} * nct(\mu^{att}_k, \tau^{att}_k, \nu=4)
  • def_t \sim nct(\mu^{def}_{grp(t)}, \tau^{def}_{grp(t)}, \nu) (nct = non centered student t). In a mixture model, the parameter def is computed as follows:
    def_t = \sum_{k=1}^3 \pi^{def}_{kt} * nct(\mu^{def}_k, \tau^{def}_k, \nu=4)

If a team has poor performance, then it is likely to show low propensity to score and an high propensity to concede goals, and this is represented using the following truncated normal distributions:

  • poor teams
    \mu^{att}_1 \sim truncNormal(0, 0.001, -3, 0)
    \mu^{def}_1 \sim truncNormal(0, 0.001, 0, 3)
  • top teams
    \mu^{att}_3 \sim truncNormal(0, 0.001, 0, 3)
    \mu^{def}_3 \sim truncNormal(0, 0.001, -3, 0)
    *mid teams
    \mu^{att}_2 \sim Normal(0, \tau^{att}_2)
    \mu^{def}_2 \sim Normal(0, \tau^{def}_2)
    *for all the 3 groups, the precisions are modelled using independent minimally informative Gamma distributions
    \tau^{att}_k \sim Gamma(0.01 , 0.01)
    \tau^{def}_k \sim Gamma(0.01 , 0.01)

I’ve done my best writing the below stan model, but it does not work, obviously.
I have questions to ask and explanations to give regarding the model I wrote.

  1. Since Stan does not allow the use of discrete parameters, I found an altenrative way on internet: based on alpha. Are there any other ways?
  2. mu_att and mu_def vary over groups. I tried to capture this behavour using a for loop. Fitting the model on R, it gives me a warning saying that sampling was not possible because I do not include the case k=0, but k=1,2,3! How can I write this part in a suitable way?
  3. in the last for loop, I wanted to express these:
    att_t = \sum_{k=1}^3 \pi^{att}_{kt} * nct(\mu^{att}_k, \tau^{att}_k, \nu=4)
    def_t = \sum_{k=1}^3 \pi^{def}_{kt} * nct(\mu^{def}_k, \tau^{def}_k, \nu=4)
    but I don’t really think that this is the right way of doing so.

I’ve tried a lot of codes but I no longer know what to try.

data {
  // Dimensions
  int<lower=1> N; // number of matches
  int<lower=1, upper=20> J; // number of teams
  int<lower=1, upper=3> K; // number of groups
  
  // Variables
  int<lower=0> y1[N]; // number of goals of home: outcomes
  int<lower=0> y2[N]; // number of goals of visitor: outcomes
  int<lower=1, upper=J> hteams_idx[N]; //identifier for each teams in home
  int<lower=1, upper=J> vteams_idx[N]; //identifier for each teams in away
  real<lower=0.5> alphaNu; //degrees of freedom for alpha0N = J*2
  vector<lower=0.5>[K] alpha0mu; // prior for alpha0 = c(0.5, 0.5, 0.5)
}
parameters {
  real<lower=0, upper=1> home;
  vector[J-1] att_start;
  vector[J-1] def_start;
  simplex[K] pi_att;
  simplex[K] pi_def; 
  real<lower=0, upper=N> alphaN;
  simplex[K] alpha0;
  ordered[K] mu_att;
  ordered[K] mu_def;
  real<lower=0> tau_att;
  real<lower=0> tau_def;
}
transformed parameters{
    vector[J] att = append_row(att_start, -sum(att_start));
    vector[J] def = append_row(def_start, -sum(def_start));
    vector<lower=1>[K] alpha = alpha0*alphaN;
}    
model{
  home ~ normal(0, 0.1); //flat prior, fixed effect
  
  alphaN ~ chi_square(alphaNu);
  alpha0 ~ dirichlet(alpha0mu);
  
  pi_att ~ dirichlet(alpha+0.01);
  pi_def ~ dirichlet(alpha+0.01);
 
  tau_att ~ gamma(0.01, 0.01);
  tau_def ~ gamma(0.01, 0.01);

 for(k in 1:K){
  mu_att ~ normal(0, 0.001); 
  if (mu_att[k==1] < -3 || mu_att[k==1] > 0)
  target += negative_infinity();
  else
  target += -log_diff_exp(normal_lcdf(0 | 0, 0.001),
                          normal_lcdf(-3 | 0, 0.001));
  if (mu_att[k==2])
  target += normal_lpdf(mu_att[k==2] | 0, tau_att);
  if (mu_att[k==3] < 0 || mu_att[k==3] > 3)
  target += negative_infinity();
  else
  target += -log_diff_exp(normal_lcdf(3 | 0, 0.001),
                          normal_lcdf(0 | 0, 0.001));
  }
   
   for(k in 1:K){
  mu_def ~ normal(0, 0.001);
  if (mu_def[k==1] < 0 || mu_def[k] > 3)
  target += negative_infinity();
  else
  target += -log_diff_exp(normal_lcdf(3 | 0, 0.001),
                          normal_lcdf(0 | 0, 0.001));
  if (mu_def[k==2])
  target += normal_lpdf(mu_def[k] | 0, tau_def);
  if (mu_def[k==3] < -3 || mu_def[k] > 0)
  target += negative_infinity();
  else
  target += -log_diff_exp(normal_lcdf(0 | 0, 0.001),
                          normal_lcdf(-3 | 0, 0.001));
   }

for(j in 1:J){
  vector[J] att_mixt;
  vector[J] def_mixt;
  for(k in 1:K){
    att_mixt[j] += log(pi_att[k]) + 
    student_t_lpdf(att[j]|4, mu_att[k], tau_att);
    def_mixt += log(pi_def[k]) + 
    student_t_lpdf(def[j]| 4, mu_def[k], tau_def);
  }
  target += poisson_log_lpmf(y1| home + att_mixt[hteams_idx] 
  - def_mixt[vteams_idx]);
  target += poisson_log_lpmf(y2| att_mixt[vteams_idx] - def_mixt[hteams_idx]);
}

I thank in advance anyone who takes the time to help me understanding where I am going wrong and how could I improve.

Hi!
It looks to me like the for-loops are giving you trouble. I think clarifying two lines might give you the tools to make some progress.

for(k in 1:K){
  mu_att ~ normal(0, 0.001); 

The ~ statement is vectorized over the elements of mu_att and is applied K times in this loop. The upshot is that you are saying that mu_att is drawn from an extremely tight distribution around zero–almost surely tighter than you intend. As an aside, do you really mean to use a normal with a standard deviation of 0.001 as your prior here? That’s awfully tight, and it looks like a common mistake made by people coming to Stan from BUGS who then try to set vague priors: in Stan, the normal distribution is parameterized by its standard deviation, not its precision. With that said, I don’t think anyone here would recommend overly vague priors anyway. In any case, by applying this prior K times, you are getting something even tighter than a normal distribution with standard deviation 0.001.

for(k in 1:K){
...
  if (mu_att[k==2])

your index k ranges from 1 to K. When k is 2, then k == 2 will evaluate to TRUE, and I think mu_att[k==2] will evaluate to mu_att[1]. When k is not 2, then k == 2 will evaluate to FALSE and I think that mu_att[k==2] will attempt to access mu[0]. I imagine this is probably where you warning was coming from.

I cannot tell what you actually mean to do here. if (mu_att[k] == 2) makes sense syntactically, but not conceptually. mu_att[k] is continuous, and therefore will never literally equal exactly 2, so this conditional would always evaluate as FALSE.

I think that the confusion underlies these two issues is probably rearing its head throughout this code. Getting a better feel for loops and indexing in a language like R or python (where it’s easier to evaluate code snippets and understand how they behave) might be a good way to begin to feel more comfortable writing loops in general.

1 Like