Hi all,
I am posting here since I believe my issues & questions to be a little too involved for forum helpers to dive into, so I’d like to offer small compensation (name your price) for hands-on help with model formulation.
My name is Liam and I’m an undergraduate at UC Berkeley. I am working on a sports modeling research project that aims to model latent strengths of teams and home-field advantages based only on home/away win-loss data. I am using STAN and it’s companion, rStan to implement. I am relatively new to STAN but have been working with it for 2 months now.
I have spent a significant amount of time building and testing different formulations of the model. A lot of my formulations come away with posterior estimates that make sense, i.e. good teams are good and bad teams are bad. However, I’m having trouble picking the right priors, achieving non-divergent, non-autocorrelated results, or otherwise passing all the diagnosing tests. I have the sneaking suspicion that there are much better ways to formulate the model, but I haven’t found them and do not possess the experience and understanding to illuminate a path forward. I’ve done my best to research online but there’s only so much I can learn via reading alone.
This would be a relatively small time commitment, unless of course it takes weeks to hammer out the right formulation. Perhaps it would only take hours, I have no clue.
Let me know if you are interested.
1 Like
There are a few things to do that might not take up much of folk’s time.
I am not sure if you have done these yet but writing a post with the following:
- Plots of your data.
- Simulate some fake data so you know what your true parameters are. Small enough to run fast, large enough to well be large enough ;)
- Write out your model.
- Pick reasonable priors based on your literature review, domain specific knowledge, and expert opinion. Have these documented.
- Run your model on your fake data to troubleshoot all the diagnosing tests.
- See if you recovered your fake data parameters.
Include your OS, OS version, version of R, and rstan.
3 Likes
I took a look at your post history and the model you’re working on, and I wonder if you’re making things more difficult for yourself than they should be by attempting to model things in the probability space. Indeed, it actually looks like you should be able to use a standard item response theory model for this kind of data, where you model each outcome a function of each team’s latent strength plus an intercept reflecting home team advantage. Indeed, I wonder if one of the reasons your current model is experiencing issues is because you’re not constraining it properly as IRT models usually require (ex. fixing one team’s strength to an arbitrary constant against which all the others are compared).
3 Likes
Ara: I think you are right, I should probably do that. I feel like I would struggle a bit to cover all 6 bases well, but I will try to formulate a good post.
Mike: I am interested in this approach, but have not heard of item response theory before now. To clarify, I would still use STAN to sample the latent strengths given some prior distribution, and then formulate probabilities as a logistic function of strength with an intercept? Do these models incorporate the opposing team’s strength as well? I’d be interested in any resource pointers you have outside of the Wikipedia page. EDIT: found this: https://mc-stan.org/docs/2_20/stan-users-guide/item-response-models-section.html
There’s an IRT section in the Stan User’s Guide. Your scenario isn’t as complicated as the ones presented there because you only have one variable (team
) while the example tends to have two crossed variables (student
and question
). So I think you’d do something like:
data {
int<lower=2> num_teams;
int<lower=2> num_games;
int<lower=0,upper=1> home_win[num_games];
int<lower=1,upper=num_teams> home_team[num_games];
int<lower=1,upper=num_teams> away_team[num_games];
}
parameters {
vector[num_teams-1] strength_all_but_first ;
real home_team_advantage ;
}
model {
vector[num_teams] strength_all = append_row(0,strength_all_but_first) ;
\\priors
target += normal_lpdf( strength_all_but_first | 0 , 10 ) ;
target += normal_lpds( home_team_advantage | 0 , 10 ) ;
\\likelihood
target += bernoulli_logit_lpmf( home_win | home_team_advantage + strengths_all[home_team] - strengths_all[away_team] ) ;
}
Note I used a completely arbitrary scale of 10 for the priors, so you should play with some prior predictive checks to work out priors that actually correspond to your domain expertise.
Once you have an initial model working as expected, you might consider letting the home-team advantage vary by which team is home; I’m not expert in baseball but I can imagine that some teams might have a greater home team advantage than others.
2 Likes
(I ended up editing my last post a bunch after you might have first seen it; be sure to check it out now)
2 Likes
@mike-lawrence Thank you, thank you, thank you for this awesome start! I was able to successfully implement it (change dimension of strength_all_but_first
to [num_teams-1]
instead of [num_games-1]
). This is a cool new class of models that we will use either as a benchmark or perhaps the primary model itself. I think it was a little ambitious to model all of the win probabilities as well.
Nearly all of the diagnostics are giving green lights. That being said, I am getting low N_eff, which sometimes throws <10% warnings and sometimes doesn’t, but in those cases it’s between 10% and 15%. I was wondering if you have any idea about strategies to address this, and what you’d need to see from me. I can post everything but since the model is simple I figure the solution may be as well. Or perhaps I shouldn’t value N_eff as much as I do in the first place?
Good catch. I’ll change it in my original post too so that anyone coming here later sees the correct version first.
You can compute these in a generated quantities section:
data {
int<lower=2> num_teams;
int<lower=2> num_games;
int<lower=0,upper=1> home_win[num_games];
int<lower=1,upper=num_teams> home_team[num_games];
int<lower=1,upper=num_teams> away_team[num_games];
}
parameters {
vector[num_teams-1] strength_all_but_first ;
real home_team_advantage ;
}
model {
vector[num_teams] strength_all = append_row(0,strength_all_but_first) ;
\\priors
target += normal_lpdf( strength_all_but_first | 0 , 10 ) ;
target += normal_lpds( home_team_advantage | 0 , 10 ) ;
\\likelihood
target += bernoulli_logit_lpmf( home_win | home_team_advantage + strengths_all[home_team] - strengths_all[away_team] ) ;
}
generated quantities{
matrix[num_teams,num_teams] win_prob ;
for(home_team in 1:num_teams){
for(away_team in 1:num_teams){
if(home_team==away_team){
//just do a random uniform so the automatic Rhat/ESS computations don't fail and cause an unnecessary warning
win_prob[home_team,away_team] = uniform_rng(1e-16,1e-15) ;
}else{
win_prob[home_team,away_team] = inv_logit(home_team_advantage + strengths_all[home_team] - strengths_all[away_team] ) ;
}
}
}
}
It depends. Maybe check the bulk/tail ESS separately via monitor(fit)
. Depending on what kinds of inferences you want to make, one or the other might be more important to have a high ESS. If you want higher ESS and there’s no other problems with the model, you can simply re-sample with a higher iteration count.
1 Like
@wahsmail like @mike-lawrence said the IRT models are great models for this. I would still go through all the steps I outlined just keep with best practices of a Bayesian workflow. And post those here.
You can also check your Stan model against the IRT model in brms (runs Stan in the background).
1 Like
there are a lot more IRT models here:
https://education-stan.github.io
2 Likes