Creating a prior for a logistic model in brms

I am a recent graduated from George Mason in statistics. My thesis paper was to create a model that could predict match outcomes by using player performance indicators, specifically transfer value. I have only 80 observations from World Cup knockout-stage matches. The first time I ran my model with loo_predict I received a lot of warnings, numerous high pareto k values. This error gave me unreliable estimates for predictions, and predictions were skewed in the true outcomes favor. I presume this may have been because I specified the prior incorrectly. I can send you my data if you need and would really appreciate to join you on a zoom call to talk about it. Thank you so much for your help in advance.

  1. Would you please help me specifying a prior

a. Do power priors exist in BRMS to weight variables based on my beliefs?
b. What should I specify. Just the fixed effects. The fixed and random effects or something more?
c. How should I evaluate the distribution of best fit.
My data: Random effects seem to represent something in between a bell curve and uniform distribution. Most fixed effects represent a compressed bell curve with extremely heavy tail but one variable is binomial -1 and 1? Would using sensitivity analysis and WAIC be the best way to go to fit data to distrubtutions?

Sincerely,

Julian Christov 312-945-9463 julianchri@gmail.com

What is the frequency breakdown of the 80 observations’ outcomes? How many predictors do you have? You will get more benefit from limiting the predictor space to what the sample size will support rather than tinkering with smooth priors. For example do a good deal of unsupervised learning e.g. variable clustering, redundancy analysis, possibly sparse principal component analysis. If your outcome is binary you need 96 observations just to estimate the intercept in a logistic model.

Thank you for the timely response. Below, I provided answers to your questions and included the formula and model, as well as histograms for the predictors. I have minimal experience with unsupervised learning, most of which is PCA. Is it possible you could provide code and further insight into applying unsupervised learning techniques to my Bayesian logistic regression model. Thank you so much.

What is the frequency breakdown of the 80 observations’ outcomes?

54 instances of 1 and 26 instances of 0 (54 wins for the home team and 26 losses for the home team).

How many predictors do you have?

11 predictors, 6 fixed and 5 random. Just to clarify I used “0” as the intercept for my model because my variables show the difference in market value and age from one team and the other team.

My model looks at the differences between the two competing teams and I needed to make adjustments to transfer value to better improve the model. In terms of the variables used in my model, ‘TV’ represents the proportional difference in transfer value across player positions. Due to the closed market structure for international soccer, positional value is relevant. ‘AGE’ indicates the annual difference in ages across positions. Age is critical since transfer value only looks to measure projected long-term performance of the player and value to a club and does not evaluate immediate player performance, meaning older players near or slightly above their playing peak are undervalued. When looking at values, ‘CB_TV = -1.2’ and ‘CB_AGE = 2’ would mean the team’s starting Centre Backs (CBs) are, on average, 2 years older, and the opposing team’s starting CBs have a transfer value 120% higher, on average.

Code:

formula ← bf(Outcome ~ 0 + ELO + 0 + GK_TV + (0 + GK_TV | GK_AGE) + 0 + CB_TV + (0 + CB_TV | CB_AGE) + 0 + CM_TV + (0 + CM_TV | CM_AGE) + 0 + CF_TV + (0 + CF_TV | CF_AGE) + 0 + OUT_TV + (0 + OUT_TV | OUT_AGE))

model ← brm(
formula = formula,
data = soccer,
family = bernoulli(),
prior = prior,
cores = num.cores,
control = list(max_treedepth = 15, adapt_delta = 0.99))

Since your effective sample size is very low and you don’t have enough observations to even estimate an intercept-only model (the frequentist Wilson 0.95 compatibility interval for the probability of Y=1 is (0.57, 0.77), which is relevant event if you force the covariate-adjusted intercept to be zero), analysis may be futile. For small datasets you have to have an extremely high-resolution Y.

What is a random predictor?

It is often safe to do data reduction outside of the main model fitting stage so that need not necessarily involve Bayesian methods. In the future I will have a way to insert PCA restrictions into logistic regression so all modeling will be on the original predictor scales.

It is often dangerous to omit the intercept in the best of conditions.

My apologies there are no random predictors. Instead the age variables are just random effects on transfer value to adjust to immediate performance. Please let me know if my data would be helpful to you for the PCA.

Based on my soccer knowledge I strongly believe the intercept is close to if not zero because this is when all things are equal therefore the teams should have the same shot at winning, at 50/50 odds. If one were to say the model would have a intercept it would imply a home team or away team have an advantage regardless of player composition (transfer value) or past results (Elo) which does not seem logical to me, especially where games are played at a neutral location with neutral referees. I understand that there may
additional complexities to evaluate, such as different coach, but for the most part it seems close to if not zero.

I would really appreciate to get on a zoom call with you. I can send a link whenever you are available. Thank you so much for taking so much time to help me on this.

1 Like

Still not understanding ‘random’ here. But to the main issue pick one binary fixed predictor and fit a simple model with it. Get the 0.95 HPD uncertainty interval for the odds ratio for that factor. You’ll find it’s so wide that you haven’t learned anything from the data. It just gets worse with more predictors.

In the formula, age indicates that the influence of Transfer Value (TV) on the outcome varies across different age groups. I do not mean age is ‘random’ just that the effect of age on transfer value is not necessarily constant.

I can omit ELO in the modeling process and see if it explains the incorrect predictions. However, it becomes tough to omit or combine other variables since I am looking to measure disparities in players from the two competing teams. Please let me know if you would like my Capstone paper as it explains much of the context and my thought process behind for the topic.

I don’t mean to collapse all predictors to one measure. Pick one good measure and do a complete analysis. You’ll see that the sample size is inadequate as judged by the extreme width of an uncertainty interval for an odds ratio. Therefore the sample is inadequate for everything you are doing.

I think you are too pessimistic here. I know your justifications for binary target sample size, but you are leaving out here in this post the background that you are assuming no prior and requiring quite narrow interval. You claim the posterior interval would have extreme width, but then it’s much narrower than without any data.

Even with this computation the interval width is only 21% from the uniform interval.

Also, the original goal was to make predictions and not to infer any posterior intervals. There are plenty of papers showing that we can get useful predictions with even fewer observation (and binary target).

Of course, you learn from the data, and usually you learn more information you have. There are plenty of papers showing you can get useful predictions with fewer observations and thousands of predictors. In this case the predictors are likely to be correlating and thus providing similar information, and thus adding more of them is likely to improve predictive performance. Of course, this is assuming you are using Bayesian inference and not using priors that force overfitting.

Couple references

  • Juho Piironen, Markus Paasiniemi, and Aki Vehtari (2020). Projective inference in high-dimensional problems: prediction and feature selection. Electronic Journal of Statistics , 14(1):2155-2197. Online.
  • Juho Piironen and Aki Vehtari (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics , 11(2):5018-5051. Online

But, these are not just binary data, as there is also score for both teams, and instead of Bernoulli model, it would be better to use bivariate Poisson model for both scores or Poisson-difference model for the score difference. You can’t do these with brms, but would need to write them in Stan language.

You have many redundant 0s here and it would be good to read the documentation on what 0 means in the formula.

While I did say above that for binary target, fewer than 80 can be plenty for getting a useful predictive model (for example, in repeated betting beating a model with no data at all), and I think something like R2D2 type prior would be sensible (instead of improper uniform or indenpendent normals that favor overfitting), I don’t think there would be enough information in these data to improve the predictive performance with varying slope components.

Sorry, I don’t have more time for this, but wanted to counter Franks statements which seemed to state too universal claims about what can be done with binary outcomes.

4 Likes

Thank you so much for the advice. First off, I took away the 0’s that were not needed in the formula:

formula ← bf(Outcome ~ 0 + ELO + GK_TV + (0 + GK_TV | GK_AGE) + CB_TV + (0 + CB_TV | CB_AGE) + CM_TV + (0 + CM_TV | CM_AGE) + CF_TV + (0 + CF_TV | CF_AGE) + OUT_TV + (0 + OUT_TV | OUT_AGE))

I know a lot of successful soccer model’s have used a bivariate Poisson model to predict outcomes for soccer matches, like Dixon and Coles.
However, personally I do not think score line is best in predicting knock-out stage soccer matches and the information it provides may not be as accurate as the true outcome itself.

  • There are only 2 outcomes that ultimately determine success in knock-out matches: win or go home.
  • Certain teams consistently win by small margins while other teams win less consistently but by large margins.
  • Game scenarios can greatly affect score lines even when teams are closely matched.
  • The GK position may be more important in knockout-stage matches since games can go penalty kicks.

These sources seemed very useful for what I am working on but a lot of what was talked about is beyond my current level of expertise.

Predictive projection in order to reduce the number of features may not be as useful as regulation techniques to decrease the magnitude of those features, especially since the variables seem key to understanding the outcome in there own ways. I could be wrong though and this may help me mitigate overfitting the training data.

The paper having to do with the regularized horseshoe prior seems very useful in my case.

I will follow your advice using the R2D2 prior though. It seems to fit best with the model I am trying to build since:

  • Use R2D2 prior when you are interested in controlling the proportion of variance explained by the predictors and have a moderate number of predictors.
  • Use regularized horseshoe prior when you are dealing with high-dimensional data with many predictors, where you expect only a few of them to be truly relevant.

Please let me know if this code is okay for computing LOOCV to determine accuracy of this kind of model since it is my first time using rstan and code came from ChatGPT (Imay end up taking ELO out and also seeing how it performs). Also would PCA be helpful at all to reduce the dimensionality (I am guessing the shrinkage prior should be fine without).

Here is my code:

stan_model_code_r2d2 ← "
data {
int<lower=0> n; // number of observations
int<lower=0> d; // number of predictors
int<lower=0> k; // number of age groups
int<lower=0, upper=1> y[n]; // binary outcome
matrix[n, d] x; // predictors
matrix[n, k] age_group_indices; // age group indices
real<lower=0> scale_icept; // prior std for the intercept
}

parameters {
real beta0; // intercept
real<lower=0, upper=1> r2; // R-squared parameter
vector[d] beta_raw; // raw coefficients
matrix[d, k] u; // random effects for each predictor by age group
}

transformed parameters {
vector[d] beta; // regression coefficients
vector[n] eta; // linear predictor

beta = beta_raw * sqrt(r2 / (1 - r2));

// Compute linear predictor
eta = beta0 + x * beta;
for (i in 1:n) {
eta[i] += u[1, ] * age_group_indices[i, 1] * x[i, 2]; // GK_TV | GK_AGE
eta[i] += u[2, ] * age_group_indices[i, 2] * x[i, 3]; // CB_TV | CB_AGE
eta[i] += u[3, ] * age_group_indices[i, 3] * x[i, 4]; // CM_TV | CM_AGE
eta[i] += u[4, ] * age_group_indices[i, 4] * x[i, 5]; // CF_TV | CF_AGE
eta[i] += u[5, ] * age_group_indices[i, 5] * x[i, 6]; // OUT_TV | OUT_AGE
}
}

model {
// Prior for R-squared
r2 ~ beta(2, 2);

// Prior for coefficients
beta_raw ~ normal(0, 1);

// Prior for random effects
for (j in 1:k) {
u[, j] ~ normal(0, 1);
}

// Prior for intercept
beta0 ~ normal(0, scale_icept);

// Logistic regression model
y ~ bernoulli_logit(eta);
}
"

library(rstan)
library(dplyr)

Load the data

soccer ← read.table(“/mnt/data/luluboyo.txt”, header = TRUE, sep = “\t”)

Define the number of observations

n_obs ← nrow(soccer)

Initialize a vector to store the predicted probabilities

predicted_probs ← numeric(n_obs)

Perform LOOCV

for (i in 1:n_obs) {

Prepare training data leaving out the ith observation

soccer_train ← soccer[-i, ]
soccer_test ← soccer[i, ]

Extract predictors and outcome for training data

x_train ← as.matrix(soccer_train[, c(“ELO”, “GK_TV”, “CB_TV”, “CM_TV”, “CF_TV”, “OUT_TV”)])
age_group_indices_train ← as.matrix(soccer_train[, c(“GK_AGE”, “CB_AGE”, “CM_AGE”, “CF_AGE”, “OUT_AGE”)])
y_train ← soccer_train$Outcome

Prepare data for Stan

stan_data_train ← list(
n = nrow(x_train),
d = ncol(x_train),
k = ncol(age_group_indices_train),
y = as.integer(y_train),
x = x_train,
age_group_indices = age_group_indices_train,
scale_icept = 10
)

Compile and fit the model

stan_model ← stan(model_code = stan_model_code_r2d2, data = stan_data_train,
iter = 2000, chains = 4, cores = parallel::detectCores(),
control = list(max_treedepth = 15, adapt_delta = 0.99))

Extract posterior samples

posterior_samples ← extract(stan_model)

Prepare the new data (left-out observation)

new_x ← as.matrix(soccer_test[, c(“ELO”, “GK_TV”, “CB_TV”, “CM_TV”, “CF_TV”, “OUT_TV”)])
new_age_group_indices ← as.matrix(soccer_test[, c(“GK_AGE”, “CB_AGE”, “CM_AGE”, “CF_AGE”, “OUT_AGE”)])

Number of posterior samples

n_samples ← length(posterior_samples$beta0)

Initialize a vector to store the predictions

predictions ← numeric(n_samples)

Compute the linear predictor (eta) for each posterior sample

for (j in 1:n_samples) {
eta ← posterior_samples$beta0[j] + sum(new_x * posterior_samples$beta[j, ])
eta ← eta + posterior_samples$u[1, j] * new_age_group_indices[1, 1] * new_x[1, 2]; // GK_TV | GK_AGE
eta ← eta + posterior_samples$u[2, j] * new_age_group_indices[1, 2] * new_x[1, 3]; // CB_TV | CB_AGE
eta ← eta + posterior_samples$u[3, j] * new_age_group_indices[1, 3] * new_x[1, 4]; // CM_TV | CM_AGE
eta ← eta + posterior_samples$u[4, j] * new_age_group_indices[1, 4] * new_x[1, 5]; // CF_TV | CF_AGE
eta ← eta + posterior_samples$u[5, j] * new_age_group_indices[1, 5] * new_x[1, 6]; // OUT_TV | OUT_AGE
predictions[j] ← 1 / (1 + exp(-eta)) # Apply logistic function to get probabilities
}

Store the mean prediction for the ith observation

predicted_probs[i] ← mean(predictions)

cat(“Finished LOOCV for observation”, i, “with predicted probability:”, predicted_probs[i], “\n”)
}

Print all predicted probabilities

predicted_probs

I’m not surprised you can get a model that is better than no model. But I would be surprised if the posterior distribution of a typical risk estimate is usefully narrow unless the prior information is so extreme that you almost don’t need data.

Last question. I would like to determine which team is more likely to win based on the transfer value of their starters, adjusted for age. Could I perform a sensitivity analysis on the age groups across different positions to assess the impact on this prediction?