Censoring in ufc fights

Hi all,

I want to apply survival analysis on UFC fights. Basically, I want to describe fighters as diseases. So I am exploring my dataset and have some questions.

A standard UFC fight consists of three five-minute rounds. Title fights, however, are extended to five five-minute fights. We assume that an event / death time for opponent o exists an can be denoted T * Next, we have T ; the observed event or censoring time. The measurement of this variable is in seconds.

image

In the case of a knock-out, o is uncensored and T * = T When o survives until the end the of the match and wins based on the decision of the jury, o is right-censored and T * > T

Question: how should we deal with the event of a loss for o by decision of the jury? Literally, the event of a loss / death occurs. So, is o uncensored in this case? The chart below shows that the majority of the outcomes is determined by jury. How can we model this is in a parametric way?

image

Thank you for your time

Morning,

Are you looking to code this up in rstanarm, brms, or Stan? If so, do you have a model already in mind? If you are looking for broader advice on modeling https://stats.stackexchange might be a better place to ask.

Ara is right the we try to keep the scope here to models fitted with packages from the Stan ecosystem, but Iā€™ll just assume you are planning to this as this looks like a neat little puzzle.

I have almost no understanding of fighting sports, but I would actually expect that what leads to a knockout is only indirectly related to jury decisions - there would be shared factors, but I wouldnā€™t expect the jury decisions to be reasonably interpretable as estimates of who would be knocked-out. So I would probably start by modelling the outcome (knockout vs. jury decision) and knockout times - conditional on knockout happening separately and then try to build a shared model which would assume either some correlation of predictors or a similar structure.

I also think you are missing an important piece of the structure in that this is a competitive thing between two fighters, so I think it would make sense to model the full set of four outcomes - potentially as an oridnal model - and have predictors for both fighters. I think there are some football case studies out there that might serve as inspiration.

Finally, there seems to be some additional structure in the knockout times within each round, so one might probably want to model this as well.

Best of luck with the model!

2 Likes

hey @martinmodrak

Thanks for replying. After thinking more about it, the event I want to capture is the knock-out of the opponent by a fighter. In addition, I want to use the fights round, instead of number of seconds, as time unit. So, there are basically three different possible outcomes per fight:

  • Fighter KOā€™s opponent in round r
    • event is observed in round r
  • Opponent KOā€™s fighter in round r
    • event is right-censored in round r
  • Fighter or opponent wins by jury decision
    • event is right-censored in round r = 3 ( normal fight) or r = 5 ( title fight)

Visually this looks, for a random picked fighter, as follows:

image

The table below shows the summarized data:

round KO censored fighters removed fighters at risk hazard rate
0 0 0 0 27 -
1 7 1 8 27 0.26
2 4 0 4 19 0.21
3 3 2 5 15 0.20
4 2 0 2 10 0.20
5 0 8 8 8 -

I want to calculate the probability that a KO happens in round r when it hasnā€™t happened yet in round r ā€“ 1 . So, I want to create a discrete time survival model.

Let KO_o be a discrete random variable that indicates the round when the KO occurs for a randomly selected opponent o.
Next, we define the discrete-time hazard as the conditional probability of opponent o getting KOā€™ed in round r give that he/she has survived until that round

image

Cox (1972) proposed that because the hazard rate are probabilities, they can be reparametrized so that they have logistic dependence on the time periods.

image

Where [R_1, R_2, R_3, R_4, R_5] are a sequence of dummy variables. If an opponent is KOā€™ed / censored in round 3, R_3 = 1 and the rest is 0. If we take the logs, we obtain a model on the logit of the hazard rate

image

Next, for every opponent o, we determine for each round r whether a KO was observed using a sequence of dummy variables Y_o,r that consist of the values y_o,r

image

If opponent o does not get KOā€™ed during the match, Y_r,o will be equal to 0 in every round that was observed during the fight. If the fight was a title fight, Y_r,o is equal to {0, 0, 0, 0, 0}. For an opponent that gets KOā€™d in the third round, Y_r,o is equal to {0, 0, 1}

In addition, we want to check whether opponent o is censored or not.

image

The probability that an uncensored opponent o will get KOā€™ed in round r is equal to

image

The probability that a censored opponent o will get KOā€™ed after round r is

image

The likelihood function is the product of the probabilities of observing the data, Prā”{T_ko = t_r }, in the case of uncensored opponents (c_o = 0), and Prā”{T_ko > t_r }, in the case of the censored opponents (c_o = 1):

image

According to Singer and Willet (1993) we can rewrite the above to

image

Now the likelihood function of the discrete-time hazard model is equal to the likelihood function for N (t_1, t_2, ā€¦, t_r) independent Bernoulli trials with parameter Ī»_r,o. So, we can treat the N dichotomous observed values y_r,o as the values of the outcome variable in a logistic regression analysis of the time-period indicators R.

I transform my data to person-period format rounds_jon_jones.csv (2.4 KB)

and write the following stan code

> data {
>   int<lower=0> n_rounds;
>   int<lower=0, upper=1> knockouts[n_rounds];
>   matrix[n_rounds , 5] rounds
> }
> 
> parameters {
>   real alpha_1;
>   real alpha_2;
>   real alpha_3;
>   real alpha_4;
>   real alpha_5;
> }
> 
> model {
>   // priors 
>   alpha_1 ~ normal(0, 1);
>   alpha_2 ~ normal(0, 1);
>   alpha_3 ~ normal(0, 1);
>   alpha_4 ~ normal(0, 1);
>   alpha_5 ~ normal(0, 1);
> 	
>   // likelihood
>   knockouts ~ bernoulli_logit(alpha_1 * rounds[,1] + alpha_2 * rounds[,2] + alpha_3 * rounds[,3] + alpha_4 * rounds[,4] + alpha_5 * rounds[,5]);
> }

Running the code gives the following output

image

I calculate the hazard rate of the first round as follows:

image

The table below shows the hazard rate for every round:

round hazard
1 0.752
2 0.505
3 0.257
4 0.330
5 0.061

These hazard rates are quite different from the hazard rates of the first table where we divide the number of KOā€™s by the number of fighters at risk per round. Can someone tell me if I did something wrong? In addition, feedback is much appreciated. Please point out if I did something wrong!

Thank you

2 Likes

Thatā€™s a lot of work you shared! I admit that I didnā€™t dig into the math much, but I donā€™t see any glaring issues.

If I understand it correctly, you are actually fitting something very similar (or equivalent?) to a sequential ordinal model as discussed by Paul Burkner at https://journals.sagepub.com/doi/epub/10.1177/2515245918823199

As I said before, (assuming the maximum number of rounds is the same for all matches) you could probably extend the ordinal model to incorporate outcomes for both oponents, so that your categories would be:

  • A knocked out in round 1
  • A knocked out in round 2
  • ā€¦
  • A knocked out in round 5
  • Nobody knocked out
  • B knocked out in round 5
  • ā€¦
  • B knocked out in round 1

This would let you capture the fact that a knock-out for one of the fighters means automatically, that the other didnā€™t score a knock-out. You could then enforce symmetry of the ā€œdifficultyā€ by having the intercepts mirrored.

Would that make sense?

Best of luck with your model!

1 Like

hey @martinmodrak, sorry for the delayed response.
The maximum number of rounds is not always the same. The majority of the fights, the standard fights, have a max of 3 rounds. Title fights can last max 5 rounds. Does this setup make the ordinal model unfeasible? Nevertheless, I continued exploring the logistic regression approach and I changed the code to include multiple fighters

data {
  int<lower=1> n_rounds;
  int<lower=0, upper=1> knockouts[n_rounds];
  matrix[n_rounds, 5] rounds;
  int<lower=1> n_fighters;
  int<lower=1, upper=n_fighters> fighter_id[n_rounds];
}

parameters {
  vector[n_fighters] alpha_1;
  vector[n_fighters] alpha_2;
  vector[n_fighters] alpha_3;
  vector[n_fighters] alpha_4;
  vector[n_fighters] alpha_5;
}

model {
  // priors
  alpha_1 ~ normal(0, 1);
  alpha_2 ~ normal(0, 1);
  alpha_3 ~ normal(0, 1);
  alpha_4 ~ normal(0, 1);
  alpha_5 ~ normal(0, 1);

  // likelihood
  knockouts ~ bernoulli_logit(alpha_1[fighter_id] * rounds[,1] + alpha_2[fighter_id] * rounds[,2] + alpha_3[fighter_id] * rounds[,3] + alpha_4[fighter_id] * rounds[,4] + alpha_5[fighter_id] * rounds[,5]);
}

However, now I am still ignoring the strength of the opponent. How can I make sure that I include the strength of opponents. Is that as simple as including vector[n_fighters] beta term in the script?

Basically, the model is survival analysis where each fighter is a disease and the patients are the same. This would enable partial pooling greatly right? In addition, there are no competing risks, because there is enough rest between the fights (sometimes months) Have you ever read a paper with a similar set-up?

Thank you again

Not completely, just more tedious. You could probablz assume that you have the same predictor for the latent continuous variable but have separate thresholds (intercepts) for the two cases.

I am not sure I understand you completely, but it might be.

I think there is quite a lot of work out there on modelling outcomes in sports, especially baseball and soccer are AFAIK popular targets for modelling, so I would start there (I am not familiar with the literature to give specific recommendations). There are definitely other sports that have variable number of rounds (I think cricket does, but I am not 100% sure), so I would expect there to be some prior work.

I admit that I really struggle to understand how framing this problem as a survival model is useful/sensible - I think the differences are bigger than the similarities - but I might be missing something. But yes, partial pooling would be easy in this setting (and in many other settings).