Censoring in ufc fights

HJAM24 · May 1, 2021, 10:41am

Hi all,

I want to apply survival analysis on UFC fights. Basically, I want to describe fighters as diseases. So I am exploring my dataset and have some questions.

A standard UFC fight consists of three five-minute rounds. Title fights, however, are extended to five five-minute fights. We assume that an event / death time for opponent o exists an can be denoted T * Next, we have T ; the observed event or censoring time. The measurement of this variable is in seconds.

In the case of a knock-out, o is uncensored and T * = T When o survives until the end the of the match and wins based on the decision of the jury, o is right-censored and T * > T

Question: how should we deal with the event of a loss for o by decision of the jury? Literally, the event of a loss / death occurs. So, is o uncensored in this case? The chart below shows that the majority of the outcomes is determined by jury. How can we model this is in a parametric way?

Thank you for your time

Ara_Winter · May 6, 2021, 3:44pm

Morning,

Are you looking to code this up in rstanarm, brms, or Stan? If so, do you have a model already in mind? If you are looking for broader advice on modeling https://stats.stackexchange might be a better place to ask.

martinmodrak · May 6, 2021, 9:05pm

Ara is right the we try to keep the scope here to models fitted with packages from the Stan ecosystem, but I’ll just assume you are planning to this as this looks like a neat little puzzle.

I have almost no understanding of fighting sports, but I would actually expect that what leads to a knockout is only indirectly related to jury decisions - there would be shared factors, but I wouldn’t expect the jury decisions to be reasonably interpretable as estimates of who would be knocked-out. So I would probably start by modelling the outcome (knockout vs. jury decision) and knockout times - conditional on knockout happening separately and then try to build a shared model which would assume either some correlation of predictors or a similar structure.

I also think you are missing an important piece of the structure in that this is a competitive thing between two fighters, so I think it would make sense to model the full set of four outcomes - potentially as an oridnal model - and have predictors for both fighters. I think there are some football case studies out there that might serve as inspiration.

Finally, there seems to be some additional structure in the knockout times within each round, so one might probably want to model this as well.

Best of luck with the model!

HJAM24 · May 18, 2021, 9:03am

hey @martinmodrak

Thanks for replying. After thinking more about it, the event I want to capture is the knock-out of the opponent by a fighter. In addition, I want to use the fights round, instead of number of seconds, as time unit. So, there are basically three different possible outcomes per fight:

Fighter KO’s opponent in round r
- event is observed in round r
Opponent KO’s fighter in round r
- event is right-censored in round r
Fighter or opponent wins by jury decision
- event is right-censored in round r = 3 ( normal fight) or r = 5 ( title fight)

Visually this looks, for a random picked fighter, as follows:

The table below shows the summarized data:

round	KO	censored	fighters removed	fighters at risk	hazard rate
0	0	0	0	27	-
1	7	1	8	27	0.26
2	4	0	4	19	0.21
3	3	2	5	15	0.20
4	2	0	2	10	0.20
5	0	8	8	8	-

I want to calculate the probability that a KO happens in round r when it hasn’t happened yet in round r – 1 . So, I want to create a discrete time survival model.

Let KO_o be a discrete random variable that indicates the round when the KO occurs for a randomly selected opponent o.
Next, we define the discrete-time hazard as the conditional probability of opponent o getting KO’ed in round r give that he/she has survived until that round

Cox (1972) proposed that because the hazard rate are probabilities, they can be reparametrized so that they have logistic dependence on the time periods.

Where [R_1, R_2, R_3, R_4, R_5] are a sequence of dummy variables. If an opponent is KO’ed / censored in round 3, R_3 = 1 and the rest is 0. If we take the logs, we obtain a model on the logit of the hazard rate

Next, for every opponent o, we determine for each round r whether a KO was observed using a sequence of dummy variables Y_o,r that consist of the values y_o,r

If opponent o does not get KO’ed during the match, Y_r,o will be equal to 0 in every round that was observed during the fight. If the fight was a title fight, Y_r,o is equal to {0, 0, 0, 0, 0}. For an opponent that gets KO’d in the third round, Y_r,o is equal to {0, 0, 1}

In addition, we want to check whether opponent o is censored or not.

The probability that an uncensored opponent o will get KO’ed in round r is equal to

The probability that a censored opponent o will get KO’ed after round r is

The likelihood function is the product of the probabilities of observing the data, Pr⁡{T_ko = t_r }, in the case of uncensored opponents (c_o = 0), and Pr⁡{T_ko > t_r }, in the case of the censored opponents (c_o = 1):

According to Singer and Willet (1993) we can rewrite the above to

Now the likelihood function of the discrete-time hazard model is equal to the likelihood function for N (t_1, t_2, …, t_r) independent Bernoulli trials with parameter λ_r,o. So, we can treat the N dichotomous observed values y_r,o as the values of the outcome variable in a logistic regression analysis of the time-period indicators R.

I transform my data to person-period format rounds_jon_jones.csv (2.4 KB)

and write the following stan code

> data {
>   int<lower=0> n_rounds;
>   int<lower=0, upper=1> knockouts[n_rounds];
>   matrix[n_rounds , 5] rounds
> }
> 
> parameters {
>   real alpha_1;
>   real alpha_2;
>   real alpha_3;
>   real alpha_4;
>   real alpha_5;
> }
> 
> model {
>   // priors 
>   alpha_1 ~ normal(0, 1);
>   alpha_2 ~ normal(0, 1);
>   alpha_3 ~ normal(0, 1);
>   alpha_4 ~ normal(0, 1);
>   alpha_5 ~ normal(0, 1);
> 	
>   // likelihood
>   knockouts ~ bernoulli_logit(alpha_1 * rounds[,1] + alpha_2 * rounds[,2] + alpha_3 * rounds[,3] + alpha_4 * rounds[,4] + alpha_5 * rounds[,5]);
> }

Running the code gives the following output

I calculate the hazard rate of the first round as follows:

The table below shows the hazard rate for every round:

round	hazard
1	0.752
2	0.505
3	0.257
4	0.330
5	0.061

These hazard rates are quite different from the hazard rates of the first table where we divide the number of KO’s by the number of fighters at risk per round. Can someone tell me if I did something wrong? In addition, feedback is much appreciated. Please point out if I did something wrong!

Thank you

martinmodrak · May 18, 2021, 5:03pm

That’s a lot of work you shared! I admit that I didn’t dig into the math much, but I don’t see any glaring issues.

If I understand it correctly, you are actually fitting something very similar (or equivalent?) to a sequential ordinal model as discussed by Paul Burkner at https://journals.sagepub.com/doi/epub/10.1177/2515245918823199

As I said before, (assuming the maximum number of rounds is the same for all matches) you could probably extend the ordinal model to incorporate outcomes for both oponents, so that your categories would be:

A knocked out in round 1
A knocked out in round 2
…
A knocked out in round 5
Nobody knocked out
B knocked out in round 5
…
B knocked out in round 1

This would let you capture the fact that a knock-out for one of the fighters means automatically, that the other didn’t score a knock-out. You could then enforce symmetry of the “difficulty” by having the intercepts mirrored.

Would that make sense?

Best of luck with your model!

HJAM24 · May 28, 2021, 10:23am

hey @martinmodrak, sorry for the delayed response.
The maximum number of rounds is not always the same. The majority of the fights, the standard fights, have a max of 3 rounds. Title fights can last max 5 rounds. Does this setup make the ordinal model unfeasible? Nevertheless, I continued exploring the logistic regression approach and I changed the code to include multiple fighters

data {
  int<lower=1> n_rounds;
  int<lower=0, upper=1> knockouts[n_rounds];
  matrix[n_rounds, 5] rounds;
  int<lower=1> n_fighters;
  int<lower=1, upper=n_fighters> fighter_id[n_rounds];
}

parameters {
  vector[n_fighters] alpha_1;
  vector[n_fighters] alpha_2;
  vector[n_fighters] alpha_3;
  vector[n_fighters] alpha_4;
  vector[n_fighters] alpha_5;
}

model {
  // priors
  alpha_1 ~ normal(0, 1);
  alpha_2 ~ normal(0, 1);
  alpha_3 ~ normal(0, 1);
  alpha_4 ~ normal(0, 1);
  alpha_5 ~ normal(0, 1);

  // likelihood
  knockouts ~ bernoulli_logit(alpha_1[fighter_id] * rounds[,1] + alpha_2[fighter_id] * rounds[,2] + alpha_3[fighter_id] * rounds[,3] + alpha_4[fighter_id] * rounds[,4] + alpha_5[fighter_id] * rounds[,5]);
}

However, now I am still ignoring the strength of the opponent. How can I make sure that I include the strength of opponents. Is that as simple as including vector[n_fighters] beta term in the script?

Basically, the model is survival analysis where each fighter is a disease and the patients are the same. This would enable partial pooling greatly right? In addition, there are no competing risks, because there is enough rest between the fights (sometimes months) Have you ever read a paper with a similar set-up?

Thank you again

martinmodrak · June 7, 2021, 4:55pm

Not completely, just more tedious. You could probablz assume that you have the same predictor for the latent continuous variable but have separate thresholds (intercepts) for the two cases.

I am not sure I understand you completely, but it might be.

I think there is quite a lot of work out there on modelling outcomes in sports, especially baseball and soccer are AFAIK popular targets for modelling, so I would start there (I am not familiar with the literature to give specific recommendations). There are definitely other sports that have variable number of rounds (I think cricket does, but I am not 100% sure), so I would expect there to be some prior work.

I admit that I really struggle to understand how framing this problem as a survival model is useful/sensible - I think the differences are bigger than the similarities - but I might be missing something. But yes, partial pooling would be easy in this setting (and in many other settings).

Topic		Replies	Views
Survival models with multiple censoring methods Modeling	3	142	July 31, 2024
Recurrent survival models Modeling	2	342	January 23, 2020
Aggregated model for mixture of Exponential distributions Modeling mixture , poisson	5	79	March 18, 2025
Estimate the spacing between cutoff points of a hierarchical ordered-logit model Modeling	6	106	December 17, 2024
Data Augmentation for Censored Survival Data Modeling rstan , techniques , fitting-issues	0	398	June 16, 2023

Censoring in ufc fights

Related topics