Ordinal multilevel regression in brms

Hi all, I am trying to predict ranking (1st to 10th place) from racetime using historical data of elite marathoners. My question is “Conditional on a 2:05 marathon (2h 5min) performance what is the probability for a medal winning position (i.e at least 3rd)?”.

To put that in perspective the first sub2:05 marathon was run by the winner of the Berlin marathon back in 2003 (a world record at the time); some 20 years later 9 out of top-10 2023 Berlin marathon achieved sub2:05 performances. So there is a natural progression of marathon performance, there are mean differences between venues (some attract the best elite runners, some attract “slower” elites and of course some venues are inherently slower compared to others due to course , environmental or other external factors).

My data set contains runner id, venue id, year id , racetime and top-10 ranking for 15 selected popular marathon races for years 2001-2023. My DV is final race position (as ordered) and the predictors are year and racetime. My model is specified as:

fit<- brm(
position_ordered ~ 1 + year+racetime+(1|athleteid)+(1|venueid)+(1|year_as_factor),
data = Dataset,family = sratio("cloglog"),iter=4000, warmup=1500,control = list(adapt_delta = 0.95, max_treedepth=15), seed=12345)

I included year both as a population- and group-effect given that there is a natural progression in marathon performance across years on average, but also random variations around that trend. In addition I added athlete and venue group-effects due to the repeated measures within both athletes and venues.

The first question is whether the sequential ordinal regression is appropriate for this type of research? In addition I am diving into Burkner and Vuorre ordinal tutorial on the interpretation of the coefficients but I cannot wrap my head around a satisfactory natural interpretation. On the above model I have 0.1 (95%CI: 0.07-0.12) for year and 0.42 (95%CI: 0.39-0.45) for racetime.

Any help/input would be greatly appreciated!!!

PS: Haven’t touched the priors yet, just float with the default ones, the model does seem to converge without any problems, but first thing’s first and I would like to get a solid understanding of the interpretation before experimenting with priors or with more complex models.

I don’t know how to solve this problem using the precise model you’re describing, but with the proviso that this may be a silly question: Why not model the finishing times directly, and then use the distributions of those finishing times to answer your question? For instance, if you can estimate the distribution of finishing times for a single race, you can then use that distribution to simulate the race some large number of times, and count the proportion of those simulated races where a time < 2:05 was sufficient for a top-3 ranking. Would that work?

Hi ps2,
Thanks for the reply!

Would you suggest that something in the likes of:

fit<- brm(
racetime~ 1 + year+position_ordered+(1|athleteid)+(1|venueid)+(1|year_as_factor),
data = Dataset,family = sratio("cloglog"),iter=4000, warmup=1500,control = list(adapt_delta = 0.95, max_treedepth=15), seed=12345) 

would be more appropriate?

If i understand correctly your suggestion, I did experiment with this model formulation (with position as ordered monotonic effect) but i failed to see how predicting racetime from position (and the other effects) will provide that a certain level of performance (e.g 2:05) is associated with x% probability for 3rd place in a particular race. I will give it another shoot; that said I would still wait for any input on the sequential ordinal model.

Thanks for your time!

I may be totally misleading you here, so take my suggestions with a grain of salt. But it seems to me that modelling your data may not be that different from some of the ways that response times are modelled in behavioural experiments. In experiments of that sort we often assume that response times on individual trials result from evidence accumulating at a certain rate until it reaches a threshold, at which point the response occurs. We then look at what effects different experimental manipulations have on rate of accumulation. I see this as being analogous to times in races: Each runner moves at a certain speed (rate) until they reach the finish line (threshold), and across runners this results in a distribution of finishing times. The speeds vary across runners, across courses, across weather conditions, etc.

In your case, you want to know the probability that a race time of 2:05 results in a podium finish (i.e., is one of the top 3 finishing times in a specific race). Given that distance is fixed in marathons, this is the same as the probability that a speed of 20.26 km/h (approx) is one of the top three speeds in the race. This will be a function of both the distribution of speeds in the race, and the number of racers.

I am not sure whether your data would be easier to model as finishing times or speeds, but let’s assume that you’re modelling finishing times. Let n be the number of racers, f(x) and F(x) be the density and cumulative distribution functions for finishing time in the race, and S(x) be the survival function 1-F(x). F(2:05) gives us the probability of a finishing time shorter than 2:05, whereas S(2:05) gives us the probability of a finishing time greater than that. If we make some potentially dubious simplifying assumptions (e.g., that finishing times are independent draws from a race-specific distribution), then to obtain the probability of winning the race, conditional on a time of 2:05, we can take S(2:05)^n. To obtain the probability that the same time will achieve second place we can take S(2:05)^(n-1), and to obtain the probability of a third place finish we take S(2:05)^(n-2). We can then sum these probabilities to obtain the probability of a podium finish in the race, given a finishing time of 2:05.

To be able to do all this for different races, and/or to predict this probability for future races, we need to model the finishing times with a distribution that accurately reproduces them, so we can then use the relevant CDF to generate these probabilities from the posterior distributions. If these were response times, I’d probably go with the inverse Gaussian distribution for process reasons; but the lognormal and ex-Gaussian distributions also usually provide good fits to response time data if one is not overly concerned about a process interpretation. All are available as families in brms. I have no idea what your finishing time distributions actually look like, so perhaps these would not be good families to use in your case.

Once we have modelled the finishing times, it would just be a matter of using the relevant posterior draws as the input to the appropriate CDF to generate the probabilities, taking the uncertainty inherent in the parameter estimates into account.

Modelling the data as speeds would not be that different, but you’d probably want to use a different distribution AND the probability calculations would be slightly different (because you’d want the probability of speeds greater than the target speed (20.26 km/h), rather than times shorter than the target time. As such, you’d want to use the CDFs directly, rather than the survival functions.

A further alternative that may be simpler than either of the above options could be to recode your data to make a couple of different binary variables: Time < 2:05 (0 or 1) and finishing position ≤ 3 (0 or 1). You’d then use the binary time variable and your other predictors to model the probability of the binary finishing position variable. This wouldn’t give you the probability that a time of 2:05 is in the top 3, but it would tell you something about the probability that a time < 2:05 is.

Hope some of this is helpful. I have never worked with race data like this and am not 100% sure I’m interpreting your research question as intended, so as I said at the outset, take everything I write here with a big grain of salt!

1 Like

Not specifically about how to interpret the estimates, but I imagine you’d want something like this:

1 + year * racetime + (1|athleteid) + (1|venueid)

i.e., with an interaction between year and racetime, because the relationship between time and placement changes over time (and it’s usually not necessary or a great idea to have a predictor both as a fixed effect and a random factor).