Excess Death Study, Preprint + Source Code

I’ve finally submitted my first paper that uses Stan, an attempt to model excess mortality in Canada over the span of fourteen years. It’s just been submitted, so who knows if/when it’ll get published, but in the meantime the preprint and source code are public. The latter will probably be of most interest here, most notably I wrote two different Stan “libraries” for cubic spline interpolation, both of which include routines for integrals and derivatives.

https://gitlab.com/hjhornbeck/excess_deaths_canada_2010_2023/-/blob/main/models/nr_splines.stan
https://gitlab.com/hjhornbeck/excess_deaths_canada_2010_2023/-/blob/main/models/mitchell_splines.stan

There’s also Stan code to handle heaps, reservoir sampling, and double-precision summation. I could have sworn there was a central repository for sharing Stan code, but all of my searches have come up empty, so this will have to do for now. Most of the repository carries a CC BY-SA-NC license, but a lot of simpler low-level functions are CC0.

As for the paper itself, the pre-print is here:

Understanding the causes of death in a population is vital to guide effective health policy. In this paper, we build a novel population-level model of mortality in Canada. It is capable of handling multiple causes of death simultaneously, has a more intuitive interpretation than prior models, allows for variable temporal resolution, and incorporates uncertainty into its outputs. Our model estimates that the official estimates of influenza only capture one in twenty of all deaths associated with influenza, and official COVID-19 mortality is half of all COVID-19-associated mortality. It finds younger demographics appear more susceptible to COVID-19 mortality than estimated, suggests there are multiple periodic patterns to mortality for all demographics, and that the per-capita rate of mortality would have decreased if not for influenza, COVID-19, and drug poisonings. We also examine and critique our model in depth, explore alternative parametrizations, and find it is weakest at capturing susceptibility to influenza and COVID-19-linked mortality. Full results and source code are public.

I don’t know how useful the rest of the Stan code will be; I think there’s some nifty ideas in there, but this was also my first time working with Stan, and my statistical knowledge is largely self-taught. There’s probably some silly or embarrassing stuff as well (at one point I thought log_sum_exp()wasn’t vectorized), but I am only human. Critiques are welcome, of course, it’s tough to improve without peer review.

Welcome to the Stan forums and thanks for posting!

We’ve talked about it, but as far as I know, such a thing does not exist.

Also, sorry you had to hack the interpolation. That’s the one case where we feel like it’d be nice to have a parameter to integer conversion because you can guarantee differentiability analytically.

I couldn’t see where your interpolate() function was used in nr_splines.stan. But you should be able to constrain the knots to be ordered so you don’t have to do a rejection test. If that ever fires, it means Stan’s devolving to rejection sampling, which is not good for efficiency or robustness.

What led you to rewrite all the numerics here? Was the simple vectorized log-sum-exp too inaccurate in some places? And if you are going to do different kinds of summation, Kahan summation is the usual approach. We would have included it if it had ever become an issue for anyone. So I’m very curious about this example. If you do need to sort, we have a fast built-in quick sort that should be much more performant than writing your own heap sort.

You may be thinking of @spinkney’s GitHub - spinkney/helpful_stan_functions · GitHub

Oh no no, don’t apologize for a lack of spline routines! There’s so many different variants out there it’d be silly to incorporate one. Kharratzadeh’s semi-official natural spline code just wasn’t cutting it for my use case.

Statistics Canada groups mortality data into different age demographics than the Public Health Agency of Canada’s drug poisoning mortality data. Rather than fudge something, I fit a density spline to the PHA data with a C^1-continuous exponential welded onto it:

See what I mean? As luck would have it, though, I had a C++ re-implementation of Numerical Recipes’ cubic spline code in my back pocket, I just needed to rewrite it in Stan and add code for integration. It quickly became a core utility through all my models, and not just because of its flexibility. A preview of what will probably become my next paper:

block and sub-section spline routine execution time, 1000 loops (in seconds)
transformed_data.interpolate mitchell_sparse 0.014977
transformed_data.interpolate nr_default 0.030765
transformed_data.interpolate deboor_sparse 0.046460
transformed_data.interpolate mitchell_default 0.050792
transformed_data.interpolate deboor_default 0.272033

There’s good reason it made it into NR, the algorithm’s dang quick. I can only beat it by writing custom sparse matrix code tuned to spline interpolation (deboor_sparse uses Stan’s internal sparse matrix routines instead, hence the relative sluggishness). Alas, these splines don’t perform nearly as well when in transformed parameters, which explains the code for Mitchell-Netravli B-splines.

interpolate() is only ever used in excess_deaths.stan, alas, the Python equivalent gets far more of a workout.

As for why heaps are there, I originally didn’t think log_sum_exp() was vectorized, and I was paranoid that summing together log likelihoods would quickly lead to lost precision. I’ll spare you the full rollercoaster, but I eventually worked out that heaps + non-vectorized log_sum_exp() is indeed more precise than vectorized log_sum_exp()… but only ~50% of the time (and log_sum_exp() was more precise maybe a quarter of the time?!), and only for the last digit or two, plus my routine was 100x slower. Whoops! Why not use the internal sort? My algorithm was only adding one new value to an otherwise sorted list, and it only ever needed the smallest value, so on paper heaps would be more performant.

The implementation of Kahan-Babuška in there was a fallback in case my heap-based solution turned out to be annoyingly more precise than vectorized log_sum_exp(). Since it wasn’t, I never wound up using that code.

The reservoir sampling code came from fitting functions to a posterior. Without some form of sub-sampling, the model section would have been crushed with roughly 2,960,000 evaluations. But I also didn’t want to include the same sample twice, and for the time axis I also wanted a stratified sub-sample. The former I wrote for fun (there’s more efficient routines out there, but you can get away with a lot in transformed data), the latter because I couldn’t find an existing algorithm.

Those other bits of code probably won’t get nearly as much use as the spline routines, but the code was always going to be released under an open-source license. It made more sense to keep it in place, even if I wound up doing something else, just in case someone else could make use of it.

That’s the one! A shame it doesn’t seem very active, but no matter. I’ll fire off a PR in the next day or two.