I’m currently having some issues with Stan’s NUTS sampler. I’ve encountered them before, but last time I think I fixed the issue by either hardening the model against outputting NaNs or “fixed” it by fiddling with random seed values. This time around, I’m not seeing anything obvious in the logs that could point me towards a fix. And now that my code’s public, it’s much easier to ask for help.
My “star” Stan model is run in parallel for eight demographics. Seven of them complete without error, but STDOUT for the eighth’s sampler portion contains this oddity:
# [A TONNE OF OUTPUT OMITTED]
8093.835: Chain [1] Iteration: 10000 / 10000 [100%] (Sampling)
8093.843:
8103.613: Elapsed Time: 8086.29 seconds (Warm-up)
8103.613: 7.494 seconds (Sampling)
8103.613: 8093.79 seconds (Total)
8103.613:
# [STILL MORE OUTPUT OMITTED]
8898.718: Chain [2] Iteration: 10000 / 10000 [100%] (Sampling)
8899.392:
8919.936: Elapsed Time: 8227.19 seconds (Warm-up)
8919.936: 672.141 seconds (Sampling)
8919.936: 8899.33 seconds (Total)
8919.936:
# [SKIPPING PAST CHAIN 3]
9952.118: Chain [4] Iteration: 10000 / 10000 [100%] (Sampling)
9953.434:
9953.439: Elapsed Time: 8695.15 seconds (Warm-up)
9953.439: 1258.23 seconds (Sampling)
9953.439: 9953.38 seconds (Total)
9953.439:
It’s a bad sign when one chain completed two orders of magnitude faster than the others! My code discards the raw posterior CSVs output by Stan’s NUTS sampler, but I can still isolate one chain’s output in the compressed Parquet file it keeps around.
>>> posterior = pd.read_parquet( "excess_deaths.Age at time of death, 65 to 84 years.Males.parquet" )
>>> posterior['exp_amp'][:32:4] # nothing special, I get the same behaviour from other parameters
0 987.64845
4 987.64845
8 987.64845
12 987.64845
16 987.64845
20 987.64845
24 987.64845
28 987.64845
Name: exp_amp, dtype: float64
>>> posterior['exp_amp'][1:32:4]
1 1162.0122
5 1240.2681
9 1294.4327
13 1428.2402
17 1114.5056
21 1081.0909
25 1453.9135
29 1326.2621
Name: exp_amp, dtype: float64
Chain one seems to have gotten “stuck,” pumping out the same value over and over again. One possible explanation is a NaNin the posterior, but…
>>> sum( posterior.isna().sum() )
0
… there are none to be found. The gradient may have become a NaN value anyway, but if that’s the case I’d expect the log probabilities to be NaN for that chain, and the Parquet files include the lp__ column. I’d also expect a much shorter burn-in period than is observed. There are some infinite values in the posterior, but that’s because one variable in “generated quantities” returns inf if nothing’s amiss. Chains two through four have ample inf values, and yet nothing’s wrong there.
The STDERR log of the sampler run isn’t illuminating, either.
$ bzcat excess_deaths.Age\ at\ time\ of\ death\,\ 65\ to\ 84\ years.Males.sampler.err.bz2 | grep Exception
0.027: Exception: normal_lpdf: Scale parameter[5] is -4.18656e+28, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.031: Exception: normal_lpdf: Scale parameter[4] is -1.21467e+29, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.031: Exception: normal_lpdf: Scale parameter[2] is -9.44079e+28, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.031: Exception: normal_lpdf: Scale parameter[393] is -7.26158e+28, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.032: Exception: normal_lpdf: Scale parameter[2] is -4.16264e+28, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.033: Exception: normal_lpdf: Scale parameter[3] is -29.9696, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.033: Exception: normal_lpdf: Scale parameter[1] is -7.19344e+27, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.033: Exception: normal_lpdf: Scale parameter[356] is -1.43916e+29, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.035: Exception: normal_lpdf: Scale parameter[393] is -1.22123e+29, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.035: Exception: normal_lpdf: Scale parameter[4] is -198.739, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.036: Exception: normal_lpdf: Scale parameter[421] is -2091.82, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.080: Exception: normal_lpdf: Scale parameter[20] is -19.7683, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
0.224: Exception: Found a inf value in 'baseline', rejecting sample. (in 'excess_deaths.stan', line 331, column 1 to column 75)
4.948: Exception: Found a inf value in 'baseline', rejecting sample. (in 'excess_deaths.stan', line 331, column 1 to column 75)
5.624: Exception: Found a inf value in 'flu_dispersion', rejecting sample. (in 'excess_deaths.stan', line 338, column 1 to column 87)
5.626: Exception: Found a inf value in 'exp_amp', rejecting sample. (in 'excess_deaths.stan', line 333, column 1 to column 73)
9953.439: Exception: normal_lpdf: Scale parameter[314] is -2.38664e+06, but must be positive! (in 'excess_deaths.stan', line 646, column 8 to column 130)
That may look nasty, but other demographics can have very similar exceptions in their output. Nothing stands out to me.
I’ve looked around in this forum for others running into the same issue, and while there is some precedent their problem was usually systemic and could be solved by eyeballing the code. My model is much too complex for that. As pointed out earlier, this problem also only shows up for one demographic out of eight, and only popped up after months of A-OK operation; but what I haven’t yet mentioned is that I have a “variant” system that currently spams those eight demographics 44 more times with slightly tweaked parameters. The only significant difference between all those successful prior runs and the current one are updated datasets (at the start of the pipeline, it downloads the latest versions off the public internet). The only code changes have been isolated to the Python chart-generating code, and that’s well downstream. This points to something rare and intermittent causing the issue, be it within my model or CmdStan itself.
Thankfully, I put a lot of thought into how to replicate my work, which also makes replicating this bug a snap:
$ docker pull hjhornbeck/excess_deaths_canada_2010_2023:DEBUG_IMAGE
$ docker run -it --rm hjhornbeck/excess_deaths_canada_2010_2023:DEBUG_IMAGE /bin/bash
user@8c639c5cd547:~/excess_deaths_canada_2010_2023$ python3 src/model_excess_deaths.py --stripe-id 5 --stripe-count 8 --debug
If all three commands run successfully, then approximately 9,953 seconds later you should have a finished sampler run with a broken posterior. Change “5” to any other number between 0 and 7, and you’ll be working on another demographic (derived/categories.csv provides a handy list); eliminate all the options after the Python source file, and up to 32 cores of your computer will start number-crunching on every demographic. If you want to save the original posterior CSVs, the function that manages the sampler run has a parameter to handle that but you’ll have to either manually edit the code in src/model_excess_deaths.py or enable pdb and intercept the call just before it happens.
Just want to examine the aftermath, without the wait?
$ docker pull hjhornbeck/excess_deaths_canada_2010_2023:DEBUG_POST_IMAGE
Docker images are mounted to the filesystem when run, so with root access you can poke at what’s inside from the outside. If you’d prefer to work from the inside, it’s not hard to write aDockerfilethat adds whatever tools you need to the existing image.



