Stan 2.17 running slower on a model than Stan 2.15

That’s way too much of a slowdown to tolerate.

If that’s all coming from the strings, can we revert it back? Is there a way to test if that’s what it is?

1.76s to 2.13s is way too much of a slowdown? I think Daniel was saying that it was likely just due to new random numbers affecting how long fitting takes…?

One can revert merge commits and run tests, sure - git revert -m1 <sha1> and then ./runTests src/test/performance

Wouldn’t that just show up on a profiler? Also, are these strings being constructed even if no errors are thrown? That seems unnecessary…

I’m running the performance tests on 2.17.0 with PR 584 (the strings change) reverted in the math library and it seems to go back to ~1.7s per run for me locally.

I’m not sure the built-in performance test is measuring what @jjramsey is reporting (which was an order of magnitude more). If someone can get us the data for that we could try it out locally and make sure.

Sorry that wasn’t really a question, if there were something specific to profile that showed the order of magnitude slowdown I’m interested in trying it out. I haven’t seen meaningful slowdowns.

May I share what you sent me here?

Yes, please do.

Should I go ahead and revert PR 584 (since it’s causing at least a 25% slowdown on its own)?

Whoa… hold up! Can’t compare wall time across these git hashes.

You need to be measuring against effective sample size. And since we’ve changed how random numbers are generated, you can’t do this with just with one seed. It’s one of the reasons I’m very careful when we change behavior.

I timed the char * vs std::string and there really was no measurable time difference. I looked at the timings with the performance test too and there wasn’t anything that was noticeable.

For this particular model, I think it would help to run it with 100 different seeds and check the wall time divided by the number of effective samples per run. And the same with the other version. (And maybe replicating that whole process to get a sense of the variability.)

It sounds like you already did that work - can you dig it up so that we could use it now? We should check it in and ideally modify the performance test to output those numbers, if those are what we really care about.

1 Like

I’m pretty sure I reported it on the pull request. If not, maybe we can verify pre and post PR? I’ll try to dig up the PR (do you know the number off-hand by any chance?).

PR 584 https://github.com/stan-dev/math/pull/584 . Discourse is telling me to stop spamming this link :P

What do you mean by verify pre- and post-?

Also isn’t the performance test already running with 100 different seeds?

=). Thank you! And… no, I didn’t post it there!

This would be easy to verify by running the tests before this pull request and then after and looking at the timing change.

Nope. That variability is what happens when we run 100 times with the same seed! It shocked me to see that there’s sometimes a 2x difference in time based on whatever is happening on the computer at the time, even though it’s supposed to be the only thing running on the machine.

Seems like, as you’re mentioning, what you really want to do is run with 100 different seeds, no…?

I ran the tests with and without just that PR and came up with the above results - are you talking about something different with the pre and post?

20% isn’t an order of magnitude, but that cost is way too high to clean up string passing.

It sounds like you’re running from 100 different seeds and reporting average timing, which should average out most of the seed effects. Try another 100 and see how much variation there is.

Good point.

What I want to test for the models is the time to evaluate the log density and gradients. No sampling involved. That should catch a lot of things and it should always be producing the same result for the same input. There’s some dependence on inputs here if there are functions with different paths for different inputs (definitely iterative things like ODEs, but even simple things like logs can behave this way under the hood).

Then we also want to test sampling, but that’s way more complicated as we have to make sure we’re getting the right results, then conditional on that test speed.

1 Like

That’s a much more robust test in some ways. At least if what we’re trying to measure is run time. But as Daniel points out, we need to test n_eff.

So you’re saying there’s a +/- 20% measurement effect for running the same program twice. Is the third time the same as the second? There’s a lot of file and memory buffering effects.

There are statistics you can use to measure these things :-)

I installed the package ‘moments’:

> library('moments')
> skewness(vanilla)
[1] 0.8239469
> skewness(sans)
[1] 0.6429934
> kurtosis(vanilla)
[1] 4.35243
> kurtosis(sans)
[1] 3.92635

I don’t know if these are reasonably different. I’m sure there are statistical methods for that but I don’t know what they are off the top of my head or if they’re really worth looking into - it looks similar enough from the histogram.

Sorry, should’ve indicated that was a joke even more clearly. (Of course there are tests—statisticians have been very busy.)

When you reported the timing, you mentioned that it was 2.16.0 vs 2.17.0. Can you confirm the exact hashes you used? I’ll dig up the timing on our Jenkins box. Luckily, I think we store a decent amount of history.

Did you see my earlier posts? I’m not sure if there’s any point in recovering that data or running tests - the distribution of times looks about the same. I did the original test with tags v2.16.0 vs v2.17.0. I then tested with just that PR removed and found the results to be the same. And then I did 1000 runs of 2.17.0 with the PR reverted (“sans”, above) vs. 2.17.0 “vanilla.”