How much precision should we expect in our fits?

seantalts · May 31, 2019, 2:55pm

In the new compiler, an interesting case has come up where the code being generated is not exactly the same and the fits are off by something like 0.005 or so (on single-digit posterior means). Is this close enough? It seems like something of that order of magnitude could come up just through reordering floating point operations over the course of ~10k leapfrog steps. Anyone have thoughts on this? @Bob_Carpenter

syclik · May 31, 2019, 3:00pm

isn’t that a bug and not just an interesting case?

On the same machine with the same compiler, if the model is the same, each transition should be the same. If you’ve intentionally generated different code that does the numerical computation, then it might differ, but you’d really have to dig into what’s different about it to even make a determination whether a posterior mean being off by 0.005 is a bug or if it’s MCMC error.

syclik · May 31, 2019, 3:01pm

Do you have a simple example? Maybe we can break down the code side-by-side to see what it is.

seantalts · May 31, 2019, 3:13pm

The question is basically “what is enough off to be a bug” - your answer might be “they have to be exactly the same” but it’s not the obvious answer to me when making changes to the system.

Agreed it’d be best to figure out what it is. I think it might be a waste of your time to dig into this specific example given that I already know a bunch of things about the new compiler are still incorrect. But thanks for the offer.

I’m just trying to get a sense of what a reasonable precision target is given that we are intentionally generating (slightly) different code to do the numerical computation.

syclik · May 31, 2019, 3:44pm

Yup. It really depends on what you’re trying to do. I thought the first goal was to produce the same generated code (or at least same functionally). With that goal in mind, any deviation is a bug. If the goal is to produce similar code… that’s a lot harder to test. You have to think about it.

Why not post the example? We can get a head start on discussing strategies for hunting this down? Or maybe I can actually spot a bug by eye.

Question 1: what part of the numerical computation is intentionally different?

seantalts · May 31, 2019, 3:50pm

I don’t mind posting it (it’s certainly not private) I just want to be very conscientious of your limited time and recognize that it’s pretty low quality code right now. Here it is anyway: https://gist.github.com/seantalts/2eee4aaf7a835791b681d5091e43be1b . Compare with what stan2 emits: https://gist.github.com/seantalts/e2da343fea48c645b3e38060993a5044

One way in which things are different is that I am eschewing the writer and reader constrain/unconstrain helper functions in favor of direct Math lib calls.

seantalts · May 31, 2019, 3:53pm

Oh, and I also don’t want to derail this thread from its intended purpose: figure out what acceptable precision tolerances are for significant changes.

syclik · May 31, 2019, 5:11pm

Yup. I hear you. My personal tolerance (from a technical, dev point of view) is 0 if there’s no explanation. If there’s an explanation, the bounds don’t matter so much.

For the numeric computation, it really should match identically. We could make the existing generator match it or we could update the math functions to behave the same or we can update the new generator to match. But if the numerical computation is off, then it’s really not computing the same thing for some reason. You must be able to see that with individual calls to whatever function you have.

syclik · May 31, 2019, 5:13pm

for that example, what’s the original Stan code and do you have some data?

seantalts · May 31, 2019, 5:15pm

This is https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks/low_dim_gauss_mix

syclik · May 31, 2019, 5:21pm

Just a heads up… using CmdStan and diagnose mode, the log prob is very different. It looks like the gradients are the same, though.

This is how I ran it:
./stan3 data file=low_dim_gauss_mix.data.R diagnose random seed=234

Stan3:

TEST GRADIENT MODE

 Log probability=-2342.17

 param idx           value           model     finite diff           error
         0          1.9997         67.5104         67.5104    -1.88828e-07
         1        -1.84968        -12.3411        -12.3411    -3.59003e-06
         2       0.0448746         143.026         143.026       2.227e-06
         3         1.76805        -238.177        -238.177     5.30072e-07
         4        0.580072        -329.885        -329.885    -3.68993e-07

Stan

 Log probability=-3261.11

 param idx           value           model     finite diff           error
         0          1.9997         67.5104         67.5104    -1.88828e-07
         1        -1.84968        -12.3411        -12.3411    -3.59003e-06
         2       0.0448746         143.026         143.026       2.227e-06
         3         1.76805        -238.177        -238.177     5.30072e-07
         4        0.580072        -329.885        -329.885    -3.68993e-07

seantalts · May 31, 2019, 5:23pm

Oh awesome, I didn’t know about that tool. Gives me some clues for what to dig in to :D Thank you!

syclik · May 31, 2019, 5:26pm

Found it. Stan3’s generated code is wrong.

In Stan:

   target += log_mix(theta,
                     normal_lpdf(y[n] | mu[1], sigma[1]),
                     normal_lpdf(y[n] | mu[2], sigma[2]));

this needs to generate the call without the propto__ template parameters for this to be properly normalized.

Stan (correct):

log_mix(theta, normal_log(get_base1(y, n, "y", 1), get_base1(mu, 1, "mu", 1), get_base1(sigma, 1, "sigma", 1)), normal_log(get_base1(y, n, "y", 1), get_base1(mu, 2, "mu", 1), get_base1(sigma, 2, "sigma", 1)));

Stan3 (incorrect):

log_mix(
              theta,
              normal_lpdf<propto__>(
                  stan::model::rvalue(
                      y,
                      stan::model::cons_list(stan::model::index_uni(n),
                                             stan::model::nil_index_list()),
                      "pretty printed e"),
                  stan::model::rvalue(
                      mu,
                      stan::model::cons_list(stan::model::index_uni(1),
                                             stan::model::nil_index_list()),
                      "pretty printed e"),
                  stan::model::rvalue(
                      sigma,
                      stan::model::cons_list(stan::model::index_uni(1),
                                             stan::model::nil_index_list()),
                      "pretty printed e")),
              normal_lpdf<propto__>(
                  stan::model::rvalue(
                      y,
                      stan::model::cons_list(stan::model::index_uni(n),
                                             stan::model::nil_index_list()),
                      "pretty printed e"),
                  stan::model::rvalue(
                      mu,
                      stan::model::cons_list(stan::model::index_uni(2),
                                             stan::model::nil_index_list()),
                      "pretty printed e"),
                  stan::model::rvalue(
                      sigma,
                      stan::model::cons_list(stan::model::index_uni(2),
                                             stan::model::nil_index_list()),
                      "pretty printed e"))));

If you removed the <propto__> or explicitly set false, it’d be correct.

seantalts · May 31, 2019, 5:33pm

Thanks! If you want to learn OCaml and help out with code gen, clearly we could use your discerning eye! :)

syclik · May 31, 2019, 5:55pm

so… curiosity got the better of me. I removed the <propto__> from the generated header and verified that the output was identical. So it looks like that’s the only bug in the generated code for that.

For timing, it looks like the Stan3 generated version is slightly faster than Stan. Very rough numbers, not really trying to be very systematic about it yet:

Stan 3: about 3.1 sec
Stan: about 3.3 sec

seantalts · May 31, 2019, 5:59pm

Thanks :) The new compiler is still definitely cheating - not updating locations or running all of the same validation. But it also doesn’t have Matthijs and Ryan’s optimization module hooked up yet, either :)

syclik · May 31, 2019, 6:08pm

I found the timing difference. It’s in the access of the elements. This is something we can address immediately in Stan and probably should.

In Stan3, I saw this pattern:

stan::model::rvalue(y, stan::model::cons_list(stan::model::index_uni(n), stan::model::nil_index_list())

to access the nth element of y; y is an Eigen Vector.

In Stan, it’s currently:

get_base1(y, n, "y", 1)

My guess is that get_base1 is slower to access that element.

syclik · May 31, 2019, 6:12pm

btw, this sort of shows why my tolerance is 0. Although the end results would have been similar in this case, if we changed the model to something like:

target += log_mix(theta,
                     normal_lpdf(y[n] | mu[1], sigma[1]),
                     student_t(y[n] | mu[2], sigma[2], 3));

the inference results would have been different.

avehtari · May 31, 2019, 6:24pm

If the code is same then the difference should be 0. If the code is correct, but with different order of floating point operations then Markov chains with identical seeds can diverge (different divergence than @betanalpha’s divergence diagnostic) and after that behave as independent chains with different seeds. In that case the estimate for distribution of difference of expectations from two chains (or sets of chains) can be computed using effective sample size and Monte Carlo standard estimates or by repeated simulations. There is no absolute threshold given for known number of leapforg steps. If the difference is very small but non-zero, that can be also due to luck and repeated runs would be needed to check the distribution of differences.

seantalts · May 31, 2019, 6:33pm

Can you say more about the steps required there? That sounds interesting.

Topic		Replies	Views
Stan Performance Test Developers	42	2375	March 28, 2018
Reproducibility across machines - hardware/C++ complier dependent? Modeling	2	472	June 29, 2019
Stan on computing cluster: strange results CmdStan	11	1613	June 8, 2018
Question about the Reproducibility of Stan Results Algorithms cmdstan , cmdstanr	6	1659	January 10, 2022
GPU GLMs: float or double Developers math	41	2172	July 30, 2019

How much precision should we expect in our fits?

Related topics