In the new compiler, an interesting case has come up where the code being generated is not exactly the same and the fits are off by something like 0.005 or so (on single-digit posterior means). Is this close enough? It seems like something of that order of magnitude could come up just through reordering floating point operations over the course of ~10k leapfrog steps. Anyone have thoughts on this? @Bob_Carpenter

isnâ€™t that a bug and not just an interesting case?

On the same machine with the same compiler, if the model is the same, each transition should be the same. If youâ€™ve intentionally generated different code that does the numerical computation, then it might differ, but youâ€™d really have to dig into whatâ€™s different about it to even make a determination whether a posterior mean being off by 0.005 is a bug or if itâ€™s MCMC error.

Do you have a simple example? Maybe we can break down the code side-by-side to see what it is.

The question is basically â€śwhat is enough off to be a bugâ€ť - your answer might be â€śthey have to be exactly the sameâ€ť but itâ€™s not the obvious answer to me when making changes to the system.

Agreed itâ€™d be best to figure out what it is. I think it might be a waste of your time to dig into this specific example given that I already know a bunch of things about the new compiler are still incorrect. But thanks for the offer.

Iâ€™m just trying to get a sense of what a reasonable precision target is given that we are intentionally generating (slightly) different code to do the numerical computation.

Yup. It really depends on what youâ€™re trying to do. I thought the first goal was to produce the same generated code (or at least same functionally). With that goal in mind, any deviation is a bug. If the goal is to produce similar codeâ€¦ thatâ€™s a lot harder to test. You have to think about it.

Why not post the example? We can get a head start on discussing strategies for hunting this down? Or maybe I can actually spot a bug by eye.

Question 1: what part of the numerical computation is intentionally different?

I donâ€™t mind posting it (itâ€™s certainly not private) I just want to be very conscientious of your limited time and recognize that itâ€™s pretty low quality code right now. Here it is anyway: https://gist.github.com/seantalts/2eee4aaf7a835791b681d5091e43be1b . Compare with what stan2 emits: https://gist.github.com/seantalts/e2da343fea48c645b3e38060993a5044

One way in which things are different is that I am eschewing the writer and reader constrain/unconstrain helper functions in favor of direct Math lib calls.

Oh, and I also donâ€™t want to derail this thread from its intended purpose: figure out what acceptable precision tolerances are for significant changes.

Yup. I hear you. My personal tolerance (from a technical, dev point of view) is 0 if thereâ€™s no explanation. If thereâ€™s an explanation, the bounds donâ€™t matter so much.

For the numeric computation, it really should match identically. We could make the existing generator match it or we could update the math functions to behave the same or we can update the new generator to match. But if the numerical computation is off, then itâ€™s really not computing the same thing for some reason. You must be able to see that with individual calls to whatever function you have.

for that example, whatâ€™s the original Stan code and do you have some data?

Just a heads upâ€¦ using CmdStan and diagnose mode, the log prob is very different. It looks like the gradients are the same, though.

This is how I ran it:

`./stan3 data file=low_dim_gauss_mix.data.R diagnose random seed=234`

Stan3:

```
TEST GRADIENT MODE
Log probability=-2342.17
param idx value model finite diff error
0 1.9997 67.5104 67.5104 -1.88828e-07
1 -1.84968 -12.3411 -12.3411 -3.59003e-06
2 0.0448746 143.026 143.026 2.227e-06
3 1.76805 -238.177 -238.177 5.30072e-07
4 0.580072 -329.885 -329.885 -3.68993e-07
```

Stan

```
Log probability=-3261.11
param idx value model finite diff error
0 1.9997 67.5104 67.5104 -1.88828e-07
1 -1.84968 -12.3411 -12.3411 -3.59003e-06
2 0.0448746 143.026 143.026 2.227e-06
3 1.76805 -238.177 -238.177 5.30072e-07
4 0.580072 -329.885 -329.885 -3.68993e-07
```

Oh awesome, I didnâ€™t know about that tool. Gives me some clues for what to dig in to :D Thank you!

Found it. Stan3â€™s generated code is wrong.

In Stan:

```
target += log_mix(theta,
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
```

this needs to generate the call without the `propto__`

template parameters for this to be properly normalized.

Stan (correct):

```
log_mix(theta, normal_log(get_base1(y, n, "y", 1), get_base1(mu, 1, "mu", 1), get_base1(sigma, 1, "sigma", 1)), normal_log(get_base1(y, n, "y", 1), get_base1(mu, 2, "mu", 1), get_base1(sigma, 2, "sigma", 1)));
```

Stan3 (incorrect):

```
log_mix(
theta,
normal_lpdf<propto__>(
stan::model::rvalue(
y,
stan::model::cons_list(stan::model::index_uni(n),
stan::model::nil_index_list()),
"pretty printed e"),
stan::model::rvalue(
mu,
stan::model::cons_list(stan::model::index_uni(1),
stan::model::nil_index_list()),
"pretty printed e"),
stan::model::rvalue(
sigma,
stan::model::cons_list(stan::model::index_uni(1),
stan::model::nil_index_list()),
"pretty printed e")),
normal_lpdf<propto__>(
stan::model::rvalue(
y,
stan::model::cons_list(stan::model::index_uni(n),
stan::model::nil_index_list()),
"pretty printed e"),
stan::model::rvalue(
mu,
stan::model::cons_list(stan::model::index_uni(2),
stan::model::nil_index_list()),
"pretty printed e"),
stan::model::rvalue(
sigma,
stan::model::cons_list(stan::model::index_uni(2),
stan::model::nil_index_list()),
"pretty printed e"))));
```

If you removed the `<propto__>`

or explicitly set `false`

, itâ€™d be correct.

Thanks! If you want to learn OCaml and help out with code gen, clearly we could use your discerning eye! :)

soâ€¦ curiosity got the better of me. I removed the `<propto__>`

from the generated header and verified that the output was identical. So it looks like thatâ€™s the only bug in the generated code for that.

For timing, it looks like the Stan3 generated version is slightly faster than Stan. Very rough numbers, not really trying to be very systematic about it yet:

- Stan 3: about 3.1 sec
- Stan: about 3.3 sec

Thanks :) The new compiler is still definitely cheating - not updating locations or running all of the same validation. But it also doesnâ€™t have Matthijs and Ryanâ€™s optimization module hooked up yet, either :)

I found the timing difference. Itâ€™s in the access of the elements. This is something we can address immediately in Stan and probably should.

In Stan3, I saw this pattern:

```
stan::model::rvalue(y, stan::model::cons_list(stan::model::index_uni(n), stan::model::nil_index_list())
```

to access the nth element of y; y is an Eigen Vector.

In Stan, itâ€™s currently:

```
get_base1(y, n, "y", 1)
```

My guess is that `get_base1`

is slower to access that element.

btw, this sort of shows why my tolerance is 0. Although the end results would have been similar in this case, if we changed the model to something like:

```
target += log_mix(theta,
normal_lpdf(y[n] | mu[1], sigma[1]),
student_t(y[n] | mu[2], sigma[2], 3));
```

the inference results would have been different.

If the code is same then the difference should be 0. If the code is correct, but with different order of floating point operations then Markov chains with identical seeds can diverge (different divergence than @betanalphaâ€™s divergence diagnostic) and after that behave as independent chains with different seeds. In that case the estimate for distribution of difference of expectations from two chains (or sets of chains) can be computed using effective sample size and Monte Carlo standard estimates or by repeated simulations. There is no absolute threshold given for known number of leapforg steps. If the difference is very small but non-zero, that can be also due to luck and repeated runs would be needed to check the distribution of differences.

Can you say more about the steps required there? That sounds interesting.