Update: the problem seems to be almost exclusively the stochasticity of Rhat if the number of post-warmup samples is low.

Taking a very simple model:

```
data {
int<lower=0> N;
real y[N];
}
parameters {
real mu;
}
model {
mu ~ normal(0, 2);
y ~ normal(mu, 1);
}
```

I simulate data exactly from this model.

Running 1000 fits with 1000 warmup iterations but 200 sampling iterations: 499 fits had Rhat > 1.01. Largest Rhat was 1.042.

Running 1000 fits with 200 warmup iterations but 1000 sampling iterations: All rhats < 1.01

So it seems that the model is able adapt and warmup quickly, so all the chains actually explore the same distribution in both cases, just that the Rhat estimate has larger variance with fewer samples (which sounds unsurprising).

I did check that 200 i.i.d. samples have unproblematic Rhat while autocorrelated samples can have problems, so it probably is the combination of autocorrelation + low number of samples that is tripping Rhat to give false positives.

## Code for simulation

I.i.d. samples:

```
n_iter <- 200
for(i in 1:10) {
var <- posterior::rvar(array(rnorm((n_iter) * 4), dim = c(n_iter, 4)), with_chains = TRUE, nchains = 4)
print(posterior::rhat(var))
}
```

A representative result:

```
[1] 1.000692
[1] 1.001277
[1] 0.9993861
[1] 0.9969532
[1] 1.000892
[1] 1.001532
[1] 1.006309
[1] 1.001748
[1] 0.9996341
[1] 1.004682
```

Autocorrelated samples (a sliding-window linear combination of i.i.d normals - hope that’s not particularly stupid):

```
n_iter_high <- 1000
n_iter_low <- 200
lag_coeffs <- rev(c(1, 0.8,0.5))
N_sims <- 10
res <- array(NA_real_, dim = c(N_sims, 2), dimnames = list(NULL, paste0(c(n_iter_low, n_iter_high), "_iter")))
n_lags <- length(lag_coeffs)
for(i in 1:N_sims) {
draws_latent <- array(rnorm((n_iter_high + n_lags) * 4), dim = c(n_iter_high + n_lags, 4))
draws_observed <- array(NA_real_, dim = c(n_iter_high, 4))
for(n in 1:n_iter_high) {
for(c in 1:4) {
draws_observed[n, c] <- sum(draws_latent[n:(n + n_lags - 1), c] * lag_coeffs)
}
}
var_high <- posterior::rvar(draws_observed, with_chains = TRUE, nchains = 4)
var_low<- posterior::rvar(draws_observed[(n_iter_high - n_iter_low):n_iter_high, ], with_chains = TRUE, nchains = 4)
res[i, 1] <- posterior::rhat(var_low)
res[i, 2] <- posterior::rhat(var_high)
}
res
```

Gives something like:

```
200_iter 1000_iter
[1,] 1.005455 1.001189
[2,] 1.002999 1.000354
[3,] 1.020892 1.003734
[4,] 1.009019 1.003877
[5,] 1.011578 1.002240
[6,] 1.004152 1.001302
[7,] 1.006214 1.001396
[8,] 1.006232 1.001696
[9,] 1.010586 1.002576
[10,] 1.010942 1.000831
```

I.e. the last 200 iterations from the same array have substantially more high Rhats than the whole thing.

I forgot about this - yes, that could help partially. But given that I have many fits with an R* value for each, I still need to determine whether to raise a warning overall. Do you have an idea how one could pick a sensible threshold. Also since R* is even more stochastic than Rhat, I would expect to see similar problems with short chains (but I didn’t check yet).