x0 ~ normal(-1, 1);
for (t in 1:(T-1))
z[t+1] ~ normal(f(z[t], x0), sig);
data ~ normal(z, eps);

where HMC shows Rhat values near 1 for x0 & high Rhat values for z, but I only care about the x0 parameter.

Is it correct to assume that because the Rhat of z is poor, and that the only link between data and x0 is z, that the estimate of x0 is bad despite its Rhat being low?

After rereading the section of BDA on Rhat (and a bunch of forum posts), that alternative interpretation I thought of is it, running the chains longer would improve estimate of z, but not x0â€¦ but I canâ€™t make heads or tails of that given the dependencies of the variables in this model. Any words of wisdom are welcome.

Itâ€™s possible that (for some model not necessarily this one) the estimate of x0 is completely insensitive to the exact values of z and you can duck your head in shame and use these esitmates.

I doubt it.

you likely have strong correlation between successive zâ€™s as seen in the naively-written AR model in Stan. You could fix this by re-writing this as a non-centered variation (where the model is on z[t+1] - z[t]) but that depends on f. At least for ARÂ§ there are a few implementations posted on the list already, for VARÂ§ models I think @James_Savage posted something onceâ€¦ and for non-linear models I donâ€™t know.

x0 is completely insensitive to the exact values of z and you can duck your head in shame and use these esitmates

well Iâ€™d prefer not; from this I take it that all Rhats should be close to one for purposes of convergence (and thus good estimate).

non-centered variation (where the model is on z[t+1] - z[t]) but that depends on f

indeed f is nonlinear (actually a SDE) here, and mean(z) is not zero; still, shifting it wonâ€™t reduce correlation (think noisy oscillator, autocorrelation in t is significant). Maybe I donâ€™t understand the non-centered-ing thing though.

(Iâ€™m continuing here though outside original topic)

Yes, theyâ€™re correlated for sure. I think, at this point, Iâ€™d need to estimate a generic (Volterra?) kernel and fit variations around that kernel.

If I understand correctly, for the AR you mentioned, that kernel would be {1, -1} which is just a difference filter, removing low frequency components, which lowers correlations between successive zâ€™s.

Or, maybe I can have just the Brownian path as the parameter, and z as a transformation?

parameters {
real x0;
vector dWt[T];
real z0;
}
transformed parameters {
vector z[T];
z[1] = z0;
for (t in 1:(T-1))
z[t+1] = f(z[t], x0) + dWt[t];
}
model {
x0 ~ normal(-1, 1);
dWt ~ normal(0, sig);
observed_data ~ normal(z, eps);
}

on the face of it, I guess those parameters should be far less correlated.

I recently did this approach for an SIR model and it worked so I think this
should work her as well. If you donâ€™t mind sharing what you get I would
like to see how well it works.

The problem is that if theyâ€™re not, you donâ€™t know what youâ€™re missing and how large its effect would be. We often see this behavior in hierarchical modelsâ€”the low-level parameters are estimated precisely, but the hierarchical parameters donâ€™t mix well. It often doesnâ€™t matter, but the only way to verify is to fit the whole thing properly.

And that indeed means reparameterization as @sakrejda pointed out.

Iâ€™ve found similar things in some of the more esoteric state-space models we use at work. Some geometries are just tough, and reparameterization is difficult. I know thinning is a bad word around here, but so long as you donâ€™t have pathological results it can help.

As a check for whether your estimation isnâ€™t behaving too poorly, you should make sure that you can recapture known values of z simulated from the model. This method might help. I find with state space models in particular that doing this gives you a lot of information about what values your priors you should be using, and that knowledge can help you get better inference.

I setup a simulation and tried both strategies (shown below), which provide identical results (wrt. Neff, R_hat and PPC), but estimating the Brownian increments directly (w/ the SDE realization as a transformation thereof) is much slower. In code, the core difference in code (full code here)

parameters {
...
vector[T] zh;
}
model {
...
for (t in 1:(T - 1))
zh[t + 1] ~ normal(f(zh[t], mh, bh, dt), sqrt(dt) * sig);
z ~ normal(zh, eps);
}

where the latter is much slower. Is this related to centering? Iâ€™ve seen terms like non-centered thrown around and still donâ€™t yet grasp concrete what/how to exploit the notion.

Thereâ€™s a paper by Michael Betancourt and Mark Girolami that explains the centering vs. non-centering with some simple examples: https://arxiv.org/abs/1312.0906

I am not a statistician (IANAS?), but I think the point in that paper doesnâ€™t apply to dependencies between subsequent time points in a time series model.

This has gotten off topic though, Iâ€™m going to pick Bobâ€™s earlier comment as the response to my question :)

Ah I see what you mean, and in fact thatâ€™s nearly what I did in the previous code snippets I posted, but I was puzzled about why the non-centered version was slower, despite it no longer having correlations between subsequent time points.

Also, in your code, it seems that z_std in the parameters section is supposed to be z_raw as in the model section.

Either way the point of non centering is to break dependence between specific parameters so saying itâ€™s slower/faster without plotting specific parameters before and after is silly. Plot the specific parameters and see what it did.

Re-parameterization changes the space the sampler moves in s.t. it is not moving in z[t],z[t-1] space, thatâ€™s the whole point. You plot the space the sampler moves through in each case.