Cherry-picking Rhat values


#1

I have a model of the form

x0 ~ normal(-1, 1);
for (t in 1:(T-1))
    z[t+1] ~ normal(f(z[t], x0), sig);
data ~ normal(z, eps);

where HMC shows Rhat values near 1 for x0 & high Rhat values for z, but I only care about the x0 parameter.

Is it correct to assume that because the Rhat of z is poor, and that the only link between data and x0 is z, that the estimate of x0 is bad despite its Rhat being low?

After rereading the section of BDA on Rhat (and a bunch of forum posts), that alternative interpretation I thought of is it, running the chains longer would improve estimate of z, but not x0… but I can’t make heads or tails of that given the dependencies of the variables in this model. Any words of wisdom are welcome.


#2
  1. It’s possible that (for some model not necessarily this one) the estimate of x0 is completely insensitive to the exact values of z and you can duck your head in shame and use these esitmates.

  2. I doubt it.

  3. you likely have strong correlation between successive z’s as seen in the naively-written AR model in Stan. You could fix this by re-writing this as a non-centered variation (where the model is on z[t+1] - z[t]) but that depends on f. At least for AR§ there are a few implementations posted on the list already, for VAR§ models I think @James_Savage posted something once… and for non-linear models I don’t know.


#3

x0 is completely insensitive to the exact values of z and you can duck your head in shame and use these esitmates

well I’d prefer not; from this I take it that all Rhats should be close to one for purposes of convergence (and thus good estimate).

non-centered variation (where the model is on z[t+1] - z[t]) but that depends on f

indeed f is nonlinear (actually a SDE) here, and mean(z) is not zero; still, shifting it won’t reduce correlation (think noisy oscillator, autocorrelation in t is significant). Maybe I don’t understand the non-centered-ing thing though.


#4

Have you plotted the pair plots for some successive z’s? I think you’re firmly in the land of problem-specific solutions.


#5

(I’m continuing here though outside original topic)

Yes, they’re correlated for sure. I think, at this point, I’d need to estimate a generic (Volterra?) kernel and fit variations around that kernel.

If I understand correctly, for the AR you mentioned, that kernel would be {1, -1} which is just a difference filter, removing low frequency components, which lowers correlations between successive z’s.

Or, maybe I can have just the Brownian path as the parameter, and z as a transformation?

parameters {
    real x0;
    vector dWt[T];
    real z0;
}
transformed parameters {
    vector z[T];
    z[1] = z0;
    for (t in 1:(T-1))
        z[t+1] = f(z[t], x0) + dWt[t];
}
model {
    x0 ~ normal(-1, 1);
    dWt ~ normal(0, sig);
    observed_data ~ normal(z, eps);
}

on the face of it, I guess those parameters should be far less correlated.


#6

I recently did this approach for an SIR model and it worked so I think this
should work her as well. If you don’t mind sharing what you get I would
like to see how well it works.


#7

The problem is that if they’re not, you don’t know what you’re missing and how large its effect would be. We often see this behavior in hierarchical models—the low-level parameters are estimated precisely, but the hierarchical parameters don’t mix well. It often doesn’t matter, but the only way to verify is to fit the whole thing properly.

And that indeed means reparameterization as @sakrejda pointed out.


#8

I’ve found similar things in some of the more esoteric state-space models we use at work. Some geometries are just tough, and reparameterization is difficult. I know thinning is a bad word around here, but so long as you don’t have pathological results it can help.

As a check for whether your estimation isn’t behaving too poorly, you should make sure that you can recapture known values of z simulated from the model. This method might help. I find with state space models in particular that doing this gives you a lot of information about what values your priors you should be using, and that knowledge can help you get better inference.


#9

Thanks for the advice, on all accounts.

I setup a simulation and tried both strategies (shown below), which provide identical results (wrt. Neff, R_hat and PPC), but estimating the Brownian increments directly (w/ the SDE realization as a transformation thereof) is much slower. In code, the core difference in code (full code here)

parameters {
...
  vector[T] zh;
}
model {
...
  for (t in 1:(T - 1))
    zh[t + 1] ~ normal(f(zh[t], mh, bh, dt), sqrt(dt) * sig);
  z ~ normal(zh, eps);
}

vs

parameters {
...
  vector[T] dWt;
}
model {
...
  dWt ~ normal(0.0, sig);
  z ~ normal(f_loop(z0, mh, bh, dWt, dt), eps);
}

where the latter is much slower. Is this related to centering? I’ve seen terms like non-centered thrown around and still don’t yet grasp concrete what/how to exploit the notion.


#10

There’s a paper by Michael Betancourt and Mark Girolami that explains the centering vs. non-centering with some simple examples: https://arxiv.org/abs/1312.0906


#11

I am not a statistician (IANAS?), but I think the point in that paper doesn’t apply to dependencies between subsequent time points in a time series model.

This has gotten off topic though, I’m going to pick Bob’s earlier comment as the response to my question :)


#12

It works in the same way in that it breaks a prior dependency by reparameterizing.

In a simple case, suppose you have some kind of latent time series z with noise,

for (t in 2:T) {
  z[t] ~ normal(alpha * z[t - 1] + beta, sigma);
  y[t] ~ normal(z[t], tau);
}

or in vectorized form,

z[2:T] ~ normal(alpha * z[1:T - 1] + beta, sigma);
y ~ normal(z, tau);

Now you can see how to non-center z.

parameters {
  vector[T - 1] z_std;
  ...
model {
  z_raw ~ normal(0, 1);
  z[1] = mu1 + sigma1 * z_raw[1];  // first point gets its own prior
  for (t in 2:T)
    z[t] = z_raw[t] * sigma + alpha * z[t - 1] + beta;

#13

Ah I see what you mean, and in fact that’s nearly what I did in the previous code snippets I posted, but I was puzzled about why the non-centered version was slower, despite it no longer having correlations between subsequent time points.

Also, in your code, it seems that z_std in the parameters section is supposed to be z_raw as in the model section.


#14

Yes, I’ve been trying to get myself to swtich from the generic z_raw to the more specific z_std for standard normal variates.

The non-centered version can be slower if there’s a lot of data. Otherwise, it should be faster in an effective sample size per second basis.


#15

Either way the point of non centering is to break dependence between specific parameters so saying it’s slower/faster without plotting specific parameters before and after is silly. Plot the specific parameters and see what it did.


#16

This case has an SDE with strong nonlinear deterministic part, so there is by construction strong correlation between z[t] and z[t-1].

Thanks, I’ll look at the N eff / s metric before making performance statements again.


#17

Re-parameterization changes the space the sampler moves in s.t. it is not moving in z[t],z[t-1] space, that’s the whole point. You plot the space the sampler moves through in each case.