Failing to reproduce a stan fit, is it a dependency issue?

I am trying to reproduce the Stan results in this jupyter notebook (github link).

Background

This model is a PCA / factor-analysis model where we observe data matrix Y = XW^T + \epsilon and we want to recover the loadings matrix W. The loadings matrix is not rotationally identifiable (XW^T = X QQ^T X^T for orthogonal Q), so we should see a pattern of concentric circles in the posterior.

The authors ran this model on a synthetic data matrix Y, with n=100, d=4, and where the number of factors should be 2. Therefore the loadings matrix W should be 4x2. They specified how Y was generated, but did not post a seed, so I can only reproduce Y up to distribution.

The following figure shows their posterior of the matrix W, viewed as four vectors in R^2, for two models:

  1. The rotationally nonidentifiable PCA model, which corresponds to the concentric circles
  2. Their proposed modification of this model which removes the nonidentifiability. This corresponds to the dense blobs within each concentric ring. I’m not concerned with this part at the moment.

My question

Even using the authors’ code, I cannot reproduce the figure above. I can compile the model, but about 90% of the iterations hit the max tree depth of 10. (The authors’ code suggests no tree depth issues.) When I plot my posterior, I see the desired concentric circles:

However, when I break it out by each chain, I see that there are large angular sections that the chain did not reach:

Variations of this issue show up no matter what I do, e.g., change seeds, change Stan versions, change computing environments. What are possible reasons that I have this treedepth/mixing problem where the authors did not? What could I try to fix it?

Certainly the authors had a specific set of dependencies: C++ toolchain, eigen build, stan version, etc. Could it be due to this? (All I know is they used Pystan 2.17.)

Here is the model:

data{
    int<lower=0> N;
    int<lower=1> D;
    int<lower=1> Q;
    vector[D] Y[N];
}
parameters {
    matrix[D, Q] W;
    real<lower=0> sigma_noise;
	vector[D] mu;
}
transformed parameters {
    cholesky_factor_cov[D] L;
    {
        matrix[D, D] K = tcrossprod(W);
        for (d in 1:D)
            K[d,d] += square(sigma_noise) + 1e-14;
        L = cholesky_decompose(K);
    }
}
model{
	mu ~ normal(0, 10);
    sigma_noise ~ normal(0,1);
    to_vector(W) ~ normal(0,1);
    
    Y ~ multi_normal_cholesky(mu, L);
}

Do they say there were no issues, or just don’t say there were? Is it possible they just ignored the issue? I think this kind of posterior is exactly what would generate tree depth problem – and possibly all kinds of other problems.

You can’t start from the same weights, although you could eyeball the posterior and choose values closer to theirs (up to the rotation) and see if that matters. But otherwise the results seem compatible with the notebook’s. One suggestion would be and try with static HMC and see what the output looks like – I’m guessing the performance would be similar and the tree depth issue would disappear “by design” (but not others).

I’d be surprised if its a dependency issue, but maybe developers know of a reason why it might.

As discussed in the Stan manual any changes to the operating system, hardware, C++ toolchain, Stan version, etc can, and likely will, cause the Markov chains realized from a given seed to change. If those Markov chains are well-behaved then their equilibrium behavior will be the same; in particular Markov chain Monte Carlo estimators will bounce around the same distribution centered on the exact expectation value) but bitwise reproducibility cannot be guaranteed. When the Markov chains are not so well-behaved then reproducibility will be even worse.

For example the adaptation of Stan’s sampler depends on the early states in a realized Markov chain. Some realized Markov chains might compromise the adaptation, for example by seeding the chains in particularly pathological neighborhoods or making excursions out to pathological neighborhoods early on, which will affect the behavior of the later states. There treedepth warnings and partial explorations seen in your fits suggest that maybe you hit a bad adaptation which is limiting how well you can recover the posterior behavior.

Without more information from the authors of the original paper it’s not clear if they also encountered this problem in their fit, if they tuned the seed to avoid it, or if they didn’t encounter it at all. You could try different seeds until the adaptation behavior is better (in addition to looking at treedepth warnings you can also dig into the sampler adaptation information which includes both step size and inverse metric elements) or even reducing the range of initial values to see if that stabilizes the adaptation or not, and if it does then if you’re able to recover what is shown in the paper.

Another issue could be with the simulated data itself. The proposed modification might be fragile in the sense that it works for some simulated data but not all, with the paper explicitly or implicitly selecting for one of the simulations for which the approach does work. Here all you can do is run over an ensemble of simulations, look for the desired behavior in any of the simulations, and if you see it then determine how stable/robust it might be. While more work this kind of analysis is critical for properly understanding a method.