Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

Good luck!

But x there is data, so you’re just evaluating the likelihood. Integrating variables out is usually good for getting rid of divergences.

What are the dimensions of BB^T? And is \Sigma there a constant along the diagonal (like it could be expressed something times the identity matrix)?

Cheers, fingers-crossed my submissions will work out!

So the dimensions are currently 10 \times 10 or 20 \times 20 but I’m hoping to scale it to 30 \times 30 if posisble. I can make it be but currently I’m letting it differ for each dimension.

Oh well if these are just 30x30 matrices then that shouldn’t be too expensive. Definitely give this a go.

If you use the same covariance more than once, the multi_normal_cholesky form of the likelihood is really useful: https://mc-stan.org/docs/2_23/functions-reference/multi-normal-cholesky-fun.html

I guess if you mean the matrix had 30x30 columns, so it’s actually 900x900, that’s more of a problem lol.

Yeah, it’s tricky as I have to solve the cholesky for a y x y matrix for each time step so it’s relatively slow and I don’t ever get to reuse the same covariance >.<. I guess there’s no harm slowing things down slighty to get more accurate inference…

Does Stan have an approximate solver or something along the lines of that?

The diagonal being constant was the simplest trick I could think of. Maybe just code it up and give it a go? Make sure and find a way to test you did the integrate out correctly.

I’m pretty hopeful since you say the integrate-out is possible!

It’s much slower, about x2 the time required, and the comparison plots (12.6 KB) are really similar…

I’m not sure what’s the right thing to do…

Well if it’s really similar that’s good if the other thing worked!

What about the effective sample sizes? Are they different?

And I wonder if you can get away from the hard constraint on the prior? Having posterior mass pushed up against the constraint is bad. That means that the problem is trying to move out into that region to fit the data, but you’re not allowing it.

Unless there are like mathematical constraints that absolutely do not permit something to happen, then we like to avoid prior constraints cuz they can be opportunities to learn!

I had a geometry teacher in high school that called quizzes opportunities (I think they were to show off what you learned?), so perhaps that’s a warm bucket of spit situation.

I’m running to check now. This time I’ve intialised between -1 and 1 to see how well it works.

The current prior for \beta_s is cauchy(0, 0.5). Do you mean the hard constraint on initialisation?

Oh I think I misunderstood. From this comment:

I assumed that was a uniform(0, 1) prior on beta with upper and lower constraints at 0 and 1, my bad. Ignore that.

Ah no worries. I’m surprised that using a beta prior doesn’t work better. It seems surprising to me that the Normal/Cauchy family dominates in terms of performance…

Re ess_bulk, the average in the integrated out case is 303 while for the non-integrated out case is 247. However, the individual numbers for each variable differ by quite a bit. What should I be looking out for?

Also, initialising between -0.5 and 0.5 improves the average ess_bulk for the non-integrated out case to 392.459. I’m not sure what more can be done…

Look at the lowest Neff (presumably the lowest is the limiting factor for the inferences). The higher Neffs can do whatever – it probably doesn’t matter. The hope would be the lowest Neff would improve enough to make up for the 2x slowdown.

Wow, unexpected.

This is doing an integrate out trick: https://statmodeling.stat.columbia.edu/2020/06/10/hamiltonian-monte-carlo-using-an-adjoint-differentiated-laplace-approximation/

Looking at option b/c here, I wonder if there’s a quick way to compute the determinant of your system? Cause the B matrices in the B B^T terms have more rows than columns?

Maybe there’s a way to write your likelihoods faster. You need a determinant and a solve. There’s this which I think the Laplace thing linked at the top is using.

The lowest Neff (ess_basic I assume in posterior) for the non-integrated out and integrated out cases are 7 and 12 (for \Sigma_X) so integrating out is definitely not worth the 2x computational time.

Since we can obtain the integrate-out directly, the main beneift of what you mentioned lies in being able to compute gradients quickly right?

Huh, with Neffs that small I assume there must be a multimodality somewhere.

Maybe it would make sense to pin down the entries of \Sigma_x to diagnoe what’s going on?

The Neff output should be per element of the matrix. Like sigma[3, 5] or whatever. Have a look at a plot of the marginal posteriors for different chains (or look at them in shinystan).

This might be one of those multimodalities that doesn’t really matter (different, equivalent modes), or maybe not.

It took a while to run the comparisons as it takes a day or two for the runs to finish. It seems to me that the integrated variant ensures that I don’t any divergences compared to the non-integrated divergences. From random sampling of the posterior kernel density plots across draws, beta scale seems to have very thick tails and I don’t see convergence yet while I see multimodality elsewhere for beta_sd and x_sd.

Does this mean that I should be running MCMC for longer when I see thick tails? Also, what I can do to fix the multimodalities?

Not sure if this has been mentioned (didn’t read all posts!). I’ve had good success with using VI and the meanfield option when fitting multivariate factor models with multivariate_normal data. It’s typically very fast and converges relatively quickly. The parameters are recovered nicely though be cautious about interpreting the variance of the estimate as VI is known to underestimate those. You may want to explore this option.