Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

bbbales2 · June 10, 2020, 12:18am

Good luck!

But x there is data, so you’re just evaluating the likelihood. Integrating variables out is usually good for getting rid of divergences.

What are the dimensions of BB^T? And is \Sigma there a constant along the diagonal (like it could be expressed something times the identity matrix)?

atechnicolorskye · June 10, 2020, 12:22am

Cheers, fingers-crossed my submissions will work out!

So the dimensions are currently 10 \times 10 or 20 \times 20 but I’m hoping to scale it to 30 \times 30 if posisble. I can make it be but currently I’m letting it differ for each dimension.

bbbales2 · June 10, 2020, 12:25am

Oh well if these are just 30x30 matrices then that shouldn’t be too expensive. Definitely give this a go.

If you use the same covariance more than once, the multi_normal_cholesky form of the likelihood is really useful: https://mc-stan.org/docs/2_23/functions-reference/multi-normal-cholesky-fun.html

bbbales2 · June 10, 2020, 12:26am

I guess if you mean the matrix had 30x30 columns, so it’s actually 900x900, that’s more of a problem lol.

atechnicolorskye · June 10, 2020, 12:28am

Yeah, it’s tricky as I have to solve the cholesky for a y x y matrix for each time step so it’s relatively slow and I don’t ever get to reuse the same covariance >.<. I guess there’s no harm slowing things down slighty to get more accurate inference…

Does Stan have an approximate solver or something along the lines of that?

bbbales2 · June 10, 2020, 1:47am

The diagonal being constant was the simplest trick I could think of. Maybe just code it up and give it a go? Make sure and find a way to test you did the integrate out correctly.

bbbales2 · June 10, 2020, 1:47am

I’m pretty hopeful since you say the integrate-out is possible!

atechnicolorskye · June 10, 2020, 2:55am

It’s much slower, about x2 the time required, and the comparison plots (12.6 KB) are really similar…

I’m not sure what’s the right thing to do…

bbbales2 · June 10, 2020, 1:32pm

Well if it’s really similar that’s good if the other thing worked!

What about the effective sample sizes? Are they different?

And I wonder if you can get away from the hard constraint on the prior? Having posterior mass pushed up against the constraint is bad. That means that the problem is trying to move out into that region to fit the data, but you’re not allowing it.

Unless there are like mathematical constraints that absolutely do not permit something to happen, then we like to avoid prior constraints cuz they can be opportunities to learn!

I had a geometry teacher in high school that called quizzes opportunities (I think they were to show off what you learned?), so perhaps that’s a warm bucket of spit situation.

atechnicolorskye · June 10, 2020, 6:55pm

I’m running to check now. This time I’ve intialised between -1 and 1 to see how well it works.

The current prior for \beta_s is cauchy(0, 0.5). Do you mean the hard constraint on initialisation?

bbbales2 · June 10, 2020, 6:59pm

Oh I think I misunderstood. From this comment:

I assumed that was a uniform(0, 1) prior on beta with upper and lower constraints at 0 and 1, my bad. Ignore that.

atechnicolorskye · June 10, 2020, 7:01pm

Ah no worries. I’m surprised that using a beta prior doesn’t work better. It seems surprising to me that the Normal/Cauchy family dominates in terms of performance…

Re ess_bulk, the average in the integrated out case is 303 while for the non-integrated out case is 247. However, the individual numbers for each variable differ by quite a bit. What should I be looking out for?

Also, initialising between -0.5 and 0.5 improves the average ess_bulk for the non-integrated out case to 392.459. I’m not sure what more can be done…

bbbales2 · June 11, 2020, 10:29am

Look at the lowest Neff (presumably the lowest is the limiting factor for the inferences). The higher Neffs can do whatever – it probably doesn’t matter. The hope would be the lowest Neff would improve enough to make up for the 2x slowdown.

Wow, unexpected.

bbbales2 · June 11, 2020, 11:14am

This is doing an integrate out trick: https://statmodeling.stat.columbia.edu/2020/06/10/hamiltonian-monte-carlo-using-an-adjoint-differentiated-laplace-approximation/

Looking at option b/c here, I wonder if there’s a quick way to compute the determinant of your system? Cause the B matrices in the B B^T terms have more rows than columns?

Maybe there’s a way to write your likelihoods faster. You need a determinant and a solve. There’s this which I think the Laplace thing linked at the top is using.

atechnicolorskye · June 15, 2020, 4:22pm

The lowest Neff (ess_basic I assume in posterior) for the non-integrated out and integrated out cases are 7 and 12 (for \Sigma_X) so integrating out is definitely not worth the 2x computational time.

Since we can obtain the integrate-out directly, the main beneift of what you mentioned lies in being able to compute gradients quickly right?

bbbales2 · June 15, 2020, 4:38pm

Huh, with Neffs that small I assume there must be a multimodality somewhere.

atechnicolorskye · June 15, 2020, 4:46pm

Maybe it would make sense to pin down the entries of \Sigma_x to diagnoe what’s going on?

bbbales2 · June 15, 2020, 4:48pm

The Neff output should be per element of the matrix. Like sigma[3, 5] or whatever. Have a look at a plot of the marginal posteriors for different chains (or look at them in shinystan).

This might be one of those multimodalities that doesn’t really matter (different, equivalent modes), or maybe not.

atechnicolorskye · June 21, 2020, 3:26am

It took a while to run the comparisons as it takes a day or two for the runs to finish. It seems to me that the integrated variant ensures that I don’t any divergences compared to the non-integrated divergences. From random sampling of the posterior kernel density plots across draws, beta scale seems to have very thick tails and I don’t see convergence yet while I see multimodality elsewhere for beta_sd and x_sd.

Does this mean that I should be running MCMC for longer when I see thick tails? Also, what I can do to fix the multimodalities?

spinkney · June 21, 2020, 10:12am

Not sure if this has been mentioned (didn’t read all posts!). I’ve had good success with using VI and the meanfield option when fitting multivariate factor models with multivariate_normal data. It’s typically very fast and converges relatively quickly. The parameters are recovered nicely though be cautious about interpreting the variance of the estimate as VI is known to underestimate those. You may want to explore this option.

Topic		Replies	Views
Convergence problems with factor loadings in multivariante autoregressive model Modeling rstan , fitting-issues , specification , ecology	11	1595	July 31, 2020
Divergent transitions in hierarchical model Modeling fitting-issues	26	1922	November 7, 2019
Time Varying Factor Analysis Modeling fitting-issues , performance	2	801	March 23, 2020
Divergent transitions in Beta model General fitting-issues , specification , performance , divergences	7	1283	February 7, 2022
Divergent transitions in a multivariate multilevel regression model Modeling fitting-issues	17	1429	August 25, 2018

Speeding up and reducing divergences for multivariate hiearchical stochastic factor model

Related topics