Cheers, fingers-crossed my submissions will work out!
So the dimensions are currently 10 \times 10 or 20 \times 20 but I’m hoping to scale it to 30 \times 30 if posisble. I can make it be but currently I’m letting it differ for each dimension.
Yeah, it’s tricky as I have to solve the cholesky for a y x y matrix for each time step so it’s relatively slow and I don’t ever get to reuse the same covariance >.<. I guess there’s no harm slowing things down slighty to get more accurate inference…
Does Stan have an approximate solver or something along the lines of that?
The diagonal being constant was the simplest trick I could think of. Maybe just code it up and give it a go? Make sure and find a way to test you did the integrate out correctly.
Well if it’s really similar that’s good if the other thing worked!
What about the effective sample sizes? Are they different?
And I wonder if you can get away from the hard constraint on the prior? Having posterior mass pushed up against the constraint is bad. That means that the problem is trying to move out into that region to fit the data, but you’re not allowing it.
Unless there are like mathematical constraints that absolutely do not permit something to happen, then we like to avoid prior constraints cuz they can be opportunities to learn!
I had a geometry teacher in high school that called quizzes opportunities (I think they were to show off what you learned?), so perhaps that’s a warm bucket of spit situation.
Ah no worries. I’m surprised that using a beta prior doesn’t work better. It seems surprising to me that the Normal/Cauchy family dominates in terms of performance…
Re ess_bulk, the average in the integrated out case is 303 while for the non-integrated out case is 247. However, the individual numbers for each variable differ by quite a bit. What should I be looking out for?
Also, initialising between -0.5 and 0.5 improves the average ess_bulk for the non-integrated out case to 392.459. I’m not sure what more can be done…
Look at the lowest Neff (presumably the lowest is the limiting factor for the inferences). The higher Neffs can do whatever – it probably doesn’t matter. The hope would be the lowest Neff would improve enough to make up for the 2x slowdown.
Looking at option b/c here, I wonder if there’s a quick way to compute the determinant of your system? Cause the B matrices in the B B^T terms have more rows than columns?
Maybe there’s a way to write your likelihoods faster. You need a determinant and a solve. There’s this which I think the Laplace thing linked at the top is using.
The lowest Neff (ess_basic I assume in posterior) for the non-integrated out and integrated out cases are 7 and 12 (for \Sigma_X) so integrating out is definitely not worth the 2x computational time.
Since we can obtain the integrate-out directly, the main beneift of what you mentioned lies in being able to compute gradients quickly right?
The Neff output should be per element of the matrix. Like sigma[3, 5] or whatever. Have a look at a plot of the marginal posteriors for different chains (or look at them in shinystan).
This might be one of those multimodalities that doesn’t really matter (different, equivalent modes), or maybe not.
It took a while to run the comparisons as it takes a day or two for the runs to finish. It seems to me that the integrated variant ensures that I don’t any divergences compared to the non-integrated divergences. From random sampling of the posterior kernel density plots across draws, beta scale seems to have very thick tails and I don’t see convergence yet while I see multimodality elsewhere for beta_sd and x_sd.
Does this mean that I should be running MCMC for longer when I see thick tails? Also, what I can do to fix the multimodalities?
Not sure if this has been mentioned (didn’t read all posts!). I’ve had good success with using VI and the meanfield option when fitting multivariate factor models with multivariate_normal data. It’s typically very fast and converges relatively quickly. The parameters are recovered nicely though be cautious about interpreting the variance of the estimate as VI is known to underestimate those. You may want to explore this option.