Cholesky decomposition/speed/divergencies


#1

Hi,

I have tried to compare efficiency of the same hierarchical model but one with cholesky decomposition tweak while other using multi-normal. Agains expectations the multi-normal version shaved off 1000s per 2000 iterations and gave no divergencies while cholesky version was slover and gave 45 divergencies. Is this expected, what version is generally recommended - manual recommends cholecky but my results are contradictory. Perhaps I had to few parameters and lots of data. Both models are attached for those who may be interested.

cholesky version doa3.stan (2.8 KB)

multi-normal version doa5.stan (2.5 KB)


#2

Looks like you have the generated quantities set up to check if everything is working out the same between the two models – is it?

It’s not obvious to me what’s going wrong. Only thing is the centered vs. non-centered parameterizations are two different models, so they can behave differently. Have you looked around in Shinystan or the Stan pairplots to see if you can see which parameters the divergences correspond to?

Since K = 3 in the example here, maybe the solves are just easy enough that the reparameterization isn’t necessary? That doesn’t seem like a good explanation though.


#3

Sorry for using incorrect expression. Yes, I understand that those are two different models with overlapping set of parameters.

As far as I can tell both models produce nearly identical estimates.

I have difficulty understanding how to trace divergent parameter using pairs(). Below are pairs plots for random effects.

image

The same pairs for centered model:

image


#4

Ooops, sorry for the slow response, I was hoping discourse would send me an e-mail but I must have my settings messed up.

Can you try a version of this model where you use multi_normal_cholesky in place of multi_normal?

This is a version of the multivariate normal pdf that takes in the Cholesky factored matrix instead of a full one (this is still the centered parameterization – all we’re doing is changing how the computation works a little). If this works, then we can say it’s the centered vs. non-centered parameterization that is causing the problem. If it has divergences, then maybe there’s a bug somewhere in the code.

Should be able to change:

beta_unit ~ multi_normal(zero_unit, quad_form_diag(Omega_unit, tau_unit));

to (and redefine things in terms of L_Omega_unit instead of Omega_unit):

beta_unit ~ multi_normal_cholesky(zero_unit, diag_pre_multiply(tau_unit,L_Omega_unit));

#5

There were no divergencies. Time went down by 500s (some improvement with multi-normal 4800s).


#6

Hmm, I guess it’s just the difference in parameterizations. If you switch the lot and unit variables one at a time, are the divergences associated with one set of parameters more than the other?

I don’t think there’s any magic to finding divergences other than making tons of pairplots and looking if they correlate to parameters (http://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html). You might be able to write a quick script to help you dig through the plots, but it’s pretty labor intensive process that might not end up telling you anything haha.

And when you look at pairplots, look at mixes of different parameters too. Your divergences could come from a coupling between sigma and beta_lot, or sigma and beta_unit, or some combination of different things. Maybe there’s a non-identifiability that only becomes an issue in the non-centered parameterization for some weird reason.


#7

The Cholesky form should be more efficient and more stable. That’s because if you declare a covariance matrix, it uses a Cholesky factor parameterization on the unconstrained scale. Then it has to reconstruct the covariance matrix and then factor once more when calling multi_normal. Using Cholesky throughout cuts out the middle layer.

But as @bbbales2 says, you’re looking at two different models as far as the sampling algorithm is concerned—it works only over the unconstrained parameters.


#8

Thanks. Any idea why ncp gives divergencies why cp doesn’t? Also I found that in some instances ncp gives two modes and it is very difficult to distinguish them - this one (below) I found by accident and those modes are not clearly separated for different individuals.


#9

Michael Betancourt’s arXiv paper on hierarchical HMC is the best place to look.

When there’s no data, you get a funnel shape for a hierarchical posterior in the centered parameteriation and an isotropic posterior with the non-centered. When the data highly constrains the posterior (either there’s a lot of it or it’s very informative), then it’s the other way around.

The multiple modes may just be an artifact of not mixing well. It’s easy for the sampler to get stuck in the neck of a funnel-like posterior.


#10

I bet you have suggested somewhere to vectorize weibull. I did some but wonder if some for loops could be replaced with vectorized form (doa5.stan (3.3 KB)). Also I am missing vectorized form of power function.

Friend of mine likes WinBUGS and equivalent model with Wishart priors outperforms stan quite a bit (40,000 iterations run 10h while in stan 1,000 iterations takes 1h). Results are nearly identical. He runs 3 chains (total 120,000 draws) while rstan chokes with memory problems. I guess I have to use cmdstan for this purpose. But one of his estimate (namely lot to lot variability derived from 6 lots) has much higher uncertainty. I wonder if it is related to Wishart prior. For lot2lot corr matrix I was using lkj_corr_cholesky(2) and then tried lkj_corr_cholesky(20). He uses dwish(I[ , ],4). So my questions are why there is such performance gap between stan and WinBUGS and what prior better corresponds to dwish(I[ , ],4).


#11

performance gap

I think to get an idea of the actual performance gap you’d need to compute some sort of number of effective (n_eff) samples for both sequences of samples (and compute that n_eff the same way – there are a bunch of options). I know there’s a section on this is Andrew’s book (BDA3), but try Googling around for it too.

He uses dwish(I[ , ],4)

Stan has Wishart distributions (and inverse Wishart distributions). One of those is probably what you want. Don’t just go by the names though. Check the actual parameterization to make sure you’re using the right one.


#12

It was in the Stan manual first. I’m pretty sure it was Andrew’s idea to use cross-chain n_eff

We use the standard naming convention as does BUGS/JAGS/R, so dwish is going to be our Wishart. We don’t have Cholesky-factor parameterizations of those, though, at least not yet. We tend to use the scaled LKJ priors for reasons described in @bgoodri’s paper: