QR decomposition of parameters for multilevel regression?

Hi,
I have a multi level regression where I am trying to fit a set of data with a linear relationship between two sets of parameters. I’m continuing this over from my thread at Divergences and Quantifying HPD intervals, as my question is now different.

I have a model where I’m fitting some low level parameters to data for a set of experiments, let’s call these parameters k,D and H, each of which are N dimensional vectors fit to a set of x,y data (each of which has its own elements).
At a hierarchical level, the values of H should be dependent on k, according to the sampling statement:
H \sim \mathrm{Normal}(\alpha+\beta k,\sigma) where \alpha,\beta,\sigma are hyper parameters. \alpha is what I want to get an estimate for with this model.

Running this model just using this sampling statements results in divergences, especially when N is small, because I haven’t set a prior on the slope parameter \beta. Switching to the non centered parameterization as detailed here seems to remove these divergences, but causes the model to hit max tree depth.

If I use the QR decomposition on my k vector (treating it as a matrix) as outlined here then I no longer hit the maximum tree depth. It’s not clear that I’m sampling from the same distribution because each time I’m QR decomposing a different parameter matrix rather than a vector matrix, so I’m not sure the \theta parameter that I’m sampling from imposes any sort of consistent prior on \beta. However, my results haven’t varied significantly from my original model, so it seems to be working. Is it OK to perform this decomposition for a parameter vector in this way?

The way to think about this is to write out mathematically the difference in the two models.

Certainly it sounds different than the regular QR decomposition trick.

You could also try the model without the QR but a dense metric. If you’re using Rstan, search for ‘dense_e’ in the manual and you’ll find the docs on how to do that (should just be control = list(metric = "dense_e") but I haven’t used it in a while)

The QR decomposition of a vector is such that \mathbf{Q} = \mathbf{x} / R_{11|}, so it is just a rescaling. In rstanarm, we do QR decompositions of the \mathbf{X} matrix a lot (when there is more than one predictor) where the linear predictor is written in lme4 form: \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b}.

I think you are right about the QR parameterization perhaps not being appropriate. The ‘dense_e’ metric with a non centered parameterization causes divergences, like with the centered parameterization. Simply using the non centered parameterization on its own is causing saturation of max treedepth.

Ultimately, the inferences that I draw from this in terms of my credible intervals on \alpha (which is what I am interested in) do not seem to change hugely between these models, except when the number of experiments is low (which is generally when the number of divergences is higher, probably due to the curvature of the posterior being higher). Perhaps I can run my centered model with a large number of experiments and my non centered model with a small number of experiments. I’m guessing the issue is that the NUTS algorithm just can’t find a suitable step size for these high curvature models.

Yeah if the only difference between your two models is one is non-centered and one is centered, and the centered one works better with big data and the non-centered with small data, this happens.

There’s two different problems that happen. Here’s code to plot the problem and examine it on the eight schools dataset:

eight_schools_centered.stan (334 Bytes) eight_schools_noncentered.stan (357 Bytes)

library(rstan)

data_low = list(J = 8,
            y = c(28,  8, -3,  7, -1,  1, 18, 12),
            sigma = c(15, 10, 16, 11,  9, 11, 10, 18))

# really small measurement errors (small sigma) is like having lots of data
data_high = list(J = 8,
            y = c(28,  8, -3,  7, -1,  1, 18, 12),
            sigma = 0.01 * c(15, 10, 16, 11,  9, 11, 10, 18))


cmod = stan_model("eight_schools_centered.stan")
ncmod = stan_model("eight_schools_noncentered.stan")

fit_clow = sampling(cmod, data_low) # centered with low data, divergences
fit_chigh = sampling(cmod, data_high) # centered with lots of data, good

fit_nclow = sampling(ncmod, data_low) # non-centered low data, good
fit_nchigh = sampling(ncmod, data_high) # non-centered high data, treedepth problems

# plot log_tau instead of tau because log_tau is what the sampler sees
pairs(fit_clow, pars = c("theta[1]", "theta[3]", "mu", "log_tau"))
pairs(fit_chigh, pars = c("theta[1]", "theta[3]", "mu", "log_tau"))

# In the non-centered case the sampler works in z not in theta
pairs(fit_nclow, pars = c("z[1]", "z[3]", "mu", "log_tau"))
pairs(fit_nchigh, pars = c("z[1]", "z[3]", "mu", "log_tau"))