How to think about QR decomposition with Y-dependent X effects

Taking a design matrix X and QR-orthonormalizing it is a standard and highly effective technique for improving sampling in the a presence of collinearities among predictor variables. For the problem I’m about to describe my current solution for isolating a prior for a certain parameter is to hold a column of X out of the QR process, but for most parameters I specify priors through contrasts, e.g., I put a prior of a treatment effect at age=50 when age interacts with treatment. I’d like to have a more general solution that allows QR to be used throughout the columns of X.

This question is focused on models for which you might say that a subset of the Xs interacts with Y, in the sense of a partial proportional odds models that allows some of the Xs to not act just in proportional odds, or a Cox proportional hazards model time-dependent covariates that allows an X to not act in proportional hazards.

For example a partial proportional odds model when there are two predictors and an ordinal response Y has levels 0, 1, 2 may be stated as follows, where X_2 is not assumed to be in proportional odds:

\Pr(Y \geq j | X) = \text{expit}(\alpha_{j} + \beta_{1}X_{1} + \beta_{2}X_{2} + \tau[j=2]X_{2}),

where j \in {1,2}, [] is a 0/1 indicator, and expit is the inverse logit transform.

Just as with the Cox model with a time-dependent covariate the likelihood needs to be carefully constructed when dealing with X \times Y “interactions”.

This raises the question of how to use QR pre-processing. If X_1 and X_2 do not have a correlation of exactly 0.0, the QR-transformed columns of X will convert a one parameter departure from proportional odds into a 2-parameter partial proportional odds component. Perhaps when doing the inverse transforming to get back to the original X space will make a \tau-like effect have parameter again, but I can’t wrap my head around this.

Is it the case that in such models that we need to separately run QR on the regular X components and on the components that involve Y?

The general issue of having Y-dependent covariate effects needs to ultimately be addressed in Stan front-ends, and I’d also love to get @paul.buerkner ‘s take on this.

I hope someone can give even some wild thoughts about this. The only solution I can think of is to use QR on the subset of predictors that don’t have Y-dependencies, do a separate QR on the subset that do, and specify priors on contrasts separately for the two sets of parameters. This approach would unfortunately not allow cross-parameter-type contrasts. For example this would allow an easy specification of a prior for the differential effect on treatment for Y>3 than from Y>4 but would not allow for priors on sex-specific differential treatment effects.

Hi, @harrelfe. I just added a proportional hazards model to the Stan doc, but I’m just not good enough at regression to help with this question. Best to ping @andrewgelman, who doesn’t read these posts unless pinged (or send him a personal email and he’ll turn his reply into a blog post in six months).

What I don’t understand from the question is why specifying a transform on the X is going to constrain putting priors on the coefficients. But I think it may be related to a problem I’ve been worrying a bit about and meaning to ask Andrew or @avehtari or @bgoodri about (those are the regression ringers we can bring in from the Stan side in addition to Paul, who you aready pinged!). I have been wondering about how QR decomposition or any kind of standardization is using the data twice. That is, we don’t know the true means and variances and covariances of the covariates, just our sample estimates. In the face of more data, this should converge, but we have only finite amounts of data. I have no idea what the biasing effect of this might be.

1 Like

Instead of using QR-orthonormalizing, it would be better to use a mass matrix in HMC that makes the corresponding transformation, but keeps your parameters interpretable and keeps it possible to use non-rotation-invariant priors like different sparsity-assuming priors (e.g. RHS and R2D2) or whatever you like. I’m ll the time using more and more R2D2 type priors with great success.

If you don’t have a huge number of parameters, you could test running Stan with a dense mass matrix, which would learn the appropriate transformation during the warmup. The dense mass matrix is not the default, as with a large number of parameters doing the transformation every leapfrog step is time-consuming. Low-rank + diagonal mass matrix in Stan has been in my wishlist since @bbbales2 demonstrated the great speedup in [1905.11916] Selecting the Metric in Hamiltonian Monte Carlo.

There is a duality with this and keeping the data fixed and using data adaptive priors. This can have significant effect if the size of the data is very small. But even with small data, if the priors are relatively weak, the effect is usually small. I know people who do cross-validate also the standardization just to be sure.

1 Like

Thanks for your reply Bob. I think that standardization/orthogonalization of X is a structural process that does not entail double dipping, just because X is conditioned upon through the process. It’s true there is volatility in correlation and variance estimates for the covariates which will make the QR decomposition give different results for different datasets but by the time you reverse QR-transform to get back to the original data scale those problems cancel out.

My worry about priors and messing with the design matrix X was restricted to the case where one or more of the columns of X has an effect on Y\geq y that depends on y (in an ordinal regression setting for ordinal Y). Say there is one y-dependent covariate that is collinear with 2 other covariates that are not y-dependent. QR may then create 3 y-dependent covariates that change the structure of the likelihood function.

Thank you very much Avi. I am always learning something completely new from you guys. I have never studied mass matrices and would greatly appreciate a pointer to an introduction to the topic that explains the spirit of how mass matrices work in this type of collinearity context.

My temporary solution to the problem is to put on a restriction on the type of contrasts for which I allow the user to specify priors. One set of contrasts would be for the non-y-dependent part of the model and another set of contrasts would be strictly for the y-dependencies. In a partial proportional odds model that means I would have priors on the effects of variables for a single reference value of y, then have contrasts limiting the y-variation effect. The first set restricts “main effect” odds ratios for covariate effects and the second restricts the amount of non-proportional odds (non-parallelism across y cutoffs). I had hope to be able to define contrasts that encode double differences and not just differences. A double difference would be for example how different is the effect of an age \times sex interaction on Y\geq 3 vs. Y\geq 4. In the separate contrast restricted approach I’ll use one would just be able to put a prior on how different the age 30:40 is for different levels of Y.

Stan HMC algorithm parameters are briefly discussed in 15.2 HMC algorithm parameters | Stan Reference Manual. You can activate dense matrix by including control=list(metric="dense_e") in RStan or metric="dense_e" in CmdStanR arguments. Adaptation is automatic. The dense mass matrix scales and rotates the parameter space based on the warmup adaptation estimated posterior covariance, which makes it easier for HMC. The dense mass matrix doesn’t help with nonlinear dependencies like funnels and bananas, but it is likely to help in case of collinearity. If the sampling speed with dense mass matrix is slower than with diagonal, it is likely that you have so many parameters that the matrix-vector-product in each HMC leapfrog step to rotate the parameter space is dominating the computation time.

1 Like

Fantastic. Sounds like a situation for having a little sampling contest between QR and dense mass matrix.

I ran one test using the R rmsb package with cmdstan for an ordinal logistic model with 56 intercepts and 23 other parameters, n=892. It took 6 seconds with the default sampler and 29 seconds with metric='dense_e'. This was for an orthogonal design matrix for both. 4 chains, 2000 total iterations each.

@avehtari do you have a comment on the execution time differences? Thanks.

Any difference in ESS’s? Or in the number of leapfrog steps? If the design matrix is orthogonal, I guess there are no strong posterior correlations. dense_e has higher computational cost per leapfrog step, and is beneficial only if the number of leapfrog steps is reduced by reducing the correlation in the rotated space. If the model has some hierarchical structure leading to funnels, then rotation is not sufficient.

1 Like

ESS was the same for both. Will have to run again to look up leapfrogs.

For now the conclusion seems to be this: Keep doing QR because it’s very fast, and greatly speeds up default sampling.

1 Like