Linear Regression with Highly Correlated Covariates

I’m posting this in case anyone finds it useful…

I was getting some funky results with Stan until I eventually simplified my problem to the point that I realized my issue was with linear regression with highly correlated covariates.

In particular, I was trying to fit an error correction model (ECM), which is a time series model that extends a vector autoregression (VAR) in differences to account for long-term trends. My issue was in the long-term trend part of the ECM. Basically, I was regressing a variable against two variables that were highly correlated (over 99%). When I checked out the pairs plot, this became very obvious. Part of the reason they were so highly correlated was because an ECM model requires regressing time series variables in levels (both were I(1)) and the covariates shared a common trend.

rstan recommending increasing the tree depth and while that allowed the model to be fit without giving a warning, the effective sample size was still quite low.

What did I do instead? Once I realized the problem, all I needed to do was de-trend covariates. This sent the correlation between them to around 30% and Stan didn’t have any issue with it (much faster execution and higher effective sample size).

The Stan documentation mentions a few techniques that can improve Stan’s efficiency, like standardizing variables, but de-trending can also be an important tool in some cases.

See below for accompanying code.

library(rstan)
library(MASS)
rstan_options(auto_write = TRUE)

# Model
stan_model <- "
data {
	int<lower=0> T;
	int<lower=0> N_X;
	vector[T] Y;
	matrix[T, N_X] X;
}
parameters {
	vector[N_X] B;
	real<lower=0> sigma;
}
model {
	B ~ normal(0, 100);
	sigma ~ cauchy(0, 1);
	Y ~ normal(X * B, sigma);
}
"

# Stan parameters
N_iter <- 500
N_chain <- 4

# Simulation Parameters
T <- 285
N_X <- 3
Y_coefficients <- c(-39.512948, 1.833247, 2.219502)
sd_Y_residual <- 0.05590053
X_initial <- c(10.9678, 11.00594)
mu_diff_X <- c(0.003503189, 0.003508181)
sigma_diff_X <- matrix(c(2.801688E-5, 1.243823E-5, 1.243823E-5, 1.340090E-5), 2, 2)

# Simulate
X_sim <- mvrnorm(T - 1, mu_diff_X, sigma_diff_X)
X_sim <- rbind(X_initial, X_sim)
X_sim <- apply(X_sim, 2, cumsum)
X_sim <- cbind(rep(1, T), X_sim)
Y_sim <- X_sim %*% Y_coefficients + rnorm(T, 0, sd_Y_residual)
	 
reg1 <- lm(Y_sim ~ X_sim[, 2] + X_sim[, 3])
	 
stan_data1 <- list(T=T, N_X=N_X, Y=as.vector(Y_sim), X=X_sim)
		  
fit_1 <- stan(model_code = stan_model,
			  data = stan_data1, verbose = FALSE,
			  iter=N_iter, chain=N_chain)
#gets warning about number of transitions
print(fit_1)

# De-trend Simulation
X_sim_alt <- X_sim
reg_detrend1 <- lm(X_sim_alt[, 2] ~ seq(1, T))
reg_detrend2 <- lm(X_sim_alt[, 3] ~ seq(1, T))
X_sim_alt[, 2] <- reg_detrend1$residuals
X_sim_alt[, 3] <- reg_detrend2$residuals

Y_sim_alt <- X_sim_alt %*% Y_coefficients + rnorm(T, 0, sd_Y_residual)
stan_data2 <- list(T=T, N_X=N_X, Y=as.vector(Y_sim_alt), X=X_sim_alt)

reg2 <- lm(Y_sim_alt ~ X_sim_alt[, 2] + X_sim_alt[, 3])	 
 
fit_2 <- stan(model_code = stan_model,
			  data = stan_data2, verbose = FALSE,
			  iter=N_iter, chain=N_chain)
print(fit_2)
3 Likes

In case of high posterior correlations you may get much better performance by using dense mass matrix. Currently diagonal mass matrix is default and you need to set metric=dense_e to switch to dense mass matrix. Later this year we may get automatic metric selection in Stan.

Thanks for the reply. I wasn’t aware of that option. I don’t tend to tweak Stan too much as my grasp of HMC and NUTS isn’t that strong.

Hey Aki, I just tried the dense_e and I actually didn’t see much of an impact. I was able to get the correct mean values for parameters with diag_e, but n_effs were all low, so I wasn’t sure I wanted to pursue that.

However, I followed some of the links you put on Andrews blog and noticed directed me to the piece that recommended the QR decomposition technique when having issues with multicollinearity. That worked like a charm!

2 Likes

Just another update. While I had resolved the issue for a simpler linear regression, I also noticed that it comes back (to a much smaller extent) when extending it to multiple Y variables whose errors are distributed by a multivariate normal distribution. The pairs plot again revealed that parameters were highly correlated. However, in this case it was two parameters in the cholesky correlation matrix. Because the data is strongly cointegrated, this also tends to mean the data is also strongly correlated.

Increasing max_treedepth does eliminate the warnings this time, but it slows down fitting quite a bit. However, I didn’t find any success again with dense_e or diag_e. dense_e actually made things worse as I got some divergences.

The only thing I found that consistently works well and allows the model to be fit quickly is figuring out a way to write the problem to avoid the high correlation in parameters. The QR decomposition worked above in the case of covariates very well. However, in my application I also have high correlation in the correlation matrix. My solution made it a little more complicated to work with, but gave me the best results yet in terms of n_effs and how long it takes to fit. I made it so that two of the variables have a multivariate normal, rather than 3 before, and then I regressed the third on the other two variables, plus the original variables. This effectively takes the high correlation out of the problem (there is still some between those two variables, but not enough for Stan to provide warnings).

One recommendation from Andrew’s blog post was to put a stronger prior, but there isn’t really an easy way to put priors on the LKJ distribution. My way forces things, but it works.

EDIT: Just checked and saw 2.19.2 is on CRAN now.

I think the previous version rstan 2.18.2 has not implemented the dense_e option correctly. See this reply: Problems remain after non-centered parametrization of correlated parameters

(and also the whole thread there if you are interested in a comparison between dense_e and diag_e)

as of yesterday’s 2.19 release, dense metric is working nicely. tripled my sampling speed on one model…

1 Like

Conveniently enough…I am still using 2.18.1, rather than latest version. I was going to wait until the next release to upgrade because it can be a little bit of a hassle sometimes to upgrade at work.

Sorry, I forgot that RStan CRAN release didn’t have working dense pre 2.19, and happy to know it improved performance for @Charles_Driver