Cool, thanks!
Maybe a minor addition on the first case that you present:
In this example x and x^2 are actually the same thing (perfect multi-collinearity) and one could rewrite (ignoring subscripts) y\sim\text{Normal}(x+x^2,\sigma) to y\sim\text{Normal}(2x,\sigma), right? If you run
set.seed(20180512)
x = rep(c(0,1), times = 10)
N = length(x)
y = rnorm(N, x + x ^ 2, 1)
# -1 for "no-intercept"
lm(y ~ x + I(x^2) -1)
R
will just drop the second term:
Call:
lm(formula = y ~ x + I(x^2) - 1)
Residuals:
Min 1Q Median 3Q Max
-2.23957 -0.46602 -0.06163 0.49694 2.24213
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
x 2.3450 0.3391 6.915 1.36e-06 ***
I(x^2) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.072 on 19 degrees of freedom
Multiple R-squared: 0.7157, Adjusted R-squared: 0.7007
F-statistic: 47.82 on 1 and 19 DF, p-value: 1.359e-06
and I think this is very convenient in most cases, because usually the problems that we have are not so straight forward. I really wanted to have that for my Stan models. So this is what I do (taking the generated data from above):
# -1 for no intercept
X_full <- model.matrix(y ~ x + I(x^2) -1)
qrX_full <- qr(X_full)
# this should evaluate to TRUE (otherwise check the "0.1" value)
ncol(X_full) - qrX_full$rank == sum(abs(diag(qr.R(qrX_full))) < 0.1)
singular <- colnames(qr.R(qrX_full))[abs(diag(qr.R(qrX_full))) < 0.1]
# this prints the column names of the singular predictors
singular
# remove singularities; drop=FALSE needed to maintain matrix class when ncol = 1
X_new <- X_full[,-which(colnames(X_full) %in% singular),drop=FALSE]
library(rstan)
options(mc.cores = parallel::detectCores())
stan_code_linear <- "
data {
int N;
int k;
vector[N] y;
matrix[N,k] X;
}
parameters {
vector[k] beta;
real<lower=0> sigma;
}
model {
y ~ normal(X*beta, sigma);
sigma ~ normal(0,1);
beta ~ normal(0,100);
}
"
model_linear <- stan_model(model_code = stan_code_linear)
data_linear <- list(
N = N,
k = ncol(X_new),
y = y,
X = X_new
)
fit_linear <- sampling(model_linear, data = data_linear)
Note that this works well with flat priors and (sort of) automatically re-parametrizes to the sum-of-variables solution you mentioned in your blog post.
Inference for Stan model: f12f5625ffe0f1b2c58d969f38db8a32.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] 2.35 0.01 0.35 1.63 2.12 2.35 2.58 3.06 2301 1
sigma 1.11 0.00 0.19 0.81 0.98 1.08 1.21 1.53 2237 1
lp__ -12.44 0.03 1.08 -15.32 -12.83 -12.10 -11.68 -11.41 1195 1
Samples were drawn using NUTS(diag_e) at Thu May 17 15:09:03 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
I get that you point was more to give an instructive example, so this is just a tip how to get this “Coefficients: (1 not defined because of singularities)” behavior that people like me are used to from lm()
or glm()
.
Cheers! :)
Max