Blog post: Identifying non-identifiability


Hi all,
I wrote a blog post, surveying some broken models in hope it helps people spot, uncover and understand issues with their models.

As in the previous post about divergences, I am learning this stuff as I go, so feedback is very welcome, hopefully I didn’t get anything very wrong.

GLM with growth function - problem of indetermiinability when slope = 0

This is really cool. In the spirit of giving back. The problem you identify in the sigmoid model with non-identified special case made me think of the problem with multi-collinearity. That is, a parameter blowing up because of near non-identifiability. In one of Andrew Gelman’s papers there was some suggestion to solve that with a sort of joint prior for some of the parameters. I had some success with taming divergences in that way but I have no idea whether there are any problems with the approach. I think for your model it would like this.

w ~ normal(0, prior_sd);
b ~ normal(0, prior_sd);
prior_sd ~ normal(0, 2.5); // Or whatever is appropriate in your case.

Again, intuitively it makes sense for me to have a prior that says if w is not very large, it is unlikely that b is very large (in this case) but that is also the only motivation I have.

GLM with growth function - problem of indetermiinability when slope = 0

Cool, thanks!

Maybe a minor addition on the first case that you present:

In this example x and x^2 are actually the same thing (perfect multi-collinearity) and one could rewrite (ignoring subscripts) y\sim\text{Normal}(x+x^2,\sigma) to y\sim\text{Normal}(2x,\sigma), right? If you run


x = rep(c(0,1), times = 10)
N = length(x)
y = rnorm(N, x + x ^ 2, 1)

# -1 for "no-intercept"
lm(y ~ x + I(x^2) -1)

R will just drop the second term:

lm(formula = y ~ x + I(x^2) - 1)

     Min       1Q   Median       3Q      Max 
-2.23957 -0.46602 -0.06163  0.49694  2.24213 

Coefficients: (1 not defined because of singularities)
       Estimate Std. Error t value Pr(>|t|)    
x        2.3450     0.3391   6.915 1.36e-06 ***
I(x^2)       NA         NA      NA       NA    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.072 on 19 degrees of freedom
Multiple R-squared:  0.7157,	Adjusted R-squared:  0.7007 
F-statistic: 47.82 on 1 and 19 DF,  p-value: 1.359e-06

and I think this is very convenient in most cases, because usually the problems that we have are not so straight forward. I really wanted to have that for my Stan models. So this is what I do (taking the generated data from above):

# -1 for no intercept
X_full <- model.matrix(y ~ x + I(x^2) -1)

qrX_full <- qr(X_full)

# this should evaluate to TRUE (otherwise check the "0.1" value)
ncol(X_full) - qrX_full$rank == sum(abs(diag(qr.R(qrX_full))) < 0.1)

singular <- colnames(qr.R(qrX_full))[abs(diag(qr.R(qrX_full))) < 0.1]

# this prints the column names of the singular predictors

# remove singularities; drop=FALSE needed to maintain matrix class when ncol = 1
X_new <- X_full[,-which(colnames(X_full) %in% singular),drop=FALSE]

options(mc.cores = parallel::detectCores())

stan_code_linear <- "
data { 
  int N;
  int k;
  vector[N] y; 
  matrix[N,k] X; 
parameters {
  vector[k] beta;
  real<lower=0> sigma;
model { 
  y ~ normal(X*beta, sigma); 
  sigma ~ normal(0,1); 
  beta ~ normal(0,100); 

model_linear <- stan_model(model_code = stan_code_linear)
data_linear <- list(
  N = N,
  k = ncol(X_new),
  y = y,
  X = X_new

fit_linear <- sampling(model_linear, data = data_linear)

Note that this works well with flat priors and (sort of) automatically re-parametrizes to the sum-of-variables solution you mentioned in your blog post.

Inference for Stan model: f12f5625ffe0f1b2c58d969f38db8a32.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
beta[1]   2.35    0.01 0.35   1.63   2.12   2.35   2.58   3.06  2301    1
sigma     1.11    0.00 0.19   0.81   0.98   1.08   1.21   1.53  2237    1
lp__    -12.44    0.03 1.08 -15.32 -12.83 -12.10 -11.68 -11.41  1195    1

Samples were drawn using NUTS(diag_e) at Thu May 17 15:09:03 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

I get that you point was more to give an instructive example, so this is just a tip how to get this “Coefficients: (1 not defined because of singularities)” behavior that people like me are used to from lm() or glm().

Cheers! :)