Blog post: Identifying non-identifiability

Hi all,
I wrote a blog post, surveying some broken models in hope it helps people spot, uncover and understand issues with their models.

http://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/

As in the previous post about divergences, I am learning this stuff as I go, so feedback is very welcome, hopefully I didn’t get anything very wrong.

13 Likes

This is really cool. In the spirit of giving back. The problem you identify in the sigmoid model with non-identified special case made me think of the problem with multi-collinearity. That is, a parameter blowing up because of near non-identifiability. In one of Andrew Gelman’s papers there was some suggestion to solve that with a sort of joint prior for some of the parameters. I had some success with taming divergences in that way but I have no idea whether there are any problems with the approach. I think for your model it would like this.

w ~ normal(0, prior_sd);
b ~ normal(0, prior_sd);
prior_sd ~ normal(0, 2.5); // Or whatever is appropriate in your case.

Again, intuitively it makes sense for me to have a prior that says if w is not very large, it is unlikely that b is very large (in this case) but that is also the only motivation I have.

1 Like

Cool, thanks!

Maybe a minor addition on the first case that you present:

In this example x and x^2 are actually the same thing (perfect multi-collinearity) and one could rewrite (ignoring subscripts) y\sim\text{Normal}(x+x^2,\sigma) to y\sim\text{Normal}(2x,\sigma), right? If you run

set.seed(20180512)

x = rep(c(0,1), times = 10)
N = length(x)
y = rnorm(N, x + x ^ 2, 1)

# -1 for "no-intercept"
lm(y ~ x + I(x^2) -1)

R will just drop the second term:

Call:
lm(formula = y ~ x + I(x^2) - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.23957 -0.46602 -0.06163  0.49694  2.24213 

Coefficients: (1 not defined because of singularities)
       Estimate Std. Error t value Pr(>|t|)    
x        2.3450     0.3391   6.915 1.36e-06 ***
I(x^2)       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.072 on 19 degrees of freedom
Multiple R-squared:  0.7157,	Adjusted R-squared:  0.7007 
F-statistic: 47.82 on 1 and 19 DF,  p-value: 1.359e-06

and I think this is very convenient in most cases, because usually the problems that we have are not so straight forward. I really wanted to have that for my Stan models. So this is what I do (taking the generated data from above):

# -1 for no intercept
X_full <- model.matrix(y ~ x + I(x^2) -1)

qrX_full <- qr(X_full)

# this should evaluate to TRUE (otherwise check the "0.1" value)
ncol(X_full) - qrX_full$rank == sum(abs(diag(qr.R(qrX_full))) < 0.1)

singular <- colnames(qr.R(qrX_full))[abs(diag(qr.R(qrX_full))) < 0.1]

# this prints the column names of the singular predictors
singular

# remove singularities; drop=FALSE needed to maintain matrix class when ncol = 1
X_new <- X_full[,-which(colnames(X_full) %in% singular),drop=FALSE]

library(rstan)
options(mc.cores = parallel::detectCores())

stan_code_linear <- "
data { 
  int N;
  int k;
  vector[N] y; 
  matrix[N,k] X; 
} 
 
parameters {
  vector[k] beta;
  real<lower=0> sigma;
} 
 
model { 
  y ~ normal(X*beta, sigma); 
  sigma ~ normal(0,1); 
  beta ~ normal(0,100); 
} 
"

model_linear <- stan_model(model_code = stan_code_linear)
data_linear <- list(
  N = N,
  k = ncol(X_new),
  y = y,
  X = X_new
)

fit_linear <- sampling(model_linear, data = data_linear)

Note that this works well with flat priors and (sort of) automatically re-parametrizes to the sum-of-variables solution you mentioned in your blog post.

Inference for Stan model: f12f5625ffe0f1b2c58d969f38db8a32.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
beta[1]   2.35    0.01 0.35   1.63   2.12   2.35   2.58   3.06  2301    1
sigma     1.11    0.00 0.19   0.81   0.98   1.08   1.21   1.53  2237    1
lp__    -12.44    0.03 1.08 -15.32 -12.83 -12.10 -11.68 -11.41  1195    1

Samples were drawn using NUTS(diag_e) at Thu May 17 15:09:03 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

I get that you point was more to give an instructive example, so this is just a tip how to get this “Coefficients: (1 not defined because of singularities)” behavior that people like me are used to from lm() or glm().

Cheers! :)
Max

1 Like

That’s a standard application of regularization for maximum likelihood estimates or priors for Bayes. It will also solve the separability problem for logistic regression (one predictor that’s always the same sign and showing up with only successes or failures in the data, leading to an infinite coefficient estimate).

Just to give my experience,

[edited] double checking quite non reproducible results.

[EDIT] The divergent transitions arose because of the non centred parametrization.

Ok, here a case of mine. In my case

does not resolve alone the poblem

with the following mode

data {
      int<lower = 0> G;                   // genes
      int<lower = 0> T;                   // tube colleciton of genes for a patient
			int<lower=0> R;                     // covariates
      vector<lower = 0>[G] y[T];          // RNA-seq counts
			matrix[T,R] X;                      // design matrix
}
parameters {
	matrix[R, G] beta;
	row_vector<lower=0>[G] k;
	real<lower=0> sigma;
	vector<lower=0>[G] prior_sd;
	real<lower=0> k_mu;
	real<lower=0> k_sigma;

}
model {
  for (t in 1:T) 
		y[t] ~ normal( inv_logit( X[t] * beta ) .* k, sigma );

	for(r in 1:R) beta[r] ~ normal(0, prior_sd);

	k ~ normal(k_mu,k_sigma);
	sigma ~ cauchy(0,2.5);

	k_mu ~ normal(0,2);
	k_sigma ~ normal(0,1);

	prior_sd ~ normal(0,1);
}
generated quantities{
  vector[G] y_gen[T];          // RNA-seq counts
	matrix[T, G] y_hat = X * beta;

	for (t in 1:T) 
		for(g in 1:G)
			y_gen[t,g] = normal_rng( inv_logit( y_hat[t, g] ) * k[g], sigma );
}

Slope == 0

image

image

Slope == 5

image

image

Now I shrink prior_sd to ~ normal(0,0.1);

Slope == 0

image

Now the correlation is linear rather than multiplicative, but still there

Slope == 5

image

image

Any comments on what I can change to reconver this inference, or I have to ocnsider that the common prior of intercept and slope is not necessarily enough and I should consider deeper reparametrizations (e.g., ODE Model of Gene Regulation) ?

[EDIT] The divergent transitions arose because of the non centred parametrization.

I am trying to use this technique to make the alpha parameter identifiable in case the slope beta is 0; with the link function

k / (1 + exp(- (alpha + beta * X) ))

I apply

alpha[g] ~ normal(0, sigma_common[g]); // goes to 0 when beta == 0
beta[g] ~ normal(0, sigma_common[g]);
sigma_common ~ normal(0,1);

However I am wondering if I want to put a hyperprior on the slope

beta ~ reg.horseshoe()

I would have two sampling statement for each beta[g]. I imagine this is not indicated. Any better way?

That’s k * inv_logit(alpha + beta * X). But why map to the interval [0, k]?

You might want another post on hierarchical horseshoes that @avehtari should see. You can do this combination—it works out similarly to something like the elastic net for penalized maximum likelihood (which combines L1 (Laplace) and L2 (normal) priors to allow the L1 to shrink to zero and the L2 to identify [which L1 doesn’t do by itself]).

1 Like

Use regularized horseshoe with group specific slab scales instead of one scale as in the paper.

1 Like

This list is like magic! Just say @avehtari’s name and he appears with an answer. Thanks, Aki!

1 Like

I realised I hijacked this post (I will open another post if the discussion needs to get further)

Instead of using an exp( alpha + beta * X), the above function allows for biological plateaus that I observe in the data. The biological interpretation is that: (i) a gene has reached it’s maximum functionality and does not need to increase further; or (ii) the concentration of such protein (encoded by the gene) starts to get toxic. For example (I know I have Small number of data points; the interpolating curves here are splines from an old analysis):

image image

While for some genes it never reach the soft biological maximum

image

There are lots of techniques for dealing with this kind of thing in population models where there’s not so much a hard boundary like this, but a point at which the environment can’t sustain the population and the increased population size exerts a negative influence on population growth.