All parameters ~0 except the final entry in the model matrix (regardless of what that value is)

I’m trying to fit a hierarchical negative binomial (and eventually zero-inflated poisson) model to ~2000 datapoints using a model with ~40 predictors and the QR decomposition. I am able to get the model to fit reasonably well (Rhat, neff, and PPchecks all seem reasonable), but am finding that all of the predictors are estimated to be ~0 except for the last predictor in the model matrix (regardless of what that predictor is). I have tried reducing the number of predictors and changing the order of the predictors, but the result is always a series of highly variable estimates (centered at zero, but with high levels of uncertainty) save for the last predictor which is typically estimated much more precisely and centered around 25. It seems that all of the ‘explanatory power’ is being loaded on the last predictor (regardless of what that predictor is). Is this a problem with the way the predictors and model matrix are entered? Or are priors driving the outcome (which seems unlikely given the size of the dataset)? Any insight would be greatly appreciated.

My stan code is:

  int<lower=1> N; //the number of observations
  int<lower=1> J; //the number of states
  int<lower=1> K; //number of columns in the model matrix
  int<lower=1,upper=J> id[N]; //vector of group indices
  matrix[K,N] X; //the model matrix
  int<lower=0> y[N]; //the response variable
  real<lower=0> sdLT; //observed standard deviation
transformed data{
  matrix[N,K] Q = qr_Q(X')[,1:K] * N;
  matrix[K,K] R = qr_R(X')[1:K,] / N;
  matrix[K,K] R_inv = inverse(R);
real alpha_lambda;
real<lower=0> sigma_state_lambda;
real<lower=0> phi;
vector[J] eta_state_lambda;
vector[K] beta_l_tilde;

transformed parameters{
vector[K] beta_l = R_inv * beta_l_tilde;
vector[J] alpha_state_lambda;
vector[N] lambda;                     // avg. number of LTs in successful states

//varying intercepts
alpha_state_lambda = eta_state_lambda * sigma_state_lambda;
  for(n in 1:N){  
    for(k in 1:K){
      // linear predictors with random effect
      // add fixed and random effects here 
      lambda[n] = exp(alpha_lambda + alpha_state_lambda[id[n]] + beta_l_tilde[k]*X[k,n]); 
phi ~ gamma(10,0.01);
eta_state_lambda ~ normal(0,1);
sigma_state_lambda ~ cauchy(0, 2.5*sdLT);
alpha_lambda ~ normal(0,100);
beta_l ~ normal(0,100);

// likelihood
y ~ neg_binomial_2(lambda, phi);
generated quantities {
 vector[N] y_rep;
 for(n in 1:N){
  y_rep[n] = neg_binomial_2_rng(lambda[n],phi); //posterior draws to get posterior predictive checks

And the code for running in R:

N_obs <- nrow(LT_scale) #1984
J_st <- length(unique(LT_scale$stID)) #48 states
K_mat <- ncol(LT_scale[,c(7:12, 25:30, 48:53)]) #15 for a test run
id <- LT_scale$stID #id linking observation to group
X_mat <- t(as.matrix(LT_scale[,c(7:12, 25:30, 48:53)])) #15 row x 1984 column matrix
y_lt <- LT_scale$LTActivities #outcome 
sdLT <- sd(LT_scale$LTActivities) #empirical estimate of variation in the outcome

LT_data <- list(N=N_obs, J=J_st, K=K_mat, id=id, X=X_mat, y=y_lt, sdLT=sdLT)
LT_fitNG <- stan("/Users/matthewwilliamson/ConsOppBatch/LTA_nested_NegBin.stan", data=LT_data, iter=800,chains=3, control = list(max_treedepth=15)) ```

Here is a screenshot of the issue I’m describing

Guessing a nearly singular design matrix X or one that includes a constant.

Thanks Ben - There aren’t any constants in the matrix. How can I diagnose whether the matrix is singular?

Also, does the QR decomposition effect that? Or is the singularity a function of the original data matrix?

If X is nearly singular the last (possibly few) diagonal element(s) of R will be nearly zero, in which case the last (possibly few) diagonal element(s) of the inverse of R will be really large.

You should also make sure that you center the columns of X before doing the QR decomposition (assuming you are estimating the intercept separately).

Thanks for that. I think I understand your description of the effect of singularity on the R matrix, but don’t see evidence of singularity in the matrix prior to QR decomp (i.e. rankMatrix returns the same number of columns that are in the dataset).
I forgot to mention that all data in X were scaled (mean 0, unit variance) before modeling (so I don’t think that centering is the issue). I found a few covariates that were highly correlated and have removed them (thought I caught them all before running model). Is there something “wrong” with the priors I’m using for the Negative Binomial? I assumed that the amount of data I have would result in the likelihood overwhelming the prior, but that may not be correct?

Functions that try to determine the rank of a matrix numerically are not that helpful. It is possible that X is of full-rank but close to short-rank. When you have many coefficients or a complicated model generally, the number of data points is almost a complete red herring. You are typically going to have a lot of information in the data on some parameters and much less on others. Look at the (estimated) effective number of parameters produced by loo().

To clarify, I run loo on the stanfit object?

No. Run loo on a simulations by observations matrix of log-likelihoods evaluated over the posterior simulations. See help(loo, package = "loo").

Ok - I’ve run loo and the p_loo value is 90.6. I’m not entirely clear what that’s counting or how to use that to identify which parts of the design matrix are causing problems. I’m sorry if this is an obvious question, I’ve not used loo before… Thanks in advance for any explanation you might have

Also, the pointwise output from loo has 1948 rows (which matches the number of observations). Is that correct? Or should it match the number of predictors in X_mat?

Should have N = 1948 columns.

Ok, so that is correct. I’ve read the help file and the Vehtari et al 2016 paper on loo and it is still not clear to me how to use loo to identify the components of the design matrix that are causing the singularity. Is there additional information I could provide that would be helpful? Thanks again for your help.

I’ve now re-run the model with a much-reduced matrix and am continuing to have the same problem. I’ve attached a subset of the data (note that the number of observations has changed because there are more complete cases in the reduced dataset) that can be run with the following:

N_obs <- nrow(LT_scale)
J_st <- length(unique(LT_scale$stID))
K_mat <- ncol(LT_scale[,c(7:15)])
id <- LT_scale$stID
X_mat <- t(as.matrix(LT_scale[,c(7:15)]))
y_lt <- LT_scale$LTActivities
#offset <- LT_scale$TotPop2010
sdLT <- sd(LT_scale$LTActivities)

LT_data <- list(N=N_obs, J=J_st, K=K_mat, id=id, X=X_mat, y=y_lt, sdLT=sdLT)
LT_fitNG <- stan("/Users/matthewwilliamson/ConsOppBatch/LTA_nested_NegBin.stan", data=LT_data, iter=800,chains=3)  

LTA_scale_subset.csv (497.0 KB)