Modeling interactions with unconstrained parameters

I am trying to model interactions between nominal predictors, following Kruschke’s strategy from DBDA. That is, I model my data with unconstrained parameters, and then I calculate the parameters so that each predictor sum to zero, and the data is interpreted as deflections from the overall mean. I have copied my model code below, b0, b1, b2, b1b2, b1b3 are the unconstrained parameters and a0, a1, a2, a3, a1a2, a1a3 are the constrained parameters, a0 is the overall mean.

As my models becomes more complex I get warnings about transitions that exceed max_treedepth. However, I have noticed that while the unconstrained parameters chains are jagged and highly autocorrelated, the constrained parameters seem to be just fine. The posterior estimates of the constrained parameters are virtually identical to what is obtained with higher max_treedepth that does not produce warning messages.

In my example it seem that the interaction term b1b3 is trading off with b0, but once the constraints (sum-to-zero) are imposed there is not such correlation, and posterior distributions/chains look clean.

  1. Can I trust these posteriors for the constrained parameters even though the unconstrained parameters look bad? My real data/model is larger and more complex so increasing max_treedepth results in sampling time of days.

  2. Is there a way to know which of the parameters are the culprits for the warnings other than checking the chains? I understand that parameters represent simultaneous credible values as a set but I can see that usually there are particularly problematic ones. Checking the pairs plot just shows me the problematic transition in all the parameters.

  3. My posteriors are very sharp but still a bit off the true values. Specifically, the main effects seem to be a little smaller and the interactions bigger. I am guessing this is a problem with any interactions model, where it is not possible to distinguish between a particular model and the sample noise. Are there any tricks or ways to improve this?

This is the data I simulated to test my model.

# simulate 2 interactions 
# main effects
a1 = c( -2,1,0,1,2 ); a2=c(-1,0,1); a3=c( -3,0,3 )
n1=length(a1); n2=length(a2); n3=length(a3)
# interaction
a1a2 = rbind(c(1,0,0), 
             c(0,0,0),
             c(0,0,0),
             c(0,0,0),
             c(0,0,-1))

a1a3 = rbind(c(-1,0,0), 
             c(0,0,0),
             c(0,0,0),
             c(0,0,0),
             c(0,0,1))

sigma=0.2
a1_s=0.2; a2_s=0.2; a3_s=0.2; a1a2_s=0.2; a1a3_s=0.2


# number of samples
n=50

df <- list(); i=1
for (l1 in 1:n1){
	for (l2 in 1:n2){
		for (l3 in 1:n3){
			# l1=1; l2=1; l3=1
			b1 <- rnorm(n, a1[l1], a1_s)
			b2 <- rnorm(n, a2[l2], a2_s)
			b3 <- rnorm(n, a3[l3], a3_s)
			b1b2 <- rnorm(n, a1a2[l1, l2], a1a2_s)
			b1b3 <- rnorm(n, a1a3[l1, l3], a1a3_s)
			mu <- b1 + b2 + b3 + b1b2 + b1b3 + rnorm(n, 0, sigma)
			df[[i]] <- tibble(a1=l1, a2=l2, a3=l3, b1=b1, b2=b2, b3=b3, b1b2=b1b2, b1b3=b1b3, value = mu)
			i=i+1
		}
	}
}
df <- bind_rows(df)

# run model
model_data <- list(
	n_obs = nrow(df),
	n_a1 = df %>% distinct(a1) %>% nrow(),
	n_a2 = df %>% distinct(a2) %>% nrow(),
	n_a3 = df %>% distinct(a3) %>% nrow(),
	obs2a1 = df$a1,
	obs2a2 = df$a2,
	obs2a3 = df$a3,
	y = df$value, 
	y_mean = df %>% pull(value) %>% mean(),
	y_sd = df %>% pull(value) %>% sd()
	)

and this is the stan model

data{
  int<lower=1> n_obs;
  int<lower=1> n_a1;
  int<lower=1> n_a2;
  int<lower=1> n_a3;
  int<lower=1, upper=n_a1> obs2a1[n_obs];
  int<lower=1, upper=n_a2> obs2a2[n_obs];
  int<lower=1, upper=n_a3> obs2a3[n_obs];
  real y[n_obs];
  real y_mean;
  real y_sd;
}

parameters{
  real b0; 
  vector[n_a1] b1; 
  real<lower=0> b1_s; 
  vector[n_a2] b2; 
  vector[n_a3] b3; 
  matrix[n_a1, n_a2] b1b2;
  matrix[n_a1, n_a2] b1b3;
  real<lower=0> b1b2_s;
  real<lower=0> b1b3_s;
  real<lower=0> sigma; 
}

transformed parameters{
  real mu[n_obs];
  for (i in 1:n_obs){
    mu[i] = b0 + b1[obs2a1[i]] + b2[obs2a2[i]] + b3[obs2a3[i]] + 
      b1b2[obs2a1[i], obs2a2[i]] + b1b3[obs2a1[i], obs2a3[i]];  
  }
}

model{
  b0 ~ normal(y_mean,y_sd*5);
  b1_s ~ normal(0,1);
  b1 ~ normal(0, b1_s);
  b2 ~ normal(0, 1);
  b3 ~ normal(0, 1);  
  b1b2_s ~ normal(0,1);
  for (j2 in 1:n_a2){
    for (j1 in 1:n_a1){
      b1b2[j1,j2] ~ normal(0, b1b2_s);
    }
  }
  b1b3_s ~ normal(0,1);
  for (j3 in 1:n_a3){
    for (j1 in 1:n_a1){
      b1b2[j1,j3] ~ normal(0, b1b3_s);
    }
  }

  // likelihood
  y ~ normal(mu, sigma);
}

generated quantities{
  real a0; 
  vector[n_a1] a1; 
  vector[n_a2] a2; 
  vector[n_a3] a3; 
  matrix[n_a1, n_a2] a1a2; 
  matrix[n_a1, n_a3] a1a3; 
  real log_lik[n_obs];
  real m[n_a1,n_a2, n_a3]; // cell means
  real m_1x2[n_a1, n_a2];
  real m_1x3[n_a1, n_a3];

  // -------------------------------------------------------------------------------   
  // for loo
  for (i in 1:n_obs) {
    log_lik[i] = normal_lpdf( y[i] | mu[i], sigma );     
  }

  // ------------------------------------------------------------------------------- 
  // convert predictors to deviation from mean
  for (j1 in 1:n_a1) { 
    for (j2 in 1:n_a2) {
      for (j3 in 1:n_a3) {  
        m[j1,j2,j3] = b0 + b1[j1] + b2[j2] + b3[j3] + b1b2[j1,j2] + b1b3[j1,j3]; // cell means 
      }
    } 
  }

  // overall mean: a0 = mean(m);
  a0 = 0.0;
  for (j1 in 1:n_a1) { 
    for (j2 in 1:n_a2) {
      for (j3 in 1:n_a3) {
        a0 += m[j1,j2,j3];
      }
    } 
  }
  a0 = a0 / num_elements(m);

  // main effects
  for (j1 in 1:n_a1) { a1[j1] = mean(to_array_1d(m[j1,:,:])) - a0; }
  for (j2 in 1:n_a2) { a2[j2] = mean(to_array_1d(m[:,j2,:])) - a0; }
  for (j3 in 1:n_a3) { a3[j3] = mean(to_array_1d(m[:,:,j3])) - a0; }

  // interctions
  for (j1 in 1:n_a1) { 
    for (j2 in 1:n_a2) {
        m_1x2[j1,j2] = mean(to_array_1d(m[j1,j2,:]));
    } 
  } 
  for (j1 in 1:n_a1) { 
    for (j3 in 1:n_a3) {
        m_1x3[j1,j3] = mean(to_array_1d(m[j1,:,j3]));
    } 
  } 

  for ( j1 in 1:n_a1 ) { 
    for ( j2 in 1:n_a2 ) { 
      a1a2[j1,j2] = m_1x2[j1,j2] - (a1[j1] + a2[j2] + a0);
    }
  }

  for ( j1 in 1:n_a1 ) { 
    for ( j3 in 1:n_a3 ) { 
      a1a3[j1,j3] = m_1x3[j1,j3] - (a1[j1] + a3[j3] + a0);
    }
  }

}

Any help, comment, advice is greatly appreciated!

2 Likes

Just center your predictors from the get go and that might help things.

So for a basic linear regression, instead of doing:

y ~ normal(a + b * x, sigma);

Set xc = x - mean(x) and do:

y ~ normal(a + b * xc, sigma);

When you do that xc sums to zero, so I assume this is similar to what you’re doing now, but now you’ll be sampling in the possibly-better-identified space and maybe your Neffs will be better.

That’ll change the interpretation of your parameters but I’m pretty sure it’ll change them in the way you want (“and the data is interpreted as deflections from the overall mean”).

@bbbales2

I’m a little confused. If my predictors were metric I could do that, but in my model I have nominal predictors (categories) so I don’t know how I could center these since the numbers are just a convention to assign different groups.

But your idea inspired me a little. Maybe I can have a much stronger prior on b0 to force it to be closer to the overall mean.

Oh I misinterpreted, my bad!

I think what I’d do with this model is non-center it. So that’d mean everywhere you have:

b1b2[j1,j2] ~ normal(0, b1b2_s);

Make b1b2 just a declared variable at the top of the model block (instead of a parameter), make a new parameter b1b2_z with the same shape, and then do:

b1b2_z[j1,j2] ~ normal(0, 1);
b1b2[j1,j2] = b1b2_s * b1b2_z[j1, j2];

That’ll give b1b2 the desired distribution. It should hopefully be a bit easier on Stan to adapt to and sample though.

Might not work. It really sounds like the posterior is correlated, and this isn’t gonna help that. Try it though. It shouldn’t take that long to code up. Here’s longer form directions: Diagnosing Biased Inference with Divergences. If that isn’t clear just message back.

Is there any chance you have a bug in your second loop?

  for (j3 in 1:n_a3){
    for (j1 in 1:n_a1){
      b1b2[j1,j3] ~ normal(0, b1b3_s);
    }
  }

Should that be b1b3 on the left side instead of b1b2?

@bbbales2 You are absolutely right, I mixed up b1b2 and b1b3, and also matrix[n_a1, n_a2] b1b3; should be matrix[n_a1, n_a3] b1b3;

This was not however, the root of the problem (I’ve been running a couple of different models) and I still get the same phenomenon: lots of max_treedepth warnings and the unconstrained parameters looking bad but the constrained ones looking ok.

I will try your suggested parameterization and report back!

UPDATE:

  1. My original assertion that b0 was correlated to b1b3 was due to the error pointed by bbbales2.
    <(_ _)> With the proper model specification the model fits fine.

  2. I tried the non-centered parameterization. This did not help. Posteriors for b0, b1, etc actually look worth, but again constrained parameters a0, a1, etc look ok. I guess non-center parameterization does not work well in this model, but thanks @bbbales2 anyway!

Nagging Question

Can I ignore max_treedepth warnings? According to the Brief Guide to Stan’s Warnings

“Warnings about hitting the maximum treedepth are not as serious as warnings about divergent transitions.”, they are efficiency concerns. But if all/most of my iterations exceed max_tree can I trust my posteriors?

I tried this problem with dense adaptation and it worked well for this. How many parameters do you have in your full model? That can be a limiter for dense adaptation.

Here’s a summary of the output from four chains: summary.txt (218.7 KB)

There was a bug with dense adaptation a helpful internet person tracked down recently (https://github.com/stan-dev/stan/issues/2671) – I don’t think the fix has propagated into Rstan, but you can do this in the latest cmdstan.

To dump your Rstan data to a file in the format used by Cmdstan do:

stan_rdump(names(stan_data), "stan_data.dat", env = list2env(stan_data))

The run command should be something like:

./model sample algorithm=hmc metric=dense_e data file=model.dat output refresh=1

Dense adaptation makes it easier for Stan to sample things with correlated posteriors which can cause treedepth issues. See if this works.

Edit: The specific model and data I used were mat.stan (1.1 KB) and mat.data.R (60.7 KB)

@bbbales2 sorry for the lateness. It took me a while to get cmdstan working.
Just as you said it works great with metric=dnese_e option! Thank you so much!
I get about 9% of transitions exceeding max_treedepth in Rstan and 0% when using dense adaptation in cmdstan. You are also right the bug has not been corrected in Rstan, in Rstan dense adaptation results in 97% of transitions exceeding max_treedepth.

My actual data has 7 nominal categories, 3 with 3 predictors and 4 with 2 predictors, and I am interested in 4 interaction terms, which makes it 50 predictors (20 for main effects and 30 for interactions). I will give it a go with cmdstan using dense adaptation.

As I am a newbie to cmdstan, I am not sure what’s the recommended way to use it. Should I download cmdstan and make build every time I want to run a new model, or should I move my model and data files to a master cmdstan folder?

I quickly read through cmdstan manual and Stan’s users guide, but I couldn’t find anything about the different methods and metrics. Could you recommend me anything to find out more about different algorithms and metrics?

Niiice! (Thanks @fehiepsi for finding this problem)

If you have less than a couple hundred parameters, I think you should be good to go.

I just treat the “examples” folder inside cmdstan as a working directory. I keep all my models and results in there. It makes it easier to write scripts and stuff. I’m not a well-organized person though :D.

The idea is to make a linear coordinate transform that makes the posterior look more like an isotropic Gaussian. I think isotropic is the right word. If you have N parameters, you try to rescale the posterior to look like N iid normal(0, 1) random variables. The same sort of rescaling is done in Metropolis and probably any other MCMC.

The way it works in Stan is you use the warmup samples to approximate to the posterior covariance. dense_e uses this full covariance estimate. diag_e uses a diagonal approximation (just set all the off diagonals equal to zero). It takes more work to get the full estimate than the diagonal one (you need at least as many warmup samples as parameters, practically more). And that’s about the state of things now :D. If you have treedepth issues, try dense_e!

1 Like

@bbbales2 I was finally able to model my data with interactions, worked just fine with dense adaptation. I’ve been trying to do this weeks! Thank you so much for all of your advice!

1 Like