# Multilevel regression: poor behaviour with certain parameter combinations

Apologies in advance for the long post.

I am working on “micro-macro” multilevel regression, in which group-level responses are predicted from individual-level covariates. It’s the same structure as regression with covariate measurement error, but here there are more covariate replications per group than there usually are in measurement error models. The model for the data is as follows:

\begin{align} Y_i &\sim \text{Normal}\left(\alpha + \beta X_i, \sigma_Y\right) (i = 1, \ldots, N )\\ W_{ij} &\sim \text{Normal} \left(X_i, \sigma_W\right) (j = 1, \ldots, n) \end{align}

where Y_i is the response for the i^\mathrm{th} group and W_{ij} is the measured covariate for the j^\mathrm{th} individual in the i^\mathrm{th} group (assume for simplicity that all groups contain n individuals). The unobserved group-level covariates X_i are given a \mathrm{Normal}\left(0, \sigma_X\right) prior.

I am currently testing this model with simulated data (generated using the attached R code) and have noticed that certain parameter combinations result in inescapably poor behaviour in Stan. Specifically, if \sigma_W/n is large relative to \sigma_Y, the posterior for the latter tends to be significantly biased upward. If \sigma_W/n is large relative to \sigma_X, the NUTS chains have very low BFMI’s and low effective sample sizes (oddly enough, this poor sampling behaviour usually manifests for \sigma_Y, even when the first problem is not present).

The severity of these problems depends on the magnitude of the true regression slope \beta: the larger it is, the more sensitive the model is to relative differences between the \sigma's. Conversely, for small enough \beta, the difficulties disappear.

Note that \sigma_W/n is the variance of the “naive covariates” \overline{W}_i, the use of which in an ordinary regression is known to result in biased parameter estimates. Therefore, it seems that the problems arise when the effect size is large and the model has to overcome a large difference between the “naive regression” and the “true regression”.

I’ve tried every trick in the book to resolve this: scaling the priors differently, centered/non-centered parameterizations, standardizing/not standardizing the data, using heavier-tailed or even improper priors, using a dense mass matrix, running for more iterations with adapt_delta close to 1. The only thing that works sometimes is the use of a zero-avoiding prior for \sigma_Y, and even this doesn’t work for many parameter configurations (and when it does, the resulting posterior for \sigma_Y is quite biased unless hyperparameters are chosen very carefully).

Does anyone have any experience with this type of difficulty? I’m starting to suspect that the difficulties here are inherent to the model itself, and it will be resistant to any possible fixes. But any help is greatly appreciated.

data {
int<lower = 1> N; // Number of groups
int<lower = 1> M; // Total number of covariate observations (sum of n_i's)
vector[N] Y; // Group-level regression responses
vector[M] W; // Individual-level covariate observations for all groups
int n[N]; // Number of covariate observations per group
}

parameters {
real alpha_raw;
real beta_raw;
real<lower = 0> sigma_Y_raw;
real<lower = 0> sigma_X_raw;
real<lower = 0> sigma_W_raw;
vector[N] X_raw;
}

transformed parameters{
real sigma_Y = 0.5*sigma_Y_raw;
real sigma_X = 2*sigma_X_raw;
real sigma_W = 2*sigma_W_raw;
vector[N] X = sigma_X * X_raw;
real alpha = 15*alpha_raw;
real beta = 15*beta_raw;
}

model{
// Likelihoods for individual-level covariates in each group
int pos = 1;
for(i in 1:N){
segment(W, pos, n[i]) ~ normal(X[i], sigma_W);
pos += n[i];
}

X_raw ~ std_normal(); // Implies X ~ normal(0, sigma_X)

Y ~ normal(alpha + beta*X, sigma_Y);

// Priors
sigma_Y_raw ~ std_normal(); // Implies sigma_Y ~ normal(0, 0.5)
sigma_X_raw ~ std_normal(); // Implies sigma_X ~ normal(0, 2)
sigma_W_raw ~ std_normal(); // Implies sigma_W ~ normal(0, 2)
beta_raw ~ normal(0, sigma_Y); // Implies beta ~ normal(0, 15*sigma_Y)
alpha_raw ~ normal(0, sigma_Y); // Implies alpha ~ normal(0, 15*sigma_Y)
}

generated quantities{
vector[N] fitted_values = alpha + beta*X;
}


multilevel_regression.R (1.0 KB)
scalar_micromacro_regression.stan (1.4 KB)

1 Like

This reflects a model for X whereby you’re doing inference on the variability but assume a fixed mean of zero. Have you tried instead inference on the mean (with, for example, a zero peaked prior)?

Thanks for the suggestion!

I just tried a few variations of X \sim \text{Normal}\left(\mu_X, \sigma_X\right) with a Normal prior on \mu_X, and unfortunately the problems persist. It’s also worth mentioning that the fixed mean of zero is consistent with the simulated data I’m using.

I suspect the problem is in the regression part of the model - namely, the difficulty of inferring the “true regression” when there is lots of within-group noise (and the effect size is large). The question now is how to overcome it.

Do you get any divergences? You don’t mention explicitly in your post, but the number of other diagnostic issues that you’ve run into makes me think you may be running into some.

If so, you could plot the divergences in parameter space to better understand where the problematic regions are.

My guess is that the model has some inherent degeneracies in the coupling of Y_i and W_{ij} that are frustrating/biasing the computation. You say that the issue seem to go away when the true value of \beta is very small, which corresponds to cases when Y_i and W_{ij} are effectively decoupled because X_i is not strongly informative of Y_i.

In the case when \beta is higher and the coupling is strong, there may not be enough information to successfully jointly estimate all the parameters.

One thing that is confusing me about your model is how you are modeling \alpha and \beta with:

\alpha \sim \textrm{Normal}(0, 15 \sigma_Y) \qquad \beta\sim \textrm{Normal}(0,15 \sigma_Y)

Y_i \sim \textrm{Normal}(\alpha + \beta X_i, \sigma_Y)

Seems somewhat awkward? With \sigma_Y affecting not only the the estimation of the variability of Y_i but the estimation of \alpha and \beta. Not an expert by any means, but that seems like it could be a problematic inferential problem.

Do you get any divergences? You don’t mention explicitly in your post, but the number of other diagnostic issues that you’ve run into makes me think you may be running into some.
If so, you could plot the divergences in parameter space to better understand where the problematic regions are.

Surprisingly, divergences don’t occur very often here. In runs/parameter combinations where I do get divergences, there don’t seem to be any clear patterns in terms of posterior geometry.

My guess is that the model has some inherent degeneracies in the coupling of Y_i and W_{ij} that are frustrating/biasing the computation. You say that the issue seem to go away when the true value of \beta is very small, which corresponds to cases when Y_i and W_{ij} are effectively decoupled because X_i is not strongly informative of Y_i.

This might be the case. However, the model behaves well when \beta is large and \sigma_X >> \sigma_Y > \sigma_W/n. In such a case, the X_i's are strongly informative of the Y_i's (so we could expect coupling between Y_i and W_{ij}), but because the within-group covariate error is small, the “true regression” is not obfuscated.

One thing that is confusing me about your model is how you are modeling \alpha and \beta with [conditional priors]

It’s fairly standard in Bayesian regression literature (at least, the literature I’ve seen) to condition the scale for \alpha and \beta on \sigma_Y in this way. In my application, not doing it caused even worse model behaviour. There are also some theoretical results showing that it guarantees unimodality in a wide range of settings, but maybe with standardized data and group-level covariate noise there’s a case to be made against it.

You can think of the encoded linkage as making for priors on \alpha and \beta that are expressed as “effect sizes”.

Interesting. Have you been able to show that the computation is tractable under much more informative priors (you mention a zero-avoiding prior for \sigma_Y sometimes helping) or a larger dataset?

Makes sense, thank you! Pardon my ignorance.