Ineffective sampling in linear regression when errors approach zero

Hi,

I’m new to Stan and Bayesian modeling in general, but I’m really liking Stan so far and learning a ton from the user’s guide and this community. I’ve been searching hard for some ideas about how to approach the problem outlined below, but haven’t found anything addressing it so far. I sincerely apologize if this has been covered elsewhere or if it’s a nonsensical question.

I am trying to construct a linear regression model to deconvolute processes in spectral data. I’m using simulated data (for which I know the noise level and the true coefficients) to test the model. I find that the model generally does a good job of estimating the true coefficient values for noisy datasets. However, as the noise level in the data approaches zero (which is common for this type of data), I start to get warnings about low E-BFMI, and the resulting posterior distribution of coefficients poorly approximates the true coefficients. My understanding of what’s happening is that as the noise level decreases, sigma_tot (which aggregates several different noise contributions) all have high posterior density very close to zero, which prevents the sampler from efficiently sampling the posterior distribution of the coefficients, beta_raw (similar to the Neal’s funnel example in the User’s guide). Instead, the sampler seems to get stuck in a local maximum. However, in this case, I can’t see a way to reparameterize the model to enable efficient sampling. Below is my model code:

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    vector[N/2] freq;
    int<lower=0> N_tilde;
    matrix[N_tilde,K] x_tilde;
    vector[N_tilde/2] freq_tilde;
}
transformed data {
    vector [N] hfr_vec = append_row(rep_vector(1,N/2), rep_vector(0,N/2));
    vector [N] induc_vec = append_row(rep_vector(0,N/2), -2*pi()*freq);
    vector [N_tilde] hfr_vec_tilde = append_row(rep_vector(1,N_tilde/2), rep_vector(0,N_tilde/2));
    vector [N_tilde] induc_vec_tilde = append_row(rep_vector(0,N_tilde/2), -2*pi()*freq_tilde);
    vector [N] y_mod;
    y_mod = append_row(sqrt(square(y[1:N/2]) + square(y[N/2+1:])),sqrt(square(y[1:N/2]) + square(y[N/2+1:])));
}
parameters {
    real<lower=0> hfr_raw;
    real<lower=0> induc;
    vector<lower=0>[K] beta_raw;
    real<lower=0> sigma_res_raw;
    real<lower=0> sigma_prop_raw;
    real<lower=0> sigma_mod_raw;
    vector<lower=0>[K] tau_raw;
}
transformed parameters {
    real<lower=0> hfr = hfr_raw*100;
    vector<lower=0>[K] tau = tau_raw*5;
    vector<lower=0>[K] beta = tau .* beta_raw);
    real<lower=0> sigma_res = sigma_res_raw*0.02;
    real<lower=0> sigma_prop = sigma_prop_raw*0.02;
    real<lower=0> sigma_mod = sigma_mod_raw*0.02;
    vector[N] yhat = x*beta + hfr*hfr_vec + induc*induc_vec;
    vector[N] sigma_tot = sigma_res + fabs(yhat)*sigma_prop + y_mod*sigma_mod;
}
model {
    tau_raw ~ std_normal();
    hfr_raw ~ std_normal();
    induc ~ std_normal();
    beta_raw ~ std_normal();
    y ~ normal(yhat,sigma_tot);
    sigma_res_raw ~ std_normal();
    sigma_prop_raw ~ std_normal();
    sigma_mod_raw ~ std_normal();
}

I’ve tried a few things to address this. I tried using heavy-tailed distributions (cauchy and student-t) for y (i.e. y ~ cauchy(yhat,sigma_tot)), but when sigma_tot is very small, the tails don’t help much. I’ve tried increasing adapt_t0 in the sampler control options, which helps marginally. I’ve also tried setting a minimum noise level by adding a data input, sigma_min, that puts a floor on sigma_tot to allow the sampler to escape local maxima even when the true noise level is low:

data {
    ...
    real<lower=0> sigma_min;
}
...
transformed parameters {
    ...
    vector[N] sigma_tot = sigma_min + sigma_res + fabs(yhat)*sigma_prop + y_mod*sigma_mod;
}
...

The sigma_min approach does seem to help, but in many cases it results in suboptimal resolution of the coefficients, as the inflated sigma_tot washes out the posterior density that would ideally push the estimated coefficients toward their true values.

Does anyone have ideas about how I could reparameterize the model to efficiently sample when the noise level is very low? It seems to me that this would be an issue for any regression model when noise in the outcome is low and the outcome is well-described by the predictors. Thanks for any advice or suggestions!

1 Like

Sorry for taking quite long to respond.

A few thoughts that may (and may not) be related to your problem:

Usually, when you have multiple sources of variance you need to add on the variance scale (as variance is a linear operator over random variables), e.g. you might want to try: sigma_tot = sqrt(sigma_res^2 + (yhat*sigma_prop) ^ 2 + (y_mod*sigma_mod) ^ 2). (note that this gets you rid of the fabs call which in itself may cause some problems due to its discontinuous gradient).

Not sure what the underlying data are, but I find it when variance/sd is considered proportional to the mean this is often represented via a lognormal or other positive-only distribution. So - why do you think normal is a good model for your data? (note: I have no background in signal processing or similar, so I might be misunderstanding something obvious here).

I would definitely look at some pairs plots or inspect bivariate plots in ShinyStan from the problematic fit, those might provide some hints as to what is going wrong (you should find a bunch of posts on the forums on those topics, if you struggle with how to get those).

Hope that helps and best of luck with your model!

Hi Martin, thanks so much for your reply! Thanks for your advice about summing on the variance scale and adjusting the error distributions. I think you are probably right that the distributions for the proportional errors should likely be lognormal or something similar.

I made a stripped-down version of the model with only one normal error contribution, and no variable scales on the coefficients:

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    vector[N/2] freq;
    int<lower=0> N_tilde;
    matrix[N_tilde,K] x_tilde;
    vector[N_tilde/2] freq_tilde;
}
transformed data {
    vector [N] hfr_vec = append_row(rep_vector(1,N/2), rep_vector(0,N/2));
    vector [N] induc_vec = append_row(rep_vector(0,N/2), 2*pi()*freq);
    vector [N_tilde] hfr_vec_tilde = append_row(rep_vector(1,N_tilde/2), rep_vector(0,N_tilde/2));
    vector [N_tilde] induc_vec_tilde = append_row(rep_vector(0,N_tilde/2), 2*pi()*freq_tilde);
}
parameters {
    real<lower=0> hfr;
    real<lower=0> induc;
    vector<lower=0>[K] beta;
    real<lower=0> sigma;
}
model {
    hfr ~ normal(0,1000);
    induc ~ std_normal();
    beta ~ normal(0,5);
    y ~ normal(x*beta + hfr*hfr_vec + induc*induc_vec, sigma);
    sigma ~ std_normal();
}
generated quantities {
    vector[N_tilde] y_tilde
        = x_tilde*beta + hfr*hfr_vec_tilde + induc*induc_vec_tilde;
}

I still run into the same problem when I run this model on noiseless simulated data. Below is a plot of the posterior mean coefficient values compared to the true coefficient values:

image

While the posterior mean coefficient values are generally distributed around the true values, they are often far from the true values. The trace plot shows that all four chains (correctly) determined that sigma is essentially zero:

If I look at a pairplot of any two coefficients that are close to each other, I can see that each chain gets stuck in its own small local neighborhood. The chains do not mix:

image

However, if I simulate data with some artificial noise, the sampling is much more effective, the chains mix, and the posterior mean coefficient values are much closer to the true values. I presume that this is because, when sigma is very small, there’s too large of an energy barrier for the sampler to escape local minima. I wonder if part of the problem is that neighboring covariates in X tend to be highly correlated.