Ineffective sampling in linear regression when errors approach zero

Hi,

I’m new to Stan and Bayesian modeling in general, but I’m really liking Stan so far and learning a ton from the user’s guide and this community. I’ve been searching hard for some ideas about how to approach the problem outlined below, but haven’t found anything addressing it so far. I sincerely apologize if this has been covered elsewhere or if it’s a nonsensical question.

I am trying to construct a linear regression model to deconvolute processes in spectral data. I’m using simulated data (for which I know the noise level and the true coefficients) to test the model. I find that the model generally does a good job of estimating the true coefficient values for noisy datasets. However, as the noise level in the data approaches zero (which is common for this type of data), I start to get warnings about low E-BFMI, and the resulting posterior distribution of coefficients poorly approximates the true coefficients. My understanding of what’s happening is that as the noise level decreases, sigma_tot (which aggregates several different noise contributions) all have high posterior density very close to zero, which prevents the sampler from efficiently sampling the posterior distribution of the coefficients, beta_raw (similar to the Neal’s funnel example in the User’s guide). Instead, the sampler seems to get stuck in a local maximum. However, in this case, I can’t see a way to reparameterize the model to enable efficient sampling. Below is my model code:

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    vector[N/2] freq;
    int<lower=0> N_tilde;
    matrix[N_tilde,K] x_tilde;
    vector[N_tilde/2] freq_tilde;
}
transformed data {
    vector [N] hfr_vec = append_row(rep_vector(1,N/2), rep_vector(0,N/2));
    vector [N] induc_vec = append_row(rep_vector(0,N/2), -2*pi()*freq);
    vector [N_tilde] hfr_vec_tilde = append_row(rep_vector(1,N_tilde/2), rep_vector(0,N_tilde/2));
    vector [N_tilde] induc_vec_tilde = append_row(rep_vector(0,N_tilde/2), -2*pi()*freq_tilde);
    vector [N] y_mod;
    y_mod = append_row(sqrt(square(y[1:N/2]) + square(y[N/2+1:])),sqrt(square(y[1:N/2]) + square(y[N/2+1:])));
}
parameters {
    real<lower=0> hfr_raw;
    real<lower=0> induc;
    vector<lower=0>[K] beta_raw;
    real<lower=0> sigma_res_raw;
    real<lower=0> sigma_prop_raw;
    real<lower=0> sigma_mod_raw;
    vector<lower=0>[K] tau_raw;
}
transformed parameters {
    real<lower=0> hfr = hfr_raw*100;
    vector<lower=0>[K] tau = tau_raw*5;
    vector<lower=0>[K] beta = tau .* beta_raw);
    real<lower=0> sigma_res = sigma_res_raw*0.02;
    real<lower=0> sigma_prop = sigma_prop_raw*0.02;
    real<lower=0> sigma_mod = sigma_mod_raw*0.02;
    vector[N] yhat = x*beta + hfr*hfr_vec + induc*induc_vec;
    vector[N] sigma_tot = sigma_res + fabs(yhat)*sigma_prop + y_mod*sigma_mod;
}
model {
    tau_raw ~ std_normal();
    hfr_raw ~ std_normal();
    induc ~ std_normal();
    beta_raw ~ std_normal();
    y ~ normal(yhat,sigma_tot);
    sigma_res_raw ~ std_normal();
    sigma_prop_raw ~ std_normal();
    sigma_mod_raw ~ std_normal();
}

I’ve tried a few things to address this. I tried using heavy-tailed distributions (cauchy and student-t) for y (i.e. y ~ cauchy(yhat,sigma_tot)), but when sigma_tot is very small, the tails don’t help much. I’ve tried increasing adapt_t0 in the sampler control options, which helps marginally. I’ve also tried setting a minimum noise level by adding a data input, sigma_min, that puts a floor on sigma_tot to allow the sampler to escape local maxima even when the true noise level is low:

data {
    ...
    real<lower=0> sigma_min;
}
...
transformed parameters {
    ...
    vector[N] sigma_tot = sigma_min + sigma_res + fabs(yhat)*sigma_prop + y_mod*sigma_mod;
}
...

The sigma_min approach does seem to help, but in many cases it results in suboptimal resolution of the coefficients, as the inflated sigma_tot washes out the posterior density that would ideally push the estimated coefficients toward their true values.

Does anyone have ideas about how I could reparameterize the model to efficiently sample when the noise level is very low? It seems to me that this would be an issue for any regression model when noise in the outcome is low and the outcome is well-described by the predictors. Thanks for any advice or suggestions!

1 Like

Sorry for taking quite long to respond.

A few thoughts that may (and may not) be related to your problem:

Usually, when you have multiple sources of variance you need to add on the variance scale (as variance is a linear operator over random variables), e.g. you might want to try: sigma_tot = sqrt(sigma_res^2 + (yhat*sigma_prop) ^ 2 + (y_mod*sigma_mod) ^ 2). (note that this gets you rid of the fabs call which in itself may cause some problems due to its discontinuous gradient).

Not sure what the underlying data are, but I find it when variance/sd is considered proportional to the mean this is often represented via a lognormal or other positive-only distribution. So - why do you think normal is a good model for your data? (note: I have no background in signal processing or similar, so I might be misunderstanding something obvious here).

I would definitely look at some pairs plots or inspect bivariate plots in ShinyStan from the problematic fit, those might provide some hints as to what is going wrong (you should find a bunch of posts on the forums on those topics, if you struggle with how to get those).

Hope that helps and best of luck with your model!

1 Like

Hi Martin, thanks so much for your reply! Thanks for your advice about summing on the variance scale and adjusting the error distributions. I think you are probably right that the distributions for the proportional errors should likely be lognormal or something similar.

I made a stripped-down version of the model with only one normal error contribution, and no variable scales on the coefficients:

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    vector[N/2] freq;
    int<lower=0> N_tilde;
    matrix[N_tilde,K] x_tilde;
    vector[N_tilde/2] freq_tilde;
}
transformed data {
    vector [N] hfr_vec = append_row(rep_vector(1,N/2), rep_vector(0,N/2));
    vector [N] induc_vec = append_row(rep_vector(0,N/2), 2*pi()*freq);
    vector [N_tilde] hfr_vec_tilde = append_row(rep_vector(1,N_tilde/2), rep_vector(0,N_tilde/2));
    vector [N_tilde] induc_vec_tilde = append_row(rep_vector(0,N_tilde/2), 2*pi()*freq_tilde);
}
parameters {
    real<lower=0> hfr;
    real<lower=0> induc;
    vector<lower=0>[K] beta;
    real<lower=0> sigma;
}
model {
    hfr ~ normal(0,1000);
    induc ~ std_normal();
    beta ~ normal(0,5);
    y ~ normal(x*beta + hfr*hfr_vec + induc*induc_vec, sigma);
    sigma ~ std_normal();
}
generated quantities {
    vector[N_tilde] y_tilde
        = x_tilde*beta + hfr*hfr_vec_tilde + induc*induc_vec_tilde;
}

I still run into the same problem when I run this model on noiseless simulated data. Below is a plot of the posterior mean coefficient values compared to the true coefficient values:

image

While the posterior mean coefficient values are generally distributed around the true values, they are often far from the true values. The trace plot shows that all four chains (correctly) determined that sigma is essentially zero:

If I look at a pairplot of any two coefficients that are close to each other, I can see that each chain gets stuck in its own small local neighborhood. The chains do not mix:

image

However, if I simulate data with some artificial noise, the sampling is much more effective, the chains mix, and the posterior mean coefficient values are much closer to the true values. I presume that this is because, when sigma is very small, there’s too large of an energy barrier for the sampler to escape local minima. I wonder if part of the problem is that neighboring covariates in X tend to be highly correlated.

I think you might be simply hitting issues with floating point arithmetics: large means with very small sigma might be problematic by themselves. Also, having a very small sigma, you are leaving only a very narrow peak in the posterior.

Also, you have a strong data-prior conflict - with sigma ~ std_normal(); probability of sigma < 0.001 is around 0.00093 and for sigma < 0.0001 around 7.61e-06. This basically means your model (prior) is heavily mismatched to data which is at least a part of what is causing the problems.

Overall, you are leaving the model in a situation where it is difficult to move - it needs to make extremely small steps and there is still a large chance of changing the likelihood a lot with those small steps (by both the actual geometry of the posterior and possible floating point inaccuracies).

If you expect such a small noise I am not sure Bayesian methods are even a sensible approach - why wouldn’t you just do a maximum likelihood /least squares fit?

3 Likes

That makes sense. I set sigma ~ std_normal() just for the simpler model; before I had it as sigma_raw ~ std_normal() with sigma = sigma_raw * 0.02, although the probability of sigma values approaching zero would still be tiny. Would you recommend something like an exponential or gamma distribution with more density close to zero?

The problem may be a bit odd for a Bayesian approach. In practice, the measurement noise level varies a lot depending on the experimental setup (i.e. the instrumentation, the sample under study, operating conditions, etc.). Some measurements are close to noiseless, some have very high noise, and others have noise levels that vary with measurement frequency.

The deconvolution that I am trying to achieve with a Bayesian model is very ill-posed, and has been addressed previously with a hierarchical penalized maximum likelihood approach. The model for that looks like:

\begin{align} \mathbf{y} &\sim Normal(\mathbf{X} \boldsymbol\beta,\boldsymbol\sigma) \\ \mathbf{L} \boldsymbol\beta &\sim Normal(0,\boldsymbol\lambda) \\ \boldsymbol \lambda &\sim Gamma(a,b) \end{align}

Here \mathbf{L} is a differentiation matrix such that \mathbf{L} \boldsymbol\beta gives a vector quantifying the smoothness of the solution. This approach works fairly well (for simulated data) if a and b are carefully tuned, but the problem is that in practice we don’t know the optimal values of a and b, nor is there a good way to determine them. Cross-validation tends to suggest a and b values that significantly under-regularize the solution. Because the problem is very ill-posed, regularization is required even in the noiseless case.

I thought that a Bayesian approach might be helpful for avoiding this burden of calibration and improving noise tolerance. From what I’ve seen with my testing so far, it actually works really nicely, except for this issue with low-noise data. I’ve been working on a penalized maximum likelihood model in parallel, in which I am trying to address the calibration issue by placing additional priors on the gamma parameters a and b, but that is proving to be tricky.

I’m not a data scientist/statistician by any means (though very interested), so I greatly appreciate your help! It’s helpful to know that I may simply be trying to do something that doesn’t make sense.

2 Likes

That might actually be sensible in this case, but not sure.

I would also try supplying initial values for the parameters (e.g. from a maximum-likelihood fit) those sometimes help when the geometry is difficult.

I would also try to save the predictor (here vector[N] mu = x*beta + hfr*hfr_vec + induc*induc_vec) in transformed parameters and inspect how it behaves during sampling. Especially if both y and mu are large than y - mu (which is the first thing computed internally in normal_lpdf) can loose a lot of precision which might be problematic numerically when sigma is small.

And if mu is almost the same for all chains while the individual coefficents don’t mix, you know you are facing a true non-identifiability, not an artifact.

Finally I would try to simplify the model even further and fit it to data simulated exactly from the model (i.e. drawing all parameters from their respective priors).

Also if the only problem you are having are those ultra low-noise experimental setups, than maybe a non-ideal but pragmatic solution would be to preprocess the data by adding a small noise - even if you don’t know before which experiments are low-noise you can just add it everywhere, adding small noise for very noisy setups shouldn’t hurt much.

I’d also ask @Max_Mantei for a second opinion as I feel I am a bit out of my depth.

2 Likes

Heyhey!

I’m not familiar with the model that you have coded up. So take everything I say with a shovel of salt.

The situation you described (bad sampling behavior in low- or no-noise scenarios) regularly comes up on this forum. And with \sigma as low as 0.000005 (remember, that means variance is \sigma^2 = 0.000005^2 = 0.000000000025) it’s not too surprising that Stan runs into problems.

Did you try it this way? This way it would be pretty easy to build the non-centered parameterization for the \beta's at least. What would \mathbf{L} look like?

If you want to, you can also upload your data here (csv format) so it’s a bit easier for me to see what’s going on. :)

Cheers!
Max

1 Like

Just to clarify a bit, I think Max is referring to using the optimizing mode of Stan, which indeed might be worth trying.

1 Like

I did try placing a prior on \mathbf{L}\boldsymbol\beta in sampling mode, but it didn’t fully resolve issues at low noise until the solution had become over-smoothed. However, that was pretty early in my attempts, and I think it’s worth another try now that I have a somewhat better understanding of what’s going on.

I did also try out the optimizing method, but it didn’t seem to give me good results (whether or not there was noise). I don’t think I tried it with the smoothing \mathbf{L}\boldsymbol\beta prior though, so I’ll look at that again.

I think this is definitely part of the problem, although the much-improved performance when there is significant noise suggests that there is some degree of identifiability that is masked by sampling issues (since the problem should become less identifiable as the noise level increases). My covariates are highly correlated - each one actually represents the contribution of a radial basis function in time constant space to the response in frequency space, so neighboring covariates (basis functions) are naturally correlated. Thus, the “spiky” set of coefficients shown in my previous post matches the data nearly (but not quite) as well the smoothly varying true coefficients. This is perhaps where a smoothing prior is necessary to improve identifiability - I just wanted to avoid it unless absolutely necessary, as it tends to oversmooth some features.

I’m going to try out these suggestions and see if I can’t make something work. I’ll report back in a couple days and provide some example data if I’m still struggling. Thanks so much for your thoughtful responses!

1 Like

If L is deterministic then you could precompute XL and use a QR decomposition to decorrelate the columns of the resulting matrix.

2 Likes

That’s an interesting idea! \mathbf{L} is deterministic. I tried QR decomposition of \mathbf{X} itself previously, but the results didn’t make any sense to me (the coefficients were highly oscillatory, and I had trouble enforcing the \beta > 0 constraint). I read elsewhere that if we have informative priors on the location of \boldsymbol \beta, QR reparameterization may actually yield worse results. I want to eventually place more informative priors on the \beta 's themselves, so I abandoned that idea, but it’s worth trying again with \mathbf{XL}.

It took me a while to iron out most of the kinks, but I think I have a workable solution at this point. I think the issue really was with poor identifiability; neighboring \beta 's could trade off with each other with very little change in \hat{y}. Thus, I did implement the smoothing prior I mentioned, i.e.

\mathbf{L}\boldsymbol\beta \sim Normal(o,\boldsymbol\tau) ,

while allowing \tau to vary modestly:

\begin{align} \tau_k &\sim InvGamma(\alpha,\beta) \\ \frac{\tau_k - (\tau_{k+1}+\tau_{k-1})/2}{\tau_k} &\sim Normal(0,1) \end{align}

I found that it’s still helpful/necessary to have a floor on the noise level, i.e. \sigma_{tot} = (\sigma_{min}^2 + \sigma_{res}^2)^{1/2}, to ensure good sampling when the noise level in the data is very low, although the necessary value of \sigma_{min} is much smaller than it was without the smoothing prior.

My current model looks like this:

data {
    int<lower=0> N;
    int<lower=0> K;
    matrix[N, K] x;
    vector[N] y;
    vector[N/2] freq;
    int<lower=0> N_tilde;
    matrix[N_tilde,K] x_tilde;
    vector[N_tilde/2] freq_tilde;
    matrix[K,K] L0; //0th-order differentiation matrix
    matrix[K,K] L1; //1st-order differentiation matrix
    matrix[K,K] L2; //2nd-order differentiation matrix
    real<lower=0> sigma_min;
    real<lower=0> tau_alpha;
    real<lower=0> tau_beta;
}
transformed data {
    vector [N] hfr_vec = append_row(rep_vector(1,N/2), rep_vector(0,N/2));
    vector [N] induc_vec = append_row(rep_vector(0,N/2), 2*pi()*freq);
    vector [N_tilde] hfr_vec_tilde = append_row(rep_vector(1,N_tilde/2), rep_vector(0,N_tilde/2));
    vector [N_tilde] induc_vec_tilde = append_row(rep_vector(0,N_tilde/2), 2*pi()*freq_tilde);
}
parameters {
    real<lower=0> hfr_raw;
    real<lower=0> induc;
    vector<lower=0>[K] beta;
    real<lower=0> sigma_res_raw;
    vector<lower=0>[K] tau_raw;
    real<lower=0> d0_strength;
    real<lower=0> d1_strength;
    real<lower=0> d2_strength;
}
transformed parameters {
    real<lower=0> hfr = hfr_raw*100;
    vector<lower=0>[K] LB = sqrt(d0_strength*square(L0*beta) + d1_strength*square(L1*beta) + d2_strength*square(L2*beta));
    real<lower=0> sigma_res = sigma_res_raw*0.02;
    vector[N] yhat = x*beta + hfr*hfr_vec + induc*induc_vec;
    real<lower=0> sigma_tot = sqrt(square(sigma_min) + square(sigma_res));
    vector<lower=0>[K] tau = tau_raw .* 0.15;
    vector[K-2] dtau;
    for (k in 1:K-2)
        dtau[k] = 0.5*(tau[k+1] - 0.5*(tau[k] + tau[k+2]))/tau[k+1];
}
model {
    d0_strength ~ inv_gamma(5,5);
    d1_strength ~ inv_gamma(5,5);
    d2_strength ~ inv_gamma(5,5);
    tau_raw ~ inv_gamma(tau_alpha,tau_beta);
    hfr_raw ~ std_normal();
    induc ~ std_normal();
    LB ~ normal(0,tau);
    dtau ~ std_normal();
    y ~ normal(yhat,sigma_tot);
    sigma_res_raw ~ std_normal();
}
generated quantities {
    vector[N_tilde] y_tilde
        = x_tilde*beta + hfr*hfr_vec_tilde + induc*induc_vec_tilde;
}

This model runs a bit more slowly, and occasionally saturates the maximum tree depth or hits divergences, but overall the performance is much improved compared to my previous attempts. Thanks, Max and Martin, for helping me think through this and figure out a better solution!

2 Likes