# Error in robust regression model

This is my first Stan model, so please be gentle ;)
I am trying to set up a robust linear regression model using the probability density proposed by Barron in
Barron, Jonathan T. “A general and adaptive robust loss function.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2019.
Here is the code:

functions {
real barron_lpdf(vector x, vector mu, real alpha, real c) {
vector[num_elements(x)] prob;

for (i in 1:num_elements(x)) {
if (alpha == 2) {
prob[i] = -0.5*((x[i]-mu[i])/c)^2;
} else if (alpha == 0) {
prob[i] = -log(0.5*((x[i]-mu[i])/c)^2+1);
} else if (alpha == negative_infinity()) {
prob[i] = exp(-0.5*((x[i]-mu[i])/c)^2) - 1;
} else {
prob[i] = - abs(alpha-2)/alpha*(pow(((x[i]-mu[i])/c)^2/abs(alpha-2)+1,alpha/2)-1);
}
return sum(log(prob));
}
}
}
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real y0;    // intercept
real slope;
real<lower=0> alpha;
real<lower=0> c;
}
model {
y0 ~ normal(0,1);
slope ~ normal(0,1);
alpha ~ normal(0,1);
c ~ normal(0,1);
y ~ barron(y0 + slope * x, alpha, c);
}


The idea behind the Barron density/loss is to have a parameter (alpha) that can be tuned to control the shape of the loss. I am getting this error:

RuntimeError: Exception during call to services function: ValueError("Initialization failed. Rejecting
initial value: Log probability evaluates to log(0), i.e. negative infinity. Stan can't start sampling from this
initial value. Rejecting initial value: ..."), traceback: ['  File
__step\n    result = coro.send(None)\n', '  File "/Users/arrigo/opt/anaconda3/envs/warp-
env/lib/python3.10/site-packages/httpstan/services_stub.py", line 185, in call\n    future.result()\n', '  File
"/Users/arrigo/opt/anaconda3/envs/warp-env/lib/python3.10/asyncio/futures.py", line 201, in
result\n    raise self._exception.with_traceback(self._exception_tb)\n']


The model seems correct to me. Any ideas?

I think that your issue is arising because you’ve coded up the negative loss functions (Equation 8, page 2) rather than the PDFs (Equation 16, page 3; equation below). The cases for alpha = 2, 0, and negative infinity will never run in your particular model, so we can focus on the last catch-all calculation. The original loss function has a lower bound of 0 (when x = mu) and an upper bound of infinity. Consequently, the lower bound of -1 * loss is negative infinity. You then get an error when you take the log of those negative values.

Based on the text, the PDF requires Bessel functions of the second kind, which they approximate with a cubic hermite spline. I’m not familiar with either of those methods, but maybe Stan has that functionality. Of course, if you pre-define alpha (e.g. pass it in the data section), then that portion of the normalizing constant can be ignored. In that case,

p(x | \mu, \alpha, c) = \frac{1}{cZ(\alpha)} exp(-\rho(x - \mu, \alpha, c)) \\\ Z(\alpha) = \text{some constant} \\\ p(x | \mu, \alpha, c) \propto \frac{1}{c} exp(-\rho(x - \mu, \alpha, c))

And the code would look something like this (though check my math)

functions {
real barron_lpdf(vector x, vector mu, real alpha, real c) {
int N = num_elements(x);
vector[N] loss;
vector[N] loglik;

for (i in 1:N) {
if (alpha == 2) {
loss[i] = 0.5*((x[i]-mu[i])/c)^2;
} else if (alpha == 0) {
loss[i] = log(0.5*((x[i]-mu[i])/c)^2+1);
} else if (alpha == negative_infinity()) {
// density may not be valid for this loss; see page 3 ("Z(a) is divergent when a < 0")
loss[i] = 1 - exp(-0.5*((x[i]-mu[i])/c)^2);
} else {
loss[i] = abs(alpha-2)/alpha*(pow(((x[i]-mu[i])/c)^2/abs(alpha-2)+1,alpha/2)-1);
}
loglik = -1 * (loss + log(c));

return(sum(loglik));
}
}
}
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
real<lower = 0> alpha;
}
parameters {
real y0;    // intercept
real slope;
real<lower=0> c;
}
model {
y0 ~ normal(0,1);
slope ~ normal(0,1);
c ~ normal(0,1);
y ~ barron(y0 + slope * x, alpha, c);
}

1 Like

Thanks for the detailed explanation. I was hoping to avoid having to compute the PDF normalization constant (in the general case it is a Meijer G-function, good luck with that) but I should be able to use an interpolation scheme as explained in Barron’s paper.
So if I understand this correctly, the calculation of the log likelihood of a density can incur in numerical problems. I wonder if there are other ways to deal with this.

There are a few things going on here, all of which are important.

• The paper gives a loss function that assumes some fixed value of \alpha. As @simonbrauer notes, if you want to treat \alpha as a fitted parameter, then you need to modify the loss function to include the parts of the normalization constant that depend on \alpha. I’m just commenting to add that you can intuitively see why this is the case by looking at figure 1 from the paper. If we don’t re-normalize, then no matter the data the penalty is always strictly smaller for smaller \alpha, so it’s clear that the model would tend to simply find extremely negative values of \alpha in the course of fitting. There’s no possible dataset that would encourage the model towards moderate \alpha.
• With that said, we can still ask why the program you wrote down fails to initialize. My best guess is that you are getting underflow in - abs(alpha-2)/alpha*(pow(((x[i]-mu[i])/c)^2/abs(alpha-2)+1,alpha/2)-1), but without being able to check with your data this is non-obvious to me.
• When alpha is treated as a continuous parameter, you are guaranteed to never end up on any branch of the control flow other than the last one. So you can delete all of the if statements and just run the final else.
• All of this might be moot, because this likelihood has a discontinuous derivative in alpha at alpha = 2, so it’s probably going to be very hard to sample even if you resolve all of the above issues.
2 Likes

Regarding the last point (discontinuities of the likelihood with respect to alpha) in principle one could run different simulations with alpha constrained to an interval where there are no discontinuities, sample from these posteriors and combine the results. @jsocolar, do you think this would work?

1 Like

This would work in principle, using appropriate stacking weights for the posteriors under the different constraints, as long as you are estimating just one value of alpha for the entire dataset. Once you start fitting different alphas for different (subsets of) observations, this approach gets combinatorially hard.

Likewise, as long as you assume a single shared value of alpha for all observations, you could also treat this whole thing as a mixture of a model where alpha is greater than 2 and a model where alpha is less than two, and estimate in one step. This approach might be simpler to implement in code and to explain methodologically, but it might also be harder to debug.