How to effectively handle outliers in predictor/input variables

I’m trying to fit a Weibull Survival model (right-censored and left-truncated) to be specific, but my question is more general. One of the input variables has nice normal-ish distribution, except for a very very long left tail. So most of the values will be between -2 and 2, but then you can get a value as low as -130. This causes issues with the sampler, because likelihood of this data point (given my model, of course) is close to zero. So I end up with the typical log(0)=- \inf issue.

The particular problematic part, I think, is the exponent term that can blow-up when X_i is such an extreme outlier, leading to “exp(very large negative number)”:

model {
...
  sigma = exp(-(mu + (X*beta))/alpha);
...
} 

Is the sampler suppose to be able to navigate through these type of issues? I can add a prior for the \beta parameters with longer tails such as a student_t(2,0,1), but that does not solve my issue.

What is the correct approach to deal with such extreme outliers in predictor variables? Furthermore, I have many variables and don’t want to manually go and change them all, so I’m hoping for a more robust strategy than capping outliers to certain values or pulling out the usual tricks of making complicated transforms of the predictors variables (yes some interpretability in the end is desirable).

1 Like

Do you have any sense of why the outliers happen? Can you post some histograms?

Yes, I do, I will try to share some more context, although this is based on very confidential data, therefore the limited information provided so far.

You can imagine something like an account balance. The largest group of individuals are, say middle income, for our discussion here. But then there is also a small number of extremely wealthy clients with either extreme positive or negative balances. So the bulk of the data follows a normal-ish distribution but with a very long tail, and in some cases only to one side. Below is an example of a feature (but there are about 50-60 features), without being too specific about the unit. I’m not sure this helps?

To give an idea, the mean is about zero, 25th percentile is about -0.5 and the 75th% is about 1.2. So the values are really dense around the mean of zero.

Off the top of my head, there’s a couple ways to go about this. As you’ve seen, simply giving beta (the coefficient for the influence of X) a fat tail doesn’t help because the core issue is that regardless of the value for beta is, it is applied to all values of X, blowing up prediction for the extreme values.

One way to handle this is to create a latent variable X_prime with the same dimensionality of X, model X ~ student_t(X_prime,...) then use X_prime, not X, in your computation of sigma. This instantiates a “measurement error” model where your predictor X is being modelled as being related to a latent variable X_prime that is observed with student_t measurement noise.

Another approach would be to model not a linear relationship X*beta but instead a flexible non-linear one where different areas of the domain of X get a different multiplier. This is most elegantly achieved with a Gaussian Process, but I suspect that your data density is such that a GP would be too compute intensive, so you could try a Generalized Additive Model instead. You’d want to spend some time choosing basis functions; I could imagine both “more bases in the more dense regions” and “more bases in the less-dense regions” could make sense (the former enabling data-informed non-linearity in the data-dense regions while the latter imposing more linearity in the data-dense regions, which serves as a kind of “mostly-linear but uncertain in the tails” model).

4 Likes

Thanks for the thoughtful response, Mike. I will definitely consider both options. Option 1 seems a bit more elegant, I haven’t tried this before. So I will still probably then keep tight priors around the \beta parameters and then a much wider prior distribution for the “measurement error” part to deal with the large errors, right?

Regarding Option 2, I have considered using splines before. I’m just worried about losing too much interpretability. I was thinking of simple linear basis functions, but only about 3 of them: one for the values below, say the 10th percentile, one for the middle dense 80%, and one for the values above the 90th percentile. It is a bit rigid, but this way I can at least make the selection of knots more automated (because I have a large number of variables) and deal with negative and positive outliers separately.

I guess a third option would be to see if I can find a more robust and aggressive monotonic transformation so that I can at least still say that higher values are associated with higher/lower expected lifetimes. I used Sklearn’s robust scaler for most variables, but I’m sure there might be better options to deal with these extreme value data distributions.

Ultimately I don’t want to overcomplicate the model or lose a substantial part of explainability, but I know there is no free lunch. Let me try out your suggestions!

Just a follow-up relevant to this thread: so it definitely seems like the measurement error model helps to deal with a lot of the model fitting issues. However, I have two practical problems with this:

  1. The number of parameters estimated is massive, causing extremely slow fits and memory issues (cmdstan helped a lot with reducing memory footprint by the way, so it seems R/Rtudio added some overhead, but it just runs extremely slow). I have approximately 80k observations with about 40 variables. So that means I will have to estimate millions of parameters, which is not practical.

  2. My second practical issue is when I actually want to score new data in the future, I don’t know which values to use for X to do predictions, because now I have used the latent variables of X namely X_prime. If new X_{ij} = -430, its latent variable might be -3, but how do I determine that? So I would probably need some way to “shrink” new observed values towards the latent distribution.

Any comments or suggestions are welcome, but I thought I would just mention it here for anyone else that wants to follow the same approach.

Just had an idea for this: in R, do:

est_true_X = qnorm( rank(X)/length(X) )

That is, get the percentile of each entry in X, then the value of a standard normal given that percentile.

Now, the above would be a data transform outside the model, which implies a kind of perfect certainty on the accuracy of the transform as a representation of the generative process. A more uncertainty-respecting elaboration would be to bring the transform inside the model by adding some parameters to it, for example instead of the standard normal implied by qnorm, you could use a student-t and have as parameters the SD & df (I think it wouldn’t make much sense to let the mean be non-zero, but I might be mistaken).

2 Likes

Thanks Mike, another good suggestion. I’ll be sure to try it out, this will save me from having to estimate a crazy number of parameters and the transform is still monotone, so it will preserve that relationship. At this stage, my variables are already transformed, but I can jump between the different transforms and perhaps even consider using this one from the start again (I have considered Sklearn’s quantile transform at the beginning of the project (but the model seemed to work better with the Robust transform mentioned earlier for a particular Python library I used to fit Survival model).