Ideas for modeling systematically skewed outliers?

Hi All,

I have (log) apartment prices along with apartment sizes (sq meters) for 12 different neighborhoods that I’d like to model using a hierarchical regression. For the most part, the data seems to follow a linear or quadratic trend except for a few outliers which I’d like to model (see image below). When the outliers do occur, they are far below the predicted regression mean, and usually, but not always at one of three specific levels (Mil1, Mil2, and Mil3 below). I think those are tax milestones, e.g. 1 million is the highest you can sell for without incurring any taxes. I have also found some decent predictors (not pictured) that help identify when a sale will be an outlier, e.g. when the economy is bad there are more of these outliers (maybe people getting worried and trying to liquify their investments).

I’m looking for ideas for a sensible generative process to describe this data. My ultimate goal is to get the predictive distribution of new houses. One thing I’ve tried is modeling the data as bi-modal (code below). One mode comes from the regression function, and another is a high-variance normal distribution with a much lower mean e.g. 14. The problem with this model is that the mixture probability is the same for all points given the economy variable (z). This means that the huge outliers affect the regression fit just as much the small outliers that may not even be outliers. I’m thinking I need more of a k-means type assignment probability that separate for each point so that the large outliers can only affect the outlier distribution and not the non-outlier distribution. Does anyone have any better ideas or tips for doing that?

data {
  int<lower=0> N;
  int<lower=0> K;
  int<lower=0> L;
  matrix[N, K] x;               //basic housing predictors
  matrix[N, L] z;               //predictors of outlier
  real y[N];
transformed data {
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
parameters {
  vector[L] gamma;
  vector[K] theta;               //regression coefficients for non-outliers (include intercept)
  real<lower=0,upper=0.5> sigma;          //std. error for non-outliers
  real<lower=12, upper=15> mu_outlier;
  real<lower=0, upper=0.5> sigma_outlier;
model {
  gamma[1] ~ normal(-7.64, 10.0);
  gamma[2] ~ normal(6.72, 10.0);
  //mu_outlier ~ normal(13.5, 1.0);
  theta[1] ~ normal(15.51, 0.01);
  theta[2] ~ normal(0.19, 0.01);
  theta[3] ~ normal(-0.01, 0.01);
  //sigma ~ normal(0.1, 0.01);
  //sigma_outlier ~ gamma(10, 20);
  for (n in 1:N) {
    target += log_mix(inv_logit(z[n]*gamma),
                      normal_lpdf(y[n] | mu_outlier, sigma_outlier),
                      normal_lpdf(y[n] | Q_ast[n]*theta, sigma));
generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
1 Like

wouldn’t it make sense to do something multiplicative (well, additive on this log scale) where the outliers are selling at a fraction of the value they would otherwise sell at? Outliers might also just be down at the sell now price. Both of these might be outlier-type-specific.

I think mixture model would be natural here. You could use as many mixtures as you have groups in your plot. Mixture components M1, MI2, MI3 would have informative prior as they don’t have much variation and they are quite well separated from normal observations. For other outliers you could use some skewed distribution, so that you can separate reasonably well it from the normal observations.Check Stan manual for examples of coding a mixture model.


Thanks Aki. Which skewed distribution would you recommend?

It seens that OtherOutliers are not so well defined, and you might be able to model the normals and non-MI’s together with skewed t-distrbution (one mixture component less). Otherwise I recommend plotting some conditional histograms to get ideas what distribution might be useful.


So it doesn’t look like the skew-t distribution is available yet. It’s still an open issue it seems. Has anyone tried using it in a Stan model file as a custom density?

Unfortunately, none of the distributions with support on the real line that are available in Stan are skewed AND have heavy tails, so I think it’d be pretty useful. I’d be willing to take a crack at implementing it myself, if the developers think it’s a newbie-friendly issue.

Here’s a quick implementation of Skewed generalised t (see, e.g. There’s no argument checking, but in my experiments I used such constraints for parameters that it was valid all the time. The priors below are weakly informative for my specific case. In your case you might want to keep the parameter p fixed (providing skewed t distribution), and constrain lambda (l) to be negative (to constrain the long tail towards smaller values).

functions {
  real sgt_log(vector x, real mu, real s, real l, real p, real q) {
    // Skewed generalised t
    int N;
    real lz1;
    real lz2;
    real v;
    real m;
    real r;
    real out;
    N <- dims(x)[1];
    for (n in 1:N) {
      if (r<0)
      	     out<-out+log(p)-log(2*v*s*q^(1.0/p)*exp(lz1)*(fabs(r)^p /(q*(v*s)^p*(l*(-1)+1)^p)+1)^(1.0/p+q));
      	     out<-out+log(p)-log(2*v*s*q^(1.0/p)*exp(lz1)*(fabs(r)^p /(q*(v*s)^p*(l*(1)+1)^p)+1)^(1.0/p+q));
    return out;
data {
  int<lower=0> N;
  vector[N] x;
parameters {
  real mu; 
  real<lower=0> sigma; 
  real<lower=-.99,upper=0> l; 
  real<lower=1,upper=10> p; 
  real<lower=3.0/p,upper=p*50> q; 
model {
  mu ~ normal(0,9.0/sqrt(N));
  l ~ normal(0,.5);
  p ~ lognormal(log(2),1);
  q ~ gamma(2,0.1);
  x ~ sgt(mu, sigma, l, p, q);

Thanks Aki! That distribution worked like a charm! Would be nice to get it as a default distribution in Stan, especially to make it vectorizable.

I think implementing an additional distribution as an internal Stan function belongs to the “newbie-friendly” or as Stan repo issue label says “new dev friendly” category.


You can judge for yourself by looking at the non-skewed t and skewed normals:


Both these look like they way over-include; we split them apart at some point, but weren’t careful with includes. We need to clean that up!

Thank you for this code snippet! I have two questions.

  1. Is there any reason you are not using a proper prior for sigma?
  2. The number of degrees of freedom is given by 2 q. Then, following your recommendation about gamma(2, 0.1), should the prior for q not be gamma(2, 0.2)?

The reason was that it was a quick experiment and had a brief moment to share the code. I recommend to use a proper prior and add argument checking.

For my quick experiment I didn’t think carefully about the priors. If gamma(2, 0.2) better reflects your prior information, I recommend to use that.

I have not used this since the quick experiment. If you are able to do something useful with
Skewed generalised t or learn something useful about the priors, it would nice if you can post it here.