Hi All,

I have (log) apartment prices along with apartment sizes (sq meters) for 12 different neighborhoods that I’d like to model using a hierarchical regression. For the most part, the data seems to follow a linear or quadratic trend except for a few outliers which I’d like to model (see image below). When the outliers do occur, they are far below the predicted regression mean, and usually, but not always at one of three specific levels (Mil1, Mil2, and Mil3 below). I think those are tax milestones, e.g. 1 million is the highest you can sell for without incurring any taxes. I have also found some decent predictors (not pictured) that help identify when a sale will be an outlier, e.g. when the economy is bad there are more of these outliers (maybe people getting worried and trying to liquify their investments).

I’m looking for ideas for a sensible generative process to describe this data. My ultimate goal is to get the predictive distribution of new houses. One thing I’ve tried is modeling the data as bi-modal (code below). One mode comes from the regression function, and another is a high-variance normal distribution with a much lower mean e.g. 14. The problem with this model is that the mixture probability is the same for all points given the economy variable (z). This means that the huge outliers affect the regression fit just as much the small outliers that may not even be outliers. I’m thinking I need more of a k-means type assignment probability that separate for each point so that the large outliers can only affect the outlier distribution and not the non-outlier distribution. Does anyone have any better ideas or tips for doing that?

```
data {
int<lower=0> N;
int<lower=0> K;
int<lower=0> L;
matrix[N, K] x; //basic housing predictors
matrix[N, L] z; //predictors of outlier
real y[N];
}
transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1);
R_ast = qr_R(x)[1:K, ] / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
vector[L] gamma;
vector[K] theta; //regression coefficients for non-outliers (include intercept)
real<lower=0,upper=0.5> sigma; //std. error for non-outliers
real<lower=12, upper=15> mu_outlier;
real<lower=0, upper=0.5> sigma_outlier;
}
model {
gamma[1] ~ normal(-7.64, 10.0);
gamma[2] ~ normal(6.72, 10.0);
//mu_outlier ~ normal(13.5, 1.0);
theta[1] ~ normal(15.51, 0.01);
theta[2] ~ normal(0.19, 0.01);
theta[3] ~ normal(-0.01, 0.01);
//sigma ~ normal(0.1, 0.01);
//sigma_outlier ~ gamma(10, 20);
for (n in 1:N) {
target += log_mix(inv_logit(z[n]*gamma),
normal_lpdf(y[n] | mu_outlier, sigma_outlier),
normal_lpdf(y[n] | Q_ast[n]*theta, sigma));
}
}
generated quantities {
vector[K] beta;
beta = R_ast_inverse * theta; // coefficients on x
}
```