Hello all,

I’m assisting with creating/fiting a model to reflect a somewhat simple, albeit skewed, dataset. The dataset contains two variables, reflecting outcomes of two slightly different diagnostic tests: one being a shorter version of the other. The outcomes were determined to be highly correlated using a multivariate normal model, as well as a multivariate student-t distribution model. However, the data is fairly skewed, and when we examined posterior predictive samples, the model produced a larger number of samples that had values quite a bit higher than the maximum point in our actual dataset. The figure below illustrates the problem, with the red dots indicating actual datapoints, and the black representing the predicted datapoints.

It seems as though a skewed model would more accurately capture the distribution of the collected data, but Stan does not currently have this capability innately. We’ve been following another post on the forum that has worked with this problem, and are currently testing the solution that was outlined there (Multivariate Skew Normal). However, we could use some guidance on whether this is the best way to go about working with this dataset, and if so, what priors would be most appropriate to use for the parameters. Any guidance as to how to go about solving this problem would be much appreciated! For the sake of ease, I’ve posted the Stan code we’re working with below, we made a few slight additions/modifications, but credit must go to the original authors bgoodri and samreay.

```
data {
int<lower=0> n;
vector[2] x[n];
}
parameters {
vector[2] xi;
vector<lower=0>[2] omega;
vector[2] alpha;
cholesky_factor_corr[2] L;
}
model {
target += multi_normal_cholesky_lpdf(x | xi, diag_pre_multiply(omega, L));
/
for (i in 1:n)
target += normal_lcdf(dot_product(alpha, (x[i] - xi) ./ omega) | 0, 1);
xi ~ normal(0, 100);
omega ~ cauchy(0, 3);
L ~ lkj_corr_cholesky(1);
alpha ~ normal(0, 100);
}
generated quantities {
matrix[2,2] cov;
cov = diag_pre_multiply(omega, L);
}
```

Since this code is based on the multivariate skewed normal model by Azzalini et al, we’re hoping to use the function ‘rmsn’ from the ‘sn’ package to generate sample datapoints using the Stan fit, since it is based on the same distribution/model. However, we’re open to other suggestions of how to generate samples.

Thanks!