Modelling location, scale, and shape. Difficulty fitting skewed normal distribution

Hi there!

Apologies if this is a daft question, still trying to learn! I am trying to fit a distributional model in brms. I wanted to start with a smaller/easier example where I have generated the data to that I want to model. I cannot seem to get the results I expect.

I want to learn how to implement models for location scale and shape (see here and here for examples). The example data I am working with looks like this (see below for reprex).

It has been easier working with simpler models where the variance varies as a function of X. But I am having difficulty as soon as I introduce “skew” parameters.



The data

Data is generated as follows


N <- 200
X <- cbind(1, runif(N, 0, 100)) # Generate N uniform random numbers between 0-100

# Setting Distributions Params for Coefficients
B_mu <- c(10, 0.2) 
B_sigma <- c(0.1, 0.02)
# Note it is nu for gamlss.dist, but alpha for brms
B_nu <- c(2, 0.5)

# Generating Coefficients for X values
mu <- X %*% B_mu 
sigma <- exp(X %*% B_sigma)
nu <- X %*% B_nu

# Generating y-values
y <- gamlss.dist::rSN1(n=N, mu = mu, sigma = sigma, nu = nu)
sim_data <- data.frame(y, X)


lss_fm <- bf(
  y ~ X2,
  sigma ~ X2,
  alpha ~ X2

lss_brm <- brm(lss_fm, 
               data = sim_data, 
               family = skew_normal(
                 link = "identity", 
                 link_sigma = "log", 
                 link_alpha = "identity")

# Summarising fit

# Prediction
pred_df <- data.frame(X2 = seq(0, 100))
lss_post_pred <- cbind(pred_df, predict(lss_brm, newdata = pred_df))

ggplot() +
  geom_point(data=sim_data, aes(y=y, x=X2, shape="Data Point"))+
  geom_ribbon(data=lss_post_pred,aes(x=X2, ymin=Q2.5, ymax=Q97.5,fill = "Confidence Interval"), alpha=0.5)+
  scale_fill_manual("",values="grey12") +
  geom_line(data=lss_post_pred,aes(x=X2, y=Estimate, color="Estimate"))+


Model Summary:

Warning: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors.
Warning: There were 2127 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See

 Family: skew_normal 
  Links: mu = identity; sigma = log; alpha = identity 
Formula: y ~ X2 
         sigma ~ X2
         alpha ~ X2
   Data: sim_data (Number of observations: 200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI  Rhat Bulk_ESS Tail_ESS
Intercept          14.04      9.24     3.14    28.65  7.23        4       NA
sigma_Intercept    -1.73     14.42   -23.33    16.12 10.30        4       NA
alpha_Intercept     5.19     67.12   -92.74    82.94  7.82        4        4
X2                 -0.28      0.21    -0.60    -0.04  7.93        4        4
sigma_X2            0.06      0.29    -0.29     0.48 10.25        4       NA
alpha_X2           -0.10      1.31    -1.64     1.80  7.84        4       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

If anyone has time to give advice/guidance, that would be really helpful! Thanks for developing such a fantastic tool!

To update on this. I have tried a range of things. Increasing adapt_delta, increasing the number of chains and iterations. I feel like there is something fundamental that I am missing here.

I have very little experience with skew_normal models, so take my comments with a grain of salt. However, my experience with distributional Gaussian and Student-t models is the non \mu parameters can be fickle. I see your data is a modest n =200; you may be better off simulating more like n = 500 or so. You also may need to use tighter priors than the defaults. It’s also not impossible that you may need to do something terrible like set your init values. Perhaps a good place to start is init = 0.

Thanks so much! So I ran an array of these models (each with different params). Turned out the main things that helped were:

  • Increasing number of iterations
  • init = 0
  • setting slightly stricter priors

Interestingly, increasing the sample size didn’t make that much of a difference. Thanks for the help, also your books have been a life saver!


The problem with skew normal and other three-parameter families of density functions is that in most cases data poorly informs all of the parameters at the same time. This results in large, and often complex, uncertainties between the parameters which frustrates effective quantification of the posterior distribution. Indeed adding more data often doesn’t reduce these uncertainties but rather makes them even more complex and harder to deal with – for a simple example see Section 5.2.3 of Identity Crisis.

In these cases stricter component prior modeling is often much more effective as they can restraint the individual parameters directly (at least when the domain expertise is available for informing these component prior models).


Good to know!

Additionally, the skew parameter for the skew normal distribution, in particular, is only meaningful within a certain range. I helped Paul get the SN distribution into brms, but it’s been a long time since I’ve looked at it. If I recall correctly though, the SN parameterization used in brms has a real shape parameter; however, beyond a limit (say, around \pm 2 or something), it has very little effect on the distribution. That means for a highly skewed parameter, the credible interval for alpha could be [2, a number much much larger than 2] even with a lot of data. Also it may be worth noting that if alpha is very high, it basically forms a cliff of a distribution; i.e., it may cause divergences due to a massive cliff on the ‘other side of the skew’ that is effectively a straight vertical line upward.

Edit: It may be worth trying to throw a non-identity link function on the skew parameter; my guess would be, ideally, something like inverse-tanh so that the linear variate is mapped onto [-1,1], then just multiply that sucker by 2; i.e., I’d say something that maps the real line to [-2,2] may be a good choice (or [-4,4], I cannot remember what the ‘meaningful range’ of the alpha parameter is; it was either near 2 or near 4)

1 Like

Absolutely! For example if you look at the skewness and kurtosis of the skew normal at Skew normal distribution - Wikipedia you can see how quickly they saturate as \alpha increases and the intermediate variable \delta converges towards one.

That said once you’re in that plateau there tends to be much less correlation with the other two parameters. In this case a decent containment prior on \alpha should be sufficiently to avoid any problematic from propagating to the posterior distribution. In particular I personally would stay away from explicit cutoffs because you never know when you’ll have enough data to actually resolve subtle changes in skew that require larger values of \alpha.