Skew_normal pp_ckeck looks better than skew_generalized_t


I have a response variable that is substantially skewed. I initially attempted to fit a model using brms with the skew_normal() family.

brm(bf(SPT ~ rain + (rain | tag)), 
                       data = data[complete.cases(data[,c("rain", "SPT")]),],
                       save_pars = save_pars(all = TRUE),
                       iter = 5000,
                       #init = init_fun,
                       prior = c(
                         prior(student_t(3,700, 100), class = Intercept),
                         prior(student_t(3,0, 50), class = b),
                         prior(student_t(3,0, 50), class = sd)
                       ),
                       family = skew_normal(),
                       backend = "cmdstanr",
                       threads = threading(2),
                       control = list(max_treedepth = 10, adapt_delta = .999))

The pp_check didn’t look great

So I decided to try using the skewed generalized t distribution in the hopes that it would capture the shape and peak better. Details on my attempt at implementing this distribution in brms are here

When I simulate random data from a skew_generalized_t distribution, I’m able to recover the shape of my data reasonably well

testdata=rskew_generalized_t (n=10000, mu=715, sigma=125, lambda=-.9, p=2.1, q=2.10)
testdata=data.frame(testdata)

newdata = data[complete.cases(data[,c("rain", "SPT")]),]$SPT
newdata = data.frame(newdata)
colnames(newdata)="data"
colnames(testdata)="data"
newdata$Type = "Data"
testdata$Type = "simulated"
newdata=rbind(newdata,testdata)
#hist(testdata$testdata, breaks = 100)

ggplot(newdata,aes(x = data, color = Type)) + geom_density()  +
  geom_vline(xintercept = 660, color = "black") + theme_classic()

However when I actually try to fit this model in brms, the model doesn’t seem to be able to recover the original shape of the data well

init_fun <- function() {
  list(mu = rnorm(1, 0, 1),
       sigma = runif(1, 0.1, 1),
       lambda = runif(1, -0.99, 0.99),
       p = runif(1, 2, 10),
       q = runif(1, 2, 10))
}

SPT_rain_model <- brm(bf(SPT ~ rain + (rain | tag)), 
                       data = data[complete.cases(data[,c("rain", "SPT")]),],
                       save_pars = save_pars(all = TRUE),
                       iter = 8000,
                       init = init_fun,
                       prior = c(
                         prior(normal(700, 100), class = Intercept,),
                         prior(student_t(3,0, 2), class = b),
                         prior(gamma(2, .1), class = sd, group = tag, coef = Intercept),
                         prior(gamma(2, .5), class = sd, group = tag, coef = rain),
                         prior(normal(0, 150), class = sigma,lb = 0.1),
                         prior(uniform(-0.99, 0.99), class = lambda, lb = -0.99, ub = 0.99),
                         prior(gamma(2, .5), class = p, lb = 2),
                         prior(gamma(2, .5), class = q, lb = 2)
                       ),
                       family = skew_generalized_t, stanvars = stanvars,
                       #refresh = 1,
                       backend = "cmdstanr",
                       threads = threading(2),
                       control = list(max_treedepth = 10, adapt_delta = .999))

I’m not sure if the problem is my parameterization (I’ve experimented a lot with priors and so far no luck), with my helper functions, or if there is something inherent about this distribution that I am not understanding.

Does anyone have any ideas as to why the skew_normal model is able to align so much better while the skew_generalized_t model captures the peak well but is so misaligned? Any suggestions for things I can try to improve this?

Sorry to press, but I would be very grateful for some feedback from people in this community regarding this issue. Is anyone willing to give me some suggestions??

Could you perhaps tell a bit more about your variables, your assumptions about the data generating process, and the aims of your analysis? I’m thinking that the pp-checking is telling you to revise your model, and that perhaps neither a skew-normal or skew-t distribution is appriopriate.

Have you considered whether a mixture of normals might be a better fit? It might be completely off the mark - as I don’t know anything about your data or analysis. But if a subgroup with a substantially different distribution might be plausible, and relevant to your analysis, I’d give that a try perhaps?

2 Likes

Fair question and a very good point I didn’t consider well enough. I am trying to model the effect of local weather on sleep duration in our study species. I suppose there could have been some unobserved process resulting in two overlapping distributions, shorter sleep bouts and a longer sleep bouts. Our sleep bout duration data were estimated from remotely collected accelerometer data, so I didn’t actually observe any of the sleep behavior directly. The rain data is also remotely collected and in this case we are examining the effect of the total rain per sleep bout on the durations of those sleep bouts.

A mixture of gaussians does seem to look pretty good

And the pp-check for the mixture model looks a lot better than the previous models.

mix <- mixture(gaussian, gaussian)
prior = c(
  prior(normal(450, 10), class = Intercept, dpar = mu1),
  prior(normal(650, 10), class = Intercept, dpar = mu2),
  prior(normal(150, 10), class = sigma1),
  prior(normal(40, 10), class = sigma2),
  prior(student_t(3,0, 5), class = b, dpar = mu1),
  prior(student_t(3,0, 5), class = b, dpar = mu2),
  prior(gamma(2, .1), class = sd, group = tag, coef = Intercept, dpar = mu1),
  prior(gamma(2, .5), class = sd, group = tag, coef = rain, dpar = mu1),
  prior(gamma(2, .1), class = sd, group = tag, coef = Intercept, dpar = mu2),
  prior(gamma(2, .5), class = sd, group = tag, coef = rain, dpar = mu2)
)

SPT_rain_model <- brm(bf(SPT ~ rain + (rain | tag)), 
                      data = data[complete.cases(data[,c("rain", "SPT")]),],
                      save_pars = save_pars(all = TRUE),
                      iter = 6000,
                      prior = prior,
                      family = mix,
                      backend = "cmdstanr",
                      threads = threading(2))

summary(SPT_rain_model)
pp_check(SPT_rain_model, ndraws=100)

I just need to wrap my head around if I think this is biologically plausible.

Ok, I interrogated our data some more and think there really are two distributions being generated by actual behavior. There are two distributions of sleep onset times, leading to a distribution of short sleep periods and a distribution of long sleep periods. This convinces me that the mixture model is actually biologically sensible. Thanks for helping me realize this!

That’s great - good thing you took the time to revise your model. It’s really a nice example of the value of PPC and a proper workflow. Thanks for posting about your results!

1 Like