I am examining my model using a posterior predictive check. I am trying to predict air conditioner power based on ambient temperature, seasonality and weather. I have 60 trips with these variables.
The distribution of air conditioner power:
In order to account for outliers, I use log transform on my response and use student t distribution.
brm_model_aircon ← brm(log(X.AirCon.Power.) ~ Weather_cloudy + Weather_cloudy + Weather_dark + Weather_dark_little_rainy + Weather_rainy + Weather_slightly_cloudy + Weather_sunny + Weather_sunrise + Weather_sunset + s(X.Time.) + s(X.Ambient.Temperature.) + s(X.Time., by = Seasonality) + s(X.Ambient.Temperature., by = Seasonality) + (1 | Vehicle_id) , train, family = student(), sample_prior = “yes”, iter = 100, warmup = 50,
cores=ncores, backend = “cmdstanr”,
threads = threading(4))
The pp check:
Thanks in advance for your help!
Can you clarify exactly what you’re needing help with? I might be missing it, but I didn’t see a question in your post. Are you asking how to check for outliers in your data?
A few things do jump out at me. Can you explain how you log-transformed your outcome? From your histogram, it looks like you have a lot of observed zero values, but in your code, it looks like you take just a simple log of the outcome. If those two points are correct, then you’re accidently getting rid a lot of your data because
What is the overall N of your data? I see that you have a healthy number of repeated observations, but I don’t see where you specify the total number of trucks observed. It doesn’t look like you’re using any priors, so just trying to get a sense for what could impact your results since you do include a lot of predictors (though maybe a lot of those are dummy coded?)
Can you share a little more about your modeling up to this point? The number of iterations and warmup samples is really small, but I also don’t know how many cores you’re using. Are you getting (or have you gotten) any convergence issues like small ESS or large R-hats?
The likelihood of the data strikes me as odd if your observed variable is restricted to be positive. Is there a reason you prefer the student-t? Is it just to be “robust” to outliers?
Thank you for your reply.
I want to ask if this kind of pp check plot normal or is due to outliers?
I add a small number (0.00001) to my response and do log transform.
I have total 89515 observations for these 60 trips. The covariate ambient temperature is time varying and the others are not time varying and dummy coded (I also include time as the covariate).
This is my prior:
prior_aircon ← c(set_prior(“student_t(3,-13.8,2.5)”, class = “Intercept”),
set_prior(“gamma(2, 0.1)”, class = “nu”),
set_prior(“student_t(3,0,2.5)”, class = “sd”),
#set_prior(“flat”, class = “b”),
set_prior(“student_t(3, 0, 2.5)”, class = “sds”),
set_prior(“student_t(3, 0, 2.5)”, class = “sigma”)
I also tried hurdle gamma on my original response (without log transformation), but the pp check plot is also weird. I suspect that it may due to the outliers, so I used student t instead.
Thanks again for your help!
Well, it’s very hard to tell what the posterior predictive check is actually showing you, so I’m not sure what the weird behavior is that you’re identifying. My guess is that the plot is so zoomed out because you have such long tails. You can try zooming in the plot a little more by changing the range of your x-axis. With brms and ggplot2 loaded, you should be able to do that via:
You may want to change the x-axis range based on your outcome’s actual data range (I’d suggest plotting a little outside of that range in both the positive and negative direction to also get an idea of what your model is predicting for those ranges). I just picked those values based on the histogram you provided, but there’s nothing special about them. This range should help you see a little better what’s actually happening in your predictions. You can also examine what kind of range you get when you run
posterior_predict(). Your model may very well predict some out-of-range values, but whether that’s a problem depends on your desired aims with the model and how frequently those kinds of predictions occur.
You may also consider just adding 1 to your outcome values: log(0+0.00001) = -5 but log(0+1) = 0, meaning that all your 0 values will stay 0 and everything else is just evenly shifted. I think that adding 1 as your constant has some greater fidelity to what your untransformed data actually are.
It is possible that there are some outlying or highly influential observations, but I’d encourage thinking of those cases as arising from a data generation process rather than as potential reasons a model performs poorly. The posterior predictive check is showing you what your model is able to predict based on what information you’ve given it (in the priors and likelihood) and what it’s learned (from the data and estimated parameters). When there’s a mismatch between those predictions and the observed data, then it can highlight that the model is not accounting for something. It could be that there is actually a mixture of regressions going on wherein some observations are estimated by one set of parameters while other observations come from some other model(s). It could also be that the population effects are inadequate for capturing the effects you’re interested in, meaning that you may consider random slopes in addition to the random intercepts to capture more heterogeneity in the effects of various predictors on your observations. It could also just be that the likelihood is incorrect or that you need some priors to help regularize your estimates and keep some influential cases from pulling your estimates to some kind of an extreme.
I don’t know that a single posterior predictive check will really tell you exactly what the next steps in model revision might be, but it should be able to give you some ideas if you can actually observe what the ppc looks like rather than seeing just a large spike with no detail as it seems you’re currently looking at.
Okay, thank you very much for your help! I will try to investigate what is the potential cause of problem.