Hi, rstan and modeling pros –
I am having issue with a count time series model that contains a spline producing one-off, extremely large estimates at certain instances. Here is some background:
The data + model:
- My data is structured as a long-format time series with hourly observations at ~10 locations. At each location, I have between 15 hours 3 days to 24 hours90 days; some of the locations (LOCATION) had months of observation, and others only had a few days of observation. For each hour, I have observed data (OBSERVED), some kind of proxy for the observed data that is an incomplete measure (PROXY). In some locations, OBSERVED and PROXY are sparse.
LOCATION | HOUR | DATE | OBSERVED | PROXY |
---|---|---|---|---|
1 | 0 | 1/1/2024 | 0 | 0 |
1 | 1 | 1/1/2024 | 0 | 0 |
1 | 2 | 1/1/2024 | 0 | 0 |
1 | 3 | 1/1/2024 | 2 | 1 |
1 | 4 | 1/1/2024 | 10 | 5 |
1 | 5 | 1/1/2024 | 15 | 6 |
… | … | … | … | … |
2 | 0 | 1/5/2024 | 0 | 0 |
2 | 1 | 1/5/2024 | 0 | 0 |
2 | 2 | 1/5/2024 | 35 | 20 |
2 | 3 | 1/5/2024 | 0 | 0 |
2 | 4 | 1/5/2024 | 0 | 0 |
2 | 5 | 1/5/2024 | 1 | 0 |
… | … | … | … | … |
- I want to model the outcome variable OBSERVED as a function of PROXY. Here is what the specification looks something like:
formula ← bf(OBSERVED ~ PROXY*LOCATION + s(HOUR, by=LOCATION)).
I have included a spline term that creates new predictions for each hour that vary by location. My parameter priors are:
(intercept) \alpha \sim Normal(2,1)
(proxy parameter) \beta_{PROXY} \sim Normal(2,1), and
spline \sim Student(3,0,2.5) –
pretty standard stuff.
The problem
This model works quite well for my data in almost all locations for all hours, except at locations where there are (location-specific) outliers. Said another way: if there is an hour at a location that has a relatively high – as in relative to that location – value of PROXY, my model makes quite extreme predictions at those hours, resulting in errors of 10s of thousands. This occurred an hour where the “outlier” value was PROXY = 35 and the more typical value is PROXY = 5 for a given location; and when where the typical value was PROXY = 0 and the “outlier” is PROXY = 5 for a location. Of course the problem of course could go away if the outliers were removed, but for reasons specific to the analysis, I do not want to remove them.
Playing around with the model has led me to believe that the error could be caused by the spline responding to sparse data.
Here is what I’ve tried
- Tried wider-tailed parameter priors, which did not work well and messed with my other “good” predictions;
- Log1p() transformed the data, which resulted in smaller extreme predictions (i.e., +150 instead of +50,000), but clearly did not exactly fix with the problem;
- Increasing the penalty term on the spline to k = 15 and k = 20, which did not help;
- Creating an interaction term for LOCATION*DATE and LOCATION*HOUR in the spline, since perhaps this is an issue caused by the fact that not all observations have the same number of days and hours in the observation, but this couldn’t because the model became too nested for all my data;
So, I am flummoxed. Any ideas to produce more reasonable estimates at the hours with outlier values? Thanks in advance for any help you can offer! Let me know if I can clarify further.