Help understanding spline and extreme estimations

Hi, rstan and modeling pros –

I am having issue with a count time series model that contains a spline producing one-off, extremely large estimates at certain instances. Here is some background:

The data + model:

  • My data is structured as a long-format time series with hourly observations at ~10 locations. At each location, I have between 15 hours 3 days to 24 hours90 days; some of the locations (LOCATION) had months of observation, and others only had a few days of observation. For each hour, I have observed data (OBSERVED), some kind of proxy for the observed data that is an incomplete measure (PROXY). In some locations, OBSERVED and PROXY are sparse.
LOCATION HOUR DATE OBSERVED PROXY
1 0 1/1/2024 0 0
1 1 1/1/2024 0 0
1 2 1/1/2024 0 0
1 3 1/1/2024 2 1
1 4 1/1/2024 10 5
1 5 1/1/2024 15 6
2 0 1/5/2024 0 0
2 1 1/5/2024 0 0
2 2 1/5/2024 35 20
2 3 1/5/2024 0 0
2 4 1/5/2024 0 0
2 5 1/5/2024 1 0
  • I want to model the outcome variable OBSERVED as a function of PROXY. Here is what the specification looks something like:

formula ← bf(OBSERVED ~ PROXY*LOCATION + s(HOUR, by=LOCATION)).

I have included a spline term that creates new predictions for each hour that vary by location. My parameter priors are:

(intercept) \alpha \sim Normal(2,1)
(proxy parameter) \beta_{PROXY} \sim Normal(2,1), and
spline \sim Student(3,0,2.5)

pretty standard stuff.

The problem
This model works quite well for my data in almost all locations for all hours, except at locations where there are (location-specific) outliers. Said another way: if there is an hour at a location that has a relatively high – as in relative to that location – value of PROXY, my model makes quite extreme predictions at those hours, resulting in errors of 10s of thousands. This occurred an hour where the “outlier” value was PROXY = 35 and the more typical value is PROXY = 5 for a given location; and when where the typical value was PROXY = 0 and the “outlier” is PROXY = 5 for a location. Of course the problem of course could go away if the outliers were removed, but for reasons specific to the analysis, I do not want to remove them.

Playing around with the model has led me to believe that the error could be caused by the spline responding to sparse data.

Here is what I’ve tried

  • Tried wider-tailed parameter priors, which did not work well and messed with my other “good” predictions;
  • Log1p() transformed the data, which resulted in smaller extreme predictions (i.e., +150 instead of +50,000), but clearly did not exactly fix with the problem;
  • Increasing the penalty term on the spline to k = 15 and k = 20, which did not help;
  • Creating an interaction term for LOCATION*DATE and LOCATION*HOUR in the spline, since perhaps this is an issue caused by the fact that not all observations have the same number of days and hours in the observation, but this couldn’t because the model became too nested for all my data;

So, I am flummoxed. Any ideas to produce more reasonable estimates at the hours with outlier values? Thanks in advance for any help you can offer! Let me know if I can clarify further.

You may need to modify the observational model to include an outlier component.

Sorry, but I am not sure what you mean. Do you mean a separate model or parameter for outliers?

For a continuous outcome one might model the observations (conditional on location and time) as coming from the “normal” distribution (with one measurement noise/variability) with some probability or coming from the (perhaps heavy tailed) “outlier” distribution (with a much higher measurement variability) with some (lesser) probability. There is probably an analog to count data but I don’t know it off the top of my head.

hey @rtpanik22 ,

I’m not sure yet about outlier approaches but I had a few questions for ya on other aspects of your model:

What likelihood function (and link) are you using? Since it’s count data, I imagine you may want Poisson(log) or similar. You mentioned transforming the data beforehand (log1p()) – it might be better to let the response family handle it instead via a generalised model.

I’m assuming HOUR is something like 1 to 24 (or 0 to 23 ?) – right? If you have data around the clock, you could tell your spline to use the cyclic cubic basis: s(HOUR, bs = 'cc'). That means the diurnal variation modeled by that spline knows to ‘wrap around’ from 24 to 0. Also handy for day-of-year (1-365) and similar for seasonal effects.

You can still have that cyclic smooth vary by LOCATION as a factor smooth (see ?mgcv::factor.smooth; note that brms uses mgcv to set up smooth terms) with: s(HOUR, by = LOCATION, bs = 'cc').

Speaking of, I recommend checking out Pedersen et al. (2019) if you haven’t already. (I have to refer to it every time I deal with these sorts of GAMs now.) Reading the section on “GI models” has me thinking you could try, instead of the LOCATION fixed intercepts, a varying intercept: (1|LOCATION). There might be some alternative models for you to try as well. E.g., would it make sense that each location’s HOUR smooth (diurnal variation) relates to each other or share some similarities? Then a factor smooth interaction spline might be better (bs = 'fs'). These sorts of features can help constrain a model from chasing extreme observations in the data through ‘shrinkage’.

Increasing the penalty term on the spline to k = 15 and k = 20, which did not help

btw, the k argument in smooths isn’t really a penalty term – I think of it as “allowable complexity” for that smooth. E.g., giving it k = 50 means a pretty damn complex smooth is permissible (not that it will estimate one though, it could still estimate a straight line!) but the tradeoff is that its computationally much more burdensome than say k = 12, which still permits plenty of complex shapes. Simon Wood gives more detail at ?mgcv::choose.k. The idea is: “complex enough”.