Help understanding spline and extreme estimations

rtpanik22 · July 24, 2024, 6:01pm

Hi, rstan and modeling pros –

I am having issue with a count time series model that contains a spline producing one-off, extremely large estimates at certain instances. Here is some background:

The data + model:

My data is structured as a long-format time series with hourly observations at ~10 locations. At each location, I have between 15 hours 3 days to 24 hours90 days; some of the locations (LOCATION) had months of observation, and others only had a few days of observation. For each hour, I have observed data (OBSERVED), some kind of proxy for the observed data that is an incomplete measure (PROXY). In some locations, OBSERVED and PROXY are sparse.

LOCATION	HOUR	DATE	OBSERVED	PROXY
1	0	1/1/2024	0	0
1	1	1/1/2024	0	0
1	2	1/1/2024	0	0
1	3	1/1/2024	2	1
1	4	1/1/2024	10	5
1	5	1/1/2024	15	6
…	…	…	…	…
2	0	1/5/2024	0	0
2	1	1/5/2024	0	0
2	2	1/5/2024	35	20
2	3	1/5/2024	0	0
2	4	1/5/2024	0	0
2	5	1/5/2024	1	0
…	…	…	…	…

I want to model the outcome variable OBSERVED as a function of PROXY. Here is what the specification looks something like:

formula ← bf(OBSERVED ~ PROXY*LOCATION + s(HOUR, by=LOCATION)).

I have included a spline term that creates new predictions for each hour that vary by location. My parameter priors are:

(intercept) \alpha \sim Normal(2,1)
(proxy parameter) \beta_{PROXY} \sim Normal(2,1), and
spline \sim Student(3,0,2.5) –

pretty standard stuff.

The problem
This model works quite well for my data in almost all locations for all hours, except at locations where there are (location-specific) outliers. Said another way: if there is an hour at a location that has a relatively high – as in relative to that location – value of PROXY, my model makes quite extreme predictions at those hours, resulting in errors of 10s of thousands. This occurred an hour where the “outlier” value was PROXY = 35 and the more typical value is PROXY = 5 for a given location; and when where the typical value was PROXY = 0 and the “outlier” is PROXY = 5 for a location. Of course the problem of course could go away if the outliers were removed, but for reasons specific to the analysis, I do not want to remove them.

Playing around with the model has led me to believe that the error could be caused by the spline responding to sparse data.

Here is what I’ve tried

Tried wider-tailed parameter priors, which did not work well and messed with my other “good” predictions;
Log1p() transformed the data, which resulted in smaller extreme predictions (i.e., +150 instead of +50,000), but clearly did not exactly fix with the problem;
Increasing the penalty term on the spline to k = 15 and k = 20, which did not help;
Creating an interaction term for LOCATION*DATE and LOCATION*HOUR in the spline, since perhaps this is an issue caused by the fact that not all observations have the same number of days and hours in the observation, but this couldn’t because the model became too nested for all my data;

So, I am flummoxed. Any ideas to produce more reasonable estimates at the hours with outlier values? Thanks in advance for any help you can offer! Let me know if I can clarify further.

js592 · July 24, 2024, 6:42pm

You may need to modify the observational model to include an outlier component.

rtpanik22 · July 24, 2024, 7:07pm

Sorry, but I am not sure what you mean. Do you mean a separate model or parameter for outliers?

js592 · July 24, 2024, 7:46pm

For a continuous outcome one might model the observations (conditional on location and time) as coming from the “normal” distribution (with one measurement noise/variability) with some probability or coming from the (perhaps heavy tailed) “outlier” distribution (with a much higher measurement variability) with some (lesser) probability. There is probably an analog to count data but I don’t know it off the top of my head.

zacho · August 8, 2024, 10:00pm

hey @rtpanik22 ,

I’m not sure yet about outlier approaches but I had a few questions for ya on other aspects of your model:

What likelihood function (and link) are you using? Since it’s count data, I imagine you may want Poisson(log) or similar. You mentioned transforming the data beforehand (log1p()) – it might be better to let the response family handle it instead via a generalised model.

I’m assuming HOUR is something like 1 to 24 (or 0 to 23 ?) – right? If you have data around the clock, you could tell your spline to use the cyclic cubic basis: s(HOUR, bs = 'cc'). That means the diurnal variation modeled by that spline knows to ‘wrap around’ from 24 to 0. Also handy for day-of-year (1-365) and similar for seasonal effects.

You can still have that cyclic smooth vary by LOCATION as a factor smooth (see ?mgcv::factor.smooth; note that brms uses mgcv to set up smooth terms) with: s(HOUR, by = LOCATION, bs = 'cc').

Speaking of, I recommend checking out Pedersen et al. (2019) if you haven’t already. (I have to refer to it every time I deal with these sorts of GAMs now.) Reading the section on “GI models” has me thinking you could try, instead of the LOCATION fixed intercepts, a varying intercept: (1|LOCATION). There might be some alternative models for you to try as well. E.g., would it make sense that each location’s HOUR smooth (diurnal variation) relates to each other or share some similarities? Then a factor smooth interaction spline might be better (bs = 'fs'). These sorts of features can help constrain a model from chasing extreme observations in the data through ‘shrinkage’.

Increasing the penalty term on the spline to k = 15 and k = 20, which did not help

btw, the k argument in smooths isn’t really a penalty term – I think of it as “allowable complexity” for that smooth. E.g., giving it k = 50 means a pretty damn complex smooth is permissible (not that it will estimate one though, it could still estimate a straight line!) but the tradeoff is that its computationally much more burdensome than say k = 12, which still permits plenty of complex shapes. Simon Wood gives more detail at ?mgcv::choose.k. The idea is: “complex enough”.

Topic		Replies	Views
Modeling continuous varible using splines in Stan General	8	1346	October 26, 2020
Regression Model with B-Splines not converge Modeling	6	780	March 13, 2020
Discontinuous splines? Modeling	4	106	December 9, 2024
Tips for speeding up a model fit...I cut it off at 10 days Modeling fitting-issues , brms	4	413	September 1, 2023
Non-centered parameterization for basic spline Modeling specification	5	587	June 13, 2020

Help understanding spline and extreme estimations

Related topics