I have data from multiple voting intention surveys. Some individuals appear once, some many times. The data cover a period of 10 years. I want to model individual-level voting intention in response to change in GDP growth. But I also need to account for when the data were collected. I know that this is possible using gaussian processes. But these are difficult to fit and, as the data set is very large (~250k rows), not practical. Is a spline a suitable alternative?
The model would look something like this (plus some controls):
brm(Vote ~ 1 + s(years) + gdp + (1 | id),
family = bernoulli(link = "logit"))
Where vote is a dummy variable indicating whether one would support the incumbent party or not, years is a continuous variable measuring years passed since the first observation in the data, gdp is GDP growth when measured, and id is a unique respondent ID.
250k rows would too big data for exact GP with full rank covariance, but if there are only 10 years, then the covariance matrix is just 10x10 and this would be easy. There is a strong connection between GPs and splines and some of them are equivalent. With 10 years, there would be maximum 10 knots and splines are easy, too.
s(years) uses thin plate regression splines, which is probably ok choice here.
If you want GPs, you can use s(years, bs="gp") which is a spline basis function representation of GP with Matern covariance function with lengthscale fixed to the data range.
The problem with the spline implementation in ‘s()’ is that it’s splines only within data range, and linear model is used for extrapolation. I haven’t figured how to set the data range larger.
With that many observations and an additive effect for time, it’ likely that the result is not sensitive to which non-linear / smoothing model you use, and probably you would get indistinguishable result with unstructured random effect for years.
If you have 25k individuals, this model would have 25k random effects, making it quite slow to sample.