I spent today playing with periodic models and I thought I’d share some observations in case anyone is interested. There are definitely some troublesome spots on which I could use some input as well.
So, first let’s generate some data in R. I’ll start out with very simple sinuisoidal data where there’s no DC shift, amplitude is 1, phase is zero, and there’s observation noise of .1:
#load packages used
library(cmdstanr)
library(tidyverse)
# define a function to generate simple sinusoid that's defined by hz & phase
sine = function(time,hz,phase) sin(time*(2*pi)*hz - phase)
#generate data
set.seed(1)
n = 100
dat = (
tibble(
time = seq(0,10,length.out = n)
, z = sine(time,hz=1/3,phase=0)
, y = z + rnorm(n,0,.1)
)
)
#visualize
(
ggplot(dat)
+geom_line(aes(x=time,y=z))
+geom_point(aes(x=time,y=y))
)
And now let’s express a simple model where the observed data (dots in above figure) are expressed as random normal deviates from a latent sinusoid with a phase of zero but an unknown frequency. We’ll want to express a prior for the frequency that avoids those values too high to be well-characterized given the observed sample rate as well as those values too low to be well-characterized given the observed sample window. To do this we’ll define a min an max frequency and use a beta(2,2)
within that range:
data{
int n ;
vector[n] y ;
vector[n] x ;
real min_diff_x ;
real diff_range_x ;
}
transformed data{
real hz_max = 1/(min_diff_x*5) ;
real hz_min = 1/diff_range_x ;
real hz_max_minus_min = hz_max - hz_min ;
vector[n] x_times_pi_times_2 = x*pi()*2 ;
}
parameters{
real<lower=0> noise ;
real<lower=0,upper=1> hz_scaled01 ;
}
transformed parameters{
real hz = hz_scaled01 * hz_max_minus_min + hz_min;
vector[n] f = sin(x_times_pi_times_2*hz) ;
}
model{
hz_scaled01 ~ beta(2,2) ;
noise ~ std_normal() ;
y ~ normal(f,noise) ;
}
And using cmdstanr to sample, confirm there’s no issues and finally visualize:
fit = mod$sample(
data = list(
n = nrow(dat)
, y = dat$y
, x = dat$time
, min_diff_x = min(diff(dat$time))
, diff_range_x = diff(range(dat$time))
)
, chains = parallel::detectCores()/2-1
, parallel_chains = parallel::detectCores()/2-1
, refresh = 0
, seed = 1
, iter_warmup = 1e4
, iter_sampling = 1e3
)
fit$cmdstan_diagnose()
draws =
(
fit$draws(
variables = 'f'
)
%>% posterior::as_draws_df()
%>% dplyr::select(-.iteration)
%>% tidyr::pivot_longer(
cols = starts_with('f')
, names_to = 'x'
)
%>% dplyr::mutate(
x = str_replace(x,fixed('f['),'')
, x = str_replace(x,fixed(']'),'')
, x = time[as.numeric(x)]
, .chain = factor(.chain)
)
)
(
draws
%>% dplyr::group_by(
x
, .chain
)
%>% dplyr::summarise(
med = mean(value)
, lo50 = quantile(value,.25)
, hi50 = quantile(value,.75)
, lo95 = quantile(value,.025)
, hi95 = quantile(value,.975)
, .groups = 'drop'
)
%>% ggplot()
+ facet_grid(.chain~.)
+ geom_line(
data = dat
, mapping = aes(
x = time
, y = z
)
)
+ geom_point(
data = dat
, mapping = aes(
x = time
, y = y
)
, alpha = .5
)
+ geom_ribbon(
mapping = aes(
x = x
, ymin = lo50
, ymax = hi50
, group = .chain
, fill = .chain
)
, alpha = .5
)
+ geom_ribbon(
mapping = aes(
x = x
, ymin = lo95
, ymax = hi95
, group = .chain
, fill = .chain
)
, alpha = .5
)
+ geom_line(
data =
(
draws
%>% dplyr::nest_by(.chain,.draw)
%>% dplyr::group_by(.chain)
%>% dplyr::slice_sample(n=10)
%>% tidyr::unnest(cols=data)
)
, mapping = aes(
x = x
, y = value
, group = .draw
, colour = .chain
)
, alpha = .5
)
)
Beautiful! Well, kinda. Notice I used 1e4
warmup iterations; that’s because the default 1e3
yielded post-warmup samples that failed the R-hat diagnotic. Inspection of the samples showed that while some chains found the data-generating values for hz
and noise
quickly, others had settled on higher values for both. I think this makes sense as a higher value for noise means that the latent sinusoid isn’t as influential on the data, so the sampler is free to explore more widely through the parameter space.
But the model samples so quickly that extending warmup by an order of magnitude is barely noticeable, and it seems that with that length whatever learning about the shape of the parameter space that is happening during warmup (that’s still a bit of voodoo to me) allows all chains to settle into the proper typical set.
Ok, so let’s add complexity by no longer assuming the phase of the data are zero. Same data as before, but now:
parameters{
real<lower=0> noise ;
real<lower=0,upper=1> hz_scaled01 ;
real<lower=0,upper=2*pi()> phase ;
}
transformed parameters{
real hz = hz_scaled01 * hz_max_minus_min + hz_min;
vector[n] f = sin(x_times_pi_times_2*hz-phase) ;
}
model{
hz_scaled01 ~ beta(2,2) ;
noise ~ std_normal() ;
y ~ normal(f,noise) ;
}
Note we now have a phase
parameter contrained to the range from 0 to 2\pi (radians) that has an implicit uniform prior across this range. We can use the same data as above, generated with a phase of zero, and see if we can recover that value along with the frequency as before.
Unfortunately even with 1e4
warmup iterations, I observe that while all chains seem to settle into the right areas of the noise
and hz
dimensions of the parameter space, there’s bimodality among the chains in where they settle in the phase
dimension such that some settle down near zero and some settle up at 2\pi. Now, in principle these regions are right next to eachother since phase
should be considered a circular parameter. At first I even thought that the warnings from the diagnostics were false-alarms driven by the diagnostics not being made aware that phase
was a circular parameter. However I don’t think that’s the case since the diagnostics for f
, the latent sinusoid, are also being flagged for high Rhat and low ESS. There’s also warnings for hz
and I can see that while all the chains are in generally the same place, there’s still some discernible separation between them (depending on what seed is used to start sampling), so I think something about phase
being in the mix is causing issues for the sampler more fundamentally.
One thought is that I’ve not done anything to signal to the sampler that phase
is a circular parameter. As a wild last-ditch effort at achieving this I tried giving phase a prior via phase ~ von_mises(0,0.01)
, which uses the explicitly-circular Von Mises distribution but with a precision value that renders a pretty-much flat prior. But no dice; same behaviour as with the earlier implicit uniform. I also tried bumping the warmup iterations up as high as 1e6
, but still found things unreliable.
So, the seemingly simple approach of modelling the sinusoid-generated data as sinusoidal works great if we know the data-generating phase, and possibly it would also work if we had data where we had decent prior info on the phase (ex seasonal stuff), we could center the representation of the variable phase
on that value and give it a center-peaked prior to avoid the bimodal behaviour I’ve observed here. However, I’m more immediately interested in inference amidst truly-nil prior knowledge on the phase of periodic signals, so I decided to look at Gaussian Processes next.
First let’s model a GP using a non-periodic kernel and see how well it nonetheless captures this periodic data. Like the above sinusoidal model, I’ll assume an amplitude of the latent signal of 1 and like the prior for hz
, we should express a prior on the lengthscale
parameter that avoids extremes on which the sampling scheme of the observed data cannot inform:
functions{
// gp: computes noiseless Gaussian process
vector gp(
real lengthscale
, real [] x
, vector stdnormal
) {
int n = num_elements(x) ;
matrix[n,n] cov = cov_exp_quad(x,1.0,lengthscale);
for(i in 1:n){
cov[i,i] = cov[i,i]+1e-6 ; //jitter to avoid positive-definite warnings
}
return(cholesky_decompose(cov) * stdnormal ) ;
}
}
data{
int n ;
vector[n] y ;
real x[n];
real min_diff_x ;
real diff_range_x ;
}
transformed data{
real lengthscale_min = min_diff_x ;
real lengthscale_max = diff_range_x ;
real lengthscale_max_minus_min = lengthscale_max - lengthscale_min ;
}
parameters{
real<lower=0> noise ;
real<lower=0,upper=1> lengthscale_scaled01 ;
vector[n] gp_helper ;
}
transformed parameters{
real lengthscale = lengthscale_scaled01 * lengthscale_max_minus_min + lengthscale_min;
vector[n] f = gp(lengthscale,x,gp_helper) ;
}
model{
lengthscale_scaled01 ~ beta(2,2) ;
gp_helper ~ std_normal() ;
noise ~ std_normal() ;
y ~ normal(f,noise) ;
}
Sampling this model takes far longer than the sinusoidal model, several minutes rather than several seconds. That’s why I was motivated to work out the sinusoid model first, as I knew that a GP would be a hefty compute burden, limiting applications. But putting that aside for now though, this period-naive model does seem to at least sample without issues and yields great recovery of the latent sinusoid:
However, I’d seen some neat work here by others using kernels that had explicit periodic content, so I thought I’d see if I could get a minimal periodic kernel working. This meant abandoning cov_exp_quad()
and the stellar accelerations the devs built into it, but there are some pre-computation tricks from the pre-cov_exp_quad()
days that are helpful at not slowing down too badly:
functions{
// pgp: computes noiseless periodic Gaussian process
vector pgp(
real hz
, vector stdnormal
, real amplitude_sq_plus_jitter
, matrix pi_times_2_times_abs_diff_x
) {
int n = rows(pi_times_2_times_abs_diff_x) ;
matrix[n,n] cov ;
for(i in 1:(n-1)){
cov[i,i] = amplitude_sq_plus_jitter ;
for(j in (i+1):n){
cov[i, j] = cos( pi_times_2_times_abs_diff_x[i,j] * hz ) ;
cov[j,i] = cov[i,j] ;
}
}
cov[n,n] = amplitude_sq_plus_jitter ;
return(cholesky_decompose(cov) * stdnormal ) ;
}
}
data{
int n ;
vector[n] y ;
vector[n] x ;
real min_diff_x ;
real diff_range_x ;
}
transformed data{
real hz_max = 1/(min_diff_x*5) ;
real hz_min = 1/diff_range_x ;
real hz_max_minus_min = hz_max - hz_min ;
real amplitude_sq_plus_jitter = 1 + 1e-3 ;
matrix[n,n] pi_times_2_times_abs_diff_x ;
for(i in 1:n){
for(j in 1:n){
pi_times_2_times_abs_diff_x[i,j] = pi() * 2 * fabs(x[i]-x[j]) ;
}
}
}
parameters{
real<lower=0> noise ;
real<lower=0,upper=1> hz_scaled01 ;
vector[n] gp_helper ;
}
transformed parameters{
real hz = hz_scaled01 * hz_max_minus_min + hz_min;
vector[n] gp = pgp(
hz
, gp_helper
, amplitude_sq_plus_jitter
, pi_times_2_times_abs_diff_x
) ;
}
model{
hz_scaled01 ~ beta(2,2) ;
gp_helper ~ std_normal() ;
noise ~ std_normal() ;
y ~ normal(gp,noise) ;
}
This periodic-kernel GP takes about as long as the regular GP to sample, has no warnings and seems to do a great job of recovering the latent sinusoid:
Now, the regular GP really should perform worse in this case at least in the sense of having greater uncertainty in the posterior of the latent signal because each point is influenced by a lower proportion of other points relative to the periodic kernel:
It’s difficult to see this performance difference looking at the earlier plots, but when we compare the width of the posteriors computed as a deviation from the data-generating truth, we see the periodic GP is substantially narrower:
(
gp_draws
%>% dplyr::mutate(kind='gp')
%>% dplyr::bind_rows(
(
pgp_draws
%>% dplyr::mutate(kind='pgp')
)
)
%>% dplyr::rename(time=x)
%>% dplyr::left_join(dat)
%>% dplyr::mutate(value=value-z)
%>% dplyr::group_by(kind)
%>% dplyr::summarise(
value = diff(quantile(value,c(.025,.975)))
)
)
# A tibble: 2 x 2
kind value
<chr> <dbl>
1 gp 0.178
2 pgp 0.103
Now, that greater certainty isn’t free; we pay for it in less robustness amid non-stationary signals. So long as the lengthscale is stationary, a standard GP will be able to accomodate a broad diversity of wiggles in the data, while the periodic GP will dramatically fail in the face of most non-stationarity. Even periodic and stationary data that is merely non-sinusoidal will thwart it:
Which contrasts to the performance of the regular GP:
(though we do see some widening of the uncertainty relative to the truly-sinusoidal-data case, driven presumably by the fact that the asymetric shape of the periodic element causes local variation in the lengthscale)
It’s also possible to combine the two approaches:
functions{
// pgp: computes noiseless periodic Gaussian process
vector pgp(
real hz
, real lengthscale
, vector stdnormal
, real amplitude_sq_plus_jitter
, matrix pi_times_2_times_abs_diff_x
, matrix neg_squared_diff_x
) {
int n = rows(pi_times_2_times_abs_diff_x) ;
matrix[n,n] cov ;
for(i in 1:(n-1)){
cov[i,i] = amplitude_sq_plus_jitter ;
for(j in (i+1):n){
cov[i, j] = (
cos( pi_times_2_times_abs_diff_x[i,j] * hz )
* exp( neg_squared_diff_x[i,j] / lengthscale )
) ;
cov[j,i] = cov[i,j] ;
}
}
cov[n,n] = amplitude_sq_plus_jitter ;
return(cholesky_decompose(cov) * stdnormal ) ;
}
}
data{
int n ;
vector[n] y ;
vector[n] x ;
real min_diff_x ;
real diff_range_x ;
}
transformed data{
real lengthscale_min = min_diff_x ;
real lengthscale_max = diff_range_x ;
real lengthscale_max_minus_min = lengthscale_max - lengthscale_min ;
real hz_max = 1/(min_diff_x*5) ;
real hz_min = 1/diff_range_x ;
real hz_max_minus_min = hz_max - hz_min ;
real amplitude_sq_plus_jitter = 1 + 1e-3 ;
matrix[n,n] pi_times_2_times_abs_diff_x ;
matrix[n,n] neg_squared_diff_x ;
for(i in 1:n){
for(j in 1:n){
pi_times_2_times_abs_diff_x[i,j] = 2 * pi() * fabs(x[i]-x[j]) ;
neg_squared_diff_x[i,j] = -((x[i]-x[j])^2) ;
}
}
}
parameters{
real<lower=0> noise ;
real<lower=0,upper=1> lengthscale_scaled01 ;
real<lower=0,upper=1> hz_scaled01 ;
vector[n] gp_helper ;
}
transformed parameters{
real lengthscale = lengthscale_scaled01 * lengthscale_max_minus_min + lengthscale_min;
real hz = hz_scaled01 * hz_max_minus_min + hz_min;
vector[n] f = pgp(
hz
, lengthscale
, gp_helper
, amplitude_sq_plus_jitter
, pi_times_2_times_abs_diff_x
, neg_squared_diff_x
) ;
}
model{
lengthscale_scaled01 ~ beta(2,2) ;
hz_scaled01 ~ beta(2,2) ;
gp_helper ~ std_normal() ;
noise ~ std_normal() ;
y ~ normal(f,noise) ;
}
While I haven’t managed to observe much benefit from using this combined approach, I probably just haven’t explored enough of the parameter space.
So that’s what I’ve come up with so far. It wasn’t rigorous or thorough enough to call a case study, but I wanted to log my observations nonetheless. And I’d still love a solution to the phase issue in the first models as they’re so much faster. Obviously GPs will give me far more flexibility, but I’d rather spend time working out proper generative models for any nuances that crop up in the data than burn cpu cycles leaving it to the GP to work out on it’s own. (In my applied work at least.)