Results from playing with periodic models

I spent today playing with periodic models and I thought I’d share some observations in case anyone is interested. There are definitely some troublesome spots on which I could use some input as well.

So, first let’s generate some data in R. I’ll start out with very simple sinuisoidal data where there’s no DC shift, amplitude is 1, phase is zero, and there’s observation noise of .1:

#load packages used
library(cmdstanr)
library(tidyverse)

# define a function to generate simple sinusoid that's defined by hz & phase
sine = function(time,hz,phase) sin(time*(2*pi)*hz - phase)

#generate data
set.seed(1)
n = 100
dat = (
	tibble(
		time = seq(0,10,length.out = n)
		, z = sine(time,hz=1/3,phase=0)
		, y = z + rnorm(n,0,.1)
	)
)

#visualize
(
	ggplot(dat)
	+geom_line(aes(x=time,y=z))
	+geom_point(aes(x=time,y=y))
)

And now let’s express a simple model where the observed data (dots in above figure) are expressed as random normal deviates from a latent sinusoid with a phase of zero but an unknown frequency. We’ll want to express a prior for the frequency that avoids those values too high to be well-characterized given the observed sample rate as well as those values too low to be well-characterized given the observed sample window. To do this we’ll define a min an max frequency and use a beta(2,2) within that range:

data{
	int n ;
	vector[n] y ;
	vector[n] x ;
	real min_diff_x ;
	real diff_range_x ;
}
transformed data{
	real hz_max = 1/(min_diff_x*5) ;
	real hz_min = 1/diff_range_x ;
	real hz_max_minus_min = hz_max - hz_min ;
	vector[n] x_times_pi_times_2 = x*pi()*2 ;
}
parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> hz_scaled01 ;
}
transformed parameters{
	real hz = hz_scaled01 * hz_max_minus_min + hz_min;
	vector[n] f = sin(x_times_pi_times_2*hz) ;
}
model{
	hz_scaled01 ~ beta(2,2) ;
	noise ~ std_normal() ;
	y ~ normal(f,noise) ;
}

And using cmdstanr to sample, confirm there’s no issues and finally visualize:

fit = mod$sample(
	data = list(
		n = nrow(dat)
		, y = dat$y
		, x = dat$time
		, min_diff_x = min(diff(dat$time))
		, diff_range_x = diff(range(dat$time))
	)
	, chains = parallel::detectCores()/2-1
	, parallel_chains = parallel::detectCores()/2-1
	, refresh = 0
	, seed = 1
	, iter_warmup = 1e4
	, iter_sampling = 1e3
)
fit$cmdstan_diagnose()
draws =
	(
		fit$draws(
			variables = 'f'
		)
		%>% posterior::as_draws_df()
		%>% dplyr::select(-.iteration)
		%>% tidyr::pivot_longer(
			cols = starts_with('f')
			, names_to = 'x'
		)
		%>% dplyr::mutate(
			x = str_replace(x,fixed('f['),'')
			, x = str_replace(x,fixed(']'),'')
			, x = time[as.numeric(x)]
			, .chain = factor(.chain)
		)
	)


(
	draws
	%>% dplyr::group_by(
		x
		, .chain
	)
	%>% dplyr::summarise(
		med = mean(value)
		, lo50 = quantile(value,.25)
		, hi50 = quantile(value,.75)
		, lo95 = quantile(value,.025)
		, hi95 = quantile(value,.975)
		, .groups = 'drop'
	)
	%>% ggplot()
	+ facet_grid(.chain~.)
	+ geom_line(
		data = dat
		, mapping = aes(
			x = time
			, y = z
		)
	)
	+ geom_point(
		data = dat
		, mapping = aes(
			x = time
			, y = y
		)
		, alpha = .5
	)
	+ geom_ribbon(
		mapping = aes(
			x = x
			, ymin = lo50
			, ymax = hi50
			, group = .chain
			, fill = .chain
		)
		, alpha = .5
	)
	+ geom_ribbon(
		mapping = aes(
			x = x
			, ymin = lo95
			, ymax = hi95
			, group = .chain
			, fill = .chain
		)
		, alpha = .5
	)
	+ geom_line(
		data =
			(
				draws
				%>% dplyr::nest_by(.chain,.draw)
				%>% dplyr::group_by(.chain)
				%>% dplyr::slice_sample(n=10)
				%>% tidyr::unnest(cols=data)
			)
		, mapping = aes(
			x = x
			, y = value
			, group = .draw
			, colour = .chain
		)
		, alpha = .5
	)
)

Beautiful! Well, kinda. Notice I used 1e4 warmup iterations; that’s because the default 1e3 yielded post-warmup samples that failed the R-hat diagnotic. Inspection of the samples showed that while some chains found the data-generating values for hz and noise quickly, others had settled on higher values for both. I think this makes sense as a higher value for noise means that the latent sinusoid isn’t as influential on the data, so the sampler is free to explore more widely through the parameter space.

But the model samples so quickly that extending warmup by an order of magnitude is barely noticeable, and it seems that with that length whatever learning about the shape of the parameter space that is happening during warmup (that’s still a bit of voodoo to me) allows all chains to settle into the proper typical set.

Ok, so let’s add complexity by no longer assuming the phase of the data are zero. Same data as before, but now:

parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> hz_scaled01 ;
	real<lower=0,upper=2*pi()> phase ;
}
transformed parameters{
	real hz = hz_scaled01 * hz_max_minus_min + hz_min;
	vector[n] f = sin(x_times_pi_times_2*hz-phase) ;
}
model{
	hz_scaled01 ~ beta(2,2) ;
	noise ~ std_normal() ;
	y ~ normal(f,noise) ;
}

Note we now have a phase parameter contrained to the range from 0 to 2\pi (radians) that has an implicit uniform prior across this range. We can use the same data as above, generated with a phase of zero, and see if we can recover that value along with the frequency as before.

Unfortunately even with 1e4 warmup iterations, I observe that while all chains seem to settle into the right areas of the noise and hz dimensions of the parameter space, there’s bimodality among the chains in where they settle in the phase dimension such that some settle down near zero and some settle up at 2\pi. Now, in principle these regions are right next to eachother since phase should be considered a circular parameter. At first I even thought that the warnings from the diagnostics were false-alarms driven by the diagnostics not being made aware that phase was a circular parameter. However I don’t think that’s the case since the diagnostics for f, the latent sinusoid, are also being flagged for high Rhat and low ESS. There’s also warnings for hz and I can see that while all the chains are in generally the same place, there’s still some discernible separation between them (depending on what seed is used to start sampling), so I think something about phase being in the mix is causing issues for the sampler more fundamentally.

One thought is that I’ve not done anything to signal to the sampler that phase is a circular parameter. As a wild last-ditch effort at achieving this I tried giving phase a prior via phase ~ von_mises(0,0.01), which uses the explicitly-circular Von Mises distribution but with a precision value that renders a pretty-much flat prior. But no dice; same behaviour as with the earlier implicit uniform. I also tried bumping the warmup iterations up as high as 1e6, but still found things unreliable.

So, the seemingly simple approach of modelling the sinusoid-generated data as sinusoidal works great if we know the data-generating phase, and possibly it would also work if we had data where we had decent prior info on the phase (ex seasonal stuff), we could center the representation of the variable phase on that value and give it a center-peaked prior to avoid the bimodal behaviour I’ve observed here. However, I’m more immediately interested in inference amidst truly-nil prior knowledge on the phase of periodic signals, so I decided to look at Gaussian Processes next.

First let’s model a GP using a non-periodic kernel and see how well it nonetheless captures this periodic data. Like the above sinusoidal model, I’ll assume an amplitude of the latent signal of 1 and like the prior for hz, we should express a prior on the lengthscale parameter that avoids extremes on which the sampling scheme of the observed data cannot inform:

functions{
	// gp: computes noiseless Gaussian process
	vector gp(
		real lengthscale
		, real [] x
		, vector stdnormal
	) {
		int n = num_elements(x) ;
		matrix[n,n] cov = cov_exp_quad(x,1.0,lengthscale);
		for(i in 1:n){
			cov[i,i] = cov[i,i]+1e-6 ; //jitter to avoid positive-definite warnings
		}
		return(cholesky_decompose(cov) * stdnormal ) ;
	}
}

data{
	int n ;
	vector[n] y ;
	real x[n];
	real min_diff_x ;
	real diff_range_x ;
}
transformed data{
	real lengthscale_min = min_diff_x ;
	real lengthscale_max = diff_range_x ;
	real lengthscale_max_minus_min = lengthscale_max - lengthscale_min ;
}
parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> lengthscale_scaled01 ;
	vector[n] gp_helper ;
}
transformed parameters{
	real lengthscale = lengthscale_scaled01 * lengthscale_max_minus_min + lengthscale_min;
	vector[n] f = gp(lengthscale,x,gp_helper) ;
}
model{
	lengthscale_scaled01 ~ beta(2,2) ;
	gp_helper ~ std_normal() ;
	noise ~ std_normal() ;
	y ~ normal(f,noise) ;
}

Sampling this model takes far longer than the sinusoidal model, several minutes rather than several seconds. That’s why I was motivated to work out the sinusoid model first, as I knew that a GP would be a hefty compute burden, limiting applications. But putting that aside for now though, this period-naive model does seem to at least sample without issues and yields great recovery of the latent sinusoid:

However, I’d seen some neat work here by others using kernels that had explicit periodic content, so I thought I’d see if I could get a minimal periodic kernel working. This meant abandoning cov_exp_quad() and the stellar accelerations the devs built into it, but there are some pre-computation tricks from the pre-cov_exp_quad() days that are helpful at not slowing down too badly:

functions{
	// pgp: computes noiseless periodic Gaussian process
	vector pgp(
		real hz
		, vector stdnormal
		, real amplitude_sq_plus_jitter
		, matrix pi_times_2_times_abs_diff_x
	) {
		int n = rows(pi_times_2_times_abs_diff_x) ;
		matrix[n,n] cov ;
		for(i in 1:(n-1)){
			cov[i,i] = amplitude_sq_plus_jitter ;
			for(j in (i+1):n){
				cov[i, j] = cos( pi_times_2_times_abs_diff_x[i,j] * hz ) ;
				cov[j,i] = cov[i,j] ;
			}
		}
		cov[n,n] = amplitude_sq_plus_jitter ;
		return(cholesky_decompose(cov) * stdnormal ) ;
	}
}
data{
	int n ;
	vector[n] y ;
	vector[n] x ;
	real min_diff_x ;
	real diff_range_x ;
}
transformed data{
	real hz_max = 1/(min_diff_x*5) ;
	real hz_min = 1/diff_range_x ;
	real hz_max_minus_min = hz_max - hz_min ;
	real amplitude_sq_plus_jitter = 1 + 1e-3 ;
	matrix[n,n] pi_times_2_times_abs_diff_x ;
	for(i in 1:n){
		for(j in 1:n){
			pi_times_2_times_abs_diff_x[i,j] = pi() * 2 * fabs(x[i]-x[j]) ;
		}
	}
}
parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> hz_scaled01 ;
	vector[n] gp_helper ;
}
transformed parameters{
	real hz = hz_scaled01 * hz_max_minus_min + hz_min;
	vector[n] gp = pgp(
		hz
		, gp_helper
		, amplitude_sq_plus_jitter
		, pi_times_2_times_abs_diff_x
	) ;
}
model{
	hz_scaled01 ~ beta(2,2) ;
	gp_helper ~ std_normal() ;
	noise ~ std_normal() ;
	y ~ normal(gp,noise) ;
}

This periodic-kernel GP takes about as long as the regular GP to sample, has no warnings and seems to do a great job of recovering the latent sinusoid:

Now, the regular GP really should perform worse in this case at least in the sense of having greater uncertainty in the posterior of the latent signal because each point is influenced by a lower proportion of other points relative to the periodic kernel:
image
image

It’s difficult to see this performance difference looking at the earlier plots, but when we compare the width of the posteriors computed as a deviation from the data-generating truth, we see the periodic GP is substantially narrower:

(
	gp_draws
	%>% dplyr::mutate(kind='gp')
	%>% dplyr::bind_rows(
		(
			pgp_draws
			%>% dplyr::mutate(kind='pgp')
		)
	)
	%>% dplyr::rename(time=x)
	%>% dplyr::left_join(dat)
	%>% dplyr::mutate(value=value-z)
	%>% dplyr::group_by(kind)
	%>% dplyr::summarise(
		value = diff(quantile(value,c(.025,.975)))
	)
)
# A tibble: 2 x 2
  kind   value
  <chr>  <dbl>
1 gp    0.178
2 pgp   0.103

Now, that greater certainty isn’t free; we pay for it in less robustness amid non-stationary signals. So long as the lengthscale is stationary, a standard GP will be able to accomodate a broad diversity of wiggles in the data, while the periodic GP will dramatically fail in the face of most non-stationarity. Even periodic and stationary data that is merely non-sinusoidal will thwart it:

Which contrasts to the performance of the regular GP:

(though we do see some widening of the uncertainty relative to the truly-sinusoidal-data case, driven presumably by the fact that the asymetric shape of the periodic element causes local variation in the lengthscale)

It’s also possible to combine the two approaches:

functions{
	// pgp: computes noiseless periodic Gaussian process
	vector pgp(
		real hz
		, real lengthscale
		, vector stdnormal
		, real amplitude_sq_plus_jitter
		, matrix pi_times_2_times_abs_diff_x
		, matrix neg_squared_diff_x
	) {
		int n = rows(pi_times_2_times_abs_diff_x) ;
		matrix[n,n] cov ;
		for(i in 1:(n-1)){
			cov[i,i] = amplitude_sq_plus_jitter ;
			for(j in (i+1):n){
				cov[i, j] = (
					cos( pi_times_2_times_abs_diff_x[i,j] * hz )
					* exp( neg_squared_diff_x[i,j] / lengthscale )
				) ;
				cov[j,i] = cov[i,j] ;
			}
		}
		cov[n,n] = amplitude_sq_plus_jitter ;
		return(cholesky_decompose(cov) * stdnormal ) ;
	}
}
data{
	int n ;
	vector[n] y ;
	vector[n] x ;
	real min_diff_x ;
	real diff_range_x ;
}
transformed data{
	real lengthscale_min = min_diff_x ;
	real lengthscale_max = diff_range_x ;
	real lengthscale_max_minus_min = lengthscale_max - lengthscale_min ;
	real hz_max = 1/(min_diff_x*5) ;
	real hz_min = 1/diff_range_x ;
	real hz_max_minus_min = hz_max - hz_min ;
	real amplitude_sq_plus_jitter = 1 + 1e-3 ;
	matrix[n,n] pi_times_2_times_abs_diff_x ;
	matrix[n,n] neg_squared_diff_x ;
	for(i in 1:n){
		for(j in 1:n){
			pi_times_2_times_abs_diff_x[i,j] = 2 * pi() * fabs(x[i]-x[j]) ;
			neg_squared_diff_x[i,j] = -((x[i]-x[j])^2) ;
		}
	}
}
parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> lengthscale_scaled01 ;
	real<lower=0,upper=1> hz_scaled01 ;
	vector[n] gp_helper ;
}
transformed parameters{
	real lengthscale = lengthscale_scaled01 * lengthscale_max_minus_min + lengthscale_min;
	real hz = hz_scaled01 * hz_max_minus_min + hz_min;
	vector[n] f = pgp(
		hz
		, lengthscale
		, gp_helper
		, amplitude_sq_plus_jitter
		, pi_times_2_times_abs_diff_x
		, neg_squared_diff_x
	) ;
}
model{
	lengthscale_scaled01 ~ beta(2,2) ;
	hz_scaled01 ~ beta(2,2) ;
	gp_helper ~ std_normal() ;
	noise ~ std_normal() ;
	y ~ normal(f,noise) ;
}

While I haven’t managed to observe much benefit from using this combined approach, I probably just haven’t explored enough of the parameter space.

So that’s what I’ve come up with so far. It wasn’t rigorous or thorough enough to call a case study, but I wanted to log my observations nonetheless. And I’d still love a solution to the phase issue in the first models as they’re so much faster. Obviously GPs will give me far more flexibility, but I’d rather spend time working out proper generative models for any nuances that crop up in the data than burn cpu cycles leaving it to the GP to work out on it’s own. (In my applied work at least.)

4 Likes

von_mises(0,0) is a perfectly good uniform distribution and it should work with CmdStanR. (RStan might still use an older version of Stan which had a bug at \kappa=0.)

The reason a prior does not help is that the sampler does not examine the model block. The model block is only for computing the density. To signal that phase is circular it must be declared as such in the parameters block. Stan does provide a data type with the appropriate constraint: a 2-dimensional unit vector is equivalent to an angle.
You could extract the phase angle from the unit vector with the atan2() function but personally I like writing the phase-shifted sine as the sum of sine and cosine components.

parameters{
	real<lower=0> noise ;
	real<lower=0,upper=1> hz_scaled01 ;
	unit_vector[2] phase ;
}
transformed parameters{
	real hz = hz_scaled01 * hz_max_minus_min + hz_min;
	vector[n] f = phase[1]*sin(x_times_pi_times_2*hz) + phase[2]*cos(x_times_pi_times_2*hz);
}

This is especially convenient when the amplitude is also unknown because then you can change that unit_vector[2] to a plain vector[2].

Thanks! I’m embarrassed (concerned?) to realize on reminder from your response that I had already explored all this and learned that solution only a year and a half ago! Hopefully my re-teaching of myself here can be of use to others in some way.

1 Like

I will say that I don’t think I knew about the trick of not explicitly converting to an angle; I’ll see if that helps with the issues I previously observed with sampling with the unit_vector.

I think the previous problems were due to Stan’s bad unit_vector implementation. It’s been discussed a couple of times but unit vectors are hard because of topology.
Encoding both phase and amplitude in a single vector probably helps.

1 Like

With your phamp approach (which I insist you call it! :) ), is it possible to put a prior on the amplitude by combining the values? Presumably there’s some transform that derives what the combination of the two values implies for the final amplitude, but I’m having trouble working out what that transform is.

Some R code of my attempts at at least visualizing the relationship:

sine = function(time,hz,phase) sin(time*(2*pi)*hz - phase)
cosine = function(time,hz,phase) cos(time*(2*pi)*hz - phase)
n = 1e5
time = seq(0,10,length.out = n)
phamp1 = 2
phamp2 = seq(-10,10,length.out = 1e3)
amp = rep(NA,times=length(x))
for(i in 1:length(x)){
	z = (
		phamp1*sine(time,hz=1/3,phase=0) 
		+ phamp2[i]*cosine(time,hz=1/3,phase=0) 
	)
	amp[i] = diff(range(z))
}
qplot(phamp2,amp,geom='line')

The components are the amplitude times sine and cosine of the phase. It is derived from the sum formula for trigonometric functions

A\mathrm{sin}\left(x+\phi\right) = \left(A\mathrm{cos}\phi\right)\mathrm{sin}x + \left(A\mathrm{sin}\phi\right)\mathrm{cos}x

The amplitude is then simply the length of the vector.
A prior like

  phamp ~ normal(0, sigma);

implies 2-d-o-f chi distribution for the total amplitude.

1 Like

Thanks! I’m going to have to sit with that a bit to parse what is going on, but in the meantime: presumably this solution really only works for modelling pure sinusoids, right? There’s not a similarly straightforward solution to doing inference on phase with something asymmetric like sin( x*hz-phase + sin(x*hz-phase) )? (not actually wedded to that functional form, just one that produces periodic-but-asymmetric data)

Any periodic function can be written as a Fourier series

f\left(x\right)=\sum_{n}a_{n}\mathrm{sin}\left(\frac{2\pi n}{P}x\right)+b_{n}\mathrm{cos}\left(\frac{2\pi n}{P}x\right)

IIRC, if you take infinitely many terms and independent a_n, b_n ~ normal(0, s_n) (where s_n\to0 when n\to\infty) it amounts to a periodic Gaussian process.

Ah, right! I think I caught glimmers of that last night while still banging my head on the phase inference before your solution. I tried supplying as data a matrix of sinusoids, each with a slightly different phase and the phases spanning the 0-2\pi range, then giving each a weight parameter and summing. This gave really good recovery but was almost as slow as the full GP. I can see now that had I also included the cosine terms it would have been a Fourier series.

However, for the case of an explicit periodic function P(x), I think your original solution can be applied without all the burden/flexibility of the Fourier series / GP by modelling a sine as we’ve done then somehow transform that to a sawtooth that is then used to index P(x). This is getting way beyond my math background, but I think the sine->sawtooth transform is a straightforward Fourier series itself.

Hmm, yes, let’s say you separate the amplitude and the phase angle

parameters{
  vector[2] phamp ;
}
transformed parameters{
  real amp = sqrt(dot_self(phamp));
  real phase = atan2(phamp[1],phamp[2]);

  vector[n] f = amp * P(x_times_pi_times_2*hz - phase) ;
}

Although atan2 is discontinuous at some point the important thing is that f is a smooth function of phamp. The original problem was the real<lower=0,upper=2*pi()> or even unit_vector[2] parametrization is discontinuous.
Separate sine and cosine terms was just a cute optimization for sin(x + atan2(v[1],v[2])).

2 Likes

@nhuurre, one last Q (hopefully!):

Do you know the Jacobian adjustment I’d need to use a prior for the amplitude directly?

I know you suggested that phamp~normal(0,sigma); implied a \chi^2_{df=2} distribtuion (though I’m not sure where sigma comes from nor how it’s value affects the implication), but I’m encountering some issues with an elaboration on these ideas and I think it would help me diagnose if I could put a prior on the amplitude directly.

So the transform is

parameters{
  vector[2] phamp ;
}
transformed parameters{
  real amp = sqrt(dot_self(phamp));
  real phase = atan2(phamp[1],phamp[2]);
}

and I believe the corresponding Jacobian adjustment is

model {
  target += log(amp);
}
1 Like

You wouldn’t happen to have already worked out the jacobian necessary to put a prior on the phase too, eh?

It’s the above. The Jacobian is for the entire 2d transform. There’s no such thing as a “partial” Jacobian.

Alternative way to say it: if you put a prior only on amp then phase gets a uniform prior. A statement like

 target += von_mises_lpdf(phase| mu, kappa);

adds the difference between the uniform and von Mises probability density.

1 Like