Poor sampling in model of periodic response including a rate parameter

Another entry in my ongoing mission to model some physiological data in which the heart beat induces quasi-regular changes in the data. I’ve managed to get a version working where I assume a known rate and do inference on the magnitude of noise and the heart beat amplitude, but when I try to do inference on the rate as well, I get very high rhat and very low neff for the rate parameter even with iter=1e5. I’ve tried a couple different forms of the response function (symmetric half-sine, asymmetric half-sine, & beta(2,2) ), and get the same behaviour with all forms. I’ve also tried the scenario with known noise & pulse amplitudes, doing inference only on the rate parameter, but still get very poor sampling.

As per the warning at the beginning of section 41.7 in the manual, I suspect that this is an issue with my use of fmod() to impose periodicity, but am out of ideas for fixing this. Help?

Here’s the model that attempts inference on rate, amplitude, & noise:

functions{
  // pulse: function to generate pulse-like curve (half-sine); oversimplified in shape for initial model building/testing
  vector pulse(vector time, real amplitude, real rate){
    vector[num_elements(time)] position ;
    real period = 1/rate ;
    //need to loop because fmod isn't vectorized
    for(i in 1:num_elements(time)){
      position[i] = fmod(time[i],period)/period ;
    }
    return amplitude * sin( position*pi() ) ;
  }
}
data{
  int n;
  vector[n] y;
  vector[n] time;
}
parameters{
  real<lower=0> amplitude;
  real<lower=0,upper=1> rate; //needs upper bound else chains will gravitate towards higher frequencies whose aliased samples "fit" the data well
  real<lower=0> noise;
}
model{
  rate ~ uniform(0,1) ;
  amplitude ~ weibull(2,2) ;
  noise ~ normal(0,1) ;
  y ~ normal( pulse(time,amplitude,rate) , noise ) ;
}

If you think that it’s the fmod, then could you code up a model that wouldn’t necessarily need the fmod and then try it both ways?

Like if your signal was a full sine wave, then you wouldn’t really need the fmod because the signal actually is periodic without having to do fmod tricks to make it look periodic. You could just do sin(2 * pi * time[i] / period)? But sine is going to have all its derivatives everywhere unlike half-sines stuck together, so maybe that doesn’t really diagnose anything.

Could you write your function as a few terms of a Fourier series or something to avoid having fmods crawling around?

I was hoping to have something generally applicable to any non-sine-based periodic function. I might be speaking beyond my expertise, but since my long-term goal is to model non-stationary scenarios where I want to model changes in the amplitudes and rates of said responses through time, I don’t think a Fourier series would be able to handle those changes, and would certainly be difficult to set priors.

I was able to get this to sample well using the DE-MCMC sampling in the BayesianTools package:

library(BayesianTools)

#define a sinmple non-sine periodic response function
simple_pulse_curve = function(time,amplitude,rate){
  period = 1/rate
  position = (time%%period)/period
  return(amplitude*sin(position*pi))
}

#generate some data
n = 1e2
time = seq(0,10,length.out = n)
rate = .5
amplitude = 1
noise = .1
y = simple_pulse_curve(time,amplitude,rate)+rnorm(n,0,noise)
plot(time,y,type='b')

#define the likelihood
likelihood = function(par){
  rate = par[1]
  amplitude = par[2]
  noise = par[3]
  pulse = simple_pulse_curve(time,amplitude,rate)
  target = sum(dnorm(y,pulse,noise,log=T))
  return(target)
}

#define the sampler that expresses the priors
priorSampler = function(n=1){
  rate = rweibull(n,2,1)
  amplitude = rweibull(n,2,2)
  noise = rnorm(n,0,1)
  while(noise<0){
    noise = rnorm(n,0,1)
  }
  return(cbind(rate,amplitude,noise))
}

#define the priors
prior = function(par){
  rate = par[1]
  amplitude = par[2]
  noise = par[3]
  target = dweibull(rate,2,2,log=T) +
    dweibull(amplitude,2,2,log=T) +
    dnorm(noise,0,1,log=T)
  return(target)
}

#combine above plus bounds
bayesianSetup = createBayesianSetup(
  likelihood = likelihood
  , prior = prior
  , priorSampler = priorSampler
  , names = c('rate','amplitude','noise')
  , lower = c(0, 0, 0)
  , upper = c(2,Inf,Inf) #need to constrain rate else undersampled higher rates will fit well
)

#set iterations, etc
settings = list(iterations = 1e4, nrChains = 4, message = FALSE)

#run the sampler
post = runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings)

#plot the traces & densities
plot(post)

#show a summary of the sampling
summary(post)

Note that one thing I had to do to achieve good sampling was create an upper-bound on rate as I discovered that higher rates could seem consistent with the data thanks to aliasing. I also set a uniform prior on rate, but that isn’t consequential (just seemed more appropriate for a variable with 0-1 bounds). I tried bounding rate in the Stan version, but still got the same poor sampling.

Oh, and possibly pertinent to tracking down why Stan isn’t performing well in this case: when I look at the traceplots, it seems that some chains explore the space around the original data-generating parameters (rate=.5,amplitude=1, noise=.1), while others seem to have gravitated to a second set:

Rplot01

Hm, depending on the run, seems there are a few different rate values that chains get attracted to:

The alternative rate values that chains sometimes gravitate to are .08, .18, & .38. Here’s what the signals specified by those rates look like (colored lines) compared to the data-generating (points & lines) rate of 0.5:

Weird that their periods are not simply multiples of the data-generating period, which was my first guess as to what was going on.

Cool beans! Good to hear you got something working.

I dunno really what to do with these curve-fitting sorts of applications. Just seems really hard, but they pop up a lot. I saw your post before Xmas on this but just didn’t have any advice haha.

The most obvious thing is that if the curve gets off a little at the start of the data, it can just screw up everything else down the line. And maybe there’s a way to use fancy GPs to account for how rates are changing through time, or the models are imperfect or whatever, but that just sounds painful.

What do you think about doing some feature extraction sorta stuff? Like, if you have a bunch of humps, just manually extract the peaks and then try to model the distance between them.

There’d probly be plenty of problems with that (multiple detections – missed peaks, etc), but it just seems like it’d be so much less data and maybe more workable.

This kinda gets away from the generating process thing, but if the function we’re fitting with isn’t a good model to begin with, then maybe that’s not such a bad sacrifice. It’s annoying to break stuff up into a bunch of little stages though, more forking paths and magic constants and whatnot.

I made a bit of headway, at least to the extent that I managed to figure out how to code the model without using fmod, as well as adding inference on phase of the periodic response. Here’s the version where I don’t attempt inference on rate:

data{
  int n;
  vector[n] y;
  vector[n] x;
  real<lower=0> rate ;
}
parameters{
  //real<lower=0> rate ;
  unit_vector[2] phase_helper;
  real<lower=0> amplitude;
  real<lower=0> noise;
}
transformed parameters{
  real phase = atan2(phase_helper[1],phase_helper[2]) ;
}
model{
  //rate ~ weibull(2,1) ;
  amplitude ~ weibull(2,2) ;
  noise ~ normal(0,.1) ;
  {
    vector[n] position ;
    vector[n] f ;
    vector[n] temp = x*rate*2*pi()+phase ;
    for(i in 1:n){ //atan2 isn't vectorized, so have to loop
      position[i] = atan2(sin(temp[i]),cos(temp[i])) ;
    }
    f = amplitude * sin( ((position/pi()+1)/2)*pi() ) ; //example periodic function: half-sine
    y ~ normal( f , noise ) ;
  }
}

Unfortunately, when I tweak it to also attempt inference on rate (comment-out the rate data variable then uncomment rate-related lines below), I’m still getting poor sampling associated with different chains getting stuck at different rate values. When this happens, the noise parameter becomes over-estimated & the amplitude parameter is under-estimated. If I put a very tight prior around the true rate, it samples just fine, recovering all parameters well, but this isn’t feasible for real world data where I’ll have substantial uncertainty about the rate. Any ideas? @Bob_Carpenter I tagged you on a related thread (but I think you misread it as having to do wth GPs), possibly you might have insight? Or maybe this is an issue of it being non-euclidean sampling space on which @betanalpha might have input?

Oh, and here’s some R code for generating data then attempting inference:

library(rstan)
rstan_options(auto_write = TRUE)

n = 1e3
x = seq(0,10,length.out = n)
noise = .1
phase = pi/2
rate = 1
amplitude = 1

temp = x*rate*2*pi+phase
position = atan2(sin(temp),cos(temp))
f = amplitude*sin(((position/pi+1)/2)*pi)
y = f + rnorm(n,0,noise)
plot(x,y,type='b') #show the observations

#run stan
post = rstan::stan(
  file = 'periodic.stan'
  , data = list(
    n = length(y)
    , x = x
    , y = y
  )
  , chains = 4 
  , cores = 4 #set to number of physical cores on your system
  , iter = 1e3
)

#check the traceplots
rstan::traceplot(post)

Is it possible to do a fit on just the first lump of the data and then use that to do inference on the whole timeseries?

Might be awkward to code.

It’s the discontinuities in gluing together a bunch of half sines that makes me nervous, no matter how it’s computed.

If f(x) is a function of all these things glue’d together, and you have a rate scaling thing, then presumably you have something like:

target += normal_lpdf(y | f(t / rate), sigma);

dtarget/drate

will require

df(x)/drate, where x = t / rate

and cause the autodiff is effectively just using a bunch of chain rules, this will require:

df/dx * d(x)/drate

And df/dx has discontinuities if it’s just glued together half-sines. If these cause problems or not, it’s unclear, but they definitely could.

Yes, this is a topological issue. By using the unit vector type you avoid the issue for the phase but don’t do anything to avoid aliasing for the rate. If you’re suspicious then just plot rate % 2 * pi to see if your chains are actually aliasing.

Thanks @betanalpha for taking a look; presumably no solutions/re-mappings come to mind?

Thanks @bbbales2, I was using half-sines as a simple test case but it’s actually fairly representative of my real data:

(n.b. there’s lower-frequency variability that I plan to model with a GP when I finally get to the point of analyzing real data)

If that’s the real data, I’m all for peak extraction sorta stuff. Just get all the local minima/maxima in a sequence, model the times between them, and see where that leads. You could model the envelope or the mean or whatever with some GPs as well.

I’m scared half-sines, even though it looks like the data, is too much of a model (even if it did sample well).

That’s a particularly low-noise sequence; I have more data where extremum search doesn’t work.

Yeah, that makes sense.

It’s just, I had someone come to me with a segmentation problem a month or so ago and I put on my fancy hat and we tried all sorts of fancy things (Bayesian and not) and stuff didn’t work. In the end she showed me the results she got from just doing a big Gaussian filter and a threshold, and it was waaaaay better than the stuff I’d come up with (it was pretty funny in the moment – I’d put on quite a good show about how Bayesian inference would solve her problems and be interpretable).

The reason I’m hinting around about overengineering here is cause I’d probably overengineer the problem :D.

No, not the way you have constructed the rest of the model.

Looking at the data I’m not sure why you don’t just model it as a sequence of peaks with the peak locations constrained to be in non-overlapping windows.

Neat idea! I’ll give it a try.

Presumably there is a causal mechanism here, where one peak can’t fire until the previous peak relaxes to a sufficient point, which would allow you to, say, model the peak locations with an ordered_vector.

Ah, darn, I think the model you propose implies an integer parameter (number of peaks), which isn’t supported in Stan. Maybe instead I should be thinking about a latent survival process for the occurrence of extrema in a latent function that is then observed with noise.

You could marginalize over the number of peaks as long as you could find a reasonable bound without too many cases. If it’s thousands, you’re out of luck.