Poisson time series analysis with multi-component random walks

Dear all,

I have used Stan successfully before to do demographics of super-massive black holes over cosmic time. However, I am now stuck on modelling a different problem.

I am asking for helpful pointers on analysis of Poisson time series with randomly variable count rates. This is motivated by high-energy astrophysics observations. Ultimately, I want to build quite complex source variability models. For example, I want to place a prior on a primary random walk to have, say, a power-law power spectrum with index 2 ± 1. Also, I want to add secondary source processes that depend on the first, e.g., a second component is the delayed and time-averaged form of the first, with the parametric response kernel being fitted for simultaneously. (Such echos happen when light is reflected off a distant surface.)

To start simple, lets say I want to model time-binned poisson data sampled from a random walk with a powerlaw PSD. I have looked into a few approaches to go about this.

Approach 1): A simple way is to generate a lot of such time series with different powerlaw indices (with the Timmer&Koenig method), and then compute the likelihood of seeing each data. This falls apart pretty quickly at >~100 data points, because one is unlikely to find an exact match to the data, and the generating method doesn’t seem to easily allow nearby exploration.

Approach 2): Another approach are ARMA methods, or state-based representations in general (Kelly+14 wrote a nice introduction for astronomers). I quite like this, because it is very expressive and allows defining arbitrarily complex state changes. It deals well with data gaps. Because of the Poisson data, I turned away from Kalman filters and looked into particle filters. In classical particle filters, one would at each time step draw new potential random states and resample the particles using their likelihood weights.
Q1: My first Stan question here is whether it is preferable to make these intermediate random variables Stan parameters instead. My suspicion is yes, because otherwise the derivatives are very noisy, and HMC will complain.
My second problem is that the ARMA parameters are a bit unnatural. Based on them, I can compute a PSD (1, 2), but (Q2) how do I best place a powerlaw slope prior on them. I was considering to run Stan with the ARMA parameters and no data and to use the desired prior as the target (compute the PSD and a gaussian likelihood on deviations from a powerlaw) and then use the ARMA parameter posterior mean and covariance as priors for the actual data analysis. This seems quite cumbersome, however.

Approach 3): The third approach I considered are Gaussian processes, and draw a Poisson instance from it. Q3: I believe one can place a powerlaw PSD prior on the covariance kernel, but how does one do this exactly? Specifically, the covariance decline with |r| is not the same as the PSD slope, if I understand correctly, but I couldn’t find the right formulae. I would appreciate pointers.

I did manage to get an AR(1) primary process plus a reprocessing secondary process (called reflection below, modelled as a convolution with a gaussian window) to work. An issue here is that I have to model many states before my time series actually starts to reprocess these earlier states (with the delayed Gaussian convolution). In the code below, I have Nt data points, and Npremodel states before that (lognorm vector contains all the states of the primary process). The convergence on these states is poor, but I guess that is just how it is, because they must be highly correlated but are poorly constrained. The situation is different for the times with observations, where the states are well constrained. I tried to fill in these Npremodel points by running the AR(1) backwards, but it is not performing much better. Q4: Is there a smarter way to go about this?

data {
    int Nflux;
    int Nchan;
    int Nt;
    int Npremodel;
    int<lower=0> all_src_data[Nt,Nchan];
    int<lower=0> bkg_data[Nchan];
    real phoindex;
    real reflphoindex;

    vector[Nchan] bkg_model;

    matrix<lower=0>[Nflux,Nchan] RMF;
    vector<lower=0>[Nflux] ARF;
    vector<lower=0>[Nflux] e_lo;
    vector<lower=0>[Nflux] e_hi;
    real<lower=0> src_expoarea;
    real<lower=0> src_to_bkg_ratio;
    real<lower=0> dt;
}
transformed data {
    vector<lower=0>[Nflux] e_width;
    vector<lower=0>[Nflux] e_mid;
    int Ntotalmodel;
    matrix<lower=0>[Nchan,Nflux] response;
    vector[Nflux] fluxmodel;
    vector[Nflux] reflmodel;
    vector[Nchan] fluxmodeldet;
    vector[Nchan] reflmodeldet;
    real max_time_logsigma;
    int<lower=0> all_src_data_flat[Nt*Nchan];

    // width of delay or drw has to be within observed window.
    max_time_logsigma = log(Npremodel * dt / 3) / log(10);

    e_mid = (e_hi + e_lo) / 2.0;
    e_width = e_hi - e_lo;
    Ntotalmodel = Npremodel + Nt;
    for (i in 1:Nchan) {
        for (j in 1:Nflux) {
            response[i,j] = RMF[j,i] * ARF[j] * e_width[j] * src_expoarea;
        }
    }

    for (i in 1:Nflux) {
        fluxmodel[i] = e_mid[i]^(-phoindex);
        reflmodel[i] = e_mid[i]^(-reflphoindex);
    }
    // project components through response:
    fluxmodeldet = response * fluxmodel;
    reflmodeldet = response * reflmodel;

    for (i in 1:Nt) {
        for (j in 1:Nchan) {
            all_src_data_flat[Nchan * (i-1) + j] = all_src_data[i,j];
        }
    }
}
parameters {
    // primary component: powerlaw with damped random walk
    real<lower=0,upper=Npremodel*dt> drw_tau;
    real drw_mean;
    real<lower=-3,upper=max_time_logsigma> drw_logsigma;
    vector[Ntotalmodel] lognorm;

    // secondary component: powerlaw with gaussian delay response of primary
    vector<lower=-10,upper=2>[Nt] logreflnorm;
    real<lower=0,upper=Npremodel> delay_mean;
    real<lower=0,upper=max_time_logsigma> delay_logstd;
    real logreflstrength;
    real logbkgnorm;
}
transformed parameters {
    real reflstrength;
    real bkgnorm;
    real delay_std;
    real drw_sigma;
    vector[Ntotalmodel] norm;
    vector[Nt] reflnorm;
    vector[Npremodel] reflweight;
    real all_src_model_flat[Nt*Nchan];

    drw_sigma = pow(10, drw_logsigma);
    reflstrength = pow(10, logreflstrength);
    bkgnorm = pow(10, logbkgnorm);
    delay_std = pow(10, delay_logstd);

    norm = exp(lognorm * 2.302585092994046);
    // prepare delayed response function (gaussian)
    // convention: with delay_mean=0, j=Npremodel should be the maximum
    for (j in 1:Npremodel) {
        // look back in time j steps:
        reflweight[j] = exp(-0.5 * square(((Npremodel - j) * dt - delay_mean) / delay_std));
    }

    for (i in 1:Nt) {
        // compute reflection by summing up contributions from the past
        // weighted by delayed response function (gaussian)
        reflnorm[i] = dot_product(norm[i:i-1+Npremodel], reflweight);
        for (j in 1:Nchan) {
            all_src_model_flat[Nchan * (i-1) + j] = 
                norm[Npremodel+i-1] * fluxmodeldet[j] + 
                reflstrength * reflnorm[i] * reflmodeldet[j] //+ 
                //bkgnorm * bkg_model
            ;
        }
    }
}
model {
    lognorm ~ normal(0, 10);
    drw_mean ~ normal(0, 3);
    drw_tau ~ uniform(1, 1000);
    drw_logsigma ~ normal(-1, 0.3);
    delay_logstd ~ normal(1, 1);
    logreflstrength ~ normal(-3, 3);

    // AR(1) model
    lognorm[2:Ntotalmodel] ~ normal(lognorm[1:Ntotalmodel-1] - dt / drw_tau * (drw_mean - lognorm[2:Ntotalmodel]), drw_sigma);

    // source + background time series
    all_src_data_flat ~ poisson(all_src_model_flat);
    // background time series
    bkg_data ~ poisson(bkgnorm * bkg_model);
}

Cheers,
Johannes

1 Like

Hi,
sorry for not getting to you earlier. This looks like a tough problem, although I don’t full grasp a lot of the details as I don’t have physics/signal processing background and can’t translate sentences like "random walk to have, say, a power-law power spectrum " into math.

This is correct. Also note that you can build linear state space models (i.e. the stuff Kalman filters are built for) with arbitrary likelihood, but you then need to make the latent states explicit parameters and can’t integrate them out (as Kalman filter does). Not sure if that’s good approach for you, but I’ve built a model with this and it worked OK-ish.

So if I understand it correctly, PSD is something you can compute from the ARMA parameters. So a direct approach would be to have PSD as an explicit parameter in your model (so you can directly put prior on it), and remove one of the ARMA parameters and instead compute it from the PSD and the rest of the parameters. If there is no analytic solution, you can use algebra_solver to solve the equation numerically (but performance will suffer). Not sure if this is feasible for your case.

This is unfortunately completely out of my expertise (while the points above are only somewhat out of my expertise :-). I know that there is “Spectral approximation to Gaussian processes” (e.g Approximate GPs with Spectral Stuff and https://discourse.mc-stan.org/t/hermite-polynomials-and-their-inverse-matrix-continued-from-approximate-gps-with-spectral-stuff/37930), but I am not sure if they use “spectrum” in the sense you use it. So just a wild guess that this might be relevant. (once again, I don’t understand the spectral stuff neither in GPs nor in your application :-/)

One thing that sometimes helps with random walk models is that instead of working with the states directly, you take lognorm_raw ~ normal(0, 1) and compute lognorm by scaling and summing lognorm_raw appropriately. This could remove some of the correlation.

Also we usually recommend enforcing constraints on the parameters that are not physical as priors rather than bounds. Actually you kind of do this, so when you have drw_logsigma ~ normal(-1, 0.3);, you should not need real<lower=-3,upper=max_time_logsigma> drw_logsigma; (assuming max_time_logsigma > 0) - those configurations are already quite strongly excluded by the prior. If your chains are moving such far from the prior that the bounds have an effect, it is likely there is an issue with the model (a bug or simply a prior-data conflict).

Other than that the model is too complex for me to understand well, but hope this helps at least a bit.

Good luck!

1 Like

Many thanks for the reply. I think you hit a couple of good points and this was helpful to clarify my thoughts.

Regarding the power spectral density (PSD), just to make 100% sure we are on the same page there, I mean that if the process were infinitely well sampled without measurement uncertainties, you could take a Fourier transform, and I want to place priors on that spectrum (Fourier power vs. frequency). Powerlaw is one example.

For ARMA, yes, there is an analytic formula. Your idea to drop one free parameter and select it based on the PSD prior is a good one. Thanks, I will try it! But I suspect that for some n-1 ARMA parameter choices, the PSD prior cannot be reached for any value of the nth parameter. The prior density will likely also not be conserved by such an approach, even if I can find the Jacobian. But it may be a step in the right direction.

For GPs, I was unsure. Thank you for the links. It lead me to some interesting references, including spectral decompositions and Fourier features, which I have yet to understand. Perhaps this is the way to go. I will also ping @grburgess who I believe has worked with this before.

Just a quick pointer that for a basic Poisson GP you can use this approach by Michael Betancourt: Robust Gaussian Process Modeling

Another colleague referred me to Wiener–Khinchin theorem - Wikipedia
which provides the connection between covariance kernels and their Fourier spectrum.

Given that I can sample the power spectrum, I could do a inverse Fourier transform and create the covariance kernel. But I would need to sample the complex phases. That might be tricky even with Stan.

Sparse representations such as random Fourier features sound interesting. I am interested to present the posterior PSD, but worried that if it is a handful of delta functions, it will not be a very convincing result. But maybe it is not a problem if I use enough features.

The main reference for GPs is certainly “Gaussian processes for machine learning” by Rasmussen and Williams. Chapter 4 provides some discussion on the PSD of common kernels and their relation to AR(p) models. The Matern class of kernels might be what you are looking for …

Thanks @bertschi, I checked there already, but the information on the corresponding Fourier representation (Appendix A.8) is quite meager.

Another (simplistic) idea:

First, do a warm-up inference: Treat each time interval independently and infer a posterior there. This will be quite informative for the primary process, and very uninformative for the secondary processes.
Then for each posterior sample, Fourier transform the primary process, and weigh it with the desired PSD prior to resample the posterior. This posterior will be an auxiliary distribution.

Then, in a particle filter / state-space representation approach, instead of drawing the next state based on the few previous ones (like ARMA), I can draw the next state based on the auxiliary distribution that was learned above, and apply an importance sampling weight correction to judge the proposal against the data. And the usual particle filter importance reweighting as needed. At the end of the time series, look at the Fourier transform of the generated primary process and apply also the prior weighting. Then, update the hyper-parameters and repeat – with whatever sampler is wrapped around the particle filter.

I think then I might get away with a relatively low dimensional sampler (only dealing with the hyperparameters).

I think this could also address the issue of the states before the time series starts, by the covariance of the auxiliary distribution. Presumably I have to assume a shape for that distribution, e.g., Gaussian or Student-t, and perhaps do a few iterations to learn it.

Caveats apply, of course.

Note that you only need the Jacobian if you are putting a transformed quantity “on the left hand of sampling statement” so there are actually two approaches (I missed one of them, so good observation there):

  1. Have PSD as an actual parameter and compute one of the ARMA coefficients from it. You might need to have constraints (e.g. choosing PSD might make some values of the ARMA coefficients impossible), but you will not need Jacobian
  2. Have ARMA as parameters, compute PSD as transformed parameter, but put a prior on PSD + ARMA parameters except one - you need the mapping between “parameters” and “stuff you put prior on” (very loosely speaking) to be 1 to 1. Then you need the Jacobian, but don’t need the constraint and solve for ARMA parameters.

It is obviously still possible that the other approaches are better, just wanted to highlight this.

Does that make sense?

Yes, thanks.