Fitting GP with noisy observations of time

martinmodrak · August 1, 2017, 2:22pm

Hi all,
I am trying to solve a biological problem with a GP model, where I cannot directly observe the time of the observations. In particular, I have

regulator ~ GP(gp_sigma, gp_length)
target = f(regulator) //here f is a function over functions
time_position[i] ~ Beta(shape1, shape2)
regulator_expression[i] ~ N(regulator(time_position[i]), expression_sigma)
target_expression[i] ~ N(target(time_position[i]), expression_sigma)

And I can observe only regulator_expression and target_expression, not time_position. I am interested in multiple forms of f, but cannot fit it for any, so I decided to try to work with a simpler model where f(g)(x) = x, i.e. f maps any function to identity, in hopes of learning what’s wrong. Now the problem can be seen as having noisy observations of the GP (regulator_expression) and noisy observations of the associated time (target_expression) - still the model does not converge, even if all the hyperparameters are known. Since if the time_position is observable, the GP alone converges, I assume, the added noise in observation time is responsible for the issues.

Do you have any ideas to try? Or is there a reason to believe that this class of models is beyond the reach of current samplers?

Here is the model code, I have also attached it in a file with an R script that generates data from the model, tries to fit the data with Stan and shows a graphical representation of the fit. Usually most of the chains get close to the true values of regulator over time, but some chains contain nonsense and there are divergent transitions and often the max treedepth is exceeded. I’ve run through a bunch of tweaks and reparametrizations, but nothing worked and I am out of ideas.

Thanks for any help!

The model:

data {
  int num_cells;
  vector[num_cells] regulator_expression;
  vector[num_cells] target_expression;
  real expression_sigma;
  real gp_sigma;
  real gp_length;
  real time_shape_1;
  real time_shape_2;
}


parameters {
  real<lower = 0, upper = 1> time_position[num_cells];
  vector[num_cells] regulator_true_raw;
}

transformed parameters {
  vector[num_cells] regulator_true;
  vector[num_cells] target_estimate;

  {
    //new scope to introduce vars that are not stored in output
    matrix[num_cells, num_cells] cov_m = cov_exp_quad(time_position, gp_sigma, gp_length);
    matrix[num_cells, num_cells] cov_m_chol;

    //Add jitter to ensure positive semidefiniteness
    for(i in 1:num_cells) {
      cov_m[i,i] = cov_m[i,i] + 1e-6;
    }

    cov_m_chol = cholesky_decompose(cov_m);

    regulator_true = cov_m_chol * regulator_true_raw;

    target_estimate = to_vector(time_position);
  }

}

model {
  regulator_expression ~ normal(regulator_true, expression_sigma);
  target_expression ~ normal(target_estimate, expression_sigma);

  regulator_true_raw ~ normal(0,1);

  time_position ~ beta(time_shape_1, time_shape_2);
}

An example model fit - lines are samples from the regulator posterior and points are the true values.

gp_position_test.stan (1.2 KB)

gp_position_test.R (4.5 KB)

anon75146577 · August 1, 2017, 2:53pm

I’m not sure the model you wrote down (if I understood it correctly) will fit well anywhere. There will be massive multimodality if you can’t, for example, order the observations (or approximately order them).

This is a really really hard problem, an ugly non-linear extension of the Berksen measurement error model. The challenge is that the GP is so flexible it will be very very hard to make this identified. There’s a long history of bad solutions to this problem (probably because there are no good ones) in fields like paleoclimate (https://arxiv.org/pdf/1206.5009.pdf) and spatial statistics (http://onlinelibrary.wiley.com/doi/10.1002/env.1062/abstract).

As far as Stan goes, I would use a slightly different GP model. Namely, I would use a spline with fixed knots (or a fixed number of basis functions if you’re using P-Splines / O-Splines). Because a spline is continuous, you can evaluate the likelihood for any set of time points t without changing the covariance matrix. This means that the prior on the spline doesn’t change as the time points change (whereas with the GP you need to re-evaluate everything every time, which will be difficult to do stably).

But If you want this to actually work, I would recommend putting very informative priors on everything. A good workflow would be to start with fake data (that mirrors the structure of the data you’re interested in) and fit the model with increasingly weak priors (starting from known time points and then relaxing). This will give you some idea of how much prior information you’ll need to inject to stop your inference from going haywire.

anon75146577 · August 1, 2017, 3:00pm

Just to give an example of the sort of prior information that would really help you with this model: in that paleoclimate example, they know the order the observations come in (because they’re from things like ice cores, which are naturally ordered) and they have some sensible bounds for how far off they can be (from the “dating procedure”). Similar things happen in Berkson measurement error models (to spell Berkson correctly for a change). In the spatial paper, it was assumed that the GPS measurements of the location have some known instrumental error.

Without something like this, it’s hard to see how you can expect not to have uselessly broad uncertainty (ie your posterior saying “I have no idea”).

Is there anything in your problem you can use?

betanalpha · August 1, 2017, 4:36pm

You’ll at least need ordering to avoid the multimodality that @anon75146577 mentions. If you can guarantee ordering then use

ordered_vector[num_cells] time_position;

Priors on the individual elements of time_position can enforce the constraint to 0 < t < 1 well enough.

Andre_Pfeuffer · August 1, 2017, 5:10pm

Are the time-points independent? Did you check for AR, MA? Why you modeled it as GP? Why its not a stochastic process? Lévy Process? It don’t understand why the time is fixed to 0,1 interval?

avehtari · August 1, 2017, 7:53pm

I would also recommend ordered (or at least soft ordered).

Hmmm… I guess you meant to write something else than “because a spline is continuous”. It’s not good explanation as GP is also continuous. You can also make basis function representation of GP, if you want to think in terms of covariances instead of splines.

GP is a stochastic process.

Aki

anon75146577 · August 1, 2017, 8:01pm

Spline could be replaced with any basis function expansion with known basis functions and unknown weights that have a multivariate normal distribution. P-splines are probably the easiest of these.

The key thing that needs to be true is that if you change the observation locations you don’t want to have to re-compute the prior covariance matrix.

betanalpha · August 2, 2017, 12:25am

This is the first time I’ve heard a reasonable argument for splines! I could be an important one in this particular application, although if you’re marginalizing over the hyperparameters then you’d be recreating the covariance matrix at every iteration anyways.

anon75146577 · August 2, 2017, 12:58am

Say it with me Michael: a spline is an intrinsic Gaussian process and hence
no more weird than a GP.

betanalpha · August 2, 2017, 1:10am

Sure, but isn’t every well-posed time series model?

I was referring to the interpretation – I could never grok the consequences of the knot placement and then the tunings condition on those placements, whereas the hyperparameters of a exponential quadratic covariance function have some immediate meaning. Are splines formally infinite dimensional or do they correspond to degenerate kernels?

As always, this could be do to my limited understanding of spines!

anon75146577 · August 2, 2017, 1:11am

Knots are for people who didn’t understand Wahba the first time.

The GP is the fundamental bit. The knots are just the representor theorem +
“implicit regularisation”

anon75146577 · August 2, 2017, 1:14am

People who aren’t Betancourt are welcome to convert that to something
polite in their head. I forgot it was a forum post.

Knots are fine but they are much harder to understand as a GP (although
notable attempts would be “subset of regressors” and “predictive processes”)

betanalpha · August 2, 2017, 1:22am

For the Stan community, we are usually running arounds with our proverbial heads cut off regarding various modeling techniques and until @anon75146577 informs us of the useful interpretation that we wished we had found many months prior. Please heed his advice.

anon75146577 · August 2, 2017, 1:25am

There is not a power on this earth that will stop me sharing my opinion (or
you’re wondering why I’ll die alone). You only had to ask.

Between me and Aki we’ve got both the topic and the time zones covered!

martinmodrak · August 2, 2017, 5:42am

Thanks a lot for the hints - a lot to digest for me, but will try to make sense of it all and let you know, if that allowed me to go further. I understand the task is challenging, but that makes it worth trying :-)

As for ordering, I unfortunately cannot assume that. The data I am aiming at are single cell RNA-seq data at steady state, where I assume cells to be distributed along an oscillating gene expression trajectory at an unknown/arbitrary time scale (that’s why the time is [0,1]). The only thing I can safely assume is periodicity - or, more precisely, that if I waited an arbitrary amount of time and measured the same cells (impossible in practice), the distrbution of (regulator, target) pairs will be the same, but individual cells could have changed values. Models in use for this kind of data have a stricter assumption - that each individual cell is in steady state and thus if measured after a time interval, all cells would stay the same, which obviously makes the problem a lot easier, but could be overly strict.

Except for periodicity, the only other thing I can assume is continuity/smoothness and that there is a known deterministic relationship between regulator and target over time - in particular, that regulator determines the derivative of target.

avehtari · August 2, 2017, 11:04am

We should write a wiki page on recommendations when to use splines, basis functions, Gaussian Markov random fields, Gaussian processes, etc. depending on the number of dimensions, amount of data, prior assumptions about the functions, what Stan can currently handle easily, etc. I’ll start writing this after I have resubmitted PSIS paper, and I’ll then ask you to edit.

Aki

bbbales2 · August 2, 2017, 2:03pm

@martincerny

Except for periodicity, the only other thing I can assume is continuity/smoothness and that there is a known deterministic relationship between regulator and target over time - in particular, that regulator determines the derivative of target.

If you plotted all your regular vs. target points in phase space would we see a circle?

@betanalpha

although if you’re marginalizing over the hyperparameters then you’d be recreating the covariance matrix at every iteration anyways.

I don’t think a lot of these splines have hyperparameters to marginalize over. Or at least difficult ones like the RBF length scale. Maybe it’s not an advantage because splines just make you think about this earlier (cause you have to specify their domains/decide if you’re going to use an expansion, etc), so the difficulty is just being shifted?

@anon75146577

Since the thread has gone this way, the splines assumption that weirds me out a little is if you split the regression into two separate spaces like Wahba does, the non-penalized H0 and then the penalized H1, that leaves you with the improper priors (uniform -inf to inf) on the basis functions that form H0 (cause they’re non-penalized). Does it matter mathematically if I start putting (possibly non-Gaussian) priors on my H0 basis functions? Am I screwing something up? I really liked splitting things out into different spaces to avoid the non-identifiabilities between what you’re fitting and what you’re considering noise.

@avehtari

We should write a wiki page on recommendations when to use splines, basis functions, Gaussian Markov random fields, Gaussian processes, etc. depending on the number of dimensions

Yes yes this. If you want help with example models or plots, I’m in.

anon75146577 · August 2, 2017, 3:15pm

Nope. That’s exactly what we’re doing here Case study on spatial models for areal data - Poisson CAR/IAR - #69 by anon75146577

A technical point though: H0 are not non-identifiabilities. They’re just directions that aren’t “seen” by the prior. They’re only non-identifiable if you have them in the model a second time (eg if H0 contains constant functions and you fit an intercept + a spline this is non-identifiable unless you constrain the spline to be orthogonal to H0).

They don’t have a length-scale/band-width parameter. In fact you can get a lot of spline models by letting the length-scale go to infinity. You will often see data try to send you to that limit (or the practical version: length scale > than observed domain), which is another version of the “folklore” form time series modelling that data often tends towards the non-stationary boundary of the parameter space.

Also - in a very soon to be complete paper (next week if the gods are kind) [see also this one] we argue that you need to consider things like domain size when setting priors on the parameters in GPs, so you still need to consider domains. (In a not soon to be completed paper because its two authors are busy, we show that you don’t need to worry about boundary effects - they’re easy to deal with [at least in low dimensions]).

anon75146577 · August 2, 2017, 4:06pm

I am up for this.

martinmodrak · August 2, 2017, 6:56pm

@anon75146577: I tried working with very informative priors and it turns out, splines converge with priors that are still to broad for the GP to converge, so this is a way forward - thanks for the idea, it is a useful trick I didn’t consider!

@bbbales2:

If you plotted all your regular vs. target points in phase space would we see a circle?

Unless my math (which is rusty at many places) fails me badly, it should be a closed path. The problem is, that the dynamics of the system is not fully contained in regulator and target, there is also an unobserved/impossible to model part, i.e. knowing target and regulator does not uniquely determine the future evolution of the system. In fact, it turns out that if I assume that target and regulator determine the evolution uniquely, the system always diverges (target or regulator grow without bounds) or converges to a single point - so under assumption of stability over time, the only solution is that all cells are exactly in the same state.

Wait a minute - I guess I just said, that if the system is interesting, the posterior for time_position has to be multimodal. No wonder I am having problems then :-). Thanks for helping me realize this - I will obviously need to find a way to include more information in the model.

Topic		Replies	Views
Generalising Gaussian Process (GP) model Modeling specification , gaussian-process	12	1109	October 25, 2017
Gaussian Process Prediction Modeling rstan , techniques , fitting-issues	2	484	March 22, 2021
Change from global model fit to one-step-ahead prediction Modeling rstan , fitting-issues	7	1497	May 25, 2020
Advice to improve fit using ODE solver Modeling	1	719	August 1, 2017
Brownian motion with measurement noise Modeling	11	1591	April 26, 2021

Fitting GP with noisy observations of time

Related topics