Ito process as numerical solution of stochastic differential equation

NONMEN had SDE + Kalman filter for a while, though I haven’t used it.

There are numerous approaches and seemingly even more numerous notations/terminologies. At the end of the day whatever solution the solver returns just has to be locally smooth with respect to the parameters.

Can you point me to some other notations/terminologies? Calculating greeks is the only context/application that I’m familiar with.

I don’t have any references off hand, but beyond finance you have at least the statistics stochastic process literature and the physics stochastic process literature. The notation has some overlap but also plenty of heterogeneity. Conditioning on a single realization of the randomness is also a subtle mathematical concept which convolves together the math with the implementation – while I’ve had many conversations about such techniques I haven’t seen any of the discussion arise in the literature. But I’m also far from an expert.

As @betanalpha suggested, I’m adding some background here(don’t know how to do latex at github issue) trying avoid most technicalities.

  • What’s the issue ito process as numerical solution of SDE · Issue #1263 · stan-dev/math · GitHub is about?
    It’s about solving general stochastic differential equation in form dX=\mu(X)dt + \sigma(X)dW, using a modified Euler method. The output is a matrix with its column i being X^i, the numerical solution at step i.
  • What does the above equation mean?
    It’s an integral equation written like a differential equation. The integral form is
    X=X_0 + \int\mu dt + \int \sigma dW
    The first integral is regular(Riemann–Stieltjes sense). The second is Ito integral(generalized Riemann–Stieltjes sense) based on Wiener process W(continuous random walk in a sense), describing its random contribution and gives X its random nature. One can view dW as a white noise shock imposed on the evolution in every time step.
  • What is \mu, \sigma?
    \mu is a function capturing the drift(or trend) of the process. \sigma is a function that describes the intensity(volatility) of the random fluctuation.
  • What does a “solution” mean for such an equation?
    Solution X is a stochastic process(Ito process) that satisfies the above integral equation.
  • Why Euler method?
    It’s simple and the most popular scheme(so large knowledge base on its performance). The proposed using of a tamed Euler method lifts some of restriction of Euler-Maruyama that asks for globally Lipschitz \mu, hence suits more applications.
  • What’s the order of this scheme?
    It’s of order 1/2 for strong approximation(converge in mean of error E(|X_n-X(t_n)|)), order 1 for weak approximation(converge in error of mean |E(X_n)-E(X(t_n)|).
  • What’s the limitation of the scheme?
    Low strong convergence order, not adaptive(no established adaptive stepping scheme for SDE to my knowledge).
  • How do random variables enter the SDE from Stan’s point of view?
    First, one should be able to use the implementation as RNG to generate an ensemble of ito process, by fixing X_0, \mu, and \sigma while allowing W to be generated from standard normal RNG. Second, we also want to fit a SDE for \theta in \mu(X, \theta), \sigma(X, \theta), given data as time series. In this case we have to include nuisance parameter W. Take 1D for example(X\in \mathbb R), if we abstract the Euler scheme to map f: \Delta W_i \rightarrow X_i for step i, then passing in parameter W is equivalent to f^{-1}(X_i) \sim \mathcal N(0,\Delta t).
  • Any reference?
    The most accessible is Evans’ note.
2 Likes

I’m still not sure to what the solutions X^{i} refer. In general the solution to an SDE is an entire probability distribution (usually specified as a probability density function) at each time point, not a single state. So are you planning on returning multiple X^{i} as samples from that distribution, or a single sequence of X^{i} corresponding to a sequence of means or other summaries of the distributions, or even a single sequence X^{i} corresponding to one path sampled from the sequence of distributions?

The first option runs into differentiability issues, the second may work but requires the only the given summary is needed, and the third may work if care is taken in how the Wiener process is handled.

X refers the sample path of corresponding Ito process given a Wiener process sample path. X^i refers X at time i.

Considering in general SDEs do not have analytical solution so no pdf can be specified, I’d say in general the solution is implicitly defined by the SDE itself.

By solution I mean the 3rd: path(trajectory) given initial condition, drift/diffusion function, and a Wiener process. This is not very different from getting a process with analytical path formulation. Let X be OU process for example, given a discrete sample path, how do we fit the constant drift \mu and constant diffusion \sigma? We write out likelihood by expressing Wiener process step \Delta W using X, \Delta t, \mu, \sigma and make it standard normal.

Great, thanks, that makes things more clear.

How do you imagine this would be used in a Stan program? Because you’re solving for a single sample path you can’t construct a probability density for the path realizations at each point, which also means that you can’t build a joint density for the path and the process parameters (and hence can’t implicitly get at the posterior for the Weiner realizations and process parameters conditioned on the path). Just want to make sure that I’m not missing something.

Using fitting trivial model dX= \theta dW as example, what I’d like to do in Stan is like this

data {
  int N;                        /* nb. of data points */
  int M;                        /* nb. of auxiliary data between each pair of data points */
  real<lower=0> T;              /* end time */
  vector[N] scal_w;             /* observed data X */
}

transformed data {
  real dt = T/(N*M);            /* fixed time step */
}
parameters {
  real<lower=0.0> sigma;        /* data simulated from given sigma */
  real<lower=0.0> eps;          /* obs error */
  real sn[M*N];                 /* Wiener process' std normal steps */
}

model {
  vector[M] x;
  real x0;
  
  sigma ~ normal(1.0, 0.5);
  eps ~ cauchy(0, 1);
  sn ~ normal(0, 1);

  x0 = 0.0;
  for (i in 1:N) {
                                /* nested loop equivalent to SDE integrator */
                                /* that outputs x, given sn, dt, and sigma */
    for (j in 1:M) {
      x[j] = (j == 1 ? x0 : x[j - 1]) + sigma * sqrt(dt) * sn[(i - 1) * M + j];
    }
    scal_w[i] ~ normal(x[M], eps);
    x0 = x[M];
  }
}

as proposed in http://apps.olin.wustl.edu/faculty/chib/papers/elerianchibshephard2001.pdf. The idea is to generate missing/auxiliary data between observations, with SDE integrator describing the recursive conditional likelihood p(X_{i+1}|X_i, W_{i+1}, \theta). This is a full-posterior approach compared with filter-based approaches.

Ah, okay. For the model without any drift it’s just a latent autoregressive process, right? In other words you’re modeling the unit innovations with parameters, yeah?

yes, within state space model framework we’re probably better off with Kalman filter when things are linear and gaussian. But for general drift and diffusion we’ll have to work on nonlinear filters, so sometimes full posterior may be easier to use.

I’m mildly hesitant of how hard nonlinear models will be to fit, and what happens if \epsilon in your example is very small and the non-centered parameterization you use would be suboptimal relative to a centered parameterization, but it’s worth moving forwards if only to experiment.

That’s definitely a problem. On the other hand, it’s not easy to project how extended Kalman filter would perform in nonlinear models either . Also the proposed abstraction doesn’t change the recursive nature of the latent state or how parameterization is done between latent states & measurement states. Worst scenario, we get a RNG for SDEs.

But having a single abstraction forces one to choose between a centered or non-centered latent parameterization, no?

ah, misread your comment and thought we’re talking about the error variance epsilon. Yes, the abstraction limits the recursive relationship to non-centered parameterization. Considering the smallness of \Delta t for an explicit scheme, wouldn’t non-centered be more permissive to sampler?

I’ve done limited experimentation in this direction, and basically gave up on sampling states in favour of integration, for anything multivariate the sampling seemed very difficult. Tried centered and non centered, scaled and non scaled. But things change, maybe a dense mass matrix would help, curious how performance goes!

I’m playing with several models to experiment. If it doesn’t work out filter approaches is next option.

Depends on how small the observational noise is. If the observations are extremely informative then the latent realization would be extremely constrained even without the regularization of the latent process and a centered parameterization would be preferable.

That makes sense.

Unfortunately, according to the tests I’ve done, neither centered nor non-centered gives satisfactory performance for nonlinear models. This is not surprising, and is consistent with @Charles_Driver 's comment. The scheme that I’m trying to implement, on the other hand, requires a two-step sampling process:
For an SDE with param \theta, trajectory data X^{\text{obs}}, we augment data X by add k missing data points between observation i and i+1, so that the added missing data(as param in Stan) \tilde{X} and X^{\text{obs}} line up as

X^{\text{obs}}_1, \tilde{X}_{1,1}, \tilde{X}_{1,2}, \dots, \tilde{X}_{1,k}, X^{\text{obs}}_2, \tilde{X}_{2,1}, \tilde{X}_{2,2}, \dots, \tilde{X}_{2,k}, X^{\text{obs}}_3, \dots

chronologically.

The sampling scheme mentioned earlier(Ito process as numerical solution of stochastic differential equation) samples \theta with two steps in iteration j of MCMC: first sample \tilde{X}^j from f(\tilde{X} | X^{\text{obs}}, \theta^{j-1}), then sample \theta^j from f(\theta | \tilde{X}^j, X^{\text{obs}}).

I’ve scratched my head and don’t see how this Gibbs-type two-steps sampling is doable in Stan. Is it?

You have to fit everything jointly in Stan. I’ve done something similar to fit missing data in autoregressive models but the \tilde{X} become at once both poorly identified and highly correlated with each other, making them hard to fit.