Optimizer struggling with random walk

Hello. I have a time-series-like model with discrete observations generated from a continuous latent variable that evolves according to Gaussian random walk. When I fit the model by sampling, I get the desired solution of a relatively gentle latent variable trajectory. However, when I use the L-BFGS optimizer, it converges to a crazy solution where, in essence, the latent variable adapts perfectly to the observation at every time point.

The observational model is rather complex, but the problem remains even if I replace the discrete observations with simpler, continuous proxies, which I also have. This leads me to believe that the issue is not with my model specification (which is why I don’t paste the code here), but that this type of likelihood is bimodal and the optimizer just likes one mode better than the other.

Is there a way to force the optimizer to remain in the desired region of the parameter space? I initialise it there, and I give it low initial step size, but I see no other option that could help. And more generally, has anyone else ran into similar issues with latent random walks and can share their experiences? Thanks in advance.

when I use the L-BFGS optimizer, it converges to a crazy solution where, in essence, the latent variable adapts perfectly to the observation at every time point.

Not sure if I understand the symptom, here’s my crude mental model: the optimiser is attempting to locally minimise some loss function, which is a log posterior probability involving two terms - the former term corresponds to the observation model, and would penalise solutions that are poor fits to the observed data, and the latter term corresponds to the evolution of the latent variable, and will penalise solutions that involve implausible trajectories of the latent variable.

Does your model have parameters that could cause the former term – fitting observations – in the posterior log probability objective function – to completely dominate the term for selecting a smooth latent variable trajectory? E.g. if your observation model includes some additive error term from a normal distribution where sigma is a free parameter, could the optimiser be choosing sigma=0 for some reason, which might then cause the loss from failing to exactly interpolate the observations to dominate any loss from choosing a wild latent variable trajectory. On another hand, if there are parameters that govern the evolution of the latent variable, is it possible that the optimiser is picking a value that allows wild trajectories to be completely unpenalised, which would then allow it to interpolate the observations without penalty?

Maybe one way to investigate if this might be happening, and isolate which parameter is causing the problems (if any), could be to replace parameters in your model with constants estimated from sampling.

you just can’t optimize latent states. if you want to use optimization you have to marginalise them out via a Kalman filter or similar, then in a later step (after fitting) you can sample from the posterior for each latent state.

2 Likes

@Charles_Driver

Can you explain in a bit more detail about why optimising latent states doesn’t work?

I am not familiar with Kalman filters but have been learning about hidden Markov models (so, small finite state space for latent variables).

If the task is to estimate a trajectory of latent (aka hidden) states in a historical time window, one approach is exactly as you describe (Forward–backward algorithm - Wikipedia). This approach gives the best estimate of the latent state at each individual historical timestep, but has the downside that it doesn’t necessarily produce a good (or even valid) estimate of the trajectory of latent states. E.g. the resulting individual estimates x_t, x_{t+1} of the latent state x at timesteps t and t+1 might be a pair of states such that there is zero probability of transitioning between x_t and x_{t+1}.

A different approach is the Viterbi algorithm, which computes a MAP estimate of a most probable trajectory of hidden states.

Wouldn’t the problem that is efficiently solved by the Viterbi algorithm – an optimisation problem to recover a MAP estimate of the trajectory of hidden states – be the same optimisation problem we’re attempting to solve via Stan with L-BFGS minimisation of the log posterior probability?

Aha – maybe what I didn’t understand is that the Viterbi algorithm requires that any parameters of the observation model and transition model are held constant (e.g. noise level in observation, smoothness parameters for transition model), so the only variables being optimised over are the sequence of latent states.

1 Like

Yeah you got it. Was too general in my above statement. Can’t optimise latent states when also optimising system pars.

4 Likes

Hi @Charles_Driver , can you clarify what sorts of latent states this statement applies to? I know nothing about how maximum-likelihood estimation works in, for example, a simple hierarchical (random effect) model. Is it the case that one cannot simultaneously optimize the BLUPs and the hyperparameters? I would find that surprising. If that’s not the case, what is the key dividing line between latent variables/factors/parameters that cannot be optimized jointly with their hyperparameters, and those that can?

1 Like

To the best of my understanding it’s valid as a general statement. You might find very specific cases / datasets where it works but can’t rely on it.

2 Likes

Thank you all very much for your thoughts.

This is exactly right. The rate of change of the latent variable is one of the parameters that I am optimising, so the optimizer is free to fit a large value that permits the latent to interpolate the observations. With sampling, I can prevent this from happening with the prior, in which case the desired solution is found, one with a moderate rate of change. I was hoping that there was a way to restrict the optimizer to only stay in this region and thus find the desired local optimum. But the right way to deal with this is indeed to kalmanize the model.

Like @jsocolar, I would be very interested in knowing how broadly this principle applies.
[Edit: thanks for the reply above.]

1 Like

If we

  1. marginalise out the latent states, leaving a posterior log probability as a function of system parameters only ; and
  2. regard the resulting posterior log probability as an objective function to minimise, rather than sample from

then I believe we would get an instance of an Expectation Maximisation algorithm, or something fairly close to one (?).

c.f.

…although the examples focus on marginalising discrete latent parameters, but presumably a similar treatment works for Kalman filter style continuous state spaces.