Correlated Random Walk with Measurement Error: Trouble Converging

#1

I’m slowly building up a model on based on GPS collar data for a small forest carnivore. The data consist of a time series of GPS fixes and accelerometer readings collected every fifteen minutes or half hour. All GPS fixes have some amount of measurement error, but because the animal is in the forest some fixes have more error than others, up to the limit of having no fix if the GPS was not in view of any satellites. We do however, always have an accelerometer measurement. One of the goals of our research is to identify when the animal is resting and the location of resting sites (often structures like tree cavities), but also to generally characterize the animals movement. Because these animals rest in cover, we tend to have more failed GPS fixes or high measurement error fixes when we presume the animal is resting in a resting site. Eventually, I want to build up to hidden Markov model with measurement error. However, for now I’m trying to understand the measurement error piece with a simple correlated random walk with measurement error.

Using some Stan and Rcode that a colleague obtained from a recent workshop on animal movement at ISEC (which I believe Juan Morales wrote) as a basis, I have the following R and Stan code for simulating and fitting a correlated random walk with measurement error. The issue is that for this relatively simple model, I am having trouble getting the model to converge without divergent transitions, exceeding the treedepth, or getting a warning about BFMI. I was eventually able to eliminate divergent transitions and treedepth exceedances by setting adapt_delta to 0.975 and max_treedepth to 20. But with 4000 iterations over 4 chains and a 10 hour runtime I still get BFMI warnings. This is for a simple model with only 200 data points in the time-series. That doesn’t give me a lot of hope for scaling this approach up to account for totally missing GPS fixes and state switching.

The model is able to return the provided parameter values even with divergent transition warnings. Indeed the fit never looks bad (except for the warnings). But I’m wondering if there is some inherent pathology here that I’m missing? Is there a better way to parameterize this?

Here is the R code to simulate:

library(CircStats)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

T <- 200
scale <- 1
shape <- 1.5
m <- 0
rho <- 0.8
locs       <- matrix(0, T, 2)
steps      <- rep(NA, T - 1)
turn_angle <- rep(NA, T - 1)

turn_angle[1] <- runif(1,-pi, pi)

for(t in 2:T){
steps[t - 1] <- rweibull(1, shape = shape, scale = scale)
if(t > 2)
turn_angle[t - 1] <- rwrpcauchy(1, location = turn_angle[t - 2] + m + pi, rho = rho) - pi
locs[t, 1] <- locs[t - 1, 1] + cos(turn_angle[t - 1]) * steps[t - 1]
locs[t, 2] <- locs[t - 1, 2] + sin(turn_angle[t - 1]) * steps[t - 1]
}

plot(locs[,1],locs[,2], type = "l", asp = 1, xlab="", ylab="", lwd=1)

s_obs <- 0.8
xo <- locs[,1] + rnorm(dim(locs)[1], 0, s_obs)
yo <- locs[,2] + rnorm(dim(locs)[1], 0, s_obs)

plot(xo,yo, type = "l", asp = 1, xlab="", ylab="", lwd=1)

data_list <- list(
T, xo, yo
)

params <- c("m", "rho", "scale", "shape", "x", "y",
"sigma", "steps", "turn_angle")

fit <- stan(file="crw_oe.stan", data=data_list, iter = 4000,
chains = 4, pars = params)


And Stan code to fit:

functions{
//Function that returns the log pdf of the wrapped-Cauchy
real wrappedCauchy_lpdf(real y, real rho, real mu) {
return(log(1/(2*pi())) + log((1-rho^2)/(1+rho^2-2*rho*cos(y-mu))));
}
}

data {
int<lower=0> T;   // length of the time series
vector[T] xo;
vector[T] yo;
}

parameters {
real<lower=0, upper=1> rho;
real<lower=-pi(), upper=pi()> m;
real<lower=0> shape;
real<lower=0> scale;
real<lower=0> sigma;
vector<lower=0>[T - 1] steps;
vector[T - 1] turn_angle;
}

transformed parameters{
vector[T] x;
vector[T] y;
x[1] = 0;
y[1] = 0;
for(t in 2:T){
x[t] = x[t - 1] + cos(turn_angle[t - 1]) * steps[t - 1];
y[t] = y[t - 1] + sin(turn_angle[t - 1]) * steps[t - 1];
}
}

model {
turn_angle[1] ~ uniform(-pi(), pi());
m ~ normal(0,0.5);
rho ~ beta(4,1);
scale ~ normal(0,1);
shape ~ gamma(2,1);
sigma ~ normal(0, .25);

for(t in 2:T){
steps[t - 1] ~ weibull(shape, scale);
if (t > 2)
turn_angle[t - 1] ~ wrappedCauchy(rho, turn_angle[t - 2] + m);
}

for (t in 1:T) {
xo[t] ~ normal(x[t], sigma);
yo[t] ~ normal(y[t], sigma);
}

}


Divergence /Treedepth issues with unit_vector
#2

pairs(fit, pars = c("steps", "turn_angle", "x", "y"), include = FALSE, las = 1)?

#3

This is from the 10 hour model run where I was able to eliminate divergent transitions, but not BFMI.

#4

If you have the old runs, their pairs plots may make for a good comparison, but it looks from this one that large values of rho may be the source of the low BFMI warnings due to its correlation with energy__. Maybe you can think of a way to change the parameterization there.

#5

Here’s a run with vanilla settings. 2 divergent transitions and 3449 transitions that exceeded maximum treedepth, BFMI low on all 4 chains. rho doesn’t seems to have any correlation with energy__ for this run:

#6

Yes, although in this case rho never reaches extreme values.

#7

You will certainly want to restrict your angle parameter

vector<lower=-pi(), upper=pi()>[T - 1] turn_angle;


Otherwise the posterior will be highly multi-modal as turn_angle could just be shifted by 2*pi. Furthermore, putting a uniform prior on an unconstrained parameter is asking for trouble.

The disadvantage of restricting the range between -pi and pi is that the sampler cannot easily move across the boundary. Yet, as the parameter actually describes an angle this should be easily possible on a circular space.

Another approach would be to use a unit vector which is then transformed to polar coordinates (shortly discussed in section 19.4 of the Stan manual). I have often found it more effective to use an unconstrained 2-d vector instead with a soft constraint on its length, i.e. len ~ normal(1, 0.1);

The following code illustrates this approach for your model:

functions{
//Function that returns the log pdf of the wrapped-Cauchy
real wrappedCauchy_lpdf(real y, real rho, real mu) {
return(log(1/(2*pi())) + log((1-rho^2)/(1+rho^2-2*rho*cos(y-mu))));
}
}

data {
int<lower=0> T;   // length of the time series
vector[T] xo;
vector[T] yo;
}

parameters {
real<lower=0, upper=1> rho;
real<lower=-pi(), upper=pi()> m;
real<lower=0> shape;
real<lower=0> scale;
real<lower=0> sigma;
vector<lower=0>[T - 1] steps;
// unconstrained 2-D vectors
vector[2] angle_raw[T - 1];
}

transformed parameters{
vector[T] x;
vector[T] y;
vector<lower=-pi(), upper=pi()>[T-1] turn_angle;
vector<lower=0>[T-1] angle_len;

// transform to polar coordinates
for (t in 2:T) {
turn_angle[t - 1] = atan2(angle_raw[t - 1, 1], angle_raw[t - 1, 2]);
angle_len[t - 1] = sqrt(angle_raw[t - 1, 1]^2 + angle_raw[t - 1, 2]^2);
}

x[1] = 0;
y[1] = 0;
for(t in 2:T){
x[t] = x[t - 1] + cos(turn_angle[t - 1]) * steps[t - 1];
y[t] = y[t - 1] + sin(turn_angle[t - 1]) * steps[t - 1];
}
}

model {
// soft constraint on vector length
angle_len ~ normal(1, 1e-1);
// Jacobian correction for polar coordinates
target += - sum(log(angle_len));

turn_angle[1] ~ uniform(-pi(), pi());
m ~ normal(0,0.5);
rho ~ beta(4,1);
scale ~ normal(0,1);
shape ~ gamma(2,1);
sigma ~ normal(0, .25);

for(t in 2:T){
steps[t - 1] ~ weibull(shape, scale);
if (t > 2)
turn_angle[t - 1] ~ wrappedCauchy(rho, turn_angle[t - 2] + m);
}

for (t in 1:T) {
xo[t] ~ normal(x[t], sigma);
yo[t] ~ normal(y[t], sigma);
}

}


Note the Jacobian correction to account for the change to polar coordinates. For me this parameterization gets rid of divergent trajectories and low BMI warnings.

#8

Lol, how did I miss this post before?

Why parameterize this in terms of angles and steps?

GPS is going to give you position information ± some error. Why not try to estimate the true position and handle any sorta step/turn_angle stuff estimates with some postprocessing?

#9

Why parameterize this in terms of angles and steps?

The positions are a time series and I want to model them as a time-series. Angles and steps are natural transformation of the difference between two 2-d coordinates. Much of the literature on animal movement models the difference in subsequent positions in terms of step-lengths and angles. I’m just starting out on this project though and really this question was motivated by me simply trying to understand some modelling techniques that are new to me. Most of the material I’m working off of is based off of step-lengths and angles. Eventually I want to work in a hidden Markov like approach where the animal switches between at least two “states” each with different movement parameters. One of these states is “resting” where we presume the animal doesn’t move at all. One of the major goals of this project is identify when the animal is resting and get as good of an estimates of the location it is resting at. The reason for this is that the resting locations for these animals are features of the landscape that are good targets for conservation.

GPS is going to give you position information ± some error. Why not try to estimate the true position and handle any sorta step/turn_angle stuff estimates with some postprocessing?

GPS sometimes gives me position information. When the error is large enough I don’t get a GPS fix. However, even if I don’t have a GPS position, I do have a data point. The latitude and longitude is missing for these datapoints, but I do have an accelerometer reading, which gives me some pretty good information on whether the animal was moving or resting in the interval between that data point and the previous one. So even though the data on GPS is missing I should be able to constrain the possible locations this animal is at by modelling the differences between successive locations as a time series, especially if I can get a good estimates of the variance of the measurement error when I do have a location. Hence steps and angles. There is probably a good way to model the differences between locations as a 2-d multivariate normal, but steps and angles is where I started to understand this problem.

#10

Thank you! I was hoping this was just a transformation problem and not some fundamental problem with imputation. On a run with 4 chains and 2000 iterations, I got only 12 max treedepth warnings and no other warnings.

This is a good starting point for me. Although I must admit it is rather curious and not intuitive (at least to me) to make use a parameter angle_len which seems to have only a tangential (apologies for the trigonometric pun) relationship to the parameters I’m interested in.

Also, I am embarrassed to admit I didn’t read section 19 of the Stan manual which should’ve been my first place to start. So file this post under RtFM (“Read the Frickin’ Manual!”).

#11

The way I’d try to structure this is Kalman filter style, where there’s some underlying state that you want to estimate and you have measurements of that state. This is what you’re doing, but I’d parameterize it in x/y coordinates for no other reason than to avoid rotation parameterizations (I could be wrong here – sounds @bertschi has some experience with these and likes them).

So maybe if we had a critter that moved on continuous paths and Everything Was Nice you could do:

y_i \sim \text{normal}(\mu_i, \sigma_y)
\dot{ y_i} \sim \text{normal}(\dot{\mu_i}, \sigma_\dot{y})

And you’d make noisy measurements of y_i and \dot{ y_i} as you did and you’d set up some non-parametric form of \mu (maybe a GP with a derivative) and try to estimate it. So the prior on \mu might end up looking like a big multivariate normal:

\mu \sim \text{multi_normal}(0, \Sigma)

I doubt this would work super well cause the continuous path thing probably breaks down. I assume the critter could move in a pretty weird way (cause it just likes to or the terrain requires it).

So maybe we break the smoothness assumption and just assume animals move from point to point in straight lines and we measure every point they’re moving from and to:

y_i \sim \text{normal}(\mu_i, \sigma_y)
\dot{ y_i} \sim \text{normal}(v_i, \sigma_\dot{y})

And then tag on some sort of update for \mu

\mu_i \sim \text{normal}(\mu_{i - 1} + v_{i - 1} (t_i - t_{i - 1}), \sigma_\mu)

So now we’re using the accelerometer measurements to estimate a latent velocity, v_i and the position measurements to estimate the latent position but we’re also using the third equation to say everything should kinda be related in the same model.

But you aren’t putting super hard constraints on things. So unlike the GP your estimates of latent velocity might not exactly correspond to changes in estimates of positions over time haha.

Back to the rotations real quick, even with things like the [-pi, pi] constraint you end up worrying, what if the parameter crosses -pi to pi? You can do the unit vector thing too, but I just don’t think it’s necessary. There’s nothing mathematically wrong with parameterizing this way. It’s really just a shortcoming of Stan that it might have trouble sampling rotations. Just putting fair warning here – I don’t mean to keep going in circles about it (woohoo! puns!).

If you’re gonna play with these things it might be worth your time to skim through the first few chapters of http://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf (Simo Sarkka, “Bayesian Filtering and Smoothing”). It’s a quick read but good. I’d guess the EE journals would be dense with Kalman filter tracking applications like this too that’d probably give you good ideas you can translate over. I guess the neat thing with your data is that you know exactly which animal is generating each data point?

I think one of the harder tracking problems is when you also aren’t sure which thing generated each data point, and how many things there are exactly (like in image processing or something). So it should be worth a lot that you’d know who owns what.

#12

That was my other hunch. I’ve got about 17 tabs open my browser right now on Kalman filters. I haven’t seen these as much in the animal movement literature however. There is this paper (linked from the author’s website so hopefully no copyright issues here) which deals with the issue from the continuous time perspective. We’ve got discrete time here, so for example t_i - t_{i-1} is constant for all i.

I’m sure I’ll be posting more on this topic as I dig deeper. We’ve got some rich data here thanks to some exceptional field biologists I’m fortunate to know.

#13

This is a trick I came up with when playing with unit_vector parameters in Stan. I attached a short writeup that explains this idea … circular.pdf (1.7 MB)

Divergence /Treedepth issues with unit_vector
#14

I had a look at your writeup. I think it makes sense. The Stan implementation just normalizes a vector with each component normal(0, 1).

When I’d used the default Stan parameterization before I had treedepth problems. I assumed there was an underlying identifiability problem cause r wasn’t constrained enough. This seems like it would fix that.

The discontinuity at zero always scares me cause my math is bad. But this transformation seems to make the same assumptions as the one in Stan, so I don’t see a problem.

I don’t get why the unit_vector in Stan is working badly though. I don’t see why rs near 0 would cause it a problem.

#15

Thanks for the write-up. It occurs to me that we might run into a similar issue however if we want to estimate the location (i.e. mean) of the von Mises or wrapped Cauchy, in addition to or instead of, estimating the angle. Especially if the location is near the boundary. That is, in your write up you give the parameters (mu = location, kappa = concentration) of the von Mises as data and use a transformation with a Jacobian to sample the angle as a parameter. But what if we also want to estimate mu?

It looks like Leos-Barajas and Michelot used a similar “trick” for the purpose of estimating the location and concentration of von Mises distributions within a Hidden Markov Model. To paraphase their model, using the von Mises distribution, the data are angles and steps (this differs from my use case where angles and steps are also parameters) and in the parameters block we define:

parameters {
real x; //x-dimension of Euclidean transformation of von Mises distribution
real y; //y-dimension of the same
}


Then the transformed parameters has:

transformed parameters {
real<lower = -pi(), upper = pi()> loc;
real<lower = 0>                   kappa;

loc   = atan2(y, x);
kappa = sqrt(square(x) + square(y));
}


Then the model block could look something like:

model  {
//priors

x ~ normal(0, 1);
y ~ normal(0, 1);

for (i in 1:N)
angle[i] ~ von_mises(loc, kappa);
}


Do we not need a Jacobian in the model step then for this parameterization? What about in the use case of this thread where both the sampled angles and the location and concentration of the von Mises are parameters? What about for the wrapped Cauchy - do we have the same connection between the location and scale parameters? It looks like that isn’t the case in the parameterization of the wrapped Cauchy that I cribbed from somewhere on the Stan forums, as in this case the scale parameters of the wrapped Cauchy is bounded between 0 and 1:

  //Function that returns the log pdf of the wrapped-Cauchy
real wrappedCauchy_lpdf(real y, real mu, real rho) {
return(log(1/(2*pi())) + log((1-rho^2)/(1+rho^2-2*rho*cos(y-mu))));
}


#16

Yes, mu is also a circular quantity. Just represent it the same way:

parameters {
real mu_x;
real mu_y;
}
transformed parameters {
real<lower=0> mu_r = sqrt(square(mu_x) + square(mu_y));
real<lower=-pi(), upper=pi()> mu = atan2(mu_y, mu_x);
}
model {
// Some prior on mu
mu ~ ...;
// Keep length from zero
mu_r ~ normal(1, 0.1);
target += - log(mu_r); // Jacobian

// Use mu in likelihood etc.
obs ~ von_mises(mu, kappa);
}


Think of probability distributions as building blocks for a Bayesian model. Thus if your model has a circular location parameter just represent it and its distribution as part of your larger model.

They use the same transformation to polar coordinates, but with a different distribution. Again, thinking of the parameters loc and kappa as building blocks just study their distribution:

x ~ normal(0, 1); y ~ normal(0, 1);

Then, loc = atan2(y, x) is uniformly distributed on (-pi, pi] and kappa = sqrt(square(x) + square(y)) is Chi-distributed with two degrees of freedom. Furthermore, loc and kappa are independent, i.e. there is no a-priori connection between these parameters.

As they put the prior on the parameters x and y and transform afterwards, no Jacobian correction is needed. Instead, you could define the model equivalently as follows:

parameters {
real x;
real y;
}
transformed parameters {
real<lower = -pi(), upper = pi()> loc;
real<lower = 0>                   kappa;

loc   = atan2(y, x);
kappa = sqrt(square(x) + square(y));
}
model {
target += - log(kappa); Jacobian

// Put prior density on loc and kappa
loc ~ uniform(-pi(), pi());
target += - (2 / 2 - 1) * log(2) - lgamma(2 / 2) + (2 - 1) * log(kappa) - square(kappa) / 2;
// i.e. kappa ~ Chi(k = 2)
// Note: implies x, y ~ normal(0, 1)!

for (i in 1:N)
angle[i] ~ von_mises(loc, kappa);


For clarity, I would probably not reuse the radius of the polar transform and introduce a separate parameter for kappa. As explained, while the above code suggests a connection between loc and kappa, from a modelling perspective we have independent priors on them. If you are fine with the Chi-distribution the same code could be used for the parameters of a wrapped Cauchy.

#17

I am trying to do the same job, currently I am using GMM and Weibull to eliminate GPS noise. Maybe Kalman filter is a better choice.