I’m a STAN beginner and I’m trying to model the following problem: Given a arbitary 1D function (here a zickzick like function) with unknown shift in x direction, I try to build model that evaluates the probability of the possible shifts
Here is my code
library(rstan)
# simulate the data
x <- seq(-10,10,0.01)
zickzack <- function(x, off) {
if (abs(x-off) >= 0 & abs(x-off) < pi) {
return(0.5*(pi-abs(x-off)))
}
else {
return(0)
}
}
y <- sapply(x,zickzack,off=5)
y_obs = y + rnorm(length(y),0,0.3)
model_code="
functions {
real zz(real x, real off) {
if ( (fabs(x-off) >= 0) && (fabs(x-off) < pi()) )
return 0.5*(pi()-fabs(x-off));
else
return 0;
}
}
data{
vector[2001] y;
vector[2001] x;
}
parameters{
real off;
real<lower=0> sigma;
}
model{
real mu;
sigma ~ exponential( 1 );
off ~ normal( 0 , 10 );
mu = zz(x,off);
y ~ normal( mu , sigma );
}
"
mymodel <- stan(model_code=model_code, data=list(y=y_obs,x=x))
This is how the data looks like (whereas the shift of the red function in x direction is unknown):
I didn’t test the code below but I think it should be something like it. You’re right that you need it to work with vectors. Since the input x in the function is type real the function expects that but the model variable x is a vector.
I don’t think you can get around using a loop since you have conditional logic. You can throw the loop in the function and still use the vectorized call to normal in the model block.
functions {
vector zz(vector x, real off) {
int N = num_elements[x];
vector[N] out;
for (n in 1:N) {
if ( (fabs(x[n] - off) >= 0) && (fabs(x[n] -off ) < pi()) )
out[n] = 0.5*(pi() - fabs(x[n]- off));
else
out[n] = 0;
}
return out;
}
}
data{
vector[201] y;
vector[201] x;
}
parameters{
real off;
real<lower=0> sigma;
}
model {
sigma ~ exponential( 1 );
off ~ normal( 0 , 10 );
y ~ normal(zz(x,off), sigma);
}
Be sure to read the Changepoint Models section of the SUG. Such models induce a discontinuity in the gradient that can cause the sampler to fail, but that section describes how to code them to avoid this.
Because you have a single covariate that has a changing relationship with the outcome depending discretely (hence the conditional) on the value of the covariate.
Thanks mike. That might be the reason why I often get divergent transitions. But why discretely? The shift parameter is continuous? Anyway, if the parameter that I’ve marginalize out is my shift (off), how can I do that if the parameter has no uniform prior as in the example?
(@martinmodrak: can you check the below for accuracy, or tag someone you think has the expertise to do so?)
First a version of the SUG example that’s mathematically equivalent but more explicit:
data {
real<lower=0> r_e;
real<lower=0> r_l;
int<lower=1> T;
int<lower=0> D[T];
}
parameters {
real<lower=0> e;
real<lower=0> l;
}
// declaration/computation of lp moved to model block for clarity
model {
e ~ exponential(r_e);
l ~ exponential(r_l);
vector[T] lp;
for (s in 1:T){
lp[s] = uniform_lpdf(s|1,T) ; // here's where the prior for the changepoint is expressed
for (t in 1:T){
lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l);
}
}
target += log_sum_exp(lp);
}
And now with a normal prior centered at the middle time:
data {
real<lower=0> r_e;
real<lower=0> r_l;
int<lower=1> T;
int<lower=0> D[T];
}
parameters {
real<lower=0> e;
real<lower=0> l;
}
model {
e ~ exponential(r_e);
l ~ exponential(r_l);
vector[T] lp;
for (s in 1:T){
lp[s] = normal_lpdf(s|T/2.0,1) ; // here's where the prior for the changepoint is expressed
for (t in 1:T){
lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l);
}
}
target += log_sum_exp(lp);
}
I think @mike-lawrence is correct in the general direction, but I think a few clarifications are needed:
The problem in this case (unlike the change point example in the user guide) is not that off would have somewhat “discrete” relationship with the outcome. Actually, the likelihood is a continuous function of off (because fabs is a continuous function). The problem arises because fabs does not have continuous derivative and thus the derivative of the likelihood with respect to off is not continuous. Stan requires the derivative to be continuous because it relies on the derivative to describe the behaviour of likelihood in neighbourhood to any point of interest reasonably well. Note that this problem would largely go away if you created a smooth function similar to zickzack by e.g. using some sigmoid instead of the hard cutoff.
There is actually an additional potential problem - even if zickzack was smooth, fitting such function to some datasets might be problematic. If the observed data had more than one “peak” - even a pretty subtle additional peak, the posterior would be multimodal as the peak of the zickzack could be at each of the peaks of the observed data, but nowhere in between. This would hinder sampling. (note also that AFAIK even fitting a sigmoid to data reliably is an unresolved problem and all approaches I know of fail for certain datasets - so any function that involves sigmoids can pose challenges).
The approach used in the Stan user guide does not directly translate to the case here, because you need to model the discrete “segment” where the change point lies, but also the location within this segment. A sketch of the code as I would write it (not tested), assuming x are in increasing order:
parameters{
simplex[200] off_segment; //off_segment[i] = probability of `off` being between x[i] and x[i+1]
vector<lower=0, upper = 1>[200] off_raw;
vector<lower=0> sigma[200];
}
transformed parameters {
vector[200] off = x[1:200] + (x[2:201] - x[1:200]) .* off_raw;
vector[200] log_jacobian_off = -log(x[2:201] - x[1:200]); //The log jacobian (derivative) of `off_raw` w.r.t `off`
}
model {
sigma ~ exponential( 1 );
vector[T] lp;
for (s in 1:200){
//Truncated normal + the jacobian
lp[s] = normal_lpmf(off[s] | 0, 10) -log_diff_exp(normal_lcdf(x[s + 1] | 0, 10),
normal_lcdf(x[s] | 0, 10)) + log_jacobian_off[s];
lp[s] = lp[s] + normal_lpdf(y | zz(x,off[s]), sigma[s]);
}
target += log_sum_exp(lp);
}
This could (in theory) even handle the multimodality problem, but the model uses some a bit advanced techniques (Jacobian, truncation) and is likely to not scale well to big datasets because we need to keep a copy of all variables of interest for each segment :/. (we might be able to get away with using only one copy of sigma - would also depend on the data). I also hope I didn’t mess up something when coding this. I’ve never succesfully implemented such a model, the last time I tried helping someone with a similar (but more complicated) task, we didn’t end up with a working model.
I don’t think there are any universally good options - I would start with finding a smooth version zickzack and move to the marginalized version only if that didn’t work well.
Manual sections for some of the stuff I use.
Hope this clarifies more than confuses and best of luck with your model!
Wow, thank you so much for your extensive answer. It already answered many of my questions. I will work on it in detail tomorrow morning but would like to thank you spending so many lines explaining stuff to a STAN beginner!
Regarding your concerns that fabs has not a continious derivative I was just wondering if replacing fabs(x[n] - off) with sqrt(pow(x[n] - off, 2)) would help?
While I still working on the stuff @martinmodrak posted, I start worrying about something else. Because this was just (in my eyes simplified) example to learn how I can solve a problem I’m actually interested in. Let me try to describe it:
I’ve a 2D image Y (greyscale, 32 bit) which is very noise (SNR might be < 1). And I’ve the 2d reference image X. However, the 2D image Y might be randomly rotated and shifted with respect to the reference image X. Now I want to find probability distribution for the shifts/rotation angles. I thought first doing this in 1D with a simple function would be a good idea.
Do you think that my simplified problem is a good approximation? The difference would be that in the real case the function “zz” would need to be replaced by something shifting / rotating an image. I don’t know yet how to do that with stan or if I can call external functions doing that ( https://cran.r-project.org/web/packages/rstan/vignettes/external.html maybe this? )
I fear that would be a fundamentally multimodal problem: e.g. imagine the reference image is a square - then the rotation would have 4 separate modes. Or if the reference image would be a checkerboard than shift would have multiple modes. Even if the image wouldn’t have obvious symmetries, I think there would often be multiple local modes. You would at least get one mode for each pair of “dark patch on reference” - “dark patch on Y” that overlays those two dark patches and then optimizes the best fit for the rest, so for complex images, you could easily be looking at hundreds of separate modes…
Unfortunately, multimodal problems are hard to do Bayesian inference for in Stan, (but also everywhere else) - at best you can hope for a single chain to explore a single mode (at worst chains fail to adapt and don’t even explore one mode well).
However, your problem (unless there are additional complicating factors) could possibly be small enough to be feasible to attack via brute force: either choose a dense 3D grid over all parameters, compute the likelihood at each point and sum all likelihoods to get a normalizing constant and you have full posterior, or do a direct Monte Carlo/Importance sampling: draw shift and angle values from prior, compute likelihood and once again renormalize over all samples. I have little experience with those methods, so not completely sure they would work well, but those are the only options I can suggest that have some chance of working. I don’t doubt that there is a huge number of much more sophisticated methods in the literature, but those might be hard to apply.
EDIT: The last few sentences are probably overconfident. I really don’t understand this topic very well. Maybe there are other simple options. Maybe the simple options are unlikely to work. I honestly don’t know a lot about this and don’t want to oversell my quick thoughts on the topic as rock solid advice.
thanks for sharing your opinion. What If I can tell you that I know the symmetry? This should reduce the multi modality problem, right? If you don’t see a realistic chance that this will work, I will probably try something else, like your grid suggestions. However, it would be sad as I invested some time to learn modelling with STAN :-)
Another problem I’m facing is, that at every stage of the analysis, I typically only want evaluate a discrete grid of parameters. What is the best way to handle that? Or should I first implement it using continuous parameters and then come back to forum and discuss this “discrete grid” problem?
Two quick things: you need to handle the symmetries for both the reference and the observed data. In the 1D example, I would expect your model to fail if there are two bumps in the observed data (even if one is quite pronounced and the other just barely above noise).
This might be crucial. If the grid is small-ish, I would just brute-force it. Or - if you need to use the grid on top of a more complex model, you can marginalize the grid out in Stan (this would force you to have at least 1 parameter per grid point, so it will become intractable quickly). Generally, in the context of Stan, moving between discrete/continuous usually completely changes how hard the problem is. Here it would make the computation much more expensive, but once you marginalize the discrete grid out you might (not completely certain about this) get rid of the multimodality, i.e. the marginalized model would likely have just one mode but still be able to represent all the modes of the original model (due to the extra parameters).
Also depends heavily on why do you want to consider only a grid - is it really a physical constraint of the system or a convenient simplification?