Fitting a known piecewise linear function with unknown offset

I didn’t test the code below but I think it should be something like it. You’re right that you need it to work with vectors. Since the input x in the function is type real the function expects that but the model variable x is a vector.

I don’t think you can get around using a loop since you have conditional logic. You can throw the loop in the function and still use the vectorized call to normal in the model block.

functions {
    vector zz(vector x, real off) {
        int N = num_elements[x];
        vector[N] out;
        for (n in 1:N) {
             if ( (fabs(x[n] - off) >= 0) && (fabs(x[n] -off ) < pi()) ) 
                 out[n] = 0.5*(pi() - fabs(x[n]- off));
        else 
            out[n] = 0;
       }

     return out;
    }
}
data{
    vector[201] y;
    vector[201] x;
}
parameters{
    real off;
    real<lower=0> sigma;
}
model {
    sigma ~ exponential( 1 );
    off ~ normal( 0 , 10 );
    y ~ normal(zz(x,off), sigma);
}
1 Like

Got it and will try it. Can you quickly tell me what is the advantage compared to the “solution” I’ve posted?

Elegance is in the eyes of the beholder :)

With a larger model what I posted should be faster due to the gradient of the log-prob (see https://mc-stan.org/docs/2_25/stan-users-guide/vectorization.html).

1 Like

Be sure to read the Changepoint Models section of the SUG. Such models induce a discontinuity in the gradient that can cause the sampler to fail, but that section describes how to code them to avoid this.

Thanks mike! I will check. But I’m not sure why you consider my example as a changepoint model?

Because you have a single covariate that has a changing relationship with the outcome depending discretely (hence the conditional) on the value of the covariate.

Thanks mike. That might be the reason why I often get divergent transitions. But why discretely? The shift parameter is continuous? Anyway, if the parameter that I’ve marginalize out is my shift (off), how can I do that if the parameter has no uniform prior as in the example?

(@martinmodrak: can you check the below for accuracy, or tag someone you think has the expertise to do so?)

First a version of the SUG example that’s mathematically equivalent but more explicit:

data {
  real<lower=0> r_e;
  real<lower=0> r_l;

  int<lower=1> T;
  int<lower=0> D[T];
}
parameters {
  real<lower=0> e;
  real<lower=0> l;
}
// declaration/computation of lp moved to model block for clarity
model {
  e ~ exponential(r_e);
  l ~ exponential(r_l);
  vector[T] lp;
  for (s in 1:T){
    lp[s] = uniform_lpdf(s|1,T) ; // here's where the prior for the changepoint is expressed
    for (t in 1:T){
      lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l);
    }
  }
  target += log_sum_exp(lp);
}

And now with a normal prior centered at the middle time:

data {
  real<lower=0> r_e;
  real<lower=0> r_l;

  int<lower=1> T;
  int<lower=0> D[T];
}
parameters {
  real<lower=0> e;
  real<lower=0> l;
}
model {
  e ~ exponential(r_e);
  l ~ exponential(r_l);
  vector[T] lp;
  for (s in 1:T){
    lp[s] = normal_lpdf(s|T/2.0,1) ; // here's where the prior for the changepoint is expressed
    for (t in 1:T){
      lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l);
    }
  }
  target += log_sum_exp(lp);
}

I think @mike-lawrence is correct in the general direction, but I think a few clarifications are needed:

  • The problem in this case (unlike the change point example in the user guide) is not that off would have somewhat “discrete” relationship with the outcome. Actually, the likelihood is a continuous function of off (because fabs is a continuous function). The problem arises because fabs does not have continuous derivative and thus the derivative of the likelihood with respect to off is not continuous. Stan requires the derivative to be continuous because it relies on the derivative to describe the behaviour of likelihood in neighbourhood to any point of interest reasonably well. Note that this problem would largely go away if you created a smooth function similar to zickzack by e.g. using some sigmoid instead of the hard cutoff.

  • There is actually an additional potential problem - even if zickzack was smooth, fitting such function to some datasets might be problematic. If the observed data had more than one “peak” - even a pretty subtle additional peak, the posterior would be multimodal as the peak of the zickzack could be at each of the peaks of the observed data, but nowhere in between. This would hinder sampling. (note also that AFAIK even fitting a sigmoid to data reliably is an unresolved problem and all approaches I know of fail for certain datasets - so any function that involves sigmoids can pose challenges).

  • The approach used in the Stan user guide does not directly translate to the case here, because you need to model the discrete “segment” where the change point lies, but also the location within this segment. A sketch of the code as I would write it (not tested), assuming x are in increasing order:

parameters{
    simplex[200] off_segment; //off_segment[i] = probability of `off` being between x[i] and x[i+1]
    vector<lower=0, upper = 1>[200] off_raw; 
    vector<lower=0> sigma[200];
}
transformed parameters {
  vector[200] off = x[1:200] + (x[2:201] - x[1:200]) .* off_raw;
  vector[200] log_jacobian_off = -log(x[2:201] - x[1:200]); //The log jacobian (derivative) of `off_raw` w.r.t `off`
}
model {
    sigma ~ exponential( 1 );
  vector[T] lp;
  for (s in 1:200){
    //Truncated normal + the jacobian
    lp[s] = normal_lpmf(off[s] | 0, 10) -log_diff_exp(normal_lcdf(x[s + 1] | 0, 10),
                          normal_lcdf(x[s] | 0, 10)) + log_jacobian_off[s];
   
      lp[s] = lp[s] + normal_lpdf(y | zz(x,off[s]), sigma[s]);
  }
  target += log_sum_exp(lp);
}

This could (in theory) even handle the multimodality problem, but the model uses some a bit advanced techniques (Jacobian, truncation) and is likely to not scale well to big datasets because we need to keep a copy of all variables of interest for each segment :/. (we might be able to get away with using only one copy of sigma - would also depend on the data). I also hope I didn’t mess up something when coding this. I’ve never succesfully implemented such a model, the last time I tried helping someone with a similar (but more complicated) task, we didn’t end up with a working model.

I don’t think there are any universally good options - I would start with finding a smooth version zickzack and move to the marginalized version only if that didn’t work well.

Manual sections for some of the stuff I use.


Hope this clarifies more than confuses and best of luck with your model!

1 Like

Wow, thank you so much for your extensive answer. It already answered many of my questions. I will work on it in detail tomorrow morning but would like to thank you spending so many lines explaining stuff to a STAN beginner!

Thanks @martinmodrak and @mike-lawrence

Great community :-)

2 Likes

One thing I forgot to add is people use “changepoint model” in at least two sense:

  1. A discrete change point (as in the Stan user’s guid example @mike-lawrence mentioned).

  2. Fitting a function that is defined piecewise (this is IMHO closer to what you care about). A general approach using sigmoids to smooth-out transitions between piecewise linear functions is described at: Piecewise Linear Mixed Models With a Random Change Point and Simpler Change Point Models?

1 Like

Regarding your concerns that fabs has not a continious derivative I was just wondering if replacing
fabs(x[n] - off) with sqrt(pow(x[n] - off, 2)) would help?

No, that happens to be the exact same function. (note that the derivative of \sqrt{x} is \frac{1}{2 \sqrt{x}} which has discontinuity at 0.)

1 Like

While I still working on the stuff @martinmodrak posted, I start worrying about something else. Because this was just (in my eyes simplified) example to learn how I can solve a problem I’m actually interested in. Let me try to describe it:

I’ve a 2D image Y (greyscale, 32 bit) which is very noise (SNR might be < 1). And I’ve the 2d reference image X. However, the 2D image Y might be randomly rotated and shifted with respect to the reference image X. Now I want to find probability distribution for the shifts/rotation angles. I thought first doing this in 1D with a simple function would be a good idea.

Do you think that my simplified problem is a good approximation? The difference would be that in the real case the function “zz” would need to be replaced by something shifting / rotating an image. I don’t know yet how to do that with stan or if I can call external functions doing that ( https://cran.r-project.org/web/packages/rstan/vignettes/external.html maybe this? )

1 Like

You were right! By replacing my example function with following function it works perfectly fine:

set.seed(31235)
library(pracma)
zickzack_smooth <- function(x, off, offy) {
    return( pi*(sigmoid(x-off)*(1-sigmoid(x-off))) + offy)
}
y_smooth <- sapply(x,zickzack_smooth,off=6,offy=5)
y_smooth_obs = y_smooth + rnorm(length(y_smooth),0,0.1)
plot(x,y_smooth_obs)
lines(x,y_smooth,col='red')

This is the model

set.seed(31235)
model_code_vector="
functions {
    vector zz_vec(vector x, real offx, real offy) {
        int N = 201;
        vector[N] out;
        for (n in 1:N) {
            out[n] = pi()*(inv_logit(x[n]-offx)*(1-inv_logit(x[n]-offx))) + offy;
        }
        return out;
    }
}

data{
    vector[201] y;
    vector[201] x;
    
}

parameters{
    real offx;
    real offy;
    real<lower=0> sigma;
}

model{
    real mu;
    sigma ~ exponential( 1 );
    offx ~ normal( 0 , 10 );
    offy ~ normal( 0 , 10 );
    y ~ normal(zz_vec(x,offx,offy),sigma);
    
}
"

mymodel <- stan(model_code=model_code_vector, data=list(y=y_smooth_obs,x=x))

No divergent transitions and it is super fast. Thank you a lot!

Looking forward to discuss the follow-up question that I asked one post above.

Best,
Thorsten

Edit: That works also quite good when the measurements are noisy:

1 Like

I fear that would be a fundamentally multimodal problem: e.g. imagine the reference image is a square - then the rotation would have 4 separate modes. Or if the reference image would be a checkerboard than shift would have multiple modes. Even if the image wouldn’t have obvious symmetries, I think there would often be multiple local modes. You would at least get one mode for each pair of “dark patch on reference” - “dark patch on Y” that overlays those two dark patches and then optimizes the best fit for the rest, so for complex images, you could easily be looking at hundreds of separate modes…

Unfortunately, multimodal problems are hard to do Bayesian inference for in Stan, (but also everywhere else) - at best you can hope for a single chain to explore a single mode (at worst chains fail to adapt and don’t even explore one mode well).

However, your problem (unless there are additional complicating factors) could possibly be small enough to be feasible to attack via brute force: either choose a dense 3D grid over all parameters, compute the likelihood at each point and sum all likelihoods to get a normalizing constant and you have full posterior, or do a direct Monte Carlo/Importance sampling: draw shift and angle values from prior, compute likelihood and once again renormalize over all samples. I have little experience with those methods, so not completely sure they would work well, but those are the only options I can suggest that have some chance of working. I don’t doubt that there is a huge number of much more sophisticated methods in the literature, but those might be hard to apply.

EDIT: The last few sentences are probably overconfident. I really don’t understand this topic very well. Maybe there are other simple options. Maybe the simple options are unlikely to work. I honestly don’t know a lot about this and don’t want to oversell my quick thoughts on the topic as rock solid advice.

1 Like

Hi @martinmodrak ,

thanks for sharing your opinion. What If I can tell you that I know the symmetry? This should reduce the multi modality problem, right? If you don’t see a realistic chance that this will work, I will probably try something else, like your grid suggestions. However, it would be sad as I invested some time to learn modelling with STAN :-)

Another problem I’m facing is, that at every stage of the analysis, I typically only want evaluate a discrete grid of parameters. What is the best way to handle that? Or should I first implement it using continuous parameters and then come back to forum and discuss this “discrete grid” problem?

Best,
Thorsten

Two quick things: you need to handle the symmetries for both the reference and the observed data. In the 1D example, I would expect your model to fail if there are two bumps in the observed data (even if one is quite pronounced and the other just barely above noise).

This might be crucial. If the grid is small-ish, I would just brute-force it. Or - if you need to use the grid on top of a more complex model, you can marginalize the grid out in Stan (this would force you to have at least 1 parameter per grid point, so it will become intractable quickly). Generally, in the context of Stan, moving between discrete/continuous usually completely changes how hard the problem is. Here it would make the computation much more expensive, but once you marginalize the discrete grid out you might (not completely certain about this) get rid of the multimodality, i.e. the marginalized model would likely have just one mode but still be able to represent all the modes of the original model (due to the extra parameters).

Also depends heavily on why do you want to consider only a grid - is it really a physical constraint of the system or a convenient simplification?

You were right, if I introduce a second peak, sometimes the model works and find both and sometimes not.

Here is the reason for the discrete grid:
What I observe are noisy projections of a unknown 3D object. The goal is to reconstruct the 3D volume from the 2D projections. In the beginning you start with some kind of initial of low resolution inital 3D model, then you find the likelihood of the orientations + shift (on a coarse grid) of the observed projections, backproject them (weighted by likelihood) and get a better resolved 3D reconstruction. Then you restart the process (expectation maximization) with the new reconstructed volume and a finer grid. In the end you get a reconstruction which has a high resolution.

There are already software packes who implement bayesian appraoches for this problem ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3690530/ ), but I entered the world of bayes with the rethinking book and STAN and was wondering if I can implement similar things with STAN.

Best,
Thorsten

This sounds like something that would be extremely challenging to implement in Stan. Maybe not impossible, but definitely challenging. As I said - treating the grid indices as discrete parameters to be marginalized out might work and get rid of the multimodality, but there might be other challenges and it would likely be very expensive computationally.

This is the funny thing with the word “Bayes” it can mean many different things. The paper you linked doesn’t (after a brief reading through it) describe a “Bayesian” approach in the sense Stan uses the word. What they do is they compute a MAP estimate which is besically maximum likelihood + regularization. (in Stan you can achieve this behaviour using the “optimize” mode). They do not (if I understand the paper correctly) try to quantify the uncertainty about the structure, which would be needed to let us call the approach “fully Bayesian”. IMHO their use of “Bayes” is really just a buzzword, MAP estimates are AFAIK rarely called “Bayes” (because hey - we already have a way to call them and that is “MAP estimate”).

BTW I don’t really understand Cryo-EM, but I believe it is very cool stuff! However, if I read the paper right, even optimization-based approaches (like the MAP procedure they describe) are very expensive to compute, making it unlikely that fully Bayesian treatment would be tractable.

1 Like