Inferring spot intensity but spot may be or not be present

Hi,

I am trying to infer the intensity of some fluorescent, diffraction-limited, spots using Stan. Here an example of some region of interest of one spot that was followed over time:

image

each region of interest represents one time point, and the time increases from left to right, and top to bottom.

As you can see sometimes the spot is present and sometime not.
For each region of interest (time point), I am modelling each pixel as drawn from a Gamma distribution (representing the pixel noise of the acquisition system) with mean given by the sum of a background and a signal:
\mu_{ij} = b_{ij} + s_{ij}
The signal s_{ij} is given by a 2D gaussian with parameters

  • total intensity I
  • w_x, w_y width in x and y directions
  • i_0 and j_0 center of the 2D gaussian in pixels coordinates.

s_{ij} = \frac{I}{2 \pi w_x w_y}\exp{(-\frac{(i-i_0)^2}{2 w_x^2}-\frac{(j-j_0)^2}{2 w_y^2})}

I am primarily interested in I, and I would like it to be zero when no spot is present in the region of interest. However the mean of I (\langle I \rangle) can be substantially different from 0 even if there is no spot present, and I am still trying to understand why this is the case. Here you can see \langle I \rangle for the spot shown before over time

image

I notice that there is a correlation between w_x and I (w_y and I) which could explain the high values of I in absence of a spot…

I don’t know if I am approaching the modelling problem correctly, because if there is no spot the parameters w_x, w_y, i_0 and j_0 do not have any meaning, and in fact the posterior has a larger variance then when a spot is present. Here I am showing the variance of w_x over time

image

I thought about having a bernoulli variable indicating the presence or absence of a spot, but I think that marginalising it out in every frame becomes computationally very expensive.

Would you have any suggestion about how to improve the model? I would be happy to give more details if needed.

Thank you in advance.

It wasn’t clear to me which of those examples had a spot. Just the second and third rows? And how is intensity defined? Are the images actually color or is this a multicolor rendering of a black and white image?

What’s a region and what’s a time point? You showed a bunch of 2D color (?) images. Are they arranged into a time series somehow?

Is \mu_{i, j} the image you observe? Is there some kind of hierarchical model for the background?

I don’t understand why s[i, j] takes on a value that looks like a density. Is it sampled somehow from something with that density rather than equated?

It’s hard to say much without knowing what a spot is. How are you modeling images that do not have spots? One issue is that Bayesian posterior means are never zero (more precisely, they take on value zero with measure zero if you have a continuous density).

How are you modeling the time-series of the spot? Presumably the frames are related.

Wouldn’t it be more natural to have a mixture model of spot and no spot? I don’t think this would be any harder to marginalize than any other mixture model.

Hi @Bob_Carpenter

Thank you very much for taking the time to read my question.

  1. I’ll describe the data more in detail. The original data are gray-scale images (16 bits) with 2 channels and several time frames. The first channel (not shown my original question) contains few target molecules, appearing as diffraction-limited fluorescent spots (1 target molecule = 1 spot). I have tracked each molecule over time, defining for each time point a region of interest (ROI) containing the target molecule at that time. This is an example of time series for one target molecule (time increases from left to right and top to bottom). It’s a multicolor rendering of gray-scale images, with higher pixel values in yellow and lower pixel values in dark blue. What I call a spot is few bright (yellow) pixels popping out from the dark (blue) background. I have several of these target molecules tracked over time.
    One target molecule over time:

    The second channel (showed in my original post) corresponds to the wavelength of binder molecules, that can bind to the target molecule (multiple binders can bind to one target molecule at the same time). So these are the same ROIs that I just showed but in the binder channel.
    Binder channel:

    The colormap is the same as before. I’d say that binder molecules become visible at frame 5 (starting at 0) and disappear around frame 18/19. The total intensity of the binder spot relates to the number of binder molecules that bind to the target molecule. I am interest in the number of binders at each time point, this is why I would like to determine the intensity of the binder spot, if binder molecules are present. I hope the problem is better stated now.
    Finally, the pre-processed data are M (number of target molecules) arrays of length T (number of frames) of 9x9 matrices (pixel values in the binder channel), each array representing the position of one target molecule over time.
  1. As you mentioned afterwards, the different frames are related, however in a first modelling attempt I was neglecting the temporal correlation. For each ROI, I was modelling the pixel value \mu_{ij} (of an idealised noise-free image) as the sum of a background b and a signal due to the binder molecules s_{ij}:
    \mu_{ij} = b + s_{ij}
    I decided to approximate the distribution in space of the intensity of the binder spot as a 2D gaussian, so
    s_{ij} = \frac{I}{2 \pi w_x w_y}\exp{\left ( -\frac{(i-i_0)^2}{2w^2_x} -\frac{(j-j_0)^2}{2w^2_y} \right )}
    where w_x (w_y) is the width in the x(y)-direction, i_0,j_0 is the position of the spot center, and I is the total intensity (the integral value) of the binder spot. The 2D gaussian here represents the diffraction pattern of the light emitted by the binder molecules, an approximation of the point spread function of the microscope… I hope this is more clear.
    On top of this I am modelling the noise due to the acquisition system as a Gamma distribution, with mean \mu_{ij} and variance \gamma \mu_{ij}, where \gamma represents the gain of the EMCCD camera. Hence, finally, the value of the pixel at position i, j in the ROI is given by
    v_{ij} \sim \text{Gamma} (\mu_{ij}, \gamma \mu_{ij}).
    Because \gamma depends on the acquisition settings, this should be a parameter shared between all the data coming from the same acquisition, so between all arrays:
  1. So, I think the problem in my model is exactly that I am using the same model regardless of the presence/absence of the spot, and if there is no spot the parameters w_x, w_y, i_0 and j_0 do not have any meaning, while I would like I to be close to zero. I thought about introducing some temporal correlation, but maybe it is not the point here.
  1. I thought about using a Bernoulli variable p that is 0 if no spot is present and 1 if the spot is present, following the description in 7.2 of the Stan users manual (Change point models). Is this what you mean? Is it marginalisation feasible considering that I should have one p variable for every time point t<T and for every target molecule m<M?

Thank you again for your answers.

I read chapter 5.3 of the Stan users guide and if I understand well a simple model to describe my data using finite mixtures and neglecting all correlations (considering each ROI as independent) could be the following:

data{
     int nROIs; // number of ROIs
     int npixels; // linear size of one ROI in pixels
     array[nROIs] matrix[npixels,npixels] ROIs; // pixel values
     }
parameters{
    real<lower=0> g; // detector gain
    vector<lower=0>[nROIs] b; // background
    vector<lower=0>[nROIs] i0, j0; // spot center (pixel coordinates)
    vector<lower=0>[nROIs] wx, wy; // spot width in pixels
    vector<lower=0>[nROIs] I; // spot intensity
    
    simplex[2] theta; // mixing proportions
    }
model{
      // priors
      g ~ normal(0,10);
      b ~ normal(0,100);
      i0 ~ normal((npixels+1)%/%2, 2);
      j0 ~ normal((npixels+1)%/%2, 2);
      wx ~ normal(1,0.25);
      wy ~ normal(1,0.25);
      I ~ normal(0, 3000);
      //
      vector[2] log_theta = log(theta);
      
      for (n in 1:nROIs){ // iterating over ROIs

          vector[2] lps = log_theta;
          
          for (i in 1:npixels){
              for (j in 1:npixels){ // iterating over pixels

                  real y = ROIs[n][j,i]; 
                  
                  // no spot
                  real mu = b[n];
                  real alpha = mu/g;
                  real beta = 1/g;
                  lps[1] += gamma_lpdf(y | alpha, beta);
                  
                  // spot
                  mu += I[n]/(2*pi()*wx[n]*wy[n]) * exp( - pow(i-i0[n], 2)/(2*pow(wx[n], 2)) - pow(j-j0[n], 2)/(2*pow(wy[n], 2)) );
                  alpha = mu/g;
                  beta = 1/g;
                  lps[2] += gamma_lpdf(y | alpha, beta);
                  }        
              }
          target += log_sum_exp(lps);
          }
      }

Sorry if I still add this comment, but if I understand well in the finite mixture model that I wrote above theta[1] will be the proportion of data generated by the ‘no spot’ model, while theta[2] will be the proportion of data generated by the ‘spot’ model… However I think that for me it would be more interesting to have a probability to have a spot for each ROI. So, I was thinking that I could use a probability p of having a spot for each ROI like in the following model:

data{
     int nROIs; // number of ROIs
     int npixels; // linear size of one ROI in pixels
     array[nROIs] matrix[npixels,npixels] ROIs; // pixel values
     }
parameters{
    real<lower=0> g;
    vector<lower=0>[nROIs] b; // background
    vector<lower=0>[nROIs] i0, j0; // spot center
    vector<lower=0>[nROIs] wx, wy; // spot width
    vector<lower=0>[nROIs] I; // spot intensity
    
    vector<lower=0,upper=1>[nROIs] p; // probability to have a spot in each ROI
    }
model{
      g ~ normal(0,10);
      b ~ normal(0,100);
      i0 ~ normal((npixels+1)%/%2, 2);
      j0 ~ normal((npixels+1)%/%2, 2);
      wx ~ normal(1,0.25);
      wy ~ normal(1,0.25);
      I ~ normal(0, 3000);
      
      p ~ uniform(0,1);
      
      vector[nROIs] log_p = log(p);
      
      for (n in 1:nROIs){

          vector[2] lps;
          lps[1] = 1-log_p[n];
          lps[2] = log_p[n];

          for (i in 1:npixels){
              for (j in 1:npixels){

                  real y = ROIs[n][j,i]; 

                  // no spot
                  real mu = b[n];
                  real alpha = mu/g;
                  real beta = 1/g;
                  lps[1] += gamma_lpdf(y | alpha, beta);

                  // spot
                  mu += I[n]/(2*pi()*wx[n]*wy[n]) * exp( - pow(i-i0[n], 2)/(2*pow(wx[n], 2)) - pow(j-j0[n], 2)/(2*pow(wy[n], 2)) );
                  alpha = mu/g;
                  beta = 1/g;
                  lps[2] += gamma_lpdf(y | alpha, beta);
                  }        
              }
          target += log_sum_exp(lps);
          }
      }

The user’s guide chapter on latent discrete parameter shows how to derive these probabilities for individual cases. I think I may have done it in the mixture model chapter, too, but I don’t recall. The trick is to use Bayes’s rule to generate probabilities given the posterior.

Yes, I think I am getting a better understanding…
If I am correct, what I am looking for is something similar to what described in Section 5.3, the paragraph “Estimating parameters of a mixture”. In that case, a parameter of the model is the mixture proportions (theta), which lies on the unit simplex. In that case there is no prior that is specified, but if I understood well the “default prior” is a Dirichlet distribution. Is it correct?
My thought was to specify a probability p_k to have or not have a spot the k-th ROI, which would be equivalent to the theta parameter in the mixture model. In my case p_k would be a parameter and have a prior distribution (on which I could add some temporal correlation). And then use Bayes’ rule to write the log-likelihood given the priors. Does it make sense?

The default prior in Stan is always uniform over support. So if a variable is constrained to a simplex, that will be uniform over simplexes, which turns out to be a Dirichlet with shape 1. You can add whatever kind of prior you want on top of that, either a non-uniform Dirichlet, a logistic multivariate normal, some kind of time-series thing (though you’d probably do that on the log odds scale and then run through softmax rather than using a simplex parameter). For example, to get a multivariate logistic normal parameter theta, you can code this way.

parameters {
  vector[K] alpha;
}
transformed parameters {
  simplex[K] theta = softmax(alpha);
}
model {
  alpha ~ multi_normal_cholesky(mu_alpha, L_Sigma_alpha);
}

or if you want to put some kind of time-series on alpha, that’d just be something like

   alpha ~ rw2(sigma);

where rw2_lpdf(vector alpha, real sigma) implements a second-order random walk or similar.

I was pointing out the user’s guide sections because they show you how to derive the posterior probability distribution over discrete parameters, from which you can either sample or perform inference directly in expectation (the change point example shows the latter is much more statistically efficient, which follows from the Rao-Blackwell theorem). See the “recovering mixture proportions” section in 5.3 Summing out the responsibility parameter | Stan User’s Guide

Thanks for describing the data. That makes the whole thing much clearer to me in terms of your goal.

Yes, I think that’d be feasible. I wasn’t really thinking about the time series, but it’d make sense to have the sequence of choices spot/no-spot be related. But I think you’d need two change points—intro of spot and exit of spot, either of which could be pinned to the first or last entry. So a model where a spot shows up, gets tracked with a time series, then disappears would make sense and I think it’d be feasible (though non-trivial) to code.

In my experience, models get better computationally and inferentially the closer you can get to the data generating process, even if generating the data isn’t your goal (something like detecting spots conditionally might be the goal).

Another option would be to code up a classifier using a neural network image recognition model, though that would require some labeled training data.

This is why I thought a mixture might make sense.

Thank you very much for the explanations!

I see one problem with this approach, that I can have multiple appearance/disappearance events for the same target spot (in the same track), so in principle I don’t know in advance the number of change points – which I guess would complicate a bit the model.

This is a very nice advice, thanks.

Thank you. I think I understood how to marginalise discrete parameters and how to derive their posterior probability distribution.
I am starting with a very simple model with 2 mixture components (spot and no spot), without modelling the time series (I thought it was better for me to start simple). At the end I can compute the probability that an image belongs to one or the other category, as shown in the paragraph Recovering posterior mixture proportions (5.3 Summing out the responsibility parameter).

I also thought that maybe a good model for this could be a Hidden Markov Chain, with the spot/no spot as hidden state. Not sure if this is a good idea, though.