Fitting when data is a density

I have fluorescence data that looks like this:
sampdata

There is strong theoretical support for representing it as coming from a mixture of five normals. I was wondering how to specify such a model in Stan. This is what I had in mind:

model {
      cum_value = p[1]*Phi((mu[1]-x[i])/s[1])+...+p[5]*Phi((mu[5]-x[i])/s[5])
}

Obviously this won’t work as is since cum_value is data, but that’s the gist of the model I’d like to estimate. I’m particularly interested in the mixture proportions.

EDIT: it probably makes more sense to use the pdf as the expectation for a positive-support random variable to capture the measurement error. I’ll be trying that for now.

Can you tell us a bit more about the data generating process that gave rise to the data in the figure?

We excite a solution of a certain protein with a UV laser and measure the intensity of light it emits back over a range of wavelengths. It is known that the components of the protein that emit light do so at one of five specific wavelengths (and approx normally around that value) but we don’t know a priori the proportions of these “emitters” for our particular protein.

The proportions actually also vary depending on the conformational state of the protein, which means we have a mixture over the five-proportion mixture, but just having the idea of what the average proportions are would be fine for now.

If you’re modeling the process as a linear combination of normal distributions, can’t you just normalize your data so that it actually gives the density of emissions, and estimate the five parameters giving the coefficients of this linear combination (with the constraint that they sum to 1)?
Maybe that could be formulated as a multivariate normal with a diagonal covariance matrix, since that will be normalized to unity without additional constraints - and the \mathbf{\mu} vector components would give you the relative intensities, I’m not sure about that, but check, maybe.

There’s a chapter in the manual that explains how to code up multi-component mixture models in Stan. We don’t have convenient built-ins beyond two components, but it’s not that hard to code directly.

So this is more like a curve fitting than a mixture model, but conveniently it’ll have the mixture model problems I guess haha.

So you have this process that generates a curve. Presumably we’re assuming this is deterministic?

Like the ideal version of the thing you plotted is

f(\vec{\alpha}, \vec{\text{loc}}, \vec{\text{scale}}) = \sum_i^N \alpha_i e^{-(x - \text{loc}_i)^2 / \text{scale}_i}

and for whatever reason this is a sum of squared exponential-looking terms. All you need to do is decide on a measurement process and you’re ready for a Stan model.

For continuous, the standard thing is to just fit a normal if you don’t have a reason to think anything else.

So the simplest curve fit you might try is:

parameters {
  vector[5] alpha;
  vector[5] loc;
  vector<lower=0.0>[5] scale;
  real<lower=0.0> sigma;
}
model {
  y ~ normal(f(alpha, loc, scale), sigma);
}

So it’s easy enough to write that down, but you’re probably going to have a bunch of problems with this model. I’ll list em’ in order of what I think is most importance, but they’re all pretty rough.

  1. The model I wrote down has 16 parameters. I do not think the curve you plotted will identify 16 parameters reliably. Just my opinion – I didn’t actually try it, but that curve looks really simple. There are usually a deceptively large number of ways to put quadratic exponentials together and fit curves. Sampling will help you find some of them, but probably not all of them, and the diagnostics will be sad.

  2. You say there is strong evidence to support that this data is the sum of a bunch of squared exponentials. I believe that maybe it is a good fit, but your data looks really clean. When your data doesn’t have much noise, it is important that you have the right model. Bayesian posteriors are only reliable when you have a good model, but if you have really clean data it raises the bar on what a good enough model is. If you don’t have a good model then the posterior is going to tell you more about model mispecification than anything reliable about the parameters your system. It’s all about that generative process.

  3. There’s lots of problems with the posteriors of things that are mixed together. For instance, re-orderings of the means. You can put ordering constraints on the means but it’ll still be rough.

Since I answered so negatively, here’s a similar question from awhile ago: Mixture of Gaussian functions . You can have a look at that and see if there’s anything useful.

I probably made the outlook sound really bleak, haha :P, but feel free to ask more questions. I could be super wrong. Just give things a try and see what works and what doesn’t, and post questions if you have em’! Hope that helps at all!

Gaussian mixture models (in one or more dimension) are a general non-parametric density estimation technique as the number of components grows. Andrew et al. discuss it in BDA.

But the @bmfazio’s problem is not density estimation but curve fittting as @bbbales2 worte.

I also assume it’s still going tobe difficult to estimate five normals even if the noise is quite small.

I would definitely constrain the means of normals to be ordered.

EDIT: I’m dumb and used the wrong curve for reference. Stan actually did great.

For reference:

Rplot

Thanks for the help so far - I suppose I will be back later as I try to extend this further with the rest of my data.


Hello all,

I followed the approach suggested by @bbbales2. I actually found a better parametrization for the underlying components which requires a single parameter for each one and after some exploration, it seems that only three of the five hypothesized components are present here. Regardless, I’ve been getting some odd results.

The fit that HMC settled for looks quite off:

mix3hmcid0d0

(the intensity is normalized here and it isn’t the same curve I initially showed, but this one should actually follow the model more closely given how the experiments were conducted)

The results were consistent over a few sets of reasonable initial values. I decided to see if R’s optim() could do better, and it seemingly did:

optimgraph1

However, this fit was very susceptible to the choice of initial values.

I thought that using rstan::optimizing() might achieve a compromise of stabiltiy and good fits, but it gave essentially identical results to those from HMC.

Since Stan’s solutions seem obviously suboptimal, I’m wondering if there might be something wrong with my specification. But the unstable fits from optim() makes me think that maybe I need to add some additional restrictions? Not sure how to proceed from here. I’ve got ~60 more curves where the one I’m trying to recover here is present mixed with more components, but if I can’t reliably recover the “pure” signal I don’t suppose using the entire dataset makes sense.

Here’s the model I’ve currently set up (the underlying components are parametrized solely in terms of their maxima):

data {
  int<lower=0> n;   // n# observations
  int<lower=0> k;   // n# components (tried 2-5, but 3 works best)
  
  vector[k] lm0;    // prior maxima
  vector[k] s0;     // how much to trust priors
  
  vector[n] I;      // intensity (measured signal)
  vector[n] v;      // wavenumber (reciprocal of wavelength shown in graph)
}
parameters {
  // Maxima for the components
  ordered[k] lm;
  // Relative intensities
  real<lower=0,upper=1> pu[k];
  // Noise
  real<lower=0> error;
}
model {
  // Intermediate values to ease writing the component formula
  vector[k] a; 
  vector[k] r;
  vector[k] vm;
  vector[k] vneg;
  vector[k] vpos;
  // Signals
  vector[k] Im[n]; // Latent singal contributed per component
  vector[n] mI;    // Expected total signal

  // Component formula
    // Place prior on maxima and invert it
  lm ~ normal(lm0,s0);
  for(i in 1:k){
     vm[i] = (10^7)/lm[i];
  }
    // Intermediate calculations
  vneg = 1.177*vm - 7780;
  vpos = 0.831*vm + 7070;
    // Per-component emission formula
  for(i in 1:n){
    mI[i] = 0;
  }
  for(j in 1:n){
    for(i in 1:k){
      r[i] = (vm[i] - vneg[i])/(vpos[i] - vm[i]);
      a[i] = vm[i] +r[i]*(vpos[i] - vneg[i])/(r[i]^2 - 1);
      Im[j][i] = exp(-log(2)*(log(a[i] - v[j])-log(a[i] - vm[i]))^2/(log(r[i])^2));
      mI[j] = mI[j] + pu[i]*Im[j][i];
    }
  }
  
  // Prior for noise
  error ~ normal(0, 10);
  // Sampling
  I ~ normal(mI, error);
}