Fitting when data is a density

specification

#1

I have fluorescence data that looks like this:
sampdata

There is strong theoretical support for representing it as coming from a mixture of five normals. I was wondering how to specify such a model in Stan. This is what I had in mind:

model {
      cum_value = p[1]*Phi((mu[1]-x[i])/s[1])+...+p[5]*Phi((mu[5]-x[i])/s[5])
}

Obviously this won’t work as is since cum_value is data, but that’s the gist of the model I’d like to estimate. I’m particularly interested in the mixture proportions.

EDIT: it probably makes more sense to use the pdf as the expectation for a positive-support random variable to capture the measurement error. I’ll be trying that for now.


#2

Can you tell us a bit more about the data generating process that gave rise to the data in the figure?


#3

We excite a solution of a certain protein with a UV laser and measure the intensity of light it emits back over a range of wavelengths. It is known that the components of the protein that emit light do so at one of five specific wavelengths (and approx normally around that value) but we don’t know a priori the proportions of these “emitters” for our particular protein.

The proportions actually also vary depending on the conformational state of the protein, which means we have a mixture over the five-proportion mixture, but just having the idea of what the average proportions are would be fine for now.


#4

If you’re modeling the process as a linear combination of normal distributions, can’t you just normalize your data so that it actually gives the density of emissions, and estimate the five parameters giving the coefficients of this linear combination (with the constraint that they sum to 1)?
Maybe that could be formulated as a multivariate normal with a diagonal covariance matrix, since that will be normalized to unity without additional constraints - and the \mathbf{\mu} vector components would give you the relative intensities, I’m not sure about that, but check, maybe.


#5

There’s a chapter in the manual that explains how to code up multi-component mixture models in Stan. We don’t have convenient built-ins beyond two components, but it’s not that hard to code directly.


#6

So this is more like a curve fitting than a mixture model, but conveniently it’ll have the mixture model problems I guess haha.

So you have this process that generates a curve. Presumably we’re assuming this is deterministic?

Like the ideal version of the thing you plotted is

f(\vec{\alpha}, \vec{\text{loc}}, \vec{\text{scale}}) = \sum_i^N \alpha_i e^{-(x - \text{loc}_i)^2 / \text{scale}_i}

and for whatever reason this is a sum of squared exponential-looking terms. All you need to do is decide on a measurement process and you’re ready for a Stan model.

For continuous, the standard thing is to just fit a normal if you don’t have a reason to think anything else.

So the simplest curve fit you might try is:

parameters {
  vector[5] alpha;
  vector[5] loc;
  vector<lower=0.0>[5] scale;
  real<lower=0.0> sigma;
}
model {
  y ~ normal(f(alpha, loc, scale), sigma);
}

So it’s easy enough to write that down, but you’re probably going to have a bunch of problems with this model. I’ll list em’ in order of what I think is most importance, but they’re all pretty rough.

  1. The model I wrote down has 16 parameters. I do not think the curve you plotted will identify 16 parameters reliably. Just my opinion – I didn’t actually try it, but that curve looks really simple. There are usually a deceptively large number of ways to put quadratic exponentials together and fit curves. Sampling will help you find some of them, but probably not all of them, and the diagnostics will be sad.

  2. You say there is strong evidence to support that this data is the sum of a bunch of squared exponentials. I believe that maybe it is a good fit, but your data looks really clean. When your data doesn’t have much noise, it is important that you have the right model. Bayesian posteriors are only reliable when you have a good model, but if you have really clean data it raises the bar on what a good enough model is. If you don’t have a good model then the posterior is going to tell you more about model mispecification than anything reliable about the parameters your system. It’s all about that generative process.

  3. There’s lots of problems with the posteriors of things that are mixed together. For instance, re-orderings of the means. You can put ordering constraints on the means but it’ll still be rough.

Since I answered so negatively, here’s a similar question from awhile ago: Mixture of Gaussian functions . You can have a look at that and see if there’s anything useful.

I probably made the outlook sound really bleak, haha :P, but feel free to ask more questions. I could be super wrong. Just give things a try and see what works and what doesn’t, and post questions if you have em’! Hope that helps at all!


#7

Gaussian mixture models (in one or more dimension) are a general non-parametric density estimation technique as the number of components grows. Andrew et al. discuss it in BDA.


#8

But the @bmfazio’s problem is not density estimation but curve fittting as @bbbales2 worte.

I also assume it’s still going tobe difficult to estimate five normals even if the noise is quite small.

I would definitely constrain the means of normals to be ordered.