Continuous(?) mixing model examples

Hi all,

I’m fairly new to stan, my work focuses on chemical mixing models where we want to e.g. estimate the contributions of different sources to pollutants present in a given sample based on tracers with uncertain endmember data. An example of such a model is this: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2014GC005683

I’m wondering if one of you can point me to examples where such mixing models have been implemented in stan that I could use as a starting point.

thanks, Lukas

Hi Lukas,

After a bit of searching I couldn’t personally find anything, though someone else might know of some examples in Stan. I skimmed the paper you linked though and I think I understand the general model. It seems like the idea is that you have measurements of different pollutants, where each measurement for a given pollutant is modeled as being drawn from a normal distribution centered on mean \mu = \sum_{k} p_k\mu_k , where k is the number of sources of a particular pollutant, p_k is the proportion of total pollutant coming from that source and \mu_k is the average amount of pollutant from source k. If I am misunderstanding the model you are talking about, then please ignore the rest of this post. If it sounds right though, then if only considering a single pollutant, a very simple stan model for estimating p_k assuming that \mu_k are known (e.g., the sample means from the source, ignoring sample variance for now; more on that later), a simple Stan model might look something like:

// Simple isotope mixing model

data {
  int<lower=0> N; // Number of observations from mixture distribution
  vector[N] y; // Measurements from mixture
  int<lower=1> k; // Number of sources of isotopes
  vector[k] mus; // Sample means from each source
  vector[k] alpha; // Dirichlet prior
}


parameters {
  simplex[k] ps; // Mixing proportions
  real<lower=0> sigma; // Mixture distribution variance
}

transformed parameters{
  real mu = dot_product(mus, ps); // Mixture mean

}


model {
  
  ps ~ dirichlet(alpha); // Prior for mixing proportions
  
  y ~ normal(mu, sigma); // Model for mixing distribution

}

generated quantities{
  real y_predict[N];
  
  // Posterior predictive simulation
  for(i in 1:N){
    y_predict[i] = normal_rng(mu, sigma);
  }
  
  
}

The most unique part of this model is the prior for the mixture proportions (p_k ). These mixture proportions have to sum to 1, so that is why I specified it as a simplex data type (a simplex is just a vector whose elements sum to 1). If you only had two sources, then an appropriate prior for one of the p_k (call it $p_1) might be some sort of beta distribution, with the other p_k just being 1 - p_1. In the case where there are more than two sources, you need a distribution that is a multi-variate generalization of a beta distribution, which is where the Dirichlet distribution comes in. The Dirichlet distribution has one parameter, a vector with number of elements = number of sources in this case.

If this is indeed the kind of model that you are looking for, then playing around with simulations can help you to better understand the model and the Dirichlet prior. Here is an R script I used to simulate data according to the simple model I put forth and to analyze the simulated data with Stan (using rstan)
Mixing_Simulation.R (2.2 KB)

The model converges and fits the data well. See for example the posterior predictive simulations:

and a simple comparison of the mixture proportion estimates to the truth (red point is simulated truth; error bars represent one posterior standard deviation):

Improvements to this model could include modeling the population means as error contaminated (see here for more details) and hierarchical modeling if you have measurements of multiple different pollutants (lots has been written about hierarchical models so I’ll just link this and this as some nice examples).

I know this was a lot of detail, but I am happy to explain anything included here in greater depth and with less jargon if necessary. Just didn’t want to waste too much time if I am misunderstanding the model you are trying to implement :).

2 Likes

Thanks a lot for your help! This looks great - I’ll need bit of time to digest this, but I’m sure this will save me days and weeks of trial and error!

1 Like

I’m glad I could help! One small thing to note as you work through the model is that in the example Stan code I posted I forgot to explicitly specify a prior for sigma. This means that Stan opts for a default, super flat “uninformative” prior, which is almost always an unrealistic prior and can lead to convergence issues. So I would suggest using your domain expertise to decide on a more informative/realistic sigma prior, and to always explicitly specify priors in general.