Simple linear fitting problem with two predictors

I am working on a linear fitting problem using Stan to analyze contributions from two sources (emitters) on two pollutants. The goal is to determine the emission ratios of the pollutants for both emitters and how the contributions vary over time.

A simple scenario is described as follows:

  • Emitter A releases both Pollutant 1 and Pollutant 2.
  • Emitter B releases only Pollutant 1.
  • The measurements of Pollutant 1 and Pollutant 2 in the atmosphere over time is avaiable.

We can express this relationship as:

Conc =Emitters\ Profiles × Emitters\ Time\ Series

To solve this, I have written the following Stan code:

stan_pmf_code = '''
data {
    int<lower=1> I;  // Number of samples
    int<lower=1> J;  // Number of species (set to 2 for two pollutants)
    int<lower=1> K;  // Number of factors (set to 2 for two factors)
    matrix<lower=0>[I, J] X;  // Concentration matrix
    matrix<lower=0>[I, J] Sigma;  // Uncertainty matrix
}

parameters {
    matrix<lower=0>[I, K] G;  // Scores (time series of each factor)
    real<lower=0> a;  // Proportion of Pollutant 1 attributed to Factor 1
}

transformed parameters {
    matrix<lower=0>[K, J] F;  // Loadings (factor profiles)
    // Populate the F matrix according to specific constraints
    F[1, 1] = a;       // Factor 1 affects Pollutant 1
    F[1, 2] = 1 - a;   // Factor 1 affects Pollutant 2
    F[2, 1] = 1;       // Factor 2 affects Pollutant 1
    F[2, 2] = 0;       // Factor 2 has no effect on Pollutant 2 (prior knowledge)
}

model {
    // Priors
    a ~ uniform(0, 1);  // Uniform prior for a

    // Likelihood
    for (i in 1:I) {
        for (j in 1:J) {
            X[i, j] ~ normal(dot_product(row(G, i), col(F, j)), sqrt(Sigma[i, j]));
        }
    }
}
'''

Issue

What I personally assumed was an extreme case where the model would fit by searching the value of a toward zero, resulting in the first factor only explaining Pollutant 2 and the second factor only explaining Pollutant 1. However, the model consistently provides non-zero values for a. When I compare the reproduced concentration matrix to this extreme case (which could achieve a perfect match), the model’s fit does not seem to align with my expectations.

So, I am confused why the model could not search out for the extreme case, or the resulting solution was actually better mathmetically since the 2 factors I set will be considered both important to the 2 pollutants?

Any insights or suggestions from the Stan community would be greatly appreciated. Thank you!

This is a very popular factor model of particular pollution source attribution and we’ve discussed it on the forums previously. Some comments:

  1. You need to declare <lower=0, upper=1> on a to make it a proportion. As is, it’s more like a concentration being constrained only to be positive. In Stan, the constraint on a target should match its distribution and here the distribution is uniform(0, 1), so you really need the constraint for this to be well formed.

  2. Unless K = J = 2, this model’s not going to be well formed around F. Assuming this holds, then you can declare and define F in one line as

    matrix<lower=0>[K, J] F = [[a, 1 - a], [1, 0]];
    

    You’d only do this if you knew the loadings were [1, 0] for the second site—otherwise you’d be fitting as [b, 1 - b]. In general, you can declare those to be simplexes.

    parameters {
      simplex[2] factor1;
      simplex[2] factor2;
    ...
    model {
      matrix[K, J] F = [factor1, factor2];
    ...
    
These will get uniform distributions by default, though it's assuming you're fitting the second factor.


3.  In the model block, you want to use matrix multiply for efficiency,

    ```stan
    matrix[I, J] GF = G * F;
    matrix[I, J] sqrtSigma = sqrt(Sigma);
    for (i in 1:I) for (j in 1:J) X[i, j] ~ normal(GF[I, j], sqrtSigma(i, j)); 
    ```

Then you can efficiently collapse to a one-liner as

```stan
    to_vector(X) ~ normal(to_vector(G * F), sqrt(to_vector(Sigma)));
  1. The normal distribution isn’t such a great choice for values that are constrained to be positive. You might want to use lognormal, which at least has the correct support. The problem with using normal is that you might get posterior intervals that move into unphysical regions with values of X below zero.

  2. You might want to share some of the Sigma values rather than having variation for each component. For instance, I would have started taking Sigma to be an J-vector rather than an I x J matrix. Does the Sigma vary by observation I or just by species? Sometimes you’ll find that having independent error terms for each measurement is too many degrees of freedom and sucks up all the explanation into the error terms.

What you want to do is simulate data from the model and see if the recovered posteriors cover your true parameters at their nominal amount (e.g., 50% posterior intervals should contain about 50% of the true values). Hamiltonian Monte Carlo doesn’t “search” per se—it samples by taking a random-ish walk around the posterior. And we’re doing Bayesian inference, not optimization, so we’re rarely going to push a factor all the way to zero this way. What you can do is look at your posterior coverage on the two factors and see if it’s consistent with what you included, even if the posterior average isn’t exactly your simulated parameter. As you get more data, the posterior means should converge to the true values and the posterior intervals should shrink toward zero width. If you use optimization to find a maximum likelihood estimate, then those are often on the boundaries of parameter space.