I am working on a linear fitting problem using Stan to analyze contributions from two sources (emitters) on two pollutants. The goal is to determine the emission ratios of the pollutants for both emitters and how the contributions vary over time.
A simple scenario is described as follows:
- Emitter A releases both Pollutant 1 and Pollutant 2.
- Emitter B releases only Pollutant 1.
- The measurements of Pollutant 1 and Pollutant 2 in the atmosphere over time is avaiable.
We can express this relationship as:
Conc =Emitters\ Profiles × Emitters\ Time\ Series
To solve this, I have written the following Stan code:
stan_pmf_code = '''
data {
int<lower=1> I; // Number of samples
int<lower=1> J; // Number of species (set to 2 for two pollutants)
int<lower=1> K; // Number of factors (set to 2 for two factors)
matrix<lower=0>[I, J] X; // Concentration matrix
matrix<lower=0>[I, J] Sigma; // Uncertainty matrix
}
parameters {
matrix<lower=0>[I, K] G; // Scores (time series of each factor)
real<lower=0> a; // Proportion of Pollutant 1 attributed to Factor 1
}
transformed parameters {
matrix<lower=0>[K, J] F; // Loadings (factor profiles)
// Populate the F matrix according to specific constraints
F[1, 1] = a; // Factor 1 affects Pollutant 1
F[1, 2] = 1 - a; // Factor 1 affects Pollutant 2
F[2, 1] = 1; // Factor 2 affects Pollutant 1
F[2, 2] = 0; // Factor 2 has no effect on Pollutant 2 (prior knowledge)
}
model {
// Priors
a ~ uniform(0, 1); // Uniform prior for a
// Likelihood
for (i in 1:I) {
for (j in 1:J) {
X[i, j] ~ normal(dot_product(row(G, i), col(F, j)), sqrt(Sigma[i, j]));
}
}
}
'''
Issue
What I personally assumed was an extreme case where the model would fit by searching the value of a
toward zero, resulting in the first factor only explaining Pollutant 2 and the second factor only explaining Pollutant 1. However, the model consistently provides non-zero values for a
. When I compare the reproduced concentration matrix to this extreme case (which could achieve a perfect match), the model’s fit does not seem to align with my expectations.
So, I am confused why the model could not search out for the extreme case, or the resulting solution was actually better mathmetically since the 2 factors I set will be considered both important to the 2 pollutants?
Any insights or suggestions from the Stan community would be greatly appreciated. Thank you!