Hi,
I am trying to model Ozone concentrations by combining predictions from multiple cheap sensors. In particular, I have training data recorded by 6 cheap Ozone sensors and 1 reference grade Ozone instrument that were all placed together. I want to fit a model that I can use to get more accurate Ozone estimates from future field deployments where I don’t have a reference grade instrument.
The relationship between the reference-grade measurements and the cheap sensors can be modelled linearly with corrections for temperature (T) and relative humidity (RH):
\text{Ozone}_\text{reference} = \text{Ozone}_\text{cheap sensor} + T + RH
I firstly independently fitted 6 linear regression models (1 per sensor) and averaged their outputs to get an overall estimate. However, I thought I could get a more accurate estimate by making a hierarchical model.
I fitted the model below in Stan, which gives separate intercepts and coefficients for each sensor that are drawn from an overall distribution.
I had to make my data long, so if there are M timepoints in my training data there are N=6M rows in my input to Stan with each \text{Ozone}_\text{reference} duplicated 6 times.
However, I’m having problems working out how to use this model to predict future values. I’m not sure if this is a statistical issue, or a programming one…
Say I get 6 new cheap sensor readings for 1 timepoint (+ T & RH). In theory I could use these to get 6 separate predictions for the actual Ozone concentrations that can be averaged. This would be just like my 6 linear regressions approach but hopefully the partial pooling will make it a bit more accurate.
Is there a more principled way of combining these predictions?
Secondly, how do I actually generate predictions programatically? The docs state that the preferred way is to use generated quantities
when fitting the model, but I’d like to be able to obtain predictions on-line.
I’ve also got a modelling question: what is a good starting point for placing a hierarchical prior on the observation level variance, sigma_sensors
? This is currently a single random variable, but ideally it would vary by sensor as each sensor will have different error characteristics, but I don’t know what a suitable uninformative prior would be.
Finally, are there any obvious optimizations? It takes a while to fit (8 hours when N=13 000). I think I’ve vectorized as much as I can.
data {
int<lower=0> N;
int<lower=0> J;
int<lower=0> n_terms;
int sensor[N];
row_vector[n_terms] x[N];
real O3[N];
}
parameters {
real alpha[J];
vector[n_terms] beta[J];
real mu_alpha;
real<lower=0> sigma_alpha;
real mu_beta[n_terms];
real<lower=0> sigma_beta[n_terms];
real<lower=0> sigma_sensors;
}
model {
vector[N] linear_pred;
vector[N] sigma_obs;
// Group level priors
mu_alpha ~ normal(0, 2.5);
sigma_alpha ~ exponential(1);
mu_beta ~ normal(0, 2.5);
sigma_beta ~ exponential(1);
// Observation level priors
// TODO: ideally this would also have a group-level prior,
// but what would be an appropriate distribution?
sigma_sensors ~ exponential(1);
for (j in 1:J)
beta[j] ~ normal(mu_beta, sigma_beta); // Coefficients
alpha ~ normal(mu_alpha, sigma_alpha); // Intercept
for (n in 1:N) {
linear_pred[n] = alpha[sensor[n]] + x[n] * beta[sensor[n]];
}
O3 ~ normal(linear_pred, sigma_sensors);
}