Prediction from model with repeated measurements

Hi,

I am trying to model Ozone concentrations by combining predictions from multiple cheap sensors. In particular, I have training data recorded by 6 cheap Ozone sensors and 1 reference grade Ozone instrument that were all placed together. I want to fit a model that I can use to get more accurate Ozone estimates from future field deployments where I don’t have a reference grade instrument.

The relationship between the reference-grade measurements and the cheap sensors can be modelled linearly with corrections for temperature (T) and relative humidity (RH):

\text{Ozone}_\text{reference} = \text{Ozone}_\text{cheap sensor} + T + RH

I firstly independently fitted 6 linear regression models (1 per sensor) and averaged their outputs to get an overall estimate. However, I thought I could get a more accurate estimate by making a hierarchical model.
I fitted the model below in Stan, which gives separate intercepts and coefficients for each sensor that are drawn from an overall distribution.
I had to make my data long, so if there are M timepoints in my training data there are N=6M rows in my input to Stan with each \text{Ozone}_\text{reference} duplicated 6 times.

However, I’m having problems working out how to use this model to predict future values. I’m not sure if this is a statistical issue, or a programming one…

Say I get 6 new cheap sensor readings for 1 timepoint (+ T & RH). In theory I could use these to get 6 separate predictions for the actual Ozone concentrations that can be averaged. This would be just like my 6 linear regressions approach but hopefully the partial pooling will make it a bit more accurate.
Is there a more principled way of combining these predictions?

Secondly, how do I actually generate predictions programatically? The docs state that the preferred way is to use generated quantities when fitting the model, but I’d like to be able to obtain predictions on-line.

I’ve also got a modelling question: what is a good starting point for placing a hierarchical prior on the observation level variance, sigma_sensors? This is currently a single random variable, but ideally it would vary by sensor as each sensor will have different error characteristics, but I don’t know what a suitable uninformative prior would be.

Finally, are there any obvious optimizations? It takes a while to fit (8 hours when N=13 000). I think I’ve vectorized as much as I can.

data {
  int<lower=0> N;
  int<lower=0> J;
  int<lower=0> n_terms;
  int sensor[N];
  row_vector[n_terms] x[N];
  real O3[N];
}
parameters {
  real alpha[J];
  vector[n_terms] beta[J];
  
  real mu_alpha;
  real<lower=0> sigma_alpha;
  real mu_beta[n_terms];
  real<lower=0> sigma_beta[n_terms];
  real<lower=0> sigma_sensors;
}
model {
  vector[N] linear_pred;
  vector[N] sigma_obs;
  
  // Group level priors
  mu_alpha ~ normal(0, 2.5);
  sigma_alpha ~ exponential(1);
  mu_beta ~ normal(0, 2.5);
  sigma_beta ~ exponential(1);
  
  // Observation level priors
  // TODO: ideally this would also have a group-level prior, 
  // but what would be an appropriate distribution?
  sigma_sensors ~ exponential(1);
    
  for (j in 1:J)
    beta[j] ~ normal(mu_beta, sigma_beta);   // Coefficients
  alpha ~ normal(mu_alpha, sigma_alpha);    // Intercept
    
  for (n in 1:N) {
    linear_pred[n] = alpha[sensor[n]] + x[n] * beta[sensor[n]];
  }
    
  O3 ~ normal(linear_pred, sigma_sensors);
}

If you’re using six different sensors drawn from the population, you gotta do that.

If each sensor has \alpha_i and you have a hierarchical prior \alpha \sim N(\mu_\alpha, \sigma_\alpha), you have estimates for \alpha_1, \ldots \alpha_6 specifically, but if you get more sensors and don’t have any calibration data for them, then you’d just assume they’re 6 new draws from the population N(\mu_\alpha, \sigma_\alpha).

So generating predictions would be something like:

for each draw i in posterior:
  generate 6 new alphas from i-th draw of mu_alpha and sigma_alpha
  generate new prediction measurements using these new alphas + covariates

That make sense?

I completely follow that from a theoretical standpoint, how do you actually perform those steps in Stan - particularly the generate new predictions one?

Do you have any idea how you could combine the 6 predictions too?

Given a new set of covariate observations, x_new, and the posterior draws for mu_alpha and mu_beta, the most sensible prediction would be:

pred = mu_alpha+ x_new * mu_beta

On getting better ESS/s, how many iterations are you running for? Presumably no divergences?

1 Like

Assuming you have K new sensors, and you’ve coded up sensor_new and x_new to match the measurements to expect to take (that is which sensor is going to take which measurement and the covariates attached to each measurement), then:

generated quantities {
  vector[n_terms] beta_new[K]
  vector[K] alpha_new;
  vector[N] O3_new;
  for (k in 1:K)
      beta_new[k] = normal_rng(mu_beta, sigma_beta);   // Coefficients
  alpha_new[k] = normal_rng(mu_alpha, sigma_alpha);    // Intercept

  for (n in 1:N) {
    O3_new[n] = normal_rng(alpha_new[sensor_new[n]] + x_new[n] * beta_new[sensor_new[n]], sigma_sensors);
  }
}

Something like that. I probably made a couple bugs for you so watch out :D!

This seems to be doing a full generative simulation of a new sensor as having sensor-level parameters that deviate from the mean-of-sensors with the same variability as the calibration experiment data and with observations with the same noise as observed in the calibration experiment data. For the OPs use case, applying what they’ve learned from the calibration experiment to new sensors going forward, they should simply use the parameters reflecting the mean-of-sensors, right?

If it’s not the same exact sensors, then I think they have to take new sensor parameters from the population.

So if it were the same 6, then yeah, that would make sense, but it’s 6 new random ones I think, and you wanna respect the variation in new sensors you might get. If you take the 6 you already had, then that’ll underestimate the variation.

Lemme know if that still sounds weird. I think that’s how it works.

I might be overconfident in my discerning of the intent of prediction, and/or under-imaginative, but I have a hard time imagining what purpose is met by the version that includes shifting the mean prediction randomly away from that derived from the parameters that reflect the mean-across-sensors. Certainly if the intent is, given a new sensor, to minimize point prediction error, then the mean-across-sensors must be used to generate the point prediction. If one wants an uncertainty interval around said prediction, then it would be important to incorporate both the learned measurement noise (sigma_sensors) as well as the learned magnitude by which individual sensors deviate from the mean-across-sensors (sigma_alpha & sigma_beta).

Oh! I bet what we’re saying is mathematically equivalent. To simplify (bc I’m having trouble thinking through the full hierarchical model with covariates), let’s consider a simpler hierarchical model with just intercepts, I was going to propose something like:

x = rnorm(
    1
    , mean_across_sensors
    , sqrt( (sd_among_sensors^2) +(measurement_noise^2) )
)

While your proposal is (if I have it correctly):

z = rnorm(1,mean_across_sensors,sd_among_sensors)
x = rnorm(1, z, measurement_noise)

I’m embarrassingly bad at math sometimes, but I think those are equivalent in expectation. Both should yield a mean prediction that matches the mean_across_sensors, and both should yield a sd of predictions of the same magnitude, right?

If so, sorry for the diversion, but this has been helpful for me in thinking about this (I also know see that your method generalizes really easily while my reliance on the summing of variances is pretty reliant on a specific model structure; for example, my noted difficulty thinking through how to apply my approach when the model has covariates). I think I was stuck erroneously thinking that one would, for a given new sensor and posterior samples from the model, obtain a single sample z then get lots of values for x, in which I was thinking that the deviation from the mean imposed by sampling z would persist. But of course we sample z on every sample from the posterior, so that bias averages out and all that remains is the proper propagation of uncertainty.

z = rnorm(1,mean_across_sensors,sd_among_sensors)
x = rnorm(1, z, measurement_noise)

Looks right! And yeah the xs here will have the same distribution.

In this case the covariates are the noisy sensor measurements and we’re using them to predict what the actual measurements are.

We have direct estimates for what the parameters of the six sensors. For any other sensor in the population, the best guess we can make is that they come from the hierarchical distributions.

SOmething like that. Hope it works lol.

1 Like

Thanks for the discussion guys, it’s been really useful to consider these different aspects that I hadn’t thought about before!

So to summarize then, if I have some new sensors I draw the sensor-level parameters from my posterior of mu_alpha, mu_beta, sigma_alpha, and sigma_beta and then just draw the predictions from these equations. I don’t need to refit my model with a generated quantities block.

z = rnorm(1,mean_across_sensors,sd_among_sensors)
x = rnorm(1, z, measurement_noise)

What about if I want to use my original 6 sensors on some new data, do I then need to refit my model using generated quantities each time I want to make a prediction? That’s the impression I got from your second message @bbbales2 but I could be misunderstanding something!

Finally, if I have multiple sensors at the same recording location (doesn’t matter if they are part of the original 6 or new sensors) and I want to combine their predictions - is there an obvious way of doing that rather than just taking the mean? I.e. is there a Bayesian weighting approach?

If I follow and you just want prediction plus uncertainty, then no. You could extract the draws and compute things by hand in R. But note that at least cmdstanr has a method to just run the generated quantities section of a fitted model object with new data. So it’s probably easiest to put the prediction stuff in the model’s generated quantities, fit it, save the fit object, then load it and use that method whenever you want to get predictions for new sensors.

1 Like

If you don’t have any reason apriori to expect different magnitudes of measurement noise from each sensor, then the simple mean should be the best way to collapse the predictions. Doing this in the generated quantities (just add the mean part after the code @bbales provided) will be easiest and ensure uncertainty propogates properly.

Motivation for some kind of differential weighting of the sensors would derive from an expectation of differential measurement noise, in which case you would want to you should include that in your model. To do this, you’d model sigma_alpha hierarchically (usually on the log-scale) in the same way you model beta hierarchically. I’m having trouble thinking through what to do in the generated quantities part though , indeed I can’t decide if there would need to be anything different there from the just-compute-a-mean approach that’s definitely ok amid no noise heterogeneity. Any thoughts @bbales?

I’m not sure about taking the mean cuz then your uncertainties will go down. I’m not sure that having two of the same type of sensor measuring in the same place really gives you more information unless you make more assumptions. Like, you could put 1000 sensors there. It doesn’t seem like we really know that much more about the actual thing you’re measuring if the variation could just come from having a ton of different sensor values there. Not sure. I think you’d have to build this into your original model somehow. Like you’d just predict your high precision measurement as a function of all the N sensors you had in one spot, like:

y_high_res ~ normal(a1 * sensor1 + a2 * sensor2 + ... + intercept, sigma);

And then if you assumed out of the box that a1 = a2 = a3 = ... you could do something like:

y_high_res ~ normal(a * mean(sensors) + intercept, sigma);

And it’s a different thing, but the mean shows up there.

If you’re wanting to get into evaluating your predictions, then the place to start is probably Model selection tutorials and talks (the FAQ link at the top especially)

1 Like

Ah right so effectively optimising weights for a weighted average?

I do see what you what you’re saying about adding more sensors doesn’t necessarily provide more knowledge about the underlying process, but I feel intuitively that it should have some benefit.

I’ve found a fair bit of literature on sensor fusion using Kalman Filters which might be more applicable for my use case where the model is relatively straight forward and I want online processing, but I think I’d have to lose some of the hierarchical structure.

To do this, you’d model sigma_alpha hierarchically (usually on the log-scale) in the same way you model beta hierarchically.

Something like moving the main model into the for loop with

O3 ~ normal(linear_pred, sigma_sensors[sensor[n]]);

And specifying a group prior that itself has a log-normal prior?

sigma_sensors ~ exponential(group_sigma)
log(group_sigma) ~ normal(0, 2.5);

Yup!

1 Like