Updating model with new data

Hi everyone,

I was just wondering if it’s possible to update a model when new data arrives. So suppose I have some model X and build a model to obtain posterior estimates for some parameters, and I get some new data X’. Is it possible to rather than run the whole model again on all the data available X+X’, is it possible to just update the model for the new data X’ to get updated posterior estimates for the parameters?

Just thinking that this could save time if I didnt need to re-run the model again?

Might just be a stupid question but curious.

Thanks!
Ryan

1 Like

Yes

If you can obtain a parametric representation for the posterior conditional on X

Thanks for the reply Ben!

So if I had a stanfit object called stanfit_1 from fitting the model to X, how do I obtain the updated model stanfit_2 with data X+X’ from this?

You look at the parameters conditional on X, see what distribution they are closest to, and use that as your prior when conditioning on X’ to obtain the new posterior distribution.

Can you explain how I can determine the closest distribution that the posterior distribution of the parameters is in Stan and RStan please?

If it is anything other than multivariate normal, it is pretty difficult. But if it is multivariate normal, then you just need to estimate the mean vector and the covariance matrix (presumably with some regularization).

The task usually is to find a good transformation which turns the parameters to become multivariate normal (ideally uncorrelated). This is anyway a good exercise to do for a given model (if possible). So as a bonus you get a better sampling model, usually.

This makes sense but just not 100% clear how to implement this in practice.

I’ve implemented this football/soccer model by Baio and Blangiardo (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.8659&rep=rep1&type=pdf) and my current stan code is:

data {
int nteams;
int ngames;
int home_team[ngames];
int away_team[ngames];
int<lower=0> home_goals[ngames];
int<lower=0> away_goals[ngames];
}
parameters {
real home;
real mu_att;
real mu_def;
real tau_att;
real tau_def;

vector[nteams-1] att_free;
vector[nteams-1] def_free;
}
transformed parameters {
vector[nteams] att;
vector[nteams] def;
vector[ngames] log_theta_home;
vector[ngames] log_theta_away;

// need to make sum(att)=sum(def)=0
for (k in 1:(nteams-1)) {
att[k] = att_free[k];
def[k] = def_free[k];
}
att[nteams] = -sum(att_free);
def[nteams] = -sum(def_free);

log_theta_home = home + att[home_team] + def[away_team];
log_theta_away = att[away_team] + def[home_team];
}
model {
home ~ normal(0, 10000);
mu_att ~ normal(0, 10000);
mu_def ~ normal(0, 10000);
tau_att ~ gamma(0.1, 0.1);
tau_def ~ gamma(0.1, 0.1);

att_free ~ normal(mu_att, 1/tau_att);
def_free ~ normal(mu_def, 1/tau_def);

home_goals ~ poisson_log(log_theta_home);
away_goals ~ poisson_log(log_theta_away);
}

The reason why I want to update the model is when testing the performance of it I’ve just kept increasing the data and running this whole model on the dataset again as more games come in.

The parameters I’m specifically interested in are att, def and home.

If the posterior distribution of the parameters after seeing a first set of games X is approximately multivariate normal, how do I make adjustments to this stan code so that I can obtain new posterior distribution by running it on the next set of games X’? So far, I’ve just been re-running the code for X+X’ when new games come in.

Pass in the posterior mean vector and the posterior precision matrix as data. Eliminate all the old priors and use the multivariate normal instead.

Except your original posterior is going to be messed up because you did not constrain tau_att and tau_def to be positive. It would probably be better to declare them in log form in the parameters block and then antilog them in the transformed parameters block. And use real priors.

data {
  int nteams;
  int ngames;
  int home_team[ngames];
  int away_team[ngames];
  int<lower=0> home_goals[ngames];
  int<lower=0> away_goals[ngames];
  
  vector[5 * nteams - 2] mu;
  cov_matrix[rows(mu), rows(mu)] precision;
}
parameters {
  real home;
  real mu_att;
  real mu_def;
  real tau_att;
  real tau_def;

  vector[nteams-1] att_free;
  vector[nteams-1] def_free;
}
transformed parameters {
  vector[nteams] att;
  vector[nteams] def;
  vector[ngames] log_theta_home;
  vector[ngames] log_theta_away;

  // need to make sum(att)=sum(def)=0
  for (k in 1:(nteams-1)) {
    att[k] = att_free[k];
    def[k] = def_free[k];
  }
  att[nteams] = -sum(att_free);
  def[nteams] = -sum(def_free);

  log_theta_home = home + att[home_team] + def[away_team];
  log_theta_away = att[away_team] + def[home_team];
}
model {
  vector[5 + 2 * nteams - 2] theta = append_row([home, mu_att, mu_def, tau_att, tau_def]'
                                     append_row(att_free, def_free));
  theta ~ multi_normal_prec(mu, precision);
  home_goals ~ poisson_log(log_theta_home);
  away_goals ~ poisson_log(log_theta_away);
}
2 Likes

Thanks Ben, appreciate your patience, this is great help!

Shouldn’t the dimensions of these two vectors be the same? Is it meant to be vector[5 + 2 * nteams - 2] for both mu and theta?

Yes