I’ve been a long-time user of Stan and I admire what you’ve built. However, more than a concrete question, this is pretty much a shot in the dark.

At work, I am concerned about data shift: is the data we are seeing in production similar to what the model saw during training? The frequentist approach that I have been using is the following:

Computing Jensen Shannon divergence (or Wasserstein distance or K-L divergence).

I don’t like (1) because it conflates statistical significance with practical significance. I don’t like (2) because I lose any uncertainty estimate. In other domains, I’d normally solve for these inconsistencies using Bayes.

I guess my question is the following: Do any of you know of a Bayesian approach to the above problem?

Just a thought from a fellow interested discourse attendee: I think if you can specify a quantitative metric and a criterion that represents the practical significance that you mention, the Bayesian sandbox is your universe. With such a criterion, you calculate the probability that your metric interest is above/under the criterion, averaging over the posterior (thus accounting for uncertainty). The challenge is, of course, to come up with a metric and a criterion that you and others can agree on as being the right representation of “practical significance”.

Given that you’re talking about data that a model saw in training, it seems like the crucial test of similarity is whether the model’s fit to the new data is of similar quality as the model’s fit to the training data. This is something that you can calculate directly, and I think it might usefully focus the question of what is practically “as good”.

One design decision to make is whether you want to compare the model’s fit to the new data with the actual realized fit to the training data, or whether you want to compare the model’s fit to the new data with the expected fit to hold-out sets from the training data (i.e. to the expected predictive density under cross-validation based on the training data).

Hi, @jsocolar ! That would be a solution if I could observe the thing I am predicting. Many times, I cannot; many times, I have to wait months to observe the outcome for my production predictions. Thus, I use data drift as a leading indicator of that model decay: if the data is radically different to what I had during training, chances are that the model will underperform.

What data do you have that might drift, and what data do you not have (that you are unable to observe and therefore unable to evaluate the quality of the model predictions)?

I’m trying to read between the lines here a bit: is the situation that you observe the new independent variables, but you don’t have access to the new responses, and you want to know whether the new independent variables are distributed in a way that is similar to the old ones?

In this case, I think you need to answer the following question: is there a reasonable generative model for the independent variables, such that you want to treat the independent variables as random? Or is it more appropriate to treat the independent variables as fixed?

Often, we treat the independent variables as fixed. Thus, there is no uncertainty in the similarity between your old independent variables and your new independent variables; there’s just the two samples with have some empirical similarity that you can calculate according to whatever measure you want, and there is no posterior uncertainty in this measure because the samples are what they are (i.e. there’s no posterior uncertainty in what the samples are).

On the other hand, if you are interested in predicting how well the model will generalize to future data that is as-yet unseen, then you might want to do inference on how much the generative process for the independent variables has changed. This question is tractable because the “independent variables” are the “response” in this assumed generative process, so you can assess the quality of the fit of this model to the newly observed variables.

But I don’t think any of this properly answers your question, which is how to decide what type/magnitude of change in the distribution of the independent variables amounts to an “important” change. I think the key questions I might think to ask are:

What proportion of the new data (compared to the old data) falls into regions where there’s evidence that the fitted model for the outcome (i.e. the model using just the old data) is locally misspecified?

What proportion of the new data (compared to the old data) falls into regions where the old data are so sparse that you can’t evaluate whether the fitted model for the outcome is misspecified?

Again, note that you can evaluate these questions empirically with no posterior uncertainty if you take the data as fixed and you’re interested in how confident you can be about model performance on the new data. On the other hand, if you are interested in how confident you can be about model performance on hypothetical new new data (i.e. data for which the independent variables still have not been collected), then you might want to know something about how much/rapidly/predictably the generative process for the independent variables is changing.

Maybe too obvious or straightforward, but how about the KS distance metric itself (i.e., the maximum different in percentile points between the two distributions)? I agree that the associated p-value is not of interesting, but maybe the KS distance itself is?