I’m looking for a way to evaluate a forecasting model with an “extrinsic” metric.
As part of my PhD, I developed 3 state-space models for forecasting stink bug populations in soybean crops (Redirecting) that now a company wants to use in production. They are interested in the models because with the posterior predictive distribution we can compute the likelihood of stink bugs surpassing a given “economic threshold”, which will require a pest control intervention to avoid economic loss.
So far, I’ve only evaluated my models using LOO-CV (although I should’ve used LFO-CV), which was good to decide which model was the best one.
The thing is that this company wants to know “how useful is the model”, which makes a lot of sense, but I’m unable to respond properly because of lack of knowledge and because my attempts to find an extrinsic metric for my Bayesian models (i.e., a metric that will tell us how good the model will perform overall in a business context) have been unsuccessful.
So my questions are:
Do you know of an extrinsic metric that I could use in this case?
I wrote down an idea that I had for an extrinsic metric but I’m not sure if it’s a sound proposal. Could you please comment on it?
Extrinsic metric proposal
Given that this company wants a model that gives a “recommendation” on if a pest control intervention should be carried out in a given week to avoid the economic threshold being surpassed on the next week, this proposed extrinsic metric aims to measure how often a model makes a “good” recommendation.
In the metric, so far named “Threshold Surpassing Accuracy”, a recommendation is considered ‘good’ when the model suggested to carry out a pest control intervention at time
t, and stink bugs surpassed the economic threshold at
t+1; or when the model did NOT suggest to carry out a pest control intervention at time
t, and stink bugs did NOT surpass the economic threshold at
A recommendation is considered ‘bad’ when the opposite happens, i.e., the model recommends a control and threshold is not surpassed or model does not recommend a control and the threshold is surpassed.
The model will only recommend to carry out a pest control intervention when the likelihood of the population density surpassing the economic threshold in
t + 1 is higher than 50% (or other percentage based on expert knowledge).
A good recommendation gives a score of
+1 and a bad recommendation gives a score of
The “Threshold Surpassing Accuracy” (TSA) metric is calculated as the total score divided by the total amount of recommendations.
The TSA metric is computed in a similar way as the Leave-Future-Out Cross-Validation algorithm ([1902.06281] Approximate leave-future-out cross-validation for Bayesian time series models), i.e., model is trained with
L sequential data points, an
L + 1 forecast is made and a score is given based on the recommendation (good/bad) and the observed outcome in
L + 1 (threshold was surpassed or not). The process is repeated for every possible
L and then the final score is computed.