On Extrinsic forecasting metrics

Hello everyone.

I’m looking for a way to evaluate a forecasting model with an “extrinsic” metric.

As part of my PhD, I developed 3 state-space models for forecasting stink bug populations in soybean crops (Redirecting) that now a company wants to use in production. They are interested in the models because with the posterior predictive distribution we can compute the likelihood of stink bugs surpassing a given “economic threshold”, which will require a pest control intervention to avoid economic loss.

So far, I’ve only evaluated my models using LOO-CV (although I should’ve used LFO-CV), which was good to decide which model was the best one.
The thing is that this company wants to know “how useful is the model”, which makes a lot of sense, but I’m unable to respond properly because of lack of knowledge and because my attempts to find an extrinsic metric for my Bayesian models (i.e., a metric that will tell us how good the model will perform overall in a business context) have been unsuccessful.

So my questions are:

  1. Do you know of an extrinsic metric that I could use in this case?

  2. I wrote down an idea that I had for an extrinsic metric but I’m not sure if it’s a sound proposal. Could you please comment on it?

Extrinsic metric proposal

Given that this company wants a model that gives a “recommendation” on if a pest control intervention should be carried out in a given week to avoid the economic threshold being surpassed on the next week, this proposed extrinsic metric aims to measure how often a model makes a “good” recommendation.

In the metric, so far named “Threshold Surpassing Accuracy”, a recommendation is considered ‘good’ when the model suggested to carry out a pest control intervention at time t, and stink bugs surpassed the economic threshold at t+1; or when the model did NOT suggest to carry out a pest control intervention at time t, and stink bugs did NOT surpass the economic threshold at t+1.

A recommendation is considered ‘bad’ when the opposite happens, i.e., the model recommends a control and threshold is not surpassed or model does not recommend a control and the threshold is surpassed.

The model will only recommend to carry out a pest control intervention when the likelihood of the population density surpassing the economic threshold in t + 1 is higher than 50% (or other percentage based on expert knowledge).

A good recommendation gives a score of +1 and a bad recommendation gives a score of +0.

The “Threshold Surpassing Accuracy” (TSA) metric is calculated as the total score divided by the total amount of recommendations.

The TSA metric is computed in a similar way as the Leave-Future-Out Cross-Validation algorithm ([1902.06281] Approximate leave-future-out cross-validation for Bayesian time series models), i.e., model is trained with L sequential data points, an L + 1 forecast is made and a score is given based on the recommendation (good/bad) and the observed outcome in L + 1 (threshold was surpassed or not). The process is repeated for every possible L and then the final score is computed.


1 Like

I just want to gut check an assumption in your proposed evaluation, which is that the company really cares about targeting the intervention to the correct week. Is it really a problem to intervene a week or two early in cases where intervention was indeed going to be required eventually?

Yeah, it’s a good point. But indeed it’s a problem to intervene at a wrong time, mainly because those interventions tend to be not as effective, and because the intervention itself is costly and has a negative impact on the environment

This seems like a reasonable idea then, again with the caveat that it treats a prediction that misses by a week or two (in either direction) as an equally bad outcome to missing the prediction by a much larger amount of time. If it’s true that from the company’s perspective, missing by a week is just as bad as missing by months, then that’s exactly what you want. Otherwise, you might consider evaluation metrics that penalize outcomes in proportion to how bad they are for the company.

Is there a reason why an existing forecast scoring metric won’t work? In case you haven’t encountered Gneiting and Katzfuss 2014, it’s an excellent (albeit technical) review of the properties of scoring rules for probabilistic forecasts. A good resource to explore for ideas.

For a model that makes binary recommendations (intervene vs. don’t intervene) and the outcome measured is also binary (above vs. below economic threshold), the True Skill Statistic has some nice properties. If you have a decent validation dataset in hand, you could even determine the optimal probability at which to recommend intervention. (Iterate over all possible forecast probability thresholds, 0-100%, with a reasonably fine step size and select the threshold that maximizes the TSS.)

I see why a discretized recommendation is important for the end-user, but if the model’s forecast skill is something the company wants to understand with more nuance, perhaps something like a Brier Score would be a good place to start. If the economic threshold is indeed passed, you’d like a model that forecasted this event with 91% probability compared to one that forecasted 51%. Similarly, forecasts that are -1% and +1% of the selected threshold for intervention really aren’t all that different from each other. The related Brier Skill Score might also be a tool you could use to measure the skill of your model relative to the current approach or a naive forecasting model. Internal to the current model, you could also use the BS(S) to measure the how your model’s forecast skill decays with increasing lead time and how far ahead its forecasts cease to provide any value.

I’ll second @jsocolar’s suggestion to think about linking these scores to a currency-based cost function. The best forecast skill won’t always minimize cost—it may be better to intervene more cautiously or more aggressively than the strategy with the highest forecast skill would suggest. Mike Dietz’s “Ecological Forecasting” book has a section on using forecasts for decision support that might be useful to get you started. This will depend on the cost of the intervention and the cost of pest losses. That second one may be very hard to pin down given the uncertainty of future yield and market prices for the crop being produced (although yields too are increasingly being forecast in their own right).

Ultimately, the “right” answer will depend on what the company wants to know and what makes a forecast “good.” This is much harder to specify than it may seem to them. You may think about exploring a couple of different scoring rules that measure “good” in different ways.


Hello @wpetry.

Sorry for the (very) late reply.
I wanted to thank you for your answer, It was everything I needed.
We are going to propose the Brier Score to begin with and then a currency-based cost function, and both the paper and book that you suggested have been extremely useful.

Thank you again,

1 Like