Do we pay too much attention to accuracy in forecasting?


It sounds like a nonsense question. We forecast in the hope of getting an accurate picture of what the future will be like. We’ll be tempted to tweet less-than-complimentary messages about   weather forecasters if their predicted barbecue weekend turns into a drenching for guests at an outdoor party we’ve organized. Similarly, a forecast that tells us we can expect to sell 2000 units next week when we only sell 1500 will be galling if we wasted resources producing the surplus 500 units that now have to be dumped.

What I’m referring to is the use of accuracy to decide which forecasting method -or which human forecaster – we should employ in a particular circumstance. Typically, we look at the track record of the method or person. Or we provide them with some past data so any patterns can be detected. But then we keep the latest data hidden and see how accurately they can forecast these unseen observations, which are referred to as holdout data. This raises three practical problems.

First, to be confident in our choice of method or person we need to test a large number of their forecasts. One or two seemingly brilliant forecasts are not enough. Forecasters can be lucky –by chance an outrageous forecast coincides with what actually happens. A maverick analyst foresees a recession no one else had seen coming. Or a TV pundit risks ridicule to predict a stock market crash and a week later the market nosedives. In neither case can we conclude that the forecaster has some mystical powers of foresight. Research suggests that the opposite is more likely to be true.

But, if we need to assess accuracy over a large number of forecasts, where do we get them from? In many circumstances, there is a dearth of opportunities for evaluating forecast performance. Events like elections occur relatively infrequently so we have few chances to assess how skilled a politics expert is in identifying the most likely winner. Product life cycles are getting shorter so we usually have only a limited amount of past demand data. Once we’ve used some of this data to detect patterns there is not much left for the holdout observations. This means that, when comparing competing methods, it’s tempting to use just one, two or three unseen observations and then declare one method as the clear winner.

Of course, we can test the expert on lots of elections in different countries, just as we can test a statistical forecasting method on lots of different short life cycle products, if they are available. But if the expert only claims knowledge of the political landscape of one country, or if the products have different demand characteristics, our testing is likely to mislead us.

This leads to a related problem. As they warn in investment advertisements: past performance is no guarantee of future performance. In a rapidly changing world, what worked in the past may be a poor guide to what will work in the future. Forecasting has been compared to steering a ship by studying its wake. Similarly, to focus on past accuracy is to focus on history in an exercise that should be all about the future.

The third problem is how do we measure accuracy? There are a host of different measures, ranging from mean absolute errors to Brier scores, depending on the type of forecast being made. These make different –and often undeclared – assumptions about the seriousness of differences between the forecast and the outcome. As a result, they can lead to contradictory findings. Method A is more accurate than Method B on one accuracy measure, but B is more accurate than A on another measure. Moreover, the assumptions about the consequences of forecast-outcome discrepancies rarely coincide with the true consequences in a given situation –such as a soaking for my party guests, loss of customer goodwill through under production or the costs of surplus stocks.

So what’s the answer? In decision making it’s often said that you should not judge the quality of a decision by its outcome. I might decide to gamble everything I own on a 500 to 1 outsider in horse race and, incredibly, I win. That’s a great outcome. But most people would agree that it was an awful, reckless decision. We should judge the quality of a decision by the process that underpinned it. Was accurate, cost-effective, information gathered? Were all stakeholders consulted? Were risks assessed, and so on? The same should be true of forecasting.

Nearly twenty years ago, the Wharton Professor, Scott Armstrong, led the Forecasting Principles project, which was designed to identify the characteristics of a good forecasting process. The M-Competitions led by Spyros Makridakis have provided further guidance. Later work, such as the Good Judgment project led by Philip Tetlock have added to our knowledge of what makes a ‘good’ forecast –albeit in a more restricted range of contexts. Of course, the validity of the principles uncovered by these projects depends on their ability to improve the likelihood of an accurate or well calibrated forecast. This validity is established by testing them on very large numbers of forecasts under different conditions and using a range of measures.

As we’ve seen, in many practical situations we don’t have access to this richness of data to test each of our candidates for the title of ‘Best Forecasting Method’. So we should spend more time comparing how well they adhere to principles of good forecasting and give less prominence to fortuitous short-term bursts of apparent accuracy or a few unlucky instances that seem to suggest poor performance.

Paul Goodwin

Read more in: Forewarned: A Sceptic’s Guide to Prediction (Biteback Publications).