Skill or Luck?: How NHC’s Hurricane Track Forecasts Beat the Models

Posted on April 9, 2020 Updated on April 9, 2020

Skill or Luck?

There’s one thing that many of us are missing right now while we’re occupying ourselves at home: sports. We should have been all set for the playoffs in major league hockey and basketball, and we would be excited about the beginning of the major league baseball and soccer seasons. We also would have been eagerly anticipating some of this spring and summer’s major sporting events, including the Olympics. So let’s dream a little…

When we set out to write this blog post for Inside the Eye, we wanted to show how National Hurricane Center (NHC) forecasters use their skill and expertise to predict the future track of a hurricane. And then it got us thinking, how does luck factor into the equation? In other words, when meteorologists get a weather forecast right, how much of it is luck, and how much of it is forecasters’ skill in correctly interpreting, or even beating, the weather models available to them?

Investment strategist Michael Mauboussin created a “Skill-Luck Continuum” where individual sports, among other activities in life, are placed on a spectrum somewhere between pure skill and pure luck (Figure 1). Based on factors such as the number of games in a season, number of players in action, and number of scoring opportunities in a game or match, athletes and their teams in some sports might have to rely on a little more luck than other sports to be successful. On this spectrum, a sport like basketball would be closest to the skill side (there are a lot of scoring opportunities in a basketball game) whereas a sport like hockey would require a little more luck (there are fewer scoring opportunities in a hockey match, and sometimes you just need the puck to bounce your way). Fortunately for hockey fans, there are enough games in a season for their favorite team’s “unlucky” games to not matter so much.

Figure 1. The Skill-Luck Continuum in Sports, developed by investment strategist Michael Mauboussin.

Where would hurricane forecasting lie on such a continuum? There’s no doubt that luck plays at least some part in weather forecasting too, particularly in individual forecasts when random or unforeseen circumstances could either play in your favor (and make you look like the best forecaster around) or turn against you (and make you look like you don’t know what you’re doing!). But luck is much less of a factor when you consider a lot of forecasts over longer periods of time, where the good and bad circumstances should cancel each other out and true skill shines through (just as in sports). At NHC, we routinely compare our forecasts with weather models over these long periods of time to assess our skill at predicting, for example, the future tracks of hurricanes.

An International Friendly?

From our experience of talking to people about hurricanes and weather models, it seems to be almost common “knowledge” that only two models exist – the U.S. Global Forecast System (GFS) and the European Centre for Medium Range Weather Forecasts (ECMWF) model. It’s true that those two models are used heavily at NHC and the National Weather Service in general, but there are many more weather models that can simulate a hurricane’s track and general weather across the globe. (Here’s a comprehensive list showing all of the available weather models that are used at NHC today, if you’re interested: https://www.nhc.noaa.gov/modelsummary.shtml.) We’ve also heard and seen people compare the GFS and ECMWF models and talk about which model scenario might be more correct for a given storm. This blog entry summarizes the performances of those models and discusses how, on the whole, NHC systematically outperforms them on predicting the track of a storm.

Below are the most recent three years of data (2017, 2018, and 2019) of Atlantic basin track forecast skill from NHC and the three best individual track models: the GFS, ECMWF, and the United Kingdom Meteorological Office model (UKMET) (Figure 2). Track forecast skill is assessed by comparing NHC’s and each model’s performance to that of a baseline, which in this case is a climatology and persistence model. This model makes forecasts based on a combination of what past storms with similar characteristics–like location, intensity, forward speed, and the time of year–have done (the climatology part) and a continuation of what the current storm has been doing (the persistence part). This model contains no information about the current state of the atmosphere and represents a “no-skill” level of accuracy.

Figure 2. NHC and selected model track forecast skill for the Atlantic basin in 2017, 2018, and 2019.

On the skill diagrams above, lines for models or forecasts that are above other lines are considered to be the most skillful. It can be seen that in each year shown, NHC (black line) outperforms the models and has the greatest skill at most, if not all, forecast times (the black line is above the other colored lines most of the time). Among the models, the ECMWF (red line) has been the best performer, with the GFS (blue line) and UKMET (green line) trading spots for second place.

Yet another metric to estimate how often NHC outperforms the models is called “frequency of superior performance.” Based on this metric, over the last 3 years (2017-19), NHC outperformed the GFS 65% of the time, the UKMET 59% of the time, and the ECMWF 56% of the time. This means that more often than not, NHC is beating these individual models. So the question is, how do the NHC forecasters beat the models?

Keep Your Eyes on the Ball

Forecasters at NHC are quite skilled at assessing weather models and their associated strengthens and weaknesses. It is that experience and a methodology of using averages of model solutions (consensus) that typically help NHC perform best. If you ever read a NHC forecast discussion and see statements like “the track forecast is near the consensus aids,” or “the track forecast is near the middle of the guidance envelope,” the forecaster believed that the best solution was to be near the average of the models. Although this strategy often works, NHC occasionally abandons this method when something does not seem right in the model solutions. One recent example of this was Tropical Storm Isaac in 2018. The figure below (Figure 3) shows the available model guidance, denoted by different colors, at 2 PM EDT (1800 UTC) on September 9 for Isaac, with the red-brown line representing the model consensus (TVCA).

Figure 3. NHC forecast (dashed black line) and selected model tracks at 2 PM EDT (1800 UTC) September 9, 2018 for then-Tropical Storm Isaac. The solid black line represents the actual track of Isaac and the red-brown line represents the model consensus.

Although the models were in fair agreement that the storm would head westward for some time, a few models diverged by the time Isaac was expected to be near the eastern Caribbean Islands, mostly because they disagreed on how fast Isaac would be moving at that time. Instead of being near the middle of the guidance envelope, NHC placed the forecast on the southern side of the model suite (dashed black line) at the latter forecast times since the forecaster believed that the steering flow would continue to force Isaac westward into the central Caribbean. Indeed, NHC was correct in this case, and in fact, for the entire storm, NHC had very low track errors.

In some cases all of the models turn out to be wrong, which usually causes the official forecast to suffer as well. That was the case for a period during Dorian in 2019. Figure 4 shows many of the available operational models at 8 PM EDT on August 26 (0000 UTC August 27) for then-Tropical Storm Dorian. As you can see by noting the deviation of the colored lines from the solid black line (Dorian’s actual track), none of the models or the official forecast (colored lines) anticipated that Dorian would turn as sharply as it did over the northeastern Caribbean Sea, and no model showed a direct impact to the Virgin Islands, where Dorian made landfall as a hurricane.

Figure 4. NHC forecast (dashed black line) and selected model tracks at 8 PM EDT on August 26 (0000 UTC 27 August), 2019 for then-Tropical Storm Dorian. The solid black line represents the actual track of Dorian.

Figure 5 shows many of the operational models at 2 AM EDT (0600 UTC) on August 30 when Dorian, a major hurricane at the time, was approaching the Bahamas. You can see that all of the models showed Dorian making landfall in south or central Florida in about four days from the time of the model runs, and none of them captured the catastrophic two-day stall that occurred over Great Abaco and Grand Bahama Islands. NHC’s forecast followed the consensus of the models in this case and thus did not initially anticipate Dorian’s long, drawn-out battering of the northwestern Bahamas.

Figure 5. NHC forecast (dashed black line) and selected model tracks at 2 AM EDT (0600 UTC) on August 30, 2019 for Hurricane Dorian. The sold black line represents the actual track of Dorian.

The Undervalued Player? A Consistently Good Field-Goal Kicker

In American football, probably one of the most undervalued players on the field is the kicker. They don’t see much action during the majority of the game. But at the end of close games, who has the best chance to win the game for a team? A dependably accurate field goal kicker. In that vein, it’s not just accuracy that can make NHC’s forecasts “better” than the individual models. Another important factor is how consistent NHC’s predictions are from forecast to forecast compared to those from the models. We looked at consistency by comparing the average difference in the forecast storm locations between predictions that were made 12 hours apart. For example, by how much did the 96-hour storm position in the current forecast change from the 108-hour position in the forecast that was made 12 hours ago (which was interpolated between the 96- and 120-hour forecast positions)? Figure 6 shows this 4-day “consistency,” as well as the 4-day error, plotted together for the GFS, ECMWF, UKMET, and NHC forecasts for the Atlantic basin from 2017-19. It can be seen that NHC is not only more accurate than these models (it’s farthest down on the y-axis), but it is also more consistent (it’s farthest to the left on the x-axis), meaning the official forecast holds steady more than the models do from cycle to cycle. We like to say that we’re avoiding the model run-to-run “windshield wiper” effect (large shifts in forecast track to the left or right) or “trombone” effect (tracks that speed up or slow down) that are often displayed by even the most accurate models.

Figure 6. 96-hour NHC and model forecast error and consistency for 2017-2019 in the Atlantic basin (change from cycle to cycle).

NHC’s emphasis on consistency is so great that there are times when we knowingly accept that we might be sacrificing a little track accuracy to achieve consistency and a better public response to the threat. An example would be for a hurricane that is forecast to move westward and pose a serious threat to the U.S. southeastern states. Sometimes, such storms “recurve” to the north and then the northeast and move back out to sea before reaching the coast. When the models trend toward such a recurvature, the NHC’s forecast will sometimes lag the models’ forecast of a lower threat to land. In these cases, NHC does not want to prematurely take the southeastern states “off the hook”, sending a potentially erroneous signal that the risk of impacts on land has diminished, only to have later forecasts ratchet the threat back up after the public has turned their attention and energies elsewhere if the models, well, “change their mind”. That would be the kind of windshield wiper effect NHC wants to prevent in its own forecasts. Now, there are times where the recurvature does indeed occur. Then, NHC’s track forecasts, which have hung back a little from the models, could end up having larger errors than the models. But, NHC can accept having somewhat larger track forecast errors than the models in such circumstances at longer lead times if in doing so it can provide those at risk with a more effective message–achieved in part through consistency.

The superior accuracy and higher levels of consistency of the NHC forecasts are both important characteristics since emergency managers and other decision makers have to make challenging decisions, such as evacuation orders, based on that information. It is not surprising to us that NHC’s forecasts are more consistent than the global models, since forecasters here take a conservative approach and usually make gradual changes from the forecast they inherited from the previous forecaster. Conversely, the models often bounce around more and are not constrained by their previous prediction. And, unlike human forecasters, the models also bear no responsibility or feel remorse when they are wrong!

Filling Out Your Bracket

Accuracy, consistency, and luck are important factors in one particularly favorite sport: college basketball. We just passed the time of year when we should have been crowning champions in the men’s and women’s college basketball tournaments. But before those tournaments would have kicked off, “bracketologists” (no known relation to meteorologists!) would have made predictions on which teams would make it into the tournaments and which teams would have been likely to win.

Think of it this way: a team can be accurate in that they have a spectacular winning record during the regular season, but does that mean they are guaranteed to win the tournament, or even advance far? Nope. As is often said, that’s why they play the game. An inconsistent team—one whose performance varies wildly from game to game—has a higher risk of having a bad game and losing to an underdog in the first few rounds, even if their regular season record by itself suggests they should have no problem winning. The problem is, they could have been very lucky in the regular season, winning a lot of close games that could have easily swung the other way. If that luck runs out, the inconsistent team could have an early exit from the tournament. With a consistent team, on the other hand, you pretty much know what kind of performance you’re going to get—good or bad—and that increases confidence in knowing how far in the tournament the team would advance. You’d want to hitch your wagon to a good team that is consistent and hasn’t had to rely on too much luck to get where they are.

The same can be said for hurricane forecasts from NHC and the models. NHC’s track forecasts are more accurate and more consistent than the individual models in the long run, and that fact should increase overall user confidence in the forecasts put out by NHC. Even still, there is always room to improve, and it is hoped that forecasts will continue to become more accurate and consistent in the future. It is always a good idea to read the NHC Forecast Discussion to understand the reasons behind the forecast and to gauge the forecaster’s confidence in the prediction. For more information on NHC forecast and model verification, click the following link: https://www.nhc.noaa.gov/verification/