Skill or Luck?: How NHC’s Hurricane Track Forecasts Beat the Models
Skill or Luck?
There’s one thing that many of us are missing right now while we’re occupying ourselves at home: sports. We should have been all set for the playoffs in major league hockey and basketball, and we would be excited about the beginning of the major league baseball and soccer seasons. We also would have been eagerly anticipating some of this spring and summer’s major sporting events, including the Olympics. So let’s dream a little…
When we set out to write this blog post for Inside the Eye, we wanted to show how National Hurricane Center (NHC) forecasters use their skill and expertise to predict the future track of a hurricane. And then it got us thinking, how does luck factor into the equation? In other words, when meteorologists get a weather forecast right, how much of it is luck, and how much of it is forecasters’ skill in correctly interpreting, or even beating, the weather models available to them?
Investment strategist Michael Mauboussin created a “Skill-Luck Continuum” where individual sports, among other activities in life, are placed on a spectrum somewhere between pure skill and pure luck (Figure 1). Based on factors such as the number of games in a season, number of players in action, and number of scoring opportunities in a game or match, athletes and their teams in some sports might have to rely on a little more luck than other sports to be successful. On this spectrum, a sport like basketball would be closest to the skill side (there are a lot of scoring opportunities in a basketball game) whereas a sport like hockey would require a little more luck (there are fewer scoring opportunities in a hockey match, and sometimes you just need the puck to bounce your way). Fortunately for hockey fans, there are enough games in a season for their favorite team’s “unlucky” games to not matter so much.
Figure 1. The Skill-Luck Continuum in Sports, developed by investment strategist Michael Mauboussin.
Where would hurricane forecasting lie on such a continuum? There’s no doubt that luck plays at least some part in weather forecasting too, particularly in individual forecasts when random or unforeseen circumstances could either play in your favor (and make you look like the best forecaster around) or turn against you (and make you look like you don’t know what you’re doing!). But luck is much less of a factor when you consider a lot of forecasts over longer periods of time, where the good and bad circumstances should cancel each other out and true skill shines through (just as in sports). At NHC, we routinely compare our forecasts with weather models over these long periods of time to assess our skill at predicting, for example, the future tracks of hurricanes.
An International Friendly?
From our experience of talking to people about hurricanes and weather models, it seems to be almost common “knowledge” that only two models exist – the U.S. Global Forecast System (GFS) and the European Centre for Medium Range Weather Forecasts (ECMWF) model. It’s true that those two models are used heavily at NHC and the National Weather Service in general, but there are many more weather models that can simulate a hurricane’s track and general weather across the globe. (Here’s a comprehensive list showing all of the available weather models that are used at NHC today, if you’re interested: https://www.nhc.noaa.gov/modelsummary.shtml.) We’ve also heard and seen people compare the GFS and ECMWF models and talk about which model scenario might be more correct for a given storm. This blog entry summarizes the performances of those models and discusses how, on the whole, NHC systematically outperforms them on predicting the track of a storm.
Below are the most recent three years of data (2017, 2018, and 2019) of Atlantic basin track forecast skill from NHC and the three best individual track models: the GFS, ECMWF, and the United Kingdom Meteorological Office model (UKMET) (Figure 2). Track forecast skill is assessed by comparing NHC’s and each model’s performance to that of a baseline, which in this case is a climatology and persistence model. This model makes forecasts based on a combination of what past storms with similar characteristics–like location, intensity, forward speed, and the time of year–have done (the climatology part) and a continuation of what the current storm has been doing (the persistence part). This model contains no information about the current state of the atmosphere and represents a “no-skill” level of accuracy.
Figure 2. NHC and selected model track forecast skill for the Atlantic basin in 2017, 2018, and 2019.
On the skill diagrams above, lines for models or forecasts that are above other lines are considered to be the most skillful. It can be seen that in each year shown, NHC (black line) outperforms the models and has the greatest skill at most, if not all, forecast times (the black line is above the other colored lines most of the time). Among the models, the ECMWF (red line) has been the best performer, with the GFS (blue line) and UKMET (green line) trading spots for second place.
Yet another metric to estimate how often NHC outperforms the models is called “frequency of superior performance.” Based on this metric, over the last 3 years (2017-19), NHC outperformed the GFS 65% of the time, the UKMET 59% of the time, and the ECMWF 56% of the time. This means that more often than not, NHC is beating these individual models. So the question is, how do the NHC forecasters beat the models?
Keep Your Eyes on the Ball
Forecasters at NHC are quite skilled at assessing weather models and their associated strengthens and weaknesses. It is that experience and a methodology of using averages of model solutions (consensus) that typically help NHC perform best. If you ever read a NHC forecast discussion and see statements like “the track forecast is near the consensus aids,” or “the track forecast is near the middle of the guidance envelope,” the forecaster believed that the best solution was to be near the average of the models. Although this strategy often works, NHC occasionally abandons this method when something does not seem right in the model solutions. One recent example of this was Tropical Storm Isaac in 2018. The figure below (Figure 3) shows the available model guidance, denoted by different colors, at 2 PM EDT (1800 UTC) on September 9 for Isaac, with the red-brown line representing the model consensus (TVCA).
Figure 3. NHC forecast (dashed black line) and selected model tracks at 2 PM EDT (1800 UTC) September 9, 2018 for then-Tropical Storm Isaac. The solid black line represents the actual track of Isaac and the red-brown line represents the model consensus.
Although the models were in fair agreement that the storm would head westward for some time, a few models diverged by the time Isaac was expected to be near the eastern Caribbean Islands, mostly because they disagreed on how fast Isaac would be moving at that time. Instead of being near the middle of the guidance envelope, NHC placed the forecast on the southern side of the model suite (dashed black line) at the latter forecast times since the forecaster believed that the steering flow would continue to force Isaac westward into the central Caribbean. Indeed, NHC was correct in this case, and in fact, for the entire storm, NHC had very low track errors.
In some cases all of the models turn out to be wrong, which usually causes the official forecast to suffer as well. That was the case for a period during Dorian in 2019. Figure 4 shows many of the available operational models at 8 PM EDT on August 26 (0000 UTC August 27) for then-Tropical Storm Dorian. As you can see by noting the deviation of the colored lines from the solid black line (Dorian’s actual track), none of the models or the official forecast (colored lines) anticipated that Dorian would turn as sharply as it did over the northeastern Caribbean Sea, and no model showed a direct impact to the Virgin Islands, where Dorian made landfall as a hurricane.
Figure 4. NHC forecast (dashed black line) and selected model tracks at 8 PM EDT on August 26 (0000 UTC 27 August), 2019 for then-Tropical Storm Dorian. The solid black line represents the actual track of Dorian.
Figure 5 shows many of the operational models at 2 AM EDT (0600 UTC) on August 30 when Dorian, a major hurricane at the time, was approaching the Bahamas. You can see that all of the models showed Dorian making landfall in south or central Florida in about four days from the time of the model runs, and none of them captured the catastrophic two-day stall that occurred over Great Abaco and Grand Bahama Islands. NHC’s forecast followed the consensus of the models in this case and thus did not initially anticipate Dorian’s long, drawn-out battering of the northwestern Bahamas.
Figure 5. NHC forecast (dashed black line) and selected model tracks at 2 AM EDT (0600 UTC) on August 30, 2019 for Hurricane Dorian. The sold black line represents the actual track of Dorian.
The Undervalued Player? A Consistently Good Field-Goal Kicker
In American football, probably one of the most undervalued players on the field is the kicker. They don’t see much action during the majority of the game. But at the end of close games, who has the best chance to win the game for a team? A dependably accurate field goal kicker. In that vein, it’s not just accuracy that can make NHC’s forecasts “better” than the individual models. Another important factor is how consistent NHC’s predictions are from forecast to forecast compared to those from the models. We looked at consistency by comparing the average difference in the forecast storm locations between predictions that were made 12 hours apart. For example, by how much did the 96-hour storm position in the current forecast change from the 108-hour position in the forecast that was made 12 hours ago (which was interpolated between the 96- and 120-hour forecast positions)? Figure 6 shows this 4-day “consistency,” as well as the 4-day error, plotted together for the GFS, ECMWF, UKMET, and NHC forecasts for the Atlantic basin from 2017-19. It can be seen that NHC is not only more accurate than these models (it’s farthest down on the y-axis), but it is also more consistent (it’s farthest to the left on the x-axis), meaning the official forecast holds steady more than the models do from cycle to cycle. We like to say that we’re avoiding the model run-to-run “windshield wiper” effect (large shifts in forecast track to the left or right) or “trombone” effect (tracks that speed up or slow down) that are often displayed by even the most accurate models.
Figure 6. 96-hour NHC and model forecast error and consistency for 2017-2019 in the Atlantic basin (change from cycle to cycle).
NHC’s emphasis on consistency is so great that there are times when we knowingly accept that we might be sacrificing a little track accuracy to achieve consistency and a better public response to the threat. An example would be for a hurricane that is forecast to move westward and pose a serious threat to the U.S. southeastern states. Sometimes, such storms “recurve” to the north and then the northeast and move back out to sea before reaching the coast. When the models trend toward such a recurvature, the NHC’s forecast will sometimes lag the models’ forecast of a lower threat to land. In these cases, NHC does not want to prematurely take the southeastern states “off the hook”, sending a potentially erroneous signal that the risk of impacts on land has diminished, only to have later forecasts ratchet the threat back up after the public has turned their attention and energies elsewhere if the models, well, “change their mind”. That would be the kind of windshield wiper effect NHC wants to prevent in its own forecasts. Now, there are times where the recurvature does indeed occur. Then, NHC’s track forecasts, which have hung back a little from the models, could end up having larger errors than the models. But, NHC can accept having somewhat larger track forecast errors than the models in such circumstances at longer lead times if in doing so it can provide those at risk with a more effective message–achieved in part through consistency.
The superior accuracy and higher levels of consistency of the NHC forecasts are both important characteristics since emergency managers and other decision makers have to make challenging decisions, such as evacuation orders, based on that information. It is not surprising to us that NHC’s forecasts are more consistent than the global models, since forecasters here take a conservative approach and usually make gradual changes from the forecast they inherited from the previous forecaster. Conversely, the models often bounce around more and are not constrained by their previous prediction. And, unlike human forecasters, the models also bear no responsibility or feel remorse when they are wrong!
Filling Out Your Bracket
Accuracy, consistency, and luck are important factors in one particularly favorite sport: college basketball. We just passed the time of year when we should have been crowning champions in the men’s and women’s college basketball tournaments. But before those tournaments would have kicked off, “bracketologists” (no known relation to meteorologists!) would have made predictions on which teams would make it into the tournaments and which teams would have been likely to win.
Think of it this way: a team can be accurate in that they have a spectacular winning record during the regular season, but does that mean they are guaranteed to win the tournament, or even advance far? Nope. As is often said, that’s why they play the game. An inconsistent team—one whose performance varies wildly from game to game—has a higher risk of having a bad game and losing to an underdog in the first few rounds, even if their regular season record by itself suggests they should have no problem winning. The problem is, they could have been very lucky in the regular season, winning a lot of close games that could have easily swung the other way. If that luck runs out, the inconsistent team could have an early exit from the tournament. With a consistent team, on the other hand, you pretty much know what kind of performance you’re going to get—good or bad—and that increases confidence in knowing how far in the tournament the team would advance. You’d want to hitch your wagon to a good team that is consistent and hasn’t had to rely on too much luck to get where they are.
The same can be said for hurricane forecasts from NHC and the models. NHC’s track forecasts are more accurate and more consistent than the individual models in the long run, and that fact should increase overall user confidence in the forecasts put out by NHC. Even still, there is always room to improve, and it is hoped that forecasts will continue to become more accurate and consistent in the future. It is always a good idea to read the NHC Forecast Discussion to understand the reasons behind the forecast and to gauge the forecaster’s confidence in the prediction. For more information on NHC forecast and model verification, click the following link: https://www.nhc.noaa.gov/verification/
— John Cangialosi, Robbie Berg, and Andrew Penny
The State of Hurricane Forecasting
The State of Hurricane Forecasting is . . .
The National Hurricane Center (NHC) has the responsibility for issuing advisories and U.S. watches/warnings for tropical cyclones (TCs), which includes tropical depressions, tropical storms, and hurricanes, for the Atlantic and east Pacific basins. NHC has a long history of issuing advisories for TCs, with the first known recorded forecast being in 1954, when 24-hour predictions of a TC’s track were made. Since then, we’ve expanded our forecasts out in time and added predictions of TC intensity, size, and associated hazards, such as wind, storm surge, and rainfall. In addition, the lead times of tropical storm and hurricane watches and warnings have increased to give the public additional time to prepare for these potentially devastating events. Since we’re at the time of year when the U.S. President and state governors have just given their “State of the Union” or “State of the State” speeches, we thought this might be a good time to give our own “State of Hurricane Forecasting” speech. This blog entry takes a look at the accuracy of NHC’s forecasts and quantifies how much more accurate they are today compared to decades ago.
Track Forecasting (a.k.a., Where the Storm Will Go)
We are usually more confident in predicting the path of TCs as compared to predicting the strength or size of a TC. The primary reason for this is because the track of a TC is governed by forces larger than the tropical system itself, since the surrounding steering currents cover a much larger area than the hurricane. Because these nearby weather patterns are big, we can usually “see” them easily, and the global weather models do a fairly good job in predicting how these steering features might evolve over the course of a few days.
The figure below shows the average NHC track forecast errors for tropical storms and hurricanes by decade beginning in the 1960s. You can see that there has been a steady reduction in the track errors over time, with the average errors in the current decade about 30-40% smaller than they were in the 2000s and about half of the size (or even smaller) than they were in the 1990s.
If that doesn’t seem impressive, let’s look at another example. The next graphic shows two circles centered on a point near Pensacola, Florida, with the blue one representing the average 48-hour track error in 1990 and the red one showing the average 48-hour error today. What it shows is that if NHC had made a forecast for a storm to be over Pensacola in 48 hours back in 1990, the TC would have ended up, on average, not exactly over Pensacola but somewhere on the blue circle. If NHC makes the same forecast today, now the storm ends up, on average, somewhere on the red circle. You can easily see that the NHC forecasts for the path of a TC today are much more accurate, on average, than they were decades ago, and these more accurate forecasts have helped narrow the warning areas, save lives, and make for more efficient and less costly evacuations.
So, you might be wondering why the track forecasts are more accurate today than in the past. Well, the primary reason is the advancements in technology, specifically the improvements in the observing platforms (satellites, for example) and the various modeling systems we use to make forecasts. The amount and quality of data available to the models so they can paint an initial picture of the atmosphere have increased dramatically in the last 20 to 30 years. Also, the resolution and physics in the models we use today are far superior to what forecasters had available in the 1990s or prior decades, in part due to the tremendous improvements in computational capabilities. In addition, NHC has found ways to even beat the individual dynamical models by using a balance of statistical approaches and experience.
We often hear a lot of questions asking which model is the best one. Although some models are usually better than others, no model is perfect, and their performance varies from season to season and from storm to storm. Two of the most well-known models for weather forecasting are the U.S. National Weather Service’s Global Forecast System (GFS) and the European Centre for Medium-Range Weather Forecasts (ECMWF). The figure below shows a comparison of the NHC forecasts (OFCL, black) and forecasts from the GFS (GFSI, blue) and ECMWF (EMXI, red) models for Hurricanes Harvey, Irma, Maria, and Nate in 2017. In all of these cases, except for Hurricane Irma, OFCL performed as well as or better than GFSI and EMXI. Among the two models, EMXI beat GFSI for Harvey, Irma, and Nate, but GFSI beat EMXI for Maria.
Over the past decade, the average track errors of GFSI and EMXI models have been quite close, so even though EMXI was the best-performing model most of the time in 2017, it does not mean that it will always be the best for every storm. The models that typically have the lowest errors are consensus aids, which blend several models together. Forecasters construct their own forecasts of how the storm will evolve, aided by model simulations and their knowledge of model strengthens and weaknesses.
Even though our track forecasts are much more accurate today – in fact preliminary estimates are that the 2017 Atlantic track forecasts set record low errors at all time periods – typical track errors currently start off at 37 n mi at 24 hours and then increase by about 35 n mi (40 mi ) per day of the forecast. This means that our 5-day track error is on average around 180 n mi (210 mi). So, keep that in mind and be sure to account for forecast uncertainty when using NHC forecasts next hurricane season.
Intensity Forecasting (a.k.a., How Strong the Storm Will Get)
Predicting the intensity of a tropical storm or hurricane is usually more challenging than forecasting its track. This is because the intensity of these weather systems is affected by factors that are both big and small. On the large scale, vertical wind shear (the change of wind speed and direction with height) and the amount of moisture in the atmosphere greatly affect the amount or organization of the thunderstorm activity that the TC can produce. Ocean temperatures also affect the system’s intensity, with temperatures below 80° F usually being too cool to sustain significant thunderstorm activity. However, smaller-scale features can also be at play. One of the more complex phenomena that affects a TC’s intensity is an eyewall replacement cycle. Initially, when two eyewalls, one inside the other, are present, the hurricane’s wind field will begin to expand, and as the inner eyewall dies, the hurricane’s peak winds start to weaken. However, if the second eyewall contracts, the hurricane can often re-intensify. The radar image below of Hurricane Irma (2017) was taken at the beginning of an eyewall replacement cycle, when the hurricane had a double eyewall structure.
Given these complex factors and the fact that errors in the track can also affect the TC’s future intensity, we have not made as much progress in this area as we have for track forecasting. The next graphic (below) shows NHC average intensity errors for Atlantic tropical storms and hurricanes by decade starting in the 1970s. Note that only small improvements were made in the intensity predictions from the 1970s through the 2000s. A much more significant reduction in error has occurred in the current decade, which could mean that the recent investment in new models and techniques is beginning to pay off. Today’s intensity errors are close to 15 kt (17 mph) from 72 to 120 h. This number is on the order of one Saffir-Simpson category, so we often encourage those who could be affected by a TC to prepare for a storm one category stronger (on the Saffir-Simpson Hurricane Wind Scale) than what we are forecasting.
Although the GFS and ECMWF models are skillful for track forecasting and help us understand the environment around the TC, did you know that these models are typically inadequate to predict how strong a TC might become? Both the GFS and ECMWF are global models, and they cannot “see” sufficient detail within the storm to represent and predict the core winds in the hurricane’s eyewall. Therefore, we use different models to predict intensity, some that are run at high resolution specifically for TCs (e.g., Hurricane Weather Research and Forecasting [HWRF] model, Hurricanes in a Multi-scale Ocean-coupled Non-hydrostatic [HMON] model) and some that are statistical in nature (e.g., Statistical Hurricane Intensity Prediction Scheme [SHIPS], Logistic Growth Equation Model [LGEM]). The statistical models tell the forecaster what typically occurs for a TC in a specific location and environment based on past storm behavior. Even though the intensity models are improving, the gains in these models are much smaller than what has occurred in the models we use for track forecasting.
If you want more information on the models, please visit the following page for details: http://www.nhc.noaa.gov/modelsummary.shtml
Will the errors keep decreasing?
The short answer is they likely won’t forever. At some point the forecasts made by NHC and other forecasting centers will likely reach the limits of predictability. No one knows for sure what those limits are or when they will be reached, but researchers are still providing great information that is helping NHC make steady advancements as discussed above.
For more information on the NHC and model verification please visit the following page: http://www.nhc.noaa.gov/verification/