Does rain predict rain? | Towards Data Science

Introducing useful climate datasets and validating a global warming prediction

Photo by Wim van ‘t Einde on Unsplash

During a dreary stretch of June and July in Boston, it seemed to rain every time my family had planned to do something fun. We started to wonder if we were stuck in a rainy pattern and asked, “Does the fact that it has rained a lot for three days straight, make it likely to rain tomorrow?” I realized that this question is easy to answer using available weather data.

This article presents the US weather datasets I used, the Python/pandas program I wrote to analyze the data, and the results. In short, yes, a stretch of rainy days strongly predicts more rain. And, surprisingly, the longer the stretch of rain, the more likely it is to rain the next day. The results also prove a prediction from global warming models — there is more rain now than in previous years.

There are two key datasets about rainfall from the US Oceanic and Atmospheric Administration (NOAA). I used Hourly Precipitation Data (HPD). The description page is generally helpful, but the link to the dataset is wrong and points to an older version. The new dataset is here and covers the period 1940 to 2022. (2023 data will be added.) HPD has fine granularity with hourly precipitation amounts from more than 2000 collection stations across the US. The data files contain all years for one station in each CSV file. I used only the daily totals, but the hourly information might be useful for future analysis.

What about when it snows rather than rains? Snow accumulation is melted to find the equivalent rain amount. So all data in HPD includes liquid rain, melted snow, and everything in between such as slush and hail.

There is another valuable dataset named Local Climatological Data (LCD) that could be used for similar analysis. LCD contains far more information than just precipitation and also includes temperature, sunrise/sunset, air pressure, visibility, wind speed, fog, smoke, monthly summaries, and more. LCD is updated daily, so it contains yesterday’s data. To use it you need to decode the Integrated Surface Dataset (ISD) station numbers.

The rain analysis program is written in Python/pandas. I wrote the code to be readable as is, but it is worth exploring some specific features.

The program can read the full list of HPD stations or a specific list of stations from a text file. This feature is used to re-run the program with various parameters while making sure to use the exact same stations as a previous run.

from rain_helpers import ALL_STATION_FILES
STATION_LIST_INPUT = "/Users/chuck/Desktop/Articles/hpd_stations_used_list_1940-1950.txt"
ALL_STATIONS = True # use every station, or a specific list ?

# Choose either all stations we know about, or a specific list of stations (usually from a previous run of this program)

station_files = ALL_STATION_FILES
with open(STATION_LIST_INPUT, 'r') as fp:
data =
station_files = data.split("n")

Another handy feature is the ability to choose a subset of the station files. You can run the code very quickly with 1/100 of the stations for debugging, or something like 1/3 for an accurate approximation of the results. By the Law Of Large Numbers, my tests with 1/3 (about 600 stations) produced results that are nearly identical to the full dataset.

SKIP_COUNT = 3 # 1 = don't skip any.

for i in range (0, len(station_files), SKIP_COUNT):
station_url = HPD_LOCAL_DIR + station_files[i]
stationDF = pd.read_csv(station_url, sep=',', header='infer', dtype=str)

Another speed improvement is to download all of the stations files to your local machine, so you don’t have to fetch them from NOAA each time. The full set is about 20GB. If you don’t have this extra space, the code runs just fine while reading from the cloud.

HPD_CLOUD_DIR = ""  # Hourly Precipitation Data (HPD)
HPD_LOCAL_DIR = "/Users/chuck/Desktop/Articles/NOAA/HPD/"

station_url = HPD_LOCAL_DIR + station_files[i] # toggle between local and cloud

The trickiest part of the code is the look-back on each date to see if it has been rainy for a stretch of previous days. The problem is that the data to be looked up is within the very same DataFrame, a self-join. It is tempting to use a loop over the DataFrame and, for each row, look up previous dates as you go. But loops over large data structures are bad style in any programming language, especially pandas. My code solves this problem by taking a snapshot of the DataFrame, creating fields on each row that contain the nine previous dates (and one for tomorrow), and then using those fields to join with the snapshot.

    # Grab a snapshot for a self-join later. Adjust fields names to avoid confusion after the join.

stationCopyDF = stationDF
stationCopyDF = stationCopyDF[["STATION","DATE","DlySumToday"]] # keep just what we need
stationCopyDF = stationCopyDF.rename({"DlySumToday":"DlySumOther", "DATE":"DATEother"}, axis='columns')

# Add in some other dates, for which we will pull in rainfall.

stationDF["DATE_minus9"] = stationDF["DATE"] - pd.offsets.Day(9)
stationDF["DATE_minus8"] = stationDF["DATE"] - pd.offsets.Day(8)
stationDF["DATE_minus1"] = stationDF["DATE"] - pd.offsets.Day(1)
stationDF["DATE_plus1"] = stationDF["DATE"] + pd.offsets.Day(1)

# Join other rainfall onto base record. Adjust column names to make clear what we did.

stationDF = stationDF.merge(stationCopyDF, how='inner', left_on=["STATION","DATE_minus9"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum9DaysAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

stationDF = stationDF.merge(stationCopyDF, how='inner', left_on=["STATION","DATE_minus8"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum8DaysAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])


stationDF = stationDF.merge(stationCopyDF, how='inner', left_on=["STATION","DATE_minus1"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum1DayAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

stationDF = stationDF.merge(stationCopyDF, how='inner', left_on=["STATION","DATE_plus1"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySumTomorrow"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

After getting the previous days’ rainfalls on each row, the code easily finds the length of each rainy period. Note that when calculating how many days it has been raining, today counts as a day.

    stationDF["DaysOfRain"] = 0   
stationDF.loc[(stationDF["DlySumToday"] >= RAINY), "DaysOfRain"] = 1
stationDF.loc[(stationDF['DlySumToday'] >= RAINY) & (stationDF['DlySum1DayAgo'] >= RAINY), 'DaysOfRain'] = 2
stationDF.loc[(stationDF['DlySumToday'] >= RAINY) & (stationDF['DlySum1DayAgo'] >= RAINY) & (stationDF['DlySum2DaysAgo'] >= RAINY), 'DaysOfRain'] = 3
... etc

Using the years 2000 to 2021 inclusive, there are 1808 stations with valid data, with 8,967,394 data points (a date, location, and rain amount).

  • The average rainfall over all data points was 0.0983 inches, or about 1/10 of an inch.
  • The fraction of days it was rainy (≥ 0.5 inches) was 6.2%.
  • The fraction of dry days (≤ 0.05 inches) was 78.0%.

The answer to the original question that prompted this project is,

Yes, rainy days predict rain tomorrow. The longer it has been raining (up to 8 days), the more likely it is to rain again.

And a related result…

Rainy days predict how much rain is expected tomorrow. The longer it has been raining (up to 7 days), the more rain is likely tomorrow.

Two charts show this result.

Chance of Rain Tomorrow vs Days of Rain (image by Author)
Amt of Rain Tomorrow vs Days of Rain (image by Author)

I experimented with different settings for “rainy day”, changing it from 0.5 inches to 0.75 and 1.0. These changes show the same general phenomena of rainy days predicting more rain, but without perfect correlation through eight days. The definition of “rainy” as 0.5 inches seems to be the sweet spot for predicting tomorrow’s rain.

You might wonder where it could rain for 10 days straight. Over 22 years across the entire US, with almost nine million data points, there were only 118 runs of such weather. Some of the places were: Boca Raton, FL; San Juan, PR; Kahuna Falls, HI; Kaumana, HI; Kihalani, HI; Paakea, HI; Pascagoula, MS; Quinault, WA; and Quilcene, WA.

Sequential days of dryness (< 0.05 inches of rain) also correlate well with dryness the next day, but the prediction is not as strong because the chances of a dry day tomorrow are very close to each other. The chance of a dry day tomorrow is always near the overall mean of 78%.

Sequential days of dryness are somewhat better at predicting the amount of rain expected the next day.

Amt of Rain Tomorrow vs Days of Dry (image by Author)

An obvious related question is whether the results described here have changed as the earth’s temperature rises from climate change. I ran the same analysis of US data from 1940 to 1960, 1960 to 1980, and 1980 to 2000.

The central result is the same — rainy days predict more rain. The exact numbers are slightly different in each time span, but they do not change the strong correlation. For example, from 1960 to 1980, there were 1388 stations with valid data and 6,807,917 data points, with these results:

Fraction of days it is rainy after 1 rainy day = 17.3%
Fraction of days it is rainy after 2 rainy days = 19.6%
Fraction of days it is rainy after 3 rainy days = 27.4%
Fraction of days it is rainy after 4 rainy days = 37.1%
Fraction of days it is rainy after 5 rainy days = 43.8%
Fraction of days it is rainy after 6 rainy days = 51.5%
Fraction of days it is rainy after 7 rainy days = 52.4%

A more important corollary, predicted by climate change models, is that as the earth warms up, it will rain more. The HPD dataset can verify this, at least over the past 80 years.

The simple approach would be to take all of the current climate stations (about 2000) and look at rainfall data during each decade. But there is a potential bias with doing that. There are more weather stations now than there were in 1940, since stations have gradually been added over the past 80 years. It is possible that newer stations have been built in rainy places. If that is the case, newer data would show more rain, but only because the overall set of stations is wetter than the 1940 set.

A more accurate approach is to find the set of stations that had rain data in the 1940s, and then use the same stations for every decade. My program can do this because it emits the list of stations actually used on each run. So first I found data for the years 1940 to 1950, and then re-used the emitted station list again for 1950 to 1960, then 1960 to 1970, etc. This is about 840 stations with between 400K and 2.5M data points in the various decades.

The average rainfall in each decade should be very close — by the Law Of Large Numbers again. But the chart below shows significantly increased rain over the same collection stations. This is a remarkable result supporting a key predication of global warming models.

Average Rainfall vs Decade (image by Author) — How the US National Weather Service makes forecasts. — Overview of US Oceanic and Atmospheric Administration datasets. — US Environmental Protection Agency report on precipitation increase from climate change.

Source link

Leave a Comment