Running with Pandas AI: an Exploration of the Boston Marathon | by Sejal Dua | May, 2023


Striding through the Boston Marathon Winners dataset from start to finish line

Photo by Miguel A Amutio on Unsplash

As I stood amidst the electrifying atmosphere of this year’s Boston Marathon, cheering on the runners with exhilaration, I began to understand the almost inarticulable magic of Boston. Marathons demonstrate that with mental fortitude, discipline, and determination, ordinary individuals can accomplish extraordinary feats. Even larger than that, marathons are a celebration of human progress and potential. With Boston being the oldest annual marathon in the world, one of the most challenging courses, and also one of the most symbolic events in terms of strength and resilience in light of the 2013 bombing, I decided to embark on a journey to explore it through a data lens.

I was able to find a Kaggle dataset with the winners (men & women) of the Boston Marathon, and their winning times. The columns of this dataset are as follows:

  • Year (int): the year in which the Boston Marathon was held and the winner obtained the title
  • Winner (str): the name of the athlete (male / female) who won the marathon in that year
  • Country (str): the country the athlete represents or is from
  • Time (time): the finishing time of the athlete in hours, minutes, and seconds
  • Distance (miles): the distance covered by the athlete in miles
  • Distance (km): the distance covered by the athlete in kilometers

Early on, I learned that there is not a publicly available dataset with the times of all of the elite finishers, which limited the types of questions I would be able to explore. However, since this Boston Marathon Winners dataset is relatively compact (126 rows x 6 columns), I thought it could be fun to try a speedier route, if you will. With the advent and democratization of ChatGPT, among other large language models (LLMs) this past year, I wanted to explore AI could be used to level up our analysis and/or bring it to a less technical audience.

Coincidentally, on my GitHub explore page, I stumbled upon Pandas AI, which is a Python library that integrates generative artificial intelligence capabilities into Pandas, making dataframes conversational. Essentially, what if we could learn a thing or two about Boston Marathon winners just by talking to our data set? That would be a game changer.

Image generated with DALL-E

Pandas AI has great documentation for how to get set up with your Open AI token, your choice of LLM, etc., so I will dive straight into the analysis, but it is worth mentioning that there are a couple of handy parameters a user can leverage when invoking Pandas AI. You can ask for a conversational response, and you can also ask Pandas AI to “show code”. In this walkthrough, I will show examples for both of these types of prompts.

Analysis

First, I downloaded the datasets from Kaggle, read them in using pandas, and set up my LLM instance.

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
import matplotlib.pyplot as plt
from dotenv import load_dotenv
load_dotenv()
import os

# Instantiate a LLM
llm = OpenAI(api_token=os.environ.get('OPENAI_KEY'))
pandas_ai = PandasAI(llm, conversational=True)

# Read in the data
df = pd.read_csv('./data/Mens_Boston_Marathon_Winners.csv')

Screenshot by author

Next, I asked Pandas AI to do the data cleaning for me because I noticed some empty rows. Namely, 2020 did not have a Boston Marathon winner due to the COVID-19 pandemic. I also noticed that some of the data types were likely not being interpreted properly.

pandas_ai.run(df, prompt='''
Clean the dataset for me, please. Drop rows with empty, not a number,
or null values. Cast Year as integers. Interepret Time as a pandas
datetime object where the format is hours, minutes, then seconds. Include comments in the code.
''',
show_code=True, is_conversational_answer=True)
Screenshot by author

It handled this task with ease. What I love most about its response is that it shows exactly how my English prompt is addressed by each line of code. This makes the package a really compelling educational tool for people who are looking to get started with Python for data analysis.

I then did a small test of Pandas AI’s quick fact retrieval capabilities. I asked it this question:

“Who won the Boston marathon in 2022? How much faster or slower was their Time compared to the Time in 1922? Please report the answer in minutes.”

After a bit of stumbling, it was able to provide some helpful code:

winner_2022 = df.loc[df['Year'] == 2022, 'Winner'].values[0]
time_2022 = df.loc[df['Year'] == 2022, 'Time'].values[0]
time_1922 = df.loc[df['Year'] == 1922, 'Time'].values[0]
time_diff = (time_2022 - time_1922).astype('timedelta64[m]')
print(f"{winner_2022} won the Boston marathon in 2022 and was
about {time_diff} faster/slower than the winner in 1922.")

Evans Chebet won the Boston marathon in 2022 and was about -12 minutes faster/slower than the winner in 1922.

Pretty cool that Pandas AI yielded this insight, but let’s just take a moment to digest how unreal this statement is. Clarence DeMar finished the 1922 Boston Marathon with a time of 2:18:10. Flash forward 100 years, and Evans Chebet finishes the 2022 Boston Marathon with a time of 2:06:51. Now, here is the kicker: the Boston marathon was actually only a 24.5 mile race in 1922. The distance of the race was increased to over 26 miles to adhere to the Olympic standard later on, which in our dataset, would be marked by the 1927 race. The 12+ minute improvement over the course of a century, marks a tremendous feat in human progress, fitness training, and also footwear innovation.

Visualization

Pivoting to some visualization prompts, we can now explore the Country column to uncover which countries have produced the most Boston Marathon winners.

pandas_ai.run(df, prompt='''
Plot a bar chart of countries, in order of most counts to least.
Use fivethirtyeight as the matplotlib style. Make the font size 10.0
''', is_conversational_answer=True)
Screenshot by author

The bar chart above demonstrates that the U.S. has dominated the Boston Marathon over the long term history of the race, followed by Kenya, Canada, Japan, Finland, and Ethiopia. This insight is to be taken with a grain of salt because we are only showcasing the winners here, and I imagine that the race did not have a lot of international participants in the early 1900s. Perhaps we can contextualize these insights by understanding when runners representing Kenya, Canada, Japan, and other countries, secured their wins, over time.

pandas_ai.run(df, prompt='''
Use the data frame to plot a scatter plot of Time and Year.
Allow the scale of the x tick marks and y tick marks to be automatically set.
''', show_code=True)
Screenshot by author

The scatterplot above shows us just how drastically winning times have dropped over the 127 year history of the Boston marathon. Individual runners can improve their personal best time for the marathon by quite a bit, if their training reflects major fitness improvement, but for the winning time to decrease by over 45 minutes, that shows true longitudinal performance evolution. As mentioned before, it is worth bearing in mind that the distance of the Boston Marathon course started at 24.5 miles, and only increased to 26+ miles in the 1920s, as evidenced by the upward spike in the points on the plot around that time. The most consistently fast finish times appear to reside during the 1990-early 2000s period. There has been an a higher amount of variance in winning times since around 2005, which could be worth looking into in the future!

Next, let’s see if we can color code this scatterplot by the Country of the winning runner. There will be too many colors going on if we represent each runner with their Country, so let’s only color-code the top countries. We can represent points for runners that do not represent top countries with an “Other” label. We will implement this visualization in 2 steps:

  1. Ask Pandas AI to create our top countries array
  2. Ask Pandas AI to color code the previously shown scatterplot by top countries

For step 1, this was the prompt given to Pandas AI:

pandas_ai.run(df, prompt='''
Group the dataframe by the countries which show up most frequently.
Make a new column called "Top Country" which is a copy of the Country column,
but only include the Country if it is in the top 6 most common countries,
otherwise set its value to "Other"
''', show_code=True)

It gave back the following code:

top_countries = df['Country'].value_counts().nlargest(6).index.tolist()
df['Top Country'] = df['Country'].apply(lambda x: x if x in top_countries else 'Other')
grouped_df = df.groupby('Top Country')

For step 2, this was the prompt given to Pandas AI:

pandas_ai.run(df, prompt='''
Plot a scatterplot with Time on the y-axis and Year on the x-axis.
Make the xlabel "Year" and the ylabel "Time".
Color each point by top_countries, which is a list that already exists.
Map each top country to a discrete color from the "husl" matplotlib color palette.
Place the legend below the plot. Make the font size 10.0.
''', is_conversational_answer=True, show_code=True)

We received back the following pretty involved code:

import matplotlib.pyplot as plt
import seaborn as sns

# create a dictionary to map each top country to a color from the "Set1" palette
color_dict = {country: sns.color_palette("husl", n_colors=len(df["Top Country"].unique()))[i] for i, country in enumerate(df["Top Country"].unique())}

# create the scatterplot
sns.scatterplot(x="Year", y="Time", hue="Top Country", data=df, palette=color_dict)

# set the xlabel and ylabel
plt.xlabel("Year")
plt.ylabel("Time")

# move the legend below the plot and set the font size
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=5, fontsize=10.0)

# show the plot
plt.show()

Screenshot by author

As we can see above, the earlier hypothesis that international runners did not start accumulating Boston Marathon wins till later on is, in fact, validated by the scatterplot. Until the 1940s, American and Canadian runners seemed to dominate the Boston Marathon. From 1950 on, we see more green points, indicating a higher prevalence of runners from elsewhere. We also see a brief period of Finnish dominance in the 1950s. Then, from 1970 to present day, we see a shift from Americans owning the Boston hills to Kenyan and Ethiopian runners going for gold.

Conclusion

This is only the beginning…

In this article, I have touched on cleaning, grouping, and visualization tasks that can be accelerated with the help of Pandas AI. It was incredibly rewarding to get to take Pandas AI for a spin with this project. I am personally excited that it can make seamlessly expedite EDA work, and it only just recently became available to use. I also want to point out that it is not all unicorns and rainbows when it comes to leveraging AI today. If you are a data analyst, fear not, Pandas AI is not at all coming to replace you…

Screenshot by author

In fact, sometimes it can be a little bit naive / sarcastic in the way it responds to certain prompts (see example above). That being said, using it as a tool to boost your own productivity is something I’d definitely encourage (as long as you have considered the privacy implications of your data).

Its use cases and functionality will only continue to improve from here. In particular, I am excited to test out different LLMs, integrate it with different data user interfaces, such as Google Sheets, merge dataframes with ease, and also one day hopefully ask predictive questions, like who will win the Boston marathon next year?

Links & References



Source link

Leave a Comment