Fast String Processing with Polars — Scam Emails Dataset | by Antons Tocilins-Ruberts | May, 2023


Clean, process and tokenise texts in milliseconds using in-built Polars string expressions

Photo by Stephen Phillips – Hostreviews.co.uk on Unsplash

With the large scale adoption of Large language Models (LLMs) it might seem that we’re past the stage where we had to manually clean and process text data. Unfortunately, me and other NLP practitioners can attest that this is very much not the case. Clean text data is required at every stage of NLP complexity — from basic text analytics to machine learning and LLMs. This post will showcase how this laborious and tedious process can be significantly sped up using Polars.

Polars is a blazingly fast Data Frame library written in Rust that is incredibly efficient with handling strings (due to its Arrow backend). Polars stores strings in the Utf8 format using Arrow backend which makes string traversal cache-optimal and predictable. Also, it exposes a lot of in-built string operations under the str namespace which makes the string operations parallelised. Both of these factors make working with strings extremely easy and fast.

The library shares a lot of syntaxis with Pandas but there are also a lot of quirks that you’ll need to get used to. This post will walk you through working with strings but for a comprehensive overview I highly recommend this “Getting Started” guide as it will give you a good overview of the library.

You can find all the code in this GitHub repo, so make sure to pull it if want to code along (don’t forget to ⭐ it). To make this post more practical and fun, I’ll showcase how we can clean a small scam email dataset which can be found on Kaggle (License CC BY-SA 4.0). Polars can be installed using pip — pip install polars and the recommended Python version is 3.10 .

The goal of this pipeline is to parse the raw text file into a DataFrame that can be used for further analytics/modelling. Here are the overall steps that will be implemented:

  1. Read in text data
  2. Extract relevant fields (e.g. sender email, object, text, etc.)
  3. Extract useful features from these fields (e.g. length, % of digits, etc)
  4. Pre-process text for further analysis
  5. Perform some basic text analytics

Without further ado, let’s begin!

Reading Data

Assuming that the text file with emails is saved as fraudulent_emails.txt , here’s the function used to read them in:

def load_emails_txt(path: str, split_str: str = "From r  ") -> list[str]:
with open(path, "r", encoding="utf-8", errors="ignore") as file:
text = file.read()

emails = text.split(split_str)

return emails

If you explore the text data you’ll see that the emails have two main sections

  • Metadata (starts with From r ) that contains email sender, subject, etc.
  • Email text (starts after Status: O or Status: RO )

I’m using the first pattern to split the continuous text file into a list of emails. Overall, we should be able to read in 3977 emails that we put into a Polars DataFrame for further analysis.

emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame({"emails": emails})

print(len(emails))
>>> 3977

Extracting Relevant Fields

Now the tricky part begins. How do we extract relevant fields from this mess of a text data? Unfortunately, the answer is regex.

Sender and Subject

Upon further inspection of metadata (below) you can see that it has fields From: and Subject: which are going to be very useful for us.

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Status: O

If you keep scrolling the emails, you’ll find that there are a few formats for the From: field. The first format you see above where we have both name and email. The second format contains only the email e.g. From: 123@abc.com or From: “123@abc.com” . With this in mind, we’ll need three regex patterns — one for subject, and two for sender (name with email and just email).

email_pattern = r"From:s*([^<ns]+)"
subject_pattern = r"Subject:s*(.*)"
name_email_pattern = r'From:s*"?([^"<]+)"?s*<([^>]+)>'

Polars has an str.extract method that can compare the above patterns to our text and (you guessed it) extract the matching groups. Here’s how you can apply it to the emails_pl DataFrame.

emails_pl = emails_pl.with_columns(
# Extract the first match group as email
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
# Extract the second match group as email
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
# Extract the subject
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
).with_columns(
# In cases where we didn't extract email
pl.when(pl.col("sender_email").is_null())
# Try another pattern (just email)
.then(pl.col("emails").str.extract(email_pattern, 1))
# If we do have an email, do nothing
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)

As you can see besides str.extract we’re also using a pl.when().then().otherwise() expressions (Polars version of if/else) to account for a second email only pattern. If you print out the results you’ll see that in most cases it should’ve worked correctly (and incredibly fast). We now have sender_name , sender_email and subject fields for our analysis.

Polars DF sample. Screenshot by author.

Email Text

As was noted above, the actual email text starts after Status: O (opened) or Status: RO (read and opened) which means that we can utilise this pattern to split the email into “metadata” and “text” parts. Below you can see the three steps that we need to take to extract the required field and the corresponding Polars method to perform them.

  1. Replace Status: RO with Status: O so that we only have one “split” pattern — use str.replace
  2. Split the actual string by Status: O — use str.split
  3. Get the second element (text) of the resulting list — use arr.get(1)
emails_pl = emails_pl.with_columns(
# Apply operations to the emails column
pl.col("emails")
# Make these two statuses the same
.str.replace("Status: RO", "Status: O", literal=True)
# Split using the status string
.str.split("Status: O")
# Get the second element
.arr.get(1)
# Rename the field
.alias("email_text")
)

Et voilà! We have extracted important fields in just a few milliseconds. Let’s put it all into one coherent function that we can later use in the pipeline.

def extract_fields(emails: pl.DataFrame) -> pl.DataFrame:
email_pattern = r"From:s*([^<ns]+)"
subject_pattern = r"Subject:s*(.*)"
name_email_pattern = r'From:s*"?([^"<]+)"?s*<([^>]+)>'

emails = (
emails.with_columns(
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
)
.with_columns(
pl.when(pl.col("sender_email").is_null())
.then(pl.col("emails").str.extract(email_pattern, 1))
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)
.with_columns(
pl.col("emails")
.str.replace("Status: RO", "Status: O", literal=True)
.str.split("Status: O")
.arr.get(1)
.alias("email_text")
)
)

return emails

Now, we can move on to the feature generation part.

Feature Engineering

From personal experience, scam emails tend to be very detailed and long (since scammers are trying to win your trust) so the character length of an email is going to be quite informative. Also, they heavily use exclamation points and digits, so calculating the proportion of non-characters in an email can also be useful. Finally, scammers love to use caps lock, so let’s calculate the proportion of capital letters as well. There are of course, many more features we could create but to not make this post too long, let’s just focus on these two.

The first feature can be very easily created using an in-built str.n_chars() function. The two other features can be computed using regex and str.count_match(). Below you can find the function to calculate these three features. Similar to the previous function, it uses with_columns() clause to carry over the old features and create the new ones on top of them.

def email_features(data: pl.DataFrame, col: str) -> pl.DataFrame:
data = data.with_columns(
pl.col(col).str.n_chars().alias(f"{col}_length"),
).with_columns(
(pl.col(col).str.count_match(r"[A-Z]") / pl.col(f"{col}_length")).alias(
f"{col}_percent_capital"
),
(pl.col(col).str.count_match(r"[^A-Za-z ]") / pl.col(f"{col}_length")).alias(
f"{col}_percent_digits"
),
)

return data

Text Cleaning

If you print out a few of the emails we’ve extracted, you’ll notice some things that need to be cleaned. For example:

  • HTML tags are still present in some of the emails
  • Lots of non-alphabetic characters are used
  • Some emails are written in uppercase, some in lowercase, and some are mixed

Same as above, we’re going to use regular expressions to clean up the data. However, now the method of choice is str.replace_all because we want to replace all the matched instances, not just the first one. Additionally, we’ll use str.to_lowercase() to make all text lowercase.

emails_pl = emails_pl.with_columns(
# Apply operations to the emails text column
pl.col("email_text")
# Remove all the data in <..> (HTML tags)
.str.replace_all(r"<.*?>", "")
# Replace non-alphabetic characters (except whitespace) in text
.str.replace_all(r"[^a-zA-Zs]+", " ")
# Replace multiple whitespaces with one whitespace
# We need to do this because of the previous cleaning step
.str.replace_all(r"s+", " ")
# Make all text lowercase
.str.to_lowercase()
# Keep the field's name
.keep_name()
)

Now, let’s refactor this chain of operations into a function, so that it could be applied to the other columns of interest as well.

def email_clean(
data: pl.DataFrame, col: str, new_col_name: str | None = None
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.str.replace_all(r"<.*?>", " ")
.str.replace_all(r"[^a-zA-Zs]+", " ")
.str.replace_all(r"s+", " ")
.str.to_lowercase()
.alias(new_col_name if new_col_name is not None else col)
)

return data

Text Tokenisation

As a final step in the pre-processing pipeline, we’re going to tokenise the text. Tokenisation is going to happen using the already familiar method str.split() where as a split token we’re going to specify a whitespace.

emails_pl = emails_pl.with_columns(
pl.col("email_text").str.split(" ").alias("email_text_tokenised")
)

Again, let’s put this code into a function for our final pipeline.

def tokenise_text(data: pl.DataFrame, col: str, split_token: str = " ") -> pl.DataFrame:
data = data.with_columns(pl.col(col).str.split(split_token).alias(f"{col}_tokenised"))

return data

Removing Stop Words

If you’ve worked with text data before, you know that stop word removal is a key step in pre-processing tokenised texts. Removing these words allows us to focus the analysis only on the important parts of the text.

To remove these words, we first need to define them. Here, I’m going to use a default set of stop words from nltk library plus a set of HTML related words.

stops = set(
stopwords.words("english")
+ ["", "nbsp", "content", "type", "text", "charset", "iso", "qzsoft"]
)

Now, we need to find out if these words exist in the tokenised array, and if they do, we need to drop them. For this we’ll need to use the arr.eval method because it allows us to run the Polars expressions (e.g. .is_in ) against every element of the tokenised list. Make sure to read the comment below to understand what the each line does as this part of the code is more complicated.

emails_pl = emails_pl.with_columns(
# Apply to the tokenised column (it's a list)
pl.col("email_text_tokenised")
# For every element, check if it's not in a stopwords list and only then return it
.arr.eval(
pl.when(
(~pl.element().is_in(stopwords)) & (pl.element().str.n_chars() > 2)
).then(pl.element())
)
# For every element of a new list, drop nulls (previously items that were in stopwords list)
.arr.eval(pl.element().drop_nulls())
.keep_name()
)

As usual, let’s refactor this bit of code into a function for our final pipeline.

def remove_stopwords(
data: pl.DataFrame, stopwords: set | list, col: str
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.arr.eval(pl.when(~pl.element().is_in(stopwords)).then(pl.element()))
.arr.eval(pl.element().drop_nulls())
)
return data

While this pattern might seem quite complicated it’s well worth it to use the pre-defined str and arr expressions to optimise the performance.

Full Pipeline

So far, we’ve defined pre-processing functions and saw how they can be applied to a single column. Polars provides a very handy pipe method that allows us to chain Polars operations specified as function. Here’s how the final pipeline looks like:

emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame({"emails": emails})

emails_pl = (
emails_pl.pipe(extract_fields)
.pipe(email_features, "email_text")
.pipe(email_features, "sender_email")
.pipe(email_features, "subject")
.pipe(email_clean, "email_text")
.pipe(email_clean, "sender_name")
.pipe(email_clean, "subject")
.pipe(tokenise_text, "email_text")
.pipe(tokenise_text, "subject")
.pipe(remove_stopwords, stops, "email_text_tokenised")
.pipe(remove_stopwords, stops, "subject_tokenised")
)

Notice that now we can easily apply all the feature engineering, cleaning, and tokenisation functions to all the extracted columns and not just the email text like in the examples above.

If you’ve got so far — great job! We’ve read in, cleaned, processed, tokenised, and did basic feature engineering on ~4k text records in under a second (at least on my Mac M2 machine). Now, let’s enjoy the fruits of our labor and do some basic text analysis.

First of all, let’s look at the word cloud of the email texts and marvel at all the silly things we can find.

# Word cloud function
def generate_word_cloud(text: str):
wordcloud = WordCloud(
max_words=100, background_color="white", width=1600, height=800
).generate(text)

plt.figure(figsize=(20, 10), facecolor="k")
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

# Prepare data for word cloud
text_list = emails_pl.select(pl.col("email_text_tokenised").arr.join(" "))[
"email_text_tokenised"
].to_list()
all_emails = " ".join(text_list)

generate_word_cloud(all_emails)

Email text word cloud. Generated by the author.

Bank accounts, next of kin, security companies, and decease relatives — it has got it all. Let’s see how these will look like for text clusters created using simple TF-IDF and K-Means.

# TF-IDF with 500 words
vectorizer = TfidfVectorizer(max_features=500)
transformed_text = vectorizer.fit_transform(text_list)
tf_idf = pd.DataFrame(transformed_text.toarray(), columns=vectorizer.get_feature_names_out())

# Cluster into 5 clusters
n = 5
cluster = KMeans(n_clusters=n, n_init='auto')
clusters = cluster.fit_predict(tf_idf)

for c in range(n):
cluster_texts = np.array(text_list)[clusters==c]
cluster_text = ' '.join(list(cluster_texts))

generate_word_cloud(cluster_text)

Below you can see a few interesting clusters that I’ve identified:

Besides these, I also found a few non-sense clusters which means that there is still room for improvements when it comes text cleaning. Still, it looks like we were able to extract useful clusters, so let’s call it a success. Let me know which clusters you find!

This post has covered a wide variety of pre-processing and cleaning operations that Polars library allows you to do. We’ve seen how to use Polars to:

  • Extract specific patterns from texts
  • Split texts into lists based on a token
  • Calculate lengths and the number of matches in texts
  • Clean texts using regex
  • Tokenise texts and filter for stop words

I hope that this post was useful to you and you’ll give Polars a chance in your next NLP project. Please consider subscribing, clapping and commenting below.

Not a Medium Member yet?

Radev, D. (2008), CLAIR collection of fraud email, ACL Data and Code Repository, ADCR2008T001, http://aclweb.org/aclwiki

Project Github https://github.com/aruberts/tutorials/tree/main/metaflow/fraud_email

Polars User Guide https://pola-rs.github.io/polars-book/user-guide/



Source link

Leave a Comment