How to Chunk Text Data — A Comparative Analysis | by Solano Todeschini | Jul, 2023


Exploring distinct approaches to text chunking.

Image compiled by the author. Pineapple image from Canva.

The ‘Text chunking’ process in Natural Language Processing (NLP) involves the conversion of unstructured text data into meaningful units. This seemingly simple task belies the complexity of the various methods employed to achieve it, each with its strengths and weaknesses.

At a high level, these methods typically fall into one of two categories. The first, rule-based methods, hinge on the use of explicit separators such as punctuation or space characters, or the application of sophisticated systems like regular expressions, to partition text into chunks. The second category, semantic clustering methods, leverages the inherent meaning embedded in the text to guide the chunking process. These might utilize machine learning algorithms to discern context and infer natural divisions within the text.

In this article, we’ll explore and compare these two distinct approaches to text chunking. We’ll represent rule-based methods with NLTK, Spacy, and Langchain, and contrast this with two different semantic clustering techniques: KMeans and a custom technique for Adjacent Sentence Clustering.

The goal is to equip practitioners with a clear understanding of each method’s pros, cons, and ideal use cases to enable better decision-making in their NLP projects.

In Brazilian slang, “abacaxi,” which translates to “pineapple,” signifies “something that doesn’t yield a good outcome, a tangled mess, or something that is no good.”

Use Cases for Text Chunking

Text chunking can be used by several different applications:

  1. Text Summarization: By breaking down large bodies of text into manageable chunks, we can summarize each section individually, leading to a more accurate overall summary.
  2. Sentiment Analysis: Analyzing the sentiment of shorter, coherent chunks can often yield more precise results than analyzing an entire document.
  3. Information Extraction: Chunking helps in locating specific entities or phrases within text, enhancing the process of information retrieval.
  4. Text Classification: Breaking down text into chunks allows classifiers to focus on smaller, contextually meaningful units rather than entire documents, which can improve performance.
  5. Machine Translation: Translation systems often operate on chunks of text rather than on individual words or whole documents. Chunking can aid in maintaining the coherence of the translated text.

Understanding these use cases can help in choosing the most suitable chunking technique for your specific project.

In this part of the article, we will compare popular methods for semantic chunking of unstructured text: NLTK Sentence Tokenizer, Langchain Text Splitter, KMeans Clustering, and Clustering Adjacent Sentences based on similarity.

In the following example, we’re gonna evaluate this technique using a text extracted from a PDF, processing it into sentences and their clusters.

The data we used was a PDF exported from Brazil’s Wikipedia page.

For extracting text from PDF and split into sentences with NLTK, we use the following functions:

from PyPDF2 import PdfReader
import nltk
nltk.download('punkt')

# Extracting Text from PDF
def extract_text_from_pdf(file_path):
with open(file_path, 'rb') as file:
pdf = PdfReader(file)
text = " ".join(page.extract_text() for page in pdf.pages)
return text

# Extract text from the PDF and split it into sentences
text = extract_text_from_pdf(file_path)

Like that, we end with a string text with 210964 characters of length.

Here’s a sample of the Wiki text:

sample = text[1015:3037]
print(sample)

"""
=======
Output:
=======

Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of
mass immigration from around t he world,[13] and the most popul ous Roman Catholic-majority country.
Bounde d by the Atlantic Ocean on the east, Brazil has a coastline of 7,491 kilometers (4,655 mi).[14] It
borders all other countries and territories in South America except Ecuador and Chile and covers roughl y
half of the continent's land area.[15] Its Amazon basin includes a vast tropical forest, home to diverse
wildlife, a variety of ecological systems, and extensive natural resources spanning numerous protected
habitats.[14] This unique environmental heritage positions Brazil at number one of 17 megadiverse
countries, and is the subject of significant global interest, as environmental degradation through processes
like deforestation has direct impacts on gl obal issues like climate change and biodiversity loss.
The territory which would become know n as Brazil was inhabited by numerous tribal nations prior to the
landing in 1500 of explorer Pedro Álvares Cabral, who claimed the discovered land for the Portugue se
Empire. Brazil remained a Portugue se colony until 1808 when the capital of the empire was transferred
from Lisbon to Rio de Janeiro. In 1815, the colony was elevated to the rank of kingdom upon the
formation of the United Kingdom of Portugal, Brazil and the Algarves. Independence was achieved in
1822 with the creation of the Empire of Brazil, a unitary state gove rned unde r a constitutional monarchy
and a parliamentary system. The ratification of the first constitution in 1824 led to the formation of a
bicameral legislature, now called the National Congress.
"""

The Natural Language Toolkit (NLTK) provides a useful function for splitting text into sentences. This sentence tokenizer divides a given block of text into its component sentences, which can then be used for further processing.

Implementation

Here’s an example of using the NLTK sentence tokenizer:

import nltk
nltk.download('punkt')

# Splitting Text into Sentences
def split_text_into_sentences(text):
sentences = nltk.sent_tokenize(text)
return sentences

sentences = split_text_into_sentences(text)

This returns a list of 2670 sentences extracted from the input text with a mean of 78 characters per sentence.

Evaluating NLTK Sentence Tokenizer

While the NLTK Sentence Tokenizer is a straightforward and efficient way to divide a large body of text into individual sentences, it does come with certain limitations:

  1. Language Dependency: The NLTK Sentence Tokenizer relies heavily on the language of the text. It performs well with English but may not provide accurate results with other languages without additional configuration.
  2. Abbreviations and Punctuation: The tokenizer can occasionally misinterpret abbreviations or other punctuation at the end of a sentence. This can lead to fragments of sentences being treated as independent sentences.
  3. Lack of Semantic Understanding: Like most tokenizers, the NLTK Sentence Tokenizer does not consider the semantic relationship between sentences. Therefore, a context that spans multiple sentences might be lost in the tokenization process.

Spacy, another powerful NLP library, provides a sentence tokenization function that relies heavily on linguistic rules. It is a similar approach to NLTK.

Implementation

Implementing Spacy’s sentence splitter is quite straightforward. Here’s how to do it in Python:

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = list(doc.sents)

This returns a list of 2336 sentences extracted from the input text with a mean of 89 characters per sentence.

Evaluating Spacy Sentence Splitter

Spacy’s sentence splitter tends to create smaller chunks compared to the Langchain Character Text Splitter, as it strictly adheres to sentence boundaries. This can be advantageous when smaller text units are necessary for analysis.

Like NLTK, however, Spacy’s performance depends on the quality of the input text. For poorly punctuated or structured text, the identified sentence boundaries might not always be accurate.

Now, we’ll see how Langchain provides a framework for chunking text data and further compare it with NLTK and Spacy.

The Langchain Character Text Splitter works by recursively dividing the text at specific characters. It is especially useful for generic text.

The splitter is defined by a list of characters. It attempts to split the text based on these characters until the generated chunks meet the desired size criterion. The default list is [“nn”, “n”, “ ”, “”], aiming to keep paragraphs, sentences, and words together as much as possible to maintain semantic coherence.

Implementation

Consider the following example, where we split the sample text extracted from our PDF using this method.

# Initialize the text splitter with custom parameters
custom_text_splitter = RecursiveCharacterTextSplitter(
# Set custom chunk size
chunk_size = 100,
chunk_overlap = 20,
# Use length of the text as the size measure
length_function = len,

)

# Create the chunks
texts = custom_text_splitter.create_documents([sample])

# Print the first two chunks
print(f'### Chunk 1: nn{texts[0].page_content}nn=====n')
print(f'### Chunk 2: nn{texts[1].page_content}nn=====')

"""
=======
Output:
=======

### Chunk 1:

Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital

=====

### Chunk 2:

is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of

=====

"""

Finally, we end up with 3205 chunks of text, represented by the texts list. 65.8 characters is the mean for each chunk here — a bit less thank NLTK’s mean (79 characters).

Changing Parameters and Using ‘n’ Separator:

For a more customized approach on the Langchain Splitter, we can alter the chunk_size and chunk_overlap parameters according to our needs. Additionally, we can specify only one character (or set of characters) for the splitting operation, such as n. This will guide the splitter to separate the text into chunks only at the new line characters.

Let’s consider an example where we set chunk_size to 300, chunk_overlap to 30, and only use n as the separator.

# Initialize the text splitter with custom parameters
custom_text_splitter = RecursiveCharacterTextSplitter(
# Set custom chunk size
chunk_size = 300,
chunk_overlap = 30,
# Use length of the text as the size measure
length_function = len,
# Use only "nn" as the separator
separators = ['n']
)

# Create the chunks
custom_texts = custom_text_splitter.create_documents([sample])

# Print the first two chunks
print(f'### Chunk 1: nn{custom_texts[0].page_content}nn=====n')
print(f'### Chunk 2: nn{custom_texts[1].page_content}nn=====')

Now, let’s compare some outputs from the standard set of parameters with the custom parameters:

# Print the sampled chunks
print("==== Sample chunks from 'Standard Parameters': ====nn")
for i, chunk in enumerate(texts):
if i < 4:
print(f"### Chunk {i+1}: n{chunk.page_content}n")

print("==== Sample chunks from 'Custom Parameters': ====nn")
for i, chunk in enumerate(custom_texts):
if i < 4:
print(f"### Chunk {i+1}: n{chunk.page_content}n")

"""
=======
Output:
=======

==== Sample chunks from 'Standard Parameters': ====

### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital

### Chunk 2:
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of

### Chunk 3:
of the union of the 26

### Chunk 4:
states and the Federal District. It is the only country in the Americas to have Portugue se as an

==== Sample chunks from 'Custom Parameters': ====

### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26

### Chunk 2:
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of

### Chunk 3:
mass immigration from around t he world,[13] and the most popul ous Roman Catholic-majority country.
Bounde d by the Atlantic Ocean on the east, Brazil has a coastline of 7,491 kilometers (4,655 mi).[14] It

### Chunk 4:
borders all other countries and territories in South America except Ecuador and Chile and covers roughl y
half of the continent's land area.[15] Its Amazon basin includes a vast tropical forest, home to diverse
"""

We can already see that these custom parameters yield much bigger chunks and therefore keep more content than the default set of parameters.

Evaluating the Langchain Character Text Splitter

After splitting the text into chunks using different parameters, we obtain two lists of chunks: texts and custom_texts, containing 3205 and 1404 text chunks, respectively. Now, let’s plot the distribution of chunk lengths for these two scenarios to better understand the impact of changing the parameters.

Figure 1: Distribution plot of chunk lengths for Langchain splitter with different parameters (Image by Author)

In this histogram, the x-axis represents the chunk lengths, while the y-axis represents the frequency of each length. The blue bars represent the distribution of chunk lengths for the original parameters, and the orange bars represent the distribution of the custom parameters. By comparing these two distributions, we can see how the changes in parameters affected the resulting chunk lengths.

Remember, the ideal distribution depends on the specific requirements of your text-processing task. You might want smaller, more numerous chunks if you’re dealing with fine-grained analysis or larger, fewer chunks for broader semantic analysis.

Langchain Character Text Splitter vs. NLTK and Spacy

Earlier, we generated 3205 chunks using the Langchain splitter with its default parameters. The NLTK Sentence Tokenizer, on the other hand, split the same text into a total of 2670 sentences.

To get a more intuitive understanding of the difference between these methods, we can visualize the distribution of chunk lengths. The following plot shows the densities of chunk lengths for each method, allowing us to see how the lengths are distributed and where most of the lengths lie.

Figure 2: Distribution plot of chunk lengths resulting from Langchain Splitter with custom parameters vs. NLTK and Spacy (Image by Author)

From Figure 1, we can see that the Langchain splitter results in a much more concise density of cluster lengths and has a tendency to have more of longer clusters whereas NLTK and Spacy seem to produce very similar outputs in terms of cluster length, preferring smaller sentences while having lots of outliers with lengths that can reach up to 1400 characters — and a tendency of decreasing length.

Sentence Clustering is a technique that involves grouping sentences based on their semantic similarity. By using sentence embeddings and a clustering algorithm such as K-means, we can implement Sentence Clustering.

Implementation

Here is a simple example code snippet using the Python library sentence-transformers for generating sentence embeddings and scikit-learn for K-means clustering:

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# Load the Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define a list of sentences (your text data)
sentences = ["This is an example sentence.", "Another sentence goes here.", "..."]

# Generate embeddings for the sentences
embeddings = model.encode(sentences)

# Choose an appropriate number of clusters (here we choose 5 as an example)
num_clusters = 3

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters)
clusters = kmeans.fit_predict(embeddings)

You can see here that the steps for clustering a list of sentences are:

  1. Load a Sentence Transform model. In this case, we’re using all-MiniLM-L6-v2 from sentence-transformers/all-MiniLM-L6-v2 in HuggingFace.
  2. Define your sentences and generate their embeddings with the encode() method from the model.
  3. Then you define your clustering technique and number of clusters (we’re using KMeans with 3 clusters here) and finally fit it into the dataset.

Evaluating KMeans Clustering

And finally we plot a WordCloud for each cluster.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('stopwords')

# Define a list of stop words
stop_words = set(stopwords.words('english'))

# Define a function to clean sentences
def clean_sentence(sentence):
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Convert to lower case
tokens = [w.lower() for w in tokens]
# Remove punctuation
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove non-alphabetic tokens
words = [word for word in stripped if word.isalpha()]
# Filter out stop words
words = [w for w in words if not w in stop_words]
return words

# Compute and print Word Clouds for each cluster
for i in range(num_clusters):
cluster_sentences = [sentences[j] for j in range(len(sentences)) if clusters[j] == i]
cleaned_sentences = [' '.join(clean_sentence(s)) for s in cluster_sentences]
text = ' '.join(cleaned_sentences)

wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title(f"Cluster {i}")
plt.show()

Below we have the WordCloud plots for the generated clusters:

Figure 3: Word Cloud plot for KMeans clustering — cluster 0 (Image by Author)
Figure 4: Word Cloud plot for KMeans clustering — cluster 1 (Image by Author)
Figure 5: Word Cloud plot for KMeans clustering — cluster 2 (Image by Author)

In our analysis of the word cloud for the KMeans clustering, it’s evident that each cluster distinctively differentiates based on the semantics of its most frequent words. This demonstrates a strong semantic differentiation amongst clusters. Moreover, a noticeable variation in cluster sizes is observed, indicating a significant disparity in the number of sequences each cluster comprises.

Limitations of KMeans Clustering

Sentence clustering, although beneficial, does have a few notable drawbacks. The primary limitations include:

  1. Loss of Sentence Order: Sentence clustering doesn’t retain the original sequence of sentences, which could distort the natural flow of the narrative. ** This is very important**
  2. Computational Efficiency: KMeans can be computationally intensive and slow, especially with large text corpora or when working with a larger number of clusters. This can be a significant drawback for real-time applications or when handling big data.

To overcome some of the limitations of KMeans clustering, especially the loss of sentence order, an alternative approach could be clustering adjacent sentences based on their semantic similarity. The fundamental premise of this approach is that two sentences that appear consecutively in a text are more likely to be semantically related than two sentences that are farther apart.

Implementation

Here’s an expanded implementation of this heuristics using Spacy sentences as inputs:

import numpy as np
import spacy

# Load the Spacy model
nlp = spacy.load('en_core_web_sm')

def process(text):
doc = nlp(text)
sents = list(doc.sents)
vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])

return sents, vecs

def cluster_text(sents, vecs, threshold):
clusters = [[0]]
for i in range(1, len(sents)):
if np.dot(vecs[i], vecs[i-1]) < threshold:
clusters.append([])
clusters[-1].append(i)

return clusters

def clean_text(text):
# Add your text cleaning process here
return text

# Initialize the clusters lengths list and final texts list
clusters_lens = []
final_texts = []

# Process the chunk
threshold = 0.3
sents, vecs = process(text)

# Cluster the sentences
clusters = cluster_text(sents, vecs, threshold)

for cluster in clusters:
cluster_txt = clean_text(' '.join([sents[i].text for i in cluster]))
cluster_len = len(cluster_txt)

# Check if the cluster is too short
if cluster_len < 60:
continue

# Check if the cluster is too long
elif cluster_len > 3000:
threshold = 0.6
sents_div, vecs_div = process(cluster_txt)
reclusters = cluster_text(sents_div, vecs_div, threshold)

for subcluster in reclusters:
div_txt = clean_text(' '.join([sents_div[i].text for i in subcluster]))
div_len = len(div_txt)

if div_len < 60 or div_len > 3000:
continue

clusters_lens.append(div_len)
final_texts.append(div_txt)

else:
clusters_lens.append(cluster_len)
final_texts.append(cluster_txt)

Key takeaways from this code:

  1. Text Processing: Each text chunk is passed to the process function. This function uses the SpaCy library to create sentence embeddings, which are used to represent the semantic meaning of each sentence in the text chunk.
  2. Cluster Creation: The cluster_text function forms clusters of sentences based on the cosine similarity of their embeddings. If the cosine similarity is less than a specified threshold, a new cluster begins.
  3. Length Check: The code then checks the length of each cluster. If a cluster is too short (less than 60 characters) or too long (more than 3000 characters), the threshold is adjusted and the process repeats for that particular cluster until an acceptable length is achieved.

Let’s take a look at some of the output chunks from this approach and compare them to Langchain Splitter:

====   Sample chunks from 'Langchain Splitter with Custom Parameters':   ====

### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo. The federation is composed of the union of the 26

### Chunk 2:
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12] It is one of the most multicultural and ethnically diverse nations, due to over a century of

==== Sample chunks from 'Adjacent Sentences Clustering': ====

### Chunk 1:
Brazil is the world's fifth-largest country by area and the seventh most popul ous. Its capital
is Brasília, and its most popul ous city is São Paulo.

### Chunk 2:
The federation is composed of the union of the 26
states and the Federal District. It is the only country in the Americas to have Portugue se as an official
langua ge.[11][12]

Great, now let’s compare the distribution of chunk lengths of the final_texts (from the adjacent sequence clustering approach) with the distributions from the Langchain Character Text Splitter and NLTK Sentence Tokenizer. To do this, we’ll first need to calculate the lengths of the chunks in final_texts:

final_texts_lengths = [len(chunk) for chunk in final_texts]

We can now plot the distributions of all three methods:

Figure 3: Distribution plot of chunk lengths resulting from all the different methods tested (Image by Author)

From Figure 6, we can derive that the Langchain splitter, using its predefined chunk size, creates a uniform distribution, implying consistent chunk lengths.

The Spacy Sentence Splitter and the NLTK Sentence Tokenizer, on the other hand, seem to prefer smaller sentences, though with many larger outliers, indicating their reliance on linguistic cues to determine splits and potentially produce irregularly sized chunks.

Lastly, the custom Adjacent Sequence Clustering approach, which clusters based on semantic similarity, exhibits a more varied distribution. This could be indicative of a more context-sensitive approach, maintaining the coherence of content within chunks while allowing for more flexibility in size.

Evaluating Adjacent Sequence Clustering Approach

The Adjacent Sequence Clustering Approach brings unique benefits:

  1. Contextual Coherence: Generates thematically consistent chunks by considering semantic and contextual coherence.
  2. Flexibility: Balances context preservation and computational efficiency, providing adjustable chunk sizes.
  3. Threshold Tuning: Allows users to fine-tune the chunking process according to their needs, by adjusting the similarity threshold.
  4. Sequence Preservation: Retains the original order of sentences in the text, essential for sequential language models and tasks where text order matters.

Langchain Character Text Splitter

This method provides consistent chunk lengths, yielding a uniform distribution. This could be beneficial when a standard size is necessary for downstream processing or analysis. The approach is less sensitive to the specific linguistic structure of the text, focusing more on producing chunks of a predefined character length.

NLTK Sentence Tokenizer and Spacy Sentence Splitter

These approaches exhibit a preference for smaller sentences but include many larger outliers. While this can result in more linguistically coherent chunks, it can also lead to high variability in chunk size.

These methods can yield good results that can serve as inputs to downstream tasks too.

Adjacent Sequence Clustering

This method generates a more varied distribution, indicative of its context-sensitive approach. By clustering based on semantic similarity, it ensures that the content within each chunk is coherent while allowing for flexibility in chunk size. This method may be advantageous when it is important to preserve the semantic continuity of text data.

For a more visual and abstract (or silly) representation, let’s look at Figure 7 below and try to figure out which kind of pineapple “cut” would better represent the approaches discussed:

Figure 7: Different methods of text chunking shown as pineapple cuts (Image compiled by the author. Pineapple image from Canva)

Listing them in order:

  1. Cut number 1 would represent a rule-based approach, in which you can just “peel off” the “junk” text you want based on filters or regular expressions. Lot’s of work to do the whole pineapple tho, since it also retains a lot of outliers with a much bigger context size.
  2. Langchain would be like cut number 2. Very similar pieces in size but not holding the entire desired context (it’s a triangle, so it could be a watermelon as well).
  3. Cut number 3 is definitely KMeans. You may even group only what makes sense for you — the juiciest part — but you won’t get its core. Without it, the chunks lose all the structure and meaning. I think it takes a lot of work to do that as well… especially for bigger pineapples.
  4. Lastly, cut number 4 illustrates the Adjacent Sentence Clustering method. The size of the chunks can vary but they often maintain contextual information, similar to uneven pineapple pieces that still indicate the fruit’s overall structure.



Source link

Leave a Comment