Photo by Charles Chen on Unsplash

Mastering Text Chunking in LLMs

Adnan Writes

--

Learn the best practices and strategies for document splitting to enhance data indexing and retrieval efficiency. Discover various chunking methods tailored for different content types and applications, from basic character splitting to advanced semantic and agentic methods.

What is Chunking?

Chunking is the process of breaking down a large piece of text into smaller, more manageable units, often referred to as “chunks.” This technique is crucial in various natural language processing (NLP) and information retrieval tasks. By dividing text into smaller segments, systems can handle, index, and search through data more efficiently.

Why Split Documents?

Ease of Search

Large chunks of data are harder to search and index efficiently. Splitting documents into smaller, manageable chunks allows for more precise indexing, leading to faster and more accurate search results.

Context Window Size

Large Language Models (LLMs) have a finite context window, meaning they can only process a limited number of tokens at a time. Splitting documents ensures that each chunk fits within this context window, allowing the model to process and generate responses effectively.

Chunking Strategies

Nature of Content

The type of content you’re working with significantly influences your chunking strategy. Lengthy documents, such as articles or books, require different approaches compared to shorter content like tweets or instant messages. Understanding the nature of your content helps in selecting an appropriate chunking method.

Embedding Model Considerations

The choice of embedding model also dictates the chunking strategy. Some models perform better with specific chunk lengths, making it crucial to tailor your approach based on the model you intend to use.

User Query Length and Complexity

Anticipate the nature of user queries — whether they are short and specific or long and complex. This understanding helps in designing a chunking strategy that aligns the embedded queries with the embedded chunks, enhancing search accuracy.

Application-Specific Requirements

The end use case, such as semantic search, question answering, or summarization, will determine how text should be chunked. For instance, if the results need to be input into another language model with a token limit, this constraint must be factored into your chunking strategy.

Chunking Methods

Depending on the considerations mentioned above, several text splitters are available, each with unique methods of dividing text.

Basic Character Splitting

This method involves dividing text into static character chunks without regard to content structure. While simple, it is generally not recommended for practical applications due to its rigidity.

Example:

text = "This is the text I would like to chunk up. It is the example text for this exercise"
chunk_size = 35
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
##output##
['This is the text I would like to ch',
'unk up. It is the example text for ',
'this exercise']

Pros: Easy to implement. Cons: Ignores content structure, leading to poor search and indexing performance.

Recursive Character Text Splitting

A more sophisticated method that recursively splits text based on a list of separators. This approach maintains a balance between chunk size and content structure, making it suitable for more complex documents.

Example: LangChain’s RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
text_splitter.create_documents([text])

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
documents = text_splitter.create_documents([text])
##outpu##
[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]

Document-Specific Splitting

Tailored chunking methods for different document types (e.g., PDF, Python, Markdown) to ensure optimal splitting based on the document’s inherent structure.

Example:

from langchain.document_loaders import PDFDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_loader = PDFDocumentLoader('path/to/document.pdf')
documents = pdf_loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
split_documents = text_splitter.split_documents(documents)

Semantic Splitting

Utilizes embedding models to split text based on semantic meaning, ensuring chunks are contextually relevant.

Example:

from sentence_transformers import SentenceTransformer, util
from langchain.text_splitter import RecursiveCharacterTextSplitter

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

text = """
Machine learning is a method of data analysis that automates analytical model building.
It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention.
"""

embeddings = model.encode([text])
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.create_documents([text])

Agentic Splitting

An experimental method that uses an agent-like system to split text, useful for scenarios where token cost trends toward zero.

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
Agents are autonomous entities that observe through sensors and act upon an environment using actuators and direct their activity towards achieving goals.
"""

text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0, separator=" ")
split_documents = text_splitter.create_documents([text])

Alternative Representation Chunking

Creates derivative representations of raw text to aid in retrieval and indexing.

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
Data representation refers to the form in which data is stored, processed, and transmitted.
The methods used to represent data can significantly impact the efficiency and effectiveness of data processing systems.
"""

text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.create_documents([text])

# Alternative representation for retrieval
alternative_representation = [doc.page_content.lower() for doc in split_documents]

A very common approach is where we pre-determine the size of the text chunks. Additionally, we can specify the overlap between chunks (Remember, overlap is preferred to maintain contextual continuity between chunks). This approach is simple and cheap and is, therefore, widely used. Let’s look at some examples:

Split by Character In this approach, the text is split based on a character and the chunk size is measured by the number of characters. Example text : alice_in_wonderland.txt (the book in .txt format) using LangChain’s CharacterTextSplitter

Character Splitting

Character splitting is the simplest form of dividing text. It involves breaking down the text into chunks of a specified number of characters, without considering the content or structure.

While this method is generally not recommended for practical applications due to its rigidity, it serves as a useful starting point for understanding the fundamentals of text splitting.

Pros:

  • Easy to implement
  • Simple to understand

Cons:

  • Very rigid
  • Does not consider the text’s structure

Key Concepts:

  • Chunk Size: The number of characters in each chunk, e.g., 50, 100, 10,000, etc.
  • Chunk Overlap: The number of characters by which sequential chunks overlap. This helps to avoid splitting important context between chunks but results in duplicate data across chunks.

Practical Implementation: Character Splitting

Let’s walk through a basic example of character splitting using LangChain’s CharacterTextSplitter.

  1. Sample Text:
text = "This is the text I would like to chunk up. It is the example text for this exercise"
  1. Manual Splitting:
chunks = []
chunk_size = 35

for i in range(0, len(text), chunk_size):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
  1. Using LangChain’s CharacterTextSplitter:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
documents = text_splitter.create_documents([text])

Conclusion

Document splitting is a foundational step in building efficient indexing and search systems. By understanding the nature of your content, the requirements of your application, and the capabilities of your embedding models, you can choose and implement the most effective chunking strategy. Whether you are working with basic character splitting or advanced semantic splitting, the key is to balance context preservation with indexing efficiency.

Please subscribe to newsletter for more. Thank You

--

--

Adnan Writes
Adnan Writes

Written by Adnan Writes

GEN AI ,Artificial intelligence, Marketing , writing ,side-hustles ,analyst

No responses yet