Mastering Text Chunking in LLMs
Learn the best practices and strategies for document splitting to enhance data indexing and retrieval efficiency. Discover various chunking methods tailored for different content types and applications, from basic character splitting to advanced semantic and agentic methods.
What is Chunking?
Chunking is the process of breaking down a large piece of text into smaller, more manageable units, often referred to as “chunks.” This technique is crucial in various natural language processing (NLP) and information retrieval tasks. By dividing text into smaller segments, systems can handle, index, and search through data more efficiently.
Why Split Documents?
Ease of Search
Large chunks of data are harder to search and index efficiently. Splitting documents into smaller, manageable chunks allows for more precise indexing, leading to faster and more accurate search results.
Context Window Size
Large Language Models (LLMs) have a finite context window, meaning they can only process a limited number of tokens at a time. Splitting documents ensures that each chunk fits within this context window, allowing the model to process and generate responses effectively.
Chunking Strategies
Nature of Content
The type of content you’re working with significantly influences your chunking strategy. Lengthy documents, such as articles or books, require different approaches compared to shorter content like tweets or instant messages. Understanding the nature of your content helps in selecting an appropriate chunking method.
Embedding Model Considerations
The choice of embedding model also dictates the chunking strategy. Some models perform better with specific chunk lengths, making it crucial to tailor your approach based on the model you intend to use.
User Query Length and Complexity
Anticipate the nature of user queries — whether they are short and specific or long and complex. This understanding helps in designing a chunking strategy that aligns the embedded queries with the embedded chunks, enhancing search accuracy.
Application-Specific Requirements
The end use case, such as semantic search, question answering, or summarization, will determine how text should be chunked. For instance, if the results need to be input into another language model with a token limit, this constraint must be factored into your chunking strategy.
Chunking Methods
Depending on the considerations mentioned above, several text splitters are available, each with unique methods of dividing text.
Basic Character Splitting
This method involves dividing text into static character chunks without regard to content structure. While simple, it is generally not recommended for practical applications due to its rigidity.
Example:
text = "This is the text I would like to chunk up. It is the example text for this exercise"
chunk_size = 35
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
##output##
['This is the text I would like to ch',
'unk up. It is the example text for ',
'this exercise']
Pros: Easy to implement. Cons: Ignores content structure, leading to poor search and indexing performance.
Recursive Character Text Splitting
A more sophisticated method that recursively splits text based on a list of separators. This approach maintains a balance between chunk size and content structure, making it suitable for more complex documents.
Example: LangChain’s RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
text_splitter.create_documents([text])
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
documents = text_splitter.create_documents([text])
##outpu##
[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]
Document-Specific Splitting
Tailored chunking methods for different document types (e.g., PDF, Python, Markdown) to ensure optimal splitting based on the document’s inherent structure.
Example:
from langchain.document_loaders import PDFDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
pdf_loader = PDFDocumentLoader('path/to/document.pdf')
documents = pdf_loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=0)
split_documents = text_splitter.split_documents(documents)
Semantic Splitting
Utilizes embedding models to split text based on semantic meaning, ensuring chunks are contextually relevant.
Example:
from sentence_transformers import SentenceTransformer, util
from langchain.text_splitter import RecursiveCharacterTextSplitter
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
text = """
Machine learning is a method of data analysis that automates analytical model building.
It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention.
"""
embeddings = model.encode([text])
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.create_documents([text])
Agentic Splitting
An experimental method that uses an agent-like system to split text, useful for scenarios where token cost trends toward zero.
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
Agents are autonomous entities that observe through sensors and act upon an environment using actuators and direct their activity towards achieving goals.
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0, separator=" ")
split_documents = text_splitter.create_documents([text])
Alternative Representation Chunking
Creates derivative representations of raw text to aid in retrieval and indexing.
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
Data representation refers to the form in which data is stored, processed, and transmitted.
The methods used to represent data can significantly impact the efficiency and effectiveness of data processing systems.
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.create_documents([text])
# Alternative representation for retrieval
alternative_representation = [doc.page_content.lower() for doc in split_documents]
A very common approach is where we pre-determine
the size of the text chunks. Additionally, we can specify the overlap between chunks
(Remember, overlap is preferred to maintain contextual continuity between chunks). This approach is simple and cheap and is, therefore, widely used. Let’s look at some examples:
Split by Character In this approach, the text is split based on a character and the chunk size is measured by the number of characters. Example text : alice_in_wonderland.txt (the book in .txt format) using LangChain’s CharacterTextSplitter
Character Splitting
Character splitting is the simplest form of dividing text. It involves breaking down the text into chunks of a specified number of characters, without considering the content or structure.
While this method is generally not recommended for practical applications due to its rigidity, it serves as a useful starting point for understanding the fundamentals of text splitting.
Pros:
- Easy to implement
- Simple to understand
Cons:
- Very rigid
- Does not consider the text’s structure
Key Concepts:
- Chunk Size: The number of characters in each chunk, e.g., 50, 100, 10,000, etc.
- Chunk Overlap: The number of characters by which sequential chunks overlap. This helps to avoid splitting important context between chunks but results in duplicate data across chunks.
Practical Implementation: Character Splitting
Let’s walk through a basic example of character splitting using LangChain’s CharacterTextSplitter.
- Sample Text:
text = "This is the text I would like to chunk up. It is the example text for this exercise"
- Manual Splitting:
chunks = []
chunk_size = 35
for i in range(0, len(text), chunk_size):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
- Using LangChain’s CharacterTextSplitter:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
documents = text_splitter.create_documents([text])
Conclusion
Document splitting is a foundational step in building efficient indexing and search systems. By understanding the nature of your content, the requirements of your application, and the capabilities of your embedding models, you can choose and implement the most effective chunking strategy. Whether you are working with basic character splitting or advanced semantic splitting, the key is to balance context preservation with indexing efficiency.
Please subscribe to newsletter for more. Thank You