Tokenization & Embeddings

8 min readJun 27, 2024

What Are Embeddings?

Embeddings are representations of values or objects like text, images, and audio that are designed to be used by machine learning models and semantic search algorithms. They translate these objects into a mathematical form based on the traits and categories each one may have

Essentially, embeddings help machine learning models find similar objects. For instance, given a photo or a document, a machine learning model using embeddings could locate a similar photo or document. Embeddings enable computers to understand relationships between words and other objects, making them foundational for artificial intelligence (AI).

For example, documents in the upper right of a two-dimensional space may be relevant to each other.

Technically, embeddings are vectors created by machine learning models to capture meaningful data about each object.

The Problem with Semantic Search and Why We Use Embeddings

Semantic search aims to improve search accuracy by understanding the intent and contextual meaning behind the search terms rather than just matching keywords. While this approach is more sophisticated than traditional keyword-based search, it still faces several challenges:

Ambiguity: Words can have multiple meanings depending on the context. For example, the word “bank” can refer to a financial institution or the side of a river.
Synonymy: Different words can have similar meanings. A search for “car” should also return results for “automobile.”
Context: Understanding the context in which words are used is crucial. The same word can imply different things in different sentences.
Scalability: Handling large volumes of data efficiently while maintaining accuracy is challenging.

The basic design of a semantic search system, as pitched by most vector search vendors, involves two seemingly straightforward steps (note the irony):

Compute embeddings for your documents and queries: Somewhere. Somehow. Figure it out by yourself.
Upload them to a vector search engine: Enjoy a better semantic search.

However, a good embedding model is crucial for effective semantic search. Your semantic search is only as good as your embedding model. Unfortunately, choosing the right model is often considered out of scope for most early adopters. As a result, many just opt for readily available models like sentence-transformers/all-MiniLM-L6-v2 and hope for the best.

Embeddings address these challenges by converting objects (such as text, images, and audio) into high-dimensional vectors that capture their semantic meanings. Here’s why embeddings are effective:

Contextual Understanding: Embeddings consider the context in which words appear, allowing for a more nuanced understanding of meaning. This helps disambiguate words with multiple meanings.
Similarity Detection: By placing similar objects close to each other in the vector space, embeddings make it easier to identify synonyms and related concepts. This improves the relevance of search results.
Efficient Processing: Embeddings enable efficient comparisons between large volumes of data. Since they represent complex data in a condensed form, they allow for faster and more accurate search processes.
Flexibility: Embeddings can be applied to various types of data, including text, images, and audio, making them versatile for different semantic search applications.

What Is a Vector in Machine Learning?

In mathematics, a vector is an array of numbers that define a point in a dimensional space. In more practical terms, a vector is a list of numbers — such as 1989, 22, 9, 180. Each number indicates where the object is along a specified dimension.

In machine learning, vectors make it possible to search for similar objects. A vector-searching algorithm simply finds two vectors that are close together in a vector database.

To understand this better, think about latitude and longitude. These two dimensions — north-south and east-west — can indicate the location of any place on Earth. For example, Vancouver, British Columbia, Canada, can be represented by the coordinates 49°15'40"N, 123°06'50"W. This list of two values is a simple vector.

Now, imagine trying to find a city near Vancouver. A person would look at a map, while a machine learning model would look at the latitude and longitude (or vector) to find a similar location. The city of Burnaby, with coordinates 49°16'N, 122°58'W, is very close to Vancouver’s coordinates. The model can conclude, correctly, that Burnaby is near Vancouver.

Adding More Dimensions to Vectors

Now, imagine trying to find a city not only close to Vancouver but also of similar size. We can add a third “dimension” to latitude and longitude: population size. Population can be added to each city’s vector, treating it like a Z-axis, with latitude and longitude as the Y- and X-axes.

The vector for Vancouver is now 49°15'40"N, 123°06'50"W, 662,248*. With this third dimension, Burnaby is no longer particularly close to Vancouver, as its population is only 249,125*. Instead, the model might find Seattle, Washington, US, which has a vector of 47°36'35"N, 122°19'59"W, 749,256**.

This example shows how vectors and similarity search work. However, machine learning models often generate more than three dimensions, resulting in much more complex vectors.

Even More Multi-Dimensional Vectors

For instance, how can a model tell which TV shows are similar and likely to be watched by the same people? There are many factors to consider: episode length, number of episodes, genre, number of viewers in common, actors, year of debut, and so on. All of these can be “dimensions,” with each show represented as a point along these dimensions.

Multi-dimensional vectors can help us determine if the sitcom Seinfeld is similar to the horror show Wednesday. Seinfeld debuted in 1989, while Wednesday debuted in 2022. The two shows have different episode lengths — Seinfeld at 22–24 minutes and Wednesday at 46–57 minutes. By looking at their vectors, we can see that these shows likely occupy very different points in a dimensional representation of TV shows.

We can express these vectors as follows:

Seinfeld vector: [Sitcom], 1989, 22–24, 9, 180
Wednesday vector: [Horror], 2022, 46–57, 1, 8

A machine learning model might identify the sitcom Cheers as being much more similar to Seinfeld. Cheers is of the same genre, debuted in 1982, features episodes of 21–25 minutes, has 11 seasons, and 275 episodes.

Cheers vector: [Sitcom], 1982, 21–25, 11, 275

In our examples, a city was a point along the two dimensions of latitude and longitude; we then added a third dimension of population. We also analyzed TV shows along five dimensions. Instead of two, three, or five dimensions, a TV show within a machine learning model might be a point along perhaps a hundred or a thousand dimensions.

How Do Embeddings Work?

Embedding is the process of creating vectors using deep learning. An “embedding” is the output of this process — the vector created by a deep learning model for similarity searches.

Embeddings that are close to each other — like Seattle and Vancouver having similar latitude and longitude values and populations — can be considered similar. Using embeddings, an algorithm can suggest a relevant TV show, find similar locations, or identify which words are likely to be used together in language models.

Tokenization Methods

Tokenization is the process of breaking down a string of text into smaller units called tokens. These tokens can be words, subwords, or even characters. Let’s explore some popular tokenization methods with examples to understand how they work.

Example String for Tokenization

example_string = "Welcome to the world of generative AI."

Method 1: White Space Tokenization

This method splits the text based on white spaces.

white_space_tokens = example_string.split()

Method 2: WordPunct Tokenization

This method splits the text into words and punctuation.

from nltk.tokenize import WordPunctTokenizer
wordpunct_tokenizer = WordPunctTokenizer()
wordpunct_tokens = wordpunct_tokenizer.tokenize(example_string)

Method 3: Treebank Word Tokenization

This method uses the standard word tokenization of the Penn Treebank.

from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(example_string)

Tokenization Outputs

white_space_tokens, wordpunct_tokens, treebank_tokens
# Output:
# (['Welcome', 'to', 'the', 'world', 'of', 'generative', 'AI.'],
#  ['Welcome', 'to', 'the', 'world', 'of', 'generative', 'AI', '.'],
#  ['Welcome', 'to', 'the', 'world', 'of', 'generative', 'AI', '.'])

Each method breaks down the sentence into tokens differently, illustrating how tokenization can vary based on the method used. This process is fundamental for preparing text for analysis in natural language processing tasks.

Why Tokenization Matters

Tokenization plays a crucial role for several reasons:

Breaking Down Complex Text: Text often contains punctuation, special characters, and multiple languages. Tokenization breaks down this complexity into manageable units (tokens), facilitating further analysis.
Preparation for Analysis: Tokens are essential for tasks like part-of-speech tagging, syntactic parsing, named entity recognition, and sentiment analysis. These tasks rely on understanding the structure and meaning of words within a sentence or document.
Creating Structured Data: Tokenization transforms unstructured text into structured data, where each token can be analyzed, processed, and used as input for various machine learning models.

Tokenization in Large Language Models (LLMs)

In the context of large language models (LLMs) such as BERT and GPT, tokenization takes a more sophisticated approach:

Subword Tokenization: Unlike traditional methods that tokenize based on words or punctuation, LLMs often use subword tokenization. This approach breaks words into smaller units, allowing models to handle a broader vocabulary and adapt to different languages efficiently.
Embeddings: LLMs create embeddings for tokens, which are high-dimensional vectors capturing semantic relationships and contextual meanings. These embeddings enable models to understand nuances in language and perform complex tasks like language translation and sentiment analysis.

How Are Embeddings Used in Large Language Models (LLMs)?

Embeddings are fundamental to the effectiveness of large language models (LLMs) due to their intricate nature and high-dimensional structure. Each token’s embedding represents a vector in a multi-dimensional space, allowing the model to encapsulate a wide array of linguistic nuances. This includes understanding the meaning of words, their grammatical roles, and how they relate to other words within a sentence.

Contextual Embeddings, such as those used by BERT, take this a step further by varying the embeddings of identical words based on their context. This means that a word can have different embeddings depending on the words surrounding it in a sentence. This contextual richness is crucial for tasks like understanding semantics and syntactic structure.

In practical terms, when a sentence like “Welcome to the world of generative AI” is tokenized, each token receives its own embedding vector. In more sophisticated models like BERT, embeddings are not only derived from the final layer of the neural network but also from intermediary layers. Each layer captures different aspects of language complexity, contributing to the overall richness of the model’s understanding.

These embeddings serve as inputs for various NLP tasks such as sentiment analysis, question answering, and language translation. Their complexity enables LLMs to perform these tasks with a high degree of accuracy and sophistication, essentially transforming text into meaningful numerical representations.

Moreover, embeddings form a crucial part of the model’s internal representation and are intertwined with its weights. They represent learned features about language encoded during training, where adjustments to these weights refine the model’s ability to generate accurate predictions. Whether initialized randomly or from pre-trained models, these weights are optimized through training to ensure the model produces the most precise outputs for a given input context.

In essence, embeddings are not just numerical representations of tokens; they are the cornerstone of how LLMs comprehend and process language. Enhancing these embeddings through advanced techniques leads to more capable models, capable of handling a broader range of linguistic tasks with greater accuracy and efficiency.

If you have any questions or feedback on this content, please feel free to share them in the comments section, and I’ll update the material accordingly.