A Deep Dive into UTF-8 for BPE Tokenization

Diving into Natural Language Processing (NLP) often leads to the intricate world of tokenization. Before algorithms like Byte Pair Encoding (BPE) can work their magic, we must understand how text is represented at a fundamental level. This post documents a learning journey into UTF-8, a crucial prerequisite for anyone looking to understand how modern tokenizers operate.

The First Hurdle: Making Sense of Encoded Bytes

While learning to implement the Byte Pair Encoding (BPE) algorithm from scratch, I came across a fundamental step that often precedes the main algorithm: converting text into a sequence of bytes and then into integers. I was following two excellent tutorials:

BPE from Scratch by Sebastian Raschka
Andrej Karpathy’s video tutorial: Let’s build the GPT Tokenizer (which also points to a helpful blog post: A Programmer’s Introduction to Unicode).

The initial step involves taking a string and encoding it. Let’s see what happens when we encode a complex string containing various Unicode characters.

text = """
Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
"

encoded_text = text.encode('utf-8')
print(encoded_text)

The output is a byte string:

b'\n\xef\xbc\xb5\xef\xbd\x8e\xef\xbd\x89\xef\xbd\x83\xef\xbd\x8f\xef\xbd\x84\xef\xbd\x85! \xf0\x9f\x85\xa4\xf0\x9f\x85\x9d\xf0\x9f\x85\x98\xf0\x9f\x85\x92\xf0\x9f\x85\x9e\xf0\x9f\x85\x93\xf0\x9f\x85\x94\xe2\x80\xbd \xf0\x9f\x87\xba\xe2\x80\x8c\xf0\x9f\x87\xb3\xe2\x80\x8c\xf0\x9f\x87\xae\xe2\x80\x8c\xf0\x9f\x87\xa8\xe2\x80\x8c\xf0\x9f\x87\xb4\xe2\x80\x8c\xf0\x9f\x87\xa9\xe2\x80\x8c\xf0\x9f\x87\xaa! \xf0\x9f\x98\x84 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to \xe2\x80\x9csupport Unicode\xe2\x80\x9d in our software (whatever that means\xe2\x80\x94like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don\xe2\x80\x99t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode\xe2\x80\x99s inception.\n'

This output immediately sparked a couple of questions in my mind:

Why are some bytes shown in hexadecimal format (e.g., \xef), while others appear as regular characters (like in the word programmers)?
These bytes can be mapped to integers. Does this mean that \xef and the letter p are both just integers displayed differently?

To get clarity, I turned to Google’s Gemini for an explanation.

Understanding Bytes and Integers (with help from Gemini)

*Disclaimer: This section was generated with the assistance of Google’s Gemini and has been edited for clarity.

The answers to my questions are indeed related to how characters are represented.

Why do some bytes look like \xef while others look like p?

When you encode a string to UTF-8, every character becomes one or more bytes. For characters within the standard ASCII set (like the letters in “programmers”), the UTF-8 representation is a single byte that is identical to its ASCII value. Since these byte values correspond to printable characters, Python displays them as such.

However, for characters outside the basic ASCII range (like Ｕ, 🅤, 😄), the UTF-8 representation requires multiple bytes. These individual bytes often don’t correspond to any printable ASCII characters. Therefore, Python displays them using their hexadecimal escape sequence, like \xef, \xf0, etc. A sequence of these bytes, when decoded together, represents a single Unicode character.
Are bytes just integers?

Yes, precisely! At the fundamental level, everything in the byte string is an integer. The way it’s displayed is just a matter of convention and readability.
- The character p is displayed as p because its integer value (112) falls within the printable ASCII range.
- The sequence \xef is the hexadecimal representation of the integer 239. It’s shown this way because this value on its own is not a printable ASCII character.

We can prove this by explicitly converting the byte string into a list of integers:

list_of_integers = list(map(int, encoded_text))
# This is equivalent to: list(bytearray(text, 'utf-8'))
print(list_of_integers)

This gives us a long list of integers, where you can find 239 (for \xef) and 112 (for p). This integer list is the actual input that the BPE algorithm will work with.

The Second Hurdle: Decoding the UTF-8 Specification

Deeper into UTF-8 Encoding Ranges

After understanding the byte representation, I read more about the structure of UTF-8 from the Unicode article and found this table:

UTF-8 (binary)	Code point (binary)	Range
0xxxxxxx	xxxxxxx	U+0000–U+007F
110xxxxx 10yyyyyy	xxxxx yyyyyy	U+0080–U+07FF
1110xxxx 10yyyyyy 10zzzzzz	xxxx yyyyyy zzzzzz	U+0800–U+FFFF
11110xxx 10yyyyyy 10zzzzzz 10wwwwww	xxx yyyyyy zzzzzz wwwwww	U+10000–U+10FFFF

This led to two more questions:

Can you give me a concrete example for each of these four cases?
The integer ranges for these cases are not continuous. Why are there gaps?

This time, I asked ChatGPT for a detailed explanation.

UTF-8 Encoding Explained (with help from ChatGPT)

*Disclaimer: This section was generated with the assistance of OpenAI’s ChatGPT and has been edited for clarity.

📝 What UTF-8 Does

Each Unicode character (code point) is stored using 1 to 4 bytes.
The prefix bits (like 0xxxxxxx, 110xxxxx) tell us how many bytes the character takes.
The remaining bits hold the actual code point index of the character.

Here is a summary of the encoding patterns with examples:

Bytes	Range	Format (Binary)	Example	Unicode	UTF-8 (Hex)
1	U+0000–U+007F	`0xxxxxxx`	‘A’	U+0041	`41`
2	U+0080–U+07FF	`110xxxxx 10yyyyyy`	‘é’	U+00E9	`C3 A9`
3	U+0800–U+FFFF	`1110xxxx 10yyyyyy 10zzzzzz`	‘你’	U+4F60	`E4 BD A0`
4	U+10000–U+10FFFF	`11110xxx 10yyyyyy 10zzzzzz 10wwwwww`	‘😀’	U+1F600	`F0 9F 98 80`

🔑 Why are there “gaps” in the ranges?

This was a sharp observation. The ranges are not continuous for a critical reason: UTF-8 is self-synchronizing.

This design ensures that:

No byte sequence is a prefix of another. This allows you to jump into the middle of a byte stream and still find the beginning of a character without having to parse from the start.
Each encoding length has “overhead” bits (the 110, 1110, 11110 prefixes) that signal the length of the character sequence.

These “gaps” are intentional and crucial for robustness.

The 2-byte block starts at U+0080 because the range U+0000–U+007F is already covered by 1-byte encodings.
The 3-byte block starts at U+0800 because anything below that is handled by 1 or 2-byte encodings.
The 4-byte block starts at U+10000 because anything below that is handled by 1, 2, or 3-byte encodings.

This staggering ensures there is no overlap between byte lengths, making UTF-8 backward-compatible with ASCII, robust, and unambiguous.

Conclusion

UTF-8 is a variable-length encoding system: It uses 1, 2, 3, or 4 bytes to represent every possible Unicode character.
It’s backward-compatible with ASCII: The first 128 characters (U+0000 to U+007F) are encoded in a single byte, identical to the ASCII standard.
It’s robust and self-synchronizing: Thanks to its clever use of byte prefixes, you can always determine the start of a character in a stream of bytes, preventing corruption and errors.

Understanding this byte-level foundation is not just an academic exercise; it’s essential for truly grasping how and why modern NLP models and tokenizers work the way they do.

The First Hurdle: Making Sense of Encoded Bytes#

Understanding Bytes and Integers (with help from Gemini)#

The Second Hurdle: Decoding the UTF-8 Specification#

Deeper into UTF-8 Encoding Ranges#

UTF-8 Encoding Explained (with help from ChatGPT)#

📝 What UTF-8 Does#

🔑 Why are there “gaps” in the ranges?#

Conclusion#