Diving into Natural Language Processing (NLP) often leads to the intricate world of tokenization. Before algorithms like Byte Pair Encoding (BPE) can work their magic, we must understand how text is represented at a fundamental level. This post documents a learning journey into UTF-8, a crucial prerequisite for anyone looking to understand how modern tokenizers operate.
The First Hurdle: Making Sense of Encoded Bytes
While learning to implement the Byte Pair Encoding (BPE) algorithm from scratch, I came across a fundamental step that often precedes the main algorithm: converting text into a sequence of bytes and then into integers. I was following two excellent tutorials:
- BPE from Scratch by Sebastian Raschka
- Andrej Karpathy’s video tutorial: Let’s build the GPT Tokenizer (which also points to a helpful blog post: A Programmer’s Introduction to Unicode).
The initial step involves taking a string and encoding it. Let’s see what happens when we encode a complex string containing various Unicode characters.
text = """
Unicode! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺🇳🇮🇨🇴🇩🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.
"
encoded_text = text.encode('utf-8')
print(encoded_text)
The output is a byte string:
b'\n\xef\xbc\xb5\xef\xbd\x8e\xef\xbd\x89\xef\xbd\x83\xef\xbd\x8f\xef\xbd\x84\xef\xbd\x85! \xf0\x9f\x85\xa4\xf0\x9f\x85\x9d\xf0\x9f\x85\x98\xf0\x9f\x85\x92\xf0\x9f\x85\x9e\xf0\x9f\x85\x93\xf0\x9f\x85\x94\xe2\x80\xbd \xf0\x9f\x87\xba\xe2\x80\x8c\xf0\x9f\x87\xb3\xe2\x80\x8c\xf0\x9f\x87\xae\xe2\x80\x8c\xf0\x9f\x87\xa8\xe2\x80\x8c\xf0\x9f\x87\xb4\xe2\x80\x8c\xf0\x9f\x87\xa9\xe2\x80\x8c\xf0\x9f\x87\xaa! \xf0\x9f\x98\x84 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to \xe2\x80\x9csupport Unicode\xe2\x80\x9d in our software (whatever that means\xe2\x80\x94like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don\xe2\x80\x99t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode\xe2\x80\x99s inception.\n'
This output immediately sparked a couple of questions in my mind:
- Why are some bytes shown in hexadecimal format (e.g.,
\xef), while others appear as regular characters (like in the wordprogrammers)? - These bytes can be mapped to integers. Does this mean that
\xefand the letterpare both just integers displayed differently?
To get clarity, I turned to Google’s Gemini for an explanation.
Understanding Bytes and Integers (with help from Gemini)
*Disclaimer: This section was generated with the assistance of Google’s Gemini and has been edited for clarity.
The answers to my questions are indeed related to how characters are represented.
-
Why do some bytes look like
\xefwhile others look likep?When you encode a string to UTF-8, every character becomes one or more bytes. For characters within the standard ASCII set (like the letters in “programmers”), the UTF-8 representation is a single byte that is identical to its ASCII value. Since these byte values correspond to printable characters, Python displays them as such.
However, for characters outside the basic ASCII range (like
U,🅤,😄), the UTF-8 representation requires multiple bytes. These individual bytes often don’t correspond to any printable ASCII characters. Therefore, Python displays them using their hexadecimal escape sequence, like\xef,\xf0, etc. A sequence of these bytes, when decoded together, represents a single Unicode character. -
Are bytes just integers?
Yes, precisely! At the fundamental level, everything in the byte string is an integer. The way it’s displayed is just a matter of convention and readability.
- The character
pis displayed aspbecause its integer value (112) falls within the printable ASCII range. - The sequence
\xefis the hexadecimal representation of the integer239. It’s shown this way because this value on its own is not a printable ASCII character.
- The character
We can prove this by explicitly converting the byte string into a list of integers:
list_of_integers = list(map(int, encoded_text))
# This is equivalent to: list(bytearray(text, 'utf-8'))
print(list_of_integers)
This gives us a long list of integers, where you can find 239 (for \xef) and 112 (for p). This integer list is the actual input that the BPE algorithm will work with.
The Second Hurdle: Decoding the UTF-8 Specification
Deeper into UTF-8 Encoding Ranges
After understanding the byte representation, I read more about the structure of UTF-8 from the Unicode article and found this table:
| UTF-8 (binary) | Code point (binary) | Range |
|---|---|---|
| 0xxxxxxx | xxxxxxx | U+0000–U+007F |
| 110xxxxx 10yyyyyy | xxxxx yyyyyy | U+0080–U+07FF |
| 1110xxxx 10yyyyyy 10zzzzzz | xxxx yyyyyy zzzzzz | U+0800–U+FFFF |
| 11110xxx 10yyyyyy 10zzzzzz 10wwwwww | xxx yyyyyy zzzzzz wwwwww | U+10000–U+10FFFF |
This led to two more questions:
- Can you give me a concrete example for each of these four cases?
- The integer ranges for these cases are not continuous. Why are there gaps?
This time, I asked ChatGPT for a detailed explanation.
UTF-8 Encoding Explained (with help from ChatGPT)
*Disclaimer: This section was generated with the assistance of OpenAI’s ChatGPT and has been edited for clarity.
📝 What UTF-8 Does
- Each Unicode character (code point) is stored using 1 to 4 bytes.
- The prefix bits (like
0xxxxxxx,110xxxxx) tell us how many bytes the character takes. - The remaining bits hold the actual code point index of the character.
Here is a summary of the encoding patterns with examples:
| Bytes | Range | Format (Binary) | Example | Unicode | UTF-8 (Hex) |
|---|---|---|---|---|---|
| 1 | U+0000–U+007F | 0xxxxxxx |
‘A’ | U+0041 | 41 |
| 2 | U+0080–U+07FF | 110xxxxx 10yyyyyy |
‘é’ | U+00E9 | C3 A9 |
| 3 | U+0800–U+FFFF | 1110xxxx 10yyyyyy 10zzzzzz |
‘你’ | U+4F60 | E4 BD A0 |
| 4 | U+10000–U+10FFFF | 11110xxx 10yyyyyy 10zzzzzz 10wwwwww |
‘😀’ | U+1F600 | F0 9F 98 80 |
🔑 Why are there “gaps” in the ranges?
This was a sharp observation. The ranges are not continuous for a critical reason: UTF-8 is self-synchronizing.
This design ensures that:
- No byte sequence is a prefix of another. This allows you to jump into the middle of a byte stream and still find the beginning of a character without having to parse from the start.
- Each encoding length has “overhead” bits (the
110,1110,11110prefixes) that signal the length of the character sequence.
These “gaps” are intentional and crucial for robustness.
- The 2-byte block starts at
U+0080because the rangeU+0000–U+007Fis already covered by 1-byte encodings. - The 3-byte block starts at
U+0800because anything below that is handled by 1 or 2-byte encodings. - The 4-byte block starts at
U+10000because anything below that is handled by 1, 2, or 3-byte encodings.
This staggering ensures there is no overlap between byte lengths, making UTF-8 backward-compatible with ASCII, robust, and unambiguous.
Conclusion
- UTF-8 is a variable-length encoding system: It uses 1, 2, 3, or 4 bytes to represent every possible Unicode character.
- It’s backward-compatible with ASCII: The first 128 characters (U+0000 to U+007F) are encoded in a single byte, identical to the ASCII standard.
- It’s robust and self-synchronizing: Thanks to its clever use of byte prefixes, you can always determine the start of a character in a stream of bytes, preventing corruption and errors.
Understanding this byte-level foundation is not just an academic exercise; it’s essential for truly grasping how and why modern NLP models and tokenizers work the way they do.