Understanding the NULL Character: Behavior Across Languages, Databases, and Editors
In this post, we explore the NULL character in different programming languages and databases, and how it’s presented in different contexts.
In this post, we explore the NULL character in different programming languages and databases, and how it’s presented in different contexts.
Final part of our BPE tokenizer series, where we implement encoding and decoding capabilities. We’ll write comprehensive tests for token conversion, handle special tokens, and ensure proper error handling for edge cases.
Second part of our BPE tokenizer series, focusing on implementing the train method. We’ll cover the core BPE algorithm, write tests for training functionality, and implement vocabulary management and pair merging logic.
First part of a series on building a Byte Pair Encoding tokenizer using Test-Driven Development. We set up our project structure, create a virtual environment, and write our first test for the tokenizer’s initialization.
A hands-on exploration of UTF-8 encoding, prompted by the need to prepare text for a Byte Pair Encoding (BPE) tokenizer. This post breaks down why Unicode characters produce mixed results of readable text and hex codes when encoded, clarifies that all bytes are fundamentally integers, and demystifies the non-continuous ranges in the UTF-8 specification with examples. Features AI-assisted explanations from Gemini and ChatGPT.