Python

Understanding the NULL Character: Behavior Across Languages, Databases, and Editors

In this post, we explore the NULL character in different programming languages and databases, and how it’s presented in different contexts.

Building a BPE Tokenizer with TDD - Part 3: Implementing Encode and Decode Methods

Final part of our BPE tokenizer series, where we implement encoding and decoding capabilities. We’ll write comprehensive tests for token conversion, handle special tokens, and ensure proper error handling for edge cases.

Building a BPE Tokenizer with TDD - Part 2: Implementing the Train Method

Second part of our BPE tokenizer series, focusing on implementing the train method. We’ll cover the core BPE algorithm, write tests for training functionality, and implement vocabulary management and pair merging logic.

Building a BPE Tokenizer with TDD - Part 1: Project Setup and First Test

First part of a series on building a Byte Pair Encoding tokenizer using Test-Driven Development. We set up our project structure, create a virtual environment, and write our first test for the tokenizer’s initialization.

A Deep Dive into UTF-8 for BPE Tokenization

A hands-on exploration of UTF-8 encoding, prompted by the need to prepare text for a Byte Pair Encoding (BPE) tokenizer. This post breaks down why Unicode characters produce mixed results of readable text and hex codes when encoded, clarifies that all bytes are fundamentally integers, and demystifies the non-continuous ranges in the UTF-8 specification with examples. Features AI-assisted explanations from Gemini and ChatGPT.