10x faster tokenization

Tue, 07 Apr 2026 09:00:00 +0100

Today we merged a pretty nice performance improvement to ZML.

PR #416 switches zml.tokenizer.Tokenizer from using the Hugging Face tokenizers crate to using the IREE project tokenizer when using Hugging Face tokenizer.json files.

The (upcoming) problem

Tokenizers performance is often overlooked, but it can have a significant impact on the overall latency of LLM inference. In some cases, it can even become the bottleneck of the entire system. This is particularly important as context are becoming bigger and bigger. For code generation, a 1M context window is regular. But a 1M context means 4MB of data (u32).

Huggingface on ZML - Model to Metal

10x faster tokenization

The (upcoming) problem