Today we merged a pretty nice performance improvement to ZML.

PR #416 switches zml.tokenizer.Tokenizer from using the Hugging Face tokenizers crate to using the IREE project tokenizer when using Hugging Face tokenizer.json files.

The (upcoming) problem

Tokenizers performance is often overlooked, but it can have a significant impact on the overall latency of LLM inference. In some cases, it can even become the bottleneck of the entire system. This is particularly important as context are becoming bigger and bigger. For code generation, a 1M context window is regular. But a 1M context means 4MB of data (u32).

A bad (de)tokenization strategy can severely impact the overall latency of the system, and as such, it’s crucial to have a fast and efficient tokenizer.

ZML Tokenizers

ZML handles 2 types of tokenizers: Hugging Face tokenizer.json files and SentencePiece models. This is exposed through the zml.tokenizer.Tokenizer union type, which makes this transparent to the user:

var tokenizer: zml.tokenizer.Tokenizer = try .fromFile(allocator, io, "tokenizer.json");
defer tokenizer.deinit();
const token_id = tokenizer.tokenId("<|im_start|>") orelse return error.TokenNotFound;

The zml.tokenizer.Tokenizer types also support the Encoder and Decoder types for streaming purposes. We are also iterating on a std.Io.Reader/Writer based APIs for great composability.

For SentencePiece, we use the official google/sentencepiece implementation.

For Hugging Face, we leverage the tokenizers crate via a C trampoline.

The IREE tokenizer

The IREE tokenizer, introduced by Stella Laurenzo in her LinkedIn post is a C reimplementation of the Hugging Face and TikToken tokenizers. The interesting part is that it was generated by agents, orchestrated and reviewed by Ben Vanik.

It also happens to be 10x faster than everything else.

We’re sold.

Easily integrated via Bazel

At its core, the IREE tokenizer is a pretty straightforward C API. Simple right? Well, as always, there are strings attached. However, because ZML and IREE are built with Bazel, and thanks to our native translate-c support in rules_zig, integration was super easy.

IREE being a pretty big project. We had to be careful to only pull in the tokenizer and its dependencies. This is done via a git sparse checkout and a little patch to shim or remove a few graph dependencies. No biggie.

After that, we just had to depend on a few targets, bind it to our abstractions and we were good to go.

Benchmarks

With all said and done, here are the benchmarks when integrated in ZML with a ~100k tokens text:

ModelHugging Face (ms)IREE oneshot / stream (ms)Speedup
liquidai/LFM2.5-1.2B-Instruct40.58.5 / 7.34.8x / 5.5x
Qwen/Qwen3.5-9B59.46.0 / 8.29.9x / 7.2x
mistralai/Ministral-3-3B-Reasoning-251259.27.4 / 6.98x / 8.6x

Sunsetting the Rust implementation

We’re pretty happy with the IREE tokenizer. It has been a long discussion at ZML on wether we should rewrite our own implementation in Zig, mostly because we had some ideas to optimize the tokenization process. That said, the performance uplift is so significant that we don’t think it’s worth it for now.

As such, we’ll be removing the Rust implementation and it will be done in a separate PR.

Happy hacking!