<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Huggingface on ZML - Model to Metal</title><link>https://zml.ai/tags/huggingface/</link><description>Recent content in Huggingface on ZML - Model to Metal</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 07 Apr 2026 09:00:00 +0100</lastBuildDate><atom:link href="https://zml.ai/tags/huggingface/index.xml" rel="self" type="application/rss+xml"/><item><title>10x faster tokenization</title><link>https://zml.ai/posts/iree-tokenizer/</link><pubDate>Tue, 07 Apr 2026 09:00:00 +0100</pubDate><guid>https://zml.ai/posts/iree-tokenizer/</guid><description>&lt;p&gt;Today we merged a pretty nice performance improvement to ZML.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/zml/zml/pull/416"&gt;PR #416&lt;/a&gt; switches &lt;code&gt;zml.tokenizer.Tokenizer&lt;/code&gt; from using the Hugging Face
&lt;a href="https://crates.io/crates/tokenizers"&gt;&lt;code&gt;tokenizers&lt;/code&gt;&lt;/a&gt; crate to using the &lt;a href="https://github.com/iree-org/iree/blob/main/runtime/src/iree/tokenizer/README.md"&gt;IREE project tokenizer&lt;/a&gt;
when using Hugging Face &lt;code&gt;tokenizer.json&lt;/code&gt; files.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://zml.ai/img/posts/iree-tokenizer/1-benchmark.jpg" alt=""&gt;&lt;/p&gt;
&lt;h1 id="the-upcoming-problem"&gt;The (upcoming) problem&lt;/h1&gt;
&lt;p&gt;Tokenizers performance is often overlooked, but it can have a significant impact on the overall latency of LLM inference.
In some cases, it can even become the bottleneck of the entire system. This is particularly important as context are
becoming bigger and bigger. For code generation, a 1M context window is regular. But a 1M context means 4MB of data (&lt;code&gt;u32&lt;/code&gt;).&lt;/p&gt;</description></item></channel></rss>