<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inference on ZML - Model to Metal</title><link>https://zml.ai/tags/inference/</link><description>Recent content in Inference on ZML - Model to Metal</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 24 Mar 2026 16:00:00 +0100</lastBuildDate><atom:link href="https://zml.ai/tags/inference/index.xml" rel="self" type="application/rss+xml"/><item><title>Introducing ZML/v2</title><link>https://zml.ai/posts/zml-v2/</link><pubDate>Tue, 24 Mar 2026 16:00:00 +0100</pubDate><guid>https://zml.ai/posts/zml-v2/</guid><description>&lt;p&gt;ZML is an inference stack built close to the hardware. It lowers models directly onto NVIDIA, AMD, TPU, and Trainium
targets from a single codebase, without depending on and suffering from the Python-heavy runtime layers that most of the
ecosystem is built around.&lt;/p&gt;
&lt;p&gt;The guiding idea behind zml/v1 was simplicity: give ZML a model and its weights, and the system would take care of
compilation, placement, and execution for you. That made the first version approachable and effective for standard
deployments, but it also baked too much behavior into implicit global state. As the project pushed into partial
compilation, custom passes, sharding, quantization, and more backend-specific execution paths, those implicit shortcuts
became constraints. ZML/v2 is the rewrite that makes those concepts explicit: platform ownership, compilation, memory,
IO, and placement are now first-class, so advanced use cases can be expressed directly instead of forced through
workarounds.&lt;/p&gt;</description></item></channel></rss>