Introducing ZML/v2

Tue, 24 Mar 2026 16:00:00 +0100

ZML is an inference stack built close to the hardware. It lowers models directly onto NVIDIA, AMD, TPU, and Trainium targets from a single codebase, without depending on and suffering from the Python-heavy runtime layers that most of the ecosystem is built around.

The guiding idea behind zml/v1 was simplicity: give ZML a model and its weights, and the system would take care of compilation, placement, and execution for you. That made the first version approachable and effective for standard deployments, but it also baked too much behavior into implicit global state. As the project pushed into partial compilation, custom passes, sharding, quantization, and more backend-specific execution paths, those implicit shortcuts became constraints. ZML/v2 is the rewrite that makes those concepts explicit: platform ownership, compilation, memory, IO, and placement are now first-class, so advanced use cases can be expressed directly instead of forced through workarounds.

Inference on ZML - Model to Metal

Introducing ZML/v2