AI Plumbers Track

2026-01-31T10:30:00+01:00

Welcome talk covering some organizational questions

Multimodal support in llama.cpp - Achievements and Future Directions

2026-01-31T10:35:00+01:00

llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.

We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real examples of low-latency OCR applications. We will also talk about adding audio support, which lets newer models summarize audio inputs. Plus, we will cover the challenges of handling legacy code while keeping the project flexible for future models.

Looking forward, the talk will share plans for new features like video input, text-to-speech support, and image generation. Attendees will also learn how to contribute and use these multimodal tools in their own project.

API Remoting for llama.cpp: Near-Native GPU Speed in macOS Containers

2026-01-31T11:00:00+01:00

Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.

The Problem: Bridging the OS and Acceleration Gap

While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is infeasible, as it monopolizes the display and host system resources. Consequently, achieving high-performance LLM inference requires a sophisticated bridging mechanism.

The Solution: Para-Virtualization via GGML API Remoting

We present the implementation of an API-remoting, para-virtualized execution path for llama.cpp. This novel design is built directly on the GGML API, allowing us to selectively offload compute-intensive operations to execute on the macOS host. Crucially, this execution path natively leverages the GGML-Metal backend, achieving full-speed host performance. This strategy preserves a fully containerized Linux developer experience inside the virtual machine while achieving hardware-level acceleration outside of it.

Performance and Implications

We will share concrete performance results demonstrating near-native inference speed compared to running llama.cpp directly on the host. The talk will examine the architectural trade-offs of this split-execution model, providing insight into:

The implementation details of the para-virtualized GGML API bridge.
The latency overhead of API remoting versus the throughput gain from host GPU execution.
How this open-source approach fundamentally changes the future of containerized and accelerated AI tooling on non-Linux platforms, specifically addressing the needs of macOS users in the open-source AI community.

tract - an efficient rust neural network inference engine

2026-01-31T11:25:00+01:00

Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.

This talk introduces tract, Sonos's open-source neural network inference toolkit started in 2018 and written in Rust. We'll explore how tract bridges the gap between training frameworks and production deployment by offering a no-nonsense, self-contained inference solution used today to deploy deep learning on millions of devices at Sonos.

This toolkit has some unique strengths thanks to embedded graph optimization, automated streaming management, and symbolic abstraction for dynamic dimensions — plus support for multiple open exchange standards including ONNX, NNEF, and TensorFlow Lite.

tract also has a companion project coined torch-to-nnef that strive to export PyTorch models to an NNEF optimized for tract with maximum compatibility. It enables some unique features like quantization, better Fourier Transform support and easier extensibility: this will also be discussed shortly during this presentation.

Beyond TinyML: Balance inference accuracy and latency on MCUs

2026-01-31T11:50:00+01:00

Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and a unified API, vAccel bridges heterogeneous devices, enabling seamless offload without modifying application logic.

This session presents our IoT port of vAccel (client & lightweight agent) and demonstrates a real deployment where an ESP32 delegates inference to a GPU-backed k8s node, reducing latency by 3 orders of magnitude while preserving Kubernetes-native control and observability. Attendees will see how open acceleration can unify the Cloud–Edge–IoT stack through standard interfaces and reusable runtimes.

Bringing up bare metal ExecuTorch on RISC-V

2026-01-31T12:15:00+01:00

During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.

In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom acceleration

The goal is for others to be able to bring up ExecuTorch on other bare metal microcontrollers, learning from our experiences.

WebNN and WebLLM on RISC-V: Closing the AI Acceleration Gap with RVV and Tenstorrent

2026-01-31T12:40:00+01:00

As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs directly in the browser. We will also detail the RVV-enabled WASM implementation path for WebNN and what’s needed upstream.

Single-source cross-platform GPU LLM inference with Slang and Rust

2026-01-31T13:05:00+01:00

Leveraging Rust and Khronos' emerging Slang initiative, we introduce our efforts toward a cross-platform GPU LLM inference ecosystem. With a single-source approach we aim to minimize backend-specific code and foster community participation by writing inference kernels once and run them everywhere.

Closing the Loop: A Self-Learning Compiler for AI Accelerators

2026-01-31T13:30:00+01:00

AI workloads increasingly target heterogeneous accelerators, from GPUs and TPUs to novel RISC-V architectures, each with different scheduling, memory, and concurrency constraints. Traditional compilers rely on static heuristics that do not generalize across devices and custom neural network layers.

In this talk, we present the Daisytuner Optimizing Compiler Collection (DOCC), a self-learning compiler that closes the optimization loop by continuously collecting performance data from real executions and feeding it back into the compilation pipeline. The system represents code regions using stateful dataflow multigraphs, an open-source intermediate representation that enables symbolic dataflow analysis. Performance profiles in the form of hardware counters and execution times are ingested into an online embedding database that the compiler can query to derive and apply new optimizations.

We describe the generation of SDFGs from ONNX and PyTorch via IREE, the passes for mapping the IR to backends, and the benchmarking infrastructure running on our super-heterogeneous cluster. We conclude by showing how this feedback pipeline allows the compiler to evolve its optimization strategies automatically, improving schedules without human intervention.

One GPU, Many Models: What Works and What Segfaults

2026-01-31T13:55:00+01:00

Serving multiple models on a single GPU sounds great until something segfaults.

Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.

I tested both strategies for video generation workloads in parallel.

This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.

By the end, you'll know:

How to utilize unused GPU capacity.
How to setup MIG and MPS.
Memory issues, crashes, and failures.
Workload specific configs

OneAI: An Open-Source Framework for Managing AI Models at Scale

2026-01-31T14:20:00+01:00

OneAI is an open-source framework that provides the foundation required to manage AI model artifacts and inference workloads across OpenNebula-based cloud infrastructures. In this talk, we present how OneAI discovers, imports, and deploys models directly from Hugging Face Hub into OpenNebula clusters—turning complex AI lifecycle operations into a streamlined, infrastructure-native workflow. OneAI is built around three core components: (1) HFHUB Marketplace, a lightweight catalog that indexes Hugging Face models as metadata, deferring artifact materialization until deployment to dramatically reduce storage overhead; (2) SharedFS Datastore, a specialized OpenNebula datastore that treats directories as images, enabling efficient model storage on high-performance shared filesystems; (3) AI Service REST API, an orchestration layer that provisions model deployments, supervises the vLLM inference engine, and exposes OpenAI-compatible endpoints. We will provide an overview of the architecture behind these components and how they work together to create a clean, reproducible pipeline – from model discovery to fully deployed inference services. OneAI offers a fully open-source alternative to proprietary inference platforms. By building directly on OpenNebula’s capabilities, it reduces TCO by exploiting existing HPC storages, supports secure multi-tenancy, and enables scalable, production-ready inference deployments.

Vulkan API for Machine Learning? Competing with CUDA and ROCm in llama.cpp

2026-01-31T14:45:00+01:00

Most Machine Learning tools use CUDA for hardware acceleration, and are as a result only compatible with Nvidia GPUs. AMD has been making a lot of progress enabling simple recompilation with minimal code changes to ROCm for their hardware, but why not use an open and broadly-compatible API instead? That's where Vulkan comes in, which was built up for game development, but also allows compute-only applications, and has broad and good driver support across many hardware vendors.

As a follow-up to last year's talk about my work on the llama.cpp/GGML Vulkan backend, this talk will discuss lessons learnt from optimizations and new features that we added since, how viable Vulkan is for Machine Learning and what it is still missing.

https://github.com/ggml-org/llama.cpp https://github.com/ggml-org/ggml

Running tinygrad and ggml on microcontroller NPUs

2026-01-31T15:10:00+01:00

Running various forms of inference on microcontroller NPUs is not new. Systems where machine learning is used to analyze sensor data or do light CV on microcontroller-grade systems under 1 watt, under few dozen MB of RAM and FLASH and under 10 USD bill-of-materials are being massively deployed (even if they stay in the long shadow of more flashy LLMs and GenAI). That area, however, historically has been a domain of specialized machine learning frameworks such as emlearn, LiteRT (artist formerly known as TensorFlow Lite) and a few others.

The question I will try to answer in this talk is the following: are there any benefits of trying to use more well established, but still pretty tightly optimized frameworks such as ggml and tinygrad for these types of deployments. I will share my experience with adopting these frameworks to targets such as Google Coral NPU and AI Foundry Erbium and what kind of interesting challenges it presented.

The Hidden Cost of Intelligence: The Energy Footprint of AI from Code to GPU Kernels

2026-01-31T15:20:00+01:00

The growing energy demands of modern AI models pose a significant barrier to sustainable computing. As model complexity and deployment scale continue to rise, training and inference increasingly contribute to carbon emissions and operational costs. This talk begins by examining the technical challenges of accurately measuring energy consumption at multiple levels of abstraction—from system-wide and process-level metrics down to individual source code methods and API calls. Practical strategies for overcoming these measurement hurdles are discussed. The second part of the talk explores power consumption patterns in GPU kernels, highlighting how thread configuration, block geometry, and power limit settings shape kernel-level energy efficiency. We demonstrate how these characteristics influence power draw and discuss techniques for predicting consumption based on kernel properties. The session concludes with insights and best practices for managing performance–energy trade-offs in GPU-accelerated AI applications, offering a path toward more sustainable AI development.

Lowering the barrier of entrance in AI-native system development

2026-01-31T15:30:00+01:00

Silicon engineering seems unobtainable elite industry only for those veterans of the industry that have access to expensive instruments and years of experience, right? Well, not anymore! With AI tools, coding agents and more and more available substrate to hook them up to, you can approach it as a system design, and start in days! Make new models inference on novel accelerator cards, put open cores in FPGA, and other tricks you can do without specialized engineering degrees. Not suggesting to tape it out and put it in production without proper review, but for prototyping and faster feedback cycles it's really changing the game! I'll demo a couple of examples all based on open IP!

Supercharging LLM serving with Dynamo

2026-01-31T15:40:00+01:00

The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:

Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.
Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.
Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.

This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.

Taming the LLM Zoo with Docker Model Runner: Inference with OCI Artifacts, llama.cpp, and vLLM

2026-01-31T16:05:00+01:00

Running LLMs is currently fraught with friction: dependency hell and the ungoverned management of massive weight files (GGUF, safetensors).

In this talk, we introduce Docker Model Runner (DMR), an open initiative to bring the same standard of reproducibility to AI models that containers brought to code. We will explore how DMR streamlines the "pull, push, run, serve" lifecycle by treating AI models as first-class OCI (Open Container Initiative) artifacts, decoupling the model weights from the inference engine.

We will dive deep into the architecture of the Docker Model Runner, covering:

Models as Artifacts: How we package and distribute GGUF and safetensors formats using OCI-compliant registries, eliminating the need for arbitrary file downloads.

A Unified Interface: How DMR abstracts the complexity of underlying inference backends. We will discuss the integration of llama.cpp for broad hardware compatibility (CPU/GPU) and vLLM for high-performance production serving.

The Model Driver Pack: How the runner handles hardware acceleration automatically, managing the interface between the container runtime and host resources (NVIDIA CUDA, Vulkan, etc.).

Developer Experience: A look at the docker model CLI plugin and the local REST API that allows developers to swap models without changing their client code.

Join us to see how we are building a standardized, open ecosystem where docker model run ai/gemma3 is all you need to start building AI applications.

Key Takeaways for the Review Committee (Why this fits FOSDEM): Open Standards: The talk focuses on OCI compliance and open weights (GGUF/safetensors).

Open Source Integration: It highlights the usage and orchestration of popular open-source projects (llama.cpp and vLLM).

Technical Depth: It addresses infrastructure challenges (GPU passthrough, artifact management) rather than just high-level AI concepts.

From Infrastructure to Production: A Year of Self-Hosted LLMs

2026-01-31T16:30:00+01:00

Last year, I shared Paddler, an open-source LLM load balancer. A year of community feedback and building Poet (a static site generator with AI features) on top of it taught me what actually matters when self-hosting LLMs. This talk shares practical patterns the open-source community needs. What works, what doesn't, and what tooling we still need to build together.

A practical introduction to the ET platform.

2026-01-31T16:55:00+01:00

This talk will focus on ET Platform, the AI Foundry's open-source RISC-V based manycore architecture.

After a quick introduction, it will show how to get started on building and running the ET platform software in your machine, how to write software for ET accelerators and how to contribute to the project.

For more information about AI Foundry and the ET platform visit https://https://github.com/aifoundry-org/et-platform

Zero to matmul with the ET-SoC-1

2026-01-31T17:25:00+01:00

The ET-SoC-1 chip contains more than one thousand RISC-V cores, with custom vector and tensor extensions on each core, and has recently been given a new open-source lease of life [1]. What do low-level AI software engineers do with novel hardware? Obviously the answer is to make it do matmuls.

Join me on a rapid journey from naïve matmul to optimized matmul, learning about ET-SoC-1 along the way. Some of its hardware features will help us, whereas others will be a hinderance.

[1] https://github.com/aifoundry-org

All in RISC-V, RISC-V All in AI: Solving Real AI Compute Challenges with DeepComputing & Tenstorrent

2026-01-31T17:50:00+01:00

RISC-V is rapidly evolving into a serious platform for AI acceleration—from embedded devices to full AI PCs and datacenter-class compute. But building real, production-ready AI systems on open hardware still poses challenges: memory bandwidth bottlenecks, heterogeneous compute scheduling, toolchain maturity, and model deployment efficiency. As part of this session, we will also briefly share DeepComputing’s AI product roadmap to illustrate how these engineering breakthroughs translate into real devices. In this talk, engineers from DeepComputing and Tenstorrent will share how we are solving these challenges together across two ends of the computing spectrum: • AI PC / edge devices: How we integrate high-performance RISC-V CPUs with NPUs, optimize dataflow for multi-die architectures, and overcome compiler/runtime fragmentation to run LLMs locally. • AI servers: How Tenstorrent’s RISC-V cores and scalable mesh architecture handle AI workloads; how we bridge the software gap (compilers, toolchains, scheduling, kernel-level tuning); and how we standardize low-level interfaces for AI compute. The focus is on how these problems are solved—microarchitecture decisions, data movement, kernel optimizations, interoperability layers, and lessons learned from building real products. This session will show why “All in RISC-V, RISC-V All in AI” is no longer a slogan but a practical engineering path forward.

Review of kernel and user-space Neural Processing Unit (NPU) chips support on Linux

2026-01-31T18:15:00+01:00

In the last 10 years there's been a hardware race to build the best application-specific integrated circuit (ASIC) for both machine learning training and inference i.e. AI accelerators. What started with vision processing units (VPUs) went through tensor PUs (TPUs) and now we are dealing with neural processing units (NPUs). What's next?

This talk will take a systematic look at the different hardware platforms for AI acceleration, but with a focus on the software stacks that support them on Linux. We'll take a look how individual vendors approached their ASICs from the kernel side, and how they exposed the acceleration functionality for user-space.

Is it all proprietary or is there liberté? We'll find out together!

TT-Boltz: Drug Discovery on Tenstorrent Hardware

2026-01-31T18:40:00+01:00

Boltz-2 is a state-of-the-art model that builds on the general architecture of AlphaFold 3, predicting biomolecular structures and binding affinities.

BoltzGen is a system that builds on Boltz-2 and designs protein binders (potential drugs) to biomolecular targets.

We implement both systems on Tenstorrent hardware to make drug discovery more open, more efficient, and cheaper.

GitHub Repository: https://github.com/moritztng/tt-boltz My Thesis: https://moritztng.github.io/thesis/thesis.pdf