Welcome talk covering some organizational questions
llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.
We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real examples of low-latency OCR applications. We will also talk about adding audio support, which lets newer models summarize audio inputs. Plus, we will cover the challenges of handling legacy code while keeping the project flexible for future models.
Looking forward, the talk will share plans for new features like video input, text-to-speech support, and image generation. Attendees will also learn how to contribute and use these multimodal tools in their own project.
Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.
The Problem: Bridging the OS and Acceleration Gap
While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is infeasible, as it monopolizes the display and host system resources. Consequently, achieving high-performance LLM inference requires a sophisticated bridging mechanism.
The Solution: Para-Virtualization via GGML API Remoting
We present the implementation of an API-remoting, para-virtualized execution path for llama.cpp. This novel design is built directly on the GGML API, allowing us to selectively offload compute-intensive operations to execute on the macOS host. Crucially, this execution path natively leverages the GGML-Metal backend, achieving full-speed host performance. This strategy preserves a fully containerized Linux developer experience inside the virtual machine while achieving hardware-level acceleration outside of it.
Performance and Implications
We will share concrete performance results demonstrating near-native inference speed compared to running llama.cpp directly on the host. The talk will examine the architectural trade-offs of this split-execution model, providing insight into: * The implementation details of the para-virtualized GGML API bridge. * The latency overhead of API remoting versus the throughput gain from host GPU execution. * How this open-source approach fundamentally changes the future of containerized and accelerated AI tooling on non-Linux platforms, specifically addressing the needs of macOS users in the open-source AI community.
Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.
This talk introduces tract, Sonos's open-source neural network inference toolkit started in 2018 and written in Rust. We'll explore how tract bridges the gap between training frameworks and production deployment by offering a no-nonsense, self-contained inference solution used today to deploy deep learning on millions of devices at Sonos.
This toolkit has some unique strengths thanks to embedded graph optimization, automated streaming management, and symbolic abstraction for dynamic dimensions — plus support for multiple open exchange standards including ONNX, NNEF, and TensorFlow Lite.
tract also has a companion project coined torch-to-nnef that strive to export PyTorch models to an NNEF optimized for tract with maximum compatibility. It enables some unique features like quantization, better Fourier Transform support and easier extensibility: this will also be discussed shortly during this presentation.
Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and a unified API, vAccel bridges heterogeneous devices, enabling seamless offload without modifying application logic.
This session presents our IoT port of vAccel (client & lightweight agent) and demonstrates a real deployment where an ESP32 delegates inference to a GPU-backed k8s node, reducing latency by 3 orders of magnitude while preserving Kubernetes-native control and observability. Attendees will see how open acceleration can unify the Cloud–Edge–IoT stack through standard interfaces and reusable runtimes.
As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs directly in the browser. We will also detail the RVV-enabled WASM implementation path for WebNN and what’s needed upstream.
Leveraging Rust and Khronos' emerging Slang initiative, we introduce our efforts toward a cross-platform GPU LLM inference ecosystem. With a single-source approach we aim to minimize backend-specific code and foster community participation by writing inference kernels once and run them everywhere.
Over the past decade, CPU vector units have evolved faster than most software stacks have adapted. From AVX2 and AVX-512 to NEON, SVE/SVE2, AMX, and SME, each generation introduced wider registers, richer predication, new mixed-precision formats, and entirely new execution models. Yet extracting sustained throughput from these extensions requires understanding architectural asymmetries, compiler behaviour, portability constraints, and the practical limits of auto-vectorisation.
This session distills ten years of developing SIMD kernels deployed inside large-scale open-source systems, including USearch — a high-throughput vector search engine embedded today in many modern DBMS projects — as well as StringZilla, SimSIMD, and bioinformatics tools. We will examine how different CPU families behave on identical workloads, which instructions consistently deliver real speedups, which look promising but rarely pay off, and where compilers do (and do not) generate optimal vector code.
Case studies include: AVX-512 reductions exceeding 300 GB/s on current x86 machines; i8/u8 and bf16 pipelines on NEON and SVE2; practical limits of AMX tiles for dense math; and early insights into SME’s streaming execution model. Each example is shown through minimal, reproducible kernel fragments.
Beyond raw performance, the talk outlines when SIMD is the right tool: which classes of problems benefit most, when auto-vectorisation is sufficient, and when hand-written intrinsics or assembly are justified. We will also discuss hardware selection for SIMD-heavy workloads — both on-prem and in the cloud — and what upcoming extensions mean for open-source systems in the next decade.
Serving multiple models on a single GPU sounds great until something segfaults.
Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing. Both have trade-offs and their behavior changes based on GPU architecture.
I tested both strategies on Hopper and Blackwell, running diffusion, MoE, and TTS workloads in parallel. Some setups survived. Others didn't.
This talk digs into what actually happened: where memory isolation falls apart, which configs crash, and what survives under load.
By the end, you'll know:
"Adventures in Model Quantization" continues to quest to run high quality models with minimal hardware resources. In this edition, community quantizer John Leimgruber ("ubergarm" on huggingface), tells the story of how a single line change to llama.cpp enabled the 1000B open weights model Kimi-K2-Thinking to maintain full quality while using only half the memory!
This talk presents an overview and visualizations of llama.cpp quantization types and discuses how Quantization Aware Training (QAT) effects mapping models across ecosystems from transformers' safetensors into llama.cpp GGUF.
If you're interested in running the best open-weights LLMs and ai models on gaming rigs, home-lab servers, or privately for your organization, then come learn how to benchmark both quality and speed for all the huggingface quants available for ik/llama.cpp.
This is an updated presentation expanding upon a recent AI Plumber's talk given in October 2025 in San Fransisco:
Most Machine Learning tools use CUDA for hardware acceleration, and are as a result only compatible with Nvidia GPUs. AMD has been making a lot of progress enabling simple recompilation with minimal code changes to ROCm for their hardware, but why not use an open and broadly-compatible API instead? That's where Vulkan comes in, which was built up for game development, but also allows compute-only applications, and has broad and good driver support across many hardware vendors.
As a follow-up to last year's talk about my work on the llama.cpp/GGML Vulkan backend, this talk will discuss lessons learnt from optimizations and new features that we added since, how viable Vulkan is for Machine Learning and what it is still missing.
https://github.com/ggml-org/llama.cpp https://github.com/ggml-org/ggml
Running various forms of inference on microcontroller NPUs is not new. Systems where machine learning is used to analyze sensor data or do light CV on microcontroller-grade systems under 1 watt, under few dozen MB of RAM and FLASH and under 10 USD bill-of-materials are being massively deployed (even if they stay in the long shadow of more flashy LLMs and GenAI). That area, however, historically has been a domain of specialized machine learning frameworks such as emlearn, LiteRT (artist formerly known as TensorFlow Lite) and a few others.
The question I will try to answer in this talk is the following: are there any benefits of trying to use more well established, but still pretty tightly optimized frameworks such as ggml and tinygrad for these types of deployments. I will share my experience with adopting these frameworks to targets such as Google Coral NPU and AI Foundry Erbium and what kind of interesting challenges it presented.
In the era of big data and artificial intelligence, organizations are increasingly relying on data lakehouses to store, process, and analyze vast amounts of structured and unstructured data. The path from raw data to a production-ready AI model is complex, often bottlenecked by data inconsistencies, schema drift, and a lack of data versioning. In the spirit of open source AI hacking, the focus must shift to ensuring the underlying data infrastructure is as reliable and reproducible as the models themselves. This presentation addresses the critical role of Open Table Formats (OTFs)—specifically Apache Iceberg, Delta Lake, and Apache Hudi—in transforming unstructured data lakes into reliable, queryable data lakehouses. OTFs provide database-like capabilities, including ACID transactions, schema evolution, and time travel, directly on low-cost object storage The talk will provide a practical overview of integrating open-source OTFs with popular AI/ML frameworks, empowering "AI Plumbers" to build robust, governed, and highly performant data foundations for their next generation of open-source models
The growing energy demands of modern AI models pose a significant barrier to sustainable computing. As model complexity and deployment scale continue to rise, training and inference increasingly contribute to carbon emissions and operational costs. This talk begins by examining the technical challenges of accurately measuring energy consumption at multiple levels of abstraction—from system-wide and process-level metrics down to individual source code methods and API calls. Practical strategies for overcoming these measurement hurdles are discussed. The second part of the talk explores power consumption patterns in GPU kernels, highlighting how thread configuration, block geometry, and power limit settings shape kernel-level energy efficiency. We demonstrate how these characteristics influence power draw and discuss techniques for predicting consumption based on kernel properties. The session concludes with insights and best practices for managing performance–energy trade-offs in GPU-accelerated AI applications, offering a path toward more sustainable AI development.
Silicon engineering seems unobtainable elite industry only for those veterans of the industry that have access to expensive instruments and years of experience, right? Well, not anymore! With AI tools, coding agents and more and more available substrate to hook them up to, you can approach it as a system design, and start in days! Make new models inference on novel accelerator cards, put open cores in FPGA, and other tricks you can do without specialized engineering degrees. Not suggesting to tape it out and put it in production without proper review, but for prototyping and faster feedback cycles it's really changing the game! I'll demo a couple of examples all based on open IP!
Running LLMs is currently fraught with friction: dependency hell and the ungoverned management of massive weight files (GGUF, safetensors).
In this talk, we introduce Docker Model Runner (DMR), an open initiative to bring the same standard of reproducibility to AI models that containers brought to code. We will explore how DMR streamlines the "pull, push, run, serve" lifecycle by treating AI models as first-class OCI (Open Container Initiative) artifacts, decoupling the model weights from the inference engine.
We will dive deep into the architecture of the Docker Model Runner, covering:
Models as Artifacts: How we package and distribute GGUF and safetensors formats using OCI-compliant registries, eliminating the need for arbitrary file downloads.
A Unified Interface: How DMR abstracts the complexity of underlying inference backends. We will discuss the integration of llama.cpp for broad hardware compatibility (CPU/GPU) and vLLM for high-performance production serving.
The Model Driver Pack: How the runner handles hardware acceleration automatically, managing the interface between the container runtime and host resources (NVIDIA CUDA, Vulkan, etc.).
Developer Experience: A look at the docker model CLI plugin and the local REST API that allows developers to swap models without changing their client code.
Join us to see how we are building a standardized, open ecosystem where docker model run ai/gemma3 is all you need to start building AI applications.
Key Takeaways for the Review Committee (Why this fits FOSDEM): Open Standards: The talk focuses on OCI compliance and open weights (GGUF/safetensors).
Open Source Integration: It highlights the usage and orchestration of popular open-source projects (llama.cpp and vLLM).
Technical Depth: It addresses infrastructure challenges (GPU passthrough, artifact management) rather than just high-level AI concepts.
Last year, I shared Paddler, an open-source LLM load balancer. A year of community feedback and building Poet (a static site generator with AI features) on top of it taught me what actually matters when self-hosting LLMs. This talk shares practical patterns the open-source community needs. What works, what doesn't, and what tooling we still need to build together.
The ET-SoC-1 chip contains more than one thousand RISC-V cores, with custom vector and tensor extensions on each core, and has recently been given a new open-source lease of life [1]. What do low-level AI software engineers do with novel hardware? Obviously the answer is to make it do matmuls.
Join me on a rapid journey from naïve matmul to optimized matmul, learning about ET-SoC-1 along the way. Some of its hardware features will help us, whereas others will be a hinderance.
[1] https://github.com/aifoundry-org
RISC-V is rapidly evolving into a serious platform for AI acceleration—from embedded devices to full AI PCs and datacenter-class compute. But building real, production-ready AI systems on open hardware still poses challenges: memory bandwidth bottlenecks, heterogeneous compute scheduling, toolchain maturity, and model deployment efficiency. As part of this session, we will also briefly share DeepComputing’s AI product roadmap to illustrate how these engineering breakthroughs translate into real devices. In this talk, engineers from DeepComputing and Tenstorrent will share how we are solving these challenges together across two ends of the computing spectrum: • AI PC / edge devices: How we integrate high-performance RISC-V CPUs with NPUs, optimize dataflow for multi-die architectures, and overcome compiler/runtime fragmentation to run LLMs locally. • AI servers: How Tenstorrent’s RISC-V cores and scalable mesh architecture handle AI workloads; how we bridge the software gap (compilers, toolchains, scheduling, kernel-level tuning); and how we standardize low-level interfaces for AI compute. The focus is on how these problems are solved—microarchitecture decisions, data movement, kernel optimizations, interoperability layers, and lessons learned from building real products. This session will show why “All in RISC-V, RISC-V All in AI” is no longer a slogan but a practical engineering path forward.
In the last 10 years there's been a hardware race to build the best application-specific integrated circuit (ASIC) for both machine learning training and inference i.e. AI accelerators. What started with vision processing units (VPUs) went through tensor PUs (TPUs) and now we are dealing with neural processing units (NPUs). What's next?
This talk will take a systematic look at the different hardware platforms for AI acceleration, but with a focus on the software stacks that support them on Linux. We'll take a look how individual vendors approached their ASICs from the kernel side, and how they exposed the acceleration functionality for user-space.
Is it all proprietary or is there liberté? We'll find out together!