Software Performance Track

2026-02-01T09:00:00+01:00

Nowadays, in the software industry, we already have a lot of ways to improve performance of our applications: compilers become better and better each year in the optimization field, we have a lot of tools like Linux perf and Intel VTune to analyze performance. Even algorithms are still improving in various domains! But how many of these improvements are actually adopted in the industry, and how difficult it is to adopt them in reality? That's an interesting question!

In this talk, I want to show you:

Why accessibility of software performance matters
How various software optimization approaches are different from the adoption easiness perspective: from different compiler optimizations to semi-automatic optimizations to a manual approach
What things can be improved and how
Many related open-source examples from my practice
Share with you an idea behind the "Software performance" devroom

I hope after the talk you get an interesting perspective on software performance to think about.

Beyond nvidia-smi: Tools for Real GPU Performance Metrics

2026-02-01T09:50:00+01:00

Relying only on nvidia-smi is like measuring highway usage by checking if any car is present, not how many lanes are full.

This talk reveals the metrics nvidia-smi doesn't show and introduces open source tools that expose actual GPU efficiency metrics.

We'll cover:

Why GPU Utilization is not same as GPU Efficiency.
Deep dive into relevant key metrics: SM metrics, Tensor Core metrics, Memory metrics explained.
Practical gpu profiling and monitoring setup.
Identifying bottlenecks in inference workloads.

Attendees will leave understanding how to identify underutilized GPU and discover real optimization opportunities across inference workloads.

Keeping the P in HPC: the EESSI Way

2026-02-01T10:30:00+01:00

In scientific computing on supercomputers, performance should be king. Today’s rapidly diversifying High-Performance Computing (HPC) landscape makes this increasingly difficult to achieve however...

Modern supercomputers rely heavily on open source software, from a Linux-based operating system to scientific applications and their vast dependency stacks. A decade ago, HPC systems were relatively homogeneous: Intel CPUs, a fast interconnect like Infininand, and a shared filesystem. Today, diversity is the norm: AMD and Intel CPUs, emerging Arm-based exascale systems like JUPITER, widespread acceleration with NVIDIA and AMD GPUs, soon also RISC-V system architectures (like Tenstorrent), etc.

This hardware fragmentation creates significant challenges for researchers and HPC support teams. Getting scientific software installed reliably and efficiently is more painful than ever, and that’s before even considering software performance.

Containers, once heralded as the solution for mobility-of-compute, are increasingly showing their limits. An x86_64 container image is useless on a system with Arm CPUs, and will be equally useless on RISC-V in the not so distant future. What's worse is that portable container images used today already sacrifice performance by avoiding CPU-specific instructions like AVX-512 or AVX10, potentially leaving substantial performance gains on the table. Containerization also complicates MPI-heavy workloads and introduces friction for HPC users.

This talk introduces the European Environment for Scientific Software Installations (EESSI), which tackles these challenges head-on with a fundamentally different approach. EESSI is a curated, performance-optimized scientific software stack powered by open source technologies including CernVM-FS, Gentoo Prefix, EasyBuild, Lmod, Magic Castle, ReFrame, etc.

We will show how EESSI enables researchers to use the same optimized software stack seamlessly across laptops, cloud VMs, supercomputers, CI pipelines, and even Raspberry Pis—without sacrificing performance or ignoring hardware differences. This unlocks powerful workflows and simplifies software management across heterogeneous environments.

EESSI is already being adopted across European supercomputers and plays a central role in the upcoming EuroHPC Federation Platform.

Come learn why EESSI is the right way to keep the P in HPC.

Towards unified full-stack performance analysis and automated computer system design at CERN with Adaptyst

2026-02-01T11:10:00+01:00

Slow performance is often a major blocker of new visionary applications in scientific computing and related fields, regardless of whether it is embedded or distributed computing. This issue is becoming more and more challenging to tackle as it is no longer enough to do only algorithmic optimisations, only hardware optimisations, or only (operating) system optimisations: all of them need to be considered together.

Architecting full-stack computer systems customised for a use case comes to the rescue, namely software-system-hardware co-design. However, doing this manually per use case is cumbersome as the search space of possible solutions is vast, the number of different programming models is substantial, and experts from various disciplines need to be involved. Moreover, performance analysis tools often used here are fragmented, with state-of-the-art programs tending to be proprietary and not compatible with each other.

This is why automated full-stack system design is promising, but the existing solutions are few and far between and do not scale. Adaptyst is an open-source project at CERN (the world-leading particle physics laboratory) aiming to solve this problem. It is meant to be a comprehensive architecture-agnostic tool which:

unifies performance analysis across the entire software-hardware stack by calling state-of-the-art software and APIs under the hood with any remaining gaps bridged by Adaptyst (so that performance can be inspected both macro- and microscopically regardless of the workflow and platform type)
suggests automatically the best solutions of workflow performance bottlenecks in terms of one or more of: software optimisations, hardware choices and/or customisations, and (operating) system design
scales easily from embedded to high-performance/distributed computing and allows adding support for new software/system/hardware components seamlessly by anyone thanks to the modular design

The tool is in the early phase of development with small workforce and concentrating on profiling at the moment. Given that Adaptyst has broad application potential and we want it to be for everyone’s benefit, we are building an open-source community around the project.

This talk is an invitation to join us: we will explain the performance problems we face at CERN, tell you in detail what Adaptyst is and how you can get involved, and demonstrate the current version of the project on CPU and CUDA examples.

Project website: https://adaptyst.web.cern.ch

How to Reliably Measure Software Performance

2026-02-01T11:50:00+01:00

Reliable performance measurement remains an unsolved problem across most open source projects. Benchmarks are often an afterthought, and when they aren't they can be noisy, non-repeatable, and hard to act on.

This talk shares lessons learned from building a large-scale benchmarking system at Datadog and shows how small fixes can make a big difference: controlling environmental noise, designing benchmarks, interpreting results with sound statistical methods, and more.

Attendees should leave with practical principles they can apply in their own projects to make benchmarks trustworthy and actionable. We'll illustrate each principle with real data — for instance, environment tuning that cut variance by 100x, or design changes that turned a flaky benchmark into a reliable one.

Pulling 100k revisions 100× faster

2026-02-01T12:30:00+01:00

Mercurial is a distributed version control system whose codebase combines Python, C and Rust. Over its twenty years of development, significant effort has been put into its scaling and overall performance.

In the recent 7.2 version, the performance of exchanging data between repositories (e.g. push and pull) has been significantly improved, with some of our most complicated benchmark cases moving from almost four hours down to 2 minutes, a speedup of over 100x.

This talk uses this work as a case study of the multiple places where performance improvements lie. It goes over the challenges that arise from exchanging data in a DVCS, and the levers we can pull to overcome them: higher level logic changes, lower level algorithmic improvements, programming language strengths, modern CPU architecture, network protocol design, etc.

Despite the great results, exchanging data in version control remains a complex matter, and we lastly expose our ideas to further tackle its inherent complexity.

Database benchmarks: Lessons learned from running a benchmark standard organization

2026-02-01T13:10:00+01:00

Database vendors often engage in fierce competition on system performance – in the 1980s, they even had their "benchmark wars". The creation of the TPC, a non-profit organization that defines standard benchmarks and supervises their use through rigorous audits, spelled an end to the benchmark wars and helped drive innovation on performance in relational database management systems.

TPC served as a model for defining database benchmarks, including the Linked Data Benchmark Council (LDBC, https://ldbc.org/), of which I've been a contributor and board member for the past 5+ years. Through LDBC's workloads, graph database systems have seen a 25× speedup in four years and a 71× price-performance improvement on transactional workloads.

Defining database benchmarks requires a careful balancing of multiple aspects: relevance, portability, scalability, and simplicity. Most notably, the field in the last few years has shifted toward using simpler, leaderboard-style benchmarks that skip the rigorous auditing process but allow quick iterations.

In this talk, I will share my lessons learned on designing database benchmarks and using them in practice. The talk has five sections:

The need for database benchmarks
TPC overview (Transaction Processing Performance Council)
LDBC overview (Linked Data Benchmark Council)
The current benchmark landscape (ClickBench, H2O, etc.)
Takeaways for designing new benchmarks

Continuous Performance Engineering HowTo

2026-02-01T13:50:00+01:00

In the past 30 years we've moved from manual QA testing of release candidates to Continuous Integration and even Continuous Deployment. But while most software projects excel at testing correctness, the level of automation of performance testing is still near zero. And while it's a given that each developer writes tests for their own code, Performance Engineering remains the domain of individual experts or separate teams, who benchmark the product with custom tools developed in house, often focusing on beta and release candidates, with zero performance tests happening in the Continuous Integration work stream.

This talk is your guide to Continuous Performance Engineering, aka Continuous Benchmarking. We will cover standard benchmarking frameworks and how to automate them in CI, automating deployments of large end-to-end environments, how to tune your infrastructure for minimum noise and maximum repeatability, and using change point detection to automatically alert on performance regressions with a minimal amount of those annoying false positives.

Writing an ultrafast Lua/JSON encoder+decoder as a LuaJIT module

2026-02-01T14:30:00+01:00

JSON is one of the most popular data exchange formats. Parsing routines for it exist in every modern programming languages, either built-in, or included in popular libraries such as RapidJSON for C++ or json for Rust.

The task of conversion between JSON strings and Lua objects has been solved plenty of times before, but either the solutions are not focused on performance, or the parsers are too strict for the "relaxed" format we use at BeamNG.

What if we want to have the fastest Lua table <-> relaxed JSON conversion possible? We came up with a highly optimized LuaJIT code we use for handling JSONs at BeamNG since a few years. But there is a way to go further -- hacking on the C source code of the interpreter itself to add compiled built-in JSON support. How much extra performance can we squeeze out by going a level deeper?

Get ready for juicy benchmarks and an optimization story from a real usage perspective.

How To Move Bytes Around

2026-02-01T15:10:00+01:00

If you take a random program and start profiling it, you'll usually find that the memcpy function is at the top. However, this doesn't necessarily mean memcpy is slow. The most hopeless thing a C++/Rust developer can do (while no one is watching) is optimize memcpy to move bytes faster. That's exactly what we'll do.

A Performance Comparison of Kubernetes Multi-Cluster Networking

2026-02-01T15:50:00+01:00

Driven by application, compliance, and end-user requirements, companies opt to deploy multiple Kubernetes clusters across public and private clouds. However, deploying applications in multi-cluster environments presents distinct challenges, especially managing the communication between the microservices spread across clusters. Traditionally, custom configurations, like VPNs or firewall rules, were required to connect such complex setups of clusters spanning the public cloud and on-premise infrastructure. This talk presents a comprehensive analysis of network performance characteristics for three popular open-source multi-cluster networking solutions (namely, Skupper, Submariner and Istio), addressing the challenges of microservices connectivity across clusters. We evaluate key factors such as latency, throughput, and resource utilization using established tools and benchmarks, offering valuable insights for organizations aiming to optimize the network performance of their multi-cluster deployments. Our experiments revealed that each solution involves unique trade-offs in performance and resource efficiency: Submariner offers low latency and consistency, Istio excels in throughput with moderate resource consumption, and Skupper stands out for its ease of configuration while maintaining balanced performance.

Load Testing Real React Applications for Production Performance

2026-02-01T16:30:00+01:00

In this talk, we'll explore how we built comprehensive load testing for React applications at Mattermost, achieving 100,000 concurrent users in production-like environments. We'll begin by revealing why traditional API testing missed critical browser issues that only emerged at scale. Next, we'll demonstrate our open-source tool that uses Playwright to run thousands of real browsers, measuring React-specific metrics like component render times, memory leaks, and state management bottlenecks. Finally, we'll share the optimization journey that reduced browser memory and enabled true production readiness, ensuring our React application performs flawlessly for enterprise customers.