Testing and Continuous Delivery Track

Externally verifying Linux’s real-time deadline scheduling capabilities

2026-02-01T09:00:00+01:00

A number of industrial applications now demand hard real-time scheduling capabilities from the kernel of a Linux-based operating system, but scheduling measurements from the system itself cannot be completely trusted as they are referenced to the same clock as the kernel-under-test. Yet, if the system can output signals to hardware as it runs, their timing can be analysed by an external microcontroller, and a second "external" measurement obtained to compare with the system's own report.

Codethink wrote embedded Rust firmware to run on a Raspberry Pi Pico which analyses the timings of characters received on a UART serial port. On the other end of the serial port is our "Rusty Worker": a Rust program on a Linux system which uses the sched_setattr syscall to request that Linux schedules it with specified parameters, and then measures its own scheduling period and runtime. The single-threaded and interrupt-based architecture of this firmware allowed accurate external measurements of the Rusty Worker’s scheduling parameters at microsecond precision, and meant it was easy to extend to monitor the "petting" behaviour of a watchdog.

Rust was a natural choice for the verification firmware and Rusty Worker. Higher level or interpreted languages would increase non-determinism and reduce our confidence in the accuracy of the collected timing data, whereas C or C++ programs risk undefined behaviour unacceptable in a safety-related context. Yet, Rust really came into its own with the relative simplicity of the cargo toolchain for embedded targets, so reproducibly building the firmware with a self-built toolchain (and without access to the internet) was just as straightforward as building the Rusty Worker for a Linux target.

Having equipped our suite of bare-metal CI runners with KiCad-designed custom PCBs that feature a Raspberry Pi Pico and Debug Probe, we are able to run “soak” tests to collect thousands of self- and externally-measured deadline scheduling parameters for each iteration of our Codethink Trustable Reproducible Linux (CTRL OS), as soon as engineers push a new commit. We then use the open source Eclipse Trustable Software Framework (TSF) to facilitate automatic “real time” aggregation and statistical analysis of the external scheduling measurements for each commit, clearly communicating the results and trends to senior stakeholders and junior engineers alike.

TSF challenges us both to robustly, systematically, and continuously evidence our claims about Linux’s capabilities as a real-time operating system, and to scrutinise the software used for testing as strictly as the software under test. We are excited to share how we work towards these goals with external scheduling measurements and embedded Rust.

Instrument and Unit Test an Asm-only OS Kernel by Turning it into an Anykernel

2026-02-01T09:30:00+01:00

OS kernel development is often connected with time consuming testing process and non-trivial debug technics. Although emulators like QEMU and Bochs ease this work significantly, nothing can compare with convenience of userspace developer environment. Moving parts of the kernel to the userspace binary is not straightforward, especially if the kernel has almost no compatibility with POSIX and is written entirely in assembly. Still, sometimes it is doable. The talk shares experience, architecture and design decisions of compiling VFS, block, and some other subsystems of KolibriOS as Linux® interactive shell program and a FUSE filesystem. Implemented unit testing framework and coverage collection tool for assembly (flat assembler) programs are also discussed.

Testing on hardware with Claude AI

2026-02-01T10:00:00+01:00

Talk about a project that implements a hardware-in-the-loop testing framework for validating Linux distributions on specific development boards. The system uses a universal testing harness that automatically detects target hardware platforms and adapts generic testing scripts to board-specific configurations with Claude AI help. Platform adaptation is achieved through specific configuration files that define board-specific parameters, enabling the same testing codebase to validate different hardware capabilities. The GitHub Actions CI/CD integration provides automated testing across multiple platforms, with matrix-based execution that flashes appropriate images and runs comprehensive validation including hardware-specific feature testing.

Building a multi-arch CI pipeline for 13 targets. What could possibly go wrong?

2026-02-01T10:15:00+01:00

The ci-multiplatform project is a generic, OCI-based multi-architecture CI system designed to make cross-platform testing practical for open-source projects using GitLab CI. Originally created while enabling RISC-V support for Pixman (https://gitlab.freedesktop.org/pixman/pixman), it has since grown into an independent project under the RISE (RISC-V Software Ecosystem) umbrella: https://gitlab.com/riseproject/CI/ci-multiplatform, with a mirror on freedesktop.org: https://gitlab.freedesktop.org/pixman/ci-multiplatform

The project provides multi-arch layered OCI images (Base GNU, LLVM, Meson) based on Debian, GitLab Component-style templates, and fully automated downstream test pipelines for the included examples. It supports creating customized OCI images, building and testing across 16 Linux and Windows targets – including x86, ARM, RISC-V, MIPS, and PowerPC – using unprivileged GitLab runners with QEMU user-mode emulation. Architecture-specific options (e.g., RISC-V VLEN configuration) allow developers to exercise multiple virtual hardware profiles without any physical hardware, all within a convenient job-matrix workflow.

The talk covers how the system is engineered, tested, and validated across multiple GitLab instances, and what happens when unprivileged runners, QEMU quirks, toolchain differences, and architecture-specific behaviours all converge in a single pipeline. I will show how projects can adopt ci-multiplatform with minimal effort and turn multi-arch CI from a maintenance burden into a routine part of upstream development.

Unit Testing in Fortran

2026-02-01T10:30:00+01:00

Testing is central to modern software quality, yet many widely used Fortran codebases still lack automated tests. Existing tests are often limited to coarse end-to-end regression checks that provide only partial confidence. With the growth of open-source Fortran tools, we can now bring unit testing and continuous validation to legacy and modern Fortran projects alike.

This talk surveys the current landscape of Fortran testing frameworks before focusing on three I have evaluated in practice — pFUnit, test-drive and veggies — and explaining why pFUnit is often the most robust choice. I will discuss its JUnit-inspired design, use of the preprocessor, and the compiler idiosyncrasies that can still make adoption challenging. I will examine the hurdles that make testing Fortran hard: global state, oversized subroutines, legacy dependencies, and compiler-specific behaviour.

I will then present community-oriented efforts to improve testing practices in Fortran, including development of an open-source Carpentries-style training course on testing in Fortran, with plans to expand into a broader introduction to sustainable Fortran development using open-source linting and documentation tools such as Fortitude and Ford.

Attendees will gain practical guidance for introducing effective testing into existing Fortran codebases, and insight into current efforts towards modern workflows that support reproducibility and continuous delivery.

Testing ESPHome in the real world

2026-02-01T10:45:00+01:00

ESPHome is a versatile framework to create custom firmware for various microcontrollers. In this talks we will look at how to automatically test the latest ESPHome firmware on an ESP32.

As ESPHome devices are used to interact with the real world, we will also look at how to test that the LUX sensors is able to detect light variations.

In order to test the ESP32 device, we are going to use lava on the command line, directly inside a gitlab runner.

Unified Quality Feedback Across CI/CD Pipelines

2026-02-01T11:05:00+01:00

The CI/CD server Jenkins provides powerful build-quality visualizations through plugins such as Warnings, Coverage, and Git Forensics. These plugins aggregate and visualize data from static analysis tools, coverage reports, software metrics, and Git history, enabling teams to track quality trends across builds. We have now brought this functionality to other widely used CI/CD platforms, including GitHub Actions and GitLab CI.

This talk presents portable, UI-independent implementations of these capabilities for GitHub Actions and GitLab CI: the quality monitor GitHub Action and the GitLab autograding-action. Both tools share a common architecture and codebase with the Jenkins plugins. They automatically analyze pull requests and branch pipelines, generate structured comments and concise Markdown summaries, and enforce configurable quality gates. The solutions are language-agnostic and integrate seamlessly with more than 150 static analysis, coverage, test, and metrics report formats—including Checkstyle, SpotBugs, SARIF, JUnit, JaCoCo, GoCov, and GCC. Additionally, both tools provide an autograding mode for educational use, enabling instructors to assess student submissions through a flexible, configurable point-based scoring system.

CI/CD with Gerrit, AI-Enhanced Review, and Hardware-in-the-Loop Testing in Jenkins Pipelines

2026-02-01T11:35:00+01:00

CI/CD with Gerrit, AI-Enhanced Review, and Hardware-in-the-Loop Testing in Jenkins Pipelines This presentation will explore advanced Continuous Integration (CI) strategies essential for open-source embedded systems development, moving beyond standard software testing to encompass physical hardware validation. We will begin by establishing the necessity of integrating rigorous unit and integration testing directly into the development workflow, demonstrating how to effectively define these steps within Jenkins Declarative Pipelines (DSL). The core of our approach involves deep integration with Gerrit Code Review, ensuring that tests and static analysis are triggered automatically upon every patch set creation, providing fast feedback to developers. A significant portion of the talk will focus on achieving true end-to-end validation through Hardware-in-the-Loop (HIL) testing. We will detail the implementation of Labgrid, an open-source tool used to manage and control remote hardware resources (such as embedded boards and IoT devices). This integration allows the Jenkins pipeline to reserve, provision, and execute automated, system-level tests directly on physical target devices before firmware changes are merged. Furthermore, we will introduce two critical elements for pipeline stability and code quality. Firstly, we will demonstrate the utility of an AI-Powered Error Explanation component (e.g., via the Explain Error Plugin). This feature leverages large language models to analyze complex Jenkins log files and pipeline failures, translating cryptic errors into human-readable insights and suggested fixes, which dramatically cuts down debugging time. Secondly, we will showcase the Warnings Next Generation (Warning-NG) Plugin, which serves as a central aggregator, collecting and visualizing issues and potential vulnerabilities reported by various static analysis tools, thereby enforcing strict, quantifiable quality gates within the CI process. Attendees will gain practical, cutting-edge insights into implementing a robust, AI and hardware-enhanced CI/CD workflow suitable for modern open-source projects.

Non-Blocking Continuous Code Reviews

2026-02-01T12:05:00+01:00

The problem with the current most commonly accepted way of running code reviews using Pull Requests is that they have the nasty habit of blocking the flow of delivery. They introduce a cost of delay. Any delay reduces feedback. Consequently, it drives down quality.

The usual way to achieve fast, efficient and effective Continuous Code Reviews without disrupting the flow of delivery is through Pair Programming or Team Programming. However, for various valid reasons, these can be a cultural stretch for many teams and organisations.

In 2012, a novice team practising trunk-based development set in place a fairly uncommon but efficient alternative to implementing continuous code reviews on mainline without ever blocking the flow of delivery.

This team went from a bunch of rag-tags to becoming a reference team within the organisation, with auditors falling to the floor due to the high quality the team delivered. Target audience: software engineers, test engineers, infrastructure engineers, team leads, engineering managers, CTOs

Developer Experience is more than just Productivity metrics

2026-02-01T12:35:00+01:00

With everything changing in tech at a frenetic pace, the emphasis on developer productivity has overshadowed the true essence of developer experience (DevEx). While frameworks like SPACE, getDX, and DORA metrics provide valuable insights, they often miss the mark on capturing developers' real, day-to-day experiences using tools and services, instead focusing strictly on the bottom line for the company. Meanwhile, developers and practitioners are job-hopping more than ever. This talk will explore the origins and evolution of "developer experience," dissect popular frameworks, and advocate for a more balanced approach that values the practitioner's perspective. At the end we will set a path towards integrating top-down metrics with bottom-up feedback, ensuring an approach to developer experience that fosters innovation and satisfaction.

Self-Healing Rollouts: Automating Production Fixes with Agentic AI

2026-02-01T13:05:00+01:00

Even with robust CI/CD, production rollouts can hit unexpected snags. While in Kubernetes Argo Rollouts excels at Progressive Delivery and automated rollbacks to mitigate deployment issues, what if we could go a step further?

This session explores how to elevate your release process by integrating Agentic AI and asynchronous coding agents, with Argo Rollouts canary deployments. We'll demonstrate how an intelligent agent can automatically analyze a rollout failure, pinpointing the root cause. Beyond diagnosis, these agents can take proactive steps on your behalf, suggesting and even implementing code fixes as new pull requests, which can be redeployed automatically after PR review. This approach moves us closer to truly self-healing deployments.

Join us to learn how to combine the power of Kubernetes and Argo Rollouts with the autonomous capabilities of Agentic AI, achieving a release experience that is not only seamless but also resilient.

Your Cluster is Lying to ArgoCD (And How to Catch It)

2026-02-01T13:35:00+01:00

We love ArgoCD, but it creates a classic "map vs. territory" problem. We treat Git as our "map", our single source of truth. But the cluster is the "territory", and it's often more complex than the map shows. This becomes a crisis with the 3 AM hotfix: an SRE fixes production, changing the territory. ArgoCD, loyal to the map, sees this as drift and helpfully overwrites the fix, re-breaking the cluster. The problem is that Git isn't our Truth, it's our Intention. This talk introduces a pragmatic solution: Cluster-Scoped Snapshotting. We’ll show a simple pattern that dumps the entire live cluster state (the "territory") into its own "reality" Git repo. To automate this, we wrote a small open-source tool called Kalco, but the pattern is the real takeaway. This "reality" repo gives us a powerful "pre-flight diff" in our CI pipeline, comparing our "intention" (the app repo) against the "truth" (the snapshot repo). This simple check lets us bootstrap existing clusters, create a complete audit log, and stop our pipeline before it merges a change that conflicts with a critical live fix.

The Most Bizarre Software Bugs in History

2026-02-01T14:05:00+01:00

We've all heard that we should test our software, but what happens when we don't? Sometimes, it leads to strange and unexplainable events.

Is 'testing more' always the right solution? What do these bugs reveal about software and its failures? And how can we use these lessons to build more resilient systems?

Let's explore together the most bizarre software bugs in history!

Bug reporting made less buggy

2026-02-01T14:20:00+01:00

Forgotten files, incomplete system info, back and forward emails... Bug reporting can be messy process and sometimes wastes a lot of time on both developer and user side. However, a lot of it can be normalized and automated. In this talk we introduce CLI tool DebugPack that helps us to simplify the bug reporting process for our team and ensures that developers always have the necessary information for successful bug hunting.

https://gitlab.nic.cz/labs/bird-group/debugpack

Bringing automatic detection of backdoors to the CI pipeline

2026-02-01T14:35:00+01:00

Software backdoors aren’t a myth—they’re a recurring nightmare. Time and again, we’ve watched malicious code slip into open-source ecosystems. The notorious xz compromise grabbed headlines, but it wasn’t the first act in this drama. Earlier breaches included the PHP incident in 2021, as well as vulnerabilities in vsFTPd (CVE-2011-2523) and ProFTPD (CVE-2010-20103). And here’s the unsettling truth: these examples likely just scratch the surface. Why does it matter? Because a single backdoor in a widely used project turns into a hacker’s dream buffet—millions of machines served up for exploitation.

Tracking down and eliminating backdoors isn’t a quick win—it’s like diving headfirst into sprawling code jungles. Sounds epic? In reality, even for a veteran armed with reverse-engineering gear, it’s a grueling slog. So grueling that most people simply don’t bother. The good news? New tools such as ROSA (https://github.com/binsec/rosa) prove that large-scale backdoor detection can be automated—at least to a significant extent. Here’s the twist: traditional fuzzers like AFL++ (https://github.com/AFLplusplus/AFLplusplus) test programs with endless input variations to trigger crashes. It’s brute force, but brilliant for uncovering memory-safety flaws. Backdoors, however, play by different rules—they don’t crash; they lurk behind hidden triggers and perfectly valid behaviors. ROSA changes the game by training fuzzers to tell “normal” execution apart from “backdoored” behavior.

But there’s a catch: ROSA’s current use case is after-the-fact analysis, helping security experts vet full software releases (including binaries). Following the shift-left paradigm, our goal is to bring this detection magic into the CI pipeline—so we can stop backdoors before they ever land. Sounds great, but reality bites: ROSA produces false alarms and can require a significant test budget to find backdoors, which are a nightmare in CI. In this talk, we would like explore the methodological and technical upgrades needed to build a ROSA-based backdoor detection prototype that thrives in CI environments. Think reduced resources, and minimal noise—all within the tight resource windows CI jobs demand.

AI-based failure aggregation

2026-02-01T14:55:00+01:00

Modern automated testing environments generate vast amounts of test results, making failure analysis increasingly complex as both the number of tests and failures grow. This presentation introduces an AI-driven approach to failure aggregation, leveraging text embeddings and semantic similarity to efficiently group and analyze unique failures. The workflow integrates open-source, pre-trained models for text embedding (such as Sentence Transformers) and vector similarity search using PostgreSQL with pgvector, enabling scalable and low-barrier adoption.

Building CDviz: Lessons from Creating CI/CD Observability Tooling

2026-02-01T15:25:00+01:00

In 2024, I left my job to build CDviz full-time—an open source platform for CI/CD observability using CDEvents, an emerging specification with minimal ecosystem adoption. This talk shares lessons from building production tooling on early-stage standards.

You'll see: - Why I chose to build on CDEvents despite limited adoption - Technical challenges: converting diverse tool events into a unified format - Architecture decisions: PostgreSQL/TimescaleDB for storage, Grafana for visualization - Live demo: CDviz tracking deployments with real metrics - What worked, what didn't, and lessons for building on emerging specs

This is a builder's story about creating interoperability tooling before the ecosystem is ready—and why standardization matters even when adoption is slow.

Automated Testing of VoIP Infrastructure: Lessons from the Field

2026-02-01T15:55:00+01:00

Testing VoIP infrastructure at scale is far from straightforward. These systems route calls and enrich the caller experience, with features such as playing prompts, interactive menus, and caller queues. With so many features and interactions, manually testing every scenario is impossible, so test automation is essential.

As a software tester for a real-world VoIP infrastructure, I built an automated framework using the open-source SIPSorcery library (https://github.com/sipsorcery-org/sipsorcery) to create programmable softphones that simulate complex call interactions. The talk covers the most interesting challenges faced, such as verifying who can be heard and seen in audio and video calls, mimicking specific physical phones, and making timing-sensitive tests run reliably.

Attendees will take away insights into the challenges of large-scale VoIP testing and practical strategies for designing automated tests that are reliable, repeatable, and maintainable.

Formal Verification in Rocq, an Exhaustive Testing

2026-02-01T16:25:00+01:00

In this talk, we present formal verification, a technique for mathematically verifying code and ensuring it is safe for any possible inputs.

We explore in particular the theorem prover Rocq, and how we use it to model and verify production code at Formal Land. We show the two primary methods to create a code model, by testing or proving the equivalence with the implementation, and the main classes of properties that are interesting to formally verify on a program.