Virtually Attend FOSDEM 2026

Databases Track

2026-01-31T10:30:00+01:00

In this session, four seasoned database administrators with sound knowledge of both PostgreSQL and MySQL present an unbiased comparison of the two technologies. Attendees will learn about the architectural and DX differences between the world's two most popular databases.

Pep Pla, with his peculiar sense of humour, will open the session with a deep dive into the MVCC architectures between the two. The audience will learn why we need MVCC. Postgres and MySQL take very different approaches to implementation: Postgres relies on row versioning and vacuuming dead tuples, while MySQL does in-place changes and tracks versions with the undo log.

A broad-strokes overview from Ben Dicken, who has worked closely with both, will emphasize where ecosystem cross-pollination would help. This includes differences in table storage, bloat management, replication, and process-per-connection vs thread-per-connection architecture.

Postgres and MySQL take fundamentally different approaches to logical replication. Rohit Nayak and Shlomi Noach will examine how these designs affect WAL/binlog retention, backpressure, and CDC workloads, explore their failover implications, and highlight key feature-parity gaps between the two systems.

2026-01-31T11:25:00+01:00

The success of open source databases like PostgreSQL and MySQL/MariaDB has created an ecosystem of derivatives claiming "drop-in compatibility." But as the distance between upstream and these derivatives grows, user confusion and brand dilution can follow.

To address this, we explore the challenge of compatibility with de facto standards from two distinct angles: a governance perspective on defining the compatibility criteria, and a systems engineering case study on implementing them.

  1. The Standard: We present the findings from the "Establishing the PostgreSQL Standard" working group held at PGConf.EU 2025. This progress report details the community's consensus on the hard requirements needed to fix the "wild west" of marketing claims, including:
    • Core SQL: Defining the non-negotiable functions, types, and PL/pgSQL.
    • Protocol: Why wire compatibility is insufficient without consistent transactional and pg_catalog behaviour.
    • Ecosystem: The critical requirements for integration with logical replication and tools like Patroni.
  2. The Implementation: Maintaining compatibility with MySQL/MariaDB in TiDB, a distributed database engine, is far more complex than matching syntax for an evolving SQL dialect:
    • We explore the architectural friction of making TiDB speak the MySQL wire protocol and support the MySQL syntax.
    • We cover compatibility with the MySQL binary log based replication.
2026-01-31T11:55:00+01:00

As analytics ecosystems grow more diverse, organisations increasingly need to query data across warehouses, data lakes, and operational systems without excessive movement or duplication. Query federation has become essential by enabling unified SQL access and intelligent pushdown into heterogeneous sources. This talk introduces the core principles of federation and why it matters for modern OLAP workloads and how it is different to Trino.

Using StarRocks as a model system, we highlight its vectorized execution engine, native connectors, and deep Apache Iceberg integration that together deliver high-performance lakehouse querying. We examine common lakehouse challenges—schema evolution, file fragmentation, and object-storage latency—and show how federation and hot/cold data separation help address them.

Finally, we explore federating additional sources such as Elasticsearch, PostgreSQL, and Apache Paimon to build a unified analytical architecture.

2026-01-31T12:20:00+01:00

We all write SQL, but how many of us have looked under the hood of a relational database like PostgreSQL? This talk is a deep dive into the guts of the database engine, tracking a simple SELECT statement from the moment you hit "Enter" to the final result set.

We'll lift the veil on the core components: the parser, the planner (and the optimizer's black magic!), and the executor, and see how they transform a text string into a low-level, high-performance operation. Using a live, interactive session on a PostgreSQL instance, we'll expose the role of the shared buffer cache, explain why an index works (or doesn't), explore the true cost of I/O, and understand the significance of the binary log (WAL) on read operations.

Whether you're a developer frustrated with slow queries or a database administrator looking to squeeze out every millisecond of performance, you'll leave this talk with a mental model that demystifies query execution and gives you the knowledge to write queries that fly.

2026-01-31T12:45:00+01:00

While optimizing a new heap storage engine across both MySQL and a PostgreSQL-based database we encountered a puzzling result: while on MySQL the throughput stalled below 500k tpmC, on the other database it achieved over 1 million tpmC. The mystery deepened when three different TPC-C benchmarks each told a conflicting story about MySQL’s speed.

This talk details the systematic investigation to resolve these contradictions and reclaim the lost performance. We’ll walk through the methodical process of isolating variables across the entire software stack, dissecting benchmark implementations, profiling execution end-to-end with advanced tools, analyzing client/server protocol behavior, and comparing query optimization plans.

The investigation revealed that the performance gap was not caused by a single flaw, but by a cascade of inefficiencies, in multiple areas of the stack. Subtle issues in query planning, protocol handling, and client-side implementation conspired to create overwhelming overhead. By addressing these interconnected problems holistically – through optimizer fixes, protocol enhancements, and client improvements – we transformed MySQL’s performance profile to reveal the engine’s true potential.

The outcome was a dramatic turnaround: with additional improvements the performance of the new engine on MySQL reaches almost 2 million tpmC now.

This case study underscores a critical lesson: database performance, for OLTP workloads in particular, is determined not by any single component, but by the precise alignment of the entire database stack, from the client down to the storage engine.

2026-01-31T13:10:00+01:00

RonDB is a high-performance, MySQL-compatible distributed database engineered for real-time, latency-critical workloads. Built on decades of development in the MySQL NDB Cluster—led by the original founder of the NDB product—RonDB extends the NDB storage engine with new capabilities, cloud-native automation, and modern APIs tailored for large-scale AI and online services.

This talk will describe how RonDB consistently delivers 1–4 ms latency even for large batched operations involving hundreds of rows and multi-megabyte payloads, and will explain the architectural techniques that make such performance possible. We will highlight RonDB’s role as the online feature store powering the Hopsworks Real-Time AI platform, deployed in production at companies such as Zalando for personalized recommendations and other low-latency machine-learning applications.

The session will also introduce key components of the RonDB ecosystem:

rondb-helm – Kubernetes and Helm tooling for deploying, managing, and scaling RonDB clusters in cloud environments.

rondb-tools – Scripts and automation utilities for quickly setting up local or distributed RonDB testbeds.

New API layers, including: • A REST API server offering batch key operations, batch scans, and aggregated SQL queries. • An experimental Redis-compatible interface, enabling RonDB to act as a durable, high-throughput backend behind standard Redis commands.

We will outline the active collaboration between the RonDB team and Oracle’s MySQL NDB Cluster engineers, and how RonDB extends and complements the upstream NDB ecosystem. In addition, we will present ongoing cooperation with Datagraph to build a SPARQL interface to RonDB, leveraging Datagraph’s Common Lisp NDB API.

Attendees will come away with a clear understanding of how RonDB achieves its performance characteristics, how it integrates with modern real-time AI pipelines, and how to deploy, operate, and experiment with RonDB using the available open-source tools.

GitHub repositories: https://github.com/logicalclocks/rondb https://github.com/logicalclocks/rond-helm https://github.com/logicalclocks/rondb-tools https://github.com/datagraph/cl-ndbapi/

Web sites of note: https://rondb.com https://docs.rondb.com https://hopsworks.ai https://blog.dydra.com/@datagenous/blog-catalog

2026-01-31T13:15:00+01:00

DuckDB has traditionally been seen as a last-mile analytics powerhouse, the fastest way to run a SQL query on your laptop. But DuckDB offers more than just fast SQL, of course; it supports full database semantics and ACID transactions, behaving like a fully fledged, in-process OLAP database. The in-process component has sometimes been viewed as a limitation when considering DuckDB as a data warehouse.

However, DuckDB now supports reading and writing to most Open Table Formats (OTFs), including Iceberg, Delta, and DuckLake. This capability puts DuckDB in a very different position: it allows DuckDB to act as a SQL engine in the cloud (or on your local machine) and run queries against any OTF stored in remote cloud storage. DuckDB can now be the all-mighty, single-node query engine that powers your data analytics use cases.

In this talk I will also dive into: - Why this change allows for a "multi-player" DuckDB experience. - How DuckDB efficiently queries very large tables leveraging table statistics and cache. - Why building native implementations with minimal dependencies to interact with OTFs is hard but can potentially pay off.

Projects: https://github.com/duckdb/duckdb https://github.com/duckdb/ducklake https://github.com/duckdb/duckdb-iceberg https://github.com/duckdb/duckdb-delta

2026-01-31T13:20:00+01:00

Observability data isn’t typically blended with the data that your Analysts are working with. These data types are typically stored in entirely separate databases, and interrogated through different tools.

But that needn’t be the case. At Grafana Labs we’ve started blending this data together, to answer questions that we or our customers have, such as: - How much revenue did that downtime cost me? - How did latency impact on sales last Black Friday? - Which customers were impacted by that incident, and which ones are the highest priority to follow up with?

The FOSS projects we’re combining to get there are: - The LGTM stack (github.com/grafana) for Observability - Cube core (cube.dev/docs/product/getting-started/core) for Semantic Layer - dbt core (github.com/dbt-labs/dbt-core) for transforming SQL data - Grafana itself to blend, visualise and even alert on the end-result

During this talk I’ll describe how you too can fit these pieces together and use them to answer similar questions for your own context.

2026-01-31T13:25:00+01:00

Everyone is running their applications on Kubernetes these days, most of the time the application servers are stateless so it is easy to do so because the database behind the application is responsible for storing the state. What if you would also want to run you database on the same Kubernetes stack. Will you use stateful sets? Will you use network attached storage? These types of storage are introducing a lot of disk latency because of the mandatory network hops. This is why in many environments the database servers still are dedicated machines that are treated as pets while the rest of the fleet is more like cattle.

In this session I will speak about how we run our databases on Kubernetes by using the local ephemeral storage to store your data and also how we are confident we will not loose it in the process of doing so!

2026-01-31T13:35:00+01:00

PostgreSQL already knows how to parse SQL, track object dependencies, and understand your schema. Most tools that work with schemas reimplement this from scratch. What if you just asked Postgres instead?

This talk digs into the techniques that make that possible. We’ll start with the shadow database pattern: applying schema files to a temporary PostgreSQL instance and letting Postgres handle all parsing and validation. Then we’ll explore pg_depend and the system catalogs, where PostgreSQL tracks that your view depends on a function, which depends on a table, which depends on a custom type. I’ll show the exact catalog queries that extract this dependency graph, the edge cases that make it interesting (extension-owned objects, implicit sequences, array types, function bodies that pg_depend can’t see), and how to turn it all into a correct topological ordering for migration generation.

I learned this while building pgmt, a tool that diffs PostgreSQL schemas to generate migrations. But the techniques apply to anything that needs to understand a Postgres schema -- linters, drift detectors, visualization tools, CI validation -- and they let you build on Postgres’s own knowledge instead of reinventing it.

2026-01-31T14:00:00+01:00

This talk discusses the design choices behind this open source project leveraging Debezium : https://github.com/Altinity/clickhouse-sink-connector It reliably replicates data to ClickHouse, a well known open source real time analytics database that can be deployed anywhere. The sink-connector provides an alternative to proprietary solutions that typically lock people in or are only available on the cloud. It works with MySQL, MariaDB. Postgres, Oracle (experimental) and MongoDB. As a bonus, Binary logs analysis and Time Travel will also be presented.

2026-01-31T14:25:00+01:00

Using SQL from other programming languages can prove to be quite the hassle: wrangling the database rows into the host's language types is tedious and error prone, and making sure the application code stays up to date with the ever-changing database schema is just as challenging.

To address these developer experience shortcomings ORMs try to shield the developer from ever having to write any SQL at all. This doesn't feel totally satisfying though: as developers we are always keen on using the right language for the job, so what would it look like to fully embrace SQL instead of trying to abstract it away?

In this talk we'll look at Squirrel (https://github.com/giacomocavalieri/squirrel), a library that tackles database access in Gleam (https://gleam.run): a functional, statically-typed language. We'll explore how code generation from raw SQL can help bridge the gap between the database and a functional language without compromising on type-safety, performance or developer experience.

2026-01-31T14:50:00+01:00

Time Series databases face the significant challenge of processing vast amounts of data. At VictoriaMetrics, we are actively developing an open-source Time Series database entirely from scratch using Go. Our average installation handles between 2 to 4 million samples per second during ingestion, with larger setups managing over 100 million samples per second on a single cluster. In his presentation, we will explore various techniques essential for constructing write-heavy applications such as: - Understanding and mitigating write amplification. - Implementing instant database snapshots. - Safeguarding against data corruption post power outages. - Evaluating the advantages and disadvantages of utilizing Write Ahead Log. - Enhancing reliability in Network File System (NFS) environments. Throughout the talk, we will illustrate these concepts with real code examples sourced from open-source projects.

2026-01-31T15:15:00+01:00

Contributing to MariaDB: Learn how to contribute to the MariaDB server codebase. And be prepared for what it takes. And see what you will learn along the way.

Have you ever wondered what it would take to actually get your contribution into the MariaDB server codebase?

We will take one specific contribution and follow through its processing. It's a bug fix contribution. 2 lines of actual code change. On smaller codebases, used by less people, this would have probably taken minutes to process. It is somewhat different with the MariaDB server's codebase. But for a very good reason!

Contributing to Postgres: Contributing to open source can feel intimidating early in your career, especially with a project as widely used and critical as Postgres. This feeling can be exacerbated by the Postgres contribution process, which revolves around mailing lists and commit fests instead of GitHub issues and pull requests. Often, confidence comes after action; the first patch is the hardest. Even small contributions can reach thousands of people.

This talk traces my path from setting up a local build and gaining familiarity with the codebase to contributing bug-fix patches and documentation updates. Also, it outlines how the Postgres development process and community operate. The aim is to demystify the process so more engineers feel confident contributing to Postgres, and leave with the context and practical steps to make their first (or next) patch and take their favourite database to new heights.

2026-01-31T15:45:00+01:00

From plain-old Postgres to the Grafana stack (Loki, Grafana, Tempo, and Mimir), OpenSearch, Cassandra, and ClickHouse, the landscape of telemetry storage options is as vast as it is overwhelming. With so many choices, how do we decide which datastore is right for the job? In this talk, Joshua will guide attendees through the foundational principles of telemetry—covering metrics, traces, logs, profiles, and wide events—and break down the strengths and limitations of different database technologies for each use case. We’ll examine how traditional relational databases like Postgres can still hold their own, where OpenSearch and Prometheus fit into the picture, and why specialized stacks like LGTM (Loki, Grafana, Tempo, Mimir) are so popular in modern observability pipelines. And, of course, we’ll highlight the growing role of ClickHouse as a versatile and high-performance option for logs, traces, and more and VictoriaMetrics as a drop-in replacement for Prometheus. By the end of this session, attendees will have a clearer understanding of the trade-offs between these datastores and how to make informed decisions based on the unique requirements of their systems. Whether you’re building an observability stack from scratch or looking to optimize an existing setup, this tour of the observability datastore landscape will leave you better equipped to navigate the options.

2026-01-31T16:10:00+01:00

Change Data Capture (CDC) has become foundational for real-time analytics, cross-region replication, event-driven systems, and streaming ingestion pipelines. Databases like MySQL and Postgres replication expose change streams through a single-writer log. CDC in these systems is trivial. Modern distributed SQL databases like TiDB require a fundamentally different design and handle bigger challenges because they need to order the writes of multiple writers and deal with millions of tables,

This talk is about TiCDC’s architecture and how it handles multiple writers, millions of tables, 1000s of writers with its event-driven pipeline. To preserve total order, TiCDC must merge, order, and stream updates coming concurrently from multiple regions, Raft groups, and storage nodes—while preserving correctness and low latency.

This talk will explore the challenges and the evolution of TiCDC design over several iterations, With lessons learnt the hard way.

2026-01-31T16:35:00+01:00

Database usage in practice often involves heavy text processing. For example, in "observability" use cases, databases must extract, store, and search billions of log messages daily. Most databases, including many column-oriented OLAP databases, struggle with such massive amounts of text data. The only way to process text data at scale is by using specialized inverted indexes in databases.

This presentation explains how inverted indexes work and which (text) search patterns they support. Where appropriate, we describe our experience and the gotchas we encountered when adding an inverted index to ClickHouse, one of the most popular open-source databases for analytics.

2026-01-31T17:00:00+01:00

In 2017, Mark Raasveldt and Hannes Mühleisen (who went on to create DuckDB presented a VLDB paper entitled “Don’t Hold My Data Hostage – A Case For Client Protocol Redesign.” Their paper proposed the use of columnar serialization to achieve order-of-magnitude improvements in query result transfer performance. Eight years later, this talk revisits Raasveldt and Mühleisen’s argument and describes the central role that the Apache Arrow project has played in realizing this vision—through the dissemination of Arrow IPC, Arrow Flight, Arrow Flight SQL, Arrow over HTTP, and ADBC across numerous open source and commercial query systems. The talk concludes with a call to action to introduce Arrow-based transport to the systems that continue to “hold data hostage.”

2026-01-31T17:25:00+01:00

Our database had reached a point where failure scenarios were becoming increasingly complex and time consuming. A single node could take up to 15 minutes to recover. It was expensive to run and operate, and it simply couldn’t scale to meet the customer demand we were facing. It became clear that we needed a new design. By leveraging a modern architecture and the latest open-source technologies, we rebuilt our database for the cloud era. Recoveries that once took 15 minutes now complete in seconds. Operational costs dropped by 50%, and query latencies improved by 200%. These gains weren’t the result of any single change, but of a holistic redesign powered by technologies like Vortex, DataFusion, Delta Lake, and Rust.

In this talk, Thor will walk you through the end-to-end journey of this evolution the failure patterns and scaling limits that forced a rethink,

the architectural principles that guided the redesign,

the trade-offs and dead ends along the way,

how modern open-source components were evaluated and integrated, and

the concrete performance and reliability improvements unlocked by the new design.

You’ll leave with a blueprint for modernizing a legacy data system: how to identify when your architecture is holding you back, and how to apply today’s open-source ecosystem to build a cloud-native database that’s fast, resilient, and ready for the future.

2026-01-31T17:50:00+01:00

Apache DataFusion is emerging as a powerful open-source foundation for building interoperable data systems, thanks to its strongly modular design, Arrow-native execution model, and growing ecosystem of extension libraries. In this talk, we'll explore our contributions to the DataFusion ecosystem—most notably DataFusion Federation for cross-database query execution and DataFusion Table Providers that connect DataFusion to a wide range of backends.

We'll show how we use these components to federate queries to databases such as TiDB and InfluxDB 2, and how this fits into a broader data fabric/API generation work we're doing at Twintag. We'll also discuss our work on Arrow-native interfaces, including an Arrow Flight SQL Server implementation for DataFusion and a prototype Flight SQL endpoint for TiDB, which together enable a fully Arrow-based pipeline spanning query submission, execution, and federated dispatch.

The session highlights practical patterns for building distributed data infrastructure using open libraries rather than monolithic systems, and offers a look at where Arrow and DataFusion are headed as shared interoperability layers for modern databases.

2026-01-31T18:15:00+01:00

Cloud-native databases often use open-source embedded key-value stores on each node or shard. OLTP workloads are read- and write-intensive, typically relying on indexes for data access. Two main on-disk structures are prevalent: B-Trees, such as WiredTiger, and LSM-Trees, like RocksDB. This talk explores the similarities and differences in their internal implementations, as well as the trade-offs among read, write, and storage amplification. It also compares these structures to traditional fixed-size block storage in RDBMS and discusses the differences in caching the working set in memory and ensuring durability through write-ahead logging.

2026-01-31T18:40:00+01:00

Your AI application returns wrong answers. Not because of your LLM choice or vector database—but because of the data engineering ( or lack there of) nobody wants to talk about.

This technical deep dive shows why embedding models, chunking strategies, and search filtering have more impact on AI accuracy than switching from one model to another. Using real production data, we'll demonstrate how naive vector search returns Star Trek reviews when users ask about Star Wars, how poor chunking strategies lose critical context (Who want's their AI to respond to how to fix a headache with a head transplant?), and why "just use a vector" without proper data engineering guarantees hallucinations.

We'll cover:

  • Embedding model selection: dimensions, token limits, and silent truncation failures
  • Chunking strategies: when to chunk, how to preserve context, and the double-embedding approach
  • Hybrid search: combining Full Text/BM25 keyword matching with vector similarity
  • Filtering architecture: pre-filter vs post-filter performance trade-offs
  • Production gotchas: triggers, performance, batch processing, and cold start problems

While many of the examples will be for PostgreSQL, This is talk will be database-agnostic, no matter if you are using PostgreSQL, MariaDB, ClickHouse, or others you will learn something! In AI Land, the hard problem is always data engineering, not database selection.

Users don't care about inference speed—they care about accuracy. This talk shows how to engineer your data pipeline so your AI doesn't lie.