Virtually Attend FOSDEM 2026

One GPU, Many Models: What Works and What Segfaults

2026-01-31T14:10:00+01:00 for 00:20

Serving multiple models on a single GPU sounds great until something segfaults.

Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing. Both have trade-offs and their behavior changes based on GPU architecture.

I tested both strategies on Hopper and Blackwell, running diffusion, MoE, and TTS workloads in parallel. Some setups survived. Others didn't.

This talk digs into what actually happened: where memory isolation falls apart, which configs crash, and what survives under load.

By the end, you'll know:

  1. How to utilize unused GPU capacity.
  2. How to setup MIG and MPS.
  3. How MIG and MPS behave under actual load.
  4. Memory issues, crashes, and failures.
  5. Which config is suited best for your AI workload.

View on FOSDEM site