One GPU, Many Models: What Works and What Segfaults

2026-01-31T13:55:00+01:00 for 00:20

Serving multiple models on a single GPU sounds great until something segfaults.

Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.

I tested both strategies for video generation workloads in parallel.

This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.

By the end, you'll know:

How to utilize unused GPU capacity.
How to setup MIG and MPS.
Memory issues, crashes, and failures.
Workload specific configs

AI Plumbers Track
Chat room(web)
Chat room(app)
Submit Feedback

View on FOSDEM site