Serving multiple models on a single GPU sounds great until something segfaults.
Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing. Both have trade-offs and their behavior changes based on GPU architecture.
I tested both strategies on Hopper and Blackwell, running diffusion, MoE, and TTS workloads in parallel. Some setups survived. Others didn't.
This talk digs into what actually happened: where memory isolation falls apart, which configs crash, and what survives under load.
By the end, you'll know: