Serving multiple models on a single GPU sounds great until something segfaults.
Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.
I tested both strategies for video generation workloads in parallel.
This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.
By the end, you'll know: