Observability for AI Workloads on HPC: Beyond GPU Utilization Metrics

2026-02-01T13:10:00+01:00 for 00:10

When you run LLMs or large-scale ML training on HPC clusters, traditional monitoring falls short. GPU utilization at 95% tells you nothing about model quality. Memory bandwidth looks healthy while your inference latency silently degrades. Your job scheduler reports success while concept drift erodes prediction accuracy. This talk introduces a practical observability framework specifically designed for AI workloads on HPC infrastructure, what I call "Cognitive SLIs" (Service Level Indicators for AI systems). I'll cover three critical gaps in current HPC monitoring: 1. Model-aware metrics that matter 2. GPU observability beyond utilization 3. Energy and cost accountability

The demo shows a complete stack built with open source tools: Victoria metrics with custom AI-specific exporters, Grafana dashboards designed for ML engineers (not just sysadmins), and OpenTelemetry instrumentation patterns for PyTorch/JAX workloads.

Attendees will leave with the following resources :

1)Architecture patterns for instrumenting HPC AI workloads 2) Victoria Metrics recording rules and alerting strategies for ML metrics 3)Grafana dashboard templates (GitHub repo provided) 4) Understanding of how AI Act logging requirements intersect with HPC operations

View on FOSDEM site