Nvidia has released its KAI Scheduler as open source under the Apache 2.0 license. This Kubernetes-native tool is designed to manage AI workloads on GPUs and CPUs. KAI Scheduler helps to handle varying GPU demands and cuts down on wait times for compute access. It also guarantees resource allocation for GPUs.
This scheduler supports the entire AI lifecycle, managing everything from small interactive jobs to large-scale training and inference, all within the same cluster. It ensures optimal resource allocation while maintaining fairness among different applications needing GPU access. Administrators can dynamically allocate GPU resources to workloads, and KAI Scheduler can operate alongside other schedulers in a Kubernetes environment.
Ronen Dar, Nvidia’s VP of software systems, and Ekin Karabulut, a data scientist at Nvidia, shared that KAI Scheduler continuously recalibrates fair-share values and adjusts quotas in real time. This means it matches workload demands on the fly, which reduces the need for constant manual oversight.
For machine learning engineers, the scheduler decreases wait times through a combination of gang scheduling, GPU sharing, and a hierarchical queuing system. Users can submit batches of jobs that launch automatically when resources become available, balancing priorities and fairness effectively.
KAI Scheduler tackles fluctuations in GPU and CPU demands using techniques like bin packing and consolidation, which enhance compute utilization by reducing resource fragmentation. It packs smaller tasks into underused GPUs and CPUs, reallocates tasks across nodes, and spreads workloads to minimize strain on any single node while maximizing overall resource availability.
When shared clusters are in play, KAI Scheduler addresses a common issue where researchers hoard more GPUs than needed early in the day, leading to underutilization. This tool enforces resource guarantees, preventing resource hogging and improving overall cluster efficiency.
Additionally, KAI Scheduler features a built-in podgrouper that automatically connects with tools like Kubeflow, Ray, Argo, and the Training Operator, simplifying configuration and speeding up development.