AI workloads are unlike those seen in the enterprise before, with varying impacts on storage throughout different AI phases. After intense training, AI is put to work inferencing based on what it has learned, taking into account factors such as AI frameworks, storage demands of retrieval-augmented generation (RAG), and checkpointing.
Nvidia’s vice-president and general manager of DGX Systems, Charlie Boyle, discussed these challenges and practical tips for customers embarking on AI projects at the recent Pure Storage Accelerate event in Las Vegas. Understanding good and bad data is crucial for AI success, with the need to differentiate between data that adds value and outdated information.
To start with AI, Boyle recommends beginning with existing models that can be fine-tuned for specific needs, rather than building foundational models from scratch. For customers looking to put AI to work, leveraging ready-made applications and data to customize them for their own needs is key, with no need for extensive coding or advanced degrees in AI.
In terms of I/O profile, the demands of AI workloads such as training, fine-tuning, inference, RAG, and checkpointing vary. Fast storage is necessary for training large models from scratch, with checkpointing being I/O-intensive and critical for minimizing data loss in case of system failures during training runs. Checkpointing involves all compute stopping for writes, with the timing of checkpoints crucial for ensuring data integrity.