Friday, October 18, 2024

Google Unveils Parallelstore: A New File Storage Solution for Cloud AI Training

Google Cloud Platform just launched its Parallelstore, a managed parallel file storage service. This service is designed for high-demand input/output (I/O), specifically targeting artificial intelligence (AI) applications. It uses the Distributed Asynchronous Object Storage (DAOS) architecture, which is developed by Intel and open source. Originally, Intel aimed for DAOS to work with its Optane persistent memory, but that product is no longer available.

DAOS functions as a parallel file system spread across multiple storage nodes, with a metadata store in persistent memory. It replicates files across many nodes, enabling parallel access with minimal latency. This is crucial for AI developers who need speedy access to data.

Even though Optane is gone, DAOS still leverages some of Intel’s technology, like its communications protocol, Intel Omnipath. This protocol operates similarly to Infiniband and uses Intel cards to connect compute nodes. These nodes communicate with metadata servers to locate files during read/write requests and then transfer data efficiently via Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).

Barak Epstein, GCP’s product director, highlighted that this setup maximizes data flow to GPUs and TPUs. He noted that Parallelstore supports the read/write needs of thousands of virtual machines, GPUs, and TPUs, catering to a wide range of AI and high-performance computing tasks. For max deployments, users can expect throughput up to 115GBps, three million read IOPS, and one million write IOPS, boasting a latency close to 0.3 milliseconds.

Epstein pointed out that Parallelstore is also effective for small files, providing random, distributed access to many clients. AI model training times can be boosted by nearly four times compared to traditional machine learning data loaders.

To prepare for this, GCP encourages customers to first upload data to Google Cloud Storage, making it accessible for various GCP use cases. Using the Storage Insights Dataset service from the Gemini AI suite, customers can pick out data suited for AI processing. Once selected, transferring this training data to Parallelstore happens at 20GBps. For smaller files—less than 32MB—it can handle 5,000 files per second.

Beyond AI, Parallelstore is available for Kubernetes clusters through GCP’s Google Container Engine (GKE) with dedicated CSI drivers. This integration allows administrators to manage Parallelstore volumes just like any other GKE storage.

DAOS separates data and control planes while keeping I/O metadata and indexing workloads distinct from bulk storage. It utilizes fast, persistent memory for metadata storage and NVMe solid-state drives for bulk data. Intel claims that DAOS’s read/write I/O performance scales almost linearly with more client I/O requests, making it especially suitable for cloud and shared environments.