Saturday, January 18, 2025

Key-Value Flash: Enhancing Data Storage Efficiency

Envision flash storage that maintains data in the exact format utilized by applications. This is the potential of key-value flash media, as proposed by a major drive manufacturer, several researchers, and startups, along with the NVMe Key-Value command set.

However, practical implementations of key-value storage are relatively scarce. While there has been research and the development of a command set for key-value in NVMe, the actual products remain limited. In 2019, Samsung introduced a prototype key-value drive, which later evolved into a spin-off named Stellus, initially aimed at creating a key-value storage array. Unfortunately, this venture appears to have faded into obscurity. More recently, startup QiStor has proposed plans to commercialize storage software and FPGA chips for key-value storage, asserting a significant market opportunity.

The concept behind key-value storage is that by keeping data in this format, it can achieve remarkable efficiency, speed, energy conservation, and durability compared to the traditional layered input/output (I/O) processes. Currently, applications and hosts must convert storage I/O into logical block addressing (LBA) at the drive level for locating data during read/write actions. This method dates back to the early days of spinning disk hard drives and relies on the LBA system. Developers of key-value stores recognized the inefficiencies inherent in I/O that emerged from this approach.

At present, when an application interacts with a key-value database, it must first communicate with the database, then translate key-value addressing through the host file system into LBA language to pinpoint data’s physical location on the disk. This convoluted process involves numerous steps that could be streamlined for greater efficiency. Moreover, LBA introduces additional inefficiencies that key-value storage could eliminate.

Flash storage frequently suffers from reduced durability due to continual erasures and rewrites. This problem is exacerbated because any time data is overwritten (using blocks different from LBA sizes), it must be erased, relocated, and rewritten. As storage devices fill up, a single write operation may lead to multiple writes in a process known as garbage collection, where data is shuffled around. This cycle induces wear and shortens the drive’s lifespan.

In contrast, key-value storage enables the application to interact directly with the media, bypassing the need for translation through the operating system and file system to the media LBA. Key-value storage does not require knowledge of which physical block contains the desired data. Instead, it manages data placement independently and knows the location of its values. The host, operating system, and file system are not involved in this process. When searching for a value, the device refers to its internal mapping tables to locate the corresponding key.

Key-value storage is increasingly prevalent and is often associated with formats such as JSON, the etcd datastore in Kubernetes, and data types in programming languages like JavaScript and Python, forming the foundation for NoSQL databases. In this model, the key represents a variable’s name, while the value can be its corresponding value or values. Keys and values can vary in length and data types—ranging from numbers and characters to images and audio files—and can even be nested, allowing a key to have a value that itself is another key with its own set of values.