Thursday, February 20, 2025

DeepSeek-R1: Overcoming Budgeting Challenges for On-Premise Deployments

IT leaders face a tough decision when it comes to managing cybersecurity risks with large language models (LLMs) like ChatGPT. They can either use these models directly from the cloud or opt for open-source LLMs that can be hosted on their own servers or in a private cloud.

For LLMs to operate effectively, they must run in-memory. This usually means investing heavily in graphics processing units (GPUs) to handle the significant memory demands. Take Nvidia’s H100 GPU, which has 80GB of RAM and consumes 350 watts of power.

China’s DeepSeek has shown that its R1 LLM can compete with U.S. AI, even without the latest GPU tech, although it does leverage GPU acceleration. But setting up a private version of DeepSeek requires hefty hardware investments. To run the 671 billion parameters of the R1 model in-memory, you’d need a total of 768GB of memory. That means you’d have to buy 10 H100 GPUs, bringing the total cost of AI acceleration hardware to about $250,000. Even if you opt for less powerful GPUs, you’re still looking at over $100,000 for a server capable of running the model fully.

Public cloud options exist, too. Azure provides access to Nvidia H100s with 900GB of memory for $27.167 per hour. If you used that every working day, your Azure bill would add up to nearly $46,000 each year. Committing to a three-year plan could lower that to about $23,000 annually. Turning to Google Cloud, the Nvidia T4 GPU offers a more budget-friendly route, costing $0.35 per hour per GPU. But fitting the DeepSeek-R1 entirely in memory calls for 12 GPUs, raising costs to around $16.80 an hour. A long-term commitment brings that figure down significantly.

IT managers can cut costs further by using general-purpose central processing units (CPUs) in lieu of expensive GPUs, especially when employing DeepSeek-R1 solely for AI inference. Matthew Carrigan, a machine learning engineer at Hugging Face, suggested a setup featuring two AMD Epyc processors and 768GB of memory for only about $6,000. He reported that this configuration can achieve a processing rate of six to eight tokens per second, depending on various conditions like query length. Carrigan demonstrated near-real-time querying on his setup, noting the importance of memory in speeding up responses.

For those considering more traditional server options, like prebuilt models from Dell or HPE, costs can climb significantly based on the selected configurations.

Another innovative approach to address memory costs involves using a multi-tier memory system managed by custom chips. California startup SambaNova has created a Reconfigurable Dataflow Unit (RDU) that significantly streamlines the hardware needed for models like DeepSeek-R1. The CEO, Rodrigo Liang, highlighted that this design could reduce the requirements from 40 racks down to just one, containing 16 RDUs.

Recently, SambaNova partnered with Saudi Telecom to establish Saudi Arabia’s first LLM-as-a-service cloud platform, marking a move toward sovereign AI capabilities. This collaboration shows how governments can explore various options when building their AI infrastructures. DeepSeek illustrates that alternatives to bulky and expensive GPU setups can yield equally effective results, enabling organizations to run powerful models without breaking the bank.