News
Dec 30, 2025
News
Artificial Intelligence
Asia
NewDecoded
7 min read
Image by Alibaba Cloud
Alibaba Cloud Tair, SGLang, and the Mooncake team have launched HiCache, a hierarchical Key-Value cache infrastructure designed to eliminate the memory bottlenecks of large language model inference. This system tackles the "memory wall" faced by autonomous AI agents that require massive amounts of context data for multi-step reasoning and long-term memory. By utilizing a three-tier architecture, HiCache expands capacity from gigabytes to petabytes while maintaining the high speeds necessary for real-time interaction.
Traditional inference systems often struggle with state bloat as context lengths grow, causing GPUs to run out of high-bandwidth memory. HiCache solves this by treating storage as an extension of compute, moving frequently accessed data between GPU memory, host CPU memory, and remote distributed systems. This approach ensures that historical data remains available without the need for expensive and redundant re-computation.
The technical backbone of this project includes DeepSeek's 3FS Fire-Flyer File System, which provides the extreme throughput required for remote data loading. By combining RDMA networks and NVMe SSDs, the system achieves read bandwidth speeds that make loading a cache faster than generating it from scratch. This paradigm shift, known as "Store-as-Compute," effectively removes the hard physical limits on how much information a model can retain.
Real-world results from Novita AI demonstrate the immediate impact of the new infrastructure on production workloads. After integrating HiCache into their serving environment, the company reported that cache hit rates jumped from 40 percent to 80 percent. This optimization led to a 56 percent reduction in the time to first token and allowed the system to handle twice the amount of query traffic on the same hardware.
Beyond simple performance gains, HiCache introduces a global sharing mechanism that allows multiple AI agents to reuse common prefixes like system prompts and historical context. The Alibaba Cloud Tair team is also developing a standalone KVCache Manager to provide unified metadata management across different inference engines like vLLM and TensorRT-LLM. This creates a scalable foundation for the next generation of digital assistants capable of deep document analysis.
Looking ahead, the roadmap includes support for hybrid model architectures and sparse attention frameworks to further optimize resource utilization. These advancements aim to keep the computational cost of AI manageable even as coding tasks push sequence lengths to millions of tokens. The project highlights a shift where model state is treated as a storable and schedulable database asset rather than fleeting temporary data.
Alibaba Cloud Tair has collaborated with the SGLang community and the Mooncake team to launch HiCache, a next-generation hierarchical infrastructure for agentic AI. This system addresses the critical memory bottlenecks of large language models by integrating GPU memory, host memory, and remote distributed storage into a unified pool. By breaking the memory wall, HiCache enables models to maintain the massive contexts required for multi-turn reasoning and complex task planning. The architecture is built on the philosophy of store-as-compute, where loading cached data from high-speed storage is faster than recalculating it on a GPU. HiCache utilizes a three-tier hierarchy that seamlessly offloads data to DeepSeek 3FS, a distributed file system capable of massive aggregated bandwidth. This ensures that even ultra-long contexts remain accessible without exhausting expensive on-card hardware resources.
Technical breakthroughs such as the HiRadixTree and zero-copy transmission are central to the performance of this new paradigm. These tools allow the system to index cached tokens across different tiers and transfer data without the overhead of kernel buffers. Asynchronous prefetching further hides latency by loading necessary data into the GPU while a request is still waiting in the processing queue.
Production results from Novita AI demonstrate the dramatic impact of integrating HiCache into the SGLang framework. The platform reported that cache hit rates surged from 40 percent to 80 percent, while the average time to first token fell by 56 percent. These improvements directly translated into a doubling of the total queries per second that the infrastructure could handle. Beyond performance, HiCache introduces a scalable way to handle the persistence of session states across multiple agents. Traditional inference often results in redundant computation when different tasks share parts of the same context or instruction set. This hierarchical approach allows for global cache sharing, ensuring that once a state is computed, it can be reused across the entire network to reduce costs.
The emergence of HiCache marks a transition from stateless chat bots to state-heavy AI agents capable of long-term memory. In the broader industry, the cost of re-computing historical data has become a primary barrier to deploying sophisticated agents at scale. By turning KVCache from a temporary byproduct into a persistent and shareable asset, Alibaba Cloud and SGLang are providing the economic foundation needed for the next wave of autonomous AI applications. You can find more details on the Alibaba Cloud Blog.