• On October 23, 2025, Alibaba Cloud announced at the ACM SOSP 2025 Operating System Conference (Seoul) that it has developed the Aegaeon system, which allows for optimal GPU utilization in concurrent AI inference tasks, helping to reduce the number of required GPUs by up to 82% while maintaining high performance.
  • According to research co-conducted with Peking University, Aegaeon is described as a “multi-model serving system” capable of token-level autoscaling, allowing up to 7 models to run in parallel on the same GPU — whereas current systems only achieve 2–3 models.
  • The system operates by actively offloading a currently running model and activating a waiting model upon a new request, ensuring the Service Level Objective (SLO) and avoiding “Head-of-Line (HOL) blocking.”
  • In internal testing, Alibaba reduced the number of GPUs from 1,192 to 213, an 82% reduction, when serving dozens of AI models in its marketplace.
  • Tests on models up to 72 billion parameters showed a 1.5 to 9-fold increase in performance, depending on the task type.
  • The testing environment included 2 nodes, each with 8 Nvidia H800 80GB GPUs (16 GPUs total), 2TB DDR5 RAM, and 192 Intel Xeon Platinum 8469C CPUs, connected via NVLink. Alibaba is reported to use a proprietary eRDMA network to accelerate data transfer between GPUs.
  • The paper points out that 90% of the models in Alibaba’s model studio are infrequently called, but account for 17.7% of GPU resources, leading to significant waste when using fixed reservation mechanisms.
  • Aegaeon differs from current methods:
    • Multiplexing (running multiple models on 1 GPU) is limited by GPU memory.
    • Traditional Autoscaling only scales by time, not by token, so it remains less efficient.
  • Aegaeon overcomes this limitation through its token-level autoscaling decision — the smallest unit of AI inference.
  • Despite the breakthrough, tech circles compare that Aegaeon’s effect has not caused the “ripple” of DeepSeek V3, the Chinese model that caused a shock by costing only $5.6 million to train earlier this year.
  • A report from The Register emphasizes that “US hyperscalers” such as Google, Amazon, or Microsoft may have similar solutions but have not yet announced them, considering this a “strategic GPU optimization secret.”

📌 On October 23, 2025, Alibaba Cloud announced the development of the Aegaeon system, which allows for optimal GPU utilization in concurrent AI inference tasks, helping to reduce the number of required GPUs by up to 82% while maintaining high performance. Alibaba Cloud demonstrates China’s rapid progress in optimizing GPU infrastructure for generative AI, reaching a performance level of 7 models/GPU. This technology not only reduces the cost of inferencing billions of AI requests but could also reshape the global AI cloud market, where GPU optimization capability is becoming the most crucial competitive weapon.

Share.
© 2025 Vietmetric
Exit mobile version