- Huawei’s Computing Systems Lab in Zurich has announced a new quantization method for LLMs (Large Language Models) called SINQ (Sinkhorn-Normalized Quantization), which reduces memory usage by 60–70% without degrading output quality.
- SINQ is designed to be calibration-free, easy to integrate, and much faster than previous methods. Huawei has released the open-source code on GitHub and Hugging Face under the Apache 2.0 license, allowing businesses to freely use, modify, and deploy it commercially at no cost.
- This technique allows models that previously required over 60 GB of RAM to run on just around 20 GB, enabling LLMs to operate on consumer GPUs like the RTX 4090 ($1,600) instead of the A100 80GB ($19,000) or H100 (>$30,000).
- For cloud services, the cost savings are significant: an A100 costs $3–$4.50/hour, while a 24 GB GPU like the RTX 4090 costs only about $1–$1.50/hour, saving thousands of dollars on long-term inference tasks.
- SINQ operates based on two main innovations:
- Dual-Axis Scaling: Uses two separate scaling vectors for rows and columns, which helps reduce quantization errors caused by outliers.
- Sinkhorn-Knopp Normalization: A fast normalization algorithm that reduces “matrix imbalance”—a new metric more effective than kurtosis for optimizing quantization quality.
- Test results on Qwen3, LLaMA, and DeepSeek show that SINQ reduces perplexity and flip rate, nearly matching the performance of full-precision models, and is 30 times faster than AWQ.
- SINQ also supports non-uniform quantization (NF4) and is compatible with AWQ to create an A-SINQ variant with even higher accuracy.
- Huawei provides sample code, tools for saving/loading weights, lm-eval integration, and plans to soon release pre-quantized models on the Hugging Face Hub.
📌 Summary: With its new SINQ quantization method, Huawei is democratizing the ability to run LLMs on common hardware, saving 60–70% on memory and reducing GPU costs by up to three times. This open-source, fast, easy-to-use, and calibration-free solution could become a new standard in AI quantization, expanding opportunities for both individuals and small businesses to deploy LLMs.

