这是用户在 2024-12-11 15:15 为 https://app.immersivetranslate.com/pdf-pro/eceaec04-bbba-4a99-b989-26206b599de8 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

AWQ: A C T I V A T I O N A W A R E W E I G H T Q U A N T I Z A T I O N F O R A C T I V A T I O N A W A R E W E I G H T Q U A N T I Z A T I O N F O R _ ACTIVATION-AWAREWEIGHTQUANTIZATIONFOR_\underline{A C T I V A T I O N-A W A R E ~ W E I G H T ~ Q U A N T I Z A T I O N ~ F O R ~} ON-DEVICE LLM COMPRESSION AND ACCELERATION
设备内压缩和加速

Ji Lin 1 1 ^(**1){ }^{* 1} Jiaming Tang 12 12 ^(**12){ }^{* 12} Haotian Tang 1 1 ^(†1){ }^{\dagger 1} Shang Yang 1 1 ^(†1){ }^{\dagger 1} Wei-Ming Chen 3 3 ^(3){ }^{3} Wei-Chen Wang 1 1 ^(1){ }^{1} Guangxuan Xiao 1 1 ^(1){ }^{1} Xingyu Dang 14 14 ^(14){ }^{14} Chuang Gan 56 56 ^(56){ }^{56} Song Han 13 13 ^(13){ }^{13}
吉林 姜明唐 唐浩天 尚阳 陈伟明 王维琛 萧广轩 党兴宇 干创 韩松
https://github.com/mit-han-lab/llm-awq

Abstract 摘要

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1 % 1 % 1%1 \% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3 × 3 × 3xx3 \times speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.
大型语言模型(code1001)从根本上改变了许多应用程序的能力,从自然语言处理到机器人和自动驾驶等更复杂的领域特定任务。此外,设备端(code1002)的重要性在近年来也显著增加。在边缘设备上运行(code1003)不仅能够减少延迟和改善用户体验,还与日益增长的用户隐私需求相符,因为数据处理可以在本地进行。然而,现代(code1004)的天文级模型尺寸和边缘设备的约束,主要是在内存大小和带宽方面,给部署带来了重大挑战。在本文中,我们提出了激活感知权重量化(AWQ),这是一种面向硬件的(code1005)低位权重量化方法。我们的方法基于这样一个观察:权重并非同等重要,只保护(code0)个重要权重就可以大大减少量化误差。我们提出通过观察激活而不是权重来搜索最佳的每通道缩放因子,从而保护重要权重。AWQ 不依赖于任何反向传播或重建,因此可以很好地保持(code1006)在不同领域和模态上的泛化能力,而不会过拟合校准集。AWQ 在各种语言建模和领域特定的基准测试(编码和数学)上表现优于现有工作。得益于更好的泛化,它实现了卓越的量化性能,用于指令调优的 LM 和多模态 LM。与 AWQ 一起,我们实现了 TinyChat,这是一个针对设备端(code1007)/VLM 的高效灵活的推理框架,在台式机和移动 GPU 上相比 Huggingface FP16 实现都提供了超过(code1)倍的加速。它还使 70B Llama-2 模型在移动 GPU 上的部署民主化。

1 InTRODUCTION 介绍

Deploying large language models (LLMs) directly on edge devices is crucial. On-device usage eliminates delays caused by sending data to a cloud server and enables LLMs to operate offline, which is beneficial for real-time applications like virtual assistants, chatbots, and autonomous vehicles. The operational costs associated with maintaining and scaling centralized cloud infrastructure can also be reduced. On-device LLM also enhances data security by keeping sensitive information local, reducing the chance of data breaches. LLMs, grounded in transformer-based architectures (Vaswani et al., 2017), have gathered significant attention for their impressive performance across diverse benchmarks (Brown et al., 2020; Zhang et al., 2022; Touvron
在边缘设备上直接部署大型语言模型至关重要。本地使用可以消除将数据发送到云服务器所产生的延迟,并使设备能在离线情况下运行,这对实时应用如虚拟助手、聊天机器人和自主车辆非常有利。此外,还可以减少维护和扩展集中式云基础设施所产生的运营成本。本地使用还通过将敏感信息保留在本地来增强数据安全性,降低数据泄露的风险。基于变换器架构的大型语言模型在各种基准测试中表现出色,受到了广泛关注。
Figure 1. We introduce AWQ, a versatile weight quantization method for LLM. To implement AWQ, we developed TinyChat to deploy 4 -bit quantized LLMs into various edge platforms, achieving a 3-4 × × xx\times performance boost compared to FP16. Notably, we’ve also manufactured a TinyChat computer, powered by TinyChat, which contains an NVIDIA Jetson Orin Nano with only 8 GB of memory and 15 W power consumption. Demo: https://youtu.be/z91a8DrfgEw.
我们介绍 AWQ,一种多功能的权重量化方法,用于LLM。为实现 AWQ,我们开发了 TinyChat 以部署 4 位量化LLMs到各种边缘平台,与 FP16 相比获得 3-4 倍 × × xx\times 性能提升。值得注意的是,我们还制造了搭载 TinyChat 的 TinyChat 电脑,配备有 NVIDIA Jetson Orin Nano,仅 8GB 内存和 15W 功耗。演示:https://youtu.be/z91a8DrfgEw。

et al., 2023a; Scao et al., 2022). However, the large model size leads to the high serving costs. For example, GPT-3 has 175B parameters, which is 350GB in FP16, while the latest H100 GPU only has 96GB memory, let alone edge devices.
等人,2023 年 a; Scao 等人,2022 年)。然而,大型模型尺寸导致了高昂的服务成本。例如,GPT-3 有 1,750 亿个参数,以 FP16 格式占用 350GB 空间,而最新的 H100 GPU 仅有 96GB 内存,更不用说边缘设备了。

Low-bit weight quantization for LLMs can significantly reduce the memory footprint of on-device LLM inference but
低位权重量化可以大大减小设备上LLMs推理的内存占用,但是

is hard. Quantization-aware training (QAT) is not efficient due to the high training cost, while post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. The closest work is GPTQ (Frantar et al., 2022), which uses second-order information to perform error compensation. However, it may overfit the calibration set during reconstruction, distorting the learned features on out-of-distribution domains (Figure 8), which is problematic since LLMs are generalist models.
很难。感知量化培训(QAT)由于高昂的培训成本而效率不高,而培训后量化(PTQ)在低位设置下则会遭受大幅准确性下降。最接近的工作是 GPTQ(Frantar 等人,2022 年),它使用二阶信息来执行误差补偿。但是,它可能会过度拟合校准集 during 重建,扭曲了分布外域上的学习特征(图 8),这是个问题,因为LLMs是通用模型。
In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly low-bit weight-only quantization method for LLMs. Our method is based on the observation that weights are not equally important for LLMs’ performance. There is a small fraction ( 0.1 % 1 % ) 0.1 % 1 % ) 0.1%-1%)0.1 \%-1 \%) of salient weights; skipping the quantization of these salient weights will significantly reduce the quantization loss (Table 1). To find the salient weight channels, the insight is that we should refer to the activation distribution instead of the weight distribution, despite we are doing weightonly quantization: weight channels corresponding to larger activation magnitudes are more salient since they process more important features. To avoid the hardware-inefficient mixed-precision implementation, we analyze the error from weight quantization and derive that scaling up the salient channels can reduce their relative quantization error (Equation 2). Following the intuition, we designed a per-channel scaling method to automatically search for the optimal scaling that minimizes the quantization error under full-weight quantization. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on various domains and modalities without overfitting to the calibration set.
LLMs LLMs 0.1 % 1 % ) 0.1 % 1 % ) 0.1%-1%)0.1 \%-1 \%) LLMs
To implement AWQ, we designed TinyChat, an efficient inference framework to convert theoretical memory savings from 4-bit LLM to measured speedup. Our framework significantly speeds up linear layers through on-the-fly dequantization. We also take advantage of efficient 4-bit weight packing and kernel fusion to minimize the inference overhead (e.g., intermediate DRAM access and kernel launch overhead), such that we can better realize the speed up from quantizing the weights to 4 -bit, despite the computer is byte-aligned.
我们设计了 TinyChat,这是一个高效的推理框架,可将理论内存节省从 4 位 LLM 转换为实测加速。我们的框架通过即时反量化大幅加快了线性层。我们还利用高效的 4 位权重打包和内核融合,最小化了推理开销(例如中间 DRAM 访问和内核启动开销),从而更好地实现了将权重量化到 4 位的加速,尽管计算机是字节对齐的。
Experiments show that AWQ outperforms existing work on various tasks for different model families (e.g., LLaMA (Touvron et al., 2023a), OPT (Zhang et al., 2022)) and model sizes. Thanks to better generalization, it also achieves good quantization performance for instructiontuned LMs (e.g., Vicuna) and, for the first time, multi-modal LMs (OpenFlamingo (Awadalla et al., 2023)). TinyChat further translates the 4 × 4 × ∼4xx\sim 4 \times lower memory footprint to measured speedup. On desktop, laptop and mobile GPUs, we consistently observe a 3.2-3.3 × × xx\times average speedup compared
实验表明,AWQ 在不同模型家族的各种任务上优于现有工作(例如,LLaMA(Touvron 等人,2023 年 a)、OPT(Zhang 等人,2022 年))和不同大小的模型。由于更好的泛化,它还在指令调优的 LM(例如,Vicuna)和多模态 LM(OpenFlamingo(Awadalla 等人,2023 年))上实现了良好的量化性能。TinyChat 进一步将较低的内存占用转化为测量到的加速。在台式机、笔记本电脑和移动 GPU 上,我们一致观察到平均 3.2-3.3 倍的加速。

to the FP16 implementation by Huggingface across a diverse spectrum of LLMs. Furthermore, it facilitates effortless deployment of the Llama-2-70B model on a single NVIDIA Jetson Orin with 64GB of memory. It also democratizes 13 billion parameter LLM at an interactive pace of 30 tokens/second on a laptop RTX 4070 GPU with only 8GB of memory. AWQ has been widely adopted by various opensource LLM serving solutions including FastChat, vLLM, HuggingFace TGI, LMDeploy, etc.
华为飞腾公司关于广泛模型光谱的 FP16 实施。此外,它还促进了在单个 NVIDIA Jetson Orin 具有 64GB 内存的设备上轻松部署 Llama-2-70B 模型。它还使 130 亿参数模型以每秒 30 个标记的交互速度在仅 8GB 内存的笔记本电脑 RTX 4070 GPU 上民主化。AWQ 已被广泛采用于 FastChat、vLLM、HuggingFace TGI、LMDeploy 等各种开源模型服务解决方案。
Model quantization methods. Quantization reduces the bit-precision of deep learning models (Han et al., 2016; Jacob et al., 2018; Nagel et al., 2019; Wang et al., 2019; Nagel et al., 2020; Lin et al., 2020), which helps to reduce the model size and accelerate inference. Quantization techniques generally fall into two categories: quantization-aware training (QAT, which relies on backpropagation to update the quantized weights) (Bengio et al., 2013; Gholami et al., 2021; Nagel et al., 2021; Choi et al., 2018) and post-training quantization (Jacob et al., 2018; Nagel et al., 2019; 2020) (PTQ, usually training-free). The QAT methods cannot easily scale up to large models like LLMs. Therefore, people usually use PTQ methods to quantize LLMs.
模型量化方法。量化可以减少深度学习模型的位精度(Han et al., 2016; Jacob et al., 2018; Nagel et al., 2019; Wang et al., 2019; Nagel et al., 2020; Lin et al., 2020),这有助于减小模型大小并加快推理速度。量化技术通常分为两类:量化感知训练(QAT,依赖于反向传播来更新量化权重)(Bengio et al., 2013; Gholami et al., 2021; Nagel et al., 2021; Choi et al., 2018)和训练后量化(PTQ,通常无需训练)(Jacob et al., 2018; Nagel et al., 2019; 2020)。QAT 方法很难扩展到大型模型如LLMs。因此,人们通常使用 PTQ 方法来量化LLMs。

Quantization of LLMs. People study two settings for LLM quantization: (1) W8A8 quantization, where both activation and weights are quantized to INT8 (Dettmers et al., 2022; Xiao et al., 2022; Yao et al., 2022; Wei et al., 2022a; 2023); (2) Low-bit weight-only quantization (e.g., W4A16), where only weights are quantized into low-bit integers (Frantar et al., 2022; Dettmers & Zettlemoyer, 2022; Sheng et al., 2023; Park et al., 2022). We focus on the second setting in this work since it not only reduces the hardware barrier (requiring a smaller memory size) but also speeds up the token generation (remedies memory-bound workload). Apart from the vanilla round-to-nearest baseline (RTN), GPTQ (Frantar et al., 2022) is the closest to our work. However, the reconstruction process of GPTQ leads to an over-fitting issue to the calibration set and may not preserve the generalist abilities of LLMs for other modalities and domains. It also requires a reordering trick to work for some models (e.g., LLaMA-7B (Touvron et al., 2023a) and OPT66B (Zhang et al., 2022)). Apart from quantiztion methods designed for general-purporse hardware, SpAtten (Wang et al., 2020) designs a progressive approach to gradually increase the number of bits used in softmax calculation.
数值化 LLMs。人们研究了两种设置的 LLM 量化:(1) W8A8 量化,其中激活和权重都量化为 INT8 (Dettmers 等人,2022;Xiao 等人,2022;Yao 等人,2022;Wei 等人,2022a;2023);(2) 低位权重量化 (例如 W4A16),其中只有权重被量化为低位整数(Frantar 等人,2022;Dettmers & Zettlemoyer,2022;Sheng 等人,2023;Park 等人,2022)。我们在这项工作中关注第二种设置,因为它不仅降低了硬件障碍(需要较小的内存大小),而且加快了令牌生成(补救了受内存限制的工作负载)。除了普通的四舍五入到最近基准(RTN)之外,GPTQ (Frantar 等人,2022)是最接近我们工作的。然而,GPTQ 的重建过程导致了对校准集的过度拟合,可能无法保持 LLMs 对其他模态和领域的概括能力。它还需要一个重新排序的技巧才能适用于某些模型(例如 LLaMA-7B(Touvron 等人,2023a)和 OPT66B(Zhang 等人,2022))。除了针对通用硬件设计的量化方法外,SpAtten(Wang 等人,2020)还设计了一种渐进式方法,逐步增加用于 softmax 计算的位数。

System support for low-bit quantized LLMs. Low-bit quantized LLMs have been a popular setting to reduce inference costs. There are some system supports to achieve a practical speed-up. GPTQ (Frantar et al., 2022) provides INT3 kernels for OPT models and GPTQ-for-LLaMA extends kernel support for INT4 reordered quantization with the help of Triton (Tillet et al., 2019). FlexGen (Sheng et al.,
系统支持低比特量化LLMs。低比特量化LLMs一直是一种流行的设置,可以降低推理成本。有一些系统支持来实现实际的速度提升。GPTQ(Frantar 等,2022)为 OPT 模型提供 INT3 内核,GPTQ-for-LLaMA 在 Triton(Tillet 等,2019)的帮助下扩展了 INT4 重排量化的内核支持。FlexGen(Sheng 等,

Figure 2. We observe that we can find 1 % 1 % 1%1 \% of the salient weights in LLMs based on the activation distribution (middle). Keeping the salient weights in FP16 can significantly improve the quantized performance (PPL from 43.2 (left) to 13.0 (middle)), but the mixed-precision format is not hardware-efficient. We follow the activation-awareness principle and propose AWQ (right). AWQ performs per-channel scaling to protect the salient weights and reduce quantization error. We measure the perplexity of OPT-6.7B under INT3-g128 quantization.
我们观察到,我们可以根据激活分布(中间)找到显著权重的 1 % 1 % 1%1 \% 。保持显著权重在 FP16 中可以显著提高量化性能(PPL 从 43.2(左)到 13.0(中)),但混合精度格式并不高效。我们遵循激活感知原则,提出了 AWQ(右)。AWQ 执行每通道缩放以保护显著权重并降低量化误差。我们测量 OPT-6.7B 在 INT3-g128 量化下的困惑度。

2023), llama. cpp* and exllama ^(†){ }^{\dagger} perform group-wise INT4 quantization to reduce I/O costs and offloading. FasterTransformer implements FP16×INT4 GEMM for weightonly per-tensor quantization but does not support group quantization. LUT-GEMM (Park et al., 2022) performs bitwise computation on GPU CUDA cores with the help of lookup tables. Our concurrent work, MLC-LLM (MLCTeam, 2023) offers strong results on multiple edge CPU and GPU platforms thanks to the powerful TVM (Chen et al., 2018; Feng et al., 2023) backend.
2023 年), 羊驼. cpp*和 exllama ^(†){ }^{\dagger} 执行组合 INT4 量化以降低 I/O 成本和卸载。FasterTransformer 实现了 FP16×INT4 GEMM 用于仅权重的分每张量量化,但不支持组量化。LUT-GEMM(Park 等人,2022)在 GPU CUDA 内核上执行位运算,借助查找表。我们并行的工作,MLC-LLM(MLCTeam,2023)依靠强大的 TVM(Chen 等人,2018;Feng 等人,2023)后端在多个边缘 CPU 和 GPU 平台上提供了出色的结果。

3 AWQ: ACtivation-aWare WeIGht QUANTIZATION
3 AWQ:激活感知权重量化

Quantization maps a floating-point number into lower-bit integers. It is an effective method to reduce the model size and inference costs of LLMs (Dettmers et al., 2022; Frantar et al., 2022; Yao et al., 2022; Xiao et al., 2022). In this section, we first propose a weight-only quantization method to improve accuracy without training/regression by protecting more “important” weights. And then develop a data-driven method to search for the optimal scaling that reduces quantization errors (Figure 2).
量化将浮点数映射到较低位的整数。这是一种有效的方法,可以减少模型大小和推理成本。在本节中,我们首先提出一种只量化权重的方法,通过保护更"重要"的权重来提高精度,无需训练或回归。然后开发了一种数据驱动的方法,搜索可以减少量化误差的最佳缩放因子(图 2)。

3.1 Improving LLM Quantization by Preserving 1% Salient Weights
通过保留 1%重要权重改进LLM量化

We observe that the weights of LLMs are not equally important: there is a small fraction of salient weights that are much more important for LLMs’ performance compared to others. Skipping the quantization of these salient weights can help bridge the performance degradation due to the quantization loss without any training or regression (Figure 2(b)). To verify the idea, we benchmark the performance of quantized LLMs when skipping part of the weight channels in Table 1. We measured the performance of INT3 quantized models while keeping some ratios of
我们观察到,LLMs的权重并不是同等重要的:有一小部分显著权重比其他权重对LLMs的性能影响要大得多。跳过对这些显著权重进行量化可以帮助缓解由于量化损失而造成的性能退化,无需任何训练或回归(图 2(b))。为了验证这个想法,我们在表 1 中对部分权重通道进行跳过量化的量化LLMs模型的性能进行了基准测试。我们测量了 INT3 量化模型的性能,同时保留了一些权重比例。
weight channels in FP16. A widely used method to determine the importance of weights is to look at its magnitude or L 2 L 2 L_(2)L_{2}-norm (Han et al., 2015; Frankle & Carbin, 2018). But we find skipping the weight channels with large norm (i.e., FP16% (based on W)) does not significantly improve the quantized performance, leading to a similar marginal improvement as random selection. Interestingly, selecting weights based on activation magnitude can significantly improve the performance despite keeping only 0.1 % 1 % 0.1 % 1 % 0.1%-1%0.1 \%-1 \% of channels in FP16. We hypothesize that the input features with larger magnitudes are generally more important. Keeping the corresponding weights in FP16 can preserve those features, which contributes to better model performance.
以 FP16 表示的权重通道。确定权重重要性的广泛方法是观察其大小或 L 2 L 2 L_(2)L_{2} 范数(Han et al., 2015; Frankle & Carbin, 2018)。但我们发现跳过具有大范数的权重通道(即 FP16%(基于 W))并不能明显改善量化性能,其带来的边际改善与随机选择相似。有趣的是,基于激活大小选择权重可显著改善性能,尽管只保留了 0.1 % 1 % 0.1 % 1 % 0.1%-1%0.1 \%-1 \% 的通道采用 FP16。我们假设具有较大幅度的输入特征通常更为重要。保留相应的权重采用 FP16 可以保留这些特征,从而有助于提高模型性能。

Limitations: Despite keeping 0.1 % 0.1 % 0.1%0.1 \% of weights in FP16 can improve the quantized performance without a noticeable increase in model size (measured in total bits), such a mixedprecision data type will make the system implementation difficult. We need to come up with a method to protect the important weights without actually keeping them as FP16.
尽管在 FP16 中保留权重可以在模型大小(以总比特为单位)几乎不增加的情况下提高量化性能,但这种混合精度数据类型会使系统实现变得困难。我们需要想出一种方法来保护关键权重,而无需将它们实际保留为 FP16。

3.2 Protecting Salient Weights by Activation-aware Scaling
通过激活感知缩放保护显著权重

We propose an alternative method to reduce the quantization error of the salient weight by per-channel scaling, which does not suffer from the hardware inefficiency issue.
我们提出了一种替代方法,通过按通道缩放来减少显著权重的量化误差,不会遭受硬件效率低下的问题。

Analyzing the quantization error.
分析量化误差。

We start by analyzing the error from weight-only quantization. Consider a group/block of weight w w w\mathbf{w}; the linear operation can be written as y = w x y = w x y=wxy=\mathbf{w} \mathbf{x}, and the quantized counterpart is y = Q ( w ) x y = Q ( w ) x y=Q(w)xy=Q(\mathbf{w}) \mathbf{x}. Specifically, the quantization function is defined as:
我们通过分析仅重量量化的错误开始。考虑一组/块重量 w w w\mathbf{w} ; 线性操作可以写为 y = w x y = w x y=wxy=\mathbf{w} \mathbf{x} ,量化对应物为 y = Q ( w ) x y = Q ( w ) x y=Q(w)xy=Q(\mathbf{w}) \mathbf{x} 。具体地说,量化函数定义为:
Q ( w ) = Δ Round ( w Δ ) , Δ = max ( | w | ) 2 N 1 Q ( w ) = Δ Round w Δ , Δ = max ( | w | ) 2 N 1 Q(w)=Delta*Round((w)/(Delta)),quad Delta=(max(|w|))/(2^(N-1))Q(\mathbf{w})=\Delta \cdot \operatorname{Round}\left(\frac{\mathbf{w}}{\Delta}\right), \quad \Delta=\frac{\max (|\mathbf{w}|)}{2^{N-1}}
where N N NN is the number of quantization bits, and Δ Δ Delta\Delta is the quantization scaler determined by the absolute maximum value. Now consider a weight element w w w w w inww \in \mathbf{w}, if we mul-
量化位数为 N N NN ,量化缩放因子由最大绝对值确定为 Δ Δ Delta\Delta 。现考虑权重元素 w w w w w inww \in \mathbf{w} ,如果我们将其乘以
PPL darr\downarrow 人群 darr\downarrow FP16 RTN (w3-g128)  RTN   (w3-g128)  {:[" RTN "],[" (w3-g128) "]:}\begin{gathered} \text { RTN } \\ \text { (w3-g128) } \end{gathered} FP16% (based on act.)
16% (基于实际).
FP16% (based on W)
16% (基于 W)
FP16% (random) 半精度 16% (随机)
0.1% 1% 3% 0.1% 1% 3% 0.1% 1% 3%
OPT-1.3B 14.62 119.00 25.03 16.91 16.68 108.71 98.55 98.08 119.76 109.38 61.49
OPT-6.7B 10.86 23.54 11.58 11.39 11.36 23.41 22.37 22.45 23.54 24.23 24.22
OPT-13B 10.13 46.04 10.51 10.43 10.42 46.07 48.96 54.49 44.87 42.00 39.71
PPL darr FP16 " RTN (w3-g128) " FP16% (based on act.) FP16% (based on W) FP16% (random) 0.1% 1% 3% 0.1% 1% 3% 0.1% 1% 3% OPT-1.3B 14.62 119.00 25.03 16.91 16.68 108.71 98.55 98.08 119.76 109.38 61.49 OPT-6.7B 10.86 23.54 11.58 11.39 11.36 23.41 22.37 22.45 23.54 24.23 24.22 OPT-13B 10.13 46.04 10.51 10.43 10.42 46.07 48.96 54.49 44.87 42.00 39.71| PPL $\downarrow$ | FP16 | $\begin{gathered} \text { RTN } \\ \text { (w3-g128) } \end{gathered}$ | FP16% (based on act.) | | | FP16% (based on W) | | | FP16% (random) | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | 0.1% | 1% | 3% | 0.1% | 1% | 3% | 0.1% | 1% | 3% | | OPT-1.3B | 14.62 | 119.00 | 25.03 | 16.91 | 16.68 | 108.71 | 98.55 | 98.08 | 119.76 | 109.38 | 61.49 | | OPT-6.7B | 10.86 | 23.54 | 11.58 | 11.39 | 11.36 | 23.41 | 22.37 | 22.45 | 23.54 | 24.23 | 24.22 | | OPT-13B | 10.13 | 46.04 | 10.51 | 10.43 | 10.42 | 46.07 | 48.96 | 54.49 | 44.87 | 42.00 | 39.71 |
Table 1. Keeping a small fraction of weights ( 0.1 % 1 % ) ( 0.1 % 1 % ) (0.1%-1%)(0.1 \%-1 \%) in FP16 significantly improves the performance of the quantized models over round-to-nearest (RTN). It is only effective when we select the important weights in FP16 by looking at activation distribution instead of weight distribution. We highlight results with a decent perplexity in green. We used INT3 quantization with a group size of 128 and measured the WikiText perplexity ( ) ( ) (darr)(\downarrow).
表 1.在 FP16 中保留一小部分权重显著提高了量化模型相对于舍入到最近的性能。它仅在我们通过查看激活分布而不是权重分布来选择 FP16 中的重要权重时才有效。我们用绿色突出显示具有体面困惑度的结果。我们使用了组大小为 128 的 INT3 量化,并测量了维基百科文本困惑度。
OPT-6.7B s = 1 s = 1 s=1s=1 s = 1.25 s = 1.25 s=1.25s=1.25 s = 1.5 s = 1.5 s=1.5s=1.5 s = 2 s = 2 s=2s=2 s = 4 s = 4 s=4s=4
proportion of Δ Δ Δ Δ Delta^(')!=Delta\Delta^{\prime} \neq \Delta
比例
0 % 0 % 0%0 \% 2.8 % 2.8 % 2.8%2.8 \% 4.4 % 4.4 % 4.4%4.4 \% 8.2 % 8.2 % 8.2%8.2 \% 21.2 % 21.2 % 21.2%21.2 \%
average Δ / Δ Δ / Δ Delta^(')//Delta\Delta^{\prime} / \Delta 平均值 1 1.005 1.013 1.038 1.213
average Δ Δ 1 s Δ Δ 1 s (Delta^('))/(Delta)*(1)/(s)\frac{\Delta^{\prime}}{\Delta} \cdot \frac{1}{s} 平均 1 0.804 0.676 0.519 0 . 3 0 3 0 . 3 0 3 0.303\mathbf{0 . 3 0 3}
Wiki-2 PPL 维基-2 PPL 23.54 12.87 12.48 1 1 . 9 2 1 1 . 9 2 11.92\mathbf{1 1 . 9 2} 12.36
OPT-6.7B s=1 s=1.25 s=1.5 s=2 s=4 proportion of Delta^(')!=Delta 0% 2.8% 4.4% 8.2% 21.2% average Delta^(')//Delta 1 1.005 1.013 1.038 1.213 average (Delta^('))/(Delta)*(1)/(s) 1 0.804 0.676 0.519 0.303 Wiki-2 PPL 23.54 12.87 12.48 11.92 12.36| OPT-6.7B | $s=1$ | $s=1.25$ | $s=1.5$ | $s=2$ | $s=4$ | | :--- | :---: | :---: | :---: | :---: | :---: | | proportion of $\Delta^{\prime} \neq \Delta$ | $0 \%$ | $2.8 \%$ | $4.4 \%$ | $8.2 \%$ | $21.2 \%$ | | average $\Delta^{\prime} / \Delta$ | 1 | 1.005 | 1.013 | 1.038 | 1.213 | | average $\frac{\Delta^{\prime}}{\Delta} \cdot \frac{1}{s}$ | 1 | 0.804 | 0.676 | 0.519 | $\mathbf{0 . 3 0 3}$ | | Wiki-2 PPL | 23.54 | 12.87 | 12.48 | $\mathbf{1 1 . 9 2}$ | 12.36 |
Table 2. Statistics when multiplying the 1 % 1 % 1%1 \% salient channels by s > 1 s > 1 s > 1s>1. Scaling up the salient channels significantly improves the perplexity ( 23.54 to 11.92 ). As s s ss goes larger, the percentage of changed Δ Δ Delta\Delta increases, and the error reduction rate for salient channels also increases. However, the best perplexity is achieved at s = 2 s = 2 s=2s=2, since further increasing s s ss will increase the quantization error for non-salient channels.
表 2.将 1 % 1 % 1%1 \% 显著信道乘以 s > 1 s > 1 s > 1s>1 时的统计数据。显著信道的显著增加改善了置 perplexity(23.54 到 11.92)。当 s s ss 变大时,被更改的 Δ Δ Delta\Delta 的百分比增加,显著信道的错误减少率也增加。但是,最佳的置 perplexity 是在 s = 2 s = 2 s=2s=2 处达到的,因为进一步增加 s s ss 会增加非显著信道的量化误差。

tiply w w ww with s > 1 s > 1 s > 1s>1 and the inversely scale x x xx, we will have Q ( w s ) ( x / s ) Q ( w s ) ( x / s ) Q(w*s)(x//s)Q(w \cdot s)(x / s), which is:
w w ww 乘以 s > 1 s > 1 s > 1s>1 ,并逆向缩放 x x xx ,我们将得到 Q ( w s ) ( x / s ) Q ( w s ) ( x / s ) Q(w*s)(x//s)Q(w \cdot s)(x / s)
Q ( w s ) x s = Δ Round ( w s Δ ) x 1 s Q ( w s ) x s = Δ Round w s Δ x 1 s Q(w*s)*(x)/(s)=Delta^(')*Round((ws)/(Delta^(')))*x*(1)/(s)Q(w \cdot s) \cdot \frac{x}{s}=\Delta^{\prime} \cdot \operatorname{Round}\left(\frac{w s}{\Delta^{\prime}}\right) \cdot x \cdot \frac{1}{s}
where Δ Δ Delta^(')\Delta^{\prime} is the new quantization scaler after applying s s ss. We empirically find that: (1) The expected error from Round ( ) ( ) (*)(\cdot) (denoted as RoundErr ( *\cdot )) does not change: since the round function maps a floating-point number to an integer, the error is roughly uniformly distributed from [ 0 , 0.5 ] [ 0 , 0.5 ] [0,0.5][0,0.5], resulting in an average error of 0.25 ; i.e., RoundErr ( ) ( ) (*)∼(\cdot) \sim 0.25 . (2) Scaling up a single element w w ww usually does not change the maximum value from the group w w ww. Therefore we have Δ Δ Δ Δ Delta^(')~~Delta\Delta^{\prime} \approx \Delta; (3) As Δ Δ Delta\Delta and x x xx are represented in FP16, they have no quantization error. Consequently, the quantization error from equation 1 and 2 can be expressed as
Δ Δ Delta^(')\Delta^{\prime} 是应用 s s ss 后的新量化缩放因子。我们实证发现:(1)Round ( ) ( ) (*)(\cdot) (用 RoundErr( *\cdot )表示)的预期误差不变:因为四舍五入函数将浮点数映射到整数,误差大致均匀分布在 [ 0 , 0.5 ] [ 0 , 0.5 ] [0,0.5][0,0.5] 范围内,平均误差为 0.25;即 RoundErr ( ) ( ) (*)∼(\cdot) \sim =0.25。(2)单个元素 w w ww 的缩放通常不会改变群组 w w ww 的最大值。因此我们有 Δ Δ Δ Δ Delta^(')~~Delta\Delta^{\prime} \approx \Delta ;(3)由于 Δ Δ Delta\Delta x x xx 以 FP16 表示,它们没有量化误差。因此,方程 1 和 2 中的量化误差可以表示为
Err ( Q ( w ) x ) = Δ RoundErr ( w Δ ) x Err ( Q ( w s ) ( x s ) ) = Δ RoundErr ( w s Δ ) x 1 s Err ( Q ( w ) x ) = Δ RoundErr w Δ x Err Q ( w s ) x s = Δ RoundErr w s Δ x 1 s {:[Err(Q(w)x)=Delta*RoundErr((w)/( Delta))*x],[Err(Q(w*s)((x)/(s)))=Delta^(')*RoundErr((ws)/(Delta^(')))*x*(1)/(s)]:}\begin{gathered} \operatorname{Err}(Q(w) x)=\Delta \cdot \operatorname{RoundErr}\left(\frac{w}{\Delta}\right) \cdot x \\ \operatorname{Err}\left(Q(w \cdot s)\left(\frac{x}{s}\right)\right)=\Delta^{\prime} \cdot \operatorname{RoundErr}\left(\frac{w s}{\Delta^{\prime}}\right) \cdot x \cdot \frac{1}{s} \end{gathered}
The ratio of the new error to the original error is Δ Δ 1 s Δ Δ 1 s (Delta^('))/(Delta)*(1)/(s)\frac{\Delta^{\prime}}{\Delta} \cdot \frac{1}{s}. Given Δ Δ Δ Δ Delta^(')~~Delta\Delta^{\prime} \approx \Delta and s > 1 s > 1 s > 1s>1, the relative error is smaller for the salient weight w w ww.
新错误与原始错误的比率为 Δ Δ 1 s Δ Δ 1 s (Delta^('))/(Delta)*(1)/(s)\frac{\Delta^{\prime}}{\Delta} \cdot \frac{1}{s} 。给定 Δ Δ Δ Δ Delta^(')~~Delta\Delta^{\prime} \approx \Delta s > 1 s > 1 s > 1s>1 ,突出权重 w w ww 的相对误差更小。
To verify the idea, we multiply the 1 % 1 % 1%1 \% salient channels with s > 1 s > 1 s > 1s>1 for the OPT-6.7B model, and measure the change in
我们为 OPT-6.7B 模型将显著通道乘以以验证这一想法,并测量变化。
OPT (PPL ) ) darr)\downarrow) 优化选项(人群 ) ) darr)\downarrow) 1.3 B 2.7 B 6.7 B 13B 30 B
FP16 14.62 12.47 10.86 10.13 9.56
RTN 119.47 298.00 23.54 46.04 18.80
1 % 1 % 1%1 \% FP16 半精度浮点数 16.91 13.69 1 1 . 3 9 1 1 . 3 9 11.39\mathbf{1 1 . 3 9} 1 0 . 4 3 1 0 . 4 3 10.43\mathbf{1 0 . 4 3} 9.85
s = 2 s = 2 s=2s=2 18.63 14.94 11.92 10.80 10.32
AWQ 1 6 . 3 2 1 6 . 3 2 16.32\mathbf{1 6 . 3 2} 1 3 . 5 8 1 3 . 5 8 13.58\mathbf{1 3 . 5 8} 1 1 . 3 9 1 1 . 3 9 11.39\mathbf{1 1 . 3 9} 10.56 9 . 7 7 9 . 7 7 9.77\mathbf{9 . 7 7}
OPT (PPL darr) 1.3 B 2.7 B 6.7 B 13B 30 B FP16 14.62 12.47 10.86 10.13 9.56 RTN 119.47 298.00 23.54 46.04 18.80 1% FP16 16.91 13.69 11.39 10.43 9.85 s=2 18.63 14.94 11.92 10.80 10.32 AWQ 16.32 13.58 11.39 10.56 9.77| OPT (PPL $\downarrow)$ | 1.3 B | 2.7 B | 6.7 B | 13B | 30 B | | :--- | :--- | :---: | :---: | :---: | :---: | | FP16 | 14.62 | 12.47 | 10.86 | 10.13 | 9.56 | | RTN | 119.47 | 298.00 | 23.54 | 46.04 | 18.80 | | $1 \%$ FP16 | 16.91 | 13.69 | $\mathbf{1 1 . 3 9}$ | $\mathbf{1 0 . 4 3}$ | 9.85 | | $s=2$ | 18.63 | 14.94 | 11.92 | 10.80 | 10.32 | | AWQ | $\mathbf{1 6 . 3 2}$ | $\mathbf{1 3 . 5 8}$ | $\mathbf{1 1 . 3 9}$ | 10.56 | $\mathbf{9 . 7 7}$ |
Table 3. AWQ protects salient weights and reduces quantization error by using a scaling-based method. It consistently outperforms Round-to-nearest quantization (RTN) and achieves comparable performance as mixed-precision ( 1 % 1 % 1%1 \% FP16) while being more hardware-friendly. We use 3-bit quantization with group size 128.
表 3. AWQ 通过使用基于缩放的方法来保护重要权重并降低量化误差。它始终优于四舍五入量化(RTN),并且其性能与混合精度( 1 % 1 % 1%1 \% FP16)相当,同时更加友好于硬件。我们使用 128 的分组大小进行 3 位量化。

Δ Δ Delta\Delta for each group in Table 2. We find that scaling up the salient channels is quite effective: the perplexity improves from 23.54 for s = 1 s = 1 s=1s=1 (simply RTN) to 11.92 for s = 2 s = 2 s=2s=2