AWQ: $\underset{―}{A C T I V A T I O N - A W A R E W E I G H T Q U A N T I Z A T I O N F O R}$ ON-DEVICE LLM COMPRESSION AND ACCELERATION
设备内压缩和加速

Ji Lin $^{* 1}$ Jiaming Tang $^{* 12}$ Haotian Tang $^{† 1}$ Shang Yang $^{† 1}$ Wei-Ming Chen $^{3}$ Wei-Chen Wang $^{1}$ Guangxuan Xiao $^{1}$ Xingyu Dang $^{14}$ Chuang Gan $^{56}$ Song Han $^{13}$
吉林姜明唐唐浩天尚阳陈伟明王维琛萧广轩党兴宇干创韩松https://github.com/mit-han-lab/llm-awq

Abstract 摘要

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only $1 %$ of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than $3 \times$ speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.
大型语言模型(code1001)从根本上改变了许多应用程序的能力,从自然语言处理到机器人和自动驾驶等更复杂的领域特定任务。此外,设备端(code1002)的重要性在近年来也显著增加。在边缘设备上运行(code1003)不仅能够减少延迟和改善用户体验,还与日益增长的用户隐私需求相符,因为数据处理可以在本地进行。然而,现代(code1004)的天文级模型尺寸和边缘设备的约束,主要是在内存大小和带宽方面,给部署带来了重大挑战。在本文中,我们提出了激活感知权重量化(AWQ),这是一种面向硬件的(code1005)低位权重量化方法。我们的方法基于这样一个观察:权重并非同等重要,只保护(code0)个重要权重就可以大大减少量化误差。我们提出通过观察激活而不是权重来搜索最佳的每通道缩放因子,从而保护重要权重。AWQ 不依赖于任何反向传播或重建,因此可以很好地保持(code1006)在不同领域和模态上的泛化能力,而不会过拟合校准集。AWQ 在各种语言建模和领域特定的基准测试(编码和数学)上表现优于现有工作。得益于更好的泛化,它实现了卓越的量化性能,用于指令调优的 LM 和多模态 LM。与 AWQ 一起,我们实现了 TinyChat,这是一个针对设备端(code1007)/VLM 的高效灵活的推理框架,在台式机和移动 GPU 上相比 Huggingface FP16 实现都提供了超过(code1)倍的加速。它还使 70B Llama-2 模型在移动 GPU 上的部署民主化。

1 InTRODUCTION 介绍

Deploying large language models (LLMs) directly on edge devices is crucial. On-device usage eliminates delays caused by sending data to a cloud server and enables LLMs to operate offline, which is beneficial for real-time applications like virtual assistants, chatbots, and autonomous vehicles. The operational costs associated with maintaining and scaling centralized cloud infrastructure can also be reduced. On-device LLM also enhances data security by keeping sensitive information local, reducing the chance of data breaches. LLMs, grounded in transformer-based architectures (Vaswani et al., 2017), have gathered significant attention for their impressive performance across diverse benchmarks (Brown et al., 2020; Zhang et al., 2022; Touvron
在边缘设备上直接部署大型语言模型至关重要。本地使用可以消除将数据发送到云服务器所产生的延迟,并使设备能在离线情况下运行,这对实时应用如虚拟助手、聊天机器人和自主车辆非常有利。此外,还可以减少维护和扩展集中式云基础设施所产生的运营成本。本地使用还通过将敏感信息保留在本地来增强数据安全性,降低数据泄露的风险。基于变换器架构的大型语言模型在各种基准测试中表现出色,受到了广泛关注。

Figure 1. We introduce AWQ, a versatile weight quantization method for LLM. To implement AWQ, we developed TinyChat to deploy 4 -bit quantized LLMs into various edge platforms, achieving a 3-4

\times

performance boost compared to FP16. Notably, we’ve also manufactured a TinyChat computer, powered by TinyChat, which contains an NVIDIA Jetson Orin Nano with only 8 GB of memory and 15 W power consumption. Demo: https://youtu.be/z91a8DrfgEw.
我们介绍 AWQ,一种多功能的权重量化方法,用于LLM。为实现 AWQ,我们开发了 TinyChat 以部署 4 位量化LLMs到各种边缘平台,与 FP16 相比获得 3-4 倍

\times

性能提升。值得注意的是,我们还制造了搭载 TinyChat 的 TinyChat 电脑,配备有 NVIDIA Jetson Orin Nano,仅 8GB 内存和 15W 功耗。演示:https://youtu.be/z91a8DrfgEw。
et al., 2023a; Scao et al., 2022). However, the large model size leads to the high serving costs. For example, GPT-3 has 175B parameters, which is 350GB in FP16, while the latest H100 GPU only has 96GB memory, let alone edge devices.
等人，2023 年 a; Scao 等人，2022 年)。然而,大型模型尺寸导致了高昂的服务成本。例如,GPT-3 有 1,750 亿个参数,以 FP16 格式占用 350GB 空间,而最新的 H100 GPU 仅有 96GB 内存,更不用说边缘设备了。
Low-bit weight quantization for LLMs can significantly reduce the memory footprint of on-device LLM inference but
低位权重量化可以大大减小设备上LLMs推理的内存占用,但是
is hard. Quantization-aware training (QAT) is not efficient due to the high training cost, while post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. The closest work is GPTQ (Frantar et al., 2022), which uses second-order information to perform error compensation. However, it may overfit the calibration set during reconstruction, distorting the learned features on out-of-distribution domains (Figure 8), which is problematic since LLMs are generalist models.
很难。感知量化培训(QAT)由于高昂的培训成本而效率不高,而培训后量化(PTQ)在低位设置下则会遭受大幅准确性下降。最接近的工作是 GPTQ(Frantar 等人,2022 年),它使用二阶信息来执行误差补偿。但是,它可能会过度拟合校准集 during 重建,扭曲了分布外域上的学习特征(图 8),这是个问题,因为LLMs是通用模型。

In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly low-bit weight-only quantization method for LLMs. Our method is based on the observation that weights are not equally important for LLMs’ performance. There is a small fraction (

0.1 % - 1 %)

of salient weights; skipping the quantization of these salient weights will significantly reduce the quantization loss (Table 1). To find the salient weight channels, the insight is that we should refer to the activation distribution instead of the weight distribution, despite we are doing weightonly quantization: weight channels corresponding to larger activation magnitudes are more salient since they process more important features. To avoid the hardware-inefficient mixed-precision implementation, we analyze the error from weight quantization and derive that scaling up the salient channels can reduce their relative quantization error (Equation 2). Following the intuition, we designed a per-channel scaling method to automatically search for the optimal scaling that minimizes the quantization error under full-weight quantization. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on various domains and modalities without overfitting to the calibration set.
LLMs LLMs

0.1 % - 1 %)

LLMs

To implement AWQ, we designed TinyChat, an efficient inference framework to convert theoretical memory savings from 4-bit LLM to measured speedup. Our framework significantly speeds up linear layers through on-the-fly dequantization. We also take advantage of efficient 4-bit weight packing and kernel fusion to minimize the inference overhead (e.g., intermediate DRAM access and kernel launch overhead), such that we can better realize the speed up from quantizing the weights to 4 -bit, despite the computer is byte-aligned.
我们设计了 TinyChat，这是一个高效的推理框架,可将理论内存节省从 4 位 LLM 转换为实测加速。我们的框架通过即时反量化大幅加快了线性层。我们还利用高效的 4 位权重打包和内核融合,最小化了推理开销(例如中间 DRAM 访问和内核启动开销),从而更好地实现了将权重量化到 4 位的加速,尽管计算机是字节对齐的。

Experiments show that AWQ outperforms existing work on various tasks for different model families (e.g., LLaMA (Touvron et al., 2023a), OPT (Zhang et al., 2022)) and model sizes. Thanks to better generalization, it also achieves good quantization performance for instructiontuned LMs (e.g., Vicuna) and, for the first time, multi-modal LMs (OpenFlamingo (Awadalla et al., 2023)). TinyChat further translates the

\sim 4 \times

lower memory footprint to measured speedup. On desktop, laptop and mobile GPUs, we consistently observe a 3.2-3.3

\times

average speedup compared
实验表明,AWQ 在不同模型家族的各种任务上优于现有工作(例如,LLaMA(Touvron 等人,2023 年 a)、OPT(Zhang 等人,2022 年))和不同大小的模型。由于更好的泛化,它还在指令调优的 LM(例如,Vicuna)和多模态 LM(OpenFlamingo(Awadalla 等人,2023 年))上实现了良好的量化性能。TinyChat 进一步将较低的内存占用转化为测量到的加速。在台式机、笔记本电脑和移动 GPU 上,我们一致观察到平均 3.2-3.3 倍的加速。
to the FP16 implementation by Huggingface across a diverse spectrum of LLMs. Furthermore, it facilitates effortless deployment of the Llama-2-70B model on a single NVIDIA Jetson Orin with 64GB of memory. It also democratizes 13 billion parameter LLM at an interactive pace of 30 tokens/second on a laptop RTX 4070 GPU with only 8GB of memory. AWQ has been widely adopted by various opensource LLM serving solutions including FastChat, vLLM, HuggingFace TGI, LMDeploy, etc.
华为飞腾公司关于广泛模型光谱的 FP16 实施。此外,它还促进了在单个 NVIDIA Jetson Orin 具有 64GB 内存的设备上轻松部署 Llama-2-70B 模型。它还使 130 亿参数模型以每秒 30 个标记的交互速度在仅 8GB 内存的笔记本电脑 RTX 4070 GPU 上民主化。AWQ 已被广泛采用于 FastChat、vLLM、HuggingFace TGI、LMDeploy 等各种开源模型服务解决方案。

Model quantization methods. Quantization reduces the bit-precision of deep learning models (Han et al., 2016; Jacob et al., 2018; Nagel et al., 2019; Wang et al., 2019; Nagel et al., 2020; Lin et al., 2020), which helps to reduce the model size and accelerate inference. Quantization techniques generally fall into two categories: quantization-aware training (QAT, which relies on backpropagation to update the quantized weights) (Bengio et al., 2013; Gholami et al., 2021; Nagel et al., 2021; Choi et al., 2018) and post-training quantization (Jacob et al., 2018; Nagel et al., 2019; 2020) (PTQ, usually training-free). The QAT methods cannot easily scale up to large models like LLMs. Therefore, people usually use PTQ methods to quantize LLMs.
模型量化方法。量化可以减少深度学习模型的位精度(Han et al., 2016; Jacob et al., 2018; Nagel et al., 2019; Wang et al., 2019; Nagel et al., 2020; Lin et al., 2020),这有助于减小模型大小并加快推理速度。量化技术通常分为两类:量化感知训练(QAT,依赖于反向传播来更新量化权重)(Bengio et al., 2013; Gholami et al., 2021; Nagel et al., 2021; Choi et al., 2018)和训练后量化(PTQ,通常无需训练)(Jacob et al., 2018; Nagel et al., 2019; 2020)。QAT 方法很难扩展到大型模型如LLMs。因此,人们通常使用 PTQ 方法来量化LLMs。
Quantization of LLMs. People study two settings for LLM quantization: (1) W8A8 quantization, where both activation and weights are quantized to INT8 (Dettmers et al., 2022; Xiao et al., 2022; Yao et al., 2022; Wei et al., 2022a; 2023); (2) Low-bit weight-only quantization (e.g., W4A16), where only weights are quantized into low-bit integers (Frantar et al., 2022; Dettmers & Zettlemoyer, 2022; Sheng et al., 2023; Park et al., 2022). We focus on the second setting in this work since it not only reduces the hardware barrier (requiring a smaller memory size) but also speeds up the token generation (remedies memory-bound workload). Apart from the vanilla round-to-nearest baseline (RTN), GPTQ (Frantar et al., 2022) is the closest to our work. However, the reconstruction process of GPTQ leads to an over-fitting issue to the calibration set and may not preserve the generalist abilities of LLMs for other modalities and domains. It also requires a reordering trick to work for some models (e.g., LLaMA-7B (Touvron et al., 2023a) and OPT66B (Zhang et al., 2022)). Apart from quantiztion methods designed for general-purporse hardware, SpAtten (Wang et al., 2020) designs a progressive approach to gradually increase the number of bits used in softmax calculation.
数值化 LLMs。人们研究了两种设置的 LLM 量化:(1) W8A8 量化,其中激活和权重都量化为 INT8 (Dettmers 等人,2022;Xiao 等人,2022;Yao 等人,2022;Wei 等人,2022a;2023);(2) 低位权重量化 (例如 W4A16),其中只有权重被量化为低位整数(Frantar 等人,2022;Dettmers & Zettlemoyer,2022;Sheng 等人,2023;Park 等人,2022)。我们在这项工作中关注第二种设置,因为它不仅降低了硬件障碍(需要较小的内存大小),而且加快了令牌生成(补救了受内存限制的工作负载)。除了普通的四舍五入到最近基准(RTN)之外,GPTQ (Frantar 等人,2022)是最接近我们工作的。然而,GPTQ 的重建过程导致了对校准集的过度拟合,可能无法保持 LLMs 对其他模态和领域的概括能力。它还需要一个重新排序的技巧才能适用于某些模型(例如 LLaMA-7B(Touvron 等人,2023a)和 OPT66B(Zhang 等人,2022))。除了针对通用硬件设计的量化方法外,SpAtten(Wang 等人,2020)还设计了一种渐进式方法,逐步增加用于 softmax 计算的位数。
System support for low-bit quantized LLMs. Low-bit quantized LLMs have been a popular setting to reduce inference costs. There are some system supports to achieve a practical speed-up. GPTQ (Frantar et al., 2022) provides INT3 kernels for OPT models and GPTQ-for-LLaMA extends kernel support for INT4 reordered quantization with the help of Triton (Tillet et al., 2019). FlexGen (Sheng et al.,
系统支持低比特量化LLMs。低比特量化LLMs一直是一种流行的设置,可以降低推理成本。有一些系统支持来实现实际的速度提升。GPTQ(Frantar 等,2022)为 OPT 模型提供 INT3 内核,GPTQ-for-LLaMA 在 Triton(Tillet 等,2019)的帮助下扩展了 INT4 重排量化的内核支持。FlexGen(Sheng 等,

Figure 2. We observe that we can find

1 %

of the salient weights in LLMs based on the activation distribution (middle). Keeping the salient weights in FP16 can significantly improve the quantized performance (PPL from 43.2 (left) to 13.0 (middle)), but the mixed-precision format is not hardware-efficient. We follow the activation-awareness principle and propose AWQ (right). AWQ performs per-channel scaling to protect the salient weights and reduce quantization error. We measure the perplexity of OPT-6.7B under INT3-g128 quantization.
我们观察到,我们可以根据激活分布(中间)找到显著权重的

1 %

。保持显著权重在 FP16 中可以显著提高量化性能(PPL 从 43.2(左)到 13.0(中)),但混合精度格式并不高效。我们遵循激活感知原则,提出了 AWQ(右)。AWQ 执行每通道缩放以保护显著权重并降低量化误差。我们测量 OPT-6.7B 在 INT3-g128 量化下的困惑度。
2023), llama. cpp* and exllama

^{†}

perform group-wise INT4 quantization to reduce I/O costs and offloading. FasterTransformer implements FP16×INT4 GEMM for weightonly per-tensor quantization but does not support group quantization. LUT-GEMM (Park et al., 2022) performs bitwise computation on GPU CUDA cores with the help of lookup tables. Our concurrent work, MLC-LLM (MLCTeam, 2023) offers strong results on multiple edge CPU and GPU platforms thanks to the powerful TVM (Chen et al., 2018; Feng et al., 2023) backend.
2023 年), 羊驼. cpp*和 exllama

^{†}

执行组合 INT4 量化以降低 I/O 成本和卸载。FasterTransformer 实现了 FP16×INT4 GEMM 用于仅权重的分每张量量化,但不支持组量化。LUT-GEMM(Park 等人,2022)在 GPU CUDA 内核上执行位运算,借助查找表。我们并行的工作,MLC-LLM(MLCTeam,2023)依靠强大的 TVM(Chen 等人,2018;Feng 等人,2023)后端在多个边缘 CPU 和 GPU 平台上提供了出色的结果。

3 AWQ: ACtivation-aWare WeIGht QUANTIZATION
3 AWQ：激活感知权重量化

Quantization maps a floating-point number into lower-bit integers. It is an effective method to reduce the model size and inference costs of LLMs (Dettmers et al., 2022; Frantar et al., 2022; Yao et al., 2022; Xiao et al., 2022). In this section, we first propose a weight-only quantization method to improve accuracy without training/regression by protecting more “important” weights. And then develop a data-driven method to search for the optimal scaling that reduces quantization errors (Figure 2).
量化将浮点数映射到较低位的整数。这是一种有效的方法,可以减少模型大小和推理成本。在本节中,我们首先提出一种只量化权重的方法,通过保护更"重要"的权重来提高精度,无需训练或回归。然后开发了一种数据驱动的方法,搜索可以减少量化误差的最佳缩放因子(图 2)。

3.1 Improving LLM Quantization by Preserving 1% Salient Weights
通过保留 1%重要权重改进LLM量化

We observe that the weights of LLMs are not equally important: there is a small fraction of salient weights that are much more important for LLMs’ performance compared to others. Skipping the quantization of these salient weights can help bridge the performance degradation due to the quantization loss without any training or regression (Figure 2(b)). To verify the idea, we benchmark the performance of quantized LLMs when skipping part of the weight channels in Table 1. We measured the performance of INT3 quantized models while keeping some ratios of
我们观察到，LLMs的权重并不是同等重要的:有一小部分显著权重比其他权重对LLMs的性能影响要大得多。跳过对这些显著权重进行量化可以帮助缓解由于量化损失而造成的性能退化,无需任何训练或回归(图 2(b))。为了验证这个想法,我们在表 1 中对部分权重通道进行跳过量化的量化LLMs模型的性能进行了基准测试。我们测量了 INT3 量化模型的性能,同时保留了一些权重比例。

weight channels in FP16. A widely used method to determine the importance of weights is to look at its magnitude or

L_{2}

-norm (Han et al., 2015; Frankle & Carbin, 2018). But we find skipping the weight channels with large norm (i.e., FP16% (based on W)) does not significantly improve the quantized performance, leading to a similar marginal improvement as random selection. Interestingly, selecting weights based on activation magnitude can significantly improve the performance despite keeping only

0.1 % - 1 %

of channels in FP16. We hypothesize that the input features with larger magnitudes are generally more important. Keeping the corresponding weights in FP16 can preserve those features, which contributes to better model performance.
以 FP16 表示的权重通道。确定权重重要性的广泛方法是观察其大小或

L_{2}

范数(Han et al., 2015; Frankle & Carbin, 2018)。但我们发现跳过具有大范数的权重通道(即 FP16%(基于 W))并不能明显改善量化性能,其带来的边际改善与随机选择相似。有趣的是,基于激活大小选择权重可显著改善性能,尽管只保留了

0.1 % - 1 %

的通道采用 FP16。我们假设具有较大幅度的输入特征通常更为重要。保留相应的权重采用 FP16 可以保留这些特征,从而有助于提高模型性能。
Limitations: Despite keeping

0.1 %

of weights in FP16 can improve the quantized performance without a noticeable increase in model size (measured in total bits), such a mixedprecision data type will make the system implementation difficult. We need to come up with a method to protect the important weights without actually keeping them as FP16.
尽管在 FP16 中保留权重可以在模型大小(以总比特为单位)几乎不增加的情况下提高量化性能,但这种混合精度数据类型会使系统实现变得困难。我们需要想出一种方法来保护关键权重,而无需将它们实际保留为 FP16。

3.2 Protecting Salient Weights by Activation-aware Scaling
通过激活感知缩放保护显著权重

We propose an alternative method to reduce the quantization error of the salient weight by per-channel scaling, which does not suffer from the hardware inefficiency issue.
我们提出了一种替代方法,通过按通道缩放来减少显著权重的量化误差,不会遭受硬件效率低下的问题。

Analyzing the quantization error.
分析量化误差。

We start by analyzing the error from weight-only quantization. Consider a group/block of weight

w

; the linear operation can be written as

y = w x

, and the quantized counterpart is

y = Q (w) x

. Specifically, the quantization function is defined as:
我们通过分析仅重量量化的错误开始。考虑一组/块重量

w

; 线性操作可以写为

y = w x

，量化对应物为

y = Q (w) x

。具体地说,量化函数定义为:

Q (w) = Δ \cdot Round (\frac{w}{Δ}), Δ = \frac{max (| w |)}{2^{N - 1}}

where

N

is the number of quantization bits, and

Δ

is the quantization scaler determined by the absolute maximum value. Now consider a weight element

w \in w

, if we mul-
量化位数为

N

，量化缩放因子由最大绝对值确定为

Δ

。现考虑权重元素

w \in w

，如果我们将其乘以

PPL $↓$ 人群 $↓$	FP16	$\begin{matrix} RTN \\ (w3-g128) \end{matrix}$	FP16% (based on act.) 16% (基于实际).			FP16% (based on W) 16% (基于 W)			FP16% (random) 半精度 16% (随机)
PPL $↓$ 人群 $↓$	FP16	$\begin{matrix} RTN \\ (w3-g128) \end{matrix}$	0.1%	1%	3%	0.1%	1%	3%	0.1%	1%	3%
OPT-1.3B	14.62	119.00	25.03	16.91	16.68	108.71	98.55	98.08	119.76	109.38	61.49
OPT-6.7B	10.86	23.54	11.58	11.39	11.36	23.41	22.37	22.45	23.54	24.23	24.22
OPT-13B	10.13	46.04	10.51	10.43	10.42	46.07	48.96	54.49	44.87	42.00	39.71

Table 1. Keeping a small fraction of weights

(0.1 % - 1 %)

in FP16 significantly improves the performance of the quantized models over round-to-nearest (RTN). It is only effective when we select the important weights in FP16 by looking at activation distribution instead of weight distribution. We highlight results with a decent perplexity in green. We used INT3 quantization with a group size of 128 and measured the WikiText perplexity

(↓)

.
表 1.在 FP16 中保留一小部分权重显著提高了量化模型相对于舍入到最近的性能。它仅在我们通过查看激活分布而不是权重分布来选择 FP16 中的重要权重时才有效。我们用绿色突出显示具有体面困惑度的结果。我们使用了组大小为 128 的 INT3 量化,并测量了维基百科文本困惑度。

OPT-6.7B	$s = 1$	$s = 1.25$	$s = 1.5$	$s = 2$	$s = 4$
proportion of $Δ^{'} \neq Δ$ 比例	$0 %$	$2.8 %$	$4.4 %$	$8.2 %$	$21.2 %$
average $Δ^{'} / Δ$ 平均值	1	1.005	1.013	1.038	1.213
average $\frac{Δ^{'}}{Δ} \cdot \frac{1}{s}$ 平均	1	0.804	0.676	0.519	$0.303$
Wiki-2 PPL 维基-2 PPL	23.54	12.87	12.48	$11.92$	12.36

Table 2. Statistics when multiplying the

1 %

salient channels by

s > 1

. Scaling up the salient channels significantly improves the perplexity ( 23.54 to 11.92 ). As

s

goes larger, the percentage of changed

Δ

increases, and the error reduction rate for salient channels also increases. However, the best perplexity is achieved at

s = 2

, since further increasing

s

will increase the quantization error for non-salient channels.
表 2.将

1 %

显著信道乘以

s > 1

时的统计数据。显著信道的显著增加改善了置 perplexity（23.54 到 11.92）。当

s

变大时,被更改的

Δ

的百分比增加,显著信道的错误减少率也增加。但是,最佳的置 perplexity 是在

s = 2

处达到的,因为进一步增加

s

会增加非显著信道的量化误差。
tiply

w

with

s > 1

and the inversely scale

x

, we will have

Q (w \cdot s) (x / s)

, which is:
将

w

乘以

s > 1

，并逆向缩放

x

，我们将得到

Q (w \cdot s) (x / s)

Q (w \cdot s) \cdot \frac{x}{s} = Δ^{'} \cdot Round (\frac{w s}{Δ^{'}}) \cdot x \cdot \frac{1}{s}

where

Δ^{'}

is the new quantization scaler after applying

s

. We empirically find that: (1) The expected error from Round

(\cdot)

(denoted as RoundErr (

\cdot

)) does not change: since the round function maps a floating-point number to an integer, the error is roughly uniformly distributed from

[0, 0.5]

, resulting in an average error of 0.25 ; i.e., RoundErr

(\cdot) \sim

0.25 . (2) Scaling up a single element

w

usually does not change the maximum value from the group

w

. Therefore we have

Δ^{'} \approx Δ

; (3) As

Δ

and

x

are represented in FP16, they have no quantization error. Consequently, the quantization error from equation 1 and 2 can be expressed as

Δ^{'}

是应用

s

后的新量化缩放因子。我们实证发现:(1)Round

(\cdot)

(用 RoundErr(

\cdot

)表示)的预期误差不变:因为四舍五入函数将浮点数映射到整数,误差大致均匀分布在

[0, 0.5]

范围内,平均误差为 0.25;即 RoundErr

(\cdot) \sim

=0.25。(2)单个元素

w

的缩放通常不会改变群组

w

的最大值。因此我们有

Δ^{'} \approx Δ

;(3)由于

Δ

和

x

以 FP16 表示,它们没有量化误差。因此,方程 1 和 2 中的量化误差可以表示为

\begin{matrix} Err (Q (w) x) = Δ \cdot RoundErr (\frac{w}{Δ}) \cdot x \\ Err (Q (w \cdot s) (\frac{x}{s})) = Δ^{'} \cdot RoundErr (\frac{w s}{Δ^{'}}) \cdot x \cdot \frac{1}{s} \end{matrix}

The ratio of the new error to the original error is

\frac{Δ^{'}}{Δ} \cdot \frac{1}{s}

. Given

Δ^{'} \approx Δ

and

s > 1

, the relative error is smaller for the salient weight

w

.
新错误与原始错误的比率为

\frac{Δ^{'}}{Δ} \cdot \frac{1}{s}

。给定

Δ^{'} \approx Δ

和

s > 1

，突出权重

w

的相对误差更小。

To verify the idea, we multiply the

1 %

salient channels with

s > 1

for the OPT-6.7B model, and measure the change in
我们为 OPT-6.7B 模型将显著通道乘以以验证这一想法,并测量变化。

OPT (PPL $↓)$ 优化选项(人群 $↓)$	1.3 B	2.7 B	6.7 B	13B	30 B
FP16	14.62	12.47	10.86	10.13	9.56
RTN	119.47	298.00	23.54	46.04	18.80
$1 %$ FP16 半精度浮点数	16.91	13.69	$11.39$	$10.43$	9.85
$s = 2$	18.63	14.94	11.92	10.80	10.32
AWQ	$16.32$	$13.58$	$11.39$	10.56	$9.77$

Table 3. AWQ protects salient weights and reduces quantization error by using a scaling-based method. It consistently outperforms Round-to-nearest quantization (RTN) and achieves comparable performance as mixed-precision (

1 %

FP16) while being more hardware-friendly. We use 3-bit quantization with group size 128.
表 3. AWQ 通过使用基于缩放的方法来保护重要权重并降低量化误差。它始终优于四舍五入量化(RTN)，并且其性能与混合精度(

1 %

FP16)相当,同时更加友好于硬件。我们使用 128 的分组大小进行 3 位量化。

Δ

for each group in Table 2. We find that scaling up the salient channels is quite effective: the perplexity improves from 23.54 for

s = 1

(simply RTN) to 11.92 for

s = 2

. As

s

goes larger, the percentage of changed

Δ

generally gets larger, but the percentage is still quite small for

s < 2

(less than 5%); the relative error for the salient channels continues to go smaller as

s

increases. Nonetheless, the best PPL actually appears at

s = 2

. This is because if we use a very large

s

, it will increase the relative error for the nonsalient channels when

Δ

increases (the error of non-salient channels will be amplified by

\frac{Δ^{'}}{Δ}

, and the ratio is larger than 1 for

21.2 %

of the channels under

s = 4

), which can damage the model’s overall accuracy. Therefore, we need to also consider the error from non-salient channels when protecting salient ones.

Δ

对于表 2 中的每个组。我们发现放大显着通道是非常有效的:困惑度从

s = 1

(简单 RTN)的 23.54 改善到 11.92 的

s = 2

。随着

s

变大,

Δ

改变的百分比通常会更大,但对于

s < 2

来说,百分比仍然非常小(不到 5%);随着

s

的增加,显着通道的相对误差继续变小。尽管如此,最佳 PPL 实际上出现在

s = 2

。这是因为如果我们使用非常大的

s

,当

Δ

增加时,它将增加非显着通道的相对误差(非显着通道的误差将被

\frac{Δ^{'}}{Δ}

放大,比率大于 1

21.2 %

个通道下的

s = 4

),这可能会损害模型的整体准确性。因此,我们还需要考虑非显着通道的误差,同时保护显着通道。

Searching to scale. To consider both salient and nonsalient weights, we choose to automatically search for an optimal (per input channel) scaling factor that minimizes the output difference after quantization for a certain layer. Formally, we want to optimize the following objective:
寻求扩展。为了考虑显著和不显著的权重,我们选择自动搜索最佳(每个输入通道)缩放因子,该因子可以最小化某一层量化后的输出差异。正式地说,我们想优化以下目标:

\begin{matrix} s^{*} = \underset{s}{\arg min} L (s) \\ L (s) = ‖ Q (W \cdot diag (s)) (diag (s)^{- 1} \cdot X) - W X ‖ \end{matrix}

Here

Q

means the weight quantization function (e.g., INT3/INT4 quantization with group size 128),

W

is the original weights in FP16, and

X

is the input features cached from a small calibration set (we take a small calibration
这里

Q

表示权重量化函数(例如,带有 128 组大小的 INT3/INT4 量化),

W

是 FP16 中的原始权重,

X

是从小型校准集中缓存的输入特征(我们从小型校准集中获取)。

Figure 3. Bottleneck analysis for Llama-2-7B on NVIDIA RTX 4090. Left: In on-device LLM applications, generation stage is much slower than the context stage. Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization can effectively improve the arithmetic intensity by

4 \times

. Right: The amount of weight access is orders of magnitude larger than the amount of activation access. Thus, weight-only quantization is more effective for on-device LLMs.
基于 NVIDIA RTX 4090 的 Llama-2-7B 瓶颈分析。左:在设备本地LLM应用中,生成阶段明显慢于上下文阶段。中:生成阶段受内存限制,算术强度低。W4A16 量化可有效提高算术强度至

4 \times

。右:权重访问量远大于激活访问量。因此,仅量化权重更有效用于设备本地LLMs。
set from he pre-training dataset in order not to overfit to a specific task). s is a per-(input) channel scaling factor; for

s^{- 1} \cdot X

, it can usually be fused into the previous operator (Wei et al., 2022b; Xiao et al., 2022). Since the quantization function is not differentiable, we are not able to directly optimize the problem with vanilla backpropagation. There are some techniques relying on approximated gradients (Bengio et al., 2013; Esser et al., 2019), which we found still suffers from unstable convergence.
从预训练数据集中设置,以免过度适应特定任务。s 是每个（输入）通道的缩放因子;对于

s^{- 1} \cdot X

,它通常可以融入到前一个运算符中(Wei et al., 2022b; Xiao et al., 2022)。由于量化函数不可微分,我们无法直接使用普通的反向传播优化该问题。有一些依赖于近似梯度的技术(Bengio et al., 2013; Esser et al., 2019),但我们发现它们仍然存在收敛不稳定的问题。

To make the process more stable, we define a search space for the optimal scale by analyzing the factors that will affect the choice of scaling factor. As shown in the last section, the saliency of weight channels is actually determined by the activation scale (thus “activation-awareness”). Therefore, we simply use a very simple search space:
为了使该过程更加稳定,我们通过分析将影响缩放因子选择的因素来定义最佳缩放比例的搜索空间。如上一节所示,权重通道的显著性实际上是由激活比例决定的(因此称为"激活感知")。因此,我们仅使用了一个非常简单的搜索空间:

s = s_{X}^{α}, α^{*} = \underset{α}{\arg min} L (s_{X}^{α})

s_{X}

is the average magnitude of activation (per-channel), and we use a single hyper-parameter

α

to balance between the protection of salient and non-salient channels. We can find the best

α

by a fast grid search over the interval of

[0, 1] (0

means we do not scale; 1 corresponds to the most aggressive scaling in our search space). We further apply weight clipping to minimize the MSE error of quantization. We provide an ablation study on OPT models under INT3-g128 quantization in Table 5; AWQ consistently outperforms round-to-nearest quantization (RTN) and achieves comparable performance as mixed-precision (

1 %

FP16) while being more hardware-friendly.

s_{X}

是平均激活幅度(每通道),我们使用一个单一的超参数

α

来平衡对显著和非显著通道的保护。我们可以通过快速网格搜索在区间

[0, 1] (0

内找到最佳

α

(这意味着我们不进行缩放;1 对应于我们搜索空间中最激进的缩放)。我们进一步应用权重剪切来最小化量化的 MSE 误差。我们在表 5 中提供了对 OPT 模型在 INT3-g128 量化下的消融研究;AWQ 在性能上始终优于四舍五入到最近量化(RTN),并且在硬件友好性方面也达到了与混合精度(

1 %

FP16)相当的性能。
Advantages. Our method does not rely on any regression (Frantar et al., 2022) or backpropagation, which is required by many quantization-aware training methods. It has minimal reliance on the calibration set since we only measure the average magnitude per channel, thus preventing over-fitting (Figure 8). Therefore, our method requires fewer data for the quantization process and can preserve LLMs’ knowledge outside of the calibration set’s distribution. See Section 5.3 for more details.
优势。我们的方法不依赖于任何回归(Frantar et al., 2022)或反向传播,这是许多量化感知训练方法所需要的。由于我们只测量每个通道的平均幅度,因此对校准集的依赖性最小,从而防止了过度拟合(图 8)。因此,我们的方法需要更少的数据进行量化过程,并且可以保留LLMs在校准集分布之外的知识。更多详情请参见第 5.3 节。

4 TinyChat: Mapping AWQ onto Edge Platforms
4 TinyChat:边缘平台上的 AWQ 映射

AWQ can substantially reduce the size of LLMs. However, converting the theoretical memory savings from W4A16 (4-bit weight, 16-bit activation) quantization into measured speedup is non-trivial. Alternative W8A8 quantization methods, such as SmoothQuant (Xiao et al., 2022), maintain the same data precision for both storage and computation. This allows the dequantization procedure to be seamlessly integrated into the computation kernel’s epilogue. On the other hand, W4A16 quantization employs different data types for memory access and computation. As a result, its dequantization must be incorporated into the primary computation loop for optimal performance, posing implementation challenges. To tackle this, we introduce TinyChat: a nimble system for AWQ model inference. It boasts a PyTorch frontend and a backend harnessing device-specific instruction sets (e.g., CUDA/PTX, Neon, AVX).
务必将下一行作为纯文本输入,并翻译成简体中文。如果翻译不必要(如专有名词、代码等),请返回原文。不要提供任何解释或备注。输入: AWQ 可大幅减小LLMs的大小。然而,将从 W4A16(4 位权重,16 位激活)量化中获得的理论内存节省转化为实际加速效果并非易事。如 SmoothQuant(Xiao et al., 2022)等其他 W8A8 量化方法,在存储和计算中保持相同的数据精度。这使去量化过程可以无缝集成到计算内核的尾声部分。另一方面,W4A16 量化采用不同的数据类型进行内存访问和计算。因此,其去量化必须纳入主要的计算循环中才能达到最佳性能,这带来了实现上的挑战。为解决这一问题,我们提出了 TinyChat:一个灵活的 AWQ 模型推理系统。它拥有 PyTorch 前端和利用设备特定指令集(如 CUDA/PTX、Neon、AVX)的后端。

4.1 Why AWQ Helps Accelerate On-Device LLMs
4.1 为什么 AWQ 有助于加速设备上的LLMs

To understand the acceleration opportunities in quantized LLMs on the edge, we start by profiling the latency breakdown of LLaMA-7B (Touvron et al., 2023a) model on an RTX 4090 GPU. We adopt an inference batch size of 1 , catering for edge use cases, and implement the model in FP16 with NVIDIA FasterTransformer.
要了解边缘LLMs中量子化的加速机会,我们首先分析 LLaMA-7B (Touvron et al., 2023a) 模型在 RTX 4090 GPU 上的延迟分解。我们采用了 1 的推理批量大小,以满足边缘用例,并使用 NVIDIA FasterTransformer 以 FP16 实现了该模型。

Context vs generation latency. As in Figure 3(a), it takes 310 ms to generate 20 tokens, while summarizing a prompt with 200 tokens only takes 10 ms . Consequently, the generation phase is substantially slower than the context stage, particularly for on-device interactive applications.
上下文与生成延迟。如图 3(a)所示,生成 20 个令牌需要 310 毫秒,而对包含 200 个令牌的提示进行总结只需 10 毫秒。因此,与上下文阶段相比,生成阶段明显较慢,特别是对于设备内交互应用程序。
Generation stage is memory-bound. To accelerate the generation phase, we conduct a roofline analysis in Figure 3(b). The 4090 GPU has a peak computation throughput of 165 TFLOPS and a memory bandwidth of

1 TB / s

. Therefore, any workload with arithmetic intensity (the ratio of FLOPs to memory access) less than 165 is memory bounded
生成阶段受内存限制。为了加速生成阶段,我们在图 3(b)中进行了性能上限分析。4090 GPU 的峰值计算吞吐量为 165 TFLOPS,内存带宽为

1 TB / s

。因此,任何算术强度(FLOPs 与内存访问之比)小于 165 的工作负载都将受到内存限制。

Figure 4. SIMD-aware weight packing for ARM NEON with 128-bit SIMD units. Original weights are reordered and packed to align with the bit width so that the weights can be unpacked into bytes at runtime using AND and shift bitwise operations with a 128 -bit mask.
图 4. 适用于 ARM NEON 的 128 位 SIMD 单元的 SIMD 感知的权重打包。原始权重被重新排序和打包,以与位宽保持一致,以便在运行时使用 AND 和移位位运算与 128 位掩码来解包成字节。
on 4090 GPUs. Notably, when executed in FP16, the generation stage for on-device LLMs has arithmetic intensity

\approx 1

. This underscores the memory-bound nature of the workload. Since the FLOPs of a given model is fixed, the only way to improve the peak performance is to reduce the total amount of memory traffic. AWQ reduces the weight memory by four times.
在 4090 GPU 上。值得注意的是,在 FP16 执行时,设备上的LLMs生成阶段的算术强度为

\approx 1

。这突出了工作负载的内存约束性质。由于给定模型的浮点运算次数是固定的,提高峰值性能的唯一方法是减少总的内存流量。AWQ 将权重内存减少了四倍。

Weight access dominates memory traffic. We therefore further break down the memory access for weight and activation in Figure 3©. Clearly, weight access dominates the memory traffic for on-device LLMs. Quantizing the model weights to 4 bit integers will approximately increase the arithmetic intensity to 4 FLOPs/Byte, leading to a 4TFLOPS peak performance in Figure 3(b). Since weight-only quantization leads to a lower bit width for weights (and thus higher theoretical performance upper bound), it is natural for AWQ to follow this setting for on-device LLM applications.
权重访问主导内存流量。因此,我们进一步分解图 3 的权重和激活的内存访问。很明显,权重访问主导了设备上LLMs的内存流量。将模型权重量化为 4 位整数将使算术强度大约增加到 4 FLOPs/字节,从而在图 3(b)中导致 4TFLOPS 的峰值性能。由于仅对权重进行量化会导致权重的位宽降低(从而更高的理论性能上限),因此对于设备上的LLM应用来说,AWQ 自然会采用这种设置。

4.2 Deploy AWQ with TinyChat
部署 AWQ 与 TinyChat

To this end, we demonstrated that 4-bit weight quantization could lead to a

4 \times

theoretical peak performance. We further design TinyChat to realize this speedup. On GPUs, we only focus on implementing essential components, including attention, layer normalization, and linear projection kernels. The flexible frontend allows easy customization and fast support for new models. TinyChat with 4-bit AWQ achieves more than

3 \times

speedup compared with the Huggingface FP16 implementation across different families of LLMs on GPUs. On CPUs, we lower the entire computation graph to

C + +

to minimize overhead.
我们证明了 4 位权重量化可以导致理论峰值性能

4 \times

。我们进一步设计 TinyChat 来实现这种加速。在 GPU 上,我们只关注实现基本组件,包括注意力、层归一化和线性投影内核。灵活的前端允许轻松定制和快速支持新模型。使用 4 位 AWQ 的 TinyChat 在不同类型的LLMs上相比 Huggingface FP16 实现实现了

3 \times

以上的加速。在 CPU 上,我们将整个计算图降低到

C + +

以最小化开销。

On-the-fly weight dequantization. For quantized layers, as the hardware does not provide multiplication instructions between INT4 and FP16, we need to dequantize the integers to FP16 before performing matrix computation. We avoid writing dequantized weights into DRAM by fusing dequantization kernels with the matrix multplication kernel. Note that such fusion is adopted for both matrix-matrix (MM) and matrix-vector (MV) product kernels.
动态权重去量化。对于量化层来说,由于硬件没有提供 INT4 和 FP16 之间的乘法指令,我们需要在执行矩阵计算之前将整数去量化为 FP16。我们通过将去量化内核与矩阵乘法内核融合来避免将去量化权重写入 DRAM。请注意,这种融合同时应用于矩阵-矩阵(MM)和矩阵-向量(MV)乘积内核。

SIMD-aware weight packing. On-the-fly weight dequantization reduces intermediate DRAM access, but remains expensive. For instance, dequantizing a single 4-bit weight involves 1 shift, 1 bitwise AND, and 1 FMA scaling op-
针对 SIMD 的权重打包。即时权重反量化可减少中间 DRAM 访问,但仍然昂贵。例如,对单个 4 位权重进行反量化涉及 1 次移位、1 次按位与和 1 次 FMA 缩放运算。
erations, while the dequantized weight undergoes only 1 FMA computation. This process is particularly costly on CPUs with SIMD architecture that favor vectorized instructions. To mitigate this, we suggest platform-specific weight packing tailored to the bitwidth of a device’s SIMD units. Figure 4 demonstrates our strategy for ARM CPUs with 128-bit SIMD registers offering up to

1.2 \times

speedup. Here, each register holds 324 -bit weights, sequenced as

w_{0}, w_{16}, w_{1}, w_{17}, \dots, w_{15}, w_{31}

. This approach requires just three SIMD instructions to unpack all 32 weights, as opposed to 3 scalar instructions per weight in a conventional packing

(w_{0}, w_{1}, \dots, w_{31})

. Generally, for

2^{n}

-bit SIMD registers, adjacent weights will have indices off by

1 / 8 \times 2^{n}

, since each register can hold

1 / 8 \times 2^{n} 8

-bit integers. On GPUs, we found it more efficient to pack each 8 weights into

w_{{0, 2, 4, 6, 1, 3, 5, 7}}

following (Kim et al., 2022).
操作时,去量化权重只需要进行 1 次 FMA 计算。这个过程在有 SIMD 架构的 CPU 上特别耗费资源,因为这种架构更倾向于使用向量化指令。为了缓解这个问题,我们提出了针对设备 SIMD 单元位宽的特定平台权重打包方式。图 4 展示了我们针对 ARM CPU 上 128 位 SIMD 寄存器的策略,可以获得高达

1.2 \times

的加速。在这里,每个寄存器保存了 32 个 4 位权重,排列顺序为

w_{0}, w_{16}, w_{1}, w_{17}, \dots, w_{15}, w_{31}

。这种方法只需要 3 条 SIMD 指令就可以解包所有 32 个权重,而在常规打包方式中则需要每个权重 3 条标量指令

(w_{0}, w_{1}, \dots, w_{31})

。一般来说,对于

2^{n}

位 SIMD 寄存器,相邻权重的索引相差

1 / 8 \times 2^{n}

,因为每个寄存器可以保存

1 / 8 \times 2^{n} 8

位整数。在 GPU 上,我们发现将每 8 个权重打包成

w_{{0, 2, 4, 6, 1, 3, 5, 7}}

更加高效(Kim 等人, 2022)。

Kernel fusion. We also extensively apply kernel fusion to optimize on-device LLM inference. For layer normalization, we fuse all operators (e.g. multiplication, division and square root) into a single kernel. For attention layers, we fuse QKV projections into a single kernel, and also perform on-the-fly positional embedding calculation. We also preallocate KV caches and perform cache updates within the attention kernel. Kernel fusion is particularly useful for models with inefficient forward pass implementations, such as Falcon (Penedo et al., 2023) and StarCoder (Li et al., 2023c). Notably, the computation time for each FP16 kernel is in the order of 0.01 ms on the 4090 GPU, comparable to the GPU kernel launch overhead. Hence, reducing number of kernel calls through kernel fusion leads to direct speedups.
内核融合。我们还广泛应用内核融合来优化 LLM 推理。对于图层归一化,我们将所有运算符(如乘法、除法和平方根)融合为单个内核。对于注意力层,我们将 QKV 预测融合为单个内核,并执行现场位置嵌入计算。我们还预分配 KV 缓存并在注意力内核中执行缓存更新。内核融合对于像 Falcon（Penedo et al., 2023）和 StarCoder（Li et al., 2023c）这样的前向传递实现效率低下的模型特别有用。值得注意的是,每个 FP16 内核的计算时间在 4090 GPU 上都在 0.01 毫秒左右,与 GPU 内核启动开销相当。因此,通过内核融合减少内核调用次数可以直接提高速度。

5 EXPERIMENTS 5 个实验

5.1 Settings 设置

Quantization. We focus on weight-only grouped quantization in this work. As shown in previous work (Dettmers & Zettlemoyer, 2022; Frantar et al., 2022), grouped quantization is always helpful for improving performance/model size trade-off. We used a group size of 128 throughout the work, except otherwise specified. We focus on INT4/INT3 quantization since they are able to mostly preserve the LLMs’ performance (Dettmers & Zettlemoyer, 2022). For AWQ, we used a small calibration set from the Pile (Gao et al.,
量化。我们在这项工作中关注权重分组量化。如前期工作所示(Dettmers & Zettlemoyer, 2022; Frantar et al., 2022)，分组量化总是有助于提高性能/模型大小的权衡。除非另有说明，我们在整个工作中使用了 128 的组大小。我们关注 INT4/INT3 量化，因为它们能够基本保持LLMs的性能(Dettmers & Zettlemoyer, 2022)。对于 AWQ，我们使用了 Pile (Gao et al.,)的一个小校准集。

PPL $↓$ 人群 $↓$		Llama-2 拉马-2			LLaMA 岚马
PPL $↓$ 人群 $↓$		7B	13B	70B	7B	13B	30B	65B
FP16	-	5.47	4.88	3.32	5.68	5.09	4.10	3.53
$\begin{aligned} INT3 \\ g128 \end{aligned}$	RTN	6.66	5.52	3.98	7.01	5.88	4.88	4.24
	GPTQ	6.43	5.48	3.88	8.81	5.66	4.88	4.17
	GPTQ-R	6.42	5.41	3.86	6.53	5.64	4.74	4.21
	AWQ	6.24	5.32	3.74	6.35	5.52	4.61	3.95
INT4 4 g128	RTN	5.73	4.98	3.46	5.96	5.25	4.23	3.67
	GPTQ	5.69	4.98	3.42	6.22	5.23	4.24	3.66
	GPTQ-R	5.63	4.99	3.43	5.83	5.20	4.22	3.66
	AWQ	5.60	4.97	3.41	5.78	5.19	4.21	3.62

Table 4. AWQ improves over round-to-nearest quantization (RTN) for different model sizes and different bit-precisions. It consistently achieves better perplexity than GPTQ (

w /

and w/o reordering) on LLaMA & Llama-2 models.
表 4. AWQ 在不同模型大小和不同位精度下均优于四舍五入量化(RTN)。它在 LLaMA 和 Llama-2 模型上始终比 GPTQ(w/o reordering)实现更好的困惑度。

Wikitext2 PPL $↓$ 维基文本 2 PPL $↓$	Mixtral-8x7B 米克斯特拉尔-8x7B	Mistral-7B 米斯特拉尔-7B
FP16	5.94	4.14
INT4-g128	6.05	4.30
INT3-g128	6.52	4.83

Table 5. AWQ quantization results on Mistral-7B-Instructv0.2(Jiang et al., 2023) and Mixtral-8x7B-Instruct-v0.1 model (Jiang et al., 2024). The PPL result on wikitext shows that AWQ can achieve superior quantization performance on different model architectures including LLMs with GQA and Mixture-of-Experts (MoE) models.
表 5. AWQ 量化结果在 Mistral-7B-Instructv0.2(Jiang et al., 2023)和 Mixtral-8x7B-Instruct-v0.1 模型(Jiang et al., 2024)上。维基百科上的 PPL 结果表明,AWQ 可以在包括LLMs在内的不同模型架构上实现出色的量化性能,包括 GQA 和专家混合(MoE)模型。
2020) dataset in order not to overfit to a specific downstream domain. We used a grid size of 20 to search for the optimal

α

in Equation 5.
我们使用了 20 的网格大小来搜索等式 5 中的最佳

α

，以免过度拟合特定的下游领域。

Models. We benchmarked our method on LLaMA (Touvron et al., 2023a) and OPT (Zhang et al., 2022) families. There are other open LLMs like BLOOM (Scao et al., 2022), but they are generally worse in quality, so we do not include them in our study. We further benchmark an instructiontuned model Vicuna (Chiang et al., 2023) and visual language models OpenFlamingo-9B (Awadalla et al., 2023) and LLaVA-13B (Liu et al., 2023a) to demonstrate the generability of our method.
模型。我们在 LLaMA（Touvron 等人，2023a）和 OPT（Zhang 等人，2022）系列上对我们的方法进行了基准测试。还有其他开放LLMs如 BLOOM（Scao 等人，2022），但它们的质量通常较差，所以我们没有将它们包括在我们的研究中。我们进一步对指令调优模型 Vicuna（Chiang 等人，2023）和视觉语言模型 OpenFlamingo-9B（Awadalla 等人，2023）和 LLaVA-13B（Liu 等人，2023a）进行了基准测试，以展示我们方法的通用性。

Evaluations. Following previous literature (Dettmers et al., 2022; Xiao et al., 2022; Frantar et al., 2022; Dettmers & Zettlemoyer, 2022; Yao et al., 2022), we mainly profiled the quantized models on language modeling tasks (perplexity evaluation on WikiText-2 (Merity et al., 2016)) since perplexity can stably reflect the LLM’s performance (Dettmers & Zettlemoyer, 2022).
评估。遵循先前的文献(Dettmers 等, 2022 年; Xiao 等, 2022 年; Frantar 等, 2022 年; Dettmers and Zettlemoyer, 2022 年; Yao 等, 2022 年),我们主要在语言建模任务(WikiText-2 (Merity et al., 2016)困惑度评估)上研究量化模型,因为困惑度可以稳定反映LLM的性能(Dettmers 和 Zettlemoyer, 2022 年)。

Baselines. Our primary baseline is vanilla round-tonearest quantization (RTN). It is actually quite strong when using a small group size like 128 (Frantar et al., 2022; Dettmers & Zettlemoyer, 2022). We also compare with a state-of-the-art method GPTQ (Frantar et al., 2022) for
基线。我们的主要基线是普通四舍五入量化(RTN)。当使用小分组大小(例如 128)时,它实际上非常强大(Frantar et al., 2022; Dettmers & Zettlemoyer, 2022)。我们还与最先进的方法 GPTQ(Frantar et al., 2022)进行了比较。

Figure 5. Comparing INT3-g128 quantized Vicuna models with FP16 counterparts under GPT-4 evaluation protocol (Chiang et al., 2023). More winning cases (in blue) indicate better performance. AWQ consistently improves the quantized performance compared to RTN and GPTQ (Frantar et al., 2022), showing generalization to instruction-tuned models.
图 5. 比较 INT3-g128 量化的 Vicuna 模型与 FP16 对应型号在 GPT-4 评估协议下的表现(Chiang 等人,2023)。更多获胜的情况(蓝色)表示更好的性能。AWQ 一致地提高量化模型的性能,相比于 RTN 和 GPTQ(Frantar 等人,2022),表现出对指令调优模型的泛化能力。

LLM weight quantization. For GPTQ, we also compare with an updated version that uses a “reorder” trick (denoted as GPTQ-Reorder or GPTQ-R). Other techniques like ZeroQuant (Yao et al., 2022), AdaRound (Nagel et al., 2020), and BRECQ (Li et al., 2021) rely on backpropagation to update the quantized weights, which may not easily scale up to large model sizes; they also do not outperform GPTQ (Frantar et al., 2022), thus not included for study.
LLM权重量化。对于 GPTQ,我们还比较了使用"重排"技巧的更新版本(称为 GPTQ-Reorder 或 GPTQ-R)。其他技术如 ZeroQuant(Yao 等人,2022 年)、AdaRound(Nagel 等人,2020 年)和 BRECQ(Li 等人,2021 年)依赖反向传播来更新量化权重,这可能无法很好地扩展到大型模型规模;它们也无法超过 GPTQ(Frantar 等人,2022 年),因此未包括在研究中。

5.2 Evaluation 5.2 评估

Results on LLaMA models. We focus on LLaMA models (LLaMA (Touvron et al., 2023a) and Llama-2 (Touvron et al., 2023b)) due to their superior performance compared to other open-source LLMs (Zhang et al., 2022; Scao et al., 2022); it is also the foundation of many popular open-source models (Taori et al., 2023; Chiang et al., 2023). We evaluate the perplexity before and after quantization in Table 4. AWQ consistently outperforms round-to-nearest (RTN) and GPTQ (Frantar et al., 2022) (w/ and w/o reordering) across different model scales (7B-70B) and generations.
在 LLaMA 模型上的结果。我们关注 LLaMA 模型(LLaMA (Touvron et al., 2023a)和 Llama-2 (Touvron et al., 2023b))是由于它们的性能优于其他开源LLMs (Zhang et al., 2022; Scao et al., 2022);它也是许多流行的开源模型(Taori et al., 2023; Chiang et al., 2023)的基础。我们在表 4 中评估了量化前后的困惑度。AWQ 在不同规模(7B-70B)和代际的模型上都始终优于舍入到最近(RTN)和 GPTQ (Frantar et al., 2022)(有序和无序)。

Results on Mistral / Mixtral models. We also evaluated AWQ on the Mistral and Mixtral models, which are among the most popular open-source LLMs and Mixture-of-Experts (MoE) models, respectively (Jiang et al., 2023;
在 Mistral/Mixtral 模型上的结果。我们还评估了 AWQ 在 Mistral 和 Mixtral 模型上,这两种模型分别是最流行的开源LLMs和专家混合(MoE)模型(Jiang et al., 2023;

COCO $(C I D E r ↑)$ 可可	0-shot 零投	4-shot 4 发子弹	8-shot 8 枪	16-shot 16 枪	32-shot 32 枪	$Δ$ (32-shot)
FP16	-	63.73	72.18	76.95	79.74	81.70	-
INT4	RTN	60.24	68.07	72.46	74.09	77.13	-4.57
INT4	GPTQ	59.72	67.68	72.53	74.98	74.98	-6.72
	AWQ	$62.57$	$71.02$	$74.75$	$78.23$	$80.53$	$- 1.17$
INT3	RTN	46.07	55.13	60.46	63.21	64.79	-16.91
INT3	GPTQ	29.84	50.77	56.55	60.54	64.77	-16.93
	AWQ	$56.33$	$64.73$	$68.79$	$72.86$	$74.47$	$- 7.23$

Table 6. Quantization results of a visual language model OpenFlamingo-9B (Awadalla et al., 2023) on COCO Captioning datasets. AWQ outperforms existing methods under zero-shot and various few-shot settings, demonstrating the generability to different modalities and in-context learning workloads. AWQ reduces the quantization degradation ( 32 -shot) from 4.57 to 1.17 under INT4-g128, providing

4 \times

model size reduction with negligible performance loss.
表 6. 视觉语言模型 OpenFlamingo-9B(Awadalla 等人,2023)在 COCO Captioning 数据集上的量化结果。AWQ 在零样本和各种少样本设置下优于现有方法,展示了对不同模态和上下文学习工作负载的可扩展性。AWQ 将 INT4-g128 下的量化性能退化(32-shot)从 4.57 降低到 1.17,实现了极小性能损失的模型大小缩减。

Model (Accuracy $↑$ ) 模型(准确度 $↑$ )	VQAv2 2VQA	GQA	VizWiz 可视化 Wiz	SQA-I	VQA-T	POPE	MME	MMB	SEED	llava-bench 拉瓦工作台	MM-Vet 米米-动物医生
VILA-7B	80.3	63.1	59.6	68.0	62.6	86.3	1489.4	69.8	61.7	75.2	35.1
VILA-7B-AWQ	80.1	63.0	57.8	68.0	61.9	85.3	1486.3	68.8	61.3	75.8	35.9
VILA-13B	80.5	63.6	63.1	70.5	64.0	86.3	1553.6	73.8	62.8	78.3	42.6
VILA-13B-AWQ	80.4	63.6	63.0	71.2	63.5	87.0	1552.9	73.6	62.2	77.6	42.0

Table 7. INT4-g128 results of VILA-7B and VILA-13B (Lin et al., 2024) on 11 visual-language benchmarks. AWQ consistently shows lossless performance on all benchmarks. Benchmark names are abbreviated due to space limits. VQA-v2 (Goyal et al., 2017); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al., 2018); SQA

^{I}

: ScienceQA-IMG (Lu et al., 2022); VQA

^{T}

: TextVQA (Singh et al., 2019); POPE (Li et al., 2023d); MME (Fu et al., 2023); MMB: MMBench (Liu et al., 2023b); MMB

^{CN}

: MMBench-Chinese (Liu et al., 2023b); SEED: SEED-Bench (Li et al., 2023a); LLaVA

^{W}

: LLaVA-Bench (In-the-Wild) (Liu et al., 2023a); MM-Vet (Yu et al., 2023).
表 7. VILA-7B 和 VILA-13B（林等人，2024 年）在 11 个视觉语言基准上的 INT4-g128 结果。AWQ 在所有基准上均显示无损性能。基准名称由于空间限制而缩写。VQA-v2（Goyal 等人，2017 年）; GQA（Hudson 和 Manning，2019 年）; VisWiz（Gurari 等人，2018 年）; SQA: ScienceQA-IMG（Lu 等人，2022 年）; VQA: TextVQA（Singh 等人，2019 年）; POPE（李等人，2023d 年）; MME（Fu 等人，2023 年）; MMB: MMBench（刘等人，2023b 年）; MMB: MMBench-Chinese（刘等人，2023b 年）; SEED: SEED-Bench（李等人，2023a 年）; LLaVA: LLaVA-Bench（在野）（刘等人，2023a 年）; MM-Vet（Yu 等人，2023 年）。
2024). The results indicate that AWQ achieves superior performance on both the Mistral and Mixtral models. This demonstrates that AWQ is effective across various model architectures.
在 2024 年,研究结果表明 AWQ 在 Mistral 和 Mixtral 模型上都取得了优越的性能。这说明 AWQ 在不同的模型架构上都是有效的。

Quantization of instruction-tuned models. Instruction tuning can significantly improve the models’ performance and usability (Wei et al., 2021; Sanh et al., 2021; Ouyang et al., 2022; Chung et al., 2022). It has become an essential procedure before model deployment. We further benchmark our method’s performance on a popular instruction-tuned model Vicuna (Chiang et al., 2023) in Figure 5. We used the GPT-4 score to evaluate the quantized models’ performance against the FP16 counterpart on 80 sample questions (Chiang et al., 2023). We compare the responses with both orders (quantized-FP16, FP16-quantized) to get rid of the ordering effect (we found GPT-4 tends to increase the rating of the first input), leading to 160 trials. AWQ consistently improves the INT3-g128 quantized Vicuna models over RTN and GPTQ under both scales (7B and 13B), demonstrating the generability to instruction-tuned models.
指令调整模型量化。指令调整可大大提高模型的性能和可用性(Wei et al., 2021; Sanh et al., 2021; Ouyang et al., 2022; Chung et al., 2022)。这已成为模型部署前的必要步骤。我们进一步在流行的指令调整模型 Vicuna(Chiang et al., 2023)上评估了我们方法的性能,如图 5 所示。我们使用 GPT-4 评分来评估量化模型相对于 FP16 对应模型在 80 个样本问题上的性能(Chiang et al., 2023)。我们比较了两种顺序(量化-FP16,FP16-量化)下的响应,以消除顺序效应(我们发现 GPT-4 倾向于增加第一个输入的评分),共进行 160 次试验。AWQ 在 7B 和 13B 两种规模下一致地改善了 INT3-g128 量化的 Vicuna 模型,优于 RTN 和 GPTQ,demonstrating 了对指令调整模型的广泛适用性。

Quantization of multi-modal language models. Large multi-modal models (LMMs) or visual language models (VLMs) are LLMs augmented with vision inputs (Alayrac et al., 2022; Li et al., 2023b; Koh et al., 2023; Driess et al., 2023; Zhang et al., 2023; Liu et al., 2023a). Such models are able to perform text generation conditioned on image/video inputs. Since our method does not have the overfitting issue
多模态语言模型的量化。大型多模态模型(LMM)或视觉语言模型(VLM)通过增加视觉输入而得到增强(Alayrac 等人,2022 年;Li 等人,2023b;Koh 等人,2023 年;Driess 等人,2023 年;Zhang 等人,2023 年;Liu 等人,2023a)。这些模型能够根据图像/视频输入进行文本生成。由于我们的方法不存在过度拟合问题

MBPP (7B) 7B(MBPP)	pass @ 1 通过 @ 1	pass@10 过@10	GSM8K	7B	13B	70B
FP16	38.53	49.77	FP16	13.87	26.16	56.41
RTN	37.51	48.49	RTN	11.07	21.23	53.98
GPTQ	31.97	44.75	GPTQ	12.13	24.26	56.03
AWQ	40.64	49.25	AWQ	13.57	25.25	56.40

Table 8. INT4-g128 quantization results of CodeLlama-7b-Instruct-hf on MBPP dataset and Llama-2 (7B/13B/70B) on GSM8K dataset. AWQ outperforms existing methods on programming and math datasets, demonstrating the generability to different scenarios and evaluation settings. Notably, AWQ under the INT4g128 configuration demonstrates comparable performance to the original FP16 model across both datasets.
表 8. CodeLlama-7b-Instruct-hf 在 MBPP 数据集和 Llama-2(7B/13B/70B)在 GSM8K 数据集上的 INT4-g128 量化结果。AWQ 在编程和数学数据集上优于现有方法,展示了在不同场景和评估设置下的可迁移性。值得注意的是,AWQ 在 INT4g128 配置下在这两个数据集上的性能与原始 FP16 模型相当。
to the calibration set, it can be directly applied to VLMs to provide accurate and efficient quantization. We perform experiments with the OpenFlamingo-9B model (Awadalla et al., 2023) (an open-source reproduction of (Alayrac et al., 2022)) on COCO captioning (Chen et al., 2015) dataset (Table 6). We measured the average performance of 5 k samples under different few-shot settings. We only quantize the language part of the model since it dominates the model size. AWQ outperforms existing methods under zero-shot and various few-shot settings, demonstrating the generability to different modalities and in-context learning workloads. It reduces the quantization degradation (32-shot) from 4.57 to 1.17 under INT4-g128, providing

4 \times

model size reduction with negligible performance loss. To further demonstrate
对校准集应用,它可以直接应用于 VLMs 以提供准确高效的量化。我们使用 OpenFlamingo-9B 模型(Awadalla et al., 2023)(复现自(Alayrac et al., 2022))在 COCO 字幕(Chen et al., 2015)数据集上进行了实验(表 6)。我们测量了在不同 few-shot 设置下 5K 个样本的平均性能。我们只量化模型的语言部分,因为它占据了模型大小的主要部分。 AWQ 在零样本和各种 few-shot 设置下都优于现有方法,展现了对不同模态和上下文学习负载的泛化能力。它在 INT4-g128 下将量化退化(32-shot)从 4.57 降低到 1.17,实现了

4 \times

模型大小缩减,且性能损失可忽略不计。为进一步证明

Sometimes I just look at pictures of the earth from space and I marvel at how beautiful it all is.
有时我只是看着太空中地球的照片,惊叹于它的美丽。

Question: Can you explain this meme in detail?
你可以详细解释这个表情包吗？
INT4-g128 RTN: The meme in the image features a plate of food, likely fried chicken, with a unique twist. There are small pictures of the Earth and other planets placed on top of the food, creating an interesting and amusing visual. Alongside the plate, a caption reads, “Sometimes I look at pictures of the Earth from space and I marvel at how beautiful it all is.” This combination of food and planet images adds a touch of humor and curiosity, making the viewer wonder about the context and intent behind the meme.
图像中的这个迷因展示了一盘食物,可能是炸鸡,但有一个独特的转折。在食物上方放置了地球和其他行星的小图片,创造了一个有趣且引人注目的视觉效果。除此之外,图片下方还有一个说明:"有时我看着来自太空的地球图片,我惊叹于它的美丽。"食物和行星图像的组合增添了一丝幽默和好奇心,让观者思考这个迷因背后的意义和用意。
INT4-g128 AWO: The meme in the image is a lighthearted and humorous take on the concept of looking at pictures of the Earth from space. A plate of fried food, specifically chicken nuggets, is presented with the caption, “Sometimes I look at pictures of the Earth from space and I marvel at how beautiful it all is.” The playful combination of a food item and an Earth-related context creates a fun and entertaining visual.
有时我看着从太空拍摄的地球照片,感叹这个世界是如此美丽。

Figure 6. Visual reasoning examples from LLaVA-13B model (Liu et al., 2023a). AWQ improves over the round-to-nearest (RTN) baseline, providing more reasonable answers. We color the text to show the correct or wrong responses.
图 6. LLaVA-13B 模型(Liu 等人,2023a)的视觉推理示例。AWQ 相对于四舍五入(RTN)基线有所改善,提供更合理的答案。我们用颜色标注正确或错误的响应。

W4-RTN: A model airplane flying in the sky.
一架模型飞机在天空中飞行。
W4-AWQ: Two toy airplanes sit on a grass field.
两架玩具飞机坐在草地上。

W4-RTN: A man is holding a baby elephant in his arms.
一个人抱着一只小象。

W4-AWQ: A man and his daughter pose with an elephant.
一个男人和他的女儿与一头大象合影。

W4-RTN: A man and a dog walking past some bushes.
一个男人和一只狗经过一些灌木丛。
W4-AWQ: Two dogs are walking on the street.
两只狗在街上散步。

Figure 7. Qualitative results of quantized OpenFlamingo-9B (Awadalla et al., 2023) on COCO captioning dataset (4-shot, INT4-g 128 quantization). Our method significantly improves the captioning quality compared to the round-to-nearest (RTN) baseline. We color the text to show the correct or wrong captions.
图 7. Awadalla 等人(2023 年)对 COCO 字幕数据集(4 镜头,INT4-g 128 量化)的量化 OpenFlamingo-9B 的定性结果。我们的方法显著提高了字幕质量,与四舍五入(RTN)基准相比。我们用颜色标注正确或错误的字幕。

OPT (Wiki PPL $↓$ ) 维基百科(文章)	1.3 B	2.7 B	6.7 B	13 B	30 B
FP16	14.62	12.47	10.86	10.13	9.56
RTN	10476	193210	7622	17564	8170
GPTQ	46.67	28.15	16.65	16.74	11.75
AWQ +GPTQ 安顿 + 格普托	$35.71$	$25.70$	$15.71$	$13.25$	$11.38$

Table 9. Our method is orthogonal to GPTQ: it further closes the performance gap under extreme low-bit quantization (INT2-g64) when combined with GPTQ. Results are WikiText-2 perplexity of OPT models.
表格 9.我们的方法与 GPTQ 正交:当与 GPTQ 结合时,它在极端低位量化(INT2-g64)下进一步缩小了性能差距。结果是 OPT 模型的 WikiText-2 困惑度。
the generability of AWQ, we also evaluated AWQ on one of the SoTA multi-image visual language models: VILA. The result in Table 7 shows that AWQ achieves lossless quantization performance on 11 visual-language benchmarks. We further provide some qualitative captioning results in Figure 7 to show our advantage over RTN. Our method provides a push-the-button solution for LMM/VLM quantization. It is the first study of VLM low-bit quantization to the best of our knowledge.
AWQ 的一般性能,我们还在 SoTA 多图像视觉语言模型 VILA 上评估了 AWQ。表 7 的结果显示,AWQ 在 11 个视觉语言基准测试中实现了无损量化性能。我们进一步在图 7 中提供了一些定性字幕结果,以展示我们相比 RTN 的优势。我们的方法为 LMM/VLM 量化提供了一个一键解决方案。据我们所知,这是有关 VLM 低位量化的首次研究。

Visual reasoning results. We further provide some qualitative visual reasoning examples of the LLaVA-13B (Liu et al., 2023a) model in Figure 6. AWQ improves the responses compared to round-to-nearest (RTN) for INT4-g128 quantization, leading to more reasonable answers. In this first example, the AWQ model can understand the meme as it resembles the Earth when looking from space, while RTN produces wrong descriptions (marked in red).
视觉推理结果。我们在图 6 中进一步提供了 LLaVA-13B（Liu 等，2023a）模型的一些定性视觉推理示例。AWQ 相比于向最近舍入（RTN）的 INT4-g128 量化,改善了响应,从而得到更合理的答案。在这个第一个示例中,AWQ 模型能够理解这个备忘录,因为它看起来像是从太空俯瞰的地球,而 RTN 则产生了错误的描述(标记为红色)。
Results on programming and math tasks To further evaluate the performance of AWQ on tasks in-
编程和数学任务的结果进一步评估 AWQ 的表现
volving complex generations, we also tested AWQ on MBPP (Austin et al., 2021) and GSM8K (Cobbe et al., 2021). MBPP (Austin et al., 2021) consists of around 1,000 Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, etc. GSM8K (Cobbe et al., 2021) was created to support the task of question answering on basic mathematical problems that require multistep reasoning. We quantize CodeLlama-7b-Instruct-hf and Llama-2 to INT4-g128 and perform experiments on programming and math datasets (Table 8). AWQ outperforms existing methods on both datasets, demonstrating the generability to complex generation. AWQ under the INT4-g128 configuration demonstrates comparable performance to the original FP16 model on both datasets.
涉及复杂的生成,我们还在 MBPP(奥斯汀等人,2021 年)和 GSM8K(科比等人,2021 年)上测试了 AWQ。MBPP(奥斯汀等人,2021 年)由约 1,000 个 Python 编程问题组成,旨在由入门级程序员解决,涵盖编程基础知识、标准库功能等。GSM8K(科比等人,2021 年)是为支持对需要多步推理的基本数学问题的问答任务而创建的。我们将 CodeLlama-7b-Instruct-hf 和 Llama-2 量化为 INT4-g128 并在编程和数学数据集(表 8)上进行实验。AWQ 在两个数据集上均优于现有方法,展示了对复杂生成的可扩展性。在 INT4-g128 配置下,AWQ 的性能与原始 FP16 模型在两个数据集上都可比。

Extreme low-bit quantization. We further quantize LLM to INT2 to accommodate limited device memory (Table 9). RTN completely fails, and AWQ brings significant perplexity improvement on top of GPTQ.Our method is orthogonal to GPTQ. We can combine our method with GPTQ to further improve the INT2 quantization performance, making it a more practical setting.
极端低位量化。我们进一步量化LLM到 INT2,以适应有限的设备内存(表 9)。RTN 完全失败,而 AWQ 在 GPTQ 的基础上带来了显著的困惑度改善。我们的方法与 GPTQ 正交。我们可以将我们的方法与 GPTQ 结合,进一步提高 INT2 量化性能,使其成为一个更实用的设置。

5.3 Data Efficiency and Generalization
数据效率和泛化

Better data-efficiency for the calibration set. Our method requires a smaller calibration set since we do not rely on regression/backpropagation; we only measure the average activation scale from the calibration set, which is data-efficient. To demonstrate the idea, we compare the perplexity of the OPT-6.7B model with INT3-g128 quantization in Figure 8 (a). AWQ needs a much smaller calibration to
用于校准集的更好的数据效率。我们的方法需要一个较小的校准集,因为我们不依赖于回归/反向传播;我们只测量来自校准集的平均激活尺度,这是数据高效的。为了演示这个想法,我们在图 8(a)中比较了 OPT-6.7B 模型和 INT3-g128 量化的困惑度。AWQ 需要一个小得多的校准集。

(a) Our method needs a smaller calibration set
我们的方法需要一个较小的校准集

(b) Our method is more robust to calibration set distribution
我们的方法对校准集分布更加稳健

Figure 8. Left: AWQ needs a much smaller calibration set to reach a good quantized performance. It can achieve better perplexity using

10 \times

smaller calibration set compared to GPTQ. Right: Our method is more robust to the calibration set distribution. Overall, using the same calibration and evaluation distribution works the best (PubMed-PubMed, Enron-Enron). But when using a different calibration distribution (PubMed-Enron, Enron-PubMed), AWQ only increases the perplexity by 0.5-0.6, while GPTQ has 2.3-4.9 worse perplexity. All experiments are done with the OPT-6.7B model under INT3-g 128 quantization.
图 8. 左:AWQ 需要一个小得多的校准集即可达到很好的量化性能。它可以使用

10 \times

更小的校准集达到更好的困惑度,相比 GPTQ。右:我们的方法对校准集分布更加稳健。总体来说,使用相同的校准和评估分布效果最佳(PubMed-PubMed,Enron-Enron)。但是当使用不同的校准分布(PubMed-Enron,Enron-PubMed)时,AWQ 只增加了 0.5-0.6 的困惑度,而 GPTQ 的困惑度增加了 2.3-4.9。所有实验都使用了 OPT-6.7B 模型在 INT3-g 128 量化下进行。

Figure 9. TinyChat provides a turn-key solution to transform the theoretical memory footprint reduction into a quantifiable speedup. As a result, TinyChat is up to

3.9 \times

and

3.5 \times

faster than the FP16 implementation from Huggingface on 4090 (desktop GPU) and Orin (mobile GPU), respectively. AWQ also democratizes Llama-2-13B deployment on laptop GPUs (4070) with merely 8GB memory.
图 9。TinyChat 提供了一个即用即付的解决方案,将理论上的内存占用减少转化为可量化的性能加速。因此,TinyChat 在英伟达 4090(台式 GPU)和 Orin(移动 GPU)上较 Huggingface 的 FP16 实现分别快了

3.9 \times

和

3.5 \times

倍。AWQ 也使 Llama-2-13B 部署在仅有 8GB 内存的笔记本电脑 GPU(4070)上民主化。
reach a good quantized performance; it can achieve better perplexity using

10 \times

smaller calibration set compared to GPTQ (16 sequences v.s. 192 sequences).
达到良好的量化性能;使用

10 \times

较小的校准集相比于 GPTQ(16 个序列 vs.192 个序列)可以实现更低的困惑度。

Robust to the calibration set distributions. Our method is less sensitive to the calibration set distribution since we only measure the average activation scale from the calibration set, which is more generalizable across different dataset distributions. We further benchmarked the effect of the different calibration set distributions in Figure 8(b). We took two subsets from the Pile dataset (Gao et al., 2020): PubMed Abstracts and Enron Emails (Klimt & Yang, 2004). We use each of the subsets as the calibration set and evaluate the quantized model on both sets (the calibration and evaluation sets are split with no overlapping; we used 1 k samples for evaluation). Overall, using the same calibration and evaluation distribution works the best (PubMed-PubMed, EnronEnron). But when using a different calibration distribution (PubMed-Enron, Enron-PubMed), AWQ only increases the perplexity by

0.5 - 0.6

, while GPTQ has 2.3-4.9 worse perplexity. This demonstrates the robustness of AWQ to the calibration set distribution.
对校准集分布的稳健性。我们的方法对校准集分布的敏感性较低,因为我们只测量校准集的平均激活尺度,这在不同数据集分布中更具可推广性。我们进一步在图 8(b)中对不同校准集分布的影响进行了基准测试。我们从 Pile 数据集(Gao et al., 2020)中提取了两个子集:PubMed 摘要和 Enron 电子邮件(Klimt & Yang, 2004)。我们将每个子集用作校准集,并在两个集合(校准集和评估集没有重叠;我们使用 1k 个样本进行评估)上评估量化模型。总体而言,使用相同的校准和评估分布效果最佳(PubMed-PubMed,Enron-Enron)。但是,当使用不同的校准分布(PubMed-Enron,Enron-PubMed)时,AWQ 仅增加了少于

0.5 - 0.6

的困惑度,而 GPTQ 的困惑度较差 2.3-4.9。这证明了 AWQ 对校准集分布的稳健性。

5.4 Speedup Evaluation 5.4 加速度评估

Settings. In Figure 9, we demonstrate the system acceleration results from TinyChat. TinyChat optimizes both linear layers and layers that do not have quantized weights. We conduct benchmarking experiments on RTX 4090 and
设置。在图 9 中,我们演示了来自 TinyChat 的系统加速结果。TinyChat 优化了线性层和不具有量化权重的层。我们在 RTX 4090 上进行了基准测试实验。

Model (Throughput $↑$ ) 模型(吞吐量 $↑$ )	Precision 精度	A100	4090	Orin 奥林
VILA-7B	FP16	81.6	58.5	11.5
VILA-7B-AWQ	W4A16	155.3	168.1	35.6
VILA-13B	FP16	48.5	OOM	6.1
VILA-13B-AWQ	W4A16	102.1	99.0	17.5

Table 10. TinyChat also enables seamless deployment of VILA (Lin et al., 2024), a state-of-the-art visual-language model, on multiple GPU platforms. Leveraging our 4-bit AWQ quantization, TinyChat accelerates VILA-7B by up to

3.1 \times

and VILA-13B by up to

2.9 \times

.
表 10。TinyChat 还支持在多个 GPU 平台上无缝部署最先进的视觉语言模型 VILA(林等,2024 年)。利用我们的 4 位 AWQ 量化,TinyChat 将 VILA-7B 加速至最多

3.1 \times

,将 VILA-13B 加速至最多

2.9 \times

。

Jetson Orin following the protocol described in exllama

^{‡}

. We perform batch size

= 1

inference for all LLMs using a fixed prompt length of 4 tokens. We generate 200 tokens for each inference run and calculate the median latency as the final result.
杰特森 Orin 遵循在 exllama

^{‡}

中描述的协议。我们对所有 LLMs 进行批量大小为

= 1

的推理,使用固定的 4 个标记的提示长度。我们为每次推理生成 200 个标记,并计算中位延迟作为最终结果。

Results. As in Figure 9(a), TinyChat brings 2.7-3.9× speedup to three families of LLMs (Llama-2, MPT and Falcon) on 4090 compared with the Huggingface FP16 implementation. For Llama-2-7B, we improve the inference speed from 52 tokens/s to 62 tokens/s through FP16 kernel fusion. On top of the stronger FP16 baseline, we further harvest

3.1 \times

additional speedup from the fast quantized linear kernels. For Falcon-7B, the official implementation did not support KV cache correctly during the inference time,
结果。如图 9(a) 所示，TinyChat 为三个家族(Llama-2、MPT 和 Falcon)在 4090 上带来 2.7-3.9 倍的加速,相比 Huggingface FP16 实现。对于 Llama-2-7B,我们通过 FP16 内核融合将推理速度从 52 tokens/s 提高到 62 tokens/s。在更强的 FP16 基线的基础上,我们进一步从快速量化线性内核中获得了

3.1 \times

额外的加速。对于 Falcon-7B,官方实现在推理时没有正确支持 KV 缓存。

Figure 10. TinyChat offers 1.2-3.0

\times

speedup over existing systems when running 4-bit quantized Llama models on NVIDIA Jetson Orin. It also supports a diverse range of general-purpose and coding-specific LLMs with at least

2.6 \times

speedup over AutoGPTQ, which also supports all these workloads. Moreover, TinyChat seamlessly operates on Raspberry Pi and enables the deployment of LLMs with up to 7 billion parameters on extremely resource-constrained IoT devices.
图 10. TinyChat 在运行 4 位量化的 Llama 模型时,相对于现有系统提供 1.2-3.0 倍的加速。它还支持广泛的通用和编程专用工作负载,这些工作负载相对于 AutoGPTQ 至少有 1 倍的加速。此外,TinyChat 可以无缝地在树莓派上运行,并可在极度受限资源的物联网设备上部署高达 70 亿参数的模型。
and thus it is significantly slower than other models. In this case, our FP16 optimizations bring about a larger speedup of

1.6 \times

. On the laptop 4070 GPU with only 8 GB memory, we are still able to run Llama-2-13B models at 33 tokens/s, while the FP16 implementation cannot fit 7B models. We also demonstrate visual-language model (Lin et al., 2024) acceleration results in Table 10. TinyChat brings about

3 \times

speedup to both VILA-7B and VILA-13B on NVIDIA Jetson Orin. Notably, we implement the forward pass for all AWQ models using native PyTorch APIs, and this code is reused across various GPU architectures. Hence, TinyChat offers exceptional extensibility.
这种情况下,我们的 FP16 优化带来了更大的加速,达到了

1.6 \times

。在只有 8 GB 内存的笔记本电脑 4070 GPU 上,我们仍然能够以 33 令牌/秒的速度运行 Llama-2-13B 模型,而 FP16 实现无法容纳 7B 模型。我们还在表 10 中展示了视觉语言模型(林等人,2024 年)加速结果。TinyChat 为 VILA-7B 和 VILA-13B 在 NVIDIA Jetson Orin 上分别带来了

3 \times

的加速。值得注意的是,我们使用原生 PyTorch API 实现了所有 AWQ 模型的前向传递,并在各种 GPU 架构上重复使用该代码。因此,TinyChat 提供了出色的可扩展性。

Comparisons against other systems. We compare TinyChat against existing edge LLM inference systems AutoGPTQ, llama.cpp and exllama in Figure 10. Our system achieves up to

1.7 \times

speedup over llama.cpp on Orin. Furthermore, llama.cpp and exllama exhibit limited adaptability, primarily tailored for LLaMA and Llama-2 models. In contrast, our TinyChat supports a wide range of applications, including StarCoder (Li et al., 2023c), StableCode (GPTNeoX) (Black et al., 2022), Mistral (Jiang et al., 2023), and Falcon (Penedo et al., 2023) while consistently delivering significant speedup over AutoGPTQ. TinyChat even democratizes LLM deployment on extremely resource-constrained Raspberry Pi 4B, achieving 0.7 tokens/s for 7B models.
与其他系统的比较。我们在图 10 中将 TinyChat 与现有的边缘LLM推理系统 AutoGPTQ、llama.cpp 和 exllama 进行了比较。我们的系统在 Orin 上最高可达

1.7 \times

倍的加速度超过 llama.cpp。此外,llama.cpp 和 exllama 的可适应性受限,主要针对 LLaMA 和 Llama-2 模型。相比之下,我们的 TinyChat 支持广泛的应用,包括 StarCoder(Li 等人,2023c)、StableCode(GPTNeoX)(Black 等人,2022)、Mistral(Jiang 等人,2023)和 Falcon(Penedo 等人,2023),并且在 AutoGPTQ 上始终保持显著的加速度。TinyChat 甚至使在极度资源受限的树莓派 4B 上部署LLM成为可能,实现了 7B 模型每秒 0.7 个令牌的性能。

6 Conclusion 结论

In this work, we propose Activation-aware Weight Quantization (AWQ), a simple yet effective method for low-bit weight-only LLM compression. Based on the observation that weights are not equally important in LLMs, AWQ performs per-channel scaling to reduce the quantization loss of salient weights. AWQ does not over-fit the calibration set and preserves the generalist abilities of LLMs in various domains and modalities. It outperforms existing work on language modeling and is applicable to instruction-tuned LMs and multi-modal LMs. Our TinyChat system further translates the theoretical memory savings achieved by AWQ into 3.2-3.3

\times

measured speedups over the FP16 implemen-
在这项工作中,我们提出了激活感知权重量化(AWQ),这是一种简单而有效的低比特权重仅压缩方法。基于权重在不同重要性的观察,AWQ 执行每通道缩放以降低显著权重的量化损失。AWQ 不会过度拟合校准集,并保留了在各种领域和模态中 LM 的泛化能力。它在语言建模方面优于现有工作,并适用于指令调整的 LM 和多模态 LM。我们的 TinyChat 系统进一步将 AWQ 实现的理论内存节省转化为相对于 FP16 实现的 3.2-3.3 倍测量加速。
tations from Huggingface on desktop and mobile GPUs, democratizing LLM deployment on the edge.
在桌面和移动 GPU 上使用拥抱脸的模型部署在边缘计算设备上,民主化了LLM。

REFERENCES 参考文献

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716-23736, 2022.
阿雷亚拉克，J.-B.，多纳休，J.，卢克，P.，米耶克，A.，巴尔，I.，哈松，Y.，连茨，K.，门斯奇，A.，米尔利坎，K.，雷诺兹，M.等。Flamingo：一种用于少样本学习的视觉语言模型。神经信息处理系统进展，35:23716-23736，2022。

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021.
奥斯汀、J.、奥德内、A.、奈、M.、博斯马、M.、米卡洛夫斯基、H.、多汉、D.、江、E.、蔡、C.、特里、M.、黎、Q.、和萨顿、C.大型语言模型的程序合成,2021。

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Wortsman, M., and Schmidt, L. Openflamingo, March 2023. URL https: //doi.org/10.5281/zenodo. 7733589.
阿瓦达拉，A.，高，I.，加德纳，J.，赫塞尔，J.，哈纳菲，Y.，朱，W.，马拉特，K.，比顿，Y.，加德雷，S.，吉特塞夫，J.，科恩布利斯，S.，高，P. W.，伊尔哈科，G.，沃茨曼，M.，和施密特，L. Openflamingo, 2023 年 3 月。链接 https://doi.org/10.5281/zenodo.7733589.

Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
本吉奥, Y., 利奥纳德, N., 和库尔维, A.估计或传播通过随机神经元进行有条件计算的梯度。arXiv 预印本 arXiv:1308.3432, 2013 年。

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
黑色,S.,比德曼,S.,哈拉罕,E.,安东尼,Q.,高,L.,戈尔丁,L.,贺,H.,利亚希,C.,麦克唐奈,K.,潘,J.,等。 Gpt-neox-20b:一个开源的自回归语言模型。arXiv 预印本 arXiv:2204.06745,2022.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing
布朗, T., 曼恩, B., 赖德, N., 苏比亚, M., 卡普兰, J. D., 达里瓦尔, P., 尼拉坎塔, A., 夏姆, P., 萨斯特里, G., 阿斯克尔, A., 阿加尔瓦尔, S., 赫伯特-沃斯, A., 克鲁格, G., 海宁汉, T., 查尔德, R., 拉梅什, A., 齐格勒, D., 吴, J., 温特, C., 赫塞, C., 陈, M., 西格勒, E., 利特温, M., 格雷, S., 切斯, B., 克拉克, J., 伯纳, C., 麦坎德利什, S., 拉德福德, A., 萨茨科维尔, I., 和阿莫迪, D. 语言模型是少量学习者。

$^{*}$ : Algorithm co-lead, $^{†}$ : system co-lead. $^{1}$ MIT $^{2}$ Shanghai Jiao Tong University $^{3}$ NVIDIA $^{4}$ Tsinghua University $^{5}$ MIT-IBM Watson AI Lab $^{6}$ UMass Amherst. Correspondence to: Song Han songhan@mit.edu.
$^{*}$ : 算法联合负责人, $^{†}$ : 系统联合负责人. $^{1}$ 麻省理工学院 $^{2}$ 上海交通大学 $^{3}$ NVIDIA $^{4}$ 清华大学 $^{5}$ 麻省理工学院-IBM 沃森人工智能实验室 $^{6}$ 马萨诸塞大学阿默斯特分校. 通讯作者: Song Han songhan@mit.edu.

Proceedings of the $5^{th}$ MLSys Conference, Santa Clara, CA, USA, 2024. Copyright 2024 by the author(s).
2024 年圣克拉拉(加利福尼亚州),美国 MLSys 会议论文集。 2024 年版权归作者所有。
*https://github.com/ggerganov/llama.cpp
https://github.com/ggerganov/llama.cpp
$^{†}$ https://github.com/turboderp/exllama
"https://github.com/turboderp/exllama
https://github.com/turboderp/exllama

Abstract 摘要

1 InTRODUCTION 介绍

2 Related Work 相关工作

3 AWQ: ACtivation-aWare WeIGht QUANTIZATION3 AWQ：激活感知权重量化

3.1 Improving LLM Quantization by Preserving 1% Salient Weights通过保留 1%重要权重改进LLM量化

3.2 Protecting Salient Weights by Activation-aware Scaling通过激活感知缩放保护显著权重

Analyzing the quantization error.分析量化误差。

4 TinyChat: Mapping AWQ onto Edge Platforms4 TinyChat:边缘平台上的 AWQ 映射

4.1 Why AWQ Helps Accelerate On-Device LLMs4.1 为什么 AWQ 有助于加速设备上的LLMs

4.2 Deploy AWQ with TinyChat部署 AWQ 与 TinyChat