网络缓冲膨胀延迟

Queueing in the Linux Network Stack
Linux 网络堆栈中的排队

on September 23, 2013

作者：丹・西蒙农 2013 年 9 月 23 日

Packet queues are a core component of any network stack or device. They allow for asynchronous modules to communicate, increase performance and have the side effect of impacting latency. This article aims to explain where IP packets are queued on the transmit path of the Linux network stack, how interesting new latency-reducing features, such as BQL, operate and how to control buffering for reduced latency.
数据包队列是任何网络堆栈或设备的核心组件。它们允许异步模块进行通信，提高性能，但会产生影响延迟的副作用。本文旨在解释 IP 数据包在 Linux 网络堆栈的传输路径上排队的位置、有趣的新延迟减少功能（例如 BQL）如何运行以及如何控制缓冲以减少延迟。

Figure 1. Simplified High-Level Overview of the Queues on the Transmit Path of the Linux Network Stack
图 1. Linux 网络堆栈传输路径上的队列的简化高级概述

Driver Queue (aka Ring Buffer)
驱动程序队列（又名环形缓冲区）

Between the IP stack and the network interface controller (NIC) lies the driver queue. This queue typically is implemented as a first-in, first-out (FIFO) ring buffer (http://en.wikipedia.org/wiki/Circular_buffer)—just think of it as a fixed-sized buffer. The driver queue does not contain the packet data. Instead, it consists of descriptors that point to other data structures called socket kernel buffers (SKBs, http://vger.kernel.org/%7Edavem/skb.html), which hold the packet data and are used throughout the kernel.
在 IP 堆栈和网络接口控制器 (NIC) 之间存在驱动程序队列。该队列通常被实现为先进先出（FIFO）环形缓冲区（http://en.wikipedia.org/wiki/Circular_buffer）—— 只需将其视为固定大小的缓冲区即可。驱动程序队列不包含数据包数据。相反，它由指向称为套接字内核缓冲区（SKB，http://vger.kernel.org/%7Edavem/skb.html）的其他数据结构的描述符组成，这些数据结构保存数据包数据并在整个内核中使用。

Figure 2. Partially Full Driver Queue with Descriptors Pointing to SKBs
图 2. 描述符指向 SKB 的部分满驱动程序队列

The input source for the driver queue is the IP stack that queues IP packets. The packets may be generated locally or received on one NIC to be routed out another when the device is functioning as an IP router. Packets added to the driver queue by the IP stack are dequeued by the hardware driver and sent across a data bus to the NIC hardware for transmission.
驱动程序队列的输入源是对 IP 数据包进行排队的 IP 堆栈。当设备用作 IP 路由器时，数据包可以在本地生成，也可以在一个 NIC 上接收，然后从另一个 NIC 路由出去。由 IP 堆栈添加到驱动程序队列的数据包由硬件驱动程序出列，并通过数据总线发送到 NIC 硬件进行传输。

The reason the driver queue exists is to ensure that whenever the system has data to transmit it is available to the NIC for immediate transmission. That is, the driver queue gives the IP stack a location to queue data asynchronously from the operation of the hardware. An alternative design would be for the NIC to ask the IP stack for data whenever the physical medium is ready to transmit. Because responding to this request cannot be instantaneous, this design wastes valuable transmission opportunities resulting in lower throughput. The opposite of this design approach would be for the IP stack to wait after a packet is created until the hardware is ready to transmit. This also is not ideal, because the IP stack cannot move on to other work.
驱动程序队列存在的原因是为了确保每当系统有数据要传输时，NIC 都可以立即传输该数据。也就是说，驱动程序队列为 IP 堆栈提供了一个位置，用于与硬件操作异步地对数据进行排队。另一种设计是每当物理介质准备好传输时，NIC 都会向 IP 堆栈请求数据。由于无法即时响应该请求，因此该设计浪费了宝贵的传输机会，导致吞吐量降低。与此设计方法相反的是，IP 堆栈在创建数据包后等待，直到硬件准备好传输。这也不理想，因为 IP 堆栈无法继续进行其他工作。

Huge Packets from the Stack
来自堆栈的巨大数据包

Most NICs have a fixed maximum transmission unit (MTU), which is the biggest frame that can be transmitted by the physical media. For Ethernet, the default MTU is 1,500 bytes, but some Ethernet networks support Jumbo Frames (http://en.wikipedia.org/wiki/Jumbo_frame) of up to 9,000 bytes. Inside the IP network stack, the MTU can manifest as a limit on the size of the packets that are sent to the device for transmission. For example, if an application writes 2,000 bytes to a TCP socket, the IP stack needs to create two IP packets to keep the packet size less than or equal to a 1,500 MTU. For large data transfers, the comparably small MTU causes a large number of small packets to be created and transferred through the driver queue.
大多数网卡都有一个固定的最大传输单元（MTU），它是物理介质可以传输的最大帧。对于以太网，默认 MTU 为 1,500 字节，但某些以太网支持高达 9,000 字节的巨型帧 (http://en.wikipedia.org/wiki/Jumbo_frame)。在 IP 网络堆栈内部，MTU 可以表现为对发送到设备进行传输的数据包大小的限制。例如，如果应用程序向 TCP 套接字写入 2,000 字节，则 IP 堆栈需要创建两个 IP 数据包以保持数据包大小小于或等于 1,500 MTU。对于大数据传输，相对较小的 MTU 会导致创建大量小数据包并通过驱动程序队列进行传输。

In order to avoid the overhead associated with a large number of packets on the transmit path, the Linux kernel implements several optimizations: TCP segmentation offload (TSO), UDP fragmentation offload (UFO) and generic segmentation offload (GSO). All of these optimizations allow the IP stack to create packets that are larger than the MTU of the outgoing NIC. For IPv4, packets as large as the IPv4 maximum of 65,536 bytes can be created and queued to the driver queue. In the case of TSO and UFO, the NIC hardware takes responsibility for breaking the single large packet into packets small enough to be transmitted on the physical interface. For NICs without hardware support, GSO performs the same operation in software immediately before queueing to the driver queue.
为了避免传输路径上大量数据包带来的开销，Linux 内核实现了多项优化：TCP 分段卸载 (TSO)、UDP 分段卸载 (UFO) 和通用分段卸载 (GSO)。所有这些优化都允许 IP 堆栈创建大于传出 NIC 的 MTU 的数据包。对于 IPv4，可以创建最大为 IPv4 65,536 字节的数据包并将其排队到驱动程序队列中。对于 TSO 和 UFO，NIC 硬件负责将单个大数据包分解为足够小的数据包，以便在物理接口上传输。对于没有硬件支持的 NIC，GSO 在排队到驱动程序队列之前立即在软件中执行相同的操作。

Recall from earlier that the driver queue contains a fixed number of descriptors that each point to packets of varying sizes. Since TSO, UFO and GSO allow for much larger packets, these optimizations have the side effect of greatly increasing the number of bytes that can be queued in the driver queue. Figure 3 illustrates this concept in contrast with Figure 2.
回想一下前面的内容，驱动程序队列包含固定数量的描述符，每个描述符都指向不同大小的数据包。由于 TSO、UFO 和 GSO 允许更大的数据包，因此这些优化会产生副作用，即大大增加可在驱动程序队列中排队的字节数。图 3 与图 2 对比说明了这一概念。

Figure 3. Large packets can be sent to the NIC when TSO, UFO or GSO are enabled. This can greatly increase the number of bytes in the driver queue.
图 3. 当启用 TSO、UFO 或 GSO 时，可以将大数据包发送到 NIC。这可以大大增加驱动程序队列中的字节数。

Although the focus of this article is the transmit path, it is worth noting that Linux has receive-side optimizations that operate similarly to TSO, UFO and GSO and share the goal of reducing per-packet overhead. Specifically, generic receive offload (GRO, http://vger.kernel.org/%7Edavem/cgi-bin/blog.cgi/2010/08/30) allows the NIC driver to combine received packets into a single large packet that is then passed to the IP stack. When the device forwards these large packets, GRO allows the original packets to be reconstructed, which is necessary to maintain the end-to-end nature of the IP packet flow. However, there is one side effect: when the large packet is broken up, it results in several packets for the flow being queued at once. This "micro-burst" of packets can negatively impact inter-flow latency.
尽管本文的重点是传输路径，但值得注意的是，Linux 具有与 TSO、UFO 和 GSO 类似的接收端优化，并且都有减少每个数据包开销的目标。具体来说，通用接收卸载（GRO，http://vger.kernel.org/%7Edavem/cgi-bin/blog.cgi/2010/08/30）允许 NIC 驱动程序将接收到的数据包组合成一个大数据包，该数据包然后传递到 IP 堆栈。当设备转发这些大数据包时，GRO 允许重建原始数据包，这对于保持 IP 数据包流的端到端性质是必要的。然而，有一个副作用：当大数据包被分解时，会导致该流的多个数据包同时排队。这种数据包的 “微突发” 会对流间延迟产生负面影响。

Starvation and Latency 饥饿和延迟

Despite its necessity and benefits, the queue between the IP stack and the hardware introduces two problems: starvation and latency.
尽管有其必要性和好处，但 IP 堆栈和硬件之间的队列会带来两个问题：饥饿和延迟。

If the NIC driver wakes to pull packets off of the queue for transmission and the queue is empty, the hardware will miss a transmission opportunity, thereby reducing the throughput of the system. This is referred to as starvation. Note that an empty queue when the system does not have anything to transmit is not starvation—this is normal. The complication associated with avoiding starvation is that the IP stack that is filling the queue and the hardware driver draining the queue run asynchronously. Worse, the duration between fill or drain events varies with the load on the system and external conditions, such as the network interface's physical medium. For example, on a busy system, the IP stack will get fewer opportunities to add packets to the queue, which increases the chances that the hardware will drain the queue before more packets are queued. For this reason, it is advantageous to have a very large queue to reduce the probability of starvation and ensure high throughput.
如果 NIC 驱动程序唤醒以将数据包从队列中拉出进行传输，而队列为空，则硬件将错过传输机会，从而降低系统的吞吐量。这被称为饥饿。请注意，当系统没有任何内容可传输时，队列为空并不是饥饿 —— 这是正常的。与避免饥饿相关的复杂情况是，填充队列的 IP 堆栈和排出队列的硬件驱动程序异步运行。更糟糕的是，填充或耗尽事件之间的持续时间随系统负载和外部条件（例如网络接口的物理介质）而变化。例如，在繁忙的系统上，IP 堆栈将数据包添加到队列的机会较少，这增加了硬件在更多数据包排队之前耗尽队列的机会。因此，拥有非常大的队列有利于减少饥饿概率并确保高吞吐量。

Although a large queue is necessary for a busy system to maintain high throughput, it has the downside of allowing for the introduction of a large amount of latency.
虽然繁忙的系统需要大队列来维持高吞吐量，但它的缺点是会引入大量延迟。

Figure 4 shows a driver queue that is almost full with TCP segments for a single high-bandwidth, bulk traffic flow (blue). Queued last is a packet from a VoIP or gaming flow (yellow). Interactive applications like VoIP or gaming typically emit small packets at fixed intervals that are latency-sensitive, while a high-bandwidth data transfer generates a higher packet rate and larger packets. This higher packet rate can fill the queue between interactive packets, causing the transmission of the interactive packet to be delayed.
图 4 显示了一个驱动程序队列，该队列几乎充满了单个高带宽、大量流量（蓝色）的 TCP 段。最后排队的是来自 VoIP 或游戏流的数据包（黄色）。 VoIP 或游戏等交互式应用程序通常会以固定时间间隔发出对延迟敏感的小数据包，而高带宽数据传输会生成更高的数据包速率和更大的数据包。这种较高的数据包速率会填满交互数据包之间的队列，从而导致交互数据包的传输延迟。

Figure 4. Interactive Packet (Yellow) behind Bulk Flow Packets (Blue)
图 4. 大容量流数据包（蓝色）后面的交互数据包（黄色）

To illustrate this behaviour further, consider a scenario based on the following assumptions:
为了进一步说明此行为，请考虑基于以下假设的场景：

A network interface that is capable of transmitting at 5 Mbit/sec or 5,000,000 bits/sec.
能够以 5 Mbit / 秒或 5,000,000 位 / 秒的速度传输的网络接口。
Each packet from the bulk flow is 1,500 bytes or 12,000 bits.
来自批量流的每个数据包为 1,500 字节或 12,000 位。
Each packet from the interactive flow is 500 bytes.
交互流中的每个数据包为 500 字节。
The depth of the queue is 128 descriptors.
队列的深度是 128 个描述符。
There are 127 bulk data packets and one interactive packet queued last.
最后排队的有 127 个批量数据包和 1 个交互式数据包。

Given the above assumptions, the time required to drain the 127 bulk packets and create a transmission opportunity for the interactive packet is (127 * 12,000) / 5,000,000 = 0.304 seconds (304 milliseconds for those who think of latency in terms of ping results). This amount of latency is well beyond what is acceptable for interactive applications, and this does not even represent the complete round-trip time—it is only the time required to transmit the packets queued before the interactive one. As described earlier, the size of the packets in the driver queue can be larger than 1,500 bytes, if TSO, UFO or GSO are enabled. This makes the latency problem correspondingly worse.
考虑到上述假设，排出 127 个批量数据包并为交互式数据包创建传输机会所需的时间为 (127 * 12,000) / 5,000,000 = 0.304 秒（对于那些根据 ping 结果考虑延迟的人来说为 304 毫秒）。这一延迟量远远超出了交互式应用程序可接受的范围，并且这甚至不代表完整的往返时间，它只是传输在交互式应用程序之前排队的数据包所需的时间。如前所述，如果启用了 TSO、UFO 或 GSO，驱动程序队列中的数据包大小可能大于 1,500 字节。这使得延迟问题相应地变得更糟。

Large latencies introduced by over-sized, unmanaged queues is known as Bufferbloat (http://en.wikipedia.org/wiki/Bufferbloat). For a more detailed explanation of this phenomenon, see the Resources for this article.
过大的非托管队列导致的大延迟被称为 Bufferbloat (http://en.wikipedia.org/wiki/Bufferbloat)。有关此现象的更详细解释，请参阅本文的参考资料。

As the above discussion illustrates, choosing the correct size for the driver queue is a Goldilocks problem—it can't be too small, or throughput suffers; it can't be too big, or latency suffers.
正如上面的讨论所示，为驱动程序队列选择正确的大小是一个金发姑娘问题 —— 它不能太小，否则吞吐量会受到影响；它不能太大，否则延迟会受到影响。

Byte Queue Limits (BQL)
字节队列限制 (BQL)

Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) that attempts to solve the problem of driver queue sizing automatically. This is accomplished by adding a layer that enables and disables queueing to the driver queue based on calculating the minimum queue size required to avoid starvation under the current system conditions. Recall from earlier that the smaller the amount of queued data, the lower the maximum latency experienced by queued packets.
字节队列限制 (BQL) 是最新 Linux 内核 (> 3.3.0) 中的一项新功能，尝试自动解决驱动程序队列大小调整问题。这是通过添加一个层来实现的，该层根据计算在当前系统条件下避免饥饿所需的最小队列大小来启用和禁用对驱动程序队列的排队。回想一下前面的内容，排队数据量越小，排队数据包经历的最大延迟就越短。

It is key to understand that the actual size of the driver queue is not changed by BQL. Rather, BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.
重要的是要理解驱动程序队列的实际大小不会被 BQL 改变。相反，BQL 计算当前可以排队的数据量（以字节为单位）的限制。超过此限制的任何字节都必须由驱动程序队列上方的层保留或丢弃。

A real-world example may help provide a sense of how much BQL affects the amount of data that can be queued. On one of the author's servers, the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes, this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO and so forth are disabled, or this would be much higher). However, the limit value calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data that can be queued.
现实世界中的示例可能有助于了解 BQL 对可排队数据量的影响有多大。在作者的一台服务器上，驱动程序队列大小默认为 256 个描述符。由于以太网 MTU 为 1,500 字节，这意味着最多 256 * 1,500 = 384,000 字节可以排队到驱动程序队列中（TSO、GSO 等被禁用，或者这会更高）。但是，BQL 计算出的限制值为 3,012 字节。正如您所看到的，BQL 极大地限制了可以排队的数据量。

BQL reduces network latency by limiting the amount of data in the driver queue to the minimum required to avoid starvation. It also has the important side effect of moving the point where most packets are queued from the driver queue, which is a simple FIFO, to the queueing discipline (QDisc) layer, which is capable of implementing much more complicated queueing strategies.
BQL 通过将驱动程序队列中的数据量限制到避免饥饿所需的最小值来减少网络延迟。它还具有重要的副作用，即将大多数数据包排队的点从驱动程序队列（一个简单的 FIFO）移动到排队规则（QDisc）层，该层能够实现更复杂的排队策略。

Queueing Disciplines (QDisc)
排队规则 (QDisc)

The driver queue is a simple first-in, first-out (FIFO) queue. It treats all packets equally and has no capabilities for distinguishing between packets of different flows. This design keeps the NIC driver software simple and fast. Note that more advanced Ethernet and most wireless NICs support multiple independent transmission queues, but similarly, each of these queues is typically a FIFO. A higher layer is responsible for choosing which transmission queue to use.
驱动程序队列是一个简单的先进先出 (FIFO) 队列。它平等地对待所有数据包，并且没有能力区分不同流的数据包。这种设计使网卡驱动软件简单、快速。请注意，更先进的以太网和大多数无线 NIC 支持多个独立的传输队列，但同样，这些队列中的每一个通常都是 FIFO。更高层负责选择使用哪个传输队列。

Sandwiched between the IP stack and the driver queue is the queueing discipline (QDisc) layer (Figure 1). This layer implements the traffic management capabilities of the Linux kernel, which include traffic classification, prioritization and rate shaping. The QDisc layer is configured through the somewhat opaque tc command. There are three key concepts to understand in the QDisc layer: QDiscs, classes and filters.

The QDisc is the Linux abstraction for traffic queues, which are more complex than the standard FIFO queue. This interface allows the QDisc to carry out complex queue management behaviors without requiring the IP stack or the NIC driver to be modified. By default, every network interface is assigned a pfifo_fast QDisc (http://lartc.org/howto/lartc.qdisc.classless.html), which implements a simple three-band prioritization scheme based on the TOS bits. Despite being the default, the pfifo_fast QDisc is far from the best choice, because it defaults to having very deep queues (see txqueuelen below) and is not flow aware.

The second concept, which is closely related to the QDisc, is the class. Individual QDiscs may implement classes in order to handle subsets of the traffic differently—for example, the Hierarchical Token Bucket (HTB, http://lartc.org/manpages/tc-htb.html). QDisc allows the user to configure multiple classes, each with a different bitrate, and direct traffic to each as desired. Not all QDiscs have support for multiple classes. Those that do are referred to as classful QDiscs, and those that do not are referred to as classless QDiscs.

Filters (also called classifiers) are the mechanism used to direct traffic to a particular QDisc or class. There are many different filters of varying complexity. The u32 filter (http://www.lartc.org/lartc.html#LARTC.ADV-FILTER.U32) is the most generic, and the flow filter is the easiest to use.

Buffering between the Transport Layer and the Queueing Disciplines

In looking at the figures for this article, you may have noticed that there are no packet queues above the QDisc layer. The network stack places packets directly into the QDisc or else pushes back on the upper layers (for example, socket buffer) if the queue is full. The obvious question that follows is what happens when the stack has a lot of data to send? This can occur as the result of a TCP connection with a large congestion window or, even worse, an application sending UDP packets as fast as it can. The answer is that for a QDisc with a single queue, the same problem outlined in Figure 4 for the driver queue occurs. That is, the high-bandwidth or high-packet rate flow can consume all of the space in the queue causing packet loss and adding significant latency to other flows. Because Linux defaults to the pfifo_fast QDisc, which effectively has a single queue (most traffic is marked with TOS=0), this phenomenon is not uncommon.

As of Linux 3.6.0, the Linux kernel has a feature called TCP Small Queues that aims to solve this problem for TCP. TCP Small Queues adds a per-TCP-flow limit on the number of bytes that can be queued in the QDisc and driver queue at any one time. This has the interesting side effect of causing the kernel to push back on the application earlier, which allows the application to prioritize writes to the socket more effectively. At the time of this writing, it is still possible for single flows from other transport protocols to flood the QDisc layer.

Another partial solution to the transport layer flood problem, which is transport-layer-agnostic, is to use a QDisc that has many queues, ideally one per network flow. Both the Stochastic Fairness Queueing (SFQ, http://crpppc19.epfl.ch/cgi-bin/man/man2html?8+tc-sfq) and Fair Queueing with Controlled Delay (fq_codel, http://linuxmanpages.net/manpages/fedora18/man8/tc-fq_codel.8.html) QDiscs fit this problem nicely, as they effectively have a queue-per-network flow.

How to Manipulate the Queue Sizes in Linux

Driver Queue:

The ethtool command (http://linuxmanpages.net/manpages/fedora12/man8/ethtool.8.html) is used to control the driver queue size for Ethernet devices. ethtool also provides low-level interface statistics as well as the ability to enable and disable IP stack and driver features.

The -g flag to ethtool displays the driver queue (ring) parameters:


# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:       16384
RX Mini:    0
RX Jumbo:    0
TX:      16384
Current hardware settings:
RX:       512
RX Mini:    0
RX Jumbo:    0
TX:    256

You can see from the above output that the driver for this NIC defaults to 256 descriptors in the transmission queue. Early in the Bufferbloat investigation, it often was recommended to reduce the size of the driver queue in order to reduce latency. With the introduction of BQL (assuming your NIC driver supports it), there no longer is any reason to modify the driver queue size (see below for how to configure BQL).

ethtool also allows you to view and manage optimization features, such as TSO, GSO, UFO and GRO, via the -k and -K flags. The -k flag displays the current offload settings and -K modifies them.

As discussed above, some optimization features greatly increase the number of bytes that can be queued in the driver queue. You should disable these optimizations if you want to optimize for latency over throughput. It's doubtful you will notice any CPU impact or throughput decrease when disabling these features unless the system is handling very high data rates.

Byte Queue Limits (BQL):

The BQL algorithm is self-tuning, so you probably don't need to modify its configuration. BQL state and configuration can be found in a /sys directory based on the location and name of the NIC. For example: /sys/devices/pci0000:00/0000:00:14.0/net/eth0/queues/tx-0/byte_queue_limits.

To place a hard upper limit on the number of bytes that can be queued, write the new value to the limit_max file:


echo "3000" > limit_max

What Is txqueuelen?

Often in early Bufferbloat discussions, the idea of statically reducing the NIC transmission queue was mentioned. The txqueuelen field in the ifconfig command's output or the qlen field in the ip command's output show the current size of the transmission queue:


$ ifconfig eth0
eth0    Link encap:Ethernet  HWaddr 00:18:F3:51:44:10 
       inet addr:69.41.199.58  Bcast:69.41.199.63  Mask:255.255.255.248
       inet6 addr: fe80::218:f3ff:fe51:4410/64 Scope:Link
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:435033 errors:0 dropped:0 overruns:0 frame:0
       TX packets:429919 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000 
       RX bytes:65651219 (62.6 MiB)  TX bytes:132143593 (126.0 MiB)
       Interrupt:23

$ ip link
1: lo:  mtu 16436 qdisc noqueue state UNKNOWN 
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
   link/ether 00:18:f3:51:44:10 brd ff:ff:ff:ff:ff:ff

The length of the transmission queue in Linux defaults to 1,000 packets, which is a large amount of buffering, especially at low bandwidths.

The interesting question is what queue does this value control? One might guess that it controls the driver queue size, but in reality, it serves as a default queue length for some of the QDiscs. Most important, it is the default queue length for the pfifo_fast QDisc, which is the default. The "limit" argument on the tc command line can be used to ignore the txqueuelen default.

The length of the transmission queue is configured with the ip or ifconfig commands:


ip link set txqueuelen 500 dev eth0

Queueing Disciplines:

As introduced earlier, the Linux kernel has a large number of queueing disciplines (QDiscs), each of which implements its own packet queues and behaviour. Describing the details of how to configure each of the QDiscs is beyond the scope of this article. For full details, see the tc man page (man tc). You can find details for each QDisc in man tc qdisc-name (for example, man tc htb or man tc fq_codel).

TCP Small Queues:

The per-socket TCP queue limit can be viewed and controlled with the following /proc file: /proc/sys/net/ipv4/tcp_limit_output_bytes.

You should not need to modify this value in any normal situation.

Oversized Queues Outside Your Control

Unfortunately, not all of the over-sized queues that will affect your Internet performance are under your control. Most commonly, the problem will lie in the device that attaches to your service provider (such as DSL or cable modem) or in the service provider's equipment itself. In the latter case, there isn't much you can do, because it is difficult to control the traffic that is sent toward you. However, in the upstream direction, you can shape the traffic to slightly below the link rate. This will stop the queue in the device from having more than a few packets. Many residential home routers have a rate limit setting that can be used to shape below the link rate. Of course, if you use Linux on your home gateway, you can take advantage of the QDisc features to optimize further. There are many examples of tc scripts on-line to help get you started.

Summary

Queueing in packet buffers is a necessary component of any packet network, both within a device and across network elements. Properly managing the size of these buffers is critical to achieving good network latency, especially under load. Although static queue sizing can play a role in decreasing latency, the real solution is intelligent management of the amount of queued data. This is best accomplished through dynamic schemes, such as BQL and active queue management (AQM, http://en.wikipedia.org/wiki/Active_queue_management) techniques like Codel. This article outlines where packets are queued in the Linux network stack, how features related to queueing are configured and provides some guidance on how to achieve low latency.

Acknowledgements

Thanks to Kevin Mason, Simon Barber, Lucas Fontes and Rami Rosen for reviewing this article and providing helpful feedback.

Resources

Controlling Queue Delay: http://queue.acm.org/detail.cfm?id=2209336

Bufferbloat: Dark Buffers in the Internet: http://cacm.acm.org/magazines/2012/1/144810-bufferbloat/fulltext

Bufferbloat Project: http://www.bufferbloat.net

Linux Advanced Routing and Traffic Control How-To (LARTC): http://www.lartc.org/howto