这是用户在 2024-5-15 10:31 为 https://app.immersivetranslate.com/pdf-pro/e5b9296a-dd75-4456-9100-ad9c2686b9bc 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_15_552caef18c65683aaac5g

What is an L3 Master Device?
什么是 L3 主设备?

David Ahern 戴维-埃亨Cumulus Networks 积云网络Mountain View, CA, USA 美国加利福尼亚州山景城dsa@cumulusnetworks.com

Abstract 摘要

The L3 Master Device (13mdev) concept was introduced to the Linux networking stack in v4.4. While it was created for the VRF implementation, it is a separate API that can be leveraged by other drivers that want to influence FIB lookups or want to manipulate packets at layer 3. This paper discusses the implementation, the hooks in the IPv4 and IPv6 networking stack and the driver API, why they are needed and what opportunities they provide to network drivers. The VRF driver is used as an example of what can be done in each of the driver hooks.
L3 主设备(13mdev)概念是在 Linux 网络协议栈 v4.4 中引入的。虽然它是为 VRF 实现而创建的,但它是一个独立的 API,其他驱动程序可以利用它来影响 FIB 查找或在第 3 层操作数据包。本文将讨论 的实现、IPv4 和 IPv6 网络协议栈中的钩子以及驱动程序 API,为什么需要这些钩子以及它们为网络驱动程序提供了哪些机会。本文以 VRF 驱动程序为例,说明每个驱动程序钩子可以完成哪些工作。

Keywords 关键词

13mdev, VRF, IPv4, IPv6
13mdev、VRF、IPv4、IPv6

Introduction 导言

The L3 Master Device (13mdev for short) idea evolved from the initial Virtual Routing and Forwarding (VRF) implementation for the Linux networking stack.
L3 主设备(简称 13mdev)的想法是从 Linux 网络协议栈最初的虚拟路由和转发(VRF)实现发展而来的。

The concept was created to generalize the changes made to the core IPv4 and IPv6 code into an API that can be leveraged by devices that operate at layer 3 (L3).
这一概念的提出是为了将对 IPv4 和 IPv6 核心代码所做的修改概括为一个 API,供在第 3 层(L3)运行的设备使用。
The primary motivation for devices is to create domains that correlate to a specific FIB table (Figure 1). Network interfaces can be enslaved to an device uniquely associating those interfaces with an L3 domain. Packets going through devices enslaved to an device use the FIB table configured for the device for routing, forwarding and addressing decisions. The key here is the enslavement only affects layer 3 decisions.
设备的主要目的是创建与特定 FIB 表相关的 域(图 1)。网络接口可以绑定到 设备,将这些接口与 L3 域唯一关联。通过 设备的数据包使用为该设备配置的 FIB 表进行路由、转发和寻址决策。这里的关键是奴役只影响第 3 层的决策。
Drivers leveraging the operations can get access to packets at layer 3 similar to the rx-handler available to layer 2 devices such as bridges and bonding. Drivers can use the hooks to implement device-based networking features that apply to the entire L3 domain.
利用 操作的驱动程序可以访问第 3 层的数据包,类似于网桥和绑定等第 2 层设备可用的 rx 处理器。驱动程序可以使用钩子来实现适用于整个 L3 域的基于设备的网络功能。
Userspace programs can use well-known POSIX APIs to specify which domain to use when sending packets. This is required since layer 3 network addresses and routes are local to the L3 domains.
用户空间程序可以使用众所周知的 POSIX API 来指定发送数据包时使用哪个域。这是必需的,因为第 3 层网络地址和路由是第 3 层域的本地地址和路由。
Finally, administration, monitoring and debugging for devices follows the existing paradigms for Linux networking.
最后, 设备的管理、监控和调试遵循现有的 Linux 网络范例。

The code references in this paper are for net-next in what will be the 4.9 kernel. Older kernels operate similarly though the driver operations and hooks into the kernel are slightly different. As of this writing there are 2 drivers using the API: VRF and IPvlan. This paper uses the VRF driver for examples of what can be done with the driver operations.
本文中的代码引用的是 4.9 内核中的 net-next。旧版内核的操作类似,但驱动程序操作和内核钩子略有不同。截至本文撰写时,有两个驱动程序使用 API:VRF 和 IPvlan。本文使用 VRF 驱动程序举例说明驱动程序的操作方法。

Layer 3 Master Devices
第 3 层主设备

The feature is controlled by the kernel configuration option CONFIG_NET_L3_MASTER_DEV under Networking support -> Networking options. It must be set to enable drivers that leverage the infrastructure (e.g., VRF and IPvlan).
功能由网络支持 -> 网络选项下的内核配置选项 CONFIG_NET_L3_MASTER_DEV 控制。必须设置该选项才能启用利用基础设施的驱动程序(如 VRF 和 IPvlan)。
L3 master devices are created like any other network device (e.g., RTM_NEWLINK and 'ip link add ...'); required attributes are a function of the device type.
L3 主设备的创建与其他网络设备一样(如 RTM_NEWLINK 和 "ip link add ...");所需属性是设备类型的函数。

For example, a VRF device requires a table id that is associated with the VRF device while IPvlan does not since it does not use the FIB table aspect of .
例如,VRF 设备需要一个与 VRF 设备相关联的表 id,而 IPvlan 则不需要,因为它不使用 的 FIB 表。
13mdev devices have the IFF_L3MDEV_MASTER flag set in priv_flags; devices enslaved to an device have IFF_L3MDEV_SLAVE set. These flags are leveraged in the fast path to determine if hooks are relevant for a particular flow using the helpers netif_is_13_master and netif_is_13_slave.
13mdev 设备在 priv_flags 中设置了 IFF_L3MDEV_MASTER 标志;被 设备奴役的设备设置了 IFF_L3MDEV_SLAVE。快速路径会利用这些标志,通过帮助程序 netif_is_13_master 和 netif_is_13_slave 来确定 钩子是否与特定流量相关。
The driver operations are defined in include/net/ . h, struct 13mdev_ops. As of v4.9 the handlers are:
驱动程序操作定义在 include/net/ . h, struct 13mdev_ops 中。截至 v4.9 版,处理程序包括
13mdev_fib_table - returns FIB table for L3 domain,
13mdev_fib_table - 返回 L3 域的 FIB 表、
13mdev_13_rcv - Rx hook in network layer,
13mdev_13_rcv - 网络层中的 Rx 钩子、
13mdev_13_out - Tx hook in network layer, and
13mdev_13_out - 网络层的发送钩子,以及
13mdev_link_scope_lookup - route lookup for IPv6 link local and
13mdev_link_scope_lookup - IPv6 链路本地和远程路由查找。
Drivers using the infrastructure only need to implement the handlers of interest. For example, IPvlan only implements the 13mdev_13_rcv hook for its 13s mode, while the VRF driver implements all of them.
使用 基础架构的驱动程序只需实现感兴趣的处理程序。例如,IPvlan 只为其 13s 模式实现了 13mdev_13_rcv 钩子,而 VRF 驱动程序则实现了所有钩子。
The 13mdev_fib_table and 13mdev_link_scope_lookup are discussed in the next section. The 13mdev_13_rcv and
13mdev_fib_table 和 13mdev_link_scope_lookup 将在下一节讨论。13mdev_13_rcv 和
Figure 1. An L3 domain defined by an I3mdev device.
图 1.由 I3mdev 设备定义的 L3 域。
13mdev_13_out operations are discussed in the Layer 3 Packet Processing section.
13mdev_13_out 操作将在第 3 层数据包处理部分讨论。

Layer 3 Routing and Forwarding
第 3 层路由和转发

The primary motivation for Layer 3 master devices is to create L3 domains represented by an associated FIB table (Figure 1). Network interfaces can be enslaved to the device making them part of the domain for layer 3 routing and forwarding decisions. A key point is that the domains and enslaving an interface to those domains affect only layer 3 decisions. The association with an device has no impact on Layer 2 applications such as 1ldpd sending and receiving packets over the enslaved network interfaces (e.g., for neighbor discovery).
第 3 层主设备的主要目的是创建由相关 FIB 表表示的第 3 层域(图 1)。可将网络接口绑定到 设备上,使其成为域的一部分,以便进行第 3 层路由选择和转发决策。关键的一点是,域和将接口绑定到这些域只影响第 3 层的决策。与 设备的关联不会影响第 2 层应用程序,如通过被奴役网络接口发送和接收数据包的 1ldpd(如用于邻居发现)。

Network Addresses 网络地址

Network addresses for interfaces enslaved to an device are local to the L3 domain. When selecting a source address for outgoing packets, only addresses associated with interfaces in the L3 domain are considered during the selection process.
设备绑定的接口的网络地址是 L3 域的本地地址。为传出数据包选择源地址时,在选择过程中只考虑与 L3 域中接口相关的地址。

By extension this means applications need to specify which domain to use when communicating over IPv4 or IPv6. This is discussed in the 'Userspace API' section below.
这意味着应用程序在通过 IPv4 或 IPv6 进行通信时,需要指定使用哪个域。下文 "用户空间应用程序接口 "部分将对此进行讨论。
Since the device is also a network interface, it too can have network addresses (e.g, loopback addressing commonly used for routing protocols). Those addresses are also considered for source address selection on outgoing connections. The device is treated as the loopback device for the domain, and and the IPv4 stack allows the loopback address (127.0.0.1) on an 13 mdev device.
由于 设备也是一个网络接口,因此它也可以有网络地址(如路由协议常用的环回地址)。在外向连接的源地址选择中也会考虑这些地址。 设备被视为域的环回设备,IPv4 协议栈允许在 13 mdev 设备上使用环回地址 (127.0.0.1)。

FIB Tables FIB 表格

The FIB table id for an device is retrieved using the fib_table driver operation. Since fib_table is called in fast path, the operation should only return the table id for the device presumably stored in a private struct added to the net_device.
设备的 FIB 表 ID 是通过 fib_table 驱动程序操作获取的。由于 fib_table 是在快速路径中调用的,因此该操作只应返回设备的表 id(可能存储在添加到 net_device 的私有结构体中)。
Local and connected routes for network addresses added to interfaces enslaved to an device are automatically moved to the FIB table associated with the device when the network interface is brought up. This means that the 13mdev FIB table has all routes for the domain - local, broadcast, and unicast.
当网络接口被调出时,添加到 设备的接口上的网络地址的本地和连接路由会自动转移到与 设备相关联的 FIB 表中。这意味着 13mdev FIB 表包含域的所有路由--本地、广播和单播。

Additional routes can be added to the FIB table either statically (e.g, ip route add table N) or using protocol suites such as quagga.
其他路由可通过静态方式(如 ip route add table N)或使用 quagga 等协议套件添加到 FIB 表中。

Policy Routing and FIB Rules
策略路由和 FIB 规则

The Linux networking stack has supported policy routing with FIB rules since v2.2. Rules can use the oif (outgoing interface index) or iif (incoming interface index) to direct lookups to a specific table. The code leverages this capability to direct lookups to the table associated with the master device.
Linux 网络协议栈从版本 2.2 开始就支持使用 FIB 规则的策略路由。规则可使用 oif(传出接口索引)或 iif(传入接口索引)将查找指向特定表。 代码利用这一功能将查找指向与主设备相关联的表。
For this to work the flow structure, which contains parameters to use for FIB lookups, needs to have either oif or iif set to the interface index of the device. This is accomplished in multiple ways: for locally originated traffic the oif is originally set based on either the socket (sk_bound_dev_if or uc_index) or cmsg and IP_PKTINFO. For responses and forwarded traffic, the original iif or oif are based on the ingress device.
流量结构包含用于 FIB 查找的参数,要实现这一功能,必须将 oif 或 iif 设置为 设备的接口索引。这可以通过多种方式实现:对于本地发起的流量,oif 最初是根据套接字(sk_bound_dev_if 或 uc_index)或 cmsg 和 IP_PKTINFO 设置的。对于响应和转发流量,最初的 iif 或 oif 基于入口设备。
has several operations for updating the oif or iif from an enslaved interface to the L3 master device. Generically, this is done in the IPv4 and IPv6 stacks with calls to _update_flow before calling fib_rules_lookup. Unfortunately, there are several special cases where the oif or iif is not set in the flow. These have to be handled directly with calls to _master_ifindex and related helper functions.
有几种操作用于更新从被奴役接口到 L3 主设备的 oif 或 iif。一般来说,IPv4 和 IPv6 协议栈在调用 fib_rules_lookup 之前会调用 _update_flow。遗憾的是,在一些特殊情况下,流量中没有设置 oif 或 iif。这些情况必须直接调用 _master_ifindex 和相关辅助函数来处理。
FIB rules can be written per device (e.g., an oif and/or iif rule per device) to direct lookups to a specific table:
可为每个 设备编写 FIB 规则(如每个设备的 oif 和/或 iif 规则),以便将查找指向特定表:
$ ip rule add oif blue table 1001
ip rule add iif blue table 1001
IP 规则添加 iif 蓝色表 1001
Alternatively, a single rule can be used to direct lookups to the table associated with the device:
或者,也可以使用单个 规则,将查找指向与设备相关的表:
ip rule add 13 mdev pref 1000
The rule was designed to address the scalability problems of having 1 or 2 rules per device since the rules are evaluated linearly on each FIB lookup. With the rule, a single rule covers all devices as the table id is retrieved from the device.
规则旨在解决每个设备使用 1 或 2 条规则的可扩展性问题,因为每次 FIB 查询都会对规则进行线性评估。有了 规则,当从设备检索表 id 时,一条规则就能覆盖所有 设备。
If an rule exists, _fib_rule_match is called to determine if the flow structure oif or iif references an device. If so, the fib_table driver operation is used to retrieve the table id. It is saved to the fib_lookup_arg, and the lookup is directed to that table.
如果存在 规则,则调用 _fib_rule_match,以确定流结构 oif 或 iif 是否引用了 设备。如果是,则使用 fib_table 驱动程序操作来检索表 id。它将被保存到 fib_lookup_arg 中,并且查找将指向该表。
The rule is specified by adding the FRA_L3MDEV attribute with a value of 1 in RTM_NEWR RTM_DELRULE messages.
规则通过在 RTM_NEWR RTM_DELRULE 报文中添加值为 1 的 FRA_L3MDEV 属性来指定。
The VRF driver adds the rule with a preference of 1000 when the first VRF device is created. That rule can be deleted and added with a different priority if desired.
创建第一个 VRF 设备时,VRF 驱动程序会添加优先级为 1000 的 规则。如果需要,可以删除该规则并添加不同的优先级。
Special consideration is needed for IPv6 linklocal and multicast addresses. For these addresses, the flow struct can not be updated to the device as the enslaved device index is needed for an exact match (e.g., the linklocal address for the specific interface is needed). In this case, the device needs to do the lookup directly in the FIB table for the device. The VRF driver and its function vrf_link_scope_lookup is an example of how to do this.
IPv6 链路本地地址和组播地址需要特别考虑。对于这些地址,流量结构无法更新到 设备,因为需要有被奴役设备索引才能进行精确匹配(例如,需要特定接口的链路本地地址)。在这种情况下, 设备需要直接在设备的 FIB 表中进行查找。VRF 驱动程序及其函数 vrf_link_scope_lookup 就是这样做的一个例子。
Also, IPv6 linklocal addresses are not added to devices by the kernel, and the stack does not insert IPv6 multicast routes for the devices. The VRF driver for example specifically fails route lookups for IPv6 linklocal or multicast addresses on a VRF device.
此外,内核不会将 IPv6 链路本地地址添加到 设备,堆栈也不会为设备插入 IPv6 多播路由。例如,VRF 驱动程序特别会导致 VRF 设备上 IPv6 链路本地地址或组播地址的路由查找失败。

Layer 3 Packet Processing
第 3 层数据包处理

Packets are passed to devices on ingress and egress if the driver implements the _rcv and _ 13_out handlers.
如果驱动程序实现了 _rcv 和 _ 13_out 处理程序,数据包将在入口和出口处传递到 设备。

Packets are passed to the 13mdev driver in the IPv4 or IPv6 receive handlers after the netfilter hook for NF_INET_PRE_ROUTING (Figure 2).
数据包会在 NF_INET_PRE_ROUTING 的 netfilter 钩子之后,通过 IPv4 或 IPv6 接收处理程序传递给 13mdev 驱动程序(图 2)。

At this point the IPv4 and IPv6 receive functions have done basic sanity checks on the skb, and the skb device (skb->dev) is set to the ingress device. The driver can modify the skb or its metadata as needed based on relevant features. If it returns NULL, the skb is assumed to be consumed by the driver and no further processing is done. The operation is called before invoking the input function for the dst attached to the skb which means the driver can set (or change) the dst if desired hence altering the next function called on it.
此时,IPv4 和 IPv6 接收函数已对 skb 进行了基本的正确性检查,并将 skb 设备(skb->dev)设置为入口设备。 驱动程序可根据相关功能的需要修改 skb 或其元数据。如果返回 NULL,则假定该 skb 已被 驱动程序消耗,不再进行进一步处理。 操作是在调用连接到 skb 的 dst 的输入函数之前调用的,这意味着 驱动程序可以根据需要设置(或更改)dst,从而改变对其调用的下一个函数。
The 13mdev_13_rcv hook is the layer 3 equivalent to the rx handler commonly used for layer 2 devices such as bonds and bridges. By passing the skb to the handler in the networking stack at layer 3, drivers do not need to duplicate network layer checks on skbs. Furthermore, it allows the IPv4 and IPv6 layers to save the original ingress device index to the skb control buffer prior to calling the 13mdev_rcv_out hook.
13mdev_13_rcv 钩子相当于第 3 层的 rx 处理程序,常用于债券和网桥等第 2 层设备。通过将 skb 传递给第 3 层网络协议栈中的 处理程序,驱动程序无需重复对 skb 进行网络层检查。此外,它还允许 IPv4 和 IPv6 层在调用 13mdev_rcv_out 钩子之前,将原始入口设备索引保存到 skb 控制缓冲区。

This is essential for datagram applications that require the ingress device index and not the index (the latter is easily derived from the former via the master attribute).
这对于需要入口设备索引而非 索引(后者很容易通过主属性从前者导出)的数据报应用来说至关重要。
Figure 2. I3mdev receive hook.
图 2.I3mdev 接收钩子
Figure 3. Receive path for VRF.
图 3.VRF 的接收路径。
Prior to the 13 mdev hooks, drivers relied on the rx-handler and duplicating network layer code. That design had other limitations such as preventing a device with a macvlan or ipvlan from also being placed in an L3 domain. With this hook both are possible (e.g., eth 2 can be assigned to a VRF and eth2 can be the parent device for macvlans).
在 13 mdev 钩子出现之前,驱动程序依赖于 rx 处理器和重复的网络层代码。这种设计还有其他局限性,例如无法将具有 macvlan 或 ipvlan 的设备同时置于 L3 域中。有了 钩子,这两种情况都可以实现(例如,eth 2 可以分配给 VRF,eth2 可以成为 macvlans 的父设备)。
Figure 3 shows the operations done by the VRF driver. It uses the receive functions to switch the skb device to its device to influence socket lookups, and the skb is run through the network taps allowing tcpdump on a VRF device to see all packets for the domain.
图 3 显示了 VRF 驱动程序执行的操作。它使用 接收函数将 skb 设备切换到其设备上,以影响套接字查找,并通过网络分路器运行 skb,允许 VRF 设备上的 tcpdump 查看域的所有数据包。
Figure 4. I3mdev transmit hook.
图 4.I3mdev 发送钩子
Figure 5. Transmit path for VRF.
图 5.VRF 的传输路径。

As mentioned earlier, IPv6 linklocal and multicast addresses need special handling. VRF uses the 13mdev_ip6_rcv function to do the ingress lookup directly in the associated FIB table and set the dst on the skb.
如前所述,IPv6 链路本地地址和组播地址需要特殊处理。VRF 使用 13mdev_ip6_rcv 函数直接在相关的 FIB 表中进行入口查找,并在 skb 上设置 dst。
The skb is then run through NF_HOOK for NF_INET_PRE_ROUTING. This allows netfilter rules that look at the VRF device. (Additional netfilter hooks may be added in the future.)
然后通过 NF_HOOK 运行 skb,进行 NF_INET_PRE_ROUTING。这样就可以使用网络过滤规则来查看 VRF 设备。(将来可能会添加其他网络过滤器钩子)。

Tx 特克斯

For transmit path, packets are passed to the driver in the IPv4 or IPv6 layer in the local_out functions before the netfilter hook for NF_INET_LOCAL_OUT. As with the Rx path, the driver can modify the skb or its metadata as needed based on relevant features. If it returns NULL, the skb is consumed by the 13 mdev driver and no further processing is done. Since the 13_out hook is called before dst_output, an driver can change the dst attached to the skb thereby impacting the next function (dst->output) invoked after the netfilter hook.
对于发送路径,数据包会在 NF_INET_LOCAL_OUT 的 netfilter 钩子之前,在 IPv4 或 IPv6 层的 local_out 函数中传递给 驱动程序。与 Rx 路径一样, 驱动程序可根据相关特性按需修改 skb 或其元数据。如果返回 NULL,skb 将被 13 mdev 驱动程序消耗,不再做进一步处理。由于 13_out 钩子是在 dst_output 之前调用的,因此 驱动程序可以更改连接到 skb 的 dst,从而影响在 netfilter 钩子之后调用的下一个函数(dst->output)。
The VRF driver uses the 13mdev_13_out handler to implement device based features for the domain (Figure 5). It accomplishes this by using the vrf_13_out handler to switch the skb dst to its per-VRF device dst and then returns. The VRF dst has the output function pointing back to the VRF driver.
VRF 驱动程序使用 13mdev_13_out 处理程序为 域实现基于设备的功能(图 5)。为此,它使用 vrf_13_out 处理程序将 skb dst 切换到每个 VRF 设备的 dst,然后返回。VRF dst 的输出函数指向 VRF 驱动程序。
The skb proceeds down the stack with dst->dev pointing to the VRF device. Netfilter, qdisc and tc rules and network taps are evaluated based on this device. Finally, the skb makes it to the vrf_xmit function which resets the dst based on a FIB lookup.
skb 沿堆栈向下进行,dst->dev 指向 VRF 设备。根据该设备对 Netfilter、qdisc 和 tc 规则以及网络分路进行评估。最后,skb 进入 vrf_xmit 函数,该函数根据 FIB 查找重置 dst。

It goes through the netfilter LOCAL_OUT hook again this time with the real Tx device and then back to dst_output for the real Tx path.
这次,它将再次通过 netfilter LOCAL_OUT 钩子与真正的 Tx 设备连接,然后返回 dst_output,以获得真正的 Tx 路径。
This additional processing comes with a performance penalty, but that is a design decision within the VRF driver and is separate topic from the API. The relevant point here is to illustrate what can be done in an driver.
这种额外处理会带来性能损失,但这是 VRF 驱动程序内部的设计决策,与 API 是两个不同的主题。这里的相关要点是说明 驱动程序可以做什么。

Userspace API 用户空间应用程序接口

As mentioned in the "Layer 3 Routing and Forwarding" section network addresses and routes are local to an L3 domain. Accordingly, userspace programs communicating over IPv4 and IPv6 need to specify which domain to use.
如 "第 3 层路由和转发 "部分所述,网络地址和路由都是第 3 层域的本地地址和路由。因此,通过 IPv4 和 IPv6 通信的用户空间程序需要指定使用哪个域。

If a device (L3 domain) is not specified, the default table is used for lookups and only addresses for interfaces not enslaved to an device are considered.
如果未指定设备(L3 域),则使用默认表进行查找,并且只考虑未被 设备奴役的接口的地址。
Since the domains are defined by network devices, userspace can use the age old POSIX apis for sockets SO_BINDTODEVICE or cmsg and IP_PKTINFO
由于域是由网络设备定义的,因此用户空间可以使用古老的 POSIX apis 套接字 SO_BINDTODEVICE 或 cmsg 和 IP_PKTINFO

(datagram sockets). The former binds the socket to the device while the latter applies to a single sendmsg. In both cases, the scope of the send is limited to a specific L3 domain, affecting source address selection and route lookups as mentioned earlier.
(数据报套接字)。前者将套接字与 设备绑定,后者则适用于单个 sendmsg。在这两种情况下,发送范围都仅限于特定的 L3 域,从而影响到前面提到的源地址选择和路由查找。
On ingress, the skb device index is set to the device index, so only unbound sockets (wildcard) or sockets bound to the device will match on a socket lookup.
在入口时,skb 设备索引被设置为 设备索引,因此只有未绑定(通配符)或绑定到 设备的套接字才会在套接字查找中匹配。
The tcp_13mdev_accept sysctl allows a TCP server to bind to a port globally - i.e., across all L3 domains. Any connections accepted by it are bound to the L3 domain the connection originates.
tcp_13mdev_accept sysctl 允许 TCP 服务器全局绑定端口,即跨所有 L3 域。它接受的任何连接都会绑定到连接发起的 L3 域。

This enables users to have a choice: run a TCP-based daemon per L3 domain or run a single daemon across all domains with client connections bound to an L3 domain.
这样,用户就可以选择:在每个 L3 域运行一个基于 TCP 的守护进程,或者在所有域运行一个守护进程,并将客户端连接绑定到一个 L3 域。

Performance Overhead 性能间接费用

The hooks into the core networking stack were written such that if a user does not care about L3 devices the feature completely compiles out. This by definition means the existence of the code has no affect on performance.
核心网络协议栈的钩子是这样编写的:如果用户不关心 L3 设备,则该功能完全编译不出来。顾名思义,这意味着代码的存在不会影响性能。
When the L3_MASTER_DEVICE config is enabled in the kernel the hooks have been written to minimize the overhead, leveraging device flags and organizing the checks in the most likely paths first.
当内核中启用 L3_MASTER_DEVICE 配置时,钩子的编写将最大限度地减少开销,利用设备标志并首先在最可能的路径中组织检查。

The overhead in this case is mostly extra device lookups on the oif or iif in the flow struct and checking the priv_flags of a device (IFF_L3MDEV_MASTER and IFF_L3MDEV_SLAVE) to determine if the device is related.
这种情况下的开销主要是对流量结构中的 oif 或 iif 进行额外的设备查找,以及检查设备的 priv_flags (IFF_L3MDEV_MASTER 和 IFF_L3MDEV_SLAVE),以确定设备是否与 相关。
When an is enabled (e.g., a VRF device is created and an interface is enslaved to it) the performance overhead is dictated by the driver and what it chooses to do.
当启用 时(例如,创建了一个 VRF 设备并将一个接口绑定到该设备上),性能开销将由驱动程序及其选择的操作决定。
The intent of this performance comparison is to examine the overhead of the hooks in the packet path. Latency tests such as netperf's UDP_RR with 1-byte payloads stress the overhead of code and drivers such as VRF. This study compared three cases:
这种性能比较的目的是检查数据包路径中 钩子的开销。使用 1 字节有效载荷的 netperf UDP_RR 等延迟测试会对 代码和 VRF 等驱动程序的开销造成压力。本研究比较了三种情况:
1.13mdev disabled (kernel config option CONFIG_NET_L3_MASTER_DEV is not set).
1.13mdev 禁用(内核配置选项 CONFIG_NET_L3_MASTER_DEV 未设置)。
  1. compiled in, but no device instances are created.
    编译,但没有创建设备实例。
  2. compiled in with a minimal VRF device driver.
    编入了一个最小的 VRF 设备驱动程序。
For case 3, the VRF driver was reduced to only influencing FIB lookups and switching the skb dev on ingress for socket lookups. This means all of the Rx and Tx processing discussed earlier (e.g, the nf_hooks and switching the dst in the output path) were removed.
在案例 3 中,VRF 驱动程序被简化为只影响 FIB 查找和在入口处为套接字查找切换 skb dev。这意味着前面讨论的所有 Rx 和 Tx 处理(如 nf_hooks 和在输出路径中切换 dst)都被删除了。

The overhead of VRF and options to improve it are a separate study.
关于 VRF 的开销和改进方案,将另作研究。
Data were collected using a VM running on with virtio+vhost. The vcpus, vhost threads and netperf command were restrained to specific cpus in the first numa node of the host using taskset. The kernel for the VM was net-next tree at commit fad64eff4.
数据是使用运行在 上的虚拟机和 virtio+vhost 收集的。使用 taskset 将 vcpus、vhost 线程和 netperf 命令限制在主机第一个 numa 节点的特定 cpus 上。虚拟机的内核是 net-next tree,版本为 fad64eff4。
Figure 6 shows the average UDP_RR transactions/sec for 3 30 -sec runs. Comparing the result for cases 1 and 2 the overhead of enabling the L3_MASTER_DEVICE kernel configuration option is roughly for IPv4 and for IPv6.
图 6 显示了 3 次 30 秒运行的平均 UDP_RR 事务/秒。比较情况 1 和情况 2 的结果,启用 L3_MASTER_DEVICE 内核配置选项的开销对于 IPv4 大约是 ,对于 IPv6 大约是
The overhead of the 13 mdev code in the fast path is the difference in UDP RR transactions per second between cases 1 and 3. With the minimal VRF driver, a single VRF
快速路径中 13 mdev 代码的开销是情况 1 和情况 3 之间每秒 UDP RR 事务的差额。使用最小 VRF 驱动程序,单个 VRF
Figure 6. UDP_RR performance
图 6.UDP_RR 性能

with traffic through an enslaved interface has about a 3.6% performance hit for compared to without support at all, while IPv6 shows about a gain. A cursory review of perf output suggests one reason IPv6 shows a gain is less time spent looking for a source address. Specifically, the case spends the time in ipv6_get_saddr_eval as only interfaces in the L3 domain are considered.
与完全不支持 的情况相比,通过受奴役接口传输流量的 的性能下降了约 3.6%,而 IPv6 的增益约为 。对 perf 输出的粗略分析表明,IPv6 性能提升的原因之一是查找源地址的时间减少了。具体来说,由于只考虑 L3 域中的接口, 案例在 ipv6_get_saddr_eval 中花费的时间为
Despite the best efforts to control variability there is about a difference between netperf runs. Figure 6 does show the general trend of what to expect with the current code which is a worst case loss in performance around 2 to .
尽管已尽力控制可变性,但 netperf 运行之间仍存在约 的差异。图 6 显示了当前 代码的总体趋势,即最坏情况下性能损失约为 2 到
Futhermore, UDP_RR is a worst case test as it exercises the overhead in both fib lookups and the hooks in the Rx and Tx paths for each packet. Other tests such as TCP_RR show less degradation in performance since connected sockets avoid fib lookups per packet.
此外,UDP_RR 是最坏情况下的测试,因为它在每个数据包的 Rx 和 Tx 路径中,都会对 的纤维查找和 钩子进行开销测试。其他测试(如 TCP_RR)显示的性能下降较少,因为连接的套接字避免了每个数据包的 fib 查找。

Administration, Monitoring and Debugging
管理、监控和调试

devices follow the Linux networking paradigms established by devices such as bridging and bonding. Accordingly, devices are created, configured, monitored and destroyed using the existing Linux networking APIs and tools such as iproute2.
设备遵循由桥接和绑定等设备建立的 Linux 网络范例。因此, 设备是使用现有的 Linux 网络 API 和 iproute2 等工具创建、配置、监控和销毁的。
As a master device, the rtnetlink features for MASTER_DEVICE apply to devices as well. For example, filters can be passed in the link request to only show devices enslaved to the device:
作为主设备,MASTER_DEVICE 的 rtnetlink 功能也适用于 设备。例如,可以在链接请求中传递过滤器,以便只显示被 设备奴役的设备:
ip link show master red
or to only show addresses for devices enslaved to an L3 domain
或只显示被 L3 域奴役的设备的地址
ip address show master red
ip 地址 show master red
or neighbor entries for the L3 domain:
或 L3 域的邻居条目:
ip neighbor show master red
Furthermore, tools such as ss can use the inet diag API with a device filter to list sockets bound to an device:
此外,ss 等工具可以使用带有设备过滤器的 inet diag API 来列出绑定到 设备的套接字:
Maintaining existing semantics is another key feature of and by extension the VRF implementation.
保持现有语义是 以及 VRF 实现的另一个关键特征。

Differences by Kernel Version
内核版本的差异

The 13mdev API is fairly new (about a year old at the time of this writing) and to date driven by what is needed for the VRF driver. A short summary by kernel version:
13mdev API 是一个相当新的应用程序接口(本文撰写时已使用约一年),目前主要由 VRF 驱动程序所需的功能驱动。内核版本简述
v4.4 Initial 13mdev api added
添加了 v4.4 初始 13mdev api
v4.5 tcp_13mdev_accept sysctl added
v4.5 新增 tcp_13mdev_accept sysctl
v4.7 13mdev_13_rcv driver operation added.
v4.7 新增 13mdev_13_rcv 驱动程序操作。
v4.8 13mdev FIB rule added
v4.8 新增 13mdev FIB 规则
v4.9 Overhaul of changes for FIB lookups. _13_out operation added.
v4.9 全面修改 FIB 查找。 增加了 _13_out 操作。

Conclusions 结论

The L3 Master Device ( ) is a layer 3 API that can be leveraged by network drivers that want to influence FIB lookups or manipulate packets at L3. The concept has been driven by the VRF implementation but is by no means limited to it.
L3 Master Device ( ) 是一个第 3 层 API,网络驱动程序可以利用它来影响 FIB 查找或在 L3 层操作数据包。这一概念由 VRF 实现驱动,但绝不仅限于 VRF。

The API will continue to evolve for VRF as well as any future drivers that wish to take advantage of the capabilities.
应用程序接口将继续为 VRF 以及任何希望利用其功能的未来驱动程序而发展。

Author Biography 作者简介

David Ahern is a Member of Technical Staff at Cumulus Networks.
David Ahern 是 Cumulus Networks 的技术人员。