Process Architecture 进程架构
FRR is a suite of daemons that serve different functions. This document
describes internal architecture of daemons, focusing their general design
patterns, and especially how threads are used in the daemons that use them.
FRR 是一套提供不同功能的守护进程。本文档描述了这些守护进程的内部架构,重点介绍其通用设计模式,特别是使用线程的守护进程中线程的使用方式。
Overview 概述
The fundamental pattern used in FRR daemons is an event loop. Some daemons use kernel threads. In these
daemons, each kernel thread runs its own event loop. The event loop
implementation is constructed to be thread safe and to allow threads other than
its owning thread to schedule events on it. The rest of this document describes
these two designs in detail.
FRR 守护进程中使用的基本模式是事件循环。一些守护进程使用内核线程。在这些守护进程中,每个内核线程运行自己的事件循环。事件循环的实现被设计为线程安全,并允许其拥有线程之外的其他线程在其上调度事件。本文档的其余部分详细描述了这两种设计。
Terminology 术语
Because this document describes the architecture for kernel threads as well as
the event system, a digression on terminology is in order here.
由于本文档描述了内核线程的架构以及事件系统,因此有必要在此对术语进行一些说明。
Historically Quagga’s loop system was viewed as an implementation of userspace
threading. Because of this design choice, the names for various datastructures
within the event system are variations on the term “thread”. The primary
datastructure that holds the state of an event loop in this system is called a
“threadmaster”. Events scheduled on the event loop - what would today be called
an ‘event’ or ‘task’ in systems such as libevent - are called “threads” and the
datastructure for them is struct event
. To add to the confusion, these
“threads” have various types, one of which is “event”. To hopefully avoid some
of this confusion, this document refers to these “threads” as a ‘task’ except
where the datastructures are explicitly named. When they are explicitly named,
they will be formatted like this
to differentiate from the conceptual
names. When speaking of kernel threads, the term used will be “pthread” since
FRR’s kernel threading implementation uses the POSIX threads API.
历史上,Quagga 的循环系统被视为用户空间线程的一种实现。由于这一设计选择,事件系统中各种数据结构的名称都与“线程”一词相关。在这个系统中,持有事件循环状态的主要数据结构被称为“threadmaster”。在事件循环上调度的事件——在当今如 libevent 等系统中可能被称为“事件”或“任务”——在这里被称为“线程”,其数据结构为 struct event
。更令人困惑的是,这些“线程”有多种类型,其中之一就是“事件”。为了避免混淆,本文档在未明确命名数据结构的情况下,将这些“线程”称为“任务”。当明确命名时,它们将格式化为 like this
,以区别于概念名称。在讨论内核线程时,使用的术语将是“pthread”,因为 FRR 的内核线程实现使用了 POSIX 线程 API。
Event Architecture 事件架构
This section presents a brief overview of the event model as currently
implemented in FRR. This doc should be expanded and broken off into its own
section. For now it provides basic information necessary to understand the
interplay between the event system and kernel threads.
本节简要概述了 FRR 中当前实现的事件模型。此文档应扩展并拆分为独立章节。目前,它提供了理解事件系统与内核线程之间相互作用所需的基本信息。
The core event system is implemented in lib/event.c
and
lib/frrevent.h
. The primary
structure is struct event_loop
, hereafter referred to as a
threadmaster
. A threadmaster
is a global state object, or context, that
holds all the tasks currently pending execution as well as statistics on tasks
that have already executed. The event system is driven by adding tasks to this
data structure and then calling a function to retrieve the next task to
execute. At initialization, a daemon will typically create one
threadmaster
, add a small set of initial tasks, and then run a loop to
fetch each task and execute it.
核心事件系统在 lib/event.c
和 lib/frrevent.h
中实现。主要结构是 struct event_loop
,下文称为 threadmaster
。 threadmaster
是一个全局状态对象或上下文,它保存了所有当前待执行的任务以及已执行任务的统计信息。事件系统通过向此数据结构添加任务,然后调用一个函数来检索下一个要执行的任务来驱动。在初始化时,守护程序通常会创建一个 threadmaster
,添加一小部分初始任务,然后运行一个循环来获取每个任务并执行它。
These tasks have various types corresponding to their general action. The types
are given by integer macros in frrevent.h
and are:
这些任务根据其一般操作具有不同的类型。类型由 frrevent.h
中的整数宏给出,分别为:
EVENT_READ
Task which waits for a file descriptor to become ready for reading and then executes.
等待文件描述符准备好读取然后执行的任务。EVENT_WRITE
Task which waits for a file descriptor to become ready for writing and then executes.
等待文件描述符准备好写入然后执行的任务。EVENT_TIMER
Task which executes after a certain amount of time has passed since it was scheduled.
任务在预定后经过一定时间后执行。EVENT_EVENT
Generic task that executes with high priority and carries an arbitrary integer indicating the event type to its handler. These are commonly used to implement the finite state machines typically found in routing protocols.
以高优先级执行并携带任意整数(指示事件类型)到其处理程序的通用任务。这些通常用于实现路由协议中常见的有限状态机。EVENT_READY
Type used internally for tasks on the ready queue.
内部用于就绪队列任务的类型。EVENT_UNUSED
Type used internally for
struct event
objects that aren’t being used. The event system poolsstruct event
to avoid heap allocations; this is the type they have when they’re in the pool.
用于内部未使用的struct event
对象的类型。事件系统池化struct event
以避免堆分配;这是它们在池中时的类型。EVENT_EXECUTE
Just before a task is run its type is changed to this. This is used to show
X
as the type in the output ofshow event cpu
.
在任务运行之前,其类型会改为这个。这用于在show event cpu
的输出中将类型显示为X
。
The programmer never has to work with these types explicitly. Each type of task
is created and queued via special-purpose functions (actually macros, but
irrelevant for the time being) for the specific type. For example, to add a
EVENT_READ
task, you would call
程序员无需显式处理这些类型。每种任务类型都是通过特定类型的专用函数(实际上是宏,但暂时无关紧要)创建并排队的。例如,要添加一个 EVENT_READ
任务,您可以调用
event_add_read(struct event_loop *master, int (*handler)(struct event *), void *arg, int fd, struct event **ref);
The struct event
is then created and added to the appropriate internal
datastructure within the threadmaster
. Note that the READ
and
WRITE
tasks are independent - a READ
task only tests for
readability, for example.
struct event
随后被创建并添加到 threadmaster
中的适当内部数据结构中。请注意, READ
和 WRITE
任务是独立的——例如, READ
任务仅测试可读性。
The Event Loop 事件循环
To use the event system, after creating a threadmaster
the program adds an
initial set of tasks. As these tasks execute, they add more tasks that execute
at some point in the future. This sequence of tasks drives the lifecycle of the
program. When no more tasks are available, the program dies. Typically at
startup the first task added is an I/O task for VTYSH as well as any network
sockets needed for peerings or IPC.
要使用事件系统,在创建 threadmaster
之后,程序会添加一组初始任务。随着这些任务的执行,它们会添加更多在未来某个时刻执行的任务。这一系列任务驱动着程序的生命周期。当没有更多任务可用时,程序将终止。通常,在启动时,添加的第一个任务是 VTYSH 的 I/O 任务,以及用于对等连接或 IPC 所需的任何网络套接字。
To retrieve the next task to run the program calls event_fetch()
.
event_fetch()
internally computes which task to execute next based on
rudimentary priority logic. Events (type EVENT_EVENT
) execute with the
highest priority, followed by expired timers and finally I/O tasks (type
EVENT_READ
and EVENT_WRITE
). When scheduling a task a function and an
arbitrary argument are provided. The task returned from event_fetch()
is
then executed with event_call()
.
要检索下一个任务以运行程序,程序调用 event_fetch()
。 event_fetch()
内部根据基本的优先级逻辑计算下一个要执行的任务。事件(类型 EVENT_EVENT
)以最高优先级执行,其次是过期的计时器,最后是 I/O 任务(类型 EVENT_READ
和 EVENT_WRITE
)。在调度任务时,会提供一个函数和一个任意参数。然后使用 event_call()
执行从 event_fetch()
返回的任务。
The following diagram illustrates a simplified version of this infrastructure.
下图展示了该基础设施的简化版本。
The series of “task” boxes represents the current ready task queue. The various
other queues for other types are not shown. The fetch-execute loop is
illustrated at the bottom.
“task”框系列表示当前就绪任务队列。其他类型的各种队列未显示。底部的取指-执行循环被图示出来。
Mapping the general names used in the figure to specific FRR functions:
将图中使用的通用名称映射到特定的 FRR 功能:
task
isstruct event *
task
是struct event *
fetch
isevent_fetch()
fetch
是event_fetch()
exec()
isevent_call()
exec()
是event_call()
cancel()
isevent_cancel()
cancel()
是event_cancel()
schedule()
is any of the various task-specificevent_add_*
functions
schedule()
是各种特定任务的event_add_*
函数中的任意一个
Adding tasks is done with various task-specific function-like macros. These
macros wrap underlying functions in event.c
to provide additional
information added at compile time, such as the line number the task was
scheduled from, that can be accessed at runtime for debugging, logging and
informational purposes. Each task type has its own specific scheduling function
that follow the naming convention event_add_<type>
; see frrevent.h
for details.
添加任务通过多种任务特定的类似函数宏完成。这些宏将底层函数封装在 event.c
中,以提供编译时添加的额外信息,例如任务调度时的行号,这些信息在运行时可用于调试、日志记录和信息目的。每种任务类型都有其特定的调度函数,遵循命名约定 event_add_<type>
;详情参见 frrevent.h
。
There are some gotchas to keep in mind:
有一些需要注意的事项:
I/O tasks are keyed off the file descriptor associated with the I/O operation. This means that for any given file descriptor, only one of each type of I/O task (
EVENT_READ
andEVENT_WRITE
) can be scheduled. For example, scheduling two write tasks one after the other will overwrite the first task with the second, resulting in total loss of the first task and difficult bugs.
I/O 任务由与 I/O 操作相关的文件描述符触发。这意味着对于任何给定的文件描述符,每种类型的 I/O 任务(EVENT_READ
和EVENT_WRITE
)只能安排一个。例如,连续安排两个写任务会导致第二个任务覆盖第一个任务,从而完全丢失第一个任务并引发难以调试的错误。Timer tasks are only as accurate as the monotonic clock provided by the underlying operating system.
定时器任务的准确性仅取决于底层操作系统提供的单调时钟。Memory management of the arbitrary handler argument passed in the schedule call is the responsibility of the caller.
传递给调度调用中的任意处理程序参数的内存管理由调用者负责。
Kernel Thread Architecture
Efforts have begun to introduce kernel threads into FRR to improve performance
and stability. Naturally a kernel thread architecture has long been seen as
orthogonal to an event-driven architecture, and the two do have significant
overlap in terms of design choices. Since the event model is tightly integrated
into FRR, careful thought has been put into how pthreads are introduced, what
role they fill, and how they will interoperate with the event model.
已开始努力将内核线程引入 FRR,以提高性能和稳定性。自然,长期以来,内核线程架构被视为与事件驱动架构正交,两者在设计选择上确实存在显著重叠。由于事件模型已紧密集成到 FRR 中,因此对如何引入 pthreads、它们所扮演的角色以及它们将如何与事件模型互操作进行了深思熟虑。
Design Overview 设计概述
Each kernel thread behaves as a lightweight process within FRR, sharing the
same process memory space. On the other hand, the event system is designed to
run in a single process and drive serial execution of a set of tasks. With this
consideration, a natural choice is to implement the event system within each
kernel thread. This allows us to leverage the event-driven execution model with
the currently existing task and context primitives. In this way the familiar
execution model of FRR gains the ability to execute tasks simultaneously while
preserving the existing model for concurrency.
每个内核线程在 FRR 中表现为一个轻量级进程,共享相同的进程内存空间。另一方面,事件系统设计为在单个进程中运行,并驱动一系列任务的串行执行。考虑到这一点,自然的选择是在每个内核线程内实现事件系统。这使我们能够利用当前存在的任务和上下文原语来利用事件驱动的执行模型。通过这种方式,FRR 熟悉的执行模型获得了同时执行任务的能力,同时保留了现有的并发模型。
The following figure illustrates the architecture with multiple pthreads, each
running their own threadmaster
-based event loop.
下图展示了具有多个 pthreads 的架构,每个 pthread 运行其自己的 threadmaster
-based 事件循环。
Each roundrect represents a single pthread running the same event loop
described under Event Architecture. Note the arrow from the exec()
box on the right to the schedule()
box in the middle pthread. This
illustrates code running in one pthread scheduling a task onto another
pthread’s threadmaster. A global lock for each threadmaster
is used to
synchronize these operations. The pthread names are examples.
每个圆角矩形代表一个运行相同事件循环的 pthread,如事件架构下所述。注意从右侧 exec()
框到中间 pthread 中 schedule()
框的箭头。这说明了在一个 pthread 中运行的代码将任务调度到另一个 pthread 的 threadmaster 上。每个 threadmaster
的全局锁用于同步这些操作。pthread 名称仅为示例。
Kernel Thread Wrapper
The basis for the integration of pthreads and the event system is a lightweight
wrapper for both systems implemented in lib/frr_pthread.[ch]
. The
header provides a core datastructure, struct frr_pthread
, that encapsulates
structures from both POSIX threads and event.c
, frrevent.h
.
In particular, this
datastructure has a pointer to a threadmaster
that runs within the pthread.
It also has fields for a name as well as start and stop functions that have
signatures similar to the POSIX arguments for pthread_create()
.
pthreads 与事件系统集成的基础是在 lib/frr_pthread.[ch]
中实现的两个系统的轻量级封装。该头文件提供了一个核心数据结构 struct frr_pthread
,它封装了来自 POSIX 线程和 event.c
、 frrevent.h
的结构。特别是,该数据结构包含一个指向在 pthread 内运行的 threadmaster
的指针。它还具有名称字段以及具有与 POSIX 参数 pthread_create()
相似签名的启动和停止函数字段。
Calling frr_pthread_new()
creates and registers a new frr_pthread
. The
returned structure has a pre-initialized threadmaster
, and its start
and stop
functions are initialized to defaults that will run a basic event
loop with the given threadmaster. Calling frr_pthread_run()
starts the thread
with the start
function. From there, the model is the same as the regular
event model. To schedule tasks on a particular pthread, simply use the regular
event.c
functions as usual and provide the threadmaster
pointed to
from the frr_pthread
. As part of implementing the wrapper, the
event.c
functions were made thread-safe. Consequently, it is safe to
schedule events on a threadmaster
belonging both to the calling thread as
well as any other pthread. This serves as the basis for inter-thread
communication and boils down to a slightly more complicated method of message
passing, where the messages are the regular task events as used in the
event-driven model. The only difference is thread cancellation, which requires
calling event_cancel_async()
instead of event_cancel()
to cancel a task
currently scheduled on a threadmaster
belonging to a different pthread.
This is necessary to avoid race conditions in the specific case where one
pthread wants to guarantee that a task on another pthread is cancelled before
proceeding.
调用 frr_pthread_new()
创建并注册一个新的 frr_pthread
。返回的结构体具有预初始化的 threadmaster
,其 start
和 stop
函数被初始化为默认值,这些默认值将使用给定的线程主控运行一个基本事件循环。调用 frr_pthread_run()
以 start
函数启动线程。从此处开始,模型与常规事件模型相同。要在特定 pthread 上调度任务,只需像往常一样使用常规的 event.c
函数,并提供从 frr_pthread
指向的 threadmaster
。作为实现包装器的一部分, event.c
函数已被设计为线程安全的。因此,在属于调用线程或任何其他 pthread 的 threadmaster
上调度事件是安全的。这作为线程间通信的基础,归结为一种稍微复杂的消息传递方法,其中消息是事件驱动模型中使用的常规任务事件。唯一的区别是线程取消,需要调用 event_cancel_async()
而不是 event_cancel()
来取消当前在属于不同 pthread 的 threadmaster
上调度的任务。 这是必要的,以避免在特定情况下出现竞争条件,其中一个 pthread 希望确保在继续之前取消另一个 pthread 上的任务。
In addition, the existing commands to show statistics and other information for
tasks within the event driven model have been expanded to handle multiple
pthreads; running show event cpu
will display the usual event
breakdown, but it will do so for each pthread running in the program. For
example, BGPD runs a dedicated I/O pthread and shows the following
output for show event cpu
:
此外,现有用于显示事件驱动模型中任务统计信息和其他信息的命令已扩展至处理多个 pthreads;运行 show event cpu
将显示常规的事件分解,但会针对程序中运行的每个 pthread 执行此操作。例如,BGPD 运行一个专用的 I/O pthread,并为 show event cpu
显示如下输出:
frr# show event cpu
Event statistics for bgpd:
Showing statistics for pthread main
------------------------------------
CPU (user+system): Real (wall-clock):
Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
0 1389.000 10 138900 248000 135549 255349 T subgroup_coalesce_timer
0 0.000 1 0 0 18 18 T bgp_startup_timer_expire
0 850.000 18 47222 222000 47795 233814 T work_queue_run
0 0.000 10 0 0 6 14 T update_subgroup_merge_check_thread_cb
0 0.000 8 0 0 117 160 W zclient_flush_data
2 2.000 1 2000 2000 831 831 R bgp_accept
0 1.000 1 1000 1000 2832 2832 E zclient_connect
1 42082.000 240574 174 37000 178 72810 R vtysh_read
1 152.000 1885 80 2000 96 6292 R zclient_read
0 549346.000 2997298 183 7000 153 20242 E bgp_event
0 2120.000 300 7066 14000 6813 22046 T (bgp_holdtime_timer)
0 0.000 2 0 0 57 59 T update_group_refresh_default_originate_route_map
0 90.000 1 90000 90000 73729 73729 T bgp_route_map_update_timer
0 1417.000 9147 154 48000 132 61998 T bgp_process_packet
300 71807.000 2995200 23 3000 24 11066 T (bgp_connect_timer)
0 1894.000 12713 148 45000 112 33606 T (bgp_generate_updgrp_packets)
0 0.000 1 0 0 105 105 W vtysh_write
0 52.000 599 86 2000 138 6992 T (bgp_start_timer)
1 1.000 8 125 1000 164 593 R vtysh_accept
0 15.000 600 25 2000 15 153 T (bgp_routeadv_timer)
0 11.000 299 36 3000 53 3128 RW bgp_connect_check
Showing statistics for pthread BGP I/O thread
----------------------------------------------
CPU (user+system): Real (wall-clock):
Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
0 1611.000 9296 173 13000 188 13685 R bgp_process_reads
0 2995.000 11753 254 26000 182 29355 W bgp_process_writes
Showing statistics for pthread BGP Keepalives thread
-----------------------------------------------------
CPU (user+system): Real (wall-clock):
Active Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
No data to display yet.
Attentive readers will notice that there is a third thread, the Keepalives
thread. This thread is responsible for – surprise – generating keepalives for
peers. However, there are no statistics showing for that thread. Although the
pthread uses the frr_pthread
wrapper, it opts not to use the embedded
threadmaster
facilities. Instead it replaces the start
and stop
functions with custom functions. This was done because the threadmaster
facilities introduce a small but significant amount of overhead relative to the
pthread’s task. In this case since the pthread does not need the event-driven
model and does not need to receive tasks from other pthreads, it is simpler and
more efficient to implement it outside of the provided event facilities. The
point to take away from this example is that while the facilities to make using
pthreads within FRR easy are already implemented, the wrapper is flexible and
allows usage of other models while still integrating with the rest of the FRR
core infrastructure. Starting and stopping this pthread works the same as it
does for any other frr_pthread
; the only difference is that event
statistics are not collected for it, because there are no events.
细心的读者会注意到,还有第三个线程,即 Keepalives 线程。该线程负责——没错——为对等节点生成 keepalive 消息。然而,该线程没有显示任何统计信息。尽管该 pthread 使用了 frr_pthread
包装器,但它选择不使用嵌入的 threadmaster
功能。相反,它用自定义函数替换了 start
和 stop
函数。这样做是因为相对于 pthread 的任务而言, threadmaster
功能引入了一些虽小但显著的开销。在这种情况下,由于 pthread 不需要事件驱动模型,也不需要从其他 pthread 接收任务,因此在提供的事件功能之外实现它更简单、更高效。从这个例子中得出的要点是,尽管已经实现了使在 FRR 中使用 pthread 变得容易的功能,但包装器非常灵活,允许使用其他模型,同时仍然与 FRR 核心基础设施的其余部分集成。启动和停止此 pthread 的方式与任何其他 frr_pthread
相同;唯一的区别是没有为其收集事件统计信息,因为没有事件。
Notes on Design and Documentation
设计与文档说明
Because of the choice to embed the existing event system into each pthread
within FRR, at this time there is not integrated support for other models of
pthread use such as divide and conquer. Similarly, there is no explicit support
for thread pooling or similar higher level constructs. The currently existing
infrastructure is designed around the concept of long-running worker threads
responsible for specific jobs within each daemon. This is not to say that
divide and conquer, thread pooling, etc. could not be implemented in the
future. However, designs in this direction must be very careful to take into
account the existing codebase. Introducing kernel threads into programs that
have been written under the assumption of a single thread of execution must be
done very carefully to avoid insidious errors and to ensure the program remains
understandable and maintainable.
由于选择将现有事件系统嵌入到 FRR 中的每个 pthread 中,目前尚未集成支持其他 pthread 使用模型,如分而治之。同样,也没有明确支持线程池或类似的高级结构。当前存在的基础设施是围绕长期运行的工作线程概念设计的,这些线程负责每个守护进程中的特定工作。这并不是说分而治之、线程池等将来无法实现。然而,朝这个方向的设计必须非常小心,要考虑到现有的代码库。在假设单线程执行的情况下编写的程序中引入内核线程必须非常小心,以避免隐蔽的错误,并确保程序保持可理解和可维护性。
In keeping with these goals, future work on kernel threading should be
extensively documented here and FRR developers should be very careful with
their design choices, as poor choices tightly integrated can prove to be
catastrophic for development efforts in the future.
为了与这些目标保持一致,未来关于内核线程的工作应在此处详细记录,FRR 开发人员应非常谨慎地做出设计选择,因为紧密集成的糟糕选择可能会对未来的开发工作造成灾难性影响。