Main Page 主页
Revision 1.2 (GPGPU-Sim 3.1.1)
Editors: Tor M. Aamodt, Wilson W.L. Fung, Tayler H. Hetherington
编辑:Tor M. Aamodt,Wilson W.L. Fung,Tayler H. Hetherington
Introduction 介绍
This manual provides documentation for GPGPU-Sim 3.x, a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs). GPGPU-Sim 3.x is the latest version of GPGPU-Sim. It includes many enhancements to GPGPU-Sim 2.x. If you are trying to install GPGPU-Sim, please refer to the README file in the GPGPU-Sim distribution you are using. The README for most recent version of GPGPU-Sim can also be browsed online here.
本手册提供了 GPGPU-Sim 3.x 的文档,这是一个周期级 GPU 性能模拟器,专注于“GPU 计算”(在 GPU 上进行通用计算)。GPGPU-Sim 3.x 是 GPGPU-Sim 的最新版本。它包含了对 GPGPU-Sim 2.x 的许多增强。如果您尝试安装 GPGPU-Sim,请参考您正在使用的 GPGPU-Sim 发行版中的 README 文件。最新版本的 GPGPU-Sim 的 README 也可以在线浏览 这里。
This manual contains three major parts:
本手册包含三个主要部分:
- A Microarchitecture Model section that describes the microarchitecture that GPGPU-Sim 3.x models.
一个微架构模型部分,描述了 GPGPU-Sim 3.x 所建模的微架构。 - A Usage section that provides documentations on how to use GPGPU-Sim. This section provides information on the following:
一个使用部分,提供有关如何使用 GPGPU-Sim 的文档。该部分提供以下信息:- Different modes of simulation
不同的模拟模式 - Configuration options (how to change high level parameters of the microarchitecture simulated)
配置选项(如何更改模拟的微架构的高级参数) - Simulation output (e.g., microarchitecture statistics)
仿真输出(例如,微架构统计) - Visualizing microarchitecture behavior (useful for performance debugging)
可视化微架构行为(对性能调试有用) - Strategies for debugging GPGPU-Sim when performance simulations crashes or deadlocks due to errors in the timing model.
当由于时序模型中的错误导致性能模拟崩溃或死锁时,调试 GPGPU-Sim 的策略。
- Different modes of simulation
- A Software Design section that explains the internal software design of GPGPU-Sim 3.x. The goal of that section is to provide a starting point for the users to extend GPGPU-Sim for their own research.
一个软件设计部分,解释了 GPGPU-Sim 3.x 的内部软件设计。该部分的目标是为用户扩展 GPGPU-Sim 以进行自己的研究提供一个起点。
If you use GPGPU-Sim in your work please cite our ISPASS 2009 paper:
如果您在工作中使用 GPGPU-Sim,请引用我们的 ISPASS 2009 论文:
Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, Analyzing CUDA Workloads Using a Detailed GPU Simulator, in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, April 19-21, 2009.
To help reviewers you should indicate the version of GPGPU-Sim you used (e.g., "GPGPU-Sim version 3.1.0", "GPGPU-Sim version 3.0.2", "GPGPU-Sim version 2.1.2b", etc...).
为了帮助审稿人,您应该指明您使用的 GPGPU-Sim 版本(例如,“GPGPU-Sim 版本 3.1.0”,“GPGPU-Sim 版本 3.0.2”,“GPGPU-Sim 版本 2.1.2b”等等)。
The GPGPU-Sim 3.x source is available under a BSD style copyright from GitHub.
GPGPU-Sim 3.x 源代码在 GitHub 上以 BSD 风格的版权提供。
GPGPU-Sim version 3.1.0 running PTXPlus has a correlation of 98.3% and 97.3% versus GT200 and Fermi hardware on the RODINIA benchmark suite with scaled down problem sizes (see Figure 1 and Figure 2).
GPGPU-Sim 版本 3.1.0 运行 PTXPlus 在 RODINIA 基准测试套件 上与 GT200 和 Fermi 硬件的相关性分别为 98.3% 和 97.3%,使用缩小的问题规模(见图 1 和图 2)。
Please submit bug reports through the GPGPU-Sim Bug Tracking System. If you have further questions after reading the manual and searching the bugs database, you may want to sign up to the GPGPU-Sim Google Group.
请通过GPGPU-Sim 错误跟踪系统提交错误报告。如果在阅读手册和搜索错误数据库后还有进一步的问题,您可能想要注册GPGPU-Sim Google Group。
Besides this manual, you may also want to consult the slides from our tutorial at ISCA 2012
除了本手册,您可能还想查看我们在ISCA 2012上的幻灯片
Contributors 贡献者
Contributing Authors to this Manual
本手册的贡献作者
Tor M. Aamodt, Wilson W. L. Fung, Inderpreet Singh, Ahmed El-Shafiey, Jimmy Kwa, Tayler Hetherington, Ayub Gubran, Andrew Boktor, Tim Rogers, Ali Bakhoda, Hadi Jooybar
Contributors to GPGPU-Sim version 3.x
GPGPU-Sim 版本 3.x 的贡献者
Tor M. Aamodt, Wilson W. L. Fung, Jimmy Kwa, Andrew Boktor, Ayub Gubran, Andrew Turner, Tim Rogers, Tayler Hetherington
Microarchitecture Model 微架构模型
This section describes the microarchitecture modelled by GPGPU-Sim 3.x. The model is more detailed than the timing model in GPGPU-Sim 2.x. Some of the new details result from examining various NVIDIA patents. This includes the modelling of instruction fetching, scoreboard logic, and register file access. Other improvements in 3.x include a more detailed texture cache model based upon the prefetching texture cache architecture. The overall microarchitecture is first described, then the individual components including SIMT cores and clusters, interconnection network and memory partitions.
本节描述了 GPGPU-Sim 3.x 所建模的微架构。该模型比 GPGPU-Sim 2.x 中的时序模型更为详细。一些新细节源于对各种 NVIDIA 专利的研究。这包括指令获取、分数板逻辑和寄存器文件访问的建模。3.x 中的其他改进包括基于预取纹理缓存架构的更详细的纹理缓存模型。首先描述整体微架构,然后描述各个组件,包括 SIMT 核心和集群、互连网络和内存分区。
Overview 概述
GPGPU-Sim 3.x runs program binaries that are composed of a CPU portion and a GPU portion. However, the microarchitecture (timing) model in GPGPU-Sim 3.x reports the cycles where the GPU is busy--it does not model either CPU timing or PCI Express timing (i.e. memory transfer time between CPU and GPU). Several efforts are under way to provide combined CPU plus GPU simulators where the GPU portion is modeled by GPGPU-Sim. For example, see http://www.fusionsim.ca/.
GPGPU-Sim 3.x 运行由 CPU 部分和 GPU 部分组成的程序二进制文件。然而,GPGPU-Sim 3.x 中的微架构(时序)模型报告 GPU 繁忙的周期——它不模拟 CPU 时序或 PCI Express 时序(即 CPU 和 GPU 之间的内存传输时间)。目前正在进行几项工作,以提供结合 CPU 和 GPU 的模拟器,其中 GPU 部分由 GPGPU-Sim 模拟。例如,请参见 http://www.fusionsim.ca/。
GPGPU-Sim 3.x models GPU microarchitectures similar to those in the NVIDIA GeForce 8x, 9x, and Fermi series. The intention of GPGPU-Sim is to provide a substrate for architecture research rather than to exactly model any particular commercial GPU. That said, GPGPU-Sim 3.x has been calibrated against an NVIDIA GT 200 and NVIDIA Fermi GPU architectures.
GPGPU-Sim 3.x 模拟了与 NVIDIA GeForce 8x、9x 和 Fermi 系列相似的 GPU 微架构。GPGPU-Sim 的目的是提供一个架构研究的基础,而不是精确模拟任何特定的商业 GPU。也就是说,GPGPU-Sim 3.x 已经针对 NVIDIA GT 200 和 NVIDIA Fermi GPU 架构进行了校准。
Accuracy 准确性
We calculated the correlation of the IPC (Instructions per Clock) versus that of real NVIDIA GPUs. When configured to use the native hardware instruction set (PTXPlus, using the -gpgpu_ptx_convert_to_ptxplus option),GPGPU-Sim 3.1.0 obtains IPC correlation of 98.3% (Figure 1) and 97.3% (Figure 2) respectively on a scaled down version of the RODINIA benchmark suite (about 260 kernel launches). All of the benchmarks described in [Che et. al. 2009] were included in our tests in addition to some other benchmarks from later versions of RODINIA. Each data point in Figure 1 and Figure 2 represents an individual kernel launch. Average absolute errors are 35% and 62% respectively due to some outliers.
我们计算了 IPC(每时钟指令数)与真实 NVIDIA GPU 之间的相关性。当配置为使用本地硬件指令集(PTXPlus,使用-gpgpu_ptx_convert_to_ptxplus 选项)时,GPGPU-Sim 3.1.0 在缩小版的 RODINIA 基准测试套件(约 260 个内核启动)上分别获得了 98.3%(图1)和 97.3%(图2)的 IPC 相关性。我们在测试中包含了[Che et. al. 2009]中描述的所有基准测试,以及一些来自后续版本 RODINIA 的其他基准测试。图1和图2中的每个数据点代表一个单独的内核启动。由于一些异常值,平均绝对误差分别为 35%和 62%。
We have included our spreadsheet used to calculate those correlations to demonstrate how the correlation coefficients were computed: File:Correlation.xls.
我们已包含用于计算这些相关性的电子表格,以演示相关系数是如何计算的:文件:Correlation.xls。
Top-Level Organization 顶级组织
The GPU modeled by GPGPU-Sim is composed of Single Instruction Multiple Thread (SIMT) cores connected via an on-chip connection network to memory partitions that interface to graphics GDDR DRAM.
GPGPU-Sim 模拟的 GPU 由单指令多线程 (SIMT) 核心组成,这些核心通过片上连接网络连接到与图形 GDDR DRAM 接口的内存分区。
An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent to what NVIDIA calls an Streaming Multiprocessor (SM) or what AMD calls a Compute Unit (CU). The organization of an SIMT core is described in Figure 3 below.
一个 SIMT 核心模拟了一个高度多线程的流水线 SIMD 处理器,大致相当于 NVIDIA 所称的流式多处理器(SM)或 AMD 所称的计算单元(CU)。SIMT 核心的组织结构如下面的图3所示。
Clock Domains 时钟域
GPGPU-Sim supports four independent clock domains: (1) the SIMT Core Cluster clock domain (2) the interconnection network clock domain (3) the L2 cache clock domain, which applies to all logic in the memory partition unit except DRAM, and (4) the DRAM clock domain.
GPGPU-Sim 支持四个独立的时钟域:(1) SIMT 核心集群时钟域 (2) 互连网络时钟域 (3) L2 缓存时钟域,适用于内存分区单元中除 DRAM 之外的所有逻辑,以及 (4) DRAM 时钟域。
Clock frequencies can have any arbitrary value (they do not need to be multiples of each other). In other words, we assume the existence of synchronizers between clock domains.
In the GPGPU-Sim 3.x simulation model, units in adjacent clock domains communicate through clock crossing buffers that are filled at the source domain's clock rate and drained at the destination domain's clock rate.
时钟频率可以具有任意值(它们不需要是彼此的倍数)。换句话说,我们假设在时钟域之间存在同步器。在 GPGPU-Sim 3.x 仿真模型中,相邻时钟域中的单元通过时钟交叉缓冲区进行通信,这些缓冲区以源域的时钟速率填充,并以目标域的时钟速率排出。
SIMT Core Clusters SIMT 核心集群
The SIMT Cores are grouped into SIMT Core Clusters. The SIMT Cores in a SIMT Core Cluster share a common port to the interconnection network as shown in Figure 4.
SIMT 核心被分组为 SIMT 核心集群。一个 SIMT 核心集群中的 SIMT 核心共享一个与互连网络的公共端口,如图4所示。
As illustrated in Figure 4, each SIMT Core Cluster has a single response FIFO which holds the packets ejected from the interconnection network. The packets are directed to either a SIMT Core's instruction cache (if it is a memory response servicing an instruction fetch miss) or its memory pipeline (LDST unit). The packets exit in FIFO fashion. The response FIFO is stalled if a core is unable to accept the packet at the head of the FIFO. For generating memory requests at the LDST unit, each SIMT Core has its own injection port into the interconnection network. The injection port buffer however is shared by all the SIMT Cores in a cluster.
如图4所示,每个 SIMT 核心集群都有一个单独的响应 FIFO,用于存放从互连网络中弹出的数据包。这些数据包被定向到 SIMT 核心的指令缓存(如果是服务于指令获取未命中的内存响应)或其内存管道(LDST 单元)。数据包以 FIFO 方式退出。如果核心无法接受 FIFO 头部的数据包,则响应 FIFO 会被暂停。为了在 LDST 单元生成内存请求,每个 SIMT 核心都有自己的注入端口连接到互连网络。然而,注入端口缓冲区是由集群中的所有 SIMT 核心共享的。
SIMT Cores SIMT 核心
Figure 5 below illustrates the SIMT core microarchitecture simulated by GPGPU-Sim 3.x. An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent to what NVIDIA calls an Streaming Multiprocessor (SM) [1] or what AMD calls a Compute Unit (CU) [2]. A Stream Processor (SP) or a CUDA Core would correspond to a lane within an ALU pipeline in the SIMT core.
下图5展示了由 GPGPU-Sim 3.x 模拟的 SIMT 核心微架构。SIMT 核心模拟了一个高度多线程的流水线 SIMD 处理器,大致相当于 NVIDIA 所称的流式多处理器(SM)[1]或 AMD 所称的计算单元(CU)[2]。流处理器(SP)或 CUDA 核心对应于 SIMT 核心中 ALU 流水线的一个通道。
This microarchitecture model contains many details not found in earlier versions of GPGPU-Sim. The main differences include:
该微架构模型包含许多在早期版本的 GPGPU-Sim 中未找到的细节。主要区别包括:
- A new front-end that models instruction caches and separates the warp scheduling (issue) stage from the fetch and decode stage
一个新的前端,模拟指令缓存,并将波调度(发射)阶段与获取和解码阶段分开 - Scoreboard logic enabling multiple instructions from a single warp to be in the pipeline at once
得分板逻辑使得来自单个波的多个指令能够同时在流水线中 - A detailed model of an operand collector that schedules operand access to single ported register file banks (used to reduce area and power of the register file)
一个详细的操作数收集器模型,调度对单端口寄存器文件银行的操作数访问(用于减少寄存器文件的面积和功耗) - Flexible model that supports multiple SIMD functional units. This allows memory instructions and ALU instructions to operate in different pipelines.
支持多个 SIMD 功能单元的灵活模型。这允许内存指令和 ALU 指令在不同的流水线中操作。
The following subsections describe the details in Figure 5 by going through each stage of the pipeline.
以下小节通过逐步介绍管道的每个阶段,描述了图5中的细节。
Front End 前端
As described below, the major stages in the front end include instruction cache access and instruction buffering logic, scoreboard and scheduling logic, SIMT stack.
如下面所述,前端的主要阶段包括指令缓存访问和指令缓冲逻辑、分数板和调度逻辑、SIMT 堆栈。
Fetch and Decode 获取和解码
The instruction buffer (I-Buffer) block in Figure 5 is used to buffer instructions after they are fetched from the instruction cache. It is statically partitioned so that all warps running on SIMT core have dedicated storage to place instructions. In the current model, each warp has two I-Buffer entries. Each I-Buffer entry has a valid bit, ready bit and a single decoded instruction for this warp. The valid bit of an entry indicates that there is a non-issued decoded instruction within this entry in the I-Buffer. While the ready bit indicates that the decoded instructions of this warp are ready to be issued to the execution pipeline. Conceptually, the ready bit is set in the schedule and issue stage using the scoreboard logic and availability of hardware resources (in the simulator software, rather than actually set a ready bit, a readiness check is performed). The I-Buffer is initially empty with all valid bits and ready bits deactivated.
图中5的指令缓冲区(I-Buffer)块用于在从指令缓存中获取指令后进行缓冲。它是静态分区的,以便在 SIMT 核心上运行的所有 warp 都有专用存储空间来放置指令。在当前模型中,每个 warp 有两个 I-Buffer 条目。每个 I-Buffer 条目都有一个有效位、一个准备位和一个为该 warp 解码的单个指令。条目的有效位表示该条目中有一个未发出的解码指令。准备位则表示该 warp 的解码指令已准备好发给执行管道。从概念上讲,准备位在调度和发射阶段通过记分板逻辑和硬件资源的可用性设置(在模拟软件中,而不是实际设置准备位,而是执行准备检查)。I-Buffer 最初是空的,所有有效位和准备位均未激活。
A warp is eligible for instruction fetch if it does not have any valid instructions within the I-Buffer. Eligible warps are scheduled to access the instruction cache in round robin order. Once selected, a read request is sent to instruction cache with the address of the next instruction in the currently scheduled warp. By default, two consecutive instructions are fetched. Once a warp is scheduled for an instruction fetch, its valid bit in the I-Buffer is activated until all the fetched instructions of this warp are issued to the execution pipeline.
如果一个 warp 在 I-Buffer 中没有任何有效指令,则有资格进行指令获取。合格的 warp 按轮询顺序调度访问指令缓存。一旦被选中,将向指令缓存发送读取请求,地址为当前调度的 warp 中的下一条指令。默认情况下,获取两条连续的指令。一旦一个 warp 被调度进行指令获取,其在 I-Buffer 中的有效位被激活,直到该 warp 的所有获取指令被发出到执行流水线。
The instruction cache is a read-only, non-blocking set-associative cache that can model both FIFO and LRU replacement policies with on-miss or on-fill allocation policies. A request to the instruction cache results in either a hit, miss or a reservation fail. The reservation fail results if either the miss status holding register (MSHR) is full or there are no replaceable blocks in the cache set because all block are reserved by prior pending requests (see section Caches for more details). In both cases of hit and miss the round robin fetch scheduler moves to the next warp. In case of hit, the fetched instructions are sent to the decode stage. In the case of a miss a request will be generated by the instruction cache. When the miss response is received the block is filled into the instruction cache and the warp will again need to access the instruction cache. While the miss is pending, the warp does not access the instruction cache.
指令缓存是一个只读的、非阻塞的集合关联缓存,可以模拟 FIFO 和 LRU 替换策略,具有缺失或填充分配策略。对指令缓存的请求会导致命中、未命中或预留失败。如果缺失状态保持寄存器(MSHR)已满,或者缓存集中没有可替换的块,因为所有块都被先前的待处理请求保留,则会导致预留失败(有关更多详细信息,请参见缓存部分)。在命中和未命中的两种情况下,轮询获取调度器都会移动到下一个 warp。在命中情况下,获取的指令会被发送到解码阶段。在未命中情况下,指令缓存将生成一个请求。当收到未命中响应时,该块将被填充到指令缓存中,warp 将再次需要访问指令缓存。在未命中待处理期间,warp 不会访问指令缓存。
A warp finishes execution and is not considered by the fetch scheduler anymore if all its threads have finished execution without any outstanding stores or pending writes to local registers. The thread block is considered done once all warps within it are finished and have no pending operations. Once all thread blocks dispatched at a kernel launch finish, then this kernel is considered done.
如果所有线程在没有任何未完成的存储或待处理的本地寄存器写入的情况下完成执行,则一个 warp 完成执行,并且不再被提取调度器考虑。一旦其中所有 warp 都完成并且没有待处理操作,线程块就被视为完成。一旦在内核启动时调度的所有线程块完成,则该内核被视为完成。
At the decode stage, the recent fetched instructions are decoded and stored in their corresponding entry in the I-Buffer waiting to be issued.
在解码阶段,最近获取的指令被解码并存储在 I-Buffer 中相应的条目中,等待发出。
The simulator software design for this stage is described in Fetch and Decode.
此阶段的模拟器软件设计在 获取和解码 中描述。
Instruction Issue 指令问题
A second round robin arbiter chooses a warp to issue from the I-Buffer to rest of the pipeline. This round robin arbiter is decoupled from the round robin arbiter used to schedule instruction cache accesses. The issue scheduler can be configured to issue multiple instructions from the same warp per cycle. Each valid instruction (i.e. decoded and not issued) in the currently checked warp is eligible for issuing if (1) its warp is not waiting at a barrier, (2) it has valid instructions in its I-Buffer entries (valid bit is set), (3) the scoreboard check passes (see section Scoreboard for more details), and (4) the operand access stage of the instruction pipeline is not stalled.
第二个轮询仲裁器选择一个波发出到 I-Buffer 的其余管道。这个轮询仲裁器与用于调度指令缓存访问的轮询仲裁器是解耦的。发射调度器可以配置为每个周期从同一个波发出多个指令。当前检查的波中每个有效指令(即已解码且未发出的指令)在以下情况下有资格发出:(1) 其波不在等待障碍,(2) 它在 I-Buffer 条目中有有效指令(有效位被设置),(3) 记分板检查通过(有关更多详细信息,请参见记分板部分),以及(4) 指令管道的操作数访问阶段没有停滞。
Memory instructions (Load, store, or memory barriers) are issued to the memory pipeline. For other instructions, it always prefers the SP pipe for operations that can use both SP and SFU pipelines. However, if a control hazard is detected then instructions in the I-Buffer corresponding to this warp are flushed. The warp's next pc is updated to point to the next instruction (assuming all branches as not-taken). For more information about handling control flow, refer to SIMT Stack.
内存指令(加载、存储或内存屏障)被发往内存管道。对于其他指令,它总是优先选择 SP 管道进行可以同时使用 SP 和 SFU 管道的操作。然而,如果检测到控制冒险,则与该 warp 对应的 I-Buffer 中的指令将被清除。warp 的下一个程序计数器被更新为指向下一条指令(假设所有分支都未被采取)。有关控制流处理的更多信息,请参阅 SIMT Stack。
At the issue stage barrier operations are executed. Also, the SIMT stack is updated (refer to SIMT Stack for more details) and register dependencies are tracked (refer to Scoreboard for more details). Warps wait for barriers ("__syncthreads()") at the issue stage.
在发射阶段执行屏障操作。同时,更新 SIMT 堆栈(有关更多详细信息,请参阅 SIMT 堆栈),并跟踪寄存器依赖关系(有关更多详细信息,请参阅 记分板)。在发射阶段,线程束在屏障处等待("__syncthreads()")。
SIMT Stack
A per-warp SIMT stack is used to handle the execution of branch divergence on single-instruction, multiple thread (SIMT) architectures. Since divergence reduces the efficiency of these architectures, different techniques can be adapted to reduce this effect. One of the simplest techniques is the post-dominator stack-based reconvergence mechanism. This technique synchronizes the divergent branches at the earliest guaranteed reconvergence point in order to increase the efficiency of the SIMT architecture. Like previous versions of GPGPU-Sim, GPGPU-Sim 3.x adopts this mechanism.
每个扭曲的 SIMT 栈用于处理单指令多线程(SIMT)架构上的分支分歧执行。由于分歧降低了这些架构的效率,因此可以采用不同的技术来减少这种影响。最简单的技术之一是后支配者栈基础的重新汇聚机制。该技术在最早的保证重新汇聚点同步分歧分支,以提高 SIMT 架构的效率。与之前的 GPGPU-Sim 版本一样,GPGPU-Sim 3.x 采用了这一机制。
Entries of the SIMT stack represents a different divergence level. At each divergence branch, a new entry is pushed to the top of the stack. The top-of-stack entry is popped when the warp reaches its reconvergence point. Each entry stores the target PC of the new branch, the immediate post dominator reconvergence PC and the active mask of threads that are diverging to this branch. In our model, the SIMT stack of each warp is updated after each instruction issue of this warp. The target PC, in case of no divergence, is normally updated to the next PC. However, in case of divergence, new entries are pushed to the stack with the new target PC, the active mask that corresponds to threads that diverge to this PC and their immediate reconvergence point PC. Hence, a control hazard is detected if the next PC at top entry of the SIMT stack does not equal to the PC of the instruction currently under check.
SIMT 堆栈的条目表示不同的分歧级别。在每个分歧分支上,新的条目被推送到堆栈的顶部。当波浪达到其重新汇聚点时,顶部条目被弹出。每个条目存储新分支的目标 PC、立即后支配者重新汇聚 PC 以及分歧到该分支的线程的活动掩码。在我们的模型中,每个波浪的 SIMT 堆栈在该波浪的每个指令发出后更新。目标 PC 在没有分歧的情况下通常更新为下一个 PC。然而,在分歧的情况下,新的条目被推送到堆栈,包含新的目标 PC、对应于分歧到该 PC 的线程的活动掩码及其立即重新汇聚点 PC。因此,如果 SIMT 堆栈顶部条目的下一个 PC 不等于当前检查的指令的 PC,则检测到控制冒险。
See Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware for more details.
请参阅动态扭曲形成:在 SIMD 图形硬件上高效的 MIMD 控制流以获取更多详细信息。
Note that it is known that NVIDIA and AMD actually modify the contents of their divergence stack using special instructions. These divergence stack instructions are not exposed in PTX but are visible in the actual hardware SASS instruction set (visible using decuda or NVIDIA's cuobjdump). When the current version of GPGPU-Sim 3.x is configured to execute SASS via PTXPlus (see PTX vs. PTXPlus) it ignores these low level instructions and instead a comparable control flow graph is created to identify immediate post-dominators. We plan to support execution of the low level branch instructions in a future version of GPGPU-Sim 3.x.
请注意,已知 NVIDIA 和 AMD 实际上使用特殊指令修改其分歧栈的内容。这些分歧栈指令在 PTX 中并未公开,但在实际硬件 SASS 指令集中可见(可使用 decuda 或 NVIDIA 的 cuobjdump 查看)。当当前版本的 GPGPU-Sim 3.x 配置为通过 PTXPlus 执行 SASS 时(请参见 PTX 与 PTXPlus),它会忽略这些低级指令,而是创建一个可比较的控制流图以识别立即后支配者。我们计划在未来版本的 GPGPU-Sim 3.x 中支持低级分支指令的执行。
Scoreboard 记分板
The Scoreboard algorithm checks for WAW and RAW dependency hazards. As explained above, the registers written to by a warp are reserved at the issue stage. The scoreboard algorithm indexed by warp IDs. It stores the required register numbers in an entry that corresponds to the warp ID. The reserved registers are released at the write back stage.
计分板算法检查 WAW 和 RAW 依赖性危害。如上面所述,warp 写入的寄存器在发射阶段被保留。计分板算法按 warp ID 索引。它在与 warp ID 对应的条目中存储所需的寄存器编号。保留的寄存器在写回阶段释放。
As mentioned above, the decoded instruction of a warp is not scheduled for issue until the scoreboard indicates no WAW or RAW hazards exist. The scoreboard detects WAW and RAW hazards by tracking which registers will be written to by an instruction that has issued but not yet written its results back to the register file.
如上所述所示,扭曲的解码指令在计分板指示不存在 WAW 或 RAW 危险之前不会被安排发布。计分板通过跟踪哪些寄存器将被已发出但尚未将其结果写回寄存器文件的指令写入来检测 WAW 和 RAW 危险。
Register Access and the Operand Collector
寄存器访问和操作数收集器
Various NVIDIA patents describe a structure called an "operand collector". The operand collector is a set of buffers and arbitration logic used to provide the appearance of a multiported register file using multiple banks of single ported RAMs. The overall arrangement saves energy and area which is important to improving throughput. Note that AMD also uses banked register files, but the compiler is responsible for ensuring these are accessed so that no bank conflicts occur.
各种 NVIDIA 专利描述了一种称为“操作数收集器”的结构。操作数收集器是一组缓冲区和仲裁逻辑,用于提供多端口寄存器文件的外观,使用多个单端口 RAM 的银行。整体布局节省了能量和面积,这对提高吞吐量很重要。请注意,AMD 也使用银行寄存器文件,但编译器负责确保这些寄存器的访问不会发生银行冲突。
Figure 6 provides an illustration of the detailed way in which GPGPU-Sim 3.x models the operand collector.
图6提供了 GPGPU-Sim 3.x 如何详细建模操作数收集器的示例。
After an instruction is decoded, a hardware unit called a collector unit is allocated to buffer the source operands of the instruction.
在指令解码后,分配一个称为收集单元的硬件单元来缓冲指令的源操作数。
The collector units are not used to eliminate name dependencies via register renaming, but rather as a way to space register operand accesses out in time so that no more than one access to a bank occurs in a single cycle. In the organization shown in the figure, each of the four collector units contains three operand entries. Each operand entry has four fields: a valid bit, a register identifier, a ready bit, and operand data. Each operand data field can hold a single 128 byte source operand composed of 32 four byte elements (one four byte value for each scalar thread in a warp). In addition, the collector unit contains an identifier indicating which warp the instruction belongs to. The arbitrator contains a read request queue for each bank to hold access requests until they are granted.
收集器单元并不是用来通过寄存器重命名消除名称依赖,而是作为一种在时间上分散寄存器操作数访问的方法,以便在单个周期内不会对一个银行进行超过一次的访问。在图中所示的组织中,每个四个收集器单元包含三个操作数条目。每个操作数条目有四个字段:一个有效位、一个寄存器标识符、一个准备位和操作数数据。每个操作数数据字段可以容纳一个由 32 个四字节元素组成的单个 128 字节源操作数(每个 warp 中的每个标量线程一个四字节值)。此外,收集器单元包含一个标识符,指示指令属于哪个 warp。仲裁器包含一个每个银行的读请求队列,以保存访问请求直到它们被授予。
When an instruction is received from the decode stage and a collector unit is available it is allocated to the instruction and the operand, warp, register identifier and valid bits are set. In addition, source operand read requests are queued in the arbiter. To simplify the design, data being written back by the execution units is always prioritized over read requests. The arbitrator selects a group of up to four non-conflicting accesses to send to the register file. To reduce crossbar and collector unit area the selection is made so that each collector unit only receives one operand per cycle.
当从解码阶段接收到指令并且有收集单元可用时,它会被分配给指令,并设置操作数、波、寄存器标识符和有效位。此外,源操作数读取请求在仲裁器中排队。为了简化设计,执行单元写回的数据总是优先于读取请求。仲裁器选择最多四个不冲突的访问组发送到寄存器文件。为了减少交叉开关和收集单元的面积,选择是这样进行的:每个收集单元每个周期只接收一个操作数。
As each operand is read out of the register file and placed into the corresponding collector unit, a “ready bit” is set. Finally, when all the operands are ready the instruction is issued to a SIMD execution unit.
当每个操作数从寄存器文件中读取并放入相应的收集单元时,会设置一个“就绪位”。最后,当所有操作数都准备好时,指令会发给 SIMD 执行单元。
In our model, each back-end pipeline (SP, SFU, MEM) has a set of dedicated collector units, and they share a pool of general collector units. The number of units available to each pipeline and the capacity of the pool of the general units are configurable.
在我们的模型中,每个后端管道(SP、SFU、MEM)都有一组专用的收集单元,并且它们共享一组通用收集单元。每个管道可用的单元数量和通用单元池的容量是可配置的。
ALU Pipelines ALU 管道
GPGPU-Sim v3.x models two types of ALU functional units.
GPGPU-Sim v3.x 模拟两种类型的 ALU 功能单元。
- SP units executes all types of ALU instructions except transcendentals.
SP 单元执行所有类型的 ALU 指令,除了超越指令。 - SFU units executes transcendental instructions (Sine, Cosine, Log... etc.).
SFU 单元执行超越指令(正弦、余弦、对数等)。
Both types of units are pipelined and SIMDized. The SP unit can usually execute one warp instruction per cycle, while the SFU unit may only execute a new warp instruction every few cycles, depending on the instruction type. For example, the SFU unit can execute a sine instruction every 4 cycles or a reciprocal instruction every 2 cycles. Different types of instructions also has different execution latency.
这两种类型的单元都是流水线和 SIMD 化的。SP 单元通常每个周期可以执行一个 warp 指令,而 SFU 单元可能每隔几个周期才执行一个新的 warp 指令,这取决于指令类型。例如,SFU 单元每 4 个周期可以执行一个正弦指令,或者每 2 个周期可以执行一个倒数指令。不同类型的指令也有不同的执行延迟。
Each SIMT core has one SP unit and one SFU unit. Each unit has an independent issue port from the operand collector. Both units share the same output pipeline register that connects to a common writeback stage. There is a result bus allocator at the output of the operand collector to ensure that the units will never be stalled due to the shared writeback. Each instruction will need to allocate a cycle slot in the result bus before being issued to either unit. Notice that the memory pipeline has its own writeback stage and is not managed by this result bus allocator.
每个 SIMT 核心都有一个 SP 单元和一个 SFU 单元。每个单元都有一个独立的发射端口,来自操作数收集器。两个单元共享同一个输出管道寄存器,该寄存器连接到一个公共的写回阶段。在操作数收集器的输出端有一个结果总线分配器,以确保单元不会因共享写回而停滞。每条指令在发射到任一单元之前都需要在结果总线上分配一个周期槽。请注意,内存管道有其自己的写回阶段,并不由这个结果总线分配器管理。
The software design section contains more implementation detail of the model.
软件设计部分包含了模型的更多实现细节。
Memory Pipeline (LDST unit)
内存管道(LDST 单元)
GPGPU-Sim Supports the various memory spaces in CUDA as visible in PTX. In our model, each SIMT core has 4 different on-chip level 1 memories: shared memory, data cache, constant cache, and texture cache. The following table shows which on chip memories service which type of memory access:
GPGPU-Sim 支持 CUDA 中的各种内存空间,如 PTX 中所示。在我们的模型中,每个 SIMT 核心有 4 种不同的片上一级内存:共享内存、数据缓存、常量缓存和纹理缓存。下表显示了哪些片上内存服务于哪种类型的内存访问:
Core Memory 核心内存 | PTX Accesses |
---|---|
Shared memory (R/W) 共享内存 (读/写) | CUDA shared memory (OpenCL local memory) accesses only
仅访问 CUDA 共享内存(OpenCL 局部内存) |
Constant cache (Read Only)
常量缓存(只读) |
Constant memory and parameter memory
常量内存和参数内存 |
Texture cache (Read Only)
纹理缓存(只读) |
Texture accesses only 仅纹理访问 |
Data cache (R/W - evict-on-write for global memory, writeback for local memory)
数据缓存(读/写 - 写时驱逐全局内存,写回本地内存) |
Global and Local memory accesses (Local memory = Private data in OpenCL)
全局和局部内存访问(局部内存 = OpenCL 中的私有数据) |
Although these are modelled as separate physical structures, they are all components of the memory pipeline (LDST unit) and therefore they all share the same writeback stage. The following describes how each of these spaces is serviced:
尽管这些被建模为独立的物理结构,但它们都是内存管道(LDST 单元)的组成部分,因此它们共享相同的写回阶段。以下描述了如何为每个空间提供服务:
- Texture Memory - Accesses to texture memory are cached in the L1 texture cache (reserved for texture accesses only) and also in the L2 cache (if enabled). The L1 texture cache is a special design described in the 1998 paper [3]. Threads on GPU cannot write to the texture memory space, thus the L1 texture cache is read-only.
纹理内存 - 对纹理内存的访问会被缓存到 L1 纹理缓存(仅保留用于纹理访问)以及 L2 缓存(如果启用)。L1 纹理缓存是一个特殊设计,详见 1998 年的论文 [3]。GPU 上的线程无法写入纹理内存空间,因此 L1 纹理缓存是只读的。 - Shared Memory - Each SIMT core contains a configurable amount of shared scratchpad memory that can be shared by threads within a thread block. This memory space is not backed by any L2, and is explicitly managed by the programmer.
共享内存 - 每个 SIMT 核心包含可配置的共享临时存储器,可以被线程块内的线程共享。此内存空间不受任何 L2 支持,并由程序员显式管理。 - Constant Memory - Constants and parameter memory is cached in a read-only constant cache.
常量内存 - 常量和参数内存缓存于只读常量缓存中。 - Parameter Memory - See above
参数内存 - 见上文 - Local Memory - Cached in the L1 data cache and backed by the L2. Treated in a fashion similar to global memory below except values are written back on eviction since there can be no sharing of local (private) data.
本地内存 - 缓存于 L1 数据缓存中,并由 L2 支持。处理方式类似于下面的全局内存,除了在驱逐时写回值,因为本地(私有)数据不能共享。 - Global Memory - Global and local accesses are both serviced by the L1 data cache. Accesses by scalar threads from the same warp are coalesced on a half-warp basis as described in the CUDA 3.1 programming guide. [4]. These accesses are processed at a rate of 2 per SIMT core cycle, such that a memory instruction that is perfectly coalesced into 2 accesses (one per half-warp) can be serviced in a single cycle. For those instructions that generate more than 2 accesses, these will access the memory system at a rate of 2 per cycle. So, if a memory instruction generates 32 accesses (one per lane in the warp), it will take at least 16 SIMT core cycles to move the instruction to the next pipeline stage.
全局内存 - 全局和本地访问都由 L1 数据缓存服务。来自同一波的标量线程的访问按半波的基础进行合并,如 CUDA 3.1 编程指南中所述。[4]。这些访问以每个 SIMT 核心周期 2 次的速率处理,因此一个完美合并为 2 次访问(每个半波一个)的内存指令可以在一个周期内服务。对于那些生成超过 2 次访问的指令,这些访问将以每个周期 2 次的速率访问内存系统。因此,如果一个内存指令生成 32 次访问(每个波道一个),则将指令移动到下一个流水线阶段至少需要 16 个 SIMT 核心周期。
The subsections below describe the first level memory structures.
以下小节描述了第一层内存结构。
L1 Data Cache L1 数据缓存
The L1 data cache is a private, per-SIMT core, non-blocking first level cache for local and global memory accesses. The L1 cache is not banked and is able to service two coalesced memory request per SIMT core cycle. An incoming memory request must not span two or more cache lines in the L1 data cache. Note also that the L1 data caches are not coherent.
L1 数据缓存是一个私有的、每个 SIMT 核心的、非阻塞的一级缓存,用于本地和全局内存访问。L1 缓存没有分银行,并且能够在每个 SIMT 核心周期内处理两个合并的内存请求。传入的内存请求不得跨越 L1 数据缓存中的两个或多个缓存行。还要注意,L1 数据缓存不是一致的。
The table below summarizes the write policies for the L1 data cache.
下表总结了 L1 数据缓存的写入策略。
L1 data cache write policy
L1 数据缓存写入策略 | ||
---|---|---|
Local Memory 本地内存 | Global Memory 全球内存 | |
Write Hit 写击 | Write-back 写回 | Write-evict |
Write Miss 写小姐 | Write no-allocate 写无分配 | Write no-allocate 写无分配 |
For local memory, the L1 data cache acts as a write-back cache with write no-allocate. For global memory, write hits cause eviction of the block. This mimics the default policy for global stores as outlined in the PTX ISA specification [5].
对于本地内存,L1 数据缓存充当写回缓存,采用写不分配策略。对于全局内存,写命中会导致块的驱逐。这模拟了 PTX ISA 规范中概述的全局存储的默认策略 [5]。
Memory accesses that hit in the L1 data cache are serviced in one SIMT core clock cycle. Missed accesses are inserted into a FIFO miss queue. One fill request per SIMT clock cycle is generated by the L1 data cache (given the interconnection injection buffers are able to accept the request).
在 L1 数据缓存中命中的内存访问在一个 SIMT 核心时钟周期内得到服务。未命中的访问被插入到 FIFO 未命中队列中。每个 SIMT 时钟周期由 L1 数据缓存生成一个填充请求(前提是互连注入缓冲区能够接受该请求)。
The cache uses Miss Status Holding Registers (MSHR) to hold the status of misses in progress. These are modeled as a fully-associative array. Redundant accesses to the memory system that take place while one request is in flight are merged in the MSHRs. The MSHR table has a fixed number of MSHR entries. Each MSHR entry can service a fixed number of miss requests for a single cache line. The number of MSHR entries and maximum number of requests per entry are configurable.
缓存使用未命中状态保持寄存器(MSHR)来保存正在进行的未命中状态。这些被建模为一个完全关联的数组。在一个请求正在进行时,对内存系统的冗余访问会在 MSHR 中合并。MSHR 表具有固定数量的 MSHR 条目。每个 MSHR 条目可以为单个缓存行服务固定数量的未命中请求。MSHR 条目的数量和每个条目的最大请求数是可配置的。
A memory request that misses in the cache is added to the MSHR table and a fill request is generated if there is no pending request for that cache line. When a fill response to the fill request is received at the cache, the cache line is inserted into the cache and the corresponding MSHR entry is marked as filled. Responses for filled MSHR entries are generated at one request per cycle. Once all the requests waiting at the filled MSHR entry have been responded to and serviced, the MSHR entry is freed.
一个在缓存中未命中的内存请求被添加到 MSHR 表中,如果该缓存行没有待处理请求,则生成填充请求。当填充请求的填充响应在缓存中接收到时,缓存行被插入到缓存中,并且相应的 MSHR 条目被标记为已填充。已填充 MSHR 条目的响应以每个周期一个请求的方式生成。一旦所有在已填充 MSHR 条目上等待的请求都已响应并服务,MSHR 条目将被释放。
Texture Cache 纹理缓存
The texture cache model is a prefetching texture cache. Texture memory accesses exhibit mostly spatial locality and this locality has been found to be mostly captured with about 16 KB of storage. In realistic graphics usage scenarios many texture cache accesses miss. The latency to access texture in DRAM is on the order of many 100's of cycles. Given the large memory access latency and small cache size the problem of when to allocate lines in the cache becomes paramount. The prefetching texture cache solves the problem by temporally decoupling the state of the cache tags from the state of the cache blocks. The tag array represents the state the cache will be in when misses have been serviced 100's of cycles later. The data array represents the state after misses have been serviced. The key to making this decoupling work is to use a reorder buffer to ensure returning texture miss data is placed into the data array in the same order the tag array saw the access. For more details see the original paper.
纹理缓存模型是一个 预取纹理缓存。纹理内存访问主要表现出空间局部性,这种局部性大多可以通过 大约 16 KB 的存储 来捕获。在现实的图形使用场景中,许多纹理缓存访问会失效。访问 DRAM 中的纹理的延迟通常在数百个周期的数量级。考虑到较大的内存访问延迟和较小的缓存大小,何时在缓存中分配行的问题变得至关重要。预取纹理缓存通过在时间上将缓存标签的状态与缓存块的状态解耦来解决这个问题。标签数组表示在失效被服务数百个周期后缓存将处于的状态。数据数组表示在失效被服务后的状态。使这种解耦工作的关键是使用重排序缓冲区,以确保返回的纹理失效数据按照标签数组看到访问的相同顺序放入数据数组。有关更多详细信息,请参见 原始论文。
Constant (Read only) Cache
常量(只读)缓存
Accesses to constant and parameter memory run through the L1 constant cache. This cache is implemented with a tag array and is like the L1 data cache with the exception that it cannot be written to.
对常量和参数内存的访问通过 L1 常量缓存进行。该缓存使用标签数组实现,类似于 L1 数据缓存,但不能写入。
Thread Block / CTA / Work Group Scheduling
线程块 / CTA / 工作组调度
Thread blocks, Cooperative Thread Arrays (CTAs) in CUDA terminology or Work Groups in OpenCL terminology, are issued to SIMT Cores one at a time. Every SIMT Core clock cycle, the thread block issuing mechanism selects and cycles through the SIMT Core Clusters in a round robin fashion. For each selected SIMT Core Cluster, the SIMT Cores are selected and cycled through in a round robin fashion. For every selected SIMT Core, a single thread block will be issued to the core from the selected kernel if the there are enough resources free on that SIMT Core.
线程块,CUDA 术语中的协作线程数组(CTAs)或 OpenCL 术语中的工作组,逐个发给 SIMT 核心。每个 SIMT 核心时钟周期,线程块发放机制以轮询的方式选择并循环遍历 SIMT 核心集群。对于每个选定的 SIMT 核心集群,SIMT 核心以轮询的方式被选择和循环。对于每个选定的 SIMT 核心,如果该 SIMT 核心上有足够的空闲资源,将从选定的内核向核心发放一个线程块。
If multiple CUDA Streams or command queues are used in the application, then multiple kernels can be executed concurrently in GPGPU-Sim. Different kernels can be executed across different SIMT Cores; a single SIMT Core can only execute thread blocks from a single kernel at a time. If multiple kernels are concurrently being executed, then the selection of the kernel to issue to each SIMT Core is also round robin. Concurrent kernel execution on CUDA architectures is described in the NVIDIA CUDA Programming Guide
如果在应用程序中使用多个 CUDA 流或命令队列,则可以在 GPGPU-Sim 中并发执行多个内核。不同的内核可以在不同的 SIMT 核心上执行;单个 SIMT 核心一次只能执行来自单个内核的线程块。如果多个内核同时被执行,则向每个 SIMT 核心发出内核的选择也是轮询的。在 CUDA 架构上并发内核执行的描述见于NVIDIA CUDA 编程指南。
Interconnection Network 互连网络
Interconnection network is responsible for the communications between SIMT core clusters and Memory Partition units. To simulate the interconnection network we have interfaced the "booksim" simulator to GPGPU-Sim. Booksim is a stand alone network simulator that can be found here. Booksim is capable of simulating virtual channel based Tori and Fly networks and is highly configurable. It can be best understood by referring to "Principles and Practices of Interconnection Networks" book by Dally and Towles.
互连网络负责 SIMT 核心集群与内存分区单元之间的通信。为了模拟互连网络,我们将“booksim”模拟器与 GPGPU-Sim 进行了接口连接。Booksim 是一个独立的网络模拟器,可以在这里找到。Booksim 能够模拟基于虚拟通道的 Tori 和 Fly 网络,并且高度可配置。通过参考 Dally 和 Towles 的《互连网络的原理与实践》一书,可以更好地理解它。
We refer to our modified version of the booksim as Intersim. Intersim has it own clock domain. The original booksim only supports a single interconnection network. We have made some changes to be able to simulate two interconnection networks: one for traffic from the SIMT core clusters to Memory Partitions and one network for traffic from Memory partitions back to SIMT core clusters. This is one way of avoiding circular dependencies that might cause protocol deadlocks in the system. Another way would be having dedicated virtual channels for request and response traffic on a single physical network but this capability is not fully supported in the current version of our public release.
Note: A newer version of Booksim (Booksim 2.0) is now available from Stanford, but GPGPU-Sim 3.x does not yet use it.
我们将我们修改后的 booksim 版本称为Intersim。Intersim 有自己的时钟域。原始的 booksim 只支持单一的互连网络。我们做了一些更改,以能够模拟两个互连网络:一个用于从 SIMT 核心集群到内存分区的流量,另一个用于从内存分区返回到 SIMT 核心集群的流量。这是一种避免可能导致系统协议死锁的循环依赖的方法。另一种方法是在单一物理网络上为请求和响应流量设置专用虚拟通道,但这一功能在我们当前的公开版本中尚未完全支持。注意:更新版本的 Booksim(Booksim 2.0)现已从斯坦福大学提供,但 GPGPU-Sim 3.x 尚未使用它。
Please note that SIMT Core Clusters do not directly communicate which each other and hence there is no notion of coherence traffic in the interconnection network. There are only four packet types: (1)read-request and (2)write-requests sent from SIMT core clusters to Memory partitions and (3)read-replys and write-acknowledges sent from Memory Partitions to SIMT Core Clusters.
请注意,SIMT 核心集群之间不直接通信,因此在互连网络中没有一致性流量的概念。只有四种数据包类型:(1)从 SIMT 核心集群发送到内存分区的读请求和(2)写请求,以及(3)从内存分区发送到 SIMT 核心集群的读回复和写确认。
Concentration 集中力
SIMT core Clusters act as external concentrators in GPGPU-Sim. From the interconnection network's point of view a SIMT core cluster is a single node and the routers connected to this node only have one injection and one ejection port.
SIMT 核心集群在 GPGPU-Sim 中充当外部集中器。从互连网络的角度来看,SIMT 核心集群是一个单一节点,连接到该节点的路由器只有一个注入端口和一个排出端口。
Interface with GPGPU-Sim 与 GPGPU-Sim 接口
The interconnection network regardless of its internal configuration provides a simple interface to communicate with SIMT core Clusters and Memory partitions that are connected to it. For injecting packets SIMT core clusters or Memory controllers first check if the network has enough buffer space to accept their packet and then send their packet to the network. For ejection they check if there is packet waiting for ejection in the network and then pop it. These action happen in each units' clock domain. The serialization of packets is handled inside the network interface, e.g. SIMT core cluster injects a packet in SIMT core cluster clock domain but the router accepts only one flit per interconnect clock cycle. More implementation details can be found in the Software design section.
无论其内部配置如何,互连网络提供了一个简单的接口,以便与连接到它的 SIMT 核心集群和内存分区进行通信。为了注入数据包,SIMT 核心集群或内存控制器首先检查网络是否有足够的缓冲空间来接受它们的数据包,然后将数据包发送到网络。对于弹出,它们检查网络中是否有等待弹出的数据包,然后将其弹出。这些操作发生在每个单元的时钟域中。数据包的序列化在网络接口内部处理,例如,SIMT 核心集群在 SIMT 核心集群时钟域中注入一个数据包,但路由器每个互连时钟周期只接受一个 flit。更多实现细节可以在软件设计部分找到。
Memory Partition 内存分区
The memory system in GPGPU-Sim is modelled by a set of memory partitions. As shown in Figure 7 each memory partition contain an L2 cache bank, a DRAM access scheduler and the off-chip DRAM channel. The functional execution of atomic operations also occurs in the memory partitions in the Atomic Operation Execution phase. Each memory partition is assigned a sub-set of physical memory addresses for which it is responsible. By default the global linear address space is interleaved among partitions in chunks of 256 bytes. This partitioning of the address space along with the detailed mapping of the address space to DRAM row, banks and columns in each partition is configurable and described in the Address Decoding section.
GPGPU-Sim 中的内存系统由一组内存分区建模。如图7所示,每个内存分区包含一个 L2 缓存银行、一个 DRAM 访问调度器和离芯 DRAM 通道。原子操作的功能执行也发生在原子操作执行阶段的内存分区中。每个内存分区被分配了一组物理内存地址,负责这些地址。默认情况下,全局线性地址空间在分区之间以 256 字节的块交错分配。地址空间的这种分区以及每个分区中地址空间到 DRAM 行、银行和列的详细映射是可配置的,并在地址解码部分中进行了描述。
The L2 cache (when enabled) services the incoming texture and (when configured to do so) non-texture memory requests. Note the Quadro FX 5800 (GT200) configuration enables the L2 for texture references only. On a cache miss, the L2 cache bank generates memory requests to the DRAM channel to be serviced by off-chip GDDR3 DRAM.
L2 缓存(启用时)服务于传入的纹理和(在配置为这样时)非纹理内存请求。请注意,Quadro FX 5800(GT200)配置仅为纹理引用启用 L2。在缓存未命中时,L2 缓存银行会生成内存请求到 DRAM 通道,由外部 GDDR3 DRAM 进行服务。
The subsections below describe in more detail how traffic flows through the memory partition along with the individual components mentioned above.
以下小节将更详细地描述流量如何通过内存分区以及上述提到的各个组件。
Memory Partition Connections and Traffic Flow
内存分区连接和流量流动
Figure 7 above shows the three sub-components inside a single Memory Partition and the various FIFO queues that facilitate the flow of memory requests and responses between them.
上面的图7显示了单个内存分区内的三个子组件以及促进它们之间内存请求和响应流动的各种 FIFO 队列。
The memory request packets enter the Memory Partition from the interconnect via the ICNT->L2 queue. Non-texture accesses are directed through the Raster Operations Pipeline (ROP) queue to model a minimum pipeline latency of 460 L2 clock cycles, as observed by a GT200 micro-benchmarking study. The L2 cache bank pops one request per L2 clock cycle from the ICNT->L2 queue for servicing. Any memory requests for the off-chip DRAM generated by the L2 are pushed into the L2->dram queue. If the L2 cache is disabled, packets are popped from the ICNT->L2 and pushed directly into the L2->DRAM queue, still at the L2 clock frequency. Fill requests returning from off-chip DRAM are popped from DRAM->L2 queue and consumed by the L2 cache bank. Read replies from the L2 to the SIMT core are pushed through the L2->ICNT queue.
内存请求数据包通过 ICNT->L2 队列从互连进入内存分区。非纹理访问通过 光栅操作管线 (ROP) 队列进行,以模拟 460 个 L2 时钟周期的最小管线延迟,这一点在 GT200 微基准测试研究 中得到了观察。L2 缓存银行每个 L2 时钟周期从 ICNT->L2 队列中弹出一个请求进行服务。由 L2 生成的任何针对外部 DRAM 的内存请求都被推入 L2->dram 队列。如果 L2 缓存被禁用,数据包将从 ICNT->L2 弹出并直接推入 L2->DRAM 队列,仍然以 L2 时钟频率进行。来自外部 DRAM 的填充请求从 DRAM->L2 队列中弹出,并被 L2 缓存银行消耗。L2 到 SIMT 核心的读回复通过 L2->ICNT 队列推送。
The DRAM latency queue is a fixed latency queue that models the minimum latency difference between a L2 access and a DRAM access (an access that has missed L2 cache). This latency is observed via micro-benchmarking and this queue simply modeling this observation (instead of the real hardware that causes this delay). Requests exiting the L2->DRAM queue reside in the DRAM latency queue for a fixed number of SIMT core clock cycles. Each DRAM clock cycle, the DRAM channel can pop memory request from the DRAM latency queue to be serviced by off-chip DRAM, and push one serviced memory request into the DRAM->L2 queue.
DRAM 延迟队列是一个固定延迟队列,模拟 L2 访问和 DRAM 访问(一个未命中 L2 缓存的访问)之间的最小延迟差。这种延迟是通过微基准测试观察到的,这个队列简单地模拟了这种观察(而不是导致这种延迟的真实硬件)。从L2->DRAM队列退出的请求在DRAM 延迟队列中停留固定数量的 SIMT 核心时钟周期。每个 DRAM 时钟周期,DRAM 通道可以从DRAM 延迟队列中弹出内存请求,由外部 DRAM 进行服务,并将一个已服务的内存请求推入DRAM->L2队列。
Note that ejection from the interconnect to Memory Partition (ROP or ICNT->L2 queues) occurs in L2 clock domain while injection into the interconnect from Memory Partition (L2->ICNT queue) occurs in the interconnect (ICNT) clock domain.
请注意,从互连到内存分区(ROP 或 ICNT->L2 队列)的弹出发生在 L2 时钟域,而从内存分区(L2->ICNT 队列)注入到互连发生在互连(ICNT)时钟域。
L2 Cache Model and Cache Hierarchy
L2 缓存模型和缓存层次结构
The L2 cache model is very similar to the L1 data caches in SIMT cores (see that section for more details). When enabled to cache the global memory space data, the L2 acts as a read/write cache with write policies as summarized in the table below.
L2 缓存模型与 SIMT 核心中的 L1 数据缓存非常相似(有关更多细节,请参见该部分)。当启用缓存全局内存空间数据时,L2 充当读/写缓存,写入策略如下面的表格所总结。
L2 cache write policy L2 缓存写策略 | ||
---|---|---|
Local Memory 本地内存 | Global Memory 全球内存 | |
Write Hit 写击 | Write-back for L1 write-backs
L1 写回的写回 |
Write-evict |
Write Miss 写小姐 | Write no-allocate 写无分配 | Write no-allocate 写无分配 |
Additionally, note that the L2 cache is a unified last level cache that is shared by all SIMT cores, whereas the L1 caches are private to each SIMT core.
此外,请注意,L2 缓存是一个统一的最后级缓存,所有 SIMT 核心共享,而 L1 缓存是每个 SIMT 核心私有的。
The private L1 data caches are not coherent (the other L1 caches are for read only address spaces). The cache hierarchy in GPGPU-Sim is non-inclusive non-exclusive. Additionally, a non-decreasing cache line size going down the cache hierarchy (i.e. increasing cache level) is enforced. A memory request from the first level cache also cannot span across two cache lines in the L2 cache. These two restrictions ensure:
私有 L1 数据缓存是不一致的(其他 L1 缓存仅用于只读地址空间)。GPGPU-Sim 中的缓存层次结构是非包含非排他性的。此外,缓存层次结构中缓存行大小是非递减的(即,缓存级别增加时)。来自第一级缓存的内存请求也不能跨越 L2 缓存中的两个缓存行。这两个限制确保:
- A request from a lower level cache can be serviced by one cache line in the higher level cache. This ensures that requests from the L1 can be serviced atomically by the L2 cache.
来自较低级别缓存的请求可以由较高级别缓存中的一行缓存服务。这确保了来自 L1 的请求可以由 L2 缓存原子地服务。 - Atomic operations do not need to access multiple blocks at the L2
原子操作不需要在 L2 访问多个块
This restriction simplifies the cache design and prevents having to deal with live-lock related issues with servicing a request from L1 non-atomically.
这一限制简化了缓存设计,并防止在非原子性地处理来自 L1 的请求时出现活锁相关问题。
Atomic Operation Execution Phase
原子操作执行阶段
The Atomic Operation Execution phase is a very idealized model of atomic instruction execution. Atomic instructions with non-conflicting memory accesses that were coalesced into one memory request are executed at the memory partition in one cycle. In the performance model, we currently model an atomic operation as a global load operation that skips the L1 data cache. This generates all the necessary register writeback traffic (and data hazard stalls) within the SIMT core. At L2 cache, the atomic operation marks the accessed cache line dirty (changing its status to modified) to generate the writeback traffic to DRAM. If L2 cache is not enabled (or used for texture access only), then no DRAM write traffic will be generated for atomic operations (a very idealized model).
原子操作执行阶段是一个非常理想化的原子指令执行模型。具有非冲突内存访问的原子指令被合并为一个内存请求,在一个周期内在内存分区中执行。在性能模型中,我们目前将原子操作建模为跳过 L1 数据缓存的全局加载操作。这会在 SIMT 核心内生成所有必要的寄存器写回流量(和数据冒险停顿)。在 L2 缓存中,原子操作将访问的缓存行标记为脏(将其状态更改为 已修改),以生成写回流量到 DRAM。如果 L2 缓存未启用(或仅用于纹理访问),则不会为原子操作生成 DRAM 写流量(这是一个非常理想化的模型)。
DRAM Scheduling and Timing Model
DRAM 调度和时序模型
GPGPU-Sim models both DRAM scheduling and timing. GPGPU-Sim implements two open page mode DRAM schedulers: a FIFO (First In First Out) scheduler and a FR-FCFS (First-Ready First-Come-First-Serve) scheduler, both described below. These can be selected using the configuration option -gpgpu_dram_scheduler.
GPGPU-Sim 模拟了 DRAM 调度和时序。GPGPU-Sim 实现了两种开放页面模式的 DRAM 调度器:FIFO(先进先出)调度器和 FR-FCFS(优先就绪先到先服务)调度器,下面将对此进行描述。可以使用配置选项 -gpgpu_dram_scheduler 进行选择。
FIFO Scheduler FIFO 调度器
The FIFO scheduler services requests in the order they are received. This will tend to cause a large number of precharges and activates and hence result in poorer performance especially for applications that generate a large amount of memory traffic relative to the amount of computation they perform.
FIFO 调度程序按接收请求的顺序提供服务。这将导致大量的预充电和激活,从而导致性能下降,特别是对于生成大量内存流量而相对于其执行的计算量较少的应用程序。
FR-FCFS
The First-Row First-Come-First-Served scheduler gives higher priority to requests to a currently open row in any of the DRAM banks. The scheduler will schedule all requests in the queue to open rows first. If no such request exists it will open a new row for the oldest request. The code for this scheduler is located in dram_sched.h/.cc.
第一行先到先服务调度器对任何 DRAM 银行中当前打开行的请求给予更高优先级。调度器将优先调度队列中所有请求到打开的行。如果不存在这样的请求,它将为最旧的请求打开一个新行。该调度器的代码位于 dram_sched.h/.cc。
DRAM Timing Model DRAM 时序模型
GPGPU-Sim accurately models graphics DRAM memory. Currently GPGPU-Sim 3.x models GDDR3 DRAM, though we are working on adding a detailed GDDR5. The following DRAM timing parameters can be set using the option -gpgpu_dram_timing_opt nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR. Currently, we do not model the timing of DRAM refresh operations. Please refer to GDDR3 specifications for more details about each parameter.
GPGPU-Sim 精确模拟图形 DRAM 内存。目前 GPGPU-Sim 3.x 模拟 GDDR3 DRAM,尽管我们正在努力添加详细的 GDDR5。可以使用选项 -gpgpu_dram_timing_opt nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR 设置以下 DRAM 时序参数。目前,我们不模拟 DRAM 刷新操作的时序。有关每个参数的更多详细信息,请参阅 GDDR3 规格。
- nbk: number of banks nbk: 银行数量
- tCCD: Column to Column Delay (RD/WR to RD/WR different banks)
tCCD:列到列延迟(读/写到读/写不同银行) - tRRD: Row to Row Delay (Active to Active different banks)
tRRD:行到行延迟(活动到活动不同银行) - tRCD: Row to Column Delay (Active to RD/WR/RTR/WTR/LTR)
tRCD: 行到列延迟(激活到读/写/读取/写入/延迟) - tRAS: Active to PRECHARGE command period
tRAS: 活动到预充电命令周期 - tRP: PRECHARGE command period
tRP: 预充电命令周期 - tRC: Active to Active command period (same bank)
tRC:同一银行的主动到主动命令周期 - CL: CAS Latency CL: CAS 延迟
- WL: WRITE latency WL: WRITE 延迟
- tCDLR: Last data-in to Read Delay (switching from write to read)
tCDLR:最后的数据输入到读取延迟(从写入切换到读取) - tWR: WRITE recovery time tWR:写入恢复时间
In our model, commands for each memory bank are scheduled in a round-robin fashion. The banks are arranged in a circular array with a pointer to the bank with the highest priority. The scheduler goes through the banks in order and issues commands. Whenever an activate or precharge command is issued for a bank, the priority pointer is set to the next bank guaranteeing that other pending commands for other banks will be eventually scheduled.
在我们的模型中,每个内存银行的命令以轮询的方式调度。银行以圆形数组排列,并指向优先级最高的银行。调度器按顺序遍历银行并发出命令。每当对某个银行发出激活或预充电命令时,优先级指针会设置为下一个银行,确保其他银行的待处理命令最终会被调度。
Instruction Set Architecture (ISA)
指令集架构 (ISA)
PTX and SASS PTX 和 SASS
GPGPU-Sim simulates the Parallel Thread eXecution (PTX) instruction set used by NVIDIA. PTX is a pseudo-assembly instruction set; i.e. it does not execute directly on the hardware. ptxas is the assembler released by NVIDIA to assemble PTX into the native instruction set run by the hardware (SASS). Each hardware generation supports a different version of SASS. For this reason, PTX is compiled into multiple versions of SASS that correspond to different hardware generations at compile time. Despite that, the PTX code is still embedded into the binary to enable support for future hardware. At runtime, the runtime system selects the appropriate version of SASS to run based on the available hardware. If there is none, the runtime system invokes a just-in-time (JIT) compiler on the embedded PTX to compile it into the SASS corresponding to the available hardware.
GPGPU-Sim 模拟了 NVIDIA 使用的并行线程执行(PTX)指令集。PTX 是一种伪汇编指令集;即它不能直接在硬件上执行。ptxas 是 NVIDIA 发布的汇编器,用于将 PTX 汇编成硬件运行的本地指令集(SASS)。每一代硬件支持不同版本的 SASS。因此,PTX 在编译时被编译成多个版本的 SASS,以对应不同的硬件代。尽管如此,PTX 代码仍然嵌入到二进制文件中,以支持未来的硬件。在运行时,运行时系统根据可用硬件选择适当版本的 SASS 进行运行。如果没有,运行时系统会在嵌入的 PTX 上调用即时编译器(JIT),将其编译成与可用硬件相对应的 SASS。
PTXPlus
GPGPU-Sim is capable of running PTX. However, since PTX is not the actual code that runs on the hardware, there is a limit to how accurate it can be. This is mainly due to compiler passes such as strength reduction, instruction scheduling, register allocation to mention a few.
GPGPU-Sim 能够运行 PTX。然而,由于 PTX 并不是在硬件上运行的实际代码,因此其准确性是有限的。这主要是由于编译器的处理,例如强度减少、指令调度、寄存器分配等。
To enable running SASS code in GPGPU-Sim, new features had to be added:
为了在 GPGPU-Sim 中运行 SASS 代码,必须添加新功能:
- New addressing modes 新的寻址模式
- More elaborate condition codes and predicates
更复杂的条件代码和谓词 - Additional instructions 额外说明
- Additional datatypes 其他数据类型
In order to avoid developing and maintaining two parsers and two functional execution engines (one for PTX and the other for SASS), we chose to extend PTX with the required features in order to provide a one-to-one mapping to SASS. PTX along with the extentions is called PTXPlus. To run SASS, we perform a syntax conversion from SASS to PTXPlus.
为了避免开发和维护两个解析器和两个功能执行引擎(一个用于 PTX,另一个用于 SASS),我们选择扩展 PTX 以提供与 SASS 的一对一映射。PTX 及其扩展称为 PTXPlus。为了运行 SASS,我们执行从 SASS 到 PTXPlus 的语法转换。
PTXPlus has a very similar syntax when compared to PTX with the addition of new addressing modes, more elaborate condition codes and predicates, additional instructions and more data types. It is important to keep in mind that PTXPlus is a superset of PTX, which means that valid PTX is also valid PTXPlus. More details about the exact differences between PTX and PTXPlus can be found in #PTX vs. PTXPlus.
PTXPlus 的语法与 PTX 非常相似,增加了新的寻址模式、更复杂的条件码和谓词、额外的指令以及更多的数据类型。重要的是要记住,PTXPlus 是 PTX 的超集,这意味着有效的 PTX 也是有效的 PTXPlus。有关 PTX 和 PTXPlus 之间确切差异的更多细节可以在 #PTX vs. PTXPlus 中找到。
From SASS to PTXPlus 从 SASS 到 PTXPlus
When the configuration file instructs GPGPU-Sim to run SASS, a conversion tool, cuobjdump_to_ptxplus, is used to convert the SASS embedded within the binary to PTXPlus. For the full details of the conversion process see #PTXPlus Conversion . The PTXPlus is then used in the simulation. When SASS is converted to PTXPlus, only the syntax is changed, the instructions and their order is preserved exactly as in the SASS. Thus, the effect of compiler optimizations applied to the native code is fully captured. Currently, GPGPU-Sim only supports the conversion of GT200 SASS to PTXPlus.
当配置文件指示 GPGPU-Sim 运行 SASS 时,使用转换工具 cuobjdump_to_ptxplus 将嵌入在二进制文件中的 SASS 转换为 PTXPlus。有关转换过程的完整细节,请参见 #PTXPlus 转换 。然后在仿真中使用 PTXPlus。当 SASS 被转换为 PTXPlus 时,仅语法发生变化,指令及其顺序与 SASS 中完全一致。因此,应用于本机代码的编译器优化的效果得以完全捕获。目前,GPGPU-Sim 仅支持将 GT200 SASS 转换为 PTXPlus。
Using GPGPU-Sim 使用 GPGPU-Sim
Refer to the README file in the top level GPGPU-Sim directory for instructions on building and running GPGPU-Sim 3.x. This section provides other important guidance on using GPGPU-Sim 3.x, covering topics such as different simulation modes, how to modify the timing model configuration, a description of the default simulation statistics, and description of approaches for analyzing bugs at the functional level via tracing simulation state and a GDB-like interface. GPGPU-Sim 3.x also provides extensive support for debugging performance simulation bugs including both a high level microarchitecture visualizer and cycle by cycle pipeline state visualization. Next, we describe strategies for debugging GPGPU-Sim when it crashes or deadlocks in performance simulation mode.
Finally, it conclude with answers to frequently asked questions.
请参阅顶层 GPGPU-Sim 目录中的 README 文件,以获取构建和运行 GPGPU-Sim 3.x 的说明。本节提供了有关使用 GPGPU-Sim 3.x 的其他重要指导,涵盖了不同的 仿真模式、如何 修改时序模型配置、默认 仿真统计信息 的描述,以及通过 跟踪仿真状态 和 类似 GDB 的接口 在功能级别分析错误的方法的描述。GPGPU-Sim 3.x 还提供了广泛的支持,用于调试性能仿真错误,包括 高级微架构可视化工具 和 逐周期管道状态可视化。接下来,我们将描述 在性能仿真模式下 GPGPU-Sim 崩溃或死锁时的调试策略。最后,以 常见问题的答案 结束。
Simulation Modes 模拟模式
By default most users will want to use GPGPU-Sim 3.x to estimate the number of GPU clock cycles it takes to run an application. This is known as performance simulation mode. When trying to run a new application on GPGPU-Sim it is always possible that application may not run correctly--i.e., it is possible it may generate the wrong output. To help debugging such applications, GPGPU-Sim 3.x also supports a fast functional simulation only mode. This mode may also be helpful for compiler research and/or when making changes to the functional simulation engine. Orthogonal to the distinction between performance and functional simulation, GPGPU-Sim 3.x also support execution of the native hardware ISA on NVIDIA GPUs (currently GT200 and earlier only), via an extended PTX syntax we call PTXPlus. The following subsections describe these features in turn.
默认情况下,大多数用户希望使用 GPGPU-Sim 3.x 来估算运行应用程序所需的 GPU 时钟周期数。这被称为性能仿真模式。在 GPGPU-Sim 上尝试运行新应用程序时,应用程序可能无法正确运行——即,它可能会生成错误的输出。为了帮助调试此类应用程序,GPGPU-Sim 3.x 还支持仅快速功能仿真模式。此模式对于编译器研究和/或在对功能仿真引擎进行更改时也可能有帮助。与性能仿真和功能仿真之间的区别正交,GPGPU-Sim 3.x 还支持在 NVIDIA GPU(目前仅限 GT200 及更早版本)上执行本机硬件 ISA,通过我们称之为 PTXPlus 的扩展 PTX 语法。以下小节将依次描述这些功能。
CUDA Version CUDA 版本 | PTX | PTXPlus | cuobjdump+PTX | cuobjdump+PTXPlus |
---|---|---|---|---|
2.3 | ? | No 不 | No 不 | No 不 |
3.1 | Yes 是 | No 不 | No 不 | No 不 |
4.0 | No 不 | No 不 | Yes 是 | Yes 是 |
Performance Simulation 性能模拟
Performance simulation is the default mode of simulation and collects performance statistics in exchange for slower simulation speed. GPGPU-Sim simulates the microarchitecture described in the Microarchitecture Model section.
性能仿真是默认的仿真模式,并以较慢的仿真速度收集性能统计数据。GPGPU-Sim 模拟在 微架构模型 部分描述的微架构。
To select the performance simulation mode, add the following line to the gpgpusim.config file:
要选择性能仿真模式,请在 gpgpusim.config 文件中添加以下行:
-gpgpu_ptx_sim_mode 0
For information regarding understanding the simulation output refer to the section on understanding simulation output.
有关理解模拟输出的信息,请参阅理解模拟输出部分。
Pure Functional Simulation
纯函数模拟
Pure functional simulations run faster than performance simulations but only perform the execution of the CUDA/OpenCL program and does not collect performance statistics.
纯函数模拟运行速度比性能模拟快,但仅执行 CUDA/OpenCL 程序,不收集性能统计数据。
To select the pure functional simulation mode, add the following line to the gpgpusim.config file:
要选择纯功能仿真模式,请将以下行添加到 gpgpusim.config 文件中:
-gpgpu_ptx_sim_mode 1
Alternatively, you can set the environmental variable PTX_SIM_MODE_FUNC to 1. Then execute the program normally as you would do in performance simulation mode.
或者,您可以将环境变量 PTX_SIM_MODE_FUNC 设置为 1。然后像在性能仿真模式下那样正常执行程序。
Simulating only the functionality of a GPU device, GPGPU-Sim pure functional simulation mode execute a CUDA/OpenCL program as if it runs on a real GPU device, so no performance measures are collected in this mode, only the regular output of a GPU program is shown. As you expect the pure simulation mode is significantly faster than the performance simulation mode (about 5~10 times faster).
仅模拟 GPU 设备的功能,GPGPU-Sim 纯功能模拟模式执行 CUDA/OpenCL 程序,就像它在真实的 GPU 设备上运行一样,因此在此模式下不会收集性能测量,仅显示 GPU 程序的常规输出。正如您所期望的,纯模拟模式的速度显著快于性能模拟模式(约快 5~10 倍)。
This mode is very useful if you want to quickly check that your code is working correctly on GPGPU-Sim, or if you want to experience using CUDA/OpenCL without the need to have a real GPU computation device. Pure functional simulation supports the same versions of CUDA as the performance simulation (CUDA v3.1) and (CUDA v2.3) for PTX Plus. The pure functional simulation mode execute programs as a group of warps, where warps of each Cooperative Thread Array (CTA) get executed till they all finish or all wait at a barrier, in the latter case once all the warps meet at the barrier they are cleared to go ahead and cross the barrier.
此模式非常有用,如果您想快速检查您的代码在 GPGPU-Sim 上是否正常工作,或者如果您想体验使用 CUDA/OpenCL 而无需拥有真实的 GPU 计算设备。纯功能模拟支持与性能模拟相同版本的 CUDA(CUDA v3.1)和(CUDA v2.3)用于 PTX Plus。纯功能模拟模式将程序作为一组 warp 执行,其中每个协作线程数组(CTA)的 warp 被执行,直到它们全部完成或全部在一个障碍处等待,在后一种情况下,一旦所有 warp 在障碍处相遇,它们就被清除以继续并越过障碍。
Software design details for Pure Functional Simulation can be found below.
纯函数模拟的软件设计细节可以在下面找到。
Interactive Debugger Mode
交互式调试器模式
Interactive debugger mode offers a GDB-like interface for debugging functional behavior in GPGPU-Sim. However, currently it only works with performance simulation.
交互式调试器模式提供了类似 GDB 的界面,用于调试 GPGPU-Sim 中的功能行为。然而,目前它仅适用于性能模拟。
To enable interactive debugger mode, set environment variable GPGPUSIM_DEBUG to 1. Here are supported commands:
要启用交互式调试器模式,请将环境变量 GPGPUSIM_DEBUG 设置为 1。以下是支持的命令:
Command 命令 | Description 描述 |
---|---|
dp <id> | Dump pipeline: Display the state (pipeline content) of the SIMT core <id>.
转储管道:显示 SIMT 核心<id> 的状态(管道内容)。 |
q | Quit 退出 |
b <file>:<line> <thread uid> | Set breakpoint at <file>:<line> for thread with <thread uid>.
在 <file>:<line> 设置断点,线程 ID 为 <thread uid>。 |
d <uid> | Delete breakpoint. |
s | Single step execution to next core cycle for all cores.
所有核心的单步执行到下一个核心周期。 |
c | Continue execution without single stepping.
继续执行而不逐步调试。 |
w <address> | Set watchpoint at <address>. <address> is specified as a hexadecimal number.
在 <address> 处设置观察点。<address> 被指定为十六进制数字。 |
l | List PTX around current breakpoint.
在当前断点周围列出 PTX。 |
h | Display help message. 显示帮助信息。 |
It is implemented in files debug.h and debug.cc.
它在文件 debug.h 和 debug.cc 中实现。
Cuobjdump Support Cuobjdump 支持
As of GPGPU-Sim version 3.1.0, support for using cuobjdump was added. cuobjdump is a software provided by NVidia to extract information like SASS and PTX from binaries. GPGPU-Sim supports using cuobjdump to extract the information it needs to run either SASS or PTX instead of obtaining them through the cubin files. Using cuobjdump is supported only with CUDA 4.0. cuobjdump is enabled by default if the simulator is compiled with CUDA 4.0. To enable/disable cuobjdump, add one of the following option to your configuration file:
从 GPGPU-Sim 版本 3.1.0 开始,增加了对使用 cuobjdump 的支持。cuobjdump 是 NVidia 提供的一款软件,用于从二进制文件中提取 SASS 和 PTX 等信息。GPGPU-Sim 支持使用 cuobjdump 提取其运行 SASS 或 PTX 所需的信息,而不是通过 cubin 文件获取。仅在 CUDA 4.0 下支持使用 cuobjdump。如果模拟器是用 CUDA 4.0 编译的,则默认启用 cuobjdump。要启用/禁用 cuobjdump,请将以下选项之一添加到您的配置文件中:
# disable cuobjdump -gpgpu_ptx_use_cuobjdump 0
# enable cuobjdump -gpgpu_ptx_use_cuobjdump 1
PTX vs. PTXPlus PTX 与 PTXPlus
By default, GPGPU-Sim 3.x simulates PTX instructions. However, when executing on an actual GPU, PTX is recompiled to a native GPU ISA (SASS). This recompilation is not fully accounted for in the simulation of normal PTX instructions. To address this issue we created PTXPlus. PTXPlus is an extended form of PTX, introduced by GPGPU-Sim 3.x, that allows for a near 1 to 1 mapping of most GT200 SASS instructions to PTXPlus instructions. It includes new instructions and addressing modes that don't exist in regular PTX. When the conversion to PTXPlus option is activated, the SASS instructions that make up the program are translated into PTXPlus instructions that can be simulated by GPGPU-Sim. Use of the PTXPlus conversion option can lead to significantly more accurate results. However, conversion to PTXPlus does not yet fully support all programs that could be simulated using normal PTX. Currently, only CUDA Toolkit later than 4.0 is supported for conversion to PTXPlus.
默认情况下,GPGPU-Sim 3.x 模拟 PTX 指令。然而,在实际 GPU 上执行时,PTX 会被重新编译为本地 GPU ISA (SASS)。这种重新编译在正常 PTX 指令的模拟中并没有完全考虑。为了解决这个问题,我们创建了 PTXPlus。PTXPlus 是 GPGPU-Sim 3.x 引入的 PTX 的扩展形式,它允许大多数 GT200 SASS 指令与 PTXPlus 指令之间几乎 1:1 的映射。它包括在常规 PTX 中不存在的新指令和寻址模式。当激活转换为 PTXPlus 选项时,构成程序的 SASS 指令将被转换为可以被 GPGPU-Sim 模拟的 PTXPlus 指令。使用 PTXPlus 转换选项可以导致显著更准确的结果。然而,转换为 PTXPlus 目前尚未完全支持所有可以使用正常 PTX 模拟的程序。目前,仅支持 4.0 以上的 CUDA Toolkit 转换为 PTXPlus。
SASS is the term NVIDIA uses for the native instruction set used by the GPUs according to their released documentation of the instruction sets. This documentation can be found in the file "cuobjdump.pdf" released with the CUDA Toolkit.
SASS 是 NVIDIA 用于其 GPU 的本机指令集的术语,依据他们发布的指令集文档。该文档可以在与CUDA Toolkit一起发布的文件"cuobjdump.pdf"中找到。
To convert the SASS from an executable, GPGPU-Sim cuobjdump -- a software release along with the CUDA toolkit by NVIDIA that extracts PTX, SASS and other information from CUDA executables. GPGPU-Sim 3.x includes a stand alone program called cuobjdump_to_ptxplus that is invoked to convert the output of cuobjdump into PTXPlus which GPGPU-Sim can simulate. cuobjdump_to_ptxplus is a program written in C++ and its source is provided with the GPGPU-Sim distribution. See the PTXPlus Conversion section for a detailed description on the PTXPlus conversion process. Currently, cuobjdump_to_ptxplus supports the conversion of SASS for sm versions < sm_20.
要将 SASS 从可执行文件转换,GPGPU-Sim cuobjdump -- 这是 NVIDIA 与 CUDA 工具包一起发布的软件,提取 CUDA 可执行文件中的 PTX、SASS 和其他信息。GPGPU-Sim 3.x 包括一个独立程序 cuobjdump_to_ptxplus,该程序被调用以将 cuobjdump 的输出转换为 GPGPU-Sim 可以模拟的 PTXPlus。cuobjdump_to_ptxplus 是用 C++编写的程序,其源代码随 GPGPU-Sim 发行版提供。有关 PTXPlus 转换过程的详细描述,请参见 PTXPlus 转换部分。目前,cuobjdump_to_ptxplus 支持将 SASS 转换为 sm 版本< sm_20。
To enable PTXPlus simulation, add the following line to the gpgpusim.config file:
要启用 PTXPlus 仿真,请将以下行添加到 gpgpusim.config 文件中:
-gpgpu_ptx_convert_to_ptxplus 1
Additionally the converted PTXPlus can be saved to files named "_#.ptxplus" by adding the following line to the gpgpusim.config file:
此外,转换后的 PTXPlus 可以通过将以下行添加到 gpgpusim.config 文件中保存为名为 "_#.ptxplus" 的文件:
-gpgpu_ptx_save_converted_ptxplus 1
To turn off either feature, either remove the line or change the value from 1 to 0. More details about using PTXPlus can be found in PTXPlus support. If the option above is enabled, GPGPU-Sim will attempt to convert the SASS code to PTXPlus and then run the resulting PTXPlus. However, as mentioned before, not all programs are supported in this mode.
要关闭任一功能,可以删除该行或将值从 1 更改为 0。有关使用 PTXPlus 的更多详细信息,请参见 PTXPlus 支持。如果上面的选项已启用,GPGPU-Sim 将尝试将 SASS 代码转换为 PTXPlus,然后运行生成的 PTXPlus。然而,如前所述,并非所有程序在此模式下都受支持。
The subsections below describe the additions we made to PTX to obtain PTXPlus.
以下小节描述了我们对 PTX 所做的添加,以获得 PTXPlus。
Addressing Modes 寻址模式
To support GT200 SASS, PTXPlus increases the number of addressing modes available to most instructions. Non-load/non-store are now able to directly access memory. The following instruction adds the value in register r0 to the value store in shared memory at address 0x0010 and stores the values in register r1:
为了支持 GT200 SASS,PTXPlus 增加了大多数指令可用的寻址模式数量。非加载/非存储现在能够直接访问内存。以下指令将寄存器 r0 中的值加到地址 0x0010 的共享内存中存储的值,并将结果存储在寄存器 r1 中:
add.u32 $r1, s[0x0010], $r0;
Operands such as s[$r2] or s[$ofs1+0x0010] can also be used. PTXPlus also introduces the following addressing modes that are not present in original PTX:
操作数如 s[$r2] 或 s[$ofs1+0x0010] 也可以使用。PTXPlus 还引入了原始 PTX 中不存在的以下寻址模式:
- g = global memory g = 全局内存
- s = shared memory s = 共享内存
- ce#c# = constant memory (first number is the kernel number, second number is the constant segment)
ce#c# = 常量内存(第一个数字是内核编号,第二个数字是常量段)
g[$ofs1+$r0] //global memory address determined by the sum of register ofs1 and register r0. s[$ofs1+=0x0010] //shared memory address determined by value in register $ofs1. Register $ofs1 is then incremented by 0x0010. ce1c2[$ofs1+=$r1] //first kernel's second constant segment's memory address determined by value in register $ofs1. Register $ofs1 is then incremented by the value in register $r1.
The implementation details of these addressing modes is described in PTXPlus Implementation.
这些寻址模式的实现细节在 PTXPlus 实现 中描述。
New Data Types 新数据类型
Instructions have also been upgraded to more accurately represent how 64 bit and 128 bit values are stored across multiple 32 bit registers. The least significant 32 bits are stored in the far left register while the most significant 32 bits are stored in the far right registers. The following is a list of the new data types and an example of an add instruction adding two 64 bit floating point numbers:
指令也已升级,以更准确地表示 64 位和 128 位值如何在多个 32 位寄存器中存储。最低有效的 32 位存储在最左侧的寄存器中,而最高有效的 32 位存储在最右侧的寄存器中。以下是新数据类型的列表以及一个将两个 64 位浮点数相加的加法指令示例:
- .ff64 = PTXPlus version of 64 bit floating point number
.ff64 = PTXPlus 64 位浮点数的版本 - .bb64 = PTXPlus version of 64 bit untyped
.bb64 = PTXPlus 64 位无类型版本 - .bb128 = PTXPlus version of 128 bit untyped
.bb128 = PTXPlus 128 位无类型版本
add.rn.ff64 {$r0,$r1}, {$r2,$r3}, {$r4,$r5};
PTXPlus Instructions PTXPlus 使用说明
PTXPlus Instructions | |
---|---|
nop | Do nothing 什么都不做 |
andn | a andn b = a and ~b
a 和 b = a 和 ~b |
norn | a norn b = a nor ~b |
orn | a orn b = a or ~b |
nandn | a nandn b = a nand ~b |
callp | A new call instruction added in PTXPlus. It jumps to the indicated label
在 PTXPlus 中添加了一个新的调用指令。它跳转到指定的标签。 |
retp | A new return instruction added in PTXPlus. It jumps back to the instruction after the previous callp instruction
在 PTXPlus 中添加了一个新的返回指令。它跳回到上一个 callp 指令之后的指令。 |
breakaddr | Pushes the address indicated by the operand on the thread's break address stack
将操作数指示的地址推送到线程的断点地址栈上 |
break 休息 | Jumps to the address at the top of the thread's break address stack and pops off the entry
跳转到线程的断点地址栈顶部的地址并弹出该条目 |
PTXPlus Condition Codes and Instruction Predication
PTXPlus 条件码和指令预测
Instead of the normal true-false predicate system in PTX, SASS instructions use 4-bit condition codes to specify more complex predicate behaviour. As such, PTXPlus uses the same 4-bit predicate system. GPGPU-Sim uses the predicate translation table from decuda for simulating PTXPlus instructions.
在 PTX 中,SASS 指令使用 4 位条件码来指定更复杂的谓词行为,而不是正常的真-假谓词系统。因此,PTXPlus 使用相同的 4 位谓词系统。GPGPU-Sim 使用 decuda 的谓词翻译表来模拟 PTXPlus 指令。
The highest bit represents the overflow flag followed by the carry flag and sign flag. The last and lowest bit is the zero flag. Separate condition codes can be stored in separate predicate registers and instructions can indicate which predicate register to use or modify. The following instruction adds the value in register $r0 to the value in register $r1 and stores the result in register $r2. At the same the, the appropriate flags are set in predicate register $p0.
最高位表示溢出标志,后面是进位标志和符号标志。最后一位和最低位是零标志。可以将单独的条件码存储在单独的谓词寄存器中,指令可以指示使用或修改哪个谓词寄存器。以下指令将寄存器 $r0 中的值加到寄存器 $r1 中的值,并将结果存储在寄存器 $r2 中。同时,适当的标志在谓词寄存器 $p0 中被设置。
add.u32 $p0|$r2, $r0, $r1;
Different test conditions can be used on predicated instructions. For example the next instruction is only performed if the carry flag bit in predicate register $p0 is set:
可以在条件指令上使用不同的测试条件。例如,只有在谓词寄存器 $p0 中的进位标志位被设置时,下一条指令才会执行:
@$p0.cf add.u32 $r2, $r0, $r1;
Parameter and Thread ID (tid) Initialization
参数和线程 ID (tid) 初始化
PTXPlus does not use an explicit parameter state space to store the kernel parameters. Instead, the input parameters are copied in order into shared memory starting at address 0x0010. The copying of parameters is performed during GPGPU-Sim's thread initialization process. The thread initialization process occurs when a thread block is issued to a SIMT core, as described in Thread Block / CTA / Work Group Scheduling. The Kernel Launch: Parameter Hookup section describes the implementation for this procedure. Also during this process, the values of special registers %tid.x, %tid.y and %tid.z are copied into register $r0.
PTXPlus 不使用显式参数状态空间来存储内核参数。相反,输入参数从地址 0x0010 开始按顺序复制到共享内存中。参数的复制是在 GPGPU-Sim 的线程初始化过程中进行的。线程初始化过程发生在将线程块分配给 SIMT 核心时,如线程块/CTA/工作组调度中所述。内核启动:参数连接部分描述了该过程的实现。在此过程中,特殊寄存器%tid.x、%tid.y 和%tid.z 的值也被复制到寄存器$r0 中。
Register $r0: |%tid.z| %tid.y | NA | %tid.x | 31 26 25 16 15 10 9 0
Debugging via Prints and Traces
通过打印和跟踪进行调试
There are two built-in facilities for debugging gpgpu-sim.
The first mechanism is through environment variables.
This is useful for debugging elements of GPGPU-Sim that take place before the configuration file (gpgpusim.config) is parsed, however this can be a clumsy way to implement tracing information in the performance simulator.
As of version 3.2.1 GPGPU-Sim includes a tracing system implemented in 'src/trace.h', which allows the user to turn traces on and off via the config file and enable traces by their string name.
Both these systems are described below.
Please note that many of the environment variable prints could be implemented via the tracing system, but exist as env variables becuase they are in legacy code.
Also, GPGPU-Sim prints a large amount of information that is not controlled through the tracing system which is also a result of legacy code.
GPGPU-Sim 有两个内置的调试工具。第一个机制是通过环境变量。这对于调试在解析配置文件(gpgpusim.config)之前发生的 GPGPU-Sim 元素非常有用,但在性能模拟器中实现跟踪信息可能会显得笨拙。从版本 3.2.1 开始,GPGPU-Sim 包含一个在 'src/trace.h' 中实现的跟踪系统,允许用户通过配置文件打开和关闭跟踪,并通过字符串名称启用跟踪。以下将描述这两个系统。请注意,许多环境变量打印可以通过跟踪系统实现,但由于它们在遗留代码中,因此以环境变量的形式存在。此外,GPGPU-Sim 打印了大量不通过跟踪系统控制的信息,这也是遗留代码的结果。
Environment Variables for Debugging
调试的环境变量
Some behavior of GPGPU-Sim 3.x relevant to debugging can be configured via environment variables.
GPGPU-Sim 3.x 的某些与调试相关的行为可以通过环境变量进行配置。
When debugging it may be helpful to generate additional information about what is going on in the simulator and print this out to standard output. This is done by using the following environment variable:
在调试时,生成关于模拟器中发生的事情的额外信息并将其打印到标准输出可能会很有帮助。这是通过使用以下环境变量来完成的:
export PTX_SIM_DEBUG=<#> enable debugging and set the verbose level
The currently supported levels are enumerated below:
当前支持的级别如下所列:
Level 级别 | Description 描述 |
---|---|
1 | Verbose logs for CTA allocation
CTA 分配的详细日志 |
2 | Print verbose output from dominator analysis
打印来自支配者分析的详细输出 |
3 | Verbose logs for GPU malloc/memcpy/memset
GPU malloc/memcpy/memset 的详细日志 |
5 | Display the instruction executed
显示已执行的指令 |
6 | Display the modified register(s) by each executed instruction
显示每个执行指令后修改的寄存器 |
10 | Display the entire register file of the thread executing the instruction
显示执行指令的线程的整个寄存器文件 |
50 | Print verbose output from control flow analysis
从控制流分析中打印详细输出 |
100 | Print the loaded PTX files
打印加载的 PTX 文件 |
If a benchmark does not run correctly on GPGPU-Sim you may need to debug the functional simulation engine. The way we do this is to print out the functional state of a single thread that generates an incorrect output. To enable printing out functional simulation state for a single thread, use the following environment variable (and set the appropriate level for PTX_SIM_DEBUG):
如果基准测试在 GPGPU-Sim 上运行不正确,您可能需要调试功能仿真引擎。我们这样做的方法是打印出生成不正确输出的单个线程的功能状态。要启用打印单个线程的功能仿真状态,请使用以下环境变量(并为 PTX_SIM_DEBUG 设置适当的级别):
export PTX_SIM_DEBUG_THREAD_UID=<#> ID of thread to debug
Other environment configuration options:
其他环境配置选项:
export PTX_SIM_USE_PTX_FILE=<non-empty-string> override PTX embedded in the binary and revert to old strategy of looking for *.ptx files (good for hand-tweaking PTX) export PTX_SIM_KERNELFILE=<filename> use this to specify the name of the PTX file
GPGPU-Sim debug tracing GPGPU-Sim 调试跟踪
The tracing system is controlled by variables in the gpgpusim.config file:
跟踪系统由 gpgpusim.config 文件中的变量控制:
Variable 变量 | Values 值 | Description 描述 |
---|---|---|
trace_enabled | 0 or 1 0 或 1 | Globally enable or disable all tracing. If enabled, then trace_components are printed.
全局启用或禁用所有跟踪。如果启用,则会打印 trace_components。 |
trace_components | <WARP_SCHEDULER,SCOREBOARD,...> | A comma separated list of tracing elements to enable, a complete list is available in src/trace_streams.tup 一个以逗号分隔的跟踪元素列表以启用,完整列表可在 src/trace_streams.tup 中找到。 |
trace_sampling_core | <0 through num_cores-1> | For elements associated with a given shader core (such as the warp scheduler or scoreboard), only print traces from this core
对于与给定着色器核心相关的元素(例如,warp 调度器或记分板),仅打印来自该核心的跟踪信息 |
The Code files that implement the system are:
实现该系统的代码文件是:
Variable 变量 | Description 描述 |
---|---|
src/trace_streams.tup | Lists the names of each print stream
列出每个打印流的名称 |
src/trace.cc | Some setup implementation and initialization
一些设置实现和初始化 |
src/trace.h | Defines all the high level interfaces for the tracing system
定义了跟踪系统的所有高级接口 |
src/gpgpu-sim/shader_trace.h | Defines some convenient prints for debugging a specific shader core
定义了一些方便的打印,以调试特定的着色器核心 |
Configuration Options 配置选项
Configuration options are passed into GPGPU-Sim with gpgpusim.config and an interconnection configuration file (specified with option -inter_config_file inside gpgpusim.config). GPGPU-Sim 3.0.2 comes with calibrated configuration files in the configs directory for the NVIDIA GT200 (configs/QuadroFX5800/) and Fermi (configs/Fermi/).
配置选项通过 gpgpusim.config 和一个互连配置文件(在 gpgpusim.config 中使用选项 -inter_config_file 指定)传递给 GPGPU-Sim。GPGPU-Sim 3.0.2 附带了位于 configs 目录中的 NVIDIA GT200(configs/QuadroFX5800/)和 Fermi(configs/Fermi/)的校准配置文件。
Here is a list of the configuration options:
以下是配置选项的列表:
Simulation Run Configuration 仿真运行配置 | ||
---|---|---|
Option 选项 | Description 描述 | |
-gpgpu_max_cycle <# cycles> | Terminate GPU simulation early after a maximum number of cycle is reached (0 = no limit)
在达到最大循环次数后提前终止 GPU 仿真(0 = 无限制) | |
-gpgpu_max_insn <# insns> | Terminate GPU simulation early after a maximum number of instructions (0 = no limit)
在达到最大指令数后提前终止 GPU 仿真(0 = 无限制) | |
-gpgpu_ptx_sim_mode <0=performance (default), 1=functional>
-gpgpu_ptx_sim_mode <0=性能 (默认), 1=功能> |
Select between performance or functional simulation (note that functional simulation may incorrectly simulate some PTX code that requires each element of a warp to execute in lock-step)
选择性能模拟或功能模拟(请注意,功能模拟可能会错误地模拟某些需要每个 warp 元素同步执行的 PTX 代码) | |
-gpgpu_deadlock_detect <0=off, 1=on (default)> | Stop the simulation at deadlock
在死锁时停止模拟 | |
-gpgpu_max_cta | Terminates GPU simulation early (0 = no limit)
提前终止 GPU 仿真(0 = 无限制) | |
-gpgpu_max_concurrent_kernel | Maximum kernels that can run concurrently on GPU
GPU 上可以并发运行的最大内核数 | |
Statistics Collection Options 统计收集选项 | ||
Option 选项 | Description 描述 | |
-gpgpu_ptx_instruction_classification <0=off, 1=on (default)> | Enable instruction classification
启用指令分类 | |
-gpgpu_runtime_stat <frequency>:<flag> | Display runtime statistics
显示运行时统计信息 | |
-gpgpu_memlatency_stat | Collect memory latency statistics (0x2 enables MC, 0x4 enables queue logs)
收集内存延迟统计信息(0x2 启用 MC,0x4 启用队列日志) | |
-visualizer_enabled <0=off, 1=on (default)>
-visualizer_enabled <0=关闭, 1=开启 (默认)> |
Turn on visualizer output (use AerialVision visualizer tool to plot data saved in log)
打开可视化输出(使用 AerialVision 可视化工具绘制保存在日志中的数据) | |
-visualizer_outputfile <filename> | Specifies the output log file for visualizer. Set to NULL for automatically generated filename (Done by default).
指定可视化工具的输出日志文件。设置为 NULL 以自动生成文件名(默认完成)。 | |
-visualizer_zlevel <compression level> | Compression level of the visualizer output log (0=no compression, 9=max compression)
可视化输出日志的压缩级别(0=无压缩,9=最大压缩) | |
-save_embedded_ptx | saves ptx files embedded in binary as <n>.ptx
将嵌入二进制的 ptx 文件保存为<n>.ptx | |
-enable_ptx_file_line_stats <0=off, 1=on (default)> | Turn on PTX source line statistic profiling
打开 PTX 源代码行统计分析 | |
-ptx_line_stats_filename <output file name> | Output file for PTX source line statistics.
PTX 源行统计的输出文件。 | |
-gpgpu_warpdistro_shader | Specify which shader core to collect the warp size distribution from
指定要从哪个着色器核心收集波大小分布 | |
-gpgpu_cflog_interval | Interval between each snapshot in control flow logger
控制流日志记录中每个快照之间的间隔 | |
-keep -保持 | keep intermediate files created by GPGPU-Sim when interfacing with external programs
在与外部程序接口时保留 GPGPU-Sim 创建的中间文件 | |
High-Level Architecture Configuration (See ISPASS paper for more details on what is being modeled) 高层架构配置(有关建模内容的更多细节,请参见 ISPASS 论文) | ||
Option 选项 | Description 描述 | |
-gpgpu_n_mem <# memory controller> | Number of memory controllers (DRAM channels) in this configuration. Read #Topology Configuration before modifying this option.
此配置中的内存控制器数量(DRAM 通道)。在修改此选项之前,请阅读 #拓扑配置。 | |
-gpgpu_clock_domains <Core Clock>:<Interconnect Clock>:<L2 Clock>:<DRAM Clock> | Clock domain frequencies in MhZ (See #Clock Domain Configuration)
时钟域频率以 MHz 为单位(见#时钟域配置) | |
-gpgpu_n_clusters | Number of processing clusters
处理集群数量 | |
-gpgpu_n_cores_per_cluster | Number of SIMD cores per cluster
每个集群的 SIMD 核心数量 | |
Additional Architecture Configuration 附加架构配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_n_cluster_ejection_buffer_size | Number of packets in ejection buffer
弹出缓冲区中的数据包数量 | |
-gpgpu_n_ldst_response_buffer_size | Number of response packets in LD/ST unit ejection buffer
LD/ST 单元弹出缓冲区中的响应数据包数量 | |
-gpgpu_coalesce_arch | Coalescing arch (default = 13, anything else is off for now)
合并弓形(默认 = 13,其他任何值目前都关闭) | |
Scheduler 调度程序 | ||
Option 选项 | Description 描述 | |
-gpgpu_num_sched_per_core | Number of warp schedulers per core
每个核心的波动调度器数量 | |
-gpgpu_max_insn_issue_per_warp | Max number of instructions that can be issued per warp in one cycle by scheduler
调度程序在一个周期内每个 warp 可以发出的最大指令数 | |
Shader Core Pipeline Configuration 着色器核心管道配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_shader_core_pipeline <# thread/shader core>:<warp size>:<pipeline SIMD width> | Shader core pipeline configuration
着色器核心管道配置 | |
-gpgpu_shader_registers <# registers/shader core, default=8192> | Number of registers per shader core. Limits number of concurrent CTAs.
每个着色器核心的寄存器数量。限制并发 CTA 的数量。 | |
-gpgpu_shader_cta <# CTA/shader core, default=8> | Maximum number of concurrent CTAs in shader
着色器中最大并发 CTA 数量 | |
-gpgpu_simd_model <1=immediate post-dominator, others are not supported for now>
-gpgpu_simd_model <1=立即后支配者,其他暂不支持> |
SIMD Branch divergence handling policy
SIMD 分支分歧处理策略 | |
-ptx_opcode_latency_int/fp/dp <ADD,MAX,MUL,MAD,DIV> | Opcode latencies 操作码延迟 | |
-ptx_opcode_initiation_int/fp/dp <ADD,MAX,MUL,MAD,DIV> | Opcode initiation period. For this number of cycles the inputs of the ALU is held constant. The ALU cannot consume new values during this time. i.e. if this value is 4, then that means this unit can consume new values once every 4 cycles.
操作码启动周期。在这个周期内,ALU 的输入保持不变。在此期间,ALU 无法消耗新值。也就是说,如果这个值是 4,那么意味着这个单元每 4 个周期可以消耗新值一次。 | |
Memory Sub-System Configuration 内存子系统配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_perfect_mem <0=off (default), 1=on> | Enable perfect memory mode (zero memory latency with no cache misses)
启用完美内存模式(零内存延迟,无缓存未命中) | |
-gpgpu_tex_cache:l1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<fifo_entry> | Texture cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random.
纹理缓存(只读)配置。驱逐策略:L = 最近最少使用,F = 先进先出,R = 随机。 |
|
-gpgpu_const_cache:l1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | Constant cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random
常量缓存(只读)配置。驱逐策略:L = 最近最少使用,F = 先进先出,R = 随机 | |
-gpgpu_cache:il1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | Shader L1 instruction cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random
着色器 L1 指令缓存(用于全局和局部内存)配置。驱逐策略:L = 最近最少使用,F = 先进先出,R = 随机 | |
-gpgpu_cache:dl1 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> -- set to "none" for no DL1 -- | L1 data cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random
L1 数据缓存(用于全局和局部内存)配置。驱逐策略:L = 最近最少使用,F = 先进先出,R = 随机 | |
-gpgpu_cache:dl2 <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | Unified banked L2 data cache configuration. This specifies the configuration for the L2 cache bank in one of the memory partitions. The total L2 cache capacity = <nsets> x <bsize> x <assoc> x <# memory controller>.
统一的 L2 数据缓存配置。这指定了内存分区中 L2 缓存银行的配置。总 L2 缓存容量 = <nsets> x <bsize> x <assoc> x <# memory controller>。 | |
-gpgpu_shmem_size <shared memory size, default=16kB>
-gpgpu_shmem_size < 共享内存大小,默认=16kB> |
Size of shared memory per SIMT core (aka shader core)
每个 SIMT 核心(即着色器核心)的共享内存大小 | |
-gpgpu_shmem_warp_parts | Number of portions a warp is divided into for shared memory bank conflict check
用于共享内存银行冲突检查的波纹分成的部分数量 | |
-gpgpu_flush_cache <0=off (default), 1=on> | Flush cache at the end of each kernel call
在每次内核调用结束时刷新缓存 | |
-gpgpu_local_mem_map | Mapping from local memory space address to simulated GPU physical address space (default = enabled)
从本地内存空间地址映射到模拟 GPU 物理地址空间(默认 = 启用) | |
-gpgpu_num_reg_banks | Number of register banks (default = 8)
寄存器组数量(默认 = 8) | |
-gpgpu_reg_bank_use_warp_id | Use warp ID in mapping registers to banks (default = off)
在映射寄存器到银行时使用扭曲 ID(默认 = 关闭) | |
-gpgpu_cache:dl2_texture_only | L2 cache used for texture only (0=no, 1=yes, default=1)
仅用于纹理的 L2 缓存(0=否,1=是,默认=1) | |
Operand Collector Configuration 操作数收集器配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_operand_collector_num_units_sp | number of collector units (default = 4)
收集器单元数量(默认 = 4) | |
-gpgpu_operand_collector_num_units_sfu | number of collector units (default = 4)
收集器单元数量(默认 = 4) | |
-gpgpu_operand_collector_num_units_mem | number of collector units (default = 2)
收集器单元数量(默认 = 2) | |
-gpgpu_operand_collector_num_units_gen | number of collector units (default = 0)
收集器单元数量(默认 = 0) | |
-gpgpu_operand_collector_num_in_ports_sp | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_in_ports_sfu | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_in_ports_mem | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_in_ports_gen | number of collector unit in ports (default = 0)
端口中的收集器单元数量(默认 = 0) | |
-gpgpu_operand_collector_num_out_ports_sp | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_out_ports_sfu | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_out_ports_mem | number of collector unit in ports (default = 1)
端口中的收集器单元数量(默认 = 1) | |
-gpgpu_operand_collector_num_out_ports_gen | number of collector unit in ports (default = 0)
端口中的收集器单元数量(默认 = 0) | |
DRAM/Memory Controller Configuration DRAM/内存控制器配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_dram_scheduler <0 = fifo, 1 = fr-fcfs> | DRAM scheduler type DRAM 调度器类型 | |
-gpgpu_frfcfs_dram_sched_queue_size <# entries> | DRAM FRFCFS scheduler queue size (0 = unlimited (default); # entries per chip).
DRAM FRFCFS 调度器队列大小(0 = 无限(默认);每个芯片的条目数)。 (Note: FIFO scheduler queue size is fixed to 2).
| |
-gpgpu_dram_return_queue_size <# entries> | DRAM requests return queue size (0 = unlimited (default); # entries per chip).
DRAM 请求返回队列大小(0 = 无限(默认);每个芯片的条目数)。 | |
-gpgpu_dram_buswidth <# bytes/DRAM bus cycle, default=4 bytes, i.e. 8 bytes/command clock cycle>
-gpgpu_dram_buswidth <# 字节/DRAM 总线周期,默认=4 字节,即 8 字节/命令时钟周期> |
Bus bandwidth of a single DRAM chip at command bus frequency (default = 4 bytes (8 bytes per command clock cycle)). The number of DRAM chip per memory controller is set by option -gpgpu_n_mem_per_ctrlr. Each memory partition has (gpgpu_dram_buswidth X gpgpu_n_mem_per_ctrlr) bits of DRAM data bus pins.
单个 DRAM 芯片在命令总线频率下的总线带宽(默认 = 4 字节(每个命令时钟周期 8 字节))。每个内存控制器的 DRAM 芯片数量由选项-gpgpu_n_mem_per_ctrlr 设置。每个内存分区具有(gpgpu_dram_buswidth X gpgpu_n_mem_per_ctrlr)位 DRAM 数据总线引脚。 For example, Quadro FX5800 has a 512-bit DRAM data bus, which is divided among 8 memory partitions. Each memory partition a 512/8=64 bits of DRAM data bus. This 64-bit bus is split into 2 DRAM chips for each memory partition. Each chip will have 32-bit=4-byte of DRAM bus width. We therefore set -gpgpu_dram_buswidth to 4.
| |
-gpgpu_dram_burst_length <# burst per DRAM request>
-gpgpu_dram_burst_length <# 每个 DRAM 请求的突发次数> |
Burst length of each DRAM request (default = 4 data clock cycle, which runs at 2X command clock frequency in GDDR3)
每个 DRAM 请求的突发长度(默认 = 4 个数据时钟周期,在 GDDR3 中以 2 倍命令时钟频率运行) | |
-gpgpu_dram_timing_opt <nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tWTR> | DRAM timing parameters:
DRAM 时序参数:
| |
-gpgpu_mem_address_mask <address decoding scheme> | Obsolete: Select different address decoding scheme to spread memory access across different memory banks. (0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits)
过时:选择不同的地址解码方案,以在不同的内存银行之间分散内存访问。 (0 = 旧地址掩码,1 = 新地址掩码,2 = 新地址掩码 + 翻转的银行选择和芯片选择位) | |
-gpgpu_mem_addr_mapping dramid@<start bit>;<memory address map> | Mapping memory address to DRAM model:
将内存地址映射到 DRAM 模型:
See configuration file for Quadro FX 5800 for example.
| |
-gpgpu_n_mem_per_ctrlr <# DRAM chips/memory controller>
-gpgpu_n_mem_per_ctrlr <# DRAM 芯片/内存控制器> |
Number of DRAM chip per memory controller (aka DRAM channel)
每个内存控制器(即 DRAM 通道)的 DRAM 芯片数量 | |
-gpgpu_dram_partition_queues | i2$:$2d:d2$:$2i | |
-rop_latency <# minimum cycle before L2 cache access>
-rop_latency <# 最小周期在 L2 缓存访问之前> |
Specify the minimum latency (in core clock cycles) between when a memory request arrived at the memory partition and when it accesses the L2 cache / moves into the queue to access DRAM. It models the minimum L2 hit latency.
指定内存请求到达内存分区与访问 L2 缓存/进入访问 DRAM 队列之间的最小延迟(以核心时钟周期为单位)。它模拟了最小的 L2 命中延迟。 | |
-dram_latency <# minimum cycle after L2 cache access and before DRAM access>
-dram_latency <# L2 缓存访问后和 DRAM 访问前的最小周期> |
Specify the minimum latency (in core clock cycles) between when a memory request has accessed the L2 cache and when it is pushed into the DRAM scheduler. This option works together with -rop_latency to model the minimum DRAM access latency (= rop_latency + dram_latency).
指定内存请求访问 L2 缓存与其被推送到 DRAM 调度器之间的最小延迟(以 核心时钟周期 为单位)。此选项与 -rop_latency 一起工作,以模拟最小 DRAM 访问延迟(= rop_latency + dram_latency)。 | |
Interconnection Configuration 互连配置 | ||
Option 选项 | Description 描述 | |
-inter_config_file <Path to Interconnection Config file> | The file containing Interconnection Network simulator's options. For more details about interconnection configurations see Manual provided with the original code at [6]. NOTE that options under "4.6 Traffic" and "4.7 Simulation parameters" should not be used in our simulator. Also see #Interconnection Configuration.
包含互连网络模拟器选项的文件。有关互连配置的更多详细信息,请参阅随原始代码提供的手册 [6]。请注意,“4.6 流量”和“4.7 模拟参数”下的选项不应在我们的模拟器中使用。另请参见 #互连配置。 | |
-network_mode | Interconnection network mode to be used (default = 1).
要使用的互连网络模式(默认 = 1)。 | |
PTX Configurations PTX 配置 | ||
Option 选项 | Description 描述 | |
-gpgpu_ptx_use_cuobjdump | Use cuobjdump to extract ptx/sass (0=no, 1=yes) Only allowed with CUDA 4.0
使用 cuobjdump 提取 ptx/sass(0=否,1=是)仅允许在 CUDA 4.0 中使用 | |
-gpgpu_ptx_convert_to_ptxplus | Convert embedded ptx to ptxplus (0=no, 1=yes)
将嵌入的 ptx 转换为 ptxplus(0=否,1=是) | |
-gpgpu_ptx_save_converted_ptxplus | Save converted ptxplus to a file (0=no, 1=yes)
将转换后的 ptxplus 保存到文件中 (0=否, 1=是) | |
-gpgpu_ptx_force_max_capability | Force maximum compute capability (default 0)
强制最大计算能力(默认值 0) | |
-gpgpu_ptx_inst_debug_to_file | Dump executed instructions' debug information to a file (0=no, 1=yes)
将执行的指令的调试信息转储到文件中(0=否,1=是) | |
-gpgpu_ptx_inst_debug_file | Output file to dump debug information for executed instructions
输出文件以转储执行指令的调试信息 | |
-gpgpu_ptx_inst_debug_thread_uid | Thread UID for executed instructions' debug output
执行指令调试输出的线程 UID |
Interconnection Configuration
互连配置
GPGPU-Sim 3.x uses the booksim router simulator to model the interconnection network. For the most part you will want to consult the booksim documentation for how to configure the interconnect. However, below we list special considerations that need to be taken into account to ensure your modifications work with GPGPU-Sim.
GPGPU-Sim 3.x 使用 booksim 路由器模拟器来模拟互连网络。在大多数情况下,您将需要查阅 booksim 文档以了解如何配置互连。然而,下面我们列出了一些特殊考虑事项,需要考虑这些事项以确保您的修改与 GPGPU-Sim 一起工作。
Topology Configuration 拓扑配置
Note that the total number of network nodes as specified in the interconnection network config file should match the total nodes in GPGPU-Sim.
GPGPU-Sim's total number of nodes would be the sum of SIMT Core Cluster count and the number of Memory controllers. E.g. in the QuadroFX5800 configuration there are 10 SIMT Core Clusters and 8 Memory Controllers. That is a total of 18 nodes.
Therefore in the interconnection config file the network also has 18 nodes as demonstrated below:
请注意,互连网络配置文件中指定的网络节点总数应与 GPGPU-Sim 中的总节点数匹配。GPGPU-Sim 的节点总数将是 SIMT 核心集群数量和内存控制器数量的总和。例如,在 QuadroFX5800 配置中,有 10 个 SIMT 核心集群和 8 个内存控制器。总共有 18 个节点。因此,在互连配置文件中,网络也有 18 个节点,如下所示:
topology = fly; k = 18; n = 1; routing_function = dest_tag;
The configuration snippet above sets up a single stage butterfly network with destination tag routing and 18 nodes. Generally, in both butterfly and mesh networks the total number of network nodes would be k*n.
上面的配置片段设置了一个具有目标标签路由和 18 个节点的单级蝴蝶网络。通常,在蝴蝶网络和网格网络中,网络节点的总数为 k*n。
Note that if you choose to use a mesh network you will want to consider configuring the memory controller placement. In the current release there are a few predefined mappings that can be enabled by setting "use_map=1;" In particular the mesh network used in our ISPASS 2009 paper paper can be configured using this setting and the following topology:
请注意,如果您选择使用网状网络,您需要考虑配置内存控制器的位置。在当前版本中,有一些预定义的映射可以通过设置“use_map=1;”来启用。特别是我们在ISPASS 2009 论文中使用的网状网络可以通过此设置和以下拓扑进行配置:
- a 6x6 mesh network (topology=mesh, k=6, n=2) : 28 SIMT cores + 8 dram channels assuming the SIMT core Cluster size is one
一个 6x6 网状网络(拓扑=网状,k=6,n=2):28 个 SIMT 核心 + 8 个 DRAM 通道,假设 SIMT 核心集群大小为 1
You can create your own mappings by modifying create_node_map() in interconnect_interface.cpp (and set use_map=1)
您可以通过修改 interconnect_interface.cpp 中的 create_node_map() 来创建自己的映射(并将 use_map 设置为 1)。
Booksim options added by GPGPU-Sim
GPGPU-Sim 添加的 Booksim 选项
These options are specific to GPGPU-Sim and not part of the original booksim:
这些选项是特定于 GPGPU-Sim 的,不是原始 booksim 的一部分:
- perfect_icnt: if set the interconnect is not simulated all packets that are injected to the network will appear at their destination after one cycle. This is true even when multiple sources send packets to one destination.
perfect_icnt:如果设置了互连未被模拟,则所有注入到网络中的数据包将在一个周期后出现在其目的地。这在多个源向一个目的地发送数据包时也成立。 - fixed_lat_per_hop: similar to perfect_icnt above except that the packet appears in destination after "Manhattan distance hop count times fixed_lat_per_hop" cycles.
fixed_lat_per_hop:类似于上面的 perfect_icnt,除了数据包在经过“曼哈顿距离跳数乘以 fixed_lat_per_hop”周期后出现在目的地。 - use_map: changes the way memory and shader cores are placed. See Topology Configuration.
use_map:改变内存和着色器核心的放置方式。请参见拓扑配置。 - flit_size: specifies the flit_size in bytes. This is used to identify the number of flits per packet based on the size of packet as passed to icnt_push() functions.
flit_size:指定以字节为单位的 flit_size。这用于根据传递给 icnt_push()函数的包大小来识别每个数据包的 flit 数量。 - network_count: Number of independent interconnection networks. Should be set to 2 unless you know what you are doing.
network_count: 独立互连网络的数量。除非你知道自己在做什么,否则应设置为 2。 - output_extra_latency: Adds extra cycles to each router. Used to create Figure 10 of ISPASS paper.
output_extra_latency: 为每个路由器添加额外的周期。用于创建 ISPASS 论文的图 10。 - enable_link_stats: prints extra statistics for each link
enable_link_stats: 打印每个链接的额外统计信息 - input_buf_size: Input buffer size of each node in flits. If left zero the simulator will set it automatically. See "Injecting a packet from the outside world to network"
input_buf_size:每个节点在 flits 中的输入缓冲区大小。如果设置为零,模拟器将自动设置。请参见“从外部世界向网络注入数据包” - ejection_buffer_size: ejection buffer size. If left zero the simulator will set it automatically. See "Ejecting a packet from network to the outside world"
ejection_buffer_size:弹出缓冲区大小。如果设置为零,模拟器将自动设置。请参见“将数据包从网络弹出到外部世界” - boundary_buffer_size: boundary buffer size. If left zero the simulator will set it automatically. See "Ejecting a packet from network to the outside world"
边界缓冲区大小:边界缓冲区大小。如果设置为零,模拟器将自动设置。请参见“将数据包从网络弹出到外部世界”
These four options were set using #define in original booksim but we have made them configurable via intersim's config file:
这些四个选项是在原始 booksim 中使用#define 设置的,但我们已通过 intersim 的配置文件使它们可配置:
- MATLAP_OUTPUT (generates Matlab friendly outputs), DISPLAY_LAT_DIST (shows a distribution of packet latencies), DISPLAY_HOP_DIST (shows a distribution of hop counts), DISPLAY_PAIR_LATENCY (shows average latency for each source destination pair)
MATLAP_OUTPUT(生成与 Matlab 兼容的输出),DISPLAY_LAT_DIST(显示数据包延迟的分布),DISPLAY_HOP_DIST(显示跳数的分布),DISPLAY_PAIR_LATENCY(显示每个源目标对的平均延迟)
Booksim Options ignored by GPGPU-Sim
Please note the following options that are part of original booksim are either ignored or should not be changed from default.
请注意,原始 booksim 中的以下选项要么被忽略,要么不应从默认值更改。
- Traffic Options (section 4.6 of booksim manual): injection_rate, injection_process, burst_alpha, burst_beta, "const_flit_per_packet", traffic
流量选项(booksim 手册第 4.6 节):注入速率,注入过程,突发_alpha,突发_beta,“每个数据包的常量_flit”,流量 - Simulation parameters (section 4.7 of booksim manual): sim_type, sample_period, warmup_periods, max_samples, latency_thres, sim_count, reorder
仿真参数(booksim 手册第 4.7 节):sim_type, sample_period, warmup_periods, max_samples, latency_thres, sim_count, reorder
Clock Domain Configuration
时钟域配置
GPGPU-Sim supports four clock domains that can be controlled by the -gpgpu_clock_domains option:
GPGPU-Sim 支持四个时钟域,可以通过 -gpgpu_clock_domains 选项进行控制:
- DRAM clock domain = frequency of the real DRAM clock (command clock) and not the data clock (i.e. 2x of the command clock frequency)
DRAM 时钟域 = 实际 DRAM 时钟(命令时钟)的频率,而不是数据时钟(即命令时钟频率的 2 倍) - SIMT Core Cluster clock domain = frequency of the pipeline stages in a core clock (i.e. the rate at which simt_core_cluster::core_cycle() is called)
SIMT 核心集群时钟域 = 核心时钟中管道阶段的频率(即 simt_core_cluster::core_cycle() 被调用的速率) - Icnt clock domain = frequency of the interconnection network (usually this can be regarded as the core clock in NVIDIA GPU specs)
Icnt 时钟域 = 互连网络的频率(通常这可以被视为 NVIDIA GPU 规格中的 核心 时钟) - L2 clock domain = frequency of the L2 cache (We usually set this equal to ICNT clock frequency)
L2 时钟域 = L2 缓存的频率(我们通常将其设置为 ICNT 时钟频率)
Note that in GPGPU-Sim the width of the pipeline is equal to warp size. To compensate for this we adjust the SIMT Core Cluster clock domain.
For example we model the superpipelined stages in NVIDIA's Quadro FX 5800 (GT200) SM running at the fast clock rate (1GHz+) with a single-slower pipeline stage running at 1/4 the frequency. So a 1.3GHz shader clock rate of FX 5800 corresponds to a 325MHz SIMT core clock in GPGPU-Sim.
请注意,在 GPGPU-Sim 中,管道的宽度等于 warp 大小。为此,我们调整 SIMT 核心集群的时钟域。例如,我们模拟 NVIDIA Quadro FX 5800(GT200)SM 中以快速时钟频率(1GHz+)运行的超流水线阶段,以及以 1/4 频率运行的单个较慢管道阶段。因此,FX 5800 的 1.3GHz 着色器时钟频率对应于 GPGPU-Sim 中的 325MHz SIMT 核心时钟。
The DRAM clock domain is specified in the frequency of the command clock. To simplify the peak memory bandwidth calculation, most GPU specs report the data clock, which runs at 2X the command clock frequency. For example, Quadro FX5800 has a memory data clock of 1600MHz, while the command clock is only running at 800MHz. Therefore, our configuration sets the DRAM clock domain at 800.0MHz.
DRAM 时钟域以命令时钟的频率为准。为了简化峰值内存带宽的计算,大多数 GPU 规格报告数据时钟,其运行频率为命令时钟频率的 2 倍。例如,Quadro FX5800 的内存数据时钟为 1600MHz,而命令时钟仅运行在 800MHz。因此,我们的配置将 DRAM 时钟域设置为 800.0MHz。
clock Special Register 时钟特殊寄存器
In ptx, there is a special register %clock that reads the a clock cycle counter. On the hardware, this register is called SR1. It is a clock cycle counter that silently wraps around. In Quadro, this counter is incremented twice per scheduler clock. In Fermi, it is incremented only once per scheduler clock. GPGPU-Sim will return a value for a counter that is incremented once per scheduler clock (which is the same as the SIMT core clock).
在 ptx 中,有一个特殊寄存器%clock,用于读取时钟周期计数器。在硬件中,这个寄存器称为 SR1。它是一个静默循环的时钟周期计数器。在 Quadro 中,这个计数器每个调度器时钟增加两次。在 Fermi 中,它每个调度器时钟只增加一次。GPGPU-Sim 将返回一个每个调度器时钟增加一次的计数器值(这与 SIMT 核心时钟相同)。
In PTXPlus, the nvidia compiler generates the instructions accessing the %clock register as follows
在 PTXPlus 中,nvidia 编译器生成访问%clock 寄存器的指令如下
//SASS accessing clock register S2R R1, SR1 SHL R1, R1, 0x1
//PTXPlus accessing clock register mov %r1, %clock shl %r1, %r1, 0x1
This basically multiplies the value in the clock register by two. In PTX, however, the clock register is accessed directly. Those conditions must be taken into consideration when calculating results based on the clock register.
这基本上是将时钟寄存器中的值乘以二。然而,在 PTX 中,时钟寄存器是直接访问的。在根据时钟寄存器计算结果时,必须考虑这些条件。
//PTX accessing clock register mov r1, %clock
Understanding Simulation Output
理解仿真输出
At the end of each CUDA grid launch, GPGPU-Sim prints out the performance statistics to the console (stdout). These performance statistics provide insights into how the CUDA application performs with the simulated GPU architecture.
在每次 CUDA 网格启动结束时,GPGPU-Sim 会将性能统计信息打印到控制台( stdout )。这些性能统计信息提供了关于 CUDA 应用程序在模拟 GPU 架构下的性能表现的见解。
Here is a brief list of the important performance statistics:
以下是重要性能统计的简要列表:
General Simulation Statistics
一般模拟统计
Statistic 统计数据 | Description 描述 |
---|---|
gpu_sim_cycle | Number of cycles (in Core clock) required to execute this kernel.
执行此内核所需的周期数(以核心时钟为单位)。 |
gpu_sim_insn | Number of instructions executed in this kernel.
此内核中执行的指令数量。 |
gpu_ipc | gpu_sim_insn / gpu_sim_cycle |
gpu_tot_sim_cycle | Total number of cycles (in Core clock) simulated for all the kernels launched so far.
到目前为止,为所有已启动的内核模拟的总周期数(以核心时钟为单位)。 |
gpu_tot_sim_insn | Total number of instructions executed for all the kernels launched so far.
迄今为止启动的所有内核执行的指令总数。 |
gpu_tot_ipc | gpu_tot_sim_insn / gpu_tot_sim_cycle |
gpu_total_sim_rate | gpu_tot_sim_insn / wall_time |
Simple Bottleneck Analysis
简单瓶颈分析
These performance counters track stall events at different high-level parts of the GPU. In combination, they give a broad sense of how where the bottleneck is in the GPU for an application. Figure 8 illustrates a simplified flow of memory requests through the memory sub-system in GPGPU-Sim,
这些性能计数器跟踪 GPU 不同高层部分的停顿事件。它们结合在一起,可以大致了解应用程序在 GPU 中的瓶颈所在。图 8 说明了在 GPGPU-Sim 中内存请求通过内存子系统的简化流程,
Here are the description for each counter:
以下是每个计数器的描述:
Statistic 统计数据 | Description 描述 |
---|---|
gpgpu_n_stall_shd_mem | Number of pipeline stall cycles at the memory stage caused by one of the following reasons:
由于以下原因之一导致的内存阶段的管道停顿周期数:
|
gpu_stall_dramfull | Number of cycles that the interconnect outputs to dram channel is stalled.
互连输出到 DRAM 通道的周期数被阻塞。 |
gpu_stall_icnt2sh | Number of cycles that the dram channels are stalled due to the interconnect congestion.
由于互连拥塞而导致 DRAM 通道停滞的周期数。 |
Memory Access Statistics 内存访问统计
Statistic 统计数据 | Description 描述 |
---|---|
gpgpu_n_load_insn | Number of global/local load instructions executed.
执行的全局/本地加载指令数量。 |
gpgpu_n_store_insn | Number of global/local store instructions executed.
执行的全局/本地存储指令数量。 |
gpgpu_n_shmem_insn | Number of shared memory instructions executed.
执行的共享内存指令数量。 |
gpgpu_n_tex_insn | Number of texture memory instructions executed.
执行的纹理内存指令数量。 |
gpgpu_n_const_mem_insn | Number of constant memory instructions executed.
执行的常量内存指令数量。 |
gpgpu_n_param_mem_insn | Number of parameter read instructions executed.
执行的参数读取指令数量。 |
gpgpu_n_cmem_portconflict | Number of constant memory bank conflict.
常量内存银行冲突的数量。 |
maxmrqlatency | Maximum memory queue latency (amount of time a memory request spent in the DRAM memory queue)
最大内存队列延迟(内存请求在 DRAM 内存队列中花费的时间) |
maxmflatency | Maximum memory fetch latency (round trip latency from shader core to DRAM and back)
最大内存获取延迟(从着色器核心到 DRAM 再返回的往返延迟) |
averagemflatency | Average memory fetch latency
平均内存获取延迟 |
max_icnt2mem_latency | Maximum latency for a memory request to traverse from a shader core to the destinated DRAM channel
从着色器核心到目标 DRAM 通道的内存请求的最大延迟 |
max_icnt2sh_latency | Maximum latency for a memory request to traverse from a DRAM channel back to the specified shader core
从 DRAM 通道返回到指定着色器核心的内存请求的最大延迟 |
Memory Sub-System Statistics
内存子系统统计信息
Statistic 统计数据 | Description 描述 |
---|---|
gpgpu_n_mem_read_local | Number of local memory reads placed on the interconnect from the shader cores.
从着色器核心放置在互连上的本地内存读取次数。 |
gpgpu_n_mem_write_local | Number of local memory writes placed on the interconnect from the shader cores.
从着色器核心放置在互连上的本地内存写入的数量。 |
gpgpu_n_mem_read_global | Number of global memory reads placed on the interconnect from the shader cores.
从着色器核心放置在互连上的全球内存读取次数。 |
gpgpu_n_mem_write_global | Number of global memory writes placed on the interconnect from the shader cores.
从着色器核心放置到互连的全球内存写入次数。 |
gpgpu_n_mem_texture | Number of texture memory reads placed on the interconnect from the shader cores.
从着色器核心放置在互连上的纹理内存读取次数。 |
gpgpu_n_mem_const | Number of constant memory reads placed on the interconnect from the shader cores.
从着色器核心放置在互连上的常量内存读取次数。 |
Control-Flow Statistics 控制流统计
GPGPU-Sim reports the warp occupancy distribution which measures performance penalty due to branch divergence in the CUDA application. This information is reported on a single line following the text "Warp Occupancy Distribution:". Alternatively, you may want to grep for W0_Idle. The distribution is display in format: <bin>:<cycle count>. Here is the meaning of each bin:
GPGPU-Sim 报告了 warp 占用分布,该分布衡量了 CUDA 应用程序中由于分支发散导致的性能损失。此信息在文本“Warp Occupancy Distribution:”后以单行形式报告。或者,您可能想要 grep W0_Idle。分布以格式 <bin>:<cycle count> 显示。每个区间的含义如下:
Statistic 统计数据 | Description 描述 |
---|---|
Stall 摊位 | The number of cycles when the shader core pipeline is stalled and cannot issue any instructions.
着色器核心管道停滞且无法发出任何指令时的周期数。 |
W0_Idle | The number of cycles when all available warps are issued to the pipeline and are not ready to execute the next instruction.
当所有可用的扭曲被发射到管道中且未准备好执行下一条指令时的周期数。 |
W0_Scoreboard | The number of cycles when all available warps are waiting for data from memory.
所有可用的扭曲都在等待来自内存的数据时的周期数。 |
WX (where X = 1 to 32) WX(其中 X = 1 到 32) |
The number of cycles when a warp with X active threads is scheduled into the pipeline.
当具有 X 活动线程的波形被调度到管道中时的周期数。 |
Code that has no branch divergence should result in no cycles with W"X" where X is between 1 and 31.
See Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware for more detail.
没有分支分歧的代码应该不会在 W"X" 中产生循环,其中 X 在 1 到 31 之间。有关更多细节,请参见 动态波形形成:在 SIMD 图形硬件上高效的 MIMD 控制流。
DRAM Statistics DRAM 统计数据
By default, GPGPU-Sim reports the following statistics for each DRAM channel:
默认情况下,GPGPU-Sim 为每个 DRAM 通道报告以下统计信息:
Statistic 统计数据 | Description 描述 |
---|---|
n_cmd | Total number of command cycles the memory controller in a DRAM channel has elapsed. The controller can issue a single command per command cycle.
在 DRAM 通道中,内存控制器经过的命令周期总数。控制器可以在每个命令周期发出一个命令。 |
n_nop | Total number of NOP commands issued by the memory controller.
内存控制器发出的 NOP 命令总数。 |
n_act | Total number of Row Activation commands issued by the memory controller.
内存控制器发出的行激活命令总数。 |
n_pre | Total number of Precharge commands issued by the memory controller.
内存控制器发出的预充电命令总数。 |
n_req | Total number of memory requests processed by the DRAM channel.
DRAM 通道处理的内存请求总数。 |
n_rd | Total number of read commands issued by the memory controller.
内存控制器发出的读取命令总数。 |
n_write | Total number of write commands issued by the memory controller.
内存控制器发出的写命令总数。 |
bw_util | DRAM bandwidth utilization = 2 * (n_rd + n_write) / n_cmd
DRAM 带宽利用率 = 2 * (n_rd + n_write) / n_cmd |
n_activity | Total number of active cycles, or command cycles when the memory controller has a pending request at its queue.
活动周期的总数,或在内存控制器的队列中有待处理请求时的命令周期。 |
dram_eff | DRAM efficiency = 2 * (n_rd + n_write) / n_activity (i.e. DRAM bandwidth utilization when there is a pending request waiting to be processed)
DRAM 效率 = 2 * (n_rd + n_write) / n_activity(即在有待处理请求时的 DRAM 带宽利用率) |
mrqq: max | Maximum memory request queue occupancy. (i.e. Maximum number of pending entries in the queue)
最大内存请求队列占用率。(即队列中待处理条目的最大数量) |
mrqq: avg | Average memory request queue occupancy. (i.e. Average number of pending entries in the queue)
平均内存请求队列占用率。(即队列中待处理条目的平均数量) |
Cache Statistics 缓存统计信息
For each cache (normal data cache, constant cache, texture cache alike), GPGPU-Sim reports the following statistics:
对于每个缓存(普通数据缓存、常量缓存、纹理缓存等),GPGPU-Sim 报告以下统计信息:
- Access = Total number of access to the cache
访问 = 对缓存的总访问次数 - Miss = Total number of misses to the cache. The number in parenthesis is the cache miss rate.
未命中 = 缓存未命中的总数。括号中的数字是缓存未命中率。 - PendingHit = Total number of pending hits in the cache. An pending hit access has hit a cache line in RESERVED state, which means there is already an inflight memory request sent by a previous cache miss on the same line. This access can be merged that previous memory access so that it does not produce memory traffic. The number in parenthesis is the ratio of accesses that exhibit pending hits.
PendingHit = 缓存中待处理命中的总数。待处理命中访问已命中处于保留状态的缓存行,这意味着已经有一个由先前缓存未命中在同一行上发送的正在进行的内存请求。此访问可以与先前的内存访问合并,以便不产生内存流量。括号中的数字是表现出待处理命中的访问比例。
Notice that pending hits are not counted as cache misses. Also, we do not count pending hits for cache that employs allocate-on-fill policy (e.g. read-only caches such as constant cache and texture cache).
请注意,待处理的命中不计入缓存未命中。此外,对于采用填充时分配策略的缓存(例如只读缓存,如常量缓存和纹理缓存),我们也不计算待处理的命中。
GPGPU-Sim also calculates the total miss rate for all instances of the L1 data cache:
GPGPU-Sim 还计算了所有 L1 数据缓存实例的总未命中率:
total_dl1_misses
total_dl1_accesses
total_dl1_miss_rate
Notice that data for L1 Total Miss Rate should be ignored when the l1 cache is turned off: -gpgpu_cache:dl1 none
注意,当 L1 缓存关闭时,应忽略 L1 总未命中率的数据: -gpgpu_cache:dl1 none
Interconnect Statistics 互连统计
In GPGPU-Sim, the user can configure whether to run all traffic on a single interconnection network, or on two separate physical networks (one relaying data from the shader cores to the DRAM channels and the other relaying the data back). (The motivation for using two separate networks, besides increasing bandwidth, is often to avoid "protocol deadlock" which otherwise requires additional dedicated virtual channels.) GPGPU-Sim reports the following statistics for each individual interconnection network:
在 GPGPU-Sim 中,用户可以配置是否在单个互连网络上运行所有流量,或在两个独立的物理网络上运行(一个将数据从着色器核心传输到 DRAM 通道,另一个将数据传回)。 (使用两个独立网络的动机,除了增加带宽,通常是为了避免“协议死锁”,否则需要额外的专用虚拟通道。)GPGPU-Sim 为每个单独的互连网络报告以下统计信息:
Statistic 统计数据 | Description 描述 |
---|---|
average latency 平均延迟 | Average latency for a single flit to traverse from a source node to a destination node.
从源节点到目标节点单个 flit 的平均延迟。 |
average accepted rate 平均接受率 | Measured average throughput of the network relative to its total input channel throughput. Notice that when using two separate networks for traffics in different directions, some nodes will never inject data into the network (i.e. the output only nodes such as DRAM channels on the cores-to-dram network). To get the real ratio, total input channel throughput from these nodes should be ignored. That means one should multiply this rate with the ratio (total # nodes / # input nodes in this network) to get the real average accepted rate. Note that by default we use two separate networks which is set by network_count option in interconnection network config file. The two networks serve to break circular dependancies that might cause deadlocks.
测量网络的平均吞吐量相对于其总输入通道吞吐量。请注意,当使用两个独立的网络处理不同方向的流量时,一些节点将永远不会向网络注入数据(即仅输出节点,如核心到 DRAM 网络上的 DRAM 通道)。要获得真实的比率,应忽略这些节点的总输入通道吞吐量。这意味着应该将此速率乘以比率(总节点数 / 此网络中的输入节点数)以获得真实的平均接受速率。请注意,默认情况下,我们使用两个独立的网络,这由互连网络配置文件中的 network_count 选项设置。这两个网络用于打破可能导致死锁的循环依赖。 |
min accepted rate 最低接受利率 | Always 0, as there are nodes that do not inject flits into the network due to the fact that we simulate two separate networks for traffic in different directions.
始终为 0,因为有些节点由于我们为不同方向的流量模拟两个独立的网络而不向网络注入 flit。 |
latency_stat_0_freq | A histogram showing the distribution of latency of flits traversed in the network.
一个直方图,显示了在网络中传输的 flit 延迟的分布。 |
Note: Accepted traffic or throughput of a network is the amount of traffic delivered to the destination terminals of the network. If the network is below saturation all the offered traffic is accepted by the network and offered traffic would be equal to throughput of the network. The interconnect simulator calculates the accepted rate of each node by dividing the total number of packets received at a node by the total network cycles.
注意:网络的接受流量或吞吐量是指传递到网络目的终端的流量。如果网络未达到饱和状态,所有提供的流量都被网络接受,提供的流量将等于网络的吞吐量。互连模拟器通过将节点接收到的总数据包数除以总网络周期来计算每个节点的接受速率。
Visualizing High-Level GPGPU-Sim Microarchitecture Behavior
可视化高层次 GPGPU-Sim 微架构行为
AerialVision is a GPU performance analysis tool for GPGPU-Sim that is includes with the GPGPU-Sim source code introduced in a ISPASS 2010 paper. A detailed manual describing how to use AerialVision can be found here.
AerialVision 是一个用于 GPGPU-Sim 的 GPU 性能分析工具,包含在 ISPASS 2010 论文 中的 GPGPU-Sim 源代码中。有关如何使用 AerialVision 的详细手册可以在 这里 找到。
Two examples of the type of high level analysis possible with AerialVision are illustrated below. Figure 9 illustrates the use of AerialVision to understand microarchitecture behavior versus time. The top row is average memory access latency versus time, the second row plots load per SIMT core versus time (vertical axis is SIMT core, color represents average instructions per cycle), the bottom row shows load per memory controller channel. Figure 10 illustrates the use of AerialVision to understand microarchiteture behavior at the source code level. This helps identify lines of code associated with uncoalesced or branch divergence.
AerialVision 能够进行的高水平分析的两个示例如下所示。图9展示了使用 AerialVision 理解微架构行为与时间的关系。第一行是平均内存访问延迟与时间的关系,第二行绘制了每个 SIMT 核心的负载与时间的关系(纵轴是 SIMT 核心,颜色表示每个周期的平均指令数),底行显示了每个内存控制器通道的负载。图10展示了使用 AerialVision 理解源代码级别的微架构行为。这有助于识别与未合并或分支发散相关的代码行。
To get GPGPU-Sim to generate a visualizer log file for the Time Lapse View, add the following option to gpgpusim.config:
要使 GPGPU-Sim 生成时间延迟视图的可视化日志文件,请将以下选项添加到 gpgpusim.config:
-visualizer_enabled 1
The sampling frequency in this log file can be set by option -gpgpu_runtime_stat. One can also specify the name of the visualizer log file with option -visualizer_outputfile.
此日志文件中的采样频率可以通过选项 -gpgpu_runtime_stat 设置。还可以通过选项 -visualizer_outputfile 指定可视化器日志文件的名称。
To generate the output for the Source Code Viewer, add the following option to gpgpusim.config:
要为源代码查看器生成输出,请将以下选项添加到 gpgpusim.config:
-enable_ptx_file_line_stats 1
One can specify the output file name with option -ptx_line_stats_filename.
可以使用选项 -ptx_line_stats_filename 指定输出文件名。
If you use plots generated by AerialVision in your publications, please cite the above linked ISPASS 2010 paper.
如果您在出版物中使用 AerialVision 生成的图表,请引用上述链接的 ISPASS 2010 论文。
Visualizing Cycle by Cycle Microarchitecture Behavior
逐周期可视化微架构行为
GPGPU-Sim 3.x provides a set of GDB macros that can be used to visualize the detail states of each SIMT core and memory partition. By setting the global variable "g_single_step" in GDB to the shader clock cycle at which you would like to start "single stepping" the shader clock, these macros can be used to observe how the microarchitecture states changes cycle-by-cycle. You can use the macros anytime GDB has stopped the simulator, but the global variable "g_single_step" is used in gpu-sim.cc to trigger a call to a hard coded software breakpoint instruction placed after all shader cores have advanced simulation by one cycle. Stopping simulation here tends to lead to more easy to interpret state.
GPGPU-Sim 3.x 提供了一组 GDB 宏,可以用来可视化每个 SIMT 核心和内存分区的详细状态。通过在 GDB 中将全局变量 "g_single_step" 设置为您希望开始“单步执行”着色器时的着色器时钟周期,可以使用这些宏观察微架构状态如何逐周期变化。您可以在 GDB 停止模拟器时随时使用这些宏,但全局变量 "g_single_step" 在 gpu-sim.cc 中用于触发对放置在所有着色器核心已推进一个周期后的硬编码软件断点指令的调用。在这里停止模拟通常会导致更易于解释的状态。
Visibility at this level is useful for debugging and can help you gain deeper insight into how the simulated GPU micro architecture works.
在这个级别的可见性对于调试非常有用,可以帮助你更深入地了解模拟的 GPU 微架构是如何工作的。
These GDB macros are available in the .gdbinit file that comes with the GPGPU-Sim distribution. To use these macros, either copy the file to your home directory or to the same directory where GDB is launched. GDB will automatically detect the presence of the macro file, load it and display the following message:
这些 GDB 宏可在随 GPGPU-Sim 分发版提供的 .gdbinit 文件中找到。要使用这些宏,可以将文件复制到您的主目录或 GDB 启动的同一目录。GDB 将自动检测宏文件的存在,加载它并显示以下消息:
** loading GPGPU-Sim debugging macros... **
Macro 宏 | Description 描述 |
---|---|
dp <core_id> | Display pipeline state of the SIMT core with ID=<core_id>. See below for an example of the display.
显示 ID=<core_id> 的 SIMT 核心的管道状态。请参见下面的显示示例。 |
dpc <core_id> | Display the pipeline state, then continue to the next breakpoint.
This version is useful if you set "g_single_step" to trigger the hard coded breakpoint where gpu_sim_cycle is incremented in gpgpu-sim::cycle() in src/gpgpu-sim/gpu-sim.cc. Repeatedly hitting enter will advance to show the pipeline contents in successive cycles.
|
dm <partition_id> | Display the internal states of a memory partition with ID=<partition_id>.
显示 ID=<partition_id> 的内存分区的内部状态。 |
ptxdis <start_pc> <end_pc> | Display PTX instructions between <start_pc> and <end_pc>.
在 <start_pc> 和 <end_pc> 之间显示 PTX 指令。 |
ptxdis_func <core_id> <thread_id> | Display all PTX instructions inside the kernel function executed by thread <thread_id> in SIMT core <core_id>.
在 SIMT 核心<core_id> 中,由线程<thread_id> 执行的内核函数内显示所有 PTX 指令。 |
ptx_tids2pcs <thread_ids> <length> <core_id> | Display the current PCs of the threads in SIMT core <core_id> represented by an array <thread_ids> of length=<length>.
显示 SIMT 核心<core_id> 中线程的当前 PC,由长度为<length> 的数组<thread_ids> 表示。 |
Example of the output by dp:
dp 的输出示例:
Debugging Errors in Performance Simulation
性能仿真中的调试错误
This section describes strategies for debugging errors that crash or deadlock GPGPU-Sim while running in performance simulation mode.
本节描述了在性能仿真模式下调试导致 GPGPU-Sim 崩溃或死锁的错误的策略。
If you ran a benchmark we haven't tested and it crashed we encourage you to file a bug.
如果您运行了我们尚未测试的基准测试并且它崩溃了,我们鼓励您报告一个错误。
However, suppose GPGPU-Sim crashes or deadlocks after you have made your own changes? What should you do then? This section describes some generic techniques we use at UBC for debugging errors in the GPGPU-Sim timing model.
然而,假设 GPGPU-Sim 在您进行自己的更改后崩溃或死锁?那您该怎么办?本节描述了我们在 UBC 用于调试 GPGPU-Sim 时间模型错误的一些通用技术。
Segmentation Faults, Aborts and Failed Assertions
分段错误、终止和失败的断言
Deadlocks 死锁
Frequently Asked Questions
常见问题解答
Question:
Which Linux platform is best for GPGPU-Sim?
问题:哪个 Linux 平台最适合 GPGPU-Sim?
Answer:
Currently we use SUSE 11.3 for developing GPGPU-Sim. However, many people ran it successfully on other distributions. In principle, there should be no problems in doing so. We did very minor testing for GPGPU-Sim with Ubuntu 10.04 LTS.
回答: 目前我们使用 SUSE 11.3 开发 GPGPU-Sim。然而,许多人在其他发行版上成功运行了它。原则上,这样做应该没有问题。我们对 GPGPU-Sim 在 Ubuntu 10.04 LTS 上进行了非常少量的测试。
Question:
Can GPGPU-Sim simulate graphics?
问题: GPGPU-Sim 能模拟图形吗?
Answer:
We have some plans to model graphics in the future (no ETA on when
that might be available).
回答:我们有一些计划在未来对图形进行建模(目前还没有可用的时间预估)。
Question:
Is there an option to enable a pure functional simulation without the timing model?
问题:是否有选项可以启用没有时序模型的纯功能仿真?
Answer:
Yes, it is available now. It was removed when we first refactored the GPGPU-Sim 2.x code base into GPGPU-Sim 3.x to simplify the development process, but now it has been reintroduced again.
回答: 是的,现在可以使用。它在我们第一次将 GPGPU-Sim 2.x 代码库重构为 GPGPU-Sim 3.x 以简化开发过程时被移除,但现在又重新引入了。
Question:
How to change configuration files?
问题:如何更改配置文件?
Answer:
GPGPU-Sim searches for a file called gpgpusim.conf in the current directory. If you need to change the configuration file on the fly, you can create a new directory and create a symbolic link to the configuration file in that directory and use it as your working directory when running GPGPU-Sim. Changing the symbolic link to another file will change the file seen by GPGPU-Sim.
答案: GPGPU-Sim 在当前目录中搜索名为 gpgpusim.conf 的文件。如果您需要动态更改配置文件,可以创建一个新目录,并在该目录中创建指向配置文件的符号链接,并在运行 GPGPU-Sim 时将其用作工作目录。将符号链接更改为另一个文件将更改 GPGPU-Sim 看到的文件。
Question:
Can I run OpenCL applications on GPGPU-Sim if I don't have a GPU?
Do I need a GPU to run GPGPU-Sim?
问题:如果我没有 GPU,我可以在 GPGPU-Sim 上运行 OpenCL 应用程序吗?我需要 GPU 来运行 GPGPU-Sim 吗?
Answer:
Building and running GPGPU-Sim does not require the presence of a GPU on your system. However, running OpenCL applications requires the NVIDIA Driver which in turn requires the presence of the graphics card. Despite that, we have included support for executing the compilation of OpenCL applications on a remote machine. This means that you only need access to a remote machine that has a graphics card, but the machine you are actually using for running GPGPU-Sim doesn't.
答案: 构建和运行 GPGPU-Sim 不需要系统中有 GPU。然而,运行 OpenCL 应用程序需要 NVIDIA 驱动程序,而这又需要显卡的存在。尽管如此,我们已经包括了在远程机器上执行 OpenCL 应用程序编译的支持。这意味着您只需要访问一台有显卡的远程机器,但您实际用于运行 GPGPU-Sim 的机器不需要有显卡。
Question:
I got a parsing error from GPGPU-Sim for an OpenCL program that runs fine on real GPU. What is going on?
问题:我在 GPGPU-Sim 中遇到了一个解析错误,而一个在真实 GPU 上运行良好的 OpenCL 程序。发生了什么?
Answer: Unlike most CUDA applications that contains compiled PTX code in the binary, the kernel code in an OpenCL program are compiled by the video driver at runtime. Depending on the version of the video driver, different versions of PTX may be produced from the same OpenCL kernel. GPGPU-Sim 3.x is developed with NVIDIA driver version 256.40 (see README file that comes with GPGPU-Sim). The newer driver that comes with CUDA Toolkit 4.0 or newer has introduced some new keywords in PTX. Some of these keywords are not difficult to support in GPGPU-Sim (such as .ptr), while others are not as simple (such as the use of %envreg, which is setup by the driver). For now, we would recommend downgrading the video driver, or setup a remote machine with the supported driver version for OpenCL compilation.
Question:
Does/Will GPGPU-Sim support CUDA 4.0?
问题: GPGPU-Sim 是否支持 CUDA 4.0?
Answer:
Supporting CUDA 4 is something we are currently working on implementing (as usual, no ETA on when it will be released). Using multiple versions of CUDA is not hard with GPGPU-Sim 3.x: The setup_environment script for GPGPU-Sim 3.x can be modified to point to the 3.1 installation so you do not need to modify your .bashrc
答案: 支持 CUDA 4 是我们目前正在实施的工作(和往常一样,没有发布日期的预计)。使用多个版本的 CUDA 在 GPGPU-Sim 3.x 中并不困难:可以修改 GPGPU-Sim 3.x 的 setup_environment 脚本以指向 3.1 的安装,因此您不需要修改您的.bashrc。
Question:
Why are there two networks (reply and request)?
问题: 为什么会有两个网络(回复和请求)?
Answer:
Those too networks do not necessarily need to be two different physical networks, they can be two different logical networks e.g. each one can use a dedicated set of virtual channels on a single physical network. If the request and reply networks share the same network then a "Protocol Deadlock" may happen. To understand it better read section 14.1.5 of Dally's Interconnection Network book.
答案:这两个网络不一定需要是两个不同的物理网络,它们可以是两个不同的逻辑网络,例如,每个网络可以在单一物理网络上使用一组专用的虚拟通道。如果请求和回复网络共享同一网络,则可能会发生“协议死锁”。要更好地理解这一点,请阅读 Dally 的《互连网络》一书的第 14.1.5 节。
Question:
Is it normal to get 'NaN' in the simulator output?
问题:在模拟器输出中出现'NaN'是正常的吗?
Answer:
You may get it with the cache miss rates when the cache module has never been accessed.
答案: 当缓存模块从未被访问时,您可能会获得缓存未命中率。
Question:
Why do all CTAs finishes at cycle X, while gpu_sim_cycle says (X + Y)? (i.e. Why is GPGPU-Sim still simulating after all the CTAs/shader cores are done?)
问题:为什么所有的 CTA 都在周期 X 结束,而 gpu_sim_cycle 却显示 (X + Y)?(即,为什么 GPGPU-Sim 在所有 CTA/着色器核心完成后仍然在模拟?)
Answer:
The difference from when a CTA is considered finished by GPGPU-Sim to when GPGPU-Sim thinks the simulation is done can be due to global
memory write traffic. Basically, it takes some time from issuing a write command until that command is processed by the memory system.
回答:CTA 被 GPGPU-Sim 视为完成的时间与 GPGPU-Sim 认为仿真完成的时间之间的差异可能是由于全局内存写入流量造成的。基本上,从发出写入命令到该命令被内存系统处理需要一些时间。
Question:
How to calculate the Peak off-chip DRAM bandwidth given a GPGPU-Sim configuration?
问题:如何根据 GPGPU-Sim 配置计算峰值离芯片 DRAM 带宽?
Answer:
Peak off-chip DRAM bandwidth = gpgpu_n_mem * gpgpu_n_mem_per_ctrlr * gpgpu_dram_buswidth * DRAM Clock * 2 (for DDR)
答案:峰值离芯片 DRAM 带宽 = gpgpu_n_mem * gpgpu_n_mem_per_ctrlr * gpgpu_dram_buswidth * DRAM 时钟 * 2(对于 DDR)
- gpgpu_n_mem = Number of memory channels in the GPU (each memory channel has an independent controller for DRAM command scheduling)
gpgpu_n_mem = GPU 中的内存通道数量(每个内存通道都有一个独立的控制器用于 DRAM 命令调度) - gpgpu_n_mem_per_ctrlr = Number of DRAM chips attached to a memory channel (default = 2, for 64-bit memory channel)
gpgpu_n_mem_per_ctrlr = 连接到内存通道的 DRAM 芯片数量(默认 = 2,适用于 64 位内存通道) - gpgpu_dram_buswidth = Bus width of each DRAM chip (default = 32-bit = 4 bytes)
gpgpu_dram_buswidth = 每个 DRAM 芯片的总线宽度(默认 = 32 位 = 4 字节) - DRAM Clock = the real clock of the DRAM chip (as opposed to the effective clock used in marketing - See #Clock Domain Configuration)
DRAM 时钟 = DRAM 芯片的真实时钟(与营销中使用的有效时钟相对 - 见#时钟域配置)
Question:
How to get the DRAM utilization?
问题:如何获取 DRAM 利用率?
Answer:
Each memory controller prints out some statistics at the end of the simulation using "dram_print()". DRAM utilization is "bw_util". Take the average of this number across all the memory controllers (the number for each controller can differ if each DRAM channel gets a different amount of memory traffic).
答案: 每个内存控制器在模拟结束时使用 "dram_print()" 打印出一些统计信息。DRAM 利用率是 "bw_util"。对所有内存控制器的这个数字取平均值(如果每个 DRAM 通道的内存流量不同,则每个控制器的数字可能会有所不同)。
Inside the simulator's code, 'bwutil' is incremented by 2 for every read or write operation because it takes two DRAM command cycles to service a single read or write operation (given burst length = 4).
在模拟器的代码中,'bwutil' 每进行一次读或写操作就增加 2,因为服务单个读或写操作需要两个 DRAM 命令周期(假设突发长度 = 4)。
Question:
Why isn't DRAM utilization improving with more shader cores (with the same number of DRAM channels) for a memory-limited application?
问题:为什么在内存受限的应用中,增加着色器核心(在相同数量的 DRAM 通道下)并没有改善 DRAM 利用率?
Answer:
DRAM utilization may not improve with having more inflight threads for many reasons. One reason could the DRAM precharge/activate overheads.
(See e.g., Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures)
答案: DRAM 利用率可能不会因为有更多的在飞线程而提高,原因有很多。其中一个原因可能是 DRAM 预充电/激活的开销。(参见,例如,多核加速器架构的复杂有效内存访问调度)
Question:
How to get the interconnect utilization?
问题:如何获取互连利用率?
Answer:
The definition of the interconnect's untilization highly depends on the topology of the interconnection network itself, so it is quite difficult to give a single "utilization" metric that is consistent across all types of topology. If you are looking into whether the interconnection is the bottleneck of an application, you may want to look at gpu_stall_icnt2sh and gpu_stall_sh2icnt instead.
答案: 互连的利用率的定义高度依赖于互连网络本身的拓扑结构,因此很难给出一个在所有类型的拓扑中都一致的“利用率”指标。如果您想了解互连是否是应用程序的瓶颈,您可能想要查看 gpu_stall_icnt2sh 和 gpu_stall_sh2icnt 。
The throughput (accepted rate) is also a good indicator for the utilization of each network. Note that by default we use two separate networks for traffics from SIMT core clusters to memory partitions and the traffics heading back; therefore you will see two accepted rate numbers reported at the end of simulation (one for each network). See #Interconnect Statistics for more detail.
吞吐量(接受率)也是每个网络利用率的一个良好指标。请注意,默认情况下,我们使用两个独立的网络来处理从 SIMT 核心集群到内存分区的流量以及返回的流量;因此,您将在模拟结束时看到报告的两个接受率数字(每个网络一个)。有关更多详细信息,请参见 #Interconnect Statistics。
Question:
Does/Will GPGPU-Sim 3.x support DWF (dynamic warp formation) or TBC (thread block compaction)?
问题:GPGPU-Sim 3.x 是否支持 DWF(动态波形形成)或 TBC(线程块压缩)?
Answer:
The current release of GPGPU-Sim 3.x does not support DWF nor TBC. For now, only PDOM is supported. We are currently working on implementing TBC on GPGPU-Sim 3.x (evaluations of TBC in the HPCA 2011 paper was done on GPGPU-Sim 2.x). While we do not have any plan to release the implementation yet, it may be released under a separate branch in the future (this depends on how modular the final implementation is).
回答: 当前版本的 GPGPU-Sim 3.x 不支持 DWF 或 TBC。目前仅支持 PDOM。我们正在努力在 GPGPU-Sim 3.x 上实现 TBC(HPCA 2011 论文中对 TBC 的评估是在 GPGPU-Sim 2.x 上进行的)。虽然我们目前没有计划发布该实现,但未来可能会在一个单独的分支下发布(这取决于最终实现的模块化程度)。
Question:
Why this simulator is claimed to be timing accurate/cycle accurate? How can I verify this fact?
问题: 为什么这个模拟器被称为时间准确/周期准确?我该如何验证这一事实?
Answer:
A cycle-accurate simulator reports the timing behavior of the simulated architecture - it is possible for the user to stop the simulator at cycle boundaries and observe the states. All the hardware behavior within a cycle is approximated with C/C++ (as opposed to implementing them in HDLs) to speed up the simulation time. It is also common for architectural simulator to simplify some detailed implementations covering corner cases of a hardware design to emphasize what dictates the overall performance of a system - this is what we try to achieve with GPGPU-Sim.
答案: 一个周期准确的模拟器报告模拟架构的时序行为 - 用户可以在周期边界停止模拟器并观察状态。周期内的所有硬件行为都用 C/C++ 进行近似(而不是用 HDL 实现),以加快模拟时间。架构模拟器通常会简化一些详细实现,涵盖硬件设计的边缘情况,以强调决定系统整体性能的因素 - 这就是我们试图通过 GPGPU-Sim 实现的目标。
So, like all other cycle-accurate simulators used for architectural research/development, we do not guarantee 100% matching with real GPUs.
The normal way to verify a simulator would involve comparing reported timing result of an application running on the simulator against measured runtime of the same application running on the actual hardware simulation target. With PTX-ISA, this is a little tricky, because PTX-ISA is recompiled by the GPU driver into native GPU ISA for execution on the actual GPU, whereas GPGPU-Sim execute PTX-ISA directly. Also, the limited amount of publicly available information on the actual NVIDIA GPU microarchitecture has posed a big challenge on implementing the exact matching behavior in the simulator.
(i.e. We do not know what is actually implemented inside a GPU. We just implement our best guess in the simulator!)
因此,与所有其他用于架构研究/开发的周期准确模拟器一样,我们不保证与真实 GPU 的 100%匹配。验证模拟器的正常方法是将在模拟器上运行的应用程序的报告时间结果与在实际硬件模拟目标上运行的相同应用程序的测量运行时间进行比较。对于 PTX-ISA,这有点棘手,因为 PTX-ISA 被 GPU 驱动程序重新编译为本机 GPU ISA 以在实际 GPU 上执行,而 GPGPU-Sim 直接执行 PTX-ISA。此外,关于实际 NVIDIA GPU 微架构的公开信息有限,这给在模拟器中实现精确匹配行为带来了很大挑战。(即,我们不知道 GPU 内部实际实现了什么。我们只是根据最佳猜测在模拟器中实现!)
Nevertheless, we have been continually trying to improve the accuracy of our architecture model. In our ISPASS paper in 2009, we have compared simulated timing performance of various benchmarks against their hardware runtime on a GeForce 8600GT. The correlation coefficient was calculated to be 0.899. GPGPU-Sim 3.0 has been calibrated against an NVIDIA GT 200 GPU and shows IPC correlation of 0.976 on a subset of applications from the NVIDIA CUDA SDK. We welcome feedback from users regarding the accuracy of GPGPU-Sim.
尽管如此,我们一直在不断努力提高我们的架构模型的准确性。在我们 2009 年的 ISPASS 论文中,我们比较了各种基准的模拟时序性能与它们在 GeForce 8600GT 上的硬件运行时间。相关系数被计算为 0.899。GPGPU-Sim 3.0 已针对 NVIDIA GT 200 GPU 进行了校准,并在 NVIDIA CUDA SDK 的一部分应用程序上显示出 0.976 的 IPC 相关性。我们欢迎用户对 GPGPU-Sim 的准确性提供反馈。
Software Design of GPGPU-Sim
GPGPU-Sim 的软件设计
To perform substantial architecture research with GPGPU-Sim 3.x you will need to modify the source code. This section documents the high level software design of GPGPU-Sim 3.x, which differs from version 2.x. In addition to the software descriptions found here you may find it helpful to consult the doxygen generated documentation for GPGPU-Sim 3.x. Please see the README file for instructions for building the doxygen documentation from the GPGPU-Sim 3.x source.
要使用 GPGPU-Sim 3.x 进行实质性的架构研究,您需要修改源代码。本节记录了 GPGPU-Sim 3.x 的高层软件设计,该设计与 2.x 版本不同。除了这里找到的软件描述,您可能还会发现查阅 GPGPU-Sim 3.x 生成的 doxygen 文档很有帮助。有关从 GPGPU-Sim 3.x 源代码构建 doxygen 文档的说明,请参见 README 文件。
Below we summarize the source organization, command line option parser, object oriented abstract hardware model that provides an interface between the software organization of the performance simulation engine and the software organization of the functional simulation engine. Finally, we describe the software design of the interface with CUDA/OpenCL applications.
下面我们总结了源组织、命令行选项解析器、面向对象的抽象硬件模型,它提供了性能仿真引擎的软件组织与功能仿真引擎的软件组织之间的接口。最后,我们描述了与 CUDA/OpenCL 应用程序的接口的软件设计。
File list and brief description
文件列表和简要描述
GPGPU-Sim 3.x consists of three major modules (each located in its own directory):
GPGPU-Sim 3.x 由三个主要模块组成(每个模块位于其自己的目录中):
- cuda-sim - The functional simulator that executes PTX kernels generated by NVCC or OpenCL compiler
cuda-sim - 执行由 NVCC 或 OpenCL 编译器生成的 PTX 内核的功能模拟器 - gpgpu-sim - The performance simulator that simulates the timing behavior of a GPU (or other many core accelerator architectures)
gpgpu-sim - 模拟 GPU(或其他多核加速器架构)时序行为的性能模拟器 - intersim - The interconnection network simulator adopted from Bill Dally's BookSim
intersim - 从比尔·达利的 BookSim 采用的互连网络模拟器
Here are the files in each module:
这里是每个模块中的文件:
Overall/Utilities 整体/公用事业
File name 文件名 | Description 描述 |
---|---|
Makefile | Makefile that builds gpgpu-sim and calls other the Makefile in cuda-sim and intersim.
构建 gpgpu-sim 的 Makefile,并调用 cuda-sim 和 intersim 中的其他 Makefile。 |
abstract_hardware_model.h abstract_hardware_model.cc |
Provide a set of classes that interface between functional and timing simulator.
提供一组类,用于在功能模拟器和时序模拟器之间进行接口。 |
debug.h debug.cc |
Implements the Interactive Debugger Mode.
实现了交互式调试器模式。 |
gpgpusim_entrypoint.c | Contains functions that interface with the CUDA/OpenCL API stub libraries.
包含与 CUDA/OpenCL API 存根库接口的函数。 |
option_parser.h option_parser.cc |
Implements the command-line option parser.
实现命令行选项解析器。 |
stream_manager.h stream_manager.cc |
Implements the stream manager to support CUDA streams.
实现流管理器以支持 CUDA 流。 |
tr1_hash_map.h | Wrapper code for std::unordered_map in C++11. Falls back to std::map or GNU hash_map if the compile does not support C++11.
C++11 中 std::unordered_map 的包装代码。如果编译不支持 C++11,则回退到 std::map 或 GNU hash_map。 |
.gdb_init | Contains macros that simplify low-level visualization of simulation states with GDB
包含宏,简化使用 GDB 进行仿真状态的低级可视化 |
cuda-sim
File name 文件名 | Description 描述 |
---|---|
Makefile | Makefile for cuda-sim. Called by Makefile one level up.
cuda-sim 的 Makefile。由上一级的 Makefile 调用。 |
cuda_device_print.h cuda_device_print.cc |
Implementation to support printf() within CUDA device functions (i.e. calling printf() within GPU kernel functions). Please note that the device printf works only with CUDA 3.1
在 CUDA 设备函数中支持 printf()的实现(即在 GPU 内核函数中调用 printf())。请注意,设备 printf 仅在 CUDA 3.1 中有效。 |
cuda-math.h | Contains interfaces to CUDA Math header files.
包含对 CUDA 数学头文件的接口。 |
cuda-sim.h cuda-sim.cc |
Implements the interface between gpgpu-sim and cuda-sim. It also contains a standalone simulator for functional simulation.
实现了 gpgpu-sim 和 cuda-sim 之间的接口。它还包含一个用于功能模拟的独立模拟器。 |
instructions.h instructions.cc |
This is where the emulation code of all PTX instructions is implemented.
这是所有 PTX 指令的仿真代码实现的地方。 |
memory.h memory.cc |
Functional memory space emulation.
功能内存空间仿真。 |
opcodes.def | DEF file that links between various information of each instruction (eg. string name, implementation, internal opcode...)
DEF 文件,用于链接每个指令的各种信息(例如,字符串名称、实现、内部操作码……) |
opcodes.h | Defines enum for each PTX instruction.
为每个 PTX 指令定义枚举。 |
ptxinfo.l ptxinfo.y |
Lex and Yacc files for parsing ptxinfo file. (To obtain kernel resource requirement)
用于解析 ptxinfo 文件的 Lex 和 Yacc 文件。(以获取内核资源需求) |
ptx_ir.h ptx_ir.cc |
Static structures in CUDA - kernels, functions, symbols... etc. Also contain code to perform static analysis for extracting immediate-post-dominators from kernels at load time.
CUDA 中的静态结构 - 内核、函数、符号等。还包含用于在加载时从内核提取即时后支配者的静态分析代码。 |
ptx.l ptx.y |
Lex and Yacc files for parsing .ptx files and embedded cubin structure to obtain PTX code of the CUDA kernels
Lex 和 Yacc 文件用于解析 .ptx 文件和嵌入的 cubin 结构,以获取 CUDA 内核的 PTX 代码 |
ptx_loader.h ptx_loader.cc |
Contains functions for loading and parsing and printing PTX and PTX info file
包含用于加载、解析和打印 PTX 及 PTX 信息文件的功能 |
ptx_parser.h ptx_parser.cc |
Contains functions called by Yacc during parsing which creating infra structures needed for functional and performance simulation
包含在解析过程中由 Yacc 调用的函数,这些函数创建功能和性能仿真所需的基础结构 |
ptx_parser_decode.def | Contains token definition of parser used in PTX extraction.
包含用于 PTX 提取的解析器的令牌定义。 |
ptx_sim.h ptx_sim.cc |
Dynamic structures in CUDA - Grids, CTA, threads
CUDA 中的动态结构 - 网格、CTA、线程 |
ptx-stats.h ptx-stats.cc |
PTX source line profiler PTX 源代码行分析器 |
gpgpu-sim
File name 文件名 | Description 描述 |
---|---|
Makefile | Makefile for gpgpu-sim. Called by Makefile one level up.
gpgpu-sim 的 Makefile。由上一级的 Makefile 调用。 |
addrdec.h addrdec.cc |
Address decoder - Maps a given address to a specific row, bank, column, in a DRAM channel.
地址解码器 - 将给定地址映射到 DRAM 通道中的特定行、银行、列。 |
delayqueue.h | An implementation of a flexible pipelined queue.
一个灵活的流水线队列的实现。 |
dram_sched.h dram_sched.cc |
FR-FCFS DRAM request scheduler.
FR-FCFS DRAM 请求调度器。 |
dram.h dram.cc |
DRAM timing model + interface to other parts of gpgpu-sim.
DRAM 时序模型 + 与 gpgpu-sim 其他部分的接口。 |
gpu-cache.h gpu-cache.cc |
Cache model for GPGPU-Sim
GPGPU-Sim 的缓存模型 |
gpu-misc.h gpu-misc.cc |
Contains misc. functionality that is needed by parts of gpgpu-sim
包含 gpgpu-sim 某些部分所需的杂项功能 |
gpu-sim.h gpu-sim.cc |
Gluing different timing models in GPGPU-Sim into one. It contains implementations to support multiple clock domains and implements the thread block dispatcher.
将不同的时序模型粘合在 GPGPU-Sim 中。它包含支持多个时钟域的实现,并实现了线程块调度器。 |
histogram.h histogram.cc |
Defines several classes that implement different kinds of histograms.
定义了几个实现不同类型直方图的类。 |
icnt_wrapper.h icnt_wrapper.c |
Interconnection network interface for gpgpu-sim. It provides a completely decoupled interface allows intersim to work as a interconnection network timing simulator for gpgpu-sim.
gpgpu-sim 的互连网络接口。它提供了一个完全解耦的接口,使得 intersim 可以作为 gpgpu-sim 的互连网络时序模拟器工作。 |
l2cache.h l2cache.cc |
Implements the timing model of a memory partition. It also implements the L2 cache and interfaces it with the rest of the memory partition (e.g. DRAM timing model).
实现内存分区的时序模型。它还实现了 L2 缓存,并将其与内存分区的其余部分(例如 DRAM 时序模型)接口。 |
mem_fetch.h mem_fetch.cc |
Defines mem_fetch, a communication structure that models a memory request.
定义了 mem_fetch ,一种模拟内存请求的通信结构。 |
mem_fetch_status.tup | Defines the status of a memory request.
定义内存请求的状态。 |
mem_latency_stat.h mem_latency_stat.cc |
Contains various code for memory system statistic collection.
包含用于内存系统统计收集的各种代码。 |
scoreboard.h scoreboard.cc |
Implements the scoreboard used in SIMT core.
实现 SIMT 核心中使用的记分板。 |
shader.h shader.cc |
SIMT core timing model. It calls cudu-sim for functional simulation of a particular thread and cuda-sim would return with performance-sensitive information for the thread.
SIMT 核心时序模型。它调用 cudu-sim 对特定线程进行功能模拟,cuda-sim 将返回该线程的性能敏感信息。 |
stack.h stack.cc |
Simple stack used by immediate post-dominator thread scheduler. (deprecated)
立即后支配线程调度器使用的简单堆栈。(已弃用) |
stats.h | Defines the enums that categorize memory accesses and various stall conditions at the memory pipeline.
定义了枚举,用于对内存访问和内存管道中的各种停顿条件进行分类。 |
stat-tool.h stat-tool.cc |
Implements various tools for performance measurements.
实施各种性能测量工具。 |
visualizer.h visualizer.cc |
Output dynamic statistics for the visualizer
输出可视化工具的动态统计信息 |
intersim
Only files modified from original booksim are listed.
仅列出从原始 booksim 修改的文件。
File name 文件名 | Description 描述 |
---|---|
booksim_config.cpp | intersim's configuration options are defined here and given a default value.
intersim 的配置选项在这里定义并给出默认值。 |
flit.hpp | Modified to add capability of carrying data to the flits. Flits also know which network they belong to.
修改以增加携带数据到飞行器的能力。飞行器也知道它们属于哪个网络。 |
interconnect_interface.cpp interconnect_interface.h |
The interface between GPGPU-Sim and intersim is implemented here.
GPGPU-Sim 与 intersim 之间的接口在这里实现。 |
iq_router.cpp iq_router.hpp |
Modified to add support for output_extra_latency (Used to create Figure 10 of ISPASS 2009 paper).
修改以添加对 output_extra_latency 的支持(用于创建ISPASS 2009 论文的图 10)。 |
islip.cpp | Some minor edits to fix an out of array bound error.
一些小修改以修复数组越界错误。 |
Makefile | Modified to create a library instead of the standalone network simulator.
修改为创建一个库,而不是独立的网络模拟器。 |
stats.cpp stats.hpp |
Stat collection functions are in this file. We have made some minor tweaks. E.g. a new function called NeverUsed is added that tell if that particular stat is ever updated or not.
统计收集功能在这个文件中。我们做了一些小的调整。例如,添加了一个名为 NeverUsed 的新函数,用于判断该特定统计数据是否被更新过。 |
statwraper.cpp statwraper.h |
A wrapper that enables using the stat collection capabilities implemented in Stat class in stats.cpp in C files.
一个包装器,使得在 C 文件中使用 stats.cpp 中 Stat 类实现的统计收集功能成为可能。 |
trafficmanager.cpp trafficmanager.hpp |
Heavily modified from original booksim. Many high level operations are done here.
从原始的 booksim 大幅修改。许多高级操作在这里完成。 |
Option Parser 选项解析器
As you modify GPGPU-Sim for your research, you will likely add features that you want to configure differently in different simulations.
GPGPU-Sim 3.x provides a generic command-line option parser that allows different software modules to register their options through a simple interface. The option parser is instantiated in gpgpu_ptx_sim_init_perf() in gpgpusim_entrypoint.cc. Options are added in gpgpu_sim_config::reg_options() using the function:
在您为研究修改 GPGPU-Sim 时,您可能会添加希望在不同模拟中以不同方式配置的功能。GPGPU-Sim 3.x 提供了一个通用的命令行选项解析器,允许不同的软件模块通过一个简单的接口注册它们的选项。选项解析器在 gpgpusim_entrypoint.cc 中的 gpgpu_ptx_sim_init_perf()中实例化。选项通过以下函数在 gpgpu_sim_config::reg_options()中添加:
void option_parser_register(option_parser_t opp, const char *name, enum option_dtype type, void *variable, const char *desc, const char *defaultvalue);
Here is the description for each parameter:
以下是每个参数的描述:
- option_parser_t opp - The option parser identifier.
option_parser_t opp - 选项解析器标识符。 - const char *name - The string the identify the command-line option.
const char *name - 用于标识命令行选项的字符串。 - enum option_dtype type - Data type of the option. It can be one of the following:
选项数据类型 - 选项的数据类型。它可以是以下之一:- int
- unsigned int 无符号整数
- long long 长长
- unsigned long long 无符号长整型
- bool (as int in C)
- float 浮点数
- double 双精度
- c-string (a.k.a. char*)
- void *variable - Pointer to the variable.
void *variable - 指向变量的指针。 - const char *desc - Description of the option as displayed
const char *desc - 选项的描述,如所示 - const char *defaultvalue - Default value of the option (the string value will be automatically parser). You can set this to NULL for this c-string variables.
const char *defaultvalue - 选项的默认值(字符串值将自动解析)。您可以将其设置为 NULL 以用于此 c-string 变量。
Look inside gpgpu-sim/gpu-sim.cc for examples.
查看 gpgpu-sim/gpu-sim.cc 以获取示例。
The option parser is implemented with the OptionParser class in option_parser.cc (exposed in the C interface as option_parser_t in option_parser.h).
Here is the full C interface that is used by the rest of GPGPU-Sim:
选项解析器在 option_parser.cc 中使用 OptionParser 类实现(在 option_parser.h 中以 option_parser_t 的形式暴露于 C 接口)。以下是 GPGPU-Sim 其余部分使用的完整 C 接口:
- option_parser_register() - Create a binding between an option name (a string) and a variable in the simulator. The variable is passed by reference (pointer), and it will be modified when option_parser_cmdline() or option_parser_cfgfile() is called. Notice that each option can only be bind to a single variable.
option_parser_register() - 在选项名称(字符串)和模拟器中的变量之间创建绑定。变量是通过引用(指针)传递的,当调用 option_parser_cmdline() 或 option_parser_cfgfile() 时,它将被修改。请注意,每个选项只能绑定到一个变量。 - option_parser_cmdline() - Parse the given command line options. Calls option_parser_cmdline() for option -config <config filename>.
option_parser_cmdline() - 解析给定的命令行选项。调用 option_parser_cmdline() 以获取选项 -config <config filename>。 - option_parser_cfgfile() - Parse a given file containing the configuration options.
option_parser_cfgfile() - 解析包含配置选项的给定文件。 - option_parser_print() - Dump all the registered options and their parsed value.
option_parser_print() - 转储所有注册的选项及其解析值。
Only one OptionParser object is instantiated in GPGPU-Sim, inside gpgpu_ptx_sim_init_perf() in gpgpusim_entrypoint.cc. This OptionParser object converts the simulation options in gpgpusim.config into variables values that can be accessed within the simulator. Different modules in GPGPU-Sim registers their options into the OptionParser object (i.e. specifying which option corresponds to which variable in the simulator). After that, the simulator calls option_parser_cmdline() to parse the simulation options contained in gpgpusim.config.
在 GPGPU-Sim 中只实例化了一个 OptionParser 对象,位于 gpgpusim_entrypoint.cc 中的 gpgpu_ptx_sim_init_perf() 。这个 OptionParser 对象将 gpgpusim.config 中的仿真选项转换为可以在模拟器中访问的变量值。GPGPU-Sim 中的不同模块将它们的选项注册到 OptionParser 对象中(即指定哪个选项对应于模拟器中的哪个变量)。之后,模拟器调用 option_parser_cmdline()来解析 gpgpusim.config 中包含的仿真选项。
Internally, the OptionParser class contains as set of OptionRegistry. OptionRegistry is a template class that uses the >> operator for parsing. Each instance of OptionRegistry is responsible for parsing one option to a particular type of variable. Currently the parser only supports the following data types, but it is possible to extend supports to more complex data types by overloading the >> operator:
在内部, OptionParser 类包含一组 OptionRegistry 。 OptionRegistry 是一个模板类,使用 >> 运算符进行解析。 OptionRegistry 的每个实例负责将一个选项解析为特定类型的变量。目前,解析器仅支持以下数据类型,但可以通过重载 >> 运算符扩展对更复杂数据类型的支持:
- 32-bit/64-bit integers 32 位/64 位整数
- floating points (float and doubles)
浮点数(浮点和双精度) - booleans 布尔值
- c-style strings (char*) c-style 字符串 (char*)
Abstract Hardware Model 抽象硬件模型
The files abstract_hardware_model{.h,.cc} provide a set of classes that interface between functional and timing simulator.
文件 abstract_hardware_model{.h,.cc} 提供了一组在功能和时序模拟器之间进行接口的类。
Hardware Abstraction Model Objects
硬件抽象模型对象
Enum Name 枚举名称 | Description 描述 |
---|---|
_memory_space_t | Memory space type (register, local, shared, global, ...)
内存空间类型(寄存器、局部、共享、全局等) |
uarch_op_t | Type of operation (ALU_OP, SFU_OP, LOAD_OP, STORE_OP, ...)
操作类型(ALU_OP,SFU_OP,LOAD_OP,STORE_OP,...) |
_memory_op_t | Defines whether instruction accesses (load or store) memory.
定义指令是否访问(加载或存储)内存。 |
cudaTextureAddressMode | CUDA texture address modes. It can specify wrapping address mode, clamp to edge address mode, mirror address mode or border address mode.
CUDA 纹理地址模式。它可以指定包裹地址模式、边缘夹紧地址模式、镜像地址模式或边界地址模式。 |
cudaTextureFilterMode | CUDA texture filter modes (point or linear filter mode).
CUDA 纹理过滤模式(点过滤或线性过滤模式)。 |
cudaTextureReadMode | CUDA texture read mode. Specifies read texture mode as element type or normalized float
CUDA 纹理读取模式。指定读取纹理模式为元素类型或归一化浮点数 |
cudaChannelFormatKind | Data type used by CUDA runtime which specifies channel format. This is an argument of cudaCreateChannelDesc(...).
CUDA 运行时使用的数据类型,用于指定通道格式。这是 cudaCreateChannelDesc(...)的一个参数。 |
mem_access_type | Different types of accesses in the timing simulator to different types of memories.
在时序模拟器中对不同类型的内存进行不同类型的访问。 |
cache_operator_type | Different types of L1 data cache access behavior provided by PTX
PTX 提供的不同类型的 L1 数据缓存访问行为 |
divergence_support_t | Different control flow divergence handling model. (post dominator is supported)
不同的控制流分歧处理模型。(支持后支配) |
Class Name 班级名称 | Description 描述 |
class kernel_info_t | Holds information of a kernel. It contains information like kernel function(function_info), grid and block size and list of active threads inside that kernel (ptx_thread_info).
包含内核的信息。它包含诸如内核函数(function_info)、网格和块大小以及该内核内活动线程列表(ptx_thread_info)等信息。 |
class core_t | Abstract base class of a core for both functional and performance model. shader_core_ctx (the class that implements a SIMT core in timing model) is derived from this class.
功能和性能模型核心的抽象基类。shader_core_ctx(在时序模型中实现 SIMT 核心的类)是从这个类派生的。 |
struct cudaChannelFormatDesc | Channel descriptor structure. It keeps channel format and number of bits of each component.
通道描述符结构。它保持通道格式和每个组件的位数。 |
struct cudaArray | Structure for saving data of arrays in GPGPU-Sim. It uses whenever main program calls cuda_malloc, cuda_memcopy, cuda_free and etc.
在 GPGPU-Sim 中保存数组数据的结构。它在主程序调用 cuda_malloc、cuda_memcopy、cuda_free 等时使用。 |
struct textureReference | Data type used by cuda runtime to specifies texture references.
cuda 运行时用于指定纹理引用的数据类型。 |
class gpgpu_functional_sim_config | Functional simulator configuration options.
功能模拟器配置选项。 |
class gpgpu_t | The top-level class that implements a functional GPU simulator. It contains the functional simulator configuration (gpgpu_functional_sim_config) and holds that actual buffer that implements global/texture memory spaces (instances of class memory_space). It has a set of member functions that provides managements to the simulated GPU memory space (malloc, memcpy, texture-bindin, ...). These member functions are called by the CUDA/OpenCL API implementations. Class gpgpu_sim (the top-level GPU timing simulation model) is derived from this class.
实现功能性 GPU 模拟器的顶级类。它包含功能模拟器配置(gpgpu_functional_sim_config),并持有实现全局/纹理内存空间的实际缓冲区(内存空间类的实例)。它有一组成员函数,用于管理模拟的 GPU 内存空间(malloc,memcpy,texture-binding 等)。这些成员函数由 CUDA/OpenCL API 实现调用。类 gpgpu_sim(顶级 GPU 时序模拟模型)是从这个类派生的。 |
struct gpgpu_ptx_sim_kernel_info | Holds properties of a kernel function such as PTX version and target machine. Also holds amount of memory and registers using by that kernel.
包含内核函数的属性,如 PTX 版本和目标机器。还包含该内核使用的内存和寄存器的数量。 |
struct gpgpu_ptx_sim_arg | Holds information of kernel arguments/paramters which is set in cudaSetupArgument(...) and _cl_kernel::bind_args(...) for CUDA and OpenCL respectively.
保存了在 cudaSetupArgument(...)和_cl_kernel::bind_args(...)中设置的内核参数的信息,分别用于 CUDA 和 OpenCL。 |
class memory_space_t | Information of a memory space like type of memory and number of banks for this memory space.
此内存空间的信息,如内存类型和此内存空间的银行数量。 |
class mem_access_t | Contains information of each memory access in the timing simulator. This class has information about type of memory access, requested address, size of data and active masks of threads inside warp accessing the memory.This class is used as one of parameters of mem_fetch class, which basically instantiated for each memory access. This class is for interfacing between two different level of memory and passing through interconnects.
包含定时模拟器中每次内存访问的信息。该类包含有关内存访问类型、请求地址、数据大小和访问内存的线程活跃掩码的信息。该类作为 mem_fetch 类的参数之一使用,mem_fetch 类基本上是为每次内存访问实例化的。该类用于在两个不同级别的内存之间进行接口,并通过互连传递。 |
struct dram_callback_t | This class is the one who is responsible for atomic operations. The function pointer is set during functional simulation(atom_impl()) to atom_callback(...). During timing simulation this function pointer is being called when in l2cache memory controller pop memory partition unit to interconnect. This function is supposed to compute result of atomic operation and saving it in memory.
这个类负责原子操作。函数指针在功能仿真(atom_impl())期间设置为 atom_callback(...)。在时序仿真中,当在 l2cache 内存控制器弹出内存分区单元以进行互连时,将调用此函数指针。这个函数应该计算原子操作的结果并将其保存在内存中。 |
class inst_t | Base class of all instruction classes. This class contains information about type and size of instruction, address of instruction, inputs and outputs, latency and memory scope (memory_space_t) of the instruction.
所有指令类的基类。该类包含有关指令类型和大小、指令地址、输入和输出、延迟以及指令的内存范围(memory_space_t)的信息。 |
class warp_inst_t | Data of instructions need during timing simulation. Each instruction (ptx_instruction) which is inherited from warp_inst_t contains data for timing and functional simulation. ptx_instruction is filled during functional simulation. After this level program needs only timing information so it casts ptx_instruction to warp_inst_t (some data is being loosed) for timing simulation. warp_inst_t inherited from inst_t. It holds warp_id, active thread mask inside the warp, list of memory accesses (mem_access_t) and information of threads inside that warp (per_thread_info)
在时序仿真期间需要的指令数据。每个继承自 warp_inst_t 的指令(ptx_instruction)包含时序和功能仿真的数据。ptx_instruction 在功能仿真期间被填充。在这个级别上,程序只需要时序信息,因此它将 ptx_instruction 转换为 warp_inst_t(某些数据会丢失)以进行时序仿真。warp_inst_t 继承自 inst_t。它包含 warp_id、warp 内的活动线程掩码、内存访问列表(mem_access_t)以及该 warp 内线程的信息(per_thread_info)。 |
GPGPU-sim - Performance Simulation Engine
GPGPU-sim - 性能模拟引擎
In GPGPU-Sim 3.x the performance simulation engine is implemented via numerous classes defined and implemented in the files under <gpgpu-sim_root>/src/gpgpu-sim/. These classes are brought together via the top-level class gpgpu_sim, which is derived from gpgpu_t (its functional simulation counter part). In the current version of GPGPU-Sim 3.x, only one instance of gpgpu_sim, g_the_gpu, is present in the simulator. Simulation of multiple GPU simultaneously is not currently supported but may be provided in future versions.
在 GPGPU-Sim 3.x 中,性能模拟引擎通过在 <gpgpu-sim_root>/src/gpgpu-sim/ 下定义和实现的多个类来实现。这些类通过顶层类 gpgpu_sim 组合在一起,该类派生自 gpgpu_t(其功能模拟对应物)。在当前版本的 GPGPU-Sim 3.x 中,模拟器中仅存在一个 gpgpu_sim 实例 g_the_gpu。目前不支持同时模拟多个 GPU,但未来版本可能会提供此功能。
This section describes the various classes in the performance simulation engine. These include a set of software objects that models the microarchitecture described earlier, this section also describes how the performance simulation engine interfaces with the functional simulation engine, how it interfaces with AerialVision, and various non-trivial software designs we employ in the performance simulation engine.
本节描述了性能仿真引擎中的各种类。这些类包括一组软件对象,模拟了前面描述的微架构,此外,本节还描述了性能仿真引擎如何与功能仿真引擎接口,如何与 AerialVision 接口,以及我们在性能仿真引擎中采用的各种非平凡的软件设计。
Performance Model Software Objects
性能模型软件对象
One of the more significant changes in GPGPU-Sim 3.x versus 2.x is the introduction of a C++ (mostly) object oriented design for the performance simulation engine. The high level design of the various classes used to implement the performance simulation engine are described in this subsection. These closely correspond to the hardware blocks described earlier.
GPGPU-Sim 3.x 相较于 2.x 的一个重要变化是引入了一个主要基于 C++ 的面向对象设计用于性能模拟引擎。实现性能模拟引擎的各种类的高层设计在本小节中进行了描述。这些类与之前描述的硬件模块密切对应。
SIMT Core Cluster Class SIMT 核心集群类
The SIMT core clusters are modelled by the simt_core_cluster class. This class contains an array of SIMT core objects in m_core. The simt_core_cluster::core_cycle() method simply cycles each of the SIMT cores in order. The simt_core_cluster::icnt_cycle() method pushes memory requests into the SIMT Core Cluster's response FIFO from the interconnection network. It also pops the requests from the FIFO and sends them to the appropriate core's instruction cache or LDST unit. The simt_core_cluster::icnt_inject_request_packet(...) method provides the SIMT cores with an interface to inject packets into the network.
SIMT 核心集群由 simt_core_cluster 类建模。该类包含一个在 m_core 中的 SIMT 核心对象数组。 simt_core_cluster::core_cycle() 方法简单地按顺序循环每个 SIMT 核心。 simt_core_cluster::icnt_cycle() 方法将内存请求从互连网络推送到 SIMT 核心集群的 响应 FIFO 中。它还从 FIFO 中弹出请求,并将其发送到相应核心的指令缓存或 LDST 单元。 simt_core_cluster::icnt_inject_request_packet(...) 方法为 SIMT 核心提供了一个接口,以将数据包注入网络。
SIMT Core Class SIMT 核心课程
The SIMT core microarchitecture shown in Figure 5 is implemented with the class shader_core_ctx in shader.h/cc. Derived from class core_t (the abstract functional class for a core), this class combines all the different objects that implements various parts of the SIMT core microarchitecture model:
图 5 中显示的 SIMT 核心 微架构是通过 shader.h/cc 中的类 shader_core_ctx 实现的。该类派生自类 core_t(核心的抽象功能类),结合了实现 SIMT 核心微架构模型各个部分的不同对象:
- A collection of shd_warp_t objects which models the simulation state of each warp in the core.
一组 shd_warp_t 对象,模拟核心中每个 warp 的仿真状态。 - A SIMT stack, simt_stack object, for each warp to handle branch divergence.
一个 SIMT 堆栈,simt_stack 对象,用于每个 warp 处理分支分歧。 - A set of scheduler_unit objects, each responsible for selecting one or more instructions from its set of warps and issuing those instructions for execution.
一组 scheduler_unit 对象,每个对象负责从其 warp 集合中选择一个或多个指令并发出这些指令以供执行。 - A Scoreboard object for detecting data hazard.
用于检测数据冒险的记分板对象。 - An opndcoll_rfu_t object, which models an operand collector.
一个 opndcoll_rfu_t 对象,它模拟一个操作数收集器。 - A set of simd_function_unit objects, which implements the SP unit and the SFU unit (the ALU pipelines).
一组 simd_function_unit 对象,实现了 SP 单元和 SFU 单元(ALU 管道)。 - A ldst_unit object, which implements the memory pipeline.
一个 ldst_unit 对象,它实现了内存管道。 - A shader_memory_interface which connects the SIMT core to the corresponding SIMT core cluster. Each memory request goes through this interface to be serviced by one of the memory partitions.
一个将 SIMT 核心连接到相应 SIMT 核心集群的 shader_memory_interface。每个内存请求通过此接口由其中一个内存分区进行服务。
Every core cycle, shader_core_ctx::cycle() is called to simulate one cycle at the SIMT core. This function calls a set of member functions that simulate the core's pipeline stages in reverse order to model the pipelining effect:
每个核心周期,shader_core_ctx::cycle() 被调用以模拟 SIMT 核心的一个周期。此函数调用一组成员函数,以逆序模拟核心的流水线阶段,以模拟流水线效果:
- fetch()
- decode()
- issue()
- read_operand()
- execute()
- writeback()
The various pipeline stages are connected via a set of pipeline registers which are pointers to warp_inst_t objects (with the exception of Fetch and Decode, which connects via a ifetch_buffer_t object).
各个管道阶段通过一组管道寄存器连接,这些寄存器是指向 warp_inst_t 对象的指针(Fetch 和 Decode 除外,它们通过 ifetch_buffer_t 对象连接)。
Each shader_core_ctx object refers to a common shader_core_config object when accessing configuration options specific to the SIMT core. All shader_core_ctx objects also link to a common instance of a shader_core_stats object which keeps track of a set of performance measurements for all the SIMT cores.
每个 shader_core_ctx 对象在访问特定于 SIMT 核心的配置选项时都引用一个公共的 shader_core_config 对象。所有 shader_core_ctx 对象还链接到一个公共的 shader_core_stats 对象实例,该实例跟踪所有 SIMT 核心的一组性能测量。
Fetch and Decode Software Model
获取和解码软件模型
This section describes the software modelling Fetch and Decode.
本节描述了软件建模 获取和解码。
The I-Buffer shown in Figure 3 is implemented as an array of shd_warp_t objects inside shader_core_ctx. Each shd_warp_t has a set m_ibuffer of I-Buffer entries (ibuffer_entry) holding a configurable number of instructions (the maximum allowable instructions to fetch in one cycle). Also, shd_warp_t has flags that are used by the schedulers to determine the eligibility of the warp for issue. The decoded instructions stored in an ibuffer_entry as a pointer to a warp_inst_t object. The warp_inst_t holds information about the type of the operation of this instruction and the operands used.
图 3 中显示的 I-Buffer 实现为 shader_core_ctx 内部的 shd_warp_t 对象数组。每个 shd_warp_t 具有一个名为 m_ibuffer 的 I-Buffer 条目集合 (ibuffer_entry),其中保存了可配置数量的指令(每个周期内可获取的最大指令数)。此外,shd_warp_t 还有一些标志,调度器使用这些标志来确定该 warp 是否有资格被发出。解码后的指令存储在 ibuffer_entry 中,作为指向 warp_inst_t 对象的指针。warp_inst_t 保存有关该指令操作类型和使用的操作数的信息。
Also, in the fetch stage, the shader_core_ctx::m_inst_fetch_buffer variable acts as a pipeline register between the fetch (instruction cache access) and the decode stage.
此外,在获取阶段,shader_core_ctx::m_inst_fetch_buffer 变量充当获取(指令缓存访问)和解码阶段之间的流水线寄存器。
If the decode stage is not stalled (i.e. shader_core_ctx::m_inst_fetch_buffer is free of valid instructions), the fetch unit works. The outer for loop implements the round robin scheduler, the last scheduled warp id is stored in m_last_warp_fetched. The first if-statement checks if the warp has finished execution, while inside the second if-statement, the actual fetch from the instruction cache, in case of hit or the memory access generation, in case of miss are done. The second if-statement mainly checks if there are no valid instructions already stored in the entry that corresponds the currently checked warp.
如果解码阶段没有被阻塞(即 shader_core_ctx::m_inst_fetch_buffer 中没有有效指令),则取指单元工作。外部 for 循环实现了轮询调度器,最后调度的 warp id 存储在 m_last_warp_fetched 中。第一个 if 语句检查 warp 是否已完成执行,而在第二个 if 语句中,实际从指令缓存中获取指令(如果命中)或生成内存访问(如果未命中)。第二个 if 语句主要检查与当前检查的 warp 对应的条目中是否没有有效指令。
The decode stage simply checks the shader_core_ctx::m_inst_fetch_buffer and start to store the decoded instructions (current configuration decode up to two instructions per cycle) in the instruction buffer entry (m_ibuffer, an object of shd_warp_t::ibuffer_entry) that corresponds to the warp in the shader_core_ctx::m_inst_fetch_buffer.
解码阶段简单地检查 shader_core_ctx::m_inst_fetch_buffer,并开始将解码后的指令(当前配置每个周期解码最多两条指令)存储在与 shader_core_ctx::m_inst_fetch_buffer 中的 warp 对应的指令缓冲区条目(m_ibuffer,shd_warp_t::ibuffer_entry 的一个对象)中。
Schedule and Issue Software Model
调度和发布软件模型
Within each core, there are a configurable number of scheduler units. The function shader_core_ctx::issue() iterates over these units where each one of them executes scheduler_unit::cycle(), where a round robin algorithm is applied on the warps. In the scheduler_unit::cycle(), the instruction is issued to its suitable execution pipeline using the function shader_core_ctx::issue_warp(). Within this function, instructions are functionally executed by calling shader_core_ctx::func_exec_inst() and the SIMT stack (m_simt_stack[warp_id]) is updated by calling simt_stack::update(). Also, in this function, the warps are held/released due to barriers by shd_warp_t:set_membar() and barrier_set_t::warp_reaches_barrier. On the other hand, registers are reserved by Scoreboard::reserveRegisters() to be used later by the scoreboard algorithm.
The scheduler_unit::m_sp_out, scheduler_unit::m_sfu_out, scheduler_unit::m_mem_out points to the first pipeline register between the issue stage and the execution stage of SP, SFU and Mem pipline receptively. That is why they are checked before issuing any instruction to its corresponding pipeline using shader_core_ctx::issue_warp().
在每个核心内,有可配置数量的调度单元。函数 shader_core_ctx::issue() 遍历这些单元,每个单元执行 scheduler_unit::cycle(),在其中对 warps 应用轮询算法。在 scheduler_unit::cycle() 中,指令通过函数 shader_core_ctx::issue_warp() 被发给其合适的执行管道。在这个函数中,指令通过调用 shader_core_ctx::func_exec_inst() 被功能性执行,并通过调用 simt_stack::update() 更新 SIMT 堆栈 (m_simt_stack[warp_id])。此外,在这个函数中,由于屏障,warps 通过 shd_warp_t:set_membar() 和 barrier_set_t::warp_reaches_barrier 被保持/释放。另一方面,寄存器通过 Scoreboard::reserveRegisters() 被保留,以便稍后由记分板算法使用。scheduler_unit::m_sp_out、scheduler_unit::m_sfu_out、scheduler_unit::m_mem_out 指向发射阶段与 SP、SFU 和 Mem 管道的执行阶段之间的第一个管道寄存器。这就是为什么在使用 shader_core_ctx::issue_warp() 向其相应管道发出任何指令之前要检查它们的原因。
SIMT Stack Software Model
SIMT 堆栈软件模型
For each scheduler unit there is an array of SIMT stacks. Each SIMT stack corresponds to one warp. In the scheduler_unit::cycle(), the top of the stack entry for the SIMT stack of the scheduled warp determines the issued instruction. The program counter of the top of the stack entry is normally consistent with the program counter of the next instruction in the I-Buffer that corresponds to the scheduled warp (Refer to SIMT Stack). Otherwise, in case of control hazard, they will not be matched and the instructions within the I-Buffer are flushed.
对于每个调度单元,有一个 SIMT 堆栈数组。每个 SIMT 堆栈对应一个 warp。在 scheduler_unit::cycle()中,调度 warp 的 SIMT 堆栈的堆栈条目的顶部决定了发出的指令。堆栈条目顶部的程序计数器通常与对应于调度 warp 的 I-Buffer 中下一条指令的程序计数器一致(参见 SIMT Stack)。否则,在控制冒险的情况下,它们将不匹配,I-Buffer 中的指令将被刷新。
The implementation of the SIMT stack is in the simt_stack class in shader.h. The SIMT stack is updated after each issue using this function simt_stack::update(...). This function implements the algorithm required at divergence and reconvergence points. Functional execution (refer to Instruction Execution) is performed at the issue stage before updating the SIMT stack. This allows the issue stage to have information of the next pc of each thread, hence, to update the SIMT stack as required.
SIMT 栈的实现位于 shader.h 中的 simt_stack 类。每次发出指令后,使用此函数 simt_stack::update(...)更新 SIMT 栈。该函数实现了在分歧和重新汇聚点所需的算法。在更新 SIMT 栈之前,功能执行(参见指令执行)在发出阶段进行。这使得发出阶段能够获取每个线程的下一个程序计数器的信息,从而根据需要更新 SIMT 栈。
Scoreboard Software Model
记分板软件模型
The scoreboard unit is instantiated in shader_core_ctx as a member object, and passed to scheduler_unit via reference (pointer). It stores both shader core id and a register table index by the warp ids. This register table stores the number of registers reserved by each warp. The functions Scoreboard::reserveRegisters(...), Scoreboard::releaseRegisters(...) and Scoreboard::checkCollision(...) are used to reserve registers, release register and to check for collision before issuing a warp respectively.
记分板单元作为成员对象在 shader_core_ctx 中实例化,并通过引用(指针)传递给 scheduler_unit。它存储了每个 warp 的着色器核心 ID 和寄存器表索引。该寄存器表存储每个 warp 保留的寄存器数量。函数 Scoreboard::reserveRegisters(...)、Scoreboard::releaseRegisters(...)和 Scoreboard::checkCollision(...)分别用于保留寄存器、释放寄存器和在发出 warp 之前检查冲突。
Operand Collector Software Model
操作数收集器软件模型
The operand collector is modeled as one stage in the main pipeline executed by the function shader_core_ctx::cycle(). This stage is represented by the shader_core_ctx::read_operands() function. Refer to ALU Pipeline for more details about the interfaces of the operand collector.
操作数收集器被建模为由函数 shader_core_ctx::cycle() 执行的主管道中的一个阶段。这个阶段由函数 shader_core_ctx::read_operands() 表示。有关操作数收集器接口的更多细节,请参阅 ALU 管道。
The class opndcoll_rfu_t models the operand collector based register file unit. It contains classes that abstracts the collector unit sets, the arbiter and the dispatch units.
类 opndcoll_rfu_t 模拟操作数收集器基础寄存器文件单元。它包含抽象收集器单元集、仲裁器和调度单元的类。
The opndcoll_rfu_t::allocate_cu(...) is responsible to allocate warp_inst_t to a free operand collector unit within its assigned sets of operand collectors. Also it adds a read requests for all source operands in their corresponding bank queues in the arbitrator.
opndcoll_rfu_t::allocate_cu(...) 负责将 warp_inst_t 分配给其分配的操作数收集器集合中的空闲操作数收集器单元。同时,它还为仲裁器中相应银行队列中的所有源操作数添加读取请求。
However, opndcoll_rfu_t::allocate_reads(...) processes read requests that do not have conflicts, in other words, the read requests that are in different register banks and do not go to the same operand collector are popped from the arbitrator queues. This accounts for write request priority over read requests.
然而,opndcoll_rfu_t::allocate_reads(...) 处理没有冲突的读请求,换句话说,来自不同寄存器组且不去同一个操作数收集器的读请求会从仲裁器队列中弹出。这解释了写请求优先于读请求的原因。
The function opndcoll_rfu_t::dispatch_ready_cu() dispatches the operand registers of ready operand collectors (with all operands are collected) to the execute stage.
函数 opndcoll_rfu_t::dispatch_ready_cu() 将准备好的操作数收集器的操作数寄存器(所有操作数已收集)分派到执行阶段。
The function opndcoll_rfu_t::writeback( const warp_inst_t &inst ) is called at the write back stage of the memory pipeline. It is responsible to the allocation of writes.
函数 opndcoll_rfu_t::writeback( const warp_inst_t &inst ) 在内存管道的写回阶段被调用。它负责写入的分配。
This summarizes the highlights of the main functions used to model the operand collector, however, more details are in the implementations of the opndcoll_rfu_t class in both shader.cc and shader.h.
这总结了用于建模操作数收集器的主要功能的亮点,但更多细节在 shader.cc 和 shader.h 中 opndcoll_rfu_t 类的实现中。
ALU Pipeline Software Model
ALU 流水线软件模型
The timing model of SP unit and SFU unit are mostly implemented in the pipelined_simd_unit class defined in shader.h. The specific classes modelling the units (sp_unit and sfu class) are derived from this class with overridden can_issue() member function to specify the types of instruction executable by the unit.
SP 单元和 SFU 单元的时序模型主要在 pipelined_simd_unit 类中实现,该类定义在 shader.h 中。具体建模单元的类( sp_unit 和 sfu 类)是从该类派生的,并重写了 can_issue() 成员函数以指定单元可执行的指令类型。
The SP unit is connected to the operation collector unit via the OC_EX_SP pipeline register; the SFU unit is connected to the operand collector unit via the OC_EX_SFU pipeline register. Both units shares a common writeback stage via the WB_EX pipeline register. To prevent two units from stalling for writeback stage conflict, each instruction going into either unit has to allocate a slot in the result bus (m_result_bus) before it is issued into the destined unit (see shader_core_ctx::execute()).
SP 单元通过 OC_EX_SP 管道寄存器连接到操作收集单元;SFU 单元通过 OC_EX_SFU 管道寄存器连接到操作数收集单元。两个单元通过 WB_EX 管道寄存器共享一个公共的写回阶段。为了防止两个单元因写回阶段冲突而停滞,每个进入任一单元的指令在发出到目标单元之前必须在结果总线(m_result_bus)中分配一个槽位(见 shader_core_ctx::execute() )。
The following figure provides an overview to how pipelined_simd_unit models the throughput and latency for different types of instruction.
下图提供了 pipelined_simd_unit 如何为不同类型的指令建模吞吐量和延迟的概述。
In each pipelined_simd_unit, the issue(warp_inst_t*&) member function moves the contents of the given pipeline registers into m_dispatch_reg. The instruction then waits at m_dispatch_reg for initiation_interval cycles. In the meantime, no other instruction can be issued into this unit, so this wait models the throughput of the instruction. After the wait, the instruction is dispatched to the internal pipeline registers m_pipeline_reg for latency modelling. The dispatching position is determined so that time spent in m_dispatch_reg are accounted towards the latency as well. Every cycle, the instructions will advances through the pipeline registers and eventually into m_result_port, which is the shared pipeline register leading to the common writeback stage for both SP and SFU units.
在每个 pipelined_simd_unit 中, issue(warp_inst_t*&) 成员函数将给定的管道寄存器的内容移动到 m_dispatch_reg 。然后,指令在 m_dispatch_reg 等待 initiation_interval 个周期。在此期间,无法向该单元发出其他指令,因此此等待模拟了指令的吞吐量。等待后,指令被调度到内部管道寄存器 m_pipeline_reg 进行延迟建模。调度位置的确定使得在 m_dispatch_reg 中花费的时间也计入延迟。每个周期,指令将通过管道寄存器并最终进入 m_result_port ,这是通向 SP 和 SFU 单元的公共写回阶段的共享管道寄存器。
The throughput and latency of each type of instruction are specified at ptx_instruction::set_opcode_and_latency() in cuda-sim.cc. This function is called during pre-decode.
每种指令的吞吐量和延迟在 ptx_instruction::set_opcode_and_latency() 的 cuda-sim.cc 中指定。此函数在预解码期间调用。
Memory Stage Software Model
内存阶段软件模型
The ldst_unit class inside shader.cc implements the memory stage of the shader pipeline.
The class instantiates and operates on all the in-shader memories: texture (m_L1T), constant (m_L1C) and data (m_L1D). ldst_unit::cycle() implements the guts of the unit's operation and is pumped m_config->mem_warp_parts times pre core cycle. This is so fully coalesced memory accesses can be processed in one shader cycle. ldst_unit::cycle() processes the memory responses from the interconnect (stored in m_response_fifo), filling the caches and marking stores as complete. The function also cycles the caches so they can send their requests for missed data to the interconnect.
ldst_unit 类在 shader.cc 中实现了着色器管道的内存阶段。该类实例化并操作所有的着色器内存:纹理 ( m_L1T )、常量 ( m_L1C ) 和数据 ( m_L1D )。 ldst_unit::cycle() 实现了单元操作的核心,并在核心周期前被泵送 m_config->mem_warp_parts 次。这样可以在一个着色器周期内处理完全合并的内存访问。 ldst_unit::cycle() 处理来自互连的内存响应(存储在 m_response_fifo 中),填充缓存并标记存储为完成。该函数还循环缓存,以便它们可以向互连发送缺失数据的请求。
Cache accesses to each type of L1 memory are done in shared_cycle(), constant_cycle(), texture_cycle() and memory_cycle() respectively. memory_cycle is used to access the L1 data cache. Each of these functions then calls process_memory_access_queue() which is a universal function that pulls an access off the instructions internal access queue and sends this request to the cache. If this access cannot be processed in this cycle (i.e. it neither misses nor hits in the cache which can happen when various system queues are full or when all the lines in a particular way have been reserved and are not yet filled) then the access is attempted again next cycle.
对每种类型的 L1 内存的缓存访问分别在 shared_cycle(), constant_cycle(), texture_cycle() and memory_cycle() 进行。 memory_cycle 用于访问 L1 数据缓存。每个这些函数然后调用 process_memory_access_queue() ,这是一个通用函数,它从指令的内部访问队列中提取一个访问请求,并将此请求发送到缓存。如果在这个周期内无法处理此访问(即它既没有在缓存中未命中也没有命中,这可能发生在各种系统队列满或特定方式中的所有行已被保留且尚未填充时),则在下一个周期将再次尝试该访问。
It is worth noting that not all instructions reach the writeback stage of the unit. All store instructions and load instructions where all requested cache blocks hit exit the pipeline in the cycle function. This is because they do not have to wait for a response from the interconnect and can by-pass the writeback logic that book-keeps the cache lines requested by the instruction and those that have been returned.
值得注意的是,并非所有指令都到达单元的写回阶段。所有存储指令和所有请求的缓存块命中的加载指令在 cycle 函数中退出流水线。这是因为它们不必等待互连的响应,可以绕过记录指令请求的缓存行和已返回缓存行的写回逻辑。
Cache Software Model 缓存软件模型
gpu-cache.h implements all the caches used by the ldst_unit. Both the constant cache and the data cache contain a member tag_array object which implements the reservation and replacement logic. The probe() function checks for a block address without effecting the LRU position of the data in question, while access() is meant to model a look-up that effects the LRU position and is the function that generates the miss and access statistics. MSHR's are modeled with the mshr_table class emulates a fully associative table with a finite number of merged requests. Requests are released from the MSHR through the next_access() function.
gpu-cache.h 实现了 ldst_unit 使用的所有缓存。常量缓存和数据缓存都包含一个成员 tag_array 对象,该对象实现了保留和替换逻辑。 probe() 函数检查一个块地址,而不影响相关数据的 LRU 位置,而 access() 则用于模拟影响 LRU 位置的查找,并且是生成未命中和访问统计信息的函数。MSHR 通过 mshr_table 类建模,模拟具有有限数量合并请求的完全关联表。请求通过 next_access() 函数从 MSHR 中释放。
The read_only_cache class is used for the constant cache and as the base-class for the data_cache class. This hierarchy can be somewhat confusing because R/W data cache extends from the read_only_cache. The only reason for this is that they share much of the same functionality, with the exception of the access function which deals has to deal with writes in the data_cache. The L2 cache is also implemented with the data_cache class.
read_only_cache 类用于常量缓存,并作为 data_cache 类的基类。这个层次结构可能有些混乱,因为 R/W 数据缓存扩展自 read_only_cache 。这样做的唯一原因是它们共享许多相同的功能,除了 access 函数,它处理 data_cache 中的写入。L2 缓存也使用 data_cache 类实现。
The tex_cache class implements the texture cache outlined in the architectural description above. It does not use the tag_array or mshr_table since it's operation is significantly different from that of a conventional cache.
tex_cache 类实现了上述架构描述中的纹理缓存。它不使用 tag_array 或 mshr_table ,因为它的操作与传统缓存有显著不同。
Thread Block / CTA / Work Group Scheduling
线程块 / CTA / 工作组调度
The scheduling of Thread Blocks to SIMT cores occurs in shader_core_ctx::issue_block2core(...). The maximum number of thread blocks (or CTAs or Work Groups) that can be concurrently scheduled on a core is calculated by the function shader_core_config::max_cta(...). This function determines the maximum number of thread blocks that can be concurrently assigned to a single SIMT core based on the number of threads per thread block specified by the program, the per-thread register usage, the shared memory usage, and the configured limit on maximum number of thread blocks per core. Specifically, the number of thread blocks that could be assigned to a SIMT core if each of the above criteria was the limiting factor is computed. The minimum of these is the maximum number of thread blocks that can be assigned to the SIMT core.
线程块调度到 SIMT 核心发生在 shader_core_ctx::issue_block2core(...) 。可以同时调度到核心的最大线程块数(或 CTA 或工作组)由函数 shader_core_config::max_cta(...) 计算。该函数根据程序指定的每个线程块的线程数、每个线程的寄存器使用、共享内存使用以及每个核心的最大线程块数的配置限制,确定可以同时分配给单个 SIMT 核心的最大线程块数。具体来说,如果上述每个标准都是限制因素,则计算可以分配给 SIMT 核心的线程块数。这些中的最小值就是可以分配给 SIMT 核心的最大线程块数。
In shader_core_ctx::issue_block2core(...), the thread block size is first padded to be an exact multiple of the warp size. Then a range of free hardware thread ids is determined. The functional state for each thread is initialized by calling ptx_sim_init_thread. The SIMT stacks and warp states are initialized by calling shader_core_ctx::init_warps.
在 shader_core_ctx::issue_block2core(...) 中,线程块大小首先被填充为波大小的精确倍数。然后确定一系列可用的硬件线程 ID。通过调用 ptx_sim_init_thread 初始化每个线程的功能状态。通过调用 shader_core_ctx::init_warps 初始化 SIMT 堆栈和波状态。
When each thread finishes, the SIMT core calls register_cta_thread_exit(...) to update the active thread block's state. When all threads in a thread block have finished, the same function decreases the count of thread blocks active on the core, allowing more thread blocks to be scheduled in the next cycle. New thread blocks to be scheduled are selected from pending kernels.
当每个线程完成时,SIMT 核心调用 register_cta_thread_exit(...) 来更新活动线程块的状态。当线程块中的所有线程都完成时,同样的函数会减少在核心上活动的线程块的数量,从而允许在下一个周期调度更多的线程块。要调度的新线程块是从待处理的内核中选择的。
Interconnection Network 互连网络
The interconnection network interface has a few functions as follows. These function are implemented in interconnect_interface.cpp. These function are wrapped in icnt_wrapper.cpp.
The original intention for having icnt_wrapper.cpp was to allow other network simulators to hook up to GPGPU-Sim.
互连网络接口有以下几个功能。这些功能在 interconnect_interface.cpp 中实现。这些功能被封装在 icnt_wrapper.cpp 中。创建 icnt_wrapper.cpp 的初衷是为了允许其他网络模拟器连接到 GPGPU-Sim。
- init_interconnect(): Initialize the network simulator. Its inputs are the interconnection network's configuration file and the number of SIMT core clusters and memory nodes.
init_interconnect(): 初始化网络模拟器。它的输入是互连网络的配置文件以及 SIMT 核心集群和内存节点的数量。 - interconnect_push(): which specifies a source node, a destination node, a pointer to the packet to be transmitted and the packet size (in bytes).
interconnect_push(): 它指定一个源节点、一个目标节点、一个指向要传输的数据包的指针和数据包大小(以字节为单位)。 - interconnect_pop(): gets a node number as input and it returns a pointer to the packet that was waiting to be ejected at that node. If there is no packet it returns NULL.
interconnect_pop(): 获取一个节点编号作为输入,并返回指向在该节点等待被弹出的数据包的指针。如果没有数据包,则返回 NULL。 - interconnect_has_buffer(): gets a node number and the packet size to be sent as input and returns one(true) if the input buffer of the source node has enough space.
interconnect_has_buffer(): 获取一个节点编号和要发送的数据包大小作为输入,如果源节点的输入缓冲区有足够的空间,则返回一个(true)。 - advance_interconnect(): Should be called every interconnection clock cycle. As name says it perform all the internal steps of the network for one cycle.
advance_interconnect(): 应该在每个互连时钟周期调用。正如名称所示,它执行网络在一个周期内的所有内部步骤。 - interconnect_busy(): Returns one(true) if there is a packet in flight inside the network.
interconnect_busy(): 如果网络中有数据包在传输,则返回一个(true)。 - interconnect_stats(): Prints network statistics.
interconnect_stats(): 打印网络统计信息。
Clock domain crossing for intersim
时钟域交叉用于仿真
Ejecting a packet from network
从网络中弹出数据包
We effectively have a two stage buffer per virtual channel at the output, the first stage contains a buffer per virtual channel that has the same space as the buffers internal to the network, the next stage buffer per virtual channel is where we cross from one clock domain to the other--we push flits into the second stage buffer in the interconnect clock domain, and remove whole packets from the second stage buffer in the shader or L2/DRAM clock domain. We return a credit only when we are able to move a flit from the first stage buffer to the second stage buffer (and this occurs at the interconnect clock frequency).
我们在输出端有效地为每个虚拟通道设置了一个两级缓冲区,第一级包含一个与网络内部缓冲区大小相同的每个虚拟通道的缓冲区,第二级缓冲区则是我们从一个时钟域跨越到另一个时钟域的地方——我们在互连时钟域中将数据包推入第二级缓冲区,并在着色器或 L2/DRAM 时钟域中从第二级缓冲区中移除整个数据包。只有当我们能够将数据包从第一级缓冲区移动到第二级缓冲区时,我们才会返回一个信用(这发生在互连时钟频率下)。
Ejection interface details
弹出接口细节
Here is a more detailed explanation of the clock boundary implementation: At the ejection port of each router we have as many buffers as the number of Virtual Channels. Size of each buffer is exactly equal to VC buffer size. These are the first stage of buffers mentioned above. Let's call the second stage of buffers (again as many as VCs) boundary buffers. These buffers are sized to hold 16-flits each by default (this is a configurable option called boudry_buf_size). When a router tries to eject a flit, the flit is put in the corresponding first stage buffers based on the VC it's coming from ( No credit is sent back yet). Then the boundary buffers are checked to see if they have space; a flit is popped from the corresponding ejection buffer and pushed to the boundary buffer if it has space (this is done for all buffers in the same cycle). At this point the flit is also pushed to a credit return queue. Routers can pop 1 flit per network cycle from this credit return queue and generate its corresponding credit. The shader (or L2/DRAM) side pops the boundary buffer every shader or (DRAM/L2 cycle) and gets a full "Packet". i.e. If the packet is 4 flits it frees up 4 slots in the boundary buffer;if it's 1 flit it only frees up 1 flit. Since boundary buffers are as many as VCs shader (or DRAM) pops them in round robin. (It can only get 1 packet per cycle) In this design the first stage buffer always has space for the flits coming from router and as boundary buffers get full the flow of credits backwards will stop.
这是时钟边界实现的更详细说明:在每个路由器的排出端口,我们有与虚拟通道数量相同的缓冲区。每个缓冲区的大小正好等于 VC 缓冲区大小。这些是上述提到的第一阶段缓冲区。我们将第二阶段的缓冲区(同样与 VC 数量相同)称为边界缓冲区。这些缓冲区默认大小为 16-flits(这是一个可配置选项,称为 boudry_buf_size)。当路由器尝试排出一个 flit 时,该 flit 会根据其来源的 VC 放入相应的第一阶段缓冲区(尚未发送任何信用)。然后检查边界缓冲区是否有空间;如果有空间,则从相应的排出缓冲区弹出一个 flit 并推送到边界缓冲区(在同一周期内对所有缓冲区执行此操作)。此时,flit 也被推送到信用返回队列。路由器可以每个网络周期从此信用返回队列中弹出 1 个 flit 并生成相应的信用。着色器(或 L2/DRAM)侧每个着色器或(DRAM/L2 周期)弹出边界缓冲区并获取一个完整的“数据包”。 如果数据包是 4 个 flit,它会在边界缓冲区释放 4 个槽;如果是 1 个 flit,它只释放 1 个 flit。由于边界缓冲区的数量与 VCs 相同,着色器(或 DRAM)以轮询方式将它们弹出。(每个周期只能获取 1 个数据包)在这个设计中,第一阶段缓冲区始终有空间容纳来自路由器的 flit,当边界缓冲区满时,信用的反向流动将停止。
Note that the implementation described above is just our way of implementing the interface logic in the simulator and not necessarily the way the network interface is actually implemented in real hardware.
请注意,上述实现只是我们在模拟器中实现接口逻辑的一种方式,并不一定是网络接口在实际硬件中实现的方式。
Injecting a packet to the network
向网络注入数据包
Each node of the network has an input buffer. This input buffer size is configurable via input_buffer_size option in the interconnect config file. In order to inject a packet into the interconnect first the input buffer capacity is checked by calling interconnect_has_buffer(). If there is enough space the packet will be pushed to interconnect by calling interconnect_push().
These steps are done in the shader clock domain (in the memory stage) and in the interconnect clock domain for memory nodes.
网络的每个节点都有一个输入缓冲区。这个输入缓冲区的大小可以通过互连配置文件中的 input_buffer_size 选项进行配置。为了将数据包注入到互连中,首先通过调用 interconnect_has_buffer() 检查输入缓冲区的容量。如果有足够的空间,数据包将通过调用 interconnect_push() 被推送到互连中。这些步骤在着色器时钟域(在内存阶段)和内存节点的互连时钟域中完成。
Every-time advance_interconnect() function is called (in the interconnect clock domain) flits are taken out of the input buffer on each node and actually start traveling in the network (if possible).
每次调用 advance_interconnect() 函数(在互连时钟域中)时,每个节点的输入缓冲区中的 flit 被取出并实际开始在网络中传输(如果可能的话)。
Memory Partition 内存分区
The Memory Partition is modelled by the memory_partition_unit class defined inside l2cache.h and l2cache.cc. These files also define an extended version of the mem_fetch_allocator, partition_mf_allocator, for generation of mem_fetch objects (memory requests) by the Memory Partition and L2 cache.
内存分区由在 l2cache.h 和 l2cache.cc 中定义的 memory_partition_unit 类建模。这些文件还定义了 mem_fetch_allocator 的扩展版本 partition_mf_allocator ,用于通过内存分区和 L2 缓存生成 mem_fetch 对象(内存请求)。
From the sub-components described in the Memory Partition micro-architecture model section, the member object of type data_cache models the L2 cache and type dram_t the off-chip DRAM channel. The various queues are modelled using the fifo_pipeline class. The minimum latency ROP queue is modelled as a queue of rop_delay_t structs. The rop_delay_t structs store the minimum time at which each memory request can exit the ROP queue (push time + constant ROP delay). The m_request_tracket object tracks all in-flight requests not fully serviced yet by the Memory Partition to determine if the Memory Partition is currently active.
从内存分区微架构模型部分描述的子组件来看,类型 data_cache 的成员对象模拟 L2 缓存,类型 dram_t 模拟离芯 DRAM 通道。各种队列使用 fifo_pipeline 类进行建模。最小延迟 ROP 队列被建模为 rop_delay_t 结构的队列。 rop_delay_t 结构存储每个内存请求可以退出 ROP 队列的最小时间(推送时间 + 常量 ROP 延迟)。 m_request_tracket 对象跟踪所有尚未完全服务的在飞请求,以确定内存分区当前是否处于活动状态。
The Atomic Operation Unit does not have have an associated class. This component is modelled simply by functionally executing the atomic operations of memory requests leaving the L2->icnt queue. The next section presents further details.
原子操作单元没有关联的类。该组件仅通过功能性地执行内存请求的原子操作来建模,留下了L2->icnt 队列。下一部分将提供更多细节。
Memory Partition Connections and Traffic Flow
内存分区连接和流量流动
The gpgpu_sim::cycle() method clock all the architectural components in GPGPU-Sim, including the Memory Partition's queues, DRAM channel and L2 cache bank.
该 gpgpu_sim::cycle() 方法时钟 GPGPU-Sim 中的所有架构组件,包括内存分区的队列、DRAM 通道和 L2 缓存银行。
The code segment 代码段
::icnt_push( m_shader_config->mem2device(i), mf->get_tpc(), mf, response_size ); m_memory_partition_unit[i]->pop();
injects memory requests into the interconnect from the Memory Partition's L2->icnt queue. The call to memory_partition_unit::pop() functionally executes atomic instructions. The request tracker also discards the entry for that memory request here indicating that the Memory Partition is done servicing this request.
将内存请求注入到内存分区的 L2->icnt 队列中的互连。对 memory_partition_unit::pop() 的调用在功能上执行原子指令。请求跟踪器还在此处丢弃该内存请求的条目,表明内存分区已完成对该请求的服务。
The call to memory_partition_unit::dram_cycle() moves memory requests from L2->dram queue to the DRAM channel, DRAM channel to dram->L2 queue, and cycles the off-chip GDDR3 DRAM memory.
对 memory_partition_unit::dram_cycle() 的调用将内存请求从L2->dram队列移动到 DRAM 通道,从 DRAM 通道移动到dram->L2队列,并循环使用离芯片的 GDDR3 DRAM 内存。
The call to memory_partition_unit::push() ejects packets from the interconnection network and passes them to the Memory Partition. The request tracker is notified of the request. Texture accesses are pushed directly into the icnt->L2 queue, while non-texture accesses are pushed into the minimum latency ROP queue. Note that the push operations into both the icnt->L2 and ROP queues are throttled by the size of icnt->L2 queue as defined in the memory_partition_unit::full() method.
调用 memory_partition_unit::push() 从互连网络中弹出数据包并将其传递到内存分区。请求跟踪器会被通知该请求。纹理访问直接推入 icnt->L2 队列,而非纹理访问则推入最小延迟的 ROP 队列。请注意,推送操作进入 icnt->L2 和 ROP 队列的速度受到 icnt->L2 队列大小的限制,如 memory_partition_unit::full() 方法中所定义。
The call to memory_partition_unit::cache_cycle() clocks the L2 cache bank and moves requests into or out of the L2 cache. The next section describes the internals of memory_partition_unit::cache_cycle().
对 memory_partition_unit::cache_cycle() 的调用会时钟 L2 缓存银行,并将请求移入或移出 L2 缓存。下一部分描述了 memory_partition_unit::cache_cycle() 的内部结构。
L2 Cache Model L2 缓存模型
Inside memory_partition_unit::cache_cycle(), the call 在 memory_partition_unit::cache_cycle() 内部,呼叫
mem_fetch *mf = m_L2cache->next_access();
generates replies for memory requests waiting in filled MSHR entries, as described in the MSHR description. Fill responses, i.e. response messages to memory requests generated by the L2 on read misses, are passed to the L2 cache by popping from the dram->L2 queue and calling
生成对等待在填充的 MSHR 条目中的内存请求的回复,如 MSHR 描述中所述。填充响应,即由 L2 在读取未命中时生成的内存请求的响应消息,通过从dram->L2队列中弹出并调用传递给 L2 缓存。
m_L2cache->fill(mf,gpu_sim_cycle+gpu_tot_sim_cycle);
Fill requests that are generated by the L2 due to read misses are popped from the L2's miss queue and pushed into the L2->dram queue by calling
由于读取未命中而生成的请求从 L2 的未命中队列中弹出,并通过调用推送到L2->dram队列中
m_L2cache->cycle();
L2 access for memory request exiting the icnt->L2 queue is done by the call
内存请求退出 icnt->L2 队列的 L2 访问是通过调用完成的
enum cache_request_status status = m_L2cache->access(mf->get_partition_addr(),mf,gpu_sim_cycle+gpu_tot_sim_cycle,events);
On a L2 cache hit, a response is immediately generated and pushed into the L2->icnt queue. On a miss, no request is generated here as the code internal to the cache class has generated a memory request in its miss queue. If the L2 cache is disabled, then memory requests are pushed straight from the icnt->L2 queue to the L2->dram queue.
在 L2 缓存命中时,立即生成响应并推送到L2->icnt队列中。在未命中时,这里不会生成请求,因为缓存类内部的代码已经在其未命中队列中生成了内存请求。如果 L2 缓存被禁用,则内存请求直接从icnt->L2队列推送到L2->dram队列。
Also in memory_partition_unit::cache_cycle(), memory requests are popped from the ROP queue and inserted into the icnt->L2 queue.
在 memory_partition_unit::cache_cycle() 中,内存请求从ROP队列中弹出并插入到icnt->L2队列中。
DRAM Scheduling and Timing Model
DRAM 调度和时序模型
The DRAM timing model is implemented in the files dram.h and dram.cc. The timing model also includes an implementation of a FIFO scheduler. The more complicated FRFCFS scheduler is located in dram_sched.h and dram_sched.cc.
DRAM 时序模型在文件 dram.h 和 dram.cc 中实现。时序模型还包括 FIFO 调度器的实现。更复杂的 FRFCFS 调度器位于 dram_sched.h 和 dram_sched.cc 中。
The function dram_t::cycle() represents a DRAM cycle. In each cycle, the DRAM pops a request from the request queue then calls the scheduler function to allow the scheduler to select a request to be serviced based on the scheduling policy. Before the requests are sent to the scheduler, they wait in the DRAM latency queue for a fixed number of SIMT core cycles. This functionality is also implemented inside dram_t::cycle().
函数 dram_t::cycle() 代表一个 DRAM 周期。在每个周期中,DRAM 从请求队列中弹出一个请求,然后调用调度器函数,允许调度器根据调度策略选择一个请求进行服务。在请求发送到调度器之前,它们在 DRAM 延迟 队列中等待固定数量的 SIMT 核心周期。此功能也在 dram_t::cycle() 中实现。
case DRAM_FIFO: scheduler_fifo(); break; case DRAM_FRFCFS: scheduler_frfcfs(); break;
The DRAM timing model then checks if any bank is ready to issue a new request based on the different timing constraints specified in the configuration file. Those constraints are represented in the DRAM model by variables similar to this one
DRAM 时序模型接着检查是否有任何存储银行准备根据配置文件中指定的不同时序约束发出新请求。这些约束在 DRAM 模型中由类似于这个的变量表示。
unsigned int CCDc; //Column to Column Delay
Those variables are decremented at the end of each cycle. An action is only taken when all of its constraint variables have reached zero. Each taken action resets a set of constraint variables to their original configured values. For example, when a column is activated, the variable CCDc is reset to its original configured value, then decremented by one every cycle. We cannot scheduler a new column until this variable reaches zero.
The Macro DEC2ZERO decrements a variable until it reaches zero, and then it keeps it at zero until another action resets it.
这些变量在每个周期结束时递减。只有当所有约束变量都达到零时,才会采取行动。每次采取的行动将一组约束变量重置为其原始配置值。例如,当激活一列时,变量 CCDc 被重置为其原始配置值,然后在每个周期递减一次。在该变量达到零之前,我们无法调度新列。宏 DEC2ZERO 将一个变量递减到零,然后保持在零,直到另一个动作重置它。
Interface between CUDA-Sim and GPGPU-Sim
CUDA-Sim 与 GPGPU-Sim 之间的接口
The timing simulator (GPGPU-Sim) interfaces with the functional simulator (CUDA-sim) through the ptx_thread_info class. The m_thread member variable is an array of ptx_thread_info in the SIMT core class shader_core_ctx and maintains a functional state of all threads active in that SIMT core. The timing model communicates with the functional model through the warp_inst_t class which represents a dynamic instance of an instruction being executed by a single warp.
时间模拟器(GPGPU-Sim)通过 ptx_thread_info 类与功能模拟器(CUDA-sim)接口。 m_thread 成员变量是 SIMT 核心类 shader_core_ctx 中的 ptx_thread_info 数组,并维护该 SIMT 核心中所有活动线程的功能状态。时间模型通过 warp_inst_t 类与功能模型通信,该类表示由单个 warp 执行的指令的动态实例。
The timing model communicates with the functional model at the following three stages of simulation.
时序模型在以下三个仿真阶段与功能模型进行通信。
Decoding 解码
In the decoding stage at shader_core_ctx::decode(), the timing simulator obtains the instruction from the functional simulator given a PC. This is done by calling the ptx_fetch_inst function.
在 shader_core_ctx::decode() 的解码阶段,时序模拟器根据给定的 PC 从功能模拟器获取指令。这是通过调用 ptx_fetch_inst 函数完成的。
Instruction execution 指令执行
- Functional execution: The timing model advances the functional state of a thread by one instruction by calling the ptx_exec_inst method of class ptx_thread_info. This is done inside core_t::execute_warp_inst_t. The timing simulator passes the dynamic instance of the instruction to execute, and the functional model advances the thread's state accordingly.
功能执行:时间模型通过调用类 ptx_thread_info 的 ptx_exec_inst 方法将线程的功能状态推进一条指令。这是在 core_t::execute_warp_inst_t 内部完成的。时间模拟器将要执行的指令的动态实例传递给功能模型,功能模型相应地推进线程的状态。 - SIMT stack update: After functional execution of an instruction for a warp, the timing model updates the next PC in the SIMT stack by requesting it from the functional model. This happens inside simt_stack::update.
SIMT 堆栈更新:在一个 warp 的指令功能执行后,时序模型通过从功能模型请求来更新 SIMT 堆栈中的下一个 PC。这发生在 simt_stack::update 内部。 - Atomic callback: If the instruction is an atomic operation, then functional execution of the instruction does not take place in core_t::execute_warp_inst_t. Instead, in the functional execution stage the functional simulator stores a pointer to the atomic instruction in the warp_inst_t object by calling warp_inst_t::add_callback. The timing simulator executes this callback function as the request is leaving the L2 cache (see Memory Partition Connections and Traffic Flow).
原子回调:如果指令是一个原子操作,则指令的功能执行不会在 core_t::execute_warp_inst_t 中发生。相反,在功能执行阶段,功能模拟器通过调用 warp_inst_t::add_callback 将指向原子指令的指针存储在 warp_inst_t 对象中。时序模拟器在请求离开 L2 缓存时执行此回调函数(见 内存分区连接和流量流动)。
Launching Thread Blocks 启动线程块
When new thread blocks are launched in shader_core_ctx::issue_block2core, the timing simulator initializes the per-thread functional state by calling the functional model method ptx_sim_init_thread. Additionally, the timing model also initializes the SIMT stack and warp states by fetching the starting PC from the functional model.
当在 shader_core_ctx::issue_block2core 中启动新的线程块时,时序模拟器通过调用功能模型方法 ptx_sim_init_thread 来初始化每个线程的功能状态。此外,时序模型还通过从功能模型中获取起始 PC 来初始化 SIMT 堆栈和 warp 状态。
Address Decoding 地址解码
Address decoding is responsible for translating linear addresses to raw addresses, which are used to access the appropriate row, column, and bank in DRAM. Address decoding is also responsible for determining which memory controller to send the memory request to. The code for address decoding is found in addrdec.h and addrdec.cc; located in "gpgpu-sim_root/src/gpgpu-sim/". When a load or store instruction is encountered in the kernel code, a "memory fetch" object is created (defined in mem_fetch.h/mem_fetch.cc). Upon creation, the mem_fetch object decodes the linear address by calling ddrdec_tlx(new_addr_type addr /*linear address*/, addrdec_t *tlx /*raw address struct*/).
地址解码负责将线性地址转换为原始地址,这些地址用于访问 DRAM 中的适当行、列和银行。地址解码还负责确定将内存请求发送到哪个内存控制器。地址解码的代码位于 addrdec.h 和 addrdec.cc;位于 "gpgpu-sim_root/src/gpgpu-sim/"。当在内核代码中遇到加载或存储指令时,会创建一个 "内存获取" 对象(在 mem_fetch.h/mem_fetch.cc 中定义)。在创建时,mem_fetch 对象通过调用 ddrdec_tlx(new_addr_type addr /*线性地址*/, addrdec_t *tlx /*原始地址结构*/) 解码线性地址。
The interpretation of the linear address can be set to one of 13 predefined configurations by setting "-gpgpu_mem_address_mask" in a "gpgpusim.config" file to one of (0, 1, 2, 3, 5, 6, 14, 15, 16, 100, 103, 106, 160). These configurations specify the bit masks used to extract the chip (memory controller), row, col, bank, and burst from the linear address. A custom mapping can be chosen by setting "-gpgpu_mem_addr_mapping" in a "gpgpusim.config" file to a desired mapping, such as
线性地址的解释可以通过在“gpgpusim.config”文件中将“-gpgpu_mem_address_mask”设置为(0、1、2、3、5、6、14、15、16、100、103、106、160)之一来设置为 13 个预定义配置之一。这些配置指定了用于从线性地址中提取芯片(内存控制器)、行、列、银行和突发的位掩码。可以通过在“gpgpusim.config”文件中将“-gpgpu_mem_addr_mapping”设置为所需的映射来选择自定义映射,例如
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RRBBBCCC.CCCSSSSS
Where R(r)=row, B(b)=bank, C(c)=column, S(s)=Burst, and D(d) [not shown]=chip.
其中 R(r)=行,B(b)=银行,C(c)=列,S(s)=突发,D(d) [未显示]=芯片。
Also, dramid@<#> means that the address decoder will insert the dram/chip ID starting at bit <#> (counting from LSB) -- i.e., dramid@8 will start at bit 8.
此外,dramid@<#> 意味着地址解码器将在位 <#>(从最低有效位开始计数)插入 dram/chip ID -- 即,dramid@8 将从第 8 位开始。
Output to AerialVision Performance Visualizer
输出到 AerialVision 性能可视化工具
In gpgpu_sim::cycle(), gpgpu_sim::visualizer_printstat() (in gpgpu-sim/visualizer.cc) is called every sampling interval to append a snapshot of the monitored performance metrics to a log file. This log file is the input for the time-lapse view in AerialVision. The log file is compressed via zlib as it is created to minimize disk usage. The sampling interval can be configured by the option -gpgpu_runtime_stat.
在 gpgpu_sim::cycle() 中,gpgpu_sim::visualizer_printstat()(在 gpgpu-sim/visualizer.cc 中)在每个采样间隔被调用,以将监测到的性能指标的快照附加到日志文件中。该日志文件是 AerialVision 中时间推移视图的输入。日志文件在创建时通过 zlib 压缩,以最小化磁盘使用。采样间隔可以通过选项 -gpgpu_runtime_stat 配置。
gpgpu_sim::visualizer_printstat() calls a set of functions to sample the performance metrics in various modules:
gpgpu_sim::visualizer_printstat() 调用一组函数来采样各个模块中的性能指标:
- cflog_visualizer_gzprint(): Generating data for PC-Histogram (See ISPASS 2010 paper for detail).
cflog_visualizer_gzprint(): 生成 PC-直方图的数据(详见 ISPASS 2010 论文)。 - shader_CTA_count_visualizer_gzprint(): The number of CTAs active in each SIMT core.
shader_CTA_count_visualizer_gzprint(): 每个 SIMT 核心中活动的 CTA 数量。 - shader_core_stats::visualizer_print(): Performance metrics for SIMT cores, including the warp cycle breakdown.
shader_core_stats::visualizer_print(): SIMT 核心的性能指标,包括 warp 周期细分。 - memory_stats_t::visualizer_print(): Performance metric for memory accesses.
memory_stats_t::visualizer_print(): 内存访问的性能指标。 - memory_partition_unit::visualizer_print(): Performance metric for each memory partition. Calls dram_t::visualizer_print().
memory_partition_unit::visualizer_print(): 每个内存分区的性能指标。调用 dram_t::visualizer_print()。 - time_vector_print_interval2gzfile(): Latency distribution for memory accesses.
time_vector_print_interval2gzfile(): 内存访问的延迟分布。
The PC-Histogram is implemented using two classes: thread_insn_span and thread_CFlocality. Both classes can be found in gpgpu-sim/stat-tool.{h,cc}. It is interfaced to the SIMT cores via a C interface:
PC-直方图是通过两个类实现的:thread_insn_span 和 thread_CFlocality。这两个类可以在 gpgpu-sim/stat-tool.{h,cc} 中找到。它通过 C 接口与 SIMT 核心进行交互:
- create_thread_CFlogger(): Create one thread_CFlocality object for each SIMT core.
create_thread_CFlogger(): 为每个 SIMT 核心创建一个 thread_CFlocality 对象。 - cflog_update_thread_pc(): Update PC of a thread. The new PC will be added to the list of PC that was touched by this thread.
cflog_update_thread_pc():更新线程的 PC。新的 PC 将被添加到该线程触及的 PC 列表中。 - cflog_visualizer_gzprint(): Output the PC-Histogram of this current sampling interval to the log file.
cflog_visualizer_gzprint(): 将当前采样间隔的 PC 直方图输出到日志文件。
Histogram 直方图
GPGPU-Sim provides several types of histogram data types that simplifies generation of value breakdown for any metric. These histogram classes are implemented in histogram.{h,cc}:
GPGPU-Sim 提供几种类型的直方图数据类型,简化了任何指标的值分解生成。这些直方图类在 histogram.{h,cc} 中实现:
- binned_histogram: The base histogram with each unique integer value occupying a bin.
分箱直方图: 基础直方图,每个唯一的整数值占据一个箱子。 - pow2_histogram: A Power-Of-Two histogram with each bin representing log2 of the input value. This is useful when the value of a metric can span a large range (differs by orders of magnitude).
pow2_histogram: 一个以 2 的幂为基础的直方图,每个区间表示输入值的 log2。这在度量值可以跨越大范围(差异数量级)时非常有用。 - linear_histogram: A histogram with each bin representing a range of values specified by the stride.
线性直方图: 每个区间表示由步幅指定的值范围的直方图。
All of the histogram classes offer the same interface:
所有直方图类提供相同的接口:
- constructor([stride], name, nbins, [bins]): Create a histogram with a given name, with nbins # of bins, and with bins located by the given pointer (optional). The stride option is only available to linear_histogram.
constructor([stride], name, nbins, [bins]): 创建一个具有给定名称的直方图,具有 nbins 个箱子,并且箱子由给定指针定位(可选)。stride 选项仅适用于 linear_histogram。 - reset_bins(): Reset all the bins to zero.
reset_bins(): 将所有箱子重置为零。 - add2bin(sample): Add a sample to the histogram.
add2bin(sample): 将样本添加到直方图中。 - fprint(fout): Print the histogram to the given file handle. Here is the output format:
fprint(fout): 将直方图打印到给定的文件句柄。输出格式如下:
<name> = <number of samples in each bin> max=<maximum among the samples> avg=<average value of the samples>
Dump Pipeline 转储管道
See #Visualizing_Cycle_by_Cycle_Microarchitecture_Behavior for how this is used.
请参见 #Visualizing_Cycle_by_Cycle_Microarchitecture_Behavior 以了解如何使用此功能。
The top level dump pipeline code is implemented in gpgpu_sim::pipeline(...) in gpgpu-sim/gpu-sim.cc. It calls shader_core_ctx::display_pipeline(...) in gpgpu-sim/shader.cc for the pipeline states in each SIMT core. For memory partition states, it calls memory_partition_unit::print(...) in gpgpu-sim/l2cache.cc.
顶级转储管道代码在 gpgpu_sim::pipeline(...) 中实现,位于 gpgpu-sim/gpu-sim.cc。它在 gpgpu-sim/shader.cc 中调用 shader_core_ctx::display_pipeline(...) 以获取每个 SIMT 核心的管道状态。对于内存分区状态,它在 gpgpu-sim/l2cache.cc 中调用 memory_partition_unit::print(...) 。
CUDA-sim - Functional Simulation Engine
CUDA-sim - 功能仿真引擎
The src/cuda-sim directory contains files that implement the functional simulation engine used by GPGPU-Sim. For increased flexibility the functional simulation engine interprets instructions at the level of individual scalar operations per vector lane.
src/cuda-sim 目录包含实现 GPGPU-Sim 使用的功能模拟引擎的文件。为了增加灵活性,功能模拟引擎在每个向量通道的单个标量操作级别解释指令。
Key Objects Descriptions 关键对象描述
kernel_info (<gpgpu-sim_root>/src/abstract_hardware_model.h/cc):
内核信息 (<gpgpu-sim_root>/src/abstract_hardware_model.h/cc):
- The kernel_info_t object contains the GPU grid and block dimensions, the function_info object associated with the kernel entry point, and memory allocated for the kernel arguments in param memory.
kernel_info_t 对象包含 GPU 网格和块的维度,与内核入口点相关的 function_info 对象,以及在 param 内存中为内核参数分配的内存。
ptx_cta_info (<gpgpu-sim_root>/src/ptx_sim.h/cc):
- Contains a the thread state (ptx_thread_info) for the set of threads within a cooperative thread array (CTA) (or workgroup in OpenCL).
包含一个线程状态(ptx_thread_info),用于协作线程数组(CTA)(或 OpenCL 中的工作组)内的一组线程。
ptx_thread_info (<gpgpu-sim_root>/src/ptx_sim.h/cc):
- Contains functional simulation state for a single scalar thread (work item in OpenCL). This includes the following:
包含单个标量线程(OpenCL 中的工作项)的功能模拟状态。这包括以下内容:- Register value storage 寄存器值存储
- Local memory storage (private memory in OpenCL)
本地内存存储(OpenCL 中的私有内存) - Shared memory storage (local memory in OpenCL). Notice that all scalar threads from the same thread block/workgroup accesses the same shared memory storage.
共享内存存储(OpenCL 中的本地内存)。请注意,来自同一线程块/工作组的所有标量线程访问相同的共享内存存储。 - Program counter (PC) 程序计数器 (PC)
- Call stack 调用栈
- Thread IDs (the software ID within a grid launch, and the hardware ID indicating which hardware thread slot it occupies in timing model)
线程 ID(在网格启动中的软件 ID,以及指示其在时间模型中占用哪个硬件线程插槽的硬件 ID)
The current functional simulation engine was developed to support NVIDIA's PTX. PTX is essentially a low level compiler intermediate representation but not the actual machine representation used by NVIDIA hardware (which is known as SASS). Since PTX does not define a binary representation, GPGPU-Sim does not store a binary view of instructions (e.g., as you would learn about when studying instruction set design in an undergraduate computer architecture course). Instead, the text representation of PTX is parsed into a list of objects somewhat akin to a low level compiler intermediate representation.
当前的功能模拟引擎是为了支持 NVIDIA 的 PTX 而开发的。PTX 本质上是一个低级编译器中间表示,但不是 NVIDIA 硬件使用的实际机器表示(称为 SASS)。由于 PTX 没有定义二进制表示,GPGPU-Sim 不存储指令的二进制视图(例如,在学习本科计算机体系结构课程中的指令集设计时会了解到的)。相反,PTX 的文本表示被解析为一个对象列表,类似于低级编译器中间表示。
Individual PTX instructions are found inside of PTX functions that are either kernel entry points or subroutines that can be called on the GPU. Each PTX function has a function_info object:
个别 PTX 指令位于 PTX 函数内部,这些函数可以是内核入口点或可以在 GPU 上调用的子例程。每个 PTX 函数都有一个 function_info 对象:
function_info (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
- Contains a list of static PTX instructions (ptx_instruction's) that can be functionally simulated.
包含可以功能性模拟的静态 PTX 指令(ptx_instruction)的列表。 - For kernel entry points, stores each of the kernel arguments in a map; m_ptx_kernel_param_info; however, this might not always be the case for OpenCL applications. In OpenCL, the associated constant memory space can be allocated in two ways: It can be explicitly initialized in the .ptx file where it is declared, or it can be allocated using the clCreateBuffer on the host. In this later case, the .ptx file will contain a global declaration of the parameter, but it will have an unknown array size. Thus, the symbol's address will not be set and need to be set in the function_info::add_param_data(...) function before executing the PTX. In this case, the address of the kernel argument is stored in a symbol table in the function_info object.
对于内核入口点,将每个内核参数存储在一个映射中;m_ptx_kernel_param_info;然而,这对于 OpenCL 应用程序可能并不总是如此。在 OpenCL 中,相关的常量内存空间可以通过两种方式分配:可以在声明它的 .ptx 文件中显式初始化,或者可以在主机上使用 clCreateBuffer 分配。在后一种情况下,.ptx 文件将包含参数的全局声明,但它将具有未知的数组大小。因此,符号的地址将不会被设置,需要在执行 PTX 之前在 function_info::add_param_data(...) 函数中设置。在这种情况下,内核参数的地址存储在 function_info 对象的符号表中。
The list below describe the class hierarchy used to represent instructions in GPGPU-Sim 3.x. The hierarchy was designed to support future expansion of instruction sets beyond PTX and to isolate functional simulation objects from the timing model.
下面的列表描述了用于表示 GPGPU-Sim 3.x 中指令的类层次结构。该层次结构旨在支持超出 PTX 的指令集的未来扩展,并将功能仿真对象与时间模型隔离。
inst_t (<gpgpu-sim_root>/src/abstract_hardware_model.h/cc):
- Contains an abstract view of a static instruction relevant to the microarchitecture. This includes the opcode type, source and destination register identifiers, instruction address, instruction size, reconvergence point instruction address, instruction latency and initiation interval, and for memory operations, the memory space accessed.
包含与微架构相关的静态指令的抽象视图。这包括操作码类型、源和目标寄存器标识符、指令地址、指令大小、重聚点指令地址、指令延迟和启动间隔,以及对于内存操作,访问的内存空间。
warp_inst_t (<gpgpu-sim_root>/src/abstract_hardware_model.h/cc) (derived from inst_t):
warp_inst_t (<gpgpu-sim_root>/src/abstract_hardware_model.h/cc) (源自 inst_t):
- Contains the view of a dynamic instruction relevant to the microarchitecture. This includes per lane dynamic information such as mask status and memory address accessed. To support accurate functional execution of global memory atomic operations this class includes a callback interface to functionally simulate atomic memory operations when they reach the DRAM interface in performance simulation.
包含与微架构相关的动态指令视图。这包括每个通道的动态信息,例如掩码状态和访问的内存地址。为了支持全局内存原子操作的准确功能执行,该类包括一个回调接口,以在性能模拟中,当它们到达 DRAM 接口时,功能性地模拟原子内存操作。
ptx_instruction (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc) (derived from warp_inst_t):
ptx_instruction (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc) (源自 warp_inst_t):
- Contains the full state of a dynamic instruction including the interfaces required for functional simulation.
包含动态指令的完整状态,包括功能仿真所需的接口。
To support functional simulation GPGPU-Sim must access data in the various memory spaces defined in the CUDA and OpenCL memory models. This requires both a way to name locations and a place to store the values in those locations.
为了支持功能模拟,GPGPU-Sim 必须访问 CUDA 和 OpenCL 内存模型中定义的各种内存空间的数据。这需要一种命名位置的方法和一个存储这些位置中值的地方。
For naming locations, GPGPU-Sim initially builds up a "symbol table" representation while parsing the input PTX. This is done using the following classes:
对于命名位置,GPGPU-Sim 在解析输入 PTX 时最初构建一个“符号表”表示。这是通过以下类完成的:
symbol_table (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
符号表 (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
- Contains a mapping from the textual representation of a memory location in PTX (e.g., "%r2", "input_data", etc...) to a symbol object that contains information about the data type and location.
包含从 PTX 中内存位置的文本表示(例如,“%r2”,“input_data”等)到包含有关数据类型和位置的信息的符号对象的映射。
symbol (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
符号 (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
- Contains information about the name and type of data and its location (address or register identifier) in the simulated GPU memory space. Also tracks where the name was declared in the PTX source.
包含有关名称和数据类型及其在模拟 GPU 内存空间中的位置(地址或寄存器标识符)的信息。还跟踪名称在 PTX 源中声明的位置。
type_info and type_info_key (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
类型信息 和 类型信息键 (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
- Contains information about the type of a data object (used during instruction interpretation).
包含有关数据对象类型的信息(在指令解释期间使用)。
operand_info (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
操作数信息 (<gpgpu-sim_root>/src/cuda-sim/ptx_ir.h/cc):
- A wrapper class containing a source operand for an instruction which may be either a register identifier, memory operand (including displacement mode information), or immediate operand.
一个包装类,包含一个指令的源操作数,该操作数可以是寄存器标识符、内存操作数(包括位移模式信息)或立即数操作数。
Storage of dynamic data values used in functional simulation uses different classes for registers and memory spaces. Register values are contained in ptx_thread_info::m_regs which is a mapping from symbol pointer to a C union called ptx_reg_t. Registers are accessed using the method ptx_thread_info::get_operand_value() which uses operand_info as input. For memory operands this method returns the effective address of the memory operand. Each memory space in the programming model is contained in an object of type memory_space. Memory spaces visible to all threads in the GPU are contained in gpgpu_t and accessed via interfaces in ptx_thread_info (e.g., ptx_thread_info::get_global_memory).
动态数据值的存储在功能模拟中使用不同的类来表示寄存器和内存空间。寄存器值包含在 ptx_thread_info::m_regs 中,它是从符号指针到一个名为ptx_reg_t的 C 联合体的映射。寄存器通过方法 ptx_thread_info::get_operand_value()访问,该方法使用 operand_info 作为输入。对于内存操作数,此方法返回内存操作数的有效地址。编程模型中的每个内存空间都包含在类型为memory_space的对象中。所有线程在 GPU 中可见的内存空间包含在gpgpu_t中,并通过 ptx_thread_info 中的接口访问(例如,ptx_thread_info::get_global_memory)。
memory_space (<gpgpu-sim_root>/src/cuda-sim/memory.h/cc):
内存空间 (<gpgpu-sim_root>/src/cuda-sim/memory.h/cc):
- Abstract base class for implementing memory storage for functional simulation state.
用于实现功能仿真状态的内存存储的抽象基类。
memory_space_impl (<gpgpu-sim_root>/src/cuda-sim/memory.h/cc):
- To optimize functional simulation performance, memory is implemented using a hash table. The hash table block size is a template argument for the template class memory_space_impl.
为了优化功能仿真性能,内存使用哈希表实现。哈希表块大小是模板类 memory_space_impl 的模板参数。
PTX extraction PTX 提取
Depending on the configuration file, PTX is extracted either from cubin files or using cuobjdump. This section describes the flow of information for extracting PTX and other information. Figure 14 shows the possible flows for the extraction.
根据配置文件,PTX 可以从 cubin 文件中提取,或使用 cuobjdump。 本节描述了提取 PTX 和其他信息的信息流。 图 14 显示了提取的可能流程。
From cubin 来自 cubin
__cudaRegisterFatBinary( void *fatCubin ) in the cuda_runtime_api.cc is the function which is responsible for extracting PTX. This function is called by program for each CUDA file. FatbinCubin is a structure which contains different versions of PTX and cubin corresponded to that CUDA file. GPGPU-Sim extract the newest version of PTX which is not newer than forced_max_capability (defines in simulation parameters).
__cudaRegisterFatBinary( void *fatCubin ) 在 cuda_runtime_api.cc 中是负责提取 PTX 的函数。该函数由程序为每个 CUDA 文件调用。FatbinCubin 是一个结构,包含与该 CUDA 文件对应的不同版本的 PTX 和 cubin。GPGPU-Sim 提取的 PTX 最新版本不得超过 forced_max_capability(在仿真参数中定义)。
Using cuobjdump 使用 cuobjdump
In CUDA version 4.0 and later, the fat cubin file used to extract the ptx and sass is not available any more. Instead, cuobjdump is used. cuobjdump is a tool provided by NVidia along with the toolkit that can extract the PTX, SASS as well as other information from the executable. If the option -gpgpu_ptx_use_cuobjdump is set to "1" then GPGPU-Sim will invoke cuobjdump to extract the PTX, SASS and other information from the binary. If conversion to PTXPlus is enabled, the simulator will invoke cuobjdump_to_ptxplus to convert the SASS to PTXPlus. The resulting program is then loaded.
在 CUDA 4.0 及更高版本中,用于提取 ptx 和 sass 的 fat cubin 文件不再可用。相反,使用 cuobjdump。cuobjdump 是 NVIDIA 提供的一个工具,随工具包一起提供,可以从可执行文件中提取 PTX、SASS 以及其他信息。如果选项-gpgpu_ptx_use_cuobjdump 设置为“1”,则 GPGPU-Sim 将调用 cuobjdump 从二进制文件中提取 PTX、SASS 和其他信息。如果启用转换为 PTXPlus,模拟器将调用 cuobjdump_to_ptxplus 将 SASS 转换为 PTXPlus。然后加载生成的程序。
PTX/PTXPlus loading PTX/PTXPlus 加载
When the PTX/PTXPlus program is ready, gpgpu_ptx_sim_load_ptx_from_string(...) is called. This function basically use Lex/Yacc to parse the PTX code and create symbol table for that PTX file. Then add_binary(...) is called. This function add created symbol table to CUctx structure which saves all function and symbol table information. Function gpgpu_ptxinfo_load_from_string(...) is invoked in order to extract some information from PTXinfo file. This function run ptxas (the PTX assembler tool from CUDA Toolkit) on PTX file and parse the output file using Lex and Yacc. It extract some information like number of registers using by each kernel from ptxinfo file. Also gpgpu_ptx_sim_convert_ptx_to_ptxplus(...) invoked to to create PTXPlus.
当 PTX/PTXPlus 程序准备好时,调用 gpgpu_ptx_sim_load_ptx_from_string(...)。这个函数基本上使用 Lex/Yacc 解析 PTX 代码并为该 PTX 文件创建符号表。然后调用 add_binary(...)。这个函数将创建的符号表添加到 CUctx 结构中,该结构保存所有函数和符号表信息。调用函数 gpgpu_ptxinfo_load_from_string(...)以提取 PTXinfo 文件中的一些信息。这个函数在 PTX 文件上运行 ptxas (来自 CUDA Toolkit 的 PTX 汇编工具),并使用 Lex 和 Yacc 解析输出文件。它从 ptxinfo 文件中提取一些信息,比如每个内核使用的寄存器数量。同时调用 gpgpu_ptx_sim_convert_ptx_to_ptxplus(...)以创建 PTXPlus。
__cudaRegisterFunction(...) function invoked by application for each device function. This function is generate a mapping between device and host functions. Inside register_function(...) GPGPU-sim searches for symbol table associated with that fatCubin in which device function is located. This function generate a map between kernel entry point and CUDA application function address (host function).
__cudaRegisterFunction(...) 函数由应用程序为每个设备函数调用。此函数生成设备和主机函数之间的映射。在 register_function(...) 中,GPGPU-sim 搜索与该 fatCubin 相关联的符号表,其中包含设备函数。此函数生成内核入口点和 CUDA 应用程序函数地址(主机函数)之间的映射。
PTXPlus support PTXPlus 支持
This subsection describes how PTXPlus is implemented in GPGPU-Sim 3.x.
本小节描述了 PTXPlus 在 GPGPU-Sim 3.x 中的实现。
PTXPlus Conversion PTXPlus 转换
GPGPU-Sim version 3.1.0 and later implement support for native hardware ISA execution (PTXPlus) by using NVIDIA's 'cuobjdump' utility. Currently, PTXPlus is only supported with CUDA 4.0.
When PTXPlus is enabled, the simulator uses cuobjdump to extract into text format the embedded SASS (NVIDIA's hardware ISA) image included in CUDA binaries. This text representation of SASS is then converted to our own extension of PTX, called PTXPlus, using a separate executable called cuobjdump_to_ptxplus. In the conversion process, more information is needed than available in the SASS text representation. This information is acquired from the ELF and PTX code also extracted using cuobjdump. cuobjdump_to_ptxplus bundles all this information into a single PTXPlus file. Figure 14 depicts the slight differences in run time execution flow when using PTXPlus. Note that there are no changes required in the compilation process of CUDA executables. The conversion process is completely handled by GPGPU-Sim at run-time. Note this flow illustrated in Figure 14 is different than the one illustrated in Figure 3(b) of our ISPASS 2009 paper.
GPGPU-Sim 版本 3.1.0 及更高版本通过使用 NVIDIA 的 'cuobjdump' 工具实现对原生硬件 ISA 执行 (PTXPlus) 的支持。目前,PTXPlus 仅支持 CUDA 4.0。当启用 PTXPlus 时,模拟器使用 cuobjdump 将嵌入在 CUDA 二进制文件中的 SASS (NVIDIA 的硬件 ISA) 图像提取为文本格式。然后,这种 SASS 的文本表示被转换为我们自己扩展的 PTX,称为 PTXPlus,使用一个名为 cuobjdump_to_ptxplus 的单独可执行文件。在转换过程中,需要的信息比 SASS 文本表示中可用的信息要多。这些信息是从使用 cuobjdump 提取的 ELF 和 PTX 代码中获取的。cuobjdump_to_ptxplus 将所有这些信息打包成一个单一的 PTXPlus 文件。图 14 描述了使用 PTXPlus 时运行时执行流程的细微差别。请注意,在 CUDA 可执行文件的编译过程中不需要任何更改。转换过程完全由 GPGPU-Sim 在运行时处理。请注意,图 14 中所示的流程与我们 ISPASS 2009 论文 中图 3(b) 所示的流程不同。
The translation from PTX to PTXPlus is performed by gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() located in ptx_loader.cc. gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() is called by called by usecuobjdump() which passes in the SASS, PTX and ELF information. gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() calls cuobjdump_to_ptxplus on those inputs. cuobjdump_to_ptxplus uses the three inputs to create the final PTXPlus version of the original program and this is returned from gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus().
从 PTX 到 PTXPlus 的翻译由 gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() 在 ptx_loader.cc 执行。 gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() 由 usecuobjdump() 调用,传入 SASS、PTX 和 ELF 信息。 gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() 在这些输入上调用 cuobjdump_to_ptxplus。cuobjdump_to_ptxplus 使用这三个输入创建原始程序的最终 PTXPlus 版本,并从 gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus() 返回。
Operation of cuobjdump_to_ptxplus
cuobjdump_to_ptxplus 的操作
cuobjdump_to_ptxplus uses three files to generate PTXPlus. First, before cuobjdump_to_ptxplus is executed, GPGPU-Sim parses the information output by NVIDIA's cuobjdump and merely divides that information into multiple files. For each section (a section corresponds to one CUDA binary), three files are generated: .ptx, .sass and .elf. Those files are merely a split of the output of cuobjdump so it can be easily handled by cuobjdump_to_ptxplus. A description of each is provided below:
cuobjdump_to_ptxplus 使用三个文件生成 PTXPlus。首先,在执行 cuobjdump_to_ptxplus 之前,GPGPU-Sim 解析 NVIDIA 的 cuobjdump 输出的信息,并仅将该信息分成多个文件。对于每个部分(一个部分对应一个 CUDA 二进制文件),生成三个文件:.ptx、.sass 和.elf。这些文件仅是 cuobjdump 输出的拆分,以便 cuobjdump_to_ptxplus 可以轻松处理。下面提供了每个文件的描述:
- .ptx: contains the PTX code corresponding to the CUDA binary
.ptx: 包含与 CUDA 二进制文件对应的 PTX 代码 - .sass: contains the SASS generated by building the PTX code
.sass: 包含通过构建 PTX 代码生成的 SASS - .elf: contains a textual dump of the ELF object
.elf: 包含 ELF 对象的文本转储
cuobjdump_to_ptxplus takes the three files corresponding to a single binary as input and generates a PTXPlus file. Multiple calls are made to cuobjdump_to_ptxplus to convert multiple binaries as needed. Each of the files are parsed, and an elaborate intermediate representation is generated. Multiple functions are then called to output this representation in the form of a PTXPlus file. Below is a description of the information extracted from each of the files:
cuobjdump_to_ptxplus 接受与单个二进制文件对应的三个文件作为输入,并生成一个 PTXPlus 文件。根据需要多次调用 cuobjdump_to_ptxplus 以转换多个二进制文件。每个文件都被解析,并生成一个详细的中间表示。然后调用多个函数以 PTXPlus 文件的形式输出该表示。以下是从每个文件中提取的信息的描述:
- .ptx: The ptx file is used to extract information about the available kernels, their function signatures, and information about textures.
.ptx: ptx 文件用于提取有关可用内核、它们的函数签名和纹理信息。 - .sass: The sass file is used to extract the actual instructions that will be converted to PTXPlus instructions in the output PTXPlus file.
.sass: 该 sass 文件用于提取将在输出 PTXPlus 文件中转换为 PTXPlus 指令的实际指令。 - .elf: The elf file is used to extract constant and local memory values as well as constant memory pointers.
.elf:Elf 文件用于提取常量和局部内存值以及常量内存指针。
PTXPlus Implementation PTXPlus 实施
ptx_thread_info::get_operand_value() in instructions.cc determines the current value of an input operand. The following extensions to get_operand_value are meant for PTXPlus execution.
ptx_thread_info::get_operand_value() 在 instructions.cc 确定输入操作数的当前值。对 get_operand_value 的以下扩展旨在用于 PTXPlus 执行。
- If a register operands has a ".lo" modifier, only the lower 16 bits are read. If a register operands has a ".hi" modifier, only the higher 16 bits are read. This information is stored in the m_operand_lohi property of the operand. A value of 0 is default while a value of 1 means a ".lo" modifier and a value of 2 means a ".hi" modifier.
如果寄存器操作数具有“.lo”修饰符,则仅读取低 16 位。如果寄存器操作数具有“.hi”修饰符,则仅读取高 16 位。这些信息存储在操作数的 m_operand_lohi 属性中。值为 0 是默认值,而值为 1 表示“.lo”修饰符,值为 2 表示“.hi”修饰符。
- For PTXPlus style 64bit or 128bit operands, the get_reg() function is passed each register name and the final result is constructed by combining the data in each register.
对于 PTXPlus 样式的 64 位或 128 位操作数, get_reg() 函数接收每个寄存器名称,最终结果通过组合每个寄存器中的数据构建。
- The return value from get_double_operand_type() indicates the use of one of the new ways of determining the memory address.
get_double_operand_type() 的返回值表示使用了一种新的确定内存地址的方法。- If it's 1, the value of two registers must be added together.
如果是 1,则必须将两个寄存器的值相加。 - If it's 2, the address is stored in a register and the value in the register is postincremented by a value in a second register.
如果是 2,地址存储在一个寄存器中,寄存器中的值通过第二个寄存器中的值后递增。 - If it's 3, the address is stored in a register and the value in the register is postincremented by a constant.
如果是 3,地址存储在寄存器中,寄存器中的值按常量后递增。
- If it's 1, the value of two registers must be added together.
- For memory operands, the first half of the get_operand_value() function calculates the memory address to access. This value is stored in result. result is used as the address and the appropriate data is fetched from the address space indicated by the operand and returned. For postincrementing memory accesses, the register holding the address is also incremented in get_operand_value().
对于内存操作数, get_operand_value() 函数的前半部分计算要访问的内存地址。该值存储在 result 中。 result 用作地址,并从操作数指示的地址空间中获取适当的数据并返回。对于后递增的内存访问,持有地址的寄存器也在 get_operand_value() 中递增。 - If it isn't a memory operand, the value of the register is returned.
如果它不是内存操作数,则返回寄存器的值。 - get_operand_value checks for a negative sign on the operand and takes the negative of finalResult if necessary before returning it.
get_operand_value 检查操作数上的负号,并在必要时取 finalResult 的负值,然后返回。
Control Flow Analysis + Pre-decode
控制流分析 + 预解码
Each kernel function is analyzed and pre-decoded as it is loaded into GPGPU-Sim. When the PTX parser detects the end of a kernel function, it calls function_info::ptx_assemble() (in cuda-sim/cuda-sim.cc). This function does the following:
每个内核函数在加载到 GPGPU-Sim 时都会被分析和 预解码。当 PTX 解析器检测到内核函数的结束时,它会调用 function_info::ptx_assemble() (在 cuda-sim/cuda-sim.cc 中)。此函数执行以下操作:
- Assign each instruction in the function with a unique PC
为函数中的每个指令分配一个唯一的 PC - Resolve each branch label in the function to a corresponding instruction/PC
将函数中的每个分支标签解析为相应的指令/程序计数器- I.e. Determine the branch target for each branch instruction
即确定每个分支指令的分支目标
- I.e. Determine the branch target for each branch instruction
- Create control flow graph for the function
为该函数创建控制流图 - Perform control-flow analysis
执行控制流分析 - Pre-decode each instruction (to speed up simulation)
预解码每个指令(以加快模拟速度)
Creation of the control flow graph is done via two member functions in function_info:
控制流图的创建是通过 function_info 中的两个成员函数完成的:
- create_basic_blocks() groups individual instructions into basic block (basic_block_t).
create_basic_blocks () 将单个指令分组为基本块 ( basic_block_t )。 - connect_basic_blocks() connects the basic blocks to form a control flow graph.
connect_basic_blocks () 连接基本块以形成控制流图。
After creating the control flow graph, two control flow analysis will be done.
在创建控制流图后,将进行两次控制流分析。
- Determine the target of each break instruction:
确定每个 break 指令的目标:- This is a make-shift solution to support break instruction in PTXPlus, which implements break-statements in while-loops. A long-term solution is to extend the SIMT stack with proper break entries.
这是一个临时解决方案,用于支持 PTXPlus 中的 break 指令,该指令在 while 循环中实现了断点语句。长期解决方案是通过适当的断点条目扩展 SIMT 堆栈。 - The goal is to determine the latest breakaddr instruction that precedes each break instruction. Assuming that the code has structured control flow. This information can be determined by traversing upstream through the dominator tree (constructed by calling member functions find_dominators() and find_idominators()). However, the control flow graph changes after break instructions are connected to their targets. The current solution is to perform this analysis iteratively until both the dominator tree and the break targets become stable.
目标是确定每个 break 指令之前的最新 breakaddr 指令。假设代码具有结构化控制流。可以通过遍历支配树(通过调用成员函数 find_dominators() 和 find_idominators() 构建)来确定此信息。然而,在 break 指令连接到其目标后,控制流图会发生变化。目前的解决方案是迭代执行此分析,直到支配树和 break 目标都变得稳定。 - The algorithm for finding dominators are described in Muchnick's Adv. Compiler Design and Implementation (Figure 7.14 and Figure 7.15).
在 Muchnick 的《高级编译器设计与实现》中描述了寻找支配者的算法(图 7.14 和图 7.15)。
- This is a make-shift solution to support break instruction in PTXPlus, which implements break-statements in while-loops. A long-term solution is to extend the SIMT stack with proper break entries.
- Find immediate post-dominator of each branch instruction:
找到每个分支指令的直接后支配者:- This information is used by the SIMT stack for reconvergence point at a divergent branch.
此信息由 SIMT 堆栈用于在发散分支处的重新汇聚点。 - The analysis is done by calling member functions find_postdominators() and find_ipostdominators(). The algorithm is described in Muchnick's Adv. Compiler Design and Implementation (Figure 7.14 and Figure 7.15).
分析是通过调用成员函数 find_postdominators() 和 find_ipostdominators() 完成的。该算法在穆赫尼克的《高级编译器设计与实现》中描述(图 7.14 和图 7.15)。
- This information is used by the SIMT stack for reconvergence point at a divergent branch.
Pre-decode is performed by calling ptx_instruction::pre_decode() for each instruction. It extracts information that is useful to the timing simulator.
预解码是通过对每条指令调用 ptx_instruction::pre_decode() 来执行的。它提取对时序模拟器有用的信息。
- Detect LD/ST instruction.
检测 LD/ST 指令。 - Determine if the instruction writes to a destination register.
确定指令是否写入目标寄存器。 - Obtain the reconvergence PC if this is a branch instruction.
如果这是一个分支指令,请获取重聚合 PC。 - Extract the register operands of the instruction.
提取指令的寄存器操作数。 - Detect predicated instruction.
检测预测指令。
The extracted information is stored inside the ptx_instruction object corresponding to the instruction. This speeds up simulation because all scalar threads in a kernel launch executes the same kernel function. Extracting these information once as the function is loaded is significantly more efficient than repeating the same extraction for each thread during simulation.
提取的信息存储在与指令对应的 ptx_instruction 对象中。这加快了仿真,因为内核启动中的所有标量线程都执行相同的内核函数。一次性提取这些信息在函数加载时比在仿真过程中为每个线程重复相同的提取要高效得多。
Memory Space Buffer 内存空间缓冲区
In CUDA-sim, the various memory spaces in CUDA/OpenCL are implemented functionally with memory space buffers (the memory_space_impl class in memory.h).
在 CUDA-sim 中,CUDA/OpenCL 中的各种内存空间通过内存空间缓冲区( memory_space_impl 类在 memory.h 中)功能性地实现。
- All of global, texture, constant memory spaces are implemented with a single memory_space object inside the top-level gpgpu_t class (as member object m_global_memory).
所有全局、纹理、常量内存空间都在顶层 gpgpu_t 类中实现为一个单一的 memory_space 对象(作为成员对象 m_global_memory )。 - The local memory space of each thread is contained in the ptx_thread_info object corresponding to each thread.
每个线程的本地内存空间包含在与每个线程对应的 ptx_thread_info 对象中。 - The shared memory space is common to the entire CTA (thread block), and a unique memory_space object is allocated for each CTA when it is dispatched for execution (in function ptx_sim_init_thread()). The object is deallocated when the CTA has completed execution.
共享内存空间是整个 CTA(线程块)共有的,并且在执行时为每个 CTA 分配一个唯一的 memory_space 对象(在函数 ptx_sim_init_thread() 中)。当 CTA 完成执行时,该对象会被释放。
The memory_space_impl class implements the read-write interface defined by abstract class memory_space.
Internally, each memory_space_impl object contains a set of memory pages (implemented by class template mem_storage). It uses a STL unordered map (reverts to STL map if unordered map is not available) to associate pages with their corresponding addresses. Each mem_storage object is an array of bytes with read and write functions. Initially, each memory_space object is empty, and pages are allocated on demand as an address corresponding to the individual page in the memory space is accessed (either via an LD/ST instruction or cudaMemcpy()).
memory_space_impl 类实现了抽象类 memory_space 定义的读写接口。在内部,每个 memory_space_impl 对象包含一组内存页面(由类模板 mem_storage 实现)。它使用 STL 无序映射(如果无序映射不可用,则回退到 STL 映射)将页面与其对应的地址关联。每个 mem_storage 对象是一个具有读写功能的字节数组。最初,每个 memory_space 对象是空的,页面根据需求分配,当访问与内存空间中单个页面对应的地址时(通过 LD/ST 指令或 cudaMemcpy() )。
The implementation of memory_space, memory_space_impl and mem_storage can be found in files memory.h and memory.cc.
memory_space 、 memory_space_impl 和 mem_storage 的实现可以在文件 memory.h 和 memory.cc 中找到。
Global/Constant Memory Initialization
全局/常量内存初始化
In CUDA, programmer can declare device variables that are accessible to all kernel/device functions. These variables can be in global (e.g. x_d) or constant memory space (e.g. y_c):
在 CUDA 中,程序员可以声明可供所有内核/设备函数访问的设备变量。这些变量可以位于全局内存(例如 x_d)或常量内存空间(例如 y_c):
__device__ int x_d = 100; __constant__ int y_c = 70;
These variables, and their initial values, are compiled into PTX variables.
这些变量及其初始值被编译为 PTX 变量。
In GPGPU-Sim, these variables are parsed via the PTX parser into (symbol, value) pairs. After the all the PTX are loaded, two functions, load_stat_globals(...) and load_constants(...) are called to assign each variable with a memory address in the simulated global memory space, and copy the initial value to the assigned memory location. The two functions are located inside cuda_runtime_api.cc.
在 GPGPU-Sim 中,这些变量通过 PTX 解析器解析为(符号,值)对。在加载所有 PTX 后,调用两个函数 load_stat_globals(...) 和 load_constants(...) 为每个变量分配一个在模拟全局内存空间中的内存地址,并将初始值复制到分配的内存位置。这两个函数位于 cuda_runtime_api.cc 中。
These variables can also be declared as __global__ in CUDA. In this case, they are accessible by both the host (CPU) and the device (GPU). CUDA accomplish this by having two copies of the same variable in both host memory and device memory. The linkage between the two copies is established using function __cudaRegisterVar(...). GPGPU-Sim intercept call to this function to acquire this information, and establish a similar linkage by calling functions gpgpu_ptx_sim_register_const_variable(...) or gpgpu_ptx_sim_register_global_variable(...) (implemented in cuda-sim/cuda-sim.cc). With this linkage established, the host may call cudaMemcpyToSymbol() or cudaMemcpyFromSymbol() to access these __global__ variables. GPGPU-Sim implements these functions with gpgpu_ptx_memcpy_symbol(...) in cuda-sim/cuda-sim.cc.
这些变量也可以在 CUDA 中声明为__global__。在这种情况下,它们可以被主机(CPU)和设备(GPU)访问。CUDA 通过在主机内存和设备内存中拥有同一变量的两个副本来实现这一点。两个副本之间的链接是通过使用函数 __cudaRegisterVar(...) 建立的。GPGPU-Sim 拦截对该函数的调用以获取此信息,并通过调用函数 gpgpu_ptx_sim_register_const_variable(...) 或 gpgpu_ptx_sim_register_global_variable(...) (在 cuda-sim/cuda-sim.cc 中实现)建立类似的链接。建立此链接后,主机可以调用 cudaMemcpyToSymbol() 或 cudaMemcpyFromSymbol() 来访问这些__global__变量。GPGPU-Sim 在 cuda-sim/cuda-sim.cc 中使用 gpgpu_ptx_memcpy_symbol(...) 实现这些函数。
Notice that __cudaRegisterVar(...) is not part of CUDA Runtime API, and future versions of CUDA may implement __global__ variables in a different way. In that case, GPGPU-Sim will need to be modified to support the new implementation.
请注意, __cudaRegisterVar(...) 不是 CUDA 运行时 API 的一部分,未来版本的 CUDA 可能会以不同的方式实现 __global__ 变量。在这种情况下,GPGPU-Sim 需要进行修改以支持新的实现。
Kernel Launch: Parameter Hookup
内核启动:参数连接
Kernel parameters in GPGPU-Sim are set using the same methods as in regular CUDA and OpenCL applications;
GPGPU-Sim 中的内核参数设置与常规 CUDA 和 OpenCL 应用程序使用相同的方法;
kernel_name <<x,y>> (param1, param2, ..., paramN) and clSetKernelArg(kernel_name, arg_index, arg_size, arg_value)
respectively. 分别。
Another method to pass the kernel arguments in CUDA is with the use of cudaSetupArgument(void* arg, size_t count, size_t offset) .
This function pushes count bytes of the argument pointed to by arg at offset bytes from the start of the parameter passing area, which starts at offset 0. The arguments are stored in the top of the execution stack. For example, if a CUDA kernel is to have 3 arguments, a, b, and c (in this order), the offset for a is 0, offset for b is sizeof(a), and offset for c is sizeof(a)+sizeof(b).
在 CUDA 中传递内核参数的另一种方法是使用 cudaSetupArgument(void* arg, size_t count, size_t offset) 。该函数在参数传递区域的起始位置(从偏移量 0 开始)向偏移量 offset 字节 处推送指向 arg 的 count 字节的参数。参数存储在执行栈的顶部。例如,如果一个 CUDA 内核有 3 个参数 a、b 和 c(按此顺序),则 a 的偏移量为 0,b 的偏移量为 sizeof(a),而 c 的偏移量为 sizeof(a)+sizeof(b)。
For both CUDA and OpenCL, GPGPU-Sim creates a gpgpu_ptx_sim_arg object per kernel argument and maintains a list of all kernel arguments. Prior to executing the kernel, an initialization function is called to setup the GPU grid dimensions and parameters: gpgpu_cuda_ptx_sim_init_grid(...) or gpgpu_opencl_ptx_sim_init_grid(...). Two main objects are used within these functions, the function_info and kernel_info_t objects, which are described above. In the init_grid functions, the kernel arguments are added to the function_info object by calling function_info_object::add_param_data(arg #, gpgpu_ptx_sim_arg *).
对于 CUDA 和 OpenCL,GPGPU-Sim 为每个内核参数创建一个gpgpu_ptx_sim_arg对象,并维护所有内核参数的列表。在执行内核之前,会调用一个初始化函数来设置 GPU 网格维度和参数:gpgpu_cuda_ptx_sim_init_grid(...)或gpgpu_opencl_ptx_sim_init_grid(...)。在这些函数中使用了两个主要对象,function_info和kernel_info_t对象,具体描述见上面。在 init_grid 函数中,通过调用function_info_object::add_param_data(arg #, gpgpu_ptx_sim_arg *)将内核参数添加到function_info对象中。
After adding all of the parameters to the function_info object, function_info::finalize(...) is called, which copies over the kernel arguments, stored in the function_info::m_ptx_kernel_param_info map, into the parameter memory allocated in the kernel_info_t object mentioned above. If it was not done previously in function_info::add_param_data(...), the address of each kernel argument is added to the symbol table in the function_info object.
在将所有参数添加到 function_info 对象后,调用 function_info::finalize(...),该函数将存储在 function_info::m_ptx_kernel_param_info 映射中的内核参数复制到上述 kernel_info_t 对象中分配的参数内存。如果在 function_info::add_param_data(...) 中没有完成此操作,则每个内核参数的地址将添加到 function_info 对象的符号表中。
PTXPlus support requires copying kernel parameters to shared memory. The kernel parameters can be copied from Param memory to Shared memory by calling the function_info::param_to_shared(shared_mem_ptr, symbol_table) function. This function iterates over the kernel parameters stored in the function_info::m_ptx_kernel_param_info map and copies each parameter from Param memory to the appropriate location in Shared memory pointed to by ptx_thread_info::shared_mem_ptr.
PTXPlus 支持需要将内核参数复制到共享内存。可以通过调用 function_info::param_to_shared(shared_mem_ptr, symbol_table) 函数将内核参数从 Param 内存复制到 Shared 内存。该函数遍历存储在 function_info::m_ptx_kernel_param_info 映射中的内核参数,并将每个参数从 Param 内存复制到由 ptx_thread_info::shared_mem_ptr 指向的 Shared 内存中的适当位置。
The function_info::add_param_data(...), function_info::finalize(...), function_info::param_to_shared(...), and gpgpu_opencl_ptx_sim_init_grid(...) functions are defined in <gpgpu-sim_root>/distribution/src/cuda-sim/cuda-sim.cc. The gpgpu_cuda_ptx_sim_init_grid(...) function is implemented in <gpgpu-sim_root>/distribution/libcuda/cuda_runtime_api.cc.
function_info::add_param_data(...), function_info::finalize(...), function_info::param_to_shared(...), 和 gpgpu_opencl_ptx_sim_init_grid(...) 函数在 /distribution/src/cuda-sim/cuda-sim.cc 中定义。gpgpu_cuda_ptx_sim_init_grid(...) 函数在 <gpgpu-sim_root>/distribution/libcuda/cuda_runtime_api.cc 中实现。
Generic Memory Space 通用内存空间
Generic addressing is a feature that was introduced in NVIDIA's PTX 2.0, with generic addressing supported by instructions ld, ldu, st, prefetch, prefetchu, isspacep, cvta, atom, and red. In generic addressing, an address maps to global memory unless it falls within the local memory window or the shared memory window. Within these windows, an address maps to the corresponding location in local or shared memory, i.e. to the address formed by subtracting the window base from the generic address to form the offset in the implied state space. So an instruction can use generic addressing to deal with addresses that may correspond to global, local or shared memory space.
通用寻址是 NVIDIA 的 PTX 2.0 中引入的一项功能,通用寻址由指令 ld、ldu、st、prefetch、prefetchu、isspacep、cvta、atom 和 red 支持。在通用寻址中,地址映射到全局内存,除非它位于局部内存窗口或共享内存窗口内。在这些窗口内,地址映射到局部或共享内存中的相应位置,即通过从通用地址中减去窗口基址来形成偏移量,从而形成隐含状态空间中的地址。因此,指令可以使用通用寻址来处理可能对应于全局、局部或共享内存空间的地址。
The generic addressing in GPGPU-Sim is supported with instructions ld, ldu, st, isspacep and cvta. Functions generic_to_{shared, local, global}, {shared, local, global}_to_generic and isspace_{shared, local, global} (which are all defined in "cuda_sim.cc") are used to support the generic addressing in GPGPU-Sim with the previously mentioned instructions.
GPGPU-Sim 中的通用寻址通过指令 ld、ldu、st、isspacep 和 cvta 得到支持。函数 generic_to_{shared, local, global}、{shared, local, global}_to_generic 和 isspace_{shared, local, global}(均在"cuda_sim.cc"中定义)用于支持 GPGPU-Sim 中的通用寻址,配合上述指令。
Identifiers (SHARED_GENERIC_START, SHARED_MEM_SIZE_MAX, LOCAL_GENERIC_START, TOTAL_LOCAL_MEM_PER_SM, LOCAL_MEM_SIZE_MAX, GLOBAL_HEAP_START and STATIC_ALLOC_LIMIT) which are defined in "abstract_hardware_model.h" are used to define the different memory spaces boundaries (windows). These identifiers are used to derive and generate different address spaces which are needed to support generic addressing.
在“abstract_hardware_model.h”中定义的标识符(SHARED_GENERIC_START、SHARED_MEM_SIZE_MAX、LOCAL_GENERIC_START、TOTAL_LOCAL_MEM_PER_SM、LOCAL_MEM_SIZE_MAX、GLOBAL_HEAP_START 和 STATIC_ALLOC_LIMIT)用于定义不同的内存空间边界(窗口)。这些标识符用于推导和生成支持通用寻址所需的不同地址空间。
The follwing table shows an example of how the spaces are defined in the code:
下表显示了代码中如何定义空格的示例:
Identifier 标识符 | Value 价值 |
---|---|
GLOBAL_HEAP_START | 0x80000000 |
SHARED_MEM_SIZE_MAX | 64*1024 |
LOCAL_MEM_SIZE_MAX | 8*1024 |
MAX_STREAMING_MULTIPROCESSORS | 64 |
MAX_THREAD_PER_SM | 2048 |
TOTAL_LOCAL_MEM_PER_SM | MAX_THREAD_PER_SM*LOCAL_MEM_SIZE_MAX |
TOTAL_SHARED_MEM | MAX_STREAMING_MULTIPROCESSORS*SHARED_MEM_SIZE_MAX |
TOTAL_LOCAL_MEM | MAX_STREAMING_MULTIPROCESSORS*MAX_THREAD_PER_SM*LOCAL_MEM_SIZE_MAX |
SHARED_GENERIC_START | GLOBAL_HEAP_START-TOTAL_SHARED_MEM |
LOCAL_GENERIC_START | SHARED_GENERIC_START-TOTAL_LOCAL_MEM |
STATIC_ALLOC_LIMIT | GLOBAL_HEAP_START - (TOTAL_LOCAL_MEM+TOTAL_SHARED_MEM) |
Notice that with this address space partitioning, each thread may only have up to 8kB of local memory (LOCAL_MEM_SIZE_MAX). With CUDA compute capability 1.3 and below, each thread can have up to 16kB of local memory. With CUDA compute capability 2.0, this limit has increased to 512kB [7]. The user may increase LOCAL_MEM_SIZE_MAX to support applications that require more than 8kB of local memory per thread. However, one should always ensure that GLOBAL_HEAP_START > (TOTAL_LOCAL_MEM + TOTAL_SHARED_MEM). Failure to do so may result in erroneous simulation behavior.
请注意,在这种地址空间分区下,每个线程最多只能拥有 8kB 的本地内存 (LOCAL_MEM_SIZE_MAX)。在 CUDA 计算能力 1.3 及以下版本中,每个线程最多可以拥有 16kB 的本地内存。在 CUDA 计算能力 2.0 中,这个限制增加到了 512kB [7]。用户可以增加 LOCAL_MEM_SIZE_MAX,以支持需要超过每个线程 8kB 本地内存的应用程序。然而,始终应确保 GLOBAL_HEAP_START > (TOTAL_LOCAL_MEM + TOTAL_SHARED_MEM)。未能做到这一点可能会导致错误的仿真行为。
To get more information about the generic addressing in general refer to NVIDIA's PTX: Parallel Thread Execution ISA Version 2.0 manual[8].
要获取有关通用寻址的一般信息,请参考 NVIDIA 的 PTX:并行线程执行 ISA 版本 2.0 手册[8]。
Instruction Execution 指令执行
After parsing, instructions used for functional execution are represented as a ptx_instruction object contained within a function_info object (see cuda-sim/ptx_ir.{h,cc}). Each scalar thread is represented by a ptx_thread_info object. Executing an instruction (functionally) is mainly accomplished by calling the ptx_thread_info::ptx_exec_inst().
解析后,用于功能执行的指令表示为包含在 function_info 对象中的 ptx_instruction 对象(参见 cuda-sim/ptx_ir.{h,cc})。每个标量线程由 ptx_thread_info 对象表示。执行指令(功能上)主要通过调用 ptx_thread_info::ptx_exec_inst() 来完成。
The abstract class core_t has the most basic data structures and procedures required for instruction execution functionally. This class is the base class for the shader_core_ctx and functionalSimCore which are used for performance and pure functional simulation respectively. core_t most important members are objects of types simt_stack and ptx_thread_info, which are used in the functional exectuion to keep track of the warps branch divergence and to handle the threads' instructions execution.
抽象类 core_t 具有指令执行功能所需的最基本数据结构和过程。该类是 shader_core_ctx 和 functionalSimCore 的基类,分别用于性能和纯功能模拟。core_t 最重要的成员是 simt_stack 和 ptx_thread_info 类型的对象,这些对象在功能执行中用于跟踪 warp 的分支分歧并处理线程的指令执行。
Executing instruction simply starts by initializing scalar threads using the function ptx_sim_init_thread (in cuda-sim.cc), then we execute the scalar threads in warps using the function_info::ptx_exec_inst(). In this version, keeping track of threads as warps is done using a simt_stack object for each warp of scalar threads (this is the assumed model here and other models can be used instead), the simt_stack tells which threads are active and which instruction to execute at each cycle so we can execute the scalar threads in warps.
执行指令简单地通过使用函数 ptx_sim_init_thread(在 cuda-sim.cc 中)初始化标量线程开始,然后我们使用 function_info::ptx_exec_inst()在波中执行标量线程。在这个版本中,跟踪作为波的线程是通过为每个标量线程的波使用 simt_stack 对象来完成的(这是这里假定的模型,其他模型也可以使用),simt_stack 指示哪些线程是活动的以及在每个周期执行哪个指令,以便我们可以在波中执行标量线程。
In ptx_thread_info::ptx_exec_inst, is where acutally the instructions get functionally executed. We check the instruction opcode and call the corresponding funciton, the file opcodes.def contains the functions used to execute each instruction. Every instruction function takes two parameters of type ptx_instruction and ptx_thread_info which hold the data for the instruction and the thread in execution receptively.
在 ptx_thread_info::ptx_exec_inst 中,指令实际上在这里被功能性地执行。我们检查指令操作码并调用相应的函数,文件 opcodes.def 包含用于执行每个指令的函数。每个指令函数接受两个参数,类型为 ptx_instruction 和 ptx_thread_info,分别保存指令和正在执行的线程的数据。
Information are communicated back from the execution of ptx_exec_inst to the function that executes the warps through modifying warp_inst_t parameter that is passed to the ptx_exec_inst by reference, so for atomics we indicate that the executed warp instruction is atomic and add a call back to the warp_inst_t and which set the atomic flag, the flag is then checked by the warp execution function in order to do the callbacks which are used to execute the atomics (check functionalCoreSim::executeWarp in cuda-sim.cc).
信息通过修改传递给 ptx_exec_inst 的 warp_inst_t 参数以引用的方式,从 ptx_exec_inst 的执行中反馈给执行 warps 的函数,因此对于原子操作,我们指示执行的 warp 指令是原子的,并添加一个回调到 warp_inst_t,并设置原子标志,然后该标志由 warp 执行函数检查,以便进行回调,这些回调用于执行原子操作(请查看 cuda-sim.cc 中的 functionalCoreSim::executeWarp)。
As you might have expected more communication is made in the performance simulation than the pure functional simulation. The pure functional execution with functionalSimCore (in cuda-sim{.h,.cc}) can be checked to get more details on the functional execution.
正如您所预期的,性能模拟中的通信比纯功能模拟中更多。可以检查使用 functionalSimCore(在 cuda-sim{.h,.cc}中)的纯功能执行,以获取有关功能执行的更多详细信息。
Interface to Source Code View in AerialVision
AerialVision 中的源代码视图接口
Source Code View in AerialVision is a view where it is possible to plot different kinds of metrics vs. ptx source code. For example, one could plot the DRAM traffic generated by each line in ptx source code. GPGPU-Sim exports the statistics needed to construct the Source Code View in AerialVision to statistics files that are read by AerialVision.
在 AerialVision 中的源代码视图是一个可以绘制不同类型的指标与 ptx 源代码的视图。例如,可以绘制 ptx 源代码中每一行生成的 DRAM 流量。GPGPU-Sim 将构建 AerialVision 中源代码视图所需的统计信息导出到 AerialVision 读取的统计文件中。
If the options "-enable_ptx_file_line_stats 1" and "-visualizer_enabled 1" are defined, GPGPU-Sim will save files with the statistics. The name of the file can be specified using the option "-ptx_line_stats_filename filename".
如果定义了选项“-enable_ptx_file_line_stats 1”和“-visualizer_enabled 1”,GPGPU-Sim 将保存带有统计信息的文件。文件的名称可以使用选项“-ptx_line_stats_filename filename”指定。
For each line in the executed ptx files, one line is added to the line stats file in the following format:
对于执行的 ptx 文件中的每一行,按以下格式向行统计文件添加一行:
kernel line : count latency dram_traffic smem_bk_conflicts smem_warp gmem_access_generated gmem_warp exposed_latency warp_divergence
Using AerialVision, one could plot/view the statistics collected by this interface in different ways. AerialVision can also map those statistics to CUDA C++ source files. Please refer the AerialVision manual for more details about how to do that.
使用 AerialVision,可以以不同的方式绘制/查看通过此接口收集的统计数据。AerialVision 还可以将这些统计数据映射到 CUDA C++源文件。有关如何执行此操作的更多详细信息,请参阅 AerialVision 手册。
This functionality is implemented in src/cuda-sim/ptx-stats.h(.cc). The stats for each ptx line are held in an instance of the class ptx_file_line_stats. The function void ptx_file_line_stats_write_file() is responsible for printing the statistics to the statistics file in the above format. A number of other functions, similar to void ptx_file_line_stats_add_dram_traffic(unsigned pc, unsigned dram_traffic)
are called by the rest of the simulator to record different statistics about ptx source code lines.
此功能在 src/cuda-sim/ptx-stats.h(.cc) 中实现。每个 ptx 行的统计信息保存在类 ptx_file_line_stats 的一个实例中。函数 void ptx_file_line_stats_write_file() 负责以上述格式将统计信息打印到统计文件中。其他一些函数,类似于 void ptx_file_line_stats_add_dram_traffic(unsigned pc, unsigned dram_traffic) 被模拟器的其余部分调用,以记录有关 ptx 源代码行的不同统计信息。
Pure Functional Simulation
纯函数模拟
Pure functional simulation (bypassing performance simulation) is implemented in files cuda-sim{.h,.cc}, in function gpgpu_cuda_ptx_sim_main_func(...) and using the functionalCoreSim class. The functionalCoreSim class is inherited from the core_t abstract class, which contains many of the functional simulation data structures and procedures that are used by the pure functional simulation as well as performance simulation.
纯功能模拟(绕过性能模拟)在文件 cuda-sim{.h,.cc}中实现,使用函数 gpgpu_cuda_ptx_sim_main_func(...)和 functionalCoreSim 类。functionalCoreSim 类继承自 core_t 抽象类,该类包含许多功能模拟数据结构和程序,这些数据结构和程序被纯功能模拟以及性能模拟使用。
Interface with outside world
与外界接口
GPGPU-Sim is compiled into stub libraries that dynamically link at runtime to the CUDA/OpenCL application. Those libraries intercept the call intended to be executed by the CUDA runtime environment and instead initialize the simulator and run the kernels on it instead of on the hardware
GPGPU-Sim 被编译成动态链接的存根库,在运行时与 CUDA/OpenCL 应用程序链接。这些库拦截原本由 CUDA 运行时环境执行的调用,而是初始化模拟器并在其上运行内核,而不是在硬件上运行。
Entry Point and Stream Manager
入口点和流管理器
GPGPU-Sim is initialized by the function GPGPUSim_Init() which is called when the CUDA or OpenCL application performs its first CUDA/OpenCL API call. Our implementation of the CUDA/OpenCL API function implementations either directly call GPGPUSim_Init() or they call GPGPUSim_Context() which in turn calls GPGPUSim_Init(). An example API call that calls GPGPUSim_Context() is cudaMalloc(). Note that by utilizing static variables GPGPUSim_Init() is not called every time a cudaMalloc() is called.
GPGPU-Sim 通过函数 GPGPUSim_Init()初始化,该函数在 CUDA 或 OpenCL 应用程序执行其第一次 CUDA/OpenCL API 调用时被调用。我们对 CUDA/OpenCL API 函数实现的实现要么直接调用 GPGPUSim_Init(),要么调用 GPGPUSim_Context(),后者又调用 GPGPUSim_Init()。调用 GPGPUSim_Context()的一个示例 API 调用是 cudaMalloc()。请注意,通过利用静态变量,GPGPUSim_Init()并不是在每次调用 cudaMalloc()时都被调用。
First call to GPGPUSim_Init() would call the function gpgpu_ptx_sim_init_perf() located in gpgpusim_entrypoint.cc. Inside gpgpu_ptx_sim_init_perf() all the environmental variables, command line parameters and configuration files are processed. Based on the options a gpgpu_sim object is instantiated and assigned to the global variable g_the_gpu. Also a stream_manager object is instantiated and assigned to the global variable g_stream_manager.
第一次调用 GPGPUSim_Init() 将调用位于 gpgpusim_entrypoint.cc 中的函数 gpgpu_ptx_sim_init_perf()。在 gpgpu_ptx_sim_init_perf() 内部,所有环境变量、命令行参数和配置文件都被处理。根据选项,实例化一个 gpgpu_sim 对象并分配给全局变量 g_the_gpu。同时,实例化一个 stream_manager 对象并分配给全局变量 g_stream_manager。
GPGPUSim_Init() also calls start_sim_thread() function located in gpgpusim_entrypoint.cc. start_sim_thread() creates starts a new pthread which is actually responsible for running the simulation. For OpenCL application, the simulator pthread runs gpgpu_sim_thread_sequential(), which simulates the execution of kernels one at a time. For CUDA application, the simulator pthread runs gpgpu_sim_thread_concurrent(), which simulates the concurrent execution of multiple kernels. The maximum number of kernels that may concurrently execute on the simulated GPU is configured by the option '-gpgpu_max_concurrent_kernel'.
GPGPUSim_Init() 还调用位于 gpgpusim_entrypoint.cc 的 start_sim_thread() 函数。start_sim_thread() 创建一个新的 pthread,实际上负责运行模拟。对于 OpenCL 应用程序,模拟器 pthread 运行 gpgpu_sim_thread_sequential(),它一次模拟一个内核的执行。对于 CUDA 应用程序,模拟器 pthread 运行 gpgpu_sim_thread_concurrent(),它模拟多个内核的并发执行。可以在模拟的 GPU 上并发执行的最大内核数量由选项 '-gpgpu_max_concurrent_kernel' 配置。
gpgpu_sim_thread_sequential() will wait for a start signal to start the simulation (g_sim_signal_start). The start signal is set by the gpgpu_opencl_ptx_sim_main_perf() function used to start OpenCL performance simulation.
gpgpu_sim_thread_sequential() 将等待启动信号以开始仿真 (g_sim_signal_start)。启动信号由用于启动 OpenCL 性能仿真的 gpgpu_opencl_ptx_sim_main_perf() 函数设置。
gpgpu_sim_thread_concurrent() initializes the performance simulator structures once and then enters a loop waiting for a job from the stream manager (implemented by class stream_manager in stream_manager.h/cc). The stream manager itself gets the jobs from CUDA API calls to functions such as cudaStreamCreate() and cudaMemcpy() and kernel launches.
gpgpu_sim_thread_concurrent() 一次初始化性能模拟器结构,然后进入一个循环,等待来自流管理器的作业(由 stream_manager.h/cc 中的 stream_manager 类实现)。流管理器本身从 CUDA API 调用中获取作业,例如 cudaStreamCreate() 和 cudaMemcpy() 以及内核启动。
CUDA runtime library (libcudart)
CUDA 运行时库(libcudart)
When building a CUDA application, NVIDIA's nvcc translates each kernel launch into a series of API calls to the CUDA runtime API, which prepares the GPU hardware for kernel execution. libcudart.so is the library provided by NVIDIA that implements this API. In order for GPGPU-Sim to intercept those calls and run the kernels on the simulator, GPGPU-Sim also implements this library. The implementation resides in libcuda/cuda_runtime_api.cc. The resulting shared object resides in gpgpu-sim_root/lib/<build_type>/libcudart.so. By including this path into your LD_LIBRARY_PATH, you instruct your system to dynamically link against GPGPU-Sim at runtime instead of the NVIDIA provided library, thus allowing the simulator to run your code. Setting your LD_LIBRARY_PATH should be done through the setup_environment script as instructed in the README file distributed with GPGPU-Sim. Our implementation of libcudart is not compatible with all versions of cuda because of different interfaces that NVIDIA uses between different versions.
在构建 CUDA 应用程序时,NVIDIA 的 nvcc 将每个内核启动转换为一系列对 CUDA 运行时 API 的 API 调用,这些调用为内核执行准备 GPU 硬件。libcudart.so 是 NVIDIA 提供的实现此 API 的库。为了使 GPGPU-Sim 能够拦截这些调用并在模拟器上运行内核,GPGPU-Sim 也实现了这个库。实现位于 libcuda/cuda_runtime_api.cc。生成的共享对象位于 gpgpu-sim_root/lib/<build_type>/libcudart.so。通过将此路径包含到您的 LD_LIBRARY_PATH 中,您指示系统在运行时动态链接到 GPGPU-Sim,而不是 NVIDIA 提供的库,从而允许模拟器运行您的代码。设置 LD_LIBRARY_PATH 应通过 setup_environment 脚本完成,具体说明请参见与 GPGPU-Sim 一起分发的 README 文件。由于 NVIDIA 在不同版本之间使用不同的接口,我们的 libcudart 实现与所有版本的 cuda 不兼容。
OpenCL library (libopencl)
OpenCL 库 (libopencl)
Similar to libcuda described above, libopencl is a library included with GPGPU-Sim that implements the OpenCL API found in libOpencl.so. GPGPU-Sim currently supports OpenCL v1.1. OpenCL function calls are intercepted by GPGPU-Sim and handled by the simulator instead of the physical hardware. The resulting shared object resides in <gpgpu-sim_root>/lib/<build_type>/libOpenCL.so. By including this path into your LD_LIBRARY_PATH, you instruct your system to dynamically link against GPGPU-Sim at runtime instead of the NVIDIA provided library, thus allowing the simulator to run your code. Setting your LD_LIBRARY_PATH should be done through the setup_environment script as instructed in the README file in the v3.x directory.
类似于上述描述的 libcuda,libopencl 是一个包含在 GPGPU-Sim 中的库,它实现了在 libOpencl.so 中找到的 OpenCL API。GPGPU-Sim 目前支持 OpenCL v1.1。OpenCL 函数调用被 GPGPU-Sim 拦截,并由模拟器处理,而不是物理硬件。生成的共享对象位于<gpgpu-sim_root>/lib/<build_type>/libOpenCL.so。通过将此路径包含到您的 LD_LIBRARY_PATH 中,您指示系统在运行时动态链接到 GPGPU-Sim,而不是 NVIDIA 提供的库,从而允许模拟器运行您的代码。设置 LD_LIBRARY_PATH 应通过< a id=0>v3.x 目录中的 README 文件中指示的 setup_environment 脚本进行。
As GPGPU-Sim executes PTX, OpenCL applications must be compiled and converted into PTX. This is handled by nvopencl_wrapper.cc (found in <gpgpu-sim_root>/distribution/libopencl/). The OpenCL kernel is passed to the nvopencl_wrapper, compiled using the standard OpenCL clCreateProgramWithSource(...) and clBuildProgram(...) functions, converted into PTX and stored into a temporary PTX file (_ptx_XXXXXX), which is then read into GPGPU-Sim. Compiling OpenCL applications requires a physical device capable of supporting OpenCL. Thus, it may be necessary to perform the compilation process on a remote system containing such a device. GPGPU-Sim supports this through use of the <OPENCL_REMOTE_GPU_HOST> environment variable. If necessary, the compilation and conversion to PTX processes will be performed on the remote system and will return the resulting PTX files to the local system to be read into GPGPU-Sim.
当 GPGPU-Sim 执行 PTX 时,OpenCL 应用程序必须被编译并转换为 PTX。这由 nvopencl_wrapper.cc 处理(位于 <gpgpu-sim_root>/distribution/libopencl/)。OpenCL 内核被传递给 nvopencl_wrapper,使用标准的 OpenCL clCreateProgramWithSource(...) 和 clBuildProgram(...) 函数编译,转换为 PTX 并存储到临时 PTX 文件 (_ptx_XXXXXX) 中,然后被读取到 GPGPU-Sim 中。编译 OpenCL 应用程序需要一个能够支持 OpenCL 的物理设备。因此,可能需要在包含此类设备的远程系统上执行编译过程。GPGPU-Sim 通过使用 <OPENCL_REMOTE_GPU_HOST> 环境变量支持这一点。如果需要,编译和转换为 PTX 的过程将在远程系统上执行,并将生成的 PTX 文件返回到本地系统,以便读取到 GPGPU-Sim 中。
The following table provides a list of OpenCL functions currently implemented in GPGPU-Sim. See the OpenCL specifications document for more details on the behaviour of these functions. The OpenCL API implementation for GPGPU-Sim can be found in <gpgpu-sim_root>/distribution/libopencl/opencl_runtime_api.cc.
下表提供了当前在 GPGPU-Sim 中实现的 OpenCL 函数列表。有关这些函数行为的更多详细信息,请参阅 OpenCL 规范文档。GPGPU-Sim 的 OpenCL API 实现可以在<gpgpu-sim_root>/distribution/libopencl/opencl_runtime_api.cc 中找到。
OpenCL API |
---|
clCreateContextFromType(cl_context_properties *properties, cl_ulong device_type, void (*pfn_notify)(const char *, const void *, size_t, void *), void * user_data, cl_int * errcode_ret) |
clCreateContext( const cl_context_properties * properties, cl_uint num_devices, const cl_device_id *devices, void (*pfn_notify)(const char *, const void *, size_t, void *), void * user_data, cl_int * errcode_ret) |
clGetContextInfo(cl_context context,
cl_context_info param_name,
size_t param_value_size,
void * param_value,
size_t * param_value_size_ret )
clGetContextInfo(cl_context 上下文, cl_context_info 参数名称, size_t 参数值大小, void * 参数值, size_t * 参数值大小返回) |
clCreateCommandQueue(cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int * errcode_ret)
clCreateCommandQueue(cl_context 上下文, cl_device_id 设备, cl_command_queue_properties 属性, cl_int * 错误代码返回) |
clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size , void * host_ptr, cl_int * errcode_ret ) |
clCreateProgramWithSource(cl_context context, cl_uint count, const char ** strings, const size_t * lengths, cl_int * errcode_ret) |
clBuildProgram(cl_program program, cl_uint num_devices, const cl_device_id * device_list, const char * options, void (*pfn_notify)(cl_program /* program */, void * /* user_data */), void * user_data ) |
clCreateKernel(cl_program program, const char * kernel_name, cl_int * errcode_ret) |
clSetKernelArg(cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void * arg_value ) |
clEnqueueNDRangeKernel(cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t * global_work_offset, const size_t * global_work_size, const size_t * local_work_size, cl_uint num_events_in_wait_list, const cl_event * event_wait_list, cl_event * event) |
clEnqueueReadBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void * ptr, cl_uint num_events_in_wait_list, const cl_event * event_wait_list, cl_event * event ) |
clEnqueueWriteBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void * ptr, cl_uint num_events_in_wait_list, const cl_event * event_wait_list, cl_event * event ) |
clReleaseMemObject(cl_mem /* memobj */) |
clReleaseKernel(cl_kernel /* kernel */)
clReleaseKernel(cl_kernel /* 内核 */) |
clReleaseProgram(cl_program /* program */)
clReleaseProgram(cl_program /* 程序 */) |
clReleaseCommandQueue(cl_command_queue /* command_queue */)
clReleaseCommandQueue(cl_command_queue /* 命令队列 */) |
clReleaseContext(cl_context /* context */)
clReleaseContext(cl_context /* 上下文 */) |
clGetPlatformIDs(cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms ) |
clGetPlatformInfo(cl_platform_id platform,
cl_platform_info param_name,
size_t param_value_size,
void * param_value,
size_t * param_value_size_ret )
clGetPlatformInfo(cl_platform_id 平台, cl_platform_info 参数名称, size_t 参数值大小, void * 参数值, size_t * 参数值大小返回) |
clGetDeviceIDs(cl_platform_id platform,
cl_device_type device_type,
cl_uint num_entries,
cl_device_id * devices,
cl_uint * num_devices )
clGetDeviceIDs(cl_platform_id 平台, cl_device_type 设备类型, cl_uint 条目数量, cl_device_id * 设备, cl_uint * 设备数量 ) |
clGetDeviceInfo(cl_device_id device,
cl_device_info param_name,
size_t param_value_size,
void * param_value,
size_t * param_value_size_ret)
clGetDeviceInfo(cl_device_id 设备, cl_device_info 参数名称, size_t 参数值大小, void * 参数值, size_t * 参数值大小返回) |
clFinish(cl_command_queue /* command_queue */) |
clGetProgramInfo(cl_program program, cl_program_info param_name, size_t param_value_size, void * param_value, size_t * param_value_size_ret ) |
clEnqueueCopyBuffer(cl_command_queue command_queue, cl_mem src_buffer, cl_mem dst_buffer, size_t src_offset, size_t dst_offset, size_t cb, cl_uint num_events_in_wait_list, const cl_event * event_wait_list, cl_event * event ) |
clGetKernelWorkGroupInfo(cl_kernel kernel, cl_device_id device, cl_kernel_work_group_info param_name, size_t param_value_size, void * param_value, size_t * param_value_size_ret ) |
clWaitForEvents(cl_uint /* num_events */, const cl_event * /* event_list */) |
clReleaseEvent(cl_event /* event */)
clReleaseEvent(cl_event /* 事件 */) |
clGetCommandQueueInfo(cl_command_queue command_queue, cl_command_queue_info param_name, size_t param_value_size, void * param_value, size_t * param_value_size_ret ) |
clFlush(cl_command_queue /* command_queue */)
clFlush(cl_command_queue /* 命令队列 */) |
clGetSupportedImageFormats(cl_context context, cl_mem_flags flags, cl_mem_object_type image_type, cl_uint num_entries, cl_image_format * image_formats, cl_uint * num_image_formats) |
clEnqueueMapBuffer(cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_map, cl_map_flags map_flags, size_t offset, size_t cb, cl_uint num_events_in_wait_list, const cl_event * event_wait_list, cl_event * event, cl_int * errcode_ret ) |