The Go Blog Go 博客
Getting to Go: The Journey of Go's Garbage Collector
Getting to Go:Go 垃圾回收器之旅
This is the transcript from the keynote I gave at the International Symposium
on Memory Management (ISMM) on June 18, 2018.
For the past 25 years ISMM has been the premier venue for publishing memory
management and garbage collection papers and it was an honor to have been
invited to give the keynote.
这是我在 2018 年 6 月 18 日的内存管理国际研讨会 (ISMM) 上发表的主题演讲的文字记录。在过去的 25 年里,ISMM 一直是发表内存管理和垃圾回收论文的首选场所,很荣幸受邀发表主题演讲。
Abstract¶ 摘要¶
The Go language features, goals, and use cases have forced us to rethink
the entire garbage collection stack and have led us to a surprising place.
The journey has been exhilarating. This talk describes our journey.
It is a journey motivated by open source and Google’s production demands.
Included are side hikes into dead end box canyons where numbers guided us home.
This talk will provide insight into the how and the why of our journey,
where we are in 2018, and Go’s preparation for the next part of the journey.
Go 语言的功能、目标和用例迫使我们重新思考整个垃圾回收堆栈,并带领我们走向一个令人惊讶的地方。这段旅程令人振奋。本次讲座描述了我们的旅程。这是一段由开源和 Google 的生产需求推动的旅程。包括进入死胡同箱形峡谷的侧面徒步旅行,那里的数字引导我们回家。本次讲座将深入了解我们旅程的方式和原因,我们在 2018 年所处的位置,以及 Go 为下一部分旅程所做的准备。
Bio¶ 生物¶
Richard L. Hudson (Rick) is best known for his work in memory management
including the invention of the Train,
Sapphire, and Mississippi Delta algorithms as well as GC stack maps which
enabled garbage collection in statically typed languages such as Modula-3, Java, C#, and Go.
Rick is currently a member of Google’s Go team where he is working on Go’s
garbage collection and runtime issues.
Richard L. Hudson (Rick) 以其在内存管理方面的工作而闻名,包括发明了 Train、Sapphire 和 Mississippi Delta 算法,以及 GC 堆栈映射,这些算法支持静态类型语言(如 Modula-3、Java、C# 和 Go)中的垃圾回收。Rick 目前是 Google Go 团队的成员,负责解决 Go 的垃圾收集和运行时问题。
Contact: rlh@golang.org 联系人: rlh@golang.org
Comments: See the discussion on golang-dev.
评论:请参阅 golang-dev 上的讨论。
The Transcript¶ 成绩单¶
Rick Hudson here. Rick Hudson 在这里。
This is a talk about the Go runtime and in particular the garbage collector.
I have about 45 or 50 minutes of prepared material and after that we will
have time for discussion and I’ll be around so feel free to come up afterwards.
这是关于 Go 运行时,特别是垃圾回收器的讨论。我有大约 45 或 50 分钟的准备材料,之后我们将有时间进行讨论,我会在场,所以之后请随时提出。
Before I get started I want to acknowledge some people.
在我开始之前,我想感谢一些人。
A lot of the good stuff in the talk was done by Austin Clements.
Other people on the Cambridge Go team, Russ,
Than, Cherry, and David have been an engaging,
exciting, and fun group to work with.
演讲中的很多好东西都是由 Austin Clements 完成的。Cambridge Go 团队的其他人 Russ、Than、Cherry 和 David 一直是一个引人入胜、令人兴奋且有趣的团队。
We also want to thank the 1.6 million Go users worldwide for giving us interesting problems to solve.
Without them a lot of these problems would never come to light.
我们还要感谢全球 160 万 Go 用户,他们为我们提供了有趣的问题。没有他们,很多这些问题将永远不会暴露出来。
And finally I want to acknowledge Renee French for all these nice Gophers
that she has been producing over the years.
You will see several of them throughout the talk.
最后,我想感谢 Renee French 多年来制作的所有这些漂亮的 Gophers。在整个演讲中,您将看到其中的几个。
Before we get revved up and going on this stuff we really have to show what GC’s view of Go looks like.
在我们开始深入研究之前,我们真的必须展示一下 GC 对 Go 的看法是什么样的。
Well first of all Go programs have hundreds of thousands of stacks.
They are managed by the Go scheduler and are always preempted at GC safepoints.
The Go scheduler multiplexes Go routines onto OS threads which hopefully
run with one OS thread per HW thread.
We manage the stacks and their size by copying them and updating pointers in the stack.
It’s a local operation so it scales fairly well.
嗯,首先,Go 程序有数十万个堆栈。它们由 Go 调度器管理,并且始终在 GC 安全点被抢占。Go 调度器将 Go 例程多路复用到操作系统线程上,这些操作系统线程希望每个硬件线程运行一个操作系统线程。我们通过复制堆栈和更新堆栈中的指针来管理堆栈及其大小。这是一个本地操作,因此扩展性相当好。
The next thing that is important is the fact that Go is a value-oriented
language in the tradition of C-like systems languages rather than reference-oriented
language in the tradition of most managed runtime languages.
For example, this shows how a type from the tar package is laid out in memory.
All of the fields are embedded directly in the Reader value.
This gives programmers more control over memory layout when they need it.
One can collocate fields that have related values which helps with cache locality.
接下来重要的是,Go 是类似 C 的系统语言传统的面向价值的语言,而不是大多数托管运行时语言传统的面向引用的语言。例如,这显示了 tar 包中的类型在内存中的布局方式。所有字段都直接嵌入到 Reader 值中。这使程序员可以在需要时更好地控制内存布局。可以搭配具有相关值的字段,这有助于缓存局部性。
Value-orientation also helps with the foreign function interfaces.
We have a fast FFI with C and C++. Obviously Google has a tremendous number
of facilities available but they are written in C++.
Go couldn’t wait to reimplement all of these things in Go so Go had to have
access to these systems through the foreign function interface.
Value-oriented 还有助于外部函数接口。我们有使用 C 和 C++ 的快速 FFI。显然,Google 有大量的可用工具,但它们是用 C++ 编写的。Go 迫不及待地想在 Go 中重新实现所有这些东西,因此 Go 必须通过外部函数接口访问这些系统。
This one design decision has led to some of the more amazing things that
have to go on with the runtime.
It is probably the most important thing that differentiates Go from other GCed languages.
这个设计决策导致了 runtime 中一些更令人惊奇的事情。这可能是 Go 与其他 GCed 语言区别开来的最重要的一点。
Of course Go can have pointers and in fact they can have interior pointers.
Such pointers keep the entire value live and they are fairly common.
当然,Go 可以有指针,事实上它们可以有内部指针。这样的指针使整个值保持有效,并且它们相当常见。
We also have a way ahead of time compilation system so the binary contains the entire runtime.
我们还有一个 way of advance time 编译系统,因此二进制文件包含整个 runtime。
There is no JIT recompilation. There are pluses and minuses to this.
First of all, reproducibility of program execution is a lot easier which
makes moving forward with compiler improvements much faster.
没有 JIT 重新编译。这有优点也有缺点。首先,程序执行的可重复性要容易得多,这使得编译器改进可以更快地进行。
On the sad side of it we don’t have the chance to do feedback optimizations as you would with a JITed system.
遗憾的是,我们没有机会像使用 JIT 系统那样进行反馈优化。
So there are pluses and minuses.
所以有优点也有缺点。
Go comes with two knobs to control the GC.
The first one is GCPercent. Basically this is a knob that adjusts how much
CPU you want to use and how much memory you want to use.
The default is 100 which means that half the heap is dedicated to live memory
and half the heap is dedicated to allocation.
You can modify this in either direction.
Go 配有两个旋钮来控制 GC。第一个是 GCPercent。基本上,这是一个旋钮,可以调节您想要使用的 CPU 数量和您想要使用的内存数量。默认值为 100,这意味着一半的堆专用于活动内存,另一半的堆专用于分配。您可以在任一方向上对其进行修改。
MaxHeap, which is not yet released but is being used and evaluated internally,
lets the programmer set what the maximum heap size should be.
Out of memory, OOMs, are tough on Go; temporary spikes in memory usage should
be handled by increasing CPU costs, not by aborting.
Basically if the GC sees memory pressure it informs the application that
it should shed load.
Once things are back to normal the GC informs the application that it can
go back to its regular load.
MaxHeap also provides a lot more flexibility in scheduling.
Instead of always being paranoid about how much memory is available the
runtime can size the heap up to the MaxHeap.
MaxHeap 尚未发布,但正在内部使用和评估,它允许程序员设置最大堆大小应该是多少。内存不足,OOM 在 Go 上很难;内存使用量的临时峰值应通过增加 CPU 成本来处理,而不是通过中止来处理。基本上,如果 GC 看到内存压力,它会通知应用程序它应该卸载负载。一旦一切恢复正常,GC 就会通知应用程序它可以恢复到其常规负载。MaxHeap 还在调度方面提供了更大的灵活性。运行时可以把堆的大小调整到 MaxHeap,而不是总是对有多少可用内存感到偏执。
This wraps up our discussion on the pieces of Go that are important to the garbage collector.
我们关于对垃圾回收器很重要的 Go 部分的讨论到此结束。
So now let’s talk about the Go runtime and how did we get here, how we got to where we are.
那么现在让我们谈谈 Go 运行时,以及我们是如何走到这一步的,我们是如何走到现在的。
So it’s 2014. If Go does not solve this GC latency problem somehow then
Go isn’t going to be successful. That was clear.
现在是 2014 年。如果 Go 不能以某种方式解决这个 GC 延迟问题,那么 Go 就不会成功。这很清楚。
Other new languages were facing the same problem.
Languages like Rust went a different way but we are going to talk about
the path that Go took.
其他新语言也面临着同样的问题。像 Rust 这样的语言走了一条不同的路,但我们将讨论 Go 所走的道路。
Why is latency so important?
为什么延迟如此重要?
The math is completely unforgiving on this.
数学在这个问题上是完全无情的。
A 99%ile isolated GC latency service level objective (SLO),
such as 99% of the time a GC cycle takes < 10ms,
just simply doesn’t scale.
What matters is latency during an entire session or through the course of
using an app many times in a day.
Assume a session that browses several web pages ends up making 100 server
requests during a session or it makes 20 requests and you have 5 sessions
packed up during the day.
In that situation only 37% of users will have a consistent sub 10ms experience
across the entire session.
99%ile 隔离的 GC 延迟服务级别目标 (SLO),例如 99% 的 GC 周期需要 < 10ms 的时间,根本无法扩展。重要的是整个会话期间或在一天中多次使用应用程序的过程中的延迟。假设一个浏览多个网页的会话最终在会话期间发出 100 个服务器请求,或者它发出 20 个请求,而您在一天中打包了 5 个会话。在这种情况下,只有 37% 的用户在整个会话中会获得一致的低于 10 毫秒的体验。
If you want 99% of those users to have a sub 10ms experience,
as we are suggesting, the math says you really need to target 4 9s or the 99.99%ile.
如果您希望 99% 的用户拥有低于 10 毫秒的体验,正如我们所建议的,数学表明您确实需要以 4 个 9 或 99.99%ile 为目标。
So it’s 2014 and Jeff Dean had just come out with his paper called ‘The
Tail at Scale’ which this digs into this further.
It was being widely read around Google since it had serious ramifications
for Google going forward and trying to scale at Google scale.
现在是 2014 年,Jeff Dean 刚刚发表了他的论文,名为“The Tail at Scale”,进一步深入探讨了这个问题。它在 Google 周围被广泛阅读,因为它对 Google 的未来和试图在 Google 规模上扩展产生了严重影响。
We call this problem the tyranny of the 9s.
我们称这个问题为 9 的暴政。
So how do you fight the tyranny of the 9s?
那么,你如何对抗 9 的暴政呢?
A lot of things were being done in 2014.
2014 年做了很多事情。
If you want 10 answers ask for several more and take the first 10 and those
are the answers you put on your search page.
If the request exceeds 50%ile reissue or forward the request to another server.
If GC is about to run, refuse new requests or forward the requests to another
server until GC is done.
And so forth and so on.
如果你想要 10 个答案,再问几个,然后拿前 10 个,这些就是你放在搜索页面上的答案。如果请求超过 50%ile,则重新发出请求或将请求转发到另一个服务器。如果 GC 即将运行,请拒绝新请求或将请求转发到其他服务器,直到 GC 完成。等等。
All these are workarounds come from very clever people with very real problems
but they didn’t tackle the root problem of GC latency.
At Google scale we had to tackle the root problem. Why?
所有这些都是来自非常聪明的人的解决方法,他们遇到了非常实际的问题,但他们并没有解决 GC 延迟的根本问题。在 Google 规模上,我们必须解决根本问题。为什么?
Redundancy wasn’t going to scale, redundancy costs a lot. It costs new server farms.
冗余无法扩展,冗余成本很高。它需要花费新的服务器场。
We hoped we could solve this problem and saw it as an opportunity to improve
the server ecosystem and in the process save some of the endangered corn
fields and give some kernel of corn the chance to be knee high by the fourth
of July and reach its full potential.
我们希望能够解决这个问题,并将其视为改善服务器生态系统的机会,并在此过程中拯救一些濒临灭绝的玉米田,并让一些玉米粒有机会在 7 月 4 日之前长高无上并充分发挥其潜力。
So here is the 2014 SLO. Yes, it was true that I was sandbagging,
I was new on the team, it was a new process to me,
and I didn’t want to over promise.
所以这是 2014 年的 SLO。是的,我确实在装沙袋,我是团队的新人,这对我来说是一个新的过程,我不想过度承诺。
Furthermore presentations about GC latency in other languages were just plain scary.
此外,其他语言中关于 GC 延迟的演示简直令人恐惧。
The original plan was to do a read barrier free concurrent copying GC.
That was the long term plan. There was a lot of uncertainty about the overhead
of read barriers so Go wanted to avoid them.
最初的计划是执行读取无障碍并发复制 GC。那是长期计划。读取障碍的开销有很多不确定性,因此 Go 希望避免它们。
But short term 2014 we had to get our act together.
We had to convert all of the runtime and compiler to Go.
They were written in C at the time. No more C,
no long tail of bugs due to C coders not understanding GC but having a cool
idea about how to copy strings.
We also needed something quickly and focused on latency but the performance
hit had to be less than the speedups provided by the compiler.
So we were limited. We had basically a year of compiler performance improvements
that we could eat up by making the GC concurrent.
But that was it. We couldn’t slow down Go programs.
That would have been untenable in 2014.
但短期的 2014 年,我们必须一起行动。我们必须将所有运行时和编译器转换为 Go。它们当时是用 C 语言编写的。不再有 C,也没有由于 C 编码人员不理解 GC 但对如何复制字符串有很酷的想法而出现的长尾错误。我们还需要快速并专注于延迟,但性能影响必须小于编译器提供的加速。所以我们受到了限制。我们基本上有一年的编译器性能改进,我们可以通过使 GC 并发来消耗掉这些改进。但就是这样。我们不能放慢 Go 程序的速度。这在 2014 年是站不住脚的。
So we backed off a bit. We weren’t going to do the copying part.
所以我们退缩了一点。我们不打算做复制部分。
The decision was to do a tri-color concurrent algorithm.
Earlier in my career Eliot Moss and I had done the journal proofs showing
that Dijkstra’s algorithm worked with multiple application threads.
We also showed we could knock off the STW problems,
and we had proofs that it could be done.
他们决定使用三色并发算法。在我职业生涯的早期,Eliot Moss 和我做了日志证明,表明 Dijkstra 的算法适用于多个应用程序线程。我们还证明了我们可以消除 STW 问题,并且我们有证据证明这是可以做到的。
We were also concerned about compiler speed,
that is the code the compiler generated.
If we kept the write barrier turned off most of the time the compiler optimizations
would be minimally impacted and the compiler team could move forward rapidly.
Go also desperately needed short term success in 2015.
我们还担心编译器速度,即编译器生成的代码。如果我们在大部分时间都关闭写入屏障,编译器优化受到的影响将最小,编译器团队可以快速前进。Go 在 2015 年也迫切需要短期成功。
So let’s look at some of the things we did.
那么,让我们看看我们所做的一些事情。
We went with a size segregated span. Interior pointers were a problem.
我们采用了一个大小分离的跨度。内部指针是个问题。
The garbage collector needs to efficiently find the start of the object.
If it knows the size of the objects in a span it simply rounds down to that
size and that will be the start of the object.
垃圾回收器需要有效地找到对象的开头。如果它知道 span 中对象的大小,它就会简单地向下舍入到该大小,这将是对象的开始。
Of course size segregated spans have some other advantages.
当然,尺寸分离跨度还有其他一些优点。
Low fragmentation: Experience with C, besides Google’s TCMalloc and Hoard,
I was intimately involved with Intel’s Scalable Malloc and that work gave
us confidence that fragmentation was not going to be a problem with non-moving allocators.
低碎片化:除了 Google 的 TCMalloc 和 Hoard 之外,我还密切参与了 Intel 的 Scalable Malloc,这项工作让我们确信,碎片化不会成为非移动分配器的问题。
Internal structures: We fully understood and had experience with them.
We understood how to do size segregated spans,
we understood how to do low or zero contention allocation paths.
内部结构:我们完全理解并有使用它们的经验。我们了解如何进行大小分离 span,我们了解如何进行低争用或零争用分配路径。
Speed: Non-copy did not concern us, allocation admittedly might be slower
but still in the order of C.
It might not be as fast as bump pointer but that was OK.
速度:非复制与我们无关,分配确实可能更慢,但仍在 C 的顺序中。它可能不如 bump pointer 快,但没关系。
We also had this foreign function interface issue.
If we didn’t move our objects then we didn’t have to deal with the long
tail of bugs you might encounter if you had a moving collector as you attempt
to pin objects and put levels of indirection between C and the Go object
you are working with.
我们还遇到了这个外部函数接口问题。如果我们不移动我们的对象,那么我们就不必处理如果你有一个移动的收集器,当你试图固定对象并在 C 和你正在使用的 Go 对象之间设置间接级别时,你可能会遇到的长尾错误。
The next design choice was where to put the object’s metadata.
We needed to have some information about the objects since we didn’t have headers.
Mark bits are kept on the side and used for marking as well as allocation.
Each word has 2 bits associated with it to tell you if it was a scalar or
a pointer inside that word.
It also encoded whether there were more pointers in the object so we could
stop scanning objects sooner than later.
We also had an extra bit encoding that we could use as an extra mark bit
or to do other debugging things.
This was really valuable for getting this stuff running and finding bugs.
下一个设计选择是放置对象的元数据的位置。我们需要一些关于对象的信息,因为我们没有 headers。标记位保留在侧面,用于标记和分配。每个字都有 2 个与之关联的位,以告诉您它是该字内的标量还是指针。它还对对象中是否有更多指针进行编码,以便我们可以尽早停止扫描对象。我们还有一个额外的位编码,我们可以将其用作额外的标记位或执行其他调试操作。这对于运行这些东西和发现 bug 非常有价值。
So what about write barriers? The write barrier is on only during the GC.
At other times the compiled code loads a global variable and looks at it.
Since the GC was typically off the hardware correctly speculates to branch
around the write barrier.
When we are inside the GC that variable is different,
and the write barrier is responsible for ensuring that no reachable objects
get lost during the tri-color operations.
那么写入障碍呢?写入屏障仅在 GC 期间处于打开状态。在其他时候,编译后的代码加载一个全局变量并查看它。由于 GC 通常处于关闭状态,因此硬件会正确推测以绕过写入屏障进行分支。当我们在 GC 中时,该变量是不同的,并且写入屏障负责确保在 tri-color 操作期间不会丢失任何可访问的对象。
The other piece of this code is the GC Pacer.
It is some of the great work that Austin did.
It is basically based on a feedback loop that determines when to best start a GC cycle.
If the system is in a steady state and not in a phase change,
marking will end just about the time memory runs out.
此代码的另一部分是 GC Pacer。这是 Austin 所做的一些伟大工作。它基本上基于一个反馈回路,该回路确定何时最好地开始 GC 循环。如果系统处于稳定状态而不是相位变化,则标记将在内存耗尽时结束。
That might not be the case so the Pacer also has to monitor the marking
progress and ensure allocation doesn’t overrun the concurrent marking.
情况可能并非如此,因此 Pacer 还必须监控评分进度并确保分配不会超过并发评分。
If need be, the Pacer slows down allocation while speeding up marking.
At a high level the Pacer stops the Goroutine,
which is doing a lot of the allocation, and puts it to work doing marking.
The amount of work is proportional to the Goroutine’s allocation.
This speeds up the garbage collector while slowing down the mutator.
如果需要,Pacer 会减慢分配速度,同时加快标记速度。在高层次上,Pacer 停止了 Goroutine,它正在做大量的分配,并让它做标记。工作量与 Goroutine 的分配成正比。这加快了垃圾回收器的速度,同时减慢了 mutator 的速度。
When all of this is done the Pacer takes what it has learnt from this GC
cycle as well as previous ones and projects when to start the next GC.
当所有这些完成后,Pacer 会利用它从这个 GC 周期以及之前的 GC 周期中学到的东西,并预测何时开始下一个 GC。
It does much more than this but that is the basic approach.
它的作用远不止于此,但这是基本方法。
The math is absolutely fascinating, ping me for the design docs.
If you are doing a concurrent GC you really owe it to yourself to look at
this math and see if it’s the same as your math.
If you have any suggestions let us know.
数学绝对令人着迷,请给我获取设计文档。如果你正在做一个并发 GC,你真的应该看看这个数学,看看它是否与你的数学相同。如果您有任何建议,请告诉我们。
*Go 1.5 concurrent garbage collector pacing
and Proposal: Separate soft and hard heap size goal
*Go 1.5 并发垃圾回收器节奏和提案:分离软硬堆大小目标
Yes, so we had successes, lots of them. A younger crazier Rick would have
taken some of these graphs and tattooed them on my shoulder I was so proud of them.
是的,所以我们取得了成功,很多。一个更年轻、更疯狂的 Rick 会拿出其中一些图表并纹在我的肩膀上,我为它们感到非常自豪。
This is a series of graphs that was done for a production server at Twitter.
We of course had nothing to do with that production server.
Brian Hatfield did these measurements and oddly enough tweeted about them.
这是为 Twitter 的生产服务器完成的一系列图表。当然,我们与那个生产服务器没有任何关系。布赖恩·哈特菲尔德 (Brian Hatfield) 进行了这些测量,奇怪的是,他在推特上发布了这些测量。
On the Y axis we have GC latency in milliseconds.
On the X axis we have time. Each of the points is a stop the world pause
time during that GC.
在 Y 轴上,我们有 GC 延迟(以毫秒为单位)。在 X 轴上,我们有时间。每个点都是该 GC 期间的 Stop the world 暂停时间。
On our first release, which was in August of 2015,
we saw a drop from around 300 - 400 milliseconds down to 30 or 40 milliseconds.
This was good, order of magnitude good.
在 2015 年 8 月的第一个版本中,我们看到从大约 300 - 400 毫秒下降到 30 或 40 毫秒。这很好,数量级的好。
We are going to change the Y-axis here radically from 0 to 400 milliseconds down to 0 to 50 milliseconds.
我们将从根本上将这里的 Y 轴从 0 到 400 毫秒更改为 0 到 50 毫秒。
This is 6 months later. The improvement was largely due to systematically
eliminating all the O(heap) things we were doing during the stop the world time.
This was our second order of magnitude improvement as we went from 40 milliseconds down to 4 or 5.
这是 6 个月后的事情。这种改进主要是由于系统地消除了我们在 stop the world time 期间所做的所有 O(heap) 事情。这是我们的第二个数量级改进,我们将时间从 40 毫秒缩短到 4 或 5 毫秒。
There were some bugs in there that we had to clean up and we did this during
a minor release 1.6.3.
This dropped latency down to well under 10 milliseconds, which was our SLO.
其中有一些错误我们必须清理,我们在 1.6.3 小版本中这样做了。这将延迟降低到远低于 10 毫秒,这就是我们的 SLO。
We are about to change our Y-axis again, this time down to 0 to 5 milliseconds.
我们即将再次更改 Y 轴,这次减少到 0 到 5 毫秒。
So here we are, this is August of 2016, a year after the first release.
Again we kept knocking off these O(heap size) stop the world processes.
We are talking about an 18Gbyte heap here.
We had much larger heaps and as we knocked off these O(heap size) stop the world pauses,
the size of the heap could obviously grow considerable without impacting latency.
So this was a bit of a help in 1.7.
所以,我们来了,现在是 2016 年 8 月,距离首次发布已经过去了一年。我们再次不断敲掉这些 O(堆大小)停止世界进程。我们在这里谈论的是 18GB 的堆。我们有更大的堆,当我们敲掉这些 O(堆大小)停止世界暂停时,堆的大小显然可以大大增加,而不会影响延迟。所以这在 1.7 中有点帮助。
The next release was in March of 2017. We had the last of our large latency
drops which was due to figuring out how to avoid the stop the world stack
scanning at the end of the GC cycle.
That dropped us into the sub-millisecond range.
Again the Y axis is about to change to 1.5 milliseconds and we see our third
order of magnitude improvement.
下一个版本是在 2017 年 3 月。我们经历了最后一次大幅度的延迟下降,这是因为我们弄清楚了如何避免在 GC 周期结束时停止世界堆栈扫描。这使我们进入了亚毫秒范围。Y 轴再次即将变为 1.5 毫秒,我们看到了第三个数量级的改进。
The August 2017 release saw little improvement.
We know what is causing the remaining pauses.
The SLO whisper number here is around 100-200 microseconds and we will push towards that.
If you see anything over a couple hundred microseconds then we really want
to talk to you and figure out whether it fits into the stuff we know about
or whether it is something new we haven’t looked into.
In any case there seems to be little call for lower latency.
It is important to note these latency levels can happen for a wide variety
of non-GC reasons and as the saying goes “You don’t have to be faster than the bear,
you just have to be faster than the guy next to you.”
2017 年 8 月版本几乎没有什么改善。我们知道是什么导致了剩余的暂停。这里的 SLO 耳语数约为 100-200 微秒,我们将朝着这个方向努力。如果你看到任何超过几百微秒的东西,那么我们真的想和你谈谈,弄清楚它是否适合我们所知道的东西,或者它是否是我们没有研究过的新事物。无论如何,似乎很少有人要求降低延迟。重要的是要注意,这些延迟水平可能由于各种非 GC 原因而发生,俗话说:“你不必比熊快,你只需要比你旁边的人快。
There was no substantial change in the Feb'18 1.10 release just some clean-up and chasing corner cases.
在 2 月 18 日的 1.10 版本中没有实质性的变化,只是做了一些清理和追逐极端情况。
So a new year and a new SLO This is our 2018 SLO.
新的一年,新的语言安全令 这是我们的 2018 年法律安全令。
We have dropped total CPU to CPU used during a GC cycle.
我们降低了 GC 周期期间使用的 CPU 到 CPU 的总量。
The heap is still at 2x.
堆仍为 2 倍。
We now have an objective of 500 microseconds stop the world pause per GC cycle. Perhaps a little sandbagging here.
我们现在的目标是每个 GC 周期停止 500 微秒的世界暂停。也许这里有一个小沙袋。
The allocation would continue to be proportional to the GC assists.
分配将继续与 GC 助攻成正比。
The Pacer had gotten much better so we looked to see minimal GC assists in a steady state.
Pacer 已经变得更好了,所以我们希望在稳定状态下看到最少的 GC 助攻。
We were pretty happy with this. Again this is not an SLA but an SLO so it’s an objective,
not an agreement, since we can’t control such things as the OS.
我们对此非常满意。同样,这不是 SLA,而是 SLO,所以它是一个目标,而不是协议,因为我们无法控制操作系统等东西。
That’s the good stuff. Let’s shift and start talking about our failures.
These are our scars; they are sort of like tattoos and everyone gets them.
Anyway they come with better stories so let’s do some of those stories.
这就是好东西。让我们转移话题,开始谈论我们的失败。这些是我们的伤疤;它们有点像纹身,每个人都能得到它们。无论如何,他们带来了更好的故事,所以让我们来做一些这样的故事。
Our first attempt was to do something called the request oriented collector or ROC. The hypothesis can be seen here.
我们的第一次尝试是做一个叫做面向请求的收集器或 ROC 的东西。假设可以在这里看到。
So what does this mean?
那么这意味着什么呢?
Goroutines are lightweight threads that look like Gophers,
so here we have two Goroutines.
They share some stuff such as the two blue objects there in the middle.
They have their own private stacks and their own selection of private objects.
Say the guy on the left wants to share the green object.
Goroutines 是轻量级的线程,看起来像 Gophers,所以这里我们有两个 Goroutines。它们共享一些东西,例如中间的两个蓝色对象。它们有自己的私有堆栈和自己选择的私有对象。假设左侧的 guy 想要共享绿色对象。
The goroutine puts it in the shared area so the other Goroutine can access it.
They can hook it to something in the shared heap or assign it to a global
variable and the other Goroutine can see it.
协程将其放在共享区域,以便其他协程可以访问它。他们可以将其挂接到共享堆中的某个内容,或者将其分配给一个全局变量,其他 Goroutine 可以看到它。
Finally the Goroutine on the left goes to its death bed, it’s about to die, sad.
最后,左边的 Goroutine 走向了它的临终之床,它即将死去,悲伤。
As you know you can’t take your objects with you when you die.
You can’t take your stack either. The stack is actually empty at this time
and the objects are unreachable so you can simply reclaim them.
如您所知,您死后不能随身携带您的物品。你也不能拿走你的筹码。此时堆栈实际上是空的,并且对象无法访问,因此您只需回收它们即可。
The important thing here is that all actions were local and did not require
any global synchronization.
This is fundamentally different than approaches like a generational GC,
and the hope was that the scaling we would get from not having to do that
synchronization would be sufficient for us to have a win.
这里重要的是,所有操作都是本地的,不需要任何全局同步。这与分代 GC 等方法有着根本的不同,我们希望我们不必进行同步所获得的扩展就足以让我们取得成功。
The other issue that was going on with this system was that the write barrier was always on.
Whenever there was a write, we would have to see if it was writing a pointer
to a private object into a public object.
If so, we would have to make the referent object public and then do a transitive
walk of reachable objects making sure they were also public.
That was a pretty expensive write barrier that could cause many cache misses.
这个系统的另一个问题是写屏障总是打开的。每当有写入时,我们都必须查看它是否将指向私有对象的指针写入公共对象。如果是这样,我们将不得不将引用对象设为 public,然后对可访问的对象进行传递遍历,确保它们也是 public。这是一个非常昂贵的写入屏障,可能会导致许多缓存未命中。
That said, wow, we had some pretty good successes.
也就是说,哇,我们取得了一些相当不错的成功。
This is an end-to-end RPC benchmark. The mislabeled Y axis goes from 0 to
5 milliseconds (lower is better),
anyway that is just what it is.
The X axis is basically the ballast or how big the in-core database is.
这是一个端到端的 RPC 基准测试。错误标记的 Y 轴从 0 到 5 毫秒(越低越好),无论如何,这就是它。X 轴基本上是镇流器或核心内数据库的大小。
As you can see if you have ROC on and not a lot of sharing,
things actually scale quite nicely.
If you don’t have ROC on it wasn’t nearly as good.
正如你所看到的,如果你打开了 ROC 并且没有大量共享,事情实际上可以很好地扩展。如果你没有 ROC,它就不会那么好。
But that wasn’t good enough, we also had to make sure that ROC didn’t slow
down other pieces of the system.
At that point there was a lot of concern about our compiler and we could
not slow down our compilers.
Unfortunately the compilers were exactly the programs that ROC did not do well at.
We were seeing 30, 40, 50% and more slowdowns and that was unacceptable.
Go is proud of how fast its compiler is so we couldn’t slow the compiler down,
certainly not this much.
但这还不够好,我们还必须确保 ROC 不会减慢系统的其他部分。那时,人们对我们的编译器有很多担忧,我们不能减慢编译器的速度。不幸的是,编译器正是 ROC 不擅长的程序。我们看到 30%、40%、50% 甚至更多的减速,这是不可接受的。Go 对它的编译器速度感到自豪,所以我们不能减慢编译器的速度,当然不会那么慢。
We then went and looked at some other programs.
These are our performance benchmarks. We have a corpus of 200 or 300 benchmarks
and these were the ones the compiler folks had decided were important for
them to work on and improve.
These weren’t selected by the GC folks at all.
The numbers were uniformly bad and ROC wasn’t going to become a winner.
然后我们去看了一些其他项目。这些是我们的性能基准。我们有一个包含 200 或 300 个基准测试的语料库,这些是编译人员认为对他们来说很重要的基准测试。这些根本不是 GC 人员选择的。数字总是很糟糕,ROC 不会成为赢家。
It’s true we scaled but we only had 4 to 12 hardware thread system so we
couldn’t overcome the write barrier tax.
Perhaps in the future when we have 128 core systems and Go is taking advantage of them,
the scaling properties of ROC might be a win.
When that happens we might come back and revisit this,
but for now ROC was a losing proposition.
确实,我们进行了扩展,但我们只有 4 到 12 个硬件线程系统,因此我们无法克服写入障碍税。也许在未来,当我们有 128 个核心系统并且 Go 正在利用它们时,ROC 的扩展特性可能会是一个胜利。当这种情况发生时,我们可能会回来重新审视这个问题,但就目前而言,ROC 是一个失败的提议。
So what were we going to do next? Let’s try the generational GC.
It’s an oldie but a goodie. ROC didn’t work so let’s go back to stuff we
have a lot more experience with.
那么我们下一步要做什么呢?让我们尝试一下分代 GC。这是一首老歌,但却是一件好事。ROC 没有奏效,所以让我们回到我们有更多经验的东西。
We weren’t going to give up our latency, we weren’t going to give up the
fact that we were non-moving.
So we needed a non-moving generational GC.
我们不会放弃我们的延迟,我们不会放弃我们没有移动的事实。所以我们需要一个不移动的分代 GC。
So could we do this? Yes, but with a generational GC,
the write barrier is always on.
When the GC cycle is running we use the same write barrier we use today,
but when GC is off we use a fast GC write barrier that buffers the pointers
and then flushes the buffer to a card mark table when it overflows.
那么我们能做到这一点吗?是的,但是对于分代 GC,写入屏障始终处于打开状态。当 GC 周期运行时,我们使用与现在相同的写屏障,但是当 GC 关闭时,我们使用快速 GC 写屏障来缓冲指针,然后在缓冲区溢出时将缓冲区刷新到卡标记表。
So how is this going to work in a non-moving situation? Here is the mark / allocation map.
Basically you maintain a current pointer.
When you are allocating you look for the next zero and when you find that
zero you allocate an object in that space.
那么,在非移动情况下,这将如何运作呢?这是标记/分配地图。基本上,您维护一个 current 指针。在分配时,查找下一个 0,当找到该 0 时,在该空间中分配一个对象。
You then update the current pointer to the next 0.
然后,将当前指针更新为下一个 0。
You continue until at some point it is time to do a generation GC.
You will notice that if there is a one in the mark/allocation vector then
that object was alive at the last GC so it is mature.
If it is zero and you reach it then you know it is young.
您继续操作,直到某个时候需要执行生成 GC。您会注意到,如果 mark/allocation 向量中有 1,则该对象在最后一个 GC 中处于活动状态,因此它是成熟的。如果它是零并且你达到它,那么你就知道它是年轻的。
So how do you do promoting. If you find something marked with a 1 pointing
to something marked with a 0 then you promote the referent simply by setting that zero to a one.
那么你怎么做推广呢。如果你发现标有 1 的事物指向标有 0 的事物,那么你只需将 0 设置为 1 来提升所指对象。
You have to do a transitive walk to make sure all reachable objects are promoted.
您必须执行传递遍历以确保所有可访问的对象都得到提升。
When all reachable objects have been promoted the minor GC terminates.
当所有可访问的对象都已升级时,次要 GC 将终止。
Finally, to finish your generational GC cycle you simply set the current
pointer back to the start of the vector and you can continue.
All the zeros weren’t reached during that GC cycle so are free and can be reused.
As many of you know this is called ‘sticky bits’ and was invented by Hans
Boehm and his colleagues.
最后,要完成分代 GC 周期,您只需将当前指针设置回向量的开头,然后就可以继续了。在该 GC 周期内未达到所有零,因此是免费的,可以重复使用。正如你们中的许多人所知,这被称为“粘性钻头”,是由 Hans Boehm 和他的同事发明的。
So what did the performance look like? It wasn’t bad for the large heaps.
These were the benchmarks that the GC should do well on. This was all good.
那么性能是什么样的呢?对于大堆来说,这还不错。这些是 GC 应该做得很好的基准。这一切都很好。
We then ran it on our performance benchmarks and things didn’t go as well. So what was going on?
然后,我们在性能基准上运行它,但事情并不顺利。那么这是怎么回事呢?
The write barrier was fast but it simply wasn’t fast enough.
Furthermore it was hard to optimize for. For example,
write barrier elision can happen if there is an initializing write between
when the object was allocated and the next safepoint.
But we were having to move to a system where we have a GC safepoint at every
instruction so there really wasn’t any write barrier that we could elide going forward.
写入屏障很快,但速度还不够快。此外,它很难进行优化。例如,如果在分配对象和下一个安全点之间有初始化写入,则可能会发生写入屏障省略。但是我们不得不迁移到一个系统,在每条指令上都有一个 GC 安全点,这样我们以后就真的可以忽略任何写入障碍了。
We also had escape analysis and it was getting better and better.
Remember the value-oriented stuff we were talking about? Instead of passing
a pointer to a function we would pass the actual value.
Because we were passing a value, escape analysis would only have to do intraprocedural
escape analysis and not interprocedural analysis.
我们还进行了逃逸分析,结果越来越好。还记得我们刚才谈论的以价值为导向的东西吗?我们不是将指针传递给函数,而是传递实际值。因为我们传递了一个值,所以逃逸分析只需要执行过程内逃逸分析,而不必进行过程间分析。
Of course in the case where a pointer to the local object escapes, the object would be heap allocated.
当然,在指向本地对象的指针转义的情况下,该对象将被堆分配。
It isn’t that the generational hypothesis isn’t true for Go,
it’s just that the young objects live and die young on the stack.
The result is that generational collection is much less effective than you
might find in other managed runtime languages.
这并不是说 Go 的世代假设不正确,只是年轻的对象在堆栈上年轻地生存和死亡。结果是,分代集合的效率远不如其他托管运行时语言。
So these forces against the write barrier were starting to gather.
Today, our compiler is much better than it was in 2014.
Escape analysis is picking up a lot of those objects and sticking them on
the stack-objects that the generational collector would have helped with.
We started creating tools to help our users find objects that escaped and
if it was minor they could make changes to the code and help the compiler
allocate on the stack.
因此,这些反对写入屏障的力量开始聚集。今天,我们的编译器比 2014 年要好得多。逃逸分析正在拾取大量这样的对象,并将它们粘贴到分代收集器会提供帮助的堆栈对象上。我们开始创建工具来帮助我们的用户找到转义的对象,如果转义的对象很小,他们可以更改代码并帮助编译器在堆栈上分配。
Users are getting more clever about embracing value-oriented approaches
and the number of pointers is being reduced.
Arrays and maps hold values and not pointers to structs. Everything is good.
用户越来越聪明地采用面向价值的方法,并且指针的数量正在减少。数组和映射保存值,而不是指向结构体的指针。一切都很好。
But that’s not the main compelling reason why write barriers in Go have an uphill fight going forward.
但这并不是 Go 中写入障碍未来需要进行艰苦斗争的主要令人信服的原因。
Let’s look at this graph. It’s just an analytical graph of mark costs.
Each line represents a different application that might have a mark cost.
Say your mark cost is 20%, which is pretty high but it’s possible.
The red line is 10%, which is still high.
The lower line is 5% which is about what a write barrier costs these days.
So what happens if you double the heap size? That’s the point on the right.
The cumulative cost of the mark phase drops considerably since GC cycles are less frequent.
The write barrier costs are constant so the cost of increasing the heap
size will drive that marking cost underneath the cost of the write barrier.
让我们看看这张图。它只是一个标记成本的分析图。每行表示可能具有标记成本的不同应用程序。假设你的标记成本是 20%,这相当高,但这是可能的。红线为 10%,仍然很高。下限是 5%,这大约是现在写屏障的成本。那么,如果将堆大小增加一倍会发生什么呢?这就是右边的重点。由于 GC 循环频率较低,标记阶段的累积成本大幅下降。写入屏障成本是恒定的,因此增加堆大小的成本将使标记成本低于写入屏障的成本。
Here is a more common cost for a write barrier,
which is 4%, and we see that even with that we can drive the cost of the
mark barrier down below the cost of the write barrier by simply increasing the heap size.
这是写入屏障的更常见成本,即 4%,我们看到,即使这样,我们也可以通过简单地增加堆大小来将标记屏障的成本降低到写入屏障的成本以下。
The real value of generational GC is that,
when looking at GC times, the write barrier costs are ignored since they
are smeared across the mutator.
This is generational GC’s great advantage,
it greatly reduces the long STW times of full GC cycles but it doesn’t necessarily improve throughput.
Go doesn’t have this stop the world problem so it had to look more closely
at the throughput problems and that is what we did.
分代 GC 的真正价值在于,当查看 GC 时间时,写屏障成本被忽略,因为它们被涂抹在 mutator 中。这是分代 GC 的巨大优势,它大大减少了完整 GC 循环的较长 STW 时间,但不一定能提高通量。Go 没有这个 stop the world 问题,所以它必须更仔细地研究吞吐量问题,这就是我们所做的。
That’s a lot of failure and with such failure comes food and lunch.
I’m doing my usual whining “Gee wouldn’t this be great if it wasn’t for the write barrier.”
那是很多失败,而这种失败带来了食物和午餐。我像往常一样抱怨着“哎呀,如果不是写障碍,这不是很好吗。
Meanwhile Austin has just spent an hour talking to some of the HW GC folks
at Google and he was saying we should talk to them and try and figure out
how to get HW GC support that might help.
Then I started telling war stories about zero-fill cache lines,
restartable atomic sequences, and other things that didn’t fly when I was
working for a large hardware company.
Sure we got some stuff into a chip called the Itanium,
but we couldn’t get them into the more popular chips of today.
So the moral of the story is simply to use the HW we have.
与此同时,Austin 刚刚花了一个小时与 Google 的一些硬件 GC 人员交谈,他说我们应该与他们交谈,并尝试弄清楚如何获得可能会有帮助的硬件 GC 支持。然后我开始讲述关于零填充缓存行、可重启原子序列和其他在我为大型硬件公司工作时没有成功的东西的战争故事。当然,我们在一个名为 Itanium 的芯片中添加了一些东西,但我们无法将它们放入当今更流行的芯片中。所以这个故事的寓意就是简单地使用我们拥有的硬件。
Anyway that got us talking, what about something crazy?
不管怎样,这让我们开始谈论,疯狂的事情呢?
What about card marking without a write barrier? It turns out that Austin
has these files and he writes into these files all of his crazy ideas that
for some reason he doesn’t tell me about.
I figure it is some sort of therapeutic thing.
I used to do the same thing with Eliot. New ideas are easily smashed and
one needs to protect them and make them stronger before you let them out into the world.
Well anyway he pulls this idea out.
没有写入屏障的卡片标记怎么样?事实证明,Austin 有这些文件,他把他所有疯狂的想法都写进了这些文件,出于某种原因,他没有告诉我。我认为这是某种治疗性的东西。我以前对 Eliot 做同样的事情。新的想法很容易被粉碎,在你把它们放到世界上之前,你需要保护它们并让它们变得更强大。好吧,无论如何,他把这个想法拿出来了。
The idea is that you maintain a hash of mature pointers in each card.
If pointers are written into a card, the hash will change and the card will
be considered marked.
This would trade the cost of write barrier off for cost of hashing.
这个想法是你在每张卡片中维护一个成熟指针的哈希值。如果指针写入卡片,则哈希值将发生变化,卡片将被视为已标记。这将用 write barrier 的成本来换取哈希的成本。
But more importantly it’s hardware aligned.
但更重要的是,它与硬件保持一致。
Today’s modern architectures have AES (Advanced Encryption Standard) instructions.
One of those instructions can do encryption-grade hashing and with encryption-grade
hashing we don’t have to worry about collisions if we also follow standard
encryption policies.
So hashing is not going to cost us much but we have to load up what we are going to hash.
Fortunately we are walking through memory sequentially so we get really
good memory and cache performance.
If you have a DIMM and you hit sequential addresses,
then it’s a win because they will be faster than hitting random addresses.
The hardware prefetchers will kick in and that will also help.
Anyway we have 50 years, 60 years of designing hardware to run Fortran,
to run C, and to run the SPECint benchmarks.
It’s no surprise that the result is hardware that runs this kind of stuff fast.
今天的现代架构具有 AES (高级加密标准) 指令。其中一条指令可以进行加密级哈希,如果我们也遵循标准加密策略,那么使用加密级哈希,我们就不必担心冲突。因此,哈希不会花费我们太多,但我们必须加载我们将要哈希的内容。幸运的是,我们是按顺序遍历内存的,因此我们获得了非常好的内存和缓存性能。如果你有一个 DIMM 并且你点击了顺序地址,那么这是一个胜利,因为它们比点击随机地址更快。硬件预取程序将启动,这也将有所帮助。无论如何,我们有 50 年、60 年的设计硬件来运行 Fortran、运行 C 和运行 SPECint 基准测试。毫不奇怪,结果是硬件可以快速运行这种东西。
We took the measurement. This is pretty good. This is the benchmark suite for large heaps which should be good.
我们进行了测量。这很好。这是大型堆的基准测试套件,应该很好。
We then said what does it look like for the performance benchmark? Not so good,
a couple of outliers.
But now we have moved the write barrier from always being on in the mutator
to running as part of the GC cycle.
Now making a decision about whether we are going to do a generational GC
is delayed until the start of the GC cycle.
We have more control there since we have localized the card work.
Now that we have the tools we can turn it over to the Pacer,
and it could do a good job of dynamically cutting off programs that fall
to the right and do not benefit from generational GC.
But is this going to win going forward? We have to know or at least think
about what hardware is going to look like going forward.
然后我们说性能基准测试是什么样的?不太好,有几个异常值。但是现在我们已经将 write barrier 从 mutator 中总是 on 移动到作为 GC cycle 的一部分运行。现在,决定是否要进行分代 GC 被推迟到 GC 周期开始。由于我们已经本地化了卡片工作,因此我们在那里有更多的控制权。现在我们有了工具,我们可以把它交给 Pacer,它可以很好地动态地切断那些落在右边的程序,并且不能从分代 GC 中受益。但这在未来会赢吗?我们必须知道或至少考虑未来的硬件会是什么样子。
What are the memories of the future?
未来的记忆是什么?
Let’s take a look at this graph. This is your classic Moore’s law graph.
You have a log scale on the Y axis showing the number of transistors in a single chip.
The X-axis is the years between 1971 and 2016.
I will note that these are the years when someone somewhere predicted that
Moore’s law was dead.
让我们看一下这张图。这是经典的摩尔定律图。Y 轴上有一个对数刻度,显示单个芯片中的晶体管数量。X 轴是 1971 年至 2016 年之间的年份。我要指出的是,这些年是某个地方有人预测摩尔定律已经死的那些年。
Dennard scaling had ended frequency improvements ten years or so ago.
New processes are taking longer to ramp. So instead of 2 years they are
now 4 years or more.
So it’s pretty clear that we are entering an era of the slowing of Moore’s law.
Dennard 缩放在十年左右前结束了频率改进。新流程需要更长的时间才能实现增长。所以他们现在不是 2 年,而是 4 年或更长时间。因此,很明显,我们正在进入一个摩尔定律放缓的时代。
Let’s just look at the chips in the red circle. These are the chips that are the best at sustaining Moore’s law.
让我们看看红色圆圈中的筹码。这些是维持摩尔定律的最佳筹码。
They are chips where the logic is increasingly simple and duplicated many times.
Lots of identical cores, multiple memory controllers and caches,
GPUs, TPUs, and so forth.
它们是 logic 越来越简单并多次重复的芯片。许多相同的内核、多个内存控制器和缓存、GPU、TPU 等。
As we continue to simplify and increase duplication we asymptotically end
up with a couple of wires,
a transistor, and a capacitor.
In other words a DRAM memory cell.
随着我们继续简化和增加重复,我们逐渐得到几根电线、一个晶体管和一个电容器。换句话说,DRAM 存储单元。
Put another way, we think that doubling memory is going to be a better value than doubling cores.
换句话说,我们认为内存加倍比内核加倍更有价值。
Original graph
at www.kurzweilai.net/ask-ray-the-future-of-moores-law.
www.kurzweilai.net/ask-ray-the-future-of-moores-law 处的原始图形。
Let’s look at another graph focused on DRAM.
These are numbers from a recent PhD thesis from CMU.
If we look at this we see that Moore’s law is the blue line.
The red line is capacity and it seems to be following Moore’s law.
Oddly enough I saw a graph that goes all the way back to 1939 when we were
using drum memory and that capacity and Moore’s law were chugging along
together so this graph has been going on for a long time,
certainly longer than probably anybody in this room has been alive.
让我们看一下另一个专注于 DRAM 的图表。这些是 CMU 最近一篇博士论文中的数字。如果我们看一下这个,我们会看到摩尔定律是蓝线。红线是容量,它似乎遵循摩尔定律。奇怪的是,我看到一张图表可以追溯到 1939 年,当时我们使用鼓内存,容量和摩尔定律一起滚动,所以这个图表已经持续了很长时间,肯定比这个房间里的任何人都长。
If we compare this graph to CPU frequency or the various Moore’s-law-is-dead graphs,
we are led to the conclusion that memory,
or at least chip capacity, will follow Moore’s law longer than CPUs.
Bandwidth, the yellow line, is related not only to the frequency of the
memory but also to the number of pins one can get off of the chip so it’s
not keeping up as well but it’s not doing badly.
如果我们将此图与 CPU 频率或各种摩尔定律失效图进行比较,我们会得出结论,内存或至少芯片容量遵循摩尔定律的时间将比 CPU 更长。带宽,黄线,不仅与内存的频率有关,还与可以从芯片中获得的引脚数量有关,因此它没有跟上,但表现并不差。
Latency, the green line, is doing very poorly,
though I will note that latency for sequential accesses does better than
latency for random access.
Latency(绿线)的表现非常糟糕,但我会指出 Sequential Access(顺序访问)的延迟比 Random Access(随机访问)的延迟要好。
(Data from “Understanding and Improving the Latency of DRAM-Based Memory
Systems Submitted in partial fulfillment of the requirements for the degree
of Doctor of Philosophy in Electrical and Computer Engineering Kevin K.
Chang M.S., Electrical & Computer Engineering,
Carnegie Mellon University B.S., Electrical & Computer Engineering,
Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA May, 2017”.
See Kevin K. Chang’s thesis.
The original graph in the introduction was not in a form that I could draw
a Moore’s law line on it easily so I changed the X-axis to be more uniform.)
(数据来自“理解并改善基于DRAM的内存系统的延迟部分满足电气与计算机工程博士学位的要求 Kevin K. Chang M.S., 电气与计算机工程, 卡内基梅隆大学 卡内基梅隆大学 卡内基梅隆大学 匹兹堡, 宾夕法尼亚州卡内基梅隆大学 2017年5月””。参见 Kevin K. Chang 的论文。引言中的原始图形不是我可以很容易地在上面画摩尔定律线的形式,所以我把 X 轴改得更均匀。
Let’s go to where the rubber meets the road.
This is actual DRAM pricing and it has generally declined from 2005 to 2016.
I chose 2005 since that is around the time when Dennard scaling ended and
along with it frequency improvements.
让我们去看看橡胶与道路相遇的地方。这是实际的 DRAM 定价,从 2005 年到 2016 年总体上有所下降。我选择 2005 年,因为那大约是 Dennard 缩放结束的时间,随之而来的是频率的提高。
If you look at the red circle, which is basically the time our work to reduce
Go’s GC latency has been going on,
we see that for the first couple of years prices did well.
Lately, not so good, as demand has exceeded supply leading to price increases
over the last two years.
Of course, transistors haven’t gotten bigger and in some cases chip capacity
has increased so this is driven by market forces.
RAMBUS and other chip manufacturers say that moving forward we will see
our next process shrink in the 2019-2020 time frame.
如果你看一下红色圆圈,这基本上是我们减少 Go 的 GC 延迟的工作进行的时间,我们会看到在最初几年价格表现良好。最近,情况不太好,因为需求超过了供应,导致过去两年价格上涨。当然,晶体管并没有变得更大,在某些情况下芯片产能增加了,所以这是由市场力量驱动的。RAMBUS 和其他芯片制造商表示,展望未来,我们将在 2019-2020 年的时间框架内看到我们的下一个工艺缩小。
I will refrain from speculating on global market forces in the memory industry
beyond noting that pricing is cyclic and in the long term supply has a tendency to meet demand.
我不会对内存行业的全球市场力量进行推测,只是指出定价是周期性的,从长远来看,供应有满足需求的趋势。
Long term, it is our belief that memory pricing will drop at a rate that is much faster than CPU pricing.
从长远来看,我们相信内存定价的下降速度将比 CPU 定价快得多。
(Sources https://hblok.net/blog/ and https://hblok.net/storage_data/storage_memory_prices_2005-2017-12.png)
(来源 https://hblok.net/blog/ 和 https://hblok.net/storage_data/storage_memory_prices_2005-2017-12.png)
Let’s look at this other line. Gee it would be nice if we were on this line.
This is the SSD line. It is doing a better job of keeping prices low.
The material physics of these chips is much more complicated that with DRAM.
The logic is more complex, instead of a one transistor per cell there are half a dozen or so.
让我们看看另一行。哎呀,如果我们在这条线上就好了。这是 SSD 系列。它在保持低价方面做得更好。这些芯片的材料物理学比 DRAM 复杂得多。逻辑更复杂,而不是每个单元一个晶体管,而是六个左右。
Going forward there is a line between DRAM and SSD where NVRAM such as Intel’s
3D XPoint and Phase Change Memory (PCM) will live.
Over the next decade increased availability of this type of memory is likely
to become more mainstream and this will only reinforce the idea that adding
memory is the cheap way to add value to our servers.
展望未来,DRAM 和 SSD 之间将有一条线,其中 NVRAM(例如英特尔的 3D XPoint 和相变内存 (PCM))将出现。在未来十年中,此类内存的可用性增加可能会变得更加主流,这只会强化这样一种观点,即添加内存是增加服务器价值的廉价方式。
More importantly we can expect to see other competing alternatives to DRAM.
I won’t pretend to know which one will be favored in five or ten years but
the competition will be fierce and heap memory will move closer to the highlighted blue SSD line here.
更重要的是,我们可以期待看到 DRAM 的其他竞争替代品。我不会假装知道哪一个会在五年或十年后受到青睐,但竞争会很激烈,堆内存将更接近此处突出显示的蓝色 SSD 线。
All of this reinforces our decision to avoid always-on barriers in favor of increasing memory.
所有这些都加强了我们避免始终开启的障碍以支持增加内存的决定。
So what does all this mean for Go going forward?
We intend to make the runtime more flexible and robust as we look at corner cases that come in from our users. The hope is to tighten the scheduler down and get better determinism and fairness but we don’t want to sacrifice any of our performance.
We also do not intend to increase the GC API surface. We’ve had almost a decade now and we have two knobs and that feels about right. There is not an application that is important enough for us to add a new flag.
We will also be looking into how to improve our already pretty good escape analysis and optimize for Go’s value-oriented programming. Not only in the programming but in the tools we provide our users.
Algorithmically, we will focus on parts of the design space that minimize the use of barriers, particularly those that are turned on all the time.
Finally, and most importantly, we hope to ride Moore’s law’s tendency to favor RAM over CPU certainly for the next 5 years and hopefully for the next decade.
So that’s it. Thank you.
P.S. The Go team is looking to hire engineers to help develop and maintain the Go runtime and compiler toolchain.
Interested? Have a look at our open positions.
Next article: Portable Cloud Programming with Go Cloud
Previous article: Updating the Go Code of Conduct
Blog Index