Hunting for Memory Leaks in asyncio Applications
寻找异步应用程序中的内存泄漏

Published in

G Adventures Technology

3 min readJan 16, 2018

Asyncio allows you to write high performance network applications in Python. It’s been a breath of fresh air in reducing complexity and resource abuse common in sync applications. However bugs happen — and they happen to asyncio apps too.
Asyncio 允许您使用 Python 编写高性能网络应用程序。这对于降低同步应用程序中常见的复杂性和资源滥用来说是一股清新的空气。然而错误还是会发生——而且它们也会发生在异步应用程序中。

Sailing into the last two weeks of 2017 that I fully intended to spend experimenting with various eggnog recipes. I was alerted by our DevOps team that our asyncio app was consuming 10GB of memory. That is approximately 100 times more than it should!
进入 2017 年的最后两周，我完全打算花时间尝试各种蛋酒食谱。我们的 DevOps 团队提醒我，我们的 asyncio 应用程序正在消耗 10GB 内存。这大约是应有的 100 倍！

I reviewed all recent changes and nothing obvious jumped up at me. Also to make matters worse I had no idea what is triggering the memory allocation as restart would fix this problem — and it could take several hours for the app to grow again — but grow it did. And it ate all the memory it could.
我回顾了最近的所有变化，没有任何明显的变化引起我的注意。更糟糕的是，我不知道是什么触发了内存分配，因为重新启动可以解决这个问题——应用程序可能需要几个小时才能再次增长——但它确实增长了。它吃掉了它能吃掉的所有内存。

Code reviews and minor bug fixes accomplished nothing to change this.
代码审查和小错误修复并没有改变这一点。

I knew I was facing the dreaded question.
我知道我正面临着这个可怕的问题。

What is the application actually doing?
应用程序实际上在做什么？

I realized I needed an endpoint where I could examine such stats at runtime (since that is where the problem happens). So I wrote simple view called /stats where I started displaying various useful info.
我意识到我需要一个端点，可以在运行时检查此类统计信息（因为这就是问题发生的地方）。因此，我编写了名为 /stats 的简单视图，在其中开始显示各种有用的信息。

How many coroutines does the application use?
应用程序使用了多少个协程？

Here we are helped by asyncio.Task.all_tasks method which returns a list of all currently running coroutines. Better yet we can inspect these coroutines to gain more insight. We can figure out if a coroutine is cancelled if it has an exception and most important in our case what actually created it by examining the stack.
在这里，我们得到了asyncio.Task.all_tasks方法的帮助，该方法返回所有当前正在运行的协程的列表。更好的是，我们可以检查这些协程以获得更多见解。如果协程出现异常，我们可以确定它是否被取消，在我们的例子中最重要的是通过检查堆栈来确定实际创建它的内容。

def show_coro(c):
    data = OrderedDict([
        ('txt', str(c)),
        ('type', str(type(c))),
        ('done', c.done()),
        ('cancelled', False),
        ('stack', None),
        ('exception', None),
    ])
    if not c.done():
        data['stack'] = [format_frame(x) for x in c.get_stack()]
    else:
        if c.cancelled():
            data['cancelled'] = True
        else:
            data['exception'] = str(c.exception())    return datadef format_frame(f):
    keys = ['f_code', 'f_lineno']
    return OrderedDict([(k, str(getattr(f, k))) for k in keys])

Which will give us an output that can look something like this.
这会给我们一个看起来像这样的输出。

{
      "txt": "<Task pending coro=<writer_process() running at /var/www/app/server/influx.py:52> wait_for=<Future pending cb=[Task._wakeup()]>>",
      "type": "<class 'asyncio.tasks.Task'>",
      "done": false,
      "cancelled": false,
      "stack": [
        {
          "f_code": "<code object writer_process at 0x7f04df9a6b70, file \"/var/www/app/server/influx.py\", line 36>",
          "f_lineno": "52"
        }
      ],
      "exception": null
    },

Where is the memory going? What is being allocated?
记忆去哪儿了？正在分配什么？

Here we are helped by the tracemalloc module. Which will show us what is being allocated and where.
在这里，我们得到了tracemalloc模块的帮助。这将告诉我们正在分配什么以及在哪里。

if tracemalloc.is_tracing():
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    view_data['top_10_trace'] = [str(x) for x in top_stats[:10]]

Which may look like this.
可能看起来像这样。

"top_10_trace": [
    "/usr/lib/python3.5/site-packages/raven/breadcrumbs.py:47: size=40.6 KiB, count=100, average=416 B",
    "/usr/lib/python3.5/logging/__init__.py:1270: size=35.3 KiB, count=468, average=77 B",
    "/usr/lib/python3.5/json/decoder.py:355: size=28.5 KiB, count=449, average=65 B",
...

The answers to these two questions is all I needed to solve my troubles.
这两个问题的答案就是解决我的烦恼所需要的。

First of all, tracemalloc tipped me off that I have an extremely high number of allocations tied up in search results objects. This was unexpected since the search results should be freed up after being returned to the user.
首先，tracemalloc 告诉我，我有大量的分配与搜索结果对象相关。这是出乎意料的，因为搜索结果在返回给用户后应该被释放。

Close examination of what happens with the search results object made me realize that we also log a number of results to our statistics database and we do so in separate coroutine (as to not to block the response to user with what is just our internal statistics collection).
仔细检查搜索结果对象发生的情况使我意识到，我们还将许多结果记录到我们的统计数据库中，并且我们在单独的协程中执行此操作（以免阻止仅使用我们的内部统计收集来阻止对用户的响应））。

That led me to the conclusion that perhaps something is wrong with those logging coroutines. Under normal circumstances, one node would run between 15 to 40 coroutines — when the huge memory increase happened the number of coroutines was 3000 or more.
这让我得出结论，这些日志记录协程可能出了问题。正常情况下，一个节点会运行 15 到 40 个协程——当内存大幅增加时，协程数量会达到 3000 个甚至更多。

Mystery solved! The whole problem was caused by our statistics database suffering temporary failures and being unresponsive, for short periods of time. When that happened our application had a large number of coroutines running, all waiting to write to the stats database.
谜团解开了！整个问题是由于我们的统计数据库在短时间内出现临时故障且无响应而引起的。当这种情况发生时，我们的应用程序运行了大量协程，所有协程都在等待写入统计数据库。

The optimization and fix was simple.
优化和修复很简单。

When logging the number of search results we don’t need to pass in the entire result set, just the number of results.
记录搜索结果的数量时，我们不需要传递整个结果集，只需传递结果的数量。

Second and more importantly ensure we do not have long lived logging coroutines running. We can specify a timeout by using asycnio.wait_for(coroutine, 5). By using this feature, occasional slowdowns of external services don’t create performance problems in our applications since we’ll timeout after 5 seconds.
其次，更重要的是确保我们没有长期运行的日志记录协程。我们可以使用asycnio.wait_for(coroutine, 5)指定超时。通过使用此功能，外部服务的偶尔变慢不会在我们的应用程序中造成性能问题，因为我们将在 5 秒后超时。

The application has now been running for days and only consuming 100MB of memory.
该应用程序现已运行数天，仅消耗 100MB 内存。

May your code be leak free in 2018 but if it does leak — you know what to do.
祝你的代码在 2018 年不会泄漏，但如果确实泄漏，你知道该怎么做。

Keep following us on Medium and Twitter for more snippets of the Tech world according to G!
请继续在Medium和Twitter上关注我们，了解 G 报道的科技世界的更多片段！

Want to help G Adventures change people’s lives and travel the world? Check out all our jobs and apply today.
想帮助 G Adventures 改变人们的生活并环游世界吗？查看我们所有的职位并立即申请。

Hunting for Memory Leaks in asyncio Applications
寻找异步应用程序中的内存泄漏

How many coroutines does the application use?
应用程序使用了多少个协程？

Where is the memory going? What is being allocated?
记忆去哪儿了？正在分配什么？

Written by Jakub Labath 雅各布·拉巴斯编剧

More from Jakub Labath and G Adventures Technology
Jakub Labath 和 G Adventures Technology 的更多内容

Taming memory leaks in asyncio
解决 asyncio 中的内存泄漏问题

Recently I’ve observed an unpleasant problem in our python asyncio application. The symptoms were that our python process would eat all…
最近我在 python asyncio 应用程序中观察到一个令人不快的问题。症状是我们的 python 进程会吃掉所有......

Speeding up Postgres Restores
加快 Postgres 恢复速度

Recently I sat down to speed up our database restore process in our development environment. Like most projects our database started out…
最近，我坐下来加快开发环境中的数据库恢复过程。像大多数项目一样，我们的数据库开始于......

Things I learned making my first Github action
我在第一次 GitHub 操作中学到的东西

GitHub Actions is an additional tool to the workflows we’ve centralized around GitHub, making it easier to have bots do more work for us.
GitHub Actions 是我们围绕 GitHub 集中的工作流程的一个附加工具，使机器人可以更轻松地为我们完成更多工作。

The Bad, the Good and the Ugly of Binary Floating Point Numbers
二进制浮点数的坏处、好处和丑陋

At some point early on in my career I inevitably got in trouble when using binary floating point numbers. Eventually, I learnt that when I…
在我职业生涯早期的某个时刻，我在使用二进制浮点数时不可避免地遇到了麻烦。最终，我了解到，当我...

Recommended from Medium 媒体推荐

The resume that got a software engineer a $300,000 job at Google.
这份简历让一位软件工程师在谷歌获得了价值 30 万美元的工作。

1-page. Well-formatted. 1 页。格式良好。

Solving CPU throttling issue in Golang applications before hitting the CPU limit in Kubernetes.
在达到 Kubernetes 中的 CPU 限制之前解决 Golang 应用程序中的 CPU 限制问题。

We faced an issue within our Kubernetes cluster wherein certain multi-threaded Golang applications, for which CPU limit has been set, are…
我们在 Kubernetes 集群中遇到了一个问题，其中某些已设置 CPU 限制的多线程 Golang 应用程序......

Lists 列表

Coding & Development 编码与开发

General Coding Knowledge 一般编码知识

ChatGPT prompts ChatGPT 提示

ChatGPT 聊天GPT

LeetCode was HARD until I Learned these 15 Patterns
LeetCode 很难，直到我学会了这 15 个模式

I have solved more than 1500 LeetCode problems, and if there is one thing I have learned, it’s this:
我已经解决了超过 1500 道 LeetCode 问题，如果说我学到了一件事，那就是：

Multiprocessing in Python: A Guide to Using Multiple CPU Cores
Python 中的多重处理：使用多个 CPU 核心的指南

Python’s `multiprocessing` module is a powerful tool that allows you to create applications that can run concurrently using multiple CPU…
Python 的“多处理”模块是一个强大的工具，允许您创建可以使用多个 CPU 并发运行的应用程序......

Why do many people not recommend using JWT?
为什么很多人不推荐使用JWT？

My article is open to everyone; non-member readers can click this link to read the full text.
我的文章向所有人开放；非会员读者可以点击此链接阅读全文。

Starting My UDP DNS Server
启动我的 UDP DNS 服务器

Building a DNS server from the ground up, join me on my journey.
从头开始构建 DNS 服务器，加入我的旅程。

Hunting for Memory Leaks in asyncio Applications寻找异步应用程序中的内存泄漏

How many coroutines does the application use?应用程序使用了多少个协程？

Where is the memory going? What is being allocated?记忆去哪儿了？正在分配什么？

Written by Jakub Labath 雅各布·拉巴斯 编剧

More from Jakub Labath and G Adventures TechnologyJakub Labath 和 G Adventures Technology 的更多内容

Taming memory leaks in asyncio解决 asyncio 中的内存泄漏问题

Recently I’ve observed an unpleasant problem in our python asyncio application. The symptoms were that our python process would eat all…最近我在 python asyncio 应用程序中观察到一个令人不快的问题。症状是我们的 python 进程会吃掉所有......

Speeding up Postgres Restores加快 Postgres 恢复速度

Recently I sat down to speed up our database restore process in our development environment. Like most projects our database started out…最近，我坐下来加快开发环境中的数据库恢复过程。像大多数项目一样，我们的数据库开始于......

Things I learned making my first Github action我在第一次 GitHub 操作中学到的东西

GitHub Actions is an additional tool to the workflows we’ve centralized around GitHub, making it easier to have bots do more work for us.GitHub Actions 是我们围绕 GitHub 集中的工作流程的一个附加工具，使机器人可以更轻松地为我们完成更多工作。

The Bad, the Good and the Ugly of Binary Floating Point Numbers二进制浮点数的坏处、好处和丑陋

At some point early on in my career I inevitably got in trouble when using binary floating point numbers. Eventually, I learnt that when I…在我职业生涯早期的某个时刻，我在使用二进制浮点数时不可避免地遇到了麻烦。最终，我了解到，当我...

Recommended from Medium 媒体推荐

The resume that got a software engineer a $300,000 job at Google.这份简历让一位软件工程师在谷歌获得了价值 30 万美元的工作。

1-page. Well-formatted. 1 页。格式良好。

Solving CPU throttling issue in Golang applications before hitting the CPU limit in Kubernetes.在达到 Kubernetes 中的 CPU 限制之前解决 Golang 应用程序中的 CPU 限制问题。

We faced an issue within our Kubernetes cluster wherein certain multi-threaded Golang applications, for which CPU limit has been set, are…我们在 Kubernetes 集群中遇到了一个问题，其中某些已设置 CPU 限制的多线程 Golang 应用程序......

Lists 列表

Coding & Development 编码与开发

General Coding Knowledge 一般编码知识

ChatGPT prompts ChatGPT 提示

ChatGPT 聊天GPT

LeetCode was HARD until I Learned these 15 PatternsLeetCode 很难，直到我学会了这 15 个模式

I have solved more than 1500 LeetCode problems, and if there is one thing I have learned, it’s this:我已经解决了超过 1500 道 LeetCode 问题，如果说我学到了一件事，那就是：

Multiprocessing in Python: A Guide to Using Multiple CPU CoresPython 中的多重处理：使用多个 CPU 核心的指南

Python’s `multiprocessing` module is a powerful tool that allows you to create applications that can run concurrently using multiple CPU…Python 的“多处理”模块是一个强大的工具，允许您创建可以使用多个 CPU 并发运行的应用程序......

Why do many people not recommend using JWT?为什么很多人不推荐使用JWT？

My article is open to everyone; non-member readers can click this link to read the full text.我的文章向所有人开放；非会员读者可以点击此链接阅读全文。

Starting My UDP DNS Server启动我的 UDP DNS 服务器

Building a DNS server from the ground up, join me on my journey.从头开始构建 DNS 服务器，加入我的旅程。

Hunting for Memory Leaks in asyncio Applications
寻找异步应用程序中的内存泄漏

How many coroutines does the application use?
应用程序使用了多少个协程？

Where is the memory going? What is being allocated?
记忆去哪儿了？正在分配什么？

Written by Jakub Labath 雅各布·拉巴斯编剧

More from Jakub Labath and G Adventures Technology
Jakub Labath 和 G Adventures Technology 的更多内容

Taming memory leaks in asyncio
解决 asyncio 中的内存泄漏问题

Recently I’ve observed an unpleasant problem in our python asyncio application. The symptoms were that our python process would eat all…
最近我在 python asyncio 应用程序中观察到一个令人不快的问题。症状是我们的 python 进程会吃掉所有......

Speeding up Postgres Restores
加快 Postgres 恢复速度

Recently I sat down to speed up our database restore process in our development environment. Like most projects our database started out…
最近，我坐下来加快开发环境中的数据库恢复过程。像大多数项目一样，我们的数据库开始于......

Things I learned making my first Github action
我在第一次 GitHub 操作中学到的东西

GitHub Actions is an additional tool to the workflows we’ve centralized around GitHub, making it easier to have bots do more work for us.
GitHub Actions 是我们围绕 GitHub 集中的工作流程的一个附加工具，使机器人可以更轻松地为我们完成更多工作。

The Bad, the Good and the Ugly of Binary Floating Point Numbers
二进制浮点数的坏处、好处和丑陋

At some point early on in my career I inevitably got in trouble when using binary floating point numbers. Eventually, I learnt that when I…
在我职业生涯早期的某个时刻，我在使用二进制浮点数时不可避免地遇到了麻烦。最终，我了解到，当我...

The resume that got a software engineer a $300,000 job at Google.
这份简历让一位软件工程师在谷歌获得了价值 30 万美元的工作。

Solving CPU throttling issue in Golang applications before hitting the CPU limit in Kubernetes.
在达到 Kubernetes 中的 CPU 限制之前解决 Golang 应用程序中的 CPU 限制问题。

We faced an issue within our Kubernetes cluster wherein certain multi-threaded Golang applications, for which CPU limit has been set, are…
我们在 Kubernetes 集群中遇到了一个问题，其中某些已设置 CPU 限制的多线程 Golang 应用程序......

LeetCode was HARD until I Learned these 15 Patterns
LeetCode 很难，直到我学会了这 15 个模式

I have solved more than 1500 LeetCode problems, and if there is one thing I have learned, it’s this:
我已经解决了超过 1500 道 LeetCode 问题，如果说我学到了一件事，那就是：

Multiprocessing in Python: A Guide to Using Multiple CPU Cores
Python 中的多重处理：使用多个 CPU 核心的指南

Python’s `multiprocessing` module is a powerful tool that allows you to create applications that can run concurrently using multiple CPU…
Python 的“多处理”模块是一个强大的工具，允许您创建可以使用多个 CPU 并发运行的应用程序......

Why do many people not recommend using JWT?
为什么很多人不推荐使用JWT？

My article is open to everyone; non-member readers can click this link to read the full text.
我的文章向所有人开放；非会员读者可以点击此链接阅读全文。

Starting My UDP DNS Server
启动我的 UDP DNS 服务器

Building a DNS server from the ground up, join me on my journey.
从头开始构建 DNS 服务器，加入我的旅程。