2024_06_05_e8c34215e949b11c1f1ag

ACKNOWLEDGMENTS 致谢

I am very grateful to several people for their help with this book. I especially thank the three anonymous reviewers who read through various drafts with great care. Their suggestions and corrections have improved this book beyond measure. I also thank Elizabeth Swayze and her team at the MIT Press for all their support and work on turning a rough proposal into this book.
我要衷心感谢为本书提供帮助的各位人士。我尤其要感谢三位匿名审稿人，他们一丝不苟地审阅了不同版本的稿件，并提出宝贵意见和修改建议，使本书增色不少。我还要感谢 Elizabeth Swayze 以及她在 MIT 出版社的团队，感谢他们为本书从提案到出版付出的所有努力和支持。

INTRODUCTION 引言

The Digital Revolution marks the shift from analog to digital technology. This revolution changed, and is still changing, the way we live, work, and communicate.
数字革命标志着我们从模拟技术迈向数字技术的时代。这场革命不仅改变了过去，也持续影响着我们现在以及未来的生活、工作和沟通方式。

At the beginning of the twentieth century, telephones, radio, television, movies, and computing machines were analog; by the end of the century, they were digital. Computers shrank from room-filling machines owned by large companies to small personal devices. Then the internet connected our computers, and the Web gave us easy access to vast amounts of information. AI chatbots now help us organize and synthesize this sea of knowledge into manageable and useful chunks.
二十世纪伊始，电话、收音机、电视、电影以及计算机都还停留在模拟信号时代；而到二十世纪末，这些设备都已经迈入了数字时代。计算机也从体积庞大、占据整个房间且只有大型公司才能拥有的机器，演变为了我们如今习以为常的小型个人设备。互联网的出现将我们的计算机连接在一起，而万维网的诞生更让我们能够轻松获取海量信息。如今，人工智能聊天机器人可以帮助我们对浩瀚的知识海洋进行整理和归纳，将其转化为易于管理和使用的知识块。

Phones became digital, mobile, and smart. Communication became easy, cheap, and fast. We could now not just chat but video chat. Work from home became possible.
手机步入了数字时代，兼具移动和智能的特点。沟通变得轻松便捷，成本低廉且速度飞快。我们不仅能够语音聊天，还可以进行视频通话。远程办公也成为了现实。

One can examine the Digital Revolution through various lenses. We could look at the causes, the scientific and technological innovations that helped bring about the revolution, or the effects, the changes of the revolution on society, but this book is about the deep theories that lie at the very heart of the Information Age. Most of us know something about the grand theories of physics that transformed humanity's views of the universe at the start of the twentieth century: quantum mechanics and general relativity. Less familiar are the brilliant theories underlying the Digital Revolution.
数字革命可以从多个角度进行解读。我们可以探究其背后的动因，比如催生这场革命的科技创新；也可以研究其影响，比如这场革命给社会带来的变革。但这本书关注的是信息时代核心的深层理论。我们大多数人对 20 世纪初改变人类宇宙观的物理学理论有所了解，比如量子力学和广义相对论。然而，鲜为人知的是，数字革命背后也有一套支撑其发展的卓越理论。

This book is about these theories and the elegant mathematics that gives deep insight into the digital world. Much of this is not widely known or appreciated. Communication and computation have become a
这本书探讨了几种重要理论及相关数学原理，它们为我们深入理解数字世界提供了优雅的工具。然而，这些理论和数学原理尚未得到广泛了解和重视。实际上，通信和计算已经成为……
taken-for-granted part of our modern world. We rarely stop to ask some of the fundamental questions: What is computation? What is information? What advantages does digital information have over analog? How do we convert analog signals into digital ones? What is an algorithm? What is a universal computer? How can a machine learn? But the answers to these questions are fascinating. Mathematics is the key that helps us deepen our understanding. This book is a journey through these mathematical ideas that underpinned the Information Age.
计算已经融入我们生活的方方面面，成为现代世界理所当然的一部分。我们很少会停下来思考那些根本性的问题：什么是计算？什么是信息？数字信息相较于模拟信息有何优势？我们如何将模拟信号转换为数字信号？什么是算法？什么是通用计算机？机器如何学习？然而，这些问题的答案却充满魅力。数学是我们深入理解这些问题的一把钥匙。本书将引领读者踏上一段旅程，探索信息时代背后的数学思想。

Can we describe the theory without mathematics? We can try, but the result is often overstretched metaphors that don't yield real understanding. Mathematics is not there to obfuscate, complicate, and intimidate. Quite the opposite. Mathematics should simplify and elucidate.
不借助数学，我们能描述这个理论吗？当然可以尝试，但这通常会导致比喻过度延伸，无法产生真正的理解。数学并非为了混淆、复杂化或恐吓，而是为了简化和阐明问题。

The good news is that we will see some truly beautiful mathematics, and it is fairly elementary. Many of the ideas may be new, but the numerous examples should help to illustrate and make them comprehensible. Most of the book should be understandable with a high school mathematics background. In a few places, I felt readers who had taken a STEM mathematics course at university might appreciate a little more information. I put these additions into the appendices. However, these are not essential to understanding the basic ideas.
好消息是，我们将一起探索一些优美的数学原理，而且这些原理都十分基础。书中可能会出现一些新概念，但丰富的示例可以帮助大家理解。大多数内容仅需具备高中数学知识就能读懂。个别章节可能需要具备大学 STEM 数学课程的基础才能更深入地理解，这类拓展内容我放在了附录中，不会影响对基本概念的理解。

The four major themes of the book are information, communication, computation, and learning. We usually start with a simple mathematical model of an important concept. A powerful theory takes the simple model and deduces surprising and useful consequences. It reveals a deep underlying structure connecting concepts from what, at first, appear unrelated areas. We will see this over and over.
这本书围绕信息、通信、计算和学习四大主题展开。我们通常会从一个重要概念的简单数学模型入手，并在此基础上发展出一个强大的理论，推导出令人惊叹且实用的结论。这一理论将揭示深层的底层结构，将看似无关的领域连接起来。这种思维方式将贯穿本书始终。

The goal is to present the concepts using the least amount of mathematics. Instead of full-blown general theories, we will often look at the simplest example that illustrates the basic idea. But nothing is oversimplified. The goal is to understand the concepts and see connections, but not get lost in technical details.
本书的目标是用最简洁的数学语言介绍相关概念。我们将避免繁复的通用理论，而是选取最简单的例子来阐释基本思想。当然，我们也不会过度简化，力求在不陷入技术细节的前提下，帮助读者理解概念并建立联系。

I often include historical anecdotes to give a sense of the technology at that time, its impact, and the problems which needed to be solved. These historical stories are there for illumination. This is not a history book but a book about ideas.
我常在书中穿插历史轶事，帮助读者了解当时的技术背景、影响和亟待解决的问题。这些故事是为了更好地阐释问题，毕竟这不是一本历史书，而是一本探讨思想的书。

Here is an outline of some of these ideas and the order in which we meet them.
以下概述了我们将探讨的观点，以及它们出现的顺序。

CHAPTER 1 第一章

This book is about the Digital Revolution that started in the twentieth century. But there have been other digital revolutions that also transformed society. The first was the invention of writing and alphabets. Of course, this was before written history, but a revolution occurred. Communication could now be done digitally—using a finite set of symbols-allowing it to take place over long distances and over long intervals of time.
这本书探讨的是始于二十世纪的数字革命。但事实上，历史上还发生过其他几次改变社会的数字革命。第一次数字革命是文字和字母的发明。当然，这发生在文字记载的历史出现之前，但它确确实实是一场革命。文字和字母的发明使得信息传递可以用数字化的方式进行，即使用一组有限的符号来表达信息，从而使信息传递跨越遥远的距离和漫长的时间间隔成为可能。

Chapter 1 briefly looks at alphabets before turning to the telegraph and its precursors. The telegraph introduces us to Samuel Morse and his eponymous code but also to the relay, which plays an important role in the development of computers. It ends with a brief description of the analog revolution that followed: the telephone superseding the telegraph, the invention of radio and television, the adoption of record and cassette players, movies. All were analog. These would all change to being digital. But why?
第一章首先对字母表进行了简要介绍，接着探讨了电报及其发展历程。通过电报，我们了解了塞缪尔·莫尔斯（Samuel Morse）以及以他名字命名的代码，还了解到在计算机发展过程中发挥了重要作用的继电器。本章最后概述了随之而来的模拟技术革命：电话取代电报、收音机和电视的发明、录音机和磁带播放器的普及以及电影的出现。这些都是模拟技术的产物，而它们最终都将走向数字化。但这是为什么呢？

CHAPTER 2 第二章

In chapter 2, we start our study of information theory. What is information? How should it be defined and quantified?
第二章将开启我们对信息论的探索之旅。首先，什么是信息？该如何定义和量化它？

In everyday English, information has a wide variety of meanings. The amount of information conveyed is often subjective-a message in Chinese might convey something important to a native speaker but nothing to someone who doesn't speak the language. Here, "information" seems related to both meaning and understanding. However, we want to talk about the amount of information in a message no matter who or what sends or receives it, for example, communication between two machines.
在日常英语中，“信息”的含义多种多样。它所传达的信息量往往是主观的——一条中文信息对母语者来说可能意义重大，但对不懂中文的人来说却毫无价值。此处，“信息”的概念似乎与意义和理解相关联。然而，我们希望探讨的是信息的数量，而不关注信息的发送者或接收者是谁，也不关注他们是什么，例如，两台机器之间的通信。

In mathematical theory, information is separated from meaning and understanding. This definition gives us an objective way to measure the amount of information in a message. At first, this definition seems rather simplistic, but, as we see later, it captures the fundamental idea and has powerful consequences.
在数学理论中，信息与其含义和理解是相分离的。这一定义为我们提供了一种客观衡量信息量的方法。虽然这个定义初看起来过于简单，但我们后面会看到，它抓住了信息的本质，并具有深刻的意义。

CHAPTER 3 第三章

Chapter 3 extends the ideas from the previous chapter. We see the number of bits of information in a message might be fewer than the number of
第三章对前一章的内容进行了扩展。我们会发现，信息中包含的信息比特数有可能少于
binary digits in the message. This gives us a measure of redundancy. We then see how that redundancy can be removed, leaving a shorter message that contains all the information. This is the idea behind lossless compression.
信息可以用一串二进制数字表示，这些数字的长度决定了信息的冗余度。无损压缩的原理就是去除这些冗余，在保留所有信息的同时缩短信息的长度。

Compression is important for both digital communication and storage. We want to reduce the size of a file before sending or storing-we want information conveyed as quickly as possible and stored in the least amount of space. However, we don't want to lose any of the information, and we need the ability to decompress the file to obtain an exact copy of the original.
压缩技术在数字通信和数据存储领域至关重要。我们希望在传输或存储数据前尽可能减小文件大小，以便更快地传递信息，并最大程度地节省存储空间。同时，我们也需要确保信息完整无损，并且能够通过解压缩还原出与原始文件完全一致的副本。

CHAPTER 4 第 4 章

When we transmit or store data, errors will occur. We construct our equipment to minimize errors, but it is impossible to eliminate them entirely. There is always some outside interference that can occasionally corrupt some of our data. We call this interference noise.
数据在传输或存储过程中，错误的发生在所难免。尽管我们竭力通过优化设备来减少错误，但某些外部干扰仍有可能偶尔损坏数据，我们将这类干扰称为噪声。

When we receive a message, we would like to know whether there are any errors in it. If there are errors, we would like to know where they are and how to correct them. Error-correcting codes do exactly this.
当我们接收到信息时，我们希望知道其中是否存在错误。如果存在错误，我们希望知道错误的位置以及如何纠正。纠错码正是为此而生。

After looking at some examples of error-correcting codes, we turn to Claude Shannon's Noisy-Channel Coding Theorem. This stunning result shows that it is possible to transmit information along a channel with noise. Each channel has a speed limit. This limit depends on how much noiseinterference-there is on the channel. Shannon shows if you stay under the speed limit, you can transmit information as accurately as you want.
看过纠错码的例子后，我们来了解一下克劳德·香农（Claude Shannon）的噪声信道编码定理。这一定理得出了一个惊人的结论：即使信道存在噪声干扰，我们依然可以进行信息传输。每个信道都有其速度上限，该上限取决于信道噪声干扰的程度。香农指出，只要传输速度低于该上限，我们就可以实现任意精度的信息传输。

We give a simple statement of the theorem and explain what it means, followed by a sketch proof of the theorem.
我们将简述该定理并解释其含义，并提供该定理的证明概要。

CHAPTER 5 第 5 章

We now know how to compress and send digital messages accurately. The next step is to send them securely. This chapter looks at encryption.
我们已经了解如何准确地压缩和发送数字信息，接下来需要确保信息安全。本章将探讨加密技术。

We look at both symmetric-key and private-key encryption, explaining the methods used whenever we conduct business over the internet. We also describe digital signatures and certificates-it's not enough to know we have a secure communication channel; we must also know the person on the other end is who they claim to be.
本书将介绍对称密钥加密和私钥加密，并解释我们在互联网上进行各种业务时所使用的加密方法。此外，我们还会讲解数字签名和证书的概念——仅仅拥有安全的通信渠道是不够的，我们还必须确认与我们通信的人是真实可靠的。

CHAPTER 6 第六章

The previous four chapters considered digital information. We know how to compress and transmit it securely and accurately. But what about analog signals? Sound is conveyed by waves, light electromagnetic waves; waves are analog, not digital.
前四章讨论了数字信息，我们了解了如何对其进行压缩和传输，以确保安全性和准确性。但模拟信号又该如何处理呢？声音通过波进行传播，光也是一种电磁波；而波是模拟信号，并非数字信号。

Chapter 6 shows how we can convert analog information to digital information. Once in digital format, we can use our digital theory to compress and send the information accurately and securely. Analog-to-digital conversion enables us to use digital technology on real-world analog data.
第 6 章介绍了如何将模拟信息转换为数字信息。信息一旦转换为数字格式，我们就可以利用数字理论对其进行压缩，并实现准确安全的传输。模数转换技术使得我们能够将数字技术应用于现实世界的模拟数据。

We discuss two of the most important concepts in digital signal theory: the Nyquist-Shannon Sampling Theorem and the Discrete Fourier Transform.
本文将探讨数字信号理论中的两个重要概念：奈奎斯特-香农采样定理和离散傅里叶变换。

Appendix B follows from this chapter. It describes a Fast Fourier Transform-an algorithm for calculating the Discrete Fourier Transform quickly.
本章内容延伸至附录 B，其中介绍了快速傅里叶变换，这是一种快速计算离散傅里叶变换的算法。

CHAPTER 7 第 7 章

In the previous chapters we mentioned algorithms on several occasions. In chapter 7, we look at two papers that tell us something fundamental about algorithms and computing.
前几章中我们已经接触过算法的概念。第 7 章将着重介绍两篇论文，它们揭示了算法和计算领域的一些基础内容。

The first, a paper by Claude Shannon, "A Symbolic Analysis of Relay and Switching Circuits," shows the equivalence of circuits with switches and relays to the propositional calculus of logic. It forms the foundation of the design of logic networks. Shannon's paper was written before the invention of the transistor. The transistor has now replaced the relay in logic circuits, but Shannon's paper explains the important role they play.
第一篇是由 Claude Shannon 撰写的论文《继电器与开关电路的符号分析》，阐述了继电器和开关电路与逻辑命题演算之间的等价性，奠定了逻辑网络设计的理论基础。这篇论文早于晶体管发明，尽管如今晶体管已取代继电器在逻辑电路中的地位，但 Shannon 的论文依然解释了继电器在逻辑电路发展过程中发挥的重要作用。

The second paper is by Alan Turing, "On Computable Numbers, with an Application to the Entscheidungsproblem." Here, Turing gives his definition of algorithm and shows there is a universal computer that can perform any algorithm. He famously also shows there are easy-to-state problems a computer cannot solve. This paper forms the foundation of theoretical computer science.
第二篇论文是艾伦·图灵（Alan Turing）的《论可计算数及其在判定性问题上的应用》。在文中，图灵不仅给出了他对算法的定义，还证明了存在一种可以执行任何算法的通用计算机。更重要的是，他还证明了有些问题易于描述，但计算机却无法解决。这篇开创性的论文奠定了理论计算机科学的基础。

Appendix

follows from this chapter. It explains the halting problem. Given a program and an input, we would like to know whether the program will eventually stop. If the program is not going to halt on the input, we would like to know this in advance, so we don't start running it! The appendix shows there is no algorithm that can tell us whether an arbitrary
本章内容引出附录

，其中解释了停机问题。对于给定的程序和输入，我们想要知道程序是否最终会停止。如果程序在该输入下永不停止，我们希望能够提前预知，避免运行程序！本附录将证明，不存在能够判断任意程序在给定输入下是否会停止的算法。
program running on an arbitrary input will halt. This is a fundamental result in computer science showing the limitations of what algorithms can achieve.
任意程序在任意输入下都会停止运行。这个计算机科学中的基本结论，揭示了算法能力的极限。

CHAPTER 8 第八章

The previous chapter looked at algorithms, explaining what they can and cannot do. We now look at how algorithms can "learn." Chapter 8 introduces the basic concepts: We have an underlying model involving a number of parameters. If the parameters are chosen correctly, the model will fit the data well. The process of choosing the parameters involves an iterative process of locating the minimum of a function.
前一章我们探讨了算法，了解了算法的能与不能。本章我们将进一步研究算法如何“学习”。第八章将介绍一些基本概念：假设我们有一个包含多个参数的模型，如果参数选择得当，模型就能很好地拟合数据。而选择参数的过程是一个迭代的过程，目标是找到函数的最小值。

We start with two simple examples of fitting a line to data. These examples illustrate the difference between supervised and unsupervised learning.
我们先来看两个用直线拟合数据的简单例子。这些例子展示了监督学习和无监督学习之间的差异。

CHAPTER 9 第 9 章

Chapter 9, the final chapter, looks at neural networks. We start with the initial work by Warren McCulloch and Walter Pitts, showing neural networks, like Shannon's switching circuits, are equivalent to the propositional calculus for logic. Then we turn to learning.
第 9 章作为本书的最后一章，将着眼于神经网络。本章将首先介绍 Warren McCulloch 和 Walter Pitts 的早期工作，并解释神经网络如何像 Shannon 的开关电路一样，等效于逻辑学中的命题演算。之后，我们将转向对学习的探讨。

We study a small neural network with just four nodes. This example illustrates how the process works but is too small to show the power of neural networks to find patterns in large datasets. We indicate how a larger neural network can recognize handwritten digits.
我们将以一个仅包含四个节点的小型神经网络为例，阐述神经网络的工作原理。虽然这个例子过于简单，无法体现神经网络在大规模数据集中寻找模式的强大能力，但我们将据此说明更大规模的神经网络是如何识别手写数字的。

1 DIGITAL REVOLUTIONS 一次数字化革命

The term digital revolution denotes the shift from mechanical and analog technology to digital technology, ushering in the Information Age. A change so profound that we sometimes refer it to as the third industrial revolution (the first industrial revolution having started with the mechanization of the textile industry, and the second beginning with the moving assembly line of Henry Ford).
数字革命标志着机械与模拟技术向数字技术的转变，由此开启了信息时代。这场变革影响深远，因此也常被称为第三次工业革命（第一次工业革命以纺织工业机械化为开端，第二次工业革命则以 Henry Ford 的流水线生产为标志）。

This Digital Revolution involves both computation and information. There are mathematical theories of both. Two remarkable papers were published in the first half of the twentieth century, that are now regarded as being foundational. One by Alan Turing forms the basis of the theory of computation. The other by Claude Shannon forms the basis of the theory of information. Both Shannon and Turing were mathematicians, and they wrote their theories in the language of mathematics. To understand these theories, and to see how one idea leads to the next, mathematics is essential. In this book, we present the fundamental ideas as simply as possible, often giving special cases rather than the much more complicated general case, but the goal is to give a genuine understanding of some of the deepest ideas of the twentieth century-simplification, without oversimplification.
这场数字化革命涵盖了计算与信息两大领域，而两者均有其相应的数学理论。20 世纪上半叶，两篇非凡的论文横空出世，为这些理论奠定了基础，如今已被奉为经典之作。其中一篇出自艾伦·图灵（Alan Turing）之手，为计算理论的构建奠定了基石；而另一篇则由克劳德·香农（Claude Shannon）撰写，为信息理论的形成开辟了道路。香农和图灵皆为数学家，他们选择用严谨的数学语言来阐释各自的理论。想要真正理解这些理论，并洞悉其中一个又一个思想火花的碰撞，掌握数学这门语言至关重要。在本书中，我们将竭尽所能，以简洁易懂的方式呈现这些基本思想，并常常以具体的实例来代替繁复的通用情况进行阐释。本书的目标在于，在不曲解原意的基础上，尽可能清晰地阐释 20 世纪那些最具深度的伟大思想。

This book is about mathematical theories, starting with definitions of fundamental ideas and then using logical arguments to deduce deep and surprising results. The initial definitions formalize concepts that we have used for millennia. To define terms such as computation, information, and learning, the starting point is to consider what humans do in each case and then to extract the basic essential qualities to model with mathematics. The
这本书探讨了各种数学理论，先从对基本概念的定义入手，再运用逻辑论证推导出深刻且令人惊叹的结果。书中对那些我们已经使用了几千年的概念进行了形式化定义。而对于计算、信息和学习这类术语，我们会先考察人类在不同情况下的行为，提炼出其基本特征，再利用数学模型进行解释。
initial definitions are often remarkably simple and, at first glance, might seem like hopeless oversimplifications, but the power of a successful theory is that from modest beginnings, it tells us surprising new things that could not have been obtained without it.
最初的定义往往十分简洁，甚至简单到令人怀疑其有效性。然而，一个成功理论的魅力正在于此：它以简约为始，最终却能揭示出令人惊叹的新知，而离开这个理论，这些新知将不为人所知。

This chapter gives some historical background about earlier digital revolutions-those caused by the invention of writing and the telegraph. These did not involve mathematics. However, in later chapters, as we introduce the mathematical theory, we will often refer to this chapter's basic ideas and the problems that needed to be surmounted.
本章回顾了几次早期数字革命的历史背景，包括文字和电报的发明。这些革命并未涉及数学，但当我们在后续章节介绍数学理论时，会经常回顾本章提到的基本概念以及当时需要克服的难题。

1.1 ANALOG VERSUS DIGITAL
1.1 模拟信号与数字信号

In movies, when the director wants to introduce digital content, the scene often focuses on a screen in which a list of 0s and 1s appears, starting from left to right and rapidly continuing until the whole screen is filled. The list typically continues for a few more seconds before finally coming to an end. This depiction correctly tells us something about digital communication: We have a list of symbols. The list might be long, but it eventually ends. It is a finite list, not infinite.
在电影中，每当需要展现数字世界的内容时，导演往往会将镜头聚焦在一块屏幕上。屏幕上，一串由 0 和 1 组成的代码从左至右飞快地滚动，直到占满整个屏幕。这些代码通常会持续滚动几秒钟，然后才会停止。这样的画面传递了一个关于数字通信的重要信息：我们使用的是一组符号列表，它或许很长，但最终会结束，是一个有限而非无限的列表。

Mathematicians use the term string for this list, referring to the list as a string of Os and 1s. The word string conjures something long and thin, which seems appropriate, but in some ways the word is not ideal. One important point about a mathematical string is it has a first symbol, then a second symbol, a third, and so on. It is discrete. Physical strings don't have that property. They are continuous. It makes little sense to talk of the third point of a physical string. Perhaps chain would have been a better choice of word. We could then talk about the first link, the second link, and so on. But string is the standard word, and we will use it throughout the book to mean a finite list.
数学家们用“字符串”一词来表示这种列表，将由 0 和 1 构成的列表称为字符串。 “字符串”让人联想到细长的物体，这似乎很贴切，但在某些方面这个词并不理想。数学字符串的一个重要特点是它的符号有先后顺序，第一个、第二个、第三个，等等。它是离散的。但物理上的绳子没有这种特性，它们是连续的，说一根绳子的第三个点毫无意义。也许用“链条”会更好，我们可以说第一个链环、第二个链环等等。但“字符串”是标准术语，我们将在本书中使用它来表示有限列表。

Another movie trope occurs when a character has been hospitalized. The scene often starts with a hospital monitor. Here, the heart rate is depicted graphically by the drawing of a wavy line peaking as the patient's heart contracts. The monitor also beeps, but it is the wavy line being continuously traced that gives a good example of analog communication. The wavy line is not discrete. It is continuous, there are no gaps in the graph. This gives us one important distinction between analog and digital communication. Analog is a continuous process while digital is discrete. Digital and analog clocks give another example of the distinction.
电影里经常出现这样的桥段：角色因故住院，镜头切到医院病房，伴随着“滴滴”作响的心电监护仪，屏幕上显示着代表心跳的波浪线，随着病人心脏的跳动而起伏。这根不断延伸的波浪线，正是模拟通信的绝佳例子。它并非由一个个独立的点组成，而是在时间上连续变化，没有间断。这正是模拟通信和数字通信的关键区别：模拟信号是连续的，而数字信号是离散的。数字时钟和指针式时钟就很好地体现了这种差异。

An old-fashioned analog clock has hands that continuously rotate around the clock face. Between three minutes past ten and four minutes past ten, there will be an instant when the hands depict

minutes past ten. Most digital clocks only show the time at discrete second intervals. It doesn't display

minutes past ten, only the whole number of seconds that comes before it, and then the second after it.
老式指针式时钟的指针会沿着钟面持续旋转。在十点三分到十点四分之间，会出现指针指向十点

分的瞬间。大多数数字时钟只以秒为单位显示时间。它不会显示十点

分，只会显示前一秒和后一秒的整秒时间。

The mathematical theory of information looks at strings of symbols. We usually think of the symbols as 0 and 1 , but there is no reason to restrict to these two. We could use

and

instead. Indeed, we don't have to restrict to just two symbols. We can construct our strings using any number of symbols, provided the number is finite and greater than one.
信息的数学理论研究的是符号串。我们通常将这些符号视为 0 和 1，但这并非必要条件。我们也可以使用

和

来表示。事实上，我们不必局限于仅仅两个符号，可以使用任意数量的符号来构建字符串，只要该数量是有限的且大于 1 即可。

There have been earlier digital revolutions in which mathematics did not play a central role. It helps to look at historical examples to see how digitization has revolutionized human society. The first digital revolution was the invention of writing-the conversion of language from oral to writtenprobably the earliest example of analog-to-digital conversion. We begin by looking at this in more detail. Then we turn to the telegraph, which, along with railroads, changed society once again, connecting us even more.
在早期的数字化革命浪潮中，数学并未占据核心地位。回顾历史，我们可以发现数字化是如何深刻改变人类社会的。第一次数字化革命是文字的发明——语言从口头形式转变为书面形式，这或许是最早的模拟信号到数字信号转换的例子。我们将首先深入探讨这一点，然后转向电报。电报与铁路一样，再次改变了社会，将我们更加紧密地联系在一起。

Digital communication was an important part of the nineteenth century, but towards the end of the nineteenth and the start of the twentieth century, there was an analog revolution. Major technological advancesradio, television, telephone, cinema, photography, vinyl records, and even some computers-were analog. We need to know a little of this history to understand the construction of the mathematical theories developed later.
数字通信是 19 世纪的重要组成部分，但随着 19 世纪末 20 世纪初的到来，模拟技术革命爆发。众多重要技术取得突破，例如无线电、电视、电话、电影、摄影、黑胶唱片，甚至部分计算机，都属于模拟技术的范畴。要想理解后来发展的数学理论，我们需要对这段历史有所了解。

1.2 WRITING AND ALPHABETS
1.2 书写与字母

All human societies develop languages. We need to communicate with one another and humans have evolved to exchange information using speech. Not all societies develop writing. It's not necessary if your community is small and homogenous, but once a community grows, writing becomes increasingly useful. Written language extends spoken language through both space and time. We can transport a written message to a recipient well out of earshot of the sender. It can be read years later to remind everyone exactly what was said. Written language allows us to store information.
人类社会都会发展语言，因为我们需要彼此交流。人类进化出了使用语音交换信息的能力。然而，并非所有社会都会发展文字。如果一个群体规模小且成员同质，那么文字就不是必需的。但随着群体扩大，文字的用处就日益凸显。书面语突破了口语在时间和空间上的限制。我们可以将书面信息传递给远方的接收者，信息也能被保存多年，准确地记录当时所说的话。换句话说，书面语使我们能够存储信息。

I have written this book in English using the standard alphabet, but not all written languages use alphabets. Mandarin Chinese, for example,
我用标准字母表的英语写了这本书，但不是所有语言的书面形式都使用字母表。比如，中文
uses logograms, where each symbol represents the idea behind the word it stands for. Alphabets, on the other hand, contain symbols that represent the smallest atom of the sound of the spoken language-the phoneme. English is an amalgam of other languages, taking words from French, German, and Scandinavian roots. It has many phonemes-approximately forty-five. Besides using a single letter to denote a single phoneme, we can use combinations of letters to get sounds such as ch, sh, and th. This means we can restrict our alphabet to a more manageable size. Standard English uses twenty-six letters, and most alphabets are of a similar size. Learning the alphabet means learning the names of the letters and the related task of associating phonemes to them. Alphabets are a remarkable construction. Phonemes are much more basic than syllables, but they contain no information about pitch, tone, or volume. However, the phonetic structure helps when we first learn to read. We start by reading each letter and pronouncing the associated phoneme, gradually building up to the sound of the word being read.
汉语使用语素文字，每个汉字代表其所代表词语背后的含义。而字母语言，例如英语，则包含代表口语发音最小单位——音素的符号。英语融合了其他语言，吸收了法语、德语和斯堪的纳维亚语的词汇，拥有大约 45 个音素。除了使用单个字母表示单个音素外，英语还可以使用字母组合来表示"ch"、"sh"和"th"等发音。这意味着英语可以将字母表限制在一个更易于管理的范围内。标准英语使用 26 个字母，大多数字母表的大小都与之类似。学习字母表意味着学习字母的名称以及将音素与字母联系起来。字母表是一项了不起的发明。音素比音节更基本，但它们不包含音调、音调或音量的信息。然而，语音结构在我们初学阅读时很有帮助。我们首先阅读每个字母并发出相关的音素，逐渐拼出所阅读单词的发音。

Our Latin alphabet comes from the Greek, the word alphabet coming from alpha and beta, the first two letters of the Greek alphabet. The Greek alphabet came, in turn, from the Phoenician one. Phoenicians were traders and needed to communicate with people speaking many languages. They probably derived their alphabet from one used by even earlier traders; the exact origins are not entirely clear. However, the symbols used in the initial alphabets came from Egyptian hieroglyphics. Hieroglyphics started out as pictograms but grew more sophisticated as the Greeks developed a true writing system capable of expressing more than simple concepts. We can trace our letter

back to a pictogram of a cow's head. (If you invert our usual symbol, you can see the horns.) The history of the letter

also tells us something else important about alphabets.
我们的拉丁字母表源于希腊语，“字母表”一词来源于希腊字母表中的前两个字母 alpha 和 beta。希腊字母表则源于腓尼基字母表。腓尼基人擅长贸易，需要与使用不同语言的人交流。他们使用的字母表很可能借鉴了更早从事贸易的民族；其确切起源至今仍未可知。然而，最初字母表中使用的符号可以追溯到埃及象形文字。象形文字最初只是简单的象形图，但随着希腊人发展出能够表达更复杂含义的书写系统，象形文字也变得更加复杂。我们可以追溯字母

的起源，它最初的形态是牛头的象形图。（如果将我们现在常用的符号倒过来看，你就能看到牛角。）字母

的发展历史也揭示了字母表的另一个重要特点。

The Phoenician language is Semitic and is not based on syllables but on groups of consonants. Its alphabet contains no vowels, consisting entirely of consonants. Ancient Greek, like modern English, is much more vowel dependent. Omitting vowels would make the words ambiguous, so the Greeks repurposed some of the Phoenician letters for vowels. Our letter

was one of these.
腓尼基语属于闪含语系，不像英语，它并非建立在音节之上，而是辅音组合。腓尼基字母表中没有元音，全部由辅音构成。古希腊语，和现代英语一样，非常依赖元音。如果省略元音，就会造成单词含义模糊，因此希腊人借用了一些腓尼基字母来表示元音。我们今天使用的字母

就是其中之一。

Different languages can have quite different atomic components, but we can adjust the meanings of letters to fit the language. Though alphabets may vary, they all have the same fundamental properties: there are
不同语言的最小构成单位可能截然不同，但我们可以调整字母的含义以适应不同的语言。字母表虽然可能各异，但它们都具有相同的基本属性：
a fixed, finite number of symbols, and strings of these symbols can completely describe the language. From a modern mathematical perspective, messages comprising strings of symbols from a finite, fixed alphabet are called digital as opposed to analog, where quantities can vary continuously. When we talk, we can vary how loudly we say each word, how quickly we say it, and add intonations; all can be changed continuously, not discretely. The remarkable property of alphabets is that in turning spoken language into written language, they extract important discrete information while ignoring some analog parts. Of course, how we say our words is important and carries meaning, but we can write additional words that convey this meaning in writing. For example, "'Help!' he cried urgently" is how we would write about someone falling down a hole, but the person shouting for help doesn't need to convey the urgency by adding additional words. It is conveyed by how the word is shouted.
语言可以用固定且有限的符号来完整地描述，这些符号组成的字符串就构成了语言的表达。从现代数学的角度来看，由有限、固定字母表中的符号串构成的信息被称为数字信息，与之相对的是模拟信息，在模拟信息中，信息量可以是连续变化的。例如，当我们说话时，每个词的音量、语速以及语调都可以连续变化，而不是离散的。字母表的一个显著特点是，在将口语转化为书面语言的过程中，它提取了重要的离散信息，而忽略了一些模拟部分。当然，我们说话的方式很重要，它承载着一定的意义，但我们可以通过书写额外的词语来传达这种意义。例如，我们会这样描述一个人掉进洞里的情景：“‘救命！’他焦急地喊道”。在这种情况下，喊“救命”的人不需要通过额外的词语来表达急迫的心情，因为这种急迫性已经通过他喊叫的方式传达出来了。

As we move to a mathematical description of communication, we need to define our terms more precisely. We will look at strings of symbols. The word "alphabet" will describe the complete list of all possible symbols that are allowed in these strings. So, for example, if we are describing the string of symbols that make the text of this book, we include not only the standard uppercase and lowercase alphabet but we also include all the punctuation symbols and the space. In English, we use the space to denote the end of words and sentences. It doesn't initially seem like a symbol because it doesn't require ink to write it, but it occupies space. The space key is the largest on the standard keyboard. This is because in English and many other languages it is the most used symbol-in English we use it more than the letter E. Typically, we study strings of 0 s and 1s. Here, our alphabet comprises just the two symbols, 0 and 1 . It might seem strange at first to use the word "alphabet" for two symbols that usually represent numbers and not letters, but it is standard and helps us remember that we are using them as symbols and not as numbers.
在对通信进行数学描述时，我们需要更精确地定义术语。我们将研究符号串。“字母表”指的是在这些字符串中允许使用的所有符号的完整集合。例如，要描述构成本书文本的符号串，我们不仅要考虑标准的大小写字母，还要包括所有标点符号和空格。空格在英语中用于分隔单词和句子。虽然它看起来不像一个符号，因为它不需要墨水书写，但它占据了空间。空格键是标准键盘上最大的键，这是因为在英语和其他许多语言中，它是使用频率最高的符号，甚至超过了字母 E。通常情况下，我们研究的是由 0 和 1 组成的字符串。这里的“字母表”只包含 0 和 1 这两个符号。虽然用“字母表”来表示通常代表数字的符号乍看之下有些奇怪，但这是一种标准做法，可以提醒我们，我们将它们视为符号而非数字。

As we noted above, writing enables us to communicate over long distances. Once people have the ability to transmit information, the speed of transmission becomes important. Recall, for example, ancient Greek messengers running long distances to convey military information, or the American Pony Express. People usually want important information to be conveyed as quickly as possible. The invention of the telegraph was an important step in this direction.
正如前文所述，书写使我们能够跨越遥远距离进行交流。而一旦人们获得了传递信息的能力，传递速度就变得至关重要。例如，古希腊信使需要长途奔袭传递军事情报，而美国“Pony Express”驿马快信也体现了这一点。人们总是希望重要的信息能够尽快传达，而电报的发明正是朝着这个方向迈出的重要一步。

1.3 TELEGRAPH 1.3 电报机

When we think of a telegraph, we often picture someone tapping out Morse code. Samuel Morse, a professor of painting and sculpture at the University of the City of New York, invented a code that revolutionized telegraphy. However, telegraphy predates Morse and has an interesting history.
一提到电报，我们脑海中总会浮现出这样的画面：有人正在用摩尔斯电码发送信息。塞缪尔·莫尔斯（Samuel Morse）曾是纽约城市大学的绘画与雕塑教授，他发明的代码，彻底改变了电报的使用方式。然而，电报的历史其实比莫尔斯电码更早，并且有着许多有趣的故事。

The ancient Greeks built towers on mountaintops-each within sight of its nearest neighbors. If flames were lit on a tower within your sight, you would light the beacon on your mountaintop. This was a binary system and was limited to just one bit of information. The beacon was either on fire or it wasn't. The meaning, such as "the army is advancing," had to be prearranged. This is a very simple system, but it is effective-and is one that is still used today for traffic lights.
古希腊人在各个山顶建造了一座座塔楼，确保相邻塔楼之间彼此可见。如果发现视线范围内的塔楼燃起了火焰，守卫就会在自己所在的塔楼上点燃烽火。这是一种二进制系统，仅能传递一位信息：烽火要么燃烧，要么熄灭。至于传递的信息内容，例如“敌军来袭”，则需要事先约定。这套系统看似简单，却十分有效，时至今日仍在交通信号灯中发挥着作用。

Semaphore is more general. A soldier or sailor conveys information using flags in each hand. Each letter of the alphabet corresponds to a position of the arms. In 1792, Claude Chappe, a French inventor, demonstrated a mechanical system based on this idea. Chappe also coined several words. He coined the word semaphore, which means sign bearer. He used the word tachygraph-fast writing-for his system. However, the army, which adopted Chappe's system, invented its own word, telegraph, emphasizing the distance over the speed.
信号量（Semaphore）的概念更为通用。士兵或水手可以用手中旗帜的不同位置来传递信息，每个字母都对应着一种特定的旗语。早在 1792 年，法国发明家 Claude Chappe 就展示了一种基于这种原理的机械信号系统。Chappe 还创造了几个与之相关的词汇。他创造了 "semaphore" 一词，意为 "信号员"。他还用 "tachygraph"（意为快速书写）来描述他的系统。然而，最终采用 Chappe 系统的军队却发明了自己的称呼——"telegraph"（电报），这个词强调的是信号传递的距离而非速度。

In 1792, France was in the middle of its revolution. To say this year was tumultuous is probably understating it. France declared war on Austria. Prussia declared war on France. The guillotine was approved for executions. Louis XVI was arrested, and the monarchy abolished. The French Republican calendar begins-it was now year 1.
1792 年，法国大革命正如火如荼。这一年风云激荡，法国对奥地利宣战，普鲁士也向法国宣战。断头台被正式启用，用于执行死刑。路易十六被捕，君主制被废除，法国共和历开始实施，这一年被定为共和元年。

Chappe's idea was to build towers with masts. Attached to the top of the mast was a cross-arm. There were arms attached to each end of the crossarm. The operator could pivot the arms and cross-arms to various positions. Neighboring operators would observe the configurations using telescopes and pass the message along. Chappe's brother was a member of the Legislative Assembly and got them to approve construction of a line between Paris and Lille. They built this and had it working in 1792. It successfully conveyed information to Paris about the war against the Austrians. Other lines were quickly built throughout France. Neighboring countries also began adopting it. Napoleon Bonaparte used it extensively. The optical telegraph
夏普（Chappe）的想法是用桅杆建造信号塔，桅杆顶端装有横臂，横臂两端各连接一根臂杆。操作员可以通过旋转臂杆和横臂，摆出不同的姿势来表示不同的信息。附近的信号员使用望远镜观察这些姿势，并将信息传递下去。夏普的兄弟是法国立法议会的成员，在他的推动下，巴黎和里尔之间修建了第一条信号线路。这条线路于 1792 年建成并投入使用，成功地将有关对奥战争的信息传到了巴黎。法国各地很快建起了其他的信号线路，周边的国家也开始采用这种通信方式。拿破仑·波拿巴也曾广泛使用过这种光学信号电报系统。
was an important advance in communication, but within a few years it would become outdated, replaced by the electrical one.
...是通讯技术的一项重大进步，然而短短几年后，它就被更先进的电报技术所取代。

By the end of the eighteenth century, many people were experimenting with electricity and realizing that they could use it as a medium to transmit messages. Some of the earliest used static electricity. The Swiss mathematician and engineer George-Louis Le Sage constructed a telegraph of this type between two rooms of his house in 1774 . It used twenty-six wires, one for each letter. It worked by building up static electricity on a glass rod that was then touched by the appropriate wire. At the other end were flakes of gold leaf which moved in the charge's presence. It worked but was more of a novelty than a useful device. Just a few years later, in 1800, Alessandro Volta invented the battery. This changed everything. Now electrical circuits could be connected to the battery, and the current turned on and off using switches. The discovery that electricity could move magnetized needles gave a convenient way of detecting the signal at the end of the wire.
18 世纪末，许多人开始尝试利用电力传输信息。早期的尝试利用了静电。1774 年，瑞士数学家兼工程师 George-Louis Le Sage 利用静电原理，在他家的两个房间之间搭建了一套电报系统。该系统包含 26 根导线，分别对应一个字母。工作原理是：操作者使一根玻璃棒积累静电，然后用它触碰对应的导线，导线另一端的金属箔片会在静电作用下发生移动，从而显示出相应的字母信息。虽然这套系统可以运作，但更多地被视为一种新奇事物，而非实用设备。几年后的 1800 年，Alessandro Volta 发明了电池，彻底改变了电力传输的方式。电池的发明使得电路可以连接到稳定的电源上，并通过开关控制电流的通断。此外，人们还发现电流可以使磁针发生偏转，这为在导线末端检测信号提供了极大的便利。

William Fothergill Cooke and Charles Wheatstone developed the first commercial telegraph in England. Their telegraph used needles that could be in three positions: a vertical position, neutral position, and two diagonal positions. Initially, each terminal had five needles cleverly arranged on a metal plate engraved with letters of the alphabet in such a way that two diagonal needles could point to any letter. They patented their system and demonstrated a preliminary version on the railway line between the London stations of Euston and Camden Town in 1837. It was used to signal when carriages needed to be hauled up an incline to be attached to a locomotive. This worked but was rejected as being too complicated for communicating over a short distance and was replaced with a simpler pneumatic system using whistles. However, the following year their telegraph was used to connect the stations of Paddington and West Drayton-a distance of thirteen miles.
威廉·福特吉尔·库克（William Fothergill Cooke）和查尔斯·惠斯通（Charles Wheatstone）在英国研制出了首台商用电报机。这台电报机利用指针传递信息，指针有三种状态：垂直、居中和两个对角线位置。最初，每个终端都配有五个指针，巧妙地排列在一块刻有字母表的金属板上。两个对角线指针组合可以指向任意字母。1837 年，库克和惠斯通为他们的发明申请了专利，并在伦敦尤斯顿和卡姆登镇之间的铁路上进行了初步演示。这套系统用于指示何时需要将车厢拉上斜坡与机车连接。尽管系统运作良好，但由于过于复杂，不适用于短距离通信，最终被更简单的气动哨声系统所取代。然而，在接下来的一年里，他们的电报机被用于连接帕丁顿和西德雷顿车站，两地相距 13 英里。

One advantage of Cooke and Wheatstone's telegraph was that hardly any training was needed to operate it. A disadvantage was that it required five parallel wires to be run between the telegraph terminals. As wires failed, Cooke and Wheatstone realized a simpler system with fewer connections would increase reliability. Over the next few years, they gradually reduced the number of wires until they had a system that needed only two wires between the terminals. With two wires, the telegraph could move only one
库克 (Cooke) 和惠斯通 (Wheatstone) 电报的一大优势是操作简便，几乎无需培训。然而，这种电报系统需要在终端之间架设五条平行线，一旦线路出现故障，就会造成很大麻烦。为此，库克和惠斯通致力于简化系统、减少连接，以提高可靠性。经过几年的努力，他们逐步减少了电线数量，最终成功将终端之间的连接线减少到只需两条。但这也意味着，每次只能传送一个信号。
needle, so there needed to be a code relating a sequence of needle directions to letters of the alphabet. The system was more reliable, but the operators now needed to learn the code. The two-wire system was in place between Slough and Paddington in 1843, when a remarkable event thrust it into the public eye. Before we describe this, we should note the three least used letters in English are

, and

. The telegraph omitted these three letters. If these letters were needed, the operator substituted the letters

, and

.
因此，需要建立一套代码，将指针方向序列与字母表中的字母对应起来。这套系统更加可靠，但操作员需要学习新的代码。1843 年，斯劳和帕丁顿之间铺设了两线电报系统。同年发生的一件奇事，让电报系统进入了公众的视野。在描述这件事之前，我们先来看看英语中使用频率最低的三个字母：

、

和

。当时的电报系统会省略这三个字母，如果需要使用，操作员会用字母

和

来代替。

For many years, John Tawell worked for a Quaker-owned business in London, eventually becoming a Quaker himself. But he had a criminal streak. In 1814, he was arrested for forging banknotes. This was a capital offense, but the bank owners didn't want adverse publicity and so, as they too were Quakers and opposed the death penalty, they quietly arranged to have Tawell transported to the penal colony of Australia.
John Tawell 多年来一直在伦敦一家贵格会教徒开设的企业工作，并最终成为了一名贵格会教徒。然而，他却有着不为人知的犯罪一面。1814 年，他因伪造钞票被捕。这在当时属于死罪，但银行老板不想因此事招致负面影响，加上他们本身也是贵格会教徒，反对死刑，于是便悄悄安排 Tawell 流放到澳大利亚的殖民地。

In Australia, Tawell seemed to turn his life around. He became a successful businessman, running a chemist shop. While there he discovered philanthropy, helping to found the first Quaker community in the country. He returned to London in 1838, a wealthy man. Unfortunately, his wife was terminally ill with tuberculosis. Tawell hired a nurse, Sarah Hart, to take care of her. After his wife died, Tawell began a secret affair with Hart, which went on for many years, continuing even after Tawell remarried another woman. He and Hart had two children. Tawell supported the secret family financially.
Tawell 在澳大利亚仿佛获得了新生。他经营药店，生意兴隆，摇身一变成为成功商人。期间，他热心公益，帮助建立了澳大利亚第一个贵格会社区。1838 年，Tawell 带着财富回到伦敦。然而，他的妻子却身患肺结核，病危卧床。Tawell 雇佣了护士 Sarah Hart 来照顾妻子。妻子去世后，Tawell 与 Hart 开始了一段长达数年的地下恋情，即使在 Tawell 另娶他人后也未中断。他们两人育有两个孩子，Tawell 一直秘密地为这个家提供经济支持。

In 1844, Tawell ran into financial difficulties, and decided his best course of action was to murder Hart. This he did, by slipping prussic acid into her beer. A neighbor heard Hart's screams and observed Tawell leaving the house. The neighbor quickly alerted others who followed Tawell, who rushed to a nearby train station, bought a ticket, and boarded a train heading to central London. The pursuers alerted the station master that a murderer might have just escaped. The quick-thinking station master telegraphed the following message:
1844 年，Tawell 陷入经济困境，决定铤而走险，谋杀 Hart。他将氢氰酸偷偷放入 Hart 的啤酒中。一位邻居听到 Hart 的尖叫，看到 Tawell 从房子里跑出来，便立即通知了其他人。众人一路追赶 Tawell 到附近的火车站。Tawell 匆忙买了车票，登上了开往伦敦市中心的火车。追赶的人提醒站长，说一个杀人犯可能刚刚逃脱。机智的站长立刻拍发了电报：

A MURDER HAS GUST BEEN COMMITTED AT SALT HILL AND THE SUSPECTED MURDERER WAS SEEN TO TAKE A FIRST CLASS TICKET TO LONDON BY THE TRAIN WHICH LEFT SLOUGH AT 742 PM HE IS IN THE GARB OF A KWAKER WITH A GREAT COAT ON WHICH REACHES NEARLY DOWN TO HIS FEET HE IS IN THE LAST COMPARTMENT OF THE SECOND CLASS COMPARTMENT
索尔特山庄惊现凶杀案！凶手乘坐晚上 7:42 从斯劳开往伦敦的火车，购买了一张一等座车票。他伪装成贵格会教徒，身穿一件几乎垂到脚踝的大衣，藏身于二等车厢的最后一节车厢内

In Paddington there was some initial difficulty understanding what KWAKER meant, and the message had to be transmitted more than once,
在帕丁顿，人们一开始并不理解 KWAKER 的意思，这条信息不得不反复传递，
but the operator gave the message to William Williams of the railway police. Williams followed Tawell as he left the train, noting where he was lodging for the night. The following day, Williams, accompanied by Inspector Wiggins of the Metropolitan Police, arrested Tawell.
但是，操作员把消息给了铁路警察威廉·威廉姆斯。威廉姆斯一路跟踪离开火车的塔维尔，记下了他当晚的落脚处。第二天，威廉姆斯在大都市警察局探长威金斯的陪同下逮捕了塔维尔。

Tawell was tried, sentenced, and hanged. The story, of course, made headlines in the papers. Besides the gory details, it publicized the power of the telegraph. The arrest of Tawell was a harbinger of change. Information could be transmitted quickly, and what initially seemed to be a novel contraption would soon be deemed essential.
Tawell 被判有罪，处以绞刑。不出所料，这桩案件成了轰动一时的新闻。除了案件本身的种种骇人听闻，电报的巨大作用也经此为大众所知。Tawell 的落网，预示着一个新时代的到来：信息传递的速度将大大加快，而现在看似新奇的电报，很快就会成为不可或缺的重要工具。

From a modern perspective, some elements of the story seem strange. Why didn't Tawell get rid of his distinctive coat? Why wasn't he arrested immediately? But these questions just illustrate how sending information quickly is important and how the telegraph was the first step in the speed of communication that we take for granted today. Sending a letter through the mail was the standard method of communication; when the train left the station with Tawell on board, he thought he was going to outrun any information about his crime-there was no need for disguise. When Williams understood a murderer needed to be arrested, he knew he needed to organize back-up, and this involved going to the local police station to discuss the situation in person. The pace of daily life was slower than now. The telegraph would help to change this.
从现代人的角度来看，故事中的一些细节显得有些奇怪。为什么 Tawell 不丢掉那件显眼的 coat？为什么他没有被立即逮捕？这些疑问恰恰说明了快速传递信息的重要性，也体现出电报作为一种通讯方式，是如何开启了我们今天习以为常的信息化时代的。在当时，通过邮寄信件是 standard 的通讯方式；当 Tawell 乘火车离开车站时，他认为自己可以逃脱任何罪责——因此他无需掩饰。而当 Williams 明白必须逮捕凶手时，他知道自己需要请求支援，这意味着他必须亲自前往警局说明情况。当时的日常生活节奏比现在慢得多，而电报的出现改变了这一切。

We often credit Samuel Morse with inventing the telegraph. This seems to be for two reasons. The first is that Morse was convinced the idea originated with him and, since his brothers published The New York Observer, he could publicize his invention and claim he was the originator, both for himself and for America-Morse had a remarkable talent for self-promotion. The second reason is that, although many people devised electrical telegraphs, Morse's version eventually developed into one that was better than the competitors' and was the one that became widely adopted. Let us examine how his invention came to be:
人们普遍认为电报是 Samuel Morse 发明的，这其中有两方面原因。首先，Morse 本人深信这项发明来自于他的原创想法，而且他的兄弟们又是《纽约观察家报》的发行者，这使得他能够广为宣传自己的发明，并宣称自己是这项惠及美国的发明的缔造者——Morse 的确有着非凡的自我营销才能。其次，尽管很多人都在构思电子电报，但 Morse 的版本最终胜过了所有竞争对手，成为了被广泛采用的版本。让我们来探究一下这项发明是如何诞生的：

In 1832, Morse returned from a two-year European trip studying art and visiting artists' studios. On the ship home during an after-dinner discussion, one passenger, Dr. Charles Jackson, described some recent experiments involving electricity. Morse realized electricity could be used to send signals.
1832 年，Morse 结束了他为期两年的欧洲艺术之旅，期间他潜心学习艺术，并拜访了许多艺术家的工作室。在返程的航船上，一次饭后闲谈中，同行的 Charles Jackson 医生谈起了一些最近关于电的实验。Morse 顿时意识到，电可以用来传递信号。

Morse, who was in his mid-40s, was a well-established painter based in New York. At that time in the United States, painting usually meant portraiture. Morse was successful at this and had been commissioned to paint
Morse 时年四十余岁，是一位居住在纽约的知名画家。当时的美国画坛，绘画通常指肖像画。Morse 在这方面颇有成就，并受委托创作过许多肖像画。
many people, including President James Monroe and the Marquis de Lafayette. Morse believed painting was much more than portraiture and that he should encourage the art form to be more widely appreciated. He gave public lectures, introducing New Yorkers to the work of the great European masters. He also helped found, and became president of, the National Academy of Design, which quickly became the most important art institution in America.
包括詹姆斯·门罗总统和拉法耶特侯爵在内的很多人都对莫尔斯的才华赞赏有加。莫尔斯认为绘画远不止肖像画那么简单，他希望这种艺术形式能够得到更广泛的认可。他积极举办公开讲座，向纽约市民介绍欧洲绘画大师的杰出作品。此外，他还帮助创办了美国国家设计学院并担任院长，该学院很快发展成为美国最重要的艺术机构。

When Morse arrived back in New York after his European trip, he started work in secret on designing his apparatus. Five years later, when he read newspaper reports of other telegraphs, he realized he had to announce that he was the original inventor of the electrical telegraph and had to display his prototype. His telegraph was very primitive-he would improve every part over the next few years-but it had two important features that would remain throughout. The first was that it comprised just one electrical circuit. Initially, this meant there were two wires connecting the terminals, but later, when people understood you could ground the apparatus and use the earth as the return wire, only one wire was needed. The other distinguishing feature was a pen on the receiving end, which, when deflected by the electrical impulse, marked a moving paper tape. The advantage over other types of telegraphs was a record of the message. If the operator missed a symbol or misunderstood what was being transmitted, he could refer to the paper record.
Morse 结束欧洲之旅回到纽约后，便开始秘密研发自己的电报设备。五年后，他从报纸上读到其他电报的报道，意识到必须公布自己是电力电报的原始发明人，并对外展示自己的原型机。Morse 的电报非常简陋，在接下来的几年里，他不断改进每个部分，但有两个重要特征始终保留。其一，它只包含一个电路。最初，这意味着需要两根电线连接终端，但后来，人们发现可以将设备接地并利用地球作为回路，只需要一根电线。另一个显著特征是接收端的一支笔，当它被电脉冲偏转时，会在移动的纸带上留下标记。与其他类型的电报相比，这种设计的一大优势是可以记录信息。如果操作员错过了某个符号或误解了传输内容，可以参考纸质记录。

Morse was a painter, not an engineer nor a scientist. He knew he needed help with both engineering and financing. He met both needs when he partnered with Alfred Vail. Vail's name is not as widely known, but he played an important role in the design and implementation of the first machines.
莫尔斯是一位画家，并非工程师或科学家。深知自身在工程和资金方面都需要帮助，他与阿尔弗雷德·维尔（Alfred Vail）建立了合作关系。尽管维尔的名字鲜为人知，但他在首批莫尔斯电报机的设计和应用中发挥了至关重要的作用。

One problem the telegraph faced was that wires had a resistance and the voltage would drop over long distances so much that the signal became unreadable. Morse's idea was that telegraphs would be about twenty miles long, but you could extend the length of the total system by stringing them together. After twenty miles, the output would become the next telegraph's input. The output of the second telegraph would become the input of the third, and so on. Importantly, each telegraph would have its own battery, so the signal would be readable at the end. This can be compared to a relay race, where the baton is passed from one runner to the next. Morse knew what he wanted and tasked Vail with the design.
电报技术曾面临一个难题：电线存在电阻，导致电压在长距离传输中大幅下降，信号因此变得难以识别。莫尔斯的构想是，每段电报线长约 20 英里，但可以通过将多段电报线首尾相连来扩展传输距离。传输 20 英里后，第一段电报线的输出信号就成为第二段电报线的输入信号，而第二段电报线的输出信号又会成为第三段电报线的输入信号，以此类推。重要的是，每段电报线都配有独立的电池，确保信号在长距离传输后依然清晰可读。这就好比接力赛，接力棒在运动员之间依次传递。莫尔斯明确了自己的目标，并将设计工作委托给了维尔。

Vail knew Joseph Henry's work on electromagnetism had essentially solved the problem. A current running through a coil forms a magnetic field which can attract the moving part of a switch. This was exactly what was needed. You could have a switch at the start of the second telegraph, remotely mirroring the switch of the first. This remote switch became called a relay. Relays meant the signal could be amplified at each stage and so there was no limit to the length a signal could be sent. Even though this was really Henry's idea, Morse, being Morse, realized relays would be important and patented them.
维尔深知约瑟夫·亨利在电磁学领域的研究成果已基本解决了这一难题。电流通过线圈会产生磁场，进而吸引开关的动触点。这恰好满足了他们的需求。只需在第二套电报机的起始端设置一个开关，就能远程复制第一个开关的操作。这种远程开关被称为继电器。继电器的应用意味着信号可以在每个阶段得到增强，因此信号传输距离不再受限。虽然这确实是亨利的构想，但莫尔斯就是莫尔斯，他敏锐地意识到继电器的重要性，并将其申请了专利。

Morse code also evolved. Initially, the idea was to transmit just what are now known as dots. Pauses separated sequences of the dots. The receiving operator counted the number of dots in the sequence. For example, a message might be something like dot dot pause dot dot dot pause dot long pause. The operator would read this as 231, which meant it was the two hundred and thirty-first word in the accompanying dictionary. The operator would read off numbers, look them up in the dictionary, and then write down the associated word.
摩尔斯电码也经历了发展。最初，它只用于传输我们现在所知的“点”。点序列之间通过停顿来分隔。接收操作员负责计算每个序列中的点数。例如，一条消息可能是“点点-停顿-点点点-停顿-点-长停顿”。操作员会将其解读为 231，这意味着它是随附词典中的第 231 个单词。操作员会依次读出数字，在词典中查找，然后写下对应的单词。

Morse realized besides sending a pulse of electricity, he could send longer lasting bursts-dashes. He thought about having dashes of various lengths, but eventually settled on the standard dot-dash system. The assignment of dots and dashes evolved. Morse's original assignment worked, but he wanted his telegraph to be faster than competing systems. It made sense to assign shorter strings of dots and dashes to letters of the alphabet which are used frequently and longer strings to letters which are used less often. To find the frequency of letters, Vail visited the print shop of a local newspaper. Printing used movable type, which was stored in a frame. The frame had compartments for each letter. The size of the compartments varied depending on the frequency that the pieces of type were used. Vail estimated the number of each of the letters in each of the compartments of the frame and used this to rank the letters according to how often they were used. Once they had the frequencies, Morse and Vail changed their previous assignment of dots and dashes, so the lengths of the strings corresponded to the frequency of the letters.
Morse 意识到，除了发送电流脉冲，他还可以发送持续时间更长的脉冲——“划”。他曾考虑过使用不同长度的“划”，但最终确定了标准的“点划”系统。不过，点和划的分配方案也在不断演变。Morse 最初的方案虽然有效，但他希望自己的电报系统比竞争对手更快。为此，他们决定将较短的点划字符串分配给常用字母，将较长的字符串分配给不常用字母。为了确定字母的使用频率，Vail 参观了当地一家报纸的印刷厂。当时的印刷使用的是活字印刷术，活字被存放在一个框架中，框架上设有放置每个字母的隔间，隔间的大小根据活字的使用频率而有所不同。Vail 估算了框架中每个隔间里字母的数量，并根据这一结果对字母的使用频率进行了排序。掌握了字母频率后，Morse 和 Vail 调整了点和划的分配方案，使得字符串的长度与字母的频率相对应。

Morse kept showing his telegraph to anyone interested. These demonstrations eventually paid off when, in 1843, the United States Congress approved

for Morse to construct his first telegraph, to run between Washington, DC and Baltimore. The choice of these two cities could not
莫尔斯不断向感兴趣的人展示他的电报机。这些演示最终获得了成功：1843 年，美国国会批准了 {0} ，支持莫尔斯建造他的第一条电报线路，连接华盛顿特区和巴尔的摩。这两座城市的选择并非偶然，
have been better. In 1844, as the line was being completed, the Whig convention for choosing the presidential and vice-presidential candidates was being held in Baltimore. Morse, realizing people in Washington were eager to hear news of what was happening, started sending messages using the telegraph. This was much quicker than the train. As news spread that this was the best way to hear about the latest events, crowds built up at both ends of the telegraph.
如果当初能（预见未来）做出别的选择，结果本可以更好。1844 年，随着电报线路的完工，负责提名总统和副总统候选人的辉格党大会在巴尔的摩召开。Morse 意识到华盛顿的人们渴望获悉大会的最新消息，于是开始用电报传送信息。这比火车快得多。随着电报是了解时事最佳途径的消息不胫而走，电报线路的两端都聚集了大批人群。

The Democratic convention, also in Baltimore, started shortly afterwards. The meeting was contentious. Martin Van Buren was a leading contender for the presidential position. James Polk was in the running as Van Buren's vice president. In the first round of voting, Van Buren received the most votes but failed to secure the two-thirds majority needed to win the nomination. From this point on, things became complicated. There were many rounds of voting, all ending in a deadlock. Eventually, Polk, whom nobody had initially considered for the presidential role, was selected as a compromise candidate whom most people could get behind. People in Washington were avidly following the Baltimore convention. They mobbed Morse and his telegraph.
紧随辉格党大会之后，民主党全国代表大会也在巴尔的摩拉开帷幕。此次会议争论不休。马丁·范布伦是总统职位的有力竞争者，而詹姆斯·波尔克则是他的副总统竞选伙伴。在第一轮投票中，范布伦虽然获得了最多票数，但未能达到赢得提名所需的三分之二多数。自此，形势急转直下。经过多轮投票，依然僵持不下。最终，波尔克——这位最初无人问津的候选人——成为了各方都能接受的折衷人选。与此同时，华盛顿民众热切关注着巴尔的摩大会的进展，他们将莫尔斯和他的电报团团围住，渴望获悉最新消息。

After they chose Polk as the presidential candidate, the convention had to choose someone to run as vice president. Again, Morse and his telegraph played a central role. The convention chose Silas Wright. Wright was not at the convention but in Washington. He learned of his nomination by telegraph. He also declined by telegraph.
民主党全国代表大会选定 Polk 为总统候选人后，就要开始挑选副总统人选。Morse 和他的电报再次成为焦点。大会最终选择了 Silas Wright。Wright 当时不在大会现场，而是在华盛顿。他通过电报得知自己获得提名，也通过电报拒绝了提名。

Morse and his telegraph were now famous. After this, the number of telegraph lines grew exponentially.
莫尔斯和他的电报从此声名大噪。此后，电报线路的数量呈指数级增长。

1.4 THE ANALOG REVOLUTION
1.4 模拟革命

The theory we have looked at so far concerns digital data. As we noted before, written communication is digital. During the nineteenth century, the telegraph transformed the speed of communication and this, along with faster transportation with the construction of railroads, made the world much more connected. But the telegraph didn't change the nature of communication. It was still digital.
我们目前探讨的理论主要涉及数字数据。正如之前提到的，书面交流也属于数字化形式。19 世纪，电报的出现彻底改变了通信速度，而铁路建设则推动了交通运输的提速，这两者的结合让世界联系更加紧密。然而，电报的出现并没有改变通信的本质，它依然属于数字化的范畴。

Telegraphs were useful for conveying urgent messages, but there was no widespread demand to have telegraphs installed in private homes. This changed with the telephone, when people could actually hear their friends
电报在传递紧急信息时非常高效，但将其安装在私人住宅的需求并不普遍。而电话的出现改变了这一现状，因为人们可以通过电话直接听到亲友的声音。
and family on the other end of the line. People wanted their own telephones where they could have a conversation, both parties hearing the other's voice. In addition, people wanted to listen to music on vinyl records (later cassettes). They wanted to listen to the radio and watch television. These were all information technologies that changed life in our homes. They were all analog. They would all become digital.
人们渴望拥有自己的电话，能够随时随地与家人朋友畅谈，聆听彼此的声音。此外，人们还想沉浸在黑胶唱片（后来是盒式磁带）的音乐世界中，收听广播，观看电视。这些改变家庭生活的模拟信息技术，最终都将迎来数字化时代。

1.5 THE WAY FORWARD
1.5 未来方向

Why did analog technology become digital? What advantage does digital communication have over analog? These are questions at the heart of the Digital Revolution, and these are the questions we answer in this book.
模拟技术为何会发展成为数字技术？数字通信相较于模拟通信有何优势？这些问题直击数字革命的核心，而本书将围绕这些问题展开讨论。

Morse constructed his code to convey written English efficiently, assigning shorter strings of dots and dashes to more commonly used letters. But written English contains both information and redundancy. Can we shorten messages by keeping the information and eliminating the redundancy? As we shall see, the answer is yes. But before we can do this, we need to define information and redundancy.
莫尔斯（Morse）设计了他的代码，以便高效地传达书面英语，将更短的点划字符串分配给更常用的字母。然而，书面英语既包含信息，也包含冗余。我们能否在保留信息的同时消除冗余，从而缩短信息？答案是肯定的，我们后面会进行说明。但在此之前，我们需要先定义什么是信息，什么是冗余。

Napoleon Bonaparte conveyed military information using the optical telegraph. He faced the same problem as a person today connecting to public Wi-Fi. How do you stop eavesdroppers from accessing information you would like to keep secret? Clearly, you need to use encryption, but how should this be done?
拿破仑·波拿巴曾使用光学电报传递军事情报，但他面临着和今天人们连接公共 Wi-Fi 一样的难题：如何防止窃听者获取你想要保密的信息？答案显而易见，你需要加密，但具体应该怎么做呢？

Errors occur during transmission. If the message is not encrypted, humans can sometimes detect errors: the word KWAKER seemed like a typo, but repeated transmissions between the receiver and sender eventually led to the receiver realizing the intended word was

. But we want an error-correction method that works automatically on encrypted messages. We show how this can be done.
信息在传输过程中可能会出现错误。对于未加密的信息，人们有时能够发现错误：例如，单词“KWAKER”看起来像是一个拼写错误，但在接收方和发送方反复确认后，接收方最终明白正确的单词应该是“

”。然而，我们需要一种能够自动纠正加密信息错误的方法。接下来，我们将展示如何实现这一点。

In the following chapters, we outline the theories underlying the Digital Revolution, ushering in the Information Age. We start by seeing how digital messages can be conveyed efficiently, securely, and without errors. We begin with information. How can this be defined mathematically, and what mathematical properties does it have?
本书接下来的章节将概述数字革命背后的理论，正是这场革命引领我们进入了信息时代。首先，我们将探索如何高效、安全、无误地传递数字信息。我们的探讨将从“信息”这一概念本身出发：如何从数学角度定义信息？它又具备哪些数学特性？

2 INFORMATION 2. 信息

We begin our study of mathematical theory of information. The first topic is to define information in such a way that we can objectively measure the amount of information in a message.
我们将从信息论的数学基础开始学习。首先，我们将对“信息”进行定义，以便客观地衡量消息中的信息量。

Our starting point is a remarkable paper by Claude Shannon, titled "A Mathematical Theory of Communication." In 1948, Shannon was working at Bell Labs and published his work in the Bell System Technical Journal. It was immediately recognized as being both groundbreaking and highly original. From the publication of this paper onwards, Shannon was regarded as a genius by those who understood it.
我们的起点是 Claude Shannon 那篇意义非凡的论文——《通信的数学理论》。1948 年，Shannon 在贝尔实验室工作期间，在《贝尔系统技术期刊》上发表了这篇论文。论文一经发表便引起了轰动，其开创性和原创性得到了广泛认可。自那时起，Shannon 便被业内人士视为天才。

One of the many who realized the importance of the paper was Warren Weaver. Weaver was director of the Division of Natural Sciences at the Rockefeller Foundation. Throughout his life, Weaver was a powerful advocate for educating the public about science. When Shannon's paper, unchanged except for some minor corrections, was published as a book in 1949, Weaver wrote the introduction. The book has been in print ever since. One change from the original is the title. At first glance, it may seem minor, but the article speaks volumes. Shannon's original paper was called A Mathematical Theory of Communication. To emphasize the importance of the ideas, it became The Mathematical Theory of Communication.
许多人意识到了这篇论文的重要性，Warren Weaver 就是其中之一。他是洛克菲勒基金会自然科学部门的主任，毕生致力于向公众普及科学知识。1949 年，Shannon 的论文在除了几处细微修改外未做任何改动的情况下出版成书，Weaver 为其撰写了前言。这本书自此之后一直再版。与原文相比，标题虽然只是稍作改动，却意义重大。Shannon 的原始论文名为《通信的数学理论》，而为了强调其思想的重要性，书名最终确定为《通信的数学理论》。

We will devote the next three chapters to this paper, but first, we mention the title. Shannon called it A Mathematical Theory of Communication. He is clearly stating this is a mathematical theory. Mathematics is central. The definition of information provides the beginning of the path along which we will travel. This chapter concentrates on how he defines it. But, as we
我们将用接下来的三章来阐述这篇论文，但首先，让我们看一下标题。香农将其命名为“通信的数学理论”，明确地指出这是一个数学理论，而数学正是其核心。信息的定义为我们即将展开的旅程提供了起点。本章将重点介绍香农是如何定义信息的。但是，正如我们
will see in the subsequent chapters, it is mathematics that guides us; showing connections we cannot see without it. Shannon's title also tells us this is a theory about communication, and this is indeed what he is interested in. How should we communicate efficiently? When we communicate, we want to convey information, and we want to do this effectively. Shannon needs the concept of information for his theory of communication, but his primary focus is on communication.
在接下来的章节中，我们会看到是数学指引着我们，它揭示了那些离开数学我们就无法发现的联系。香农的标题也表明，这是一个关于传播的理论，而这的确是他所关注的。我们该如何高效地进行传播？在传播过程中，我们希望能传递信息，并且希望高效地完成这一目标。香农的传播理论需要引入信息的概念，但他最关注的还是传播本身。

2.1 BELL LABS 2.1 贝尔实验室

For most of the twentieth century, AT&T (American Telephone and Telegraph) had a monopoly on telephone service in the United States. In exchange for being allowed to have a monopoly, AT&T guaranteed phone service to rural areas that would be subsidized by the rest of the network. (Interestingly, the president of AT&T, who arranged the agreement, was Theodore Vail, a cousin of Morse's partner, Alfred Vail.)
20 世纪的大部分时间里，美国电话电报公司 (AT&T) 一直垄断着美国的电话服务。作为垄断经营的特许权，AT&T 承诺向农村地区提供电话服务，其成本由网络的其他部分补贴。（有趣的是，促成这项协议的 AT&T 总裁 Theodore Vail 是 Morse 的合作伙伴 Alfred Vail 的堂兄。）

This was a delicate balance. A monopoly meant AT&T could set its prices, but if its profits were excessive, there would be complaints about lack of competition and then the US regulators could end it. This is what eventually happened. It was broken up in 1982, after the monopoly had been in place for almost a century.
这是一种微妙的平衡。AT&T 的垄断地位使其可以随意定价，但如果利润过高，就会招致人们对缺乏竞争的抱怨，最终导致美国监管机构出手干预。而这最终的确发生了。在维持了近一个世纪后，AT&T 的垄断地位于 1982 年被打破。

In 1925, AT&T founded a major research laboratory, Bell Telephone Laboratories, commonly called just Bell Labs. Since AT&T was highly profitable, Bell Labs could afford to pay employees more than if they were at an academic institution. Because AT&T had no competitors, there wasn't pressure for researchers to look only to the next year or two. Instead, AT&T could spend money on projects that might not come to fruition for a decade or more. Positions at the Labs were prestigious, well paid, and with a fair amount of freedom in research topics. This meant they could hire some of the best graduating PhDs each year.
1925 年，AT&T 成立了贝尔电话实验室，一家举足轻重的研究机构，通常简称为贝尔实验室。由于 AT&T 盈利颇丰，贝尔实验室能够为员工提供比学术机构更高的薪酬。由于没有竞争对手，AT&T 无需给研究人员施加压力，要求他们在短期内就必须拿出成果。相反，AT&T 可以将资金投入到那些可能在十年甚至更长时间内都不会产出成果的项目中。贝尔实验室的职位声望很高，薪水优厚，而且研究人员在选择研究课题方面享有很大的自由度。这意味着他们每年都能够吸引到各个领域最优秀的博士毕业生。

A mathematics department was formed within the Labs. Initially, the intent was for mathematicians to play a supportive role in helping scientists with mathematical problems related to the telephone system. But, naturally, being a group of talented people, they began working on problems they found mathematically interesting.
实验室设立了一个数学部门，其最初的目的是协助科学家解决电话系统相关的数学问题。然而，这些才华横溢的数学家们并不满足于此，他们开始探索自己感兴趣的数学难题。

After Shannon finished his PhD in mathematics at MIT, he was awarded a fellowship at the Institute for Advanced Study in Princeton, New Jersey.
Shannon 在麻省理工学院获得数学博士学位后，荣获普林斯顿高等研究院的奖学金。

He started in September 1940, but just days later, President Franklin D. Roosevelt passed the Selective Training and Service Act requiring men between the ages of 21 and 35 to register for the draft. Shannon realized he was not suited for combat but that his mathematical ability might be useful for military applications. He had had a couple of summer internships at Bell Labs and knew the scientists there were involved in many aspects of the upcoming war. He was soon offered a full-time position at the Labs, which he gladly accepted. Initially, he worked in ballistics, then doing important work in cryptography.
他于 1940 年 9 月开始工作，但仅仅几天后，总统 Franklin D. Roosevelt 就通过了《选择性训练和服务法案》，要求 21 至 35 岁的男性登记服兵役。Shannon 意识到自己并不适合战斗，但他的数学才能或许能在军事领域发挥作用。他曾在贝尔实验室进行过暑期实习，深知那里的科学家们正为即将到来的战争做着各方面的准备。很快，他就被实验室聘为全职研究员，他欣然接受了这份工作。Shannon 最初从事弹道学研究，后来又在密码学领域做出了重要贡献。

After the war, Shannon turned his attention to communication. Researchers at the Labs had produced many papers on the best way to send signals along the wires-finding optimal solutions for practical problems. But Shannon was interested in finding a general theory of communication. This resulted in the famous paper in which he presented a complete theory.
战后，Shannon 将研究重心转向了通信领域。此前，贝尔实验室的研究人员已针对如何优化信号传输效率发表了诸多论文，并为一些实际问题找到了最佳解决方案。但 Shannon 追求的是构建通用的通信理论。为此，他发表了一篇影响深远的论文，提出了完整的通信理论框架。

2.2 A MATHEMATICAL THEORY
2.2 数学理论

As we move to a mathematical description, we need to define our terms more precisely. We will look at strings of symbols. As we mentioned in the previous chapter, the word alphabet describes the complete list of all symbols that are allowed in these strings. So, for example, if we are describing the string of symbols that make the text of this book, we include not only the standard uppercase and lowercase alphabet, but we also include all the punctuation symbols and the space.
在进行数学描述时，我们需要更精确地定义一些术语。我们将着眼于符号串。正如我们在前一章中提到的，“字母表”是指这些字符串中允许使用的所有符号的完整列表。因此，如果我们要描述构成本书文本的符号串，我们不仅要包含标准的大小写字母表，还要包含所有的标点符号和空格。

The alphabet we will practically always use comprises just two symbols: 0 and 1 . But, as we have seen, knowing just letters of the alphabet is a small part of learning how to write. We need to know how to assign the letters to the data.
我们日常使用的字母表实际上只包含 0 和 1 这两个符号。但是，正如我们所知，仅仅掌握字母表只是学习写作的一小部分，我们还需要学习如何将这些符号与数据对应起来。

ZEROS AND ONES, BUT NO GAPS.
只包含 0 和 1，中间没有空格。

We will use binary, the alphabet consisting solely of 0 and 1 . Why? There are two important reasons, one practical and one mathematical. The practical reason is that much of communication in electronics is based on the switch. A switch has two states, on and off, and so it is ideally suited for dealing with a binary alphabet. We can assign 0 to one state and 1 to the other.
我们将使用二进制，它是一种仅包含 0 和 1 的字母表。为什么呢？这背后有两个重要原因，一个源于实践，另一个则与数学有关。实践中的原因是，电子设备中的许多通信都依赖于开关。开关具有两种状态：开或关，因此它天然地适用于处理二进制字母表。我们可以将 0 分配给其中一种状态，将 1 分配给另一种状态。

It is possible to encode any alphabet into binary. For example, the ASCII (American Standard Code for Information Interchange) code for

任何字母表都可以编码成二进制形式。例如，

的 ASCII 码（美国信息交换标准代码）就是一种二进制编码方式。
is 01000001 . Given any alphabet, we can devise a way of assigning a string of 0 s and

to each letter. As we shall see later, this must be done carefully, but the important point is that it can be done.
是 01000001。对于任何字母表，我们都可以设计一种方法，将由 0 和

组成的字符串分配给每个字母。需要注意的是，这需要仔细进行，但重要的是这是可以实现的。

Two symbols are the minimum number of symbols an alphabet must have to encode every other alphabet. At first glance, it might seem we could get away with just one symbol. What if we just use the symbol 1 , and encoded

with

with 111 , and so on? The problem with this is that if I send 111, you cannot tell whether I meant to send

, or

. Of course, if we adopted spaces, it would clear the ambiguity, but as we noted, the space is another symbol, and if we were to include it, we would be back to a binary system. The mathematical reason for using binary is that it is the smallest alphabet that can encode any other alphabet.
字母表至少需要两个符号才能编码其他任何字母表。你可能会想，只用一个符号行不行？比如，我们只用符号“1”，用“1”表示

，用“11”表示

，用“111”表示

，以此类推。但问题是，如果我发送“111”，你就无法分辨我是想表达

、

还是

。当然，我们可以用空格来消除歧义，但正如我们之前提到的，空格本身也是一个符号。如果算上空格，我们就回到了二进制系统。从数学角度来看，之所以使用二进制，是因为它是能够编码任何其他字母表的最小字母表。

BINARY DIGITS 二进制位

The decimal digits are

. We use these ten digits every day. There are just two binary digits, 0 and 1 . Shannon introduced the term bit as a portmanteau of the two words binary digit in his classic work. He attributes it to John Tukey, who used it previously in a Bell Labs memo.
十进制数字是

，我们日常生活中每天都会用到它们。二进制数字只有两个，0 和 1。Shannon 在其经典著作中引入了“比特”（bit）一词，它是“二进制数字”（binary digit）的缩写。据他所说，这个词是 John Tukey 首创，并在贝尔实验室的一份备忘录中使用过。

What can be said about the bit? First, the two positions of a switch can be represented as 0 and 1 . So, it is natural to connect bits to switches. Second, just as we can use decimals to represent all the real numbers, we can use binary digits to represent them. It might appear there is not much more to say, but there is. There is another meaning of the term bit we use when talking about information. Sometimes the two meanings overlap, but sometimes they don't. An example will help to illustrate what a bit of information is and why it is sometimes different to a binary digit thought of as a digit. After the example, we will give the definition of what a bit of information is.
关于“比特”，我们可以说些什么呢？首先，开关的两种状态可以用 0 和 1 来表示，因此很自然地就将比特和开关联系起来。其次，就像我们可以用十进制来表示所有实数一样，我们也可以用二进制数字来表示它们。你可能觉得关于比特也没什么可说的了，但并非如此。“比特”这个词在描述信息时还有另一种含义。这两种含义有时重叠，有时并不相同。接下来，我们将通过一个例子来说明什么是信息比特，以及为什么它有时不同于被认为是数字的二进制数字。在例子之后，我们将给出信息比特的正式定义。

2.3 A BIT OF INFORMATION VERSUS A BINARY DIGIT
2.3 信息量与二进制位

We will consider three scenarios. In each, I will send you a brief text at the start of the day. In the first, I toss a coin. If it lands heads up, I send you 0 . If heads down, I send a 1. So, you receive a binary digit each morning and, though we have yet to define bits of information, you also receive 1 bit of information telling you the result of my coin toss. In the second scenario, I toss the coin twice. There are four possible outcomes which we can denote
我们将考虑三种情况。在每种情况下，我都会在早上给您发送一段简短的文本。第一种情况，我会抛一枚硬币。如果正面朝上，我会给您发送 0；如果反面朝上，我会给您发送 1。这样，您每天早上都会收到一个二进制数字，虽然我们还没有定义什么是“信息位”，但您也会收到 1 比特的信息，其中包含抛硬币的结果。第二种情况，我会抛两次硬币。这时会有四种可能的结果，我们可以将它们表示为
by

, and

, where

denotes heads up and

denotes tails up. As before, I am going to assign 0 to

and 1 to

and so will send you one of four two-bit texts:

. As you might guess, each of these corresponds to 2 bits of information-the two outcomes from tossing the coin twice. The third case comes from adding a third toss. Now there are eight possibilities-HHH, HHT, HTH, HTT, THH, THT, TTH, TTT—and I send you the appropriate digit string of 0s and 1s. You, of course, are now receiving 3 bits of information. For example, if you receive 000, you have 3 bits of information telling you my coin landed heads up all three times. Now that we have the initial setup, let's change things slightly.
由

，

和

完成，其中

表示正面朝上，

表示反面朝上。和之前一样，我将数字 0 对应

，数字 1 对应

，因此将向您发送四组两位代码之一：

。您可能已经猜到，这四组代码中的每一组都代表 2 比特的信息——抛两次硬币的两种结果。第三种情况是抛第三次硬币。现在有八种可能性——HHH、HHT、HTH、HTT、THH、THT、TTH、TTT——我会向您发送相应的 0 和 1 组成的字符串。当然，您现在接收的是 3 比特的信息。例如，如果您收到 000，则表示您收到了 3 比特信息，告诉您我的硬币三次都正面朝上。现在我们已经完成了初始设置，让我们稍微改变一下。

As in the third scenario, I am going to toss my coin three times, but this time, if it lands heads up three times in a row, I will send just 0. For the other seven possible outcomes, I send just 1. Each day you will receive just one binary digit, but on days you receive a 0 you know my coin landed heads up all three times, and, as we noted above, this is 3 bits of information. Of course, on days you receive a 1, you receive much less than 3 bits of information-you only know that it could correspond to one of seven possible outcomes. We will give the mathematical definition of information shortly, but it is clear in this case the binary digit 0 conveys not 1 , but 3 bits of information, while the binary digit 1 conveys some information, but not a lot.
和第三种情况一样，我会抛三次硬币。但这次，如果三次都是正面朝上，我就发送数字 0，而对于剩下的七种可能结果，我只发送数字 1。你每天会收到一个二进制数字。如果某天你收到了 0，就意味着我的硬币连续三次正面朝上，也就是说，你获得了 3 比特的信息（正如我们之前所讨论的）。当然，如果你收到的是 1，那么你获得的信息就少于 3 比特——你只知道这对应着七种可能结果中的一种。我们稍后会给出信息的数学定义，但在这个例子中，我们可以清楚地看到，二进制数字 0 传递了 3 比特信息，而不是 1 比特；而二进制数字 1 虽然也传递了信息，但不多。

Let's look at these experiments from a more mathematical perspective. When we tossed the coin once, there were just two outcomes,

, each of which corresponds to 1 bit of information. Note,

. When we tossed the coin twice, there were four outcomes, each of which corresponds to 2 bits of information. Note,

. Tossing the coin three times gives eight possible outcomes, each of which corresponds to 3 bits of information. Again, note

. There is a pattern and so we can generalize. If we toss the coin

times, there will be

outcomes and each of these outcomes-a string of length

-will give

bits of information. The number of tosses is exactly the number of bits of information. We can switch the emphasis from the number of tosses to the number of outcomes. If we toss the coin several times and are told there are

equally likely outcomes, then we can calculate the number of tosses

, or, equivalently, the number of bits of information

, by solving the equation

. For example, if we are told there are 32 possible outcomes, then we solve

, to discover each outcome gives 5 bits of information. If you have seen exponential functions
让我们从数学的角度来分析这些实验。当我们抛一次硬币时，只有两种可能的结果：

或

，每种结果都代表 1 比特的信息。注意，

。当我们抛两次硬币时，会有四种可能的结果，每种结果代表 2 比特的信息。注意，

。抛三次硬币则会有八种可能的结果，每种结果代表 3 比特的信息。再次注意，

。我们可以从中找到规律并进行概括：如果我们抛

次硬币，将会有

种可能的结果，每种结果都可以用一个长度为

的字符串来表示，并且代表

比特的信息。抛掷次数与信息比特数完全相等。我们也可以换个角度，从结果数量来考虑。如果我们多次抛掷硬币，并且知道有

种等概率的结果，那么我们可以通过求解方程

来计算抛掷次数

，也就是信息比特数

。例如，如果我们知道有 32 种可能的结果，那么通过求解方程

，就可以得知每种结果代表 5 比特的信息。如果你已经了解过指数函数
and logarithms, you may recall that

and

convey the same information; we solve one for

, the other for

. We can summarize this as saying: If there are

equally likely outcomes, then each outcome gives

bits of information.
回顾指数和对数，

和

表达的是同一件事，只是形式不同；我们可以通过一个式子解出

，用另一个式子解出

。简而言之：如果有

个等可能的结果，那么每个结果都包含

比特的信息。

In our examples so far, our number of outcomes

has been a power of 2 raised to an integer power, but we will use our formula for any positive integer value

. So, for example, rolling a fair die yields one of six equally likely possible outcomes. The amount of information from one roll is

. This is 2.585 to three decimal places.
在我们已经讨论过的例子中，结果的数量

总是 2 的整数次幂，但我们的公式适用于任何正整数

。例如，掷一个均匀的骰子，会得到六种等可能的结果之一。一次掷骰子的信息量是

，精确到小数点后三位是 2.585。

Most of us are not familiar with logarithms to the base 2 , but it is straightforward to calculate approximations by using the fact that logarithms are increasing functions-the logarithm of a larger number is larger than the logarithm of a smaller one. Clearly, six is greater than four and less than eight. We know one of four possible outcomes gives 2 bits of information, and one of eight possible outcomes gives 3 bits of information; consequently, one of six possible outcomes will give us more than 2 but fewer than 3 bits of information. We can't see intuitively that it will give 2.585 bits of information, but we can quickly deduce it will be between 2 and 3 bits. In general, if a number lies between

and

, its logarithm to base 2 will lie between

and

.
我们大多数人对以 2 为底的对数不太熟悉，但我们可以利用一个简单的原理来估算它：对数函数是递增的，也就是说，数字越大，其对数也越大。例如，6 介于 4 和 8 之间。我们知道，如果一种情况有 4 种可能的结果，那么它包含 2 比特的信息；如果一种情况有 8 种可能的结果，那么它包含 3 比特的信息。因此，对于有 6 种可能结果的情况，它包含的信息量应该在 2 比特到 3 比特之间。虽然我们无法直接看出它具体包含 2.585 比特的信息，但可以快速推断出其范围。总的来说，如果一个数字介于

和

之间，那么它以 2 为底的对数就介于

和

之间。

Figure 2.1 shows the graph of

, which we can also label as

. To find the logarithm of a number

, you locate

on the

-axis, then move horizontally until you meet the graph, finally travel vertically from that point to

on the

-axis.
图 2.1 展示了

的图像，我们也可以将其标记为

。要找到一个数

的对数，你需要先在

轴上找到

，然后水平移动直到与图像曲线相交，最后从交点垂直移动到

轴，此处的坐标值即为

。

Doubling the number of outcomes always results in the information per outcome increasing by 1 bit. Since one out of six outcomes gives 2.585 bits of information, we can deduce one out of twelve possible outcomes gives 3.585 bits of information, one out of twenty-four gives 4.585 bits, one out of three gives 1.585 bits, and so on.
每当结果数量翻倍，每个结果的信息量就会增加 1 比特。例如，六个结果中，一个结果提供 2.585 比特的信息，那么我们就可以推断出：十二个结果中，一个结果提供 3.585 比特的信息；二十四分之一的结果提供 4.585 比特的信息；三分之一的结果提供 1.585 比特的信息，以此类推。

2.4 THE MATHEMATICAL DEFINITION OF THE NUMBER OF BITS OF INFORMATION
2.4 信息比特数的数学定义

We now know how to deal with cases where the pieces of information are all equally likely, but most times, this won't be true. Some pieces of information occur more often than others. We need to include this fact. We generalize our definition of information by incorporating the probability
我们现在已经了解了当所有信息片段出现的可能性相同时该如何处理，但在大多数情况下并非如此。有些信息片段出现的频率高于其他片段，我们需要将这一因素考虑在内。为此，我们要引入概率的概念，从而推广我们对信息的定义。

FIGURE 2.1 图 2.1

Graph of

or, equivalently,

的图像，等价于

的图像。

we will receive a particular piece. When we have

possible outcomes that are all equally likely, each occurs with probability

. We said the number of bits of information is

. We can rewrite this as

. This gives our definition:
我们会接收到一个特定的部分。假设有

种可能的结果，并且它们发生的概率相同，那么每种结果的概率都是

。我们之前提到，信息量可以用

比特来表示。这个公式可以改写成

，这就是我们对信息量的定义：

Suppose

occurs with probability

, then

conveys

bits of information.
假设事件

发生的概率为

，则

包含

比特的 Shannon 信息量。

There are several equivalent ways of writing

可以用以下几种等价的方式来写：

We will usually express the above definition as:
通常我们会将上述定义表示为：

Suppose I occurs with probability , then I conveys bits of information.
假设事件 I 发生的概率为，则 I 传递了比特的信息量。

This formulation is easier to write because we don't have the reciprocal, but that minus sign can cause confusion. Remember

is usually between 0 and 1 , so

is usually negative, and, consequently,

is positive, not negative.
这种表达方式更容易写，因为它没有倒数，但是其中的负号可能会引起混淆。需要注意的是，

通常介于 0 和 1 之间，所以

通常是负数，因此

是正数。

, our formula gives 0 bits of information. This makes sense. The event

is bound to occur, so when it occurs it doesn't tell us anything. If

is 0 , our formula doesn't make sense, but we don't need to worry about how much information we receive when there is no possibility of receiving it! We receive 0 bits.
如果

，我们的公式会得到 0 比特的信息，这是合理的。因为事件

必然发生，所以其发生本身并不会透露任何新的信息。如果

为 0，则公式失去意义，但我们无需纠结于不可能发生的事情会带来多少信息，因为实际上我们接收到的信息为 0 比特。

We now return to the example where I toss a coin three times and send you 0 if I get three heads and 1 in the seven other cases. The probability I get three heads is

and the probability I don't is

Consequently, the amount of information you receive if you receive a 0 is

, which is 3 . This agrees with what we said before, but now we can calculate the number of bits of information from receiving a 1 . This is

which, to three decimal places, is 0.193 .
我们回到之前抛硬币三次的例子。如果我抛了三次都是正面，就给你发送 0，其他七种情况则发送 1。三次都是正面的概率是

，不是三次都是正面的概率是

。因此，如果你收到 0，就意味着你接收到的信息量是

，也就是 3。这与我们之前的分析一致。现在，我们还可以计算出收到 1 时你所获得的信息量，即

，它约等于 0.193。

Usually, we will transmit information using 1s and Os, but we can use any symbols, and any number of symbols. The amount of information we get from receiving a certain symbol is calculated from the probability of receiving the symbol. As the probability decreases, the number of bits of information increases.
我们通常使用 1 和 0 来传输信息，但实际上可以使用任何符号，数量也可以任意选择。接收某个特定符号所传递的信息量取决于该符号出现的概率，概率越低，信息量就越大。

Another term that is occasionally used for information is surprisal. The more unlikely an event is, the higher the level of surprise, the more bits of information the event conveys. Hearing you didn't win the lottery is not surprising and carries little information; hearing you won is both surprising and contains a lot of information! However, we must be careful with both the terms surprise and information. They both have many meanings in everyday English. Often these are subjective. For our mathematical theory, we need an unambiguous, objective definition. From now on, the information given by an event is the base 2 logarithm of the reciprocal of the probability of the event occurring.
另一个偶尔用于描述信息的术语是“惊奇度”（surprisal）。事件越不可能发生，其惊奇度就越高，它所传达的信息量也就越大。例如，听到你没有中彩票并不令人惊讶，它携带的信息也很少；但如果听到你中了彩票，这不仅会让人感到惊讶，同时也包含了大量的信息！然而，我们必须谨慎使用“惊奇”和“信息”这两个词语，因为它们在日常英语中有多种含义，并且通常带有主观性。对于构建我们的数学理论，我们需要对“信息”进行明确且客观的定义。因此，从现在开始，我们将一个事件所包含的信息定义为该事件发生概率的倒数的以 2 为底的对数。

When we are sending 0 s and 1 s, each 0 and 1 will give us 1 bit of information only if the probability of each is a half. If the probability of receiving a
当我们发送 0 和 1 时，只有当每个数字出现的概率都为二分之一时，每个 0 或 1 才能提供 1 比特的信息。如果接收到……的概率

0 is not the same as receiving a 1 , then one of the two symbols gives more than 1 bit of information, and the other less. However, we are not usually interested in sending and receiving just one symbol. Datasets and communication rarely involve just one symbol-usually there are thousands. We are interested in the total number of bits of information sent. This leads to calculating the average number of bits of information per symbol received. This is what we do next.
"0" 和 "1" 并不相同，这意味着这两个符号中，一个传递的信息大于 1 比特，另一个则小于 1 比特。然而，我们通常不会只发送或接收一个符号。数据集和通信很少只包含一个符号，通常情况下会有成千上万个。我们关心的是发送的总信息量，也就是平均每个符号传递的信息量。接下来我们将探讨这个问题。

2.5 ENTROPY—THE AVERAGE INFORMATION PER SYMBOL
2.5 熵：平均信息量

Consider our example of sending 0s and 1s, with associated probabilities of is

and

. We know 0 gives 3 bits of information and 1 gives 0.193 bits. Now suppose we have a long string. What is the average amount of information per symbol? It is not the average of 3 and 0.193 , because we will not have an equal number of 0s and 1s. The proportions correspond to the probabilities. We get Os about one-eighth of the time and 1s seven-eighths of the time, so one-eighth of the time we receive 3 bits and seven-eighths of the time 0.193 bits. Consequently, on average we receive

bits per symbol. This quantity is called the entropy of the message. Entropy-the average information per symbol—is a fundamental quantity that, as we shall see, much of information theory is built on.
以发送 0 和 1 为例，假设发送 0 的概率为

，发送 1 的概率为

。我们知道，接收到 0 时可以获得 3 比特的信息，接收到 1 时可以获得 0.193 比特的信息。假设现在有一串很长的 0 和 1 组成的字符串，那么平均每个符号包含多少信息呢？这个平均信息量并非简单地计算 3 和 0.193 的平均数，因为 0 和 1 出现的次数并不相等，它们的比例对应着各自的概率。由于接收到 0 的概率是八分之一，接收到 1 的概率是八分之七，因此平均来说，接收到 3 比特信息的概率是八分之一，接收到 0.193 比特信息的概率是八分之七。最终计算得出，平均每个符号接收到的信息量为

比特。这个平均信息量被称为消息的熵。熵，作为每个符号携带的平均信息量，是一个基本概念，很多信息论的知识都建立在熵的基础之上。

More generally, let

denote the probability of receiving a 0 and

the probability of receiving a 1 . Each 0 gives

bits of information. Each 1 gives

bits. The proportion of 0 s and

in a long string will be given by these probabilities. The average amount of information per symbol is

. We define this to be the entropy

.
更一般地，假设

表示接收到 0 的概率，

表示接收到 1 的概率。每个 0 提供

比特的信息，而每个 1 则提供

比特的信息。在一段较长的字符串中，0 和

的比例将由这些概率确定。每个符号所携带的平均信息量为

，我们将其定义为熵

。

Since this formula is important in what follows we will restate it in bold. The entropy, the average amount of information per symbol is given by:
由于该公式在后续内容中至关重要，我们将以粗体再次强调。熵，即平均每个符号包含的信息量，可通过以下公式计算：

We know probabilities sum to 1 , so

is equal to

. This means we can rewrite the formula for entropy:
我们知道所有事件的概率之和为 1，因此

等于

。这意味着我们可以改写熵的公式：

One thing we should note is that we are assuming the probability of receiving a 0 or 1 does not change throughout the string. We are assuming, no matter what we have received before, the probability of receiving a 0 is always

. The technical name for this property is independence. The
需要注意的是，我们假设接收 0 或 1 的概率在整个字符串中保持不变。也就是说，无论之前接收过什么，接收 0 的概率始终为

。这种特性在技术上被称为“独立性”。
probabilities of receiving a 0 or 1 do not depend on what has been sent previously. Messages in real life rarely have this property. If we are sending a message written in English, the probability of receiving the letter

is quite low if we have just received a

. On the other hand, the probability of receiving a

after just receiving a

is high. Assuming the probability that letters in our alphabet are independent of what we have received before is a simplification, but it allows us to give the definition of entropy concisely containing the essential idea.
接收“0”或“1”的概率与之前传输的内容无关。现实生活中，信息很少具备这种特性。假设我们正在发送一封英文邮件，如果前一个字符是“

”，那么下一个字符是“

”的概率就会非常低。相反，在“

”之后收到“

”的概率则很高。当然，假设字母出现的概率与其之前的内容无关是一种简化，但这让我们能够简洁地定义熵，并包含其基本思想。

If this is not the case, then we need to calculate the information associated with a message symbol using conditional probability-the probability we will receive the symbol given the portion of the message we have received so far. We calculate the entropy as above but use the appropriate conditional probabilities at each stage.
如果情况并非如此，我们需要利用条件概率计算与消息符号相关的信息，即已知目前已接收到的消息部分，接收到该符号的概率。我们仍按照上述方法计算熵，但需在每个阶段使用相应的条件概率。

2.6 ENGLISH LANGUAGE 2.6 英语

Shannon studied texts in English. In his book, he gives several examples of strings using the standard 26 letters along with a space. He considered various randomly generated approximations to English in which the probability of each letter corresponds to the frequency with which the letter occurs. In the first-order approximation, the letters are independent-the prior letter has no effect on the subsequent one. He generated:
Shannon 对英文文本进行了研究。在他的著作中，他列举了几个例子，展示了如何使用标准的 26 个字母和空格来构成字符串。他探讨了各种随机生成的英语近似模型，在这些模型中，每个字母的概率与其在实际使用中的频率相对应。在最简单的一阶近似模型中，字母之间是相互独立的，也就是说，前一个字母的出现不会对后一个字母产生任何影响。他基于这种模型生成了以下内容：

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.

In the second-order approximation, the probabilities are no longer independent. The probability of the next letter depends on the last letter received, but not on prior letters. Pairs of letters now occur with the frequencies they occur in English. Shannon describes a practical method of generation: ". . . one opens a book at random and selects a letter at random on that page. The letter is recorded. Then the book is opened at another page, and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page, this second letter is searched for, and the succeeding letter recorded, etc." Using this method, he generated:
在二阶近似模型中，字母的概率不再是独立的。下一个字母出现的概率取决于紧邻它的前一个字母，而与更早出现的字母无关。这样一来，成对字母出现的频率就与它们在英语中的实际频率相一致了。Shannon 描述了一种生成这种文本的实用方法：“……随机打开一本书，并在该页面上随机选择一个字母并记录下来。然后打开另一页，阅读直到遇到刚才记录的字母，并记录下它后面的下一个字母。接下来翻到另一页，搜索第二个记录的字母，并记录下它后面的字母，以此类推。” 使用这种方法，他生成了如下文本：

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.
这段文字看起来像是乱码或加密信息，无法进行有效翻译。请提供更准确的原文信息。

In the third-order approximation, the probability of a letter depends on the previous two letters. Triples of letters occur with correct frequencies. Shannon obtained:
在三阶近似模型中，单个字母出现的概率取决于其前两个字母，因此三个字母组合的出现频率符合实际情况。Shannon 通过这种方法得出：

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.
在缺乏第一个维度的情况下，克拉蒂克模型如何为生长在重复演示文稿上的鸟类生成生长曲线？

We are still a long way from recognizable English, but at each stage, the approximations become noticeably better.
虽然距离生成流畅自然的英文还有很长的路要走，但每一步的逼近都让我们看到了明显的进步。

Nowadays, Shannon's approach has been extended into software that is often embedded in text editors. After typing a few characters, the software calculates the probabilities of what we are going to type next and suggests options for the completion of the word we are typing, sometimes even suggesting the next few words in the phrase. Clearly, a sentence written in English is far from a random string of symbols. Indeed, once certain symbols have been typed, we can be fairly certain what the next few will be.
如今，香农 (Shannon) 的方法已扩展应用于软件领域，并常见于各种文本编辑器中。在我们输入几个字符后，软件就能计算出接下来可能输入内容的概率，并提供补全建议，有时甚至会建议短语中的后续词语。显然，英语句子并非随机排列的符号串。事实上，根据已输入的字符，我们往往能推断出接下来的内容。

Entropy calculations involving strings of symbols where the probability of receiving a symbol depends on what someone has sent beforehand can become rather messy. We will stick to strings where the probability of receiving a symbol is independent of what has happened before, what Shannon calls first-order approximations. We do this for two reasons. First, the mathematics is much simpler, but it still illustrates the important concepts. Second, these types of first-order approximations convey the surprising nature of information in a much clearer way. It is not too surprising that we can replace a string written in English with a shorter string and not lose any information. It is far more surprising, as we shall see in the next chapter, that a string, where the letters are generated randomly and are independent of previous letters, can sometimes be replaced with a shorter string containing the same information.
在计算符号字符串的熵时，如果每个符号出现的概率取决于前面已出现的符号，那么计算过程会变得相当繁琐。我们将重点关注一种特殊情况，即每个符号出现的概率与其前面的符号无关，Shannon 将其称为一阶近似。我们选择这种简化模型有两个原因。首先，它简化了数学计算，但仍然能够清晰地阐释信息论中的重要概念。其次，这种一阶近似模型能够更清晰地揭示信息本身令人惊奇的本质。用更短的字符串替换英文文本而不丢失任何信息，这并不令人意外。但令人惊讶的是，正如我们将在下一章中看到的，即使对于那些字母随机生成且彼此独立的字符串，有时也能够找到更短的字符串来表示相同的信息。

2.7 MAXIMIZING ENTROPY 2.7 最大熵

We return to the case where we are receiving 0 with probability

and 1 with probability

.
我们回到这种情况：以

的概率接收到 0，以

的概率接收到 1。

These probabilities do not vary. They do not depend on what has gone before. It is a simplification, but it allows us to give a concise formula for the entropy. It will allow us to see the central ideas without getting lost in the weeds.
这些概率是固定不变的，不受先前事件的影响。虽然这是一种简化，但它让我们可以用简洁的公式来表达熵，从而把握核心概念，不被细枝末节所干扰。

FIGURE 2.2 图 2.2

Graph of entropy versus probability.
熵值随概率变化曲线图。

As we noted, for this case, the entropy is given by

. Figure 2.2 shows the graph of entropy versus the probability. There are several points to observe. The first is that the graph is symmetric about the line

. This is not surprising. It says the average information is unchanged if we interchange the roles of 0 and 1 . The second point is, though

and

are not defined, it makes sense to define them both to be 0 . The graph shows these two values extend the graph continuously to these two values. (Technically, we take the limits as the values of

approach 0 from the right and 1 from the left.) The third, and most important, observation is entropy is maximized when

. For every other value, the entropy is less. The value of the entropy when

.
正如我们所指出的，在这种情况下，熵由

给出。图 2.2 展示了熵与概率的关系图。观察该图，我们可以发现几点：首先，该图关于直线

对称。这并不奇怪，意味着交换 0 和 1 的位置后，平均信息量保持不变。其次，虽然

和

没有定义，但将其定义为 0 是合理的。从图中可以看出，这两个值是图形的连续延展。（技术上来说，这是指

分别从右侧趋近于 0 和从左侧趋近于 1 时的极限。）最后，也是最重要的一点：当

时，熵达到最大值；对于其他任何值，熵都会减小。当

时，熵的值为

。

Suppose we receive a long string of 0s and 1s. Our third observation says this string can convey at most 1 bit of information per symbol

, and this will occur if the 0 s and

occur at random with equal probability. At first sight, this seems strange. If you ask someone how much information is contained in a random string, they will probably say none. Randomness is often used synonymously with meaninglessness. We talk of random acts of violence, for example. But Shannon is saying a random string contains
假设我们接收到一串由 0 和 1 组成的长字符串。我们的第三个观察结果表明，该字符串每个符号

最多只能传达 1 比特的信息，这种情况会在 0 和

以相等概率随机出现时发生。乍一看，这似乎很奇怪。如果你问别人一个随机字符串包含多少信息，他们很可能会说“没有信息”。“随机性”这个词通常与“无意义”同义，例如我们会说“毫无意义的暴力行为”。但 Shannon 的意思是说，一个随机字符串实际上包含
more information than any nonrandom string. How do we reconcile these seemingly contradictory ideas?
随机字符串怎么可能比非随机字符串包含更多信息呢？我们该如何理解这些看似矛盾的观点？

We have three terms that need to be clearly understood: meaningfulness, randomness, and information. Meaningfulness is subjective. Something that might convey a lot of meaning to you might convey nothing to me. We are not giving a mathematical definition of this term, and our model has nothing to say about it. Randomness is mathematical. We are saying the next symbol in our string is completely independent of the one we have just received. If we have just received a 0 , we have no information about what will come next. It is equally likely to be a 0 as a 1 . Entropy, the average information per symbol, is also defined mathematically. The nagging question that remains is whether our mathematical definition of information connects to what we mean when we talk about information. As we delve further into the theory, it will become clearer that it does. The following example helps to illustrate the next steps we should take and shows that the definition of information is both sensible and useful.
我们需要厘清三个概念：意义性、随机性和信息。意义性是主观的，对你来说意义重大的事物可能对我来说毫无意义。我们不打算对“意义性”进行数学上的定义，我们的模型对此不作讨论。随机性则是可以用数学表述的。字符串中的下一个符号和我们刚刚接收到的符号完全独立。如果我们刚刚接收到一个“0”，我们完全无法预知下一个符号是什么，它有相同的概率是“0”或“1”。熵，即每个符号平均携带的信息量，也是用数学方法定义的。现在，一个关键问题是：我们对信息的数学定义是否与我们通常所说的“信息”含义相符？随着我们对信息论的深入研究，答案会越来越清晰。下面的例子将说明我们下一步的研究方向，并展现出信息定义的合理性和实用性。

We return to the example where I toss a coin three times and send you a 0 if

get three heads and a 1 in the other cases. Instead of sending you 0 s and 1s, I will send you Ys and Ns. If I get three heads, I send a

, and I send you

in the other cases. Of course, nothing much has changed from the original problem. I have exchanged 0 and 1 for

and

. Now suppose I perform a thousand repetitions of the experiment of tossing a coin three times, and send you the string of a thousand symbols each of which is a

or an

. As we noted above, the probability of sending a

and the probability of an

. We calculated the entropy of this system to be 0.544 bits per symbol. Since the string comprises a thousand symbols, it contains about 544 bits of information.
我们回到抛硬币三次的例子。之前，如果抛出三次正面，我会发送给你数字“0”，其他情况下发送数字“1”。现在，我们改用字母“Y”和“N”来表示。如果抛出三次正面，我会发送“

”，其他情况下发送“

”。当然，这与之前的问题本质上是一样的，只是把“0”和“1”替换成了“

”和“

”。现在，假设我将抛硬币三次的实验重复一千次，并把结果用包含一千个符号的字符串发送给你，每个符号都是“

”或“

”。如前所述，发送“

”的概率是

，发送“

”的概率是

。我们计算出这个系统的熵为每个符号 0.544 比特。由于字符串包含一千个符号，因此它包含了大约 544 比特的信息。

From our earlier work, we know it is possible to send 1 bit of information with 0 s and 1 s if each of these is sent with equal probability. We can deduce from this that it is possible to send 544 bits of information using a string of 0s and 1s of length 544 , each one of which carries 1 bit of information. The string of 0s and 1s of length 544 and the string of Ys and Ns of length 1000 both contain 544 bits of information. Can we arrange things so that they carry the same information?
我们之前的工作表明，如果 0 和 1 出现的概率相等，那么可以用一个 0 或 1 来传递 1 比特的信息。由此推论，一个长度为 544 的 0 和 1 组成的字符串可以传递 544 比特的信息，因为字符串中的每个 0 或 1 都携带 1 比特的信息。长度为 544 的 0 和 1 字符串与长度为 1000 的“Y”和“N”字符串都包含 544 比特的信息。我们是否可以对它们进行调整，使它们传递相同的信息？

Suppose instead of sending Ys and Ns, we agree I should send you a different string consisting of 0 s and 1s. The situation is: I have a string of a
假设我们约定，我不发送字母 Y 和 N，而是发送一个只包含数字 0 和 1 的字符串给你。现在的情况是：我有一个字符串，它只包含字母 a。
thousand symbols comprising Ys and Ns. We want you to end with exactly this string. To do this, I am going to convert my string of Ys and Ns to Os and 1 s. I send you the new string. You will convert the string of 0 s and

back to get the string of Ys and Ns. It looks as though we are going around in circles, but bear with me!
假设我们有一个由字母 Y 和 N 组成的字符串，长度为 1000 个字符。我们的目标是让你最终得到这个字符串。为此，我将把字符串中的 Y 和 N 替换成 0 和 1，并将新的字符串发送给你。你的任务是将这个由 0 和

组成的字符串转换回原来的 Y 和 N 字符串。虽然看起来像是在绕圈子，但请相信我，这个过程很重要！

We know the initial and final string contain a thousand symbols and 544 bits of information. However, we also know that with 0s and 1s it is possible to send 1 bit per symbol on average. This, in turn, leads to whether it is possible to convert the original string of a thousand symbols into one comprising just 544 symbols, transmit the shorter string, and then reconstruct the original string. Can we compress a string that conveys 0.544 bits per symbol into a shorter string that conveys 1 bit per symbol?
我们知道，初始字符串和最终字符串都包含 1000 个符号，并且携带着 544 比特的信息。我们也知道，使用 0 和 1 编码，平均每个符号可以传输 1 比特的信息。这就产生了一个问题：我们是否可以将包含 1000 个符号的原始字符串压缩成只包含 544 个符号的字符串，然后传输这个较短的字符串，最后再重建出原始字符串呢？换句话说，我们是否可以将每个符号携带 0.544 比特信息的字符串压缩成每个符号携带 1 比特信息的较短字符串呢？

Messages in English contain redundancy and can be shortened. This is because English has rules and patterns we can exploit. We could eliminate the

after any

, for example. But does our string of

s and

s contain redundancy? It is generated by a random method. There is no pattern. Surprisingly, Shannon discovered strings like this contain redundancy and can be shortened. It is possible to compress the string to get a new one that conveys almost one bit per 0 and 1. I urge you to read through this paragraph again carefully. Shannon is saying something that is not at all obvious. He is saying if the entropy of a string is less than 1 , then there is a way of compressing the string. The ideal is to have a string that conveys 1 bit per symbol, and it is possible to compress to almost reach that. This is Shannon's First Theorem (sometimes called the Source Coding Theorem or the Noiseless Coding Theorem). We are familiar with data compression. Text and image files are almost always compressed before being stored or transmitted. Shannon's First Theorem is the basic result of compression: you take an initial string that contains less than 1 bit of information per symbol and compress until you are close to having 1 bit per symbol.
英文信息中存在冗余，可以进行压缩。这是因为英语存在我们可以利用的规则和模式。例如，我们可以在任意

之后省略

。但是，由

和

构成的字符串是否也包含冗余呢？毕竟，它是通过随机方法生成的，没有明显的模式。然而，令人惊讶的是，香农发现即使是像这样的字符串也存在冗余，可以进行压缩。我们可以将字符串进行压缩，得到一个新的字符串，使得其中每个 0 和 1 都能传递接近于 1 比特的信息。我建议你再仔细阅读一遍这段文字。香农提出的观点并非显而易见。他指出，如果一个字符串的熵小于 1，那么就存在一种方法可以压缩这个字符串。理想情况下，我们希望每个符号都能传递 1 比特的信息，而通过压缩，我们几乎可以实现这一目标。这就是著名的香农第一定理（有时也称为信源编码定理或无噪声编码定理）。我们对数据压缩并不陌生。在实际应用中，文本和图像文件在存储或传输之前几乎都会进行压缩。香农第一定理揭示了压缩的本质：对于一个每个符号信息量不足 1 比特的初始字符串，可以通过压缩使其每个符号的信息量接近 1 比特。

We now have an intuitive idea of the theorem, but why is it true? How do we compress strings? These are the questions we turn to in the next chapter.
我们现在对这个定理有了一个直观的认识，但它为什么成立呢？我们如何压缩字符串？这些问题将在下一章中得到解答。

3 INFORMATION, REDUNDANCY, AND COMPRESSION
3 信息、冗余与压缩

Information, redundancy, and compression are three closely related concepts. A message usually contains information, but often there is redundancy. We would like to remove redundancy from the message while keeping all the information, resulting in a more concise, shorter message.
信息、冗余和压缩是三个息息相关的概念。一条消息通常包含信息，但往往也存在冗余。我们希望在保留所有信息的同时去除消息中的冗余，以便获得更简洁的消息。

Compression is important for both storage and data transfer. When we download a file, it is compressed before it is sent and then decompressed before we interact with it; compression allows large files to be sent quickly. We can also save storage by saving the compressed files as opposed to the uncompressed ones.
压缩技术在数据存储和传输中扮演着至关重要的角色。下载文件时，文件会先被压缩再进行传输，并在我们使用前进行解压缩。压缩技术使得传输大型文件变得更加快速高效。此外，我们还可以选择存储压缩后的文件，从而节省宝贵的存储空间。

Given a message of Os and 1s, Shannon's theory gives us a mathematically defined way of calculating the information and the average amount of information conveyed per symbol-the entropy, denoted by

. We also know that a message of 0 s and

has a maximum possible entropy of 1 bit per symbol. This means we can quantify redundancy. If we let

denote redundancy, then

. The goal of compression is to shorten the message, so it contains all the information but hardly any of the redundancy.
假设有一条由 0 和 1 组成的消息，香农定理为我们提供了一种数学方法来计算信息量，以及平均每个符号传递的信息量，即熵，用

表示。我们也知道，一条由 0 和

组成的消息，其最大可能熵为每个符号 1 比特。这意味着我们可以量化冗余。如果用

表示冗余，那么

。压缩的目标是缩短消息长度，使其包含所有信息，但几乎不包含冗余信息。

In this chapter, we generate a string, calculate its entropy and redundancy, then see how the string can be compressed. It is the relation among entropy, redundancy, and compression that shows the power of Shannon's definition of information. So, this chapter is important conceptually, but data compression, though we are rarely aware of it, is essential to much of our digital life. Our primary emphasis in this book is on concepts and not on applications, but since compression is ubiquitous and not widely appreciated, we will look at one example showing where compression is used.
在本章中，我们将创建一个字符串，计算其信息熵和冗余度，并探讨如何对其进行压缩。信息熵、冗余度和压缩之间的关系，彰显了香农信息定义的强大之处。因此，本章在概念理解上至关重要。尽管我们很少意识到数据压缩的存在，但它对我们的数字生活至关重要。本书主要强调概念而非应用，但鉴于压缩技术的普遍应用和其价值的低估，我们将通过一个示例来展示压缩技术的应用场景。

3.1 NEED FOR COMPRESSION
3.1 压缩的必要性

I bought my first computer in the 1980s. To send and receive data, you needed a separate modem that connected the computer to the telephone jack. My modem was one of the fastest, able to send 2,400 bits per second. In the present, my laptop is connected to the internet via Wi-Fi and can download data at around 200 megabits per second. This is around 83,000 times faster than my initial modem. Put another way, the amount of data I can now download in one second would have taken about a day with my original modem. What am I doing that requires all this speed? Clearly, I am transferring much larger data files. One important example is video. In the 1980 s, photographs and videos were analog. There was no easy way to convert them to digital files, and so most people didn't see any connection between video and computers. Nowadays, we record most photos and videos digitally. We are at an age when we cannot only send videos to one another but can stream them. This fairly new ability is enabled by high data transmission speeds. We can watch movies as they are downloading. But let's look at this in a little more detail.
我在上世纪 80 年代拥有了人生中第一台电脑。那时，要想用电脑收发数据，必须得借助一个独立的调制解调器，连接电脑和电话插孔才能实现。我的调制解调器在当时已经算得上是速度非常快的了，每秒能够传输 2400 比特的数据。而如今，我的笔记本电脑通过无线网络连接互联网，下载速度可以达到每秒 200 兆比特左右，比当初的调制解调器快了将近 83000 倍。这也就是说，我现在一秒钟就能下载完的数据量，如果用以前的调制解调器，得花上整整一天的时间。那么，我平时用电脑都在做什么呢？为什么需要这么快的网速呢？答案很简单，因为我需要传输和处理的数据文件越来越大，其中一个重要的原因就是视频的普及。在 20 世纪 80 年代，照片和视频都还是以模拟信号的方式记录和存储的，要想把它们转换成数字文件非常困难，所以那时候，大多数人根本无法想象视频和电脑之间会有什么联系。但如今，我们拍摄的照片和视频基本上都是数字化的，不仅如此，我们还能很方便地将视频分享给其他人，甚至进行实时在线观看。而这一切，都要归功于数据传输速度的突飞猛进。现如今，我们已经可以在观看视频的同时进行下载了。接下来，让我们再深入探讨一下这背后的原理。

picture has a resolution of

pixels. (There are several slightly different

formats, but they all have approximately 4000 pixels horizontally-this is the 4 K.) Each pixel records the intensity of three colors: red, green, and blue. The intensities can take on one of

values, which corresponds to 8 bits of information. Each second of movie contains sixty frames. This means one second of video will have

bits of information-about 12 billion bits! And we haven't included the audio! How can we stream

video with current internet speeds? My Wi-Fi is reasonably fast at 200 million bits per second, but this is fifty times slower than 12 billion bits per second that

video seems to require. The answer, of course, is compression. We compress the data before transmitting it and decompress before we watch it. Our current data speeds may initially seem fast, and the storage on our phones and laptops large, but we also need effective compression. Most of us know how much storage we have, we know the speed of our internet, but we know almost nothing about the equally important third componentcompression.
一张

图片的分辨率高达

像素。（目前市面上存在几种略有不同的

格式，但它们在水平方向上都拥有约 4000 个像素，这就是我们常说的 4K。）每个像素记录着红、绿、蓝三种颜色的强度信息。而每种颜色的强度值都可以从

种可能性中选择，对应 8 位信息。每秒钟的视频包含 60 帧画面，这意味着一秒钟的视频就包含了

位信息，也就是大约 120 亿位！这还只是视频信息，我们还没有算上音频信息呢！以目前的网络速度，我们如何才能流畅地观看

视频呢？我的 Wi-Fi 速度已经相当快了，达到了每秒 2 亿位，但这仍然比

视频理论上所需的每秒 120 亿位慢了 50 倍。答案当然在于“压缩”。我们在传输数据之前会先将数据进行压缩，然后在观看之前再进行解压缩。我们现有的网络速度和手机、电脑的存储容量看起来似乎很大，但高效的压缩技术同样必不可少。我们大多数人都了解自己的存储空间和网速，却对“压缩”这一同样重要的因素知之甚少。

There are two types of compression-lossy and lossless. Both types produce shorter files by eliminating some of the redundancy. In lossy
压缩分为两种类型：有损压缩和无损压缩。两种类型的压缩方式都是通过减少文件中的冗余信息来实现缩减文件大小的目的。在有损压缩中，

PROPERTY OF THE MIT PRESS FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
本文件为麻省理工学院出版社所有，仅供校对、索引和宣传使用

compression, we lose some of the information. Lossless compression, as the name suggests, is compression in which all the information is retained in the compressed file. Image files, such as JPEG, use lossy compression. The compressed image does not contain the same information as the original. If you pixel peep and compare the compressed and uncompressed image, there will be differences, but if the compression is done well, the casual observer won't notice these. In lossy compression, we are eliminating redundancy but also some of the less important information. However, sometimes we need all the information. A good example is text files, where we use an algorithm like zip to compress. We want to decompress a compressed text file and get exactly the same text-not a fairly good approximation. For text files, we need lossless compression. We need to preserve the information in its entirety. Lossless compression is, for me, the more interesting case, and this is what we will study.
在进行数据压缩时，我们有时会损失一部分信息。顾名思义，无损压缩指的是在压缩过程中保留所有原始信息。图像文件，例如 JPEG 格式，采用的是有损压缩技术。这意味着压缩后的图像并不包含与原始图像完全相同的信息。如果您仔细观察对比压缩前后的图像，就会发现它们之间存在差异。但如果压缩过程处理得当，这些差异通常不会被肉眼察觉。在有损压缩中，我们不仅剔除了冗余数据，还舍弃了一些不太重要的信息。然而，在某些情况下，我们需要保留所有信息。一个典型的例子是文本文件，我们通常使用类似 zip 的算法对其进行压缩。我们希望解压缩后的文本文件与原始文件完全一致，而不是得到一个近似的版本。因此，对于文本文件，我们需要采用无损压缩技术，以确保信息完整无缺地保留下来。对我而言，无损压缩是一个更引人入胜的领域，这也是我们将要深入探讨的主题。

It is clear if the file has some underlying pattern, then we can exploit it to reduce the size of the file. For example, if an image has a large expanse of blue sky, the RGB values of the pixels will all be the same. So, we don't need to send the values separately for each individual pixel, but we can give a list of a lot of pixels with the same values. Neighboring pixels often have similar RGB values, so we can often exploit this to shorten the overall file. This makes intuitive sense. What is not intuitively obvious is how we can reduce the size of the file when there is no underlying pattern in such a way that all the information is preserved. Once we have the definitions of information and entropy, mathematics gives us a way forward. To see this, we return to the example from the end of the last chapter.
如果文件底层存在某种模式，我们显然可以利用它来压缩文件大小。例如，如果一张图片中有大片蓝天，那么这些像素的 RGB 值都将相同。在这种情况下，我们无需分别存储每个像素的值，而是可以创建一个列表，将所有具有相同值的像素存储在一起。由于相邻像素通常具有相似的 RGB 值，因此我们可以利用这种特性来进一步缩短文件长度。这种压缩方式非常直观。但问题是，如果没有任何底层模式，我们如何在保留所有信息的同时减小文件大小呢？这在直觉上并不明显。一旦我们理解了信息和熵的定义，数学就能为我们指明方向。为了说明这一点，让我们回到上一章末尾的例子。

3.2 THE COIN EXAMPLE
3.2 硬币举例

I repeatedly perform an experiment in which I toss a coin three times. In fact, I perform the experiment exactly one thousand times. I send you Ys and Ns. If I get three heads, I send a

. In the other cases, I send an

. You will receive a string of length 1000 , consisting of

s and Ns. As we noted before, the probability of sending a

and the probability of an

. We calculated the entropy of this system to be 0.544 bits per symbol. Since the string consists of a thousand symbols, it contains about 544 bits of information. The goal is for us to devise a method of compressing the string, so we end up with a shorter string that still has the same information
我做了一千次抛硬币实验，每次抛三枚。我会根据每次实验的结果给你发送字母 Y 或 N：如果出现三次正面，就发送 Y，否则发送 N。你将会收到一个长度为 1000 的字符串，其中包含 Y 和 N。正如我们之前所说，发送 Y 的概率是

，发送 N 的概率是

。我们计算出这个系统的熵是每个符号 0.544 比特，也就是说，这个包含一千个符号的字符串包含了大约 544 比特的信息。我们的目标是找到一种方法来压缩这个字符串，使其在包含相同信息的情况下变得更短。
about the coin tosses. To help with clarity, we consider the original string to comprise

s and Ns and the compressed one to consist of Os and 1s. It helps to use a concrete example. Below, I generated a string of length 1000 which we will show how to compress.
为便于说明，我们假设原始字符串由

和 N 组成，压缩后的字符串由 O 和 1 组成。接下来，我们将以一个长度为 1000 的字符串为例，演示如何对其进行压缩。

My string with added spaces between the letters to help with legibility:
我的字符串中，为了方便阅读，每个字母之间都添加了空格。

N N N N N N N N N N N N N N N NYNNNNNNNNNNNNNNN

YYNNNYNNNNYNNNNNYNNNNNNNNNNNNNNYN NNNNNNNNNNNNNNNNNYYNNNNNNNNNNNNN N NYNNNNNNYNNNNNNNNNNNNYNNNNNNNNNY NNNNNNNNNNNNNNNYNNNNNNNNYNNNNNNN NNNNNNNNNNNNNNNYNNNNNNYYNNNNNYYNN NNYNNNNNNNNNNNNNNNNNNNNNNNNYNYNNN NNNNNNNNNNNNNNNNNNNYYNNNNNNNYNNNN 重试错误原因

N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNYNNNNYNNYNNNNNNNNNNNNNNNNNNNNNN
N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNYNNNNYNNYNNNNNNNNNNNNNNNNNNNNNN Please note: Without context or further information about the meaning of "N," "Y," and "

" in the provided string, it's impossible to provide a meaningful translation or refinement. These elements likely represent specific concepts or placeholders that require further clarification for accurate translation. If you can provide more context or details about the meaning of these elements, I can offer a more helpful and accurate translation

N N NNNNNNNNNNNNNNYYNNNNYNNNNNNNNNN

N N N N N N N N N NYNNNNNNNNNNNNNNNNNNNNYY NNNNYNYNNNYNNNNNNYNYNNNNNNNNNNNNN N NNYNNNNNNNNNNYNNNNNNNYNNYNNNNNNN NNNNNYYNNNNYNNNNNNNNNNNNNNNNNNNNN NNYNNNNNNNNNNNNNNNYNNYNNNNNNNYYYN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN N N N NNNYYNYNNNNNNNNNNNNNYYNNNNNNNN NNYNNNNNNNNNNNNNNNNNNYNNNNNNNYYNN

NNNNNNNNNNYNNNNYNNNNNNNNNNYNNNYNN N N N Y N N N N N Y N N N N N N N N N N Y N N N N N N N N Y N N N NNYNNNNNNNNNNNNNNNNNNNNNYNNNNNYNN N YYNYNNNYNNNYYNNNNNYNNNNNNNNYYNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

重试错误原因

We see immediately there are far more Ns than Ys. In fact, there are 887 Ns and 113 Ys. We expect Ys on average one-eighth of the time. Dividing 1000 by 8 gives 125 . The number of

s is less than this, but not by too much. This gives a quick check that I generated the string correctly, but it also reminds us that our dataset is quite small.
我们可以看到，N 的数量远多于 Y ，确切地说，N 有 887 个，Y 有 113 个。因此，我们预期 Y 出现的频率大约为八分之一，而 1000 除以 8 等于 125，与 Y 的数量相差不大。这可以作为字符串生成正确性的一项快速检查，但也提醒我们，这个数据集相对较小。

We will give a clear statement of Shannon's First Theorem a little later, but it essentially tells us that if we have an arbitrarily long string of

s and

s generated in this way, we can compress to a string of 0s and 1s that has entropy of 1 bit per symbol. This means not only the string of 0 s and

will contain the same information in a shorter string but as we saw in the last chapter, the string of 0 s and 1 s will look like a random string in which both 0 and 1 are equally likely to occur. An outside observer who is unaware of our method of compression would not be able to see any pattern and would consider the string to be random. Before continuing, we should say a little about what "arbitrarily long" mentioned in the first sentence of this paragraph means.
我们稍后会对香农第一定理进行清晰的阐述，但其本质是：如果有一个由

和

组成的任意长字符串，并以这种方式生成，那么我们可以将其压缩成一个每符号熵为 1 比特的 0 和 1 字符串。这意味着，压缩后的 0 和

字符串不仅包含相同信息，而且正如我们在上一章中所见，该字符串看起来就像一个随机字符串，其中 0 和 1 出现的可能性相同。对于一个不知道我们压缩方法的外部观察者来说，他无法看到任何模式，会认为该字符串是随机的。在继续之前，我们应该稍微解释一下本段第一句中提到的“任意长”是什么意思。

Shannon's theorem involves limits that exist as strings get longer and longer. For our example, his theorem does not say we can compress our string of 1000 symbols to one that contains 544 symbols. Instead, it tells us if we looked at strings of longer and longer lengths, we will be able to compress them so that the compression factor becomes closer and closer to 0.544. Later, we will use the ideas underlying the proof of Shannon's theorem to compress our string to 606 symbols. Our goal is to understand how compression works.
香农定理探讨的是字符串长度无限增长时存在的极限。对于我们所举的例子，该定理并非断言我们可以将 1000 个符号的字符串压缩到 544 个符号。相反，它阐述的是，如果我们观察越来越长的字符串，就能将其压缩至更小的规模，而压缩因子将无限接近 0.544。稍后，我们将运用香农定理证明中的原理，将字符串压缩至 606 个符号。我们的目标是理解压缩的工作机制。

3.3 COMPRESSION: SENDING YN-BLOCKS TO 01-BLOCKS
3.3 压缩：将 YN-块发送到 01-块

We now have a string of 1,000 symbols containing 544 bits of information. Shannon's First Theorem tells us we should be able to be compress to a shorter string of 0s and 1s-the compressed string should have a length of about 544. As we noted above, the theorem involves arbitrary long strings, and so we might not be able to compress down to a length of 544 for our string, which is rather short for data. To make things easier, both to find an achievable number and to have simpler numbers to work with, let us see how we might compress our string to one of about length 600 . What properties must the compressed string have?
我们现在有一个包含 1000 个符号的字符串，它包含 544 比特的信息。根据香农第一定理，我们应该可以将其压缩成一个更短的，只包含 0 和 1 的字符串——压缩后的字符串长度应该大约是 544。如上所述，该定理适用于任意长度的字符串，所以对于我们这个相对较短的字符串来说，我们可能无法将其压缩到长度为 544。为了简化问题，既是为了找到一个可实现的长度，也是为了使用更简单的数字，让我们看看如何将我们的字符串压缩到大约 600 的长度。压缩后的字符串需要具备哪些属性呢？

Imagine breaking our initial string into blocks of length 10. Instead of thinking of it as a string of 1000 symbols, we will consider it a string of 100 blocks. (For our string the first block is

.) Since we are compressing 100 blocks to a string of 600 0s and 1s, each block will be sent to a block of 0s and 1s of length 6 on average. Since this is important, we restate it as: Each

-block, containing 10 symbols, will be sent to a block of 0s and 1s. The 01-blocks will have length 6 on average. This follows logically from the statement of the theorem. But it gives us a way forward.
假设我们将初始字符串分割成长度为 10 的块。与其将其视为包含 1000 个字符的字符串，不如将其视为包含 100 个块的字符串。（对于我们的字符串，第一个块是

。）由于我们要将这 100 个块压缩成一个包含 600 个 0 和 1 的字符串，因此平均每个块将被映射到一个长度为 6 的 01 字符串。这一点非常重要，因此我们将重申一下：每个

块（包含 10 个字符）都将被映射到一个由 0 和 1 组成的字符串。这些 01 字符串的平均长度为 6。这是定理的必然推论，同时也为我们指明了前进的方向。

Our first attempt might be devising a way of sending blocks of length 10 consisting of Ys and Ns to blocks of length 6 comprising 0s and 1s, but this doesn't work. The number of distinct blocks of length 10 is

-the first symbol can be one of two things, so can the second, and so on. The number of blocks of length 6 is

. Since

is less than

, there is no way of sending the blocks of length 10 to blocks of length 6 without sending two distinct blocks of length 10 to the same block of length 6 . But we need distinct blocks of length 10 to be sent to distinct blocks of length 6 , otherwise we won't be able to decompress our string uniquely. After this observation, we should look at our previous statement about blocks and read it carefully.
我们首先想到的可能是设计一种方法，将长度为 10 、由字母 Y 和 N 组成的字符串转换为长度为 6、由数字 0 和 1 组成的字符串，但这行不通。长度为 10 的字符串，其排列组合数量为

—— 因为第一个字符可以是两种情况之一，第二个字符也可以，依此类推。长度为 6 的字符串，其排列组合数量为

。由于

小于

，因此我们无法保证将长度为 10 的字符串转换为长度为 6 的字符串时，不会将两个不同的长度为 10 的字符串转换为同一个长度为 6 的字符串。而我们需要将不同的长度为 10 的字符串转换为不同的长度为 6 的字符串，否则我们将无法对字符串进行唯一的解压缩。在观察到这一点之后，我们应该重新审视之前关于字符串转换的陈述，并仔细思考其含义。

We said each block of 10 containing symbols

and

will be sent to a block of 6 of 0s and 1s on average. Clearly, the on average must be important, and this makes sense. Some blocks of length 10 occur more frequently than others. An

occurs with probability

and

with probability

. The string of

will occur with probability

; whereas a string of

occurs with probability

which is miniscule-

. The string of 10 Ns will occur often; the string of

will occur extremely infrequently. To end up with a short string of Os and 1s, it makes sense to send the string of

to a short block of 0 s and

and the string of

s to a much longer block of 0 s and 1s. We should send blocks of Ys and Ns that occur more often to shorter blocks of 0s and 1s than those that occur less frequently. The less frequent blocks of

s and Ns should be sent to longer blocks of Os and 1s. This is just common sense. It's exactly what Morse did when devising his eponymous code.
我们之前讨论过，平均而言，每个包含符号

和

的 10 符号块将被发送到一个 6 位的 0 和 1 组成的块中。“平均而言” 非常重要，因为长度为 10 的块出现的频率并不相同。例如，

出现的概率为

，而

出现的概率为

。

字符串出现的概率为

，而

字符串出现的概率

则微乎其微（

）。10 个 N 的字符串会频繁出现，而

的字符串则极其罕见。为了最终得到更短的 0 和 1 字符串，我们可以将频繁出现的

字符串发送到较短的 0 和

组成的块中，而将较罕见的

字符串发送到较长的 0 和 1 组成的块中。也就是说，出现频率高的 Y 和 N 符号块应该对应更短的 0 和 1 代码块，而出现频率低的

和 N 符号块则对应更长的 0 和 1 代码块。这符合常理，也正是 Morse 在设计摩尔斯电码时的思路。

In a while we will describe an algorithm that makes this assignment in an optimal way. But first we look at some potential problems that arise when the blocks of 0 s and 1 s don't have the same length.
我们很快将介绍一种算法，能够以最佳方式完成这项分配任务。但在开始之前，让我们先看看当 0 和 1 的块长度不同时，可能会出现哪些问题。

3.4 CODING 3.4 编码

We will temporarily consider a new example. We have an alphabet consisting of the four letters

, and

. We want to assign strings of 0 s and

to each of the letters in the alphabet.
我们先来看一个新的例子。假设我们有一个包含四个字母的字母表：

、和

。我们希望用由 0 和

组成的字符串来表示字母表中的每个字母。

When assigning Os and

, there are several problems that we need to solve. To illustrate, it helps to introduce some terminology. We will assign to each letter of our original alphabet a string of 0s and 1s. These strings are called code words. We might, for example, assign

to 10 , and

to 11 . Here,

, and 11 are the code words. This assignment is bad. Let's see why.
在分配 0 和

时，我们会遇到一些需要解决的问题。为了便于说明，我们先介绍一些术语。我们会为原始字母表中的每个字母分配一个由 0 和 1 组成的字符串，这些字符串称为“码字”。例如，我们可以将

对应到

，将

对应到 10，将

对应到 11。其中，

和 11 就是“码字”。这种分配方式并不可取，让我们来看看原因。

Suppose we receive the string 0110 . Does this come from

, or ADA? All three strings encode to 0110 . This coding leads to ambiguity. We need to design our codes so that they are unambiguous. There are many ways of doing this. Morse used dots, dashes, and, importantly, spaces. The letter

is a single dot followed by a brief pause in international Morse code;

is a single dash followed by a brief pause; and

is a dot followed by a dash and then a temporary pause. It is the pauses that distinguish

from

. This is what we do in written language. We leave a space between words. Morse's choice of pauses reflects this. This choice meant that operators did not usually look at the tape but could recognize the symbol being sent by sound alone. Many people learnt Morse by ear, rather than by sight. Using pauses or gaps was an excellent idea for Morse, but, as we have mentioned before, the pause adds another symbol to the alphabet. We want to use only Os and 1s, so this approach does not work for us.
假设我们收到了字符串 0110，它代表的是什么呢？是

还是 ADA？因为这三个字符串都可以编码成 0110，所以这种编码方式会导致歧义。我们需要设计出能够消除歧义的编码方式，这可以通过多种方法来实现。摩尔斯电码就采用了点、划和空格来编码。在国际摩尔斯电码中，字母

是一个点 followed by a brief pause；

是一条划线 followed by a brief pause；而

则是一个点 followed by a dash and then a temporary pause。正是这些停顿区分了

和

，正如我们在书面语言中使用空格来分隔单词一样。摩尔斯电码对停顿的使用也反映了这一点。这种设计使得操作员无需观看电报，仅凭声音就能识别出所发送的符号。许多人学习摩尔斯电码也是通过听觉而非视觉。虽然对摩尔斯电码而言，使用停顿或间隔是一个绝妙的想法，但正如我们之前提到的，停顿相当于在字母表中引入了另一个符号。而我们只想使用 0 和 1 进行编码，因此这种方法并不适用。

Another way of avoiding ambiguity is to make all the code words the same length. A byte is a string of 8 bits. Unicode is a way of assigning bytes to symbols. Each letter of all the world's major writing systems has a Unicode code word, so do many symbols and emojis. Unicode has been expanded over the years, but all versions have a dictionary that links symbols to their code words written in terms of bytes. Nature has also adopted the constant length approach in our genetic code. RNA, unlike DNA, comprises a single strand of organic molecules called nucleotides. The nucleotides can be one of four different types, denoted by

, and

. Consequently, we can think of a molecule of RNA as a string written using these four letters. RNA describes how to construct proteins by adding a sequence of amino acids. The letters of the RNA string are read off three at a time. Sequences of three
避免歧义的另一种方法是确保所有代码字长度一致。一个字节是由 8 位组成的字符串。Unicode 是一种将字节映射到符号的标准。世界上所有主要书写系统的每个字母都有一个 Unicode 代码字，许多符号和表情符号也是如此。尽管 Unicode 多年来一直在扩展，但所有版本都有一个字典，将符号与其对应的字节代码字相关联。自然界在我们基因代码中也采用了这种固定长度的方法。与 DNA 不同，RNA 由称为核苷酸的有机分子单链组成。核苷酸共有四种类型，分别用

、和

表示。因此，我们可以将 RNA 分子视为由这四个字母组成的字符串。RNA 通过添加氨基酸序列来描述蛋白质的构建过程。RNA 字符串中的字母每次读取三个。三个字母的序列
letters, called codons, describe amino acids. For example,

is the code word for serine; CGA codes arginine. There are also codons that tell when to start and stop adding proteins. The important point for us is amino acids are encoded by exactly three RNA nucleotides. Some amino acids have more than one codon associated with them, but given a codon, there is a unique amino acid. The system is unambiguous (given a string of codons, there is only one string of amino acids), but it is redundant (it is possible for two different strings of codons to give the same string of amino acids). This approach of having code words of the same length is useful in avoiding ambiguity, but it will not work for us. We noted earlier, to compress our data, we needed to assign shorter words to commonly used symbols and longer words to those used less often. You cannot compress if every code word has the same length. We need a fresh approach.
三个字母组成的序列被称为密码子，用于描述氨基酸。例如，

是丝氨酸的密码子，而 CGA 编码精氨酸。还有一些特殊的密码子，用于指示何时开始和停止蛋白质的合成。对于我们来说，重要的是每个氨基酸都由三个 RNA 核苷酸编码。某些氨基酸可能对应多个密码子，但给定一个密码子，它所代表的氨基酸是唯一的。这种编码系统是无歧义的（一串密码子只能翻译成唯一的一串氨基酸），但却是冗余的（不同的密码子序列可能对应相同的氨基酸序列）。虽然这种使用相同长度代码字的方法有助于避免歧义，但并不适用于我们的情况。正如我们之前提到的，为了压缩数据，我们需要将较短的代码字分配给常用的符号，将较长的代码字分配给不常用的符号。如果每个代码字的长度都相同，我们就无法实现压缩。因此，我们需要一种全新的方法。

3.5 INSTANTANEOUS CODES 3.5 瞬时码

We will start by considering a couple of examples of encoding three letters

, and

. In the first example, the code word for

is 011 , and for

is 010. Suppose we receive the string 0111010. This can be unambiguously decoded.
我们先来看几个例子，展示如何将三个字母

、

进行编码。第一个例子中，

的编码是 011，

的编码是 010。假设我们收到了字符串 0111010，我们可以对其进行无歧义解码。

For the second example, the code words are: 1 for

for

, and 010 for

. Suppose we receive the string 1101010. This can also be decoded uniquely. But the first coding is much better. We will examine this a little more carefully.
以第二个例子来说，编码规则如下：用“1”代表

，用“01”代表

，用“010”代表

。假设我们收到了字符串“1101010”，它同样可以被唯一解码，但第一个编码方案效率更高。接下来，我们将对此进行更深入的分析。

In the first example, no code word is the beginning of another code word. We say no code word is a prefix of another code word. When we receive the first three symbols 011 , we know immediately this corresponds to

. After receiving the next symbol, 1 , we know immediately it corresponds to

. Codes with the property that no code word is the prefix of another code word are instantaneous. To illustrate this idea further, we will look at the second example, which is not instantaneous.
在第一个例子中，任何一个码字都不是另一个码字的开头，也就是说，没有任何码字是其他码字的前缀。当我们接收到前三个符号“011”时，可以立即确定它对应着

。接收下一个符号“1”后，我们立刻就能知道它对应着

。拥有这种“任何码字都不是其他码字的前缀”特性的代码，被称为“瞬时码”。为了帮助大家更好地理解这一概念，我们将看到第二个例子，它就不是瞬时码。

We start by receiving 1 . This is the code word for

but also the start of the code word for

, so we need to receive more information. After receiving 1101, we still cannot determine the starting letter. It might have come from

. The next symbol gives 11010 . Unfortunately, that still doesn't clear things up. It could come from

. Adding the next symbol gives 110101. This might have come from

, or

. After
我们首先接收到数字 "1"。这可能是代码字

，但也可能是代码字

的开头，因此我们需要更多信息才能确定。即使接收到 "1101" 后，我们仍然无法确定它是哪个代码字。它可能是

、

或

的代码字。下一个符号是 "11010"，但这仍然无法消除歧义，因为它也可能是

的代码字。继续接收下一个符号，得到 "110101"。这可能是代码字

或

的一部分。
receiving the ultimate symbol, and knowing we have the complete message, we can narrow it down to

.
收到最终符号，意味着我们已获取完整信息，接下来只需将其范围缩小到

即可。

The second example has code words that are not suffixes of the other code words. Once we have the complete message, we can read it off immediately by starting the decoding at the end and working to the beginning of the message. This shows the coding is not ambiguous, but you must wait until the complete message is received until you can determine the correctly decoded message. This works but is clearly inferior to having an instantaneous code that can be decoded as the message is being received.
第二个例子中的代码字并非彼此的后缀。这意味着，一旦接收到完整的消息，我们可以直接从末尾开始解码，逐个识别代码字，最终得到原始信息。这表明编码方案是清晰无歧义的，但必须等到接收完整消息后才能进行解码。相较于能够实时解码的瞬时代码，这种方式的效率显然较低。

Ideally, we want an instantaneous coding that assigns strings to letters according to the probability of the letter being used, shorter strings assigned to high probability letters. David Huffman invented an algorithm for doing exactly this.
我们理想的目标是实现一种即时编码方法，根据字母的使用频率，将字符串映射到相应的字母上。其中，高频字母对应较短的字符串。David Huffman 发明了一种算法可以完美实现这一点。

3.6 HUFFMAN CODING 3.6 霍夫曼编码

There are many methods of coding. The method we will use was invented by David Huffman in 1951 while he was a student at MIT. This is well after Shannon wrote his landmark paper in which he used a different method. However, Huffman's algorithm has an elegant simplicity.
编码方法多种多样，我们将采用 David Huffman 于 1951 年在麻省理工学院求学期间发明的方法。这比 Shannon 发表其划时代论文的时间要晚得多，Shannon 在论文中使用了另一种方法。但 Huffman 算法以其简洁优雅著称。

The algorithm has three parts: construction of a tree, labelling the branches, and reading off the code words.
该算法由三部分组成：构建树结构、标记分支以及读取代码字。

CONSTRUCTING THE TREE 构造树

The tree has two components: nodes and branches. We denote nodes by small circles. Branches will be drawn as lines connecting nodes. The tree has one node at the bottom, called the root, and many nodes at the top, called leaves. We will start at the top with the leaves and work our way down. (Usually, mathematical trees are drawn the other way up, with the root at the top, but since we start our construction with the leaves, it is easier to draw if we put them at the top and work down-it also makes it look more like a tree!)
树由节点和分支两部分组成。节点用小圆圈表示，分支则用连接节点的线段表示。树的底部有一个节点，称为根节点；顶部有许多节点，称为叶节点。我们将从顶部的叶节点开始，逐步向下分析。（通常情况下，数学树的绘制方式相反，根节点位于顶部，但由于我们是从叶节点开始构建，所以将叶节点放在顶部并向下绘制会更容易，也更像一棵树！）

The aim is to give a binary coding for an alphabet with known probabilities. Our tree construction starts with this. For each letter of our alphabet, we draw a node and assign the associated probability. These nodes are the initial nodes under consideration. They are the leaves of the tree.
我们的目标是为已知概率的字母表提供二进制编码，而树结构的构建正是实现这一目标的第一步。具体来说，我们要为字母表中的每一个字母创建一个节点，并为其赋予相应的概率。这些节点就是初始节点，它们也是树的叶子节点。

The algorithm gives an iterative process. At each step, find two nodes under consideration with the smallest probabilities. Construct a new node
该算法采用迭代方式。每一步，找到当前考虑的概率最小的两个节点，并构造一个新节点。

FIGURE 3.1 图 3.1

The leaves for the tree.
树叶。

beneath these two nodes and add branches from the two nodes to the new node. The new node's probability is the sum of the two connecting nodes. The two nodes are removed from the nodes under consideration and the new node is added.
在这两个节点下方创建一个新节点，并分别从这两个节点连接到新节点。新节点的概率等于连接的两个节点的概率之和。然后，从节点集合中移除这两个节点，并将新节点添加到集合中。

Since we remove two nodes and add one, the number of nodes under consideration decreases by one at each iteration. We keep going until we are left with only one node under consideration. This is the root.
由于每次迭代移除两个节点并添加一个，因此需要考虑的节点数量会减少一个。重复此过程，直到只剩下一个节点，该节点即为根节点。

An example will help to make this clear. (We need this example later.) There are eight letters. We will call them

, and

. The letter

has probability

. Letters

, and

have probability

. Letters

, and

have probability

. Finally,

has probability

.
下面通过一个例子来说明这一点（我们稍后还会用到这个例子）。假设有八个字母，分别记作

、

。其中，字母

出现的概率为

；字母

和

出现的概率均为

；字母

和

出现的概率均为

；最后，字母

出现的概率为

。

We start our tree construction with eight nodes corresponding to these letters.
我们构建的这棵树初始包含八个节点，分别对应这些字母。

We now start the algorithm for finding the branches by finding the two nodes with the smallest probabilities. This means we must pick

, but there is a choice for the second node,

, or

. We can pick any of the three. I picked

because I didn't want to have branches crossing later in the construction. (When drawing a tree, you don't want branches to cross. If you have a tree with crossed branches, you can always rearrange the order of your nodes in a way that uncrosses them.)
现在，我们开始寻找概率最小的两个节点，以确定分支。首先，我们必须选择节点

。接下来，我们需要在节点

和

中选择一个作为第二个节点。我们可以选择其中任意一个，这里选择

是为了避免在后续构造中出现分支交叉。（绘制树状图时，应尽量避免分支交叉。如果出现交叉，可以重新排列节点顺序来避免。）

Figures 3.2 to 3.8 show each of the iterative steps until we eventually reach the root.
图 3.2 至图 3.8 展示了迭代过程的每个步骤，直至最终抵达根节点。

We have done the hard part. The next two steps are easy.
我们已经完成了最难的部分，接下来的两步会容易很多。

LABELING THE BRANCHES 分支标注

Each node that is not a leaf has two branches attached above it. Label one with 0 and the other with 1 . It doesn't matter how this is done, but I will always label the left branch with 0 . The figure below shows the labeled tree.
每个非叶节点上方都有两个分支，分别标记为 0 和 1。标记方式并不重要，但我习惯将左侧分支标记为 0。下图展示了标记后的树。

FIGURE 3.2 图 3.2

Huffman algorithm: step 1 .
霍夫曼算法：第一步。

FIGURE 3.3 图 3.3

Huffman algorithm: step 2.
霍夫曼算法：第二步。

FIGURE 3.4 图 3.4

Huffman algorithm: step 3 .
霍夫曼算法：步骤三。

FIGURE 3.5 图 3.5

Huffman algorithm: step 4.
霍夫曼算法：步骤四。

FIGURE 3.6 图 3.6

Huffman algorithm: step 5 .
Huffman 算法：步骤五。

FIGURE 3.7 图 3.7

Huffman algorithm: step 6.
霍夫曼算法：步骤六。

FIGURE 3.8 图 3.8

Huffman algorithm: step 7 .
霍夫曼算法：步骤七。

FIGURE 3.9 图 3.9

Labeling the branches of the Huffman tree.
为哈夫曼树的各个分支添加标签。

READING THE CODE WORDS
解读代码字词

There is exactly one path from the root to each of the leaves. (We don't allow backtracking.) For each node, find the path and write the labels of the branches on the path that starts at the root and ends at the leaf.
从根节点到每个叶节点只有一条路径（不允许走回头路）。请找出每个节点对应的路径，并写出从根节点出发，沿着路径到达该叶节点所经过的所有分支标签。

This gives 这会使得

Letter 字母	Code word 密码
	00000
	00001
	00010
	00011
	001
	010
	011
	1

We can see that Huffman coding has properties we want. First, no code word is the prefix of any other code word. (The prefixes of any code word correspond to the nodes on the path from the leaf to the root. We are only using the leaves for the code words.) This ensures our coding is unambiguous. Second, the algorithm assigns longer words to low probability letters and shorter words to letters with higher probabilities. We can say a little more about this assignment.
霍夫曼编码的特性正是我们所需要的。首先，任何代码字都不会成为其他代码字的前缀（因为代码字对应着从叶子节点到根节点路径上的节点，而我们只使用叶子节点作为代码字），这保证了编码的清晰 unambiguous。其次，该算法会将较长的代码字分配给低概率字母，将较短的代码字分配给高概率字母。关于这种分配方式，我们还可以进一步探讨。

Suppose we have a very long initial message generated using the letters and probabilities. Let

denote the number of letters in this original message, and we will calculate the number of 0 s and 1 s in the encoded message. We expect

to occur about

times,

to occur

times, and so on. Each time

occurs, we replace it with a string of five binary digits. This means that if we consider

s, there are

in the original message that contribute

binary digits in the encoded message. We now go through the contribution each letter gives to calculate the total length of the encoded message:
假设我们有一条使用字母和概率生成的初始消息，这条消息非常长。令

表示这条原始消息中的字母数量，我们将计算编码消息中 0 和 1 的数量。我们预计

会出现大约

次，

会出现

次，以此类推。每次出现

时，我们都会将其替换为一个由五个二进制数字组成的字符串。这意味着如果我们考虑

，则原始消息中有

个会贡献编码消息中的

个二进制数字。现在，我们来遍历每个字母的贡献，从而计算出编码消息的总长度：

We can simplify this a little:
我们可以简化一下：

This is

, approximately

. From this we can calculate the average code word length. The original message had

symbols. We ended with a final message of length

, so the average code word length is 1.746.
这是

，约等于

。由此我们可以计算出平均码字长度。原始消息包含

个符号，最终编码后的消息长度为

，因此平均码字长度为 1.746。

This average is the best possible. No other binary coding of the alphabet, with the property that no code word is a prefix of another code word, has an average word length of less than 1.746. Huffman codes are optimal. You might find another coding with the same average length, but you cannot find one with a shorter average word length.
这种平均码长是最优的。任何其他能够确保任何码字都不是其他码字前缀的二进制编码方式，其平均码长都不可能小于 1.746。霍夫曼编码是最佳选择，你也许能找到平均码长相同的编码方式，但绝找不到平均码长更短的。

We now have all the pieces in place to understand Shannon's First Theorem and to compress our sample dataset.
现在，我们已经掌握了理解香农第一定理和压缩样本数据集的所有必要知识。

3.7 SHANNON'S FIRST THEOREM
3.7 香农第一定理

To compress data, Shannon proposed the following steps.
为了实现数据压缩，Shannon 提出了以下步骤。

First, instead of taking each symbol separately, take blocks of symbols. All blocks have the same length, and we will denote this length by .
首先，我们不逐个处理符号，而是将符号组成块进行处理。所有符号块的长度一致，我们用表示该长度。
Calculate the probabilities associated with each of the blocks.
计算每个区块对应的概率。
Find an optimal code for the blocks. (We will use Huffman coding.)
为数据块寻找一种最优编码方案。（我们将采用霍夫曼编码。）
Calculate the average word length of the coding per block.
计算每个代码块中单词的平均长度。

Divide the average word length by the length of the block to get the average code word length per original symbol. We will call this

.
将平均单词长度除以文本块长度，得到每个原始符号对应的平均码字长度，记作

。

Shannon's First Theorem: , where is the entropy.
香农第一定理：，其中表示信息熵。

To illustrate, we will calculate

for block lengths 1,2 , and 3 using our example where

has probability

and

has probability

.
为便于说明，我们将以示例为例，计算块长度分别为 1、2 和 3 时

的值。示例中，

的概率为

，

的概率为

。

BLOCK LENGTH 1 块长度为 1

There is not much to do in this case. The blocks of length 1 are just the letters themselves,

and

. The coding simply assigns 0 to one of them and 1 to the other. The average code word length is 1 , so

. This tells us the coding doesn't result in any compression, which is just what we expect.
这种情况很简单。长度为 1 的块就是字母本身，

和

。我们只需将其中一个编码为 0，另一个编码为 1。这样一来，平均码字长度为 1，即

。这表明编码没有实现任何压缩，与预期相符。

111111111111111011111111111111010111110111111111110111111 11011100111011110111110111111111111110111111111111111111001 11111111111110111111011111111111011111111101111111111111 10111111110111111111111111111111101111110011111001111011111 1111111111111111101011111111111111111111001111111011111 111011111111011111111111111111111111111111111111111111111 11110111011101111101110111111101111101101111011011111111111 11111111111111111101111111101111111111111111111111111111111 0011110111111111011101101101111101110110111111101111111111 10111111111111111111001111010111011111101011111111111111 10111111111011111110110111111111110011110111111111111111 1111111011111111111111101101111111000111111111111111111111 111111111111111100101111111111110011111111110111111111111 11111101111111001101111111111111111111011111101011111111111 110111101111111110111011111011111011111111110111111101111 10111111111111111111111011111011100101110111001111101111111 10011111111111111111111111111111111010111101111111111

FIGURE 3.10 图 3.10

Tree for blocks of length 2 .
长度为 2 的块构成的树。

Just glancing at this tells us there are far more 1s than Os. In fact, there are 887 1s and only 113 Os. As we improve our compression, we expect to approach a random string of 0 s and 1 s, each occurring with equal probability.
只需粗略看一下，就能发现“1”的数量远超“0”。确切地说，这里有 887 个“1”和 113 个“0”。随着压缩技术的提升，我们预期最终会得到一个由“0”和“1”构成的随机序列，两者出现的概率相同。

BLOCK LENGTH 2 块长度为 2

There are four blocks:

, and

. The probability

occurs is

. Both

and

have probability

. Finally, the probability

occurs is

. Our next step is to find an optimal coding.
假设有四个块:

，和

。其中，

发生的概率为

，

和

的概率均为

，而

发生的概率为

。接下来，我们将寻找一种最优编码方式。

I constructed a tree for Huffman coding, as shown in figure 3.10.
如图 3.10 所示，我构造了一棵用于霍夫曼编码的树。

This means that we replace

with

with 01 , and

with 1. The average word length per block is

. The blocks have length 2 , so to get the average word length per letter, we divide by 2 to obtain

.
这意味着我们要将

替换为 01，将

替换为 1。每个数据块的平均单词长度为

。由于每个数据块的长度为 2，因此要计算每个字母的平均单词长度，我们需要将平均单词长度除以 2，最终结果为

。

Making the appropriate substitutions gives the following string:
经过适当替换后，得到以下字符串：

1111111100111111110010011100111111001111011010011001101110

1111111100111111111010011111111001110111111100111110011111

1110011110111111111111001110100111000110011111111111101011 1111111111000111011111001111011111111111111111111111111001

1001100111001100111100111001011100101111111111111110111110

ACKNOWLEDGMENTS 致谢

INTRODUCTION 引言

CHAPTER 1 第一章

CHAPTER 2 第二章

CHAPTER 3 第三章

CHAPTER 4 第 4 章

CHAPTER 5 第 5 章

CHAPTER 6 第六章

CHAPTER 7 第 7 章

CHAPTER 8 第八章

CHAPTER 9 第 9 章

1 DIGITAL REVOLUTIONS 一次数字化革命

1.1 ANALOG VERSUS DIGITAL1.1 模拟信号与数字信号

1.2 WRITING AND ALPHABETS1.2 书写与字母

1.3 TELEGRAPH 1.3 电报机

1.4 THE ANALOG REVOLUTION1.4 模拟革命

1.5 THE WAY FORWARD1.5 未来方向

2 INFORMATION 2. 信息

2.1 BELL LABS 2.1 贝尔实验室

2.2 A MATHEMATICAL THEORY2.2 数学理论

ZEROS AND ONES, BUT NO GAPS.只包含 0 和 1，中间没有空格。

BINARY DIGITS 二进制位

2.3 A BIT OF INFORMATION VERSUS A BINARY DIGIT2.3 信息量与二进制位

2.4 THE MATHEMATICAL DEFINITION OF THE NUMBER OF BITS OF INFORMATION2.4 信息比特数的数学定义

Suppose I occurs with probability , then I conveys bits of information.假设事件 I 发生的概率为 ，则 I 传递了 比特的信息量。

2.5 ENTROPY—THE AVERAGE INFORMATION PER SYMBOL2.5 熵：平均信息量

2.6 ENGLISH LANGUAGE 2.6 英语

2.7 MAXIMIZING ENTROPY 2.7 最大熵

3 INFORMATION, REDUNDANCY, AND COMPRESSION3 信息、冗余与压缩

3.1 NEED FOR COMPRESSION3.1 压缩的必要性

PROPERTY OF THE MIT PRESS FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY本文件为麻省理工学院出版社所有，仅供校对、索引和宣传使用

3.2 THE COIN EXAMPLE3.2 硬币举例

3.3 COMPRESSION: SENDING YN-BLOCKS TO 01-BLOCKS3.3 压缩：将 YN-块 发送到 01-块

3.4 CODING 3.4 编码

3.5 INSTANTANEOUS CODES 3.5 瞬时码

3.6 HUFFMAN CODING 3.6 霍夫曼编码

CONSTRUCTING THE TREE 构造树

LABELING THE BRANCHES 分支标注

READING THE CODE WORDS解读代码字词

3.7 SHANNON'S FIRST THEOREM3.7 香农第一定理

Shannon's First Theorem: , where is the entropy.香农第一定理： ，其中 表示信息熵。

BLOCK LENGTH 1 块长度为 1

BLOCK LENGTH 2 块长度为 2

1.1 ANALOG VERSUS DIGITAL
1.1 模拟信号与数字信号

1.2 WRITING AND ALPHABETS
1.2 书写与字母

1.4 THE ANALOG REVOLUTION
1.4 模拟革命

1.5 THE WAY FORWARD
1.5 未来方向

2.2 A MATHEMATICAL THEORY
2.2 数学理论

ZEROS AND ONES, BUT NO GAPS.
只包含 0 和 1，中间没有空格。

2.3 A BIT OF INFORMATION VERSUS A BINARY DIGIT
2.3 信息量与二进制位

2.4 THE MATHEMATICAL DEFINITION OF THE NUMBER OF BITS OF INFORMATION
2.4 信息比特数的数学定义

Suppose I occurs with probability , then I conveys bits of information.
假设事件 I 发生的概率为，则 I 传递了比特的信息量。

2.5 ENTROPY—THE AVERAGE INFORMATION PER SYMBOL
2.5 熵：平均信息量

3 INFORMATION, REDUNDANCY, AND COMPRESSION
3 信息、冗余与压缩

3.1 NEED FOR COMPRESSION
3.1 压缩的必要性

PROPERTY OF THE MIT PRESS FOR PROOFREADING, INDEXING, AND PROMOTIONAL PURPOSES ONLY
本文件为麻省理工学院出版社所有，仅供校对、索引和宣传使用

3.2 THE COIN EXAMPLE
3.2 硬币举例

3.3 COMPRESSION: SENDING YN-BLOCKS TO 01-BLOCKS
3.3 压缩：将 YN-块发送到 01-块

READING THE CODE WORDS
解读代码字词

3.7 SHANNON'S FIRST THEOREM
3.7 香农第一定理

Shannon's First Theorem: , where is the entropy.
香农第一定理：，其中表示信息熵。