Notebook

Probabilistic Programming and Bayesian Methods for Hackers
概率编程与黑客的贝叶斯方法¶

Welcome to Bayesian Methods for Hackers. The full Github repository is available at github/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers. The other chapters can be found on the project's homepage. We hope you enjoy the book, and we encourage any contributions!
欢迎来到黑客的贝叶斯方法。完整的 Github 代码库可在 github/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers 找到。其他章节可以在项目主页上找到。我们希望您喜欢这本书，并鼓励任何贡献！

Looking for a printed version of Bayesian Methods for Hackers?
寻找《黑客的贝叶斯方法》的印刷版吗？¶

Bayesian Methods for Hackers is now a published book by Addison-Wesley, available on Amazon!
《黑客的贝叶斯方法》现在是由 Addison-Wesley 出版的一本书，已在亚马逊上架！

BMH

Chapter 1 第 1 章¶

The Philosophy of Bayesian Inference
贝叶斯推断的哲学¶

You are a skilled programmer, but bugs still slip into your code. After a particularly difficult implementation of an algorithm, you decide to test your code on a trivial example. It passes. You test the code on a harder problem. It passes once again. And it passes the next, even more difficult, test too! You are starting to believe that there may be no bugs in this code...
你是一名熟练的程序员，但错误仍然会出现在你的代码中。在一次特别困难的算法实现后，你决定在一个简单的例子上测试你的代码。它通过了。你在一个更难的问题上测试代码。它再次通过。接下来更困难的测试也通过了！你开始相信这段代码可能没有错误……

If you think this way, then congratulations, you already are thinking Bayesian! Bayesian inference is simply updating your beliefs after considering new evidence. A Bayesian can rarely be certain about a result, but he or she can be very confident. Just like in the example above, we can never be 100% sure that our code is bug-free unless we test it on every possible problem; something rarely possible in practice. Instead, we can test it on a large number of problems, and if it succeeds we can feel more confident about our code, but still not certain. Bayesian inference works identically: we update our beliefs about an outcome; rarely can we be absolutely sure unless we rule out all other alternatives.
如果你这样想，那么恭喜你，你已经在进行贝叶斯思考了！贝叶斯推断就是在考虑新证据后更新你的信念。贝叶斯人很少能对结果感到确定，但他们可以非常自信。就像上面的例子一样，除非我们在每一个可能的问题上测试我们的代码，否则我们永远无法 100%确定它没有错误；在实践中，这几乎是不可能的。相反，我们可以在大量问题上进行测试，如果成功了，我们可以对我们的代码感到更有信心，但仍然不能确定。贝叶斯推断的工作原理是相同的：我们更新对结果的信念；除非我们排除所有其他选择，否则我们很少能绝对确定。

The Bayesian state of mind
贝叶斯思维方式¶

Bayesian inference differs from more traditional statistical inference by preserving uncertainty. At first, this sounds like a bad statistical technique. Isn't statistics all about deriving certainty from randomness? To reconcile this, we need to start thinking like Bayesians.
贝叶斯推断与更传统的统计推断不同，它保留了不确定性。乍一看，这听起来像是一种糟糕的统计技术。统计学难道不是关于从随机性中推导出确定性吗？为了调和这一点，我们需要开始像贝叶斯人那样思考。

The Bayesian world-view interprets probability as measure of believability in an event, that is, how confident we are in an event occurring. In fact, we will see in a moment that this is the natural interpretation of probability.
贝叶斯世界观将概率解释为对事件可信度的度量，即我们对事件发生的信心。事实上，我们稍后将看到，这就是概率的自然解释。

For this to be clearer, we consider an alternative interpretation of probability: Frequentist, known as the more classical version of statistics, assumes that probability is the long-run frequency of events (hence the bestowed title). For example, the probability of plane accidents under a frequentist philosophy is interpreted as the long-term frequency of plane accidents. This makes logical sense for many probabilities of events, but becomes more difficult to understand when events have no long-term frequency of occurrences. Consider: we often assign probabilities to outcomes of presidential elections, but the election itself only happens once! Frequentists get around this by invoking alternative realities and saying across all these realities, the frequency of occurrences defines the probability.
为了更清楚地说明这一点，我们考虑概率的另一种解释：频率主义，这被认为是统计学的更经典版本，假设概率是事件的长期频率（因此得名）。例如，在频率主义哲学下，飞机事故的概率被解释为飞机事故的长期频率。这对于许多事件的概率来说是合乎逻辑的，但当事件没有长期发生频率时，就变得更难以理解。考虑一下：我们经常给总统选举的结果分配概率，但选举本身只发生一次！频率主义者通过引入替代现实来解决这个问题，声称在所有这些现实中，发生的频率定义了概率。

Bayesians, on the other hand, have a more intuitive approach. Bayesians interpret a probability as measure of belief, or confidence, of an event occurring. Simply, a probability is a summary of an opinion. An individual who assigns a belief of 0 to an event has no confidence that the event will occur; conversely, assigning a belief of 1 implies that the individual is absolutely certain of an event occurring. Beliefs between 0 and 1 allow for weightings of other outcomes. This definition agrees with the probability of a plane accident example, for having observed the frequency of plane accidents, an individual's belief should be equal to that frequency, excluding any outside information. Similarly, under this definition of probability being equal to beliefs, it is meaningful to speak about probabilities (beliefs) of presidential election outcomes: how confident are you candidate A will win?
贝叶斯主义者则采取更直观的方法。贝叶斯主义者将概率解释为对事件发生的信念或信心的度量。简单来说，概率是对一种观点的总结。一个对某事件赋予 0 信念的个体对该事件发生没有信心；相反，赋予 1 信念意味着该个体对事件发生绝对确定。介于 0 和 1 之间的信念允许对其他结果进行加权。这个定义与飞机事故的概率示例一致，因为在观察到飞机事故的频率后，个体的信念应等于该频率，不考虑任何外部信息。同样，在将概率等同于信念的定义下，谈论总统选举结果的概率（信念）是有意义的：你对候选人 A 获胜的信心有多大？

Notice in the paragraph above, I assigned the belief (probability) measure to an individual, not to Nature. This is very interesting, as this definition leaves room for conflicting beliefs between individuals. Again, this is appropriate for what naturally occurs: different individuals have different beliefs of events occurring, because they possess different information about the world. The existence of different beliefs does not imply that anyone is wrong. Consider the following examples demonstrating the relationship between individual beliefs and probabilities:
注意在上面的段落中，我将信念（概率）度量分配给个体，而不是自然。这非常有趣，因为这个定义为个体之间的冲突信念留出了空间。同样，这也适合自然发生的情况：不同的个体对事件的发生有不同的信念，因为他们对世界拥有不同的信息。不同信念的存在并不意味着任何人是错误的。考虑以下示例，展示个体信念与概率之间的关系：

I flip a coin, and we both guess the result. We would both agree, assuming the coin is fair, that the probability of Heads is 1/2. Assume, then, that I peek at the coin. Now I know for certain what the result is: I assign probability 1.0 to either Heads or Tails (whichever it is). Now what is your belief that the coin is Heads? My knowledge of the outcome has not changed the coin's results. Thus we assign different probabilities to the result.
我抛硬币，我们都猜测结果。我们都同意，假设硬币是公平的，正面朝上的概率是 1/2。那么，假设我偷看了硬币。现在我确定结果是什么：我将正面或反面（无论是哪一面）的概率赋值为 1.0。那么你认为硬币是正面的概率是多少？我对结果的了解并没有改变硬币的结果。因此，我们对结果赋予了不同的概率。
Your code either has a bug in it or not, but we do not know for certain which is true, though we have a belief about the presence or absence of a bug.
您的代码要么有一个错误，要么没有，但我们无法确定哪种情况是正确的，尽管我们对错误的存在或不存在有一种信念。
A medical patient is exhibiting symptoms $x$ , $y$ and $z$ . There are a number of diseases that could be causing all of them, but only a single disease is present. A doctor has beliefs about which disease, but a second doctor may have slightly different beliefs.
一名医疗患者表现出症状 $x$ 、 $y$ 和 $z$ 。有多种疾病可能导致这些症状，但实际上只有一种疾病存在。一位医生对哪种疾病有自己的看法，但第二位医生可能有稍微不同的看法。

This philosophy of treating beliefs as probability is natural to humans. We employ it constantly as we interact with the world and only see partial truths, but gather evidence to form beliefs. Alternatively, you have to be trained to think like a frequentist.
这种将信念视为概率的哲学对人类来说是自然的。我们在与世界互动时不断使用它，只看到部分真相，但收集证据以形成信念。相反，你必须经过训练才能像频率主义者那样思考。

To align ourselves with traditional probability notation, we denote our belief about event $A$ as $P (A)$ . We call this quantity the prior probability.
为了与传统概率符号保持一致，我们将对事件 $A$ 的信念表示为 $P (A)$ 。我们称这个量为先验概率。

John Maynard Keynes, a great economist and thinker, said "When the facts change, I change my mind. What do you do, sir?" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even — especially — if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P (A | X)$ , interpreted as the probability of $A$ given the evidence $X$ . We call the updated belief the posterior probability so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$ :
约翰·梅纳德·凯恩斯，这位伟大的经济学家和思想家，曾说过：“当事实改变时，我会改变我的想法。您呢，先生？”这句话反映了贝叶斯人在看到证据后如何更新自己的信念。即使——尤其是——当证据与最初的信念相悖时，这些证据也不能被忽视。我们将更新后的信念表示为 $P (A | X)$ ，解释为在证据 $X$ 下 $A$ 的概率。我们称更新后的信念为后验概率，以便与先验概率进行对比。例如，考虑在观察到一些证据 $X$ 后，上述例子的后验概率（读作：后验信念）：

1. $P (A) :$ the coin has a 50 percent chance of being Heads. $P (A | X) :$ You look at the coin, observe a Heads has landed, denote this information $X$ , and trivially assign probability 1.0 to Heads and 0.0 to Tails.
1. $P (A) :$ 硬币有 50% 的机会是正面。 $P (A | X) :$ 你查看硬币，观察到正面朝上，标记此信息 $X$ ，并简单地将正面的概率分配为 1.0，反面的概率分配为 0.0。

2. $P (A) :$ This big, complex code likely has a bug in it. $P (A | X) :$ The code passed all $X$ tests; there still might be a bug, but its presence is less likely now.
2. $P (A) :$ 这个庞大复杂的代码可能存在一个错误。 $P (A | X) :$ 代码通过了所有 $X$ 测试；仍然可能存在错误，但现在出现错误的可能性较小。

3. $P (A) :$ The patient could have any number of diseases. $P (A | X) :$ Performing a blood test generated evidence $X$ , ruling out some of the possible diseases from consideration.
3. $P (A) :$ 患者可能有多种疾病。 $P (A | X) :$ 进行血液测试生成了证据 $X$ ，排除了某些可能的疾病。

It's clear that in each example we did not completely discard the prior belief after seeing new evidence $X$ , but we re-weighted the prior to incorporate the new evidence (i.e. we put more weight, or confidence, on some beliefs versus others).
显然，在每个例子中，我们并没有在看到新证据后完全抛弃先前的信念 $X$ ，而是重新调整了先前的信念以纳入新证据（即我们对某些信念赋予了更多的权重或信心，而不是其他信念）。

By introducing prior uncertainty about events, we are already admitting that any guess we make is potentially very wrong. After observing data, evidence, or other information, we update our beliefs, and our guess becomes less wrong. This is the alternative side of the prediction coin, where typically we try to be more right.
通过引入对事件的先前不确定性，我们已经承认我们所做的任何猜测都有可能是非常错误的。在观察到数据、证据或其他信息后，我们更新我们的信念，我们的猜测变得不那么错误。这是预测的另一面，通常我们试图更正确。

Bayesian Inference in Practice
实践中的贝叶斯推断¶

If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, representing an estimate (typically a summary statistic like the sample average etc.), whereas the Bayesian function would return probabilities.
如果频率派和贝叶斯推断是编程函数，输入为统计问题，那么它们在返回给用户的结果上会有所不同。频率派推断函数会返回一个数字，代表一个估计（通常是像样本平均值等的汇总统计），而贝叶斯函数则会返回概率。

For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all $X$ tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all $X$ tests; is my code bug-free?" would return something very different: probabilities of YES and NO. The function might return:
例如，在我们上面的调试问题中，调用频率主义函数，参数为“我的代码通过了所有 $X$ 个测试；我的代码没有错误吗？”将返回一个 YES。另一方面，询问我们的贝叶斯函数“我的代码经常有错误。我的代码通过了所有 $X$ 个测试；我的代码没有错误吗？”将返回非常不同的结果：YES 和 NO 的概率。该函数可能返回：

YES, with probability 0.8; NO, with probability 0.2
是的，概率为 0.8；不是，概率为 0.2

This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument: "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences.
这与频率主义函数返回的答案非常不同。请注意，贝叶斯函数接受了一个额外的参数：“我的代码经常有错误”。这个参数是先验。通过包含先验参数，我们告诉贝叶斯函数考虑我们对情况的信念。从技术上讲，贝叶斯函数中的这个参数是可选的，但我们将看到排除它会有其自身的后果。

Incorporating evidence 纳入证据¶

As we acquire more and more instances of evidence, our prior belief is washed out by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like "I expect the sun to explode today", and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better. Bayesian inference will correct this belief.
随着我们获得越来越多的证据实例，我们的先前信念会被新证据所冲淡。这是可以预期的。例如，如果你的先前信念是一些荒谬的事情，比如“我期待今天太阳爆炸”，而每天你都被证明是错误的，你会希望任何推断能纠正你，或者至少使你的信念更一致。贝叶斯推断将纠正这种信念。

Denote $N$ as the number of instances of evidence we possess. As we gather an infinite amount of evidence, say as $N \to \infty$ , our Bayesian results (often) align with frequentist results. Hence for large $N$ , statistical inference is more or less objective. On the other hand, for small $N$ , inference is much more unstable: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning probabilities (instead of a scalar estimate), we preserve the uncertainty that reflects the instability of statistical inference of a small $N$ dataset.
将 $N$ 表示为我们拥有的证据实例的数量。当我们收集到无限量的证据时，假设为 $N \to \infty$ ，我们的贝叶斯结果（通常）与频率主义结果一致。因此，对于大 $N$ ，统计推断或多或少是客观的。另一方面，对于小 $N$ ，推断则不稳定得多：频率主义估计具有更大的方差和更大的置信区间。这就是贝叶斯分析的优势所在。通过引入先验，并返回概率（而不是标量估计），我们保留了反映小 $N$ 数据集统计推断不稳定性的不确定性。

One may think that for large $N$ , one can be indifferent between the two techniques since they offer similar inference, and might lean towards the computationally-simpler, frequentist methods. An individual in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:
人们可能会认为，对于较大的 $N$ ，在这两种技术之间可以无所谓，因为它们提供了类似的推断，并可能倾向于计算上更简单的频率主义方法。处于这种情况的个人在做出这样的决定之前，应考虑安德鲁·盖尔曼（2005）[1] 的以下引用：

Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is "large enough," you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). $N$ is never enough because if it were "enough" you'd already be on to the next problem for which you need more data.
样本量从来都不大。如果 $N$ 太小，无法获得足够精确的估计，您需要获取更多数据（或做出更多假设）。但是一旦 $N$ “足够大”，您就可以开始细分数据以获取更多信息（例如，在公众舆论调查中，一旦您对整个国家有了良好的估计，您就可以在男性和女性、北方人和南方人、不同年龄组等之间进行估计）。 $N$ 永远不够，因为如果它“足够”，您已经可以开始下一个需要更多数据的问题。

Are frequentist methods incorrect then?
频率主义方法是错误的吗？¶

No. 不。

Frequentist methods are still useful or state-of-the-art in many areas. Tools such as least squares linear regression, LASSO regression, and expectation-maximization algorithms are all powerful and fast. Bayesian methods complement these techniques by solving problems that these approaches cannot, or by illuminating the underlying system with more flexible modeling.
频率主义方法在许多领域仍然有用或处于最前沿。最小二乘线性回归、LASSO 回归和期望最大化算法等工具都非常强大且快速。贝叶斯方法通过解决这些方法无法解决的问题，或通过更灵活的建模来阐明潜在系统，从而补充了这些技术。

A note on Big Data
关于大数据的说明¶

Paradoxically, big data's predictive analytic problems are actually solved by relatively simple algorithms [2][4]. Thus we can argue that big data's prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman's quote from above and ask "Do I really have big data?" )
矛盾的是，大数据的预测分析问题实际上是由相对简单的算法解决的。因此，我们可以认为大数据的预测难点不在于所使用的算法，而在于大数据存储和执行的计算难度。（还应考虑上面 Gelman 的引用，并问“我真的有大数据吗？”）

The much more difficult analytic problems involve medium data and, especially troublesome, really small data. Using a similar argument as Gelman's above, if big data problems are big enough to be readily solved, then we should be more interested in the not-quite-big enough datasets.
更为复杂的分析问题涉及中等数据，尤其棘手的是非常小的数据。使用与 Gelman 上述相似的论点，如果大数据问题足够大以便轻松解决，那么我们应该对那些不够大的数据集更感兴趣。

Our Bayesian framework 我们的贝叶斯框架¶

We are interested in beliefs, which can be interpreted as probabilities by thinking Bayesian. We have a prior belief in event $A$ , beliefs formed by previous information, e.g., our prior belief about bugs being in our code before performing tests.
我们对信念感兴趣，这可以通过贝叶斯思维解释为概率。我们对事件 $A$ 有一个先验信念，这些信念是由先前的信息形成的，例如，在进行测试之前，我们对代码中存在错误的先验信念。

Secondly, we observe our evidence. To continue our buggy-code example: if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the posterior probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after its discoverer Thomas Bayes:
其次，我们观察我们的证据。继续我们的错误代码示例：如果我们的代码通过了 $X$ 个测试，我们希望更新我们的信念以纳入这一点。我们称这种新信念为后验概率。更新我们的信念是通过以下方程完成的，这被称为贝叶斯定理，以其发现者托马斯·贝叶斯命名：

\begin{aligned} (1) & P (A | X) = & \frac{P (X | A) P (A)}{P (X)} \\ (2) \\ (3) & \propto P (X | A) P (A) (\propto is proportional to) \end{aligned}

The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect prior probabilities $P (A)$ with an updated posterior probabilities $P (A | X)$ .
上述公式并非贝叶斯推断所独有：它是一个数学事实，在贝叶斯推断之外也有应用。贝叶斯推断仅仅利用它将先验概率 $P (A)$ 与更新后的后验概率 $P (A | X)$ 连接起来。

Example: Mandatory coin-flip example
示例：强制抛硬币示例¶

Every statistics text must contain a coin-flipping example, I'll use it here to get it out of the way. Suppose, naively, that you are unsure about the probability of heads in a coin flip (spoiler alert: it's 50%). You believe there is some true underlying ratio, call it $p$ , but have no prior opinion on what $p$ might be.
每本统计学教材都必须包含一个抛硬币的例子，我在这里使用它以便处理掉这个问题。假设，天真地说，你对抛硬币出现正面的概率不确定（剧透：是 50%）。你认为存在某个真实的基础比例，称之为 $p$ ，但对 $p$ 可能是什么没有任何先前的看法。

We begin to flip a coin, and record the observations: either $H$ or $T$ . This is our observed data. An interesting question to ask is how our inference changes as we observe more and more data? More specifically, what do our posterior probabilities look like when we have little data, versus when we have lots of data.
我们开始抛硬币，并记录观察结果：要么是 $H$ ，要么是 $T$ 。这就是我们的观察数据。一个有趣的问题是，随着我们观察到越来越多的数据，我们的推断是如何变化的？更具体地说，当我们有少量数据时，我们的后验概率是什么样的，与我们有大量数据时相比又是什么样的。

Below we plot a sequence of updating posterior probabilities as we observe increasing amounts of data (coin flips).
下面我们绘制了一系列更新后的后验概率，随着我们观察到越来越多的数据（抛硬币）。

In [1]: 在[1]：

"""
The book uses a custom matplotlibrc file, which provides the unique styles for
matplotlib plots. If executing this book, and you wish to use the book's
styling, provided are two options:
    1. Overwrite your own matplotlibrc file with the rc-file provided in the
       book's styles/ dir. See http://matplotlib.org/users/customizing.html
    2. Also in the styles is  bmh_matplotlibrc.json file. This can be used to
       update the styles in only this notebook. Try running the following code:

        import json, matplotlib
        s = json.load( open("../styles/bmh_matplotlibrc.json") )
        matplotlib.rcParams.update(s)

"""

# The code below can be passed over, as it is currently not important, plus it
# uses advanced topics we have not covered yet. LOOK AT PICTURE, MICHAEL!
%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)

import scipy.stats as stats

dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)

# For the already prepared, I'm using Binomial's conj. prior.
for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials) / 2, 2, k + 1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials) - 1] else None
    plt.setp(sx.get_yticklabels(), visible=False)
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
    plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
    plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)

    leg = plt.legend()
    leg.get_frame().set_alpha(0.4)
    plt.autoscale(tight=True)


plt.suptitle("Bayesian updating of posterior probabilities",
             y=1.02,
             fontsize=14)

plt.tight_layout()

The posterior probabilities are represented by the curves, and our uncertainty is proportional to the width of the curve. As the plot above shows, as we start to observe data our posterior probabilities start to shift and move around. Eventually, as we observe more and more data (coin-flips), our probabilities will tighten closer and closer around the true value of $p = 0.5$ (marked by a dashed line).
后验概率由曲线表示，我们的不确定性与曲线的宽度成正比。如上图所示，当我们开始观察数据时，后验概率开始发生变化。最终，随着我们观察到越来越多的数据（抛硬币），我们的概率将越来越紧密地集中在真实值 $p = 0.5$ （用虚线标记）周围。

Notice that the plots are not always peaked at 0.5. There is no reason it should be: recall we assumed we did not have a prior opinion of what $p$ is. In fact, if we observe quite extreme data, say 8 flips and only 1 observed heads, our distribution would look very biased away from lumping around 0.5 (with no prior opinion, how confident would you feel betting on a fair coin after observing 8 tails and 1 head). As more data accumulates, we would see more and more probability being assigned at $p = 0.5$ , though never all of it.
注意到这些图形并不总是集中在 0.5。没有理由应该是这样：回想一下，我们假设我们对 $p$ 没有先验意见。事实上，如果我们观察到相当极端的数据，比如 8 次翻转只有 1 次正面朝上，我们的分布看起来会非常偏向于远离 0.5（在没有先验意见的情况下，观察到 8 次反面和 1 次正面后，你对公平硬币下注会有多自信）。随着数据的积累，我们会看到越来越多的概率被分配到 $p = 0.5$ ，尽管永远不会全部。

The next example is a simple demonstration of the mathematics of Bayesian inference.
下一个例子是贝叶斯推断数学的简单演示。

Example: Bug, or just sweet, unintended feature?
示例：错误，还是只是甜蜜的意外功能？¶

Let $A$ denote the event that our code has no bugs in it. Let $X$ denote the event that the code passes all debugging tests. For now, we will leave the prior probability of no bugs as a variable, i.e. $P (A) = p$ .
让 $A$ 表示我们的代码没有错误的事件。让 $X$ 表示代码通过所有调试测试的事件。现在，我们将把没有错误的先验概率留作一个变量，即 $P (A) = p$ 。

We are interested in $P (A | X)$ , i.e. the probability of no bugs, given our debugging tests $X$ pass. To use the formula above, we need to compute some quantities.
我们对 $P (A | X)$ 感兴趣，即在我们的调试测试 $X$ 通过的情况下没有错误的概率。为了使用上述公式，我们需要计算一些量。

What is $P (X | A)$ , i.e., the probability that the code passes $X$ tests given there are no bugs? Well, it is equal to 1, for code with no bugs will pass all tests.
$P (X | A)$ 是什么，即在没有错误的情况下，代码通过 $X$ 测试的概率？好吧，它等于 1，因为没有错误的代码将通过所有测试。

$P (X)$ is a little bit trickier: The event $X$ can be divided into two possibilities, event $X$ occurring even though our code indeed has bugs (denoted $\sim A$ , spoken not $A$ ), or event $X$ without bugs ( $A$ ). $P (X)$ can be represented as:
$P (X)$ 有点棘手：事件 $X$ 可以分为两种可能性，事件 $X$ 发生，即使我们的代码确实有缺陷（表示为 $\sim A$ ，发音不是 $A$ ），或者事件 $X$ 没有缺陷（ $A$ ）。 $P (X)$ 可以表示为：

\begin{aligned} (4) & P (X) & = P (X and A) + P (X and \sim A) \\ (5) \\ (6) & = P (X | A) P (A) + P (X | \sim A) P (\sim A) \\ (7) \\ (8) & = P (X | A) p + P (X | \sim A) (1 - p) \end{aligned}

We have already computed $P (X | A)$ above. On the other hand, $P (X | \sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P (X | \sim A) = 0.5$ . Then
我们已经在上面计算了 $P (X | A)$ 。另一方面， $P (X | \sim A)$ 是主观的：我们的代码可以通过测试，但仍然可能存在错误，尽管存在错误的概率降低了。请注意，这取决于执行的测试数量、测试的复杂程度等。我们采取保守态度，分配 $P (X | \sim A) = 0.5$ 。然后

\begin{aligned} (9) & P (A | X) & = \frac{1 \cdot p}{1 \cdot p + 0.5 (1 - p)} \\ (10) \\ (11) & = \frac{2 p}{1 + p} \end{aligned}

This is the posterior probability. What does it look like as a function of our prior, $p \in [0, 1]$ ?
这是后验概率。它作为我们先验 $p \in [0, 1]$ 的函数看起来是什么样的？

In [2]: 在[2]：

figsize(12.5, 4)
p = np.linspace(0, 1, 50)
plt.plot(p, 2 * p / (1 + p), color="#348ABD", lw=3)
# plt.fill_between(p, 2*p/(1+p), alpha=.5, facecolor=["#A60628"])
plt.scatter(0.2, 2 * (0.2) / 1.2, s=140, c="#348ABD")
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel("Prior, $P(A) = p$")
plt.ylabel("Posterior, $P(A|X)$, with $P(A) = p$")
plt.title("Is my code bug-free?")

Out[2]:

<matplotlib.text.Text at 0x7f13c17609e8>

We can see the biggest gains if we observe the $X$ tests passed when the prior probability, $p$ , is low. Let's settle on a specific value for the prior. I'm a strong programmer (I think), so I'm going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let's pin it at 0.20. Then my updated belief that my code is bug-free is 0.33.
我们可以看到最大的收益，如果我们观察到在先验概率 $p$ 较低时通过的 $X$ 测试。我们来确定一个具体的先验值。我是一个强大的程序员（我认为），所以我给自己一个现实的先验值 0.20，也就是说，我编写无错误代码的概率是 20%。为了更现实一点，这个先验值应该是代码复杂性和大小的函数，但我们将其固定为 0.20。然后我对我的代码无错误的更新信念是 0.33。

Recall that the prior is a probability: $p$ is the prior probability that there are no bugs, so $1 - p$ is the prior probability that there are bugs.
请记住，先验是一个概率： $p$ 是没有错误的先验概率，因此 $1 - p$ 是有错误的先验概率。

Similarly, our posterior is also a probability, with $P (A | X)$ the probability there is no bug given we saw all tests pass, hence $1 - P (A | X)$ is the probability there is a bug given all tests passed. What does our posterior probability look like? Below is a chart of both the prior and the posterior probabilities.
同样，我们的后验概率也是一个概率， $P (A | X)$ 是在我们看到所有测试通过的情况下没有错误的概率，因此 $1 - P (A | X)$ 是在所有测试通过的情况下存在错误的概率。我们的后验概率是什么样的？下面是先验概率和后验概率的图表。

In [3]: 在[3]：

figsize(12.5, 4)
colours = ["#348ABD", "#A60628"]

prior = [0.20, 0.80]
posterior = [1. / 3, 2. / 3]
plt.bar([0, .7], prior, alpha=0.70, width=0.25,
        color=colours[0], label="prior distribution",
        lw="3", edgecolor=colours[0])

plt.bar([0 + 0.25, .7 + 0.25], posterior, alpha=0.7,
        width=0.25, color=colours[1],
        label="posterior distribution",
        lw="3", edgecolor=colours[1])

plt.ylim(0,1)
plt.xticks([0.20, .95], ["Bugs Absent", "Bugs Present"])
plt.title("Prior and Posterior probability of bugs present")
plt.ylabel("Probability")
plt.legend(loc="upper left");

Notice that after we observed $X$ occur, the probability of bugs being absent increased. By increasing the number of tests, we can approach confidence (probability 1) that there are no bugs present.
注意到在我们观察到 $X$ 发生后，缺陷缺失的概率增加。通过增加测试数量，我们可以接近置信度（概率为 1），即没有缺陷存在。

This was a very simple example of Bayesian inference and Bayes rule. Unfortunately, the mathematics necessary to perform more complicated Bayesian inference only becomes more difficult, except for artificially constructed cases. We will later see that this type of mathematical analysis is actually unnecessary. First we must broaden our modeling tools. The next section deals with probability distributions. If you are already familiar, feel free to skip (or at least skim), but for the less familiar the next section is essential.
这是一个非常简单的贝叶斯推断和贝叶斯规则的例子。不幸的是，进行更复杂的贝叶斯推断所需的数学只会变得更加困难，除了人为构造的案例。我们稍后将看到，这种数学分析实际上是没有必要的。首先，我们必须扩展我们的建模工具。下一节将讨论概率分布。如果您已经熟悉，可以随意跳过（或至少浏览），但对于不太熟悉的人来说，下一节是必不可少的。

Probability Distributions
概率分布¶

Let's quickly recall what a probability distribution is: Let $Z$ be some random variable. Then associated with $Z$ is a probability distribution function that assigns probabilities to the different outcomes $Z$ can take. Graphically, a probability distribution is a curve where the probability of an outcome is proportional to the height of the curve. You can see examples in the first figure of this chapter.
让我们快速回顾一下什么是概率分布：设 $Z$ 为某个随机变量。那么与 $Z$ 相关联的是一个概率分布函数，它为 $Z$ 可能出现的不同结果分配概率。从图形上看，概率分布是一条曲线，结果的概率与曲线的高度成正比。您可以在本章的第一幅图中看到示例。

We can divide random variables into three classifications:
我们可以将随机变量分为三类：

$Z$ is discrete: Discrete random variables may only assume values on a specified list. Things like populations, movie ratings, and number of votes are all discrete random variables. Discrete random variables become more clear when we contrast them with...
$Z$ 是离散的：离散随机变量只能在指定列表上取值。像人口、电影评分和投票数都是离散随机变量。当我们将离散随机变量与...进行对比时，它们变得更加清晰。
$Z$ is continuous: Continuous random variable can take on arbitrarily exact values. For example, temperature, speed, time, color are all modeled as continuous variables because you can progressively make the values more and more precise.
$Z$ 是连续的：连续随机变量可以取任意精确的值。例如，温度、速度、时间、颜色都被建模为连续变量，因为你可以逐渐使这些值变得越来越精确。
$Z$ is mixed: Mixed random variables assign probabilities to both discrete and continuous random variables, i.e. it is a combination of the above two categories.
$Z$ 是混合的：混合随机变量为离散和连续随机变量分配概率，即它是上述两类的组合。

Expected Value 期望值¶

Expected value (EV) is one of the most important concepts in probability. The EV for a given probability distribution can be described as "the mean value in the long run for many repeated samples from that distribution." To borrow a metaphor from physics, a distribution's EV acts like its "center of mass." Imagine repeating the same experiment many times over, and taking the average over each outcome. The more you repeat the experiment, the closer this average will become to the distributions EV. (side note: as the number of repeated experiments goes to infinity, the difference between the average outcome and the EV becomes arbitrarily small.)
期望值（EV）是概率中最重要的概念之一。给定概率分布的 EV 可以描述为“从该分布中多次重复样本的长期平均值。”借用物理学的比喻，分布的 EV 就像它的“质心”。想象一下多次重复同一个实验，并对每个结果取平均值。你重复实验的次数越多，这个平均值就越接近分布的 EV。（附注：当重复实验的次数趋向于无穷大时，平均结果与 EV 之间的差异变得任意小。）

Discrete Case 离散案例¶

If $Z$ is discrete, then its distribution is called a probability mass function, which measures the probability $Z$ takes on the value $k$ , denoted $P (Z = k)$ . Note that the probability mass function completely describes the random variable $Z$ , that is, if we know the mass function, we know how $Z$ should behave. There are popular probability mass functions that consistently appear: we will introduce them as needed, but let's introduce the first very useful probability mass function. We say $Z$ is Poisson-distributed if:
如果 $Z$ 是离散的，则其分布称为概率质量函数，它测量概率 $Z$ 取值为 $k$ 的概率，记作 $P (Z = k)$ 。请注意，概率质量函数完全描述了随机变量 $Z$ ，也就是说，如果我们知道质量函数，我们就知道 $Z$ 应该如何表现。有一些常见的概率质量函数会不断出现：我们会根据需要介绍它们，但让我们先介绍第一个非常有用的概率质量函数。如果我们说 $Z$ 服从泊松分布，则：

P (Z = k) = \frac{λ^{k} e^{- λ}}{k!}, k = 0, 1, 2, \dots, λ \in R_{> 0}

$λ$ is called a parameter of the distribution, and it controls the distribution's shape. For the Poisson distribution, $λ$ can be any positive number. By increasing $λ$ , we add more probability to larger values, and conversely by decreasing $λ$ we add more probability to smaller values. One can describe $λ$ as the intensity of the Poisson distribution.
$λ$ 被称为分布的参数，它控制分布的形状。对于泊松分布， $λ$ 可以是任何正数。通过增加 $λ$ ，我们将更多的概率分配给较大的值，反之，通过减少 $λ$ ，我们将更多的概率分配给较小的值。可以将 $λ$ 描述为泊松分布的强度。

Unlike $λ$ , which can be any positive number, the value $k$ in the above formula must be a non-negative integer, i.e., $k$ must take on values 0,1,2, and so on. This is very important, because if you wanted to model a population you could not make sense of populations with 4.25 or 5.612 members.
与 $λ$ 不同， $k$ 在上述公式中必须是非负整数，即 $k$ 必须取值为 0、1、2，依此类推。这一点非常重要，因为如果你想要建模一个种群，就无法理解有 4.25 或 5.612 个成员的种群。

If a random variable $Z$ has a Poisson mass distribution, we denote this by writing
如果随机变量 $Z$ 具有泊松质量分布，我们用以下方式表示：

Z \sim Poi (λ)

One useful property of the Poisson distribution is that its expected value is equal to its parameter, i.e.:
泊松分布的一个有用属性是其期望值等于其参数，即：

E [Z | λ] = λ

We will use this property often, so it's useful to remember. Below, we plot the probability mass distribution for different $λ$ values. The first thing to notice is that by increasing $λ$ , we add more probability of larger values occurring. Second, notice that although the graph ends at 15, the distributions do not. They assign positive probability to every non-negative integer.
我们将经常使用这个属性，因此记住它是有用的。下面，我们绘制了不同 $λ$ 值的概率质量分布。首先要注意的是，通过增加 $λ$ ，我们增加了较大值发生的概率。其次，请注意，尽管图形在 15 处结束，但分布并没有结束。它们为每个非负整数分配了正概率。

In [4]: 在[4]：

figsize(12.5, 4)

import scipy.stats as stats
a = np.arange(16)
poi = stats.poisson
lambda_ = [1.5, 4.25]
colours = ["#348ABD", "#A60628"]

plt.bar(a, poi.pmf(a, lambda_[0]), color=colours[0],
        label="$\lambda = %.1f$" % lambda_[0], alpha=0.60,
        edgecolor=colours[0], lw="3")

plt.bar(a, poi.pmf(a, lambda_[1]), color=colours[1],
        label="$\lambda = %.1f$" % lambda_[1], alpha=0.60,
        edgecolor=colours[1], lw="3")

plt.xticks(a + 0.4, a)
plt.legend()
plt.ylabel("probability of $k$")
plt.xlabel("$k$")
plt.title("Probability mass function of a Poisson random variable; differing \
$\lambda$ values")

Out[4]: 输出[4]:

<matplotlib.text.Text at 0x7f13c1fb9710>

Continuous Case 连续案例¶

Instead of a probability mass function, a continuous random variable has a probability density function. This might seem like unnecessary nomenclature, but the density function and the mass function are very different creatures. An example of continuous random variable is a random variable with exponential density. The density function for an exponential random variable looks like this:
连续随机变量具有概率密度函数，而不是概率质量函数。这可能看起来像是不必要的术语，但密度函数和质量函数是非常不同的概念。一个连续随机变量的例子是具有指数密度的随机变量。指数随机变量的密度函数如下所示：

f_{Z} (z | λ) = λ e^{- λ z}, z \geq 0

Like a Poisson random variable, an exponential random variable can take on only non-negative values. But unlike a Poisson variable, the exponential can take on any non-negative values, including non-integral values such as 4.25 or 5.612401. This property makes it a poor choice for count data, which must be an integer, but a great choice for time data, temperature data (measured in Kelvins, of course), or any other precise and positive variable. The graph below shows two probability density functions with different $λ$ values.
像泊松随机变量一样，指数随机变量只能取非负值。但与泊松变量不同，指数变量可以取任何非负值，包括非整数值，如 4.25 或 5.612401。这个特性使得它不适合计数数据（必须是整数），但非常适合时间数据、温度数据（当然是以开尔文为单位）或任何其他精确且正的变量。下面的图表显示了两个具有不同 $λ$ 值的概率密度函数。

When a random variable $Z$ has an exponential distribution with parameter $λ$ , we say $Z$ is exponential and write
当随机变量 $Z$ 具有参数 $λ$ 的指数分布时，我们称 $Z$ 为指数分布，并写作

Z \sim Exp (λ)

Given a specific $λ$ , the expected value of an exponential random variable is equal to the inverse of $λ$ , that is:
给定一个特定的 $λ$ ，指数随机变量的期望值等于 $λ$ 的倒数，即：

E [Z | λ] = \frac{1}{λ}

In [5]: 在[5]：

a = np.linspace(0, 4, 100)
expo = stats.expon
lambda_ = [0.5, 1]

for l, c in zip(lambda_, colours):
    plt.plot(a, expo.pdf(a, scale=1. / l), lw=3,
             color=c, label="$\lambda = %.1f$" % l)
    plt.fill_between(a, expo.pdf(a, scale=1. / l), color=c, alpha=.33)

plt.legend()
plt.ylabel("PDF at $z$")
plt.xlabel("$z$")
plt.ylim(0, 1.2)
plt.title("Probability density function of an Exponential random variable;\
 differing $\lambda$");

But what is $λ$ ?
但 $λ$ 是什么？¶

This question is what motivates statistics. In the real world, $λ$ is hidden from us. We see only $Z$ , and must go backwards to try and determine $λ$ . The problem is difficult because there is no one-to-one mapping from $Z$ to $λ$ . Many different methods have been created to solve the problem of estimating $λ$ , but since $λ$ is never actually observed, no one can say for certain which method is best!
这个问题是统计学的动力。在现实世界中， $λ$ 对我们是隐藏的。我们只能看到 $Z$ ，并且必须向后推导以尝试确定 $λ$ 。这个问题很困难，因为 $Z$ 和 $λ$ 之间没有一一对应的映射。已经创造了许多不同的方法来解决估计 $λ$ 的问题，但由于 $λ$ 从未被实际观察到，没有人可以确定哪种方法是最好的！

Bayesian inference is concerned with beliefs about what $λ$ might be. Rather than try to guess $λ$ exactly, we can only talk about what $λ$ is likely to be by assigning a probability distribution to $λ$ .
贝叶斯推断关注的是对 $λ$ 可能是什么的信念。我们不能准确地猜测 $λ$ ，而只能通过给 $λ$ 分配一个概率分布来讨论 $λ$ 可能是什么。

This might seem odd at first. After all, $λ$ is fixed; it is not (necessarily) random! How can we assign probabilities to values of a non-random variable? Ah, we have fallen for our old, frequentist way of thinking. Recall that under Bayesian philosophy, we can assign probabilities if we interpret them as beliefs. And it is entirely acceptable to have beliefs about the parameter $λ$ .
这起初可能看起来很奇怪。毕竟， $λ$ 是固定的；它并不是（必然）随机的！我们如何能给一个非随机变量的值分配概率呢？啊，我们陷入了我们旧的、频率主义的思维方式。请记住，在贝叶斯哲学下，如果我们将概率解释为信念，我们可以分配概率。而对参数 $λ$ 有信念是完全可以接受的。

Example: Inferring behaviour from text-message data
示例：从短信数据推断行为¶

Let's try to model a more interesting example, one that concerns the rate at which a user sends and receives text messages:
让我们尝试建模一个更有趣的例子，涉及用户发送和接收短信的速率：

You are given a series of daily text-message counts from a user of your system. The data, plotted over time, appears in the chart below. You are curious to know if the user's text-messaging habits have changed over time, either gradually or suddenly. How can you model this? (This is in fact my own text-message data. Judge my popularity as you wish.)

In [6]: 在[6]：

figsize(12.5, 3.5)
count_data = np.loadtxt("data/txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color="#348ABD")
plt.xlabel("Time (days)")
plt.ylabel("count of text-msgs received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data);

Before we start modeling, see what you can figure out just by looking at the chart above. Would you say there was a change in behaviour during this time period?
在我们开始建模之前，看看你能通过观察上面的图表得出什么结论。你会说在这个时间段内行为发生了变化吗？

How can we start to model this? Well, as we have conveniently already seen, a Poisson random variable is a very appropriate model for this type of count data. Denoting day $i$ 's text-message count by $C_{i}$ ,
我们如何开始对其建模呢？好吧，正如我们方便地已经看到的，泊松随机变量是这种计数数据非常合适的模型。用 $C_{i}$ 表示第 $i$ 天的短信计数，

C_{i} \sim Poisson (λ)

We are not sure what the value of the $λ$ parameter really is, however. Looking at the chart above, it appears that the rate might become higher late in the observation period, which is equivalent to saying that $λ$ increases at some point during the observations. (Recall that a higher value of $λ$ assigns more probability to larger outcomes. That is, there is a higher probability of many text messages having been sent on a given day.)
我们不确定 $λ$ 参数的实际值是什么。然而，从上面的图表来看，观察期末的比率可能会变得更高，这相当于说 $λ$ 在观察期间的某个时刻增加。（请记住， $λ$ 的更高值会将更多的概率分配给更大的结果。也就是说，在某一天发送许多短信的概率更高。）

How can we represent this observation mathematically? Let's assume that on some day during the observation period (call it $τ$ ), the parameter $λ$ suddenly jumps to a higher value. So we really have two $λ$ parameters: one for the period before $τ$ , and one for the rest of the observation period. In the literature, a sudden transition like this would be called a switchpoint:
我们如何用数学表示这一观察结果？假设在观察期间的某一天（称为 $τ$ ），参数 $λ$ 突然跃升到一个更高的值。因此，我们实际上有两个 $λ$ 参数：一个用于 $τ$ 之前的时期，另一个用于观察期的其余部分。在文献中，这样的突然转变被称为切换点：

λ = {\begin{cases} λ_{1} & if t < τ \\ λ_{2} & if t \geq τ \end{cases}

If, in reality, no sudden change occurred and indeed $λ_{1} = λ_{2}$ , then the $λ$ s posterior distributions should look about equal.
如果实际上没有发生突变，并且确实是 $λ_{1} = λ_{2}$ ，那么 $λ$ 的后验分布应该看起来大致相等。

We are interested in inferring the unknown $λ$ s. To use Bayesian inference, we need to assign prior probabilities to the different possible values of $λ$ . What would be good prior probability distributions for $λ_{1}$ and $λ_{2}$ ? Recall that $λ$ can be any positive number. As we saw earlier, the exponential distribution provides a continuous density function for positive numbers, so it might be a good choice for modeling $λ_{i}$ . But recall that the exponential distribution takes a parameter of its own, so we'll need to include that parameter in our model. Let's call that parameter $α$ .
我们有兴趣推断未知的 $λ$ 。为了使用贝叶斯推断，我们需要为 $λ$ 的不同可能值分配先验概率。对于 $λ_{1}$ 和 $λ_{2}$ ，什么样的先验概率分布比较好？请记住， $λ$ 可以是任何正数。正如我们之前看到的，指数分布为正数提供了连续密度函数，因此它可能是建模 $λ_{i}$ 的一个好选择。但请记住，指数分布有其自己的参数，因此我们需要在模型中包含该参数。我们将该参数称为 $α$ 。

\begin{aligned} (12) & λ_{1} \sim Exp (α) \\ (13) & λ_{2} \sim Exp (α) \end{aligned}

$α$ is called a hyper-parameter or parent variable. In literal terms, it is a parameter that influences other parameters. Our initial guess at $α$ does not influence the model too strongly, so we have some flexibility in our choice. A good rule of thumb is to set the exponential parameter equal to the inverse of the average of the count data. Since we're modeling $λ$ using an exponential distribution, we can use the expected value identity shown earlier to get:
$α$ 被称为超参数或父变量。从字面上看，它是一个影响其他参数的参数。我们对 $α$ 的初始猜测对模型的影响并不强，因此我们在选择上有一定的灵活性。一个好的经验法则是将指数参数设置为计数数据平均值的倒数。由于我们使用指数分布对 $λ$ 进行建模，我们可以使用之前显示的期望值恒等式得到：

\frac{1}{N} \sum_{i = 0}^{N} C_{i} \approx E [λ | α] = \frac{1}{α}

An alternative, and something I encourage the reader to try, would be to have two priors: one for each $λ_{i}$ . Creating two exponential distributions with different $α$ values reflects our prior belief that the rate changed at some point during the observations.
一种替代方案，我鼓励读者尝试的，是使用两个先验：每个 $λ_{i}$ 一个。创建两个具有不同 $α$ 值的指数分布反映了我们对在观察期间某个时刻速率发生变化的先验信念。

What about $τ$ ? Because of the noisiness of the data, it's difficult to pick out a priori when $τ$ might have occurred. Instead, we can assign a uniform prior belief to every possible day. This is equivalent to saying
$τ$ 怎么样？由于数据的噪声，很难事先判断 $τ$ 可能发生的时间。相反，我们可以给每一个可能的日期分配一个均匀的先验信念。这等同于说

\begin{aligned} (14) & τ \sim DiscreteUniform(1,70) \\ (15) \\ (16) & \Rightarrow P (τ = k) = \frac{1}{70} \end{aligned}

So after all this, what does our overall prior distribution for the unknown variables look like? Frankly, it doesn't matter. What we should understand is that it's an ugly, complicated mess involving symbols only a mathematician could love. And things will only get uglier the more complicated our models become. Regardless, all we really care about is the posterior distribution.
所以经过这一切，我们对未知变量的总体先验分布是什么样的？坦率地说，这并不重要。我们应该明白的是，这是一团丑陋而复杂的混乱，涉及只有数学家才能喜欢的符号。随着我们的模型变得越来越复杂，情况只会变得更糟。无论如何，我们真正关心的只是后验分布。

We next turn to PyMC, a Python library for performing Bayesian analysis that is undaunted by the mathematical monster we have created.
接下来我们转向 PyMC，这是一个用于进行贝叶斯分析的 Python 库，它对我们所创造的数学怪物毫不畏惧。

Introducing our first hammer: PyMC
介绍我们的第一款锤子：PyMC¶

PyMC is a Python library for programming Bayesian analysis [3]. It is a fast, well-maintained library. The only unfortunate part is that its documentation is lacking in certain areas, especially those that bridge the gap between beginner and hacker. One of this book's main goals is to solve that problem, and also to demonstrate why PyMC is so cool.
PyMC 是一个用于编程贝叶斯分析的 Python 库[3]。它是一个快速且维护良好的库。唯一不幸的是，它的文档在某些方面有所欠缺，特别是在连接初学者和黑客之间的部分。本书的主要目标之一就是解决这个问题，并展示为什么 PyMC 如此酷。

We will model the problem above using PyMC. This type of programming is called probabilistic programming, an unfortunate misnomer that invokes ideas of randomly-generated code and has likely confused and frightened users away from this field. The code is not random; it is probabilistic in the sense that we create probability models using programming variables as the model's components. Model components are first-class primitives within the PyMC framework.
我们将使用 PyMC 对上述问题进行建模。这种编程类型称为概率编程，这个不幸的名称引发了随机生成代码的想法，可能使用户对这个领域感到困惑和害怕。代码并不是随机的；它是概率性的，因为我们使用编程变量作为模型组件来创建概率模型。模型组件是 PyMC 框架中的一等原语。

B. Cronin [5] has a very motivating description of probabilistic programming:
B. Cronin [5] 对概率编程有一个非常激励人心的描述：

Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations.
另一种思考方式是：与传统程序仅向前运行不同，概率程序同时向前和向后运行。它向前运行以计算其对世界的假设所包含的后果（即它所表示的模型空间），但它也从数据向后运行以限制可能的解释。在实践中，许多概率编程系统会巧妙地交错这些向前和向后的操作，以高效地聚焦于最佳解释。

Because of the confusion engendered by the term probabilistic programming, I'll refrain from using it. Instead, I'll simply say programming, since that's what it really is.
由于“概率编程”这一术语引发的混淆，我将避免使用它。相反，我会简单地说编程，因为这才是它真正的含义。

PyMC code is easy to read. The only novel thing should be the syntax, and I will interrupt the code to explain individual sections. Simply remember that we are representing the model's components ( $τ, λ_{1}, λ_{2}$ ) as variables:
PyMC 代码易于阅读。唯一新颖的地方应该是语法，我会中断代码以解释各个部分。只需记住，我们将模型的组件 ( $τ, λ_{1}, λ_{2}$ ) 表示为变量：

In [7]: 在[7]：

import pymc as pm

alpha = 1.0 / count_data.mean()  # Recall count_data is the
                               # variable that holds our txt counts
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)

In the code above, we create the PyMC variables corresponding to $λ_{1}$ and $λ_{2}$ . We assign them to PyMC's stochastic variables, so-called because they are treated by the back end as random number generators. We can demonstrate this fact by calling their built-in random() methods.
在上面的代码中，我们创建了对应于 $λ_{1}$ 和 $λ_{2}$ 的 PyMC 变量。我们将它们分配给 PyMC 的随机变量，之所以称为随机变量，是因为它们在后端被视为随机数生成器。我们可以通过调用它们的内置 random() 方法来证明这一事实。

In [8]: 在[8]：

print("Random output:", tau.random(), tau.random(), tau.random())

Random output: 64 5 12

In [9]: 在[9]：

@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out

This code creates a new function lambda_, but really we can think of it as a random variable: the random variable $λ$ from above. Note that because lambda_1, lambda_2 and tau are random, lambda_ will be random. We are not fixing any variables yet.
这段代码创建了一个新函数 lambda_ ，但实际上我们可以将其视为一个随机变量：上面的随机变量 $λ$ 。请注意，由于 lambda_1 、 lambda_2 和 tau 是随机的， lambda_ 也将是随机的。我们还没有固定任何变量。

@pm.deterministic is a decorator that tells PyMC this is a deterministic function. That is, if the arguments were deterministic (which they are not), the output would be deterministic as well. Deterministic functions will be covered in Chapter 2.
@pm.deterministic 是一个装饰器，告诉 PyMC 这是一个确定性函数。也就是说，如果参数是确定性的（实际上并不是），输出也将是确定性的。确定性函数将在第二章中讨论。

In [10]: 在[10]：

observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)

model = pm.Model([observation, lambda_1, lambda_2, tau])

The variable observation combines our data, count_data, with our proposed data-generation scheme, given by the variable lambda_, through the value keyword. We also set observed = True to tell PyMC that this should stay fixed in our analysis. Finally, PyMC wants us to collect all the variables of interest and create a Model instance out of them. This makes our life easier when we retrieve the results.
变量 observation 将我们的数据 count_data 与我们提出的数据生成方案（由变量 lambda_ 给出）通过 value 关键字结合在一起。我们还设置 observed = True 以告诉 PyMC 这在我们的分析中应该保持固定。最后，PyMC 希望我们收集所有感兴趣的变量并从中创建一个 Model 实例。这使我们在检索结果时更轻松。

The code below will be explained in Chapter 3, but I show it here so you can see where our results come from. One can think of it as a learning step. The machinery being employed is called Markov Chain Monte Carlo (MCMC), which I also delay explaining until Chapter 3. This technique returns thousands of random variables from the posterior distributions of $λ_{1}, λ_{2}$ and $τ$ . We can plot a histogram of the random variables to see what the posterior distributions look like. Below, we collect the samples (called traces in the MCMC literature) into histograms.
下面的代码将在第 3 章中解释，但我在这里展示它，以便您可以看到我们的结果来自哪里。可以将其视为一个学习步骤。所使用的机器被称为马尔可夫链蒙特卡洛（MCMC），我也将推迟到第 3 章进行解释。这种技术从 $λ_{1}, λ_{2}$ 和 $τ$ 的后验分布中返回数千个随机变量。我们可以绘制随机变量的直方图，以查看后验分布的样子。下面，我们将样本（在 MCMC 文献中称为轨迹）收集到直方图中。

In [11]: 在[11]：

# Mysterious code to be explained in Chapter 3.
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)

 [-----------------100%-----------------] 40000 of 40000 complete in 6.5 sec

In [12]: 在[12]：

lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]

In [13]: 在[13]：

figsize(12.5, 10)
# histogram of the samples:

ax = plt.subplot(311)
ax.set_autoscaley_on(False)

plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_1$", color="#A60628", density=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
    $\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")

ax = plt.subplot(312)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
         label="posterior of $\lambda_2$", color="#7A68A6", density=True)
plt.legend(loc="upper left")
plt.xlim([15, 30])
plt.xlabel("$\lambda_2$ value")

plt.subplot(313)
w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_count_data, alpha=1,
         label=r"posterior of $\tau$",
         color="#467821", weights=w, rwidth=2.)
plt.xticks(np.arange(n_count_data))

plt.legend(loc="upper left")
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$ (in days)")
plt.ylabel("probability");

Interpretation 解释¶

Recall that Bayesian methodology returns a distribution. Hence we now have distributions to describe the unknown $λ$ s and $τ$ . What have we gained? Immediately, we can see the uncertainty in our estimates: the wider the distribution, the less certain our posterior belief should be. We can also see what the plausible values for the parameters are: $λ_{1}$ is around 18 and $λ_{2}$ is around 23. The posterior distributions of the two $λ$ s are clearly distinct, indicating that it is indeed likely that there was a change in the user's text-message behaviour.
请记住，贝叶斯方法返回一个分布。因此，我们现在有分布来描述未知的 $λ$ 和 $τ$ 。我们得到了什么？我们可以立即看到我们估计中的不确定性：分布越宽，我们的后验信念就越不确定。我们还可以看到参数的合理值： $λ_{1}$ 大约在 18 左右， $λ_{2}$ 大约在 23 左右。这两个 $λ$ 的后验分布明显不同，表明用户的短信行为确实可能发生了变化。

What other observations can you make? If you look at the original data again, do these results seem reasonable?
你还能做出什么其他观察？如果你再次查看原始数据，这些结果看起来合理吗？

Notice also that the posterior distributions for the $λ$ s do not look like exponential distributions, even though our priors for these variables were exponential. In fact, the posterior distributions are not really of any form that we recognize from the original model. But that's OK! This is one of the benefits of taking a computational point of view. If we had instead done this analysis using mathematical approaches, we would have been stuck with an analytically intractable (and messy) distribution. Our use of a computational approach makes us indifferent to mathematical tractability.
请注意， $λ$ 的后验分布看起来并不像指数分布，尽管我们对这些变量的先验分布是指数分布。实际上，后验分布并不属于我们从原始模型中识别的任何形式。但这没关系！这是采取计算视角的好处之一。如果我们使用数学方法进行此分析，我们将面临一个在解析上难以处理（且混乱）的分布。我们采用计算方法使我们对数学可处理性无所谓。

Our analysis also returned a distribution for $τ$ . Its posterior distribution looks a little different from the other two because it is a discrete random variable, so it doesn't assign probabilities to intervals. We can see that near day 45, there was a 50% chance that the user's behaviour changed. Had no change occurred, or had the change been gradual over time, the posterior distribution of $τ$ would have been more spread out, reflecting that many days were plausible candidates for $τ$ . By contrast, in the actual results we see that only three or four days make any sense as potential transition points.
我们的分析还返回了 $τ$ 的分布。它的后验分布与其他两个看起来有些不同，因为它是一个离散随机变量，因此不对区间分配概率。我们可以看到，在第 45 天附近，用户行为变化的概率为 50%。如果没有发生变化，或者变化是逐渐发生的，那么 $τ$ 的后验分布会更加分散，反映出许多天都是 $τ$ 的合理候选者。相比之下，在实际结果中，我们看到只有三到四天作为潜在的过渡点是有意义的。

Why would I want samples from the posterior, anyways?
我为什么还想要后验样本呢？¶

We will deal with this question for the remainder of the book, and it is an understatement to say that it will lead us to some amazing results. For now, let's end this chapter with one more example.
我们将在本书的剩余部分处理这个问题，说它将带我们获得一些惊人的结果是轻描淡写。现在，让我们用一个例子结束这一章。

We'll use the posterior samples to answer the following question: what is the expected number of texts at day $t, 0 \leq t \leq 70$ ? Recall that the expected value of a Poisson variable is equal to its parameter $λ$ . Therefore, the question is equivalent to what is the expected value of $λ$ at time $t$ ?
我们将使用后验样本来回答以下问题：在第 $t, 0 \leq t \leq 70$ 天预期的文本数量是多少？请记住，泊松变量的期望值等于其参数 $λ$ 。因此，这个问题等同于在时间 $t$ 时 $λ$ 的期望值是多少？

In the code below, let $i$ index samples from the posterior distributions. Given a day $t$ , we average over all possible $λ_{i}$ for that day $t$ , using $λ_{i} = λ_{1, i}$ if $t < τ_{i}$ (that is, if the behaviour change has not yet occurred), else we use $λ_{i} = λ_{2, i}$ .
在下面的代码中，让 $i$ 索引来自后验分布的样本。给定一天 $t$ ，我们对那一天 $t$ 的所有可能的 $λ_{i}$ 进行平均，如果 $t < τ_{i}$ （即，如果行为变化尚未发生），则使用 $λ_{i} = λ_{1, i}$ ，否则使用 $λ_{i} = λ_{2, i}$ 。

In [14]: 在[14]：

figsize(12.5, 5)
# tau_samples, lambda_1_samples, lambda_2_samples contain
# N samples from the corresponding posterior distribution
N = tau_samples.shape[0]
expected_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
    # ix is a bool index of all tau samples corresponding to
    # the switchpoint occurring prior to value of 'day'
    ix = day < tau_samples
    # Each posterior sample corresponds to a value for tau.
    # for each day, that value of tau indicates whether we're "before"
    # (in the lambda1 "regime") or
    #  "after" (in the lambda2 "regime") the switchpoint.
    # by taking the posterior sample of lambda1/2 accordingly, we can average
    # over all samples to get an expected value for lambda on that day.
    # As explained, the "message count" random variable is Poisson distributed,
    # and therefore lambda (the poisson parameter) is the expected value of
    # "message count".
    expected_texts_per_day[day] = (lambda_1_samples[ix].sum()
                                   + lambda_2_samples[~ix].sum()) / N


plt.plot(range(n_count_data), expected_texts_per_day, lw=4, color="#E24A33",
         label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Expected # text-messages")
plt.title("Expected number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
        label="observed texts per day")

plt.legend(loc="upper left");

Our analysis shows strong support for believing the user's behavior did change ( $λ_{1}$ would have been close in value to $λ_{2}$ had this not been true), and that the change was sudden rather than gradual (as demonstrated by $τ$ 's strongly peaked posterior distribution). We can speculate what might have caused this: a cheaper text-message rate, a recent weather-to-text subscription, or perhaps a new relationship. (In fact, the 45th day corresponds to Christmas, and I moved away to Toronto the next month, leaving a girlfriend behind.)
我们的分析显示强烈支持用户行为确实发生了变化（如果不是这样， $λ_{1}$ 的值将接近 $λ_{2}$ ），而且这种变化是突然的而非渐进的（如 $τ$ 的强峰后验分布所示）。我们可以推测可能导致这种变化的原因：更便宜的短信费率、最近的天气短信订阅，或者可能是一段新关系。（事实上，第 45 天恰逢圣诞节，而我在下个月搬到了多伦多，留下了一个女朋友。）

Exercises 练习¶

1. Using lambda_1_samples and lambda_2_samples, what is the mean of the posterior distributions of $λ_{1}$ and $λ_{2}$ ?
使用 lambda_1_samples 和 lambda_2_samples ， $λ_{1}$ 和 $λ_{2}$ 的后验分布的均值是多少？

In [25]: 在[25]：

# type your code here.

2. What is the expected percentage increase in text-message rates? hint: compute the mean of lambda_1_samples/lambda_2_samples. Note that this quantity is very different from lambda_1_samples.mean()/lambda_2_samples.mean().
2. 短信费率预计将增加多少百分比？ hint: 计算 lambda_1_samples/lambda_2_samples 的平均值。请注意，这个数量与 lambda_1_samples.mean()/lambda_2_samples.mean() 非常不同。

In [26]: 在[26]：

# type your code here.

3. What is the mean of $λ_{1}$ given that we know $τ$ is less than 45. That is, suppose we have been given new information that the change in behaviour occurred prior to day 45. What is the expected value of $λ_{1}$ now? (You do not need to redo the PyMC part. Just consider all instances where tau_samples < 45.)
3. 在已知 $τ$ 小于 45 的情况下， $λ_{1}$ 的均值是多少？也就是说，假设我们得到了新的信息，行为的变化发生在第 45 天之前。那么现在 $λ_{1}$ 的期望值是多少？（您不需要重新做 PyMC 部分。只需考虑所有 tau_samples < 45 的实例。）

In [17]: 在[17]：

# type your code here.

References 参考文献¶

[1] Gelman, Andrew. N.p.. Web. 22 Jan 2013. N is never large enough.
[1] Gelman, Andrew. 无出版信息. 网络. 2013 年 1 月 22 日. N 从来都不够大。
[2] Norvig, Peter. 2009. The Unreasonable Effectiveness of Data.
[2] 诺维格，彼得。2009 年。《数据的非理性有效性》。
[3] Patil, A., D. Huard and C.J. Fonnesbeck. 2010.
[3] Patil, A., D. Huard 和 C.J. Fonnesbeck. 2010.

PyMC: Bayesian Stochastic Modelling in Python. Journal of Statistical Software, 35(4), pp. 1-81.
PyMC：Python 中的贝叶斯随机建模。《统计软件期刊》，35(4)，第 1-81 页。

[4] Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), pages 793-804, May 2012, Scottsdale, Arizona.
[4] 林俊杰和阿列克·科尔茨。Twitter 上的大规模机器学习。2012 年 ACM SIGMOD 国际数据管理会议论文集（SIGMOD 2012），第 793-804 页，2012 年 5 月，亚利桑那州斯科茨代尔。
[5] Cronin, Beau. "Why Probabilistic Programming Matters." 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1.
[5] Cronin, Beau. "为什么概率编程很重要。" 2013 年 3 月 24 日。谷歌，在线发布至谷歌。网络。2013 年 3 月 24 日。https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1。

In [1]: 在[1]：

from IPython.core.display import HTML


def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Out[1]:

In [ ]: 在[ ]:

Probabilistic Programming and Bayesian Methods for Hackers概率编程与黑客的贝叶斯方法¶

Looking for a printed version of Bayesian Methods for Hackers?寻找《黑客的贝叶斯方法》的印刷版吗？¶

Chapter 1 第 1 章¶

The Philosophy of Bayesian Inference贝叶斯推断的哲学¶

The Bayesian state of mind贝叶斯思维方式¶

Bayesian Inference in Practice实践中的贝叶斯推断¶