非正态响应模型 - 第 1 部分：广义线性模型

Preamble 序言

In our previous sessions, we used linear models to explore the relationship between multiple predictors $X_{1}$ , $X_{2}$ ,and a continuous response $Y$ . However, the assumptions of the linear model are not always well-suited to the data that we are interested in. For example, if our response $Y$ is a count variable (e.g., number of flies per cow, number of seeds in a 1-m $^{2}$ plot), the mean $μ$ should be strictly non-negative and integer. If our response $Y$ is a binary variable (success/failure, true/false, present/absent, dead/alive, which we often code as 0/1), the mean $μ$ represents the probability of success and $μ$ must, therefore, be constrained between 0 and 1.
在之前的会议中，我们使用线性模型来探索多个预测变量 $X_{1}$ 之间的关系， $X_{2}$ 以及连续响应 $Y$ 。然而，线性模型的假设并不总是适合我们感兴趣的数据。例如，如果我们的响应 $Y$ 是一个计数变量（例如，每头奶牛的苍蝇数量，1 m $^{2}$ 图中的种子数量），则均值 $μ$ 应严格为非负数和整数。如果我们的响应 $Y$ 是一个二元变量（成功/失败、真/假、存在/不存在、死/活，我们通常将其编码为 0/1），则均值 $μ$ 表示成功的概率， $μ$ 因此必须约束在 0 和 1 之间。

In addition, the distribution of observations $Y$ around $μ$ does not always follow a Normal distribution. For example, count data are strictly discrete (i.e., $Y$ can be only have integer values 0, 1, 2, 3), whereas binary data can only have two options (0 or 1 value). When the response data is discrete or binary, we need to use discrete or binary distributions to model them in our analyses.
此外， $Y$ 周围 $μ$ 观测值的分布并不总是遵循正态分布。例如，计数数据是严格离散的（即 $Y$ 只能有整数值 0、1、2、3），而二进制数据只能有两个选项（0 或 1 个值）。当响应数据是离散或二进制时，我们需要在分析中使用离散或二元分布来对它们进行建模。

In the linear model that we have used so far, the mean $μ$ can range from negative infinity to positive infinity and assumes the residuals follow a normal distribution. However, when the response data do not map onto that range or cannot produce normally distributed residuals, this can cause problems. Generalized Linear Models (GLMs) generalize the linear model to deal with these cases.
在我们目前使用的线性模型中，均值 $μ$ 的范围可以从负无穷大到正无穷大，并假设残差服从正态分布。但是，当响应数据未映射到该范围或无法生成正态分布残差时，这可能会导致问题。广义线性模型（GLM）推广线性模型以处理这些情况。

In this session we will introduce you to the use of GLMs using count data and binary data. Upon completion of this session, you should be able to:
在本次会议中，我们将向您介绍如何使用计数数据和二进制数据来使用 GLM。完成此会话后，您应该能够：

generate and interpret linear models for count and binary data.
生成和解释计数和二进制数据的线性模型。
understand how to interpret the models and relate that to their biological meaning.
了解如何解释模型并将其与它们的生物学意义联系起来。

Assessment 评估

This prac will be assessed. There are 5 questions worth 12 points at the end of the prac. You will need to provide answers for each of the questions. Your answers may require written text, R code and R output, R graphics, or some combination of these.
将对此进行评估。练习结束时有 5 个问题，每 12 分。您需要为每个问题提供答案。你的答案可能需要书面文本、R 代码和 R 输出、R 图形或这些的某种组合。

Revisiting the linear model
重新审视线性模型

By now you should all be familiar with the equation for a linear model:
到现在为止，你们应该都熟悉线性模型的方程：

Y \sim β_{0} + β_{1} X + ϵ ϵ \sim N (0, σ^{2})

To better understand the link between the linear model and the generalized linear model, it is useful to re-write the linear model shown above using the following set of three equations:
为了更好地理解线性模型和广义线性模型之间的联系，使用以下三个方程组重写上面显示的线性模型是很有用的：

$Y \sim N (μ, σ^{2})$

This states that the observations $Y$ follow a Normal distribution with mean $μ$ and variance $σ^{2}$ .
这说明观测值 $Y$ 遵循具有均值 $μ$ $σ^{2}$ 和方差的正态分布。

$μ = f (l p)$

The mean ( $μ$ ) is a function of an intermediate quantity called the linear predictor, $l p$ . In the specific case of the linear model, the function $f$ is the identity function; that is, $μ = l p$ .
均值（ $μ$ ）是称为线性预测变量的中间量的函数。 $l p$ 在线性模型的具体情况下，函数 $f$ 是恒等函数;也就是说， $μ = l p$ .

$l p = β_{0} + β_{1} X_{1}$

The linear predictor, $l p$ , is the linear combination of the predictors represented as the equation of the line that we are familiar with.
线性预测变量 $l p$ 是预测变量的线性组合，表示为我们熟悉的直线方程。

Thus the mean of $Y$ , $μ$ , equals the linear predictor $l p$ , which equals $β_{0} + β_{1} X_{1}$ . Or, alternatively:
因此 $Y$ ，的 $μ$ 均值等于线性预测变量，线性预测变量 $l p$ 等于 $β_{0} + β_{1} X_{1}$ 。或者，或者：

Y \sim N (β_{0} + β_{1} X_{1}, σ^{2})

Generalising the linear model: Generalized Linear Model
推广线性模型：广义线性模型

The Generalized Linear Model (GLM) generalises the linear model to accommodate different types of response variables. While we keep the linear combination of predictors (the third of the three equations above), we allow for different stochastic distributions of observations $Y$ around the mean $μ$ , and we allow for different functions linking $μ$ to the $l p$ :
广义线性模型（GLM）对线性模型进行推广，以适应不同类型的响应变量。虽然我们保留了预测变量的线性组合（上述三个方程中的第三个），但我们允许 $Y$ 观测值在均值 $μ$ 附近进行不同的随机分布，并且我们允许链接 $μ$ 到 $l p$ ：

$Y \sim D i s t r i b u t i o n (μ, . . .)$

This says that the observations $Y$ follow a probability distribution with mean $μ$ . There are a range of common distributions for GLMs, including Poisson and Binomial distributions, which we discuss below.
这表示观测值 $Y$ 遵循均值 $μ$ 的概率分布。GLM 有一系列常见的分布，包括 Poisson 分布和二项分布，我们将在下面讨论。

$μ = f (l p)$

There is a function linking $μ$ to $l p$ . This maps the $μ$ onto the $l p$ . For a linear model, $μ$ and the $l p$ are the same. However, when a linear model is not appropriate, then the function linking the two is central to the implementation of the GLM. In effect, the function linearises the predictor.
有一个链接 $μ$ 到 $l p$ 的函数。这会将 $μ$ 映射到 . $l p$ 对于线性模型， $μ$ 和 $l p$ 是相同的。但是，当线性模型不合适时，将两者联系起来的功能是 GLM 实现的核心。实际上，该函数使预测变量线性化。

$l p = β_{0} + β_{1} X_{1}$

The linear predictor, as before.
线性预测变量，如前所述。

So, let’s look at how we implement a GLM and how these three equations work when the model that we are using is not the standard linear model.
因此，让我们看看我们如何实现 GLM，以及当我们使用的模型不是标准线性模型时，这三个方程是如何工作的。

Dealing with count response data: Poisson GLMs
处理计数响应数据：Poisson GLM

Recall that the standard distribution for count response data is the Poisson distribution. The Poisson distribution is typically described by a single parameter $λ$ (lambda), which describes both the mean ( $μ$ ) and variance ( $σ^{2}$ ) in the data. For consistency with the three-equation structure described above, we will use $μ$ to describe the mean instead of $λ$ :
回想一下，计数响应数据的标准分布是泊松分布。泊松分布通常由单个参数 $λ$ （lambda）描述，该参数同时描述数据中的均值（ $μ$ ）和方差（ $σ^{2}$ ）。为了与上述三方程结构保持一致，我们将使用 $μ$ 平均值来描述平均值，而不是 $λ$ ：

$Y \sim P o i s s o n (μ)$

Because the mean and variance are equal, as the mean increases, so to does the variance. This is a problem for standard linear models. In an earlier prac, we addressed this issue by log-transforming the data.
因为均值和方差相等，所以随着均值的增加，方差也相等。这是标准线性模型的问题。在早期的实践中，我们通过对数据进行对数转换来解决这个问题。

Since count data are strictly positive, we need to ensure that $μ$ remains positive. This can be done by using a log-link to relate $μ$ and $l p$ : $l o g (μ) = l p$ . In practice, we first compute $l p$ and then use the inverse of the log-link (i.e., the exponential function) to compute $μ$ from $l p$ :
由于计数数据严格为正数，因此我们需要确保保持 $μ$ 正数。这可以通过使用 log-link 来关联 $μ$ 和 $l p$ ： $l o g (μ) = l p$ 来完成。在实践中，我们首先计算 $l p$ ，然后使用对数链接的逆函数（即指数函数） $μ$ 从以下公式 $l p$ 进行计算：

$μ = e x p (l p)$

The exponential function ensures that $μ > 0$ , which is necessary because, as previously discussed, count data must be non-negative. You can’t have -3 frogs…
指数函数确保 $μ > 0$ ，这是必要的，因为如前所述，计数数据必须是非负数。你不能有 -3 只青蛙......

$l p = b_{0} + b_{1} \times X_{1}$

The linear predictor, as before.
线性预测变量，如前所述。

###Model assumptions ###Model 假设

In our linear models, there were clear model assumtions that had to be met in order to understand if the model could be trustfully interpreted. For GLMs there are fewer assumptions that must be met:
在我们的线性模型中，必须满足明确的模型假设，才能理解模型是否可以被可信地解释。对于 GLM，必须满足的假设较少：

the data are independent
数据是独立的
the data (the response variable) fit the distribution
数据（响应变量）拟合分布

You can look at the distribution of the data (look back to the prac from week 3 on data distributions)
您可以查看数据的分布（回顾第 3 周关于数据分布的 prac）
You can ensure that the data meet the criteria for the distribution (e.g., binomial data = binomial distribution; count data = Poisson distribution; proportion data, bounded by 0 and 1 = beta distribution: look to the week 3 prac)
您可以确保数据符合分布标准（例如，二项式数据 = 二项分布;计数数据 = 泊松分布;比例数据，以 0 和 1 为界 = β 分布：查看第 3 周练习）

Example 1: Counting seeds in the forest
示例 1：计算森林中的种子

Victoria forests experience fires that occur at irregular intervals. Tree species have evolved a range of strategies to persist in fire-prone environments. One strategy for species is to store seeds in the soil, where they are insulated from the heat of fires. A common species in the tall open wet forests of Victoria that uses such a strategy is silver wattle (Acacia dealbata).
维多利亚森林的火灾不定期发生。树种已经进化出一系列策略，以在火灾多发的环境中持续存在。物种的一种策略是将种子储存在土壤中，在那里它们可以免受火灾的热量。在维多利亚州高大开阔的潮湿森林中，使用这种策略的常见物种是银荆（Acacia dealbata）。

For silver wattle, collecting soil samples and counting the number seeds in the soil gives us an idea of whether the species will be able to persist if a fire burns through the landscape in the coming years. However, collecting soil samples and counting seeds is time consuming. For routine checks of the density of seeds in the soil seed bank, it would be useful to find a more cost-effective method. Intuitively, we expect to find silver wattle seeds in the soils underneath silver wattle trees. However, because silver wattle are short-lived species and the seeds can persist for decades, it is possible that seeds may occur in areas where no living trees currently exist. So, it is worth testing whether seed density is higher under living trees within the forest.
对于银荆来说，收集土壤样本并计算土壤中的种子数量可以让我们了解如果未来几年大火烧毁景观，该物种是否能够持续存在。然而，收集土壤样本和计算种子非常耗时。对于土壤种子库中种子密度的常规检查，找到一种更具成本效益的方法将是有用的。直观地说，我们希望在银荆树下的土壤中找到银荆种子。然而，由于银荆是短命物种，种子可以持续数十年，因此种子可能会出现在目前没有活树的地区。因此，值得测试的是，在森林内的活树下，种子密度是否更高。

To do this, field crews installed plots along a gradient of silver wattle tree cover (tree_cover). In each plot, the field crew also collected a $20 \times 20 \times 10$ cm soil sample and counted the number of silver wattle seeds in the soil sample (seed_count).
为此，现场工作人员沿着银荆树覆盖物的梯度（ tree_cover ）安装了地块。在每个地块中，野外工作人员还收集了一 $20 \times 20 \times 10$ 厘米的土壤样本，并计算了土壤样本中银荆种子的数量（ seed_count ）。

Scientific hypothesis: Tree cover influences the number of silver wattle seeds in the soil seed bank. Statistical hypothesis: The slope of the relationship between tree cover and seed count is non-zero.
科学假设：树木覆盖影响土壤种子库中银荆种子的数量。统计假设：树木覆盖率和种子数之间关系的斜率不为零。

Let’s load the data and have a look at what the field crew found.
让我们加载数据，看看现场工作人员发现了什么。

head(data_seed)

##   seed_count tree_cover
## 1          1       29.5
## 2          0       39.7
## 3          4       59.0
## 4          6       91.2
## 5          1       23.4
## 6          3       90.2

summary(data_seed)

##    seed_count      tree_cover   
##  Min.   : 0.00   Min.   : 5.30  
##  1st Qu.: 1.00   1st Qu.:33.23  
##  Median : 2.00   Median :52.80  
##  Mean   : 2.75   Mean   :53.17  
##  3rd Qu.: 3.00   3rd Qu.:73.70  
##  Max.   :19.00   Max.   :99.20

Note how seed_count in our soil sample only takes integer values (0, 1, 2, 3, …). By definition, this will always be the case for count responses. Also, note that the minimum value for seed_count is zero. Count responses must be strictly non-negative (i.e., zero or positive); they cannot take negative values.
请注意，在我们的土壤样本中，如何 seed_count 只取整数值（0、1、2、3、...）。根据定义，计数响应将始终如此。另请注意，的 seed_count 最小值为零。计数响应必须严格为非负数（即零或正数）;它们不能取负值。

We want to investigate the effect of silver wattle tree cover on the seed count in our soil samples. Let’s have a look at the data:
我们想研究银荆树覆盖对土壤样品中种子数量的影响。让我们看一下数据：

ggplot(data = data_seed, aes(x = tree_cover, y = seed_count)) + 
  geom_point()

Tree cover of silver wattle appears to be a decent predictor of silver wattle seed density in the soil seed bank. Notice again how the distribution of seed_count is discrete (it only takes integer values) and is positive.
银荆的树木覆盖似乎是土壤种子库中银荆种子密度的一个很好的预测指标。再次注意的 seed_count 分布是离散的（它只接受整数值）并且是正的。

Now let’s fit a statistical model the data. Because we have a count data response, a generalized linear model (GLM) with a Poisson distribution and log-link function is appropriate. We use the glm() function to do this:
现在让我们拟合一个统计模型，即数据。由于我们有一个计数数据响应，因此具有泊松分布和对数链接函数的广义线性模型（GLM）是合适的。我们使用该 glm() 函数来执行此操作：

glm_seed_fit <- glm(seed_count ~ tree_cover, 
                    family = poisson(link = 'log'), 
                    data = data_seed)

You should note that this differs from the lm() syntax in a few ways. First, we use the glm() function instead of the lm() function. Second, we have to specify the stochastic distribution and the link function in the family statement. However, in terms of the specification of the predictor and response variables, the glm() and lm() functions are identical. Let’s look at the summary output:
您应该注意，这与 lm() 语法在几个方面有所不同。首先，我们使用 glm() 函数而不是 lm() 函数。其次，我们必须在 family 语句中指定随机分布和链接函数。但是，就预测变量和响应变量的规范而言， glm() 和 lm() 函数是相同的。让我们看一下摘要输出：

summary(glm_seed_fit)

## 
## Call:
## glm(formula = seed_count ~ tree_cover, family = poisson(link = "log"), 
##     data = data_seed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9727  -0.9578  -0.1705   0.5006   3.0341  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.952654   0.273527  -3.483 0.000496 ***
## tree_cover   0.031351   0.003695   8.485  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 160.253  on 59  degrees of freedom
## Residual deviance:  74.255  on 58  degrees of freedom
## AIC: 219.52
## 
## Number of Fisher Scoring iterations: 5

This summary table looks a little different from the ones you’ve seen before. Notice that, instead of the summary of residuals in the second chunk of output for the linear models, we now have deviance residuals. And in place of the $F$ -test and adjusted R $^{2}$ values, we have null and residual deviances and an AIC score. Deviance is a generalization of the concept of the sum-of-squares for non-normal models. However, the details of how deviance is calculated are beyond the scope of this unit.
此汇总表看起来与您之前看到的汇总表略有不同。请注意，对于线性模型，我们现在使用的不是线性模型的第二块输出中的残差汇总，而是偏差残差。代替 $F$ -test 和调整后的 R $^{2}$ 值，我们有零偏差和残差偏差以及 AIC 分数。偏差是非正态模型的平方和概念的推广。但是，如何计算偏差的细节超出了本单元的范围。

From the summary table, we can see that the equation for the linear predictor $l p$ is:
从汇总表中，我们可以看到线性预测变量 $l p$ 的方程为：

- 0.953 + 0.031 \times t r e e c o v e r

The effect of tree cover is positive (0.031) and significantly different from zero ( $p$ -value << 0.001). This is consistent with what we saw in the figure — seed count increases with tree cover, but the variability in seed count also increases with increasing tree cover.
树木覆盖的影响为正（0.031），与零（ $p$ -值<<0.001）显著不同。这与我们在图中看到的情况一致——种子数量随着树木覆盖率的增加而增加，但种子数量的变异性也随着树木覆盖率的增加而增加。

If we want to predict the mean seed count $μ$ for a given tree cover value, we have to follow three steps:
如果我们想预测给定树木覆盖值的平均种子计数 $μ$ ，我们必须遵循三个步骤：

Chose the value of tree_cover that you want to use in your prediction.
选择要在预测中使用的 tree_cover 值。
Compute the linear predictor $l p$ : $l p = - 0.953 + 0.031 \times tree_cover$ .
计算线性预测变量 $l p$ ： $l p = - 0.953 + 0.031 \times tree_cover$ .
Compute $μ$ from the $l p$ using the inverse of the log-link (i.e., the exponential function): $μ = e x p (l p)$ .
$μ$ $l p$ 使用对数链接的逆函数（即指数函数） $μ = e x p (l p)$ 进行计算：。

To make plotting the GLM easier and our code more general (e.g., to deal with models that have more than one predictor), we will store the results in a dataframe that we call the prediction_matrix:
为了使绘制 GLM 更容易，我们的代码更通用（例如，处理具有多个预测变量的模型），我们将结果存储在一个数据帧中， prediction_matrix 我们称之为：

prediction_matrix <- data.frame(tree_cover = 2:100)
prediction_matrix$lp <- -0.952654 + 0.031351 * prediction_matrix$tree_cover
prediction_matrix$mu <- exp(prediction_matrix$lp)

Note that we state the name of the predictor variable when we create the prediction matrix and provide a series of values to predict with.
请注意，我们在创建预测矩阵时声明预测变量的名称，并提供一系列用于预测的值。

Alternatively, we can also use the predict function to do this (and skip the second and third lines in the previous code chunk):
或者，我们也可以使用该 predict 函数来执行此操作（并跳过上一个代码块中的第二行和第三行）：

prediction_matrix$mu2 <- predict(glm_seed_fit, newdata=prediction_matrix, type="response")

The predict() function is a very powerful tool that you will need to use with GLMs, so it is important that you understand how it works. The basic syntax is simple. You call the predict() function, specify the model that you want to predict from, and then provide the new dataframe to use. The key element of the newdata= argument is that the new dataframe must include a variable that has the same name as the predictor variable in the original model specification.
该 predict() 功能是一个非常强大的工具，您需要将其与 GLM 一起使用，因此了解它的工作原理非常重要。基本语法很简单。调用函数 predict() ，指定要从中预测的模型，然后提供要使用的新数据帧。该 newdata= 参数的关键元素是，新数据帧必须包含与原始模型规范中的预测变量同名的变量。

We can also specify the scale for the values that predict() generates. The two options that you are likely to use are "link" (the default) and "response". The default "link" returns the predictions on the scale of the linear predictors (which is not necessarily a linear scale..!). For the Poisson regression example that we used here, that would be the log-transformed responses. If you want them on the original linear scale that they were measured in, there are two options: 1) Use the default "link" option and convert the values using exp() (for the log-link case in the seed count example). Or you can just use type = "response" (for any GLM).
我们还可以指定 predict() 生成值的比例。您可能使用的两个选项是 "link" （默认值）和 "response" 。默认值 "link" 返回线性预测变量尺度上的预测（不一定是线性尺度..！）。对于我们在这里使用的泊松回归示例，这将是对数转换的响应。如果您希望它们在测量它们的原始线性刻度上，有两个选项：1）使用默认 "link" 选项并使用（ exp() 对于种子计数示例中的对数链接情况）转换值。或者您可以只使用 type = "response" （对于任何 GLM）。

Once we have our prediction matrix with the associated $l p$ and the predicted $μ$ values, we can add the model predictions to the original data that we plotted earlier.
一旦我们有了包含相关 $l p$ 值和预测 $μ$ 值的预测矩阵，我们就可以将模型预测添加到我们之前绘制的原始数据中。

ggplot(data = data_seed, aes(x = tree_cover, y = seed_count)) + 
  geom_point() + 
  geom_line(data = prediction_matrix, aes(tree_cover, mu), colour = "blue")

And now you have the GLM model with a Poisson distribution and log-link for count data. You have interpreted the results from the summary() function and predicted the mean seed count value that we would expect at any value of tree cover within the range of the data set.
现在，您拥有了具有泊松分布和计数数据对数链接的 GLM 模型。您已经解释了 summary() 函数的结果，并预测了我们在数据集范围内的任何树木覆盖值下所期望的平均种子计数值。

If we want to print out a nicely formatted table of the model outputs, we can use the following from the parameters package:
如果我们想打印出一个格式良好的模型输出表，我们可以使用 parameters 包中的以下内容：

print_html(model_parameters(glm_seed_fit), summary = TRUE)

Parameter 参数	Coefficient 系数	SE	95% CI	z	p
(Intercept) （拦截）	-0.95	0.27	(-1.51, -0.44)	-3.48	< .001 <.001
tree cover 树木覆盖	0.03	3.69e-03 3.69E-03型	(0.02, 0.04)	8.48	< .001 <.001

Example 2: Frog abundance across microhabitats
示例2：不同微生境的青蛙丰度

In the previous example, we had a continuous predictor (tree cover) and a count-based response variable (number of seeds in a soil sample). We used the Poisson GLM in much the same way that we would a linear regression. Indeed, the final output was a fitted line, albeit a curved one. We can also use Poisson GLMs when we have a categorical predictor – much like we use a linear model for an ANOVA.
在前面的示例中，我们有一个连续预测变量（树木覆盖率）和一个基于计数的响应变量（土壤样本中的种子数量）。我们使用泊松 GLM 的方式与线性回归大致相同。事实上，最终的输出是一条拟合线，尽管是一条弯曲的线。当我们有分类预测变量时，我们也可以使用泊松 GLM——就像我们使用线性模型进行方差分析一样。

In this example, we will look at the abundance of frogs encountered on different microhabitats. The response variable is count of frogs encountered during field surveys throughout the year at different sites. Field surveys occurred for 30 minutes and all frogs that were seen within that 30 minutes were recorded, along with the microhabitat that they were found in. The study included 460 observations and 6 different microhabitats. So, each microhabitat will have multiple frog counts. We are interested in whether there are differences in the number of frogs amongst the microhabitats.
在这个例子中，我们将看看在不同微生境中遇到的青蛙的丰度。响应变量是全年在不同地点进行实地调查时遇到的青蛙数量。实地调查持续了30分钟，记录了在这30分钟内看到的所有青蛙，以及它们被发现的微生境。该研究包括 460 个观察结果和 6 个不同的微生境。因此，每个微生境都会有多个青蛙数量。我们感兴趣的是微生境中青蛙的数量是否存在差异。

Scientific hypothesis Microhabitat influences the abundance of frogs.
科学假说：微生境影响青蛙的丰度。

Statistical hypothesis There are no differences in the mean counts of frogs across the different microhabitats.
统计假设：不同微生境中青蛙的平均数量没有差异。

Let’s start by looking at the data:
让我们先看一下数据：

frog_count$microhabitat<-as.factor(frog_count$microhabitat)
head(frog_count)

##   X microhabitat count
## 1 1         bank     1
## 2 2         bank    13
## 3 3         bank     2
## 4 4         bank     1
## 5 5         bank     1
## 6 6         bank     1

summary(frog_count)

##        X                      microhabitat     count       
##  Min.   :  1.0   bank               : 86   Min.   : 1.000  
##  1st Qu.:115.8   elevated vegetation: 45   1st Qu.: 1.000  
##  Median :230.5   emergent vegetation: 67   Median : 3.000  
##  Mean   :230.5   leaf litter        : 83   Mean   : 5.807  
##  3rd Qu.:345.2   road               : 17   3rd Qu.: 7.250  
##  Max.   :460.0   water              :162   Max.   :42.000

Now let’s look at the distribution of the data:
现在让我们看一下数据的分布情况：

ggplot(frog_count, aes(x = count)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#or you can use

hist(frog_count$count, breaks=30)

Let’s start by running a linear model and skipping straight to checking the model assumptions:
让我们从运行线性模型开始，然后直接跳到检查模型假设：

frog_lm1 <- lm(glm(count ~ microhabitat, data = frog_count))
check_model(frog_lm1)

Clearly, there are some serious (!) problems with the data. It is certainly not normally distributed and the variance appears to increase over the range of fitted values.
显然，数据存在一些严重的（！）问题。它肯定不是正态分布的，并且方差似乎在拟合值范围内增加。

Because the response data are count data, we will use a Poisson GLM to see if we can resolve the issue. The alternative is to log-transform the data, but with count data, particularly if the dataset contains many zeros, that may lead to other problems. So we’ll go with the Poisson GLM:
由于响应数据是计数数据，因此我们将使用 Poisson GLM 来查看是否可以解决问题。另一种方法是对数据进行日志转换，但使用计数数据，特别是如果数据集包含许多零，则可能会导致其他问题。因此，我们将使用 Poisson GLM：

frog_count_glm<-glm(count ~ microhabitat, 
                    data=frog_count, 
                    family=poisson(link="log"))

Anova(frog_count_glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: count
##              LR Chisq Df Pr(>Chisq)    
## microhabitat   566.25  5  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In base R the anova() function provides a summary of the main effects and interaction terms in a linear model. However, for GLMs it does not do this. If we want similar information for the glm() model, we need to use the Anova() function in the car package. There are some key differences in how the statistical test is calculated (e.g., the test statistic is $χ^{2}$ , not $F$ ); however, the interpretation is the same as the interpretation of the anova() output for a linear model. In this case, it indicates that there is at least one significant difference in the group means of the frog counts by microhabitat.
在基础上，该 anova() 函数提供了线性模型中主要效应和交互项的摘要。但是，对于 GLM，它不会这样做。如果我们想要 glm() 模型的类似信息，我们需要使用 car 包中的 Anova() 函数。统计检验的计算方式存在一些关键差异（例如，检验统计量为 $χ^{2}$ ，而不是 $F$ ）;但是，该解释与线性模型 anova() 的输出解释相同。在这种情况下，它表明微生境的青蛙计数的组均值至少存在一个显着差异。

We can look at the summary output to see if it provides any indication of the differences. Recall that the parameter estimates are 1) all relative to the first group (in alphabetical order – bank, in this case) and 2) the parameter estimates are on the scale of the link function, not the linear predictor.
我们可以查看摘要输出，看看它是否提供了任何差异的指示。回想一下，参数估计值是 1）全部相对于第一组（在本例中按字母顺序 – bank ）和 2）参数估计值在链接函数的尺度上，而不是线性预测变量上。

summary(frog_count_glm)

## 
## Call:
## glm(formula = count ~ microhabitat, family = poisson(link = "log"), 
##     data = frog_count)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4975  -1.5006  -0.7870   0.8814   9.4900  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      1.18756    0.05955  19.943  < 2e-16 ***
## microhabitatelevated vegetation -0.00379    0.10173  -0.037    0.970    
## microhabitatemergent vegetation  0.01492    0.08961   0.166    0.868    
## microhabitatleaf litter          0.38506    0.07776   4.952 7.34e-07 ***
## microhabitatroad                 0.62362    0.11472   5.436 5.45e-08 ***
## microhabitatwater                1.04801    0.06485  16.159  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2816.0  on 459  degrees of freedom
## Residual deviance: 2249.7  on 454  degrees of freedom
## AIC: 3690
## 
## Number of Fisher Scoring iterations: 5

If we want to determine the mean frog counts for each microhabitat and identify which microhabitats differ from each other in frog count, we can use emmeans(), as we did for linear models.
如果我们想确定每个微生境的平均青蛙数量，并确定哪些微生境在青蛙数量上彼此不同，我们可以像线性模型一样使用 emmeans() 。

frog_emm <- emmeans(frog_count_glm, ~microhabitat, type="response")
frog_emm

##  microhabitat        rate    SE  df asymp.LCL asymp.UCL
##  bank                3.28 0.195 Inf      2.92      3.69
##  elevated vegetation 3.27 0.269 Inf      2.78      3.84
##  emergent vegetation 3.33 0.223 Inf      2.92      3.80
##  leaf litter         4.82 0.241 Inf      4.37      5.32
##  road                6.12 0.600 Inf      5.05      7.41
##  water               9.35 0.240 Inf      8.89      9.83
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

This provides the group means on the response scale (ie, the actual number of counts per 30-minute survey). If we want to see all pairwise comparisons, we can use the pairs() function on the frog_emm object:
这提供了响应量表上的组均值（即每 30 分钟调查的实际计数数）。如果我们想查看所有成对比较，我们可以在 frog_emm 对象上使用该 pairs() 函数：

pairs(frog_emm)

##  contrast                                  ratio     SE  df null z.ratio
##  bank / elevated vegetation                1.004 0.1021 Inf    1   0.037
##  bank / emergent vegetation                0.985 0.0883 Inf    1  -0.166
##  bank / leaf litter                        0.680 0.0529 Inf    1  -4.952
##  bank / road                               0.536 0.0615 Inf    1  -5.436
##  bank / water                              0.351 0.0227 Inf    1 -16.159
##  elevated vegetation / emergent vegetation 0.981 0.1043 Inf    1  -0.176
##  elevated vegetation / leaf litter         0.678 0.0654 Inf    1  -4.032
##  elevated vegetation / road                0.534 0.0684 Inf    1  -4.897
##  elevated vegetation / water               0.349 0.0302 Inf    1 -12.175
##  emergent vegetation / leaf litter         0.691 0.0577 Inf    1  -4.429
##  emergent vegetation / road                0.544 0.0646 Inf    1  -5.126
##  emergent vegetation / water               0.356 0.0255 Inf    1 -14.404
##  leaf litter / road                        0.788 0.0867 Inf    1  -2.167
##  leaf litter / water                       0.515 0.0290 Inf    1 -11.793
##  road / water                              0.654 0.0663 Inf    1  -4.187
##  p.value
##   1.0000
##   1.0000
##   <.0001
##   <.0001
##   <.0001
##   1.0000
##   0.0008
##   <.0001
##   <.0001
##   0.0001
##   <.0001
##   <.0001
##   0.2531
##   <.0001
##   0.0004
## 
## P value adjustment: tukey method for comparing a family of 6 estimates 
## Tests are performed on the log scale

Each pairwise comparison is tested against a null hypothesis of no difference. If the $p$ -value is less than 0.05, then we conclude that there is evidence of a difference in frog counts between the two microhabitats being compared. All 15 pairwise comparisons are shown here.
每个成对比较都根据无差异的零假设进行检验。如果 $p$ -value 小于 0.05，那么我们得出结论，有证据表明所比较的两个微生境之间的青蛙数量存在差异。此处显示了所有 15 个成对比较。

We can also plot the frog_emm object:
我们还可以绘制 frog_emm 对象：

plot(frog_emm, xlab="Frog count (n)", ylab="Microhabitat")

The estimated marginal means show three distinct groups of microhabitat types in terms of frog counts. The greatest number of frogs were found in water, the fewest frogs were found on the bank of the water body or in vegetation. Intermediate numbers of frogs were found in leaf litter and road habitats.
估计的边际均值显示了在青蛙数量方面有三组不同的微生境类型。在水中发现的青蛙数量最多，在水体岸边或植被中发现的青蛙最少。在落叶和道路栖息地中发现了中等数量的青蛙。

Reporting the results: We compared frog count data from six potential microhabitat sites. We used a generalised linear model with a Poisson log-link function to account for the count data. We found a significant effect of microhabitat on frog counts ( $χ_{d f = 5}^{2}$ = 566.25, $p$ << 0.0001). Pairwise comparisons showed three distinct groups of microhabitats based on the frog count data: 1) water, 2) roads and leaf litter, and 3) banks, emergent vegetation, and elevated vegetation.
报告结果：我们比较了来自六个潜在微生境地点的青蛙计数数据。我们使用带有泊松对数链接函数的广义线性模型来解释计数数据。我们发现微生境对青蛙数量有显着影响（ $χ_{d f = 5}^{2}$ = 566.25， $p$ << 0.0001）。根据青蛙数量数据，成对比较显示了三组不同的微生境：1）水，2）道路和落叶，以及3）河岸、新兴植被和高架植被。

Dealing with binary response data: Logistic GLMs
处理二进制响应数据：逻辑 GLM

In many management and research disciplines, binary data is both common and important. We consider data to be binary if it can be written in terms of two mutually exclusive outcomes (e.g., true/false, presence/absence, alive/dead). By convention, we represent binary data as a dummy variable with values of 0 and 1 (e.g., alive = 1, dead = 0).
在许多管理和研究学科中，二进制数据既常见又重要。如果数据可以用两个相互排斥的结果（例如，真/假、存在/不存在、活/死）来写，我们就会认为数据是二进制的。按照惯例，我们将二进制数据表示为值为 0 和 1 的虚拟变量（例如，alive = 1，dead = 0）。

When dealing with binary response data $Y$ , we can model the observations as coming from a binomial distribution with mean $μ$ , where $μ$ is the probability of sampling a one, and $1 - μ$ is the probability of sampling a zero. So, the three components of the generalised linear model for binary data look like this:
在处理二元响应数据 $Y$ 时，我们可以将观测值建模为来自均值 $μ$ 的二项分布，其中 $μ$ 是抽样 1 的概率， $1 - μ$ 是抽样 0 的概率。因此，二进制数据的广义线性模型的三个组成部分如下所示：

$Y \sim B i n o m i a l (μ)$

Observations $Y$ follow a Binomial distribution with mean $μ$ .
观测值 $Y$ 遵循均值 $μ$ 的二项分布。

$μ = \frac{1}{1 + e^{- l p}}$ Logistic function linking $μ$ to $l p$ .
$μ = \frac{1}{1 + e^{- l p}}$ 链接 $μ$ 到 $l p$ 的逻辑函数。

Since $μ$ is a probability, it needs to be constrained between 0 and 1. This can be done by using a logit function to link $μ$ and the linear predictor, $l p$ :
由于 $μ$ 是一个概率，因此需要将其约束在 0 和 1 之间。这可以通过使用 logit 函数链接 $μ$ 和线性预测器来完成， $l p$ ：

l p = l o g i t (μ) = l o g (\frac{μ}{1 - μ})

In practice, though, we first compute the $l p$ and then use the inverse of the logit-link (which is known as the logistic function) to compute $μ$ from the $l p$ .
然而，在实践中，我们首先计算 $l p$ ，然后使用 logit-link 的逆函数（称为逻辑函数） $μ$ 从计算。 $l p$

μ = \frac{1}{1 + e^{- l p}}

Or if we expand it to its full form:

μ = \frac{1}{1 + e^{(l o g \frac{μ}{1 - μ})}}

μ = \frac{1}{1 + e^{- l p}}

或者，如果我们将其扩展为完整形式：

μ = \frac{1}{1 + e^{(l o g \frac{μ}{1 - μ})}}

Note that other link functions are possible for the Binomial distribution, but the default link is the logit-link.
请注意，二项分布可以使用其他链接函数，但默认链接是 logit-link。

$l p = b_{0} + b_{1} X_{1}$

The linear predictor, as before.
线性预测变量，如前所述。

Exercise 3. A story of flies and sheep hygiene.
练习 3.一个关于苍蝇和绵羊卫生的故事。

In this exercise, we consider how a binary response variable is affected by categorical predictors.
在本练习中，我们将考虑分类预测变量如何影响二元响应变量。

According to the United Nations’s Food and Agriculture Organization (FAO), Australia is the second-largest producer of sheep in the world. It has >100 million sheep (roughly four sheep per person!). When raised in such high densities, sheep, like other livestock animals, are prone to various diseases. In Australia, myiasis (commonly called flystrike) is a particularly unpleasant affliction in sheep. It is caused when flies deposit their eggs on a sheep, the eggs hatch, and the fly larvae grow inside the sheep’s body and feed on the sheep’s tissues. It’s gross.
根据联合国粮食及农业组织（FAO）的数据，澳大利亚是世界第二大绵羊生产国。它有>1亿只羊（大约每人四只羊！当在如此高的密度下饲养时，绵羊和其他牲畜一样，容易患上各种疾病。在澳大利亚，蝇蛆病（通常称为蝇击）是绵羊特别令人不快的疾病。当苍蝇将卵产在绵羊身上，卵孵化，苍蝇幼虫在绵羊体内生长并以绵羊的组织为食时，就会引起这种情况。这太恶心了。

Australian farmers seek to reduce the prevalence of flystrike on sheep livestock for animal welfare and economic reasons. Since flies are attracted by the mix of feces and urine that gets stuck in the wool surrounding the sheep’s buttocks, one way to reduce flystrike infection is mulesing: removing strips of wool-bearing skin around the buttocks of the sheep. While mulesing has proved effective in the past, issues surrounding animal welfare have required the industry to look for alternatives. One potential alternative treatment is crutching: the periodic removal of wool around the tail and between the rear legs of a sheep.
出于动物福利和经济原因，澳大利亚农民试图减少飞蝇对绵羊牲畜的流行。由于苍蝇会被粘在绵羊臀部周围的羊毛中的粪便和尿液的混合物所吸引，因此减少苍蝇感染的一种方法是骡子：去除绵羊臀部周围的羊毛皮条。虽然骡子在过去被证明是有效的，但围绕动物福利的问题要求该行业寻找替代品。一种潜在的替代疗法是拐杖：定期去除绵羊尾巴周围和后腿之间的羊毛。

There are several practical questions that need to be addressed to determine what is the best course of action for farmers raising sheep:
有几个实际问题需要解决，以确定养羊农民的最佳行动方案是什么：

Is mulesing better than no treatment at all?
骡子总比不治疗好吗？
Is crutching as effective as mulesing?
拐杖和骡子一样有效吗？
Is a combination of mulesing and crutching better than mulesing alone?
骡子和拐杖的组合比单独骡子好吗？

To answer these questions, research scientists at the University of Melbourne conducted an experiment. The researchers first looked for sheep that had different treatments (mulesing, crutching, a combination of mulesing and crutching, and no treatment). The researchers then looked for the presence of flystrike on each sheep. The binary response variable was coded as $Y = 1$ when flystrike was present and $Y = 0$ when flystrike was absent. The data are in the data_fly dataframe:
为了回答这些问题，墨尔本大学的研究科学家进行了一项实验。研究人员首先寻找了具有不同治疗方法（骡子，拐杖，骡子和拐杖的组合，以及没有治疗）的绵羊。然后，研究人员在每只羊身上寻找苍蝇袭击的存在。二元响应变量被编码为 $Y = 1$ 何时存在 flystrike 和 $Y = 0$ 何时不存在 flystrike。数据位于数据帧中 data_fly ：

head(data_fly)

##   X flystrike          treatment
## 1 1         0 mulesing+crutching
## 2 2         0          crutching
## 3 3         1 mulesing+crutching
## 4 4         1           mulesing
## 5 5         1          crutching
## 6 6         0           mulesing

Let’s plot the data to see what it looks like. Remember that treatment is the predictor and the presence of flies (flystrike) is the (binary) response.
让我们绘制数据，看看它是什么样子的。请记住， treatment 这是预测因子，苍蝇（ flystrike ）的存在是（二元）响应。

ggplot(data = data_fly, aes(x = treatment, y = flystrike)) + 
  geom_point()

This is not very helpful..!
这不是很有帮助..！

Since the values on the y-axis are either zero (no flies) or one (presence of flies), all of the points sit on top of each other. So, it’s impossible to see any patterns in the data. We can work around this by adding a bit of noise to the plot by using geom_jitter() instead of geom_point():
由于 y 轴上的值为零（没有苍蝇）或 1（存在苍蝇），因此所有点都位于彼此的顶部。因此，不可能在数据中看到任何模式。我们可以通过使用 geom_jitter() 而不是 geom_point() ：

ggplot(data = data_fly, aes(x = treatment, y = flystrike, col = treatment)) +
  geom_jitter(height = 0.1, width = 0.2)

That’s a bit clearer. It looks like sheep who received no treatment have the greatest prevalence of flystrike, whereas sheep that received the combined mulesing+crutching treatments had the lowest. Sheep that received either crutching or mulesing appear to be intermediate. We can check this by using tapply() to compute the mean for each treatment:
这更清楚一些。看起来没有接受治疗的绵羊的蝇击患病率最高，而接受骡子+拐杖联合治疗的绵羊的患病率最低。拄着拐杖或骡子的绵羊似乎是中间的。我们可以通过计算 tapply() 每个处理的平均值来检查这一点：

round(tapply(data_fly$flystrike, data_fly$treatment, FUN = mean), 2)

##          crutching           mulesing mulesing+crutching       no treatment 
##               0.54               0.59               0.19               0.83

While this is a useful summary of the data, it is unclear whether the probability of flystrike infestation is significantly different among treatments (e.g., due to unequal sample sizes). We can use a GLM to quantify this. Since the response data is binary, we will model the observations as coming from a Binomial distribution with mean $μ$ . Since $μ$ is the probability of flystrike presence, it needs to be constrained to the 0–1 range. This is done by using a logit-link function.
虽然这是对数据的有用总结，但尚不清楚不同治疗之间苍蝇侵扰的概率是否显着不同（例如，由于样本量不相等）。我们可以使用 GLM 来量化这一点。由于响应数据是二进制的，我们将观测值建模为来自均值 $μ$ 的二项分布。由于 $μ$ 是 flystrike 存在的概率，因此需要将其限制在 0-1 范围内。这是通过使用 logit-link 函数完成的。

We specify the GLM with Binomial distribution and a logit link function as:
我们将具有二项分布和 logit 链接函数的 GLM 指定为：

fly_glm <- glm(flystrike ~ treatment, 
               family = binomial(link = 'logit'), 
               data = data_fly)

We’ll use the Anova() function from the car package here, as we did with the frogs in the previous example.
我们将在此处使用包中的 Anova() car 函数，就像我们在上一个示例中对青蛙所做的那样。

Anova(fly_glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: flystrike
##           LR Chisq Df Pr(>Chisq)    
## treatment   43.625  3  1.813e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clearly, there is at least one significant pairwise difference amongst the groups in terms of mean probability of flystrike. We can obtain parameter estimates, their associated uncertainties, and inference about the efficacy of different treatments relative to the reference treatment using the summary() function, just as we did for linear models.
显然，就苍蝇袭击的平均概率而言，各组之间至少存在一个显着的成对差异。我们可以使用该 summary() 函数获得参数估计值、其相关的不确定性以及有关不同处理相对于参考处理的疗效的推断，就像我们对线性模型所做的那样。

summary(fly_glm)

## 
## Call:
## glm(formula = flystrike ~ treatment, family = binomial(link = "logit"), 
##     data = data_fly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8930  -1.2435   0.6039   1.0219   1.8328  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                   0.1542     0.5563   0.277   0.7817  
## treatmentmulesing             0.2231     0.5871   0.380   0.7039  
## treatmentmulesing+crutching  -1.6275     0.6491  -2.507   0.0122 *
## treatmentno treatment         1.4553     0.7413   1.963   0.0496 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 304.82  on 219  degrees of freedom
## Residual deviance: 261.20  on 216  degrees of freedom
## AIC: 269.2
## 
## Number of Fisher Scoring iterations: 4

As we can see, there is no estimate for treatment crutching in the summary of the model. This is because crutching is the first treatment based on alphabetical order and so is considered to be the reference treatment for the summary output. Because this is a randomised experiment with a control (i.e., a no treatment category), we would typically want our reference group to be the control. However, because this study is trying to determine whether certain other treatments are better than mulesing (due to the animal welfare concerns), in this case it makes sense to have mulesing as our reference category.
正如我们所看到的，在模型的摘要中没有对治疗 crutching 的估计。这是因为 crutching 这是第一个基于字母顺序的处理，因此被认为是汇总输出的参考处理。因为这是一个随机实验，有对照组（即无治疗类别），我们通常希望我们的参考组是对照组。然而，由于这项研究试图确定某些其他治疗方法是否比骡子更好（由于动物福利问题），在这种情况下，将骡子作为我们的参考类别是有意义的。

We can do this with factor() and specifying the order of the groups in the levels= argument. This will adjust the order in which they are presented in the model output and graphics.
我们可以使用 factor() 并指定 levels= 参数中组的顺序来做到这一点。这将调整它们在模型输出和图形中的显示顺序。

data_fly$treatment_reorder <- factor(data_fly$treatment, levels=c("mulesing", "crutching", "mulesing+crutching", "no treatment"))

fly_glm <- glm(flystrike ~ treatment_reorder, family = binomial(link = 'logit'), data = data_fly)

summary(fly_glm)

## 
## Call:
## glm(formula = flystrike ~ treatment_reorder, family = binomial(link = "logit"), 
##     data = data_fly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8930  -1.2435   0.6039   1.0219   1.8328  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                           0.3773     0.1874   2.013   0.0441 *  
## treatment_reordercrutching           -0.2231     0.5871  -0.380   0.7039    
## treatment_reordermulesing+crutching  -1.8506     0.3832  -4.829 1.37e-06 ***
## treatment_reorderno treatment         1.2321     0.5245   2.349   0.0188 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 304.82  on 219  degrees of freedom
## Residual deviance: 261.20  on 216  degrees of freedom
## AIC: 269.2
## 
## Number of Fisher Scoring iterations: 4

This is better. 这样更好。

In the summary table, the parameter value associated with the reference treatment is the intercept = 0.3773.
在汇总表中，与参考处理关联的参数值为截距 = 0.3773。

The other parameters describe the difference between the reference treatment and the other treatments. For example, treatment_reorderno treatment = 1.2321 means that the no treatment treatment is 1.2321 points above the reference treatment and has a $p$ -value of 0.0188. We conclude that there is evidence against the null hypothesis of no difference between mulesing and no treatment (p = 0.0188). In addition, the observed difference favors mulesing over no treatment because there is less flystrike for sheep that received the mulesing treatment compared to those that received than no treatment.
其他参数描述了参考处理与其他处理之间的差异。例如， treatment_reorderno treatment = 1.2321 表示 no treatment 处理比参考处理高 1.2321 点， $p$ 并且 -value 为 0.0188。我们得出结论，有证据反对骡子和不治疗之间没有差异的原假设（p = 0.0188）。此外，观察到的差异有利于骡子而不是不治疗，因为与接受骡子治疗的绵羊相比，接受骡子治疗的绵羊的苍蝇袭击更少。

For the reference treatment, the $l p$ is simply the intercept. For the other treatments, we have to manually compute the $l p$ as the sum of the intercept and the different coefficient term. For example, for the “no treatment”, $l p$ = 0.3773 + 1.2321 = 1.6094.
对于参考处理，只是 $l p$ 截距。对于其他处理，我们必须手动计算 $l p$ 截距和不同系数项的总和。例如，对于“不治疗”， $l p$ = 0.3773 + 1.2321 = 1.6094。

We can either calculate the $l p$ by hand (as we did for the no treatment intercept), or have R help us.
我们可以手动计算（ $l p$ 就像我们对无处理截距所做的那样），或者让 R 帮助我们。

You can access the estimated coefficients of your model using coef(fly_glm)[1] for the first coefficient, coef(fly_glm)[2] for the second coefficient, and so on. Use these to compute the value of lp associated with each treatment. So, the $l p$ values for each treatment are:
您可以使用 coef(fly_glm)[1] 第一个系数、 coef(fly_glm)[2] 第二个系数等来访问模型的估计系数。使用这些来计算与每个处理 lp 相关的值。因此，每个处理的 $l p$ 值为：

lp_1 <- coef(fly_glm)[1]
lp_2 <- coef(fly_glm)[1] + coef(fly_glm)[2]
lp_3 <- coef(fly_glm)[1] + coef(fly_glm)[3]
lp_4 <- coef(fly_glm)[1] + coef(fly_glm)[4]
c(lp_1, lp_2, lp_3, lp_4)

## (Intercept) (Intercept) (Intercept) (Intercept) 
##   0.3772942   0.1541507  -1.4733057   1.6094379

To obtain the $μ$ values for each treatment, we need to apply the inverse-logit function to $l p$ :
为了获得每个处理 $μ$ 的值，我们需要将 inverse-logit 函数应用于 $l p$ ：

mu_1 <- 1 / (1 + exp(-lp_1))
mu_2 <- 1 / (1 + exp(-lp_2))
mu_3 <- 1 / (1 + exp(-lp_3))
mu_4 <- 1 / (1 + exp(-lp_4))
c(mu_1, mu_2, mu_3, mu_4)

## (Intercept) (Intercept) (Intercept) (Intercept) 
##   0.5932203   0.5384615   0.1864407   0.8333333

So, what are these values? These are the mean probabilities of flystrike for each of the four treatments. For mulesing, the mean probability of flystrike based on the model is 59.3%; for crutching it is 53.8%; for mulesing + crutching it is 18.6%; and for no treatment it is 83.3%. These are the same values that we calculated directly using the tapply() function.
那么，这些价值是什么？这些是四种处理中每一种的苍蝇袭击的平均概率。对于骡子，基于模型的飞击平均概率为59.3%;拄拐杖为53.8%;骡子+拐杖为18.6%;不治疗为83.3%。这些值与我们直接使用该 tapply() 函数计算的值相同。

A simpler way of doing this is to use the emmeans() function. It provides the $μ$ and confidence intervals for each treatment. However, because the GLM transforms the data in the modelling process, it is critical to specify type = "response" in the emmeans() call to ensure that the $l p$ is transformed from the logit scale to the linear scale.
一种更简单的方法是使用该 emmeans() 函数。它提供了每个处理的 $μ$ 和置信区间。但是，由于 GLM 在建模过程中转换数据，因此在 emmeans() 调用 $l p$ 中指定 type = "response" 以确保从 logit 刻度转换为线性刻度至关重要。

fly_emm <- emmeans(fly_glm, ~treatment_reorder, type="response")
fly_emm

##  treatment_reorder   prob     SE  df asymp.LCL asymp.UCL
##  mulesing           0.593 0.0452 Inf     0.502     0.678
##  crutching          0.538 0.1383 Inf     0.282     0.776
##  mulesing+crutching 0.186 0.0507 Inf     0.106     0.306
##  no treatment       0.833 0.0680 Inf     0.657     0.929
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

We can use the pairs() function to identify which treatments are significantly different:
我们可以使用该 pairs() 函数来识别哪些治疗方法有显着差异：

pairs(fly_emm)

##  contrast                            odds.ratio     SE  df null z.ratio p.value
##  mulesing / crutching                    1.2500 0.7338 Inf    1   0.380  0.9813
##  mulesing / (mulesing+crutching)         6.3636 2.4387 Inf    1   4.829  <.0001
##  mulesing / no treatment                 0.2917 0.1530 Inf    1  -2.349  0.0872
##  crutching / (mulesing+crutching)        5.0909 3.3043 Inf    1   2.507  0.0588
##  crutching / no treatment                0.2333 0.1730 Inf    1  -1.963  0.2021
##  (mulesing+crutching) / no treatment     0.0458 0.0272 Inf    1  -5.198  <.0001
## 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## Tests are performed on the log odds ratio scale

The statistic returned here is the odds ratio. The odds ratio is interpreted as how many times more is something likely to happen from one group in the pair to the other. For example, the odds ratio for the comparison between mulesing and the combined mulesing and crutching treatment is 6.36. This means that sheep that received mulesing had 6.36 times more flystrike than sheep that recieved mulesing + crutching.
此处返回的统计数据是比值比。比值比被解释为从一对中的一组到另一组可能发生的事情多多少倍。例如，骡子与骡子和拐杖联合治疗之间的比较比值比为 6.36。这意味着接受骡子的绵羊比接受骡子+拐杖的绵羊的苍蝇攻击次数多 6.36 倍。

However, if the odds ratio is less than 1, then we need to take the inverse value to facilitate interpretation. For example, the odds ratio for the comparison between the combined mulesing and crutching and no treatment is 0.0458. To calculate the difference in flystrike, we take the inverse of 0.0458, which is 1/0.0458 = 21.83. However, the interpretation then becomes how LESS likely is flystrike to occur. In this case, the combined treatment is 21.83 times less likely to result in flystrike than no treatment.
但是，如果比值比小于 1，那么我们需要取反值以方便解释。例如，骡子和拐杖联合治疗与不治疗之间的比较比值比为 0.0458。为了计算 flystrike 的差异，我们取 0.0458 的倒数，即 1/0.0458 = 21.83。然而，解释就变成了苍蝇袭击发生的可能性有多小。在这种情况下，联合治疗导致苍蝇袭击的可能性比不治疗低 21.83 倍。

If you look at the $p$ -values at the end of each row, you will see that the only significant differences are between the combined mulesing + crutching treatment and mulesing or no treatment. It is worth noting that there is not a significant difference between mulesing and no treatment. It is also worth noting that the occurrence of flystrike was much more variable for the crutching treatment than any of the other treatments.
如果您查看每行末尾的 $p$ -值，您会发现唯一的显着差异是骡子 + 拐杖联合治疗和骡子或不处理之间的差异。值得注意的是，骡子和不治疗之间没有显着差异。还值得注意的是，拐杖治疗的苍蝇撞击的发生率比其他任何治疗都要多得多。

We can plot the emmeans object to see what this looks like. Note that the orientation is reversed compared to how we plotted the data earlier. So, the x-axis label and y-axis labels have to be reversed.
我们可以绘制对象， emmeans 看看它是什么样子的。请注意，与我们之前绘制数据的方式相比，方向是相反的。因此，必须颠倒 x 轴标签和 y 轴标签。

plot(fly_emm, 
     xlab="Probability of flystrike", 
     ylab="Treatment")

If we are using the mulesing treatment as the baseline, it is clear that mulesing and crutching together are the most effective treatment. It also demonstrates that crutching alone leads to the most variable outcomes and that mulesing alone is not statistically different from no treatment.
如果我们以 mulesing 治疗为基线，很明显，骡子和拐杖是最有效的治疗方法。它还表明，单独使用拐杖会导致最可变的结果，并且单独使用骡子与不治疗在统计学上没有区别。

Reporting the results: To test the effectiveness of several treatments to prevent myiasis, agricultural scientists compared the current standard treatment, mulesing, against two other treatments and a no-treatment control. Because the response variable was binomial (i.e., the presence or absence of flystrike), they used a generalised linear model with a logit link function. They found a significant effect of treatment on the presence or absence of flystrike ( $χ_{d f = 3}^{2}$ = 43.625, $p$ << 0.0001). Pairwise comparisons of all treatments showed that the combined mulesing and crutching treatment had significantly less incidence of flystrike than mulesing alone ( $p$ < 0.001) or no treatment at all ( $p$ < 0.001). The mulesing treatment had 6.36 (times as much flystrike as the combined crutching and mulesing. Crutching alone was not significantly different from any of the other treatments. It is worth noting that mulesing alone was not significantly different from no treatment ( $p$ < 0.087) despite having long been the standard treatment for flystrike in Australian sheep.
报告结果：为了测试几种预防蝇蛆病的治疗方法的有效性，农业科学家将目前的标准治疗方法骡子与其他两种治疗方法和无治疗对照进行了比较。由于响应变量是二项式的（即是否存在 flystrike），因此他们使用了具有 logit 链接函数的广义线性模型。他们发现治疗对是否存在苍蝇撞击有显着影响（ $χ_{d f = 3}^{2}$ = 43.625， $p$ <<0.0001）。所有治疗的成对比较表明，骡子和拐杖联合治疗的苍蝇发生率明显低于单独骡子治疗（ $p$ < 0.001）或根本不治疗（ $p$ < 0.001）。骡子治疗有 6.36（是拐杖和骡子组合的飞击次数的倍。单独拄拐杖与任何其他治疗方法没有显着差异。值得注意的是，单独使用骡子与不治疗（ $p$ < 0.087）没有显着差异，尽管长期以来一直是澳大利亚绵羊蝇击的标准治疗方法。

Assessed questions 评估问题

NOTE: You have now been analysing data and presenting R code and results for many weeks. In previous assessed pracs, we have asked you to explain each step and what you have done. This ensured that we could see that you understood the workflow. Going forward, we will be less focused on you narrating the individual steps and more interested in you describing (concisely) how the statistical analyses supported your conclusions. We have been providing you with examples of this (Reporting the Results) in the worked examples over the past 3-4 weeks. You do need to show your R code and its outputs, but we we will increasingly weight the points towards correct inference informed by the statistical analyses, rather than the specific steps that you take to do the statistical analyses.
注意：您现在已经分析数据并显示 R 代码和结果数周了。在之前评估的实践中，我们要求您解释每个步骤以及您做了什么。这确保了我们可以看到您了解工作流程。展望未来，我们将不再关注您叙述各个步骤，而更感兴趣的是您（简明扼要地）描述统计分析如何支持您的结论。在过去的 3-4 周内，我们一直在工作示例中为您提供这方面的例子（报告结果）。你确实需要显示你的 R 代码及其输出，但我们将越来越多地将这些点权重到统计分析所告知的正确推理上，而不是你为进行统计分析而采取的具体步骤。

Exercise 4. Investigating the impact of female fish size on number of eggs produced
练习 4.调查雌鱼大小对产卵数量的影响

Size is often important in female fecundity. Typically, we would expect that the larger the female, the more eggs they are able to produce. A research scientist conducted an experiment to assess if female size (mass in grams) of the European anchovy affects the number of eggs inside one ovary.
体型在雌性繁殖力中通常很重要。通常，我们会期望雌性越大，它们能够产生的卵就越多。一位研究科学家进行了一项实验，以评估欧洲鳀鱼的雌性大小（以克为单位的质量）是否会影响一个卵巢内的卵子数量。

The data frame is eggs, where size_g is mass of the female in grams, and count is the number of eggs.
数据框是 eggs ，其中 size_g 是雌性的质量（以克为单位）， count 是卵子的数量。

QUESTION 1: (1 point) Plot the data and describe what you see.
问题 1：（1 分）绘制数据并描述您看到的内容。

QUESTION 2: (5 points) Identify the scientific and statistical hypotheses and then use a GLM for count data to assess them. Make a plot of the data and the statistical model. Interpret and summarise the statistical findings and the conclusions they support.
问题 2：（5 分）确定科学和统计假设，然后使用 GLM 进行计数数据来评估它们。绘制数据和统计模型的图。解释和总结统计结果及其支持的结论。

Exercise 5. More frogs, but this time about infectious disease
练习 5.更多的青蛙，但这次是关于传染病的

Around the world frog populations are declining — they are amongst the most threatened vertebrate taxa globally. One of the major causes of these declines is the fungal pathogen Batrachochytrium dendrobatidis, which infects the skin of the frogs, causing the skin to thicken. This disrupts critical physiological processes of the amphibian skin such as water absorption and ion transport. Infection can lead to the fatal disease, chytridiomycosis.
在世界各地，青蛙的数量正在下降——它们是全球最受威胁的脊椎动物分类群之一。这些下降的主要原因之一是真菌病原体 Batrachochytrium dendrobatidis，它感染青蛙的皮肤，导致皮肤增厚。这破坏了两栖动物皮肤的关键生理过程，例如吸水和离子传输。感染可导致致命的疾病，壶菌病。

One thing that can help protect frogs at risk of infection and potentially mitigate declines due to chytridiomycosis is to understand when the animals are at risk. That is, are there environmental factors that influence when infection occurs?
有一件事可以帮助保护有感染风险的青蛙，并可能减轻由于壶菌病引起的下降，那就是了解动物何时处于危险之中。也就是说，是否有影响感染发生的环境因素？

One environmental variable that has been proposed to influence general disease dynamics is temperature. For frogs and chytridiomycosis, it has been hypothesized that infection is higher when the frogs begin breeding in the Spring, during the cooler months of the year.
一个被提出影响一般疾病动态的环境变量是温度。对于青蛙和壶菌病，据推测，当青蛙在春季开始繁殖时，在一年中较凉爽的月份，感染率会更高。

To test this, a study was conducted over two years in Pennsylvania, USA. Field surveys were carried out monthly across multiple sites for the nine months during which frogs at this location are active. The data is provided in the frog_infection dataframe. Any time a frog was encountered its temperature was recorded (temp) in degrees C using an infrared thermometer. The frog was then captured and swabbed for infection. Swabs were analyzed in the lab for presence or absence of B. dendrobatidis (infection).
为了验证这一点，在美国宾夕法尼亚州进行了一项为期两年的研究。在该地点的青蛙活跃的九个月中，每月在多个地点进行实地调查。 frog_infection 数据在数据帧中提供。每当遇到青蛙时，都会用红外测温仪记录其体温（摄氏度 temp ）。然后捕获青蛙并擦拭感染。在实验室中分析拭子是否存在 B. dendrobatidis （ infection ）。

QUESTION 3: (4 points) Identify the scientific and statistical hypotheses, conduct a logistic regression to assess your statistical hypothesis, and interpret the results.
问题 3：（4 分）确定科学和统计假设，进行逻辑回归以评估您的统计假设，并解释结果。

QUESTION 4: (1 point) Plot the data and the model for the GLM. Provide the logistic regression equation.
问题 4：（1 分）绘制 GLM 的数据和模型。提供逻辑回归方程。

Note: When plotting a binomial glm you can use stat_smooth(). To do this, use the notation stat_smooth(method="glm", formula=y~x, method.args = list(family="binomial")) as you would use geom_smooth(method="lm") for a linear regression in your ggplot call. Or you can use a prediction matrix approach is described for the Poisson example above.
注意：绘制二项式 glm 时，可以使用 stat_smooth（）。为此，请使用与 ggplot 调用中的线性回归相同的 geom_smooth(method="lm") 表示法 stat_smooth(method="glm", formula=y~x, method.args = list(family="binomial")) 。或者，您可以使用上面针对泊松示例描述的预测矩阵方法。

QUESTION 5: (1 point) 问题 5：（1 分）

Given that: + frogs are often solitary and only come together to breed + breeding often occurs in the cooler months of the year (early spring) + the pathogen is transmitted both by the environment (through infected water) and through direct contact
鉴于： + 青蛙通常是孤独的，只聚集在一起繁殖 + 繁殖通常发生在一年中较凉爽的月份（早春） + 病原体通过环境（通过受感染的水）和直接接触传播

interpret the findings from your analysis on the risk of infection by the pathogen B. dendrobatidis.
解释您对病原体 B. dendrobatidis 感染风险的分析结果。

Models for Non-Normal Responses - Part 1: Generalised linear models
非正态响应模型 - 第 1 部分：广义线性模型

Laura Brannelly and Patrick Baker
劳拉·布兰纳利和帕特里克·贝克

06 May 2024 06 五月 2024

Preamble 序言

Assessment 评估

Revisiting the linear model
重新审视线性模型

Generalising the linear model: Generalized Linear Model
推广线性模型：广义线性模型

Dealing with count response data: Poisson GLMs
处理计数响应数据：Poisson GLM

Example 1: Counting seeds in the forest
示例 1：计算森林中的种子

Example 2: Frog abundance across microhabitats
示例2：不同微生境的青蛙丰度

Dealing with binary response data: Logistic GLMs
处理二进制响应数据：逻辑 GLM

Exercise 3. A story of flies and sheep hygiene.
练习 3.一个关于苍蝇和绵羊卫生的故事。

Assessed questions 评估问题

Exercise 4. Investigating the impact of female fish size on number of eggs produced
练习 4.调查雌鱼大小对产卵数量的影响

Exercise 5. More frogs, but this time about infectious disease
练习 5.更多的青蛙，但这次是关于传染病的

Models for Non-Normal Responses - Part 1: Generalised linear models非正态响应模型 - 第 1 部分：广义线性模型

Laura Brannelly and Patrick Baker劳拉·布兰纳利和帕特里克·贝克

06 May 2024 06 五月 2024

Preamble 序言

Assessment 评估

Revisiting the linear model重新审视线性模型

Generalising the linear model: Generalized Linear Model推广线性模型：广义线性模型

Dealing with count response data: Poisson GLMs处理计数响应数据：Poisson GLM

Example 1: Counting seeds in the forest示例 1：计算森林中的种子

Example 2: Frog abundance across microhabitats示例2：不同微生境的青蛙丰度

Dealing with binary response data: Logistic GLMs处理二进制响应数据：逻辑 GLM

Exercise 3. A story of flies and sheep hygiene.练习 3.一个关于苍蝇和绵羊卫生的故事。

Assessed questions 评估问题

Exercise 4. Investigating the impact of female fish size on number of eggs produced练习 4.调查雌鱼大小对产卵数量的影响

Exercise 5. More frogs, but this time about infectious disease练习 5.更多的青蛙，但这次是关于传染病的

Models for Non-Normal Responses - Part 1: Generalised linear models
非正态响应模型 - 第 1 部分：广义线性模型

Laura Brannelly and Patrick Baker
劳拉·布兰纳利和帕特里克·贝克

Revisiting the linear model
重新审视线性模型

Generalising the linear model: Generalized Linear Model
推广线性模型：广义线性模型

Dealing with count response data: Poisson GLMs
处理计数响应数据：Poisson GLM

Example 1: Counting seeds in the forest
示例 1：计算森林中的种子

Example 2: Frog abundance across microhabitats
示例2：不同微生境的青蛙丰度

Dealing with binary response data: Logistic GLMs
处理二进制响应数据：逻辑 GLM

Exercise 3. A story of flies and sheep hygiene.
练习 3.一个关于苍蝇和绵羊卫生的故事。

Exercise 4. Investigating the impact of female fish size on number of eggs produced
练习 4.调查雌鱼大小对产卵数量的影响

Exercise 5. More frogs, but this time about infectious disease
练习 5.更多的青蛙，但这次是关于传染病的