BUSS6002 W7 Feature Engineering and NLP
BUSS6002 W7 特征工程和 NLP

Prepared by Ella Luo 编写人：Ella Luo

Part 1. Feature Engineering
第 1 部分.特征工程

What is feature engineering
什么是功能工程

process of constructing predictors from raw data, based on domain knowledge or algorithm requirements
根据领域知识或算法要求，从原始数据中构建预测器的过程
Good feature engineering can significantly improve model performance
良好的特征工程可大幅提高模型性能
Example: polynomial regression
示例：多项式回归

2. log Transformation 对数变换

Original Linear regression: AmountSpent

= β_{0} + β_{1} \times

Salary

+ ε

原始线性回归：支出金额

= β_{0} + β_{1} \times

工资

+ ε

We can see patterns in the residual plots, which indicate non-linear relationship
我们可以从残差图中看出一些规律，这表明两者之间存在非线性关系
Also we can see heteroscedasticity in the variance of errors
我们还可以看到误差方差的异方差性

Log transformation:

\log (

Amount Spent

) = β_{0} + β_{1} \times \log (

Salary

) + ε

对数变换：

\log (

支出金额

) = β_{0} + β_{1} \times \log (

工资

) + ε

The residual plot shows no pattern
残差图没有显示模式
Variance is almost constant now
现在差异几乎保持不变
Log transformation can help:
对数变换可以提供帮助：
Remove certain type of non-linearity
消除某些类型的非线性
Stabilise variance 稳定差异

Retransformation bias: 再转换偏差：

The model after log-transform: $\log (y) = x^{⊤} \hat{β} + ε$
对数变换后的模型： $\log (y) = x^{⊤} \hat{β} + ε$
Thus, if we need to predict $\hat{y}$ :
因此，如果我们需要预测 $\hat{y}$ ：
Naïve Prediction: ${\hat{y}}_{naive} = \exp (\hat{\log (y)}) = \exp (x^{⊤} \hat{β})$
简单预测： ${\hat{y}}_{naive} = \exp (\hat{\log (y)}) = \exp (x^{⊤} \hat{β})$
Optimal Prediction: ${\hat{y}}_{optimal} = E (y ∣ x) = \exp (x^{⊤} \hat{β}) E [\exp (ε)] = {\hat{y}}_{naive} \times E [\exp (ε)]$
最佳预测： ${\hat{y}}_{optimal} = E (y ∣ x) = \exp (x^{⊤} \hat{β}) E [\exp (ε)] = {\hat{y}}_{naive} \times E [\exp (ε)]$
According to Jensen’s inequality: $E [\exp (ε)] > \exp [E (ε)] = 1$
根据詹森不等式： $E [\exp (ε)] > \exp [E (ε)] = 1$
Thus, ${\hat{y}}_{naïve} < {\hat{y}}_{optimal}$ 因此， ${\hat{y}}_{naïve} < {\hat{y}}_{optimal}$
${\hat{y}}_{unbiased} := {\hat{y}}_{naive} \times \frac{1}{n} \sum_{i = 1}^{n} \exp (ε_{i}) = \exp (x^{⊤} \hat{β}) \frac{1}{n} \sum_{i = 1}^{n} \exp (ε_{i})$

3. Dealing with categorical Predictors:
3.处理分类预测因子：

Using dummy variables to handle qualitative predictors.
使用虚拟变量处理定性预测因子。

For example, for gender predictor: { ‘Female’, ‘Male’ }
例如，性别预测器：{'女'、'男' }
A dummy variable can be constructed by:
虚拟变量的构造方法如下

Female = {\begin{cases} 1, & if Gender ='Female' \\ 0, & if Gender ='Male' \end{cases}

$k$ - categories need only $k - 1$ dummy variables
$k$ - 类别只需 $k - 1$ 虚拟变量
Reference or Baseline Category: The one that is not explicitly represent by dummy variable
参考或基准类别：未用虚拟变量明确表示的类别
Each dummy is compared with the reference when interpreting
在解释时，将每个假值与参照值进行比较。

Consider the following regression with dummy variable Female:
考虑以下带有虚拟变量 "女性 "的回归：

AmountSpent $= β_{0} + β_{1} \times$ Salary $+ β_{2} \times$ Children $+ β_{3} \times$ Catalogs $+ β_{4} \times$ Female $+ ε$
支出金额 $= β_{0} + β_{1} \times$ 工资 $+ β_{2} \times$ 子女 $+ β_{3} \times$ 目录 $+ β_{4} \times$ 女性 $+ ε$
Female $= {\begin{cases} 1, & if Gender ='Female' \\ 0, & if Gender ='Male' \end{cases}$
女 $= {\begin{matrix} 1, & if Gender ='Female' \\ 0, & if Gender ='Male' \end{matrix}$

This means that: 这意味着

AmountSpent

= {\begin{cases} {\hat{β}}_{0} + {\hat{β}}_{4} + {\hat{β}}_{1} \times Salary + {\hat{β}}_{2} \times Children + {\hat{β}}_{3} \times Catalogs + ε, & for female \\ {\hat{β}}_{0} + {\hat{β}}_{1} \times Salary + {\hat{β}}_{2} \times Children + {\hat{β}}_{3} \times Catalogs + ε, & for male \end{cases}

支出金额

= {\begin{matrix} {\hat{β}}_{0} + {\hat{β}}_{4} + {\hat{β}}_{1} \times Salary + {\hat{β}}_{2} \times Children + {\hat{β}}_{3} \times Catalogs + ε, & for female \\ {\hat{β}}_{0} + {\hat{β}}_{1} \times Salary + {\hat{β}}_{2} \times Children + {\hat{β}}_{3} \times Catalogs + ε, & for male \end{matrix}

Interpretation: all else being equal, a female customer spending on average ${\hat{β}}_{4}$ more than a male customer
释义：在其他条件相同的情况下，女性顾客的平均消费额 ${\hat{β}}_{4}$ 高于男性顾客

2) Interactions: 2) 互动：

Original regression: 原来的退步：

AmountSpent = β_{0} + β_{1} \times Salary + β_{2} \times Children + ε

Interpretation of $β_{1}$ : the average spending is increased by ${\hat{β}}_{1}$ for each dollar increase in salary, if keep the number of children constant
$β_{1}$ 的解释：在子女人数不变的情况下，工资每增加一美元，平均支出就会增加 ${\hat{β}}_{1}$ 。

Regression with interaction:
交互回归：

AmountSpent

= β_{0} + β_{1} \times

Salary

+ β_{2} \times

Children

+ β_{3} \times

Salary

\times

Children

+ ε

支出金额

= β_{0} + β_{1} \times

工资

+ β_{2} \times

子女

+ β_{3} \times

工资

\times

子女

+ ε

Interpretation: when salary increased by one-dollar, the average spending is increased by ( ${\hat{β}}_{1} + {\hat{β}}_{3} \times$ Children)
解释：当工资增加一美元时，平均支出增加 ( ${\hat{β}}_{1} + {\hat{β}}_{3} \times$ 儿童)

3) Polynomial Regression 3) 多项式回归

Allow us to model nonlinear relationship between response and a predictor
允许我们模拟响应与预测因子之间的非线性关系

y = β_{0} + β_{1} x + β_{2} x^{2} + \dots + β_{p} x^{p} + ε

Part 2. Natural Language Processing (NLP):
第 2 部分：自然语言处理（NLP自然语言处理（NLP）：

What is NLP (Natural Language Processing):
什么是 NLP（自然语言处理）：

NLP is automated handling of natural human language like speech or text
NLP 是对语音或文本等人类自然语言的自动处理
Make computers to understand all meaning and information in text (Unstructured data)
让计算机理解文本中的所有含义和信息（非结构化数据）

Regular Expressions (Regex) - Not really NLP:
正则表达式 (Regex) - 并非真正的 NLP：

Can represent highly complex deterministic queries about the sequence of numbers, characters and special symbols.
可表示有关数字、字符和特殊符号序列的高度复杂的确定性查询。

Structuring Unstructured Data
构建非结构化数据

Keywords are called Tokens
关键词被称为代币
Useful semantic unit for processing
有用的语义处理单元
In English: often identified by whitespace and / or punctuation around it
在英语中：通常通过周围的空白和/或标点符号来识别
(1) Bag-of-words (BOW): (1) 词袋（BOW）：

Raw text Bag-of-words representation
原始文本字袋表示法

Extract features from text that describes the occurrence of words within a document
从文本中提取描述文档中词语出现情况的特征
Create a vocabulary of known words
创建已知单词词汇表
A measure of the presence of known words
衡量是否存在已知词语
Does not capture the order of word (only capture the occurrence)
不捕捉词序（只捕捉出现次数）
(2) Term frequency - inverse document frequency (TF - iDF):
(2) 词频-反向文档频率（TF-iDF）：

\frac{(N of token t in document d)}{(N of tokens in document d)} * \ln (\frac{N of documents}{N of documents containing token t})

Translate counts into a measure of how “important” (i.e. unique) a token is to a document relative to a wider collection or corpus.
将计数转换为衡量一个标记相对于更广泛的集合或语料库对文档的 "重要性"（即唯一性）。

(3) N-gram: (3) N-语法：

A sequence of $n$ tokens
一个 $n$ 标记序列
1-gram: unigram
2-gram: bigram
3-gram: trigram
Maintains the structure of the structure, context and possibly the meaning of the original text
保持原文的结构、上下文以及可能的含义
Compatible with BoW and TF-iDF (Can be used within BOW and TF-iDF)
与 BoW 和 TF-iDF 兼容（可在 BOW 和 TF-iDF 中使用）

Reduce complexity in unstructured data:
降低非结构化数据的复杂性：

(1) Stop Words: (1) 停止词：

Filter out some words that tend to add little to the meaning of a text.
过滤掉一些对文本意义影响不大的词语。
Stop words usually refers to the most common words in a language
停滞词通常指一种语言中最常见的词
No single universal list of stop words used by all natural language processing tools.
所有自然语言处理工具都不使用单一的通用停滞词列表。

(2) Removing Grammar with Python's NLTK:
(2) 使用 Python 的 NLTK 删除语法：

Natural Language Toolkit:
自然语言工具包
Stemming: remove the ends of words
词干：去掉词尾
Lemmatization: Properly with the use of a vocabulary and morphological analysis of words.
词法化：正确使用词汇和词形分析。
Stemming and lemmatization used for removing variations in language
词干化和词素化用于消除语言中的差异
This is not statistical or learned algorithm
这不是统计或学习算法
Porter 2 (Snowball English Stemmer) is generally the best choice for English
Porter 2（雪球英语词干转换器）通常是英语学习的最佳选择。

State of Art: 技术现状：

Sufficiently large training data (the longer of n-gram’s n, the harder to estimate probabilities)
足够大的训练数据（n-词组的 n 越长，估计概率就越难）
Many Researchers, use 5-grams or 6-grams
许多研究人员，使用 5 克或 6 克

Sentiment Analysis 情感分析
(1) VADER (Valence Aware Dictionary and Sentiment Reasoner)
(1) VADER（价值感知词典和情感推理器）

A lexicon and rule-based sentiment analysis tool that is specifically to social media (Twitter).
专门针对社交媒体（Twitter）的基于词典和规则的情感分析工具。
Assign positive or negative points for each word, then add them up
为每个单词分配正分或负分，然后将它们相加
VADER is specially tuned for Twitter
VADER 专门针对 Twitter 进行了调整
(2) Word Embedding (2) 单词嵌入
Problem: frequency representation of words loses a lot of information
问题：词频表示法丢失了大量信息
Dog vs poodle 狗与贵宾犬
Dog vs Cat 狗与猫
Idea: Represent words as vectors in multidimensional space
想法将单词表示为多维空间中的向量
Similar words should be closer
相似的词语应更接近
Solution: Word2Vec 解决方案Word2Vec
Using shallow neural networks to calculate these vectors.
使用浅层神经网络计算这些向量。

PART 3. Ethics In Data Science
第 3 部分.数据科学的伦理

1. Ethics in the Age of Algorithmic Decisions
1.算法决策时代的伦理问题

Application of AI (Artificial Intelligence)
人工智能（AI）的应用
Not only a technological advance, but also a social change
不仅是技术进步，也是社会变革
Data science and Al may harm society to the potential vulnerable populations, misuse personal information and the political manipulations.
数据科学和人工智能可能会危害社会潜在的弱势群体、滥用个人信息和政治操纵。
Bias and discrimination are built into AI Algorithms, Al reflects the values of its creators.
人工智能算法中存在偏见和歧视，这反映了其创造者的价值观。
Ethical judgement: 道德判断：
Ethics is concerned with making judgement about intention and actions
伦理涉及对意图和行动的判断
Normative theory: norm of society, how things ought to be
规范理论：社会规范，事物应该如何发展
Descriptive Theory: explain how things are
描述理论：解释事物是如何存在的
Ethical judgements are normative but not descriptive
道德判断是规范性的，而不是描述性的
Morality: 道德：
A system rules that guide human conduct, principles for evaluating rules.
指导人类行为的规则体系，评价规则的原则。
Morality is a system 道德是一个体系
Comprising values and principles. Just like policies.
包括价值观和原则。就像政策一样。
Ethics vs Morals 伦理与道德
Ethics: the rules that a social system provides us with
道德：社会制度为我们提供的规则
Morals: Our own principles
道德：我们自己的原则
Ethics vs values 道德与价值观

Ethics 职业道德

Values 价值观

Set of moral principles 一套道德原则

Principles or standard of behavior
原则或行为标准

Professional 专业人员

Personal 个人

受不同专业、组织的影响

Influenced by diff professions,

organizations

Influenced by family, background..
受家庭、背景的影响...

Can vary according to professions
可因职业而异

Vary according to individuals
因人而异

2. Ethics in Data science:
2.数据科学的伦理：

Data used in training may contain bias
训练中使用的数据可能存在偏差
Algorithms themselves learned from the biases and used the conclusions from the biased data to support future decisions.
算法本身从偏差中学习，并利用从偏差数据中得出的结论来支持未来的决策。
Usable of data (Data Legality):
数据的可用性（数据的合法性）：
Fair Use Agreement: 合理使用协议：
Harvesting data when a policy in place explicitly forbidding
在政策明确禁止的情况下采集数据
Harvesting data behind login page/portal without a policy
在未制定政策的情况下，在登录页面/门户网站后面采集数据
Harvesting public data that is not explicitly linked anywhere
采集未在任何地方明确链接的公共数据
Harvesting public social media data that is plainly visible might be ethical or unethical or probably legal
获取公开的社交媒体数据可能是合乎道德的，也可能是不道德的，或者可能是合法的
UK Gov - 5 principles of ethical AI:
英国政府 - 人工智能伦理的 5 项原则：
Should be developed for the common good and benefit of humanity
应为人类的共同利益和福祉而发展
Should operate on principles of intelligibility and fairness
应按照易懂和公平的原则运作
Should not be used to diminish the data rights or privacy of individuals, families, or communities
不应削弱个人、家庭或社区的数据权或隐私权
All citizens should have the right to be educated to enable them to flourish mentally, emotionally, and economically alongside Al
所有公民都应有权接受教育，使他们能够在精神、情感和经济上与他人并肩发展。
o The autonomous power to hurt, destroy, or deceive human beings should never be vested in Al.
o 绝不能将伤害、毁灭或欺骗人类的自主权赋予阿尔。
Five Common Principles of Responsible AI:
负责任的人工智能的五项共同原则：

Beneficence 惠益

promote well-being, preserve dignity, and sustain the planet
促进福祉、维护尊严和地球的可持续发展

Non-maleficence 无私

- 隐私、安全和 "能力谨慎" - 避免滥用 - 避免 "合法但不道德" - 避免滥用 - 避免 "合法但不道德" - 避免滥用 - 避免 "合法但不道德"。

- privacy, security and 'capability caution'

- Avoid Misuse - avoid being 'legal, but immoral'

Autonomy 自主性

- 决定权--伤害、毁灭或欺骗人类的自主权绝不应赋予人工智能。 - 任何授权都应是可覆盖的

- the power to decide

- autonomous power to hurt, destroy or deceive human beings should never be vested in AI.

- any delegation should remain overridable

Justice 司法

- 促进繁荣、维护团结、避免不公平--抵御偏见风险和对 "团结 "的威胁，包括 "社会保险和医疗保健等互助体系"。

- promoting prosperity, preserving solidarity, avoiding unfairness

- defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare."

Explicability 可解释性

- 可理解性"--"问责制

- 'intelligibility'

- 'accountability'

The field of data ethics offers guiding principles for business professionals who handle data.
数据伦理领域为处理数据的业务专业人员提供了指导原则。
Ownership: an individual owns their personal information
所有权：个人拥有自己的个人信息
Transparency: Data subjects have a right to know how you plan to collect, store, and use it. When gathering data, exercise transparency.
透明度：数据主体有权知道您计划如何收集、存储和使用数据。在收集数据时，要做到透明。
Privacy: Even if a customer gives your company consent to collect, store, and analyse their personally identifiable information (PII), that doesn’t mean that it should be made publicly available.
隐私：即使客户同意贵公司收集、存储和分析其个人身份信息 (PII)，也并不意味着这些信息应该公开。
Intention: Before collecting data, ask yourself why you need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not ethical to collect their data.
意图：在收集数据之前，先问问自己为什么需要这些数据，从中能获得什么，分析后能做出什么改变。如果你的意图是伤害他人、利用研究对象的弱点获利或其他恶意目的，那么收集他们的数据就不道德。
Outcome: Even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals or groups of people. This is called a disparate impact, which is outlined in the Civil Rights Act as unlawful.
结果：即使出发点是好的，数据分析的结果也可能无意中对个人或群体造成伤害。这就是所谓的差异影响，《民权法案》规定这种影响是非法的。

The bottom line in AI:
人工智能的底线
“We don’t give advice. "我们不提供建议。
We can’t tell anybody what to do.
我们不能对任何人指手画脚。
Instead we say - what we know about this problem at this time.
我们要说的是--我们目前对这个问题的了解。
And here are the consequences of these actions.”
以下是这些行动的后果"。

Practice for Week 7. 第 7 周的练习。

Developing a classification model for topics in a large corpus of social media posts, you trained a model based on data that represented each document through the count of each token it contains. Dissatisfied with the model’s performance you decide that you need to represent the meaning of the texts more precisely. Which of the following processes would you choose to achieve this?
在为一个大型社交媒体帖子语料库中的主题开发分类模型时，您训练了一个基于数据的模型，该模型通过包含的每个标记的计数来表示每个文档。由于对模型的性能不满意，您决定需要更精确地表示文本的含义。您会选择以下哪种方法来实现这一目标？
A. Encode texts as bi-grams.
A.将文本编码为双语法
B. Prepare document with stemming.
B.编写有标点符号的文件。
C. Remove all stop words.
C.删除所有停止词。
D. Engineer features with VADER.
D.使用 VADER 设计功能。
You are tasked to categorise a large set of research papers according to their research disciplines. Which of the following is the most suitable choice to distinguish these documents?
您的任务是将一大批研究论文按照研究学科分类。以下哪项最适合用来区分这些文件？
A. TF-iDF A.TF-iDF
B. VADER B.VADER
C. Bag-of-Words C.词袋
D. Regular Expressions (RegEx)
D.正则表达式 (RegEx)
The following regression model is estimated via least squares, using a data-set containing 200 observations:
利用包含 200 个观测值的数据集，通过最小二乘法估算出以下回归模型：

\log (y) = 3 + 2 x_{1} + 0.9 x_{2} + ε

The sum of transformed residuals by the natural exponential function,

\exp (a) := e^{a}

, is calculated to be 226 , i.e.,

\sum_{i = 1}^{200} \exp (ε_{i}) = 226

. For a newly observed pair of feature values

x_{1} = 0.5

and

x_{2} = 2

, which of the following is an unbiased prediction of y (to 2 decimal places)?
经计算，按自然指数函数

\exp (a) := e^{a}

转换后的残差之和为 226，即

\sum_{i = 1}^{200} \exp (ε_{i}) = 226

。对于新观测到的一对特征值

x_{1} = 0.5

和

x_{2} = 2

，以下哪项是对 y 的无偏预测（精确到小数点后 2 位）？
A. 373.24 A.373.24
B. 5.80 B.5.80
C. 330.30 C.330.30
D. 331.43 D.331.43
4. Typos and semantic and grammatical errors can be readily identified by current spellcheckers. But they are only as good as the data used to train them. Take the statement “One does not simply wlak into Mordor.” How do you expect algorithms will perform if they were trained on bagof-words, bi-grams, or tri-grams?
4.目前的拼写检查程序很容易识别错别字、语义和语法错误。但它们的好坏取决于用来训练它们的数据。就拿 "一个人不会简单地走进魔多 "这句话来说吧。如果算法是在词袋、双词组或三词组基础上训练出来的，你觉得它们会有怎样的表现？
A. All will be similar in performance.
A.性能都差不多。
B. Bi-grams will be better than the others.
B.Bi-grams 将优于其他。
C. Uni-grams will be better than the others.
C.Uni-grams 会比其他的更好。
D. Tri-grams will be better than the others.
D.三格会比其他格好。
5. Different natural language processing tools serve different purposes. Which of these techniques is least well suited to capture the meaning of elements of a text?
5.不同的自然语言处理工具有不同的用途。哪种技术最不适合捕捉文本元素的含义？
A. TF-iDF A.TF-iDF
B. Regular Expressions (RegEx)
B.正则表达式（RegEx）
C. VADER C.瓦德尔
D. Word embedding D.词语嵌入
6. If you wanted to explain how TF-iDF works, which of the statement If below would be most accurate?
6.如果要解释 TF-iDF 的工作原理，以下哪种说法最准确？
A. TF-iDF calculates a high-dimensional numerical space where words with similar meanings are located close to one another
A.TF-iDF 计算出一个高维数字空间，在这个空间中，意义相近的词语彼此靠近
B. TF-iDF removes ambiguity by mapping the valence of sentences on a continuous range.
B.TF-iDF 通过将句子的价态映射到一个连续的范围来消除歧义。
C. TF-iDF uses stop-words to reduce the dimensionality in a similar way to lemmatisation and stemming.
C.TF-iDF 采用与词法化和词干化类似的方式，使用停止词来降低维度。
D. TF-IDF adds information to unstructured data by calculating to what degree specific tokens are able to distinguish between different documents.
D.TF-IDF 通过计算特定标记在多大程度上能够区分不同的文档，从而为非结构化数据添加信息。
7. The transformation commonly applied to categorical variables is known as:
7.通常应用于分类变量的转换称为：
A. Baseline variable A.基线变量
B. Dummy variable B.虚拟变量
C. Log transformation C.对数变换
D. Interaction terms D.交互项
8. You used word2vec to engineer feature vectors for a collection of documents. To check if it worked, you compare the vectors for word pairs. What would you expect to find for the pairs “mother” and “daughter”?
8.您使用 word2vec 为文档集设计了特征向量。为了检查它是否有效，您比较了词对的向量。对于 "母亲 "和 "女儿 "这两个词对，您预期会发现什么？
A. They would be related in the same way as “king” and “queen”.
A.它们的关系就像 "国王 "和 "王后 "一样。
B. They would be orthogonal to each other.
B.它们是正交的。
C. They would be on opposite ends of the vector space.
C.它们将位于向量空间的两端。
D. They would be related in the same way as “dog” and “puppy”.
D.它们的关系就像 "狗 "和 "小狗 "一样。
9. Different natural language processing tools serve different purposes. Which of these techniques is least well suited to capture the meaning of elements of a text?
9.不同的自然语言处理工具有不同的用途。哪种技术最不适合捕捉文本元素的含义？
A. Regular Expressions (RegEx)
A.正则表达式（RegEx）
B. Word embedding B.单词嵌入
C. TF-iDF C.TF-iDF
D. VADER D.瓦德
10. Consider the following linear regression model:

\log (y) = {\hat{β}}_{0} + {\hat{β}}_{1} x + ε

. For a given value of

x

, the prediction of

y

via

\hat{y} = \exp ({\hat{β}}_{0} + {\hat{β}}_{1} x)

10.考虑以下线性回归模型：

\log (y) = {\hat{β}}_{0} + {\hat{β}}_{1} x + ε

。对于给定的

x

值，通过

\hat{y} = \exp ({\hat{β}}_{0} + {\hat{β}}_{1} x)

预测

y

为

。
A. Upward biased A.向上倾斜
B. Unbiased B.不偏不倚
C. Optimal with the mean squared error
C.均方误差最优
D. Downward biased D.向下倾斜
11. For a linear regression model

y = x^{⊤} β + ε

the optimal prediction of y that gives the smallest mean square error is

E (y ∣ x)

11.对于线性回归模型

y = x^{⊤} β + ε

，均方误差最小的 y 最佳预测值是

E (y ∣ x)

A. True A.正确
B. False B.假

BUSS6002 W7 Feature Engineering and NLP BUSS6002 W7 特征工程和 NLP

Prepared by Ella Luo 编写人：Ella Luo

Part 1. Feature Engineering第 1 部分.特征工程

2. log Transformation 对数变换

Retransformation bias: 再转换偏差：

3. Dealing with categorical Predictors:3.处理分类预测因子：

Consider the following regression with dummy variable Female:考虑以下带有虚拟变量 "女性 "的回归：

This means that: 这意味着

2) Interactions: 2) 互动：

Original regression: 原来的退步：

Regression with interaction:交互回归：

3) Polynomial Regression 3) 多项式回归

Part 2. Natural Language Processing (NLP):第 2 部分：自然语言处理（NLP自然语言处理（NLP）：

(3) N-gram: (3) N-语法：

(1) Stop Words: (1) 停止词：

(2) Removing Grammar with Python's NLTK:(2) 使用 Python 的 NLTK 删除语法：

PART 3. Ethics In Data Science第 3 部分.数据科学的伦理

1. Ethics in the Age of Algorithmic Decisions1.算法决策时代的伦理问题

2. Ethics in Data science:2.数据科学的伦理：

Practice for Week 7. 第 7 周的练习。

BUSS6002 W7 Feature Engineering and NLP
BUSS6002 W7 特征工程和 NLP

Part 1. Feature Engineering
第 1 部分.特征工程

3. Dealing with categorical Predictors:
3.处理分类预测因子：

Consider the following regression with dummy variable Female:
考虑以下带有虚拟变量 "女性 "的回归：

Regression with interaction:
交互回归：

Part 2. Natural Language Processing (NLP):
第 2 部分：自然语言处理（NLP自然语言处理（NLP）：

(2) Removing Grammar with Python's NLTK:
(2) 使用 Python 的 NLTK 删除语法：

PART 3. Ethics In Data Science
第 3 部分.数据科学的伦理

1. Ethics in the Age of Algorithmic Decisions
1.算法决策时代的伦理问题

2. Ethics in Data science:
2.数据科学的伦理：