quad\quad The model after log-transform: log(y)=x^(TT) widehat(beta)+epsi\log (y)=\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}+\varepsilon quad\quad 对数变换后的模型: log(y)=x^(TT) widehat(beta)+epsi\log (y)=\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}+\varepsilon
quad\quad Thus, if we need to predict hat(y)\hat{y} : quad\quad 因此,如果我们需要预测 hat(y)\hat{y} :
quad\quad Female ={[1","," if Gender "=" 'Female' "],[0","," if Gender "=" 'Male' "]:}= \begin{cases}1, & \text { if Gender }=\text { 'Female' } \\ 0, & \text { if Gender }=\text { 'Male' }\end{cases} quad\quad 女 ={[1","," if Gender "=" 'Female' "],[0","," if Gender "=" 'Male' "]:}= \begin{cases}1, & \text { if Gender }=\text { 'Female' } \\ 0, & \text { if Gender }=\text { 'Male' }\end{cases}
This means that: 这意味着
AmountSpent ={[ hat(beta)_(0)+ hat(beta)_(4)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for female "],[ hat(beta)_(0)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for male "]:}= \begin{cases}\hat{\beta}_{0}+\hat{\beta}_{4}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for female } \\ \hat{\beta}_{0}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for male }\end{cases} 支出金额 ={[ hat(beta)_(0)+ hat(beta)_(4)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for female "],[ hat(beta)_(0)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for male "]:}= \begin{cases}\hat{\beta}_{0}+\hat{\beta}_{4}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for female } \\ \hat{\beta}_{0}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for male }\end{cases}
Interpretation: all else being equal, a female customer spending on average hat(beta)_(4)\hat{\beta}_{4} more than a male customer 释义:在其他条件相同的情况下,女性顾客的平均消费额 hat(beta)_(4)\hat{\beta}_{4} 高于男性顾客
2) Interactions: 2) 互动:
Original regression: 原来的退步:
" AmountSpent "=beta_(0)+beta_(1)xx" Salary "+beta_(2)xx" Children "+epsi\text { AmountSpent }=\beta_{0}+\beta_{1} \times \text { Salary }+\beta_{2} \times \text { Children }+\varepsilon
Interpretation of beta_(1)\beta_{1} : the average spending is increased by hat(beta)_(1)\hat{\beta}_{1} for each dollar increase in salary, if keep the number of children constant beta_(1)\beta_{1} 的解释:在子女人数不变的情况下,工资每增加一美元,平均支出就会增加 hat(beta)_(1)\hat{\beta}_{1} 。
Interpretation: when salary increased by one-dollar, the average spending is increased by ( hat(beta)_(1)+ hat(beta)_(3)xx\hat{\beta}_{1}+\hat{\beta}_{3} \times Children) 解释:当工资增加一美元时,平均支出增加 ( hat(beta)_(1)+ hat(beta)_(3)xx\hat{\beta}_{1}+\hat{\beta}_{3} \times 儿童)
3) Polynomial Regression 3) 多项式回归
Allow us to model nonlinear relationship between response and a predictor 允许我们模拟响应与预测因子之间的非线性关系
Can represent highly complex deterministic queries about the sequence of numbers, characters and special symbols. 可表示有关数字、字符和特殊符号序列的高度复杂的确定性查询。
Structuring Unstructured Data 构建非结构化数据
Keywords are called Tokens 关键词被称为代币
Useful semantic unit for processing 有用的语义处理单元
In English: often identified by whitespace and / or punctuation around it 在英语中:通常通过周围的空白和/或标点符号来识别
(1) Bag-of-words (BOW): (1) 词袋(BOW):
Raw text Bag-of-words representation 原始文本 字袋表示法
Extract features from text that describes the occurrence of words within a document 从文本中提取描述文档中词语出现情况的特征
Create a vocabulary of known words 创建已知单词词汇表
A measure of the presence of known words 衡量是否存在已知词语
Does not capture the order of word (only capture the occurrence) 不捕捉词序(只捕捉出现次数)
(2) Term frequency - inverse document frequency (TF - iDF): (2) 词频-反向文档频率(TF-iDF):
((N" of token "t" in document "d))/((N" of tokens in document "d))**ln((N" of documents ")/(N" of documents containing token "t))\frac{(N \text { of token } t \text { in document } d)}{(N \text { of tokens in document } d)} * \ln \left(\frac{N \text { of documents }}{N \text { of documents containing token } t}\right)
Translate counts into a measure of how “important” (i.e. unique) a token is to a document relative to a wider collection or corpus. 将计数转换为衡量一个标记相对于更广泛的集合或语料库对文档的 "重要性"(即唯一性)。
(3) N-gram: (3) N-语法:
A sequence of nn tokens 一个 nn 标记序列
1-gram: unigram
2-gram: bigram
3-gram: trigram
Maintains the structure of the structure, context and possibly the meaning of the original text 保持原文的结构、上下文以及可能的含义
Compatible with BoW and TF-iDF (Can be used within BOW and TF-iDF) 与 BoW 和 TF-iDF 兼容(可在 BOW 和 TF-iDF 中使用)
Reduce complexity in unstructured data: 降低非结构化数据的复杂性:
(1) Stop Words: (1) 停止词:
Filter out some words that tend to add little to the meaning of a text. 过滤掉一些对文本意义影响不大的词语。
Stop words usually refers to the most common words in a language 停滞词通常指一种语言中最常见的词
No single universal list of stop words used by all natural language processing tools. 所有自然语言处理工具都不使用单一的通用停滞词列表。
Lemmatization: Properly with the use of a vocabulary and morphological analysis of words. 词法化:正确使用词汇和词形分析。
Stemming and lemmatization used for removing variations in language 词干化和词素化用于消除语言中的差异
This is not statistical or learned algorithm 这不是统计或学习算法
Porter 2 (Snowball English Stemmer) is generally the best choice for English Porter 2(雪球英语词干转换器)通常是英语学习的最佳选择。
State of Art: 技术现状:
Sufficiently large training data (the longer of n-gram’s n, the harder to estimate probabilities) 足够大的训练数据(n-词组的 n 越长,估计概率就越难)
Many Researchers, use 5-grams or 6-grams 许多研究人员,使用 5 克或 6 克
Sentiment Analysis 情感分析
(1) VADER (Valence Aware Dictionary and Sentiment Reasoner) (1) VADER(价值感知词典和情感推理器)
A lexicon and rule-based sentiment analysis tool that is specifically to social media (Twitter). 专门针对社交媒体(Twitter)的基于词典和规则的情感分析工具。
Assign positive or negative points for each word, then add them up 为每个单词分配正分或负分,然后将它们相加
VADER is specially tuned for Twitter VADER 专门针对 Twitter 进行了调整
(2) Word Embedding (2) 单词嵌入
Problem: frequency representation of words loses a lot of information 问题:词频表示法丢失了大量信息
Dog vs poodle 狗与贵宾犬
Dog vs Cat 狗与猫
Idea: Represent words as vectors in multidimensional space 想法将单词表示为多维空间中的向量
Similar words should be closer 相似的词语应更接近
Solution: Word2Vec 解决方案Word2Vec
Using shallow neural networks to calculate these vectors. 使用浅层神经网络计算这些向量。
PART 3. Ethics In Data Science 第 3 部分.数据科学的伦理
1. Ethics in the Age of Algorithmic Decisions 1.算法决策时代的伦理问题
Application of AI (Artificial Intelligence) 人工智能(AI)的应用
Not only a technological advance, but also a social change 不仅是技术进步,也是社会变革
Data science and Al may harm society to the potential vulnerable populations, misuse personal information and the political manipulations. 数据科学和人工智能可能会危害社会潜在的弱势群体、滥用个人信息和政治操纵。
Bias and discrimination are built into AI Algorithms, Al reflects the values of its creators. 人工智能算法中存在偏见和歧视,这反映了其创造者的价值观。
Ethical judgement: 道德判断:
Ethics is concerned with making judgement about intention and actions 伦理涉及对意图和行动的判断
Normative theory: norm of society, how things ought to be 规范理论:社会规范,事物应该如何发展
Descriptive Theory: explain how things are 描述理论:解释事物是如何存在的
Ethical judgements are normative but not descriptive 道德判断是规范性的,而不是描述性的
Morality: 道德:
A system rules that guide human conduct, principles for evaluating rules. 指导人类行为的规则体系,评价规则的原则。
Morality is a system 道德是一个体系
Comprising values and principles. Just like policies. 包括价值观和原则。就像政策一样。
Ethics vs Morals 伦理与道德
Ethics: the rules that a social system provides us with 道德:社会制度为我们提供的规则
Morals: Our own principles 道德:我们自己的原则
Ethics vs values 道德与价值观
Ethics 职业道德
Values 价值观
Set of moral principles 一套道德原则
Principles or standard of behavior 原则或行为标准
Professional 专业人员
Personal 个人
受不同专业、组织的影响
Influenced by diff professions,
organizations
Influenced by diff professions,
organizations| Influenced by diff professions, |
| :---: |
| organizations |
Influenced by family, background.. 受家庭、背景的影响...
Can vary according to professions 可因职业而异
Vary according to individuals 因人而异
Ethics Values
Set of moral principles Principles or standard of behavior
Professional Personal
"Influenced by diff professions,
organizations" Influenced by family, background..
Can vary according to professions Vary according to individuals| Ethics | Values |
| :---: | :---: |
| Set of moral principles | Principles or standard of behavior |
| Professional | Personal |
| Influenced by diff professions, <br> organizations | Influenced by family, background.. |
| Can vary according to professions | Vary according to individuals |
2. Ethics in Data science: 2.数据科学的伦理:
Data used in training may contain bias 训练中使用的数据可能存在偏差
Algorithms themselves learned from the biases and used the conclusions from the biased data to support future decisions. 算法本身从偏差中学习,并利用从偏差数据中得出的结论来支持未来的决策。
Usable of data (Data Legality): 数据的可用性(数据的合法性):
Fair Use Agreement: 合理使用协议:
Harvesting data when a policy in place explicitly forbidding 在政策明确禁止的情况下采集数据
Harvesting data behind login page/portal without a policy 在未制定政策的情况下,在登录页面/门户网站后面采集数据
Harvesting public data that is not explicitly linked anywhere 采集未在任何地方明确链接的公共数据
Harvesting public social media data that is plainly visible might be ethical or unethical or probably legal 获取公开的社交媒体数据可能是合乎道德的,也可能是不道德的,或者可能是合法的
UK Gov - 5 principles of ethical AI: 英国政府 - 人工智能伦理的 5 项原则:
Should be developed for the common good and benefit of humanity 应为人类的共同利益和福祉而发展
Should operate on principles of intelligibility and fairness 应按照易懂和公平的原则运作
Should not be used to diminish the data rights or privacy of individuals, families, or communities 不应削弱个人、家庭或社区的数据权或隐私权
All citizens should have the right to be educated to enable them to flourish mentally, emotionally, and economically alongside Al 所有公民都应有权接受教育,使他们能够在精神、情感和经济上与他人并肩发展。
o The autonomous power to hurt, destroy, or deceive human beings should never be vested in Al. o 绝不能将伤害、毁灭或欺骗人类的自主权赋予阿尔。
Five Common Principles of Responsible AI: 负责任的人工智能的五项共同原则:
Beneficence 惠益
promote well-being, preserve dignity, and sustain the planet 促进福祉、维护尊严和地球的可持续发展
- privacy, security and 'capability caution'
- Avoid Misuse - avoid being 'legal, but immoral'| - privacy, security and 'capability caution' |
| :--- |
| - Avoid Misuse - avoid being 'legal, but immoral' |
Autonomy 自主性
- 决定权--伤害、毁灭或欺骗人类的自主权绝不应赋予人工智能。 - 任何授权都应是可覆盖的
- the power to decide
- autonomous power to hurt, destroy or deceive human beings should never be vested in AI.
- any delegation should remain overridable
- the power to decide
- autonomous power to hurt, destroy or deceive human beings should never be vested in AI.
- any delegation should remain overridable| - the power to decide |
| :--- |
| - autonomous power to hurt, destroy or deceive human beings should never be vested in AI. |
| - any delegation should remain overridable |
- defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare."
- promoting prosperity, preserving solidarity, avoiding unfairness
- defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare."| - promoting prosperity, preserving solidarity, avoiding unfairness |
| :--- |
| - defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare." |
Beneficence promote well-being, preserve dignity, and sustain the planet
Non-maleficence "- privacy, security and 'capability caution'
- Avoid Misuse - avoid being 'legal, but immoral'"
Autonomy "- the power to decide
- autonomous power to hurt, destroy or deceive human beings should never be vested in AI.
- any delegation should remain overridable"
Justice "- promoting prosperity, preserving solidarity, avoiding unfairness
- defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare.""
Explicability "- 'intelligibility'
- 'accountability'"| Beneficence | promote well-being, preserve dignity, and sustain the planet |
| :---: | :---: |
| Non-maleficence | - privacy, security and 'capability caution' <br> - Avoid Misuse - avoid being 'legal, but immoral' |
| Autonomy | - the power to decide <br> - autonomous power to hurt, destroy or deceive human beings should never be vested in AI. <br> - any delegation should remain overridable |
| Justice | - promoting prosperity, preserving solidarity, avoiding unfairness <br> - defend against the risk of bias and threats to "solidarity," including "systems of mutual assistance such as in social insurance and healthcare." |
| Explicability | - 'intelligibility' <br> - 'accountability' |
The field of data ethics offers guiding principles for business professionals who handle data. 数据伦理领域为处理数据的业务专业人员提供了指导原则。
Ownership: an individual owns their personal information 所有权:个人拥有自己的个人信息
Transparency: Data subjects have a right to know how you plan to collect, store, and use it. When gathering data, exercise transparency. 透明度:数据主体有权知道您计划如何收集、存储和使用数据。在收集数据时,要做到透明。
Privacy: Even if a customer gives your company consent to collect, store, and analyse their personally identifiable information (PII), that doesn’t mean that it should be made publicly available. 隐私:即使客户同意贵公司收集、存储和分析其个人身份信息 (PII),也并不意味着这些信息应该公开。
Intention: Before collecting data, ask yourself why you need it, what you’ll gain from it, and what changes you’ll be able to make after analysis. If your intention is to hurt others, profit from your subjects’ weaknesses, or any other malicious goal, it’s not ethical to collect their data. 意图:在收集数据之前,先问问自己为什么需要这些数据,从中能获得什么,分析后能做出什么改变。如果你的意图是伤害他人、利用研究对象的弱点获利或其他恶意目的,那么收集他们的数据就不道德。
Outcome: Even when intentions are good, the outcome of data analysis can cause inadvertent harm to individuals or groups of people. This is called a disparate impact, which is outlined in the Civil Rights Act as unlawful. 结果:即使出发点是好的,数据分析的结果也可能无意中对个人或群体造成伤害。这就是所谓的差异影响,《民权法案》规定这种影响是非法的。
The bottom line in AI: 人工智能的底线
“We don’t give advice. "我们不提供建议。
We can’t tell anybody what to do. 我们不能对任何人指手画脚。
Instead we say - what we know about this problem at this time. 我们要说的是--我们目前对这个问题的了解。
And here are the consequences of these actions.” 以下是这些行动的后果"。
Practice for Week 7. 第 7 周的练习。
Developing a classification model for topics in a large corpus of social media posts, you trained a model based on data that represented each document through the count of each token it contains. Dissatisfied with the model’s performance you decide that you need to represent the meaning of the texts more precisely. Which of the following processes would you choose to achieve this? 在为一个大型社交媒体帖子语料库中的主题开发分类模型时,您训练了一个基于数据的模型,该模型通过包含的每个标记的计数来表示每个文档。由于对模型的性能不满意,您决定需要更精确地表示文本的含义。您会选择以下哪种方法来实现这一目标?
A. Encode texts as bi-grams. A.将文本编码为双语法
B. Prepare document with stemming. B.编写有标点符号的文件。
C. Remove all stop words. C.删除所有停止词。
D. Engineer features with VADER. D.使用 VADER 设计功能。
You are tasked to categorise a large set of research papers according to their research disciplines. Which of the following is the most suitable choice to distinguish these documents? 您的任务是将一大批研究论文按照研究学科分类。以下哪项最适合用来区分这些文件?
A. TF-iDF A.TF-iDF
B. VADER B.VADER
C. Bag-of-Words C.词袋
D. Regular Expressions (RegEx) D.正则表达式 (RegEx)
The following regression model is estimated via least squares, using a data-set containing 200 observations: 利用包含 200 个观测值的数据集,通过最小二乘法估算出以下回归模型:
The sum of transformed residuals by the natural exponential function, exp(a):=e^(a)\exp (a):=e^{a}, is calculated to be 226 , i.e., sum_(i=1)^(200)exp(epsi_(i))=226\sum_{i=1}^{200} \exp \left(\varepsilon_{i}\right)=226. For a newly observed pair of feature values x_(1)=0.5x_{1}=0.5 and x_(2)=2x_{2}=2, which of the following is an unbiased prediction of y (to 2 decimal places)? 经计算,按自然指数函数 exp(a):=e^(a)\exp (a):=e^{a} 转换后的残差之和为 226,即 sum_(i=1)^(200)exp(epsi_(i))=226\sum_{i=1}^{200} \exp \left(\varepsilon_{i}\right)=226 。对于新观测到的一对特征值 x_(1)=0.5x_{1}=0.5 和 x_(2)=2x_{2}=2 ,以下哪项是对 y 的无偏预测(精确到小数点后 2 位)?
A. 373.24 A.373.24
B. 5.80 B.5.80
C. 330.30 C.330.30
D. 331.43 D.331.43
4. Typos and semantic and grammatical errors can be readily identified by current spellcheckers. But they are only as good as the data used to train them. Take the statement “One does not simply wlak into Mordor.” How do you expect algorithms will perform if they were trained on bagof-words, bi-grams, or tri-grams? 4.目前的拼写检查程序很容易识别错别字、语义和语法错误。但它们的好坏取决于用来训练它们的数据。就拿 "一个人不会简单地走进魔多 "这句话来说吧。如果算法是在词袋、双词组或三词组基础上训练出来的,你觉得它们会有怎样的表现?
A. All will be similar in performance. A.性能都差不多。
B. Bi-grams will be better than the others. B.Bi-grams 将优于其他。
C. Uni-grams will be better than the others. C.Uni-grams 会比其他的更好。
D. Tri-grams will be better than the others. D.三格会比其他格好。
5. Different natural language processing tools serve different purposes. Which of these techniques is least well suited to capture the meaning of elements of a text? 5.不同的自然语言处理工具有不同的用途。哪种技术最不适合捕捉文本元素的含义?
A. TF-iDF A.TF-iDF
B. Regular Expressions (RegEx) B.正则表达式(RegEx)
C. VADER C.瓦德尔
D. Word embedding D.词语嵌入
6. If you wanted to explain how TF-iDF works, which of the statement If below would be most accurate? 6.如果要解释 TF-iDF 的工作原理,以下哪种说法最准确?
A. TF-iDF calculates a high-dimensional numerical space where words with similar meanings are located close to one another A.TF-iDF 计算出一个高维数字空间,在这个空间中,意义相近的词语彼此靠近
B. TF-iDF removes ambiguity by mapping the valence of sentences on a continuous range. B.TF-iDF 通过将句子的价态映射到一个连续的范围来消除歧义。
C. TF-iDF uses stop-words to reduce the dimensionality in a similar way to lemmatisation and stemming. C.TF-iDF 采用与词法化和词干化类似的方式,使用停止词来降低维度。
D. TF-IDF adds information to unstructured data by calculating to what degree specific tokens are able to distinguish between different documents. D.TF-IDF 通过计算特定标记在多大程度上能够区分不同的文档,从而为非结构化数据添加信息。
7. The transformation commonly applied to categorical variables is known as: 7.通常应用于分类变量的转换称为:
A. Baseline variable A.基线变量
B. Dummy variable B.虚拟变量
C. Log transformation C.对数变换
D. Interaction terms D.交互项
8. You used word2vec to engineer feature vectors for a collection of documents. To check if it worked, you compare the vectors for word pairs. What would you expect to find for the pairs “mother” and “daughter”? 8.您使用 word2vec 为文档集设计了特征向量。为了检查它是否有效,您比较了词对的向量。对于 "母亲 "和 "女儿 "这两个词对,您预期会发现什么?
A. They would be related in the same way as “king” and “queen”. A.它们的关系就像 "国王 "和 "王后 "一样。
B. They would be orthogonal to each other. B.它们是正交的。
C. They would be on opposite ends of the vector space. C.它们将位于向量空间的两端。
D. They would be related in the same way as “dog” and “puppy”. D.它们的关系就像 "狗 "和 "小狗 "一样。
9. Different natural language processing tools serve different purposes. Which of these techniques is least well suited to capture the meaning of elements of a text? 9.不同的自然语言处理工具有不同的用途。哪种技术最不适合捕捉文本元素的含义?
A. Regular Expressions (RegEx) A.正则表达式(RegEx)
B. Word embedding B.单词嵌入
C. TF-iDF C.TF-iDF
D. VADER D.瓦德
10. Consider the following linear regression model: log(y)= hat(beta)_(0)+ hat(beta)_(1)x+epsi\log (y)=\hat{\beta}_{0}+\hat{\beta}_{1} x+\varepsilon. For a given value of xx, the prediction of yy via hat(y)=exp( hat(beta)_(0)+ hat(beta)_(1)x)\hat{y}=\exp \left(\hat{\beta}_{0}+\hat{\beta}_{1} x\right) is qquad\qquad 10.考虑以下线性回归模型: log(y)= hat(beta)_(0)+ hat(beta)_(1)x+epsi\log (y)=\hat{\beta}_{0}+\hat{\beta}_{1} x+\varepsilon 。对于给定的 xx 值,通过 hat(y)=exp( hat(beta)_(0)+ hat(beta)_(1)x)\hat{y}=\exp \left(\hat{\beta}_{0}+\hat{\beta}_{1} x\right) 预测 yy 为 qquad\qquad 。
A. Upward biased A.向上倾斜
B. Unbiased B.不偏不倚
C. Optimal with the mean squared error C.均方误差最优
D. Downward biased D.向下倾斜
11. For a linear regression model y=x^(TT)beta+epsiy=\mathbf{x}^{\top} \boldsymbol{\beta}+\varepsilon the optimal prediction of y that gives the smallest mean square error is E(y∣x)\mathbb{E}(y \mid \mathbf{x}) 11.对于线性回归模型 y=x^(TT)beta+epsiy=\mathbf{x}^{\top} \boldsymbol{\beta}+\varepsilon ,均方误差最小的 y 最佳预测值是 E(y∣x)\mathbb{E}(y \mid \mathbf{x})
A. True A.正确
B. False B.假