这是用户在 2024-11-9 12:31 为 https://app.immersivetranslate.com/pdf-pro/11879963-f626-4f40-aaec-93d7a8eb7a6b 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

BUSS6002 W7 Feature Engineering and NLP
BUSS6002 W7 特征工程和 NLP

Prepared by Ella Luo 编写人:Ella Luo

Part 1. Feature Engineering
第 1 部分.特征工程

  1. What is feature engineering
    什么是功能工程
  • process of constructing predictors from raw data, based on domain knowledge or algorithm requirements
    根据领域知识或算法要求,从原始数据中构建预测器的过程
  • Good feature engineering can significantly improve model performance
    良好的特征工程可大幅提高模型性能
  • Example: polynomial regression
    示例:多项式回归

2. log Transformation 对数变换

Original Linear regression: AmountSpent = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times Salary + ε + ε +epsi+\varepsilon
原始线性回归:支出金额 = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times 工资 + ε + ε +epsi+\varepsilon

  • We can see patterns in the residual plots, which indicate non-linear relationship
    我们可以从残差图中看出一些规律,这表明两者之间存在非线性关系
  • Also we can see heteroscedasticity in the variance of errors
    我们还可以看到误差方差的异方差性
Log transformation: log ( log ( log(\log ( Amount Spent ) = β 0 + β 1 × log ( ) = β 0 + β 1 × log ( )=beta_(0)+beta_(1)xx log()=\beta_{0}+\beta_{1} \times \log ( Salary ) + ε ) + ε )+epsi)+\varepsilon
对数变换: log ( log ( log(\log ( 支出金额 ) = β 0 + β 1 × log ( ) = β 0 + β 1 × log ( )=beta_(0)+beta_(1)xx log()=\beta_{0}+\beta_{1} \times \log ( 工资 ) + ε ) + ε )+epsi)+\varepsilon


  • The residual plot shows no pattern
    残差图没有显示模式
  • Variance is almost constant now
    现在差异几乎保持不变
  • Log transformation can help:
    对数变换可以提供帮助:
  • Remove certain type of non-linearity
    消除某些类型的非线性
  • Stabilise variance 稳定差异

Retransformation bias: 再转换偏差:

  • quad\quad The model after log-transform: log ( y ) = x β ^ + ε log ( y ) = x β ^ + ε log(y)=x^(TT) widehat(beta)+epsi\log (y)=\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}+\varepsilon
    quad\quad 对数变换后的模型: log ( y ) = x β ^ + ε log ( y ) = x β ^ + ε log(y)=x^(TT) widehat(beta)+epsi\log (y)=\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}+\varepsilon
  • quad\quad Thus, if we need to predict y ^ y ^ hat(y)\hat{y} :
    quad\quad 因此,如果我们需要预测 y ^ y ^ hat(y)\hat{y}
  • quad\quad Naïve Prediction: y ^ naive = exp ( log ( y ) ^ ) = exp ( x β ^ ) y ^ naive  = exp ( log ( y ) ^ ) = exp x β ^ hat(y)_("naive ")=exp( widehat(log(y)))=exp(x^(TT)( widehat(beta)))\hat{y}_{\text {naive }}=\exp (\widehat{\log (y)})=\exp \left(\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}\right)
    quad\quad 简单预测: y ^ naive = exp ( log ( y ) ^ ) = exp ( x β ^ ) y ^ naive  = exp ( log ( y ) ^ ) = exp x β ^ hat(y)_("naive ")=exp( widehat(log(y)))=exp(x^(TT)( widehat(beta)))\hat{y}_{\text {naive }}=\exp (\widehat{\log (y)})=\exp \left(\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}\right)

  • Optimal Prediction: y ^ optimal = E ( y x ) = exp ( x β ^ ) E [ exp ( ε ) ] = y ^ naive × E [ exp ( ε ) ] y ^ optimal  = E ( y x ) = exp x β ^ E [ exp ( ε ) ] = y ^ naive  × E [ exp ( ε ) ] hat(y)_("optimal ")=E(y∣x)=exp(x^(TT)( widehat(beta)))E[exp(epsi)]= hat(y)_("naive ")xxE[exp(epsi)]\hat{y}_{\text {optimal }}=\mathbb{E}(y \mid \mathbf{x})=\exp \left(\boldsymbol{x}^{\top} \widehat{\boldsymbol{\beta}}\right) \mathbb{E}[\exp (\varepsilon)]=\hat{y}_{\text {naive }} \times \mathbb{E}[\exp (\varepsilon)]
    最佳预测: y ^ optimal = E ( y x ) = exp ( x β ^ ) E [ exp ( ε ) ] = y ^ naive × E [ exp ( ε ) ] y ^ optimal  = E ( y x ) = exp x β ^ E [ exp ( ε ) ] = y ^ naive  × E [ exp ( ε ) ] hat(y)_("optimal ")=E(y∣x)=exp(x^(TT)( widehat(beta)))E[exp(epsi)]= hat(y)_("naive ")xxE[exp(epsi)]\hat{y}_{\text {optimal }}=\mathbb{E}(y \mid \mathbf{x})=\exp \left(\boldsymbol{x}^{\top} \widehat{\boldsymbol{\beta}}\right) \mathbb{E}[\exp (\varepsilon)]=\hat{y}_{\text {naive }} \times \mathbb{E}[\exp (\varepsilon)]
  • According to Jensen’s inequality: E [ exp ( ε ) ] > exp [ E ( ε ) ] = 1 E [ exp ( ε ) ] > exp [ E ( ε ) ] = 1 E[exp(epsi)] > exp[E(epsi)]=1\mathbb{E}[\exp (\varepsilon)]>\exp [\mathbb{E}(\varepsilon)]=1
    根据詹森不等式: E [ exp ( ε ) ] > exp [ E ( ε ) ] = 1 E [ exp ( ε ) ] > exp [ E ( ε ) ] = 1 E[exp(epsi)] > exp[E(epsi)]=1\mathbb{E}[\exp (\varepsilon)]>\exp [\mathbb{E}(\varepsilon)]=1
  • Thus, y ^ naïve < y ^ optimal y ^ naïve  < y ^ optimal  hat(y)_("naïve ") < hat(y)_("optimal ")\hat{y}_{\text {naïve }}<\hat{y}_{\text {optimal }}ï 因此, y ^ naïve < y ^ optimal y ^ naïve  < y ^ optimal  hat(y)_("naïve ") < hat(y)_("optimal ")\hat{y}_{\text {naïve }}<\hat{y}_{\text {optimal }}ï
  • y ^ unbiased := y ^ naive × 1 n i = 1 n exp ( ε i ) = exp ( x β ^ ) 1 n i = 1 n exp ( ε i ) y ^ unbiased  := y ^ naive  × 1 n i = 1 n exp ε i = exp x β ^ 1 n i = 1 n exp ε i hat(y)_("unbiased "):= hat(y)_("naive ")xx(1)/(n)sum_(i=1)^(n)exp(epsi_(i))=exp(x^(TT)( widehat(beta)))(1)/(n)sum_(i=1)^(n)exp(epsi_(i))\hat{y}_{\text {unbiased }}:=\hat{y}_{\text {naive }} \times \frac{1}{n} \sum_{i=1}^{n} \exp \left(\varepsilon_{i}\right)=\exp \left(\mathbf{x}^{\top} \widehat{\boldsymbol{\beta}}\right) \frac{1}{n} \sum_{i=1}^{n} \exp \left(\varepsilon_{i}\right)

3. Dealing with categorical Predictors:
3.处理分类预测因子:

  1. Using dummy variables to handle qualitative predictors.
    使用虚拟变量处理定性预测因子。
For example, for gender predictor: { ‘Female’, ‘Male’ }
例如,性别预测器:{'女'、'男' }

A dummy variable can be constructed by:
虚拟变量的构造方法如下
Female = { 1 , if Gender = 'Female' 0 , if Gender = 'Male'  Female  = 1 ,       if Gender  =  'Female'  0 ,       if Gender  =  'Male'  " Female "={[1","," if Gender "=" 'Female' "],[0","," if Gender "=" 'Male' "]:}\text { Female }= \begin{cases}1, & \text { if Gender }=\text { 'Female' } \\ 0, & \text { if Gender }=\text { 'Male' }\end{cases}
  • k k quad k\quad k - categories need only k 1 k 1 k-1k-1 dummy variables
    k k quad k\quad k - 类别只需 k 1 k 1 k-1k-1 虚拟变量
  • Reference or Baseline Category: The one that is not explicitly represent by dummy variable
    参考或基准类别:未用虚拟变量明确表示的类别
  • Each dummy is compared with the reference when interpreting
    在解释时,将每个假值与参照值进行比较。

Consider the following regression with dummy variable Female:
考虑以下带有虚拟变量 "女性 "的回归:

  • AmountSpent = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times Salary + β 2 × + β 2 × +beta_(2)xx+\beta_{2} \times Children + β 3 × + β 3 × +beta_(3)xx+\beta_{3} \times Catalogs + β 4 × + β 4 × +beta_(4)xx+\beta_{4} \times Female + ε + ε +epsi+\varepsilon
    支出金额 = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times 工资 + β 2 × + β 2 × +beta_(2)xx+\beta_{2} \times 子女 + β 3 × + β 3 × +beta_(3)xx+\beta_{3} \times 目录 + β 4 × + β 4 × +beta_(4)xx+\beta_{4} \times 女性 + ε + ε +epsi+\varepsilon
  • quad\quad Female = { 1 , if Gender = 'Female' 0 , if Gender = 'Male' = 1 ,       if Gender  =  'Female'  0 ,       if Gender  =  'Male'  ={[1","," if Gender "=" 'Female' "],[0","," if Gender "=" 'Male' "]:}= \begin{cases}1, & \text { if Gender }=\text { 'Female' } \\ 0, & \text { if Gender }=\text { 'Male' }\end{cases}
    quad\quad = { 1 , if Gender = 'Female' 0 , if Gender = 'Male' = 1 ,       if Gender  =  'Female'  0 ,       if Gender  =  'Male'  ={[1","," if Gender "=" 'Female' "],[0","," if Gender "=" 'Male' "]:}= \begin{cases}1, & \text { if Gender }=\text { 'Female' } \\ 0, & \text { if Gender }=\text { 'Male' }\end{cases}

This means that: 这意味着

AmountSpent = { β ^ 0 + β ^ 4 + β ^ 1 × Salary + β ^ 2 × Children + β ^ 3 × Catalogs + ε , for female β ^ 0 + β ^ 1 × Salary + β ^ 2 × Children + β ^ 3 × Catalogs + ε , for male = β ^ 0 + β ^ 4 + β ^ 1 ×  Salary  + β ^ 2 ×  Children  + β ^ 3 ×  Catalogs  + ε ,       for female  β ^ 0 + β ^ 1 ×  Salary  + β ^ 2 ×  Children  + β ^ 3 ×  Catalogs  + ε ,       for male  ={[ hat(beta)_(0)+ hat(beta)_(4)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for female "],[ hat(beta)_(0)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for male "]:}= \begin{cases}\hat{\beta}_{0}+\hat{\beta}_{4}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for female } \\ \hat{\beta}_{0}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for male }\end{cases} 支出金额 = { β ^ 0 + β ^ 4 + β ^ 1 × Salary + β ^ 2 × Children + β ^ 3 × Catalogs + ε , for female β ^ 0 + β ^ 1 × Salary + β ^ 2 × Children + β ^ 3 × Catalogs + ε , for male = β ^ 0 + β ^ 4 + β ^ 1 ×  Salary  + β ^ 2 ×  Children  + β ^ 3 ×  Catalogs  + ε ,       for female  β ^ 0 + β ^ 1 ×  Salary  + β ^ 2 ×  Children  + β ^ 3 ×  Catalogs  + ε ,       for male  ={[ hat(beta)_(0)+ hat(beta)_(4)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for female "],[ hat(beta)_(0)+ hat(beta)_(1)xx" Salary "+ hat(beta)_(2)xx" Children "+ hat(beta)_(3)xx" Catalogs "+epsi","," for male "]:}= \begin{cases}\hat{\beta}_{0}+\hat{\beta}_{4}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for female } \\ \hat{\beta}_{0}+\hat{\beta}_{1} \times \text { Salary }+\hat{\beta}_{2} \times \text { Children }+\hat{\beta}_{3} \times \text { Catalogs }+\varepsilon, & \text { for male }\end{cases}
  • Interpretation: all else being equal, a female customer spending on average β ^ 4 β ^ 4 hat(beta)_(4)\hat{\beta}_{4} more than a male customer
    释义:在其他条件相同的情况下,女性顾客的平均消费额 β ^ 4 β ^ 4 hat(beta)_(4)\hat{\beta}_{4} 高于男性顾客

2) Interactions: 2) 互动:

Original regression: 原来的退步:

AmountSpent = β 0 + β 1 × Salary + β 2 × Children + ε  AmountSpent  = β 0 + β 1 ×  Salary  + β 2 ×  Children  + ε " AmountSpent "=beta_(0)+beta_(1)xx" Salary "+beta_(2)xx" Children "+epsi\text { AmountSpent }=\beta_{0}+\beta_{1} \times \text { Salary }+\beta_{2} \times \text { Children }+\varepsilon
  • Interpretation of β 1 β 1 beta_(1)\beta_{1} : the average spending is increased by β ^ 1 β ^ 1 hat(beta)_(1)\hat{\beta}_{1} for each dollar increase in salary, if keep the number of children constant
    β 1 β 1 beta_(1)\beta_{1} 的解释:在子女人数不变的情况下,工资每增加一美元,平均支出就会增加 β ^ 1 β ^ 1 hat(beta)_(1)\hat{\beta}_{1}

Regression with interaction:
交互回归:

AmountSpent = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times Salary + β 2 × + β 2 × +beta_(2)xx+\beta_{2} \times Children + β 3 × + β 3 × +beta_(3)xx+\beta_{3} \times Salary × × xx\times Children + ε + ε +epsi+\varepsilon
支出金额 = β 0 + β 1 × = β 0 + β 1 × =beta_(0)+beta_(1)xx=\beta_{0}+\beta_{1} \times 工资 + β 2 × + β 2 × +beta_(2)xx+\beta_{2} \times 子女 + β 3 × + β 3 × +beta_(3)xx+\beta_{3} \times 工资 × × xx\times 子女 + ε + ε +epsi+\varepsilon
  • Interpretation: when salary increased by one-dollar, the average spending is increased by ( β ^ 1 + β ^ 3 × β ^ 1 + β ^ 3 × hat(beta)_(1)+ hat(beta)_(3)xx\hat{\beta}_{1}+\hat{\beta}_{3} \times Children)
    解释:当工资增加一美元时,平均支出增加 ( β ^ 1 + β ^ 3 × β ^ 1 + β ^ 3 × hat(beta)_(1)+ hat(beta)_(3)xx\hat{\beta}_{1}+\hat{\beta}_{3} \times 儿童)

3) Polynomial Regression 3) 多项式回归

Allow us to model nonlinear relationship between response and a predictor
允许我们模拟响应与预测因子之间的非线性关系
y = β 0 + β 1 x + β 2 x 2 + + β p x p + ε y = β 0 + β 1 x + β 2 x 2 + + β p x p + ε y=beta_(0)+beta_(1)x+beta_(2)x^(2)+cdots+beta_(p)x^(p)+epsiy=\beta_{0}+\beta_{1} x+\beta_{2} x^{2}+\cdots+\beta_{p} x^{p}+\varepsilon