(1) BUSS6002 - 教育课程

Week 10: Model Selection 第 10 周：模型选择

Summary 摘要

Polynomial Regression 多项式回归

Polynomial regression allows us to build a nonlinear model.
多项式回归允许我们建立一个非线性模型。
These models can include terms of a degree greater than 1
这些模型可以包含度数大于 1 的项

$y=\beta_0+\beta_1x_{ }+\beta_2x^2+....\ +\beta_nx_{ }^n+\epsilon$

They can also include new features created by multiplying together original features. E.g. $x_2^2$ or $x_1^2x_4$
它们还可以包括通过将原始特征相乘而产生的新特征。例如， $x_2^2$ 或 $x_1^2x_4$
Polynomial regression is a special case of linear regression
多项式回归是线性回归的特例

Overfitting and underfitting
过度拟合和拟合不足

Overfitting (high variance): The model is overly complex memorises the data and is unable to generalise to new data.
过度拟合（高方差）：模型对数据的记忆过于复杂，无法推广到新的数据。
Underfitting (high bias): The model is not complex enough and is unable to learn all the trends in the data.
拟合不足（偏差大）：模型不够复杂，无法学习数据中的所有趋势。

Hyperparameters 超参数

Model parameters: What your model learns during training e.g. $\beta$ parameters during regression
模型参数：模型参数：模型在训练过程中学习到的参数，例如 $\beta$ 回归过程中的参数
Hyperparameters: Control how your model will learn during training and must be assigned prior to training. e.g. $k$ in k-means clustering.
超参数：控制模型在训练过程中的学习方式，必须在训练前分配。例如，K-means 聚类中的 $k$ 。

Holdout 不参加

A method for selecting optimal hyperparameters.
选择最佳超参数的方法。

Training data: data your model learns from during the .fit() process.
训练数据：模型在 .fit() 过程中学习的数据。
Validation data: data you use to determine the best hyperparameters. You will need to use .fit() and .predict() on the validation data.
验证数据：用于确定最佳超参数的数据。您需要在验证数据上使用 .fit() 和 .predict() 。
Test data: data you use to evaluate your model when you use a .predict()
测试数据：当你使用一个模型时，用来评估模型的数据。 .predict()

Train-vali-test split 训练-瓦利-测试分离

To split the dataset into train, validation and test sets you can use sklearn's train_test_split().
要将数据集分成训练集、验证集和测试集，可以使用 sklearn 的 train_test_split() 。

from sklearn.model_selection import train_test_split X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size = test_size, random_state = seed) X_train, X_vali, y_train, y_vali = train_test_split(X_tv, y_tv, test_size = validation_size, random_state = seed)
从 sklearn.model_selection 导入 train_test_split X_tv、X_test、y_tv、y_test = train_test_split（X、y、 test_size = test_size、 random_state = seed) X_train，X_vali，y_train，y_vali = train_test_split（X_tv，y_tv、 test_size = validation_size、 random_state = seed)

Evaluation Metrics 评估指标

Mean square error (MSE)
均方误差 (MSE)

$MSE\ =\ \frac{\sum_{i=1}^N\left(y_i-\hat{y_i}\right)^2}{N}$

from sklearn.metrics import mean_squared_error as mse mse(true values, predicted values)
from sklearn.metrics import mean_squared_error as mse mse（真实值，预测值）