这是用户在 2024-5-25 16:36 为 https://edstem.org/au/courses/14775/lessons/47634/slides/324247 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Week 10: Model Selection 第 10 周:模型选择

Summary 摘要

Polynomial Regression 多项式回归

  • Polynomial regression allows us to build a nonlinear model.
    多项式回归允许我们建立一个非线性模型。

  • These models can include terms of a degree greater than 1
    这些模型可以包含度数大于 1 的项

y=β0+β1x+β2x2+.... +βnxn+ϵy=\beta_0+\beta_1x_{ }+\beta_2x^2+....\ +\beta_nx_{ }^n+\epsilon

  • They can also include new features created by multiplying together original features. E.g. x22x_2^2 or x12x4x_1^2x_4
    它们还可以包括通过将原始特征相乘而产生的新特征。例如, x22x_2^2x12x4x_1^2x_4

  • Polynomial regression is a special case of linear regression
    多项式回归是线性回归的特例

Overfitting and underfitting
过度拟合和拟合不足

  • Overfitting (high variance): The model is overly complex memorises the data and is unable to generalise to new data.
    过度拟合(高方差):模型对数据的记忆过于复杂,无法推广到新的数据。

  • Underfitting (high bias): The model is not complex enough and is unable to learn all the trends in the data.
    拟合不足(偏差大):模型不够复杂,无法学习数据中的所有趋势。

Hyperparameters 超参数

  • Model parameters: What your model learns during training e.g. β\beta parameters during regression
    模型参数:模型参数:模型在训练过程中学习到的参数,例如 β\beta 回归过程中的参数

  • Hyperparameters: Control how your model will learn during training and must be assigned prior to training. e.g. kk in k-means clustering.
    超参数:控制模型在训练过程中的学习方式,必须在训练前分配。例如,K-means 聚类中的 kk

Holdout 不参加

A method for selecting optimal hyperparameters.
选择最佳超参数的方法。

  • Training data: data your model learns from during the .fit() process.
    训练数据:模型在 .fit() 过程中学习的数据。

  • Validation data: data you use to determine the best hyperparameters. You will need to use .fit() and .predict() on the validation data.
    验证数据:用于确定最佳超参数的数据。您需要在验证数据上使用 .fit().predict()

  • Test data: data you use to evaluate your model when you use a .predict()
    测试数据:当你使用一个模型时,用来评估模型的数据。 .predict()

Train-vali-test split 训练-瓦利-测试分离

To split the dataset into train, validation and test sets you can use sklearn's train_test_split().
要将数据集分成训练集、验证集和测试集,可以使用 sklearntrain_test_split()

from sklearn.model_selection import train_test_split X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size = test_size, random_state = seed) X_train, X_vali, y_train, y_vali = train_test_split(X_tv, y_tv, test_size = validation_size, random_state = seed)
从 sklearn.model_selection 导入 train_test_split X_tv、X_test、y_tv、y_test = train_test_split(X、y、 test_size = test_size、 random_state = seed) X_train,X_vali,y_train,y_vali = train_test_split(X_tv,y_tv、 test_size = validation_size、 random_state = seed)
from sklearn.model_selection import train_test_split X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size = test_size, random_state = seed) X_train, X_vali, y_train, y_vali = train_test_split(X_tv, y_tv, test_size = validation_size, random_state = seed)
从 sklearn.model_selection 导入 train_test_split X_tv、X_test、y_tv、y_test = train_test_split(X、y、 test_size = test_size、 random_state = seed) X_train,X_vali,y_train,y_vali = train_test_split(X_tv,y_tv、 test_size = validation_size、 random_state = seed)

Evaluation Metrics 评估指标

MSE = i=1N(yiyi^)2NMSE\ =\ \frac{\sum_{i=1}^N\left(y_i-\hat{y_i}\right)^2}{N}

from sklearn.metrics import mean_squared_error as mse mse(true values, predicted values)
from sklearn.metrics import mean_squared_error as mse mse(真实值,预测值)
from sklearn.metrics import mean_squared_error as mse mse(true values, predicted values)
from sklearn.metrics import mean_squared_error as mse mse(真实值,预测值)