这是用户在 2024-5-25 16:34 为 https://edstem.org/au/courses/14775/lessons/47632/slides/324178 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Week 8: Feature Engineering
第 8 周功能工程

Summary 摘要

Dummy Variables 虚拟变量

pd.get_dummies(data)
pd.get_dummies(data 数据)
  • To reduce interdependence between dummy variables we include drop_first = True:
    为了减少虚拟变量之间的相互依赖性,我们将 drop_first = True 加入其中:

pd.get_dummies(data, drop_first = True)
pd.get_dummies(data, drop_first = True)

Binning 分选

pd.cut(Series, bins = number of bins)
pd.cut(Series, bins = bins 数量)
pd.cut(Series, bins = number of bins)
pd.cut(Series, bins = bins 数量)

Text Analytics 文本分析

Terminology: 术语:

  • token: a word/a feature/a term
    标记:一个词/一个特征/一个术语

  • document: made up of tokens. Usually a small chunk of text such as a sentence.
    文档:由标记组成。通常是一小段文本,如一个句子。

  • corpus: made up of documents. This is the entire text collection. i.e. the dataset.
    语料库:由文档组成。这是整个文本集合,即数据集。

Bag-of-words 词袋

from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer(stop_words = 'english') count_vectorizer.fit(corpus) bow = count_vectorizer.transform(corpus)
from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer(stop_words = 'english') count_vectorizer.fit(corpus) bow = count_vectorizer.transform(corpus)
count_vectorizer.get_feature_names_out()
count_vectorizer 计数器.get_feature_names_out()
bow.todense()
bow.todense()
bow_df = pandas.DataFrame(bow.todense(), columns = count_vectorizer.get_feature_names_out())
bow_df = pandas.DataFrame(bow.todense(), columns = count_vectorizer.get_feature_names_out())

Term frequency inverse-document frequency
术语频率 文档反向频率

  • The term frequency (TF) measures how often a word, ww, occurs in a document.
    词频 (TF) 衡量一个词( ww )在文档中出现的频率。

tfi (w)= fnttf_i\ \left(w\right)=\ \frac{f}{n_t}

  • The inverse document frequency (IDF) measures how rare the word is in the documents.
    反向文档频率(IDF)衡量单词在文档中的罕见程度。

idf(w) = ln(nddf(w))idf\left(w\right)\ =\ \ln\left(\frac{n_d}{df\left(w\right)}\right)

  • The TF-IDF score: TF-IDF 分数:

    • boost the counts or frequency of uncommon words (makes the word more significant)
      增加不常见词语的数量或频率(使词语更有意义)

    • shrinks the counts of common words.
      缩减常用词的数量。

tfidf i(w)= tfiidf(w)tfidf\ _i\left(w\right)=\ tf_i\cdot idf\left(w\right)

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
从 sklearn.feature_extraction.text 导入 TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
从 sklearn.feature_extraction.text 导入 TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
count_vectorizer.transform(new_corpus) tfidf_vectorizer.transform(new_corpus)
count_vectorizer 计数器.transform(new_corpus 新语法) tfidf_vectorizer.transform(new_corpus 新语法)