(1) BUSS6002 - 教育课程

Week 8: Feature Engineering
第 8 周功能工程

Summary 摘要

Dummy Variables 虚拟变量

pd.get_dummies(data)

pd.get_dummies(data 数据)

To reduce interdependence between dummy variables we include drop_first = True:
为了减少虚拟变量之间的相互依赖性，我们将 drop_first = True 加入其中：

pd.get_dummies(data, drop_first = True)

Binning 分选

pd.cut(Series, bins = number of bins)
pd.cut(Series, bins = bins 数量)

Text Analytics 文本分析

Terminology: 术语：

token: a word/a feature/a term
标记：一个词/一个特征/一个术语
document: made up of tokens. Usually a small chunk of text such as a sentence.
文档：由标记组成。通常是一小段文本，如一个句子。
corpus: made up of documents. This is the entire text collection. i.e. the dataset.
语料库：由文档组成。这是整个文本集合，即数据集。

Bag-of-words 词袋

Implementation with sklearn
使用 sklearn 实施

from sklearn.feature_extraction.text import CountVectorizer count_vectorizer = CountVectorizer(stop_words = 'english') count_vectorizer.fit(corpus) bow = count_vectorizer.transform(corpus)

Formatting the output 设置输出格式

count_vectorizer.get_feature_names_out()

count_vectorizer 计数器.get_feature_names_out()

bow.todense()

Converting to pandas DataFrame
转换为 pandas DataFrame

bow_df = pandas.DataFrame(bow.todense(), columns = count_vectorizer.get_feature_names_out())

Term frequency inverse-document frequency
术语频率文档反向频率

The term frequency (TF) measures how often a word, $w$ , occurs in a document.
词频 (TF) 衡量一个词（ $w$ ）在文档中出现的频率。

$tf_i\ \left(w\right)=\ \frac{f}{n_t}$

The inverse document frequency (IDF) measures how rare the word is in the documents.
反向文档频率（IDF）衡量单词在文档中的罕见程度。

$idf\left(w\right)\ =\ \ln\left(\frac{n_d}{df\left(w\right)}\right)$

The TF-IDF score: TF-IDF 分数：
- boost the counts or frequency of uncommon words (makes the word more significant)
  增加不常见词语的数量或频率（使词语更有意义）
- shrinks the counts of common words.
  缩减常用词的数量。

$tfidf\ _i\left(w\right)=\ tf_i\cdot idf\left(w\right)$

Implementation with sklearn
使用 sklearn 实施

from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
从 sklearn.feature_extraction.text 导入 TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)

Transform a new corpus 转换新语料库

count_vectorizer.transform(new_corpus) tfidf_vectorizer.transform(new_corpus)

count_vectorizer 计数器.transform(new_corpus 新语法) tfidf_vectorizer.transform(new_corpus 新语法)