第 8 周功能工程
Summary 摘要
Dummy Variables 虚拟变量
To reduce interdependence between dummy variables we include
drop_first = True
:
为了减少虚拟变量之间的相互依赖性,我们将drop_first = True
加入其中:
Binning 分选
pd.cut(Series, bins = bins 数量)
pd.cut(Series, bins = bins 数量)
Text Analytics 文本分析
Terminology: 术语:
token: a word/a feature/a term
标记:一个词/一个特征/一个术语document: made up of tokens. Usually a small chunk of text such as a sentence.
文档:由标记组成。通常是一小段文本,如一个句子。corpus: made up of documents. This is the entire text collection. i.e. the dataset.
语料库:由文档组成。这是整个文本集合,即数据集。
Term frequency inverse-document frequency
术语频率 文档反向频率
The term frequency (TF) measures how often a word, , occurs in a document.
词频 (TF) 衡量一个词( )在文档中出现的频率。
The inverse document frequency (IDF) measures how rare the word is in the documents.
反向文档频率(IDF)衡量单词在文档中的罕见程度。
The TF-IDF score: TF-IDF 分数:
boost the counts or frequency of uncommon words (makes the word more significant)
增加不常见词语的数量或频率(使词语更有意义)shrinks the counts of common words.
缩减常用词的数量。
从 sklearn.feature_extraction.text 导入 TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)
从 sklearn.feature_extraction.text 导入 TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(stop_words = 'english') tfidf_vectorizer.fit(corpus) tfidf = tfidf_vectorizer.transform(corpus)