This is a bilingual snapshot page saved by the user at 2024-6-27 9:03 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

山东财经大学管理科学与工程学院本科学年论文
Undergraduate Thesis of the School of Management Science and Engineering, Shandong University of Finance and Economics

管理科学与工程学院
School of Management Science and Engineering

大数据分析与预测术
Big Data Analysis and Prediction Techniques

学期大作业
Semester project

题目:垃圾短信分类
Title: Spam Message Classification

学 院 管理科学与工程学院
School of Management Science and Engineering

专 业 信息管理与信息系统
Major: Information Management and Information Systems

班 级 信管一班
Class Information Management Class 1

学 号 20190611636

姓 名 张之彦
Name: Zhang Zhiyan

指导教师 孙凌云
Supervisor: Sun Lingyun

山东财经大学管理科学与工程学院制
Shandong University of Finance and Economics, School of Management Science and Engineering

二O二一年 六 月
June 2021

1

题目:垃圾短信分类
Title: Spam Message Classification

摘 要

垃圾短信是指未经用户同意向用户发送的用户不愿意收到的短信息,或用户不能根据自己的意愿拒绝接收的短信息,主要包含以下属性:(一)未经用户同意向用户发送的商业类、广告类等短信息; (二)其他违反行业自律性规范的短信息。
Spam messages refer to short messages sent to users without their consent, which users do not wish to receive, or short messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) Commercial, advertising, and other short messages sent to users without their consent; (2) Other short messages that violate industry self-regulation norms.

垃圾短信泛滥,已经严重影响到人们正常生活、运营商形象乃至社会稳定。如伪基站可以给三公里内10万手机发信。现用户可以使用手机管家进行拦截此类短信。
Spam messages are rampant, seriously affecting people's normal lives, the image of operators, and even social stability. For example, fake base stations can send messages to 100,000 phones within a three-kilometer radius. Now users can use mobile security software to intercept such messages.

此次作业是根据朴素贝叶斯模型对短信数据集进行分类、预测和可视化。通过实验理解贝叶斯概率公式掌握贝叶斯分类算法的原理
This assignment involves classifying, predicting, and visualizing a text message dataset based on the Naive Bayes model. Through experiments, understand the Bayesian probability formula and grasp the principles of the Bayesian classification algorithm.

掌握非结构化文本的处理掌握使用朴素贝叶斯实现对短信进行分类。
Master the processing of unstructured text, master the use of Naive Bayes to classify text messages.

关键词:垃圾短信、朴素贝叶斯模型、非结构化文本、数据可视化、数据预测
Keywords: spam messages, Naive Bayes model, unstructured text, data visualization, data prediction

Title: classification of spam messages

ABSTRACT

Spam short messages refer to short messages sent to users without users' consent that users are unwilling to receive, or short messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) commercial, advertising and other short messages sent to users without users' consent;(2Other short messages that violate the self-discipline norms of the industry.

The proliferation of spam messages has seriously affected people's normal life, operator image and even social stability. For example, the pseudo base station can send messages to 100000 mobile phones within three kilometers.current users can use mobile housekeeper to intercept such short messages.

This task is to classify, predict and visualize the SMS data set according to the naive Bayesian model. Understand the Bayesian probability formula through experiments and master the principle of Bayesian classification algorithm,

Master the processing of unstructured text and the classification of short messages using naive Bayes.

Keywords: spam message, naive Bayesian model, unstructured text, data visualization, data prediction

目录

一. 问题描述4
I. Problem Description 4

(一) 问题概况4
(I) Overview of the Problem 4

(二) 数据概况4
(2) Data Overview 4

(三) 预测5
(3) Prediction 5

(四) 流程图:5
(4) Flowchart: 5

二. 数据预处理7
2. Data Preprocessing 7

(一)数据总览7
1. Data Overview 7

(二)数据预处理7
(2) Data Preprocessing 7

1. 导入数据集7
1. Import dataset 7

2. 导入数据8
2. Import Data 8

3. 处理缺失值9
3. Handling missing values 9

4. 将数据分成自变量和因变量9
4. Divide the data into independent variables and dependent variables 9

5. 字符编码9
5. Character Encoding 9

6. 拆分数据集9
6. Split the dataset 9

7. 转换数据格式10
7. Convert data format 10

8. 文本向量化10
8. Text Vectorization 10

三. 构建模型11
3. Building the Model 11

四. 评估模型12
4. Evaluation Model 12

五. 可视化单词出现频率14
5. Visualization of word frequency 14

六. 预测新数据17
6. Predict New Data 17

七. 绘制词云图18
7. Create a word cloud 18

(一)首先利用python18
(1) First, use Python 18

(二)运用word art19
(2) Use Word Art 19

问题描述
Problem Description

问题概况
Problem Overview

使用朴素贝叶斯实现垃圾短信分类。
Implement spam classification using Naive Bayes.

垃圾短信是指未经用户同意向用户发送的用户不愿意收到的短信息,或用户不能根据自己的意愿拒绝接收的短信息,主要包含以下属性:(一)未经用户同意向用户发送的商业类、广告类等短信息;(二)其他违反行业自律性规范的短信息。
Spam messages refer to text messages sent to users without their consent, which users do not wish to receive, or text messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) Commercial or advertising text messages sent to users without their consent; (2) Other text messages that violate industry self-regulatory norms.

该实验给了垃圾短信数据集,数据集有5572行,2列。包含了大量短信,该实验要对信息进行分类预测以及数据可视化。实验目的为理解贝叶斯概率公式;掌握贝叶斯分类算法的原理;掌握非结构化文本的处理;掌握使用朴素贝叶斯实现对短信进行分类。
This experiment provided a spam SMS dataset, which has 5572 rows and 2 columns. It contains a large number of messages, and the experiment aims to classify and predict the information as well as visualize the data. The purpose of the experiment is to understand the Bayesian probability formula; master the principles of the Bayesian classification algorithm; master the processing of unstructured text; and master the use of Naive Bayes to classify SMS messages.

数据概况
Data Overview

数据集有2列。第一列为label,即这条短信的标签。标签有2种类型,第一种为ham,即正常短信,第二种为spam,即垃圾短信。数据集的第二列为content,即短信内容。短信内容是非结构化的文本。数据集有5572行,2列。
The dataset has 2 columns. The first column is the label, which is the tag of the message. There are 2 types of tags, the first type is ham, which means normal message, and the second type is spam, which means junk message. The second column of the dataset is the content, which is the message content. The message content is unstructured text. The dataset has 5572 rows and 2 columns.

预测

对短信内容做出预测,判断是否为垃圾短信。可以输入文本并预测其为垃圾短信或是正常短信。运用了python sklearn中的.predict。大致步骤为:先创建分类器对象,用训练器数据拟合分类器模型,输出聚类中心,用训练器数据X拟合分类器模型并对训练器数据X进行预测,最后输出预测结果
To predict the content of text messages and determine whether they are spam. You can input text and predict whether it is spam or a normal message. The .predict function from Python's sklearn is used. The general steps are: first, create a classifier object, fit the classifier model with the training data, output the cluster centers, fit the classifier model with the training data X and predict the training data X, and finally output the prediction results.

流程图:
Flowchart:

首先,进行数据预处理,再进行模型构建,然后评估模型,数据可视化,预测新数据,最后绘制词云图。
First, perform data preprocessing, then build the model, followed by model evaluation, data visualization, predicting new data, and finally drawing a word cloud.

图1-1
Figure 1-1

数据预处理
Data preprocessing

(一)数据总览
(1) Data Overview

图2-1
Figure 2-1

由上图所示,数据集有2列。第一列为label,即这条短信的标签。标签有2种类型,第一种为ham,即正常短信,第二种为spam,即垃圾短信。数据集的第二列为content,即短信内容。短信内容是非结构化的文本。
From the above figure, the dataset has 2 columns. The first column is the label, which is the tag of the message. There are 2 types of tags: the first type is ham, which means normal message, and the second type is spam, which means junk message. The second column of the dataset is the content, which is the message content. The message content is unstructured text.

(二)数据预处理
(2) Data Preprocessing

导入数据集
Import dataset

首先,导入库
First, import the library

图2-2
Figure 2-2

Nltk:

NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,还提供了一套用于分类,标记化,词干化,标记,解析和语义推理的文本处理库。NLTK是Python上著名的⾃然语⾔处理库 ⾃带语料库,具有词性分类库 ⾃带分类,分词,等等功能。NLTK被称为“使用Python进行教学和计算语言学工作的绝佳工具”,以及“用自然语言进行游戏的神奇图书馆”。
NLTK is a leading platform for building Python programs to work with human language data. It provides an easy-to-use interface for over 50 corpora and lexical resources such as WordNet, and also offers a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK is a well-known natural language processing library in Python, with built-in corpora and a part-of-speech classification library, as well as built-in functions for classification, tokenization, and more. NLTK is referred to as "an excellent tool for teaching and computational linguistics work using Python" and "a magical library for playing with natural language."

1/nltk.download()

图2-3
Figure 2-3

2/手动下载
2/Manual download

图2-4
Figure 2-4

图2-5
Figure 2-5

导入数据
Import data

图2-6
Figure 2-6

由上图可见,数据集有5572行,2列。
From the above figure, it can be seen that the dataset has 5572 rows and 2 columns.

处理缺失值
Handling missing values

图2-7
Figure 2-7

执行代码块。
Execute code block.

通过查看null_stat变量的值发现,没有缺失数据
By checking the value of the null_stat variable, it was found that there is no missing data

将数据分成自变量和因变量
Divide the data into independent variables and dependent variables

现在数据集有2个列,将content(内容)列作为自变量,label(标签)列作为自变量。
The dataset now has 2 columns, with the content column as the independent variable and the label column as the dependent variable.

图2-8
Figure 2-8

字符编码
Character encoding

由于不能直接使用spam和ham,所以对y进行字符编码。
Since spam and ham cannot be used directly, y is character encoded.

图2-9
Figure 2-9

执行代码块。
Execute code block.

代码执行后,y储存的变量ham和spam变为0和1。
After the code is executed, the variables ham and spam stored in y become 0 and 1.

拆分数据集
Splitting the dataset

通过sklearn实现将数据集拆分成训练集和测试集。
By using sklearn, split the dataset into a training set and a test set.

data:需要进行分割的数据集
data: Dataset that needs to be segmented

random_state:设置随机种子,保证每次运行生成相同的随机数
random_state: Set the random seed to ensure that the same random numbers are generated each time it runs

test_size:将数据分割成训练集的比例(一般为2:8)
test_size: The proportion of data split into the training set (usually 2:8)

图2-10
Figure 2-10

训练集与测试集
Training set and test set

训练集:用于模型拟合的数据样本,用来调试网络中的参数。
Training set: Data samples used for model fitting, used to adjust parameters in the network.

测试集:用来评估模最终模型的泛化能力。但不能作为调参、选择特征等算法相关的选择的依据。测试集的作用是体现在测试的过程。
Test set: used to evaluate the generalization ability of the final model. However, it cannot be used as a basis for algorithm-related choices such as parameter tuning and feature selection. The role of the test set is reflected in the testing process.

举例来说:
For example:

训练集-------学生的课本;学生根据课本里的内容来掌握知识。
Training set-------Student's textbook; students grasp knowledge based on the content in the textbook.

验证集-------作业;通过作业可以知道不同学生学习情况、进步的速度快慢。
Validation set-------Homework; through homework, one can understand the learning situation of different students and the speed of their progress.

测试集-------考试;考的题是平常都没有见过,考察学生举一反三的能力。
Test set-------Exam; the questions on the exam are ones that have never been seen before, testing the students' ability to infer and apply knowledge.

转换数据格式
Convert data format

将数据转换成列表类型。
Convert the data into a list type.

图2-11
Figure 2-11

文本向量化
Text vectorization

由于模型不能直接处理中文,需要首先把中文转换成数值。
Since the model cannot directly process Chinese, it is necessary to first convert Chinese into numerical values.

首先使用词袋模型进行文本向量化。
First, use the bag-of-words model for text vectorization.

文本分析的原始数据无法直接丢给算法,这些原始数据是一组符号,因为大多数算法期望的输入是固定长度的数值特征向量而不是不同长度的文本文件。因此,文本向量化是文本分析的基础。
The raw data for text analysis cannot be directly fed to the algorithm. This raw data is a set of symbols, as most algorithms expect input in the form of fixed-length numerical feature vectors rather than text files of varying lengths. Therefore, text vectorization is the foundation of text analysis.

文本向量化(又称“词向量模型”、“向量空间模型”)即将文本表示成计算机可识别的实数向量,根据粒度大小不同可将文本特征表示分为字、词、句子或篇章几个层次。文本向量化的方法主要分为离散表示和分布式表示。
Text vectorization (also known as "word vector model" or "vector space model") refers to representing text as real number vectors that can be recognized by computers. Depending on the granularity, text features can be represented at the levels of characters, words, sentences, or paragraphs. The methods of text vectorization are mainly divided into discrete representation and distributed representation.

运用sklearn来实现文本向量化,代码如下:
Using sklearn to achieve text vectorization, the code is as follows:

图2-12
Figure 2-12

执行后发现有7809个独一无二的单词
After execution, it was found that there were 7,809 unique words

图2-13
Figure 2-13

接着,为了使文本向量化更加准确,使用TF-IDF再次进行文本向量化。此处借助sklearn进行TF-IDF预处理,代码如下:
Next, in order to make text vectorization more accurate, TF-IDF is used again for text vectorization. Here, TF-IDF preprocessing is performed using sklearn, and the code is as follows:

图2-14
Figure 2-14

TF-IDF是Term Frequency - Inverse Document Frequency的缩写,即“词频-逆文本频率”。它由两部分组成,TF和IDF。
TF-IDF stands for Term Frequency - Inverse Document Frequency. It consists of two parts, TF and IDF.

TF:词频即文本中各个词的出现频率统计
TF: Term frequency is the statistical count of the occurrence of each word in the text

IDF:逆文本频率,指有些词出场频率虽然高但是其重要性不及其他词汇。概括来讲, IDF反应了一个词在所有文本中出现的频率,如果一个词在很多的文本中出现,那么它的IDF值应该低。而反过来如果一个词在比较少的文本中出现,那么它的IDF值应该高。比如一些专业的名词。这样的词IDF值应该高。一个极端的情况,如果一个词在所有的文本中都出现,那么它的IDF值应该为0。IDF的公式如下:
IDF: Inverse Document Frequency, refers to the phenomenon where some words appear frequently but are less important than other words. In summary, IDF reflects the frequency of a word appearing in all texts. If a word appears in many texts, its IDF value should be low. Conversely, if a word appears in relatively few texts, its IDF value should be high. For example, some specialized terms should have a high IDF value. In an extreme case, if a word appears in all texts, its IDF value should be 0. The formula for IDF is as follows:

IDF(x)=2logNN(x)

其中,N代表语料库中文本的总数,而N(x)代表语料库中包含词x的文本总数。
Among them, N represents the total number of texts in the corpus, while N(x) represents the total number of texts in the corpus that contain the word x.

构建模型
Building a model

构建朴素贝叶斯模型。
Build a Naive Bayes model.

注:由于这里用于文本分类,所以选择MultinomialNB类,而不是GaussianNB类。
Note: Since this is used for text classification, the MultinomialNB class is chosen instead of the GaussianNB class.

图3-1
Figure 3-1

朴素贝叶斯公式:
Naive Bayes formula:

PY|X=PX|YP(Y)P(X)

在机器学习的视角下,把X理解成“具有某特征”,把Y理解“类别标签”。在最简单的二分类问题(是与否判定)下,我们将Y理解成“属于某类”的标签。于是贝叶斯公式就变形成了下面的样子:
From the perspective of machine learning, X is understood as "having a certain feature," and Y is understood as "category label." In the simplest binary classification problem (yes or no determination), we understand Y as a label "belonging to a certain category." Thus, the Bayes' formula transforms into the following form:

P(“属于某类”|“具有某特征”)=P(“具有某特征”|“属于某类”)P(“属于某类”)P(“具有某特征”)

在垃圾短信识别中:
In spam message recognition:

P(“属于垃圾短信”|“某句话”)=P(“某句话”|“属于垃圾短信”)P(“属于垃圾短信”)P(“某句话”)

经代入、化简得到:
Substituting and simplifying, we get:

P(“属于垃圾短信”|“某句话”)=垃圾短信中出现这句话的次数垃圾短信中出现这句话的次数+正常短信中出现这句话的次数

MultinomialNB假设特征的先验概率为多项式分布,即如下式:
MultinomialNB assumes that the prior probability of features follows a multinomial distribution, as shown in the following formula:

P(Xj=xjl|Y=Ck)=xjl+λ mk+nλ 

其中,P(Xj=xjl|Y=Ck)是第k个类别的第j维特征的第l个取值条件概率,m
Among them, P(Xj=xjl|Y=Ck) is the conditional probability of the l-th value of the j-th dimension feature of the k-th category, m

k是训练集中输出为第k类的样本个数。λ为一个大于0的常数,尝尝取值为1,即拉普拉斯平滑,也可以取其他值。
k is the number of samples in the training set that belong to the k-th class. λ is a constant greater than 0, often taken as 1, which is Laplace smoothing, but it can also take other values.

alpha:浮点型可选参数,默认为1.0,其实就是添加拉普拉斯平滑,即为上述公式中的λ,如果这个参数设置为0,就是不添加平滑;
alpha: Optional floating-point parameter, default is 1.0, which actually adds Laplace smoothing, i.e., λ in the above formula. If this parameter is set to 0, no smoothing is added.

fit_prior:布尔型可选参数,默认为True。布尔参数fit_prior表示是否要考虑先验概率,如果是false,则所有的样本类别输出都有相同的类别先验概率。否则可以自己用第三个参数class_prior输入先验概率,或者不输入第三个参数class_prior,让MultinomialNB自己从训练集样本来计算先验概率,此时的先验概率为:
fit_prior: Boolean optional parameter, default is True. The Boolean parameter fit_prior indicates whether to consider prior probabilities. If false, all sample class outputs have the same class prior probability. Otherwise, you can input the prior probability using the third parameter class_prior, or not input the third parameter class_prior, allowing MultinomialNB to calculate the prior probability from the training set samples. In this case, the prior probability is:

P(Y=Ck)=mk 

其中m为训练集样本总数量,mk 为输出为第k类别的训练集样本数。
The total number of training samples is m, and mk  is the number of training samples output as the k-th category.

评估模型
Evaluation model

建立完模型后需要评估一下这个模型的准确性,代码如下:
After building the model, it is necessary to evaluate its accuracy. The code is as follows:

图4-1
Figure 4-1

执行代码块。
Execute code block.

执行后结果如图
The result after execution is as shown in the figure

图4-2
Figure 4-2

图4-3
Figure 4-3

模型评估用来评测模型的好坏。
Model evaluation is used to assess the quality of the model.

模型在训练集上的误差通常称为训练误差或经验误差,而在新样本上的误差称为泛化误差。显然,机器学习的目的是得到泛化误差小的学习器。然而,在实际应用中,新样本是未知的,所以只能使训练误差尽量小。所以,为了得到泛化误差小的模型,在构建机器模型时,通常将数据集拆分为相互独立的训练数据集、验证数据集和测试数据集等,而在训练过程中使用验证数据集来评估模型并据此更新超参数,训练结束后使用测试数据集评估训练好的最终模型的性能。
The error of the model on the training set is usually referred to as training error or empirical error, while the error on new samples is called generalization error. Clearly, the goal of machine learning is to obtain a learner with a small generalization error. However, in practical applications, new samples are unknown, so we can only minimize the training error as much as possible. Therefore, to obtain a model with a small generalization error, the dataset is typically split into mutually independent training, validation, and test datasets when constructing a machine model. During the training process, the validation dataset is used to evaluate the model and update hyperparameters accordingly, and after training, the test dataset is used to assess the performance of the final trained model.

模型的比较:
Comparison of models:

一次训练过程中的模型比较。
A comparison of models during a training process.

多次训练模型比较。
Multiple training model comparisons.

不同算法的模型比较。
Comparison of models with different algorithms.

准确率:准确率是指分类正确的样本占总样本个数的比例,公式如下:
Accuracy: Accuracy refers to the proportion of correctly classified samples to the total number of samples, and the formula is as follows:

ACC=TP+TN TP+TN+FP+FN

True Positive(真正,TP):将正类预测为正类数。
True Positive (TP): Number of positive cases predicted as positive.

True Negative(真负,TN):将负类预测为负类数。
True Negative (TN): The number of negative cases predicted as negative.

False Positive(假正,FP):将负类预测为正类数。
False Positive (FP): Number of negative classes predicted as positive classes.

False Negative(假负,FN):将正类预测为负类数。
False Negative (FN): Number of positive cases predicted as negative.

正样本:需要判定概率为1的类型的样本叫做正样本。
Positive sample: A sample that needs to be determined as having a probability of 1 is called a positive sample.

负样本:需要判定概率为0的类型的样本叫做负样本。
Negative sample: A sample that needs to be determined as having a probability of 0 is called a negative sample.

准确率是分类问题中最简单也是最直观的评价指标,但存在明显的缺陷。比如,当负样本占99%时,分类器把所有样本都预测为负样本也可以获得99%的准确率。所以,当不同类别的样本占比十分不平衡时,占比较大的样本对准确率影响较大。
Accuracy is the simplest and most intuitive evaluation metric in classification problems, but it has obvious flaws. For example, when negative samples account for 99%, a classifier that predicts all samples as negative can also achieve 99% accuracy. Therefore, when the sample proportions of different classes are very imbalanced, the samples with a larger proportion have a greater impact on accuracy.

可视化单词出现频率
Visualize word frequency

分别可视化垃圾短信出现的高频词汇和正常短信出现的高频词汇,有助于理解垃圾短信经常出现哪些词汇,正常短信经常出现哪些词汇。
Visualizing the high-frequency words in spam messages and normal messages separately helps to understand which words frequently appear in spam messages and which words frequently appear in normal messages.

在可视化之前需要先将数据进行分类。
Before visualization, the data needs to be classified.

根据dataset中label标签spam和ham,创建两个dataset分别存储垃圾短信和正常短信。代码如下:
Based on the labels spam and ham in the dataset, create two datasets to store spam messages and normal messages respectively. The code is as follows:

图5-1
Figure 5-1

结果:

图5-2
Figure 5-2

图5-3
Figure 5-3

分类好数据集之后进行可视化。
Visualize after categorizing the dataset.

数据可视化主要旨在借助于图形化手段,清晰有效地传达与沟通信息。但是,这并不就意味着数据可视化就一定因为要实现其功能用途而令人感到枯燥乏味,或者是为了看上去绚丽多彩而显得极端复杂。为了有效地传达思想概念,美学形式与功能需要齐头并进,通过直观地传达关键的方面与特征,从而实现对于相当稀疏而又复杂的数据集的深入洞察。然而,设计人员往往并不能很好地把握设计与功能之间的平衡,从而创造出华而不实的数据可视化形式,无法达到其主要目的,也就是传达与沟通信息。
Data visualization primarily aims to convey and communicate information clearly and effectively through graphical means. However, this does not necessarily mean that data visualization must be dull and tedious to achieve its functional purpose, or extremely complex to appear visually stunning. To effectively convey ideas and concepts, aesthetic form and function need to go hand in hand, providing deep insights into relatively sparse and complex datasets by intuitively conveying key aspects and features. However, designers often fail to strike a good balance between design and functionality, resulting in flashy but impractical forms of data visualization that do not achieve their main purpose, which is to convey and communicate information.

数据可视化与信息图形、信息可视化、科学可视化以及统计图形密切相关。当前,在研究、教学和开发领域,数据可视化乃是一个极为活跃而又关键的方面。“数据可视化”这条术语实现了成熟的科学可视化领域与较年轻的信息可视化领域的统一。
Data visualization is closely related to infographics, information visualization, scientific visualization, and statistical graphics. Currently, in the fields of research, teaching, and development, data visualization is an extremely active and critical aspect. The term "data visualization" has achieved the unification of the mature field of scientific visualization and the relatively younger field of information visualization.

数据可视化技术的基本思想,是将数据库中每一个数据项作为单个图元元素表示,大量的数据集构成数据图像,同时将数据的各个属性值以多维数据的形式表示,可以从不同的维度观察数据,从而对数据进行更深入的观察和分析。
The basic idea of data visualization technology is to represent each data item in the database as a single graphical element. A large dataset forms a data image, and the various attribute values of the data are represented in the form of multidimensional data, allowing observation of the data from different dimensions for deeper observation and analysis.

可视化代码如下:
The visualization code is as follows:

图5-4
Figure 5-4

图5-5
Figure 5-5

图5-6
Figure 5-6

根据上图可以看出,在垃圾短信中,“call”这个词出现的最多,“free”排第二。
According to the above figure, the word "call" appears the most in spam messages, with "free" coming in second.

预测新数据
Predict new data

最后对短信内容做出预测,判断是否为垃圾短信。
Finally, predict the content of the text message to determine whether it is spam.

大数据预测是大数据最核心的应用,它将传统意义的预测拓展到“现测”。大数据预测的优势体现在,它把一个非常困难的预测问题,转化为一个相对简单的描述问题,而这是传统小数据集根本无法企及的。从预测的角度看,大数据预测所得出的结果不仅仅是用于处理现实业务的简单、客观的结论,更是能用于帮助企业经营的决策。
Big data prediction is the core application of big data, expanding traditional prediction to "current measurement." The advantage of big data prediction lies in transforming a very difficult prediction problem into a relatively simple descriptive problem, which is something that traditional small datasets cannot achieve. From a predictive perspective, the results obtained from big data prediction are not only simple and objective conclusions for handling real-world business but also can be used to aid in business decision-making.

图6-1
Figure 6-1

绘制词云图
Generate a word cloud

(一)首先利用python
(1) First, use Python

词云图是一种用来展现高频关键词的可视化表达,通过文字、色彩、图形的搭配,产生有冲击力地视觉效果,而且能够传达有价值的信息。
A word cloud is a visual representation used to display high-frequency keywords. By combining text, colors, and graphics, it creates a visually impactful effect and can convey valuable information.

词云就是通过形成“关键词云层”或“关键词渲染”,对网络文本中出现频率较高的“关键词”的视觉上的突出。
A word cloud is the visual highlighting of "keywords" that appear frequently in online texts by forming a "keyword cloud" or "keyword rendering."

词云图过滤掉大量的文本信息,使浏览网页者只要一眼扫过文本就可以领略文本的主旨。
The word cloud filters out a large amount of text information, allowing web page viewers to grasp the main idea of the text at a glance.

图7-1
Figure 7-1

图7-2
Figure 7-2

(二)运用word art
(2) Use word art

图7-3
Figure 7-3

图7-5
Figure 7-5

清晰明了,单词出现频率一目了然,外形美观。
Clear and concise, word frequency is clear at a glance, and the appearance is attractive.