This is a bilingual snapshot page saved by the user at 2024-6-27 9:03 for, provided with bilingual support by Immersive Translate. Learn how to save?

Undergraduate Thesis of the School of Management Science and Engineering, Shandong University of Finance and Economics

School of Management Science and Engineering

Big Data Analysis and Prediction Techniques

Semester project

Title: Spam Message Classification

学 院 管理科学与工程学院
School of Management Science and Engineering

专 业 信息管理与信息系统
Major: Information Management and Information Systems

班 级 信管一班
Class Information Management Class 1

学 号 20190611636

姓 名 张之彦
Name: Zhang Zhiyan

指导教师 孙凌云
Supervisor: Sun Lingyun

Shandong University of Finance and Economics, School of Management Science and Engineering

二O二一年 六 月
June 2021


Title: Spam Message Classification

摘 要

垃圾短信是指未经用户同意向用户发送的用户不愿意收到的短信息,或用户不能根据自己的意愿拒绝接收的短信息,主要包含以下属性:(一)未经用户同意向用户发送的商业类、广告类等短信息; (二)其他违反行业自律性规范的短信息。
Spam messages refer to short messages sent to users without their consent, which users do not wish to receive, or short messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) Commercial, advertising, and other short messages sent to users without their consent; (2) Other short messages that violate industry self-regulation norms.

Spam messages are rampant, seriously affecting people's normal lives, the image of operators, and even social stability. For example, fake base stations can send messages to 100,000 phones within a three-kilometer radius. Now users can use mobile security software to intercept such messages.

This assignment involves classifying, predicting, and visualizing a text message dataset based on the Naive Bayes model. Through experiments, understand the Bayesian probability formula and grasp the principles of the Bayesian classification algorithm.

Master the processing of unstructured text, master the use of Naive Bayes to classify text messages.

Keywords: spam messages, Naive Bayes model, unstructured text, data visualization, data prediction

Title: classification of spam messages


Spam short messages refer to short messages sent to users without users' consent that users are unwilling to receive, or short messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) commercial, advertising and other short messages sent to users without users' consent;(2Other short messages that violate the self-discipline norms of the industry.

The proliferation of spam messages has seriously affected people's normal life, operator image and even social stability. For example, the pseudo base station can send messages to 100000 mobile phones within three kilometers.current users can use mobile housekeeper to intercept such short messages.

This task is to classify, predict and visualize the SMS data set according to the naive Bayesian model. Understand the Bayesian probability formula through experiments and master the principle of Bayesian classification algorithm,

Master the processing of unstructured text and the classification of short messages using naive Bayes.

Keywords: spam message, naive Bayesian model, unstructured text, data visualization, data prediction


一. 问题描述4
I. Problem Description 4

(一) 问题概况4
(I) Overview of the Problem 4

(二) 数据概况4
(2) Data Overview 4

(三) 预测5
(3) Prediction 5

(四) 流程图:5
(4) Flowchart: 5

二. 数据预处理7
2. Data Preprocessing 7

1. Data Overview 7

(2) Data Preprocessing 7

1. 导入数据集7
1. Import dataset 7

2. 导入数据8
2. Import Data 8

3. 处理缺失值9
3. Handling missing values 9

4. 将数据分成自变量和因变量9
4. Divide the data into independent variables and dependent variables 9

5. 字符编码9
5. Character Encoding 9

6. 拆分数据集9
6. Split the dataset 9

7. 转换数据格式10
7. Convert data format 10

8. 文本向量化10
8. Text Vectorization 10

三. 构建模型11
3. Building the Model 11

四. 评估模型12
4. Evaluation Model 12

五. 可视化单词出现频率14
5. Visualization of word frequency 14

六. 预测新数据17
6. Predict New Data 17

七. 绘制词云图18
7. Create a word cloud 18

(1) First, use Python 18

(二)运用word art19
(2) Use Word Art 19

Problem Description

Problem Overview

Implement spam classification using Naive Bayes.

Spam messages refer to text messages sent to users without their consent, which users do not wish to receive, or text messages that users cannot refuse to receive according to their own wishes. They mainly include the following attributes: (1) Commercial or advertising text messages sent to users without their consent; (2) Other text messages that violate industry self-regulatory norms.

This experiment provided a spam SMS dataset, which has 5572 rows and 2 columns. It contains a large number of messages, and the experiment aims to classify and predict the information as well as visualize the data. The purpose of the experiment is to understand the Bayesian probability formula; master the principles of the Bayesian classification algorithm; master the processing of unstructured text; and master the use of Naive Bayes to classify SMS messages.

Data Overview

The dataset has 2 columns. The first column is the label, which is the tag of the message. There are 2 types of tags, the first type is ham, which means normal message, and the second type is spam, which means junk message. The second column of the dataset is the content, which is the message content. The message content is unstructured text. The dataset has 5572 rows and 2 columns.


对短信内容做出预测,判断是否为垃圾短信。可以输入文本并预测其为垃圾短信或是正常短信。运用了python sklearn中的.predict。大致步骤为:先创建分类器对象,用训练器数据拟合分类器模型,输出聚类中心,用训练器数据X拟合分类器模型并对训练器数据X进行预测,最后输出预测结果
To predict the content of text messages and determine whether they are spam. You can input text and predict whether it is spam or a normal message. The .predict function from Python's sklearn is used. The general steps are: first, create a classifier object, fit the classifier model with the training data, output the cluster centers, fit the classifier model with the training data X and predict the training data X, and finally output the prediction results.


First, perform data preprocessing, then build the model, followed by model evaluation, data visualization, predicting new data, and finally drawing a word cloud.

Figure 1-1

Data preprocessing

(1) Data Overview

Figure 2-1

From the above figure, the dataset has 2 columns. The first column is the label, which is the tag of the message. There are 2 types of tags: the first type is ham, which means normal message, and the second type is spam, which means junk message. The second column of the dataset is the content, which is the message content. The message content is unstructured text.

(2) Data Preprocessing

Import dataset

First, import the library

Figure 2-2


NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,还提供了一套用于分类,标记化,词干化,标记,解析和语义推理的文本处理库。NLTK是Python上著名的⾃然语⾔处理库 ⾃带语料库,具有词性分类库 ⾃带分类,分词,等等功能。NLTK被称为“使用Python进行教学和计算语言学工作的绝佳工具”,以及“用自然语言进行游戏的神奇图书馆”。
NLTK is a leading platform for building Python programs to work with human language data. It provides an easy-to-use interface for over 50 corpora and lexical resources such as WordNet, and also offers a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK is a well-known natural language processing library in Python, with built-in corpora and a part-of-speech classification library, as well as built-in functions for classification, tokenization, and more. NLTK is referred to as "an excellent tool for teaching and computational linguistics work using Python" and "a magical library for playing with natural language."


Figure 2-3

2/Manual download

Figure 2-4

Figure 2-5

Import data

Figure 2-6

From the above figure, it can be seen that the dataset has 5572 rows and 2 columns.

Handling missing values

Figure 2-7

Execute code block.

By checking the value of the null_stat variable, it was found that there is no missing data

Divide the data into independent variables and dependent variables

The dataset now has 2 columns, with the content column as the independent variable and the label column as the dependent variable.

Figure 2-8

Character encoding

Since spam and ham cannot be used directly, y is character encoded.

Figure 2-9

Execute code block.

After the code is executed, the variables ham and spam stored in y become 0 and 1.

Splitting the dataset

By using sklearn, split the dataset into a training set and a test set.

data: Dataset that needs to be segmented

random_state: Set the random seed to ensure that the same random numbers are generated each time it runs

test_size: The proportion of data split into the training set (usually 2:8)

Figure 2-10

Training set and test set

Training set: Data samples used for model fitting, used to adjust parameters in the network.

Test set: used to evaluate the generalization ability of the final model. However, it cannot be used as a basis for algorithm-related choices such as parameter tuning and feature selection. The role of the test set is reflected in the testing process.

For example:

Training set-------Student's textbook; students grasp knowledge based on the content in the textbook.

Validation set-------Homework; through homework, one can understand the learning situation of different students and the speed of their progress.

Test set-------Exam; the questions on the exam are ones that have never been seen before, testing the students' ability to infer and apply knowledge.

Convert data format

Convert the data into a list type.

Figure 2-11

Text vectorization

Since the model cannot directly process Chinese, it is necessary to first convert Chinese into numerical values.

First, use the bag-of-words model for text vectorization.

The raw data for text analysis cannot be directly fed to the algorithm. This raw data is a set of symbols, as most algorithms expect input in the form of fixed-length numerical feature vectors rather than text files of varying lengths. Therefore, text vectorization is the foundation of text analysis.

Text vectorization (also known as "word vector model" or "vector space model") refers to representing text as real number vectors that can be recognized by computers. Depending on the granularity, text features can be represented at the levels of characters, words, sentences, or paragraphs. The methods of text vectorization are mainly divided into discrete representation and distributed representation.

Using sklearn to achieve text vectorization, the code is as follows:

Figure 2-12

After execution, it was found that there were 7,809 unique words

Figure 2-13

Next, in order to make text vectorization more accurate, TF-IDF is used again for text vectorization. Here, TF-IDF preprocessing is performed using sklearn, and the code is as follows:

Figure 2-14

TF-IDF是Term Frequency - Inverse Document Frequency的缩写,即“词频-逆文本频率”。它由两部分组成,TF和IDF。
TF-IDF stands for Term Frequency - Inverse Document Frequency. It consists of two parts, TF and IDF.

TF: Term frequency is the statistical count of the occurrence of each word in the text

IDF:逆文本频率,指有些词出场频率虽然高但是其重要性不及其他词汇。概括来讲, IDF反应了一个词在所有文本中出现的频率,如果一个词在很多的文本中出现,那么它的IDF值应该低。而反过来如果一个词在比较少的文本中出现,那么它的IDF值应该高。比如一些专业的名词。这样的词IDF值应该高。一个极端的情况,如果一个词在所有的文本中都出现,那么它的IDF值应该为0。IDF的公式如下:
IDF: Inverse Document Frequency, refers to the phenomenon where some words appear frequently but are less important than other words. In summary, IDF reflects the frequency of a word appearing in all texts. If a word appears in many texts, its IDF value should be low. Conversely, if a word appears in relatively few texts, its IDF value should be high. For example, some specialized terms should have a high IDF value. In an extreme case, if a word appears in all texts, its IDF value should be 0. The formula for IDF is as follows:


Among them, N represents the total number of texts in the corpus, while N(x) represents the total number of texts in the corpus that contain the word x.

Building a model

Build a Naive Bayes model.

Note: Since this is used for text classification, the MultinomialNB class is chosen instead of the GaussianNB class.

Figure 3-1

Naive Bayes formula:


From the perspective of machine learning, X is understood as "having a certain feature," and Y is understood as "category label." In the simplest binary classification problem (yes or no determination), we understand Y as a label "belonging to a certain category." Thus, the Bayes' formula transforms into the following form:


In spam message recognition:


Substituting and simplifying, we get:


MultinomialNB assumes that the prior probability of features follows a multinomial distribution, as shown in the following formula:

P(Xj=xjl|Y=Ck)=xjl+λ mk+nλ 

Among them, P(Xj=xjl|Y=Ck) is the conditional probability of the l-th value of the j-th dimension feature of the k-th category, m

k is the number of samples in the training set that belong to the k-th class. λ is a constant greater than 0, often taken as 1, which is Laplace smoothing, but it can also take other values.

alpha: Optional floating-point parameter, default is 1.0, which actually adds Laplace smoothing, i.e., λ in the above formula. If this parameter is set to 0, no smoothing is added.

fit_prior: Boolean optional parameter, default is True. The Boolean parameter fit_prior indicates whether to consider prior probabilities. If false, all sample class outputs have the same class prior probability. Otherwise, you can input the prior probability using the third parameter class_prior, or not input the third parameter class_prior, allowing MultinomialNB to calculate the prior probability from the training set samples. In this case, the prior probability is:


其中m为训练集样本总数量,mk 为输出为第k类别的训练集样本数。
The total number of training samples is m, and mk  is the number of training samples output as the k-th category.

Evaluation model

After building the model, it is necessary to evaluate its accuracy. The code is as follows:

Figure 4-1

Execute code block.

The result after execution is as shown in the figure

Figure 4-2

Figure 4-3

Model evaluation is used to assess the quality of the model.

The error of the model on the training set is usually referred to as training error or empirical error, while the error on new samples is called generalization error. Clearly, the goal of machine learning is to obtain a learner with a small generalization error. However, in practical applications, new samples are unknown, so we can only minimize the training error as much as possible. Therefore, to obtain a model with a small generalization error, the dataset is typically split into mutually independent training, validation, and test datasets when constructing a machine model. During the training process, the validation dataset is used to evaluate the model and update hyperparameters accordingly, and after training, the test dataset is used to assess the performance of the final trained model.

Comparison of models:

A comparison of models during a training process.

Multiple training model comparisons.

Comparison of models with different algorithms.

Accuracy: Accuracy refers to the proportion of correctly classified samples to the total number of samples, and the formula is as follows:


True Positive(真正,TP):将正类预测为正类数。
True Positive (TP): Number of positive cases predicted as positive.

True Negative(真负,TN):将负类预测为负类数。
True Negative (TN): The number of negative cases predicted as negative.

False Positive(假正,FP):将负类预测为正类数。
False Positive (FP): Number of negative classes predicted as positive classes.

False Negative(假负,FN):将正类预测为负类数。
False Negative (FN): Number of positive cases predicted as negative.

Positive sample: A sample that needs to be determined as having a probability of 1 is called a positive sample.

Negative sample: A sample that needs to be determined as having a probability of 0 is called a negative sample.

Accuracy is the simplest and most intuitive evaluation metric in classification problems, but it has obvious flaws. For example, when negative samples account for 99%, a classifier that predicts all samples as negative can also achieve 99% accuracy. Therefore, when the sample proportions of different classes are very imbalanced, the samples with a larger proportion have a greater impact on accuracy.

Visualize word frequency

Visualizing the high-frequency words in spam messages and normal messages separately helps to understand which words frequently appear in spam messages and which words frequently appear in normal messages.

Before visualization, the data needs to be classified.

Based on the labels spam and ham in the dataset, create two datasets to store spam messages and normal messages respectively. The code is as follows:

Figure 5-1


Figure 5-2

Figure 5-3

Visualize after categorizing the dataset.

Data visualization primarily aims to convey and communicate information clearly and effectively through graphical means. However, this does not necessarily mean that data visualization must be dull and tedious to achieve its functional purpose, or extremely complex to appear visually stunning. To effectively convey ideas and concepts, aesthetic form and function need to go hand in hand, providing deep insights into relatively sparse and complex datasets by intuitively conveying key aspects and features. However, designers often fail to strike a good balance between design and functionality, resulting in flashy but impractical forms of data visualization that do not achieve their main purpose, which is to convey and communicate information.

Data visualization is closely related to infographics, information visualization, scientific visualization, and statistical graphics. Currently, in the fields of research, teaching, and development, data visualization is an extremely active and critical aspect. The term "data visualization" has achieved the unification of the mature field of scientific visualization and the relatively younger field of information visualization.

The basic idea of data visualization technology is to represent each data item in the database as a single graphical element. A large dataset forms a data image, and the various attribute values of the data are represented in the form of multidimensional data, allowing observation of the data from different dimensions for deeper observation and analysis.

The visualization code is as follows:

Figure 5-4

Figure 5-5

Figure 5-6

According to the above figure, the word "call" appears the most in spam messages, with "free" coming in second.

Predict new data

Finally, predict the content of the text message to determine whether it is spam.

Big data prediction is the core application of big data, expanding traditional prediction to "current measurement." The advantage of big data prediction lies in transforming a very difficult prediction problem into a relatively simple descriptive problem, which is something that traditional small datasets cannot achieve. From a predictive perspective, the results obtained from big data prediction are not only simple and objective conclusions for handling real-world business but also can be used to aid in business decision-making.

Figure 6-1

Generate a word cloud

(1) First, use Python

A word cloud is a visual representation used to display high-frequency keywords. By combining text, colors, and graphics, it creates a visually impactful effect and can convey valuable information.

A word cloud is the visual highlighting of "keywords" that appear frequently in online texts by forming a "keyword cloud" or "keyword rendering."

The word cloud filters out a large amount of text information, allowing web page viewers to grasp the main idea of the text at a glance.

Figure 7-1

Figure 7-2

(二)运用word art
(2) Use word art

Figure 7-3

Figure 7-5

Clear and concise, word frequency is clear at a glance, and the appearance is attractive.