表 3-4 监测点 A 从 2020 年 8 月 25 日到 8 月 28 日每天实测的 AQI 和首要污染物(逐日实测数据计算) Table 3-4 Daily measured AQI and top pollutants at monitoring site A from August 25 to August 28, 2020 (calculated from daily measured data)
监测点 A 从 2020 年 8 月 25 日到 8 月 28 日每日根据逐小时实测数据的平均值计算出的各项污染物的 IAQI 如表 3-5 所示, The IAQIs for each pollutant calculated from the average of the hour-by-hour measured data for each day from August 25 through August 28, 2020, at monitoring site A are shown in Table 3-5 below.
表 3-5 监测点 A 从 2020 年 8 月 25 日到 8 月 28 日每天实测的 IAQI 和空气质量等级(逐小时实测数据计算) Table 3-5 Measured IAQI and Air Quality Levels at Monitoring Site A for each day from August 25 to August 28, 2020 (calculated from hourly measured data)
则监测点 A 从 2020 年 8 月 25 日到 8 月 28 日根据逐小时实测数据的平均值根据逐小时实测数据的平均值的 AQI 和首要污染物,结果如表 3-6 所示。 The results of AQI and primary pollutants at monitoring point A from August 25 to August 28, 2020, based on the average of hourly measured data, are shown in Table 3-6.
表 3-6 监测点 A 从 2020 年 8 月 25 日到 8 月 28 日每天实测的 AQI 和首要污染物(逐小时实测数据计算) Table 3-6 Daily measured AQI and top pollutants at monitoring site A from August 25 to August 28, 2020 (calculated from hourly measured data)
根据结果可以看出:根据逐日实测数据计算出的 AQI 和根据逐小时实测数据的平均值根据根据逐日实测数据计算出的 AQI 值十分接近,且首要污染物相同。 According to the results, it can be seen that the AQI calculated based on the daily measured data and the average value based on the hourly measured data are very close to each other, and the top pollutants are the same.
4. 问题二分析与求解 4. Analysis and solution of problem II
4. 1 问题二分析 4.1 Analysis of question two
空气质量预报模型的构建需要符合实际情况且误差控制在一定范围内的、相对准确的相关变量测量数据作为支撑。但是在实际的监测站点设备在进行数据测量、数据记录、数据导出等过程中难免会存在一定问题,使得最终得到的数据(原始数据)与希望得到的良好数据存在一定差距。例如,监测站点设备在进行数据测量过程中会遇到各种不同条件的较为复杂的实际工况,最终会导致采集的原始数据中存在着或多或少的不良数据,包括连续或间断性的数值缺失、数值漂移(偏大或偏小)等情况。因此,对不良数据进行科学有 The construction of an air quality forecasting model needs to be supported by relatively accurate measurements of relevant variables that are consistent with the actual situation and within a certain margin of error. However, in the actual monitoring station equipment in the process of data measurement, data recording, data export and so on, there will inevitably be certain problems, so that the final data (raw data) and hope to get good data have a certain gap. For example, the monitoring station equipment in the data measurement process will encounter a variety of different conditions of the more complex actual working conditions, which will ultimately lead to the collection of raw data there are more or less bad data, including continuous or intermittent missing values, numerical value drift (biased large or small) and so on. Therefore, it is important to scientifically and effectively analyze the bad data.
效地预处理,对于空气质量预报模型的构建有着决定性意义。 Effective preprocessing is decisive for the construction of air quality forecasting models.
对于实时数据库采集的不同位点的数据来说,采集数据的监测站点设备的数据均有部分站点存在问题,即部分站点只含有部分时间段的数据,部分站点的数据全部为空值或部分数据为空值,同时存在部分站点的数据超出了限值,因此对原始数据进行处理后才可以使用;对于实际的监测站点设备采集的近地 2 米温度( ^(@)C{ }^{\circ} \mathrm{C} )、地表温度(K)、湿度(%)、近地 10 米风速( m//s\mathrm{m} / \mathrm{s} )、大气压( Kpa )的相应气象数据来说,由于这些气象参数通常情况下是波动的,但在较短时间范围内可以认为不发生变化。 For the data collected from different locations in the real-time database, the data collected by the monitoring station equipment have problems in some of the stations, i.e., some of the stations only contain data for part of the time period, some of the data in some of the stations are all null or some of the data are null, and at the same time, some of the data in some of the stations are out of the limits, so the raw data can only be used after processing; for the actual monitoring station equipment For the corresponding meteorological data collected by the actual monitoring station equipment of near-earth 2-meter temperature ( ^(@)C{ }^{\circ} \mathrm{C} ), surface temperature (K), humidity (%), near-earth 10-meter wind speed ( m//s\mathrm{m} / \mathrm{s} ), and atmospheric pressure (Kpa), since these meteorological parameters are usually fluctuating, they can be regarded as unchanged for a relatively short period of time.
4.2 问题二求解 4.2 Solving Problem 2
4.2.1 数据预处理 4.2.1 Data pre-processing
将一氧化碳( CO)) 、二氧化硫 (SO_(2))\left(\mathrm{SO}_{2}\right) 、二氧化氮 (NO_(2))\left(\mathrm{NO}_{2}\right) 、臭氧 (O_(3))\left(\mathrm{O}_{3}\right) 、粒径小于 10 mum10 \mu \mathrm{~m} 的颗粒物( PM_(10)\mathrm{PM}_{10} )、粒径小于 2.5 mum2.5 \mu \mathrm{~m} 的颗粒物( PM_(2.5)\mathrm{PM}_{2.5} )的浓度的原始数据进行可视化,如图 4-1 所示。 Raw data of carbon monoxide (CO )) , sulfur dioxide (SO2) (SO_(2))\left(\mathrm{SO}_{2}\right) , nitrogen dioxide (NO2) (NO_(2))\left(\mathrm{NO}_{2}\right) , ozone (O3) (O_(3))\left(\mathrm{O}_{3}\right) , particulate matter smaller than 10 mum10 \mu \mathrm{~m} ( PM_(10)\mathrm{PM}_{10} ), and particulate matter smaller than 2.5 mum2.5 \mu \mathrm{~m} ( PM_(2.5)\mathrm{PM}_{2.5} ) are visualized. This is shown in Figure 4-1.
图 4-1 污染物浓度原始数据可视图 Figure 4-1 View of raw pollutant concentration data
数据因监测站点设备调试、维护等原因,实测数据在连续时间内存在部分或全部缺失的情况;受监测站点及其附近某些偶然因素的影响,实测数据在某个小时(某天)的数值偏离数据正常分布;题目提供的监测气象指标共计五项(温度、湿度、气压、风向、风速), The data are missing in part or in whole for a continuous period of time due to commissioning and maintenance of the equipment at the monitoring stations; the values of the measured data deviate from the normal distribution of the data in a certain hour (day) due to certain incidental factors in the monitoring stations and their vicinity; the title provides a total of five meteorological indicators for monitoring (temperature, humidity, barometric pressure, wind direction and wind speed).
因不同监测站点使用设备存在差异,部分气象指标在某些监测站点无法获取。 Some meteorological indicators are not available at some monitoring stations because of differences in the equipment used at different stations.
综合以上分析,现针对附件 1 中的原始数据中各种类型的不良数据分别进行相应分析及处理。对于附件 1 中的原始数据,应将其中的异常值剔除掉,剔除的标准则可以采用拉依达准则( 3sigma3 \sigma 准则)。综上,针对该原始数据可能存在的异常情况作如下方面的预处理: After summarizing the above analysis, various types of bad data in the raw data in Annex 1 are analyzed and processed accordingly. For the raw data in Annex 1, the outliers should be eliminated, and the criteria for elimination can be the Lajda criterion ( 3sigma3 \sigma criterion). In summary, the following pre-processing is done for possible anomalies in the raw data:
(1)变量值缺失为空值 (1) Variable values are missing as null values
监测站点设备数据采集过程中导致部分点位出现连续的或者间断性的变量值缺失为空值,对于这种情况主要采取利用缺失值前后一小时内的数值取平均的方式来处理。 The data collection process of the monitoring station equipment leads to the occurrence of continuous or intermittent variable values missing as null values at some points, for which the average of the values in the hour before and after the missing values is mainly adopted to deal with the situation.
(2)变量值超出限值范围(超出最小最大值) (2) Variable values outside the limits (exceeding the minimum maximum value)
在实际操作过程中,可能会使得操作变量的实际控制超出附录所要求的最小最大值范围,对这种情况则考虑将超出范围区间的值剔除出去,即在对不同站点的数据取平均得出最终的确定值过程中,对这些超出最小最大值范围的数据取最小值或最大值。 In practice, the actual control of the operating variables may exceed the minimum and maximum values required by the appendix, in this case, we consider that the value out of the range is excluded, that is, in the process of averaging the data from different sites to arrive at the final value, the data that exceeds the minimum and maximum values are taken as the minimum or maximum value.
(3)变量值超出 3sigma3 \sigma 区间范围 (3) Variables with values outside the 3sigma3 \sigma interval
有理由认为监测站点设备变量的过程为连续的,不会出现跳跃的情况。但是可能由于操作装置采集数据过程可能存在问题,在实际的原始数据中,可以发现该类型的"跳跃"。因此对于站点采集到的数据,考虑将这样的跳跃值剔除出去,即在对站点的变量数据取平均得出最终的变量确定值过程中,对这些在 3sigma3 \sigma 范围外的数据不予考虑。 It is reasonable to assume that the process of monitoring the equipment variables at the site is continuous and that there are no jumps. However, this type of "jump" can be found in the actual raw data due to possible problems with the data collection process of the operating device. Therefore, for the data collected at the site, it is considered that such jumps should be excluded, i.e., data outside the 3sigma3 \sigma range are not taken into account in the process of averaging the variable data at the site to obtain the final determined value of the variable.
(4)变量值存在单位不一致的情况 (4) There are unit inconsistencies in the values of the variables
变量值在附件中存在单位不一致的情况,例如在一次预测数据中气压单位为 Kpa,而在实测数据中气压单位为 MBar,这两者之间的单位换算等价式为 1Kpa=10MBar1 \mathrm{Kpa}=10 \mathrm{MBar} 。 There are inconsistencies in the units of the variable values in the annexes, for example, the unit of barometric pressure is Kpa in a prediction and MBar in the measured data, and the unit conversion equivalence between the two is 1Kpa=10MBar1 \mathrm{Kpa}=10 \mathrm{MBar} .
处理后的数据及每日 AQI 值在附件 1、2 中给出,并且给出数据可视化的结果,如图 4-2 所示。 The processed data and daily AQI values are given in Annexes 1 and 2, and the results of the data visualization are given as shown in Figure 4-2.
图 4-2 污染物浓度处理后数据可视图 Figure 4-2 View of Processed Pollutant Concentration Data
4. 2.2 模型原理及框架 4. 2.2 Model rationale and framework
k-means 聚类算法模型 k-means clustering algorithm model
对于给定的一个包含 nn 个数据点的数据集 X={X_(1),X_(2),dots,X_(n)}X=\left\{X_{1}, X_{2}, \ldots, X_{n}\right\} ,其中 X_(i)inR_(d)\mathrm{X}_{\mathrm{i}} \in \mathrm{R}_{\mathrm{d}} ,以及要生成数据子集的数目 K,K\mathrm{K}, \mathrm{K}-Means 聚类算法将数据对象组织为 K 个划分 C={c_(k),i=1,2,dots,(K)}\mathrm{C}=\left\{\mathrm{c}_{\mathrm{k}}, \mathrm{i}=1,2, \ldots, \mathrm{~K}\right\} 。每个划分代表一个类 c_(k)c_{k} ,每个类 c_(k)c_{k} 有一个类别中心 mu_(i)\mu_{i} 。选取欧氏距离作为相似性和距离判别准则,计算该类内各点到聚类中心 mu_(i)\mu_{i} 的距离平方和 For a given dataset X={X_(1),X_(2),dots,X_(n)}X=\left\{X_{1}, X_{2}, \ldots, X_{n}\right\} containing nn data points, where X_(i)inR_(d)\mathrm{X}_{\mathrm{i}} \in \mathrm{R}_{\mathrm{d}} , and the number of data subsets to be generated the K,K\mathrm{K}, \mathrm{K} -Means clustering algorithm organizes the data objects into K divisions C={c_(k),i=1,2,dots,(K)}\mathrm{C}=\left\{\mathrm{c}_{\mathrm{k}}, \mathrm{i}=1,2, \ldots, \mathrm{~K}\right\} . Each division represents a class c_(k)c_{k} and each class c_(k)c_{k} has a category center mu_(i)\mu_{i} . The Euclidean distance is chosen as the similarity and distance criterion, and the sum of the squares of the distances from each point in the class to the clustering center mu_(i)\mu_{i} is computed
聚类目标是使各类总的距离平方和 J(C)=sum_(k=1)^(K)J(c_(k))\mathrm{J}(\mathrm{C})=\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{J}\left(\mathrm{c}_{\mathrm{k}}\right) 最小。 The goal of clustering is to minimize the total sum of squared distances J(C)=sum_(k=1)^(K)J(c_(k))\mathrm{J}(\mathrm{C})=\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{J}\left(\mathrm{c}_{\mathrm{k}}\right) for each class.
其中, d_(ki)={[1","X_(i)inc_(i)],[0","X_(i)!inc_(i)]:}\mathrm{d}_{\mathrm{ki}}=\left\{\begin{array}{l}1, \mathrm{X}_{\mathrm{i}} \in \mathrm{c}_{\mathrm{i}} \\ 0, \mathrm{X}_{\mathrm{i}} \notin \mathrm{c}_{\mathrm{i}}\end{array}\right. where, d_(ki)={[1","X_(i)inc_(i)],[0","X_(i)!inc_(i)]:}\mathrm{d}_{\mathrm{ki}}=\left\{\begin{array}{l}1, \mathrm{X}_{\mathrm{i}} \in \mathrm{c}_{\mathrm{i}} \\ 0, \mathrm{X}_{\mathrm{i}} \notin \mathrm{c}_{\mathrm{i}}\end{array}\right.
显然根据最小二乘法和拉格朗日原理,聚类中心 mu_(k)\mu_{k} 应该取为类别 c_(k)c_{k} 类各数据点的平均值。 Obviously according to the least squares method and Lagrange's principle, the clustering center mu_(k)\mu_{k} should be taken as the average of the data points of the category c_(k)c_{k} class.
K-Means 聚类算法从一个初始的 K 类别划分开始,然后将各数据点指派到各个类别中,以减少总的距离平方和。因为 K-Means 聚类算法中总的距离平方和随着类别个数 K 的增加而趋向于减少。因此,总的距离平方和只能在某个确定的类别个数 K 下,取得最小值。定义: The K-Means clustering algorithm starts with an initial K-category division and then assigns each data point to each category to reduce the total distance sum of squares. This is because the total distance sum of squares in the K-Means clustering algorithm tends to decrease as the number of categories K increases. Therefore, the total sum of squared distances can only be minimized for a defined number of categories K. Definition.
(1)两个数据对象间的距离: (1) Distance between two data objects:
我们采用欧式距离(Euclidean Distance)进行计算,计算公式为 We use the Euclidean Distance (EDD) for the calculation, which is calculated as
(2)准则函数 E (2) The criterion function E
对于 K-means 算法,通常使用准则函数 E,也就是误差平方和(Sum of Squared Error, SSE)作为度量聚类质量的目标函数。 For the K-means algorithm, the criterion function E, which is the Sum of Squared Error (SSE), is usually used as the objective function to measure the quality of clustering.
其中, d(d( )表示两个对象之间的距离。 where d(d( ) denotes the distance between two objects.
对于相同的 k 值,更小的 SSE 说明簇中对象越集中。对于不同的 k 值,越大的 k 值应该越小的 SSE。 For the same value of k, a smaller SSE indicates a greater concentration of objects in the cluster. For different k values, larger k values should have smaller SSE.
k-means 聚类算法实现步骤 Steps to implement the k-means clustering algorithm
首先,随机选择 k 个对象,每个对象代表一个簇的初始均值或中心;对剩余的每个对象,根据其与各簇中心的距离,将它指派到最近(或最相似)的簇,然后计算每个簇的新均值, 得到更新后的簇中心;不断重复,直到准则函数收玫。通常,采用平方误差准则,即对于每个簇中的每个对象,求对象到其中心距离的平方和,这个准则试图生成的 k 个结果簇尽可能地紧凑和独立。 First, k objects are randomly selected, each representing the initial mean or center of a cluster; for each remaining object, it is assigned to the nearest (or most similar) cluster based on its distance from the center of the clusters, and then the new mean is computed for each cluster to obtain the updated center of the clusters; and this is repeated until the criterion function is closed. Typically, a squared error criterion is used, i.e., for each object in each cluster, the sum of the squares of the distances of the objects to their centers, which attempts to generate k resultant clusters that are as compact and independent as possible.
步骤: Steps:
输入:聚类个数 k ,以及包含 n 个数据对象的数据库 X ; Inputs: the number of clusters k, and a database X containing n data objects;
输出:满足方差最小标准的 k 个聚类。 Output: k clusters that satisfy the criterion of minimizing variance.
处理流程: Processing Flow:
步骤 1 从 n 个数据对象任意选择 k 个对象作为初始聚类中心; Step 1 Arbitrarily select k objects from n data objects as initial clustering centers.
步骤 2 根据簇中对象的平均值,将每个对象重新赋给最类似的簇; Step 2 Reassign each object to the most similar cluster based on the average value of the objects in the cluster;
步骤 3 更新簇的平均值,即计算每个簇中对象的平均值; Step 3 Update the average of the clusters, i.e., calculate the average value of the objects in each cluster;
步骤4 循环 Step2 到 Step3 直到每个聚类不再发生变化为止 ^([4]){ }^{[4]} 。 Step 4 Loop Step2 through Step3 until each cluster no longer changes ^([4]){ }^{[4]} .
4.2.3 聚类结果 4.2.3 Clustering results
用 Matlab 通过 k-means 算法聚类得到的结果如下: The results obtained by k-means algorithm clustering using Matlab are as follows:
首先,由于在污染物排放情况不变的条件下,某一地区的气象条件有利于污染物扩散或沉降时,该地区的 AQI 会下降,反之会上升。气象因素中高温、低压、低湿、高风速、都有利于污染物浓度的清除和扩散,任何一个因素的变化均会引起环境空气质量的变化,而不同污染物受气象条件影响程度不同,故根据每个气象条件对污染物浓度的影响程度, First of all, since under the condition of unchanged pollutant emission, when the meteorological conditions in a certain area are favorable for pollutant diffusion or deposition, the AQI in that area will decrease, and vice versa will increase. Meteorological factors such as high temperature, low pressure, low humidity, high wind speed, are all favorable to the removal and diffusion of pollutant concentrations, and changes in any one of these factors will cause changes in ambient air quality, and different pollutants are affected by meteorological conditions to different extents, so according to the extent of the influence of each meteorological condition on pollutant concentrations, the
对气象条件和计算所得对应 AQI 值分别进行聚类分析,对气象条件进行划分,聚类中心结果保留 3 位小数, The meteorological conditions and the corresponding AQI values obtained from the calculations were subjected to cluster analysis, respectively, and the meteorological conditions were divided, and the results of the cluster centers were retained in 3 decimal places.
1.温度分成两类:高温(聚类中心为 27.691)、低温(聚类中心为 17.921) 1. Temperature is divided into two categories: high temperature (center of clustering is 27.691) and low temperature (center of clustering is 17.921).
2.湿度分成两类:高湿(聚类中心为 72.988)、低湿(聚类中心为 45.970) 2. Humidity is divided into two categories: high humidity (center of clustering is 72.988) and low humidity (center of clustering is 45.970).
3.气压分成两类:高压(聚类中心为1016.784)、低压(聚类中心为 1006.346) 3. Air pressure is divided into two categories: high pressure (center of clustering 1016.784) and low pressure (center of clustering 1006.346).
4.风速分成三类:微风(聚类中心为 2.226)、轻风(聚类中心为 1.530 )、软风(聚类中心为 0.994 ) 4. Wind speed is divided into three categories: light wind (center of clustering 2.226), light wind (center of clustering 1.530), and soft wind (center of clustering 0.994).
通过对聚类结果进行分析,可以将气象进行如下分类: By analyzing the clustering results, the weather can be classified as follows:
1:按温度分为两类:第一类高温,平均温度 27.691^(@)C27.691^{\circ} \mathrm{C} ,有利于污染物的扩散,下层气温高,使空气流动剧烈,底层乱流、湍流比较旺盛,有利于污染物向高空传输,从而使得近地面污染物浓度降低,使得空气质量指数降低,AQI 取值范围在( 0,60 )之间;第二类低温,平均温度 17.921^(@)C17.921^{\circ} \mathrm{C} ,不利于污染物扩散,即污染物浓度随环境温度的降低而增高,AQI 取值范围(60,150); 1: According to the temperature is divided into two categories: the first type of high temperature, the average temperature 27.691^(@)C27.691^{\circ} \mathrm{C} , is conducive to the diffusion of pollutants, the lower air temperature is high, so that the air flow is intense, the bottom of the turbulence, turbulence is more vigorous, is conducive to pollutants to the high altitude of the transmission, so that the pollutants near the ground to reduce the concentration of pollutants, so that the air quality index is reduced, AQI take the value of the range between (0,60); the second type of low temperature. The second type of low temperature, average temperature 17.921^(@)C17.921^{\circ} \mathrm{C} , is not conducive to the diffusion of pollutants, that is, the concentration of pollutants increases with the reduction of ambient temperature, AQI value range (60, 150);
2:按湿度分为两类:第一类,高湿,平均湿度 72.988%72.988 \% ,湿度高,水汽对污染物有吸附作用,特别是降水的时候,空气中的水汽含量高,会使 PM_(10)、PM_(2.5)\mathrm{PM}_{10} 、 \mathrm{PM}_{2.5} 质量增加而使浮悬颗粒物沉降到地面,降低 PM10、PM2.5的浓度,故高湿环境能够促使污染物的扩散,降低污染物浓度,AQI 取值范围( 0,65 );第二类低湿,平均湿度 45.970%45.970 \% ,不利于污染物扩散,污染物浓度较高,AQI 取值范围( 65,150 ); 2: According to the humidity is divided into two categories: the first category, high humidity, average humidity 72.988%72.988 \% , high humidity, water vapor on the pollutants have adsorption effect, especially when precipitation, the water vapor content in the air is high, it will make the PM_(10)、PM_(2.5)\mathrm{PM}_{10} 、 \mathrm{PM}_{2.5} mass increase and make the floating particulate matter to the ground to reduce the concentration of PM10, PM2.5, so the high humidity environment can contribute to the diffusion of pollutants, reduce Therefore, the high humidity environment can promote the diffusion of pollutants and reduce the concentration of pollutants, and the AQI is taken in the range of 0,65; the second category is low humidity, average humidity 45.970%45.970 \% , which is not conducive to the diffusion of pollutants, and the concentration of pollutants is high, and the AQI is taken in the range of 65,150;
3.按气压分为两类:第一类,高压,平均气压 1016.784MBar,大气压与污染物浓度正相关,高压时,大气层结构稳定,气流下沉,不利于污染物的垂直扩散,污染物浓度累积增加,AQI 取值范围(55,150);第二类,低压,平均气压 1006.346MBar,由于气流上升,有利于污染物扩散,AQI 取值范围( 0,55 ); 3. According to the air pressure is divided into two categories: the first category, high pressure, average pressure 1016.784MBar, atmospheric pressure and pollutant concentration is positively correlated, high pressure, the atmospheric structure is stable, the airflow is sinking, which is not conducive to the vertical diffusion of pollutants, and pollutant concentration accumulates and increases, the AQI takes the range of values (55,150); the second category, low pressure, the average pressure of 1006.346MBar, due to air flow Second category, low pressure, average air pressure 1006.346MBar, due to airflow rising, favorable to pollutant diffusion, AQI value range (0,55);
4. 按风速分成三类:第一类,微风,平均风速 2.226m//s2.226 \mathrm{~m} / \mathrm{s} ,风有利于污染物的水平扩散,风速越大,污染物水平扩散能力越大,降低污染物浓度,AQI 取值范围( 0,50 );第二类,轻风,平均风速 1.530m//s1.530 \mathrm{~m} / \mathrm{s} ,污染物浓度较高,AQI 取值范围(50,100);第三类,软风,平均风速 0.994m//s0.994 \mathrm{~m} / \mathrm{s} ,风速过低,混合作用强于扩散作用不利于污染物扩散,污染物浓度高,相应 AQI 值也较高,取值范围( 100,150)。) 。 4. According to the wind speed is divided into three categories: First category, light wind, average wind speed 2.226m//s2.226 \mathrm{~m} / \mathrm{s} , the wind is conducive to the horizontal diffusion of pollutants, the greater the wind speed, the greater the horizontal diffusion of pollutants, reduce the concentration of pollutants, the value of the AQI take the range of (0,50); the second category, light wind, the average wind speed 1.530m//s1.530 \mathrm{~m} / \mathrm{s} , the pollutant concentration is high, the value of the AQI take the range of (50,100); The third category, soft wind, average wind speed 0.994m//s0.994 \mathrm{~m} / \mathrm{s} , wind speed is too low, the mixing effect is stronger than the diffusion effect is not conducive to the diffusion of pollutants, the pollutant concentration is high, and the corresponding AQI value is also high, the value range (100,150 )。) 。 ).
温度、湿度、气压、风速的聚类结果图分别如图 4-3 所示。 The clustering result plots for temperature, humidity, barometric pressure, and wind speed are shown in Figure 4-3, respectively.
图4-3 温度、湿度、气压、风速的聚类结果图 Figure 4-3 Clustering Results for Temperature, Humidity, Barometric Pressure, Wind Speed
在依据 AQI 值对不同气象进行分别聚类之后。对整体也进行了聚类,因 k-means 聚类的先验性条件,需要预先给定聚类簇数,但我们不能预知分为多少簇合适,故设置了一个最大聚类簇数 13,进行多次聚类,依据聚类有效性指标 DB 和 SSE 对聚类结果进行评价,最终选择簇数为 3 ,将整体分为三类,聚类中心如表 4-1 所示(保留三位小数): After clustering the different meteorological conditions based on the AQI values, we also clustered the whole meteorological conditions. The overall clustering was also carried out, because of the a priori condition of k-means clustering, the number of clusters needs to be given in advance, but we can not predict how many clusters are appropriate, so we set a maximum number of clusters of 13 clusters for multiple clustering, and based on the clustering validity indexes of DB and SSE for the evaluation of the clustering results, and finally chose the number of clusters of 3, and the whole is divided into three categories, and the clustering center is shown in Table 4-1 (with three decimal places reserved). The cluster centers are shown in Table 4-1 (three decimal places are reserved):
通过聚类结果分析,可以将气象分为以下三类: By analyzing the clustering results, the weather can be classified into the following three categories:
第一类, 平均温度 26.60^(@)C26.60^{\circ} \mathrm{C}, 平均湿度 56.876%56.876 \%, 平均气压 1011.303 MBar , 平均风速 1.575m//s1.575 \mathrm{~m} / \mathrm{s}, 平均风向 79.178^(@)79.178^{\circ} ,在此气象条件下 SO_(2)、NO_(2)、PM10\mathrm{SO}_{2} 、 \mathrm{NO}_{2} 、 \mathrm{PM} 10 的平均污染浓度低,但是 PM_(2.5)、O_(3)\mathrm{PM}_{2.5} 、 \mathrm{O}_{3} 和 CO 的污染浓度在三类中最高; In the first category, average temperature 26.60^(@)C26.60^{\circ} \mathrm{C} , average humidity 56.876%56.876 \% , average barometric pressure 1011.303 MBar , average wind speed 1.575m//s1.575 \mathrm{~m} / \mathrm{s} , and average wind direction 79.178^(@)79.178^{\circ} , the average pollutant concentration in this meteorological condition is low in SO_(2)、NO_(2)、PM10\mathrm{SO}_{2} 、 \mathrm{NO}_{2} 、 \mathrm{PM} 10 , but the pollutant concentrations in PM_(2.5)、O_(3)\mathrm{PM}_{2.5} 、 \mathrm{O}_{3} and CO are the highest among the three categories;
第二类, 平均温度 25.100^(@)C25.100^{\circ} \mathrm{C} ,平均湿度 67.665%67.665 \% ,气压 1010.3 MBar ,平均风速 1.343m//s1.343 \mathrm{~m} / \mathrm{s} ,平均风向 288.373^(@)288.373^{\circ} ,在此气象条件下除 PM_(2.5)\mathrm{PM}_{2.5} 的平均浓度是三类中最高的,其它污染物浓度都居中; In the second category, average temperature 25.100^(@)C25.100^{\circ} \mathrm{C} , average humidity 67.665%67.665 \% , barometric pressure 1010.3 MBar, average wind speed 1.343m//s1.343 \mathrm{~m} / \mathrm{s} , and average wind direction 288.373^(@)288.373^{\circ} , all pollutant concentrations were intermediate under these meteorological conditions, except for the average concentration of PM_(2.5)\mathrm{PM}_{2.5} , which was the highest of the three categories;
第三类, 平均温度 23.081^(@)C23.081^{\circ} \mathrm{C} ,平均湿度 72.649 ,气压 1012.385 MBar ,平均风速 1.360m//s1.360 \mathrm{~m} / \mathrm{s} ,平均风向 50.188^(@)50.188^{\circ} ,在此气象条件下除 PM_(10)\mathrm{PM}_{10} 的浓度在三类中最高外,其它污染物浓度均处于最低水平。 In the third category, the average temperature was 23.081^(@)C23.081^{\circ} \mathrm{C} , the average humidity was 72.649, the barometric pressure was 1012.385 MBar, the average wind speed was 1.360m//s1.360 \mathrm{~m} / \mathrm{s} , the average wind direction was 50.188^(@)50.188^{\circ} , and all pollutants were at their lowest levels under these meteorological conditions except for PM_(10)\mathrm{PM}_{10} , which had the highest concentration among the three categories.
5. 问题三分析与求解 5. Problem 3 analysis and solution
5.1 问题三分析 5.1 Analysis of question three
一个有效的空气质量预报系统有助于人类掌握污染物未来浓度信息,制定相应的防治 An effective air quality forecasting system can help human beings to have information about the future concentration of pollutants and develop appropriate prevention and control measures.
策略,对空气中污染物的浓度水平提前给出精确的预报,使因污染物浓度超标所造成的非线性智能统计模型,目前常用 WRF-CMAQ 模拟体系对空气质量进行预报。WRF-CMAQ模型主要包括 WRF 和 CMAQ 两部分:WRF 是一种中尺度数值天气预报系统,用于为 CMAQ 提供所需的气象场数据;CMAQ 是一种三维欧拉大气化学与传输模拟系统,其根据来自 WRF 的气象信息及场域内的污染排放清单,基于物理和化学反应原理模拟污染物等的变化过程,继而得到具体时间点或时间段的预报结果 ^([6]){ }^{[6]} 。 The WRF-CMAQ model is a mesoscale numerical weather prediction system used to provide CMAQ with the required meteorological field data; CMAQ is a three-dimensional Eulerian atmospheric chemistry and transport simulation system that simulates the change process of pollutants based on physical and chemical reaction principles according to the meteorological information from the WRF and the pollution emission list in the field. WRF is a mesoscale numerical weather prediction system, which provides the meteorological data for CMAQ, while CMAQ is a three-dimensional Eulerian atmospheric chemistry and transport simulation system, which simulates the changes of pollutants based on physical and chemical reactions based on the meteorological information from WRF and the pollution emission inventories in the field, and then obtains the prediction results for a specific point in time or a specific period of time ^([6]){ }^{[6]} .
但受制于模拟的气象场以及排放清单的不确定性,以及对包括臭氧在内的污染物生成机理的不完全明晰,WRF-CMAQ 预报模型的结果并不理想 ^([7]){ }^{[7]} 。故提出二次建模概念:即指在 WRF-CMAQ 等一次预报模型模拟结果的基础上,结合更多的数据源进行再建模,以提高预报的准确性。其中,由于实际气象条件对空气质量影响很大(例如湿度降低有利于臭氧的生成),且污染物浓度实测数据的变化情况对空气质量预报具有一定参考价值,故目前会参考空气质量监测点获得的气象与污染物数据进行二次建模,以优化预报模型。 However, the results of the WRF-CMAQ prediction model are not satisfactory due to the uncertainty of the simulated meteorological field and emission inventory, as well as the incomplete understanding of the generation mechanism of pollutants including ozone ^([7]){ }^{[7]} . Therefore, the concept of secondary modeling is proposed, which means that based on the simulation results of primary forecast models such as WRF-CMAQ, the modeling is combined with more data sources to improve the accuracy of the forecasts. Since the actual meteorological conditions have a great influence on air quality (e.g., lower humidity is conducive to ozone production) and the changes in the measured pollutant concentrations have a certain reference value for air quality forecasting, the secondary modeling will be carried out with reference to the meteorological and pollutant data obtained from the air quality monitoring sites to optimize the forecasting model.
如何预测空气中污染物浓度是一个复杂的问题。目前我国的城市环境空气质量预报的主要模型包括多元线性回归、人工神经网络、NAQPMS、CAMx、WRF-Chem 及多模式集合预报体系等 ^([8]){ }^{[8]} ,国内外的研究表明神经网络能够比回归模型更好地预报空气污染物的浓度随时间的变化趋势,得出更加理想的预报效果。神经网络空气质量预报模型适用于当前难以开展空气质量数值预报,并且回归统计模型的预报精度无法满足要求的城市开展空气质量预报工作及相关研究。因此本文在 WRF-CMAQ 的一次预报模型模拟结果的基础上二次建模建立 BP 神经网络空气质量预测模型。 How to predict the concentration of pollutants in the air is a complex problem. At present, the main models for urban ambient air quality forecasting in China include multiple linear regression, artificial neural network, NAQPMS, CAMx, WRF-Chem, and multi-mode ensemble forecasting system, etc. Studies at home and abroad have shown that neural network can forecast the trend of air pollutant concentration over time better than regression model, which results in a more desirable forecasting effect. The neural network air quality forecasting model is suitable for air quality forecasting and related research in cities where it is difficult to carry out numerical air quality forecasting and the forecasting accuracy of the regression statistical model cannot meet the requirements. Therefore, this paper establishes a BP neural network air quality prediction model based on the simulation results of the primary forecasting model of WRF-CMAQ.
5.2 问题三求解 5.2 Solving Problem 3
5.2.1模型原理及框架 5.2.1 Model Principles and Framework
BP 神经网络 BP neural network
BP 算法由数据流的前向计算(正向传播)和误差信号的反向传播两个过程构成。正向传播时,传播方向为输入层 rarr\rightarrow 隐层 rarr\rightarrow 输出层,每层神经元的状态只影响下一层神经元。若在输出层得不到期望的输出,则转向误差信号的反向传播流程。通过这两个过程的交替进行,在权向量空间执行误差函数梯度下降策略,动态迭代搜索一组权向量,使网络误差函数达到最小值,从而完成信息提取和记忆过程。 The BP algorithm consists of two processes: forward computation of the data stream (forward propagation) and back propagation of the error signal. In forward propagation, the propagation direction is input layer rarr\rightarrow hidden layer rarr\rightarrow output layer, and the state of neurons in each layer only affects the neurons in the next layer. If the desired output is not obtained at the output layer, the process of back propagation of the error signal is shifted. By alternating these two processes, the gradient descent strategy of the error function is executed in the weight vector space, and a set of weight vectors is searched dynamically and iteratively to minimize the network error function, thus completing the information extraction and memory process.
图 5-1 三层神经网络的拓扑结构 Figure 5-1 Topology of a three-layer neural network
设 BP 网络的输入层有 n 个节点,隐层有 q 个节点,输出层有 m 个节点,输入层与隐层之间的权值为 v_(ki)\mathrm{v}_{\mathrm{ki}}, 隐层与输出层之间的权值为 w_(jk)\mathrm{w}_{\mathrm{jk}}, 如图 5-1 所示。隐层的传递函数为 fl(*)\mathrm{fl}(\cdot), Let the input layer of the BP network has n nodes, the hidden layer has q nodes, and the output layer has m nodes, and the weights between the input layer and the hidden layer are v_(ki)\mathrm{v}_{\mathrm{ki}} , and the weights between the hidden layer and the output layer are w_(jk)\mathrm{w}_{\mathrm{jk}} , as shown in Figure 5-1. The transfer function of the hidden layer is fl(*)\mathrm{fl}(\cdot) , the
输出层的传递函数为 f2(*)\mathrm{f} 2(\cdot) ,则隐层节点的输出为(将阈值写入求和项中): The transfer function of the output layer is f2(*)\mathrm{f} 2(\cdot) , and the output of the hidden layer node is (write the threshold into the summation term):
至此 BP 网络就完成了 n 维空间向量对 m 维空间的近似映射 ^([10]){ }^{[10]} 。 At this point the BP network completes the approximate mapping of n-dimensional space vectors to m-dimensional space ^([10]){ }^{[10]} .
反向传播 backward propagation
(1)定义误差函数 (1) Define the error function
输入 P 个学习样本,用 x^(1),x^(2),dots,x^(p),dots,x^(P)\mathrm{x}^{1}, \mathrm{x}^{2}, \ldots, \mathrm{x}^{\mathrm{p}}, \ldots, \mathrm{x}^{\mathrm{P}} 来表示。第 p 个样本输入到网络后得到输出 y_(j)^(p)\mathrm{y}_{\mathrm{j}}^{\mathrm{p}}(j=1,2,cdots m)(j=1,2, \cdots m) 。采用平方型误差函数, 于是得到第 pp 个样本的误差 E_(p)E_{p} : The input P learning samples, denoted by x^(1),x^(2),dots,x^(p),dots,x^(P)\mathrm{x}^{1}, \mathrm{x}^{2}, \ldots, \mathrm{x}^{\mathrm{p}}, \ldots, \mathrm{x}^{\mathrm{P}} . The pth sample is fed into the network to get the output y_(j)^(p)\mathrm{y}_{\mathrm{j}}^{\mathrm{p}}(j=1,2,cdots m)(j=1,2, \cdots m) . Using a squared error function, the error for the pp th sample is E_(p)E_{p} :
式中: t_(j)^(p)\mathrm{t}_{\mathrm{j}}^{\mathrm{p}} 为期望输出。 Where: t_(j)^(p)\mathrm{t}_{\mathrm{j}}^{\mathrm{p}} is the desired output.
对于 P 个样本,全局误差为: For P samples, the global error is:
采用累计误差 BP 算法调整 W_(jk)\mathrm{W}_{\mathrm{jk}} ,使全局误差 E 变小,即 The cumulative error BP algorithm is used to adjust W_(jk)\mathrm{W}_{\mathrm{jk}} so that the global error E becomes smaller, i.e.
BP 神经网络预测模型是基于统计预报方法的空气质量预测模型,是以分析历史空气质量数据和气象条件为基础,找出内在的发展规律从而完成预测 ^([9]){ }^{[9]} 。相比于数值预测模型方法,该方法主要基于历史气象数据以及污染物监测浓度的规律性分析,利用气象条件预报 The BP neural network prediction model is an air quality prediction model based on the statistical forecasting method, which is based on analyzing the historical air quality data and meteorological conditions to find out the inherent developmental pattern to complete the prediction ^([9]){ }^{[9]} . Compared with the numerical prediction modeling method, this method is mainly based on the analysis of historical meteorological data and the regularity of pollutant concentration monitoring, and utilizes the meteorological conditions to forecast the air quality.
产品开展污染物浓度预测,分析方法较为灵活多样,具有较好的适用性。 BP 神经网络预测模型结构如图 5-2 所示。 The analyzing method is flexible and has good applicability in predicting pollutant concentration of products. The structure of the BP neural network prediction model is shown in Figure 5-2.
图 5-2 BP 神经网络预测模型结构 Figure 5-2 BP Neural Network Prediction Model Structure
根据问题 3 的要求,需要建立一个满足两个目标的预测模型,极小化 AQI 预报值的相对误差,极大化首要污染物精确度,运用 BP 神经网络进行空气中 6 种污染物浓度进行预测,通过输入 6 种污染物浓度及温度、湿度、气压、风速、风向等实测数据,他们对应的评价量化值为 AQI , 设置隐层元为 4 , 输出层为 1 。在训练过程中不断修正神经网络, 把步数设置为 10000 进行训练。 BP 神经网络的学习训练算法流程图如图 5-3 所示。基于问题三的 BP 神经网络模型如图 5-4 所示。 According to the requirements of Problem 3, it is necessary to establish a prediction model to meet two objectives, minimize the relative error of the AQI prediction value and maximize the accuracy of the primary pollutant, using BP neural network to predict the concentration of six pollutants in the air, by inputting the concentration of six pollutants and measured data such as temperature, humidity, barometric pressure, wind speed, wind direction and so on, and their corresponding quantitative evaluation value for AQI, set the hidden layer as 4 and the output layer as 1. The hidden layer is set to be 4 and the output layer to be 1. During the training process, the neural network is constantly modified, and the number of steps is set to 10000 for training. The flow chart of the learning and training algorithm of BP neural network is shown in Figure 5-3. The BP neural network model based on Problem 3 is shown in Figure 5-4.
图 5-3 BP 神经网络学习训练算法流程图 Figure 5-3 Flowchart of BP Neural Network Learning and Training Algorithm
图 5-4 基于问题三的 BP 神经网络模型 Figure 5-4 BP Neural Network Model Based on Problem 3
借助于 Matlab 编程来实现训练神经网络,对所建立的模型进行多次测试,下面给出了选取七个月的数据对 2021 年 1 月 1 日到 2021 年月 3 日的空气中六种污染物的浓度进行预测的结果,并计算了相应的 AQI 值,与一次预报数据和实测数据进行对比分析,发现建立的二次模型的预测 AQI 值的相对误差小于且优于一次预测结果,首要污染物预测结果与实测数据吻合。训练结果如表 5-1 所示。详细数据见附件 3 5。 With the help of Matlab programming to realize the training neural network, the established model was tested several times, and the results of predicting the concentrations of six pollutants in the air from January 1, 2021 to January 3, 2021 with seven months' data are given in the following, and the corresponding AQI values were calculated, which were compared and analyzed with the primary forecast data and the measured data, and it is found that the relative errors in the prediction of AQI values of the established secondary model are smaller and better than the primary prediction results. It is found that the relative error of the predicted AQI values of the quadratic model is smaller and better than that of the primary prediction results, and the prediction results of the top pollutants coincide with the measured data. The training results are shown in Table 5-1. Detailed data are shown in Annex 3 5.
表 5-1 污染物的浓度预测测试结果 Table 5-1 Predicted Test Results of Pollutant Concentrations