符号定义 Symbol definition | 符号说明 Description of the symbol |
| 原始数据集中第 In the original dataset 个数据点的位置坐标 The location coordinates of data points |
| 原始数据集中第 In the original dataset 个数据点的 F1 目标变量值 F1 target variable values for data points |
| 原始数据集的总数据点数 The total number of data points for the original dataset |
| 随机均匀重采样后的样本量 Sample size after random uniform resampling |
| 在位置 in place 处通过克里金插值估计得到的 F1 目标变量值 The value of the F1 target variable estimated by kriging interpolation |
| 估计误差 Estimation error |
| 拉格朗日乘数 Lagrange multiplier |
| 距离间隔为 The distance interval is 的数据点对的数量 The number of pairs of data points |
| 权重系数 Weight factor |
| 拉格朗日乘数 Lagrange multiplier |
| 皮尔逊相关系数 Pearson correlation coefficient |
| |
3.2 基本假设
3.2 Basic Assumptions
1.假设题目所给的数据真实可靠;
1. Assuming that the data given in the question is true and reliable;
2.假设对异常值处理时所产生的误差可忽略不计;
2. It is assumed that the error generated by the handling of outliers is negligible;
3.假设用于分析的数据是随机抽取且具有代表性的,能够反映总体的特征。
3. It is assumed that the data used for analysis are randomly selected and representative, reflecting the characteristics of the population.
4. 模型
4 models
4.1问题一的模型
4.1 Model for Problem 1
4.1.1符号定义:
4.1.1 Symbol Definition:
:原始数据集的总数据点数
The total number of data points for the original dataset;
:随机均匀重采样后的样本量
Sample size after random uniform resampling;
:原始数据集中第
In the original dataset个数据点的位置坐标
The location coordinates of data points;
:原始数据集中第
In the original dataset个数据点的
data points F1 目标变量值
The value of the target variable;
:在位置
in place处通过克里金插值估计得到的
is estimated by kriging interpolation F1 目标变量值
The value of the target variable;
:估计误差
Estimation error,即
namely4.1.2克里金插值模型
4.1.2 Kriging interpolation model
半变异函数
Semivariograms
常用的计算公式为
The commonly used formula is as follows::
为距离间隔为
The distance interval is的数据点对的数量
The number of pairs of data points。克里金方程组
Kriging equations
其中
thereinto
为权重系数,
is the weight coefficient,为拉格朗日乘数
is a Lagrange multiplier。在《Statistics for Spatial Data, Revised Edition》Noel Cressie这本书籍中第三章详细介绍了普通克里金、协同克里金算法相关内容。介绍内容和相关公式如下:
Chapter 3 of Noel Cressie's book Statistics for Spatial Data, Revised Edition details the common kriging and the co-kriging algorithm. The introduction and related formulas are as follows:
(协同克里金算法在问题三中详细介绍)
(The co-kriging algorithm is described in detail in Problem 3)
图3- 1普通克里金算法详细介绍
Figure 3-1 Detailed description of the ordinary kriging algorithm
图3- 2普通克里金算法详细介绍
Figure 3-2 Detailed description of the ordinary kriging algorithm
4.1.3线性插值模型
4.1.3 Linear interpolation model
对采样点的数值我们采取了线性插值的方法估计采样位置的目标变量,并使用griddata实现插值:
For the values of the sampling points, we use the linear interpolation method to estimate the target variable at the sampling location, and use griddata to implement the interpolation:
其中
thereinto
为插值估计值,
is an interpolated estimate,为采样点
is a sampling point对的权重,
weight,为采样点的目标变量值。
is the value of the target variable for the sampling point.4.2问题二的模型
4.2 Model of Problem 2
计算皮尔逊相关系数
The Pearson correlation coefficient is calculated
公式为:
The formula is:其中
thereinto
为目标变量,
as the target variable,为协同变量,协同变量的样本均值为
is a covariable, and the sample mean of the covariate is 。且
。 moreover,当
while时,
time与完全正相关;当
Perfectly positive correlation; while与完全负相关;当
Completely negatively correlated; while时,
time与不存在线性相关关系。
There is no linear correlation.4.3问题三的模型
4.3 Model of Problem Three
4.3.1Cokriging算法
4.3.1 Cokriging algorithm
Suppose that the data are k x 1 vectors Z(s,),...,Z(s.) and write
图3- 3Cokriging算法
Figure 3-3 Cokriging algorithm
图3- 4Cokriging算法
Figure 3-4 Cokriging algorithm
图3- 5Cokriging算法
Figure 3-5 Cokriging algorithm
图3- 6Cokriging算法
Figure 3-6 Cokriging algorithm
图3- 7Cokriging算法
Figure 3-7 Cokriging algorithm
图3- 8Cokriging算法
Figure 3-8 Cokriging algorithm
4.3.2机器学习算法
4.3.2 Machine Learning Algorithms
(一)K最近邻(k-Nearest Neighbor,KNN)
(a) K-Nearest Neighbor (KNN)
1. 算法概述
1. Algorithm Overview
邻近算法,或者说K最近邻(K-Nearest Neighbor,KNN)分类算法是数据挖掘分类技术中最简单的方法之一,是著名的模式识别统计学方法,在机器学习分类算法中占有相当大的地位。它是一个理论上比较成熟的方法。既是最简单的机器学习算法之一,也是基于实例的学习方法中最基本的,又是最好的文本分类算法之一。
The K-Nearest Neighbor (KNN) classification algorithm is one of the simplest methods in data mining classification technology, and is a well-known statistical method for pattern recognition, which occupies a considerable position in machine learning classification algorithms. It is a theoretically mature method. It is one of the simplest machine learning algorithms, the most basic of example-based learning methods, and one of the best text classification algorithms.
所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表。Cover和Hart在1968年提出了最初的邻近算法。KNN是一种分类(classification)算法,它输入基于实例的学习(instance-based learning),属于懒惰学习(lazy learning)即KNN没有显式的学习过程,也就是说没有训练阶段,数据集事先已有了分类和特征值,待收到新样本后直接进行处理。与急切学习(eager learning)相对应。
The so-called K nearest neighbor means k nearest neighbor, which means that each sample can be represented by its nearest k neighbor. Cover and Hart proposed the original proximity algorithm in 1968. KNN is a classification algorithm that feeds instance-based learning, which is lazy learningThat is, there is no explicit learning process for KNN, that is, there is no training stage, and the dataset already has classification and eigenvalues in advance, which are directly processed after receiving new samples. Corresponding to eager learning.
2. 算法思想
2. Algorithmic thinking
KNN是通过测量不同特征值之间的距离进行分类。
KNN is classified by measuring the distance between different eigenvalues.
思路是:如果一个样本在特征空间中的k个最邻近的样本中的大多数属于某一个类别,则该样本也划分为这个类别。 KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。
The idea is that if most of the k closest neighbors in a sample space belong to a certain category, then the sample is also classified into that category. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. In this method, only the category of the nearest sample or several samples is determined according to the category of the nearest sample.
该算法假定所有的实例对应于N维欧式空间Ân中的点。通过计算一个点与其他所有点之间的距离,取出与该点最近的K个点,然后统计这K个点里面所属分类比例最大的,则这个点属于该分类。
The algorithm assumes that all instances correspond to points in the N-dimensional Euclidean space Ân. By calculating the distance between a point and all other points, take out the K points closest to the point, and then count the largest proportion of the classification in the K points, then this point belongs to the classification.
该算法涉及3个主要因素:实例集、距离或相似的衡量、k的大小。
The algorithm involves 3 main factors: instance set, distance or similar measurement, size k.
一个实例的最近邻是根据标准欧氏距离定义的。更精确地讲,把任意的实例x表示为下面的特征向量:
The nearest neighbor of an instance is defined according to the standard Euclidean distance. More precisely, any instance x is represented as the following eigenvector:
其中ar(x)表示实例x的第r个属性值。那么两个实例xi和xj间的距离定义为d(xi,xj),其中:
where ar(x) indicates the r-th attribute value of instance x. Then the distance between the two instances xi and xj is defined as d(xi,xj), where:
简单来说,KNN可以看成:有那么一堆你已经知道分类的数据,然后当一个新数据进入的时候,就开始跟训练数据里的每个点求距离,然后挑离这个训练数据最近的K个点看看这几个点属于什么类型,然后用少数服从多数的原则,给新数据归类。
To put it simply, KNN can be seen as: there are a bunch of data that you already know to classify, and then when a new data enters, start to find the distance from each point in the training data, and then pick the K points closest to the training data to see what type these points belong to, and then use the principle of minority obeying the majority to classify the new data.
(二)戴维森堡丁指数(Davies-Bouldin index,DBI)算法
(2) Davies-Bouldin index (DBI) algorithm
戴维森堡丁指数又称为分类适确性指标,是由大卫L·戴维斯和唐纳德·Bouldin提出的一种评估聚类算法优劣的指标。
The Davidsonburgerding Index, also known as the Categorical Accuracy Index, was developed by David LDavis and Donald M. Bouldin proposed an index to evaluate the advantages and disadvantages of clustering algorithms.
综合考虑了类内样本相似度以及类间样本差异度,其值越小表征聚类有效性越高,假设我们有m个序列,将这些序列通过算法聚为n类,使用DBI9聚类效果评价方法。具体定义如下
Suppose we have m sequences, these sequences are clustered into n classes by algorithm, and the DBI9 clustering effect evaluation method is used. The specific definitions are as follows:
将协同变量思想融入到机器算法中,进行相关性指数的检验。
The idea of covariables is integrated into the machine algorithm to test the correlation index.
4.4问题四的模型
4.4 Model of Problem Four
在附件 2 中,F2_目标变量的数据为不完全采样,即采样点数量不足以完全描述目标区域的 所有网格点信息。目标是在这种情况下,使用问题三中找到的最佳插值方法来对整个研究区 域的目标变量进行估计,并用等高线图的形式展示结果。
In Appendix 2, the data F2_ the target variable is incompletely sampled, i.e., the number of sampling points is not sufficient to fully describe all the grid point information in the target area. The goal is to use the best interpolation method found in question 3 to estimate the target variables for the entire study area and to present the results in the form of contour plots.
在问题三中,我们使用了不同的插值方法来估计目标变量的空间分布,并根据估计误差(MSE)来判断哪个方法最合适。
In Question 3, we use different interpolation methods to estimate the spatial distribution of the target variables and determine which method is most appropriate based on the Estimation Error (MSE).
5.测试模型
5. Test the model
5.1问题一的测试模型
5.1 Test model for question 1
运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:
(1)对目标变量进行随机均匀重采样,并使用重采样值估计未采样位置的空间变量值。结果以等高线图的形式呈现为:
(1) The target variable is randomly and uniformly resampled, and the resampled value is used to estimate the spatial variable value of the unsampled position. The results are presented in the form of contour plots as:
图1 不同样本点数量的插值结果图
Fig.1 Interpolation results of different sample points
改变样本大小,并探索样本大小与估计误差之间的关系如下图:
Change the sample size and explore the relationship between the sample size and the estimation error as shown in the figure below
图2 样本量与估计误差之间的关系图
Fig.2 Relationship between sample size and estimation error
5.2问题二的测试模型
5.2 Test model for question two
运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:
图1 四个目标变量与辅助变量之间的相关性比较
Fig. 1 Comparison of the correlation between the four target variables and the auxiliary variables
由图1可知,协同变量1、协同变量4与目标变量的相关性较大,因此我们选择选择协同变量1与协同变量4作为目标变量的估计协同变量。
As can be seen from Figure 1, covariate 1 and covariate 4 have a large correlation with the target variable, so we choose covariate 1 and covariate 4 as the estimated covariates of the target variable.
5.3问题三的测试模型
5.3 Test model for question three
运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:
图1 四种方法下样本量与估计误差之间的关系图
Fig.1 Diagram of the relationship between sample size and estimation error under the four methods
图2 四种插值方法下的1000采样点比较
Fig.2 Comparison of 1000 sampling points under the four interpolation methods
图3 四种插值方法下的500采样点比较
Fig.3 Comparison of 500 sampling points under the four interpolation methods
图4 四种插值方法下的200采样点比较
Fig.4 Comparison of 200 sampling points under the four interpolation methods
图5 四种插值方法下的50采样点比较
Fig.5 Comparison of 50 sampling points under the four interpolation methods
图6 四种插值结果图与原数据图的相关性比较
Fig.6 Comparison of the correlation between the four interpolation result plots and the original data plots
5.4问题四的测试模型
5.4 Test model for question four
(1)nearest插值方法通过MATLAB得到以下图:
(1) The nearest interpolation method obtains the following figure by MATLAB:
图1 nearest方法补全数值结果图
Fig.1 Completion of the Nearest method
(2)由共克里金插值方法通过MATLAB得到以下图:
(2) The following figure is obtained by MATLAB by the co-kriging interpolation method:
图2 corkriging方法补全数值结果图
Fig.2 Numerical results of the corkriging method
6. 敏感性分析
6. Sensitivity analysis
6.1问题一的敏感性分析
6.1 Sensitivity analysis of question 1
在局部敏感性分析中保持其他参数不变,每次改变一个参数的值,在取值范围内逐步变化,并利用克里金插值进行计算,得到相应的插值结果;计算估计误差指标,来评估插值结果的准确性,通过观察曲线的斜率和变化趋势,判断参数的敏感性。如果曲线斜率较大,则参数较为敏感;反之则不敏感。
In the local sensitivity analysis, the value of one parameter is changed at a time, and the value is gradually changed within the value range, and the kriging interpolation is used to calculate and obtain the corresponding interpolation results. The estimation error index is calculated to evaluate the accuracy of the interpolation results, and the sensitivity of the parameters is judged by observing the slope and change trend of the curve. If the slope of the curve is large, the parameter is more sensitive; Otherwise, it is insensitive.
在全局敏感性分析中(Sobol方法),通过柱状图或雷达图展示每个参数的总灵敏度指标和一阶灵敏度指标。如果变程的总灵敏度指标较高,证明变程不仅自身对插值结果影响大,并且与其他参数的交互作用也对结果有很大影响。
In the global sensitivity analysis (Sobol method), the total sensitivity and first-order sensitivity metrics for each parameter are presented in a histogram or radar chart. If the total sensitivity index of the variable range is high, it is proved that the variable range not only has a great influence on the interpolation results, but also has a great impact on the results due to the interaction with other parameters.
模型优化:对于敏感参数,可以考虑采用更先进的参数估计方法,或者在数据收集阶段更加关注与这些参数相关的因素,以提高模型的准确性和稳定性。同时,对于相对不敏感的参数,可以在一定范围内简化其取值选择,以提高模型计算效率。
Model optimization: For sensitive parameters, more advanced parameter estimation methods can be considered, or more attention can be paid to factors related to these parameters during the data collection phase to improve the accuracy and stability of the model. At the same time, for relatively insensitive parameters, the value selection can be simplified within a certain range to improve the calculation efficiency of the model.
6.2问题二的敏感性分析
6.2 Sensitivity analysis of question 2
稳定性判断:如果皮尔逊相关系数在参数变化过程中保持相对稳定,就说明变量间的线性相关性对该参数变化不敏感。反之则说明变量间的线性相关性对该参数变化较为敏感。
Stability judgment: If the Pearson correlation coefficient remains relatively stable during the parameter change, it means that the linear correlation between the variables is not sensitive to the parameter change. On the contrary, the linear correlation between the variables is more sensitive to the change of the parameter.
模型优化:实际应用中,如果数据量较大或变量较多,可以采用数据降维技术对数据先进行预处理,然后再去进行相关性分析,这样可提高分析效率和准确性。
Model optimization: In practical applications, if the amount of data is large or there are many variables, the data dimensionality reduction technology can be used to preprocess the data first, and then perform correlation analysis, which can improve the analysis efficiency and accuracy.
6.3问题三的敏感性分析
6.3 Sensitivity analysis of question 3
局部敏感性:通过计算偏导数来衡量特征
Local sensitivity: Features are measured by calculating partial derivatives
对输出
to the output的局部敏感性。偏导数的绝对值越大,说明在该点附近特征对输出的影响越大。
local sensitivity. The greater the absolute value of the partial derivative, the greater the influence of the feature near that point on the output.全局敏感性:通过蒙特卡洛模拟法对于给定的机器学习模型,首先在输入变量的合理取值范围内随机生成大量的样本点,然后计算每个样本点对应的输出。通过统计分析方法,如计算变量与输出之间的相关系数、方差贡献等,来评估每个输入变量对输出的全局敏感性。
Global sensitivity: For a given machine learning model, Monte Carlo simulation method first randomly generates a large number of sample points within the reasonable value range of the input variables, and then calculates the corresponding output of each sample point. The global sensitivity of each input variable to the output is evaluated by statistical analysis methods, such as calculating the correlation coefficient between the variable and the output, the variance contribution, etc.
模型优化:找到合适的超参数可以使模型在训练数据和测试数据上都有良好的性能。
Model optimization: Finding the right hyperparameters can make the model perform well on both the training and test data.
6.4问题四的敏感性分析
6.4 Sensitivity analysis of question 4
变量相关性的敏感性:Cokriging依赖于多个变量之间的相关性来提高插值精度。变量之间相关性的强度和性质(正相关或负相关)是敏感因素。如果对变量之间相关性的估计不准确,会影响整个插值过程。
Sensitivity to variable correlations: Cokriging relies on correlations between multiple variables to improve interpolation accuracy. The strength and nature of the correlation between variables (positive or negative) are sensitive factors. If the estimation of the correlation between variables is not accurate, it can affect the entire interpolation process.
7.模型优缺点
7. Model advantages and disadvantages
7.1优点
7.1 Advantages
克里金插值方法具有一定的灵活性,并充分利用了空间变量的空间相关性,通过半变异函数来描述这种相关性。能够更好地捕捉数据的空间结构,对于具有明显空间分布特征的数据插值效果较好。
The kriging interpolation method has some flexibility and makes full use of the spatial correlation of spatial variables, which is described by semivariograms. It can better capture the spatial structure of data, and the interpolation effect is better for data with obvious spatial distribution characteristics.
皮尔逊相关系数计算过程相对直观,在实际应用中易于理解和实现。
The Pearson correlation coefficient calculation process is relatively intuitive, and it is easy to understand and implement in practical applications.
机器学习法能够自动从大量的数据中学习模式和规律,可以适应不同类型的数据和问题。
Machine learning methods can automatically learn patterns and patterns from large amounts of data, and can adapt to different types of data and problems.
Cokriging当变量之间存在空间相关性时,它可以综合这些信息,比单独的使用一个变量进行克里金插值更准确地估计未知点的值。
CokrigingWhen there is a spatial correlation between variables, it can synthesize this information to estimate the value of an unknown point more accurately than using a single variable for kriging interpolation.
7.2缺点
7.2 Cons
克里金插值模型的计算过程相对复杂、半变异函数参数的确定比较困难。
The calculation process of kriging interpolation model is relatively complex, and it is difficult to determine the semivariogram parameters.
使用皮尔逊相关系数时,当数据中存在非线性关系,可能会误导我们认为一种变量之间不存在关联。
When using Pearson correlation coefficients, when there is a nonlinear relationship in the data, it can mislead us into believing that there is no correlation between a variable.
机器学习算法的性能高度依赖于数据的质量和数量,且很难解释模型的决策过程和输出结果。
The performance of machine learning algorithms is highly dependent on the quality and quantity of data, and it is difficult to interpret the model's decision-making process and outputs.
Cokriging 需要多个相关变量的空间数据,并且对数据质量要求较高。
Cokriging requires spatial data with multiple related variables and high data quality requirements.
引用
cite
[1]Cressie N .STATISTICS FOR SPATIAL DATA[J].Terra Nova, 1992.DOI:10.1111/j.1365-3121.1992.tb00605.x.
[1] Cressie N . STATISTICS FOR SPATIAL DATA[J]. Terra Nova, 1992.DOI:10.1111/j.1365-3121.1992.tb00605.x.
[2]李敏丽.线性相关性定理在高等代数中的应用探究[J].湖北成人教育学院学报,2024,30(05):129-134.DOI:10.16019/
[2] Li Minli. Journal of Hubei University of Adult Education,2024,30(05):129-134.DOI:10.16019/
j.cnki.cn42-1578/g4.2024.05.002.
[3]苏传云,何学明,毕思存,等.克里金插值方法在东易煤矿9201工作面的应用[J].能源研究与管理,2024,16(03):140-145.DOI:10.16056/
[3] Su Chuanyun, He Xueming, Bi Sicun, et al. Application of kriging interpolation method in 9201 working face of Dongyi Coal Mine[J].Energy Research and Management,2024,16(03):140-145.DOI:10.16056/
j.2096-7705.2024.03.022.
[4]张汉琪,徐悦,钟敏,等.基于CoKriging代理模型的低声爆优化设计[J/OL].航空工程进展,1-11[2024-11-17].http://kns.cnki.net/kcms/detail/61.1479.V.20240403.1355.004.html.
[4] Zhang Hanqi, Xu Yue, Zhong Min, et al. Optimal design of low acoustic explosion based on CoKriging surrogate model[J/OL].Progress in Aeronautical Engineering,1-11[2024-11-17].http://kns.cnki.net/kcms/detail/61.1479.V.20240403.1355.004.html.