This is a bilingual snapshot page saved by the user at 2024-11-18 20:08 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

Problem Chosen
B

2024
ShuWei Cup
Summary Sheet

Team Control Number
2024111130780

摘 要
Summary

随着信息时代的发展,如今常用协同克里金算法估计未采样位置值。但实际工程中同测量方法不同会导致测量值有差异,且存在成本高昂难测或成本低易测的变量。为更好地研究空间变量协同估计方法,本文便是通过数学模型和算法给出的空间属性数据的协同估计的适当方法。
With the development of the information age, the cooperative kriging algorithm is commonly used to estimate the unsampled position value. However, in actual engineering, different measurement methods will lead to differences in measured values, and there are variables that are costly and difficult to measure or low-cost and easy to measure. In order to better study the co-estimation method of spatial variables, this paper is an appropriate method for co-estimation of spatial attribute data through mathematical models and algorithms.

针对问题一,我们首先使用采用随机均匀重采样法随机选择样本,又建立了克里金插值模型,通过半变异函数和克里金方程组对模型进行了求解,用线性插值法结合griddata估计了未抽样位置的空间变量值,并以等值线图的形式将结果呈现;然后,改变样本量,得到了样本量与估计误差之间的关系,并绘制了折线图。
For the first problem, we first use the stochastic uniform resampling method to randomly select the samples, and then establish a kriging interpolation model, solve the model by semivariograms and kriging equations, and estimate the spatial variable values of the unsampled positions by linear interpolation method combined with griddata, and present the results in the form of contour plots. Then, the sample size was changed to obtain the relationship between the sample size and the estimation error, and a line chart was plotted.

针对问题二,我们使用皮尔逊相关系数法结合附件一中的数据,通过计算得到相关系数选择估计协同变量,并对结果进行分析,发现协同变量1、协同变量4与目标变量的相关性较大,因此我们选择选择协同变量1与协同变量4作为目标变量的估计协同变量
In response to question 2, we used the Pearson correlation coefficient method combined with the data in Annex 1 to calculate the correlation coefficient, selected the estimated covariates, and analyzed the results to find the covariants1. The correlation between covariate 4 and the target variable is large, so we choose covariate 1 and covariate 4 as the estimated covariates of the target variable.

针对问题三,我们与问题一类似,对目标变量和协同变量进行随机均匀重采样,并使用这些重采样值来估计未抽样位置的空间变量值,后将结果以等值线图的形式呈现。此外,还改变了样本量,并探索了样本量与估计误差之间的关系。为了比较不同方法的性能,我们延续了问题一基于克里金插值算法进行估计,后又基于机器学习算法考虑协同变量对目标变量的影响进行估计。
For problem 3, similar to problem 1, we perform random uniform resampling of the target variable and covariable, and use these resampled values to estimate the spatial variable values of the unsampled locations, and then present the results in the form of contour plots. In addition, the sample size was changed, and the relationship between the sample size and the estimation error was explored. In order to compare the performance of different methods, we continue the first problem based on the kriging interpolation algorithm for estimation, and then based on the machine learning algorithm to consider the influence of covariates on the target variables.

针对问题四,我们使用问题三中确定的最优方法来估计其趋势,并利用与目标变量相关的协同变量以及它们在空间上的分布来推断目标变量的未抽样值,然后利用问题三中评判的性能最优的估计模型进行拟合。最后将结果以等值线图的形式呈现了出来。
For problem 4, we use the optimal method identified in question 3 to estimate its trend, and use the covariates related to the target variable and their spatial distribution to infer the unsampled value of the target variable, and then use the estimation model with the best performance evaluated in question 3 for fitting. Finally, the results are presented in the form of contour plots.

关键词 协同变量;克里金算法;空间插值;皮尔逊相关系数;线性插值机器算法;相关性分析
keyword covariates; kriging algorithm; spatial interpolation; Pearson correlation coefficient; linear interpolation; machine algorithms; Correlation analysis

目录
directory

1. 引言1
1. Introduction 1

1.1 背景1
1.1 Background 1

1.2 工作1
1.2 Work 1

2. 问题分析2
2. Problem Analysis2

2.1 数据分析2
2.1 Data analysis2

2.2 问题一的分析3
2.2 Analysis of Question 13

2.3 问题二的分析3
2.3 Analysis of Problem 23

2.4 问题三的分析3
2.4 Analysis of Question 3

2.5 问题四的分析3
2.5 Analysis of Question 4 3

3. 符号说明和假设4
3. Description of symbols and assumptions4

3. 1 符号说明4
3. 1 Explanation of symbols4

3.2 基本假设6
3.2 Basic assumptions6

4. 模型6
4. Model 6

4.1问题一的模型6
4.1 Model 6 for Problem One

4.2问题二的模型10
4.2Model for Problem 2 10

4.3问题三的模型10
4.3Model of question three10

4.4问题四的模型20
4.4Model for question four20

5.测试模型21
5. Test the model 21

5.1问题一的测试模型21
5.1Test model for question 121

5.2问题二的测试模型22
5.2Test model for question two22

5.3问题三的测试模型23
5.3Test model for question three23

5.4问题四的测试模型25
5.4Test model for question four25

6. 敏感性分析26
6. Sensitivity analysis26

6.1问题一的敏感性分析26
6.1Sensitivity analysis of question 126

6.2问题二的敏感性分析27
6.2Sensitivity analysis of question 27

6.3问题三的敏感性分析27
6.3Sensitivity analysis of question 327

7.模型优缺点28
7. Model advantages and disadvantages 28

7.1优点28
7.1 Advantages 28

7.2缺点28
7.2 Disadvantages 28

引用29
Quote 29

附录30
Appendix 30

Team # 2024111130780 Page1of 76

1. 引言
1. Introduction

1.1 背景
1.1 Background

随着人工智能和机器学习方法的发展,常用协同克里金算法算法估计未采样位置值。在实际工程中,虽然测量的是同一物理量,但测量原理差异会使测量值存在有差别。不过这些值仍有着空间相关性,这表明可以利用这种相关性进行数据处理和分析。在工程研究中对采样不足的空间变量常用相关变量用协同估计,而协同克里金算法虽然理论易理解但因其计算交叉协方差或变异函数而难以应用。现需根据附件,探索空间变量的相关特征及合适的估计方法,以熟练空间变量协同估计方法
With the development of artificial intelligence and machine learning methods, the co-kriging algorithm algorithm is commonly used to estimate the unsampled position value. In actual engineering, although the same physical quantity is measured, the difference in measurement principle will cause differences in the measured values. However, there is still a spatial correlation between these values, suggesting that this correlation can be exploited for data processing and analysis. In engineering research, co-estimation of related variables is often used for undersampled spatial variables, while the co-kriging algorithm is difficult to apply because of its calculation of cross-covariance or variogram function, although it is easy to understand theoretically. According to the annex, it is necessary to explore the relevant characteristics of spatial variables and appropriate estimation methods, so as to be proficient in the collaborative estimation methods of spatial variables.

1.2 工作
1.2 Work

为更好地研究空间变量协同估计方法,我们需要建立数学模型来解决以下几个问题:
In order to better study the co-estimation method of spatial variables, we need to establish mathematical models to solve the following problems:

问题一: 根据附件一中的数据,聚焦于其中一个特定的F1_目标变量,深入研究其变化模式。这一涉及两个任务:一是通过随机均匀重采样目标变量,重采样数据估计未采样位置的变量值,并以等高线图展示结果;二是研究样本大小变化对估计误差的影响,并找到两者之间的内在联系。
Question 1: Based on the data in Annex I, focus on one of the specific F1_ target variables and delve into its patterns of change. This question involves two tasks: one is to estimate the value of the variable at the unsampled location by randomly and uniformly resampling the target variable, and the resampled data to estimate the variable value at the unsampled location, and present the results in a contour plot; The second is to study the influence of sample size changes on the estimation error, and find the internal relationship between the two.

问题二: 利用附件一中的数据,研究目标变量和其他协同变量间的相关性程度,并根据计算结果选出两个变量作为目标变量的估计协同变量
Problem 2: Using the data in Annex I, study the degree of correlation between the target variable and other co-variables, and select two variables as the estimated co-variables for the target variable based on the calculated results.

问题三:基于附件一的数据和问题二的协同变量结果,深入研究空F1_目标变量的变化模式这一涉及任务首先是对所挑选出的协同变量F1_目标变量进行随机均匀重采样操作,并估计未采样位置的空间变量值,结果用等高线呈现;其次是通过改变样本大小来研究样本大小和误差间的关系;最后,要求比较时至少运用两种方法。
Question 3: Based on the data in Annex 1 and the results of the covariate in Question 2, the change pattern of the target variable of the empty F1_ is studied in depth. This question involves three tasks: the first is to perform a stochastic uniform resampling operation on the selected covariables and F1_ target variables, and estimate the spatial variable values of the unsampled locations, and the results are presented by contour lines. secondly, the relationship between sample size and error was studied by changing the sample size. Finally, at least two methods are required for comparisons.

问题四: 结合附件二,并从问题三中选出合适的方法估计目标变量的趋势,结果用等高线呈现。
Question 4: Estimate the trend of the target variable in conjunction with Annex 2 and select an appropriate method from Question 3, and the results are presented as contour lines.

2. 问题分析
2. Problem analysis

2.1 数据分析
2.1 Data analysis

我们针对题目所给的txt文件,用MATLAB将附件一中的F1_target_variable可视化得到:
The txt file we gave for the problem was visualized in MATLAB to the F1_target_variable in Annex 1 to obtain:

将附件二的目标变量可视化得到:散点图
Visualization of the target variables in Annex II yields: (scatter plot).

2.2 问题一的分析
2.2 Analysis of Problem 1

针对问题一,题目要求我们研究F1目标变量的变化模式。我们可以采用随机均匀重采样的方法从原始数据集中随机选择一部分数据点作为样本,接着用基于克里金插值算法与线性插值方法,并结合这些样本数据来估计未抽样位置的空间变量值。然后可以改变样本量,并探索样本量与估计误差之间的关系。
In response to question 1, the title asks us to study the pattern of change in the F1 target variable. We can use the method of random uniform resampling to randomly select a part of the data points from the original dataset as samples, and then use the kriging-based interpolation algorithm and linear interpolation method to estimate the spatial variable values of the unsampled locations by combining these sample data. The sample size can then be varied and the relationship between the sample size and the estimation error can be explored.

2.3 问题二的分析
2.3 Analysis of Problem 2

针对问题二,题目要求研究目标变量与协同变量之间的相关性。协同变量指与目标变量具有一定空间相关性的其他变量。测量这些协同变量与目标变量的相关性,可以帮助理解空间变量的变化规律。在这一问中,可以基于皮尔逊相关系数进行相关性分析的方式分析协同变量与目标变量之间的关系。最后选择与目标变量相关性最强的两个协同变量作为估计协同变量。
In response to question 2, the question asks to study the correlation between the target variable and the co-variable. Covariables are other variables that have some spatial correlation with the target variable. Measuring the correlation between these co-variables and the target variables can help to understand the variation of spatial variables. In this question, the relationship between the covariate and the target variable can be analyzed by correlation analysis based on the Pearson correlation coefficient. Finally, the two covariates with the strongest correlation with the target variables were selected as the estimation covariates.

2.4 问题三的分析
2.4 Analysis of Problem 3

针对问题三,与问题一类似,可以对目标变量和协同变量进行随机均匀重采样,并使用这些重采样值来估计未抽样位置的空间变量值。同样地,将结果以等值线图的形式呈现。此外,还需要改变样本量,并探索样本量与估计误差之间的关系。为了比较不同方法的性能,需要选择至少两种方法进行对比。这里的算法选择一方面可以延续问题一基于克里金插值算法进行估计,另一方面也可以基于机器学习算法考虑协同变量对目标变量的影响进行估计。
For problem 3, similar to problem 1, the target variable and co-variable can be resampled randomly and uniformly, and these resampled values can be used to estimate the spatial variable values of the unsampled locations. Again, the results are presented as contour plots. In addition, it is necessary to vary the sample size and explore the relationship between the sample size and the estimation error. In order to compare the performance of different methods, you need to select at least two methods to compare. On the one hand, the algorithm selection here can continue the estimation based on the kriging interpolation algorithm of problem 1, and on the other hand, it can also be based on the machine learning algorithm to consider the influence of covariates on the target variables.

2.5 问题四的分析
2.5 Analysis of Question 4

针对问题四,由于附件2中的目标变量采样数据不足,需要使用问题三中确定的最优方法来估计其趋势。利用与目标变量相关的协同变量以及它们在空间上的分布来推断目标变量的未抽样值,在这里可以利用问题三中评判的性能最优的估计模型进行拟合。同样地,结果以等值线图的形式呈现。
For question 4, due to the insufficient sampling data for the target variables in Annex 2, it is necessary to use the optimal method identified in question 3 to estimate its trend. The unsampled values of the target variables are inferred by using the covariables associated with the target variables and their spatial distribution, and the estimation model with the best performance judged in question 3 can be used to fit here. Again, the results are presented in the form of contour plots.

符号说明和假设
Symbolic descriptions and assumptions

3. 1 符号说明
3. 1 Description of Symbols

符号定义
Symbol definition

符号说明
Description of the symbol

原始数据集中第
In the original dataset

个数据点的位置坐标
The location coordinates of data points

原始数据集中第
In the original dataset

个数据点的 F1 目标变量值
F1 target variable values for data points

原始数据集的总数据点数
The total number of data points for the original dataset

随机均匀重采样后的样本量
Sample size after random uniform resampling

在位置
in place

处通过克里金插值估计得到的 F1 目标变量值
The value of the F1 target variable estimated by kriging interpolation

估计误差
Estimation error

拉格朗日乘数
Lagrange multiplier

距离间隔为
The distance interval is

的数据点对的数量
The number of pairs of data points

权重系数
Weight factor

拉格朗日乘数
Lagrange multiplier

皮尔逊相关系数
Pearson correlation coefficient

3.2 基本假设
3.2 Basic Assumptions

1.假设题目所给的数据真实可靠;
1. Assuming that the data given in the question is true and reliable;

2.假设对异常值处理时所产生的误差可忽略不计
2. It is assumed that the error generated by the handling of outliers is negligible;

3.假设用于分析的数据是随机抽取且具有代表性的,能够反映总体的特征。
3. It is assumed that the data used for analysis are randomly selected and representative, reflecting the characteristics of the population.

4. 模型
4 models

4.1问题一的模型
4.1 Model for Problem 1

4.1.1符号定义:
4.1.1 Symbol Definition:

:原始数据集的总数据点数
The total number of data points for the original dataset
;

:随机均匀重采样后的样本量
Sample size after random uniform resampling
;

:原始数据集中第
In the original dataset
个数据点的位置坐标
The location coordinates of data points
;

:原始数据集中第
In the original dataset
个数据点的
data points
F1 目标变量值
The value of the target variable
;

:在位置
in place
处通过克里金插值估计得到的
is estimated by kriging interpolation
F1 目标变量值
The value of the target variable
;

:估计误差
Estimation error
,即
namely

(1)

4.1.2克里金插值模型
4.1.2 Kriging interpolation model

半变异函数
Semivariograms

常用的计算公式为
The commonly used formula is as follows:
:

2

距离间隔为
The distance interval is
的数据点对的数量
The number of pairs of data points

克里金方程组
Kriging equations

(3)

(4)

其中
thereinto

为权重系数,
is the weight coefficient,
为拉格朗日乘数
is a Lagrange multiplier

在《Statistics for Spatial Data, Revised EditionNoel Cressie这本书籍中第三章详细介绍了普通克里金、协同克里金算法相关内容。介绍内容和相关公式如下:
Chapter 3 of Noel Cressie's book Statistics for Spatial Data, Revised Edition details the common kriging and the co-kriging algorithm. The introduction and related formulas are as follows:

(协同克里金算法在问题三中详细介绍)
(The co-kriging algorithm is described in detail in Problem 3)

图3- 1普通克里金算法详细介绍
Figure 3-1 Detailed description of the ordinary kriging algorithm

图3- 2普通克里金算法详细介绍
Figure 3-2 Detailed description of the ordinary kriging algorithm

4.1.3线性插值模型
4.1.3 Linear interpolation model

对采样点的数值我们采取了线性插值的方法估计采样位置的目标变量,并使用griddata实现插值:
For the values of the sampling points, we use the linear interpolation method to estimate the target variable at the sampling location, and use griddata to implement the interpolation:

(5)

其中
thereinto

为插值估计值,
is an interpolated estimate,
为采样点
is a sampling point
的权重,
weight,
为采样点的目标变量值。
is the value of the target variable for the sampling point.

4.2问题二的模型
4.2 Model of Problem 2

计算皮尔逊相关系数
The Pearson correlation coefficient is calculated

公式为:
The formula is:

(6)

其中
thereinto

为目标变量,
as the target variable,
为协同变量,协同变量的样本均值为
is a covariable, and the sample mean of the covariate is
。且
。 moreover
,当
while
时,
time
完全正相关;当
Perfectly positive correlation; while
完全负相关;当
Completely negatively correlated; while
时,
time
不存在线性相关关系。
There is no linear correlation.

4.3问题三的模型
4.3 Model of Problem Three

4.3.1Cokriging算法
4.3.1 Cokriging algorithm

Suppose that the data are k x 1 vectors Z(s,),...,Z(s.) and write

(7)

图3- 3Cokriging算法
Figure 3-3 Cokriging algorithm

图3- 4Cokriging算法
Figure 3-4 Cokriging algorithm

图3- 5Cokriging算法
Figure 3-5 Cokriging algorithm

图3- 6Cokriging算法
Figure 3-6 Cokriging algorithm

图3- 7Cokriging算法
Figure 3-7 Cokriging algorithm

图3- 8Cokriging算法
Figure 3-8 Cokriging algorithm

4.3.2机器学习算法
4.3.2 Machine Learning Algorithms

(一)K最近邻(k-Nearest Neighbor,KNN)
(a) K-Nearest Neighbor (KNN)

1. 算法概述
1. Algorithm Overview

邻近算法,或者说K最近邻(K-Nearest Neighbor,KNN)分类算法是数据挖掘分类技术中最简单的方法之一,是著名的模式识别统计学方法,在机器学习分类算法中占有相当大的地位。它是一个理论上比较成熟的方法。既是最简单的机器学习算法之一,也是基于实例的学习方法中最基本的,又是最好的文本分类算法之一。
The K-Nearest Neighbor (KNN) classification algorithm is one of the simplest methods in data mining classification technology, and is a well-known statistical method for pattern recognition, which occupies a considerable position in machine learning classification algorithms. It is a theoretically mature method. It is one of the simplest machine learning algorithms, the most basic of example-based learning methods, and one of the best text classification algorithms.

所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表。CoverHart1968年提出了最初的邻近算法。KNN是一种分类(classification)算法,它输入基于实例的学习(instance-based learning),属于懒惰学习(lazy learning)即KNN没有显式的学习过程,也就是说没有训练阶段,数据集事先已有了分类和特征值,待收到新样本后直接进行处理。与急切学习(eager learning)相对应。
The so-called K nearest neighbor means k nearest neighbor, which means that each sample can be represented by its nearest k neighbor. Cover and Hart proposed the original proximity algorithm in 1968. KNN is a classification algorithm that feeds instance-based learning, which is lazy learningThat is, there is no explicit learning process for KNN, that is, there is no training stage, and the dataset already has classification and eigenvalues in advance, which are directly processed after receiving new samples. Corresponding to eager learning.

2. 算法思想
2. Algorithmic thinking

KNN是通过测量不同特征值之间的距离进行分类。
KNN is classified by measuring the distance between different eigenvalues.

思路是:如果一个样本在特征空间中的k个最邻近的样本中的大多数属于某一个类别,则该样本也划分为这个类别。 KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。
The idea is that if most of the k closest neighbors in a sample space belong to a certain category, then the sample is also classified into that category. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. In this method, only the category of the nearest sample or several samples is determined according to the category of the nearest sample.

该算法假定所有的实例对应于N维欧式空间Ân中的点。通过计算一个点与其他所有点之间的距离,取出与该点最近的K个点,然后统计这K个点里面所属分类比例最大的,则这个点属于该分类。
The algorithm assumes that all instances correspond to points in the N-dimensional Euclidean space Ân. By calculating the distance between a point and all other points, take out the K points closest to the point, and then count the largest proportion of the classification in the K points, then this point belongs to the classification.

该算法涉及3个主要因素:实例集、距离或相似的衡量、k的大小。
The algorithm involves 3 main factors: instance set, distance or similar measurement, size k.

一个实例的最近邻是根据标准欧氏距离定义的。更精确地讲,把任意的实例x表示为下面的特征向量:
The nearest neighbor of an instance is defined according to the standard Euclidean distance. More precisely, any instance x is represented as the following eigenvector:

其中ar(x)表示实例x的第r个属性值。那么两个实例xixj间的距离定义为d(xi,xj),其中:
where ar(x) indicates the r-th attribute value of instance x. Then the distance between the two instances xi and xj is defined as d(xi,xj), where:

简单来说,KNN可以看成:有那么一堆你已经知道分类的数据,然后当一个新数据进入的时候,就开始跟训练数据里的每个点求距离,然后挑离这个训练数据最近的K个点看看这几个点属于什么类型,然后用少数服从多数的原则,给新数据归类。
To put it simply, KNN can be seen as: there are a bunch of data that you already know to classify, and then when a new data enters, start to find the distance from each point in the training data, and then pick the K points closest to the training data to see what type these points belong to, and then use the principle of minority obeying the majority to classify the new data.

(二)戴维森堡丁指数(Davies-Bouldin indexDBI算法
(2) Davies-Bouldin index (DBI) algorithm

戴维森堡丁指数又称为分类适确性指标,是由大卫戴维斯和唐纳德·Bouldin提出的一种评估聚类算法优劣的指标。
The Davidsonburgerding Index, also known as the Categorical Accuracy Index, was developed by David LDavis and Donald M. Bouldin proposed an index to evaluate the advantages and disadvantages of clustering algorithms.

综合考虑了类内样本相似度以及类间样本差异度,其值越小表征聚类有效性越高,假设我们有m个序列,将这些序列通过算法聚为n类,使用DBI9聚类效果评价方法。具体定义如下
Suppose we have m sequences, these sequences are clustered into n classes by algorithm, and the DBI9 clustering effect evaluation method is used. The specific definitions are as follows
:

9

将协同变量思想融入到机器算法中,进行相关性指数的检验。
The idea of covariables is integrated into the machine algorithm to test the correlation index.

4.4问题四的模型
4.4 Model of Problem Four

在附件 2 中,F2_目标变量的数据为不完全采样,即采样点数量不足以完全描述目标区域的 所有网格点信息。目标是在这种情况下,使用问题三中找到的最佳插值方法来对整个研究区 域的目标变量进行估计,并用等高线图的形式展示结果。
In Appendix 2, the data F2_ the target variable is incompletely sampled, i.e., the number of sampling points is not sufficient to fully describe all the grid point information in the target area. The goal is to use the best interpolation method found in question 3 to estimate the target variables for the entire study area and to present the results in the form of contour plots.

在问题三中,我们使用了不同的插值方法来估计目标变量的空间分布,并根据估计误差(MSE)来判断哪个方法最合适。
In Question 3, we use different interpolation methods to estimate the spatial distribution of the target variables and determine which method is most appropriate based on the Estimation Error (MSE).

5.测试模型
5. Test the model

5.1问题一的测试模型
5.1 Test model for question 1

运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:

(1)对目标变量进行随机均匀重采样,并使用重采样值估计未采样位置的空间变量值。结果以等高线图的形式呈现为:
(1) The target variable is randomly and uniformly resampled, and the resampled value is used to estimate the spatial variable value of the unsampled position. The results are presented in the form of contour plots as:

图1 不同样本点数量的插值结果图
Fig.1 Interpolation results of different sample points

改变样本大小并探索样本大小与估计误差之间的关系如下图:
Change the sample size and explore the relationship between the sample size and the estimation error as shown in the figure below

图2 样本量与估计误差之间的关系图
Fig.2 Relationship between sample size and estimation error

5.2问题二的测试模型
5.2 Test model for question two

运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:

图1 四个目标变量与辅助变量之间的相关性比较
Fig. 1 Comparison of the correlation between the four target variables and the auxiliary variables

由图1可知,协同变量1、协同变量4与目标变量的相关性较大,因此我们选择选择协同变量1与协同变量4作为目标变量的估计协同变量
As can be seen from Figure 1, covariate 1 and covariate 4 have a large correlation with the target variable, so we choose covariate 1 and covariate 4 as the estimated covariates of the target variable.

5.3问题三的测试模型
5.3 Test model for question three

运用MATLAB处理后得到以下结果:
The following results were obtained after MATLAB processing:

图1 四种方法下样本量与估计误差之间的关系图
Fig.1 Diagram of the relationship between sample size and estimation error under the four methods

图2 四种插值方法下的1000采样点比较
Fig.2 Comparison of 1000 sampling points under the four interpolation methods

图3 四种插值方法下的500采样点比较
Fig.3 Comparison of 500 sampling points under the four interpolation methods

图4 四种插值方法下的200采样点比较
Fig.4 Comparison of 200 sampling points under the four interpolation methods

图5 四种插值方法下的50采样点比较
Fig.5 Comparison of 50 sampling points under the four interpolation methods

图6 四种插值结果图与原数据图的相关性比较
Fig.6 Comparison of the correlation between the four interpolation result plots and the original data plots

5.4问题四的测试模型
5.4 Test model for question four

(1)nearest插值方法通过MATLAB得到以下图:
(1) The nearest interpolation method obtains the following figure by MATLAB:

图1 nearest方法补全数值结果图
Fig.1 Completion of the Nearest method

(2)由共克里插值方法通过MATLAB得到以下图:
(2) The following figure is obtained by MATLAB by the co-kriging interpolation method:

图2 corkriging方法补全数值结果图
Fig.2 Numerical results of the corkriging method

6. 敏感性分析
6. Sensitivity analysis

6.1问题一的敏感性分析
6.1 Sensitivity analysis of question 1

在局部敏感性分析中保持其他参数不变,每次改变一个参数的值,在取值范围内逐步变化,并利用克里金插值进行计算,得到相应的插值结果计算估计误差指标,来评估插值结果的准确性通过观察曲线的斜率和变化趋势,判断参数的敏感性。如果曲线斜率较大,参数较为敏感;反之则不敏感。
In the local sensitivity analysis, the value of one parameter is changed at a time, and the value is gradually changed within the value range, and the kriging interpolation is used to calculate and obtain the corresponding interpolation results. The estimation error index is calculated to evaluate the accuracy of the interpolation results, and the sensitivity of the parameters is judged by observing the slope and change trend of the curve. If the slope of the curve is large, the parameter is more sensitive; Otherwise, it is insensitive.

在全局敏感性分析中Sobol方法),通过柱状图或雷达图展示每个参数的总灵敏度指标和一阶灵敏度指标。如果变程的总灵敏度指标较高,证明变程不仅自身对插值结果影响大,并且与其他参数的交互作用也对结果有很大影响
In the global sensitivity analysis (Sobol method), the total sensitivity and first-order sensitivity metrics for each parameter are presented in a histogram or radar chart. If the total sensitivity index of the variable range is high, it is proved that the variable range not only has a great influence on the interpolation results, but also has a great impact on the results due to the interaction with other parameters.

模型优化:对于敏感参数,可以考虑采用更先进的参数估计方法,或者在数据收集阶段更加关注与这些参数相关的因素,以提高模型的准确性和稳定性。同时,对于相对不敏感的参数,可以在一定范围内简化其取值选择,以提高模型计算效率。
Model optimization: For sensitive parameters, more advanced parameter estimation methods can be considered, or more attention can be paid to factors related to these parameters during the data collection phase to improve the accuracy and stability of the model. At the same time, for relatively insensitive parameters, the value selection can be simplified within a certain range to improve the calculation efficiency of the model.

6.2问题二的敏感性分析
6.2 Sensitivity analysis of question 2

稳定性判断:如果皮尔逊相关系数在参数变化过程中保持相对稳定,说明变量间的线性相关性对该参数变化不敏感。反之则说明变量间的线性相关性对该参数变化较为敏感
Stability judgment: If the Pearson correlation coefficient remains relatively stable during the parameter change, it means that the linear correlation between the variables is not sensitive to the parameter change. On the contrary, the linear correlation between the variables is more sensitive to the change of the parameter.

模型优化实际应用中,如果数据量较大或变量较多,可以采用数据降维技术对数据进行预处理,然后再去进行相关性分析,这样可提高分析效率和准确性
Model optimization: In practical applications, if the amount of data is large or there are many variables, the data dimensionality reduction technology can be used to preprocess the data first, and then perform correlation analysis, which can improve the analysis efficiency and accuracy.

6.3问题三的敏感性分析
6.3 Sensitivity analysis of question 3

局部敏感性通过计算偏导数来衡量特征
Local sensitivity: Features are measured by calculating partial derivatives

对输出
to the output
的局部敏感性。偏导数的绝对值越大,说明在该点附近特征对输出的影响越大。
local sensitivity. The greater the absolute value of the partial derivative, the greater the influence of the feature near that point on the output.

全局敏感性:通过蒙特卡洛模拟法对于给定的机器学习模型,首先在输入变量的合理取值范围内随机生成大量的样本点,然后计算每个样本点对应的输出。通过统计分析方法,如计算变量与输出之间的相关系数、方差贡献等,来评估每个输入变量对输出的全局敏感性。
Global sensitivity: For a given machine learning model, Monte Carlo simulation method first randomly generates a large number of sample points within the reasonable value range of the input variables, and then calculates the corresponding output of each sample point. The global sensitivity of each input variable to the output is evaluated by statistical analysis methods, such as calculating the correlation coefficient between the variable and the output, the variance contribution, etc.

模型优化:找到合适的超参数可以使模型在训练数据和测试数据上都有良好的性能
Model optimization: Finding the right hyperparameters can make the model perform well on both the training and test data.

6.4问题四的敏感性分析
6.4 Sensitivity analysis of question 4

变量相关性的敏感性Cokriging依赖于多个变量之间的相关性来提高插值精度。变量之间相关性的强度和性质(正相关或负相关)是敏感因素。如果对变量之间相关性的估计不准确,会影响整个插值过程。
Sensitivity to variable correlations: Cokriging relies on correlations between multiple variables to improve interpolation accuracy. The strength and nature of the correlation between variables (positive or negative) are sensitive factors. If the estimation of the correlation between variables is not accurate, it can affect the entire interpolation process.

7.模型优缺点
7. Model advantages and disadvantages

7.1优点
7.1 Advantages

里金插值方法具有一定的灵活性,并充分利用了空间变量的空间相关性,通过半变异函数来描述这种相关性能够更好地捕捉数据的空间结构,对于具有明显空间分布特征的数据插值效果较好
The kriging interpolation method has some flexibility and makes full use of the spatial correlation of spatial variables, which is described by semivariograms. It can better capture the spatial structure of data, and the interpolation effect is better for data with obvious spatial distribution characteristics.

皮尔逊相关系数计算过程相对直观在实际应用中易于理解和实现
The Pearson correlation coefficient calculation process is relatively intuitive, and it is easy to understand and implement in practical applications.

机器学习法能够自动从大量的数据中学习模式和规律可以适应不同类型的数据和问题。
Machine learning methods can automatically learn patterns and patterns from large amounts of data, and can adapt to different types of data and problems.

Cokriging当变量之间存在空间相关性时,它可以综合这些信息,比单独使用一个变量进行克里金插值更准确地估计未知点的值。
CokrigingWhen there is a spatial correlation between variables, it can synthesize this information to estimate the value of an unknown point more accurately than using a single variable for kriging interpolation.

7.2缺点
7.2 Cons

克里金插值模型的计算过程相对复杂半变异函数参数的确定比较困难
The calculation process of kriging interpolation model is relatively complex, and it is difficult to determine the semivariogram parameters.

使用皮尔逊相关系数时,当数据中存在非线性关系,可能会误导我们认为一种变量之间不存在关联
When using Pearson correlation coefficients, when there is a nonlinear relationship in the data, it can mislead us into believing that there is no correlation between a variable.

机器学习算法的性能高度依赖于数据的质量和数量,且很难解释模型的决策过程和输出结果。
The performance of machine learning algorithms is highly dependent on the quality and quantity of data, and it is difficult to interpret the model's decision-making process and outputs.

Cokriging 需要多个相关变量的空间数据,并且对数据质量要求较高。
Cokriging requires spatial data with multiple related variables and high data quality requirements.

引用
cite

[1]Cressie N .STATISTICS FOR SPATIAL DATA[J].Terra Nova, 1992.DOI:10.1111/j.1365-3121.1992.tb00605.x.
[1] Cressie N . STATISTICS FOR SPATIAL DATA[J]. Terra Nova, 1992.DOI:10.1111/j.1365-3121.1992.tb00605.x.

[2]李敏丽.线性相关性定理在高等代数中的应用探究[J].湖北成人教育学院学报,2024,30(05):129-134.DOI:10.16019/
[2] Li Minli. Journal of Hubei University of Adult Education,2024,30(05):129-134.DOI:10.16019/

j.cnki.cn42-1578/g4.2024.05.002.

[3]苏传云,何学明,毕思存,等.克里金插值方法在东易煤矿9201工作面的应用[J].能源研究与管理,2024,16(03):140-145.DOI:10.16056/
[3] Su Chuanyun, He Xueming, Bi Sicun, et al. Application of kriging interpolation method in 9201 working face of Dongyi Coal Mine[J].Energy Research and Management,2024,16(03):140-145.DOI:10.16056/

j.2096-7705.2024.03.022.

[4]张汉琪,徐悦,钟敏,等.基于CoKriging代理模型的低声爆优化设计[J/OL].航空工程进展,1-11[2024-11-17].http://kns.cnki.net/kcms/detail/61.1479.V.20240403.1355.004.html.
[4] Zhang Hanqi, Xu Yue, Zhong Min, et al. Optimal design of low acoustic explosion based on CoKriging surrogate model[J/OL].Progress in Aeronautical Engineering,1-11[2024-11-17].http://kns.cnki.net/kcms/detail/61.1479.V.20240403.1355.004.html.