这是用户在 2024-12-25 23:32 为 https://essd.copernicus.org/articles/11/1239/2019/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Articles | Volume 11, issue 3
https://doi.org/10.5194/essd-11-1239-2019
https://doi.org/10.5194/essd-11-1239-2019
Data description paper   数据说明纸
 |   |
21 Aug 2019  2019 年 8 月 21 日
Data description paper |  | 21 Aug 2019
数据描述文件| | 2019 年 8 月 21 日

A machine-learning-based global sea-surface iodide distribution
基于机器学习的全球海面碘化物分布

A machine-learning-based global sea-surface iodide distribution
基于机器学习的全球海面碘化物分布
A machine-learning-based global sea-surface iodide distribution Tomás Sherwen et al.  托马斯·舍文等人。
Tomás Sherwen, Rosie J. Chance, Liselotte Tinel, Daniel Ellis, Mat J. Evans, and Lucy J. Carpenter
托马斯·舍文 Rosie J. Chance Liselotte Tinel Daniel Ellis Mat J. EvansLucy J. Carpenter
Abstract  抽象的

Iodide in the sea-surface plays an important role in the Earth system. It modulates the oxidising capacity of the troposphere and provides iodine to terrestrial ecosystems. However, our understanding of its distribution is limited due to a paucity of observations. Previous efforts to generate global distributions have generally fitted sea-surface iodide observations to relatively simple functions using proxies for iodide such as nitrate and sea-surface temperature. This approach fails to account for coastal influences and variation in the bio-geochemical environment. Here we use a machine learning regression approach (random forest regression) to generate a high-resolution (0.125×0.125, 12.5km×12.5km), monthly dataset of present-day global sea-surface iodide. We use a compilation of iodide observations (1967–2018) that has a 45 % larger sample size than has been used previously as the dependent variable and co-located ancillary parameters (temperature, nitrate, phosphate, salinity, shortwave radiation, topographic depth, mixed layer depth, and chlorophyll a) from global climatologies as the independent variables. We investigate the regression models generated using different combinations of ancillary parameters and select the 10 best-performing models to be included in an ensemble prediction. We then use this ensemble of models, combined with global fields of the ancillary parameters, to predict new high-resolution monthly global sea-surface iodide fields representing the present day. Sea-surface temperature is the most important variable in all 10 models. We estimate a global average sea-surface iodide concentration of 106 nM (with an uncertainty of ∼20 %), which is within the range of previous estimates (60–130 nM). Similar to previous work, higher concentrations are predicted for the tropics than for the extra-tropics. Unlike the previous parameterisations, higher concentrations are also predicted for shallow areas such as coastal regions and the South China Sea. Compared to previous work, the new parameterisation better captures observed variability. The iodide concentrations calculated here are significantly higher (40 % on a global basis) than the commonly used MacDonald et al. (2014) parameterisation, with implications for our understanding of iodine in the atmosphere. We envisage these fields could be used to represent present-day sea-surface iodide concentrations, in applications such as climate and air-quality modelling. The global iodide dataset is made freely available to the community (https://doi.org/10/gfv5v3, Sherwen et al.2019), and as new observations are made, we will update the global dataset through a “living data” model.
海面的碘化物在地球系统中发挥着重要作用。它调节对流层的氧化能力并向陆地生态系统提供碘。然而,由于缺乏观察,我们对其分布的理解是有限的。以前生成全球分布的努力通常使用碘化物代理(例如硝酸盐和海面温度)将海面碘化物观测结果拟合到相对简单的函数。这种方法未能考虑沿海影响和生物地球化学环境的变化。这里我们使用机器学习回归方法(随机森林回归)来生成高分辨率( 12.5km×12.5km ),当今全球海面碘化物的月度数据集。我们使用碘化物观测数据(1967-2018)的汇编,其样本量比之前用作因变量和同位辅助参数(温度、硝酸盐、磷酸盐、盐度、短波辐射、地形深度、混合层深度和来自全球气候的叶绿素a ) 作为自变量。我们研究使用不同的辅助参数组合生成的回归模型,并选择 10 个性能最佳的模型包含在整体预测中。然后,我们使用这个模型集合,结合辅助参数的全局场,来预测代表当今的新的高分辨率月度全球海面碘化物场。海面温度是所有 10 个模型中最重要的变量。我们估计全球平均海面碘化物浓度为 106 nM(不确定性约为20 %),在之前估计的范围内(60-130 nM)。与之前的工作类似,预计热带地区的浓度将高于热带地区。与之前的参数化不同,预计沿海地区和南海等浅海区域的浓度也会更高。与之前的工作相比,新的参数化更好地捕获了观察到的变化。这里计算的碘化物浓度明显高于MacDonald 等人常用的浓度(全球范围内为 40%)。( 2014 ) 参数化,对我们对大气中碘的理解具有影响。我们设想这些字段可用于代表当今海面碘化物浓度,用于气候和空气质量建模等应用。全球碘化物数据集免费向社区开放( https://doi.org/10/gfv5v3,Sherwen et al.2019 ),随着新的观察结果,我们将通过“实时数据”更新全球数据集“ 模型。

1 Introduction
1简介

Iodine in seawater exists in two major forms, iodide (I) and iodate (IO3-). Total inorganic iodine (I-+IO3-) remains approximately constant across most of the oceans, but the ratio of iodide to iodate varies. Chance et al. (2014) have shown that the ratio will vary with latitude, depth, and oxygen level. A small amount of iodine (<10 %) is thought to be present in organic forms in the open ocean (e.g. Wong1991); however, this may be a larger fraction in coastal waters (e.g. Wong and Cheng1998). The processes controlling the distribution of the ratio between iodide and iodate remain poorly understood (Chance et al.2014).
海水中的碘以两种主要形式存在,碘化物 ( I ) 和碘酸盐 ( IO3- )。总无机碘( I-+IO3- )在大多数海洋中保持大致恒定,但碘化物与碘酸盐的比率有所不同。机会等人。 ( 2014 )表明该比率将随纬度、深度和含氧量而变化。少量的碘( <10 %)被认为以有机形式存在于公海中(例如Wong1991 );然而,沿海水域的比例可能更大(例如Wong 和 Cheng1998 )。控制碘化物和碘酸盐之间比例分布的过程仍然知之甚少Chance 等2014

A reason for gaps in our understanding is that the observational dataset of iodide and iodate remains relatively sparse (Chance et al.2014, 2019a). Despite this paucity in observations, iodine's role in the Earth system has driven multidisciplinary interest in the distribution of iodine compounds in seawater from a number of different research communities, including paleoceanography (e.g. Lu et al.2016, 2018; Zhou et al.2015), atmospheric composition (e.g. Ganzeveld et al.2009; Saiz-Lopez et al.2014; Sherwen et al.2016a), and air-quality prediction (e.g. Luhar et al.2017, 2018; Sarwar et al.2015; Sherwen et al.2017b).
我们的理解存在差距的一个原因是碘化物和碘酸盐的观测数据集仍然相对稀疏Chance等人2014年2019年a 。尽管观测数据很少,但碘在地球系统中的作用已经引起了包括古海洋学在内的许多不同研究团体对海水中碘化合物分布的多学科兴趣(例如Lu等人2016年2018年Zhou等人2015 ),大气成分(例如Ganzeveld 等人2009Saiz-Lopez 等人2014 年Sherwen 等人2016 年a ) 和空气质量预测(例如Luhar 等人2017 年2018 年Sarwar 等人2015 年Sherwen 等人2017 年b )。

The atmospheric science community has seen a particularly large growth in interest in iodine chemistry in the atmosphere and at the sea-surface, as sea-surface I is believed to be the main driver of atmospheric iodine emissions. The reaction of I with ozone in the sea-surface micro-layer removes ozone from the atmosphere (dry deposition) (Ganzeveld et al.2009) and results in the emission of inorganic iodine (HOI and I2) into the atmosphere (Carpenter et al.2013), which can subsequently catalytically destroy ozone (Chameides and Davis1980). Models have shown this can then lead to a feedback mechanism between the increased ozone from pre-industrial and present day, counteracting human-driven increases in tropospheric ozone (Prados-Roman et al.2015; Sherwen et al.2017a). A number of model studies have discussed the impact of ocean-sourced iodine on atmosphere composition in the context of air quality (Gantt et al.2017; Sarwar et al.2016; Sherwen et al.2017b), climate (Sherwen et al.2017b; Saiz-Lopez et al.2012), aerosols (Sherwen et al.2017a), and stratospheric ozone (Saiz-Lopez et al.2015). These atmospheric modelling studies have used relatively simple parameterisations for predictions of sea-surface iodide.
大气科学界对大气和海面碘化学的兴趣特别大,因为海面I 被认为是大气碘排放的主要驱动因素。 I -与海面微层中的臭氧发生反应,从大气中去除臭氧(干沉积) Ganzeveld 等2009 ,并导致无机碘(HOI 和I 2 )排放到大气中Carpenter 等人2013 ,随后可以催化破坏臭氧Chameides 和 Davis1980 。模型表明,这可以在工业化前和当今臭氧增加之间形成反馈机制,抵消人类驱动的对流层臭氧增加Prados-Roman 等2015Sherwen 等2017 a 。许多模型研究讨论了海洋来源的碘在空气质量Gantt等人2017年Sarwar等人2016年Sherwen等人2017b 、气候Sherwen )背景下对大气成分的影响。等人2017Saiz-Lopez 等人2012 年 ,气溶胶舍文等人。2017 a ) 和平流层臭氧Saiz-Lopez 等2015 。这些大气模拟研究使用相对简单的参数化来预测海面碘化物。

Early parameterisations for sea-surface iodide were based on limited datasets and used either an observed range of iodide concentrations (Coleman et al.2010; Chang et al.2004) or a reported relationship with biogeochemical parameters (e.g. chlorophyll in Oh et al.2008, or nitrate Ganzeveld et al.2009). However, more recent attempts (Chance et al.2014; MacDonald et al.2014) have focused on using correlation analysis to fit compilations of observed iodide concentrations to a variety of commonly measured sea-surface variables, notably sea-surface temperature, but also chlorophyll, salinity, and nitrate. A summary of parameterisations that have been used in previous studies is given in Appendix Table A1. Compilation of all available observations confirmed a strong latitudinal gradient and identified sea-surface temperature as the strongest single predictor of iodide concentration (Chance et al.2014). This approach has led to Eq. (1) from Chance et al. (2014) and Eq. (2) from MacDonald et al. (2014).
海面碘化物的早期参数化基于有限的数据集,并使用观察到的碘化物浓度范围Coleman等人2010年Chang等人2004年或报告的与生物地球化学参数的关系(例如Oh等人中的叶绿素) 2008 年,或硝酸盐Ganzeveld 等人2009 年)。然而,最近的尝试Chance 等人2014 年MacDonald 等人2014 年侧重于使用相关分析将观测到的碘化物浓度汇编与各种常见测量的海面变量(特别是海面温度)进行拟合。还有叶绿素、盐度和硝酸盐。附录表A1中给出了先前研究中使用的参数化总结。所有可用观测结果的汇编证实了强烈的纬度梯度,并确定海面温度是碘化物浓度的最强单一预测因子​​( Chance 等人2014 年 。这种方法导致了方程。 ( 1 ) 来自Chance 等人。2014和等式。 ( 2 )来自麦克唐纳等人。2014

(1)Iaq-(nM)=0.225T(C)2+19(2)Iaq-(nM)=1.46×106exp-9134T(K)1×109

Figure 1 shows the global annual mean distribution of sea-surface iodide calculated using these parameterisations (Eqs. 1 and 2) and sea-surface temperature fields (Locarnini et al.2013). Although both equations predict a similar distribution (higher concentrations in tropical waters and lower in polar waters), Eq. (1) generally predicts iodide concentrations 2–4 times higher than Eq. (2). In developing Eq. (1), Chance et al. (2014) compiled iodide observations from both coastal and non-coastal sites. However, Eq. (2) used a relatively small subset (14 %) of these observations, which did not include coastal sites, which may explain the lower concentrations. Equation (2) also has an Arrhenius form, which could be interpreted to suggest that iodide concentrations are controlled by abiotic reaction kinetics. However, this has not been demonstrated, and Chance et al. (2014) discussed how microbiological activity and oceanic mixing are currently thought to be the primary controls. The choice of parameterisation (Eq. 2 versus Eq. 1) results in a difference of 50 % in the calculated global emissions of iodine into the atmosphere (Sherwen et al.2016a).
1显示了使用这些参数化(方程12 )和海面温度场计算出的海面碘化物的全球年平均分布Locarnini 等2013 。尽管两个方程都预测相似的分布(热带水域浓度较高,极地水域浓度较低),但方程( 1 ) 一般预测碘化物浓度比方程 (1) 高 2-4 倍。 ( 2 )。在开发方程式中。 ( 1 ),钱斯等人。 ( 2014 )汇编了沿海和非沿海地点的碘化物观测结果。然而,等式。 ( 2 ) 使用了这些观测值中相对较小的子集 (14%),其中不包括沿海地点,这可能解释了浓度较低的原因。方程 ( 2 ) 也具有阿累尼乌斯形式,可以解释为表明碘化物浓度受非生物反应动力学控制。然而,这尚未得到证实, Chance 等人。 ( 2014 )讨论了微生物活动和海洋混合目前如何被认为是主要控制措施。参数化的选择(方程2与方程1 )导致计算出的全球大气中碘排放量存在 50% 的差异Sherwen 等人2016 年a

https://www.earth-syst-sci-data.net/11/1239/2019/essd-11-1239-2019-f01

Figure 1Annual average sea-surface iodide concentrations predicted by (a)  Eq. (1) from Chance et al. (2014) and (b) Eq. (2) from MacDonald et al. (2014). Temperature fields used to make spatial predictions were from the World Ocean Atlas (Locarnini et al.2013). Earth raster and vector map data used are freely available from Natural Earth (http://www.naturalearthdata.com/, last access: 23 July 2019).
图1(a)式预测的年平均海面碘化物浓度( 1 ) 来自Chance 等人。2014(b)等式。 ( 2 )来自麦克唐纳等人。2014 。用于进行空间预测的温度场来自世界海洋地图集Locarnini 等2013 。使用的地球栅格和矢量地图数据可从 Natural Earth 免费获取( http://www.naturalearthdata.com/ ,最后访问日期:2019 年 7 月 23 日)。

Locarnini et al. (2013)Zweng et al. (2013)Garcia et al. (2010)Becker et al. (2009)Smith and Sandwell (1997)Garcia et al. (2014)Garcia et al. (2014)Garcia et al. (2014)OBPG (2014)Monterey and Levitus (1997)Large and Yeager (2009)
洛卡尔尼尼等人。 (2013)Zweng 等人。 (2013)加西亚等人。 (2010)贝克尔等人。 (2009)Smith 和 Sandwell (1997)Garcia 等人。 (2014)加西亚等人。 (2014)加西亚等人。 (2014)OBPG (2014)Monterey 和 Levitus (1997)Large 和 Yeager (2009)

Table 1Ancillary variables extracted onto a global 0.125×0.125 (12.5km×12.5km) grid on a monthly basis.
表1提取到全局的辅助变量 0.125×0.125 12.5km×12.5km )每月网格一次。

Expansion of Acronyms. WOA: World Ocean Atlas, SeaWIFS: Sea-Viewing Wide Field-of-View Sensor, GEBCO: General Bathymetric Chart of the Oceans, NOAMADS: NOAA National Operational Model Archive and Distribution System. (*) Three available mixed layer depth (MLD) definitions in WOA (vd: variable potential density, pt: potential temperature, pd: potential density) were processed from comma-separated value (CSV) to NetCDF files and extracted. Following Chance et al. (2014), the monthly sum and maximum MLD was also computed (vd, pt, pd) and used for building predictions of iodide. When just the variable MLD is shown, it is MLD as defined by potential temperature. n/a – not applicable
缩写词的扩展。 WOA:世界海洋地图集、SeaWIFS:海景宽视场传感器、GEBCO:海洋通用测深图、NOAMADS:NOAA 国家运行模型档案和分发系统。 (*) WOA 中的三个可用混合层深度 (MLD) 定义(vd:可变电势密度、pt:电势温度、pd:电势密度)从逗号分隔值 (CSV) 处理为 NetCDF 文件并提取。跟随机会等人。 (2014),还计算了每月总和和最大 MLD(vd、pt、pd)并用于构建碘化物的预测。当仅显示变量 MLD 时,它是由位温定义的 MLD。 n/a – 不适用

Download Print Version | Download XLSX
下载印刷版|下载XLSX

Considering the need for spatially resolved sea-surface iodide fields by models and the paucity of observations, parameterisations are required that can yield predictions from ancillary variables. This is a regression problem and a number of approaches are available. Conventional linear and linear multi-variant approaches have been used in the past (e.g. see summary in Appendix Table A1). However, they need to assume a functional relationship between the dependent and independent variables. Another approach is machine learning, which uses algorithms to build predictive models. These algorithms take a different approach and use a non-parametric formulation. Machine learning approaches range from interpretable options such as the random forest algorithm (Breiman2001) to less interpretable ones such as artificial neural networks (Gardner and Dorling1998). On the more interpretable end, machine learning algorithms are being used increasingly within environmental sciences, with recent examples including linear ridge regression and random forest models to replace computationally expensive processes (Keller and Evans2019; Nowack et al.2018) and Gaussian process emulation to explore model biases on a global scale (Lee et al.2011; Revell et al.2018).
考虑到通过模型对海面碘化物场进行空间解析的需要以及观测的缺乏,需要进行参数化以从辅助变量中产生预测。这是一个回归问题,有多种方法可用。过去已经使用了传统的线性和线性多变量方法(例如参见附录表A1中的总结)。然而,他们需要假设因变量和自变量之间存在函数关系。另一种方法是机器学习,它使用算法来构建预测模型。这些算法采用不同的方法并使用非参数公式。机器学习方法的范围从可解释的选项(例如随机森林算法Breiman2001 ))到难以解释的选项(例如人工神经网络Gardner 和 Dorling1998 )) 。从更容易解释的角度来看,机器学习算法在环境科学中的使用越来越多,最近的例子包括线性岭回归和随机森林模型,以取代计算成本高昂的过程Keller 和 Evans2019Nowack 等人2018和高斯过程通过仿真探索全球范围内的模型偏差Lee et al. , 2011Revell et al. , 2018

Table 2Splits of dataset used to evaluate outliers and their performance against the withheld data. The root mean square error (RMSE) statistic given as the mean of the performance against the withheld data for 20 different models built from 20 different pseudo-random initialisations (Sect. 3.2). The model used here includes ancillary variables of temperature, depth, and salinity which were thought to intuitively give a reasonable result. “No.” gives the number of samples in each dataset.
表 2用于评估异常值及其针对保留数据的性能的数据集分割。均方根误差 (RMSE) 统计量作为根据 20 个不同伪随机初始化构建的 20 个不同模型的保留数据的性能平均值给出(第3.2节)。这里使用的模型包括温度、深度和盐度等辅助变量,这些变量被认为可以直观地给出合理的结果。 “不。”给出每个数据集中的样本数。

Download Print Version | Download XLSX
下载印刷版|下载XLSX

Here, we use a recently expanded compilation of sea-surface iodide observations (Chance et al.2019a) to build a new sea-surface iodide parameterisation using a data-driven machine learning approach. We choose to use the random forest regressor (RFR) algorithm (Breiman2001; Pedregosa et al.2011), which is relatively simple and produces results that are also easy to understand. We aim to be able to predict global sea-surface iodide based on observations and ancillary physical and chemical variables (e.g. sea-surface temperature, depth, and salinity) from a number of publicly available sources. We first describe the input datasets we use (Sect. 2), then we explain the methodology taken (Sect. 3), and finally we present the predictions at observational locations and globally (Sect. 4). This product should be considered a present-day climatology representing the period of the iodide observations (1967–2018), which we envisage could be useful in applications such as climate and air-quality modelling. We make the resulting high-resolution, global, monthly dataset of predicted iodide available to the community (Sherwen et al.2019; https://doi.org/10/gfv5v3). When new observations become available, they will be incorporated into the model, and updated versions will be provided through a “living data” model.
在这里,我们使用最近扩展的海面碘化物观测数据汇编Chance 等人2019 a ),利用数据驱动的机器学习方法构建新的海面碘化物参数化。我们选择使用随机森林回归(RFR)算法Breiman2001Pedregosa 等人2011 ,该算法相对简单,产生的结果也很容易理解。我们的目标是能够根据来自许多公开来源的观测结果和辅助物理和化学变量(例如海面温度、深度和盐度)来预测全球海面碘化物。我们首先描述我们使用的输入数据集(第2节),然后解释所采用的方法(第3节),最后我们提出观测位置和全球的预测(第4节)。该产品应被视为代表碘化物观测时期(1967-2018)的当今气候学,我们认为它可用于气候和空气质量建模等应用。我们向社区提供由此产生的高分辨率、全球、每月预测碘化物数据集( Sherwen 等人2019https://doi.org/10/gfv5v3 )。当新的观测结果可用时,它们将被纳入模型中,并通过“实时数据”模型提供更新版本。

2 Input datasets
2输入数据集

Chance et al. (2019a) provides a compilation of the available 1342 sea-surface (<20 m depth) iodide observations between 1967 and 2018. The dataset is available from the British Oceanographic Data Centre (BODC, Chance et al.2019b; https://doi.org/10/czhx). It includes 45 % more data points, and has greater spatial coverage, than the previous compilation of 925 observations (Chance et al.2014). Observations are categorised in Chance et al. (2019a) as “coastal” or “non-coastal”, according to the designation of their static Longhurst biogeochemical province (Longhurst1998). We adopt the same categorisation here. This sea-surface iodide dataset then forms the dependent variable for our regression. We assume no inter-annual variability and use all data from all years (1967–2018) in this work.
机会等人。 ( 2019 a )提供了 1967 年至 2018 年间可用的 1342 个海面( <20 m 深度)碘化物观测数据的汇编。该数据集可从英国海洋数据中心获取(BODC、 Chance 等人2019 bhttps: //doi.org/10/czhx )。与之前的 925 个观测数据相比,它包含的数据点多了 45%,空间覆盖范围也更大Chance 等人2014 年Chance 等人对观察结果进行了分类。 ( 2019a )根据其静态朗赫斯特生物地球化学省的指定,分为“沿海”或“非沿海” Longhurst1998 。我们在这里采用相同的分类。然后,该海面碘化物数据集形成了我们回归的因变量。我们假设没有年际变化,并在这项工作中使用所有年份(1967-2018)的所有数据。

We require a number of physical, chemical, and biological parameters as the independent variables in our regression models. Consistent in situ measurement of these parameters are not available for the iodide observations. Thus we have used a number of ancillary datasets (Table 1) to provide this information. There are a number of criteria for these datasets: they need to be available at an appropriately similar resolution as a gridded product to the desired resolution of the predicted fields, they need to represent potential processes that could control iodide concentrations, and they need to be in some way orthogonal to the other independent variables. Gridded datasets of dissolved organic carbon (e.g. Roshan and DeVries2017) and phytoplankton primary productivity (e.g. Behrenfeld and Falkowski1997) may have some usefulness, but they themselves are built using statistical models with other variables and thus we do not use those here. The selected ancillary variables (Table 1) were first extracted from their native resolution using the nearest-neighbour method onto a consistent high-resolution monthly grid (0.125×0.125, 12.5km×12.5km). This horizontal resolution was used as this is the highest resolution of the current generation of global atmospheric chemistry simulations (Hu et al.2018) and is also used for regional-scale air-quality studies (e.g. Li et al.2019). We calculate monthly means because the chemical lifetime of iodide in the surface oceans is thought to be at least several months (Campos et al.1996; Žic et al.2013), and possibly years (Edwards and Truesdale1997; Tsunogai and Henmi1971). Indeed, the lifetime of iodide is thought to be sufficiently long that, where deep vertical mixing occurs on a seasonal timescale, this may be the dominant loss process from surface waters (e.g. Chance et al.2010). The values for bathymetric ocean depth were set to a minimum depth of 2 m, to remove terrestrial locations, and the same value was used for all months.
我们需要许多物理、化学和生物参数作为回归模型中的自变量。这些参数的一致原位测量不适用于碘化物观测。因此,我们使用了许多辅助数据集(表1 )来提供此信息。这些数据集有许多标准:它们需要以与预测场所需分辨率适当相似的网格产品的分辨率提供,它们需要代表可以控制碘化物浓度的潜在过程,并且它们需要以某种方式与其他自变量正交。溶解有机碳(例如Roshan 和 DeVries2017 )和浮游植物初级生产力(例如Behrenfeld 和 Falkowski1997 )的网格数据集可能有一些用处,但它们本身是使用具有其他变量的统计模型构建的,因此我们在这里不使用它们。首先使用最近邻法将选定的辅助变量(表1 )从其原始分辨率中提取到一致的高分辨率每月网格上( 0.125×0.125 , 12.5km×12.5km )。使用该水平分辨率是因为这是当前一代全球大气化学模拟的最高分辨率Hu et al.2018 ),并且也用于区域规模的空气质量研究(例如Li et al.2019 )。 我们计算月均值,因为表面海洋中碘化物的化学寿命被认为至少为几个月Campos 等人1996Žic 等人2013 ,甚至可能数年Edwards 和 Truesdale1997Tsunogai 和亨米1971 。事实上,碘化物的寿命被认为足够长,当在季节性时间尺度上发生深度垂直混合时,这可能是地表水的主要损失过程(例如Chance等人2010 )。海洋测深深度值设置为最小深度 2 m,以消除陆地位置,并且所有月份都使用相同的值。

For each iodide observation, the nearest point in space and time was extracted from the high-resolution gridded ancillary data. For the 31 iodide observations where a month was not available (Luther and Cole1988; Tsunogai and Henmi1971; Wong and Cheng1998), an arbitrary month was chosen (of March for northern hemispheric observations and September for southern hemispheric observations). Outliers within the observations are removed as described in Sect. 3.3. A further single dataset (Truesdale et al.2003) was also excluded from this analysis. This is discussed in Appendix Sect. A1.
对于每个碘化物观测,从高分辨率网格辅助数据中提取空间和时间上最近的点。对于无法获得月份的 31 个碘化物观测Luther 和 Cole1988Tsnogai 和 Henmi1971Wong 和 Cheng1998 ,选择任意月份(北半球观测为 3 月,南半球观测为 9 月) 。观察中的异常值被删除,如第 1 节中所述。 3.3 .另一个单一数据集Truesdale 等人2003也被排除在该分析之外。这在附录部分讨论。 A1

3 Methods
3方法

Here we first explain the way in which we use the machine learning algorithm (Sect. 3.1). We then explain how we have calculated uncertainty (Sect. 3.2), how observations considered outliers have been removed from the data (Sect. 3.3), and how we have decided which ancillary variables (temperature, salinity, etc.) to use as independent variables for an ensemble prediction (Sect. 3.4). Finally we describe the interpretable ensemble prediction model that results from this methodology in both numerical and graphical terms (Sect. 3.5).
这里我们首先解释一下我们使用机器学习算法的方式(第3.1节)。然后,我们解释如何计算不确定性(第3.2节),如何从数据中删除被视为异常值的观测值(第3.3节),以及我们如何决定使用哪些辅助变量(温度、盐度等)作为独立变量集合预测的变量(第3.4节)。最后,我们用数值和图形术语描述了由这种方法产生的可解释的集合预测模型(第3.5节)。

3.1 Random forest regressor algorithm
3.1随机森林回归器算法

As the aim here is to predict a continuous numerical value for sea-surface iodide, a regression approach is taken. As discussed in the introduction, previous approaches have been made to parameterise sea-surface iodide, and the most commonly used relationships employ sea-surface temperature as the predictor variable. Here we take a different multivariate and non-parametric approach, using the computationally cheap and interpretable random forest regressor (RFR) algorithm (Breiman2001; Pedregosa et al.2011).
由于此处的目的是预测海面碘化物的连续数值,因此采用回归方法。正如引言中所讨论的,之前已经采用了参数化海面碘化物的方法,最常用的关系式采用海面温度作为预测变量。在这里,我们采用不同的多变量和非参数方法,使用计算成本低且可解释的随机森林回归(RFR)算法Breiman2001Pedregosa 等人2011

Random forest regression is based on finding a number of decisions trees, which predict the dependent variable. All of the trees contribute to the prediction, and they are collectively referred to as a “forest”. These trees can be explained as a record of the way the algorithm has linearly traversed a subset of the training data, splitting the data into two parts at each decision point or “node” in a way that minimised the internal differences of the parts. The best split is chosen between the available variables based on an error metric (e.g. mean square error), and this process is continued until a criterion of purity is reached or a minimum number of data points are left from a split. This is essentially a classification problem. The prediction of the forest is the mean value of the prediction of all of the different decision trees, which attempts to make the results more of a regression problem. More details of this approach can be found in Friedman et al. (2009).
随机森林回归基于查找多个决策树,这些决策树预测因变量。所有树木都对预测做出贡献,它们统称为“森林”。这些树可以解释为算法线性遍历训练数据子集的方式的记录,在每个决策点或“节点”将数据分成两部分,以最小化各部分内部差异的方式。基于误差度量(例如均方误差)在可用变量之间选择最佳分割,并且继续该过程直到达到纯度标准或分割留下最小数量的数据点。这本质上是一个分类问题。森林的预测是所有不同决策树的预测的平均值,它试图使结果更像是一个回归问题。有关此方法的更多详细信息,请参见Friedman 等人。2009

This approach differs from previous approaches which have individually tested proposed relationships and selected the best-performing model(s) as a parameterisation (e.g. Table A1). Here, an algorithm uses the data it is provided to build a model that gives a prediction, and therefore it is the data themselves that define the model that is used to predict new values. A key difference of this approach is also that only a subset, the training set, is used to build the model, and the rest (or withheld set) is then used to test the performance of the model. Here we use 80  % of the data for the training set and use the remaining 20 % as the withheld set (also commonly referred to as the “testing set”).
该方法不同于先前的方法,先前的方法单独测试了所提出的关系并选择性能最佳的模型作为参数化(例如表A1 )。在这里,算法使用所提供的数据来构建给出预测的模型,因此数据本身定义了用于预测新值的模型。这种方法的一个关键区别还在于,仅使用一个子集(训练集)来构建模型,然后使用其余部分(或保留集)来测试模型的性能。这里我们使用 80% 的数据作为训练集,使用剩余的 20% 作为保留集(通常也称为“测试集”)。

To ensure that the models built are generalisable and mitigate overfitting, the random forest approach used here artificially increases the randomness within the forest (Pedregosa et al.2011). This is done by randomly combining single decision trees by an approach referred to as “bootstrap aggregation” or “bagging” (Breiman2001; Tong et al.2003). This additional bagging approach randomly samples observations within the training dataset and so mitigates overfitting of the trees to the dataset (Friedman et al.2009). Furthermore, a stratified sampling approach was taken. Specifically, the overall dataset was split into quartiles according to iodide concentration value, and training data were randomly selected from each quartile. This approach was used to maintain the same statistical distribution in the training/testing data as the overall dataset.
为了确保建立的模型具有普适性并减轻过度拟合,这里使用的随机森林方法人为地增加了森林内的随机性Pedregosa et al. , 2011 。这是通过称为“引导聚合”或“装袋”的方法随机组合单个决策树来完成的Breiman2001Tong 等人2003 。这种额外的装袋方法对训练数据集中的观察结果进行随机采样,从而减轻了树与数据集的过度拟合Friedman 等人2009 。此外,还采取了分层抽样的方法。具体来说,根据碘化物浓度值将整个数据集分成四分位数,并从每个四分位数中随机选择训练数据。这种方法用于保持训练/测试数据与整个数据集相同的统计分布。

Machine learning algorithms can generally be tuned to increase performance using settings called hyperparameters. However, random forests are known to generally perform well without tuning. The default hyperparameters were therefore used here (Pedregosa et al.2011), except for increasing the number of trees (“n_estimators”) from 10 to 500. Mean square error (MSE) was used as the criterion for evaluating each split (also referred to as a “node”). The maximum number of “features” (the ancillary variables provided to the algorithm, such as temperature or nitrate concentration) considered when looking for the best split is set to the number provided to the algorithm. The number of splits a tree is allowed to make (“max_depth”) is not restricted, and further nodes are made until leaves contain less than two samples (“min_samples_split”) and a minimum of one (“min_samples_leaf”). All the random forest models are built using bootstrapping.
通常可以使用称为超参数的设置来调整机器学习算法以提高性能。然而,众所周知,随机森林通常无需调整即可表现良好。因此,这里使用默认的超参数Pedregosa et al.2011 ,除了将树的数量(“n_estimators”)从 10 增加到 500 之外。均方误差(MSE)被用作评估每个分割的标准(也称为“节点”)。寻找最佳分割时考虑的“特征”(提供给算法的辅助变量,例如温度或硝酸盐浓度)的最大数量设置为提供给算法的数量。允许树进行分割的数量(“max_depth”)不受限制,并且会生成更多节点,直到叶子包含少于两个样本(“min_samples_split”)且至少包含一个样本(“min_samples_leaf”)。所有随机森林模型都是使用引导构建的。

3.2 Error and uncertainty calculations
3.2误差和不确定性计算

Understanding the errors and uncertainties in the global iodide distribution is important due to any sensitivities to this value within the modelled Earth system. We consider three sources of error in our predictions: the “dataset selection” error due to the splitting of the dataset into training and withheld parts, the “model selection error” due to the choice of dependent variables, and the “observational error” on the iodide measurements.
由于模拟地球系统内对该值的敏感性,了解全球碘化物分布的误差和不确定性非常重要。我们在预测中考虑三个误差来源:由于将数据集分为训练部分和保留部分而导致的“数据集选择”误差,由于因变量的选择而导致的“模型选择误差”,以及由于变量的选择而导致的“观察误差”碘化物测量。

To quantify the range of the dataset selection error, we construct models from 20 pseudo-random splits of the dataset into training and withheld parts. The hyperparameters and input ancillary variables are kept the same for the generation of the 20 models, so that the only difference between the models is the training dataset. These 20 models are then used to predict the withheld data. Performance metrics (e.g. root mean square error (RMSE) and average absolute prediction) can then be calculated for each model. This gives a range of 20 values, which can then be converted to a percentage range as the error. This is done by dividing the largest range in predicted values for a model by the minimum predicted value, to give a maximum value for the range of error, and then taking the smallest range in the predicted values and dividing this by the maximum value, to give a minimum value for the error range. Significant differences between the model's performance metrics would suggest important sensitivity to the training/withheld dataset splits.
为了量化数据集选择误差的范围,我们将数据集的 20 个伪随机分割构建为训练部分和保留部分。生成 20 个模型时,超参数和输入辅助变量保持相同,因此模型之间的唯一区别是训练数据集。然后使用这 20 个模型来预测保留的数据。然后可以计算每个模型的性能指标(例如均方根误差(RMSE)和平均绝对预测)。这给出了 20 个值的范围,然后可以将其转换为百分比范围作为误差。这是通过将模型预测值的最大范围除以最小预测值来完成的,以给出误差范围的最大值,然后取预测值的最小范围并将其除以最大值,得到给出误差范围的最小值。模型性能指标之间的显着差异表明对训练/保留数据集分割的重要敏感性。

We define the “model selection” error as the uncertainty resulting from the choice of input ancillary variables. A number of combinations of input variables are possible in generating the models, and each will generate a different prediction. We quantify this error as the difference in performance against the withheld dataset and prediction value (e.g. average global value). Similarly to our calculation of the dataset selection error, this can be converted to percentage error by considering the range in these values and dividing them by minimum and maximum values.
我们将“模型选择”误差定义为由于选择输入辅助变量而产生的不确定性。在生成模型时可以使用多种输入变量组合,并且每种组合都会生成不同的预测。我们将此误差量化为与保留数据集和预测值(例如平均全局值)的性能差异。与我们计算数据集选择误差类似,可以通过考虑这些值的范围并将它们除以最小值和最大值来将其转换为百分比误差。

For the observational error we refer to Chance et al. (2019a), who provide individual error estimates for each of the iodide observations in the data compilation. Over half (51 %) of the data points have an error of 5 % or less, and a further ∼25 % have an uncertainty in the range of 5 %–10 %. We therefore use a value of 10 % as a conservative estimate of the observational error.
对于观察误差,我们参考Chance 等人。 ( 2019 a ) ,他们为数据汇编中的每个碘化物观测值提供了单独的误差估计。超过一半 (51%) 的数据点的误差为 5% 或更少,另外约 25 % 的数据点的不确定性在 5%–10% 范围内。因此,我们使用 10% 的值作为观测误差的保守估计。

3.3 Outlier identification and removal
3.3异常值识别和去除

Our dataset consists of values for ancillary variables and iodide concentration for all of the 1342 measurement locations in the observational dataset (Sect. 2). As discussed in Sect. 3.1, we split this dataset into two parts: (i) a training set for use in building and optimising models, and (ii) a withheld set to evaluate the models built. Particular care was taken to ensure the withheld and training datasets were representative of the entire dataset in the way the models are built, therefore improving performance and generalisability to unseen data (see Sect. 3.1).
我们的数据集由观测数据集中所有 1342 个测量位置的辅助变量值和碘化物浓度组成(第2节)。正如第 3 节中所讨论的那样。 3.1中,我们将该数据集分为两部分:(i)用于构建和优化模型的训练集,以及(ii)用于评估构建的模型的保留集。我们特别注意确保保留数据集和训练数据集以模型构建的方式代表整个数据集,从而提高性能和对未见数据的通用性(参见第3.1节)。

We take a random forest regressor model built with variables that were intuitively assumed to give a reasonable ability to differentiate the observations (using depth, temperature, and salinity as the independent variables – abbreviated to “RFR(DEPTH+TEMP+SAL)” following Table 1). The RFR(DEPTH+TEMP+SAL) model was then used to explore the variation of error in the predictions using the dataset selection error approach described in Sect. 3.2. This builds multiple versions of the same model with different splits of training and test data and yields a distribution of root mean square error in the predicted iodide for withheld data as summarised in the final column of Table 2 and shown graphically in Appendix Fig. A1.
我们采用随机森林回归模型,该模型使用变量构建,直观地假设这些变量具有区分观测结果的合理能力(使用深度、温度和盐度作为自变量 - 缩写为“RFR(DEPTH+TEMP+SAL)”,如下表1 )。然后使用 RFR(DEPTH+TEMP+SAL) 模型来探索预测中误差的变化,使用第 1 节中描述的数据集选择误差方法。 3.2 .这使用不同的训练和测试数据分割构建了同一模型的多个版本,并产生了保留数据的预测碘化物的均方根误差分布,如表2最后一列中总结并​​在附录图A1中以图形方式显示。

We define outliers here as values greater than the third quartile plus 1.5 times the interquartile range (Frigge et al.1989). Removing these 49 values categorised as outliers (>309.5 nM) leads to a vast improvement in the RMSE error in the ensemble prediction from 95.1 to 37.6 nM (Table 2). This is shown graphically in Appendix Fig. A1, with the other subsets of the data explored (Table 2). This demonstrates that the high values are not well enough represented by the dataset to be able to be captured by the RFR approach. The removal of these high values from the dataset can also be justified as the driver for these concentrations is not yet well understood (Chance et al.2014, 2019c; Cutter et al.2018).
我们在这里将异常值定义为大于第三个四分位数加上 1.5 倍四分位距的值Frigge 等1989 。删除这 49 个被分类为异常值 ( >309.5 nM) 的结果是,集合预测中的 RMSE 误差从 95.1 nM 大幅改善至 37.6 nM(表2 )。附录图A1中以图形方式显示了这一点,并探索了其他数据子集(表2 )。这表明数据集不足以很好地表示高值,无法通过 RFR 方法捕获。从数据集中删除这些高值也是合理的,因为这些浓度的驱动因素尚未得到很好的理解 Chance 等人2014,2019 cCutter 等人2018

Removing these outliers reduces RMSE in the prediction with the 20 independent model builds from 48.2 to 2.3 nM (third quartile–first quartile). Once these outliers are excluded, more modest changes in average RMSE are then seen if models are built only using coastal or non-coastal data. Figure A1 also shows this is seen when removing lower salinity data (“Salinity ≥30 PSU and no outliers”), which is indicative of estuarine water. This highlights the strength in this approach's ability to predict iodide in different biogeochemical regions (i.e. not just coastal or non-coastal locations).
删除这些异常值可将 20 个独立模型构建的预测中的 RMSE 从 48.2 降低到 2.3 nM(第三个四分位数 - 第一个四分位数)。一旦排除这些异常值,如果仅使用沿海或非沿海数据构建模型,则平均 RMSE 会出现更温和的变化。图A1还显示了在删除较低盐度数据(“盐度≥30 PSU 并且没有异常值”)时看到的情况,这表明是河口水。这凸显了该方法预测不同生物地球化学区域(即不仅仅是沿海或非沿海地区)碘化物的能力。

An additional removal of a single dataset of 19 observations from the Skagerrak strait (Truesdale et al.2003) was made due to it exerting a disproportionate influence on iodide prediction in high northern latitudes (65 N), an area that is almost entirely unconstrained by local observations. We note that the Skagerrak is relatively unusual oceanographically, being an estuarine location with high ship traffic, and is considered unlikely to be an analogue for iodine speciation in the Arctic. This is decision is discussed further in the Appendix (Sect. A1), and the predictions made including this dataset are also provided in the shared output (Sect. 5).
额外删除了斯卡格拉克海峡的 19 个观测值的单个数据集Truesdale 等2003 ),因为它对北高纬度地区的碘化物预测产生了不成比例的影响( 65 N),一个几乎完全不受当地观测限制的区域。我们注意到,斯卡格拉克海峡在海洋学上相对不寻常,是一个船舶交通频繁的河口位置,并且被认为不太可能与北极的碘形态类似。这一决定将在附录( A1节)中进一步讨论,并且包括该数据集在内的预测也在共享输出中提供(第5节)。

From here, only the 1293 observational points excluding outliers and the data from the Skagerrak strait (Truesdale et al.2003) are used.
从这里开始,仅使用排除异常值的 1293 个观测点和斯卡格拉克海峡的数据Truesdale 等2003

3.4 Selection of ancillary variables and building an ensemble prediction
3.4辅助变量的选择和建立集合预测

To decide which ancillary variables (temperature, salinity, etc.; see Table 1 and Sect. 2) should be used to predict sea-surface iodide concentration, RFR models were built and evaluated with different combinations of variables. Thirty eight combinations were considered (see first column of Appendix Table A2).
为了确定应使用哪些辅助变量(温度、盐度等;参见表1和第2节)来预测海面碘化物浓度,建立了 RFR 模型并使用不同的变量组合进行评估。考虑了 38 种组合(参见附录表A2第一列)。

The top 20 performing models, based on their root mean square error against the withheld data, are plotted in Fig. 2, alongside existing parameterisations. The standard deviation for all predicted values is also shown to illustrate variation in the predictions. A complete list of the performance and of all models built here and their performance is given in the appendix (Table