Elsevier

Water Research 水研究

Volume 205, 15 October 2021, 117666
第 205 卷,2021 年 10 月 15 日,117666
Water Research

Review 评论
Machine learning in natural and engineered water systems
自然水系统和工程水系统中的机器学习

https://doi.org/10.1016/j.watres.2021.117666Get rights and content 获取权利和内容

Highlights 亮点

  • Reviewed applications of machine learning in natural water systems.
    回顾了机器学习在自然水系统中的应用。
  • Reviewed applications of machine learning in engineered water systems.
    回顾了机器学习在工程水系统中的应用。
  • Briefly introduced the establishment of machine learning models.
    简要介绍机器学习模型的建立。
  • Analyzed the advantages and disadvantages of common algorithms.
    分析常见算法的优缺点。
  • Made suggestions on intelligent development of water science with machine learning.
    就利用机器学习智能发展水科学提出建议。

Abstract 摘要

Water resources of desired quality and quantity are the foundation for human survival and sustainable development. To better protect the water environment and conserve water resources, efficient water management, purification, and transportation are of critical importance. In recent years, machine learning (ML) has exhibited its practicability, reliability, and high efficiency in numerous applications; furthermore, it has solved conventional and emerging problems in both natural and engineered water systems. For example, ML can predict various water quality indicators in situ and real-time by considering the complex interactions among water-related variables. ML approaches can also solve emerging pollution problems with proven rules or universal mechanisms summarized from the related research. Moreover, by applying image recognition technology to analyze the relationships between image information and physicochemical properties of the research object, ML can effectively identify and characterize specific contaminants. In view of the bright prospects of ML, this review comprehensively summarizes the development of ML applications in natural and engineered water systems. First, the concept and modeling steps of ML are briefly introduced, including data preparation, algorithm selection and model evaluation. In addition, comprehensive applications of ML in recent studies, including predicting water quality, mapping groundwater contaminants, classifying water resources, tracing contaminant sources, and evaluating pollutant toxicity in natural water systems, as well as modeling treatment techniques, assisting characterization analysis, purifying and distributing drinking water, and collecting and treating sewage water in engineered water systems, are summarized. Finally, the advantages and disadvantages of commonly used algorithms are analyzed according to their structures and mechanisms, and recommendations on the selection of ML algorithms for different studies, as well as prospects on the application and development of ML in water science are proposed. This review provides references for solving a wider range of water-related problems and brings further insights into the intelligent development of water science.
具有理想质量和数量的水资源是人类生存和可持续发展的基础。为了更好地保护水环境和节约水资源,高效的水资源管理、净化和运输至关重要。近年来,机器学习(ML)在众多应用中展现了其实用性、可靠性和高效性;此外,它还解决了自然水系统和工程水系统中的传统问题和新出现的问题。例如,考虑到水相关变量之间复杂的相互作用,机器学习可以在现场实时预测各种水质指标。ML 方法还可以利用从相关研究中总结出的行之有效的规则或通用机制来解决新出现的污染问题。此外,通过应用图像识别技术分析图像信息与研究对象理化性质之间的关系,ML 可以有效地识别和表征特定污染物。鉴于 ML 的广阔前景,本综述全面总结了 ML 在自然和工程水系统中的应用发展。首先,简要介绍了 ML 的概念和建模步骤,包括数据准备、算法选择和模型评估。此外,还总结了近年研究中 ML 的全面应用,包括天然水系统中的水质预测、地下水污染物绘图、水资源分类、污染物来源追踪和污染物毒性评估,以及工程水系统中的处理技术建模、特征分析辅助、饮用水净化和分配以及污水收集和处理。 最后,根据算法的结构和机制分析了常用算法的优缺点,提出了针对不同研究选择 ML 算法的建议,并展望了 ML 在水科学中的应用和发展。这篇综述为解决更广泛的水相关问题提供了参考,并为水科学的智能化发展带来了更多启示。

Keywords 关键词

Machine learning
Natural water systems
Engineered water systems

机器学习
天然水系统
工程水系统

1. Introduction 1.导言

Water is one of the most indispensable material resources on which the survival and development of humanity depends. However, the increasing amounts of pollutant discharge caused by human activities has been leading to increasing amount of water pollution, which poses a threat to the sustainable development of ecosystems and human society. To protect the ecological safety and human health from water pollution, a range of measures have been adopted. In natural water systems, the qualities of various waters (e.g., rivers, lakes, groundwaters, and seawaters) are under close monitoring for better management and utilization of water resources. In addition, water pollution is controlled using identification, source tracking, and toxicity evaluations of pollutants. In engineered water systems, raw water taken from natural waters is purified in drinking water treatment plants (DWTPs) using various treatment processes (e.g., coagulation, sedimentation, filtration, and disinfection) to remove the contaminants, and then this supplied to users via distribution systems. Additionally, sewage and wastewater are collected and treated in the wastewater treatment plants (WWTPs) through a series of physical, chemical, and biological processes to abate pollution toward the urban and natural environment. During the processes of natural water management, water resource utilization and polluted water treatment, a large number of corresponding studies have been conducted, gradually developing into fields related to natural and engineered water systems.
水是人类赖以生存和发展的最不可或缺的物质资源之一。然而,人类活动造成的污染物排放量不断增加,导致水污染日益严重,对生态系统和人类社会的可持续发展构成威胁。为了保护生态安全和人类健康免受水污染的影响,人们采取了一系列措施。在自然水系中,为了更好地管理和利用水资源,对各种水体(如河流、湖泊、地下水和海水)的水质进行密切监测。此外,还通过对污染物的识别、源头追踪和毒性评估来控制水污染。在工程水系统中,取自天然水域的原水在饮用水处理厂(DWTPs)中通过各种处理工艺(如混凝、沉淀、过滤和消毒)进行净化,以去除污染物,然后通过分配系统供应给用户。此外,污水和废水经收集后在污水处理厂(WWTP)中通过一系列物理、化学和生物过程进行处理,以减少对城市和自然环境的污染。在自然水管理、水资源利用和污染水处理过程中,人们开展了大量相应的研究,并逐渐发展成为与自然水系统和工程水系统相关的领域。
Currently, “the fourth industrial revolution”, combined with big data and artificial intelligence, is expected to bring immense changes to human society. Machine learning (ML) is one of the technical approaches of artificial intelligence; and it is developed using various algorithms based on mathematical and statistical knowledge. ML can predict the status of new data by summarizing the underlying relationships and rules within known data, and its prediction performance will improve with an increase in the data amount and the iteration of the algorithms (Vamathevan et al., 2019). ML can solve complex problems involving massive nonlinear processes or combinatorial spaces that cannot be solved using conventional methods, or only with great time and cost. Therefore, ML has been widely used in fields such as computer vision, speech recognition, natural language processing, robotic control, and other hot topics (Jordan and Mitchell, 2015a).
当前,"第四次工业革命 "与大数据和人工智能相结合,有望给人类社会带来巨大变革。机器学习(ML)是人工智能的技术手段之一,它是在数学和统计学知识的基础上,利用各种算法开发出来的。ML 可以通过总结已知数据中的潜在关系和规则来预测新数据的状态,其预测性能会随着数据量的增加和算法的迭代而提高(Vamathevan et al., 2019)。ML 可以解决涉及大量非线性过程或组合空间的复杂问题,这些问题无法用传统方法解决,或者只能花费大量时间和成本。因此,ML 已广泛应用于计算机视觉、语音识别、自然语言处理、机器人控制等领域,以及其他热门课题(Jordan and Mitchell, 2015a)。
In recent years, ML has also been applied to solve a wide variety of problems in the water science field. To understand the application situation of ML in these fields, and the applicability and feasibility of various algorithms in solving water-related problems, it is necessary to review and summarize the existing studies, and some papers have reviewed relevant studies. For example, many publications have reviewed the application of artificial neural networks (ANNs) and other algorithms, such as the fuzzy inference system (FIS), evolutionary algorithms, the support vector machine (SVM), random forest (RF), decision tree (DT), and ML coupled with wavelet transformation or optimization algorithms for modeling water quality in rivers, lakes, and groundwater (Chau 2006; Che Osmi et al., 2016; Chen et al., 2020b; Ighalo et al., 2020; Maier and Dandy, 2000; Maier et al., 2010, 2014; Nicklow et al., 2010; Ostfeld and Solomatine, 2008; Raghavendra and Deka, 2014; Rajaee et al., 2020; Tiyasha, Tung and Yaseen, 2020). In addition, the applications of ML in other aspects or water-related research have also been reviewed, such as remote sensing for monitoring water quality (Sagan et al., 2020; Wagle et al., 2020), drinking water treatment (Dogo et al., 2019; Li et al., 2021), seawater desalination (Al Aani et al., 2019), and wastewater treatment and pollutant removal (Fan et al., 2018; Khataee and Kasiri, 2011; Wang et al., 2021b; Yaseen 2021). All of these reviews are beneficial for researchers to understand and expand the application of artificial intelligence in the water science field. However, many other published applications of ML, such as groundwater contaminant mapping (Podgorski and Berg, 2020), contaminant sources tracing (Balleste et al., 2020), pollutant toxicity evaluating (Wang et al., 2021d), contaminant identification (Baek et al., 2021), and others, have not been summarized and discussed. Moreover, many review papers have focused on the application of one algorithm, thus lacking a comparison of its advantages and disadvantages with other algorithms. As is well known, plenty of ML algorithms have been developed, and their performance for solving practical tasks varies. Therefore, an analysis of the applicability scope of various algorithms for dealing with different water-related problems is required.
近年来,ML 也被应用于解决水科学领域的各种问题。为了了解 ML 在这些领域的应用情况,以及各种算法在解决水相关问题中的适用性和可行性,有必要对现有研究进行回顾和总结,一些论文对相关研究进行了综述。例如,许多出版物对人工神经网络(ANN)和其他算法的应用进行了综述,如模糊推理系统(FIS)、进化算法、支持向量机(SVM)、随机森林(RF)、决策树(DT)等、决策树(DT)和 ML 与小波变换或优化算法相结合,用于河流、湖泊和地下水的水质建模(Chau ;2006; Che Osmi et al.,2016; Chen et al、2020b; Ighalo et al、2020; Maier and Dandy, ;2000; Maier et al.,2010, 2014; Nicklow et al.2010 年Ostfeld and Solomatine, 2008 年Raghavendra and Deka, ;2014; Rajaee et al.,2020Tiyasha, Tung and Yaseen, 2020)。此外,还综述了 ML 在其他方面或与水有关的研究中的应用,例如用于水质监测remote sensing (Sagan et al、2020; Wagle et al、2020)、饮用水处理Dogo et al、2019; Li et al、2021)、海水淡化(Al Aani et al、2019)、以及废水处理污染物去除Fan ;et al., 2018; Khataee and Kasiri, ;2011; Wang et al.,2021b; Yaseen 2021).所有这些综述都有助于研究人员了解和拓展人工智能在水科学领域的应用。然而,许多其他已发表的人工智能应用,如地下水污染物绘图(Podgorski ;和Berg, 2020)、污染物来源追踪(Balleste et al.,2020)、污染物毒性评估(Wang et al、2021d )、污染物识别(Baek et al.此外,许多综述论文只关注一种算法的应用,缺乏与其他算法优缺点的比较。众所周知,目前已开发出大量的 ML 算法,它们在解决实际任务时的表现也各不相同。因此,需要分析各种算法在处理不同水相关问题时的适用范围。
This article provides a comprehensive review of ML applications in related fields of water science and summarizes the representative studies in recent years. First, the establishment processes of ML models, including data preparation, algorithm selection, and model evaluation, are briefly introduced. In addition, the representative applications of ML in water quality prediction and management in natural water systems, as well as technology development and operation monitoring in engineered water systems, are summarized. To be specific, in natural water systems, ML has been used to predict water quality indicators, map pollutant distribution in groundwater, classify water resources, trace contaminant sources, and evaluate pollutant toxicities. In engineered water systems, ML has been applied to optimize adsorption and oxidation processes, assist laboratory characterization analyses, and improve the operation and management of drinking water purification and distribution, as well as wastewater collection and treatment. More importantly, the advantages and disadvantages of representative algorithms are discussed, and their applicability for different data and researches is analyzed by comparing their structures and mechanisms. Finally, current research hotspots, challenges, and prospects of ML combined water utilization and pollution control are discussed.
本文全面回顾了 ML 在水科学相关领域的应用,并总结了近年来的代表性研究。首先,简要介绍了 ML 模型的建立过程,包括数据准备、算法选择和模型评估。此外,还总结了 ML 在自然水系的水质预测和管理以及工程水系的技术开发和运行监测方面的代表性应用。具体而言,在自然水系中,ML 已被用于预测水质指标、绘制地下水中的污染物分布图、对水资源进行分类、追踪污染物来源以及评估污染物毒性。在工程水系统中,ML 被用于优化吸附和氧化过程,协助实验室表征分析,改善饮用水净化和分配以及废水收集和处理的操作和管理。更重要的是,讨论了代表性算法的优缺点,并通过比较它们的结构和机制,分析了它们对不同数据和研究的适用性。最后,讨论了当前 ML 水利用与污染控制相结合的研究热点、挑战和前景。

2. An overview of machine learning
2.机器学习概述

ML refers to a technology that uses a series of programmed algorithms to predict the future patterns of any raw data with the experience learned from the hidden associations within the given data through an automatic mathematical analysis (Jordan and Mitchell, 2015b). In general, to recognize the rules underlying the known data as accurately as possible, the data should first be well treated to generate the dataset. Then an appropriate ML algorithm is determined according to the characteristics of input data and the requirements of output data. The selected algorithm will then be trained with the well-prepared data and evaluated to adjust the hyper-parameters within, thus generating the desired model. Afterward, the proposed ML model is qualified to make predictions on new data. The establishment of the ML model is briefly introduced in the subsequent sections, and those who want to learn more details about ML algorithms can refer to specialized books and papers on statistics or machine learning (Hastie et al., 2008; Mohri et al., 2012).
ML 指的是一种技术,它使用一系列编程算法,通过自动数学分析从给定数据中的隐藏关联中吸取经验,预测任何原始数据的未来模式(Jordan and Mitchell, 2015b)。一般来说,要尽可能准确地识别已知数据的基本规则,首先应妥善处理数据以生成数据集。然后,根据输入数据的特点和输出数据的要求,确定合适的 ML 算法。然后用精心准备的数据对选定的算法进行训练和评估,调整其中的超参数,从而生成所需的模型。之后,提出的 ML 模型就可以对新数据进行预测了。下文将简要介绍 ML 模型的建立,想要了解有关 ML 算法的更多细节,可以参考统计学或机器学习方面的专业书籍和论文(Hastie et al、2008; Mohri et al.)

2.1. Date preparation 2.1.准备日期

An ideal ML requires appropriate and well-trained models, but the quality and quantity of data are also vitally important. Currently, we can crawl data from the internet, collect data from the literature, search data from open-source databases, or record data from the experiments, etc. However, we typically obtain raw data with missing values, errors, duplicates, or noises that should be treated with data cleansing. Then, feature engineering is conducted based on professional background knowledge to select or extract data features according to task demands. Finally, the prepared data is typically divided into a training set, a validation set (for some algorithms), and a test set. The training set is used to train the model based on the selected ML algorithm, while the validation set is applied to tune hyper-parameters to optimize the trained model. After the training process, the predictive and generalization ability of the trained model is evaluated using the test set by comparing the prediction outputs with their corresponding known results.
理想的人工智能需要适当和训练有素的模型,但数据的质量和数量也至关重要。目前,我们可以从互联网上抓取数据,从文献中收集数据,从开源数据库中搜索数据,或从实验中记录数据等。然而,我们获得的原始数据通常会有缺失值、错误、重复或噪声,这些都需要通过数据清洗来处理。然后,基于专业背景知识开展特征工程,根据任务需求选择或提取数据特征。最后,准备好的数据通常会被分为训练集、验证集(针对某些算法)和测试集。训练集用于根据选定的 ML 算法训练模型,而验证集则用于调整超参数,以优化训练后的模型。训练过程结束后,通过比较预测输出和相应的已知结果,使用测试集评估训练模型的预测和泛化能力。

2.2. Algorithm selection 2.2.算法选择

The rise of ML can be traced back to the 1950s when Donald Hebb proposed the Hebbian learning theory (Hebb, 1949), after which a wide range of ML algorithms were developed. In general, ML algorithms can be classified into three categories: supervised, unsupervised, and reinforcement learning (a certain classification also includes semi-supervised learning) according to the types of data and requirements of the work. Moreover, the applications of deep learning (DL) for water utilization and pollution control are also discussed due to the inseparable relationships between DL and ML. Fig. 1 outlines the reviewed ML algorithms and their applications in the fields of natural and engineered water systems.
ML 的兴起可以追溯到 20 世纪 50 年代,当时 Donald Hebb 提出了希比学习理论(Hebb, 1949),之后各种 ML 算法应运而生。一般来说,ML 算法可根据数据类型和工作要求分为三类:监督学习、无监督学习和强化学习(某种分类还包括半监督学习)。此外,还讨论了深度学习(DL)在水资源利用和污染控制方面的应用,因为DL和ML之间有着密不可分的关系。Fig. 1 概述了已审查的 ML 算法及其在自然和工程水系统领域的应用。
Fig 1
  1. Download: Download high-res image (1004KB)
    下载:下载高清图片 (1004KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 1. Reviewed machine learning algorithms and their applications in natural and engineered water systems.
图 1。回顾了机器学习算法及其在自然和工程水系统中的应用。

Supervised learning algorithms are typically applied to treat labeled data to predict the values of a continuous set using regression or the category of a discrete set using classification. For the regression analysis, the least-square method (LSM) has long been used in many algorithms, as it can find the best function parameters to minimize the sum of squares of errors between the predicted and actual values. Based on the LSM, the simplest ML algorithm linear regression is typically applied when the dataset possesses a relatively small size and a linear relationship within data (Fig. 2a). When dealing with nonlinear relationships, polynomial regression, which is also based on LSM, is a better choice, as it can flexibly fit nonlinear data by adjusting the power of the variables (Fig. 2b). By using linear and polynomial regressions, rapid modeling, intuitive interpretation, and accurate sensitivity to outliers can be achieved. In addition to linear and polynomial regressions, the ridge regression, Lasso regression, and ElasticNet regression are also common regression algorithms used in other fields. As for the classification tasks, the naïve Bayesian (NB) classifier and logistic regression (LR) have long been used. The NB classifier can calculate the required posterior probability based on the existing prior probability of the event; then, the newly gained probability is updated to perform the subsequent tasks. The LR applies the sigmoid function to normalize the predicted values, thus calculating the probability of an event and comparing it with the selected threshold value (usually 0.5) to generate the predicted binary outcomes (yes or no).
监督学习算法通常用于处理标记数据,使用回归法预测连续集合的值,或使用分类法预测离散集合的类别。对于回归分析,最小二乘法(LSM)长期以来一直被用于许多算法中,因为它可以找到最佳函数参数,使预测值与实际值之间的误差平方和最小。基于 LSM,最简单的 ML 算法线性回归通常适用于数据集规模相对较小且数据内部存在线性关系的情况(Fig. 2a )。在处理非线性关系时,同样基于 LSM 的多项式回归是更好的选择,因为它可以通过调整变量的幂来灵活地拟合非线性数据(Fig. 2b )。通过使用线性回归和多项式回归,可以实现快速建模、直观解释和对异常值的准确敏感性。除了线性回归和多项式回归,脊回归、Lasso 回归和 ElasticNet 回归也是其他领域常用的回归算法。至于分类任务,长期以来一直使用的是天真贝叶斯(NB)分类器和逻辑回归(LR)。NB 分类器可以根据事件的现有先验概率计算出所需的后验概率,然后更新新获得的概率来执行后续任务。LR 应用 sigmoid 函数对预测值进行归一化处理,从而计算出事件发生的概率,并将其与选定的阈值(通常为 0.0)进行比较。5)来生成预测的二元结果(是或否)。
Fig 2
  1. Download: Download high-res image (1MB)
    下载:下载高清图片(1MB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 2. A chart of the common algorithms applied in this review.
图 2。本综述采用的常用算法图。

In addition to the algorithms described above specifically developed for regression or classification, many other algorithms can be used for both regression and classification. The K-nearest neighbors (kNN) can predict the value or category of a sample according to its adjacent neighbors in the feature space. For regression, the average value of the k nearest neighbors is recognized as the prediction result, while for classification, the category of the most occurring in the k nearest neighbors is outputted (Fig. 2c) (Altman, 1992). Additionally, the support vector machine (SVM) aims at finding a hyperplane to segment the samples based on the principle of maximizing the interval between two categories of samples (Fig. 2d). To solve classification or regression problems, SVM can be divided into support vector classification (SVC) and support vector regression (SVR) (Kadyrova and Pavlova, 2014). The decision tree (DT) is also a popular algorithm that has a tree structure. It is composed of a root node, several internal nodes, and leaf nodes that respectively represent the set of all samples, attribute tests, and the decision results (Fig. 2e). The DT algorithm begins its decision-making process from the root node and then compares the tested data with the characteristic nodes. The algorithm then selects the next comparison branch according to the result of the attribute test. It finally outputs the result of the leaf node as the final decision result (Myles et al., 2004). Because the DT algorithm can solve both classification and regression problems, it is also known as the classification and regression tree (CART). Moreover, to improve the performance of the DT algorithm, many derived DT algorithms, such as boosted decision trees (BDTs), gradient boosted decision trees (GBDTs), and the random forests (RF) have been developed. In particular, RF is an ensemble algorithm comprised of many DTs in which each tree randomly samples from the input data to independently generate a prediction result. After all the trees output their decisions, they vote for the most appropriate result as the final prediction of this RF model (Fig. 2f) (Breiman, 2001). Aside from the above algorithms, the artificial neural network (ANN) was the most frequently used algorithm in the reviewed studies. Perceptron is the structural unit of ANN, and it is composed of input cells and an output cell, and the weight the connections between them. Different input information can exert different effects on the output by adjusting the values of the weight connections (Fig. 2g) (Rosenblatt, 1957). The perceptron algorithm is a linear classification model that is suitable for dealing with linearly separable data. By combining multiple perceptrons and introducing hidden layers and activation functions, the multilayer perceptron (MLP) algorithm was proposed and is capable of treating multi-dimensional data (Fig. 2h) (Clark, 1991). However, MLP is a global approximation algorithm, in which all the weights in the network should be readjusted every time the sample is learned. Therefore, MLP has the disadvantage of a slow convergence velocity, thus tending to fall into the local optimum. The radial basis function neural network (RBF NN) is another common ANN algorithm (Fig. 2i). It employs the radial basis function as the activation function and only adjusts the weights connections in the specified domain. Therefore, RBF NN has the advantage of fast convergence and is immune to the local optimum problem (Lee and Chang, 2003). Moreover, the adaptive neuro-fuzzy inference system (ANFIS) is also a commonly used ANN-derived algorithm. ANFIS can be defined as a multi-layer feed-forward network that employs fuzzy inference to map the input space to the output space. ANFIS allowed the realization of a highly non-linear mapping that is considered to be superior in yielding non-linear time series compared with common linear methods (Jang, 1993).
除了上述专为回归或分类开发的算法外,还有许多其他算法可同时用于回归和分类。K 近邻(kNN)可以根据样本在特征空间中的相邻值来预测样本的值或类别。在回归时,k 个近邻的平均值被视为预测结果,而在分类时,则输出 k 个近邻中出现最多的类别(Fig. 2c)(Altman, 1992 )。此外, 支持向量机(SVM)的目的是根据两类样本之间间隔最大化的原则找到一个超平面来分割样本(Fig. 2d )。为解决分类或回归问题,SVM 可分为支持向量分类(SVC)和支持向量回归(SVR)(Kadyrova and Pavlova, 2014 )。决策树(DT)也是一种流行的树形结构算法。它由一个根节点、几个内部节点和叶节点组成,分别代表所有样本集、属性测试和决策结果(Fig. 2 e)。DT 算法的决策过程从根节点开始,然后将测试数据与特征节点进行比较。 然后,算法根据属性测试结果选择下一个比较分支。最后输出叶节点的结果作为最终决策结果(Myles et al.)由于 DT 算法可以同时解决分类和回归问题,因此也被称为分类回归树(CART)。此外,为了提高 DT 算法的性能,人们还开发了许多衍生的 DT 算法,如提升决策树(BDT)、梯度提升决策树(GBDT)和随机森林(RF)。其中,RF 是一种由许多决策树组成的集合算法,每棵树都会从输入数据中随机取样,独立生成预测结果。在所有树输出其决定后,它们投票选出最合适的结果,作为该 RF 模型的最终预测结果(Fig. 2f)(Breiman, 2001)。除上述算法外,人工神经网络(ANN)是综述研究中最常用的算法。感知器是人工神经网络的结构单元,它由输入单元和输出单元以及它们之间的连接权重组成。通过调整权重连接的值,不同的输入信息会对输出产生不同的影响(Fig. 2g) (Rosenblatt, 1957).感知器算法是一种线性分类模型,适用于处理线性可分离数据。通过组合多个感知器并引入隐藏层和激活函数,multilayer perceptron (MLP) 算法被提出,并能够处理多维数据(Fig. 2h)(Clark, 1991 )。但是,MLP 是一种全局近似算法,每次学习样本时都要重新调整网络中的所有权值。因此,MLP 的缺点是收敛速度慢,容易陷入局部最优。径向基函数神经网络(RBF NN)是另一种常见的 ANN 算法(Fig. 2i )。它采用径向基函数作为激活函数,只调整指定域中的权值连接。因此,RBF NN 具有收敛速度快、不受局部最优问题影响的优点(Lee and Chang, 2003)。此外,自适应神经模糊推理系统(ANFIS)也是一种常用的 ANN 衍生算法。ANFIS 可定义为一种多层前馈网络,它采用模糊推理将输入空间映射到输出空间。 ANFIS 允许实现高度非线性映射,与普通线性方法相比,ANFIS 在生成非线性时间序列方面更具优势(Jang, 1993 )。
Unsupervised learning algorithms are often applied to reveal the intrinsic characteristics and rules of the unlabeled sample data. They are typically applied in dimensionality reduction, clustering, and anomaly detection (also known as outlier detection). The principal component analysis (PCA) is a representative method for unsupervised dimensionality reduction. As its name implies, PCA aims to find the most essential characteristics or generate a new characteristic to describe the original dataset, thus reducing the dimensionality of the dataset and increasing interpretability with minimized information loss (Jolliffe and Cadima, 2016). K-Means is a common method for clustering analysis. It can find a partition to organize data into differentiable groupings, by minimizing the squared error between the empirical mean and the points in cluster (Jain, 2010). Isolation forest, Gaussian distribution, and local outlier factor (LOF) are also common algorithms for anomaly detection. They are typically used to detect samples that are sparsely distributed and far from the majority of data (Ariyaluran Habeeb et al., 2019). Different from supervised learning algorithms that predict the data directly, unsupervised learning algorithms are typically applied for data pretreatment in the studies reviewed herein.
无监督学习算法通常用于揭示未标记样本数据的内在特征和规则。它们通常用于降维、聚类和异常检测(也称为离群点检测)。主成分分析(PCA)是无监督降维的一种代表性方法。顾名思义,PCA 的目的是找到最基本的特征或生成一个新的特征来描述原始数据集,从而降低数据集的维度,以最小的信息损失提高可解释性(Jolliffe and Cadima, 2016)。K-Means 是一种常用的聚类分析方法。它可以通过最小化经验平均值与聚类中的点之间的平方误差,找到一个分区,将数据组织成可区分的分组(Jain, 2010)。隔离林、高斯分布和局部离群因子 (LOF) 也是异常检测的常用算法。它们通常用于检测分布稀疏且远离大多数数据的样本(Ariyaluran Habeeb et al.,2019)。与直接预测数据的监督学习算法不同,非监督学习算法在本文回顾的研究中通常用于数据解释

2.3. Model evaluation 2.3.模型评估

Many approaches have been proposed to evaluate the performance of ML models. The evaluation parameters for regression algorithms primarily include bias, variance, the mean absolute error (MAE), the mean squared error (MSE), and R-squared (R2) among others. Bias is the difference between the predicted result and its actual value while variance represents the degree of deviation from the mean of the total sample. MAE is the average sum of the absolute value of the bias for each sample. It can reflect the actual error of prediction by avoiding the problem where the positive and negative errors cancel each other. MSE is the average of the sum of the square of the bias for each sample. Typically, to keep the evaluation indicator and the sample values are on the same order of magnitude, MSE uses the root to obtain the root mean squared error (RMSE), which is also a commonly used measure of performance. Moreover, R2, also known as the coefficient of determination, describes the fitting degree of the regression function to the observed values. The value of R2 is between 0 and 1; and the closer the R2 value is to 1, the better the model fits the data.
人们提出了许多方法来评估 ML 模型的性能。回归算法的评估参数主要包括偏差、方差、平均绝对误差 (MAE)、平均平方误差 (MSE) 和 R 平方 (R2) 等。偏差是指预测结果与实际值之间的差异,而方差则代表偏离总样本平均值的程度。MAE 是每个样本偏差绝对值的平均和。它可以避免正负误差相互抵消的问题,从而反映预测的实际误差。MSE 是每个样本偏差平方和的平均值。通常情况下,为了保持评价指标和样本值处于同一数量级,MSE 采用求根的方法得到均方根误差(RMSE),这也是常用的性能衡量标准。此外,R2也称为判定系数,用于描述回归函数与观测值的拟合程度。R2 值介于 0 和 1 之间;R2 值越接近 1,模型与数据的拟合程度越高。
For classification algorithms, the accuracy rate is the most used evaluation indicator, and it is the proportion of the correctly classified samples to the total number of samples. In addition, precision (P) and recall (R) are also widely used measurements for evaluating classifiers. The predicted results can be divided into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) according to their real categories and the predicted categories (Table 1). P is defined as the proportion of the correctly classified positive samples in the total number of positive samples, while R is the proportion of the correctly classified positive samples in the total number of correctly classified samples. For an ideal classifier, the values of both P and R are expected to be as high as possible. However, one value will increase, while the other decreases in practical cases. Therefore, the Fβ score, the weighted harmonic mean of the P and R-values, is applied to balance P and R. Fβ is presented as Eq. (1), where β measures the relative importance of R to P.(1)Fβ=(1+β2)×P×R(β2×P)+R,where R matters more than P when β >1, and conversely, P matters more than R when β <1. When β =1, Fβ is converted to the F1 score, that represents the balance between R and P.
对于分类算法来说,准确率是最常用的评价指标,它是指正确分类的样本占样本总数的比例。此外,精确度(P)和召回率(R)也是广泛使用的分类器评估指标。根据真实类别和预测类别,预测结果可分为真阳性(TP)、假阳性(FP)、真阴性(TN)和假阴性(FN)(Table 1 )。P 的定义是正确分类的阳性样本占阳性样本总数的比例,而 R 则是正确分类的阳性样本占正确分类样本总数的比例。对于理想的分类器来说,P 和 R 的值都应尽可能高。但在实际情况中,一个值会增加,而另一个值会减少。因此,Fβ 分数,即 P 值和 R 值的加权调和平均值,被用来平衡 P 值和 R 值。Fβ 表示为 Eq. (1) ,其中 β 衡量 R 对 P 的相对重要性。 (1)Fβ=(1+β2)×P×R(β2×P)+R, 其中,当 β >1 时,R 比 P 重要;反之,当 β <1 时,P 比 R 重要。当 β =1 时,Fβ 被转换为 F1 分数,代表 R 和 P 之间的平衡。

Table 1. Confusion matrix.
表 1。混淆矩阵。

True classification 真实分类Predicted classification 预测分类
True 正确False 假的
True 正确True Positive (TP) 真阳性 (TP)False Negative (FN) 假阴性 (FN)
False 假的False Positive (FP) 假阳性 (FP)True Negative (TN) 真阴性 (TN)

3. Water quality prediction and management in natural water systems
3.自然水系的水质预测和管理

Natural waters, including rivers, lakes, groundwaters and seas, are the most important water sources for human life and productive activities. To better manage and utilize natural water resources, different countries and regions have adopted various evaluation schemes based on a series of water quality indicators (WQIs). Physical indicators include temperature, color, turbidity, electric conductivity (EC), suspended solids (SS), and total solids (TS). Chemical indicators include pH, dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total organic carbon (TOC), alkalinity, ammonia nitrogen, total phosphorus (TP), total nitrogen (TN).Biological indicators covering chlorophyll a (Chl-a) and fecal indicator bacteria (FIB, e.g., total coliform, E. coli, and Enterococcus) (Uddin et al., 2021). In recent years, many studies have applied ML approaches in water quality prediction and water resource management in natural water environments. Below the relevant representative applications of ML are summarized, including predicting water quality WQIs (e.g., DO, FIB, and Chl-a, or other multiple indicators), mapping pollutant distribution in groundwater, classifying water resources according to different standards, tracing contamination sources, and evaluating pollutant toxicity.
天然水,包括河流、湖泊、地下水和海洋,是人类生活和生产活动最重要的水源。为了更好地管理和利用天然水资源,不同国家和地区采用了基于一系列水质指标(WQIs)的各种评价方案。物理指标包括温度、颜色、浑浊度、电导率(EC)、悬浮固体(SS)和总固体(TS)。化学指标包括 pH 值、溶解氧 (DO)、生化需氧量 (BOD)、化学需氧量 (COD)、总有机碳 (TOC)、碱度、氨氮、总磷 (TP) 和总氮 (TN)、肠球菌)(Uddin et al、2021)。近年来,许多研究将 ML 方法应用于自然水环境中的水质预测和水资源管理 。下面总结了 ML 的相关代表性应用,包括预测水质 WQIs(如溶解氧、FIB 和 Chl-a 或其他多指标)、绘制地下水中污染物分布图、根据不同标准对水资源进行分类、追踪污染源以及评估污染物毒性等。

3.1. Predicting water quality
3.1.预测水质

The dissolved oxygen (DO) concentration is a critical parameter for evaluating the ecological health of an aquatic environment. This concentration results from an equilibrium between oxygen-producing (e.g., photosynthesis and air diffusing) and oxygen-consuming (e.g., aerobic respiration, nitrification, and chemical oxidation) processes in an aquatic environment (Olyaie et al., 2017). Under the influence of these complex processes, it is difficult for conventional process-based models or statistical approaches to simulate the DO level. On the contrary, data-driven ML does not consider the accumulation mechanism of DO, but only analyzes the statistical and mathematical relationship between different parameters. For example, WQIs such as temperature, pH, EC, and discharge were selected as inputs to train two types of ANN models for predicting DO concentrations downstream and upstream of Fountain Creek. The RBF NN performed better with a higher R2 value for both sampling stations than the MLP model (Ay and Kisi, 2012). However, the RBF NN was not as accurate as the MLP in predicting the DO concentration in the Mediterranean Sea along with Gaza, which might have been attributed to the small database, as RBF NN was more sensitive to the data volume (Zaqoot et al., 2009). To improve the performance of the RBF NN, the general regression neural network (GRNN) modified the structure of the RBF by replacing its weight connection between the hidden layer and the output layer with a summation layer to reduce the need for data volumes. In predicting the DO concentration in the Danube River, the performances of the MLP, GRNN, and recurrent neural network (RNN) were examined using only small amount of data. Results showed that the GRNN obtained better performance than the MLP did. However, the GRNN was inferior to the RNN model that was equipped with an extra layer to store and transform the previous input information as a decision reference when needed (Fig. 2J) (J and W, 2011). The tested RNN provided considerably better predictions of the DO with all results within an error of less than ±10% (Antanasijevic et al., 2013b). However, because the previous information was sequentially stored in RNN, too much information would make it difficult to learn from the far information, thus causing long-term dependency problems. To solve this problem, the long short-term memory (LSTM) algorithm was developed by storing previous information in the memory cells that will be opened by a gate to transfer information when needed (Fig. 2K) (Greff et al., 2017). The LSTM was applied to predict the river DO concentrations at the continental scale using the CAMELS-chem database that collected DO information from 236 minimally disturbed watersheds across the U.S. Ultimately, the proposed LSTM model achieved a satisfactory prediction with a mean and median Nash-Sutcliffe Efficiency (NSE) of 0.60 and 0.78, respectively (Fig. 3A) (Zhi et al., 2021). Due to the difference in the spatial location and scale of the study, as well as the choice of variables and the amount of data, a comparison between different studies is undoubtedly difficult. However, by utilizing a direct comparison in the same study and a cross-comparison of the different studies listed in Table S1, it was found that MLP performed well with a small dataset (data bulk < 1000), while SVM performed better with a larger dataset (data bulk > 1000). Moreover, considering that the variables used in a study for predicting DO are typically based on a time series, the LSTM is also recommended as it is designed for time-series tasks and can carry useful information from the past to the future. In addition, it relies on fewer environment variables, and this saves the cost of data collection.
溶解氧(DO)浓度是评估水生环境生态健康状况的关键参数。这一浓度是产氧(例如,光合作用和空气扩散)和耗氧(例如,有氧呼吸、硝化反应)之间平衡的结果、Olyaie et al.在这些复杂过程的影响下,传统的基于过程的模型或统计方法很难模拟溶解氧水平。相反,数据驱动的 ML 不考虑溶解氧的累积机制,只分析不同参数之间的统计和数学关系。例如,选择温度、pH 值、导电率和排水量等 WQIs 作为输入,训练两种类型的 ANN 模型来预测福泉溪下游和上游的溶解氧浓度。与 MLP 模型(Ay ;和 Kisi, 2012)。然而,RBF NN 在预测地中海和加沙地区溶解氧浓度时不如 MLP 准确,这可能与数据库较小有关,因为 RBF NN 对数据量更敏感(Zaqoot ;et al.,2009)。为了提高 RBF 神经网络的性能,一般回归神经网络(GRNN)对 RBF 的结构进行了修改,将隐藏层和输出层之间的权值连接改为求和层,以减少对数据量的需求。在预测多瑙河溶解氧浓度时,仅使用少量数据考察了 MLP、GRNN 和递归神经网络(RNN)的性能。结果表明,GRNN 的性能优于 MLP。但是,GRNN 的性能不如 RNN 模型,因为 RNN 模型配备了一个额外层,可以在需要时存储和转换之前的输入信息作为决策参考(Fig. 2J)(J and W, 2011)。经过测试的 RNN 对溶解氧的预测效果要好得多,所有结果的误差都在±10% 以下(Antanasijevic et al., 2013b)。然而,由于先前的信息是按顺序存储在 RNN 中的,过多的信息会使远期信息难以学习,从而造成长期依赖问题。为了解决这个问题,我们开发了长短时记忆(LSTM)算法,将以前的信息存储在存储单元中,需要时通过门打开这些存储单元以传输信息(图)。 2K)(Greff et al.)使用 CAMELS-chem 数据库(该数据库收集了美国 236 个最小干扰流域的溶解氧信息),将 LSTM 应用于预测大陆尺度的河流溶解氧浓度最终,拟议的 LSTM 模型取得了令人满意的预测结果,其纳什-苏特克利夫效率(NSE)的平均值和中值分别为 0.60 和 0.78(NSE 的平均值和中值分别为 0.60 和 0.78)。78 (Fig. 3A) (Zhi et al、2021)。由于研究的空间位置和规模不同,以及变量的选择和数据量的差异,不同研究之间的比较无疑是困难的。不过,通过在同一研究中进行直接比较,以及对表 S1 中列出的不同研究进行交叉比较,发现 MLP 在数据量较小(数据量小于 1000)的情况下表现良好,而 SVM 在数据量较大(数据量大于 1000)的情况下表现较好。此外,考虑到研究中用于预测溶解氧的变量通常基于时间序列,因此也推荐使用 LSTM,因为它是专为时间序列任务设计的,可以将有用的信息从过去带到未来。此外,它所依赖的环境变量较少,从而节省了数据收集成本。
Fig 3
  1. Download: Download high-res image (715KB)
    下载:下载高清图片 (715KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 3. The applications of ML in predicting water quality in the natural water environment. (A) The application of the LSTM model in predicting river DO concentrations across the continental United States. Reproduced from (Zhi et al. 2021) with permission. Copyright (2021) American Chemical Society. (B) Balanced data for Clarks Beach, green dot: below sample, blue dot: above sample, brown dot: ADASYN sample. Reproduced from (Xu et al. 2020) with permission. Copyright (2020) Elsevier Ltd. (C) Monthly mean Chl-a concentration derived from Aqua-MODIS by BME/SVR. Reproduced from (He et al. 2020) with permission. Copyright (2019) Elsevier Ltd. (D) RGB Channel separation diagram of a pond water sample.  Reproduced from (Li et al. 2020b) with permission. Copyright (2020) Elsevier Ltd. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.).
图 3。应用 ML 预测自然水环境中的水质。(A) LSTM 模型在预测美国大陆河流溶解氧浓度中的应用。经授权转载自(Zhi 等,2021 年)。美国化学学会版权所有 (2021)。(B) 克拉克斯海滩的平衡数据,绿点:样本下方,蓝点:样本上方,棕点:ADASYN 样本:ADASYN 样本。经授权转载自(Xu 等,2020 年)。Copyright (2020) Elsevier Ltd. All Rights Reserved.(C) 通过 BME/SVR 从 Aqua-MODIS 得出的月平均 Chl-a 浓度。经授权转载自 (He et al. 2020)。版权所有 (2019) 爱思唯尔有限公司。(D) 池塘水样的 RGB 通道分离图。 经授权转载自(Li et al.版权 (2020) 爱思唯尔有限公司。(本图例中有关颜色的解释,请读者参阅本文网络版)。

Drinking water or recreational water polluted with feces has the potential to cause gastrointestinal and respiratory diseases (Haile et al., 1999). Total coliform, fecal coliform (or E. coli), and Enterococcus are typically applied as fecal indicator bacteria (FIB) to characterize fecal contamination in water. However, common approaches for quantifying FIB such as multiple-tube fermentation, membrane filtration, and specific enzymatic detection, usually require 18–72 h, while emerging techniques such as flow cytometry, ATP assays, online optical sensors, and quantitative PCR are costly (García-Alba et al., 2019). However, data-driven ML can provide real-time predictions that rely on easily available information about environmental conditions (e.g., precipitation, discharge, and tide) and the representative studies are reviewed below. For instance, an MLP model was used to fill the missing microbial data in a dataset based on related physical, chemical, and bacteriological data from the Kentucky River. The MLP had a more accurate performance than conventional imputation or regression models with smaller error values. Additionally, the proposed MLP model also accurately classified observational data into two defined ranges for fecal coliform concentrations in a river system, especially when the concentration of FIB was below 200 CFU/100 mL (Chandramouli et al., 2007). Moreover, when the safe concentration of FIB was set as the classification threshold, ML could also be applied in water quality warnings for recreational surface waters. For example, when monitoring fecal pollution of coastal water in southern California, an MLP accurately predicted the levels of three FIBs, with FP and FN rates less than 10%, thus sufficing to make rapid and reliable decisions regarding beach closures or advisories (He and He, 2008). In addition to ANN models, some other algorithms, such as CART, RF, and the Bayesian network, have also been applied to predict FIB concentrations to provide information about water safety (Thoe et al., 2014). Though the models described above performed well with high overall accuracy, the unbalance in the dataset might have reduced the sensitivity of the predictions. Data unbalance is caused because most of the data in the dataset was typically below the guideline thresholds while the minority is exceeded. This increases the possibility of over-fitting in the majority class and information loss in the minority class when training a model (Batista et al., 2004). To solve this problem, an adaptive synthetic (ADASYN) sampling algorithm was adopted to create more samples at the boundary between the two classes using linear interpolation (Fig. 3B). Afterward, the proposed kNN, BDT, and MLP models all provided favorable predictions with relatively high sensitivity of approximately 75% and an overall accuracy of greater than 90% (Xu et al., 2020). In addition to ADASYN, recently, the synthetic minority oversampling technique (SMOTE) was also applied to solve the data imbalance problem in FIB prediction. SMOTE is a technique based on the nearest neighbor idea to synthesize new samples for the minority class. The combination with SMOTE improved the performance of the six tested algorithms. For example, RF presented a 50% higher true positive rate with respect to the baseline (Bourel et al., 2021). The proposed approach in these two studies provided new insight into data preparation for training ML models. Moreover, no matter what algorithms were applied for FIB prediction, accuracy was always used to evaluate the performance of the proposed models. However, accuracy might not be enough to assess models when the potential health effects of a high false negative (FN) rate are assumed to outweigh the risk of an unnecessary recreational water closure given by a high false positive (FP) rate. Therefore, (Stidson et al., 2012), attached more importance to the false negative (FN) rate over other metrics (Stidson et al., 2012). This was because a high FN rate meant the prediction of a dangerous situation as a safe one (Motamarri and Boccelli, 2012). Therefore, when evaluating the performance of an ML model, diversified evaluation criteria based on actual requirements should be considered. According to Table S2, MLP was the most widely used algorithm in the previous studies that modeled FIB, but algorithms based on DT, such as BDT, CART and RF, performed better in studies that solved the data imbalance problem. In addition, the model based on DT was more interpretable than the neural network model, and it was more instructive to understand and prevent FIB from exceeding the safety standard. Therefore, DT-based models combined with data imbalance processing technology are more suitable for FIB predictions.
饮用被粪便污染的水或娱乐用水有可能导致胃肠道和呼吸道疾病(Haile et al、1999)。总大肠菌群、粪大肠菌群(或大肠杆菌)和肠球菌通常被用作粪便指示菌 (FIB),以确定水体粪便污染的特征。然而,常用的 FIB 定量方法,如多管发酵、膜过滤和特定酶检测,通常需要 18-72 ;小时,而流式细胞术、ATP 检测、在线光学传感器、和定量 PCR成本高昂(García-Alba ;et al.,2019)。不过,数据驱动的 ML 可以依靠容易获得的环境条件信息(如降水、排水和潮汐)提供实时预测,下文将对具有代表性的研究进行回顾。例如,根据肯塔基河的相关物理、化学和细菌学数据,使用 MLP 模型来填补数据集中缺失的微生物数据。与误差值较小的传统估算或回归模型相比,MLP 的性能更为精确。 此外,拟议的 MLP 模型还能准确地将观测数据划分为河流系统中粪大肠菌群浓度的两个定义范围,尤其是当 FIB 浓度低于 200 CFU/100 mL 时(Chandramouli et al、2007)。此外,当 FIB 的安全浓度被设定为分类阈值时,ML 也可应用于娱乐地表水的水质警告。例如,在监测加利福尼亚南部沿海水域的粪便污染时,MLP 准确地预测了三种 FIB 的含量,FP 和 FN 的比率低于 10%,因此足以对海滩关闭或警告做出快速可靠的决定(He and He, 2008 )。除 ANN 模型外,其他一些算法(如 CART、RF 和贝叶斯网络)也被用于预测 FIB 浓度,以提供有关水安全的信息(Thoe et al., 2014)。虽然上述模型表现良好,总体准确率较高,但数据集的不平衡可能会降低预测的灵敏度。造成数据不平衡的原因是,数据集中的大部分数据通常低于指导阈值,而少数数据超过了指导阈值。这增加了训练模型时多数类过度拟合和少数类信息丢失的可能性(Batista et al.,2004)。为了解决这个问题,我们采用了自适应合成(ADASYN)采样算法,利用线性插值在两类之间的边界创建更多样本(Fig. 3 B)。之后,提出的 kNN、BDT 和 MLP 模型都提供了良好的预测,灵敏度相对较高,约为 75%,总体准确率超过 90%(Xu et al., 2020)。除了 ADASYN 之外,最近还应用了合成少数超采样技术(SMOTE)来解决 FIB 预测中的数据不平衡问题。SMOTE 是一种基于最近邻思想的技术,用于合成少数类的新样本。与 SMOTE 的结合提高了六种测试算法的性能。例如,与基线相比,RF 的真阳性率提高了 50%(Bourel et al., 2021)。这两项研究中提出的方法为训练 ML 模型的数据准备提供了新的见解。此外,无论采用哪种算法进行 FIB 预测,准确率总是用来评估所提出模型的性能。然而,如果假定高假阴性(FN)率对健康的潜在影响大于高假阳性(FP)率带来的不必要的娱乐用水关闭风险,那么准确性可能不足以评估模型。因此,(Stidson et al.(Stidson 等人,2012))更重视假阴性率(FN)而非其他指标。这是因为高 FN 率意味着将危险情况预测为安全情况(Motamarri and Boccelli, 2012)。因此,在评估 ML 模型的性能时,应根据实际需求考虑多样化的评估标准。根据表 S2,在以往对 FIB 进行建模的研究中,MLP 是使用最广泛的算法,但在解决数据不平衡问题的研究中,基于 DT 的算法(如 BDT、CART 和 RF)表现更好。此外,基于 DT 的模型比神经网络模型更具可解释性,对理解和防止 FIB 超过安全标准更有指导意义。因此,基于 DT 的模型结合数据不平衡处理技术更适合用于 FIB 预测。
An appropriate amount of phytoplankton in an aquatic environment can improve water quality by producing oxygen and soaking up carbon dioxide through photosynthesis (Durham et al., 2015). However, eutrophication of water bodies will promote the rapid growth of phytoplankton, thus leading to aquatic environmental problems such as algal blooms (Ma et al., 2015). Algal blooms will shade benthic primary producers from the sunlight, reduce DO levels, and accumulate toxic metabolites to threaten the ecology and human beings (Conley et al., 2009). Therefore, forecasting algal blooms or determining the key factors that control bloom production is important to prevent their deleterious effects (Recknagel et al., 1997). Various models have been proposed to achieve this purpose, including empirical models, deterministic models, time-series analysis models, and fuzzy-logic models (Yabunaka et al., 1997). ML has also been applied to predict algal blooms by identifying the environmental factors that affect the growth and accumulation of algal biomass to simulate the nonlinear processes of algal blooms (Karul et al., 2000; Lee et al., 2003). However, environmental factors that control the biomass of algae vary according to different studies. For example, flow and temperature have been reported to be predominant in determining the initiate and duration of cyanobacterial growth, while water color controls the magnitude of the population growth peak (Maier et al., 1998). Additionally, the importance of controlling factors for different cyanobacteria genera varied with their species. Concretely, Anabaena and Microcystis had the highest sensitivity to average flow, while for Planktolyngbya and Oscillatoria, it was temperature, and for Cylindrospermopsis, it was water color. Therefore, it has been difficult to predict the intensity and frequency of algal blooms using a single environment variable and to control algal biomass by adjusting few influential factors due to the diversity of phytoplankton that made up algal blooms (Nelson et al., 2018). In the last few decades, the development of remote sensing technology makes has made it possible to directly observe and investigate algal blooms at large spatial and temporal scales. Recently, ML has also been applied to analyze remote sensing data (Lary et al., 2016). For example, moderate resolution imaging spectroradiometer data were applied to estimate the Chl-a concentration within the Gulf of St. Lawrence of Canada (Fig. 3C). In this study, remote sensing with 10 types of reflectance (Rrs) was combined with eight different ML algorithms to predict algal blooms. The comparison results revealed the SVR model obtained the best performance using Rrs at 412, 443, 488, 531, and 678 nm (He et al., 2020). In addition, some other remote sensing data, such as SeaWiFS, have also been applied for estimating Chl-a concentrations to forecast algal blooms (Campsvalls et al., 2006; Keiner 2010). In addition to predicting Chl-a in the form of remote sensing data, remote sensing technology can also provide hyperspectral images to simulate the concentration of algae cells. In a study that predicted cyanobacterial cell concentrations, (Pyo et al., 2021), developed a convolutional neural network (CNN) model with a convolutional block attention module (CNNan) that could emphasize valuable information and suppress insignificant ones to improve the performance of CNN for image recognition. Compared to a traditional hydrodynamic model and CNN, CNNan performed better with a higher NSE and a smaller RMSE (Table S3) (Pyo et al., 2021). As mentioned in this section, the factors that affect the growth of different algae are different, and a natural water body often contains several types of algae at the same time. Consequently, screening from all available water-related parameters to determine influential ones is often necessary prior to an algae or Chl-a prediction, which is time and labor-intensive. Remote sensing and image recognition technologies do not consider complex water quality parameters, but directly use color information, which is one of the major characteristics of water bodies rich in Chl-a or algae relative to other polluted water bodies. Therefore, algorithms based on remote sensing information or hyperspectral images are undoubtedly the optimal choice to predict the concentration of algae or Chl-a.
水生环境中适量的浮游植物可以通过光合作用产生氧气和吸收 二氧化碳,从而改善水质(Durham ;et al.,2015)。然而, 水体 富营养化 水体 将促进 浮游植物的快速生长、从而导致藻类大量繁殖等水生环境问题(Ma ;et al.,2015)。藻华会遮蔽底栖初级生产者的阳光,降低溶解氧水平,积累有毒代谢物,威胁生态和人类(Conley et al.)因此,预报藻华或确定控制藻华产生的关键因素,对于防止藻华的有害影响非常重要(Recknagel et al.)为实现这一目的,提出了各种模型,包括经验模型、确定性模型、时间序列分析模型和模糊逻辑模型(Yabunaka et al.) 通过识别影响藻类生物量生长和积累的环境因素来模拟藻类繁殖的非线性过程,ML 也被用于预测藻类繁殖(Karul et al、2000; Lee et al.)不过,控制藻类生物量的环境因素因研究不同而各异。例如,据报道,水流和温度在决定蓝藻生长的开始和持续时间方面起着主导作用,而水色则控制着 藻群生长高峰的大小(Maier et al.)此外,控制因素对不同蓝藻菌属的重要性也因物种而异。具体来说,AnabaenaMicrocystis 对平均流量的敏感度最高、而对于PlanktolyngbyaOscillatoria、是温度,Cylindrospermopsis 是水色。 因此,由于构成藻华的浮游植物种类繁多,很难通过单一的环境变量来预测藻华的强度和频率,也很难通过调整少数影响因素来控制藻华的生物量(Nelson et al、2018)。在过去的几十年中,遥感技术的发展使得在大时空尺度上直接观测和研究藻华成为可能。最近,ML 也被用于分析 遥感数据(Lary et al.)例如,中等分辨率成像分光辐射计数据被用于估算加拿大圣劳伦斯湾的 Chl-a 浓度(Fig. 3 C)。在这项研究中,10 种反射率(Rrs)遥感与 8 种不同的 ML 算法相结合来预测藻华。比较结果表明,使用 412、443、488、531 和 678 nm 波长的反射率,SVR 模型获得了最佳性能(He et al., 2020)。 此外,其他一些遥感数据,如SeaWiFS, 也被用于估算藻华的 Chl-a 浓度,以预测藻华(Campsvalls ;et al.,2006; Keiner 2010 )。除了以遥感数据的形式预测 Chl-a 外,遥感技术 还可以提供高光谱图像来模拟藻类细胞的浓度。在一项预测蓝藻细胞浓度的研究中,(Pyo et al、2021)开发了一种带有卷积块注意模块(CNNan)的卷积神经网络(CNN)模型,该模块可以强调有价值的信息,抑制无价值的信息,从而提高 CNN 在图像识别方面的性能。与传统的流体动力学模型和 CNN 相比,CNNan 的性能更好,NSE 更高而 RMSE 更小(表 S3)(Pyo et al., 2021)。如本节所述,影响不同藻类生长的因素各不相同,而一个自然水体往往同时含有多种类型的藻类。因此,在预测藻类或 Chl-a 之前,往往需要对所有可用的水体相关参数进行筛选,以确定影响因素,这既费时又费力。 遥感和图像识别技术不考虑复杂的水质参数,而是直接使用颜色信息,而颜色信息是富含 Chl-a 或藻类的水体相对于其他污染水体的主要特征之一。因此,基于遥感信息或高光谱图像的算法无疑是预测藻类或 Chl-a 浓度的最佳选择。
Aside from the single WQI like DO, FIB, and Chl-a introduced above, indicators like BOD, COD, TOC, TP, TN, and TS are more common in water quality evaluations, and ML has also been used to predict these water quality indicators. For instance, 13 environmental variables were adopted to develop an MLP model to simulate the daily concentrations of TOC, TN, and TP in three rivers. The results showed that concentrations of the three WQIs were all estimated with R2 values greater than 0.75. Moreover, the future carbon and nitrogen loads under climate change scenarios were predicted, and increases in TOC and TN under notable change scenarios were reported (Holmberg et al., 2006). Recently, an RF model was used to predict the concentrations of nutrients (N and P) in both rural and urban catchments. By analyzing the impacts of each variable in the model and removing the inessential ones, the number of variables were minimized to reduce the cost of surrogate sensors. In this manner, both the low cost and high accuracy of the prediction were guaranteed (Castrillo and Garcia, 2020). In addition to rivers, the water quality of other water bodies has also been simulated, including a lake (Barzegar et al., 2020), a reservoir (Zhao et al., 2007), coastal water (Palani et al., 2008), and even stormwater runoff (Najah Ahmed et al., 2019). In particular, the water quality index, rather than WQIs, has been utilized in studies to predict the water quality in wetlands using MLP, RBF NN, SVM, and CART models. The water quality index is a single number calculated using a set of physicochemical water indicators to normalize the water quality (Mohammadpour et al., 2015, 2016). Interestingly, image analysis was also applied to determine the quantitative relationships between the image information and the water quality. In a study that analyzed animal wastewater quality, the linear regression (LR), stochastic gradient descent (SGD), and Ridge regression algorithms were applied. The RGB color index extracted from the water sample images was used to analyze the correlations between the spectral rate and the WQIs, including N, P, TS, and total coliform (Fig. 3D). The optimal R2 values were as high as 0.98 for TS and 0.96 for total coliform, which demonstrated that this was an instructive method for rapid and cheap analysis of water quality (Li et al., 2020b). Due to random or systematic errors, data obtained from site monitoring or experimental processes were possibly affected by noise that caused uncertainty in the accurate prediction. To eliminate the effects of noise on data, the ANFIS algorithm was improved by combining it with the wavelet de-noising technique (WDT). The results indicated that WDT-ANFIS was the fastest and the most precise method for processing large volumes of non-linear and non-parametric data compared with the MLP, RBF NN, and ANFIS models (Najah Ahmed et al., 2019). Therefore, to optimize the performance of ML, more advanced analytical and data processing techniques are worthy of being considered. According to Table S4, the tested algorithms, e.g., MLP, RBF NN, SVM, and DT, performed well in predicting various WQIs in the reviewed studies, and this might have been due to the close chemical and physical relationship between the selected water quality variables and the predicted parameters. Although the performance of various algorithms seems acceptable, noise in the data is always an unavoidable problem, and this could be solved by WDT to improve the performance of ML algorithms. Moreover, image recognition technology and the DL model once again has appeared in recent studies, and this might indicate the developmental direction of predicting water quality using ML.

3.2. Mapping groundwater contaminants

Groundwater contamination is a public concern because of its potential health threats to billions of people globally who rely on groundwater for drinking. However, the worldwide scale of polluted regions is still unknown due to a lack of sufficient field data that reflects the spatial distribution and concentration of contaminants. To solve this problem, various research methods have been developed. In the past, geostatistical-statistical interpolation methods such as Kriging, inverse distance weighting, global polynomial interpolation, sequential Gaussian simulations, and the Thiessen polygon, have been applied to simulate the spatial distribution of contaminants in groundwater (Bindal and Singh, 2019). However, these methods depend too much on accurate field data, which is sometimes unattainable or sparse. Moreover, they do not take consider the spatial dependency of data, thus making the performance of these methods unsatisfactory (Shaji et al., 2020). In recent years, ML has also been widely applied to predict the distribution of contaminants such as arsenic (As), fluoride, manganese, and E. coli in groundwater based on analyzing the hidden relationships between contaminants and their direct or indirect influencing factors. Because of the serious health threat of As to humans, and its wide distribution throughout Earth's crust and the hydrosphere, the prediction of the arsenic spatial distribution in groundwater by ML has been highlighted (Shaji et al., 2020).
Geologic and geochemical settings are primary factors that affects the occurrence of As in the lithosphere. Geologic factors such as specific rocks (i.e., mafic and granitic rocks (Ayotte et al., 2006), basalt, andesite, and rhyolite (Bretzler et al., 2017)) and soils (i.e., medium-textured soils (Winkel et al., 2011; Yang et al., 2014), organic-rich deposits (Winkel et al., 2008, 2011), and Holocene fluvial sediments (Podgorski et al., 2017; Rodríguez-Lado et al., 2013; Yang et al., 2014)), were reported to be the primary causes of arsenic accumulation. Moreover, geochemical conditions, such as the level of soil pH (Lee et al., 2009; Podgorski et al., 2017; Twarakavi and Kaluarachchi, 2006; Yang et al., 2012), DO (Yang et al., 2012), and other chemical substances including fluoride, bromide, chloride, iron, manganese, ammonia, nitrate, nitrite, phosphate, and sulfate (Lee et al., 2009; Yang et al., 2012) were also reported to be crucial factors that affect the occurrence of As in groundwater. By introducing these relevant environmental factors as variables, ML has gained satisfactory performance in predicting the spatial distribution of arsenic in groundwater around the world. These nations include the United States (Anning et al., 2012; Ayotte et al., 2017, 2016, 2006; Frederick et al., 2016; Kim et al., 2011; Meliker et al., 2008; Twarakavi and Kaluarachchi 2006; Yang et al., 2014, 2012), China (Lee et al., 2009; Rodríguez-Lado et al., 2013; Zhang et al., 2012, 2013), Canada (Dummer et al., 2015), India (Bindal and Singh, 2019; Purkait 2008), Pakistan (Podgorski et al., 2017), Burkina Faso (Bretzler et al., 2017), Bangladesh (Tan et al., 2020), Cambodia (Lado et al., 2008) and regions such as Southeast Asia (Bangladesh, Myanmar, Thailand, Laos, Cambodia, Vietnam, Sumatra, Red River, and Mekong deltas) (Cha et al., 2016; Cho et al., 2011; Chowdhury et al., 2010; Hossain and Piantanakulchai, 2013; Lado et al., 2008; Tan et al., 2020; Winkel et al., 2008, 2011), and the global (Fig. 4A) (Podgorski and Berg, 2020). In these studies, logistic regression (LR) was the most frequently used algorithm, with an accuracy rate of approximately 70% being achieved in corresponding studies. However, LR has shortcomings in dealing with outliers, large feature space, and the multicollinearity of variables, thus causing problems of under-fitting or low accuracy (Mood and Carina, 2010). In contrast, the CART is immune to the multicollinearity, as the decision at each node of the tree is made based on a single feature (Myles et al., 2004). In addition, the CART can also solve the problem of outliers and missing data by accommodating surrogates. Therefore, the CART is capable of making more accurate predictions (Breiman et al., 1984). For example, the CART model achieved an accuracy of 78.25% in predicting the distribution of groundwater arsenic contamination in Bangladesh (Hossain and Piantanakulchai, 2013). Moreover, the CART has also been used to investigate the factors governing elevated groundwater arsenic concentrations across the western and the eastern United States, and the significance of aridity and pH was revealed (Frederick et al., 2016). To improve the prediction ability of the CART, a weak-learner ensemble model called the boosted regression tree (BRT) was recently developed by combining the CART with an adaptive boosting method (Elith et al., 2008). The BRT model captured the corresponding major geochemical processes with an accuracy of up to 91% in predicting the probabilities of groundwater arsenic in the central valley of California (Ayotte et al., 2016). In addition, the BRT also obtained an outstanding performance with an accuracy value of approximately 90.5% in comparison with the LR model and the RF model to predict the groundwater arsenic distribution in Bangladesh. Likewise, the RF model delivered an excellent result with a prediction accuracy of 90% in this comparison (Tan et al., 2020). Moreover, the RF model was also applied to predict the global groundwater arsenic concentrations. In this study, 52 types of parameters widely covering climate, soil, geology, and topography that were known or hypothesized to be related to the accumulation and release of arsenic were initially selected to be the predictor variables. By using a relative importance analysis, soil parameters (i.e., topsoil clay, subsoil sand, pH, and fluvisols), climate variables (i.e., precipitation, actual and potential evapotranspiration and combinations thereof with temperature), and the topographic wetness index were ultimately determined as influential variables for the distribution of arsenic (Podgorski and Berg, 2020). In addition to arsenic, the RF model was also applied to predict the distribution of fluoride in groundwater throughout India (Fig. 4B) (Podgorski et al., 2018). The proposed RF model obtained a more accurate prediction with an overall prediction accuracy of 91% (Podgorski et al., 2018), compared to a previous study mapping the distribution of worldwide groundwater fluoride using the ANFIS method (Amini et al., 2008). In addition to arsenic and fluoride, the distribution of manganese in the northern continental United States was also predicted using a BRT model with a total accuracy of 83% (Fig. 4C). Environmental variables such as estimated recharge, probability of high Fe > 100 μg/L, baseflow index, and the mean annual precipitation were demonstrated to have a high relative importance on influencing the probability of high manganese (Erickson et al., 2021). Recently, a generalized additive model for location, scale, and shape (GAMLSS) regression model was applied to predict the E. coli concentration in wells across Ontario, Canada. A dataset containing E. coli concentration, well characteristics (well depth, location), and hydrogeological characteristics (bottom of well stratigraphy, and specific capacity) from 253,136 independent wells was established to train the GAMLSS model. The regression analysis revealed that bedrock wells drilled in sedimentary and igneous rock were more susceptible to contamination events, which was helpful to improve the understanding of aquifer vulnerability to contamination and for assessing water quality (White et al., 2021). In studies that mapped the groundwater contaminant distribution, it was not necessary for the ML model to accurately predict the contaminant concentration at a certain point, but to judge whether the contaminant concentration at that point exceeded the threshold (e.g., > 10 mg/L for As) according to the environmental variables. Therefore, the essence of this type of research is classification, which explains why the commonly used algorithms are LR and those based on DT. The LR algorithm is robust to small noises in the data, and will not be specially affected by slight multicollinearity, while serious multicollinearity can be solved by combining with L2 regularization. Therefore, LR performed well in dealing with environmental variables influencing the distribution of As, even though there are typically interactions among geologic and geochemical variables. However, LR covers all the features and will suffer performance decline when the feature space becomes larger. In contrast, the DT model is not limited by the size of the feature space because it can select the optimal feature according to the information gain. However, the DT model is designed to create branches for the training set, thus making it prone to over-fitting. RF overcomes this shortcoming by randomness in taking samples, determining features, and building trees. Therefore, DT-based models, especially RF, theoretically perform better than other classification algorithms. This is why DT-based models have been more popular in recent studies (Table S5).
Fig 4
  1. Download: Download high-res image (985KB)
  2. Download: Download full-size image

Fig. 4. The applications of ML in mapping groundwater contaminants. (A) Modeled probability of arsenic concentration in groundwater exceeding 10 μg/L for the entire globe. Reproduced from (Podgorski and Berg 2020) with permission. Copyright (2020) The American Association for the Advancement of Science. (B) Areas of aquifers with fluoride concentrations exceeding 1.5 mg/L in India, and neighboring countries of Bangladesh, Bhutan, Nepal, and Sri Lanka.  Reproduced from (Podgorski et al. 2018) with permission from ACS AuthorChoice. (C) Modeled probability of high Mn indicated by Mn > 300 μg/L in the northern continental United States. Reproduced from (Erickson et al. 2021) with permission from ACS AuthorChoice.

3.3. Classifying water resources

To better manage and protect water resources, different countries and regions have adopted various standards to classify water resources according to their water quality. For example, in the National Environmental Quality Standards of Surface Water (GB3838-2002, China), water resources are divided into five categories by 24 fundamental WQIs. Each category of water resource has corresponding applications according to its water quality. However, measuring all WQIs listed in the standard is costly. Therefore, it is meaningful to classify water quality with fewer parameters that are more indicative. ML is good at evaluating the variable importance to the prediction target and has been applied to identify the key WQIs to classify water resources.
Initially, ANFIS was applied to classify the water samples of all major river basins across China using the GB3838-2002 for the classification standards. DO, COD, and NH3-N were determined to be the influential variables, and 89.59% of the data was correctly classified (Yan et al., 2010). In addition to the GB3838-2002, other standards such as the Canadian Council of Ministers of the Environment (CCME) WQI (Canada) and the National Sanitation Foundation (NSF) WQI (U.S.) were also applied to classify water resources. For example, the SVM, the probabilistic neural network (PNN), and the kNN model were applied using the CCME WQI for the classification standards to classify the water samples of the primary aquifer in the Tehran plain. The results revealed that the kNN algorithm was the weakest classifier with the highest total number and the total value of errors. In contrast, PNN and SVM were more appropriate for a small sets of samples, and SVM presented the best performance with no errors in the calibration and validation phases (Modaresi and Araghinejad, 2014). However, in another study classifying the water of the Tiaoxi River in the Lake Taihu Watershed, the classification accuracy of SVM fell from 99.56% to 61.60% when the number of indicators was raised from one to five, indicating that it was insufficient for SVM to treat data with large features (Li et al., 2013). In addition, when classifying water samples from the Karoon River in Iran using NSF WQI as standards, the training process of SVM was reported to be more difficult and time-consuming. Therefore, the PNN model was ultimately recommended for water quality classification due to its synthetic ability to reduce the sampling costs and computation time (Dezfooli et al., 2017). To further identify a better model and the key WQIs, a more comprehensive comparison was conducted among 10 ML algorithms using big data (33,612 observations) from the major rivers and lakes in China using GB3838-2002 as the standard. Different algorithms were compared using different WQI groups as prediction indicators. Ultimately, two WQI sets (DO, CODMn, and NH3-N; and CODMn, and NH3-N) were identified as the key water parameters. The performances of the DT, RF, and deep cascade forest (DCF) were shown to be better than other tested algorithms, including LR, LDA, SVM, CART, NB and kNN (Fig. 5A) (Chen et al., 2020a). In addition to water quality, spatial or temporal factors can also be regarded as standards in water classification. For example, water samples from the Gomti River were classified in terms of sampling sites (spatial) and months (temporal), thereby identifying similar ones in the monitoring network to reduce sampling numbers and the annual sampling frequency. In this way, a data reduction of 92.5% without compromising the output quality was achieved (Singh et al., 2011). The nonlinear boundary problem caused by a variety of water quality parameters is one of the most distinct features of water resource classification. SVM relies on a single sample of the boundary to establish the desired separation curve, so it can deal with nonlinear decision boundaries. Therefore, compared with the application in mapping groundwater contaminants, SVM was more commonly used in classifying water resources (Tables S5 and S6). However, training support vector machine on large data is very time-consuming and sometimes it is difficult to find a suitable kernel function. Moreover, DT-based models exhibited better performance than SVM, especially in dealing with tasks with large dataset (Chen et al., 2020a). As a result, DT-based models are more suitable in classifying water resources.
Fig 5
  1. Download: Download high-res image (1009KB)
  2. Download: Download full-size image

Fig. 5. The applications of ML in water resource and pollutant management in the natural water environment. (A) The surface water quality prediction performance of DT, RF, and DCF using (a) DO, CODMn, and NH3-N; (b) CODMn and NH3-N.  Reproduced from (Chen et al. 2020a) with permission. Copyright (2020) Elsevier Ltd. (B) Schematic of a high-throughput DNA-sequencing approach for determining sources of fecal bacteria in the Lake Superior estuary. Reproduced from (Brown et al. 2017) with permission. Copyright (2017) American Chemical Society. (C) Schematic of tracking the sources of antibiotic resistance genes in an urban stream during wet weather. Reproduced from (Baral et al. 2018) with permission. Copyright (2018) American Chemical Society. (D) Schematic of tracing the sources of sediment based on the correlation between the sediment bacteria's alpha diversity, aquatic environmental variables, and aquatic sediment in Dongting Lake. Reproduced from (Zhang et al. 2019a) with permission. Copyright (2019) American Chemical Society. (E) Schematic of the computational process used to generate and validate microbial source tracking models with Ichnaea®. Reproduced from (Balleste et al. 2020) with permission. Copyright (2020) Elsevier Ltd. (F) Schematic of a perturbation theory machine learning (PTML) based QSTR approach for predicting the genotoxicity of metal oxide nanomaterials. Reproduced from (Halder et al. 2020) with permission. Copyright (2019) Elsevier Ltd.

3.4. Tracing contaminant sources

The accurate detection of contaminant sources in water environments is a critical step for contamination prevention and remediation. Many approaches, such as the response matrix, contaminant transport simulation, and Tikhonov regularization have been proposed to identify pollution sources (Singh et al., 2004), and ML has also joined in this attempt. For instance, an MLP was applied to simulate the time series of E. coli concentrations in a water system to locate discharge points. By inversely interpreting the transport patterns of E. coli, the source locations where E. coli was introduced into the given system were identified with an accuracy of 75% (Kim et al., 2008). Moreover, in a recent study using the RF model to identify water sample sources from three different river ecosystems, the responses of the aquatic microbial community to variations in water quality caused by pollution discharge were considered. Environmental physicochemical indices, microbiological indices, and their combination were applied as inputs to train the classifier. Microbiological indices-based models obtained the best predictions by using the abundances of the top 30 bacteria as predictor variables. With the increasing development of gene sequencing technology, the method proposed in this study provided an economical and rapid approach to trace water sample sources based on the abundance of microbial communities (Wang et al., 2021a). Recently, ML classification programs specifically designed for contamination tracing such as SourceTracker, have gained many applications. SourceTracker can estimate the relative contribution of a specific potential source to an environmental sink based on the Bayesian approach (Knights et al., 2011). To track the sources of fecal bacteria in the Lake Superior estuary, community-based microbial source-tracking using SourceTracker was conducted. The high-throughput DNA sequencing of a fecal sample collection was used to establish a fecal library that was utilized to understand the fecal microbiome composition, as well as the marker specificity and sensitivity in several animals. It was found that fecal bacteria in the Lake Superior estuary were primarily attributed to wastewater effluent and, to a lesser extent, geese and gull wastes (Fig. 5B) (Brown et al., 2017). SourceTracker was also combined with the shotgun metagenomic technique to analyze the sources of antibiotic resistance genes (ARGs) in an urban stream during wet weather. The relative contributions of both microbes and ARGs in the sink environment were estimated using the abundances of microbial taxa and ARGs provided by shotgun metagenomics. The results revealed that storm drain outfall waters were the largest contributor of both microbes and ARGs in the urban stream, while wash-off from streets was the largest contributor of microbes and ARGs in the storm drain outfall water (Fig. 5C) (Baral et al., 2018). Moreover, SourceTracker was applied to analyze the sources of sediment pollution in Lake Dongting. A metagenomic analysis characterized the difference among community compositions of source sediment samples. Then the specific sources of sediment were identified by SourceTracker based on the inseparable relationships between sediment and adsorbed microorganisms (Fig. 5D) (Zhang et al., 2019a). Additionally, the significant relationships between sediment and microbial community were also used to investigate phosphorus sources in Lake Dongting. By analyzing phosphorus source contributions in interconnected river-lake systems using SourceTracker, a novel framework for nutrient source-tracking was established to develop effective management and control strategies for both sediment and eutrophication in river-lake systems (Gu et al., 2020). Ichnaea® is another ML-based software utilized to improve tracking the fecal pollution sources in water. Different from SourceTracker that depends on comparing obtained gene information with a known gene library, Ichnaea® relies on library-independent markers and the abundance of fecal indicators or host-specific markers. For example, in tracking the sources of fecal pollution in a given water, different fecal indicators and source tracking markers were applied to provide host information. Ichnaea® correctly distinguished the fecal pollution of human or non-human from other several origins (Fig. 5E) (Balleste et al., 2020). Compared with the research directions introduced above, there have been fewer studies that have investigated the use of ML in tracing the source of contaminants. Despite this, most of the reviewed studies in this paper adopted ML-based contaminant tracing tools, e.g., SourceTracker and Ichnaea® (Table S7), and this not only reduced the burden of developing specialized ML models for researchers, but also provides an opportunity for researchers who are not proficient in programming and ML to use efficient contaminant tracing tools. This provides a reference for lowering the threshold of applying ML for environmental researchers.

3.5. Evaluating pollutant toxicity

Toxic pollutants discharged into the natural water environment will inevitably cause harm to aquatic ecosystems and ultimately poison humans through the food chain. Therefore, it is crucial to assess the toxicity of pollutants on aquatic organisms and humans. Animal testing is one of the primary approaches to evaluate the toxicity of chemical substances. However, the consumption of animal testing in time, money, and labor limits its applications. In addition, the rise of animal protectionism recently exacerbates the difficulty of implementing of animal testing (Takata et al., 2020). To reduce the costs and uncertainties of conventional toxicity evaluation methods, high-throughput computer technologies have become increasingly popular due to because of their high efficiency and convenience in data-driven toxicity assessments. The quantitative structure-activity relationship (QSAR) builds a quantitative relationship between the structural or physicochemical characteristics of chemicals and their properties or activities, including toxicity (Wu and Wang, 2018). However, it is increasingly difficult for the prior knowledge-based QSAR method to process tasks with increasing amounts of data, thus making it feasible to introduce ML to assist and improve the QSAR in predicting the toxicity of pollutants. ML-QSAR approaches collected large amounts of data from the related literature to reveal some universal rules or mechanisms of correlative chemical reactions or processes, thus guiding the modeling of similar reactions. Therefore, ML-QSAR has been widely applied for evaluating pollutant toxicity. Pesticides are organics that possess high toxicity against various organisms. In a recent study, five ML models (i.e., DT, NB, kNN, RF, and SVM) were severally combined with the QSAR to predict the concentration for 50% of the maximal effect (EC50) of 639 diverse pesticides against Daphnia magna. A total of 365 molecular descriptors and seven molecular fingerprints (MF) were applied to describe the characteristics of the tested pesticides. MF referred to a method to encode the molecules into a mathematical format that is readable for computational programming according to their molecular structure. The best performance was obtained using the SVM model with a prediction accuracy of 0.848 in the test set verification (He et al., 2019). Endocrine-disrupting chemicals (EDCs) are another type of organic that can cause adverse effects on the normal function of homeostasis, reproduction, and metabolism by disordering the endocrine system of animals and humans (Lister and Van Der Kraak, 2001). A DL-based QSAR model was developed to classify and predict the toxicological effects of EDCs on sex-hormone binding globulin (SHBG) and estrogen receptors (ER). Datasets assessing diverse qualitative responses of the SHBG and ER to the tested EDCs were collected from the relevant literature. The proposed DL-QSAR model exhibited more satisfactory performance than MLR and SVM in simulating the toxicity of EDCs, with R2 values of 0.86 for SHBG and 0.84 for ER (Heo et al., 2019). Moreover, chemicals can produce endocrine-disrupting effects by binding to the peroxisome proliferator-activated receptor γ (PPARγ). To identify the binding affinity of chemicals with PPARγ, recently, QSAR was also utilized to provide molecular descriptors for the construction of ML models (Wang et al., 2021d). In addition, a DNN-based DL model was developed for the toxicity classification of 1317 chemicals to screen for environmental estrogens. By introducing a novel three-dimensional molecular surface point cloud with electrostatic potential to describe chemical structures, this model recognized the active and inactive chemicals at accuracies of 82.8 and 88.9%, respectively (Wang et al., 2021c). In addition to organics, nanoparticles (NPs) are also of concern in environmental toxicology. NPs are emerging detrimental pollutants that can cause harmful effects, such as oxidative stress and lipid peroxidation on aquatic organisms by generating reactive oxygen species, due to their specific surface area or photocatalytic activity. To predict the toxicity of NPs, a perturbation theory machine learning (PTML)-based QSAR was applied to estimate the genotoxicity of metal oxide NPs. The genotoxicity information of 78 metal oxide NPs against 32 different biological targets was collected as a dataset. In addition, chemical periodic table-based descriptors, quantum chemical properties, and constitutional descriptors were adopted to describe the tested NPs. The results revealed that 97.81% of the cases in the test set were correctly classified, and the toxicity of nearly all the 78 NPs was correctly predicted in the external validation. Moreover, because the proposed PTML-QSAR model was built based on the relevant physicochemical descriptors, it could also provide reliable insights in terms of the genotoxicity mechanisms of NPs (Fig. 5F) (Halder et al., 2020). In addition to combining with QSAR, ML is sufficient to predict the toxicity of pollutants independently. For example, the 50% lethal concentrations (LC50) of 400 compounds were predicted using the RF, SVM, and XGBoost algorithms based on 12 types of MF. The optimal prediction results were achieved using the SVM model with an accuracy of 92.2%, which was better than those generated by the QSAR or ANN methods (Ai et al., 2019). Additionally, the performance of kNN, SVM, ANN, RF, AdaBoost, and GBM for predicting 50% hazardous concentrations (HC50) of 578 chemical substances were compared. The RF model was found to outperform the other models, including the QSAR and ECOSAR methods, as it could explain 63% of the variability of HC50 (Hou et al., 2020). In the reviewed studies above, various environmental variables were required to form the feature space to train the ML model, because the predicted objectives (e.g., water quality and classification, contaminant distribution and sources) were closely related to the surrounding environmental factors. However, toxicity is an attribute of the pollutant itself to the organisms. Consequently, when training a model to predict the toxicity of chemicals, the characteristics of the tested chemical itself are required to form the feature space. There were several methods adopted in the reviewed studies to describe the molecular characteristics, including molecular descriptors and molecular fingerprints, and these are also basic research tools for the QSAR methods. Therefore, ML models combined with QSAR were often used for predicting the toxicity of pollutants, and they obtained good prediction results (Table S8). However, with the deepening of studies on the prediction of pollutant toxicity, more cutting-edge descriptors, or patterns, such as higher dimensional descriptors and transformer architecture, are encouraged to be utilized for pollutant toxicity prediction.

4. Technology optimization and operation monitoring in engineered water systems

Engineered water systems refer to a series of processes that consist of taking water from natural waters, drinking water purification and distribution, sewage or wastewater collection and treatment, and water reuse or draining back to nature in modern cities and towns (Lu et al., 2016). Effective management of engineered water systems will be of great help in providing clean water to users and preventing ecological pollution. To achieve this, ML has been applied widely in each link of engineered water systems, and related studies and applications of ML are reviewed in this section. In addition, ML has also been combined with adsorption and oxidation techniques, and the characterization analysis of pollutants, which are also necessary steps in engineered water systems. Therefore, the applications of ML in modeling adsorption and oxidation, and assisting characterization analysis, are also reviewed below.

4.1. Modeling treatment technique

Lab-scale development and optimization are fundamental for actual applications of treatment technologies for drinking water or wastewater in DWTPs and WWTPs, especially for the removal of emerging pollutants requiring specific treatment techniques. Adsorption and oxidation are common processes in both laboratory experiments and treatment plants to remove pollutants with low biodegradability or biological toxicity. Accurate simulation of treatment processes is beneficial for optimizing the reactant dosage, selecting an appropriate approach, or scaling up reactors for practical use. Some representative applications of ML for modeling the adsorption and oxidation processes to improve the pollutant removal efficiency are summarized below.
Adsorption is a popular process in wastewater treatment, especially for removing pollutants that are immune or even noxious to conventional biological treatment processes, such as heavy metals and some types of organic matter (Dabrowski, 2001). For example, in predicting the removal efficiency of Pb (II) adsorbed by red mud, the adsorbent dosage, contact time, and pH were used as variables to train an MLP model. The results were then compared with the response surface methodology (RSM). The RSM is a common method used to assess the effects of independent variables on the reaction process (Yetilmezsoy et al., 2009). The results showed that the MLP exhibited a better performance for predicting the lead removal efficiency than RSM did, with R2 values of 0.898 for the MLP and 0.672 for the RSM (Geyikçi et al., 2012). Moreover, the performance of ML could be further improved by combining it with optimization algorithms. The genetic algorithm (GA) is a global and parallel optimization algorithm that can automatically acquire and accumulate the knowledge of a search space during the process of searching, thus controlling the searching process adaptively to obtain the best solution (Álvarez et al., 2016). In the adsorption of Cu (II) by reduced graphene oxide-supported nanoscale zero-valent iron (nZVI/rGO) magnetic nanocomposites, MLP-GA performed well with an MAE value of 1.13%, while the values were 3.64% and 7.44% for the MLP and RSM methods, respectively. Moreover, the combination of MLP and particle swarm optimization (PSO) obtained a better prediction with a minimum MAE value of 0.46%. PSO is another optimization algorithm through which global optimization is realized by group iteration based on the interaction among and update of particles (each represents a possible solution to a problem) (Kennedy and Eberhart, 1995). By using the MLP-PSO to identify and adjust the critical parameters controlling adsorption processes, the Cu (II) removal efficiency was improved by 3.15% and 8.54% compared to that of the MLP-GA model and the RSM model (Fan et al., 2017). Aside from heavy metals, ML was also applied in modeling the adsorption of some types of organic matter. For instance, triclosan (TCS) is a broad-spectrum chlorinated antibacterial ingredient that possesses high health risks and ecological toxicities. The absorption process of TCS by a novel host-guest complex (MWCNT/PEG/b-CD) was simulated using the GRNN, ANFIS, and RSM models, in which ANFIS showed a better prediction ability than the other two models. Moreover, GA was introduced to optimize the experimental design, and the maximum TCS removal efficiency was improved to 99.50% (Azqhandi et al., 2019). Synthetic dye is another type of organic compound that can cause environmental pollution (Salleh et al., 2011). Ghaedi et al. summarized in early 2017 the applications of ANN models for simulating synthetic dye adsorption. Their review concluded that four types of ML models, including MLP, ANFIS, SVM, and the hybrid with GA or PSO optimization, were primarily used in previous studies (Ghaedi and Vafaei, 2017). The better performance of the hybrid networks with optimization algorithms was confirmed in this review, as well as in a study published afterward (Jun et al., 2020).
In addition to batch experiments in a single study, the datasets generated from the published literature have also been used to develop ML algorithms. For instance, to train a deep learning neural network (DLNN), a dataset containing 200,000 sample scenarios was generated from over ten years of studies that used a carbon-based adsorbent (i.e., carbon nanotube (CNT), activated carbon, biochar, graphene, carbonaceous, or graphite) to absorb anionic, cationic, and zwitterionic ionizable organic compounds. The proposed DLNN model exhibited a strong generalization and forecasting ability for the adsorption of ionizable organic compounds by a wide range of carbonaceous materials, with R2 values exceeding 0.98 and 0.91 for predicting the Freundlich coefficient KF and the exponent n, respectively (Sigmund et al., 2020). Another study mined literature for the adsorption data of 165 organic chemicals on 50 biochars, 34 CNTs, 35 granular activated carbons (GAC), and 30 polymeric resins to train a neural network with the poly-parameter linear free energy relationships (pp-LFER). The results showed that the proposed NN-LFER model accurately simulated the tested adsorption processes, with R2 values ranging from 0.86 to 0.91 (Fig. 6A) (Zhang et al., 2020). More notably, a graphical user interface was provided in these two studies for those who are not skilled in computer operation. This provides great convenience for researchers and practitioners in the fields of water purification and pollution control to select the appropriate sorbent for a given contaminant based on their requirements.
Fig 6
  1. Download: Download high-res image (985KB)
  2. Download: Download full-size image

Fig. 6. The applications of ML in modeling adsorption and oxidation processes, and assisting characterization analysis. (A) Predicting aqueous adsorption of organic compounds onto biochars, carbon nanotubes, granular activated carbons, and resins with machine learning. Reproduced from (Zhang et al. 2020) with permission. Copyright (2020) American Chemical Society. (B) Schematic for prediction of oxidant exposures and micropollutant abatement during ozonation. Reproduced from (Cha et al. 2020) with permission. Copyright (2021) American Chemical Society. (C) Schematic for the understanding of DOM reactivity in freshwater. Reproduced from (Herzsprung et al. 2020) with permission. Copyright (2020) American Chemical Society. (D) Fully CNN for detection and counting of diatoms after short-term field exposure. Reproduced from (Krause et al. 2020) with permission. Copyright (2020) American Chemical Society. (E) Hyperspectral imaging-based method for rapid detection of microplastics in the intestinal tracts of fish. Reproduced from (Zhang et al. 2019c) with permission. Copyright (2019) American Chemical Society. (F) Rapid identification of marine plastic debris via spectroscopic techniques and ML classifiers. Reproduced from (Michel et al. 2020) with permission. Copyright (2020) American Chemical Society.

Apart from adsorption processes, advanced oxidation processes (AOPs) are also effective technologies to treat pollutants with high chemical stabilities and low biodegradabilities. The development of AOPs can be traced back to the late 19th century when H. J. H. Fenton discovered that hydrogen peroxide could generate hydroxyl radicals (OH▪) catalyzed by ferrous ions to oxidize organic contaminants that then became known as the Fenton reaction (Nguyen et al., 2020). Since then, an increasing number of AOP processes have been developed based on various approaches including ozone-based, ultraviolet (UV)-based, photocatalytic, electrochemical, and physical methods (Bolton et al., 1996; Miklos et al., 2018). Currently, ML is also applied to simulate the oxidation processes for removing organic pollutants. Khataee et al. reviewed early applications of ANN in modeling homogeneous and heterogeneous nanocatalytic AOPs that included photocatalytic, photooxidative, and electrochemical treatment processes. The authors confirmed that ANN was an effective and simple approach to describe AOP processes, without considering the complex effects, such as the radiant energy balance, mass transfer, and the spatial distribution of the absorbed radiation (Khataee and Kasiri, 2010). Here the studies published after or covering content outside the scope of Khataee and Kasiri's review were summarized to present the applications of more ML models in a wider range of AOPs. For the degrading of antibiotics including amoxicillin, ampicillin, and cloxacillin with Fenton reaction, an MLP model was used to simulate the reaction processes in terms of COD removal. An accurate prediction of COD removal efficiency was achieved with an R2 of 0.997, additionally, the sensitivity analysis showed that the H2O2/Fe2+molar ratio was the most influential parameter on antibiotic degradation. This result pointed the direction for further improving the removal efficiency (Elmolla et al., 2010). Based on the Fenton reaction, photo-Fenton processes are developed due to the exposure of the reaction system to solar or UV light, both of which can speed up the generation of hydroxyl radicals in Fenton reactions (Kavitha and Palanivelu, 2004). During the degradation of oil-contaminated wastewater by the homogeneous photo-Fenton (UV/H2O2/Fe2+) process, the MLP model accurately predicted the oil removal efficiency. In addition, pH was discovered to be the most influential parameter that affects oil degradation, compared with H2O2, Fe2+, oil concentration, or the irradiation time and temperature (Mustafa et al., 2014). In addition to facilitating the Fenton process to generate hydroxyl radicals in the photo-Fenton processes, in photocatalytic processes, light can also provide photons to excite the lone electron on a semiconductor (e.g., TiO2 and ZnO) to create an electron-hole pair (e-h+). Then, a series of chain oxidative-reductive reactions will occur to degrade the contaminants on the semiconductor surface with electron-hole pairs (Chong et al., 2010). During the photocatalytic degradation of m-cresol by synthesized Mn-doped ZnO nanoparticles under visible-light irradiation, three types of MLP with different architectures were developed to simulate the reaction processes. A comparison between the predicted output and the experimental results showed that ANN trained with quick propagation and accurately simulated the degradation with an R2 of 0.9938. Moreover, the predicted optimal parameter values were used to optimize the reaction condition, under which the actual removal efficiency of m-cresol was improved to 99% (Abdollahi et al., 2014). In addition to Fenton and photocatalytic processes, ozonation is another popular AOP technology. During the degradation of micropollutants (MPs) by the ozonation process, the RF model was applied to predict the oxidant exposures and MP abatement during ozonation. Parameters including pH, alkalinity, DOC, and fluorescence excitation-emission matrix (FEEM) data were adopted as input variables. Ultimately, the proposed RF model obtained an accurate prediction with R2 values of 0.798 and 0.772 for O3 and OH▪ exposure, respectively, and 0.904 for MP abatement (Fig. 6B) (Cha et al., 2020). Ozonation could also be combined with other processes to enhance the ability of treating pollutants. For instance, a synthesized Dy2O3/TiO2/graphite nanocomposite was used to assist the photocatalytic ozonation process for the degradation of textile dyeing in wastewater, while O3/H2O2/Zr-pumice was applied in the decolorization of Rhodamine B dye. For both studies, the MLP models were applied to predict the removal efficiency, and achieved excellent accuracy with R2 values of 0.99 and 0.9948 (Sheydaei et al., 2020; Shokoohi et al., 2020).
Data mining in AOP-related studies has also been conducted. For instance, 446 data points concerning the photo-degradation rate constants (-log(K)) of organic contaminants in a TiO2-UV photocatalyst system were collected. Then, an MLP model was trained using a variety of reaction parameters, including ultraviolet intensity, TiO2 dosage, initial concentration, and the MF information of 78 tested organic contaminants. The proposed MLP model showed its reliability and stability as the predicted photo-degradation rate constants were in good agreement with experimental data with an average R2 value of 0.873 (Jiang et al., 2020). In another study, an MLP was compared with the quantitative structure-property relationship (QSPR) method to model hydroxyl radical rate constants (kHO▪) using a diverse dataset of 457 water contaminants from 27 various chemical classes. With a total of 785 molecular descriptors as variables, QSPR achieved a simulation of kHO▪ with R2 of 0.724, while the MLP performed better with R2 of 0.847 (Borhani et al., 2016). Moreover, based on this dataset, a DNN combined with MF was developed to predict the kHO▪ of other 46 organic pollutants and obtained a prediction accuracy of 0.724 (Zhong et al., 2020). The accurate prediction of reaction rate constants, such as degradation rate constants and hydroxyl radical rate constants, will undoubtedly help researchers select appropriate reaction types and reactants, as well as to design more efficient AOP-based water treatment units. The summary of Tables S9 and S10 indicates that MLP is the most frequently used algorithm for the simulation of adsorption and oxidation reactions, especially in earlier studies with small datasets. MLP is one of the most basic algorithms of an ANN, and it can approximate any nonlinear function relation with high precision. In addition, MLP has strong robustness and fault tolerance to noise datasets. Additionally, after 30 years of development, there are many MLP-based programs and software that can be used directly by researchers who are not familiar with programming. Therefore, MLP has been widely used in dealing with tasks with smaller dataset and fewer features. For example, data generated by batch experiments of a single study are reviewed herein. In these studies, optimization algorithms (e.g., GA and PSO) played important roles in improving the performance of MLP, which could be an experience worth learning. However, when dealing with tasks with large dataset and many features, for example, data generated by collecting from similar studies, an ANN with more hidden layers, that is, a DNN, has a stronger ability to model objects or abstract features, and can also simulate more complex models. Therefore, DNNs are increasingly popular in studies simulating various reaction processes, especially when the computing power of computers is becoming increasingly stronger.

4.2. Assisting characterization analysis

In addition to modeling and optimizing of physical and chemical treatment processes, ML has also been applied in assisting characterization analyses. By processing data or images produced by various analytical instruments, ML approaches can deepen and broaden the analysis information, or reduce the workload and improve the efficiency of testers. For example, to improve the predictability of drinking water disinfection by-products (DBPs), parallel factors analysis, PCA, and autoencoders (AE) were applied to process high dimensional fluorescence data of DBPs. Then a neural network approach was used to identify fluorescence regions associated with DBP formation, thus estimating the concentrations of DBPs. The AE-NN was found to provide more accurate predictions for both trihalomethanes (THMs) and haloacetic acids (HAAs) by when compared to common organic measuring methods. The approach proposed in this work provided a promising approach to rapidly quantify DBPs or other organics by analyzing fluorescence EEM data (Peleato et al., 2018). In addition to the direct determination of DBPs, determining the relationships between the molecular properties of dissolved organic matter (DOM) and their reactivity as potential precursors of DBPs is also instructive for DBP control and management. For instance, Fourier-transform ion cyclotron resonance mass spectrometry (FT-ICR-MS) was used to provide chemical information about DOM molecules. Then the RF model was introduced to identify the molecular reactivity classes based on the chemical properties and peak intensities of the FT-ICR-MS. Finally, a classification approach based on molecular formulas rather than a prior definition of biogeochemical reactivity was applied to discriminate biogeochemical reaction types. With this approach, potential precursor DOM of DBPs could be distinguished and specifically removed in drinking water treatment processes (Fig. 6C) (Herzsprung et al., 2020). Apart from acting as DBP precursors, some organic micropollutants (MPs), including various anthropogenic pharmaceuticals and industrial chemicals are also a negligible threat to the water quality. High-resolution mass spectrometry (HRMS) has been commonly used to measure the concentration of MPs. However, the high cost of the stable isotope-labeled (SIL) standard used in HRMS restricts its economic efficiency in measuring MPs. To reduce the cost of HRMS analysis, DL and ML models were proposed to estimate the concentrations of five MPs without SIL standards, including sulpiride, metformin, benzotriazole, tebuconazole, and niflumic acid. First, 35 alternative MS subsets were selected from hundreds of MS results to examine their correlation with the SIL standard peak for determining the specific MS subset to replace SIL standards. Then four convolutional neural network (CNN) models, as well as the RF, SVM, and ANN models were applied to estimate concentrations of tested MPs by analyzing their mass spectrum using the selected MS subset. The results revealed that the CNN model, ResNet 101, achieved the best performance in estimating the concentrations of the five MPs, with an average R2 value of 0.84 (Baek et al., 2021). The CNN algorithm has also been applied in image recognition to classify and count diatoms on fouled surfaces during short-term field exposures. The acquired images were saved as two-channel red-green-blue (RGB) images, with the phase-contrast images encoded in the blue channel and the fluorescence images encoded in the green one. The proposed DL model could analyze the tested images and distinguish diatoms from sand, dirt, and bacteria on fouled surfaces in only 30 min, while it took 180 h of manual counting (Fig. 6D) (Krause et al., 2020). Microplastics are a type of emerging harmful pollutants derived from the fragmentating of plastics. ML approaches have also been applied to identify, quantify, and classify microplastics. For example, near-infrared hyperspectral imaging (NIR-HSI) was used to characterize the intestinal tract content of crucian carps (Carassius carassius) that were exposed to water that contained various tested microplastics. An SVM classifier was developed to detect and classify the microplastics according to their NIR-HSI. The validation with Raman spectroscopy indicated that the SVM model performed well in classifying five different microplastics with high precision (97.19%) and recall (91.98%) values (Fig. 6E) (Zhang et al., 2019c). Moreover, the performance of the SVM, kNN, and LDA algorithms for identifying consumer plastics (CP) and marine plastic debris (MPD) was compared. Four spectroscopic techniques, including attenuated total reflectance-Fourier, transform infrared spectroscopy (ATR-FTIR), NIR reflectance spectroscopy, laser-induced breakdown spectroscopy (LIBS), and X-ray fluorescence spectroscopy (XRF), were applied to characterize the tested plastics. The comparison results revealed that ATR-FTIR combined with kNN classifiers obtained the optimal classification accuracies of 95% and 99% for CP and MPD, respectively (Fig. 6F) (Michel et al., 2020). To assist the characterization analysis, ML is primarily used to deal with data or images generated by various analysis equipment. For image processing, CNNs are undoubtedly a popular and powerful option. For data processing, it is also divided into a regression prediction (e.g., DBP and MP concentration estimation) and a classification prediction (e.g., DBP precursor and microplastic classification). Based on the above analysis of the regression and classification algorithms, it is not difficult to select an appropriate regression or classification algorithm according to the volume of both dataset and features. For assisting the characterization analysis with ML, the real difficulty is to find the combination point of ML technology and traditional analytical methods.

4.3. Purifying and distributing drinking water

In drinking water treatment plants (DWTPs), conventional treatment processes including coagulation, sedimentation, filtration, and disinfection, are designed to supply adequate and safe drinking water to users (LaPara et al., 2015). However, demand for high-quality drinking water and the gradual deterioration of raw water has brought requirements for more efficient treatment and management in DWTPs. Comprehensive and timely monitoring of water quality is essential for improving the management and automation development of DWTPs. To this end, ML has been applied to simulate water quality in a series of treatment units to provide real-time information on the operational conditions of DWTPs (Antanasijevic et al., 2013a). Some of the relevant studies are reviewed below.
Pre-oxidation is an optional process according to the quality of the raw water. This be conducted prior to the coagulation procedure to oxidize iron, manganese, or taste and odor substances, and control microorganisms or DBP precursors in raw water (Hu et al., 2018). Following pre-oxidation, coagulants are added to coagulate the insoluble contaminants by charging neutralization, enmeshment, or sweep to form particles and flocs that can be removed during subsequent sedimentation and filtration processes (Hogg 2000). Therefore, an appropriate dosage of oxidants and coagulants is critical for the entire drinking water treatment process. However, the dosage of oxidants or coagulants is determined by complicated factors, such as NOM, turbidity, conductivity, and temperature (Heddam et al., 2012). Therefore, there is not a simple empirical approach nor an accepted mathematical model to determine the optimal dosage for both oxidants and coagulants (Hu et al., 2018). Currently, ML approaches have been applied for interpreting relationships between the raw water characteristics and the demands of oxidants or coagulants. For pre-oxidizing oxidants, an MLP model was developed to predict the KMnO4 demand. In addition to an accurate simulation of the KMnO4 dosing rate, the proposed model also found that turbidity had the highest impact on KMnO4 demand, while the inflow rate had the least (Fig. 7A) (Godo-Pla et al., 2019). For coagulants, the MLP and ANFIS models were compared for simulating the poly aluminum chloride (PAC) dosage. The ANFIS model well predicted the PAC dosage with the PAC dosage of yesterday being the only input. However, in an emergency such as high turbidity caused by a rainstorm, the MLP output more stable predictions (Fig. 7B) (Wu and Lo, 2008). After coagulation, the sedimentation process will remove the formed particles and flocs with high density, while the filtration process removes the remaining suspended ones. An MLP model successfully predicted the post-filtration particle counts for three different size ranges in a dual-media filter of a DWTP (Griffiths and Andrews, 2011). Another MLP model was combined with GA optimization to predict the transmembrane pressure of a ceramic membrane microfiltration system. The proposed model optimized the settings for filtration time, flux, and aluminum dosage, thus leading to the minimized operational cost with a reduction of approximately 15% (Strugholtz et al., 2008). Membrane filtration has gained increasing popularity in water treatment processes due to its strong purification ability. However, a major problem facing the membrane filtration process is membrane fouling that would reduce the purification ability and shorten the working life of the membrane. To monitor and predict membrane fouling, an MLP model was used to simulate the membrane resistance during the nanofiltration process. The developed model correctly predicted the 93% of the experimental data with an absolute relative error < 5%, which sufficed to reduce or even replace expensive site-specific tests (Shetty and Chellam, 2003). In addition, an image analysis was also used to investigate membrane fouling. A DNN model was trained using thousands of high-resolution fouling layer images that were generated using optical coherence tomography characterizing membrane fouling during nanofiltration (NF) and reverse osmosis (RO) filtration. The proposed DNN accurately reproduced two- or three-dimensional images of the organic fouling growth with the R2 values of 0.99 in the simulation for both fouling growth and flux decline (Fig. 7C) (Park et al., 2019). Moreover, determining the relationships between the physicochemical properties of NOM and their effects on membrane fouling is also helpful to prevent fouling events. Recently, a PCA model was used to analyze the relationships between organic matter and their fouling propensity. Then a clustering analysis was applied to classify the water resources into three groups according to their low, intermediate, or high fouling potential. The results revealed a correlation between strong fouling events and a combined increase of carbon and protein like-substances contents in water. This conclusion was helpful for membrane users to make a better selection of feed water resources to prevent membrane fouling events (Teychene et al., 2018). Following the pre-oxidation, coagulation, sedimentation, and filtration processes, disinfection is performed to further inactivate microorganisms. Chlorine is the most common disinfectant because of its high efficiency and low cost and its persistence that meets the residual disinfectant demand. However, chlorine will interact with NOM to generate DBPs, such as trihalomethanes (THMs), haloacetic acids (HAAs), and total organic halide (TOX), and these can act as human mutagens, carcinogens, and teratogens (Hamidin et al., 2008). An MLP model was developed to simulate the concentrations of these three types of DBPs in raw and treated water, and accurate predictions were obtained with R2 values ranging from 0.71 to 0.97 (Fig. 7D) (Kulkarni and Chellam, 2010). In addition to the water quality or operational conditions in various treatment units inside a DWTP, a recent study forecasted the overall performance of DWTPs across China. The authors collected the monthly data of water quality and operational parameters from 45 DWTPs nationwide as the dataset. By using these data, they proposed an MLP model that well predicted the water production variation. This work enabled decision-makers and DWTP managers to revise production plans in response to different raw water quality, water quality standards, and market demands (Zhang et al., 2019b).
Fig 7
  1. Download: Download high-res image (1MB)
  2. Download: Download full-size image

Fig. 7. The applications of ML in water purification and distribution. (A) Representation of predictions for KMnO4 demand time-series.  Reproduced from (Godo-Pla et al. 2019) with permission. Copyright (2019) Elsevier Ltd. (B) The optimal model of ANN for predicting real-time coagulant dosage in water treatment. Reproduced from (Wu and Lo 2008) with permission. Copyright (2008) Elsevier Ltd. (C) The comparison of observed and simulated 3D fouling images (The unit of the axis is μm). Reproduced from (Park et al. 2019) with permission. Copyright (2019) Elsevier Ltd. (D) Comparisons of ANN predictions with experimental measurements for DBP formation in conventionally treated waters. Reproduced from (Kulkarni and Chellam 2010) with permission. Copyright (2010) Elsevier Ltd. (E) Forecasts of chlorine in a water distribution system at Aldinga. Reproduced from (Bowden et al. 2006) with permission. Copyright (2006) Elsevier Ltd. (F) Schematic for predicting DBPs in a DWDS. Reproduced from (Lin et al. 2020) with permission. Copyright (2020) Elsevier Ltd. (G) Actual and predicted search volume for symptoms of gastrointestinal illness with the RF and bagged CART models. Reproduced from (Shortridge and Guikema 2014) with permission. Copyright (2014) Elsevier Ltd.

After the treatment in DWTPs, drinking water of qualified quality and quantity will be delivered to consumers via drinking water distribution systems (DWDSs). Therefore, water quality in DWDSs is also important for the entire drinking water supply system. Discoloration is one of the most disturbing problems facing drinking water producers and users. Elevated concentrations of Fe and Mn have been recognized as the major causes of drinking water discoloration (Speight et al., 2019). Therefore, predicting the accumulation potential of Fe and Mn is important for preventing discoloration. An MLP model was developed to investigate the influence of chemical properties of water on the accumulation behaviors of Fe and Mn. The results showed that free chlorine residual (FCR) played complex roles in affecting the accumulation of Fe and Mn. Concretely, a high concentration of FCR would oxidize soluble Fe2+ and Mn2+ to insoluble Fe3+ and Mn4+, thus aggravating discoloration. On the contrary, as a disinfectant, FCR could sterilize the oxidizing bacteria to prevent biological oxidation of Fe and Mn, thus easing discoloration. Therefore, to achieve the equilibrium of chemical and biological oxidation, the FCR concentration was recommended to be 0.8 to 1.8 mg/L in the tested DWTSs (Danso-Amoakoa and Prasad, 2014). Moreover, an appropriate amount of FCR can also restrain the growth of other harmful microbial pathogens, while high dosing will lead to undesired taste, corrosion of the pipe network, and an increase in treatment costs. Therefore, it is beneficial to accurately predict FCR in DWDSs to achieve a balance among bacteriostasis, pleasant water quality, and economic cost (May et al., 2008). In study that applied the GRNN model to forecast FCR in a DWDS, the flow, temperature, and chlorine residual data were collected as variables. Then a GA method was used to optimize the division of the dataset into training, testing, and validation sets. The results showed that the GRNN model accurately predicted the FCR up to 72 h in advance with an R2 value of 0.9617 (Fig. 7E) (Bowden et al., 2006). In addition to the FCR, the concentration of DBPs in DWDSs is another hotspot of concern related to the disinfection process. For instance, an RBF NN predicted the concentration of nine different HAAs with an accuracy ranging from 75% to 91% (Fig. 7F) (Lin et al., 2020). Moreover, a Bayesian network found that monochloramine had a negative effect, while DOC and TDS exerted positive effects on the formation of THMs (Li et al., 2020a). In addition to investigating factors that influence the water quality in DWDSs, ML has also been used to forecast the pipeline failures in the distribution networks. For instance, contamination events in DWDSs can be caused by the injection of any potential pollutants. To detect anomalous fluctuations in water quality, an MLP model was used to depict the temporal variations in water quality indicators. Then the Bayesian updating rule was recursively applied to analyze the probability of a related contamination event (Perelman et al., 2012). Moreover, the recursive Bayesian rule was also combined with GA optimization to improve the performance of the MLP-Bayesian. By applying adaptive updating dynamic thresholds to optimize the fixed thresholds in the outliers’ classification module, better performance in the detection of contamination events in DWDSs was obtained (Arad et al., 2013). In addition, pipeline leakage detection and location in DWDSs are also of significance. For instance, the SVM and MLP models were respectively used to simulate the leakage size and location by analyzing the time series behavior of flow and pressure in pipe networks (Makaya and Hensel, 2016). Interestingly, Shortridge et al., compared pipe break events with weekly internet search volume for symptoms of gastrointestinal illness using data mining techniques. The results revealed a positive correlation between elevated risk on public health and pipe failures (Fig. 7G) (Shortridge and Guikema, 2014). This finding emphasized the importance of the application of ML in real-time monitoring of the functional integrity of pipelines. Like water quality predictions in natural water systems, the volume of features involved in a water quality simulation for drinking water treatment and distribution is relatively small, and water quality parameters are interconnected and affect each other through chemical and physical interactions. Therefore, NN-based models (e.g., MLP and ANFIS) are competent for most water quality prediction tasks and have been applied widely (Table S12). Additionally, LSTM based on time series and image recognition technology based on DL have also been applied for predicting membrane fouling. In fact, water quality parameters in drinking water treatment and distribution processes are all time series that provide the feasibility for LSTM to solve more problems in engineered water systems. Moreover, image recognition technology can also be used to investigate processes of visual changes such as coagulation, sedimentation, and discoloration.

4.4. Collecting and treating sewage water

Sewer networks are designed to collect domestic sewage and industrial and hospital wastewater, and transport them to WWTPs for contaminant removal. Therefore, the good condition of a sewer network plays an important role in maintaining the sanitation of an urban environment. Currently, ML approaches have been applied to improve the maintenance of sewer networks. In sewers, fluctuations in flow rate and pressure typically cause debris sedimentation that can gradually form accumulative deposits. Bottom deposits will reduce the transport capacity of pipelines and form anaerobic conditions to cause corrosion or odor problems (Bonakdari and Larrarte, 2006). To monitor the deposition situation and predict the bed loads inside pipelines, an ANFIS model was applied to analyze the effects of sewer geometrical features and flow hydraulic characteristics (Azamathulla et al., 2012). In addition, an RF model was also used to predict the self-cleansing sewer velocity, and this allowed the flow to spontaneously scour and remove deposited sediments from the pipes. The results showed the most important variable that determines the self-cleansing velocity was the volumetric sediment concentration (Fig. 8A) (Montes et al., 2020). In addition, the multigene genetic programming technique was adopted to estimate the particle Froude number in large sewers. Both works were found to be conducive to self-cleansing sewer design (Safari and Danandeh Mehr, 2018). Recently, an ensemble procedure comprised of the Network K-function geographically weighted Poisson regression, and the RF algorithm was applied to analyze the factors that affect sewer pipe blockages. Explanatory factors, such as material type, number of service connections, self-cleaning velocity, sagging pipes, root intrusion risk, closed-circuit television (CCTV) inspection grade, and distance to restaurants showed significantly varying impacts on the blockage propensity. In addition, the RF model predicted the blockage recurrence with a 60–80% accuracy in one of the studied cities (Okwori et al., 2021). Moreover, ML has also been applied in sewer defect detection. Closed circuit television (CCTV) is also a popular tool for visual inspection of the internal conditions of pipelines. However, CCTV inspection relies heavily on human labor to perform video analysis, which is time-consuming, labor-intensive, and error-prone. In this case, the SVM model based on anomalous frame recognition and classification was proposed to improve the efficiency and accuracy of CCTV inspections (Ye et al., 2019). Moreover, better performance in defect recognition and classification was achieved by applying a CNN model, which obtained an accuracy of 96.33% (Fig. 8B) (Hassan et al., 2019).
Fig 8
  1. Download: Download high-res image (1MB)
  2. Download: Download full-size image

Fig. 8. The applications of ML in wastewater collection and treatment. (A) Variable importance estimated by RF model in determining self-cleansing velocity: (a) without deposited bed; (b) with a deposited bed. Reproduced from (Montes et al. 2020) with permission. Copyright (2020) Elsevier Ltd. (B) Overview of the defect classification and location recognition framework for sewer line assessment system. Reproduced from (Hassan et al. 2019) with permission. Copyright (2019) Elsevier Ltd. (C) Schematic flow diagram of the BPANN-AO + ALK/ULS system to remove ammonia nitrogen. Reproduced from (Yang et al. 2021) with permission. Copyright (2020) Elsevier Ltd. (D) Observed and predicted sludge volume index in Batna wastewater treatment plant. Reproduced from (Djeddou and Achour 2015) with permission. Copyright (2015) Larhyss Journal. (E) Comparison of experimental and ANN model predicted values for trichloroethylene concentration. Reproduced from (Baskaran et al. 2020) with permission. Copyright (2019) Elsevier Ltd. (F) Comparison of the measured and ANN predicted results for mercury removal efficiency. Reproduced from (Yaqub and Lee 2020) with permission. Copyright (2019) Elsevier Ltd. (G) Forecast N2O concentrations from (a) the DNN-based model and (b) the LSTM-based model over the fixed prediction horizon (1 day). Reproduced from (Hwangbo et al. 2021) with permission. Copyright (2021) Elsevier Ltd.

When transported into the WWTPs, sewage and wastewater will be treated by a series of treatment processes. However, increasingly stringent sewage discharge standards have brought demands for more advanced treatment processes and more efficient management in WWTPs. Like DWTPs, timely and readily obtaining water quality information helps improve the operation and management of a WWTP. To achieve this goal, ML has been used to simulate the water quality and operation status in various treatment processes in different WWTPs.
In studies applying ML to predict the performance of WWTPs, water quality indicators such as COD, BOD, phosphorus, and ammonia have typically been adopted to evaluate the effluent quality (Khatri et al., 2019). For example, an MLP model accurately predicted the effluent TN, TP, and COD concentrations in small-scaled WWTPs in Korea's rural areas, thus realizing real-time remote supervising of WWTPs (Lee et al., 2008). Similarly, the ANN model was also applied for predicting the effluent quality of processes including a membrane bioreactor (MBR) (Giwa et al., 2016), a sequential batch reactor (SBR) (Khatri et al., 2019), and anoxic/oxic (AO) system (Yang et al., 2021), and aerobic granular sludge reactors (AGS) (Mahmod and Wahab, 2017). Additionally, some other algorithms, such as SVR (Seshan et al., 2014), ANFIS (Perendeci et al., 2009), and extreme learning machine (ELM) (Lotfi et al., 2019) have also obtained encouraging results for predicting effluent water quality. In a combined lysis-cryptic and biological nitrogen removal system, an MLP model realized an accurate simulation of the process with an R2 value of 0.9513. By real-time adjusting the sludge lysate return ratio, the optimized system removed greater than 97–99% of the ammonia nitrogen with zero excess sludge production (Fig. 8C) (Yang et al., 2021). Another study used molecular data to train the SVR model to predict the effluent COD, ammonia, nitrate, and the removal of 3-chloroaniline. Molecular data was generated from a terminal restriction fragment length polymorphism analysis targeting the 16S rRNA and amoA genes from the sludge community. The results showed that the proposed SVR model simulated the effluent water quality with R2 values ranging from 0.89 to 0.97 for the tested four parameters (Seshan et al., 2014). Similarly, metagenomic information was also applied to investigate the strength of associations between ARGs and different bacterial taxa using the RF model that revealed that genera including Bacteroides, Clostridium, and Streptococcus were primarily the hosts of the selected ARGs (Sun et al., 2021). Moreover, comparisons among different algorithms have been conducted. SVM was found to perform better for predicting effluent total ammonia nitrogen (TAN) than ANN (Alejo et al., 2018), while it was better for simulating the concentration of Kjeldahl nitrogen (KN) than ANFIS (Manu and Thalla, 2017). The superior ability of SVR compared to these two algorithms was also reported when predicting the performance of AGS reactors (Zaghloul et al., 2020a). In contrast to these comparisons, a multi-stage model that combined several ML algorithms was developed for predicting effluent COD, NH4-N, and PO43−. During the first stage of this model, the selected variables were input into the ANN, SVR, and ANFIS models. Then the outputs were combined as inputs for the subsequent ensemble algorithms, of which the best outputs were determined to be the final prediction. This approach improved the performance of ML models with a small dataset by combining the advantages of different algorithms instead of discussing and selecting the best single one (Zaghloul et al., 2020b).
In addition to the common indicators mentioned above, some other parameters related to WWTPs have also been predicted. For example, the accumulation of fat, oil, and grease in the sumps of wastewater pumping stations was monitored using CNN-based image recognition technology (Moreno-Rodenas et al., 2021). The sludge volume index (SVI) was predicted to monitor the running conditions of the activated sludge process (Fig. 8D) (Djeddou and Achour, 2015). In addition, the daily flow rates for a WWTP were simulated to better design its treatment units (Najafzadeh and Zeinolabedini, 2019). Moreover, some toxic substances present in effluent have also been predicted. For instance, in a two-phase continuous stirred tank bioreactor (CSTB) used for treating a trichloroethylene (TCE) polluted stream, an MLP model simulated the treatment process with excellent accuracy (R2=0.9923) (Fig. 8E) (Baskaran et al., 2020). The fecal coliform and total coliform removal efficiencies were also predicted with an MLP model in a WWTP using an intermittent cycle extended aeration-sequential batch reactor (ICEAS-SBR) (Khatri et al., 2020). Additionally, ML approaches have also been used to predict the removal of heavy metals including zinc (Rahmanian et al., 2011), chromium (Yaqub et al., 2019), and mercury (Yaqub and Lee, 2020). All of these three studies provided helpful guidance for future research on the removal of other heavy metals (Fig. 8F). Moreover, a DNN and LSTM model were compared for predicting the emission of nitrous oxide (N2O) from a WWTP in Denmark. A total of 750,000 measurements, including the influent flow rate, airflow rate, temperature, ammonium, nitrate, and DO, were collected to train the models. The higher R2 values (0.94 versus 0.76) indicated that the LSTM performed better than the DNN (Fig. 8G). A sensitivity analysis revealed that the temperature, NO3-N, and NH4-N were the most important factors contributing to the N2O concentration and N2O emission rate. WWTPs are a great contributor to global greenhouse gas emissions because of the strong warming effects of N2O. Therefore, this study was beneficial for understanding the production and emission mechanisms, as well as developing control strategies for N2O to ensure the sustainable operation and development of WWTPs (Hwangbo et al., 2021). Recently, ML has also been applied to model anaerobic digestion (AD) processes. AD is a popular technique that can convert organic waste and wastewater into biomethane to harvest energy. In this study, six ML algorithms were combined with microbial gene sequencing techniques to predict the methane yield. Genomic data and their corresponding operational parameters from eight research groups were used to train the models. The RF model achieved an accuracy of 0.82 using the combination of operational parameters and genomic data. Moreover, the importance of microbial community members for methane production was first quantified, and the results showed that Chloroflexi, Actinobacteria, Proteobacteria, Fibrobacteres, and Spirochaeta were the top five most influential phyla. This study provided valuable information regarding the AD process for monitoring and controlling methane production (Long et al., 2021). According to Table S15, MLP was undoubtedly a powerful tool for predicting water quality in sewage or wastewater treatment and has been widely used the past two decades, indicating that it was not out of date due to the popularity of other emerging powerful algorithms. However, it cannot be ignored that emerging technologies bring more possibilities and efficiency to water quality prediction with ML. For example, DL makes it possible to deal with tasks with large feature spaces, while image recognition develops a non-data prediction framework. Moreover, the combination of ML and bioinformatics analysis provides richer data and an in-depth perspective for predicting biological treatment processes, which is lacking in the current studies and will require special attention in future research.

5. Discussion, conclusions, and prospects

5.1. Discussion and conclusions

The above review lists representative applications of ML in fields of water science, and exhibits the availability and efficiency of ML for solving problems concerning water utilization and pollution control. To better use ML in water-related research, previous research and algorithm applications need to be analyzed to provide other researchers with common characteristics and rules that can be used for reference.
According to the classification and analysis of the studies reviewed above, the combination of ML with water science was found to be realized primarily in the following manners (Table 2): i) Predicting the status of desired contaminants was based on analyzing their interactions with other water-related parameters. In these studies, the ML approaches could be further divided into regression and classification analyses. In studies that adopted regression analyses, the specific concentration of pollutants was predicted. For example, they were utilized for predicting the WQIs in natural and engineered systems, and simulating pollutant concentrations in treatment reaction processes. In studies that applied classification analyses, the category of the target (e.g., the pollutant concentration range or water quality classification) rather than a specific value was determined. For example, classification analyses were utilized for the mapping of contaminant distributions in groundwater and classifying water resources into different categories according to their water quality. ii) Mining big data from previous studies were used to summarize universal rules and mechanisms, thus guiding similar reactions or processes. These included evaluating the toxicity of emerging pollutants, and applying adsorption or oxidation reactions to remove pollutants or test new reactive materials. In this manner, the valuable information hidden in an increasing number of scientific studies was systematically delved to provide more theoretical guidance for similar studies; iii) Image recognition technology was applied to analyze the relationships between the image information and physicochemical properties of the research object, thus characterizing water quality, identifying specific contaminants, and detecting equipment faults in engineered water systems. This approach improved the efficiency of analysis methods with spectrum or image as output thereby reducing the difficulty in detecting sewer faults. In particular, this method provides new insights for analyzing water quality in this era when high-definition cameras and remote sensing are increasingly popular.

Table 2. Recommendations on the selection of ML algorithm in different research directions of water science.

MeansApplicationsAlgorithm recommendedAlgorithm characteristicsApplicable conditions
RegressionPredicting water qualityMLPSimple structure; slow convergence, local optimum, black boxFewer featuresData with big volume
Evaluating pollutant toxicityRBF NNStrong generalization, fast convergence; complex structure, black boxData with small volume
Modeling treatment techniqueCARTInterpretable, fast training; easy to over-fitting,Data with a balanced sample size for each category
Assisting characterization analysisRFInterpretable, fast training, anti- overfitting, no need for feature selection; calculation burdenMore featuresData with less noisy
Purifying and distributing drinking waterLSTMLong-time memory; complex structure, black box, calculation burdenData of time series
Collecting and treating sewage waterDNNStrong ability in fitting, feature extraction, and fault tolerance; complex structure, black box, calculation burdenData with a huge volume
ClassificationClassifying water resourcesCARTAs mentioned aboveFewer featuresAs mentioned above
Mapping groundwater contaminantsRFAs mentioned aboveMore featuresAs mentioned above
Data miningEvaluating pollutant toxicityDNNAs mentioned aboveUsually, the volume of data is huge and molecular descriptors are needed
Modeling treatment technique
Image recognitionPredicting water qualityCNNAutomatic feature extractionThe sample is presented in the form of images
Assisting characterization analysis
Purifying and distributing drinking water
Collecting and treating sewage water
Note: a) Theoretically, data mining and image recognition also belong to regression or classification, but due to their distinctive characteristics, they are discussed separately; b) That one algorithm is suitable under certain condition does not mean that other algorithms are not suitable for this condition; c) All the algorithms listed in this table can be used for both regression and classification.
Though ML has been recognized as a powerful tool for solving water-related problems, different algorithms exhibited different performance in these fields. Therefore, discussing and analyzing the applicability of various commonly used algorithms in different application scenarios is informative for other researchers to choose the appropriate algorithm or further optimize an algorithm. To determine the commonly used algorithms, the usage frequency of algorithms used in 215 reviewed studies was counted, and the usage frequency of these algorithms in water-related researches was also retrieved. Results of both statistical schemes showed that MLP, ANFIS, RBF, SVM, LR, CART, and RF were commonly used ML algorithms, while LSTM, DNN, and CNN were popular DL algorithms (Text S1). Therefore, the advantages and disadvantages of these ML algorithms in different research directions were compared and analyzed, and the results are briefly presented in Table 2. In general, MLP was the most widely used algorithm in dealing with both regression and classification tasks, especially in earlier studies with small volume of data and features. ML achieved satisfactory performance for predicting pollutant concentrations in natural waters, chemical reaction processes, and water or wastewater treatment. However, MLP suffers the problem of a long training time, slow convergence, and a local optimum caused by gradient decline, although it has a strong ability in fitting nonlinear problems and robustness to data noise. In contrast, RBF NN makes up for these disadvantages of MLP and even has higher approximation accuracy and generalization ability. However, as the volume of training samples increase, the number of hidden layer neurons of RBF NN will be higher than that of MLP, which will increase the complexity and computation burden of the RBF network. Recently, DL has also been applied in dealing with problems in water science. Due to the introduction of more hidden layers, DNN has stronger abilities in nonlinear learning, feature extraction, and fault tolerance. Consequently, DNN was found to be used to handle tasks with a large volume of data and features, especially studies that applied data mining from the literature. Additionally, RNN and LSTM can deal with time series data. Moreover, through the switch of the cell gate, LSTM realizes the function of long-time memory, thus solving the problem of gradient disappearance and gradient explosion during the training process. Therefore, LSTM can solve more problems in water science, as the data in many fields of water-related research are time series data. Although the DL models seem to possess a powerful ability to solve nearly all the tasks concerning water science, they still must face the black box nature, which is an unavoidable problem for all NN-based algorithms. In contrast, the DT-based models are more interpretable, as they can show the importance of different features to generate understandable rules. In addition, DT-based models can quickly solve multi-classification tasks with large volume of features. This is the advantage over LR and SVM, the former of which cannot deal with tasks with nonlinear and large feature spaces, and it is time-consuming during training and sometimes difficult to find a suitable kernel function for the latter. Therefore, CART, especially the RF models, have been widely used for handling both regression and classification tasks, such as mapping groundwater contaminants and classifying water sources. This is because RF overcomes the problems of over-fitting and data imbalance, and it is robust to the loss of features.
Based on the above comparison and analysis of the several commonly used algorithms, recommendations on the selection of algorithms were made according to their applicability to different scenarios (Table 2). For regression tasks with a small volume of data, both MLP and RBF NN are competent. Without regard to the complexity of the model, RBF NN may be better due to its higher ability of generalization, fitting, and convergence. For classification tasks, the CART model is more suitable for data with a balanced sample size for each category because the information gain is partial to the feature with more data, thus leading to over-fitting. Otherwise, the RF model is more qualified, especially for tasks with a large volume of features. Moreover, for those tasks with a huge amount of data, DNN and LSTM are more competent.

5.2. Prospects

The development of ML in water science is still at an initial stage; however, ML has shown its feasibility and efficiency in solving complex environmental problems in many studies. To make ML technology more accessible to environmental researchers, thus solving research problems in the water science field more efficiently, joint efforts in terms of algorithm development, data curation, and interdisciplinary cooperation are required.
  • i)
    As analyzed above, there was not a perfect algorithm for all tasks necessary for the water science field. Algorithms with simple structures typically have defects in performance (e.g., MLP and CART), while those with excellent performance often possess complex structures, thus increasing the difficulty of programming and the hardware cost of operation (e.g., DNN). Therefore, according to the characteristics of the data in water-related research, such as a moderate amount of time series data, algorithms with simple structures, high performance, and strong interpretability are encouraged to be developed. Moreover, the graphical user interface (e.g., the graphical user interface designed for modeling adsorption processes) or user-friendly data analytics tools (e.g., SourceTracker) designed specifically for water-related studies can also reduce the cost and difficulties of researchers encounter when using ML techniques.
  • ii)
    Data mining is helpful to collect data from similar studies to form big data, thus revealing underlying rules or providing a data basis for other big data researchers. However, in traditional research areas, including water science, data from other studies are often difficult to obtain. Open data and the sharing of data are common ways to provide rich data sources for datasets in the application of ML. However, open source data in water science field seems to be insufficient compared to other fields where ML techniques have been applied earlier and utilized more in depth, e.g., drug research (Vamathevan et al., 2019), biological research (Camacho et al., 2018), and solid Earth geosciences (Bergen et al., 2019), for which many open source data platforms have been developed. Therefore, the concept of open source data and the sharing of data is expected to be accepted and practiced more widely in the water research community, and researchers are encouraged to share their research data without any conflict of interest or legal and regulatory restrictions.
  • iii)
    The programming and implementation of ML models depend on the researchers' computer skills and mastery of algorithms, which are difficult for most water researchers to grasp in a short amount of time. To lower the threshold for researchers to use the ML technologies, interdisciplinary communication and cooperation with data researchers are beneficial. Under this framework of cross-disciplines, data researchers can provide professional suggestions on data processing and modeling, while water researchers can interpret the output of the model with expert knowledge. Moreover, with the help of data researchers, some cutting-edge algorithms can also be used to solve problems in water science field.

Declaration of Competing Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgment

The authors acknowledge funding from the National Natural Science Foundation of China (52070029, 51961125104).

Appendix. Supplementary materials

References

Cited by (146)

  • The mixed/mixotrophic nitrogen removal for the effective and sustainable treatment of wastewater: From treatment process to microbial mechanism

    2022, Water Research
    Citation Excerpt :

    However, the large number of process parameters in practice inevitably complicates the model, and the fitting of multiple effecters may lead to distortion. Recently, machine learning has been widely adopted to guide the operation of wastewater treatment process (Huang et al., 2021b), which also provides the idea of a more stable and efficient mathematical model for MixNR. In fact, the existence of AnAOB has been widely reported in many municipal WWTPs with a low anammox rate (0.08-1.2 μmol-N g−1VSS h−1) (Yang et al., 2021).

View all citing articles on Scopus
View Abstract