In the era of big data, the surge in volume is matched by the challenges of data quality, which resonates across all domains of data-driven and artificial-intelligence-related technologies. This research proposes a method to navigate these challenges by introducing a paradigm based on deep reinforcement learning, capable of discerning value in data across varied contexts. By enabling the strategic selection of optimal data samples, we envision a future where analytics are not just smarter but also more adaptable, allowing decision makers to harness the full potential of their data assets. The implications of this work extend beyond technical realms, offering insights that could shape policies and provide fresh perspectives in data-driven industries. 在大数据时代,数据量的激增与数据质量挑战相辅相成,这一现象贯穿于所有数据驱动和人工智能相关技术的各个领域。本研究提出了一种基于深度强化学习的方法,通过引入一种范式来应对这些挑战,能够在不同情境中识别数据价值。通过实现最佳数据样本的智能选择,我们展望一个未来,其中分析不仅更智能,而且更具适应性,使决策者能够充分发挥其数据资产的全部潜力。这项工作的意义超越了技术领域,提供了可能塑造政策并为数据驱动行业带来新视角的见解。
Highlights 要点
•
We introduce a learning-based paradigm for data valuation across scenarios 我们引入了一种基于学习的跨场景数据评估范式
•
Our study identifies varying effects of high/low-quality data on model efficacy 我们的研究确定了高质量/低质量数据对模型效能的不同影响
•
This method explores inherent and transferable value patterns across datasets 该方法探索了跨数据集的固有和可迁移的价值模式
•
Analysis reveals data value’s geographic sensitivity in nationwide power forecasts 分析揭示了数据价值在全国电力预测中的地理敏感性
Summary 总结
Artificial intelligence has substantially improved the efficiency of data utilization across various sectors. However, the insufficient filtering of low-quality data poses challenges to uncertainty management, threatening system stability. In this study, we introduce a data-valuation approach employing deep reinforcement learning to elucidate the value patterns in data-driven tasks. By strategically optimizing with iterative sampling and feedback, our method is effective in diverse scenarios and consistently outperforms the classic methods in both accuracy and efficiency. In China’s wind-power prediction, excluding 25% of the overall dataset deemed low-value led to a 10.5% improvement in accuracy. Utilizing just 42.8% of the dataset, the model discerned 80% of linear patterns, showcasing the data’s intrinsic and transferable value. A nationwide analysis identified a data-value-sensitive geographic belt across 10 provinces, leading to robust policy recommendations informed by variances in power outputs and data values, as well as geographic climate factors. 人工智能已显著提升了各行业的数据利用效率。然而,低质量数据的过滤不足给不确定性管理带来了挑战,威胁到系统稳定性。在本研究中,我们引入了一种采用深度强化学习的数据评估方法,以阐明数据驱动任务中的价值模式。通过迭代采样和反馈的策略优化,我们的方法在各种场景中均表现出效,并在准确性和效率上持续超越经典方法。在中国风力预测中,排除被认定为低价值的数据集的 25%,准确率提升了 10.5%。仅使用 42.8%的数据集,模型识别出 80%的线性模式,展示了数据的内在和可迁移价值。全国性分析识别出一条贯穿 10 个省份的数据价值敏感地理带,基于电力输出和数据价值差异以及地理气候因素的变异性,提出了稳健的政策建议。
Artificial intelligence (AI) technologies have revolutionized the utilization of data for constructing machine-learning models, thereby empowering and optimizing real-world research analysis and production processes, leading to tangible benefits.1
1.
Reichstein, M. ∙ Camps-Valls, G. ∙ Stevens, B. ...
Deep learning and process understanding for data-driven earth system science
For example, in the low-carbon energy revolution, the integration of digital technologies links the production, transmission, consumption, and storage of renewable energy with data and control systems, building the basic support of the modern energy industry.5
5.
Shahbaz, M. ∙ Wang, J. ∙ Dong, K. ...
The impact of digital economy on energy transition across the globe: The mediating role of government governance
However, the sheer volume and diversity of data not only lead to an abundance of low-quality data with escalating processing and storage costs but also pose challenges in training efficiency due to the variable significance of data in AI-based modeling. The existence of low-quality data causes models to learn complex and irrelevant numerical rules. This causes a decrease in the accuracy and the efficiency of these methods in the absence of data screening.8
8.
Taleb, I. ∙ Serhani, M.A. ∙ Bouhaddioui, C. ...
Big data quality framework: a holistic approach to continuous quality management
The instability in data quality resulting from the drastic and unpredictable nature of natural climatic conditions poses challenges in comprehending renewable-energy patterns. This instability jeopardizes the safety and the reliability of power systems.12
12.
Emeis, S.
Wind Energy Meteorology: Atmospheric Physics for Wind Power Generation
By discerning the varying contributions of data of different qualities to practical scenarios, informed decisions can be made regarding the retention or elimination of data based on their quality. This strategic approach to extracting and utilizing “smart data”—the data that are most relevant and beneficial for specific tasks—enhances the refinement of data-driven models while simultaneously maximizing the overall value of the data.15
15.
Liang, W. ∙ Tadesse, G.A. ∙ Ho, D. ...
Advances, challenges and opportunities in creating data for trustworthy AI
The criteria for data screening vary significantly depending on the specific application of the data. This variability highlights the urgent need for a flexible, comprehensive data-valuation paradigm. Such a paradigm is essential for advancing data-driven modeling across diverse scenarios. 人工智能(AI)技术革新了数据在构建机器学习模型中的应用,从而赋能并优化了现实世界的研究分析和生产流程,带来了切实的效益。 1
1.
Reichstein, M. ∙ Camps-Valls, G. ∙ Stevens, B. ...
Deep learning and process understanding for data-driven earth system science
However, the research on data-value assessment is currently limited. Most existing approaches for evaluating dataset quality lack unified quantitative measurements and supervision methods that incorporate actual scenario model outputs.17
Moreover, these approaches do not reliably benchmark the practical usage of data. While some statistical filtering algorithms effectively denoise datasets by identifying outliers, thus improving the distribution for enhanced model generalization,19
19.
Blázquez-Garćıa, A. ∙ Conde, A. ∙ Mori, U. ...
A review on outlier/anomaly detection in time series data
these binary classifications fall short in differentiating the nuances of various data points for precise dataset adjustments. More crucially, such methods overlook the influence of usage scenarios on data-value judgment, such as the contribution of outliers to model robustness in extreme-event models. Scholars use the Shannon entropy and the non-noise ratio to describe data quality20
20.
Yu, M. ∙ Wang, J. ∙ Yan, J. ...
Pricing information in smart grids: A quality-based data valuation paradigm
on renewable energy. However, neither metric is used to explain value creation at the data level. From this standpoint, while the leave-one-out (LOO) method effectively evaluates the marginal utility of an individual data point against the whole dataset, its ability to extend this evaluation to the cumulative value of data is limited, which restricts its use in broader big-data analysis. Considering the complex interrelations among data points, some scholars liken them to players in cooperative games by applying game-theoretical metrics such as the Shapley value.22
22.
Ghorbani, A. ∙ Zou, J.
Data shapley: Equitable valuation of data for machine learning
the applicability of the Shapley value to large-scale data and complex models remains limited. Furthermore, these methods often exhibit biases in selecting data subsets that maximize value, a limitation stemming from the averaging calculations they rely on. Some scholars have advanced the algorithms by integrating meta-learning and deep-learning principles,24
However, in fields such as renewable energy, which are characterized by time-series data, the continuous, intermittent, and complex nature of these datasets necessitates a valuation approach that is simultaneously flexible and specialized.26
26.
Wang, H. ∙ Lei, Z. ∙ Zhang, X. ...
A review of deep learning for renewable energy forecasting
A general valuation approach should be adaptable and robust to address the potential uncertainties and the instabilities in practical data usage and should be effectively tailored to meet the unique challenges of each specific domain. Furthermore, the role of data valuation extends beyond conventional applications, especially in scenarios with varied objectives. For instance, in real-world analytics, data are crucial for guiding optimal decision making. This drives the need for algorithms that are compatible with a broader spectrum of value representations that go beyond simple predictions.27
This necessity highlights the importance of developing a comprehensive and efficient framework for data valuation that can address the multifaceted and intricate nature of data-value assessment across diverse contexts. 然而,目前关于数据价值评估的研究还有限。大多数现有的数据集质量评估方法缺乏统一的量化指标和结合实际场景模型输出的监督方法。此外,这些方法并不能可靠地评估数据的实际使用情况。虽然一些统计过滤算法通过识别异常值来有效去噪数据集,从而改善分布以增强模型泛化能力,但这些二元分类在区分不同数据点的细微差别以进行精确的数据集调整方面存在不足。更重要的是,这些方法忽略了使用场景对数据价值判断的影响,例如异常值在极端事件模型中对模型鲁棒性的贡献。学者们使用香农熵和非噪声率来描述数据质量,并通过减少可再生能源上的不确定性来解释数据价值。然而,这两种指标都没有被用来解释数据层面的价值创造。 从这一角度来看,虽然留一法(LOO)能有效评估单个数据点相对于整个数据集的边际效用,但其将这种评估扩展到数据累积价值的能力有限,这限制了它在更广泛的大数据分析中的应用。考虑到数据点之间的复杂相互关系,一些学者通过应用博弈论指标(如夏普利值)将它们比作合作游戏中的玩家。然而,尽管算法得到了机器学习的增强,夏普利值在大规模数据和复杂模型中的适用性仍然有限。此外,这些方法在选择能最大化价值的子集时往往存在偏差,这一限制源于它们所依赖的平均计算。一些学者通过整合元学习和深度学习原理来改进这些算法,主要是在同质化数据结构(如标签诊断和图像识别)中,其中估值涉及离散、规则的样本。 25
25.
Saeed, S.U. ∙ Fu, Y. ∙ Stavrinides, V. ...
Image quality assessment for machine learning tasks using meta-reinforcement learning
In this research, we establish an effective data-valuation paradigm suitable for a broad spectrum of data-driven scenarios. Within this framework, the values derived from continuously sampled data subsets are interpreted as feedback signals, progressively revealing the intrinsic patterns that determine the value of the data. By applying our methodology to various data-driven contexts—including tasks such as adult income classification, forest fire and obesity regression, and heart failure clustering—we validate the efficacy of our algorithm and demonstrate its adaptability across various data types, models, and performance metrics. These metrics include the accuracy, mean absolute error (MAE), mean squared error (MSE), and sum of distances to cluster centers (distance), which gauges dataset dispersion. Our numerical results indicate that this learning-based approach significantly outperforms traditional methods such as the LOO method and the Shapley value in terms of both data-value calculation efficacy and operational speed. We further apply our framework to the analysis of wind-power data in China, highlighting its practical utility in renewable-energy prediction scenarios. This application not only demonstrates the effectiveness of the method but also aids in uncovering data patterns that can inform policy recommendations. The findings of this study are particularly noteworthy for the exceptional precision in handling uncertainty across diverse datasets. The results of this study demonstrate the inherent and transferable nature of data-value patterns, with a special focus on the geographic variations in data-value sensitivity across Chinese provinces. These insights, supported by an integration of power-based and geographic knowledge, provide a foundation for informed, data-driven regulation recommendations. 在本研究中,我们建立了一种适用于广泛数据驱动场景的有效数据评估范式。在该框架内,从连续采样的数据子集中得出的值被解释为反馈信号,逐步揭示决定数据价值的内在模式。通过将我们的方法应用于各种数据驱动环境——包括成人收入分类、森林火灾和肥胖回归以及心力衰竭聚类等任务——我们验证了算法的有效性,并展示了其在不同数据类型、模型和性能指标上的适应性。这些指标包括准确率、平均绝对误差(MAE)、均方误差(MSE)以及到聚类中心的距离之和(距离),后者用于衡量数据集的分散程度。我们的数值结果表明,这种基于学习的这种方法在数据价值计算有效性和操作速度方面均显著优于传统方法,如 LOO 方法和 Shapley 值。我们进一步将我们的框架应用于中国风力数据的分析,突出了其在可再生能源预测场景中的实际应用价值。 这项应用不仅展示了方法的有效性,还有助于揭示可指导政策建议的数据模式。这项研究的发现特别值得注意,因为它在处理不同数据集的不确定性方面表现出色。研究结果展示了数据-价值模式的内在性和可迁移性,特别关注了中国各省数据-价值敏感性的地理差异。这些见解,结合了基于权力和地理的知识,为制定数据驱动的监管建议奠定了基础。
Result 结果
Learning-based data valuation 基于学习的数据评估
In this section, we explore a learning-based approach to data valuation by using a deep neural network. The network functions as a value learner, processing individual data features to estimate their respective values. This method utilizes the ability of deep learning to discern intricate patterns within the dataset, facilitating precise data-value estimations. The primary goal is to identify a data subset that optimizes the utility, a measure, derived from the dataset, of the performance of the data-based task. The value of each data point is quantified by its probability of inclusion in this optimal subset, with values near 1 indicating a greater likelihood of selection due to their substantial impact on utility enhancement. 在本节中,我们探讨了一种基于学习的数据估值方法,使用深度神经网络。该网络作为价值学习器,处理单个数据特征以估计其各自价值。该方法利用深度学习辨别数据集中复杂模式的能力,促进精确的数据价值估计。主要目标是识别一个优化效用(效用是数据集派生出的衡量数据任务性能的指标)的数据子集。每个数据点的价值通过其在该最优子集中的包含概率来量化,接近 1 的值表明其被选中的可能性更大,因为它们对效用提升有显著影响。
Figure 1 illustrates the iterative training of the value learner, beginning with an equalized value distribution across the data, represented as a random ranking. This iterative process involves repeatedly sampling data subsets based on their current value estimations within each cycle. This method not only augments the stability of the learning process but also optimizes the efficiency of sample utilization. During each iteration, the utility of the selected subsets is assessed in the context of a specific data-driven task. This assessment yields either positive or negative feedback. This feedback is used to adjust the data-valuation strategy, ultimately guiding the learner to prioritize data subsets with maximal utility potential and to devalue fewer contributive points. In summary, the training process enables the learner to gradually master the strategy of selecting the subset with the highest utility, thereby enhancing model performance and the decision-making capabilities in terms of data valuation. 图 1 展示了价值学习者的迭代训练过程,从数据中均衡的价值分布开始,以随机排序表示。这一迭代过程涉及在每个周期中根据当前的价值估计反复采样数据子集。这种方法不仅增强了学习过程的稳定性,还优化了样本利用效率。在每次迭代中,所选子集的效用在特定数据驱动任务的背景下进行评估。这种评估产生正面或负面的反馈。这种反馈用于调整数据评估策略,最终引导学习者优先考虑具有最大效用潜力的数据子集,并降低贡献较小的点。总之,训练过程使学习者能够逐渐掌握选择具有最高效用的子集的策略,从而提高模型性能和数据评估方面的决策能力。
Figure 1LDV paradigm 图 1 LDV 范式
This figure illustrates our approach, where we use the term "data-based tasks" to emphasize the incorporation of various valuation metrics applicable in data-driven contexts. For simplicity in this visualization, we assume that the data elements contribute independently to the task. Here, circles and crosses symbolize positive and negative contributions, respectively. Plus and minus signs indicate the impact of sampled subsets on the value performance of the task, either enhancing or diminishing the learner’s perceived value for these combinations. The blue bars graphically represent the data-value outputs assigned by the trained learner to each data point in their respective positions. An asterisk symbolizes the optimal learner state achieved post training. The green box depicts the final application phase, where the optimized learner is operationalized. Let us note, however, that real-world scenarios may present more intricate interdependencies among data, leading to a more complex distribution of data values than is depicted here.
In practical scenarios, especially in complex fields such as wind-power prediction, data-value expression often transcends binary categorizations (merely 1s and 0s) due to intricate interrelations such as substitution and complementarity effects among data points. Consequently, data values can manifest as continuous variables that represent the probability of their inclusion in the optimal subset rather than as discrete entities. This continuous nature reflects a nuanced understanding of data value, allowing the data-value learner to generate a distinctive value distribution. This perspective accurately captures the subtleties and complexities inherent in real-world data relationships. 在实际场景中,特别是在风力预测等复杂领域,由于数据点之间存在替代和互补等复杂相互关系,数据值的表达往往超越了简单的二元分类(仅 1 和 0)。因此,数据值可以表现为连续变量,代表其在最优子集中的概率,而不是离散实体。这种连续性反映了对数据值的细致理解,使数据值学习器能够生成独特的值分布。这种观点准确地捕捉了现实世界数据关系的微妙性和复杂性。
The cornerstone of this learning-based approach is utility feedback obtained from the data subsets. Its general applicability across various data-driven contexts is attributed to the focus on utility estimation. This data-valuation approach necessitates mapping only between data subsets and their utility values, regardless of task complexity or utility metrics. However, a challenge arises due to the absence of gradient information when discretizing the continuous value output of the data learner for the subset in the sampling mechanism. This issue blocks the training process of the neural network, particularly during backpropagation. To overcome this hurdle and to ensure theoretical robustness, the model employs a policy gradient algorithm within a deep reinforcement-learning framework.29
29.
François-Lavet, V. ∙ Henderson, P. ∙ Islam, R. ...
An introduction to deep reinforcement learning
Foundations and Trends® in Machine Learning. 2018; 11:219-354
This strategy redefines the learner’s objective as an optimization within the context of reinforcement learning. The algorithm plays a crucial role in making the neural network trainable while expediting the adaptation and refinement of the data-value strategy. To further enhance the training process, techniques such as importance sampling and clipping functions30
30.
Zhao, T. ∙ Hachiya, H. ∙ Niu, G. ...
Analysis and improvement of policy gradient estimation
are implemented. These methods are instrumental in promoting algorithm convergence and stability, thereby significantly improving the reliability and the effectiveness of the data-valuation model. 这种基于学习的方法的基石是来自数据子集的有效性反馈。它在各种数据驱动环境中的普遍适用性归因于对有效性估计的侧重。这种数据评估方法只需要在数据子集及其有效性值之间进行映射,而不管任务复杂度或有效性指标如何。然而,由于在采样机制中对数据学习器为子集离散化连续值输出时缺乏梯度信息,因此出现了一个挑战。这个问题阻碍了神经网络的训练过程,尤其是在反向传播期间。为了克服这一障碍并确保理论上的稳健性,该模型在深度强化学习框架内采用了一种策略梯度算法。 29
29.
François-Lavet, V. ∙ Henderson, P. ∙ Islam, R. ...
An introduction to deep reinforcement learning
Foundations and Trends® in Machine Learning. 2018; 11:219-354
Experiments for various data-driven tasks 各种数据驱动任务的实验
To evaluate the adaptability and the effectiveness of our learning-based data-valuation (LDV) algorithm, we conducted experiments across four distinct datasets from various disciplines, namely the social, natural, life, and medical sciences. These datasets, which are critical to a range of practical tasks, include adult income level classification by using census features,34
For these diverse tasks, we selected appropriate machine-learning methods and assessment metrics, as detailed in Table 1. Our comparative analysis focused primarily on the LOO and Shapley value (SV) methods; we evaluated their performance and computational efficiency. 为了评估我们基于学习的数据估值(LDV)算法的适应性和有效性,我们在来自不同学科的四个不同数据集上进行了实验,即社会科学、自然科学、生命科学和医学科学。这些数据集对各种实际任务至关重要,包括使用人口普查特征对成人收入水平进行分类、使用气象和地理数据预测森林火灾大小、基于饮食习惯和身体状况估计肥胖程度,以及分析心力衰竭患者的临床特征聚类分析。对于这些多样化的任务,我们选择了适当的机器学习方法和评估指标,具体如表 1 所示。我们的比较分析主要集中于 LOO 和 Shapley 值(SV)方法;我们评估了它们的性能和计算效率。
The precision of the data valuation for each method was assessed by sorting and excluding a predefined fraction of the data based on their calculated values. This was followed by reassessing the utility of the remaining subset. Specifically, we removed the data in both ascending and descending order of value ranking and re-evaluated the model performance with the condensed dataset. Optimal valuation significantly enhances the final value by accurately identifying and removing low-value data, while the elimination of high-value points leads to a marked decrease in value. This indicates the precise representation of essential data features. 通过根据计算出的值排序并排除预定义比例的数据来评估每种方法的估值精度。随后重新评估剩余子集的效用。具体来说,我们按价值排名的升序和降序移除数据,并使用精简的数据集重新评估模型性能。最优估值通过准确识别并移除低价值数据显著提升了最终价值,而移除高价值点则导致价值明显下降。这表明了关键数据特征的精确表示。
In Table 1, biases in data-value calculations when using LOO and SV are identified; these biases affect the performance in certain tasks. For instance, SV fails to reduce the MSE in the obesity dataset, while LOO fails to increase accuracy in census classification. In contrast, LDV consistently yields significant improvements across datasets, tasks, and valuation methods, outperforming LOO and SV. These improvements are substantial, with LDV performing 6.76 times better than LOO does in the forest fire dataset and 3.77 times better than SV does in the heart-failure dataset. This remarkable improvement is also evident in the decrease in the metrics upon removal of high-value data, where the LDV surpasses the LOO and SV algorithms in terms of performance. For instance, in the census dataset, the LDV outperforms the LOO by 6.9% and the SV by 3.9%. 在表 1 中,使用 LOO 和 SV 进行数据值计算时发现了偏差;这些偏差影响了某些任务的性能。例如,SV 未能降低肥胖数据集的 MSE,而 LOO 未能提高人口普查分类的准确率。相比之下,LDV 在所有数据集、任务和评估方法上均持续显著提升,表现优于 LOO 和 SV。这些改进非常显著,LDV 在森林火灾数据集上的表现比 LOO 好 6.76 倍,在心力衰竭数据集上的表现比 SV 好 3.77 倍。这种显著的改进也体现在移除高价值数据后指标值的降低上,LDV 在性能方面超越了 LOO 和 SV 算法。例如,在人口普查数据集上,LDV 比 LOO 高 6.9%,比 SV 高 3.9%。
Regarding computational efficiency, LOO has a shorter execution time due to its deterministic and straightforward calculation process, albeit at the cost of valuation precision. Despite using Monte Carlo methods to expedite SV computations, achieving accurate data-valuation convergence remains time intensive. Conversely, LDV rapidly achieves optimal data valuations through a reinforcement-learning framework, effectively combining exploration and greedy strategies in agent network training. As tasks and metrics grow more complex, the time advantage of the LDV becomes more pronounced. For instance, in regression problems, the SV execution times are 5.46 and 6.33 times longer, respectively. In clustering tasks with flexible utility definitions, LDV requires only 0.16 times the time taken by SV. Although LDV has a longer runtime than LOO does, its computational cost is reasonable; this is especially true when considering its superior accuracy in data valuation compared to that of the former. 在计算效率方面,LOO 由于其确定性和直接的计算过程,执行时间更短,但代价是估值精度有所损失。尽管使用蒙特卡洛方法加速 SV 计算,但要实现准确的数据估值收敛仍然需要大量时间。相反,LDV 通过强化学习框架快速实现最优数据估值,在智能体网络训练中有效结合了探索和贪婪策略。随着任务和指标变得更加复杂,LDV 的时间优势更加明显。例如,在回归问题中,SV 的执行时间分别比 LDV 长 5.46 倍和 6.33 倍。在具有灵活效用定义的聚类任务中,LDV 所需时间仅为 SV 的 0.16 倍。尽管 LDV 的运行时间比 LOO 长,但其计算成本是合理的;特别是考虑到它在数据估值方面比前者具有更高的精度时,这一点尤为明显。
In Figure 2, metric inversion is applied to the vertical axis to effectively align the trend curves. For the LDV classification (Figure 2A) and regression tasks (Figures 2B and 2C), we observed a notable rebound effect. Initially, removing low-value data containing redundant information enhances model performance; this enhancement demonstrates efficient data governance. However, as progressively more data are eliminated and the value of the removed data increases, crucial feature data starts being excluded. This results in a subsequent performance decline, underscoring the necessity of maintaining a diverse dataset for unbiased training and optimal utility in test environments. 在图 2 中,对垂直轴应用了指标反转,以有效对齐趋势曲线。对于 LDV 分类(图 2A)和回归任务(图 2B 和 2C),我们观察到明显的反弹效应。最初,移除包含冗余信息的高价值数据会提升模型性能;这种提升展示了高效的数据治理。然而,随着越来越多数据被移除且被移除数据的价值增加,关键特征数据开始被排除。这导致后续性能下降,强调了保持多样化数据集对于无偏训练和在测试环境中实现最佳效用的重要性。
Figure 2Comparative trends in high/low data removal across different methods 图 2 不同方法在高/低数据删除方面的比较趋势
(A–D) This figure encapsulates the results from experiments conducted when using the leave-one-out (LOO) method, the Shapley value (SV), and the learning-based data-valuation (LDV) approach across four distinct datasets: (A) adult census, (B) forest fire, (C) obesity, and (D) heart failure. Each subgraph includes two representations of high/low-value removal. The depicted curves represent the mean values; the shaded areas around these curves indicate confidence intervals.
(E) The runtime ratio of various methods with runtime for the SV as the baseline = 1.
In clustering tasks (Figure 2D), the pattern deviates due to the unsupervised nature of these tasks. Clustering lacks the stringent requirement for specific feature data essential for performance validation in supervised tasks. Therefore, no comparable rebound phenomenon is observed. The LDV algorithm effectively identifies and removes isolated data points. This process accelerates the formation of clustering centers. In contrast, LOO and SV are slower at recognizing data contributions in clustering tasks, resulting in a more gradual formation of cluster centers. This distinction underscores the adaptability and the effectiveness of LDV in addressing the distinct requirements of both supervised and unsupervised learning environments. Furthermore, these aspects of the model help to elucidate the governing principles of data value from two distinct perspectives, enhancing our understanding of the intricate role of data across various machine-learning contexts. 在聚类任务(图 2D)中,由于这些任务的监督性质,模式发生了偏离。聚类缺乏监督任务中性能验证所必需的特定特征数据的严格要求。因此,没有观察到可比较的反弹现象。LDV 算法有效识别并移除孤立数据点。这个过程加速了聚类中心的形成。相比之下,LOO 和 SV 在聚类任务中识别数据贡献的速度较慢,导致聚类中心的形成更加渐进。这一区别突出了 LDV 在处理监督和无监督学习环境的独特需求时的适应性和有效性。此外,这些模型方面有助于从两个不同的视角阐明数据价值的支配原则,增强我们对数据在各种机器学习环境中的复杂作用的理解。
In Figures 2A, 2B, and 2C, the removal of high-value data in the LDV exhibits a consistent pattern: as more data are excluded, the inverse value metric correspondingly increases. Over the long term, LDV shows a faster and more stable rate of increase in inverse value than do other methods, highlighting the direct impact of removing high-value data on the utility and efficacy of the model training for the dataset. The removal process, starting with the most critical features, leads to a direct and steady decline in performance. This downward trajectory persists even as the value of the removed data decreases, indicating that the training of the model deteriorates continuously with decreasing data, regardless of the diminishing value of the data being excluded. 在图 2A、2B 和 2C 中,LDV 中移除高价值数据表现出一致的规律:随着更多数据的排除,逆价值指标相应增加。从长期来看,LDV 在逆价值增长上显示出比其他方法更快且更稳定的速率,突出了移除高价值数据对数据集模型训练效用和效果的直接影响。移除过程从最关键的特征开始,导致性能直接且稳定地下降。即使被移除数据的价值降低,这种下降趋势仍然持续,表明随着数据减少,模型的训练效果不断恶化,无论被排除数据的价值如何降低。
In Figure 2D, the ability of LDV to maintain the discreteness of data points notably surpasses that of the other methods, illustrating the effectiveness of our data-valuation framework. Overall, beyond the exemplary performance of LDVs in data-removal experiments, the data-value patterns theoretically align with the principles of machine learning. This alignment is significant because it demonstrates the robust understanding and effectiveness of LDV in terms of data valuation. In contrast to the unstable data-value expression observed in LOO and SV, the LDV approach more accurately reflects our expectations of data-value patterns, affirming its superior performance in both comprehension and practical application within data-driven models. 在图 2D 中,LDV 在保持数据点离散性方面的能力显著优于其他方法,这表明了我们数据评估框架的有效性。总体而言,除了 LDV 在数据删除实验中的出色表现外,数据价值模式在理论上与机器学习原理相一致。这种一致性非常重要,因为它展示了 LDV 在数据评估方面的深刻理解和有效性。与 LOO 和 SV 中观察到的不稳定数据价值表达形成对比的是,LDV 方法更准确地反映了我们对数据价值模式的预期,这证实了它在理解和数据驱动模型中的实际应用方面的优越性能。
Applicability in wind power prediction 在风力发电预测中的适用性
Wind power, heralded as a key player in the global energy shift, is at the forefront due to its sustainability, affordability, and abundant nature.38
38.
Wang, Z. ∙ Liu, W.
Wind energy potential assessment based on wind speed, its direction and power data
Acknowledging the inherent volatility and unpredictability of wind-energy generation, we applied predictive modeling as the data utilization case, and we selected accuracy as a metric for data-value assessment. Accurate wind-power forecasting mitigates uncertainty and facilitates optimized energy dispatch; this approach can yield substantial savings, thus augmenting the efficiency of electricity markets.40
40.
Yan, J. ∙ Liu, Y. ∙ Han, S. ...
Reviews on uncertainty analysis of wind power forecasting
Consequently, the application of data-valuation methodologies is imperative for enhancing forecast precision by leveraging reliable datasets with informative, high-quality data points.42
42.
Aryandoust, A. ∙ Patt, A. ∙ Pfenninger, S.
Enhanced spatio-temporal electric load forecasts using less data with active deep learning
Our predictive model inputs include wind-power data from the preceding 48 h coupled with daily and hourly meteorological information to forecast subsequent 24-h power generation. We adopted the mean absolute percentage error (MAPE) as a reverse metric to gauge the utility of the prediction model trained on various data subsets. In this section, we focus on data-valuation experiments utilizing wind-power data from Sichuan (SC) Province, which is renowned for its significant power generation and demand. 我们的预测模型输入包括前 48 小时的风电数据以及每日和每小时的气象信息,用于预测接下来 24 小时发电量。我们采用平均绝对百分比误差(MAPE)作为反向指标来衡量在不同数据子集上训练的预测模型的效用。在本节中,我们专注于使用四川省(SC)的风电数据进行数据评估实验,四川省以其显著的发电量和需求而闻名。
Figure 3A exhibits a data-value pattern consistent with earlier experiments, with the removal of high-value data correlating to a pronounced increase in the wind-power prediction error. Eliminating 40% of the high-value data inflates the median MAPE by 25.8%. Conversely, discarding low-value data reduces the MAPE, peaking at a 4.5% decrease with 25% data removal, equivalent to a 10.5% relative change. This finding suggests that excising superfluous data may indeed bolster the accuracy of renewable-energy forecasts. However, the subsequent rebound in the MAPE above the 25% removal threshold underscores the risk of depleting the training sample to a detrimental level; this depletion can undermine model training efficacy. 图 3A 展示了一种与早期实验一致的数据值模式,移除高价值数据与风能预测误差的显著增加相关。移除 40%的高价值数据使中位数 MAPE 增加了 25.8%。相反,丢弃低价值数据降低了 MAPE,在移除 25%的数据时达到 4.5%的最大降幅,相当于 10.5%的相对变化。这一发现表明,移除冗余数据确实可能提高可再生能源预测的准确性。然而,MAPE 在移除 25%数据阈值以上的反弹表明,将训练样本消耗到有害水平存在风险;这种消耗会削弱模型训练的有效性。
Figure 3Value characteristics of wind-power data in SC 图 3 SC 中风能数据的值特征
(A) The red and blue boxplots show the MAPE variation from removing the highest- and lowest-value data, respectively, with the median marked by the bold line.
(B) The error trends on the first and last days of the last 2 months from the test set (11.1 and 12.31) illustrate that the absolute prediction error changes when 20% of the highest-value (gray) and lowest-value (green) data are removed compared to that of the full dataset (orange).
Figure 3B illustrates the impact of excising 20% of the valued data on the prediction error across a 24-h forecast period. The baseline error trajectory, derived from training with the complete dataset, pinpoints the primary forecasting challenges as occurring beyond the 10-h mark, where the temporal inertia of the power-generation time series starts to wane. The removal of 20% of the low-value data leads to a tangible reduction in prediction error after 10 h, whereas the excision of an equal proportion of high-value data markedly exacerbates the error. Thus, the strategic value of wind-power data is demonstrated by their ability to confine prediction errors, particularly beyond the critical 10-h threshold. By eliminating low-value data, we reduce noise and enhance the training process of the model. In contrast, removing high-value data impairs the predictive performance of the model by eliminating essential training data. This delineation of results validates the practical application and robustness of our data-valuation methodology in wind-power forecasting, especially from an hourly perspective. 图 3B 展示了在 24 小时预测周期内移除 20%有价值数据对预测误差的影响。基于完整数据集训练得到的基线误差轨迹,指出了主要预测挑战发生在 10 小时标记之后,此时电力生成时间序列的时序惯性开始减弱。移除 20%的低价值数据导致 10 小时后预测误差明显减少,而移除同等比例的高价值数据则显著加剧了误差。因此,风能数据的战略价值体现在其能够限制预测误差,特别是在关键的 10 小时阈值之后。通过消除低价值数据,我们减少了噪声并提升了模型的训练过程。相比之下,移除高价值数据通过消除重要的训练数据损害了模型的预测性能。这一结果划分验证了我们的数据评估方法在风能预测中的实际应用性和鲁棒性,尤其是在小时级别的视角下。
Intrinsicality and transferability in subset valuation 子集评估中的内在性和可迁移性
In this study, we investigated value patterns within wind-energy data subsets and their learning performance in terms of value. We split the initial 42.8% of the training set and the remaining portion into two datasets, named alpha and beta. Then, we trained two different learners: (1) by using both alpha and beta, and (2) by using only alpha. These learners were then tasked with evaluating both datasets. The learner trained on both datasets serves as the optimal benchmark and possesses comprehensive training information. We compared its data-value output with that of the learner trained solely on alpha, assessing consistency across both the trained (alpha) and untrained (beta) datasets. 在本研究中,我们研究了风能数据子集中的价值模式及其在价值方面的学习性能。我们将最初的 42.8%的训练集和剩余部分分为两个数据集,分别命名为 alpha 和 beta。然后,我们训练了两个不同的学习器:(1)使用 alpha 和 beta;(2)仅使用 alpha。这些学习器随后被要求评估这两个数据集。在两个数据集上训练的学习器作为最佳基准,并拥有全面的训练信息。我们将其数据价值输出与仅使用 alpha 训练的学习器进行比较,评估了在训练(alpha)和未训练(beta)数据集上的数据一致性。
Figure 4 reveals a significant correlation in the data-value outputs between the learners. An 𝑅2valueof0.81 for alpha reflects the stability and intrinsic robustness of our learning module within wind-energy scenarios. These facts indicate reliable value generalization for the same dataset under varying training conditions. Moreover, an 𝑅2 value of 0.76 for beta highlights the learner’s strong generalizability with evidence for the transferability of known value patterns to similar but previously unseen datasets. 图 4 揭示了不同学习者在数据值输出上存在显著相关性。alpha 的 𝑅2valueof0.81 值反映了我们的学习模块在风能场景中的稳定性和内在鲁棒性。这些事实表明,在变化训练条件下,对于同一数据集具有可靠的价值泛化能力。此外,beta 的 𝑅2 值和 0.76 值突出了学习者的强泛化能力,并证明了已知价值模式可迁移到相似但之前未见过的数据集。
Figure 4Data-value patterns between subsets 图 4 子集间的数据-价值模式
The graph illustrates the correlation in data values as assessed by learner i (trained on the alpha dataset and beta dataset) and learner ii (trained only on the alpha dataset), where red represents data points in (A) alpha and blue represents (B) beta. R-squared values are presented to quantify the linear correlation after fitting a linear regression model with ordinary least squares.
The implications of such robustness and generalizability are 2-fold. First, consistent value patterns observed across different training regimens underscore a stable, intrinsic link between wind-power data and their value. This consistency reinforces the importance and necessity of our research into data valuation in the wind-energy domain. Second, the ability to accurately predict the value of new, untrained data demonstrates the transferability of value patterns. Given the extensive computational resources typically required for processing large volumes of wind-power data, the ability to generalize and to transfer learning efficiently is pivotal; these abilities potentially reduce the computational burden associated with extensive training. 这种鲁棒性和泛化能力的意义有两方面。首先,在不同训练方案中观察到的一致价值模式强调了风能数据与其价值之间的稳定、内在联系。这种一致性强化了我们在风能领域进行数据价值研究的重要性和必要性。其次,准确预测新、未训练数据的值展示了价值模式的可迁移性。鉴于处理大量风能数据通常需要大量的计算资源,泛化能力和高效迁移学习的能力至关重要;这些能力可能减少与大量训练相关的计算负担。
Geographic value patterns in China’s wind power prediction 中国风电预测中的地理价值模式
In our study, we extend the data-valuation methodology nationally to discern patterns across 25 Chinese provinces. Employing consistent data-value calculations, we integrate the effects of high/low-value data removal into Figure 5, utilizing the province abbreviations listed in Table S1. 在我们的研究中,我们将数据评估方法扩展到全国范围,以识别 25 个中国省份的模式。采用一致的数据价值计算方法,我们将高/低价值数据移除的影响整合到图 5 中,并使用表 S1 中列出的省份缩写。
Figure 5Data-value effects on wind-power prediction across 25 provinces in China 图 5 中国 25 个省份的风能预测数据值效应
Prediction performance across 25 provinces relative to the removal of high-value (red) and low-value (blue) data. The vertical axis shows the MAPE; the horizontal axis represents the percentage of data removed.
The data-value behavior across most Chinese provinces exhibits a consistent pattern. In general, the MAPE decreases initially then increases with the sequential removal of low-value data, while the removal of high-value data leads to a steady increase in the MAPE. This trend is exemplified by Jilin (JL) and Hebei (HE), which exhibit curve patterns similar to those of SC. In addition, the varied geographic and climatic conditions across China indicate that a low data value does not necessarily equate to model interference, as observed in provinces such as Heilongjiang (HL) and Guangdong (GD), where removing low-value data does not substantially decrease the MAPE. However, the universal increase in the MAPE with the removal of high-value data across all 25 provinces emphasizes the critical nature of certain feature data. The marked difference in the MAPE following the removal of equivalent amounts of high- and low-value data accentuates the capacity of the algorithm to discriminate data values. This consistency, despite China’s geographic and climatic variability, attests to the adaptability and efficacy of the algorithm on distributed datasets; these benefits underscore the robustness and versatility of our approach. 中国大多数省份的数据值行为表现出一致的模式。总体而言,随着低值数据的顺序移除,MAPE 最初下降然后上升,而高值数据的移除则导致 MAPE 稳步上升。这一趋势以吉林(JL)和河北(HE)为例,它们的曲线模式与 SC 相似。此外,中国多样的地理和气候条件表明,低数据值并不一定等同于模型干扰,如在黑龙江(HL)和广东(GD)等省份观察到的那样,移除低值数据并不会显著降低 MAPE。然而,所有 25 个省份在移除高值数据时 MAPE 的普遍增加强调了某些特征数据的关键性。移除等量高值和低值数据后 MAPE 的显著差异突出了算法区分数据值的能力。 这种一致性,尽管中国地理和气候存在差异,证明了该算法在分布式数据集上的适应性和有效性;这些优势突显了我们方法的鲁棒性和通用性。
The results reveal significant regional disparities in the sensitivity of wind-power prediction to data valuation across Chinese provinces, as illustrated by the extent of divergence in the curves for low- and high-value data removal. This sensitivity highlights the varying impacts on prediction accuracy resulting from the elimination of equal quantities of data with differing values. Provinces such as HE and Liaoning (LN) demonstrate pronounced increases in the MAPE difference following data removal, in contrast to the more subdued patterns from Xinjiang (XJ) and Chongqing (CQ). These observed variations in data-value sensitivity across provinces prompted further investigation into the underlying patterns and reasons for these geographic patterns. 结果表明,中国各省的风电预测对数据价值的敏感性存在显著的地域差异,这从低价值和高价值数据删除后曲线的偏离程度可以看出。这种敏感性突显了删除等量但价值不同的数据对预测精度产生的不同影响。例如,河北和辽宁(LN)在数据删除后 MAPE 差异显著增加,而新疆(XJ)和重庆(CQ)则表现出更为平缓的模式。各省之间数据价值敏感性的这些差异,促使我们进一步探究这些地理模式背后的规律和原因。
In subsequent analysis, we focus on the distributed sensitivity of the data across provinces, with a cap of 40% on the removed data volume. Specific removal points at 10% intervals up to 40% provide insight into the predicted effects created by valued wind-power data. In Figures 6A, 6B, and 6C, color-coded deviations indicate the impact of data valuation on the prediction performance (MAPE) difference, with darker hues denoting greater sensitivity. An observable pattern emerges, with darker shades from southwest to northeast China occurring at 10% removal, indicating a sensitive region. This sensitivity becomes more defined within a narrower belt at 20% removal, shifting from northeast to southwest with increased removal (40%). 在后续分析中,我们关注数据在各省的分布式敏感性,并设定移除数据量的上限为 40%。在 10%间隔至 40%的特定移除点提供了有价值的风能数据所创造的预测效果洞察。在图 6A、6B 和 6C 中,颜色编码的偏差显示了数据估值对预测性能(MAPE)差异的影响,较深的色调表示更高的敏感性。一个可观察的模式出现,在 10%移除时,中国西南至东北的较深色调表明存在敏感区域。这种敏感性在 20%移除时在一个更窄的带状区域内更为明确,随着移除量增加(40%),这种敏感性从东北转向西南。
Figure 6Sensitivity analysis of nationwide data values 图 6 全国数据值的敏感性分析
(A–C) The sensitivity of the data valuation across provinces when (A) 10%, (B) 20%, and (C) 40% of the highest and lowest value data are removed, respectively, with color intensity indicating the extent of MAPE variation.
(D) The overall sensitivity of each province to the data value, with the color intensity reflecting the average impact of removing 10% of the data on model performance. The red line marks the region with the highest sensitivity (S2), dividing China into three zones: S1 (northwest), S2 (central belt), and S3 (southeast).
In Figure 6D, we calculate a generalized sensitivity index for each province by averaging the effect ranging from 10% removal to 40% removal; we identify a continuous geographic belt of heightened sensitivity, labeled S2 within the red boundary. This belt segregates the provinces into three distinct and continuous regions: S2 within the boundary, S1 to the northwest, and S3 to the southeast. This delineation aids in quantifying and in understanding the geographical distribution of data-valuation sensitivity in national wind-power forecasting. 在图 6D 中,我们通过计算从 10%去除到 40%去除的效果平均值,为每个省份计算一个广义敏感性指数;我们识别出一个连续的地理带,标记为红色边界内的 S2。这条带将省份划分为三个截然不同且连续的区域:边界内的 S2、西北部的 S1 和东南部的 S3。这种划分有助于量化和理解全国风力预测中数据估值敏感性的地理分布。
In our investigation of geographic patterns of data sensitivity, we merge data valuation with energy dynamics data to elucidate regional differences across Chinese provinces. Figure 7A displays a scatterplot distinguishing provinces by sensitivity along two variance dimensions. A clear division among provinces in three regions (S1, S2, S3) is visible, with the high data-value sensitivity of S2 distinctively in the upper right. Notably, within S2, Gansu (GS) Province connects the climatic zones of XJ and Inner Mongolia (IM), which exhibit characteristics similar to those of the S1 region. 在我们的数据敏感性地理模式研究中,我们将数据估值与能源动态数据相结合,以阐明中国各省的区域差异。图 7A 显示了一个散点图,通过两个方差维度区分省份的敏感性。三个区域(S1、S2、S3)之间的省份存在明显区分,S2 的高数据价值敏感性在右上角显著突出。值得注意的是,在 S2 区域内,甘肃省(GS)连接了新疆(XJ)和内蒙古(IM)的气候区,这些区域表现出与 S1 区域相似的特性。
Figure 7Explanation of the data-value-sensitive geographic belt formation 图 7 数据价值敏感地理带形成的解释
(A) Scatterplot for 25 Chinese provinces displaying data-value variance on the horizontal axis and wind-power-generation variance (logarithmic) on the vertical axis, with bubble size indicating sensitivity levels. Provinces are color coded: S1 (blue), S2 (orange), and S3 (green), demarcated by two dotted lines.
(B) Annual wind-power-generation distribution across the three regions, with power-generation percentiles on the horizontal axis and probability on the vertical axis. Decile histograms and kernel density estimation (KDE) curves illustrate distribution disparities.
Horizontally, a broader data-value variance in S2 suggests an unstable system; moreover, the removal of extreme values significantly affects the distribution bounds, altering the central tendency and data spread. This phenomenon underscores how greater value variance enhances sensitivity to data removal, directly influencing model accuracy in predictive tasks, particularly in diverse datasets such as wind-power forecasting. Vertically, there is a notable difference in the wind-power variance between the S1 and S3 regions. The power variance of S1 is considerably greater than that of S3, with S2 falling in between. 水平方向上,S2 中的数据价值方差更大,表明系统不稳定;此外,极端值的移除显著影响分布边界,改变中心趋势和数据分布。这一现象突显了更大的价值方差如何增强对数据移除的敏感性,直接影响预测任务中的模型准确性,特别是在风能预测等多样化数据集中。垂直方向上,S1 和 S3 区域之间的风能方差存在显著差异。S1 的风能方差远大于 S3,而 S2 介于两者之间。
For a detailed interpretation, Figure 7B presents the normalized power output distributions for these regions. The peak power metric indicates the most prevalent normalized output level within each region, embodying the percentile relative to the maximum observed output. 为了详细解读,图 7B 展示了这些区域的归一化功率输出分布。峰值功率指标表示每个区域内最常见的归一化输出水平,体现了相对于最大观测输出的百分位数。
Influenced by the subtropical monsoon climate of the southeast plains of China, S3 has the lowest peak power at 15.43%, which aligns with the limited wind resources and consistent low-level outputs with low variance. Consequently, data valuation has a limited role in distinguishing performance due to the predictability of low, stable outputs. Conversely, the peak power of 34.76% in S1 signals substantial wind-energy production with notable variability, influenced by its temperate continental climate. Despite these fluctuations, the cyclical nature of the wind patterns in S1 simplifies predictions, leading to a homogenized data value and a reduced impact of data variations on model performance. 受中国东南平原亚热带季风气候的影响,S3 的峰值功率最低,为 15.43%,这与风资源有限以及低水平输出且方差较低相一致。因此,数据估值在区分性能方面的作用有限,因为低稳定输出的可预测性较高。相反,S1 的峰值功率为 34.76%,表明其具有显著的风能生产量且存在明显波动,这受其温带大陆气候的影响。尽管存在这些波动,S1 中风的模式具有周期性,简化了预测,导致数据价值趋于均一化,数据变化对模型性能的影响也减小。
The "data-favored" profile of S2 emerges with its 26.65% peak power, situated between the stability of S3 and the variability of S1. Its location, encompassing both temperate and subtropical climates, results in a complex, less-predictable wind pattern. This variability necessitates precise data valuation for capturing the intricacies of wind generation, for underscoring the critical role of high-quality data, and for removing redundant data for accurate forecasting. This is particularly true for the three most sensitive provinces in S2—Yunnan (YN), LN, and JL—for which refined data valuation is essential for enhancing model efficiency and predictive reliability. S2 的"数据偏好"特征体现在其 26.65%的峰值功率上,这一数值位于 S3 的稳定性与 S1 的变异性之间。其地理位置覆盖温带和亚热带气候区,导致风能模式复杂且难以预测。这种变异性要求进行精确的数据评估,以捕捉风能发电的复杂细节,强调高质量数据的关键作用,并去除冗余数据以实现准确预测。这一点对于 S2 中最敏感的三个省份——云南(YN)、临沧(LN)和保山(JL)尤为重要,因为对这些省份进行精细的数据评估对于提高模型效率和预测可靠性至关重要。
Discussion 讨论
In this study, we present a learning-based paradigm for data valuation, and we demonstrate significant improvements in model utility and efficiency across various data-driven contexts. By employing deep reinforcement learning in our approach, we select the data subsets with the highest utility, optimizing performance, as shown by comparative analyses across diverse scientific datasets. Specifically, in wind-power prediction, our paradigm proficiently identifies valuable data, influencing model accuracy through the removal of data points and demonstrating consistent data-value patterns across regions. 在本研究中,我们提出了一种基于学习的数据估值范式,并在各种数据驱动场景中展示了模型效用和效率的显著提升。通过采用深度强化学习方法,我们选择了具有最高效用的数据子集,优化了性能,这一点通过在不同科学数据集上的比较分析得到了证明。具体而言,在风力发电预测中,我们的范式能够高效地识别有价值的数据,通过移除数据点来影响模型精度,并在不同地区展现出一致的数据价值模式。
Our findings reveal intrinsic and transferable data-value patterns, with learners effectively capturing value characteristics despite limited training data. This indicates the underlying nature of the data value and the potential for computational efficiency. Nationwide application of our algorithm has verified the presence of uniform data-value patterns across 25 Chinese provinces. Nonetheless, the degree to which the predictive accuracies of different regions react to data removal varies illuminates diverse spatial sensitivities to data valuation. This analysis leads to the identification of a geographically sensitive belt, providing a deeper understanding of the regional nuances in the value patterns of wind-power data. 我们的研究发现揭示了内在且可迁移的数据价值模式,学习器在训练数据有限的情况下仍能有效捕捉价值特征。这表明了数据价值的本质以及计算的效率潜力。全国范围内应用我们的算法已验证了 25 个中国省份中存在统一的数据价值模式。然而,不同地区预测精度对数据移除的反应程度差异,揭示了数据评估的多样化空间敏感性。这项分析识别出一条地理敏感带,为更深入理解风能数据价值模式中的区域差异提供了依据。
Our data-valuation framework offers substantial contributions to renewable-energy policy by enhancing wind-power forecasting and by mitigating uncertainty in energy systems. We introduce an efficient approach for data-level governance, optimizing computational resources by effectively evaluating wind-power data quality and implementing strategic data filtering. This approach enhances forecast accuracy, stabilizes the energy grid, and supports potential data transactions. 我们的数据评估框架通过提升风力发电预测能力和缓解能源系统中的不确定性,为可再生能源政策做出了重要贡献。我们引入了一种高效的数据级治理方法,通过有效评估风力发电数据质量并实施战略数据过滤来优化计算资源。这种方法提高了预测精度,稳定了电网,并支持潜在的数据交易。
The framework underscores the importance of targeted data collection and analysis, particularly in data-sensitive regions such as the S2 area in China. Here, the removal of specific valuable data markedly influences the predictive performance. Provinces within this belt are advised to advance their data collection and analytical capabilities, potentially through expanded sensor deployment and advanced processing techniques. Such measures are vital for effectively managing wind-power fluctuations and ensuring robust, field-applicable datasets. 该框架强调了目标数据收集和分析的重要性,特别是在中国 S2 等数据敏感区域。在这里,特定有价值数据的移除会显著影响预测性能。该区域内的省份建议提升其数据收集和分析能力,可能通过扩大传感器部署和先进处理技术实现。这些措施对于有效管理风力发电波动并确保稳健、适用于实际应用的数据集至关重要。
By utilizing data-driven insights, provinces can refine their energy management strategies. The flexibility and comprehensive assessment of our framework can equip policy makers with crucial tools for crafting bespoke renewable-energy policies, thereby promoting a sustainable and economically beneficial energy ecosystem. 通过利用数据驱动的洞察,各省可以优化其能源管理策略。我们框架的灵活性和全面评估能够为政策制定者提供关键工具,以制定定制的可再生能源政策,从而促进可持续且经济上有利的能源生态系统。
Future research directions can broaden the application of this paradigm to address a range of practical decision-making challenges. This includes optimizing power dispatch in the face of renewable-energy variability and leveraging enhanced data analysis to improve public health outcomes. Additionally, customizing data governance frameworks to align with specific regional energy policies may increase the effectiveness of data-driven decision making. Such an approach not only translates data value into tangible economic benefits but also bolsters efforts toward sustainable development goal. 未来的研究方向可以扩展这一范式在解决一系列实际决策挑战中的应用。这包括在可再生能源波动性面前优化电力调度,并利用增强数据分析来改善公共卫生成果。此外,定制数据治理框架以与特定区域能源政策保持一致,可能提高数据驱动决策的有效性。这种方法不仅将数据价值转化为切实的经济效益,而且增强了实现可持续发展目标的努力。
Experimental procedures 实验程序
Resource availability 资源可用性
Lead contact 主要联系人
For further details, questions, or requests, direct correspondence to Jianxiao Wang at wang-jx@pku.edu.cn. 如有进一步详情、问题或请求,请直接联系王建晓,邮箱为 wang-jx@pku.edu.cn。
Materials availability 材料可用性
No physical materials were utilized in the conduct of this research. 本研究未使用任何物理材料。
The LOO method is a foundational technique in data valuation. It involves assessing the marginal value of an individual data point by comparing it to its complement, as follows: 留一法是数据估值的基础技术。它通过将单个数据点与其补集进行比较来评估该数据点的边际价值,如下所示:
𝐿(𝑖)=𝑉(𝐷)−𝑉(𝐷∖{𝑖})
where 𝐿(𝑖) is the LOO value of an individual data point 𝑖, determined by quantifying the discrepancies in utility between the entire dataset 𝑉(𝐷) and its complement 𝑉(𝐷∖{𝑖}). 其中 𝐿(𝑖) 是单个数据点 𝑖 的留一值,通过量化整个数据集 𝑉(𝐷) 与其补集 𝑉(𝐷∖{𝑖}) 之间的效用差异来确定。
Stemming from efficient allocation of utility in game theory and a more comprehensive consideration of data contributions, the data SV is defined as follows: 源自博弈论中效用的高效分配和更全面的数据贡献考量,数据 SV 定义如下:
𝑄(𝑖)=∑𝑆⊆𝐷∖{𝑖}|𝑆|!(𝑁−|𝑆|−1)!𝑁!(𝑉(𝑆∪{𝑖})−𝑉(𝑆))
where 𝑄(𝑖) denotes the SV of an individual data point 𝑖, treated as a participant in the game, is computed by taking the average of the marginal contributions from all the other data subsets 𝑆 in the entire dataset 𝐷 (with 𝑁 data points) while considering the metric that defines the value based on the dataset. Considering its exponential computational complexity, the SV calculation in this study is approximated by using a Monte Carlo simulation with randomized sorting.22
22.
Ghorbani, A. ∙ Zou, J.
Data shapley: Equitable valuation of data for machine learning
This method provides an efficient way to conduct an abbreviated computation, maintaining the integrity of the conceptual framework of the SV while addressing the computational demands. 其中 𝑄(𝑖) 表示个体数据点 𝑖 (被视为参与者)的 SV,通过计算整个数据集 𝐷 (包含 𝑁 个数据点)中所有其他数据子集 𝑆 的边际贡献平均值来确定,同时考虑基于数据集的价值定义的度量。考虑到其指数级计算复杂度,本研究中的 SV 计算通过使用随机排序的蒙特卡洛模拟进行近似。 22
22.
Ghorbani, A. ∙ Zou, J.
Data shapley: Equitable valuation of data for machine learning
In our research, we introduce a data-valuation paradigm that leverages deep reinforcement learning to optimize the utility of the model. We denote the entire dataset as 𝒟={𝐝𝑖}𝑁𝑖=1∼𝒫, where 𝐝𝑖∈ℛ𝑚 is an individual data point characterized by an 𝑚-featured vector. The utility derived from a subset 𝑆 is defined as 𝑉(𝑆). Subset 𝑆 corresponds to a N-dimensional filter vector 𝒔∈{0,1}𝑁, where 𝑆={𝐝𝑖∣𝑠𝑖=1,𝑖=1,2…𝑁}. For clarity, we express this as 𝑉(𝒟,𝐬) instead of 𝑉(𝑆). 在我们的研究中,我们引入了一种数据评估范式,该范式利用深度强化学习来优化模型的效用。我们将整个数据集表示为 𝒟={𝐝𝑖}𝑁𝑖=1∼𝒫 ,其中 𝐝𝑖∈ℛ𝑚 是一个由 𝑚 维向量表征的个体数据点。子集 𝑆 的效用定义为 𝑉(𝑆) 。子集 𝑆 对应于一个 N 维的过滤向量 𝒔∈{0,1}𝑁 ,其中 𝑆={𝐝𝑖∣𝑠𝑖=1,𝑖=1,2…𝑁} 。为了清晰起见,我们用 𝑉(𝒟,𝐬) 表示而不是 𝑉(𝑆) 。
The data-value learner, represented as 𝑙𝜑:𝒟→[0,1], is optimized to predict the probability of the inclusion of each data element in subset 𝑆. This probability is interpreted as the value of the data point. A higher 𝑙𝜑(𝐝𝐢), approaching 1, suggests a greater likelihood of the data point 𝐝𝐢 being selected for model training, indicating its significant contribution and higher value in practical scenarios. Conversely, a value closer to 0 implies a higher probability of the data point being excluded from the final task or training. We define the filter as a binomial distribution filter 𝐹𝐵:[0,1]→{0,1}, indicating whether 𝐝𝐢 is selected (𝐹𝐵=1) or not (𝐹𝐵=0). Hence, 𝑠𝑖=𝐹𝐵(𝑙𝜑(𝐝𝐢)). 数据价值学习器,表示为 𝑙𝜑:𝒟→[0,1] ,被优化以预测每个数据元素被包含在子集 𝑆 中的概率。这个概率被解释为数据点的价值。更高的 𝑙𝜑(𝐝𝐢) ,接近 1,表明数据点 𝐝𝐢 被选用于模型训练的可能性更大,表明其在实际场景中的重要贡献和更高的价值。相反,接近 0 的值意味着数据点被排除在最终任务或训练中的概率更高。我们将过滤器定义为一个二项分布过滤器 𝐹𝐵:[0,1]→{0,1} ,表示 𝐝𝐢 是否被选中( 𝐹𝐵=1 )或未被选中( 𝐹𝐵=0 )。因此, 𝑠𝑖=𝐹𝐵(𝑙𝜑(𝐝𝐢)) 。
Given the nondifferentiable nature of this filtering process, we transform the utility of the data subset into a "sampling signal.” This necessitates adopting a reinforcement-learning framework for resolution. We model this as a decision process with a single step, where the entire dataset 𝒟 represents the “state,” the filter vector 𝐬 denotes the “action,” the utility 𝑉 is the "reward" from a data-driven environment, and the learner’s parameter 𝜑 is the strategy to be optimized. Then, the probability of the filter vector 𝒔 being actioned based on 𝑙𝜑(𝒟) is 𝜋𝜑(𝒟,𝐬)=∏𝑁𝑖=1[𝑙𝜑(𝐝𝑖)𝑠𝑖·(1−𝑙𝜑(𝐝𝑖))1−𝑠𝑖], the result of the binomial distribution filter 𝐹𝐵. To maximize the utility of the selected subset, we formulate the optimization problem for training the learner as follows: 鉴于该过滤过程的不可微特性,我们将数据子集的效用转化为“采样信号”。这要求采用强化学习框架进行解析。我们将此建模为一个单步决策过程,其中整个数据集 𝒟 代表“状态”,过滤向量 𝐬 表示“动作”,效用 𝑉 是数据驱动环境的“奖励”,学习者的参数 𝜑 是要优化的策略。然后,基于 𝑙𝜑(𝒟) 的过滤向量 𝒔 被执行的概率是 𝜋𝜑(𝒟,𝐬)=∏𝑁𝑖=1[𝑙𝜑(𝐝𝑖)𝑠𝑖·(1−𝑙𝜑(𝐝𝑖))1−𝑠𝑖] ,这是二项分布过滤 𝐹𝐵 的结果。为了最大化选定子集的效用,我们将训练学习者的优化问题表述如下:
𝑚𝑎𝑥𝜑𝐸𝐬∼𝜋𝜑(𝒟,·)[𝑉(𝒟,𝐬)]
Policy gradient and improvements 策略梯度和改进
Our algorithm undergoes theoretical refinement to enhance its efficiency and robustness. We address the issue of the non-differentiability of 𝑉(𝒟,𝐬) with respect to 𝜑 by using the policy gradient method. This technique allows for the effective training of the learner under the reinforcement-learning framework. If 𝒥(𝜑) is denoted as the objective function, the gradient of strategy 𝜑 is expressed as follows: 我们的算法经过理论优化以提升其效率和鲁棒性。我们通过使用策略梯度方法解决 𝑉(𝒟,𝐬) 对 𝜑 的不可微性问题。这种技术使得在强化学习框架下对学习者进行有效训练成为可能。如果 𝒥(𝜑) 表示目标函数,策略 𝜑 的梯度表示如下:
∇𝜑𝒥(𝜑)=𝐸𝐬∼𝜋𝜑(𝒟,·)[𝑉(𝒟,𝐬)·∇𝜑log(𝜋𝜑(𝒟,𝐬))]
To enhance the stability of the learning process based on policy gradients, we employ a moving average of the previous rewards, denoted as 𝛿, with a window size 𝑇 (number of iterations) as the baseline. 为增强基于策略梯度的学习过程的稳定性,我们采用先前奖励的移动平均值,记作 𝛿 ,以窗口大小 𝑇 (迭代次数)作为基准。
Our analysis reveals that, without specific intervention, the values of all the data points (∀𝐝𝐢∈𝒟) are highly likely to skew toward either all 0 s or all 1 s. Such extrema in the value distribution leads to suboptimal results. This local convergence is due to inadequate exploration of alternative optimum strategies. To counteract this, we introduce a penalty function 𝑝(𝜑;𝜎,𝒕) to promote exploration: 我们的分析表明,如果没有特定干预,所有数据点 (∀𝐝𝐢∈𝒟) 的值很可能偏向全部为 0 或全部为 1。这种值分布的极端情况会导致次优结果。这种局部收敛是由于对替代最优策略探索不足造成的。为应对这一问题,我们引入惩罚函数 𝑝(𝜑;𝜎,𝒕) 以促进探索:
where 𝜎 is the penalty factor and 𝒕 is the threshold near 1. 其中 𝜎 是惩罚因子, 𝒕 是接近 1 的阈值。
To further enhance the stability, we propose using repeated sampling to maximize the sample utility. We incorporate importance sampling and a target network within the reinforcement-learning framework, facilitating a transition from on-policy to off-policy methods. The modified objective function is 为了进一步提升稳定性,我们提出使用重复采样来最大化样本效用。我们在强化学习框架中结合了重要性采样和目标网络,从而实现从在线策略到离线策略的过渡。修改后的目标函数是
𝐸𝐬∼𝜋̃𝜑(𝒟,·)[𝜋𝜑(𝒟,𝐬)𝜋̃𝜑(𝒟,𝐬)𝑉(𝒟,𝐬)]
Given the increased variance from off-policy sampling and the potential limitation of updates to new policies, we employ a clipping function to moderate the disparity between old and new policies: 考虑到离线采样带来的方差增加以及更新到新策略的潜在限制,我们采用裁剪函数来调节旧策略和新策略之间的差异:
𝑐𝑙𝑖𝑝(𝜋𝜑(𝒟,𝐬)𝜋∼𝜑(𝒟,𝐬),1−𝜂,1+𝜂)
Here, the clip function 𝑐𝑙𝑖𝑝(𝑥,𝑎,𝑏) truncates 𝑥 the bounds 𝑎and𝑏, maintaining the sensitivity of the policy gradient to larger step sizes within manageable limits 𝜂. Ultimately, we integrate the clipped objective with the penalty term to form the final objective function, guiding the training of the data learner’s strategy via the policy gradient method. 这里,裁剪函数 𝑐𝑙𝑖𝑝(𝑥,𝑎,𝑏) 将 𝑥 限制在 𝑎and𝑏 边界内,同时保持策略梯度对较大步长的敏感性在可控范围内 𝜂 。最终,我们将裁剪后的目标函数与惩罚项结合,形成最终的目标函数,通过策略梯度方法指导数据学习者的策略训练。
Comparison experiments 对比实验
The datasets utilized for the comparison experiments in our study were sourced from publicly available repositories comprising adult,34
data. Notably, in the heart-failure clustering analysis, we adopted an unsupervised learning approach, and we defined value by calculating the distance from each point to its nearest cluster center; this approach demonstrates the versatility of our framework in handling nonpredictive tasks. 我们研究中用于比较实验的数据集来源于公开可用的数据仓库,包括成人、 34
To construct classification, regression, and clustering models, we used appropriate machine-learning algorithms tailored to each problem type. For classification tasks, we employed eXtreme Gradient Boosting (XGB), a sophisticated ensemble boosting model renowned for its effectiveness in featured data management, numerical pattern extraction, and predictive analysis capabilities.49
49.
Chen, T. ∙ Guestrin, C.X.
A scalable tree boosting system
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
Support vector regression (SVR) was chosen for the regression tasks, and another regression task utilized the light gradient-boosting machine (LGBM), a decision-tree-based gradient-boosting framework noted for its outstanding performance and scalability.50
50.
Ke, G. ∙ Meng, Q. ∙ Finley, T. ...
Lightgbm: A highly efficient gradient boosting decision tree
Proceedings of the 31st International Conference on Neural Information Processing Systems
K-means clustering was applied for clustering analyses. In each iteration, the models were trained on a selected subset of the data based on the calculated value metric. 为构建分类、回归和聚类模型,我们使用了针对每种问题类型量身定制的适当机器学习算法。在分类任务中,我们采用了 eXtreme Gradient Boosting (XGB),这是一种在特征数据管理、数值模式提取和预测分析能力方面以高效著称的复杂集成提升模型。 49
49.
Chen, T. ∙ Guestrin, C.X.
A scalable tree boosting system
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
The inverse metric employed in classification is the "mismatch rate,” defined as the complement of accuracy (1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦). For clustering, we computed the metric by aggregating the distances between each data point and its respective cluster center. As the volume of data is reduced, the cumulative distance decreases proportionally. To highlight the unique approach of our algorithm in data valuation, the vertical axis in Figure 2D represents the averaged distance, calculated as the sum of distances divided by the data volume. In our analysis, larger metric values indicate greater overall utility in the dataset. For each dataset, 12 training repetitions were performed for each method, and the results were averaged and are presented in Table 1. 分类中使用的逆指标是“不匹配率”,定义为准确率的补数( 1−𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 )。对于聚类,我们通过聚合每个数据点与其对应聚类中心之间的距离来计算该指标。随着数据量的减少,累积距离成比例地降低。为了突出我们算法在数据评估中的独特方法,图 2D 中的垂直轴表示平均距离,计算方法为距离总和除以数据量。在我们的分析中,较大的指标值表示数据集中更大的整体效用。对于每个数据集,每种方法都进行了 12 次训练重复,结果取平均值并展示在表 1 中。
Wind power prediction 风力发电预测
We analyzed 17,520 h of wind data from 2017 to 2018, focusing on 25 Chinese provinces and regions and omitting those with insufficient or corrupt data.51
51.
Lu, X. ∙ McElroy, M.B. ∙ Peng, W. ...
Challenges faced by china compared with the us in developing wind power
This subset still covers a significant portion of China, excluding Hong Kong, Macau, Tianjin, Taiwan, Tibet, Qinghai, Beijing, Shanghai, and Hainan. To aid in wind-power forecasting, we integrated multidimensional numerical weather prediction (NWP) data from The National Aeronautics and Space Administration (NASA), encompassing more than 3,300 counties nationwide. Employing cluster averaging, we extracted characteristic NWP data for each province and compiled hourly and daily data to support wind-power predictions for corresponding hours. The predictive model inputs merge these NWP data with historical power-generation data spanning the previous 48 h, setting the stage to forecast hourly power generation for the subsequent 24 h. Leveraging the capabilities of the LGBM50
50.
Ke, G. ∙ Meng, Q. ∙ Finley, T. ...
Lightgbm: A highly efficient gradient boosting decision tree
Proceedings of the 31st International Conference on Neural Information Processing Systems
The MAPE for the prediction model is calculated as follows: 预测模型的 MAPE 计算方法如下:
MAPE=100%𝑛∗24∑𝑛𝑖=1∑24𝑡=1∣𝑃𝑖,𝑡−𝐴𝑖,𝑡𝑚𝑎𝑥(𝐴𝑖,𝑡,𝜖)∣
where 𝑛 is the test set size. 𝑃𝑖,𝑡 and 𝐴𝑖,𝑡 denote the predicted and actual wind-power values, respectively, at hour 𝑡 during the day. We introduced 𝜖 = 0.12 in the MAPE calculation to prevent small fluctuations in low actual power outputs from unduly magnifying the error metric, thus avoiding an overemphasis on prediction accuracy during periods of minimal generation. 其中 𝑛 是测试集大小。 𝑃𝑖,𝑡 和 𝐴𝑖,𝑡 分别表示在第 𝑡 小时期间预测和实际的风电功率值。我们在 MAPE 计算中引入 𝜖 = 0.12,以防止在低实际功率输出时的小波动过度放大误差指标,从而避免在发电量极小时过度强调预测精度。
In the data-value learner model, data points are represented by input vectors 𝑥 consisting of hourly/daily NWP and the preceding 48-h power data, with 𝑦 representing the subsequent 24-h power data. These inputs are processed through a fully connected neural layer, and a sigmoid activation function computes the data value between 0 and 1: 在数据值学习模型中,数据点由输入向量 𝑥 表示,该向量包含小时/每日数值天气预报和前 48 小时电力数据, 𝑦 表示后续 24 小时电力数据。这些输入通过全连接神经网络层进行处理,并使用 Sigmoid 激活函数计算 0 到 1 之间的数据值:
𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝑥)=11+𝑒−𝑥
Sensitivity calculation for each province 各省的敏感性计算
To assess the sensitivity of wind-power forecasting to data removal across various provinces, we define Δ as a metric quantifying the average change in the MAPE per 10% increase in data removal, up to a limit of 40%. The formula for Δ is given by 为了评估风电预测对各省数据删除的敏感性,我们将 Δ 定义为一个指标,用于量化每增加 10%数据删除时 MAPE 的平均变化,上限为 40%。 Δ 的公式为
Δ=𝛿10%+𝛿20%2+𝛿30%3+𝛿40%4
Here, 𝛿𝑚% denotes the change in MAPE resulting from the removal of 𝑚% of the highest or lowest value of the data. This methodology allows for a comprehensive assessment of the impact of data valuation in removal experiments, and this approach allows us to effectively quantify how data value influences wind-power prediction. Additionally, it serves as an aggregate indicator to evaluate the sensitivity of each province to the data value in relation to the prediction accuracy. 这里, 𝛿𝑚% 表示移除数据中最高或最低值的 𝑚 %后 MAPE 的变化。这种方法允许全面评估数据价值在移除实验中的影响,并且这种方法使我们能够有效地量化数据价值如何影响风力预测。此外,它还作为综合指标来评估各省对数据价值的敏感性,以及与预测精度之间的关系。
Acknowledgments 致谢
This work was supported by the National Key Research and Development Program of China (2022YFB2405600) and the National Natural Science Foundation of China (72241420, 52277092). 本研究得到中国国家重点研发计划(2022YFB2405600)和中国国家自然科学基金(72241420, 52277092)的支持。
Author contributions 作者贡献
Y.W., J.W., and J.S. conceived and designed the research. Y.W. and J.W. developed the framework and formulated the theoretical model. Y.W., J.W., and F.G. carried out the data search. Y.W., J.W., and F.G. carried out the simulations. Y.W. and J.W. conducted the prediction analysis. All authors contributed to the discussions on the method and the writing of this article. Y.W.、J.W.和 J.S.构思并设计了研究。Y.W.和 J.W.开发了框架并建立了理论模型。Y.W.、J.W.和 F.G.进行了数据搜索。Y.W.、J.W.和 F.G.进行了模拟。Y.W.和 J.W.进行了预测分析。所有作者都对方法和本文的撰写进行了讨论。
Declaration of interests 利益声明
The authors declare no competing interests. 作者声明无利益冲突。
Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data 将大数据转化为智能数据:关于使用 k 近邻算法获取高质量数据的见解
Wiley Interdisciplinary Reviews: Data Min. Knowl. Discov. 2019; 9:e1289
Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from colombia, peru and mexico 基于饮食习惯和身体状况对哥伦比亚、秘鲁和墨西哥个体肥胖水平估计的数据集