Unveiling value patterns via deep reinforcement learning in heterogeneous data analytics

Yanzhi Wang; Jianxiao Wang; Feng Gao; Jie Song

doi:10.1016/j.patter.2024.100965

ArticleVolume 5, Issue 5100965May 10, 2024Open access

Unveiling value patterns via deep reinforcement learning in heterogeneous data analytics

Yanzhi Wang

Affiliations

Department of Industrial Engineering and Management, College of Engineering, Peking University, Beijing 100871, China

Search for articles by this author

¹ ∙ Jianxiao Wang

Jianxiao Wang 0000-0001-9871-5263

Correspondence

Corresponding author

wang-jx@pku.edu.cn

Footnotes

4

Lead contact

Affiliations

National Engineering Laboratory for Big Data Analysis and Applications, Peking University, Beijing 100871, China

PKU-Changsha Institute for Computing and Digital Economy, Changsha 410000, China

Search for articles by this author

^2,3,4 wang-jx@pku.edu.cn ∙ Feng Gao

Feng Gao

Affiliations

Department of Industrial Engineering and Management, College of Engineering, Peking University, Beijing 100871, China

Search for articles by this author

¹ ∙ Jie Song

Jie Song

Correspondence

Corresponding author

jie.song@pku.edu.cn

Affiliations

Department of Industrial Engineering and Management, College of Engineering, Peking University, Beijing 100871, China

National Engineering Laboratory for Big Data Analysis and Applications, Peking University, Beijing 100871, China

PKU-Changsha Institute for Computing and Digital Economy, Changsha 410000, China

Search for articles by this author

^1,2,3 jie.song@pku.edu.cn

Cover Image - Patterns, Volume 5, Issue 5

More

The bigger picture 整体概览

In the era of big data, the surge in volume is matched by the challenges of data quality, which resonates across all domains of data-driven and artificial-intelligence-related technologies. This research proposes a method to navigate these challenges by introducing a paradigm based on deep reinforcement learning, capable of discerning value in data across varied contexts. By enabling the strategic selection of optimal data samples, we envision a future where analytics are not just smarter but also more adaptable, allowing decision makers to harness the full potential of their data assets. The implications of this work extend beyond technical realms, offering insights that could shape policies and provide fresh perspectives in data-driven industries.
在大数据时代，数据量的激增与数据质量挑战相辅相成，这一现象贯穿于所有数据驱动和人工智能相关技术的各个领域。本研究提出了一种基于深度强化学习的方法，通过引入一种范式来应对这些挑战，能够在不同情境中识别数据价值。通过实现最佳数据样本的智能选择，我们展望一个未来，其中分析不仅更智能，而且更具适应性，使决策者能够充分发挥其数据资产的全部潜力。这项工作的意义超越了技术领域，提供了可能塑造政策并为数据驱动行业带来新视角的见解。

Highlights 要点

•

We introduce a learning-based paradigm for data valuation across scenarios
我们引入了一种基于学习的跨场景数据评估范式

•

Our study identifies varying effects of high/low-quality data on model efficacy
我们的研究确定了高质量/低质量数据对模型效能的不同影响

•

This method explores inherent and transferable value patterns across datasets
该方法探索了跨数据集的固有和可迁移的价值模式

•

Analysis reveals data value’s geographic sensitivity in nationwide power forecasts
分析揭示了数据价值在全国电力预测中的地理敏感性

Summary 总结

Artificial intelligence has substantially improved the efficiency of data utilization across various sectors. However, the insufficient filtering of low-quality data poses challenges to uncertainty management, threatening system stability. In this study, we introduce a data-valuation approach employing deep reinforcement learning to elucidate the value patterns in data-driven tasks. By strategically optimizing with iterative sampling and feedback, our method is effective in diverse scenarios and consistently outperforms the classic methods in both accuracy and efficiency. In China’s wind-power prediction, excluding 25% of the overall dataset deemed low-value led to a 10.5% improvement in accuracy. Utilizing just 42.8% of the dataset, the model discerned 80% of linear patterns, showcasing the data’s intrinsic and transferable value. A nationwide analysis identified a data-value-sensitive geographic belt across 10 provinces, leading to robust policy recommendations informed by variances in power outputs and data values, as well as geographic climate factors.
人工智能已显著提升了各行业的数据利用效率。然而，低质量数据的过滤不足给不确定性管理带来了挑战，威胁到系统稳定性。在本研究中，我们引入了一种采用深度强化学习的数据评估方法，以阐明数据驱动任务中的价值模式。通过迭代采样和反馈的策略优化，我们的方法在各种场景中均表现出效，并在准确性和效率上持续超越经典方法。在中国风力预测中，排除被认定为低价值的数据集的 25%，准确率提升了 10.5%。仅使用 42.8%的数据集，模型识别出 80%的线性模式，展示了数据的内在和可迁移价值。全国性分析识别出一条贯穿 10 个省份的数据价值敏感地理带，基于电力输出和数据价值差异以及地理气候因素的变异性，提出了稳健的政策建议。

Graphical abstract 图形摘要

Keywords 关键词

Introduction 引言

Artificial intelligence (AI) technologies have revolutionized the utilization of data for constructing machine-learning models, thereby empowering and optimizing real-world research analysis and production processes, leading to tangible benefits.¹

1.

Reichstein, M. ∙ Camps-Valls, G. ∙ Stevens, B. ...

Deep learning and process understanding for data-driven earth system science

Nature. 2019; 566:195-204

Property	Value
Status
Version
Ad File
Disable Ads Flag
Environment
Moat Init
Moat Ready
Contextual Ready
Contextual URL
Contextual Initial Segments
Contextual Used Segments
AdUnit
SubAdUnit
Custom Targeting
Ad Events
Invalid Ad Sizes

Subject 主题	Task 任务 Model 模型 Metric 指标	Method 方法	Promotion (within 40% removal) 促销（去除率在 40%以内）	Declination (within 40% removal) 衰退（去除率在 40%以内）	Running time (s) 运行时间（秒）
Adult census 成人人口普查	classification 分类	LOO	–	−8.5%	22
	XGB	SV	+0.5%	−11.5%	126
	accuracy 准确率	LDV	+1.1%∗¹	−15.4%∗	104
Forest fire 森林火灾	regression 回归	LOO	+0.025	−0.056	5
	SVR	SV	+0.089	−0.053	562
	MAE reduction MAE 降低	LDV	+0.169∗	−0.105∗	103
Obesity 肥胖	regression 回归	LOO	+0.021	−0.388	102
	LGBM	SV	–	−0.487	2,448
	MSE reduction	LDV	+0.031∗	−0.611∗	387
Heart failure	cluster	LOO	+0.068	–	2
	K-means	SV	+0.070	−0.009	320
	distance reduction	LDV	+0.264∗	−0.013∗	52

Share

Download PDF

Share

The bigger picture 整体概览

Highlights 要点

Summary 总结

Graphical abstract 图形摘要

Keywords 关键词

Introduction 引言

Result 结果

Learning-based data valuation基于学习的数据评估

Experiments for various data-driven tasks各种数据驱动任务的实验

Applicability in wind power prediction在风力发电预测中的适用性

Intrinsicality and transferability in subset valuation子集评估中的内在性和可迁移性

Geographic value patterns in China’s wind power prediction中国风电预测中的地理价值模式

Discussion 讨论

Experimental procedures 实验程序

Resource availability 资源可用性

Lead contact 主要联系人

Materials availability 材料可用性

Data and code availability数据和代码可用性

Methods 方法

LOO and SV 留一法和支持向量

LDV design LDV 设计

Policy gradient and improvements策略梯度和改进

Comparison experiments 对比实验

Wind power prediction 风力发电预测

Sensitivity calculation for each province各省的敏感性计算

Acknowledgments 致谢

Author contributions 作者贡献

Declaration of interests 利益声明

Supplemental information (2)补充信息（2）

References 参考文献

Figures (7) 图示 (7)Figure Viewer 图示查看器

Article metrics 文章指标

Supplemental information (2)补充信息 (2)

Related articles (40) 相关文章 (40)

Life & medical sciences journals生命与医学科学期刊

Physical sciences & engineering journals物理科学和工程期刊

Multidisciplinary journals多学科期刊

Authors 作者

Reviewers 审稿人

News & events 新闻与活动

Multimedia 多媒体

About 关于

Careers 职业发展

Access 访问

Collections 馆藏

Information 信息

Learning-based data valuation
基于学习的数据评估

Experiments for various data-driven tasks
各种数据驱动任务的实验

Applicability in wind power prediction
在风力发电预测中的适用性

Intrinsicality and transferability in subset valuation
子集评估中的内在性和可迁移性

Geographic value patterns in China’s wind power prediction
中国风电预测中的地理价值模式

Data and code availability
数据和代码可用性

Policy gradient and improvements
策略梯度和改进

Sensitivity calculation for each province
各省的敏感性计算

Supplemental information (2)
补充信息（2）

Figures (7) 图示 (7)

Supplemental information (2)
补充信息 (2)

Life & medical sciences journals
生命与医学科学期刊

Physical sciences & engineering journals
物理科学和工程期刊

Multidisciplinary journals
多学科期刊