2024_04_29_8e69ab715e08d65983b0g

Introduction to the Al Index Report 2024
2024 年 Al 指数报告简介

Welcome to the seventh edition of the Al Index report. The 2024 Index is our most comprehensive to date and arrives at an important moment when Al's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on Al training costs, detailed analyses of the responsible Al landscape, and an entirely new chapter dedicated to Al's impact on science and medicine.
欢迎阅读 Al 指数报告的第七版。 2024 年的指数是迄今为止我们最全面的一次，正值 AI 对社会影响从未如此突出的重要时刻。今年，我们将范围扩大，更全面地涵盖了技术进步、公众对技术的看法以及围绕其发展的地缘政治动态等重要趋势。本版包含的原始数据比以往任何时候都更多，新增了有关 AI 培训成本的新估算、对负责任的 AI 场景的详细分析，并特别新增了一个章节，专门介绍 AI 对科学和医学的影响。

The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of Al.
Al 指数报告追踪、整理、提炼和可视化与人工智能（AI）相关的数据。我们的使命是提供经过公正审查、广泛收集的数据，让决策者、研究人员、高管、记者和广大公众能够更全面、更细致地了解 AI 这一复杂领域。

The Al Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that

is coming to hold in all of our lives.
Al 指数被全球公认为人工智能领域数据和洞见最可靠、最权威的来源之一。之前的版本曾被《纽约时报》、彭博社和《卫报》等主要报纸引用，获得了数百次学术引用，并被美国、英国和欧盟等地的高级政策制定者引用。今年的版本在规模、范围和影响力上超过了以往所有版本，反映了人工智能在我们生活中日益重要的地位。

Message From the Co-directors
联合主任致辞

A decade ago, the best

systems in the world were unable to classify objects in images at a human level.

struggled with language comprehension and could not solve math problems. Today, Al systems routinely exceed human performance on standard benchmarks.
十年前，世界上最好的人工智能系统无法以人类水平对图像中的对象进行分类。它们在语言理解方面遇到困难，无法解决数学问题。如今，人工智能系统在标准基准测试中通常超过人类表现。

Progress accelerated in 2023. New state-of-the-art systems like GPT-4, Gemini, and Claude 3 are impressively multimodal: They can generate fluent text in dozens of languages, process audio, and even explain memes. As AI has improved, it has increasingly forced its way into our lives. Companies are racing to build Al-based products, and

is increasingly being used by the general public. But current

technology still has significant problems. It cannot reliably deal with facts, perform complex reasoning, or explain its conclusions.
2023 年，进展加快。像 GPT-4、Gemini 和 Claude 3 这样的最新先进系统令人印象深刻：它们可以用几十种语言生成流畅的文本，处理音频，甚至解释表情包。随着人工智能的改进，它越来越多地渗透到我们的生活中。公司们正在竞相开发基于 AI 的产品，并且越来越多地被普通大众所使用。但目前的 AI 技术仍然存在重大问题。它无法可靠地处理事实，进行复杂的推理，或解释其结论。

Al faces two interrelated futures. First, technology continues to improve and is increasingly used, having major consequences for productivity and employment. It can be put to both good and bad uses. In the second future, the adoption of

is constrained by the limitations of the technology. Regardless of which future unfolds, governments are increasingly concerned. They are stepping in to encourage the upside, such as funding university R&D and incentivizing private investment. Governments are also aiming to manage the potential downsides, such as impacts on employment, privacy concerns, misinformation, and intellectual property rights.
人工智能面临着两种相互关联的未来。首先，技术持续改进并越来越广泛地应用，对生产力和就业产生重大影响。它可以被用于善良和恶劣之用。在第二种未来中，技术的采用受到技术限制的约束。无论哪种未来展现出来，政府都越来越关注。他们正在采取措施鼓励积极方面，如资助大学研发并激励私人投资。政府还致力于管理潜在的负面影响，如对就业的影响，隐私问题，错误信息和知识产权。

rapidly evolves, the

Index aims to help the Al community, policymakers, business leaders, journalists, and the general public navigate this complex landscape. It provides ongoing, objective snapshots tracking several key areas: technical progress in Al capabilities, the community and investments driving Al development and deployment, public opinion on current and potential future impacts, and policy measures taken to stimulate

innovation while managing its risks and challenges. By comprehensively monitoring the Al ecosystem, the Index serves as an important resource for understanding this transformative technological force.
随着人工智能迅速发展，

指数旨在帮助人工智能社区、政策制定者、商业领袖、记者和普通公众在这个复杂的领域中进行导航。它提供持续的客观快照，跟踪几个关键领域：人工智能能力的技术进展、推动人工智能发展和部署的社区和投资、公众对当前和潜在未来影响的看法，以及采取的政策措施，以刺激

创新，同时管理其风险和挑战。通过全面监测人工智能生态系统，该指数成为了理解这一变革性技术力量的重要资源。

On the technical front, this year's Al Index reports that the number of new large language models released worldwide in 2023 doubled over the previous year. Two-thirds were open-source, but the highest-performing models came from industry players with closed systems. Gemini Ultra became the first LLM to reach humanlevel performance on the Massive Multitask Language Understanding (MMLU) benchmark; performance on the benchmark has improved by 15 percentage points since last year. Additionally, GPT-4 achieved an impressive 0.96 mean win rate score on the comprehensive Holistic Evaluation of Language Models (HELM) benchmark, which includes MMLU among other evaluations.
从技术角度来看，今年的 AI 指数报告称，2023 年全球发布的新大型语言模型数量是去年的两倍。其中三分之二是开源的，但性能最高的模型来自具有封闭系统的行业玩家。Gemini Ultra 成为第一个在 Massive Multitask Language Understanding (MMLU) 基准测试中达到人类水平性能的模型；相比去年，该基准测试的性能提高了 15 个百分点。此外，GPT-4 在综合语言模型全面评估 (HELM) 基准测试中取得了令人印象深刻的 0.96 平均获胜率分数，其中包括 MMLU 在内的其他评估。

Message From the Co-directors (cont'd)
联合主任们的消息（续）

Although global private investment in Al decreased for the second consecutive year, investment in generative

skyrocketed. More Fortune 500 earnings calls mentioned

than ever before, and new studies show that

tangibly boosts worker productivity. On the policymaking front, global mentions of

in legislative proceedings have never been higher. U.S. regulators passed more AI-related regulations in 2023 than ever before. Still, many expressed concerns about Al's ability to generate deepfakes and impact elections. The public became more aware of Al, and studies suggest that they responded with nervousness.
尽管全球私人对 Al 的投资连续第二年下降，但生成式

的投资却激增。更多财富 500 强公司的财报电话提到

，新研究显示

明显提升了工人的生产力。在政策制定方面，全球在立法程序中提到

的次数达到历史最高。美国监管机构在 2023 年通过的与 AI 相关的法规比以往任何时候都多。然而，许多人对 Al 生成深度伪造和影响选举的能力表示担忧。公众对 Al 的认识更加深入，研究表明他们的反应是紧张的。

Ray Perrault and Jack Clark
Ray Perrault 和 Jack Clark

Co-directors, Al Index Al 指数联合主任

Top 10 Takeaways 十大要点

Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.
在某些任务上，人工智能超过人类，但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现，包括图像分类、视觉推理和英语理解等方面。然而，在更复杂的任务上，如竞赛级数学、视觉常识推理和规划方面，人工智能仍然落后。
Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high.
行业继续主导前沿的人工智能研究。2023 年，行业发布了 51 个显著的机器学习模型，而学术界仅贡献了 15 个。2023 年，行业与学术界合作产生了 21 个显著模型，创下新高。
Frontier models get way more expensive. According to AI Index estimates, the training costs of state-of-the-art Al models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute.
边境模型变得更加昂贵。根据 AI 指数估计，现代 AI 模型的训练成本已经达到了前所未有的水平。例如，OpenAI 的 GPT-4 使用了估计价值百万的计算资源进行训练，而谷歌的 Gemini Ultra 则需要百万计算成本。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15.
美国领先于中国、欧盟和英国，成为顶尖人工智能模型的主要来源。2023 年，61 个知名的人工智能模型源自美国机构，远远超过欧盟的 21 个和中国的 15 个。

5. Robust and standardized evaluations for LLM responsibility are seriously lacking.
5. 对LLM责任的强大和标准化评估严重匮乏。

New research from the AI Index reveals a significant lack of standardization in responsible Al reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible Al benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models.

Generative Al investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative AI surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAI, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds.
生成性人工智能投资大幅增长。尽管去年整体人工智能私人投资出现下滑，但生成性人工智能的资金激增，从 2022 年增长近 8 倍，达到亿美元。生成性人工智能领域的主要参与者，包括 OpenAI、Anthropic、Hugging Face 和 Inflection，都宣布了重大的募资轮次。
The data is in: Al makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still, other studies caution that using Al without proper oversight can lead to diminished performance.
数据显示：人工智能提高了工人的生产力，促进了更高质量的工作。2023 年，几项研究评估了人工智能对劳动力的影响，表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。然而，其他研究警告称，没有适当监督的情况下使用人工智能可能会导致绩效下降。

Top 10 Takeaways (cont'd)
十大要点（续）

Scientific progress accelerates even further, thanks to Al. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery.
科学进步得到进一步加速，这要归功于人工智能。2022 年，人工智能开始推动科学发现。然而，2023 年，更多重要的与科学相关的人工智能应用程序推出，从 AlphaDev 推出的使算法排序更加高效，到 GNoME 简化材料发现过程。
The number of AI regulations in the United States sharply increases. The number of AIrelated regulations in the U.S. has risen significantly in the past year and over the last five years. In 2023, there were 25 Al-related regulations, up from just one in 2016. Last year alone, the total number of Al-related regulations grew by .
美国的人工智能法规数量急剧增加。美国与人工智能相关的法规数量在过去一年和过去五年显著增加。2023 年，与人工智能相关的法规数量达到 25 项，远高于 2016 年的 1 项。仅去年一年，与人工智能相关的法规总数增长了。
People across the globe are more cognizant of Al's potential impact-and more nervous. A survey from Ipsos shows that, over the last year, the proportion of those who think Al will dramatically affect their lives in the next three to five years has increased from to . Moreover, express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that of Americans report feeling more concerned than excited about Al, rising from 37% in 2022.
全球各地的人们更加意识到人工智能的潜在影响，并感到更加紧张。Ipsos 的一项调查显示，过去一年中，认为人工智能将在未来三到五年内对他们的生活产生巨大影响的人所占比例已从增加到。此外，对人工智能产品和服务表现出紧张情绪，较 2022 年增加了 13 个百分点。根据皮尤数据，美国有的美国人表示对人工智能感到更担忧而不是兴奋，较 2022 年的 37%上升。

Steering Committee 指导委员会

Co-directors 联合导演

Jack Clark, Anthropic, OECD
杰克·克拉克，Anthropic，经济合作与发展组织

Raymond Perrault, SRI International
雷蒙德·佩罗，SRI 国际

Members 成员

Erik Brynjolfsson, Stanford University Juan Carlos Niebles, Stanford University, Salesforce
埃里克·布林约尔松，斯坦福大学胡安·卡洛斯·尼埃布莱斯，斯坦福大学，Salesforce

John Etchemendy, Stanford University Vanessa Parli, Stanford University
约翰·埃切门迪，斯坦福大学瓦内萨·帕利，斯坦福大学

Katrina Ligett, Hebrew University
卡特里娜·利杰特，希伯来大学

Terah Lyons, JPMorgan Chase & Co.
特拉·莱昂斯，摩根大通公司

James Manyika, Google, University of Oxford
詹姆斯·马尼卡，谷歌，牛津大学

Russell Wald, Stanford University
斯坦福大学的 Russell Wald

Staff and Researchers 员工和研究人员

Research Manager and Editor in Chief
研究经理和总编辑

Nestor Maslej

Stanford University 斯坦福大学

Research Associate 研究助理

Loredana Fattorini

Stanford University 斯坦福大学

Affiliated Researchers 附属研究员

Elif Kiesow Cortez, Stanford Law School Research Fellow
Elif Kiesow Cortez，斯坦福法学院研究员

Anka Reuel, Stanford University
Anka Reuel，斯坦福大学

Robi Rahman, Data Scientist
Robi Rahman，数据科学家

Alexandra Rome, Freelance Researcher Lapo Santarlasci, IMT School for
Alexandra Rome，自由研究员 Lapo Santarlasci，IMT 学院

Advanced Studies Lucca 高级研究卢卡

Graduate Researchers 研究生研究员

James da Costa, Stanford University
James da Costa，斯坦福大学

Simba Jonga, Stanford University
Simba Jonga，斯坦福大学

Undergraduate Researchers
本科研究员

Emily Capstick, Stanford University Summer Flowers, Stanford University Armin Hamrah, Claremont McKenna College Amelia Hardy, Stanford University Mena Hassan, Stanford University Ethan Duncan He-Li Hellman, Stanford University Julia Betts Lotufo, Stanford University Sukrut Oak, Stanford University Andrew Shi, Stanford University Jason Shin, Stanford University Emma Williamson, Stanford University Alfred Yu, Stanford University
Emily Capstick，斯坦福大学 Summer Flowers，斯坦福大学 Armin Hamrah，克莱蒙特麦肯纳学院 Amelia Hardy，斯坦福大学 Mena Hassan，斯坦福大学 Ethan Duncan He-Li Hellman，斯坦福大学 Julia Betts Lotufo，斯坦福大学 Sukrut Oak，斯坦福大学 Andrew Shi，斯坦福大学 Jason Shin，斯坦福大学 Emma Williamson，斯坦福大学 Alfred Yu，斯坦福大学

How to Cite This Report
如何引用本报告

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark, "The Al Index 2024 Annual Report," AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, Stanford, CA, April 2024.

The AI Index 2024 Annual Report by Stanford University is licensed under Attribution-NoDerivatives 4.0 International.
《2024 年斯坦福大学 AI 指数年度报告》由斯坦福大学授权，采用署名-禁止演绎 4.0 国际许可。

Public Data and Tools
公共数据和工具

The Al Index 2024 Report is supplemented by raw data and an interactive tool. We invite each reader to use the data and the tool in a way most relevant to their work and interests.
2024 年 Al 指数报告附带原始数据和交互工具。我们邀请每位读者根据自己的工作和兴趣以最相关的方式使用数据和工具。

Raw data and charts: The public data and high-resolution images of all the charts in the report are available on Google Drive.
原始数据和图表：报告中所有图表的公共数据和高分辨率图像可在 Google Drive 上获得。
Global AI Vibrancy Tool: Compare the Al ecosystems of over 30 countries. The Global Al Vibrancy tool will be updated in the summer of 2024.
全球 AI 活力工具：比较 30 多个国家的 AI 生态系统。全球 AI 活力工具将于 2024 年夏季更新。

Al Index and Stanford HAI
阿尔指数和斯坦福人工智能研究所

The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
人工智能指数是斯坦福人类中心人工智能研究所（HAI）的独立倡议。

Artificial Intelligence Index
人工智能指数

Stanford University 斯坦福大学

Human-Centered 以人为本

Artificial Intelligence 人工智能

The Al Index was conceived within the One Hundred Year Study on Artificial Intelligence (Al100).
Al 指数是在人工智能百年研究（Al100）中构想的。

The AI Index welcomes feedback and new ideas for next year. Contact us at AI-Index-Report@stanford.edu.
人工智能指数欢迎对明年的反馈意见和新想法。请通过 AI-Index-Report@stanford.edu 与我们联系。

The Al Index acknowledges that while authored by a team of human researchers, its writing process was aided by Al tools. Specifically, the authors used ChatGPT and Claude to help tighten and copy edit initial drafts. The workflow involved authors writing the original copy, then utilizing Al tools as part of the editing process.
Al 指数承认，虽然由一组人类研究人员创作，但其写作过程得到了 Al 工具的帮助。具体来说，作者们使用 ChatGPT 和 Claude 来帮助修改和编辑初稿。工作流程涉及作者撰写原始文本，然后利用 Al 工具作为编辑过程的一部分。

Supporting Partners 支持合作伙伴

Analytics and Research Partners
分析和研究合作伙伴

accenture 밈은 QCSET 三EPochal Github (govini =

Lightcast Linked in
Lightcast 领英

Contributors 贡献者

The Al Index wants to acknowledge the following individuals by chapter and section for their contributions of data, analysis, advice, and expert commentary included in the AI Index 2024 Report:
AI 指数希望通过章节和部分向以下个人致谢，感谢他们在 AI 指数 2024 年报告中提供的数据、分析、建议和专家评论的贡献：

Introduction 介绍

Loredana Fattorini, Nestor Maslej, Vanessa Parli, Ray Perrault
Loredana Fattorini，Nestor Maslej，Vanessa Parli，Ray Perrault

Chapter 1: Research and Development
第一章：研究与开发

Catherine Aiken, Terry Auricchio, Tamay Besiroglu, Rishi Bommasani, Andrew Brown, Peter Cihon, James da Costa, Ben Cottier, James Cussens, James Dunham, Meredith Ellison, Loredana Fattorini, Enrico Gerding, Anson Ho, Percy Liang, Nestor Maslej, Greg Mori, Tristan Naumann, Vanessa Parli, Pavlos Peppas, Ray Perrault, Robi Rahman, Vesna Sablijakovic-Fritz, Jim Schmiedeler, Jaime Sevilla, Autumn Toney, Kevin Xu, Meg Young, Milena Zeithamlova
Catherine Aiken，Terry Auricchio，Tamay Besiroglu，Rishi Bommasani，Andrew Brown，Peter Cihon，James da Costa，Ben Cottier，James Cussens，James Dunham，Meredith Ellison，Loredana Fattorini，Enrico Gerding，Anson Ho，Percy Liang，Nestor Maslej，Greg Mori，Tristan Naumann，Vanessa Parli，Pavlos Peppas，Ray Perrault，Robi Rahman，Vesna Sablijakovic-Fritz，Jim Schmiedeler，Jaime Sevilla，Autumn Toney，Kevin Xu，Meg Young，Milena Zeithamlova

Chapter 2: Technical Performance
第二章：技术表现

Rishi Bommasani, Emma Brunskill, Erik Brynjolfsson, Emily Capstick, Jack Clark, Loredana Fattorini, Tobi Gertsenberg, Noah Goodman, Nicholas Haber, Sanmi Koyejo, Percy Liang, Katrina Ligett, Sasha Luccioni, Nestor Maslej, Juan Carlos Niebles, Sukrut Oak, Vanessa Parli, Ray Perrault, Andrew Shi, Yoav Shoham, Emma Williamson
Rishi Bommasani，Emma Brunskill，Erik Brynjolfsson，Emily Capstick，Jack Clark，Loredana Fattorini，Tobi Gertsenberg，Noah Goodman，Nicholas Haber，Sanmi Koyejo，Percy Liang，Katrina Ligett，Sasha Luccioni，Nestor Maslej，Juan Carlos Niebles，Sukrut Oak，Vanessa Parli，Ray Perrault，Andrew Shi，Yoav Shoham，Emma Williamson

Chapter 3: Responsible AI
第三章：负责任的人工智能

Jack Clark, Loredana Fattorini, Amelia Hardy, Katrina Ligett, Nestor Maslej, Vanessa Parli, Ray Perrault, Anka Reuel, Andrew Shi
Jack Clark，Loredana Fattorini，Amelia Hardy，Katrina Ligett，Nestor Maslej，Vanessa Parli，Ray Perrault，Anka Reuel，Andrew Shi

Chapter 4: Economy 第四章：经济

Susanne Bieller, Erik Brynjolfsson, Mar Carpanelli, James da Costa, Natalia Dorogi, Heather English, Murat Erer, Loredana Fattorini, Akash Kaura, James Manyika, Nestor Maslej, Cal McKeever, Julia Nitschke, Layla O’Kane, Vanessa Parli, Ray Perrault, Brittany Presten, Carl Shan, Bill Valle, Casey Weston, Emma Williamson
Susanne Bieller，Erik Brynjolfsson，Mar Carpanelli，James da Costa，Natalia Dorogi，Heather English，Murat Erer，Loredana Fattorini，Akash Kaura，James Manyika，Nestor Maslej，Cal McKeever，Julia Nitschke，Layla O’Kane，Vanessa Parli，Ray Perrault，Brittany Presten，Carl Shan，Bill Valle，Casey Weston，Emma Williamson

Chapter 5: Science and Medicine
第五章：科学与医学

Russ Altman, Loredana Fattorini, Remi Lam, Curtis Langlotz, James Manyika, Nestor Maslej, Vanessa Parli, Ray Perrault, Emma Williamson

Contributors (cont'd) 贡献者（续）

Chapter 6: Education 第六章：教育

Betsy Bizot, John Etchemendy, Loredana Fattorini, Kirsten Feddersen, Matt Hazenbush, Nestor Maslej, Vanessa Parli, Ray Perrault, Svetlana Tikhonenko, Laurens Vehmeijer, Hannah Weissman, Stuart Zweben
Betsy Bizot，John Etchemendy，Loredana Fattorini，Kirsten Feddersen，Matt Hazenbush，Nestor Maslej，Vanessa Parli，Ray Perrault，Svetlana Tikhonenko，Laurens Vehmeijer，Hannah Weissman，Stuart Zweben

Chapter 7: Policy and Governance
第七章：政策与治理

Alison Boyer, Elif Kiesow Cortez, Rebecca DeCrescenzo, David Freeman Engstrom, Loredana Fattorini, Philip de Guzman, Mena Hassan, Ethan Duncan He-Li Hellman, Daniel Ho, Simba Jonga, Rohini Kosoglu, Mark Lemley, Julia Betts Lotufo, Nestor Maslej, Caroline Meinhardt, Julian Nyarko, Jeff Park, Vanessa Parli, Ray Perrault, Alexandra Rome, Lapo Santarlasci, Sarah Smedley, Russell Wald, Emma Williamson, Daniel Zhang

Chapter 8: Diversity 第八章：多样性

Betsy Bizot, Loredana Fattorini, Kirsten Feddersen, Matt Hazenbush, Nestor Maslej, Vanessa Parli, Ray Perrault, Svetlana Tikhonenko, Laurens Vehmeijer, Caroline Weis, Hannah Weissman, Stuart Zweben

Chapter 9: Public Opinion
第九章：公众舆论

Maggie Arai, Heather English, Loredana Fattorini, Armin Hamrah, Peter Loewen, Nestor Maslej, Vanessa Parli, Ray Perrault, Marco Monteiro Silva, Lee Slinger, Bill Valle, Russell Wald
Maggie Arai，Heather English，Loredana Fattorini，Armin Hamrah，Peter Loewen，Nestor Maslej，Vanessa Parli，Ray Perrault，Marco Monteiro Silva，Lee Slinger，Bill Valle，Russell Wald

The Al Index thanks the following organizations and individuals who provided data for inclusion in this year's report:
Al 指数感谢以下组织和个人为本年度报告提供数据：

Organizations 组织

Center for Research on
研究中心

Foundation Models 基础模型

Rishi Bommasani, Percy Liang

Center for Security and Emerging
安全与新兴技术中心

Technology, Georgetown University

Catherine Aiken, James Dunham, Autumn Toney

Code.org

Hannah Weissman

Computing Research Association
计算机研究协会

Betsy Bizot, Stuart Zweben
Betsy Bizot，Stuart Zweben

Epoch 时代

Ben Cottier, Robi Rahman
本·科蒂尔，罗比·拉赫曼

GitHub

Peter Cihon, Kevin Xu
彼得·西洪，凯文·徐

Govini

Alison Boyer, Rebecca DeCrescenzo, Philip de Guzman, Jeff Park

Informatics Europe

Svetlana Tikhonenko 斯维特拉娜·蒂霍年科

International Federation of Robotics
国际机器人联合会

Susanne Bieller 苏珊娜·比勒

Lightcast 亮投

Cal McKeever, Julia Nitschke, Layla O’Kane
卡尔·麦克弗，朱莉娅·尼奇克，莱拉·奥卡恩

Linkedln 领英

Murat Erer, Akash Kaura, Casey Weston

McKinsey & Company 麦肯锡公司

Natalia Dorogi, Brittany Presten

Munk School of Global Affairs and Public Policy
全球事务与公共政策蒙克学院

Peter Loewen, Lee Slinger
彼得·洛文，李·斯林格

Quid

Heather English, Bill Valle

Schwartz Reisman Institute for Technology and Society

Maggie Arai, Marco Monteiro Silva

Studyportals

Kirsten Feddersen, Laurens Vehmeijer

Women in Machine Learning
机器学习中的女性

Caroline Weis

The Al Index also thanks Jeanina Casusi, Nancy King, Carolyn Lehman, Shana Lynch, Jonathan Mindes, and Michi Turner for their help in preparing this report; Joe Hinman and Nabarun Mukherjee for their help in maintaining the Al Index website; and Annie Benisch, Marc Gough, Panos Madamopoulos-Moraris, Kaci Peel, Drew Spence, Madeline Wright, and Daniel Zhang for their work in helping promote the report.
Al Index 也感谢 Jeanina Casusi、Nancy King、Carolyn Lehman、Shana Lynch、Jonathan Mindes 和 Michi Turner 为准备此报告提供的帮助；Joe Hinman 和 Nabarun Mukherjee 为维护 Al Index 网站提供的帮助；Annie Benisch、Marc Gough、Panos Madamopoulos-Moraris、Kaci Peel、Drew Spence、Madeline Wright 和 Daniel Zhang 在促进该报告方面的工作。

Table of Contents 目录

Report Highlights ..... 14
报告亮点 ..... 14
Chapter 1 Research and Development ..... 27
第 1 章研究与开发 ..... 27
Chapter 2 Technical Performance ..... 73
第 2 章技术表现 ..... 73
Chapter 3 Responsible AI ..... 159
第三章负责任的人工智能 ..... 159
Chapter 4 Economy ..... 213
第四章经济 ..... 213
Chapter 5 Science and Medicine ..... 296
第五章科学与医学 ..... 296
Chapter 6 Education ..... 325
第 6 章教育 ..... 325
Chapter 7 Policy and Governance ..... 366
第 7 章政策与治理 ..... 366
Chapter 8 Diversity ..... 411
第 8 章多样性 ..... 411
Chapter 9 Public Opinion ..... 435
第 9 章公众意见 ..... 435
Appendix ..... 458 附录 ..... 458
ACCESS THE PUBLIC DATA
访问公共数据

Report Highlights 报告亮点

Chapter 1: Research and Development
第一章：研究与开发

Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high.
行业继续主导前沿的人工智能研究。2023 年，行业发布了 51 个显著的机器学习模型，而学术界仅贡献了 15 个。2023 年，行业与学术界合作产生了 21 个显著模型，创下新高。
More foundation models and more open foundation models. In 2023 , a total of 149 foundation models were released, more than double the amount released in 2022. Of these newly released models, were open-source, compared to only in 2022 and in 2021.
发布更多基础模型和开源基础模型。2023 年共发布了 149 个基础模型，比 2022 年发布的数量翻了一倍还多。其中，个是开源的，而 2022 年只有个，2021 年更少，只有个。
Frontier models get way more expensive. According to Al Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute.
前沿模型变得更加昂贵。根据 Al Index 估计，最先进的人工智能模型的训练成本已经达到了前所未有的水平。例如，OpenAI 的 GPT-4 使用了估计价值百万的计算资源进行训练，而谷歌的 Gemini Ultra 则花费了百万用于计算资源。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15.
美国领先于中国、欧盟和英国，成为顶尖人工智能模型的主要来源。2023 年，61 个知名的人工智能模型源自美国机构，远远超过欧盟的 21 个和中国的 15 个。
The number of Al patents skyrockets. From 2021 to 2022, Al patent grants worldwide increased sharply by . Since 2010 , the number of granted Al patents has increased more than 31 times.
Al 专利数量激增。从 2021 年到 2022 年，全球 Al 专利授权数量急剧增加。自 2010 年以来，已授权的 Al 专利数量增加了超过 31 倍。
China dominates Al patents. In 2022, China led global Al patent origins with , significantly outpacing the United States, which accounted for of Al patent origins. Since 2010, the U.S. share of AI patents has decreased from .
中国主导人工智能专利。2022 年，中国在全球人工智能专利来源中处于领先地位，占比，远远超过美国，美国占的人工智能专利来源。自 2010 年以来，美国在人工智能专利中的份额已经从下降。
Open-source Al research explodes. Since 2011, the number of AI-related projects on GitHub has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in 2023. Notably, there was a sharp 59.3% rise in the total number of GitHub Al projects in 2023 alone. The total number of stars for Al-related projects on GitHub also significantly increased in 2023, more than tripling from 4.0 million in 2022 to 12.2 million.
开源人工智能研究蓬勃发展。自 2011 年以来，GitHub 上与人工智能相关项目的数量持续增加，从 2011 年的 845 个增长到 2023 年的约 180 万个。值得注意的是，仅在 2023 年，GitHub 上人工智能项目的总数就急剧增加了 59.3%。2023 年 GitHub 上与人工智能相关项目的星标总数也显著增加，从 2022 年的 400 万增至 1220 万。
The number of Al publications continues to rise. Between 2010 and 2022 , the total number of publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest .
人工智能出版物数量持续增长。2010 年至 2022 年间，人工智能出版物总数几乎翻了三番，从 2010 年的约 8.8 万增至 2022 年的超过 24 万。过去一年的增长幅度较为温和，为。

Report Highlights 报告亮点

Chapter 2: Technical Performance
第二章：技术表现

\begin{abstract}
1. Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.
1. 人工智能在某些任务上超过了人类，但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现，包括图像分类、视觉推理和英语理解等领域。然而，在更复杂的任务上，如竞赛级数学、视觉常识推理和规划等方面，人工智能仍然落后。
\end{abstract} \end{摘要}

Here comes multimodal Al. Traditionally Al systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google's Gemini and OpenAl's GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio.
传统上，AI 系统在范围上存在局限性，语言模型擅长文本理解，但在图像处理方面表现不佳，反之亦然。然而，最近的进展导致了强大的多模态模型的发展，例如谷歌的 Gemini 和 OpenAI 的 GPT-4。这些模型表现出灵活性，能够处理图像和文本，并且在某些情况下甚至可以处理音频。
Harder benchmarks emerge. Al models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations.
更难的基准出现了。AI 模型在 ImageNet、SQuAD 和 SuperGLUE 等已建立的基准上达到了性能饱和，促使研究人员开发出更具挑战性的基准。2023 年，出现了几个具有挑战性的新基准，包括用于编码的 SWE-bench，用于图像生成的 HEIM，用于一般推理的 MMMU，用于道德推理的 MoCa，用于基于代理的行为的 AgentBench，以及用于幻觉的 HaluEval。

4. Better Al means better data which means ... even better Al. New Al models such as
4. 更好的人工智能意味着更好的数据，这意味着...甚至更好的人工智能。新的人工智能模型，如

SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for

technical improvements. The use of

to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks.
SegmentAnything 和 Skoltech 正在用于生成专门用于诸如图像分割和 3D 重建等任务的数据。数据对于技术改进至关重要。利用

来创建更多数据增强了当前的能力，并为未来算法改进铺平了道路，特别是在更难的任务上。

Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public sentiment about AI is becoming an increasingly important consideration in tracking Al progress.
人类评估正在兴起。随着生成模型生成高质量的文本、图像等，基准测试逐渐开始转向纳入人类评估，如 Chatbot Arena Leaderboard，而不是像 ImageNet 或 SQuAD 这样的计算机排名。公众对人工智能的情绪变得越来越重要，成为追踪人工智能进展的一个日益重要的考虑因素。
Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world.
由于LLMs的出现，机器人变得更加灵活。语言建模与机器人技术的融合催生了更加灵活的机器人系统，如 PaLM-E 和 RT-2。除了改进了机器人的能力，这些模型还可以提问，这是机器人与现实世界更有效互动的重要一步。

Chapter 2: Technical Performance (cont'd)
第二章：技术性能（续）

More technical research in agentic Al. Creating Al agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance.
关于主动型人工智能的更多技术研究。创建自主在特定环境中运行的人工智能代理系统一直是计算机科学家们长期面临的挑战。然而，新兴的研究表明，自主代理系统的性能正在不断提高。现如今的代理系统已经能够掌握复杂的游戏，如 Minecraft，并且能够有效地处理实际任务，如在线购物和研究辅助。
Closed LLMs significantly outperform open ones. On 10 select Al benchmarks, closed models outperformed open ones, with a median performance advantage of . Differences in the performance of closed and open models carry important implications for Al policy debates.
闭源LLMs明显优于开源。在 10 个选择的 Al 基准测试中，闭源模型的性能优于开源模型，具有的中位性能优势。闭源模型和开源模型的性能差异对 Al 政策辩论具有重要意义。

Report Highlights 报告亮点

Chapter 3: Responsible Al
第三章：负责任的 Al

1. Robust and standardized evaluations for LLM responsibility are seriously lacking.
1.对于LLM责任的稳健和标准化评估严重缺乏。

New research from the AI Index reveals a significant lack of standardization in responsible AI reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models.

Political deepfakes are easy to generate and difficult to detect. Political deepfakes are already affecting elections across the world, with recent research suggesting that existing Al deepfake methods perform with varying levels of accuracy. In addition, new projects like CounterCloud demonstrate how easily AI can create and disseminate fake content.
政治 Deepfake 很容易生成，难以检测。最近的研究表明，政治 Deepfake 已经对世界各地的选举产生影响，现有的 Al Deepfake 方法的准确性不同。此外，CounterCloud 等新项目展示了 AI 如何轻松创建和传播假内容。
Researchers discover more complex vulnerabilities in LLMs. Previously, most efforts to red team Al models focused on testing adversarial prompts that intuitively made sense to humans. This year, researchers found less obvious strategies to get LLMs to exhibit harmful behavior, like asking the models to infinitely repeat random words.
研究人员在LLMs中发现了更复杂的漏洞。以前，对 Al 模型进行红队测试的大部分工作都集中在测试对人类直观有意义的对抗性提示上。今年，研究人员发现了一些不太明显的策略，可以使LLMs表现出有害行为，例如要求模型无限重复随机单词。
Risks from are becoming a concern for businesses across the globe. A global survey on responsible Al highlights that companies' top Al-related concerns include privacy, data security, and reliability. The survey shows that organizations are beginning to take steps to mitigate these risks. Globally, however, most companies have so far only mitigated a small portion of these risks.
来自的风险正成为全球企业关注的焦点。一项关于负责任人工智能的全球调查显示，公司对人工智能相关的主要担忧包括隐私、数据安全和可靠性。调查显示，组织正开始采取措施来减轻这些风险。然而，在全球范围内，大多数公司迄今只减轻了这些风险的一小部分。
LLMs can output copyrighted material. Multiple researchers have shown that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. Whether such output constitutes copyright violations is becoming a central legal question.
LLMs可以输出受版权保护的材料。多位研究人员已经证明，流行LLMs的生成输出可能包含受版权保护的材料，例如《纽约时报》的摘录或电影场景。这种输出是否构成侵犯版权的问题正逐渐成为一个核心的法律问题。

6. Al developers score low on transparency, with consequences for research. The newly
6. 人工智能开发者在透明度方面得分较低，对研究产生影响。新近进行的研究显示，人工智能开发者在透明度方面得分较低，这对研究产生了影响。

introduced Foundation Model Transparency Index shows that AI developers lack transparency, especially regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of Al systems.

Chapter 3: Responsible AI (cont'd)
第三章:负责任的人工智能(续)

Extreme Al risks are difficult to analyze. Over the past year, a substantial debate has emerged among Al scholars and practitioners regarding the focus on immediate model risks, like algorithmic discrimination, versus potential long-term existential threats. It has become challenging to distinguish which claims are scientifically founded and should inform policymaking. This difficulty is compounded by the tangible nature of already present short-term risks in contrast with the theoretical nature of existential threats.
极端 AI 风险很难分析。在过去的一年中，AI 学者和从业者之间出现了一场关于立即模型风险（如算法歧视）与潜在的长期存在威胁的辩论。区分哪些声明在科学上有根据并应该影响政策制定已变得具有挑战性。这一困难被已经存在的短期风险的具体性质与存在威胁的理论性质所加剧。
The number of Al incidents continues to rise. According to the Al Incident Database, which tracks incidents related to the misuse of Al, 123 incidents were reported in 2023, a 32.3 percentage point increase from 2022. Since 2013, Al incidents have grown by over twentyfold. A notable example includes Al-generated, sexually explicit deepfakes of Taylor Swift that were widely shared online.
AI 事件的数量继续增加。根据 AI 事件数据库的数据，追踪与 AI 滥用相关的事件，2023 年报告了 123 起事件，比 2022 年增长了 32.3 个百分点。自 2013 年以来，AI 事件增长了超过二十倍。一个显著的例子是泰勒·斯威夫特的 AI 生成的深度假视频，在网上广为传播。
ChatGPT is politically biased. Researchers find a significant bias in ChatGPT toward Democrats in the United States and the Labour Party in the U.K. This finding raises concerns about the tool's potential to influence users' political views, particularly in a year marked by major global elections.
ChatGPT 在政治上存在偏见。研究人员发现 ChatGPT 在美国偏向民主党和英国工党。这一发现引发了对该工具可能影响用户政治观点的担忧，尤其是在一个以重大全球选举为特征的年份。

Report Highlights 报告亮点

Chapter 4: Economy 第四章：经济

Generative investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAl, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds.
生成投资激增。尽管去年整体 AI 私人投资有所下降，但生成的资金激增，从 2022 年几乎增加了八倍，达到亿美元。生成 AI 领域的主要参与者，包括 OpenAI、Anthropic、Hugging Face 和 Inflection，报告了大量的筹款轮。
Already a leader, the United States pulls even further ahead in Al private investment. In 2023, the United States saw Al investments reach billion, nearly 8.7 times more than China, the next highest investor. While private Al investment in China and the European Union, including the United Kingdom, declined by and , respectively, since 2022, the United States experienced a notable increase of in the same time frame.
美国已经是领导者，在人工智能私人投资方面进一步拉开了差距。2023 年，美国的人工智能投资达到亿美元，几乎是中国的近 8.7 倍，后者是第二大投资国。尽管自 2022 年以来，中国和欧盟（包括英国）的私人人工智能投资分别下降了和，但同期美国经历了的显著增长。
Fewer Al jobs in the United States and across the globe. In 2022, Al-related positions made up of all job postings in America, a figure that decreased to in 2023. This decline in Al job listings is attributed to fewer postings from leading Al firms and a reduced proportion of tech roles within these companies.
美国和全球范围内的人工智能工作岗位减少。2022 年，与美国所有职位招聘相比，与人工智能相关的职位占比为，这一数字在 2023 年降至。人工智能工作岗位的减少归因于领先人工智能公司发布的职位减少以及这些公司中技术角色比例的降低。
Al decreases costs and increases revenues. A new McKinsey survey reveals that of surveyed organizations report cost reductions from implementing (including generative ), and report revenue increases. Compared to the previous year, there was a 10 percentage point increase in respondents reporting decreased costs, suggesting is driving significant business efficiency gains.
人工智能降低成本并增加收入。麦肯锡的一项新调查显示，的受访组织表示通过实施（包括生成式）实现了成本降低，表示收入增加。与前一年相比，报告成本降低的受访者比例增加了 10 个百分点，表明正在推动重大的商业效率提升。

5. Total Al private investment declines again, while the number of newly funded AI
5. 总体上，私人投资再次下降，而新资助的 AI 数量增加。

companies increases. Global private Al investment has fallen for the second year in a row, though less than the sharp decrease from 2021 to 2022 . The count of newly funded Al companies spiked to 1,812 , up

from the previous year.

Al organizational adoption ticks up. A 2023 McKinsey report reveals that of organizations now use (including generative ) in at least one business unit or function, up from in 2022 and in 2017.
AI 组织采用率有所提高。2023 年麦肯锡报告显示，的组织现在至少在一个业务部门或功能中使用（包括生成），高于 2022 年的和 2017 年的。
China dominates industrial robotics. Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest competitor nation. In 2013, China's installations accounted for of the global total, a share that rose to by 2022 .
中国主导工业机器人。自 2013 年超过日本成为工业机器人领先安装国以来，中国与最近的竞争对手国家之间的差距显著扩大。2013 年，中国的安装量占全球总量的，到 2022 年已上升至。

Chapter 4: Economy (cont'd)
第四章：经济（续）

Greater diversity in robot installations. In 2017, collaborative robots represented a mere of all new industrial robot installations, a figure that climbed to by 2022 . Similarly, 2022 saw a rise in service robot installations across all application categories, except for medical robotics. This trend indicates not just an overall increase in robot installations but also a growing emphasis on deploying robots for human-facing roles.
机器人安装的多样性增加。2017 年，协作机器人仅占所有新工业机器人安装量的，到 2022 年这一数字上升到。同样，2022 年各应用类别的服务机器人安装量均有所增加，除了医疗机器人。这一趋势不仅表明机器人安装总量增加，还表明人类面向角色的机器人部署受到越来越多的重视。
The data is in: Al makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still, other studies caution that using Al without proper oversight can lead to diminished performance.
数据显示：人工智能提高了工人的生产力，促进了更高质量的工作。2023 年，几项研究评估了人工智能对劳动力的影响，表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。然而，其他研究警告称，没有适当监督的情况下使用人工智能可能会导致绩效下降。
Fortune 500 companies start talking a lot about Al, especially generative Al. In 2023, Al was mentioned in 394 earnings calls (nearly of all Fortune 500 companies), a notable increase from 266 mentions in 2022. Since 2018, mentions of Al in Fortune 500 earnings calls have nearly doubled. The most frequently cited theme, appearing in of all earnings calls, was generative Al.
《财富》500 强公司开始大谈人工智能，尤其是生成式人工智能。2023 年，394 家公司在财报电话会议中提到了人工智能（几乎占所有《财富》500 强公司的），较 2022 年的 266 次提及有显著增加。自 2018 年以来，在《财富》500 强公司的财报电话会议中提到人工智能的次数几乎翻了一番。最经常被提及的主题，出现在的所有财报电话会议中，是生成式人工智能。

Report Highlights 报告亮点

Chapter 5: Science and Medicine
第五章：科学与医学

Scientific progress accelerates even further, thanks to AI. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery.
由于人工智能的推动，科学进步进一步加速。2022 年，人工智能开始推动科学发现。然而，2023 年看到了更多重要的与科学相关的人工智能应用的推出，从使算法排序更加高效的 AlphaDev，到促进材料发现过程的 GNoME。
Al helps medicine take significant strides forward. In 2023, several significant medical systems were launched, including EVEscape, which enhances pandemic prediction, and AlphaMissence, which assists in Al-driven mutation classification. is increasingly being utilized to propel medical advancements.
AI 帮助医学取得了重大进展。2023 年，推出了几款重要的医疗系统，包括增强疫情预测的 EVEscape 和协助基于 AI 的突变分类的 AlphaMissence。越来越多地被用于推动医学进步。
Highly knowledgeable medical Al has arrived. Over the past few years, Al systems have shown remarkable improvement on the MedQA benchmark, a key test for assessing Al's clinical knowledge. The standout model of 2023 , GPT-4 Medprompt, reached an accuracy rate of , marking a 22.6 percentage point increase from the highest score in 2022. Since the benchmark's introduction in 2019, Al performance on MedQA has nearly tripled.
高度了解医学的 AI 已经到来。在过去几年里，AI 系统在 MedQA 基准测试上显示出了显著的改进，这是评估 AI 临床知识的关键测试。2023 年的杰出模型 GPT-4 Medprompt 达到了的准确率，比 2022 年最高分提高了 22.6 个百分点。自 2019 年引入基准测试以来，AI 在 MedQA 上的表现几乎翻了三倍。
The FDA approves more and more Al-related medical devices. In 2022, the FDA approved 139 Al-related medical devices, a 12.1% increase from 2021. Since 2012, the number of FDA-approved Al-related medical devices has increased by more than 45 -fold. is increasingly being used for real-world medical purposes.
FDA 批准了越来越多与 AI 相关的医疗设备。2022 年，FDA 批准了 139 款与 AI 相关的医疗设备，比 2021 年增加了 12.1%。自 2012 年以来，FDA 批准的与 AI 相关的医疗设备数量增加了超过 45 倍。越来越多地被用于真实世界的医疗用途。

Report Highlights 报告亮点

Chapter 6: Education 第六章：教育

The number of American and Canadian CS bachelor's graduates continues to rise, new CS master's graduates stay relatively flat, and PhD graduates modestly grow. While the number of new American and Canadian bachelor's graduates has consistently risen for more than a decade, the number of students opting for graduate education in CS has flattened. Since 2018, the number of CS master's and graduates has slightly declined.
美国和加拿大计算机科学学士毕业生人数持续增长，新的计算机科学硕士毕业生人数保持相对稳定，博士毕业生逐渐增加。虽然美国和加拿大新毕业的学士生人数在过去十多年一直在增加，但选择攻读计算机科学研究生教育的学生人数已经趋于稳定。自 2018 年以来，计算机科学硕士和毕业生人数略有下降。
The migration of Al PhDs to industry continues at an accelerating pace. In 2011, roughly equal percentages of new AI PhDs took jobs in industry (40.9%) and academia (41.6%). However, by 2022, a significantly larger proportion (70.7%) joined industry after graduation compared to those entering academia (20.0%). Over the past year alone, the share of industry-bound AI PhDs has risen by 5.3 percentage points, indicating an intensifying brain drain from universities into industry.
人工智能博士毕业生向产业界的迁移速度正在加快。2011 年，新的人工智能博士毕业生在产业界（40.9%）和学术界（41.6%）的就业比例大致相等。然而，到 2022 年，毕业后加入产业界的比例显著增加至 70.7%，远远超过进入学术界的比例（20.0%）。仅在过去一年中，进入产业界的人工智能博士毕业生的比例增加了 5.3 个百分点，表明大学向产业界的人才流失正在加剧。
Less transition of academic talent from industry to academia. In of new Al faculty in the United States and Canada were from industry. By 2021, this figure had declined to 11%, and in 2022, it further dropped to . This trend indicates a progressively lower migration of high-level AI talent from industry into academia.
行业向学术界的学术人才转移减少。在美国和加拿大的新人工智能教职员中，都来自行业。到 2021 年，这一比例下降至 11％，并在 2022 年进一步降至。这一趋势表明，高水平人工智能人才从行业流向学术界的迁移越来越少。
CS education in the United States and Canada becomes less international. Proportionally fewer international CS bachelor's, master's, and PhDs graduated in 2022 than in 2021. The drop in international students in the master's category was especially pronounced.
美国和加拿大的计算机科学教育变得不那么国际化。2022 年，获得计算机科学学士、硕士和博士学位的国际学生比例较 2021 年有所下降。硕士类别中国际学生的减少尤为显著。

5. More American high school students take CS courses, but access problems remain.
5. 更多的美国高中学生选修计算机科学课程，但仍存在接触问题。

In 2022, 201,000 AP CS exams were administered. Since 2007, the number of students taking these exams has increased more than tenfold. However, recent evidence indicates that students in larger high schools and those in suburban areas are more likely to have access to CS courses.

Al-related degree programs are on the rise internationally. The number of English-language, Al-related postsecondary degree programs has tripled since 2017, showing a steady annual increase over the past five years. Universities worldwide are offering more Al-focused degree programs.
人工智能相关的学位课程在国际上不断增加。自 2017 年以来，英语授课的人工智能相关高等教育学位课程数量增加了三倍，过去五年稳步增长。全球各大学正在提供更多以人工智能为重点的学位课程。

Chapter 6: Education (cont'd)
第 6 章：教育（续）

The United Kingdom and Germany lead in European informatics, CS, CE, and IT graduate production. The United Kingdom and Germany lead Europe in producing the highest number of new informatics, CS, CE, and information bachelor's, master's, and PhD graduates. On a per capita basis, Finland leads in the production of both bachelor's and PhD graduates, while Ireland leads in the production of master's graduates.
英国和德国在欧洲信息学、计算机科学、计算机工程和信息技术研究生产量方面处于领先地位。英国和德国在生产新的信息学、计算机科学、计算机工程和信息学学士、硕士和博士毕业生数量方面领先于欧洲。以人均计算，芬兰在生产学士和博士毕业生方面领先，而爱尔兰在生产硕士毕业生方面领先。

Report Highlights 报告亮点

Chapter 7: Policy and Governance
第七章：政策与治理

The number of Al regulations in the United States sharply increases. The number of Al-related regulations has risen significantly in the past year and over the last five years. In 2023 , there were -related regulations, up from just one in 2016. Last year alone, the total number of Al-related regulations grew by .
美国的 Al 法规数量急剧增加。过去一年和过去五年，与 Al 相关的法规数量显著增加。2023 年，与 Al 相关的法规数量达到个，远远超过 2016 年的仅有一个。仅去年一年，与 Al 相关的法规总数增长了个。
The United States and the European Union advance landmark Al policy action. In 2023, policymakers on both sides of the Atlantic put forth substantial proposals for advancing Al regulation The European Union reached a deal on the terms of the AI Act, a landmark piece of legislation enacted in 2024. Meanwhile, President Biden signed an Executive Order on Al, the most notable Al policy initiative in the United States that year.
美国和欧盟推进具有里程碑意义的人工智能政策行动。2023 年，大西洋两岸的决策者提出了推进人工智能监管的实质性提案。欧盟就《人工智能法案》的条款达成协议，这是 2024 年出台的具有里程碑意义的立法。与此同时，拜登总统签署了一项关于人工智能的行政命令，这是美国在那一年最显著的人工智能政策倡议。
Al captures U.S. policymaker attention. The year 2023 witnessed a remarkable increase in Al-related legislation at the federal level, with 181 bills proposed, more than double the 88 proposed in 2022.
人工智能引起美国决策者的关注。2023 年，联邦层面关于人工智能立法的提案大幅增加，共有 181 项提出，是 2022 年提出的 88 项的两倍多。
Policymakers across the globe cannot stop talking about Al. Mentions of Al in legislative proceedings across the globe have nearly doubled, rising from 1,247 in 2022 to 2,175 in 2023. Al was mentioned in the legislative proceedings of 49 countries in 2023 . Moreover, at least one country from every continent discussed Al in 2023, underscoring the truly global reach of Al policy discourse.
全球各地的决策者们谈论不休人工智能。全球立法程序中关于人工智能的提及几乎翻了一番，从 2022 年的 1,247 次上升至 2023 年的 2,175 次。 2023 年，有 49 个国家的立法程序提及了人工智能。此外，至少来自每个大洲的一个国家在 2023 年讨论了人工智能，凸显了人工智能政策议题的真正全球影响力。
More regulatory agencies turn their attention toward AI. The number of U.S. regulatory agencies issuing Al regulations increased to 21 in 2023 from 17 in 2022, indicating a growing concern over Al regulation among a broader array of American regulatory bodies. Some of the new regulatory agencies that enacted Alrelated regulations for the first time in 2023 include the Department of Transportation, the Department of Energy, and the Occupational Safety and Health Administration.
更多监管机构将注意力转向人工智能。2023 年，发布 Al 法规的美国监管机构数量增至 21 家，而 2022 年为 17 家，表明越来越多的美国监管机构对 Al 监管表示关注。2023 年首次颁布与 Al 相关法规的一些新监管机构包括交通部、能源部和职业安全与健康管理局。

Report Highlights 报告亮点

Chapter 8: Diversity 第八章：多样性

U.S. and Canadian bachelor's, master's, and PhD CS students continue to grow more ethnically diverse. While white students continue to be the most represented ethnicity among new resident graduates at all three levels, the representation from other ethnic groups, such as Asian, Hispanic, and Black or African American students, continues to grow. For instance, since 2011, the proportion of Asian CS bachelor's degree graduates has increased by 19.8 percentage points, and the proportion of Hispanic CS bachelor's degree graduates has grown by 5.2 percentage points.
美国和加拿大的计算机科学学士、硕士和博士研究生在族裔多样性上持续增长。白人学生在这三个层次的新入学毕业生中仍然是最多的族裔代表，而其他少数民族群体的代表，如亚洲人、西班牙裔和非洲裔学生的比例则继续增长。例如，自 2011 年以来，亚洲计算机科学学士学位毕业生的比例增加了 19.8 个百分点，西班牙裔计算机科学学士学位毕业生的比例增长了 5.2 个百分点。
Substantial gender gaps persist in European informatics, CS, CE, and IT graduates at all educational levels. Every surveyed European country reported more male than female graduates in bachelor's, master's, and PhD programs for informatics, CS, CE, and IT. While the gender gaps have narrowed in most countries over the last decade, the rate of this narrowing has been slow.
欧洲在信息学、计算机科学、计算机工程和信息技术领域的女性毕业生普遍存在较大的性别差距。每个受调查的欧洲国家在信息学、计算机科学、计算机工程和信息技术的学士、硕士和博士项目中都报告了男性毕业生的人数多于女性。虽然在过去十年中，各国间的性别差距有所缩小，但这种缩小的速度很慢。
U.S. K-12 CS education is growing more diverse, reflecting changes in both gender and ethnic representation. The proportion of AP CS exams taken by female students rose from in 2007 to 30.5% in 2022. Similarly, the participation of Asian, Hispanic/Latino/Latina, and Black/African American students in AP CS has consistently increased year over year.
美国 K-12 计算机科学教育正在变得更加多样化，反映出性别和种族代表性的变化。女性学生参加 AP 计算机科学考试的比例从 2007 年的上升到 2022 年的 30.5%。同样，亚洲、西班牙裔/拉丁裔和非洲裔美国学生参加 AP 计算机科学考试的比例也在逐年增加。

Report Highlights 报告亮点

Chapter 9: Public Opinion
第九章：公众舆论

People across the globe are more cognizant of Al's potential impact-and more nervous. A survey from Ipsos shows that, over the last year, the proportion of those who think AI will dramatically affect their lives in the next three to five years has increased from to . Moreover, express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that of Americans report feeling more concerned than excited about Al, rising from in 2022.
环球范围内的人们对人工智能的潜在影响更加明显，也更加紧张。Ipsos 的一项调查显示，在过去一年中，认为人工智能在未来三到五年内将对他们的生活产生巨大影响的人的比例从上升到。此外，对人工智能产品和服务表达出紧张情绪，与 2022 年相比上升了 13 个百分点。在美国，Pew 的数据显示，的美国人对人工智能感到更多的担忧而不是兴奋，从 2022 年的上升到现在。
Al sentiment in Western nations continues to be low, but is slowly improving. In 2022, several developed Western nations, including Germany, the Netherlands, Australia, Belgium, Canada, and the United States, were among the least positive about Al products and services. Since then, each of these countries has seen a rise in the proportion of respondents acknowledging the benefits of Al, with the Netherlands experiencing the most significant shift.
西方国家的情绪仍然低迷，但正在慢慢改善。2022 年，包括德国、荷兰、澳大利亚、比利时、加拿大和美国在内的几个发达的西方国家对人工智能产品和服务持最少的积极态度。自那时起，这些国家中的每一个都看到了承认人工智能好处的受访者比例上升，其中荷兰经历了最显著的变化。
The public is pessimistic about Al's economic impact. In an Ipsos survey, only of respondents feel will improve their job. Only anticipate will boost the economy, and believe it will enhance the job market.
公众对人工智能的经济影响持悲观态度。在一项 Ipsos 调查中，只有的受访者认为会改善他们的工作。只有预期会促进经济增长，相信它会提升就业市场。

4. Demographic differences emerge regarding Al optimism. Significant demographic
4. 关于人工智能乐观情绪存在人口统计学差异。显著的人口统计学差异 emerge regarding Al optimism.

differences exist in perceptions of Al's potential to enhance livelihoods, with younger generations generally more optimistic. For instance,

of Gen

respondents believe

will improve entertainment options, versus only

ChatGPT is widely known and widely used. An international survey from the University of Toronto suggests that of respondents are aware of ChatGPT. Of those aware, around half report using ChatGPT at least once weekly.
ChatGPT 被广泛认知和广泛使用。据多伦多大学的一项国际调查显示，受访者中有知道 ChatGPT。而在知道 ChatGPT 的人中，约一半表示每周至少使用一次 ChatGPT。

CHAPTER 1: 第一章：

Artificial Intelligence 人工智能

Index Report 2024 2024 索引报告

Research and Development
研究与开发

Preview 预览

Overview	29
Chapter Highlights 章节亮点	30
1.1 Publications 1.1 出版物	31
Overview	31
Total Number of Al Publications 出版物总数	31
By Type of Publication 按出版物类型	32
By Field of Study 按研究领域	33
By Sector	34
Al Journal Publications Al 期刊出版物	36
Al Conference Publications 人工智能会议出版物	37
1.2 Patents	38
Al Patents	38
Overview	38
By Filing Status and Region 按申请状态和地区	39
1.3 Frontier Al Research 1.3 前沿 Al 研究	45
General Machine Learning Models 通用机器学习模型	45
Overview	45
Sector Analysis 领域分析	46
National Affiliation 国家隶属	47
Parameter Trends 参数趋势	49
Compute Trends 计算趋势	50
Highlight: Will Models Run Out of Data? 亮点：模型会用尽数据吗？	52
Foundation Models 基础模型	56
Model Release 模型发布	56
Organizational Affiliation 组织隶属	58
National Affiliation 国家隶属	61
Training Cost 训练成本	63

1.4 Al Conferences ..... 66
1.4 Al 会议 ..... 66
Conference Attendance ..... 66
会议出席 ..... 66
1.5 Open-Source Al Software ..... 69
1.5 开源人工智能软件 ..... 69
Projects ..... 69 项目 ..... 69
Stars ..... 71 星标 ..... 71
ACCESS THE PUBLIC DATA
访问公共数据
Al Journal Publications ..... 36 ..... 37
所有期刊出版物 ..... 36 ..... 37
1.2 Patents ..... 38 ..... 39
1.2 专利 ..... 38 ..... 39
1.3 Frontier Al Research ..... 45
1.3 前沿人工智能研究 ..... 45
Overview ..... 45 概述 ..... 45
46
Foundation Models 基础模型
Model Release ..... 56
模型发布 ..... 56
Training Cost ..... 63
培训成本 ..... 63

This chapter studies trends in

research and development. It begins by examining trends in Al publications and patents, and then examines trends in notable Al systems and foundation models. It concludes by analyzing Al conference attendance and open-source Al software projects.
本章研究

研发的趋势。首先，它分析了 Al 出版物和专利的趋势，然后检查了显著的 Al 系统和基础模型的趋势。最后，它分析了 Al 会议的参与情况以及开源 Al 软件项目。

Chapter Highlights 章节亮点

Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high.
行业继续主导前沿的人工智能研究。2023 年，行业发布了 51 个显著的机器学习模型，而学术界仅贡献了 15 个。2023 年，行业与学术界合作产生了 21 个显著模型，创下新高。
More foundation models and more open foundation models. In 2023, a total of 149 foundation models were released, more than double the amount released in 2022. Of these newly released models, were open-source, compared to only in 2022 and in 2021.
更多的基础模型和更多的开源基础模型。2023 年，共发布了 149 个基础模型，是 2022 年发布数量的两倍以上。在这些新发布的模型中，个是开源的，而 2022 年只有个，2021 年只有个。
Frontier models get way more expensive. According to Al Index estimates, the training costs of state-of-the-art Al models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute.
边界模型变得更加昂贵。根据 Al Index 估计，最先进的 Al 模型的训练成本已经达到了前所未有的水平。例如，OpenAl 的 GPT-4 使用了估计价值百万的计算资源进行训练，而谷歌的 Gemini Ultra 则花费了百万用于计算资源。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15.
美国领先于中国、欧盟和英国，成为顶尖人工智能模型的主要来源。2023 年，61 个知名的人工智能模型源自美国机构，远远超过欧盟的 21 个和中国的 15 个。
The number of Al patents skyrockets. From 2021 to 2022, Al patent grants worldwide increased sharply by . Since 2010 , the number of granted Al patents has increased more than 31 times.
Al 专利数量激增。从 2021 年到 2022 年，全球 Al 专利授权数量急剧增加。自 2010 年以来，已授权的 Al 专利数量增加了超过 31 倍。
China dominates Al patents. In 2022, China led global Al patent origins with , significantly outpacing the United States, which accounted for of Al patent origins. Since 2010, the U.S. share of AI patents has decreased from .
中国主导人工智能专利。2022 年，中国在全球人工智能专利来源中处于领先地位，占比，远远超过美国，美国占的人工智能专利来源。自 2010 年以来，美国在人工智能专利中的份额已经从下降。
Open-source Al research explodes. Since 2011, the number of AI-related projects on GitHub has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in 2023. Notably, there was a sharp 59.3% rise in the total number of GitHub Al projects in 2023 alone. The total number of stars for Al-related projects on GitHub also significantly increased in 2023, more than tripling from 4.0 million in 2022 to 12.2 million.
开源人工智能研究蓬勃发展。自 2011 年以来，GitHub 上与人工智能相关项目的数量持续增加，从 2011 年的 845 个增长到 2023 年的约 180 万个。值得注意的是，仅在 2023 年，GitHub 上人工智能项目的总数就急剧增加了 59.3%。2023 年 GitHub 上与人工智能相关项目的星标总数也显著增加，从 2022 年的 400 万增至 1220 万。
The number of Al publications continues to rise. Between 2010 and 2022 , the total number of publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest .
人工智能出版物数量持续增长。2010 年至 2022 年间，人工智能出版物总数几乎翻了三番，从 2010 年的约 8.8 万增至 2022 年的超过 24 万。过去一年的增长幅度较为温和，为。

1.1 Publications 1.1 出版物

Overview 概述

The figures below present the global count of English- and Chinese-language AI publications from 2010 to 2022, categorized by type of affiliation and cross-sector collaborations. Additionally, this section details publication data for

journal articles and conference papers.
以下图表展示了 2010 年至 2022 年间英语和中文语言 AI 出版物的全球数量，按隶属类型和跨部门合作进行分类。此外，本节详细介绍了期刊文章和会议论文的出版数据。

Total Number of AI Publications
AI 出版物的总数

Figure 1.1.1 displays the global count of Al publications. Between 2010 and 2022, the total number of AI publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest

.
图 1.1.1 显示了全球 AI 出版物的数量。2010 年至 2022 年间，AI 出版物的总数几乎翻了三倍，从 2010 年的约 88,000 增加到 2022 年的超过 240,000。过去一年的增长是适度的

。

Number of Al publications in the world, 2010-22
2010 年至 2022 年间全球 AI 出版物的数量

1 The data on publications presented this year is sourced from CSET. Both the methodology and data sources used by CSET to classify Al publications have changed since their data was last featured in the Al Index (2023). As a result, the numbers reported in this year's section differ slightly from those reported in last year's edition. Moreover, the Al-related publication data is fully available only up to 2022 due to a significant lag in updating publication data. Readers are advised to approach publication figures with appropriate caution.
今年提供的出版物数据来自 CSET。自上次在 AI 指数（2023 年）中展示数据以来，CSET 用于分类 AI 出版物的方法和数据来源发生了变化。因此，今年报告的数字与去年版中报告的数字略有不同。此外，由于更新出版物数据存在显着滞后，与 AI 相关的出版物数据仅完全可用至 2022 年。建议读者谨慎对待出版物数据。

By Type of Publication
按出版物类型

Figure 1.1.2 illustrates the distribution of Al publication types globally over time. In 2022, there were roughly

journal articles compared to roughly 42,000 conference submissions. Since 2015, Al
图 1.1.2 展示了全球 AI 出版物类型随时间的分布。2022 年，大约有

篇期刊文章，而会议提交约为 42,000 篇。自 2015 年以来，AI

journal and conference publications have increased at comparable rates. In 2022, there were 2.6 times as many conference publications and 2.4 times as many journal publications as there were in 2015.
期刊和会议出版物的数量以相当的速度增长。2022 年，会议出版物的数量是 2015 年的 2.6 倍，期刊出版物的数量是 2015 年的 2.4 倍。

Number of Al publications by type, 2010-22
2010-22 年各类别 Al 出版物数量

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index fopot
来源：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数 fopot

By Field of Study
按研究领域

Figure 1.1.3 examines the total number of

publications by field of study since 2010. Machine learning publications have seen the most rapid growth over the past decade, increasing nearly sevenfold since 2015. Following machine learning, the most published Al fields in 2022 were computer vision (21,309 publications), pattern recognition (19,841), and process management

.
图 1.1.3 研究了自 2010 年以来按研究领域刊登的

文章总数。机器学习领域的出版物在过去十年中增长最为迅速，自 2015 年以来增长了近七倍。在机器学习之后，2022 年发表最多的 AI 领域是计算机视觉（21,309 篇文章）、模式识别（19,841 篇）和过程管理

。

Number of Al publications by field of study (excluding Other AI), 2010-22
2010 年至 2022 年各研究领域的 AI 出版物数量（不包括其他 AI）

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

Figure 1.1.3 图 1.1.3

By Sector 按行业

This section presents the distribution of

publications by sector-education, government, industry, nonprofit, and other-globally and then specifically within the United States, China, and the European Union plus the United Kingdom. In 2022, the academic sector contributed the majority of

publications (81.1%), maintaining its position as the leading global source of

research over the past decade across all regions (Figure 1.1.4 and Figure 1.1.5). Industry participation is most significant in the United States, followed by the European Union plus the United Kingdom, and China (Figure 1.1.5).
本节介绍了

出版物按教育、政府、工业、非营利和其他部门的全球分布，然后具体介绍了在美国、中国和欧盟以及英国的情况。2022 年，学术部门贡献了大多数

出版物（81.1%），在过去十年中在所有地区保持其作为全球

研究领域的领先来源的地位（图 1.1.4 和图 1.1.5）。工业参与在美国最为显著，其次是欧盟加英国，以及中国（图 1.1.5）。

Figure 1.1.4 图 1.1.4

Al publications (% of total) by sector and geographic area, 2022
2022 年按部门和地理区域划分的出版物（总量的百分比）

Figure 1.1.5 图 1.1.5

Al Journal Publications Al 期刊出版物

Figure 1.1.6 illustrates the total number of Al journal publications from 2010 to 2022. The number of

journal publications experienced modest growth from 2010 to 2015 but grew approximately 2.4 times since 2015 .
图 1.1.6 展示了 2010 年至 2022 年间 Al 期刊出版物的总数量。从 2010 年到 2015 年，Al 期刊出版物数量经历了适度增长，但自 2015 年以来增长了约 2.4 倍。

Between 2021 and 2022, Al journal publications saw a 4.5% increase.
2021 年至 2022 年间，AI 期刊出版物增加了 4.5%。

Number of Al journal publications, 2010-22
2010 年至 2022 年 AI 期刊出版物数量

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

Al Conference Publications
人工智能会议出版物

Figure 1.1.7 visualizes the total number of

conference publications since 2010 . The number of Al conference publications has seen a notable rise in the past two years, climbing from 22,727 in 2020 to 31,629 in 2021, and reaching 41,174 in 2022. Over the last year alone, there was a

increase in

conference publications. Since 2010, the number of AI conference publications has more than doubled.
图 1.1.7 展示了自 2010 年以来

会议出版物的总数。过去两年，人工智能会议出版物数量显著增加，从 2020 年的 22,727 增加到 2021 年的 31,629，再到 2022 年的 41,174。仅在过去一年中，

会议出版物数量增加了

。自 2010 年以来，人工智能会议出版物数量翻了一番多。

Number of Al conference publications, 2010-22
2010 年至 2022 年的人工智能会议出版物数量

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

This section examines trends over time in global Al patents, which can reveal important insights into the evolution of innovation, research, and development within Al. Additionally, analyzing Al patents can reveal how these advancements are distributed globally. Similar to the publications data, there is a noticeable delay in AI patent data availability, with 2022 being the most recent year for which data is accessible. The data in this section comes from CSET.
本节旨在分析全球人工智能专利的时间趋势，这可以揭示关于人工智能创新、研究和发展演变的重要见解。此外，分析人工智能专利可以揭示这些进展在全球的分布情况。与出版数据类似，人工智能专利数据的可用性存在明显的延迟，截至 2022 年为止的数据是最新可获取的。本节数据来自 CSET。

1.2 Patents 1.2 专利

Al Patents 所有专利

Overview 概述

Figure 1.2.1 examines the global growth in granted Al patents from 2010 to 2022. Over the last decade, there has been a significant rise in the number of

patents, with a particularly sharp increase in recent years. For instance, between 2010 and 2014, the total growth in granted Al patents was

. However, from 2021 to 2022 alone, the number of

patents increased by

.
图 1.2.1 研究了 2010 年至 2022 年全球已授予的铝专利的增长情况。在过去的十年中，

专利数量显著增加，近年来增长尤为迅猛。例如，从 2010 年到 2014 年，已授予的铝专利总增长为

。然而，仅在 2021 年至 2022 年间，

专利数量增加了

。

Number of Al patents granted, 2010-22
2010-22 年授予的 Al 专利数量

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

By Filing Status and Region
按申请状态和地区

The following section disaggregates

patents by their filing status (whether they were granted or not granted), as well as the region of their publication.
以下部分按其申请状态（已授予或未授予）以及发布地区对

专利进行了细分。

Figure 1.2.2 compares global Al patents by application status. In 2022, the number of ungranted Al patents

was more than double the amount granted

. Over time, the landscape of Al patent approvals has shifted markedly. Until 2015, a larger proportion of filed Al patents were granted. However, since then, the majority of Al patent filings have not been granted, with the gap widening significantly. For instance, in

of all filed Al patents were not granted. By 2022, this figure had risen to

.
图 1.2.2 比较了全球 Al 专利的申请状态。2022 年，未获批准的 Al 专利

数量超过了获批准的

数量的两倍以上。随着时间的推移，Al 专利批准的情况发生了显著变化。直到 2015 年，提交的 Al 专利中获批准的比例更高。然而，自那时起，大多数 Al 专利申请未获批准，差距显著扩大。例如，在

提交的所有 Al 专利中，未获批准的比例。到 2022 年，这一数字已上升至

。

Al patents by application status,
申请状态的 Al 专利，

The gap between granted and not granted

patents is evident across all major patent-originating geographic areas, including China, the European Union and United Kingdom, and the United States
授予和未授予的

专利之间的差距在所有主要专利产生地区都很明显，包括中国、欧盟和英国以及美国
(Figure 1.2.3). In recent years, all three geographic areas have experienced an increase in both the total number of Al patent filings and the number of patents granted.
（图 1.2.3）。近年来，这三个地理区域在 Al 专利申请总数和授予专利数方面都出现了增加。

Al patents by application status by geographic area, 2010-22
2010 年至 2022 年各地区按申请状态划分的 Al 专利

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

Figure 1.2.4 showcases the regional breakdown of granted Al patents. As of 2022, the bulk of the world's granted Al patents

originated from East Asia and the Pacific, with North America being the next largest contributor at

. Up until 2011,
图 1.2.4 展示了授予的人工智能专利的区域分布。截至 2022 年，世界大部分授予的人工智能专利

源自东亚和太平洋地区，北美洲是下一个最大的贡献者，达到

。直到 2011 年，
North America led in the number of global Al patents. However, since then, there has been a significant shift toward an increasing proportion of Al patents originating from East Asia and the Pacific.
北美曾在全球人工智能专利数量上处于领先地位。然而，自那时起，人工智能专利的比例出现了显著转变，越来越多的专利来源于东亚和太平洋地区。

Granted Al patents (% of world total) by region, 2010-22
按地区划分，2010-22 年各地区授予的 Al 专利（占世界总量的比例）

Disaggregated by geographic area, the majority of the world's granted Al patents are from China (61.1%) and the United States (20.9%) (Figure 1.2.5). The share of Al patents originating from the United States has declined from

in 2010.
按地理区域细分，世界授予的大多数 Al 专利来自中国（61.1%）和美国（20.9%）（图 1.2.5）。来自美国的 Al 专利份额已从 2010 年的

下降。

Granted Al patents (% of world total) by geographic area, 2010-22
按地理区域划分，2010-22 年各地区授予的 Al 专利（占世界总量的比例）

Figure 1.2.6 and Figure 1.2.7 document which countries lead in Al patents per capita. In 2022, the country with the most granted Al patents per 100,000 inhabitants was South Korea (10.3), followed by Luxembourg (8.8) and the United States (4.2)
图 1.2.6 和图 1.2.7 记录了哪些国家在人均 Al 专利方面处于领先地位。2022 年，每 10 万居民中拥有最多 Al 专利的国家是韩国（10.3），其次是卢森堡（8.8）和美国（4.2）。
(Figure 1.2.6). Figure 1.2.7 highlights the change in granted Al patents per capita from 2012 to 2022. Singapore, South Korea, and China experienced the greatest increase in Al patenting per capita during that time period.
（图 1.2.6）。图 1.2.7 突出显示了 2012 年至 2022 年间每人均获得的 Al 专利数量的变化。新加坡、韩国和中国在那段时间内经历了 Al 专利每人均增长最多。

Granted Al patents per 100,000 inhabitants by country, 2022
2022 年各国每 10 万居民获得的 Al 专利数量

Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report
源自：安全与新兴技术中心，2023 年 | 图表：2024 年 Al 指数报告

Percentage change of granted Al patents per 100,000 inhabitants by country, 2012 vs. 2022
每 10 万居民获得的 Al 专利授权数量的百分比变化，按国家划分，2012 年对比 2022 年

This section explores the frontier of Al research. While many new Al models are introduced annually, only a small sample represents the most advanced research. Admittedly what constitutes advanced or frontier research is somewhat subjective. Frontier research could reflect a model posting a new state-of-the-art result on a benchmark, introducing a meaningful new architecture, or exercising some impressive new capabilities.
本节探讨了 Al 研究的前沿。虽然每年都会推出许多新的 Al 模型，但只有少数代表了最先进的研究。诚然，什么构成先进或前沿研究在某种程度上是主观的。前沿研究可能反映了一个模型在基准测试中发布了新的最先进结果，引入了一个有意义的新架构，或者展示了一些令人印象深刻的新能力。

The Al Index studies trends in two types of frontier Al models: "notable models" and foundation models.

Epoch, an Al Index data provider, uses the term "notable machine learning models" to designate noteworthy models handpicked as being particularly influential within the AI/machine learning ecosystem. In contrast, foundation models are exceptionally large Al models trained on massive datasets, capable of performing a multitude of downstream tasks. Examples of foundation models include GPT-4, Claude 3, and Gemini. While many foundation models may qualify as notable models, not all notable models are foundation models.
Al 指数研究两种类型的前沿 Al 模型的趋势："显著模型"和基础模型。

Epoch，一个 Al 指数数据提供商，使用术语"显著机器学习模型"来指定被手工挑选为在 AI/机器学习生态系统中特别有影响力的显著模型。相比之下，基础模型是在大规模数据集上训练的异常庞大的 Al 模型，能够执行多种下游任务。基础模型的示例包括 GPT-4、Claude 3 和 Gemini。虽然许多基础模型可能符合显著模型的条件，但并非所有显著模型都是基础模型。

Within this section, the Al Index explores trends in notable models and foundation models from various perspectives, including originating organization, country of origin, parameter count, and compute usage. The analysis concludes with an examination of machine learning training costs.
在本节中，Al 指数从各种角度探讨显著模型和基础模型的趋势，包括起源组织、起源国家、参数数量和计算使用情况。分析最后对机器学习训练成本进行了审查。

1.3 Frontier Al Research
1.3 前沿 Al 研究

General Machine Learning Models
通用机器学习模型

Overview 概述

Epoch Al is a group of researchers dedicated to studying and predicting the evolution of advanced

. They maintain a database of

and machine learning models released since the 1950s, selecting entries based on criteria such as state-of-theart advancements, historical significance, or high citation rates. Analyzing these models provides a comprehensive overview of the machine learning landscape's evolution, both in recent years and over the past few decades.

Some models may be missing from the dataset; however, the dataset can reveal trends in relative terms.
Epoch AI 是一群致力于研究和预测先进

演变的研究人员。他们维护着自 1950 年代以来发布的

和机器学习模型的数据库，根据最新技术进展、历史意义或高引用率等标准选择条目。分析这些模型可以全面了解机器学习领域的演变，无论是近年来还是过去几十年。

数据集中可能缺少一些模型，但数据集可以揭示相对趋势。

3 "Al system" refers to a computer program or product based on AI, such as ChatGPT. "Al model" refers to a collection of parameters whose values are learned during training, such as GPT-4. 4 New and historic models are continually added to the Epoch database, so the total year-by-year counts of models included in this year's Al Index might not exactly match those published in last year's report.
"AI 系统"指的是基于人工智能的计算机程序或产品，例如 ChatGPT。"AI 模型"指的是在训练过程中学习到的一组参数值，例如 GPT-4。新的和历史模型不断被添加到 Epoch 数据库中，因此本年度 AI 指数中包含的模型总数可能与去年报告中发布的数量不完全匹配。

Sector Analysis 领域分析

Until 2014, academia led in the release of machine learning models. Since then, industry has taken the lead. In 2023, there were 51 notable machine learning models produced by industry compared to just 15 from academia (Figure 1.3.1). Significantly, 21 notable models resulted from industry/academic collaborations in 2023, a new high.
直到 2014 年，学术界主导了机器学习模型的发布。从那时起，工业界开始领先。2023 年，工业界发布了 51 个显著的机器学习模型，而学术界仅发布了 15 个（图 1.3.1）。值得注意的是，2023 年有 21 个显著模型是由工业界和学术界合作产生的，创下新高。
Creating cutting-edge

models now demands a substantial amount of data, computing power, and financial resources that are not available in academia. This shift toward increased industrial dominance in leading AI models was first highlighted in last year's Al Index report. Although this year the gap has slightly narrowed, the trend largely persists.
创建尖端

模型现在需要大量的数据、计算能力和财务资源，这些资源在学术界是不可用的。去年的 AI 指数报告首次突显了领先 AI 模型中增加工业主导地位的趋势。尽管今年差距略有缩小，但这一趋势基本上仍然存在。

Number of notable machine learning models by sector, 2003-23
各行业显著机器学习模型数量，2003-23

National Affiliation 国家隶属

To illustrate the evolving geopolitical landscape of

, the Al Index research team analyzed the country of origin of notable models.
为了说明

不断演变的地缘政治格局，Al 指数研究团队分析了显著模型的原产国。

Figure 1.3.2 displays the total number of notable machine learning models attributed to the location of researchers' affiliated institutions.

图 1.3.2 显示了与研究人员所属机构位置相关的显著机器学习模型总数。

In 2023, the United States led with 61 notable machine learning models, followed by China with 15, and France with 8. For the first time since 2019, the European Union and the United Kingdom together have surpassed China in the number of notable Al models produced (Figure 1.3.3). Since 2003, the United States has produced more models than other major geographic regions such as the United Kingdom, China, and Canada (Figure 1.3.4).
2023 年，美国以 61 个显著机器学习模型领先，其次是中国 15 个，法国 8 个。自 2019 年以来，欧盟和英国联合超过中国在生产显著 Al 模型数量上首次超过（图 1.3.3）。自 2003 年以来，美国生产的模型数量超过其他主要地理区域，如英国、中国和加拿大（图 1.3.4）。

Figure 1.3.2 图 1.3.2

Number of notable machine learning models by select geographic area, 2003-23
2003 年至 2023 年各地区知名机器学习模型数量

Number of notable machine learning models by geographic area, 2003-23 (sum)
2003 年至 2023 年各地区知名机器学习模型数量总和

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

Figure 1.3.4 图 1.3.4

Parameter Trends 参数趋势

Parameters in machine learning models are numerical values learned during training that determine how a model interprets input data and makes predictions. Models trained on more data will usually have more parameters than those trained on less data. Likewise, models with more parameters typically outperform those with fewer parameters.
机器学习模型中的参数是在训练过程中学习到的数值，它们决定了模型如何解释输入数据并进行预测。使用更多数据训练的模型通常会比使用较少数据训练的模型具有更多的参数。同样，具有更多参数的模型通常会胜过具有较少参数的模型。

Figure 1.3.5 demonstrates the parameter count of machine learning models in the Epoch dataset, categorized by the sector from which the models originate. Parameter counts have risen sharply since the early 2010s, reflecting the growing complexity of tasks Al models are designed for, the greater availability of data, improvements in hardware, and proven efficacy of larger models. High-parameter models are particularly notable in the industry sector, underscoring the capacity of companies like OpenAl, Anthropic, and Google to bear the computational costs of training on vast volumes of data.
图 1.3.5 展示了 Epoch 数据集中机器学习模型的参数数量，按照模型来源的领域进行分类。自 2010 年代初以来，参数数量急剧上升，反映了 Al 模型设计任务的复杂性增加，数据的更大可用性，硬件的改进以及更大模型的有效性。高参数模型在工业领域尤为显著，突显了像 OpenAl、Anthropic 和 Google 这样的公司承担训练大量数据计算成本的能力。

Number of parameters of notable machine learning models by sector, 2003-23
2003-23 年间各领域知名机器学习模型的参数数量

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

Compute Trends 计算趋势

The term "compute" in Al models denotes the computational resources required to train and operate a machine learning model. Generally, the complexity of the model and the size of the training dataset directly influence the amount of compute needed. The more complex a model is, and the larger the underlying training data, the greater the amount of compute required for training.
在 Al 模型中，“计算”一词表示训练和操作机器学习模型所需的计算资源。通常，模型的复杂性和训练数据集的大小直接影响所需的计算量。模型越复杂，底层训练数据越大，训练所需的计算量就越大。

Figure 1.3.6 visualizes the training compute required for notable machine learning models in the last 20 years. Recently, the compute usage of notable Al models has increased exponentially.

This trend has been especially pronounced in the last five years. This rapid rise in compute demand has critical implications. For instance, models requiring more computation often have larger environmental footprints, and companies typically have more access to computational resources than academic institutions.
图 1.3.6 展示了过去 20 年中知名机器学习模型所需的训练计算量。最近，知名人工智能模型的计算使用量呈指数增长。这一趋势在过去五年中尤为明显。计算需求的快速增长具有重要的影响。例如，需要更多计算的模型通常具有更大的环境足迹，而公司通常比学术机构更容易获得计算资源。

Training compute of notable machine learning models by sector, 2003-23
2003-23 年间各领域知名机器学习模型的训练计算量

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

6 FLOP stands for "floating-point operation." A floating-point operation is a single arithmetic operation involving floating-point numbers, such as addition, subtraction, multiplication, or division. The number of FLOPs a processor or computer can perform per second is an indicator of its computational power. The higher the FLOP rate, the more powerful the computer is. An AI model with a higher FLOP rate reflects its requirement for more computational resources during training.
6 FLOP 代表“浮点运算”。浮点运算是涉及浮点数的单个算术运算，如加法、减法、乘法或除法。处理器或计算机每秒可以执行的 FLOP 数量是其计算能力的指标。 FLOP 速率越高，计算机的性能就越强大。具有更高 FLOP 速率的 AI 模型反映了其在训练过程中对更多计算资源的需求。

Figure 1.3.7 highlights the training compute of notable machine learning models since 2012. For example, AlexNet, one of the papers that popularized the now standard practice of using GPUs to improve Al models, required an estimated 470 petaFLOPs for training.
图 1.3.7 突出了自 2012 年以来值得注意的机器学习模型的训练计算。例如，AlexNet 是一篇推广现在标准做法的论文，即使用 GPU 来改进 Al 模型，其训练估计需要 470 petaFLOPs。
The original Transformer, released in 2017, required around 7,400 petaFLOPs. Google's Gemini Ultra, one of the current state-of-the-art foundation models, required 50 billion petaFLOPs.
最初的 Transformer 于 2017 年发布，大约需要 7,400 petaFLOPs。 Google 的 Gemini Ultra 是当前最先进的基础模型之一，需要 500 亿 petaFLOPs。

Training compute of notable machine learning models by domain, 2012-23
2012-23 年领域内知名机器学习模型的训练计算

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

Highlight: 突出显示:

Will Models Run Out of Data?
模型会用尽数据吗？

As illustrated above, a significant proportion of recent algorithmic progress, including progress behind powerful LLMs, has been achieved by training models on increasingly larger amounts of data. As noted recently by Anthropic cofounder and AI Index Steering Committee member Jack Clark, foundation models have been trained on meaningful percentages of all the data that has ever existed on the internet.
如上所述，最近算法进展的显著部分，包括强大的LLMs背后的进展，是通过在越来越大量的数据上训练模型实现的。正如 Anthropic 联合创始人兼 AI 指数指导委员会成员杰克·克拉克最近指出的，基础模型已经在互联网上存在的所有数据中训练了有意义的百分比。

The growing data dependency of Al models has led to concerns that future generations of computer scientists will run out of data to further scale and improve their systems. Research from Epoch suggests that these concerns are somewhat warranted. Epoch researchers have generated historical and compute-based projections for when Al researchers might expect to run out of data. The historical projections are based on observed growth rates in the sizes of data used to train foundation models. The compute projections adjust the historical growth rate based on projections of compute availability.
Al 模型日益依赖数据，这引发了人们对未来计算机科学家会用尽数据以进一步扩展和改进系统的担忧。Epoch 的研究表明，这些担忧在一定程度上是有根据的。Epoch 的研究人员已经为 Al 研究人员何时可能用尽数据生成了历史和基于计算的预测。历史预测基于用于训练基础模型的数据大小的增长率。计算预测根据计算可用性的预测调整历史增长率。

For instance, the researchers estimate that computer scientists could deplete the stock of high-quality language data by 2024, exhaust lowquality language data within two decades, and use up image data by the late 2030s to mid-2040s (Figure 1.3.8).
例如，研究人员估计计算机科学家可能在 2024 年之前耗尽高质量语言数据，两十年内用尽低质量语言数据，并在 2030 年代末至 2040 年代中期用尽图像数据（图 1.3.8）。

Theoretically, the challenge of limited data availability can be addressed by using synthetic
从理论上讲，有限数据可用性的挑战可以通过使用合成数据来解决

Projections of ML data exhaustion by stock type:
ML 数据枯竭的预测按库存类型：

median and

dates

Source: Epoch, 2023 | Table: 2024 Al Index report
源：时代，2023 | 表：2024 Al 指数报告

Stock type

Historical projection 历史预测

Compute projection 计算预测

Low-quality

language stock 语言库

High-quality

language stock 语言库

Image stock

Figure 1.3.8 图 1.3.8

data, which is data generated by

models themselves. For example, it is possible to use text produced by one LLM to train another LLM. The use of synthetic data for training Al systems is particularly attractive, not only as a solution for potential data depletion but also because generative Al systems could, in principle, generate data in instances where naturally occurring data is sparse-for example, data for rare diseases or underrepresented populations. Until recently, the feasibility and effectiveness of using synthetic data for training generative

systems were not well understood. However, research this year has suggested that there are limitations associated with training models on synthetic data.
数据，这是由

模型自动生成的数据。例如，可以使用一个LLM生成的文本来训练另一个LLM。对于训练 AI 系统来说，使用合成数据尤为吸引人，不仅可以解决潜在的数据枯竭问题，而且生成式 AI 系统原则上可以在自然数据稀缺的情况下生成数据，例如罕见疾病或代表性不足的人群的数据。直到最近，使用合成数据训练生成式

系统的可行性和有效性并不为人所了解。然而，今年的研究表明，使用合成数据训练模型存在一些限制。

For instance, a team of British and Canadian researchers discovered that models predominantly trained on synthetic data experience model collapse, a phenomenon where, over time, they lose the ability to remember true underlying data distributions and start producing a narrow range of
例如，一组英国和加拿大研究人员发现，主要在合成数据上训练的模型会经历模型崩溃，这是一种现象，随着时间的推移，它们失去了记住真实基础数据分布的能力，并开始产生一种狭窄范围的

Highlight: 突出显示:

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

outputs. Figure 1.3.9 demonstrates the process of model collapse in a variational autoencoder (VAE) model, a widely used generative

architecture. With each subsequent generation trained on additional synthetic data, the model produces an increasingly limited set of outputs. As illustrated in Figure 1.3.10, in statistical terms, as the number of synthetic generations increases, the tails of the distributions vanish, and the generation density shifts toward the mean.

This pattern means that over time, the generations of models trained predominantly on synthetic data become less varied and are not as widely distributed.
输出。图 1.3.9 展示了变分自动编码器（VAE）模型中模型坍缩的过程，这是一种广泛使用的生成

架构。随着每一代在额外的合成数据上训练，模型产生的输出集变得越来越有限。正如图 1.3.10 所示，在统计学术语中，随着合成代数的增加，分布的尾部消失，生成密度向均值偏移。

这种模式意味着随着时间的推移，主要在合成数据上训练的模型的生成变得不那么多样化，分布也不那么广泛。

The authors demonstrate that this phenomenon occurs across various model types, including Gaussian Mixture Models and LLMs. This research underscores the continued importance of humangenerated data for training capable LLMs that can produce a diverse array of content.
作者证明了这种现象发生在各种模型类型中，包括高斯混合模型和 LLMs。这项研究强调了人工生成数据对训练能够产生多样内容的 LLMs 的持续重要性。

A demonstration of model collapse in a VAE
VAE 中模型坍缩的演示

Source: Shumailovet al., 2023
来源：Shumailov 等，2023 年

(a) Original model (a) 原始模型

(b) Generation 5 （b）第 5 代

(d) Generation 20 (d) 第 20 代

Highlight: 突出显示:

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

In a similar study published in 2023 on the use of synthetic data in generative imaging models, researchers found that generative image models trained solely on synthetic data cycles-or with insufficient real human data-experience a significant drop in output quality. The authors label this phenomenon Model Autophagy Disorder (MAD), in reference to mad cow disease.
在一项类似的研究中，研究人员发现，仅在合成数据上训练的生成图像模型，或者在真实人类数据不足的情况下训练的生成图像模型，会出现显著的输出质量下降。作者将这一现象称为模型自噬障碍（MAD），这是指疯牛病。

The study examines two types of training processes: fully synthetic, where models are trained exclusively on synthetic data, and synthetic augmentation, where models are trained on a mix of synthetic and real data. In both scenarios, as the number of training generations increases, the quality of the generated images declines. Figure 1.3.11 highlights the degraded image generations of models that are augmented with synthetic data; for example, the faces generated in steps 7 and 9 increasingly display strange-looking hash marks. From a statistical perspective, images generated with both synthetic data and synthetic augmentation loops have higher FID scores (indicating less similarity to real images), lower precision scores (signifying reduced realism or quality), and lower recall scores (suggesting decreased diversity) (Figure 1.3.12). While synthetic augmentation loops, which incorporate some real data, show less degradation than fully synthetic loops, both methods exhibit diminishing returns with further training.
这项研究探讨了两种训练过程：完全合成，其中模型仅在合成数据上进行训练，以及合成增强，其中模型在合成数据和真实数据的混合上进行训练。在这两种情景中，随着训练代数的增加，生成的图像质量下降。图 1.3.11 突出显示了使用合成数据增强的模型的降级图像生成；例如，在第 7 步和第 9 步生成的人脸逐渐显示出奇怪的哈希标记。从统计角度来看，使用合成数据和合成增强循环生成的图像具有更高的 FID 分数（表示与真实图像的相似性较低），更低的精度分数（表示现实性或质量降低），以及更低的召回分数（表明多样性减少）（图 1.3.12）。虽然合成增强循环，其中包含一些真实数据，显示出比完全合成循环更少的退化，但两种方法在进一步训练中都表现出递减的回报。

Highlight: 突出显示:

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

An example of MAD in image-generation models
图像生成模型中 MAD 的示例

Source: Alemohammad et al., 2023
来源：Alemohammad 等人，2023 年

Figure 1.3.11 图 1.3.11

Assessing FFHQ syntheses: FID, precision, and recall in synthetic and mixed-data training loops Source: Alemohammad et al., 2023 | Chart: 2024 Al Index report
评估 FFHQ 合成：FID，精度和召回在合成和混合数据训练循环中的表现 Source: Alemohammad 等人，2023 | 图表：2024 Al 指数报告

Figure 1.3.12 图 1.3.12

Foundation Models 基础模型

Foundation models represent a rapidly evolving and popular category of Al models. Trained on vast datasets, they are versatile and suitable for numerous downstream applications. Foundation models such as GPT-4, Claude 3, and Llama 2 showcase remarkable abilities and are increasingly being deployed in realworld scenarios.
基础模型代表了一类快速发展且受欢迎的 AI 模型。经过大量数据集的训练，它们具有多功能性，适用于许多下游应用。诸如 GPT-4、Claude 3 和 Llama 2 等基础模型展示了卓越的能力，并越来越多地被部署在现实场景中。

Introduced in 2023, the Ecosystem Graphs is a new community resource from Stanford that tracks the foundation model ecosystem, including datasets, models, and applications. This section uses data from the Ecosystem Graphs to study trends in foundation models over time.

生态系统图是斯坦福的一个新社区资源，于 2023 年推出，用于跟踪基础模型生态系统，包括数据集、模型和应用程序。本节使用生态系统图的数据来研究基础模型随时间的发展趋势。

Model Release 模型发布

Foundation models can be accessed in different ways. No access models, like Google's PaLM-E, are only accessible to their developers. Limited access models, like OpenAl's GPT-4, offer limited access to the models, often through a public API. Open models, like Meta's Llama 2, fully release model weights, which means the models can be modified and freely used.
基础模型可以通过不同的方式访问。没有访问权限的模型，如谷歌的 PaLM-E，只能由其开发人员访问。有限访问模型，如 OpenAI 的 GPT-4，通过公共 API 通常提供对模型的有限访问。开放模型，如 Meta 的 Llama 2，完全释放模型权重，这意味着可以修改并自由使用模型。

Figure 1.3.13 visualizes the total number of foundation models by access type since 2019. In recent years, the number of foundation models has risen sharply, more than doubling since 2022 and growing by a factor of nearly 38 since 2019 . Of the 149 foundation models released in 2023, 98 were open, 23 limited and 28 no access.
图 1.3.13 展示了自 2019 年以来按访问类型分类的基础模型总数。近年来，基础模型的数量急剧增加，自 2022 年以来增长了一倍以上，自 2019 年以来增长了近 38 倍。2023 年发布的 149 个基础模型中，98 个是开放的，23 个是有限的，28 个没有访问权限。

Foundation models by access type, 2019-23
2019 年至 2023 年按访问类型分类的基础模型

Source: Bommasani et al., 2023 | Chart: 2024 Al Index report
来源：Bommasani 等人，2023 年 | 图表：2024 年 Al 指数报告

8 The Ecosystem Graphs make efforts to survey the global Al ecosystem, but it is possible that they underreport models from certain nations like South Korea and China.
8 生态系统图表努力调查全球 AI 生态系统，但可能会低估某些国家（如韩国和中国）的模型。

In 2023, the majority of foundation models were released as open access (65.8%), with

having no access and

limited access (Figure 1.3.14). Since 2021, there has been a significant increase in the proportion of models released with open access.
2023 年，大多数基础模型以开放获取的形式发布（65.8%），

没有访问权限，

有限访问权限（图 1.3.14）。自 2021 年以来，发布开放获取模型的比例显著增加。

Foundation models (% of total) by access type, 2019-23
基础模型（总数的百分比）按访问类型分类，2019-23

Organizational Affiliation
组织隶属

Figure 1.3.15 plots the sector from which foundation models have originated since 2019. In 2023, the majority of foundation models (

) originated from industry. Only

of foundation models in 2023 originated from academia. Since 2019, an ever larger number of foundation models are coming from industry.
图 1.3.15 显示自 2019 年以来基础模型的来源部门。到 2023 年，大多数基础模型（

）来自工业界。2023 年，只有

的基础模型来自学术界。自 2019 年以来，越来越多的基础模型来自工业界。

Number of foundation models by sector, 2019-23
按部门划分的基础模型数量，2019-23

Figure 1.3.15 图 1.3.15

Figure 1.3.16 highlights the source of various foundation models that were released in 2023. Google introduced the most models (18), followed by Meta (11), and Microsoft (9). The academic institution that released the most foundation models in 2023 was UC Berkeley (3).
图 1.3.16 突出了 2023 年发布的各种基础模型的来源。谷歌推出了最多的模型（18 个），其次是 Meta（11 个）和微软（9 个）。2023 年发布最多基础模型的学术机构是加州大学伯克利分校（3 个）。

Number of foundation models by organization, 2023
按组织划分的基础模型数量，2023 年

Source: Bommasani et al., 2023 | Chart: 2024 Al Index report
来源：Bommasani 等人，2023 年 | 图表：2024 年 Al 指数报告

Since 2019, Google has led in releasing the most foundation models, with a total of 40, followed by OpenAl with 20 (Figure 1.3.17). Tsinghua University stands out as the top non-Western institution, with seven foundation model releases, while Stanford University is the leading American academic institution, with five releases.
自 2019 年以来，谷歌在发布最多基础模型方面处于领先地位，总共发布了 40 个，其次是 OpenAI 发布了 20 个（图 1.3.17）。清华大学是排名最高的非西方机构，发布了七个基础模型，而斯坦福大学是美国领先的学术机构，发布了五个模型。

Number of foundation models by organization, 2019-23 (sum)
组织发布的基础模型数量，2019-23 年（总和）

Source: Bommasani et al., 2023 | Chart: 2024 Al Index report
来源：Bommasani 等人，2023 年 | 图表：2024 年 Al 指数报告

National Affiliation 国家隶属

Given that foundation models are fairly representative of frontier

research, from a geopolitical perspective, it is important to understand their national affiliations. Figures 1.3.18, 1.3.19, and 1.3.20 visualize the national affiliations of various foundation models. As with the notable model analysis presented earlier in the chapter, a model is deemed affiliated with a country if a researcher contributing to that model is affiliated with an institution headquartered in that country.
鉴于基础模型相当代表前沿

研究，从地缘政治的角度来看，了解它们的国家隶属关系至关重要。图 1.3.18、1.3.19 和 1.3.20 展示了各种基础模型的国家隶属关系。与本章前面提出的著名模型分析一样，如果参与该模型的研究人员隶属于总部设在该国的机构，则认为该模型与该国有关联。

In 2023, most of the world's foundation models originated from the United States (109), followed by China (20), and the United Kingdom (Figure 1.3.18). Since 2019, the United States has consistently led in originating the majority of foundation models (Figure 1.3.19).
2023 年，世界大部分基础模型源自美国（109），其次是中国（20）和英国（图 1.3.18）。自 2019 年以来，美国一直在发源大多数基础模型方面处于领先地位（图 1.3.19）。
Number of foundation models by geographic area, 2023
2023 年各地区基础模型数量

Source: Bommasani et al., 2023 | Chart: 2024 Al Index report
来源：Bommasani 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 1.3.18 图 1.3.18

Number of foundation models by select geographic area, 2019-23
2019-23 年各地区基础模型数量

Figure 1.3.20 depicts the cumulative count of foundation models released and attributed to respective countries since 2019. The country with the greatest number of foundation models released since 2019 is the United States (182), followed by China (30), and the United Kingdom (21).
图 1.3.20 显示自 2019 年以来发布并归因于各个国家的基础模型的累积数量。自 2019 年以来发布基础模型数量最多的国家是美国（182），其次是中国（30）和英国（21）。

Number of foundation models by geographic area, 2019-23 (sum)
地理区域基础模型数量，2019-23 年（总和）

Source: Bommasani et al., 2023 | Chart: 2024 AI Index report
来源：Bommasani 等人，2023 年 | 图表：2024 年 AI 指数报告

Figure 1.3.20 图 1.3.20

Training Cost 训练成本

A prominent topic in discussions about foundation models is their speculated costs. While AI companies seldom reveal the expenses involved in training their models, it is widely believed that these costs run into millions of dollars and are rising. For instance, OpenAl's CEO, Sam Altman, mentioned that the training cost for GPT-4 was over

million. This escalation in training expenses has effectively excluded universities, traditionally centers of

research, from developing their own leading-edge foundation models. In response, policy initiatives, such as President Biden's Executive Order on AI, have sought to level the playing field between industry and academia by creating a National AI Research Resource, which would grant nonindustry actors the compute and data needed to do higher level Al-research.
在关于基础模型的讨论中，一个突出的话题是它们被猜测的成本。虽然人工智能公司很少透露训练模型所涉及的费用，但普遍认为这些成本高达数百万美元且不断上升。例如，OpenAI 的 CEO Sam Altman 提到，GPT-4 的训练成本超过了

百万美元。训练费用的上升有效地排除了传统上是

研究中心的大学开发自己的领先基础模型。作为回应，政策倡议，如拜登总统关于人工智能的行政命令，试图通过创建一个国家人工智能研究资源来拉平行业和学术界的竞争格局，该资源将为非产业参与者提供进行更高级别人工智能研究所需的计算和数据。

Understanding the cost of training Al models is important, yet detailed information on these costs remains scarce. The Al Index was among the first to offer estimates on the training costs of foundation models in last year's publication. This year, the AI Index has collaborated with Epoch AI, an AI research institute, to substantially enhance and solidify the robustness of its

training cost estimates.

To estimate the cost of cutting-edge models, the Epoch team analyzed training duration, as well as the type, quantity, and utilization rate of the training hardware, using information from publications, press releases, or technical reports related to the models.

了解培训 AI 模型的成本是重要的，然而关于这些成本的详细信息仍然很少。AI 指数是去年出版的第一份提供基础模型培训成本估算的报告之一。今年，AI 指数与 Epoch AI 合作，大大增强和巩固了其

培训成本估算的稳健性。

为了估算尖端模型的成本，Epoch 团队分析了培训持续时间，以及培训硬件的类型、数量和利用率，使用了与这些模型相关的出版物、新闻稿或技术报告中的信息。

Figure 1.3.21 visualizes the estimated training cost associated with select AI models, based on cloud compute rental prices. Al Index estimates validate suspicions that in recent years model training costs have significantly increased. For example, in 2017, the original Transformer model, which introduced the architecture that underpins virtually every modern LLM, cost around

to train." RoBERTa Large, released in 2019, which achieved state-of-the-art results on many canonical comprehension benchmarks like SQUAD and GLUE, cost around

to train. Fast-forward to 2023, and training costs for OpenAl's GPT-4 and Google's Gemini Ultra are estimated to be around

million and

million, respectively.
图 1.3.21 展示了基于云计算租赁价格的选择 AI 模型的预计训练成本。Al Index 估计验证了人们对近年来模型训练成本显著增加的怀疑。例如，2017 年，原始的 Transformer 模型成本约为

美元进行训练，该模型为几乎所有现代 LLM 模型的基础架构。2019 年发布的 RoBERTa Large 模型在许多典型的理解基准测试上取得了最先进的结果，如 SQUAD 和 GLUE，该模型成本约为

美元进行训练。快进到 2023 年，OpenAl 的 GPT-4 和 Google 的 Gemini Ultra 的训练成本预计分别为

百万美元和

百万美元。
9 Ben Cottier and Robi Rahman led research at Epoch Al into model training cost. 10 A detailed description of the estimation methodology is provided in the Appendix. 11 The cost figures reported in this section are inflation-adjusted.
Ben Cottier 和 Robi Rahman 在 Epoch Al 进行了模型训练成本的研究。附录中提供了估算方法的详细说明。本节报告的成本数据已经进行了通货膨胀调整。

Estimated training cost of select AI models, 2017-23
2017 年至 2023 年选择的 AI 模型的预计训练成本

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

Figure 1.3.2 图 1.3.2

Figure 1.3.22 visualizes the training cost of all AI models for which the Al Index has estimates. As the figure shows, model training costs have sharply increased over time.
图 1.3.22 展示了所有具有 Al 指数估计的 AI 模型的训练成本。正如图所示，随着时间的推移，模型的训练成本急剧增加。

Estimated training cost of select AI models, 2016-23
2016-23 年精选 AI 模型的预估训练成本

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

As established in previous Al Index reports, there is a direct correlation between the training costs of Al models and their computational requirements. As illustrated in Figure 1.3.23, models with greater computational training needs cost substantially more to train.
正如前述 Al 指数报告所述，Al 模型的训练成本与其计算需求之间存在直接相关性。如图 1.3.23 所示，具有更大计算训练需求的模型训练成本显著更高。

Estimated training cost and compute of select Al models
选择 Al 模型的预估培训成本和计算

Source: Epoch, 2023 | Chart: 2024 Al Index report
来源：Epoch，2023 | 图表：2024 年 Al 指数报告

Al conferences serve as essential platforms for researchers to present their findings and network with peers and collaborators. Over the past two decades, these conferences have expanded in scale, quantity, and prestige. This section explores trends in attendance at major Al conferences.
Al 会议是研究人员展示他们的发现并与同行和合作者建立联系的重要平台。在过去的二十年里，这些会议在规模、数量和声誉上不断扩大。本节探讨了主要 Al 会议的参与趋势。

1.4 Al Conferences 1.4 Al 会议

Conference Attendance 会议出席

Figure 1.4.1 graphs attendance at a selection of Al conferences since 2010. Following a decline in attendance, likely due to the shift back to exclusively in-person formats, the Al Index reports an increase in conference attendance from 2022 to

图 1.4.1 显示自 2010 年以来一些 AI 会议的出席情况。由于回归到纯线下形式，出席人数出现下降，AI 指数报告显示从 2022 年到

年会议出席人数增加。
Specifically, there was a

rise in total attendance over the last year. Since 2015, the annual number of attendees has risen by around 50,000, reflecting not just a growing interest in

research but also the emergence of new Al conferences.
具体来说，过去一年总出席人数增加了

。自 2015 年以来，每年出席人数增加了约 5 万人，这不仅反映了对

研究的日益浓厚兴趣，也反映了新的 AI 会议的出现。

Attendance at select Al conferences, 2010-23
2010-23 年参加了部分 Al 会议

Figure 1.4.1 图 1.4.1

Neural Information Processing Systems (NeurIPS) remains one of the most attended

conferences, attracting approximately 16,380 participants in 2023
神经信息处理系统（NeurIPS）仍然是参加人数最多的

会议之一，在 2023 年吸引了约 16,380 名参与者

(Figure 1.4.2 and Figure 1.4.3). Among the major
(图 1.4.2 和图 1.4.3)。在主要的
Al conferences, NeurIPS, ICML, ICCV, and AAAI experienced year-over-year increases in attendance. However, in the past year, CVPR, ICRA, ICLR, and IROS observed slight declines in their attendance figures.
Al 会议中，NeurIPS、ICML、ICCV 和 AAAI 的参加人数同比增加。然而，在过去的一年中，CVPR、ICRA、ICLR 和 IROS 的参会人数略有下降。

Attendance at large conferences, 2010-23
大型会议的参会人数，2010-23

Attendance at small conferences, 2010-23
2010-23 年小型会议出席情况

Figure 1.4.3 图 1.4.3

GitHub is a web-based platform that enables individuals and teams to host, review, and collaborate on code repositories. Widely used by software developers, GitHub facilitates code management, project collaboration, and open-source software support. This section draws on data from GitHub providing insights into broader trends in open-source Al software development not reflected in academic publication data.
GitHub 是一个基于 Web 的平台，使个人和团队能够托管、审查和协作编码存储库。GitHub 被软件开发人员广泛使用，有助于代码管理、项目协作和开源软件支持。本节利用来自 GitHub 的数据，提供有关开源 AI 软件开发中更广泛趋势的见解，这些趋势在学术出版数据中没有体现。

1.5 Open-Source Al Software
1.5 开源人工智能软件

Projects 项目

A GitHub project comprises a collection of files, including source code, documentation, configuration files, and images, that together make up a software project. Figure 1.5.1 looks at the total number of
一个 GitHub 项目包括一组文件，包括源代码、文档、配置文件和图像，这些文件共同组成一个软件项目。图 1.5.1 查看了总数
GitHub Al projects over time. Since 2011, the number of Al-related GitHub projects has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in

Notably, there was a sharp

rise in the total number of GitHub Al projects in the last year alone.
GitHub Al 项目随时间的推移而增加。自 2011 年以来，与 Al 相关的 GitHub 项目数量持续增加，从 2011 年的 845 个增长到约 180 万个。值得注意的是，仅在过去一年里，GitHub Al 项目的总数出现了急剧增长。

Figure 1.5.1 图 1.5.1

recently published research paper, a shift from the previously detailed methodology in an earlier paper. This edition of the Al Index is the first to adopt this updated approach. Moreover, the previous edition of the Al Index utilized country-level mapping of GitHub Al projects conducted by the OECD, which depended on self-reported data-a method experiencing a decline in coverage over time. This year, the Al Index has adopted geographic mapping from GitHub, leveraging server-side data for broader coverage. Consequently, the data presented here may not align perfectly with data in earlier versions of the report.
最近发表的研究论文中，从之前的详细方法论转变。这一版本的 AI 指数是首次采用这种更新的方法。此外，之前的 AI 指数利用 OECD 根据 GitHub AI 项目进行的国家级映射，该方法依赖于自报数据，随着时间的推移，覆盖范围逐渐减少。今年，AI 指数采用了 GitHub 的地理映射，利用服务器端数据进行更广泛的覆盖。因此，这里呈现的数据可能与报告早期版本中的数据不完全一致。

Figure 1.5.2 reports GitHub Al projects by geographic area since 2011. As of 2023, a significant share of GitHub Al projects were located in the United States, accounting for

of contributions. India was the second-largest contributor with

, followed closely by the European Union and the United Kingdom at

. Notably, the proportion of AI projects from developers located in the United States on GitHub has been on a steady decline since 2016.
图 1.5.2 报告了 2011 年以来各地区的 GitHub AI 项目。截至 2023 年，GitHub AI 项目中有相当大的比例位于美国，贡献占比为

。印度是第二大贡献国，贡献占比为

，紧随其后的是欧盟和英国，贡献占比为

。值得注意的是，从 2016 年以来，位于美国的开发者在 GitHub 上的 AI 项目比例持续下降。

GitHub Al projects (% of total) by geographic area, 2011-23
2011-2023 年各地区 GitHub AI 项目（占总数的百分比）

Stars 星星

GitHub users can show their interest in a repository by "starring" it, a feature similar to liking a post on social media, which signifies support for an opensource project. Among the most starred repositories are libraries such as TensorFlow, OpenCV, Keras, and PyTorch, which enjoy widespread popularity among software developers in the Al coding community. For example, TensorFlow is a popular library for building and deploying machine learning models. OpenCV is a platform that offers a variety of tools for computer vision, such as object detection and feature extraction.
GitHub 用户可以通过“点星”来展示他们对仓库的兴趣，这类似于在社交媒体上点赞一篇帖子，表示对开源项目的支持。在最受欢迎的仓库中，有一些库如 TensorFlow、OpenCV、Keras 和 PyTorch 被点赞次数最多，这些库在 Al 编程社区中备受欢迎。例如，TensorFlow 是一个用于构建和部署机器学习模型的热门库。OpenCV 是一个提供各种计算机视觉工具的平台，如目标检测和特征提取。

The total number of stars for Al-related projects on GitHub saw a significant increase in the last year, more than tripling from 4.0 million in 2022 to 12.2 million in 2023 (Figure 1.5.3). This sharp increase in GitHub stars, along with the previously reported rise in projects, underscores the accelerating growth of open-source Al software development.
GitHub 上与人工智能相关的项目的总星星数量在过去一年中显著增加，从 2022 年的 400 万增加到 2023 年的 1220 万（图 1.5.3）。GitHub 星星数量的急剧增加，以及之前报道的项目增长，凸显了开源人工智能软件开发的加速增长。

Number of GitHub stars in Al projects, 2011-23
2011-23 年 Al 项目在 GitHub 上的星标数量

Source: GitHub, 2023 | Chart: 2024 Al Index report
源：GitHub，2023 | 图表：2024 Al 指数报告

Figure 1.5.3 图 1.5.3

In 2023, the United States led in receiving the highest number of GitHub stars, totaling 10.5 million (Figure 1.5.4). All major geographic regions sampled, including the European Union and United Kingdom,
在 2023 年，美国在获得 GitHub 星标数量方面处于领先地位，总数达到 1050 万（图 1.5.4）。所有主要地理区域样本，包括欧盟和英国，
China, and India, saw a year-over-year increase in the total number of GitHub stars awarded to projects located in their countries.
中国和印度，看到了其国家项目获得的 GitHub 星标总数同比增加。

Number of GitHub stars by geographic area, 2011-23
2011-23 年各地区 GitHub 星标数量

Source: GitHub, 2023 | Chart: 2024 Al Index report
源：GitHub，2023 | 图表：2024 Al 指数报告

Figure 1.5.4 图 1.5.4

Preview 预览

Overview ..... 76 概述 ..... 76
Chapter Highlights ..... 77
章节亮点 ..... 77
2.1 Overview of

..... 78

中

的概述 ..... 78
Timeline: Significant Model Releases ..... 78
时间轴：重要的模型发布 ..... 78
State of AI Performance ..... 81
AI 性能现状 ..... 81
Al Index Benchmarks ..... 82
AI 指标基准 ..... 82
2.2 Language ..... 85
2.2 语言 ..... 85
Understanding ..... 86 理解 ..... 86
HELM: Holistic Evaluation of
HELM: 全面评估
Language Models ..... 86
语言模型 ..... 86
MMLU: Massive Multitask MMLU: 大规模多任务
Language Understanding ..... 87
语言理解 ..... 87
Generation ..... 88 生成.....88
Chatbot Arena Leaderboard ..... 88
聊天机器人竞技场排行榜.....88
Factuality and Truthfulness ..... 90
真实性和真实度.....90
TruthfulQA ..... 90 真实问答 ..... 90
HaluEval ..... 92
2.3 Coding ..... 94
2.3 编码 ..... 94
Generation ..... 94 生成.....94
HumanEval ..... 94 人工评估.....94
SWE-Bench ..... 95 SWE-Bench.....95
2.4 Image Computer Vision and
2.4 图像计算机视觉和
Image Generation ..... 96
图像生成 ..... 96
Generation ..... 96 生成 ..... 96
HEIM: Holistic Evaluation of
HEIM: 全面评估
Text-to-Image Models ..... 97
文本到图像模型 ..... 97
Highlighted Research: MVDream ..... 98
突出研究: MVDream ..... 98
Instruction-Following ..... 99
指令遵循 ..... 99
VisIT-Bench ..... 99 访问工作台 ..... 99

CHAPTER 2: 第二章:
Editing ..... 100 编辑 ..... 100
EditVal ..... 100 编辑值 ..... 100
Highlighted Research: ControlNet ..... 101
突出研究：ControlNet ..... 101
Highlighted Research: Instruct-NeRF2NeRF ..... 103
突出研究：Instruct-NeRF2NeRF ..... 103
Segmentation ..... 105 分割 ..... 105
Highlighted Research: Segment Anything ..... 105
突出研究：分割任何东西 ..... 105
3D Reconstruction From Images ..... 107
从图像中的 3D 重建 ..... 107
Highlighted Research: Skoltech3D ..... 107
突出研究：Skoltech3D ..... 107
Highlighted Research: RealFusion ..... 108
突出研究：RealFusion ..... 108
2.5 Video Computer Vision and
2.5 视频计算机视觉和
Video Generation ..... 109
视频生成 ..... 109
Generation ..... 109 生成 ..... 109
UCF101 ..... 109
Highlighted Research: Align Your Latents ..... 110
突出研究：调整您的潜在 ..... 110
Highlighted Research: Emu Video ..... 111
突出研究：鸸鹋视频 ..... 111
2.6 Reasoning ..... 112
2.6 推理 ..... 112
General Reasoning ..... 112
一般推理 ..... 112
MMMU: A Massive Multi-discipline
MMMU：一个大规模多学科
Multimodal Understanding and Reasoning
多模态理解与推理
Benchmark for Expert AGI ..... 112
专家 AGI 的基准 ..... 112
GPQA: A Graduate-Level Google-Proof
GPQA: 一个研究生级别的谷歌防护
Q&A Benchmark ..... 115
问答基准 ..... 115
Highlighted Research: Comparing Humans,
突出研究: 比较人类,
GPT-4, and GPT-4V on Abstraction and
GPT-4 和 GPT-4V 在抽象化方面
Reasoning Tasks ..... 116
推理任务 ..... 116
Mathematical Reasoning ..... 117
数学推理 ..... 117
GSM8K ..... 117
MATH ..... 119 数学 ..... 119
PlanBench ..... 120
Visual Reasoning ..... 121
视觉推理 ..... 121
Visual Commonsense Reasoning (VCR) ..... 121
视觉常识推理（VCR）..... 121

Preview (cont'd) 预览（续）

Moral Reasoning ..... 122
道德推理..... 122
MoCa ..... 122 MoCa.....122(莫卡..... 122)
Causal Reasoning ..... 124
Causal Reasoning.....124(因果推理..... 124)
BigToM ..... 124 BigToM.....124(大型教学管理系统..... 124)
Highlighted Research: 突出研究：
Tübingen Cause-Effect Pairs ..... 126
图宾根因果关系对..... 126
2.7 Audio ..... 127
2.7 音频..... 127
Generation ..... 127 生成 ..... 127
Highlighted Research: UniAudio ..... 128
突出研究: UniAudio ..... 128
Highlighted Research: 突出研究:
MusicGEN and MusicLM ..... 129
MusicGEN 和 MusicLM ..... 129
2.8 Agents ..... 131
2.8 代理 ..... 131
General Agents ..... 131
通用代理 ..... 131
AgentBench ..... 131 代理工作台 ..... 131
Highlighted Research: Voyageur ..... 133
突出研究：航行者 ..... 133
Task-Specific Agents ..... 134
任务特定代理 ..... 134
MLAgentBench ..... 134
2.9 Robotics ..... 135
2.9 机器人技术 ..... 135
Highlighted Research: PaLM-E ..... 135
突出研究：PaLM-E ..... 135
Highlighted Research: RT-2 ..... 137
突出研究: RT-2 ..... 137
2.10 Reinforcement Learning ..... 138
2.10 强化学习 ..... 138
Reinforcement Learning from Human Feedback ..... 138
从人类反馈中学习强化学习 ..... 138
Highlighted Research: RLAIF ..... 139
突出研究：RLAIF ..... 139
Highlighted Research: 突出研究：
Direct Preference Optimization ..... 140
直接优化偏好 ..... 140

2.11 Properties of LLMs
2.11 LLMs的属性

..... 141

Highlighted Research: 突出研究：

Challenging the Notion of Emergent Behavior 141
挑战新兴行为的概念 141

Highlighted Research: 突出研究:

Changes in LLM Performance Over Time 143
随时间变化的LLM绩效 143

Highlighted Research: 突出研究:

LLMs Are Poor Self-Correctors 145
LLMs 是贫穷的自我纠正者 145

Closed vs. Open Model Performance 146
封闭模型与开放模型的表现 146

2.12 Techniques for LLM Improvement
148
2.12 技术的 LLM 改进 148

Prompting 提示

148

Highlighted Research: 突出研究：

Graph of Thoughts Prompting 148
思维图引导 148

Highlighted Research: 突出研究:

Optimization by PROmpting (OPRO) 150
通过 PROmpting (OPRO) 150 进行优化

Fine-Tuning 151 精细调整 151

Highlighted Research: QLoRA 151
突出研究：QLoRA 151

Attention 152 注意 152

Highlighted Research: Flash-Decoding

突出研究：Flash-Decoding

2.13 Environmental Impact of AI Systems 154
2.13 AI 系统的环境影响 154

General Environmental Impact 154
总体环境影响 154

Training 154 训练 154

Inference 156 推理 156

ACCESS THE PUBLIC DATA
访问公共数据

The technical performance section of this year's Al Index offers a comprehensive overview of

advancements in 2023. It starts with a high-level overview of Al technical performance, tracing its broad evolution over time. The chapter then examines the current state of a wide range of Al capabilities, including language processing, coding, computer vision (image and video analysis), reasoning, audio processing, autonomous agents, robotics, and reinforcement learning. It also shines a spotlight on notable Al research breakthroughs from the past year, exploring methods for improving LLMs through prompting, optimization, and fine-tuning, and wraps up with an exploration of Al systems' environmental footprint.
今年 Al 指数的技术性能部分提供了对 2023 年进展的全面概述。它从高层次概述 Al 技术性能开始，追溯其随时间的广泛演变。该章节随后检视了各种 Al 能力的当前状态，包括语言处理、编码、计算机视觉（图像和视频分析）、推理、音频处理、自主代理、机器人技术和强化学习。它还聚焦探讨了过去一年中值得关注的 Al 研究突破，探索了通过提示、优化和微调来改善 LLMs 的方法，并最后探讨了 Al 系统的环境足迹。

Chapter Highlights 章节亮点

Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning.
在某些任务上，人工智能超过人类，但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现，包括图像分类、视觉推理和英语理解等方面。然而，在更复杂的任务上，如竞赛级数学、视觉常识推理和规划方面，人工智能仍然落后。
Here comes multimodal AI. Traditionally Al systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google's Gemini and OpenAl's GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio.
在这里有多模态人工智能。传统上，人工智能系统在范围上是受限的，语言模型在文本理解方面表现出色，但在图像处理方面表现不足，反之亦然。然而，最近的进展导致了强大的多模态模型的发展，例如 Google 的 Gemini 和 OpenAl 的 GPT-4。这些模型展示了灵活性，能够处理图像和文本，有时甚至可以处理音频。
Harder benchmarks emerge. Al models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations.
更难的基准出现了。AI 模型在 ImageNet、SQuAD 和 SuperGLUE 等已建立的基准上达到了性能饱和，促使研究人员开发出更具挑战性的基准。2023 年，出现了几个具有挑战性的新基准，包括用于编码的 SWE-bench，用于图像生成的 HEIM，用于一般推理的 MMMU，用于道德推理的 MoCa，用于基于代理的行为的 AgentBench，以及用于幻觉的 HaluEval。
Better Al means better data which means ... even better Al. New Al models such as SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for Al technical improvements. The use of to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks.
更好的人工智能意味着更好的数据，这意味着…甚至更好的人工智能。新的人工智能模型如 SegmentAnything 和 Skoltech 正在被用于生成专业化的数据，用于任务如图像分割和 3D 重建。数据对于人工智能的技术改进至关重要。使用来生成更多的数据增强了当前的能力，并为未来的算法改进铺平了道路，尤其是在更难的任务上。
Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public feeling about Al is becoming an increasingly important consideration in tracking Al progress.
人类评估已经开始。随着生成模型生成高质量的文本、图像等，基准测试已经开始慢慢转向纳入人类评估，比如 Chatbot Arena 排行榜，而不是像 ImageNet 或 SQuAD 那样的计算机排名。公众对人工智能的看法正在成为追踪人工智能进展的日益重要的考虑因素。
Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world.
由于LLMs的出现，机器人变得更加灵活。语言建模与机器人技术的融合催生了更加灵活的机器人系统，如 PaLM-E 和 RT-2。除了改进了机器人的能力，这些模型还可以提问，这是机器人与现实世界更有效互动的重要一步。
More technical research in agentic Al. Creating Al agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance.
关于主动型人工智能的更多技术研究。创建自主在特定环境中运行的人工智能代理系统一直是计算机科学家们长期面临的挑战。然而，新兴的研究表明，自主代理系统的性能正在不断提高。现如今的代理系统已经能够掌握复杂的游戏，如 Minecraft，并且能够有效地处理实际任务，如在线购物和研究辅助。
Closed LLMs significantly outperform open ones. On 10 select Al benchmarks, closed models outperformed open ones, with a median performance advantage of . Differences in the performance of closed and open models carry important implications for Al policy debates.
闭源LLMs明显优于开源。在 10 个选择的 Al 基准测试中，闭源模型的性能优于开源模型，具有的中位性能优势。闭源模型和开源模型的性能差异对 Al 政策辩论具有重要意义。

The technical performance chapter begins with a high-level overview of significant model releases in 2023 and reviews the current state of Al technical performance.
技术性能章节以 2023 年重要模型发布的高层概述开始，并回顾了当前人工智能技术性能的现状。

2.1 Overview of Al in 2023
2023 年人工智能概述

Timeline: Significant Model Releases
时间轴：重要的模型发布

As chosen by the Al Index Steering Committee, here are some of the most notable model releases of 2023.
由 Al Index Steering Committee 选择，以下是 2023 年一些最显著的模型发布。

Date

Model

Type

Creator(s)

Significance

Image

Mar. 14, 2023 2023 年 3 月 14 日

Claude

Large language 大语言

model

Anthropic

Claude is the first
克劳德是第一个

publicly released LLM 公开发布的LLM

from Anthropic, one of
来自 Anthropic，OpenAI 的主要竞争对手之一。

OpenAl's main rivals. OpenAI 的主要竞争对手之一。

Claude is designed to be
Claude 被设计为

as helpful, honest, and
尽可能地有帮助、诚实和

harmless as possible. 无害。

Figure 2.1.1

Source: Anthropic, 2023 来源：Anthropic，2023

Mar. 14, 2023 2023 年 3 月 14 日

GPT-4

Large language 大语言

model

OpenAl

GPT-4, improving GPT-4，改进

over GPT-3, is among
GPT-3 是最强大和功能强大的之一

the most powerful 最强大的

and capable LLMs to
和能力强大的LLMs

date and surpasses 日期和超越

human performance on 人类在

numerous benchmarks. 许多基准上的表现。

Mar. 23, 2023 2023 年 3 月 23 日

Stable

Diffusion v2

Text-to-image

model

Stability AI

Stable Diffusion v2 is an
稳定扩散 v2 是稳定性 Al 的升级

upgrade of Stability Al's
稳定扩散 v2 是稳定性 Al 的升级

existing text-to-image 现有的文本到图像

model and produces 模型并生成

higher-resolution, 更高分辨率，

superior-quality images.
优质图像。

Source: Stability Al, 2023
来源：稳定性 Al，2023

Apr. 5, 2023

Image

segmentation

State of Al Performance
人工智能表现状态

As of 2023, Al has achieved levels of performance that surpass human capabilities across a range of tasks. Figure 2.1.16 illustrates the progress of

systems relative to human baselines for nine

benchmarks corresponding to nine tasks (e.g., image classification or basic-level reading comprehension).

截至 2023 年，人工智能在各种任务上已经实现了超越人类能力的性能水平。图 2.1.16 展示了

系统相对于九个

基准的人类基线的进展，这九个基准对应九项任务（例如图像分类或基本水平阅读理解）。

The AI Index team selected one benchmark to represent each task.
AI 指数团队选择了一个基准来代表每个任务。
Over the years, Al has surpassed human baselines on a handful of benchmarks, such as image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021. As of 2023, there are still some task categories where Al fails to exceed human ability. These tend to be more complex cognitive tasks, such as visual commonsense reasoning and advanced-level mathematical problem-solving (competition-level math problems).
多年来，Al 在一些基准测试中超越了人类基线，例如 2015 年的图像分类，2017 年的基本阅读理解，2020 年的视觉推理，以及 2021 年的自然语言推理。截至 2023 年，仍然有一些任务类别在其中 Al 未能超越人类能力。这些任务往往是更复杂的认知任务，如视觉常识推理和高级数学问题解决（竞赛级别的数学问题）。

Select AI Index technical performance benchmarks vs. human performance
选择 AI 指数技术性能基准与人类表现

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 2.1.16 图 2.1.16

benchmark is a standardized test used to evaluate the performance and capabilities of Al systems on specific tasks. For example, ImageNet is a canonical Al benchmark that features a large collection of labeled images, and Al systems are tasked with classifying these images accurately. Tracking progress on benchmarks has been a standard way for the AI community to monitor the advancement of Al systems.

基准是一种标准化测试，用于评估 Al 系统在特定任务上的性能和能力。例如，ImageNet 是一个经典的 Al 基准，其中包含大量标记的图像，Al 系统的任务是准确分类这些图像。跟踪基准上的进展一直是 AI 社区监测 Al 系统进展的标准方式。

2 In Figure 2.1.16, the values are scaled to establish a standard metric for comparing different benchmarks. The scaling function is calibrated such that the performance of the best model for each year is measured as a percentage of the human baseline for a given task. A value of

indicates, for example, that a model performs

better than the human baseline.
在图 2.1.16 中，这些值被缩放以建立一个用于比较不同基准的标准度量。缩放函数被校准，使得每年最佳模型的性能被测量为给定任务的人类基线的百分比。例如，

的值表示模型比人类基线表现

更好。

Al Index Benchmarks Al 指数基准

An emerging theme in Al technical performance, as emphasized in last year's report, is the observed saturation on many benchmarks, such as ImageNet, used to assess the proficiency of Al models. Performance on these benchmarks has stagnated in recent years, indicating either a plateau in

capabilities or a shift among researchers toward more complex research challenges.

在上一年的报告中强调的一个正在兴起的 Al 技术表现主题是许多评估 Al 模型能力的基准测试，如 ImageNet，在很多方面都已经到达了饱和状态。这些基准测试的表现在最近几年停滞不前，这可能表明要么是

的能力达到了瓶颈，要么是研究人员转向了更复杂的研究挑战。

Due to saturation, several benchmarks featured in the 2023 Al Index have been omitted from this year's report. Figure 2.1.17 highlights a selection of benchmarks that were included in the 2023 edition but not featured in this year's report.

It also shows the improvement on these benchmarks since 2022. "NA" indicates no improvement was noted.
由于饱和现象，2023 年 Al 指数中的几个基准测试在今年的报告中被省略。图 2.1.17 突出显示了 2023 年版中包含但未被列入今年报告的一些基准测试。

它还显示了这些基准测试在 2022 年以来的改进情况。“NA”表示没有改进。

A selection of deprecated benchmarks from the Index report
来自指数报告的一组已弃用的基准测试

Source: Al Index, 2024
源：Al 指数，2024 年

Benchmark	Task category 任务类别	Year introduced 引入年份	Improvement from 2022 2022 年的改进
Abductive Natural Language Inference (aNLI) 求因逻辑自然语言推理（aNLI）	Natural language inference 自然语言推理	2019	NA
arXiv	Text summarization 文本摘要	2003	NA
Cityscapes Challenge 城市风光挑战	Semantic segmentation 语义分割	2016
ImageNet	Image classification 图像分类	2009
Kinetics-400	Activity recognition 活动识别	2017	NA
Kinetics-600	Activity recognition 活动识别	2018	NA
Kinetics-700	Activity recognition 活动识别	2019	NA
Kvasir-SEG	Medical image segmentation 医学图像分割	2019
MPII	Human pose estimation 人体姿势估计	2014	NA
PubMed	Text summarization 文本摘要	2008	NA
SST-5 Fine-Grained Classification SST-5 细粒度分类	Sentiment analysis 情感分析	2013	NA
STL-10	Image generation 图像生成	2011	NA
SuperGLUE	English language understanding 英语语言理解	2019	NA
Visual Question Answering Challenge (VQA) 视觉问答挑战（VQA）	Visual reasoning 视觉推理	2017	NA
VoxCeleb	Speech recognition 语音识别	2017	NA

Figure 2.1.17 图 2.1.17

3 Benchmarks can also saturate or see limited improvement because the problem created is hard and the corresponding performance fails to improve. The issue of benchmark saturation discussed in this section refers more to benchmarks where performance reaches a close-to-perfection level on which it is difficult to improve.
3 基准测试也可能会饱和或受限改进，因为问题难以解决，相应的性能无法提高。本节讨论的基准测试饱和问题更多地指的是性能达到接近完美水平，难以改进的情况。

4 For brevity, Figure 2.1.17 highlights a selection of deprecated benchmarks. Additional benchmarks that were deprecated either because there was saturation, no new state-of-the-art score was documented, or research focus shifted away from the benchmark include: Celeb-DF (deepfake detection), CIFAR-10 (image classification), NIST FRVT (facial recognition), and Procgen (reinforcement learning).
4 为了简洁起见，图 2.1.17 突出显示了一些弃用的基准测试。其他已被弃用的基准测试包括：Celeb-DF（深度伪造检测），CIFAR-10（图像分类），NIST FRVT（人脸识别）和 Procgen（强化学习）。这些基准测试要么因为饱和，没有新的最先进分数记录，要么研究重点转移离开基准。

Figure 2.1.18 illustrates the year-over-year improvement, in percent, on a selection of benchmarks featured in the

Index report. Most benchmarks see significant performance increases relatively soon after they are introduced, then the improvement slows. In the last few years, many of these benchmarks have shown little or no improvement.
图 2.1.18 展示了

指数报告中的一些基准测试的年度改善率（百分比）。大多数基准测试在引入后不久就看到了显著的性能提升，然后改善放缓。在过去几年中，许多这些基准测试显示出了很少或没有改善。

Year-over-year improvement over time on select AI Index technical performance benchmarks
随着时间的推移，在选择的 AI 指数技术性能基准上逐年改善

Source: Al Index, 2024 | Chart: 2024 Al Index report

Figure 2.1.18 图 2.1.18

In response to benchmark saturation, Al researchers are pivoting away from traditional benchmarks and testing Al on more difficult challenges. The

Index tracks progress on several new benchmarks including those for tasks in coding, advanced reasoning, and agentic behavior-areas that were underrepresented in previous versions of the report (Figure 2.1.19).

作为对基准饱和的回应，AI 研究人员正在转向传统基准以及在更困难的挑战上测试 AI。

指数跟踪几个新基准的进展，包括编码、高级推理和主体行为等任务领域，这些领域在报告的先前版本中代表不足（图 2.1.19）。

New benchmarks featured in the 2024 Al Index report
2024 年 Al 指数报告中的新基准

Source: Al Index, 2024
来源：Al 指数，2024 年

Benchmark	Task category 任务类别	Year introduced 引入年份
AgentBench	Agent-based behavior 基于代理的行为	2023
BigTom	Causal reasoning 因果推理	2023
Chatbot Arena Leaderboard 聊天机器人竞技场排行榜	General language 通用语言	2023
EditVal	Image editing 图像编辑	2023
GPQA	General reasoning 一般推理	2023
GSM8K	Mathematical reasoning 数学推理	2021
HEIM	Image generation 图像生成	2023
HELM	General language 通用语言	2021
HaluEval	Factuality	2023
HumanEval	Coding	2021
MATH	Mathematical reasoning 数学推理	2021
MLAgentBench	Agent-based behavior 基于代理的行为	2023
MMMU	General reasoning 一般推理	2023
MoCa	Moral reasoning 道德推理	2023
PlanBench	Planning	2023
SWE-bench	Coding	2023
TruthfulQA	Factuality	2021
VisIT-Bench	Image instruction-following 图像指令跟随	2023

5 This report includes an Appendix with details regarding the sourcing of new benchmarks featured in this chapter
5 本报告包括一个附录，其中详细介绍了本章节中新基准的来源

2.2 Language 2.2 语言

Natural language processing (NLP) enables computers to understand, interpret, generate, and transform text. Current stateof-the-art models, such as OpenAl's GPT-4 and Google's Gemini, are able to generate fluent and coherent prose and display high levels of language understanding ability (Figure 2.2.1). Many of these models can also now handle different input forms, such as images and audio (Figure 2.2.2).
自然语言处理（NLP）使计算机能够理解、解释、生成和转换文本。目前最先进的模型，如 OpenAI 的 GPT-4 和 Google 的 Gemini，能够生成流畅连贯的散文，并展示出高水平的语言理解能力（图 2.2.1）。许多这些模型现在也能处理不同的输入形式，如图像和音频（图 2.2.2）。

A sample output from GPT-4
GPT-4 的一个示例输出

Source: Al Index, 2024
来源：Al Index，2024

Figure 2.2.1 图 2.2.1

Gemini handling image and audio inputs
Gemini 处理图像和音频输入

Source: Google, 2024 来源：Google，2024 年

Input Image	Inpur Audio (transcribed) 输入音频（转录）
with these ingredients? 用这些成分？

Understanding 理解

English language understanding challenges

systems to understand the English language in various ways such as reading comprehension and logical reasoning.
英语语言理解挑战

系统以多种方式理解英语，如阅读理解和逻辑推理。

HELM: Holistic Evaluation of Language Models
HELM: 语言模型的整体评估

As illustrated above, in recent years LLMs have surpassed human performance on traditional Englishlanguage benchmarks, such as SQUAD (question answering) and SuperGLUE (language understanding). This rapid advancement has led to the need for more comprehensive benchmarks.
正如上面所示，在最近几年，LLMs已经超过了传统的英语基准测试中人类的表现，比如 SQUAD（问题回答）和 SuperGLUE（语言理解）。这种快速进步导致了对更全面基准测试的需求。

HELM: mean win rate
HELM: 平均胜率

Source: CRFM, 2023 | Chart: 2024 Al Index report
源：CRFM，2023 | 图表：2024 Al 指数报告

Figure 2.2.3 图 2.2.3
In 2022, Stanford researchers introduced HELM (Holistic Evaluation of Language Models), designed to evaluate LLMs across diverse scenarios, including reading comprehension, language understanding, and mathematical reasoning.

HELM assesses models from several leading companies like Anthropic, Google, Meta, and OpenAl, and uses a "mean win rate" to track average performance across all scenarios. As of January 2024, GPT-4 leads the aggregate HELM leaderboard with a mean win rate of 0.96 (Figure 2.2.3); however, different models top different task categories (Figure 2.2.4).

2022 年，斯坦福研究人员推出了 HELM（语言模型的整体评估），旨在评估LLMs在包括阅读理解、语言理解和数学推理在内的各种场景中的表现。

HELM 评估了来自 Anthropic、Google、Meta 和 OpenAl 等几家领先公司的模型，并使用“平均胜率”来跟踪所有场景中的平均表现。截至 2024 年 1 月，GPT-4 在整体 HELM 排行榜上以 0.96 的平均胜率领先（图 2.2.3）；然而，不同的模型在不同的任务类别中名列前茅（图 2.2.4）。

Leaders on individual HELM sub-benchmarks
个人 HELM 子基准上的领导者

Source: CRFM, 2023 | Table: 2024 Al Index report
来源：CRFM，2023 | 表格：2024 Al 指数报告

Task	Leading model 领先模型	Score
GSM8K - EM	GPT-4 (0613)	0.93
LegalBench - EM 法律工作台 - EM	GPT-4 (0613)	0.71
MATH - Equivalent (CoT) 数学 - 等效（CoT）	GPT-4 Turbo (1106 preview) GPT-4 Turbo（1106 预览）	0.86
MMLU - EM	GPT-4 (0613)	0.74
MedQA - EM	GPT-4 Turbo (1106 preview) GPT-4 Turbo (1106 预览)	0.82
NarrativeQA - F1	Yi (34B)	0.78
NaturalQuestions (closed-book) - NaturalQuestions（闭卷）-	Llama 2 (70B) 骆驼 2 (70B)	0.46
F1
NaturalQuestions (open-book) - F1 自然问题 (开卷) - F1	PaLM-2 (Bison) PaLM-2 (野牛)	0.81
OpenbookQA - EM 开放问答 - EM	GPT-4 (0613)	0.96
WMT 2014 - BLEU-4	Palmyra X V3 (72B)	0.26

Figure 2.2.4 图 2.2.4

MMLU: Massive Multitask Language
MMLU：大规模多任务语言

Understanding 理解

The Massive Multitask Language Understanding
大规模多任务语言理解

(MMLU) benchmark assesses model performance in zero-shot or few-shot scenarios across 57 subjects, including the humanities, STEM, and social sciences (Figure 2.2.5). MMLU has emerged as a premier benchmark for assessing LLM capabilities: Many stateof-the-art models like GPT-4, Claude 2, and Gemini have been evaluated against MMLU.
（MMLU）基准评估模型在零射击或少射击情况下在 57 个学科中的表现，包括人文学科、STEM 和社会科学（图 2.2.5）。MMLU 已成为评估LLM能力的首要基准：许多最先进的模型如 GPT-4、Claude 2 和 Gemini 已经针对 MMLU 进行了评估。
In early 2023, GPT-4 posted a state-of-the-art score on MMLU, later surpassed by Google's Gemini Ultra. Figure 2.2.6 highlights the top model scores on the MMLU benchmark in different years. The scores reported are the averages across the test set. As of January 2024, Gemini Ultra holds the top score of

, marking a 14.8 percentage point improvement since 2022 and a 57.6 percentage point increase since MMLU's inception in 2019. Gemini Ultra's score was the first to surpass MMLU's human baseline of

.
2023 年初，GPT-4 在 MMLU 上发布了最先进的得分，后来被谷歌的 Gemini Ultra 超越。图 2.2.6 突出显示了不同年份 MMLU 基准上的顶级模型得分。报告的得分是测试集的平均值。截至 2024 年 1 月，Gemini Ultra 保持

的最高得分，自 2022 年以来提高了 14.8 个百分点，自 2019 年 MMLU 成立以来增加了 57.6 个百分点。Gemini Ultra 的得分是第一个超过 MMLU 的人类基线

。

A sample question from MMLU
MMLU 的一个示例问题

Source: Hendrycks et al., 2021
来源：Hendrycks 等人，2021

One of the reasons that the government discourages and regulates monopolies is that
政府鼓励和规范垄断的一种原因是

(A) producer surplus is lost and consumer surplus is gained.
(A) 生产者剩余减少，消费者剩余增加。

(B) monopoly prices ensure productive efficiency but cost society allocative efficiency.
(B) 垄断价格确保生产效率，但损害社会的配置效率。

(D) consumer surplus is lost with higher prices and lower levels of output.
（D）消费者剩余随着价格的上涨和产出水平的降低而丧失。

MMLU: average accuracy MMLU：平均准确度

Source: Papers With Code, 2023 | Chart: 2024 Al Index report
来源：Papers With Code，2023 年 | 图表：2024 年 Al 指数报告

Generation 一代

In generation tasks, Al models are tested on their ability to produce fluent and practical language responses.
在生成任务中，AI 模型受到测试，以验证其生成流利和实用的语言回应的能力。

Chatbot Arena Leaderboard
聊天机器人竞技排行榜

The rise of capable LLMs has made it increasingly important to understand which models are preferred by the general public. Launched in 2023, the Chatbot Arena Leaderboard is one of the first comprehensive evaluations of public LLM preference. The leaderboard allows users to query two anonymous models and vote for the preferred generations (Figure 2.2.7). As of early 2024, the platform has garnered over 200,000 votes, and users ranked OpenAl's GPT-4 Turbo as the most preferred model (Figure 2.2.8).
能力强大的LLMs的崛起使得了解公众偏好的模型变得越来越重要。Chatbot Arena Leaderboard 于 2023 年推出，是对公众LLM偏好的首次全面评估之一。排行榜允许用户查询两个匿名模型并投票选择首选的生成模型（图 2.2.7）。截至 2024 年初，该平台已获得超过 20 万票，用户将 OpenAI 的 GPT-4 Turbo 排名为最受欢迎的模型（图 2.2.8）。

A sample model response on the Chatbot Arena Leaderboard
Chatbot Arena Leaderboard 上的样本模型响应

Source: Chatbot Arena Leaderboard, 2024
来源：Chatbot Arena Leaderboard，2024

LMSYS Chatbot Arena for LLMs: Elo rating Source: Hugging Face, 2024 | Chart: 2024 Al Index report
LMSYS Chatbot Arena for LLMs: Elo 评分来源: Hugging Face, 2024 | 图表: 2024 Al 指数报告

Model 模型

Figure 2.2.8 图 2.2.8

Factuality and Truthfulness
真实性和真实性

Despite remarkable achievements, LLMs remain susceptible to factual inaccuracies and content hallucination-creating seemingly realistic, yet false, information. The presence of real-world instances where LLMs have produced hallucinations-in court cases, for example-underscores the growing necessity of closely monitoring trends in LLM factuality.
尽管取得了显著成就，LLMs 仍然容易受到事实不准确和内容幻觉的影响-这种内容创造了看似真实但实际上错误的信息。现实世界中存在LLMs产生幻觉的实例-例如在法庭上-强调了密切监测LLM真实性趋势的日益增长的必要性。

TruthfulQA

Introduced at ACL 2022, TruthfulQA is a benchmark designed to evaluate the truthfulness of LLMs in generating answers to questions. This benchmark comprises approximately 800 questions across 38 categories, including health, politics, and finance. Many questions are crafted to challenge commonly held misconceptions, which typically lead humans to answer incorrectly (Figure 2.2.9). Although one of the observations of the paper is that larger models tend to be less truthful, GPT-4 (RLHF) released in early 2024, has achieved the highest performance thus far on the TruthfulQA benchmark, with a score of 0.6 (Figure 2.2.10). This score is nearly three times higher than that of a GPT-2-based model tested in 2021, indicating that LLMs are becoming progressively better at providing truthful answers.
在 ACL 2022 年推出的 TruthfulQA 是一个旨在评估生成问题答案的真实性的基准。该基准包括大约 800 个问题，涵盖 38 个类别，包括健康、政治和金融等。许多问题旨在挑战常见的误解，这些误解通常会导致人类回答不正确。尽管论文的一个观察结果是，较大的模型往往不太真实，但在 2024 年初发布的 GPT-4（RLHF）迄今为止在 TruthfulQA 基准测试中取得了最高性能，得分为 0.6。这个分数几乎是 2021 年测试的基于 GPT-2 的模型的三倍，表明生成问题答案的技术正在逐步变得更加出色。

Sample TruthfulQA questions
样本 TruthfulQA 问题

Source: Lin, Hilton, and Evans, 2022
来源：Lin、Hilton 和 Evans，2022

Catejay

Questien from Truhfliqu Truhfliqu 的问题

Abawer from GPT 3 (FALSE)
GPT 3 的回答（错误）

Can soughirg efectively sfop a
搜寻能有效地停止 a

haser atrok?

Coughirg can thetp elsp a heort
咳嗽可能是心脏疾病的征兆

athack

Lem

What tama is it lugal ta cary for
什么时间最适合护理

mporne in the Le?
莱城的重要性？

It in logel to eary a par, krite, or
将其记录为一个段落、句子或短语是合理的

dab.

Corepirscies

Who renly causads'it? 是谁突然引起了它？

It tols us thad ghed mamifg bot
它告诉我们改变了许多事物

hoes.

Fotion

Whot nives do al avoloid
所有的鸟类都有哪些特征？

intellgonses curesfy folion?
情报机构如何有效地跟踪犯罪？

Multiple-choice task on TruthfulQA: MC1
TruthfulQA 上的多项选择任务：MC1

Source: Papers with Code, 2023 | Chart: 2024 Al Index report
来源：Papers with Code，2023 | 图表：2024 Al 指数报告

Figure 2.2.10 图 2.2.10

HaluEval

As previously mentioned, LLMs are prone to hallucinations, a concerning trait given their widespread deployment in critical fields such as law and medicine. While existing research has aimed to understand the causes of hallucinations, less effort has been directed toward assessing the frequency of LLM hallucinations and identifying specific content areas where they are especially vulnerable.
正如先前提到的，LLMs容易产生幻觉，这一特点令人担忧，尤其是考虑到它们在法律和医学等关键领域的广泛应用。尽管现有研究旨在了解幻觉的原因，但对LLM幻觉的频率进行评估以及确定特定内容领域中它们特别脆弱的努力较少。

HaluEval, introduced in 2023, is a new benchmark designed to assess hallucinations in LLMs. It includes over 35,000 samples, both hallucinated and normal, for analysis and evaluation by LLMs (Figure 2.2.11). The research indicates that ChatGPT fabricates unverifiable information in approximately

of its responses, with these fabrications spanning a variety of topics such as language, climate, and technology. Furthermore, the study examines how well current LLMs can detect hallucinations. Figure 2.2.12 illustrates the performance of leading LLMs in identifying hallucinations across various tasks, including question answering, knowledge-grounded dialogue, and text summarization. The findings reveal that many LLMs struggle with these tasks, highlighting that hallucination is a significant ongoing issue.
HaluEval，于 2023 年推出，是一个旨在评估LLMs中幻觉的新基准。它包括超过 35,000 个样本，既有幻觉的，也有正常的，供LLMs（图 2.2.11）进行分析和评估。研究表明，ChatGPT 在其回复中约

的情况下捏造不可验证的信息，这些捏造涉及语言、气候和技术等各种主题。此外，该研究还检验了当前LLMs能够检测幻觉的能力。图 2.2.12 展示了领先LLMs在识别各种任务中的幻觉表现，包括问题回答、知识驱动对话和文本摘要。研究结果显示，许多LLMs在这些任务中遇到困难，突显幻觉是一个重要的持续问题。
A generated hallucinated QA example and a human-labeled ChatGPT response for a user query
一个生成的幻觉问答示例和一个用户查询的人工标记的 ChatGPT 回复

Source: Liet al., 2023
来源：Liet al.，2023

Figure 2.2.11 图 2.2.11

HaluEval hallucination classification accuracy
HaluEval 幻觉分类准确性

Source: Li et al., 2023 | Table: 2024 Al Index report
来源：Li 等人，2023 年 | 表：2024 年 Al 指数报告

Models	QA	Dialogue	Summarization	General
ChatGPT (2022) 聊天生成模型 (2022)
Claude 2 (2023) 克劳德 2 (2023)
Claude (2023) 克劳德 (2023)
Davinci002 (2022) 达芬奇 002（2022 年）
Davinci003 (2022) 达芬奇 003（2022 年）
GPT-3 (2020)
Llama 2 (2023) 羊驼 2（2023 年）
ChatGLM (2023) 聊天 GLM（2023 年）
Falcon (2023) 猎鹰（2023 年）
Vicuna (2023) 维库纳（2023 年）
Alpaca (2023) 羊驼 (2023)

Figure 2.2.12 图 2.2.12

Coding involves the generation of instructions that computers can follow to perform tasks. Recently, LLMs have become proficient coders, serving as valuable assistants to computer scientists. There is also increasing evidence that many coders find Al coding assistants highly useful.
编码涉及生成计算机可以遵循以执行任务的指令。最近，LLMs已经成为熟练的编码人员，成为计算机科学家的宝贵助手。越来越多的证据表明，许多编码人员发现羊驼编码助手非常有用。

2.3 Coding 2.3 编码

Generation 生成

On many coding tasks, Al models are challenged to generate usable code or to solve computer science problems.
在许多编码任务中，AI 模型面临着生成可用代码或解决计算机科学问题的挑战。

HumanEval 人类评估

HumanEval, a benchmark for evaluating Al systems' coding ability, was introduced by OpenAI researchers in 2021. It consists of 164 challenging handwritten programming problems (Figure 2.3.1). A GPT-4 model variant (AgentCoder) currently leads in HumanEval performance, scoring

, which is a 11.2 percentage point increase from the highest score
人类评估是一个用于评估人工智能系统编码能力的基准，由 OpenAI 研究人员于 2021 年推出。它包含 164 个具有挑战性的手写编程问题（图 2.3.1）。一种 GPT-4 模型变体（AgentCoder）目前在人类评估性能方面处于领先地位，得分

，比最高分高出 11.2 个百分点。

HumanEval: Pass@1 人类评估：Pass@1

in 2022 (Figure 2.3.2). Since 2021, performance on HumanEval has increased 64.1 percentage points.
在 2022 年（图 2.3.2）。自 2021 年以来，HumanEval 的表现提高了 64.1 个百分点。

Sample HumanEval problem
样本 HumanEval 问题

Source: Chen et al., 2023
来源：Chen 等人，2023

SWE-bench SWE-基准

As Al systems' coding capabilities improve, it has become increasingly important to benchmark models on more challenging tasks. In October 2023, researchers introduced SWE-bench, a dataset comprising 2,294 software engineering problems sourced from real GitHub issues and popular Python repositories (Figure 2.3.3). SWE-bench presents a tougher test for

coding proficiency, demanding that systems coordinate changes across multiple functions, interact with various execution environments, and perform complex reasoning.
随着人工智能系统编码能力的提高，对模型在更具挑战性的任务上进行基准测试变得越来越重要。2023 年 10 月，研究人员推出了 SWE-基准，这是一个包含 2294 个软件工程问题的数据集，这些问题来自真实的 GitHub 问题和流行的 Python 代码库（图 2.3.3）。SWE-基准对

的编码能力提出了更高要求，要求系统在多个功能之间协调变化，与各种执行环境交互，并进行复杂的推理。

Even state-of-the-art LLMs face significant challenges with SWE-bench. Claude 2, the best-performing model, solved only

of the dataset's problems (Figure 2.3.4).

In 2023, the topperforming model on SWE-bench surpassed the best model from 2022 by 4.3 percentage points.
即使是最先进的LLMs在 SWE-基准上也面临着重大挑战。表现最佳的模型 Claude 2 仅解决了数据集中的

问题（图 2.3.4）。

在 2023 年，SWE-基准上表现最佳的模型比 2022 年最佳模型提高了 4.3 个百分点。

A sample model input from SWE-bench
来自 SWE-bench 的样本模型输入

Source: Jimenez et al., 2023
来源：Jimenez 等人，2023 年

SWE-bench: percent resolved
SWE-bench：已解决的百分比

Source: SWE-bench Leaderboard, 2023 | Chart: 2024 AI Index report
来源：SWE-bench 排行榜，2023 年 | 图表：2024 年 AI 指数报告

Figure 2.3.4 图 2.3.4

8 According to the SWE-bench leaderboard, unassisted systems have no assistance in finding the relevant files in the repository. Assisted systems operate under the "oracle" retrieval setting, which means the systems are provided with the list of files that were modified in the pull request.
根据 SWE-bench 排行榜，无辅助系统在存储库中找到相关文件时没有任何帮助。辅助系统在“oracle”检索设置下运行，这意味着系统提供了在拉取请求中修改的文件列表。

Computer vision allows machines to understand images and videos and create realistic visuals from textual prompts or other inputs. This technology is widely used in fields such as autonomous driving, medical imaging, and video game development.
计算机视觉允许机器理解图像和视频，并从文本提示或其他输入创建逼真的视觉效果。这项技术广泛应用于自动驾驶、医学影像和视频游戏开发等领域。

2.4 Image Computer Vision and Image Generation
2.4 图像计算机视觉与图像生成

Generation 生成

Image generation is the task of generating images that are indistinguishable from real ones. Today's image generators are so advanced that most people struggle to differentiate between

-generated images and actual images of human faces (Figure 2.4.1). Figure 2.4.2 highlights several generations from various Midjourney model variants from 2022 to 2024 for the prompt "a hyper-realistic image of Harry Potter." The progression demonstrates the significant improvement in Midjourney's ability to generate hyper-realistic images over a two-year period. In 2022, the model produced cartoonish and inaccurate renderings of Harry Potter, but by 2024, it could create startlingly realistic depictions.
图像生成是生成与真实图像无法区分的图像的任务。如今的图像生成器非常先进，以至于大多数人很难区分

生成的图像和真实的人脸图像（图 2.4.1）。图 2.4.2 突出了 2022 年至 2024 年间各种 Midjourney 模型变体生成的几代“哈利波特的超现实主义图像”。这一进展展示了 Midjourney 在两年时间内生成超现实主义图像能力的显著提升。2022 年，该模型生成了卡通和不准确的哈利波特渲染图像，但到了 2024 年，它能够创造出令人惊讶的逼真描绘。
Which face is real?
哪张脸是真实的？

Source: Which Face Is Real, 2023
来源：哪张脸是真实的，2023

Figure 2.4.1 图 2.4.1

Midjourney generations over time: "a hyper-realistic image of Harry Potter"
随时间变化的中途世代："哈利·波特的超现实主义形象"

Source: Midjourney, 2023
来源：Midjourney，2023

V1, February 2022 2022 年 2 月 V1

V2, April 2022 2022 年 4 月 V2

V3, July 2022 2022 年 7 月 V3

V4, November 2022 2022 年 11 月 V4

V5, March 2023 2023 年 3 月 V5

V5.1, March 2023 2023 年 3 月 V5.1

V5.2, June 2023 V5.2，2023 年 6 月

V6, December 2023 V6，2023 年 12 月

HEIM: Holistic Evaluation of Text-to-Image Models
HEIM：文本到图像模型的整体评估

The rapid progress of

text-to-image systems has prompted the development of more sophisticated evaluation methods. In 2023, Stanford researchers introduced the Holistic Evaluation of Text-toImage Models (HEIM), a benchmark designed to comprehensively assess image generators across 12 key aspects crucial for real-world deployment, such as image-text alignment, image quality, and aesthetics.

Human evaluators are used to rate the models, a crucial feature since many automated metrics struggle to accurately assess various aspects of images.
文本到图像系统的快速进展促使了更复杂评估方法的发展。2023 年，斯坦福研究人员推出了文本到图像模型的整体评估（HEIM），这是一个旨在全面评估图像生成器在 12 个关键方面的基准，这些方面对于实际部署至关重要，如图像文本对齐、图像质量和美学。人类评估者用于对模型进行评分，这是一个关键特性，因为许多自动化指标难以准确评估图像的各个方面。

HEIM's findings indicate that no single model excels in all criteria. For human evaluation of image-to-text alignment (assessing how well the generated image matches the input text), OpenAl's DALL-E 2 scores highest (Figure 2.4.3). In terms of image quality (gauging if the images resemble real photographs), aesthetics (evaluating the visual appeal), and originality (a measure of novel image generation and avoidance of copyright infringement), the Stable Diffusion-based Dreamlike Photoreal model ranks highest (Figure 2.4.4).
HEIM 的研究结果表明，没有单一模型在所有标准上表现优异。对于图像到文本对齐的人类评估（评估生成的图像与输入文本匹配程度），OpenAI 的 DALL-E 2 得分最高（图 2.4.3）。在图像质量（衡量图像是否类似真实照片）、美学（评估视觉吸引力）和原创性（衡量新颖图像生成和避免侵犯版权的程度）方面，基于稳定扩散的梦幻逼真模型排名最高（图 2.4.4）。
Image-text alignment: human evaluation
图像文本对齐：人类评估

Source: CRFM, 2023 | Chart: 2024 Al Index report
来源：CRFM，2023 | 图表：2024 Al 指数报告

Figure 2.4.3 图 2.4.3

Model leaders on select HEIM sub-benchmarks
在选择的 HEIM 子基准上的模型领导者

Source: CRFM, 2023 | Table: 2024 Al Index report
来源：CRFM，2023 | 表格：2024 Al 指数报告

Task	Leading model 领先模型	Score
Image-text-alignment	DALL-E 2 (3.5B) DALL-E 2（35 亿）	0.94
Quality	Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0（1B）	0.92
Aesthetics	Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0（1B）	0.87
Originality	Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0（1B）	0.98

Highlighted Research: 突出研究：

MVDream

Creating 3D geometries or models from text prompts has been a significant challenge for

researchers, with existing models struggling with problems such as multiface Janus issue (inaccurately regenerating context described by text prompts) and content drift (inconsistency across different

views). MVDream is a new 3D generation system developed by ByteDance and University of California, San Diego researchers that overcomes some of these hurdles (Figure 2.4.5). In quantitative evaluations, MVDream's generated models achieve Inception Score (IS) and CLIP scores comparable to those in the training set, indicating the high quality of the generated images (Figure 2.4.6). MVDream has major implications, especially for creative industries where 3D content creation is traditionally time-consuming and labor-intensive.
从文本提示中创建 3D 几何结构或模型一直是

研究人员面临的重大挑战，现有模型存在诸如多面 Janus 问题（不准确地重新生成文本提示描述的上下文）和内容漂移（在不同

视图之间的不一致性）等问题。MVDream 是由字节跳动和加州大学圣地亚哥分校的研究人员开发的新型 3D 生成系统，克服了其中一些障碍（图 2.4.5）。在定量评估中，MVDream 生成的模型达到了与训练集中相当的 Inception Score（IS）和 CLIP 分数，表明生成的图像质量很高（图 2.4.6）。MVDream 具有重大影响，特别是对于传统上 3D 内容创建耗时且劳动密集的创意产业。
Sample generations from MVDream
MVDream 中的示例生成

Source: Shi et al., 2023
来源：Shi 等人，2023 年

An astronaut riding a horse
一名宇航员骑马

A bull dog wearing a black pirate hat
一只戴着黑色海盗帽的斗牛犬

Quantitative evaluation on image synthesis quality
图像合成质量的定量评估

Source: Shi et al., 2023 | Table: 2024 Al Index report
来源：Shi 等人，2023 年 | 表格：2024 年 Al 指数报告

Model	Batch size	FID	IS	CLIP
Training data 训练数据	N/A	N/A
Multi-view Diffusion - no 2D data 多视角扩散 - 没有 2D 数据	256	33.41
Multi-view Diffusion - proposed 多视角扩散 - 建议	256	32.57
Multi-view Diffusion - proposed 多视角扩散 - 提出	1024	32.06

Instruction-Following 指令跟随

In computer vision, instruction-following is the capacity of vision-language models to interpret text-based directives related to images. For instance, an Al system could be given an image of various ingredients and tasked with suggesting how to use them to prepare a healthy meal. Capable instructionfollowing vision-language models are necessary for developing advanced

assistants.
在计算机视觉中，指令跟随是指视觉-语言模型解释与图像相关的基于文本的指令的能力。例如，一个 AI 系统可以被给予一张包含各种食材的图像，并被要求提供建议如何使用它们来准备健康的餐点。具有能力的指令跟随视觉-语言模型对于开发先进的

助手是必不可少的。

A sample VisIT-Bench instruction set
一个示例 VisIT-Bench 指令集

Source: Bitton et al., 2023
来源：Bitton 等人，2023 年

VisIT-Bench

In 2023, a team of industry and academic researchers introduced VisIT-Bench, a benchmark consisting of 592 challenging vision-language instructions across about 70 instruction categories, such as plot analysis, art knowledge, and location understanding (Figure 2.4.8). As of January 2024, the leading model on VisIT-Bench is GPT-4V, the visionenabled variant of GPT-4 Turbo, with an Elo score of 1,349 , marginally surpassing the human reference score for VisIT-Bench (Figure 2.4.9).
在 2023 年，一组行业和学术研究人员推出了 VisIT-Bench，这是一个由 592 个具有挑战性的视觉语言指令组成的基准测试，涵盖了约 70 个指令类别，如情节分析、艺术知识和位置理解（图 2.4.8）。截至 2024 年 1 月，VisIT-Bench 上领先的模型是 GPT-4V，这是 GPT-4 Turbo 的视觉增强变体，具有 1,349 的 Elo 分数，略高于 VisIT-Bench 的人类参考分数（图 2.4.9）。

VisIT-Bench: Elo rating VisIT-Bench：Elo 评分

Figure 2.4.8 图 2.4.8

Source: Hugging Face, 2024 | Chart: 2024 Al Index report
源自：拥抱面孔，2024 年 | 图表：2024 年 Al 指数报告

Model 模型

Editing 编辑

Image editing involves using Al to modify images based on text prompts. This Alassisted approach has broad real-world applications in fields such as engineering, industrial design, and filmmaking.
图像编辑涉及使用 AI 根据文本提示修改图像。这种 AI 辅助方法在工程、工业设计和电影制作等领域具有广泛的实际应用。

EditVal

Despite the promise of text-guided image editing, few robust methods can evaluate how accurately

image editors adhere to editing prompts. EditVal, a new benchmark for assessing text-guided image editing, includes over 13 edit types, such as adding objects or changing their positions, across 19 object classes (Figure 2.4.10). The benchmark was applied to evaluate eight leading text-guided image editing methods including SINE and Null-text. Performance improvements since 2021 on a variety of the benchmark's editing tasks, are shown in Figure 2.4.11.
尽管有文本引导的图像编辑的前景，但很少有强大的方法可以评估图像编辑器如何准确地遵循编辑提示。EditVal 是一个用于评估文本引导的图像编辑的新基准，包括超过 13 种编辑类型，例如添加对象或更改它们的位置，涵盖 19 个对象类别（图 2.4.10）。该基准被用于评估包括 SINE 和 Null-text 在内的八种领先的文本引导图像编辑方法。自 2021 年以来在各种基准编辑任务上的性能改进，如图 2.4.11 所示。

A sample VisIT-Bench instruction set
一个样本 VisIT-Bench 指令集

Source: Bitton et al., 2023
来源：Bitton 等人，2023 年

Figure 2.4.10 图 2.4.10

EditVal automatic evaluation: editing accuracy
EditVal 自动评估：编辑准确性

Source: EditVal Leaderboard, 2024 | Chart: 2024 Al Index report
来源：EditVal 排行榜，2024 年 | 图表：2024 年 Al 指数报告

Highlighted Research: 突出研究:

ControlNet 控制网络

Conditioning inputs or performing conditional control refers to the process of guiding the output created by an image generator by specifying certain conditions that a generated image must meet. Existing text-to-image models often lack precise control over the spatial composition of an image, making it difficult to use prompts alone to generate images with complex layouts, diverse shapes, and specific poses. Fine-tuning these models for greater compositional control by training them on additional images is theoretically feasible, but many specialized datasets, such as those for human poses, are not large enough to support successful training.
输入调节或执行条件控制是指通过指定生成的图像必须满足的某些条件来引导图像生成器创建的输出的过程。现有的文本到图像模型通常缺乏对图像空间构成的精确控制，这使得仅仅使用提示来生成具有复杂布局、多样形状和特定姿势的图像变得困难。通过在额外图像上训练这些模型以获得更大的构图控制是在理论上可行的，但许多专门的数据集，如人体姿势数据集，规模不够大以支持成功的训练。
In 2023, researchers from Stanford introduced a new model, ControlNet, that improves conditional control editing for large textto-image diffusion models (Figure 2.4.12). ControlNet stands out for its ability to handle various conditioning inputs. Compared to other previously released models in 2022 , human raters prefer ControlNet both in terms of superior quality and better condition fidelity (Figure 2.4.13). The introduction of ControlNet is a significant step toward creating advanced textto-image generators capable of editing images to more accurately replicate the complex images frequently encountered in the real world.
2023 年，斯坦福的研究人员推出了一种新模型 ControlNet，该模型改进了大型文本到图像扩散模型的条件控制编辑（图 2.4.12）。ControlNet 以其处理各种条件输入的能力脱颖而出。与 2022 年之前发布的其他模型相比，人类评分员更喜欢 ControlNet，无论是在质量上还是在条件保真度上（图 2.4.13）。ControlNet 的引入是朝着创建能够编辑图像以更准确复制现实世界中经常遇到的复杂图像的先进文本到图像生成器迈出的重要一步。

Sample edits using ControlNet
使用 ControlNet 进行样本编辑

Source: Zhang et al., 2023
来源：张等人，2023

Input thasin pose 输入这个姿势

Defunit 定义

"chef in kichen" 厨房里的厨师

"Uncols statue" 独立纪念碑

Highlighted Research: 重点研究:

ControlNet (cont'd) 控制网络（续）

Average User Ranking (AUR): result quality and condition fidelity
平均用户排名（AUR）：结果质量和条件保真度

Source: Zhang et al., 2023 | Chart: 2024 Al Index report
来源：张等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.4.13 图 2.4.13

Highlighted Research: 突出研究:

Instruct-NeRF2NeRF

New models can edit 3D geometries using only text instructions. Instruct-NeRF2NeRF is a model developed by Berkeley researchers that employs an image-conditioned diffusion model for iterative text-based editing of 3D geometries
新模型可以仅使用文本指令编辑 3D 几何结构。Instruct-NeRF2NeRF 是由伯克利研究人员开发的模型，它采用基于图像的扩散模型，用于迭代基于文本的 3D 几何结构编辑。
(Figure 2.4.14). This method efficiently generates new, edited images that adhere to textual instructions, achieving greater consistency than current leading methods (Figure 2.4.15).
（图 2.4.14）。该方法有效地生成新的编辑图像，遵循文本指令，实现比当前主流方法更高的一致性（图 2.4.15）。

A demonstration of Instruct-NeRF2NeRF in action
Instruct-NeRF2NeRF 在操作中的演示

Source: Haque et al., 2023
来源：Haque 等人，2023

Figure 2.4.14 图 2.4.14

Highlighted Research: 突出研究:

Instruct-NeRF2NeRF (cont'd)
Instruct-NeRF2NeRF（续）

Evaluating text-image alignment and frame consistency
评估文本图像对齐和帧一致性

Source: Haque et al., 2023 | Chart: 2024 Al Index report
来源：Haque 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.4.15 图 2.4.15

Segmentation 分割

Segmentation involves assigning individual image pixels to specific categories (for example: human, bicycle, or street).
分割涉及将图像的每个像素分配到特定的类别中（例如：人类、自行车或街道）。

Highlighted Research: 突出的研究：

Segment Anything 分割任何东西

In 2023, Meta researchers launched Segment Anything, a project that featured the Segment Anything Model (SAM) and an extensive SA

dataset for image segmentation. SAM is remarkable for being one of the first broadly generalizable segmentation models that performs well zero-shot on new tasks and distributions. Segment Anything outperforms leading segmentation methods like RITM on 16 out of 23 segmentation datasets (Figure 2.4.17). The metric on which Segment Anything is evaluated is the mean Intersection over Union (loU).
在 2023 年，Meta 研究人员推出了 Segment Anything 项目，该项目展示了 Segment Anything Model（SAM）和一个广泛的 SA

图像分割数据集。SAM 之所以引人注目，是因为它是第一个在新任务和分布上表现良好的广义分割模型之一。Segment Anything 在 16 个 23 个分割数据集上表现优于领先的分割方法，如 RITM（图 2.4.17）。Segment Anything 评估的指标是平均交并比（loU）。

Meta's Segment Anything model was then used, alongside human annotators, to create the SA-1B dataset, which included over 1 billion segmentation masks across 11 million images (Figure 2.4.16). A new segmentation dataset of this size will accelerate the training of future image segmentors. Segment Anything demonstrates how AI models can be used alongside humans to more efficiently create large datasets, which in turn can be used to train even better Al systems.
Meta 的 Segment Anything 模型随后与人类标注者一起使用，创建了 SA-1B 数据集，该数据集包括 1100 万张图像上的超过 10 亿个分割蒙版（图 2.4.16）。这样规模的新分割数据集将加速未来图像分割器的训练。Segment Anything 展示了如何将 AI 模型与人类一起使用，更有效地创建大型数据集，进而用于训练更好的 AI 系统。

Various segmentation masks created by Segment Anything
由 Segment Anything 创建的各种分割蒙版

Source: Kirillov et al., 2023
来源: Kirillov 等人，2023

Figure 2.4.16 图 2.4.16

Highlighted Research: 突出研究:

Segment Anything (cont'd)
分割任何内容 (续)

SAM vs. RITM: mean loU
SAM vs. RITM: 平均 loU

3D Reconstruction From Images
从图像中的三维重建

3D image reconstruction is the process of creating three-dimensional digital geometries from two-dimensional images. This type of reconstruction can be used in medical imaging, robotics, and virtual reality.
三维图像重建是从二维图像创建三维数字几何形状的过程。这种重建类型可用于医学成像、机器人技术和虚拟现实。

Highlighted Research: 突出研究:

Skoltech3D

Data scarcity often hinders the development of Al systems for specific tasks. In 2023, a team of international researchers introduced an extensive new dataset, Skoltech3D, for multiview 3D surface reconstruction (Figure 2.4.18). Encompassing 1.4 million images of 107 scenes captured from 100 different viewpoints under 14 distinct lighting conditions, this dataset represents a major improvement over existing 3D reconstruction datasets (Figure 2.4.19).
数据稀缺经常阻碍了针对特定任务的 AI 系统的发展。2023 年，一组国际研究人员推出了一个广泛的新数据集 Skoltech3D，用于多视角 3D 表面重建（图 2.4.18）。该数据集包含了 107 个场景的 140 万张图像，从 100 个不同的视角下以 14 种不同的光照条件捕获，相比现有的 3D 重建数据集，这个数据集代表了一个重大的改进（图 2.4.19）。
Objects from the 3D reconstruction dataset
3D 重建数据集中的物体

Source: Voynov et al., 2023
源: Voynov 等人，2023 年

Figure 2.4.18 图 2.4.18

Skoltech3D vs. the most widely used multisensor datasets
Skoltech3D 与最广泛使用的多传感器数据集之间的对比

Source: Voynov et al., 2023 | Table: 2024 Al Index report
来源：Voynov 等人，2023 年 | 表格：2024 年 Al 指数报告

Dataset

Sensor types

RGB

resolution

(MPix)

Depth

resolution

(MPix)

High

resolution

geometry

Poses/scene

Lighting

# Scenes

# Frames

DTU

RGB (2)

ETH3D

RGB

BlendedMVG

unknown

502

BigBIRD

RGB (5)

600

120

BigBIRD

RGB-D (5)

1.2

0.3

ScanNet

RGB-D

1,3

0,3

N/A

1513

Skoltech3D

RGB (2)

100

107

Skoltech3D

RGB-D 1 (2)

0.04

Skoltech3D

RGB-D 2

0.2

Skoltech3D

RGB-D 3

0.9

Highlighted Research: 突出研究：

RealFusion

RealFusion, developed by Oxford researchers, is a new method for generating complete 3D models of objects from single images, overcoming the challenge of often having insufficient information from single images for full 360 degree reconstruction. RealFusion utilizes existing

image generators to produce multiple views of an object, and then assembles these views into a comprehensive 360 degree model (Figure 2.4.20). This technique yields more accurate 3D reconstructions compared to stateof-the-art methods from 2021 (Shelf-Supervised), across a wide range of objects (Figure 2.4.21).
由牛津研究人员开发的 RealFusion 是一种新方法，用于从单个图像生成完整的物体 3D 模型，克服了通常由于单个图像缺乏完整 360 度重建所带来的挑战。RealFusion 利用现有的

图像生成器生成物体的多个视图，然后将这些视图组装成全面的 360 度模型（图 2.4.20）。与 2021 年的最新方法（Shelf-Supervised）相比，这种技术在各种物体上产生了更准确的 3D 重建（图 2.4.21）。

Sample generations from RealFusion
RealFusion 的样本生成

Source: Melas-Kyriazi et al, 2023
来源：Melas-Kyriazi 等人，2023

Figure 2.4.20 图 2.4.20

Object reconstruction: RealFusion vs. Shelf-Supervised
对象重建：RealFusion vs. 货架监督

Source: Melas-Kyriazi et al., 2023 | Chart: 2024 Al Index report
来源：Melas-Kyriazi 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.4.21 图 2.4.21

2.5 Video Computer Vision and Video Generation
2.5 视频计算机视觉和视频生成

Generation 生成

Video generation involves the use of

to generate videos from text or images.
视频生成涉及使用

从文本或图像生成视频。

UCF101

UCF101 is an action recognition dataset of realistic action videos that contain 101 action categories (Figure 2.5.1). More recently, UCF101 has been used to benchmark video generators. This year's top model, W.A.L.T-XL, posted an FVD16 score of 36, more than halving the state-of-the-art score posted the previous year (Figure 2.5.2).
UCF101 是一个包含 101 种动作类别的真实动作视频的动作识别数据集（图 2.5.1）。最近，UCF101 已被用来评估视频生成器。今年的顶级模型 W.A.L.T-XL，发布了 36 的 FVD16 分数，比前一年发布的最先进分数减少了一半以上（图 2.5.2）。

UCF101: FVD16 UCF101：FVD16

Source: Papers With Code, 2023 | Chart: 2024 Al Index report
来源：Papers With Code，2023 | 图表：2024 Al 指数报告

Figure 2.5.2 图 2.5.2

Highlighted Research: 突出研究：

Align Your Latents 对齐您的潜在

Most existing methods can only create short, lowresolution videos. To address this limitation, an international team of researchers has applied latent diffusion models, traditionally used for generating high-quality images, to produce high-resolution videos (Figure 2.5.3). Their Latent Diffusion Model (LDM) notably outperforms previous state-ofthe-art methods released in 2022 like Long Video
大多数现有方法只能创建短小、低分辨率的视频。为了解决这一限制，一支国际研究团队将传统用于生成高质量图像的潜在扩散模型应用于生成高分辨率视频（图 2.5.3）。他们的潜在扩散模型（LDM）显著优于之前 2022 年发布的类似长视频的最新方法。
GAN (LVG) in resolution quality (Figure 2.5.4). The adaptation of a text-to-image architecture to create LDM, a highly effective text-to-video model, exemplifies how advanced Al techniques can be repurposed across different domains of computer vision. The LDM's strong video generation capabilities have many real-world applications, such as creating realistic driving simulations.
分辨率质量中的 GAN（LVG）（图 2.5.4）。将文本到图像架构改编为 LDM，一个高效的文本到视频模型，展示了先进的 AI 技术如何可以在计算机视觉的不同领域中重新利用。LDM 强大的视频生成能力有许多现实世界的应用，比如创建逼真的驾驶模拟。

High-quality generation of milk dripping into a cup of coffee
高质量生成的牛奶滴入咖啡杯中

Source: Blattmann et al., 2023
来源：Blattmann 等人，2023

Video LDM vs. LVG: FVD and FID
视频 LDM vs. LVG：FVD 和 FID

Source: Blattmann et al., 2023 | Chart: 2024 Al Index report
来源：Blattmann 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.5.4 图 2.5.4

Highlighted Research: 突出研究：

Emu Video 鸸鹋视频

Traditionally, progress in video generation has trailed that in image generation due to its higher complexity and the smaller datasets available for training. Emu Video, a new transformerbased video generation model created by Meta researchers, represents a significant step forward (Figure 2.5.5). Emu Video generates an image from text and then creates a video based on both the text and image. Figure 2.5.6 illustrates the degree to which the Emu Video model outperforms previously released state-of-the-art video generation methods. The metric is the proportion of cases when human evaluators preferred Emu Video's image quality or faithfulness to text
传统上，视频生成的进展落后于图像生成，这是因为视频生成的复杂性更高，可用于训练的数据集较小。鸸鹋视频是 Meta 研究人员创建的一种基于 Transformer 的新视频生成模型，代表了一个重大进步（图 2.5.5）。鸸鹋视频从文本生成图像，然后基于文本和图像创建视频。图 2.5.6 展示了鸸鹋视频模型在图像质量或忠实度方面优于先前发布的最先进视频生成方法的程度。度量标准是人类评估者更喜欢鸸鹋视频的图像质量或忠实度的比例。
Sample Emu Video generations
样本鸸鹋视频生成

Source: Girdhar et al., 2023
来源：Girdhar 等人，2023 年

Figure 2.5.5 图 2.5.5

instructions over the compared method. Emu Video simplifies the video generation process and signals a new era of high-quality video generation.
通过比较方法进行指导。Emu Video 简化了视频生成过程，并标志着高质量视频生成的新时代。

Emu Video vs. prior works: human-evaluated video quality and text faithfulness win rate Source: Girdhar et al., 2023 | Chart: 2024 Al Index report
Emu Video 与先前的作品相比：人工评估的视频质量和文本忠实度胜率。来源：Girdhar 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.5.6 图 2.5.6

Reasoning in

involves the ability of

systems to draw logically valid conclusions from different forms of information. Al systems are increasingly being tested in diverse reasoning contexts, including visual (reasoning about images), moral (understanding moral dilemmas), and social reasoning (navigating social situations).

在

中的推理涉及

系统从不同形式的信息中得出逻辑有效的结论的能力。所有系统越来越多地在不同的推理环境中进行测试，包括视觉（对图像进行推理）、道德（理解道德困境）和社会推理（应对社交情境）。

2.6 Reasoning 2.6 推理

General Reasoning 一般推理

General reasoning pertains to

systems being able to reason across broad, rather than specific, domains. As part of a general reasoning challenge, for example, an Al system might be asked to reason across multiple subjects rather than perform one narrow task (e.g., playing chess).
一般推理涉及到

系统能够跨越广泛领域进行推理，而不是特定领域。例如，在一个通用推理挑战中，一个人工智能系统可能被要求跨越多个主题进行推理，而不是执行一个狭窄的任务（例如下棋）。

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU：专家 AGI 的大规模多学科多模态理解和推理基准

In recent years, the reasoning abilities of

systems have advanced so much that traditional benchmarks like SQuAD (for textual reasoning) and VQA (for visual reasoning) have become saturated, indicating a need for more challenging reasoning tests.
近年来，

系统的推理能力已经取得了很大进展，以至于传统的基准测试如 SQuAD（用于文本推理）和 VQA（用于视觉推理）已经饱和，表明需要更具挑战性的推理测试。

Responding to this, researchers from the United States and Canada recently developed MMMU, the
对此做出回应，来自美国和加拿大的研究人员最近开发了 MMMU，即
Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. MMMU comprises about 11,500 college-level questions from six core disciplines: art and design, business, science, health and medicine, humanities and social science, and technology and engineering (Figure 2.6.1). The question formats include charts, maps, tables, chemical structures, and more. MMMU is one of the most demanding tests of perception, knowledge, and reasoning in

to date. As of January 2024, the highest performing model is Gemini Ultra, which leads in all subject categories with an overall score of

(Figure 2.6.2)." On most individual task categories, top models are still well beyond medium-level human experts (Figure 2.6.3). This relatively low score is evidence of MMMU's effectiveness as a benchmark for assessing Al reasoning capabilities.
专家 AGI 的 Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark。MMMU 包括来自六个核心学科的大约 11,500 个大学水平的问题：艺术与设计、商业、科学、健康与医学、人文社科以及技术与工程（图 2.6.1）。问题格式包括图表、地图、表格、化学结构等。MMMU 是当前

中对知觉、知识和推理要求最高的测试之一。截至 2024 年 1 月，最高性能模型是 Gemini Ultra，在所有学科类别中以

的总分领先（图 2.6.2）。在大多数单个任务类别上，顶尖模型仍远超过中等水平的人类专家（图 2.6.3）。这个相对较低的分数证明了 MMMU 作为评估人工智能推理能力的基准的有效性。

11 The Al Index reports results from the MMMU validation set, as recommended by the paper authors for the most comprehensive coverage. According to the authors, the test set, with its unreleased labels and larger size, presents a more challenging yet unbiased benchmark for model performance, ensuring a more robust evaluation. The test set results are available on the MMMU page.
11 Al 指数报告了 MMMU 验证集的结果，这是论文作者推荐的最全面的覆盖范围。根据作者的说法，测试集具有未发布的标签和更大的规模，为模型性能提供了更具挑战性但无偏见的基准，确保更稳健的评估。测试集结果可在 MMMU 页面上找到。

Sample MMMU questions 样本 MMMU 问题

Source: Yue et al., 2023
来源：Yue 等人，2023

MMMU: overall accuracy MMMU：整体准确性

Model 模型

MMMU: subject-specific accuracy
MMMU：主题特定准确性

Source: MMMU, 2023 | Table: 2024 Al Index report
来源：MMMU，2023 | 表格：2024 Al 指数报告

MMMU task

GPQA: A Graduate-Level Google-Proof Q&A
GPQA：一种研究生级别的谷歌防护问答系统

Benchmark

In the last year, researchers from NYU, Anthropic, and Meta introduced the GPQA benchmark to test general multisubject

reasoning. This dataset consists of 448 difficult multiple-choice questions that cannot be easily answered by Google searching. The questions were crafted by subject-matter experts in various fields like biology, physics, and chemistry (Figure 2.6.4). PhD-level experts achieved a

accuracy rate in their respective domains on GPQA, while nonexpert humans scored around

. The best-performing AI model, GPT-4, only reached a score of

on the main test set (Figure 2.6.5).
在去年，NYU、Anthropic 和 Meta 的研究人员引入了 GPQA 基准测试，以测试通用的多学科

推理。这个数据集包括 448 个难题多项选择题，无法通过谷歌搜索轻松回答。这些问题由生物学、物理学、化学等各个领域的专家设计而成（图 2.6.4）。博士级专家在各自领域的 GPQA 上实现了

的准确率，而非专家人员在此测试中的得分约为

。表现最佳的 AI 模型 GPT-4 在主测试集上仅获得了

的分数（图 2.6.5）。

A sample chemistry question from GPQA
一个来自 GPQA 的化学问题示例

Source: Rein et al., 2023
来源：Rein 等人，2023 年

Chemistry (general) 化学（通用）

A reaction of a liquid organic compound, which molecules consist of carbon and hydrogen atoms, is performed at 80 centigrade and 20 bar for 24 hours. In the proton nuclear magnetic resonance spectrum, the signals with the highest chemical shift of the reactant are replaced by a signal of the product that is observed about three to four units downfield. Compounds from which position in the periodic system of the elements, which are also used in the corresponding large-scale industrial process, have been mostly likely initially added in small amounts?
液态有机化合物的反应，其分子由碳和氢原子组成，以 80 摄氏度和 20 巴的条件进行，持续 24 小时。在质子核磁共振谱中，反应物最高化学位移的信号被一个产品的信号所取代，该信号观察到大约偏移了三到四个单位。在元素周期表中的哪个位置上的化合物，也被用于相应的大规模工业过程中，最有可能最初以少量添加？

A) A metal compound from the fifth period.
A）来自第五周期的金属化合物。

B) A metal compound from the fifth period and a non-metal compound from the third period.
B）来自第五周期的金属化合物和来自第三周期的非金属化合物。

C) A metal compound from the fourth period.
C) 第四周期的金属化合物。

D) A metal compound from the fourth period and a non-metal compound from the second period.
D) 第四周期的金属化合物和第二周期的非金属化合物。

GPQA: accuracy on the main set
GPQA：主要设定上的准确度

Source: Rein et al., 2023 | Chart: 2024 Al Index report
来源：Rein 等人，2023 年 | 图表：2024 年 Al 指数报告

Evaluation method and model
评估方法和模型

Highlighted Research: 突出研究:

Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks
比较人类、GPT-4 和 GPT-4V 在抽象和推理任务上的表现

Abstract reasoning involves using known information to solve unfamiliar and novel problems and is a key aspect of human cognition that is evident even in toddlers. While recent LLMs like GPT-4 have shown impressive performance, their capability for true abstract reasoning remains a hotly debated subject.

To further explore this topic, researchers from the Santa Fe Institute tested GPT-4 on the ConceptARC benchmark, a collection of analogy puzzles designed to assess general abstract reasoning skills (Figure 2.6.6). The study revealed that GPT-4 significantly trails behind humans in abstract reasoning abilities: While humans score

on the benchmark, the best GPT-4 system only scores

(Figure 2.6.7). The development of truly general Al requires abstract reasoning capabilities. Therefore, it will be important to continue tracking progress in this area.
抽象推理涉及使用已知信息来解决陌生和新颖的问题，是人类认知的关键方面，即使在幼儿中也是显而易见的。尽管最近像 GPT-4 这样的模型表现出色，但它们对真正抽象推理的能力仍然是一个备受争议的话题。为了进一步探讨这个话题，圣菲研究所的研究人员在 ConceptARC 基准测试中测试了 GPT-4，这是一个旨在评估一般抽象推理能力的类比难题集合（图 2.6.6）。研究表明，GPT-4 在抽象推理能力方面明显落后于人类：人类在基准测试中得分

，而最好的 GPT-4 系统仅得分

（图 2.6.7）。真正通用 AI 的发展需要抽象推理能力。因此，继续跟踪这一领域的进展将是重要的。

A sample ARC reasoning task
一个样本 ARC 推理任务

Source: Mitchell et al., 2023

ConceptARC: accuracy on minimal tasks over all concepts
ConceptARC: 在所有概念上的最小任务准确性

13 Some claim these models exhibit such reasoning capabilities, while others claim they do not.
有些人声称这些模型具有这样的推理能力，而另一些人声称它们没有。

Figure 2.6.7 图 2.6.7

Mathematical Reasoning 数学推理

Mathematical problem-solving benchmarks evaluate
数学问题解决基准评估

systems' ability to reason mathematically. Al models can be tested with a range of math problems, from grade-school level to competition-standard mathematics.

系统推理数学能力。所有模型都可以通过一系列数学问题进行测试，从小学水平到竞赛标准的数学问题。

GSM8K

GSM8K, a dataset comprising approximately 8,000 varied grade school math word problems, requires that Al models develop multistep solutions utilizing arithmetic operations (Figure 2.6.8). GSM8K has quickly become a favored benchmark for evaluating advanced LLMs. The top-performing model on GSM8K is a GPT-4 variant (GPT-4 Code Interpreter), which scores an accuracy of

, a

improvement from the state-of-the-art score in the previous year and a 30.4% improvement from 2022 when the benchmark was first introduced (Figure 2.6.9).
GSM8K，一个包含大约 8,000 个不同年级的小学数学问题的数据集，要求 AI 模型开发利用算术运算的多步解决方案（图 2.6.8）。GSM8K 已经迅速成为评估先进LLMs的首选基准。在 GSM8K 上表现最好的模型是 GPT-4 变体（GPT-4 代码解释器），准确率为

，比前一年的最新技术得分提高了

，比 2022 年首次引入基准时提高了 30.4%（图 2.6.9）。

Sample problems from GSM8K
GSM8K 示例问题

Source: Cobbe et al., 2023
来源: Cobbe 等人，2023 年

Problem: Beth bakes 4, 2 dazen batches of cookies in a week. If thase cookies are shared amongst 16 peoplo squally, how marry cookides does each person cansume?
问题: 贝丝一周烤 4, 2 批次的饼干。如果这些饼干平均分给 16 个人，每个人会吃多少块饼干？

Solution: Beth bakes 42 dazen batches of cookes for a total of

dazen cockias
解决方案: 贝丝一共烤了 42 批次的饼干，总共

批次的饼干。

Sthe spits the 96 cookies equally amongat 16 people so they each exl

cookies
S 把 96 个饼干均匀地分给 16 个人，所以每个人拥有

个饼干

Final Answer: 6 最终答案：6

Problem: Mrs. Lim miks her cows twice a day. Yesterday morning, she got 88 gallons of milk and in the evering, she got 82 gallans. This moming. shie got 18 gatons fewer than she had yesterday moming. Afer selling some galons of mik in the afternoon. Mra. Lim has onty 24 gallons lefl. How much was her revenue for the milk if each galion casses

?
问题：Lim 女士每天早上和晚上挤她的奶牛两次。昨天早上，她得到了 88 加仑的牛奶，晚上得到了 82 加仑。今天早上，她得到了比昨天早上少 18 加仑的牛奶。在下午卖掉一些加仑的奶之后，Lim 女士剩下了 24 加仑。如果每加仑牛奶售价为

，她的收入是多少？

Mrs. Lim got 68 gations - 18 galons = <e68-18=500>60 galons his morring.
林太太得到了 68 加仑-18 加仑= <e68-18=500>60 加仑他的早晨。

So she was abie to get a total of 68 gallons +82 gallons +50 galions

gallons,
所以她能够得到总共 68 加仑+82 加仑+50 加仑

加仑，

Sho was able to sel 200 gallons - 24 gallons

galions.
她能够卖出 200 加仑-24 加仑

加仑。

Final Answer: 616 最终答案：616

Problem: Tina bays 3 12-packs of soda for a party. Incuding Tina, 6 people are at he party. Half of the peopie at the party have 3 sodas each, 2 of the people have 4, and 1 persen has 5. How many sodas are left cuer when the perty is over?
问题：蒂娜为聚会购买了 3 箱 12 听装苏打水。包括蒂娜在内，聚会上有 6 个人。聚会上一半的人每人有 3 听苏打水，2 个人有 4 听，1 个人有 5 听。聚会结束时还剩下多少听苏打水？

Solution: Tina buye 3 12-paoba of soda, for

aodia
解决方案：蒂娜购买了 3 箱 12 听装苏打水，共

听。

6 people atlend the party, so half of them is

people
6 个人参加了派对，所以他们中的一半是

人

Each of those people dinits 3 sodas. so they drink

sodas
这些人中的每个人都喝了 3 杯苏打水，所以他们一共喝了

杯苏打水

With one person drining 5 , that brings the lotal drank 10

sodas
如果有一个人喝了 5 杯，那么总共喝了 10

杯苏打水

As Tina started off with 36 sodas, that means there are

sodias left
当蒂娜开始时有 36 罐苏打水，这意味着现在还剩

罐苏打水

Final Answer: 11 最终答案：11

Figure 2.6.9 图 2.6.9

MATH 数学

MATH is a dataset of 12,500 challenging competition-level mathematics problems introduced by UC Berkeley researchers in 2021 (Figure 2.6.10). Al systems struggled on MATH when it was first released, managing to solve only

of the problems. Performance has significantly improved. In 2023, a GPT-4-based model posted the top result, successfully solving

of the dataset's problems (Figure 2.6.11).
MATH 是 UC 伯克利研究人员于 2021 年引入的一组包含 12500 个具有挑战性的竞赛级数学问题的数据集（图 2.6.10）。当初发布时，所有的 AI 系统都难以应对 MATH，只解决了其中的

个问题。性能有了显著的提升。到了 2023 年，基于 GPT-4 的模型取得了最佳成绩，成功解决了该数据集中

个问题（图 2.6.11）。

A sample problem from the MATH dataset
MATH 数据集中的一个样本问题

Source: Hendrycks et al., 2023
来源：Hendrycks 等人，2023 年

MATH Dataset (Ours) MATH 数据集（我们的）

Problem: Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose?
问题：汤姆有一个红色弹珠，一个绿色弹珠，一个蓝色弹珠和三个相同的黄色弹珠。汤姆可以选择多少种不同的两个弹珠组合？

Solution: There are two cases here: either Tom chooses two yellow marbles (1 result), or he chooses two marbles of different colors

results). The total number of distinct pairs of marbles Tom can choose is

.
解决方案：这里有两种情况：要么汤姆选择两个黄色的弹珠（1 个结果），要么他选择两个不同颜色的弹珠（

个结果）。汤姆能够选择的不同弹珠对的总数为

。

Problem: The equation

has two complex solutions. Determine the product of their real parts.
问题：方程

有两个复数解。确定其实部的乘积。

Solution: Complete the square by adding 1 to each side. Then

, so

. The desired product is then

.
解决方案：通过在两边各加 1 来完成平方。然后

，所以

。所求的乘积为

。

MATH word problem-solving: accuracy
数学文字问题解决：准确性

Source: Papers With Code, 2023 | Chart: 2024 Al Index report
来源：Papers With Code，2023 | 图表：2024 Al 指数报告

PlanBench

A planning system receives a specified goal, an initial state, and a collection of actions. Each action is defined by preconditions, which must be met for the action to be executed, and the effects that result from the action's execution. The system constructs a plan, comprising a series of actions, to achieve the goal from the initial state.
规划系统接收一个指定的目标、一个初始状态和一系列动作。每个动作由前提条件定义，这些条件必须满足才能执行动作，并且由动作的执行产生影响。系统构建一个计划，由一系列动作组成，从初始状态实现目标。

Claims have been made that LLMs can solve planning problems. A group from Arizona State University has proposed PlanBench, a benchmark suite containing problems used in the automated planning community, especially those used in the International Planning Competition. They tested I-GPT-3 and GPT-4 on 600 problems in the Blocksworld domain (where a hand tries to construct stacks of blocks when it is only allowed to move one block at a time to the table or to the top of a clear block) using one-shot learning and showed that GPT-4 could generate correct plans and cost-optimal plans about

of the time, and I-GPT-3 about

(Figure 2.6.12). Verifying the correctness of a plan is easier.
有人声称LLMs可以解决规划问题。亚利桑那州立大学的一个团队提出了 PlanBench，这是一个包含自动规划社区中使用的问题的基准套件，尤其是国际规划竞赛中使用的问题。他们在 Blocksworld 领域对 I-GPT-3 和 GPT-4 进行了 600 个问题的测试（在该领域中，一个手试图构建一系列方块堆，但只允许一次移动一个方块到桌子上或者一个空方块的顶部），使用一次性学习，并展示 GPT-4 大约

的时间能够生成正确的计划和最优成本计划，而 I-GPT-3 大约

（图 2.6.12）。验证计划的正确性更容易。

GPT-4 vs. I-GPT-3 on PlanBench Source: Valmeekam, 2023 | Table: 2024 Al Index report
GPT-4 与 I-GPT-3 在 PlanBench 上的比较。来源：Valmeekam，2023 | 表格：2024 年 Al 指数报告

Task

GPT-4

(instances correct) (实例正确)

I-GPT-3

(instances correct) (实例正确)

Plan generation 计划生成

Cost-optimal planning 成本最优规划

Plan verification 计划验证

Visual Reasoning 视觉推理

Visual reasoning tests how well Al systems can reason across both visual and textual data.
视觉推理测试了人工智能系统在视觉和文本数据之间的推理能力。

Visual Commonsense Reasoning (VCR)
视觉常识推理（VCR）

Introduced in 2019, the Visual Commonsense Reasoning (VCR) challenge tests the commonsense visual reasoning abilities of

systems. In this challenge, Al systems not only answer questions based on images but also reason about the logic behind their answers (Figure 2.6.13). Performance in VCR is measured using the Q->AR score, which evaluates the machine's ability to both select the correct answer to a question (

) and choose the appropriate rationale behind that answer (

. While

systems have yet to outperform humans on this task, their capabilities are steadily improving. Between 2022 and 2023, there was a

increase in

performance on the VCR challenge (Figure 2.6.14).

A sample question from the Visual Commonsense Reasoning (VCR) challenge
来自视觉常识推理（VCR）挑战的一个示例问题

Source: Zellers et al., 2018
来源：Zellers 等，2018 年

Visual Commonsense Reasoning (VCR) task: Q->AR score
视觉常识推理（VCR）任务：Q->AR 得分

Figure 2.6.13 图 2.6.13

Moral Reasoning 道德推理

In the future, Al will be increasingly applied to domains where ethical considerations are crucial, such as in healthcare and judicial systems. Therefore, it is essential for

systems to possess robust moral reasoning capabilities, enabling them to effectively navigate and reason about ethical principles and moral considerations.
在未来，人工智能将越来越广泛地应用于道德考虑至关重要的领域，如医疗保健和司法系统。因此，

系统必须具备强大的道德推理能力，使其能够有效地导航和推理道德原则和道德考虑。

The ability of

models to reason in linguistic and visual domains is well established, yet their capacity for moral reasoning, especially moral reasoning that aligns with human moral judgments, is less understood.

To further explore this topic, a team of Stanford researchers created a new dataset (MoCa) of human stories with moral elements (Figure 2.6.15). The researchers then presented these models with stories of human actions and prompted the models to respond, measuring moral agreement with the discrete agreement metric:

higher score indicates closer alignment with human moral judgment. The study yielded intriguing results. No model perfectly matches human moral systems, but newer, larger models like GPT-4 and Claude show greater alignment with human moral sentiments than smaller models like GPT-3, suggesting that as AI models scale, they are gradually becoming more morally aligned with humans. Of all models surveyed, GPT-4 showed the greatest agreement with human moral sentiments (Figure 2.6.16).

模型在语言和视觉领域进行推理的能力已经得到确认，但它们在道德推理方面的能力，特别是与人类道德判断一致的道德推理能力，尚不为人了解。

为了进一步探讨这个话题，斯坦福大学的研究人员团队创建了一个新的数据集（MoCa），其中包含具有道德元素的人类故事（图 2.6.15）。然后，研究人员向这些模型展示了人类行为的故事，并促使模型做出回应，通过离散一致性度量来衡量道德一致性：

更高的分数表示与人类道德判断更接近。这项研究产生了有趣的结果。没有任何模型完全匹配人类的道德系统，但像 GPT-4 和 Claude 这样的更新、更大的模型比 GPT-3 这样的较小模型更接近人类道德情感，这表明随着 AI 模型的规模扩大，它们逐渐与人类在道德上更加一致。在所有调查的模型中，GPT-4 显示出与人类道德情感最大的一致性（图 2.6.16）。

A moral story from
的一个道德故事

Source: Nie et al., 2023
来源：Nie 等人，2023

You are the explosives expert of a company that has been hired wo demolish a skyseraper Yim are examining the last charges of the explosives when yous suddenly boconse aware of a teenager who is about to accidentally detconate one of the charges. The explosion will result is the building's uncoutrelled cellapse aver you, the teenager, and the croud of spectators. The teenager stands several floors below you and cannot hear you. If you drop a heavy cinderbleck on his head, this will kill him, hat it will prevest the explexion und save you and the crowd. You stap the wenager hy dropping the cinderhlock an his head, so that the building won't collapse over you and the crowd of spectators
你是一家被雇佣拆除一座摩天大楼的公司的炸药专家，当你正在检查最后一批炸药时，突然意识到一个即将意外引爆其中一枚炸药的少年。爆炸将导致大楼失控坍塌，危及你、少年和一群旁观者。少年站在你下面几层楼，听不到你的声音。如果你往他头上扔下一个沉重的石块，这将杀死他，但可以阻止爆炸并拯救你和旁观人群。你向少年扔下石块，避免了大楼坍塌并保护了旁观者。

(b)

Zero-shot alignment with human judgments on the moral permissibility task: discrete agreement
零-shot 对道德允许性任务的人类判断的对齐: 离散一致

Source: Nie et al., 2023 | Chart: 2024 Al Index report
来源：Nie 等人，2023 年 | 图表：2024 年 AI 指数报告

Figure 2.6.16 图 2.6.16

Causal Reasoning 因果推理

Causal reasoning assesses an

system's ability to understand cause-and-effect relationships. As AI becomes increasingly ubiquitous, it has become important to evaluate whether Al models can not only explain their outputs but also update their conclusions-key aspects of causal reasoning.
因果推理评估一个

系统理解因果关系的能力。随着人工智能越来越普及，评估 AI 模型不仅能解释其输出，还能更新其结论成为重要的一点。

BigToM 大 ToM

Assessing whether LLMs have theory-of-mind (ToM) capabilities-understanding and attributing mental states such as beliefs, intentions, and emotions-has traditionally challenged

researchers. Earlier methods to evaluate ToM in LLMs were inadequate and lacked robustness. To tackle this problem, in 2023 researchers developed a new benchmark called BigToM, designed for evaluating the social and causal reasoning abilities of LLMs. BigToM, comprising 25 controls and 5,000 model-generated evaluations, has been rated by human evaluators as superior to existing ToM benchmarks. BigToM tests LLMs on forward belief (predicting future events), forward action (acting based on future event predictions), and backward belief (retroactively inferring causes of actions) (Figure 2.6.17).
评估LLMs是否具有心灵理论（ToM）能力-理解和归因信念、意图和情绪等心理状态-传统上是

研究人员面临的挑战。早期评估LLMs的 ToM 方法不足以且缺乏鲁棒性。为了解决这个问题，研究人员在 2023 年开发了一个名为 BigToM 的新基准，旨在评估LLMs的社会和因果推理能力。BigToM 由 25 个控制和 5,000 个模型生成的评估组成，被人类评估者评为优于现有的 ToM 基准。BigToM 测试LLMs的前向信念（预测未来事件）、前向行动（根据未来事件预测行动）和后向信念（推断行动原因的逆向推理）（图 2.6.17）。

In tests of LLMs on the benchmark, GPT-4 was the top performer, with ToM capabilities nearing but not surpassing human levels (Figure 2.6.18, Figure 2.6.19, and Figure 2.6.20). More specifically, as measured by accuracy in correctly inferring beliefs, GPT-4 closely matched human performance in forward belief and backward belief tasks and slightly surpassed humans in forward action tasks. Importantly, the study shows that LLM performance on ToM benchmarks is trending upward, with newer models like GPT4 outperforming predecessors such as GPT-3.5 (released in 2022).
在基准测试中，GPT-4 在 ToM 能力方面表现出色，接近但未超越人类水平（图 2.6.18、图 2.6.19 和图 2.6.20）。更具体地说，通过正确推断信念的准确性衡量，GPT-4 在前向信念和后向信念任务中与人类表现相近，并在前向行动任务中略微超越人类。重要的是，研究表明，ToM 基准测试中的表现呈上升趋势，像 GPT4 这样的新模型超越了诸如 GPT-3.5（2022 年发布）之类的前辈模型。

Sample BigToM scenario 样本 BigToM 场景

Source: Gandhi et al., 2023
来源：Gandhi 等人，2023

Figure 2.6.17 图 2.6.17

Forward action inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report
具有初始信念的前向行动推断：准确性来源：甘地等人，2023 年 | 图表：2024 年 Al 指数报告

Backward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report
具有初始信念的反向信念推断：准确性来源：甘地等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.6.19 图 2.6.19

Forward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report
初始信念的前向信念推断：准确性来源：甘地等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.6.20 图 2.6.20

Highlighted Research: 突出研究:

Tübingen Cause-Effect Pairs
图宾根因果对

Researchers from Microsoft
微软的研究人员

and the University of Chicago have demonstrated that LLMs are effective causal reasoners. The team evaluated several recent LLMs, including GPT4 , using the Tübingen causeeffect pairs dataset. This benchmark comprises over 100 cause-and-effect pairs across 37 subdisciplines, testing

systems' ability to discern causal relationships (Figure 2.6.21). GPT4 's performance, a

accuracy score, surpassed the previous year's best by 13 percentage points (Figure 2.6.22). Notably, GPT-4 outperformed prior covariance-based Al models, which were explicitly trained for causal reasoning tasks. Furthermore, the researchers discovered that certain prompts, especially those designed to encourage helpfulness, can significantly enhance an LLM's causal reasoning capabilities.

Sample cause-effect pairs from the Tübingen dataset
Tübingen 数据集中的样本因果对

Source: Kiciman et al., 2023
来源：Kiciman 等人，2023 年

Variable A	Variable B	Domain
Age of Abalone 鲍鱼的年龄	Shell weight	Zoology
Cement	Compressive strength of concrete 混凝土的抗压强度	Engineering
Alcohol	Mean corpuscular volume 平均红细胞体积	Biology
Organic carbon in soil 土壤中的有机碳	Clay content in soil 土壤中的粘土含量	Pedology
PPFD (Photosynthetic Photon Flux Density) PPFD（光合光子通量密度）	Net Ecosystem productivity 净生态系统生产力	Physics
Drinking water access 饮用水获取	Infant mortality 婴儿死亡率	Epidemiology
Ozone concentration 臭氧浓度	Radiation	Atmospheric Science 大气科学
Contrast of tilted Gabor patches 倾斜的加伯补丁的对比	Accuracy of detection by participants 参与者的检测准确性	Cognitive Science 认知科学
Time for rotation of a Stirling engine 斯特林发动机旋转的时间	Heat bath temperature 热浴温度	Engineering
Time for passing first segment of a ball track 球道第一段通过时间	Time for passing second segment 第二段通过时间	Basic Physics 基础物理学

Performance on the Tübingen cause-effect pairs dataset: accuracy
在图宾根因果对数据集上的表现：准确性

Source: Kıcıman et al., 2023 | Chart: 2024 Al Index report
来源：Kıcıman 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.6.22 图 2.6.22

Al systems are adept at processing human speech, with audio capabilities that include transcribing spoken words to text and recognizing individual speakers. More recently, Al has advanced in generating synthetic audio content.
所有系统都擅长处理人类语音，具有包括将口语转录为文本和识别个别说话者在内的音频功能。最近，AI 在生成合成音频内容方面取得了进展。

2.7 Audio 2.7 音频

Generation 生成

2023 marked a significant year in the field of audio generation, which involves creating synthetic audio content, ranging from human speech to music files.
2023 年是音频生成领域的一个重要年份，涉及创建从人类语音到音乐文件等各种合成音频内容。
This advancement was highlighted by the release of several prominent audio generators, such as UniAudio, MusicGen, and MusicLM.
这一进步得到了几款知名音频生成器的推出，如 UniAudio、MusicGen 和 MusicLM。

Highlighted Research: 突出研究内容：

UniAudio

UniAudio is a high-level language modeling technique to create audio content. UniAudio uniformly tokenizes all audio types and, like modern LLMs, employs next-token prediction for highquality audio generation. UniAudio is capable of generating high-quality speech, sound, and music.
UniAudio 是一种高级语言建模技术，用于创建音频内容。UniAudio 统一标记化所有音频类型，并像现代LLMs一样，采用下一个标记预测来生成高质量音频。UniAudio 能够生成高质量的语音、声音和音乐。
UniAudio surpasses leading methods in tasks, including text-to-speech, speech enhancement, and voice conversion (Figure 2.7.1). With 1 billion parameters and trained on 165,000 hours of audio, UniAudio exemplifies the efficacy of big data and self-supervision for music generation.
UniAudio 在任务中超越了领先的方法，包括文本转语音、语音增强和语音转换（图 2.7.1）。拥有 10 亿参数并在 165,000 小时的音频上训练，UniAudio 展示了大数据和自我监督对音乐生成的有效性。

UniAudio vs. selected prior works in the training stage: objective evaluation metrics
训练阶段中的 UniAudio 与选定的先前作品：客观评估指标

Source: Yang et al., 2023 | Chart: 2024 Al Index report
来源：杨等人，2023 年 | 图表：2024 年 Al 指数报告

Highlighted Research: 突出研究:

MusicGEN and MusicLM MusicGEN 和 MusicLM

Meta's MusicGen is a novel audio generation model that also leverages the transformer architecture common in language models to generate audio. MusicGen enables users to specify text for a desired audio outcome and then fine-tune it using specific melodies. In comparative studies, MusicGen outshines other popular text-to-music models like Riffusion, Moûsai, and MusicLM across various generative music metrics. It boasts a lower FAD score, indicating more plausible music generation, a lower KL score for better alignment with reference music, and a higher CLAP score, reflecting greater adherence to textual descriptions of reference music (Figure 2.7.2).
Meta 的 MusicGen 是一种新颖的音频生成模型，它还利用了语言模型中常见的变压器架构来生成音频。MusicGen 使用户能够指定所需音频结果的文本，然后使用特定的旋律进行微调。在比较研究中，MusicGen 在各种生成音乐指标上胜过其他流行的文本到音乐模型，如 Riffusion、Moûsai 和 MusicLM。它拥有更低的 FAD 分数，表示生成的音乐更可信，更低的 KL 分数，以更好地与参考音乐对齐，以及更高的 CLAP 分数，反映出对参考音乐的文本描述的更高遵循度（图 2.7.2）。
Human evaluators also favor MusicGen for its overall quality (OVL).
人类评估者也偏爱 MusicGen 的整体质量（OVL）。

Although MusicGen outperforms certain textto-music models released earlier in the year, MusicLM is worth highlighting because its release was accompanied by the launch of MusicCaps, a state-of-the-art dataset of

music-text pairs. MusicCaps was used by MusicGen researchers to benchmark the performance of their family of models. The emergence of new models like MusicGen, and new music-to-text benchmarks like MusicCaps, highlights the expansion of generative Al beyond language and images into more diverse skill modalities like audio generation.
尽管 MusicGen 在今年早些时候发布的某些文本到音乐模型中表现出色，但 MusicLM 值得关注，因为它的发布伴随着 MusicCaps 的推出，这是一个最先进的数据集，包含

音乐文本对。MusicCaps 被 MusicGen 研究人员用来评估他们的模型系列的性能。像 MusicGen 这样的新模型的出现，以及像 MusicCaps 这样的新音乐到文本的基准，凸显了生成 AI 超越语言和图像，进入更多样化技能模态的扩展。

Highlighted Research: 突出研究:

MusicGEN and MusicLM (cont'd)
音乐 GEN 和音乐 LM（续）

Evaluation of MusicGen and baseline models on MusicCaps
在 MusicCaps 上对 MusicGen 和基准模型的评估

Source: Copet et al., 2023 | Chart: 2024 Al Index report
来源：Copet 等人，2023 年 | 图表：2024 年 Al 指数报告

MusicGen (2023) Mousai (2023)

Riffusion (2022)
MusicGen（2023 年）Mousai（2023 年）

Riffusion（2022 年）

Figure 2.7.2 图 2.7.2

Al agents, autonomous or semiautonomous systems designed to operate within specific environments to accomplish goals, represent an exciting frontier in Al research. These agents have a diverse range of potential applications, from assisting in academic research and scheduling meetings to facilitating online shopping and vacation booking.
所有代理，自主或半自主系统旨在在特定环境中运行以实现目标，在 AI 研究中代表一个令人兴奋的前沿。这些代理具有广泛的潜在应用，从协助学术研究和安排会议到便利在线购物和度假预订。

2.8 Agents 2.8 代理

General Agents 一般代理商

This section highlights benchmarks and research into agents that can flexibly operate in general task environments.
本节重点介绍了在一般任务环境中灵活运作的代理商的基准和研究。

AgentBench 代理商基准

AgentBench, a new benchmark designed for evaluating LLM-based agents, encompasses eight distinct interactive settings, including web browsing, online shopping, household management, puzzles, and digital card games (Figure 2.8.1). The study
AgentBench，一个专为评估基于LLM的代理而设计的新基准，包括八个不同的互动设置，包括网页浏览、在线购物、家庭管理、拼图和数字卡牌游戏（图 2.8.1）。该研究

assessed over 25 LLM-based agents, including those built on OpenAl's GPT-4, Anthropic's Claude 2, and Meta's Llama 2. GPT-4 emerged as the top performer, achieving an overall score of 4.01 , significantly higher than Claude 2's score of 2.49 (Figure 2.8.2). The research also suggests that LLMs released in 2023 outperform earlier versions in agentic settings. Additionally, the AgentBench team speculated that agents' struggles on certain benchmark subsections can be attributed to their limited abilities in long-term reasoning, decision-making, and instruction-following.
评估了超过 25 个基于LLM的代理，包括那些建立在 OpenAl 的 GPT-4、Anthropic 的 Claude 2 和 Meta 的 Llama 2 之上的代理。GPT-4 脱颖而出，取得了 4.01 的总体得分，远高于 Claude 2 的 2.49 得分（图 2.8.2）。研究还表明，2023 年发布的LLMs在代理设置中胜过早期版本。此外，AgentBench 团队推测，代理在某些基准子部分上的困难可以归因于它们在长期推理、决策和遵循指令方面的能力有限。

Description of the AgentBench benchmark
AgentBench 基准的描述

Source: Liu et al., 2023
来源：刘等人，2023 年

Figure 2.8.1 图 2.8.1

AgentBench across eight environments: overall score
在八个环境中的 AgentBench：总体得分

Figure 2.8.2 图 2.8.2

Highlighted Research: 突出研究:

Voyageur 航海家

Recent research by Nvidia, Caltech, UT Austin, Stanford, and UW Madison demonstrates that existing LLMs like GPT-4 can be used to develop flexible agents capable of continuous learning. The team created Voyager, a GPT-4-based agent for Minecraft-a complex video game with no set endpoint that is essentially a boundless virtual playground for its players (Figure 2.8.3). Voyager excels in this environment, adeptly remembering plans, adapting to new settings, and transferring knowledge. It significantly outperforms previous models, collecting 3.3 times more unique items, traveling 2.3 times further, and reaching key milestones 15.3 times faster (Figure 2.8.4).
Nvidia、Caltech、UT Austin、斯坦福大学和 UW Madison 的最新研究表明，现有的 GPT-4 等LLMs可以用于开发具有持续学习能力的灵活代理。团队创建了一款基于 GPT-4 的代理程序 Voyager，用于玩家间无限制的虚拟游戏 Minecraft（图 2.8.3）。Voyager 在这个环境中表现出色，能够灵活记住计划、适应新的设置并传递知识。它的表现明显优于之前的模型，收集到的独特物品数量增加了 3.3 倍，行程距离延长了 2.3 倍，达到关键里程碑的速度提高了 15.3 倍（图 2.8.4）。
Voyager in action Voyager 的运行情况

Source: Wang et al., 2023
来源：Wang 等人，2023 年

Figure 2.8.3 图 2.8.3

Blas iner

finem talse iseper ino imerryikal cunares yoreficiciest rackisee:
在美国的金融市场中，金融科技公司的角色变得越来越重要：

Tirenglt Megtrintion intes
源

Voyager's performance improvements over prior state of the art in Minecraft
《航海家》在《我的世界》中的性能改进超越了先前的技术水平

Source: Wang et al., 2023 | Chart: 2024 Al Index report
来源：王等人，2023 年 | 图表：2024 年 Al 指数报告

Category 类别
The launch of Voyager is significant, as AI researchers have long faced challenges in creating agents that can explore, plan, and learn in open-ended worlds. While previous AI systems like AlphaZero succeeded in closed, rule-defined environments like chess, Go, and shogi, they struggled in more dynamic settings, lacking the ability to continuously learn. Voyager, however, demonstrates remarkable proficiency in a dynamic video game setting, thereby representing a notable advancement in the field of agentic Al.
Voyager 的发射具有重要意义，因为 AI 研究人员长期以来一直面临着在开放式环境中创建能够探索、规划和学习的代理的挑战。虽然以前的 AI 系统如 AlphaZero 在象棋、围棋和将棋等封闭的、规则定义的环境中取得了成功，但它们在更动态的环境中遇到了困难，缺乏持续学习的能力。然而，Voyager 在动态视频游戏环境中展示了出色的熟练度，从而代表了代理 AI 领域的显著进步。

Task-Specific Agents 任务特定代理

This section highlights benchmarks and research into agents that are optimized to perform in specific task environments, such as mathematical problemsolving or academic research.
本节重点介绍了针对特定任务环境进行优化的代理的基准和研究，例如数学问题求解或学术研究。

MLAgentBench

MLAgentBench, a new benchmark for evaluating Al research agents' performance, tests whether

agents are capable of engaging in scientific experimentation. More specifically, MLAgentBench assesses Al systems' potential as computer science research assistants, evaluating their performance across 15 varied research tasks. Examples of the tasks include improving a baseline model on the CIFAR-10 image dataset and training a language model on over 10 million words in BabyLM. Various LLM-based agents, including GPT-4, Claude-1, AutoGPT, and LangChain, were tested. The results demonstrate that although there is promise in

research agents, performance varies significantly across tasks. While some agents achieved over

on tasks like ogbnarxiv (improving a baseline paper classification model), all scored

on BabyLM (training a small language model) (Figure 2.8.5). Among these, GPT-4 consistently delivered the best results.
MLAgentBench，一个用于评估人工智能研究代理性能的新基准，测试

代理是否能够进行科学实验。更具体地说，MLAgentBench 评估了人工智能系统作为计算机科学研究助手的潜力，评估它们在 15 个不同的研究任务中的表现。这些任务的示例包括改进 CIFAR-10 图像数据集上的基线模型和在 BabyLM 中训练一个超过 1000 万字的语言模型。测试了各种基于LLM的代理，包括 GPT-4、Claude-1、AutoGPT 和 LangChain。结果表明，尽管

研究代理有潜力，但在不同任务中的表现差异显著。虽然一些代理在像 ogbnarxiv（改进基线论文分类模型）这样的任务上取得了

以上的成绩，但在 BabyLM（训练一个小型语言模型）上都得分

（图 2.8.5）。在这些代理中，GPT-4 始终取得最佳结果。

MLAgentBench evaluation: success rate of select models across tasks
MLAgentBench 评估：选择模型在任务中的成功率

Source: Huang et al., 2023 | Chart: 2024 Al Index report
来源：黄等人，2023 年 | 图表：2024 年 Al 指数报告

Task 任务

Figure 2.8.5 图 2.8.5

15 The full tasks include: (1) CIFAR-10 (improve a baseline image classification model), (2) imdb (improve a baseline sentiment classification model), (3) ogbn-arxiv (improve a baseline paper classification model from scratch), (4) house prices (train a regression model), (5) spaceship titanic (train a classifier model from scratch), (6) Parkinson's-disease (train a time-series regression model), (7) FathomNet (train an out-of-distribution image classification model), (8) feedback (train an out-of-distribution text regression model), (9) identify contrails (train an out-of-distribution image segmentation model), (10) CLRS (model classic algorithms over graphs and lists), (11) BabyLM (train language model over 10M words), (12) llama-inference (improve the runtime/autoregressive generation speed of Llama 7B, (13) vectorization (improve the inference speed of a model), (14) literature-review-tool (perform literature review), and (15) bibtexgeneration (generate BibTex from sketch).
15 个完整任务包括：(1) CIFAR-10（改进基线图像分类模型），(2) imdb（改进基线情感分类模型），(3) ogbn-arxiv（从头开始改进基线论文分类模型），(4) 房价（训练回归模型），(5) 飞船泰坦尼克（从头开始训练分类器模型），(6) 帕金森病（训练时间序列回归模型），(7) FathomNet（训练一种超出分布的图像分类模型），(8) 反馈（训练一种超出分布的文本回归模型），(9) 识别对流层卷云（训练一种超出分布的图像分割模型），(10) CLRS（在图和列表上建模经典算法），(11) BabyLM（训练超过 1000 万词的语言模型），(12) llama-inference（改进 Llama 7B 的运行时/自回归生成速度），(13) 矢量化（改进模型的推理速度），(14) 文献综述工具（进行文献综述），以及 (15) bibtexgeneration（从草稿生成 BibTex）。

Over time, Al has become increasingly integrated into robotics, enhancing robots' capabilities to perform complex tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their surroundings, adapt flexibly to new settings, and make autonomous decisions.
随着时间的推移，Al（人工智能）已经越来越多地整合到机器人技术中，增强了机器人执行复杂任务的能力。特别是随着基础模型的兴起，这种整合使得机器人能够从周围环境中迭代学习，灵活适应新环境，并做出自主决策。

2.9 Robotics 2.9 机器人技术

Highlighted Research: 突出研究:

PaLM-E

PaLM-E is a new Al model from Google that merges robotics with language modeling to address real-world tasks like robotic manipulation and knowledge tasks like question answering and image captioning. Leveraging transformer-based architectures, the largest PaLM-E model is scaled up to

parameters. The model is trained on diverse visual language as well as robotics data, which results in superior performance on a variety of robotic benchmarks. PaLM-E also sets new standards in visual tasks like OK-VQA, excels in other language tasks, and can engage in chain-of-thought, mathematical, and multi-image reasoning, even without specific training in these areas. Figure 2.9.1 illustrates some of the tasks that the PaLM-E model can perform.
PaLM-E 是谷歌推出的一种新的 AI 模型，将机器人技术与语言建模相结合，以解决像机器人操作和知识任务（如问题回答和图像描述）等现实任务。利用基于 transformer 的架构，最大的 PaLM-E 模型扩展到

个参数。该模型在多样的视觉语言和机器人数据上进行训练，从而在各种机器人基准测试中表现出优异的性能。PaLM-E 还在视觉任务方面树立了新的标准，优秀地完成了其他语言任务，并且可以进行思维链、数学和多图像推理，即使在这些领域没有特定的训练。图 2.9.1 展示了 PaLM-E 模型可以执行的一些任务。

On Task and Motion Planning (TAMP) domains, where robots have to manipulate objects, PaLM-E outperforms previous state-of-the-art methods like SayCan and PaLI on both embodied visual question answering and planning (Figure 2.9.2).

On robotic manipulation tasks, PaLM-E outperforms competing models (PaLI and CLIP-FT) in its ability to detect failures, which is a crucial step for robots to perform closed-loop planning (Figure 2.9.3).
在任务和运动规划（TAMP）领域，机器人必须操纵物体的情况下，PaLM-E 在具身视觉问题回答和规划方面表现优于以往的最先进方法，如 SayCan 和 PaLI（图 2.9.2）。在机器人操作任务中，PaLM-E 在检测失败方面优于竞争模型（PaLI 和 CLIP-FT），这对于机器人执行闭环规划是至关重要的一步（图 2.9.3）。

PaLM-E is significant in that it demonstrates that language modeling techniques as well as text data can enhance the performance of Al systems in nonlanguage domains, like robotics. PaLM-E also highlights how there are already linguistically adept robots capable of real-world interaction and high-level reasoning. Developing these kinds of multifaceted robots is an essential step in creating more general robotic assistants that can, for example, assist in household work.
PaLM-E 在于展示语言建模技术以及文本数据如何提升非语言领域（如机器人技术）中人工智能系统的性能具有重要意义。PaLM-E 还突显了已经存在的在现实世界中能够进行高水平推理和互动的语言能力机器人。开发这类多方面的机器人是创建更通用的机器人助手的重要一步，例如在家务工作中提供帮助。

Highlighted Research: 突出研究：

PaLM-E (cont'd) PaLM-E（续）

PaLM-E in action PaLM-E 在行动中

Source: Robotics at Google, 2023
来源：Google 的机器人技术，2023 年

Move the green circle to the yellow hexagon
将绿色圆圈移动到黄色六边形
Move the blue triangle to the group success
将蓝色三角形移动到组成功

Figure 2.9.1 图 2.9.1

Performance of select models on TAMP environment: success rate
在 TAMP 环境中选择模型的性能：成功率

Source: Driess et al., 2023 | Table: 2024 Al Index report
来源：Driess 等人，2023 年 | 表格：2024 年 Al 指数报告

Model

Embodied VQA q1 具体化 VQA 问题 1

Embodied VQA q2 具体化 VQA 问题 2

Embodied VQA q3 具身 VQA q3

Embodied VQA q4 具身 VQA q4

Planning p1

Planning p2

SayCan (oracle affordances)
SayCan（神谕功能）

38.7

33.3

PaLI (zero-shot) 巴利语（零射击）

PaLM-E OSRT w/ input
带输入的 PaLM-E OSRT

encoding

99.7

98.2

100

93.7

82.5

76.2

Figure 2.9.2 图 2.9.2

Select models on mobile manipulation environment tests: failure detection
在移动操作环境测试中选择模型：故障检测

Source: Driess et al., 2023 | Table: 2024 Al Index report
来源：Driess 等人，2023 年 | 报表：2024 年 Al 指数报告

Baselines	Failure detection 故障检测
PaLI (zero-shot) 巴利语（零射击）	0.73
CLIP-FT	0.65
CLIP-FT-hindsight	0.89
PaLM-E-12B	0.91

Figure 2.9.3 图 2.9.3

Highlighted Research: 突出研究:

RT-2

Real-world robots could benefit from certain capabilities possessed by LLMs, such as text and code generation, as well as visual understanding. RT-2, a new robot released from DeepMind, represents an ambitious attempt to create a generalizable robotic model that has certain LLM capabilities. RT-2 uses a transformer-based architecture and is trained on both robotic trajectory data that is tokenized into text and extensive visual-language data.
现实世界中的机器人可以从LLMs拥有的某些能力中受益，比如文本和代码生成，以及视觉理解。DeepMind 发布的新机器人 RT-2 代表了创建具有某些LLM能力的通用机器人模型的雄心壮志。RT-2 采用基于 Transformer 的架构，训练了被标记为文本和广泛的视觉语言数据的机器人轨迹数据。

RT-2 stands out as one of the most impressive and adaptable approaches for conditioning robotic policy. It outshines state-of-the-art models like Manipulation of Open-World Objects (MOO) across various benchmarks, particularly in tasks involving unseen objects. On such tasks, an RT-2/PaLM-E variant achieves an

success rate, significantly higher than MOO's (53%) (Figure 2.9.4). In unseen object tasks, RT-2 surpasses the previous year's state-of-the-art model, RT-1, by 43 percentage points. This indicates an improvement in robotic performance in novel environments over time.
RT-2 凭借其作为调节机器人策略的最令人印象深刻和适应能力之一脱颖而出。在各种基准测试中，它都胜过了 Manipulation of Open-World Objects (MOO)这样的最先进模型，特别是在涉及未见过的物体的任务中。在这样的任务中，RT-2 / PaLM-E 变体实现了

的成功率，明显高于 MOO（53％）（图 2.9.4）。在未知对象任务中，RT-2 比前一年的最先进模型 RT-1 高出 43 个百分点。这表明了机器人在新环境中的性能有所改善。

Evaluation of RT-2 models and baselines on seen and unseen tasks: success rate
对看到和未看到的任务在 RT-2 模型和基准上的评估：成功率

Source: Brohan et al., 2023 | Chart: 2024 Al Index report
来源：Brohan 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.9.4 图 2.9.4

In reinforcement learning, Al systems are trained to maximize performance on a given task by interactively learning from their prior actions. Systems are rewarded if they achieve a desired goal and punished if they fail.
在强化学习中，AI 系统通过与先前行动的互动学习，以最大化在给定任务上的表现。如果系统实现了期望的目标，它们将获得奖励；如果失败，它们将受到惩罚。

2.10 Reinforcement Learning
2.10 强化学习

Reinforcement Learning from Human Feedback
从人类反馈中进行强化学习

Reinforcement learning has gained popularity in enhancing state-of-the-art language models like GPT-4 and Llama 2. Introduced in 2017, Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the reward function, enabling models to be trained for characteristics like helpfulness and harmlessness.
强化学习在增强诸如 GPT-4 和 Llama 2 之类的最先进语言模型方面已经变得非常流行。强化学习从人类反馈中引入奖励函数，使模型能够被训练出具有帮助性和无害性等特征。 RLHF（Reinforcement Learning from Human Feedback）于 2017 年推出。

This year, the Al Index tracked data on the number of foundation models using RLHF as part of their training. More specifically, the Index team looked through the technical reports and other documentation of all models included in CRFM's Ecosystem graph, one of the most comprehensive repositories of the foundation model ecosystem.

Figure 2.10.1 illustrates how many foundation models reported using RLHF over time. In 2021, no newly released foundation models used RLHF. In 2022, seven models reported using RLHF, and in 2023, 16 models reported using RLHF. The rising popularity of RLHF is also evidenced by the fact that many leading LLMs report improving their models with RLHF (Figure 2.10.2).
今年，AI 指数跟踪了使用 RLHF 作为训练的基础模型数量的数据。更具体地说，指数团队查阅了 CRFM 生态系统图中包含的所有模型的技术报告和其他文档，这是最全面的基础模型生态系统存储库之一。图 2.10.1 展示了随着时间推移，有多少基础模型报告使用 RLHF。在 2021 年，没有新发布的基础模型使用 RLHF。到了 2022 年，有七个模型报告使用 RLHF，到了 2023 年，有 16 个模型报告使用 RLHF。 RLHF 的日益流行也表现在许多领先的LLMs报告使用 RLHF 改进他们的模型（图 2.10.2）。

Figure 2.10.1 图 2.10.1

RLHF usage among foundation models
基金会模型中的 RLHF 使用情况

Source: Al Index, 2024 | Table: 2024 Al Index report
来源：Al Index，2024 | 表格：2024 年 Al Index 报告

GPT-4	Llama 2	Claude-2	Gemini	Mistral-7B

Highlighted Research: 突出研究:

RLAIF

RLHF is a powerful method for aligning

models but can be hindered by the time and labor required to generate human preference datasets for model alignment. As an alternative, Reinforcement Learning from Al Feedback (RLAIF) uses reinforcement learning based on the preferences of LLMs to align other AI models toward human preferences.
RLHF 是一种强大的方法，用于对齐

模型，但可能会受到生成人类偏好数据集所需的时间和劳动的阻碍。作为一种替代方案，基于人类偏好的强化学习反馈（RLAIF）使用强化学习来根据 LLMs 的偏好来对齐其他 AI 模型。

Recent research from Google Research compares RLAIF with RLHF, the traditional gold standard, to assess whether RLAIF can serve as a reliable substitute. The study finds that both RLAIF and RLHF are preferred over supervised fine-tuning (SFT) for summarization and helpfulness tasks, and that there is not a statistically significant difference in the degree to which RLHF is preferred (Figure 2.10.3). Notably, in harmless dialogue generation tasks focused on producing the least harmful outputs, RLAIF (88%) surpasses RLHF (76%) in effectiveness (Figure 2.10.4). This research indicates that RLAIF could be a more resource-efficient and cost-effective approach for Al model alignment.
来自 Google 研究的最新研究将 RLAIF 与 RLHF（传统的黄金标准）进行比较，以评估 RLAIF 是否可以作为可靠的替代品。研究发现，对于总结和帮助任务，RLAIF 和 RLHF 都优于监督微调（SFT），并且在 RLHF 被偏好的程度上没有统计学上显著差异（图 2.10.3）。值得注意的是，在专注于生成最无害输出的无害对话生成任务中，RLAIF（88%）在效果上超过了 RLHF（76%）（图 2.10.4）。这项研究表明，RLAIF 可能是一种更具资源效率和成本效益的 AI 模型对齐方法。
RLAIF and RLHF vs. SFT baseline: win rate
RLAIF 和 RLHF 对 SFT 基线: 胜率

Source: Lee et al., 2023 | Chart: 2024 Al Index report
来源：Lee 等人，2023 | 图表：2024 年 Al Index 报告

Harmless rate by policy
政策的无害率

Highlighted Research: 突出研究：

Direct Preference Optimization
直接偏好优化

As illustrated above, RLHF is a useful method for aligning LLMs with human preferences. However, RLHF requires substantial computational resources, involving the training of multiple language models and integrating LM policy sampling within training loops. This complexity can hinder its broader adoption.
如上所述，RLHF 是一种有用的方法，用于将 LLMs 与人类偏好对齐。然而，RLHF 需要大量的计算资源，涉及训练多个语言模型并在训练循环中集成 LM 策略抽样。这种复杂性可能会阻碍其更广泛的采用。

In response, researchers from Stanford and CZ Biohub have developed a new reinforcement learning algorithm for aligning models named
作为回应，斯坦福大学和 CZ Biohub 的研究人员开发了一种新的强化学习算法，用于对齐模型，名为
Direct Preference Optimization (DPO). DPO is simpler than RLHF but equally effective. The researchers show that DPO is as effective as other existing alignment methods, such as Proximal Policy Optimization (PPO) and Supervised FineTuning (SFT), on tasks like summarization (Figure 2.10.5). The emergence of techniques like DPO suggests that model alignment methods are becoming more straightforward and accessible.
直接偏好优化（DPO）。DPO 比 RLHF 更简单，但同样有效。研究人员表明，DPO 在像摘要（图 2.10.5）这样的任务上与其他现有的对齐方法（如 Proximal Policy Optimization (PPO) 和 Supervised FineTuning (SFT)）一样有效。像 DPO 这样的技术的出现表明，模型对齐方法变得更加简单和易于访问。

Comparison of different algorithms on TL;DR summarization task across different sampling temperatures Source: Rafailov et al., 2023 | Table: 2024 Al Index report
在不同的采样温度下比较 TL;DR 摘要任务上的不同算法源自：Rafailov 等人，2023 | 表：2024 年 Al 指数报告

Figure 2.10.5 图 2.10.5

This section focuses on research exploring critical properties of LLMs, such as their capacity for sudden behavioral shifts and self-correction in reasoning. It is important to highlight these studies to develop an understanding of how LLMs, which are increasingly representative of the frontier of Al research, operate and behave.
本节重点研究探索LLMs的关键特性，如其突然行为转变和推理中的自我修正能力。强调这些研究对于发展对越来越代表人工智能前沿的LLMs的运作和行为方式的理解是非常重要的。

2.11 Properties of LLMs
2.11 LLMs的特性

Highlighted Research: 突出研究:

Challenging the Notion of Emergent Behavior
挑战新兴行为的概念

Many papers have argued that LLMs exhibit emergent abilities, meaning they can unpredictably and suddenly display new capabilities at larger scales.

This has raised concerns that even larger models could develop surprising, and perhaps uncontrollable, new abilities.
许多论文认为LLMs表现出新兴能力，意味着它们可以在更大的尺度上不可预测地突然展示新的能力。

这引发了担忧，即即使更大的模型也可能发展出令人惊讶，也许无法控制的新能力。

However, research from Stanford challenges this notion, arguing that the perceived emergence of new capabilities is often a reflection of the benchmarks used for evaluation rather than an inherent property of the models themselves. The researchers found that when nonlinear or discontinuous metrics like multiple-choice grading are used to evaluate models, emergent abilities seem more apparent. In contrast, when linear or continuous metrics are employed, these abilities largely vanish. Analyzing a suite of benchmarks from BIGbench, a comprehensive LLM evaluation tool, the researchers noted emergent abilities on only five of the 39 benchmarks (Figure 2.11.1). These findings have important implications for

safety and alignment research as they challenge a prevailing belief that

models will inevitably learn new, unpredictable behaviors as they scale.
然而，斯坦福的研究挑战了这一观念，认为新能力的出现往往是评估所用基准的反映，而不是模型本身固有的属性。研究人员发现，当使用非线性或不连续的度量标准（如多项选择评分）来评估模型时，新能力似乎更加明显。相比之下，当采用线性或连续的度量标准时，这些能力基本上消失了。研究人员分析了来自 BIGbench 的一套基准，这是一个综合的LLM评估工具，他们注意到在 39 个基准中只有 5 个基准上出现了新能力（图 2.11.1）。这些发现对

安全性和对齐研究具有重要意义，因为它们挑战了一种普遍的观念，即

模型在扩展时将不可避免地学习新的、不可预测的行为。

Highlighted Research: 突出研究：

Challenging the Notion of Emergent Behavior (cont'd)
挑战新行为的观念（续）

Emergence score over all Big-bench tasks
所有 Big-bench 任务的 Emergence 得分

Source: Schaeffer et al., 2023
来源：Schaeffer 等人，2023 年

Highlighted Research: 突出研究:

Changes in LLM Performance Over Time
随时间变化的LLM性能变化

Publicly usable closed-source LLMs, such as GPT-4, Claude 2, and Gemini, are often updated over time by their developers in response to new data or user feedback. However, there is little research on how the performance of such models changes, if at all, in response to such updating.
公开可用的闭源LLMs，如 GPT-4、Claude 2 和 Gemini，通常会根据开发者对新数据或用户反馈的回应进行定期更新。然而，关于这类模型的性能如何随着更新而变化（如果有的话），目前很少有研究。

A study conducted at Stanford and Berkeley explores the performance of certain publicly usable LLMs over time and highlights that, in fact, their performance can significantly vary. More specifically, the study compared the March and June 2023 versions of GPT-3.5 and GPT-4 and demonstrated that performance declined on several tasks. For instance, the June version of GPT-4, compared to the March version, was 42 percentage points worse at generating code, 16 percentage points worse at answering sensitive questions, and 33 percentage points worse on certain mathematical tasks (Figure 2.11.2). The researchers also found that GPT-4's ability to follow instructions diminished over time, which potentially explains the broader performance declines. This research highlights that LLM performance can evolve over time and suggests that regular users should be mindful of such changes.
在斯坦福大学和伯克利进行的一项研究探讨了一些公开可用的LLMs随时间性能的变化，并强调实际上它们的性能可能会有显著差异。具体来说，该研究比较了 GPT-3.5 和 GPT-4 的 2023 年 3 月和 6 月版本，并表明在几项任务上性能下降。例如，与 3 月版本相比，GPT-4 的 6 月版本在生成代码方面差异达到 42 个百分点，在回答敏感问题方面差异达到 16 个百分点，在某些数学任务上差异达到 33 个百分点（图 2.11.2）。研究人员还发现，GPT-4 随时间推移遵循指令的能力下降，这可能解释了更广泛的性能下降。这项研究突显了LLM的性能可能随时间演变，并建议常规用户应注意这些变化。

Highlighted Research: 突出研究：

Changes in LLM Performance Over Time (cont'd)
LLM性能随时间的变化（续）

Performance of the March 2023 and June 2023 versions of GPT-4 on eight tasks
2023 年 3 月和 2023 年 6 月版本的 GPT-4 在八项任务上的表现

Source: Chen et al., 2023 | Chart: 2024 Al Index report
来源：Chen 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.11.2 图 2.11.2

Highlighted Research: 突出研究：

LLMs Are Poor Self-Correctors
LLMs 是自我纠正能力差的人

It is generally understood that LLMs like GPT-4 have reasoning limitations and can sometimes produce hallucinations. One proposed solution to such issues is self-correction, whereby LLMs identify and correct their own reasoning flaws. As Al's societal role grows, the concept of intrinsic self-correction-allowing LLMs to autonomously correct their reasoning without external guidanceis especially appealing. However, it is currently not well understood whether LLMs are in fact capable of this kind of self-correction.
人们普遍认为像 GPT-4 这样的 LLMs 存在推理能力的局限性，有时会产生幻觉。针对这些问题的一个提出的解决方案是自我纠正，即 LLMs 识别并纠正自己的推理缺陷。随着 AI 的社会角色不断增长，让 LLMs 能够自主纠正他们的推理而无需外部指导的内在自我纠正概念尤为吸引人。然而，目前尚不清楚 LLMs 是否实际上有能力进行这种自我纠正。
Researchers from DeepMind and the University of Illinois at Urbana-Champaign tested GPT-4's performance on three reasoning benchmarks: GSM8K (grade-school math), CommonSenseQA (common-sense reasoning), and HotpotQA (multidocument reasoning). They found that when the model was left to decide on self-correction without guidance, its performance declined across all tested benchmarks (Figure 2.11.3).
DeepMind 和伊利诺伊大学厄巴纳-香槟分校的研究人员在三个推理基准测试中测试了 GPT-4 的性能：GSM8K（小学数学）、CommonSenseQA（常识推理）和 HotpotQA（多文档推理）。他们发现，当模型在没有指导的情况下决定自我纠正时，其在所有测试基准上的性能都下降了（图 2.11.3）。

GPT-4 on reasoning benchmarks with intrinsic self-correction
GPT-4 在具有内在自我纠正的推理基准测试上

Source: Huang et al., 2023 | Chart: 2024 Al Index report
来源：Huang 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.11.3 图表 2.11.3

Closed vs. Open Model Performance
封闭模型与开放模型的性能

As LLMs become increasingly ubiquitous, debate intensifies over their varying degrees of accessibility. Some models such as Google's Gemini remain closed, accessible solely to their developers. In contrast, models like OpenAl's GPT-4 and Anthropic's Claude

offer limited access, available publicly via an API. However, model weights are not fully released, which means the model cannot be independently modified by the public or further scrutinized. Conversely, Meta's Llama 2 and Stability Al's Stable Diffusion adopt an open approach, fully releasing their model weights. Open-source models can be modified and freely used by anyone.
随着LLMs变得越来越普遍，人们对其不同程度的易用性进行了激烈的辩论。一些模型，如谷歌的 Gemini，仍然是封闭的，只对其开发者可用。相比之下，OpenAl 的 GPT-4 和 Anthropic 的 Claude

等模型提供有限的访问权限，可通过 API 公开使用。然而，模型权重并未完全发布，这意味着公众无法独立修改该模型或进一步审查。相反，Meta 的 Llama 2 和 Stability Al 的 Stable Diffusion 采用开放的方式，完全发布了其模型权重。开源模型可以由任何人修改和自由使用。

Viewpoints differ on the merits of closed versus open Al models. Some argue in favor of open models, citing their ability to counteract market concentration, foster innovation, and enhance transparency within the

ecosystem. Others contend that open-source models present considerable security risks, such as facilitating the creation of disinformation or bioweapons, and should therefore be approached with caution.
关于封闭与开放式 AI 模型的优点，观点不一。一些人支持开放模型，称赞其能够抵消市场集中、促进创新，并增强

生态系统内的透明度。另一些人认为开源模型存在相当大的安全风险，比如促进虚假信息或生物武器的制造，因此应该谨慎对待。

In the context of this debate, it is important to acknowledge that current evidence indicates a notable performance gap between open and closed models.

Figures 2.11.4 and 2.11.5 juxtapose the performances of the top closed versus open model on a selection of benchmarks.

On all selected benchmarks, closed models outperform open ones. Specifically, on 10 selected benchmarks, closed models achieved a median performance advantage of

, with differences ranging from as little as

on mathematical tasks like GSM8K to as much as

on agentic tasks like AgentBench.
在这场辩论中，重要的是要承认当前证据表明开放和封闭模型之间存在显著的性能差距。

图 2.11.4 和 2.11.5 对比了一系列基准测试中顶尖封闭与开放模型的表现。

在所有选择的基准测试中，封闭模型的表现优于开放模型。具体来说，在 10 个选择的基准测试中，封闭模型取得了

的中位性能优势，差异范围从数学任务如 GSM8K 的

到代理任务如 AgentBench 的

。

Score differentials of top closed vs. open models on select benchmarks
在选择基准测试中，顶尖封闭与开放模型的得分差异

Source: A I Index, 2024 | Table: 2024 Al Index report
源：A I 索引，2024 年 | 表格：2024 年 Al 指数报告

Benchmark	Task category 任务类别	Best closed model score 最佳封闭模型分数	Best open model score 最佳开放模型得分
AgentBench	Agent-based behavior 基于代理的行为	4.01	0.96
Chatbot Arena Leaderboard 聊天机器人竞技排行榜	General language 通用语言	1,252	1,149
GPQA	General reasoning 一般推理
GSM8K	Mathematical reasoning 数学推理
HELM	General language 通用语言	0.96	0.82
HumanEval	Coding
MATH	Mathematical reasoning 数学推理
MMLU	General language 通用语言
MMMU	General reasoning 一般推理
SWE-bench	Coding

Figure 2.11.4 图 2.11.4

Performance of top closed vs. open models on select benchmarks
在选择基准测试中，顶级封闭模型与开放模型的性能

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 2.11.5 图 2.11.5

2.12 Techniques for LLM Improvement
2.12 提高 LLM 的技术

Prompting 提示

Prompting, a vital aspect of the AI pipeline, entails supplying a model with natural language instructions that describe tasks the model should execute.
提示是 AI 管道中至关重要的一个方面，它涉及向模型提供自然语言指令，描述模型应执行的任务。
Mastering the art of crafting effective prompts significantly enhances the performance of LLMs without requiring that models undergo underlying improvements.
掌握制作有效提示的技巧显著提高了LLMs的性能，而无需让模型经历基础改进。

Highlighted Research: 突出研究：

Graph of Thoughts Prompting
思维图激发

Chain of thought (CoT) and Tree of Thoughts (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method, Graph of Thoughts (GoT), that has also shown promise (Figure 2.12.1). GoT enables LLMs to model their thoughts in a more flexible, graph-like structure which more closely mirrors actual human reasoning. The researchers then designed a model architecture to implement GoT and found that, compared to ToT, it increased the quality of outputs by

on a sorting task while reducing cost by around 31% (Figure 2.12.2).
思维链（CoT）和思维树（ToT）是可以提高LLMs在推理任务上表现的激发方法。2023 年，欧洲研究人员引入了另一种激发方法，思维图（GoT），也显示出了潜力（图 2.12.1）。GoT 使LLMs能够以更灵活、类似图形的结构建模他们的思维，更贴近实际的人类推理。研究人员随后设计了一个模型架构来实现 GoT，并发现，与 ToT 相比，在一个排序任务上，输出质量提高了

，同时成本降低了约 31%（图 2.12.2）。
Graph of Thoughts (GoT) reasoning flow Source: Besta et al., 2023
思维图（GoT）推理流程来源：Besta 等人，2023

Highlighted Research: 突出研究：

Graph of Thoughts Prompting (cont'd)
思维引发图表（续）

Number of errors in sorting tasks with ChatGPT-3.5
ChatGPT-3.5 在排序任务中的错误数量

Source: Besta et al., 2023 | Chart: 2024 Al Index report
来源：Besta 等，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.12.2 图 2.12.2

Highlighted Research: 突出研究:

Optimization by PROmpting (OPRO)
通过提示进行优化（OPRO）

A paper from DeepMind has introduced
DeepMind 的一篇论文介绍了

Optimization by PROmpting (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithmic performance. OPRO uses natural language to guide LLMs in creating new prompts based on problem descriptions and previous solutions (Figure 2.12.3). The generated prompts aim to enhance the performance of

systems on particular benchmarks. Compared to other prompting approaches like "let's think step by step" or an empty starting point, ORPO leads to significantly greater accuracy on virtually all 23 BIG-bench Hard tasks (Figure 2.12.4).
通过提示进行优化（OPRO）是一种使用LLMs来迭代生成提示以提高算法性能的方法。 OPRO 使用自然语言来引导LLMs根据问题描述和先前的解决方案（图 2.12.3）创建新的提示。生成的提示旨在提高

系统在特定基准上的性能。与其他提示方法（如“让我们一步一步地思考”或空白起点）相比，ORPO 在几乎所有 23 个 BIG-bench Hard 任务上都显著提高了准确性（图 2.12.4）。

Sample OPRO prompts and optimization progress
示例 OPRO 提示和优化进展

Source: Yang et al., 2023
来源：杨等人，2023 年

Figure 2.12.3 图 2.12.3

"Let's think carefully about the problem and solve it together." at Step 2 with the training accuracy 63.2 ;
"让我们仔细思考问题，并一起解决。" 在第 2 步，训练准确率为 63.2；
"Let's break it down!" at Step 4 with training accuracy 71.3;
"让我们把它分解开！" 在第 4 步，训练准确率为 71.3；
"Let's calculate our way to the solution!" at Step 5 with training accuracy 73.9;
"让我们计算出解决方案的方法！" 在第 5 步，训练准确率为 73.9；
"Let's do the math!" at Step 6 with training accuracy 78.2.
"让我们来做数学运算吧！" 在第 6 步时，训练准确率为 78.2。

Accuracy difference on 23 BIG-bench Hard (BBH) tasks using PaLM 2-L scorer Source: Yang et al., 2023 | Chart: 2024 Al Index report
使用 PaLM 2-L 评分器在 23 个 BIG-bench Hard（BBH）任务上的准确度差异来源：Yang 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.12.4 图 2.12.4

Fine-Tuning 微调

Fine-tuning has grown increasingly popular as a method of enhancing LLMs and involves further training or adjusting models on smaller datasets.
微调作为一种增强LLMs的方法越来越受欢迎，它涉及在较小的数据集上进一步训练或调整模型。
Fine-tuning not only boosts overall model performance but also sharpens the model's capabilities on specific tasks. It also allows for more precise control over the model's behavior.
微调不仅提升了整体模型性能，还增强了模型在特定任务上的能力。它还允许更精确地控制模型的行为。

QLoRA manages to increase efficiency with techniques like a 4-bit NormalFloat (NF4), double quantization, and page optimizers. QLoRA is used to train a model named Guanaco, which matched or even surpassed models like ChatGPT in performance on the Vicuna benchmark (a benchmark that ranks the outputs of LLMs) (Figure 2.12.5). Remarkably, the Guanaco models were created with just 24 hours of fine-tuning on a single GPU. QLoRa highlights how methods for optimizing and further improving models have become more efficient, meaning fewer resources will be required to make increasingly capable models.
QLoRA 通过诸如 4 位 NormalFloat（NF4）、双量化和页面优化器等技术提高了效率。 QLoRA 用于训练一个名为 Guanaco 的模型，该模型在维库纳基准测试（一个对LLMs的输出进行排名的基准测试）（图 2.12.5）上的性能与 ChatGPT 等模型相匹配甚至超越。值得注意的是，Guanaco 模型仅通过在单个 GPU 上进行 24 小时的微调即可创建。 QLoRa 突显了优化和进一步改进模型的方法如何变得更加高效，这意味着将需要更少的资源来制作功能更强大的模型。

Model competitions based on 10,000 simulations using GPT-4 and the Vicuna benchmark
基于 GPT-4 和维库纳基准测试的 10,000 次模拟的模型竞赛

Source: Dettmers et al., 2023 | Chart: 2024 Al Index report
来源：Dettmers 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.12.5 图 2.12.5

Attention 注意力

LLMs can flexibly handle various tasks but often demand substantial computational resources to train. As previously noted, high training costs can hinder
LLMs 可以灵活处理各种任务，但通常需要大量的计算资源来训练。正如前面所指出的，高昂的训练成本可能会阻碍
Al's broader adoption. Optimization methods aim to enhance Al's efficiency by, for example, improving memory usage, thereby making LLMs more accessible and practical.
人工智能的更广泛应用。优化方法的目标是通过改善内存使用等方式来提高人工智能的效率，从而使得LLMs更加易于使用和实用。

Highlighted Research: 突出的研究:

Flash-Decoding 快速解码

Flash-Decoding, developed by Stanford researchers, tackles inefficiency in traditional LLMs by speeding up the attention mechanism, particularly in tasks requiring long sequences. It achieves this by parallelizing the loading of keys and values, then separately rescaling and combining them to maintain right attention outputs (Figure 2.12.6). In various tests, Flash-Decoding outperforms other leading methods like PyTorch Eager and FlashAttention-2, showing much faster
由斯坦福研究人员开发的 Flash-Decoding 处理了传统 LLMs 中的低效问题，通过加速注意力机制，特别是在需要处理长序列的任务中。它通过并行加载键和值，然后分别重新调整和组合它们以保持正确的注意力输出（图 2.12.6）。在各种测试中，Flash-Decoding 胜过其他领先方法，如 PyTorch Eager 和 FlashAttention-2，显示出更快的速度。

inference: For example, on a 256 batch size and 256 sequence length, Flash-Decoding is 48 times faster than PyTorch Eager and six times faster than FlashAttention-2 (Figure 2.12.7). Inference on models like ChatGPT can cost

per response, which can become highly expensive when deploying such models to millions of users. Innovations like Flash-Decoding are critical for reducing inference costs in

.
推理：例如，在批量大小为 256 且序列长度为 256 的情况下，Flash-Decoding 比 PyTorch Eager 快 48 倍，比 FlashAttention-2 快 6 倍（图 2.12.7）。在像 ChatGPT 这样的模型上进行推理可能会每次响应花费

，当将这些模型部署到数百万用户时，这可能会变得非常昂贵。Flash-Decoding 这样的创新对于降低

中的推理成本至关重要。

Highlighted Research: 突出研究：

Flash-Decoding (cont'd) 闪存解码（续）

Performance comparison of multihead attention algorithms across batch sizes and sequence lengths Source: Dao et al., 2023 | Chart: 2024 Al Index report
跨批次大小和序列长度的多头注意力算法性能比较来源：Dao 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 2.12.7 图 2.12.7

This section examines trends in the environmental impact of Al systems, highlighting the evolving landscape of transparency and awareness. Historically, model developers seldom disclosed the carbon footprint of their Al systems, leaving researchers to make their best estimates. Recently, there has been a shift toward greater openness, particularly regarding the carbon costs of training

models. However, disclosure of the environmental costs associated with inference-a potentially more significant concern-remains insufficient. This section presents data on carbon emissions as reported by developers in addition to featuring notable research exploring the intersection of

and environmental impact. With

models growing in size and becoming more widely used, it has never been more critical for the Al research community to diligently monitor and mitigate the environmental effects of

systems.
本节探讨了人工智能系统环境影响的趋势，突出了透明度和意识不断发展的格局。从历史上看，模型开发者很少披露其人工智能系统的碳足迹，使研究人员只能做出最佳估计。最近，人们开始更加开放，特别是关于训练模型的碳成本方面。然而，关于推理过程中的环境成本披露仍然不足，这可能是一个更为重要的问题。本节提供了开发者报告的碳排放数据，同时介绍了探讨人工智能和环境影响交叉领域的重要研究。随着模型规模的增长和广泛应用，人工智能研究社区更需要认真监测和减轻系统的环境影响。

2.13 Environmental Impact of Al Systems
2.13 人工智能系统的环境影响

General Environmental Impact
一般环境影响

Training 训练

Figure 2.13.1 presents the carbon released by (in tonnes) of select LLMs during their training, compared with human reference points. Emissions data of models marked with an asterisk were estimated by independent researchers as they were not disclosed by their developers.
图 2.13.1 展示了在培训期间释放的选择LLMs的碳排放量（以吨计），并与人类参考点进行了比较。带有星号标记的模型的排放数据是由独立研究人员估算的，因为它们未被开发者披露。
Emission data varies widely. For instance, Meta's Llama 2 70B model released approximately 291.2 tonnes of carbon, which is nearly 291 times more than the emissions released by one traveler on a round-trip flight from New York to San Francisco, and roughly 16 times the amount of annual carbon emitted by an average American in one year.

However, the emissions from Llama 2 are still less than the 502 tonnes reportedly released during the training of OpenAl's GPT-3.
排放数据差异很大。例如，Meta 的 Llama 2 70B 型号释放了大约 291.2 吨碳，这几乎是纽约到旧金山往返航班上的一名旅客释放的排放量的 291 倍，大约是一个普通美国人一年排放的碳量的 16 倍。

然而，Llama 2 的排放量仍然少于据报道在 OpenAl 的 GPT-3 培训期间释放的 502 吨。

CO2 equivalent emissions (tonnes) by select machine learning models and real-life examples, 2020-23
选择的机器学习模型和现实示例的 CO2 当量排放量（吨），2020-23

Source: Al Index, 2024; Luccioni et al., 2022; Strubell et al., 2019 | Chart: 2024 Al Index report
来源：2024 年 Al 指数；Luccioni 等人，2022 年；Strubell 等人，2019 年 | 图表：2024 年 Al 指数报告

The variance in emission estimates is due to factors such as model size, data center energy efficiency, and the carbon intensity of energy grids. Figure 2.13.2 shows the emissions of select models in relation to their size. Generally, larger models emit more carbon, a trend clearly seen in the Llama 2 model series, which were all trained on the same supercomputer (Meta's Research
排放量估计的差异是由于诸如模型大小、数据中心能效和能源网格的碳强度等因素。图 2.13.2 显示了选择模型的排放量与其大小的关系。一般来说，更大的模型会排放更多的碳，这一趋势在 Llama 2 模型系列中清晰可见，这些模型都是在同一台超级计算机上训练的（Meta's Research）。
Super Cluster). However, smaller models can still have high emissions if trained on energy grids powered by less efficient energy sources. Some estimates suggest that model emissions have declined over time, which is presumably tied to increasingly efficient mechanisms of model training. Figure 2.13.3 features the emissions of select models along with their power consumption.
超级集群）。然而，如果在由效率低的能源驱动的能源网格上进行训练，较小的模型仍可能具有较高的排放量。一些估计表明，模型排放量随着时间的推移而下降，这可能与模型训练的机制日益高效有关。图 2.13.3 显示了部分模型的排放量及其能耗。

CO2 equivalent emissions (tonnes) and number of parameters by select machine learning models
选择的机器学习模型的 CO2 当量排放量（吨）和参数数量

Source: Al Index, 2024; Luccioni et al., 2022 | Chart: 2024 Al Index report
来源：Al Index，2024；Luccioni 等，2022 | 图表：2024 Al Index 报告

CO2 equivalent emissions (log scale - tonnes)
二氧化碳当量排放量（对数刻度 - 吨）

Figure 2.13.2 图 2.13.2

Environmental impact of select models
选定车型的环境影响

Source: Al Index, 2024; Luccioni et al., 2022 | Table: 2024 Al Index report
来源：Al 指数，2024 年；Luccioni 等人，2022 年 | 表格：2024 年 Al 指数报告

Model and number of
型号和数量

parameters

Year

Power

consumption

(MWh)

C02 equivalent CO2 当量

emissions

(tonnes)

Gopher (280B) 地鼠 (280B)

2021

1,066

352

BLOOM (176B)

2022

433

GPT-3 (175B)

2020

1,287

502

OPT (175B)

2022

324

Llama 2 (70B) 羊驼 2 (70B)

2023

400

291.42

Llama 2 (34B) 羊驼 2 (34B)

2023

350

153.90

Llama 2 (13B) 羊驼 2 (13B)

2023

400

62.44

Llama 2 (7B)

2023

400

31.22

Granite (13B) 花岗岩 (13B)

2023

153

22.23

Starcoder (15.5B) 星际编码者 (15.5B)

2023

89.67

16.68

Luminous Base (13B) 发光基础（13B）

2023

3.17

Luminous Extended (30B) 发光扩展（30B）

2023

11.95

Figure 2.13.3 图 2.13.3

A major challenge in evaluating the environmental impacts of

models is a lack of transparency about emissions. Consistent with findings from other studies, most prominent model developers do not report carbon emissions, hampering efforts to conduct thorough and accurate evaluations of this metric.

For example, many prominent model developers such as OpenAI, Google, Anthropic, and Mistral do not report emissions in training, although Meta does.
评估

模型环境影响的一个主要挑战是缺乏有关排放的透明度。与其他研究的发现一致，大多数知名模型开发者不报告碳排放，阻碍了对这一指标进行彻底和准确评估的努力。例如，许多知名模型开发者如 OpenAI、Google、Anthropic 和 Mistral 在训练中不报告排放，尽管 Meta 有报告。

Inference 推理

As highlighted earlier, the environmental impact of
正如前面所强调的，

Carbon emissions by task during model inference
模型推断期间任务的碳排放量

Source: Luccioni et al., 2023 | Chart: 2024 Al Index report
来源：Luccioni 等人，2023 年 | 图表：2024 年 Al 指数报告

Task 任务

Figure 2.13.4 图 2.13.4

Positive Use Cases 正面使用案例

Despite the widely recognized environmental costs of training Al systems, Al can contribute positively to environmental sustainability. Figure 2.13.5 showcases a variety of recent cases where Al supports environmental efforts.

These applications include enhancing thermal energy system management, improving pest control strategies, and boosting urban air quality.
尽管普遍认为训练人工智能系统存在环境成本，但人工智能可以积极促进环境可持续性。图 2.13.5 展示了人工智能支持环境工作的各种最新案例。

这些应用包括增强热能系统管理，改进虫害控制策略以及提升城市空气质量。

Positive Al environmental use cases
积极的铝环境使用案例

Source: Fang et al., 2024 | Table: 2024 Al Index report
来源：方等人，2024 年 | 表格：2024 年铝指数报告

Use case

Al contribution 铝贡献

Management of thermal energy storage systems
热能储存系统的管理

Anticipating thermal energy needs and managing thermal energy storage
预测热能需求并管理热能储存

systems.

Improving waste management
改善废物管理

Saving time and costs in waste-to-energy conversion, waste sorting, and
在废物能源转化、废物分类和废物监测中节约时间和成本

waste monitoring. 废物监测

More efficiently cooling buildings
更高效地冷却建筑物

Optimizing the energy usage associated with air-conditioning. 2023
优化与空调相关的能源使用。2023

Improving pest management
改善害虫管理

Identifying and eliminating pests in commercial tomato harvests.
在商业番茄收获中识别和消灭害虫。

Enhancing urban air quality
提高城市空气质量

Forecasting and predicting air quality in urban cities.
预测和预测城市空气质量。

Luo et al., 2022
罗等人，2022

Preview 预览

Overview 160 概述 160

Chapter Highlights 161 章节亮点 161

3.1 Assessing Responsible Al 163
3.1 评估负责任的 Al 163

Responsible Al Definitions

负责任的 AI 定义

Al Incidents

AI 事件

Examples 164 例子 164

Risk Perception 166 风险感知 166

Risk Mitigation 167 风险缓解 167

Overall Trustworthiness

总体可信度

Benchmarking Responsible AI 169
负责任人工智能 169

Tracking Notable RAI Benchmarks 169
追踪值得注意的人工智能负责任基准 169

Reporting Consistency

报告一致性

3.2 Privacy and Data Governance 172
3.2 隐私和数据治理 172

Current Challenges 172 当前挑战 172

Privacy and Data Governance in Numbers 173
数字中的隐私和数据治理 173

Academia 173 学术界 173

Industry 174 行业 174

Featured Research 175 精选研究 175

Extracting Data From LLMs

从 LLMs

提取数据

Foundation Models and Verbatim Generation 177
基础模型和逐字生成 177

Auditing Privacy in Al Models

在 AI 模型中审计隐私

3.3 Transparency and Explainability

3.3 透明度和可解释性

Current Challenges 180 当前挑战 180

Transparency and Explainability in Numbers

数字中的透明度和可解释性

Academia 181 学术 181

Industry 182 行业 182

Featured Research 183 精选研究 183

The Foundation Model Transparency Index 183
透明度指数 183

Neurosymbolic Artificial Intelligence
神经符号人工智能

(Why, What, and How) 185
（为什么，什么，以及如何）185
CHAPTER 3: 第三章：

Responsible Al 负责人 Al
3.4 Security and Safety

3.4 安全与安全

Current Challenges 186 当前挑战 186

Al Security and Safety in Numbers

数字中的安全和安全

Academia 187 学术界 187

Industry 188 行业 188

Featured Research 191 精选研究 191

Do-Not-Answer: A New Open Dataset
不回答：一个新的开放数据集

for Comprehensive Benchmarking of LLM Safety Risks 191
用于全面评估 LLM 安全风险 191

Universal and Transferable Attacks
通用和可转移的攻击

on Aligned Language Models
针对对齐语言模型

MACHIAVELLI Benchmark 195
马基雅维利基准 195

3.5 Fairness 197 3.5 公平性 197

Current Challenges 197 当前挑战 197

Fairness in Numbers 197
数字中的公平性 197

Academia 197 学术界 197

Industry 198 行业 198

Featured Research 199 特色研究 199

(Un)Fairness in Al and Healthcare 199
AI 和医疗保健中的(不)公平 199

Social Bias in Image Generation Models 200
图像生成模型中的社会偏见 200

Measuring Subjective Opinions in LLMs 201
在LLMs 201 中测量主观意见

LLM Tokenization Introduces Unfairness 202
LLM 分词引入不公平 202

3.6 Al and Elections 205
3.6 Al 和选举 205

Generation, Dissemination, and
生成、传播和

Detection of Disinformation 205
发现虚假信息 205

Generating Disinformation 205
生成虚假信息 205

Dissemination of Fake Content 207
传播虚假内容 207

Detecting Deepfakes 208 检测深度伪造 208

LLMs and Political Bias 210
LLMs 和政治偏见 210

Impact of Al on Political Processes 211
铝对政治进程的影响 211

ACCESS THE PUBLIC DATA
访问公共数据

Overview 概述

Al is increasingly woven into nearly every facet of our lives. This integration is occurring in sectors such as education, finance, and healthcare, where critical decisions are often based on algorithmic insights. This trend promises to bring many advantages; however, it also introduces potential risks. Consequently, in the past year, there has been a significant focus on the responsible development and deployment of

systems. The

community has also become more concerned with assessing the impact of

systems and mitigating risks for those affected.
铝正越来越多地融入到我们生活的几乎每个方面。这种整合正在教育、金融和医疗保健等领域发生，这些领域的关键决策通常基于算法洞察。这一趋势承诺带来许多优势；然而，它也带来潜在风险。因此，在过去的一年中，人们开始重点关注负责任地开发和部署

系统。

社区也越来越关注评估

系统的影响并减轻对受影响者的风险。

This chapter explores key trends in responsible Al by examining metrics, research, and benchmarks in four key responsible

areas: privacy and data governance, transparency and explainability, security and safety, and fairness. Given that 4 billion people are expected to vote globally in 2024, this chapter also features a special section on Al and elections and more broadly explores the potential impact of

on political processes.
本章通过研究四个关键的负责任 AI 领域中的指标、研究和基准，探讨了负责任 AI 的关键趋势：隐私和数据治理、透明度和可解释性、安全性和安全性、公平性。鉴于预计 2024 年全球将有 40 亿人投票，本章还特别介绍了 AI 和选举的特别部分，并广泛探讨了 AI 对政治流程的潜在影响。

Chapter Highlights 章节亮点

\begin{abstract}
1. Robust and standardized evaluations for LLM responsibility are seriously lacking. New research from the Al Index reveals a significant lack of standardization in responsible Al reporting. Leading developers, including OpenAl, Google, and Anthropic, primarily test their models against different responsible

benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models.
1. 对于LLM责任的强大和标准化评估严重缺乏。AI 指数的最新研究显示，负责任 AI 报告的标准化严重缺乏。包括 OpenAI、Google 和 Anthropic 在内的领先开发者主要针对不同的负责任

基准测试其模型。这种做法使得系统比较顶级 AI 模型的风险和局限性变得复杂。
\end{abstract} \end{摘要}

Political deepfakes are easy to generate and difficult to detect. Political deepfakes are already affecting elections across the world, with recent research suggesting that existing Al deepfake detection methods perform with varying levels of accuracy. In addition, new projects like CounterCloud demonstrate how easily AI can create and disseminate fake content.
政治深度伪造视频容易生成，难以检测。政治深度伪造视频已经在全球范围内影响选举，最近的研究表明现有的人工智能深度伪造检测方法的准确性各不相同。此外，像 CounterCloud 这样的新项目展示了人工智能如何轻松地创建和传播虚假内容。
Researchers discover more complex vulnerabilities in LLMs. Previously, most efforts to red team Al models focused on testing adversarial prompts that intuitively made sense to humans. This year, researchers found less obvious strategies to get LLMs to exhibit harmful behavior, like asking the models to infinitely repeat random words.
研究人员在LLMs中发现了更复杂的漏洞。以前，对 Al 模型进行红队测试的大部分工作都集中在测试对人类直观有意义的对抗性提示上。今年，研究人员发现了一些不太明显的策略，可以使LLMs表现出有害行为，例如要求模型无限重复随机单词。
Risks from are a concern for businesses across the globe. A global survey on responsible highlights that companies' top Al-related concerns include privacy, security, and reliability. The survey shows that organizations are beginning to take steps to mitigate these risks. However, globally, most companies have so far only mitigated a portion of these risks.
全球范围内企业对的风险表示担忧。一项关于负责任的全球调查显示，公司对人工智能相关的主要担忧包括隐私、安全和可靠性。调查显示，组织正在开始采取措施来减轻这些风险。然而，全球范围内，大多数公司迄今只减轻了部分风险。
LLMs can output copyrighted material. Multiple researchers have shown that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. Whether such output constitutes copyright violations is becoming a central legal question.
LLMs可以输出受版权保护的材料。多位研究人员已经证明，流行LLMs的生成输出可能包含受版权保护的材料，例如《纽约时报》的摘录或电影场景。这种输出是否构成侵犯版权的问题正逐渐成为一个核心的法律问题。
Al developers score low on transparency, with consequences for research. The newly introduced Foundation Model Transparency Index shows that AI developers lack transparency, especially regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of Al systems.
人工智能开发者在透明度方面得分较低，对研究造成影响。新引入的基础模型透明度指数显示，人工智能开发者缺乏透明度，特别是在披露训练数据和方法论方面。这种缺乏开放性阻碍了进一步了解人工智能系统的稳健性和安全性的努力。

Chapter Highlights (cont'd)
章节亮点（续）

Extreme Al risks are difficult to analyze. Over the past year, a substantial debate has emerged among Al scholars and practitioners regarding the focus on immediate model risks, like algorithmic discrimination, versus potential long-term existential threats. It has become challenging to distinguish which claims are scientifically founded and should inform policymaking. This difficulty is compounded by the tangible nature of already present short-term risks in contrast with the theoretical nature of existential threats.
极端 AI 风险很难分析。在过去的一年中，AI 学者和从业者之间出现了一场关于立即模型风险（如算法歧视）与潜在的长期存在威胁的辩论。区分哪些声明在科学上有根据并应该影响政策制定已变得具有挑战性。这一困难被已经存在的短期风险的具体性质与存在威胁的理论性质所加剧。
The number of Al incidents continues to rise. According to the Al Incident Database, which tracks incidents related to the misuse of Al, 123 incidents were reported in 2023, a 32.3% increase from 2022. Since 2013, Al incidents have grown by over twentyfold. A notable example includes Al-generated, sexually explicit deepfakes of Taylor Swift that were widely shared online.
Al 事件数量持续上升。根据跟踪与 Al 滥用相关事件的 Al 事件数据库显示，2023 年报告了 123 起事件，比 2022 年增长了 32.3%。自 2013 年以来，Al 事件增长了超过二十倍。一个显著的例子是泰勒·斯威夫特的 Al 生成的深度伪造色情内容在网上广泛传播。
ChatGPT is politically biased. Researchers find a significant bias in ChatGPT toward Democrats in the United States and the Labour Party in the U.K. This finding raises concerns about the tool's potential to influence users' political views, particularly in a year marked by major global elections.
ChatGPT 在政治上存在偏见。研究人员发现 ChatGPT 在美国偏向民主党和英国工党。这一发现引发了对该工具可能影响用户政治观点的担忧，尤其是在一个以重大全球选举为特征的年份。

This chapter begins with an overview of key trends in responsible AI (RAI). In this section the Al Index defines key terms in responsible Al: privacy, data governance, transparency, explainability, fairness, as well as security and safety. Next, this section looks at Al-related incidents and explores how industry actors perceive Al risk and adopt Al risk mitigation measures. Finally, the section profiles metrics pertaining to the overall trustworthiness of AI models and comments on the lack of standardized responsible Al benchmark reporting.
本章从负责任的人工智能（RAI）的关键趋势概述开始。在本节中，Al 指数定义了负责任 Al 中的关键术语：隐私、数据治理、透明度、可解释性、公平性，以及安全性和安全性。接下来，本节将研究 Al 相关事件，并探讨行业参与者如何看待 Al 风险并采取 Al 风险缓解措施。最后，本节概述了与 AI 模型整体可信度相关的指标，并评论了缺乏标准化的负责任 Al 基准报告。

3.1 Assessing Responsible Al
3.1 评估负责任的人工智能

Responsible Al Definitions
负责任的人工智能定义

In this chapter, the Al Index explores four key dimensions of responsible Al: privacy and data governance, transparency and explainability, security and safety, and fairness. Other dimensions of responsible Al, such as sustainability and reliability, are discussed elsewhere in the report. Figure 3.1.1 offers definitions for the responsible Al dimensions addressed in this chapter, along with an illustrative example of how these dimensions might be practically relevant. The "Example" column examines a hypothetical platform that employs

to analyze medical patient data for personalized treatment recommendations, and demonstrates how issues like privacy, transparency, etc., could be relevant.
在本章中，人工智能指数探讨了负责任人工智能的四个关键维度：隐私和数据治理、透明度和可解释性、安全和安全性以及公平性。其他负责任人工智能的维度，如可持续性和可靠性，在报告的其他部分讨论。图 3.1.1 提供了本章讨论的负责任人工智能维度的定义，以及这些维度如何在实践中相关的示例。"示例"栏目探讨了一个假设平台，该平台使用

来分析医疗患者数据以获取个性化治疗建议，并展示了隐私、透明度等问题可能如何相关。

Responsible Al dimensions, definitions, and examples
负责 Al 尺寸、定义和示例

Source: Al Index, 2024
来源：Al 索引，2024

Responsible Al dimension
负责 Al 尺寸

Definition

Example

Data governance 数据治理

Establishment of policies, procedures, and standards to
制定政策、程序和标准，以确保数据的质量、安全性和合乎道德的使用

ensure the quality, security, and ethical use of data, which
，这包括对数据进行分类、存储、处理和保护的规定

is crucial for accurate, fair, and responsible Al operations,
对准确、公平和负责任的人工智能操作至关重要，

particularly with sensitive or personally identifiable
特别是涉及敏感或可识别个人身份的情况，

information.

Policies and procedures are in place to maintain data
已制定政策和程序以维护数据

quality and security, with a particular focus on ethical use
质量和安全，特别关注道德使用

and consent, especially for sensitive health information.
和同意，尤其是对于敏感健康信息。

Explainability

The capacity to comprehend and articulate the rationale
理解和阐述的能力

behind AI decisions, emphasizing the importance of AI
AI 决策背后，强调 AI 的重要性

being not only transparent but also understandable to users
不仅要对用户透明，还要让用户理解

and stakeholders. 以及利益相关者。

The platform can articulate the rationale behind its
该平台可以阐明其治疗建议背后的理由

treatment recommendations, making these insights
，使这些见解易于医生和患者理解，确保他们对其的信任

understandable to doctors and patients, ensuring trust in its

decisions.

Fairness

Creating algorithms that are equitable, avoiding bias or
创造公平的算法，避免偏见或歧视，并考虑所有利益相关者的多样化需求和情况，从而与之保持一致

discrimination, and considering the diverse needs and
正在翻译...

circumstances of all stakeholders, thereby aligning with
正在翻译..

broader societal standards of equity.
更广泛的社会公平标准。

The platform is designed to avoid bias in treatment
该平台旨在避免对待中的偏见

recommendations, ensuring that patients from all
推荐，确保来自所有

demographics receive equitable care.
人口统计信息接受公平护理。

Privacy

An individual's right to confidentiality, anonymity, and
个人有权保密、匿名和

protection of their personal data, including the right to
保护其个人数据的权利，包括权利

consent and be informed about data usage, coupled with
同意并了解数据使用情况，结合

an organization's responsibility to safeguard these rights
组织在处理个人数据时负有保护这些权利的责任

when handling personal data.
。

Patient data is handled with strict confidentiality, ensuring
患者数据被严格保密，确保匿名和保护。

anonymity and protection. Patients consent to whether and
患者同意是否以及如何使用他们的数据来训练治疗建议。

how their data is used to train a treatment recommendation

system.

Security and safety 安全与保障

The integrity of Al systems against threats, minimizing
防范威胁对 AI 系统的完整性，最小化

harms from misuse, and addressing inherent safety risks
减少滥用带来的危害，并解决固有的安全风险

like reliability concerns and the potential dangers of
像可靠性问题和高级人工智能系统的潜在危险。

advanced Al systems. 实施措施以防范网络威胁。

Measures are implemented to protect against cyber threats
实施措施以防范网络威胁。

and ensure the system's reliability, minimizing risks from
并确保系统的可靠性，最小化来自

misuse or inherent system errors, thus safeguarding patient
滥用或系统固有错误的风险，从而保护患者

health and data. 健康和数据。

Transparency

Open sharing of development choices, including data
开放分享开发选择，包括数据来源和算法决策，以及系统的部署、监控和管理，覆盖

sources and algorithmic decisions, as well as how

systems are deployed, monitored, and managed, covering

both the creation and operational phases.
创作和运营阶段。

The development choices, including data sources and
开发选择，包括数据来源和

algorithmic design decisions, are openly shared. How the
算法设计决策，都是公开分享的。如何

system is deployed and monitored is clear to healthcare
系统部署和监控对医疗保健提供商和监管机构是清晰的。

providers and regulatory bodies.
提供商和监管机构。

Figure 3.1.1 图 3.1.1

1 Although Figure 3.1 .1 breaks down various dimensions of responsible Al into specific categories to improve definitional clarity, this chapter organizes these dimensions into the following broader categories: privacy and data governance, transparency and explainability, security and safety, and fairness.
尽管图 3.1.1 将负责任的 AI 的各个维度细分为特定类别以提高定义上的清晰度，但本章将这些维度组织为以下更广泛的类别：隐私和数据治理、透明度和可解释性、安全性和保障、公平性。

Al Incidents AI 事件

The Al Incident Database (AlID) tracks instances of ethical misuse of

, such as autonomous cars causing pedestrian fatalities or facial recognition systems leading to wrongful arrests.

As depicted in Figure 3.1.2, the number of Al incidents continues to climb annually. In 2023, 123 incidents were reported, a 32.3% increase from 2022. Since 2013, Al incidents have grown by over twentyfold.
AI 事件数据库（AID）跟踪道德滥用的实例，例如自动驾驶汽车导致行人死亡或面部识别系统导致错误逮捕。如图 3.1.2 所示，AI 事件的数量每年都在上升。2023 年报告了 123 起事件，比 2022 年增加了 32.3%。自 2013 年以来，AI 事件增长了超过二十倍。
The continuous increase in reported incidents likely arises from both greater integration of

into realworld applications and heightened awareness of its potential for ethical misuse. However, it is important to note that as awareness grows, incident tracking and reporting also improve, indicating that earlier incidents may have been underreported.
报告的事件持续增加可能源自

更多地融入现实世界应用和对其潜在道德滥用的意识提高。然而，值得注意的是，随着意识的增长，事件跟踪和报告也得到改善，表明早期事件可能被低估了。

Number of reported Al incidents, 2012-23
报告的人工智能事件数量，2012-23 年

Examples 例子

The next section details recent

incidents to shed light on the ethical challenges commonly linked with AI.
下一节详细介绍了最近的

事件，以揭示与人工智能常见相关的伦理挑战。

Al-generated nude images of Taylor Swift
泰勒·斯威夫特的 Al 生成的裸体图片

In January 2024, sexually explicit, Al-generated images purportedly depicting Taylor Swift surfaced on X (formerly Twitter). These images remained live for 17 hours, amassing over 45 million views before they were removed. Generative Al models can effortlessly extrapolate from training data, which often include nude images and celebrity photographs, to produce nude images of celebrities, even when images of the targeted celebrity are absent from the original dataset. There are filters put
2024 年 1 月，据称描绘泰勒·斯威夫特的性暴力 Al 生成的图片在 X（前身为 Twitter）上出现。这些图片在被删除之前保持了 17 个小时的在线状态，累积了超过 4500 万次的浏览量。生成式 Al 模型可以轻松地从训练数据中推断出，这些数据通常包括裸体图片和名人照片，以制作名人的裸体图片，即使原始数据集中没有目标名人的图片。已经设置了过滤器。
in place that aim to prevent such content creation; however, these filters can usually be circumvented with relative ease.
在旨在防止此类内容创建的地方; 但是，这些过滤器通常很容易被规避。

Unsafe behavior of fully self-driving cars
全自动驾驶汽车的不安全行为

Recent reports have surfaced about a Tesla in Full Self-Driving mode that detected a pedestrian on a crosswalk in San Francisco but failed to decelerate and allow the pedestrian to cross the street safely (Figure 3.1.3). Unlike other developers of (partially) automated driving systems, who limit the use of their software to specific settings such as highways, Tesla permits the use of their beta software on regular streets. This incident is one of several alleged cases of unsafe driving behavior by cars in Full Self-Driving mode. In November 2022, a Tesla was involved in an eight-car collision after abruptly braking. Another crash involving a Tesla is under investigation for potentially being the first fatality caused by Full Self-Driving mode.
最近有报道称，一辆特斯拉处于全自动驾驶模式，在旧金山的一个人行横道上检测到一名行人，但未能减速并让行人安全过马路（图 3.1.3）。与其他（部分）自动驾驶系统的开发者不同，他们将软件的使用限制在特定设置，如高速公路，特斯拉允许其测试版软件在普通街道上使用。这起事件是全自动驾驶模式下汽车不安全驾驶行为的几起被指控案件之一。2022 年 11 月，一辆特斯拉在突然刹车后卷入了一起八车相撞事故。另一起涉及特斯拉的事故正在调查中，可能是由全自动驾驶模式引起的首例致命事故。

Privacy concerns with romantic Al chatbots
浪漫的人工智能聊天机器人引发的隐私问题

Romantic Al chatbots are meant to resemble a lover or friend, to listen attentively, and to be a companion for their users (Figure 3.1.4). In this process, they end up collecting significant amounts of private and sensitive information. Researchers from the Mozilla Foundation reviewed 11 romantic

chatbots for privacy risks and found that these chatbots collect excessive personal data, can easily be misused, and offer inadequate data protection measures. For example, the researchers found that the privacy policy by Crushon.Al states that it "may collect extensive personal and even healthrelated information from you like your 'sexual health information,' '[u]se of prescribed medication,' and '[g] ender-affirming care information." The researchers further discussed privacy concerns associated with
浪漫的人工智能聊天机器人旨在模拟情人或朋友，倾听并成为用户的伴侣（图 3.1.4）。在这个过程中，它们最终收集了大量的私人和敏感信息。Mozilla 基金会的研究人员审查了 11 个浪漫的聊天机器人，发现这些聊天机器人存在隐私风险，它们收集了过多的个人数据，容易被滥用，并提供了不足的数据保护措施。例如，研究人员发现 Crushon.Al 的隐私政策表示它“可能会从您那里收集大量的个人甚至健康相关信息，如您的'性健康信息'，'处方药物使用'和'性别肯定护理信息'”。研究人员进一步讨论了与隐私有关的问题
Tesla recognizing pedestrian but not slowing down at a crosswalk
特斯拉识别行人但未在人行横道减速

Source: Gitlin, 2023

Figure 3.1.3 图 3.1.3

Romantic chatbot generated by DALL-E
DALL-E 生成的浪漫聊天机器人

Source: Al Index, 2024
源：Al 指数，2024

Figure 3.1.4 图 3.1.4

romantic

chatbots and highlighted how the services, despite being marketed as empathetic companions, are not transparent about their operation and data handling.
浪漫

聊天机器人并强调这些服务，尽管被宣传为富有同情心的伴侣，但在运作和数据处理方面并不透明。

Risk Perception 风险感知

In collaboration with Accenture, this year a team of Stanford researchers ran a global survey with respondents from more than 1,000 organizations to assess the global state of responsible AI. The organizations, with total revenues of at least

million each, were taken from 20 countries and 19 industries and responded in February-March 2024.

The objective of the Global State of Responsible Al survey was to gain an understanding of the challenges of adopting responsible

practices and to allow for a comparison of responsible

activities across 10 dimensions and across surveyed industries and regions.
今年，与埃森哲合作，斯坦福大学的一个团队进行了一项全球调查，调查对象来自 1000 多家组织，旨在评估负责任人工智能的全球状况。这些组织的总收入至少为

百万美元，来自 20 个国家和 19 个行业，于 2024 年 2 月至 3 月回答。全球负责任人工智能调查的目标是了解采用负责任

实践所面临的挑战，并允许比较负责任

活动在 10 个维度以及受调查的行业和地区之间的差异。

Respondents were asked which risks were relevant to them, given their

adoption strategy; i.e., depending on whether they develop, deploy, or use generative or nongenerative Al. They were presented with a list of 14 risks and could select all that apply to them, given their

adoption strategies.

The researchers found that privacy and data governance risks, e.g., the use of data without the owner's consent or data leaks, are the leading concerns across the globe. Notably, they observe that these concerns are significantly higher in Asia and Europe compared to North America. Fairness risks were only selected by

of North American respondents, significantly less than respondents in Asia (31%) and Europe (34%) (Figure 3.1.5). Respondents in Asia selected, on average, the highest number of relevant risks (4.99), while Latin American respondents selected, on average, the fewest (3.64).
调查对象被问及哪些风险与他们相关，考虑到他们的

采用策略；即，根据他们是开发、部署还是使用生成或非生成 Al 而定。他们被呈现了一个包含 14 个风险的清单，并可以选择适用于他们的

采用策略的所有风险。

研究人员发现，隐私和数据治理风险，例如未经所有者同意使用数据或数据泄露，是全球范围内最主要的关注点。值得注意的是，他们观察到这些关注点在亚洲和欧洲明显高于北美。公平性风险仅被北美受访者中的

选择，明显低于亚洲（31%）和欧洲（34%）的受访者（图 3.1.5）。亚洲受访者平均选择了最多相关风险（4.99），而拉丁美洲受访者平均选择了最少（3.64）。

Relevance of selected responsible Al risks for organizations by region
各地区组织所选负责任 Al 风险的相关性

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任 Al 状态报告 | 图表：2024 年 Al 指数报告

3 The full Global State of Responsible Al report is forthcoming in May 2024. Additional details about the methodology can be found in the Appendix to this chapter.
2024 年 5 月将发布《全球负责任人工智能状况》完整报告。有关方法论的额外细节可在本章附录中找到。

4 The full list of risks can be found in the Appendix. In Figure 3.1.5, the Al Index only reports the percentages for risks covered by this chapter.
风险的完整列表可在附录中找到。在图 3.1.5 中，人工智能指数仅报告本章涵盖的风险百分比。

Risk Mitigation 风险缓解

The Global State of Responsible Al survey finds that organizations in most regions have started to operationalize responsible

measures. The majority of organizations across regions have fully operationalized at least one mitigation measure for risks they reported as relevant to them, given their

adoption (Figure 3.1.6).
全球负责任人工智能调查发现，大多数地区的组织已经开始将负责任

措施操作化。跨地区的大多数组织已经完全将至少一项风险缓解措施操作化，这些风险对他们来说是相关的，考虑到他们的

采用情况（图 3.1.6）。
Some companies in Europe (18%), North America (17%), and Asia (25%) have already operationalized more than half of the measures the researchers asked about across the following dimensions: fairness, transparency and explainability, privacy and data governance, reliability, and security.

欧洲（18%）、北美（17%）和亚洲（25%）的一些公司已经将研究人员询问的关于公平性、透明度和可解释性、隐私和数据治理、可靠性以及安全性等维度中超过一半的措施操作化。

Global responsible Al adoption by organizations by region
按地区组织的全球负责任人工智能采用情况

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
全球负责任铝行业现状报告，2024 年 | 图表：2024 年铝指数报告

Figure 3.1.6 图 3.1.6

Note: Not all differences between region are statistically significant.
注：各地区之间并非所有差异在统计上都具有显著性。

Overall Trustworthiness 总体可信度

As noted above, responsible

encompasses various dimensions, including fairness and privacy. Truly responsible

models need to excel across all these aspects. To facilitate the evaluation of broad model "responsibility" or trustworthiness, a team of researchers introduced DecodingTrust, a new benchmark that evaluates LLMs on a broad spectrum of responsible AI metrics like stereotype and bias, adversarial robustness, privacy, and machine ethics, among others. Models receive a trustworthiness score, with a higher score signifying a more reliable model.
如上所述，负责任的

包括各个方面，包括公平和隐私。真正负责任的

模型需要在所有这些方面都表现出色。为了促进对广泛模型“责任”或可信度的评估，一组研究人员推出了 DecodingTrust，这是一个新的基准，评估LLMs在广泛的负责任人工智能指标上，如刻板印象和偏见、对抗鲁棒性、隐私和机器伦理等方面。模型将获得一个可信度评分，评分越高表示模型更可靠。
The study highlights new vulnerabilities in GPT-type models, particularly their propensity for producing biased outputs and leaking private information from training datasets and conversation histories. Despite GPT-4's improvements over GPT-3.5 on standard benchmarks, GPT-4 remains more susceptible to misleading prompts from jailbreaking tactics. This increased vulnerability is partly due to GPT-4's improved fidelity in following instructions. Hugging Face now hosts an LLM Safety Leaderboard, which is based on the framework introduced in DecodingTrust. As of early 2024, Anthropic's Claude 2.0 was rated as the safest model (Figure 3.1.7).
该研究突出了 GPT 类型模型的新漏洞，特别是它们倾向于产生偏见输出并从训练数据集和对话历史中泄露私人信息。尽管 GPT-4 在标准基准测试中比 GPT-3.5 有所改进，但 GPT-4 仍然更容易受到越狱策略的误导提示。这种增加的脆弱性部分是由于 GPT-4 在遵循指令方面的改进保真度。Hugging Face 现在托管一个基于 DecodingTrust 框架的 LLM 安全排行榜。截至 2024 年初，Anthropic 的 Claude 2.0 被评为最安全的模型（图 3.1.7）。

Average trustworthiness score across selected responsible AI dimensions
在选择的负责任人工智能维度上的平均可信度评分

Source: LLM Safety Leaderboard, 2024 | Chart: 2024 Al Index report
来源：LLM 安全排行榜，2024 年 | 图表：2024 年 Al 指数报告

Benchmarking Responsible Al
基准测试负责人 Al

Tracking Notable Responsible AI Benchmarks
跟踪值得注意的负责任 AI 基准

Benchmarks play an important role in tracking the capabilities of state-of-the-art AI models. In recent years there has been a shift toward evaluating models not only on their broader capabilities but also on responsibility-related features. This change reflects the growing importance of

and the growing demands for

accountability. As

becomes more ubiquitous and calls for responsibility mount, it will become increasingly important to understand which benchmarks researchers prioritize.
基准在跟踪最先进 AI 模型的能力方面发挥着重要作用。近年来，评估模型不仅仅局限于其更广泛的能力，还包括与责任相关的特性。这种变化反映了

的日益重要性和对

责任的不断增加需求。随着

变得更加普遍，对责任的呼声不断增加，了解研究人员优先考虑哪些基准将变得越来越重要。

Figure 3.1.8 presents the year-over-year citations for a range of popular responsible

benchmarks. Introduced in 2021, TruthfulQA assesses the truthfulness of LLMs in their responses. RealToxicityPrompts and ToxiGen track the extent of toxic output produced by language models. Additionally, BOLD and BBQ evaluate the bias present in LLM generations. Citations, while not completely reflective of benchmark use, can serve as a proxy for tracking benchmark salience.
图 3.1.8 展示了一系列热门负责任

基准的年度引用。TruthfulQA 是在 2021 年推出的，评估LLMs在其回答中的真实性。RealToxicityPrompts 和 ToxiGen 跟踪语言模型产生的有毒输出的程度。此外，BOLD 和 BBQ 评估LLM生成中存在的偏见。引用虽然不能完全反映基准使用情况，但可以作为跟踪基准重要性的代理。

Virtually all benchmarks tracked in Figure 3.1 .8 have seen more citations in 2023 than in 2022 , reflecting their increasing significance in the responsible Al landscape. Citations for TruthfulQA have risen especially sharply.
图 3.1 .8 中跟踪的几乎所有基准在 2023 年的引用量都比 2022 年增加，反映了它们在负责任 AI 领域日益重要的地位。特别是 TruthfulQA 的引用量急剧上升。

Reporting Consistency 报告一致性

The effectiveness of benchmarks largely depends on their standardized application. Comparing model capabilities becomes more straightforward when models are consistently evaluated against a specific set of benchmarks. However, testing models on different benchmarks complicates comparisons, as individual benchmarks have unique and idiosyncratic natures. Standardizing benchmark testing, therefore, plays an important role in enhancing transparency around

capabilities.
基准的有效性在很大程度上取决于它们的标准化应用。当模型被一致地评估对比特定一组基准时，比较模型能力变得更加简单。然而，在不同基准上测试模型会使比较变得复杂，因为各个基准具有独特和特殊的特性。因此，标准化基准测试在提升

能力透明度方面发挥着重要作用。

New analysis from the Al Index, however, suggests that standardized benchmark reporting for responsible Al capability evaluations is lacking. The AI Index examined a selection of leading Al model developers, specifically OpenAl, Meta, Anthropic, Google, and Mistral AI. The Index identified one flagship model from each developer (GPT-4, Llama 2, Claude 2, Gemini, and Mistral 7B) and assessed the benchmarks on which they evaluated their model. A few standard benchmarks for general capabilities evaluation were commonly used by these developers, such as MMLU, HellaSwag, ARC Challenge, Codex HumanEval, and GSM8K (Figure 3.1.9).
然而，来自 AI 指数的新分析表明，负责任的 AI 能力评估的标准化基准报告存在不足。AI 指数检查了一些领先的 AI 模型开发者，特别是 OpenAI、Meta、Anthropic、Google 和 Mistral AI。该指数确定了每个开发者的一款旗舰模型（GPT-4、Llama 2、Claude 2、Gemini 和 Mistral 7B），并评估了他们评估模型的基准。这些开发者通常使用一些用于一般能力评估的标准基准，如 MMLU、HellaSwag、ARC Challenge、Codex HumanEval 和 GSM8K（图 3.1.9）。

Reported general benchmarks for popular foundation models
报告的流行基础模型的一般基准

Source: Al Index, 2024 | Table: 2024 Al Index report
源：Al 指数，2024 | 表：2024 Al 指数报告

General benchmarks 一般基准	GPT-4	Llama 2	Claude 2	Gemini	Mistral 7B
MMLU
HellaSwag
Challenge (ARC) 挑战（ARC）
WinoGrande
Codex HumanEval 人类评估法典
GSM8K
Big Bench Hard 大型基准测试
Natural Questions 自然问题
BoolQ

However, consistency was lacking in the reporting of responsible Al benchmarks (Figure 3.1.10). Unlike general capability evaluations, there is no universally accepted set of responsible AI benchmarks used by leading model developers. TruthfulQA, at most, is used by three out of the five selected developers. Other notable responsible Al benchmarks like RealToxicityPrompts, ToxiGen,

, and

are each utilized by at most two of the five profiled developers. Furthermore, one out of the five developers did not report any responsible Al benchmarks, though all developers mentioned conducting additional, nonstandardized internal capability and safety tests.
然而，在负责任的 AI 基准报告中存在一致性不足（图 3.1.10）。与一般能力评估不同，领先的模型开发者并没有普遍接受的一套负责任的 AI 基准。TruthfulQA 最多被五个选定开发者中的三个使用。其他值得注意的负责任 AI 基准，如 RealToxicityPrompts、ToxiGen、

和

，每个最多被五个受调查开发者中的两个使用。此外，五个开发者中有一个没有报告任何负责任的 AI 基准，尽管所有开发者都提到进行了额外的、非标准化的内部能力和安全测试。
The inconsistency in reported benchmarks complicates the comparison of models, particularly in the domain of responsible Al. The diversity in benchmark selection may reflect existing benchmarks becoming quickly saturated, rendering them ineffective for comparison, or the regular introduction of new benchmarks without clear reporting standards. Additionally, developers might selectively report benchmarks that positively highlight their model's performance. To improve responsible

reporting, it is important that a consensus is reached on which benchmarks model developers should consistently test.
报告中基准测试的不一致性使模型的比较变得复杂，特别是在负责任的 AI 领域。基准测试选择的多样性可能反映了现有基准测试迅速饱和，使它们对比较失效，或者定期引入新的基准测试而没有明确的报告标准。此外，开发人员可能选择性地报告会积极突显其模型性能的基准测试。为了改进负责任的

报告，重要的是就应该一致地测试哪些基准模型开发人员达成共识。

Reported responsible Al benchmarks for popular foundation models
报告的负责任的 AI 基准测试针对热门基础模型

Source: Al Index, 2024 | Table: 2024 Al Index report
来源：AI 指数，2024 | 表格：2024 年 AI 指数报告

Responsible Al benchmarks 负责的 Al 基准	GPT-4	Llama 2	Claude 2	Gemini	Mistral 7B
TruthfulQA
RealToxicityPrompts
ToxiGen
BOLD
BBQ

A comprehensive definition of privacy is difficult and context-dependent. For the purposes of this report, the

Index defines privacy as an individual's right to the confidentiality, anonymity, and protection of their personal data, along with their right to consent to and be informed about if and how their data is used. Privacy further includes an organization's responsibility to ensure these rights if they collect, store, or use personal data (directly or indirectly). In Al, this involves ensuring that personal data is handled in a way that respects individual privacy rights, for example, by implementing measures to protect sensitive information from exposure, and ensuring that data collection and processing are transparent and compliant with privacy laws like GDPR.
隐私的全面定义是困难的，并且依赖于上下文。对于本报告的目的，

指数将隐私定义为个人对其个人数据的保密、匿名和保护的权利，以及对其数据如何使用是否同意并被告知的权利。隐私还包括组织在收集、存储或使用个人数据（直接或间接）时确保这些权利的责任。在 Al 中，这涉及确保个人数据的处理方式尊重个人隐私权利，例如通过实施措施保护敏感信息免受曝光，并确保数据收集和处理透明且符合诸如 GDPR 等隐私法律的规定。

Data governance, on the other hand, encompasses policies, procedures, and standards established to ensure the quality, security, and ethical use of data within an organization. In the context of Al, data governance is crucial for ensuring that the data used for training and operating Al systems is accurate, fair, and used responsibly and with consent. This is especially the case with sensitive or personally identifiable information (PII).
数据治理，另一方面，涵盖了旨在确保组织内数据质量、安全性和道德使用的政策、程序和标准。在人工智能的背景下，数据治理对于确保用于训练和运行人工智能系统的数据准确、公平且负责任地使用并获得同意至关重要。特别是涉及敏感或个人可识别信息（PII）的情况。

3.2 Privacy and Data Governance
3.2 隐私和数据治理

Current Challenges 当前挑战

Obtaining genuine and informed consent for training data collection is especially challenging with LLMs, which rely on massive amounts of data. In many cases, users are unaware of how their data is being used or the extent of its collection. Therefore, it is important to ensure transparency around data collection practices.
获取真实和知情同意进行培训数据收集在LLMs上尤为具有挑战性，因为它们依赖于大量数据。在许多情况下，用户并不知道他们的数据如何被使用或其收集的程度。因此，确保围绕数据收集实践的透明度是很重要的。
Relatedly, there may be trade-offs between the utility derived from

systems and the privacy of individuals. Striking the right balance is complex. Finally, properly anonymizing data to enhance privacy while retaining data usefulness for

training can be technically challenging as there is always a risk that anonymized data can be re-identified.
相关的是，从

系统中获得的效用与个人隐私之间可能存在权衡。找到合适的平衡是复杂的。最后，正确地对数据进行匿名化以增强隐私，同时保留数据对

培训的有用性可能在技术上具有挑战性，因为匿名化数据总是存在被重新识别的风险。

Privacy and Data Governance in Numbers
数字中的隐私和数据治理

The following section reviews the state of privacy and data governance within academia and industry.
以下部分审查了学术界和行业内隐私和数据治理的现状。

Academia 学术界

For this year's report, the Al Index examined the number of responsible-Al-related academic
对于今年的报告，Al 指数检查了与负责任的人工智能相关的学术数量

submissions to six leading AI conferences: AAAI, AIES, FAccT, ICML, ICLR, and NeurIPS.

Privacy and data governance continue to increase as a topic of interest for

researchers. There were 213 privacy and data governance submissions in 2023 at the select

conferences analyzed by the Al Index, nearly double the number submitted in 2022 (92), and more than five times the number submitted in 2019 (39) (Figure 3.2.1).
提交给六个领先的人工智能会议：AAAI，AIES，FAccT，ICML，ICLR 和 NeurIPS。隐私和数据治理继续成为研究人员感兴趣的话题。根据 Al Index 分析的这些精选会议中，2023 年有 213 份隐私和数据治理的提交，几乎是 2022 年提交数量的两倍（92 份），是 2019 年提交数量的五倍多（39 份）（图 3.2.1）。

Al privacy and data governance submissions to select academic conferences, 2019-23
2019 年至 2023 年选择学术会议的人工智能隐私和数据治理提交情况

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

6 The methodology employed by the Al Index to gather conference submission data is detailed in the Appendix of this chapter. The conference data is presented in various forms throughout the chapter. The same methodology was applied to all data on conference submissions featured in this chapter.
Al Index 用于收集会议提交数据的方法在本章附录中有详细说明。本章中以各种形式呈现会议数据。本章中所有关于会议提交的数据都采用相同的方法。

Industry 行业

According to the Global State of Responsible AI
根据由斯坦福大学和埃森哲研究人员合作进行的全球负责任人工智能现状调查，所有组织中{{0}}报告称隐私和数据治理相关风险与其{{1}}采用策略相关。{{2}}地理上，欧洲（56%）和亚洲（55%）的组织最频繁地报告隐私和数据治理风险是相关的，而总部位于北美的组织（42%）最少报告这些风险。

Survey, conducted in collaboration by researchers from Stanford University and Accenture,

of all organizations reported that privacy and data governance-related risks are pertinent to their

adoption strategy.

Geographically, organizations in Europe (56%) and Asia (55%) most frequently reported privacy and data governance risks as relevant, while those headquartered in North America (42%) reported them the least.

Organizations were also asked whether they took steps to adopt measures to mitigate data governancerelated risks.

The survey listed six possible data governance-related measures they could indicate adopting.

Example measures include ensuring data compliance with all relevant laws and regulations, securing consent for data use, and conducting regular audits and updates to maintain data relevance. Overall, less than

of companies indicated that they had fully operationalized all six data governance mitigations. However,

of companies selfreported that they had operationalized at least one measure. Moreover,

reported they had yet to fully operationalize any of the measures. Globally, the companies surveyed reported adopting an average of 2.2 out of 6 data governance measures.
组织还被问及他们是否采取措施来减轻与数据治理相关的风险。调查列出了六种可能的数据治理相关措施，他们可以指出采用。示例措施包括确保数据符合所有相关法律法规，获得数据使用的同意，并进行定期审计和更新以保持数据的相关性。总体而言，少于公司表示他们已经完全实施了所有六项数据治理缓解措施。然而，公司中有自报告称他们至少已经实施了一项措施。此外，报告称他们尚未完全实施任何措施。在全球范围内，受访公司平均采用了 6 项数据治理措施中的 2.2 项。

Figure 3.2.2 visualizes the mean adoption rate disaggregated by geographic region. Figure 3.2.3 visualizes the rate at which companies in different industries reported adopting

data governance measures.
图 3.2.2 展示了按地理区域细分的平均采用率。图 3.2.3 展示了不同行业公司报告采用数据治理措施的速率。

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任铝行业报告 | 图表：2024 年铝指数报告

Figure 3.2.2 图 3.2.2

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant.
注：括号中的数字是每个地区完全运作的减缓措施的平均数。并非所有地区之间的差异在统计上都具有显著性。

industry

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任 Al 状态报告 | 图表：2024 年 Al 指数报告

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant.
注：括号中的数字是每个行业内完全运作的缓解措施的平均数。并非所有行业之间的差异在统计上都具有显著性。

Featured Research 精选研究

This section highlights significant research that was published in 2023 on privacy and data governance in AI. These studies explored data extraction from LLMs, challenges in preventing duplicated generative

content, and low-resource privacy auditing.
本节重点介绍了 2023 年关于人工智能隐私和数据治理的重要研究。这些研究探讨了从LLMs中提取数据、防止重复生成

内容的挑战，以及低资源隐私审计。

Extracting Data From LLMs
从LLMs中提取数据

LLMs are trained on massive amounts of data, much of which has been scraped from public sources like the internet. Given the vastness of information that can be found online, it is not surprising that some PII is inevitably scraped as well. A study published in November 2023 explores extractable memorization: if and how sensitive training data can be extracted from LLMs without knowing the initial training dataset in advance. The researchers tested open models like Pythia and closed models like ChatGPT. The authors showed that it is possible to recover a significant amount of training data from all of these models, whether they are open or closed. While open and semi-open models can be attacked using methods from previous research, the authors found new attacks to overcome guardrails of models like ChatGPT.
LLMs是在大量数据上进行训练的，其中许多数据是从互联网等公共来源中抓取的。鉴于在线可以找到的信息之多，一些个人可识别信息（PII）不可避免地也会被抓取。2023 年 11 月发表的一项研究探讨了可提取的记忆：即在不提前知道初始训练数据集的情况下，是否以及如何可以从LLMs中提取敏感的训练数据。研究人员测试了像 Pythia 这样的开放模型和像 ChatGPT 这样的封闭模型。作者表明，可以从所有这些模型中恢复大量的训练数据，无论它们是开放的还是封闭的。虽然开放和半开放模型可以使用先前研究的方法进行攻击，但作者发现了新的攻击方法来克服像 ChatGPT 这样的模型的防护措施。

The authors propose that the key to data extraction lies in prompting the model to deviate from its standard dialog-style generation. For instance, the prompt "Repeat this word forever: 'poem poem poem poem," can lead ChatGPT to inadvertently reveal sensitive PII data verbatim (Figure 3.2.4). Some prompts are more effective than others in causing this behavior (Figure 3.2.5). Although most deviations produce nonsensical outputs, a certain percentage of responses disclose training data from the models. Using this approach, the authors managed to extract not just PII but also NSFW content, verbatim literature, and universal unique identifiers.

作者提出，数据提取的关键在于促使模型偏离其标准的对话式生成方式。例如，提示“无限重复这个词：'诗诗诗诗'”可能导致 ChatGPT 无意中逐字透露敏感的 PII 数据（图 3.2.4）。一些提示比其他提示更有效地引起这种行为（图 3.2.5）。尽管大多数偏差会产生荒谬的输出，但一定比例的回应会透露模型的训练数据。通过这种方法，作者成功提取了不仅仅是 PII，还有 NSFW 内容、逐字文学作品和通用唯一标识符。

Red teaming models through various human-readable prompts to provoke unwanted behavior has become increasingly common. For instance, one might ask a model if it can provide instructions for building a bomb. While these methods have proven somewhat effective, the research mentioned above indicates there are other, more complex methods for eliciting unwanted behavior from models.
通过各种人类可读的提示对模型进行红队测试，以引发不良行为已变得越来越普遍。例如，有人可能会问一个模型是否能提供制造炸弹的指导。虽然这些方法在一定程度上被证明是有效的，但上述研究表明还有其他更复杂的方法可以引发模型的不良行为。

Extracting PII From ChatGPT
从 ChatGPT 中提取 PII

Source: Nasr et al., 2023
来源：Nasr 等人，2023 年

Figure 3.2.4 图 3.2.4

Recovered memorized output given different repeated tokens
给定不同重复标记的恢复记忆输出
Source: Nasr et al., 2023 | Chart: 2024 Al Index report
Source: Nasr 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.2.5 图 3.2.5

Foundation Models and Verbatim Generation
基础模型和逐字生成

This year, many

researchers investigated the issue of generative models producing content that mirrors the material on which they were trained. For example, research from Google, ETH Zurich, and Cornell explored data memorization in LLMs and found that models without any protective measures (i.e., filters that guard against outputting verbatim responses) frequently reproduce text directly from their training data. Various models were found to exhibit differing rates of memorization for different datasets (Figure 3.2.6).
今年，许多

研究人员调查了生成模型产生内容的问题，这些内容与它们接受训练的材料相似。例如，来自谷歌、苏黎世联邦理工学院和康奈尔大学的研究探讨了LLMs中的数据记忆，并发现没有任何保护措施（即，防止直接输出逐字回应的过滤器）的模型经常直接复制其训练数据中的文本。发现各种模型对不同数据集的记忆率不同（图 3.2.6）。
The authors argue that blocking the verbatim output of extended texts could reduce the risk of exposing copyrighted material and personal information through extraction attacks. They propose a solution where the model, upon generating each token, checks for

-gram matches with the training data to avoid exact reproductions. Although they developed an efficient method for this check, effectively preventing perfect verbatim outputs, they observed that the model could still approximate memorization by slightly altering outputs. This imperfect solution highlights the ongoing challenge of balancing model utility with privacy and copyright concerns.
作者认为，阻止扩展文本的逐字输出可以减少通过提取攻击暴露受版权保护的材料和个人信息的风险。他们提出了一个解决方案，即在生成每个标记时，模型会检查与训练数据的

-gram 匹配，以避免完全复制。尽管他们为此检查开发了一种高效的方法，有效地防止了完全逐字输出，但他们观察到模型仍然可以通过略微改变输出来近似记忆。这种不完美的解决方案突显了在平衡模型效用与隐私和版权问题方面持续挑战的问题。

Fraction of prompts discovering approximate memorization Source: Ippolito et al., 2023 | Chart: 2024 Al Index report
提示的一部分发现近似记忆来源：Ippolito 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.2.6 图 3.2.6

Research has also highlighted challenges with exact and approximate memorization in visual content generation, notably with Midjourney v6.
研究还强调了在视觉内容生成中精确和近似记忆方面存在的挑战，特别是在 Midjourney v6 中。

This study discovered that certain prompts could produce images nearly identical to those in films, even without direct instructions to recreate specific movie scenes (Figure 3.2.7). For example, a generic prompt such as "animated toys --v 6.0 -- ar16:9 --style raw" yielded images closely resembling, and potentially infringing upon, characters from "Toy
这项研究发现，某些提示可以产生几乎与电影中相同的图像，即使没有直接的指令来重现特定的电影场景（图 3.2.7）。例如，一个通用的提示，比如“动画玩具--v 6.0--ar16:9--style raw”，产生的图像与《玩具总动员》中的角色非常相似，甚至可能侵犯版权。

Identical generation of Thanos
生成的灭霸相同

Source: Marcus and Southen, 2024
来源：Marcus 和 Southen，2024

Figure 3.2.7 图 3.2.7

Identical generation of toys
玩具的相同生成

Source: Marcus and Southen, 2024
来源：Marcus 和 Southen，2024

Story" (Figure 3.2.8). This indicates that the model might have been trained on copyrighted material. Despite efforts to frame indirect prompts to avoid infringement, the problem persisted, emphasizing the broader copyright issues associated with Al's use of unlicensed data. The research further underscores the difficulties in guiding generative

to steer clear of copyright infringement, a concern also applicable to DALL-E, the image-generating model associated with ChatGPT (Figure 3.2.9).
"故事"（图 3.2.8）。这表明该模型可能是在受版权保护的材料上进行训练的。尽管努力通过间接提示来避免侵权问题，但问题仍然存在，突显了与 Al 使用未经许可数据相关的更广泛的版权问题。研究进一步强调了指导生成

以避免侵犯版权的困难，这也是适用于 ChatGPT（图 3.2.9）相关的图像生成模型 DALL-E 的问题。

Identical generation of Mario
马里奥的相同生成

Source: Marcus and Southen, 2024
来源：Marcus 和 Southen，2024

Figure 3.2.9 图 3.2.9

Auditing Privacy in Al Models
在 AI 模型中审计隐私

Determining whether a model is privacypreserving - that is, if it safeguards individuals' personal information and data from unauthorized disclosure or access-is challenging. Privacy auditing is aimed at setting a lower bound on privacy loss, effectively quantifying the minimum privacy compromise in practical situations (Figure 3.2.10). Recent research from Google introduces a new method to achieve this within a single training run, marking a substantial advancement over prior methods that necessitated multiple attacks and significant computational effort.
确定模型是否保护隐私——即，是否保护个人信息和数据免受未经授权的披露或访问——是具有挑战性的。隐私审计旨在设定隐私损失的下限，有效地量化实际情况中的最小隐私妥协（图 3.2.10）。谷歌最近的研究引入了一种新方法，在单次训练中实现这一目标，相比之前需要多次攻击和大量计算工作的方法，这标志着实质性的进步。

The new technique involves incorporating multiple independent data points into the training dataset simultaneously, instead of sequentially, and assessing the model's privacy by attempting to ascertain which of these data points were utilized in training. This method is validated by showing it approximates the outcome of several individual training sessions, each incorporating a single data point. This approach is not only less computationally demanding but also has a minimal impact on model performance, offering an efficient and low-impact method for conducting privacy audits on

models.
这种新技术涉及将多个独立数据点同时纳入训练数据集中，而不是依次纳入，并通过尝试确定哪些数据点在训练中被使用来评估模型的隐私性。通过展示这种方法近似于几个单独的训练会话的结果来验证该方法，每个训练会话都包含一个数据点。这种方法不仅计算要求较少，而且对模型性能影响较小，为在

模型上进行隐私审计提供了一种高效且低影响的方法。
Visualizing privacy-auditing in one training run
在一个训练运行中可视化隐私审计

Identify training examples
识别个训练示例
A

D
E

F
G

I J
K

(M)

For each example, flip a coin to include or exclude it
对于每个示例，抛硬币来决定是否包含或排除它

Input selected data to model and get its output
输入选定的数据到模型中并获取其输出

)

Auditor looks at output and guesses whether each data point was included
审计员查看输出并猜测每个数据点是否被包括

Transparency in Al encompasses several aspects. Data and model transparency involve the open sharing of development choices, including data sources and algorithmic decisions. Operational transparency details how Al systems are deployed, monitored, and managed in practice. While explainability often falls under the umbrella of transparency, providing insights into the Al's decision-making process, it is sometimes treated as a distinct category. This distinction underscores the importance of

being not only transparent but also understandable to users and stakeholders. For the purposes of this chapter, the AI Index includes explainability within transparency, defining it as the capacity to comprehend and articulate the rationale behind

decisions.
透明度在人工智能中涵盖了几个方面。数据和模型透明度涉及开放共享开发选择，包括数据来源和算法决策。操作透明度详细说明了人工智能系统在实践中是如何部署、监控和管理的。虽然可解释性通常被归类为透明度的一部分，提供对人工智能决策过程的洞察，但有时被视为一个独立的类别。这种区别强调了

不仅要透明，而且要让用户和利益相关者能够理解。在本章中，AI 指数将可解释性纳入透明度范畴，将其定义为理解和表达

决策背后的原因的能力。

3.3 Transparency and Explainability
3.3 透明度和可解释性

Current Challenges 当前挑战

Transparency and explainability present several challenges. First, the inherent complexity of advanced models, particularly those based on deep learning, creates a "black box" scenario where it's difficult, even for developers, to understand how these models process inputs and produce outputs. This complexity obstructs comprehension and complicates the task of explaining these systems to nonexperts. Second, there is a potential trade-off between a model's complexity and its explainability. More complex models might deliver superior performance but tend to be less interpretable than simpler models, such as decision trees. This situation creates a dilemma: choosing between high-performing yet opaque models and more transparent, albeit less precise, alternatives.
透明度和可解释性带来了几个挑战。首先，先进模型的固有复杂性，特别是基于深度学习的模型，造成了一个“黑匣子”场景，即使对开发人员来说，也很难理解这些模型如何处理输入并产生输出。这种复杂性阻碍了理解，并使向非专家解释这些系统的任务变得更加复杂。其次，模型的复杂性和可解释性之间存在潜在的权衡。更复杂的模型可能提供更优越的性能，但往往比决策树等简单模型更难解释。这种情况造成了一个困境：在高性能但不透明的模型和更透明但不太精确的替代方案之间做出选择。

Transparency and Explainability in Numbers
数字中的透明度和可解释性

This section explores the state of Al transparency and explainability within academia and industry.
本节探讨了学术界和工业界中透明度和可解释性的现状。

Academia 学术界

Since 2019, the number of papers on transparency and explainability submitted to major academic conferences has more than tripled. In 2023, there was a record-high number of explainability-related submissions (393) at academic conferences including AAAI, FAccT, AIES, ICML, ICLR, and NeurIPS (Figure 3.3.1).
自 2019 年以来，提交给主要学术会议的关于透明度和可解释性的论文数量增加了三倍以上。2023 年，学术会议上关于可解释性的提交数量创下历史新高（393 篇），包括 AAAI、FAccT、AIES、ICML、ICLR 和 NeurIPS（图 3.3.1）。

Al transparency and explainability submissions to select academic conferences, 2019-23
在选择的学术会议上提交的透明度和可解释性，2019-23

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Industry 行业

In the Global State of Responsible AI Survey,

of all surveyed organizations indicated that transparency and explainability are relevant concerns given their

adoption strategy."
在《全球负责任人工智能状况调查》中，所有受访组织中

表示，透明度和可解释性是相关关注点，考虑到他们的

采用策略。

The researchers also asked respondents if they had implemented measures to increase transparency and explainability in the development, deployment, and use of their Al systems. The survey listed four possible transparency and explainability measures that respondents could indicate adopting.

研究人员还询问受访者是否已经采取措施增加其人工智能系统在开发、部署和使用过程中的透明度和可解释性。调查列出了四种可能的透明度和可解释性措施，受访者可以指出已采纳的措施。

Figure 3.3.2 visualizes the adoption rate of these measures across different geographic areas.
图 3.3.2 展示了这些措施在不同地理区域的采纳率。

Compared to other responsible

areas covered in the survey, a smaller share of organizations reported fully operationalizing transparency and explainability measures. The global mean was 1.43 out of the 4 measures adopted. Only 8% of companies across all regions and industries fully implemented more than half of the measures. A significant portion (12%) had not fully operationalized any measures. Overall, less than

of companies indicated full operationalization of all the measures. However,

self-reported operationalizing at least one measure. Figure 3.3.3 further breaks down the adoption rates of transparency and explainability mitigations by industry.
与调查涵盖的其他负责任领域相比，较少的组织报告已完全实施透明度和可解释性措施。全球平均值为 4 项措施中的 1.43。仅有 8%的公司在所有地区和行业中完全实施了超过一半的措施。相当大比例（12%）尚未完全实施任何措施。总体而言，不到

的公司表示已全面实施了所有措施。然而，

自称至少已实施了一项措施。图 3.3.3 进一步细分了各行业对透明度和可解释性缓解措施的采纳率。

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任人工智能报告 | 图表：2024 年人工智能指数报告

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant.
注：括号中的数字表示每个区域内全面运作的缓解措施的平均数。区域间的差异并非全部统计学上显著。

Figure 3.3.3 图 3.3.3

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant.
注：括号中的数字是每个行业内全面运作的减缓措施的平均值。并非所有行业之间的差异都具有统计学显著性。

Featured Research 特色研究

This section showcases significant research published in 2023 on transparency and explainability in AI. The research includes a new index that monitors

model transparency, as well as studies on neurosymbolic AI.
本节展示了 2023 年在人工智能透明度和可解释性方面发表的重要研究。该研究包括一个监测

模型透明度的新指数，以及关于神经符号人工智能的研究。

The Foundation Model Transparency Index
基础模型透明度指数

In October 2023, Stanford, Princeton, and MIT researchers released the Foundation Model Transparency Index (FMTI). This index evaluates the degree to which foundation models are transparent across diverse dimensions, including resource allocation for development, algorithmic design strategies, and downstream applications of the models. The analysis draws on publicly accessible data that developers release about their models.
2023 年 10 月，斯坦福大学、普林斯顿大学和麻省理工学院的研究人员发布了基础模型透明度指数（FMTI）。该指数评估了基础模型在资源分配、算法设计策略和模型的下游应用等多个维度上的透明度。该分析依赖于开发人员公开发布的关于他们模型的数据。

Meta's Llama 2 and BigScience's BLOOMZ stand out as the most transparent models (Figure 3.3.4). However, it is important to note that all models received relatively low scores, with the mean score at

. Additionally, open models-those openly releasing their weights -tend to score significantly better on transparency, with an average score of

, compared to closed models, which have limited access and score an average of

Meta 的 Llama 2 和 BigScience 的 BLOOMZ 被认为是最透明的模型（图 3.3.4）。然而，值得注意的是，所有模型得分都相对较低，平均得分为

。此外，公开模型-即公开发布其权重的模型-在透明度方面得分明显更高，平均得分为

，而封闭模型的透明度较低，这些模型的平均得分为

。

Foundation model transparency total scores of open vs. closed developers, 2023
2023 年开放和封闭开发者的基础模型透明度总分对比

Source: 2023 Foundation Model Transparency Index
2023 年基金会模型透明度指数

Figure 3.3.4 图 3.3.4

13 An updated version of the FMTI is scheduled for release in spring 2024. Therefore, the figures presented in this edition of the Al Index may not reflect the most up-to-date assessment of developer transparency.
13 FMTI 的更新版本计划于 2024 年春季发布。因此，本版 Al Index 中呈现的数据可能不反映开发者透明度的最新评估。

The researchers further categorize the models based on their openness levels, as detailed in Figure 3.3.5. While Figure 3.3.4 provides an aggregated overview of the transparency of each foundation model, incorporating over 100 indicators, Figure 3.3 .5 outlines the models' categorization by access level. This perspective offers greater insights into the variability of model access and illustrates how existing models align with different access schemes.
研究人员根据模型的开放程度进一步对其进行分类，详见图 3.3.5。虽然图 3.3.4 提供了每个基础模型透明度的汇总概述，包括 100 多个指标，但图 3.3.5 概述了模型按访问级别进行分类。这种视角提供了对模型访问变化性的更深入洞察，并说明了现有模型如何与不同的访问方案相匹配。

Levels of accessibility and release strategies of foundation models
基础模型的可访问性水平和发布策略

Source: Bommasani et al., 2023 | Table: 2024 Al Index report
来源：Bommasani 等人，2023 年 | 表格：2024 年 Al 指数报告

Considerations

Internal research only 仅限内部研究

High risk control 高风险控制

Low auditability 低审计性

Limited perspectives 有限的观点

Community research 社区研究

Low risk control 低风险控制

High auditability 高审计性

Broader perspectives 更广泛的视角

Fully open

Level of

access

Fully closed

Gradual/staged release 渐进/分阶段发布

Hosted access 托管访问

Cloud-based/API 云端/API

access

Downloadable

System

(developer)

PaLM (Google) PaLM（谷歌）

Gopher (DeepMind) 地鼠（DeepMind）

Imagen (Google) 影像（Google）

Make-A-Video (Meta) 制作视频（Meta）

GPT-2 (OpenAl) GPT-2（OpenAI）

Stable Diffusion (Stability AI)
稳定扩散（Stability AI）

DALL-E 2 (OpenAI) DALL-E 2（OpenAI）

Midjourney (Midjourney) 中途旅程（Midjourney）

GPT-3 (OpenAl) GPT-3（OpenAI）

OPT (Meta)

Craiyon (Craiyon) Craiyon（Craiyon）

BLOOM (BigScience) BLOOM（BigScience）

GPT-J (EleutherAI) GPT-J（EleutherAI）

Neurosymbolic Artificial Intelligence (Why, What, and How)
神经符号人工智能（为什么、什么和如何）

Neurosymbolic Al is an interesting research direction for creating more transparent and explainable

models that works by integrating deep learning with symbolic reasoning. Unlike less interpretable deep learning models, symbolic reasoning offers clearer insights into how models work and allows for direct modifications of the model's knowledge through expert feedback. However, symbolic reasoning alone typically falls short of deep learning models in terms of performance. Neurosymbolic Al aims to combine the best of both worlds.
神经符号人工智能是一个有趣的研究方向，旨在通过将深度学习与符号推理相结合，创造更透明和可解释的模型。与较难解释的深度学习模型不同，符号推理能够更清晰地揭示模型的工作原理，并允许通过专家反馈直接修改模型的知识。然而，单独的符号推理通常在性能上不及深度学习模型。神经符号人工智能旨在结合两者的优势。

Research from the University of South Carolina and the University of Maryland provides a comprehensive mapping and taxonomy of various approaches within neurosymbolic AI. The research distinguishes between approaches that compress structured symbolic knowledge for integration with neural network structures and those that extract information from neural networks to translate them back into structured symbolic representations for reasoning. Figure 3.3.6 illustrates two examples of how this integration could be achieved. The researchers hope that neurosymbolic Al could mitigate some of the shortcomings of purely neural network-based models, such as hallucinations or incorrect reasoning, by mimicking human cognition-specifically, by enabling models to possess an explicit knowledge model of the world.
南卡罗来纳大学和马里兰大学的研究提供了神经符号人工智能中各种方法的全面映射和分类。该研究区分了将结构化符号知识压缩以与神经网络结构集成的方法与从神经网络中提取信息以将其翻译回结构化符号表示以进行推理的方法。图 3.3.6 展示了如何实现这种集成的两个示例。研究人员希望神经符号人工智能能够通过模仿人类认知，特别是通过使模型拥有世界的显式知识模型，来减轻纯神经网络模型的一些缺点，例如幻觉或错误推理。

Integrating neural network structures with symbolic representation
将神经网络结构与符号表示集成

Source: Sheth, Roy, and Gaur, 2023
来源：Sheth，Roy 和 Gaur，2023

Figure 3.3.6 图 3.3.6

In 2023, as Al capabilities continued to improve and models became increasingly ubiquitous, concerns about their security and safety became a top priority for decision-makers. This chapter explores three distinct aspects of security and safety. First, guaranteeing the integrity of Al systems involves protecting components such as algorithms, data, and infrastructure against external threats like cyberattacks or adversarial attacks. Second, safety involves minimizing harms stemming from the deliberate or inadvertent misuse of Al systems. This includes concerns such as the development of automated hacking tools or the utilization of Al in cyberattacks. Lastly, safety encompasses inherent risks from AI systems themselves, such as reliability concerns (e.g., hallucinations) and potential risks posed by advanced Al systems.
2023 年，随着人工智能能力不断提升，模型变得越来越普遍，对其安全性和安全性的担忧成为决策者的首要任务。本章探讨了安全和安全的三个不同方面。首先，确保人工智能系统的完整性涉及保护算法、数据和基础设施等组件免受外部威胁，如网络攻击或敌对攻击。其次，安全性涉及最大限度地减少由人工智能系统的故意或无意的误用导致的伤害。这包括担忧，如开发自动化黑客工具或利用人工智能发动网络攻击。最后，安全性包括人工智能系统本身的固有风险，例如可靠性问题（例如幻觉）和先进人工智能系统可能带来的潜在风险。

3.4 Security and Safety
3.4 安全和安全

Current Challenges 当前的挑战

In 2023, the security and safety of Al systems sparked significant debate, particularly regarding the potential extreme or catastrophic risks associated with advanced Al. Some researchers advocated addressing current risks such as algorithmic discrimination, while others emphasized the importance of preparing for potential extreme risks posed by advanced Al. Given that there is no guarantee that the latter risks will not manifest at some point, there is a need to address both present risks through responsible

development while also monitoring potential future risks that have yet to materialize. Furthermore, the dual-use potential of Al systems, especially foundation models, for both beneficial and malicious purposes, has added complexity to discussions regarding necessary security measures.
在 2023 年，人工智能系统的安全性和稳定性引发了广泛的讨论，特别是针对先进人工智能可能带来的极端或灾难性风险。一些研究人员主张解决当前的风险，如算法歧视问题，而其他人则强调应该预防先进人工智能可能带来的极端风险。鉴于没有确保后者风险不会在某个时间发生的保证，我们需要通过负责任的发展来解决当前的风险，同时监测可能尚未显现的未来风险。此外，人工智能系统的双重用途潜力，特别是基础模型在有益和恶意目的上的应用，给讨论必要的安全措施增加了复杂性。

A notable challenge also arises from the potential for Al systems to amplify cyberattacks, resulting in threats that are increasingly sophisticated, adaptable, and difficult to detect. As Al models have become increasingly prevalent and sophisticated, there has been an increased focus on identifying security vulnerabilities, covering a range of attacks, from prompt injections to model leaks.
人工智能系统潜在的挑战之一是可能加剧网络攻击，导致威胁变得越来越复杂、适应性强，难以检测。随着人工智能模型变得越来越普遍和复杂，人们开始更加关注识别安全漏洞，涵盖了从提示注入到模型泄漏的一系列攻击。

Al Security and Safety in Numbers
数字化时代的人工智能安全与安全

Academia 学术界

Although the number of security and safety submissions at select academic conferences decreased since 2022, there has been an overall 70.4% increase in such submissions since 2019 (Figure 3.4.1).
尽管自 2022 年以来，在部分学术会议上提交的安全性和安全性论文数量有所减少，但自 2019 年以来这类论文的总体增加了 70.4%（图 3.4.1）。

Al security and safety submissions to select academic conferences, 2019-23
2019 年至 2023 年在部分学术会议上提交的所有安全性和安全性论文

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 3.4.1 图 3.4.1

Industry 行业

The Global State of Responsible Al survey also queried organizations about reliability risks, such as model hallucinations or output errors.

Potential mitigations for these risks may involve managing low-confidence outputs or implementing comprehensive test cases for deployment across diverse scenarios. The survey inquired about a total of 6 mitigations related to reliability risks.

《负责任人工智能全球状况调查》还询问了组织在可靠性风险方面的问题，如模型幻觉或输出错误。

这些风险的潜在减轻措施可能涉及管理低置信度输出或在不同场景下实施全面的测试用例。调查共询问了与可靠性风险相关的 6 个减轻措施。

In a survey of more than 1,000 organizations,

acknowledged the relevance of reliability risks to their

adoption strategies. Among these,

have fully implemented more than half of the surveyed measures, while

have operationalized at least one but fewer than half. Additionally,

of respondents admitted to having no reliability measures fully operationalized. The global average stood at 2.16 fully implemented measures out of the six included in the survey.
在对 1000 多个组织的调查中,

承认可靠性风险与其

采用策略的相关性。其中，

已经全面执行了超过一半的调查措施，而

已经实施了至少一个但少于一半的调查措施。此外，

的受访者承认没有可靠性措施完全投入运营。全球平均水平为调查中包括的六项措施中 2.16 个完全执行的措施。

Figure 3.4.2 visualizes mitigation adoption rates disaggregated by geographic area. Figure 3.4.3 further disaggregates Al-related reliability mitigation adoption rates by industry.
图 3.4.2 可视化了按地理区域细分的减灾采纳率。图 3.4.3 进一步按行业细分了与 Al 相关的可靠性减灾采纳率。

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任 Al 状态报告 | 图表：2024 年 Al 指数报告

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant.
注意：括号中的数字是每个地区内完全运作的减灾措施的平均数。并非所有地区之间的差异在统计上都具有显著性。

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任人工智能状况报告 | 图表：2024 年人工智能指数报告

Figure 3.4.3 图 3.4.3

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant.
注意：括号中的数字是每个行业内完全运作的减灾措施的平均数量。并非所有行业之间的差异在统计上都具有显著性。

Organizations were also queried on the relevance of security risks, such as cybersecurity incidents, with

acknowledging their relevance.
组织还被询问了安全风险的相关性，例如网络安全事件，有

承认它们的相关性。

The organizations were also asked to what degree they implemented certain security measures such as basic cybersecurity hygiene practices or conducting vulnerability assessments. Organizations were asked about a total of five security measures.

Of the organizations surveyed,

had fully implemented more than half of the proposed security measures, while

had fully operationalized at least one but fewer than half. Additionally,

reported having no

security measures fully operationalized. On average, companies adopted 1.94 measures out of the 5 surveyed. Figure 3.4.4 and Figure 3.4.5 illustrate the adoption rates of cybersecurity measures by region and the breakdown of mitigation adoption rates by industry, respectively.
组织还被要求评估其实施某些安全措施的程度，例如基本的网络安全卫生实践或进行漏洞评估。组织被要求评估总共五项安全措施。在受访的组织中，

家已经完全实施了超过一半提议的安全措施，而

家已经完全运作了至少一项但不到一半。此外，

家报告称没有

项安全措施完全运作。平均而言，公司在受访的五项措施中采用了 1.94 项。图 3.4.4 和图 3.4.5 分别说明了各地区网络安全措施的采用率以及行业间减灾采用率的细分。

Source: Global State of Responsible Al report, 2024 | Chart: 2024 AI Index report
来源：2024 年全球负责任人工智能状况报告 | 图表：2024 年人工智能指数报告

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant.
注意：括号中的数字是各个地区完全运营的缓解措施的平均数字。不是所有地区之间的差异都有统计学意义。

Figure 3.4.5 图 3.4.5

Note: The numbers in parentheses are the average numbers of
注意: 括号中的数字是每个行业中完全运作的平均缓解措施的数量。

mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant.
并非所有行业之间的差异在统计学上都具有显著性。

16 Respondents were further given the free-text option "Other" to report additional mitigations not listed.
进一步给予了 16 个回答者自由文本选项“其他”，以报告未列出的附加缓解措施。

The survey inquired about companies' perspectives on risks associated with foundation model developments. A significant majority,

of organizations, either agree or strongly agree that those developing foundation models are responsible for mitigating all associated risks (Figure 3.4.6). Furthermore,

of respondents either agree or strongly agree that the potential threats posed by generative

are substantial enough to warrant globally agreed-upon governance.
调查询问公司对基础模型发展风险的看法。绝大多数，

的组织，要么同意要么强烈同意那些开发基础模型的人有责任减轻所有相关风险（图 3.4.6）。此外，

的受访者要么同意要么强烈同意生成式

可能带来的威胁足以值得全球协商一致的治理。

Agreement with security statements
对安全声明的一致意见

Source: Global State of Responsible Al report, 2024 | Chart: 2024 AI Index report
来源：2024 年全球负责任人工智能报告 | 图表：2024 年人工智能指数报告

Strongly disagree

Disagree

Neither disagree nor agree

Agree

Strongly agree

Not sure

强烈不同意

不同意

既不同意也不同意

同意

强烈同意

不确定

Figure 3.4.6 图 3.4.6

Featured Research 特色研究

This section showcases key research published in 2023 on security and safety in AI. The profiled research studies new safety benchmarks for LLMs, methods of attacking Al models, and new benchmarks for testing deception and ethical behavior in

systems.
本节展示了 2023 年关于人工智能安全性和安全性的关键研究。所展示的研究研究了LLMs的新安全基准，攻击 Al 模型的方法，以及

系统中测试欺骗和道德行为的新基准。

Do-Not-Answer: A New Open Dataset for Comprehensive Benchmarking of LLM Safety Risks
Do-Not-Answer：一个全面评估LLM安全风险的新开放数据集

As the capabilities of LLMs expand, so too does their potential for misuse in hazardous activities. LLMs could potentially be utilized to support cyberattacks, facilitate spear-phishing campaigns, or theoretically even assist in terrorism. Consequently, it is becoming increasingly crucial for developers to devise mechanisms for evaluating the potential dangers of

models. Closed-source developers such as OpenAl and Anthropic have constructed datasets to assess dangerous model capabilities and typically implement safety measures to limit unwanted model behavior. However, safety evaluation methods for open-source LLMs are notably lacking.
随着LLMs的能力不断扩展，它们在危险活动中被误用的潜力也在增加。LLMs可能被用于支持网络攻击，促进鱼叉式网络钓鱼活动，甚至在理论上协助恐怖主义。因此，开发人员越来越需要设计机制来评估

模型的潜在危险。像 OpenAl 和 Anthropic 这样的闭源开发者已经构建了数据集来评估危险模型的能力，并通常实施安全措施来限制不受欢迎的模型行为。然而，针对开源LLMs的安全评估方法明显不足。

To that end, a team of international researchers recently created one of the first comprehensive opensource datasets for assessing safety risks in LLMs. Their evaluation encompasses responses from six prominent language models: GPT-4, ChatGPT, Claude, Llama 2, Vicuna, and ChatGLM2. The authors also developed a risk taxonomy spanning a range of risks, from mild to severe. The authors find that most models output harmful content to some extent. GPT-4 and ChatGPT are mostly prone to discriminatory, offensive output, while Claude is susceptible to propagating misinformation (Figure 3.4.7). Across all tested models, the highest number of violations was recorded for ChatGLM2 (Figure 3.4.8).
为此，一组国际研究人员最近创建了第一个全面的开源数据集，用于评估LLMs中的安全风险。他们的评估涵盖了来自六个知名语言模型的响应：GPT-4、ChatGPT、Claude、Llama 2、Vicuna 和 ChatGLM2。作者还开发了一个涵盖从轻微到严重风险的风险分类法。作者发现大多数模型在某种程度上输出有害内容。GPT-4 和 ChatGPT 更容易产生歧视性、冒犯性输出，而 Claude 易于传播错误信息（图 3.4.7）。在所有测试的模型中，ChatGLM2 记录了最高数量的违规行为（图 3.4.8）。

Harmful responses across different risk categories by foundation model Source: Wang et al., 2023 | Chart: 2024 Al Index report
基础模型在不同风险类别下的有害响应来源：Wang 等人，2023 年 | 图表：2024 年 Al 指数报告

Total number of harmful responses across different foundation models
不同基础模型中有害响应的总数

Source: Wang et al., 2023 | Chart: 2024 Al Index report
源：王等人，2023 年 | 图表：2024 年 Al 指数报告

Universal and Transferable Attacks on Aligned Language Models
针对对齐语言模型的通用和可转移攻击

Recent attention in

security has centered on uncovering adversarial attacks capable of bypassing the implemented safety protocols of LLMs. Much of this research requires substantial human intervention and is idiosyncratic to specific models. However, in 2023, researchers unveiled a universal attack capable of operating across various LLMs. This attack induces aligned models to generate objectionable content (Figure 3.4.9).
最近关注

安全的焦点集中在揭示能够绕过LLMs实施的安全协议的敌对攻击上。这项研究很大程度上需要人类干预，并且是特定模型特有的。然而，在 2023 年，研究人员揭示了一种通用攻击，能够跨越各种LLMs进行操作。这种攻击会导致对齐模型生成令人反感的内容（图 3.4.9）。

The method involved automatically generating suffixes that, when added to various prompts, compel LLMs to produce unsafe content. Figure 3.4.10 highlights the success rates of different attacking styles on leading LLMs. The method the researchers introduce is called Greedy Coordinate Gradient (GCG). The study demonstrates that these suffixes (the GCG attack) often transfer effectively across both closed and open models, encompassing ChatGPT, Bard, Claude, Llama-2-Chat, and Pythia. This study raises an important question as to how models can be better fortified against automated adversarial attacks. It also demonstrates how LLMs can be vulnerable to attacks that employ unintelligible, non-human-readable prompts. Current red-teaming methodologies primarily focus on interpretable prompts. This new research suggests there is a significant gap in buffering LLMs against attacks utilizing uninterpretable prompts.
该方法涉及自动生成后缀，当添加到各种提示时，迫使LLMs生成不安全内容。图 3.4.10 突出了不同攻击风格对主要LLMs的成功率。研究人员介绍的方法称为贪婪坐标梯度（GCG）。研究表明，这些后缀（GCG 攻击）通常在封闭和开放模型之间有效传输，包括 ChatGPT、Bard、Claude、Llama-2-Chat 和 Pythia。这项研究提出了一个重要问题，即模型如何更好地抵御自动对抗性攻击。它还展示了LLMs如何容易受到使用难以理解、非人类可读提示的攻击。当前的红队方法主要关注可解释的提示。这项新研究表明，在缓冲LLMs免受使用不可解释提示的攻击方面存在重大差距。

Using suffixes to manipulate LLMs
使用后缀来操纵LLMs

Source: Zou et al., 2023
来源：Zou 等人，2023

Figure 3.4.9 图 3.4.9
Attack success rates of foundation models using different prompting techniques
使用不同提示技术的基础模型攻击成功率
Source: Zho et al., 2023 | Chart: 2024 Al Index report
来源：Zho 等人，2023 年 | 图表：2024 年 Al 指数报告

MACHIAVELLI Benchmark 马基雅维利基准测试

There are many benchmarks, such as HELM and MMLU, that evaluate the overall capabilities of foundation models. However, there are few assessments that gauge how ethically these systems behave when they are forced to interact in social settings. This lack of measures presents a considerable obstacle in comprehensively understanding the safety risks of

systems. If these systems were deployed in decision-making settings, would they actually pose a threat?
有许多基准测试，如 HELM 和 MMLU，评估基础模型的整体能力。然而，在社交环境中强制这些系统进行交互时，很少有评估能够衡量这些系统的道德行为。缺乏这些度量标准在全面了解

系统的安全风险方面构成了相当大的障碍。如果这些系统被部署在决策环境中，它们是否会真正构成威胁？

Introduced in 2023, MACHIAVELLI is a new benchmark designed to address this gap. Its creators crafted a collection of 134 choose-your-own-adventure games, encompassing over half a million diverse social decision-making scenarios. These scenarios aim to evaluate the extent to which

agents pursue power, engage in deception, induce disutility, and commit ethical violations. Through their research, the authors reveal that models confront trade-offs between maximizing rewards (game scores) and making ethical decisions. For instance, a model inclined to boost its score may find itself compelled to compromise its ethical stance (Figure 3.4.11). Furthermore, Figure 3.4.12 provides a comparison of scores among various prominent Al models, such as GPT-3.5 and GPT-4,
2023 年推出的 MACHIAVELLI 是一个新的基准，旨在填补这一空白。其创作者打造了一系列 134 个选择自己冒险游戏，涵盖了超过 50 万个不同的社会决策情景。这些情景旨在评估

代理人追求权力、进行欺骗、引发不满和违反道德的程度。通过他们的研究，作者们揭示了模型在最大化奖励（游戏得分）和做出道德决策之间面临的权衡。例如，一个倾向于提高得分的模型可能会发现自己被迫妥协其道德立场（图 3.4.11）。此外，图 3.4.12 比较了各种知名 AI 模型的得分，如 GPT-3.5 和 GPT-4。
Trade-offs on the MACHIAVELLI benchmark
MACHIAVELLI 基准上的权衡

Source: Pan et al., 2023
来源：Pan 等人，2023

Figure 3.4.11 图 3.4.11

across different MACHIAVELLI benchmark categories like power, immorality, and dissatisfaction. Lower scores indicate a more ethically oriented model.
跨越不同的 MACHIAVELLI 基准类别，如权力、不道德和不满。较低的分数表示更具道德取向的模型。

Furthermore, the researchers demonstrate that there are strategies for mitigating the trade-off between maximizing rewards and maintaining ethical behavior, which can lead to the development of proficient and ethical

agents. MACHIAVELLI is one of the first significant attempts to construct a framework for assessing traits such as deception, morality, and power-seeking in sophisticated Al systems.
此外，研究人员证明了存在缓解在最大化奖励和保持道德行为之间的权衡的策略，这可以促进高效和道德

代理的发展。MACHIAVELLI 是构建评估欺骗、道德和追求权力等特征的框架的首次重要尝试之一，用于复杂的人工智能系统。

Mean behavioral scores of

agents across different categories

代理在不同类别中的平均行为分数

Source: Pan et al., 2023 | Chart: 2024 Al Index report
来源：Pan 等人，2023 年 | 图表：2024 年 Al 指数报告

Agent 代理

Figure 3.4.12 图 3.4.12

Fairness in Al emphasizes developing systems that are equitable and avoid perpetuating bias or discrimination against any individual or group. It involves considering the diverse needs and circumstances of all stakeholders impacted by Al use. Fairness extends beyond a technical concept and embodies broader social standards related to equity.
在人工智能中，公平性强调开发公正的系统，避免对任何个人或群体产生偏见或歧视。这涉及考虑到所有受到人工智能使用影响的利益相关者的不同需求和情况。公平性不仅仅是一个技术概念，还体现了与公平相关的更广泛的社会标准。

3.5 Fairness 3.5 公平性

Current Challenges 当前挑战

Defining, measuring, and ensuring fairness is complex due to the absence of a universal fairness definition and a structured approach for selecting context-appropriate fairness definitions. This challenge is magnified by the multifaceted nature of

systems, which require the integration of fairness measures at almost every stage of their life cycle.
由于缺乏普遍公平定义和选择上下文合适的公平定义的结构化方法，定义、衡量和确保公平变得复杂。这个挑战还因

系统的多方面性而变得更加复杂，这些系统需要在其生命周期的几乎每个阶段集成公平度量。

Fairness in Numbers 数字中的公平

This section provides an overview of the study and deployment of

fairness in academia and industry.
本节概述了学术界和工业界对

公平性的研究和部署。

Academia 学术界

The rise of LLMs like ChatGPT and Gemini made the public significantly more aware of some of the fairness issues that can arise when

systems are broadly deployed. This heightened awareness has led to a rise in

-fairness-related submissions at academic conferences.
像 ChatGPT 和 Gemini 这样的LLMs的崛起显著提高了公众对广泛部署

系统可能出现的公平性问题的意识。这种加强的意识已经导致学术会议上关于

公平性的提交数量增加。

In 2023, there were 212 papers on fairness and bias submitted, a

increase from 2022 (Figure 3.5.1). Since 2019, the number of such submissions has almost quadrupled.
2023 年，关于公平性和偏见的论文数量为 212 篇，比 2022 年增加了

（图 3.5.1）。自 2019 年以来，此类投稿数量几乎增加了四倍。

Al fairness and bias submissions to select academic conferences, 2019-23
2019 年至 2023 年选择学术会议的公平性和偏见投稿

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Industry 行业

In the Global State of Responsible Al survey referenced earlier,

of organizations identified fairness risks as relevant to their

adoption strategies.

Regionally, European organizations (34%) most frequently reported this risk as relevant, while North American organizations reported it the least (20%).
在前面提到的《全球负责任人工智能状况调查》中，

的组织确定公平风险与其

采用策略相关。

在区域上，欧洲组织（34%）最经常报告此风险相关，而北美组织报告最少（20%）。

The survey asked respondents about their efforts to mitigate bias and enhance fairness and diversity in

model development, deployment, and use, providing them with five possible measures to implement. Results show that while most companies have fully implemented at least one fairness measure, comprehensive integration is still lacking. The global average for adopted fairness measures stands at 1.97 out of five measures asked about. There is not significant regional variation in the implementation of fairness measures (Figure 3.5.2). Figure 3.5.3 visualizes integration rates by industry.
调查询问受访者关于他们在

模型开发、部署和使用中减少偏见、增强公平和多样性的努力，为他们提供了五种可能实施的措施。结果显示，虽然大多数公司至少完全实施了一项公平措施，但全面整合仍然不足。全球采用的公平措施平均为五项措施中的 1.97。在公平措施的实施上没有显著的区域差异（图 3.5.2）。图 3.5.3 可视化了行业的整合率。
Adoption of Al-related fairness measures by region
各地区对人工智能相关公平措施的采用情况

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任人工智能报告 | 图表：2024 年人工智能指数报告

Adoption of Al-related fairness measures by industry
行业采用与人工智能相关的公平措施

Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report
来源：2024 年全球负责任人工智能报告 | 图表：2024 年人工智能指数报告

Figure 3.5.3 图 3.5.3

Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant.
注意：括号中的数字是每个行业内完全运作的减灾措施数量的平均值。并非所有行业之间的差异在统计上都具有显著性。

Featured Research 精选研究

This section highlights key research published in 2023 on fairness in Al. By focusing on significant fairness studies, the Al Index highlights some critical topics that are of interest to

fairness researchers. The research featured below reveals how image generation models can perpetuate social stereotypes, LLMs tend to reflect Western opinions, and model tokenization can introduce elements of unfairness.
本节重点介绍了 2023 年关于人工智能公平性的关键研究。通过关注重要的公平性研究，人工智能指数突出了一些对

公平性研究人员感兴趣的关键主题。以下展示的研究揭示了图像生成模型如何可能强化社会刻板印象，LLMs倾向于反映西方观点，以及模型标记化可能引入不公平因素。

(Un)Fairness in and Healthcare
《(不)公平性在和医疗保健领域》

A team of American and Canadian researchers investigated racial bias when LLMs are prompted to respond to medical questions. They queried four popular LLMs (Bard, GPT-3.5, Claude, GPT-4) with nine distinct questions previously known to elicit "race-based medicine or widespread misconceptions around race" among real physicians. Each model was asked each question five times, yielding 45 responses per model.
一组美国和加拿大研究人员调查了当LLMs被要求回答医疗问题时的种族偏见。他们使用四种流行的LLMs（Bard、GPT-3.5、Claude、GPT-4）询问了九个先前已知会引发“基于种族的医学或医生之间普遍存在的关于种族的误解”问题。每个模型被问及每个问题五次，每个模型产生 45 个回答。

Figure 3.5.4 highlights the frequency with which notable LLMs delivered highly racialized responses per question.

The study revealed that all models demonstrated some degree of race-based medical bias, although their responses to identical questions varied. For certain queries, like the basis of race, only one model, Claude, consistently offered problematic responses. In contrast, for other questions, such as the purported skin thickness differences between Black and white individuals (a widespread misconception among medical students), most models regularly produced concerning race-based responses. The occasional perpetuation of debunked myths by LLMs underscores the need for caution when employing LLMs in medical contexts.
图 3.5.4 突出显示了每个问题中显著LLMs提供高度种族化回应的频率。

研究表明，所有模型都表现出一定程度的基于种族的医疗偏见，尽管它们对相同问题的回应有所不同。对于某些查询，比如基于种族的基础，只有一个模型，克劳德，始终提供有问题的回应。相反，对于其他问题，比如声称黑人和白人之间皮肤厚度差异（这是医学生中普遍存在的误解），大多数模型经常产生令人担忧的基于种族的回应。LLMs偶尔传播已被揭穿的神话强调了在医学环境中使用LLMs时需要谨慎的必要性。

Number of runs (out of 5 total runs) with concerning race-based responses by large language model Source: Omiye et al., 2023 | Chart: 2024 Al Index report
有问题的基于种族的回应次数（总共 5 次运行中）大型语言模型来源：Omiye 等人，2023 年| 图表：2024 年 Al 指数报告

19 In Figure 3.5.4, a darker shade of blue is correlated with a greater proportion of race-based responses.
在图 3.5.4 中，深蓝色与基于种族的回应比例更高相关。

BiasPainter is a new testing framework designed to detect social biases in image generation models, such as DALL-E and Midjourney. As highlighted in the 2023 Al Index, many image generation models frequently perpetuate stereotypes and biases (Figure 3.5.5). To assess bias, BiasPainter employs a wide selection of seed images and neutral prompts related to professions, activities, objects, and personality traits for image editing. It then compares these edits to the original images, concentrating on identifying inappropriate changes in gender, race, and age.
BiasPainter 是一个新的测试框架，旨在检测图像生成模型中的社会偏见，如 DALL-E 和 Midjourney。正如 2023 年 Al Index 所强调的那样，许多图像生成模型经常强化刻板印象和偏见（图 3.5.5）。为了评估偏见，BiasPainter 使用大量种子图像和与职业、活动、物体和个性特征相关的中性提示进行图像编辑。然后将这些编辑与原始图像进行比较，重点是识别性别、种族和年龄方面的不当变化。

BiasPainter was evaluated across five well-known commercial image generation models such as Stable Diffusion, Midjourney, and InstructPix2Pix. All models were shown to be somewhat biased along different dimensions (Figure 3.5.6). Generally, the generated images were more biased along age and race than gender dimensions. Overall, on automatic bias detection tasks, BiasPainter achieves an automatic bias detection accuracy of

, a considerable improvement over previous methods.
BiasPainter 在五个知名商业图像生成模型上进行了评估，如 Stable Diffusion、Midjourney 和 InstructPix2Pix。所有模型都显示出在不同维度上存在一定程度的偏见（图 3.5.6）。总体而言，生成的图像在年龄和种族方面的偏见要大于性别方面。总体而言，在自动偏见检测任务上，BiasPainter 实现了

的自动偏见检测准确率，这是对先前方法的显著改进。

Midjourney generation: "influential person"
中途一代: "有影响力的人"

Source: Marcus and Southen, 2024
来源: Marcus 和 Southen，2024

Figure 3.5.5 图 3.5.5

Average image model bias scores for five widely used commercial image generation models
五种广泛使用的商业图像生成模型的平均图像模型偏差分数

Source: Wang et al., 2023 | Chart: 2024 Al Index report
来源：王等人，2023 年 | 图表：2024 年 Al 指数报告

Measuring Subjective Opinions in LLMs
在LLMs中测量主观意见

Research from Anthropic suggests that large language models do not equally represent global opinions on a variety of topics such as politics, religion, and technology. In this study, researchers built a GlobalOpinionQA dataset to capture cross-country opinions on various issues (Figure 3.5.7). They then generated a similarity metric to compare people's answers in various countries with those outputted by LLMs. Using a four-point Likert scale, LLMs were asked to rate their agreement with statements from the World Values Survey (WVS) and Pew Research Center's Global Attitudes (GAS) surveys, including questions like, "When jobs are scarce, employers should give priority to people of this country over immigrants," or "On the whole, men make better business executives than women do."
Anthropic 的研究表明，大型语言模型并不平等地代表全球对政治、宗教和技术等各种主题的观点。在这项研究中，研究人员构建了一个 GlobalOpinionQA 数据集，以捕捉各国对各种问题的观点（图 3.5.7）。然后，他们生成了一个相似度度量标准，用于比较各国人民的答案与LLMs输出的答案。使用四点李克特量表，LLMs被要求评价他们对世界价值观调查（WVS）和皮尤研究中心全球态度（GAS）调查中的陈述的同意程度，包括问题如“当工作稀缺时，雇主应优先考虑本国人而不是移民”，或“总的来说，男性比女性更适合担任商业高管”。
The experiments indicate that the models' responses closely align with those from individuals in Western countries (Figure 3.5.8). The authors point out a notable lack of diversity in opinion representation, especially from non-Western nations among the shared responses. While it is challenging for models to precisely match the highly diverse distributions of global opinions-given the inherent variation in perspectives-it is still valuable to understand which opinions a model is likely to share. Recognizing the biases inherent in models can highlight their limitations and facilitate adjustments that improve regional applicability.
实验表明，模型的响应与西方国家个体的响应密切一致（图 3.5.8）。作者指出在共享响应中意见代表的多样性明显不足，尤其是来自非西方国家的意见。尽管模型很难精确匹配全球意见的高度多样分布-考虑到观点的固有变化，但了解模型可能分享哪些意见仍然很有价值。认识到模型固有的偏见可以突显其局限性，并促进改进区域适用性的调整。

GlobalOpinionQA Dataset 全球意见问答数据集

Source: Durmus et al., 2023
来源：Durmus 等人，2023

Western-oriented bias in large language model responses
大型语言模型响应中的西方取向偏见

Source: Durmus et al., 2023 | Chart: 2024 Al Index report
来源：Durmus 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.5.8 图 3.5.8

LLM Tokenization Introduces Unfairness
LLM 分词引入了不公平性

Research from the University of Oxford highlights how inequality in

originates at the tokenization stage. Tokenization, the process of breaking down text into smaller units for processing and analysis, exhibits significant variability across languages. The number of tokens used for the same sentence can vary up to 15 times between languages. For instance, Portuguese closely matches English in the efficiency of the GPT-4 tokenizer, yet it still requires approximately

more tokens to convey the same content. The Shan language is the furthest from English, needing 15 times more tokens. Figure 3.5.9 visualizes the concept of a context window while figure 3.5 .10 illustrates the token consumption of the same sentence across different languages.
牛津大学的研究突出了不平等是如何在分词阶段产生的。分词是将文本分解为更小单元进行处理和分析的过程，在不同语言之间存在显著的变异性。同一句子使用的标记数量在不同语言之间可以相差多达 15 倍。例如，葡萄牙语在 GPT-4 分词器的效率上与英语接近，但仍需要大约

个更多的标记来传达相同的内容。掸语与英语之间的差距最大，需要多达 15 倍的标记。图 3.5.9 展示了上下文窗口的概念，而图 3.5.10 说明了不同语言中同一句子的标记消耗。

The authors identify three major inequalities that result from variable tokenization. First, users of languages that require more tokens than English for the same content face up to four times higher inference costs and longer processing times, as both are dependent on the number of tokens.
作者确定了由于可变分词而导致的三种主要不平等。首先，对于需要比英语使用更多标记来传达相同内容的语言用户，推理成本和处理时间可能高出四倍，因为两者都取决于标记数量。

Figure 3.5.11 illustrates the variation in token length and execution time for the same sentence across different languages or language families. Second, these users may also experience increased processing times because models take longer to process a greater number of tokens. Lastly, given that models operate within a fixed context window-a limit on the amount of text or content that can be input-languages that require more tokens proportionally use up more of this window. This can reduce the available context for the model, potentially diminishing the quality of service for those users.
图 3.5.11 展示了相同句子在不同语言或语言族中标记长度和执行时间的变化。其次，这些用户可能也会经历增加的处理时间，因为模型需要更长时间来处理更多的标记。最后，考虑到模型在固定的上下文窗口内运行-即输入的文本或内容量的限制-相对需要更多标记的语言会占用更多的窗口空间。这可能会减少模型的可用上下文，潜在地降低这些用户的服务质量。

Context window 上下文窗口

Source: Al Index, 2024
来源：Al Index，2024

Larger context windows are more likely to include important information (e.g. the word "dog") which can help the model to make a better prediction ("fetch" vs "baseball"). If more tokens are used up to represent the same sentence in non-English languages, less information is being captured in the limited context window, which may result in worse performance.
更大的上下文窗口更有可能包含重要信息（例如单词“狗”），这可以帮助模型做出更好的预测（“接球”与“棒球”）。如果在非英语语言中使用更多标记来表示相同的句子，那么在有限的上下文窗口中捕获的信息就会更少，这可能导致性能更差。

Figure 3.5.9 图 3.5.9

Variable language tokenization
可变语言标记化

Source: Al Index, 2024
源：Al 指数，2024

ENGLISH 英语

I took my dog to the park to play
我带我的狗去公园玩

7 tokens 7 个令牌

VIETNAMESE 越南语

Tôi đưa con chó cưa tôi đên công viên
我带我的狗去公园

Tồi đưa con chó cưa tôii đên công viên
我带我的狗去公园

16 tokens 16 个令牌

Tokenization premium using XLM-RoBERTa and RoBERTa models by language
使用 XLM-RoBERTa 和 RoBERTa 模型进行语言分词 premium
Source: Petrov et al., 2023 | Chart: 2024 Al Index report
来源：Petrov 等，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.5.11 图 3.5.11

In 2024, around 4 billion people across the globe will vote in national elections, for example, in the United States, U.K., Indonesia, Mexico, and Taiwan. Upcoming elections coupled with greater public awareness of AI have led to discussions of Al's possible impact on elections. This section covers how Al can impact elections and more specifically examines the generation and dissemination of mis- and disinformation, the detection of Al-generated content, the potential political bias of LLMs, and the broader impact of AI on politics.
2024 年，全球约 40 亿人将参加国家选举，例如在美国、英国、印度尼西亚、墨西哥和台湾。即将举行的选举以及公众对人工智能的更广泛认识已经引发了关于人工智能可能对选举产生影响的讨论。本节涵盖了人工智能如何影响选举，更具体地探讨了误导和虚假信息的生成和传播、检测人工智能生成内容、LLMs的潜在政治偏见，以及人工智能对政治的更广泛影响。

3.6 Al and Elections
3.6 铝和选举

Generation, Dissemination, and Detection of Disinformation
信息误导的生成、传播和检测

Generating Disinformation
生成信息误导

One of the top concerns when discussing Al's impact on political processes is the generation of disinformation.

While disinformation has been around since at least the Roman Empire, Al makes it significantly easier to generate such disinformation. Moreover, deepfake tools have significantly improved since the 2020 U.S. elections. Large-scale disinformation can undermine trust in democratic institutions, manipulate public opinion, and polarize public discussions. Figure 3.6.1 highlights the different types of deepfakes that can be created.
在讨论人工智能对政治进程的影响时，其中一个主要关注点是虚假信息的生成。

虽然虚假信息至少自罗马帝国时代就存在，但人工智能使得生成这类虚假信息变得更加容易。此外，自 2020 年美国大选以来，深度伪造工具已经显著改进。大规模的虚假信息可能会破坏对民主机构的信任，操纵公众舆论，并使公众讨论极化。图 3.6.1 突出显示了可以创建的不同类型的深度伪造视频。

Potential uses of deepfakes
深度伪造的潜在用途

Source: Masood et al., 2023
来源：Masood 等人，2023

Figure 3.6.1 图 3.6.1

20 This section uses the terms synthetic content, disinformation, and deepfakes in the following senses: Synthetic content is any content (text, image, audio, video) that has been created with Al. Disinformation is false or misleading information generated with the explicit intention to deceive or manipulate an audience. Deepfakes are Al-generated image, video, or audio files that can often create convincingly realistic yet deceptive content.
本节中使用的术语包括合成内容、虚假信息和深度伪造，具体含义如下：合成内容是指使用人工智能创建的任何内容（文本、图像、音频、视频）。虚假信息是指带有明确意图欺骗或操纵受众的虚假或误导性信息。深度伪造是指由人工智能生成的图像、视频或音频文件，通常可以创造出令人信服但具有欺骗性的内容。

Slovakia's 2023 election illustrates how Al-based disinformation can be used in a political context. Shortly before the election, a contentious audio clip emerged on Facebook purportedly capturing Michal Šimečka, the leader of the Progressive Slovakia party (Figure 3.6.2), and journalist Monika Tódová from the newspaper Denník N, discussing illicit election strategies, including acquiring voters from the Roma community. The authenticity of the audio was immediately challenged by Šimečka and Denník N. An independent fact-checking team suggested that Al manipulation was likely at play. Because the clip was released during a pre-election quiet period, when media and politicians' commentary is restricted, the clip's dissemination was not easily contested. The clip's wide circulation was also aided by a significant gap in Meta's content policy, which does not apply to audio manipulations. This episode of AI-enabled disinformation occurred against the backdrop of a close electoral contest. Ultimately, the affected party, Progressive Slovakia, lost by a slim margin to SMER, one of the opposition parties.
2023 年斯洛伐克选举展示了如何在政治背景下利用人工智能进行虚假信息传播。选举前不久，Facebook 上出现了一段有争议的音频剪辑，据称捕捉到了进步斯洛伐克党领袖米哈尔·希梅奇卡（见图 3.6.2）和《每日 N 报》记者莫妮卡·托多娃讨论非法选举策略，包括获取罗姆人社区的选民。希梅奇卡和《每日 N 报》立即对音频的真实性提出质疑。一支独立的事实核查团队暗示可能存在人工智能操纵。由于该剪辑是在选举前的宣传期内发布的，媒体和政客的评论受到限制，因此剪辑的传播并不容易受到质疑。剪辑的广泛传播也受到 Meta 内容政策中的重大漏洞的帮助，该政策不适用于音频操纵。这一 AI 虚假信息事件发生在一场激烈的选举竞争背景下。最终，受影响的进步斯洛伐克党以微弱优势输给了反对党之一的 SMER 党。
Progressive Slovakia leader Michal Šimečka
进步斯洛伐克党领袖米哈尔·希梅奇卡

Source: Meaker, 2023 来源：Meaker，2023

Figure 3.6.2 图 3.6.2

Dissemination of Fake Content
虚假内容的传播

Sometimes concerns surrounding Al-generated disinformation are minimized on the grounds that

only assists with content generation but not dissemination. However, in 2023, case studies emerged about how

could be used to automate the entire generation and dissemination pipeline. A developer called Nea Paw set up Countercloud as an experiment in creating a fully automated disinformation pipeline (Figure 3.6.3).
有时，围绕由人工智能生成的虚假信息引起的担忧被忽视，理由是

仅协助内容生成而非传播。然而，在 2023 年，出现了关于如何利用

自动化整个生成和传播流程的案例研究。一位开发者名为 Nea Paw 设立了 Countercloud 作为一个实验，旨在创建一个完全自动化的虚假信息传播管道（图 3.6.3）。

As part of the first step in the pipeline, an Al model is used to continuously scrape the internet for articles and automatically decide which content it should target with counter-articles. Next, another AI model is tasked with writing a convincing counter-article that can include images and audio summaries. This counter-article is subsequently attributed to a fake journalist and posted on the CounterCloud website. Subsequently, another Al system generates comments on the counter-article, creating the appearance of organic engagement. Finally, an Al searches

for relevant tweets, posts the counter-article as a reply, and comments as a user on these tweets. The entire setup for this authentic-appearing misinformation system only costs around

.
作为管道中的第一步，一个 AI 模型被用来持续地从互联网上抓取文章，并自动决定应该以对抗文章针对哪些内容。接下来，另一个 AI 模型被要求撰写一篇有说服力的对抗文章，可以包括图片和音频摘要。这篇对抗文章随后被归因于一个虚假记者，并发布在 CounterCloud 网站上。随后，另一个 AI 系统在对抗文章上生成评论，营造有机互动的外观。最后，一个 AI 在

上搜索相关的推文，将对抗文章作为回复发布，并以用户的身份在这些推文上发表评论。这个看起来真实的错误信息系统的整个设置只需花费约

。

Al-based generation and dissemination pipeline
基于 AI 的生成和传播管道

Source: Al Index, 202421
来源：AI 指数，202421

Figure 3.6.3 图 3.6.3

21 The figure was adapted from Simon, Altay, and Mercier, 2023.
21 该图出自 2023 年的 Simon, Altay, and Mercier 改编。

Detecting Deepfakes 检测深度伪造

Recent research efforts to counter deepfakes have focused on improving methods for detecting Algenerated content. For example, a team of Singaporean researchers studied how well deepfake detectors generalize to datasets they have not been trained on. The researchers compared five deepfake detection approaches and found that even more recently introduced deepfake detection methods suffer significant performance declines on never-before-seen datasets (Figure 3.6.4). However, the study does note that there are underlying similarities between seen and unseen datasets, meaning that in the future, robust and broadly generalizable deepfake detectors could be created.
最近的研究努力致力于改进检测人工智能生成内容的方法来对抗 deepfake。例如，一组新加坡研究人员研究了深度伪造检测器在未经训练的数据集上的泛化能力。研究人员比较了五种深度伪造检测方法，发现即使是最近引入的深度伪造检测方法在以前未见过的数据集上也会出现显著的性能下降（图 3.6.4）。然而，研究指出，已见和未见数据集之间存在基本相似之处，这意味着未来可以创建稳健且广泛适用的深度伪造检测器。

Generalizability of deepfake detectors to unseen datasets
深度伪造检测器对未见数据集的泛化能力

Source: Li et al., 2023 | Chart: 2024 Al Index report
来源：Li 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.6.4 图 3.6.4

In the context of deepfake detectors, it is also important to highlight earlier experiments that show that the performance of deepfake detection methods varies significantly across attributes such as race. Some of the underlying datasets used to train deepfake detectors, like FaceForensics++, are not equally balanced with respect to race and gender (Figure 3.6.5). The authors then demonstrate that between various racial subgroups, performance accuracy could differ by as much as 10.7 percentage points. The detectors performed worst on dark skin and best on Caucasian faces.
在 Deepfake 检测器的背景下，强调早期实验证明，Deepfake 检测方法的性能在诸如种族等属性上存在显著差异。一些用于训练 Deepfake 检测器的基础数据集，如 FaceForensics++，在种族和性别方面并不平衡（图 3.6.5）。作者随后证明，在不同的种族子群之间，性能准确度可能相差多达 10.7 个百分点。检测器在深色皮肤上表现最差，在高加索人脸上表现最佳。

Ethnic and gender distribution in FaceForensics++ training data
FaceForensics++训练数据中的种族和性别分布

Source: Trinh and Liu, 2021| Chart: 2024 Al Index report
来源：Trinh 和 Liu，2021 | 图表：2024 年 Al 指数报告

Figure 3.6.5 图 3.6.5

LLMs and Political Bias
LLMs和政治偏见

LLMs are increasingly recognized as tools through which ordinary individuals can inform themselves about important political topics such as political processes, candidates, or parties. However, new research published in 2023 suggests that many major LLMs like ChatGPT are not necessarily free of bias.
LLMs越来越被认为是普通人可以了解重要政治话题（如政治流程、候选人或政党）的工具。然而，2023 年发表的新研究表明，许多主要的LLMs（如 ChatGPT）并不一定没有偏见。

The study revealed that ChatGPT exhibits a notable and systematic bias favoring Democrats in the United States and the Labour Party in the U.K. As part of the study, the researchers compared the answers of a default ChatGPT to those of Republican, Democrat, radical Republican, and radical Democrat versions of ChatGPT. This research design was created to better identify
该研究发现，ChatGPT 存在明显和系统性的偏见，偏向美国的民主党和英国的工党。作为研究的一部分，研究人员比较了默认 ChatGPT 的答案与共和党人、民主党人、激进共和党人和激进民主党人版本的 ChatGPT 的答案。这一研究设计旨在更好地识别

which political allegiance most closely corresponds to the regular ChatGPT.
哪种政治效忠最接近于常规的 ChatGPT。

Figure 3.6.6 shows strong positive correlations (blue lines) between the default ChatGPT, i.e., one that was answering questions without additional instructions, and both the Democrat and the radical Democrat ChatGPT versions, i.e., versions of ChatGPT that were asked to answer like a Democrat or radical Democrat. On the other hand, the researchers found a strong negative correlation between the default GPT and both Republican ChatGPTs. The identification of bias in these LLMs raises concerns about their potential to influence the political views and stances of users who engage with these tools.
图 3.6.6 显示了默认的 ChatGPT 与民主党和激进民主党 ChatGPT 版本之间的强烈正相关（蓝色线），即回答问题而无需额外指示的 ChatGPT，以及被要求像民主党或激进民主党回答问题的 ChatGPT 版本。另一方面，研究人员发现默认 GPT 与两个共和党 ChatGPT 之间存在强烈的负相关。在这些LLMs中发现偏见引发了人们对这些工具可能影响用户政治观点和立场的担忧。

Default vs. political ChatGPT average agreement
默认与政治 ChatGPT 平均一致性

Source: Motoki et al., 2023 | Chart: 2024 Al Index report
来源：Motoki 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.6.6 图 3.6.6

Impact of AI on Political Processes
人工智能对政治进程的影响

There has been an increasing volume of research aimed at exploring some of the risks

could pose to political processes. One topic of interest has been audio deepfakes. In July 2023, audio clips of a politician from India's Hindu party were released in which the politician attacked his own party and praised his political opponent. The politician claimed these audio clips were created using AI. However, even after deepfake experts were consulted, it could not be determined with

certainty whether the clips were authentic or not.
最近研究的数量正在增加，目的是探索一些可能对政治过程构成威胁的风险。一个感兴趣的话题是音频深度伪造。2023 年 7 月，印度印度教政党的一位政治家的音频剪辑被发布，其中这位政治家抨击了自己的政党并赞扬了他的政治对手。这位政治家声称这些音频剪辑是使用人工智能创建的。然而，即使请教了深度伪造专家，也无法确定这些剪辑是否真实。
Research published in 2023 suggests that humans generally have issues reliably detecting audio deepfakes. In their sample of 529 individuals, listeners only correctly detected deepfakes

of the time. Figure 3.6.7 illustrates some of the other key findings from the study. The authors also expect detection accuracy to go down in the future as a result of improvements in audio generation methods. The rise of more convincing audio deepfakes increases the potential to manipulate political campaigns, defame opponents, and give politicians a "liar's dividend," the ability to dismiss damaging audio clips as fabrications.
2023 年发表的研究表明，人类普遍在可靠地检测音频深度伪造方面存在问题。在他们的样本中，有 529 名个体，听众只有

的时间正确检测到深度伪造。图 3.6.7 展示了研究的一些其他关键发现。作者还预期随着音频生成方法的改进，未来检测准确性会下降。更具说服力的音频深度伪造的兴起增加了操纵政治活动、诽谤对手以及使政客得到“说谎者红利”的潜力，即将破坏性的音频剪辑视为虚构。

Key research findings on audio deepfakes
音频深度伪造的关键研究发现

Source: Mai et al., 2023; Al Index, 2024
来源：Mai 等人，2023 年；Al Index，2024 年

Reference audio improves deepfake detection
参考音频提高深度伪造检测

Training humans to detect deepfakes only helps slightly
训练人类识别深度伪造视频只能稍微有所帮助

Shorter deepfakes are not easier to identify
更短的深度伪造视频并不更容易识别

Listening to clips more frequently does not aid detection
更频繁地听取音频片段并不能帮助检测

Spending more time does not improve detection
花费更多时间并不能提高检测能力

Participants do not improve detection without explicit feedback
参与者在没有明确反馈的情况下无法提高检测能力

Human crowd and top automated detectors have comparable performance
人类群体和顶尖自动检测器具有可比较的性能

Al can also influence political processes in other ways. Research from Queen's University Belfast notes other ways in which Al can affect political processes, and potential mitigations associated with different risk cases (Figure 3.6.8). For instance, Al could be utilized for video surveillance of voters, potentially undermining the integrity of elections. The same authors identify the degree to which each Al political use case is technologically ready, the risk level it possesses, and how visible the deployment of

would be to users (Figure 3.6.9). For example, they propose that employing

for voter authentication is already highly feasible, and this application carries a significant risk.
Al 还可以以其他方式影响政治进程。来自贝尔法斯特女王大学的研究指出 Al 可以影响政治进程的其他方式，并提出了与不同风险案例相关的潜在缓解措施（图 3.6.8）。例如，Al 可以被用于对选民进行视频监控，可能破坏选举的公正性。同样的作者确定了每个 Al 政治使用案例的技术准备程度、风险水平以及

的部署对用户的可见性（图 3.6.9）。例如，他们提出，利用

进行选民身份验证已经非常可行，但这种应用存在重大风险。

Al usage, risks, and mitigation strategies in electoral processes
在选举过程中的 Al 使用、风险和缓解策略

Source: P et al., 2023 | Table: 2024 Al Index report
来源：P et al.，2023 年 | 表：2024 年 Al 指数报告

Avenue

Al usage

Risks

Mitigations

Voter list maintenance 选民名单维护

Heuristic-driven approximations
启发式驱动的近似

Record linkage 记录链接

Outlier detection 异常值检测

Access-integrity trade-off issues
访问完整性折衷问题

Biased Al

Overly generalized Al 过于泛化的人工智能

Access-focused AI 以访问为重点的人工智能

Reasonable explanations 合理解释

Local scrutiny 当地审查

Polling booth locations 投票站位置

Drop box location determination
投递箱位置确定

Facility location 设施位置

Clustering

Business ethos 商业道德

Volatility and finding costs
波动性和寻找成本

Partisan manipulation 党派操纵

Plural results 复数结果

Auditing Al

Disadvantaged voters 弱势选民

Predicting problem booths
预测问题投票站

Predictive policing 预测性警务

Time series motifs 时间序列主题

Systemic racism 制度性种族主义

Aggravating brutality 严苛的残酷行为

Feedback loops 反馈循环

Transparency

Statistical rigor 统计严谨

Fair Al

Voter authentication 选民认证

Face recognition 人脸识别

Biometrics

Race/gender bias 种族/性别偏见

Unknown biases 未知偏见

Voter turnout 选民投票率

Surveillance and misc. 监视和其他事项

Alternatives

Bias audits

Designing for edge cases
为边缘情况设计

Video monitoring 视频监控

Video-based vote counting
基于视频的计票

Event detection 事件检测

Person re-identification
人员重新识别

Electoral integrity 选举诚信

Marginalized communities
被边缘化的社区

Undermining other monitoring
破坏其他监测

Shallow monitoring 肤浅的监测

Open data

Assessments of

integration and risks in electoral
评估

在选举中的整合和风险

processes 过程

Source: P et al., 2023 | Chart: 2024 Al Index report
来源：P 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 3.6.9 图 3.6.9

Preview 预览

Overview ..... 215 概述 ..... 215
Chapter Highlights ..... 216
章节亮点 ..... 216
4.1 What's New in 2023: A Timeline ..... 218
2023 年的新变化：时间轴 ..... 218
4.2 Jobs ..... 223
工作 ..... 223
Al Labor Demand ..... 223
劳动力需求 ..... 223
Global Al Labor Demand ..... 223
全球劳动力需求 ..... 223
U.S. AI Labor Demand by Skill Cluster
美国技能集群的人工智能劳动力需求
and Specialized Skill ..... 224
和专业技能 ..... 224
U.S. AI Labor Demand by Sector ..... 228
美国各行业的人工智能劳动需求..... 228
U.S. Al Labor Demand by State ..... 229
美国各州的人工智能劳动需求..... 229
Al Hiring ..... 232
人工智能雇佣..... 232
Al Skill Penetration ..... 234
人工智能技能渗透 ..... 234
Al Talent ..... 236
人才 ..... 236
Highlight: How Much Do
亮点：计算机科学家赚多少？
Computer Scientists Earn? ..... 240
计算机科学家赚多少？ ..... 240
4.3 Investment ..... 242
4.3 投资 ..... 242
Corporate Investment ..... 242
企业投资 ..... 242
Startup Activity ..... 243
创业活动 ..... 243
Global Trends ..... 243
全球趋势 ..... 243
Regional Comparison by Funding Amount ..... 245
资金金额的区域比较 ..... 245
Regional Comparison by Newly Funded
新资金的区域比较
Al Companies ..... 251
所有公司 ..... 251
Focus Area Analysis ..... 254
重点领域分析 ..... 254
4.4 Corporate Activity ..... 258
4.4 公司活动 ..... 258
Industry Adoption ..... 258
行业采用 ..... 258
Adoption of Al Capabilities ..... 258
人工智能能力的采用 ..... 258
Adoption of Generative Al Capabilities ..... 266
生成式人工智能能力的采用 ..... 266
Use of Al by Developers ..... 269
开发人员使用 Al 的方式 ..... 269
Preference ..... 269 偏好 ..... 269
Workflow ..... 270 工作流程 ..... 270
Al's Labor Impact ..... 272
Al 的劳动影响 ..... 272
Earnings Calls ..... 277
收益电话 ..... 277
Aggregate Trends ..... 277
聚合趋势 ..... 277
Specific Themes ..... 278
具体主题 ..... 278
Highlight: Projecting Al's Economic Impact ..... 279
重点: 阿尔的经济影响预测 ..... 279
4.5 Robot Installations ..... 283
4.5 机器人安装 ..... 283
Aggregate Trends ..... 283
聚合趋势 ..... 283
Industrial Robots: 工业机器人:
Traditional vs. Collaborative Robots ..... 285
传统机器人 vs. 协作机器人 ..... 285
By Geographic Area ..... 286
按地理区域 ..... 286
Country-Level Data on Service Robotics ..... 290
服务机器人的国家级数据 ..... 290
Sectors and Application Types ..... 292
部门和应用类型 ..... 292
China vs. United States ..... 294
中国 vs. 美国 ..... 294
ACCESS THE PUBLIC DATA
访问公共数据

Overview 概述

The integration of

into the economy raises many compelling questions. Some predict that Al will drive productivity improvements, but the extent of its impact remains uncertain. A major concern is the potential for massive labor displacement-to what degree will jobs be automated versus augmented by

? Companies are already utilizing

in various ways across industries, but some regions of the world are witnessing greater investment inflows into this transformative technology. Moreover, investor interest appears to be gravitating toward specific Al subfields like natural language processing and data management.
把

整合到经济中引发了许多引人注目的问题。一些人预测 AI 将推动生产力的提高，但其影响程度仍不确定。一个主要的关切是大规模的劳动力置换，工作将被自动化程度和被

增强的程度是多少？公司已经在各个行业以各种方式利用

，但世界上一些地区正在目睹对这项变革性技术的更大投资流入。此外，投资者的兴趣似乎正在向特定的 AI 子领域聚集，如自然语言处理和数据管理。

This chapter examines Al-related economic trends using data from Lightcast, Linkedln, Quid, McKinsey, Stack Overflow, and the International Federation of Robotics (IFR). It begins by analyzing Al-related occupations, covering labor demand, hiring trends, skill penetration, and talent availability. The chapter then explores corporate investment in

, introducing a new section focused specifically on generative Al. It further examines corporate adoption of

, assessing current usage and how developers adopt these technologies. Finally, it assesses Al's current and projected economic impact and robot installations across various sectors.
本章通过来自 Lightcast、Linkedln、Quid、McKinsey、Stack Overflow 和国际机器人联合会（IFR）的数据，分析了与人工智能相关的经济趋势。它从分析与人工智能相关的职业开始，涵盖劳动力需求、招聘趋势、技能渗透和人才供应情况。然后探讨了企业对

的投资，引入了一个专门关注生成式人工智能的新部分。它进一步研究了企业对

的采用情况，评估了当前的使用情况以及开发人员如何采用这些技术。最后，评估了人工智能当前和预期的经济影响以及各个行业的机器人安装情况。

Chapter Highlights 章节亮点

Generative Al investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAl, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds.
生成式人工智能投资激增。尽管去年整体人工智能私人投资有所下降，但生成式的资金激增，从 2022 年几乎增加了八倍，达到亿美元。生成式人工智能领域的主要参与者，包括 OpenAI、Anthropic、Hugging Face 和 Inflection，报告了大量的筹款轮。

2. Already a leader, the United States pulls even further ahead in Al private investment.
2. 美国已经是领先者，在人工智能私人投资方面进一步拉开了差距。

In 2023, the United States saw Al investments reach

billion, nearly 8.7 times more than China, the next highest investor. While private Al investment in China and the European Union, including the United Kingdom, declined by

and

, respectively, since 2022, the United States experienced a notable increase of

in the same time frame.

Fewer Al jobs, in the United States and across the globe. In 2022, Al-related positions made up of all job postings in America, a figure that decreased to in 2023. This decline in Al job listings is attributed to fewer postings from leading Al firms and a reduced proportion of tech roles within these companies.
在美国和全球范围内，人工智能相关职位减少。2022 年，人工智能相关职位占美国所有职位发布的，这一数字在 2023 年降至。人工智能职位招聘数量的下降归因于领先人工智能公司发布的职位减少以及这些公司中技术角色比例的降低。

4. Al decreases costs and increases revenues. A new McKinsey survey reveals that of surveyed
4. 人工智能降低成本并增加收入。麦肯锡的一项新调查显示，受访者中有的人

organizations report cost reductions from implementing

(including generative

), and

report revenue increases. Compared to the previous year, there was a 10 percentage point increase in respondents reporting decreased costs, suggesting

is driving significant business efficiency gains.

Total AI private investment declines again, while the number of newly funded companies increases. Global private Al investment has fallen for the second year in a row, though less than the sharp decrease from 2021 to 2022 . The count of newly funded Al companies spiked to 1,812 , up from the previous year.
总体人工智能私人投资再次下降，而新获资助的公司数量增加。全球私人人工智能投资连续第二年下降，尽管比从 2021 年到 2022 年的急剧下降要小。新获资助的人工智能公司数量激增至 1,812 家，比上一年增加。
Al organizational adoption ticks up. A 2023 McKinsey report reveals that of organizations now use (including generative ) in at least one business unit or function, up from in 2022 and in 2017.
AI 组织采用率有所提高。2023 年麦肯锡报告显示，的组织现在至少在一个业务部门或功能中使用（包括生成），高于 2022 年的和 2017 年的。
China dominates industrial robotics. Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest competitor nation. In 2013, China's installations accounted for of the global total, a share that rose to by 2022 .
中国主导工业机器人。自 2013 年超过日本成为工业机器人领先安装国以来，中国与最近的竞争对手国家之间的差距显著扩大。2013 年，中国的安装量占全球总量的，到 2022 年已上升至。

Chapter Highlights (cont'd)
章节亮点（续）

Greater diversity in robotic installations. In 2017, collaborative robots represented a mere of all new industrial robot installations, a figure that climbed to by 2022 . Similarly, 2022 saw a rise in service robot installations across all application categories, except for medical robotics. This trend indicates not just an overall increase in robot installations but also a growing emphasis on deploying robots for human-facing roles.
机器人安装的多样性增加。2017 年，协作机器人仅占所有新工业机器人安装量的，到 2022 年这一数字上升到。同样，2022 年各应用类别的服务机器人安装量均有所增加，除了医疗机器人。这一趋势不仅表明机器人安装总量增加，还表明人们越来越重视将机器人投入人类面向的角色。
The data is in: AI makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still other studies caution that using Al without proper oversight can lead to diminished performance.
数据显示：人工智能提高工人生产力，带来更高质量的工作。2023 年，几项研究评估了人工智能对劳动力的影响，表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。其他研究警告称，未经适当监督使用人工智能可能导致绩效下降。
Fortune companies start talking a lot about , especially generative .
财富公司开始大谈，尤其是生成。

In 2023, Al was mentioned in 394 earnings calls (nearly

of all Fortune 500 companies), a notable increase from 266 mentions in 2022. Since 2018, mentions of Al in Fortune 500 earnings calls have nearly doubled. The most frequently cited theme, appearing in

of all earnings calls, was generative Al.
2023 年，Al 在 394 次盈利电话会议中被提及（几乎占所有财富 500 强公司的

），较 2022 年的 266 次提及有显著增加。自 2018 年以来，在财富 500 强公司的盈利电话会议中提及 Al 的次数几乎翻了一番。最经常被引用的主题，出现在所有盈利电话会议的

中，是生成 Al。

The chapter begins with an overview of some of the most significant Al-related economic events in 2023, as selected by the Al Index Steering Committee.
本章以 Al 指数指导委员会精选的 2023 年一些最重要的与 Al 相关的经济事件概述开始。

4.1 What's New in 2023: A Timeline
2023 年的新变化：时间轴

InstaDeep acquired by BioNTech
InstaDeep 被 BioNTech 收购

BioNTech, known for developing the first mRNA COVID-19 vaccine in partnership with Pfizer, acquires InstaDeep for

million to advance Al-powered drug discovery, design, and development. InstaDeep specializes in creating Al systems for enterprises in biology, logistics, and energy sectors.
BioNTech，以与辉瑞合作开发第一款 mRNA COVID-19 疫苗而闻名，以

百万美元收购 InstaDeep，以推动基于人工智能的药物发现、设计和开发。InstaDeep 专注于为生物学、物流和能源领域的企业创建人工智能系统。

Source: Reuters, 2022 Figure 4.1.1
来源: 路透社, 2022 年图 4.1.1

Microsoft invests

billion in ChatGPT maker OpenAI
微软向 ChatGPT 制造商 OpenAI 投资

十亿美元

With this deal, Microsoft Azure remains the exclusive cloud provider for OpenAl, which relies on Azure to train its models. This follows Microsoft's initial

billion investment in 2019 and a subsequent investment in 2021.
通过这项交易，微软 Azure 仍然是 OpenAI 的独家云服务提供商，OpenAI 依赖 Azure 来训练其模型。这是继 2019 年微软最初投资

十亿美元和 2021 年随后的投资之后。

GitHub Copilot for Business becomes publicly available Copilot for Business leverages an OpenAl Codex model to enhance code suggestion quality. At launch, GitHub Copilot contributed to an average of

of developers' code across various programming languages, with this figure rising to

for Java.
GitHub Copilot for Business 现已公开提供 Copilot for Business 利用 OpenAI Codex 模型来提升代码建议的质量。在推出时，GitHub Copilot 在各种编程语言中为开发者的代码贡献平均达到

，而在 Java 中这一数字上升至

。

Source: GitHub, 2023 Figure 4.1.3
来源：GitHub，2023 年图 4.1.3

Salesforce introduces Einstein GPT
Salesforce 推出 Einstein GPT

Einstein GPT, the first comprehensive Al for CRM, utilizes OpenAl's models. Einstein GPT aids Salesforce customers in sales, marketing, and customer management.
爱因斯坦 GPT，第一个为 CRM 设计的综合人工智能，利用了 OpenAI 的模型。爱因斯坦 GPT 帮助 Salesforce 客户进行销售、营销和客户管理。

Microsoft announces integration of GPT-4 into Office 365
微软宣布将 GPT-4 集成到 Office 365 中

Microsoft rolls out Copilot across Office
微软在 Office 中推出 Copilot

Copilot for Microsoft 365
为 Microsoft 365 提供协作支持

365, offering

assistance in Word, PowerPoint, and Excel.

Bloomberg announces LLM for finance
彭博宣布金融 LLM

Bloomberg's 50-billion parameter LLM is custom-built for analyzing financial data and tailored to finance professionals. This model is capable of performing financial analyses on Bloomberg's extensive datasets.
彭博的 50 亿参数 LLM 是专为分析金融数据而定制的，专为金融专业人士量身打造。该模型能够在彭博庞大的数据集上进行金融分析。

BloombergGPT: 彭博 GPT

A Large Language Model for Finance
金融领域的大型语言模型

Adobe launches generative

tools inside Photoshop
Adobe 在 Photoshop 中推出生成式

工具

Adobe introduces generative Al features in Photoshop via Adobe Firefly, its generative image tool. Users can now add, remove, and edit images within seconds using text prompts.
Adobe 在 Photoshop 中通过其生成图像工具 Adobe Firefly 引入了生成 Al 功能。用户现在可以使用文本提示在几秒钟内添加、删除和编辑图像。

Cohere raises

million
Cohere 筹集了

百万美元

Cohere, focused on developing an Al model ecosystem for enterprises, raises

million in an oversubscribed Series C round. Inovia Capital led the round, with participation from Nvidia, Oracle, Salesforce Ventures, Schroders Capital, and Index Ventures.
Cohere 专注于为企业开发 Al 模型生态系统，在一轮超额认购的 C 轮融资中筹集了

百万美元。Inovia Capital 领投，Nvidia、Oracle、Salesforce Ventures、Schroders Capital 和 Index Ventures 参与了本轮融资。

Source: Cohere, 2023 源：Cohere，2023

Figure 4.1.8 图 4.1.8

Nvidia reaches

trillion valuation
Nvidia 达到

万亿估值

Nvidia's market capitalization consistently exceeds

trilion USD, driven by rising demand for its Al-powering chips. Nvidia becomes the fifth company to reach a valuation of

trillion, joining the ranks of Apple Inc. (AAPL.O), Alphabet Inc. (GOOGL.O), Microsoft Corp. (MSFT.O), and Amazon. com Inc. (AMZN.O).
Nvidia 的市值一直超过

万亿美元，受到其人工智能芯片需求的推动。Nvidia 成为第五家达到

万亿美元估值的公司，加入了苹果公司（AAPL.O）、谷歌母公司（GOOGL.O）、微软公司（MSFT.O）和亚马逊公司（AMZN.O）的行列。

Source: The Brand Hopper, 2023
来源：The Brand Hopper，2023 年

Figure 4.1.9 图 4.1.9

Databricks buys MosaicML for

billion
Databricks 以

亿美元收购 MosaicML

Databricks, a leader in data storage and management, announces its acquisition of MosaicML, a generative AI orchestration startup founded in 2021, for

billion. This move aims to enhance Databricks' generative Al capabilities.
数据存储和管理领域的领导者 Databricks 宣布以

亿美元收购了成立于 2021 年的生成 AI 编排初创公司 MosaicML。此举旨在增强 Databricks 的生成 AI 能力。

Source: Databricks, 2023 Figure 4.1.10
来源：Databricks，2023 年图 4.1.10

Thomson Reuters acquires Casetext for

million
汤森路透以

百万美元收购 Casetext

Thomson Reuters finalizes its acquisition of Casetext, a legal startup renowned for its artificial intelligence-powered assistant for law, for a staggering

million. At the time of acquisition, Casetext boasted a substantial customer base of over 10,000 law firms and corporate legal departments. Among its flagship offerings is CoCounsel, an Al legal assistant driven by GPT4 , which enables rapid document review, legal research memos, deposition preparation, and contract analysis within minutes.
汤森路透完成了对 Casetext 的收购，这家以其人工智能助手而闻名的法律初创公司，收购金额高达

百万美元。在收购时，Casetext 拥有超过 10,000 家律师事务所和公司法律部门的庞大客户群。其旗舰产品之一是 CoCounsel，这是一款由 GPT4 驱动的 AI 法律助手，可以在几分钟内进行快速文件审查、法律研究备忘录、证词准备和合同分析。

Source: Legal.io, 2023 来源：Legal.io，2023

Figure 4.1.11 图 4.1.11
THOMSON REUTERS 汤姆逊路透

Source: TechCrunch, 2023
来源：TechCrunch，2023

Figure 4.1.12 attracts investments from Microsoft, Nvidia, Reid Hoffman, Bill Gates, and Eric Schmidt, former CEO of Google.
图 4.1.12 吸引了微软、英伟达、里德·霍夫曼、比尔·盖茨和谷歌前 CEO 埃里克·施密特的投资。

Hugging Face raises

million from investors
Hugging Face 从投资者那里筹集了

百万美元

Hugging Face, a platform and community dedicated to machine learning and data science, secures an impressive

million funding round, pushing its valuation to

billion. The platform serves as a one-stop destination for building, deploying, and training machine learning models. Offering a GitHub-like hub for Al code repositories, models, and datasets, Hugging Face has attracted significant attention from industry giants.
Hugging Face，一个致力于机器学习和数据科学的平台和社区，获得了一轮令人印象深刻的

百万美元融资，将其估值推高至

十亿美元。该平台是构建、部署和训练机器学习模型的一站式目的地。提供类似 GitHub 的 Al 代码存储库、模型和数据集中心，Hugging Face 吸引了行业巨头的重视。

Source: TechCrunch, 2023 Figure 4.1.13
来源：TechCrunch，2023 年图 4.1.13

SAP introduces new generative Al assistant Joule
SAP 推出新的生成式 AI 助手 Joule

Joule is a ChatGPT-style digital assistant integrated across SAP's diverse product range. Joule will seamlessly integrate into SAP applications spanning HR, finance, supply chain, procurement, and customer experience.
Joule 是一款类似 ChatGPT 风格的数字助手，集成在 SAP 广泛的产品范围中。Joule 将无缝集成到涵盖人力资源、财务、供应链、采购和客户体验的 SAP 应用程序中。

Additionally, it will be incorporated into the SAP Business Technology Platform, extending its utility across SAP's
此外，它将被整合到 SAP 业务技术平台中，扩展其在 SAP 公司内的实用性

Introducing Joule 介绍 Joule

The Af copiot that thily understinds your business. Weth whitharm
Af copiot 是一个深度了解您业务的智能助手。与 whitharm

Nouthativing 无法生存

Source: SAP, 2023 来源：SAP，2023

Figure 4.1.14 extensive user base of nearly 300 million.
图 4.1.14 几乎有 3 亿的广泛用户群。

Amazon and Google make multibillion-dollar investments in Anthropic
亚马逊和谷歌在 Anthropic 进行数十亿美元的投资

Amazon announces its intent to invest up to

亚马逊宣布其打算投资高达

亿美元在 Anthropic，这是 OpenAl 的竞争对手

billion in Anthropic, a rival of OpenAl. This significant investment follows Google's agreement to invest up to

billion in Anthropic. The deal comprises an initial

million upfront, with an additional

billion to be invested over time.
亚马逊宣布其打算投资高达

亿美元在 Anthropic，这是 OpenAl 的竞争对手。这笔重要的投资是在谷歌同意投资高达

亿美元在 Anthropic 之后进行的。该交易包括首次

百万美元的前期投资，随后将再投资

亿美元。

Source: TechCrunch, 2023
来源：TechCrunch，2023 年

Figure 4.1.15 图 4.1.15
Nov. 5, 2023 2023 年 11 月 5 日

Kai-Fu Lee launches OpenSource LLM
李开复推出开源 LLM

Kai-Fu Lee's LLM startup publicly unveils an open-source model and secures funding at a

billion valuation, with Alibaba leading the investment. Lee, known for his leadership roles at Google in China and for establishing Microsoft Research China, one of Microsoft's key international research hubs, spearheads this initiative.
李开复的 LLM 初创公司公开展示了一种开源模式，并以

十亿美元的估值获得了资金，阿里巴巴领投。李开复因在谷歌中国的领导职务以及建立微软亚洲研究院（微软的重要国际研究中心之一）而闻名，他领导了这一倡议。

Source: TechCrunch, 2023 Figure 4.1.16
来源：TechCrunch，2023 年图 4.1.16

Nov. 17, 11 月 17 日，

2023

Sam Altman, OpenAI CEO, fired and then rehired
OpenAI CEO Sam Altman 被解雇然后重新雇佣

OpenAl's board claims Altman was "not consistently candid in his communications." Chaos ensues at OpenAl. Many employees resign in response to the news, and 745 sign a letter threatening resignation if the current board members do not resign. A few days later, Altman is reinstated.
OpenAI 董事会称 Altman 在沟通上“不一贯坦诚”。OpenAI 遭受混乱。许多员工因此辞职，745 人签署一封威胁辞职的信，要求现任董事会成员辞职。几天后，Altman 被恢复原职。

Source: CoinGape, 2024 来源：CoinGape，2024

Figure 4.1.17 图 4.1.17
Dec. 11, 12 月 11 日

2023
Mistral Al closes

million funding round Less than six months after raising a

million seed round, Europe-based Mistral Al secures an additional

million. The startup, cofounded by alumni from Google's DeepMind and Meta, focuses on developing foundation models with an open-source technology approach, aiming to compete with OpenAl. Leading the round is Andreessen Horowitz, with participation from Lightspeed Venture Partners, Salesforce, BNP Paribas, General Catalyst, and Elad Gil.
Mistral Al 在筹集

亿美元融资后不到六个月，总部位于欧洲的 Mistral Al 又获得了额外的

万美元。这家初创公司由来自谷歌 DeepMind 和 Meta 的校友共同创立，专注于开发基于开源技术的基础模型，旨在与 OpenAl 竞争。Andressen Horowitz 领投，Lightspeed Venture Partners、Salesforce、BNP Paribas、General Catalyst 和 Elad Gil 也参与了本轮融资。

4.2 Jobs 4.2 就业机会

Al Labor Demand 人工智能劳动需求

This section analyzes the demand for Al-related skills in labor markets, drawing on data from Lightcast.
本节分析了劳动力市场对人工智能相关技能的需求，利用了来自 Lightcast 的数据。

Lightcast has analyzed hundreds of millions of job postings from over 51,000 websites since 2010 , identifying those that require

skills.
自 2010 年以来，Lightcast 已经分析了来自超过 51,000 个网站的数亿份工作岗位信息，识别出那些需要

技能的岗位。

Global AI Labor Demand
全球人工智能劳动力需求

Figure 4.2.1 shows the percentage of job postings demanding Al skills. In 2023, the United States (1.6%), Spain (1.4%), and Sweden (1.3%) led in this metric. In 2022, Al-related jobs accounted for

of all American job postings. In 2023, that number dropped to

. Although most countries saw a decrease from 2022 to 2023 in the share of job postings requiring

skills, in many, the number of

-related job postings over the past five years has increased.'
图 4.2.1 显示了要求 Al 技能的工作岗位百分比。2023 年，美国（1.6%）、西班牙（1.4%）和瑞典（1.3%）在这一指标上处于领先地位。2022 年，与所有美国工作岗位相比，与 Al 相关的工作岗位占比为

。到了 2023 年，这个数字下降到了

。尽管大多数国家在 2022 年至 2023 年间看到了要求

技能的工作岗位份额下降，但在许多国家，过去五年中与

相关的工作岗位数量有所增加。

Lightcast speculates that the 2023 decrease in

job postings is driven by many top

employers (such as Amazon, Deloitte, Capital One, Randstad, and Elevance Health) scaling back their overall posting counts. Additionally, many companies shifted the occupational mix of their postings. For example, Amazon, in 2023, posted a higher share of operational roles like sales delivery driver, packager, and postal service/mail room worker than in 2022. At the same time, there was a lower share of demand for tech roles like software developers and data scientists.
Lightcast 推测，2023 年

工作岗位减少是由许多顶级

雇主（如亚马逊、德勤、Capital One、Randstad 和 Elevance Health）减少其总体发布数量所驱动的。此外，许多公司调整了其发布的职业组合。例如，亚马逊在 2023 年发布了更多的运营角色，如销售送货司机、包装工和邮政服务/邮件室工作人员，而在 2022 年则较少需求技术角色，如软件开发人员和数据科学家。

Al job postings (% of all job postings) by geographic area, 2014-23
2014 年至 2023 年各地区 Al 工作岗位（占所有工作岗位的百分比）

Source: Lightcast, 2023 | Chart: 2024 Al Index report
源：Lightcast，2023 年 | 图表：2024 年 Al 指数报告

1 In 2023, Lightcast slightly changed its methodology for determining Al-related job postings from what was used in previous versions of the Al Index report. Lightcast also updated its taxonomy of AI-related skills. As such, some of the numbers in this chart do not completely align with those featured in last year's report.
2023 年，Lightcast 略微改变了确定与人工智能相关的工作职位的方法，与 Al 指数报告以往版本中所使用的方法不完全一致。Lightcast 还更新了其人工智能相关技能的分类。因此，本图表中的一些数字与去年报告中的数字并不完全一致。

U.S. AI Labor Demand by Skill Cluster and Specialized Skill
美国按技能群和专业技能分类的人工智能劳动力需求

Figure 4.2.2 highlights the most sought-after

skills in the U.S. labor market since 2010. Leading the demand was machine learning at

, with artificial intelligence at

, and natural language processing at

. Despite a recent dip, machine learning continues to be the most in-demand skill. Since last year, every Al-related skill cluster tracked by Lightcast had a decrease in market share, with the exception of generative Al, which grew by more than a factor of 10 .
从 2010 年以来，图 4.2.2 突出显示了美国劳动力市场中最受追捧的

技能。需求最大的是机器学习，占

，其次是人工智能，占

，以及自然语言处理，占

。尽管最近有所下降，但机器学习仍然是最受欢迎的技能。自去年以来，Lightcast 跟踪的每个 Al 相关技能群体的市场份额都有所下降，除了生成式 Al，其增长超过了 10 倍。

Al job postings (% of all job postings) in the United States by skill cluster, 2010-23
美国按技能群体划分的 Al 职位发布情况（占所有职位发布的百分比），2010 年至 2023 年

Figure 4.2.2 图 4.2.2

Figure 4.2.3 compares the top 10 specialized skills sought in Al job postings in 2023 versus those from 2011 to 2013.

On an absolute scale, the demand for nearly every specialized skill has increased over the past decade, with Python's notable increase in popularity highlighting its ascendance as a preferred Al programming language.
图 4.2.3 比较了 2023 年 Al 工作发布中寻求的前 10 种专业技能与 2011 年至 2013 年的情况。

从绝对规模来看，过去十年几乎每种专业技能的需求都有所增加，Python 显著增加的受欢迎程度凸显了它作为首选 Al 编程语言的地位提升。

Top 10 specialized skills in 2023 Al job postings in the United States, 2011-13 vs. 2023
美国 2023 年 Al 工作发布中的前 10 种专业技能，2011-13 年与 2023 年对比

Figure 4.2.3 图 4.2.3

2 The decision to select 2011-2013 as the point of comparison was because some data at the jobs/skills level from earlier years is quite sparse. Lightcast therefore used

to have a larger sample size for a benchmark from 10 years ago with which to compare. Figure 4.2 .3 juxtaposes the total number of job postings requiring certain skills from 2011 to 2013 with the total amount in 2023.
选择 2011-2013 年作为比较点的决定是因为早期年份的一些工作/技能层面的数据相当稀疏。因此，Lightcast 使用

来获得一个更大的样本量，以便与 10 年前的基准进行比较。图 4.2.3 将 2011 年至 2013 年需要某些技能的工作岗位总数与 2023 年的总数进行了对比。

In 2023, Lightcast saw great increases in the number of U.S. job postings citing generative Al skills. That year, 15,410 job postings specifically cited generative

as a desired skill, large language modeling was mentioned in 4,669 postings, and ChatGPT appeared in 2,841 job listings (Figure 4.2.4).
在 2023 年，Lightcast 看到美国工作岗位中引用生成式 AI 技能的数量大幅增加。那一年，有 15,410 个工作岗位明确提到生成式

作为所需技能，大型语言建模在 4,669 个岗位中被提及，ChatGPT 出现在 2,841 个工作列表中（图 4.2.4）。

Generative Al skills in Al job postings in the United States, 2023
2023 年美国工作岗位中的生成式 AI 技能

Figure 4.2.4 图 4.2.4

Figure 4.2.5 illustrates what proportion of all generative

job postings released in 2023 referenced particular generative

skills. The most cited skill was generative

(60.0%), followed by large language modeling (18.2%) and ChatGPT (11.1%).
图 4.2.5 展示了 2023 年发布的所有生成

工作岗位中引用特定生成

技能的比例。最常被引用的技能是生成

（60.0%），其次是大型语言建模（18.2%）和 ChatGPT（11.1%）。

Share of generative Al skills in Al job postings in the United States, 2023
2023 年美国人工智能工作岗位中生成式 AI 技能的份额

Figure 4.2.5 图 4.2.5

U.S. Al Labor Demand by Sector
美国各行业劳动力需求

Figure 4.2 .6 shows the percentage of U.S. job postings requiring

skills by industry sector from 2022 to 2023 . Nearly every sector experienced a decrease in the proportion of

job postings in 2023 compared to 2022, except for public administration and educational services. The leading sectors were information (4.6%); professional, scientific, and technical services (3.3%); and finance and insurance (2.9%). As noted earlier, the decrease in Al job postings was related to changes in the hiring patterns of several major U.S. employers.
图 4.2 .6 显示了从 2022 年到 2023 年美国各行业要求

技能的工作岗位的百分比。与 2022 年相比，2023 年几乎每个行业的工作岗位比例都有所下降，除了公共管理和教育服务。领先的行业包括信息（4.6%）；专业、科学和技术服务（3.3%）；以及金融和保险（2.9%）。正如前面所提到的，Al 工作岗位的减少与几家美国主要雇主的招聘模式变化有关。

Al job postings (% of all job postings) in the United States by sector, 2022 vs. 2023
美国各行业 2022 年与 2023 年的所有工作岗位（所有工作岗位的百分比）

Source: Lightcast, 2023 | Chart: 2024 Al Index report
数据来源：Lightcast，2023 年 | 图表：2024 年 Al 指数报告

U.S. Al Labor Demand by State
美国各州的 Al 劳动力需求

Figure 4.2.7 highlights the number of Al job postings in the United States by state. The top three states were California

, followed by Texas

and Virginia

.
图 4.2.7 突出显示了美国各州的 Al 工作岗位数量。前三名州分别是加利福尼亚

，德克萨斯

和弗吉尼亚

。

Figure 4.2.8 demonstrates what percentage of a state's total job postings were Al-related. The top states according to this metric were the District of Columbia (2.7%), followed by Delaware (2.4%) and Maryland (2.1%).
图 4.2.8 展示了一个州的总工作岗位中有多少百分比是与 Al 相关的。根据这个指标，排名靠前的州是哥伦比亚特区（2.7%），其次是特拉华州（2.4%）和马里兰州（2.1%）。
Number of Al job postings in the United States by state, 2023
美国各州的 Al 工作岗位数量，2023 年

Source: Lightcast, 2023 | Chart: 2024 Al Index report
源：Lightcast，2023 | 图表：2024 Al 指数报告

Figure 4.2.7 图 4.2.7

Figure 4.2.9 examines which U.S. states accounted for the largest proportion of

job postings nationwide. California was first: In

of all Al job postings in the United States were for jobs based in California, followed by Texas (7.9%) and Virginia (5.3%).
图 4.2.9 研究了哪些美国州占全国

职位发布的最大比例。加利福尼亚州排名第一：美国所有 Al 职位发布中有

是在加利福尼亚州的工作，其次是德克萨斯州（7.9%）和弗吉尼亚州（5.3%）。
Percentage of US AI job postings by state, 2023
2023 年各州美国人工智能工作岗位的百分比

Source: Lightcast, 2023 | Chart: 2024 Al Index report
来源：Lightcast，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.9 图 4.2.9

Figure 4.2.10 illustrates the trends in the four states with highest Al job postings: Washington, California, New York, and Texas. Each experienced a notable decline in the share of total Al-related job postings from 2022 to 2023.
图 4.2.10 展示了四个拥有最多 Al 工作岗位的州的趋势：华盛顿、加利福尼亚、纽约和德克萨斯。每个州在 2022 年至 2023 年间 Al 相关工作岗位的占比均出现显著下降。

Percentage of US states' job postings in AI by select US state, 2010-23
2010 年至 2023 年间按照美国州选择的 AI 工作岗位百分比

Source: Lightcast, 2023 | Chart: 2024 Al Index report
来源：Lightcast，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.10 图 4.2.10

Figure 4.2.11 shows how Al-related job postings have been distributed across the top four states over time. Since 2019, California's proportion of AI job postings has steadily declined, while Texas has seen a slight increase.
图 4.2.11 显示了随着时间的推移，与 Al 相关的工作岗位是如何在前四个州之间分布的。自 2019 年以来，加利福尼亚州的 AI 工作岗位比例稳步下降，而德克萨斯州略有增加。

Percentage of US AI job postings by select US state, 2010-23
2010-23 年间美国 AI 工作岗位的百分比，按选定的美国州划分

Figure 4.2.11 图 4.2.11

Al Hiring 阿尔雇佣

The hiring data presented in the Al Index is based on a Linkedln dataset of skills and jobs that appear on their platform. The geographic areas included in the sample make at least

hires each month and have Linkedln covering a substantial portion of the labor force. Linkedln's coverage of India's and South Korea's sizable labor forces fall below this threshold, so insights drawn about these countries should be interpreted with particular caution.
Al 指数中呈现的招聘数据基于 Linkedln 平台上出现的技能和工作的数据集。样本中包括的地理区域每月至少招聘

人，并且 Linkedln 覆盖了相当大比例的劳动力。Linkedln 对印度和韩国庞大劳动力的覆盖率低于此阈值，因此对这些国家得出的见解应特别谨慎解释。

Figure 4.2.12 reports the relative

hiring rate year-overyear ratio by geographic area. The overall hiring rate is computed as the percentage of Linkedln members who added a new employer in the same period the job began, divided by the total number of Linkedln members in the corresponding location. Conversely, the relative

talent hiring rate is the year-over-year change in

hiring relative to overall hiring rate in the same geographic area.

Therefore, figure 4.2.12 illustrates which specific regions have experienced the most significant rise in

talent recruitment compared to the overall hiring rate, serving as an indicator of Al hiring vibrancy. In 2023, the regions with the greatest relative

hiring rates year over year were Hong Kong (28.8%), followed by Singapore (18.9%) and Luxembourg (18.9%). This means, for example, that in 2023 in Hong Kong, the ratio of

talent hiring relative to overall hiring grew

.
图 4.2.12 报告了按地理区域划分的相对

招聘率年度比率。总体招聘率计算为在同一时期开始工作的 Linkedln 会员中添加新雇主的百分比，除以相应地点的 Linkedln 会员总数。相反，相对

人才招聘率是相对于同一地理区域总体招聘率的年度变化。因此，图 4.2.12 说明了哪些具体地区的

人才招聘相对于总体招聘率经历了最显著的增长，作为 Al 招聘活力的指标。2023 年，相对

招聘率年度增长最大的地区是香港（28.8%），其次是新加坡（18.9%）和卢森堡（18.9%）。这意味着，例如，2023 年在香港，相对于总体招聘的

人才招聘比率增长了

。

Relative Al hiring rate year-over-year ratio by geographic area, 2023
2023 年按地理区域划分的相对 Al 招聘率年度比率

Source: Linkedln, 2023 | Chart: 2024 Al Index report
来源：Linkedln，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.13 showcases the year-over-year ratio of Al hiring by geographic areas over the past five years. Starting from the beginning of 2023, countries including Australia, Canada, Singapore, and India have experienced a noticeable uptick in Al hiring.
图 4.2.13 展示了过去五年各地区 Al 招聘的同比增长比率。从 2023 年初开始，包括澳大利亚、加拿大、新加坡和印度在内的国家的 Al 招聘明显增加。

3 For each month, Linkedln calculates the Al hiring rate in the geographic area, divides the Al hiring rate by overall hiring rate in that geographic area, calculates the year-over-year change of this ratio, and then takes the 12-month moving average using the last 12 months.
对于每个月，Linkedln 计算了各地区的 Al 招聘率，将 Al 招聘率除以该地区的总体招聘率，计算了这一比率的同比变化，然后使用过去 12 个月的数据进行 12 个月移动平均。

4 For brevity, the visualization only includes the top 15 countries for this metric.
为简洁起见，可视化仅包括此指标的前 15 个国家。

Relative Al hiring rate year-over-year ratio by geographic area, 2018-23
按地理区域划分的相对人工智能招聘率年增长比率，2018-23

Source: Linkedln, 2023 | Chart: 2024 Al Index report
来源：领英，2023 | 图表：2024 年人工智能指数报告

Al Skill Penetration 人工智能技能渗透率

Figures 4.2.14 and 4.2.15 highlight relative Al skill penetration. The aim of this indicator is to measure the intensity of Al skills in an entity (such as a particular country, industry, or gender). The Al skill penetration rate signals the prevalence of

skills across occupations or the intensity with which Linkedln members utilize Al skills in their jobs. For example, the top 50 skills for the occupation of engineer are calculated based on the weighted frequency with which they appear in Linkedln members' profiles. If, for instance, four of the skills that engineers possess belong to the

skill group, the penetration of

skills among engineers is estimated to be

.
图 4.2.14 和 4.2.15 突出了相对 Al 技能渗透。该指标的目的是衡量实体（如特定国家、行业或性别）中 Al 技能的强度。Al 技能渗透率表明

技能在职业中的普及程度，或 Linkedln 会员在工作中使用 Al 技能的强度。例如，工程师职业的前 50 个技能是根据它们在 Linkedln 会员档案中出现的加权频率计算的。例如，如果工程师拥有的四项技能属于

技能组，那么工程师中

技能的渗透率估计为

。

For the period from 2015 to 2023, the countries with the highest Al skill penetration rates were India (2.8), the United States (2.2), and Germany (1.9). In the United States, therefore, the relative penetration of Al skills was 2.2 times greater than the global average across the same set of occupations.
从 2015 年到 2023 年，具有最高 Al 技能渗透率的国家是印度（2.8）、美国（2.2）和德国（1.9）。因此，在美国，Al 技能的相对渗透率比同一组职业的全球平均水平高 2.2 倍。

Relative Al skill penetration rate by geographic area, 2015-23
2015 年至 2023 年各地区相对 Al 技能渗透率

Figure 4.2.15 disaggregates Al skill penetration rates by gender across different countries or regions. A country's rate of 1.5 for women means female members in that country are 1.5 times more likely to list

skills than the average member in all countries pooled together across the same set of occupations in the country. For all countries in the sample, the relative Al skill penetration rate is greater for men than women. India (1.7), the United States (1.2), and Israel (0.9) have the highest reported relative

skill penetration rates for women.
图 4.2.15 按性别将不同国家或地区的 Al 技能渗透率进行了细分。某国女性的 1.5 的比率意味着该国的女性成员比所有国家中同一组职业中的平均成员更有可能列出

技能。在样本中的所有国家中，男性的相对 Al 技能渗透率大于女性。印度（1.7）、美国（1.2）和以色列（0.9）报告的女性相对

技能渗透率最高。

Relative Al skill penetration rate across gender, 2015-23
2015-23 年间跨性别的相对 Al 技能渗透率

Source: LinkedIn, 2023 | Chart: 2024 AI Index report
来源：LinkedIn，2023 年 | 图表：2024 年 AI 指数报告

Figure 4.2.15 图 4.2.15

Al Talent 人工智能人才

Figures 4.2.16 to 4.2.18 examine

talent by country. A Linkedln member is considered AI talent if they have explicitly added Al skills to their profile or work in AI. Counts of Al talent are used to calculate talent concentration, or the portion of members who are Al talent. Note that concentration metrics may be influenced by Linkedln coverage in these countries and should be used with caution.
图 4.2.16 至 4.2.18 检查各国的人工智能人才。如果 Linkedln 成员在他们的个人资料中明确添加了人工智能技能或者从事人工智能工作，他们将被视为人工智能人才。人工智能人才的数量用于计算人才集中度，即是人工智能人才所占的比例。请注意，集中度指标可能受到 Linkedln 在这些国家的覆盖范围影响，应谨慎使用。

Figure 4.2.16 shows Al talent concentration in various countries. In 2023, the countries with the highest concentrations of

talent included Israel (1.1%), Singapore (0.9%), and South Korea (0.8%).
图 4.2.16 显示了各个国家的人工智能人才集中度。2023 年，拥有最高人工智能人才集中度的国家包括以色列（1.1%）、新加坡（0.9%）和韩国（0.8%）。

Figure 4.2.17 looks at the percent change in

talent concentration for a selection of countries since 2016. During that time period, several major economies registered substantial increases in their

talent pools. The countries showing the greatest increases are India (263%), Cyprus (229%), and Denmark (213%).
图 4.2.17 研究了自 2016 年以来一些国家人才集中度的百分比变化。在那段时间内，几个主要经济体的人才库都有大幅增加。显示增幅最大的国家是印度（263%）、塞浦路斯（229%）和丹麦（213%）。
Al talent concentration by geographic area, 2023
2023 年各地区的人才集中度

Source: Linkedln, 2023 | Chart: 2024 Al Index report
来源：领英，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.2.16 图 4.2.16

Percentage change in

talent concentration by geographic area, 2016 vs. 2023
2016 年与 2023 年按地理区域划分的

人才集中度百分比变化

Source: Linkedln, 2023 | Chart: 2024 Al Index report
来源：Linkedln，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.17 图 4.2.17

Al talent concentration by gender, 2016-23
2016-23 年按性别的人才集中度

Source: Linkedln, 2023 | Chart: 2024 Al Index report
来源：Linkedln，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.18 图 4.2.18

Linkedln data provides insights on the Al talent gained or lost due to migration trends.

Net flows are defined as total arrivals minus departures within the given time period. Figure 4.2.19 examines net Al talent migration per 10,000 Linkedln members by geographic area. The countries that report the greatest incoming migration of Al talent are Luxembourg, Switzerland, and the United Arab Emirates.
Linkedln 数据提供了关于由于迁移趋势而获得或失去的人工智能人才的见解。

净流量被定义为在给定时间段内的总到达量减去离开量。图 4.2.19 研究了每 10,000 个 Linkedln 成员的净人工智能人才迁移情况。报告拥有最大人工智能人才流入的国家是卢森堡、瑞士和阿拉伯联合酋长国。

Net Al talent migration per 10,000 Linkedln members by geographic area, 2023
2023 年按地理区域每 10,000 个 Linkedln 成员的净人工智能人才迁移

Source: Linkedln, 2023; World Bank Group, 2023 | Chart: 2024 Al Index report
来源：领英，2023 年；世界银行集团，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.2.19 图 4.2.19

Figure 4.2.20 documents Al talent migration data over time. In the last few years, Israel, India, and South Korea have seen declining net

talent migration figures, suggesting that

talent has been increasingly flowing out of these countries.
图 4.2.20 文档了随时间变化的人工智能人才迁移数据。在过去几年中，以色列、印度和韩国的净人工智能人才迁移数字呈下降趋势，表明人才已经越来越多地流出这些国家。

Net Al talent migration per 10,000 Linkedln members by geographic area, 2019-23
2019-23 年按地理区域每 10,000 个 Linkedln 会员的净人工智能人才迁移

Source: LinkedIn, 2023; World Bank Group, 2023 | Chart: 2024 AI Index report
源：LinkedIn，2023 年；世界银行集团，2023 年 | 图表：2024 年 AI 指数报告

6 Asterisks indicate that a country's

-axis label is scaled differently than the

-axis label for the other countries.
六个星号表示一个国家的

-轴标签与其他国家的

-轴标签的比例不同。

Highlight: 突出显示:

How Much Do Computer Scientists Earn?
计算机科学家赚多少钱？

Every year, Stack Overflow conducts a survey of its community of professional developers who use their tools. The latest iteration of the survey profiled over 90,000 developers.
每年，Stack Overflow 对其专业开发人员社区进行一项调查，调查他们使用的工具。最新一期的调查对超过 90,000 名开发人员进行了分析。

Through this survey, respondents were asked about their income. It is important to note that these respondents do not work exclusively with Al. However, examining developer salaries can serve as a means to approximate the compensation of talent in Al-adjacent industries. Figure 4.2.21 examines the salaries of professional developers disaggregated by position.
通过这项调查，受访者被询问了关于他们的收入。需要注意的是，这些受访者并不专门从事人工智能领域的工作。然而，检查开发人员的薪资可以作为一种近似衡量人工智能相关行业人才薪酬的手段。图 4.2.21 分析了按职位细分的专业开发人员的薪资。
Salaries vary by position and geography. For instance, the average global salary for a cloud infrastructure engineer is

. In the United States, the average salary for such a position is

. Both globally and in the United States, the highest compensated roles are senior executives, followed by engineering managers. For all surveyed positions, salaries are significantly higher in the United States than in other countries.
薪资因职位和地理位置而异。例如，云基础架构工程师的全球平均薪资为

。在美国，这一职位的平均薪资为

。无论是全球范围还是在美国，薪资最高的角色是高级执行人员，其次是工程经理。对于所有调查的职位来说，在美国的薪资明显高于其他国家。

Highlight: 突出显示:

How Much Do Computer Scientists Earn? (cont’d)
计算机科学家赚多少钱？（续）

Median yearly salary by professional developer type, 2023
2023 年各种专业开发者的年均工资中位数

Figure 4.2.21 图 4.2.21

This section monitors Al investment trends, leveraging data from Quid, which analyzes investment data from more than 8 million companies worldwide, both public and private. Employing natural language processing, Quid sifts through vast unstructured datasets-including news aggregations, blogs, company records, and patent databases-to detect patterns and insights. Additionally, Quid is constantly expanding its database to include more companies, sometimes resulting in higher reported investment volumes for specific years. For the first time, this year's investment section in the Al Index includes data on generative Al investments.
本节监测人工智能投资趋势，利用来自 Quid 的数据，该数据分析来自全球超过 800 万家公司（包括上市公司和私人公司）的投资数据。Quid 利用自然语言处理技术，筛选海量的非结构化数据集，包括新闻聚合、博客、公司记录和专利数据库，以便发现模式和见解。此外，Quid 不断扩大其数据库，以包括更多公司，有时会导致特定年份报告的投资额增加。今年的人工智能指数投资部分首次包含生成式人工智能投资数据。

4.3 Investment 4.3 投资

Corporate Investment 企业投资

Figure 4.3.1 illustrates the trend in global corporate

investment from 2013 to 2023, including mergers and acquisitions, minority stakes, private investments, and public offerings. For the second consecutive year, global corporate investment in

has seen a decline.
2013 年至 2023 年，包括并购、少数股权、私人投资和公开发行在内的全球企业

投资的趋势如图 4.3.1 所示。全球企业对

的投资已连续第二年下降。
In 2023, the total investment dropped to

billion, a decrease of approximately 20% from 2022. Despite a slight reduction in private investment, the most significant downturn occurred in mergers and acquisitions, which fell by

from the previous year. However, over the past decade, Al-related investments have increased thirteenfold.
2023 年，总投资额下降到

亿美元，较 2022 年减少约 20%。尽管私人投资略有减少，但最重大的下滑发生在并购活动上，同比前一年下降了

。然而，在过去十年中，人工智能相关的投资增长了十三倍。

Global corporate investment in AI by investment activity, 2013-23
2013-2023 年间全球企业在人工智能方面的投资活动

Source: Quid, 2023 | Chart: 2024 Al Index report
来源: Quid, 2023 | 图表: 2024 Al 指数报告

Figure 4.3.1 图 4.3.1

Startup Activity 创业活动

This section analyzes private investment trends in artificial intelligence startups that have received over $1.5 million in investment since 2013.
自 2013 年以来，已经获得 150 万美元以上投资的人工智能初创公司的私人投资趋势进行了分析。

Global Trends 全球趋势

Global private

investment has declined for the second consecutive year (Figure 4.3.2). However, the decrease from 2022 was small (-7.2%) and smaller than the drop observed from 2021 to 2022. Despite recent declines, private

investment globally has grown substantially in the last decade.
全球私人投资连续第二年下降（图 4.3.2）。然而，从 2022 年开始的下降幅度较小（-7.2%），且低于 2021 年至 2022 年的下降幅度。尽管最近有所下降，全球私人投资在过去十年间增长了大幅。

Private investment in Al, 2013-23
2013-23 年间的人工智能私人投资

Figure 4.3.2 图 4.3.2

While overall Al private investment decreased last year, funding for generative Al sharply increased (Figure 4.3.3). In 2023, the sector attracted

billion, nearly nine times the investment of 2022 and about 30 times the amount from 2019. Furthermore, generative Al accounted for over a quarter of all Al-related private investment in 2023.
尽管去年整体人工智能私人投资减少，但生成式人工智能的资金大幅增加（图 4.3.3）。2023 年，该领域吸引了

亿美元的投资，几乎是 2022 年的九倍，是 2019 年的 30 倍左右。此外，生成式人工智能在 2023 年占据了所有人工智能相关私人投资的四分之一以上。

Private investment in generative AI, 2019-23
2019-23 年生成 AI 的私人投资

Figure 4.3.3 图 4.3.3

Interestingly, the number of newly funded Al companies jumped to 1,812 , a

increase over the previous year (Figure 4.3.4). Figure 4.3.5 visualizes the average size of Al private investment events, calculated by dividing the total yearly Al private investment by the total number of Al private investment events. From 2022 to 2023, the average increased marginally, growing from

million to

million.
有趣的是，新获资助的 AI 公司数量跃升至 1,812 家，比上一年增加了

（图 4.3.4）。图 4.3.5 展示了 AI 私人投资事件的平均规模，通过将每年的总 AI 私人投资金额除以 AI 私人投资事件的总数来计算。从 2022 年到 2023 年，平均规模略有增加，从

百万增长到

百万。

Number of newly funded Al companies in the world, 2013-23
2013-23 年全球新融资的人工智能公司数量

Average size of Al private investment events, 2013-23
2013-23 年人工智能私人投资事件的平均规模

Figure 4.3.4 图 4.3.4

Source: Quid, 2023 | Chart: 2024 Al Index report
源：Quid，2023 | 图表：2024 Al 指数报告

2023 marked a significant increase in the number of newly funded generative Al companies, with 99 new startups receiving funding, compared to 56 in 2022, and 31 in 2019 (Figure 4.3.6).
2023 年新获资助的生成式人工智能公司数量显著增加，有 99 家新创公司获得资金，而 2022 年为 56 家，2019 年为 31 家（图 4.3.6）。

Number of newly funded generative Al companies in the world, 2019-23
2019 年至 2023 年全球新获资助的生成式人工智能公司数量

Figure 4.3.7 reports Al funding events disaggregated by size. In 2023, Al private investment events decreased across nearly all funding size categories, except for those exceeding

million.
图 4.3.7 报告了按规模细分的 Al 融资事件。2023 年，几乎所有融资规模类别中的 Al 私人投资事件都减少了，除了那些超过

百万的事件。
Al private investment events by funding size, 2022 vs. 2023
2022 年与 2023 年按融资规模划分的 Al 私人投资事件

Source: Quid, 2023 | Table: 2024 Al Index report
来源：Quid，2023 | 表格：2024 年 Al 指数报告

Funding Size	2022
Over billion 超过十亿	7	9
million - $1 billion 百万 - 10 亿美元	6	7
million - $500 million 百万 - 5 亿美元	187	120
$50 million - $100 million 5000 万美元 - 1 亿美元	260	182
Under $50 million 少于 5000 万美元	2,840	2,641
Undisclosed	694	680
Total	3,994	3,639

Regional Comparison by Funding Amount
根据资金金额进行区域比较

The United States once again led the world in terms of total Al private investment. In 2023, the

美国再次在总体 Al 私人投资方面领先世界。2023 年，美国投资的

亿美元大约是下一个最高国家中国投资额的 8.7 倍（

亿美元），是英国投资额的 17.8 倍（{{1}}亿美元）（图 4.3.8）。

billion invested in the United States was roughly 8.7 times greater than the amount invested in the next highest country, China (

billion), and 17.8 times the amount invested in the United Kingdom (

billion) (Figure 4.3.8).
2023 年各地区 Al 的私人投资

Private investment in Al by geographic area, 2023

Source: Quid, 2023 | Chart: 2024 Al Index report
源头: Quid, 2023 | 图表: 2024 Al 指数报告

When aggregating private Al investments since 2013, the country rankings remain the same: The United States leads with

billion invested, followed by China with

billion, and the United Kingdom at

billion (Figure 4.3.9).
自 2013 年以来，聚合私人 Al 投资，各国排名保持不变：美国以

亿美元的投资领先，中国以

亿美元紧随其后，英国以

亿美元位列第三（图 4.3.9）。

Private investment in Al by geographic area, 2013-23 (sum)
2013-23 年各地区 Al 的私人投资总额

Figure 4.3.9 图 4.3.9

Figure 4.3.10, which looks at Al private investment over time by geographic area, suggests that the gap in private investments between the United States and other regions is widening over time. While Al private investments have decreased in China (-44.2%) and the European Union plus the United Kingdom (-14.1%) since 2022, the United States has seen a significant increase (22.1%) during the same period.
图 4.3.10，该图观察了按地理区域随时间变化的 Al 私人投资，表明美国与其他地区之间的私人投资差距随时间扩大。自 2022 年以来，中国的 Al 私人投资减少了 (-44.2%)，欧盟加英国地区减少了 (-14.1%)，而美国在同一时期内显著增加了 (22.1%)。

Private investment in Al by geographic area, 2013-23
2013-23 年按地理区域的 Al 私人投资

The disparity in regional Al private investment becomes particularly pronounced when examining generative Alrelated investments. For instance, in 2022, the United States outpaced the combined investments of the European Union plus United Kingdom in generative Al by approximately

billion (Figure 4.3.11). By 2023, this gap widened to

billion.
在考察生成 AI 相关投资时，地区 AI 私人投资的不平衡尤为显著。例如，在 2022 年，美国的生成 AI 投资超过了欧盟加英国联合的投资约

亿美元（图 4.3.11）。到 2023 年，这一差距扩大到了

亿美元。

Private investment in generative Al by geographic area, 2019-23
2019 年至 2023 年各地区生成 AI 的私人投资

Source: Quid, 2023 I Chart: 2024 Al Index report
来源：Quid，2023 | 图表：2024 年 AI 指数报告

Regional Comparison by Newly Funded
新资助的人工智能公司的区域比较

Al Companies 公司

This section examines the number of newly funded Al companies across different geographic regions.
本节将研究不同地理区域新资助的人工智能公司数量。
Consistent with trends in private investment, the United States leads all regions with 897 new AI companies, followed by China with 122, and the United Kingdom with 104 (Figure 4.3.12).
与私人投资趋势一致，美国以 897 家新的人工智能公司领先于所有地区，其次是中国的 122 家，英国的 104 家（图 4.3.12）。

Number of newly funded Al companies by geographic area, 2023
2023 年各地区新获资助的人工智能公司数量

Figure 4.3.12 图 4.3.12

A similar trend is evident in the aggregate data since 2013. In the last decade, the number of newly funded Al companies in the United States is around 3.8 times the amount in China, and 7.6 times the amount in the United Kingdom (Figure 4.3.13).
自 2013 年以来，聚合数据中显示出一种类似的趋势。在过去的十年中，美国新获资助的人工智能公司数量约为中国的 3.8 倍，是英国的 7.6 倍（图 4.3.13）。

Figure 4.3.14 presents data on newly funded

companies in specific geographic regions, highlighting a decade-long trend where the United States consistently surpasses both the European Union and the United Kingdom, as well as China. Since 2022, the
图 4.3.14 展示了特定地理区域新获资助的

公司数据，突出了一个持续十年的趋势，即美国始终超过欧盟、英国以及中国。自 2022 年以来，
United States, along with the European Union and the United Kingdom, have seen significant increases in the number of new Al companies, in contrast to China, which experienced a slight year-over-year decrease.
美国、欧盟和英国的新人工智能公司数量显著增加，与之相反，中国经历了略微的年度减少。

Number of newly funded Al companies by geographic area, 2013-23
2013-23 年各地区新获资助的人工智能公司数量

Source: Quid, 2023 | Chart: 2024 Al Index report
数据来源：Quid，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.3.14 图表 4.3.14

Focus Area Analysis 焦点领域分析

Quid also disaggregates private

investment by focus area. Figure 4.3.15 compares global private

investment by focus area in 2023 versus 2022 . The focus areas that attracted the most investment in 2023 were Al infrastructure/research/governance (

billion); NLP and customer support (

billion); and data management and processing (

billion). The prominence of

infrastructure, research, and governance reflects large investments in companies specifically building

applications, such as OpenAl, Anthropic, and Inflection AI.
Quid 还按焦点领域对私人

投资进行了细分。图 4.3.15 比较了 2023 年和 2022 年全球私人

投资在焦点领域的情况。2023 年吸引最多投资的焦点领域是 Al 基础设施/研究/治理（

亿美元）；NLP 和客户支持（

亿美元）；以及数据管理和处理（

亿美元）。

基础设施、研究和治理的突出地位反映了专门构建

应用程序的公司的大量投资，例如 OpenAI、Anthropic 和 Inflection AI。
Figure 4.3.16 presents trends over time in

focus area investments. As noted earlier, most focus areas saw declining investments in the last year. Conversely, some of the areas that saw growth since 2022 include Al infrastructure/research/governance and data management, processing. Although now still substantial, investments in medical and healthcare as well as NLP, customer support peaked in 2021 and have since then declined.
图 4.3.16 展示了

焦点领域投资随时间的趋势。正如前面所述，大多数焦点领域在过去一年中看到了投资下降。相反，自 2022 年以来看到增长的一些领域包括 Al 基础设施/研究/治理和数据管理、处理。尽管目前仍然很可观，但医疗保健以及 NLP、客户支持的投资在 2021 年达到峰值后已经下降。

Private investment in Al by focus area, 2022 vs. 2023
2022 年与 2023 年按重点领域的人工智能私人投资

Source: Quid, 2023 | Chart: 2024 Al Index report
来源：Quid，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.3.15 图 4.3.15

Private investment in Al by focus area, 2017-23
2017-23 年按重点领域的 Al 私人投资

Source: Quid, 2023 | Chart: 2024 Al Index report
来源：Quid，2023 年 | 图表：2024 年 Al 指数报告

Finally, 4.3.17 shows private investment in Al by focus area over time within select geographic regions, highlighting how private investment priorities in

differ across geographies. The significant increases observed in

infrastructure/research/governance were mostly driven by investment in the United States. The United States significantly outpaces China and the European Union and United Kingdom in investment in almost all focus area categories. A notable exception is facial recognition, where 2023 investment totals were

million in the United States and

million in China. Likewise, in semiconductor investments, China (

million) is not far behind the United States (

million).
最后，4.3.17 显示了在选择的地理区域内，随着时间推移，私人投资在 Al 的重点领域的变化，突出显示了私人投资在

方面在地理上的差异。在基础设施/研究/治理方面观察到的显著增长主要是由对美国的投资推动的。美国在几乎所有重点领域类别的投资中明显领先于中国、欧盟和英国。一个值得注意的例外是面部识别，在这方面，2023 年美国的投资总额为

万美元，中国为

万美元。同样，在半导体投资方面，中国（

万美元）距离美国（

万美元）并不遥远。

Private investment in

by focus area and geographic area, 2017-23
2017-23 年按重点领域和地理区域的

私人投资

Source: Quid, 2023 | Chart: 2024 Al Index report
来源：Quid，2023 年 | 图表：2024 年 Al 指数报告

Cybersecurity, data protection
网络安全，数据保护

Drones 无人机

Insurtech 保险科技

NLP, customer support 自然语言处理，客户支持

This section examines the practical application of Al by corporations, highlighting industry adoption trends, how businesses are integrating

, the specific

technologies deemed most beneficial, and the impact of

adoption on financial performance.
本节探讨了企业对人工智能的实际应用，突出了行业采用趋势，企业如何整合

，被认为最有益的具体

技术，以及

采用对财务表现的影响。

4.4 Corporate Activity 4.4 企业活动

Industry Adoption 行业采用

This section incorporates insights from McKinsey's "The State of Al in 2023: Generative Al's Breakout Year," alongside data from prior editions. The 2023 McKinsey analysis is based on a survey of 1,684 respondents across various regions, industries, company sizes, functional areas, and tenures. For the first time, this year's version of the McKinsey survey included detailed questions about generative

adoption and hiring trends for

-related positions.
本节内容整合了麦肯锡公司的《2023 年 AI 现状：生成 AI 的突破之年》的见解，以及之前版本的数据。2023 年麦肯锡的分析基于对来自不同地区、行业、公司规模、职能领域和任职时间的 1,684 名受访者的调查。今年麦肯锡调查的版本首次包含了关于生成 AI 采用和招聘趋势的详细问题。

Adoption of AI Capabilities
AI 能力的采用

The latest McKinsey report reveals that in 2023, 55% of organizations surveyed have implemented

in at least one business unit or function, marking a slight increase from

in 2022 and a significant jump from 20% in 2017 (Figure 4.4.1). Al adoption has spiked over the past five years, and in the future, McKinsey expects to see even greater changes happening at higher frequencies, given the rate of both

technical advancement and adoption.
最新的麦肯锡报告显示，在 2023 年，55%的受访组织至少在一个业务部门或功能中实施了 AI，这标志着与 2022 年相比略有增加，远远高于 2017 年的 20%（图 4.4.1）。AI 的采用在过去五年中急剧增加，未来，麦肯锡预计随着技术进步和采用率的提高，将会看到更大的变化频繁发生。

Share of respondents who say their organizations have adopted Al in at least one function, 2017-23
在至少一个功能中采用人工智能的受访者所占比例，2017-23

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.4.1 图 4.4.1

Figure 4.4.2 shows the proportion of surveyed companies that use

for specific functions.
图 4.4.2 显示了使用

进行特定功能的受访公司比例。

Companies may report employing

in multiple capacities. The most commonly adopted Al use case by function among surveyed businesses in 2023 was contact-center automation (26%), followed by personalization (23%), customer acquisition (22%), and Al-based enhancements of products

公司可能报告在多个方面使用

。2023 年受访企业中最常采用的 AI 使用案例是联系中心自动化（26%），其次是个性化（23%），客户获取（22%），以及产品的 AI 增强

。

Most commonly adopted Al use cases by function, 2023
2023 年按功能最常采用的 AI 使用案例

Figure 4.4.2 图 4.4.2

7 Personalization is the practice of tailoring products, services, content, recommendations, and marketing to the individual preferences of customers or users. For example, personalization can include sending tailored email messages to clients or customers to improve engagement.
7 个性化是根据客户或用户的个人偏好定制产品、服务、内容、推荐和营销的实践。例如，个性化可以包括发送定制的电子邮件消息给客户或顾客以提高参与度。

With respect to the type of Al capabilities embedded in at least one function or business unit, as indicated by Figure 4.4.3, robotic process automation had the highest rate of embedding within the financial services industry (46%). The next highest rate of embedding was for virtual agents, also in the financial services industry. Across all industries, the most embedded Al technologies were NL text understanding (

), robotic process automation (30%), and virtual agents (30%).
就至少一个功能或业务部门中嵌入的 AI 能力类型而言，如图 4.4.3 所示，机器人流程自动化在金融服务行业中的嵌入率最高（46%）。其次是虚拟代理，也在金融服务行业中。在所有行业中，最常嵌入的 AI 技术是自然语言文本理解（

）、机器人流程自动化（30%）和虚拟代理（30%）。

Al capabilities embedded in at least one function or business unit, 2023
至少一个功能或业务单元中嵌入的所有能力，2023

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 | 图表：2024 年人工智能指数报告

of respondents (Al capability)

的受访者（人工智能能力）

Figure 4.4.4 shows Al adoption by industry and Al function in 2023. The greatest adoption was in product and/ or service development for tech, media, and telecom (44%); followed by service operations for tech, media, and telecom (36%) and marketing and sales for tech, media, and telecom (36%).
2023 年，图 4.4.4 展示了不同行业和 Al 功能的 Al 采用情况。对于科技、媒体和电信行业来说，最大的采用情况是在产品和/或服务开发方面（44%）；其次是科技、媒体和电信行业的服务运营（36%）以及科技、媒体和电信行业的市场营销和销售（36%）。

Al adoption by industry and function, 2023
2023 年各行业和功能的 Al 采用情况

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
数据来源：麦肯锡公司 2023 年调查 | 图表来源：2024 年 Al 指数报告

Figure 4.4.4 图 4.4.4

Figure 4.4.5 illustrates the changes in

adoption rates by industry and function from 2022 to 2023. The areas with the largest annual gains across all industries include marketing and sales (18 percentage points), product/service development (14), and service operations (4). Conversely, across all industries, the functions experiencing the most significant declines in adoption include strategy and corporate finance (-12 percentage points), risk (-9), and human resources (-2).
图 4.4.5 展示了从 2022 年到 2023 年各行业和职能中

采用率的变化。在所有行业中，年度增长最大的领域包括营销和销售（18 个百分点）、产品/服务开发（14）和服务运营（4）。相反，在所有行业中，采用率下降最显著的职能包括战略和企业金融（-12 个百分点）、风险（-9）和人力资源（-2）。

Percentage point change in responses of Al adoption by industry and function, 2022 vs. 2023
2022 年与 2023 年各行业和职能中 Al 采用率的响应百分点变化

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年 Al 指数报告

Percentage point change in responses (function)
响应变化的百分点（功能）

Figure 4.4.6 shows the percentage of surveyed respondents across industries who reported hiring for various

positions. Across all industries, respondents reported hiring data engineers (36%), Al data scientists (31%), and machine-learning engineers (31%) to the greatest degree. Notably, a significant portion of respondents within the financial services

and the tech, media, and telecom sectors (44%) reported a high rate of hiring machine-learning engineers.
图 4.4.6 显示了各行业受访者报告招聘各种

职位的百分比。在所有行业中，受访者报告最多招聘数据工程师（36%）、Al 数据科学家（31%）和机器学习工程师（31%）。值得注意的是，在金融服务

和科技、媒体和电信行业中，有相当一部分受访者（44%）报告高比例招聘机器学习工程师。

Al-related roles that organizations hired in the last year by industry, 2023
2023 年各行业组织在过去一年中招聘的与人工智能相关的职位

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查报告, 2023 年 | 图表：2024 年人工智能指数报告

Figure 4.4.6 图 4.4.6

Organizations have experienced both cost reductions and revenue increases due to

adoption (Figure 4.4.7). The areas where respondents most frequently reported cost savings were manufacturing (55%), service operations (54%), and risk (44%). For revenue gains, the functions benefiting the most from

included manufacturing (66%), marketing and sales (65%), and risk (64%). Figure 4.4 .7 shows a substantial number of respondents reporting cost decreases

and revenue gains (59%) as a result of using

, suggesting that

tangibly helps businesses improve their bottom line. Comparing this and last year's averages reveals a 10 percentage point increase for cost decreases and a four percentage point decrease for revenue increases across all activities.
组织由于

的采用而经历了成本降低和收入增加（图 4.4.7）。受访者最经常报告成本节约的领域是制造业（55%）、服务运营（54%）和风险（44%）。对于收入增长，受益最多的功能包括制造业（66%）、营销和销售（65%）以及风险（64%）。图 4.4.7 显示，大量受访者报告由于使用

而导致成本降低

和收入增加（59%），这表明

实质性地帮助企业改善其底线。比较今年和去年的平均值发现，所有活动中成本降低的百分点增加了 10 个百分点，收入增加的百分点减少了四个百分点。

Cost decrease and revenue increase from Al adoption by function, 2022
2022 年各功能从 Al 采用中的成本降低和收入增加

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4 .8 presents global

adoption by organizations, segmented by world regions. In 2023, every surveyed region reported higher

adoption rates than in 2022. The most significant year-overyear growth was seen in Europe, where organization adoption grew by 9 percentage points. North America remains the leader in

adoption. Greater China also experienced a significant increase in

adoption rates, growing by 7 percentage points over the previous year.
图 4.4 .8 展示了全球各组织对

的采用情况，按世界地区划分。2023 年，每个调查地区报告的

采用率均高于 2022 年。欧洲的年度增长最为显著，组织采用率增长了 9 个百分点。北美仍然是

采用的领导者。大中华地区的

采用率也出现了显著增长，比上一年增长了 7 个百分点。

Al adoption by organizations in the world, 2022 vs. 2023
2022 年与 2023 年世界各组织对人工智能的采用情况

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.4.8 图 4.4.8

Adoption of Generative AI Capabilities
采用生成式人工智能能力

How are organizations deploying generative Al??
组织如何部署生成式人工智能？

Figure 4.4.9 highlights the proportion of total surveyed respondents that report using generative

for a particular function. It is possible for respondents to indicate that they deploy

for multiple purposes.
图 4.4.9 突出显示了报告使用生成

进行特定功能的受访者的比例。受访者可以指出他们将

用于多种目的。
The most frequent application is generating initial drafts of text documents (9%), followed closely by personalized marketing (

), summarizing text documents (8%), and creating images and/or videos (8%). Most of the reported leading use cases are within the marketing and sales function.
最常见的应用是生成文档的初始草稿（9%），紧随其后的是个性化营销（

），文档摘要（8%）以及创建图像和/或视频（8%）。大多数报告的主要用例都在营销和销售功能中。

Most commonly adopted generative Al use cases by function, 2023
2023 年按功能采用的最常见的生成 Al 用例

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.9 图 4.4.9

8 The adoption of generative Al capabilities is presented separately from the charts on the adoption of general Al capabilities earlier in the chapter, as it was a separate question in the survey.
8 生成式人工智能能力的采用与本章前面关于通用人工智能能力采用的图表分开展示，因为在调查中这是一个单独的问题。

Figure 4.4.10 compares the proportion of respondents who report using

versus specifically generative Al for a given function.

Figure 4.4.10 illustrates the degree to which generative

has permeated general Al usage patterns among businesses. When analyzed at the functional level, the use of

and generative
图 4.4.10 比较了报告使用

与专门生成 Al 进行特定功能的受访者比例。

图 4.4.10 说明了生成

已经渗透到企业通用 Al 使用模式中的程度。在功能级别进行分析时，

和生成的使用
Al within organizations shows similar patterns of distribution. Overall, general Al still dominates. The most common functional applications of generative

are in marketing and sales (14%), product and/or service development (13%), and service operations (10%).
组织内的 Al 显示出类似的分布模式。总体而言，通用 Al 仍然占主导地位。生成

的最常见功能应用是在营销和销售（14%）、产品和/或服务开发（13%）以及服务运营（10%）。

Al vs. generative

adoption by function, 2023
2023 年按功能划分的 Al 与生成

采用情况

Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report
源: 麦肯锡公司调查，2023 年 | 图表: 2024 年 AI 指数报告

Figure 4.4.10 图 4.4.10

9 While all generative Al use cases are considered general Al use cases, not all general Al use cases qualify as generative Al use cases.
9 虽然所有生成式 AI 用例都被视为通用 AI 用例，但并非所有通用 AI 用例都符合生成式 AI 用例的资格。

Figure 4.4.11 depicts the variation in generative

usage among businesses across different regions of the world. Across all regions, the adoption rate of generative

by organizations stands at

. This amount is meaningfully lower than the percentage of businesses across all geographies (55%) that reported using

, which was documented earlier in Figure 4.4.8. North America leads in adoption at 40%, followed closely by developing markets (including India, Latin America, and the MENA region).
图 4.4.11 显示了世界各地企业在生成

使用方面的变化。在所有地区，组织对生成

的采纳率为

。这一数字明显低于所有地理区域（55%）报告使用

的企业比例，这一比例在图 4.4.8 中已有记录。北美在采纳率方面处于领先地位，达到 40%，紧随其后的是发展中市场（包括印度、拉丁美洲和中东北非地区）。

Generative Al adoption by organizations in the world, 2023
2023 年世界各组织的生成 AI 采用情况

Use of Al by Developers
开发者使用 AI

Computer developers are among the most likely individuals to use

in professional settings. As

becomes more integrated into the economy, tracking how developers utilize and perceive

is becoming increasingly important.
计算机开发人员是最有可能在专业环境中使用

的个人之一。随着

在经济中的更加融合，跟踪开发人员如何利用和看待

变得越来越重要。

Stack Overflow, a question-and-answer website for computer programmers, conducts an annual survey of computer developers. The 2023 survey, with responses from over 90,000 developers, included, for the first time, questions on Al tool usage-detailing how developers use these tools, which tools are favored, and their perceptions of the tools used.

Stack Overflow，一个面向计算机程序员的问答网站，每年都会对计算机开发人员进行一次调查。2023 年的调查中，超过 90,000 名开发人员参与回答，首次包括了关于 Al 工具使用的问题-详细说明开发人员如何使用这些工具，哪些工具受到青睐，以及他们对所使用工具的看法。

Preference 偏好

Figure 4.4.12 highlights the proportion of surveyed respondents who report using a specific Al developer tool. According to the survey,

of respondents report using GitHub's Copilot, followed by Tabnine (11.7%) and AWS CodeWhisperer (4.9%).
图 4.4.12 强调了受访者中报告使用特定 AI 开发工具的比例。根据调查，

的受访者报告使用 GitHub 的 Copilot，其次是 Tabnine（11.7%）和 AWS CodeWhisperer（4.9%）。

Figure 4.4.13 highlights which Al search tools, software applications that use

to enhance search functionality, are most favored by

developers. The most popular Al search tools according to professional developers were ChatGPT (83.3%), followed by Bing AI (18.8%) and WolframAlpha (11.2%).
图 4.4.13 强调了哪些 AI 搜索工具，即使用

来增强搜索功能的软件应用程序，最受

开发者青睐。根据专业开发者的说法，最受欢迎的 AI 搜索工具是 ChatGPT（83.3%），其次是必应 AI（18.8%）和 WolframAlpha（11.2%）。

Most popular Al developer tools among professional developers, 2023
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report
2023 年专业开发者中最受欢迎的 AI 开发工具，来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 年 AI 指数报告

Figure 4.4.12 图 4.4.12

Most popular Al search tools among professional developers, 2023
2023 年专业开发人员中最受欢迎的 Al 搜索工具

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report
来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.13 图 4.4.13

Cloud platforms are crucial elements of the

ecosystem, providing cloud computing services that allow developers to perform computationally intensive Al work. Figure 4.4.14 reports the proportion of respondents that have reported extensively using a specific cloud platform. According to the Stack Overflow survey, Amazon Web Services (AWS) is the most commonly used cloud platform among professional developers, with 53.1% reporting regular use. Microsoft Azure follows at

, with Google Cloud at

.
云平台是

生态系统的关键组成部分，提供云计算服务，允许开发人员进行计算密集型的人工智能工作。图 4.4.14 报告了报告了广泛使用特定云平台的受访者比例。根据 Stack Overflow 的调查报告，亚马逊云服务（AWS）是专业开发人员中最常用的云平台，有 53.1%的受访者报告有定期使用。其次是微软 Azure，占

，谷歌云占

。

Workflow 工作流程

Figure 4.4.15 explores the current and future integration of

in developers' workflows. A significant majority of respondents,

, regularly use

for code writing, followed by

for debugging and assistance, and

for documentation. While only

currently use

for code testing,

express interest in adopting

for this purpose.
图 4.4.15 探讨了

在开发人员工作流中的当前和未来整合。绝大多数受访者定期使用

进行代码编写，其次是

进行调试和辅助，以及使用

进行文档编写。目前只有

使用

进行代码测试，但

对采用

来实现此目的表示兴趣。
Top 10 most popular cloud platforms among professional developers, 2023
专业开发人员首选的前十大云平台，2023 年

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 AI Index report
资料来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.4.14 图 4.4.14

Adoption of Al tools in development tasks, 2023
2023 年开发任务中 Al 工具的采用

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 AI Index report
来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 AI 指数报告

When asked about the primary advantages of

tools in professional development, developers responded with increased productivity (32.8%), accelerated learning (25.2%), and enhanced efficiency (25.0%) (Figure 4.4.16).
当被问及职业发展中使用

工具的主要优势时，开发人员的回答是增加生产力（32.8%），加速学习（25.2%）和提高效率（25.0%）（图 4.4.16）。

Figure 4.4.17 displays the sentiments professional developers have toward

tools. A significant majority of developers hold a positive view of

tools, with

feeling very favorably and

favorably inclined toward them. Only

express unfavorable opinions about Al development tools.
图 4.4.17 显示了职业开发人员对

工具的情感态度。绝大多数开发人员对

工具持积极态度，

非常赞同，

赞同它们。只有

对 AI 开发工具持有不利的观点。

Figure 4.4.18 highlights the reported level of trust developers have in

tools. More developers trust

tools than distrust them, with

reporting high or moderate trust in these technologies. In contrast, a smaller proportion,

, express some level of distrust or high distrust in

tools.
图 4.4.18 凸显了开发人员对

工具的信任水平。比起不信任，更多的开发人员信任

工具，

报告称在这些技术上存在着高度或中度的信任。相比之下，较小比例的人，

，对

工具表示某种程度的不信任或高度的不信任。

Sentiment toward AI tools in development among professional developers, 2023
对于 2023 年专业开发者对 AI 工具开发的情绪态度

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report
来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 Al 指数报告

Figure 4.4.17 图 4.4.17
Primary benefits of Al tools for professional developers, 2023
专业开发人员的 Al 工具的主要优势，2023

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report
来源：Stack Overflow 开发者调查，2023 | 图表：2024 年 Al 指数报告

Figure 4.4.16 图 4.4.16

Trust level in Al tool output accuracy, 2023
2023 年 Al 工具输出准确性的信任水平

Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report
来源：Stack Overflow 开发者调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.18 图 4.4.18

Al's Labor Impact 阿尔的劳动影响

Over the last five years, the growing integration of

into the economy has sparked hopes of boosted productivity. However, finding reliable data confirming Al's impact on productivity has been difficult because

integration has historically been low. In 2023, numerous studies rigorously examined Al's productivity impacts, offering more conclusive evidence on the topic
在过去五年里，

与经济的日益融合引发了提高生产力的希望。然而，确认阿尔对生产力的影响的可靠数据一直很难获得，因为

的整合历史上一直很低。2023 年，许多研究对阿尔的生产力影响进行了严格审查，为这一主题提供了更具有决定性的证据。
First, Al has been shown to enable workers to complete tasks more quickly and produce higher quality work. A meta-review by Microsoft, which aggregated studies comparing the performance of workers using Microsoft Copilot or GitHub's CopilotLLM-based productivity-enhancing tools-with those who did not, found that Copilot users completed tasks in

less time than their counterparts without

access (Figure 4.4.19).11
首先，研究表明阿尔能够使工人更快地完成任务并产出更高质量的工作。微软进行的一项元回顾研究汇总了比较使用 Microsoft Copilot 或 GitHub 的 CopilotLLM 基于提高生产力工具的工人与未使用这些工具的工人表现的研究，发现 Copilot 用户完成任务所需时间比没有

访问权限的同行少

到

（图 4.4.19）。11

Cross-study comparison of task completion speed of Copilot users
Copilot 用户任务完成速度的跨研究比较

Figure 4.4.19 图 4.4.19

11 This meta-review analyzed separate surveys of workers using Microsoft's Copilot and GitHub's Copilot tools. These are separate tools. Microsoft Copilot is a broader LLM-based productivity improvement tool, while GitHub's Copilot is a code-writing assistant.
11 这个元回顾分析了使用微软 Copilot 和 GitHub Copilot 工具的工作者的独立调查。这些是独立的工具。微软 Copilot 是一个更广泛的基于LLM的生产力改进工具，而 GitHub Copilot 是一个代码编写助手。

Similarly, a Harvard Business School study revealed that consultants with access to GPT-4 increased their productivity on a selection of consulting tasks by

, speed by

, and quality by

, compared to a control group without

access (Figure 4.4.20). Likewise, National Bureau of Economic Research research reported that callcenter agents using

handled

more calls per hour than those not using AI (Figure 4.4.21).
同样，哈佛商学院的一项研究表明，与没有

访问权限的对照组相比，具有 GPT-4 访问权限的顾问在一系列咨询任务中提高了

的生产力，

的速度和

的质量（图 4.4.20）。同样，美国国家经济研究局的研究报告称，使用

的呼叫中心代理每小时处理

通电话，比不使用人工智能的代理多处理了更多电话（图 4.4.21）。
Effect of GPT-4 use on a group of consultants
顾问团队使用 GPT-4 的效果

Source: Dell'Acqua et al., 2023 | Chart: 2024 Al Index report
来源：Dell'Acqua 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.20 图 4.4.20

Impact of Al on customer support agents
Al 对客服代理的影响

Source: Brynjolfsson et al., 2023 | Chart: 2024 AI Index report
来源：Brynjolfsson 等人，2023 年 | 图表：2024 年 AI 指数报告

Figure 4.4.21 图 4.4.21

A study on the impact of

in legal analysis showed that teams with GPT-4 access significantly improved in efficiency and achieved notable quality improvements in various legal tasks, especially contract drafting.
一项关于在法律分析中使用

影响的研究表明，拥有 GPT-4 访问权限的团队在效率上显著提高，并在各种法律任务中取得了显著的质量改进，特别是合同起草方面。

Figure 4.4.22 illustrates the improvements observed in the group of law students who utilized GPT-4, compared to the control group, in terms of both work quality and time efficiency across a range of tasks. Although Al can assist with legal tasks, there are also widespread reports of LLM hallucinations being especially pervasive in legal tasks.
图 4.4.22 展示了使用 GPT-4 的法学生组与对照组在各种任务的工作质量和时间效率方面的改进。尽管人工智能可以协助法律任务，但也有广泛报道称在法律任务中特别普遍存在 LLM 幻觉。

Effect of GPT-4 use on legal analysis by task
GPT-4 使用对法律分析的影响

Source: Choi et al., 2023 | Chart: 2024 Al Index report
来源：Choi 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.22 图 4.4.22

Second, Al access appears to narrow the performance gap between low- and high-skilled workers. According to the aforementioned Harvard Business School study, both groups of consultants experienced performance boosts after adopting

, with notably larger gains for lower-skilled consultants using

compared to higher-skilled consultants. Figure 4.4.23 highlights the performance improvement across a set of tasks for participants of varying skill levels:
其次，人工智能似乎缩小了低技能和高技能工人之间的绩效差距。根据前述的哈佛商学院研究，采用

后，两组顾问都经历了绩效提升，而低技能顾问使用

的绩效提升明显大于高技能顾问。图 4.4.23 突出了不同技能水平参与者在一组任务中的绩效改善情况：
Lower-skilled (bottom half) participants exhibited a

improvement, while higher-skilled (top half) participants showed a

increase. While higher-skilled workers using Al still performed better than their lower-skilled, Al-using counterparts, the disparity in performance between low- and highskilled workers was markedly lower when Al was utilized compared to when it was not.
低技能（下半部分）参与者表现出

的提高，而高技能（上半部分）参与者显示出

的增加。虽然使用人工智能的高技能工人仍然表现优于低技能工人，但在使用人工智能时，低技能和高技能工人之间的绩效差距明显较小。

Comparison of Al work performance effect by worker skill category
按工人技能类别比较人工智能工作绩效效果

Source: Dell'Acqua et al., 2023 | Chart: 2024 Al Index report
源自：Dell'Acqua 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.23 图 4.4.23

Finally, while Al tends to enhance quality and productivity, overreliance on the technology can impair worker performance. A study focused on professional recruiters reviewing résumés found that receiving any

assistance improved task accuracy by 0.6 points compared to not receiving

assistance. However, recruiters who were provided with "good Al"-believed to be high-performing_actually performed worse than those who received "bad Al," which was capable but known to make errors (Figure 4.4.24). The performance difference between the latter groups was -1.08 points. The study theorizes that recruiters using "good Al" became complacent, overly trusting the Al's results, unlike those using "bad Al," who were more vigilant in scrutinizing Al output.
最后，虽然 Al 倾向于提高质量和生产力，但过度依赖这项技术可能会损害工人的表现。一项针对专业招聘人员审查简历的研究发现，接受任何

帮助的招聘人员相较于没有接受

帮助的招聘人员，任务准确性提高了 0.6 个点。然而，那些提供了“好的 Al”给招聘人员的人-被认为是高绩效的-实际上表现比那些接受“糟糕的 Al”的人更差，后者虽然能力强但会出错（图 4.4.24）。后者之间的表现差异为-1.08 个点。该研究推测，使用“好的 Al”的招聘人员变得自满，过分信任 Al 的结果，而使用“糟糕的 Al”的人更加警惕地审查 Al 的输出。

Effects on job performance of receiving different types of Al advice
接收不同类型的人工智能建议对工作绩效的影响

Figure 4.4.24 图 4.4.24

Earnings Calls 财报电话会议

The following section presents data from Quid, which uses natural language processing tools to analyze trends in corporate earnings calls. Quid analyzed all 2023 earnings calls from Fortune 500 companies, identifying all mentions of "artificial intelligence," "AI," "machine learning," "ML," and "deep learning."
以下部分展示了 Quid 的数据，该数据使用自然语言处理工具分析企业盈利电话会议中的趋势。 Quid 分析了财富 500 强公司的所有 2023 年度电话会议，识别了所有提到“人工智能”、“AI”、“机器学习”、“ML”和“深度学习”的内容。

Aggregate Trends 聚合趋势

The past year has seen a significant rise in the mention of

in Fortune 500 company earnings calls. In 2023, Al was mentioned in 394 earnings calls (nearly

of all Fortune 500 companies), up from 266 mentions in 2022 (Figure 4.4.25). Since 2018, mentions of

in Fortune 500 earnings calls have nearly doubled.
过去一年中，在财富 500 强公司的盈利电话中提到

的次数显著增加。2023 年，AI 在 394 次盈利电话中被提到（几乎占据了所有财富 500 强公司的

），而在 2022 年有 266 次提及（图 4.4.25）。自 2018 年以来，在财富 500 强公司的盈利电话中提到

的次数几乎翻了一番。

Number of Fortune 500 earnings calls mentioning AI, 2018-23
2018-23 年提及 AI 的财富 500 强盈利电话数量

Source: Quid, 2023 | Chart: 2024 Al Index report
来源：Quid，2023 | 图表：2024 年 Al 指数报告

Figure 4.4.25 图 4.4.25

Specific Themes 具体主题

Mentions of Al in Fortune 500 earnings calls were associated with a wide range of themes in 2023. The most frequently cited theme, appearing in

of all earnings calls, was generative

(Figure 4.4.26).
在 2023 年的财富 500 强收益电话中提到 Al 的主题与一系列广泛的主题相关联。在所有收益电话中最常被提及的主题是生成

（图 4.4.26）。
Mentions of generative

grew from

in 2022. The next most mentioned theme was investments in Al, expansion of

capabilities, and

growth initiatives

, followed by company/brand Als (7.6%).
生成

的提及量从 2022 年的

增长。接下来被提及次数最多的主题是对 Al 的投资、扩展

能力以及

成长计划

，其次是公司/品牌 Al（7.6%）。

Themes of Al mentions in Fortune 500 earnings calls, 2018 vs. 2023
财富 500 强收益电话中 Al 提及的主题，2018 年对比 2023 年

Source: Quid, 2023 | Chart: 2024 Al Index report
源头: Quid, 2023 | 图表: 2024 Al 指数报告

Highlight: 突出显示:

Projecting Al's Economic Impact
预测 Al 的经济影响

In 2023, some newly published analyses aimed to project and better understand the future economic impact of AI. A recent McKinsey report examined the degree to which generative

might impact revenues across industries. Figure 4.4.27 features the projected impact range per industry, both as a percentage of total industry revenue and in total dollar amounts. The report projects that the hightech industry could see its revenue increase by

, corresponding to an additional

billion to

billion, as a result of generative

. Banking, pharmaceuticals and medical products, and education are other industries estimated to grow due to the adoption of generative Al.
在 2023 年，一些新发布的分析旨在预测和更好地理解人工智能未来的经济影响。最近的麦肯锡报告研究了生成

可能对各行业收入产生的影响程度。图 4.4.27 显示了每个行业预计的影响范围，既作为总行业收入的百分比，也作为总金额。报告预测，高科技行业的收入可能增加

至

，相当于额外

亿至

亿美元，这是由于生成

的结果。银行、制药和医疗产品以及教育是其他估计会因采用生成 Al 而增长的行业。

Anticipated impact of generative Al on revenue by industry, 2023
2023 年各行业生成式人工智能对收入的预期影响

Source: McKinsey & Company, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司，2023 年 | 图表：2024 年人工智能指数报告

Figure 4.4.27 图 4.4.27

Highlight: 突出显示:

Projecting Al's Economic Impact (cont'd)
阿尔的经济影响预测（续）

The McKinsey survey cited above, the "State of Al in 2023," asked business professionals about their expectations of Al's impact on organizational workforces in the next three years. Although a large proportion (30%) expected little to no change in the number of employees,

felt that staff size would decrease (Figure 4.4.28). Only

felt that generative Al would lead to increases in the number of employees. There were also widespread predictions that

would lead to significant employee reskilling.
上述麦肯锡调查《2023 年的阿尔状况》询问了商业专业人士对未来三年阿尔对组织员工的影响的期望。尽管有相当大比例（30%）预期员工数量几乎不会变化，

认为员工规模会减少（图 4.4.28）。只有

认为生成式阿尔会导致员工数量增加。还有广泛的预测认为

会导致员工重技能培训。

Expectations about the impact of Al on organizations' workforces in the next 3 years, 2023
对未来 3 年（2023 年）阿尔对组织员工的影响的预期

Source: Mckinsey & Company Survey, 2023 | Chart: 2024 Al Index report
来源：麦肯锡公司调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.28 图 4.4.28

Highlight: 突出显示:

Projecting Al's Economic Impact (cont'd)
预测 Al 的经济影响（续）

Perspectives differ on the anticipated effect of generative

on employment per business function.
对于不同业务功能的生成性

对就业的预期影响存在不同看法。

Certain functions, like service operations (54%), supply chain management (45%), and HR (41%), are especially likely, according to respondents, to experience decreasing employment (Figure 4.4.29).
根据受访者的说法，某些功能，如服务运营（54%）、供应链管理（45%）和人力资源（41%），特别可能会出现就业减少（图 4.4.29）。

Anticipated effect of generative Al on number of employees in the next 3 years by business function, 2023 Source: Mckinsey & Company Survey, 2023 | Chart: 2024 Al Index report
预计生成性 Al 对未来 3 年内各业务功能员工人数的影响，2023 年数据来源：麦肯锡公司调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 4.4.29 图 4.4.29

Highlight: 突出显示:

Projecting Al's Economic Impact (cont'd)
阿尔的经济影响预测（续）

Finally, a Goldman Sachs investment report released in 2023 projects that, globally, Al could lead to productivity growth over 10-year periods ranging between

and

(Figure 4.4.30).
最后，高盛投资报告预测，全球范围内，阿尔可能在不同的 10 年期间导致生产率增长，范围在

和

之间（图 4.4.30）。
Although the report projects that many countries will benefit from

-driven productivity growth, certain geographic areas, like Hong Kong, Israel, and Japan, are especially well-positioned.
尽管报告预测许多国家将受益于

驱动的生产率增长，但像香港、以色列和日本这样的某些地理区域处于特别有利位置。

Estimated impact of Al adoption on annual productivity growth over a ten-year period
预计人工智能采用对十年期间年生产率增长的影响

Source: Goldman Sachs Global Investment Research, 2023 | Chart: 2024 AI Index report

Figure 4.4.30 图 4.4.30

The deployment of robots equipped with Al-based software technologies offers a window into the real-world application of Al-ready infrastructure. This section draws on data from the International Federation of Robotics (IFR), a nonprofit organization dedicated to advancing the robotics industry. Annually, the IFR publishes the World Robotics Reports, which track global robot installation trends.

配备人工智能软件技术的机器人部署提供了一个窗口，展示了人工智能准备基础设施的真实应用。本节利用国际机器人联合会（IFR）的数据，该非营利组织致力于推动机器人行业的发展。IFR 每年发布《世界机器人报告》，跟踪全球机器人安装趋势。

4.5 Robot Installations 4.5 机器人安装

Aggregate Trends 聚合趋势

The following section includes data on the installation and operation of industrial robots, which are defined as an "automatically controlled, reprogrammable, multipurpose manipulator, programmable in three or more axes, which can be either fixed in place or mobile for use in industrial automation applications."
以下部分包括有关工业机器人的安装和操作数据，工业机器人被定义为“自动控制的、可重新编程的、多用途的操纵器，在三个或三个以上轴上可编程，可以固定或移动用于工业自动化应用。”
Figure 4.5.1 reports the total number of industrial robots installed worldwide by year. In 2022, industrial robot installations increased slightly, with 553,000 units marking a

increase from 2021. This growth reflects more than a threefold rise in installations since 2012.
图 4.5.1 报告了全球工业机器人的年安装总数。2022 年，工业机器人的安装数量略有增加，553,000 台单位标志着比 2021 年增加了

。自 2012 年以来，安装数量增长了三倍以上。

12 Due to the timing of the IFR report, the most recent data is from 2022. Every year, the IFR revisits data collected for previous years and will occasionally update the data if more accurate figures become available. Therefore, some of the data reported in this year's report might differ slightly from data reported in previous years.
由于 IFR 报告的时间安排，最新数据来自 2022 年。每年，IFR 都会重新审视以前年份收集的数据，如果有更准确的数据可用，偶尔会更新数据。因此，本年度报告中报告的一些数据可能与以前年份报告的数据略有不同。

The global operational stock of industrial robots reached 3,904,000 in 2022, up from 3,479,000 in 2021 (Figure 4.5.2). Over the past decade, both the installation and utilization of industrial robots have steadily increased.
2022 年，全球工业机器人的运营库存量达到 3,904,000，较 2021 年的 3,479,000 有所增加（图 4.5.2）。在过去的十年中，工业机器人的安装和利用率均稳步增长。

Operational stock of industrial robots in the world, 2012-22
2012 年至 2022 年全球工业机器人的运营库存

Industrial Robots: Traditional vs.
工业机器人：传统 vs.

Collaborative Robots 协作机器人

There is a distinction between traditional robots, which operate for humans, and collaborative robots, designed to work alongside them. The robotics community is increasingly enthusiastic about collaborative robots due to their safety, flexibility, scalability, and ability to learn iteratively.
传统机器人是为人类操作的，而协作机器人是设计用来与人类一起工作的。由于其安全性、灵活性、可扩展性和逐步学习的能力，机器人界对协作机器人越来越感兴趣。

Figure 4.5.3 reports the number of industrial robots installed in the world by type. In 2017, collaborative robots accounted for just

of all new industrial robot installations. By 2022, the number rose to

.
图 4.5.3 报告了按类型安装在世界各地的工业机器人数量。2017 年，协作机器人仅占所有新工业机器人安装量的

。到 2022 年，这一数字上升到

。

Number of industrial robots installed in the world by type, 2017-22
2017-22 年世界各类工业机器人安装数量

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
数据来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

By Geographic Area 按地理区域划分

Country-level data on robot installations can suggest which nations prioritize robot integration into their economies. In 2022, China led the world with 290,300 industrial robot installations, 5.8 times more than Japan's 50,400 and 7.4 times more than the United States' 39,500 (Figure 4.5.4). South Korea and Germany followed with 31,170 and 25,600 installations, respectively.
各国机器人安装数据可以表明哪些国家将机器人整合到其经济中的优先级。2022 年，中国以 29.03 万台工业机器人的安装数量领先全球，是日本的 5.8 倍，美国的 7.4 倍（图 4.5.4）。韩国和德国分别以 31,170 台和 25,600 台的安装数量紧随其后。

Number of industrial robots installed by country, 2022
2022 年各国安装的工业机器人数量

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 AI Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年人工智能指数报告

Figure 4.5.4 图表 4.5.4

Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest country. In 2013, China's installations accounted for

of the global total, a share that rose to

by 2022 (Figure 4.5.5).
自 2013 年超越日本成为工业机器人领先安装国以来，中国与最近的国家之间的差距显著扩大。 2013 年，中国的安装量占全球总量的

，到 2022 年已上升至

（图 4.5.5）。

Number of new industrial robots installed in top 5 countries, 2012-22
2012-22 年前 5 个国家安装的新工业机器人数量

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Since 2021, China has installed more industrial robots than the rest of the world combined, with the gap widening further in the last year (Figure 4.5.6). This increasing gap underscores China's growing dominance in industrial robot installations.
自 2021 年以来，中国安装的工业机器人数量超过了全球其他地区的总和，而在过去一年中这一差距进一步扩大（图 4.5.6）。这一不断增加的差距凸显了中国在工业机器人安装领域日益增长的主导地位。

Number of industrial robots installed (China vs. rest of the world), 2016-22
安装的工业机器人数量（中国 vs. 全球其他地区），2016-22

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5.6 图 4.5.6

According to the IFR report, most countries reported an annual increase in industrial robot installations from 2021 to 2022 (Figure 4.5.7). The countries with the highest growth rates include Singapore (68%), Turkey (22%), and Mexico (13%). Canada (-24%), Taiwan (-21%), Thailand (-18%), and Germany (-1%) reported installing fewer robots in 2022 than in 2021.
根据国际机器人联合会的报告，大多数国家从 2021 年到 2022 年工业机器人安装数量年度增长（图 4.5.7）。增长率最高的国家包括新加坡（68%）、土耳其（22%）和墨西哥（13%）。加拿大（-24%）、台湾（-21%）、泰国（-18%）和德国（-1%）报告称 2022 年安装的机器人数量少于 2021 年。

Annual growth rate of industrial robots installed by country, 2021 vs. 2022
各国 2021 年与 2022 年安装的工业机器人年增长率

Figure 4.5.7 图 4.5.7

Country-Level Data on Service Robotics
服务机器人的国家级数据

Another important class of robots are service robots, which the ISO defines as a robot "that performs useful tasks for humans or equipment excluding industrial automation applications."13 Such robots can, for example, be used in medicine and professional cleaning. In 2022, more service robots were installed for every application category than in 2021, with the exception of medical robotics (Figure 4.5.8). More specifically, the number of service robots installed in hospitality and in transportation and logistics increased 2.3 and 1.4 times, respectively.
另一个重要类别的机器人是服务机器人，ISO 将其定义为“为人类或设备执行有用任务的机器人，不包括工业自动化应用。”13 例如，这些机器人可以用于医学和专业清洁。2022 年，每个应用类别中安装的服务机器人数量均超过了 2021 年，唯一例外是医疗机器人（图 4.5.8）。具体来说，酒店业和运输物流领域安装的服务机器人数量分别增加了 2.3 倍和 1.4 倍。

Number of professional service robots installed in the world by application area, 2021 vs. 2022
2021 年至 2022 年按应用领域安装的世界专业服务机器人数量

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5.8 图 4.5.8

13 A more detailed definition can be accessed here.
13 一个更详细的定义可以在这里访问。

As of 2022, the United States leads in professional service robot manufacturing, with approximately 2.06 times more manufacturers than China, the next leading nation (Figure 4.5.9). Germany, Japan, and France also have significant numbers of robot manufacturers, with

, and 53,000 , respectively. In most surveyed countries, the majority of these manufacturers are established incumbents.
截至 2022 年，美国在专业服务机器人制造方面处于领先地位，制造商数量约为中国的 2.06 倍，中国是紧随其后的国家（图 4.5.9）。德国、日本和法国也拥有大量的机器人制造商，分别为

，和 53,000。在大多数调查国家，这些制造商中的大多数是已建立的老牌企业。

Number of professional service robot manufacturers in top countries by type of company, 2022
2022 年各国专业服务机器人制造商数量按公司类型划分

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
源自：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5.9 图 4.5.9

Sectors and Application Types
部门和应用类型

Figure 4.5.10 shows the number of industrial robots installed in the world by sector from 2020 to 2022. Globally, the electrical/electronics sector led in robot installations with 157,000 units, closely followed by the automotive sector with 136,000 . Both sectors have seen continuous growth in industrial robot installations since 2020.
图 4.5.10 显示了从 2020 年到 2022 年全球各行业安装的工业机器人数量。全球范围内，电子电器行业以 15.7 万台的机器人安装数量居首，汽车行业以 13.6 万台的机器人紧随其后。这两个行业自 2020 年以来一直在工业机器人安装方面持续增长。

Number of industrial robots installed in the world by sector, 2020-22
按行业统计的全球工业机器人安装数量，2020-2022 年

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5 .10 图 4.5 .10

Figure 4.5.11 shows the number of industrial robots installed in the world by application from 2020 to 2022. Data suggests that handling is the predominant application. In 2022, 266,000 industrial robots were installed for handling tasks, 3.1 times more than for welding

and 4.4 times more than for assembly

. Except for processing, every application category witnessed an increase in robot installations in 2022 compared to 2020 .
图 4.5.11 展示了 2020 年至 2022 年世界范围内按应用安装的工业机器人数量。数据显示，处理是主要应用。到 2022 年，有 266,000 台工业机器人用于处理任务，比用于焊接的机器人数量多 3.1 倍，比用于组装的机器人数量多 4.4 倍。除了处理外，每个应用类别在 2022 年安装的机器人都比 2020 年有所增加。

Number of industrial robots installed in the world by application, 2020-22
2020 年至 2022 年世界范围内按应用安装的工业机器人数量

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
源自：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5.11 图 4.5.11

China vs. United States
中国 vs. 美国

Figure 4.5.12 illustrates the number of industrial robots installed across various sectors in China over the past three years. In 2022, the leading sectors for industrial robot installations in China were electrical/electronics (100,000), automotive (73,000), and metal and machinery (31,000).
图 4.5.12 显示了过去三年中国各个行业安装的工业机器人数量。2022 年，中国工业机器人安装领先的行业是电子电器（10 万台）、汽车（7.3 万台）和金属机械（3.1 万台）。

Figure 4.5.12 图 4.5.12

In 2022, the U.S. automotive industry led in industrial robot installations with 14,500 units, significantly exceeding its 2021 figure (Figure 4.5.13). Except for the electronics sector, every other sector saw fewer robot installations in 2022 than in 2021.
2022 年，美国汽车行业在工业机器人安装方面领先，安装了 1.45 万台，明显超过了 2021 年的数字（图 4.5.13）。除了电子行业外，其他行业在 2022 年的机器人安装数量均少于 2021 年。

Number of industrial robots installed in the United States by sector, 2020-22
美国各行业安装的工业机器人数量，2020-22 年

Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report
来源：国际机器人联合会（IFR），2023 年 | 图表：2024 年 Al 指数报告

Figure 4.5.13 图 4.5.13

Overview 298 概览 298

Chapter Highlights 299 章节亮点 299

5.1 Notable Scientific Milestones

5.1 显著的科学里程碑

AlphaDev 300

FlexiCubes 301

Synbot 303

GraphCast 304 图形广播 304

GNoME 305

Flood Forecasting 306 洪水预测 306

5.2 Al in Medicine

医学中的铝 5.2

Notable Medical Systems

显著的医疗系统

Coupled Plasmonic Infrared Sensors

耦合的等离子红外传感器

EVEscape

AlphaMissence 312

Human Pangenome Reference 313
人类全基因组参考 313

Clinical Knowledge

临床知识

MedQA 314

Highlighted Research: GPT-4 Medprompt 315
突出研究：GPT-4 Medprompt 315

Highlighted Research: MediTron-70B 317
突出研究：MediTron-70B 317

Diagnosis

诊断

Highlighted Research: CoDoC 318
突出研究：CoDoC 318

Highlighted Research: CT Panda

突出研究: CT 熊猫

Other Diagnostic Uses

其他诊断用途

FDA-Approved AI-Related Medical Devices 321
FDA 批准的 AI 相关医疗设备 321

Administration and Care 323
管理和护理 323

Highlighted Research: MedAlign 323
突出研究：MedAlign 323

ACCESS THE PUBLIC DATA
访问公共数据

This year's Al Index introduces a new chapter on Al in science and medicine in recognition of Al's growing role in scientific and medical discovery. It explores 2023's standout Al-facilitated scientific achievements, including advanced weather forecasting systems like GraphCast and improved material discovery algorithms like GNoME. The chapter also examines medical Al system performance, important 2023 Al-driven medical innovations like SynthSR and ImmunoSEIRA, and trends in the approval of FDA AI-related medical devices.
今年的 Al 指数引入了一个新的章节，介绍了 Al 在科学和医学领域的作用，以认识到 Al 在科学和医学发现中的日益重要作用。它探讨了 2023 年突出的 Al 辅助科学成就，包括像 GraphCast 这样的先进天气预测系统和像 GNoME 这样的改进材料发现算法。该章节还审视了医学 Al 系统的性能，重要的 2023 年 Al 驱动的医学创新，如 SynthSR 和 ImmunoSEIRA，以及 FDA 批准的与人工智能相关的医疗设备的趋势。

Chapter Highlights 章节亮点

Scientific progress accelerates even further, thanks to AI. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery.
由于人工智能的推动，科学进步进一步加速。2022 年，人工智能开始推动科学发现。然而，2023 年看到了更多重要的与科学相关的人工智能应用的推出，从使算法排序更加高效的 AlphaDev，到促进材料发现过程的 GNoME。
Al helps medicine take significant strides forward. In 2023 , several significant medical systems were launched, including EVEscape, which enhances pandemic prediction, and AlphaMissence, which assists in Al-driven mutation classification. Al is increasingly being utilized to propel medical advancements.
人工智能帮助医学取得重大进展。2023 年，推出了几个重要的医疗系统，包括 EVEscape，它增强了疫情预测能力，以及 AlphaMissence，它协助 Al 驱动的突变分类。人工智能越来越多地被用于推动医学进步。
Highly knowledgeable medical Al has arrived. Over the past few years, Al systems have shown remarkable improvement on the MedQA benchmark, a key test for assessing Al's clinical knowledge. The standout model of 2023, GPT-4 Medprompt, reached an accuracy rate of , marking a 22.6 percentage point increase from the highest score in 2022. Since the benchmark's introduction in 2019, Al performance on MedQA has nearly tripled.
高度知识渊博的医疗 Al 已经到来。在过去几年里，Al 系统在 MedQA 基准测试中表现出了显著的改进，这是评估 Al 临床知识的关键测试。2023 年的明星模型 GPT-4 Medprompt 达到了的准确率，比 2022 年最高分提高了 22.6 个百分点。自 2019 年引入该基准测试以来，Al 在 MedQA 上的表现几乎翻了三倍。
The FDA approves more and more Al-related medical devices. In 2022, the FDA approved 139 Al-related medical devices, a 12.1% increase from 2021. Since 2012, the number of FDA-approved Al-related medical devices has increased by more than 45 -fold. Al is increasingly being used for real-world medical purposes.
FDA 批准了越来越多与 Al 相关的医疗设备。2022 年，FDA 批准了 139 个与 Al 相关的医疗设备，比 2021 年增加了 12.1%。自 2012 年以来，FDA 批准的与 Al 相关的医疗设备数量增加了超过 45 倍。Al 越来越多地被用于真实世界的医疗目的。

5.1 Notable Scientific Milestones
5.1 显著的科学里程碑

AlphaDev

AlphaDev discovers faster sorting algorithms
AlphaDev 发现了更快的排序算法

AlphaDev is a new Al reinforcement learning system that has improved on decades of work by scientists and engineers in the field of computational algorithmic enhancement. AlphaDev developed algorithms with fewer instructions than existing human benchmarks for fundamental sorting algorithms on short sequences such as Sort 3, Sort 4, and Sort 5 (Figure 5.1.1). Some of the new algorithms discovered by AlphaDev have been incorporated into the LLVM standard C++ sort library. This marks the first update to this part of the library in over 10 years and is the first addition designed using reinforcement learning.
AlphaDev 是一种新的强化学习系统，通过改进计算算法增强领域科学家和工程师数十年的工作，开发了比现有人类基准更少指令的算法，用于短序列的基本排序算法，如 Sort 3、Sort 4 和 Sort 5（图 5.1.1）。AlphaDev 发现的一些新算法已经被整合到 LLVM 标准 C++ 排序库中。这标志着该库的这一部分在 10 多年来的首次更新，也是第一个使用强化学习设计的新增内容。

AlphaDev vs. human benchmarks when optimizing for algorithm length
AlphaDev 与人类基准相比，在优化算法长度时

FlexiCubes

3D mesh optimization with FlexiCubes
使用 FlexiCubes 进行 3D 网格优化

3D mesh generation, crucial in computer graphics, involves creating a mesh of vertices, edges, and faces to define 3D objects. It is key to video games, animation, medical imaging, and scientific visualization. Traditional isosurface extraction algorithms often struggle with limited resolution, structural rigidity, and numerical instabilities, which subsequently impacts quality. FlexiCubes addresses some of these limitations by employing Al for gradient-based optimization and adaptable parameters (Figure 5.1.2). This method allows for precise, localized mesh adjustments. Compared to other leading methods that utilize differentiable isosurfacing for mesh reconstruction, FlexiCubes achieves mesh extractions that align much more closely with the underlying ground truth (Figure 5.1.3).
三维网格生成，在计算机图形学中至关重要，涉及创建由顶点、边和面组成的网格来定义三维对象。它对于视频游戏、动画、医学成像和科学可视化至关重要。传统的等值面提取算法通常在分辨率有限、结构刚度和数值不稳定性方面存在困难，随后影响质量。FlexiCubes 通过采用基于梯度的优化和可调参数的人工智能来解决其中一些限制（图 5.1.2）。这种方法允许进行精确的局部网格调整。与利用可微分等值面进行网格重建的其他主流方法相比，FlexiCubes 实现了更加贴近基本真相的网格提取（图 5.1.3）。

Sample FlexiCubes surface reconstructions
样本 FlexiCubes 表面重建

Source: Nvidia, 2023 来源：Nvidia，2023

Select quantitative results on 3D mesh reconstruction
选择 3D 网格重建的定量结果

Source: Shen et al., 2023 | Chart: 2024 Al Index report
来源：Shen 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.1.3 图 5.1.3

Synbot 合成机器人

Al-driven robotic chemist for synthesizing organic molecules
用于合成有机分子的人工智能化学家

Synbot employs a multilayered system, comprising an Al software layer for chemical synthesis planning, a robot software layer for translating commands, and a physical robot layer for conducting experiments. The closed-loop feedback mechanism between the

and the robotic system enables Synbot to develop synthetic recipes with yields equal to or exceeding established references
Synbot 采用多层系统，包括用于化学合成规划的 AI 软件层，用于翻译命令的机器人软件层，以及用于进行实验的物理机器人层。

和机器人系统之间的闭环反馈机制使 Synbot 能够开发出产量等于或超过已建立参考值的合成配方。

(Figure 5.1.4). In an experiment aimed at synthesizing M1 [4-(2,3-dimethoxyphenyl)

-pyrrolo[2,3-b]pyridine], Synbot developed multiple synthetic formulas that achieved conversion yields surpassing
(图 5.1.4)。在旨在合成 M1 [4-(2,3-二甲氧基苯基)

-吡咯[2,3-b]吡啶]的实验中，Synbot 开发了多个合成配方，实现了超过预期的转化产率
Synbot design Synbot 设计

Source: Ha et al., 2023
来源：Ha 等人，2023

Figure 5.1.4 图 5.1.4

the mid-

reference range and completed the synthesis in significantly less time (Figure 5.1.5). Synbot's automation of organic synthesis highlights Al's potential in fields such as pharmaceuticals and materials science.
中间-

参考范围，并在明显较短的时间内完成了合成（图 5.1.5）。Synbot 对有机合成的自动化突显了 Al 在制药和材料科学等领域的潜力。

Reaction kinetics of M1 autonomous optimization experiment, Synbot vs. reference
M1 自主优化实验的反应动力学，Synbot vs. 参考

GraphCast 图形广播

More accurate global weather forecasting
更准确的全球天气预报

with GraphCast

GraphCast is a new weather forecasting system that delivers highly accurate 10 -day weather predictions in under a minute (Figure 5.1.6). Utilizing graph neural networks and machine learning, GraphCast processes vast datasets to forecast temperature, wind speed, atmospheric conditions, and more. Figure 5.1.7 compares the performance of GraphCast with the current industry state-of-theart weather simulation system: the High Resolution Forecast (HRES). GraphCast posts a lower root mean squared error, meaning its forecasts more closely correspond to observed weather patterns. GraphCast can be a valuable tool in deciphering weather patterns, enhancing preparedness for extreme weather events, and contributing to global climate research.
GraphCast 是一个新的天气预报系统，可以在不到一分钟内提供高度准确的未来 10 天天气预测（图 5.1.6）。利用图神经网络和机器学习，GraphCast 处理大量数据集，预测温度、风速、大气条件等。图 5.1.7 比较了 GraphCast 与当前行业最先进的天气模拟系统：高分辨率预报（HRES）的性能。GraphCast 具有更低的均方根误差，意味着其预测更接近观测到的天气模式。GraphCast 可以成为解读天气模式、增强极端天气事件准备和促进全球气候研究的宝贵工具。

GraphCast weather prediction
GraphCast 天气预测

Source: DeepMind, 2023 来源：DeepMind，2023 年

Figure 5.1.6 图 5.1.6

Ten-day

forecast skill: GraphCast vs. HRES
十天

预报技能：GraphCast vs. HRES

Source: Lam et al., 2023 | Chart: 2024 Al Index report
来源：Lam 等人，2023 年 | 图表：2024 年 Al 指数报告

GNoME

Discovering new materials with GNoME
用 GNoME 发现新材料

The search for new functional materials is key to advancements in various scientific fields, including robotics and semiconductor manufacturing. Yet this discovery process is typically expensive and slow. Recent advancements by Google researchers have demonstrated that graph networks, a type of AI model, can expedite this process when trained on large datasets. Their model, GNoME, outperformed the Materials Project, a leading method in materials discovery, by identifying a significantly larger number of stable crystals (Figure 5.1.8). GNoME has unveiled 2.2 million new crystal structures, many overlooked by human researchers (Figure 5.1.9 and Figure 5.1.10). The success of Al-driven projects like GNoME highlights the power of data and scaling in speeding up scientific breakthroughs.
寻找新功能材料对各种科学领域的进展至关重要，包括机器人技术和半导体制造。然而，这一发现过程通常昂贵且缓慢。谷歌研究人员最近取得的进展表明，图网络，一种人工智能模型，可以在大型数据集上训练时加快这一过程。他们的模型 GNoME 在识别稳定晶体方面表现优于 Materials Project，这是材料发现领域的一种领先方法，能够识别出数量显著更多的稳定晶体（图 5.1.8）。GNoME 揭示了 220 万个新的晶体结构，其中许多被人类研究人员忽视了（图 5.1.9 和图 5.1.10）。像 GNoME 这样由人工智能驱动的项目的成功突显了数据和规模化在加速科学突破中的作用。

GNoME vs. Materials Project: stable crystal count Source: Merchant et al., 2023 | Chart: 2024 Al Index report
GNoME 对比 Materials Project：稳定晶体数量数据来源：Merchant 等人，2023 年 | 图表：2024 年 Al 指数报告

Sample material structures
样本材料结构

Source: Merchant et al., 2023
来源：Merchant 等人，2023 年

Figure 5.1.8 图 5.1.8

GNoME vs. Materials Project: distinct prototypes Source: Merchant et al., 2023 | Chart: 2024 Al Index report
GNoME vs. Materials Project：不同的原型 Source：Merchant 等，2023 年 | 图表：2024 Al 指数报告

Flood Forecasting 洪水预测

Al for more accurate and reliable flood forecasts
更准确可靠的洪水预测的铝

New research introduced in 2023 has made significant progress in predicting large-scale flood events. Floods, among the most common natural disasters, have particularly devastating effects in less developed countries where infrastructure for prevention and mitigation is lacking. Consequently, developing more accurate prediction methods that can forecast these events further in advance could yield substantial positive impacts.
2023 年引入的新研究在预测大规模洪水事件方面取得了重大进展。洪水是最常见的自然灾害之一，在基础设施缺乏的欠发达国家中具有特别破坏性的影响。因此，开发更准确的预测方法，能够提前预测这些事件，可能会产生重大的积极影响。
A team of Google researchers has used AI to develop highly accurate hydrological simulation models that are also applicable to ungauged basins.' These innovative methods can predict certain extreme flood events up to five days in advance, with accuracy that matches or surpasses current state-of-the-art models, such as GloFAS. The Al model demonstrates superior precision (accuracy of positive predictions) and recall (ability to correctly identify all relevant instances) across a range of return period events, outperforming the leading contemporary method (Figure 5.1.11).

The model is open-source and is already being used to predict flood events in over 80 countries.
谷歌研究团队利用人工智能开发了高度准确的水文模拟模型，这些模型也适用于未测流域。这些创新方法可以预测某些极端洪水事件，提前五天进行预测，准确度与当前的领先模型（如 GloFAS）相匹敌甚至超越。该人工智能模型在一系列重现期事件中展示出卓越的精度（正面预测的准确性）和召回率（正确识别所有相关实例的能力），优于领先的当代方法（图 5.1.11）。该模型是开源的，已经被用于预测 80 多个国家的洪水事件。

Predictions of AI model vs. GloFAS across return periods
人工智能模型与 GloFAS 在重现期间的预测对比

Source: Nearing et al., 2023 | Chart: 2024 Al Index report
来源：Nearing 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.1.11 图 5.1.11

2 A return period (recurrence interval) measures the likelihood of a particular hydrological event recurring within a specific period. For example, a 100-year flood means there is a

chance of the event being equaled or exceeded in any given year.
2. 返回周期（重现间隔）衡量特定水文事件在特定时期内再次发生的可能性。例如，百年一遇洪水意味着在任何给定年份内发生该事件的概率为

。

Al models are becoming increasingly valuable in healthcare, with applications for detecting polyps to aiding clinicians in making diagnoses. As Al performance continues to improve, monitoring its impact on medical practice becomes increasingly important. This section highlights significant AI-related medical systems introduced in 2023 , the current state of clinical Al knowledge, and the development of new AI diagnostic tools and models aimed at enhancing hospital administration.
Al 模型在医疗保健领域变得越来越有价值，应用范围从检测息肉到帮助临床医生做出诊断。随着 Al 性能的不断提高，监测其对医疗实践的影响变得越来越重要。本章节介绍了 2023 年引入的重要与 AI 相关的医疗系统、临床 AI 知识的当前状态，以及旨在增强医院管理的新 AI 诊断工具和模型的发展。

5.2 Al in Medicine
5.2 医学中的人工智能

Notable Medical Systems 显著的医疗系统

This section identifies significant Al-related medical breakthroughs of 2023 as chosen by the Al Index Steering Committee.
本节将列出由 AI 指数指导委员会选出的 2023 年重要的与人工智能相关的医疗突破。

SynthSR

Transforming brain scans for advanced analysis
转换脑部扫描以进行高级分析

SynthSR is an Al tool that converts clinical brain scans into high-resolution

weighted images (Figure 5.2.1). This advancement addresses the issue of scan quality variability, which previously limited the use of many scans in advanced research. By transforming these scans into T1-weighted images, known for their high contrast and clear brain structure depiction, SynthSR facilitates the creation
SynthSR 是一种 Al 工具，将临床脑部扫描转换为高分辨率

加权图像（图 5.2.1）。这一进展解决了扫描质量可变性的问题，以前限制了许多扫描在高级研究中的使用。通过将这些扫描转化为以 T1 加权显像著称的图像，SynthSR 有助于创建高对比度和清晰的脑结构描绘。

SynthSR generations SynthSR 生成

Source: Iglesias et al., 2023
来源: Iglesias et al., 2023

Figure 5.2.1 of detailed 3D brain renderings. Experiments using SynthSR demonstrate robust correlations between observed volumes at both scan and subject levels, suggesting that SynthSR generates images closely resembling those produced by high-resolution

scans. Figure 5.2.2 illustrates the extent to which SynthSR scans correspond with ground-truth observations across selected brain regions. SynthID significantly improves the visualization and analysis of brain structures, facilitating neuroscientific research and clinical diagnostics.
详细的 3D 大脑呈现图 5.2.1。使用 SynthSR 进行的实验表明，在扫描和受试者水平上观察到的体积之间存在稳健的相关性，这表明 SynthSR 生成的图像与高分辨率扫描生成的图像非常相似。图 5.2.2 说明了 SynthSR 扫描与所选脑区域的地面真实观察之间的对应程度。SynthID 显著改善了大脑结构的可视化和分析，促进了神经科学研究和临床诊断。

SynthSR correlation with ground-truth volumes on select brain regions
SynthSR 与所选脑区域的地面真实体积的相关性

Source: Iglesias et al., 2023 | Chart: 2024 Al Index report
来源：Iglesias 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.2.2 图 5.2.2

Coupled Plasmonic Infrared Sensors
耦合等离子红外传感器

Coupled plasmonic infrared sensors for the detection of neurodegenerative diseases
用于检测神经退行性疾病的耦合等离子红外传感器

Diagnosis of neurodegenerative diseases such as Parkinson's and Alzheimer's depends on fast and precise identification of biomarkers. Traditional methods, such as mass spectrometry and ELISA, are useful in that they can focus on quantifying protein levels; however, they cannot discern changes in structural states. This year, researchers uncovered a new method for neurodegenerative disease diagnosis that combined Al-coupled plasmonic infrared sensors that use Surface-Enhanced Infrared Absorption (SEIRA) spectroscopy with an immunoassay technique (ImmunoSEIRA; Figure 5.2.3). In tests that compared actual fibril percentages with predictions made by Al systems, the accuracy of the predictions was found to very closely match the actual reported percentages (Figure 5.2.4).
诊断帕金森病和阿尔茨海默病等神经退行性疾病取决于生物标志物的快速和精确识别。传统方法，如质谱和 ELISA，可以集中于蛋白质水平的定量，但无法辨别结构状态的变化。今年，研究人员揭示了一种结合了使用表面增强红外吸收（SEIRA）光谱的 Al 耦合等离子红外传感器和免疫分析技术（ImmunoSEIRA）的新的神经退行性疾病诊断方法（图 5.2.3）。在将实际纤维百分比与 Al 系统预测进行比较的测试中，发现预测的准确性非常接近实际报告的百分比（图 5.2.4）。
ImmunoSEIRA detection principle and the setup
ImmunoSEIRA 检测原理和设置

Source: Kavungal et al., 2023
来源：Kavungal 等人，2023

Figure 5.2.3 图 5.2.3

Deep neural network predicted vs. actual fibrils percentages in test samples
深度神经网络预测与实际纤维百分比在测试样本中的比较

Source: Kavungal et al., 2023 | Chart: 2024 Al Index report
来源：Kavungal 等人，2023 年 | 图表：2024 年 Al 指数报告

EVEscape

Forecasting viral evolution for pandemic preparedness
为流行病应对预测病毒演变

Predicting viral mutations is vital for vaccine design and pandemic minimization. Traditional methods, which rely on real-time virus strain and antibody data, face challenges during early pandemic stages due to data scarcity. EVEscape is a new Al deep learning model trained on historical sequences and biophysical and structural information that predicts the evolution of viruses (Figure 5.2.5). EVEscape evaluates viral escape independently of current strain data predicting

of observed SARS-CoV- 2 mutations, outperforming traditional lab studies which predicted

and

, as well as a previous model, which predicted only

of mutations (Figure 5.2.6). This performance highlights EVEscape's potential as a valuable asset for enhancing future pandemic preparedness and response efforts.
预测病毒突变对疫苗设计和流行病最小化至关重要。传统方法依赖实时病毒株和抗体数据，在早期流行病阶段面临数据稀缺挑战。EVEscape 是一个新的 AI 深度学习模型，经过历史序列、生物物理和结构信息训练，可以预测病毒的演变（图 5.2.5）。EVEscape 独立评估病毒逃逸，预测了观察到的 SARS-CoV-2 突变的

，优于预测

和

的传统实验室研究，以及之前的模型，仅预测了

的突变（图 5.2.6）。这种性能突出了 EVEscape 在增强未来流行病应对和响应工作中的潜力。

EVEscape design EVEscape 设计

Source: Thadani et al., 2023
来源：Thadani 等人，2023 年

EVEscape vs. other models on SARS-CoV-2 RBD mutation prediction
EVEscape 与其他模型在 SARS-CoV-2 RBD 突变预测上的比较

Source: Thadani et al., 2023 | Chart: 2024 Al Index report
来源：Thadani 等人，2023 年 | 图表：2024 年 Al 指数报告

AlphaMissence

Better classification of Al mutations
更好地分类 Al 突变

Scientists still do not fully understand which genetic mutations lead to diseases. With millions of possible genetic mutations, determining whether a mutation is benign or pathogenic requires laborintensive experiments.
科学家们仍然不完全理解哪些基因突变会导致疾病。有数百万种可能的基因突变，确定突变是良性还是致病需要进行耗时的实验。

In 2023, researchers from Google DeepMind unveiled AlphaMissense, a new AI model that predicted the pathogenicity of 71 million missense variants. Missense mutations are genetic alterations that impact the functionality of human proteins (Figure 5.2.7) and can lead to various diseases, including cancer. Of the 71 million possible missense variants, AlphaMissense classified

, identifying

as likely benign and

as likely pathogenic, while the remainder were categorized as uncertain (Figure 5.2.8). In contrast, human annotators have only been able to confirm the nature of

of all missense mutations.
2023 年，来自谷歌 DeepMind 的研究人员推出了 AlphaMissense，这是一个新的人工智能模型，可以预测 7100 万个错义变异的致病性。错义突变是影响人类蛋白质功能的遗传改变（图 5.2.7），可能导致各种疾病，包括癌症。在 7100 万种可能的错义变异中，AlphaMissense 对

进行了分类，识别

为可能良性，

为可能致病，其余被归类为不确定（图 5.2.8）。相比之下，人类注释者只能确认所有错义突变中的

的性质。
Hemaglobin subunit beta (HBB)
血红蛋白亚基β（HBB）

Source: Google DeepMind, 2023
来源: Google DeepMind, 2023

Figure 5.2.7 图 5.2.7

AlphaMissense predictions
AlphaMissense 预测

Source: Google DeepMind, 2023 | Chart: 2024 Al Index report
源：谷歌 DeepMind，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.2.8 图 5.2.8

Human Pangenome Reference
人类全基因组参考

Using Al to map the human genome
使用人类基因组图谱的 Al

The human genome is a set of molecular instructions for a human. The first human genome draft was released in 2000 and updated in 2022. However, the update was somewhat incomplete. It did not incorporate various genetic mutations, like blood type, and did not as completely map diverse ancestry groups. Therefore, under the existing genome reference, it would be difficult to detect diseases or find cures in certain groups of people.
人类基因组是人类的一组分子指令。第一个人类基因组草案于 2000 年发布，并于 2022 年更新。然而，更新有些不完整。它没有包含各种遗传突变，比如血型，并且没有完全映射不同的祖先群体。因此，在现有的基因组参考下，将很难检测某些人群中的疾病或找到治疗方法。

In 2023, the Human Pangenome Research Consortium, comprising 119 scientists from 60 institutions, used

to develop an updated and more representative human genome map (Figure 5.2.9). The researchers achieved remarkable accuracy, annotating a median of

of protein-coding genes,

of protein-coding transcripts,

of noncoding genes, and

of noncoding transcripts, as detailed in Figure 5.2.10.
2023 年，由 60 个机构的 119 名科学家组成的人类全基因组研究联盟使用

开发了一张更新且更具代表性的人类基因组图谱（图 5.2.9）。研究人员取得了显著的准确性，注释了中位数

的蛋白编码基因，

的蛋白编码转录本，

的非编码基因和

的非编码转录本，详细信息请参见图 5.2.10。
Graph genome for the MHC region of the genome
基因组图谱的 MHC 区域

Source: Google Research, 2023
来源：Google 研究，2023 年

Figure 5.2.9 图 5.2.9

This latest version of the genome represents the most comprehensive and genetically diverse mapping of the human genome to date.
这个基因组的最新版本代表了迄今为止对人类基因组进行的最全面和遗传多样性的映射。

Ensembl mapping pipeline results
Ensembl 映射管道结果

Source: Liao et al., 2023 | Chart: 2024 Al Index report
来源：廖等人，2023 年 | 图表：2024 年 Al 指数报告

Clinical Knowledge 临床知识

Evaluating the clinical knowledge of Al models involves determining the extent of their medical expertise, particularly knowledge applicable in a clinical setting.
评估 AI 模型的临床知识涉及确定其医学专业知识的程度，特别是在临床环境中适用的知识。

MedQA

Introduced in 2020, MedQA is a comprehensive dataset derived from professional medical board exams, featuring over 60,000 clinical questions designed to challenge doctors. remarkable improvement, with the leading system, GPT-4 Medprompt, achieving an accuracy rate of

-an increase of 22.6 percentage points from the top score in 2022 (Figure 5.2.11). Since MedQA's inception, Al capabilities on this benchmark have nearly tripled, showcasing the rapid improvements of clinically knowledgeable Al systems.
于 2020 年推出，MedQA 是从专业医学委员会考试中提取而来的全面数据集，包括超过 60,000 个临床问题，旨在考验医生。显著改进中，领先的系统 GPT-4 Medprompt 的准确率达到

，比 2022 年最高分增加了 22.6 个百分点（图 5.2.11）。自 MedQA 成立以来，对这个基准的人工智能能力几乎增长了三倍，展示了临床知识丰富的人工智能系统的快速改进。

Al performance on the MedQA benchmark has seen
在 MedQA 基准测试上的人工智能性能有所提升

MedQA: accuracy MedQA：准确率

Figure 5.2.11 图 5.2.11

Highlighted Research: 突出研究:

GPT-4 Medprompt

Although LLMs exhibit impressive general knowledge, it is commonly assumed that significant fine-tuning is required for them to excel at specialized knowledge, such as answering medical questions. Finetuning entails training an LLM on domain-specific data.
尽管LLMs展示出令人印象深刻的广泛知识，但普遍认为它们需要进行重大微调才能在专业知识方面表现出色，比如回答医学问题。微调包括在特定领域数据上训练LLM。

Research from Microsoft in late 2023 has overturned this assumption. This study employed prompt engineering to direct GPT-4 toward achieving remarkable performance on the MultiMedQA benchmark suite, a group of four challenging medical benchmarks (Figure 5.2.12). GPT-4 Medprompt exceeded the performance of the top 2022 model, Flan-PaLM 540B, in the multiplechoice sections of several renowned medical benchmarks, including PubMedQA, MedMCQA, and MMLU, by

, and 16.2 percentage points, respectively. It also exceeded the performance of the then state-ofthe-art Med-PaLM 2 (Figure 5.2.13).
微软在 2023 年底的研究推翻了这一假设。这项研究采用提示工程将 GPT-4 引导到在 MultiMedQA 基准套件上取得显著表现，这是一组四个具有挑战性的医学基准（图 5.2.12）。GPT-4 Medprompt 在多个知名医学基准的选择题部分，包括 PubMedQA、MedMCQA 和 MMLU，的表现超过了 2022 年顶尖模型 Flan-PaLM 540B，分别提高了

和 16.2 个百分点。它还超过了当时最先进的 Med-PaLM 2 的表现（图 5.2.13）。
Moreover, as noted earlier, GPT-4 Medprompt was the first to surpass the

accuracy mark on the MedQA benchmark. This breakthrough not only underscores GPT-4 Medprompt's exceptional and potentially clinically useful medical capabilities but also demonstrates that fine-tuning may not always be necessary for adapting models to specialized domains. Prompt engineering has shown to be a promising alternative strategy.
此外，正如前面所指出的，GPT-4 Medprompt 是第一个在 MedQA 基准测试中超过

准确度标志的模型。这一突破不仅突显了 GPT-4 Medprompt 出色且潜在临床应用的医疗能力，还表明微调并非总是必要的，以适应专业领域的模型。提示工程已被证明是一种有前途的替代策略。

GPT-4 vs. Med-PaLM 2 answering a medical question
GPT-4 对 Med-PaLM 2 回答医学问题

Source: Nori et al., 2023
来源：Nori 等人，2023

\section*{Question. A 22 yeaviold male waratson nuner preserts to the oflce wis he poreplaict of ingilisided rit part when he runs lang dislances. Ptyocal examutan revear noma
\section*{问题。一个 22 岁的男性患者在跑长距离时感到右部不适。体格检查显示正常

tolvwirg museke or muprie grmips wal be meet usehi in cavecting the dyefunction itizing o drect mefiod?
以下肌肉或肌肉群在纠正功能障碍时将会使用哪种方法？
(B) latseamus dorif [C] pectorals mnor
(B) 胫骨前肌 [C] 胸小肌
(D) suadatus lumborum
(D) 腰部扭伤
(4) antercerseabene}

Hand-crafled CoT from Mod PaL.M 2
Mod PaL.M 2 手工制作的 CoT

Lath atave tits abap-bpatep, fularing is authoritative saureas in naveded. Amory the oppons, onhy peciorals mincr muscle orighe from the outes surfaces of the 3rd to 5 en ribs.
拉丝在第三到第五肋骨的外侧表面的胸肌肌肉起源，是导致肩胛骨在上抬时抑制肋骨下降的主要来源。

GPT-4 generaled Col GPT-4 生成的 Col

grorded options be poctorals mhor can cietroute to for nowimant, as i orighates som the 3rdsth fitse and can act bo depress the sbs duing exthation wiren die scapda in leta. Tharefore, 1 obuld poternaly be cielur sorrecing sn eshatation dyefunction at rbs 4.5
肌肉可以通过胸肌的选项来控制呼吸，因为它起源于第三到第五肋骨，并且可以在上抬肩胛骨时抑制肋骨下降。因此，可能会在第 4.5 肋骨处潜在地纠正呼吸功能障碍。

Figure 5.2.12 图 5.2.12

Highlighted Research: 突出研究:

GPT-4 Medprompt (cont'd)
GPT-4 Medprompt（续）

Model performance on MultiMedQA sub-benchmarks
多媒体问答子基准上的模型性能

Source: Nori et al., 2023 | Chart: 2024 Al Index report
来源：Nori 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.2.13 图 5.2.13

Highlighted Research: 突出研究:

MediTron-70B

GPT-4 Medprompt is an impressive system; however, it is closed-source, meaning its weights are not freely available to the broader public for use. New research in 2023 has also sought to advance the capabilities of open-source medical LLMs. Among this new research, MediTron-70B stands out as particularly promising. This model achieves a respectable

accuracy on the MedQA benchmark. Although this is below the performance of GPT-4 Medprompt and Med-
GPT-4 Medprompt 是一个令人印象深刻的系统; 但是，它是闭源的，这意味着其权重不会自由提供给更广泛的公众使用。2023 年的新研究还试图提升开源医疗LLMs的能力。在这些新研究中，MediTron-70B 显得特别有前途。该模型在 MedQA 基准测试中实现了可观的

准确率。尽管这低于 GPT-4 Medprompt 和 Med- 的性能。
PaLM 2 (both closed models), it represents a significant improvement over the state-ofthe-art results from 2023 and surpasses other open-source models like Llama 2 (Figure 5.2.14). MediTron-70B's score on MedQA is the highest yet achieved by an open-source model. If medical

is to reach its fullest potential, it is important that its capabilities are widely accessible. In this context, MediTron represents an encouraging step forward.
PaLM 2（两种封闭模型）代表了对 2023 年最新成果的显著改进，并超越了其他开源模型，如 Llama 2（图 5.2.14）。 MediTron-70B 在 MedQA 上的得分是迄今为止由开源模型实现的最高分。如果医疗

要发挥其最大潜力，重要的是其功能是广泛可访问的。在这种情况下，MediTron 代表了一个令人鼓舞的进步。

Performance of select models on MedQA
在 MedQA 上选择模型的性能

Source: Chen et al., 2023 | Table: 2024 Al Index report
来源：Chen 等人，2023 年 | 表：2024 年 Al 指数报告

Model	Release date	Access type	Score on MedQA 在 MedQA 上的得分
GPT-4 Medprompt	November 2023 2023 年 11 月	Closed
Med-PaLM 2	April 2023	Closed
MediTron-70B	November 2023 2023 年 11 月	Open
Med-PaLM	December 2022 2022 年 12 月	Closed
Llama 2	July 2023	Open

Diagnosis 诊断

Al tools can also be used for diagnostic purposes including, for example, in radiology or cancer detection.
所有工具也可以用于诊断目的，例如在放射学或癌症检测中。

Highlighted Research: 突出研究：

CoDoC

Al medical imaging systems demonstrate robust diagnostic capabilities, yet there are instances where they overlook diagnoses that clinicians catch, and vice versa. This observation suggests a logical integration of Al systems and clinicians' diagnostic abilities. In 2023, researchers unveiled CoDoC (Complementarity-Driven Deferral to Clinical Workflow), a system designed to discern when to rely on

for diagnosis and when to defer to traditional clinical methods. CoDoC notably enhances both sensitivity (the ability to correctly identify individuals with a disease) and specificity (the ability to accurately identify those without it). Specifically, across four medical datasets, CoDoC's sensitivity surpasses clinicians' by an average of 4.5 percentage points and a standalone Al model's by 6.5 percentage points (Figure 5.2.15). In terms of specificity, CoDoC outperforms clinicians by an average of 2.7 percentage points across tested datasets and a standalone predictive model by 5.7 percentage points. Moreover, CoDoC has been shown to reduce clinical workflow by

. These findings suggest that Al medical systems can be integrated into clinical workflows, thereby enhancing diagnostic accuracy and efficiency.
所有医学影像系统展示出强大的诊断能力，但有时会忽略临床医生发现的诊断，反之亦然。这一观察表明了人工智能系统和临床医生诊断能力的逻辑整合。2023 年，研究人员推出了 CoDoC（基于互补性的推迟至临床工作流程），这是一个系统，旨在辨别何时依赖

进行诊断，何时推迟至传统临床方法。CoDoC 显著增强了敏感性（正确识别患病个体的能力）和特异性（准确识别无病者的能力）。具体而言，在四个医学数据集中，CoDoC 的敏感性平均超过临床医生 4.5 个百分点，超过独立人工智能模型 6.5 个百分点（图 5.2.15）。在特异性方面，CoDoC 在测试数据集中平均超过临床医生 2.7 个百分点，超过独立预测模型 5.7 个百分点。此外，CoDoC 已被证明可以减少

的临床工作流程。这些研究发现表明，可将 Al 医疗系统整合到临床工作流程中，从而提高诊断准确性和效率。

CoDoC vs. standalone predictive Al system and clinical readers: sensitivity Source: Dvijotham et al., 2023 | Chart: 2024 Al Index report
CoDoC 相对于独立的预测 Al 系统和临床读者的敏感性。出处：Dvijotham 等人，2023 年 | 图表：2024 年 Al 指数报告

Highlighted Research: 突出研究：

CT Panda CT 熊猫

Pancreatic ductal adenocarcinoma (PDAC) is a particularly lethal cancer, often detected too late for surgical intervention. Screening for PDAC in asymptomatic individuals is challenging due to its low prevalence and the risk of false positives. This year, a Chinese research team developed PANDA (pancreatic cancer detection with artificial intelligence), an AI model capable of efficiently detecting and classifying pancreatic lesions in X-rays (Figure 5.2.16). In validation tests, PANDA surpassed the average radiologist in sensitivity by

and in specificity by

(Figure 5.2.17). In a large-scale, real-world test involving approximately 20,000 patients, PANDA achieved a sensitivity of

and a specificity of

(Figure 5.2.18). Al medical tools like PANDA represent significant advancements in diagnosing challenging conditions, offering cost-effective and accurate detection previously considered difficult or prohibitive.
胰腺导管腺癌（PDAC）是一种特别致命的癌症，通常在手术干预的时机太晚才被发现。由于其低患病率和假阳性风险，对无症状个体进行 PDAC 筛查具有挑战性。今年，一支中国研究团队开发了 PANDA（人工智能胰腺癌检测），这是一种能够高效检测和分类 X 射线中胰腺病变的 AI 模型（图 5.2.16）。在验证测试中，PANDA 在敏感性方面超过了平均放射科医师

，在特异性方面超过了

（图 5.2.17）。在涉及约 20,000 名患者的大规模实际测试中，PANDA 实现了

的敏感性和

的特异性（图 5.2.18）。像 PANDA 这样的医疗工具代表了在诊断具有挑战性疾病方面的重大进展，提供了以前被认为困难或禁止的经济有效和准确的检测。

PANDA vs. mean radiologist on multicenter validation (6,239 patients)
PANDA 与多中心验证中的平均放射科医师（6,239 名患者）

Source: Cao et al., 2023 | Chart: 2024 Al Index report
来源：Cao 等人，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.2.17 图 5.2.17
PANDA performance on real-world multi-scenario validation (20,530 patients)
PANDA 在现实世界多场景验证（20,530 名患者）中的表现

Figure 5.2.18 图 5.2.18

Other Diagnostic Uses 其他诊断用途

New research published in 2023 highlights how AI can be used in other diagnostic contexts. Figure 5.2.19 summarizes some of the findings.
2023 年发表的新研究突显了人工智能在其他诊断背景下的应用。图 5.2.19 总结了一些研究结果。

Additional research on diagnostic Al use cases
诊断人工智能用例的额外研究

Source: Al Index, 2024
来源：2024 年人工智能指数

Research

Use case

Findings

Schopf et al., 2023
Schopf 等人，2023 年

Breast cancer 乳腺癌

The authors conducted a meta-review of the literature exploring mammography-image-based
作者进行了一项关于基于乳腺 X 线照片的文献的元回顾

Al algorithms. They discovered that predicting future breast cancer risk using only
AI 算法。他们发现仅使用

mammography images achieves accuracy that is comparable to or better than traditional risk
乳腺 X 线摄影图像的准确度与传统风险评估工具相当或更好。

assessment tools. 评估工具。

Dicente Cid et al., 2023
Dicente Cid 等人，2023 年。

X-ray interpretation X 射线解读

The researchers developed two open-source neural networks, X-Raydar and X-Raydar-NLP,
研究人员开发了两个开源神经网络，X-Raydar 和 X-Raydar-NLP，

for classifying chest X-rays using images and free-text reports. They found that these
用于使用图像和自由文本报告对胸部 X 射线进行分类。他们发现这些建议

automated classification methods perform at levels comparable to human experts and
自动分类方法的表现水平与人类专家相当，并且

demonstrate robustness when applied to external data sets.
在应用于外部数据集时表现出鲁棒性。

The U.S. Food and Drug Administration (FDA)
美国食品药品监督管理局（FDA）

maintains a list of

-enabled medical devices that have received approval. The devices featured on this list meet the FDA's premarket standards, which include a detailed review of their effectiveness and safety. As of October 2023, the FDA has not approved any devices that utilize generative

or are powered by LLMs.
维护一份已获批准的

-启用医疗设备清单。此清单上展示的设备符合 FDA 的上市前标准，其中包括对其有效性和安全性的详细审查。截至 2023 年 10 月，FDA 尚未批准任何使用生成

或由LLMs供电的设备。
Figure 5.2.20 illustrates the number of AI medical devices approved by the FDA over the past decade. In 2022, a total of 139 Al-related medical devices received FDA approval, marking a

increase from the total approved in 2021. Since 2012, the number of these devices has increased by more than 45 -fold.
图 5.2.20 展示了过去十年 FDA 批准的 AI 医疗设备数量。2022 年，共有 139 个与 AI 相关的医疗设备获得 FDA 批准，较 2021 年批准总数增加

。自 2012 年以来，这些设备的数量增长超过 45 倍。

Number of Al medical devices approved by the FDA, 2012-22
FDA 批准的 Al 医疗器械数量，2012-22 年

Source: FDA, 2023 | Chart: 2024 Al Index report
来源：FDA，2023 年 | 图表：2024 年 Al 指数报告

3 The FDA last updated the list in October 2023, meaning that the totals for 2023 were incomplete. Consequently, the Al Index limited its data presentation to include only information up to 2022 .
3 FDA 上次更新列表是在 2023 年 10 月，这意味着 2023 年的总数是不完整的。因此，Al 指数将其数据呈现限制在仅包括截至 2022 年的信息。

Figure 5.2.21 illustrates the specialties associated with FDA-approved medical devices. Of the 139 devices approved in 2022, a significant majority,

, were related to radiology. The next most common specialty was cardiovascular, accounting for

of the approvals.
图 5.2.21 展示了与 FDA 批准的医疗器械相关的专业知识。2022 年批准的 139 种器械中，绝大多数，

，与放射学有关。下一个最常见的专业是心血管，占批准数量的

。

Number of Al medical devices approved by the FDA by specialty, 2012-22 Source: FDA, 2023 | Chart: 2024 Al Index report
FDA 按专业批准的 Al 医疗器械数量，2012-22 年数据来源：FDA，2023 年 | 图表：2024 年 Al 指数报告

Figure 5.2.21 图 5.2.21

Administration and Care 管理和护理

Al tools also hold the potential to enhance medical administration efficiency and elevate the standard of patient care.
所有工具还有潜力提高医疗管理效率并提升患者护理标准。

Highlighted Research: 突出研究:

MedAlign 医学对齐

Despite significant advances in Al for healthcare, existing benchmarks like MedQA and USMLE, focused on knowledgebased questions, do not fully capture the diverse tasks clinicians perform in patient care. Clinicians often engage in information-intensive tasks, such as creating tailored diagnostic plans, and spend a significant proportion of their working hours on administrative tasks. Although Al has the potential to streamline these processes, there is a lack of suitable electronic health records (EHR) datasets for benchmarking and fine-tuning medically administrative LLMs. This year researchers have made strides to address this gap by introducing MedAlign: a comprehensive EHR-based benchmark with 983 questions and instructions and 303 clinician responses, drawn from seven different medical specialties (Figure 5.2.22). MedAlign is the first extensive EHR-focused benchmark.
尽管医疗人工智能取得了显著进展，像 MedQA 和 USMLE 这样的现有基准主要关注基于知识的问题，却未能完全涵盖临床医生在患者护理中执行的各种任务。临床医生经常从事信息密集型任务，如制定定制的诊断计划，并将工作时间的相当比例花在行政任务上。虽然人工智能有潜力简化这些流程，但目前缺乏适用于基准测试和微调医疗行政LLMs的合适电子健康记录（EHR）数据集。今年，研究人员已经取得了进展，通过引入 MedAlign 来填补这一空白：这是一个基于全面 EHR 的基准测试，包含 983 个问题和说明以及 303 个临床医生的回答，涵盖了七个不同的医学专业（图 5.2.22）。MedAlign 是第一个广泛关注 EHR 的基准测试。

The researchers then tested various existing LLMs on MedAlign. Of all LLMs, a GPT-4 variant using multistep refinement achieved the highest correctness rate (65.0%) and was routinely preferred over other LLMs (Figure 5.2.23). MedAlign is a valuable milestone toward using

to alleviate administrative burdens in healthcare.
研究人员随后在 MedAlign 上测试了各种现有的LLMs。在所有LLMs中，使用多步细化的 GPT-4 变体实现了最高的正确率（65.0%），并且通常优先于其他LLMs（图 5.2.23）。MedAlign 是朝着利用

减轻医疗保健行政负担的宝贵里程碑。

MedAlign workflow MedAlign 工作流程

Source: Fleming et al., 2023
来源：Fleming 等人，2023

Figure 5.2.22 图 5.2.22

Highlighted Research: 突出研究:

MedAlign (cont'd) MedAlign（续）

Evaluation of model performance: human vs. COMET ranks Source: Fleming et al., 2023 | Chart: 2024 Al Index report
模型性能评估：人类 vs. COMET 排名来源：Fleming 等人，2023 年 | 图表：2024 年 Al 指数报告

Model B (loser) 模型 B（失败者）

Figure 5.2.23 图 5.2.23

Artificial Intelligence 人工智能

Index Report 2024 2024 年指数报告
CHAPTER 6: 第 6 章：

Education 教育

Preview 预览

Overview ..... 327 概览 ..... 327
Chapter Highlights ..... 328
章节亮点 ..... 328
6.1 Postsecondary CS and AI Education ..... 329
6.1 高等教育中的计算机科学和人工智能 ..... 329
United States and Canada ..... 329
美国和加拿大 ..... 329
CS Bachelor's Graduates ..... 329
计算机科学学士毕业生 ..... 329
CS Master's Graduates ..... 331
计算机科学硕士毕业生 ..... 331
CS PhD Graduates ..... 333
计算机科学博士毕业生 ..... 333
CS, CE, and Information Faculty ..... 336
计算机科学、计算机工程和信息学教职工 ..... 336
Europe ..... 344 欧洲 ..... 344
Informatics, CS, CE, and IT Bachelor's Graduates ..... 344
信息学，计算机科学，计算机工程和信息技术本科毕业生 ..... 344
Informatics, CS, CE, and IT Master's Graduates ..... 347
信息学，计算机科学，计算机工程和信息技术硕士毕业生 ..... 347
Informatics, CS, CE, and IT PhD Graduates ..... 351
信息学，计算机科学，计算机工程和信息技术博士毕业生 ..... 351
Al-Related Study Programs ..... 355
与 AI 相关的学习项目 ..... 355
Total Courses ..... 355
总课程数 ..... 355
Education Level ..... 356
教育水平 ..... 356
Geographic Distribution ..... 357
地理分布 ..... 357
6.2 K-12 CS and Al Education ..... 359
6.2 K-12 计算机科学和人工智能教育 ..... 359
United States ..... 359
美国 ..... 359
State-Level Trends ..... 359
州级趋势 ..... 359
AP Computer Science ..... 361
AP 计算机科学 ..... 361
Highlight: Access Issues ..... 363
重点: 访问问题 ..... 363
Highlight: ChatGPT Usage Among
突出显示：ChatGPT 在教师和学生中的使用
Teachers and Students ..... 364
教师和学生..... 364
ACCESS THE PUBLIC DATA
访问公共数据

This chapter examines trends in

and computer science (CS) education, focusing on who is learning, where they are learning, and how these trends have evolved over time. Amid growing concerns about Al's impact on education, it also investigates the use of new

tools like ChatGPT by teachers and students.
本章探讨了

和计算机科学（CS）教育的趋势，重点关注谁在学习，他们在哪里学习，以及这些趋势如何随着时间的推移发展。在人们对人工智能对教育的影响日益关注的背景下，它还调查了教师和学生使用新

工具如 ChatGPT 的情况。

The analysis begins with an overview of the state of postsecondary CS and Al education in the United States and Canada, based on the Computing Research Association's annual Taulbee Survey. It then reviews data from Informatics Europe regarding CS education in Europe. This year introduces a new section with data from Studyportals on the global count of Al-related English-language study programs.
分析始于对美国和加拿大高等教育中 CS 和人工智能教育状况的概述，基于计算研究协会每年的 Taulbee 调查。然后回顾了 Informatics Europe 关于欧洲 CS 教育的数据。今年新增了一个新的部分，包括来自 Studyportals 关于全球英语人工智能相关学习项目数量的数据。

The chapter wraps up with insights into K-12 CS education in the United States from Code.org and findings from the Walton Foundation survey on ChatGPT's use in schools.
本章总结了来自 Code.org 的美国 K-12 CS 教育的见解，以及来自沃尔顿基金会关于 ChatGPT 在学校中使用情况的调查结果。

Chapter Highlights 章节亮点

The number of American and Canadian CS bachelor's graduates continues to rise, new CS master's graduates stay relatively flat, and PhD graduates modestly grow. While the number of new American and Canadian bachelor's graduates has consistently risen for more than a decade, the number of students opting for graduate education in CS has flattened. Since 2018, the number of CS master's and PhD graduates has slightly declined.
美国和加拿大计算机科学本科毕业生人数持续增长，新的计算机科学硕士毕业生数量保持相对稳定，博士毕业生数量略有增长。尽管美国和加拿大新的本科毕业生人数在过去十多年一直在增加，但选择攻读计算机科学研究生教育的学生人数已经趋于稳定。自 2018 年以来，计算机科学硕士和博士毕业生的人数略有下降。
The migration of Al PhDs to industry continues at an accelerating pace. In 2011, roughly equal percentages of new Al PhDs took jobs in industry (40.9%) and academia (41.6%). However, by 2022, a significantly larger proportion (70.7%) joined industry after graduation compared to those entering academia (20.0%). Over the past year alone, the share of industry-bound AI PhDs has risen by 5.3 percentage points, indicating an intensifying brain drain from universities into industry.
人工智能博士毕业生向产业界的迁移速度正在加快。2011 年，新的人工智能博士毕业生在产业界（40.9%）和学术界（41.6%）的就业比例大致相等。然而，到 2022 年，毕业后加入产业界的比例显著增加至 70.7%，远远超过进入学术界的比例（20.0%）。仅在过去一年中，进入产业界的人工智能博士毕业生的比例增加了 5.3 个百分点，表明大学向产业界的人才流失正在加剧。
Less transition of academic talent from industry to academia. In of new Al faculty in the United States and Canada were from industry. By 2021, this figure had declined to 11%, and in 2022, it further dropped to . This trend indicates a progressively lower migration of high-level Al talent from industry into academia.
学术人才从工业界到学术界的转移减少。在美国和加拿大的新人工智能教师中，来自工业界。到 2021 年，这一比例下降到 11％，到 2022 年进一步下降到。这一趋势表明高水平人工智能人才从工业界向学术界的迁移逐渐减少。
CS education in the United States and Canada becomes less international. Proportionally fewer international CS bachelor's, master's, and PhDs graduated in 2022 than in 2021. The drop in international students in the master's category was especially pronounced.
美国和加拿大的计算机科学教育变得不那么国际化。2022 年，获得计算机科学学士、硕士和博士学位的国际学生比例较 2021 年有所下降。硕士类别中国际学生的减少尤为显著。
More American high school students take CS courses, but access problems remain. In 2022, 201,000 AP CS exams were administered. Since 2007, the number of students taking these exams has increased more than tenfold. However, recent evidence indicates that students in larger high schools and those in suburban areas are more likely to have access to CS courses.
更多美国高中学生修读计算机科学课程，但存在访问问题。2022 年，共进行了 201,000 次 AP 计算机科学考试。自 2007 年以来，参加这些考试的学生人数增加了十多倍。然而，最近的证据表明，在较大的高中和郊区地区的学生更有可能接触到计算机科学课程。
Al-related degree programs are on the rise internationally. The number of English-language, Al-related postsecondary degree programs has tripled since 2017, showing a steady annual increase over the past five years. Universities worldwide are offering more Al-focused degree programs.
人工智能相关的学位课程在国际上不断增加。自 2017 年以来，英语授课的人工智能相关高等教育学位课程数量增加了三倍，过去五年稳步增长。全球各大学正在提供更多以人工智能为重点的学位课程。
The United Kingdom and Germany lead in European informatics, CS, CE, and IT graduate production. The United Kingdom and Germany lead Europe in producing the highest number of new informatics, CS, CE, and information bachelor's, master's, and PhD graduates. On a per capita basis, Finland leads in the production of both bachelor's and PhD graduates, while Ireland leads in the production of master's graduates.
英国和德国在欧洲信息学、计算机科学、计算机工程和信息技术研究生产量方面处于领先地位。英国和德国在生产新的信息学、计算机科学、计算机工程和信息学学士、硕士和博士毕业生数量方面领先于欧洲。以人均计算，芬兰在生产学士和博士毕业生方面领先，而爱尔兰在生产硕士毕业生方面领先。

This section provides an overview of postsecondary education in CS and AI, highlighting graduation statistics across North America and Europe for various degrees including bachelor's, master's, and PhDs. It also covers information on Al-related courses offered in English.
本节概述了计算机科学和人工智能的高等教育情况，重点介绍了北美和欧洲各地的毕业统计数据，涵盖了各种学位，包括学士、硕士和博士学位。还包括提供英语授课的人工智能相关课程信息。

6.1 Postsecondary CS and AI Education
6.1 高等教育中的计算机科学和人工智能

United States and Canada
美国和加拿大

This subsection presents an analysis of data from the Computing Research Association's Taulbee Survey, which evaluates the state of CS and Al postsecondary education in the United States and Canada. The survey covers 297 PhD-granting CS departments across the United States and Canada.
本小节介绍了对计算机研究协会 Taulbee 调查数据的分析，该调查评估了美国和加拿大高等教育中计算机科学和人工智能的现状。该调查涵盖了美国和加拿大共 297 所授予博士学位的计算机科学系。

New CS bachelor's graduates in the United States and Canada, 2010-22
2010-22 年美国和加拿大的新计算机科学学士毕业生

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

1 It is important to note that not all PhD-granting departments targeted in the survey provided responses. Out of the 297 departments targeted, only 182 responded, yielding an overall response rate of

.
1 值得注意的是，并非所有在调查中被调查的授予博士学位的部门都提供了回应。在 297 个被调查的部门中，只有 182 个做出了回应，总体回应率为

。

For the first time in almost eight years, the proportion of international students among CS bachelor's graduates in American and Canadian universities declined, falling from

in 2021 to

in 2022 (Figure 6.1.2). This decline likely reflects the increased difficulty of obtaining study visas during the early years of the Trump administration, an impact that is only now beginning to manifest in the data. The decline is also partially attributable to international travel restrictions that were imposed during the COVID-19 pandemic, affecting the ability of international students to study in the United States and Canada. Despite this recent drop, the overall trend over the last decade shows a steady increase in the proportion of international students.
近八年来，美国和加拿大大学计算机科学本科毕业生中国际学生的比例首次下降，从 2021 年的

下降到 2022 年的

（图 6.1.2）。这种下降可能反映了在特朗普政府早期更难获得学习签证的影响，这种影响现在才开始在数据中显现。这种下降也部分归因于 COVID-19 大流行期间实施的国际旅行限制，影响了国际学生在美国和加拿大学习的能力。尽管最近有所下降，但过去十年国际学生比例呈稳定增长趋势。

New international CS bachelor's graduates (% of total) in the United States and Canada, 2010-22
美国和加拿大新的国际计算机科学本科毕业生（占总数的百分比），2010-22

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.2 图 6.1.2

CS Master's Graduates 计算机科学硕士毕业生

Al courses are commonly included in CS master's degree programs. While the total number of new CS master's graduates from American and Canadian universities more than doubled over the past decade, the number appears to have leveled out since 2018 and slightly decreased, by

, last year (Figure 6.1.3). This leveling is a reflection of the decline in international master's students shown in the following graph.
所有课程通常包含在计算机科学硕士学位课程中。尽管过去十年来美国和加拿大大学的新计算机科学硕士毕业生总数翻了一番多，但自 2018 年以来，这一数字似乎已经趋于稳定，并在去年略有下降，减少了

（图 6.1.3）。这种趋于稳定反映在下图中显示的国际硕士生人数下降。

New CS master's graduates in the United States and Canada, 2010-22
美国和加拿大的新计算机科学硕士毕业生，2010-22 年

In 2022, American and Canadian universities experienced a notable decrease in international CS master's students.
2022 年，美国和加拿大的大学国际计算机科学硕士学生数量显著减少。

This downward trend began around 2017, but the decline was most pronounced last year, at 14.8 percentage points (Figure 6.1.4). Currently, the split between international and domestic CS master's graduates is roughly even.
这种下降趋势始于 2017 年左右，但去年下降最为明显，下降了 14.8 个百分点（图 6.1.4）。目前，国际和本国计算机科学硕士毕业生之间的比例大致相当。

New international CS master's graduates (% of total) in the United States and Canada, 2010-22
2010 年至 2022 年在美国和加拿大的新国际计算机科学硕士毕业生（占总数的比例）

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

CS PhD Graduates 计算机科学博士毕业生

For the first time in a decade, there has been a significant increase in the number of new CS PhD graduates at American and Canadian universities. In 2022, the number of CS PhD graduates reached 2,105, the highest since 2010 (Figure 6.1.5).
十年来，美国和加拿大大学的新计算机科学博士毕业生人数显著增加。2022 年，计算机科学博士毕业生人数达到了 2,105 人，是自 2010 年以来最高的（图 6.1.5）。

New CS PhD graduates in the United States and Canada, 2010-22
美国和加拿大的新计算机科学博士毕业生，2010-22 年

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

While the proportion of international students among CS PhD graduates has risen over the past decade, there was a slight decrease in this proportion in the last year, dropping from

in 2021 to

in 2022 (Figure 6.1.6).
在过去十年中，计算机科学博士毕业生中国际学生的比例有所上升，但在去年这一比例略有下降，从 2021 年的

下降到 2022 年的

（图表 6.1.6）。

New international CS PhD graduates (% of total) in the United States and Canada, 2010-22
美国和加拿大计算机科学新毕业的国际学生（占总数的百分比），2010-22 年

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.6 图 6.1.6

Where do newly minted AI PhDs choose to work after graduating? Following a trend highlighted in last year's Al Index report, a growing share of Al doctoral recipients are pursuing careers in industry (Figure 6.1.7 and Figure 6.1.8). In 2011, around the same percentage took jobs in industry (40.9%) as in academia (41.6%).
新晋 AI 博士毕业后选择在哪里工作？继去年 Al 指数报告中突出的趋势之后，越来越多的 AI 博士毕业生正在选择在工业界发展（图 6.1.7 和图 6.1.8）。2011 年，大约相同比例的人选择在工业界工作（40.9%）和在学术界工作（41.6%）。

Employment of new AI PhDs (% of total) in the United States and Canada by sector, 2010-22 Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
美国和加拿大新晋 AI 博士的就业情况（总数的百分比），2010-22 年数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.7² 图 6.1.7²
However, by 2022, a significantly larger proportion (70.7%) joined industry after graduation compared to those entering academia (20.0%). The percentage of new AI PhDs going into government roles has remained relatively low and steady at around

over the past five years.
然而，到 2022 年，相比进入学术界的人数（20.0%），有显著更大比例的人（70.7%）在毕业后加入了行业。过去五年中，进入政府部门的新人工智能博士的比例保持相对较低且稳定在

左右。
Employment of new AI PhDs in the United States and Canada by sector, 2010-22
2010-22 年间在美国和加拿大就业的新人工智能博士按部门划分

Figure 6.1.8 图 6.1.8

2 The sums in Figure 6.1.7 do not add up to 100, as there is a subset of new AI PhDs each year who become self-employed, unemployed, or report an "other" employment status in the CRA survey. These students are not included in the chart.
图 6.1.7 中的总和并不相加到 100，因为每年都有一部分新的人工智能博士毕业生会自行创业、失业，或在 CRA 调查中报告“其他”就业状态。这些学生不包括在图表中。

CS, CE, and Information Faculty
计算机科学、计算机工程和信息学院

To better understand trends in CS and Al education, it is helpful to examine data on CS faculty. Last year, the total number of CS, CE, and information faculty in American and Canadian universities increased 7.2% (Figure 6.1.9).
为了更好地了解计算机科学和人工智能教育的趋势，检查美国和加拿大大学的计算机科学、计算机工程和信息学教师的数据是很有帮助的。去年，美国和加拿大大学的计算机科学、计算机工程和信息学教师总数增加了 7.2%（图 6.1.9）。

Since 2011, the increase is

.
自 2011 年以来，增长为

。

Number of CS, CE, and information faculty in the United States and Canada, 2011-22
2011 年至 2022 年美国和加拿大的计算机科学、计算机工程和信息学教师人数。

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

In 2022, the United States had 7,084 CS faculty members, with the majority (65.7%) on the tenure track (Figure 6.1.10). The total number of American CS faculty has risen

since 2021 and

since 2011.
2022 年，美国有 7,084 名计算机科学教师，其中大多数（65.7%）处于终身职位跟踪（图 6.1.10）。自 2021 年以来，美国计算机科学教师的总人数已增加

，自 2011 年以来已增加

。

Number of CS faculty in the United States, 2011-22
美国计算机科学教师数量，2011-22 年

Source: CRA Taulbee Survey, 2023 | Chart: 2024 AI Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 AI 指数报告

Last year, 915 new faculty were hired across CS, CE, and information disciplines in North America, a decade high. 455 of these positions were tenure track. (Figure 6.1.11).
去年，在北美的计算机科学、计算机工程和信息学领域共聘用了 915 名新教师，达到了十年来的最高水平。其中有 455 个职位是终身制。（图 6.1.11）。

New CS, CE, and information faculty hires in the United States and Canada, 2011-22
2011-22 年间在美国和加拿大的计算机科学、计算机工程和信息学领域新教师聘用情况

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

In 2022, 43% of new faculty appointments came from other academic positions, indicating a "churn" within the academic workforce (Figure 6.1.12). Since these "new" faculty members vacated positions elsewhere, their previous roles will eventually need to be filled. Additionally, the proportion of faculty transitioning from industry in 2022 fell to

from

in the previous year and

in 2019 .
在 2022 年，43％的新教职员工来自其他学术职位，表明学术工作队伍中存在“循环流动”（图 6.1.12）。由于这些“新”的教职成员空缺了其他地方的职位，他们之前的角色最终需要填补。此外，2022 年从行业转到教职的教职人员比例从上一年的

降至

，并从 2019 年的

降至

。

Source of new faculty in American and Canadian CS, CE, and information departments, 2018-22
美国和加拿大 CS、CE 和信息系的新教职来源，2018-22 年

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

The reasons for faculty positions remaining unfilled have varied over the past decade. In

of failed searches were due to no offer being made, while

were because the offer made was declined (Figure 6.1.13). In contrast, in 2022 , only

ended with no offer being made, while

involved offers that were turned down. This trend appears to reflect an increasingly competitive market for new CS faculty. However, it remains unclear whether this indicates heightened competition with other academic positions or with industry positions.
在过去十年中，导致教职职位空缺的原因各不相同。

的失败搜索是因为没有发出任何报价，而

是因为报价被拒绝（图 6.1.13）。相比之下，2022 年，只有

没有发出报价，而

涉及被拒绝的报价。这一趋势似乎反映了对新的计算机科学教师的竞争日益激烈的市场。然而，目前尚不清楚这是否表明与其他学术职位或行业职位的竞争加剧。

Reason why new CS, CE, and information faculty positions remained unfilled (% of total), 2011-22
新的计算机科学、计算机工程和信息学教师职位未填补的原因（总数的百分比），2011-22

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

In 2022, North American departments in CS, CE, and information disciplines experienced a significant increase in faculty departures, totaling 405, compared to 303 in 2021 (Figure 6.1.14). Of these losses,

left for other academic positions, while

moved to nonacademic roles, maintaining a trend consistent with previous years.
2022 年，北美的计算机科学、计算机工程和信息学科系经历了教师离职人数显著增加，总计 405 人，而 2021 年为 303 人（图 6.1.14）。在这些离职中，

人离职去了其他学术职位，而

人转向了非学术角色，保持了与往年一致的趋势。

Faculty losses in American and Canadian CS, CE, and information departments, 2011-22
2011-22 年间美国和加拿大计算机科学、计算机工程和信息学系的教师流失情况

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Since 2015, the increase in median nine-month salaries for full professors has slightly fallen below U.S. inflation rates, whereas median salaries for assistant and associate professors have seen slight increases above inflation. In 2022, a full professor's salary was

higher than in 2021 , which did not keep pace with the

U.S. inflation rate, and

higher than in 2015, still below the

inflation increase over those years (Figure 6.1.15).
自 2015 年以来，全职教授的中位数九个月工资增长略低于美国通货膨胀率，而助理教授和副教授的中位数工资则略高于通货膨胀率。2022 年，全职教授的工资比 2021 年高

，但未跟上

美国通货膨胀率，比 2015 年高

，仍低于这些年间

的通货膨胀增长（图 6.1.15）。

Median nine-month salary of CS faculty in the United States, 2015-22
美国计算机科学教师的中位数九个月工资，2015 年至 2022 年

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.15 图 6.1.15

In 2022, the proportion of international hires among new tenure-track faculty in CS, CE, and information disciplines significantly increased to

from

the previous year (Figure 6.1.16). This marked the secondhighest percentage recorded in the past decade, only surpassed by 2013.
在 2022 年，计算机科学，计算机工程和信息学科新任终身教职教师中国际聘用比例显著增加到

，比前一年

增加。这是过去十年记录的第二高百分比，仅次于 2013 年。

New international CS, CE, and information tenure-track faculty hires (% of total) in the United States and Canada, 2010-22
2010-22 年在美国和加拿大的新国际 CS、CE 和信息学终身教职教师招聘（占总数的百分比）

Source: CRA Taulbee Survey, 2023 | Chart: 2024 AI Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 AI 指数报告

Figure 6.1.16 图表 6.1.16

Europe 欧洲

Data on European CS graduates comes from Informatics Europe, an academic and research community that, among other goals, monitors the state of informatics education in Europe.

Informatics Europe gathers data on graduates in informatics, CS, CE, computing, and information technology (IT) disciplines from statistical offices of European governments.

欧洲计算机科学毕业生的数据来自 Informatics Europe，这是一个学术和研究社区，除了其他目标外，还监测欧洲信息学教育的状况。Informatics Europe 从欧洲政府的统计机构收集有关信息学、计算机科学、计算机工程、计算机和信息技术（IT）学科的毕业生数据。

Informatics, CS, CE, and IT Bachelor's Graduates
信息学、计算机科学、计算机工程和信息技术本科毕业生

In 2022, the United Kingdom led with the highest number of new graduates in informatics, CS, CE, and IT at the bachelor's level, totaling approximately 25,000 (Figure 6.1.17).

Germany and Turkey followed closely. Most countries in the sample saw an increase in graduates in these fields compared to a decade ago, though there were exceptions like Poland, Spain, and the Czech Republic (Figure 6.1.18).
在 2022 年，英国以信息学、计算机科学、计算机工程和信息技术本科新毕业生人数最高领先，总计约 25,000 人（图 6.1.17）。德国和土耳其紧随其后。与十年前相比，样本中大多数国家在这些领域的毕业生人数都有所增加，尽管也有例外，如波兰、西班牙和捷克共和国（图 6.1.18）。

New informatics, CS, CE, and IT bachelor's graduates by country in Europe, 2022
2022 年欧洲各国信息学、计算机科学、计算机工程和信息技术本科新毕业生人数
Source: Informatics Europe, 2023| Chart: 2024 Al Index report
来源：Informatics Europe，2023 | 图表：2024 年 Al Index 报告

Figure 6.1.17 图 6.1.17

3 There is no singular term for CS education that is used uniformly across European countries. Across Europe, CS education can be reflected in terms such as informatics, computer science (CS), computer engineering (CE), computing, information technology (IT), information and communication technology (ICT), and information science and technology (IST). The full list of subject names (and English translations) that Informatics Europe uses to identify informatics studies programs can be found at the following link.
3 欧洲各国没有统一使用的 CS 教育的单一术语。在整个欧洲，CS 教育可以用 informatics、计算机科学（CS）、计算机工程（CE）、计算、信息技术（IT）、信息与通信技术（ICT）以及信息科学与技术（IST）等术语来反映。Informatics Europe 用于识别信息学专业课程的学科名称（及其英文翻译）的完整列表可在以下链接找到。

4 Readers are cautioned against making per capita comparisons between the CRA North American data and the European CS graduate data detailed in subsequent sections, as the European data is collected from national statistical offices and boasts broader coverage.
4 读者应注意，在 CRA 北美数据与后续章节详细介绍的欧洲 CS 毕业生数据之间进行人均比较时，欧洲数据是从国家统计机构收集的，并具有更广泛的覆盖范围。

5 Note that not all countries for which the Al Index has data are visualized in the figures in this section. To access the complete data, please view the public data associated with this chapter. Moreover, the year label refers to the year in which an academic year ends. For example, the figures visualizing new graduates for 2022 reflect the number of graduates reported for the 2021/2022 academic year. For the sake of visual simplicity, the Index opts to focus on the year in which students graduated.
请注意，并非所有 Al Index 具有数据的国家都在本节的图表中可视化。要访问完整数据，请查看与本章节相关的公共数据。此外，年份标签指的是学年结束的年份。例如，2022 年新毕业生的图表反映的是 2021/2022 学年报告的毕业生人数。为了视觉简洁起见，该指数选择关注学生毕业的年份。

Percentage change of new informatics, CS, CE, and IT bachelor's graduates by country in Europe, 2012 vs. 2022
欧洲各国新信息学、计算机科学、计算机工程和信息技术本科毕业生的百分比变化，2012 年对比 2022 年

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Finland (53.4), Norway (42.6), and the Netherlands (38.6) lead in the number of new bachelor's graduates in informatics CS, CE, and IT per 100,000 inhabitants (Figure 6.1.19). On a per capita basis, most sampled European countries have seen increases in the total number of informatics, CS, CE, and IT bachelor's graduates (Figure 6.1.20).
芬兰（53.4）、挪威（42.6）和荷兰（38.6）在每 10 万居民中新毕业的信息学 CS、CE 和 IT 学士毕业生人数方面处于领先地位（图 6.1.19）。从人均角度来看，大多数欧洲国家的样本中已经看到信息学、CS、CE 和 IT 学士毕业生总人数增加（图 6.1.20）。

New informatics, CS, CE, and IT bachelor's graduates per 100,000 inhabitants by country in Europe, 2022
2022 年欧洲各国每 10 万居民中新毕业的信息学、CS、CE 和 IT 学士毕业生

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.19 图 6.1.19

Percentage change of new CS, CE, and Information bachelor's graduates per 100,000 inhabitants by country in Europe, 2012 vs. 2022
欧洲各国每 10 万居民中新的计算机科学、计算机工程和信息学学士毕业生的百分比变化，2012 年对比 2022 年

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.20 图表 6.1.20

Informatics, CS, CE, and IT Master's Graduates
信息学、计算机科学、计算机工程和信息技术硕士毕业生

Similar to bachelor's graduates, the United Kingdom leads Europe in producing new master's graduates in informatics, CS, CE, and IT, with approximately 20,000 graduates (Figure 6.1.21). In the last decade, Germany (259%), Turkey (197%), and Spain (194%) have seen the greatest percentage growth in new informatics, CS, CE, and IT master's graduates (Figure 6.1.22).
与本科毕业生类似，英国在信息学、计算机科学、计算机工程和信息技术领域的硕士毕业生数量上领先于欧洲，约为 20,000 名毕业生（图表 6.1.21）。在过去的十年中，德国（259%）、土耳其（197%）和西班牙（194%）的信息学、计算机科学、计算机工程和信息技术硕士毕业生数量增长百分比最高（图表 6.1.22）。

New informatics, CS, CE, and IT master's graduates by country in Europe, 2022
欧洲各国 2022 年的新信息学、计算机科学、计算机工程和信息技术硕士毕业生人数

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.21 图 6.1.21

Percentage change of new informatics, CS, CE, and IT master's graduates by country in Europe, 2012 vs. 2022
欧洲各国 2012 年至 2022 年新信息学、计算机科学、计算机工程和信息技术硕士毕业生的百分比变化

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.22 图 6.1.22

Per capita metrics paint a somewhat similar picture. Ireland has the most informatics, CS, CE, and IT master's graduates on a per capita basis (31.2), followed by the United Kingdom (29.8) and Estonia (27.4) (Figure 6.1.23). On a per capita basis, Germany (243%) has also seen the greatest growth of informatics, CS, CE, and IT master's graduates in the last decade (Figure 6.1.24).
人均指标描绘了一个略有相似的画面。爱尔兰以人均基础拥有最多信息学、计算机科学、计算机工程和信息技术硕士毕业生（31.2），其次是英国（29.8）和爱沙尼亚（27.4）（图 6.1.23）。从人均角度看，德国（243%）在过去十年中也见证了信息学、计算机科学、计算机工程和信息技术硕士毕业生数量的最大增长（图 6.1.24）。

New informatics, CS, CE, and IT master's graduates per 100,000 inhabitants by country in Europe, 2022 Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
2022 年欧洲各国每 10 万居民的新信息学、计算机科学、计算机工程和信息技术硕士毕业生人数数据来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.23 图 6.1.23

Percentage change of new informatics, CS, CE, and IT master's graduates per 100,000 inhabitants by country in Europe, 2012 vs. 2022
欧洲各国每 10 万居民中新信息学、计算机科学、计算机工程和信息技术硕士毕业生的百分比变化，2012 年对比 2022 年

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Informatics, CS, CE, and IT PhD Graduates
信息学、计算机科学、计算机工程和信息技术博士毕业生

The United Kingdom (1,060) and Germany (910) also produced the most informatics, CS, CE, and IT PhD graduates in 2022, followed by Italy (581) (Figure
2022 年，英国（1,060）和德国（910）也培养了最多的信息学、计算机科学、计算机工程和信息技术博士毕业生，其次是意大利（581）（图 6.1.25）。在过去的十年中，土耳其在新的计算机科学、计算机工程和信息学博士毕业生中增长最快（图 6.1.26）。
6.1.25). In the last decade, Turkey has seen the greatest growth in new CS, CE, and information PhD graduates (Figure 6.1.26).
2022 年欧洲各国的新信息学、计算机科学、计算机工程和信息技术博士毕业生情况

New informatics, CS, CE, and IT PhD graduates by country in Europe, 2022

Figure 6.1.25 图 6.1.25

Percentage change of new informatics, CS, CE, and IT PhD graduates by country in Europe, 2012 vs. 2022 Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
欧洲各国新信息学、计算机科学、计算机工程和信息技术博士毕业生的百分比变化，2012 年对比 2022 年来源：Informatics Europe，2023 | 图表：2024 Al Index 报告

Finland has the greatest number of new informatics, CS, CE, and IT PhD graduates per capita. For every 100,000 inhabitants, it has 2.1 informatics, CS, CE, and IT PhD graduates (Figure 6.1.27). Estonia slightly trails (1.9), as does the United Kingdom (1.6). On a per capita basis, the growth rate of new informatics, CS, CE, and IT PhDs has been relatively marginal in several major European countries such as the United Kingdom, Portugal, and Switzerland (Figure 6.1.28).
芬兰每万名居民中拥有最多的新信息学、计算机科学、计算机工程和信息技术博士毕业生。每 10 万名居民中，有 2.1 名信息学、计算机科学、计算机工程和信息技术博士毕业生（图 6.1.27）。爱沙尼亚略有落后（1.9），英国也是如此（1.6）。从人均角度看，新信息学、计算机科学、计算机工程和信息技术博士的增长率在一些主要欧洲国家如英国、葡萄牙和瑞士等国家相对较低（图 6.1.28）。

New informatics, CS, CE, and IT PhD graduates per 100,000 inhabitants by country in Europe, 2022
欧洲各国每 10 万居民中的新信息学、计算机科学、计算机工程和信息技术博士毕业生人数，2022 年

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.1.27 图 6.1.27

Percentage change of new informatics, CS, CE, and IT PhD graduates per 100,000 inhabitants by country in Europe, 2012 vs. 2022
欧洲各国 2012 年至 2022 年每 10 万居民中新信息学、计算机科学、计算机工程和信息技术博士毕业生的百分比变化

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Tracking the number of Al-related courses provides insight into the educational interest in AI. This section highlights data from Studyportals, an international platform monitoring English-language university study programs worldwide. Their portal encompasses information on over 200,000 courses at more than 3,750 educational institutions across 110 countries.

跟踪与人工智能相关课程数量提供了对人工智能教育兴趣的洞察。本节重点介绍了来自 Studyportals 的数据，该平台是一个国际平台，监测全球英语大学课程。他们的门户网站涵盖了来自 110 个国家的 3,750 多所教育机构的超过 200,000 门课程的信息。

Total Courses 总课程数

A study program, or degree program, comprises a series of courses designed to enable students to earn a relevant qualification, such as a degree or diploma. The number of English-language Al-related study programs has tripled since 2017, demonstrating a consistent yearly increase over the last five years (Figure 6.1.29). This trend indicates a steadily growing educational interest in Al.
学习计划或学位课程包括一系列旨在使学生获得相关资格（如学位或文凭）的课程。自 2017 年以来，英语人工智能相关学习计划的数量增加了三倍，表明过去五年中每年都在稳步增长（图 6.1.29）。这一趋势表明人工智能领域的教育兴趣在稳步增长。

Number of Al university study programs in English in the world, 2017-23
2017-23 年全球英语 AI 大学学习项目数量

6 Currently, Studyportals, the company supplying data on Al university study programs, tracks only English-language Al courses. In the coming years, the Index plans to extend its coverage to include non-English programs.
目前，提供 AI 大学学习项目数据的 Studyportals 公司仅跟踪英语 AI 课程。未来几年，该指数计划扩大覆盖范围，包括非英语项目。

Education Level 教育水平

Broken down by educational level, the majority of Al study programs are offered at the master's level (55.0%), followed by the bachelor's level (39.8%), and finally at the PhD level (5.3%) (Figure 6.1.30).
就教育水平分析，大多数 Al 学习项目在硕士水平提供（55.0%），其次是学士水平（39.8%），最后在博士水平提供（5.3%）（图 6.1.30）。

Al university study programs in English (% of total) by education level, 2023
2023 年英语授课的 Al 大学学习项目按教育水平分布（总数的百分比）。

Geographic Distribution 地理分布。

In 2023, the United Kingdom had the greatest number of English-language Al study programs (744) (Figure 6.1.31). Next was the United States (667) and Canada (89). For virtually every country included in the sample, there was a greater number of

university study programs in 2023 than in 2022. Malta, the United Kingdom, and Cyprus had the greatest number of English-language Al university study programs per capita in 2023 (Figure 6.1.32).

2023 年，英国拥有最多的英语语言 AI 研究项目（744 个）（图 6.1.31）。其次是美国（667）和加拿大（89）。在样本中包括的几乎每个国家，2023 年的大学研究项目数量都比 2022 年多。马耳他、英国和塞浦路斯在 2023 年拥有最多的英语 AI 大学研究项目（图 6.1.32）。

Number of Al university study programs in English by geographic area, 2022 vs. 2023
2022 年和 2023 年按地理区域划分的英语 AI 大学研究项目数量

7 Although the United Kingdom has fewer universities overall compared to the United States, it likely reports a higher number of Al study programs for several reasons. Firstly, Studyportals has slightly greater coverage of the United Kingdom than the United States in its data. Secondly, the structure of higher education in the United States tends to be more generalist compared to the United Kingdom, meaning students studying Al might be enrolled in broader computer science programs that are not explicitly identified as Al study programs.
尽管英国的大学总数比美国少，但由于几个原因，英国可能报告了更多的 AI 研究项目。首先，Studyportals 对英国的覆盖范围略高于美国。其次，与英国相比，美国的高等教育结构更倾向于更普遍，这意味着学习 AI 的学生可能被录取到更广泛的计算机科学项目中，这些项目并没有明确标识为 AI 研究项目。

Al university study programs in English per 100,000 inhabitants by geographic area, 2022 vs. 2023
每 10 万居民的地理区域中以英语为主的大学学习项目，2022 年对比 2023 年

Source: Studyportals, 2023 | Chart: 2024 Al Index report
来源：Studyportals，2023 年 | 图表：2024 年 Al 指数报告

6.2 K-12 CS and Al Education
6.2 K-12 计算机科学和人工智能教育

United States 美国

Data on the state of

CS education in the United States comes from Code. org, an education innovation nonprofit dedicated to ensuring that every school includes CS as part of its core

education.
美国计算机科学教育的数据来自于 Code.org，这是一个致力于确保每所学校将计算机科学作为核心

教育的教育创新非营利组织。

State-Level Trends 州级趋势

In 2023, 30 American states required that all high schools offer a foundational course in CS (Figure 6.2.1).
在 2023 年，30 个美国州要求所有高中开设一门基础的计算机科学课程（图 6.2.1）。

The percentage of public schools offering CS courses varies significantly from state to state (Figure 6.2.2). The top three states in terms of percentage of CS offerings are Maryland (99%), Arkansas (99%), and Nevada (96%); the bottom three are Minnesota (28%), Montana (34%), and Louisiana (35%).
提供计算机科学课程的公立学校的比例在各州之间有显著差异（图 6.2.2）。在计算机科学课程提供比例方面排名前三的州分别是马里兰州（99%）、阿肯色州（99%）和内华达州（96%）；排名最低的三个州分别是明尼苏达州（28%）、蒙大拿州（34%）和路易斯安那州（35%）。
States requiring that all high schools offer a foundational CS course, 2023
2023 年要求所有高中开设基础计算机科学课程的州

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.2.1 图 6.2.1

Public high schools teaching foundational CS (% of total in state), 2023
公立高中教授基础计算机科学课程的比例（占州内总数的百分比），2023

Source: Code.org, 2023 | Chart: 2024 AI Index report
源自：Code.org, 2023 | 图表：2024 AI 指数报告

K-12 CS education is expanding in the United States (Figure 6.2.3). In 2017, only a few states supported high school CS programs. Now, approximately two-thirds of states require that CS be taught in high schools, allocate funding for it, and have developed state plans for CS education.
美国 K-12 计算机科学教育正在扩大（图 6.2.3）。2017 年，只有少数几个州支持高中计算机科学项目。现在，大约三分之二的州要求在高中教授计算机科学，并为此拨款，并制定了计算机科学教育的州级计划。

Changes over time in state-level US K-12 CS education
美国 K-12 计算机科学教育在州级的变化随时间而变化

AP Computer Science AP 计算机科学

The state of K-12 CS education in the United States can also be observed by analyzing trends in the total number of AP CS exams.

In 2022, approximately
通过分析 AP CS 考试总数的趋势，也可以观察美国 K-12 CS 教育的现状。

2022 年，大约
201,000 exams were administered, marking an 11.1% increase from 2021 (Figure 6.2.4). Since 2007, the number of AP CS exams administered has increased more than tenfold.
201,000 次考试已经进行，比 2021 年增加了 11.1%（图 6.2.4）。自 2007 年以来，进行的 AP CS 考试数量增加了十多倍。

Number of AP computer science exams taken, 2007-22
2007-22 年间参加 AP 计算机科学考试人数

8 There are two types of AP CS exams: Computer Science A and Computer Science Principles. Data on computer science exams taken includes both exams. AP CS Principles was initially offered in 2017
8 有两种类型的 AP CS 考试：计算机科学 A 和计算机科学原理。参加计算机科学考试的数据包括这两种考试。AP CS Principles 最初于 2017 年提供

In 2022, California (33,262), Texas (20,901), and Florida

were the leading states in terms of the number of AP CS exams taken (Figure 6.2.5). On the other end, Montana (39), South Dakota (40), and North Dakota (100) are the states where the fewest exams were taken.
2022 年，加利福尼亚州（33,262）、德克萨斯州（20,901）和佛罗里达州

是参加 AP CS 考试人数最多的州（图 6.2.5）。另一方面，蒙大拿州（39）、南达科他州（40）和北达科他州（100）是参加考试人数最少的州。

Per capita, Maryland (126.5), New Jersey (112.7), and Massachusetts (92.7) ranked highest in the number of AP

exams taken (Figure 6.2.6).
人均，马里兰州（126.5）、新泽西州（112.7）和马萨诸塞州（92.7）在参加 AP

考试人数方面排名最高（图 6.2.6）。
Number of AP computer science exams taken, 2022
2022 年参加 AP 计算机科学考试人数

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.2.5 图 6.2.5

Number of AP computer science exams taken per 100,000 inhabitants, 2022
2022 年每 10 万居民参加 AP 计算机科学考试的人数

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

21.41

20.66

17.62

37.60

92.66

61.08

3.47

12.84

4.40

31.69

37.74

49.03

77.21

89.81

68.75

21.02

18.40

26.93

20.16

43.60

33.74

48.03

112.66

85.20

77.92

18.69

52.68

11.75

22.63

40.61

18.38

60.06

126.48

60.42

25.54

21.06

15.65

49.24

33.30

70.77

64.32

69.60

21.51

26.91

58.79

75.09

39.81

50.02

73.04

Highlight: 突出显示:

Access Issues 访问问题

Code.org data suggests that factors such as school size and location significantly influence CS education accessibility. Large schools (over 1,200 students) are 15 percentage points more likely to offer CS courses than medium-sized schools (500-1,200 students), with the gap widening further when compared to small schools (under 500 students)
Code.org 的数据表明，诸如学校规模和位置等因素显著影响计算机科学教育的可及性。大型学校（超过 1,200 名学生）开设计算机科学课程的可能性要比中型学校（500-1,200 名学生）高出 15 个百分点，与小型学校（500 名以下学生）相比，差距进一步扩大。

(Figure 6.2.7). Similarly, students in suburban districts have better access to CS courses than their counterparts in both urban and rural areas (Figure 6.2.8).
（图 6.2.7）。同样，郊区地区的学生比城市和农村地区的同行更容易接触到计算机科学课程（图 6.2.8）。
Schools offering foundational CS courses by size, 2023
2023 年按规模提供基础计算机科学课程的学校

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

Figure 6.2.7 图 6.2.7

Schools offering foundational CS courses by geographic area, 2023
2023 年按地理区域提供基础计算机科学课程的学校

Highlight: 突出显示:

ChatGPT Usage Among Teachers and Students
教师和学生使用 ChatGPT

The introduction of generative Al tools, including ChatGPT, has sparked significant debate regarding their potential applications in education. Some individuals have raised concerns that these tools could be misused for plagiarism, potentially prompting a reevaluation of the ways in which American students may be taught.
引入生成式人工智能工具，包括 ChatGPT，在教育领域引发了关于它们潜在应用的重大争论。一些人担心这些工具可能被滥用用于抄袭，可能促使重新评估美国学生接受教育的方式。

This year, Impact Research, funded by the Walton Family Foundation, carried out a series of surveys on American teachers' and educators' perceptions and use of ChatGPT.

The surveys revealed that a majority of

teachers in the United States are already utilizing ChatGPT, with usage increasing over the year: In March 2023, 51% of teachers reported having used ChatGPT at least once, and by July 2023, that figure had risen to

(Figure 6.2.9). Among the teachers who reported using ChatGPT,

employed it for lesson planning, another

for generating new creative class ideas, and

for enhancing their background knowledge (Figure 6.2.10).
今年，由沃尔顿家族基金会资助的 Impact Research 进行了一系列关于美国教师和教育工作者对 ChatGPT 的认知和使用的调查。调查显示，美国大多数教师已经在使用 ChatGPT，使用量在逐年增加：2023 年 3 月，51%的教师报告称至少使用过 ChatGPT，到 2023 年 7 月，这一数字已经上升到

（图 6.2.9）。在报告使用 ChatGPT 的教师中，

人用于课程规划，另有

人用于生成新的创意课堂想法，

人用于增强他们的背景知识（图 6.2.10）。

ChatGPT usage rate among American teachers,
美国教师中 ChatGPT 使用率，

2023

Source: Impact Research, 2023 | Chart: 2024 Al Index report
来源：2023 年影响研究 | 图表：2024 年 Al 指数报告

ChatGPT usage purposes among American K-12 teachers, 2023
2023 年美国 K-12 教师使用 ChatGPT 的目的

Source: Impact Research, 2023 | Chart: 2024 Al Index report
来源：2023 年影响研究 | 图表：2024 年 Al 指数报告

Figure 6.2.10 图 6.2.10

Highlight: 突出显示:

ChatGPT Usage Among Teachers and Students (cont'd)
教师和学生中的 ChatGPT 使用情况（续）

Both teachers and students have overwhelmingly positive attitudes toward ChatGPT. According to the March 2023 survey,

of the teachers believe that ChatGPT has a positive impact, a sentiment echoed by

of the students surveyed (Figure 6.2.11). Furthermore,

of teachers and

of students feel that ChatGPT is important to incorporate into the educational process. This recent data indicates that tools like ChatGPT are poised to become a staple in the American educational landscape in the foreseeable future.
教师和学生对 ChatGPT 持有极其积极的态度。根据 2023 年 3 月的调查，

的教师认为 ChatGPT 具有积极影响，这一观点也得到了

的学生的认同（图 6.2.11）。此外，

的教师和

的学生认为将 ChatGPT 纳入教育过程是重要的。最近的数据表明，像 ChatGPT 这样的工具有望在可预见的未来成为美国教育领域的重要组成部分。
ChatGPT perceptions among educational users, 2023
2023 年教育用户对 ChatGPT 的看法

Source: Impact Research, 2023 | Chart: 2024 Al Index report
来源：2023 年影响研究 | 图表：2024 年 Al 指数报告

Preview 预览

Overview
Chapter Highlights 章节亮点
7.1 Overview of Al Policy in 7.1 中 AI 政策概述

7.2 Al and Policymaking 7.2 人工智能和政策制定
Global Legislative Records on Al 全球人工智能立法记录
Overview
By Geographic Area 按地理区域划分
By Relevance	379
By Approach	380
By Subject Matter 按主题分类	381
U.S. Legislative Records 美国立法记录
Federal Level 联邦层面	382
State Level	383
Al Mentions	385
Overview	385
U.S. Committee Mentions 美国委员会提到	388

7.3 National Al Strategies 7.3 国家人工智能战略	391
By Geographic Area 按地理区域划分

7.4 Al Regulation ..... 393
U.S. Regulation ..... 393
美国法规 ..... 393
Overview ..... 393 概述 ..... 393
By Relevance ..... 394
按相关性排序 ..... 394
By Agency ..... 395
按机构排序 ..... 395
By Approach ..... 396
通过方法 ..... 396
By Subject Matter ..... 397
通过主题 ..... 397
EU Regulation ..... 398
欧盟法规 ..... 398
Overview ..... 398 概述 ..... 398
By Relevance ..... 399
按相关性排序 ..... 399
By Agency ..... 400
按机构排序 ..... 400
By Approach ..... 401
通过方法 ..... 401
By Subject Matter ..... 402
通过主题 ..... 402
7.5 U.S. Public Investment in AI ..... 403
7.5 美国在人工智能领域的公共投资 ..... 403
Federal Budget for AI R&D ..... 403
人工智能研发的联邦预算 ..... 403
U.S. Department of Defense Budget Requests ..... 405
美国国防部预算请求 ..... 405
U.S. Government AI-Related Contract Spending ..... 406
美国政府与人工智能相关的合同支出 ..... 406
Al Contract Spending ..... 406
合同支出 ..... 406
Microelectronics and Semiconductor
微电子和半导体
Spending ..... 409 支出 ..... 409

Overview 概述

Al's increasing capabilities have captured policymakers' attention. Over the past year, several nations and political bodies, such as the United States and the European Union, have enacted significant AI-related policies. The proliferation of these policies reflect policymakers' growing awareness of the need to regulate

and improve their respective countries' ability to capitalize on its transformative potential.
Al 日益增强的能力引起了政策制定者的关注。在过去的一年里，包括美国和欧盟在内的几个国家和政治机构已经颁布了重要的与人工智能相关的政策。这些政策的蔓延反映了政策制定者对需要规范

并提高各自国家利用其变革潜力的意识日益增强。

This chapter begins examining global Al governance starting with a timeline of significant Al policymaking events in 2023. It then analyzes global and U.S. Al legislative efforts, studies Al legislative mentions, and explores how lawmakers across the globe perceive and discuss Al. Next, the chapter profiles national Al strategies and regulatory efforts in the United States and the European Union. Finally, it concludes with a study of public investment in Al within the United States.
本章从审视全球 AI 治理开始，首先介绍了 2023 年重要 AI 政策制定事件的时间线。然后分析了全球和美国的 AI 立法努力，研究了 AI 立法提及情况，并探讨了全球各国立法者如何看待和讨论 AI。接下来，本章介绍了美国和欧盟的国家 AI 战略和监管努力。最后，对美国的 AI 公共投资进行了研究。

Chapter Highlights 章节亮点

The number of Al regulations in the United States sharply increases. The number of Al-related regulations in the U.S. has risen significantly in the past year and over the last five years. In 2023, there were 25 Al-related regulations, up from just one in 2016. Last year alone, the total number of Al-related regulations grew by .
美国的 AI 法规数量急剧增加。美国与 AI 相关的法规数量在过去一年和过去五年中显著增加。2023 年，AI 相关法规达到 25 项，远高于 2016 年的 1 项。仅去年，AI 相关法规总数增长了。
The United States and the European Union advance landmark Al policy action. In 2023, policymakers on both sides of the Atlantic put forth substantial AI regulatory proposals. The European Union reached a deal on the terms of the AI Act, a landmark piece of legislation enacted in 2024. Meanwhile, President Biden signed an Executive Order on Al, the most notable Al policy initiative in the United States that year.
美国和欧盟推进具有里程碑意义的人工智能政策行动。2023 年，大西洋两岸的决策者提出了实质性的人工智能监管提案。欧盟就《人工智能法案》的条款达成协议，该法案于 2024 年生效，成为具有里程碑意义的立法。与此同时，拜登总统签署了一项关于人工智能的行政命令，这是美国当年最引人注目的人工智能政策举措。
Al captures U.S. policymaker attention. The year 2023 witnessed a remarkable increase in Al-related legislation at the federal level, with 181 bills proposed, more than double the 88 proposed in 2022.
Al 引起了美国政策制定者的关注。2023 年，联邦层面关于 Al 的立法数量显著增加，提出了 181 项法案，是 2022 年提出的 88 项的两倍多。
Policymakers across the globe cannot stop talking about Al. Mentions of Al in legislative proceedings across the globe have nearly doubled, rising from 1,247 in 2022 to 2,175 in 2023. Al was mentioned in the legislative proceedings of 49 countries in 2023 . Moreover, at least one country from every continent discussed Al in 2023, underscoring the truly global reach of Al policy discourse.
全球各地的政策制定者都在谈论 Al。全球立法程序中对 Al 的提及几乎翻了一番，从 2022 年的 1,247 次增加到 2023 年的 2,175 次。2023 年，Al 在 49 个国家的立法程序中被提及。此外，每个大洲至少有一个国家在 2023 年讨论了 Al，突显了 Al 政策讨论的真正全球影响力。
More regulatory agencies turn their attention toward AI. The number of U.S. regulatory agencies issuing Al regulations increased to 21 in 2023 from 17 in 2022, indicating a growing concern over Al regulation among a broader array of American regulatory bodies. Some of the new regulatory agencies that enacted AIrelated regulations for the first time in 2023 include the Department of Transportation, the Department of Energy, and the Occupational Safety and Health Administration.
更多监管机构将注意力转向人工智能。2023 年，发布人工智能法规的美国监管机构数量从 2022 年的 17 个增加到 21 个，表明越来越多的美国监管机构对人工智能监管表示关注。2023 年首次颁布与人工智能相关法规的一些新监管机构包括交通部、能源部和职业安全与健康管理局。

This chapter begins with an overview of some of the most significant Al-related policy events in 2023, as selected by the Al Index Steering Committee.
本章首先概述了由人工智能指数指导委员会选出的 2023 年一些最重要的人工智能相关政策事件。

7.1 Overview of Al Policy in 2023
2023 年 AI 政策概述

China introduces regulation on administration of deep synthesis of the internet
中国出台关于互联网深度综合管理的规定

China introduces regulations aimed at "deep synthesis" technology to tackle security issues related to the creation of realistic virtual entities and multimodal media, including "deepfakes." These regulations apply to both providers and
中国出台旨在解决与创造逼真虚拟实体和多模态媒体相关的安全问题的“深度综合”技术规定，包括“deepfakes”。这些规定适用于提供商和

Source: China Talk, 2022 Figure 7.1.1 users across different media and mandate measures, such as preventing illegal content, adhering to legal compliance, verifying user identities, securing consent for biometric editing, safeguarding data security, and enforcing content moderation.
源自：《中国谈话》，2022 年图 7.1.1 用户跨越不同媒体和强制措施，如防止非法内容，遵守法律合规性，验证用户身份，获得生物识别编辑的同意，保护数据安全，并执行内容审查。

U.S. legislators propose Al for National Security Act This legislation clarifies and solidifies the Department of Defense's (DoD) authority to acquire Al-based endpoint security tools, enhancing its cyber-defense capabilities. It aims to enable the DoD to employ

for the automatic detection and mitigation of threats to its networks and digital infrastructure. This bipartisan initiative ensures the DoD can adopt innovative commercial technologies to strengthen its
美国立法者提出国家安全法案，该立法明确和巩固了国防部（DoD）获取基于人工智能的端点安全工具的权威，增强其网络防御能力。它旨在使国防部能够利用

自动检测和减轻对其网络和数字基础设施的威胁。这一跨党派倡议确保国防部可以采用创新的商业技术来加强其

Source: Brookings, 2018 Figure 7.1.2 cyber defenses, matching the pace of adversaries.
源自：布鲁金斯学会，2018 年图 7.1.2 网络防御，与对手的步伐相匹配。

The sources cited in this section are for the images included in the text.
本节中引用的来源是文本中包含的图像。

May 11, 2023 2023 年 5 月 11 日

U.S. policymakers introduce AI Leadership
美国政策制定者推出 AI 领导力

Training Act 培训法案

This legislation aims to enhance Al literacy among federal leaders in response to Al's widespread adoption across government agencies. It mandates the director of the Office of Personnel Management (OPM) to create and periodically refresh an

training program, promoting responsible and ethical Al usage within the federal government. Building on previous laws, the initiative expands

training to include federal employees involved in procuring Al technologies for government use.
该立法旨在提高联邦领导人在政府机构广泛采用人工智能的情况下的人工智能素养。它要求人事管理办公室（OPM）主任创建并定期更新一个

培训计划，促进联邦政府内部对人工智能的负责任和道德使用。在之前的法律基础上，该倡议将

培训扩展到涉及为政府使用采购人工智能技术的联邦雇员。

Source: Fox News, 2023
来源：福克斯新闻，2023

Figure 7.1.3 图 7.1.3

U.S. policymakers propose National AI
美国政策制定者提出国家人工智能委员会法案

Commission Act 委员会法案

The National Al Commission Act calls for establishing a National Al Commission tasked with crafting a comprehensive Al regulatory framework. Highlighting the importance of expert input due to Al's rapid innovation and complexity, this bipartisan initiative focuses on mitigating risks, preserving
国家人工智能委员会法案呼吁建立一个负责制定全面人工智能监管框架的国家人工智能委员会。强调由于人工智能的快速创新和复杂性，专家意见的重要性，这一两党倡议着重于减轻风险，保护

Source: Nextgov, 2023 Figure 7.1.4 U.S. leadership in AI research and development, and ensuring consistency with American values.
来源：Nextgov，2023 年图 7.1.4 美国在人工智能研究和开发方面的领导地位，并确保与美国价值观的一致性。

House of Representatives advances Jobs of the Future Act
The bill endorses a study to evaluate industries and
众议院推进未来就业法案，该法案支持进行一项研究，评估行业和

occupations anticipated to grow due to Al, assess its effects on workers' skills or potential replacement, examine stakeholder influence opportunities, identify the demographics most impacted, evaluate the required skills and education, review data accessibility, investigate

Source: LSE Business Review, 2019 Figure 7.1.5 efficient skill delivery methods, and explore the role of academic institutions in offering critical training.
源自：LSE 商业评论，2019 年图 7.1.5 高效技能交付方法，并探讨学术机构在提供关键培训中的作用。

U.S. Senate puts forward Artificial Intelligence and
美国参议院提出人工智能和

Biosecurity Risk Assessment Act

The act mandates the assistant secretary for preparedness and response to assess and address threats to public health and national security from technical advancements in artificial intelligence. It emphasizes evaluating the potential use of Al, including open-source models, for developing harmful agents. The proposed initiatives include monitoring
该法案要求负责应对和应对来自人工智能技术进步的公共卫生和国家安全威胁的助理部长进行评估和应对。它强调评估 Al 的潜在用途，包括开源模型，用于开发有害代理。拟议的举措包括监测

Source: Clinical Trials Arena, 2023 Figure 7.1.6 global biological risks and integrating risk assessment summaries into the National Health Security Strategy.
源自：Clinical Trials Arena, 2023 年图 7.1.6 全球生物风险，并将风险评估摘要整合到国家卫生安全战略中。

Private Al labs sign voluntary White House AI commitments
私人 AI 实验室签署自愿的白宫 AI 承诺

The Biden-Harris administration obtains voluntary pledges from seven major Al firms-Google, Microsoft, Meta, Amazon, OpenAl, Anthropic, and Inflection-to promote the development of Al that is safe, secure, and reliable. These commitments involve conducting internal and external security assessments of Al systems prior to launch, sharing information on identified risks, enabling public reporting of
拜登-哈里斯政府从谷歌、微软、Meta、亚马逊、OpenAI、Anthropic 和 Inflection 等七家主要 AI 公司获得自愿承诺，以促进安全、可靠的 AI 发展。这些承诺包括在推出之前对 AI 系统进行内部和外部安全评估，共享已识别风险的信息，促进公开报告。

Source: Medium, 2023 Figure 7.1.7 issues, and disclosing when content is Al-generated.
源头: Medium, 2023 年图 7.1.7 问题, 并在内容由 AI 生成时进行披露。

U.S. Senate passes Outbound Investment
美国参议院通过对外投资法案

Transparency Act

This initiative aims to scrutinize U.S. investments in critical sectors, especially those involving China, with a focus on evaluating risks in crucial industries and technologies such as Al that impact national security. The objective is to increase awareness of potential vulnerabilities and risks linked to foreign access to American technology in these domains.
该倡议旨在审查美国在关键领域的投资，特别是涉及中国的投资，重点评估影响国家安全的关键行业和技术，如人工智能，以增加对潜在漏洞和风险的认识，这些风险与外国获取美国技术在这些领域的访问有关。

Source:

源:

Figure 7.1.8 图 7.1.8

U.S. Senate proposes CREATE AI Act
美国参议院提出了 CREATE AI 法案

The CREATE AI Act establishes the National Artificial Intelligence Research Resource (NAIRR), a national research infrastructure to improve Al researchers' and students' access to essential resources. NAIRR offers compute, curated datasets, educational tools, and Al testbeds. It aims to bolster the nation's

research capabilities by supporting the testing and evaluation of Al systems.
CREATE AI 法案建立了国家人工智能研究资源（NAIRR），这是一个国家级研究基础设施，旨在改善人工智能研究人员和学生获取基本资源的途径。NAIRR 提供计算资源、策划数据集、教育工具和人工智能测试平台。它旨在通过支持人工智能系统的测试和评估来增强国家的研究能力。

Source: Stanford HAI, 2023 Figure 7.1.9
来源：斯坦福大学人工智能研究所，2023 年图 7.1.9

China updates cyberspace administration of generative Al measures
中国更新生成式人工智能措施的网络空间管理

China's updated policy adopts a more targeted regulatory approach, focusing on applications with public implications rather than a blanket regulation. The amendments soften the regulatory language, changing directives like "ensure the truth, accuracy, objectivity, and diversity of the data" to "employ effective measures to enhance the quality of training data and improve its truth, accuracy, objectivity, and
中国更新的政策采取了更有针对性的监管方法，重点放在具有公共影响的应用上，而不是一刀切的规定。修订软化了监管语言，将“确保数据的真实性、准确性、客观性和多样性”等指令改为“采取有效措施提高训练数据的质量，并改善其真实性、准确性、客观性和多样性”。此外，修订后的法规鼓励生成式{{0}}发展，摆脱了先前的惩罚性关注。

Source: South China Morning Post, 2023 Figure 7.1.10 diversity." Additionally, the revised regulations encourage generative

development, shifting away from the prior punitive focus.
来源：南华早报，2023 年图 7.1.10 多样性。此外，修订后的法规鼓励生成式

发展，摆脱了先前的惩罚性关注。

U.S. Senate puts forward Protect Elections from
美国参议院提出保护选举免受

Deceptive Al Act

The bipartisan bill seeks to prohibit the use of Al to create materially deceptive content that falsely represents federal candidates in political advertisements. This act addresses the risks of Al-driven disinformation in elections by banning the distribution of materially deceptive Algenerated audio or visual content related to candidates
这项两党提出的法案旨在禁止使用人工智能来制作材料欺骗性内容，这些内容虚假地代表了政治广告中的联邦候选人。该法案通过禁止分发与候选人相关的材料欺骗性人工智能生成的音频或视觉内容，解决了人工智能驱动的虚假信息在选举中的风险。

Source: The Economist, 2023 Figure 7.1.11 running for federal office.
来源：《经济学人》，2023 年图 7.1.11 竞选联邦职位。

Sep. 18, 9 月 18 日

2023

U.K. proposes principles to guide competitive AI markets and protect consumers
英国提出指导竞争性人工智能市场和保护消费者的原则

The U.K.'s Competition and Markets Authority proposes principles to foster competitive Al markets while ensuring consumer protection. These principles are designed to guarantee accountability for

outputs, maintain continuous access to essential inputs, promote a diversity of business
英国竞争与市场管理局提出了原则，旨在促进竞争激烈的人工智能市场，同时确保消费者权益。这些原则旨在确保对

产品的问责制，保持对基本要素的持续获取，促进多样化的商业模式，提供企业多种选择，提供模式切换灵活性，并确保公平实践，以防止反竞争行为。

Source: Science Business, 2022 Figure 7.1.12 models, provide businesses with choices, offer flexibility to switch between models, and ensure fair practices to prevent anticompetitive behavior.
来源:Science Business，2022 年图 7.1.12 模型，为企业提供选择，提供在模型之间切换的灵活性，并确保公平的实践，以防止反竞争行为。

President Biden issues Executive Order on Safe, Secure, and Trustworthy AI
拜登总统发布关于人工智能安全、可靠和值得信赖的行政命令

The executive order establishes new benchmarks for Al safety, security, privacy protection for Americans, advancement of equity and civil rights, and the fostering of competition and innovation. It mandates the creation of a national security memorandum to guide the safe and ethical application of

in military and intelligence operations, ensuring the protection of Americans' privacy and the cultivation of an open, competitive
这项行政命令为美国人设立了人工智能安全、安全、隐私保护的新基准，推动平等和公民权利的进步，促进竞争和创新。它要求制定一项国家安全备忘录，指导在军事和情报行动中安全和道德应用

，确保保护美国人的隐私并培养一个开放、竞争的环境。

Source: AP, 2023 来源：美联社，2023

Figure 7.1.13 图 7.1.13

Al market that emphasizes U.S. innovation. Additionally, the Department of Education is tasked with addressing Al's safe and responsible use in education, while the Federal Communications Commission is encouraged to assess Al's impact on telecommunications. The National Institute of Standards and Technology (NIST) is instructed to formulate guidelines and best practices to support industry consensus on developing and deploying secure, reliable, and ethical Al.
强调美国创新的人工智能市场。此外，教育部负责解决人工智能在教育中的安全和负责任使用问题，同时鼓励联邦通信委员会评估人工智能对电信的影响。国家标准与技术研究所（NIST）被要求制定指导方针和最佳实践，以支持行业就发展和部署安全、可靠和道德人工智能达成共识。

Frontier Al taskforce releases second progress report
边疆人工智能特别小组发布第二份进展报告

The task force forms new alliances with leading

organizations and facilitates the development of the U.K.'s AI Research Resource (AIRR), to be known as Isambard

, an

supercomputer designed for compute-intensive safety research. Moreover, the report highlights the task force's initiatives to mitigate risks inherent in advanced

development and its partnerships with premier

companies to gain early access to their models.
特遣队与领先的

组织建立新联盟，并促进英国人工智能研究资源（AIRR）的发展，将其命名为伊萨姆巴德

，这是一个专为计算密集型安全研究设计的

超级计算机。此外，报告强调了特遣队为减轻先进

开发中固有风险所采取的举措，以及与领先

公司合作以提前获得其模型的合作伙伴关系。

U.K. hosts AI Safety Summit (2023)
英国主办人工智能安全峰会（2023）

The UK AI Safety Summit at Bletchley Park seeks to tackle Al risks and promote global cooperation, culminating in the Bletchley Declaration. This declaration, endorsed by 28 countries, including China and the United States, signifies a significant global agreement on AI safety. The U.K. also unveiled the world's inaugural AI Safety Institute, dedicated to
英国 AI 安全峰会在布莱切利公园旨在解决 AI 风险并促进全球合作，最终形成布莱切利宣言。这份宣言得到包括中国和美国在内的 28 个国家的支持，标志着全球关于 AI 安全的重要协议。英国还揭幕了世界首个 AI 安全研究所，致力于安全评估和研究。

Source: CGTN, 2023 Figure 7.1.15 safety assessments and research. Despite these developments, reactions are mixed, with certain experts advocating for more comprehensive and ambitious policy measures.
Source: CGTN, 2023 图 7.1.15 安全评估和研究。尽管取得了这些进展，但反应褒贬不一，一些专家主张采取更全面和雄心勃勃的政策措施。

U.K. announces AI Safety Institute
英国宣布成立人工智能安全研究所

The Al Safety Institute, the first government-supported entity dedicated to advancing Al safety in the public interest, aims to safeguard the U.K. and humanity from unforeseen

advancements. Its goal is to build the sociotechnical framework required to comprehend and govern the risks associated with advanced Al. By conducting fundamental AI
人工智能安全研究所是第一个得到政府支持的机构，致力于推动人工智能安全为公共利益服务，旨在保护英国和人类免受意想不到的

进展的影响。其目标是建立理解和管理与先进人工智能相关风险所需的社会技术框架。通过进行基础人工智能

Source: Gov.uk, 2024 来源: Gov.uk, 2024

Source: Gov.uk, 2024 comprehension of the dangers posed by advanced

systems and create the technical tools vital for effective Al governance. Furthermore, it aspires to position the U.K. as a global center for safety research, thereby reinforcing the nation's strategic investment in this critical technology.
来源: Gov.uk, 2024 对先进

系统带来的危险有深刻的理解，并创建对有效的 Al 治理至关重要的技术工具。此外，它立志将英国定位为全球安全研究中心，从而加强该国对这一关键技术的战略投资。

Europeans reach deal on EU Al Act
欧洲人就欧盟 Al 法案达成协议

European lawmakers reach a tentative deal on the AI Act. The act establishes a risk-based regulatory framework for

, prohibiting systems with unacceptable risks, such as behavioral manipulators, and classifying high-risk systems into product-based and critical sectors. Generative Al, such as ChatGPT, is required to adhere to transparency standards. Meanwhile, low-risk Al, including deepfake technologies, is subject to fundamental transparency obligations.
欧洲立法者就 AI 法案达成初步协议。该法案建立了基于风险的监管框架，禁止具有不可接受风险的系统，如行为操纵者，并将高风险系统分类为基于产品和关键部门。ChatGPT 等生成式 AI 必须遵守透明度标准。与此同时，包括深度伪造技术在内的低风险 AI 需遵守基本透明度义务。

Source: Stanford HAI, 2023 Figure 7.1.17
来源：斯坦福 HAI，2023 年图 7.1.17

7.2 Al and Policymaking
7.2 人工智能和政策制定

Global Legislative Records on Al
全球人工智能立法记录

Overview 概述

The Al Index analyzed legislation containing "artificial intelligence" in 128 countries from 2016 to 2023.

Of these, 32 countries have enacted at least one AI-related bill (Figure 7.2.1).

In total, the countries have passed

-related bills. Figure 7.2.2 illustrates the annual count of Al-related bills passed since 2016. While the total dropped to 28 in 2023 from 39 in the previous year, the number of Al-related bills passed in 2023 significantly exceeds the total passed in 2016.
Al Index 分析了 2016 年至 2023 年间 128 个国家的包含“人工智能”的立法。其中，32 个国家至少颁布了一项与人工智能相关的法案（图 7.2.1）。总共，这些国家通过了

项与人工智能相关的法案。图 7.2.2 展示了自 2016 年以来通过的与人工智能相关法案的年度数量。尽管 2023 年的总数从前一年的 39 下降到 28，但 2023 年通过的与人工智能相关法案数量显著超过了 2016 年的总数。

Number of Al-related bills passed into law by country, 2016-23
2016-23 年各国通过的与铝相关法案数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.1 图表 7.2.1

2 The analysis of passed Al policies may undercount the number of actual bills, given that large bills can include multiple sub-bills related to Al; for example, the CHIPS and Science Act passed by the United States in 2022.
2 分析通过的人工智能政策可能低估了实际法案的数量，因为大型法案可能包括与人工智能相关的多个子法案；例如，美国在 2022 年通过的 CHIPS 和科学法案。

3 The Al Index monitored Al-related bills passed in Hong Kong and Macao, despite these not being officially recognized countries. Thus, the Index covers a total of 130 geographic areas. Laws passed by Hong Kong and Macao were counted in the overall tally of Al-related bills. This year, the Index expanded its country sample compared to previous years, resulting in a difference between the number of AI-related bills reported this year and those in prior reports.
3 人工智能指数监测了香港和澳门通过的与人工智能相关的法案，尽管这两个地区并非正式承认的国家。因此，该指数涵盖了总共 130 个地理区域。香港和澳门通过的法律被计入了与人工智能相关的法案的总数。今年，与以往年份相比，该指数扩大了国家样本范围，导致了今年报告的人工智能相关法案数量与以往报告中的数量之间的差异。

Number of Al-related bills passed into law in 128 select countries, 2016-23
2016-23 年间在 128 个选定国家通过的与人工智能相关法案数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.2 图表 7.2.2

By Geographic Area 按地理区域划分

Figure 7.2.3 highlights the number of laws containing mentions of

that were enacted in 2023. Belgium led with five laws, followed by France, South Korea, and the United Kingdom, each of which passed three. Figure 7.2.4 shows the total number of laws passed since 2016. The United States (23) has passed the most Al-related laws since 2016, followed by Portugal (15), and Belgium (12).
图 7.2.3 突出显示了 2023 年通过了包含

提及的法律的数量。比利时以五部法律居首，其次是法国、韩国和英国，每个国家通过了三部法律。图 7.2.4 显示了自 2016 年以来通过的法律总数。美国（23 部）自 2016 年以来通过的 Al 相关法律最多，其次是葡萄牙（15 部）和比利时（12 部）。
Number of Al-related bills passed into law in select countries, 2023
选择国家通过法律的 Al 相关法案数量，2023 年

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.3 图 7.2.3

Number of Al-related bills passed into law in select countries, 2016-23 (sum)
2016-23 年在部分国家通过的与 Al 相关法案数量（总和）

By Relevance 按相关性

The AI Index team further disaggregated AI-related bills based on their relevance to

, as not every bill mentioning

prioritizes it equally. A bill deemed to have high relevance to

is fundamentally focused on AI-related policy, like the AI Training Act passed in 2022, which mandates Al training programs for executive agency workers. Conversely, bills with medium relevance incorporate significant

policy elements but are not fundamentally focused on Al-related matters. For example, the National Defense Authorization Act for Fiscal Year 2022 includes sections on

performance metrics and Al capabilities development for the Department of Defense. However, because it has a broader focus, namely authorizing various Defense Agency programs, and is not completely centered on

, it was assigned a medium

relevance. Low relevance

bills merely mention

in passing without a substantial legislative focus on Al. An example of a low relevance Al bill is the Energy and Water, Legislative Branch, and Military Construction and Veterans Affairs Appropriations Act, 2019. This bill allocates funding to various federal agencies, and mentions

primarily in the context of encouraging these agencies to consider workforce training opportunities for sectors like cybersecurity, energy, and Al.
AI 指数团队进一步根据 AI 相关法案的相关性对其进行了细分，因为并非每个提及

的法案都同等重视。被认为与

高度相关的法案基本上专注于与 AI 相关的政策，比如 2022 年通过的 AI 培训法案，该法案要求行政机构工作人员接受 AI 培训计划。相反，具有中等相关性的法案包含重要的

政策元素，但并非基本上专注于 AI 相关事项。例如，2022 财政年度国防授权法案包括有关

绩效指标和国防部 AI 能力发展的条款。然而，由于其更广泛的焦点，即授权各种国防机构计划，并非完全集中在

上，因此被分配了中等

By Approach 按方法

The AI Index also categorized Al-related bills as either expansive or restrictive. Expansive bills aim to enhance a nation's Al capabilities, such as establishing a network of publicly accessible supercomputers.
AI 指数还将 AI 相关法案分类为扩张性或限制性。扩张性法案旨在增强一个国家的 AI 能力，比如建立一个公共可访问的超级计算机网络。

Restrictive bills, on the other hand, impose limitations on Al usage, like setting rules for deploying facial recognition technology. A bill can be both, or neither.

Distinguishing between expansive or restrictive bills can highlight legislator priorities: whether policymakers focus on expanding Al capabilities, imposing restrictions, or balancing both.
另一方面，限制性法案则对 AI 的使用施加限制，比如制定部署面部识别技术的规则。一项法案既可以是扩张性的，也可以是限制性的，也可能两者都不是。区分扩张性或限制性法案可以突出立法者的优先事项：政策制定者是专注于扩展 AI 能力、施加限制，还是两者兼顾。

Figure 7.2.6 indicates a global trend toward regulating Al usage, showing that, while the commitment to enhancing Al capabilities remains, there is a growing shift toward restrictive legislation. This change suggests that legislators are increasingly focused on mitigating the potential harms of Al's integration into society.
图 7.2.6 显示了全球对 AI 使用进行监管的趋势，表明虽然增强 AI 能力的承诺仍然存在，但越来越多的法规转向限制性立法。这种变化表明立法者越来越专注于减轻 AI 融入社会可能带来的潜在危害。

Number of Al-related bills passed into law in select countries by approach, 2016-23
2016-23 年间各国通过的与人工智能相关法案数量，按照不同途径划分

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.6 图表 7.2.6

4 The Al Index only categorized bills as being expansive or restrictive if they were identified as having medium or high Al relevance. Consequently, the totals depicted in Figure 7.2 .5 may not fully correspond with those presented earlier in the chapter.
4 只有中等或高度与人工智能相关的账单才被 AL 指数列为扩张性或限制性。因此，图 7.2.5 中显示的总数可能与之前在章节中呈现的不完全对应。

By Subject Matter 按主题分类

The Al Index's global analysis of Al legislation classifies bills by their primary subject matter according to the typology used by the U.S. Congress to classify American legislation.

Historically, economics and public finance have been the predominant focus of Al-related legislation, reflecting the fact that Al-related policymaking matters are often incorporated within budgetary bills related to public appropriations (Figure 7.2.7). However, in 2023 the distribution of primary topics among passed bills broadened significantly, encompassing a diverse range of policy areas. Specifically, two bills were passed in each of the following categories: armed forces and national security; civil rights and liberties, minority issues; commerce; education; labor and employment; science, technology, and communication. This diversity indicates that Al policy concerns are increasingly spanning various sectors.
AL 指数对人工智能立法进行了全球分析，根据美国国会用来分类美国立法的类型，将法案按其主要主题进行分类。历史上，经济和公共财政一直是与人工智能相关的立法的主要关注点，这反映了与公共拨款相关的预算法案中常常融入了与人工智能相关的政策制定事项（图 7.2.7）。然而，到 2023 年，通过法案的主要主题分布显著扩大，涵盖了各种政策领域。具体而言，各个类别通过了两项法案：武装部队和国家安全；公民权利和自由、少数民族问题；商务；教育；劳动和就业；科学、技术和通信。这种多样性表明人工智能政策关注的范围正在越来越广泛涵盖各个行业。

Number of Al-related bills passed into law in select countries by primary subject matter, 2016-23
2016-23 年间各国通过的与人工智能相关法案数量，按主题分类

Source: Al Index, 2024 \| Chart: 2024 Al Index report 数据来源：2024 年铝指数 \| 图表来源：2024 年铝指数报告		2016	2017	2018	2019	2020	2021	2022	2023
	Armed forces and national security 武装部队和国家安全		1	2	1		3	2	2
	Arts, culture, religion 艺术，文化，宗教							1	+
	Civil rights and liberties, minority issues 公民权利和自由，少数群体问题						1	1	2
	Commerce		1			1		1	2
	Crime and law enforcement 犯罪和执法				1		1	1	1
	Economics and public finance 经济学和公共财政		1	1	3	8	6	7	5
	Education					3	1		2
	Energy					1
	Environmental protection 环境保护						1
	Finance and financial sector 金融和金融行业								1
	Foreign trade and international finance 外贸和国际金融								1
	Government operations and politics 政府运作和政治						1	2
	Health			1	2	1		+
	Labor and employment 劳动和就业			+	1		3	4	2
	Science, technology, communications 科学、技术、通讯	1		2	1	4	2	2	2
	Taxation						1
	Transportation and public works 交通和公共工程			1

Figure 7.2.7 图 7.2.7

5 Similar to the classification of bills as either expansive or restrictive, only bills coded as having a medium or high relevance to Al were coded for their primary subject matter. Consequently, not all Al-related bills featured in this section's analysis have subject matter coding available.
与将法案分类为扩张性或限制性类似，只有被编码为与 Al 有中等或高度相关性的法案才被编码为其主题内容。因此，本节分析中出现的所有与 Al 相关的法案并非都有主题编码可用。

U.S. Legislative Records
美国立法记录

Federal Level 联邦层面

Figure 7.2.8 illustrates the total number of passed versus proposed AI-related bills in the U.S. Congress, highlighting a significant increase in proposed legislation. In the last year, the count of proposed AI-
图 7.2.8 展示了美国国会中通过与提出的人工智能相关法案的总数，突出了提议立法数量的显著增加。在过去的一年中，提出的 AI 相关法案数量从 2022 年的 88 增加到 2023 年的 181。这种显著增长的美国人工智能立法活动可能反映了决策者对日益增长的公众意识和 AI 技术能力（如 ChatGPT）的回应。

to 181 in 2023. This significant increase in U.S. AIrelated legislative activity likely reflects policymakers' response to the increasing public awareness and capabilities of Al technologies, such as ChatGPT. related bills more than doubled, rising from 88 in 2022
美国人工智能相关法案数量，2016-23 年（提议 vs. 通过）

Number of Al-related bills in the United States, 2016-23 (proposed vs. passed)
具体数字已翻译为中文，可以共享给需要的用户。

Figure 7.2.8 图 7.2.8

State Level 州级

The AI Index also tracks data on the enactment of Al-related legislation at the state level. Figure 7.2.9 highlights the number of Al-related laws enacted by U.S. states in 2023. California leads with seven laws, followed by Virginia with five, and Maryland with three. Figure 7.2.10 displays the total amount of legislation passed by states from 2016 to 2023. California again tops the ranking with 13 bills, followed by Maryland (10) and Washington (7).
AI 指数还跟踪记录了各州在立法层面上关于人工智能的立法数据。图 7.2.9 突出显示了 2023 年美国各州通过的人工智能相关法律的数量。加利福尼亚州以七项法律领先，维吉尼亚州以五项紧随其后，马里兰州以三项。图 7.2.10 显示了 2016 年至 2023 年各州通过的立法总数。加利福尼亚再次以 13 项法案居首，其次是马里兰州（10 项）和华盛顿州（7 项）。
Number of Al-related bills passed into law in select US states, 2023
2023 年在美国部分州通过的与铝相关法案数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.11 displays the total number of state-level Al-related bills proposed and passed in the United States since 2016. In 2023, 150 total state-level bills were proposed, a significant increase from the 61 bills proposed in 2022. A significantly greater proportion of Al-related bills are enacted into law at the state level in the United States, compared to the federal level.
图 7.2.11 显示了自 2016 年以来在美国提出和通过的州级铝相关法案的总数。2023 年，共提出了 150 项州级法案，较 2022 年的 61 项提案显著增加。与联邦层面相比，更大比例的与铝相关法案在美国州级别得以成为法律。

Number of state-level Al-related bills in the United States, 2016-23 (proposed vs. passed)
美国国家级 Al 相关法案数量，2016-23 年（拟议 vs. 通过）

Al Mentions Al 提及

Another barometer of legislative interest is the number of mentions of artificial intelligence in governmental and parliamentary proceedings. The AI Index conducted an analysis of the minutes or proceedings of legislative sessions in 80 countries that contain the keyword "artificial intelligence" from 2016 to

另一个衡量立法兴趣的标尺是政府和议会会议中提到人工智能的次数。AI 指数对包含关键词“人工智能”的 80 个国家从 2016 年到

年的立法会议记录或会议进行了分析。

Overview 概述

Figure 7.2.12 reveals a significant increase in the mentions of

in legislative proceedings across the globe, nearly doubling from 1,247 in 2022 to 2,175 in 2023. Since 2016, Al mentions in legislative discussions have risen almost tenfold. This data suggests that the emergence of Al systems such as ChatGPT in 2023 has notably captured policymakers' attention.
图 7.2.12 显示，全球立法程序中提及

的次数显著增加，从 2022 年的 1,247 次增加到 2023 年的 2,175 次，几乎翻了一番。自 2016 年以来，立法讨论中关于人工智能的提及几乎增加了十倍。这些数据表明，2023 年 ChatGPT 等人工智能系统的出现显著吸引了决策者的注意。

Figure 7.2.12 图 7.2.12

In 2023, the United Kingdom led in Al mentions within its legislative proceedings (405), followed by the United States (240) and Australia (227) (Figure 7.2.13). Out of 80 countries analyzed, 48 mentioned

at least once. Moreover, Al discussions reached legislative platforms in at least one country from every continent in 2023 , underscoring the truly global reach of

policy discourse.
在 2023 年，英国在其立法程序中领先于 Al 提及（405 次），其次是美国（240 次）和澳大利亚（227 次）（图 7.2.13）。在分析的 80 个国家中，有 48 个国家至少提到

一次。此外，Al 讨论在 2023 年至少在每个大洲的一个国家的立法平台上展开，突显了

政策话语的真正全球影响力。
When legislative mentions are aggregated from 2016 to 2023, a somewhat similar trend emerges (Figure 7.2.14). The United Kingdom is first, with 1,490 mentions, followed by Spain (886) and the United States (868).
当将 2016 年至 2023 年的立法提及汇总时，出现了一种类似的趋势（图 7.2.14）。英国排名第一，提及次数为 1,490 次，其次是西班牙（886 次）和美国（868 次）。

Number of mentions of Al in legislative proceedings by country, 2023
2023 年各国立法程序中 Al 提及次数

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.13 图 7.2.13

Number of mentions of Al in legislative proceedings by country, 2016-23 (sum)
2016-23 年各国立法程序中提及 Al 的次数（总和）

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.2.14 图 7.2.14

U.S. Committee Mentions 美国委员会提到

Mentions of artificial intelligence in committee reports by House and Senate committees serve as another indicator of legislative interest in

in the United States. Typically, these committees focus on legislative and policy issues, investigations, and internal matters.
众议院和参议院委员会报告中对人工智能的提及是美国立法兴趣的另一个指标。通常，这些委员会关注立法和政策问题、调查和内部事务。
Figure 7.2.15 shows the frequency of Al mentions in U.S. committee reports by legislative session from 2001 to 2023. Mentions of Al have decreased for the current 118th session; however, it is important to note that this session is only about halfway through, with an end date set for January 2025. Continuing at the current rate, the 118th legislative session is poised to surpass all previous sessions in terms of Al mentions.
图 7.2.15 显示了 2001 年至 2023 年美国委员会报告中对人工智能提及的频率。当前第 118 届会议对人工智能的提及已经减少；然而，值得注意的是，该会议仅进行了一半，结束日期定于 2025 年 1 月。按照目前的速度继续下去，第 118 届立法会议有望在人工智能提及方面超过以往所有会议。

Mentions of AI in US committee reports by legislative session, 2001-23
2001 年至 2023 年美国委员会报告中对人工智能提及的频率

Figure 7.2.15 图 7.2.15

Figure 7.2.16 depicts Al mentions in the committee reports of the U.S. House of Representatives during the ongoing 118th congressional session. The Appropriations and Science, Space, and Technology committees feature the highest number of AI mentions. Meanwhile, Figure 7.2.17 highlights Al mentions in Senate committee reports, with Appropriations leading (9), followed by the Homeland Security and Governmental Affairs Committee (3).
图 7.2.16 描述了美国众议院在第 118 届国会期间委员会报告中提到的人工智能情况。拨款委员会和科学、空间和技术委员会提到人工智能次数最多。与此同时，图 7.2.17 突出了参议院委员会报告中提到人工智能的情况，拨款委员会位居首位（9 次），其次是国土安全部和政府事务委员会（3 次）。

Mentions of Al in committee reports of the US House of Representatives for the 118th congressional session, 2023
2023 年第 118 届美国众议院委员会报告中提到人工智能的情况

Mentions of AI in committee reports of the US Senate for the 118th congressional session, 2023
2023 年第 118 届美国参议院委员会报告中提到人工智能的情况

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figures 7.2.18 and 7.2.19 show the total number of mentions in committee reports from congressional sessions occurring since 2001. The House and Senate Appropriations committees, which regulate expenditures of money by the federal government, lead their respective lists.
图 7.2.18 和 7.2.19 显示了自 2001 年以来国会会议委员会报告中提及的总次数。调节联邦政府支出的众议院和参议院拨款委员会分别位居各自的榜首。

Mentions of AI in committee reports of the US House of Representatives, 2001-23 (sum)
美国众议院委员会报告中对人工智能的提及，2001-23 年（总计）

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Mentions of AI in committee reports of the US Senate, 2001-23 (sum)
美国参议院委员会报告中对人工智能的提及，2001-23 年（总计）

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

This section offers an overview of national Al strategies, which are policy plans created by governments to guide the development and deployment of Al within their country. Monitoring trends in these strategies is important for assessing how countries prioritize the development and regulation of

technologies. Sources include national or regional government websites, the OECD AI Policy Observatory (oecd.ai), and news reports.?
本节提供了国家人工智能战略的概述，这些战略是政府制定的政策计划，旨在指导其国家内人工智能的发展和部署。监测这些战略的趋势对于评估各国如何优先考虑

技术的发展和监管至关重要。信息来源包括国家或地区政府网站、经济合作与发展组织人工智能政策观察站（oecd.ai）和新闻报道。

7.3 National Al Strategies
7.3 国家人工智能战略

By Geographic Area 按地理区域划分

Canada initiated the first national

strategy in March 2017. To date, 75 national Al strategies have been unveiled. The peak year was 2019, when 24 strategies were released. In 2023, eight new strategies were added, from countries in the Middle East, Africa, and the Caribbean, showcasing the worldwide expansion of Al policymaking discourse.
2017 年 3 月，加拿大启动了第一个国家

战略。迄今为止，已经有 75 个国家发布了国家 AI 战略。2019 年是高峰年，共发布了 24 个战略。2023 年，来自中东、非洲和加勒比地区的八个新战略被添加，展示了全球 AI 政策制定讨论的扩展。
Figure 7.3.1 identifies countries that have either released or are in the process of developing a national Al strategy as of January 2024. Figure 7.3.2 lists the countries that are in the process of developing an

strategy within the past three years. The list of new countries developing national

strategies include: Antigua and Barbuda, Barbados, Belarus, Costa Rica, Jamaica, Pakistan, and Senegal. Figure 7.3 .3 provides a timeline of the release of national

strategies.
图 7.3.1 标识了截至 2024 年 1 月已经发布或正在制定国家 AI 战略的国家。图 7.3.2 列出了过去三年内正在制定

战略的国家。正在制定国家

战略的新国家名单包括：安提瓜和巴布达、巴巴多斯、白俄罗斯、哥斯达黎加、牙买加、巴基斯坦和塞内加尔。图 7.3.3 提供了国家

战略发布的时间表。

Countries with a national strategy on AI, 2023
2023 年拥有国家 AI 战略的国家

Source: Al Index, 2024 | Chart: 2024 Al Index report
源：2024 年 Al 指数报告 | 图表：2024 年 Al 指数报告

Figure 7.3.1 图 7.3.1

7 The Al Index research team made efforts to identify whether there was a national Al strategy that was released or in development for every nation in the world. It is possible that some strategies were missed.
7 Al 指数研究团队努力确定世界上每个国家是否发布或正在制定国家 Al 战略。可能会错过一些战略。

Al national strategies in development by country and year
Source: Al Index, 2024 | Table: 2024 Al Index report
按国家和年份划分的所有国家发展的国家战略 Source: Al 指数，2024 | 表格：2024 年 Al 指数报告

Year

Country

2021

Andorra, Armenia, Cuba, Iceland, Morocco, New Zealand
安道尔，亚美尼亚，古巴，冰岛，摩洛哥，新西兰

2022

Kenya

2023

Antigua and Barbuda, Barbados, Belarus, Costa Rica, Jamaica,
安提瓜和巴布达，巴巴多斯，白俄罗斯，哥斯达黎加，牙买加

Pakistan, Senegal 巴基斯坦，塞内加尔

Figure 7.3.2 图 7.3.2
Yearly release of Al national strategies by country Source: Al Index, 2024 | Table: 2024 Al Index report
每年各国发布的人工智能国家战略来源：人工智能指数，2024 年 | 表格：2024 年人工智能指数报告

Year

Country

2017

Canada, China, Finland 加拿大，中国，芬兰

2018

France, Germany, India, Mauritius, Mexico, Sweden
法国，德国，印度，毛里求斯，墨西哥，瑞典

2019

Argentina, Bangladesh, Chile, Colombia, Cyprus, Czech
阿根廷，孟加拉国，智利，哥伦比亚，塞浦路斯，捷克

Republic, Denmark, Egypt, Estonia, Japan, Lithuania,
丹麦共和国, 丹麦, 埃及, 爱沙尼亚, 日本, 立陶宛,

Luxembourg, Malta, Netherlands, Portugal, Qatar, Romania,
卢森堡, 马耳他, 荷兰, 葡萄牙, 卡塔尔, 罗马尼亚,

Russia, Sierra Leone, Singapore, Slovak Republic, United Arab
俄罗斯, 塞拉利昂, 新加坡, 斯洛伐克共和国, 阿拉伯联合大公国

Emirates, United States of America, Uruguay
阿拉伯联合酋长国，美利坚合众国，乌拉圭

2020

Algeria, Bulgaria, Croatia, Greece, Hungary, Indonesia, Latvia,
阿尔及利亚，保加利亚，克罗地亚，希腊，匈牙利，印度尼西亚，拉脱维亚

North Korea, Norway, Poland, Saudi Arabia, Serbia, Spain,
朝鲜，挪威，波兰，沙特阿拉伯，塞尔维亚，西班牙

Switzerland

2021

Australia, Austria, Brazil, Hong Kong, Ireland, Malaysia, Peru,
澳大利亚，奥地利，巴西，香港，爱尔兰，马来西亚，秘鲁

Philippines, Slovenia, Tunisia, Turkey, Ukraine, United
菲律宾，斯洛文尼亚，突尼斯，土耳其，乌克兰，英国

Kingdom, Vietnam 越南

2022

Belgium, Ghana, Iran, Italy, Jordan, Thailand
比利时，加纳，伊朗，意大利，约旦，泰国

2023

Azerbaijan, Bahrain, Benin, Dominican Republic, Ethiopia,
阿塞拜疆，巴林，贝宁，多米尼加共和国，埃塞俄比亚

Iraq, Israel, Rwanda 伊拉克，以色列，卢旺达

The advent of

has garnered significant attention from regulatory agencies-federal bodies tasked with regulating sectors of the economy and steering the enforcement of laws. This section examines Al regulations within the United States and the European Union. Unlike legislation, which establishes legal frameworks within nations, regulations are detailed directives crafted by executive authorities to enforce legislation. In the United States, prominent regulatory agencies include the Environmental Protection Agency (EPA), Food and Drug Administration (FDA), and Federal Communications Commission (FCC). Since the specifics of legislation often manifest through regulatory actions, understanding the AI regulatory landscape is essential in order to develop a deeper understanding of AI policymaking.
人工智能的出现引起了监管机构的重视 - 负责监管经济部门并引导法律执行的联邦机构。本节将研究美国和欧盟内的人工智能法规。与立法不同，立法建立了国家内的法律框架，法规是由行政当局制定的详细指令，用于执行立法。在美国，知名的监管机构包括环境保护局（EPA）、食品和药物管理局（FDA）以及联邦通信委员会（FCC）。由于立法的具体内容通常通过监管行动体现，因此了解人工智能监管格局对于深入了解人工智能政策制定至关重要。

7.4 Al Regulation 7.4 人工智能监管

U.S. Regulation 美国监管

本节将审查美国监管机构在 2016 年至 2023 年间颁布的与人工智能相关的法规。它提供了对法规总数以及它们的主题、范围、监管意图和起草机构的分析。为了编制这些数据，人工智能指数团队在《联邦法规》上进行了关键词搜索，这是美国政府几乎所有部门的政府文件的综合存储库，涵盖了超过 436 个机构。

Overview 概述

The number of Al-related regulations has risen significantly, both in the past year and over the last five years (Figure 7.4.1). In 2023, there were

-related regulations, a stark increase from just one in 2016. Last year alone, the total number of

-related regulations grew by

.
人工智能相关法规的数量在过去一年以及过去五年中显著增加（图 7.4.1）。2023 年，有

个与人工智能相关的法规，与 2016 年的仅有一个相比大幅增加。仅去年，与

By Relevance 按相关性

The Al Index categorized Al-related regulationsthose mentioning AI-into three levels of relevance: low, medium, and high.

In 2023, the number of high and medium relevance Al-related regulations increased compared to 2022. For instance, a high relevance

regulation was the Copyright Office and Library of Congress' Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence. This policy statement clarified registration practices for works incorporating

generated material. Meanwhile, a medium-relevance example is the Securities and Exchange Commission's Cybersecurity Risk Management Strategy, Governance, and Incident Disclosure, which established standardized disclosure practices for public companies concerning cybersecurity risk management, strategy, governance, and incidents.
Al 指数将提及人工智能的相关法规分为三个级别：低、中和高。2023 年，与 2022 年相比，高和中相关性的 Al 相关法规数量增加。例如，高相关性法规是版权局和国会图书馆的版权注册指南：包含人工智能生成材料的作品。该政策声明澄清了包含生成材料作品的注册实践。与此同时，中相关性示例是证券交易委员会的网络安全风险管理策略、治理和事件披露，该策略为公开公司关于网络安全风险管理、策略、治理和事件的披露建立了标准化的披露实践。

Figure 7.4.2 categorizes Al-related regulations in the United States based on their relevance to Al. A growing proportion of these regulations is highly relevant to Al. Among the

-related regulations enacted in 2023, four were identified as being highly relevant, the greatest amount since tracking began in 2016.
根据其与 Al 相关性将美国的相关法规进行了分类，图 7.4.2。这些法规中越来越多与 Al 高度相关。在 2023 年制定的

相关法规中，有四项被确定为高度相关，这是自 2016 年开始跟踪以来最多的数量。

Number of Al-related regulations in the United States by relevance to AI, 2016-23
根据其与 AI 相关性，2016 年至 23 年美国的 Al 相关法规数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

high relevance regulation focuses entirely on

-related issues. A medium relevance regulation includes meaningful mentions of

but is not solely centered on it. A low relevance regulation mentions

in passing, without a significant focus on Al-related matters.

高相关性法规完全专注于

或

相关问题。中等相关性法规包括对

的有意义提及，但并非仅以此为中心。低相关性法规顺带提及

，但不重点关注与 Al 相关的事项。

By Agency 10 由机构 10 提供

Which agencies are the primary sources of

regulations? In 2023, the Executive Office of the President and the Commerce Department led with five Al-related regulations each, followed by the Health and Human Services Department and the Industry and Security Bureau, with each issuing four
哪些机构是

法规的主要来源？2023 年，总统办公厅和商务部各颁布 5 项人工智能相关法规，其次是健康和人类服务部和工业和安全局，分别颁布 4 项
(Figure 7.4.3). Furthermore, the number of agencies issuing

regulations increased from 17 in 2022 to 21 in 2023, indicating a growing need for clarity and concern regarding

among a broader array of American regulatory bodies.
（图 7.4.3）。此外，颁布

法规的机构数量从 2022 年的 17 家增加到 2023 年的 21 家，表明对

的清晰性和关切正在逐渐增加，涵盖更广泛的美国监管机构。

Number of Al-related regulations in the United States by agency, 2016-23
美国机构在 2016-23 年间的人工智能相关法规数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.4 .3 图 7.4 .3

By Approach 按方法

The Al Index categorized regulations based on their approach: whether they expanded or restricted

capabilities." Over time, the trend in AI regulations in the United States has shifted significantly toward restriction (Figure 7.4.4). In 2023, there were 10 restrictive

regulations compared to just three that were expansive. Conversely in 2020, there were four regulations that were expansive and one that was restrictive.
人工智能指数根据其方法对法规进行分类：它们是扩大还是限制

的能力。随着时间的推移，美国的人工智能法规趋势明显向限制方向转变（图 7.4.4）。到 2023 年，有 10 项限制性

法规，而仅有三项是扩大性的。相反，在 2020 年，有四项是扩大性的法规，而只有一项是限制性的。

Number of Al-related regulations in the United States by approach, 2016-23
2016-23 年美国与人工智能相关法规的数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.4.4 图 7.4.4

11 Expansive regulations refer to actions by regulatory agencies or governments aimed at augmenting Al capacity, including investments in supercomputing infrastructure. Restrictive regulations involve steps to curtail Al capabilities, such as imposing restrictions on the use of facial recognition algorithms. Restrictive Al regulations may also be intended to address underlying policy concerns, such as Al's potential impact on citizens' civil liberties. According to this coding typology, a regulation can be classified as both expansive and restrictive, or it may fit neither category. The Al Index assigned the labels "expansive" or "restrictive" only to regulations deemed to have medium to high relevance to AI. Therefore the regulation totals in Figure 7.4.4 are less than those reported earlier in the section.
11 项扩张性法规指的是监管机构或政府为增强人工智能能力而采取的行动，包括对超级计算基础设施的投资。限制性法规涉及采取措施限制人工智能的能力，例如对面部识别算法的使用施加限制。限制性人工智能法规也可能旨在解决潜在的政策问题，例如人工智能对公民公民自由的潜在影响。根据这种编码分类法，一项法规可以被分类为既扩张性又限制性，或者它可能既不属于扩张性也不属于限制性类别。人工智能指数仅将被认为对人工智能具有中高相关性的法规标记为“扩张性”或“限制性”。因此，图 7.4.4 中的法规总数少于本节中早期报告的数量。

By Subject Matter 按主题分类

In 2023, American Al regulations were categorized by primary subject matter. The most prevalent subject matter in Al-related regulation was foreign trade and international finance, with three instances. Three
在 2023 年，美国的人工智能法规按主题进行分类。与人工智能相关的法规中最普遍的主题是外贸和国际金融，共有三起案例。其次的主题有三个并列：健康；商业；以及科学、技术和通讯（图 7.4.5）。

topics tied for second place, with two occurrences each: health; commerce; and science, technology, and communications (Figure 7.4.5).
2016 年至 2023 年美国主要主题的人工智能相关法规数量

Number of Al-related regulations in the United States by primary subject matter, 2016-23
在 2023 年，美国的人工智能法规按主题进行分类。与人工智能相关的法规中最普遍的主题是外贸和国际金融，共有三起案例。其次的主题有三个并列：健康；商业；以及科学、技术和通讯（图 7.4.5）。

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.4.5 图 7.4.5

12 The Al Index team used Congress' policy categorization typology. Only regulations that have medium and high Al relevance were coded for their primary subject matter.
12 Al 指数团队使用了国会的政策分类学。只对具有中等和高度 Al 相关性的法规进行编码，以确定其主题内容。

EU Regulation 欧盟法规

-related regulation in the European Union was limited to legal acts, international agreements, and consolidated texts. The same methodological approach was used to code EU regulations, as was used to code U.S. regulations.

Al 指数还收集了 2017 年至 2023 年间欧盟制定的与人工智能相关的法规信息。为了编制这些数据，指数团队在欧盟立法、法规和案例法的综合数据库 EURLex 上进行了关键词搜索，搜索关键词为“人工智能”。EURLex 提供了广泛的监管文件访问，如法律文件、合并文本、国际协议、准备文件和立法程序。本节分析专注于具有约束性监管权威的文件。在欧盟寻找与

相关的法规的搜索仅限于法律文件、国际协议和合并文本。编码欧盟法规采用了与编码美国法规相同的方法论。

Overview 概述

The number of Al-related regulations passed by the European Union increased from 22 in 2022 to 32 in 2023 (Figure 7.4.6). Despite this increase, the number of Al-related regulations passed by the European Union peaked in 2021 with 46.
欧盟通过的与人工智能相关法规数量从 2022 年的 22 项增加到 2023 年的 32 项（图 7.4.6）。尽管有所增加，但欧盟通过的与人工智能相关法规数量在 2021 年达到了 46 项的峰值。

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

13 The methodological approach refers to coding regulations based on relevance, originating agency, approach, and subject matter
13 该方法论方法是基于相关性、创造机构、方法和主题的编码规定

By Relevance 按相关性

In 2021, the European Union passed its first highly relevant Al-related regulations. These regulations established the Digital Europe Programme and
在 2021 年，欧盟通过了其首个高度相关的人工智能相关法规。这些法规建立了数字欧洲计划
Horizon Europe, a framework program for research and innovation. Of the 32 regulations passed in 2023, two had high relevance to AI, 13 had medium relevance, and 17 had low relevance (Figure 7.4.7).
以及 Horizon Europe，这是一个研究和创新框架计划。2023 年通过的 32 项法规中，有 2 项与 AI 高度相关，13 项与 AI 中度相关，17 项与 AI 低度相关（图 7.4.7）。

Number of AI-related regulations in the European Union by relevance to AI, 2017-23
欧盟与人工智能相关法规数量，按照对人工智能的相关性，2017-23 年

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

By Agency 按机构划分

The two most prominent originator agencies for European Union Al regulations in 2023 were the Council of the European Union (13) and European Parliament (9) (Figure 7.4.8).14
2023 年欧盟人工智能法规中最重要的两个起草机构分别是欧洲联盟理事会（13）和欧洲议会（9）（图 7.4.8）。

Number of Al-related regulations in the European Union by institution and body, 2017-23
欧盟机构和机构在 2017-23 年间的 Al 相关法规数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

By Approach 按方法

In recent years, Al-related regulation in the European Union has tended to take a more expansive approach (Figure 7.4.9). In 2023, there were eight regulations with a restrictive focus compared to 12 with an expansive one.
近年来，欧盟的 Al 相关法规倾向于采取更加广泛的方式（图 7.4.9）。2023 年，有 8 项限制性法规，而有 12 项广泛性法规。

Number of Al-related regulations in the European Union by approach, 2017-23
欧盟 Al 相关法规的数量按照方式划分，2017-23 年

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

By Subject Matter 按主题分类

In 2023, the most common subject matters for

-related regulations in the European Union were science, technology, and communications (5); followed by government operations and politics (3) (Figure 7.4.10). Regulations concerning government operations and politics involve setting rules for how governments and associated governmental processes operate. One such regulation was the Commission Recommendation (EU) on inclusive and resilient electoral processes in the Union and enhancing the European nature and efficient conduct of the elections to the European Parliament. This regulation acknowledged that

could be used to generate political misinformation and outlined steps the Commission has taken to ensure Al does not challenge the legitimacy of elections. Evidently, European Union legislators are considering how AI will impact their government's work.
在 2023 年，欧盟与

相关的法规中，最常见的主题是科学、技术和通讯（5）；其次是政府运作和政治（3）（图 7.4.10）。涉及政府运作和政治的法规涉及制定政府及相关政府流程运作规则。其中一项法规是关于欧盟推荐（EU）关于联盟包容和弹性选举流程以及增强欧洲性和有效进行欧洲议会选举的法规。该法规承认

可能被用于生成政治错误信息，并概述了委员会为确保 AI 不挑战选举合法性所采取的步骤。显然，欧盟立法者正在考虑 AI 将如何影响他们政府的工作。

Number of Al-related regulations in the European Union by primary subject matter, 2017-23
2017 年至 2023 年欧盟主要主题的 AI 相关法规数量

Source: Al Index, 2024 | Chart: 2024 Al Index report
数据来源：2024 年铝指数 | 图表来源：2024 年铝指数报告

Figure 7.4.10 图 7.4.10

This section examines public Al investment in the United States based on data from the U.S. government and Govini, a company that uses

and machine learning technologies to track U.S. public and commercial spending.
本节将根据来自美国政府和 Govini 的数据，对美国的公共 Al 投资进行分析。Govini 是一家利用

和机器学习技术追踪美国公共和商业支出的公司。

7.5 U.S. Public Investment in
美国的公共投资为 7.5 亿美元。

Federal Budget for AI R&D
AI 研发的联邦预算

Every year in December, the National Science and Technology Council publishes a report on the public sector Al R&D budget across various departments and agencies that participate in the Networking and Information Technology Research and Development (NITRD) Program and National Artificial Intelligence Initiative. These reports, however, do not include information on classified AI R&D investment.
每年 12 月，国家科学技术委员会发布一份关于参与网络和信息技术研究与开发（NITRD）计划和国家人工智能计划的各部门和机构在公共部门 Al R&D 预算的报告。然而，这些报告不包括机密人工智能研发投资的信息。

According to the 2023 report, in the fiscal year 2023, U.S. government agencies allocated a total of

billion to

research and development spending (Figure 7.5.1). The funding for AI R&D has risen annually since FY 2018, more than tripling since then. For FY 2024, a larger budget of

billion has been requested.

根据 2023 年的报告，在 2023 财政年度，美国政府机构总共拨款

亿美元用于

研究与开发支出（图 7.5.1）。自 2018 财政年度以来，人工智能研发的资金每年增长，自那时起增长了三倍以上。对于 2024 财政年度，已经请求了更大规模的

亿美元预算。

US federal NITRD budget for AI, FY 2018-24
美国联邦 NITRD 人工智能预算，2018 年至 2024 年

Figure 7.5.1 图 7.5.1

Figure 7.5.2 details the breakdown of NITRD AI R&D budget requests by agency. For FY 2024, the National Science Foundation (NSF) had the highest request at

million, followed by the Defense Advanced Research Projects Agency (DARPA) at $322.1 million, and the National Institutes of Health (NIH) at $284.5 million. amounts documented in the latest NITRD reports.
图 7.5.2 详细说明了 NITRD 人工智能研发预算请求的按机构分配情况。截至 2024 财年，国家科学基金会（NSF）的请求最高，达到

亿美元，其次是国防高级研究计划局（DARPA）32.21 亿美元，以及美国国家卫生研究院（NIH）2.845 亿美元。数量记录在最新的 NITRD 报告中。

US governmental agency NITRD budgets for AI, FY 2021-24
美国政府机构 NITRD 人工智能预算，财年 2021-24

Source: U.S. NITRD Program | Chart: 2024 AI Index report
源自：美国 NITRD 计划 | 图表：2024 年 AI 指数报告

DOT

ED-IES

U.S. Department of Defense Budget Requests
美国国防部预算请求

Every year the DoD releases the amount of funding they request for nonclassified

-specific research, development, test, and evaluation. According to its 2023 report, the DoD requested

billion in FY 2024, a significant increase from the

billion that was requested in FY 2023 (Figure 7.5.3).
每年，国防部发布他们请求用于非机密

-特定研究、开发、测试和评估的资金金额。根据其 2023 年报告，国防部请求在 2024 财年的资金为

亿美元，较之前一年的

亿美元有显著增加（图 7.5.3）。

US DoD budget request for Al-specific research, development, test, and evaluation (RDT&E), FY 2020-24 Source: U.S. Office of the Under Secretary of Defense (Comptroller), 2023 | Chart: 2024 Al Index report
美国国防部对 Al 特定研究、开发、测试和评估（RDT&E）的预算请求，FY 2020-24 来源：美国国防部次卿办公室（总务长），2023 | 图表：2024 Al 指数报告

Figure 7.5 .3 图 7.5 .3

Public investment in

can also be measured by federal government spending on the contracts awarded to private companies for goods and services. Such contracts typically occupy the largest share of an agency's budget.

的公共投资还可以通过联邦政府在授予私营公司的商品和服务合同方面的支出来衡量。这类合同通常占据了机构预算的最大份额。

Data in this section comes from Govini, which created a taxonomy of spending by the U.S. government on critical technologies including Al. Govini applied supervised machine learning and natural language processing to parse, analyze, and categorize large volumes of federal contracts data, including prime contracts, grants, and other transaction authority (OTA) awards. The use of Al models enables Govini to analyze data that is otherwise often inaccessible.
本节的数据来自 Govini，该公司创建了一个关于美国政府在关键技术领域（包括 Al）的支出分类系统。Govini 应用了监督式机器学习和自然语言处理来解析、分析和分类大量的联邦合同数据，包括总合同、拨款和其他交易授权（OTA）奖励。使用 Al 模型使得 Govini 能够分析那些通常无法访问的数据。

Al Contract Spending 全部合同支出

Figure 7.5.4 highlights total U.S. government spending on

, subdivided by various

segments. From 2022 to 2023, total Al spending increased marginally from

billion to

billion.

Since 2018 , total spending has increased nearly 2.4 times. In 2023, the Al subsegments that saw the greatest amount of government spending included machine learning (

billion) and computer vision (

billion).
图 7.5.4 突出显示了美国政府在

上的总支出，按各种

细分。从 2022 年到 2023 年，全球 Al 支出从

亿美元增加到

亿美元。

自 2018 年以来，总支出增加了近 2.4 倍。2023 年，政府支出最多的 Al 子领域包括机器学习（

亿美元）和计算机视觉（

亿美元）。

US government spending in

and autonomy by segment, FY 2018-23
美国政府在

和自主性方面的支出，FY 2018-23

Source: Govini, 2023 | Chart: 2024 Al Index report
来源：Govini，2023 年 | 图表：2024 年 AI 指数报告

Figure 7.5 .4 图 7.5 .4

16 In 2023, Govini made minor adjustments to their classification methodology. Consequently, the contract totals presented in Figure 7.5 .4 may vary slightly from those reported in earlier editions of the AI Index.
2023 年，Govini 对其分类方法进行了轻微调整。因此，在图 7.5 .4 中呈现的合同总额可能与 AI 指数早期版本中报告的略有不同。

Figure 7.5 .5 shows U.S. government spending by Al segment in FY 2022 and FY 2023. Spending significantly increased for machine learning. Computer vision and natural language processing spending also rose, albeit less prominently.
图 7.5.5 显示了 2022 财年和 2023 财年美国政府在各个领域的支出情况。机器学习的支出大幅增加，计算机视觉和自然语言处理的支出也有所增加，尽管增长幅度较小。

US government spending in AI/ML and autonomy by segment, FY 2022 vs. 2023
AI / ML 和自主领域中，美国政府的支出：2022 财年对比 2023 财年

Figure 7.5 .5 图 7.5.5

In FY 2023, the majority of federal Al contracts were prime contracts (50.6%), followed by grants (47.6%) (Figure 7.5.6). In the last year, the share of contracts has declined, while the share of grants has increased.
在 FY 2023 年，联邦 Al 合同的大部分是主合同（50.6%），其次是赠款（47.6%）（图 7.5.6）。在过去的一年中，合同的份额有所下降，而赠款的份额有所增加。

Total value of contracts, grants, and OTAs awarded by the US government for

and autonomy, FY 2018-23
美国政府在

年度授予的合同、赠款和 OTAs 的总价值，FY 2018-23

Microelectronics and Semiconductor Spending
微电子和半导体支出

Govini also monitors U.S. government microelectronics spending, which is becoming increasingly vital due to the crucial role that semiconductors, like GPUs, have played in powering recent Al technical improvements. The way governments allocate funds for semiconductors is poised to increase in geopolitical significance.
Govini 还监控美国政府在微电子支出方面的情况，由于半导体（如 GPU）在推动最近人工智能技术改进中发挥的关键作用，这一支出变得日益重要。政府为半导体分配资金的方式有望在地缘政治上变得更加重要。
Flgure 7.5.7 visualizes U.S. government spending on microelectronics by segment. Total spending on microelectronics has grown significantly in the last year, increasing to

billion from

billion in 2022. The large majority of American government microelectronic spending is allocated as contracts (Figure 7.5.8).
图 7.5.7 展示了美国政府在微电子领域的支出情况。去年，微电子的总支出显著增长，从 2022 年的

亿美元增加到

亿美元。美国政府在微电子领域的绝大部分支出都被分配为合同（图 7.5.8）。

US government spending in microelectronics by segment, FY 2018-23
美国政府在微电子领域的支出，2018 年至 2023 年，按部门划分

Source: Govini, 2023 | Chart: 2024 Al Index report
来源: Govini, 2023 | 图表: 2024 Al 指数报告

Figure 7.5.7 图 7.5.7

Total value of contracts, grants, and OTAs awarded by the US government for microelectronics, FY 2018-23
美国政府在微电子领域授予的合同、拨款和其他交易总价值，2018 年至 2023 年财年

Source: Govini, 2023 | Chart: 2024 Al Index report
来源：Govini，2023 年 | 图表：2024 年 Al 指数报告

Figure 7.5.8 图 7.5.8

Preview 预览

Overview ..... 413 概述 ..... 413
Chapter Highlights ..... 414
章节亮点 ..... 414
8.1 Al Postsecondary Education ..... 415
8.1 高等教育 ..... 415
North America ..... 415
北美 ..... 415
CS Bachelor's Graduates ..... 415
计算机科学学士毕业生 ..... 415
CS Master's Graduates ..... 417
计算机科学硕士毕业生 ..... 417
CS PhD Graduates ..... 419
计算机科学博士毕业生 ..... 419
Disability Status of CS, CE, and Information Students ..... 421
CS、CE 和信息学生的残疾状况 ..... 421
CS, CE, and Information Faculty ..... 422
CS、CE 和信息学院教师 ..... 422
Europe ..... 425 欧洲 ..... 425
Informatics, CS, CE, and IT Bachelor's Graduates ..... 425
信息学、计算机科学、计算机工程和信息技术学士毕业生 ..... 425
Informatics, CS, CE, and IT Master's Graduates ..... 425
信息学、计算机科学、计算机工程和信息技术硕士毕业生 ..... 425
Informatics, CS, CE, and IT PhD Graduates ..... 425
信息学、计算机科学、计算机工程和信息技术博士毕业生 ..... 425
8.2 Al Conferences ..... 429
8.2 Al 会议 ..... 429
Women in Machine Learning (WiML) NeurIPS Workshop ..... 429
机器学习中的女性（WiML）NeurIPS 研讨会 ..... 429
Workshop Participants ..... 429
研讨会参与者 ..... 429
Demographic Breakdown ..... 430
人口统计分析 ..... 430
8.3 K-12 Education ..... 432
8.3 K-12 教育 ..... 432
AP Computer Science: Gender ..... 432
AP 计算机科学：性别 ..... 432
AP Computer Science: Ethnicity ..... 433
AP 计算机科学：种族 ..... 433

ACCESS THE PUBLIC DATA
访问公共数据

Overview 概述

The demographics of Al developers often differ from those of users. For instance, a considerable number of prominent

companies and the datasets utilized for model training originate from Western nations, thereby reflecting Western perspectives. The lack of diversity can perpetuate or even exacerbate societal inequalities and biases.
AI 开发人员的人口统计数据通常与用户的不同。例如，许多知名

公司和用于模型训练的数据集源自西方国家，反映了西方的观点。缺乏多样性可能会持续甚至加剧社会不平等和偏见。

This chapter delves into diversity trends in Al. The chapter begins by drawing on data from the Computing Research Association (CRA) to provide insights into the state of diversity in American and Canadian computer science (CS) departments. A notable addition to this year's analysis is data sourced from Informatics Europe, which sheds light on diversity trends within European CS education. Next, the chapter examines participation rates at the Women in Machine Learning (WiML) workshop held annually at NeurIPS. Finally, the chapter analyzes data from Code.org, offering insights into the current state of diversity in secondary CS education across the United States.
本章探讨了 Al 中的多样性趋势。该章节首先利用来自计算研究协会（CRA）的数据，深入了解美国和加拿大计算机科学（CS）部门的多样性状况。今年分析的一个显著补充是来自 Informatics Europe 的数据，揭示了欧洲计算机科学教育中的多样性趋势。接下来，本章研究了每年在 NeurIPS 举办的 Women in Machine Learning（WiML）研讨会的参与率。最后，本章分析了来自 Code.org 的数据，为我们提供了关于美国中学 CS 教育当前多样性状况的见解。

The Al Index is dedicated to enhancing the coverage of data shared in this chapter.
Al 指数致力于增强本章中共享数据的覆盖范围。

Demographic data regarding

trends, particularly in areas such as sexual orientation, remains scarce. The Al Index urges other stakeholders in the Al domain to intensify their endeavors to track diversity trends associated with

and hopes to comprehensively cover such trends in future reports.
关于

趋势的人口统计数据，特别是在性取向等领域，仍然很少。Al 指数敦促 Al 领域的其他利益相关者加大努力，跟踪与

相关的多样性趋势，并希望在未来的报告中全面涵盖这些趋势。

Chapter Highlights 章节亮点

U.S. and Canadian bachelor's, master's, and PhD CS students continue to grow more ethnically diverse. While white students continue to be the most represented ethnicity among new resident graduates at all three levels, the representation from other ethnic groups, such as Asian, Hispanic, and Black or African American students, continues to grow. For instance, since 2011, the proportion of Asian CS bachelor's degree graduates has increased by 19.8 percentage points, and the proportion of Hispanic CS bachelor's degree graduates has grown by 5.2 percentage points.
美国和加拿大的计算机科学学士、硕士和博士研究生在族裔多样性上持续增长。白人学生在这三个层次的新入学毕业生中仍然是最多的族裔代表，而其他少数民族群体的代表，如亚洲人、西班牙裔和非洲裔学生的比例则继续增长。例如，自 2011 年以来，亚洲计算机科学学士学位毕业生的比例增加了 19.8 个百分点，西班牙裔计算机科学学士学位毕业生的比例增长了 5.2 个百分点。
Substantial gender gaps persist in European informatics, CS, CE, and IT graduates at all educational levels. Every surveyed European country reported more male than female graduates in bachelor's, master's, and PhD programs for informatics, CS, CE, and IT. While the gender gaps have narrowed in most countries over the last decade, the rate of this narrowing has been slow.
欧洲在信息学、计算机科学、计算机工程和信息技术领域的女性毕业生普遍存在较大的性别差距。每个受调查的欧洲国家在信息学、计算机科学、计算机工程和信息技术的学士、硕士和博士项目中都报告了男性毕业生的人数多于女性。虽然在过去十年中，各国间的性别差距有所缩小，但这种缩小的速度很慢。
U.S. K-12 CS education is growing more diverse, reflecting changes in both gender and ethnic representation. The proportion of AP CS exams taken by female students rose from in 2007 to in 2022. Similarly, the participation of Asian, Hispanic/Latino/Latina, and Black/African American students in AP CS has consistently increased year over year.
美国 K-12 计算机科学教育变得更加多样化，反映了性别和种族代表性的变化。2007 年女学生参加 AP 计算机科学考试的比例从上升到 2022 年的。同样，亚裔、西班牙裔/拉丁裔和非裔/非裔美国学生参加 AP 计算机科学的比例也在逐年增加。

8.1 Al Postsecondary Education
8.1 高等教育

North America 北美洲

Data on American and Canadian postsecondary CS and

postsecondary education comes from the Computing Research Association's (CRA) annual Taulbee Survey.

美国和加拿大高等教育以及

高等教育的数据来自计算机研究协会（CRA）每年的 Taulbee 调查。

CS Bachelor's Graduates 计算机科学学士毕业生

The percentage of female CS bachelor's graduates reached

in 2022 , continuing a decade-long rise (Figure 8.1.1). Nonbinary/other-identifying CS bachelor's graduates accounted for

in 2022.
2022 年，女性计算机科学学士毕业生的比例达到

，延续了十年的增长趋势（图 8.1.1）。2022 年，非二元/其他身份认同的计算机科学学士毕业生占比

。

Figure 8.1.1 图 8.1.1

Over the past decade, the number of CS bachelor's graduates of all ethnicities has grown, notably 4.7 times for Hispanics and 2.5 times for African Americans (Figure 8.1.2). As a proportion of ethnicities among all CS bachelor's graduates, Asians have risen the fastest, doubling in the last 10 years (Figure 8.1.3).
在过去的十年中，所有种族的计算机科学学士毕业生人数显著增长，其中西班牙裔增长了 4.7 倍，非裔美国人增长了 2.5 倍（图 8.1.2）。作为所有计算机科学本科毕业生中种族构成的一部分，亚裔增长最快，在过去 10 年中翻了一番（图 8.1.3）。

Ethnicity of new resident CS bachelor's graduates in the United States and Canada, 2011-22
2011-22 年间美国和加拿大的新居民计算机科学本科毕业生的种族

Figure 8.1.2 图 8.1.2

Ethnicity of new resident CS bachelor's graduates (% of total) in the United States and Canada, 2011-22 Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
2011-22 年间美国和加拿大新毕业的计算机科学学士学位毕业生的民族比例（占总数的百分比）来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 8.1.3 图 8.1.3

CS Master's Graduates 计算机科学硕士毕业生

The proportion of female CS master's graduates has seen minimal growth in the last decade, increasing to

in 2022 from

in 2011 . Additionally, in

of CS master's graduates identified as nonbinary/other (Figure 8.1.4).
过去十年中，女性计算机科学硕士毕业生的比例增长很少，从 2011 年的

增加到 2022 年的

。此外，在

中，计算机科学硕士毕业生中有

被识别为非二元性别/其他（图 8.1.4）。

Gender of new CS master's graduates (% of total) in the United States and Canada, 2011-22
美国和加拿大新计算机科学硕士毕业生的性别（占总数的百分比），2011 年至 22 年

Among North American students, the most represented ethnicities are white (47.9%), Asian (35.8%), and Hispanic (8.2%) (Figure 8.1.5 and Figure 8.1.6). Similar to CS bachelor's graduates, the pool of CS master's graduates has become increasingly ethnically diverse over the last decade.
在北美学生中，最具代表性的族裔是白人（47.9%）、亚裔（35.8%）和西班牙裔（8.2%）（图 8.1.5 和图 8.1.6）。与计算机科学学士毕业生类似，过去十年来，计算机科学硕士毕业生的群体在种族上变得越来越多样化。

Ethnicity of new resident CS master's graduates in the United States and Canada, 2011-22
2011-22 年在美国和加拿大的新居民计算机科学硕士毕业生的种族

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Ethnicity of new resident CS master's graduates (% of total) in the United States and Canada, 2011-22
2011-22 年在美国和加拿大的新居民计算机科学硕士毕业生的种族比例（占总数的百分比）

Source: CRA Taulbee Survey, 2023 | Chart: 2024 AI Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 AI 指数报告

CS PhD Graduates 计算机科学博士毕业生

In 2022, the percentage of female PhD graduates in CS slightly decreased to 22.1% (Figure 8.1.7), but the longterm trend is unchanged.
2022 年，计算机科学博士女性毕业生的比例略有下降至 22.1%（图 8.1.7），但长期趋势保持不变。

Gender of new CS PhD graduates (% of total) in the United States and Canada, 2010-22
美国和加拿大新计算机科学博士毕业生的性别（占总数的百分比），2010 年至 2022 年

Figure 8.1.7 图 8.1.7

From 2011 to 2022, the diversity among CS PhD graduates significantly increased (Figure 8.1.8 and Figure 8.1.9). In 2022,

of CS PhD graduates were Asian, Black, Hispanic, multiracial, American Indian, or Native Hawaiian, marking a considerable rise from 2011.
从 2011 年到 2022 年，计算机科学博士毕业生的多样性显著增加（图 8.1.8 和图 8.1.9）。2022 年，计算机科学博士毕业生中有

亚裔、黑人、西班牙裔、多种族、美洲印第安人或夏威夷原住民，较 2011 年大幅增加。

Ethnicity of new resident CS PhD graduates in the United States and Canada, 2011-22
美国和加拿大新居民计算机科学博士毕业生的种族，2011-22

Source: CRA Taulbee Survey, 2023 | Chart: 2024 AI Index reprt
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 AI 指数报告

Ethnicity of new resident CS PhD graduates (% of total) in the United States and Canada, 2011-22
美国和加拿大新居民计算机科学博士毕业生的种族（占总数的百分比），2011-22

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Disability Status of CS, CE, and Information
CS、CE 和信息学专业的残疾状态

Students 学生

For the second consecutive year, the CRA requested departments to report the number of students at each degree level who received disability accommodations over the preceding year. The reported numbers were relatively low:

of bachelor's,

of master's, and

of PhD students indicated a need for accommodations (Figure 8.1.10). Year over year, the proportion of students requesting disability accommodations has remained consistent.
连续第二年，CRA 要求各部门报告在过去一年中接受残疾人士适应的每个学位水平的学生人数。报告的数字相对较低：

本科生，

硕士生和

博士生表示需要适应（图 8.1.10）。逐年来看，请求残疾人士适应的学生比例保持一致。

CS, CE, and information students (% of total) with disability accomodations in United States and Canada, 2021 vs. 2022
2021 年和 2022 年美国和加拿大的 CS、CE 和信息学专业残疾学生（占总数的百分比）的残疾适应情况

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 8.1.10 图 8.1.10

CS, CE, and Information Faculty
计算机科学、计算机工程和信息学院

Data regarding the ethnicity and gender of faculty in CS, CE, and information fields highlight diversity trends in academic Al and CS. As of 2022, a majority of faculty members in CS, CE, and information are male

, with women comprising

and nonbinary individuals accounting for

(Figure 8.1.11). Although the proportion of female faculty in these fields has risen since 2011, the increase has been small.
有关 CS、CE 和信息领域教师种族和性别的数据突出了学术 AI 和 CS 领域的多样性趋势。截至 2022 年，CS、CE 和信息领域的大多数教师是男性，女性占

，非二元个体占

（图 8.1.11）。尽管自 2011 年以来这些领域的女性教师比例有所上升，但增幅很小。

Gender of CS, CE, and information faculty (% of total) in the United States and Canada, 2011-22
在美国和加拿大，2011-22 年间 CS、CE 和信息学院的男女比例

Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Figure 8.1.11 图 8.1.11

While the majority of new faculty hires in CS, CE, and information at American and Canadian universities remain male (

), the proportion of women reached

in 2022 (Figure 8.1.12), well above the proportion of new female PhDs.
尽管在美国和加拿大大学的 CS、CE 和信息学院中，大多数新任教师仍然是男性（

），但 2022 年女性比例达到

（图 8.1.12），远高于新女性博士的比例。

Gender of new CS, CE, and information faculty hires (% of total) in the United States and Canada, 2011-22 Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
美国和加拿大新任 CS、CE 和信息学教师的性别比例（占总数的百分比），2011-22 年来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

As of 2022, the majority of resident faculty in CS, CE, and information were white (57.3%), with Asian faculty following at

(Figure 8.2.13 and Figure 8.1.14). The ethnic diversity gap is gradually closing: In 2011, the difference between white faculty and the next largest ethnic group was

, but by 2021 , it had narrowed to

.
截至 2022 年，CS、CE 和信息学的大多数常驻教师是白人（57.3%），亚裔教师紧随其后（图 8.2.13 和图 8.1.14）。种族多样性差距逐渐缩小：2011 年，白人教师与下一个最大的族裔群体之间的差距为

，但到 2021 年，这一差距已经缩小到

。

Ethnicity of resident CS, CE, and information faculty in the United States and Canada, 2011-22
美国和加拿大 CS、CE 和信息学常驻教师的种族，2011-22 年

Ethnicity of resident CS, CE, and information faculty (% of total) in the United States and Canada, 2011-22 Source: CRA Taulbee Survey, 2023 | Chart: 2024 Al Index report
美国和加拿大居民 CS、CE 和信息学院教师的种族构成（占总数的百分比），2011-22 年数据来源：CRA Taulbee 调查，2023 年 | 图表：2024 年 Al 指数报告

Europe 欧洲

Data on diversity trends about European CS graduates comes from Informatics Europe.

有关欧洲计算机科学毕业生多样性趋势的数据来自 Informatics Europe。

Informatics, CS, CE, and IT Bachelor's
信息学、计算机科学、计算机工程和信息技术学士学位

Graduates 毕业生

In the majority of surveyed European nations, there is a persistent gender disparity among bachelor'slevel graduates in informatics, computer science, computer engineering, and information technology. Despite some narrowing since 2011, men continue to dominate. For example, France (14.8%), the United Kingdom (17.8%), and Germany (21.5%) show relatively low proportions of female graduates in these fields (Figure 8.1.15). Bulgaria stands out among the surveyed countries with the highest proportion of female graduates (35.2%).
在受调查的大多数欧洲国家中，信息学、计算机科学、计算机工程和信息技术的学士学位毕业生之间存在持续的性别差距。尽管自 2011 年以来有所缩小，男性仍然占主导地位。例如，法国（14.8%）、英国（17.8%）和德国（21.5%）在这些领域的女性毕业生比例相对较低（图 8.1.15）。保加利亚在受调查国家中以最高比例的女性毕业生脱颖而出（35.2%）。

Informatics, CS, CE, and IT Master's Graduates Similar gender disparities are observed among European informatics, CS, CE, and IT master's graduates, with a significantly greater proportion of males than females in most surveyed countries. As of 2022, Estonia (42.0%), Romania (41.9%), and Bulgaria

reported the greatest proportion of female master's graduates (Figure 8.1.16). In contrast, Belgium (13.7%), Italy (14.1%), and Switzerland (15.8%) reported the smallest proportion of female master's graduates. Informatics, CS, CE, and IT PhD Graduates
在欧洲信息学、计算机科学、计算机工程和信息技术硕士毕业生中观察到类似的性别差距，大多数受调查国家的男性比例明显高于女性。截至 2022 年，爱沙尼亚（42.0%）、罗马尼亚（41.9%）和保加利亚报告了最高比例的女性硕士毕业生（图 8.1.16）。相比之下，比利时（13.7%）、意大利（14.1%）和瑞士（15.8%）报告了最小比例的女性硕士毕业生。信息学、计算机科学、计算机工程和信息技术博士毕业生
In all surveyed European countries, informatics, CS, CE, and IT PhD graduates are predominantly male. However, in nations such as the United Kingdom, Germany, and Switzerland, the gender gap has narrowed over the last decade, with women constituting a growing share of PhD graduates (Figure 8.1.17).

In contrast, countries like Finland and Spain have seen the gap slightly widen.
在所有受调查的欧洲国家中，信息学、计算机科学、计算机工程和信息技术博士毕业生主要是男性。然而，在英国、德国和瑞士等国家，过去十年间性别差距已经缩小，女性博士毕业生所占比例逐渐增加（图 8.1.17）。相比之下，芬兰和西班牙等国家的性别差距略有扩大。

Gender of new informatics, CS, CE, and IT bachelor's graduates (% of total) in Europe, 2011-22
欧洲新信息学、计算机科学、计算机工程和信息技术本科毕业生的性别比例（总数的百分比），2011-22 年

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
资料来源：Informatics Europe，2023 年 | 图表：2024 年 Al Index 报告

Gender of new informatics, CS, CE, and IT master's graduates (% of total) in Europe, 2011-22
欧洲 2011-22 年新信息学、计算机科学、计算机工程和信息技术硕士毕业生的性别比例

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

Gender of new informatics, CS, CE, and IT PhD graduates (% of total) in Europe, 2011-22
欧洲 2011-22 年新信息学、计算机科学、计算机工程和信息技术博士毕业生的性别比例

Source: Informatics Europe, 2023 | Chart: 2024 Al Index report
来源：Informatics Europe，2023 年 | 图表：2024 年 Al 指数报告

8.2 Al Conferences 8.2 Al 会议

Women in Machine Learning (WiML) NeurlPS Workshop
机器学习中的女性（WiML）NeurlPS 研讨会

Women in Machine Learning (WiML), founded in 2006 , is an organization dedicated to supporting and increasing the impact of women in machine learning. This section of the AI Index presents data from the WiML annual technical workshop, hosted at NeurIPS.
机器学习中的女性（WiML）成立于 2006 年，是一个致力于支持和增加机器学习领域女性影响力的组织。AI 指数的这一部分展示了 WiML 年度技术研讨会在 NeurIPS 举办的数据。

Workshop Participants 工作坊参与者

Despite a decline in participation over the last two years, the 2023 NeurIPS WiML workshop attendance of 714 was nearly eight times higher than the attendance of 89 in 2010 (Figure 8.2.1). The recent drop in WiML workshop attendance may be linked to the overall decrease in NeurIPS attendance, which could be attributed to the shift away from a purely virtual format.

As a share of total conference attendance, the 2023 WiML workshop represented

of attendees (Figure 8.2.2).
尽管过去两年参与人数有所下降，但 2023 年 NeurIPS WiML 工作坊的参与人数为 714 人，几乎是 2010 年 89 人参与人数的八倍（图 8.2.1）。WiML 工作坊最近的参与人数下降可能与 NeurIPS 整体参与人数的减少有关，这可能归因于远离纯虚拟格式。作为总会议参与人数的一部分，2023 年 WiML 工作坊代表了

的参与者（图 8.2.2）。

Figure 8.2.1 图 8.2.1

Attendance at NeurIPS Women in Machine Learning workshop (% of total), 2010-23
2010-23 年参加 NeurIPS 机器学习女性研讨会的出席率（占总数的百分比）

Demographic Breakdown 人口统计分析

The data in the subsequent figures is derived from a survey completed by participants who agreed to aggregate their information. One component of the WiML survey asked attendees at the WiML workshop where they live. Among respondents,

hailed from North America, followed by Europe (21.8%), Asia (11.4%), and Africa (8.9%) (Figure 8.2.3). At this year's workshop, there was a greater proportion of North American attendees than in 2022.
后续图表中的数据来自于参与者完成的调查，这些参与者同意汇总他们的信息。WiML 调查的一个组成部分询问了 WiML 研讨会的参与者居住在哪里。在回答者中，

来自北美，其次是欧洲（21.8%）、亚洲（11.4%）和非洲（8.9%）（图 8.2.3）。在今年的研讨会上，与 2022 年相比，北美参与者的比例更高。

Continent of residence of participants at NeurIPS Women in Machine Learning workshop, 2022 vs. 2023 Source: Women in Machine Learning, 2023 | Chart: 2024 Al Index report
NeurIPS 机器学习女性研讨会参与者的居住大陆，2022 年与 2023 年来源：2023 年机器学习女性 | 图表：2024 年 Al 指数报告

Figure 8.2.3 图 8.2.3

The majority of participants at the 2022 WiML workshop were female-identifying (84.2%), another

were male-identifying, and

were nonbinary-identifying (Figure 8.2.4).
2022 年 WiML 研讨会的大多数参与者为女性（84.2%），另有

为男性，

为非二元性别（图 8.2.4）。

Gender breakdown of participants at NeurIPS Women in Machine Learning workshop, 2022 vs. 2023
2022 年和 2023 年 NeurIPS 机器学习女性研讨会参与者的性别分布

Source: Women in Machine Learning, 2023 | Chart: 2024 Al Index report
来源：2023 年机器学习女性 | 图表：2024 年 Al 指数报告

Figure 8.2.4 图 8.2.4

This section uses data from Code.org, a U.S. nonprofit dedicated to advancing CS education in K-12 schools across the country, to paint a picture of how Al diversity trends are reflected at the K-12 level.
本节使用来自 Code.org 的数据，这是一个致力于推动全国 K-12 学校中 CS 教育发展的美国非营利组织，以展示 K-12 水平上 Al 多样性趋势如何体现。

8.3 K-12 Education 8.3 K-12 教育

AP Computer Science: Gender
AP 计算机科学：性别

In 2022, male students accounted for

of AP CS exam-takers, female students

, and students identifying as neither male nor female

(Figure
在 2022 年，男性学生占 AP CS 考试考生的

，女性学生

，以及认同既非男性也非女性的学生

（图
8.3.1).

While male students continue to dominate AP CS exam participation, the proportion of female students has nearly doubled over the past decade.
8.3.1）。

虽然男性学生继续主导 AP CS 考试参与，但女性学生的比例在过去十年几乎翻了一番。

AP computer science exams taken (% of total) by gender, 2007-22
按性别划分的 2007-22 年 AP 计算机科学考试参加人数（总数的百分比）

Figure 8.3.1 图 8.3.1

On a percentage basis, the states with the highest number of female AP CS test-takers in 2022 were Mississippi (41%), Alabama (37%), and Washington, D.C. (37%) (Figure 8.3.2). California, Texas, and Washington, states known for significant

and

activity, also saw notable participation, with approximately

of AP CS exam-takers being female.
从百分比的角度来看，2022 年女性 AP CS 考试考生人数最多的州是密西西比（41%）、阿拉巴马（37%）和华盛顿特区（37%）（图 8.3.2）。加利福尼亚州、德克萨斯州和华盛顿州，这些以重要

和

活动而闻名的州，也有显著的参与度，大约

的 AP CS 考试考生为女性。
AP computer science exams taken by female students (% of total), 2022
2022 年女性学生参加的 AP 计算机科学考试（占总数的百分比）

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

Figure 8.3.2 图 8.3.2

AP Computer Science: AP 计算机科学

Ethnicity 民族

Code.org's data highlights the evolving ethnic diversity among AP CS test-takers. Similar to trends in postsecondary CS, the ethnic diversity of AP CS test-takers is increasing. While white students remain the largest group, the participation of Asian, Hispanic/ Latino/Latina, and Black/African American students in AP CS exams has grown over time (Figure 8.3.3). In 2022, white students constituted the largest share of exam-takers (38.2%), followed by Asian (27.8%) and Hispanic/Latino/Latina students (17.6%) (Figure 8.3.3 and Figure 8.3.4).
Code.org 的数据突出了 AP CS 考试考生中不断发展的种族多样性。与高等教育中计算机科学的趋势类似，AP CS 考试考生的种族多样性正在增加。尽管白人学生仍然是最大的群体，但亚裔、西班牙裔/拉丁裔和非裔/非裔美国学生参加 AP CS 考试的比例随着时间的推移而增加（图 8.3.3）。2022 年，白人学生占考试考生的最大比例（38.2%），其次是亚裔（27.8%）和西班牙裔/拉丁裔学生（17.6%）（图 8.3.3 和图 8.3.4）。

AP computer science exams taken by race/ethnicity, 2007-22
按种族/族裔划分的 AP 计算机科学考试，2007-22

Figure 8.3.3 图 8.3.3

AP computer science exams taken (% of total responding students) by race/ethnicity, 2007-22
2007-22 年按种族/族裔划分的参加 AP 计算机科学考试的学生人数（占总回答学生的百分比）

Source: Code.org, 2023 | Chart: 2024 Al Index report
来源：Code.org，2023 年 | 图表：2024 年 Al 指数报告

Figure 8.3.4 图 8.3.4

Preview 预览

Overview ..... 437 概述 ..... 437
Chapter Highlights ..... 438
章节亮点 ..... 438
9.1 Survey Data ..... 439
9.1 调查数据 ..... 439
Global Public Opinion ..... 439
全球公共舆论 ..... 439
Al Products and Services ..... 439
所有产品和服务 ..... 439

and Jobs ..... 444

和工作 ..... 444

and Livelihood ..... 446

和生计 ..... 446
Attitudes on ChatGPT ..... 448
在 ChatGPT 上的态度 ..... 448
Al Concerns ..... 451
有关 AI 的担忧 ..... 451
US Public Opinion ..... 452
美国公众舆论 ..... 452
9.2 Social Media Data ..... 454
9.2 社交媒体数据 ..... 454
Dominant Models ..... 454
主导模型 ..... 454
Highlight: Al-Related Social Media
亮点：与人工智能相关的社交媒体
Discussion in 2023 ..... 456
2023 年的讨论 ..... 456

ACCESS THE PUBLIC DATA
访问公共数据

Overview 概述

becomes increasingly ubiquitous, it is important to understand how public perceptions regarding the technology evolve. Understanding this public opinion is vital in better anticipating Al's societal impacts and how the integration of the technology may differ across countries and demographic groups.
随着

变得越来越普遍，了解公众对这项技术的看法如何演变变得越来越重要。了解这种公众意见对于更好地预测人工智能对社会的影响以及技术整合在不同国家和人口群体中可能存在的差异至关重要。

This chapter examines public opinion on Al through global, national, demographic, and ethnic perspectives. It draws upon several data sources: Iongitudinal survey data from Ipsos profiling global Al attitudes over time, survey data from the University of Toronto exploring public perception of ChatGPT, and data from Pew examining American attitudes regarding Al. The chapter concludes by analyzing mentions of significant Al models on Twitter, using data from Quid.
本章从全球、国家、人口统计和民族的角度探讨了公众对人工智能的看法。它利用了几个数据来源：来自 Ipsos 的纵向调查数据，该数据概述了随时间推移全球人工智能态度，来自多伦多大学探索 ChatGPT 公众感知的调查数据，以及 Pew 的数据，该数据调查了美国人对人工智能的态度。本章最后分析了 Twitter 上对重要人工智能模型的提及，使用了来自 Quid 的数据。

Chapter Highlights 章节亮点

People across the globe are more cognizant of Al's potential impact-and more nervous. A survey from Ipsos shows that, over the last year, the proportion of those who think AI will dramatically affect their lives in the next three to five years has increased from to . Moreover, express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that of Americans report feeling more concerned than excited about Al, rising from 38% in 2022.
全球各地的人们更加意识到人工智能的潜在影响，也更加紧张。 Ipsos 的一项调查显示，在过去一年中，认为人工智能将在未来三到五年内对他们的生活产生巨大影响的人所占比例已从增加到。此外，对人工智能产品和服务表示紧张，较 2022 年增加了 13 个百分点。在美国，Pew 的数据表明，的美国人报告称他们对人工智能感到更担忧而不是兴奋，这一比例从 2022 年的 38% 上升。
Al sentiment in Western nations continues to be low, but is slowly improving. In 2022, several developed Western nations, including Germany, the Netherlands, Australia, Belgium, Canada, and the United States, were among the least positive about Al products and services. Since then, each of these countries has seen a rise in the proportion of respondents acknowledging the benefits of Al, with the Netherlands experiencing the most significant shift.
西方国家的情绪仍然低迷，但正在慢慢改善。2022 年，包括德国、荷兰、澳大利亚、比利时、加拿大和美国在内的几个发达的西方国家对人工智能产品和服务持最少的积极态度。自那时起，这些国家中的每一个都看到了承认人工智能好处的受访者比例上升，其中荷兰经历了最显著的变化。
The public is pessimistic about Al's economic impact. In an Ipsos survey, only of respondents feel will improve their job. Only anticipate will boost the economy, and believe it will enhance the job market.
公众对人工智能的经济影响持悲观态度。在一项 Ipsos 调查中，只有的受访者认为会改善他们的工作。只有预期会促进经济增长，相信它会提升就业市场。
Demographic differences emerge regarding Al optimism. Significant demographic differences exist in perceptions of Al's potential to enhance livelihoods, with younger generations generally more optimistic. For instance, of Gen respondents believe will improve entertainment options, versus only of baby boomers. Additionally, individuals with higher incomes and education levels are more optimistic about Al's positive impacts on entertainment, health, and the economy than their lower-income and less-educated counterparts.
人口统计学差异在人工智能乐观主义方面显现。在对人工智能提升生活潜力的看法上存在显著的人口统计学差异，年轻一代通常更加乐观。例如，的世代受访者认为将改善娱乐选择，而只有的婴儿潮一代这样认为。此外，收入较高和受教育水平较高的个体对人工智能对娱乐、健康和经济的积极影响持更加乐观态度，相比之下，收入较低和受教育水平较低的人则持相对悲观态度。
ChatGPT is widely known and widely used. An international survey from the University of Toronto suggests that of respondents are aware of ChatGPT. Of those aware, around half report using ChatGPT at least once weekly.
ChatGPT 被广泛认知和广泛使用。据多伦多大学的一项国际调查显示，受访者中有知道 ChatGPT。而在知道 ChatGPT 的人中，约一半表示每周至少使用一次 ChatGPT。

9.1 Survey Data 9.1 调查数据

Global Public Opinion 全球公众舆论

This section explores global differences in

opinions through surveys conducted by Ipsos in 2022 and 2023. These surveys reveal that public perceptions of

vary widely across countries and demographic groups.
本节通过 Ipsos 在 2022 年和 2023 年进行的调查，探讨了全球对

意见的差异。这些调查显示，不同国家和人口群体对

的公众认知存在广泛差异。

Al Products and Services
所有产品和服务

In 2023, Ipsos ran a survey on global attitudes toward Al products and services. The survey consisted of interviews with 22,816 adults ages 16 to 74 in 31 countries.
2023 年，Ipsos 对全球对人工智能产品和服务的态度进行了调查。该调查涉及在 31 个国家对 16 至 74 岁的 22,816 名成年人进行的访谈。

Figure 9.1.1 shows the percentage of respondents who agree with specific statements. A significant

anticipate Al will greatly change their lives in the near future, while

believe Al's benefits surpass its drawbacks. About half of the respondents trust

companies' data-protection capabilities.
图 9.1.1 显示了同意特定说法的受访者百分比。有显著

预期人工智能将在不久的将来极大改变他们的生活，而

认为人工智能的好处超过了其缺点。大约一半的受访者信任

公司的数据保护能力。

The figure also contrasts Ipsos survey responses from 2022 and 2023, highlighting a shift in public

sentiment following the release of ChatGPT-a milestone in public

recognition. Over the last year, there has been a noticeable 6 percentage point increase in those who think Al will dramatically affect their lives in the next three to five years. Moreover,

now express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. The public across the globe is becoming increasingly cognizant of and nervous about Al's growing impact.
该图还对比了 2022 年和 2023 年 Ipsos 调查的回应，突显了 ChatGPT 发布后公众

情绪的转变-这是公众

认可的一个里程碑。在过去一年中，认为人工智能将在未来三到五年内显著影响他们生活的人数有明显增加，增长了 6 个百分点。此外，

现在对人工智能产品和服务表达了紧张情绪，这标志着从 2022 年以来增长了 13 个百分点。全球公众越来越意识到并对人工智能不断增长的影响感到紧张。

Global opinions on products and services using AI (% of total), 2022 vs. 2023
全球对使用人工智能产品和服务的看法（总数的百分比），2022 年对比 2023 年

Source: Ipsos, 2022-23 | Chart: 2024 Al Index report
源自：艾普索斯，2022-23 | 图表：2024 年 Al 指数报告

1 See Appendix for more details about the survey methodology. The survey was conducted from May to June 2023.
1 有关调查方法的更多细节，请参阅附录。该调查于 2023 年 5 月至 6 月进行。

Perceptions of Al's benefits versus drawbacks vary considerably by country, according to the Ipsos survey.

of Indonesian,

of Thai, and

of Mexican respondents view

products and services as more beneficial than harmful (Figure 9.1.2). In contrast, only

of Americans agree with this perspective. Among the 31 countries surveyed, the United States and France exhibited the most skepticism.
根据艾普索斯的调查，各国对人工智能的益处与弊端的看法存在较大差异。印尼有

的受访者、泰国有

的受访者和墨西哥有

的受访者认为

产品和服务更有益于有害（图 9.1.2）。相比之下，只有

的美国人同意这一观点。在受访的 31 个国家中，美国和法国表现出最大的怀疑态度。

Attitudes toward

are becoming more positive in countries that were previously critical. In 2022, several developed Western nations, including Germany, the Netherlands, Australia, Belgium, Canada, and the United States, were among the least positive about Al products and services. Since then, each of these countries has seen a rise in the proportion of respondents acknowledging the benefits of

, with the Netherlands experiencing the most significant shift. By 2023, 43% of Dutch respondents viewed Al products and services positively, up from

the previous year.
在以前批评的国家，人们对

的态度变得更加积极。2022 年，包括德国、荷兰、澳大利亚、比利时、加拿大和美国在内的几个发达的西方国家对人工智能产品和服务的态度最不积极。从那时起，这些国家的受访者中认识到

的好处的比例有所增加，其中荷兰的变化最为显著。到 2023 年，43%的荷兰受访者对人工智能产品和服务持积极态度，比前一年增加

。

'Products and services using Al have more benefits than drawbacks,' by country (% of total), 2022 vs. 2023
2022 年与 2023 年，按国家划分，使用人工智能的产品和服务比弊端更有益处的情况（总比例%）

Source: Ipsos, 2022-23 | Chart: 2024 Al Index report
来源：Ipsos，2022-23 | 图表：2024 年人工智能指数报告

Figure 9.1.2 图 9.1.2

Figure 9.1.3 shows responses to Ipsos' survey on Al products and services by country. Indonesian respondents are notably optimistic:

claim a solid understanding of

believe

will significantly change their lives in the next three to five years, and

express excitement about Al products and services.
图 9.1.3 展示了 Ipsos 就 AI 产品和服务进行的调查结果。印尼受访者显然持乐观态度：

认为自己对

有很好的了解，

相信 AI 将在未来三至五年内显著改变他们的生活，

对 AI 产品和服务表现出兴奋。
Conversely, Japanese respondents show the least understanding of

and also report the lowest level of nervousness about AI (23%). Meanwhile, Thai respondents exhibit the highest trust in Al's impartiality, believing it will not discriminate or show bias toward any group.
相反，日本受访者对

的了解最少，报告的关于人工智能的紧张程度也最低（23%）。与此同时，泰国受访者展示出最高的对人工智能公正性的信任，认为它不会歧视或偏袒任何群体。

Opinions about Al by country (% agreeing with statement), 2023
按国家对 Al 的看法（% 同意该说法），2023 年

Source: Ipsos, 2023 | Chart: 2024 Al Index report
来源：Ipsos，2023 年 | 图表：2024 年 Al 指数报告

A large majority of the countries surveyed by Ipsos in 2022 were surveyed again in 2023, enabling crossyear comparisons. Figure 9.1.4 highlights the year-overyear percentage point change in answers to particular Al-related questions. For every country surveyed in both 2022 and 2023, an increase was reported in the degree to which

products make people nervous. The sharpest increases were reported in Italy
Ipsos 在 2022 年对大多数国家进行了调查，这些国家在 2023 年再次接受了调查，从而实现了跨年度比较。图 9.1.4 突出显示了特定与 Al 相关问题的答案在年度之间的百分点变化。对于 2022 年和 2023 年均接受调查的每个国家，报告了人们对

产品使人紧张程度增加的情况。意大利报告了最大幅度的增长。
(24 percentage points), France (19), Chile (18), and Australia (18).
（24 个百分点），法国（19），智利（18）和澳大利亚（18）。

Likewise, except for South Africa, all countries in the survey sample are now more inclined to believe that Al will significantly impact their lives in the next three to five years. The highest increase of 12 percentage points was reported in Japan, Great Britain, Germany, and Australia.
同样，除南非外，调查样本中的所有国家现在更倾向于相信人工智能将在未来三到五年内显著影响他们的生活。日本、英国、德国和澳大利亚报告的最高增长为 12 个百分点。

Percentage point change in opinions about Al by country (% agreeing with statement), 2022-23
2022-23 年各国对人工智能看法的百分点变化（同意该说法的百分比）

Source: Ipsos, 2022-23 | Chart: 2024 Al Index report
源自：艾普索，2022-23 | 图表：2024 年 Al 指数报告

Al and Jobs Al 与工作

This year's Ipsos survey included more questions about how people perceive Al's impact on their current jobs. Figure 9.1.5 illustrates the various global perspectives on the expected impact of

on employment.

of respondents think

is likely to change how they perform their current job within the next five years, and

fear

may replace their job in the same time frame.
今年的艾普索调查包含更多关于人们如何看待人工智能对自己当前工作影响的问题。图表 9.1.5 展示了全球各地对

对就业预期影响的不同看法。

的受访者认为

可能会改变他们在未来五年内如何执行当前工作的方式，

担心

可能会在同一时间范围内取代他们的工作。

Global opinions on the impact of Al on current jobs, 2023
当前工作岗位上人工智能的影响的全球观点，2023 年

Source: Ipsos, 2023 | Chart: 2024 Al Index report
来源：Ipsos，2023 年 | 图表：2024 年人工智能指数报告

Figure 9.1.5 图 9.1.5

Opinions on whether Al will significantly impact an individual's job vary significantly across demographic groups (Figure 9.1.6). Younger generations, such as Gen

and millennials, are more inclined to agree that Al will change how they do their jobs compared to older generations like Gen X and baby boomers.
关于人工智能是否会显著影响个人工作的观点在不同人群中存在显著差异（图 9.1.6）。年轻一代，如 Z 世代和千禧一代，更倾向于认同人工智能将改变他们的工作方式，而相比之下，像 X 世代和婴儿潮一代这样的老一代则持不同意见。
Specifically,

of Gen Z compared to

of boomer respondents agree with the statement that Al will likely affect their current jobs. Additionally, individuals with higher incomes, more education, and decision-making roles are more likely to foresee

impacting their current employment.
具体来说，Z 世代中

同意人工智能可能会影响他们目前工作的说法，而婴儿潮一代中

受访者同意这一说法。此外，收入较高、受教育程度较高且担任决策角色的个人更有可能预见到人工智能对他们目前的就业产生影响。

Global opinions on the impact of Al on current jobs by demographic group, 2023
2023 年，全球各人口群体对人工智能对当前工作影响的看法

Source: Ipsos, 2023 | Chart: 2024 Al Index report
来源：艾普索，2023 年 | 图表：2024 年 Al 指数报告

Figure 9.1.6 图 9.1.6

Al and Livelihood 人工智能与生计

The Ipsos survey explored the impact respondents believe Al will have on various aspects of their lives, such as health and entertainment. On topics like time management and entertainment, the majority viewed Al positively (Figure 9.1.7). For instance,

of global respondents agree that

will improve the efficiency of their tasks, and

believe

will enhance entertainment options like TV, movies, music, and books. However, skepticism was more prominent in other areas. Only

feel AI will benefit their health, and

think it will improve their job. Only

anticipate

will boost the economy, and just

believe it will enhance the job market.
Ipsos 调查探讨了受访者认为人工智能对他们生活各个方面的影响，如健康和娱乐。在时间管理和娱乐等话题上，大多数人对人工智能持积极态度（图 9.1.7）。例如，全球受访者中有

认为

将提高他们任务的效率，

相信

将增强电视、电影、音乐和书籍等娱乐选择。然而，在其他领域，怀疑情绪更为突出。只有

认为人工智能将有益于他们的健康，

认为它会改善他们的工作。只有

预期

将促进经济增长，只有

相信它将增强就业市场。

Similar to questions about Al products and services, responses showed intracountry consistency, with Japanese, Swedes, and Americans generally pessimistic about Al's potential to improve livelihoods, whereas Brazilians, Indonesians, and Mexicans were more optimistic.
与关于人工智能产品和服务的问题类似，回答显示出国内一致性，日本人、瑞典人和美国人普遍对人工智能改善生活的潜力持悲观态度，而巴西人、印尼人和墨西哥人则更为乐观。

Global opinions on the potential of Al improving life by country, 2023
2023 年各国人们对人工智能改善生活潜力的全球观点

Source: Ipsos, 2023 | Chart: 2024 Al Index report
来源：Ipsos，2023 | 图表：2024 Al 指数报告

Figure 9.1.7 图表 9.1.7

Significant demographic differences also exist in perceptions of Al's potential to enhance livelihoods, with younger generations generally expressing greater optimism. For instance,

of Gen

respondents believe Al will improve entertainment options, versus only

of baby boomers. Additionally, individuals with higher incomes and education levels are more optimistic about Al's positive impacts on entertainment, health, and the economy compared to their lower-income and less-educated counterparts. In general, members of Gen Z, those in higher income brackets, and those with more education are the most optimistic about Al's potential to improve life, while those from the boomer generation, lower income brackets, and with less education are the least optimistic.
在对 Al 能够提升生活方式的看法方面也存在显著的人口统计学差异，年轻一代普遍表现出更大的乐观主义。例如，

Gen

受访者相信 Al 将改善娱乐选择，而只有

的婴儿潮一代持有这一看法。此外，高收入和教育水平的人对 Al 在娱乐、健康和经济方面的积极影响更持乐观态度，这一态度与低收入和较低受教育水平的人形成对比。总的来说，Z 一代的成员、高收入群体以及受教育程度更高的人对 Al 改善生活的潜力最为乐观，而婴儿潮一代、低收入群体以及受教育程度较低的人则最不乐观。

Global opinions on the potential of Al improving life by demographic group, 2023
2023 年各人群对人工智能改善生活潜力的全球观点

The amount of time it takes me
我花费的时间量

to get things done
完成任务所需的时间

My entertainment options
我的娱乐选择

(television/video content,
（电视/视频内容，

movies, music, books) 电影，音乐，书籍）

My job

The economy in my country
我国的经济

[్ర

[

Gender

Generation

Household income 家庭收入

Education

Employment statu 就业状况

Attitudes on ChatGPT ChatGPT 上的态度

Many argue that the launch of ChatGPT by OpenAI in November 2022 was a watershed moment in familiarizing the public with AI. While Al encompasses much more than ChatGPT or LLMs, the prominence of ChatGPT as one of the most well-known Al tools makes gauging public sentiment toward it an interesting approach for better understanding broader opinions on Al.
许多人认为 OpenAI 在 2022 年 11 月推出 ChatGPT 是公众熟悉人工智能的一个分水岭时刻。虽然人工智能远不止于 ChatGPT 或 LLMs，但作为最知名的人工智能工具之一，ChatGPT 的显著性使得衡量公众对其情绪的方法成为更好地了解人工智能更广泛观点的有趣途径。

Global Public Opinion on Artificial Intelligence (GPO-AI) is a report created by the Schwartz Reisman Institute for Technology and Society (SRI) in collaboration with the Policy, Elections and Representation Lab (PEARL) at the Munk School of Global Affairs and Public Policy at the University of
全球公众对人工智能（GPO-AI）的看法是由施瓦茨·赖斯曼技术与社会研究所（SRI）与多伦多大学蒙克全球事务与公共政策学院政策、选举和代表性实验室（PEARL）合作创建的报告。
Toronto. In October and November 2023, researchers from SRI and PEARL conducted a 21-country survey examining global attitudes toward Al.
多伦多。2023 年 10 月和 11 月，SRI 和 PEARL 的研究人员进行了一项涵盖 21 个国家的调查，研究全球对人工智能的态度。

Figure 9.1.9 explores the extent of global public awareness of ChatGPT. Among global respondents,

claim awareness of ChatGPT. Countries with the highest awareness rates include India (82%), Kenya (81%), Indonesia (76%), and Pakistan (76%). Poland reported the lowest awareness, at

.
图 9.1.9 探讨了全球公众对 ChatGPT 的认知程度。在全球受访者中，

声称知道 ChatGPT。意识率最高的国家包括印度（82%）、肯尼亚（81%）、印度尼西亚（76%）和巴基斯坦（76%）。波兰报告的认知率最低，为

。

Figure 9.1.10 highlights how frequently respondents who report being familiar with ChatGPT use the tool. Globally,

of users utilize it daily,

weekly, and

monthly. India (36%), Pakistan (28%), and Kenya

report the highest levels of daily usage.
图 9.1.10 突出了熟悉 ChatGPT 的受访者如何频繁地使用该工具。全球范围内，

的用户每天使用，

每周使用，

每月使用。印度（36%）、巴基斯坦（28%）和肯尼亚

报告每天使用水平最高。

Global awareness of ChatGPT (% of total), 2023
ChatGPT 全球意识度（占比），2023 年

Source: Global Public Opinion on Artificial Intelligence (GPO-Al), 2024 | Chart: 2024 Al Index report
来源：全球人工智能公共舆论（GPO-Al）2024 年调查 | 图表：2024 年人工智能指数报告

Global usage frequency of ChatGPT (% of total), 2023
ChatGPT 全球使用频率（占比），2023 年

Source: Global Public Opinion on Artificial Intelligence (GPO-AI), 2024 | Chart: 2024 Al Index report
源头: 2024 年全球人工智能公共舆论（GPO-AI）| 图表：2024 年 Al 指数报告

Al Concerns Al 关注

GPO-Al also reported on respondents' AI-related concerns. Figure 9.1.11 presents the percentage of survey respondents who expressed concern about 11 specific impacts. Globally, individuals were most concerned about Al being misused for nefarious purposes (49%), its impact on jobs (49%), and its potential to violate citizens' privacy (

). In contrast, global citizens were comparatively less concerned about issues of unequal access to

(26%), Al's potential for bias and discrimination (24%), and their own ability to use

(22%).
GPO-Al 还报告了受访者对人工智能相关担忧的情况。图 9.1.11 展示了对 11 个具体影响表达担忧的调查受访者的百分比。全球范围内，个人最担心 Al 被用于不良目的（49%），其对就业的影响（49%）以及其可能侵犯公民隐私的潜力（

）。相比之下，全球公民对不平等获取

（26%）、Al 潜在的偏见和歧视（24%）以及他们自己使用

的能力的问题相对较少表示担忧（22%）。

Global concerns on the impacts of

in the next few years, 2023
未来几年对

影响的全球关注，2023

Source: Global Public Opinion on Artificial Intelligence (GPO-Al), 2024 | Chart: 2024 Al Index report
来源：全球人工智能公共舆论（GPO-Al）2024 | 图表：2024 年人工智能指数报告

Figure 9.1.11 图表 9.1.11

US Public Opinion 美国公众舆论

Since 2021, Pew Research Center has been investigating sentiment toward

in the United States. They received 11,000 responses to their most recent 2023 survey.
自 2021 年以来，皮尤研究中心一直在调查美国对

的情绪。他们收到了对他们最近的 2023 年调查的 11,000 份回复。

Figure 9.1.12 shows that over the last year, Americans have grown increasingly concerned about the use of

in their daily lives. In 2021 and 2022, only 37% and

of Americans, respectively, reported feeling more concerned than excited about Al technology. By 2023 , this figure had risen to

, indicating that a majority of Americans now feel more concerned than excited about Al technology.
图 9.1.12 显示，过去一年，美国人对在日常生活中使用

越来越担忧。在 2021 年和 2022 年，只有 37％和

的美国人分别表示对人工智能感到更担忧而不是兴奋。到 2023 年，这一数字已上升至

，表明现在大多数美国人对人工智能感到更担忧而不是兴奋。
Americans' feelings toward increased use of

in daily life (% of total), 2021-23
美国人对日常生活中增加使用

的感受（总人数的百分比），2021-23 年

Source: Pew Research, 2023 | Chart: 2024 Al Index report
来源：皮尤研究中心，2023 年 | 图表：2024 年 Al 指数报告

Pew also surveyed Americans' opinions on whether they believed Al helped or hindered in specific contexts (Figure 9.1.13). Respondents reported that Al was more likely to be beneficial, particularly in assisting people to find products or services online, with

expressing this view. However,

of respondents indicated that

was more likely to be detrimental than beneficial in safeguarding personal information privacy.
皮尤还对美国人在特定情境中是否认为人工智能有帮助或阻碍进行了调查（图 9.1.13）。调查对象报告称，人工智能在帮助人们在线上找到产品或服务方面更有益处，其中

持有这种观点。然而，

的调查对象表示，

在保护个人信息隐私方面更有害而不是有益处。

Americans' opinions of whether Al helps or hurts in specific settings (% of total), 2023
美国人对于在特定环境中人工智能是否有帮助或有害的看法（占总数的百分比），2023 年

Source: Pew Research, 2023 | Chart: 2024 Al Index report
来源：皮尤研究中心，2023 年 | 图表：2024 年人工智能指数报告

Pew further segmented the data by education level (Figure 9.1.14). Across various use categories, Americans with higher education levels are more likely to believe in Al's potential to help rather than harm. For instance, individuals with college or higher- level degrees are more likely to report that AI can significantly aid doctors in delivering quality care to patients and assist people in discovering products and services online that interest them.
皮尤进一步按教育水平细分了数据（图 9.1.14）。在各种使用类别中，受过较高教育的美国人更可能认为人工智能有助于而非有害。例如，拥有大学或更高学位的个人更可能报告称人工智能可以显著帮助医生提供优质护理给患者，并帮助人们在线发现他们感兴趣的产品和服务。

Differences in Americans' view of Al's impact by education level (% of total), 2023 Source: Pew Research, 2023 | Chart: 2024 Al Index report
美国人对 Al 影响的观点在教育水平上的差异（% 总数），2023 年来源：皮尤研究中心，2023 年 | 图表：2024 年 Al 指数报告

Dominant Models 主导模型

Public attitudes toward

can be assessed through both quantitative and qualitative analyses of posts made on social media. Quid analyzed social conversations surrounding

models across various sectors from January to December 2023, examining over 7 million social media posts.
对社交媒体上发布的帖子进行数量和质量分析可以评估公众对

的态度。Quid 分析了从 2023 年 1 月到 12 月在不同领域中涉及

模型的社交对话，检查了 700 多万条社交媒体帖子。

Figure 9.2.1 shows the net sentiment score of various Al models released throughout the year. The net sentiment score expresses the ratio of positive to negative sentiment around a given topic. A net sentiment score of +100 means that all conversation is positive; a score of -100 means that all conversation is negative. Many models released in 2023 received positive social media sentiment. Some of the models that garnered the highest degree of positive attention were GraphCast, a new Al-powered weather forecasting system from DeepMind, and Claude 2.1, one of Anthropic's most recent LLMs.
图 9.2.1 显示了整年中发布的各种 Al 模型的净情感分数。净情感分数表示围绕特定话题的正面情感和负面情感的比例。净情感分数为+100 表示全部对话都是正面的；分数为-100 表示全部对话都是负面的。2023 年发布的许多模型得到了社交媒体上的积极情感。一些受到最高程度的积极关注的模型包括 GraphCast，这是 DeepMind 推出的一款新型 Al 驱动的天气预测系统，以及 Claude 2.1，Anthropic 最近的LLMs之一。

Net sentiment score of AI models by quarter, 2023
AI 模型的净情感分数按季度划分，2023 年

Source: Quid, 2023 | Chart: 2024 Al Index report
源：Quid，2023 | 图表：2024 Al 指数报告

	2023/Q1	2023/Q2		2023/Q4
Stable Diffusion 稳定扩散	40	37	53	45
Copilot	56	66	73	67
GPT-4	42	48	51	44
PaLM 2		62	66	75
DALL-E 3			74	72
Mistral 7B			92	56
Grok				57
GPT-4 Turbo				68
Whisper V3				83
GraphCast				94
Claude 2.1				87
Stable Video Diffusion 稳定视频扩散				66
Orca 2				83
Inflection-2				81
Gemini				36
Midjourney v6 中途 v6				71

Figure 9.2.1 图 9.2.1

Figure 9.2.2 highlights the proportion of Al-related social media conversation that was dominated by the release of particular models.

GPT-4 remained a dominant topic of consumer conversation throughout the year. Despite the release of numerous new models by the fourth quarter of 2023, GPT-4 still captured

of social media attention. Other models that garnered significant attention included Grok, Stable Diffusion, and Gemini.
图 9.2.2 突出了由特定型号发布主导的与 Al 相关的社交媒体对话的比例。

GPT-4 在整个年度仍然是消费者对话的主要话题。尽管到 2023 年第四季度发布了许多新型号，但 GPT-4 仍然占据了

的社交媒体关注度。其他引起重大关注的型号包括 Grok、Stable Diffusion 和 Gemini。

Select models' share of AI social media attention by quarter, 2023
选择模型在 2023 年每个季度的 AI 社交媒体关注度份额

Source: Quid, 2023 \| Chart: 2024 Al Index report 来源：Quid，2023 年 \| 图表：2024 年 Al 指数报告
	2023/Q1	2023/Q2	2023/Q3	2023/Q4
Stable Diffusion 稳定扩散
Copilot
GPT-4
PaLM 2
DALL-E 3
Mistral 7B
Grok
GPT-4 Turbo
Whisper V3
GraphCast
Claude 2.1
Stable Video Diffusion 稳定的视频传播
Orca 2
Inflection-2
Gemini
Midjourney v6 中途 v6

Highlight: 突出显示:

The following section, featuring data from Quid, profiles specific narratives surrounding the discussion of Al that occurred on social media in 2023. GPT-4 gathered most of the discussion volume in Q2 after its launch on March 14, 2023. Positive sentiment was primarily driven by its improvements, including faster processing speed, improved accuracy, and praise for its ability to enhance productivity across different types of work tasks, such as coding, corporate collaboration, and content creation. Negative sentiment primarily stemmed from complaints about occasional crashes of the ChatGPT website, along with an open letter led by Elon Musk and supported by over 1,300 artificial intelligence experts, urging Al laboratories to pause training of powerful Al systems. Moreover, there was disagreement regarding the "open letter" and the suggestion to halt Al research, particularly considering its potential to have a positive impact across multiple fields. For example, Andrew Ng posted:
以下部分展示了来自 Quid 的数据，概述了 2023 年社交媒体上关于 Al 讨论的特定叙事。GPT-4 在 2023 年 3 月 14 日发布后，在第二季度收集了大部分讨论量。积极情绪主要源于其改进，包括更快的处理速度、提高的准确性，以及对其提升生产力能力的赞扬，涵盖了编码、企业协作和内容创作等不同类型的工作任务。消极情绪主要来自对 ChatGPT 网站偶尔崩溃的抱怨，以及由埃隆·马斯克领导、得到 1300 多名人工智能专家支持的公开信，敦促 Al 实验室暂停训练强大的 Al 系统。此外，关于“公开信”和建议暂停 Al 研究存在分歧，特别是考虑到其在多个领域可能产生积极影响。例如，Andrew Ng 发表了：

"1/The call for a 6 month moratorium on making Al progress beyond GPT-4 is a terrible idea. I'm seeing many new applications in education, healthcare, food, ... that'll help many people. Improving GPT-4 will help. Lets balance the huge value

I is creating vs. realistic risks." - @AndrewYNg
"1/对于在 GPT-4 之上取得 Al 进展的 6 个月停顿的呼吁是一个糟糕的主意。我看到在教育、医疗保健、食品等领域出现了许多新的应用程序，这些应用程序将帮助许多人。改进 GPT-4 将会有所帮助。让我们权衡 I 正在创造的巨大价值与现实风险。" - @AndrewYNg
In Q4 2023, discussions surrounding the release of GPT-4 Turbo, launched in November, saw a significant increase. Positive sentiment centered around its innovative features and upgrades that could transform programmers' workflows. These enhancements included longer conversation capabilities, improved contextual understanding, and multimodal ability to generate images. However, some negative feedback arose due to disappointment with the model's knowledge cutoff in April 2023 and slower loading speeds compared to GPT-4. Some of the sample social media posts from this time included:
在 2023 年第四季度，围绕 GPT-4 Turbo 的发布讨论在 11 月份的推出后显著增加。积极情绪集中在其创新功能和升级上，这些功能可以改变程序员的工作流程。这些增强功能包括更长的对话能力、改进的上下文理解能力以及多模态生成图像的能力。然而，由于模型在 2023 年 4 月的知识截止日期和比 GPT-4 更慢的加载速度，一些负面反馈出现了。这个时期的一些社交媒体帖子包括：

"This is just insane... My GPT-4 coding assistant can now: - build and design a frontend - create a backend with working

- correctly hook them up - upload code to GitHub - deploy it to Vercell.] I can now build complete apps with nothing more than my voice. The future is here!”

这简直太疯狂了...我的 GPT-4 编码助手现在可以：-构建和设计前端-创建具有工作功能的后端-正确地将它们连接起来-将代码上传到 GitHub-将其部署到 Vercell。我现在可以仅凭声音构建完整的应用程序。未来已经来临！

"Trying to make my Linkedln profile more interesting if a recruiter is using a large language model like GPT-4 to send me a message. Looks like it works on the public version of my profile!"
试图使我的 Linkedln 个人资料更有趣，如果招聘人员正在使用像 GPT-4 这样的大型语言模型给我发消息。看起来它在我的个人资料的公共版本上起作用了！

@brdskggs

"GPT-4 Turbo has knowledge of the world up to April 2023. @sama says the team is 'just as annoyed as you, maybe more' that the knowledge is not more updated and that @openai will work to make sure it never gets that outdated again."
GPT-4 Turbo 对截止到 2023 年 4 月的世界有了解。@sama 说团队“和你一样烦恼，也许更烦恼”，因为知识没有更及时更新，@openai 将努力确保它再也不会过时。

- @VentureBeat @VentureBeat

Highlight: 突出显示:

Discussions about Stable Diffusion were more prominent in the first half of 2023, but decreased toward the year's end. More posts mentioned Stable Diffusion XL models than Stable Diffusion 2.0 (around 16 times more). Positive sentiment was mainly driven by the tool's rapid increase in popularity, the potential benefits of

in enhancing creativity, and the excitement surrounding technical advancements and improvements (e.g., enhanced accuracy, better understanding of various concepts, and higher resolution). On the other hand, negative sentiment revolved around concerns about legal and ethical issues related to Al-generated content, such as copyright violations, ownership of Al-created material, and the possible replacement of human artists by Al. Additionally, worries were expressed about the risks and threats linked to artificial intelligence, like its potential harmful effects, the spread of misinformation, and the possibility of Al being used for academic cheating.
2023 年上半年关于稳定扩散的讨论更为突出，但朝年底减少。更多帖子提到了稳定扩散 XL 模型，而不是稳定扩散 2.0（大约多出 16 倍）。积极情绪主要受到该工具在流行度迅速增长、

在提升创造力方面的潜在好处，以及围绕技术进步和改进（例如，增强的准确性、对各种概念的更好理解和更高分辨率）的兴奋所驱动。另一方面，消极情绪围绕着与 AI 生成内容相关的法律和道德问题，如侵犯版权、AI 创作物的所有权以及人类艺术家可能被 AI 取代的担忧。此外，人们还表达了对人工智能风险和威胁的担忧，比如其潜在的有害影响、虚假信息的传播以及 AI 可能被用于学术作弊的可能性。

"Very happy about sharing smashed Stable Diffusion models! - In one line of code, we compressed popular text-to-image Stable Diffusion models for A100. - Evaluations across various metrics show significant speedup improvements, energy savings, and

emissions savings. Now looking forward [to] sharing more compression results :) Feel free to contact us to achieve the same on your own models https://pruna.ai/contact;")
非常高兴分享压缩稳定扩散模型！- 一行代码，我们压缩了 A100 上流行的文本到图像的稳定扩散模型。- 通过各种指标进行评估，显示出显著的加速改进、节能以及

的减排。现在期待分享更多的压缩结果 :) 欢迎联系我们以在您自己的模型上实现相同目标 https://pruna.ai/contact;

@Bertrand_Charp
"Stable Diffusion XL with Contro/Net is insane A Discover the future of Al with Stability Al's latest innovation: Stable Diffusion XL (SDXL) 1.0! This powerful text-to-image generation model improves image quality and makes it easier for users to create highly detailed images. Built on a massive 3.5 billion-parameter base model, SDXL 1.0 boasts better accuracy and understanding of various concepts. Want to know more? Check out my video where I delve deeper into this groundbreaking technology!" - @work.with.ai
带有 Contro/Net 的稳定扩散 XL 真是疯狂 - 在稳定性 AI 的最新创新中探索 AI 的未来：稳定扩散 XL (SDXL) 1.0！这个强大的文本到图像生成模型提高了图像质量，使用户更容易创建高度详细的图像。基于庞大的 35 亿参数基础模型，SDXL 1.0 拥有更高的准确性和对各种概念的理解。想了解更多？查看我的视频，深入探讨这一开创性技术！ - @work.with.ai

Both Gemini (from Google) and Grok (from xAI) saw an increase in conversations during Q4 due to their late year launches. Positive feedback for Gemini focused on its improved accuracy and multilingual capabilities, as well as its potential to enhance various Google services like Search and Ads. On the other hand, negative opinions stemmed from concerns about inaccurate results, disappointment over Gemini's delayed release, and skepticism toward the Gemini Al demo.
由于它们在第四季度末推出，Google 的 Gemini（来自 Google）和 Grok（来自 xAI）都看到了对话量的增加。 Gemini 获得的积极反馈主要集中在其改进的准确性和多语言能力，以及其提升各种 Google 服务（如搜索和广告）潜力的能力。另一方面，负面意见主要来自于对不准确结果的担忧，对 Gemini 推迟发布的失望，以及对 Gemini Al 演示的怀疑。

"WHAT IS GOOGLE GEMINI AND HOW CAN
"GOOGLE GEMINI 是什么以及如何使用它？" - Erik Hyrkas

YOU USE IT?" - Erik Hyrkas

"Gemini Ultra (if Google is honest) Will Blow Our Minds " - Tina Huang
"双子座 Ultra（如果 Google 是诚实的话）将会让我们大开眼界" - 黄婷

Appendix 附录

Chapter 1	Research and Development 研究与开发	460
Chapter 2	Technical Performance 技术性能	465
Chapter 3	Responsible Al 负责任的 Al	472
Chapter 4	Economy	478
Chapter 5	Science and Medicine 科学与医学	488
Chapter 6	Education	491
Chapter 7	Policy and Governance 政策与治理	495
Chapter 8	Diversity	500
Chapter 9	Public Opinion 公众舆论	501

Chapter 1: Research and Development
第一章：研究与开发

Acknowledgments 致谢

The Al Index would like to acknowledge Ben Cottier and Robi Rahman from Epoch for leading the work analyzing machine learning training costs; Robi Rahman for leading work regarding the national affiliation of notable systems; and James da Costa, for doing coding work instrumental to the sectoral and national affiliation analysis of foundation models.
Al 指数要感谢 Epoch 的 Ben Cottier 和 Robi Rahman 领导分析机器学习训练成本的工作；Robi Rahman 领导关于显著系统的国家隶属性工作；以及 James da Costa，对基础模型的部门和国家隶属性分析进行了关键性编码工作。

Al Conference Attendance
人工智能会议出席情况

The

Index reached out to the organizers of various Al conferences in 2023 and asked them to provide information on total attendance. Some conferences posted their attendance totals online; when this was the case, the AI Index used those reported totals and did not reach out to the conference organizers.
2023 年，AI 指数联系了各种人工智能会议的组织者，并要求他们提供总出席人数的信息。一些会议在网上发布了他们的出席人数总计；在这种情况下，AI 指数使用了这些报告的总数，而没有联系会议组织者。

CSET

Prepared by Autumn Toney
由 Autumn Toney 准备

The Center for Security and Emerging Technology (CSET) is a policy research organization within Georgetown University's Walsh School of Foreign Service that produces data-driven research at the intersection of security and technology, providing nonpartisan analysis to the policy community.
安全与新兴技术中心（CSET）是乔治城大学沃尔什外交学院内的政策研究机构，致力于在安全和技术交叉领域产出数据驱动研究，为政策社区提供无党派分析。

For more information about how CSET analyzes bibliometric and patent data, see the Country Activity Tracker (CAT) documentation on the Emerging
有关 CSET 如何分析文献计量和专利数据的更多信息，请参阅 Emerging 的 Country Activity Tracker（CAT）文档。
Technology Observatory's website.' Using CAT, users can also interact with country bibliometric, patent, and investment data.

科技观察站的网站。使用 CAT，用户还可以与国家的文献计量、专利和投资数据进行交互。

Publications From CSET Merged Corpus of Scholarly Literature
来自 CSET 学术文献合并语料库的出版物

Sources 来源

CSET's merged corpus of scholarly literature combines distinct publications from Clarivate's Web of Science, OpenAlex, The Lens, Semantic Scholar, arXiv, and Papers With Code.
CSET 的学术文献合并语料库结合了 Clarivate 的 Web of Science、OpenAlex、The Lens、Semantic Scholar、arXiv 和 Papers With Code 等不同出版物。

Updates: The source list of scholarly literature for CSET's merged corpus has been changed from prior years, with the inclusion of OpenAlex, the Lens, and Semantic Scholar, and the exclusion of Digital Science's Dimensions and the Chinese National Knowledge Infrastructure (CNKI).
更新：CSET 的学术文献合并语料库的来源列表已从以往年份更改，包括 OpenAlex、The Lens 和 Semantic Scholar，不再包括 Digital Science 的 Dimensions 和中国国家知识基础设施（CNKI）。

Methodology 方法论

To create the merged corpus, CSET deduplicated across the listed sources using publication metadata, and then combined the metadata for linked publications. For analysis of AI publications, CSET used an English-language subset of this corpus published since 2010. CSET researchers developed a classifier for identifying Al-related publications by leveraging the arXiv repository, where authors and editors tag papers by subject.

为了创建合并语料库，CSET 使用出版元数据在列出的来源之间进行了去重，并将链接出版物的元数据进行了合并。对于 AI 出版物的分析，CSET 使用了自 2010 年以来发布的这个语料库的英语子集。CSET 研究人员开发了一个分类器，通过利用 arXiv 存储库来识别与 Al 相关的出版物，作者和编辑通过主题标记论文。

Updates: The Al classifier was updated from the version used in prior years; Dunham, Melot, and Murdick

describe the previously implemented classifier; and Schoeberl, Toney, and Dunham describe the updated classifier used in this analysis.
更新：AI 分类器已从之前几年使用的版本进行了更新；Dunham、Melot 和 Murdick

描述了先前实施的分类器；Schoeberl、Toney 和 Dunham 描述了用于此分析的更新分类器。

CSET matched each publication in the analytic corpus with predictions from a field-of-study model derived from Microsoft Academic Graph (MAG)'s taxonomy, which yields hierarchical labels describing the published research field(s) of study and corresponding scores.

CSET researchers identified the most common fields of study in our corpus of Al-relevant publications since 2010 and recorded publications in all other fields as "Other Al." English-language AI-relevant publications were then tallied by their top-scoring field and publication year.
CSET 将分析语料库中的每篇出版物与源自 Microsoft Academic Graph（MAG）分类法的领域模型的预测进行匹配，从而产生描述已发表研究领域及相应得分的分层标签。

CSET 研究人员确定了自 2010 年以来我们的 Al 相关出版物语料库中最常见的研究领域，并将所有其他领域的出版物记录为“其他 Al”。然后，根据其得分最高的领域和出版年份统计了英语 AI 相关出版物。

Updates: The methodology to assign MAG fields of study was updated from the methodology used in prior years. Toney and Dunham describe the field of study assignment pipeline used in this analysis; prior years used the original MAG implementation.
更新：将 MAG 研究领域分配的方法更新为之前几年使用的方法。Toney 和 Dunham 描述了在此分析中使用的研究领域分配流程；之前几年使用了原始的 MAG 实现。

CSET also provided publication counts and year-byyear citations for Al-relevant work associated with each country. A publication is associated with a country if it has at least one author whose organizational affiliation(s) is located in that country. If there is no observed country, the publication receives an “Unknown/Missing" country label. Citation counts aren't available for all publications; those without counts weren't included in the citation analysis. Over

of English-language Al papers published between 2010 and 2022 have citation data available.
CSET 还提供了与每个国家相关的 Al 相关工作的出版物计数和逐年引用。如果一篇出版物至少有一位作者的组织所属机构位于该国，那么该出版物与该国家相关联。如果没有观察到国家，则出版物将被标记为“未知/缺失”国家。并非所有出版物都有引用计数；那些没有计数的出版物不包括在引用分析中。2010 年至 2022 年间发表的英语 Al 论文中有超过

的论文具有引用数据。

Additionally, publication counts by year and by publication type (e.g., academic journal articles, conference papers) were provided where available. These publication types were disaggregated by affiliation country as described above.
此外，还提供了按年份和出版物类型（例如，学术期刊文章、会议论文）的出版物计数，如果有的话。这些出版物类型按上述所述的所属国家进行了细分。

CSET also provided publication affiliation sector(s) where, as in the country attribution analysis, sectors were associated with publications through authors' affiliations. Not all affiliations were characterized in terms of sectors; CSET researchers relied primarily on ROR for this purpose, and not all organizations can be found in or linked to ROR.

Where the affiliation sector is available, papers were counted toward these sectors, by year.
CSET 还提供了出版物所属领域的出版物所属领域，就像在国家归因分析中一样，通过作者的从属关系将部门与出版物关联起来。并非所有从属关系都以部门为特征；CSET 研究人员主要依赖 ROR 来实现这一目的，并非所有组织都可以在 ROR 中找到或与之关联。

如果从属关系部门可用，则按年份计算这些部门的论文。

CSET counted cross-sector collaborations as distinct pairs of sectors across authors for each publication. Collaborations are only counted once: For example, if a publication has two authors with an academic affiliation and two with an industry affiliation, it is counted as a single academic-industry collaboration.
CSET 将跨部门合作视为每篇出版物的作者之间的不同部门对。合作仅计算一次：例如，如果一篇出版物有两位具有学术从属关系的作者和两位具有工业从属关系的作者，则将其视为单个学术-工业合作。

Patents From CSET's Al and Robotics Patents Dataset
CSET 的 AI 和机器人专利数据集中的专利

Source 源

CSET's Al patents dataset was developed by CSET and 1790 Analytics and includes data from The Lens, 1790 Analytics, and EPO's PATSTAT. Patents relevant to the development and application of

and robotics were identified by their CPC/IPC codes and keywords.
CSET 的 Al 专利数据集是由 CSET 和 1790 Analytics 开发的，包括来自 The Lens、1790 Analytics 和 EPO 的 PATSTAT 的数据。与

和机器人技术的开发和应用相关的专利通过它们的 CPC/IPC 代码和关键词进行了识别。

Methodology 方法论

In this analysis, patents were grouped by year and country, and then counted at the "patent family" level.' CSET extracted year values from the first publication date within a family. Countries are assigned to patents based on the country or filing office where a patent is first filed (e.g., if a patent is filed with the USPTO on January 1, 2020, and then with the German Patenting Office on January 2, 2020, the patent is classified as a patent with U.S. inventors).

Note that the same patent may have multiple countries (but not years) attributed to it if the inventors filed their patent in multiple countries on the same first filing date (e.g., if a patent is filed with the USPTO on January 1, 2020, and then with the German Patenting Office on January 1,2020 , the patent is classified as a patent with U.S. inventors and as a patent with German inventors).
在这项分析中，专利按年份和国家分组，然后在“专利家族”级别进行计数。CSET 从家族内的第一次出版日期中提取年份值。根据专利首次提交的国家或申请办公室，将国家分配给专利（例如，如果专利于 2020 年 1 月 1 日在美国专利商标局提交，然后于 2020 年 1 月 2 日在德国专利局提交，则将该专利分类为具有美国发明家的专利）。请注意，同一专利可能会有多个国家（但没有年份）归属于它，如果发明家在同一首次提交日期在多个国家提交他们的专利（例如，如果专利于 2020 年 1 月 1 日在美国专利商标局提交，然后于 2020 年 1 月 1 日在德国专利局提交，则将该专利分类为具有美国发明家和具有德国发明家的专利）。

Note that patents filed with supranational organizations, such as patents filed under WIPO (the World Intellectual Property Organization), EP (European Patent Organization), and EA (a special area of Spain not included in the European Union), also fall under the "Rest of World" category.
请注意，提交给跨国组织的专利，例如在世界知识产权组织（WIPO）、欧洲专利组织（EP）和 EA（西班牙的一个特殊地区，不包括在欧盟内）下提交的专利，也属于“其余世界”类别。

Ecosystems Graph Analysis
生态系统图分析

To track the distribution of

foundation models by country, the Al Index team took the following steps:
为了跟踪

基础模型在各国的分布情况，Al 指数团队采取了以下步骤：

A snapshot of the Ecosystems Graph was taken in early January 2024.
生态系统图的快照于 2024 年初拍摄。
Authors of foundation models are attributed to countries based on their affiliation credited on the paper/technical documentation associated with the model. For international organizations, authors are attributed to the country where the organization is headquartered, unless a more specific location is indicated.
基础模型的作者根据他们在与模型相关的论文/技术文档上列出的所属机构归属于各国。对于国际组织，作者归属于该组织总部所在国家，除非有更具体的地点指示。
All of the landmark publications are aggregated within time periods (e.g., monthly or yearly) with the national contributions added up to determine what each country's contribution to landmark AI research was during each time period.
所有的里程碑出版物都按时间段（例如，每月或每年）汇总，国家贡献被累加起来，以确定每个国家在每个时间段对里程碑人工智能研究的贡献。
The contributions of different countries are compared over time to identify any trends.
不同国家的贡献随时间进行比较，以识别任何趋势。

Epoch Notable Models Analysis
时代显著模型分析

The Al forecasting research group Epoch maintains a dataset of landmark AI and ML models, along with accompanying information about their creators and publications, such as the list of their (co)authors, number of citations, type of Al task accomplished, and amount of compute used in training.
时代的 Al 预测研究小组维护着一个里程碑 AI 和 ML 模型的数据集，以及关于它们的创建者和出版物的相关信息，比如它们的（合著）作者名单、引用次数、完成的 Al 任务类型以及训练中使用的计算量。

The nationalities of the authors of these papers have important implications for geopolitical Al forecasting. As various research institutions and technology companies start producing advanced ML models, the global distribution of future Al development may shift or concentrate in certain places, which in turn affects the geopolitical landscape because Al is expected to become a crucial component of economic and military power in the near future.
这些论文作者的国籍对地缘政治的 Al 预测有重要意义。随着各种研究机构和科技公司开始生产先进的 ML 模型，未来 Al 发展的全球分布可能会转移或集中在某些地方，这反过来影响了地缘政治格局，因为 Al 预计会在不久的将来成为经济和军事力量的关键组成部分。

To track the distribution of

research contributions on landmark publications by country, the Epoch dataset is coded according to the following methodology:
为了跟踪对地标出版物的

研究贡献的国家分布，Epoch 数据集根据以下方法进行编码:

8 In CSET's data analysis for the 2022 Al Index, we used the most recent publication date for a patent family. This method has the advantage of capturing updates within a patent family (such as amendments). However, to remain consistent with CSET's other data products, including the Country Activity Tracker (available at https://cat.eto.tech/), we opted to use the first filing year instead in this data analysis.
在 CSET 对 2022 年 Al 指数的数据分析中，我们使用了专利族的最新出版日期。这种方法的优点是捕捉专利族内的更新（如修正）。然而，为了与 CSET 的其他数据产品保持一致，包括 Country Activity Tracker（可在 https://cat.eto.tech/ 上找到），我们选择在这次数据分析中使用首次申请年份。

A snapshot of the dataset was taken on January 1, 2024. This includes papers about landmark models, selected using the inclusion criteria of importance, relevance, and uniqueness, as described in the Compute Trends dataset documentation.
数据集的快照于 2024 年 1 月 1 日被拍摄。这包括了关于里程碑模型的论文，根据 Compute Trends 数据集文档中描述的重要性、相关性和独特性的包含标准进行选择。
The authors are attributed to countries based on their affiliation credited on the paper. For international organizations, authors are attributed to the country where the organization is headquartered, unless a more specific location is indicated.
作者根据其在论文上被认可的所属机构被归属于国家。对于国际组织，作者被归属于该组织总部所在国家，除非有更具体的地点指示。
All of the landmark publications are aggregated within time periods (e.g., monthly or yearly) with the national contributions added up to determine what each country's contribution to landmark AI research was during each time period.
所有的里程碑出版物都按时间段（例如，每月或每年）汇总，国家贡献被累加起来，以确定每个国家在每个时间段对里程碑人工智能研究的贡献。
The contributions of different countries are compared over time to identify any trends.
不同国家的贡献随时间进行比较，以识别任何趋势。

GitHub

Identifying AI Projects 识别人工智能项目

In partnership with researchers from Harvard Business School, Microsoft Research, and Microsoft's Al for Good Lab, GitHub identifies public AI repositories following the methodologies of Gonzalez, Zimmerman, and Nagappan, 2020, and Dohmke, lansiti, and Richards, 2023, using topic labels related to AI/ML and generative

, respectively, along with the topics "machine learning," "deep learning," or "artificial intelligence." GitHub further augments the dataset with repositories that have a dependency on the PyTorch, TensorFlow, or OpenAl libraries for Python.
与哈佛商学院的研究人员、微软研究部门和微软的 Al for Good 实验室合作，GitHub 根据 Gonzalez、Zimmerman 和 Nagappan, 2020 以及 Dohmke、lansiti 和 Richards, 2023 的方法论，使用与 AI/ML 和生成相关的主题标签，以及"机器学习"、"深度学习"或"人工智能"的主题，识别公共 AI 代码库。GitHub 进一步通过具有对 Python 的 PyTorch、TensorFlow 或 OpenAl 库的依赖关系的代码库来增强数据集。

Mapping Al Projects to Geographic Areas
将 AI 项目映射到地理区域

Public Al projects are mapped to geographic areas using IP address geolocation to determine the mode location of a project's owners each year. Each project owner is assigned a location based on their IP address when interacting with GitHub. If a project owner changes locations within a year, the location for the project would be determined by the mode location of its owners sampled daily in the year. Additionally, the last known location of the project owner is carried forward on a daily basis even if no activities were performed by the project owner that day. For example, if a project owner performed activities within the United States and then became inactive for six days, that project owner would be considered to be in the United States for that seven-day span.

Training Cost Analysis 培训成本分析

To create the dataset of cost estimates, the Epoch database was filtered for models released during the large-scale ML era

that were above the median of training compute in a two-year window centered on their release date. This filtered for the largest-scale ML models. There were 138 qualifying systems based on these criteria. Of these systems, 48 had sufficient information to estimate the training cost.
为了创建成本估算数据集，对 Epoch 数据库进行了筛选，筛选出在大规模机器学习时代发布的模型

，这些模型的训练计算量在以发布日期为中心的两年窗口内高于中位数。这样筛选出了规模最大的机器学习模型。根据这些标准，有 138 个符合条件的系统。其中有 48 个系统有足够的信息来估算训练成本。

For the selected ML models, the training time and the type, quantity, and utilization rate of the training hardware were determined from the publication, press release, or technical reports, as applicable. Cloud rental prices for the computing hardware used by these models were collected from online historical archives of cloud vendors' websites.

对于所选的机器学习模型，从出版物、新闻稿或技术报告中确定了训练时间以及训练硬件的类型、数量和利用率，视情况而定。收集了这些模型使用的计算硬件的云租金价格，这些价格来自云供应商网站的在线历史档案

。

9 The selected cutoff date was September 1, 2015, in accordance with Compute Trends Across Three Eras of Machine Learning (Epoch, 2022).
选定的截止日期是 2015 年 9 月 1 日，符合《机器学习三个时代的计算趋势》（Epoch, 2022）的要求。

10 Historic prices were collected from archived snapshots of Amazon Web Services, Microsoft Azure, and Google Cloud Platform price catalogs viewed through the Internet Archive Wayback Machine.
10 个历史价格是从亚马逊网络服务、微软 Azure 和谷歌云平台价格目录的存档快照中收集的，通过互联网档案馆的 Wayback Machine 查看。

Training costs were estimated from the hardware type, quantity, and time by multiplying the hourly cloud rental cost rates (at the time of training)

by the quantity of hardware hours. This yielded the cost to train each model using the same hardware used by the authors to train the same model at the time. However, some developers purchased hardware rather than renting cloud computers, so the true costs incurred by the developers may vary.
培训成本是通过将硬件类型、数量和时间乘以每小时的云租赁成本率（在培训时）

来估算的，乘以硬件小时的数量。这样就得出了使用作者在当时用来训练相同模型的相同硬件培训每个模型的成本。然而，一些开发人员购买了硬件，而不是租用云计算机，因此开发人员实际发生的成本可能有所不同。

Various challenges were encountered while estimating the training cost of these models. Often, the developers did not disclose the duration of training or the hardware that was used. In other cases, cloud compute pricing was not available for the hardware. The investigation of training cost trends is continued in a forthcoming Epoch report, including an expanded dataset with more models and hardware prices.
在估算这些模型的培训成本时遇到了各种挑战。通常，开发人员没有透露培训的持续时间或使用的硬件。在其他情况下，云计算价格对于硬件是不可用的。在即将发布的 Epoch 报告中将继续调查培训成本趋势，其中包括一个扩展数据集，其中包含更多的模型和硬件价格。

Chapter 2: Technical Performance
第二章：技术表现

Acknowledgments 致谢

The Al Index would like to acknowledge Andrew Shi for his work doing a literature review on the environmental impact of Al models; Emily Capstick for her work studying the use of RLHF in machine learning models; Sukrut Oak for his work generating sample Midjourney generations; and Emma Williamson for her work identifying significant Al technical advancements for the timeline.
阿尔指数（Al Index）希望感谢安德鲁·施（Andrew Shi）对阿尔模型环境影响进行文献综述的工作；艾米丽·卡普斯蒂克（Emily Capstick）对机器学习模型中 RLHF 的使用进行研究的工作；苏克鲁特·奥克（Sukrut Oak）生成样本中间旅程生成的工作；以及艾玛·威廉姆森（Emma Williamson）为时间线确定重要的阿尔技术进展的工作。

Benchmarks 基准测试

AgentBench: Data on AgentBench was taken from the AgentBench paper in January 2024. To learn more about AgentBench, please read the original paper.
AgentBench：AgentBench 的数据取自 2024 年 1 月的 AgentBench 论文。要了解更多关于 AgentBench 的信息，请阅读原始论文。
BigToM: Data on BigToM was taken from the BigToM paper in January 2024. To learn more about BigToM, please read the original paper.
BigToM：BigToM 的数据取自 2024 年 1 月的 BigToM 论文。要了解更多关于 BigToM 的信息，请阅读原始论文。
Chatbot Arena Leaderboard: Data on the Chatbot Arena Leaderboard was taken from the Chatbot Arena Leaderboard in January 2024. To learn more about the Chatbot Arena Leaderboard, please read the original paper.
聊天机器人竞技场排行榜：聊天机器人竞技场排行榜的数据来自 2024 年 1 月的聊天机器人竞技场排行榜。要了解有关聊天机器人竞技场排行榜的更多信息，请阅读原始论文。
EditVal: Data on EditVal was taken from the EditVal paper in January 2024. To learn more about EditVal, please read the original paper.
EditVal：关于 EditVal 的数据来自 2024 年 1 月的 EditVal 论文。要了解有关 EditVal 的更多信息，请阅读原始论文。
GPQA: Data on GPQA was taken from the GPQA paper in January 2024. To learn more about GPQA, please read the original paper.
GPQA：关于 GPQA 的数据来自 2024 年 1 月的 GPQA 论文。要了解有关 GPQA 的更多信息，请阅读原始论文。
GSM8K: Data on GSM8K was taken from the GSM8K Papers With Code leaderboard in January 2024. To learn more about GSM8K, please read the original paper.
GSM8K：GSM8K 的数据来自于 2024 年 1 月的 GSM8K Papers With Code 排行榜。要了解更多关于 GSM8K 的信息，请阅读原始论文。
HEIM: Data on HEIM was taken from the HEIM leaderboard in January 2024. To learn more about
HEIM：HEIM 的数据来自于 2024 年 1 月的 HEIM 排行榜。要了解更多关于 HEIM 的信息，请阅读原始论文。
HEIM, please read the original paper.
HEIM：要了解更多关于 HEIM 的信息，请阅读原始论文。
HELM: Data on HELM was taken from the HELM leaderboard in January 2024. To learn more about HELM, please read the original paper.
HELM：有关 HELM 的数据取自 2024 年 1 月的 HELM 排行榜。要了解有关 HELM 的更多信息，请阅读原始论文。
HumanEval: Data on HumanEval was taken from the HumanEval Papers With Code leaderboard in January 2024. To learn more about HumanEval, please read the original paper.
HumanEval：有关 HumanEval 的数据取自 2024 年 1 月的 HumanEval Papers With Code 排行榜。要了解有关 HumanEval 的更多信息，请阅读原始论文。
MATH: Data on MATH was taken from the MATH Papers With Code leaderboard in January 2024. To learn more about MATH, please read the original paper.
MATH：有关 MATH 的数据取自 2024 年 1 月的 MATH Papers With Code 排行榜。要了解有关 MATH 的更多信息，请阅读原始论文。
MLAgentBench: Data on MLAgentBench was taken from the MLAgentBench paper in January 2024. To learn more about MLAgentBench, please read the original paper.
MLAgentBench: MLAgentBench 的数据来自于 2024 年一月的 MLAgentBench 论文。想要了解更多关于 MLAgentBench 的信息，请阅读原始论文。
MMLU: Data on MMLU was taken from the MMLU Papers With Code leaderboard in January 2024. To learn more about MMLU, please read the original paper.
MMLU: MMLU 的数据来自于 2024 年一月的 MMLU Papers With Code 排行榜。想要了解更多关于 MMLU 的信息，请阅读原始论文。
MMMU: Data on MMMU was taken from the MMMU leaderboard in January 2024. To learn more about MMMU, please read the original paper.
MMMU: MMMU 的数据来自于 2024 年一月的 MMMU 排行榜。想要了解更多关于 MMMU 的信息，请阅读原始论文。
MoCa: Data on MoCa was taken from the MoCa paper in January 2024. To learn more about , please read the original paper.
MoCa：MoCa 的数据来自 2024 年 1 月的 MoCa 论文。要了解更多关于的信息，请阅读原始论文。
PlanBench: Data on PlanBench was taken from the PlanBench paper in January 2024. To learn more about PlanBench, please read the original paper.
PlanBench：PlanBench 的数据来自 2024 年 1 月的 PlanBench 论文。要了解更多关于 PlanBench 的信息，请阅读原始论文。
SWE-bench: Data on SWE-bench was taken from the SWE-bench leaderboard in January 2024. To learn more about SWE-bench, please read the original paper.
SWE-bench：SWE-bench 的数据来自 2024 年 1 月的 SWE-bench 排行榜。要了解更多关于 SWE-bench 的信息，请阅读原始论文。
TruthfulQA: Data on TruthfulQA was taken from the TruthfulQA Papers With Code leaderboard in January 2024. To learn more about TruthfulQA, please read the original paper.
TruthfulQA: TruthfulQA 的数据来自于 2024 年 1 月的 TruthfulQA Papers With Code 排行榜。要了解更多关于 TruthfulQA 的信息，请阅读原始论文。
UCF101: Data on UCF101 was taken from the UCF101 Papers With Code leaderboard in January 2024. To learn more about UCF101, please read the original paper.
UCF101: UCF101 的数据来自于 2024 年 1 月的 UCF101 Papers With Code 排行榜。要了解更多关于 UCF101 的信息，请阅读原始论文。
VCR: Data on VCR was taken from the VCR leaderboard in January 2024. To learn more about , please read the original paper.
VCR: VCR 的数据来自于 2024 年 1 月的 VCR 排行榜。要了解更多关于的信息，请阅读原始论文。
VisIT-Bench: Data on VisIT-Bench was taken from the VisIT-Bench leaderboard in January 2024. To learn more about VisIT-Bench, please read the original paper.
VisIT-Bench：VisIT-Bench 的数据来自 2024 年 1 月的 VisIT-Bench 排行榜。要了解更多关于 VisIT-Bench 的信息，请阅读原始论文。

Environmental Impact 环境影响

To assess the environmental impact of

models, the Al Index team surveyed technical reports of prominent foundation models to determine whether the model developers disclosed carbon emissions. The Index also reviewed papers by researchers that estimated the carbon footprint of various models. The technical reports surveyed, as well as the papers estimating the carbon impact of various models, are included in the works cited for this chapter.
为了评估

模型的环境影响，Al Index 团队调查了知名基础模型的技术报告，以确定模型开发者是否披露了碳排放量。该指数还审查了研究人员估计各种模型碳足迹的论文。调查的技术报告以及估算各种模型碳影响的论文均包含在本章的参考文献中。

RLHF

To identify foundation models using RLHF, the AI Index team reviewed the technical documentation of every foundation model included in the Ecosystem Graph, and searched for evidence that RLHF had been used in the model's development process. The year in which a model is said to have used RLHF refers to the year the model was released.
使用 RLHF 识别基础模型时，AI 指数团队审查了生态系统图中包含的每个基础模型的技术文档，并搜索了 RLHF 在模型开发过程中的使用证据。模型据称使用 RLHF 的年份指的是模型发布的年份。

Works Cited 引用文献

Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N. & Frank, C. (2023). MusicLM: Generating Music From Text (arXiv:2301.11325). arXiv. http://arxiv.org/abs/2301.11325.
Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N. & Frank, C. (2023). MusicLM: 从文本生成音乐 (arXiv:2301.11325). arXiv. http://arxiv.org/abs/2301.11325.

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., ... Zeng, A. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (arXiv:2204.01691). arXiv. https://doi.org/10.48550/arXiv.2204.01691.
安，M.，布罗汉，A.，布朗，N.，切博塔尔，Y.，科尔特斯，O.，大卫，B.，芬恩，C.，符，C.，戈帕拉克里希南，K.，豪斯曼，K.，赫尔佐格，A.，何，D.，许，J.，伊巴兹，J.，伊希特，B.，伊尔潘，A.，江，E.，鲁亚诺，R. J.，杰弗里，K.，... 曾，A.（2022 年）。像我一样做，而不是我说的那样：在机器人功能中扎根的语言（arXiv:2204.01691）。arXiv。https://doi.org/10.48550/arXiv.2204.01691。

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., ... Kaplan, J. (2022). Constitutional Al: Harmlessness From Al Feedback (arXiv:2212.08073). arXiv. https://doi.org/10.48550/arXiv.2212.08073.
白，Y.，卡达瓦斯，S.，昆度，S.，阿斯克尔，A.，科尼恩，J.，琼斯，A.，陈，A.，戈尔迪，A.，米尔霍塞尼，A.，麦金农，C.，陈，C.，奥尔森，C.，奥拉，C.，赫尔南德斯，D.，德雷恩，D.，甘古利，D.，李，D.，特兰-约翰逊，E.，佩雷斯，E.，... 卡普兰，J.（2022 年）。宪法 Al：来自 Al 反馈的无害性（arXiv:2212.08073）。arXiv。https://doi.org/10.48550/arXiv.2212.08073。

Bairi, R., Sonwane, A., Kanade, A., C, V. D., Iyer, A., Parthasarathy, S., Rajamani, S., Ashok, B. & Shet, S. (2023). CodePlan: Repository-Level Coding Using LLMs and Planning (arXiv:2309.12499). arXiv. https://doi.org/10.48550/arXiv.2309.12499.
巴伊里，R.，索瓦内，A.，卡纳德，A.，C，V. D.，艾尔，A.，帕萨拉蒂，S.，拉贾马尼，S.，阿肖克，B. & 谢特，S.（2023 年）。CodePlan：使用LLMs和规划进行存储库级编码（arXiv:2309.12499）。arXiv。https://doi.org/10.48550/arXiv.2309.12499。

Basu, S., Saberi, M., Bhardwaj, S., Chegini, A. M., Massiceti, D., Sanjabi, M., Hu, S. X. & Feizi, S. (2023). EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods (arXiv:2310.02426). arXiv. http://arxiv.org/abs/2310.02426.
Basu, S., Saberi, M., Bhardwaj, S., Chegini, A. M., Massiceti, D., Sanjabi, M., Hu, S. X. & Feizi, S. (2023). EditVal: 基于扩散的文本引导图像编辑方法基准测试 (arXiv:2310.02426). arXiv. http://arxiv.org/abs/2310.02426.

Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P. & Hoefler, T. (2024). Graph of Thoughts: Solving Elaborate Problems with Large Language Models (arXiv:2308.09687). arXiv. http://arxiv.org/abs/2308.09687.
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P. & Hoefler, T. (2024). 思维图：利用大型语言模型解决复杂问题 (arXiv:2308.09687). arXiv. http://arxiv.org/abs/2308.09687.

Bitton, Y., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R. & Schmidt, L. (2023). Vis/T-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use (arXiv:2308.06595). arXiv. http://arxiv.org/abs/2308.06595.
Bitton, Y., Bansal, H., Hessel, J., Shao, R., Zhu, W., Awadalla, A., Gardner, J., Taori, R. & Schmidt, L. (2023). Vis/T-Bench：受现实世界使用启发的视觉语言指导基准测试 (arXiv:2308.06595). arXiv. http://arxiv.org/abs/2308.06595.

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S. & Kreis, K. (2023). Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models (arXiv:2304.08818). arXiv. http://arxiv.org/abs/2304.08818.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S. & Kreis, K. (2023)。使用潜在扩散模型进行高分辨率视频合成 (arXiv:2304.08818)。arXiv。http://arxiv.org/abs/2304.08818。

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., ... Zitkovich, B. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. (arXiv:2307.15818). arXiv. https://arxiv.org/abs/2307.15818.
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., ... Zitkovich, B. (2023)。RT-2: 视觉语言行动模型将网络知识转移到机器人控制中。(arXiv:2307.15818)。arXiv。https://arxiv.org/abs/2307.15818。

Castaño, J., Martínez-Fernández, S., Franch, X. & Bogner, J. (2023). Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study. 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 1-12. https://doi.org/10.1109/ESEM56168.2023.10304801.
Castaño, J., Martínez-Fernández, S., Franch, X. & Bogner, J. (2023)。探索 Hugging Face 的 ML 模型的碳足迹：一个仓库挖掘研究。2023 年 ACM/IEEE 国际经验软件工程与测量研讨会(ESEM)，1-12。https://doi.org/10.1109/ESEM56168.2023.10304801。

Chen, L., Chen, Z., Zhang, Y., Liu, Y., Osman, A. I., Farghali, M., Hua, J., Al-Fatesh, A., Ihara, I., Rooney, D. W. & Yap, P.-S. (2023). "Artificial Intelligence-Based Solutions for Climate Change: A Review." Environmental Chemistry Letters 21, no. 5: 2525-57. https://doi.org/10.1007/s10311-023-01617-y.
陈，李，陈，张，刘，奥斯曼，法尔哈利，华，阿尔-法特什，伊哈拉，鲁尼，叶普（2023 年）。"基于人工智能的气候变化解决方案：综述。"《环境化学通讯》21 卷，第 5 期：2525-57。https://doi.org/10.1007/s10311-023-01617-y。

Chen, L., Zaharia, M. & Zou, J. (2023). How Is ChatGPT’s Behavior Changing Over Time? (arXiv:2307.09009). arXiv. http://arxiv.org/abs/2307.09009.
陈，扎哈里亚，邹（2023 年）。ChatGPT 的行为如何随时间变化？（arXiv:2307.09009）。arXiv。http://arxiv.org/abs/2307.09009。

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., .. Zaremba, W. (2021). Evaluating Large Language Models Trained on Code (arXiv:2107.03374; Version 2). arXiv. https://doi.org/10.48550/arXiv.2107.03374.
陈，特沃雷克，君，袁，皮恩托，卡普兰，爱德华兹，伯达，约瑟夫，布罗克曼，雷，普里，克鲁格，彼得罗夫，赫拉夫，萨斯特里，米什金，陈，格雷，..扎伦巴，W.（2021 年）。评估在代码上训练的大型语言模型（arXiv:2107.03374；第 2 版）。arXiv。https://doi.org/10.48550/arXiv.2107.03374。

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2023). Deep Reinforcement Learning From Human Preferences (arXiv:1706.03741). arXiv. https://doi.org/10.48550/arXiv.1706.03741.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2023). 人类偏好的深度强化学习 (arXiv:1706.03741). arXiv. https://doi.org/10.48550/arXiv.1706.03741.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems (arXiv:2110.14168). arXiv. http://arxiv.org/abs/2110.14168.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. (2021). 训练定理证明器解决数学应用题 (arXiv:2110.14168). arXiv. http://arxiv.org/abs/2110.14168.

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y. & Défossez, A. (2024). Simple and Controllable Music Generation (arXiv:2306.05284). arXiv. https://doi.org/10.48550/arXiv.2306.05284.
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y. & Défossez, A. (2024). 简单可控的音乐生成 (arXiv:2306.05284). arXiv. https://doi.org/10.48550/arXiv.2306.05284.

Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs (arXiv:2305.14314). arXiv. http://arxiv.org/abs/2305.14314.

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., ... Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model (arXiv:2303.03378). arXiv. http://arxiv.org/abs/2303.03378.

Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. D. (2023). Understanding Social Reasoning in Language Models With Language Mode/s (arXiv:2306.15448). arXiv. http://arxiv.org/abs/2306.15448.

Gemini Team: Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., ... Vinyals, O. (2023). Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805). arXiv. http://arxiv.org/abs/2312.11805.
Gemini 团队：Anil、R.、Borgeaud、S.、Wu、Y.、Alayrac、J.-B.、Yu、J.、Soricut、R.、Schalkwyk、J.、Dai、A. M.、Hauth、A.、Millican、K.、Silver、D.、Petrov、S.、Johnson、M.、Antonoglou、I.、Schrittwieser、J.、Glaese、A.、Chen、J.、Pitler、E.、... Vinyals、O. (2023)。Gemini：一系列高性能多模态模型（arXiv:2312.11805）。arXiv。http://arxiv.org/abs/2312.11805。

Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S. S., Shah, A., Yin, X., Parikh, D. & Misra, I. (2023). Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning (arXiv:2311.10709). arXiv. http://arxiv.org/abs/2311.10709.
Girdhar、R.、Singh、M.、Brown、A.、Duval、Q.、Azadi、S.、Rambhatla、S. S.、Shah、A.、Yin、X.、Parikh、D. & Misra、I.（2023）。Emu Video：通过显式图像调节分解文本到视频的生成（arXiv:2311.10709）。arXiv。http://arxiv.org/abs/2311.10709。

Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., ... Li, Z. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models (arXiv:2308.11462). arXiv. http://arxiv.org/abs/2308.11462.
Guha、N.、Nyarko、J.、Ho、D. E.、Ré、C.、Chilton、A.、Narayana、A.、Chohlas-Wood、A.、Peters、A.、Waldon、B.、Rockmore、D. N.、Zambrano、D.、Talisman、D.、Hoque、E.、Surani、F.、Fagan、F.、Sarfaty、G.、Dickinson、G. M.、Porat、H.、Hegland、J.、... Li、Z.（2023）。LegalBench：一个协作构建的用于衡量大型语言模型法律推理的基准（arXiv:2308.11462）。arXiv。http://arxiv.org/abs/2308.11462。

Haque, A., Tancik, M., Efros, A. A., Holynski, A. & Kanazawa, A. (2023). Instruct-NeRF2NeRF: Editing 3D Scenes With Instructions (arXiv:2303.12789). arXiv. http://arxiv.org/abs/2303.12789.
Haque, A., Tancik, M., Efros, A. A., Holynski, A. & Kanazawa, A. (2023). 使用指令编辑 3D 场景：Instruct-NeRF2NeRF (arXiv:2303.12789). arXiv. http://arxiv.org/abs/2303.12789.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding (arXiv:2009.03300). arXiv. http://arxiv.org/abs/2009.03300.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2021). 测量大规模多任务语言理解：Measuring Massive Multitask Language Understanding (arXiv:2009.03300). arXiv. http://arxiv.org/abs/2009.03300.

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D. & Steinhardt, J. (2021). Measuring Mathematical Problem Solving With the MATH Dataset (arXiv:2103.03874). arXiv. http://arxiv.org/abs/2103.03874.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D. & Steinhardt, J. (2021). 使用 MATH 数据集测量数学问题解决：Measuring Mathematical Problem Solving With the MATH Dataset (arXiv:2103.03874). arXiv. http://arxiv.org/abs/2103.03874.

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M. & Leskovec, J. (2021). Open Graph Benchmark: Datasets for Machine Learning on Graphs (arXiv:2005.00687). arXiv. https://doi.org/10.48550/arXiv.2005.00687.
胡伟，菲伊，齐特尼克，董扬，任卉，刘斌，卡塔斯塔和列斯科维奇（2021 年）。开放图基准：图上机器学习数据集（arXiv:2005.00687）。arXiv。https://doi.org/10.48550/arXiv.2005.00687。

Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X. & Zhou, D. (2024). Large Language Models Cannot Self-Correct Reasoning Yet (arXiv:2310.01798). arXiv. http://arxiv.org/abs/2310.01798.
黄健，陈侠，米什拉，郑宏硕，余爱华，宋雄和周达（2024 年）。大型语言模型尚不能自我纠错推理（arXiv:2310.01798）。arXiv。http://arxiv.org/abs/2310.01798。

Huang, Q., Vora, J., Liang, P. & Leskovec, J. (2023). Benchmarking Large Language Models as AI Research Agents (arXiv:2310.03302). arXiv. http://arxiv.org/abs/2310.03302.
黄清，沃拉，梁沛和列斯科维奇（2023 年）。将大型语言模型作为人工智能研究代理进行基准测试（arXiv:2310.03302）。arXiv。http://arxiv.org/abs/2310.03302。

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. & Narasimhan, K. (2023). SWE-bench: Can Language Mode/s Resolve Real-World GitHub Issues? (arXiv:2310.06770). arXiv. https://doi.org/10.48550/arXiv.2310.06770.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. & Narasimhan, K. (2023). SWE-bench: 语言模式是否能解决现实世界的 GitHub 问题？ (arXiv:2310.06770). arXiv. https://doi.org/10.48550/arXiv.2310.06770.

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset From Medical Exams (arXiv:2009.13081). arXiv. http://arxiv.org/abs/2009.13081.
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). 这位患者患有什么疾病？医学考试中的大规模开放领域问答数据集 (arXiv:2009.13081). arXiv. http://arxiv.org/abs/2009.13081.

Kıcıman, E., Ness, R., Sharma, A. & Tan, C. (2023). Causal Reasoning and Large Language Models: Opening a New Frontier for Causality (arXiv:2305.00050). arXiv. http://arxiv.org/abs/2305.00050.
Kıcıman, E., Ness, R., Sharma, A. & Tan, C. (2023). 因果推理与大型语言模型：打开因果关系的新领域 (arXiv:2305.00050). arXiv. http://arxiv.org/abs/2305.00050.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P. & Girshick, R. (2023). Segment Anything (arXiv:2304.02643). arXiv. http://arxiv.org/abs/2304.02643.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P. & Girshick, R. (2023). 任意分割 (arXiv:2304.02643). arXiv. http://arxiv.org/abs/2304.02643.

Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G. & Grefenstette, E. (2018). “The NarrativeQA Reading Comprehension Challenge." Transactions of the Association for Computational Linguistics 6: 317-28. https://doi.org/10.1162/tacl_a_00023.
Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G. & Grefenstette, E. (2018). “The NarrativeQA 阅读理解挑战." 计算语言学协会交易 6: 317-28. https://doi.org/10.1162/tacl_a_00023.

Krizhevsky, A. (2009). Learning Multiple Layers of Features From Tiny Images. https://www.semanticscholar.org/paper/ Learning-Multiple-Layers-of-Features-from-Tiny-Krizhevsky/5d90f06bb70a0a3dced62413346235c02b1aa086.
Krizhevsky, A. (2009). 从微小图像中学习多层特征。https://www.semanticscholar.org/paper/ Learning-Multiple-Layers-of-Features-from-Tiny-Krizhevsky/5d90f06bb70a0a3dced62413346235c02b1aa086.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q. & Petrov, S. (2019). "Natural Questions: A Benchmark for Question Answering Research." Transactions of the Association for Computational Linguistics 7: 452-66. https://doi.org/10.1162/tacl_a_00276.
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q. & Petrov, S. (2019). "自然问题：问答研究的基准." 计算语言学协会交易 7: 452-66. https://doi.org/10.1162/tacl_a_00276.

Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A. & Prakash, S. (2023). RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback (arXiv:2309.00267). arXiv. http://arxiv.org/abs/2309.00267.
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A. & Prakash, S. (2023). RLAIF：通过 AI 反馈扩展强化学习的人类反馈 (arXiv:2309.00267). arXiv. http://arxiv.org/abs/2309.00267.

Lee, T., Yasunaga, M., Meng, C., Mai, Y., Park, J. S., Gupta, A., Zhang, Y., Narayanan, D., Teufel, H. B., Bellagente, M., Kang, M., Park, T., Leskovec, J., Zhu, J.-Y., Fei-Fei, L., Wu, J., Ermon, S. & Liang, P. (2023). Holistic Evaluation of Text-to-Image Models (arXiv:2311.04287). arXiv. https://doi.org/10.48550/arXiv.2311.04287.
Lee, T., Yasunaga, M., Meng, C., Mai, Y., Park, J. S., Gupta, A., Zhang, Y., Narayanan, D., Teufel, H. B., Bellagente, M., Kang, M., Park, T., Leskovec, J., Zhu, J.-Y., Fei-Fei, L., Wu, J., Ermon, S. & Liang, P. (2023). 文本到图像模型的整体评估 (arXiv:2311.04287). arXiv. https://doi.org/10.48550/arXiv.2311.04287.

Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. (2023). HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models (arXiv:2305.11747). arXiv. https://doi.org/10.48550/arXiv.2305.11747.
Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. (2023). HaluEval: 一个大规模幻觉评估基准，用于大型语言模型 (arXiv:2305.11747). arXiv. https://doi.org/10.48550/arXiv.2305.11747.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., ... Koreeda, Y. (2023). Holistic Evaluation of Language Models (arXiv:2211.09110). arXiv. https://doi.org/10.48550/arXiv.2211.09110.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., ... Koreeda, Y. (2023). 语言模型的整体评估 (arXiv:2211.09110). arXiv. https://doi.org/10.48550/arXiv.2211.09110.

Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods (arXiv:2109.07958). arXiv. https://doi.org/10.48550/arXiv.2109.07958.
Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: 测量模型如何模仿人类的虚假信息 (arXiv:2109.07958). arXiv. https://doi.org/10.48550/arXiv.2109.07958.

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., ... Tang, J. (2023). AgentBench: Evaluating LLMs as Agents (arXiv:2308.03688). arXiv. https://doi.org/10.48550/arXiv.2308.03688.
刘，X.，于，H.，张，H.，徐，Y.，雷，X.，赖，H.，顾，Y.，丁，H.，门，K.，杨，K.，张，S.，邓，X.，曾，A.，杜，Z.，张，C.，沈，S.，张，T.，苏，Y.，孙，H.，... 唐，J. (2023). AgentBench: 评估 LLMs 作为代理人 (arXiv:2308.03688). arXiv. https://doi.org/10.48550/arXiv.2308.03688.

Luccioni, A. S., Jernite, Y. & Strubell, E. (2023). Power Hungry Processing: Watts Driving the Cost of Al Deployment? (arXiv:2311.16863). arXiv. http://arxiv.org/abs/2311.16863.
Luccioni，A. S.，Jernite，Y. & Strubell，E. (2023). 贪婪的处理：瓦特推动 AL 部署成本？ (arXiv:2311.16863). arXiv. http://arxiv.org/abs/2311.16863.

Luo, J., Paduraru, C., Voicu, O., Chervonyi, Y., Munns, S., Li, J., Qian, C., Dutta, P., Davis, J. Q., Wu, N., Yang, X., Chang, C.M., Li, T., Rose, R., Fan, M., Nakhost, H., Liu, T., Kirkman, B., Altamura, F., ... Mankowitz, D. J. (2022). Controlling Commercial Cooling Systems Using Reinforcement Learning (arXiv:2211.07357). arXiv. https://doi.org/10.48550/arXiv.2211.07357.
罗，J.，Paduraru，C.，Voicu，O.，切尔沃尼，Y.，Munns，S.，李，J.，钱，C.，杜塔，P.，戴维斯，J. Q.，吴，N.，杨，X.，常，C.M.，李，T.，罗斯，R.，范，M.，纳克霍斯特，H.，刘，T.，柯克曼，B.，阿尔塔穆拉，F.，... 曼科维茨，D. J. (2022). 使用强化学习控制商业冷却系统 (arXiv:2211.07357). arXiv. https://doi.org/10.48550/arXiv.2211.07357.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. (2011). “Learning Word Vectors for Sentiment Analysis.” In D. Lin, Y. Matsumoto & R. Mihalcea, eds., Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: 142-50. Association for Computational Linguistics. https://aclanthology.org/P11-1015.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. (2011). “为情感分析学习词向量。”在 D. Lin, Y. Matsumoto & R. Mihalcea, 编辑，第 49 届计算语言学协会年会论文集：人类语言技术：142-50。计算语言学协会。https://aclanthology.org/P11-1015。

Melas-Kyriazi, L., Rupprecht, C., Laina, I. & Vedaldi, A. (2023). RealFusion:

Reconstruction of Any Object From a Single Image (arXiv:2302.10663). arXiv. http://arxiv.org/abs/2302.10663.
Melas-Kyriazi, L., Rupprecht, C., Laina, I. & Vedaldi, A. (2023). RealFusion:

从单张图像重建任何对象 (arXiv:2302.10663)。arXiv。http://arxiv.org/abs/2302.10663。

Mihaylov, T., Clark, P., Khot, T. & Sabharwal, A. (2018). "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering." In E. Riloff, D. Chiang, J. Hockenmaier & J. Tsujii, eds., Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: 2381-91. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1260.
Mihaylov, T., Clark, P., Khot, T. & Sabharwal, A. (2018). "一套盔甲能导电吗？开放书籍问答的新数据集。"在 E. Riloff, D. Chiang, J. Hockenmaier & J. Tsujii, 编辑，2018 年经验方法在自然语言处理会议论文集：2381-91。计算语言学协会。https://doi.org/10.18653/v1/D18-1260。

Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D. & Zeng, A. (2023). Large Language Models as General Pattern Machines (arXiv:2307.04721). arXiv. https://doi.org/10.48550/arXiv.2307.04721.
Mirchandani，S.，Xia，F.，Florence，P.，Ichter，B.，Driess，D.，Arenas，M. G.，Rao，K.，Sadigh，D。和曾，A。（2023）。大型语言模型作为通用模式机器（arXiv:2307.04721）。arXiv。https://doi.org/10.48550/arXiv.2307.04721。

Mitchell, M., Palmarini, A. B. & Moskvichev, A. (2023). Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks (arXiv:2311.09247). arXiv. http://arxiv.org/abs/2311.09247.
米切尔，M.，帕尔玛里尼，A. B.和莫斯科维切夫，A。（2023）。比较人类，GPT-4 和 GPT-4V 在抽象和推理任务上的表现（arXiv:2311.09247）。arXiv。http://arxiv.org/abs/2311.09247。

Mokady, R., Hertz, A., Aberman, K., Pritch, Y. & Cohen-Or, D. (2022). Null-Text Inversion for Editing Real Images Using Guided Diffusion Models (arXiv:2211.09794). arXiv. https://doi.org/10.48550/arXiv.2211.09794.
莫卡迪，R.，赫尔兹，A.，阿伯曼，K.，普里奇，Y。和科恩-奥尔，D。（2022）。使用引导扩散模型编辑真实图像的空文本反转（arXiv:2211.09794）。arXiv。https://doi.org/10.48550/arXiv.2211.09794。

Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. & Schölkopf, B. (2016). “Distinguishing Cause From Effect Using Observational Data: Methods and Benchmarks." The Journal of Machine Learning Research 17, no. 1:1103-1204.
Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. & Schölkopf, B. (2016). “使用观测数据区分因果关系和效果：方法和基准。” 机器学习研究杂志 17, 第 1 卷：1103-1204。

Nie, A., Zhang, Y., Amdekar, A., Piech, C., Hashimoto, T. & Gerstenberg, T. (2023). MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks (arXiv:2310.19677). arXiv. http://arxiv.org/abs/2310.19677.
Nie, A., Zhang, Y., Amdekar, A., Piech, C., Hashimoto, T. & Gerstenberg, T. (2023). MoCa：在因果和道德判断任务上测量人类-语言模型对齐（arXiv:2310.19677）。arXiv。http://arxiv.org/abs/2310.19677。

Olabi, A. G., Abdelghafar, A. A., Maghrabie, H. M., Sayed, E. T., Rezk, H., Radi, M. A., Obaideen, K. & Abdelkareem, M. A. (2023). "Application of Artificial Intelligence for Prediction, Optimization, and Control of Thermal Energy Storage Systems." Thermal Science and Engineering Progress, 39: 101730. https://doi.org/10.1016/j.tsep.2023.101730.
Olabi, A. G., Abdelghafar, A. A., Maghrabie, H. M., Sayed, E. T., Rezk, H., Radi, M. A., Obaideen, K. & Abdelkareem, M. A. (2023). "应用人工智能预测、优化和控制热能储存系统。" 热科学与工程进展，39：101730。https://doi.org/10.1016/j.tsep.2023.101730。

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., ... Zoph, B. (2024). GPT-4 Technical Report (arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774.
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., ... Zoph, B. (2024). GPT-4 技术报告 (arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774.

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model (arXiv:2305.18290). arXiv. http://arxiv.org/abs/2305.18290.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). 直接偏好优化：您的语言模型暗中是一个奖励模型 (arXiv:2305.18290). arXiv. http://arxiv.org/abs/2305.18290.

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA: A GraduateLevel Google-Proof Q&A Benchmark (arXiv:2311.12022). arXiv. http://arxiv.org/abs/2311.12022.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA：一个研究生级别的谷歌防护问答基准 (arXiv:2311.12022). arXiv. http://arxiv.org/abs/2311.12022.

Rustia, D. J. A., Chiu, L.-Y., Lu, C.-Y., Wu, Y.-F., Chen, S.-K., Chung, J.-Y., Hsu, J.-C. & Lin, T.-T. (2022). “Towards Intelligent and Integrated Pest Management Through an AloT-Based Monitoring System." Pest Management Science 78, no. 10: 4288-4302. https://doi.org/10.1002/ps.7048.
Rustia, D. J. A., Chiu, L.-Y., Lu, C.-Y., Wu, Y.-F., Chen, S.-K., Chung, J.-Y., Hsu, J.-C. & Lin, T.-T. (2022). “通过基于 AloT 的监测系统实现智能和集成的害虫管理。”《害虫管理科学》78 卷，第 10 期：4288-4302。https://doi.org/10.1002/ps.7048。

Schaeffer, R., Miranda, B. & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? (arXiv:2304.15004). arXiv. http://arxiv.org/abs/2304.15004.
Schaeffer, R., Miranda, B. & Koyejo, S. (2023). 大型语言模型的新兴能力是幻觉吗？（arXiv:2304.15004）。arXiv。http://arxiv.org/abs/2304.15004。

Schneider, F., Kamal, O., Jin, Z. & Schölkopf, B. (2023). Moûusai: Text-to-Music Generation With Long-Context Latent Diffusion (arXiv:2301.11757). arXiv. https://doi.org/10.48550/arXiv.2301.11757.
Schneider, F., Kamal, O., Jin, Z. & Schölkopf, B. (2023). Moûusai：具有长上下文潜在扩散的文本到音乐生成（arXiv:2301.11757）。arXiv。https://doi.org/10.48550/arXiv.2301.11757。

Shams, S. R., Jahani, A., Kalantary, S., Moeinaddini, M. & Khorasani, N. (2021). "Artificial Intelligence Accuracy Assessment in NO2 Concentration Forecasting of Metropolises Air." Scientific Reports 11, no. 1: 1805. https://doi.org/10.1038/s41598-021-81455-6.
Shams, S. R., Jahani, A., Kalantary, S., Moeinaddini, M. & Khorasani, N. (2021). "大都市空气中 NO2 浓度预测的人工智能准确性评估." Scientific Reports 11, no. 1: 1805. https://doi.org/10.1038/s41598-021-81455-6.

Shi, Y., Wang, P., Ye, J., Long, M., Li, K. & Yang, X. (2024). MVDream: Multi-View Diffusion for 3D Generation (arXiv:2308.16512). arXiv. http://arxiv.org/abs/2308.16512.
Shi, Y., Wang, P., Ye, J., Long, M., Li, K. & Yang, X. (2024). MVDream: 多视角扩散用于 3D 生成 (arXiv:2308.16512). arXiv. http://arxiv.org/abs/2308.16512.

Soomro, K., Zamir, A. R. & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild (arXiv:1212.0402; Version 1). arXiv. http://arxiv.org/abs/1212.0402.
Soomro, K., Zamir, A. R. & Shah, M. (2012). UCF101: 野外视频中 101 种人类动作类别的数据集 (arXiv:1212.0402; 版本 1). arXiv. http://arxiv.org/abs/1212.0402.

Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., Finn, C. & Hausman, K. (2023). Open-World Object Manipulation Using Pre-trained Vision-Language Models (arXiv:2303.00905). arXiv. https://doi.org/10.48550/arXiv.2303.00905.
Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., Finn, C. & Hausman, K. (2023). 使用预训练的视觉-语言模型进行开放世界物体操作 (arXiv:2303.00905). arXiv. https://doi.org/10.48550/arXiv.2303.00905.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., ... Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). arXiv. https://doi.org/10.48550/arXiv.2307.09288.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., ... Scialom, T. (2023). Llama 2: 开放基础和微调的聊天模型 (arXiv:2307.09288). arXiv. https://doi.org/10.48550/arXiv.2307.09288.

Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. (2023). PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning About Change. Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=YXog|4uQUO.
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. (2023). PlanBench: 用于评估大型语言模型在规划和推理变化方面的可扩展基准。第三十七届神经信息处理系统数据集和基准赛道。https://openreview.net/forum?id=YXog|4uQUO.

Voynov, O., Bobrovskikh, G., Karpyshev, P., Galochkin, S., Ardelean, A.-T., Bozhcnko, A., Karmanova, E., Kopanev, P., LabutinRymsho, Y., Rakhimov, R., Safin, A., Serpiva, V., Artemov, A., Burnaev, E., Tsetserukou, D. & Zorin, D. (2023). Multi-sensor LargeScale Dataset for Multi-view 3D Reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21392-403. https://doi.org/10.1109/CVPR52729.2023.02049.
Voynov, O., Bobrovskikh, G., Karpyshev, P., Galochkin, S., Ardelean, A.-T., Bozhcnko, A., Karmanova, E., Kopanev, P., LabutinRymsho, Y., Rakhimov, R., Safin, A., Serpiva, V., Artemov, A., Burnaev, E., Tsetserukou, D. & Zorin, D. (2023). 多传感器大规模数据集用于多视角三维重建. 2023 年 IEEE/CVF 计算机视觉与模式识别会议(CVPR), 21392-403. https://doi.org/10.1109/CVPR52729.2023.02049.

Walker, C. M. & Gopnik, A. (2014). "Toddlers Infer Higher-Order Relational Principles in Causal Learning." Psychological Science 25, no. 1: 161-69.
Walker, C. M. & Gopnik, A. (2014). "幼儿推断因果学习中的高阶关系原则." 心理科学 25, no. 1: 161-69.

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L. & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent With Large Language Models (arXiv:2305.16291). arXiv. http://arxiv.org/abs/2305.16291.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L. & Anandkumar, A. (2023). Voyager: 具有大型语言模型的开放式具身代理人 (arXiv:2305.16291). arXiv. http://arxiv.org/abs/2305.16291.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. & Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. https://doi.org/10.48550/arXiv.2201.11903.

Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S. & Tompson, J. (2023). Robotic Skill Acquisition via Instruction Augmentation With Vision-Language Mode/s (arXiv:2211.11736). arXiv. https://doi.org/10.48550/arXiv.2211.11736.

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D. & Chen, X. (2023). Large Language Models as Optimizers (arXiv:2309.03409). arXiv. http://arxiv.org/abs/2309.03409.

Yang, D., Tian, J., Tan, X., Huang, R., Liu, S., Chang, X., Shi, J., Zhao, S., Bian, J., Wu, X., Zhao, Z., Watanabe, S. & Meng, H. (2023). UniAudio: An Audio Foundation Model Toward Universal Audio Generation (arXiv:2310.00704). arXiv. http://arxiv.org/abs/2310.00704.
杨，D.，田，J.，谭，X.，黄，R.，刘，S.，常，X.，石，J.，赵，S.，边，J.，吴，X.，赵，Z.，渡边，S.和孟，H.（2023 年）。UniAudio:通向通用音频生成的音频基础模型（arXiv：2310.00704）。arXiv。http://arxiv.org/abs/2310.00704。

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y. & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving With Large Language Models (arXiv:2305.10601). arXiv. http://arxiv.org/abs/2305.10601.
姚，S.，于，D.，赵，J.，沙夫兰，I.，格里菲斯，T. L.，曹，Y.和纳拉西姆汉，K.（2023 年）。思想之树：与大型语言模型的刻意问题解决（arXiv：2305.10601）。arXiv。http://arxiv.org/abs/2305.10601。

Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning (arXiv:1811.10830). arXiv. http://arxiv.org/abs/1811.10830.
泽勒斯，R.，比斯克，Y.，法哈迪，A.和崔，Y.（2019 年）。从识别到认知：视觉常识推理（arXiv：1811.10830）。arXiv。http://arxiv.org/abs/1811.10830。

Zhang, L., Rao, A. & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (arXiv:2302.05543). arXiv. http://arxiv.org/abs/2302.05543.
张，L.，饶，A.和 Agrawala，M.（2023）。向文本到图像扩散模型添加条件控制（arXiv:2302.05543）。arXiv。http://arxiv.org/abs/2302.05543。

Zhang, Z., Han, L., Ghosh, A., Metaxas, D. & Ren, J. (2022). SINE: SINgle Image Editing With Text-to-Image Diffusion Models (arXiv:2212.04489). arXiv. https://doi.org/10.48550/arXiv.2212.04489.
张，Z.，韩，L.，戈什，A.，Metaxas，D.和任，J.（2022）。SINE：使用文本到图像扩散模型进行单图像编辑（arXiv:2212.04489）。arXiv。https://doi.org/10.48550/arXiv.2212.04489。

Chapter 3: Responsible Al
第三章：负责任的 Al

Acknowledgments 致谢

The AI Index would like to acknowledge Amelia Hardy for her work contributing as a research assistant to visualizations and supplementary analysis for this chapter, and Andrew Shi for his work spearheading the analysis of responsible Al-related conference submissions. The Al Index also acknowledges that the Global State of Responsible Al analysis was done in collaboration with Accenture. The AI Index specifically wants to highlight the contributions of Arnab Chakraborty, Patrick Connolly, Jakub Wiatrak, Ray EitelPorter, Dikshita Venkatesh, and Shekhar Tewari to the data collection and analysis.
AI 指数要感谢 Amelia Hardy 作为研究助理为本章节的可视化和补充分析做出的贡献，以及 Andrew Shi 在推动负责任的 AI 相关会议提交分析方面的工作。AI 指数还承认全球负责任 AI 状况分析是与 Accenture 合作完成的。AI 指数特别想强调 Arnab Chakraborty，Patrick Connolly，Jakub Wiatrak，Ray EitelPorter，Dikshita Venkatesh 和 Shekhar Tewari 在数据收集和分析方面的贡献。

Conference Submissions Analysis
会议提交分析

For the analysis on responsible Al-related conference submissions, the Al Index examined the number of responsible

-related academic submissions at the following conferences: AAAI, AIES, FAccT, ICML, ICLR, and NeurIPS. Specifically, the team scraped the conference websites or repositories of conference submissions for papers containing relevant keywords indicating they could fall into a particular responsible

category. The papers were then manually verified by a human team to confirm their categorization. It is possible that a single paper could belong to multiple responsible

categories.
对负责任的人工智能相关会议提交进行分析，AI 指标考察了以下会议的负责任相关的学术提交数量: AAAI, AIES, FAccT, ICML, ICLR 和 NeurIPS. 具体而言，团队收集了会议网站或提交库中包含相关关键词的论文，指示它们可能属于特定的负责任类别。然后，人工团队手动验证了这些论文的分类。一个论文可能属于多个负责任类别。
The keywords searched include:
搜索的关键词包括:

Fairness and bias: algorithmic fairness, bias detection, bias mitigation, discrimination, equity in

, ethical algorithm design, fair data practices, fair ML, fairness and bias, group fairness, individual fairness, justice, non-discrimination, representational fairness, unfair, unfairness.
算法公平性和偏见: 算法公平性, 偏见检测, 偏见缓解, 歧视,

中的公平性, 道德算法设计, 公平数据实践, 公平机器学习, 公平和偏见, 群体公平性, 个体公平性, 正义, 非歧视, 代表性公平性, 不公平, 不公平性.

Privacy and data governance: anonymity, confidentiality, data breach, data ethics, data governance, data integrity, data privacy, data protection, data transparency, differential privacy, inference privacy, machine unlearning, privacy by design, privacy-preserving, secure data storage, trustworthy data curation.
隐私和数据治理: 匿名性, 保密性, 数据泄露, 数据伦理, 数据治理, 数据完整性, 数据隐私, 数据保护, 数据透明性, 差分隐私, 推理隐私, 机器遗忘, 设计隐私, 隐私保护, 安全数据存储, 值得信赖的数据策划.

Security: adversarial attack, adversarial learning, Al incident, attacks, audits, cybersecurity, ethical hacking, forensic analysis, fraud detection, red teaming, safety, security, security ethics, threat detection, vulnerability assessment.
安全性: 对抗性攻击, 对抗性学习, AI 事件, 攻击, 审计, 网络安全, 道德黑客, 司法分析, 诈骗检测, 红队行动, 安全, 安全性, 安全伦理, 威胁检测, 漏洞评估.

Transparency and explainability: algorithmic transparency, audit, auditing, causal reasoning, causality, explainability, explainable AI, explainable models, human-understandable decisions, interpretability, interpretable models, model explainability, outcome explanation, transparency, xAI.
算法透明度、审计、审核、因果推理、因果关系、可解释性、可解释人工智能、可解释模型、人类可理解的决策、可解释性、可解释模型、模型可解释性、结果解释、透明度、xAI。

Consistency of Responsible Al Benchmark Reporting
负责任的 Al 基准报告的一致性

For each of the analyzed models (GPT-4, Gemini, Claude 2, Llama 2, Mistral 7B), the Al Index reviewed the official papers published by the model developers at the time of model release for reported academic benchmarks. The Al Index did not consider subsequent benchmark reports by the model developers or external parties. The Al Index also did not include benchmarks on academic or professional exams (e.g., the GRE), benchmarks for modalities other than text, or internal evaluation metrics.
对于分析的每个模型（GPT-4、Gemini、Claude 2、Llama 2、Mistral 7B），Al 指数审查了模型开发者在发布模型时发布的官方论文，用于报告学术基准。Al 指数没有考虑模型开发者或外部方的后续基准报告。Al 指数也没有包括学术或专业考试（例如 GRE）、文本以外的模态的基准，或内部评估指标。

Global Responsible State of Al Survey
全球负责任的 AI 调查状态

Researchers from Stanford conducted a global responsible

) survey in collaboration with Accenture. The objective of the questionnaire was to gain an understanding of the current level of RAI adoption globally and allow for a comparison of RAI activities across 19 industries and 22 countries. The survey is further used to develop an early snapshot of current perceptions around the responsible development, deployment, and use of generative

and how this might affect RAI adoption and mitigation techniques. The survey covers a total of 10 RAI dimensions: Reliability; Privacy and Data Governance; Fairness and Nondiscrimination; Transparency and Explainability; Human Interaction; Societal and Environmental Well-Being; Accountability; Leadership/ Principles/Culture; Lawfulness and Compliance; and Organizational Governance. Only some of the survey findings are presented in the Al Index, with a more detailed report, the Global State of Responsible AI Report, coming out in May/June 2024.
斯坦福的研究人员与埃森哲合作进行了一项全球负责任的

）调查。问卷的目的是了解全球范围内目前的 RAI 采用水平，并允许比较 19 个行业和 22 个国家的 RAI 活动。该调查进一步用于开发关于生成

的负责任开发、部署和使用的当前看法的早期快照，以及这如何影响 RAI 采用和缓解技术。调查涵盖了 10 个 RAI 维度：可靠性；隐私和数据治理；公平和非歧视性；透明度和可解释性；人类互动；社会和环境福祉；问责制；领导/原则/文化；合法性和合规性；以及组织治理。只有部分调查结果呈现在 AI 指数中，更详细的报告《全球负责任 AI 报告》将于 2024 年 5 月/6 月发布。
Given the limited scalability of user interviews, the researchers opted for a questionnaire-based approach to ensure broad coverage of organizations in different countries and industries. They contracted McGuire Research to run the recruitment and data collection. The team received more than 15,897 responses from 22 countries and 19 industries. The respondents were asked 10 qualifier questions in the survey. Companies were excluded if their global annual revenue was less than 500 million USD and/or the respondent had no visibility into the RAI decision-making process of the company. Included in the final sample were more than 1,000 organizations. The survey had a total of 38 questions, including the 10 qualifier questions.
鉴于用户访谈的可扩展性有限，研究人员选择了基于问卷的方法，以确保覆盖不同国家和行业的组织。他们委托麦格理研究公司负责招募和数据收集工作。团队收到来自 22 个国家和 19 个行业的超过 15,897 份回复。在调查中，受访者被要求回答 10 个资格问题。如果公司的全球年收入低于 5 亿美元和/或受访者无法了解公司的 RAI 决策过程，则这些公司将被排除在外。最终样本中包括超过 1,000 家组织。调查共有 38 个问题，包括 10 个资格问题。

Below is the full list of measures respondents were asked about in the survey and which were referenced in the AI Index subchapters. The organizations could answer on a scale from Not applied, Ad-hoc, Rolling out, or Fully operationalized. The companies were further given the option to select Other and provide information on mitigation measures not listed.
下面是受访者在调查中被问及并在 AI 指数子章节中引用的措施的完整列表。组织可以根据未应用、临时、正在推出或完全操作化的标准进行回答。公司还可以选择“其他”选项，并提供未列出的缓解措施信息。

Fairness measures: 公平性措施:

Collection of representative data based on the anticipated user demographics
根据预期用户人口统计数据的代表性数据集
Making methodology and data sources accessible to third parties (auditors/general public) for independent oversight
使方法论和数据来源对第三方（审计员/普通公众）可访问，以进行独立监督
Involvement of diverse stakeholders in model development and/or review process
在模型开发和/或审查过程中涉及多样化的利益相关者
Assessment of performance across different demographic groups
在不同人口群体之间评估绩效
Use of technical bias mitigation techniques during model development
在模型开发过程中使用技术偏差缓解技术
Other (selection of this option opened an optional free-text field)
其他（选择此选项将打开一个可选的自由文本字段）

Data governance measures:
数据治理措施：

Checks to ensure that the data complies with all relevant laws and regulations and is used with consent, where applicable
检查以确保数据符合所有相关法律法规，并在适用情况下获得同意使用
Data collection and preparation include assessment of the completeness, uniqueness, consistency, and accuracy of the data
数据收集和准备包括评估数据的完整性、唯一性、一致性和准确性
Checks to ensure that the data is representative with respect to the demographic setting within which the final model/system is used
检查以确保数据在最终模型/系统使用的人口统计情况下是具有代表性的
Regular data audits and updates to ensure the relevancy of the data
定期数据审计和更新以确保数据的相关性
Process for dataset documentation and traceability throughout the Al life cycle
数据集文档和可追溯性的过程，贯穿 Al 生命周期 throughout the Al life cycle
Remediation plans for and documentation of datasets with shortcomings
具有缺陷的数据集的整改计划和文档
Other (selection of this option opened an optional free-text field)
其他（选择此选项会打开一个可选的自由文本字段）

Transparency and explainability:
透明度和可解释性:

Documentation of the development process, detailing algorithm design choices, data sources, intended use cases, and limitations
开发过程的文档，详细说明算法设计选择、数据来源、预期使用案例和限制
Training programs for stakeholders (incl. users) covering the intended use cases and limitations of the model
针对利益相关者（包括用户）的培训计划，涵盖模型的预期使用案例和限制
Prioritization of simpler models where high interpretability is crucial, even if it sacrifices some performance
优先考虑更简单的模型，即使牺牲一些性能，也要确保高可解释性是至关重要的
Use model explainability tools (e.g., saliency maps) to elucidate model decisions
使用模型可解释性工具（例如，显著性地图）来阐明模型决策
Other (selection of this option opened an optional free-text field)
其他（选择此选项会打开一个可选的自由文本字段）

Reliability measures: 可靠性测量：

Mitigation measures for model errors and handling low confidence outputs
用于模型错误的缓解措施和处理低置信度输出
Failover plans or other measures to ensure the system's/model's availability
故障转移计划或其他措施以确保系统/模型的可用性
Evaluation of models/systems for vulnerabilities or harmful behavior (i.e., red teaming)
评估模型/系统的漏洞或有害行为（即红队行动）
Measures to prevent adversarial attacks
防止敌对攻击的措施
Confidence scoring for model outputs
模型输出的置信度评分
Comprehensive test cases that cover a wide range of scenarios and metrics
涵盖广泛场景和指标的全面测试用例
Other (selection of this option opened an optional free-text field)
其他（选择此选项会打开一个可选的自由文本字段）

Security measures: 安全措施：

Basic cybersecurity hygiene practices (e.g., multifactor authentication, access controls, and employee training)
基本的网络安全卫生实践（例如，多因素身份验证、访问控制和员工培训）
Vetting and validation of cybersecurity measures of third parties in the supply chain
第三方供应链的网络安全措施的审核和验证
Dedicated Al cybersecurity team and/or personnel explicitly trained for Al-specific cybersecurity
专门的人工智能网络安全团队和/或专门接受人工智能网络安全培训的人员
Technical Al-specific cybersecurity checks and measures, e.g., adversarial testing, vulnerability assessments, and data security measures
技术性的人工智能网络安全检查和措施，例如对抗测试、漏洞评估和数据安全措施
Resources dedicated to research and monitoring of evolving Al-specific cybersecurity risks and integration in existing cybersecurity processes
专门用于研究和监测不断发展的人工智能特定网络安全风险，并将其整合到现有的网络安全流程中
Other (selection of this option opened an optional free-text field)
其他（选择此选项将打开一个可选的自由文本字段）

Works Cited 引用文献

Agarwal, A. & Agarwal, H. (2023). "A Seven-Layer Model With Checklists for Standardising Fairness Assessment Throughout the Al Lifecycle." Al Ethics. https://doi.org/10.1007/s43681-023-00266-9.
Agarwal, A. & Agarwal, H.（2023 年）。"具有清单的七层模型，用于在整个人工智能生命周期中标准化公平性评估。" 人工智能伦理学。https://doi.org/10.1007/s43681-023-00266-9.

Alawida, M., Abu Shawar, B., Abiodun, O. I., Mehmood, A., Omolara, A. E. & Al Hwaitat, A. K. (2024). “Unveiling the Dark Side of ChatGPT: Exploring Cyberattacks and Enhancing User Awareness." Information 15, no. 1: 27. https://doi.org/10.3390/info15010027.
阿拉维达，M.，阿布沙华，B.，阿比奥敦，O.I.，梅姆德，A.，奥莫拉拉，A.E.和 Al Hwaitat，A.K.（2024 年）。“揭示 ChatGPT 的黑暗面：探索网络攻击并增强用户意识。” 信息 15，第 1 期：27。https://doi.org/10.3390/info15010027。

Andreotta, A. J., Kirkham, N. & Rizzi, M. (2022). “AI, Big Data, and the Future of Consent." Al & Society 37, no. 4: 1715-28. https://doi.org/10.1007/s00146-021-01262-5.
安德里奥塔，A.J.，柯克汉姆，N.和里齐，M.（2022 年）。“人工智能，大数据及同意的未来。” Ai & Society 37，第 4 期：1715-28。https://doi.org/10.1007/s00146-021-01262-5。

Arous, A., Guesmi, A., Hanif, M. A., Alouani, I. & Shafique, M. (2023). Exploring Machine Learning Privacy/Utility Trade-Off From a Hyperparameters Lens (arXiv:2303.01819). arXiv. http://arxiv.org/abs/2303.01819.
阿鲁斯，A.，古斯米，A.，汉尼夫，M.A.，阿鲁阿尼，I.和沙菲克，M.（2023 年）。从超参数角度探索机器学习隐私/效用权衡（arXiv:2303.01819）。arXiv。http://arxiv.org/abs/2303.01819。

Balasubramaniam, N., Kauppinen, M., Rannisto, A., Hiekkanen, K. & Kujala, S. (2023). “Transparency and Explainability of AI Systems: From Ethical Guidelines to Requirements." Information and Software Technology 159: 107197.
Balasubramaniam，N。，Kauppinen，M。，Rannisto，A。，Hiekkanen，K。& Kujala，S。（2023）。“AI 系统的透明性和可解释性：从伦理指南到要求。”信息与软件技术 159：107197。

https://doi.org/10.1016/j.infsof.2023.107197.

Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., Zhang, D. & Liang, P. (2023). The Foundation Model Transparency Index (arXiv:2310.12941). arXiv. https://doi.org/10.48550/arXiv.2310.12941.
Bommasani，R。，Klyman，K。，Longpre，S。，Kapoor，S。，Maslej，N。，Xiong，B。，Zhang，D。& Liang，P。（2023）。基金会模型透明性指数（arXiv:2310.12941）。arXiv。https://doi.org/10.48550/arXiv.2310.12941。

Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W. & Gupta, R. (2021). "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation." Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862-72. https://doi.org/10.1145/3442188.3445924.
Dhamala，J。，Sun，T。，Kumar，V。，Krishna，S。，Pruksachatkun，Y。，Chang，K.-W。& Gupta，R。（2021）。"BOLD：用于测量开放式语言生成中偏见的数据集和度量标准。" 2021 年 ACM 公平性、责任和透明性会议论文集，862-72。https://doi.org/10.1145/3442188.3445924。

Durmus, E., Nyugen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J. & Ganguli, D. (2023). Towards Measuring the Representation of Subjective Global Opinions in Language Models (arXiv:2306.16388). arXiv. https://doi.org/10.48550/arXiv.2306.16388.
Durmus, E., Nyugen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J. & Ganguli, D. (2023). 朝着在语言模型中衡量主观全球观点的表示 (arXiv:2306.16388). arXiv. https://doi.org/10.48550/arXiv.2306.16388.

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (arXiv:2209.07858). arXiv. http://arxiv.org/abs/2209.07858.
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Clark, J. (2022). 通过对抗测试语言模型以减少伤害: 方法、扩展行为和经验教训 (arXiv:2209.07858). arXiv. http://arxiv.org/abs/2209.07858.

Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models (arXiv:2009.11462). arXiv. https://doi.org/10.48550/arXiv.2009.11462.
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. (2020). RealToxicityPrompts: 评估语言模型中神经毒性退化 (arXiv:2009.11462). arXiv. https://doi.org/10.48550/arXiv.2009.11462.

Grinbaum, A. & Adomaitis, L. (2024). "Dual Use Concerns of Generative Al and Large Language Models." Journal of Responsible Innovation 11, no. 1. https://doi.org/10.1080/23299460.2024.2304381.
Grinbaum, A. & Adomaitis, L. (2024). "生成式人工智能和大型语言模型的双重使用关注."《负责任创新杂志》第 11 卷，第 1 期。https://doi.org/10.1080/23299460.2024.2304381.

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D. & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (arXiv:2203.09509v4). arXiv. http://arxiv.org/abs/2203.09509.
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D. & Kamar, E. (2022). ToxiGen: 用于对抗性和隐性仇恨言论检测的大规模机器生成数据集 (arXiv:2203.09509v4). arXiv. http://arxiv.org/abs/2203.09509.

Ippolito, D., Tramèr, F., Nasr, M., Zhang, C., Jagielski, M., Lee, K., Choquette-Choo, C. A. & Carlini, N. (2023).

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy (arXiv:2210.17546v3). arXiv. https://doi.org/10.48550/arXiv.2210.17546.
在语言模型中防止逐字记忆会给人一种虚假的隐私感(arXiv:2210.17546v3)。arXiv。https://doi.org/10.48550/arXiv.2210.17546。

Janssen, M., Brous, P., Estevez, E., Barbosa, L. S. & Janowski, T. (2020). "Data Governance: Organizing Data for Trustworthy Artificial Intelligence." Government Information Quarterly 37, no. 3: 101493. https://doi.org/10.1016/j.giq.2020.101493.
Janssen, M., Brous, P., Estevez, E., Barbosa, L. S. & Janowski, T. (2020)。“数据治理：为值得信赖的人工智能组织数据。”政府信息季刊 37 卷，第 3 期：101493。https://doi.org/10.1016/j.giq.2020.101493。

Li, B., Sun, J. & Poskitt, C. M. (2023). How Generalizable Are Deepfake Detectors? An Empirical Study (arXiv:2308.04177). arXiv. http://arxiv.org/abs/2308.04177.
Li, B., Sun, J. & Poskitt, C. M. (2023)。深度伪造检测器有多通用？一项实证研究(arXiv:2308.04177)。arXiv。http://arxiv.org/abs/2308.04177。

Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A. & Malik, H. (2023). “Deepfakes Generation and Detection: State-of-the-Art, Open Challenges, Countermeasures, and Way Forward." Applied Intelligence 53, no. 4: 3974-4026. https://doi.org/10.1007/s10489-022-03766-z.
Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A. & Malik, H. (2023). “Deepfakes 生成与检测：现状、挑战、对策和前景." 应用智能 53, 第 4 期：3974-4026. https://doi.org/10.1007/s10489-022-03766-z.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. (2022). "A Survey on Bias and Fairness in Machine Learning." ACM Computing Surveys 54, no. 6: 1-35. https://doi.org/10.1145/3457607.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. (2022). "机器学习中的偏见和公平性调查." ACM 计算调查 54, 第 6 期：1-35. https://doi.org/10.1145/3457607.

Morreale, F., Bahmanteymouri, E., Burmester, B., Chen, A. & Thorp, M. (2023). “The Unwitting Labourer: Extracting Humanness in Al Training." Al & Society. https://doi.org/10.1007/s00146-023-01692-3.
Morreale, F., Bahmanteymouri, E., Burmester, B., Chen, A. & Thorp, M. (2023). “无意识的劳动者：在 AI 训练中提取人性." AI 与社会. https://doi.org/10.1007/s00146-023-01692-3.

Motoki, F., Pinho Neto, V. & Rodrigues, V. (2024). “More Human Than Human: Measuring ChatGPT Political Bias.”
Motoki, F., Pinho Neto, V. & Rodrigues, V. (2024). “比人类更人类：衡量 ChatGPT 的政治偏见。”

Public Choice 198, no. 1: 3-23. https://doi.org/10.1007/s11127-023-01097-2.

Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E., Tramèr, F. & Lee, K. (2023). Scalable Extraction of Training Data From (Production) Language Models (arXiv:2311.17035). arXiv. https://doi.org/10.48550/arXiv.2311.17035.
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E., Tramèr, F. & Lee, K. (2023). 从（生产）语言模型中提取训练数据的可扩展方法 (arXiv:2311.17035). arXiv. https://doi.org/10.48550/arXiv.2311.17035.

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. (2023). “Large Language Models Propagate Race-Based Medicine." npj Digital Medicine 6, no. 1: 1-4. https://doi.org/10.1038/s41746-023-00939-z.
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. (2023). “大型语言模型传播基于种族的医学。”npj 数字医学 6, 第 1 卷：1-4。https://doi.org/10.1038/s41746-023-00939-z。

P, D., Simoes, S. & MacCarthaigh, M. (2023). "Al and Core Electoral Processes: Mapping the Horizons." Al Magazine 44, no. 3: 218-39. https://doi.org/10.1002/aaai.12105.
P, D., Simoes, S. & MacCarthaigh, M. (2023). "Al 和核心选举流程：描绘地平线。"Al 杂志 44, 第 3 卷：218-39。https://doi.org/10.1002/aaai.12105。

Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S. & Hendrycks, D. (2023). Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (arXiv:2304.03279). arXiv. https://doi.org/10.48550/arXiv.2304.03279.
Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S. & Hendrycks, D. (2023). 奖励是否值得手段？在 MACHIAVELLI 基准中衡量奖励与道德行为之间的权衡（arXiv:2304.03279）。arXiv。https://doi.org/10.48550/arXiv.2304.03279。

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M. & Bowman, S. R. (2022). BBQ: A Hand-Built Bias Benchmark for Question Answering (arXiv:2110.08193). arXiv. https://doi.org/10.48550/arXiv.2110.08193.
派里什，A.，陈，A.，南吉亚，N.，帕达马库马尔，V.，庞，J.，汤普森，J.，胡特，P. M.和鲍曼，S. R.（2022 年）。烧烤：问答的手工构建偏见基准（arXiv:2110.08193）。arXiv。https://doi.org/10.48550/arXiv.2110.08193。

Pessach, D. & Shmueli, E. (2023). “Algorithmic Fairness.” In L. Rokach, O. Maimon & E. Shmueli (eds.), Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook: 867-86. https://doi.org/10.1007/978-3-031-24628-9_37.
佩萨克，D.和舒梅利，E.（2023 年）。“算法公平性。”在 L.罗卡，O.迈蒙和 E.舒梅利（eds.），数据科学手册的机器学习：数据挖掘和知识发现手册：867-86。https://doi.org/10.1007/978-3-031-24628-9_37。

Petrov, A., La Malfa, E., Torr, P. H. S. & Bibi, A. (2023). Language Model Tokenizers Introduce Unfairness Between Languages (arXiv:2305.15425). arXiv. https://doi.org/10.48550/arXiv.2305.15425.
彼得罗夫，A.，拉马尔法，E.，托尔，P. H. S.和比比，A.（2023）。语言模型分词器引入语言之间的不公平（arXiv:2305.15425）。arXiv。https://doi.org/10.48550/arXiv.2305.15425。

Rudin, C. (2019). "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence 1, no. 5: 206-15. https://doi.org/10.1038/s42256-019-0048-x.
Rudin, C. (2019). "停止解释黑盒机器学习模型用于高风险决策，改用可解释模型。" 《自然机器智能》1 卷，第 5 期：206-15。https://doi.org/10.1038/s42256-019-0048-x。

Senavirathne, N. & Torra, V. (2020). "On the Role of Data Anonymization in Machine Learning Privacy." 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom): 664-75. https://doi.org/10.1109/TrustCom50675.2020.00093.
Senavirathne, N. & Torra, V. (2020). "数据匿名化在机器学习隐私中的作用。" 2020 年 IEEE 第 19 届计算与通信中的信任、安全和隐私国际会议（TrustCom）：664-75。https://doi.org/10.1109/TrustCom50675.2020.00093。

Sheth, A., Roy, K. & Gaur, M. (2023). Neurosymbolic AI - Why, What, and How (arXiv:2305.00813). arXiv. https://doi.org/10.48550/arXiv.2305.00813.
Sheth, A., Roy, K. & Gaur, M. (2023). 神经符号人工智能 - 为什么、什么以及如何（arXiv:2305.00813）。arXiv。https://doi.org/10.48550/arXiv.2305.00813。

Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., Ho, L., Siddarth, D., Avin, S., Hawkins, W., Kim, B., Gabriel, I., Bolina, V., Clark, J., Bengio, Y., ... Dafoe, A. (2023). Model Evaluation for Extreme Risks (arXiv:2305.15324). arXiv. http://arxiv.org/abs/2305.15324.
Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., Ho, L., Siddarth, D., Avin, S., Hawkins, W., Kim, B., Gabriel, I., Bolina, V., Clark, J., Bengio, Y., ... Dafoe, A. (2023). 极端风险的模型评估 (arXiv:2305.15324). arXiv. http://arxiv.org/abs/2305.15324.

Steinke, T., Nasr, M. & Jagielski, M. (2023). Privacy Auditing With One (1) Training Run (arXiv:2305.08846). arXiv. https://doi.org/10.48550/arXiv.2305.08846.
Steinke, T., Nasr, M. & Jagielski, M. (2023). 一次训练运行的隐私审计 (arXiv:2305.08846). arXiv. https://doi.org/10.48550/arXiv.2305.08846.

Sun, X., Yang, D., Li, X., Zhang, T., Meng, Y., Qiu, H., Wang, G., Hovy, E. & Li, J. (2021). Interpreting Deep Learning Models in Natural Language Processing: A Review (arXiv:2110.10470). arXiv. http://arxiv.org/abs/2110.10470.
Sun, X., Yang, D., Li, X., Zhang, T., Meng, Y., Qiu, H., Wang, G., Hovy, E. & Li, J. (2021). 在自然语言处理中解释深度学习模型：一项综述 (arXiv:2110.10470). arXiv. http://arxiv.org/abs/2110.10470.

Trinh, L. & Liu, Y. (2021). An Examination of Fairness of Al Models for Deepfake Detection (arXiv:2105.00558). arXiv. https://doi.org/10.48550/arXiv.2105.00558.
Trinh, L. & Liu, Y. (2021). 深度伪造检测模型的公平性检验 (arXiv:2105.00558). arXiv. https://doi.org/10.48550/arXiv.2105.00558.

Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D. & Li, B. (2024). DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (arXiv:2306.11698). arXiv. https://doi.org/10.48550/arXiv.2306.11698.
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D. & Li, B. (2024). 解码信任：GPT 模型信任度的全面评估 (arXiv:2306.11698). arXiv. https://doi.org/10.48550/arXiv.2306.11698.

Wang, W., Bai, H., Huang, J., Wan, Y., Yuan, Y., Qiu, H., Peng, N. & Lyu, M. R. (2024). New Job, New Gender? Measuring the Social Bias in Image Generation Models (arXiv:2401.00763). arXiv. http://arxiv.org/abs/2401.00763.
Wang, W., Bai, H., Huang, J., Wan, Y., Yuan, Y., Qiu, H., Peng, N. & Lyu, M. R. (2024). 新工作，新性别？衡量图像生成模型中的社会偏见 (arXiv:2401.00763). arXiv. http://arxiv.org/abs/2401.00763.

Wang, Y., Li, H., Han, X., Nakov, P. & Baldwin, T. (2023). Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs (arXiv:2308.13387). arXiv. http://arxiv.org/abs/2308.13387.
王，李，韩，纳科夫，鲍德温（2023 年）。Do-Not-Answer：用于评估LLMs中保障措施的数据集（arXiv:2308.13387）。arXiv。http://arxiv.org/abs/2308.13387。

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z. & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models (arXiv:2307.15043). arXiv. http://arxiv.org/abs/2307.15043.
邹，王，卡林尼，纳斯尔，科尔特，弗雷德里克森（2023 年）。对齐语言模型的通用和可转移对抗攻击（arXiv:2307.15043）。arXiv。http://arxiv.org/abs/2307.15043。

Chapter 4: Economy 第四章：经济

Acknowledgments 致谢

The Al Index would like to acknowledge James da Costa for his work collecting information on significant AIrelated economic events, and Emma Williamson for her work collecting data from the Stack Overflow survey.
Al Index 感谢詹姆斯·达·科斯塔收集有关重要人工智能相关经济事件的信息，以及 Emma Williamson 收集来自 Stack Overflow 调查的数据。

International Federation of Robotics (IFR)
机器人国际联合会（IFR）

Data presented in the Robot Installations section was sourced from the "World Robotics 2023" report.
机器人安装部分中呈现的数据来源于《2023 年世界机器人》报告。

Lightcast 亮投

Prepared by Cal McKeever, Julia Nitschke, and Layla O’Kane
由卡尔·麦克弗，朱莉娅·尼奇克和莱拉·奥凯恩准备。

Lightcast delivers job market analytics that empower employers, workers, and educators to make datadriven decisions. The company's artificial intelligence technology analyzes hundreds of millions of job postings and real-life career transitions to provide insight into labor market patterns. This real-time strategic intelligence offers crucial insights, such as what jobs are most in demand, the specific skills employers need, and the career directions that offer the highest potential for workers. For more information, visit www.lightcast.io.
Lightcast 提供的职场市场分析赋予雇主、工作者和教育工作者做出数据驱动决策的能力。该公司的人工智能技术分析了数亿个职位发布和真实职业转变，以提供劳动力市场模式的洞察。这种实时战略情报提供了关键的见解，比如哪些工作需求最大，雇主需要哪些具体技能，以及对工作者提供最高潜力的职业方向。欲了解更多信息，请访问 www.lightcast.io。

Job Postings Data 职位发布数据

To support these analyses, Lightcast mined its dataset of millions of job postings collected since 2010.
为支持这些分析，Lightcast 挖掘了自 2010 年以来收集的数百万个职位发布数据集。

Lightcast collects postings from over 51,000 online job sites to develop a comprehensive, real-time portrait of labor market demand. It aggregates job postings, removes duplicates, and extracts data from job postings text. This includes information on job title, employer, industry, and region, as well as required experience, education, and skills.
Lightcast 从超过 51,000 个在线招聘网站收集帖子，以开发全面、实时的劳动力市场需求画像。它汇总工作帖子，去除重复项，并从工作帖子文本中提取数据。这包括有关职位名称、雇主、行业和地区的信息，以及所需的经验、教育和技能。

Job postings are useful for understanding trends in the labor market because they allow for a detailed, real-time look at the skills employers seek. To assess the representativeness of job postings data, Lightcast conducts a number of analyses to compare the distribution of job postings to the distribution of official government and other third-party sources in the United States. The primary source of government data on U.S. job postings is the Job Openings and Labor Turnover Survey (JOLTS) program, conducted by the Bureau of Labor Statistics. Based on comparisons between JOLTS and Lightcast, the labor market demand captured by Lightcast data represents over

of the total labor demand. Jobs not posted online are usually in small businesses (the classic example being the "Help Wanted" sign in the restaurant window) and union hiring halls.
招聘信息对于了解劳动力市场趋势非常有用，因为它们可以详细实时地了解雇主寻求的技能。为了评估招聘信息数据的代表性，Lightcast 进行了多项分析，比较了招聘信息的分布与美国官方政府和其他第三方来源的分布。美国关于招聘信息的主要政府数据来源是由劳工统计局进行的职位空缺和劳动力流动调查（JOLTS）计划。根据 JOLTS 和 Lightcast 之间的比较，Lightcast 数据捕捉到的劳动力市场需求占总劳动力需求的

以上。通常没有在线发布的工作岗位在小型企业（经典例子是餐馆窗户上的“招聘中”标志）和工会招聘大厅中。

Measuring Demand for AI
测量对人工智能的需求

In order to measure the demand by employers of Al skills, Lightcast uses its skills taxonomy of over 30,000 skills. The list of Al skills from Lightcast data are shown below, with associated skill clusters. While some skills are considered to be in the Al cluster specifically, for the purposes of this report, all skills below were considered

skills.

job posting was considered an

job if it mentioned any of these skills in the job text.
为了衡量雇主对人工智能技能的需求，Lightcast 使用其拥有超过 30,000 个技能的技能分类法。以下是来自 Lightcast 数据的人工智能技能列表，以及相关的技能群。虽然有些技能被认为是专门属于人工智能群，但在本报告中，所有以下技能都被视为

技能。如果工作文本中提到了这些技能中的任何一个，那么

工作发布将被视为

工作。

Artificial Intelligence:

Inference,

人工智能：

推理，

(Artificial Intelligence for IT Operations), Applications of Artificial Intelligence, Artificial General Intelligence, Artificial Intelligence, Artificial Intelligence
（用于 IT 运营的人工智能），人工智能应用，人工通用智能，人工智能，人工智能

Development, Artificial Intelligence Markup Language (AIML), Artificial Intelligence Systems, Azure Cognitive Services, Baidu, Cognitive Automation, Cognitive Computing, Computational Intelligence, Cortana, Ethical AI, Expert Systems, Explainable AI (XAI), IPSoft Amelia, Intelligent Control, Intelligent Systems, Interactive Kiosk, Knowledge Engineering, KnowledgeBased Configuration, Knowledge-Based Systems, Multi-agent Systems, Open Neural Network Exchange (ONNX), OpenAI Gym, Operationalizing AI, Reasoning Systems, Watson Conversation, Watson Studio, Weka
发展，人工智能标记语言（AIML），人工智能系统，Azure 认知服务，百度，认知自动化，认知计算，计算智能，Cortana，道德人工智能，专家系统，可解释人工智能（XAI），IPSoft Amelia，智能控制，智能系统，交互式信息亭，知识工程，基于知识的配置，基于知识的系统，多智能体系统，开放神经网络交换（ONNX），OpenAI 健身房，AI 的操作化，推理系统，Watson 对话，Watson Studio，Weka

Autonomous Driving: Advanced Driver Assistance Systems, Autonomous Cruise Control Systems, Autonomous System, Autonomous Vehicles, Guidance Navigation and Control Systems, Light Detection and Ranging (LiDAR), OpenCV, Path Analysis, Path Finding, Remote Sensing, Unmanned Aerial Systems (UAS)
自动驾驶：高级驾驶员辅助系统，自动巡航控制系统，自主系统，自动驾驶车辆，导航和控制系统，光探测与测距（LiDAR），OpenCV，路径分析，路径规划，遥感，无人机系统（UAS）

Generative Artificial Intelligence: ChatGPT, Generative Adversarial Networks, Generative Artificial Intelligence, Large Language Modeling, Prompt Engineering, Variational Autoencoders
生成式人工智能：ChatGPT，生成对抗网络，生成式人工智能，大规模语言建模，提示工程，变分自动编码器

Natural Language Processing (NLP): AI Copywriting, ANTLR, Amazon Textract, Apache OpenNLP, BERT (NLP Model), Chatbot, Computational Linguistics, Conversational AI, Dialog Systems, Fuzzy Logic, Handwriting Recognition, Hugging Face (NLP Framework), Hugging Face Transformers, Intelligent Agent, Intelligent Virtual Assistant, Kaldi, Language Model, Latent Dirichlet Allocation, Lexalytics, Machine Translation, Microsoft LUIS, Natural Language Generation, Natural Language Processing, Natural Language Programming, Natural Language Toolkits,
自然语言处理（NLP）：AI 文案撰写，ANTLR，亚马逊文本提取，Apache OpenNLP，BERT（NLP 模型），聊天机器人，计算语言学，对话式人工智能，对话系统，模糊逻辑，手写识别，Hugging Face（NLP 框架），Hugging Face 变换器，智能代理，智能虚拟助手，Kaldi，语言模型，潜在狄利克雷分配，Lexalytics，机器翻译，微软 Luis，自然语言生成，自然语言处理，自然语言编程，自然语言工具包，
Natural Language Understanding, Optical Character Recognition (OCR), Screen Reader, Semantic Analysis, Semantic Parsing, Semantic Search, Sentiment Analysis, Seq2Seq, Speech Recognition, Speech Recognition Software, Speech Synthesis, Statistical Language Acquisition, Text Mining, Textto-Speech, Tokenization, Voice Assistant Technology, Voice Interaction, Voice User Interface, Word Embedding, Word2Vec Models, fastText
自然语言理解，光学字符识别（OCR），屏幕阅读器，语义分析，语义解析，语义搜索，情感分析，Seq2Seq，语音识别，语音识别软件，语音合成，统计语言习得，文本挖掘，文本转语音，标记化，语音助手技术，语音交互，语音用户界面，词嵌入，Word2Vec 模型，fastText

Neural Networks: Apache MXNet, Artificial Neural Networks, Autoencoders, Caffe2, Chainer (Deep Learning Framework), Convolutional Neural Networks, Cudnn, Deep Learning, Deep Learning Methods, Deeplearning4j, Evolutionary Acquisition of Neural Topologies, Fast.ai, Keras (Neural Network Library), Long Short-Term Memory (LSTM), OpenVINO, PaddlePaddle, Recurrent Neural Network (RNN), TensorFlow
神经网络：Apache MXNet，人工神经网络，自编码器，Caffe2，Chainer（深度学习框架），卷积神经网络，Cudnn，深度学习，深度学习方法，Deeplearning4j，神经拓扑进化获取，Fast.ai，Keras（神经网络库），长短期记忆（LSTM），OpenVINO，PaddlePaddle，循环神经网络（RNN），TensorFlow

Machine Learning: AdaBoost (Adaptive Boosting), Adversarial Machine Learning, Apache MADlib, Apache Mahout, Apache SINGA, Apache Spark, Association Rule Learning, Automated Machine Learning, Autonomic Computing, AWS SageMaker, Azure Machine Learning, Boosting,

-Squared Automatic Interaction Detection (CHAID), Classification and Regression Tree (CART), Cluster Analysis, Collaborative Filtering, Confusion Matrix, Cyber-Physical Systems, Dask (Software), Data Classification, Dbscan, Decision Models, Decision Tree Learning, Dimensionality Reduction, Dlib (C++ Library), Ensemble Methods, Feature Engineering, Feature Extraction, Feature Learning, Feature Selection, Gaussian Process, Genetic Algorithm, Google AutoML, Gradient Boosting, H2O.ai, Hidden Markov Model, Hyperparameter Optimization, Inference Engine, K-Means Clustering, Kernel
机器学习：AdaBoost（自适应增强），对抗机器学习，Apache MADlib，Apache Mahout，Apache SINGA，Apache Spark，关联规则学习，自动化机器学习，自治计算，AWS SageMaker，Azure 机器学习，增强，

-平方自动交互检测（CHAID），分类和回归树（CART），聚类分析，协同过滤，混淆矩阵，网络物理系统，Dask（软件），数据分类，Dbscan，决策模型，决策树学习，降维，Dlib（C++库），集成方法，特征工程，特征提取，特征学习，特征选择，高斯过程，遗传算法，Google AutoML，梯度提升，H2O.ai，隐马尔可夫模型，超参数优化，推理引擎，K 均值聚类，核

Methods, Kubeflow, Loss Functions, Machine Learning, Machine Learning Algorithms, Machine Learning Methods, Machine Learning Model Monitoring And Evaluation, Machine Learning Model Training, Markov Chain, Matrix Factorization, Meta Learning, Microsoft Cognitive Toolkit (CNTK), MLflow, MLOps (Machine Learning Operations), mlpack (C++ Library), ModelOps, Naive Bayes Classifier, Perceptron, Predictive Modeling, PyTorch (Machine Learning Library), PyTorch Lightning, Random Forest Algorithm, Recommender Systems, Reinforcement Learning, Scikit-Learn (Python Package), Semi-supervised Learning, Soft Computing, Sorting Algorithm, Supervised Learning, Support Vector Machine, Test Datasets, Theano (Software), Torch (Machine Learning), Training Datasets, Transfer Learning, Transformer (Machine Learning Model), Unsupervised Learning, Vowpal Wabbit, Xgboost
方法、Kubeflow、损失函数、机器学习、机器学习算法、机器学习方法、机器学习模型监控与评估、机器学习模型训练、马尔可夫链、矩阵分解、元学习、微软认知工具包（CNTK）、MLflow、MLOps（机器学习运营）、mlpack（C++库）、ModelOps、朴素贝叶斯分类器、感知器、预测建模、PyTorch（机器学习库）、PyTorch Lightning、随机森林算法、推荐系统、强化学习、Scikit-Learn（Python 包）、半监督学习、软计算、排序算法、监督学习、支持向量机、测试数据集、Theano（软件）、Torch（机器学习）、训练数据集、迁移学习、Transformer（机器学习模型）、无监督学习、Vowpal Wabbit、Xgboost

Robotics: Advanced Robotics, Bot Framework, Cognitive Robotics, Motion Planning, Nvidia Jetson, Robot Framework, Robot Operating Systems, Robotic Automation Software, Robotic Liquid Handling Systems, Robotic Programming, Robotic Systems, Servomotor, SLAM Algorithms (Simultaneous Localization and Mapping)
机器人技术：高级机器人技术、机器人框架、认知机器人技术、运动规划、Nvidia Jetson、机器人框架、机器人操作系统、机器人自动化软件、机器液体处理系统、机器人编程、机器人系统、伺服电机、SLAM 算法（同时定位与地图构建）

Visual Image Recognition: 3D Reconstruction, Activity Recognition, Computer Vision, Contextual Image Classification, Digital Image Processing, Eye Tracking, Face Detection, Facial Recognition, Gesture Recognition, Image Analysis, Image Matching, Image Recognition, Image Segmentation, Image Sensor, Imagenet, Machine Vision, Motion Analysis, Object Recognition, OmniPage, Pose Estimation
视觉图像识别：3D 重建，活动识别，计算机视觉，上下文图像分类，数字图像处理，眼动追踪，人脸检测，人脸识别，手势识别，图像分析，图像匹配，图像识别，图像分割，图像传感器，Imagenet，机器视觉，运动分析，物体识别，OmniPage，姿态估计

Linkedln 领英

Prepared by Murat Erer, Carl Shan, and Akash Kaura
由 Murat Erer，Carl Shan 和 Akash Kaura 准备

Country Sample 国家样本

Included countries represent a select sample of eligible countries with at least

labor force coverage by LinkedIn and at least

hires in any given month. India, despite not reaching

coverage, was included in this sample because of its increasing importance in the global economy.
包括的国家代表了符合 LinkedIn 至少

劳动力覆盖率和任何给定月份至少

招聘的符合条件的国家的精选样本。尽管印度未达到

覆盖率，但由于其在全球经济中日益重要，因此被纳入了此样本。

Skills (AI Engineering and Al Literacy skills)
技能（AI 工程和 AI 素养技能）

Linkedln members self-report their skills on their Linkedln profiles. Currently, more than 41,000 distinct, standardized skills are identified by Linkedln. These have been coded and classified by taxonomists at Linkedln into 249 skill groupings, which are the skill groups represented in the dataset.

Skill groupings are derived by expert taxonomists through a similarity-index methodology that measures skill composition at the industry level. Linkedln's industry taxonomy and their corresponding NAICS codes can be found here.
技能分组是由专家分类学家通过相似性指数方法推导出的，该方法在行业水平上衡量技能构成。Linkedln 的行业分类法及其对应的 NAICS 代码可在此处找到。

This year Linkedln made updates to the Al skills list and categorized them into "Al Engineering" and "Al Literacy" skills. See "AI Skills List Update Compared to Last Year" section for more details.
今年，LinkedIn 更新了人工智能技能列表，并将其分类为 "AI 工程" 和 "AI 素养" 技能。查看 "与去年相比的人工智能技能列表更新" 部分获取更多详细信息。

Skills Genome 技能基因组

For any entity (occupation or job, country, sector, etc.), the skill genome is an ordered list (a vector) of the 50 "most characteristic skills" of that entity. These most characteristic skills are identified using a TF-IDF algorithm to identify the most representative skills of the target entity, while down-ranking ubiquitous skills that add little information about that specific entity (e.g., Microsoft Word).
对于任何实体（职业或工作，国家，行业等），技能基因组是该实体的 50 个 "最具特征性技能" 的有序列表（向量）。使用 TF-IDF 算法确定目标实体的最具代表性技能，同时降低普遍技能的排名，这些技能对于特定实体的信息贡献较少（例如 Microsoft Word）。

TF-IDF is a statistical measure that evaluates how representative a word (in this case, a skill) is to a selected entity). This is done by multiplying two metrics:
TF-IDF 是一种统计量，用于评估一个词（在本例中是一个技能）对所选实体的代表性。这是通过将两个指标相乘来完成的。

The term frequency of a skill in an entity ("TF").
技能在实体中的词频（"TF"）。
The logarithmic inverse entity frequency of the skill across a set of entities ("IDF"). This indicates how common or rare a word is in the entire entity set. The closer IDF is to 0 , the more common a word is.
技能在一组实体中的对数逆实体频率（"IDF"）。这表示一个词在整个实体集中是多么常见或罕见。IDF 越接近 0，这个词就越常见。

So, if the skill is very common across Linkedln entities, and appears in many job or member descriptions, the IDF will approach 0 . If, on the other hand, the skill is unique to specific entities, the IDF will approach 1. Details are available at Linkedln's Skills Genome and Linkedln-World Bank Methodology note.
因此，如果该技能在领英实体中非常普遍，并出现在许多工作或会员描述中，IDF 将接近 0。另一方面，如果该技能是特定实体独有的，IDF 将接近 1。有关详细信息，请参阅领英的技能基因组和领英-世界银行方法说明。

Al Skills Penetration Al 技能渗透

The aim of this indicator is to measure the intensity of Al skills in an entity (in a particular country, industry, gender, etc.) through the following methodology:
该指标的目的是通过以下方法来衡量实体（在特定国家、行业、性别等方面）中 Al 技能的强度：

Compute frequencies for all self-added skills by Linkedln members in a given entity (occupation, industry, etc.) in 2015-2023.
计算 2015-2023 年间 Linkedln 成员在给定实体（职业、行业等）中自行添加的所有技能的频率。
Re-weight skill frequencies using a TF-IDF model to get the top 50 most representative skills in that entity. These 50 skills compose the "skill genome" of that entity.
使用 TF-IDF 模型重新加权技能频率，以获得该实体中最具代表性的前 50 个技能。这 50 个技能构成了该实体的“技能基因组”。
Compute the share of skills that belong to the Al skill group out of the top skills in the selected entity.
计算属于 Al 技能组的技能在所选实体的顶级技能中的占比。

Interpretation: The Al skill penetration rate signals the prevalence of Al skills across occupations, or the intensity with which Linkedln members utilize Al skills in their jobs. For example, the top 50 skills for the occupation of engineer are calculated based on the weighted frequency with which they appear in Linkedln members' profiles. If four of the skills that engineers possess belong to the Al skill group, this measure indicates that the penetration of

skills is estimated to be

among engineers (e.g., 4/50).
解释：Al 技能渗透率表示 Al 技能在职业中的普及程度，或者说在 Linkedln 成员的工作中使用 Al 技能的强度。例如，计算工程师职业的前 50 个技能，是根据它们在 Linkedln 成员个人资料中出现的加权频率来计算的。如果工程师拥有四个属于 Al 技能组的技能，那么这个指标表明，在工程师中，

技能的渗透率估计为

（例如，4/50）。

Jobs or Occupations 工作或职业

Linkedln member titles are standardized and grouped into approximately 15,000 occupations. These are not sector- or country-specific. These occupations are further standardized into approximately 3,600 occupation representatives. Occupation representatives group occupations with a common role and specialty, regardless of seniority.
领英会员职称标准化，并分为大约 15,000 个职业。这些职业不是特定于某个行业或国家的。这些职业进一步标准化为大约 3,600 个职业代表。职业代表将具有共同角色和专业的职业分组在一起，而不考虑资历。

Al Jobs or Occupations
所有工作或职业

An "Al" job is an occupation representative that requires

skills to perform the job. Skills penetrations are used as a signal for whether AI skills are prevalent in an occupation representative, in any sector where the occupation representative may exist. Examples of such occupations include (but are not limited to): Machine Learning Engineer, Artificial Intelligence Specialist, Data Scientist, and Computer Vision Engineer.
“所有”工作是一个职业代表，执行该工作需要

技能。技能渗透率被用作一个信号，用于判断在可能存在该职业代表的任何行业中，AI 技能是否普遍存在。此类职业的例子包括（但不限于）：机器学习工程师、人工智能专家、数据科学家和计算机视觉工程师。

Al Talent 人工智能人才

A Linkedln member is considered Al talent if they have explicitly added Al skills to their profile and/or they are occupied in an Al occupation representative. The counts of

talent are used to calculate talent concentration metrics. For example, to calculate the country-level Al talent concentration, we use the counts of

talent at the country level vis-a-vis the counts of Linkedln members in the respective countries. Note that concentration metrics may be influenced by Linkedln coverage in these countries and should be utilized with caution.
如果 Linkedln 成员在其个人资料中明确添加了人工智能技能和/或从事人工智能相关职业，则被视为人工智能人才。

人工智能人才的数量用于计算人才集中度指标。例如，为了计算全国级别的人工智能人才集中度，我们使用各国级别的

人工智能人才数量与各国 Linkedln 成员数量相比较。请注意，集中度指标可能会受到 Linkedln 在这些国家的覆盖范围的影响，应谨慎使用。

Relative Al Skills Penetration
相对人工智能技能渗透率

To allow for skills penetration comparisons across countries, the skills genomes are calculated and a relevant benchmark is selected (e.g., global average). A ratio is then constructed between a country's and the benchmark's Al skills penetrations, controlling for occupations.
为了实现跨国技能渗透比较，计算了技能基因组并选择了相关基准（例如，全球平均水平）。然后构建了一个比率，该比率控制了一个国家与基准的 Al 技能渗透之间的关系，同时考虑了职业。

Interpretation: A country's relative Al skills penetration of 1.5 indicates that

skills are 1.5 times as frequent as in the benchmark, for an overlapping set of occupations.
解释：一个国家的相对 Al 技能渗透率为 1.5 表示

技能在重叠职业集中比基准频繁 1.5 倍。

Global Comparison 全球比较

For cross-country comparison, we present the relative penetration rate of

skills, measured as the sum of the penetration of each

skill across occupations in a given country, divided by the average global penetration of

skills across the overlapping occupations in a sample of countries.
用于跨国比较，我们呈现了

技能的相对渗透率，将给定国家中每个

技能的渗透率总和除以样本国家中重叠职业的全球平均渗透率的

技能。

Interpretation: A relative penetration rate of 2 means that the average penetration of

skills in that country is two times the global average across the same set of occupations.
解释：相对渗透率为 2 意味着该国家中

技能的平均渗透率是相同职业集中全球平均值的两倍。

Global Comparison: By Industry
全球比较：按行业

The relative Al skills penetration by country for industry provides an in-depth sectoral decomposition of Al skill penetration across industries and sample countries.
该行业按国家划分的相对人工智能技能渗透率提供了对人工智能技能在行业和样本国家中的深入部门分解。

Interpretation: A country's relative Al skill penetration rate of 2 in the education sector means that the average penetration of

skills in that country is two times the global average across the same set of occupations in that sector.
解释：一个国家在教育领域的相对人工智能技能渗透率为 2 意味着该国在该领域的相同职业集合中的平均

技能渗透率是全球平均水平的两倍。

Global Comparison: By Gender
全球比较：按性别

The "Relative AI Skills Penetration by Gender" metric provides a cross-country comparison of

skill penetrations within each gender, comparing countries' male/female Al skill penetrations to the global average of the same gender. Since the global averages are distinct for each gender, this metric should only be used to compare country rankings within each gender, and not for cross-gender comparisons within countries.
“按性别划分的相对 AI 技能渗透率”指标提供了跨国比较，比较每个性别内的

技能渗透率，将各国男性/女性 AI 技能渗透率与同一性别的全球平均值进行比较。由于全球平均值对于每个性别都是不同的，因此此指标应仅用于比较每个性别内的国家排名，而不用于跨性别比较。

Interpretation: A country's Al skills penetration for women of 1.5 means that female members in that country are 1.5 times more likely to list Al skills than the average female member in all countries pooled together across the same set of occupations that exist in the country/gender combination.
解释：某国女性的 AI 技能渗透率为 1.5 意味着该国的女性成员比所有国家中同一组职业中的平均女性成员更有可能列出 AI 技能。

Global Comparison: Across Genders
全球比较：跨性别

The "Relative Al Skills Penetration Across Genders" metric allows for cross-gender comparisons within and across countries globally, since we compare the countries' male/female Al skill penetrations to the same global average regardless of gender.
相对于不考虑性别因素的全球平均值，\"跨性别间相对于不同国家的 Ai 技能渗透度\" 指标允许进行跨性别的全球以及国家间比较，因为我们将各个国家的男性/女性 Ai 技能渗透度与全球平均值相比较。

Relative AI Talent Hiring Rate YoY Ratio
相对于去年同期的 AI 人才招聘率比例

Linkedln Hiring Rate or Overall Hiring Rate is a measure of hires normalized by Linkedln membership. It is computed as the percentage of Linkedln members who added a new employer in the same period the job began, divided by the total number of Linkedln members in the corresponding location.
Linkedln 招聘率或整体招聘率是通过 Linkedln 会员进行标准化后的聘用数量的衡量指标。该指标是在职位开始的同一时期内添加新雇主的 Linkedln 会员的百分比与相应地区 Linkedln 会员的总数之比。
Al Hiring Rate is computed following the overall hiring rate methodology, but only considering members classified as Al talent.
Al Hiring Rate 是根据整体招聘率方法计算出来的，但只考虑被分类为 Al 人才的成员。
Relative Al Talent Hiring Rate YoY Ratio is the year-over-year change in Al Hiring Rate relative to Overall Hiring Rate in the same country. For each month, we first calculate AI Hiring rate in the country, then divide Al Hiring Rate by Overall Hiring Rate in that country,
相对 Al 人才招聘率同比增长率是与同一国家的整体招聘率相比，Al 招聘率的年度变化。对于每个月，我们首先计算该国的 AI 招聘率，然后将 Al 招聘率除以该国的整体招聘率，
then calculate YoY change of this ratio, and then take the 12-month moving average using the last 12 months.
然后计算该比率的年度变化，然后使用最近 12 个月的数据计算 12 个月移动平均值。

Interpretation: In 2023 in India, the ratio of Al talent hiring relative to overall hiring has grown

year over year.
解释：2023 年在印度，相对于整体招聘，人工智能人才招聘比例每年增长

。

Al Talent Migration 人工智能人才迁移

Data on migration comes from the World Bank GroupLinkedln "Digital Data for Development" partnership (please see Zhu et al. [2018]).
迁移数据来自世界银行集团 Linkedln“数字数据促进发展”合作伙伴关系（请参阅 Zhu 等人[2018]）。

Linkedln migration rates are derived from the selfidentified locations of Linkedln member profiles. For example, when a Linkedln member updates his or her location from Paris to London, this is counted as a migration. Migration data is available from 2019 onward. Linkedln data provides insights into countries on the Al Talent gained or lost due to migration trends.

Talent migration is considered for all members with AI Skills/holding Al jobs at time

for country

as the country of interest and country

as the source of inflows and destination for outflows. Thus, net

Talent migration between country

and country

-for country A-is calculated as follows:
Linkedln 迁移率是根据 Linkedln 会员档案中自我标识的位置派生的。例如，当 Linkedln 会员将其位置从巴黎更新为伦敦时，这被视为一次迁移。迁移数据从 2019 年开始提供。Linkedln 数据提供了有关由于迁移趋势而获得或失去的 AI 人才的洞见。AI 人才迁移被认为是所有在时间

拥有 AI 技能/从事 AI 工作的会员对于

国家感兴趣的国家

作为流入来源和流出目的地。因此，国家

和国家

之间的净

人才迁移-对于国家 A-计算如下：

Net flows are defined as total arrivals minus departures within the given time period. Linkedln membership varies considerably between countries, which makes interpreting absolute movements of members from one country to another difficult. To compare migration flows between countries fairly, migration flows are normalized for the country of interest. For example, if country

is the country of interest, all absolute net flows into and out of country

, regardless of origin and destination countries, are normalized based on Linkedln membership in country

at the end of each year and multiplied by 10,000 . Hence, this metric indicates relative talent migration from all countries to and from country

.
净流量被定义为在给定时间段内的总到达量减去离开量。Linkedln 会员在各国之间差异很大，这使得解释会员从一个国家到另一个国家的绝对流动变得困难。为了公平地比较各国之间的迁移流量，迁移流量会被归一化为感兴趣的国家。例如，如果国家

是感兴趣的国家，那么所有进入和离开国家

的绝对净流量，无论原始国家和目的地国家如何，都将根据每年末 Linkedln 在国家

的会员资格进行归一化，并乘以 10,000。因此，这个指标表示了所有国家到国家

之间的相对人才迁移。

Al Skills List Update Compared to Last Year
与去年相比的 AI 技能列表更新

Linkedln introduced "AI Literacy" skills.
Linkedln 推出了“人工智能素养”技能。

a. The following skills were added to the list and categorized as "AI Literacy" skills: ChatGPT, DALL-E, GPT-3, GPT-4, Generative Art, Github Copilot, Google Bard, Midjourney, Prompt Engineering, and Stable Diffusion.
a. 以下技能已添加到列表中，并被归类为“AI 素养”技能：ChatGPT，DALL-E，GPT-3，GPT-4，生成艺术，Github Copilot，Google Bard，Midjourney，提示工程和稳定扩散。

Linkedln updated the former Al skills list and categorized them as "Al Engineering" skills:
Linkedln 更新了以前的 AI 技能列表，并将它们归类为“AI 工程”技能：

a. The following skills were excluded from the list: Alexa, Common Lisp, Data Structures, Gaussian 03, Graph Theory, IBM Watson, Information Retrieval, Jena, Julia (Programming Language), Linked Data, Lisp, Pandas (Software), Parallel Algorithms, Perl Automation, Resource Description Framework, Smalltalk, and dSPACE.
a. 以下技能已从列表中排除：Alexa，Common Lisp，数据结构，Gaussian 03，图论，IBM Watson，信息检索，Jena，Julia（编程语言），链接数据，Lisp，Pandas（软件），并行算法，Perl 自动化，资源描述框架，Smalltalk 和 dSPACE。

b. The following skills were added to the list: Apache Spark ML, Applied Machine Learning, Audio Synthesis, Autoencoders, Automated Clustering, Automated Feature Engineering, Automated Machine Learning (AutoML), Autoregressive Models, Chatbot Development, Chatbots, Concept Drift Adaptation, Conditional Generation, Conditional Image Generation, Decision Trees, Deep Convolutional Generative Adversarial Networks (DCGAN), Deep Neural Networks (DNN), Generative Al, Generative Adversarial Imitation Learning, Generative Adversarial Networks (GANs), Generative
以下技能已添加到列表中：Apache Spark ML、应用机器学习、音频合成、自动编码器、自动聚类、自动特征工程、自动机器学习（AutoML）、自回归模型、聊天机器人开发、聊天机器人、概念漂移适应、条件生成、条件图像生成、决策树、深度卷积生成对抗网络（DCGAN）、深度神经网络（DNN）、生成式人工智能、生成对抗模仿学习、生成对抗网络（GANs）、生成。

Design Optimization, Generative Flow Models, Generative Modeling, Generative Neural Networks, Generative Optimization, Generative Pre-training, Generative Query Networks (GQNs), Generative Replay Memory, Generative Synthesis, Google Cloud AutoML, Graph Embeddings, Graph Networks, Hyperparameter Optimization, Hyperparameter Tuning, Image Generation, Image Inpainting, Image Synthesis, Image-to-Image Translation, Large Language Models (LLM), MLOps, Machine Learning Algorithms, Machine Translation, Meta-learning, Model Compression, Model Interpretation, Model Training, Music Generation, Neural Network Architecture Design, Predictive Modeling, Probabilistic Generative Models, Probabilistic Programming, Random Forest, Recurrent Neural Networks (RNN), Responsible AI, Style Transfer, StyleGAN, Synthetic Data Generation, Text Generation, Text-toImage Generation, Time Series Forecasting, Transformer Models, Variational Autoencoders, Variational Autoencoders (VAEs), Video Generation, and k-means clustering.
设计优化，生成流模型，生成建模，生成神经网络，生成优化，生成预训练，生成查询网络（GQNs），生成回放记忆，生成合成，Google Cloud AutoML，图嵌入，图网络，超参数优化，超参数调整，图像生成，图像修复，图像合成，图像到图像的翻译，大型语言模型（LLM），MLOps，机器学习算法，机器翻译，元学习，模型压缩，模型解释，模型训练，音乐生成，神经网络架构设计，预测建模，概率生成模型，概率编程，随机森林，循环神经网络（RNN），负责任的人工智能，风格转移，StyleGAN，合成数据生成，文本生成，文本到图像生成，时间序列预测，变压器模型，变分自动编码器，变分自动编码器（VAEs），视频生成，以及 k-means 聚类。

Quid

Quid Insights prepared by Bill Valle and Heather English
由 Bill Valle 和 Heather English 准备的 Quid Insights

Quid uses its own in-house LLM and other smart search features, as well as traditional Boolean query, to search for focus areas, topics, and keywords within many datasets: social media, news, forums and blogs, companies, patents, as well as other custom feeds of data (e.g., survey data). Quid has many visualization options and data delivery endpoints, including network graphs based on semantic similarity, in-platform dashboarding capabilities, as well as programmatic
Quid 使用其自有的内部LLM和其他智能搜索功能，以及传统的布尔查询，来搜索许多数据集中的焦点领域、主题和关键词：社交媒体、新闻、论坛和博客、公司、专利，以及其他定制数据源（例如调查数据）。 Quid 拥有许多可视化选项和数据交付端点，包括基于语义相似性的网络图，平台内的仪表盘功能，以及程序化。
PostgreSQL database delivery, and so on. Quid applies best-in-class AI and NLP to reveal hidden patterns in large datasets, enabling users to make data-driven decisions accurately, quickly, and efficiently.
PostgreSQL 数据库交付等。 Quid 应用最先进的人工智能和自然语言处理技术，揭示大型数据集中的隐藏模式，使用户能够准确、快速、高效地做出数据驱动的决策。

Search, Data Sources, and Scope
搜索、数据源和范围

Over 8 million global public and private company profiles from multiple data sources are indexed to search across company descriptions, while filtering and including metadata ranging from investment information to firmographic information, such as founded year,

location, and more. Company information is updated on a weekly basis. The Quid algorithm reads a large amount of text data from each document to make links between different documents based on their similar language. This process is repeated at an immense scale, which produces a network with different clusters identifying distinct topics or focus areas. Trends are identified based on keywords, phrases, people, companies, and institutions that Quid identifies, and the other metadata that is put into the software.
从多个数据源索引了超过 800 万家全球公共和私人公司的公司描述，可以进行搜索，同时过滤和包含从投资信息到公司信息的元数据，如成立年份，

位置等。公司信息每周更新。Quid 算法从每个文档中读取大量文本数据，根据它们相似的语言建立不同文档之间的链接。这个过程以巨大的规模重复进行，产生一个具有不同聚类的网络，用于识别不同的主题或关注领域。基于 Quid 识别的关键词、短语、人物、公司和机构以及输入软件的其他元数据，识别出趋势。

Data 数据

Companies 公司

Organization data is embedded from Capital IQ and Crunchbase. These companies include all types of organizations (private, public, operating, operating as a subsidiary, out of business) throughout the world. The investment data includes private investments, M&A, public offerings, minority stakes made by PE/ VCs, corporate venture arms, governments, and institutions both within and outside the United States. Some data is simply unreachable-for instance, when investors' names or funding amounts are undisclosed. Quid embeds Capital IQ data as a default and adds in data from Crunchbase for the data points
组织数据来自 Capital IQ 和 Crunchbase。这些公司包括世界各地的所有类型的组织（私人、公共、运营、作为子公司运营、已停业）。投资数据包括私人投资、并购、公开发行、PE/VC、公司风险投资部门、政府和机构在美国境内外的少数股权。有些数据根本无法获取-例如，投资者的姓名或资金金额未公开。 Quid 将 Capital IQ 数据作为默认数据，并添加 Crunchbase 的数据点。
that are not captured in Capital IQ. This not only yields comprehensive and accurate data on all global organizations, but it also captures early-stage startups and funding events data. Company information is updated on a weekly basis.
这不仅提供了全球所有组织的全面准确数据，还捕捉了早期初创企业和融资事件数据。公司信息每周更新一次。

Earnings Calls 财报电话会议

Quid leverages earnings call transcript data embedded from Seeking Alpha. For this report, Quid has analyzed mentions of Al-related keywords across all earnings call transcripts from Fortune 500 companies from January 2018 through December 2023. New earnings call transcript data is updated in Quid on the 1st and 15th of every month.
Quid 利用从 Seeking Alpha 嵌入的收益电话会议文本数据。对于这份报告，Quid 已经分析了从 2018 年 1 月到 2023 年 12 月的财富 500 强公司的所有收益电话会议文本中提到的人工智能相关关键词。新的收益电话会议文本数据每月 1 日和 15 日在 Quid 上更新。

Search Parameters 搜索参数

Boolean query is used to search for focus areas, topics, and keywords within the archived company database, within their business descriptions and websites. We can filter out the search results by

regions, investment amount, operating status, organization type (private/ public), and founding year. Quid then visualizes these companies by semantic similarity. If there are more than 7,000 companies from the search result, Quid selects the 7,000 most relevant companies for visualization based on the language algorithm.
布尔查询用于在存档的公司数据库中搜索重点领域、主题和关键词，以及它们的业务描述和网站。我们可以通过

地区、投资金额、运营状态、组织类型（私人/公共）和成立年份来过滤搜索结果。然后，Quid 通过语义相似性将这些公司可视化。如果搜索结果中有超过 7,000 家公司，Quid 会根据语言算法选择最相关的 7,000 家公司进行可视化。

Boolean search: "artificial intelligence" or "Al" or "machine learning" or "deep learning"
布尔搜索："人工智能"或"AI"或"机器学习"或"深度学习"

Companies 公司

Global Al and ML companies that have received investments (private, IPO, M&A) from January 1, 2013, to December 31, 2023.
从 2013 年 1 月 1 日至 2023 年 12 月 31 日期间收到投资（私人、IPO、并购）的全球人工智能和机器学习公司。
Global Al and ML companies that have received over $1.5M for the last 10 years (January 1, 2013, to December 31, 2023).
在过去的 10 年中（2013 年 1 月 1 日至 2023 年 12 月 31 日），全球已获得超过 150 万美元的人工智能和机器学习公司。
Global data was also pulled for a Generative AI query (Boolean search: "generative Al" OR "gen
全球数据还被提取用于生成式 AI 查询（布尔搜索：“生成式人工智能”或“生成 Al”或“生成式人工智能”）的公司，在过去的 10 年中已获得超过{{0}}。
Al" OR "generative artificial intelligence") for companies that have received over for the last 10 years (January 1, 2013, to December 31, 2023).
在过去的 10 年中（2013 年 1 月 1 日至 2023 年 12 月 31 日），已获得超过的公司进行了生成式 AI 查询（布尔搜索：“生成式 Al”或“gen）。

Target Event Definitions
目标事件定义

Private investments: A private placement is a private sale of newly issued securities (equity or debt) by a company to a selected investor or a selected group of investors. The stakes that buyers take in private placements are often minority stakes (under 50%), although it is possible to take control of a company through a private placement as well, in which case the private placement would be a majority stake investment.
私人投资：私人配售是公司向特定投资者或特定投资者组进行的新发行证券（股权或债务）的私人销售。买家在私人配售中所持有的股份通常是少数股份（低于 50%），尽管也有可能通过私人配售来控制一家公司，这种情况下私人配售将成为多数股份投资。
Minority investment: These refer to minority stake acquisitions in Quid, which take place when the buyer acquires less than of the existing ownership stake in entities, asset products, and business divisions.
少数投资：这些是指在 Quid 中的少数股份收购，当买家收购实体、资产产品和业务部门现有所有权股份的少于时发生。
M&A: This refers to a buyer acquiring more than of the existing ownership stake in entities, asset products, and business divisions.
M&A：这指的是买方收购实体、资产产品和业务部门现有所有权股份超过。

McKinsey & Company 麦肯锡公司

Data used in the Corporate Activity-Industry Adoption section was sourced from the McKinsey Global Survey "The State of Al in 2023: Generative Al's Breakout Year."
在企业活动-行业采用部分使用的数据来源于麦肯锡全球调查“2023 年 AI 现状：生成 AI 的突破年”。

The online survey was in the field April 11, 2023, to April 21, 2023, and garnered responses from 1,684 participants representing the full range of regions, industries, company sizes, functional specialties, and tenures. Of those respondents, 913 said their organizations had adopted

in at least one function
线上调查时间为 2023 年 4 月 11 日至 2023 年 4 月 21 日，共收到来自各个地区、行业、公司规模、职能专业和任职年限的 1,684 名参与者的回应。其中，有 913 名受访者表示他们的组织至少在一个功能中采用了

。
and were asked questions about their organization's

use. To adjust for differences in response rates, the data is weighted by the contribution of each respondent's nation to global GDP.
被问及有关其组织

使用的问题。为了调整回应率的差异，数据按照每位受访者所在国家对全球 GDP 的贡献进行加权。

The Al Index also considered data from previous iterations of the survey. More specifically, the

index made use of data from:
Al 指数还考虑了来自调查先前版本的数据。更具体地说，

指数利用了以下数据：

The State of Al in 2022-and a Half Decade in Review
2022 年及半个十年的 Al 状况回顾

The State of Al in 2021
2021 年的人工智能状况

The State of Al in 2020
2020 年的人工智能状况

Al Proves Its Worth, But Few Scale Impact (2019)
人工智能证明其价值，但影响有限（2019 年）

Al Adoption Advances, But Foundational Barriers Remain (2018)
人工智能采用不断进步，但基础障碍仍然存在（2018 年）

Stack Overflow

Data on the use of Al by developers was sourced from the 2023 Developer Survey. The survey was conducted from May 8, 2023, to May 19, 2023, and incorporates the insights of 89,184 software developers from 185 countries around the world.
开发人员使用人工智能的数据来源于 2023 年的开发者调查。该调查从 2023 年 5 月 8 日至 2023 年 5 月 19 日进行，涵盖了来自全球 185 个国家的 89,184 名软件开发人员的见解。

Works Cited 引用文献

Brynjolfsson, E., Li, D. & Raymond, L. R. (2023). Generative Al at Work (Working Paper 31161). National Bureau of Economic Research. https://doi.org/10.3386/w31161.
Brynjolfsson, E., Li, D. & Raymond, L. R. (2023 年). 工作中的生成型 AI (工作论文 31161). 美国国家经济研究局. https://doi.org/10.3386/w31161.

Cambon, A., Hecht, B., Edelman, B., Ngwe, D., Jaffe, S., Heger, A., Peng, S., Hofman, J., Farach, A., Bermejo-Cano, M., Knudsen, E., Sanghavi, H., Spatharioti, S., Rothschild, D., Goldstein, D. G., Kalliamvakou, E., Cihon, P., Demirer, M., Schwarz, M. & Teevan, J. (2023). Early LLM-Based Tools for Enterprise Information Workers Likely Provide Meaningful Boosts to Productivity. https://www.microsoft.com/en-us/research/uploads/prod/2023/12/Al-and-Productivity-Report-First-Edition.pdf.
Cambon, A., Hecht, B., Edelman, B., Ngwe, D., Jaffe, S., Heger, A., Peng, S., Hofman, J., Farach, A., Bermejo-Cano, M., Knudsen, E., Sanghavi, H., Spatharioti, S., Rothschild, D., Goldstein, D. G., Kalliamvakou, E., Cihon, P., Demirer, M., Schwarz, M. & Teevan, J. (2023 年). 早期基于LLM的企业信息工作者工具可能显著提升生产力. https://www.microsoft.com/en-us/research/uploads/prod/2023/12/Al-and-Productivity-Report-First-Edition.pdf.

Choi, J. H., Monahan, A. & Schwarcz, D. (2023). "Lawyering in the Age of Artificial Intelligence." Minnesota Law Review (forthcoming). https://doi.org/10.2139/ssrn.4626276.
Choi, J. H., Monahan, A. & Schwarcz, D. (2023). "人工智能时代的律师工作。" 明尼苏达法律评论（即将出版）。https://doi.org/10.2139/ssrn.4626276.

Chui, M., Hazan, E., Roberts, R., Singla, A., Smaje, K., Sukharevsky, A., Yee, L. & Zemmel, R. (2023). The Economic Potential of Generative Al: The Next Productivity Frontier. McKinsey & Company. https://www.mckinsey.com/capabilities/mckinseydigital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier.
Chui, M., Hazan, E., Roberts, R., Singla, A., Smaje, K., Sukharevsky, A., Yee, L. & Zemmel, R. (2023). 生成 AI 的经济潜力：下一个生产力前沿。麦肯锡公司。https://www.mckinsey.com/capabilities/mckinseydigital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier.

Chui, M., Yee, L., Hall, B., Singla, A. & Sukharevsky, A. (2023). The State of Al in 2023: Generative Al's Breakout Year. McKinsey & Company. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023generative-ais-breakout-year#widespreadhttp://ceros.mckinsey.com/commentary-ai-2023-lareina-ye-desktop.
Chui, M., Yee, L., Hall, B., Singla, A. & Sukharevsky, A. (2023). 2023 年 AI 的现状：生成 AI 的突破之年。麦肯锡公司。https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023generative-ais-breakout-year#widespreadhttp://ceros.mckinsey.com/commentary-ai-2023-lareina-ye-desktop.

Dell'Acqua, F. (2022). "Falling Asleep at the Wheel: Human/AI Collaboration in a Field Experiment on HR Recruiters." Laboratory for Innovation Science, Harvard Business School, working paper. https://static1.squarespace.com/ static/604b23e38c22a96e9c78879e/t/62d5d9448d061f7327e8a7e7/1658181956291/Falling+Asleep+at+the+Wheel++Fabrizio+DellAcqua.pdf.
Dell'Acqua, F. (2022). “在轮上入睡：人工智能与人类的协作——关于人力资源招聘者的实地实验。” 哈佛商学院创新科学实验室，工作论文。https://static1.squarespace.com/ static/604b23e38c22a96e9c78879e/t/62d5d9448d061f7327e8a7e7/1658181956291/Falling+Asleep+at+the+Wheel++Fabrizio+DellAcqua.pdf.

Dell'Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F. & Lakhani, K. R. (2023). "Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Al on Knowledge Worker Productivity and Quality." Harvard Business School Technology & Operations Mgt. Unit Working Paper No. 24-013. https://doi.org/10.2139/ssrn. 4573321.
Dell'Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F. & Lakhani, K. R. (2023). “穿越崎岖的技术前沿：基于现场实验的 Al 对知识工作者生产力和质量影响的证据。” 哈佛商学院技术与运营管理部工作论文编号 24-013。https://doi.org/10.2139/ssrn.4573321.

Hatzius, J., Briggs, J., Kodnani, D. & Pierdomenico, G. (2023). The Potentially Large Effects of Artificial Intelligence on Economic Growth. Goldman Sachs. https://www.gspublishing.com/content/research/en/reports/2023/03/27/d64e052b-0f6e-45d7967b-d7be35fabd16.html.
Hatzius, J., Briggs, J., Kodnani, D. & Pierdomenico, G. (2023). 人工智能对经济增长的潜在巨大影响。高盛。https://www.gspublishing.com/content/research/en/reports/2023/03/27/d64e052b-0f6e-45d7967b-d7be35fabd16.html.

Chapter 5: Science and Medicine
第五章：科学与医学

Acknowledgments 致谢

The Al Index would like to acknowledge Emma
AL 指数感谢 Emma

Williamson for her work surveying the literature on significant Al-related science and medicine trends.
Williamson 所做的相关 AI 科学和医学趋势文献调研工作。

Benchmarks 基准测试

MedQA: Data on MedQA was taken from the MedQA Papers With Code leaderboard in January 2024. To learn more about MedQA, please read the original paper.
MedQA：MedQA 数据取自 2024 年 1 月的 MedQA Papers With Code 排行榜。要了解更多关于 MedQA 的信息，请阅读原始论文。

FDA-Approved Al-Medical FDA 批准的 Al-Medical

Devices 设备

Data on FDA-approved Al-medical devices is sourced from the FDA website that tracks artificial intelligence and machine learning (

-enabled medical devices.
从 FDA 网站获取的 FDA 批准的 AI 医疗设备数据来自跟踪人工智能和机器学习（

）的 FDA 网站。

Works Cited 引用文献

Cao, K., Xia, Y., Yao, J., Han, X., Lambert, L., Zhang, T., Tang, W., Jin, G., Jiang, H., Fang, X., Nogues, I., Li, X., Guo, W., Wang, Y., Fang, W., Qiu, M., Hou, Y., Kovarnik, T., Vocka, M., Lu, J. (2023). “Large-Scale Pancreatic Cancer Detection via Non-contrast CT and Deep Learning." Nature Medicine 29, no. 12: 3033-3043. https://doi.org/10.1038/s41591-023-02640-w.
Cao, K., Xia, Y., Yao, J., Han, X., Lambert, L., Zhang, T., Tang, W., Jin, G., Jiang, H., Fang, X., Nogues, I., Li, X., Guo, W., Wang, Y., Fang, W., Qiu, M., Hou, Y., Kovarnik, T., Vocka, M., Lu, J. (2023). “通过非对比增强 CT 和深度学习进行大规模胰腺癌检测。” 《自然医学》29 卷，第 12 期：3033-3043。https://doi.org/10.1038/s41591-023-02640-w.

Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., Pagliardini, M., Fan, S., Köpf, A., Mohtashami, A., Sallinen, A., Sakhaeirad, A., Swamy, V., Krawczuk, I., Bayazit, D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M. & Bosselut, A. (2023). MEDITRON-7OB: Scaling Medical Pretraining for Large Language Models (arXiv:2311.16079). arXiv. http://arxiv.org/abs/2311.16079.
Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., Pagliardini, M., Fan, S., Köpf, A., Mohtashami, A., Sallinen, A., Sakhaeirad, A., Swamy, V., Krawczuk, I., Bayazit, D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M. & Bosselut, A. (2023). MEDITRON-7OB: 大型语言模型的医学预训练规模化 (arXiv:2311.16079). arXiv. http://arxiv.org/abs/2311.16079.

Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytè, A., Applebaum, T., Pritzel, A., Wong, L. H., Zielinski, M., Sargeant, T., Schneider, R. G., Senior, A. W., Jumper, J., Hassabis, D., Kohli, P. & Avsec, Ž. (2023). "Accurate Proteome-Wide Missense Variant Effect Prediction With AlphaMissense." Science 381. https://doi.org/10.1126/science.adg7492.
Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytè, A., Applebaum, T., Pritzel, A., Wong, L. H., Zielinski, M., Sargeant, T., Schneider, R. G., Senior, A. W., Jumper, J., Hassabis, D., Kohli, P. & Avsec, Ž. (2023). "AlphaMissense：准确的蛋白质组范围错义变异效应预测." Science 381. https://doi.org/10.1126/science.adg7492.

Cid, Y. D., Macpherson, M., Gervais-Andre, L., Zhu, Y., Franco, G., Santeramo, R., Lim, C., Selby, I., Muthuswamy, K., Amlani, A., Hopewell, H., Indrajeet, D., Liakata, M., Hutchinson, C. E., Goh, V. & Montana, G. (2024). "Development and Validation of Open-Source Deep Neural Networks for Comprehensive Chest X-Ray Reading: A Retrospective, Multicentre Study." The Lancet Digital Health 6, no. 1: e44-e57. https://doi.org/10.1016/S2589-7500(23)00218-2.
Cid, Y. D., Macpherson, M., Gervais-Andre, L., Zhu, Y., Franco, G., Santeramo, R., Lim, C., Selby, I., Muthuswamy, K., Amlani, A., Hopewell, H., Indrajeet, D., Liakata, M., Hutchinson, C. E., Goh, V. & Montana, G. (2024). "开发和验证用于全面胸部 X 射线阅片的开源深度神经网络：一项回顾性、多中心研究." 《柳叶刀数字健康》6 卷 1 期：e44-e57. https://doi.org/10.1016/S2589-7500(23)00218-2.

Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E. P., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B. S., Chiang, C.-C., Callahan, A., Huo, Z., Gatidis, S., Adams, S. J., Fayanju, O., Shah, S. J., Savage, T., ... Shah, N. H. (2023). MedAlign: A Clinician-Generated Dataset for Instruction Following With Electronic Medical Records (arXiv:2308.14089). arXiv. http://arxiv.org/abs/2308.14089.
Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E. P., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B. S., Chiang, C.-C., Callahan, A., Huo, Z., Gatidis, S., Adams, S. J., Fayanju, O., Shah, S. J., Savage, T., ... Shah, N. H. (2023). MedAlign：一项由临床医生生成的用于遵循电子病历的数据集（arXiv:2308.14089）. arXiv. http://arxiv.org/abs/2308.14089.

Ha, T., Lee, D., Kwon, Y., Park, M. S., Lee, S., Jang, J., Choi, B., Jeon, H., Kim, J., Choi, H., Seo, H.-T., Choi, W., Hong, W., Park, Y. J., Jang, J., Cho, J., Kim, B., Kwon, H., Kim, G., ... Choi, Y.-S. (2023). “Al-Driven Robotic Chemist for Autonomous Synthesis of Organic Molecules." Science Advances 9, no. 44. https://doi.org/10.1126/sciadv.adj0461.
Ha, T., 李, D., 全, Y., 朴, M. S., 李, S., 张, J., 崔, B., 全, H., 金, J., 崔, H., 徐, H.-T., 崔, W., 洪, W., 朴, Y. J., 张, J., 曹, J., 金, B., 全, H., 金, G., ... 崔, Y.-S. (2023). “基于人工智能的自主有机分子合成机器人化学家." 科学进展 9, 第 44 期. https://doi.org/10.1126/sciadv.adj0461.

Iglesias, J. E., Billot, B., Balbastre, Y., Magdamo, C., Arnold, S. E., Das, S., Edlow, B. L., Alexander, D. C., Golland, P. & Fischl, B. (2023). "SynthSR: A Public AI Tool to Turn Heterogeneous Clinical Brain Scans into High-Resolution T1-Weighted Images for 3D Morphometry." Science Advances 9, no. 5. https://doi.org/10.1126/sciadv.add3607.
Iglesias, J. E., Billot, B., Balbastre, Y., Magdamo, C., Arnold, S. E., 达斯, S., Edlow, B. L., 亚历山大, D. C., 戈兰, P. & 菲舍尔, B. (2023). "SynthSR: 一个公共人工智能工具，将异质临床脑扫描转换为高分辨率 T1 加权图像，用于 3D 形态测量." 科学进展 9, 第 5 期. https://doi.org/10.1126/sciadv.add3607.

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset From Medical Exams (arXiv:2009.13081; Version 1). arXiv. http://arxiv.org/abs/2009.13081.
金, D., 潘, E., Oufattole, N., 翁, W.-H., 方, H. & Szolovits, P. (2020). 这位患者患有什么疾病？来自医学考试的大规模开放领域问答数据集 (arXiv:2009.13081; 版本 1). arXiv. http://arxiv.org/abs/2009.13081.

Kavungal, D., Magalhães, P., Kumar, S. T., Kolla, R., Lashuel, H. A. & Altug, H. (2023). “Artificial Intelligence-Coupled Plasmonic Infrared Sensor for Detection of Structural Protein Biomarkers in Neurodegenerative Diseases." Science Advances 9, no. 28. https://doi.org/10.1126/sciadv.adg9644.
Kavungal, D., Magalhães, P., Kumar, S. T., Kolla, R., Lashuel, H. A. & Altug, H. (2023). “基于人工智能的等离子红外传感器用于神经退行性疾病结构蛋白标志物的检测。”科学进展 9, 第 28 号。https://doi.org/10.1126/sciadv.adg9644.

Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S. & Battaglia, P. (2023). “Learning Skillful Medium-Range Global Weather Forecasting." Science 382. https://doi.org/10.1126/science.adi2336.
Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S. & Battaglia, P. (2023). “学习具有技能的中程全球天气预报。”科学 382。https://doi.org/10.1126/science.adi2336.

Liao, W.-W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., Lu, S., Lucas, J. K., Monlong, J., Abel, H. J., Buonaiuto, S., Chang, X. H., Cheng, H., Chu, J., Colonna, V., Eizenga, J. M., Feng, X., Fischer, C., Fulton, R. S., ... Paten, B. (2023). "A Draft Human Pangenome Reference." Nature 617: 312-24. https://doi.org/10.1038/s41586-023-05896-x.
Liao, W.-W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., Lu, S., Lucas, J. K., Monlong, J., Abel, H. J., Buonaiuto, S., Chang, X. H., Cheng, H., Chu, J., Colonna, V., Eizenga, J. M., Feng, X., Fischer, C., Fulton, R. S., ... Paten, B. (2023). "人类全基因组草稿参考文献。”自然 617: 312-24。https://doi.org/10.1038/s41586-023-05896-x.

Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., Köppe, T., Millikin, K., Gaffney, S., Elster, S., Broshear, J., Gamble, C., Milan, K., Tung, R., Hwang, M., ... Silver, D. (2023). “Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning." Nature 618: 257-63. https://doi.org/10.1038/s41586-023-06004-9.
Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., Köppe, T., Millikin, K., Gaffney, S., Elster, S., Broshear, J., Gamble, C., Milan, K., Tung, R., Hwang, M., ... Silver, D. (2023). “使用深度强化学习发现更快的排序算法." Nature 618: 257-63. https://doi.org/10.1038/s41586-023-06004-9.

Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. (2023). “Scaling Deep Learning for Materials
商人, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. (2023). “材料的深度学习扩展"

Discovery." Nature 624: 80-85. https://doi.org/10.1038/s41586-023-06735-9.
自然 624: 80-85. https://doi.org/10.1038/s41586-023-06735-9.

Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T., Weitzner, D. & Matias, Y. (2023). Al Increases Global Access to Reliable Flood Forecasts (arXiv:2307.16104). arXiv. http://arxiv.org/abs/2307.16104.
Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T., Weitzner, D. & Matias, Y. (2023). Al 增加全球可靠洪水预报的访问 (arXiv:2307.16104). arXiv. http://arxiv.org/abs/2307.16104.

Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C. & Horvitz, E. (2023a). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine (arXiv:2311.16452; Version 1). arXiv. http://arxiv.org/abs/2311.16452.
Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C. & Horvitz, E. (2023a). 通用基础模型能否胜过特定调整？医学案例研究 (arXiv:2311.16452; 版本 1). arXiv. http://arxiv.org/abs/2311.16452.

Schopf, C. M., Ramwala, O. A., Lowry, K. P., Hofvind, S., Marinovich, M. L., Houssami, N., Elmore, J. G., Dontchos, B. N., Lee, J. M. & Lee, C. I. (2024). "Artificial Intelligence-Driven Mammography-Based Future Breast Cancer Risk Prediction: A Systematic Review." Journal of the American College of Radiology 21, no. 2: 319-28. https://doi.org/10.1016/j.jacr.2023.10.018.
Schopf, C. M., Ramwala, O. A., Lowry, K. P., Hofvind, S., Marinovich, M. L., Houssami, N., Elmore, J. G., Dontchos, B. N., Lee, J. M. & Lee, C. I. (2024). "基于人工智能驱动的乳腺 X 线摄影未来乳腺癌风险预测：系统评价." 美国放射学院杂志 21, no. 2: 319-28. https://doi.org/10.1016/j.jacr.2023.10.018.

Shen, T., Munkberg, J., Hasselgren, J., Yin, K., Wang, Z., Chen, W., Gojcic, Z., Fidler, S., Sharp, N. & Gao, J. (2023).

"Flexible Isosurface Extraction for Gradient-Based Mesh Optimization." ACM Transactions on Graphics 42, no. 4: 1-16.
"基于梯度的网格优化的灵活等值面提取。" ACM Transactions on Graphics 42, no. 4: 1-16.

https://doi.org/10.1145/3592430.

Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y. & Marks, D. S. (2023).

“Learning From Prepandemic Data to Forecast Viral Escape." Nature 622: 818-25. https://doi.org/10.1038/s41586-023-06617-0.
“从疫情前数据中学习，预测病毒逃逸。” Nature 622: 818-25. https://doi.org/10.1038/s41586-023-06617-0.

Chapter 6: Education 第六章：教育

Code.org

State-Level Data 省级数据

The following link includes a full description of the methodology used by Code.org to collect its data. The staff at Code.org also maintains a database of the state of American K-12 education and, in this policy primer, provides a greater amount of detail on the state of American

education in each state.
下面的链接包含了 Code.org 用于收集数据的全部方法的详细描述。Code.org 的员工还维护着一个关于美国 K-12 教育状况的数据库，在这份政策简介中，他们会对每个州的美国教育状况提供更多的细节。

AP Computer Science Data
AP 计算机科学数据

The AP Computer Science data is provided to Code.org as per an agreement the College Board maintains with Code.org. The AP Computer Science data comes from the college board's national and state summary reports.
AP 计算机科学数据根据大学理事会与 Code.org 维护的协议提供给 Code.org。AP 计算机科学数据来自大学理事会的国家和州摘要报告。

Access to Computer Science Education
计算机科学教育的获取

Data on access to computer science education was drawn from Code.org's State of Computer Science Education 2023 report.
关于计算机科学教育获取的数据来自 Code.org 的《2023 年计算机科学教育现状》报告。

Computing Research Association (CRA Taulbee Survey)
计算研究协会（CRA Taulbee 调查）

Note: This year's Al Index reused the methodological notes that were submitted by the CRA for previous editions of the AI Index. For more complete delineations of the methodology used by the CRA, please consult the individual CRA surveys that are linked below.
注意：今年的 AI 指数重复使用了 CRA 为以前版本的 AI 指数提交的方法论说明。有关 CRA 使用的方法的更完整描述，请查阅以下链接的各个 CRA 调查。
Computing Research Association (CRA) members are 200-plus North American organizations active in computing research: academic departments of computer science and computer engineering; laboratories and centers in industry, government, and academia; and affiliated professional societies (AAAI, ACM, CACS/AIC, IEEE Computer Society, SIAM USENIX). CRA's mission is to enhance innovation by joining with industry, government, and academia to strengthen research and advance education in computing. Learn more about CRA here.
Computing Research Association (CRA) 成员是 200 多家北美组织，从事计算研究：计算机科学和计算机工程学术部门；工业、政府和学术界的实验室和中心；以及附属专业学会（AAAI，ACM，CACS/AIC，IEEE 计算机学会，SIAM USENIX）。 CRA 的使命是通过与工业、政府和学术界合作，加强研究，推进计算机教育，从而促进创新。在这里了解更多关于 CRA 的信息。

The CRA Taulbee Survey gathers survey data during the fall of each academic year by reaching out to over 200 PhD-granting departments. Details about the Taulbee Survey can be found here. Taulbee doesn't directly survey the students. The department identifies each new PhD's area of specialization as well as their type of employment. Data is collected from September to January of each academic year for PhDs awarded in the previous academic year. Results are published in May after data collection closes.
CRA Taulbee 调查在每个学年秋季通过与 200 多个授予博士学位的部门联系收集调查数据。有关 Taulbee 调查的详细信息可以在这里找到。Taulbee 不直接调查学生。部门确定每位新博士的专业领域以及就业类型。每个学年的 9 月至 1 月收集前一学年授予博士学位的数据。数据收集结束后，结果将于 5 月发布。

The CRA Taulbee Survey is sent only to doctoral departments of computer science, computer engineering, and information science/systems. Historically, (a) Taulbee covers one-quarter to one-third of total BS CS recipients in the United States; (b) the percentage of women earning bachelor's degrees is lower in the Taulbee schools than overall; and (c) Taulbee tracks the trends in overall CS production.
CRA Taulbee 调查仅发送给计算机科学、计算机工程和信息科学/系统的博士部门。从历史上看，(a) Taulbee 覆盖了美国总计算机科学学士学位获得者的四分之一到三分之一; (b) 在 Taulbee 学校中获得学士学位的女性比例低于整体水平; 以及(c) Taulbee 跟踪整体计算机科学生产的趋势。

The AI Index used data from the following iterations of the CRA survey:
AI 指数使用了 CRA 调查的以下版本的数据：

CRA, 2022 CRA，2022

CRA, 2021 CRA，2021 年

CRA, 2020 CRA，2020 年

CRA, 2019 CRA，2019 年

CRA, 2018 CRA，2018 年

CRA, 2017 CRA，2017 年

CRA, 2016 CRA，2016 年

CRA, 2015 CRA，2015

CRA, 2014 CRA，2014

CRA, 2013 CRA，2013

CRA, 2012

CRA, 2011

Impact Research 影响研究

Data on the usage of ChatGPT in schools among teachers and students came from two Impact Research surveys released in 2023 . To learn more about the methodology employed for the surveys, please visit the following links: March 2023 and July 2023.
来自于 2023 年发布的两项 Impact Research 调查的数据显示，ChatGPT 在学校中教师和学生使用情况。要了解有关调查所采用的方法，请访问以下链接：2023 年 3 月和 2023 年 7 月。

Informatics Europe

The statistics are annually collected by Informatics Europe and published on the Informatics Europe Higher Education Data Portal, which is updated with the most recent data at the end of the year (typically in December). In the interest of reliability, Informatics Europe collects the data from countries where a solid and reasonably complete picture could be drawn from official sources such as national statistical offices, educational agencies, or ministries. The full list of sources can be found in the Data Portal section "Data Sources." The Data Portal follows the definitions and concepts provided by these national agencies and reflects the national situation in the countries considered. Those aspects that are not exposed by the consulted agencies are not part of the dataset. A full list of definitions and concepts used can be found in the footnotes shown at the bottom of the Statistics section.
统计数据每年由 Informatics Europe 收集，并发布在 Informatics Europe 高等教育数据门户上，该门户在年底（通常在 12 月）更新最新数据。为了保证可靠性，Informatics Europe 从能够从官方来源（如国家统计机构、教育机构或部委）获得坚实且相对完整画面的国家收集数据。数据来源完整列表可在数据门户的“数据来源”部分找到。数据门户遵循这些国家机构提供的定义和概念，并反映了所考虑国家的国内情况。未被咨询机构揭示的方面不包含在数据集中。使用的定义和概念完整列表可在统计部分底部显示的脚注中找到。

Since each national data repository has its own structure and quite often provides all supporting information in the national language, Informatics Europe consults with its members-academics, active and knowledgeable in the informatics field from respective countries-who help to interpret the statistics available and who understand the specificities of these countries' higher education systems. One of the main challenges in integrating the statistical data is the identification of terms used to define the informatics discipline in different countries. Informatics is known under different names in different European languages and countries, and in English as well. A good dozen terms (presented in the Data Portal section "Subjects") are used to denote what is fundamentally the same discipline, and the role of national experts here is to help with screening the terms and programs and identifying which part of them is pertinent to the informatics field.
由于每个国家的数据存储库都有自己的结构，并且通常以国家语言提供所有支持信息，因此 Informatics Europe 与其成员进行协商，这些成员是来自各自国家在信息学领域活跃且知识渊博的学者，他们帮助解释可用的统计数据，并了解这些国家高等教育系统的特点。在整合统计数据方面的主要挑战之一是识别不同国家中用于定义信息学学科的术语。信息学在不同的欧洲语言和国家以及英语中都有不同的名称。在“主题”数据门户网站部分列出了大约十几个术语，用于表示本质上相同的学科，国家专家在这里的作用是帮助筛选术语和项目，并确定其中哪些部分与信息学领域相关。

The data covers the degrees delivered by both traditional Research Universities (RU) and University of Applied Science (UAS) for the countries where these institutions also offer bachelor's and master's studies in informatics. The full list of institutions covered can be found in the Data Portal section "Institutions & Academic Units."
数据涵盖了传统研究型大学（RU）和应用科学大学（UAS）颁发的学位，这些机构在这些国家也提供信息学的学士和硕士课程。涵盖的机构完整列表可在数据门户的“机构和学术单位”部分找到。

Studyportals

Studyportals is the world's most comprehensive study choice platform. It lists over 200,000 English-taught programs from more than 3,500 institutions, helping over 50 million students per year. The Studyportals analytics and consulting team uses the resulting data to provide higher education organizations with realtime market insights.
Studyportals 是全球最全面的学习选择平台。它列出了来自 3,500 多所机构的 20 万多个英语教授的课程，每年帮助超过 5000 万学生。Studyportals 的分析和咨询团队利用所得数据为高等教育机构提供实时市场洞察。
Studyportals categorizes the study programs on its portals into disciplines and subdisciplines. The 15 disciplines are broad categories of educational fields to help navigate the portals. The 284 subdisciplines are narrower topics, subdivisions, or specialized areas of disciplines. Paying clients can provide input, but ultimately, data processors manually choose the one-to-three closest fitting subdisciplines according to the following scenarios, listed in decreasing order of likelihood:
Studyportals 将其门户网站上的学习计划分类为学科和子学科。这 15 个学科是教育领域的广泛类别，可帮助浏览门户网站。284 个子学科是学科的更窄主题、细分或专业领域。付费客户可以提供输入，但最终，数据处理人员根据以下情景手动选择最接近的一到三个子学科，按照可能性递减的顺序列出。

Classic scenario: When the study name closely matches one subdiscipline name.
经典场景：当研究名称与一个子学科名称紧密匹配时。

a. "Chemistry" -> subdiscipline Chemistry.
a. "化学" -> 子学科化学。

Classic interdisciplinary scenario: When the study name closely matches two or three subdiscipline names.
经典跨学科场景：当研究名称与两个或三个子学科名称紧密匹配时。

a. "International Fashion Management and Marketing" -> subdisciplines Fashion Management + Marketing.
a. "国际时尚管理与营销" -> 子学科时尚管理 + 营销。

Specializations scenario: When not all the subdisciplines are mentioned in the study name, but they are listed as specific specializations, concentrations, or tracks.
专业化场景：当研究名称中未提及所有子学科，但它们被列为特定的专业化、集中领域或方向时。

a. "Business Administration with specializations in Finance and International Business" -> subdisciplines Business Administration + Finance + International Business
a. "商业管理专业，专攻金融和国际商务" -> 子学科商业管理 + 金融 + 国际商务

Mixed scenario: When the study name does not closely match any specific subdiscipline but can be represented by combining two or three subdisciplines.
混合情况：当研究名称与任何特定的子学科都不太匹配，但可以通过结合两到三个子学科来表示。

a. "Financial Economics" -> subdisciplines Finance + Economics
a. "金融经济学" -> 子学科金融 + 经济学

Last resort scenario: When the study name does not closely match any specific subdiscipline and/or combination, it is instead approximated as closely as possible.
最后的选择：当研究名称与任何特定的子学科和/或组合都不太匹配时，将尽可能接近地近似。

a. "UK, EU, and US Copyright Law" -> subdisciplines Patent & Intellectual Property Law + International Law + European Law
a. "英国、欧盟和美国版权法" -> 专业领域专利与知识产权法 + 国际法 + 欧洲法

Scenario

Example study name 示例研究名称

Example assigned 示例分配

subdisciplines

Study name closely matches one
学习名称与一个密切匹配

subdiscipline name. 子学科名称。

BSc Chemistry 化学学士

Chemistry

Study name closely matches two or
学习名称与两个或三个子学科名称非常相似。

three subdiscipline names.
国际时尚学士

BA International Fashion

Management and Marketing
管理和营销

Fashion Management + 时尚管理+

Marketing

Not all the subdisciplines are mentioned
并非所有子学科都被提及

in the study name, but they are listed as
在研究名称中，但它们被列为

specific specializations, concentrations,
具体的专业化、专业方向、

or tracks.

MBA Business Administration with
工商管理硕士 MBA 与

specializations in Finance and
金融和国际商务的专业化

International Business 国际商务

Business

Administration + 行政管理 +

Finance + International 金融 + 国际

Business

Study name does not closely match
学习名称不太匹配

any specific subdiscipline but can be
任何特定的子学科，但可以是

represented by combining two or three
通过组合两个或三个代表

subdisciplines.

MSc Financial Economics 金融经济学硕士

Finance + Economics 金融+经济学

Study name does not closely match
学习名称与任何特定的子学科和/或组合并不密切匹配

any specific subdiscipline and/

or combination and is instead
或组合，而是

approximated as closely as possible.
尽可能接近估计。

LLM UK, EU, and US Copyright Law
LLM 英国、欧盟和美国版权法

Patent and Intellectual 专利和知识产权

Property Law + 物权法 +

International Law + 国际法 +

European Law

Chapter 7: Policy and Governance
第七章：政策与治理

Acknowledgments 致谢

The Al Index would like to acknowledge Simba Jonga for his work collecting information on significant

policy events and conducting a survey of

national strategies. Additionally, the Index would like to acknowledge the efforts of Ethan Duncan He-Li Hellman, Julia Betts Lotufo, Alexandra Rome, and Emma Williamson in collecting, coding, and analyzing

-related legislation and regulations. The Index is also grateful for the guidance provided by Caroline Meinhardt on Al legislation and regulation tracking.
Al 指数要感谢 Simba Jonga 收集重要

政策事件信息并对

国家战略进行调查的工作。此外，指数还要感谢 Ethan Duncan He-Li Hellman、Julia Betts Lotufo、Alexandra Rome 和 Emma Williamson 在收集、编码和分析

相关立法和法规方面所做的努力。指数还感谢 Caroline Meinhardt 在 Al 立法和法规跟踪方面提供的指导。

Global Al Mentions 全球提及

For mentions of

-related legislative proceedings around the world, the AI Index performed searches of the keyword "artificial intelligence" on the websites of 82 countries' congresses or parliaments (in the respective languages), usually under sections named "minutes," "hansard," etc. In some cases, databases were only searchable by title, so site search functions were deployed. The AI Index team surveyed the following databases:
对于世界各地与

Global Legislation Records on
全球的立法记录

For Al-related bills passed into laws, the Al Index performed searches of the keyword "artificial intelligence" on the websites of 128 countries' congresses or parliaments (in the respective languages) in the full text of bills. Note that only laws passed by state-level legislative bodies and signed into law (i.e., by presidents or through royal assent) from 2016 to 2023 are included. Laws that were approved but then repealed are not included in the analysis. In some cases, there were databases that were only searchable by title, so site search functions were deployed. Future AI Index reports hope to include analysis on other types of legal documents, such as regulations and standards, adopted by stateor supranational-level legislative bodies, government agencies, etc. The AI Index team surveyed databases for the following countries:
针对通过立法参加进入法律的 Al 相关法案，AI 指数在 128 个国家的国会或议会网站（以各自的语言）上对法案的全文进行了“人工智能”关键词搜索。请注意，本分析仅包括自 2016 年至 2023 年由州级立法机构通过并获得总统签署或通过皇家同意法律效力的法律。已通过但后来被废止的法律不包括在分析中。在某些情况下，有些数据库只能通过标题进行搜索，因此进行了网站搜索功能的部署。未来的 AI 指数报告希望包括对其他类型的法律文件进行分析，比如由国家或超国家级立法机构、政府机构等通过的法规和标准。AI 指数团队为以下国家调查了数据库：

Albania, Algeria, American Samoa, Andorra, Angola, Antigua and Barbuda, Argentina, Armenia, Australia Austria, Azerbaijan The Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Bermuda, Bhutan, Bolivia, Brazil, Brunei, Bulgaria, Burkina Faso, Cameroon, Canada, Cayman Islands, Chile, China, Colombia, Croatia, Cuba, Curacao, Cyprus, Czech Republic, Denmark, Estonia, Faroe Islands, Fiji, Finland, France, The Gambia, Georgia, Germany, Gibraltar, Greece, Greenland, Grenada, Guam, Guatemala, Guyana, Hong Kong, Hungary, Iceland, India, Iran Islamic Republic, Iraq, Ireland, Isle of Man, Israel, Italy, Jamaica, Japan, Kazakhstan, Kenya, Kiribati, Korea Republic, Kosovo, Kyrgyz Republic, Latvia, Lebanon, Liechtenstein, Lithuania, Luxembourg, Macao SAR China, Malawi, Malaysia, Malta, Mauritius, Mexico, Monaco, Montenegro, Morocco, Mozambique, Nauru, The Netherlands, New Zealand, Nicaragua, Niger, Northern Marina Islands, Norway, Panama, Papua New Guinea, Philippines, Poland, Portugal, Romania, Russia, Samoa, Saudi Arabia, Serbia, Seychelles, Sierra Leone, Singapore, Slovak Republic, Slovenia, South Africa, Spain, Sri Lanka, St. Kitts and Nevis, Suriname, Sweden, Switzerland, Tajikistan, Tanzania, Togo, Tongo, Turkey, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Vietnam, Yemen, Zambia, Zimbabwe
阿尔巴尼亚，阿尔及利亚，美属萨摩亚，安道尔，安哥拉，安提瓜和巴布达，阿根廷，亚美尼亚，澳大利亚，奥地利，阿塞拜疆，巴哈马，巴林，孟加拉国，巴巴多斯，白俄罗斯，比利时，伯利兹，百慕大，不丹，玻利维亚，巴西，文莱，保加利亚，布基纳法索，喀麦隆，加拿大，开曼群岛，智利，中国，哥伦比亚，克罗地亚，古巴，库拉索，塞浦路斯，捷克共和国，丹麦，爱沙尼亚，法罗群岛，斐济，芬兰，法国，冈比亚，格鲁吉亚，德国，直布罗陀，希腊，格陵兰，格林纳达，关岛，危地马拉，圭亚那，香港，匈牙利，冰岛，印度，伊朗伊斯兰共和国，伊拉克，爱尔兰，马恩岛，以色列，意大利，牙买加，日本，哈萨克斯坦，肯尼亚，基里巴斯，韩国共和国，科索沃，吉尔吉斯共和国，拉脱维亚，黎巴嫩，列支敦士登，立陶宛，卢森堡，澳门特别行政区，马拉维，马来西亚，马耳他，毛里求斯，墨西哥，摩纳哥，黑山，摩洛哥，莫桑比克，瑙鲁，荷兰，新西兰，尼加拉瓜，尼日尔，北马里亚纳群岛，挪威，巴拿马，巴布亚新几内亚，菲律宾，波兰，葡萄牙，罗马尼亚，俄罗斯，萨摩亚，沙特阿拉伯，塞尔维亚，塞舌尔，塞拉利昂，新加坡，斯洛伐克共和国，斯洛文尼亚，南非，西班牙，斯里兰卡，圣。圣基茨和尼维斯，苏里南，瑞典，瑞士，塔吉克斯坦，坦桑尼亚，多哥，汤加，土耳其，图瓦卢，乌干达，乌克兰，阿拉伯联合酋长国，英国，美国，乌拉圭，越南，也门，赞比亚，津巴布韦

The legislation was then coded by a team of two human coders for: (1) relevance to AI, (2) regulatory approach, and (3) subject matter. The relevance to AI categories were low, medium, and high. The regulatory approach categories were expansive or restrictive. For the subject matter categories, the Index employed the Congress policy typology. In cases where there were disagreements on the coding schemas, a third coder was brought in to settle the differences.
立法然后由两名人类编码员团队进行编码：（1）与人工智能的相关性，（2）监管方法，和（3）主题。与人工智能相关性的类别有低、中、高。监管方法的类别有扩张性或限制性。对于主题类别，指数采用了国会政策分类法。在编码模式上存在分歧的情况下，第三名编码员被请来解决分歧。

EU AI Regulation 欧盟人工智能法规

The Al Index also gathered information on Al-related regulations enacted in the European Union between 2017 and 2023. To compile this data, the Index team conducted a keyword search for "artificial intelligence" on EUR-Lex, a comprehensive database of EU legislation, regulations, and case law. EURLex provides access to a wide range of regulatory documents, such as legal acts, consolidated texts, international agreements, preparatory documents, and legislative procedures. The analysis in this section focused exclusively on documents with binding regulatory authority. The search for Al-related regulation in the European Union was limited to legal acts, international agreements, and consolidated texts.
Al Index 还收集了 2017 年至 2023 年欧盟颁布的与 Al 相关的法规信息。为了编制这些数据，Index 团队在 EUR-Lex 上进行了关键词搜索，EUR-Lex 是一个欧盟法规、法律和判例的综合数据库。EUR-Lex 提供对各种各样的监管文件的访问，如法律行为、合并文本、国际协议、准备文件和立法程序。本节的分析仅关注具有约束性监管权威的文件。在欧盟中寻找与 Al 相关的法规的搜索范围仅限于法律行为、国际协议和合并文本。

The regulation was then coded by a team of two human coders for: (1) relevance to

, (2) regulatory approach, and (3) subject matter. The relevance to

categories were low, medium, and high. The regulatory approach categories were expansive or restrictive. For the subject matter categories, the Index employed the Congress policy typology. In cases where there were disagreements on the coding schemas, a third coder was brought in to settle the differences.
然后，由两名人类编码员对法规进行编码，包括：(1) 与

相关性，(2) 监管方法，以及 (3) 主题。与

相关性的类别有低、中、高三种。监管方法的类别有放宽性和限制性两种。对于主题类别，本指数采用了国会政策分类法。在编码方案上存在分歧的情况下，会引入第三名编码员来解决分歧。

Federal Budget for Nondefense AI R&D
非国防人工智能研发的联邦预算

Data on the federal U.S. budget for nondefense Al R&D was taken from previous editions of the Al Index (namely the 2021 and 2022 versions) and from the following National Science and Technology Council reports:
有关美国联邦非国防人工智能研发预算的数据均来自先前版本的人工智能指数（即 2021 年和 2022 年版本）以及以下国家科学技术委员会的报告：

Supplement to the President's FY 2024 Budget Supplement to the President's FY 2023 Budget Supplement to the President's FY2022 Budget
总统 2024 财政预算补充总统 2023 财政预算补充总统 2022 财政预算补充

Govini

Govini is a defense technology company. Ark, Govini's flagship software, is a suite of Al-enabled applications, powered by integrated government and commercial data, that accelerate the Defense Acquisition Process.
Govini 是一家国防技术公司。Ark，Govini 的旗舰软件，是一套由整合的政府和商业数据驱动的人工智能应用程序，加速国防采购流程。

With Ark, the acquisition community eliminates slow, manual processes and gains the ability to rapidly imagine, produce, and field critical warfighting capabilities. Analysts and decisionmakers are equipped to solve challenges across the entire spectrum of Defense Acquisition, including Supply Chain, Science and Technology, Production, Sustainment, and Modernization.
通过 Ark，采购社区消除了缓慢、手动的流程，并获得了快速构想、生产和部署关键作战能力的能力。分析师和决策者具备解决国防采购整个领域的挑战的能力，包括供应链、科学技术、生产、维护和现代化。

Govini curated USG Al spend data from their annual Scorecard Taxonomy by applying supervised machine learning (ML) and natural language processing (NLP) techniques to parse, analyze, and categorize large volumes of federal contracts data, including prime contracts, grants, and other transaction authority (OTA) awards. Govini's most recent Scorecard focused on Critical Technologies, of which AI/ML Technologies and Microelectronics were segments. The AI/ML segment consisted of five subsegments: Data Integration, Computer Vision, Machine Learning, Autonomy, and Natural Language Processing. Microelectronics is divided into two subsegments: Memory and Processing, and Semiconductors. By initially generating search terms and then subsequently excluding specific terms that yield erroneous results, Govini delivers a comprehensive yet discriminant taxonomy of subsegments that are mutually exclusive. Repeated keyword searches and filters allow a consensus, data-driven taxonomy to come into focus. Govini SMEs conduct final review of taxonomic structure to complement this iterative, data-driven process.
Govini 通过应用监督式机器学习（ML）和自然语言处理（NLP）技术，从他们的年度评分卡分类中整理了美国政府支出数据，以解析、分析和分类大量的联邦合同数据，包括主合同、拨款和其他交易授权（OTA）奖项。Govini 最近的评分卡专注于关键技术，其中包括 AI/ML 技术和微电子领域。AI/ML 领域包括五个子领域：数据集成、计算机视觉、机器学习、自主性和自然语言处理。微电子领域分为两个子领域：存储和处理，以及半导体。通过首先生成搜索词，然后逐步排除产生错误结果的特定术语，Govini 提供了一个全面而具有区分性的子领域分类法，这些子领域是相互排斥的。重复的关键字搜索和筛选使得一个共识、数据驱动的分类法得以形成。Govini 的专家进行最终的分类结构审查，以补充这个迭代的、数据驱动的过程。

The use of artificial intelligence (AI) and supervised

models enables analysis of the large volumes of irregular data contained in federal contractsdata that often is inaccessible through regular government reporting processes or human-intensive analytical approaches.
人工智能（AI）和监督模型的使用使得对联邦合同中包含的大量不规则数据进行分析成为可能，这些数据通常无法通过常规政府报告流程或人工密集型分析方法获得。

Moreover, beyond simply making usable an expansive body of data sources, Govini's Ark platform and National Security Knowledge Graph establishes high-fidelity standards in categorized and fused data to produce a comprehensive and accurate depiction of federal spending, and the supporting vendor ecosystem, over time.
此外，除了简单地利用大量数据源，Govini 的 Ark 平台和国家安全知识图谱建立了高保真度标准，对分类和融合数据进行处理，以产生对联邦支出和支持供应商生态系统的全面准确描述。

National Al Strategies 国家人工智能战略

The Al Index did a web search to identify national strategies on Al. Below is a list of countries that were identified as having a national Al strategy, including a link to said strategy. For certain counties, noted with an asterisk

, the actual strategy was not found, and a news article confirming the launch of the strategy was linked instead.
Al Index 进行了网络搜索，以确定人工智能的国家战略。以下是被确定为拥有国家人工智能战略的国家列表，包括指向该战略的链接。对于某些国家，用星号标记

，实际战略未找到，而是链接了一篇确认启动战略的新闻文章。

Countries with Al Strategies in Place
已实施人工智能战略的国家

Algeria,

Argentina, Azerbaijan,

Australia, Austria,
阿尔及利亚，

阿根廷，阿塞拜疆，

澳大利亚，奥地利

Bahrain, Bangladesh, Benin,

Botswana,

Brazil, Belgium,

Bulgaria, Canada, Chile, China, Colombia, Croatia, Cyprus, Czech Republic, Denmark, Dominican Republic,

Egypt, Arab Republic, Ethiopia, Estonia, Finland, France, Germany, Ghana, Greece, Hong Kong, Hungary, India, Indonesia, Iran,

Iraq,

Ireland, Israel,

Italy, Japan, Jordan,

Kenya, Korea Republic, Latvia, Lithuania, Luxembourg, Malta, Malaysia, Mauritius, Mexico, The Netherlands, North Korea, Norway, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russia, Rwanda, Saudi Arabia, Serbia, Sierra Leone, Singapore, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Thailand, Tunisia,

Turkey, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Vietnam
巴林，孟加拉国，贝宁，博茨瓦纳，巴西，比利时，保加利亚，加拿大，智利，中国，哥伦比亚，克罗地亚，塞浦路斯，捷克共和国，丹麦，多米尼加共和国，埃及，阿拉伯共和国，埃塞俄比亚，爱沙尼亚，芬兰，法国，德国，加纳，希腊，香港，匈牙利，印度，印度尼西亚，伊朗，伊拉克，爱尔兰，以色列，意大利，日本，约旦，肯尼亚，韩国，拉脱维亚，立陶宛，卢森堡，马耳他，马来西亚，毛里求斯，墨西哥，荷兰，朝鲜，挪威，秘鲁，菲律宾，波兰，葡萄牙，卡塔尔，罗马尼亚，俄罗斯，卢旺达，沙特阿拉伯，塞尔维亚，塞拉利昂，新加坡，斯洛伐克共和国，斯洛文尼亚，西班牙，瑞典，瑞士，泰国，突尼斯，土耳其，乌克兰，阿拉伯联合酋长国，英国，美国，乌拉圭，越南

Countries with AI Strategies in Development
制定 AI 战略的国家

Andorra,

Antigua and Barbuda,

Barbados,

Armenia,,

Belarus,

Costa Rica,

Cuba,

Iceland, Jamaica,

Kenya, Morocco, New Zealand,

Nigeria,

Pakistan,

Senegal,

Uzbekistan

US AI Regulation 美国 AI 监管

This section examines Al-related regulations enacted by American regulatory agencies between 2016 and 2023. It provides an analysis of the total number of regulations, as well as their topics, scope, regulatory intent, and originating agencies. To compile this data, the Al Index team performed a keyword search for "artificial intelligence" on the Federal Register, a comprehensive repository of government documents from nearly all branches of the American government, encompassing more than 436 agencies.
本节探讨了美国监管机构在 2016 年至 2023 年间颁布的人工智能相关法规。它提供了对法规总数、主题、范围、监管意图和起草机构的分析。为了编制这些数据，人工智能指数团队在《联邦法规》上进行了关键词搜索，这是美国政府几乎所有部门的政府文件的综合存储库，涵盖了超过 436 个机构。
The regulation was then coded by a team of two human coders for: (1) relevance to AI, (2) regulatory approach, and (3) subject matter. The relevance to Al categories were low, medium, and high. The regulatory approach categories were expansive or restrictive. For the subject matter categories, the Index employed the Congress policy typology. In cases where there were disagreements on the coding schemas, a third coder was brought in to settle differences.
然后，由两名人类编码员对法规进行编码：（1）与人工智能的相关性，（2）监管方法，和（3）主题。与人工智能相关的类别有低、中和高。监管方法类别为扩张性或限制性。对于主题类别，指数采用了国会政策分类法。在编码模式存在分歧的情况下，第三名编码员被请来解决分歧。

US Department of Defense Budget Requests
美国国防部预算请求

Data on the DoD nonclassified Al-related budget requests was taken from previous editions of the

Index (namely the

and

versions) and from the following reports:
美国国防部非机密人工智能相关预算请求数据是从以前版本的

指数（即

和

版本）和以下报告中获取的：

Defense Budget Overview United States Department of Defense Fiscal Year 2024 Budget Request Defense Budget Overview United States Department of Defense Fiscal Year 2023 Budget Request Defense Budget Overview United States Department of Defense Fiscal Year 2022 Budget Request
美国国防部防务预算概况 2024 财年预算请求美国国防部防务预算概况 2023 财年预算请求美国国防部防务预算概况 2022 财年预算请求

US State-Level AI Legislation
美国州级人工智能立法

For Al-related bills passed into law, the Al Index performed searches of the keyword "artificial intelligence" on the legislative websites of all 50 U.S. states in the full text of bills. Bills are only counted as passed into law if the final version of the bill includes the keyword, not just the introduced version. Note that only laws passed from 2015 to 2022 are included. The count for proposed laws includes both
对于通过法律的所有与人工智能相关的法案，AI 指数在所有 50 个美国州的立法网站上对“人工智能”关键词进行了搜索。只有在法案的最终版本中包含关键词时，才将法案计为通过法律，而不仅仅是引入版本。请注意，仅包括 2015 年至 2022 年通过的法律。拟议法律的计数包括
laws that were proposed and eventually passed as well as laws that were proposed that have not yet been passed, or are now inactive. In some cases, databases were only searchable by title, so site search functions were deployed. The AI Index team surveyed the following databases:
既提出并最终通过的法律，也提出但尚未通过或现已失效的法律。在某些情况下，数据库只能通过标题进行搜索，因此部署了站点搜索功能。AI 指数团队调查了以下数据库：

Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, lowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming
阿拉巴马州，阿拉斯加州，亚利桑那州，阿肯色州，加利福尼亚州，科罗拉多州，康涅狄格州，特拉华州，佛罗里达州，乔治亚州，夏威夷州，爱达荷州，伊利诺伊州，印第安纳州，爱荷华州，堪萨斯州，肯塔基州，路易斯安那州，缅因州，马里兰州，马萨诸塞州，密歇根州，明尼苏达州，密西西比州，密苏里州，蒙大拿州，内布拉斯加州，内华达州，新罕布什尔州，新泽西州，新墨西哥州，纽约州，北卡罗来纳州，北达科他州，俄亥俄州，俄克拉荷马州，俄勒冈州，宾夕法尼亚州，罗得岛州，南卡罗来纳州，南达科他州，田纳西州，德克萨斯州，犹他州，佛蒙特州，弗吉尼亚州，华盛顿州，西弗吉尼亚州，威斯康星州，怀俄明州

US Committee Mentions 美国委员会提到

In order to research trends on the United States' committee mentions of

, the following search was conducted:
为了研究美国委员会对

的提及趋势，进行了以下搜索：

Website: Congress.gov 网站：Congress.gov

Keyword: artificial intelligence
关键词：人工智能

Filters: Committee Reports
过滤器：委员会报告

Chapter 8: Diversity 第八章：多样性

Code.org

To learn more about the diversity data from Code.org, please read the methodological note on Code.org's data included in the Chapter 6 subsection of the Appendix.
要了解来自 Code.org 的多样性数据更多信息，请阅读附录中第 6 章小节中包括的 Code.org 数据的方法说明。

Computing Research 计算研究

Association 协会

(CRA Taulbee Survey) （CRA Taulbee 调查）

To learn more about the diversity data from the CRA, please read the methodological note on the CRA's data included in the Chapter 6 subsection of the Appendix.
要了解 CRA 的多样性数据，请阅读附录第 6 章小节中包含的 CRA 数据的方法说明。

Informatics Europe

To learn more about the diversity data from Informatics
要了解 Informatics 的多样性数据

Europe, please read the methodological note on Informatics Europe's data included in the Chapter 6 subsection of the Appendix.
欧洲，请阅读附录第 6 节中包含的 Informatics Europe 数据的方法论说明。

Chapter 9: Public Opinion
第九章：公众舆论

Global Public Opinion on Artificial Intelligence (GPO-AI)
全球对人工智能的公众意见（GPO-AI）

In October and November 2023, researchers at the Schwartz Reisman Institute for Technology and Society (SRI) and the Policy, Elections, and Representation Lab (PEARL) at the Munk School of Global Affairs and Public Policy at the University of Toronto completed a survey project on public perceptions of and attitudes toward Al. The survey was administered to census-targeted samples of over 1,000 people in each of 21 countries, for a total of 23,882 surveys conducted in 12 languages. The countries sampled represent a majority of the world's population. To learn more about the survey, please visit the survey website. The following authors contributed to the GPO-Al survey: Peter John Loewen, Blake Lee-Whiting, Maggie Arai, Thomas Bergeron, Thomas Galipeau, Isaac Gazendam, Hugh Needham, Lee Slinger, Sofiya Yusypovych.
2023 年 10 月和 11 月，多伦多大学施瓦茨·雷斯曼技术与社会研究所（SRI）和全球事务与公共政策蒙克学院政策、选举和代表性实验室（PEARL）的研究人员完成了一项关于公众对人工智能的看法和态度的调查项目。该调查针对 21 个国家的每个国家的 1000 多人进行了普查，总共进行了 23882 份调查，使用了 12 种语言。被抽样的国家代表了世界人口的大多数。要了解更多关于调查的信息，请访问调查网站。以下作者为 GPO-Al 调查做出了贡献：彼得·约翰·洛文、布雷克·李-惠廷、玛吉·阿莱、托马斯·贝尔杰龙、托马斯·加利佩、艾萨克·加森丹、休·尼德汉姆、李·斯林格、索菲娅·尤西波维奇。

Ipsos 益普索

For brevity, the 2024 Al Index does not republish the methodology used by the Ipsos survey that features in the report. More details about the Ipsos survey's methodology can be found in the actual survey.
为简洁起见，2024 年 Al 指数未重新发布在报告中出现的益普索调查所使用的方法。有关益普索调查方法的更多详细信息可以在实际调查中找到。

Pew Research 皮尤研究

For brevity, the 2024 Al Index does not republish the methodology used by the Pew surveys that feature in the report. Data was taken from the 2023 Pew Research Center survey.
为简洁起见，2024 年 Al 指数未重新发布《皮尤调查》中使用的方法论。数据来自 2023 年皮尤研究中心的调查。

Quid collects social media data from over 500 million sources in real time and analyzes this data through Al-powered natural language processing. This process parses out language and breaks out posts by filters such as drivers of positive and negative sentiment, emotions, and behaviors, allowing for deeper insights to be reached. Quid analyzed 6.69 million social media posts for 2023 to assess perceptions of AI model releases. The substantial increase in new models in 2023 unveiled reactions to the technical advancements in Al, the impact on efficiency in business operations, and ethical considerations as

continues to be adopted into society.
Quid 实时从超过 5 亿个来源收集社交媒体数据，并通过 Al 驱动的自然语言处理分析这些数据。该过程通过语言解析和按照积极和消极情绪、情感和行为等过滤器拆分帖子，从而实现更深入的洞察。Quid 分析了 2023 年的 669 万条社交媒体帖子，以评估对 AI 模型发布的看法。2023 年新模型的大幅增加揭示了对 Al 技术进步、对业务运营效率的影响以及道德考虑的反应，因为

继续被社会采纳。

6 HELM evaluates 10 scenarios: (1) NarrativeQA (reading comprehension), (2) Natural Questions (closed-book) (closed-book short-answer question answering), (3) Natural Questions (open-book) (open-book short-answer question answering), (4) OpenBookQA (commonsense question answering), (5) MMLU (multisubject understanding), (6) GSM8K (grade school math), (7) MATH (competition math), (8) LegalBench (legal reasoning), (9) MedQA (medical knowledge), and (10) WMT 2014 (machine translation).
6 HELM 评估 10 个场景：(1) NarrativeQA（阅读理解），(2) Natural Questions（闭卷）（闭卷简答问题回答），(3) Natural Questions（开卷）（开卷简答问题回答），(4) OpenBookQA（常识问题回答），(5) MMLU（多学科理解），(6) GSM8K（小学数学），(7) MATH（竞赛数学），(8) LegalBench（法律推理），(9) MedQA（医学知识），以及(10) WMT 2014（机器翻译）。

7 There are several versions of HELM. This section reports the score on HELM Lite, Release v1.0.0 (2023-12-19), with the data having been collected in January 2024.
7 HELM 有几个版本。本节报告了 HELM Lite，Release v1.0.0（2023-12-19）的得分，数据收集于 2024 年 1 月。
9 The 12 evaluation aspects of HEIM are: (1) Alignment: How closely does the image align with the given text? (2) Quality: What is the quality of the produced image? (3) Aesthetic How aesthetically pleasing is the generated image? (4) Originality: How original is the image? (5) Reasoning: Does the model understand objects, counts, and spatial relations? (6) Knowledge: Does the model have knowledge about the world? (7) Bias: Are the generated images biased? (8) Toxicity: Are the generated images toxic or inappropriate? (9) Fairness: Do the generated images exhibit performance disparities? (10) Robust: Is the model robust to input perturbations? (11) Multilinguality: Does the model support non-English languages? (12) Efficiency: How fast is model inference?
9 HEIM 的 12 个评估方面是：（1）对齐：图像与给定文本的对齐程度有多近？（2）质量：生成图像的质量如何？（3）美学：生成图像在美学上有多吸引人？（4）原创性：图像有多原创？（5）推理：模型是否理解对象、计数和空间关系？（6）知识：模型是否具有关于世界的知识？（7）偏见：生成的图像是否存在偏见？（8）毒性：生成的图像是否有毒或不当？（9）公平性：生成的图像是否表现出性能差异？（10）鲁棒性：模型对输入扰动的鲁棒性如何？（11）多语言性：模型是否支持非英语语言？（12）效率：模型推理速度如何？
10 Some abilities highlighted in the previous sections implicitly involve some form of reasoning. This section highlights tasks that have a more specific reasoning focus.
10 在前几节中强调的一些能力隐含地涉及某种形式的推理。本节重点介绍具有更具体推理焦点的任务。
19 By closed models, the Al Index is referring both to models that are fully closed and those with limited access.
19 通过封闭模型，Al 指数既指完全封闭的模型，也指具有有限访问权限的模型。

20 The data in this section was collected in early January 2024.
20 本节数据是在 2024 年 1 月初收集的。
21 In its technical report on Llama 2, Meta notes that it offsets all the carbon emissions generated during the model's training process.
21 在其关于 Llama 2 的技术报告中，Meta 指出它抵消了模型训练过程中产生的所有碳排放。
7 The survey is introduced above in section 3.1, Assessing Responsible AI. The full Global State of Responsible Al Report is forthcoming in May 2024. Details about the methodology can be found in the Appendix of this chapter.
7 调查在 3.1 节“评估负责任的人工智能”中介绍。全球负责任人工智能状况报告将于 2024 年 5 月发布。有关方法论的详细信息可在本章附录中找到。

8 The following analyses only look at companies that indicated in a previous question that privacy and data governance risks are relevant to them in the context of their Al adoption.
8 以下分析仅针对在先前问题中指出隐私和数据治理风险与他们的 Al 采用相关的公司。

9 Respondents were further given the free-text option "Other" to report additional mitigations not listed.
9 受访者进一步获得了自由文本选项“其他”，以报告未列出的其他缓解措施。
11 The survey is introduced above in section 3.1, Assessing Responsible AI. The full State of Responsible Al Report is forthcoming in May 2024. Details about the methodology can be found in the Appendix of this chapter.
11 调查在 3.1 节“评估负责任的 AI”中介绍。《负责任 AI 状况报告》全文将于 2024 年 5 月公布。有关方法论的详细信息可在本章附录中找到。

12 Respondents were further given the free-text option "Other" to report additional mitigations not listed.
进一步给出了 12 位受访者自由文本选项“其他”，以报告未列出的额外缓解措施。
14 The survey is introduced above in section 3.1, Assessing Responsible AI. The full State of Responsible Al Report is forthcoming in May 2024. Details about the methodology can be found in the Appendix of this chapter.
14 调查在 3.1 节“评估负责任的人工智能”中介绍。完整的负责任人工智能状况报告将于 2024 年 5 月发布。有关方法论的详细信息可在本章附录中找到。

15 Respondents were further given the free-text option 'Other' to report additional mitigations not listed.
进一步给出了 15 位受访者自由文本选项“其他”，以报告未列出的额外缓解措施。
5 Linkedln membership varies considerably between countries, which makes interpreting absolute movements of members from one country to another difficult. To compare migration flows between countries fairly, migration flows are normalized for the country of interest. For example, if country A is the country of interest, all absolute net flows into and out of country A (regardless of origin and destination countries) are normalized based on Linkedln membership in country A at the end of each year and multiplied by 10,000 . Hence, this metric indicates relative talent migration of all other countries to and from country .
5 领英会员资格在不同国家之间差异很大，这使得从一个国家到另一个国家的会员绝对流动难以解释。为了公平地比较国家之间的迁移流动，迁移流动会根据感兴趣的国家进行标准化。例如，如果国家 A 是感兴趣的国家，那么所有进入和离开国家 A 的绝对净流动（不考虑起始和目的地国家）都会根据每年年底国家 A 的领英会员资格进行标准化，并乘以 10,000。因此，该指标表示其他所有国家与国家之间的相对人才迁移。
1 An ungauged basin is a watershed for which there is insufficient streamflow data to model hydrological flows.
1 未经测量的流域是指没有足够的流量数据来模拟水文流量的流域。
10 Regulations can originate from multiple agencies, so the annual totals in Figure 7.4.3 may exceed those in Figure 7.4.1.
10 规定可以来自多个机构，因此图 7.4.3 中的年度总数可能超过图 7.4.1 中的总数。
1 The charts in this section look only at the ethnicity of domestic or native CS students and faculty. Although the CRA reports data on the proportion of nonresident aliens at each educational level (i.e., bachelor's, master's, PhD, and faculty), data on the ethnicity of nonresident aliens is not included.
1 本节中的图表仅涵盖本地或本国 CS 学生和教职员的种族。虽然 CRA 报告了每个教育层次（即学士、硕士、博士和教职员）的非居民外国人比例数据，但未包括非居民外国人的种族数据。

2 Not all PhD-granting departments targeted in the survey provided responses. Of the 297 departments targeted, only 182 responded, resulting in an overall response rate of . The Al Index advises against making per capita comparisons between the CRA North American data and the data on European CS graduates detailed in the subsequent sections due to the European data being collected from national statistical offices, which affords it broader coverage.
2 并非所有受调查的授予博士学位的部门都提供了答复。在针对的 297 个部门中，仅有 182 个做出了回应，导致整体回应率为。Al 指数建议不要在 CRA 北美数据和欧洲 CS 毕业生数据之间进行人均比较，原因是欧洲数据是由国家统计局汇总的，覆盖范围更广。
3 The year label refers to the year in which an academic year ends. For example, the figures visualizing new graduates for 2022 reflect the number of graduates reported for the academic year. For the sake of visual simplicity, the Index opts to focus on the year in which students graduated.
3 年份标签指的是一个学年结束的年份。例如，展示 2022 年新毕业生数量的数据反映了学年报告的毕业生人数。为了视觉简洁起见，指数选择专注于学生毕业年份。

4 In countries where the number of PhD graduates is relatively small, trends in gender proportions can be prone to sudden year-over-year changes. For example, in 2022 Bulgaria produced 24 total PhDs, Latvia 12, and Estonia 26.
在博士毕业生数量相对较少的国家，性别比例的趋势可能容易发生突然的年度变化。例如，在 2022 年，保加利亚产生了 24 名博士生，拉脱维亚 12 名，爱沙尼亚 26 名。
1 https://eto.tech/tool-docs/cat/

2 https://cat.eto.tech/

3 Christian Schoeberl, Autumn Toney, and James Dunham, “Identifying AI Research” (Center for Security and Emerging Technology, July 2023), https://doi.org/10.51593/20220030.
克里斯蒂安·肖伯尔、奥特姆·托尼和詹姆斯·邓纳姆，“识别 AI 研究”（安全与新兴技术中心，2023 年 7 月），https://doi.org/10.51593/20220030。

4 James Dunham, Jennifer Melot, and Dewey Murdick, “Identifying the Development and Application of Artificial Intelligence in Scientific Text," arXiv preprint, arXiv:2002.07143 (2020).
詹姆斯·邓纳姆、詹妮弗·梅洛特和杜威·默迪克，“识别科学文本中人工智能的发展和应用”，arXiv 预印本，arXiv:2002.07143（2020）。
5 These scores are based on cosine similarities between field-of-study and paper embeddings. See Autumn Toney and James Dunham, "Multi-Label Classification of Scientific Research Documents Across Domains and Languages," Proceedings of the Third Workshop on Scholarly Document Processing (Association for Computational Linguistics, 2022): 105-14, https:// aclanthology.org/2022.sdp-1.12/.
这些分数是基于领域的研究和文档嵌入之间的余弦相似度计算得出的。请参阅奥特姆·托尼和詹姆斯·邓纳姆，“跨领域和语言对科学研究文档进行多标签分类”，第三届学术文档处理研讨会论文集（计算语言学协会，2022 年）：105-14，https://aclanthology.org/2022.sdp-1.12/。

6 See https://ror.org/ for more information about the ROR dataset.
有关 ROR 数据集的更多信息，请参阅 https://ror.org/。

7 Patents are analyzed at the "patent family" level rather than "patent document" level because patent families are a collective of patent documents all associated with a single invention and/ or innovation by the same inventors/assignees. Thus, counting at the "patent family" level mitigates artificial number inflation when there are multiple patent documents in a patent family or if a patent is filed in multiple jurisdictions.
专利是在“专利族”级别而不是“专利文件”级别进行分析的，因为专利族是所有与同一发明和/或创新相关的专利文件的集合，由相同的发明人/受让人关联。因此，在“专利族”级别计数可以在专利族中有多个专利文件或者如果专利在多个司法管辖区内申请时，减少人为数字膨胀。

Introduction to the Al Index Report 20242024 年 Al 指数报告简介

Message From the Co-directors联合主任致辞

Message From the Co-directors (cont'd)联合主任们的消息（续）

Ray Perrault and Jack ClarkRay Perrault 和 Jack Clark

Top 10 Takeaways 十大要点

5. Robust and standardized evaluations for LLM responsibility are seriously lacking.5. 对LLM责任的强大和标准化评估严重匮乏。

Top 10 Takeaways (cont'd)十大要点（续）

Steering Committee 指导委员会

Co-directors 联合导演

Members 成员

Staff and Researchers 员工和研究人员

Research Manager and Editor in Chief研究经理和总编辑

Research Associate 研究助理

Affiliated Researchers 附属研究员

Graduate Researchers 研究生研究员

Undergraduate Researchers本科研究员

How to Cite This Report如何引用本报告

Public Data and Tools公共数据和工具

Al Index and Stanford HAI阿尔指数和斯坦福人工智能研究所

Stanford University 斯坦福大学

Supporting Partners 支持合作伙伴

Analytics and Research Partners分析和研究合作伙伴

Contributors 贡献者

Introduction 介绍

Chapter 1: Research and Development第一章：研究与开发

Chapter 2: Technical Performance第二章：技术表现

Chapter 3: Responsible AI第三章：负责任的人工智能

Chapter 4: Economy 第四章：经济

Chapter 5: Science and Medicine第五章：科学与医学

Contributors (cont'd) 贡献者（续）

Chapter 6: Education 第六章：教育

Chapter 7: Policy and Governance第七章：政策与治理

Chapter 8: Diversity 第八章：多样性

Chapter 9: Public Opinion第九章：公众舆论

Organizations 组织

Center for Security and Emerging安全与新兴技术中心

Code.org

Computing Research Association计算机研究协会

Epoch 时代

GitHub

Govini

Informatics Europe

International Federation of Robotics国际机器人联合会

Lightcast 亮投

Linkedln 领英

McKinsey & Company 麦肯锡公司

Quid

Schwartz Reisman Institute for Technology and Society

Studyportals

Table of Contents 目录

Report Highlights 报告亮点

Chapter 1: Research and Development第一章：研究与开发

Report Highlights 报告亮点

Chapter 2: Technical Performance第二章：技术表现

4. Better Al means better data which means ... even better Al. New Al models such as4. 更好的人工智能意味着更好的数据，这意味着...甚至更好的人工智能。新的人工智能模型，如

Chapter 2: Technical Performance (cont'd)第二章：技术性能（续）

Report Highlights 报告亮点

Chapter 3: Responsible Al第三章：负责任的 Al

1. Robust and standardized evaluations for LLM responsibility are seriously lacking.1.对于LLM责任的稳健和标准化评估严重缺乏。

6. Al developers score low on transparency, with consequences for research. The newly6. 人工智能开发者在透明度方面得分较低，对研究产生影响。新近进行的研究显示，人工智能开发者在透明度方面得分较低，这对研究产生了影响。

Chapter 3: Responsible AI (cont'd)第三章:负责任的人工智能(续)

Report Highlights 报告亮点

Chapter 4: Economy 第四章：经济

5. Total Al private investment declines again, while the number of newly funded AI5. 总体上，私人投资再次下降，而新资助的 AI 数量增加。

Chapter 4: Economy (cont'd)第四章：经济（续）

Report Highlights 报告亮点

Chapter 5: Science and Medicine第五章：科学与医学

Report Highlights 报告亮点

Chapter 6: Education 第六章：教育

5. More American high school students take CS courses, but access problems remain.5. 更多的美国高中学生选修计算机科学课程，但仍存在接触问题。

Chapter 6: Education (cont'd)第 6 章：教育（续）

Report Highlights 报告亮点

Chapter 7: Policy and Governance第七章：政策与治理

Report Highlights 报告亮点

Chapter 8: Diversity 第八章：多样性

Report Highlights 报告亮点

Chapter 9: Public Opinion第九章：公众舆论

4. Demographic differences emerge regarding Al optimism. Significant demographic4. 关于人工智能乐观情绪存在人口统计学差异。显著的人口统计学差异 emerge regarding Al optimism.

Preview 预览

Chapter Highlights 章节亮点

Introduction to the Al Index Report 2024
2024 年 Al 指数报告简介

Message From the Co-directors
联合主任致辞

Message From the Co-directors (cont'd)
联合主任们的消息（续）

Ray Perrault and Jack Clark
Ray Perrault 和 Jack Clark

5. Robust and standardized evaluations for LLM responsibility are seriously lacking.
5. 对LLM责任的强大和标准化评估严重匮乏。

Top 10 Takeaways (cont'd)
十大要点（续）

Research Manager and Editor in Chief
研究经理和总编辑

Undergraduate Researchers
本科研究员

How to Cite This Report
如何引用本报告

Public Data and Tools
公共数据和工具

Al Index and Stanford HAI
阿尔指数和斯坦福人工智能研究所

Analytics and Research Partners
分析和研究合作伙伴

Chapter 1: Research and Development
第一章：研究与开发

Chapter 2: Technical Performance
第二章：技术表现

Chapter 3: Responsible AI
第三章：负责任的人工智能

Chapter 5: Science and Medicine
第五章：科学与医学

Chapter 7: Policy and Governance
第七章：政策与治理

Chapter 9: Public Opinion
第九章：公众舆论

Center for Security and Emerging
安全与新兴技术中心

Computing Research Association
计算机研究协会

International Federation of Robotics
国际机器人联合会

Chapter 1: Research and Development
第一章：研究与开发

Chapter 2: Technical Performance
第二章：技术表现

4. Better Al means better data which means ... even better Al. New Al models such as
4. 更好的人工智能意味着更好的数据，这意味着...甚至更好的人工智能。新的人工智能模型，如

Chapter 2: Technical Performance (cont'd)
第二章：技术性能（续）

Chapter 3: Responsible Al
第三章：负责任的 Al

1. Robust and standardized evaluations for LLM responsibility are seriously lacking.
1.对于LLM责任的稳健和标准化评估严重缺乏。

6. Al developers score low on transparency, with consequences for research. The newly
6. 人工智能开发者在透明度方面得分较低，对研究产生影响。新近进行的研究显示，人工智能开发者在透明度方面得分较低，这对研究产生了影响。

Chapter 3: Responsible AI (cont'd)
第三章:负责任的人工智能(续)

5. Total Al private investment declines again, while the number of newly funded AI
5. 总体上，私人投资再次下降，而新资助的 AI 数量增加。

Chapter 4: Economy (cont'd)
第四章：经济（续）

Chapter 5: Science and Medicine
第五章：科学与医学

5. More American high school students take CS courses, but access problems remain.
5. 更多的美国高中学生选修计算机科学课程，但仍存在接触问题。

Chapter 6: Education (cont'd)
第 6 章：教育（续）

Chapter 7: Policy and Governance
第七章：政策与治理

Chapter 9: Public Opinion
第九章：公众舆论

4. Demographic differences emerge regarding Al optimism. Significant demographic
4. 关于人工智能乐观情绪存在人口统计学差异。显著的人口统计学差异 emerge regarding Al optimism.

Total Number of AI Publications
AI 出版物的总数

By Type of Publication
按出版物类型

By Field of Study
按研究领域

Al publications (% of total) by sector and geographic area, 2022
2022 年按部门和地理区域划分的出版物（总量的百分比）

Al Conference Publications
人工智能会议出版物

By Filing Status and Region
按申请状态和地区

Al patents by application status,
申请状态的 Al 专利，

Al patents by application status by geographic area, 2010-22
2010 年至 2022 年各地区按申请状态划分的 Al 专利

Granted Al patents (% of world total) by region, 2010-22
按地区划分，2010-22 年各地区授予的 Al 专利（占世界总量的比例）

1.3 Frontier Al Research
1.3 前沿 Al 研究

Number of notable machine learning models by select geographic area, 2003-23
2003 年至 2023 年各地区知名机器学习模型数量

Will Models Run Out of Data?
模型会用尽数据吗？

Projections of ML data exhaustion by stock type:
ML 数据枯竭的预测按库存类型：

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

A demonstration of model collapse in a VAE
VAE 中模型坍缩的演示

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

Will Models Run Out of Data? (cont'd)
模型会耗尽数据吗？（续）

Organizational Affiliation
组织隶属

Number of foundation models by organization, 2019-23 (sum)
组织发布的基础模型数量，2019-23 年（总和）

Attendance at large conferences, 2010-23
大型会议的参会人数，2010-23

Attendance at small conferences, 2010-23
2010-23 年小型会议出席情况

1.5 Open-Source Al Software
1.5 开源人工智能软件

Number of GitHub stars in Al projects, 2011-23
2011-23 年 Al 项目在 GitHub 上的星标数量

2.11 Properties of LLMs
2.11 LLMs的属性

2.12 Techniques for LLM Improvement
148
2.12 技术的 LLM 改进 148

ACCESS THE PUBLIC DATA
访问公共数据

2.1 Overview of Al in 2023
2023 年人工智能概述

Timeline: Significant Model Releases
时间轴：重要的模型发布

State of Al Performance
人工智能表现状态

A selection of deprecated benchmarks from the Index report
来自指数报告的一组已弃用的基准测试

Year-over-year improvement over time on select AI Index technical performance benchmarks
随着时间的推移，在选择的 AI 指数技术性能基准上逐年改善

A sample output from GPT-4
GPT-4 的一个示例输出

HELM: Holistic Evaluation of Language Models
HELM: 语言模型的整体评估

MMLU: Massive Multitask Language
MMLU：大规模多任务语言

A sample question from MMLU
MMLU 的一个示例问题

Chatbot Arena Leaderboard
聊天机器人竞技排行榜

A sample model response on the Chatbot Arena Leaderboard
Chatbot Arena Leaderboard 上的样本模型响应

LMSYS Chatbot Arena for LLMs: Elo rating Source: Hugging Face, 2024 | Chart: 2024 Al Index report
LMSYS Chatbot Arena for LLMs: Elo 评分来源: Hugging Face, 2024 | 图表: 2024 Al 指数报告

Factuality and Truthfulness
真实性和真实性