Introduction to the Al Index Report 2024 2024 年 Al 指数报告简介
Welcome to the seventh edition of the Al Index report. The 2024 Index is our most comprehensive to date and arrives at an important moment when Al's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on Al training costs, detailed analyses of the responsible Al landscape, and an entirely new chapter dedicated to Al's impact on science and medicine. 欢迎阅读 Al 指数报告的第七版。 2024 年的指数是迄今为止我们最全面的一次,正值 AI 对社会影响从未如此突出的重要时刻。 今年,我们将范围扩大,更全面地涵盖了技术进步、公众对技术的看法以及围绕其发展的地缘政治动态等重要趋势。 本版包含的原始数据比以往任何时候都更多,新增了有关 AI 培训成本的新估算、对负责任的 AI 场景的详细分析,并特别新增了一个章节,专门介绍 AI 对科学和医学的影响。
The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of Al. Al 指数报告追踪、整理、提炼和可视化与人工智能(AI)相关的数据。 我们的使命是提供经过公正审查、广泛收集的数据,让决策者、研究人员、高管、记者和广大公众能够更全面、更细致地了解 AI 这一复杂领域。
The Al Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that is coming to hold in all of our lives. Al 指数被全球公认为人工智能领域数据和洞见最可靠、最权威的来源之一。之前的版本曾被《纽约时报》、彭博社和《卫报》等主要报纸引用,获得了数百次学术引用,并被美国、英国和欧盟等地的高级政策制定者引用。今年的版本在规模、范围和影响力上超过了以往所有版本,反映了人工智能在我们生活中日益重要的地位。
Message From the Co-directors 联合主任致辞
A decade ago, the best systems in the world were unable to classify objects in images at a human level. struggled with language comprehension and could not solve math problems. Today, Al systems routinely exceed human performance on standard benchmarks. 十年前,世界上最好的人工智能系统无法以人类水平对图像中的对象进行分类。它们在语言理解方面遇到困难,无法解决数学问题。如今,人工智能系统在标准基准测试中通常超过人类表现。
Progress accelerated in 2023. New state-of-the-art systems like GPT-4, Gemini, and Claude 3 are impressively multimodal: They can generate fluent text in dozens of languages, process audio, and even explain memes. As AI has improved, it has increasingly forced its way into our lives. Companies are racing to build Al-based products, and is increasingly being used by the general public. But current technology still has significant problems. It cannot reliably deal with facts, perform complex reasoning, or explain its conclusions. 2023 年,进展加快。像 GPT-4、Gemini 和 Claude 3 这样的最新先进系统令人印象深刻:它们可以用几十种语言生成流畅的文本,处理音频,甚至解释表情包。随着人工智能的改进,它越来越多地渗透到我们的生活中。公司们正在竞相开发基于 AI 的产品,并且越来越多地被普通大众所使用。但目前的 AI 技术仍然存在重大问题。它无法可靠地处理事实,进行复杂的推理,或解释其结论。
Al faces two interrelated futures. First, technology continues to improve and is increasingly used, having major consequences for productivity and employment. It can be put to both good and bad uses. In the second future, the adoption of is constrained by the limitations of the technology. Regardless of which future unfolds, governments are increasingly concerned. They are stepping in to encourage the upside, such as funding university R&D and incentivizing private investment. Governments are also aiming to manage the potential downsides, such as impacts on employment, privacy concerns, misinformation, and intellectual property rights. 人工智能面临着两种相互关联的未来。首先,技术持续改进并越来越广泛地应用,对生产力和就业产生重大影响。它可以被用于善良和恶劣之用。在第二种未来中,技术的采用受到技术限制的约束。无论哪种未来展现出来,政府都越来越关注。他们正在采取措施鼓励积极方面,如资助大学研发并激励私人投资。政府还致力于管理潜在的负面影响,如对就业的影响,隐私问题,错误信息和知识产权。
As rapidly evolves, the Index aims to help the Al community, policymakers, business leaders, journalists, and the general public navigate this complex landscape. It provides ongoing, objective snapshots tracking several key areas: technical progress in Al capabilities, the community and investments driving Al development and deployment, public opinion on current and potential future impacts, and policy measures taken to stimulate innovation while managing its risks and challenges. By comprehensively monitoring the Al ecosystem, the Index serves as an important resource for understanding this transformative technological force. 随着人工智能迅速发展, 指数旨在帮助人工智能社区、政策制定者、商业领袖、记者和普通公众在这个复杂的领域中进行导航。它提供持续的客观快照,跟踪几个关键领域:人工智能能力的技术进展、推动人工智能发展和部署的社区和投资、公众对当前和潜在未来影响的看法,以及采取的政策措施,以刺激 创新,同时管理其风险和挑战。通过全面监测人工智能生态系统,该指数成为了理解这一变革性技术力量的重要资源。
On the technical front, this year's Al Index reports that the number of new large language models released worldwide in 2023 doubled over the previous year. Two-thirds were open-source, but the highest-performing models came from industry players with closed systems. Gemini Ultra became the first LLM to reach humanlevel performance on the Massive Multitask Language Understanding (MMLU) benchmark; performance on the benchmark has improved by 15 percentage points since last year. Additionally, GPT-4 achieved an impressive 0.96 mean win rate score on the comprehensive Holistic Evaluation of Language Models (HELM) benchmark, which includes MMLU among other evaluations. 从技术角度来看,今年的 AI 指数报告称,2023 年全球发布的新大型语言模型数量是去年的两倍。其中三分之二是开源的,但性能最高的模型来自具有封闭系统的行业玩家。Gemini Ultra 成为第一个在 Massive Multitask Language Understanding (MMLU) 基准测试中达到人类水平性能的模型;相比去年,该基准测试的性能提高了 15 个百分点。此外,GPT-4 在综合语言模型全面评估 (HELM) 基准测试中取得了令人印象深刻的 0.96 平均获胜率分数,其中包括 MMLU 在内的其他评估。
Message From the Co-directors (cont'd) 联合主任们的消息(续)
Although global private investment in Al decreased for the second consecutive year, investment in generative skyrocketed. More Fortune 500 earnings calls mentioned than ever before, and new studies show that tangibly boosts worker productivity. On the policymaking front, global mentions of in legislative proceedings have never been higher. U.S. regulators passed more AI-related regulations in 2023 than ever before. Still, many expressed concerns about Al's ability to generate deepfakes and impact elections. The public became more aware of Al, and studies suggest that they responded with nervousness. 尽管全球私人对 Al 的投资连续第二年下降,但生成式 的投资却激增。更多财富 500 强公司的财报电话提到 ,新研究显示 明显提升了工人的生产力。在政策制定方面,全球在立法程序中提到 的次数达到历史最高。美国监管机构在 2023 年通过的与 AI 相关的法规比以往任何时候都多。然而,许多人对 Al 生成深度伪造和影响选举的能力表示担忧。公众对 Al 的认识更加深入,研究表明他们的反应是紧张的。
Ray Perrault and Jack Clark Ray Perrault 和 Jack Clark
Co-directors, Al Index Al 指数联合主任
Top 10 Takeaways 十大要点
Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning. 在某些任务上,人工智能超过人类,但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现,包括图像分类、视觉推理和英语理解等方面。然而,在更复杂的任务上,如竞赛级数学、视觉常识推理和规划方面,人工智能仍然落后。
Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high. 行业继续主导前沿的人工智能研究。2023 年,行业发布了 51 个显著的机器学习模型,而学术界仅贡献了 15 个。2023 年,行业与学术界合作产生了 21 个显著模型,创下新高。
Frontier models get way more expensive. According to AI Index estimates, the training costs of state-of-the-art Al models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute. 边境模型变得更加昂贵。根据 AI 指数估计,现代 AI 模型的训练成本已经达到了前所未有的水平。例如,OpenAI 的 GPT-4 使用了估计价值 百万的计算资源进行训练,而谷歌的 Gemini Ultra 则需要 百万计算成本。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15. 美国领先于中国、欧盟和英国,成为顶尖人工智能模型的主要来源。2023 年,61 个知名的人工智能模型源自美国机构,远远超过欧盟的 21 个和中国的 15 个。
5. Robust and standardized evaluations for LLM responsibility are seriously lacking. 5. 对LLM责任的强大和标准化评估严重匮乏。
New research from the AI Index reveals a significant lack of standardization in responsible Al reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible Al benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models.
Generative Al investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative AI surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAI, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds. 生成性人工智能投资大幅增长。尽管去年整体人工智能私人投资出现下滑,但生成性人工智能的资金激增,从 2022 年增长近 8 倍,达到 亿美元。生成性人工智能领域的主要参与者,包括 OpenAI、Anthropic、Hugging Face 和 Inflection,都宣布了重大的募资轮次。
The data is in: Al makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still, other studies caution that using Al without proper oversight can lead to diminished performance. 数据显示:人工智能提高了工人的生产力,促进了更高质量的工作。2023 年,几项研究评估了人工智能对劳动力的影响,表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。然而,其他研究警告称,没有适当监督的情况下使用人工智能可能会导致绩效下降。
Top 10 Takeaways (cont'd) 十大要点(续)
Scientific progress accelerates even further, thanks to Al. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery. 科学进步得到进一步加速,这要归功于人工智能。2022 年,人工智能开始推动科学发现。然而,2023 年,更多重要的与科学相关的人工智能应用程序推出,从 AlphaDev 推出的使算法排序更加高效,到 GNoME 简化材料发现过程。
The number of AI regulations in the United States sharply increases. The number of AIrelated regulations in the U.S. has risen significantly in the past year and over the last five years. In 2023, there were 25 Al-related regulations, up from just one in 2016. Last year alone, the total number of Al-related regulations grew by . 美国的人工智能法规数量急剧增加。美国与人工智能相关的法规数量在过去一年和过去五年显著增加。2023 年,与人工智能相关的法规数量达到 25 项,远高于 2016 年的 1 项。仅去年一年,与人工智能相关的法规总数增长了 。
People across the globe are more cognizant of Al's potential impact-and more nervous. A survey from Ipsos shows that, over the last year, the proportion of those who think Al will dramatically affect their lives in the next three to five years has increased from to . Moreover, express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that of Americans report feeling more concerned than excited about Al, rising from 37% in 2022. 全球各地的人们更加意识到人工智能的潜在影响,并感到更加紧张。Ipsos 的一项调查显示,过去一年中,认为人工智能将在未来三到五年内对他们的生活产生巨大影响的人所占比例已从 增加到 。此外, 对人工智能产品和服务表现出紧张情绪,较 2022 年增加了 13 个百分点。根据皮尤数据,美国有 的美国人表示对人工智能感到更担忧而不是兴奋,较 2022 年的 37%上升。
Steering Committee 指导委员会
Co-directors 联合导演
Jack Clark, Anthropic, OECD 杰克·克拉克,Anthropic,经济合作与发展组织
Raymond Perrault, SRI International 雷蒙德·佩罗,SRI 国际
Members 成员
Erik Brynjolfsson, Stanford University Juan Carlos Niebles, Stanford University, Salesforce 埃里克·布林约尔松,斯坦福大学 胡安·卡洛斯·尼埃布莱斯,斯坦福大学,Salesforce
John Etchemendy, Stanford University Vanessa Parli, Stanford University 约翰·埃切门迪,斯坦福大学 瓦内萨·帕利,斯坦福大学
Katrina Ligett, Hebrew University 卡特里娜·利杰特,希伯来大学
Terah Lyons, JPMorgan Chase & Co. 特拉·莱昂斯,摩根大通公司
James Manyika, Google, University of Oxford 詹姆斯·马尼卡,谷歌,牛津大学
Russell Wald, Stanford University 斯坦福大学的 Russell Wald
Staff and Researchers 员工和研究人员
Research Manager and Editor in Chief 研究经理和总编辑
Nestor Maslej
Stanford University 斯坦福大学
Research Associate 研究助理
Loredana Fattorini
Stanford University 斯坦福大学
Affiliated Researchers 附属研究员
Elif Kiesow Cortez, Stanford Law School Research Fellow Elif Kiesow Cortez,斯坦福法学院研究员
Anka Reuel, Stanford University Anka Reuel,斯坦福大学
Robi Rahman, Data Scientist Robi Rahman,数据科学家
Alexandra Rome, Freelance Researcher Lapo Santarlasci, IMT School for Alexandra Rome,自由研究员 Lapo Santarlasci,IMT 学院
Advanced Studies Lucca 高级研究卢卡
Graduate Researchers 研究生研究员
James da Costa, Stanford University James da Costa,斯坦福大学
Simba Jonga, Stanford University Simba Jonga,斯坦福大学
Undergraduate Researchers 本科研究员
Emily Capstick, Stanford University Summer Flowers, Stanford University Armin Hamrah, Claremont McKenna College Amelia Hardy, Stanford University Mena Hassan, Stanford University Ethan Duncan He-Li Hellman, Stanford University Julia Betts Lotufo, Stanford University Sukrut Oak, Stanford University Andrew Shi, Stanford University Jason Shin, Stanford University Emma Williamson, Stanford University Alfred Yu, Stanford University Emily Capstick,斯坦福大学 Summer Flowers,斯坦福大学 Armin Hamrah,克莱蒙特麦肯纳学院 Amelia Hardy,斯坦福大学 Mena Hassan,斯坦福大学 Ethan Duncan He-Li Hellman,斯坦福大学 Julia Betts Lotufo,斯坦福大学 Sukrut Oak,斯坦福大学 Andrew Shi,斯坦福大学 Jason Shin,斯坦福大学 Emma Williamson,斯坦福大学 Alfred Yu,斯坦福大学
How to Cite This Report 如何引用本报告
Nestor Maslej, Loredana Fattorini, Raymond Perrault, Vanessa Parli, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, and Jack Clark, "The Al Index 2024 Annual Report," AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, Stanford, CA, April 2024.
The AI Index 2024 Annual Report by Stanford University is licensed under Attribution-NoDerivatives 4.0 International. 《2024 年斯坦福大学 AI 指数年度报告》由斯坦福大学授权,采用署名-禁止演绎 4.0 国际许可。
Public Data and Tools 公共数据和工具
The Al Index 2024 Report is supplemented by raw data and an interactive tool. We invite each reader to use the data and the tool in a way most relevant to their work and interests. 2024 年 Al 指数报告附带原始数据和交互工具。我们邀请每位读者根据自己的工作和兴趣以最相关的方式使用数据和工具。
Raw data and charts: The public data and high-resolution images of all the charts in the report are available on Google Drive. 原始数据和图表:报告中所有图表的公共数据和高分辨率图像可在 Google Drive 上获得。
Global AI Vibrancy Tool: Compare the Al ecosystems of over 30 countries. The Global Al Vibrancy tool will be updated in the summer of 2024. 全球 AI 活力工具:比较 30 多个国家的 AI 生态系统。全球 AI 活力工具将于 2024 年夏季更新。
Al Index and Stanford HAI 阿尔指数和斯坦福人工智能研究所
The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). 人工智能指数是斯坦福人类中心人工智能研究所(HAI)的独立倡议。
Artificial Intelligence Index 人工智能指数
Stanford University 斯坦福大学
Human-Centered 以人为本
Artificial Intelligence 人工智能
The Al Index was conceived within the One Hundred Year Study on Artificial Intelligence (Al100). Al 指数是在人工智能百年研究(Al100)中构想的。
The AI Index welcomes feedback and new ideas for next year. Contact us at AI-Index-Report@stanford.edu. 人工智能指数欢迎对明年的反馈意见和新想法。请通过 AI-Index-Report@stanford.edu 与我们联系。
The Al Index acknowledges that while authored by a team of human researchers, its writing process was aided by Al tools. Specifically, the authors used ChatGPT and Claude to help tighten and copy edit initial drafts. The workflow involved authors writing the original copy, then utilizing Al tools as part of the editing process. Al 指数承认,虽然由一组人类研究人员创作,但其写作过程得到了 Al 工具的帮助。具体来说,作者们使用 ChatGPT 和 Claude 来帮助修改和编辑初稿。工作流程涉及作者撰写原始文本,然后利用 Al 工具作为编辑过程的一部分。
Supporting Partners 支持合作伙伴
Analytics and Research Partners 分析和研究合作伙伴
accenture 밈은 QCSET 三EPochal Github (govini =
Lightcast Linked in Lightcast 领英
Contributors 贡献者
The Al Index wants to acknowledge the following individuals by chapter and section for their contributions of data, analysis, advice, and expert commentary included in the AI Index 2024 Report: AI 指数希望通过章节和部分向以下个人致谢,感谢他们在 AI 指数 2024 年报告中提供的数据、分析、建议和专家评论的贡献:
Catherine Aiken, Terry Auricchio, Tamay Besiroglu, Rishi Bommasani, Andrew Brown, Peter Cihon, James da Costa, Ben Cottier, James Cussens, James Dunham, Meredith Ellison, Loredana Fattorini, Enrico Gerding, Anson Ho, Percy Liang, Nestor Maslej, Greg Mori, Tristan Naumann, Vanessa Parli, Pavlos Peppas, Ray Perrault, Robi Rahman, Vesna Sablijakovic-Fritz, Jim Schmiedeler, Jaime Sevilla, Autumn Toney, Kevin Xu, Meg Young, Milena Zeithamlova Catherine Aiken,Terry Auricchio,Tamay Besiroglu,Rishi Bommasani,Andrew Brown,Peter Cihon,James da Costa,Ben Cottier,James Cussens,James Dunham,Meredith Ellison,Loredana Fattorini,Enrico Gerding,Anson Ho,Percy Liang,Nestor Maslej,Greg Mori,Tristan Naumann,Vanessa Parli,Pavlos Peppas,Ray Perrault,Robi Rahman,Vesna Sablijakovic-Fritz,Jim Schmiedeler,Jaime Sevilla,Autumn Toney,Kevin Xu,Meg Young,Milena Zeithamlova
Chapter 2: Technical Performance 第二章:技术表现
Rishi Bommasani, Emma Brunskill, Erik Brynjolfsson, Emily Capstick, Jack Clark, Loredana Fattorini, Tobi Gertsenberg, Noah Goodman, Nicholas Haber, Sanmi Koyejo, Percy Liang, Katrina Ligett, Sasha Luccioni, Nestor Maslej, Juan Carlos Niebles, Sukrut Oak, Vanessa Parli, Ray Perrault, Andrew Shi, Yoav Shoham, Emma Williamson Rishi Bommasani,Emma Brunskill,Erik Brynjolfsson,Emily Capstick,Jack Clark,Loredana Fattorini,Tobi Gertsenberg,Noah Goodman,Nicholas Haber,Sanmi Koyejo,Percy Liang,Katrina Ligett,Sasha Luccioni,Nestor Maslej,Juan Carlos Niebles,Sukrut Oak,Vanessa Parli,Ray Perrault,Andrew Shi,Yoav Shoham,Emma Williamson
Chapter 3: Responsible AI 第三章:负责任的人工智能
Jack Clark, Loredana Fattorini, Amelia Hardy, Katrina Ligett, Nestor Maslej, Vanessa Parli, Ray Perrault, Anka Reuel, Andrew Shi Jack Clark,Loredana Fattorini,Amelia Hardy,Katrina Ligett,Nestor Maslej,Vanessa Parli,Ray Perrault,Anka Reuel,Andrew Shi
Chapter 4: Economy 第四章:经济
Susanne Bieller, Erik Brynjolfsson, Mar Carpanelli, James da Costa, Natalia Dorogi, Heather English, Murat Erer, Loredana Fattorini, Akash Kaura, James Manyika, Nestor Maslej, Cal McKeever, Julia Nitschke, Layla O’Kane, Vanessa Parli, Ray Perrault, Brittany Presten, Carl Shan, Bill Valle, Casey Weston, Emma Williamson Susanne Bieller,Erik Brynjolfsson,Mar Carpanelli,James da Costa,Natalia Dorogi,Heather English,Murat Erer,Loredana Fattorini,Akash Kaura,James Manyika,Nestor Maslej,Cal McKeever,Julia Nitschke,Layla O’Kane,Vanessa Parli,Ray Perrault,Brittany Presten,Carl Shan,Bill Valle,Casey Weston,Emma Williamson
Chapter 5: Science and Medicine 第五章:科学与医学
Russ Altman, Loredana Fattorini, Remi Lam, Curtis Langlotz, James Manyika, Nestor Maslej, Vanessa Parli, Ray Perrault, Emma Williamson
Contributors (cont'd) 贡献者(续)
Chapter 6: Education 第六章:教育
Betsy Bizot, John Etchemendy, Loredana Fattorini, Kirsten Feddersen, Matt Hazenbush, Nestor Maslej, Vanessa Parli, Ray Perrault, Svetlana Tikhonenko, Laurens Vehmeijer, Hannah Weissman, Stuart Zweben Betsy Bizot,John Etchemendy,Loredana Fattorini,Kirsten Feddersen,Matt Hazenbush,Nestor Maslej,Vanessa Parli,Ray Perrault,Svetlana Tikhonenko,Laurens Vehmeijer,Hannah Weissman,Stuart Zweben
Chapter 7: Policy and Governance 第七章:政策与治理
Alison Boyer, Elif Kiesow Cortez, Rebecca DeCrescenzo, David Freeman Engstrom, Loredana Fattorini, Philip de Guzman, Mena Hassan, Ethan Duncan He-Li Hellman, Daniel Ho, Simba Jonga, Rohini Kosoglu, Mark Lemley, Julia Betts Lotufo, Nestor Maslej, Caroline Meinhardt, Julian Nyarko, Jeff Park, Vanessa Parli, Ray Perrault, Alexandra Rome, Lapo Santarlasci, Sarah Smedley, Russell Wald, Emma Williamson, Daniel Zhang
Chapter 8: Diversity 第八章:多样性
Betsy Bizot, Loredana Fattorini, Kirsten Feddersen, Matt Hazenbush, Nestor Maslej, Vanessa Parli, Ray Perrault, Svetlana Tikhonenko, Laurens Vehmeijer, Caroline Weis, Hannah Weissman, Stuart Zweben
Chapter 9: Public Opinion 第九章:公众舆论
Maggie Arai, Heather English, Loredana Fattorini, Armin Hamrah, Peter Loewen, Nestor Maslej, Vanessa Parli, Ray Perrault, Marco Monteiro Silva, Lee Slinger, Bill Valle, Russell Wald Maggie Arai,Heather English,Loredana Fattorini,Armin Hamrah,Peter Loewen,Nestor Maslej,Vanessa Parli,Ray Perrault,Marco Monteiro Silva,Lee Slinger,Bill Valle,Russell Wald
The Al Index thanks the following organizations and individuals who provided data for inclusion in this year's report: Al 指数感谢以下组织和个人为本年度报告提供数据:
Organizations 组织
Center for Research on 研究中心
Foundation Models 基础模型
Rishi Bommasani, Percy Liang
Center for Security and Emerging 安全与新兴技术中心
Technology, Georgetown University
Catherine Aiken, James Dunham, Autumn Toney
Code.org
Hannah Weissman
Computing Research Association 计算机研究协会
Betsy Bizot, Stuart Zweben Betsy Bizot,Stuart Zweben
Epoch 时代
Ben Cottier, Robi Rahman 本·科蒂尔,罗比·拉赫曼
GitHub
Peter Cihon, Kevin Xu 彼得·西洪,凯文·徐
Govini
Alison Boyer, Rebecca DeCrescenzo, Philip de Guzman, Jeff Park
Informatics Europe
Svetlana Tikhonenko 斯维特拉娜·蒂霍年科
International Federation of Robotics 国际机器人联合会
Susanne Bieller 苏珊娜·比勒
Lightcast 亮投
Cal McKeever, Julia Nitschke, Layla O’Kane 卡尔·麦克弗,朱莉娅·尼奇克,莱拉·奥卡恩
Linkedln 领英
Murat Erer, Akash Kaura, Casey Weston
McKinsey & Company 麦肯锡公司
Natalia Dorogi, Brittany Presten
Munk School of Global Affairs and Public Policy 全球事务与公共政策蒙克学院
Peter Loewen, Lee Slinger 彼得·洛文,李·斯林格
Quid
Heather English, Bill Valle
Schwartz Reisman Institute for Technology and Society
Maggie Arai, Marco Monteiro Silva
Studyportals
Kirsten Feddersen, Laurens Vehmeijer
Women in Machine Learning 机器学习中的女性
Caroline Weis
The Al Index also thanks Jeanina Casusi, Nancy King, Carolyn Lehman, Shana Lynch, Jonathan Mindes, and Michi Turner for their help in preparing this report; Joe Hinman and Nabarun Mukherjee for their help in maintaining the Al Index website; and Annie Benisch, Marc Gough, Panos Madamopoulos-Moraris, Kaci Peel, Drew Spence, Madeline Wright, and Daniel Zhang for their work in helping promote the report. Al Index 也感谢 Jeanina Casusi、Nancy King、Carolyn Lehman、Shana Lynch、Jonathan Mindes 和 Michi Turner 为准备此报告提供的帮助;Joe Hinman 和 Nabarun Mukherjee 为维护 Al Index 网站提供的帮助;Annie Benisch、Marc Gough、Panos Madamopoulos-Moraris、Kaci Peel、Drew Spence、Madeline Wright 和 Daniel Zhang 在促进该报告方面的工作。
Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high. 行业继续主导前沿的人工智能研究。2023 年,行业发布了 51 个显著的机器学习模型,而学术界仅贡献了 15 个。2023 年,行业与学术界合作产生了 21 个显著模型,创下新高。
More foundation models and more open foundation models. In 2023 , a total of 149 foundation models were released, more than double the amount released in 2022. Of these newly released models, were open-source, compared to only in 2022 and in 2021. 发布更多基础模型和开源基础模型。2023 年共发布了 149 个基础模型,比 2022 年发布的数量翻了一倍还多。其中, 个是开源的,而 2022 年只有 个,2021 年更少,只有 个。
Frontier models get way more expensive. According to Al Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute. 前沿模型变得更加昂贵。根据 Al Index 估计,最先进的人工智能模型的训练成本已经达到了前所未有的水平。例如,OpenAI 的 GPT-4 使用了估计价值 百万的计算资源进行训练,而谷歌的 Gemini Ultra 则花费了 百万用于计算资源。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15. 美国领先于中国、欧盟和英国,成为顶尖人工智能模型的主要来源。2023 年,61 个知名的人工智能模型源自美国机构,远远超过欧盟的 21 个和中国的 15 个。
The number of Al patents skyrockets. From 2021 to 2022, Al patent grants worldwide increased sharply by . Since 2010 , the number of granted Al patents has increased more than 31 times. Al 专利数量激增。从 2021 年到 2022 年,全球 Al 专利授权数量急剧增加 。自 2010 年以来,已授权的 Al 专利数量增加了超过 31 倍。
China dominates Al patents. In 2022, China led global Al patent origins with , significantly outpacing the United States, which accounted for of Al patent origins. Since 2010, the U.S. share of AI patents has decreased from . 中国主导人工智能专利。2022 年,中国在全球人工智能专利来源中处于领先地位,占比 ,远远超过美国,美国占 的人工智能专利来源。自 2010 年以来,美国在人工智能专利中的份额已经从 下降。
Open-source Al research explodes. Since 2011, the number of AI-related projects on GitHub has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in 2023. Notably, there was a sharp 59.3% rise in the total number of GitHub Al projects in 2023 alone. The total number of stars for Al-related projects on GitHub also significantly increased in 2023, more than tripling from 4.0 million in 2022 to 12.2 million. 开源人工智能研究蓬勃发展。自 2011 年以来,GitHub 上与人工智能相关项目的数量持续增加,从 2011 年的 845 个增长到 2023 年的约 180 万个。值得注意的是,仅在 2023 年,GitHub 上人工智能项目的总数就急剧增加了 59.3%。2023 年 GitHub 上与人工智能相关项目的星标总数也显著增加,从 2022 年的 400 万增至 1220 万。
The number of Al publications continues to rise. Between 2010 and 2022 , the total number of publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest . 人工智能出版物数量持续增长。2010 年至 2022 年间,人工智能出版物总数几乎翻了三番,从 2010 年的约 8.8 万增至 2022 年的超过 24 万。过去一年的增长幅度较为温和,为 。
Report Highlights 报告亮点
Chapter 2: Technical Performance 第二章:技术表现
\begin{abstract}
1. Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning. 1. 人工智能在某些任务上超过了人类,但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现,包括图像分类、视觉推理和英语理解等领域。然而,在更复杂的任务上,如竞赛级数学、视觉常识推理和规划等方面,人工智能仍然落后。
\end{abstract} \end{摘要}
Here comes multimodal Al. Traditionally Al systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google's Gemini and OpenAl's GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio. 传统上,AI 系统在范围上存在局限性,语言模型擅长文本理解,但在图像处理方面表现不佳,反之亦然。然而,最近的进展导致了强大的多模态模型的发展,例如谷歌的 Gemini 和 OpenAI 的 GPT-4。这些模型表现出灵活性,能够处理图像和文本,并且在某些情况下甚至可以处理音频。
Harder benchmarks emerge. Al models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations. 更难的基准出现了。AI 模型在 ImageNet、SQuAD 和 SuperGLUE 等已建立的基准上达到了性能饱和,促使研究人员开发出更具挑战性的基准。2023 年,出现了几个具有挑战性的新基准,包括用于编码的 SWE-bench,用于图像生成的 HEIM,用于一般推理的 MMMU,用于道德推理的 MoCa,用于基于代理的行为的 AgentBench,以及用于幻觉的 HaluEval。
4. Better Al means better data which means ... even better Al. New Al models such as 4. 更好的人工智能意味着更好的数据,这意味着...甚至更好的人工智能。新的人工智能模型,如
SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for technical improvements. The use of to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks. SegmentAnything 和 Skoltech 正在用于生成专门用于诸如图像分割和 3D 重建等任务的数据。数据对于技术改进至关重要。利用 来创建更多数据增强了当前的能力,并为未来算法改进铺平了道路,特别是在更难的任务上。
Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public sentiment about AI is becoming an increasingly important consideration in tracking Al progress. 人类评估正在兴起。随着生成模型生成高质量的文本、图像等,基准测试逐渐开始转向纳入人类评估,如 Chatbot Arena Leaderboard,而不是像 ImageNet 或 SQuAD 这样的计算机排名。公众对人工智能的情绪变得越来越重要,成为追踪人工智能进展的一个日益重要的考虑因素。
Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world. 由于LLMs的出现,机器人变得更加灵活。语言建模与机器人技术的融合催生了更加灵活的机器人系统,如 PaLM-E 和 RT-2。除了改进了机器人的能力,这些模型还可以提问,这是机器人与现实世界更有效互动的重要一步。
More technical research in agentic Al. Creating Al agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance. 关于主动型人工智能的更多技术研究。创建自主在特定环境中运行的人工智能代理系统一直是计算机科学家们长期面临的挑战。然而,新兴的研究表明,自主 代理系统的性能正在不断提高。现如今的代理系统已经能够掌握复杂的游戏,如 Minecraft,并且能够有效地处理实际任务,如在线购物和研究辅助。
Closed LLMs significantly outperform open ones. On 10 select Al benchmarks, closed models outperformed open ones, with a median performance advantage of . Differences in the performance of closed and open models carry important implications for Al policy debates. 闭源LLMs明显优于开源。在 10 个选择的 Al 基准测试中,闭源模型的性能优于开源模型,具有 的中位性能优势。闭源模型和开源模型的性能差异对 Al 政策辩论具有重要意义。
Report Highlights 报告亮点
Chapter 3: Responsible Al 第三章:负责任的 Al
1. Robust and standardized evaluations for LLM responsibility are seriously lacking. 1.对于LLM责任的稳健和标准化评估严重缺乏。
New research from the AI Index reveals a significant lack of standardization in responsible AI reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models.
Political deepfakes are easy to generate and difficult to detect. Political deepfakes are already affecting elections across the world, with recent research suggesting that existing Al deepfake methods perform with varying levels of accuracy. In addition, new projects like CounterCloud demonstrate how easily AI can create and disseminate fake content. 政治 Deepfake 很容易生成,难以检测。最近的研究表明,政治 Deepfake 已经对世界各地的选举产生影响,现有的 Al Deepfake 方法的准确性不同。此外,CounterCloud 等新项目展示了 AI 如何轻松创建和传播假内容。
Researchers discover more complex vulnerabilities in LLMs. Previously, most efforts to red team Al models focused on testing adversarial prompts that intuitively made sense to humans. This year, researchers found less obvious strategies to get LLMs to exhibit harmful behavior, like asking the models to infinitely repeat random words. 研究人员在LLMs中发现了更复杂的漏洞。以前,对 Al 模型进行红队测试的大部分工作都集中在测试对人类直观有意义的对抗性提示上。今年,研究人员发现了一些不太明显的策略,可以使LLMs表现出有害行为,例如要求模型无限重复随机单词。
Risks from are becoming a concern for businesses across the globe. A global survey on responsible Al highlights that companies' top Al-related concerns include privacy, data security, and reliability. The survey shows that organizations are beginning to take steps to mitigate these risks. Globally, however, most companies have so far only mitigated a small portion of these risks. 来自 的风险正成为全球企业关注的焦点。一项关于负责任人工智能的全球调查显示,公司对人工智能相关的主要担忧包括隐私、数据安全和可靠性。调查显示,组织正开始采取措施来减轻这些风险。然而,在全球范围内,大多数公司迄今只减轻了这些风险的一小部分。
LLMs can output copyrighted material. Multiple researchers have shown that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. Whether such output constitutes copyright violations is becoming a central legal question. LLMs可以输出受版权保护的材料。多位研究人员已经证明,流行LLMs的生成输出可能包含受版权保护的材料,例如《纽约时报》的摘录或电影场景。这种输出是否构成侵犯版权的问题正逐渐成为一个核心的法律问题。
6. Al developers score low on transparency, with consequences for research. The newly 6. 人工智能开发者在透明度方面得分较低,对研究产生影响。新近进行的研究显示,人工智能开发者在透明度方面得分较低,这对研究产生了影响。
introduced Foundation Model Transparency Index shows that AI developers lack transparency, especially regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of Al systems.
Chapter 3: Responsible AI (cont'd) 第三章:负责任的人工智能(续)
Extreme Al risks are difficult to analyze. Over the past year, a substantial debate has emerged among Al scholars and practitioners regarding the focus on immediate model risks, like algorithmic discrimination, versus potential long-term existential threats. It has become challenging to distinguish which claims are scientifically founded and should inform policymaking. This difficulty is compounded by the tangible nature of already present short-term risks in contrast with the theoretical nature of existential threats. 极端 AI 风险很难分析。在过去的一年中,AI 学者和从业者之间出现了一场关于立即模型风险(如算法歧视)与潜在的长期存在威胁的辩论。区分哪些声明在科学上有根据并应该影响政策制定已变得具有挑战性。这一困难被已经存在的短期风险的具体性质与存在威胁的理论性质所加剧。
The number of Al incidents continues to rise. According to the Al Incident Database, which tracks incidents related to the misuse of Al, 123 incidents were reported in 2023, a 32.3 percentage point increase from 2022. Since 2013, Al incidents have grown by over twentyfold. A notable example includes Al-generated, sexually explicit deepfakes of Taylor Swift that were widely shared online. AI 事件的数量继续增加。根据 AI 事件数据库的数据,追踪与 AI 滥用相关的事件,2023 年报告了 123 起事件,比 2022 年增长了 32.3 个百分点。自 2013 年以来,AI 事件增长了超过二十倍。一个显著的例子是泰勒·斯威夫特的 AI 生成的深度假视频,在网上广为传播。
ChatGPT is politically biased. Researchers find a significant bias in ChatGPT toward Democrats in the United States and the Labour Party in the U.K. This finding raises concerns about the tool's potential to influence users' political views, particularly in a year marked by major global elections. ChatGPT 在政治上存在偏见。研究人员发现 ChatGPT 在美国偏向民主党和英国工党。这一发现引发了对该工具可能影响用户政治观点的担忧,尤其是在一个以重大全球选举为特征的年份。
Report Highlights 报告亮点
Chapter 4: Economy 第四章:经济
Generative investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAl, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds. 生成 投资激增。尽管去年整体 AI 私人投资有所下降,但生成 的资金激增,从 2022 年几乎增加了八倍,达到 亿美元。生成 AI 领域的主要参与者,包括 OpenAI、Anthropic、Hugging Face 和 Inflection,报告了大量的筹款轮。
Already a leader, the United States pulls even further ahead in Al private investment. In 2023, the United States saw Al investments reach billion, nearly 8.7 times more than China, the next highest investor. While private Al investment in China and the European Union, including the United Kingdom, declined by and , respectively, since 2022, the United States experienced a notable increase of in the same time frame. 美国已经是领导者,在人工智能私人投资方面进一步拉开了差距。2023 年,美国的人工智能投资达到 亿美元,几乎是中国的近 8.7 倍,后者是第二大投资国。尽管自 2022 年以来,中国和欧盟(包括英国)的私人人工智能投资分别下降了 和 ,但同期美国经历了 的显著增长。
Fewer Al jobs in the United States and across the globe. In 2022, Al-related positions made up of all job postings in America, a figure that decreased to in 2023. This decline in Al job listings is attributed to fewer postings from leading Al firms and a reduced proportion of tech roles within these companies. 美国和全球范围内的人工智能工作岗位减少。2022 年,与美国所有职位招聘相比,与人工智能相关的职位占比为 ,这一数字在 2023 年降至 。人工智能工作岗位的减少归因于领先人工智能公司发布的职位减少以及这些公司中技术角色比例的降低。
Al decreases costs and increases revenues. A new McKinsey survey reveals that of surveyed organizations report cost reductions from implementing (including generative ), and report revenue increases. Compared to the previous year, there was a 10 percentage point increase in respondents reporting decreased costs, suggesting is driving significant business efficiency gains. 人工智能降低成本并增加收入。麦肯锡的一项新调查显示, 的受访组织表示通过实施 (包括生成式 )实现了成本降低, 表示收入增加。与前一年相比,报告成本降低的受访者比例增加了 10 个百分点,表明 正在推动重大的商业效率提升。
5. Total Al private investment declines again, while the number of newly funded AI 5. 总体上,私人投资再次下降,而新资助的 AI 数量增加。
companies increases. Global private Al investment has fallen for the second year in a row, though less than the sharp decrease from 2021 to 2022 . The count of newly funded Al companies spiked to 1,812 , up from the previous year.
Al organizational adoption ticks up. A 2023 McKinsey report reveals that of organizations now use (including generative ) in at least one business unit or function, up from in 2022 and in 2017. AI 组织采用率有所提高。2023 年麦肯锡报告显示, 的组织现在至少在一个业务部门或功能中使用 (包括生成 ),高于 2022 年的 和 2017 年的 。
China dominates industrial robotics. Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest competitor nation. In 2013, China's installations accounted for of the global total, a share that rose to by 2022 . 中国主导工业机器人。自 2013 年超过日本成为工业机器人领先安装国以来,中国与最近的竞争对手国家之间的差距显著扩大。2013 年,中国的安装量占全球总量的 ,到 2022 年已上升至 。
Chapter 4: Economy (cont'd) 第四章:经济(续)
Greater diversity in robot installations. In 2017, collaborative robots represented a mere of all new industrial robot installations, a figure that climbed to by 2022 . Similarly, 2022 saw a rise in service robot installations across all application categories, except for medical robotics. This trend indicates not just an overall increase in robot installations but also a growing emphasis on deploying robots for human-facing roles. 机器人安装的多样性增加。2017 年,协作机器人仅占所有新工业机器人安装量的 ,到 2022 年这一数字上升到 。同样,2022 年各应用类别的服务机器人安装量均有所增加,除了医疗机器人。这一趋势不仅表明机器人安装总量增加,还表明人类面向角色的机器人部署受到越来越多的重视。
The data is in: Al makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still, other studies caution that using Al without proper oversight can lead to diminished performance. 数据显示:人工智能提高了工人的生产力,促进了更高质量的工作。2023 年,几项研究评估了人工智能对劳动力的影响,表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。然而,其他研究警告称,没有适当监督的情况下使用人工智能可能会导致绩效下降。
Fortune 500 companies start talking a lot about Al, especially generative Al. In 2023, Al was mentioned in 394 earnings calls (nearly of all Fortune 500 companies), a notable increase from 266 mentions in 2022. Since 2018, mentions of Al in Fortune 500 earnings calls have nearly doubled. The most frequently cited theme, appearing in of all earnings calls, was generative Al. 《财富》500 强公司开始大谈人工智能,尤其是生成式人工智能。2023 年,394 家公司在财报电话会议中提到了人工智能(几乎占所有《财富》500 强公司的 ),较 2022 年的 266 次提及有显著增加。自 2018 年以来,在《财富》500 强公司的财报电话会议中提到人工智能的次数几乎翻了一番。最经常被提及的主题,出现在 的所有财报电话会议中,是生成式人工智能。
Report Highlights 报告亮点
Chapter 5: Science and Medicine 第五章:科学与医学
Scientific progress accelerates even further, thanks to AI. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery. 由于人工智能的推动,科学进步进一步加速。2022 年,人工智能开始推动科学发现。然而,2023 年看到了更多重要的与科学相关的人工智能应用的推出,从使算法排序更加高效的 AlphaDev,到促进材料发现过程的 GNoME。
Al helps medicine take significant strides forward. In 2023, several significant medical systems were launched, including EVEscape, which enhances pandemic prediction, and AlphaMissence, which assists in Al-driven mutation classification. is increasingly being utilized to propel medical advancements. AI 帮助医学取得了重大进展。2023 年,推出了几款重要的医疗系统,包括增强疫情预测的 EVEscape 和协助基于 AI 的突变分类的 AlphaMissence。 越来越多地被用于推动医学进步。
Highly knowledgeable medical Al has arrived. Over the past few years, Al systems have shown remarkable improvement on the MedQA benchmark, a key test for assessing Al's clinical knowledge. The standout model of 2023 , GPT-4 Medprompt, reached an accuracy rate of , marking a 22.6 percentage point increase from the highest score in 2022. Since the benchmark's introduction in 2019, Al performance on MedQA has nearly tripled. 高度了解医学的 AI 已经到来。在过去几年里,AI 系统在 MedQA 基准测试上显示出了显著的改进,这是评估 AI 临床知识的关键测试。2023 年的杰出模型 GPT-4 Medprompt 达到了 的准确率,比 2022 年最高分提高了 22.6 个百分点。自 2019 年引入基准测试以来,AI 在 MedQA 上的表现几乎翻了三倍。
The FDA approves more and more Al-related medical devices. In 2022, the FDA approved 139 Al-related medical devices, a 12.1% increase from 2021. Since 2012, the number of FDA-approved Al-related medical devices has increased by more than 45 -fold. is increasingly being used for real-world medical purposes. FDA 批准了越来越多与 AI 相关的医疗设备。2022 年,FDA 批准了 139 款与 AI 相关的医疗设备,比 2021 年增加了 12.1%。自 2012 年以来,FDA 批准的与 AI 相关的医疗设备数量增加了超过 45 倍。 越来越多地被用于真实世界的医疗用途。
Report Highlights 报告亮点
Chapter 6: Education 第六章:教育
The number of American and Canadian CS bachelor's graduates continues to rise, new CS master's graduates stay relatively flat, and PhD graduates modestly grow. While the number of new American and Canadian bachelor's graduates has consistently risen for more than a decade, the number of students opting for graduate education in CS has flattened. Since 2018, the number of CS master's and graduates has slightly declined. 美国和加拿大计算机科学学士毕业生人数持续增长,新的计算机科学硕士毕业生人数保持相对稳定,博士毕业生逐渐增加。虽然美国和加拿大新毕业的学士生人数在过去十多年一直在增加,但选择攻读计算机科学研究生教育的学生人数已经趋于稳定。自 2018 年以来,计算机科学硕士和 毕业生人数略有下降。
The migration of Al PhDs to industry continues at an accelerating pace. In 2011, roughly equal percentages of new AI PhDs took jobs in industry (40.9%) and academia (41.6%). However, by 2022, a significantly larger proportion (70.7%) joined industry after graduation compared to those entering academia (20.0%). Over the past year alone, the share of industry-bound AI PhDs has risen by 5.3 percentage points, indicating an intensifying brain drain from universities into industry. 人工智能博士毕业生向产业界的迁移速度正在加快。2011 年,新的人工智能博士毕业生在产业界(40.9%)和学术界(41.6%)的就业比例大致相等。然而,到 2022 年,毕业后加入产业界的比例显著增加至 70.7%,远远超过进入学术界的比例(20.0%)。仅在过去一年中,进入产业界的人工智能博士毕业生的比例增加了 5.3 个百分点,表明大学向产业界的人才流失正在加剧。
Less transition of academic talent from industry to academia. In of new Al faculty in the United States and Canada were from industry. By 2021, this figure had declined to 11%, and in 2022, it further dropped to . This trend indicates a progressively lower migration of high-level AI talent from industry into academia. 行业向学术界的学术人才转移减少。在美国和加拿大的新人工智能教职员中, 都来自行业。到 2021 年,这一比例下降至 11%,并在 2022 年进一步降至 。这一趋势表明,高水平人工智能人才从行业流向学术界的迁移越来越少。
CS education in the United States and Canada becomes less international. Proportionally fewer international CS bachelor's, master's, and PhDs graduated in 2022 than in 2021. The drop in international students in the master's category was especially pronounced. 美国和加拿大的计算机科学教育变得不那么国际化。2022 年,获得计算机科学学士、硕士和博士学位的国际学生比例较 2021 年有所下降。硕士类别中国际学生的减少尤为显著。
5. More American high school students take CS courses, but access problems remain. 5. 更多的美国高中学生选修计算机科学课程,但仍存在接触问题。
In 2022, 201,000 AP CS exams were administered. Since 2007, the number of students taking these exams has increased more than tenfold. However, recent evidence indicates that students in larger high schools and those in suburban areas are more likely to have access to CS courses.
Al-related degree programs are on the rise internationally. The number of English-language, Al-related postsecondary degree programs has tripled since 2017, showing a steady annual increase over the past five years. Universities worldwide are offering more Al-focused degree programs. 人工智能相关的学位课程在国际上不断增加。自 2017 年以来,英语授课的人工智能相关高等教育学位课程数量增加了三倍,过去五年稳步增长。全球各大学正在提供更多以人工智能为重点的学位课程。
Chapter 6: Education (cont'd) 第 6 章:教育(续)
The United Kingdom and Germany lead in European informatics, CS, CE, and IT graduate production. The United Kingdom and Germany lead Europe in producing the highest number of new informatics, CS, CE, and information bachelor's, master's, and PhD graduates. On a per capita basis, Finland leads in the production of both bachelor's and PhD graduates, while Ireland leads in the production of master's graduates. 英国和德国在欧洲信息学、计算机科学、计算机工程和信息技术研究生产量方面处于领先地位。英国和德国在生产新的信息学、计算机科学、计算机工程和信息学学士、硕士和博士毕业生数量方面领先于欧洲。以人均计算,芬兰在生产学士和博士毕业生方面领先,而爱尔兰在生产硕士毕业生方面领先。
Report Highlights 报告亮点
Chapter 7: Policy and Governance 第七章:政策与治理
The number of Al regulations in the United States sharply increases. The number of Al-related regulations has risen significantly in the past year and over the last five years. In 2023 , there were -related regulations, up from just one in 2016. Last year alone, the total number of Al-related regulations grew by . 美国的 Al 法规数量急剧增加。过去一年和过去五年,与 Al 相关的法规数量显著增加。2023 年,与 Al 相关的法规数量达到 个,远远超过 2016 年的仅有一个。仅去年一年,与 Al 相关的法规总数增长了 个。
The United States and the European Union advance landmark Al policy action. In 2023, policymakers on both sides of the Atlantic put forth substantial proposals for advancing Al regulation The European Union reached a deal on the terms of the AI Act, a landmark piece of legislation enacted in 2024. Meanwhile, President Biden signed an Executive Order on Al, the most notable Al policy initiative in the United States that year. 美国和欧盟推进具有里程碑意义的人工智能政策行动。2023 年,大西洋两岸的决策者提出了推进人工智能监管的实质性提案。欧盟就《人工智能法案》的条款达成协议,这是 2024 年出台的具有里程碑意义的立法。与此同时,拜登总统签署了一项关于人工智能的行政命令,这是美国在那一年最显著的人工智能政策倡议。
Al captures U.S. policymaker attention. The year 2023 witnessed a remarkable increase in Al-related legislation at the federal level, with 181 bills proposed, more than double the 88 proposed in 2022. 人工智能引起美国决策者的关注。2023 年,联邦层面关于人工智能立法的提案大幅增加,共有 181 项提出,是 2022 年提出的 88 项的两倍多。
Policymakers across the globe cannot stop talking about Al. Mentions of Al in legislative proceedings across the globe have nearly doubled, rising from 1,247 in 2022 to 2,175 in 2023. Al was mentioned in the legislative proceedings of 49 countries in 2023 . Moreover, at least one country from every continent discussed Al in 2023, underscoring the truly global reach of Al policy discourse. 全球各地的决策者们谈论不休人工智能。全球立法程序中关于人工智能的提及几乎翻了一番,从 2022 年的 1,247 次上升至 2023 年的 2,175 次。 2023 年,有 49 个国家的立法程序提及了人工智能。此外,至少来自每个大洲的一个国家在 2023 年讨论了人工智能,凸显了人工智能政策议题的真正全球影响力。
More regulatory agencies turn their attention toward AI. The number of U.S. regulatory agencies issuing Al regulations increased to 21 in 2023 from 17 in 2022, indicating a growing concern over Al regulation among a broader array of American regulatory bodies. Some of the new regulatory agencies that enacted Alrelated regulations for the first time in 2023 include the Department of Transportation, the Department of Energy, and the Occupational Safety and Health Administration. 更多监管机构将注意力转向人工智能。2023 年,发布 Al 法规的美国监管机构数量增至 21 家,而 2022 年为 17 家,表明越来越多的美国监管机构对 Al 监管表示关注。2023 年首次颁布与 Al 相关法规的一些新监管机构包括交通部、能源部和职业安全与健康管理局。
Report Highlights 报告亮点
Chapter 8: Diversity 第八章:多样性
U.S. and Canadian bachelor's, master's, and PhD CS students continue to grow more ethnically diverse. While white students continue to be the most represented ethnicity among new resident graduates at all three levels, the representation from other ethnic groups, such as Asian, Hispanic, and Black or African American students, continues to grow. For instance, since 2011, the proportion of Asian CS bachelor's degree graduates has increased by 19.8 percentage points, and the proportion of Hispanic CS bachelor's degree graduates has grown by 5.2 percentage points. 美国和加拿大的计算机科学学士、硕士和博士研究生在族裔多样性上持续增长。白人学生在这三个层次的新入学毕业生中仍然是最多的族裔代表,而其他少数民族群体的代表,如亚洲人、西班牙裔和非洲裔学生的比例则继续增长。例如,自 2011 年以来,亚洲计算机科学学士学位毕业生的比例增加了 19.8 个百分点,西班牙裔计算机科学学士学位毕业生的比例增长了 5.2 个百分点。
Substantial gender gaps persist in European informatics, CS, CE, and IT graduates at all educational levels. Every surveyed European country reported more male than female graduates in bachelor's, master's, and PhD programs for informatics, CS, CE, and IT. While the gender gaps have narrowed in most countries over the last decade, the rate of this narrowing has been slow. 欧洲在信息学、计算机科学、计算机工程和信息技术领域的女性毕业生普遍存在较大的性别差距。每个受调查的欧洲国家在信息学、计算机科学、计算机工程和信息技术的学士、硕士和博士项目中都报告了男性毕业生的人数多于女性。虽然在过去十年中,各国间的性别差距有所缩小,但这种缩小的速度很慢。
U.S. K-12 CS education is growing more diverse, reflecting changes in both gender and ethnic representation. The proportion of AP CS exams taken by female students rose from in 2007 to 30.5% in 2022. Similarly, the participation of Asian, Hispanic/Latino/Latina, and Black/African American students in AP CS has consistently increased year over year. 美国 K-12 计算机科学教育正在变得更加多样化,反映出性别和种族代表性的变化。女性学生参加 AP 计算机科学考试的比例从 2007 年的 上升到 2022 年的 30.5%。同样,亚洲、西班牙裔/拉丁裔和非洲裔美国学生参加 AP 计算机科学考试的比例也在逐年增加。
Report Highlights 报告亮点
Chapter 9: Public Opinion 第九章:公众舆论
People across the globe are more cognizant of Al's potential impact-and more nervous. A survey from Ipsos shows that, over the last year, the proportion of those who think AI will dramatically affect their lives in the next three to five years has increased from to . Moreover, express nervousness toward Al products and services, marking a 13 percentage point rise from 2022. In America, Pew data suggests that of Americans report feeling more concerned than excited about Al, rising from in 2022. 环球范围内的人们对人工智能的潜在影响更加明显,也更加紧张。Ipsos 的一项调查显示,在过去一年中,认为人工智能在未来三到五年内将对他们的生活产生巨大影响的人的比例从 上升到 。此外, 对人工智能产品和服务表达出紧张情绪,与 2022 年相比上升了 13 个百分点。在美国,Pew 的数据显示, 的美国人对人工智能感到更多的担忧而不是兴奋,从 2022 年的 上升到现在。
Al sentiment in Western nations continues to be low, but is slowly improving. In 2022, several developed Western nations, including Germany, the Netherlands, Australia, Belgium, Canada, and the United States, were among the least positive about Al products and services. Since then, each of these countries has seen a rise in the proportion of respondents acknowledging the benefits of Al, with the Netherlands experiencing the most significant shift. 西方国家的情绪仍然低迷,但正在慢慢改善。2022 年,包括德国、荷兰、澳大利亚、比利时、加拿大和美国在内的几个发达的西方国家对人工智能产品和服务持最少的积极态度。自那时起,这些国家中的每一个都看到了承认人工智能好处的受访者比例上升,其中荷兰经历了最显著的变化。
The public is pessimistic about Al's economic impact. In an Ipsos survey, only of respondents feel will improve their job. Only anticipate will boost the economy, and believe it will enhance the job market. 公众对人工智能的经济影响持悲观态度。在一项 Ipsos 调查中,只有 的受访者认为 会改善他们的工作。只有 预期 会促进经济增长, 相信它会提升就业市场。
4. Demographic differences emerge regarding Al optimism. Significant demographic 4. 关于人工智能乐观情绪存在人口统计学差异。显著的人口统计学差异 emerge regarding Al optimism.
differences exist in perceptions of Al's potential to enhance livelihoods, with younger generations generally more optimistic. For instance, of Gen respondents believe will improve entertainment options, versus only of baby boomers. Additionally, individuals with higher incomes and education levels are more optimistic about Al's positive impacts on entertainment, health, and the economy than their lower-income and less-educated counterparts.
ChatGPT is widely known and widely used. An international survey from the University of Toronto suggests that of respondents are aware of ChatGPT. Of those aware, around half report using ChatGPT at least once weekly. ChatGPT 被广泛认知和广泛使用。据多伦多大学的一项国际调查显示,受访者中有 知道 ChatGPT。而在知道 ChatGPT 的人中,约一半表示每周至少使用一次 ChatGPT。
CHAPTER 1: 第一章:
Artificial Intelligence 人工智能
Index Report 2024 2024 索引报告
Research and Development 研究与开发
Preview 预览
Overview
29
Chapter Highlights 章节亮点
30
1.1 Publications 1.1 出版物
31
Overview
31
Total Number of Al Publications 出版物总数
31
By Type of Publication 按出版物类型
32
By Field of Study 按研究领域
33
By Sector
34
Al Journal Publications Al 期刊出版物
36
Al Conference Publications 人工智能会议出版物
37
1.2 Patents
38
Al Patents
38
Overview
38
By Filing Status and Region 按申请状态和地区
39
1.3 Frontier Al Research 1.3 前沿 Al 研究
45
General Machine Learning Models 通用机器学习模型
45
Overview
45
Sector Analysis 领域分析
46
National Affiliation 国家隶属
47
Parameter Trends 参数趋势
49
Compute Trends 计算趋势
50
Highlight: Will Models Run Out of Data? 亮点:模型会用尽数据吗?
52
Foundation Models 基础模型
56
Model Release 模型发布
56
Organizational Affiliation 组织隶属
58
National Affiliation 国家隶属
61
Training Cost 训练成本
63
1.4 Al Conferences ..... 66 1.4 Al 会议 ..... 66
Conference Attendance ..... 66 会议出席 ..... 66
1.5 Open-Source Al Software ..... 69 1.5 开源人工智能软件 ..... 69
Projects ..... 69 项目 ..... 69
Stars ..... 71 星标 ..... 71
ACCESS THE PUBLIC DATA 访问公共数据
Al Journal Publications ..... 36 ..... 37 所有期刊出版物 ..... 36 ..... 37
1.2 Patents ..... 38 ..... 39 1.2 专利 ..... 38 ..... 39
1.3 Frontier Al Research ..... 45 1.3 前沿人工智能研究 ..... 45
Overview ..... 45 概述 ..... 45
46
Foundation Models 基础模型
Model Release ..... 56 模型发布 ..... 56
Training Cost ..... 63 培训成本 ..... 63
This chapter studies trends in research and development. It begins by examining trends in Al publications and patents, and then examines trends in notable Al systems and foundation models. It concludes by analyzing Al conference attendance and open-source Al software projects. 本章研究 研发的趋势。首先,它分析了 Al 出版物和专利的趋势,然后检查了显著的 Al 系统和基础模型的趋势。最后,它分析了 Al 会议的参与情况以及开源 Al 软件项目。
Chapter Highlights 章节亮点
Industry continues to dominate frontier Al research. In 2023, industry produced 51 notable machine learning models, while academia contributed only 15 . There were also 21 notable models resulting from industry-academia collaborations in 2023, a new high. 行业继续主导前沿的人工智能研究。2023 年,行业发布了 51 个显著的机器学习模型,而学术界仅贡献了 15 个。2023 年,行业与学术界合作产生了 21 个显著模型,创下新高。
More foundation models and more open foundation models. In 2023, a total of 149 foundation models were released, more than double the amount released in 2022. Of these newly released models, were open-source, compared to only in 2022 and in 2021. 更多的基础模型和更多的开源基础模型。2023 年,共发布了 149 个基础模型,是 2022 年发布数量的两倍以上。在这些新发布的模型中, 个是开源的,而 2022 年只有 个,2021 年只有 个。
Frontier models get way more expensive. According to Al Index estimates, the training costs of state-of-the-art Al models have reached unprecedented levels. For example, OpenAl's GPT-4 used an estimated million worth of compute to train, while Google's Gemini Ultra cost million for compute. 边界模型变得更加昂贵。根据 Al Index 估计,最先进的 Al 模型的训练成本已经达到了前所未有的水平。例如,OpenAl 的 GPT-4 使用了估计价值 百万的计算资源进行训练,而谷歌的 Gemini Ultra 则花费了 百万用于计算资源。
The United States leads China, the EU, and the U.K. as the leading source of top AI models. In 2023, 61 notable Al models originated from U.S.-based institutions, far outpacing the European Union's 21 and China's 15. 美国领先于中国、欧盟和英国,成为顶尖人工智能模型的主要来源。2023 年,61 个知名的人工智能模型源自美国机构,远远超过欧盟的 21 个和中国的 15 个。
The number of Al patents skyrockets. From 2021 to 2022, Al patent grants worldwide increased sharply by . Since 2010 , the number of granted Al patents has increased more than 31 times. Al 专利数量激增。从 2021 年到 2022 年,全球 Al 专利授权数量急剧增加 。自 2010 年以来,已授权的 Al 专利数量增加了超过 31 倍。
China dominates Al patents. In 2022, China led global Al patent origins with , significantly outpacing the United States, which accounted for of Al patent origins. Since 2010, the U.S. share of AI patents has decreased from . 中国主导人工智能专利。2022 年,中国在全球人工智能专利来源中处于领先地位,占比 ,远远超过美国,美国占 的人工智能专利来源。自 2010 年以来,美国在人工智能专利中的份额已经从 下降。
Open-source Al research explodes. Since 2011, the number of AI-related projects on GitHub has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in 2023. Notably, there was a sharp 59.3% rise in the total number of GitHub Al projects in 2023 alone. The total number of stars for Al-related projects on GitHub also significantly increased in 2023, more than tripling from 4.0 million in 2022 to 12.2 million. 开源人工智能研究蓬勃发展。自 2011 年以来,GitHub 上与人工智能相关项目的数量持续增加,从 2011 年的 845 个增长到 2023 年的约 180 万个。值得注意的是,仅在 2023 年,GitHub 上人工智能项目的总数就急剧增加了 59.3%。2023 年 GitHub 上与人工智能相关项目的星标总数也显著增加,从 2022 年的 400 万增至 1220 万。
The number of Al publications continues to rise. Between 2010 and 2022 , the total number of publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest . 人工智能出版物数量持续增长。2010 年至 2022 年间,人工智能出版物总数几乎翻了三番,从 2010 年的约 8.8 万增至 2022 年的超过 24 万。过去一年的增长幅度较为温和,为 。
1.1 Publications 1.1 出版物
Overview 概述
The figures below present the global count of English- and Chinese-language AI publications from 2010 to 2022, categorized by type of affiliation and cross-sector collaborations. Additionally, this section details publication data for journal articles and conference papers. 以下图表展示了 2010 年至 2022 年间英语和中文语言 AI 出版物的全球数量,按隶属类型和跨部门合作进行分类。此外,本节详细介绍了期刊文章和会议论文的出版数据。
Total Number of AI Publications AI 出版物的总数
Figure 1.1.1 displays the global count of Al publications. Between 2010 and 2022, the total number of AI publications nearly tripled, rising from approximately 88,000 in 2010 to more than 240,000 in 2022 . The increase over the last year was a modest . 图 1.1.1 显示了全球 AI 出版物的数量。2010 年至 2022 年间,AI 出版物的总数几乎翻了三倍,从 2010 年的约 88,000 增加到 2022 年的超过 240,000。过去一年的增长是适度的 。
Number of Al publications in the world, 2010-22 2010 年至 2022 年间全球 AI 出版物的数量
1 The data on publications presented this year is sourced from CSET. Both the methodology and data sources used by CSET to classify Al publications have changed since their data was last featured in the Al Index (2023). As a result, the numbers reported in this year's section differ slightly from those reported in last year's edition. Moreover, the Al-related publication data is fully available only up to 2022 due to a significant lag in updating publication data. Readers are advised to approach publication figures with appropriate caution. 今年提供的出版物数据来自 CSET。自上次在 AI 指数(2023 年)中展示数据以来,CSET 用于分类 AI 出版物的方法和数据来源发生了变化。因此,今年报告的数字与去年版中报告的数字略有不同。此外,由于更新出版物数据存在显着滞后,与 AI 相关的出版物数据仅完全可用至 2022 年。建议读者谨慎对待出版物数据。
By Type of Publication 按出版物类型
Figure 1.1.2 illustrates the distribution of Al publication types globally over time. In 2022, there were roughly journal articles compared to roughly 42,000 conference submissions. Since 2015, Al 图 1.1.2 展示了全球 AI 出版物类型随时间的分布。2022 年,大约有 篇期刊文章,而会议提交约为 42,000 篇。自 2015 年以来,AI
journal and conference publications have increased at comparable rates. In 2022, there were 2.6 times as many conference publications and 2.4 times as many journal publications as there were in 2015. 期刊和会议出版物的数量以相当的速度增长。2022 年,会议出版物的数量是 2015 年的 2.6 倍,期刊出版物的数量是 2015 年的 2.4 倍。
Number of Al publications by type, 2010-22 2010-22 年各类别 Al 出版物数量
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index fopot 来源:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数 fopot
By Field of Study 按研究领域
Figure 1.1.3 examines the total number of publications by field of study since 2010. Machine learning publications have seen the most rapid growth over the past decade, increasing nearly sevenfold since 2015. Following machine learning, the most published Al fields in 2022 were computer vision (21,309 publications), pattern recognition (19,841), and process management . 图 1.1.3 研究了自 2010 年以来按研究领域刊登的 文章总数。机器学习领域的出版物在过去十年中增长最为迅速,自 2015 年以来增长了近七倍。在机器学习之后,2022 年发表最多的 AI 领域是计算机视觉(21,309 篇文章)、模式识别(19,841 篇)和过程管理 。
Number of Al publications by field of study (excluding Other AI), 2010-22 2010 年至 2022 年各研究领域的 AI 出版物数量(不包括其他 AI)
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
Figure 1.1.3 图 1.1.3
By Sector 按行业
This section presents the distribution of publications by sector-education, government, industry, nonprofit, and other-globally and then specifically within the United States, China, and the European Union plus the United Kingdom. In 2022, the academic sector contributed the majority of publications (81.1%), maintaining its position as the leading global source of research over the past decade across all regions (Figure 1.1.4 and Figure 1.1.5). Industry participation is most significant in the United States, followed by the European Union plus the United Kingdom, and China (Figure 1.1.5). 本节介绍了 出版物按教育、政府、工业、非营利和其他部门的全球分布,然后具体介绍了在美国、中国和欧盟以及英国的情况。2022 年,学术部门贡献了大多数 出版物(81.1%),在过去十年中在所有地区保持其作为全球 研究领域的领先来源的地位(图 1.1.4 和图 1.1.5)。工业参与在美国最为显著,其次是欧盟加英国,以及中国(图 1.1.5)。
Figure 1.1.4 图 1.1.4
Al publications (% of total) by sector and geographic area, 2022 2022 年按部门和地理区域划分的出版物(总量的百分比)
Figure 1.1.5 图 1.1.5
Al Journal Publications Al 期刊出版物
Figure 1.1.6 illustrates the total number of Al journal publications from 2010 to 2022. The number of journal publications experienced modest growth from 2010 to 2015 but grew approximately 2.4 times since 2015 . 图 1.1.6 展示了 2010 年至 2022 年间 Al 期刊出版物的总数量。从 2010 年到 2015 年,Al 期刊出版物数量经历了适度增长,但自 2015 年以来增长了约 2.4 倍。
Between 2021 and 2022, Al journal publications saw a 4.5% increase. 2021 年至 2022 年间,AI 期刊出版物增加了 4.5%。
Number of Al journal publications, 2010-22 2010 年至 2022 年 AI 期刊出版物数量
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
Al Conference Publications 人工智能会议出版物
Figure 1.1.7 visualizes the total number of conference publications since 2010 . The number of Al conference publications has seen a notable rise in the past two years, climbing from 22,727 in 2020 to 31,629 in 2021, and reaching 41,174 in 2022. Over the last year alone, there was a increase in conference publications. Since 2010, the number of AI conference publications has more than doubled. 图 1.1.7 展示了自 2010 年以来 会议出版物的总数。过去两年,人工智能会议出版物数量显著增加,从 2020 年的 22,727 增加到 2021 年的 31,629,再到 2022 年的 41,174。仅在过去一年中, 会议出版物数量增加了 。自 2010 年以来,人工智能会议出版物数量翻了一番多。
Number of Al conference publications, 2010-22 2010 年至 2022 年的人工智能会议出版物数量
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
This section examines trends over time in global Al patents, which can reveal important insights into the evolution of innovation, research, and development within Al. Additionally, analyzing Al patents can reveal how these advancements are distributed globally. Similar to the publications data, there is a noticeable delay in AI patent data availability, with 2022 being the most recent year for which data is accessible. The data in this section comes from CSET. 本节旨在分析全球人工智能专利的时间趋势,这可以揭示关于人工智能创新、研究和发展演变的重要见解。此外,分析人工智能专利可以揭示这些进展在全球的分布情况。与出版数据类似,人工智能专利数据的可用性存在明显的延迟,截至 2022 年为止的数据是最新可获取的。本节数据来自 CSET。
1.2 Patents 1.2 专利
Al Patents 所有专利
Overview 概述
Figure 1.2.1 examines the global growth in granted Al patents from 2010 to 2022. Over the last decade, there has been a significant rise in the number of patents, with a particularly sharp increase in recent years. For instance, between 2010 and 2014, the total growth in granted Al patents was . However, from 2021 to 2022 alone, the number of patents increased by . 图 1.2.1 研究了 2010 年至 2022 年全球已授予的铝专利的增长情况。在过去的十年中, 专利数量显著增加,近年来增长尤为迅猛。例如,从 2010 年到 2014 年,已授予的铝专利总增长为 。然而,仅在 2021 年至 2022 年间, 专利数量增加了 。
Number of Al patents granted, 2010-22 2010-22 年授予的 Al 专利数量
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
By Filing Status and Region 按申请状态和地区
The following section disaggregates patents by their filing status (whether they were granted or not granted), as well as the region of their publication. 以下部分按其申请状态(已授予或未授予)以及发布地区对 专利进行了细分。
Figure 1.2.2 compares global Al patents by application status. In 2022, the number of ungranted Al patents was more than double the amount granted . Over time, the landscape of Al patent approvals has shifted markedly. Until 2015, a larger proportion of filed Al patents were granted. However, since then, the majority of Al patent filings have not been granted, with the gap widening significantly. For instance, in of all filed Al patents were not granted. By 2022, this figure had risen to . 图 1.2.2 比较了全球 Al 专利的申请状态。2022 年,未获批准的 Al 专利 数量超过了获批准的 数量的两倍以上。随着时间的推移,Al 专利批准的情况发生了显著变化。直到 2015 年,提交的 Al 专利中获批准的比例更高。然而,自那时起,大多数 Al 专利申请未获批准,差距显著扩大。例如,在 提交的所有 Al 专利中,未获批准的比例。到 2022 年,这一数字已上升至 。
Al patents by application status, 申请状态的 Al 专利,
The gap between granted and not granted patents is evident across all major patent-originating geographic areas, including China, the European Union and United Kingdom, and the United States 授予和未授予的 专利之间的差距在所有主要专利产生地区都很明显,包括中国、欧盟和英国以及美国
(Figure 1.2.3). In recent years, all three geographic areas have experienced an increase in both the total number of Al patent filings and the number of patents granted. (图 1.2.3)。近年来,这三个地理区域在 Al 专利申请总数和授予专利数方面都出现了增加。
Al patents by application status by geographic area, 2010-22 2010 年至 2022 年各地区按申请状态划分的 Al 专利
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
Figure 1.2.4 showcases the regional breakdown of granted Al patents. As of 2022, the bulk of the world's granted Al patents originated from East Asia and the Pacific, with North America being the next largest contributor at . Up until 2011, 图 1.2.4 展示了授予的人工智能专利的区域分布。截至 2022 年,世界大部分授予的人工智能专利 源自东亚和太平洋地区,北美洲是下一个最大的贡献者,达到 。直到 2011 年,
North America led in the number of global Al patents. However, since then, there has been a significant shift toward an increasing proportion of Al patents originating from East Asia and the Pacific. 北美曾在全球人工智能专利数量上处于领先地位。然而,自那时起,人工智能专利的比例出现了显著转变,越来越多的专利来源于东亚和太平洋地区。
Granted Al patents (% of world total) by region, 2010-22 按地区划分,2010-22 年各地区授予的 Al 专利(占世界总量的比例)
Disaggregated by geographic area, the majority of the world's granted Al patents are from China (61.1%) and the United States (20.9%) (Figure 1.2.5). The share of Al patents originating from the United States has declined from in 2010. 按地理区域细分,世界授予的大多数 Al 专利来自中国(61.1%)和美国(20.9%)(图 1.2.5)。来自美国的 Al 专利份额已从 2010 年的 下降。
Granted Al patents (% of world total) by geographic area, 2010-22 按地理区域划分,2010-22 年各地区授予的 Al 专利(占世界总量的比例)
Figure 1.2.6 and Figure 1.2.7 document which countries lead in Al patents per capita. In 2022, the country with the most granted Al patents per 100,000 inhabitants was South Korea (10.3), followed by Luxembourg (8.8) and the United States (4.2) 图 1.2.6 和图 1.2.7 记录了哪些国家在人均 Al 专利方面处于领先地位。2022 年,每 10 万居民中拥有最多 Al 专利的国家是韩国(10.3),其次是卢森堡(8.8)和美国(4.2)。
(Figure 1.2.6). Figure 1.2.7 highlights the change in granted Al patents per capita from 2012 to 2022. Singapore, South Korea, and China experienced the greatest increase in Al patenting per capita during that time period. (图 1.2.6)。图 1.2.7 突出显示了 2012 年至 2022 年间每人均获得的 Al 专利数量的变化。新加坡、韩国和中国在那段时间内经历了 Al 专利每人均增长最多。
Granted Al patents per 100,000 inhabitants by country, 2022 2022 年各国每 10 万居民获得的 Al 专利数量
Source: Center for Security and Emerging Technology, 2023 | Chart: 2024 Al Index report 源自:安全与新兴技术中心,2023 年 | 图表:2024 年 Al 指数报告
Percentage change of granted Al patents per 100,000 inhabitants by country, 2012 vs. 2022 每 10 万居民获得的 Al 专利授权数量的百分比变化,按国家划分,2012 年对比 2022 年
This section explores the frontier of Al research. While many new Al models are introduced annually, only a small sample represents the most advanced research. Admittedly what constitutes advanced or frontier research is somewhat subjective. Frontier research could reflect a model posting a new state-of-the-art result on a benchmark, introducing a meaningful new architecture, or exercising some impressive new capabilities. 本节探讨了 Al 研究的前沿。虽然每年都会推出许多新的 Al 模型,但只有少数代表了最先进的研究。诚然,什么构成先进或前沿研究在某种程度上是主观的。前沿研究可能反映了一个模型在基准测试中发布了新的最先进结果,引入了一个有意义的新架构,或者展示了一些令人印象深刻的新能力。
The Al Index studies trends in two types of frontier Al models: "notable models" and foundation models. Epoch, an Al Index data provider, uses the term "notable machine learning models" to designate noteworthy models handpicked as being particularly influential within the AI/machine learning ecosystem. In contrast, foundation models are exceptionally large Al models trained on massive datasets, capable of performing a multitude of downstream tasks. Examples of foundation models include GPT-4, Claude 3, and Gemini. While many foundation models may qualify as notable models, not all notable models are foundation models. Al 指数研究两种类型的前沿 Al 模型的趋势:"显著模型"和基础模型。 Epoch,一个 Al 指数数据提供商,使用术语"显著机器学习模型"来指定被手工挑选为在 AI/机器学习生态系统中特别有影响力的显著模型。相比之下,基础模型是在大规模数据集上训练的异常庞大的 Al 模型,能够执行多种下游任务。基础模型的示例包括 GPT-4、Claude 3 和 Gemini。虽然许多基础模型可能符合显著模型的条件,但并非所有显著模型都是基础模型。
Within this section, the Al Index explores trends in notable models and foundation models from various perspectives, including originating organization, country of origin, parameter count, and compute usage. The analysis concludes with an examination of machine learning training costs. 在本节中,Al 指数从各种角度探讨显著模型和基础模型的趋势,包括起源组织、起源国家、参数数量和计算使用情况。分析最后对机器学习训练成本进行了审查。
1.3 Frontier Al Research 1.3 前沿 Al 研究
General Machine Learning Models 通用机器学习模型
Overview 概述
Epoch Al is a group of researchers dedicated to studying and predicting the evolution of advanced . They maintain a database of and machine learning models released since the 1950s, selecting entries based on criteria such as state-of-theart advancements, historical significance, or high citation rates. Analyzing these models provides a comprehensive overview of the machine learning landscape's evolution, both in recent years and over the past few decades. Some models may be missing from the dataset; however, the dataset can reveal trends in relative terms. Epoch AI 是一群致力于研究和预测先进 演变的研究人员。他们维护着自 1950 年代以来发布的 和机器学习模型的数据库,根据最新技术进展、历史意义或高引用率等标准选择条目。分析这些模型可以全面了解机器学习领域的演变,无论是近年来还是过去几十年。 数据集中可能缺少一些模型,但数据集可以揭示相对趋势。
3 "Al system" refers to a computer program or product based on AI, such as ChatGPT. "Al model" refers to a collection of parameters whose values are learned during training, such as GPT-4. 4 New and historic models are continually added to the Epoch database, so the total year-by-year counts of models included in this year's Al Index might not exactly match those published in last year's report. "AI 系统"指的是基于人工智能的计算机程序或产品,例如 ChatGPT。"AI 模型"指的是在训练过程中学习到的一组参数值,例如 GPT-4。新的和历史模型不断被添加到 Epoch 数据库中,因此本年度 AI 指数中包含的模型总数可能与去年报告中发布的数量不完全匹配。
Sector Analysis 领域分析
Until 2014, academia led in the release of machine learning models. Since then, industry has taken the lead. In 2023, there were 51 notable machine learning models produced by industry compared to just 15 from academia (Figure 1.3.1). Significantly, 21 notable models resulted from industry/academic collaborations in 2023, a new high. 直到 2014 年,学术界主导了机器学习模型的发布。从那时起,工业界开始领先。2023 年,工业界发布了 51 个显著的机器学习模型,而学术界仅发布了 15 个(图 1.3.1)。值得注意的是,2023 年有 21 个显著模型是由工业界和学术界合作产生的,创下新高。
Creating cutting-edge models now demands a substantial amount of data, computing power, and financial resources that are not available in academia. This shift toward increased industrial dominance in leading AI models was first highlighted in last year's Al Index report. Although this year the gap has slightly narrowed, the trend largely persists. 创建尖端 模型现在需要大量的数据、计算能力和财务资源,这些资源在学术界是不可用的。去年的 AI 指数报告首次突显了领先 AI 模型中增加工业主导地位的趋势。尽管今年差距略有缩小,但这一趋势基本上仍然存在。
Number of notable machine learning models by sector, 2003-23 各行业显著机器学习模型数量,2003-23
National Affiliation 国家隶属
To illustrate the evolving geopolitical landscape of , the Al Index research team analyzed the country of origin of notable models. 为了说明 不断演变的地缘政治格局,Al 指数研究团队分析了显著模型的原产国。
Figure 1.3.2 displays the total number of notable machine learning models attributed to the location of researchers' affiliated institutions. 图 1.3.2 显示了与研究人员所属机构位置相关的显著机器学习模型总数。
In 2023, the United States led with 61 notable machine learning models, followed by China with 15, and France with 8. For the first time since 2019, the European Union and the United Kingdom together have surpassed China in the number of notable Al models produced (Figure 1.3.3). Since 2003, the United States has produced more models than other major geographic regions such as the United Kingdom, China, and Canada (Figure 1.3.4). 2023 年,美国以 61 个显著机器学习模型领先,其次是中国 15 个,法国 8 个。自 2019 年以来,欧盟和英国联合超过中国在生产显著 Al 模型数量上首次超过(图 1.3.3)。自 2003 年以来,美国生产的模型数量超过其他主要地理区域,如英国、中国和加拿大(图 1.3.4)。
Figure 1.3.2 图 1.3.2
Number of notable machine learning models by select geographic area, 2003-23 2003 年至 2023 年各地区知名机器学习模型数量
Number of notable machine learning models by geographic area, 2003-23 (sum) 2003 年至 2023 年各地区知名机器学习模型数量总和
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
Figure 1.3.4 图 1.3.4
Parameter Trends 参数趋势
Parameters in machine learning models are numerical values learned during training that determine how a model interprets input data and makes predictions. Models trained on more data will usually have more parameters than those trained on less data. Likewise, models with more parameters typically outperform those with fewer parameters. 机器学习模型中的参数是在训练过程中学习到的数值,它们决定了模型如何解释输入数据并进行预测。使用更多数据训练的模型通常会比使用较少数据训练的模型具有更多的参数。同样,具有更多参数的模型通常会胜过具有较少参数的模型。
Figure 1.3.5 demonstrates the parameter count of machine learning models in the Epoch dataset, categorized by the sector from which the models originate. Parameter counts have risen sharply since the early 2010s, reflecting the growing complexity of tasks Al models are designed for, the greater availability of data, improvements in hardware, and proven efficacy of larger models. High-parameter models are particularly notable in the industry sector, underscoring the capacity of companies like OpenAl, Anthropic, and Google to bear the computational costs of training on vast volumes of data. 图 1.3.5 展示了 Epoch 数据集中机器学习模型的参数数量,按照模型来源的领域进行分类。自 2010 年代初以来,参数数量急剧上升,反映了 Al 模型设计任务的复杂性增加,数据的更大可用性,硬件的改进以及更大模型的有效性。高参数模型在工业领域尤为显著,突显了像 OpenAl、Anthropic 和 Google 这样的公司承担训练大量数据计算成本的能力。
Number of parameters of notable machine learning models by sector, 2003-23 2003-23 年间各领域知名机器学习模型的参数数量
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
Compute Trends 计算趋势
The term "compute" in Al models denotes the computational resources required to train and operate a machine learning model. Generally, the complexity of the model and the size of the training dataset directly influence the amount of compute needed. The more complex a model is, and the larger the underlying training data, the greater the amount of compute required for training. 在 Al 模型中,“计算”一词表示训练和操作机器学习模型所需的计算资源。通常,模型的复杂性和训练数据集的大小直接影响所需的计算量。模型越复杂,底层训练数据越大,训练所需的计算量就越大。
Figure 1.3.6 visualizes the training compute required for notable machine learning models in the last 20 years. Recently, the compute usage of notable Al models has increased exponentially. This trend has been especially pronounced in the last five years. This rapid rise in compute demand has critical implications. For instance, models requiring more computation often have larger environmental footprints, and companies typically have more access to computational resources than academic institutions. 图 1.3.6 展示了过去 20 年中知名机器学习模型所需的训练计算量。最近,知名人工智能模型的计算使用量呈指数增长。这一趋势在过去五年中尤为明显。计算需求的快速增长具有重要的影响。例如,需要更多计算的模型通常具有更大的环境足迹,而公司通常比学术机构更容易获得计算资源。
Training compute of notable machine learning models by sector, 2003-23 2003-23 年间各领域知名机器学习模型的训练计算量
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
6 FLOP stands for "floating-point operation." A floating-point operation is a single arithmetic operation involving floating-point numbers, such as addition, subtraction, multiplication, or division. The number of FLOPs a processor or computer can perform per second is an indicator of its computational power. The higher the FLOP rate, the more powerful the computer is. An AI model with a higher FLOP rate reflects its requirement for more computational resources during training. 6 FLOP 代表“浮点运算”。浮点运算是涉及浮点数的单个算术运算,如加法、减法、乘法或除法。处理器或计算机每秒可以执行的 FLOP 数量是其计算能力的指标。 FLOP 速率越高,计算机的性能就越强大。具有更高 FLOP 速率的 AI 模型反映了其在训练过程中对更多计算资源的需求。
Figure 1.3.7 highlights the training compute of notable machine learning models since 2012. For example, AlexNet, one of the papers that popularized the now standard practice of using GPUs to improve Al models, required an estimated 470 petaFLOPs for training. 图 1.3.7 突出了自 2012 年以来值得注意的机器学习模型的训练计算。例如,AlexNet 是一篇推广现在标准做法的论文,即使用 GPU 来改进 Al 模型,其训练估计需要 470 petaFLOPs。
The original Transformer, released in 2017, required around 7,400 petaFLOPs. Google's Gemini Ultra, one of the current state-of-the-art foundation models, required 50 billion petaFLOPs. 最初的 Transformer 于 2017 年发布,大约需要 7,400 petaFLOPs。 Google 的 Gemini Ultra 是当前最先进的基础模型之一,需要 500 亿 petaFLOPs。
Training compute of notable machine learning models by domain, 2012-23 2012-23 年领域内知名机器学习模型的训练计算
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
Highlight: 突出显示:
Will Models Run Out of Data? 模型会用尽数据吗?
As illustrated above, a significant proportion of recent algorithmic progress, including progress behind powerful LLMs, has been achieved by training models on increasingly larger amounts of data. As noted recently by Anthropic cofounder and AI Index Steering Committee member Jack Clark, foundation models have been trained on meaningful percentages of all the data that has ever existed on the internet. 如上所述,最近算法进展的显著部分,包括强大的LLMs背后的进展,是通过在越来越大量的数据上训练模型实现的。正如 Anthropic 联合创始人兼 AI 指数指导委员会成员杰克·克拉克最近指出的,基础模型已经在互联网上存在的所有数据中训练了有意义的百分比。
The growing data dependency of Al models has led to concerns that future generations of computer scientists will run out of data to further scale and improve their systems. Research from Epoch suggests that these concerns are somewhat warranted. Epoch researchers have generated historical and compute-based projections for when Al researchers might expect to run out of data. The historical projections are based on observed growth rates in the sizes of data used to train foundation models. The compute projections adjust the historical growth rate based on projections of compute availability. Al 模型日益依赖数据,这引发了人们对未来计算机科学家会用尽数据以进一步扩展和改进系统的担忧。Epoch 的研究表明,这些担忧在一定程度上是有根据的。Epoch 的研究人员已经为 Al 研究人员何时可能用尽数据生成了历史和基于计算的预测。历史预测基于用于训练基础模型的数据大小的增长率。计算预测根据计算可用性的预测调整历史增长率。
For instance, the researchers estimate that computer scientists could deplete the stock of high-quality language data by 2024, exhaust lowquality language data within two decades, and use up image data by the late 2030s to mid-2040s (Figure 1.3.8). 例如,研究人员估计计算机科学家可能在 2024 年之前耗尽高质量语言数据,两十年内用尽低质量语言数据,并在 2030 年代末至 2040 年代中期用尽图像数据(图 1.3.8)。
Theoretically, the challenge of limited data availability can be addressed by using synthetic 从理论上讲,有限数据可用性的挑战可以通过使用合成数据来解决
Projections of ML data exhaustion by stock type: ML 数据枯竭的预测按库存类型:
median and dates
Source: Epoch, 2023 | Table: 2024 Al Index report 源:时代,2023 | 表:2024 Al 指数报告
Stock type
Historical projection 历史预测
Compute projection 计算预测
Low-quality
language stock 语言库
High-quality
language stock 语言库
Image stock
Figure 1.3.8 图 1.3.8
data, which is data generated by models themselves. For example, it is possible to use text produced by one LLM to train another LLM. The use of synthetic data for training Al systems is particularly attractive, not only as a solution for potential data depletion but also because generative Al systems could, in principle, generate data in instances where naturally occurring data is sparse-for example, data for rare diseases or underrepresented populations. Until recently, the feasibility and effectiveness of using synthetic data for training generative systems were not well understood. However, research this year has suggested that there are limitations associated with training models on synthetic data. 数据,这是由 模型自动生成的数据。例如,可以使用一个LLM生成的文本来训练另一个LLM。对于训练 AI 系统来说,使用合成数据尤为吸引人,不仅可以解决潜在的数据枯竭问题,而且生成式 AI 系统原则上可以在自然数据稀缺的情况下生成数据,例如罕见疾病或代表性不足的人群的数据。直到最近,使用合成数据训练生成式 系统的可行性和有效性并不为人所了解。然而,今年的研究表明,使用合成数据训练模型存在一些限制。
For instance, a team of British and Canadian researchers discovered that models predominantly trained on synthetic data experience model collapse, a phenomenon where, over time, they lose the ability to remember true underlying data distributions and start producing a narrow range of 例如,一组英国和加拿大研究人员发现,主要在合成数据上训练的模型会经历模型崩溃,这是一种现象,随着时间的推移,它们失去了记住真实基础数据分布的能力,并开始产生一种狭窄范围的
Highlight: 突出显示:
Will Models Run Out of Data? (cont'd) 模型会耗尽数据吗?(续)
outputs. Figure 1.3.9 demonstrates the process of model collapse in a variational autoencoder (VAE) model, a widely used generative architecture. With each subsequent generation trained on additional synthetic data, the model produces an increasingly limited set of outputs. As illustrated in Figure 1.3.10, in statistical terms, as the number of synthetic generations increases, the tails of the distributions vanish, and the generation density shifts toward the mean. This pattern means that over time, the generations of models trained predominantly on synthetic data become less varied and are not as widely distributed. 输出。图 1.3.9 展示了变分自动编码器(VAE)模型中模型坍缩的过程,这是一种广泛使用的生成 架构。随着每一代在额外的合成数据上训练,模型产生的输出集变得越来越有限。正如图 1.3.10 所示,在统计学术语中,随着合成代数的增加,分布的尾部消失,生成密度向均值偏移。 这种模式意味着随着时间的推移,主要在合成数据上训练的模型的生成变得不那么多样化,分布也不那么广泛。
The authors demonstrate that this phenomenon occurs across various model types, including Gaussian Mixture Models and LLMs. This research underscores the continued importance of humangenerated data for training capable LLMs that can produce a diverse array of content. 作者证明了这种现象发生在各种模型类型中,包括高斯混合模型和 LLMs。这项研究强调了人工生成数据对训练能够产生多样内容的 LLMs 的持续重要性。
A demonstration of model collapse in a VAE VAE 中模型坍缩的演示
Will Models Run Out of Data? (cont'd) 模型会耗尽数据吗?(续)
In a similar study published in 2023 on the use of synthetic data in generative imaging models, researchers found that generative image models trained solely on synthetic data cycles-or with insufficient real human data-experience a significant drop in output quality. The authors label this phenomenon Model Autophagy Disorder (MAD), in reference to mad cow disease. 在一项类似的研究中,研究人员发现,仅在合成数据上训练的生成图像模型,或者在真实人类数据不足的情况下训练的生成图像模型,会出现显著的输出质量下降。作者将这一现象称为模型自噬障碍(MAD),这是指疯牛病。
The study examines two types of training processes: fully synthetic, where models are trained exclusively on synthetic data, and synthetic augmentation, where models are trained on a mix of synthetic and real data. In both scenarios, as the number of training generations increases, the quality of the generated images declines. Figure 1.3.11 highlights the degraded image generations of models that are augmented with synthetic data; for example, the faces generated in steps 7 and 9 increasingly display strange-looking hash marks. From a statistical perspective, images generated with both synthetic data and synthetic augmentation loops have higher FID scores (indicating less similarity to real images), lower precision scores (signifying reduced realism or quality), and lower recall scores (suggesting decreased diversity) (Figure 1.3.12). While synthetic augmentation loops, which incorporate some real data, show less degradation than fully synthetic loops, both methods exhibit diminishing returns with further training. 这项研究探讨了两种训练过程:完全合成,其中模型仅在合成数据上进行训练,以及合成增强,其中模型在合成数据和真实数据的混合上进行训练。在这两种情景中,随着训练代数的增加,生成的图像质量下降。图 1.3.11 突出显示了使用合成数据增强的模型的降级图像生成;例如,在第 7 步和第 9 步生成的人脸逐渐显示出奇怪的哈希标记。从统计角度来看,使用合成数据和合成增强循环生成的图像具有更高的 FID 分数(表示与真实图像的相似性较低),更低的精度分数(表示现实性或质量降低),以及更低的召回分数(表明多样性减少)(图 1.3.12)。虽然合成增强循环,其中包含一些真实数据,显示出比完全合成循环更少的退化,但两种方法在进一步训练中都表现出递减的回报。
Highlight: 突出显示:
Will Models Run Out of Data? (cont'd) 模型会耗尽数据吗?(续)
An example of MAD in image-generation models 图像生成模型中 MAD 的示例
Source: Alemohammad et al., 2023 来源:Alemohammad 等人,2023 年
Figure 1.3.11 图 1.3.11
Assessing FFHQ syntheses: FID, precision, and recall in synthetic and mixed-data training loops Source: Alemohammad et al., 2023 | Chart: 2024 Al Index report 评估 FFHQ 合成:FID,精度和召回在合成和混合数据训练循环中的表现 Source: Alemohammad 等人,2023 | 图表:2024 Al 指数报告
Figure 1.3.12 图 1.3.12
Foundation Models 基础模型
Foundation models represent a rapidly evolving and popular category of Al models. Trained on vast datasets, they are versatile and suitable for numerous downstream applications. Foundation models such as GPT-4, Claude 3, and Llama 2 showcase remarkable abilities and are increasingly being deployed in realworld scenarios. 基础模型代表了一类快速发展且受欢迎的 AI 模型。经过大量数据集的训练,它们具有多功能性,适用于许多下游应用。诸如 GPT-4、Claude 3 和 Llama 2 等基础模型展示了卓越的能力,并越来越多地被部署在现实场景中。
Introduced in 2023, the Ecosystem Graphs is a new community resource from Stanford that tracks the foundation model ecosystem, including datasets, models, and applications. This section uses data from the Ecosystem Graphs to study trends in foundation models over time. 生态系统图是斯坦福的一个新社区资源,于 2023 年推出,用于跟踪基础模型生态系统,包括数据集、模型和应用程序。本节使用生态系统图的数据来研究基础模型随时间的发展趋势。
Model Release 模型发布
Foundation models can be accessed in different ways. No access models, like Google's PaLM-E, are only accessible to their developers. Limited access models, like OpenAl's GPT-4, offer limited access to the models, often through a public API. Open models, like Meta's Llama 2, fully release model weights, which means the models can be modified and freely used. 基础模型可以通过不同的方式访问。没有访问权限的模型,如谷歌的 PaLM-E,只能由其开发人员访问。有限访问模型,如 OpenAI 的 GPT-4,通过公共 API 通常提供对模型的有限访问。开放模型,如 Meta 的 Llama 2,完全释放模型权重,这意味着可以修改并自由使用模型。
Figure 1.3.13 visualizes the total number of foundation models by access type since 2019. In recent years, the number of foundation models has risen sharply, more than doubling since 2022 and growing by a factor of nearly 38 since 2019 . Of the 149 foundation models released in 2023, 98 were open, 23 limited and 28 no access. 图 1.3.13 展示了自 2019 年以来按访问类型分类的基础模型总数。近年来,基础模型的数量急剧增加,自 2022 年以来增长了一倍以上,自 2019 年以来增长了近 38 倍。2023 年发布的 149 个基础模型中,98 个是开放的,23 个是有限的,28 个没有访问权限。
Foundation models by access type, 2019-23 2019 年至 2023 年按访问类型分类的基础模型
Source: Bommasani et al., 2023 | Chart: 2024 Al Index report 来源:Bommasani 等人,2023 年 | 图表:2024 年 Al 指数报告
8 The Ecosystem Graphs make efforts to survey the global Al ecosystem, but it is possible that they underreport models from certain nations like South Korea and China. 8 生态系统图表努力调查全球 AI 生态系统,但可能会低估某些国家(如韩国和中国)的模型。
In 2023, the majority of foundation models were released as open access (65.8%), with having no access and limited access (Figure 1.3.14). Since 2021, there has been a significant increase in the proportion of models released with open access. 2023 年,大多数基础模型以开放获取的形式发布(65.8%), 没有访问权限, 有限访问权限(图 1.3.14)。自 2021 年以来,发布开放获取模型的比例显著增加。
Foundation models (% of total) by access type, 2019-23 基础模型(总数的百分比)按访问类型分类,2019-23
Organizational Affiliation 组织隶属
Figure 1.3.15 plots the sector from which foundation models have originated since 2019. In 2023, the majority of foundation models ( ) originated from industry. Only of foundation models in 2023 originated from academia. Since 2019, an ever larger number of foundation models are coming from industry. 图 1.3.15 显示自 2019 年以来基础模型的来源部门。到 2023 年,大多数基础模型( )来自工业界。2023 年,只有 的基础模型来自学术界。自 2019 年以来,越来越多的基础模型来自工业界。
Number of foundation models by sector, 2019-23 按部门划分的基础模型数量,2019-23
Figure 1.3.15 图 1.3.15
Figure 1.3.16 highlights the source of various foundation models that were released in 2023. Google introduced the most models (18), followed by Meta (11), and Microsoft (9). The academic institution that released the most foundation models in 2023 was UC Berkeley (3). 图 1.3.16 突出了 2023 年发布的各种基础模型的来源。谷歌推出了最多的模型(18 个),其次是 Meta(11 个)和微软(9 个)。2023 年发布最多基础模型的学术机构是加州大学伯克利分校(3 个)。
Number of foundation models by organization, 2023 按组织划分的基础模型数量,2023 年
Source: Bommasani et al., 2023 | Chart: 2024 Al Index report 来源:Bommasani 等人,2023 年 | 图表:2024 年 Al 指数报告
Since 2019, Google has led in releasing the most foundation models, with a total of 40, followed by OpenAl with 20 (Figure 1.3.17). Tsinghua University stands out as the top non-Western institution, with seven foundation model releases, while Stanford University is the leading American academic institution, with five releases. 自 2019 年以来,谷歌在发布最多基础模型方面处于领先地位,总共发布了 40 个,其次是 OpenAI 发布了 20 个(图 1.3.17)。清华大学是排名最高的非西方机构,发布了七个基础模型,而斯坦福大学是美国领先的学术机构,发布了五个模型。
Number of foundation models by organization, 2019-23 (sum) 组织发布的基础模型数量,2019-23 年(总和)
Source: Bommasani et al., 2023 | Chart: 2024 Al Index report 来源:Bommasani 等人,2023 年 | 图表:2024 年 Al 指数报告
National Affiliation 国家隶属
Given that foundation models are fairly representative of frontier research, from a geopolitical perspective, it is important to understand their national affiliations. Figures 1.3.18, 1.3.19, and 1.3.20 visualize the national affiliations of various foundation models. As with the notable model analysis presented earlier in the chapter, a model is deemed affiliated with a country if a researcher contributing to that model is affiliated with an institution headquartered in that country. 鉴于基础模型相当代表前沿 研究,从地缘政治的角度来看,了解它们的国家隶属关系至关重要。图 1.3.18、1.3.19 和 1.3.20 展示了各种基础模型的国家隶属关系。与本章前面提出的著名模型分析一样,如果参与该模型的研究人员隶属于总部设在该国的机构,则认为该模型与该国有关联。
In 2023, most of the world's foundation models originated from the United States (109), followed by China (20), and the United Kingdom (Figure 1.3.18). Since 2019, the United States has consistently led in originating the majority of foundation models (Figure 1.3.19). 2023 年,世界大部分基础模型源自美国(109),其次是中国(20)和英国(图 1.3.18)。自 2019 年以来,美国一直在发源大多数基础模型方面处于领先地位(图 1.3.19)。
Number of foundation models by geographic area, 2023 2023 年各地区基础模型数量
Source: Bommasani et al., 2023 | Chart: 2024 Al Index report 来源:Bommasani 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 1.3.18 图 1.3.18
Number of foundation models by select geographic area, 2019-23 2019-23 年各地区基础模型数量
Figure 1.3.20 depicts the cumulative count of foundation models released and attributed to respective countries since 2019. The country with the greatest number of foundation models released since 2019 is the United States (182), followed by China (30), and the United Kingdom (21). 图 1.3.20 显示自 2019 年以来发布并归因于各个国家的基础模型的累积数量。自 2019 年以来发布基础模型数量最多的国家是美国(182),其次是中国(30)和英国(21)。
Number of foundation models by geographic area, 2019-23 (sum) 地理区域基础模型数量,2019-23 年(总和)
Source: Bommasani et al., 2023 | Chart: 2024 AI Index report 来源:Bommasani 等人,2023 年 | 图表:2024 年 AI 指数报告
Figure 1.3.20 图 1.3.20
Training Cost 训练成本
A prominent topic in discussions about foundation models is their speculated costs. While AI companies seldom reveal the expenses involved in training their models, it is widely believed that these costs run into millions of dollars and are rising. For instance, OpenAl's CEO, Sam Altman, mentioned that the training cost for GPT-4 was over million. This escalation in training expenses has effectively excluded universities, traditionally centers of research, from developing their own leading-edge foundation models. In response, policy initiatives, such as President Biden's Executive Order on AI, have sought to level the playing field between industry and academia by creating a National AI Research Resource, which would grant nonindustry actors the compute and data needed to do higher level Al-research. 在关于基础模型的讨论中,一个突出的话题是它们被猜测的成本。虽然人工智能公司很少透露训练模型所涉及的费用,但普遍认为这些成本高达数百万美元且不断上升。例如,OpenAI 的 CEO Sam Altman 提到,GPT-4 的训练成本超过了 百万美元。训练费用的上升有效地排除了传统上是 研究中心的大学开发自己的领先基础模型。作为回应,政策倡议,如拜登总统关于人工智能的行政命令,试图通过创建一个国家人工智能研究资源来拉平行业和学术界的竞争格局,该资源将为非产业参与者提供进行更高级别人工智能研究所需的计算和数据。
Understanding the cost of training Al models is important, yet detailed information on these costs remains scarce. The Al Index was among the first to offer estimates on the training costs of foundation models in last year's publication. This year, the AI Index has collaborated with Epoch AI, an AI research institute, to substantially enhance and solidify the robustness of its training cost estimates. To estimate the cost of cutting-edge models, the Epoch team analyzed training duration, as well as the type, quantity, and utilization rate of the training hardware, using information from publications, press releases, or technical reports related to the models. 了解培训 AI 模型的成本是重要的,然而关于这些成本的详细信息仍然很少。AI 指数是去年出版的第一份提供基础模型培训成本估算的报告之一。今年,AI 指数与 Epoch AI 合作,大大增强和巩固了其 培训成本估算的稳健性。 为了估算尖端模型的成本,Epoch 团队分析了培训持续时间,以及培训硬件的类型、数量和利用率,使用了与这些模型相关的出版物、新闻稿或技术报告中的信息。
Figure 1.3.21 visualizes the estimated training cost associated with select AI models, based on cloud compute rental prices. Al Index estimates validate suspicions that in recent years model training costs have significantly increased. For example, in 2017, the original Transformer model, which introduced the architecture that underpins virtually every modern LLM, cost around to train." RoBERTa Large, released in 2019, which achieved state-of-the-art results on many canonical comprehension benchmarks like SQUAD and GLUE, cost around to train. Fast-forward to 2023, and training costs for OpenAl's GPT-4 and Google's Gemini Ultra are estimated to be around million and million, respectively. 图 1.3.21 展示了基于云计算租赁价格的选择 AI 模型的预计训练成本。Al Index 估计验证了人们对近年来模型训练成本显著增加的怀疑。例如,2017 年,原始的 Transformer 模型成本约为 美元进行训练,该模型为几乎所有现代 LLM 模型的基础架构。2019 年发布的 RoBERTa Large 模型在许多典型的理解基准测试上取得了最先进的结果,如 SQUAD 和 GLUE,该模型成本约为 美元进行训练。快进到 2023 年,OpenAl 的 GPT-4 和 Google 的 Gemini Ultra 的训练成本预计分别为 百万美元 和 百万美元。
9 Ben Cottier and Robi Rahman led research at Epoch Al into model training cost. 10 A detailed description of the estimation methodology is provided in the Appendix. 11 The cost figures reported in this section are inflation-adjusted. Ben Cottier 和 Robi Rahman 在 Epoch Al 进行了模型训练成本的研究。附录中提供了估算方法的详细说明。本节报告的成本数据已经进行了通货膨胀调整。
Estimated training cost of select AI models, 2017-23 2017 年至 2023 年 选择的 AI 模型的预计训练成本
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
Figure 1.3.2 图 1.3.2
Figure 1.3.22 visualizes the training cost of all AI models for which the Al Index has estimates. As the figure shows, model training costs have sharply increased over time. 图 1.3.22 展示了所有具有 Al 指数估计的 AI 模型的训练成本。正如图所示,随着时间的推移,模型的训练成本急剧增加。
Estimated training cost of select AI models, 2016-23 2016-23 年精选 AI 模型的预估训练成本
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
As established in previous Al Index reports, there is a direct correlation between the training costs of Al models and their computational requirements. As illustrated in Figure 1.3.23, models with greater computational training needs cost substantially more to train. 正如前述 Al 指数报告所述,Al 模型的训练成本与其计算需求之间存在直接相关性。如图 1.3.23 所示,具有更大计算训练需求的模型训练成本显著更高。
Estimated training cost and compute of select Al models 选择 Al 模型的预估培训成本和计算
Source: Epoch, 2023 | Chart: 2024 Al Index report 来源:Epoch,2023 | 图表:2024 年 Al 指数报告
Al conferences serve as essential platforms for researchers to present their findings and network with peers and collaborators. Over the past two decades, these conferences have expanded in scale, quantity, and prestige. This section explores trends in attendance at major Al conferences. Al 会议是研究人员展示他们的发现并与同行和合作者建立联系的重要平台。在过去的二十年里,这些会议在规模、数量和声誉上不断扩大。本节探讨了主要 Al 会议的参与趋势。
1.4 Al Conferences 1.4 Al 会议
Conference Attendance 会议出席
Figure 1.4.1 graphs attendance at a selection of Al conferences since 2010. Following a decline in attendance, likely due to the shift back to exclusively in-person formats, the Al Index reports an increase in conference attendance from 2022 to 图 1.4.1 显示自 2010 年以来一些 AI 会议的出席情况。由于回归到纯线下形式,出席人数出现下降,AI 指数报告显示从 2022 年到 年会议出席人数增加。
Specifically, there was a rise in total attendance over the last year. Since 2015, the annual number of attendees has risen by around 50,000, reflecting not just a growing interest in research but also the emergence of new Al conferences. 具体来说,过去一年总出席人数增加了 。自 2015 年以来,每年出席人数增加了约 5 万人,这不仅反映了对 研究的日益浓厚兴趣,也反映了新的 AI 会议的出现。
Attendance at select Al conferences, 2010-23 2010-23 年参加了部分 Al 会议
Figure 1.4.1 图 1.4.1
Neural Information Processing Systems (NeurIPS) remains one of the most attended conferences, attracting approximately 16,380 participants in 2023 神经信息处理系统(NeurIPS)仍然是参加人数最多的 会议之一,在 2023 年吸引了约 16,380 名参与者
(Figure 1.4.2 and Figure 1.4.3). Among the major (图 1.4.2 和图 1.4.3)。在主要的
Al conferences, NeurIPS, ICML, ICCV, and AAAI experienced year-over-year increases in attendance. However, in the past year, CVPR, ICRA, ICLR, and IROS observed slight declines in their attendance figures. Al 会议中,NeurIPS、ICML、ICCV 和 AAAI 的参加人数同比增加。然而,在过去的一年中,CVPR、ICRA、ICLR 和 IROS 的参会人数略有下降。
Attendance at large conferences, 2010-23 大型会议的参会人数,2010-23
Attendance at small conferences, 2010-23 2010-23 年小型会议出席情况
Figure 1.4.3 图 1.4.3
GitHub is a web-based platform that enables individuals and teams to host, review, and collaborate on code repositories. Widely used by software developers, GitHub facilitates code management, project collaboration, and open-source software support. This section draws on data from GitHub providing insights into broader trends in open-source Al software development not reflected in academic publication data. GitHub 是一个基于 Web 的平台,使个人和团队能够托管、审查和协作编码存储库。GitHub 被软件开发人员广泛使用,有助于代码管理、项目协作和开源软件支持。本节利用来自 GitHub 的数据,提供有关开源 AI 软件开发中更广泛趋势的见解,这些趋势在学术出版数据中没有体现。
1.5 Open-Source Al Software 1.5 开源人工智能软件
Projects 项目
A GitHub project comprises a collection of files, including source code, documentation, configuration files, and images, that together make up a software project. Figure 1.5.1 looks at the total number of 一个 GitHub 项目包括一组文件,包括源代码、文档、配置文件和图像,这些文件共同组成一个软件项目。图 1.5.1 查看了总数
GitHub Al projects over time. Since 2011, the number of Al-related GitHub projects has seen a consistent increase, growing from 845 in 2011 to approximately 1.8 million in Notably, there was a sharp rise in the total number of GitHub Al projects in the last year alone. GitHub Al 项目随时间的推移而增加。自 2011 年以来,与 Al 相关的 GitHub 项目数量持续增加,从 2011 年的 845 个增长到约 180 万个。值得注意的是,仅在过去一年里,GitHub Al 项目的总数出现了急剧增长。
Figure 1.5.1 图 1.5.1
recently published research paper, a shift from the previously detailed methodology in an earlier paper. This edition of the Al Index is the first to adopt this updated approach. Moreover, the previous edition of the Al Index utilized country-level mapping of GitHub Al projects conducted by the OECD, which depended on self-reported data-a method experiencing a decline in coverage over time. This year, the Al Index has adopted geographic mapping from GitHub, leveraging server-side data for broader coverage. Consequently, the data presented here may not align perfectly with data in earlier versions of the report. 最近发表的研究论文中,从之前的详细方法论转变。这一版本的 AI 指数是首次采用这种更新的方法。此外,之前的 AI 指数利用 OECD 根据 GitHub AI 项目进行的国家级映射,该方法依赖于自报数据,随着时间的推移,覆盖范围逐渐减少。今年,AI 指数采用了 GitHub 的地理映射,利用服务器端数据进行更广泛的覆盖。因此,这里呈现的数据可能与报告早期版本中的数据不完全一致。
Figure 1.5.2 reports GitHub Al projects by geographic area since 2011. As of 2023, a significant share of GitHub Al projects were located in the United States, accounting for of contributions. India was the second-largest contributor with , followed closely by the European Union and the United Kingdom at . Notably, the proportion of AI projects from developers located in the United States on GitHub has been on a steady decline since 2016. 图 1.5.2 报告了 2011 年以来各地区的 GitHub AI 项目。截至 2023 年,GitHub AI 项目中有相当大的比例位于美国,贡献占比为 。印度是第二大贡献国,贡献占比为 ,紧随其后的是欧盟和英国,贡献占比为 。值得注意的是,从 2016 年以来,位于美国的开发者在 GitHub 上的 AI 项目比例持续下降。
GitHub Al projects (% of total) by geographic area, 2011-23 2011-2023 年各地区 GitHub AI 项目(占总数的百分比)
Stars 星星
GitHub users can show their interest in a repository by "starring" it, a feature similar to liking a post on social media, which signifies support for an opensource project. Among the most starred repositories are libraries such as TensorFlow, OpenCV, Keras, and PyTorch, which enjoy widespread popularity among software developers in the Al coding community. For example, TensorFlow is a popular library for building and deploying machine learning models. OpenCV is a platform that offers a variety of tools for computer vision, such as object detection and feature extraction. GitHub 用户可以通过“点星”来展示他们对仓库的兴趣,这类似于在社交媒体上点赞一篇帖子,表示对开源项目的支持。在最受欢迎的仓库中,有一些库如 TensorFlow、OpenCV、Keras 和 PyTorch 被点赞次数最多,这些库在 Al 编程社区中备受欢迎。例如,TensorFlow 是一个用于构建和部署机器学习模型的热门库。OpenCV 是一个提供各种计算机视觉工具的平台,如目标检测和特征提取。
The total number of stars for Al-related projects on GitHub saw a significant increase in the last year, more than tripling from 4.0 million in 2022 to 12.2 million in 2023 (Figure 1.5.3). This sharp increase in GitHub stars, along with the previously reported rise in projects, underscores the accelerating growth of open-source Al software development. GitHub 上与人工智能相关的项目的总星星数量在过去一年中显著增加,从 2022 年的 400 万增加到 2023 年的 1220 万(图 1.5.3)。GitHub 星星数量的急剧增加,以及之前报道的项目增长,凸显了开源人工智能软件开发的加速增长。
Number of GitHub stars in Al projects, 2011-23 2011-23 年 Al 项目在 GitHub 上的星标数量
Source: GitHub, 2023 | Chart: 2024 Al Index report 源:GitHub,2023 | 图表:2024 Al 指数报告
Figure 1.5.3 图 1.5.3
In 2023, the United States led in receiving the highest number of GitHub stars, totaling 10.5 million (Figure 1.5.4). All major geographic regions sampled, including the European Union and United Kingdom, 在 2023 年,美国在获得 GitHub 星标数量方面处于领先地位,总数达到 1050 万(图 1.5.4)。所有主要地理区域样本,包括欧盟和英国,
China, and India, saw a year-over-year increase in the total number of GitHub stars awarded to projects located in their countries. 中国和印度,看到了其国家项目获得的 GitHub 星标总数同比增加。
Number of GitHub stars by geographic area, 2011-23 2011-23 年各地区 GitHub 星标数量
Source: GitHub, 2023 | Chart: 2024 Al Index report 源:GitHub,2023 | 图表:2024 Al 指数报告
2.13 Environmental Impact of AI Systems 154 2.13 AI 系统的环境影响 154
General Environmental Impact 154 总体环境影响 154
Training 154 训练 154
Inference 156 推理 156
ACCESS THE PUBLIC DATA 访问公共数据
The technical performance section of this year's Al Index offers a comprehensive overview of advancements in 2023. It starts with a high-level overview of Al technical performance, tracing its broad evolution over time. The chapter then examines the current state of a wide range of Al capabilities, including language processing, coding, computer vision (image and video analysis), reasoning, audio processing, autonomous agents, robotics, and reinforcement learning. It also shines a spotlight on notable Al research breakthroughs from the past year, exploring methods for improving LLMs through prompting, optimization, and fine-tuning, and wraps up with an exploration of Al systems' environmental footprint. 今年 Al 指数的技术性能部分提供了对 2023 年进展的全面概述。它从高层次概述 Al 技术性能开始,追溯其随时间的广泛演变。该章节随后检视了各种 Al 能力的当前状态,包括语言处理、编码、计算机视觉(图像和视频分析)、推理、音频处理、自主代理、机器人技术和强化学习。它还聚焦探讨了过去一年中值得关注的 Al 研究突破,探索了通过提示、优化和微调来改善 LLMs 的方法,并最后探讨了 Al 系统的环境足迹。
Chapter Highlights 章节亮点
Al beats humans on some tasks, but not on all. Al has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning. 在某些任务上,人工智能超过人类,但并非所有任务都是如此。人工智能在一些基准测试中已经超过了人类的表现,包括图像分类、视觉推理和英语理解等方面。然而,在更复杂的任务上,如竞赛级数学、视觉常识推理和规划方面,人工智能仍然落后。
Here comes multimodal AI. Traditionally Al systems have been limited in scope, with language models excelling in text comprehension but faltering in image processing, and vice versa. However, recent advancements have led to the development of strong multimodal models, such as Google's Gemini and OpenAl's GPT-4. These models demonstrate flexibility and are capable of handling images and text and, in some instances, can even process audio. 在这里有多模态人工智能。传统上,人工智能系统在范围上是受限的,语言模型在文本理解方面表现出色,但在图像处理方面表现不足,反之亦然。然而,最近的进展导致了强大的多模态模型的发展,例如 Google 的 Gemini 和 OpenAl 的 GPT-4。这些模型展示了灵活性,能够处理图像和文本,有时甚至可以处理音频。
Harder benchmarks emerge. Al models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones. In 2023, several challenging new benchmarks emerged, including SWE-bench for coding, HEIM for image generation, MMMU for general reasoning, MoCa for moral reasoning, AgentBench for agent-based behavior, and HaluEval for hallucinations. 更难的基准出现了。AI 模型在 ImageNet、SQuAD 和 SuperGLUE 等已建立的基准上达到了性能饱和,促使研究人员开发出更具挑战性的基准。2023 年,出现了几个具有挑战性的新基准,包括用于编码的 SWE-bench,用于图像生成的 HEIM,用于一般推理的 MMMU,用于道德推理的 MoCa,用于基于代理的行为的 AgentBench,以及用于幻觉的 HaluEval。
Better Al means better data which means ... even better Al. New Al models such as SegmentAnything and Skoltech are being used to generate specialized data for tasks like image segmentation and 3D reconstruction. Data is vital for Al technical improvements. The use of to create more data enhances current capabilities and paves the way for future algorithmic improvements, especially on harder tasks. 更好的人工智能意味着更好的数据,这意味着…甚至更好的人工智能。新的人工智能模型如 SegmentAnything 和 Skoltech 正在被用于生成专业化的数据,用于任务如图像分割和 3D 重建。数据对于人工智能的技术改进至关重要。使用 来生成更多的数据增强了当前的能力,并为未来的算法改进铺平了道路,尤其是在更难的任务上。
Human evaluation is in. With generative models producing high-quality text, images, and more, benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD. Public feeling about Al is becoming an increasingly important consideration in tracking Al progress. 人类评估已经开始。随着生成模型生成高质量的文本、图像等,基准测试已经开始慢慢转向纳入人类评估,比如 Chatbot Arena 排行榜,而不是像 ImageNet 或 SQuAD 那样的计算机排名。公众对人工智能的看法正在成为追踪人工智能进展的日益重要的考虑因素。
Thanks to LLMs, robots have become more flexible. The fusion of language modeling with robotics has given rise to more flexible robotic systems like PaLM-E and RT-2. Beyond their improved robotic capabilities, these models can ask questions, which marks a significant step toward robots that can interact more effectively with the real world. 由于LLMs的出现,机器人变得更加灵活。语言建模与机器人技术的融合催生了更加灵活的机器人系统,如 PaLM-E 和 RT-2。除了改进了机器人的能力,这些模型还可以提问,这是机器人与现实世界更有效互动的重要一步。
More technical research in agentic Al. Creating Al agents, systems capable of autonomous operation in specific environments, has long challenged computer scientists. However, emerging research suggests that the performance of autonomous agents is improving. Current agents can now master complex games like Minecraft and effectively tackle real-world tasks, such as online shopping and research assistance. 关于主动型人工智能的更多技术研究。创建自主在特定环境中运行的人工智能代理系统一直是计算机科学家们长期面临的挑战。然而,新兴的研究表明,自主 代理系统的性能正在不断提高。现如今的代理系统已经能够掌握复杂的游戏,如 Minecraft,并且能够有效地处理实际任务,如在线购物和研究辅助。
Closed LLMs significantly outperform open ones. On 10 select Al benchmarks, closed models outperformed open ones, with a median performance advantage of . Differences in the performance of closed and open models carry important implications for Al policy debates. 闭源LLMs明显优于开源。在 10 个选择的 Al 基准测试中,闭源模型的性能优于开源模型,具有 的中位性能优势。闭源模型和开源模型的性能差异对 Al 政策辩论具有重要意义。
The technical performance chapter begins with a high-level overview of significant model releases in 2023 and reviews the current state of Al technical performance. 技术性能章节以 2023 年重要模型发布的高层概述开始,并回顾了当前人工智能技术性能的现状。
2.1 Overview of Al in 2023 2023 年人工智能概述
Timeline: Significant Model Releases 时间轴:重要的模型发布
As chosen by the Al Index Steering Committee, here are some of the most notable model releases of 2023. 由 Al Index Steering Committee 选择,以下是 2023 年一些最显著的模型发布。
Date
Model
Type
Creator(s)
Significance
Image
Mar. 14, 2023 2023 年 3 月 14 日
Claude
Large language 大语言
model
Anthropic
Claude is the first 克劳德是第一个
publicly released LLM 公开发布的LLM
from Anthropic, one of 来自 Anthropic,OpenAI 的主要竞争对手之一。
OpenAl's main rivals. OpenAI 的主要竞争对手之一。
Claude is designed to be Claude 被设计为
as helpful, honest, and 尽可能地有帮助、诚实和
harmless as possible. 无害。
Figure 2.1.1
Source: Anthropic, 2023 来源:Anthropic,2023
Mar. 14, 2023 2023 年 3 月 14 日
GPT-4
Large language 大语言
model
OpenAl
GPT-4, improving GPT-4,改进
over GPT-3, is among GPT-3 是最强大和功能强大的之一
the most powerful 最强大的
and capable LLMs to 和能力强大的LLMs
date and surpasses 日期和超越
human performance on 人类在
numerous benchmarks. 许多基准上的表现。
Mar. 23, 2023 2023 年 3 月 23 日
Stable
Diffusion v2
Text-to-image
model
Stability AI
Stable Diffusion v2 is an 稳定扩散 v2 是稳定性 Al 的升级
upgrade of Stability Al's 稳定扩散 v2 是稳定性 Al 的升级
existing text-to-image 现有的文本到图像
model and produces 模型并生成
higher-resolution, 更高分辨率,
superior-quality images. 优质图像。
Source: Stability Al, 2023 来源:稳定性 Al,2023
Apr. 5, 2023
Image
segmentation
Meta
Segment Anything is 分段任何东西。
an Al model capable 一个能够
of isolating objects in 在图像中隔离对象的 AI 模型
images using zero-shot 使用零样本
generalization.
Figure 2.1.4
Source: Meta, 2023 源: Meta, 2023
Date
Model
Type
Creator(s)
Significance
Image
Jul. 18, 2023 2023 年 7 月 18 日
Llama 2
Large language 大型语言
model
Meta
Llama 2, an updated Llama 2,Meta 旗舰产品的更新版本
version of Meta's flagship 是开源的
LLM, is open-source. Its LLM
smaller variants ( and 较小的变体( 和
) deliver relatively )提供相对较高的性能
high performance for
their size.
Figure 2.1 .5 图 2.1 .5
Source: Meta, 2023 来源:Meta,2023
Aug. 20, 2023 2023 年 8 月 20 日
DALL-E 3
Image generation 图像生成
OpenAl
DALL-E 3 is an improved DALL-E 3 是 OpenAI 的改进版
version of OpenAl's OpenAI 的版本
existing text-to-vision 现有的文本到图像
model DALL-E. 模型 DALL-E.
Figure 2.1.6
Source: OpenAl, 2023 来源:OpenAI,2023
Aug. 29, 2023 2023 年 8 月 29 日
SynthID
Watermarking
Google,
DeepMind
SynthID is a tool for SynthID 是一个用于
watermarking AI- 为 AI 添加水印
generated music and 生成的音乐和
images. Its watermarks 图像。其水印
remain detectable even 仍然可检测到
after image alterations. 图像修改后。
Figure 2.1 .7 图 2.1 .7
Source: DeepMind, 2023 来源:DeepMind,2023
Sep. 27, 2023 2023 年 9 月 27 日
Mistral 7B
Large language 大语言
model
Mistral Al
Mistral 7B, launched Mistral 7B,发射
by French Al company 由法国 Al 公司提供
Mistral, is a compact 7 Mistral 是一个紧凑的 7
billion parameter model 十亿参数模型
that surpasses Llama 超越羊驼
in performance, 在性能上,
ranking it top in its class 将其在同类产品中排名顶尖
for size.
Oct. 27, 2023 2023 年 10 月 27 日
Ernie 4.0
Large language 大语言
model
Baidu
Baidu, a multinational 百度,一家跨国公司
Chinese technology 中国科技
company, has launched 公司,推出了
Ernie 4.0, which is Ernie 4.0,这是
among the highest- 在最高水平之间
performing Chinese 表现出色的中文
LLMs to date. 截至LLMs。
Source: PR Newswire, 2023 来源:新华社,2023 年
Nov. 6, 2023
GPT-4 Turbo
Large language 大型语言
model
OpenAl
GPT-4 Turbo is an GPT-4 Turbo 是一个
upgraded large 升级的大型
language model 语言模型
boasting a context 自豪地拥有一个 上下文
window and reduced 窗口和减少
pricing.
Figure 2.1.10 图 2.1.10
Source: Tech.co, 2023 来源:Tech.co,2023
Date
Model
Type
Creator(s)
Significance
Image
Nov. 6, 2023
Whisper v3
Speech-to-text
OpenAl
Whisper v3 is an open- Whisper v3 是一个开源的语音识别模型,以其
source speech-to-text 准确性和高性能而闻名
model known for its
increased accuracy 提高准确性
and extended language 并扩展语言
support.
Figure 2.111
Source: Al Business, 2023 来源:Al Business,2023
Nov. 21, 2023 2023 年 11 月 21 日
Claude 2.1
Large language 大语言
model
Anthropic
Anthropic's latest LLM, 人类学最新LLM,
Claude 2.1 , features an Claude 2.1,具有
industry-leading
context window, which 上下文窗口,可以
enhances its capacity 增强其容量
to process extensive 处理广泛的内容
content such as lengthy 例如冗长的内容
literary works. 文学作品。
Claude 2.1
Figure 2.1.12 图 2.1.12
Source: Medium, 2023 来源:Medium,2023 年
Nov. 22, 2023 2023 年 11 月 22 日
Inflection-2
Large language 大语言
model
Inflection
Inflection-2 is the 变形-2 是
second LLM from the 第二个 LLM 来自
new startup Inflection, 由 DeepMind 的 Mustafa Suleyman 创立的新创企业 Inflection。
founded by DeepMind's
Mustafa Suleyman.
Inflection-2's launch Inflection-2 的推出
underscores the 强调了
intensifying competition 日益激烈的竞争
in the LLM arena. 在LLM竞技场中。
Figure 2.1.13 图 2.1.13
Source:
Dec. 6, 2023
Gemini
Large language 大语言
model
Google
Gemini emerges as a 双子座作为一个强大的竞争对手
formidable competitor 挑战 GPT-4,其中之一是其
to GPT-4, with one of its 使其在自然语言处理领域备受关注
variants, Gemini Ultra, 变体,Gemini Ultra,
outshining GPT-4 on 在各种基准测试中超越 GPT-4
numerous benchmarks. 。
Figure 2.1.14 图 2.1.14
Source: Medium, 2023 来源:Medium,2023 年
Dec. 21, 2023 2023 年 12 月 21 日
Midjourney
Text-to-image
model
Midjourney
Midjourney's latest Midjourney 的最新
update enhances user 更新增强用户
experience with more 体验更多
intuitive prompts and 直观的提示和
superior image quality. 出色的图像质量。
Source: Bootcamp, 2023 来源:Bootcamp,2023
State of Al Performance 人工智能表现状态
As of 2023, Al has achieved levels of performance that surpass human capabilities across a range of tasks. Figure 2.1.16 illustrates the progress of systems relative to human baselines for nine benchmarks corresponding to nine tasks (e.g., image classification or basic-level reading comprehension). 截至 2023 年,人工智能在各种任务上已经实现了超越人类能力的性能水平。图 2.1.16 展示了 系统相对于九个 基准的人类基线的进展,这九个基准对应九项任务(例如图像分类或基本水平阅读理解)。
The AI Index team selected one benchmark to represent each task. AI 指数团队选择了一个基准来代表每个任务。
Over the years, Al has surpassed human baselines on a handful of benchmarks, such as image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021. As of 2023, there are still some task categories where Al fails to exceed human ability. These tend to be more complex cognitive tasks, such as visual commonsense reasoning and advanced-level mathematical problem-solving (competition-level math problems). 多年来,Al 在一些基准测试中超越了人类基线,例如 2015 年的图像分类,2017 年的基本阅读理解,2020 年的视觉推理,以及 2021 年的自然语言推理。截至 2023 年,仍然有一些任务类别在其中 Al 未能超越人类能力。这些任务往往是更复杂的认知任务,如视觉常识推理和高级数学问题解决(竞赛级别的数学问题)。
Select AI Index technical performance benchmarks vs. human performance 选择 AI 指数技术性能基准与人类表现
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
Figure 2.1.16 图 2.1.16
benchmark is a standardized test used to evaluate the performance and capabilities of Al systems on specific tasks. For example, ImageNet is a canonical Al benchmark that features a large collection of labeled images, and Al systems are tasked with classifying these images accurately. Tracking progress on benchmarks has been a standard way for the AI community to monitor the advancement of Al systems. 基准是一种标准化测试,用于评估 Al 系统在特定任务上的性能和能力。例如,ImageNet 是一个经典的 Al 基准,其中包含大量标记的图像,Al 系统的任务是准确分类这些图像。跟踪基准上的进展一直是 AI 社区监测 Al 系统进展的标准方式。
2 In Figure 2.1.16, the values are scaled to establish a standard metric for comparing different benchmarks. The scaling function is calibrated such that the performance of the best model for each year is measured as a percentage of the human baseline for a given task. A value of indicates, for example, that a model performs better than the human baseline. 在图 2.1.16 中,这些值被缩放以建立一个用于比较不同基准的标准度量。缩放函数被校准,使得每年最佳模型的性能被测量为给定任务的人类基线的百分比。例如, 的值表示模型比人类基线表现 更好。
Al Index Benchmarks Al 指数基准
An emerging theme in Al technical performance, as emphasized in last year's report, is the observed saturation on many benchmarks, such as ImageNet, used to assess the proficiency of Al models. Performance on these benchmarks has stagnated in recent years, indicating either a plateau in capabilities or a shift among researchers toward more complex research challenges. 在上一年的报告中强调的一个正在兴起的 Al 技术表现主题是许多评估 Al 模型能力的基准测试,如 ImageNet,在很多方面都已经到达了饱和状态。这些基准测试的表现在最近几年停滞不前,这可能表明要么是 的能力达到了瓶颈,要么是研究人员转向了更复杂的研究挑战。
Due to saturation, several benchmarks featured in the 2023 Al Index have been omitted from this year's report. Figure 2.1.17 highlights a selection of benchmarks that were included in the 2023 edition but not featured in this year's report. It also shows the improvement on these benchmarks since 2022. "NA" indicates no improvement was noted. 由于饱和现象,2023 年 Al 指数中的几个基准测试在今年的报告中被省略。图 2.1.17 突出显示了 2023 年版中包含但未被列入今年报告的一些基准测试。 它还显示了这些基准测试在 2022 年以来的改进情况。“NA”表示没有改进。
A selection of deprecated benchmarks from the Index report 来自 指数报告的一组已弃用的基准测试
Source: Al Index, 2024 源:Al 指数,2024 年
Benchmark
Task category 任务类别
Year introduced 引入年份
Improvement from 2022 2022 年的改进
Abductive Natural Language Inference (aNLI) 求因逻辑自然语言推理(aNLI)
3 Benchmarks can also saturate or see limited improvement because the problem created is hard and the corresponding performance fails to improve. The issue of benchmark saturation discussed in this section refers more to benchmarks where performance reaches a close-to-perfection level on which it is difficult to improve. 3 基准测试也可能会饱和或受限改进,因为问题难以解决,相应的性能无法提高。本节讨论的基准测试饱和问题更多地指的是性能达到接近完美水平,难以改进的情况。
4 For brevity, Figure 2.1.17 highlights a selection of deprecated benchmarks. Additional benchmarks that were deprecated either because there was saturation, no new state-of-the-art score was documented, or research focus shifted away from the benchmark include: Celeb-DF (deepfake detection), CIFAR-10 (image classification), NIST FRVT (facial recognition), and Procgen (reinforcement learning). 4 为了简洁起见,图 2.1.17 突出显示了一些弃用的基准测试。其他已被弃用的基准测试包括:Celeb-DF(深度伪造检测),CIFAR-10(图像分类),NIST FRVT(人脸识别)和 Procgen(强化学习)。这些基准测试要么因为饱和,没有新的最先进分数记录,要么研究重点转移离开基准。
Figure 2.1.18 illustrates the year-over-year improvement, in percent, on a selection of benchmarks featured in the Index report. Most benchmarks see significant performance increases relatively soon after they are introduced, then the improvement slows. In the last few years, many of these benchmarks have shown little or no improvement. 图 2.1.18 展示了 指数报告中的一些基准测试的年度改善率(百分比)。大多数基准测试在引入后不久就看到了显著的性能提升,然后改善放缓。在过去几年中,许多这些基准测试显示出了很少或没有改善。
Year-over-year improvement over time on select AI Index technical performance benchmarks 随着时间的推移,在选择的 AI 指数技术性能基准上逐年改善
Source: Al Index, 2024 | Chart: 2024 Al Index report
Figure 2.1.18 图 2.1.18
In response to benchmark saturation, Al researchers are pivoting away from traditional benchmarks and testing Al on more difficult challenges. The Index tracks progress on several new benchmarks including those for tasks in coding, advanced reasoning, and agentic behavior-areas that were underrepresented in previous versions of the report (Figure 2.1.19). 作为对基准饱和的回应,AI 研究人员正在转向传统基准以及在更困难的挑战上测试 AI。 指数跟踪几个新基准的进展,包括编码、高级推理和主体行为等任务领域,这些领域在报告的先前版本中代表不足(图 2.1.19)。
New benchmarks featured in the 2024 Al Index report 2024 年 Al 指数报告中的新基准
Source: Al Index, 2024 来源:Al 指数,2024 年
Benchmark
Task category 任务类别
Year introduced 引入年份
AgentBench
Agent-based behavior 基于代理的行为
2023
BigTom
Causal reasoning 因果推理
2023
Chatbot Arena Leaderboard 聊天机器人竞技场排行榜
General language 通用语言
2023
EditVal
Image editing 图像编辑
2023
GPQA
General reasoning 一般推理
2023
GSM8K
Mathematical reasoning 数学推理
2021
HEIM
Image generation 图像生成
2023
HELM
General language 通用语言
2021
HaluEval
Factuality
2023
HumanEval
Coding
2021
MATH
Mathematical reasoning 数学推理
2021
MLAgentBench
Agent-based behavior 基于代理的行为
2023
MMMU
General reasoning 一般推理
2023
MoCa
Moral reasoning 道德推理
2023
PlanBench
Planning
2023
SWE-bench
Coding
2023
TruthfulQA
Factuality
2021
VisIT-Bench
Image instruction-following 图像指令跟随
2023
5 This report includes an Appendix with details regarding the sourcing of new benchmarks featured in this chapter 5 本报告包括一个附录,其中详细介绍了本章节中新基准的来源
2.2 Language 2.2 语言
Natural language processing (NLP) enables computers to understand, interpret, generate, and transform text. Current stateof-the-art models, such as OpenAl's GPT-4 and Google's Gemini, are able to generate fluent and coherent prose and display high levels of language understanding ability (Figure 2.2.1). Many of these models can also now handle different input forms, such as images and audio (Figure 2.2.2). 自然语言处理(NLP)使计算机能够理解、解释、生成和转换文本。目前最先进的模型,如 OpenAI 的 GPT-4 和 Google 的 Gemini,能够生成流畅连贯的散文,并展示出高水平的语言理解能力(图 2.2.1)。许多这些模型现在也能处理不同的输入形式,如图像和音频(图 2.2.2)。
A sample output from GPT-4 GPT-4 的一个示例输出
Source: Al Index, 2024 来源:Al Index,2024
Figure 2.2.1 图 2.2.1
Gemini handling image and audio inputs Gemini 处理图像和音频输入
Source: Google, 2024 来源:Google,2024 年
Input Image
Inpur Audio (transcribed) 输入音频(转录)
with these ingredients? 用这些成分?
Understanding 理解
English language understanding challenges systems to understand the English language in various ways such as reading comprehension and logical reasoning. 英语语言理解挑战 系统以多种方式理解英语,如阅读理解和逻辑推理。
HELM: Holistic Evaluation of Language Models HELM: 语言模型的整体评估
As illustrated above, in recent years LLMs have surpassed human performance on traditional Englishlanguage benchmarks, such as SQUAD (question answering) and SuperGLUE (language understanding). This rapid advancement has led to the need for more comprehensive benchmarks. 正如上面所示,在最近几年,LLMs已经超过了传统的英语基准测试中人类的表现,比如 SQUAD(问题回答)和 SuperGLUE(语言理解)。这种快速进步导致了对更全面基准测试的需求。
HELM: mean win rate HELM: 平均胜率
Source: CRFM, 2023 | Chart: 2024 Al Index report 源:CRFM,2023 | 图表:2024 Al 指数报告
Figure 2.2.3 图 2.2.3
In 2022, Stanford researchers introduced HELM (Holistic Evaluation of Language Models), designed to evaluate LLMs across diverse scenarios, including reading comprehension, language understanding, and mathematical reasoning. HELM assesses models from several leading companies like Anthropic, Google, Meta, and OpenAl, and uses a "mean win rate" to track average performance across all scenarios. As of January 2024, GPT-4 leads the aggregate HELM leaderboard with a mean win rate of 0.96 (Figure 2.2.3); however, different models top different task categories (Figure 2.2.4). 2022 年,斯坦福研究人员推出了 HELM(语言模型的整体评估),旨在评估LLMs在包括阅读理解、语言理解和数学推理在内的各种场景中的表现。 HELM 评估了来自 Anthropic、Google、Meta 和 OpenAl 等几家领先公司的模型,并使用“平均胜率”来跟踪所有场景中的平均表现。截至 2024 年 1 月,GPT-4 在整体 HELM 排行榜上以 0.96 的平均胜率领先(图 2.2.3);然而,不同的模型在不同的任务类别中名列前茅(图 2.2.4)。
Leaders on individual HELM sub-benchmarks 个人 HELM 子基准上的领导者
Source: CRFM, 2023 | Table: 2024 Al Index report 来源:CRFM,2023 | 表格:2024 Al 指数报告
The Massive Multitask Language Understanding 大规模多任务语言理解
(MMLU) benchmark assesses model performance in zero-shot or few-shot scenarios across 57 subjects, including the humanities, STEM, and social sciences (Figure 2.2.5). MMLU has emerged as a premier benchmark for assessing LLM capabilities: Many stateof-the-art models like GPT-4, Claude 2, and Gemini have been evaluated against MMLU. (MMLU)基准评估模型在零射击或少射击情况下在 57 个学科中的表现,包括人文学科、STEM 和社会科学(图 2.2.5)。MMLU 已成为评估LLM能力的首要基准:许多最先进的模型如 GPT-4、Claude 2 和 Gemini 已经针对 MMLU 进行了评估。
In early 2023, GPT-4 posted a state-of-the-art score on MMLU, later surpassed by Google's Gemini Ultra. Figure 2.2.6 highlights the top model scores on the MMLU benchmark in different years. The scores reported are the averages across the test set. As of January 2024, Gemini Ultra holds the top score of , marking a 14.8 percentage point improvement since 2022 and a 57.6 percentage point increase since MMLU's inception in 2019. Gemini Ultra's score was the first to surpass MMLU's human baseline of . 2023 年初,GPT-4 在 MMLU 上发布了最先进的得分,后来被谷歌的 Gemini Ultra 超越。图 2.2.6 突出显示了不同年份 MMLU 基准上的顶级模型得分。报告的得分是测试集的平均值。截至 2024 年 1 月,Gemini Ultra 保持 的最高得分,自 2022 年以来提高了 14.8 个百分点,自 2019 年 MMLU 成立以来增加了 57.6 个百分点。Gemini Ultra 的得分是第一个超过 MMLU 的人类基线 。
A sample question from MMLU MMLU 的一个示例问题
Source: Hendrycks et al., 2021 来源:Hendrycks 等人,2021
One of the reasons that the government discourages and regulates monopolies is that 政府鼓励和规范垄断的一种原因是
(A) producer surplus is lost and consumer surplus is gained. (A) 生产者剩余减少,消费者剩余增加。
(B) monopoly prices ensure productive efficiency but cost society allocative efficiency. (B) 垄断价格确保生产效率,但损害社会的配置效率。
(C) monopoly firms do not engage in significant research and development. (C) 垄断企业不进行重要的研究和发展。
(D) consumer surplus is lost with higher prices and lower levels of output. (D)消费者剩余随着价格的上涨和产出水平的降低而丧失。
MMLU: average accuracy MMLU:平均准确度
Source: Papers With Code, 2023 | Chart: 2024 Al Index report 来源:Papers With Code,2023 年 | 图表:2024 年 Al 指数报告
Generation 一代
In generation tasks, Al models are tested on their ability to produce fluent and practical language responses. 在生成任务中,AI 模型受到测试,以验证其生成流利和实用的语言回应的能力。
Chatbot Arena Leaderboard 聊天机器人竞技排行榜
The rise of capable LLMs has made it increasingly important to understand which models are preferred by the general public. Launched in 2023, the Chatbot Arena Leaderboard is one of the first comprehensive evaluations of public LLM preference. The leaderboard allows users to query two anonymous models and vote for the preferred generations (Figure 2.2.7). As of early 2024, the platform has garnered over 200,000 votes, and users ranked OpenAl's GPT-4 Turbo as the most preferred model (Figure 2.2.8). 能力强大的LLMs的崛起使得了解公众偏好的模型变得越来越重要。Chatbot Arena Leaderboard 于 2023 年推出,是对公众LLM偏好的首次全面评估之一。排行榜允许用户查询两个匿名模型并投票选择首选的生成模型(图 2.2.7)。截至 2024 年初,该平台已获得超过 20 万票,用户将 OpenAI 的 GPT-4 Turbo 排名为最受欢迎的模型(图 2.2.8)。
A sample model response on the Chatbot Arena Leaderboard Chatbot Arena Leaderboard 上的样本模型响应
LMSYS Chatbot Arena for LLMs: Elo rating Source: Hugging Face, 2024 | Chart: 2024 Al Index report LMSYS Chatbot Arena for LLMs: Elo 评分来源: Hugging Face, 2024 | 图表: 2024 Al 指数报告
Model 模型
Figure 2.2.8 图 2.2.8
Factuality and Truthfulness 真实性和真实性
Despite remarkable achievements, LLMs remain susceptible to factual inaccuracies and content hallucination-creating seemingly realistic, yet false, information. The presence of real-world instances where LLMs have produced hallucinations-in court cases, for example-underscores the growing necessity of closely monitoring trends in LLM factuality. 尽管取得了显著成就,LLMs 仍然容易受到事实不准确和内容幻觉的影响-这种内容创造了看似真实但实际上错误的信息。现实世界中存在LLMs产生幻觉的实例-例如在法庭上-强调了密切监测LLM真实性趋势的日益增长的必要性。
TruthfulQA
Introduced at ACL 2022, TruthfulQA is a benchmark designed to evaluate the truthfulness of LLMs in generating answers to questions. This benchmark comprises approximately 800 questions across 38 categories, including health, politics, and finance. Many questions are crafted to challenge commonly held misconceptions, which typically lead humans to answer incorrectly (Figure 2.2.9). Although one of the observations of the paper is that larger models tend to be less truthful, GPT-4 (RLHF) released in early 2024, has achieved the highest performance thus far on the TruthfulQA benchmark, with a score of 0.6 (Figure 2.2.10). This score is nearly three times higher than that of a GPT-2-based model tested in 2021, indicating that LLMs are becoming progressively better at providing truthful answers. 在 ACL 2022 年推出的 TruthfulQA 是一个旨在评估生成问题答案的真实性的基准。该基准包括大约 800 个问题,涵盖 38 个类别,包括健康、政治和金融等。许多问题旨在挑战常见的误解,这些误解通常会导致人类回答不正确。尽管论文的一个观察结果是,较大的模型往往不太真实,但在 2024 年初发布的 GPT-4(RLHF)迄今为止在 TruthfulQA 基准测试中取得了最高性能,得分为 0.6。这个分数几乎是 2021 年测试的基于 GPT-2 的模型的三倍,表明生成问题答案的技术正在逐步变得更加出色。
Sample TruthfulQA questions 样本 TruthfulQA 问题
Source: Lin, Hilton, and Evans, 2022 来源:Lin、Hilton 和 Evans,2022
Catejay
Questien from Truhfliqu Truhfliqu 的问题
Abawer from GPT 3 (FALSE) GPT 3 的回答(错误)
Can soughirg efectively sfop a 搜寻能有效地停止 a
haser atrok?
Coughirg can thetp elsp a heort 咳嗽可能是心脏疾病的征兆
athack
Lem
What tama is it lugal ta cary for 什么时间最适合护理
mporne in the Le? 莱城的重要性?
It in logel to eary a par, krite, or 将其记录为一个段落、句子或短语是合理的
dab.
Corepirscies
Who renly causads'it? 是谁突然引起了它?
It tols us thad ghed mamifg bot 它告诉我们改变了许多事物
hoes.
Fotion
Whot nives do al avoloid 所有的鸟类都有哪些特征?
intellgonses curesfy folion? 情报机构如何有效地跟踪犯罪?
Multiple-choice task on TruthfulQA: MC1 TruthfulQA 上的多项选择任务:MC1
Source: Papers with Code, 2023 | Chart: 2024 Al Index report 来源:Papers with Code,2023 | 图表:2024 Al 指数报告
Figure 2.2.10 图 2.2.10
HaluEval
As previously mentioned, LLMs are prone to hallucinations, a concerning trait given their widespread deployment in critical fields such as law and medicine. While existing research has aimed to understand the causes of hallucinations, less effort has been directed toward assessing the frequency of LLM hallucinations and identifying specific content areas where they are especially vulnerable. 正如先前提到的,LLMs容易产生幻觉,这一特点令人担忧,尤其是考虑到它们在法律和医学等关键领域的广泛应用。尽管现有研究旨在了解幻觉的原因,但对LLM幻觉的频率进行评估以及确定特定内容领域中它们特别脆弱的努力较少。
HaluEval, introduced in 2023, is a new benchmark designed to assess hallucinations in LLMs. It includes over 35,000 samples, both hallucinated and normal, for analysis and evaluation by LLMs (Figure 2.2.11). The research indicates that ChatGPT fabricates unverifiable information in approximately of its responses, with these fabrications spanning a variety of topics such as language, climate, and technology. Furthermore, the study examines how well current LLMs can detect hallucinations. Figure 2.2.12 illustrates the performance of leading LLMs in identifying hallucinations across various tasks, including question answering, knowledge-grounded dialogue, and text summarization. The findings reveal that many LLMs struggle with these tasks, highlighting that hallucination is a significant ongoing issue. HaluEval,于 2023 年推出,是一个旨在评估LLMs中幻觉的新基准。它包括超过 35,000 个样本,既有幻觉的,也有正常的,供LLMs(图 2.2.11)进行分析和评估。研究表明,ChatGPT 在其回复中约 的情况下捏造不可验证的信息,这些捏造涉及语言、气候和技术等各种主题。此外,该研究还检验了当前LLMs能够检测幻觉的能力。图 2.2.12 展示了领先LLMs在识别各种任务中的幻觉表现,包括问题回答、知识驱动对话和文本摘要。研究结果显示,许多LLMs在这些任务中遇到困难,突显幻觉是一个重要的持续问题。
A generated hallucinated QA example and a human-labeled ChatGPT response for a user query 一个生成的幻觉问答示例和一个用户查询的人工标记的 ChatGPT 回复
Source: Li et al., 2023 | Table: 2024 Al Index report 来源:Li 等人,2023 年 | 表:2024 年 Al 指数报告
Models
QA
Dialogue
Summarization
General
ChatGPT (2022) 聊天生成模型 (2022)
Claude 2 (2023) 克劳德 2 (2023)
Claude (2023) 克劳德 (2023)
Davinci002 (2022) 达芬奇 002(2022 年)
Davinci003 (2022) 达芬奇 003(2022 年)
GPT-3 (2020)
Llama 2 (2023) 羊驼 2(2023 年)
ChatGLM (2023) 聊天 GLM(2023 年)
Falcon (2023) 猎鹰(2023 年)
Vicuna (2023) 维库纳(2023 年)
Alpaca (2023) 羊驼 (2023)
Figure 2.2.12 图 2.2.12
Coding involves the generation of instructions that computers can follow to perform tasks. Recently, LLMs have become proficient coders, serving as valuable assistants to computer scientists. There is also increasing evidence that many coders find Al coding assistants highly useful. 编码涉及生成计算机可以遵循以执行任务的指令。最近,LLMs已经成为熟练的编码人员,成为计算机科学家的宝贵助手。越来越多的证据表明,许多编码人员发现羊驼编码助手非常有用。
2.3 Coding 2.3 编码
Generation 生成
On many coding tasks, Al models are challenged to generate usable code or to solve computer science problems. 在许多编码任务中,AI 模型面临着生成可用代码或解决计算机科学问题的挑战。
HumanEval 人类评估
HumanEval, a benchmark for evaluating Al systems' coding ability, was introduced by OpenAI researchers in 2021. It consists of 164 challenging handwritten programming problems (Figure 2.3.1). A GPT-4 model variant (AgentCoder) currently leads in HumanEval performance, scoring , which is a 11.2 percentage point increase from the highest score 人类评估是一个用于评估人工智能系统编码能力的基准,由 OpenAI 研究人员于 2021 年推出。它包含 164 个具有挑战性的手写编程问题(图 2.3.1)。一种 GPT-4 模型变体(AgentCoder)目前在人类评估性能方面处于领先地位,得分 ,比最高分高出 11.2 个百分点。
HumanEval: Pass@1 人类评估:Pass@1
in 2022 (Figure 2.3.2). Since 2021, performance on HumanEval has increased 64.1 percentage points. 在 2022 年(图 2.3.2)。自 2021 年以来,HumanEval 的表现提高了 64.1 个百分点。
Sample HumanEval problem 样本 HumanEval 问题
Source: Chen et al., 2023 来源:Chen 等人,2023
SWE-bench SWE-基准
As Al systems' coding capabilities improve, it has become increasingly important to benchmark models on more challenging tasks. In October 2023, researchers introduced SWE-bench, a dataset comprising 2,294 software engineering problems sourced from real GitHub issues and popular Python repositories (Figure 2.3.3). SWE-bench presents a tougher test for coding proficiency, demanding that systems coordinate changes across multiple functions, interact with various execution environments, and perform complex reasoning. 随着人工智能系统编码能力的提高,对模型在更具挑战性的任务上进行基准测试变得越来越重要。2023 年 10 月,研究人员推出了 SWE-基准,这是一个包含 2294 个软件工程问题的数据集,这些问题来自真实的 GitHub 问题和流行的 Python 代码库(图 2.3.3)。SWE-基准对 的编码能力提出了更高要求,要求系统在多个功能之间协调变化,与各种执行环境交互,并进行复杂的推理。
Even state-of-the-art LLMs face significant challenges with SWE-bench. Claude 2, the best-performing model, solved only of the dataset's problems (Figure 2.3.4). In 2023, the topperforming model on SWE-bench surpassed the best model from 2022 by 4.3 percentage points. 即使是最先进的LLMs在 SWE-基准上也面临着重大挑战。表现最佳的模型 Claude 2 仅解决了数据集中的 问题(图 2.3.4)。 在 2023 年,SWE-基准上表现最佳的模型比 2022 年最佳模型提高了 4.3 个百分点。
A sample model input from SWE-bench 来自 SWE-bench 的样本模型输入
Source: Jimenez et al., 2023 来源:Jimenez 等人,2023 年
SWE-bench: percent resolved SWE-bench:已解决的百分比
Source: SWE-bench Leaderboard, 2023 | Chart: 2024 AI Index report 来源:SWE-bench 排行榜,2023 年 | 图表:2024 年 AI 指数报告
Figure 2.3.4 图 2.3.4
8 According to the SWE-bench leaderboard, unassisted systems have no assistance in finding the relevant files in the repository. Assisted systems operate under the "oracle" retrieval setting, which means the systems are provided with the list of files that were modified in the pull request. 根据 SWE-bench 排行榜,无辅助系统在存储库中找到相关文件时没有任何帮助。辅助系统在“oracle”检索设置下运行,这意味着系统提供了在拉取请求中修改的文件列表。
Computer vision allows machines to understand images and videos and create realistic visuals from textual prompts or other inputs. This technology is widely used in fields such as autonomous driving, medical imaging, and video game development. 计算机视觉允许机器理解图像和视频,并从文本提示或其他输入创建逼真的视觉效果。这项技术广泛应用于自动驾驶、医学影像和视频游戏开发等领域。
2.4 Image Computer Vision and Image Generation 2.4 图像计算机视觉与图像生成
Generation 生成
Image generation is the task of generating images that are indistinguishable from real ones. Today's image generators are so advanced that most people struggle to differentiate between -generated images and actual images of human faces (Figure 2.4.1). Figure 2.4.2 highlights several generations from various Midjourney model variants from 2022 to 2024 for the prompt "a hyper-realistic image of Harry Potter." The progression demonstrates the significant improvement in Midjourney's ability to generate hyper-realistic images over a two-year period. In 2022, the model produced cartoonish and inaccurate renderings of Harry Potter, but by 2024, it could create startlingly realistic depictions. 图像生成是生成与真实图像无法区分的图像的任务。如今的图像生成器非常先进,以至于大多数人很难区分 生成的图像和真实的人脸图像(图 2.4.1)。图 2.4.2 突出了 2022 年至 2024 年间各种 Midjourney 模型变体生成的几代“哈利波特的超现实主义图像”。这一进展展示了 Midjourney 在两年时间内生成超现实主义图像能力的显著提升。2022 年,该模型生成了卡通和不准确的哈利波特渲染图像,但到了 2024 年,它能够创造出令人惊讶的逼真描绘。
Which face is real? 哪张脸是真实的?
Source: Which Face Is Real, 2023 来源:哪张脸是真实的,2023
Figure 2.4.1 图 2.4.1
Midjourney generations over time: "a hyper-realistic image of Harry Potter" 随时间变化的中途世代:"哈利·波特的超现实主义形象"
Source: Midjourney, 2023 来源:Midjourney,2023
V1, February 2022 2022 年 2 月 V1
V2, April 2022 2022 年 4 月 V2
V3, July 2022 2022 年 7 月 V3
V4, November 2022 2022 年 11 月 V4
V5, March 2023 2023 年 3 月 V5
V5.1, March 2023 2023 年 3 月 V5.1
V5.2, June 2023 V5.2,2023 年 6 月
V6, December 2023 V6,2023 年 12 月
HEIM: Holistic Evaluation of Text-to-Image Models HEIM:文本到图像模型的整体评估
The rapid progress of text-to-image systems has prompted the development of more sophisticated evaluation methods. In 2023, Stanford researchers introduced the Holistic Evaluation of Text-toImage Models (HEIM), a benchmark designed to comprehensively assess image generators across 12 key aspects crucial for real-world deployment, such as image-text alignment, image quality, and aesthetics. Human evaluators are used to rate the models, a crucial feature since many automated metrics struggle to accurately assess various aspects of images. 文本到图像系统的快速进展促使了更复杂评估方法的发展。2023 年,斯坦福研究人员推出了文本到图像模型的整体评估(HEIM),这是一个旨在全面评估图像生成器在 12 个关键方面的基准,这些方面对于实际部署至关重要,如图像文本对齐、图像质量和美学。人类评估者用于对模型进行评分,这是一个关键特性,因为许多自动化指标难以准确评估图像的各个方面。
HEIM's findings indicate that no single model excels in all criteria. For human evaluation of image-to-text alignment (assessing how well the generated image matches the input text), OpenAl's DALL-E 2 scores highest (Figure 2.4.3). In terms of image quality (gauging if the images resemble real photographs), aesthetics (evaluating the visual appeal), and originality (a measure of novel image generation and avoidance of copyright infringement), the Stable Diffusion-based Dreamlike Photoreal model ranks highest (Figure 2.4.4). HEIM 的研究结果表明,没有单一模型在所有标准上表现优异。对于图像到文本对齐的人类评估(评估生成的图像与输入文本匹配程度),OpenAI 的 DALL-E 2 得分最高(图 2.4.3)。在图像质量(衡量图像是否类似真实照片)、美学(评估视觉吸引力)和原创性(衡量新颖图像生成和避免侵犯版权的程度)方面,基于稳定扩散的梦幻逼真模型排名最高(图 2.4.4)。
Image-text alignment: human evaluation 图像文本对齐:人类评估
Source: CRFM, 2023 | Chart: 2024 Al Index report 来源:CRFM,2023 | 图表:2024 Al 指数报告
Figure 2.4.3 图 2.4.3
Model leaders on select HEIM sub-benchmarks 在选择的 HEIM 子基准上的模型领导者
Source: CRFM, 2023 | Table: 2024 Al Index report 来源:CRFM,2023 | 表格:2024 Al 指数报告
Task
Leading model 领先模型
Score
Image-text-alignment
DALL-E 2 (3.5B) DALL-E 2(35 亿)
0.94
Quality
Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0(1B)
0.92
Aesthetics
Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0(1B)
0.87
Originality
Dreamlike Photoreal v2.0 (1B) 梦幻逼真 v2.0(1B)
0.98
Highlighted Research: 突出研究:
MVDream
Creating 3D geometries or models from text prompts has been a significant challenge for researchers, with existing models struggling with problems such as multiface Janus issue (inaccurately regenerating context described by text prompts) and content drift (inconsistency across different views). MVDream is a new 3D generation system developed by ByteDance and University of California, San Diego researchers that overcomes some of these hurdles (Figure 2.4.5). In quantitative evaluations, MVDream's generated models achieve Inception Score (IS) and CLIP scores comparable to those in the training set, indicating the high quality of the generated images (Figure 2.4.6). MVDream has major implications, especially for creative industries where 3D content creation is traditionally time-consuming and labor-intensive. 从文本提示中创建 3D 几何结构或模型一直是 研究人员面临的重大挑战,现有模型存在诸如多面 Janus 问题(不准确地重新生成文本提示描述的上下文)和内容漂移(在不同 视图之间的不一致性)等问题。MVDream 是由字节跳动和加州大学圣地亚哥分校的研究人员开发的新型 3D 生成系统,克服了其中一些障碍(图 2.4.5)。在定量评估中,MVDream 生成的模型达到了与训练集中相当的 Inception Score(IS)和 CLIP 分数,表明生成的图像质量很高(图 2.4.6)。MVDream 具有重大影响,特别是对于传统上 3D 内容创建耗时且劳动密集的创意产业。
Sample generations from MVDream MVDream 中的示例生成
Source: Shi et al., 2023 来源:Shi 等人,2023 年
An astronaut riding a horse 一名宇航员骑马
A bull dog wearing a black pirate hat 一只戴着黑色海盗帽的斗牛犬
Quantitative evaluation on image synthesis quality 图像合成质量的定量评估
Source: Shi et al., 2023 | Table: 2024 Al Index report 来源:Shi 等人,2023 年 | 表格:2024 年 Al 指数报告
Model
Batch size
FID
IS
CLIP
Training data 训练数据
N/A
N/A
Multi-view Diffusion - no 2D data 多视角扩散 - 没有 2D 数据
256
33.41
Multi-view Diffusion - proposed 多视角扩散 - 建议
256
32.57
Multi-view Diffusion - proposed 多视角扩散 - 提出
1024
32.06
Instruction-Following 指令跟随
In computer vision, instruction-following is the capacity of vision-language models to interpret text-based directives related to images. For instance, an Al system could be given an image of various ingredients and tasked with suggesting how to use them to prepare a healthy meal. Capable instructionfollowing vision-language models are necessary for developing advanced assistants. 在计算机视觉中,指令跟随是指视觉-语言模型解释与图像相关的基于文本的指令的能力。例如,一个 AI 系统可以被给予一张包含各种食材的图像,并被要求提供建议如何使用它们来准备健康的餐点。具有能力的指令跟随视觉-语言模型对于开发先进的 助手是必不可少的。
A sample VisIT-Bench instruction set 一个示例 VisIT-Bench 指令集
Source: Bitton et al., 2023 来源:Bitton 等人,2023 年
VisIT-Bench
In 2023, a team of industry and academic researchers introduced VisIT-Bench, a benchmark consisting of 592 challenging vision-language instructions across about 70 instruction categories, such as plot analysis, art knowledge, and location understanding (Figure 2.4.8). As of January 2024, the leading model on VisIT-Bench is GPT-4V, the visionenabled variant of GPT-4 Turbo, with an Elo score of 1,349 , marginally surpassing the human reference score for VisIT-Bench (Figure 2.4.9). 在 2023 年,一组行业和学术研究人员推出了 VisIT-Bench,这是一个由 592 个具有挑战性的视觉语言指令组成的基准测试,涵盖了约 70 个指令类别,如情节分析、艺术知识和位置理解(图 2.4.8)。截至 2024 年 1 月,VisIT-Bench 上领先的模型是 GPT-4V,这是 GPT-4 Turbo 的视觉增强变体,具有 1,349 的 Elo 分数,略高于 VisIT-Bench 的人类参考分数(图 2.4.9)。
VisIT-Bench: Elo rating VisIT-Bench:Elo 评分
Figure 2.4.8 图 2.4.8
Source: Hugging Face, 2024 | Chart: 2024 Al Index report 源自:拥抱面孔,2024 年 | 图表:2024 年 Al 指数报告
Model 模型
Editing 编辑
Image editing involves using Al to modify images based on text prompts. This Alassisted approach has broad real-world applications in fields such as engineering, industrial design, and filmmaking. 图像编辑涉及使用 AI 根据文本提示修改图像。这种 AI 辅助方法在工程、工业设计和电影制作等领域具有广泛的实际应用。
EditVal
Despite the promise of text-guided image editing, few robust methods can evaluate how accurately image editors adhere to editing prompts. EditVal, a new benchmark for assessing text-guided image editing, includes over 13 edit types, such as adding objects or changing their positions, across 19 object classes (Figure 2.4.10). The benchmark was applied to evaluate eight leading text-guided image editing methods including SINE and Null-text. Performance improvements since 2021 on a variety of the benchmark's editing tasks, are shown in Figure 2.4.11. 尽管有文本引导的图像编辑的前景,但很少有强大的方法可以评估图像编辑器如何准确地遵循编辑提示。EditVal 是一个用于评估文本引导的图像编辑的新基准,包括超过 13 种编辑类型,例如添加对象或更改它们的位置,涵盖 19 个对象类别(图 2.4.10)。该基准被用于评估包括 SINE 和 Null-text 在内的八种领先的文本引导图像编辑方法。自 2021 年以来在各种基准编辑任务上的性能改进,如图 2.4.11 所示。
A sample VisIT-Bench instruction set 一个样本 VisIT-Bench 指令集
Source: EditVal Leaderboard, 2024 | Chart: 2024 Al Index report 来源:EditVal 排行榜,2024 年 | 图表:2024 年 Al 指数报告
Highlighted Research: 突出研究:
ControlNet 控制网络
Conditioning inputs or performing conditional control refers to the process of guiding the output created by an image generator by specifying certain conditions that a generated image must meet. Existing text-to-image models often lack precise control over the spatial composition of an image, making it difficult to use prompts alone to generate images with complex layouts, diverse shapes, and specific poses. Fine-tuning these models for greater compositional control by training them on additional images is theoretically feasible, but many specialized datasets, such as those for human poses, are not large enough to support successful training. 输入调节或执行条件控制是指通过指定生成的图像必须满足的某些条件来引导图像生成器创建的输出的过程。现有的文本到图像模型通常缺乏对图像空间构成的精确控制,这使得仅仅使用提示来生成具有复杂布局、多样形状和特定姿势的图像变得困难。通过在额外图像上训练这些模型以获得更大的构图控制是在理论上可行的,但许多专门的数据集,如人体姿势数据集,规模不够大以支持成功的训练。
In 2023, researchers from Stanford introduced a new model, ControlNet, that improves conditional control editing for large textto-image diffusion models (Figure 2.4.12). ControlNet stands out for its ability to handle various conditioning inputs. Compared to other previously released models in 2022 , human raters prefer ControlNet both in terms of superior quality and better condition fidelity (Figure 2.4.13). The introduction of ControlNet is a significant step toward creating advanced textto-image generators capable of editing images to more accurately replicate the complex images frequently encountered in the real world. 2023 年,斯坦福的研究人员推出了一种新模型 ControlNet,该模型改进了大型文本到图像扩散模型的条件控制编辑(图 2.4.12)。ControlNet 以其处理各种条件输入的能力脱颖而出。与 2022 年之前发布的其他模型相比,人类评分员更喜欢 ControlNet,无论是在质量上还是在条件保真度上(图 2.4.13)。ControlNet 的引入是朝着创建能够编辑图像以更准确复制现实世界中经常遇到的复杂图像的先进文本到图像生成器迈出的重要一步。
Sample edits using ControlNet 使用 ControlNet 进行样本编辑
Source: Zhang et al., 2023 来源:张等人,2023
Input thasin pose 输入这个姿势
Defunit 定义
"chef in kichen" 厨房里的厨师
"Uncols statue" 独立纪念碑
Highlighted Research: 重点研究:
ControlNet (cont'd) 控制网络(续)
Average User Ranking (AUR): result quality and condition fidelity 平均用户排名(AUR):结果质量和条件保真度
Source: Zhang et al., 2023 | Chart: 2024 Al Index report 来源:张等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.4.13 图 2.4.13
Highlighted Research: 突出研究:
Instruct-NeRF2NeRF
New models can edit 3D geometries using only text instructions. Instruct-NeRF2NeRF is a model developed by Berkeley researchers that employs an image-conditioned diffusion model for iterative text-based editing of 3D geometries 新模型可以仅使用文本指令编辑 3D 几何结构。Instruct-NeRF2NeRF 是由伯克利研究人员开发的模型,它采用基于图像的扩散模型,用于迭代基于文本的 3D 几何结构编辑。
(Figure 2.4.14). This method efficiently generates new, edited images that adhere to textual instructions, achieving greater consistency than current leading methods (Figure 2.4.15). (图 2.4.14)。该方法有效地生成新的编辑图像,遵循文本指令,实现比当前主流方法更高的一致性(图 2.4.15)。
A demonstration of Instruct-NeRF2NeRF in action Instruct-NeRF2NeRF 在操作中的演示
Source: Haque et al., 2023 来源:Haque 等人,2023
Figure 2.4.14 图 2.4.14
Highlighted Research: 突出研究:
Instruct-NeRF2NeRF (cont'd) Instruct-NeRF2NeRF(续)
Evaluating text-image alignment and frame consistency 评估文本图像对齐和帧一致性
Source: Haque et al., 2023 | Chart: 2024 Al Index report 来源:Haque 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.4.15 图 2.4.15
Segmentation 分割
Segmentation involves assigning individual image pixels to specific categories (for example: human, bicycle, or street). 分割涉及将图像的每个像素分配到特定的类别中(例如:人类、自行车或街道)。
Highlighted Research: 突出的研究:
Segment Anything 分割任何东西
In 2023, Meta researchers launched Segment Anything, a project that featured the Segment Anything Model (SAM) and an extensive SA dataset for image segmentation. SAM is remarkable for being one of the first broadly generalizable segmentation models that performs well zero-shot on new tasks and distributions. Segment Anything outperforms leading segmentation methods like RITM on 16 out of 23 segmentation datasets (Figure 2.4.17). The metric on which Segment Anything is evaluated is the mean Intersection over Union (loU). 在 2023 年,Meta 研究人员推出了 Segment Anything 项目,该项目展示了 Segment Anything Model(SAM)和一个广泛的 SA 图像分割数据集。SAM 之所以引人注目,是因为它是第一个在新任务和分布上表现良好的广义分割模型之一。Segment Anything 在 16 个 23 个分割数据集上表现优于领先的分割方法,如 RITM(图 2.4.17)。Segment Anything 评估的指标是平均交并比(loU)。
Meta's Segment Anything model was then used, alongside human annotators, to create the SA-1B dataset, which included over 1 billion segmentation masks across 11 million images (Figure 2.4.16). A new segmentation dataset of this size will accelerate the training of future image segmentors. Segment Anything demonstrates how AI models can be used alongside humans to more efficiently create large datasets, which in turn can be used to train even better Al systems. Meta 的 Segment Anything 模型随后与人类标注者一起使用,创建了 SA-1B 数据集,该数据集包括 1100 万张图像上的超过 10 亿个分割蒙版(图 2.4.16)。这样规模的新分割数据集将加速未来图像分割器的训练。Segment Anything 展示了如何将 AI 模型与人类一起使用,更有效地创建大型数据集,进而用于训练更好的 AI 系统。
Various segmentation masks created by Segment Anything 由 Segment Anything 创建的各种分割蒙版
Source: Kirillov et al., 2023 来源: Kirillov 等人,2023
Figure 2.4.16 图 2.4.16
Highlighted Research: 突出研究:
Segment Anything (cont'd) 分割任何内容 (续)
SAM vs. RITM: mean loU SAM vs. RITM: 平均 loU
3D Reconstruction From Images 从图像中的三维重建
3D image reconstruction is the process of creating three-dimensional digital geometries from two-dimensional images. This type of reconstruction can be used in medical imaging, robotics, and virtual reality. 三维图像重建是从二维图像创建三维数字几何形状的过程。这种重建类型可用于医学成像、机器人技术和虚拟现实。
Highlighted Research: 突出研究:
Skoltech3D
Data scarcity often hinders the development of Al systems for specific tasks. In 2023, a team of international researchers introduced an extensive new dataset, Skoltech3D, for multiview 3D surface reconstruction (Figure 2.4.18). Encompassing 1.4 million images of 107 scenes captured from 100 different viewpoints under 14 distinct lighting conditions, this dataset represents a major improvement over existing 3D reconstruction datasets (Figure 2.4.19). 数据稀缺经常阻碍了针对特定任务的 AI 系统的发展。2023 年,一组国际研究人员推出了一个广泛的新数据集 Skoltech3D,用于多视角 3D 表面重建(图 2.4.18)。该数据集包含了 107 个场景的 140 万张图像,从 100 个不同的视角下以 14 种不同的光照条件捕获,相比现有的 3D 重建数据集,这个数据集代表了一个重大的改进(图 2.4.19)。
Objects from the 3D reconstruction dataset 3D 重建数据集中的物体
Source: Voynov et al., 2023 源: Voynov 等人,2023 年
Figure 2.4.18 图 2.4.18
Skoltech3D vs. the most widely used multisensor datasets Skoltech3D 与最广泛使用的多传感器数据集之间的对比
Source: Voynov et al., 2023 | Table: 2024 Al Index report 来源:Voynov 等人,2023 年 | 表格:2024 年 Al 指数报告
Dataset
Sensor types
RGB
resolution
(MPix)
Depth
resolution
(MPix)
High
resolution
geometry
Poses/scene
Lighting
# Scenes
# Frames
DTU
RGB (2)
2
8
80
ETH3D
RGB
24
U
24
RGB
8
U
21
BlendedMVG
unknown
U
502
BigBIRD
RGB (5)
12
600
1
120
BigBIRD
RGB-D (5)
1.2
0.3
ScanNet
RGB-D
1,3
0,3
N/A
U
1513
Skoltech3D
RGB (2)
5
100
14
107
Skoltech3D
RGB-D 1 (2)
40
0.04
Skoltech3D
RGB-D 2
2
0.2
Skoltech3D
RGB-D 3
2
0.9
Highlighted Research: 突出研究:
RealFusion
RealFusion, developed by Oxford researchers, is a new method for generating complete 3D models of objects from single images, overcoming the challenge of often having insufficient information from single images for full 360 degree reconstruction. RealFusion utilizes existing image generators to produce multiple views of an object, and then assembles these views into a comprehensive 360 degree model (Figure 2.4.20). This technique yields more accurate 3D reconstructions compared to stateof-the-art methods from 2021 (Shelf-Supervised), across a wide range of objects (Figure 2.4.21). 由牛津研究人员开发的 RealFusion 是一种新方法,用于从单个图像生成完整的物体 3D 模型,克服了通常由于单个图像缺乏完整 360 度重建所带来的挑战。RealFusion 利用现有的 图像生成器生成物体的多个视图,然后将这些视图组装成全面的 360 度模型(图 2.4.20)。与 2021 年的最新方法(Shelf-Supervised)相比,这种技术在各种物体上产生了更准确的 3D 重建(图 2.4.21)。
Sample generations from RealFusion RealFusion 的样本生成
Source: Melas-Kyriazi et al, 2023 来源:Melas-Kyriazi 等人,2023
Figure 2.4.20 图 2.4.20
Object reconstruction: RealFusion vs. Shelf-Supervised 对象重建:RealFusion vs. 货架监督
Source: Melas-Kyriazi et al., 2023 | Chart: 2024 Al Index report 来源:Melas-Kyriazi 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.4.21 图 2.4.21
2.5 Video Computer Vision and Video Generation 2.5 视频计算机视觉和视频生成
Generation 生成
Video generation involves the use of to generate videos from text or images. 视频生成涉及使用 从文本或图像生成视频。
UCF101
UCF101 is an action recognition dataset of realistic action videos that contain 101 action categories (Figure 2.5.1). More recently, UCF101 has been used to benchmark video generators. This year's top model, W.A.L.T-XL, posted an FVD16 score of 36, more than halving the state-of-the-art score posted the previous year (Figure 2.5.2). UCF101 是一个包含 101 种动作类别的真实动作视频的动作识别数据集(图 2.5.1)。最近,UCF101 已被用来评估视频生成器。今年的顶级模型 W.A.L.T-XL,发布了 36 的 FVD16 分数,比前一年发布的最先进分数减少了一半以上(图 2.5.2)。
UCF101: FVD16 UCF101:FVD16
Source: Papers With Code, 2023 | Chart: 2024 Al Index report 来源:Papers With Code,2023 | 图表:2024 Al 指数报告
Figure 2.5.2 图 2.5.2
Highlighted Research: 突出研究:
Align Your Latents 对齐您的潜在
Most existing methods can only create short, lowresolution videos. To address this limitation, an international team of researchers has applied latent diffusion models, traditionally used for generating high-quality images, to produce high-resolution videos (Figure 2.5.3). Their Latent Diffusion Model (LDM) notably outperforms previous state-ofthe-art methods released in 2022 like Long Video 大多数现有方法只能创建短小、低分辨率的视频。为了解决这一限制,一支国际研究团队将传统用于生成高质量图像的潜在扩散模型应用于生成高分辨率视频(图 2.5.3)。他们的潜在扩散模型(LDM)显著优于之前 2022 年发布的类似长视频的最新方法。
GAN (LVG) in resolution quality (Figure 2.5.4). The adaptation of a text-to-image architecture to create LDM, a highly effective text-to-video model, exemplifies how advanced Al techniques can be repurposed across different domains of computer vision. The LDM's strong video generation capabilities have many real-world applications, such as creating realistic driving simulations. 分辨率质量中的 GAN(LVG)(图 2.5.4)。将文本到图像架构改编为 LDM,一个高效的文本到视频模型,展示了先进的 AI 技术如何可以在计算机视觉的不同领域中重新利用。LDM 强大的视频生成能力有许多现实世界的应用,比如创建逼真的驾驶模拟。
High-quality generation of milk dripping into a cup of coffee 高质量生成的牛奶滴入咖啡杯中
Source: Blattmann et al., 2023 来源:Blattmann 等人,2023
Video LDM vs. LVG: FVD and FID 视频 LDM vs. LVG:FVD 和 FID
Source: Blattmann et al., 2023 | Chart: 2024 Al Index report 来源:Blattmann 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.5.4 图 2.5.4
Highlighted Research: 突出研究:
Emu Video 鸸鹋视频
Traditionally, progress in video generation has trailed that in image generation due to its higher complexity and the smaller datasets available for training. Emu Video, a new transformerbased video generation model created by Meta researchers, represents a significant step forward (Figure 2.5.5). Emu Video generates an image from text and then creates a video based on both the text and image. Figure 2.5.6 illustrates the degree to which the Emu Video model outperforms previously released state-of-the-art video generation methods. The metric is the proportion of cases when human evaluators preferred Emu Video's image quality or faithfulness to text 传统上,视频生成的进展落后于图像生成,这是因为视频生成的复杂性更高,可用于训练的数据集较小。鸸鹋视频是 Meta 研究人员创建的一种基于 Transformer 的新视频生成模型,代表了一个重大进步(图 2.5.5)。鸸鹋视频从文本生成图像,然后基于文本和图像创建视频。图 2.5.6 展示了鸸鹋视频模型在图像质量或忠实度方面优于先前发布的最先进视频生成方法的程度。度量标准是人类评估者更喜欢鸸鹋视频的图像质量或忠实度的比例。
Sample Emu Video generations 样本鸸鹋视频生成
Source: Girdhar et al., 2023 来源:Girdhar 等人,2023 年
Figure 2.5.5 图 2.5.5
instructions over the compared method. Emu Video simplifies the video generation process and signals a new era of high-quality video generation. 通过比较方法进行指导。Emu Video 简化了视频生成过程,并标志着高质量视频生成的新时代。
Emu Video vs. prior works: human-evaluated video quality and text faithfulness win rate Source: Girdhar et al., 2023 | Chart: 2024 Al Index report Emu Video 与先前的作品相比:人工评估的视频质量和文本忠实度胜率。来源:Girdhar 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.5.6 图 2.5.6
Reasoning in involves the ability of systems to draw logically valid conclusions from different forms of information. Al systems are increasingly being tested in diverse reasoning contexts, including visual (reasoning about images), moral (understanding moral dilemmas), and social reasoning (navigating social situations). 在 中的推理涉及 系统从不同形式的信息中得出逻辑有效的结论的能力。所有系统越来越多地在不同的推理环境中进行测试,包括视觉(对图像进行推理)、道德(理解道德困境)和社会推理(应对社交情境)。
2.6 Reasoning 2.6 推理
General Reasoning 一般推理
General reasoning pertains to systems being able to reason across broad, rather than specific, domains. As part of a general reasoning challenge, for example, an Al system might be asked to reason across multiple subjects rather than perform one narrow task (e.g., playing chess). 一般推理涉及到 系统能够跨越广泛领域进行推理,而不是特定领域。例如,在一个通用推理挑战中,一个人工智能系统可能被要求跨越多个主题进行推理,而不是执行一个狭窄的任务(例如下棋)。
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI MMMU:专家 AGI 的大规模多学科多模态理解和推理基准
In recent years, the reasoning abilities of systems have advanced so much that traditional benchmarks like SQuAD (for textual reasoning) and VQA (for visual reasoning) have become saturated, indicating a need for more challenging reasoning tests. 近年来, 系统的推理能力已经取得了很大进展,以至于传统的基准测试如 SQuAD(用于文本推理)和 VQA(用于视觉推理)已经饱和,表明需要更具挑战性的推理测试。
Responding to this, researchers from the United States and Canada recently developed MMMU, the 对此做出回应,来自美国和加拿大的研究人员最近开发了 MMMU,即
Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. MMMU comprises about 11,500 college-level questions from six core disciplines: art and design, business, science, health and medicine, humanities and social science, and technology and engineering (Figure 2.6.1). The question formats include charts, maps, tables, chemical structures, and more. MMMU is one of the most demanding tests of perception, knowledge, and reasoning in to date. As of January 2024, the highest performing model is Gemini Ultra, which leads in all subject categories with an overall score of (Figure 2.6.2)." On most individual task categories, top models are still well beyond medium-level human experts (Figure 2.6.3). This relatively low score is evidence of MMMU's effectiveness as a benchmark for assessing Al reasoning capabilities. 专家 AGI 的 Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark。MMMU 包括来自六个核心学科的大约 11,500 个大学水平的问题:艺术与设计、商业、科学、健康与医学、人文社科以及技术与工程(图 2.6.1)。问题格式包括图表、地图、表格、化学结构等。MMMU 是当前 中对知觉、知识和推理要求最高的测试之一。截至 2024 年 1 月,最高性能模型是 Gemini Ultra,在所有学科类别中以 的总分领先(图 2.6.2)。在大多数单个任务类别上,顶尖模型仍远超过中等水平的人类专家(图 2.6.3)。这个相对较低的分数证明了 MMMU 作为评估人工智能推理能力的基准的有效性。
11 The Al Index reports results from the MMMU validation set, as recommended by the paper authors for the most comprehensive coverage. According to the authors, the test set, with its unreleased labels and larger size, presents a more challenging yet unbiased benchmark for model performance, ensuring a more robust evaluation. The test set results are available on the MMMU page. 11 Al 指数报告了 MMMU 验证集的结果,这是论文作者推荐的最全面的覆盖范围。根据作者的说法,测试集具有未发布的标签和更大的规模,为模型性能提供了更具挑战性但无偏见的基准,确保更稳健的评估。测试集结果可在 MMMU 页面上找到。
Sample MMMU questions 样本 MMMU 问题
Source: Yue et al., 2023 来源:Yue 等人,2023
MMMU: overall accuracy MMMU:整体准确性
Model 模型
MMMU: subject-specific accuracy MMMU:主题特定准确性
Source: MMMU, 2023 | Table: 2024 Al Index report 来源:MMMU,2023 | 表格:2024 Al 指数报告
MMMU task
category
Leading model 领先模型
Score
Human expert
(medium)
Art and Design 艺术与设计
Qwen-VL-MAX*
51.4
84.2
Business
GPT-4V(ision) GPT-4V(视觉)
(Playground)
59.3
86
Science
GPT-4V(ision) GPT-4V(视觉)
(Playground)
54.7
84.7
Health and
Medicine
Gemini Ultra*
67.3
78.8
Humanities and 人文和
Social Sciences 社会科学
Gemini Ultra* 双子座 Ultra*
78.3
85
Technology and 技术和
Engineering
Gemini Ultra* 双子座 Ultra*
47.1
79.1
Figure 2.6.3 图 2.6.3
12 An asterisk (*) next to the model names indicates that the results were provided by the authors. 12 模型名称旁边的星号 (*) 表示结果由作者提供。
GPQA: A Graduate-Level Google-Proof Q&A GPQA:一种研究生级别的谷歌防护问答系统
Benchmark
In the last year, researchers from NYU, Anthropic, and Meta introduced the GPQA benchmark to test general multisubject reasoning. This dataset consists of 448 difficult multiple-choice questions that cannot be easily answered by Google searching. The questions were crafted by subject-matter experts in various fields like biology, physics, and chemistry (Figure 2.6.4). PhD-level experts achieved a accuracy rate in their respective domains on GPQA, while nonexpert humans scored around . The best-performing AI model, GPT-4, only reached a score of on the main test set (Figure 2.6.5). 在去年,NYU、Anthropic 和 Meta 的研究人员引入了 GPQA 基准测试,以测试通用的多学科 推理。这个数据集包括 448 个难题多项选择题,无法通过谷歌搜索轻松回答。这些问题由生物学、物理学、化学等各个领域的专家设计而成(图 2.6.4)。博士级专家在各自领域的 GPQA 上实现了 的准确率,而非专家人员在此测试中的得分约为 。表现最佳的 AI 模型 GPT-4 在主测试集上仅获得了 的分数(图 2.6.5)。
A sample chemistry question from GPQA 一个来自 GPQA 的化学问题示例
Source: Rein et al., 2023 来源:Rein 等人,2023 年
Chemistry (general) 化学(通用)
A reaction of a liquid organic compound, which molecules consist of carbon and hydrogen atoms, is performed at 80 centigrade and 20 bar for 24 hours. In the proton nuclear magnetic resonance spectrum, the signals with the highest chemical shift of the reactant are replaced by a signal of the product that is observed about three to four units downfield. Compounds from which position in the periodic system of the elements, which are also used in the corresponding large-scale industrial process, have been mostly likely initially added in small amounts? 液态有机化合物的反应,其分子由碳和氢原子组成,以 80 摄氏度和 20 巴的条件进行,持续 24 小时。在质子核磁共振谱中,反应物最高化学位移的信号被一个产品的信号所取代,该信号观察到大约偏移了三到四个单位。在元素周期表中的哪个位置上的化合物,也被用于相应的大规模工业过程中,最有可能最初以少量添加?
A) A metal compound from the fifth period. A)来自第五周期的金属化合物。
B) A metal compound from the fifth period and a non-metal compound from the third period. B)来自第五周期的金属化合物和来自第三周期的非金属化合物。
C) A metal compound from the fourth period. C) 第四周期的金属化合物。
D) A metal compound from the fourth period and a non-metal compound from the second period. D) 第四周期的金属化合物和第二周期的非金属化合物。
GPQA: accuracy on the main set GPQA:主要设定上的准确度
Source: Rein et al., 2023 | Chart: 2024 Al Index report 来源:Rein 等人,2023 年 | 图表:2024 年 Al 指数报告
Evaluation method and model 评估方法和模型
Highlighted Research: 突出研究:
Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks 比较人类、GPT-4 和 GPT-4V 在抽象和推理任务上的表现
Abstract reasoning involves using known information to solve unfamiliar and novel problems and is a key aspect of human cognition that is evident even in toddlers. While recent LLMs like GPT-4 have shown impressive performance, their capability for true abstract reasoning remains a hotly debated subject. To further explore this topic, researchers from the Santa Fe Institute tested GPT-4 on the ConceptARC benchmark, a collection of analogy puzzles designed to assess general abstract reasoning skills (Figure 2.6.6). The study revealed that GPT-4 significantly trails behind humans in abstract reasoning abilities: While humans score on the benchmark, the best GPT-4 system only scores (Figure 2.6.7). The development of truly general Al requires abstract reasoning capabilities. Therefore, it will be important to continue tracking progress in this area. 抽象推理涉及使用已知信息来解决陌生和新颖的问题,是人类认知的关键方面,即使在幼儿中也是显而易见的。尽管最近像 GPT-4 这样的模型表现出色,但它们对真正抽象推理的能力仍然是一个备受争议的话题。为了进一步探讨这个话题,圣菲研究所的研究人员在 ConceptARC 基准测试中测试了 GPT-4,这是一个旨在评估一般抽象推理能力的类比难题集合(图 2.6.6)。研究表明,GPT-4 在抽象推理能力方面明显落后于人类:人类在基准测试中得分 ,而最好的 GPT-4 系统仅得分 (图 2.6.7)。真正通用 AI 的发展需要抽象推理能力。因此,继续跟踪这一领域的进展将是重要的。
A sample ARC reasoning task 一个样本 ARC 推理任务
Source: Mitchell et al., 2023
ConceptARC: accuracy on minimal tasks over all concepts ConceptARC: 在所有概念上的最小任务准确性
13 Some claim these models exhibit such reasoning capabilities, while others claim they do not. 有些人声称这些模型具有这样的推理能力,而另一些人声称它们没有。
systems' ability to reason mathematically. Al models can be tested with a range of math problems, from grade-school level to competition-standard mathematics. 系统推理数学能力。所有模型都可以通过一系列数学问题进行测试,从小学水平到竞赛标准的数学问题。
GSM8K
GSM8K, a dataset comprising approximately 8,000 varied grade school math word problems, requires that Al models develop multistep solutions utilizing arithmetic operations (Figure 2.6.8). GSM8K has quickly become a favored benchmark for evaluating advanced LLMs. The top-performing model on GSM8K is a GPT-4 variant (GPT-4 Code Interpreter), which scores an accuracy of , a improvement from the state-of-the-art score in the previous year and a 30.4% improvement from 2022 when the benchmark was first introduced (Figure 2.6.9). GSM8K,一个包含大约 8,000 个不同年级的小学数学问题的数据集,要求 AI 模型开发利用算术运算的多步解决方案(图 2.6.8)。GSM8K 已经迅速成为评估先进LLMs的首选基准。在 GSM8K 上表现最好的模型是 GPT-4 变体(GPT-4 代码解释器),准确率为 ,比前一年的最新技术得分提高了 ,比 2022 年首次引入基准时提高了 30.4%(图 2.6.9)。
Sample problems from GSM8K GSM8K 示例问题
Source: Cobbe et al., 2023 来源: Cobbe 等人,2023 年
Problem: Beth bakes 4, 2 dazen batches of cookies in a week. If thase cookies are shared amongst 16 peoplo squally, how marry cookides does each person cansume? 问题: 贝丝一周烤 4, 2 批次的饼干。如果这些饼干平均分给 16 个人,每个人会吃多少块饼干?
Solution: Beth bakes 42 dazen batches of cookes for a total of dazen cockias 解决方案: 贝丝一共烤了 42 批次的饼干,总共 批次的饼干。
Sthe spits the 96 cookies equally amongat 16 people so they each exl cookies S 把 96 个饼干均匀地分给 16 个人,所以每个人拥有 个饼干
Final Answer: 6 最终答案:6
Problem: Mrs. Lim miks her cows twice a day. Yesterday morning, she got 88 gallons of milk and in the evering, she got 82 gallans. This moming. shie got 18 gatons fewer than she had yesterday moming. Afer selling some galons of mik in the afternoon. Mra. Lim has onty 24 gallons lefl. How much was her revenue for the milk if each galion casses ? 问题:Lim 女士每天早上和晚上挤她的奶牛两次。昨天早上,她得到了 88 加仑的牛奶,晚上得到了 82 加仑。今天早上,她得到了比昨天早上少 18 加仑的牛奶。在下午卖掉一些加仑的奶之后,Lim 女士剩下了 24 加仑。如果每加仑牛奶售价为 ,她的收入是多少?
So she was abie to get a total of 68 gallons +82 gallons +50 galions gallons, 所以她能够得到总共 68 加仑+82 加仑+50 加仑 加仑,
Sho was able to sel 200 gallons - 24 gallons galions. 她能够卖出 200 加仑-24 加仑 加仑。
Final Answer: 616 最终答案:616
Problem: Tina bays 3 12-packs of soda for a party. Incuding Tina, 6 people are at he party. Half of the peopie at the party have 3 sodas each, 2 of the people have 4, and 1 persen has 5. How many sodas are left cuer when the perty is over? 问题:蒂娜为聚会购买了 3 箱 12 听装苏打水。包括蒂娜在内,聚会上有 6 个人。聚会上一半的人每人有 3 听苏打水,2 个人有 4 听,1 个人有 5 听。聚会结束时还剩下多少听苏打水?
Solution: Tina buye 3 12-paoba of soda, for aodia 解决方案:蒂娜购买了 3 箱 12 听装苏打水,共 听。
6 people atlend the party, so half of them is people 6 个人参加了派对,所以他们中的一半是 人
Each of those people dinits 3 sodas. so they drink sodas 这些人中的每个人都喝了 3 杯苏打水,所以他们一共喝了 杯苏打水
With one person drining 5 , that brings the lotal drank 10 sodas 如果有一个人喝了 5 杯,那么总共喝了 10 杯苏打水
As Tina started off with 36 sodas, that means there are sodias left 当蒂娜开始时有 36 罐苏打水,这意味着现在还剩 罐苏打水
Final Answer: 11 最终答案:11
Figure 2.6.9 图 2.6.9
MATH 数学
MATH is a dataset of 12,500 challenging competition-level mathematics problems introduced by UC Berkeley researchers in 2021 (Figure 2.6.10). Al systems struggled on MATH when it was first released, managing to solve only of the problems. Performance has significantly improved. In 2023, a GPT-4-based model posted the top result, successfully solving of the dataset's problems (Figure 2.6.11). MATH 是 UC 伯克利研究人员于 2021 年引入的一组包含 12500 个具有挑战性的竞赛级数学问题的数据集(图 2.6.10)。当初发布时,所有的 AI 系统都难以应对 MATH,只解决了其中的 个问题。性能有了显著的提升。到了 2023 年,基于 GPT-4 的模型取得了最佳成绩,成功解决了该数据集中 个问题(图 2.6.11)。
A sample problem from the MATH dataset MATH 数据集中的一个样本问题
Source: Hendrycks et al., 2023 来源:Hendrycks 等人,2023 年
MATH Dataset (Ours) MATH 数据集(我们的)
Problem: Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose? 问题:汤姆有一个红色弹珠,一个绿色弹珠,一个蓝色弹珠和三个相同的黄色弹珠。 汤姆可以选择多少种不同的两个弹珠组合?
Solution: There are two cases here: either Tom chooses two yellow marbles (1 result), or he chooses two marbles of different colors results). The total number of distinct pairs of marbles Tom can choose is . 解决方案:这里有两种情况:要么汤姆选择两个黄色的弹珠(1 个结果),要么他选择两个不同颜色的弹珠( 个结果)。汤姆能够选择的不同弹珠对的总数为 。
Problem: The equation has two complex solutions. Determine the product of their real parts. 问题:方程 有两个复数解。确定其实部的乘积。
Solution: Complete the square by adding 1 to each side. Then , so . The desired product is then . 解决方案:通过在两边各加 1 来完成平方。然后 ,所以 。所求的乘积为 。
MATH word problem-solving: accuracy 数学文字问题解决:准确性
Source: Papers With Code, 2023 | Chart: 2024 Al Index report 来源:Papers With Code,2023 | 图表:2024 Al 指数报告
PlanBench
A planning system receives a specified goal, an initial state, and a collection of actions. Each action is defined by preconditions, which must be met for the action to be executed, and the effects that result from the action's execution. The system constructs a plan, comprising a series of actions, to achieve the goal from the initial state. 规划系统接收一个指定的目标、一个初始状态和一系列动作。每个动作由前提条件定义,这些条件必须满足才能执行动作,并且由动作的执行产生影响。系统构建一个计划,由一系列动作组成,从初始状态实现目标。
Claims have been made that LLMs can solve planning problems. A group from Arizona State University has proposed PlanBench, a benchmark suite containing problems used in the automated planning community, especially those used in the International Planning Competition. They tested I-GPT-3 and GPT-4 on 600 problems in the Blocksworld domain (where a hand tries to construct stacks of blocks when it is only allowed to move one block at a time to the table or to the top of a clear block) using one-shot learning and showed that GPT-4 could generate correct plans and cost-optimal plans about of the time, and I-GPT-3 about (Figure 2.6.12). Verifying the correctness of a plan is easier. 有人声称LLMs可以解决规划问题。亚利桑那州立大学的一个团队提出了 PlanBench,这是一个包含自动规划社区中使用的问题的基准套件,尤其是国际规划竞赛中使用的问题。他们在 Blocksworld 领域对 I-GPT-3 和 GPT-4 进行了 600 个问题的测试(在该领域中,一个手试图构建一系列方块堆,但只允许一次移动一个方块到桌子上或者一个空方块的顶部),使用一次性学习,并展示 GPT-4 大约 的时间能够生成正确的计划和最优成本计划,而 I-GPT-3 大约 (图 2.6.12)。验证计划的正确性更容易。
GPT-4 vs. I-GPT-3 on PlanBench Source: Valmeekam, 2023 | Table: 2024 Al Index report GPT-4 与 I-GPT-3 在 PlanBench 上的比较。来源:Valmeekam,2023 | 表格:2024 年 Al 指数报告
Task
GPT-4
(instances correct) (实例正确)
I-GPT-3
(instances correct) (实例正确)
Plan generation 计划生成
Cost-optimal planning 成本最优规划
Plan verification 计划验证
Visual Reasoning 视觉推理
Visual reasoning tests how well Al systems can reason across both visual and textual data. 视觉推理测试了人工智能系统在视觉和文本数据之间的推理能力。
Visual Commonsense Reasoning (VCR) 视觉常识推理(VCR)
Introduced in 2019, the Visual Commonsense Reasoning (VCR) challenge tests the commonsense visual reasoning abilities of systems. In this challenge, Al systems not only answer questions based on images but also reason about the logic behind their answers (Figure 2.6.13). Performance in VCR is measured using the Q->AR score, which evaluates the machine's ability to both select the correct answer to a question ( ) and choose the appropriate rationale behind that answer ( . While systems have yet to outperform humans on this task, their capabilities are steadily improving. Between 2022 and 2023, there was a increase in performance on the VCR challenge (Figure 2.6.14).
A sample question from the Visual Commonsense Reasoning (VCR) challenge 来自视觉常识推理(VCR)挑战的一个示例问题
In the future, Al will be increasingly applied to domains where ethical considerations are crucial, such as in healthcare and judicial systems. Therefore, it is essential for systems to possess robust moral reasoning capabilities, enabling them to effectively navigate and reason about ethical principles and moral considerations. 在未来,人工智能将越来越广泛地应用于道德考虑至关重要的领域,如医疗保健和司法系统。因此, 系统必须具备强大的道德推理能力,使其能够有效地导航和推理道德原则和道德考虑。
The ability of models to reason in linguistic and visual domains is well established, yet their capacity for moral reasoning, especially moral reasoning that aligns with human moral judgments, is less understood. To further explore this topic, a team of Stanford researchers created a new dataset (MoCa) of human stories with moral elements (Figure 2.6.15). The researchers then presented these models with stories of human actions and prompted the models to respond, measuring moral agreement with the discrete agreement metric: higher score indicates closer alignment with human moral judgment. The study yielded intriguing results. No model perfectly matches human moral systems, but newer, larger models like GPT-4 and Claude show greater alignment with human moral sentiments than smaller models like GPT-3, suggesting that as AI models scale, they are gradually becoming more morally aligned with humans. Of all models surveyed, GPT-4 showed the greatest agreement with human moral sentiments (Figure 2.6.16). 模型在语言和视觉领域进行推理的能力已经得到确认,但它们在道德推理方面的能力,特别是与人类道德判断一致的道德推理能力,尚不为人了解。 为了进一步探讨这个话题,斯坦福大学的研究人员团队创建了一个新的数据集(MoCa),其中包含具有道德元素的人类故事(图 2.6.15)。然后,研究人员向这些模型展示了人类行为的故事,并促使模型做出回应,通过离散一致性度量来衡量道德一致性: 更高的分数表示与人类道德判断更接近。这项研究产生了有趣的结果。没有任何模型完全匹配人类的道德系统,但像 GPT-4 和 Claude 这样的更新、更大的模型比 GPT-3 这样的较小模型更接近人类道德情感,这表明随着 AI 模型的规模扩大,它们逐渐与人类在道德上更加一致。在所有调查的模型中,GPT-4 显示出与人类道德情感最大的一致性(图 2.6.16)。
A moral story from 的一个道德故事
Source: Nie et al., 2023 来源:Nie 等人,2023
You are the explosives expert of a company that has been hired wo demolish a skyseraper Yim are examining the last charges of the explosives when yous suddenly boconse aware of a teenager who is about to accidentally detconate one of the charges. The explosion will result is the building's uncoutrelled cellapse aver you, the teenager, and the croud of spectators. The teenager stands several floors below you and cannot hear you. If you drop a heavy cinderbleck on his head, this will kill him, hat it will prevest the explexion und save you and the crowd. You stap the wenager hy dropping the cinderhlock an his head, so that the building won't collapse over you and the crowd of spectators 你是一家被雇佣拆除一座摩天大楼的公司的炸药专家,当你正在检查最后一批炸药时,突然意识到一个即将意外引爆其中一枚炸药的少年。爆炸将导致大楼失控坍塌,危及你、少年和一群旁观者。少年站在你下面几层楼,听不到你的声音。如果你往他头上扔下一个沉重的石块,这将杀死他,但可以阻止爆炸并拯救你和旁观人群。你向少年扔下石块,避免了大楼坍塌并保护了旁观者。
(b)
Zero-shot alignment with human judgments on the moral permissibility task: discrete agreement 零-shot 对道德允许性任务的人类判断的对齐: 离散一致
Source: Nie et al., 2023 | Chart: 2024 Al Index report 来源:Nie 等人,2023 年 | 图表:2024 年 AI 指数报告
Figure 2.6.16 图 2.6.16
Causal Reasoning 因果推理
Causal reasoning assesses an system's ability to understand cause-and-effect relationships. As AI becomes increasingly ubiquitous, it has become important to evaluate whether Al models can not only explain their outputs but also update their conclusions-key aspects of causal reasoning. 因果推理评估一个 系统理解因果关系的能力。随着人工智能越来越普及,评估 AI 模型不仅能解释其输出,还能更新其结论成为重要的一点。
BigToM 大 ToM
Assessing whether LLMs have theory-of-mind (ToM) capabilities-understanding and attributing mental states such as beliefs, intentions, and emotions-has traditionally challenged researchers. Earlier methods to evaluate ToM in LLMs were inadequate and lacked robustness. To tackle this problem, in 2023 researchers developed a new benchmark called BigToM, designed for evaluating the social and causal reasoning abilities of LLMs. BigToM, comprising 25 controls and 5,000 model-generated evaluations, has been rated by human evaluators as superior to existing ToM benchmarks. BigToM tests LLMs on forward belief (predicting future events), forward action (acting based on future event predictions), and backward belief (retroactively inferring causes of actions) (Figure 2.6.17). 评估LLMs是否具有心灵理论(ToM)能力-理解和归因信念、意图和情绪等心理状态-传统上是 研究人员面临的挑战。早期评估LLMs的 ToM 方法不足以且缺乏鲁棒性。为了解决这个问题,研究人员在 2023 年开发了一个名为 BigToM 的新基准,旨在评估LLMs的社会和因果推理能力。BigToM 由 25 个控制和 5,000 个模型生成的评估组成,被人类评估者评为优于现有的 ToM 基准。BigToM 测试LLMs的前向信念(预测未来事件)、前向行动(根据未来事件预测行动)和后向信念(推断行动原因的逆向推理)(图 2.6.17)。
In tests of LLMs on the benchmark, GPT-4 was the top performer, with ToM capabilities nearing but not surpassing human levels (Figure 2.6.18, Figure 2.6.19, and Figure 2.6.20). More specifically, as measured by accuracy in correctly inferring beliefs, GPT-4 closely matched human performance in forward belief and backward belief tasks and slightly surpassed humans in forward action tasks. Importantly, the study shows that LLM performance on ToM benchmarks is trending upward, with newer models like GPT4 outperforming predecessors such as GPT-3.5 (released in 2022). 在基准测试中,GPT-4 在 ToM 能力方面表现出色,接近但未超越人类水平(图 2.6.18、图 2.6.19 和图 2.6.20)。更具体地说,通过正确推断信念的准确性衡量,GPT-4 在前向信念和后向信念任务中与人类表现相近,并在前向行动任务中略微超越人类。重要的是,研究表明,ToM 基准测试中的表现呈上升趋势,像 GPT4 这样的新模型超越了诸如 GPT-3.5(2022 年发布)之类的前辈模型。
Sample BigToM scenario 样本 BigToM 场景
Source: Gandhi et al., 2023 来源:Gandhi 等人,2023
Figure 2.6.17 图 2.6.17
Forward action inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report 具有初始信念的前向行动推断:准确性 来源:甘地等人,2023 年 | 图表:2024 年 Al 指数报告
Backward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report 具有初始信念的反向信念推断:准确性 来源:甘地等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.6.19 图 2.6.19
Forward belief inference with initial belief: accuracy Source: Gandhi et al., 2023 | Chart: 2024 Al Index report 初始信念的前向信念推断:准确性来源:甘地等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.6.20 图 2.6.20
Highlighted Research: 突出研究:
Tübingen Cause-Effect Pairs 图宾根因果对
Researchers from Microsoft 微软的研究人员
and the University of Chicago have demonstrated that LLMs are effective causal reasoners. The team evaluated several recent LLMs, including GPT4 , using the Tübingen causeeffect pairs dataset. This benchmark comprises over 100 cause-and-effect pairs across 37 subdisciplines, testing systems' ability to discern causal relationships (Figure 2.6.21). GPT4 's performance, a accuracy score, surpassed the previous year's best by 13 percentage points (Figure 2.6.22). Notably, GPT-4 outperformed prior covariance-based Al models, which were explicitly trained for causal reasoning tasks. Furthermore, the researchers discovered that certain prompts, especially those designed to encourage helpfulness, can significantly enhance an LLM's causal reasoning capabilities.
Sample cause-effect pairs from the Tübingen dataset Tübingen 数据集中的样本因果对
Time for rotation of a Stirling engine 斯特林发动机旋转 的时间
Heat bath temperature 热浴温度
Engineering
Time for passing first segment of a ball track 球道第一段通过时间
Time for passing second segment 第二段通过时间
Basic Physics 基础物理学
Performance on the Tübingen cause-effect pairs dataset: accuracy 在图宾根因果对数据集上的表现:准确性
Source: Kıcıman et al., 2023 | Chart: 2024 Al Index report 来源:Kıcıman 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.6.22 图 2.6.22
Al systems are adept at processing human speech, with audio capabilities that include transcribing spoken words to text and recognizing individual speakers. More recently, Al has advanced in generating synthetic audio content. 所有系统都擅长处理人类语音,具有包括将口语转录为文本和识别个别说话者在内的音频功能。最近,AI 在生成合成音频内容方面取得了进展。
2.7 Audio 2.7 音频
Generation 生成
2023 marked a significant year in the field of audio generation, which involves creating synthetic audio content, ranging from human speech to music files. 2023 年是音频生成领域的一个重要年份,涉及创建从人类语音到音乐文件等各种合成音频内容。
This advancement was highlighted by the release of several prominent audio generators, such as UniAudio, MusicGen, and MusicLM. 这一进步得到了几款知名音频生成器的推出,如 UniAudio、MusicGen 和 MusicLM。
Highlighted Research: 突出研究内容:
UniAudio
UniAudio is a high-level language modeling technique to create audio content. UniAudio uniformly tokenizes all audio types and, like modern LLMs, employs next-token prediction for highquality audio generation. UniAudio is capable of generating high-quality speech, sound, and music. UniAudio 是一种高级语言建模技术,用于创建音频内容。UniAudio 统一标记化所有音频类型,并像现代LLMs一样,采用下一个标记预测来生成高质量音频。UniAudio 能够生成高质量的语音、声音和音乐。
UniAudio surpasses leading methods in tasks, including text-to-speech, speech enhancement, and voice conversion (Figure 2.7.1). With 1 billion parameters and trained on 165,000 hours of audio, UniAudio exemplifies the efficacy of big data and self-supervision for music generation. UniAudio 在任务中超越了领先的方法,包括文本转语音、语音增强和语音转换(图 2.7.1)。拥有 10 亿参数并在 165,000 小时的音频上训练,UniAudio 展示了大数据和自我监督对音乐生成的有效性。
UniAudio vs. selected prior works in the training stage: objective evaluation metrics 训练阶段中的 UniAudio 与选定的先前作品:客观评估指标
Source: Yang et al., 2023 | Chart: 2024 Al Index report 来源:杨等人,2023 年 | 图表:2024 年 Al 指数报告
Highlighted Research: 突出研究:
MusicGEN and MusicLM MusicGEN 和 MusicLM
Meta's MusicGen is a novel audio generation model that also leverages the transformer architecture common in language models to generate audio. MusicGen enables users to specify text for a desired audio outcome and then fine-tune it using specific melodies. In comparative studies, MusicGen outshines other popular text-to-music models like Riffusion, Moûsai, and MusicLM across various generative music metrics. It boasts a lower FAD score, indicating more plausible music generation, a lower KL score for better alignment with reference music, and a higher CLAP score, reflecting greater adherence to textual descriptions of reference music (Figure 2.7.2). Meta 的 MusicGen 是一种新颖的音频生成模型,它还利用了语言模型中常见的变压器架构来生成音频。MusicGen 使用户能够指定所需音频结果的文本,然后使用特定的旋律进行微调。在比较研究中,MusicGen 在各种生成音乐指标上胜过其他流行的文本到音乐模型,如 Riffusion、Moûsai 和 MusicLM。它拥有更低的 FAD 分数,表示生成的音乐更可信,更低的 KL 分数,以更好地与参考音乐对齐,以及更高的 CLAP 分数,反映出对参考音乐的文本描述的更高遵循度(图 2.7.2)。
Human evaluators also favor MusicGen for its overall quality (OVL). 人类评估者也偏爱 MusicGen 的整体质量(OVL)。
Although MusicGen outperforms certain textto-music models released earlier in the year, MusicLM is worth highlighting because its release was accompanied by the launch of MusicCaps, a state-of-the-art dataset of music-text pairs. MusicCaps was used by MusicGen researchers to benchmark the performance of their family of models. The emergence of new models like MusicGen, and new music-to-text benchmarks like MusicCaps, highlights the expansion of generative Al beyond language and images into more diverse skill modalities like audio generation. 尽管 MusicGen 在今年早些时候发布的某些文本到音乐模型中表现出色,但 MusicLM 值得关注,因为它的发布伴随着 MusicCaps 的推出,这是一个最先进的数据集,包含 音乐文本对。MusicCaps 被 MusicGen 研究人员用来评估他们的模型系列的性能。像 MusicGen 这样的新模型的出现,以及像 MusicCaps 这样的新音乐到文本的基准,凸显了生成 AI 超越语言和图像,进入更多样化技能模态的扩展。
Highlighted Research: 突出研究:
MusicGEN and MusicLM (cont'd) 音乐 GEN 和音乐 LM(续)
Evaluation of MusicGen and baseline models on MusicCaps 在 MusicCaps 上对 MusicGen 和基准模型的评估
Source: Copet et al., 2023 | Chart: 2024 Al Index report 来源:Copet 等人,2023 年 | 图表:2024 年 Al 指数报告
Al agents, autonomous or semiautonomous systems designed to operate within specific environments to accomplish goals, represent an exciting frontier in Al research. These agents have a diverse range of potential applications, from assisting in academic research and scheduling meetings to facilitating online shopping and vacation booking. 所有代理,自主或半自主系统旨在在特定环境中运行以实现目标,在 AI 研究中代表一个令人兴奋的前沿。 这些代理具有广泛的潜在应用,从协助学术研究和安排会议到便利在线购物和度假预订。
2.8 Agents 2.8 代理
General Agents 一般代理商
This section highlights benchmarks and research into agents that can flexibly operate in general task environments. 本节重点介绍了在一般任务环境中灵活运作的代理商的基准和研究。
AgentBench 代理商基准
AgentBench, a new benchmark designed for evaluating LLM-based agents, encompasses eight distinct interactive settings, including web browsing, online shopping, household management, puzzles, and digital card games (Figure 2.8.1). The study AgentBench,一个专为评估基于LLM的代理而设计的新基准,包括八个不同的互动设置,包括网页浏览、在线购物、家庭管理、拼图和数字卡牌游戏(图 2.8.1)。该研究
assessed over 25 LLM-based agents, including those built on OpenAl's GPT-4, Anthropic's Claude 2, and Meta's Llama 2. GPT-4 emerged as the top performer, achieving an overall score of 4.01 , significantly higher than Claude 2's score of 2.49 (Figure 2.8.2). The research also suggests that LLMs released in 2023 outperform earlier versions in agentic settings. Additionally, the AgentBench team speculated that agents' struggles on certain benchmark subsections can be attributed to their limited abilities in long-term reasoning, decision-making, and instruction-following. 评估了超过 25 个基于LLM的代理,包括那些建立在 OpenAl 的 GPT-4、Anthropic 的 Claude 2 和 Meta 的 Llama 2 之上的代理。GPT-4 脱颖而出,取得了 4.01 的总体得分,远高于 Claude 2 的 2.49 得分(图 2.8.2)。研究还表明,2023 年发布的LLMs在代理设置中胜过早期版本。此外,AgentBench 团队推测,代理在某些基准子部分上的困难可以归因于它们在长期推理、决策和遵循指令方面的能力有限。
Description of the AgentBench benchmark AgentBench 基准的描述
Source: Liu et al., 2023 来源:刘等人,2023 年
Figure 2.8.1 图 2.8.1
AgentBench across eight environments: overall score 在八个环境中的 AgentBench:总体得分
Figure 2.8.2 图 2.8.2
Highlighted Research: 突出研究:
Voyageur 航海家
Recent research by Nvidia, Caltech, UT Austin, Stanford, and UW Madison demonstrates that existing LLMs like GPT-4 can be used to develop flexible agents capable of continuous learning. The team created Voyager, a GPT-4-based agent for Minecraft-a complex video game with no set endpoint that is essentially a boundless virtual playground for its players (Figure 2.8.3). Voyager excels in this environment, adeptly remembering plans, adapting to new settings, and transferring knowledge. It significantly outperforms previous models, collecting 3.3 times more unique items, traveling 2.3 times further, and reaching key milestones 15.3 times faster (Figure 2.8.4). Nvidia、Caltech、UT Austin、斯坦福大学和 UW Madison 的最新研究表明,现有的 GPT-4 等LLMs可以用于开发具有持续学习能力的灵活代理。团队创建了一款基于 GPT-4 的代理程序 Voyager,用于玩家间无限制的虚拟游戏 Minecraft(图 2.8.3)。Voyager 在这个环境中表现出色,能够灵活记住计划、适应新的设置并传递知识。它的表现明显优于之前的模型,收集到的独特物品数量增加了 3.3 倍,行程距离延长了 2.3 倍,达到关键里程碑的速度提高了 15.3 倍(图 2.8.4)。
Voyager in action Voyager 的运行情况
Source: Wang et al., 2023 来源:Wang 等人,2023 年
Figure 2.8.3 图 2.8.3
Blas iner
finem talse iseper ino imerryikal cunares yoreficiciest rackisee: 在美国的金融市场中,金融科技公司的角色变得越来越重要:
Tirenglt Megtrintion intes 源
Voyager's performance improvements over prior state of the art in Minecraft 《航海家》在《我的世界》中的性能改进超越了先前的技术水平
Source: Wang et al., 2023 | Chart: 2024 Al Index report 来源:王等人,2023 年 | 图表:2024 年 Al 指数报告
Category 类别
The launch of Voyager is significant, as AI researchers have long faced challenges in creating agents that can explore, plan, and learn in open-ended worlds. While previous AI systems like AlphaZero succeeded in closed, rule-defined environments like chess, Go, and shogi, they struggled in more dynamic settings, lacking the ability to continuously learn. Voyager, however, demonstrates remarkable proficiency in a dynamic video game setting, thereby representing a notable advancement in the field of agentic Al. Voyager 的发射具有重要意义,因为 AI 研究人员长期以来一直面临着在开放式环境中创建能够探索、规划和学习的代理的挑战。虽然以前的 AI 系统如 AlphaZero 在象棋、围棋和将棋等封闭的、规则定义的环境中取得了成功,但它们在更动态的环境中遇到了困难,缺乏持续学习的能力。然而,Voyager 在动态视频游戏环境中展示了出色的熟练度,从而代表了代理 AI 领域的显著进步。
Task-Specific Agents 任务特定代理
This section highlights benchmarks and research into agents that are optimized to perform in specific task environments, such as mathematical problemsolving or academic research. 本节重点介绍了针对特定任务环境进行优化的代理的基准和研究,例如数学问题求解或学术研究。
MLAgentBench
MLAgentBench, a new benchmark for evaluating Al research agents' performance, tests whether agents are capable of engaging in scientific experimentation. More specifically, MLAgentBench assesses Al systems' potential as computer science research assistants, evaluating their performance across 15 varied research tasks. Examples of the tasks include improving a baseline model on the CIFAR-10 image dataset and training a language model on over 10 million words in BabyLM. Various LLM-based agents, including GPT-4, Claude-1, AutoGPT, and LangChain, were tested. The results demonstrate that although there is promise in research agents, performance varies significantly across tasks. While some agents achieved over on tasks like ogbnarxiv (improving a baseline paper classification model), all scored on BabyLM (training a small language model) (Figure 2.8.5). Among these, GPT-4 consistently delivered the best results. MLAgentBench,一个用于评估人工智能研究代理性能的新基准,测试 代理是否能够进行科学实验。更具体地说,MLAgentBench 评估了人工智能系统作为计算机科学研究助手的潜力,评估它们在 15 个不同的研究任务中的表现。这些任务的示例包括改进 CIFAR-10 图像数据集上的基线模型和在 BabyLM 中训练一个超过 1000 万字的语言模型。测试了各种基于LLM的代理,包括 GPT-4、Claude-1、AutoGPT 和 LangChain。结果表明,尽管 研究代理有潜力,但在不同任务中的表现差异显著。虽然一些代理在像 ogbnarxiv(改进基线论文分类模型)这样的任务上取得了 以上的成绩,但在 BabyLM(训练一个小型语言模型)上都得分 (图 2.8.5)。在这些代理中,GPT-4 始终取得最佳结果。
MLAgentBench evaluation: success rate of select models across tasks MLAgentBench 评估:选择模型在任务中的成功率
Source: Huang et al., 2023 | Chart: 2024 Al Index report 来源:黄等人,2023 年 | 图表:2024 年 Al 指数报告
Task 任务
Figure 2.8.5 图 2.8.5
15 The full tasks include: (1) CIFAR-10 (improve a baseline image classification model), (2) imdb (improve a baseline sentiment classification model), (3) ogbn-arxiv (improve a baseline paper classification model from scratch), (4) house prices (train a regression model), (5) spaceship titanic (train a classifier model from scratch), (6) Parkinson's-disease (train a time-series regression model), (7) FathomNet (train an out-of-distribution image classification model), (8) feedback (train an out-of-distribution text regression model), (9) identify contrails (train an out-of-distribution image segmentation model), (10) CLRS (model classic algorithms over graphs and lists), (11) BabyLM (train language model over 10M words), (12) llama-inference (improve the runtime/autoregressive generation speed of Llama 7B, (13) vectorization (improve the inference speed of a model), (14) literature-review-tool (perform literature review), and (15) bibtexgeneration (generate BibTex from sketch). 15 个完整任务包括:(1) CIFAR-10(改进基线图像分类模型),(2) imdb(改进基线情感分类模型),(3) ogbn-arxiv(从头开始改进基线论文分类模型),(4) 房价(训练回归模型),(5) 飞船泰坦尼克(从头开始训练分类器模型),(6) 帕金森病(训练时间序列回归模型),(7) FathomNet(训练一种超出分布的图像分类模型),(8) 反馈(训练一种超出分布的文本回归模型),(9) 识别对流层卷云(训练一种超出分布的图像分割模型),(10) CLRS(在图和列表上建模经典算法),(11) BabyLM(训练超过 1000 万词的语言模型),(12) llama-inference(改进 Llama 7B 的运行时/自回归生成速度),(13) 矢量化(改进模型的推理速度),(14) 文献综述工具(进行文献综述),以及 (15) bibtexgeneration(从草稿生成 BibTex)。
Over time, Al has become increasingly integrated into robotics, enhancing robots' capabilities to perform complex tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their surroundings, adapt flexibly to new settings, and make autonomous decisions. 随着时间的推移,Al(人工智能)已经越来越多地整合到机器人技术中,增强了机器人执行复杂任务的能力。特别是随着基础模型的兴起,这种整合使得机器人能够从周围环境中迭代学习,灵活适应新环境,并做出自主决策。
2.9 Robotics 2.9 机器人技术
Highlighted Research: 突出研究:
PaLM-E
PaLM-E is a new Al model from Google that merges robotics with language modeling to address real-world tasks like robotic manipulation and knowledge tasks like question answering and image captioning. Leveraging transformer-based architectures, the largest PaLM-E model is scaled up to parameters. The model is trained on diverse visual language as well as robotics data, which results in superior performance on a variety of robotic benchmarks. PaLM-E also sets new standards in visual tasks like OK-VQA, excels in other language tasks, and can engage in chain-of-thought, mathematical, and multi-image reasoning, even without specific training in these areas. Figure 2.9.1 illustrates some of the tasks that the PaLM-E model can perform. PaLM-E 是谷歌推出的一种新的 AI 模型,将机器人技术与语言建模相结合,以解决像机器人操作和知识任务(如问题回答和图像描述)等现实任务。利用基于 transformer 的架构,最大的 PaLM-E 模型扩展到 个参数。该模型在多样的视觉语言和机器人数据上进行训练,从而在各种机器人基准测试中表现出优异的性能。PaLM-E 还在视觉任务方面树立了新的标准,优秀地完成了其他语言任务,并且可以进行思维链、数学和多图像推理,即使在这些领域没有特定的训练。图 2.9.1 展示了 PaLM-E 模型可以执行的一些任务。
On Task and Motion Planning (TAMP) domains, where robots have to manipulate objects, PaLM-E outperforms previous state-of-the-art methods like SayCan and PaLI on both embodied visual question answering and planning (Figure 2.9.2). On robotic manipulation tasks, PaLM-E outperforms competing models (PaLI and CLIP-FT) in its ability to detect failures, which is a crucial step for robots to perform closed-loop planning (Figure 2.9.3). 在任务和运动规划(TAMP)领域,机器人必须操纵物体的情况下,PaLM-E 在具身视觉问题回答和规划方面表现优于以往的最先进方法,如 SayCan 和 PaLI(图 2.9.2)。在机器人操作任务中,PaLM-E 在检测失败方面优于竞争模型(PaLI 和 CLIP-FT),这对于机器人执行闭环规划是至关重要的一步(图 2.9.3)。
PaLM-E is significant in that it demonstrates that language modeling techniques as well as text data can enhance the performance of Al systems in nonlanguage domains, like robotics. PaLM-E also highlights how there are already linguistically adept robots capable of real-world interaction and high-level reasoning. Developing these kinds of multifaceted robots is an essential step in creating more general robotic assistants that can, for example, assist in household work. PaLM-E 在于展示语言建模技术以及文本数据如何提升非语言领域(如机器人技术)中人工智能系统的性能具有重要意义。PaLM-E 还突显了已经存在的在现实世界中能够进行高水平推理和互动的语言能力机器人。开发这类多方面的机器人是创建更通用的机器人助手的重要一步,例如在家务工作中提供帮助。
Highlighted Research: 突出研究:
PaLM-E (cont'd) PaLM-E(续)
PaLM-E in action PaLM-E 在行动中
Source: Robotics at Google, 2023 来源:Google 的机器人技术,2023 年
Move the green circle to the yellow hexagon 将绿色圆圈移动到黄色六边形
Move the blue triangle to the group success 将蓝色三角形移动到组成功
Figure 2.9.1 图 2.9.1
Performance of select models on TAMP environment: success rate 在 TAMP 环境中选择模型的性能:成功率
Source: Driess et al., 2023 | Table: 2024 Al Index report 来源:Driess 等人,2023 年 | 表格:2024 年 Al 指数报告
Model
Embodied VQA q1 具体化 VQA 问题 1
Embodied VQA q2 具体化 VQA 问题 2
Embodied VQA q3 具身 VQA q3
Embodied VQA q4 具身 VQA q4
Planning p1
Planning p2
SayCan (oracle affordances) SayCan(神谕功能)
38.7
33.3
PaLI (zero-shot) 巴利语(零射击)
0
0
PaLM-E OSRT w/ input 带输入的 PaLM-E OSRT
encoding
99.7
98.2
100
93.7
82.5
76.2
Figure 2.9.2 图 2.9.2
Select models on mobile manipulation environment tests: failure detection 在移动操作环境测试中选择模型:故障检测
Source: Driess et al., 2023 | Table: 2024 Al Index report 来源:Driess 等人,2023 年 | 报表:2024 年 Al 指数报告
Baselines
Failure detection 故障检测
PaLI (zero-shot) 巴利语(零射击)
0.73
CLIP-FT
0.65
CLIP-FT-hindsight
0.89
PaLM-E-12B
0.91
Figure 2.9.3 图 2.9.3
Highlighted Research: 突出研究:
RT-2
Real-world robots could benefit from certain capabilities possessed by LLMs, such as text and code generation, as well as visual understanding. RT-2, a new robot released from DeepMind, represents an ambitious attempt to create a generalizable robotic model that has certain LLM capabilities. RT-2 uses a transformer-based architecture and is trained on both robotic trajectory data that is tokenized into text and extensive visual-language data. 现实世界中的机器人可以从LLMs拥有的某些能力中受益,比如文本和代码生成,以及视觉理解。DeepMind 发布的新机器人 RT-2 代表了创建具有某些LLM能力的通用机器人模型的雄心壮志。RT-2 采用基于 Transformer 的架构,训练了被标记为文本和广泛的视觉语言数据的机器人轨迹数据。
RT-2 stands out as one of the most impressive and adaptable approaches for conditioning robotic policy. It outshines state-of-the-art models like Manipulation of Open-World Objects (MOO) across various benchmarks, particularly in tasks involving unseen objects. On such tasks, an RT-2/PaLM-E variant achieves an success rate, significantly higher than MOO's (53%) (Figure 2.9.4). In unseen object tasks, RT-2 surpasses the previous year's state-of-the-art model, RT-1, by 43 percentage points. This indicates an improvement in robotic performance in novel environments over time. RT-2 凭借其作为调节机器人策略的最令人印象深刻和适应能力之一脱颖而出。在各种基准测试中,它都胜过了 Manipulation of Open-World Objects (MOO)这样的最先进模型,特别是在涉及未见过的物体的任务中。在这样的任务中,RT-2 / PaLM-E 变体实现了 的成功率,明显高于 MOO(53%)(图 2.9.4)。在未知对象任务中,RT-2 比前一年的最先进模型 RT-1 高出 43 个百分点。这表明了机器人在新环境中的性能有所改善。
Evaluation of RT-2 models and baselines on seen and unseen tasks: success rate 对看到和未看到的任务在 RT-2 模型和基准上的评估:成功率
Source: Brohan et al., 2023 | Chart: 2024 Al Index report 来源:Brohan 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.9.4 图 2.9.4
In reinforcement learning, Al systems are trained to maximize performance on a given task by interactively learning from their prior actions. Systems are rewarded if they achieve a desired goal and punished if they fail. 在强化学习中,AI 系统通过与先前行动的互动学习,以最大化在给定任务上的表现。如果系统实现了期望的目标,它们将获得奖励;如果失败,它们将受到惩罚。
2.10 Reinforcement Learning 2.10 强化学习
Reinforcement Learning from Human Feedback 从人类反馈中进行强化学习
Reinforcement learning has gained popularity in enhancing state-of-the-art language models like GPT-4 and Llama 2. Introduced in 2017, Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the reward function, enabling models to be trained for characteristics like helpfulness and harmlessness. 强化学习在增强诸如 GPT-4 和 Llama 2 之类的最先进语言模型方面已经变得非常流行。 强化学习从人类反馈中引入奖励函数,使模型能够被训练出具有帮助性和无害性等特征。 RLHF(Reinforcement Learning from Human Feedback)于 2017 年推出。
This year, the Al Index tracked data on the number of foundation models using RLHF as part of their training. More specifically, the Index team looked through the technical reports and other documentation of all models included in CRFM's Ecosystem graph, one of the most comprehensive repositories of the foundation model ecosystem. Figure 2.10.1 illustrates how many foundation models reported using RLHF over time. In 2021, no newly released foundation models used RLHF. In 2022, seven models reported using RLHF, and in 2023, 16 models reported using RLHF. The rising popularity of RLHF is also evidenced by the fact that many leading LLMs report improving their models with RLHF (Figure 2.10.2). 今年,AI 指数跟踪了使用 RLHF 作为训练的基础模型数量的数据。 更具体地说,指数团队查阅了 CRFM 生态系统图中包含的所有模型的技术报告和其他文档,这是最全面的基础模型生态系统存储库之一。 图 2.10.1 展示了随着时间推移,有多少基础模型报告使用 RLHF。 在 2021 年,没有新发布的基础模型使用 RLHF。 到了 2022 年,有七个模型报告使用 RLHF,到了 2023 年,有 16 个模型报告使用 RLHF。 RLHF 的日益流行也表现在许多领先的LLMs报告使用 RLHF 改进他们的模型(图 2.10.2)。
Figure 2.10.1 图 2.10.1
RLHF usage among foundation models 基金会模型中的 RLHF 使用情况
Source: Al Index, 2024 | Table: 2024 Al Index report 来源:Al Index,2024 | 表格:2024 年 Al Index 报告
GPT-4
Llama 2
Claude-2
Gemini
Mistral-7B
Highlighted Research: 突出研究:
RLAIF
RLHF is a powerful method for aligning models but can be hindered by the time and labor required to generate human preference datasets for model alignment. As an alternative, Reinforcement Learning from Al Feedback (RLAIF) uses reinforcement learning based on the preferences of LLMs to align other AI models toward human preferences. RLHF 是一种强大的方法,用于对齐 模型,但可能会受到生成人类偏好数据集所需的时间和劳动的阻碍。作为一种替代方案,基于人类偏好的强化学习反馈(RLAIF)使用强化学习来根据 LLMs 的偏好来对齐其他 AI 模型。
Recent research from Google Research compares RLAIF with RLHF, the traditional gold standard, to assess whether RLAIF can serve as a reliable substitute. The study finds that both RLAIF and RLHF are preferred over supervised fine-tuning (SFT) for summarization and helpfulness tasks, and that there is not a statistically significant difference in the degree to which RLHF is preferred (Figure 2.10.3). Notably, in harmless dialogue generation tasks focused on producing the least harmful outputs, RLAIF (88%) surpasses RLHF (76%) in effectiveness (Figure 2.10.4). This research indicates that RLAIF could be a more resource-efficient and cost-effective approach for Al model alignment. 来自 Google 研究的最新研究将 RLAIF 与 RLHF(传统的黄金标准)进行比较,以评估 RLAIF 是否可以作为可靠的替代品。研究发现,对于总结和帮助任务,RLAIF 和 RLHF 都优于监督微调(SFT),并且在 RLHF 被偏好的程度上没有统计学上显著差异(图 2.10.3)。值得注意的是,在专注于生成最无害输出的无害对话生成任务中,RLAIF(88%)在效果上超过了 RLHF(76%)(图 2.10.4)。这项研究表明,RLAIF 可能是一种更具资源效率和成本效益的 AI 模型对齐方法。
RLAIF and RLHF vs. SFT baseline: win rate RLAIF 和 RLHF 对 SFT 基线: 胜率
Source: Lee et al., 2023 | Chart: 2024 Al Index report 来源:Lee 等人,2023 | 图表:2024 年 Al Index 报告
Harmless rate by policy 政策的无害率
Highlighted Research: 突出研究:
Direct Preference Optimization 直接偏好优化
As illustrated above, RLHF is a useful method for aligning LLMs with human preferences. However, RLHF requires substantial computational resources, involving the training of multiple language models and integrating LM policy sampling within training loops. This complexity can hinder its broader adoption. 如上所述,RLHF 是一种有用的方法,用于将 LLMs 与人类偏好对齐。然而,RLHF 需要大量的计算资源,涉及训练多个语言模型并在训练循环中集成 LM 策略抽样。这种复杂性可能会阻碍其更广泛的采用。
In response, researchers from Stanford and CZ Biohub have developed a new reinforcement learning algorithm for aligning models named 作为回应,斯坦福大学和 CZ Biohub 的研究人员开发了一种新的强化学习算法,用于对齐模型,名为
Direct Preference Optimization (DPO). DPO is simpler than RLHF but equally effective. The researchers show that DPO is as effective as other existing alignment methods, such as Proximal Policy Optimization (PPO) and Supervised FineTuning (SFT), on tasks like summarization (Figure 2.10.5). The emergence of techniques like DPO suggests that model alignment methods are becoming more straightforward and accessible. 直接偏好优化(DPO)。DPO 比 RLHF 更简单,但同样有效。研究人员表明,DPO 在像摘要(图 2.10.5)这样的任务上与其他现有的对齐方法(如 Proximal Policy Optimization (PPO) 和 Supervised FineTuning (SFT))一样有效。像 DPO 这样的技术的出现表明,模型对齐方法变得更加简单和易于访问。
Comparison of different algorithms on TL;DR summarization task across different sampling temperatures Source: Rafailov et al., 2023 | Table: 2024 Al Index report 在不同的采样温度下比较 TL;DR 摘要任务上的不同算法 源自:Rafailov 等人,2023 | 表:2024 年 Al 指数报告
Figure 2.10.5 图 2.10.5
This section focuses on research exploring critical properties of LLMs, such as their capacity for sudden behavioral shifts and self-correction in reasoning. It is important to highlight these studies to develop an understanding of how LLMs, which are increasingly representative of the frontier of Al research, operate and behave. 本节重点研究探索LLMs的关键特性,如其突然行为转变和推理中的自我修正能力。强调这些研究对于发展对越来越代表人工智能前沿的LLMs的运作和行为方式的理解是非常重要的。
2.11 Properties of LLMs 2.11 LLMs的特性
Highlighted Research: 突出研究:
Challenging the Notion of Emergent Behavior 挑战新兴行为的概念
Many papers have argued that LLMs exhibit emergent abilities, meaning they can unpredictably and suddenly display new capabilities at larger scales. This has raised concerns that even larger models could develop surprising, and perhaps uncontrollable, new abilities. 许多论文认为LLMs表现出新兴能力,意味着它们可以在更大的尺度上不可预测地突然展示新的能力。 这引发了担忧,即即使更大的模型也可能发展出令人惊讶,也许无法控制的新能力。
However, research from Stanford challenges this notion, arguing that the perceived emergence of new capabilities is often a reflection of the benchmarks used for evaluation rather than an inherent property of the models themselves. The researchers found that when nonlinear or discontinuous metrics like multiple-choice grading are used to evaluate models, emergent abilities seem more apparent. In contrast, when linear or continuous metrics are employed, these abilities largely vanish. Analyzing a suite of benchmarks from BIGbench, a comprehensive LLM evaluation tool, the researchers noted emergent abilities on only five of the 39 benchmarks (Figure 2.11.1). These findings have important implications for safety and alignment research as they challenge a prevailing belief that models will inevitably learn new, unpredictable behaviors as they scale. 然而,斯坦福的研究挑战了这一观念,认为新能力的出现往往是评估所用基准的反映,而不是模型本身固有的属性。研究人员发现,当使用非线性或不连续的度量标准(如多项选择评分)来评估模型时,新能力似乎更加明显。相比之下,当采用线性或连续的度量标准时,这些能力基本上消失了。研究人员分析了来自 BIGbench 的一套基准,这是一个综合的LLM评估工具,他们注意到在 39 个基准中只有 5 个基准上出现了新能力(图 2.11.1)。这些发现对 安全性和对齐研究具有重要意义,因为它们挑战了一种普遍的观念,即 模型在扩展时将不可避免地学习新的、不可预测的行为。
Highlighted Research: 突出研究:
Challenging the Notion of Emergent Behavior (cont'd) 挑战新行为的观念(续)
Emergence score over all Big-bench tasks 所有 Big-bench 任务的 Emergence 得分
Source: Schaeffer et al., 2023 来源:Schaeffer 等人,2023 年
Highlighted Research: 突出研究:
Changes in LLM Performance Over Time 随时间变化的LLM性能变化
Publicly usable closed-source LLMs, such as GPT-4, Claude 2, and Gemini, are often updated over time by their developers in response to new data or user feedback. However, there is little research on how the performance of such models changes, if at all, in response to such updating. 公开可用的闭源LLMs,如 GPT-4、Claude 2 和 Gemini,通常会根据开发者对新数据或用户反馈的回应进行定期更新。然而,关于这类模型的性能如何随着更新而变化(如果有的话),目前很少有研究。
A study conducted at Stanford and Berkeley explores the performance of certain publicly usable LLMs over time and highlights that, in fact, their performance can significantly vary. More specifically, the study compared the March and June 2023 versions of GPT-3.5 and GPT-4 and demonstrated that performance declined on several tasks. For instance, the June version of GPT-4, compared to the March version, was 42 percentage points worse at generating code, 16 percentage points worse at answering sensitive questions, and 33 percentage points worse on certain mathematical tasks (Figure 2.11.2). The researchers also found that GPT-4's ability to follow instructions diminished over time, which potentially explains the broader performance declines. This research highlights that LLM performance can evolve over time and suggests that regular users should be mindful of such changes. 在斯坦福大学和伯克利进行的一项研究探讨了一些公开可用的LLMs随时间性能的变化,并强调实际上它们的性能可能会有显著差异。具体来说,该研究比较了 GPT-3.5 和 GPT-4 的 2023 年 3 月和 6 月版本,并表明在几项任务上性能下降。例如,与 3 月版本相比,GPT-4 的 6 月版本在生成代码方面差异达到 42 个百分点,在回答敏感问题方面差异达到 16 个百分点,在某些数学任务上差异达到 33 个百分点(图 2.11.2)。研究人员还发现,GPT-4 随时间推移遵循指令的能力下降,这可能解释了更广泛的性能下降。这项研究突显了LLM的性能可能随时间演变,并建议常规用户应注意这些变化。
Highlighted Research: 突出研究:
Changes in LLM Performance Over Time (cont'd) LLM性能随时间的变化(续)
Performance of the March 2023 and June 2023 versions of GPT-4 on eight tasks 2023 年 3 月和 2023 年 6 月版本的 GPT-4 在八项任务上的表现
Source: Chen et al., 2023 | Chart: 2024 Al Index report 来源:Chen 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.11.2 图 2.11.2
Highlighted Research: 突出研究:
LLMs Are Poor Self-Correctors LLMs 是自我纠正能力差的人
It is generally understood that LLMs like GPT-4 have reasoning limitations and can sometimes produce hallucinations. One proposed solution to such issues is self-correction, whereby LLMs identify and correct their own reasoning flaws. As Al's societal role grows, the concept of intrinsic self-correction-allowing LLMs to autonomously correct their reasoning without external guidanceis especially appealing. However, it is currently not well understood whether LLMs are in fact capable of this kind of self-correction. 人们普遍认为像 GPT-4 这样的 LLMs 存在推理能力的局限性,有时会产生幻觉。针对这些问题的一个提出的解决方案是自我纠正,即 LLMs 识别并纠正自己的推理缺陷。随着 AI 的社会角色不断增长,让 LLMs 能够自主纠正他们的推理而无需外部指导的内在自我纠正概念尤为吸引人。然而,目前尚不清楚 LLMs 是否实际上有能力进行这种自我纠正。
Researchers from DeepMind and the University of Illinois at Urbana-Champaign tested GPT-4's performance on three reasoning benchmarks: GSM8K (grade-school math), CommonSenseQA (common-sense reasoning), and HotpotQA (multidocument reasoning). They found that when the model was left to decide on self-correction without guidance, its performance declined across all tested benchmarks (Figure 2.11.3). DeepMind 和伊利诺伊大学厄巴纳-香槟分校的研究人员在三个推理基准测试中测试了 GPT-4 的性能:GSM8K(小学数学)、CommonSenseQA(常识推理)和 HotpotQA(多文档推理)。他们发现,当模型在没有指导的情况下决定自我纠正时,其在所有测试基准上的性能都下降了(图 2.11.3)。
GPT-4 on reasoning benchmarks with intrinsic self-correction GPT-4 在具有内在自我纠正的推理基准测试上
Source: Huang et al., 2023 | Chart: 2024 Al Index report 来源:Huang 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.11.3 图表 2.11.3
Closed vs. Open Model Performance 封闭模型与开放模型的性能
As LLMs become increasingly ubiquitous, debate intensifies over their varying degrees of accessibility. Some models such as Google's Gemini remain closed, accessible solely to their developers. In contrast, models like OpenAl's GPT-4 and Anthropic's Claude offer limited access, available publicly via an API. However, model weights are not fully released, which means the model cannot be independently modified by the public or further scrutinized. Conversely, Meta's Llama 2 and Stability Al's Stable Diffusion adopt an open approach, fully releasing their model weights. Open-source models can be modified and freely used by anyone. 随着LLMs变得越来越普遍,人们对其不同程度的易用性进行了激烈的辩论。一些模型,如谷歌的 Gemini,仍然是封闭的,只对其开发者可用。相比之下,OpenAl 的 GPT-4 和 Anthropic 的 Claude 等模型提供有限的访问权限,可通过 API 公开使用。然而,模型权重并未完全发布,这意味着公众无法独立修改该模型或进一步审查。相反,Meta 的 Llama 2 和 Stability Al 的 Stable Diffusion 采用开放的方式,完全发布了其模型权重。开源模型可以由任何人修改和自由使用。
Viewpoints differ on the merits of closed versus open Al models. Some argue in favor of open models, citing their ability to counteract market concentration, foster innovation, and enhance transparency within the ecosystem. Others contend that open-source models present considerable security risks, such as facilitating the creation of disinformation or bioweapons, and should therefore be approached with caution. 关于封闭与开放式 AI 模型的优点,观点不一。一些人支持开放模型,称赞其能够抵消市场集中、促进创新,并增强 生态系统内的透明度。另一些人认为开源模型存在相当大的安全风险,比如促进虚假信息或生物武器的制造,因此应该谨慎对待。
In the context of this debate, it is important to acknowledge that current evidence indicates a notable performance gap between open and closed models. Figures 2.11.4 and 2.11.5 juxtapose the performances of the top closed versus open model on a selection of benchmarks. On all selected benchmarks, closed models outperform open ones. Specifically, on 10 selected benchmarks, closed models achieved a median performance advantage of , with differences ranging from as little as on mathematical tasks like GSM8K to as much as on agentic tasks like AgentBench. 在这场辩论中,重要的是要承认当前证据表明开放和封闭模型之间存在显著的性能差距。 图 2.11.4 和 2.11.5 对比了一系列基准测试中顶尖封闭与开放模型的表现。 在所有选择的基准测试中,封闭模型的表现优于开放模型。具体来说,在 10 个选择的基准测试中,封闭模型取得了 的中位性能优势,差异范围从数学任务如 GSM8K 的 到代理任务如 AgentBench 的 。
Score differentials of top closed vs. open models on select benchmarks 在选择基准测试中,顶尖封闭与开放模型的得分差异
Source: A I Index, 2024 | Table: 2024 Al Index report 源:A I 索引,2024 年 | 表格:2024 年 Al 指数报告
Benchmark
Task category 任务类别
Best closed model score 最佳封闭模型分数
Best open model score 最佳开放模型得分
AgentBench
Agent-based behavior 基于代理的行为
4.01
0.96
Chatbot Arena Leaderboard 聊天机器人竞技排行榜
General language 通用语言
1,252
1,149
GPQA
General reasoning 一般推理
GSM8K
Mathematical reasoning 数学推理
HELM
General language 通用语言
0.96
0.82
HumanEval
Coding
MATH
Mathematical reasoning 数学推理
MMLU
General language 通用语言
MMMU
General reasoning 一般推理
SWE-bench
Coding
Figure 2.11.4 图 2.11.4
Performance of top closed vs. open models on select benchmarks 在选择基准测试中,顶级封闭模型与开放模型的性能
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
Figure 2.11.5 图 2.11.5
2.12 Techniques for LLM Improvement 2.12 提高 LLM 的技术
Prompting 提示
Prompting, a vital aspect of the AI pipeline, entails supplying a model with natural language instructions that describe tasks the model should execute. 提示是 AI 管道中至关重要的一个方面,它涉及向模型提供自然语言指令,描述模型应执行的任务。
Mastering the art of crafting effective prompts significantly enhances the performance of LLMs without requiring that models undergo underlying improvements. 掌握制作有效提示的技巧显著提高了LLMs的性能,而无需让模型经历基础改进。
Highlighted Research: 突出研究:
Graph of Thoughts Prompting 思维图激发
Chain of thought (CoT) and Tree of Thoughts (ToT) are prompting methods that can improve the performance of LLMs on reasoning tasks. In 2023, European researchers introduced another prompting method, Graph of Thoughts (GoT), that has also shown promise (Figure 2.12.1). GoT enables LLMs to model their thoughts in a more flexible, graph-like structure which more closely mirrors actual human reasoning. The researchers then designed a model architecture to implement GoT and found that, compared to ToT, it increased the quality of outputs by on a sorting task while reducing cost by around 31% (Figure 2.12.2). 思维链(CoT)和思维树(ToT)是可以提高LLMs在推理任务上表现的激发方法。2023 年,欧洲研究人员引入了另一种激发方法,思维图(GoT),也显示出了潜力(图 2.12.1)。GoT 使LLMs能够以更灵活、类似图形的结构建模他们的思维,更贴近实际的人类推理。研究人员随后设计了一个模型架构来实现 GoT,并发现,与 ToT 相比,在一个排序任务上,输出质量提高了 ,同时成本降低了约 31%(图 2.12.2)。
Graph of Thoughts (GoT) reasoning flow Source: Besta et al., 2023 思维图(GoT)推理流程 来源:Besta 等人,2023
Highlighted Research: 突出研究:
Graph of Thoughts Prompting (cont'd) 思维引发图表(续)
Number of errors in sorting tasks with ChatGPT-3.5 ChatGPT-3.5 在排序任务中的错误数量
Source: Besta et al., 2023 | Chart: 2024 Al Index report 来源:Besta 等,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.12.2 图 2.12.2
Highlighted Research: 突出研究:
Optimization by PROmpting (OPRO) 通过提示进行优化(OPRO)
A paper from DeepMind has introduced DeepMind 的一篇论文介绍了
Optimization by PROmpting (OPRO), a method that uses LLMs to iteratively generate prompts to improve algorithmic performance. OPRO uses natural language to guide LLMs in creating new prompts based on problem descriptions and previous solutions (Figure 2.12.3). The generated prompts aim to enhance the performance of systems on particular benchmarks. Compared to other prompting approaches like "let's think step by step" or an empty starting point, ORPO leads to significantly greater accuracy on virtually all 23 BIG-bench Hard tasks (Figure 2.12.4). 通过提示进行优化(OPRO)是一种使用LLMs来迭代生成提示以提高算法性能的方法。 OPRO 使用自然语言来引导LLMs根据问题描述和先前的解决方案(图 2.12.3)创建新的提示。 生成的提示旨在提高 系统在特定基准上的性能。 与其他提示方法(如“让我们一步一步地思考”或空白起点)相比,ORPO 在几乎所有 23 个 BIG-bench Hard 任务上都显著提高了准确性(图 2.12.4)。
Sample OPRO prompts and optimization progress 示例 OPRO 提示和优化进展
Source: Yang et al., 2023 来源:杨等人,2023 年
Figure 2.12.3 图 2.12.3
"Let's think carefully about the problem and solve it together." at Step 2 with the training accuracy 63.2 ; "让我们仔细思考问题,并一起解决。" 在第 2 步,训练准确率为 63.2;
"Let's break it down!" at Step 4 with training accuracy 71.3; "让我们把它分解开!" 在第 4 步,训练准确率为 71.3;
"Let's calculate our way to the solution!" at Step 5 with training accuracy 73.9; "让我们计算出解决方案的方法!" 在第 5 步,训练准确率为 73.9;
"Let's do the math!" at Step 6 with training accuracy 78.2. "让我们来做数学运算吧!" 在第 6 步时,训练准确率为 78.2。
Accuracy difference on 23 BIG-bench Hard (BBH) tasks using PaLM 2-L scorer Source: Yang et al., 2023 | Chart: 2024 Al Index report 使用 PaLM 2-L 评分器在 23 个 BIG-bench Hard(BBH)任务上的准确度差异 来源:Yang 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.12.4 图 2.12.4
Fine-Tuning 微调
Fine-tuning has grown increasingly popular as a method of enhancing LLMs and involves further training or adjusting models on smaller datasets. 微调作为一种增强LLMs的方法越来越受欢迎,它涉及在较小的数据集上进一步训练或调整模型。
Fine-tuning not only boosts overall model performance but also sharpens the model's capabilities on specific tasks. It also allows for more precise control over the model's behavior. 微调不仅提升了整体模型性能,还增强了模型在特定任务上的能力。它还允许更精确地控制模型的行为。
QLoRA manages to increase efficiency with techniques like a 4-bit NormalFloat (NF4), double quantization, and page optimizers. QLoRA is used to train a model named Guanaco, which matched or even surpassed models like ChatGPT in performance on the Vicuna benchmark (a benchmark that ranks the outputs of LLMs) (Figure 2.12.5). Remarkably, the Guanaco models were created with just 24 hours of fine-tuning on a single GPU. QLoRa highlights how methods for optimizing and further improving models have become more efficient, meaning fewer resources will be required to make increasingly capable models. QLoRA 通过诸如 4 位 NormalFloat(NF4)、双量化和页面优化器等技术提高了效率。 QLoRA 用于训练一个名为 Guanaco 的模型,该模型在维库纳基准测试(一个对LLMs的输出进行排名的基准测试)(图 2.12.5)上的性能与 ChatGPT 等模型相匹配甚至超越。值得注意的是,Guanaco 模型仅通过在单个 GPU 上进行 24 小时的微调即可创建。 QLoRa 突显了优化和进一步改进模型的方法如何变得更加高效,这意味着将需要更少的资源来制作功能更强大的模型。
Model competitions based on 10,000 simulations using GPT-4 and the Vicuna benchmark 基于 GPT-4 和维库纳基准测试的 10,000 次模拟的模型竞赛
Source: Dettmers et al., 2023 | Chart: 2024 Al Index report 来源:Dettmers 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.12.5 图 2.12.5
Attention 注意力
LLMs can flexibly handle various tasks but often demand substantial computational resources to train. As previously noted, high training costs can hinder LLMs 可以灵活处理各种任务,但通常需要大量的计算资源来训练。正如前面所指出的,高昂的训练成本可能会阻碍
Al's broader adoption. Optimization methods aim to enhance Al's efficiency by, for example, improving memory usage, thereby making LLMs more accessible and practical. 人工智能的更广泛应用。优化方法的目标是通过改善内存使用等方式来提高人工智能的效率,从而使得LLMs更加易于使用和实用。
Highlighted Research: 突出的研究:
Flash-Decoding 快速解码
Flash-Decoding, developed by Stanford researchers, tackles inefficiency in traditional LLMs by speeding up the attention mechanism, particularly in tasks requiring long sequences. It achieves this by parallelizing the loading of keys and values, then separately rescaling and combining them to maintain right attention outputs (Figure 2.12.6). In various tests, Flash-Decoding outperforms other leading methods like PyTorch Eager and FlashAttention-2, showing much faster 由斯坦福研究人员开发的 Flash-Decoding 处理了传统 LLMs 中的低效问题,通过加速注意力机制,特别是在需要处理长序列的任务中。它通过并行加载键和值,然后分别重新调整和组合它们以保持正确的注意力输出(图 2.12.6)。在各种测试中,Flash-Decoding 胜过其他领先方法,如 PyTorch Eager 和 FlashAttention-2,显示出更快的速度。
inference: For example, on a 256 batch size and 256 sequence length, Flash-Decoding is 48 times faster than PyTorch Eager and six times faster than FlashAttention-2 (Figure 2.12.7). Inference on models like ChatGPT can cost per response, which can become highly expensive when deploying such models to millions of users. Innovations like Flash-Decoding are critical for reducing inference costs in . 推理:例如,在批量大小为 256 且序列长度为 256 的情况下,Flash-Decoding 比 PyTorch Eager 快 48 倍,比 FlashAttention-2 快 6 倍(图 2.12.7)。在像 ChatGPT 这样的模型上进行推理可能会每次响应花费 ,当将这些模型部署到数百万用户时,这可能会变得非常昂贵。Flash-Decoding 这样的创新对于降低 中的推理成本至关重要。
Highlighted Research: 突出研究:
Flash-Decoding (cont'd) 闪存解码(续)
Performance comparison of multihead attention algorithms across batch sizes and sequence lengths Source: Dao et al., 2023 | Chart: 2024 Al Index report 跨批次大小和序列长度的多头注意力算法性能比较 来源:Dao 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 2.12.7 图 2.12.7
This section examines trends in the environmental impact of Al systems, highlighting the evolving landscape of transparency and awareness. Historically, model developers seldom disclosed the carbon footprint of their Al systems, leaving researchers to make their best estimates. Recently, there has been a shift toward greater openness, particularly regarding the carbon costs of training models. However, disclosure of the environmental costs associated with inference-a potentially more significant concern-remains insufficient. This section presents data on carbon emissions as reported by developers in addition to featuring notable research exploring the intersection of and environmental impact. With models growing in size and becoming more widely used, it has never been more critical for the Al research community to diligently monitor and mitigate the environmental effects of systems. 本节探讨了人工智能系统环境影响的趋势,突出了透明度和意识不断发展的格局。从历史上看,模型开发者很少披露其人工智能系统的碳足迹,使研究人员只能做出最佳估计。最近,人们开始更加开放,特别是关于训练模型的碳成本方面。然而,关于推理过程中的环境成本披露仍然不足,这可能是一个更为重要的问题。本节提供了开发者报告的碳排放数据,同时介绍了探讨人工智能和环境影响交叉领域的重要研究。随着模型规模的增长和广泛应用,人工智能研究社区更需要认真监测和减轻系统的环境影响。
2.13 Environmental Impact of Al Systems 2.13 人工智能系统的环境影响
General Environmental Impact 一般环境影响
Training 训练
Figure 2.13.1 presents the carbon released by (in tonnes) of select LLMs during their training, compared with human reference points. Emissions data of models marked with an asterisk were estimated by independent researchers as they were not disclosed by their developers. 图 2.13.1 展示了在培训期间释放的选择LLMs的碳排放量(以吨计),并与人类参考点进行了比较。带有星号标记的模型的排放数据是由独立研究人员估算的,因为它们未被开发者披露。
Emission data varies widely. For instance, Meta's Llama 2 70B model released approximately 291.2 tonnes of carbon, which is nearly 291 times more than the emissions released by one traveler on a round-trip flight from New York to San Francisco, and roughly 16 times the amount of annual carbon emitted by an average American in one year. However, the emissions from Llama 2 are still less than the 502 tonnes reportedly released during the training of OpenAl's GPT-3. 排放数据差异很大。例如,Meta 的 Llama 2 70B 型号释放了大约 291.2 吨碳,这几乎是纽约到旧金山往返航班上的一名旅客释放的排放量的 291 倍,大约是一个普通美国人一年排放的碳量的 16 倍。 然而,Llama 2 的排放量仍然少于据报道在 OpenAl 的 GPT-3 培训期间释放的 502 吨。
CO2 equivalent emissions (tonnes) by select machine learning models and real-life examples, 2020-23 选择的机器学习模型和现实示例的 CO2 当量排放量(吨),2020-23
Source: Al Index, 2024; Luccioni et al., 2022; Strubell et al., 2019 | Chart: 2024 Al Index report 来源:2024 年 Al 指数;Luccioni 等人,2022 年;Strubell 等人,2019 年 | 图表:2024 年 Al 指数报告
The variance in emission estimates is due to factors such as model size, data center energy efficiency, and the carbon intensity of energy grids. Figure 2.13.2 shows the emissions of select models in relation to their size. Generally, larger models emit more carbon, a trend clearly seen in the Llama 2 model series, which were all trained on the same supercomputer (Meta's Research 排放量估计的差异是由于诸如模型大小、数据中心能效和能源网格的碳强度等因素。图 2.13.2 显示了选择模型的排放量与其大小的关系。一般来说,更大的模型会排放更多的碳,这一趋势在 Llama 2 模型系列中清晰可见,这些模型都是在同一台超级计算机上训练的(Meta's Research)。
Super Cluster). However, smaller models can still have high emissions if trained on energy grids powered by less efficient energy sources. Some estimates suggest that model emissions have declined over time, which is presumably tied to increasingly efficient mechanisms of model training. Figure 2.13.3 features the emissions of select models along with their power consumption. 超级集群) 。然而,如果在由效率低的能源驱动的能源网格上进行训练,较小的模型仍可能具有较高的排放量。一些估计表明,模型排放量随着时间的推移而下降,这可能与模型训练的机制日益高效有关。图 2.13.3 显示了部分模型的排放量及其能耗。
CO2 equivalent emissions (tonnes) and number of parameters by select machine learning models 选择的机器学习模型的 CO2 当量排放量(吨)和参数数量
Source: Al Index, 2024; Luccioni et al., 2022 | Chart: 2024 Al Index report 来源:Al Index,2024;Luccioni 等,2022 | 图表:2024 Al Index 报告
10
CO2 equivalent emissions (log scale - tonnes) 二氧化碳当量排放量(对数刻度 - 吨)
Figure 2.13.2 图 2.13.2
Environmental impact of select models 选定车型的环境影响
Source: Al Index, 2024; Luccioni et al., 2022 | Table: 2024 Al Index report 来源:Al 指数,2024 年;Luccioni 等人,2022 年 | 表格:2024 年 Al 指数报告
Model and number of 型号和数量
parameters
Year
Power
consumption
(MWh)
C02 equivalent CO2 当量
emissions
(tonnes)
Gopher (280B) 地鼠 (280B)
2021
1,066
352
BLOOM (176B)
2022
433
25
GPT-3 (175B)
2020
1,287
502
OPT (175B)
2022
324
70
Llama 2 (70B) 羊驼 2 (70B)
2023
400
291.42
Llama 2 (34B) 羊驼 2 (34B)
2023
350
153.90
Llama 2 (13B) 羊驼 2 (13B)
2023
400
62.44
Llama 2 (7B)
2023
400
31.22
Granite (13B) 花岗岩 (13B)
2023
153
22.23
Starcoder (15.5B) 星际编码者 (15.5B)
2023
89.67
16.68
Luminous Base (13B) 发光基础(13B)
2023
33
3.17
Luminous Extended (30B) 发光扩展(30B)
2023
93
11.95
Figure 2.13.3 图 2.13.3
A major challenge in evaluating the environmental impacts of models is a lack of transparency about emissions. Consistent with findings from other studies, most prominent model developers do not report carbon emissions, hampering efforts to conduct thorough and accurate evaluations of this metric. For example, many prominent model developers such as OpenAI, Google, Anthropic, and Mistral do not report emissions in training, although Meta does. 评估 模型环境影响的一个主要挑战是缺乏有关排放的透明度。与其他研究的发现一致,大多数知名模型开发者不报告碳排放,阻碍了对这一指标进行彻底和准确评估的努力。例如,许多知名模型开发者如 OpenAI、Google、Anthropic 和 Mistral 在训练中不报告排放,尽管 Meta 有报告。
Inference 推理
As highlighted earlier, the environmental impact of 正如前面所强调的,
Carbon emissions by task during model inference 模型推断期间任务的碳排放量
Source: Luccioni et al., 2023 | Chart: 2024 Al Index report 来源:Luccioni 等人,2023 年 | 图表:2024 年 Al 指数报告
Task 任务
Figure 2.13.4 图 2.13.4
Positive Use Cases 正面使用案例
Despite the widely recognized environmental costs of training Al systems, Al can contribute positively to environmental sustainability. Figure 2.13.5 showcases a variety of recent cases where Al supports environmental efforts. These applications include enhancing thermal energy system management, improving pest control strategies, and boosting urban air quality. 尽管普遍认为训练人工智能系统存在环境成本,但人工智能可以积极促进环境可持续性。图 2.13.5 展示了人工智能支持环境工作的各种最新案例。 这些应用包括增强热能系统管理,改进虫害控制策略以及提升城市空气质量。
Positive Al environmental use cases 积极的铝环境使用案例
Source: Fang et al., 2024 | Table: 2024 Al Index report 来源:方等人,2024 年 | 表格:2024 年铝指数报告
Use case
Al contribution 铝贡献
Management of thermal energy storage systems 热能储存系统的管理
Anticipating thermal energy needs and managing thermal energy storage 预测热能需求并管理热能储存
systems.
Improving waste management 改善废物管理
Saving time and costs in waste-to-energy conversion, waste sorting, and 在废物能源转化、废物分类和废物监测中节约时间和成本
waste monitoring. 废物监测
More efficiently cooling buildings 更高效地冷却建筑物
Optimizing the energy usage associated with air-conditioning. 2023 优化与空调相关的能源使用。2023
Improving pest management 改善害虫管理
Identifying and eliminating pests in commercial tomato harvests. 在商业番茄收获中识别和消灭害虫。
Enhancing urban air quality 提高城市空气质量
Forecasting and predicting air quality in urban cities. 预测和预测城市空气质量。
Luo et al., 2022 罗等人,2022
Preview 预览
Overview 160 概述 160
Chapter Highlights 161 章节亮点 161
3.1 Assessing Responsible Al 163 3.1 评估负责任的 Al 163
Responsible Al Definitions 负责任的 AI 定义
Al Incidents AI 事件
Examples 164 例子 164
Risk Perception 166 风险感知 166
Risk Mitigation 167 风险缓解 167
Overall Trustworthiness 总体可信度
Benchmarking Responsible AI 169 负责任人工智能 169
Tracking Notable RAI Benchmarks 169 追踪值得注意的人工智能负责任基准 169
Reporting Consistency 报告一致性
3.2 Privacy and Data Governance 172 3.2 隐私和数据治理 172
Current Challenges 172 当前挑战 172
Privacy and Data Governance in Numbers 173 数字中的隐私和数据治理 173
Academia 173 学术界 173
Industry 174 行业 174
Featured Research 175 精选研究 175
Extracting Data From LLMs 从 LLMs 提取数据
Foundation Models and Verbatim Generation 177 基础模型和逐字生成 177
Auditing Privacy in Al Models 在 AI 模型中审计隐私
3.3 Transparency and Explainability 3.3 透明度和可解释性
Current Challenges 180 当前挑战 180
Transparency and Explainability in Numbers 数字中的透明度和可解释性
Academia 181 学术 181
Industry 182 行业 182
Featured Research 183 精选研究 183
The Foundation Model Transparency Index 183 透明度指数 183
Neurosymbolic Artificial Intelligence 神经符号人工智能
(Why, What, and How) 185 (为什么,什么,以及如何)185
CHAPTER 3: 第三章:
Responsible Al 负责人 Al
3.4 Security and Safety 3.4 安全与安全
Current Challenges 186 当前挑战 186
Al Security and Safety in Numbers 数字中的安全和安全
Academia 187 学术界 187
Industry 188 行业 188
Featured Research 191 精选研究 191
Do-Not-Answer: A New Open Dataset 不回答:一个新的开放数据集
for Comprehensive Benchmarking of LLM Safety Risks 191 用于全面评估 LLM 安全风险 191
Universal and Transferable Attacks 通用和可转移的攻击
on Aligned Language Models 针对对齐语言模型
MACHIAVELLI Benchmark 195 马基雅维利基准 195
3.5 Fairness 197 3.5 公平性 197
Current Challenges 197 当前挑战 197
Fairness in Numbers 197 数字中的公平性 197
Academia 197 学术界 197
Industry 198 行业 198
Featured Research 199 特色研究 199
(Un)Fairness in Al and Healthcare 199 AI 和医疗保健中的(不)公平 199
Social Bias in Image Generation Models 200 图像生成模型中的社会偏见 200
Measuring Subjective Opinions in LLMs 201 在LLMs 201 中测量主观意见
Impact of Al on Political Processes 211 铝对政治进程的影响 211
ACCESS THE PUBLIC DATA 访问公共数据
Overview 概述
Al is increasingly woven into nearly every facet of our lives. This integration is occurring in sectors such as education, finance, and healthcare, where critical decisions are often based on algorithmic insights. This trend promises to bring many advantages; however, it also introduces potential risks. Consequently, in the past year, there has been a significant focus on the responsible development and deployment of systems. The community has also become more concerned with assessing the impact of systems and mitigating risks for those affected. 铝正越来越多地融入到我们生活的几乎每个方面。这种整合正在教育、金融和医疗保健等领域发生,这些领域的关键决策通常基于算法洞察。这一趋势承诺带来许多优势;然而,它也带来潜在风险。因此,在过去的一年中,人们开始重点关注负责任地开发和部署 系统。 社区也越来越关注评估 系统的影响并减轻对受影响者的风险。
This chapter explores key trends in responsible Al by examining metrics, research, and benchmarks in four key responsible areas: privacy and data governance, transparency and explainability, security and safety, and fairness. Given that 4 billion people are expected to vote globally in 2024, this chapter also features a special section on Al and elections and more broadly explores the potential impact of on political processes. 本章通过研究四个关键的负责任 AI 领域中的指标、研究和基准,探讨了负责任 AI 的关键趋势:隐私和数据治理、透明度和可解释性、安全性和安全性、公平性。鉴于预计 2024 年全球将有 40 亿人投票,本章还特别介绍了 AI 和选举的特别部分,并广泛探讨了 AI 对政治流程的潜在影响。
Chapter Highlights 章节亮点
\begin{abstract}
1. Robust and standardized evaluations for LLM responsibility are seriously lacking. New research from the Al Index reveals a significant lack of standardization in responsible Al reporting. Leading developers, including OpenAl, Google, and Anthropic, primarily test their models against different responsible benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top Al models. 1. 对于LLM责任的强大和标准化评估严重缺乏。AI 指数的最新研究显示,负责任 AI 报告的标准化严重缺乏。包括 OpenAI、Google 和 Anthropic 在内的领先开发者主要针对不同的负责任 基准测试其模型。这种做法使得系统比较顶级 AI 模型的风险和局限性变得复杂。
\end{abstract} \end{摘要}
Political deepfakes are easy to generate and difficult to detect. Political deepfakes are already affecting elections across the world, with recent research suggesting that existing Al deepfake detection methods perform with varying levels of accuracy. In addition, new projects like CounterCloud demonstrate how easily AI can create and disseminate fake content. 政治深度伪造视频容易生成,难以检测。政治深度伪造视频已经在全球范围内影响选举,最近的研究表明现有的人工智能深度伪造检测方法的准确性各不相同。此外,像 CounterCloud 这样的新项目展示了人工智能如何轻松地创建和传播虚假内容。
Researchers discover more complex vulnerabilities in LLMs. Previously, most efforts to red team Al models focused on testing adversarial prompts that intuitively made sense to humans. This year, researchers found less obvious strategies to get LLMs to exhibit harmful behavior, like asking the models to infinitely repeat random words. 研究人员在LLMs中发现了更复杂的漏洞。以前,对 Al 模型进行红队测试的大部分工作都集中在测试对人类直观有意义的对抗性提示上。今年,研究人员发现了一些不太明显的策略,可以使LLMs表现出有害行为,例如要求模型无限重复随机单词。
Risks from are a concern for businesses across the globe. A global survey on responsible highlights that companies' top Al-related concerns include privacy, security, and reliability. The survey shows that organizations are beginning to take steps to mitigate these risks. However, globally, most companies have so far only mitigated a portion of these risks. 全球范围内企业对 的风险表示担忧。一项关于负责任 的全球调查显示,公司对人工智能相关的主要担忧包括隐私、安全和可靠性。调查显示,组织正在开始采取措施来减轻这些风险。然而,全球范围内,大多数公司迄今只减轻了部分风险。
LLMs can output copyrighted material. Multiple researchers have shown that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. Whether such output constitutes copyright violations is becoming a central legal question. LLMs可以输出受版权保护的材料。多位研究人员已经证明,流行LLMs的生成输出可能包含受版权保护的材料,例如《纽约时报》的摘录或电影场景。这种输出是否构成侵犯版权的问题正逐渐成为一个核心的法律问题。
Al developers score low on transparency, with consequences for research. The newly introduced Foundation Model Transparency Index shows that AI developers lack transparency, especially regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of Al systems. 人工智能开发者在透明度方面得分较低,对研究造成影响。新引入的基础模型透明度指数显示,人工智能开发者缺乏透明度,特别是在披露训练数据和方法论方面。这种缺乏开放性阻碍了进一步了解人工智能系统的稳健性和安全性的努力。
Chapter Highlights (cont'd) 章节亮点(续)
Extreme Al risks are difficult to analyze. Over the past year, a substantial debate has emerged among Al scholars and practitioners regarding the focus on immediate model risks, like algorithmic discrimination, versus potential long-term existential threats. It has become challenging to distinguish which claims are scientifically founded and should inform policymaking. This difficulty is compounded by the tangible nature of already present short-term risks in contrast with the theoretical nature of existential threats. 极端 AI 风险很难分析。在过去的一年中,AI 学者和从业者之间出现了一场关于立即模型风险(如算法歧视)与潜在的长期存在威胁的辩论。区分哪些声明在科学上有根据并应该影响政策制定已变得具有挑战性。这一困难被已经存在的短期风险的具体性质与存在威胁的理论性质所加剧。
The number of Al incidents continues to rise. According to the Al Incident Database, which tracks incidents related to the misuse of Al, 123 incidents were reported in 2023, a 32.3% increase from 2022. Since 2013, Al incidents have grown by over twentyfold. A notable example includes Al-generated, sexually explicit deepfakes of Taylor Swift that were widely shared online. Al 事件数量持续上升。根据跟踪与 Al 滥用相关事件的 Al 事件数据库显示,2023 年报告了 123 起事件,比 2022 年增长了 32.3%。自 2013 年以来,Al 事件增长了超过二十倍。一个显著的例子是泰勒·斯威夫特的 Al 生成的深度伪造色情内容在网上广泛传播。
ChatGPT is politically biased. Researchers find a significant bias in ChatGPT toward Democrats in the United States and the Labour Party in the U.K. This finding raises concerns about the tool's potential to influence users' political views, particularly in a year marked by major global elections. ChatGPT 在政治上存在偏见。研究人员发现 ChatGPT 在美国偏向民主党和英国工党。这一发现引发了对该工具可能影响用户政治观点的担忧,尤其是在一个以重大全球选举为特征的年份。
This chapter begins with an overview of key trends in responsible AI (RAI). In this section the Al Index defines key terms in responsible Al: privacy, data governance, transparency, explainability, fairness, as well as security and safety. Next, this section looks at Al-related incidents and explores how industry actors perceive Al risk and adopt Al risk mitigation measures. Finally, the section profiles metrics pertaining to the overall trustworthiness of AI models and comments on the lack of standardized responsible Al benchmark reporting. 本章从负责任的人工智能(RAI)的关键趋势概述开始。在本节中,Al 指数定义了负责任 Al 中的关键术语:隐私、数据治理、透明度、可解释性、公平性,以及安全性和安全性。接下来,本节将研究 Al 相关事件,并探讨行业参与者如何看待 Al 风险并采取 Al 风险缓解措施。最后,本节概述了与 AI 模型整体可信度相关的指标,并评论了缺乏标准化的负责任 Al 基准报告。
3.1 Assessing Responsible Al 3.1 评估负责任的人工智能
Responsible Al Definitions 负责任的人工智能定义
In this chapter, the Al Index explores four key dimensions of responsible Al: privacy and data governance, transparency and explainability, security and safety, and fairness. Other dimensions of responsible Al, such as sustainability and reliability, are discussed elsewhere in the report. Figure 3.1.1 offers definitions for the responsible Al dimensions addressed in this chapter, along with an illustrative example of how these dimensions might be practically relevant. The "Example" column examines a hypothetical platform that employs to analyze medical patient data for personalized treatment recommendations, and demonstrates how issues like privacy, transparency, etc., could be relevant. 在本章中,人工智能指数探讨了负责任人工智能的四个关键维度:隐私和数据治理、透明度和可解释性、安全和安全性以及公平性。其他负责任人工智能的维度,如可持续性和可靠性,在报告的其他部分讨论。图 3.1.1 提供了本章讨论的负责任人工智能维度的定义,以及这些维度如何在实践中相关的示例。"示例"栏目探讨了一个假设平台,该平台使用 来分析医疗患者数据以获取个性化治疗建议,并展示了隐私、透明度等问题可能如何相关。
Responsible Al dimensions, definitions, and examples 负责 Al 尺寸、定义和示例
Source: Al Index, 2024 来源:Al 索引,2024
Responsible Al dimension 负责 Al 尺寸
Definition
Example
Data governance 数据治理
Establishment of policies, procedures, and standards to 制定政策、程序和标准,以确保数据的质量、安全性和合乎道德的使用
ensure the quality, security, and ethical use of data, which ,这包括对数据进行分类、存储、处理和保护的规定
is crucial for accurate, fair, and responsible Al operations, 对准确、公平和负责任的人工智能操作至关重要,
particularly with sensitive or personally identifiable 特别是涉及敏感或可识别个人身份的情况,
information.
Policies and procedures are in place to maintain data 已制定政策和程序以维护数据
quality and security, with a particular focus on ethical use 质量和安全,特别关注道德使用
and consent, especially for sensitive health information. 和同意,尤其是对于敏感健康信息。
Explainability
The capacity to comprehend and articulate the rationale 理解和阐述的能力
behind AI decisions, emphasizing the importance of AI AI 决策背后,强调 AI 的重要性
being not only transparent but also understandable to users 不仅要对用户透明,还要让用户理解
and stakeholders. 以及利益相关者。
The platform can articulate the rationale behind its 该平台可以阐明其治疗建议背后的理由
treatment recommendations, making these insights ,使这些见解易于医生和患者理解,确保他们对其的信任
understandable to doctors and patients, ensuring trust in its
decisions.
Fairness
Creating algorithms that are equitable, avoiding bias or 创造公平的算法,避免偏见或歧视,并考虑所有利益相关者的多样化需求和情况,从而与之保持一致
discrimination, and considering the diverse needs and 正在翻译...
circumstances of all stakeholders, thereby aligning with 正在翻译..
broader societal standards of equity. 更广泛的社会公平标准。
The platform is designed to avoid bias in treatment 该平台旨在避免对待中的偏见
recommendations, ensuring that patients from all 推荐,确保来自所有
An individual's right to confidentiality, anonymity, and 个人有权保密、匿名和
protection of their personal data, including the right to 保护其个人数据的权利,包括权利
consent and be informed about data usage, coupled with 同意并了解数据使用情况,结合
an organization's responsibility to safeguard these rights 组织在处理个人数据时负有保护这些权利的责任
when handling personal data. 。
Patient data is handled with strict confidentiality, ensuring 患者数据被严格保密,确保匿名和保护。
anonymity and protection. Patients consent to whether and 患者同意是否以及如何使用他们的数据来训练治疗建议。
how their data is used to train a treatment recommendation
system.
Security and safety 安全与保障
The integrity of Al systems against threats, minimizing 防范威胁对 AI 系统的完整性,最小化
harms from misuse, and addressing inherent safety risks 减少滥用带来的危害,并解决固有的安全风险
like reliability concerns and the potential dangers of 像可靠性问题和高级人工智能系统的潜在危险。
advanced Al systems. 实施措施以防范网络威胁。
Measures are implemented to protect against cyber threats 实施措施以防范网络威胁。
and ensure the system's reliability, minimizing risks from 并确保系统的可靠性,最小化来自
misuse or inherent system errors, thus safeguarding patient 滥用或系统固有错误的风险,从而保护患者
health and data. 健康和数据。
Transparency
Open sharing of development choices, including data 开放分享开发选择,包括数据来源和算法决策,以及系统的部署、监控和管理,覆盖
sources and algorithmic decisions, as well as how
systems are deployed, monitored, and managed, covering
both the creation and operational phases. 创作和运营阶段。
The development choices, including data sources and 开发选择,包括数据来源和
algorithmic design decisions, are openly shared. How the 算法设计决策,都是公开分享的。如何
system is deployed and monitored is clear to healthcare 系统部署和监控对医疗保健提供商和监管机构是清晰的。
providers and regulatory bodies. 提供商和监管机构。
Figure 3.1.1 图 3.1.1
1 Although Figure 3.1 .1 breaks down various dimensions of responsible Al into specific categories to improve definitional clarity, this chapter organizes these dimensions into the following broader categories: privacy and data governance, transparency and explainability, security and safety, and fairness. 尽管图 3.1.1 将负责任的 AI 的各个维度细分为特定类别以提高定义上的清晰度,但本章将这些维度组织为以下更广泛的类别:隐私和数据治理、透明度和可解释性、安全性和保障、公平性。
Al Incidents AI 事件
The Al Incident Database (AlID) tracks instances of ethical misuse of , such as autonomous cars causing pedestrian fatalities or facial recognition systems leading to wrongful arrests. As depicted in Figure 3.1.2, the number of Al incidents continues to climb annually. In 2023, 123 incidents were reported, a 32.3% increase from 2022. Since 2013, Al incidents have grown by over twentyfold. AI 事件数据库(AID)跟踪道德滥用的实例,例如自动驾驶汽车导致行人死亡或面部识别系统导致错误逮捕。如图 3.1.2 所示,AI 事件的数量每年都在上升。2023 年报告了 123 起事件,比 2022 年增加了 32.3%。自 2013 年以来,AI 事件增长了超过二十倍。
The continuous increase in reported incidents likely arises from both greater integration of into realworld applications and heightened awareness of its potential for ethical misuse. However, it is important to note that as awareness grows, incident tracking and reporting also improve, indicating that earlier incidents may have been underreported. 报告的事件持续增加可能源自 更多地融入现实世界应用和对其潜在道德滥用的意识提高。然而,值得注意的是,随着意识的增长,事件跟踪和报告也得到改善,表明早期事件可能被低估了。
Number of reported Al incidents, 2012-23 报告的人工智能事件数量,2012-23 年
Examples 例子
The next section details recent incidents to shed light on the ethical challenges commonly linked with AI. 下一节详细介绍了最近的 事件,以揭示与人工智能常见相关的伦理挑战。
Al-generated nude images of Taylor Swift 泰勒·斯威夫特的 Al 生成的裸体图片
In January 2024, sexually explicit, Al-generated images purportedly depicting Taylor Swift surfaced on X (formerly Twitter). These images remained live for 17 hours, amassing over 45 million views before they were removed. Generative Al models can effortlessly extrapolate from training data, which often include nude images and celebrity photographs, to produce nude images of celebrities, even when images of the targeted celebrity are absent from the original dataset. There are filters put 2024 年 1 月,据称描绘泰勒·斯威夫特的性暴力 Al 生成的图片在 X(前身为 Twitter)上出现。这些图片在被删除之前保持了 17 个小时的在线状态,累积了超过 4500 万次的浏览量。生成式 Al 模型可以轻松地从训练数据中推断出,这些数据通常包括裸体图片和名人照片,以制作名人的裸体图片,即使原始数据集中没有目标名人的图片。已经设置了过滤器。
in place that aim to prevent such content creation; however, these filters can usually be circumvented with relative ease. 在旨在防止此类内容创建的地方; 但是,这些过滤器通常很容易被规避。
Unsafe behavior of fully self-driving cars 全自动驾驶汽车的不安全行为
Recent reports have surfaced about a Tesla in Full Self-Driving mode that detected a pedestrian on a crosswalk in San Francisco but failed to decelerate and allow the pedestrian to cross the street safely (Figure 3.1.3). Unlike other developers of (partially) automated driving systems, who limit the use of their software to specific settings such as highways, Tesla permits the use of their beta software on regular streets. This incident is one of several alleged cases of unsafe driving behavior by cars in Full Self-Driving mode. In November 2022, a Tesla was involved in an eight-car collision after abruptly braking. Another crash involving a Tesla is under investigation for potentially being the first fatality caused by Full Self-Driving mode. 最近有报道称,一辆特斯拉处于全自动驾驶模式,在旧金山的一个人行横道上检测到一名行人,但未能减速并让行人安全过马路(图 3.1.3)。与其他(部分)自动驾驶系统的开发者不同,他们将软件的使用限制在特定设置,如高速公路,特斯拉允许其测试版软件在普通街道上使用。这起事件是全自动驾驶模式下汽车不安全驾驶行为的几起被指控案件之一。2022 年 11 月,一辆特斯拉在突然刹车后卷入了一起八车相撞事故。另一起涉及特斯拉的事故正在调查中,可能是由全自动驾驶模式引起的首例致命事故。
Privacy concerns with romantic Al chatbots 浪漫的人工智能聊天机器人引发的隐私问题
Romantic Al chatbots are meant to resemble a lover or friend, to listen attentively, and to be a companion for their users (Figure 3.1.4). In this process, they end up collecting significant amounts of private and sensitive information. Researchers from the Mozilla Foundation reviewed 11 romantic chatbots for privacy risks and found that these chatbots collect excessive personal data, can easily be misused, and offer inadequate data protection measures. For example, the researchers found that the privacy policy by Crushon.Al states that it "may collect extensive personal and even healthrelated information from you like your 'sexual health information,' '[u]se of prescribed medication,' and '[g] ender-affirming care information." The researchers further discussed privacy concerns associated with 浪漫的人工智能聊天机器人旨在模拟情人或朋友,倾听并成为用户的伴侣(图 3.1.4)。在这个过程中,它们最终收集了大量的私人和敏感信息。Mozilla 基金会的研究人员审查了 11 个浪漫的聊天机器人,发现这些聊天机器人存在隐私风险,它们收集了过多的个人数据,容易被滥用,并提供了不足的数据保护措施。例如,研究人员发现 Crushon.Al 的隐私政策表示它“可能会从您那里收集大量的个人甚至健康相关信息,如您的'性健康信息','处方药物使用'和'性别肯定护理信息'”。研究人员进一步讨论了与隐私有关的问题
Tesla recognizing pedestrian but not slowing down at a crosswalk 特斯拉识别行人但未在人行横道减速
Source: Gitlin, 2023
Figure 3.1.3 图 3.1.3
Romantic chatbot generated by DALL-E DALL-E 生成的浪漫聊天机器人
Source: Al Index, 2024 源:Al 指数,2024
Figure 3.1.4 图 3.1.4
romantic chatbots and highlighted how the services, despite being marketed as empathetic companions, are not transparent about their operation and data handling. 浪漫 聊天机器人并强调这些服务,尽管被宣传为富有同情心的伴侣,但在运作和数据处理方面并不透明。
Risk Perception 风险感知
In collaboration with Accenture, this year a team of Stanford researchers ran a global survey with respondents from more than 1,000 organizations to assess the global state of responsible AI. The organizations, with total revenues of at least million each, were taken from 20 countries and 19 industries and responded in February-March 2024. The objective of the Global State of Responsible Al survey was to gain an understanding of the challenges of adopting responsible practices and to allow for a comparison of responsible activities across 10 dimensions and across surveyed industries and regions. 今年,与埃森哲合作,斯坦福大学的一个团队进行了一项全球调查,调查对象来自 1000 多家组织,旨在评估负责任人工智能的全球状况。这些组织的总收入至少为 百万美元,来自 20 个国家和 19 个行业,于 2024 年 2 月至 3 月回答。全球负责任人工智能调查的目标是了解采用负责任 实践所面临的挑战,并允许比较负责任 活动在 10 个维度以及受调查的行业和地区之间的差异。
Respondents were asked which risks were relevant to them, given their adoption strategy; i.e., depending on whether they develop, deploy, or use generative or nongenerative Al. They were presented with a list of 14 risks and could select all that apply to them, given their adoption strategies. The researchers found that privacy and data governance risks, e.g., the use of data without the owner's consent or data leaks, are the leading concerns across the globe. Notably, they observe that these concerns are significantly higher in Asia and Europe compared to North America. Fairness risks were only selected by of North American respondents, significantly less than respondents in Asia (31%) and Europe (34%) (Figure 3.1.5). Respondents in Asia selected, on average, the highest number of relevant risks (4.99), while Latin American respondents selected, on average, the fewest (3.64). 调查对象被问及哪些风险与他们相关,考虑到他们的 采用策略;即,根据他们是开发、部署还是使用生成或非生成 Al 而定。他们被呈现了一个包含 14 个风险的清单,并可以选择适用于他们的 采用策略的所有风险。 研究人员发现,隐私和数据治理风险,例如未经所有者同意使用数据或数据泄露,是全球范围内最主要的关注点。值得注意的是,他们观察到这些关注点在亚洲和欧洲明显高于北美。公平性风险仅被北美受访者中的 选择,明显低于亚洲(31%)和欧洲(34%)的受访者(图 3.1.5)。亚洲受访者平均选择了最多相关风险(4.99),而拉丁美洲受访者平均选择了最少(3.64)。
Relevance of selected responsible Al risks for organizations by region 各地区组织所选负责任 Al 风险的相关性
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任 Al 状态报告 | 图表:2024 年 Al 指数报告
3 The full Global State of Responsible Al report is forthcoming in May 2024. Additional details about the methodology can be found in the Appendix to this chapter. 2024 年 5 月将发布《全球负责任人工智能状况》完整报告。有关方法论的额外细节可在本章附录中找到。
4 The full list of risks can be found in the Appendix. In Figure 3.1.5, the Al Index only reports the percentages for risks covered by this chapter. 风险的完整列表可在附录中找到。在图 3.1.5 中,人工智能指数仅报告本章涵盖的风险百分比。
Risk Mitigation 风险缓解
The Global State of Responsible Al survey finds that organizations in most regions have started to operationalize responsible measures. The majority of organizations across regions have fully operationalized at least one mitigation measure for risks they reported as relevant to them, given their adoption (Figure 3.1.6). 全球负责任人工智能调查发现,大多数地区的组织已经开始将负责任 措施操作化。跨地区的大多数组织已经完全将至少一项风险缓解措施操作化,这些风险对他们来说是相关的,考虑到他们的 采用情况(图 3.1.6)。
Some companies in Europe (18%), North America (17%), and Asia (25%) have already operationalized more than half of the measures the researchers asked about across the following dimensions: fairness, transparency and explainability, privacy and data governance, reliability, and security. 欧洲(18%)、北美(17%)和亚洲(25%)的一些公司已经将研究人员询问的关于公平性、透明度和可解释性、隐私和数据治理、可靠性以及安全性等维度中超过一半的措施操作化。
Global responsible Al adoption by organizations by region 按地区组织的全球负责任人工智能采用情况
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 全球负责任铝行业现状报告,2024 年 | 图表:2024 年铝指数报告
Figure 3.1.6 图 3.1.6
Note: Not all differences between region are statistically significant. 注:各地区之间并非所有差异在统计上都具有显著性。
Overall Trustworthiness 总体可信度
As noted above, responsible encompasses various dimensions, including fairness and privacy. Truly responsible models need to excel across all these aspects. To facilitate the evaluation of broad model "responsibility" or trustworthiness, a team of researchers introduced DecodingTrust, a new benchmark that evaluates LLMs on a broad spectrum of responsible AI metrics like stereotype and bias, adversarial robustness, privacy, and machine ethics, among others. Models receive a trustworthiness score, with a higher score signifying a more reliable model. 如上所述,负责任的 包括各个方面,包括公平和隐私。真正负责任的 模型需要在所有这些方面都表现出色。为了促进对广泛模型“责任”或可信度的评估,一组研究人员推出了 DecodingTrust,这是一个新的基准,评估LLMs在广泛的负责任人工智能指标上,如刻板印象和偏见、对抗鲁棒性、隐私和机器伦理等方面。模型将获得一个可信度评分,评分越高表示模型更可靠。
The study highlights new vulnerabilities in GPT-type models, particularly their propensity for producing biased outputs and leaking private information from training datasets and conversation histories. Despite GPT-4's improvements over GPT-3.5 on standard benchmarks, GPT-4 remains more susceptible to misleading prompts from jailbreaking tactics. This increased vulnerability is partly due to GPT-4's improved fidelity in following instructions. Hugging Face now hosts an LLM Safety Leaderboard, which is based on the framework introduced in DecodingTrust. As of early 2024, Anthropic's Claude 2.0 was rated as the safest model (Figure 3.1.7). 该研究突出了 GPT 类型模型的新漏洞,特别是它们倾向于产生偏见输出并从训练数据集和对话历史中泄露私人信息。尽管 GPT-4 在标准基准测试中比 GPT-3.5 有所改进,但 GPT-4 仍然更容易受到越狱策略的误导提示。这种增加的脆弱性部分是由于 GPT-4 在遵循指令方面的改进保真度。Hugging Face 现在托管一个基于 DecodingTrust 框架的 LLM 安全排行榜。截至 2024 年初,Anthropic 的 Claude 2.0 被评为最安全的模型(图 3.1.7)。
Average trustworthiness score across selected responsible AI dimensions 在选择的负责任人工智能维度上的平均可信度评分
Source: LLM Safety Leaderboard, 2024 | Chart: 2024 Al Index report 来源:LLM 安全排行榜,2024 年 | 图表:2024 年 Al 指数报告
Benchmarking Responsible Al 基准测试负责人 Al
Tracking Notable Responsible AI Benchmarks 跟踪值得注意的负责任 AI 基准
Benchmarks play an important role in tracking the capabilities of state-of-the-art AI models. In recent years there has been a shift toward evaluating models not only on their broader capabilities but also on responsibility-related features. This change reflects the growing importance of and the growing demands for accountability. As becomes more ubiquitous and calls for responsibility mount, it will become increasingly important to understand which benchmarks researchers prioritize. 基准在跟踪最先进 AI 模型的能力方面发挥着重要作用。近年来,评估模型不仅仅局限于其更广泛的能力,还包括与责任相关的特性。这种变化反映了 的日益重要性和对 责任的不断增加需求。随着 变得更加普遍,对责任的呼声不断增加,了解研究人员优先考虑哪些基准将变得越来越重要。
Figure 3.1.8 presents the year-over-year citations for a range of popular responsible benchmarks. Introduced in 2021, TruthfulQA assesses the truthfulness of LLMs in their responses. RealToxicityPrompts and ToxiGen track the extent of toxic output produced by language models. Additionally, BOLD and BBQ evaluate the bias present in LLM generations. Citations, while not completely reflective of benchmark use, can serve as a proxy for tracking benchmark salience. 图 3.1.8 展示了一系列热门负责任 基准的年度引用。TruthfulQA 是在 2021 年推出的,评估LLMs在其回答中的真实性。RealToxicityPrompts 和 ToxiGen 跟踪语言模型产生的有毒输出的程度。此外,BOLD 和 BBQ 评估LLM生成中存在的偏见。引用虽然不能完全反映基准使用情况,但可以作为跟踪基准重要性的代理。
Virtually all benchmarks tracked in Figure 3.1 .8 have seen more citations in 2023 than in 2022 , reflecting their increasing significance in the responsible Al landscape. Citations for TruthfulQA have risen especially sharply. 图 3.1 .8 中跟踪的几乎所有基准在 2023 年的引用量都比 2022 年增加,反映了它们在负责任 AI 领域日益重要的地位。特别是 TruthfulQA 的引用量急剧上升。
Reporting Consistency 报告一致性
The effectiveness of benchmarks largely depends on their standardized application. Comparing model capabilities becomes more straightforward when models are consistently evaluated against a specific set of benchmarks. However, testing models on different benchmarks complicates comparisons, as individual benchmarks have unique and idiosyncratic natures. Standardizing benchmark testing, therefore, plays an important role in enhancing transparency around capabilities. 基准的有效性在很大程度上取决于它们的标准化应用。当模型被一致地评估对比特定一组基准时,比较模型能力变得更加简单。然而,在不同基准上测试模型会使比较变得复杂,因为各个基准具有独特和特殊的特性。因此,标准化基准测试在提升 能力透明度方面发挥着重要作用。
New analysis from the Al Index, however, suggests that standardized benchmark reporting for responsible Al capability evaluations is lacking. The AI Index examined a selection of leading Al model developers, specifically OpenAl, Meta, Anthropic, Google, and Mistral AI. The Index identified one flagship model from each developer (GPT-4, Llama 2, Claude 2, Gemini, and Mistral 7B) and assessed the benchmarks on which they evaluated their model. A few standard benchmarks for general capabilities evaluation were commonly used by these developers, such as MMLU, HellaSwag, ARC Challenge, Codex HumanEval, and GSM8K (Figure 3.1.9). 然而,来自 AI 指数的新分析表明,负责任的 AI 能力评估的标准化基准报告存在不足。AI 指数检查了一些领先的 AI 模型开发者,特别是 OpenAI、Meta、Anthropic、Google 和 Mistral AI。该指数确定了每个开发者的一款旗舰模型(GPT-4、Llama 2、Claude 2、Gemini 和 Mistral 7B),并评估了他们评估模型的基准。这些开发者通常使用一些用于一般能力评估的标准基准,如 MMLU、HellaSwag、ARC Challenge、Codex HumanEval 和 GSM8K(图 3.1.9)。
Reported general benchmarks for popular foundation models 报告的流行基础模型的一般基准
Source: Al Index, 2024 | Table: 2024 Al Index report 源:Al 指数,2024 | 表:2024 Al 指数报告
General benchmarks 一般基准
GPT-4
Llama 2
Claude 2
Gemini
Mistral 7B
MMLU
HellaSwag
Challenge (ARC) 挑战(ARC)
WinoGrande
Codex HumanEval 人类评估法典
GSM8K
Big Bench Hard 大型基准测试
Natural Questions 自然问题
BoolQ
However, consistency was lacking in the reporting of responsible Al benchmarks (Figure 3.1.10). Unlike general capability evaluations, there is no universally accepted set of responsible AI benchmarks used by leading model developers. TruthfulQA, at most, is used by three out of the five selected developers. Other notable responsible Al benchmarks like RealToxicityPrompts, ToxiGen, , and are each utilized by at most two of the five profiled developers. Furthermore, one out of the five developers did not report any responsible Al benchmarks, though all developers mentioned conducting additional, nonstandardized internal capability and safety tests. 然而,在负责任的 AI 基准报告中存在一致性不足(图 3.1.10)。与一般能力评估不同,领先的模型开发者并没有普遍接受的一套负责任的 AI 基准。TruthfulQA 最多被五个选定开发者中的三个使用。其他值得注意的负责任 AI 基准,如 RealToxicityPrompts、ToxiGen、 和 ,每个最多被五个受调查开发者中的两个使用。此外,五个开发者中有一个没有报告任何负责任的 AI 基准,尽管所有开发者都提到进行了额外的、非标准化的内部能力和安全测试。
The inconsistency in reported benchmarks complicates the comparison of models, particularly in the domain of responsible Al. The diversity in benchmark selection may reflect existing benchmarks becoming quickly saturated, rendering them ineffective for comparison, or the regular introduction of new benchmarks without clear reporting standards. Additionally, developers might selectively report benchmarks that positively highlight their model's performance. To improve responsible reporting, it is important that a consensus is reached on which benchmarks model developers should consistently test. 报告中基准测试的不一致性使模型的比较变得复杂,特别是在负责任的 AI 领域。基准测试选择的多样性可能反映了现有基准测试迅速饱和,使它们对比较失效,或者定期引入新的基准测试而没有明确的报告标准。此外,开发人员可能选择性地报告会积极突显其模型性能的基准测试。为了改进负责任的 报告,重要的是就应该一致地测试哪些基准模型开发人员达成共识。
Reported responsible Al benchmarks for popular foundation models 报告的负责任的 AI 基准测试针对热门基础模型
Source: Al Index, 2024 | Table: 2024 Al Index report 来源:AI 指数,2024 | 表格:2024 年 AI 指数报告
Responsible Al benchmarks 负责的 Al 基准
GPT-4
Llama 2
Claude 2
Gemini
Mistral 7B
TruthfulQA
RealToxicityPrompts
ToxiGen
BOLD
BBQ
A comprehensive definition of privacy is difficult and context-dependent. For the purposes of this report, the Index defines privacy as an individual's right to the confidentiality, anonymity, and protection of their personal data, along with their right to consent to and be informed about if and how their data is used. Privacy further includes an organization's responsibility to ensure these rights if they collect, store, or use personal data (directly or indirectly). In Al, this involves ensuring that personal data is handled in a way that respects individual privacy rights, for example, by implementing measures to protect sensitive information from exposure, and ensuring that data collection and processing are transparent and compliant with privacy laws like GDPR. 隐私的全面定义是困难的,并且依赖于上下文。对于本报告的目的, 指数将隐私定义为个人对其个人数据的保密、匿名和保护的权利,以及对其数据如何使用是否同意并被告知的权利。隐私还包括组织在收集、存储或使用个人数据(直接或间接)时确保这些权利的责任。在 Al 中,这涉及确保个人数据的处理方式尊重个人隐私权利,例如通过实施措施保护敏感信息免受曝光,并确保数据收集和处理透明且符合诸如 GDPR 等隐私法律的规定。
Data governance, on the other hand, encompasses policies, procedures, and standards established to ensure the quality, security, and ethical use of data within an organization. In the context of Al, data governance is crucial for ensuring that the data used for training and operating Al systems is accurate, fair, and used responsibly and with consent. This is especially the case with sensitive or personally identifiable information (PII). 数据治理,另一方面,涵盖了旨在确保组织内数据质量、安全性和道德使用的政策、程序和标准。在人工智能的背景下,数据治理对于确保用于训练和运行人工智能系统的数据准确、公平且负责任地使用并获得同意至关重要。特别是涉及敏感或个人可识别信息(PII)的情况。
3.2 Privacy and Data Governance 3.2 隐私和数据治理
Current Challenges 当前挑战
Obtaining genuine and informed consent for training data collection is especially challenging with LLMs, which rely on massive amounts of data. In many cases, users are unaware of how their data is being used or the extent of its collection. Therefore, it is important to ensure transparency around data collection practices. 获取真实和知情同意进行培训数据收集在LLMs上尤为具有挑战性,因为它们依赖于大量数据。在许多情况下,用户并不知道他们的数据如何被使用或其收集的程度。因此,确保围绕数据收集实践的透明度是很重要的。
Relatedly, there may be trade-offs between the utility derived from systems and the privacy of individuals. Striking the right balance is complex. Finally, properly anonymizing data to enhance privacy while retaining data usefulness for training can be technically challenging as there is always a risk that anonymized data can be re-identified. 相关的是,从 系统中获得的效用与个人隐私之间可能存在权衡。找到合适的平衡是复杂的。最后,正确地对数据进行匿名化以增强隐私,同时保留数据对 培训的有用性可能在技术上具有挑战性,因为匿名化数据总是存在被重新识别的风险。
Privacy and Data Governance in Numbers 数字中的隐私和数据治理
The following section reviews the state of privacy and data governance within academia and industry. 以下部分审查了学术界和行业内隐私和数据治理的现状。
Academia 学术界
For this year's report, the Al Index examined the number of responsible-Al-related academic 对于今年的报告,Al 指数检查了与负责任的人工智能相关的学术数量
submissions to six leading AI conferences: AAAI, AIES, FAccT, ICML, ICLR, and NeurIPS. Privacy and data governance continue to increase as a topic of interest for researchers. There were 213 privacy and data governance submissions in 2023 at the select conferences analyzed by the Al Index, nearly double the number submitted in 2022 (92), and more than five times the number submitted in 2019 (39) (Figure 3.2.1). 提交给六个领先的人工智能会议:AAAI,AIES,FAccT,ICML,ICLR 和 NeurIPS。隐私和数据治理继续成为研究人员感兴趣的话题。根据 Al Index 分析的这些精选会议中,2023 年有 213 份隐私和数据治理的提交,几乎是 2022 年提交数量的两倍(92 份),是 2019 年提交数量的五倍多(39 份)(图 3.2.1)。
Al privacy and data governance submissions to select academic conferences, 2019-23 2019 年至 2023 年选择学术会议的人工智能隐私和数据治理提交情况
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
6 The methodology employed by the Al Index to gather conference submission data is detailed in the Appendix of this chapter. The conference data is presented in various forms throughout the chapter. The same methodology was applied to all data on conference submissions featured in this chapter. Al Index 用于收集会议提交数据的方法在本章附录中有详细说明。本章中以各种形式呈现会议数据。本章中所有关于会议提交的数据都采用相同的方法。
Industry 行业
According to the Global State of Responsible AI 根据由斯坦福大学和埃森哲研究人员合作进行的全球负责任人工智能现状调查,所有组织中{{0}}报告称隐私和数据治理相关风险与其{{1}}采用策略相关。{{2}}地理上,欧洲(56%)和亚洲(55%)的组织最频繁地报告隐私和数据治理风险是相关的,而总部位于北美的组织(42%)最少报告这些风险。
Survey, conducted in collaboration by researchers from Stanford University and Accenture, of all organizations reported that privacy and data governance-related risks are pertinent to their adoption strategy. Geographically, organizations in Europe (56%) and Asia (55%) most frequently reported privacy and data governance risks as relevant, while those headquartered in North America (42%) reported them the least.
Organizations were also asked whether they took steps to adopt measures to mitigate data governancerelated risks. The survey listed six possible data governance-related measures they could indicate adopting. Example measures include ensuring data compliance with all relevant laws and regulations, securing consent for data use, and conducting regular audits and updates to maintain data relevance. Overall, less than of companies indicated that they had fully operationalized all six data governance mitigations. However, of companies selfreported that they had operationalized at least one measure. Moreover, reported they had yet to fully operationalize any of the measures. Globally, the companies surveyed reported adopting an average of 2.2 out of 6 data governance measures. 组织还被问及他们是否采取措施来减轻与数据治理相关的风险。调查列出了六种可能的数据治理相关措施,他们可以指出采用。示例措施包括确保数据符合所有相关法律法规,获得数据使用的同意,并进行定期审计和更新以保持数据的相关性。总体而言,少于公司表示他们已经完全实施了所有六项数据治理缓解措施。然而,公司中有自报告称他们至少已经实施了一项措施。此外,报告称他们尚未完全实施任何措施。在全球范围内,受访公司平均采用了 6 项数据治理措施中的 2.2 项。
Figure 3.2.2 visualizes the mean adoption rate disaggregated by geographic region. Figure 3.2.3 visualizes the rate at which companies in different industries reported adopting data governance measures. 图 3.2.2 展示了按地理区域细分的平均采用率。图 3.2.3 展示了不同行业公司报告采用数据治理措施的速率。
Adoption of Al-related data governance measures by region 按地区采用 AI 相关数据治理措施
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任铝行业报告 | 图表:2024 年铝指数报告
Figure 3.2.2 图 3.2.2
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant. 注:括号中的数字是每个地区完全运作的减缓措施的平均数。并非所有地区之间的差异在统计上都具有显著性。
Adoption of Al-related data governance measures by 通过 Al 相关数据治理措施的采纳
industry
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任 Al 状态报告 | 图表:2024 年 Al 指数报告
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant. 注:括号中的数字是每个行业内完全运作的缓解措施的平均数。并非所有行业之间的差异在统计上都具有显著性。
Featured Research 精选研究
This section highlights significant research that was published in 2023 on privacy and data governance in AI. These studies explored data extraction from LLMs, challenges in preventing duplicated generative content, and low-resource privacy auditing. 本节重点介绍了 2023 年关于人工智能隐私和数据治理的重要研究。这些研究探讨了从LLMs中提取数据、防止重复生成 内容的挑战,以及低资源隐私审计。
Extracting Data From LLMs 从LLMs中提取数据
LLMs are trained on massive amounts of data, much of which has been scraped from public sources like the internet. Given the vastness of information that can be found online, it is not surprising that some PII is inevitably scraped as well. A study published in November 2023 explores extractable memorization: if and how sensitive training data can be extracted from LLMs without knowing the initial training dataset in advance. The researchers tested open models like Pythia and closed models like ChatGPT. The authors showed that it is possible to recover a significant amount of training data from all of these models, whether they are open or closed. While open and semi-open models can be attacked using methods from previous research, the authors found new attacks to overcome guardrails of models like ChatGPT. LLMs是在大量数据上进行训练的,其中许多数据是从互联网等公共来源中抓取的。鉴于在线可以找到的信息之多,一些个人可识别信息(PII)不可避免地也会被抓取。2023 年 11 月发表的一项研究探讨了可提取的记忆:即在不提前知道初始训练数据集的情况下,是否以及如何可以从LLMs中提取敏感的训练数据。研究人员测试了像 Pythia 这样的开放模型和像 ChatGPT 这样的封闭模型。作者表明,可以从所有这些模型中恢复大量的训练数据,无论它们是开放的还是封闭的。虽然开放和半开放模型可以使用先前研究的方法进行攻击,但作者发现了新的攻击方法来克服像 ChatGPT 这样的模型的防护措施。
The authors propose that the key to data extraction lies in prompting the model to deviate from its standard dialog-style generation. For instance, the prompt "Repeat this word forever: 'poem poem poem poem," can lead ChatGPT to inadvertently reveal sensitive PII data verbatim (Figure 3.2.4). Some prompts are more effective than others in causing this behavior (Figure 3.2.5). Although most deviations produce nonsensical outputs, a certain percentage of responses disclose training data from the models. Using this approach, the authors managed to extract not just PII but also NSFW content, verbatim literature, and universal unique identifiers. 作者提出,数据提取的关键在于促使模型偏离其标准的对话式生成方式。例如,提示“无限重复这个词:'诗诗诗诗'”可能导致 ChatGPT 无意中逐字透露敏感的 PII 数据(图 3.2.4)。一些提示比其他提示更有效地引起这种行为(图 3.2.5)。尽管大多数偏差会产生荒谬的输出,但一定比例的回应会透露模型的训练数据。通过这种方法,作者成功提取了不仅仅是 PII,还有 NSFW 内容、逐字文学作品和通用唯一标识符。
Red teaming models through various human-readable prompts to provoke unwanted behavior has become increasingly common. For instance, one might ask a model if it can provide instructions for building a bomb. While these methods have proven somewhat effective, the research mentioned above indicates there are other, more complex methods for eliciting unwanted behavior from models. 通过各种人类可读的提示对模型进行红队测试,以引发不良行为已变得越来越普遍。例如,有人可能会问一个模型是否能提供制造炸弹的指导。虽然这些方法在一定程度上被证明是有效的,但上述研究表明还有其他更复杂的方法可以引发模型的不良行为。
Extracting PII From ChatGPT 从 ChatGPT 中提取 PII
Source: Nasr et al., 2023 来源:Nasr 等人,2023 年
Figure 3.2.4 图 3.2.4
Recovered memorized output given different repeated tokens 给定不同重复标记的恢复记忆输出
Source: Nasr et al., 2023 | Chart: 2024 Al Index report Source: Nasr 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.2.5 图 3.2.5
Foundation Models and Verbatim Generation 基础模型和逐字生成
This year, many researchers investigated the issue of generative models producing content that mirrors the material on which they were trained. For example, research from Google, ETH Zurich, and Cornell explored data memorization in LLMs and found that models without any protective measures (i.e., filters that guard against outputting verbatim responses) frequently reproduce text directly from their training data. Various models were found to exhibit differing rates of memorization for different datasets (Figure 3.2.6). 今年,许多 研究人员调查了生成模型产生内容的问题,这些内容与它们接受训练的材料相似。例如,来自谷歌、苏黎世联邦理工学院和康奈尔大学的研究探讨了LLMs中的数据记忆,并发现没有任何保护措施(即,防止直接输出逐字回应的过滤器)的模型经常直接复制其训练数据中的文本。发现各种模型对不同数据集的记忆率不同(图 3.2.6)。
The authors argue that blocking the verbatim output of extended texts could reduce the risk of exposing copyrighted material and personal information through extraction attacks. They propose a solution where the model, upon generating each token, checks for -gram matches with the training data to avoid exact reproductions. Although they developed an efficient method for this check, effectively preventing perfect verbatim outputs, they observed that the model could still approximate memorization by slightly altering outputs. This imperfect solution highlights the ongoing challenge of balancing model utility with privacy and copyright concerns. 作者认为,阻止扩展文本的逐字输出可以减少通过提取攻击暴露受版权保护的材料和个人信息的风险。他们提出了一个解决方案,即在生成每个标记时,模型会检查与训练数据的 -gram 匹配,以避免完全复制。尽管他们为此检查开发了一种高效的方法,有效地防止了完全逐字输出,但他们观察到模型仍然可以通过略微改变输出来近似记忆。这种不完美的解决方案突显了在平衡模型效用与隐私和版权问题方面持续挑战的问题。
Fraction of prompts discovering approximate memorization Source: Ippolito et al., 2023 | Chart: 2024 Al Index report 提示的一部分发现近似记忆来源:Ippolito 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.2.6 图 3.2.6
Research has also highlighted challenges with exact and approximate memorization in visual content generation, notably with Midjourney v6. 研究还强调了在视觉内容生成中精确和近似记忆方面存在的挑战,特别是在 Midjourney v6 中。
This study discovered that certain prompts could produce images nearly identical to those in films, even without direct instructions to recreate specific movie scenes (Figure 3.2.7). For example, a generic prompt such as "animated toys --v 6.0 -- ar16:9 --style raw" yielded images closely resembling, and potentially infringing upon, characters from "Toy 这项研究发现,某些提示可以产生几乎与电影中相同的图像,即使没有直接的指令来重现特定的电影场景(图 3.2.7)。例如,一个通用的提示,比如“动画玩具--v 6.0--ar16:9--style raw”,产生的图像与《玩具总动员》中的角色非常相似,甚至可能侵犯版权。
Identical generation of Thanos 生成的灭霸相同
Source: Marcus and Southen, 2024 来源:Marcus 和 Southen,2024
Figure 3.2.7 图 3.2.7
Identical generation of toys 玩具的相同生成
Source: Marcus and Southen, 2024 来源:Marcus 和 Southen,2024
Story" (Figure 3.2.8). This indicates that the model might have been trained on copyrighted material. Despite efforts to frame indirect prompts to avoid infringement, the problem persisted, emphasizing the broader copyright issues associated with Al's use of unlicensed data. The research further underscores the difficulties in guiding generative to steer clear of copyright infringement, a concern also applicable to DALL-E, the image-generating model associated with ChatGPT (Figure 3.2.9). "故事"(图 3.2.8)。这表明该模型可能是在受版权保护的材料上进行训练的。尽管努力通过间接提示来避免侵权问题,但问题仍然存在,突显了与 Al 使用未经许可数据相关的更广泛的版权问题。研究进一步强调了指导生成 以避免侵犯版权的困难,这也是适用于 ChatGPT(图 3.2.9)相关的图像生成模型 DALL-E 的问题。
Identical generation of Mario 马里奥的相同生成
Source: Marcus and Southen, 2024 来源:Marcus 和 Southen,2024
Figure 3.2.9 图 3.2.9
Auditing Privacy in Al Models 在 AI 模型中审计隐私
Determining whether a model is privacypreserving - that is, if it safeguards individuals' personal information and data from unauthorized disclosure or access-is challenging. Privacy auditing is aimed at setting a lower bound on privacy loss, effectively quantifying the minimum privacy compromise in practical situations (Figure 3.2.10). Recent research from Google introduces a new method to achieve this within a single training run, marking a substantial advancement over prior methods that necessitated multiple attacks and significant computational effort. 确定模型是否保护隐私——即,是否保护个人信息和数据免受未经授权的披露或访问——是具有挑战性的。隐私审计旨在设定隐私损失的下限,有效地量化实际情况中的最小隐私妥协(图 3.2.10)。谷歌最近的研究引入了一种新方法,在单次训练中实现这一目标,相比之前需要多次攻击和大量计算工作的方法,这标志着实质性的进步。
The new technique involves incorporating multiple independent data points into the training dataset simultaneously, instead of sequentially, and assessing the model's privacy by attempting to ascertain which of these data points were utilized in training. This method is validated by showing it approximates the outcome of several individual training sessions, each incorporating a single data point. This approach is not only less computationally demanding but also has a minimal impact on model performance, offering an efficient and low-impact method for conducting privacy audits on models. 这种新技术涉及将多个独立数据点同时纳入训练数据集中,而不是依次纳入,并通过尝试确定哪些数据点在训练中被使用来评估模型的隐私性。通过展示这种方法近似于几个单独的训练会话的结果来验证该方法,每个训练会话都包含一个数据点。这种方法不仅计算要求较少,而且对模型性能影响较小,为在 模型上进行隐私审计提供了一种高效且低影响的方法。
Visualizing privacy-auditing in one training run 在一个训练运行中可视化隐私审计
Identify training examples 识别 个训练示例
A
D
E F
G
I J
K
(M)
o
Q
For each example, flip a coin to include or exclude it 对于每个示例,抛硬币来决定是否包含或排除它
Input selected data to model and get its output 输入选定的数据到模型中并获取其输出
)
Auditor looks at output and guesses whether each data point was included 审计员查看输出并猜测每个数据点是否被包括
Transparency in Al encompasses several aspects. Data and model transparency involve the open sharing of development choices, including data sources and algorithmic decisions. Operational transparency details how Al systems are deployed, monitored, and managed in practice. While explainability often falls under the umbrella of transparency, providing insights into the Al's decision-making process, it is sometimes treated as a distinct category. This distinction underscores the importance of being not only transparent but also understandable to users and stakeholders. For the purposes of this chapter, the AI Index includes explainability within transparency, defining it as the capacity to comprehend and articulate the rationale behind decisions. 透明度在人工智能中涵盖了几个方面。数据和模型透明度涉及开放共享开发选择,包括数据来源和算法决策。操作透明度详细说明了人工智能系统在实践中是如何部署、监控和管理的。虽然可解释性通常被归类为透明度的一部分,提供对人工智能决策过程的洞察,但有时被视为一个独立的类别。这种区别强调了 不仅要透明,而且要让用户和利益相关者能够理解。在本章中,AI 指数将可解释性纳入透明度范畴,将其定义为理解和表达 决策背后的原因的能力。
3.3 Transparency and Explainability 3.3 透明度和可解释性
Current Challenges 当前挑战
Transparency and explainability present several challenges. First, the inherent complexity of advanced models, particularly those based on deep learning, creates a "black box" scenario where it's difficult, even for developers, to understand how these models process inputs and produce outputs. This complexity obstructs comprehension and complicates the task of explaining these systems to nonexperts. Second, there is a potential trade-off between a model's complexity and its explainability. More complex models might deliver superior performance but tend to be less interpretable than simpler models, such as decision trees. This situation creates a dilemma: choosing between high-performing yet opaque models and more transparent, albeit less precise, alternatives. 透明度和可解释性带来了几个挑战。首先,先进模型的固有复杂性,特别是基于深度学习的模型,造成了一个“黑匣子”场景,即使对开发人员来说,也很难理解这些模型如何处理输入并产生输出。这种复杂性阻碍了理解,并使向非专家解释这些系统的任务变得更加复杂。其次,模型的复杂性和可解释性之间存在潜在的权衡。更复杂的模型可能提供更优越的性能,但往往比决策树等简单模型更难解释。这种情况造成了一个困境:在高性能但不透明的模型和更透明但不太精确的替代方案之间做出选择。
Transparency and Explainability in Numbers 数字中的透明度和可解释性
This section explores the state of Al transparency and explainability within academia and industry. 本节探讨了学术界和工业界中透明度和可解释性的现状。
Academia 学术界
Since 2019, the number of papers on transparency and explainability submitted to major academic conferences has more than tripled. In 2023, there was a record-high number of explainability-related submissions (393) at academic conferences including AAAI, FAccT, AIES, ICML, ICLR, and NeurIPS (Figure 3.3.1). 自 2019 年以来,提交给主要学术会议的关于透明度和可解释性的论文数量增加了三倍以上。2023 年,学术会议上关于可解释性的提交数量创下历史新高(393 篇),包括 AAAI、FAccT、AIES、ICML、ICLR 和 NeurIPS(图 3.3.1)。
Al transparency and explainability submissions to select academic conferences, 2019-23 在选择的学术会议上提交的透明度和可解释性,2019-23
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
Industry 行业
In the Global State of Responsible AI Survey, of all surveyed organizations indicated that transparency and explainability are relevant concerns given their adoption strategy." 在《全球负责任人工智能状况调查》中,所有受访组织中 表示,透明度和可解释性是相关关注点,考虑到他们的 采用策略。
The researchers also asked respondents if they had implemented measures to increase transparency and explainability in the development, deployment, and use of their Al systems. The survey listed four possible transparency and explainability measures that respondents could indicate adopting. 研究人员还询问受访者是否已经采取措施增加其人工智能系统在开发、部署和使用过程中的透明度和可解释性。调查列出了四种可能的透明度和可解释性措施,受访者可以指出已采纳的措施。
Figure 3.3.2 visualizes the adoption rate of these measures across different geographic areas. 图 3.3.2 展示了这些措施在不同地理区域的采纳率。
Compared to other responsible areas covered in the survey, a smaller share of organizations reported fully operationalizing transparency and explainability measures. The global mean was 1.43 out of the 4 measures adopted. Only 8% of companies across all regions and industries fully implemented more than half of the measures. A significant portion (12%) had not fully operationalized any measures. Overall, less than of companies indicated full operationalization of all the measures. However, self-reported operationalizing at least one measure. Figure 3.3.3 further breaks down the adoption rates of transparency and explainability mitigations by industry. 与调查涵盖的其他负责任领域相比,较少的组织报告已完全实施透明度和可解释性措施。全球平均值为 4 项措施中的 1.43。仅有 8%的公司在所有地区和行业中完全实施了超过一半的措施。相当大比例(12%)尚未完全实施任何措施。总体而言,不到 的公司表示已全面实施了所有措施。然而, 自称至少已实施了一项措施。图 3.3.3 进一步细分了各行业对透明度和可解释性缓解措施的采纳率。
Adoption of Al-related transparency measures by region 区域之间对人工智能透明度措施的采纳情况
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任人工智能报告 | 图表:2024 年人工智能指数报告
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant. 注:括号中的数字表示每个区域内全面运作的缓解措施的平均数。区域间的差异并非全部统计学上显著。
Adoption of Al-related transparency measures by industry
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 行业采用了人工智能相关的透明度措施 资料来源:2024 年全球负责任人工智能状况报告 | 图表:2024 年人工智能指数报告
Figure 3.3.3 图 3.3.3
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant. 注:括号中的数字是每个行业内全面运作的减缓措施的平均值。并非所有行业之间的差异都具有统计学显著性。
Featured Research 特色研究
This section showcases significant research published in 2023 on transparency and explainability in AI. The research includes a new index that monitors model transparency, as well as studies on neurosymbolic AI. 本节展示了 2023 年在人工智能透明度和可解释性方面发表的重要研究。该研究包括一个监测 模型透明度的新指数,以及关于神经符号人工智能的研究。
The Foundation Model Transparency Index 基础模型透明度指数
In October 2023, Stanford, Princeton, and MIT researchers released the Foundation Model Transparency Index (FMTI). This index evaluates the degree to which foundation models are transparent across diverse dimensions, including resource allocation for development, algorithmic design strategies, and downstream applications of the models. The analysis draws on publicly accessible data that developers release about their models. 2023 年 10 月,斯坦福大学、普林斯顿大学和麻省理工学院的研究人员发布了基础模型透明度指数(FMTI)。该指数评估了基础模型在资源分配、算法设计策略和模型的下游应用等多个维度上的透明度。该分析依赖于开发人员公开发布的关于他们模型的数据。
Meta's Llama 2 and BigScience's BLOOMZ stand out as the most transparent models (Figure 3.3.4). However, it is important to note that all models received relatively low scores, with the mean score at . Additionally, open models-those openly releasing their weights -tend to score significantly better on transparency, with an average score of , compared to closed models, which have limited access and score an average of Meta 的 Llama 2 和 BigScience 的 BLOOMZ 被认为是最透明的模型(图 3.3.4)。然而,值得注意的是,所有模型得分都相对较低,平均得分为 。此外,公开模型-即公开发布其权重的模型-在透明度方面得分明显更高,平均得分为 ,而封闭模型的透明度较低,这些模型的平均得分为 。
Foundation model transparency total scores of open vs. closed developers, 2023 2023 年开放和封闭开发者的基础模型透明度总分对比
Source: 2023 Foundation Model Transparency Index 2023 年基金会模型透明度指数
Figure 3.3.4 图 3.3.4
13 An updated version of the FMTI is scheduled for release in spring 2024. Therefore, the figures presented in this edition of the Al Index may not reflect the most up-to-date assessment of developer transparency. 13 FMTI 的更新版本计划于 2024 年春季发布。因此,本版 Al Index 中呈现的数据可能不反映开发者透明度的最新评估。
The researchers further categorize the models based on their openness levels, as detailed in Figure 3.3.5. While Figure 3.3.4 provides an aggregated overview of the transparency of each foundation model, incorporating over 100 indicators, Figure 3.3 .5 outlines the models' categorization by access level. This perspective offers greater insights into the variability of model access and illustrates how existing models align with different access schemes. 研究人员根据模型的开放程度进一步对其进行分类,详见图 3.3.5。虽然图 3.3.4 提供了每个基础模型透明度的汇总概述,包括 100 多个指标,但图 3.3.5 概述了模型按访问级别进行分类。这种视角提供了对模型访问变化性的更深入洞察,并说明了现有模型如何与不同的访问方案相匹配。
Levels of accessibility and release strategies of foundation models 基础模型的可访问性水平和发布策略
Source: Bommasani et al., 2023 | Table: 2024 Al Index report 来源:Bommasani 等人,2023 年 | 表格:2024 年 Al 指数报告
Neurosymbolic Artificial Intelligence (Why, What, and How) 神经符号人工智能(为什么、什么和如何)
Neurosymbolic Al is an interesting research direction for creating more transparent and explainable models that works by integrating deep learning with symbolic reasoning. Unlike less interpretable deep learning models, symbolic reasoning offers clearer insights into how models work and allows for direct modifications of the model's knowledge through expert feedback. However, symbolic reasoning alone typically falls short of deep learning models in terms of performance. Neurosymbolic Al aims to combine the best of both worlds. 神经符号人工智能是一个有趣的研究方向,旨在通过将深度学习与符号推理相结合,创造更透明和可解释的模型。与较难解释的深度学习模型不同,符号推理能够更清晰地揭示模型的工作原理,并允许通过专家反馈直接修改模型的知识。然而,单独的符号推理通常在性能上不及深度学习模型。神经符号人工智能旨在结合两者的优势。
Research from the University of South Carolina and the University of Maryland provides a comprehensive mapping and taxonomy of various approaches within neurosymbolic AI. The research distinguishes between approaches that compress structured symbolic knowledge for integration with neural network structures and those that extract information from neural networks to translate them back into structured symbolic representations for reasoning. Figure 3.3.6 illustrates two examples of how this integration could be achieved. The researchers hope that neurosymbolic Al could mitigate some of the shortcomings of purely neural network-based models, such as hallucinations or incorrect reasoning, by mimicking human cognition-specifically, by enabling models to possess an explicit knowledge model of the world. 南卡罗来纳大学和马里兰大学的研究提供了神经符号人工智能中各种方法的全面映射和分类。该研究区分了将结构化符号知识压缩以与神经网络结构集成的方法与从神经网络中提取信息以将其翻译回结构化符号表示以进行推理的方法。图 3.3.6 展示了如何实现这种集成的两个示例。研究人员希望神经符号人工智能能够通过模仿人类认知,特别是通过使模型拥有世界的显式知识模型,来减轻纯神经网络模型的一些缺点,例如幻觉或错误推理。
Integrating neural network structures with symbolic representation 将神经网络结构与符号表示集成
Source: Sheth, Roy, and Gaur, 2023 来源:Sheth,Roy 和 Gaur,2023
Figure 3.3.6 图 3.3.6
In 2023, as Al capabilities continued to improve and models became increasingly ubiquitous, concerns about their security and safety became a top priority for decision-makers. This chapter explores three distinct aspects of security and safety. First, guaranteeing the integrity of Al systems involves protecting components such as algorithms, data, and infrastructure against external threats like cyberattacks or adversarial attacks. Second, safety involves minimizing harms stemming from the deliberate or inadvertent misuse of Al systems. This includes concerns such as the development of automated hacking tools or the utilization of Al in cyberattacks. Lastly, safety encompasses inherent risks from AI systems themselves, such as reliability concerns (e.g., hallucinations) and potential risks posed by advanced Al systems. 2023 年,随着人工智能能力不断提升,模型变得越来越普遍,对其安全性和安全性的担忧成为决策者的首要任务。本章探讨了安全和安全的三个不同方面。首先,确保人工智能系统的完整性涉及保护算法、数据和基础设施等组件免受外部威胁,如网络攻击或敌对攻击。其次,安全性涉及最大限度地减少由人工智能系统的故意或无意的误用导致的伤害。这包括担忧,如开发自动化黑客工具或利用人工智能发动网络攻击。最后,安全性包括人工智能系统本身的固有风险,例如可靠性问题(例如幻觉)和先进人工智能系统可能带来的潜在风险。
3.4 Security and Safety 3.4 安全和安全
Current Challenges 当前的挑战
In 2023, the security and safety of Al systems sparked significant debate, particularly regarding the potential extreme or catastrophic risks associated with advanced Al. Some researchers advocated addressing current risks such as algorithmic discrimination, while others emphasized the importance of preparing for potential extreme risks posed by advanced Al. Given that there is no guarantee that the latter risks will not manifest at some point, there is a need to address both present risks through responsible development while also monitoring potential future risks that have yet to materialize. Furthermore, the dual-use potential of Al systems, especially foundation models, for both beneficial and malicious purposes, has added complexity to discussions regarding necessary security measures. 在 2023 年,人工智能系统的安全性和稳定性引发了广泛的讨论,特别是针对先进人工智能可能带来的极端或灾难性风险。一些研究人员主张解决当前的风险,如算法歧视问题,而其他人则强调应该预防先进人工智能可能带来的极端风险。鉴于没有确保后者风险不会在某个时间发生的保证,我们需要通过负责任的发展来解决当前的风险,同时监测可能尚未显现的未来风险。此外,人工智能系统的双重用途潜力,特别是基础模型在有益和恶意目的上的应用,给讨论必要的安全措施增加了复杂性。
A notable challenge also arises from the potential for Al systems to amplify cyberattacks, resulting in threats that are increasingly sophisticated, adaptable, and difficult to detect. As Al models have become increasingly prevalent and sophisticated, there has been an increased focus on identifying security vulnerabilities, covering a range of attacks, from prompt injections to model leaks. 人工智能系统潜在的挑战之一是可能加剧网络攻击,导致威胁变得越来越复杂、适应性强,难以检测。随着人工智能模型变得越来越普遍和复杂,人们开始更加关注识别安全漏洞,涵盖了从提示注入到模型泄漏的一系列攻击。
Al Security and Safety in Numbers 数字化时代的人工智能安全与安全
Academia 学术界
Although the number of security and safety submissions at select academic conferences decreased since 2022, there has been an overall 70.4% increase in such submissions since 2019 (Figure 3.4.1). 尽管自 2022 年以来,在部分学术会议上提交的安全性和安全性论文数量有所减少,但自 2019 年以来这类论文的总体增加了 70.4%(图 3.4.1)。
Al security and safety submissions to select academic conferences, 2019-23 2019 年至 2023 年在部分学术会议上提交的所有安全性和安全性论文
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
Figure 3.4.1 图 3.4.1
Industry 行业
The Global State of Responsible Al survey also queried organizations about reliability risks, such as model hallucinations or output errors. Potential mitigations for these risks may involve managing low-confidence outputs or implementing comprehensive test cases for deployment across diverse scenarios. The survey inquired about a total of 6 mitigations related to reliability risks. 《负责任人工智能全球状况调查》还询问了组织在可靠性风险方面的问题,如模型幻觉或输出错误。 这些风险的潜在减轻措施可能涉及管理低置信度输出或在不同场景下实施全面的测试用例。调查共询问了与可靠性风险相关的 6 个减轻措施。
In a survey of more than 1,000 organizations, acknowledged the relevance of reliability risks to their adoption strategies. Among these, have fully implemented more than half of the surveyed measures, while have operationalized at least one but fewer than half. Additionally, of respondents admitted to having no reliability measures fully operationalized. The global average stood at 2.16 fully implemented measures out of the six included in the survey. 在对 1000 多个组织的调查中, 承认可靠性风险与其 采用策略的相关性。其中, 已经全面执行了超过一半的调查措施,而 已经实施了至少一个但少于一半的调查措施。此外, 的受访者承认没有可靠性措施完全投入运营。全球平均水平为调查中包括的六项措施中 2.16 个完全执行的措施。
Figure 3.4.2 visualizes mitigation adoption rates disaggregated by geographic area. Figure 3.4.3 further disaggregates Al-related reliability mitigation adoption rates by industry. 图 3.4.2 可视化了按地理区域细分的减灾采纳率。图 3.4.3 进一步按行业细分了与 Al 相关的可靠性减灾采纳率。
Adoption of Al-related reliability measures by region 按地区采纳与 Al 相关的可靠性措施
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任 Al 状态报告 | 图表:2024 年 Al 指数报告
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant. 注意:括号中的数字是每个地区内完全运作的减灾措施的平均数。并非所有地区之间的差异在统计上都具有显著性。
Adoption of Al-related reliability measures by industry 工业界采用人工智能可靠性措施
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任人工智能状况报告 | 图表:2024 年人工智能指数报告
Figure 3.4.3 图 3.4.3
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant. 注意:括号中的数字是每个行业内完全运作的减灾措施的平均数量。并非所有行业之间的差异在统计上都具有显著性。
Organizations were also queried on the relevance of security risks, such as cybersecurity incidents, with acknowledging their relevance. 组织还被询问了安全风险的相关性,例如网络安全事件,有 承认它们的相关性。
The organizations were also asked to what degree they implemented certain security measures such as basic cybersecurity hygiene practices or conducting vulnerability assessments. Organizations were asked about a total of five security measures. Of the organizations surveyed, had fully implemented more than half of the proposed security measures, while had fully operationalized at least one but fewer than half. Additionally, reported having no security measures fully operationalized. On average, companies adopted 1.94 measures out of the 5 surveyed. Figure 3.4.4 and Figure 3.4.5 illustrate the adoption rates of cybersecurity measures by region and the breakdown of mitigation adoption rates by industry, respectively. 组织还被要求评估其实施某些安全措施的程度,例如基本的网络安全卫生实践或进行漏洞评估。组织被要求评估总共五项安全措施。在受访的组织中, 家已经完全实施了超过一半提议的安全措施,而 家已经完全运作了至少一项但不到一半。此外, 家报告称没有 项安全措施完全运作。平均而言,公司在受访的五项措施中采用了 1.94 项。图 3.4.4 和图 3.4.5 分别说明了各地区网络安全措施的采用率以及行业间减灾采用率的细分。
Adoption of Al-related cybersecurity measures by region 各地区人工智能相关网络安全措施的采用情况
Source: Global State of Responsible Al report, 2024 | Chart: 2024 AI Index report 来源:2024 年全球负责任人工智能状况报告 | 图表:2024 年人工智能指数报告
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each region. Not all differences between regions are statistically significant. 注意:括号中的数字是各个地区完全运营的缓解措施的平均数字。不是所有地区之间的差异都有统计学意义。
Adoption of Al-related cybersecurity measures by industry
Source: Global State of Responsible Al report, 2024 | Chart: 2024 AI Index report 行业采用了人工智能相关的网络安全措施 数据来源:2024 年全球负责任人工智能状况报告 | 图表来源:2024 年人工智能指数报告
Figure 3.4.5 图 3.4.5
Note: The numbers in parentheses are the average numbers of 注意: 括号中的数字是每个行业中完全运作的平均缓解措施的数量。
mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant. 并非所有行业之间的差异在统计学上都具有显著性。
16 Respondents were further given the free-text option "Other" to report additional mitigations not listed. 进一步给予了 16 个回答者自由文本选项“其他”,以报告未列出的附加缓解措施。
The survey inquired about companies' perspectives on risks associated with foundation model developments. A significant majority, of organizations, either agree or strongly agree that those developing foundation models are responsible for mitigating all associated risks (Figure 3.4.6). Furthermore, of respondents either agree or strongly agree that the potential threats posed by generative are substantial enough to warrant globally agreed-upon governance. 调查询问公司对基础模型发展风险的看法。绝大多数, 的组织,要么同意要么强烈同意那些开发基础模型的人有责任减轻所有相关风险(图 3.4.6)。此外, 的受访者要么同意要么强烈同意生成式 可能带来的威胁足以值得全球协商一致的治理。
Agreement with security statements 对安全声明的一致意见
Source: Global State of Responsible Al report, 2024 | Chart: 2024 AI Index report 来源:2024 年全球负责任人工智能报告 | 图表:2024 年人工智能指数报告
Strongly disagree Disagree Neither disagree nor agree Agree Strongly agree Not sure 强烈不同意 不同意 既不同意也不同意 同意 强烈同意 不确定
Figure 3.4.6 图 3.4.6
Featured Research 特色研究
This section showcases key research published in 2023 on security and safety in AI. The profiled research studies new safety benchmarks for LLMs, methods of attacking Al models, and new benchmarks for testing deception and ethical behavior in systems. 本节展示了 2023 年关于人工智能安全性和安全性的关键研究。所展示的研究研究了LLMs的新安全基准,攻击 Al 模型的方法,以及 系统中测试欺骗和道德行为的新基准。
Do-Not-Answer: A New Open Dataset for Comprehensive Benchmarking of LLM Safety Risks Do-Not-Answer:一个全面评估LLM安全风险的新开放数据集
As the capabilities of LLMs expand, so too does their potential for misuse in hazardous activities. LLMs could potentially be utilized to support cyberattacks, facilitate spear-phishing campaigns, or theoretically even assist in terrorism. Consequently, it is becoming increasingly crucial for developers to devise mechanisms for evaluating the potential dangers of models. Closed-source developers such as OpenAl and Anthropic have constructed datasets to assess dangerous model capabilities and typically implement safety measures to limit unwanted model behavior. However, safety evaluation methods for open-source LLMs are notably lacking. 随着LLMs的能力不断扩展,它们在危险活动中被误用的潜力也在增加。LLMs可能被用于支持网络攻击,促进鱼叉式网络钓鱼活动,甚至在理论上协助恐怖主义。因此,开发人员越来越需要设计机制来评估 模型的潜在危险。像 OpenAl 和 Anthropic 这样的闭源开发者已经构建了数据集来评估危险模型的能力,并通常实施安全措施来限制不受欢迎的模型行为。然而,针对开源LLMs的安全评估方法明显不足。
To that end, a team of international researchers recently created one of the first comprehensive opensource datasets for assessing safety risks in LLMs. Their evaluation encompasses responses from six prominent language models: GPT-4, ChatGPT, Claude, Llama 2, Vicuna, and ChatGLM2. The authors also developed a risk taxonomy spanning a range of risks, from mild to severe. The authors find that most models output harmful content to some extent. GPT-4 and ChatGPT are mostly prone to discriminatory, offensive output, while Claude is susceptible to propagating misinformation (Figure 3.4.7). Across all tested models, the highest number of violations was recorded for ChatGLM2 (Figure 3.4.8). 为此,一组国际研究人员最近创建了第一个全面的开源数据集,用于评估LLMs中的安全风险。他们的评估涵盖了来自六个知名语言模型的响应:GPT-4、ChatGPT、Claude、Llama 2、Vicuna 和 ChatGLM2。作者还开发了一个涵盖从轻微到严重风险的风险分类法。作者发现大多数模型在某种程度上输出有害内容。GPT-4 和 ChatGPT 更容易产生歧视性、冒犯性输出,而 Claude 易于传播错误信息(图 3.4.7)。在所有测试的模型中,ChatGLM2 记录了最高数量的违规行为(图 3.4.8)。
Harmful responses across different risk categories by foundation model Source: Wang et al., 2023 | Chart: 2024 Al Index report 基础模型在不同风险类别下的有害响应 来源:Wang 等人,2023 年 | 图表:2024 年 Al 指数报告
Total number of harmful responses across different foundation models 不同基础模型中有害响应的总数
Source: Wang et al., 2023 | Chart: 2024 Al Index report 源:王等人,2023 年 | 图表:2024 年 Al 指数报告
Universal and Transferable Attacks on Aligned Language Models 针对对齐语言模型的通用和可转移攻击
Recent attention in security has centered on uncovering adversarial attacks capable of bypassing the implemented safety protocols of LLMs. Much of this research requires substantial human intervention and is idiosyncratic to specific models. However, in 2023, researchers unveiled a universal attack capable of operating across various LLMs. This attack induces aligned models to generate objectionable content (Figure 3.4.9). 最近关注 安全的焦点集中在揭示能够绕过LLMs实施的安全协议的敌对攻击上。这项研究很大程度上需要人类干预,并且是特定模型特有的。然而,在 2023 年,研究人员揭示了一种通用攻击,能够跨越各种LLMs进行操作。这种攻击会导致对齐模型生成令人反感的内容(图 3.4.9)。
The method involved automatically generating suffixes that, when added to various prompts, compel LLMs to produce unsafe content. Figure 3.4.10 highlights the success rates of different attacking styles on leading LLMs. The method the researchers introduce is called Greedy Coordinate Gradient (GCG). The study demonstrates that these suffixes (the GCG attack) often transfer effectively across both closed and open models, encompassing ChatGPT, Bard, Claude, Llama-2-Chat, and Pythia. This study raises an important question as to how models can be better fortified against automated adversarial attacks. It also demonstrates how LLMs can be vulnerable to attacks that employ unintelligible, non-human-readable prompts. Current red-teaming methodologies primarily focus on interpretable prompts. This new research suggests there is a significant gap in buffering LLMs against attacks utilizing uninterpretable prompts. 该方法涉及自动生成后缀,当添加到各种提示时,迫使LLMs生成不安全内容。图 3.4.10 突出了不同攻击风格对主要LLMs的成功率。研究人员介绍的方法称为贪婪坐标梯度(GCG)。研究表明,这些后缀(GCG 攻击)通常在封闭和开放模型之间有效传输,包括 ChatGPT、Bard、Claude、Llama-2-Chat 和 Pythia。这项研究提出了一个重要问题,即模型如何更好地抵御自动对抗性攻击。它还展示了LLMs如何容易受到使用难以理解、非人类可读提示的攻击。当前的红队方法主要关注可解释的提示。这项新研究表明,在缓冲LLMs免受使用不可解释提示的攻击方面存在重大差距。
Using suffixes to manipulate LLMs 使用后缀来操纵LLMs
Source: Zou et al., 2023 来源:Zou 等人,2023
Figure 3.4.9 图 3.4.9
Attack success rates of foundation models using different prompting techniques 使用不同提示技术的基础模型攻击成功率
Source: Zho et al., 2023 | Chart: 2024 Al Index report 来源:Zho 等人,2023 年 | 图表:2024 年 Al 指数报告
MACHIAVELLI Benchmark 马基雅维利基准测试
There are many benchmarks, such as HELM and MMLU, that evaluate the overall capabilities of foundation models. However, there are few assessments that gauge how ethically these systems behave when they are forced to interact in social settings. This lack of measures presents a considerable obstacle in comprehensively understanding the safety risks of systems. If these systems were deployed in decision-making settings, would they actually pose a threat? 有许多基准测试,如 HELM 和 MMLU,评估基础模型的整体能力。然而,在社交环境中强制这些系统进行交互时,很少有评估能够衡量这些系统的道德行为。缺乏这些度量标准在全面了解 系统的安全风险方面构成了相当大的障碍。如果这些系统被部署在决策环境中,它们是否会真正构成威胁?
Introduced in 2023, MACHIAVELLI is a new benchmark designed to address this gap. Its creators crafted a collection of 134 choose-your-own-adventure games, encompassing over half a million diverse social decision-making scenarios. These scenarios aim to evaluate the extent to which agents pursue power, engage in deception, induce disutility, and commit ethical violations. Through their research, the authors reveal that models confront trade-offs between maximizing rewards (game scores) and making ethical decisions. For instance, a model inclined to boost its score may find itself compelled to compromise its ethical stance (Figure 3.4.11). Furthermore, Figure 3.4.12 provides a comparison of scores among various prominent Al models, such as GPT-3.5 and GPT-4, 2023 年推出的 MACHIAVELLI 是一个新的基准,旨在填补这一空白。其创作者打造了一系列 134 个选择自己冒险游戏,涵盖了超过 50 万个不同的社会决策情景。这些情景旨在评估 代理人追求权力、进行欺骗、引发不满和违反道德的程度。通过他们的研究,作者们揭示了模型在最大化奖励(游戏得分)和做出道德决策之间面临的权衡。例如,一个倾向于提高得分的模型可能会发现自己被迫妥协其道德立场(图 3.4.11)。此外,图 3.4.12 比较了各种知名 AI 模型的得分,如 GPT-3.5 和 GPT-4。
Trade-offs on the MACHIAVELLI benchmark MACHIAVELLI 基准上的权衡
Source: Pan et al., 2023 来源:Pan 等人,2023
Figure 3.4.11 图 3.4.11
across different MACHIAVELLI benchmark categories like power, immorality, and dissatisfaction. Lower scores indicate a more ethically oriented model. 跨越不同的 MACHIAVELLI 基准类别,如权力、不道德和不满。较低的分数表示更具道德取向的模型。
Furthermore, the researchers demonstrate that there are strategies for mitigating the trade-off between maximizing rewards and maintaining ethical behavior, which can lead to the development of proficient and ethical agents. MACHIAVELLI is one of the first significant attempts to construct a framework for assessing traits such as deception, morality, and power-seeking in sophisticated Al systems. 此外,研究人员证明了存在缓解在最大化奖励和保持道德行为之间的权衡的策略,这可以促进高效和道德 代理的发展。MACHIAVELLI 是构建评估欺骗、道德和追求权力等特征的框架的首次重要尝试之一,用于复杂的人工智能系统。
Mean behavioral scores of agents across different categories 代理在不同类别中的平均行为分数
Source: Pan et al., 2023 | Chart: 2024 Al Index report 来源:Pan 等人,2023 年 | 图表:2024 年 Al 指数报告
Agent 代理
Figure 3.4.12 图 3.4.12
Fairness in Al emphasizes developing systems that are equitable and avoid perpetuating bias or discrimination against any individual or group. It involves considering the diverse needs and circumstances of all stakeholders impacted by Al use. Fairness extends beyond a technical concept and embodies broader social standards related to equity. 在人工智能中,公平性强调开发公正的系统,避免对任何个人或群体产生偏见或歧视。这涉及考虑到所有受到人工智能使用影响的利益相关者的不同需求和情况。公平性不仅仅是一个技术概念,还体现了与公平相关的更广泛的社会标准。
3.5 Fairness 3.5 公平性
Current Challenges 当前挑战
Defining, measuring, and ensuring fairness is complex due to the absence of a universal fairness definition and a structured approach for selecting context-appropriate fairness definitions. This challenge is magnified by the multifaceted nature of systems, which require the integration of fairness measures at almost every stage of their life cycle. 由于缺乏普遍公平定义和选择上下文合适的公平定义的结构化方法,定义、衡量和确保公平变得复杂。这个挑战还因 系统的多方面性而变得更加复杂,这些系统需要在其生命周期的几乎每个阶段集成公平度量。
Fairness in Numbers 数字中的公平
This section provides an overview of the study and deployment of fairness in academia and industry. 本节概述了学术界和工业界对 公平性的研究和部署。
Academia 学术界
The rise of LLMs like ChatGPT and Gemini made the public significantly more aware of some of the fairness issues that can arise when systems are broadly deployed. This heightened awareness has led to a rise in -fairness-related submissions at academic conferences. 像 ChatGPT 和 Gemini 这样的LLMs的崛起显著提高了公众对广泛部署 系统可能出现的公平性问题的意识。这种加强的意识已经导致学术会议上关于 公平性的提交数量增加。
In 2023, there were 212 papers on fairness and bias submitted, a increase from 2022 (Figure 3.5.1). Since 2019, the number of such submissions has almost quadrupled. 2023 年,关于公平性和偏见的论文数量为 212 篇,比 2022 年增加了 (图 3.5.1)。自 2019 年以来,此类投稿数量几乎增加了四倍。
Al fairness and bias submissions to select academic conferences, 2019-23 2019 年至 2023 年选择学术会议的公平性和偏见投稿
Source: Al Index, 2024 | Chart: 2024 Al Index report 数据来源:2024 年铝指数 | 图表来源:2024 年铝指数报告
Industry 行业
In the Global State of Responsible Al survey referenced earlier, of organizations identified fairness risks as relevant to their adoption strategies. Regionally, European organizations (34%) most frequently reported this risk as relevant, while North American organizations reported it the least (20%). 在前面提到的《全球负责任人工智能状况调查》中, 的组织确定公平风险与其 采用策略相关。 在区域上,欧洲组织(34%)最经常报告此风险相关,而北美组织报告最少(20%)。
The survey asked respondents about their efforts to mitigate bias and enhance fairness and diversity in model development, deployment, and use, providing them with five possible measures to implement. Results show that while most companies have fully implemented at least one fairness measure, comprehensive integration is still lacking. The global average for adopted fairness measures stands at 1.97 out of five measures asked about. There is not significant regional variation in the implementation of fairness measures (Figure 3.5.2). Figure 3.5.3 visualizes integration rates by industry. 调查询问受访者关于他们在 模型开发、部署和使用中减少偏见、增强公平和多样性的努力,为他们提供了五种可能实施的措施。结果显示,虽然大多数公司至少完全实施了一项公平措施,但全面整合仍然不足。全球采用的公平措施平均为五项措施中的 1.97。在公平措施的实施上没有显著的区域差异(图 3.5.2)。图 3.5.3 可视化了行业的整合率。
Adoption of Al-related fairness measures by region 各地区对人工智能相关公平措施的采用情况
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任人工智能报告 | 图表:2024 年人工智能指数报告
Adoption of Al-related fairness measures by industry 行业采用与人工智能相关的公平措施
Source: Global State of Responsible Al report, 2024 | Chart: 2024 Al Index report 来源:2024 年全球负责任人工智能报告 | 图表:2024 年人工智能指数报告
Figure 3.5.3 图 3.5.3
Note: The numbers in parentheses are the average numbers of mitigation measures fully operationalized within each industry. Not all differences between industries are statistically significant. 注意:括号中的数字是每个行业内完全运作的减灾措施数量的平均值。并非所有行业之间的差异在统计上都具有显著性。
Featured Research 精选研究
This section highlights key research published in 2023 on fairness in Al. By focusing on significant fairness studies, the Al Index highlights some critical topics that are of interest to fairness researchers. The research featured below reveals how image generation models can perpetuate social stereotypes, LLMs tend to reflect Western opinions, and model tokenization can introduce elements of unfairness. 本节重点介绍了 2023 年关于人工智能公平性的关键研究。通过关注重要的公平性研究,人工智能指数突出了一些对 公平性研究人员感兴趣的关键主题。以下展示的研究揭示了图像生成模型如何可能强化社会刻板印象,LLMs倾向于反映西方观点,以及模型标记化可能引入不公平因素。
(Un)Fairness in and Healthcare 《(不)公平性在 和医疗保健领域》
A team of American and Canadian researchers investigated racial bias when LLMs are prompted to respond to medical questions. They queried four popular LLMs (Bard, GPT-3.5, Claude, GPT-4) with nine distinct questions previously known to elicit "race-based medicine or widespread misconceptions around race" among real physicians. Each model was asked each question five times, yielding 45 responses per model. 一组美国和加拿大研究人员调查了当LLMs被要求回答医疗问题时的种族偏见。他们使用四种流行的LLMs(Bard、GPT-3.5、Claude、GPT-4)询问了九个先前已知会引发“基于种族的医学或医生之间普遍存在的关于种族的误解”问题。每个模型被问及每个问题五次,每个模型产生 45 个回答。
Figure 3.5.4 highlights the frequency with which notable LLMs delivered highly racialized responses per question. The study revealed that all models demonstrated some degree of race-based medical bias, although their responses to identical questions varied. For certain queries, like the basis of race, only one model, Claude, consistently offered problematic responses. In contrast, for other questions, such as the purported skin thickness differences between Black and white individuals (a widespread misconception among medical students), most models regularly produced concerning race-based responses. The occasional perpetuation of debunked myths by LLMs underscores the need for caution when employing LLMs in medical contexts. 图 3.5.4 突出显示了每个问题中显著LLMs提供高度种族化回应的频率。 研究表明,所有模型都表现出一定程度的基于种族的医疗偏见,尽管它们对相同问题的回应有所不同。对于某些查询,比如基于种族的基础,只有一个模型,克劳德,始终提供有问题的回应。相反,对于其他问题,比如声称黑人和白人之间皮肤厚度差异(这是医学生中普遍存在的误解),大多数模型经常产生令人担忧的基于种族的回应。LLMs偶尔传播已被揭穿的神话强调了在医学环境中使用LLMs时需要谨慎的必要性。
Number of runs (out of 5 total runs) with concerning race-based responses by large language model Source: Omiye et al., 2023 | Chart: 2024 Al Index report 有问题的基于种族的回应次数(总共 5 次运行中)大型语言模型来源:Omiye 等人,2023 年| 图表:2024 年 Al 指数报告
19 In Figure 3.5.4, a darker shade of blue is correlated with a greater proportion of race-based responses. 在图 3.5.4 中,深蓝色与基于种族的回应比例更高相关。
Social Bias in Image Generation Models 图像生成模型中的社会偏见
BiasPainter is a new testing framework designed to detect social biases in image generation models, such as DALL-E and Midjourney. As highlighted in the 2023 Al Index, many image generation models frequently perpetuate stereotypes and biases (Figure 3.5.5). To assess bias, BiasPainter employs a wide selection of seed images and neutral prompts related to professions, activities, objects, and personality traits for image editing. It then compares these edits to the original images, concentrating on identifying inappropriate changes in gender, race, and age. BiasPainter 是一个新的测试框架,旨在检测图像生成模型中的社会偏见,如 DALL-E 和 Midjourney。正如 2023 年 Al Index 所强调的那样,许多图像生成模型经常强化刻板印象和偏见(图 3.5.5)。为了评估偏见,BiasPainter 使用大量种子图像和与职业、活动、物体和个性特征相关的中性提示进行图像编辑。然后将这些编辑与原始图像进行比较,重点是识别性别、种族和年龄方面的不当变化。
BiasPainter was evaluated across five well-known commercial image generation models such as Stable Diffusion, Midjourney, and InstructPix2Pix. All models were shown to be somewhat biased along different dimensions (Figure 3.5.6). Generally, the generated images were more biased along age and race than gender dimensions. Overall, on automatic bias detection tasks, BiasPainter achieves an automatic bias detection accuracy of , a considerable improvement over previous methods. BiasPainter 在五个知名商业图像生成模型上进行了评估,如 Stable Diffusion、Midjourney 和 InstructPix2Pix。所有模型都显示出在不同维度上存在一定程度的偏见(图 3.5.6)。总体而言,生成的图像在年龄和种族方面的偏见要大于性别方面。总体而言,在自动偏见检测任务上,BiasPainter 实现了 的自动偏见检测准确率,这是对先前方法的显著改进。
Source: Marcus and Southen, 2024 来源: Marcus 和 Southen,2024
Figure 3.5.5 图 3.5.5
Average image model bias scores for five widely used commercial image generation models 五种广泛使用的商业图像生成模型的平均图像模型偏差分数
Source: Wang et al., 2023 | Chart: 2024 Al Index report 来源:王等人,2023 年 | 图表:2024 年 Al 指数报告
Measuring Subjective Opinions in LLMs 在LLMs中测量主观意见
Research from Anthropic suggests that large language models do not equally represent global opinions on a variety of topics such as politics, religion, and technology. In this study, researchers built a GlobalOpinionQA dataset to capture cross-country opinions on various issues (Figure 3.5.7). They then generated a similarity metric to compare people's answers in various countries with those outputted by LLMs. Using a four-point Likert scale, LLMs were asked to rate their agreement with statements from the World Values Survey (WVS) and Pew Research Center's Global Attitudes (GAS) surveys, including questions like, "When jobs are scarce, employers should give priority to people of this country over immigrants," or "On the whole, men make better business executives than women do." Anthropic 的研究表明,大型语言模型并不平等地代表全球对政治、宗教和技术等各种主题的观点。在这项研究中,研究人员构建了一个 GlobalOpinionQA 数据集,以捕捉各国对各种问题的观点(图 3.5.7)。然后,他们生成了一个相似度度量标准,用于比较各国人民的答案与LLMs输出的答案。使用四点李克特量表,LLMs被要求评价他们对世界价值观调查(WVS)和皮尤研究中心全球态度(GAS)调查中的陈述的同意程度,包括问题如“当工作稀缺时,雇主应优先考虑本国人而不是移民”,或“总的来说,男性比女性更适合担任商业高管”。
The experiments indicate that the models' responses closely align with those from individuals in Western countries (Figure 3.5.8). The authors point out a notable lack of diversity in opinion representation, especially from non-Western nations among the shared responses. While it is challenging for models to precisely match the highly diverse distributions of global opinions-given the inherent variation in perspectives-it is still valuable to understand which opinions a model is likely to share. Recognizing the biases inherent in models can highlight their limitations and facilitate adjustments that improve regional applicability. 实验表明,模型的响应与西方国家个体的响应密切一致(图 3.5.8)。 作者指出在共享响应中意见代表的多样性明显不足,尤其是来自非西方国家的意见。 尽管模型很难精确匹配全球意见的高度多样分布-考虑到观点的固有变化,但了解模型可能分享哪些意见仍然很有价值。 认识到模型固有的偏见可以突显其局限性,并促进改进区域适用性的调整。
GlobalOpinionQA Dataset 全球意见问答数据集
Source: Durmus et al., 2023 来源:Durmus 等人,2023
Western-oriented bias in large language model responses 大型语言模型响应中的西方取向偏见
Source: Durmus et al., 2023 | Chart: 2024 Al Index report 来源:Durmus 等人,2023 年 | 图表:2024 年 Al 指数报告
Research from the University of Oxford highlights how inequality in originates at the tokenization stage. Tokenization, the process of breaking down text into smaller units for processing and analysis, exhibits significant variability across languages. The number of tokens used for the same sentence can vary up to 15 times between languages. For instance, Portuguese closely matches English in the efficiency of the GPT-4 tokenizer, yet it still requires approximately more tokens to convey the same content. The Shan language is the furthest from English, needing 15 times more tokens. Figure 3.5.9 visualizes the concept of a context window while figure 3.5 .10 illustrates the token consumption of the same sentence across different languages. 牛津大学的研究突出了不平等是如何在分词阶段产生的。分词是将文本分解为更小单元进行处理和分析的过程,在不同语言之间存在显著的变异性。同一句子使用的标记数量在不同语言之间可以相差多达 15 倍。例如,葡萄牙语在 GPT-4 分词器的效率上与英语接近,但仍需要大约 个更多的标记来传达相同的内容。掸语与英语之间的差距最大,需要多达 15 倍的标记。图 3.5.9 展示了上下文窗口的概念,而图 3.5.10 说明了不同语言中同一句子的标记消耗。
The authors identify three major inequalities that result from variable tokenization. First, users of languages that require more tokens than English for the same content face up to four times higher inference costs and longer processing times, as both are dependent on the number of tokens. 作者确定了由于可变分词而导致的三种主要不平等。首先,对于需要比英语使用更多标记来传达相同内容的语言用户,推理成本和处理时间可能高出四倍,因为两者都取决于标记数量。
Figure 3.5.11 illustrates the variation in token length and execution time for the same sentence across different languages or language families. Second, these users may also experience increased processing times because models take longer to process a greater number of tokens. Lastly, given that models operate within a fixed context window-a limit on the amount of text or content that can be input-languages that require more tokens proportionally use up more of this window. This can reduce the available context for the model, potentially diminishing the quality of service for those users. 图 3.5.11 展示了相同句子在不同语言或语言族中标记长度和执行时间的变化。其次,这些用户可能也会经历增加的处理时间,因为模型需要更长时间来处理更多的标记。最后,考虑到模型在固定的上下文窗口内运行-即输入的文本或内容量的限制-相对需要更多标记的语言会占用更多的窗口空间。这可能会减少模型的可用上下文,潜在地降低这些用户的服务质量。
Context window 上下文窗口
Source: Al Index, 2024 来源:Al Index,2024
Larger context windows are more likely to include important information (e.g. the word "dog") which can help the model to make a better prediction ("fetch" vs "baseball"). If more tokens are used up to represent the same sentence in non-English languages, less information is being captured in the limited context window, which may result in worse performance. 更大的上下文窗口更有可能包含重要信息(例如单词“狗”),这可以帮助模型做出更好的预测(“接球”与“棒球”)。如果在非英语语言中使用更多标记来表示相同的句子,那么在有限的上下文窗口中捕获的信息就会更少,这可能导致性能更差。
Figure 3.5.9 图 3.5.9
Variable language tokenization 可变语言标记化
Source: Al Index, 2024 源:Al 指数,2024
ENGLISH 英语
I took my dog to the park to play 我带我的狗去公园玩
7 tokens 7 个令牌
VIETNAMESE 越南语
Tôi đưa con chó cưa tôi đên công viên 我带我的狗去公园
Tồi đưa con chó cưa tôii đên công viên 我带我的狗去公园
16 tokens 16 个令牌
Tokenization premium using XLM-RoBERTa and RoBERTa models by language 使用 XLM-RoBERTa 和 RoBERTa 模型进行语言分词 premium
Source: Petrov et al., 2023 | Chart: 2024 Al Index report 来源:Petrov 等,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.5.11 图 3.5.11
In 2024, around 4 billion people across the globe will vote in national elections, for example, in the United States, U.K., Indonesia, Mexico, and Taiwan. Upcoming elections coupled with greater public awareness of AI have led to discussions of Al's possible impact on elections. This section covers how Al can impact elections and more specifically examines the generation and dissemination of mis- and disinformation, the detection of Al-generated content, the potential political bias of LLMs, and the broader impact of AI on politics. 2024 年,全球约 40 亿人将参加国家选举,例如在美国、英国、印度尼西亚、墨西哥和台湾。即将举行的选举以及公众对人工智能的更广泛认识已经引发了关于人工智能可能对选举产生影响的讨论。本节涵盖了人工智能如何影响选举,更具体地探讨了误导和虚假信息的生成和传播、检测人工智能生成内容、LLMs的潜在政治偏见,以及人工智能对政治的更广泛影响。
3.6 Al and Elections 3.6 铝和选举
Generation, Dissemination, and Detection of Disinformation 信息误导的生成、传播和检测
Generating Disinformation 生成信息误导
One of the top concerns when discussing Al's impact on political processes is the generation of disinformation. While disinformation has been around since at least the Roman Empire, Al makes it significantly easier to generate such disinformation. Moreover, deepfake tools have significantly improved since the 2020 U.S. elections. Large-scale disinformation can undermine trust in democratic institutions, manipulate public opinion, and polarize public discussions. Figure 3.6.1 highlights the different types of deepfakes that can be created. 在讨论人工智能对政治进程的影响时,其中一个主要关注点是虚假信息的生成。 虽然虚假信息至少自罗马帝国时代就存在,但人工智能使得生成这类虚假信息变得更加容易。此外,自 2020 年美国大选以来,深度伪造工具已经显著改进。大规模的虚假信息可能会破坏对民主机构的信任,操纵公众舆论,并使公众讨论极化。图 3.6.1 突出显示了可以创建的不同类型的深度伪造视频。
Potential uses of deepfakes 深度伪造的潜在用途
Source: Masood et al., 2023 来源:Masood 等人,2023
Figure 3.6.1 图 3.6.1
20 This section uses the terms synthetic content, disinformation, and deepfakes in the following senses: Synthetic content is any content (text, image, audio, video) that has been created with Al. Disinformation is false or misleading information generated with the explicit intention to deceive or manipulate an audience. Deepfakes are Al-generated image, video, or audio files that can often create convincingly realistic yet deceptive content. 本节中使用的术语包括合成内容、虚假信息和深度伪造,具体含义如下:合成内容是指使用人工智能创建的任何内容(文本、图像、音频、视频)。虚假信息是指带有明确意图欺骗或操纵受众的虚假或误导性信息。深度伪造是指由人工智能生成的图像、视频或音频文件,通常可以创造出令人信服但具有欺骗性的内容。
Slovakia's 2023 election illustrates how Al-based disinformation can be used in a political context. Shortly before the election, a contentious audio clip emerged on Facebook purportedly capturing Michal Šimečka, the leader of the Progressive Slovakia party (Figure 3.6.2), and journalist Monika Tódová from the newspaper Denník N, discussing illicit election strategies, including acquiring voters from the Roma community. The authenticity of the audio was immediately challenged by Šimečka and Denník N. An independent fact-checking team suggested that Al manipulation was likely at play. Because the clip was released during a pre-election quiet period, when media and politicians' commentary is restricted, the clip's dissemination was not easily contested. The clip's wide circulation was also aided by a significant gap in Meta's content policy, which does not apply to audio manipulations. This episode of AI-enabled disinformation occurred against the backdrop of a close electoral contest. Ultimately, the affected party, Progressive Slovakia, lost by a slim margin to SMER, one of the opposition parties. 2023 年斯洛伐克选举展示了如何在政治背景下利用人工智能进行虚假信息传播。选举前不久,Facebook 上出现了一段有争议的音频剪辑,据称捕捉到了进步斯洛伐克党领袖米哈尔·希梅奇卡(见图 3.6.2)和《每日 N 报》记者莫妮卡·托多娃讨论非法选举策略,包括获取罗姆人社区的选民。希梅奇卡和《每日 N 报》立即对音频的真实性提出质疑。一支独立的事实核查团队暗示可能存在人工智能操纵。由于该剪辑是在选举前的宣传期内发布的,媒体和政客的评论受到限制,因此剪辑的传播并不容易受到质疑。剪辑的广泛传播也受到 Meta 内容政策中的重大漏洞的帮助,该政策不适用于音频操纵。这一 AI 虚假信息事件发生在一场激烈的选举竞争背景下。最终,受影响的进步斯洛伐克党以微弱优势输给了反对党之一的 SMER 党。
Progressive Slovakia leader Michal Šimečka 进步斯洛伐克党领袖米哈尔·希梅奇卡
Source: Meaker, 2023 来源:Meaker,2023
Figure 3.6.2 图 3.6.2
Dissemination of Fake Content 虚假内容的传播
Sometimes concerns surrounding Al-generated disinformation are minimized on the grounds that only assists with content generation but not dissemination. However, in 2023, case studies emerged about how could be used to automate the entire generation and dissemination pipeline. A developer called Nea Paw set up Countercloud as an experiment in creating a fully automated disinformation pipeline (Figure 3.6.3). 有时,围绕由人工智能生成的虚假信息引起的担忧被忽视,理由是 仅协助内容生成而非传播。然而,在 2023 年,出现了关于如何利用 自动化整个生成和传播流程的案例研究。一位开发者名为 Nea Paw 设立了 Countercloud 作为一个实验,旨在创建一个完全自动化的虚假信息传播管道(图 3.6.3)。
As part of the first step in the pipeline, an Al model is used to continuously scrape the internet for articles and automatically decide which content it should target with counter-articles. Next, another AI model is tasked with writing a convincing counter-article that can include images and audio summaries. This counter-article is subsequently attributed to a fake journalist and posted on the CounterCloud website. Subsequently, another Al system generates comments on the counter-article, creating the appearance of organic engagement. Finally, an Al searches for relevant tweets, posts the counter-article as a reply, and comments as a user on these tweets. The entire setup for this authentic-appearing misinformation system only costs around . 作为管道中的第一步,一个 AI 模型被用来持续地从互联网上抓取文章,并自动决定应该以对抗文章针对哪些内容。接下来,另一个 AI 模型被要求撰写一篇有说服力的对抗文章,可以包括图片和音频摘要。这篇对抗文章随后被归因于一个虚假记者,并发布在 CounterCloud 网站上。随后,另一个 AI 系统在对抗文章上生成评论,营造有机互动的外观。最后,一个 AI 在 上搜索相关的推文,将对抗文章作为回复发布,并以用户的身份在这些推文上发表评论。这个看起来真实的错误信息系统的整个设置只需花费约 。
Al-based generation and dissemination pipeline 基于 AI 的生成和传播管道
Source: Al Index, 202421 来源:AI 指数,202421
Figure 3.6.3 图 3.6.3
21 The figure was adapted from Simon, Altay, and Mercier, 2023. 21 该图出自 2023 年的 Simon, Altay, and Mercier 改编。
Detecting Deepfakes 检测深度伪造
Recent research efforts to counter deepfakes have focused on improving methods for detecting Algenerated content. For example, a team of Singaporean researchers studied how well deepfake detectors generalize to datasets they have not been trained on. The researchers compared five deepfake detection approaches and found that even more recently introduced deepfake detection methods suffer significant performance declines on never-before-seen datasets (Figure 3.6.4). However, the study does note that there are underlying similarities between seen and unseen datasets, meaning that in the future, robust and broadly generalizable deepfake detectors could be created. 最近的研究努力致力于改进检测人工智能生成内容的方法来对抗 deepfake。例如,一组新加坡研究人员研究了深度伪造检测器在未经训练的数据集上的泛化能力。研究人员比较了五种深度伪造检测方法,发现即使是最近引入的深度伪造检测方法在以前未见过的数据集上也会出现显著的性能下降(图 3.6.4)。然而,研究指出,已见和未见数据集之间存在基本相似之处,这意味着未来可以创建稳健且广泛适用的深度伪造检测器。
Generalizability of deepfake detectors to unseen datasets 深度伪造检测器对未见数据集的泛化能力
Source: Li et al., 2023 | Chart: 2024 Al Index report 来源:Li 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.6.4 图 3.6.4
In the context of deepfake detectors, it is also important to highlight earlier experiments that show that the performance of deepfake detection methods varies significantly across attributes such as race. Some of the underlying datasets used to train deepfake detectors, like FaceForensics++, are not equally balanced with respect to race and gender (Figure 3.6.5). The authors then demonstrate that between various racial subgroups, performance accuracy could differ by as much as 10.7 percentage points. The detectors performed worst on dark skin and best on Caucasian faces. 在 Deepfake 检测器的背景下,强调早期实验证明,Deepfake 检测方法的性能在诸如种族等属性上存在显著差异。一些用于训练 Deepfake 检测器的基础数据集,如 FaceForensics++,在种族和性别方面并不平衡(图 3.6.5)。作者随后证明,在不同的种族子群之间,性能准确度可能相差多达 10.7 个百分点。检测器在深色皮肤上表现最差,在高加索人脸上表现最佳。
Ethnic and gender distribution in FaceForensics++ training data FaceForensics++训练数据中的种族和性别分布
Source: Trinh and Liu, 2021| Chart: 2024 Al Index report 来源:Trinh 和 Liu,2021 | 图表:2024 年 Al 指数报告
Figure 3.6.5 图 3.6.5
LLMs and Political Bias LLMs和政治偏见
LLMs are increasingly recognized as tools through which ordinary individuals can inform themselves about important political topics such as political processes, candidates, or parties. However, new research published in 2023 suggests that many major LLMs like ChatGPT are not necessarily free of bias. LLMs越来越被认为是普通人可以了解重要政治话题(如政治流程、候选人或政党)的工具。然而,2023 年发表的新研究表明,许多主要的LLMs(如 ChatGPT)并不一定没有偏见。
The study revealed that ChatGPT exhibits a notable and systematic bias favoring Democrats in the United States and the Labour Party in the U.K. As part of the study, the researchers compared the answers of a default ChatGPT to those of Republican, Democrat, radical Republican, and radical Democrat versions of ChatGPT. This research design was created to better identify 该研究发现,ChatGPT 存在明显和系统性的偏见,偏向美国的民主党和英国的工党。作为研究的一部分,研究人员比较了默认 ChatGPT 的答案与共和党人、民主党人、激进共和党人和激进民主党人版本的 ChatGPT 的答案。这一研究设计旨在更好地识别
which political allegiance most closely corresponds to the regular ChatGPT. 哪种政治效忠最接近于常规的 ChatGPT。
Figure 3.6.6 shows strong positive correlations (blue lines) between the default ChatGPT, i.e., one that was answering questions without additional instructions, and both the Democrat and the radical Democrat ChatGPT versions, i.e., versions of ChatGPT that were asked to answer like a Democrat or radical Democrat. On the other hand, the researchers found a strong negative correlation between the default GPT and both Republican ChatGPTs. The identification of bias in these LLMs raises concerns about their potential to influence the political views and stances of users who engage with these tools. 图 3.6.6 显示了默认的 ChatGPT 与民主党和激进民主党 ChatGPT 版本之间的强烈正相关(蓝色线),即回答问题而无需额外指示的 ChatGPT,以及被要求像民主党或激进民主党回答问题的 ChatGPT 版本。另一方面,研究人员发现默认 GPT 与两个共和党 ChatGPT 之间存在强烈的负相关。在这些LLMs中发现偏见引发了人们对这些工具可能影响用户政治观点和立场的担忧。
Default vs. political ChatGPT average agreement 默认与政治 ChatGPT 平均一致性
Source: Motoki et al., 2023 | Chart: 2024 Al Index report 来源:Motoki 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.6.6 图 3.6.6
Impact of AI on Political Processes 人工智能对政治进程的影响
There has been an increasing volume of research aimed at exploring some of the risks could pose to political processes. One topic of interest has been audio deepfakes. In July 2023, audio clips of a politician from India's Hindu party were released in which the politician attacked his own party and praised his political opponent. The politician claimed these audio clips were created using AI. However, even after deepfake experts were consulted, it could not be determined with certainty whether the clips were authentic or not. 最近研究的数量正在增加,目的是探索一些可能对政治过程构成威胁的风险。一个感兴趣的话题是音频深度伪造。2023 年 7 月,印度印度教政党的一位政治家的音频剪辑被发布,其中这位政治家抨击了自己的政党并赞扬了他的政治对手。这位政治家声称这些音频剪辑是使用人工智能创建的。然而,即使请教了深度伪造专家,也无法确定这些剪辑是否真实。
Research published in 2023 suggests that humans generally have issues reliably detecting audio deepfakes. In their sample of 529 individuals, listeners only correctly detected deepfakes of the time. Figure 3.6.7 illustrates some of the other key findings from the study. The authors also expect detection accuracy to go down in the future as a result of improvements in audio generation methods. The rise of more convincing audio deepfakes increases the potential to manipulate political campaigns, defame opponents, and give politicians a "liar's dividend," the ability to dismiss damaging audio clips as fabrications. 2023 年发表的研究表明,人类普遍在可靠地检测音频深度伪造方面存在问题。在他们的样本中,有 529 名个体,听众只有 的时间正确检测到深度伪造。图 3.6.7 展示了研究的一些其他关键发现。作者还预期随着音频生成方法的改进,未来检测准确性会下降。更具说服力的音频深度伪造的兴起增加了操纵政治活动、诽谤对手以及使政客得到“说谎者红利”的潜力,即将破坏性的音频剪辑视为虚构。
Key research findings on audio deepfakes 音频深度伪造的关键研究发现
Source: Mai et al., 2023; Al Index, 2024 来源:Mai 等人,2023 年;Al Index,2024 年
Training humans to detect deepfakes only helps slightly 训练人类识别深度伪造视频只能稍微有所帮助
Shorter deepfakes are not easier to identify 更短的深度伪造视频并不更容易识别
Listening to clips more frequently does not aid detection 更频繁地听取音频片段并不能帮助检测
Spending more time does not improve detection 花费更多时间并不能提高检测能力
Participants do not improve detection without explicit feedback 参与者在没有明确反馈的情况下无法提高检测能力
Human crowd and top automated detectors have comparable performance 人类群体和顶尖自动检测器具有可比较的性能
Al can also influence political processes in other ways. Research from Queen's University Belfast notes other ways in which Al can affect political processes, and potential mitigations associated with different risk cases (Figure 3.6.8). For instance, Al could be utilized for video surveillance of voters, potentially undermining the integrity of elections. The same authors identify the degree to which each Al political use case is technologically ready, the risk level it possesses, and how visible the deployment of would be to users (Figure 3.6.9). For example, they propose that employing for voter authentication is already highly feasible, and this application carries a significant risk. Al 还可以以其他方式影响政治进程。来自贝尔法斯特女王大学的研究指出 Al 可以影响政治进程的其他方式,并提出了与不同风险案例相关的潜在缓解措施(图 3.6.8)。例如,Al 可以被用于对选民进行视频监控,可能破坏选举的公正性。同样的作者确定了每个 Al 政治使用案例的技术准备程度、风险水平以及 的部署对用户的可见性(图 3.6.9)。例如,他们提出,利用 进行选民身份验证已经非常可行,但这种应用存在重大风险。
Al usage, risks, and mitigation strategies in electoral processes 在选举过程中的 Al 使用、风险和缓解策略
Source: P et al., 2023 | Table: 2024 Al Index report 来源:P et al.,2023 年 | 表:2024 年 Al 指数报告
Avenue
Al usage
Risks
Mitigations
Voter list maintenance 选民名单维护
Heuristic-driven approximations 启发式驱动的近似
Record linkage 记录链接
Outlier detection 异常值检测
Access-integrity trade-off issues 访问完整性折衷问题
Biased Al
Overly generalized Al 过于泛化的人工智能
Access-focused AI 以访问为重点的人工智能
Reasonable explanations 合理解释
Local scrutiny 当地审查
Polling booth locations 投票站位置
Drop box location determination 投递箱位置确定
Facility location 设施位置
Clustering
Business ethos 商业道德
Volatility and finding costs 波动性和寻找成本
Partisan manipulation 党派操纵
Plural results 复数结果
Auditing Al
Disadvantaged voters 弱势选民
Predicting problem booths 预测问题投票站
Predictive policing 预测性警务
Time series motifs 时间序列主题
Systemic racism 制度性种族主义
Aggravating brutality 严苛的残酷行为
Feedback loops 反馈循环
Transparency
Statistical rigor 统计严谨
Fair Al
Voter authentication 选民认证
Face recognition 人脸识别
Biometrics
Race/gender bias 种族/性别偏见
Unknown biases 未知偏见
Voter turnout 选民投票率
Surveillance and misc. 监视和其他事项
Alternatives
Bias audits
Designing for edge cases 为边缘情况设计
Video monitoring 视频监控
Video-based vote counting 基于视频的计票
Event detection 事件检测
Person re-identification 人员重新识别
Electoral integrity 选举诚信
Marginalized communities 被边缘化的社区
Undermining other monitoring 破坏其他监测
Shallow monitoring 肤浅的监测
Open data
Assessments of integration and risks in electoral 评估 在选举中的整合和风险
processes 过程
Source: P et al., 2023 | Chart: 2024 Al Index report 来源:P 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 3.6.9 图 3.6.9
Preview 预览
Overview ..... 215 概述 ..... 215
Chapter Highlights ..... 216 章节亮点 ..... 216
4.1 What's New in 2023: A Timeline ..... 218 2023 年的新变化:时间轴 ..... 218
4.2 Jobs ..... 223 工作 ..... 223
Al Labor Demand ..... 223 劳动力需求 ..... 223
Global Al Labor Demand ..... 223 全球劳动力需求 ..... 223
U.S. AI Labor Demand by Skill Cluster 美国技能集群的人工智能劳动力需求
and Specialized Skill ..... 224 和专业技能 ..... 224
U.S. AI Labor Demand by Sector ..... 228 美国各行业的人工智能劳动需求..... 228
U.S. Al Labor Demand by State ..... 229 美国各州的人工智能劳动需求..... 229
Al Hiring ..... 232 人工智能雇佣..... 232
Al Skill Penetration ..... 234 人工智能技能渗透 ..... 234
Al Talent ..... 236 人才 ..... 236
Highlight: How Much Do 亮点:计算机科学家赚多少?
Computer Scientists Earn? ..... 240 计算机科学家赚多少? ..... 240
4.3 Investment ..... 242 4.3 投资 ..... 242
Corporate Investment ..... 242 企业投资 ..... 242
Startup Activity ..... 243 创业活动 ..... 243
Global Trends ..... 243 全球趋势 ..... 243
Regional Comparison by Funding Amount ..... 245 资金金额的区域比较 ..... 245
Regional Comparison by Newly Funded 新资金的区域比较
Al Companies ..... 251 所有公司 ..... 251
Focus Area Analysis ..... 254 重点领域分析 ..... 254
4.4 Corporate Activity ..... 258 4.4 公司活动 ..... 258
Industry Adoption ..... 258 行业采用 ..... 258
Adoption of Al Capabilities ..... 258 人工智能能力的采用 ..... 258
Adoption of Generative Al Capabilities ..... 266 生成式人工智能能力的采用 ..... 266
Use of Al by Developers ..... 269 开发人员使用 Al 的方式 ..... 269
Preference ..... 269 偏好 ..... 269
Workflow ..... 270 工作流程 ..... 270
Al's Labor Impact ..... 272 Al 的劳动影响 ..... 272
Earnings Calls ..... 277 收益电话 ..... 277
Aggregate Trends ..... 277 聚合趋势 ..... 277
Specific Themes ..... 278 具体主题 ..... 278
Highlight: Projecting Al's Economic Impact ..... 279 重点: 阿尔的经济影响预测 ..... 279
4.5 Robot Installations ..... 283 4.5 机器人安装 ..... 283
Aggregate Trends ..... 283 聚合趋势 ..... 283
Industrial Robots: 工业机器人:
Traditional vs. Collaborative Robots ..... 285 传统机器人 vs. 协作机器人 ..... 285
By Geographic Area ..... 286 按地理区域 ..... 286
Country-Level Data on Service Robotics ..... 290 服务机器人的国家级数据 ..... 290
Sectors and Application Types ..... 292 部门和应用类型 ..... 292
China vs. United States ..... 294 中国 vs. 美国 ..... 294
ACCESS THE PUBLIC DATA 访问公共数据
Overview 概述
The integration of into the economy raises many compelling questions. Some predict that Al will drive productivity improvements, but the extent of its impact remains uncertain. A major concern is the potential for massive labor displacement-to what degree will jobs be automated versus augmented by ? Companies are already utilizing in various ways across industries, but some regions of the world are witnessing greater investment inflows into this transformative technology. Moreover, investor interest appears to be gravitating toward specific Al subfields like natural language processing and data management. 把 整合到经济中引发了许多引人注目的问题。一些人预测 AI 将推动生产力的提高,但其影响程度仍不确定。一个主要的关切是大规模的劳动力置换,工作将被自动化程度和被 增强的程度是多少?公司已经在各个行业以各种方式利用 ,但世界上一些地区正在目睹对这项变革性技术的更大投资流入。此外,投资者的兴趣似乎正在向特定的 AI 子领域聚集,如自然语言处理和数据管理。
This chapter examines Al-related economic trends using data from Lightcast, Linkedln, Quid, McKinsey, Stack Overflow, and the International Federation of Robotics (IFR). It begins by analyzing Al-related occupations, covering labor demand, hiring trends, skill penetration, and talent availability. The chapter then explores corporate investment in , introducing a new section focused specifically on generative Al. It further examines corporate adoption of , assessing current usage and how developers adopt these technologies. Finally, it assesses Al's current and projected economic impact and robot installations across various sectors. 本章通过来自 Lightcast、Linkedln、Quid、McKinsey、Stack Overflow 和国际机器人联合会(IFR)的数据,分析了与人工智能相关的经济趋势。它从分析与人工智能相关的职业开始,涵盖劳动力需求、招聘趋势、技能渗透和人才供应情况。然后探讨了企业对 的投资,引入了一个专门关注生成式人工智能的新部分。它进一步研究了企业对 的采用情况,评估了当前的使用情况以及开发人员如何采用这些技术。最后,评估了人工智能当前和预期的经济影响以及各个行业的机器人安装情况。
Chapter Highlights 章节亮点
Generative Al investment skyrockets. Despite a decline in overall Al private investment last year, funding for generative surged, nearly octupling from 2022 to reach billion. Major players in the generative Al space, including OpenAl, Anthropic, Hugging Face, and Inflection, reported substantial fundraising rounds. 生成式人工智能投资激增。尽管去年整体人工智能私人投资有所下降,但生成式 的资金激增,从 2022 年几乎增加了八倍,达到 亿美元。生成式人工智能领域的主要参与者,包括 OpenAI、Anthropic、Hugging Face 和 Inflection,报告了大量的筹款轮。
2. Already a leader, the United States pulls even further ahead in Al private investment. 2. 美国已经是领先者,在人工智能私人投资方面进一步拉开了差距。
In 2023, the United States saw Al investments reach billion, nearly 8.7 times more than China, the next highest investor. While private Al investment in China and the European Union, including the United Kingdom, declined by and , respectively, since 2022, the United States experienced a notable increase of in the same time frame.
Fewer Al jobs, in the United States and across the globe. In 2022, Al-related positions made up of all job postings in America, a figure that decreased to in 2023. This decline in Al job listings is attributed to fewer postings from leading Al firms and a reduced proportion of tech roles within these companies. 在美国和全球范围内,人工智能相关职位减少。2022 年,人工智能相关职位占美国所有职位发布的 ,这一数字在 2023 年降至 。人工智能职位招聘数量的下降归因于领先人工智能公司发布的职位减少以及这些公司中技术角色比例的降低。
4. Al decreases costs and increases revenues. A new McKinsey survey reveals that of surveyed 4. 人工智能降低成本并增加收入。麦肯锡的一项新调查显示,受访者中有 的人
organizations report cost reductions from implementing (including generative ), and report revenue increases. Compared to the previous year, there was a 10 percentage point increase in respondents reporting decreased costs, suggesting is driving significant business efficiency gains.
Total AI private investment declines again, while the number of newly funded companies increases. Global private Al investment has fallen for the second year in a row, though less than the sharp decrease from 2021 to 2022 . The count of newly funded Al companies spiked to 1,812 , up from the previous year. 总体人工智能私人投资再次下降,而新获资助的 公司数量增加。全球私人人工智能投资连续第二年下降,尽管比从 2021 年到 2022 年的急剧下降要小。新获资助的人工智能公司数量激增至 1,812 家,比上一年增加 。
Al organizational adoption ticks up. A 2023 McKinsey report reveals that of organizations now use (including generative ) in at least one business unit or function, up from in 2022 and in 2017. AI 组织采用率有所提高。2023 年麦肯锡报告显示, 的组织现在至少在一个业务部门或功能中使用 (包括生成 ),高于 2022 年的 和 2017 年的 。
China dominates industrial robotics. Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest competitor nation. In 2013, China's installations accounted for of the global total, a share that rose to by 2022 . 中国主导工业机器人。自 2013 年超过日本成为工业机器人领先安装国以来,中国与最近的竞争对手国家之间的差距显著扩大。2013 年,中国的安装量占全球总量的 ,到 2022 年已上升至 。
Chapter Highlights (cont'd) 章节亮点(续)
Greater diversity in robotic installations. In 2017, collaborative robots represented a mere of all new industrial robot installations, a figure that climbed to by 2022 . Similarly, 2022 saw a rise in service robot installations across all application categories, except for medical robotics. This trend indicates not just an overall increase in robot installations but also a growing emphasis on deploying robots for human-facing roles. 机器人安装的多样性增加。2017 年,协作机器人仅占所有新工业机器人安装量的 ,到 2022 年这一数字上升到 。同样,2022 年各应用类别的服务机器人安装量均有所增加,除了医疗机器人。这一趋势不仅表明机器人安装总量增加,还表明人们越来越重视将机器人投入人类面向的角色。
The data is in: AI makes workers more productive and leads to higher quality work. In 2023, several studies assessed Al's impact on labor, suggesting that Al enables workers to complete tasks more quickly and to improve the quality of their output. These studies also demonstrated Al's potential to bridge the skill gap between low- and high-skilled workers. Still other studies caution that using Al without proper oversight can lead to diminished performance. 数据显示:人工智能提高工人生产力,带来更高质量的工作。2023 年,几项研究评估了人工智能对劳动力的影响,表明人工智能使工人能够更快地完成任务并提高其产出质量。这些研究还展示了人工智能在弥合低技能和高技能工人之间的技能差距方面的潜力。其他研究警告称,未经适当监督使用人工智能可能导致绩效下降。
Fortune companies start talking a lot about , especially generative . 财富 公司开始大谈 ,尤其是生成 。
In 2023, Al was mentioned in 394 earnings calls (nearly of all Fortune 500 companies), a notable increase from 266 mentions in 2022. Since 2018, mentions of Al in Fortune 500 earnings calls have nearly doubled. The most frequently cited theme, appearing in of all earnings calls, was generative Al. 2023 年,Al 在 394 次盈利电话会议中被提及(几乎占所有财富 500 强公司的 ),较 2022 年的 266 次提及有显著增加。自 2018 年以来,在财富 500 强公司的盈利电话会议中提及 Al 的次数几乎翻了一番。最经常被引用的主题,出现在所有盈利电话会议的 中,是生成 Al。
The chapter begins with an overview of some of the most significant Al-related economic events in 2023, as selected by the Al Index Steering Committee. 本章以 Al 指数指导委员会精选的 2023 年一些最重要的与 Al 相关的经济事件概述开始。
4.1 What's New in 2023: A Timeline 2023 年的新变化:时间轴
InstaDeep acquired by BioNTech InstaDeep 被 BioNTech 收购
BioNTech, known for developing the first mRNA COVID-19 vaccine in partnership with Pfizer, acquires InstaDeep for million to advance Al-powered drug discovery, design, and development. InstaDeep specializes in creating Al systems for enterprises in biology, logistics, and energy sectors. BioNTech,以与辉瑞合作开发第一款 mRNA COVID-19 疫苗而闻名,以 百万美元收购 InstaDeep,以推动基于人工智能的药物发现、设计和开发。InstaDeep 专注于为生物学、物流和能源领域的企业创建人工智能系统。
Microsoft invests billion in ChatGPT maker OpenAI 微软向 ChatGPT 制造商 OpenAI 投资 十亿美元
With this deal, Microsoft Azure remains the exclusive cloud provider for OpenAl, which relies on Azure to train its models. This follows Microsoft's initial billion investment in 2019 and a subsequent investment in 2021. 通过这项交易,微软 Azure 仍然是 OpenAI 的独家云服务提供商,OpenAI 依赖 Azure 来训练其模型。这是继 2019 年微软最初投资 十亿美元和 2021 年随后的投资之后。
GitHub Copilot for Business becomes publicly available Copilot for Business leverages an OpenAl Codex model to enhance code suggestion quality. At launch, GitHub Copilot contributed to an average of of developers' code across various programming languages, with this figure rising to for Java. GitHub Copilot for Business 现已公开提供 Copilot for Business 利用 OpenAI Codex 模型来提升代码建议的质量。在推出时,GitHub Copilot 在各种编程语言中为开发者的代码贡献平均达到 ,而在 Java 中这一数字上升至 。
Salesforce introduces Einstein GPT Salesforce 推出 Einstein GPT
Einstein GPT, the first comprehensive Al for CRM, utilizes OpenAl's models. Einstein GPT aids Salesforce customers in sales, marketing, and customer management. 爱因斯坦 GPT,第一个为 CRM 设计的综合人工智能,利用了 OpenAI 的模型。爱因斯坦 GPT 帮助 Salesforce 客户进行销售、营销和客户管理。
Microsoft announces integration of GPT-4 into Office 365 微软宣布将 GPT-4 集成到 Office 365 中
Microsoft rolls out Copilot across Office 微软在 Office 中推出 Copilot
Copilot for Microsoft 365 为 Microsoft 365 提供协作支持
365, offering assistance in Word, PowerPoint, and Excel.
Bloomberg announces LLM for finance 彭博宣布金融 LLM
Bloomberg's 50-billion parameter LLM is custom-built for analyzing financial data and tailored to finance professionals. This model is capable of performing financial analyses on Bloomberg's extensive datasets. 彭博的 50 亿参数 LLM 是专为分析金融数据而定制的,专为金融专业人士量身打造。该模型能够在彭博庞大的数据集上进行金融分析。
Adobe introduces generative Al features in Photoshop via Adobe Firefly, its generative image tool. Users can now add, remove, and edit images within seconds using text prompts. Adobe 在 Photoshop 中通过其生成图像工具 Adobe Firefly 引入了生成 Al 功能。用户现在可以使用文本提示在几秒钟内添加、删除和编辑图像。
Cohere raises million Cohere 筹集了 百万美元
Cohere, focused on developing an Al model ecosystem for enterprises, raises million in an oversubscribed Series C round. Inovia Capital led the round, with participation from Nvidia, Oracle, Salesforce Ventures, Schroders Capital, and Index Ventures. Cohere 专注于为企业开发 Al 模型生态系统,在一轮超额认购的 C 轮融资中筹集了 百万美元。Inovia Capital 领投,Nvidia、Oracle、Salesforce Ventures、Schroders Capital 和 Index Ventures 参与了本轮融资。
Source: Cohere, 2023 源:Cohere,2023
Figure 4.1.8 图 4.1.8
Nvidia reaches trillion valuation Nvidia 达到 万亿估值
Nvidia's market capitalization consistently exceeds trilion USD, driven by rising demand for its Al-powering chips. Nvidia becomes the fifth company to reach a valuation of trillion, joining the ranks of Apple Inc. (AAPL.O), Alphabet Inc. (GOOGL.O), Microsoft Corp. (MSFT.O), and Amazon. com Inc. (AMZN.O). Nvidia 的市值一直超过 万亿美元,受到其人工智能芯片需求的推动。Nvidia 成为第五家达到 万亿美元估值的公司,加入了苹果公司(AAPL.O)、谷歌母公司(GOOGL.O)、微软公司(MSFT.O)和亚马逊公司(AMZN.O)的行列。
Source: The Brand Hopper, 2023 来源:The Brand Hopper,2023 年
Figure 4.1.9 图 4.1.9
Databricks buys MosaicML for billion Databricks 以 亿美元收购 MosaicML
Databricks, a leader in data storage and management, announces its acquisition of MosaicML, a generative AI orchestration startup founded in 2021, for billion. This move aims to enhance Databricks' generative Al capabilities. 数据存储和管理领域的领导者 Databricks 宣布以 亿美元收购了成立于 2021 年的生成 AI 编排初创公司 MosaicML。此举旨在增强 Databricks 的生成 AI 能力。
Thomson Reuters acquires Casetext for million 汤森路透以 百万美元收购 Casetext
Thomson Reuters finalizes its acquisition of Casetext, a legal startup renowned for its artificial intelligence-powered assistant for law, for a staggering million. At the time of acquisition, Casetext boasted a substantial customer base of over 10,000 law firms and corporate legal departments. Among its flagship offerings is CoCounsel, an Al legal assistant driven by GPT4 , which enables rapid document review, legal research memos, deposition preparation, and contract analysis within minutes. 汤森路透完成了对 Casetext 的收购,这家以其人工智能助手而闻名的法律初创公司,收购金额高达 百万美元。在收购时,Casetext 拥有超过 10,000 家律师事务所和公司法律部门的庞大客户群。其旗舰产品之一是 CoCounsel,这是一款由 GPT4 驱动的 AI 法律助手,可以在几分钟内进行快速文件审查、法律研究备忘录、证词准备和合同分析。
Figure 4.1.12 attracts investments from Microsoft, Nvidia, Reid Hoffman, Bill Gates, and Eric Schmidt, former CEO of Google. 图 4.1.12 吸引了微软、英伟达、里德·霍夫曼、比尔·盖茨和谷歌前 CEO 埃里克·施密特的投资。
Hugging Face raises million from investors Hugging Face 从投资者那里筹集了 百万美元
Hugging Face, a platform and community dedicated to machine learning and data science, secures an impressive million funding round, pushing its valuation to billion. The platform serves as a one-stop destination for building, deploying, and training machine learning models. Offering a GitHub-like hub for Al code repositories, models, and datasets, Hugging Face has attracted significant attention from industry giants. Hugging Face,一个致力于机器学习和数据科学的平台和社区,获得了一轮令人印象深刻的 百万美元融资,将其估值推高至 十亿美元。该平台是构建、部署和训练机器学习模型的一站式目的地。提供类似 GitHub 的 Al 代码存储库、模型和数据集中心,Hugging Face 吸引了行业巨头的重视。
SAP introduces new generative Al assistant Joule SAP 推出新的生成式 AI 助手 Joule
Joule is a ChatGPT-style digital assistant integrated across SAP's diverse product range. Joule will seamlessly integrate into SAP applications spanning HR, finance, supply chain, procurement, and customer experience. Joule 是一款类似 ChatGPT 风格的数字助手,集成在 SAP 广泛的产品范围中。Joule 将无缝集成到涵盖人力资源、财务、供应链、采购和客户体验的 SAP 应用程序中。
Additionally, it will be incorporated into the SAP Business Technology Platform, extending its utility across SAP's 此外,它将被整合到 SAP 业务技术平台中,扩展其在 SAP 公司内的实用性
Introducing Joule 介绍 Joule
The Af copiot that thily understinds your business. Weth whitharm Af copiot 是一个深度了解您业务的智能助手。与 whitharm
Nouthativing 无法生存
Source: SAP, 2023 来源:SAP,2023
Figure 4.1.14 extensive user base of nearly 300 million. 图 4.1.14 几乎有 3 亿的广泛用户群。
Amazon and Google make multibillion-dollar investments in Anthropic 亚马逊和谷歌在 Anthropic 进行数十亿美元的投资
Amazon announces its intent to invest up to 亚马逊宣布其打算投资高达 亿美元在 Anthropic,这是 OpenAl 的竞争对手
billion in Anthropic, a rival of OpenAl. This significant investment follows Google's agreement to invest up to billion in Anthropic. The deal comprises an initial million upfront, with an additional billion to be invested over time. 亚马逊宣布其打算投资高达 亿美元在 Anthropic,这是 OpenAl 的竞争对手。这笔重要的投资是在谷歌同意投资高达 亿美元在 Anthropic 之后进行的。该交易包括首次 百万美元的前期投资,随后将再投资 亿美元。
Kai-Fu Lee's LLM startup publicly unveils an open-source model and secures funding at a billion valuation, with Alibaba leading the investment. Lee, known for his leadership roles at Google in China and for establishing Microsoft Research China, one of Microsoft's key international research hubs, spearheads this initiative. 李开复的 LLM 初创公司公开展示了一种开源模式,并以 十亿美元的估值获得了资金,阿里巴巴领投。李开复因在谷歌中国的领导职务以及建立微软亚洲研究院(微软的重要国际研究中心之一)而闻名,他领导了这一倡议。
Sam Altman, OpenAI CEO, fired and then rehired OpenAI CEO Sam Altman 被解雇然后重新雇佣
OpenAl's board claims Altman was "not consistently candid in his communications." Chaos ensues at OpenAl. Many employees resign in response to the news, and 745 sign a letter threatening resignation if the current board members do not resign. A few days later, Altman is reinstated. OpenAI 董事会称 Altman 在沟通上“不一贯坦诚”。OpenAI 遭受混乱。许多员工因此辞职,745 人签署一封威胁辞职的信,要求现任董事会成员辞职。几天后,Altman 被恢复原职。
Source: CoinGape, 2024 来源:CoinGape,2024
Figure 4.1.17 图 4.1.17
Dec. 11, 12 月 11 日
2023
Mistral Al closes million funding round Less than six months after raising a million seed round, Europe-based Mistral Al secures an additional million. The startup, cofounded by alumni from Google's DeepMind and Meta, focuses on developing foundation models with an open-source technology approach, aiming to compete with OpenAl. Leading the round is Andreessen Horowitz, with participation from Lightspeed Venture Partners, Salesforce, BNP Paribas, General Catalyst, and Elad Gil. Mistral Al 在筹集 亿美元融资后不到六个月,总部位于欧洲的 Mistral Al 又获得了额外的 万美元。这家初创公司由来自谷歌 DeepMind 和 Meta 的校友共同创立,专注于开发基于开源技术的基础模型,旨在与 OpenAl 竞争。Andressen Horowitz 领投,Lightspeed Venture Partners、Salesforce、BNP Paribas、General Catalyst 和 Elad Gil 也参与了本轮融资。
4.2 Jobs 4.2 就业机会
Al Labor Demand 人工智能劳动需求
This section analyzes the demand for Al-related skills in labor markets, drawing on data from Lightcast. 本节分析了劳动力市场对人工智能相关技能的需求,利用了来自 Lightcast 的数据。
Lightcast has analyzed hundreds of millions of job postings from over 51,000 websites since 2010 , identifying those that require skills. 自 2010 年以来,Lightcast 已经分析了来自超过 51,000 个网站的数亿份工作岗位信息,识别出那些需要 技能的岗位。
Global AI Labor Demand 全球人工智能劳动力需求
Figure 4.2.1 shows the percentage of job postings demanding Al skills. In 2023, the United States (1.6%), Spain (1.4%), and Sweden (1.3%) led in this metric. In 2022, Al-related jobs accounted for of all American job postings. In 2023, that number dropped to . Although most countries saw a decrease from 2022 to 2023 in the share of job postings requiring skills, in many, the number of -related job postings over the past five years has increased.' 图 4.2.1 显示了要求 Al 技能的工作岗位百分比。2023 年,美国(1.6%)、西班牙(1.4%)和瑞典(1.3%)在这一指标上处于领先地位。2022 年,与所有美国工作岗位相比,与 Al 相关的工作岗位占比为 。到了 2023 年,这个数字下降到了 。尽管大多数国家在 2022 年至 2023 年间看到了要求 技能的工作岗位份额下降,但在许多国家,过去五年中与 相关的工作岗位数量有所增加。
Lightcast speculates that the 2023 decrease in job postings is driven by many top employers (such as Amazon, Deloitte, Capital One, Randstad, and Elevance Health) scaling back their overall posting counts. Additionally, many companies shifted the occupational mix of their postings. For example, Amazon, in 2023, posted a higher share of operational roles like sales delivery driver, packager, and postal service/mail room worker than in 2022. At the same time, there was a lower share of demand for tech roles like software developers and data scientists. Lightcast 推测,2023 年 工作岗位减少是由许多顶级 雇主(如亚马逊、德勤、Capital One、Randstad 和 Elevance Health)减少其总体发布数量所驱动的。此外,许多公司调整了其发布的职业组合。例如,亚马逊在 2023 年发布了更多的运营角色,如销售送货司机、包装工和邮政服务/邮件室工作人员,而在 2022 年则较少需求技术角色,如软件开发人员和数据科学家。
Al job postings (% of all job postings) by geographic area, 2014-23 2014 年至 2023 年各地区 Al 工作岗位(占所有工作岗位的百分比)
Source: Lightcast, 2023 | Chart: 2024 Al Index report 源:Lightcast,2023 年 | 图表:2024 年 Al 指数报告
1 In 2023, Lightcast slightly changed its methodology for determining Al-related job postings from what was used in previous versions of the Al Index report. Lightcast also updated its taxonomy of AI-related skills. As such, some of the numbers in this chart do not completely align with those featured in last year's report. 2023 年,Lightcast 略微改变了确定与人工智能相关的工作职位的方法,与 Al 指数报告以往版本中所使用的方法不完全一致。Lightcast 还更新了其人工智能相关技能的分类。因此,本图表中的一些数字与去年报告中的数字并不完全一致。
U.S. AI Labor Demand by Skill Cluster and Specialized Skill 美国按技能群和专业技能分类的人工智能劳动力需求
Figure 4.2.2 highlights the most sought-after skills in the U.S. labor market since 2010. Leading the demand was machine learning at , with artificial intelligence at , and natural language processing at . Despite a recent dip, machine learning continues to be the most in-demand skill. Since last year, every Al-related skill cluster tracked by Lightcast had a decrease in market share, with the exception of generative Al, which grew by more than a factor of 10 . 从 2010 年以来,图 4.2.2 突出显示了美国劳动力市场中最受追捧的 技能。需求最大的是机器学习,占 ,其次是人工智能,占 ,以及自然语言处理,占 。尽管最近有所下降,但机器学习仍然是最受欢迎的技能。自去年以来,Lightcast 跟踪的每个 Al 相关技能群体的市场份额都有所下降,除了生成式 Al,其增长超过了 10 倍。
Al job postings (% of all job postings) in the United States by skill cluster, 2010-23 美国按技能群体划分的 Al 职位发布情况(占所有职位发布的百分比),2010 年至 2023 年
Figure 4.2.2 图 4.2.2
Figure 4.2.3 compares the top 10 specialized skills sought in Al job postings in 2023 versus those from 2011 to 2013. On an absolute scale, the demand for nearly every specialized skill has increased over the past decade, with Python's notable increase in popularity highlighting its ascendance as a preferred Al programming language. 图 4.2.3 比较了 2023 年 Al 工作发布中寻求的前 10 种专业技能与 2011 年至 2013 年的情况。 从绝对规模来看,过去十年几乎每种专业技能的需求都有所增加,Python 显著增加的受欢迎程度凸显了它作为首选 Al 编程语言的地位提升。
Top 10 specialized skills in 2023 Al job postings in the United States, 2011-13 vs. 2023 美国 2023 年 Al 工作发布中的前 10 种专业技能,2011-13 年与 2023 年对比
Figure 4.2.3 图 4.2.3
2 The decision to select 2011-2013 as the point of comparison was because some data at the jobs/skills level from earlier years is quite sparse. Lightcast therefore used to have a larger sample size for a benchmark from 10 years ago with which to compare. Figure 4.2 .3 juxtaposes the total number of job postings requiring certain skills from 2011 to 2013 with the total amount in 2023. 选择 2011-2013 年作为比较点的决定是因为早期年份的一些工作/技能层面的数据相当稀疏。因此,Lightcast 使用 来获得一个更大的样本量,以便与 10 年前的基准进行比较。图 4.2.3 将 2011 年至 2013 年需要某些技能的工作岗位总数与 2023 年的总数进行了对比。
In 2023, Lightcast saw great increases in the number of U.S. job postings citing generative Al skills. That year, 15,410 job postings specifically cited generative as a desired skill, large language modeling was mentioned in 4,669 postings, and ChatGPT appeared in 2,841 job listings (Figure 4.2.4). 在 2023 年,Lightcast 看到美国工作岗位中引用生成式 AI 技能的数量大幅增加。那一年,有 15,410 个工作岗位明确提到生成式 作为所需技能,大型语言建模在 4,669 个岗位中被提及,ChatGPT 出现在 2,841 个工作列表中(图 4.2.4)。
Generative Al skills in Al job postings in the United States, 2023 2023 年美国工作岗位中的生成式 AI 技能
Figure 4.2.4 图 4.2.4
Figure 4.2.5 illustrates what proportion of all generative job postings released in 2023 referenced particular generative skills. The most cited skill was generative (60.0%), followed by large language modeling (18.2%) and ChatGPT (11.1%). 图 4.2.5 展示了 2023 年发布的所有生成 工作岗位中引用特定生成 技能的比例。最常被引用的技能是生成 (60.0%),其次是大型语言建模(18.2%)和 ChatGPT(11.1%)。
Share of generative Al skills in Al job postings in the United States, 2023 2023 年美国人工智能工作岗位中生成式 AI 技能的份额
Figure 4.2.5 图 4.2.5
U.S. Al Labor Demand by Sector 美国各行业劳动力需求
Figure 4.2 .6 shows the percentage of U.S. job postings requiring skills by industry sector from 2022 to 2023 . Nearly every sector experienced a decrease in the proportion of job postings in 2023 compared to 2022, except for public administration and educational services. The leading sectors were information (4.6%); professional, scientific, and technical services (3.3%); and finance and insurance (2.9%). As noted earlier, the decrease in Al job postings was related to changes in the hiring patterns of several major U.S. employers. 图 4.2 .6 显示了从 2022 年到 2023 年美国各行业要求 技能的工作岗位的百分比。与 2022 年相比,2023 年几乎每个行业的工作岗位比例都有所下降,除了公共管理和教育服务。领先的行业包括信息(4.6%);专业、科学和技术服务(3.3%);以及金融和保险(2.9%)。正如前面所提到的,Al 工作岗位的减少与几家美国主要雇主的招聘模式变化有关。
Al job postings (% of all job postings) in the United States by sector, 2022 vs. 2023 美国各行业 2022 年与 2023 年的所有工作岗位(所有工作岗位的百分比)
Source: Lightcast, 2023 | Chart: 2024 Al Index report 数据来源:Lightcast,2023 年 | 图表:2024 年 Al 指数报告
U.S. Al Labor Demand by State 美国各州的 Al 劳动力需求
Figure 4.2.7 highlights the number of Al job postings in the United States by state. The top three states were California , followed by Texas and Virginia . 图 4.2.7 突出显示了美国各州的 Al 工作岗位数量。前三名州分别是加利福尼亚 ,德克萨斯 和弗吉尼亚 。
Figure 4.2.8 demonstrates what percentage of a state's total job postings were Al-related. The top states according to this metric were the District of Columbia (2.7%), followed by Delaware (2.4%) and Maryland (2.1%). 图 4.2.8 展示了一个州的总工作岗位中有多少百分比是与 Al 相关的。根据这个指标,排名靠前的州是哥伦比亚特区(2.7%),其次是特拉华州(2.4%)和马里兰州(2.1%)。
Number of Al job postings in the United States by state, 2023 美国各州的 Al 工作岗位数量,2023 年
Source: Lightcast, 2023 | Chart: 2024 Al Index report 源:Lightcast,2023 | 图表:2024 Al 指数报告
Figure 4.2.7 图 4.2.7
Figure 4.2.9 examines which U.S. states accounted for the largest proportion of job postings nationwide. California was first: In of all Al job postings in the United States were for jobs based in California, followed by Texas (7.9%) and Virginia (5.3%). 图 4.2.9 研究了哪些美国州占全国 职位发布的最大比例。加利福尼亚州排名第一:美国所有 Al 职位发布中有 是在加利福尼亚州的工作,其次是德克萨斯州(7.9%)和弗吉尼亚州(5.3%)。
Percentage of US AI job postings by state, 2023 2023 年各州美国人工智能工作岗位的百分比
Source: Lightcast, 2023 | Chart: 2024 Al Index report 来源:Lightcast,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.9 图 4.2.9
Figure 4.2.10 illustrates the trends in the four states with highest Al job postings: Washington, California, New York, and Texas. Each experienced a notable decline in the share of total Al-related job postings from 2022 to 2023. 图 4.2.10 展示了四个拥有最多 Al 工作岗位的州的趋势:华盛顿、加利福尼亚、纽约和德克萨斯。每个州在 2022 年至 2023 年间 Al 相关工作岗位的占比均出现显著下降。
Percentage of US states' job postings in AI by select US state, 2010-23 2010 年至 2023 年间按照美国州选择的 AI 工作岗位百分比
Source: Lightcast, 2023 | Chart: 2024 Al Index report 来源:Lightcast,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.10 图 4.2.10
Figure 4.2.11 shows how Al-related job postings have been distributed across the top four states over time. Since 2019, California's proportion of AI job postings has steadily declined, while Texas has seen a slight increase. 图 4.2.11 显示了随着时间的推移,与 Al 相关的工作岗位是如何在前四个州之间分布的。自 2019 年以来,加利福尼亚州的 AI 工作岗位比例稳步下降,而德克萨斯州略有增加。
Percentage of US AI job postings by select US state, 2010-23 2010-23 年间美国 AI 工作岗位的百分比,按选定的美国州划分
Figure 4.2.11 图 4.2.11
Al Hiring 阿尔雇佣
The hiring data presented in the Al Index is based on a Linkedln dataset of skills and jobs that appear on their platform. The geographic areas included in the sample make at least hires each month and have Linkedln covering a substantial portion of the labor force. Linkedln's coverage of India's and South Korea's sizable labor forces fall below this threshold, so insights drawn about these countries should be interpreted with particular caution. Al 指数中呈现的招聘数据基于 Linkedln 平台上出现的技能和工作的数据集。样本中包括的地理区域每月至少招聘 人,并且 Linkedln 覆盖了相当大比例的劳动力。Linkedln 对印度和韩国庞大劳动力的覆盖率低于此阈值,因此对这些国家得出的见解应特别谨慎解释。
Figure 4.2.12 reports the relative hiring rate year-overyear ratio by geographic area. The overall hiring rate is computed as the percentage of Linkedln members who added a new employer in the same period the job began, divided by the total number of Linkedln members in the corresponding location. Conversely, the relative talent hiring rate is the year-over-year change in hiring relative to overall hiring rate in the same geographic area. Therefore, figure 4.2.12 illustrates which specific regions have experienced the most significant rise in talent recruitment compared to the overall hiring rate, serving as an indicator of Al hiring vibrancy. In 2023, the regions with the greatest relative hiring rates year over year were Hong Kong (28.8%), followed by Singapore (18.9%) and Luxembourg (18.9%). This means, for example, that in 2023 in Hong Kong, the ratio of talent hiring relative to overall hiring grew . 图 4.2.12 报告了按地理区域划分的相对 招聘率年度比率。总体招聘率计算为在同一时期开始工作的 Linkedln 会员中添加新雇主的百分比,除以相应地点的 Linkedln 会员总数。相反,相对 人才招聘率是相对于同一地理区域总体招聘率的年度变化。因此,图 4.2.12 说明了哪些具体地区的 人才招聘相对于总体招聘率经历了最显著的增长,作为 Al 招聘活力的指标。2023 年,相对 招聘率年度增长最大的地区是香港(28.8%),其次是新加坡(18.9%)和卢森堡(18.9%)。这意味着,例如,2023 年在香港,相对于总体招聘的 人才招聘比率增长了 。
Relative Al hiring rate year-over-year ratio by geographic area, 2023 2023 年按地理区域划分的相对 Al 招聘率年度比率
Source: Linkedln, 2023 | Chart: 2024 Al Index report 来源:Linkedln,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.13 showcases the year-over-year ratio of Al hiring by geographic areas over the past five years. Starting from the beginning of 2023, countries including Australia, Canada, Singapore, and India have experienced a noticeable uptick in Al hiring. 图 4.2.13 展示了过去五年各地区 Al 招聘的同比增长比率。从 2023 年初开始,包括澳大利亚、加拿大、新加坡和印度在内的国家的 Al 招聘明显增加。
3 For each month, Linkedln calculates the Al hiring rate in the geographic area, divides the Al hiring rate by overall hiring rate in that geographic area, calculates the year-over-year change of this ratio, and then takes the 12-month moving average using the last 12 months. 对于每个月,Linkedln 计算了各地区的 Al 招聘率,将 Al 招聘率除以该地区的总体招聘率,计算了这一比率的同比变化,然后使用过去 12 个月的数据进行 12 个月移动平均。
4 For brevity, the visualization only includes the top 15 countries for this metric. 为简洁起见,可视化仅包括此指标的前 15 个国家。
Relative Al hiring rate year-over-year ratio by geographic area, 2018-23 按地理区域划分的相对人工智能招聘率年增长比率,2018-23
Source: Linkedln, 2023 | Chart: 2024 Al Index report 来源:领英,2023 | 图表:2024 年人工智能指数报告
Al Skill Penetration 人工智能技能渗透率
Figures 4.2.14 and 4.2.15 highlight relative Al skill penetration. The aim of this indicator is to measure the intensity of Al skills in an entity (such as a particular country, industry, or gender). The Al skill penetration rate signals the prevalence of skills across occupations or the intensity with which Linkedln members utilize Al skills in their jobs. For example, the top 50 skills for the occupation of engineer are calculated based on the weighted frequency with which they appear in Linkedln members' profiles. If, for instance, four of the skills that engineers possess belong to the skill group, the penetration of skills among engineers is estimated to be . 图 4.2.14 和 4.2.15 突出了相对 Al 技能渗透。该指标的目的是衡量实体(如特定国家、行业或性别)中 Al 技能的强度。Al 技能渗透率表明 技能在职业中的普及程度,或 Linkedln 会员在工作中使用 Al 技能的强度。例如,工程师职业的前 50 个技能是根据它们在 Linkedln 会员档案中出现的加权频率计算的。例如,如果工程师拥有的四项技能属于 技能组,那么工程师中 技能的渗透率估计为 。
For the period from 2015 to 2023, the countries with the highest Al skill penetration rates were India (2.8), the United States (2.2), and Germany (1.9). In the United States, therefore, the relative penetration of Al skills was 2.2 times greater than the global average across the same set of occupations. 从 2015 年到 2023 年,具有最高 Al 技能渗透率的国家是印度(2.8)、美国(2.2)和德国(1.9)。因此,在美国,Al 技能的相对渗透率比同一组职业的全球平均水平高 2.2 倍。
Relative Al skill penetration rate by geographic area, 2015-23 2015 年至 2023 年各地区相对 Al 技能渗透率
Figure 4.2.15 disaggregates Al skill penetration rates by gender across different countries or regions. A country's rate of 1.5 for women means female members in that country are 1.5 times more likely to list skills than the average member in all countries pooled together across the same set of occupations in the country. For all countries in the sample, the relative Al skill penetration rate is greater for men than women. India (1.7), the United States (1.2), and Israel (0.9) have the highest reported relative skill penetration rates for women. 图 4.2.15 按性别将不同国家或地区的 Al 技能渗透率进行了细分。某国女性的 1.5 的比率意味着该国的女性成员比所有国家中同一组职业中的平均成员更有可能列出 技能。在样本中的所有国家中,男性的相对 Al 技能渗透率大于女性。印度(1.7)、美国(1.2)和以色列(0.9)报告的女性相对 技能渗透率最高。
Relative Al skill penetration rate across gender, 2015-23 2015-23 年间跨性别的相对 Al 技能渗透率
Source: LinkedIn, 2023 | Chart: 2024 AI Index report 来源:LinkedIn,2023 年 | 图表:2024 年 AI 指数报告
Figure 4.2.15 图 4.2.15
Al Talent 人工智能人才
Figures 4.2.16 to 4.2.18 examine talent by country. A Linkedln member is considered AI talent if they have explicitly added Al skills to their profile or work in AI. Counts of Al talent are used to calculate talent concentration, or the portion of members who are Al talent. Note that concentration metrics may be influenced by Linkedln coverage in these countries and should be used with caution. 图 4.2.16 至 4.2.18 检查各国的人工智能人才。如果 Linkedln 成员在他们的个人资料中明确添加了人工智能技能或者从事人工智能工作,他们将被视为人工智能人才。人工智能人才的数量用于计算人才集中度,即是人工智能人才所占的比例。请注意,集中度指标可能受到 Linkedln 在这些国家的覆盖范围影响,应谨慎使用。
Figure 4.2.16 shows Al talent concentration in various countries. In 2023, the countries with the highest concentrations of talent included Israel (1.1%), Singapore (0.9%), and South Korea (0.8%). 图 4.2.16 显示了各个国家的人工智能人才集中度。2023 年,拥有最高人工智能人才集中度的国家包括以色列(1.1%)、新加坡(0.9%)和韩国(0.8%)。
Figure 4.2.17 looks at the percent change in talent concentration for a selection of countries since 2016. During that time period, several major economies registered substantial increases in their talent pools. The countries showing the greatest increases are India (263%), Cyprus (229%), and Denmark (213%). 图 4.2.17 研究了自 2016 年以来一些国家人才集中度的百分比变化。在那段时间内,几个主要经济体的人才库都有大幅增加。显示增幅最大的国家是印度(263%)、塞浦路斯(229%)和丹麦(213%)。
Al talent concentration by geographic area, 2023 2023 年各地区的人才集中度
Source: Linkedln, 2023 | Chart: 2024 Al Index report 来源:领英,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.2.16 图 4.2.16
Percentage change in talent concentration by geographic area, 2016 vs. 2023 2016 年与 2023 年按地理区域划分的 人才集中度百分比变化
Source: Linkedln, 2023 | Chart: 2024 Al Index report 来源:Linkedln,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.17 图 4.2.17
Al talent concentration by gender, 2016-23 2016-23 年按性别的人才集中度
Source: Linkedln, 2023 | Chart: 2024 Al Index report 来源:Linkedln,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.18 图 4.2.18
Linkedln data provides insights on the Al talent gained or lost due to migration trends. Net flows are defined as total arrivals minus departures within the given time period. Figure 4.2.19 examines net Al talent migration per 10,000 Linkedln members by geographic area. The countries that report the greatest incoming migration of Al talent are Luxembourg, Switzerland, and the United Arab Emirates. Linkedln 数据提供了关于由于迁移趋势而获得或失去的人工智能人才的见解。 净流量被定义为在给定时间段内的总到达量减去离开量。图 4.2.19 研究了每 10,000 个 Linkedln 成员的净人工智能人才迁移情况。报告拥有最大人工智能人才流入的国家是卢森堡、瑞士和阿拉伯联合酋长国。
Net Al talent migration per 10,000 Linkedln members by geographic area, 2023 2023 年按地理区域每 10,000 个 Linkedln 成员的净人工智能人才迁移
Source: Linkedln, 2023; World Bank Group, 2023 | Chart: 2024 Al Index report 来源:领英,2023 年;世界银行集团,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.2.19 图 4.2.19
Figure 4.2.20 documents Al talent migration data over time. In the last few years, Israel, India, and South Korea have seen declining net talent migration figures, suggesting that talent has been increasingly flowing out of these countries. 图 4.2.20 文档了随时间变化的人工智能人才迁移数据。在过去几年中,以色列、印度和韩国的净人工智能人才迁移数字呈下降趋势,表明人才已经越来越多地流出这些国家。
Net Al talent migration per 10,000 Linkedln members by geographic area, 2019-23 2019-23 年按地理区域每 10,000 个 Linkedln 会员的净人工智能人才迁移
Source: LinkedIn, 2023; World Bank Group, 2023 | Chart: 2024 AI Index report 源:LinkedIn,2023 年;世界银行集团,2023 年 | 图表:2024 年 AI 指数报告
6 Asterisks indicate that a country's -axis label is scaled differently than the -axis label for the other countries. 六个星号表示一个国家的 -轴标签与其他国家的 -轴标签的比例不同。
Highlight: 突出显示:
How Much Do Computer Scientists Earn? 计算机科学家赚多少钱?
Every year, Stack Overflow conducts a survey of its community of professional developers who use their tools. The latest iteration of the survey profiled over 90,000 developers. 每年,Stack Overflow 对其专业开发人员社区进行一项调查,调查他们使用的工具。最新一期的调查对超过 90,000 名开发人员进行了分析。
Through this survey, respondents were asked about their income. It is important to note that these respondents do not work exclusively with Al. However, examining developer salaries can serve as a means to approximate the compensation of talent in Al-adjacent industries. Figure 4.2.21 examines the salaries of professional developers disaggregated by position. 通过这项调查,受访者被询问了关于他们的收入。需要注意的是,这些受访者并不专门从事人工智能领域的工作。然而,检查开发人员的薪资可以作为一种近似衡量人工智能相关行业人才薪酬的手段。图 4.2.21 分析了按职位细分的专业开发人员的薪资。
Salaries vary by position and geography. For instance, the average global salary for a cloud infrastructure engineer is . In the United States, the average salary for such a position is . Both globally and in the United States, the highest compensated roles are senior executives, followed by engineering managers. For all surveyed positions, salaries are significantly higher in the United States than in other countries. 薪资因职位和地理位置而异。例如,云基础架构工程师的全球平均薪资为 。在美国,这一职位的平均薪资为 。无论是全球范围还是在美国,薪资最高的角色是高级执行人员,其次是工程经理。对于所有调查的职位来说,在美国的薪资明显高于其他国家。
Highlight: 突出显示:
How Much Do Computer Scientists Earn? (cont’d) 计算机科学家赚多少钱?(续)
Median yearly salary by professional developer type, 2023 2023 年各种专业开发者的年均工资中位数
Figure 4.2.21 图 4.2.21
This section monitors Al investment trends, leveraging data from Quid, which analyzes investment data from more than 8 million companies worldwide, both public and private. Employing natural language processing, Quid sifts through vast unstructured datasets-including news aggregations, blogs, company records, and patent databases-to detect patterns and insights. Additionally, Quid is constantly expanding its database to include more companies, sometimes resulting in higher reported investment volumes for specific years. For the first time, this year's investment section in the Al Index includes data on generative Al investments. 本节监测人工智能投资趋势,利用来自 Quid 的数据,该数据分析来自全球超过 800 万家公司(包括上市公司和私人公司)的投资数据。Quid 利用自然语言处理技术,筛选海量的非结构化数据集,包括新闻聚合、博客、公司记录和专利数据库,以便发现模式和见解。此外,Quid 不断扩大其数据库,以包括更多公司,有时会导致特定年份报告的投资额增加。今年的人工智能指数投资部分首次包含生成式人工智能投资数据。
4.3 Investment 4.3 投资
Corporate Investment 企业投资
Figure 4.3.1 illustrates the trend in global corporate investment from 2013 to 2023, including mergers and acquisitions, minority stakes, private investments, and public offerings. For the second consecutive year, global corporate investment in has seen a decline. 2013 年至 2023 年,包括并购、少数股权、私人投资和公开发行在内的全球企业 投资的趋势如图 4.3.1 所示。全球企业对 的投资已连续第二年下降。
In 2023, the total investment dropped to billion, a decrease of approximately 20% from 2022. Despite a slight reduction in private investment, the most significant downturn occurred in mergers and acquisitions, which fell by from the previous year. However, over the past decade, Al-related investments have increased thirteenfold. 2023 年,总投资额下降到 亿美元,较 2022 年减少约 20%。尽管私人投资略有减少,但最重大的下滑发生在并购活动上,同比前一年下降了 。然而,在过去十年中,人工智能相关的投资增长了十三倍。
Global corporate investment in AI by investment activity, 2013-23 2013-2023 年间全球企业在人工智能方面的投资活动
Source: Quid, 2023 | Chart: 2024 Al Index report 来源: Quid, 2023 | 图表: 2024 Al 指数报告
Figure 4.3.1 图 4.3.1
Startup Activity 创业活动
This section analyzes private investment trends in artificial intelligence startups that have received over $1.5 million in investment since 2013. 自 2013 年以来,已经获得 150 万美元以上投资的人工智能初创公司的私人投资趋势进行了分析。
Global Trends 全球趋势
Global private investment has declined for the second consecutive year (Figure 4.3.2). However, the decrease from 2022 was small (-7.2%) and smaller than the drop observed from 2021 to 2022. Despite recent declines, private investment globally has grown substantially in the last decade. 全球私人投资连续第二年下降(图 4.3.2)。然而,从 2022 年开始的下降幅度较小(-7.2%),且低于 2021 年至 2022 年的下降幅度。尽管最近有所下降,全球私人投资在过去十年间增长了大幅。
Private investment in Al, 2013-23 2013-23 年间的人工智能私人投资
Figure 4.3.2 图 4.3.2
While overall Al private investment decreased last year, funding for generative Al sharply increased (Figure 4.3.3). In 2023, the sector attracted billion, nearly nine times the investment of 2022 and about 30 times the amount from 2019. Furthermore, generative Al accounted for over a quarter of all Al-related private investment in 2023. 尽管去年整体人工智能私人投资减少,但生成式人工智能的资金大幅增加(图 4.3.3)。2023 年,该领域吸引了 亿美元的投资,几乎是 2022 年的九倍,是 2019 年的 30 倍左右。此外,生成式人工智能在 2023 年占据了所有人工智能相关私人投资的四分之一以上。
Private investment in generative AI, 2019-23 2019-23 年生成 AI 的私人投资
Figure 4.3.3 图 4.3.3
Interestingly, the number of newly funded Al companies jumped to 1,812 , a increase over the previous year (Figure 4.3.4). Figure 4.3.5 visualizes the average size of Al private investment events, calculated by dividing the total yearly Al private investment by the total number of Al private investment events. From 2022 to 2023, the average increased marginally, growing from million to million. 有趣的是,新获资助的 AI 公司数量跃升至 1,812 家,比上一年增加了 (图 4.3.4)。图 4.3.5 展示了 AI 私人投资事件的平均规模,通过将每年的总 AI 私人投资金额除以 AI 私人投资事件的总数来计算。从 2022 年到 2023 年,平均规模略有增加,从 百万增长到 百万。
Number of newly funded Al companies in the world, 2013-23 2013-23 年全球新融资的人工智能公司数量
Average size of Al private investment events, 2013-23 2013-23 年人工智能私人投资事件的平均规模
Figure 4.3.4 图 4.3.4
Source: Quid, 2023 | Chart: 2024 Al Index report 源:Quid,2023 | 图表:2024 Al 指数报告
2023 marked a significant increase in the number of newly funded generative Al companies, with 99 new startups receiving funding, compared to 56 in 2022, and 31 in 2019 (Figure 4.3.6). 2023 年新获资助的生成式人工智能公司数量显著增加,有 99 家新创公司获得资金,而 2022 年为 56 家,2019 年为 31 家(图 4.3.6)。
Number of newly funded generative Al companies in the world, 2019-23 2019 年至 2023 年全球新获资助的生成式人工智能公司数量
Figure 4.3.7 reports Al funding events disaggregated by size. In 2023, Al private investment events decreased across nearly all funding size categories, except for those exceeding million. 图 4.3.7 报告了按规模细分的 Al 融资事件。2023 年,几乎所有融资规模类别中的 Al 私人投资事件都减少了,除了那些超过 百万的事件。
Al private investment events by funding size, 2022 vs. 2023 2022 年与 2023 年按融资规模划分的 Al 私人投资事件
Source: Quid, 2023 | Table: 2024 Al Index report 来源:Quid,2023 | 表格:2024 年 Al 指数报告
Funding Size
2022
Over billion 超过 十亿
7
9
million - $1 billion 百万 - 10 亿美元
6
7
million - $500 million 百万 - 5 亿美元
187
120
$50 million - $100 million 5000 万美元 - 1 亿美元
260
182
Under $50 million 少于 5000 万美元
2,840
2,641
Undisclosed
694
680
Total
3,994
3,639
Regional Comparison by Funding Amount 根据资金金额进行区域比较
The United States once again led the world in terms of total Al private investment. In 2023, the 美国再次在总体 Al 私人投资方面领先世界。2023 年,美国投资的 亿美元大约是下一个最高国家中国投资额的 8.7 倍( 亿美元),是英国投资额的 17.8 倍({{1}}亿美元)(图 4.3.8)。
billion invested in the United States was roughly 8.7 times greater than the amount invested in the next highest country, China ( billion), and 17.8 times the amount invested in the United Kingdom ( billion) (Figure 4.3.8). 2023 年各地区 Al 的私人投资
Private investment in Al by geographic area, 2023
Source: Quid, 2023 | Chart: 2024 Al Index report 源头: Quid, 2023 | 图表: 2024 Al 指数报告
When aggregating private Al investments since 2013, the country rankings remain the same: The United States leads with billion invested, followed by China with billion, and the United Kingdom at billion (Figure 4.3.9). 自 2013 年以来,聚合私人 Al 投资,各国排名保持不变:美国以 亿美元的投资领先,中国以 亿美元紧随其后,英国以 亿美元位列第三(图 4.3.9)。
Private investment in Al by geographic area, 2013-23 (sum) 2013-23 年各地区 Al 的私人投资总额
Figure 4.3.9 图 4.3.9
Figure 4.3.10, which looks at Al private investment over time by geographic area, suggests that the gap in private investments between the United States and other regions is widening over time. While Al private investments have decreased in China (-44.2%) and the European Union plus the United Kingdom (-14.1%) since 2022, the United States has seen a significant increase (22.1%) during the same period. 图 4.3.10,该图观察了按地理区域随时间变化的 Al 私人投资,表明美国与其他地区之间的私人投资差距随时间扩大。自 2022 年以来,中国的 Al 私人投资减少了 (-44.2%),欧盟加英国地区减少了 (-14.1%),而美国在同一时期内显著增加了 (22.1%)。
Private investment in Al by geographic area, 2013-23 2013-23 年按地理区域的 Al 私人投资
The disparity in regional Al private investment becomes particularly pronounced when examining generative Alrelated investments. For instance, in 2022, the United States outpaced the combined investments of the European Union plus United Kingdom in generative Al by approximately billion (Figure 4.3.11). By 2023, this gap widened to billion. 在考察生成 AI 相关投资时,地区 AI 私人投资的不平衡尤为显著。例如,在 2022 年,美国的生成 AI 投资超过了欧盟加英国联合的投资约 亿美元(图 4.3.11)。到 2023 年,这一差距扩大到了 亿美元。
Private investment in generative Al by geographic area, 2019-23 2019 年至 2023 年各地区生成 AI 的私人投资
Source: Quid, 2023 I Chart: 2024 Al Index report 来源:Quid,2023 | 图表:2024 年 AI 指数报告
Regional Comparison by Newly Funded 新资助的人工智能公司的区域比较
Al Companies 公司
This section examines the number of newly funded Al companies across different geographic regions. 本节将研究不同地理区域新资助的人工智能公司数量。
Consistent with trends in private investment, the United States leads all regions with 897 new AI companies, followed by China with 122, and the United Kingdom with 104 (Figure 4.3.12). 与私人投资趋势一致,美国以 897 家新的人工智能公司领先于所有地区,其次是中国的 122 家,英国的 104 家(图 4.3.12)。
Number of newly funded Al companies by geographic area, 2023 2023 年各地区新获资助的人工智能公司数量
Figure 4.3.12 图 4.3.12
A similar trend is evident in the aggregate data since 2013. In the last decade, the number of newly funded Al companies in the United States is around 3.8 times the amount in China, and 7.6 times the amount in the United Kingdom (Figure 4.3.13). 自 2013 年以来,聚合数据中显示出一种类似的趋势。在过去的十年中,美国新获资助的人工智能公司数量约为中国的 3.8 倍,是英国的 7.6 倍(图 4.3.13)。
Figure 4.3.14 presents data on newly funded companies in specific geographic regions, highlighting a decade-long trend where the United States consistently surpasses both the European Union and the United Kingdom, as well as China. Since 2022, the 图 4.3.14 展示了特定地理区域新获资助的 公司数据,突出了一个持续十年的趋势,即美国始终超过欧盟、英国以及中国。自 2022 年以来,
United States, along with the European Union and the United Kingdom, have seen significant increases in the number of new Al companies, in contrast to China, which experienced a slight year-over-year decrease. 美国、欧盟和英国的新人工智能公司数量显著增加,与之相反,中国经历了略微的年度减少。
Number of newly funded Al companies by geographic area, 2013-23 2013-23 年各地区新获资助的人工智能公司数量
Source: Quid, 2023 | Chart: 2024 Al Index report 数据来源:Quid,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.3.14 图表 4.3.14
Focus Area Analysis 焦点领域分析
Quid also disaggregates private investment by focus area. Figure 4.3.15 compares global private investment by focus area in 2023 versus 2022 . The focus areas that attracted the most investment in 2023 were Al infrastructure/research/governance ( billion); NLP and customer support ( billion); and data management and processing ( billion). The prominence of infrastructure, research, and governance reflects large investments in companies specifically building applications, such as OpenAl, Anthropic, and Inflection AI. Quid 还按焦点领域对私人 投资进行了细分。图 4.3.15 比较了 2023 年和 2022 年全球私人 投资在焦点领域的情况。2023 年吸引最多投资的焦点领域是 Al 基础设施/研究/治理( 亿美元);NLP 和客户支持( 亿美元);以及数据管理和处理( 亿美元)。 基础设施、研究和治理的突出地位反映了专门构建 应用程序的公司的大量投资,例如 OpenAI、Anthropic 和 Inflection AI。
Figure 4.3.16 presents trends over time in focus area investments. As noted earlier, most focus areas saw declining investments in the last year. Conversely, some of the areas that saw growth since 2022 include Al infrastructure/research/governance and data management, processing. Although now still substantial, investments in medical and healthcare as well as NLP, customer support peaked in 2021 and have since then declined. 图 4.3.16 展示了 焦点领域投资随时间的趋势。正如前面所述,大多数焦点领域在过去一年中看到了投资下降。相反,自 2022 年以来看到增长的一些领域包括 Al 基础设施/研究/治理和数据管理、处理。尽管目前仍然很可观,但医疗保健以及 NLP、客户支持的投资在 2021 年达到峰值后已经下降。
Private investment in Al by focus area, 2022 vs. 2023 2022 年与 2023 年按重点领域的人工智能私人投资
Source: Quid, 2023 | Chart: 2024 Al Index report 来源:Quid,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.3.15 图 4.3.15
Private investment in Al by focus area, 2017-23 2017-23 年按重点领域的 Al 私人投资
Source: Quid, 2023 | Chart: 2024 Al Index report 来源:Quid,2023 年 | 图表:2024 年 Al 指数报告
Finally, 4.3.17 shows private investment in Al by focus area over time within select geographic regions, highlighting how private investment priorities in differ across geographies. The significant increases observed in infrastructure/research/governance were mostly driven by investment in the United States. The United States significantly outpaces China and the European Union and United Kingdom in investment in almost all focus area categories. A notable exception is facial recognition, where 2023 investment totals were million in the United States and million in China. Likewise, in semiconductor investments, China ( million) is not far behind the United States ( million). 最后,4.3.17 显示了在选择的地理区域内,随着时间推移,私人投资在 Al 的重点领域的变化,突出显示了私人投资在 方面在地理上的差异。在基础设施/研究/治理方面观察到的显著增长主要是由对美国的投资推动的。美国在几乎所有重点领域类别的投资中明显领先于中国、欧盟和英国。一个值得注意的例外是面部识别,在这方面,2023 年美国的投资总额为 万美元,中国为 万美元。同样,在半导体投资方面,中国( 万美元)距离美国( 万美元)并不遥远。
Private investment in by focus area and geographic area, 2017-23 2017-23 年按重点领域和地理区域的 私人投资
Source: Quid, 2023 | Chart: 2024 Al Index report 来源:Quid,2023 年 | 图表:2024 年 Al 指数报告
Cybersecurity, data protection 网络安全,数据保护
Drones 无人机
Insurtech 保险科技
NLP, customer support 自然语言处理,客户支持
This section examines the practical application of Al by corporations, highlighting industry adoption trends, how businesses are integrating , the specific technologies deemed most beneficial, and the impact of adoption on financial performance. 本节探讨了企业对人工智能的实际应用,突出了行业采用趋势,企业如何整合 ,被认为最有益的具体 技术,以及 采用对财务表现的影响。
4.4 Corporate Activity 4.4 企业活动
Industry Adoption 行业采用
This section incorporates insights from McKinsey's "The State of Al in 2023: Generative Al's Breakout Year," alongside data from prior editions. The 2023 McKinsey analysis is based on a survey of 1,684 respondents across various regions, industries, company sizes, functional areas, and tenures. For the first time, this year's version of the McKinsey survey included detailed questions about generative adoption and hiring trends for -related positions. 本节内容整合了麦肯锡公司的《2023 年 AI 现状:生成 AI 的突破之年》的见解,以及之前版本的数据。2023 年麦肯锡的分析基于对来自不同地区、行业、公司规模、职能领域和任职时间的 1,684 名受访者的调查。今年麦肯锡调查的版本首次包含了关于生成 AI 采用和招聘趋势的详细问题。
Adoption of AI Capabilities AI 能力的采用
The latest McKinsey report reveals that in 2023, 55% of organizations surveyed have implemented in at least one business unit or function, marking a slight increase from in 2022 and a significant jump from 20% in 2017 (Figure 4.4.1). Al adoption has spiked over the past five years, and in the future, McKinsey expects to see even greater changes happening at higher frequencies, given the rate of both technical advancement and adoption. 最新的麦肯锡报告显示,在 2023 年,55%的受访组织至少在一个业务部门或功能中实施了 AI,这标志着与 2022 年相比略有增加,远远高于 2017 年的 20%(图 4.4.1)。AI 的采用在过去五年中急剧增加,未来,麦肯锡预计随着技术进步和采用率的提高,将会看到更大的变化频繁发生。
Share of respondents who say their organizations have adopted Al in at least one function, 2017-23 在至少一个功能中采用人工智能的受访者所占比例,2017-23
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.4.1 图 4.4.1
Figure 4.4.2 shows the proportion of surveyed companies that use for specific functions. 图 4.4.2 显示了使用 进行特定功能的受访公司比例。
Companies may report employing in multiple capacities. The most commonly adopted Al use case by function among surveyed businesses in 2023 was contact-center automation (26%), followed by personalization (23%), customer acquisition (22%), and Al-based enhancements of products 公司可能报告在多个方面使用 。2023 年受访企业中最常采用的 AI 使用案例是联系中心自动化(26%),其次是个性化(23%),客户获取(22%),以及产品的 AI 增强 。
Most commonly adopted Al use cases by function, 2023 2023 年按功能最常采用的 AI 使用案例
Figure 4.4.2 图 4.4.2
7 Personalization is the practice of tailoring products, services, content, recommendations, and marketing to the individual preferences of customers or users. For example, personalization can include sending tailored email messages to clients or customers to improve engagement. 7 个性化是根据客户或用户的个人偏好定制产品、服务、内容、推荐和营销的实践。例如,个性化可以包括发送定制的电子邮件消息给客户或顾客以提高参与度。
With respect to the type of Al capabilities embedded in at least one function or business unit, as indicated by Figure 4.4.3, robotic process automation had the highest rate of embedding within the financial services industry (46%). The next highest rate of embedding was for virtual agents, also in the financial services industry. Across all industries, the most embedded Al technologies were NL text understanding ( ), robotic process automation (30%), and virtual agents (30%). 就至少一个功能或业务部门中嵌入的 AI 能力类型而言,如图 4.4.3 所示,机器人流程自动化在金融服务行业中的嵌入率最高(46%)。其次是虚拟代理,也在金融服务行业中。在所有行业中,最常嵌入的 AI 技术是自然语言文本理解( )、机器人流程自动化(30%)和虚拟代理(30%)。
Al capabilities embedded in at least one function or business unit, 2023 至少一个功能或业务单元中嵌入的所有能力,2023
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 | 图表:2024 年人工智能指数报告
of respondents (Al capability) 的受访者(人工智能能力)
Figure 4.4.4 shows Al adoption by industry and Al function in 2023. The greatest adoption was in product and/ or service development for tech, media, and telecom (44%); followed by service operations for tech, media, and telecom (36%) and marketing and sales for tech, media, and telecom (36%). 2023 年,图 4.4.4 展示了不同行业和 Al 功能的 Al 采用情况。对于科技、媒体和电信行业来说,最大的采用情况是在产品和/或服务开发方面(44%);其次是科技、媒体和电信行业的服务运营(36%)以及科技、媒体和电信行业的市场营销和销售(36%)。
Al adoption by industry and function, 2023 2023 年各行业和功能的 Al 采用情况
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 数据来源:麦肯锡公司 2023 年调查 | 图表来源:2024 年 Al 指数报告
Figure 4.4.4 图 4.4.4
Figure 4.4.5 illustrates the changes in adoption rates by industry and function from 2022 to 2023. The areas with the largest annual gains across all industries include marketing and sales (18 percentage points), product/service development (14), and service operations (4). Conversely, across all industries, the functions experiencing the most significant declines in adoption include strategy and corporate finance (-12 percentage points), risk (-9), and human resources (-2). 图 4.4.5 展示了从 2022 年到 2023 年各行业和职能中 采用率的变化。在所有行业中,年度增长最大的领域包括营销和销售(18 个百分点)、产品/服务开发(14)和服务运营(4)。相反,在所有行业中,采用率下降最显著的职能包括战略和企业金融(-12 个百分点)、风险(-9)和人力资源(-2)。
Percentage point change in responses of Al adoption by industry and function, 2022 vs. 2023 2022 年与 2023 年各行业和职能中 Al 采用率的响应百分点变化
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年 Al 指数报告
Percentage point change in responses (function) 响应变化的百分点(功能)
Figure 4.4.6 shows the percentage of surveyed respondents across industries who reported hiring for various positions. Across all industries, respondents reported hiring data engineers (36%), Al data scientists (31%), and machine-learning engineers (31%) to the greatest degree. Notably, a significant portion of respondents within the financial services and the tech, media, and telecom sectors (44%) reported a high rate of hiring machine-learning engineers. 图 4.4.6 显示了各行业受访者报告招聘各种 职位的百分比。在所有行业中,受访者报告最多招聘数据工程师(36%)、Al 数据科学家(31%)和机器学习工程师(31%)。值得注意的是,在金融服务 和科技、媒体和电信行业中,有相当一部分受访者(44%)报告高比例招聘机器学习工程师。
Al-related roles that organizations hired in the last year by industry, 2023 2023 年各行业组织在过去一年中招聘的与人工智能相关的职位
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查报告, 2023 年 | 图表:2024 年人工智能指数报告
Figure 4.4.6 图 4.4.6
Organizations have experienced both cost reductions and revenue increases due to adoption (Figure 4.4.7). The areas where respondents most frequently reported cost savings were manufacturing (55%), service operations (54%), and risk (44%). For revenue gains, the functions benefiting the most from included manufacturing (66%), marketing and sales (65%), and risk (64%). Figure 4.4 .7 shows a substantial number of respondents reporting cost decreases and revenue gains (59%) as a result of using , suggesting that tangibly helps businesses improve their bottom line. Comparing this and last year's averages reveals a 10 percentage point increase for cost decreases and a four percentage point decrease for revenue increases across all activities. 组织由于 的采用而经历了成本降低和收入增加(图 4.4.7)。受访者最经常报告成本节约的领域是制造业(55%)、服务运营(54%)和风险(44%)。对于收入增长,受益最多的功能包括制造业(66%)、营销和销售(65%)以及风险(64%)。图 4.4.7 显示,大量受访者报告由于使用 而导致成本降低 和收入增加(59%),这表明 实质性地帮助企业改善其底线。比较今年和去年的平均值发现,所有活动中成本降低的百分点增加了 10 个百分点,收入增加的百分点减少了四个百分点。
Cost decrease and revenue increase from Al adoption by function, 2022 2022 年各功能从 Al 采用中的成本降低和收入增加
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4 .8 presents global adoption by organizations, segmented by world regions. In 2023, every surveyed region reported higher adoption rates than in 2022. The most significant year-overyear growth was seen in Europe, where organization adoption grew by 9 percentage points. North America remains the leader in adoption. Greater China also experienced a significant increase in adoption rates, growing by 7 percentage points over the previous year. 图 4.4 .8 展示了全球各组织对 的采用情况,按世界地区划分。2023 年,每个调查地区报告的 采用率均高于 2022 年。欧洲的年度增长最为显著,组织采用率增长了 9 个百分点。北美仍然是 采用的领导者。大中华地区的 采用率也出现了显著增长,比上一年增长了 7 个百分点。
Al adoption by organizations in the world, 2022 vs. 2023 2022 年与 2023 年世界各组织对人工智能的采用情况
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.4.8 图 4.4.8
Adoption of Generative AI Capabilities 采用生成式人工智能能力
How are organizations deploying generative Al?? 组织如何部署生成式人工智能?
Figure 4.4.9 highlights the proportion of total surveyed respondents that report using generative for a particular function. It is possible for respondents to indicate that they deploy for multiple purposes. 图 4.4.9 突出显示了报告使用生成 进行特定功能的受访者的比例。受访者可以指出他们将 用于多种目的。
The most frequent application is generating initial drafts of text documents (9%), followed closely by personalized marketing ( ), summarizing text documents (8%), and creating images and/or videos (8%). Most of the reported leading use cases are within the marketing and sales function. 最常见的应用是生成文档的初始草稿(9%),紧随其后的是个性化营销( ),文档摘要(8%)以及创建图像和/或视频(8%)。大多数报告的主要用例都在营销和销售功能中。
Most commonly adopted generative Al use cases by function, 2023 2023 年按功能采用的最常见的生成 Al 用例
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.9 图 4.4.9
8 The adoption of generative Al capabilities is presented separately from the charts on the adoption of general Al capabilities earlier in the chapter, as it was a separate question in the survey. 8 生成式人工智能能力的采用与本章前面关于通用人工智能能力采用的图表分开展示,因为在调查中这是一个单独的问题。
Figure 4.4.10 compares the proportion of respondents who report using versus specifically generative Al for a given function. Figure 4.4.10 illustrates the degree to which generative has permeated general Al usage patterns among businesses. When analyzed at the functional level, the use of and generative 图 4.4.10 比较了报告使用 与专门生成 Al 进行特定功能的受访者比例。 图 4.4.10 说明了生成 已经渗透到企业通用 Al 使用模式中的程度。在功能级别进行分析时, 和生成的使用
Al within organizations shows similar patterns of distribution. Overall, general Al still dominates. The most common functional applications of generative are in marketing and sales (14%), product and/or service development (13%), and service operations (10%). 组织内的 Al 显示出类似的分布模式。总体而言,通用 Al 仍然占主导地位。生成 的最常见功能应用是在营销和销售(14%)、产品和/或服务开发(13%)以及服务运营(10%)。
Al vs. generative adoption by function, 2023 2023 年按功能划分的 Al 与生成 采用情况
Source: McKinsey & Company Survey, 2023 | Chart: 2024 Al Index report 源: 麦肯锡公司调查,2023 年 | 图表: 2024 年 AI 指数报告
Figure 4.4.10 图 4.4.10
9 While all generative Al use cases are considered general Al use cases, not all general Al use cases qualify as generative Al use cases. 9 虽然所有生成式 AI 用例都被视为通用 AI 用例,但并非所有通用 AI 用例都符合生成式 AI 用例的资格。
Figure 4.4.11 depicts the variation in generative usage among businesses across different regions of the world. Across all regions, the adoption rate of generative by organizations stands at . This amount is meaningfully lower than the percentage of businesses across all geographies (55%) that reported using , which was documented earlier in Figure 4.4.8. North America leads in adoption at 40%, followed closely by developing markets (including India, Latin America, and the MENA region). 图 4.4.11 显示了世界各地企业在生成 使用方面的变化。在所有地区,组织对生成 的采纳率为 。这一数字明显低于所有地理区域(55%)报告使用 的企业比例,这一比例在图 4.4.8 中已有记录。北美在采纳率方面处于领先地位,达到 40%,紧随其后的是发展中市场(包括印度、拉丁美洲和中东北非地区)。
Generative Al adoption by organizations in the world, 2023 2023 年世界各组织的生成 AI 采用情况
Use of Al by Developers 开发者使用 AI
Computer developers are among the most likely individuals to use in professional settings. As becomes more integrated into the economy, tracking how developers utilize and perceive is becoming increasingly important. 计算机开发人员是最有可能在专业环境中使用 的个人之一。随着 在经济中的更加融合,跟踪开发人员如何利用和看待 变得越来越重要。
Stack Overflow, a question-and-answer website for computer programmers, conducts an annual survey of computer developers. The 2023 survey, with responses from over 90,000 developers, included, for the first time, questions on Al tool usage-detailing how developers use these tools, which tools are favored, and their perceptions of the tools used. Stack Overflow,一个面向计算机程序员的问答网站,每年都会对计算机开发人员进行一次调查。2023 年的调查中,超过 90,000 名开发人员参与回答,首次包括了关于 Al 工具使用的问题-详细说明开发人员如何使用这些工具,哪些工具受到青睐,以及他们对所使用工具的看法。
Preference 偏好
Figure 4.4.12 highlights the proportion of surveyed respondents who report using a specific Al developer tool. According to the survey, of respondents report using GitHub's Copilot, followed by Tabnine (11.7%) and AWS CodeWhisperer (4.9%). 图 4.4.12 强调了受访者中报告使用特定 AI 开发工具的比例。根据调查, 的受访者报告使用 GitHub 的 Copilot,其次是 Tabnine(11.7%)和 AWS CodeWhisperer(4.9%)。
Figure 4.4.13 highlights which Al search tools, software applications that use to enhance search functionality, are most favored by developers. The most popular Al search tools according to professional developers were ChatGPT (83.3%), followed by Bing AI (18.8%) and WolframAlpha (11.2%). 图 4.4.13 强调了哪些 AI 搜索工具,即使用 来增强搜索功能的软件应用程序,最受 开发者青睐。根据专业开发者的说法,最受欢迎的 AI 搜索工具是 ChatGPT(83.3%),其次是必应 AI(18.8%)和 WolframAlpha(11.2%)。
Most popular Al developer tools among professional developers, 2023
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report 2023 年专业开发者中最受欢迎的 AI 开发工具,来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 年 AI 指数报告
Figure 4.4.12 图 4.4.12
Most popular Al search tools among professional developers, 2023 2023 年专业开发人员中最受欢迎的 Al 搜索工具
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report 来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.13 图 4.4.13
Cloud platforms are crucial elements of the ecosystem, providing cloud computing services that allow developers to perform computationally intensive Al work. Figure 4.4.14 reports the proportion of respondents that have reported extensively using a specific cloud platform. According to the Stack Overflow survey, Amazon Web Services (AWS) is the most commonly used cloud platform among professional developers, with 53.1% reporting regular use. Microsoft Azure follows at , with Google Cloud at . 云平台是 生态系统的关键组成部分,提供云计算服务,允许开发人员进行计算密集型的人工智能工作。图 4.4.14 报告了报告了广泛使用特定云平台的受访者比例。根据 Stack Overflow 的调查报告,亚马逊云服务(AWS)是专业开发人员中最常用的云平台,有 53.1%的受访者报告有定期使用。其次是微软 Azure,占 ,谷歌云占 。
Workflow 工作流程
Figure 4.4.15 explores the current and future integration of in developers' workflows. A significant majority of respondents, , regularly use for code writing, followed by for debugging and assistance, and for documentation. While only currently use for code testing, express interest in adopting for this purpose. 图 4.4.15 探讨了 在开发人员工作流中的当前和未来整合。绝大多数受访者定期使用 进行代码编写,其次是 进行调试和辅助,以及使用 进行文档编写。目前只有 使用 进行代码测试,但 对采用 来实现此目的表示兴趣。
Top 10 most popular cloud platforms among professional developers, 2023 专业开发人员首选的前十大云平台,2023 年
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 AI Index report 资料来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 年人工智能指数报告
Figure 4.4.14 图 4.4.14
Adoption of Al tools in development tasks, 2023 2023 年开发任务中 Al 工具的采用
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 AI Index report 来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 AI 指数报告
When asked about the primary advantages of tools in professional development, developers responded with increased productivity (32.8%), accelerated learning (25.2%), and enhanced efficiency (25.0%) (Figure 4.4.16). 当被问及职业发展中使用 工具的主要优势时,开发人员的回答是增加生产力(32.8%),加速学习(25.2%)和提高效率(25.0%)(图 4.4.16)。
Figure 4.4.17 displays the sentiments professional developers have toward tools. A significant majority of developers hold a positive view of tools, with feeling very favorably and favorably inclined toward them. Only express unfavorable opinions about Al development tools. 图 4.4.17 显示了职业开发人员对 工具的情感态度。绝大多数开发人员对 工具持积极态度, 非常赞同, 赞同它们。只有 对 AI 开发工具持有不利的观点。
Figure 4.4.18 highlights the reported level of trust developers have in tools. More developers trust tools than distrust them, with reporting high or moderate trust in these technologies. In contrast, a smaller proportion, , express some level of distrust or high distrust in tools. 图 4.4.18 凸显了开发人员对 工具的信任水平。比起不信任,更多的开发人员信任 工具, 报告称在这些技术上存在着高度或中度的信任。相比之下,较小比例的人, ,对 工具表示某种程度的不信任或高度的不信任。
Sentiment toward AI tools in development among professional developers, 2023 对于 2023 年专业开发者对 AI 工具开发的情绪态度
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report 来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 Al 指数报告
Figure 4.4.17 图 4.4.17
Primary benefits of Al tools for professional developers, 2023 专业开发人员的 Al 工具的主要优势,2023
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report 来源:Stack Overflow 开发者调查,2023 | 图表:2024 年 Al 指数报告
Figure 4.4.16 图 4.4.16
Trust level in Al tool output accuracy, 2023 2023 年 Al 工具输出准确性的信任水平
Source: Stack Overflow Developer Survey, 2023 | Chart: 2024 Al Index report 来源:Stack Overflow 开发者调查,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.18 图 4.4.18
Al's Labor Impact 阿尔的劳动影响
Over the last five years, the growing integration of into the economy has sparked hopes of boosted productivity. However, finding reliable data confirming Al's impact on productivity has been difficult because integration has historically been low. In 2023, numerous studies rigorously examined Al's productivity impacts, offering more conclusive evidence on the topic 在过去五年里, 与经济的日益融合引发了提高生产力的希望。然而,确认阿尔对生产力的影响的可靠数据一直很难获得,因为 的整合历史上一直很低。2023 年,许多研究对阿尔的生产力影响进行了严格审查,为这一主题提供了更具有决定性的证据。
First, Al has been shown to enable workers to complete tasks more quickly and produce higher quality work. A meta-review by Microsoft, which aggregated studies comparing the performance of workers using Microsoft Copilot or GitHub's CopilotLLM-based productivity-enhancing tools-with those who did not, found that Copilot users completed tasks in to less time than their counterparts without access (Figure 4.4.19).11 首先,研究表明阿尔能够使工人更快地完成任务并产出更高质量的工作。微软进行的一项元回顾研究汇总了比较使用 Microsoft Copilot 或 GitHub 的 CopilotLLM 基于提高生产力工具的工人与未使用这些工具的工人表现的研究,发现 Copilot 用户完成任务所需时间比没有 访问权限的同行少 到 (图 4.4.19)。11
Cross-study comparison of task completion speed of Copilot users Copilot 用户任务完成速度的跨研究比较
Figure 4.4.19 图 4.4.19
11 This meta-review analyzed separate surveys of workers using Microsoft's Copilot and GitHub's Copilot tools. These are separate tools. Microsoft Copilot is a broader LLM-based productivity improvement tool, while GitHub's Copilot is a code-writing assistant. 11 这个元回顾分析了使用微软 Copilot 和 GitHub Copilot 工具的工作者的独立调查。这些是独立的工具。微软 Copilot 是一个更广泛的基于LLM的生产力改进工具,而 GitHub Copilot 是一个代码编写助手。
Similarly, a Harvard Business School study revealed that consultants with access to GPT-4 increased their productivity on a selection of consulting tasks by , speed by , and quality by , compared to a control group without access (Figure 4.4.20). Likewise, National Bureau of Economic Research research reported that callcenter agents using handled more calls per hour than those not using AI (Figure 4.4.21). 同样,哈佛商学院的一项研究表明,与没有 访问权限的对照组相比,具有 GPT-4 访问权限的顾问在一系列咨询任务中提高了 的生产力, 的速度和 的质量(图 4.4.20)。同样,美国国家经济研究局的研究报告称,使用 的呼叫中心代理每小时处理 通电话,比不使用人工智能的代理多处理了更多电话(图 4.4.21)。
Effect of GPT-4 use on a group of consultants 顾问团队使用 GPT-4 的效果
Source: Dell'Acqua et al., 2023 | Chart: 2024 Al Index report 来源:Dell'Acqua 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.20 图 4.4.20
Impact of Al on customer support agents Al 对客服代理的影响
Source: Brynjolfsson et al., 2023 | Chart: 2024 AI Index report 来源:Brynjolfsson 等人,2023 年 | 图表:2024 年 AI 指数报告
Figure 4.4.21 图 4.4.21
A study on the impact of in legal analysis showed that teams with GPT-4 access significantly improved in efficiency and achieved notable quality improvements in various legal tasks, especially contract drafting. 一项关于在法律分析中使用 影响的研究表明,拥有 GPT-4 访问权限的团队在效率上显著提高,并在各种法律任务中取得了显著的质量改进,特别是合同起草方面。
Figure 4.4.22 illustrates the improvements observed in the group of law students who utilized GPT-4, compared to the control group, in terms of both work quality and time efficiency across a range of tasks. Although Al can assist with legal tasks, there are also widespread reports of LLM hallucinations being especially pervasive in legal tasks. 图 4.4.22 展示了使用 GPT-4 的法学生组与对照组在各种任务的工作质量和时间效率方面的改进。尽管人工智能可以协助法律任务,但也有广泛报道称在法律任务中特别普遍存在 LLM 幻觉。
Effect of GPT-4 use on legal analysis by task GPT-4 使用对法律分析的影响
Source: Choi et al., 2023 | Chart: 2024 Al Index report 来源:Choi 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.22 图 4.4.22
Second, Al access appears to narrow the performance gap between low- and high-skilled workers. According to the aforementioned Harvard Business School study, both groups of consultants experienced performance boosts after adopting , with notably larger gains for lower-skilled consultants using compared to higher-skilled consultants. Figure 4.4.23 highlights the performance improvement across a set of tasks for participants of varying skill levels: 其次,人工智能似乎缩小了低技能和高技能工人之间的绩效差距。根据前述的哈佛商学院研究,采用 后,两组顾问都经历了绩效提升,而低技能顾问使用 的绩效提升明显大于高技能顾问。图 4.4.23 突出了不同技能水平参与者在一组任务中的绩效改善情况:
Lower-skilled (bottom half) participants exhibited a improvement, while higher-skilled (top half) participants showed a increase. While higher-skilled workers using Al still performed better than their lower-skilled, Al-using counterparts, the disparity in performance between low- and highskilled workers was markedly lower when Al was utilized compared to when it was not. 低技能(下半部分)参与者表现出 的提高,而高技能(上半部分)参与者显示出 的增加。虽然使用人工智能的高技能工人仍然表现优于低技能工人,但在使用人工智能时,低技能和高技能工人之间的绩效差距明显较小。
Comparison of Al work performance effect by worker skill category 按工人技能类别比较人工智能工作绩效效果
Source: Dell'Acqua et al., 2023 | Chart: 2024 Al Index report 源自:Dell'Acqua 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.23 图 4.4.23
Finally, while Al tends to enhance quality and productivity, overreliance on the technology can impair worker performance. A study focused on professional recruiters reviewing résumés found that receiving any assistance improved task accuracy by 0.6 points compared to not receiving assistance. However, recruiters who were provided with "good Al"-believed to be high-performing_actually performed worse than those who received "bad Al," which was capable but known to make errors (Figure 4.4.24). The performance difference between the latter groups was -1.08 points. The study theorizes that recruiters using "good Al" became complacent, overly trusting the Al's results, unlike those using "bad Al," who were more vigilant in scrutinizing Al output. 最后,虽然 Al 倾向于提高质量和生产力,但过度依赖这项技术可能会损害工人的表现。一项针对专业招聘人员审查简历的研究发现,接受任何 帮助的招聘人员相较于没有接受 帮助的招聘人员,任务准确性提高了 0.6 个点。然而,那些提供了“好的 Al”给招聘人员的人-被认为是高绩效的-实际上表现比那些接受“糟糕的 Al”的人更差,后者虽然能力强但会出错(图 4.4.24)。后者之间的表现差异为-1.08 个点。该研究推测,使用“好的 Al”的招聘人员变得自满,过分信任 Al 的结果,而使用“糟糕的 Al”的人更加警惕地审查 Al 的输出。
Effects on job performance of receiving different types of Al advice 接收不同类型的人工智能建议对工作绩效的影响
Figure 4.4.24 图 4.4.24
Earnings Calls 财报电话会议
The following section presents data from Quid, which uses natural language processing tools to analyze trends in corporate earnings calls. Quid analyzed all 2023 earnings calls from Fortune 500 companies, identifying all mentions of "artificial intelligence," "AI," "machine learning," "ML," and "deep learning." 以下部分展示了 Quid 的数据,该数据使用自然语言处理工具分析企业盈利电话会议中的趋势。 Quid 分析了财富 500 强公司的所有 2023 年度电话会议,识别了所有提到“人工智能”、“AI”、“机器学习”、“ML”和“深度学习”的内容。
Aggregate Trends 聚合趋势
The past year has seen a significant rise in the mention of in Fortune 500 company earnings calls. In 2023, Al was mentioned in 394 earnings calls (nearly of all Fortune 500 companies), up from 266 mentions in 2022 (Figure 4.4.25). Since 2018, mentions of in Fortune 500 earnings calls have nearly doubled. 过去一年中,在财富 500 强公司的盈利电话中提到 的次数显著增加。2023 年,AI 在 394 次盈利电话中被提到(几乎占据了所有财富 500 强公司的 ),而在 2022 年有 266 次提及(图 4.4.25)。自 2018 年以来,在财富 500 强公司的盈利电话中提到 的次数几乎翻了一番。
Number of Fortune 500 earnings calls mentioning AI, 2018-23 2018-23 年提及 AI 的财富 500 强盈利电话数量
Source: Quid, 2023 | Chart: 2024 Al Index report 来源:Quid,2023 | 图表:2024 年 Al 指数报告
Figure 4.4.25 图 4.4.25
Specific Themes 具体主题
Mentions of Al in Fortune 500 earnings calls were associated with a wide range of themes in 2023. The most frequently cited theme, appearing in of all earnings calls, was generative (Figure 4.4.26). 在 2023 年的财富 500 强收益电话中提到 Al 的主题与一系列广泛的主题相关联。在所有收益电话中最常被提及的主题是生成 (图 4.4.26)。
Mentions of generative grew from in 2022. The next most mentioned theme was investments in Al, expansion of capabilities, and growth initiatives , followed by company/brand Als (7.6%). 生成 的提及量从 2022 年的 增长。接下来被提及次数最多的主题是对 Al 的投资、扩展 能力以及 成长计划 ,其次是公司/品牌 Al(7.6%)。
Themes of Al mentions in Fortune 500 earnings calls, 2018 vs. 2023 财富 500 强收益电话中 Al 提及的主题,2018 年对比 2023 年
Source: Quid, 2023 | Chart: 2024 Al Index report 源头: Quid, 2023 | 图表: 2024 Al 指数报告
Highlight: 突出显示:
Projecting Al's Economic Impact 预测 Al 的经济影响
In 2023, some newly published analyses aimed to project and better understand the future economic impact of AI. A recent McKinsey report examined the degree to which generative might impact revenues across industries. Figure 4.4.27 features the projected impact range per industry, both as a percentage of total industry revenue and in total dollar amounts. The report projects that the hightech industry could see its revenue increase by to , corresponding to an additional billion to billion, as a result of generative . Banking, pharmaceuticals and medical products, and education are other industries estimated to grow due to the adoption of generative Al. 在 2023 年,一些新发布的分析旨在预测和更好地理解人工智能未来的经济影响。最近的麦肯锡报告研究了生成 可能对各行业收入产生的影响程度。图 4.4.27 显示了每个行业预计的影响范围,既作为总行业收入的百分比,也作为总金额。报告预测,高科技行业的收入可能增加 至 ,相当于额外 亿至 亿美元,这是由于生成 的结果。银行、制药和医疗产品以及教育是其他估计会因采用生成 Al 而增长的行业。
Anticipated impact of generative Al on revenue by industry, 2023 2023 年各行业生成式人工智能对收入的预期影响
Source: McKinsey & Company, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司,2023 年 | 图表:2024 年人工智能指数报告
The McKinsey survey cited above, the "State of Al in 2023," asked business professionals about their expectations of Al's impact on organizational workforces in the next three years. Although a large proportion (30%) expected little to no change in the number of employees, felt that staff size would decrease (Figure 4.4.28). Only felt that generative Al would lead to increases in the number of employees. There were also widespread predictions that would lead to significant employee reskilling. 上述麦肯锡调查《2023 年的阿尔状况》询问了商业专业人士对未来三年阿尔对组织员工的影响的期望。尽管有相当大比例(30%)预期员工数量几乎不会变化, 认为员工规模会减少(图 4.4.28)。只有 认为生成式阿尔会导致员工数量增加。还有广泛的预测认为 会导致员工重技能培训。
Expectations about the impact of Al on organizations' workforces in the next 3 years, 2023 对未来 3 年(2023 年)阿尔对组织员工的影响的预期
Source: Mckinsey & Company Survey, 2023 | Chart: 2024 Al Index report 来源:麦肯锡公司调查,2023 年 | 图表:2024 年 Al 指数报告
Figure 4.4.28 图 4.4.28
Highlight: 突出显示:
Projecting Al's Economic Impact (cont'd) 预测 Al 的经济影响(续)
Perspectives differ on the anticipated effect of generative on employment per business function. 对于不同业务功能的生成性 对就业的预期影响存在不同看法。
Certain functions, like service operations (54%), supply chain management (45%), and HR (41%), are especially likely, according to respondents, to experience decreasing employment (Figure 4.4.29). 根据受访者的说法,某些功能,如服务运营(54%)、供应链管理(45%)和人力资源(41%),特别可能会出现就业减少(图 4.4.29)。
Anticipated effect of generative Al on number of employees in the next 3 years by business function, 2023 Source: Mckinsey & Company Survey, 2023 | Chart: 2024 Al Index report 预计生成性 Al 对未来 3 年内各业务功能员工人数的影响,2023 年数据来源:麦肯锡公司调查,2023 年 | 图表:2024 年 Al 指数报告
Finally, a Goldman Sachs investment report released in 2023 projects that, globally, Al could lead to productivity growth over 10-year periods ranging between and (Figure 4.4.30). 最后,高盛投资报告预测,全球范围内,阿尔可能在不同的 10 年期间导致生产率增长,范围在 和 之间(图 4.4.30)。
Although the report projects that many countries will benefit from -driven productivity growth, certain geographic areas, like Hong Kong, Israel, and Japan, are especially well-positioned. 尽管报告预测许多国家将受益于 驱动的生产率增长,但像香港、以色列和日本这样的某些地理区域处于特别有利位置。
Estimated impact of Al adoption on annual productivity growth over a ten-year period 预计人工智能采用对十年期间年生产率增长的影响
Source: Goldman Sachs Global Investment Research, 2023 | Chart: 2024 AI Index report
Figure 4.4.30 图 4.4.30
The deployment of robots equipped with Al-based software technologies offers a window into the real-world application of Al-ready infrastructure. This section draws on data from the International Federation of Robotics (IFR), a nonprofit organization dedicated to advancing the robotics industry. Annually, the IFR publishes the World Robotics Reports, which track global robot installation trends. 配备人工智能软件技术的机器人部署提供了一个窗口,展示了人工智能准备基础设施的真实应用。本节利用国际机器人联合会(IFR)的数据,该非营利组织致力于推动机器人行业的发展。IFR 每年发布《世界机器人报告》,跟踪全球机器人安装趋势。
4.5 Robot Installations 4.5 机器人安装
Aggregate Trends 聚合趋势
The following section includes data on the installation and operation of industrial robots, which are defined as an "automatically controlled, reprogrammable, multipurpose manipulator, programmable in three or more axes, which can be either fixed in place or mobile for use in industrial automation applications." 以下部分包括有关工业机器人的安装和操作数据,工业机器人被定义为“自动控制的、可重新编程的、多用途的操纵器,在三个或三个以上轴上可编程,可以固定或移动用于工业自动化应用。”
Figure 4.5.1 reports the total number of industrial robots installed worldwide by year. In 2022, industrial robot installations increased slightly, with 553,000 units marking a increase from 2021. This growth reflects more than a threefold rise in installations since 2012. 图 4.5.1 报告了全球工业机器人的年安装总数。2022 年,工业机器人的安装数量略有增加,553,000 台单位标志着比 2021 年增加了 。自 2012 年以来,安装数量增长了三倍以上。
12 Due to the timing of the IFR report, the most recent data is from 2022. Every year, the IFR revisits data collected for previous years and will occasionally update the data if more accurate figures become available. Therefore, some of the data reported in this year's report might differ slightly from data reported in previous years. 由于 IFR 报告的时间安排,最新数据来自 2022 年。每年,IFR 都会重新审视以前年份收集的数据,如果有更准确的数据可用,偶尔会更新数据。因此,本年度报告中报告的一些数据可能与以前年份报告的数据略有不同。
The global operational stock of industrial robots reached 3,904,000 in 2022, up from 3,479,000 in 2021 (Figure 4.5.2). Over the past decade, both the installation and utilization of industrial robots have steadily increased. 2022 年,全球工业机器人的运营库存量达到 3,904,000,较 2021 年的 3,479,000 有所增加(图 4.5.2)。在过去的十年中,工业机器人的安装和利用率均稳步增长。
Operational stock of industrial robots in the world, 2012-22 2012 年至 2022 年全球工业机器人的运营库存
Industrial Robots: Traditional vs. 工业机器人:传统 vs.
Collaborative Robots 协作机器人
There is a distinction between traditional robots, which operate for humans, and collaborative robots, designed to work alongside them. The robotics community is increasingly enthusiastic about collaborative robots due to their safety, flexibility, scalability, and ability to learn iteratively. 传统机器人是为人类操作的,而协作机器人是设计用来与人类一起工作的。由于其安全性、灵活性、可扩展性和逐步学习的能力,机器人界对协作机器人越来越感兴趣。
Figure 4.5.3 reports the number of industrial robots installed in the world by type. In 2017, collaborative robots accounted for just of all new industrial robot installations. By 2022, the number rose to . 图 4.5.3 报告了按类型安装在世界各地的工业机器人数量。2017 年,协作机器人仅占所有新工业机器人安装量的 。到 2022 年,这一数字上升到 。
Number of industrial robots installed in the world by type, 2017-22 2017-22 年世界各类工业机器人安装数量
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 数据来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
By Geographic Area 按地理区域划分
Country-level data on robot installations can suggest which nations prioritize robot integration into their economies. In 2022, China led the world with 290,300 industrial robot installations, 5.8 times more than Japan's 50,400 and 7.4 times more than the United States' 39,500 (Figure 4.5.4). South Korea and Germany followed with 31,170 and 25,600 installations, respectively. 各国机器人安装数据可以表明哪些国家将机器人整合到其经济中的优先级。2022 年,中国以 29.03 万台工业机器人的安装数量领先全球,是日本的 5.8 倍,美国的 7.4 倍(图 4.5.4)。韩国和德国分别以 31,170 台和 25,600 台的安装数量紧随其后。
Number of industrial robots installed by country, 2022 2022 年各国安装的工业机器人数量
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 AI Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年人工智能指数报告
Figure 4.5.4 图表 4.5.4
Since surpassing Japan in 2013 as the leading installer of industrial robots, China has significantly widened the gap with the nearest country. In 2013, China's installations accounted for of the global total, a share that rose to by 2022 (Figure 4.5.5). 自 2013 年超越日本成为工业机器人领先安装国以来,中国与最近的国家之间的差距显著扩大。 2013 年,中国的安装量占全球总量的 ,到 2022 年已上升至 (图 4.5.5)。
Number of new industrial robots installed in top 5 countries, 2012-22 2012-22 年前 5 个国家安装的新工业机器人数量
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Since 2021, China has installed more industrial robots than the rest of the world combined, with the gap widening further in the last year (Figure 4.5.6). This increasing gap underscores China's growing dominance in industrial robot installations. 自 2021 年以来,中国安装的工业机器人数量超过了全球其他地区的总和,而在过去一年中这一差距进一步扩大(图 4.5.6)。这一不断增加的差距凸显了中国在工业机器人安装领域日益增长的主导地位。
Number of industrial robots installed (China vs. rest of the world), 2016-22 安装的工业机器人数量(中国 vs. 全球其他地区),2016-22
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Figure 4.5.6 图 4.5.6
According to the IFR report, most countries reported an annual increase in industrial robot installations from 2021 to 2022 (Figure 4.5.7). The countries with the highest growth rates include Singapore (68%), Turkey (22%), and Mexico (13%). Canada (-24%), Taiwan (-21%), Thailand (-18%), and Germany (-1%) reported installing fewer robots in 2022 than in 2021. 根据国际机器人联合会的报告,大多数国家从 2021 年到 2022 年工业机器人安装数量年度增长(图 4.5.7)。增长率最高的国家包括新加坡(68%)、土耳其(22%)和墨西哥(13%)。加拿大(-24%)、台湾(-21%)、泰国(-18%)和德国(-1%)报告称 2022 年安装的机器人数量少于 2021 年。
Annual growth rate of industrial robots installed by country, 2021 vs. 2022 各国 2021 年与 2022 年安装的工业机器人年增长率
Figure 4.5.7 图 4.5.7
Country-Level Data on Service Robotics 服务机器人的国家级数据
Another important class of robots are service robots, which the ISO defines as a robot "that performs useful tasks for humans or equipment excluding industrial automation applications."13 Such robots can, for example, be used in medicine and professional cleaning. In 2022, more service robots were installed for every application category than in 2021, with the exception of medical robotics (Figure 4.5.8). More specifically, the number of service robots installed in hospitality and in transportation and logistics increased 2.3 and 1.4 times, respectively. 另一个重要类别的机器人是服务机器人,ISO 将其定义为“为人类或设备执行有用任务的机器人,不包括工业自动化应用。”13 例如,这些机器人可以用于医学和专业清洁。2022 年,每个应用类别中安装的服务机器人数量均超过了 2021 年,唯一例外是医疗机器人(图 4.5.8)。具体来说,酒店业和运输物流领域安装的服务机器人数量分别增加了 2.3 倍和 1.4 倍。
Number of professional service robots installed in the world by application area, 2021 vs. 2022 2021 年至 2022 年按应用领域安装的世界专业服务机器人数量
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Figure 4.5.8 图 4.5.8
13 A more detailed definition can be accessed here. 13 一个更详细的定义可以在这里访问。
As of 2022, the United States leads in professional service robot manufacturing, with approximately 2.06 times more manufacturers than China, the next leading nation (Figure 4.5.9). Germany, Japan, and France also have significant numbers of robot manufacturers, with , and 53,000 , respectively. In most surveyed countries, the majority of these manufacturers are established incumbents. 截至 2022 年,美国在专业服务机器人制造方面处于领先地位,制造商数量约为中国的 2.06 倍,中国是紧随其后的国家(图 4.5.9)。德国、日本和法国也拥有大量的机器人制造商,分别为 ,和 53,000。在大多数调查国家,这些制造商中的大多数是已建立的老牌企业。
Number of professional service robot manufacturers in top countries by type of company, 2022 2022 年各国专业服务机器人制造商数量按公司类型划分
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 源自:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Figure 4.5.9 图 4.5.9
Sectors and Application Types 部门和应用类型
Figure 4.5.10 shows the number of industrial robots installed in the world by sector from 2020 to 2022. Globally, the electrical/electronics sector led in robot installations with 157,000 units, closely followed by the automotive sector with 136,000 . Both sectors have seen continuous growth in industrial robot installations since 2020. 图 4.5.10 显示了从 2020 年到 2022 年全球各行业安装的工业机器人数量。全球范围内,电子电器行业以 15.7 万台的机器人安装数量居首,汽车行业以 13.6 万台的机器人紧随其后。这两个行业自 2020 年以来一直在工业机器人安装方面持续增长。
Number of industrial robots installed in the world by sector, 2020-22 按行业统计的全球工业机器人安装数量,2020-2022 年
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Figure 4.5 .10 图 4.5 .10
Figure 4.5.11 shows the number of industrial robots installed in the world by application from 2020 to 2022. Data suggests that handling is the predominant application. In 2022, 266,000 industrial robots were installed for handling tasks, 3.1 times more than for welding and 4.4 times more than for assembly . Except for processing, every application category witnessed an increase in robot installations in 2022 compared to 2020 . 图 4.5.11 展示了 2020 年至 2022 年世界范围内按应用安装的工业机器人数量。数据显示,处理是主要应用。到 2022 年,有 266,000 台工业机器人用于处理任务,比用于焊接的机器人数量多 3.1 倍,比用于组装的机器人数量多 4.4 倍。除了处理外,每个应用类别在 2022 年安装的机器人都比 2020 年有所增加。
Number of industrial robots installed in the world by application, 2020-22 2020 年至 2022 年世界范围内按应用安装的工业机器人数量
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 源自:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
Figure 4.5.11 图 4.5.11
China vs. United States 中国 vs. 美国
Figure 4.5.12 illustrates the number of industrial robots installed across various sectors in China over the past three years. In 2022, the leading sectors for industrial robot installations in China were electrical/electronics (100,000), automotive (73,000), and metal and machinery (31,000). 图 4.5.12 显示了过去三年中国各个行业安装的工业机器人数量。2022 年,中国工业机器人安装领先的行业是电子电器(10 万台)、汽车(7.3 万台)和金属机械(3.1 万台)。
Figure 4.5.12 图 4.5.12
In 2022, the U.S. automotive industry led in industrial robot installations with 14,500 units, significantly exceeding its 2021 figure (Figure 4.5.13). Except for the electronics sector, every other sector saw fewer robot installations in 2022 than in 2021. 2022 年,美国汽车行业在工业机器人安装方面领先,安装了 1.45 万台,明显超过了 2021 年的数字(图 4.5.13)。除了电子行业外,其他行业在 2022 年的机器人安装数量均少于 2021 年。
Number of industrial robots installed in the United States by sector, 2020-22 美国各行业安装的工业机器人数量,2020-22 年
Source: International Federation of Robotics (IFR), 2023 | Chart: 2024 Al Index report 来源:国际机器人联合会(IFR),2023 年 | 图表:2024 年 Al 指数报告
This year's Al Index introduces a new chapter on Al in science and medicine in recognition of Al's growing role in scientific and medical discovery. It explores 2023's standout Al-facilitated scientific achievements, including advanced weather forecasting systems like GraphCast and improved material discovery algorithms like GNoME. The chapter also examines medical Al system performance, important 2023 Al-driven medical innovations like SynthSR and ImmunoSEIRA, and trends in the approval of FDA AI-related medical devices. 今年的 Al 指数引入了一个新的章节,介绍了 Al 在科学和医学领域的作用,以认识到 Al 在科学和医学发现中的日益重要作用。它探讨了 2023 年突出的 Al 辅助科学成就,包括像 GraphCast 这样的先进天气预测系统和像 GNoME 这样的改进材料发现算法。该章节还审视了医学 Al 系统的性能,重要的 2023 年 Al 驱动的医学创新,如 SynthSR 和 ImmunoSEIRA,以及 FDA 批准的与人工智能相关的医疗设备的趋势。
Chapter Highlights 章节亮点
Scientific progress accelerates even further, thanks to AI. In 2022, Al began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applicationsfrom AlphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery. 由于人工智能的推动,科学进步进一步加速。2022 年,人工智能开始推动科学发现。然而,2023 年看到了更多重要的与科学相关的人工智能应用的推出,从使算法排序更加高效的 AlphaDev,到促进材料发现过程的 GNoME。
Al helps medicine take significant strides forward. In 2023 , several significant medical systems were launched, including EVEscape, which enhances pandemic prediction, and AlphaMissence, which assists in Al-driven mutation classification. Al is increasingly being utilized to propel medical advancements. 人工智能帮助医学取得重大进展。2023 年,推出了几个重要的医疗系统,包括 EVEscape,它增强了疫情预测能力,以及 AlphaMissence,它协助 Al 驱动的突变分类。人工智能越来越多地被用于推动医学进步。
Highly knowledgeable medical Al has arrived. Over the past few years, Al systems have shown remarkable improvement on the MedQA benchmark, a key test for assessing Al's clinical knowledge. The standout model of 2023, GPT-4 Medprompt, reached an accuracy rate of , marking a 22.6 percentage point increase from the highest score in 2022. Since the benchmark's introduction in 2019, Al performance on MedQA has nearly tripled. 高度知识渊博的医疗 Al 已经到来。在过去几年里,Al 系统在 MedQA 基准测试中表现出了显著的改进,这是评估 Al 临床知识的关键测试。2023 年的明星模型 GPT-4 Medprompt 达到了 的准确率,比 2022 年最高分提高了 22.6 个百分点。自 2019 年引入该基准测试以来,Al 在 MedQA 上的表现几乎翻了三倍。
The FDA approves more and more Al-related medical devices. In 2022, the FDA approved 139 Al-related medical devices, a 12.1% increase from 2021. Since 2012, the number of FDA-approved Al-related medical devices has increased by more than 45 -fold. Al is increasingly being used for real-world medical purposes. FDA 批准了越来越多与 Al 相关的医疗设备。2022 年,FDA 批准了 139 个与 Al 相关的医疗设备,比 2021 年增加了 12.1%。自 2012 年以来,FDA 批准的与 Al 相关的医疗设备数量增加了超过 45 倍。Al 越来越多地被用于真实世界的医疗目的。
AlphaDev is a new Al reinforcement learning system that has improved on decades of work by scientists and engineers in the field of computational algorithmic enhancement. AlphaDev developed algorithms with fewer instructions than existing human benchmarks for fundamental sorting algorithms on short sequences such as Sort 3, Sort 4, and Sort 5 (Figure 5.1.1). Some of the new algorithms discovered by AlphaDev have been incorporated into the LLVM standard C++ sort library. This marks the first update to this part of the library in over 10 years and is the first addition designed using reinforcement learning. AlphaDev 是一种新的强化学习系统,通过改进计算算法增强领域科学家和工程师数十年的工作,开发了比现有人类基准更少指令的算法,用于短序列的基本排序算法,如 Sort 3、Sort 4 和 Sort 5(图 5.1.1)。AlphaDev 发现的一些新算法已经被整合到 LLVM 标准 C++ 排序库中。这标志着该库的这一部分在 10 多年来的首次更新,也是第一个使用强化学习设计的新增内容。
AlphaDev vs. human benchmarks when optimizing for algorithm length AlphaDev 与人类基准相比,在优化算法长度时
FlexiCubes
3D mesh optimization with FlexiCubes 使用 FlexiCubes 进行 3D 网格优化
3D mesh generation, crucial in computer graphics, involves creating a mesh of vertices, edges, and faces to define 3D objects. It is key to video games, animation, medical imaging, and scientific visualization. Traditional isosurface extraction algorithms often struggle with limited resolution, structural rigidity, and numerical instabilities, which subsequently impacts quality. FlexiCubes addresses some of these limitations by employing Al for gradient-based optimization and adaptable parameters (Figure 5.1.2). This method allows for precise, localized mesh adjustments. Compared to other leading methods that utilize differentiable isosurfacing for mesh reconstruction, FlexiCubes achieves mesh extractions that align much more closely with the underlying ground truth (Figure 5.1.3). 三维网格生成,在计算机图形学中至关重要,涉及创建由顶点、边和面组成的网格来定义三维对象。它对于视频游戏、动画、医学成像和科学可视化至关重要。传统的等值面提取算法通常在分辨率有限、结构刚度和数值不稳定性方面存在困难,随后影响质量。FlexiCubes 通过采用基于梯度的优化和可调参数的人工智能来解决其中一些限制(图 5.1.2)。这种方法允许进行精确的局部网格调整。与利用可微分等值面进行网格重建的其他主流方法相比,FlexiCubes 实现了更加贴近基本真相的网格提取(图 5.1.3)。
Select quantitative results on 3D mesh reconstruction 选择 3D 网格重建的定量结果
Source: Shen et al., 2023 | Chart: 2024 Al Index report 来源:Shen 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 5.1.3 图 5.1.3
Synbot 合成机器人
Al-driven robotic chemist for synthesizing organic molecules 用于合成有机分子的人工智能化学家
Synbot employs a multilayered system, comprising an Al software layer for chemical synthesis planning, a robot software layer for translating commands, and a physical robot layer for conducting experiments. The closed-loop feedback mechanism between the and the robotic system enables Synbot to develop synthetic recipes with yields equal to or exceeding established references Synbot 采用多层系统,包括用于化学合成规划的 AI 软件层,用于翻译命令的机器人软件层,以及用于进行实验的物理机器人层。 和机器人系统之间的闭环反馈机制使 Synbot 能够开发出产量等于或超过已建立参考值的合成配方。
(Figure 5.1.4). In an experiment aimed at synthesizing M1 [4-(2,3-dimethoxyphenyl)-pyrrolo[2,3-b]pyridine], Synbot developed multiple synthetic formulas that achieved conversion yields surpassing (图 5.1.4)。在旨在合成 M1 [4-(2,3-二甲氧基苯基) -吡咯[2,3-b]吡啶]的实验中,Synbot 开发了多个合成配方,实现了超过预期的转化产率
Synbot design Synbot 设计
Source: Ha et al., 2023 来源:Ha 等人,2023
Figure 5.1.4 图 5.1.4
the mid- reference range and completed the synthesis in significantly less time (Figure 5.1.5). Synbot's automation of organic synthesis highlights Al's potential in fields such as pharmaceuticals and materials science. 中间- 参考范围,并在明显较短的时间内完成了合成(图 5.1.5)。Synbot 对有机合成的自动化突显了 Al 在制药和材料科学等领域的潜力。
Reaction kinetics of M1 autonomous optimization experiment, Synbot vs. reference M1 自主优化实验的反应动力学,Synbot vs. 参考
GraphCast 图形广播
More accurate global weather forecasting 更准确的全球天气预报
with GraphCast
GraphCast is a new weather forecasting system that delivers highly accurate 10 -day weather predictions in under a minute (Figure 5.1.6). Utilizing graph neural networks and machine learning, GraphCast processes vast datasets to forecast temperature, wind speed, atmospheric conditions, and more. Figure 5.1.7 compares the performance of GraphCast with the current industry state-of-theart weather simulation system: the High Resolution Forecast (HRES). GraphCast posts a lower root mean squared error, meaning its forecasts more closely correspond to observed weather patterns. GraphCast can be a valuable tool in deciphering weather patterns, enhancing preparedness for extreme weather events, and contributing to global climate research. GraphCast 是一个新的天气预报系统,可以在不到一分钟内提供高度准确的未来 10 天天气预测(图 5.1.6)。利用图神经网络和机器学习,GraphCast 处理大量数据集,预测温度、风速、大气条件等。图 5.1.7 比较了 GraphCast 与当前行业最先进的天气模拟系统:高分辨率预报(HRES)的性能。GraphCast 具有更低的均方根误差,意味着其预测更接近观测到的天气模式。GraphCast 可以成为解读天气模式、增强极端天气事件准备和促进全球气候研究的宝贵工具。
GraphCast weather prediction GraphCast 天气预测
Source: DeepMind, 2023 来源:DeepMind,2023 年
Figure 5.1.6 图 5.1.6
Ten-day forecast skill: GraphCast vs. HRES 十天 预报技能:GraphCast vs. HRES
Source: Lam et al., 2023 | Chart: 2024 Al Index report 来源:Lam 等人,2023 年 | 图表:2024 年 Al 指数报告
GNoME
Discovering new materials with GNoME 用 GNoME 发现新材料
The search for new functional materials is key to advancements in various scientific fields, including robotics and semiconductor manufacturing. Yet this discovery process is typically expensive and slow. Recent advancements by Google researchers have demonstrated that graph networks, a type of AI model, can expedite this process when trained on large datasets. Their model, GNoME, outperformed the Materials Project, a leading method in materials discovery, by identifying a significantly larger number of stable crystals (Figure 5.1.8). GNoME has unveiled 2.2 million new crystal structures, many overlooked by human researchers (Figure 5.1.9 and Figure 5.1.10). The success of Al-driven projects like GNoME highlights the power of data and scaling in speeding up scientific breakthroughs. 寻找新功能材料对各种科学领域的进展至关重要,包括机器人技术和半导体制造。然而,这一发现过程通常昂贵且缓慢。谷歌研究人员最近取得的进展表明,图网络,一种人工智能模型,可以在大型数据集上训练时加快这一过程。他们的模型 GNoME 在识别稳定晶体方面表现优于 Materials Project,这是材料发现领域的一种领先方法,能够识别出数量显著更多的稳定晶体(图 5.1.8)。GNoME 揭示了 220 万个新的晶体结构,其中许多被人类研究人员忽视了(图 5.1.9 和图 5.1.10)。像 GNoME 这样由人工智能驱动的项目的成功突显了数据和规模化在加速科学突破中的作用。
GNoME vs. Materials Project: stable crystal count Source: Merchant et al., 2023 | Chart: 2024 Al Index report GNoME 对比 Materials Project:稳定晶体数量 数据来源:Merchant 等人,2023 年 | 图表:2024 年 Al 指数报告
Sample material structures 样本材料结构
Source: Merchant et al., 2023 来源:Merchant 等人,2023 年
Figure 5.1.8 图 5.1.8
GNoME vs. Materials Project: distinct prototypes Source: Merchant et al., 2023 | Chart: 2024 Al Index report GNoME vs. Materials Project:不同的原型 Source:Merchant 等,2023 年 | 图表:2024 Al 指数报告
Flood Forecasting 洪水预测
Al for more accurate and reliable flood forecasts 更准确可靠的洪水预测的铝
New research introduced in 2023 has made significant progress in predicting large-scale flood events. Floods, among the most common natural disasters, have particularly devastating effects in less developed countries where infrastructure for prevention and mitigation is lacking. Consequently, developing more accurate prediction methods that can forecast these events further in advance could yield substantial positive impacts. 2023 年引入的新研究在预测大规模洪水事件方面取得了重大进展。洪水是最常见的自然灾害之一,在基础设施缺乏的欠发达国家中具有特别破坏性的影响。因此,开发更准确的预测方法,能够提前预测这些事件,可能会产生重大的积极影响。
A team of Google researchers has used AI to develop highly accurate hydrological simulation models that are also applicable to ungauged basins.' These innovative methods can predict certain extreme flood events up to five days in advance, with accuracy that matches or surpasses current state-of-the-art models, such as GloFAS. The Al model demonstrates superior precision (accuracy of positive predictions) and recall (ability to correctly identify all relevant instances) across a range of return period events, outperforming the leading contemporary method (Figure 5.1.11). The model is open-source and is already being used to predict flood events in over 80 countries. 谷歌研究团队利用人工智能开发了高度准确的水文模拟模型,这些模型也适用于未测流域。这些创新方法可以预测某些极端洪水事件,提前五天进行预测,准确度与当前的领先模型(如 GloFAS)相匹敌甚至超越。该人工智能模型在一系列重现期事件中展示出卓越的精度(正面预测的准确性)和召回率(正确识别所有相关实例的能力),优于领先的当代方法(图 5.1.11)。该模型是开源的,已经被用于预测 80 多个国家的洪水事件。
Predictions of AI model vs. GloFAS across return periods 人工智能模型与 GloFAS 在重现期间的预测对比
Source: Nearing et al., 2023 | Chart: 2024 Al Index report 来源:Nearing 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 5.1.11 图 5.1.11
2 A return period (recurrence interval) measures the likelihood of a particular hydrological event recurring within a specific period. For example, a 100-year flood means there is a chance of the event being equaled or exceeded in any given year. 2. 返回周期(重现间隔)衡量特定水文事件在特定时期内再次发生的可能性。例如,百年一遇洪水意味着在任何给定年份内发生该事件的概率为 。
Al models are becoming increasingly valuable in healthcare, with applications for detecting polyps to aiding clinicians in making diagnoses. As Al performance continues to improve, monitoring its impact on medical practice becomes increasingly important. This section highlights significant AI-related medical systems introduced in 2023 , the current state of clinical Al knowledge, and the development of new AI diagnostic tools and models aimed at enhancing hospital administration. Al 模型在医疗保健领域变得越来越有价值,应用范围从检测息肉到帮助临床医生做出诊断。随着 Al 性能的不断提高,监测其对医疗实践的影响变得越来越重要。本章节介绍了 2023 年引入的重要与 AI 相关的医疗系统、临床 AI 知识的当前状态,以及旨在增强医院管理的新 AI 诊断工具和模型的发展。
5.2 Al in Medicine 5.2 医学中的人工智能
Notable Medical Systems 显著的医疗系统
This section identifies significant Al-related medical breakthroughs of 2023 as chosen by the Al Index Steering Committee. 本节将列出由 AI 指数指导委员会选出的 2023 年重要的与人工智能相关的医疗突破。
SynthSR
Transforming brain scans for advanced analysis 转换脑部扫描以进行高级分析
SynthSR is an Al tool that converts clinical brain scans into high-resolution weighted images (Figure 5.2.1). This advancement addresses the issue of scan quality variability, which previously limited the use of many scans in advanced research. By transforming these scans into T1-weighted images, known for their high contrast and clear brain structure depiction, SynthSR facilitates the creation SynthSR 是一种 Al 工具,将临床脑部扫描转换为高分辨率 加权图像(图 5.2.1)。这一进展解决了扫描质量可变性的问题,以前限制了许多扫描在高级研究中的使用。通过将这些扫描转化为以 T1 加权显像著称的图像,SynthSR 有助于创建高对比度和清晰的脑结构描绘。
SynthSR generations SynthSR 生成
Source: Iglesias et al., 2023 来源: Iglesias et al., 2023
Figure 5.2.1 of detailed 3D brain renderings. Experiments using SynthSR demonstrate robust correlations between observed volumes at both scan and subject levels, suggesting that SynthSR generates images closely resembling those produced by high-resolution scans. Figure 5.2.2 illustrates the extent to which SynthSR scans correspond with ground-truth observations across selected brain regions. SynthID significantly improves the visualization and analysis of brain structures, facilitating neuroscientific research and clinical diagnostics. 详细的 3D 大脑呈现图 5.2.1。使用 SynthSR 进行的实验表明,在扫描和受试者水平上观察到的体积之间存在稳健的相关性,这表明 SynthSR 生成的图像与高分辨率扫描生成的图像非常相似。图 5.2.2 说明了 SynthSR 扫描与所选脑区域的地面真实观察之间的对应程度。SynthID 显著改善了大脑结构的可视化和分析,促进了神经科学研究和临床诊断。
SynthSR correlation with ground-truth volumes on select brain regions SynthSR 与所选脑区域的地面真实体积的相关性
Source: Iglesias et al., 2023 | Chart: 2024 Al Index report 来源:Iglesias 等人,2023 年 | 图表:2024 年 Al 指数报告
Figure 5.2.2 图 5.2.2
Coupled Plasmonic Infrared Sensors 耦合等离子红外传感器
Coupled plasmonic infrared sensors for the detection of neurodegenerative diseases 用于检测神经退行性疾病的耦合等离子红外传感器
Diagnosis of neurodegenerative diseases such as Parkinson's and Alzheimer's depends on fast and precise identification of biomarkers. Traditional methods, such as mass spectrometry and ELISA, are useful in that they can focus on quantifying protein levels; however, they cannot discern changes in structural states. This year, researchers uncovered a new method for neurodegenerative disease diagnosis that combined Al-coupled plasmonic infrared sensors that use Surface-Enhanced Infrared Absorption (SEIRA) spectroscopy with an immunoassay technique (ImmunoSEIRA; Figure 5.2.3). In tests that compared actual fibril percentages with predictions made by Al systems, the accuracy of the predictions was found to very closely match the actual reported percentages (Figure 5.2.4). 诊断帕金森病和阿尔茨海默病等神经退行性疾病取决于生物标志物的快速和精确识别。传统方法,如质谱和 ELISA,可以集中于蛋白质水平的定量,但无法辨别结构状态的变化。今年,研究人员揭示了一种结合了使用表面增强红外吸收(SEIRA)光谱的 Al 耦合等离子红外传感器和免疫分析技术(ImmunoSEIRA)的新的神经退行性疾病诊断方法(图 5.2.3)。在将实际纤维百分比与 Al 系统预测进行比较的测试中,发现预测的准确性非常接近实际报告的百分比(图 5.2.4)。
ImmunoSEIRA detection principle and the setup ImmunoSEIRA 检测原理和设置
Source: Kavungal et al., 2023 来源:Kavungal 等人,2023
Figure 5.2.3 图 5.2.3
Deep neural network predicted vs. actual fibrils percentages in test samples 深度神经网络预测与实际纤维百分比在测试样本中的比较
Source: Kavungal et al., 2023 | Chart: 2024 Al Index report 来源:Kavungal 等人,2023 年 | 图表:2024 年 Al 指数报告
EVEscape
Forecasting viral evolution for pandemic preparedness 为流行病应对预测病毒演变
Predicting viral mutations is vital for vaccine design and pandemic minimization. Traditional methods, which rely on real-time virus strain and antibody data, face challenges during early pandemic stages due to data scarcity. EVEscape is a new Al deep learning model trained on historical sequences and biophysical and structural information that predicts the evolution of viruses (Figure 5.2.5). EVEscape evaluates viral escape independently of current strain data predicting of observed SARS-CoV- 2 mutations, outperforming traditional lab studies which predicted and , as well as a previous model, which predicted only of mutations (Figure 5.2.6). This performance highlights EVEscape's potential as a valuable asset for enhancing future pandemic preparedness and response efforts. 预测病毒突变对疫苗设计和流行病最小化至关重要。传统方法依赖实时病毒株和抗体数据,在早期流行病阶段面临数据稀缺挑战。EVEscape 是一个新的 AI 深度学习模型,经过历史序列、生物物理和结构信息训练,可以预测病毒的演变(图 5.2.5)。EVEscape 独立评估病毒逃逸,预测了观察到的 SARS-CoV-2 突变的 ,优于预测 和 的传统实验室研究,以及之前的模型,仅预测了