2024_08_13_d0e1df70d1da7b4334f1g

MACHINE PsyCHOLOGY 机器心理学

Thilo Hagendorff*
University of Stuttgart
Thilo Hagendorff* < br> 斯图加特大学

Andrew Lampinen Google DeepMind
Andrew LampinenGoogle DeepMind。

Ishita Dasgupta*Google DeepMind
Ishita Dasgupta*Google DeepMind。

Jane X. Wang Google DeepMind
简-X.王 Google DeepMind。

Marcel Binz Helmholtz Institute forHuman-Centered AI
Marcel BinzHelmholtz 研究所Human-Centered AI

Zeynep AkataTU Munich

August 9, 2024 2024 年 8 月 9 日

Abstract 摘要

Large language models (LLMs) show increasingly advanced emergent capabilities and are being incorporated across various societal domains. Understanding their behavior and reasoning abilities therefore holds significant importance. We argue that a fruitful direction for research is engaging LLMs in behavioral experiments inspired by psychology that have traditionally been aimed at understanding human cognition and behavior. In this article, we highlight and summarize theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table. It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks and focuses instead on computational insights that move us toward a better understanding and discovery of emergent abilities and behavioral patterns in LLMs. We review existing work taking this approach, synthesize best practices, and highlight promising future directions. We also highlight the important caveats of applying methodologies designed for understanding humans to machines. We posit that leveraging tools from experimental psychology to study AI will become increasingly valuable as models evolve to be more powerful, opaque, multi-modal, and integrated into complex real-world settings.
大型语言模型（LLMs）显示出越来越先进的新兴能力，并被纳入各个社会领域。因此，了解它们的行为和推理能力具有重要意义。我们认为，一个富有成效的研究方向是让 LLMs 参与行为实验，这些实验的灵感来自于传统上旨在理解人类认知和行为的心理学。在本文中，我们将强调并总结这种方法所带来的理论视角、实验范式和计算分析技术。它为生成式人工智能（AI）的 "机器心理学 "铺平了道路，这种心理学超越了性能基准，而是将重点放在计算洞察力上，从而帮助我们更好地理解和发现 LLMs 中的新兴能力和行为模式。我们回顾了采用这种方法的现有工作，总结了最佳实践，并强调了前景光明的未来方向。我们还强调了将为理解人类而设计的方法应用于机器时需要注意的重要事项。我们认为，随着模型变得更加强大、不透明、多模态，并融入复杂的现实世界环境，利用实验心理学工具研究人工智能将变得越来越有价值。

Introduction 导言

Recent advances in computing power, data availability, and machine learning algorithms have yielded powerful artificial intelligence systems that are used in almost all parts of society. Among these, large language models (LLMs), gigantic neural network architectures trained on large amounts of text, have seen a particularly meteoric rise in their influence. The ability of LLMs to interface directly with natural language has made them accessible to the public in a way that was not seen before, leading to widespread adoption with millions of daily users (Gemini Team et al., 2024; Anthropic, 2024; OpenAI, 2022; OpenAI, 2023a). Also contributing to their rise in influence is that LLMs are wide-ranging in the kinds of tasks they can do - from writing text or code to calling functions, accessing the Internet, retrieving external information, reasoning about complex problems, and many more (Bubeck et al., 2023; Lo et al., 2022; Elkins and Chun, 2020). Recently, LLMs have been extended to interact with other modalities such as vision and speech (Fei et al., 2022; Radford et al., 2023). The ever-growing capabilities of these systems make them challenging but also increasingly important to characterize and understand, especially since these expanding capabilities also bring greater potential for unforeseen harm (Bommasani et al., 2021; Hagendorff, 2024b; Weidinger et al., 2022; Bender et al., 2021; Schramowski et al., 2022).
近年来，随着计算能力、数据可用性和机器学习算法的不断进步，产生了功能强大的人工智能系统，几乎应用于社会的各个领域。其中，大型语言模型（LLMs），即在大量文本基础上训练而成的巨大神经网络架构，其影响力尤其如日中天。LLMs能够直接与自然语言对接，这使得它们能够以一种前所未有的方式为公众所使用，从而被数以百万计的日常用户广泛采用（Gemini Team et al.）LLMs能够完成的任务种类繁多，从编写文本或代码到调用函数、访问互联网、检索外部信息、对复杂问题进行推理等等，这也是其影响力上升的原因之一（Bubeck 等人，2023 年；Lo 等人，2022 年；Elkins 和 Chun，2020 年）。最近，LLMs已经扩展到与视觉和语音等其他模式进行交互（Fei 等人，2022；Radford 等人，2023）。这些系统不断增强的功能使其具有挑战性，但对其进行特征描述和理解也变得越来越重要，特别是因为这些不断增强的功能也带来了更大的潜在意外伤害（Bommasani 等人，2021 年；Hagendorff，2024b；Weidinger 等人，2022 年；Bender 等人，2021 年；Schramowski 等人，2022 年）。

Figure 1: Overview of key concepts of machine psychology.
图 1：机器心理学关键概念概述。

Understanding behavioral patterns and emergent abilities in LLMs requires explaining their operating principles. Of the approaches focused on explaining AI systems, many rely on trying to understand the inner workings of these neural networks. This approach, often termed mechanistic interpretability, seeks to investigate LLMs by analyzing how their weights and activation patterns implement the observable behavior. It uses simplifications in terms of data, the model, or both, that make causal interventions possible and the internal mechanisms easier to characterize (Stolfo et al., 2023; Conmy et al., 2023; Wang, Variengien, et al., 2022; Gao et al., 2024). A related set of approaches draws inspiration more directly from neuroscience to characterize broader correlational similarities and differences between the internal processing of LLMs and humans (Hosseini and Fedorenko, 2023; Kumar et al., 2022).
要理解 LLMs 中的行为模式和新出现的能力，就必须解释其运行原理。在专注于解释人工智能系统的方法中，许多都依赖于试图理解这些神经网络的内部运作。这种方法通常被称为机械可解释性，旨在通过分析权重和激活模式如何实现可观察到的行为来研究 LLMs 。它对数据、模型或两者都进行了简化，从而使因果干预成为可能，内部机制也更容易表征（Stolfo 等人，2023 年；Conmy 等人，2023 年；Wang、Variengien 等人，2022 年；Gao 等人，2024 年）。一组相关的方法更直接地从神经科学中汲取灵感，以描述 LLMs 与人类内部处理之间更广泛的相关性相似性和差异性（Hosseini 和 Fedorenko，2023 年；Kumar 等人，2022 年）。

In contrast, this review focuses on the class of approaches that directly study the behavior of LLMs, analyzing relationships between inputs and outputs instead of inspecting the inner workings. This approach includes not only analyses of static trained models, but also experimental manipulations of inputs both during and after training. It also encompasses analyses of inputs and outputs that reveal insights about internal mechanisms, even if those internal mechanisms are not directly inspected. For this set of approaches, experiments can be inspired by human psychology, cognitive science, and the behavioral sciences. This is what we want to term machine psychology (see Figure 1). Over several decades, the mentioned disciplines have developed a wide range of methods and frameworks to understand and characterize observable intelligent behaviors in human and non-human animals (Edwards, 1954; Festinger and Katz, 1953), much of which can now be adapted to LLMs as well.
相比之下，本综述侧重于直接研究 LLMs 行为的方法类别，即分析输入和输出之间的关系，而不是检查内部运作。这种方法不仅包括对经过训练的静态模型进行分析，还包括在训练期间和训练之后对输入进行实验操作。它还包括对输入和输出的分析，这些分析揭示了内部机制，即使这些内部机制没有被直接检查。对于这一系列方法，实验的灵感可以来自人类心理学、认知科学和行为科学。这就是我们想说的机器心理学（见图 1）。几十年来，上述学科已经开发了大量方法和框架，用于理解和描述人类和非人类动物中可观察到的智能行为（Edwards, 1954; Festinger and Katz, 1953），其中大部分方法和框架现在也可适用于LLMs。

Thus far, the research community has responded to the challenges of understanding behavioral patterns and growing capabilities in LLMs in several ways (Schwartz, 2022; Zhao, Chen, et al., 2023). The traditional machine learning benchmark-driven approach has released new datasets that capture specific aspects only recently seen emerging in models (Srivastava et al., 2022; Hendrycks et al., 2021; Zellers et al., 2019). Traditional benchmarking aims primarily to enable the community to compare and optimize LLM performance. In contrast, machine psychology research is not primarily interested in increasing (or measuring) an LLM's performance, but rather in understanding behavioral patterns. While traditional natural language processing benchmarks measure abilities such as translation, numerical reasoning, or factual accuracy, machine psychology is also interested in how these observable abilities indirectly reflect the underlying constructs and algorithms (Frank et al., 2024). Understanding these constructs lets us make new predictions about e.g. how the model will generalize, how it will perform with different training data, and specific failure modes.
迄今为止，研究界已通过多种方式来应对了解LLMs中的行为模式和不断增长的能力所带来的挑战（Schwartz，2022 年；Zhao、Chen 等人，2023 年）。传统的机器学习基准驱动方法发布了新的数据集，这些数据集捕捉了最近才在模型中出现的特定方面（Srivastava 等人，2022；Hendrycks 等人，2021；Zellers 等人，2019）。传统基准测试的主要目的是让社区能够比较和优化 LLM 性能。相比之下，机器心理学研究的主要兴趣不在于提高（或测量）LLM的性能，而在于理解行为模式。传统的自然语言处理基准衡量的是翻译、数字推理或事实准确性等能力，而机器心理学也关注这些可观察到的能力如何间接反映出底层构造和算法（Frank 等人，2024 年）。了解了这些构造，我们就能对模型的泛化方式、使用不同训练数据时的表现以及特定的失效模式等做出新的预测。

The relative importance of behavior-based inspection (or psychology) versus internal inspection (or neuroscience) has been a long-standing debate (Jonas and Kording, 2017). We believe that both approaches have value for understanding both humans and LLMs. Directly inspecting LLMs' behavior, however, does come with multiple advantages. The behavior of LLMs is expressed at the interface of the model, where human users interact, and thus is what we ultimately care about the most (Binz and Schulz, 2023; Chang and Bergen, 2024; Ivanova, 2023). Such behavior is often too complex to predict purely from our current mechanistic understanding of model weights and activation patterns (Grön et al., 2003). Many interesting behaviors are only displayed by large models with billions of parameters (Kaplan et al., 2020; Wei, Tay, et al., 2022), and behavioral methods in psychology that treat behavior directly as the experimental variable of interest scale gracefully with model size. Another practical advantage is that these behavioral approaches
基于行为的检查（或心理学）与内部检查（或神经科学）的相对重要性一直是一个争论不休的问题（Jonas 和 Kording，2017 年）。我们认为，这两种方法对于理解人类和LLMs都有价值。然而，直接检查LLMs的行为确实具有多种优势。LLMs 的行为表现在模型的界面上，也就是人类用户交互的地方，因此也是我们最终最关心的地方（Binz 和 Schulz，2023；Chang 和 Bergen，2024；Ivanova，2023）。这种行为往往过于复杂，我们无法仅凭目前对模型权重和激活模式的机械理解来预测（Grön 等人，2003 年）。许多有趣的行为只能通过拥有数十亿参数的大型模型来显示（Kaplan 等人，2020；Wei、Tay 等人，2022），而心理学中直接将行为作为实验变量的行为学方法可以随着模型规模的扩大而扩展。另一个实际优势是，这些行为方法
can easily be applied by the broader academic community to closed-source state-of-the-art models whose internal workings are not disclosed to the public.
更广泛的学术界可以很容易地将其应用于内部运作不向公众公开的闭源先进模型。

In this article, we review and chart future directions in this emerging field of directly modeling LLM behavior. We outline how established behavioral sciences can guide and inform our understanding of LLMs, and discuss important caveats for when and how to apply methods to LLMs, given that they were originally developed for humans and animals. In the first section, we discuss the theoretical frameworks developed and used in psychology to organize our understanding of intelligence and intelligent behaviors. We then review the many empirical paradigms that have been developed to study and characterize different aspects of intelligent behavior. Finally, we discuss and make recommendations for robust empirical methods both for designing experiments and analyzing behavioral data. We end the article by discussing the potentials and limitations of conducting machine psychology experiments with increasingly capable black-box models.
在本文中，我们回顾了直接模拟 LLM 行为的这一新兴领域，并描绘了未来的发展方向。我们概述了既有的行为科学如何指导和启发我们对 LLMs 的理解，并讨论了何时以及如何将方法应用于 LLMs 的重要注意事项，因为这些方法最初是为人类和动物开发的。在第一节中，我们将讨论心理学中为组织我们对智能和智能行为的理解而开发和使用的理论框架。然后，我们回顾了为研究和描述智能行为的不同方面而开发的许多实证范式。最后，我们讨论了设计实验和分析行为数据的可靠经验方法，并提出了相关建议。文章最后，我们讨论了利用功能日益强大的黑盒模型进行机器心理学实验的潜力和局限性。

Theory: Evaluation paradigms for understanding intelligent systems
理论：理解智能系统的评价范式

The traditional framework in machine learning algorithms has revolved around benchmark datasets (Bowman et al., 2015; Russakovsky et al., 2015). These datasets are designed to require specific capabilities (e.g. object recognition, sentiment analysis, etc.) for good performance. Researchers train on a train dataset and evaluate on a held-out test dataset that was not seen during training. This framework does not generalize well to large-scale foundation models for two reasons. First, when using Internet-scale training data for models, this split has become harder to maintain ( Li and Flanigan, 2023; Khan et al., 2023). Second, foundation models are only directly trained for next-token prediction but exhibit many other "intelligent" behaviors that can, with some reservations (Schaeffer et al., 2023), be considered emergent. For example, practitioners did not explicitly encode or train for a transformer LLM's ability to learn from a few examples in context (Brown et al., 2020), but it nonetheless arose from the machine learning architecture, data, and learning signal (Chan et al., 2022; Oswald et al., 2023). Emergent behaviors can be difficult to study through the lens of the components that gave rise to it (Anderson, 1972), and the ones that emerge can seem surprising (Wei, Tay, et al., 2022) - the most interesting evaluations are not 'held-out' exemplars of the training task.
机器学习算法的传统框架一直围绕基准数据集展开（Bowman 等人，2015 年；Russakovsky 等人，2015 年）。这些数据集的设计需要特定的能力（如物体识别、情感分析等）才能获得良好的性能。研究人员在训练数据集上进行训练，然后在训练期间未见的测试数据集上进行评估。这种框架不能很好地推广到大规模基础模型，原因有二。首先，当模型使用互联网规模的训练数据时，这种分割变得更难维持（Li 和 Flanigan，2023 年；Khan 等人，2023 年）。其次，基础模型只是直接为下一个标记预测进行训练，但却表现出许多其他 "智能 "行为，这些行为可以被认为是突发的，但也有一些保留意见（Schaeffer 等人，2023 年）。例如，实践者并没有明确编码或训练变压器LLM从上下文中的几个示例中学习的能力（Brown 等人，2020 年），但它还是从机器学习架构、数据和学习信号中产生了（Chan 等人，2022 年；Oswald 等人，2023 年）。新出现的行为可能很难通过产生它的组件进行研究（Anderson，1972 年），而且出现的行为可能看起来令人惊讶（Wei、Tay 等人，2022 年）--最有趣的评估并不是训练任务的 "保留 "范例。

Researchers have therefore started building test-only benchmarks - i.e. smaller scale datasets unsuitable for training and intended solely as a test set - to investigate model capabilities, e.g. the BIG-bench comprising more than 200 tests (Srivastava et al., 2022), the Abstraction and Reasoning Challenge (Chollet et al., 2020), as well as many others (Ivanova et al., 2024; Mazumder et al., 2024). In several cases, these benchmarks already resemble evaluation frameworks from the behavioral sciences (Bubeck et al., 2023) - like personality tests, intelligence tests, implicit association tests, etc. that are applied to humans - which similarly do not follow the train-test paradigm. They also tend to fall into two categories. Some evaluations focus on scalar performance metrics, e.g. intelligence quotients. Others focus on characterizing behavior, i.e. the questions are not designed with accuracy in mind, but designed to elicit responses that reveal behavioral strategies, or underlying constructs. In this review, we focus on test-only evaluations that provide this latter kind of understanding, as a novel evaluation paradigm that is starting to gain traction in the machine learning field.
因此，研究人员开始建立纯测试基准--即不适合训练、仅用作测试集的较小规模数据集--来研究模型的能力，例如由200多个测试组成的BIG-bench（Srivastava等人，2022年）、抽象与推理挑战（Chollet等人，2020年）以及许多其他基准（Ivanova等人，2024年；Mazumder等人，2024年）。在某些情况下，这些基准已经类似于行为科学的评估框架（Bubeck 等人，2023 年）--如应用于人类的人格测试、智力测验、内隐联想测试等--它们同样不遵循训练-测试范式。它们也往往分为两类。一些评估侧重于标量性能指标，如智力商数。另一些则侧重于行为特征描述，即问题的设计并不考虑准确性，而是旨在诱导出能揭示行为策略或潜在结构的回答。在这篇综述中，我们将重点讨论提供后一种理解的纯测试评估，这是一种新颖的评估范式，在机器学习领域正开始受到重视。

Several such diagnostic evaluations have been developed even for pre-LLM models where, despite the models being trained for specific tasks, how to solve them is not specified. Such diagnostic datasets were used to expose the ways in which learned systems solved tasks - often counter to human intuitions (Geirhos et al., 2020; McCoy, Pavlick, et al., 2019; Hermann and Lampinen, 2020; Dasgupta et al., 2022; Singla and Feizi, 2021). Researchers have also made the case for borrowing from ethology, a branch of zoology that studies the behavior of non-human animals, to explain machine behavior in machine learning systems (Rahwan et al., 2019). However, in the era of LLMs, not only are the how unspecified, but the model abilities themselves are neither directly known nor intentionally engineered. Furthermore, since LLMs can be evaluated via natural language, this can enhance or replace comparatively simpler methods from ethology. This has led to the widespread adoption of language-based diagnostic evaluations, making it easier and more intuitive for practitioners to develop relevant tests.
即使是针对前LLM模型，也开发出了一些此类诊断评估，在这些模型中，尽管模型针对特定任务进行了训练，但并没有说明如何解决这些任务。这些诊断数据集被用来揭示学习系统解决任务的方式--通常与人类的直觉相反（Geirhos 等人，2020 年；McCoy、Pavlick 等人，2019 年；Hermann 和 Lampinen，2020 年；Dasgupta 等人，2022 年；Singla 和 Feizi，2021 年）。研究人员还提出了借用动物伦理学（研究非人类动物行为的动物学分支）来解释机器学习系统中的机器行为的理由（Rahwan 等人，2019 年）。然而，在LLMs时代，不仅 "如何 "不明确，而且模型能力本身既不直接为人所知，也不是有意设计出来的。此外，由于LLMs可以通过自然语言进行评估，这可以增强或取代相对简单的伦理学方法。这使得基于语言的诊断评估得到广泛采用，使从业人员更容易、更直观地开发相关测试。

However, this comes with important caveats. In trying to shed light on the workings of a black-box system that can produce language, it is tempting to use the simplest approach of asking the system about it. Self-report measures have been extensively used in psychology as well; but their reliability is questionable in humans (Jobe, 2003) as well as LLMs. Properties that such measures usually consider, such as personality, morality, or clinical disorders, are famously sensitive to prompting (Dominguez-Olmedo et al., 2023; Röttger et al., 2024); to the extent that several recent works even simulate groups of humans of different social groups, opinions, and personalities with differently prompted LLMs
不过，这也有一些重要的注意事项。在试图揭示一个能产生语言的黑盒系统的工作原理时，最简单的方法就是询问该系统。自我报告测量方法在心理学中也得到了广泛应用；但是这些方法在人类（Jobe，2003 年）和LLMs中的可靠性值得怀疑。此类测量通常考虑的属性，如个性、道德或临床疾病，对提示非常敏感（Dominguez-Olmedo et al.

(Salewski et al., 2023; Park et al., 2022; Argyle et al., 2023; Shanahan et al., 2023). There remains value in using self-report stimuli from psychology - for example, to characterize behavior on a default prompt, as well as to understand how steerable (i.e. sensitive to prompting) models are along these dimensions. But results drawn from these measures should be taken contextually (e.g. as a property of a specific system prompt on a model) instead of as a fundamental or general property of the LLM itself.
(Salewski 等人，2023 年；Park 等人，2022 年；Argyle 等人，2023 年；Shanahan 等人，2023 年）。使用心理学中的自我报告刺激仍有价值--例如，描述默认提示下的行为特征，以及了解模型在这些维度上的可操控性（即对提示的敏感性）。但是，从这些测量方法中得出的结果应结合具体情况（例如，作为模型上特定系统提示的属性），而不是作为 LLM 本身的基本或一般属性。

In contrast, the empirical tradition in psychology is significantly different from self-reports. This tradition has yielded lasting understanding of natural intelligence (Frank et al., 2024), and is the tradition we argue is the most amenable for transferring insights to machine psychology. In this paradigm, externally observed behavior continues to be the measured experimental variable, but stimuli are designed such that different observed behaviors map onto and measure different internal representations, capabilities, or constructs - like compositionality, theory of mind, logic, causality, etc. A key principle is that experiments are hypothesis-driven: if the agent has representation or construct X , we would expect to see behavior Y, otherwise we would see behavior Z. We highlight two key principles from this tradition that are crucial to keep in mind when performing and interpreting machine psychology evaluations. First, does seeing behavior Y reliably imply having the construct

? To answer this, the design of a good control is crucial - to ensure that behavior

does not have another explanation and does, in fact, implicate X. A large part of experimental psychology has been coming up with the right controls for these subtle constructs (Boring, 1954), and has been providing a valuable foundation for future research in machine psychology. Second, does the absence of behavior Y indicate the absence of the construct X ? This is a more subtle question. Research in psychology often grapples with the fact that human performance can be noisy or biased; for example, humans may make mistakes even on an easy calculation, or produce ungrammatical language colloquially. These should not be taken to mean that they lack the abstract capability for math or language. These inconsistencies led to the concept of the performance-competence distinction (e.g. Chomsky, 1965): that the way humans perform in a particular situation may not fully capture their underlying competence. More recent work has suggested that similar issues apply when assessing the capabilities of machine learning systems (Firestone, 2020), and particularly LLMs (Lampinen, 2022).
相比之下，心理学的实证传统与自我报告有很大不同。这一传统产生了对自然智能的持久理解（Frank et al.在这一范式中，外部观察到的行为仍然是测量的实验变量，但刺激物的设计要使不同的观察行为映射到并测量不同的内部表征、能力或建构--如构成性、心智理论、逻辑、因果关系等。一个关键原则是，实验是由假设驱动的：如果代理具有表征或构造 X，我们就会期望看到行为 Y，否则我们就会看到行为 Z。我们强调这一传统中的两个关键原则，在执行和解释机器心理学评估时，这两个原则至关重要。首先，看到行为 Y 是否可靠地意味着具有构造

?要回答这个问题，设计一个好的控制是至关重要的--要确保行为

没有其他解释，并且确实与 X 有关。实验心理学的很大一部分工作就是为这些微妙的建构提供正确的控制（Boring，1954 年），并为机器心理学的未来研究奠定了宝贵的基础。其次，行为 Y 的缺失是否表明 X 构造的缺失？这是一个更为微妙的问题。心理学研究经常会遇到这样的事实，即人类的表现可能是嘈杂的或有偏差的；例如，即使是简单的计算，人类也可能会出错，或者在口语中使用不符合语法的语言。这并不意味着他们缺乏数学或语言的抽象能力。这些不一致导致了 "表现与能力的区别 "这一概念的产生（如乔姆斯基，1965 年）：人类在特定情况下的表现方式可能无法完全反映其潜在能力。最近的研究表明，类似的问题也适用于评估机器学习系统的能力（Firestone, 2020），尤其是LLMs（Lampinen, 2022）。

Paradigms: The many aspects of intelligent behavior
范例：智能行为的多个方面

There are many aspects of intelligent behavior, each of which has been studied by different sub-fields of the behavioral sciences. Each of these has developed domain-specific empirical paradigms. While some of these sub-fields (e.g. motor learning) and paradigms (e.g. pupillometry) are not directly transferable to LLMs since they rely on the existence of a physical body, several of these paradigms are purely linguistic and can be easily transferred. As LLMs expand in the kinds of stimuli they can interpret - e.g. visual (OpenAI, 2023b; Zhang, Huang, et al., 2024; Gemini Team et al., 2024) - and the ways in which they can interact with the world - e.g. embodiment and tool use (Mialon et al., 2023) -, the space of transferable paradigms increases. Humans also interact with several modalities, and the paradigms developed to understand us often compare and integrate these modalities (Schulze Buschoff et al., 2023) - e.g. the Stroop test which spans vision and reading capabilities (Scarpina and Tagini, 2017).
智能行为有许多方面，行为科学的不同分支领域对其中的每一个方面都进行了研究。每个子领域都开发了特定领域的经验范式。虽然其中一些子领域（如运动学习）和范式（如瞳孔测量）不能直接移植到 LLMs 中，因为它们依赖于实体的存在，但其中有几个范式是纯语言的，可以很容易地移植到 LLMs 中。随着LLMs能够解释的刺激类型（如视觉（OpenAI，2023b；Zhang、Huang 等人，2024；Gemini 团队等人，2024）以及与世界交互的方式（如体现和工具使用（Mialon 等人，2023））的扩大，可转移范式的空间也随之增加。人类还与多种模式进行互动，为了解我们而开发的范式通常会对这些模式进行比较和整合（Schulze Buschoff 等人，2023 年）--例如跨越视觉和阅读能力的 Stroop 测试（Scarpina 和 Tagini，2017 年）。

In this article, we focus on language-based tests, since these are the most widely used in the current research landscape. Moreover, we believe that even in light of the growing trend toward multi-modal models, language will remain a primary modality due to its fundamental role in models' reasoning processes. We concentrate on four research areas that can inform distinct strands in machine psychology research: heuristics and biases, social interactions, the psychology of language, and learning. Apart from these four areas, there are, of course, multiple other domains of psychology that can also provide valuable paradigms for, for instance when investigating creativity in LLMs (Stevenson et al., 2022), clinical psychology (Li, Li, et al., 2022), moral behavior (Khandelwal et al., 2024), and others.
在本文中，我们将重点讨论基于语言的测试，因为这些测试在当前的研究领域中应用最为广泛。此外，我们认为，即使在多模态模型日益增长的趋势下，语言仍将是主要模态，因为它在模型的推理过程中起着根本性的作用。我们专注于四个研究领域，这些领域可以为机器心理学研究的不同分支提供信息：启发式和偏见、社会互动、语言心理学和学习。当然，除了这四个领域之外，还有其他多个心理学领域也可以提供有价值的范例，例如在研究 LLMs 的创造性（Stevenson 等人，2022 年）、临床心理学（Li, Li, 等人，2022 年）、道德行为（Khandelwal 等人，2024 年）等方面。

Heuristics and biases 启发法和偏见

The heuristics and biases framework is one of the most influential research paradigms in psychology (Gigerenzer and Gaissmaier, 2011; Tversky and Kahneman, 1974). Heuristics are mental shortcuts that simplify reasoning or decision-making processes, and this field studies how such shortcuts can help explain both the successes and the biases in human behavior. The large existing literature on heuristics and biases in humans is a fertile ground for examining such shortcuts in the newest generation of LLMs - whose capabilities now overlap more with the human abilities this literature studies. Binz and Schulz (2023) were among the first to use this paradigm to better understand the decision-making processes of LLMs. They found that GPT-3 (Brown et al., 2020) displays some of the same cognitive
启发式和偏见框架是心理学中最具影响力的研究范式之一（Gigerenzer 和 Gaissmaier，2011 年；Tversky 和 Kahneman，1974 年）。启发式是简化推理或决策过程的心理捷径，这一领域研究这种捷径如何帮助解释人类行为中的成功和偏差。关于人类启发式思维和偏差的大量现有文献为研究新一代LLMs的此类捷径提供了肥沃的土壤，这些新一代LLMs的能力现在与这一文献所研究的人类能力有了更多的重叠。Binz和Schulz（2023年）是最早使用这种范式来更好地了解LLMs决策过程的人之一。他们发现，GPT-3（Brown 等人，2020 年）显示了一些与人类相同的认知能力。
biases observed in people. Several other works have also been done in this vein (Jones and Steinhardt, 2022; Yax et al., 2024; Hagendorff et al., 2023; Macmillan-Scott and Musolesi, 2024; Schulze Buschoff et al., 2023; Hayes et al., 2024; Coda-Forno, Binz, Wang, et al., 2024). Interestingly, there is evidence from several studies showing that, while the previous generation of models frequently exhibited human-like heuristics and biases, they have largely disappeared in the latest generation of LLMs (Chen, Liu, et al., 2023; Hagendorff et al., 2023). The test stimuli were originally designed to be challenging for human study participants and possibly no longer challenge the growing reasoning abilities in LLMs. This could also be due to leakage into the training set - we discuss this challenge in the section on design and analysis.
在人身上观察到的偏差。其他几项研究也是如此（Jones 和 Steinhardt，2022 年；Yax 等人，2024 年；Hagendorff 等人，2023 年；Macmillan-Scott 和 Musolesi，2024 年；Schulze Buschoff 等人，2023 年；Hayes 等人，2024 年；Coda-Forno、Binz、Wang 等人，2024 年）。有趣的是，有几项研究表明，虽然上一代模型经常表现出类似人类的启发式和偏差，但在最新一代的LLMs中，这些启发式和偏差已基本消失（陈、刘等人，2023；哈根多夫等人，2023）。测试刺激原本是为人类研究参与者设计的，具有一定的挑战性，但在LLMs中可能不再挑战不断增长的推理能力。这也可能是由于训练集的泄漏造成的--我们将在设计和分析部分讨论这一挑战。

The literature on heuristics and biases also suggests that how a problem is phrased can influence how people solve it (Cheng and Holyoak, 1985; Tversky and Kahneman, 1981). It is well-known that LLMs are also susceptible to similar manipulations. For example, Dasgupta et al. (2022) have investigated whether LLMs are affected by the semantic content of logical reasoning problems using several existing tasks from the literature. They found that, like people, LLMs reason more accurately about familiar, believable, or grounded situations, compared to unfamiliar, unbelievable, or abstract problems. Likewise, Schubert et al., 2024 have shown that how LLMs learn in-context depends on the problem formulation.
有关启发式和偏见的文献也表明，问题的措辞会影响人们解决问题的方式（Cheng 和 Holyoak，1985 年；Tversky 和 Kahneman，1981 年）。众所周知，LLMs也容易受到类似操作的影响。例如，Dasgupta 等人（2022 年）利用文献中的几个现有任务研究了 LLMs 是否会受到逻辑推理问题语义内容的影响。他们发现，LLMs和人一样，与不熟悉、不可信或抽象的问题相比，LLMs对熟悉、可信或有根有据的情况推理得更准确。同样，Schubert 等人在 2024 年的研究也表明，LLMs 如何在情境中学习取决于问题的表述。

Finally, people do not simply apply arbitrary heuristics. Instead, they use heuristics that are adapted to the problems they encounter during their everyday interactions with the world (Todd and Gigerenzer, 2012). In the context of LLMs, one can look at how the properties of the training data shape their behavior. For example, Chan et al., 2022 have demonstrated that the presence of in-context learning in LLMs can be traced back to data distributional properties such as burstiness, where items appear in clusters rather than being uniformly distributed over time, and the presence of large numbers of rarely occurring classes. Researchers also proposed that one should try to understand LLMs through the problem they are trained to solve, similarly to how behavioral scientists attempt to understand human cognition through the lens of ecological rationality (Todd and Gigerenzer, 2012; McCoy, Yao, et al., 2023; Jagadish et al., 2024).
最后，人们不会简单地应用任意的启发式方法。相反，他们会使用启发式方法来解决他们在与世界的日常互动中遇到的问题（Todd 和 Gigerenzer，2012 年）。在 LLMs 的背景下，我们可以看看训练数据的属性是如何影响他们的行为的。例如，Chan 等人在 2022 年的研究中证明，LLMs 中存在的上下文学习可追溯到数据分布的属性，如突发性，即项目是成群出现的，而不是随时间均匀分布的，以及存在大量很少出现的类别。研究人员还提出，人们应该尝试通过训练他们解决的问题来理解LLMs，这与行为科学家试图通过生态理性的视角来理解人类认知的方式类似（Todd 和 Gigerenzer，2012；McCoy、Yao 等人，2023；Jagadish 等人，2024）。

Traditionally, developmental psychology explores how humans develop cognitively, socially, and emotionally throughout their lives. This includes studying the various factors that influence development, such as social intelligence or social skills. By applying paradigms from this area of developmental psychology to LLMs, researchers can gain deeper insights into how these models manage complex social interactions. In particular, once LLMs are deployed as chat agents, they should become versed in modeling human communicators. Therefore, it is important to assess the level of social intelligence in LLMs. One example in this context is the application of theory of mind tests to LLMs, where researchers use tasks from human experiments, such as those famously conducted by Wimmer and Perner (1983) and Perner et al. (1987). While early experiments with models such as GPT-3 showed that they struggle to solve theory of mind tasks (Sap et al., 2022), later models demonstrate an increasing ability to reliably infer unobservable mental states in others (Strachan et al., 2024; Holterman and Deemter, 2023; Moghaddam and Honey, 2023). Further related research examines how LLM performance on theory of mind tests compares to that of children (Duijn et al., 2023), LLM ability to handle higher-order theory of mind tasks requiring recursive reasoning about multiple mental states (Street et al., 2024), or measures the robustness of theory of mind test setups against distracting alterations in the tasks LLMs receive as inputs (Ullman, 2023). As theory of mind tests measure, among other things, the ability to understand false beliefs, further research has explored the emerging capability of LLMs to induce false beliefs in other agents (Hagendorff, 2024a), or how LLMs trade off various communicative values like honesty and helpfulness (Liu et al., 2024) - these investigations also contribute to understanding and improving alignment with human values for AI safety (Ji et al., 2023).
传统上，发展心理学探讨人类在一生中如何在认知、社交和情感方面发展。这包括研究影响发展的各种因素，如社会智力或社交技能。通过将发展心理学这一领域的范式应用于LLMs，研究人员可以深入了解这些模型如何管理复杂的社会互动。特别是，一旦LLMs被部署为聊天代理，它们就应该精通人类交流者的建模。因此，评估 LLMs 的社交智能水平非常重要。这方面的一个例子是心智理论测试在 LLMs 中的应用，研究人员会使用人类实验中的任务，例如著名的 Wimmer 和 Perner（1983 年）以及 Perner 等人（1987 年）所做的实验。虽然 GPT-3 等模型的早期实验表明，它们在解决心智理论任务方面很吃力（Sap 等人，2022 年），但后来的模型表明，它们越来越有能力可靠地推断他人不可观察的心理状态（Strachan 等人，2024 年；Holterman 和 Deemter，2023 年；Moghaddam 和 Honey，2023 年）。更多相关研究探讨了LLM在心智理论测试中的表现与儿童相比如何（Duijn等人，2023年），LLM处理需要对多种心理状态进行递归推理的高阶心智理论任务的能力（Street等人，2024年），或测量心智理论测试设置对LLMs作为输入接收的任务的干扰改变的稳健性（Ullman，2023年）。由于心智理论测试衡量的是理解虚假信念的能力等，因此进一步的研究探索了LLMs在其他代理中诱导虚假信念的新兴能力（Hagendorff，2024a），或者LLMs如何权衡诚实和乐于助人等各种交流价值观（Liu 等人，2024）--这些研究也有助于理解和提高人工智能安全性与人类价值观的一致性（Ji 等人，2023）。

The space of relevant paradigms increases as LLMs are allowed to interact through self-reflection (Nair et al., 2023), self-instruction (Wang, Wei, et al., 2022), or in swarms (Zhuge et al., 2023). For example, researchers looked at cooperative and coordinative behavior in LLMs playing games, revealing persistent behavioral signatures in the models (Akata et al., 2023). Similarly, researchers investigated cooperative or competitive LLMs behavior in psychologyinspired dilemma situations to assess the ability of LLMs to participate in real-world negotiations (Phelps and Russell, 2024). Another study, which is influenced by works in human social psychology, looked at how multiple LLMs form and evolve networks, investigating micro-level network principles such as preferential attachment or triadic closure, as well as macro-level principles such as community structures (Papachristou and Yuan, 2024). In sum, machine psychology can reveal patterns of social behavior and interaction among LLMs, individually and collectively, be it for
随着LLMs被允许通过自我反思（Nair 等人，2023 年）、自我指导（Wang、Wei 等人，2022 年）或成群（Zhuge 等人，2023 年）进行交互，相关范例的空间也随之增大。例如，研究人员研究了LLMs游戏中的合作和协调行为，发现了模型中持续存在的行为特征（Akata 等人，2023 年）。同样，研究人员调查了LLMs在心理学启发的两难情境中的合作或竞争行为，以评估LLMs参与现实世界谈判的能力（Phelps 和 Russell，2024 年）。另一项研究受到人类社会心理学著作的影响，研究了多个LLMs如何形成和演化网络，调查了微观层面的网络原理（如优先依附或三元封闭）以及宏观层面的原理（如社区结构）（Papachristou 和 Yuan，2024 年）。总之，机器心理学可以揭示LLMs个人和集体的社会行为和互动模式，无论是为了
problem solving or world simulation (Guo et al., 2024). By drawing from human developmental psychology and social dynamics, researchers can better understand and design LLMs that navigate complex social interactions and exhibit advanced social skills.
问题解决或世界模拟（Guo 等人，2024 年）。通过借鉴人类发展心理学和社会动力学，研究人员可以更好地理解和设计LLMs，使其能够驾驭复杂的社会互动并展现高级社交技能。

Psychology of language 语言心理学

A long history of work has studied the psychology of how humans use and understand language, ranging from how they use semantic and syntactic features to understand a sentence to how they use pragmatic inferences in a discourse context to help interpret what someone has said. Correspondingly, a long-standing body of work has studied how language processing models capture these features of human language processing. Early connectionist works studied these topics in simple recurrent predictive models (Elman, 1991; McClelland et al., 1989); more recently, researchers have applied similar techniques to study LLMs. A wide range of work has studied what models learn about syntax (Linzen and Baroni, 2021), often using methods from psycholinguistics. For example, Wilcox et al. (2023) used psycholinguistics-inspired surprisal measures to show that LLMs learn filler-gap dependencies, a challenging syntactic structure. Other researchers have used related measures to study what LLMs learn about the semantics of entailment (Merrill et al., 2024). Moreover, researchers used psycholinguistic techniques like priming to study how models represent and process language (Prasad et al., 2019; Sinclair et al., 2022), and methods like deconfounded stimuli to identify where models may rely on semantic heuristics rather than syntax (McCoy, Pavlick, et al., 2019). Several recent works (Hu, Floyd, et al., 2023; Ruis, Khan, et al., 2023) studied pragmatic judgments of LLMs, and found that larger models, as well as those with instruction tuning, tend to better approximate human responses and error patterns - though some deficiencies remain. In another study, researchers examined long-form analogies generated by ChatGPT, finding that AI-generated analogies lack some human-like psycholinguistic properties (Seals and Shalin, 2023), particularly in text cohesion, language, and readability. Furthermore, researchers applied garden path sentences - sentences that lead the reader to initially interpret them incorrectly due to their ambiguous structure - to LLMs, showing that the models respond similarly to humans (Aher et al., 2023; Christianson et al., 2001). At a higher level, some researchers have drawn inspiration from aspects of human language development to attempt to identify the causes of the relative data inefficiency of language models (Warstadt et al., 2023; Frank, 2023). In each of these cases, methods and ideas from psychology and psycholinguistics provide guidance on how to assess processes through language behaviors in LLMs, potentially by drawing comparisons between LLMs and humans.
关于人类如何使用和理解语言的心理学研究由来已久，从人类如何使用语义和句法特征来理解句子，到人类如何在话语语境中使用语用推断来帮助解释别人所说的话。相应地，长期以来，人们一直在研究语言处理模型如何捕捉人类语言处理的这些特征。早期的联结主义作品在简单的递归预测模型中研究了这些主题（Elman，1991；McClelland 等人，1989）；最近，研究人员应用类似的技术研究了LLMs。许多研究都对模型学习句法的内容进行了研究（Linzen 和 Baroni，2021 年），通常使用的是心理语言学的方法。例如，Wilcox 等人（2023 年）使用受心理语言学启发的惊奇测量法来证明 LLMs 可以学习填充间隙依赖（一种具有挑战性的句法结构）。其他研究人员也使用了相关的测量方法来研究 LLMs 对蕴涵语义的学习（Merrill 等人，2024 年）。此外，研究人员还使用了心理语言学技术，如引物法（priming）来研究模型如何表征和处理语言（Prasad 等人，2019；Sinclair 等人，2022），并使用了去迷惑刺激（deconfounded stimuli）等方法来确定模型在哪些方面可能依赖语义启发式而不是语法（McCoy、Pavlick 等人，2019）。最近的几项研究（Hu, Floyd, et al., 2023; Ruis, Khan, et al., 2023）对LLMs的语用判断进行了研究，发现较大的模型以及具有指令调整功能的模型往往能更好地接近人类的反应和错误模式--尽管仍存在一些不足之处。在另一项研究中，研究人员检查了由 ChatGPT 生成的长篇类比，发现人工智能生成的类比缺乏一些类似于人类的心理语言特性（Seals 和 Shalin，2023 年），尤其是在文本凝聚力、语言和可读性方面。此外，研究人员还将花园路径句子--由于其结构模糊而导致读者一开始错误解读的句子--应用于LLMs，结果表明模型的反应与人类类似（Aher 等人，2023 年；Christianson 等人，2001 年）。在更高层次上，一些研究人员从人类语言发展的各个方面汲取灵感，试图找出语言模型数据效率相对较低的原因（Warstadt 等人，2023 年；Frank，2023 年）。在上述每一种情况中，心理学和心理语言学的方法和观点都为如何通过LLMs中的语言行为来评估过程提供了指导，并有可能在LLMs和人类之间进行比较。

Learning 学习

The psychology of learning is concerned with how individuals acquire and retain knowledge and skills. At first blush, it may appear that experimental paradigms for the study of learning are less applicable to LLMs, given that the aim of behavioral experiments is often to help uncover the underlying learning algorithm - whereas for LLMs the learning algorithms used in training are designed and already known. However, the behavioral sciences can still benefit from the study of LLMs in this context, since LLMs exhibit learning abilities that were not explicitly designed into the models (they are emergent), and thus one does not understand the underlying learning algorithm. In particular, LLMs exhibit emergent in-context learning - the ability to learn from context (the prompt) without requiring any gradient-based updates in weights (Brown et al., 2020). Understanding in-context learning is a burgeoning field that is rapidly gaining in importance, given the increasing size of LLMs context windows and consequent gains in capabilities, e.g. the capability to learn an entire language from context alone (Munkhdalai et al., 2024; Gemini Team et al., 2024), or the ability to overcome safety fine-tuning (Anil et al., 2024; Zheng et al., 2024).
学习心理学关注的是个人如何获得并保持知识和技能。乍看起来，研究学习的实验范式似乎不太适用于LLMs，因为行为实验的目的通常是帮助揭示潜在的学习算法，而对于LLMs来说，训练中使用的学习算法是设计好的，并且已经为人所知。然而，在这种情况下，行为科学仍然可以从LLMs的研究中受益，因为LLMs表现出的学习能力并没有被明确地设计到模型中（它们是突现的），因此人们并不了解底层的学习算法。特别是，LLMs表现出了突发的上下文学习能力--即无需任何基于梯度的权重更新即可从上下文（提示）中学习的能力（Brown 等人，2020 年）。考虑到 LLMs 上下文窗口的规模不断扩大以及随之而来的能力提升，例如仅从上下文学习整个语言的能力（Munkhdalai 等人，2024 年；Gemini 团队等人，2024 年），或克服安全微调的能力（Anil 等人，2024 年；Zheng 等人，2024 年），理解上下文学习是一个新兴领域，其重要性正在迅速增加。

Uncovering the implicit learning algorithm implemented by in-context learning is a burgeoning research field, and utilizes many of the methods common in cognitive science. For example, multiple studies have compared the outputs of transformer in-context learning with the outputs of hypothesized learning algorithms (Oswald et al., 2023; Akyürek et al., 2022). This is a staple of cognitive modeling, and could potentially benefit even further from model comparison procedures from psychology and statistics (Yang, 2006; Arlot and Celisse, 2010; Vrieze, 2012). Recent work in cognitive science has used machine learning to discover new theories of human decision-making (Peterson et al., 2021) - it might be interesting to apply related approaches to in-context learning as well. Researchers might also benefit from considering particular models as normative starting points (Niv, 2009).
揭示情境学习所实现的隐性学习算法是一个新兴的研究领域，它利用了认知科学中的许多常用方法。例如，多项研究将变压器语境学习的输出与假设学习算法的输出进行了比较（Oswald 等人，2023 年；Akyürek 等人，2022 年）。这是认知建模的主要内容，并有可能进一步受益于心理学和统计学的模型比较程序（Yang，2006；Arlot 和 Celisse，2010；Vrieze，2012）。认知科学领域的最新研究利用机器学习发现了人类决策的新理论（Peterson 等人，2021 年）--将相关方法应用于情境学习可能也很有趣。将特定模型视为规范性起点，也会让研究人员受益匪浅（Niv，2009 年）。

Researchers may also wish to understand other interesting and important characteristics of learning, such as inductive biases and generalization, the data dependence of learning, and the dynamics of learning over time. These characteristics
研究人员可能还希望了解学习的其他有趣而重要的特征，如归纳偏差和泛化、学习对数据的依赖性以及学习随时间的动态变化。这些特征
are often not obvious even in cases where the learning algorithm is known, and thus researchers would like to understand them not only for in-context learning, but also for other forms of LLM learning, e.g. self-supervised gradient-based learning, reinforcement learning (Ouyang et al., 2022), or "fast" memory retrieval (Borgeaud et al., 2022; Lewis et al., 2020).
即使在已知学习算法的情况下，这些问题也往往并不明显，因此研究人员希望了解它们不仅适用于上下文学习，还适用于其他形式的LLM学习，例如基于梯度的自我监督学习、强化学习（欧阳等人，2022年）或 "快速 "记忆检索（Borgeaud等人，2022年；Lewis等人，2020年）。

To characterize inductive biases and generalization of LLMs, researchers have borrowed both concepts and experimental paradigms from cognitive sciences (Schubert et al., 2024; Coda-Forno, Binz, Akata, et al., 2023) and Bayesian inference (Xie et al., 2022). Studies utilized paradigms for measuring systematic generalization to characterize those capabilities in LLMs, and as inspiration to improve these abilities (Lake and Baroni, 2023; Ruis, Andreas, et al., 2022). Webb et al. (2023) created novel variants of classic analogy problems from cognitive science, in order to examine the analogical capabilities of large language models. Chan et al. (2022) have borrowed ideas and experimental paradigms on "rule-based" vs. "exemplar-based" generalization to characterize the inductive biases of in-weights vs. in-context learning in transformers. Furthermore, researchers borrowed paradigms and measures from developmental psychology to characterize the domains where LLM inductive biases may match those of children, and where they may fall short (including in causal reasoning and innovation) (Kosoy et al., 2023; Yiu et al., 2023).
为了描述LLMs的归纳偏差和概括能力，研究人员从认知科学（Schubert等人，2024年；Coda-Forno、Binz、Akata等人，2023年）和贝叶斯推理（Xie等人，2022年）中借鉴了一些概念和实验范式。研究利用测量系统泛化的范式来描述LLMs中的这些能力，并以此作为提高这些能力的灵感来源（Lake 和 Baroni，2023 年；Ruis、Andreas 等人，2022 年）。Webb 等人（2023 年）创建了认知科学中经典类比问题的新变体，以检验大型语言模型的类比能力。Chan 等人（2022 年）借鉴了 "基于规则 "与 "基于范例 "的概括的思想和实验范式，以描述转换器中权重内学习与上下文内学习的归纳偏差。此外，研究人员还借鉴了发展心理学的范式和测量方法，以描述LLM归纳偏差可能与儿童相匹配的领域，以及他们可能不足的领域（包括因果推理和创新）（Kosoy 等人，2023 年；Yiu 等人，2023 年）。

To characterize the data dependence of in-context learning, existing work has drawn inspiration from research in developmental psychology on skewed and bursty distributions (Chan et al., 2022). An important aspect of data dependence is the structure of data over time (during training). AI researchers have long drawn inspiration from curriculum learning in human and non-human animals to better understand how to structure training data so that earlier learning on easier tasks can scaffold later learning on harder tasks (Bengio et al., 2009). There remain many areas of behavioral research on learning that may serve as rich sources of inspiration on data dependence, e.g. research on repetition and spacing (Dempster, 1989), working memory (Baddeley, 2010; Chai et al., 2018), blocking vs. interleaving tasks (Carvalho and Goldstone, 2015), and continual learning (Greco et al., 2019). Data dependence is particularly interesting for LLMs because text training data (being sourced largely from unstructured web-scale corpora) is very different from the structured training data typically used for traditional discriminatory machine learning techniques, and because data is one of the major levers one can manipulate in training LLMs to adjust their behaviors.
为了描述情境中学习的数据依赖性，现有工作从发展心理学对倾斜分布和突发分布的研究中汲取了灵感（Chan 等人，2022 年）。数据依赖性的一个重要方面是数据在一段时间内（训练期间）的结构。长期以来，人工智能研究人员一直从人类和非人类动物的课程学习中汲取灵感，以便更好地理解如何构建训练数据，从而使早期对较容易任务的学习能够为后期对较难任务的学习提供支架（Bengio 等人，2009 年）。关于学习的行为学研究仍有许多领域可以作为数据依赖性的丰富灵感来源，例如关于重复和间隔（Dempster，1989年）、工作记忆（Baddeley，2010年；Chai等人，2018年）、阻断与交错任务（Carvalho和Goldstone，2015年）以及持续学习（Greco等人，2019年）的研究。对于LLMs来说，数据依赖性尤其有趣，因为文本训练数据（主要来自非结构化的网络规模语料库）与通常用于传统判别式机器学习技术的结构化训练数据截然不同，而且数据是人们在训练LLMs时可以操纵以调整其行为的主要杠杆之一。

Design and analysis: Good behavioral experimentation
设计和分析：良好的行为实验

Computer science has not historically been an empirical science. While machine learning (especially since the era of neural network models) has been significantly driven by empirical rather than theoretical work, the settings under which those protocols were developed - a test set that is fixed for all practitioners and is effectively infinitely large - no longer hold in the small test-only behavioral experiments setting. Current LLMs are famously sensitive to small changes in prompt structure or they rely on shallow syntactic heuristics (McCoy, Pavlick, et al., 2019), and studies that are not careful about testing the robustness of their conclusions risk being spurious and non-generalizable. Psychology too has had its own share of reproducibility crises (Open Science Collaboration, 2015; Haibe-Kains et al., 2020), and machine psychology should not share the same fate. In this section, we provide recommendations for sound methodologies in behavioral test settings with LLMs, which should be valuable to practitioners in the field of machine psychology.
计算机科学历来不是一门经验科学。虽然机器学习（尤其是自神经网络模型时代以来）在很大程度上是由实证工作而非理论工作推动的，但这些协议的开发环境--对所有从业人员都固定且实际上无限大的测试集--在仅进行小规模测试的行为实验环境中已不再适用。当前的LLMs对提示结构的微小变化非常敏感，或者它们依赖于浅层的句法启发式（McCoy, Pavlick, et al.心理学也有自己的可重复性危机（开放科学合作组织，2015 年；Haibe-Kains 等人，2020 年），机器心理学不应该有同样的命运。在本节中，我们将就LLMs行为测试环境中的合理方法提出建议，这些建议对机器心理学领域的从业人员应该很有价值。

Prompting methods and biases
提示方法和偏差

Many studies conducted in the field of machine psychology have a significant shortcoming in common, namely that they do not avoid training data contamination. They use prompts from existing psychology studies and apply them to LLMs without changing their wording, task orders, etc. In this way, LLMs are likely to have already experienced identical or similar tasks during training, thus causing LLMs to simply reproduce known token patterns. When adopting test frameworks from psychology - meaning vignettes, cognitive tasks, or other test setups - researchers must ensure that LLMs have never seen the tests before and go beyond mere memorization. Hence, prompts may indeed be structurally like already existing tasks, but they should contain new wordings, agents, orders, actions, etc. That being said, some experiments may be procedurally generated (instead of consisting of a static dataset), which makes them inherently less susceptible to data contamination issues (Coda-Forno, Binz, Wang, et al., 2024).
在机器心理学领域开展的许多研究都有一个共同的重大缺陷，即无法避免训练数据污染。它们使用现有心理学研究中的提示，并将其应用于 LLMs 而不改变其措辞和任务指令等。这样，LLMs很可能已经在训练中经历过相同或类似的任务，从而导致LLMs简单地复制已知的标记模式。当采用心理学中的测试框架（指小故事、认知任务或其他测试设置）时，研究人员必须确保LLMs以前从未见过这些测试，并且不能仅仅停留在记忆的层面上。因此，提示在结构上可能确实与已有的任务相似，但它们应包含新的措辞、代理、命令、动作等。尽管如此，有些实验可能是程序化生成的（而不是由静态数据集组成），这使它们从本质上不易受到数据污染问题的影响（Coda-Forno, Binz, Wang, et al.）

Another common shortcoming of several existing machine psychology studies is that they rely on small sample sizes or convenience samples, meaning non-systematic sequences of prompts. Sampling biases in the used benchmarks or task datasets, which are especially prevalent in small sample sizes, can diminish the quality of machine psychology studies.
现有一些机器心理学研究的另一个共同缺点是，它们依赖于小样本量或便利样本，即非系统性的提示序列。所使用的基准或任务数据集存在取样偏差，这在样本量较小的情况下尤为普遍，会降低机器心理学研究的质量。

This is because slight changes in prompts can change model outputs significantly. Because of this high sensitivity to prompt wording, it is important to test multiple versions of one task and to create representative samples, meaning batteries of varied prompts. Only in this way can one reliably measure whether a certain behavior is systematically reoccurring and generalizable (Yarkoni, 2022). Furthermore, LLMs can succumb to various biases influencing the processing of prompts (Zhao, Wallace, et al., 2021; Chan et al., 2022). Recency biases in LLMs, for instance, lead to a tendency to rely more heavily on information appearing toward the end of prompts. LLMs can also possess a common token bias, meaning that models are biased toward outputting tokens that are common in their training data. Moreover, majority label biases can cause LLMs to be skewed towards labels, classes, or examples that are frequent in a few-shot learning setting. Technical biases like these can at least in part be controlled for when designing prompts or prompt variations that tend to avoid triggering them. If this is not done, LLMs may rely on shortcuts exploiting such biases.
这是因为提示语的细微变化都会显著改变模型的输出结果。由于对提示措辞的高度敏感性，对一项任务进行多个版本的测试和创建具有代表性的样本（即不同提示的电池）非常重要。只有这样，才能可靠地衡量某种行为是否系统地重复出现并具有普遍性（Yarkoni，2022 年）。此外，LLMs可能会受到影响提示处理的各种偏差的影响（Zhao、Wallace 等人，2021；Chan 等人，2022）。例如，LLMs中的 "回顾偏差 "会导致更依赖于提示结束时出现的信息。LLMs也可能具有常见标记偏差，这意味着模型偏向于输出训练数据中常见的标记。此外，多数标签偏差也会导致 LLMs 偏向于在少量学习设置中频繁出现的标签、类别或示例。在设计倾向于避免触发这些偏差的提示或提示变体时，至少可以部分控制这些技术偏差。如果不这样做，LLMs 可能会依赖于利用这些偏差的捷径。

Eliciting capabilities with prompts
通过提示激发能力

The standard prompt design, comprising a vignette plus an open- or close-ended question or task, can be enhanced by prefixes or suffixes eliciting improved reasoning capabilities in LLMs. On the other hand, omitting such prefixes and suffixes can lead to underestimations of the model's capabilities. Although it is likely that most specific prompt augmentations have a positive influence on one kind of task but not another, reducing our ability to systematically understand LLM behavior, a few prompt design approaches have nonetheless been found to confer broader performance benefits. Most notably, (zero-shot) chain-of-thought prompting (Wei, Wang, et al., 2022; Kojima et al., 2022) - which simply adds "Let's think step by step" at the end of a prompt - improves reasoning performance. This can be extended even further by generating multiple chain-of-thought reasoning paths and taking the majority response as the final one (Wang, Wei, et al., 2022). Similar to chain-of-thought prompting is least-to-most prompting, which also decomposes problems into a set of subproblems to increase accuracy in LLMs (Zhou et al., 2022). Yet another approach is to frame questions in a multiple-choice format. This was shown to improve reasoning capabilities in some cases (Kadavath et al., 2022), but can also limit them because LLMs might be prompted to provide brief responses, thereby circumventing reasoning in the process of prompt completion. Nevertheless, many prominent NLP benchmarks use multiple choice formats instead of open-ended questions. Here, one must keep in mind that different expressions of the same concept compete for probability, which can lower the chances of selecting the correct answer (Holtzman et al., 2021). Moreover, one has to consider potential recency biases, which require neutralizing this effect by shuffling the order of answers in multiple test runs to cover all possible combinations. Another method to increase reasoning is to utilize the ability for few-shot learning in LLMs (Brown et al., 2020), where the LLM's performance improves after repeated exposure to a given task. Moreover, self-reflection, meaning the automated, recursive criticizing and subsequent self-improvement of LLM outputs by the LLM itself, is a further technique that can improve reasoning abilities (Nair et al., 2023; Kim et al., 2023). Regarding improvements in symbolic or numeric reasoning, another technique is to prompt LLMs to use code for solving tasks (Zhang, Ge, et al., 2024). Eventually, all mentioned methods to improve reasoning can be not just leveraged for machine psychology; they can also become objects of study themselves.
标准的提示设计包括一个小故事和一个开放式或封闭式的问题或任务，可以通过前缀或后缀来提高LLMs的推理能力。另一方面，省略此类前缀和后缀可能会导致低估模型的能力。虽然大多数特定的提示增强可能会对一种任务产生积极影响，而对另一种任务则不会，从而降低了我们系统地理解 LLM 行为的能力，但是我们还是发现了一些提示设计方法能够带来更广泛的性能优势。最值得注意的是（零镜头）思维链提示（Wei、Wang 等人，2022 年；Kojima 等人，2022 年）--只需在提示语末尾添加 "让我们一步步思考"--就能提高推理成绩。通过生成多个思维链推理路径，并将多数人的反应作为最后的推理路径，可以进一步提高推理成绩（Wang，Wei，et al.，2022）。与思维链提示类似的还有从最少到最多的提示，它也是将问题分解为一系列子问题，以提高 LLMs 的准确性（Zhou 等人，2022 年）。另一种方法是以多项选择的形式提出问题。在某些情况下，这被证明可以提高推理能力（Kadavath 等人，2022 年），但也会限制推理能力，因为LLMs可能会被提示提供简短的回答，从而在提示完成的过程中规避了推理。不过，许多著名的 NLP 基准都使用了多选格式，而不是开放式问题。在此，我们必须牢记，同一概念的不同表达方式会竞争概率，这可能会降低选择正确答案的几率（Holtzman 等人，2021 年）。此外，我们还必须考虑潜在的追忆偏差，这就需要通过在多次测试中调整答案顺序来中和这种影响，以涵盖所有可能的组合。另一种提高推理能力的方法是利用LLMs中的少量学习能力（Brown 等人，2020 年），即LLM在反复接触特定任务后，其成绩会有所提高。此外，自我反思，即由LLM本身对LLM的输出进行自动、递归的批评和随后的自我改进，也是一种可以提高推理能力的技术（Nair等人，2023；Kim等人，2023）。关于符号或数字推理的改进，另一种技术是促使 LLMs 使用代码来解决任务（Zhang、Ge 等人，2024 年）。最终，上述所有改进推理的方法不仅可以用于机器心理学，它们本身也可以成为研究对象。

Setting parameters and evaluating outputs
设置参数和评估输出

LLMs come with a variety of parameters researchers can set. For example, most models come in a variety of sizes. Analyses across different sizes are valuable: while the largest ones usually have the highest capabilities, some recent works find "inverse-scaling" (McKenzie et al., 2023). Moreover, temperature settings control randomness. If exact reproducibility is required, studies should use temperature 0 or assign a seed to ensure complete determinacy. However, this can be prone to (intentional or unintentional) biases in seed choice. The effect of temperature on capabilities is not established (Renze and Guven, 2024), and reporting averages or "best of K " - considering all the responses over K samples that meet certain simple criteria, e.g. formatting (Chen, Tworek, et al., 2021) - is valuable.
LLMs带有研究人员可以设置的各种参数。例如，大多数模型都有各种尺寸。对不同大小的模型进行分析很有价值：虽然最大的模型通常具有最高的能力，但最近的一些研究发现了 "反缩放"（McKenzie 等人，2023 年）。此外，温度设置可控制随机性。如果需要精确的重现性，研究应使用温度 0 或指定一个种子，以确保完全的确定性。然而，这很容易造成种子选择的（有意或无意）偏差。温度对能力的影响尚未确定（Renze 和 Guven，2024 年），报告平均值或 "K 最佳值"--考虑符合某些简单标准（如格式化）的 K 个样本的所有反应（Chen、Tworek 等，2021 年）--很有价值。

After conducting the experiments, a list of LLM responses must be evaluated and compared with the ground truth. The simplest case is when the results can be framed and scored as a multiple-choice question - though even in this case, scoring the answers so that the model responds directly inline, rather than selecting a choice, can yield more signal (Hu and Levy, 2023). If possible, multiple scoring methods should be compared, to evaluate whether the effects are dependent on the scoring method (Tsvilodub et al., 2024). If the questions must be answered with free generations, the evaluation process can still be automated if the results exhibit sufficient simplicity and regularity, meaning that the LLM responses are similar to the ground truth strings in terms of length and wording, which is particularly common when using masked language models. Methods such as testing word overlaps with regular expressions or using metrics such
在进行实验后，必须对 LLM 反应列表进行评估，并与基本事实进行比较。最简单的情况是可以将结果作为多选题来设计和评分--不过即使在这种情况下，对答案进行评分，使模型直接在线响应，而不是选择一个选项，也能产生更多的信号（Hu 和 Levy，2023 年）。如果可能，应比较多种评分方法，以评估效果是否取决于评分方法（Tsvilodub 等人，2024 年）。如果问题必须用自由代回答，那么如果结果表现出足够的简单性和规律性，即 LLM 回答在长度和措辞方面与地面实况字符串相似，那么评估过程仍然可以实现自动化，这在使用屏蔽语言模型时尤为常见。使用正则表达式测试单词重叠或使用诸如
as the F1 score can be employed. State-of-the-art LLMs, however, tend to produce highly variable and comprehensive outputs, which can complicate classification. While stop sequences, token limits, or prompt instructions that interrupt further text generation can facilitate classification by promoting output uniformity, they also improperly constrain LLM behavior. Therefore, researchers are increasingly relying on LLM-based evaluations of outputs where a single model or multiple stacked model instances perform the classification using carefully crafted instructions. Although this method might still be inaccurate for very comprehensive outputs, a solution is to instruct the LLM under scrutiny to output its final answer or summary after a specific string sequence like "####" (Cobbe et al., 2021). This approach allows the LLM to reason during verbose prompt completions, which is necessary for many prompt engineering techniques such as chain-of-thought reasoning. The classification then only involves processing the string following "####". If this method still proves to be unreliable, evaluations might have to be performed manually, possibly by hiring research assistants or contractors. Following the evaluation, a statistical analysis can be carried out.
可以使用 F1 分数。然而，最先进的 LLMs 往往会产生高度多变和全面的输出，这可能会使分类变得复杂。虽然停止序列、标记限制或中断进一步文本生成的提示指令可以通过提高输出的一致性来促进分类，但它们也不适当地限制了 LLM 的行为。因此，研究人员越来越依赖于基于 LLM 的输出评估，在这种评估中，单个模型或多个堆叠模型实例通过精心设计的指令来执行分类。虽然这种方法对于非常全面的输出可能仍然不准确，但一种解决方案是指示受审查的 LLM 在特定字符串序列（如 "####"）之后输出其最终答案或摘要（Cobbe 等人，2021 年）。这种方法允许 LLM 在冗长的提示完成过程中进行推理，这对于许多提示工程技术（如思维链推理）来说是必要的。然后，分类只涉及处理 "####"后面的字符串。如果事实证明这种方法仍然不可靠，则可能需要人工进行评估，可能需要聘用研究助理或承包商。评估之后，可以进行统计分析。

Discussion 讨论

Machine psychology provides a new approach to explaining AI. Instead of interpreting a neural network's design components (Barredo Arrieta et al., 2019), one analyzes the relationships between inputs and outputs, i.e. prompt design and prompt completion. Although this may allow the identification of hitherto unknown abilities or behavioral traits in LLMs, interpreting LLM responses comes with a challenge. A strong tendency exists to confer mental concepts or psychological terms to LLMs that were hitherto reserved for human and animal minds. This tendency manifests in common terms like "machine learning," but will become more prevalent in machine psychology when concepts such as reasoning (Huang and Chang, 2022), intuition (Hagendorff et al., 2023), creativity (Stevenson et al., 2022), intelligence (Webb et al., 2023), personality (Miotto et al., 2022), mental illnesses (Li, Li, et al., 2022), etc. are transferred to LLMs. In this context, researchers have demanded caution by stressing that the underlying neural mechanisms for these concepts are different in humans and machines (Shanahan, 2022; Mahowald et al., 2024). Moreover, many psychological concepts are normatively laden and can foster mismatches in expectations between AI experts and the public regarding machine capabilities (Shevlin and Halina, 2019). Nevertheless, the problem that many abilities in LLMs cannot be reasonably grasped by only referring to the inner workings of their neural architecture remains.
机器心理学为解释人工智能提供了一种新方法。我们不再解释神经网络的设计组件（Barredo Arrieta 等人，2019 年），而是分析输入和输出之间的关系，即提示设计和提示完成。虽然这可以识别LLMs中迄今未知的能力或行为特征，但解释LLM的反应也是一个挑战。一种强烈的趋势是，LLMs被赋予了迄今为止只属于人类和动物思维的心理概念或心理学术语。这种趋势体现在 "机器学习 "等常用术语中，但当推理（Huang 和 Chang，2022 年）、直觉（Hagendorff 等人，2023 年）、创造力（Stevenson 等人，2022 年）、智力（Webb 等人，2023 年）、个性（Miotto 等人，2022 年）、精神疾病（Li, Li, 等人，2022 年）等概念转移到 LLMs 时，这种趋势将在机器心理学中变得更加普遍。在这种情况下，研究人员强调这些概念的基本神经机制在人类和机器中是不同的（Shanahan, 2022; Mahowald et al.）此外，许多心理学概念都带有规范性，会造成人工智能专家和公众对机器能力的期望不一致（Shevlin and Halina, 2019）。尽管如此，仅凭神经架构的内部运作无法合理把握LLMs中许多能力的问题依然存在。

By adopting a concept from ethnography, one could call such an approach "thin descriptions" (Ryle, 1971; Geertz, 1973), meaning that one only explains internal representations in AI systems, for instance via activation atlases, which visualize how different parts of a neural network respond to various inputs (Carter et al., 2019). In this sense, LLMs simply hijack humans' intuitions to explain machine behavior patterns by using psychological or other anthropocentric terms. Contrary to thin descriptions, though, there are "thick descriptions." They imply using psychological terms to add a layer of explainability. LLMs are, like the human brain, black boxes to some extent. By applying psychological terms to them, the explanatory power increases, even if no direct neural correlates to these terms exist. This holds for humans, too, where mental terms used to explain behavior do not directly correlate with specific sets of neural activations. By postulating (mental) unobservable states, be it with regard to brains or artificial neural networks, one increases explanatory resources (Sellars, 1997). Thick descriptions help in making sense of LLMs when thin descriptions are insufficient to explain behavioral patterns. Thin descriptions assume that LLMs merely possess syntax or a statistical capacity to associate words (Searle, 1980; Floridi and Chiriatti, 2020; Bender et al., 2021), but not semantics. Thick descriptions, though, assume that LLMs show patterns and regularities that go beyond mere syntax. These patterns can be explained by means of machine psychology.
通过采用人种学中的一个概念，我们可以将这种方法称为 "单薄描述"（Ryle, 1971; Geertz, 1973），这意味着我们只能解释人工智能系统的内部表征，例如通过激活图谱（activation atlases）对神经网络的不同部分如何响应各种输入进行可视化（Carter 等人，2019）。从这个意义上说，LLMs只是劫持了人类的直觉，用心理学或其他以人类为中心的术语来解释机器的行为模式。与单薄的描述相反，还有 "厚重的描述"。它们意味着使用心理学术语来增加一层可解释性。LLMs在某种程度上就像人类的大脑一样，是一个黑盒子。通过使用心理学术语，即使这些术语不存在直接的神经关联，解释力也会增强。人类也是如此，用于解释行为的心理术语并不直接与特定的神经激活相关联。通过假设（心理）不可观测的状态，无论是大脑还是人工神经网络，都会增加解释资源（Sellars, 1997）。当单薄的描述不足以解释行为模式时，厚描述有助于理解 LLMs 。薄描述假定 LLMs 仅仅拥有句法或关联词的统计能力（Searle, 1980; Floridi and Chiriatti, 2020; Bender et al.不过，厚描述假定 LLMs 所显示的模式和规律性超越了单纯的语法。这些模式可以通过机器心理学来解释。

Beyond potential habituations regarding the use of terminology borrowed from psychology in the context of machines, machine psychology, as a nascent field of research, aims to identify behavioral patterns, emergent abilities, and mechanisms of decision-making and reasoning in LLMs by treating them as participants in psychology experiments. This new discipline of evaluating LLMs will become even more important when taking multimodal or augmented LLMs into account, meaning LLMs that are allowed to interact with images, external information sources, sensory data, physical objects, and various other tools (Mialon et al., 2023; Schick et al., 2023; Ma et al., 2024). Moreover, once test settings for machine psychology are established, researchers can investigate how LLMs develop over time by applying the same tasks multiple times, yielding longitudinal data. This data can serve as a baseline to extrapolate trends regarding the development of reasoning abilities in LLMs. Such estimations may be increasingly important for AI safety and AI alignment research to predict future behavioral potentials in LLMs. By gaining a deeper understanding of these potentials, machine psychology is providing a new approach to AI explainability as well as an important addition to traditional benchmarking methods in natural language processing.
除了在机器的语境中使用从心理学借来的术语可能造成的习惯性问题之外，机器心理学作为一个新兴的研究领域，旨在通过将LLMs视为心理学实验的参与者，来识别LLMs的行为模式、新兴能力以及决策和推理机制。当考虑到多模态或增强型 LLMs 时，这门评估 LLMs 的新学科将变得更加重要，这意味着 LLMs 可以与图像、外部信息源、感官数据、物理对象和其他各种工具进行交互（Mialon 等人，2023 年；Schick 等人，2023 年；Ma 等人，2024 年）。此外，一旦确定了机器心理学的测试设置，研究人员就可以通过多次应用相同的任务来研究 LLMs 如何随着时间的推移而发展，从而获得纵向数据。这些数据可以作为推断LLMs推理能力发展趋向的基线。这种估计对于人工智能安全和人工智能调整研究来说可能越来越重要，可以预测LLMs未来的行为潜力。通过深入了解这些潜力，机器心理学为人工智能的可解释性提供了一种新方法，同时也是对自然语言处理领域传统基准测试方法的重要补充。

Author contributions 作者供稿

TH and ID conceptualized and led the initial design of the manuscript. TH and ID wrote the initial drafts, with contributions from MB, SCYC, AL, JW, ZA, and ES to flesh out the sections and create the figure. All authors assisted with iterations and edited and reviewed the paper.
TH和ID构思并领导了手稿的初步设计。TH和ID撰写了初稿，MB、SCYC、AL、JW、ZA和ES为充实章节和创建图表做出了贡献。所有作者都协助进行了反复修改，并对论文进行了编辑和审阅。

References 参考资料

Aher, Gati, Rosa I. Arriaga, and Adam Tauman Kalai. "Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies". In: Proceedings of the 40th International Conference on Machine Learning. 2023, pp. 1-35.
Aher, Gati, Rosa I. Arriaga, and Adam Tauman Kalai."使用大型语言模型模拟多人并复制人体研究"。In：第 40 届机器学习国际会议论文集》。2023, pp.

Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. "Playing repeated games with Large Language Models". In: arXiv (2023), pp. 1-13.
Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz."用大型语言模型玩重复游戏"。In: arXiv (2023), pp.

Akyürek, Ekin, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. "What learning algorithm is in-context learning? Investigations with linear models". In: arXiv (2022), pp. 1-29.
Akyürek、Ekin、Dale Schuurmans、Jacob Andreas、Tengyu Ma 和 Denny Zhou。"什么学习算法是情境学习？线性模型调查"。In: arXiv (2022), pp.

Anderson, Philip W. "More is different: Broken symmetry and the nature of the hierarchical structure of science". In: Science 177.4047 (1972), pp. 393-396.
Anderson, Philip W. "More is different: Broken symmetry and the nature of the hierarchical structure of science".In. Science 177.4047 (1972), pp：Science 177.4047 (1972), pp.

Anil, Cem et al. Many-shot jailbreaking. 2024.
Anil, Cem et al. 多镜头越狱。2024.

Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. 2024.
人类。克劳德 3 型系列：Opus, Sonnet, Haiku.2024.

Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. "Out of One, Many: Using Language Models to Simulate Human Samples". In: Political Analysis 31.3 (2023), pp. 337-351.
Argyle, Lisa P, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate."从一到多：使用语言模型模拟人类样本"。In. Political Analysis 31.3 (2023)：政治分析》31.3 (2023)，第 337-351 页。

Arlot, Sylvain and Alain Celisse. "A survey of cross-validation procedures for model selection". In: Statistics Surveys 4 (2010), pp. 40-79.
Arlot, Sylvain and Alain Celisse."用于模型选择的交叉验证程序概览》。In：Statistics Surveys 4 (2010), pp.

Baddeley, Alan. "Working memory". In: Current Biology 20.4 (2010), R136-R140.
Baddeley, Alan."工作记忆》。In：Current Biology 20.4 (2010), R136-R140.

Barredo Arrieta, Alejandro et al. "Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI'. In: Information Fusion 58 (2019), pp. 82-115.
Barredo Arrieta, Alejandro et al.《可解释的人工智能（XAI）》：实现负责任人工智能的概念、分类、机遇与挑战"。In：信息融合 58 (2019)，第 82-115 页。

Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021, pp. 610-623.
Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell."论随机鹦鹉的危险：语言模型会太大吗？"In：Proceedings of the 2021 ACM conference on fairness, accountability, and transparency.2021, pp.

Bengio, Yoshua, Jérôme Louradour, Ronan Collobert, and Jason Weston. "Curriculum learning". In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, pp. 41-48.
Bengio, Yoshua, Jérôme Louradour, Ronan Collobert, and Jason Weston."课程学习》。In：Proceedings of the 26th Annual International Conference on Machine Learning.2009, pp.

Binz, Marcel and Eric Schulz. "Using cognitive psychology to understand GPT-3". In: Proceedings of the National Academy of Sciences 120.6 (2023), pp. 1-10.
Binz、Marcel 和 Eric Schulz。"用认知心理学理解 GPT-3》。In：美国国家科学院院刊》120.6（2023 年），第 1-10 页。

Bommasani, Rishi et al. "On the opportunities and risks of foundation models". In: arXiv (2021), pp. 1-214.
Bommasani, Rishi et al. "On the opportunities and risks of foundation models".In: arXiv (2021), pp.

Borgeaud, Sebastian, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, and Katie Millican. "Improving Language Models by Retrieving from Trillions of Tokens". In: Proceedings of the 39th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato. Vol. 162. 2022, pp. 2206-2240.
Borgeaud, Sebastian, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, and Katie Millican."通过检索数以万亿计的时标改进语言模型"。In：第 39 届机器学习国际会议论文集》。编辑：Kamalika Chaudhuri、Stefanie Jegelka、Le Song、Csaba Szepesvari、Gang Niu 和 Sivan Sabato。第 162 卷。2022, pp.

Boring, Edwin G. "The Nature and History of Experimental Control". In: The American Journal of Psychology 67.4 (1954), pp. 573-589.
Boring, Edwin G. "实验控制的性质和历史"。In. The American Journal of Psychology 67.4 (1954), pp：美国心理学杂志》第 67.4 期（1954 年），第 573-589 页。

Bowman, Samuel R., Gabor Angeli, Christopher Potts, and Christopher D. Manning. "A large annotated corpus for learning natural language inference". In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Ed. by Lluís Màrquez, Chris Callison-Burch, and Jian Su. 2015, pp. 632-642.
Bowman, Samuel R., Gabor Angeli, Christopher Potts, and Christopher D. Manning."用于学习自然语言推理的大型注释语料库"。In：2015年自然语言处理经验方法会议论文集》。编辑：Lluís Màrquez、Chris Callison-Burch 和 Jian Su。2015, pp.

Brown, Tom et al. "Language Models are Few-Shot Learners". In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1877-1901.
Brown, Tom et al. "Language Models are Few-Shot Learners".In：神经信息处理系统进展》。H. Larochelle、M. Ranzato、R. Hadsell、M.F. Balcan 和 H. Lin 编著。第 33 卷。Curran Associates, Inc., 2020, pp.

Bubeck, Sébastien et al. "Sparks of Artificial General Intelligence: Early experiments with GPT-4". In: arXiv (2023), pp. 1-155.
Bubeck, Sébastien et al. "Sparks of Artificial General Intelligence：GPT-4 的早期实验"。In: arXiv (2023), pp.

Carter, Shan, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. "Exploring Neural Networks with Activation Atlases". In: Distill 4.3 (2019).
《卡特、单、赞-阿姆斯特朗、路德维希-舒伯特、伊恩-约翰逊和克里斯-奥拉。"用激活图谱探索神经网络》。In：Distill 4.3 (2019).

Carvalho, Paulo F. and Robert L. Goldstone. "The benefits of interleaved and blocked study: different tasks benefit from different schedules of study". In: Psychonomic Bulletin & Review 22.1 (2015), pp. 281-288.
Carvalho, Paulo F. and Robert L. Goldstone."交错学习和分块学习的益处：不同的任务受益于不同的学习时间安排》。In：Psychonomic Bulletin & Review 22.1 (2015), pp.

Chai, Wen Jia, Aini Ismafairus Abd Hamid, and Jafri Malin Abdullah. "Working Memory From the Psychological and Neurosciences Perspectives: A Review". In: Frontiers in Psychology 9 (2018), pp. 1-16.
Chai, Wen Jia, Aini Ismafairus Abd Hamid, and Jafri Malin Abdullah."从心理学和神经科学角度看工作记忆：回顾"。In：Frontiers in Psychology 9 (2018)，pp.

Chan, Stephanie, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. "Data Distributional Properties Drive Emergent In-Context Learning in Transformers". In: Advances in Neural Information Processing Systems 35 (2022), pp. 18878-18891.
Chan、Stephanie、Adam Santoro、Andrew Lampinen、Jane Wang、Aaditya Singh、Pierre Richemond、James McClelland 和 Felix Hill。"数据分布特性驱动变形金刚中的新兴情境学习》。In：神经信息处理系统进展 35 (2022)，第 18878-18891 页。

Chang, Tyler A and Benjamin K Bergen. "Language Model Behavior: A Comprehensive Survey". In: Computational Linguistics 50.1 (2024), pp. 293-350.
Chang, Tyler A and Benjamin K Bergen."语言模型行为：全面调查"。In：Computational Linguistics 50.1 (2024), pp.

Chen, Mark, Jerry Tworek, et al. "Evaluating Large Language Models Trained on Code". In: arXiv (2021), pp. 1-35.
Chen, Mark, Jerry Tworek, et al. "Evaluating Large Language Models Trained on Code".见：arXiv (2021)，第 1-35 页。

Chen, Yiting, Tracy Xiao Liu, You Shan, and Songfa Zhong. "The emergence of economic rationality of GPT". In: Proceedings of the National Academy of Sciences 120.51 (2023), e2316205120.
Chen, Yiting, Tracy Xiao Liu, You Shan, and Songfa Zhong."GPT 经济理性的出现》。In：美国国家科学院院刊》120.51（2023），e2316205120。

Cheng, Patricia W and Keith J Holyoak. "Pragmatic reasoning schemas". In: Cognitive Psychology 17.4 (1985), pp. 391-416.
Cheng, Patricia W and Keith J Holyoak."实用推理图式》。Cognitive Psychology 17.4 (1985), pp：认知心理学 17.4 (1985)，第 391-416 页。

Chollet, François, Katherine Tong, Walter Reade, and Julia Elliott. Abstraction and Reasoning Challenge. 2020. URL: https://kaggle.com/competitions/abstraction-and-reasoning-challenge.
Chollet, François, Katherine Tong, Walter Reade, and Julia Elliott.抽象与推理挑战赛。2020.URL：https://kaggle.com/competitions/abstraction-and-reasoning-challenge.

Chomsky, Noam. Aspects of the Theory of Syntax. MIT Press, 1965.
乔姆斯基，诺姆。Aspects of the Theory of Syntax.麻省理工学院出版社，1965 年。

Christianson, Kiel, Andrew Hollingworth, John F. Halliwell, and Fernanda Ferreira. "Thematic Roles Assigned along the Garden Path Linger". In: Cognitive Psychology 42.4 (2001), pp. 368-407.
Christianson, Kiel, Andrew Hollingworth, John F. Halliwell, and Fernanda Ferreira."花园小径萦绕的主题角色》。见《认知心理学》第 42.4 期：认知心理学 42.4 (2001)，第 368-407 页。

Cobbe, Karl et al. "Training Verifiers to Solve Math Word Problems". In: arXiv (2021), pp. 1-22.
Cobbe, Karl et al. "Training Verifiers to Solve Math Word Problems".In: arXiv (2021), pp.

Coda-Forno, Julian, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. "Meta-in-context learning in large language models". In: Advances in Neural Information Processing Systems 36 (2023), pp. 65189-65201.
Coda-Forno, Julian, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz."大型语言模型中的元语境学习"。In：神经信息处理系统进展 36 (2023)，第 65189-65201 页。

Coda-Forno, Julian, Marcel Binz, Jane X Wang, and Eric Schulz. "CogBench: a large language model walks into a psychology lab". In: arXiv (2024), pp. 1-26.
Coda-Forno, Julian, Marcel Binz, Jane X Wang, and Eric Schulz."CogBench：大型语言模型走进心理学实验室》。In: arXiv (2024), pp.

Conmy, Arthur, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. "Towards Automated Circuit Discovery for Mechanistic Interpretability". In: Advances in Neural Information Processing Systems. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. Vol. 36. Curran Associates, Inc., 2023, pp. 16318-16352.
Conmy, Arthur, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso."为机制可解释性实现自动电路发现"。In：神经信息处理系统的进展》。编辑：A. Oh、T. Naumann、A. Globerson、K. Saenko、M. Hardt 和 S. Levine。第 36 卷。Curran Associates, Inc., 2023, pp.

Dasgupta, Ishita, Andrew K. Lampinen, Stephanie C. Y. Chan, Antonia Creswell, Dharshan Kumaran, James L. McClelland, and Felix Hill. "Language models show human-like content effects on reasoning". In: arXiv (2022), pp.

.
Dasgupta, Ishita, Andrew K. Lampinen, Stephanie C. Y. Chan, Antonia Creswell, Dharshan Kumaran, James L. McClelland, and Felix Hill."语言模型在推理中显示出类似人类的内容效应"。In: arXiv (2022), pp.

Dempster, Frank N. "Spacing effects and their implications for theory and practice". In: Educational Psychology Review 1.4 (1989), pp. 309-330.
Dempster, Frank N. "Spacing effects and their implications for theory and practice".In：Educational Psychology Review 1.4 (1989), pp.

Dominguez-Olmedo, Ricardo, Moritz Hardt, and Celestine Mendler-Dünner. "Questioning the Survey Responses of Large Language Models". In: arXiv (2023), pp. 1-25.
Dominguez-Olmedo, Ricardo, Moritz Hardt, and Celestine Mendler-Dünner."质疑大型语言模型的调查回应》。In: arXiv (2023), pp.

Duijn, Max J. van, Bram van Dijk, Tom Kouwenhoven, Werner de Valk, Marco R. Spruit, and Peter van der Putten. "Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests". In: Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). Ed. by Jing Jiang, David Reitter, and Shumin Deng. 2023, pp. 389-402.
Duijn, Max J. van, Bram van Dijk, Tom Kouwenhoven, Werner de Valk, Marco R. Spruit, and Peter van der Putten."大型语言模型中的心智理论：考察 11 个最新模型与 7-10 岁儿童在高级测试中的表现"。In：第 27 届计算自然语言学习（CoNLL）大会论文集》。蒋晶、戴维-雷特和邓淑敏编著。2023, pp.

Edwards, Allen L. Statistical Methods for the Behavioral Sciences. Rinehart, 1954.
Edwards, Allen L. Statistical Methods for the Behavioral Sciences.Rinehart, 1954.

Elkins, Katherine and Jon Chun. "Can GPT-3 Pass a Writer's Turing Test?" In: Journal of Cultural Analytics 5.2 (2020), pp.

.
埃尔金斯、凯瑟琳和乔恩-春。"GPT-3 能否通过作家图灵测试？In：Journal of Cultural Analytics 5.2 (2020), pp.

Elman, Jeffrey L. "Distributed representations, simple recurrent networks, and grammatical structure". In: Machine Learning 7 (1991), pp. 195-225.
Elman, Jeffrey L. "分布式表征、简单递归网络和语法结构"。In. Machine Learning 7 (1991), pp：Machine Learning 7 (1991), pp.

Fei, Nanyi et al. "Towards artificial general intelligence via a multimodal foundation model". In: Nature Communications 13.1 (2022), pp. 1-13.
《费南一等人："通过多模态基础模型实现人工通用智能"。In：自然通讯》13.1 (2022)，第 1-13 页。

Festinger, Leon Ed and Daniel Ed Katz. Research methods in the behavioral sciences. Holt, Rinehart and Winston, 1953.
Festinger, Leon Ed and Daniel Ed Katz.行为科学研究方法》。Holt, Rinehart and Winston, 1953.

Firestone, Chaz. "Performance vs. competence in human-machine comparisons". In: Proceedings of the National Academy of Sciences 117.43 (2020), pp. 26562-26571.
Firestone, Chaz."人机比较中的性能与能力"。In：美国国家科学院院刊》117.43 (2020)，第 26562-26571 页。

Floridi, Luciano and Massimo Chiriatti. "GPT-3: Its Nature, Scope, Limits, and Consequences". In: Minds and Machines 30.4 (2020), pp. 681-694.
Floridi, Luciano and Massimo Chiriatti."GPT-3：其性质、范围、限制和后果》。In：Minds and Machines 30.4 (2020), pp.

Frank, Michael C. "Bridging the data gap between children and large language models". In: Trends in Cognitive Sciences 27.11 (2023), pp. 990-992.
Frank, Michael C. "Bridging the data gap between children and large language models".In：Trends in Cognitive Sciences 27.11 (2023), pp.

Frank, Michael C., Mika Braginsky, Julie Cachia, Nicholas Coles, Tom E. Hardwicke, Robert D. Hawkins, Maya B. Mathur, and Rondeline Williams. Experimentology: An Open Science Approach to Experimental Psychology Methods. MIT Press, 2024.
Frank, Michael C., Mika Braginsky, Julie Cachia, Nicholas Coles, Tom E. Hardwicke, Robert D. Hawkins, Maya B. Mathur, and Rondeline Williams.Mathur, and Rondeline Williams.实验学：实验心理学方法的开放科学方法》。麻省理工学院出版社，2024 年。

Gao, Leo, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. "Scaling and evaluating sparse autoencoders". In: arXiv (2024), pp. 1-34.
Gao, Leo, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu."稀疏自动编码器的扩展与评估"。In: arXiv (2024), pp.

Geertz, Clifford. The Interpretation of Cultures: Selected Essays. Basic Books, 1973.
Geertz, Clifford.The Interpretation of Cultures：The Interpretation of Cultures: Selected Essays.Basic Books, 1973.

Geirhos, Robert, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. "Shortcut learning in deep neural networks". In: Nature Machine Intelligence 2 (2020), pp. 665673 .
Geirhos, Robert, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann.Wichmann."深度神经网络中的捷径学习"。In：Nature Machine Intelligence 2 (2020), pp.

Gemini Team et al. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context". In: arXiv (2024), pp. 1-90.
Gemini Team et al. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context".In: arXiv (2024), pp.

Gigerenzer, Gerd and Wolfgang Gaissmaier. "Heuristic decision making". In: Annual Review of Psychology 62 (2011), pp.

.
Gigerenzer, Gerd and Wolfgang Gaissmaier."启发式决策"。In：

。

Greco, Claudio, Barbara Plank, Raquel Fernández, and Raffaella Bernardi. "Psycholinguistics Meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering". In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Ed. by Anna Korhonen, David Traum, and Lluís Màrquez. Florence, Italy: Association for Computational Linguistics, 2019, pp. 3601-3605.
Greco, Claudio, Barbara Plank, Raquel Fernández, and Raffaella Bernardi."心理语言学与持续学习：测量视觉问题解答中的灾难性遗忘"。In：第 57 届计算语言学协会年会论文集》。由 Anna Korhonen、David Traum 和 Lluís Màrquez 编辑。意大利佛罗伦萨：意大利佛罗伦萨：计算语言学协会，2019 年，第 3601-3605 页。

Grön, Georg, David Schul, Volker Bretschneider, AP Wunderlich, and Matthias W Riepe. "Alike performance during nonverbal episodic learning from diversely imprinted neural networks". In: European Journal of Neuroscience 18.11 (2003), pp. 3112-3120.
Grön、Georg、David Schul、Volker Bretschneider、AP Wunderlich 和 Matthias W Riepe。"来自不同印记神经网络的非语言记忆性学习中的相似表现"。In：欧洲神经科学杂志》18.11 (2003)，第 3112-3120 页。

Guo, Taicheng, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. "Large Language Model based Multi-Agents: A Survey of Progress and Challenges". In: arXiv (2024), pp.

.
Guo, Taicheng, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang."基于大语言模型的多代理：进展与挑战概览"。In: arXiv (2024), pp.

Hagendorff, Thilo. "Deception abilities emerged in large language models". In: Proceedings of the National Academy of Sciences 121.24 (2024), pp. 1-8.
Hagendorff, Thilo."大型语言模型中出现的欺骗能力"。In：美国国家科学院院刊》121.24 (2024), pp.

"Mapping the Ethics of Generative AI: A Comprehensive Scoping Review". In: arXiv (2024), pp. 1-25.
"绘制生成式人工智能的伦理地图：全面范围审查》。In: arXiv (2024), pp.

Hagendorff, Thilo, Sarah Fabi, and Michal Kosinski. "Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT". In: Nature Computational Science 3.10 (2023), pp. 833-838.
Hagendorff, Thilo, Sarah Fabi, and Michal Kosinski."大型语言模型中出现了类人直觉行为和推理偏差，但在 ChatGPT 中却消失了"。In：自然-计算科学》3.10 (2023)，第 833-838 页。

Haibe-Kains, Benjamin et al. "Transparency and reproducibility in artificial intelligence". In: Nature 586.7829 (2020), pp. 1-7.
Haibe-Kains, Benjamin et al.In：自然》586.7829 (2020)，第 1-7 页。

Hayes, William M, Nicolas Yax, and Stefano Palminteri. "Relative Value Biases in Large Language Models". In: arXiv (2024), pp. 1-7.
Hayes, William M, Nicolas Yax, and Stefano Palminteri."大型语言模型中的相对值偏差》。In: arXiv (2024), pp.

Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. "Measuring Mathematical Problem Solving With the MATH Dataset". In: Thirty-fifth Conference on Neural Information Processing Systems. 2021, pp. 1-11.
Hendrycks, Dan, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt."用 MATH 数据集测量数学问题的解决》。In：第三十五届神经信息处理系统会议。2021, pp.

Hermann, Katherine and Andrew Lampinen. "What shapes feature representations? Exploring datasets, architectures, and training". In: 34th Conference on Neural Information Processing Systems. 2020, pp. 1-12.
赫尔曼、凯瑟琳和安德鲁-兰皮宁。"什么塑造了特征表征？探索数据集、架构和训练"。In：第 34 届神经信息处理系统大会。2020, pp.

Holterman, Bart and Kees van Deemter. "Does ChatGPT have Theory of Mind?" In: arXiv (2023), pp. 1-15.
Holterman, Bart and Kees van Deemter."ChatGPT 有心智理论吗？In: arXiv (2023), pp.

Holtzman, Ari, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right". In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Ed. by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih. 2021, pp. 7038-7051.
Holtzman, Ari, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer."表面形式竞争：为什么概率最高的答案并不总是正确的"。In：自然语言处理经验方法 2021 年会议论文集》。由 Marie-Francine Moens、Xuanjing Huang、Lucia Specia 和 Scott Wen-tau Yih 编辑。2021, pp.

Hosseini, Eghbal A and Evelina Fedorenko. "Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language". In: Advances in Neural Information Processing Systems. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. Vol. 36. 2023, pp.

.
Hosseini, Eghbal A and Evelina Fedorenko."大型语言模型隐式学习理顺神经句子轨迹以构建自然语言的预测表征》。In：神经信息处理系统进展》。A. Oh、T. Naumann、A. Globerson、K. Saenko、M. Hardt 和 S. Levine 编辑。第 36 卷。2023, pp.

Hu, Jennifer, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. "A fine-grained comparison of pragmatic language understanding in humans and language models". In: The 61 st Annual Meeting Of The Association For Computational Linguistics. 2023, pp. 4194-4213.
Hu, Jennifer, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson."人类和语言模型的语用理解细粒度比较"。In：第 61 届计算语言学协会年会。2023, pp.

Hu, Jennifer and Roger P Levy. "Prompting is not a substitute for probability measurements in large language models". In: The 2023 Conference on Empirical Methods in Natural Language Processing. 2023, pp. 5040-5060.
Hu, Jennifer and Roger P Levy."提示不能替代大型语言模型中的概率测量"。In：自然语言处理经验方法 2023 年会议。2023, pp.

Huang, Jie and Kevin Chen-Chuan Chang. "Towards Reasoning in Large Language Models: A Survey". In: arXiv (2022), pp. 1-14.
Huang, Jie and Kevin Chen-Chuan Chang."大型语言模型中的推理：A Survey".In: arXiv (2022), pp.

Ivanova, Anna A. "Running cognitive evaluations on large language models: The do's and the don'ts". In: arXiv (2023), pp.

.
Ivanova, Anna A. "Running cognitive evaluations on large language models：该做和不该做"。In: arXiv (2023), pp.

Ivanova, Anna A et al. "Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models". In: arXiv (2024), pp. 1-21.
Ivanova, Anna A et al. "Elements of World Knowledge (EWOK)：评估语言模型中基本世界知识的认知启发框架"。In: arXiv (2024), pp.

Jagadish, Akshay K, Julian Coda-Forno, Mirko Thalmann, Eric Schulz, and Marcel Binz. "Human-like Category Learning by Injecting Ecological Priors from Large Language Models into Neural Networks". In: arXiv (2024), pp. 1-27.
Jagadish, Akshay K, Julian Coda-Forno, Mirko Thalmann, Eric Schulz, and Marcel Binz."通过将大型语言模型中的生态先验注入神经网络实现类人类别学习》。In: arXiv (2024), pp.

Ji, Jiaming et al. "AI Alignment: A Comprehensive Survey". In: arXiv (2023), pp. 1-102.
Ji, Jiaming et al：A Comprehensive Survey".In: arXiv (2023), pp.

Jobe, Jared B. "Cognitive psychology and self-reports: models and methods". In: Quality of Life Research 12 (2003), pp. 219-227.
Jobe, Jared B. "认知心理学与自我报告：模型与方法"。In：生活质量研究》第 12 期（2003 年），第 219-227 页。

Jonas, Eric and Konrad Paul Kording. "Could a Neuroscientist Understand a Microprocessor?" In: PLOS Computational Biology 13.1 (2017), pp. 1-24.
乔纳斯、埃里克和康拉德-保罗-科丁。"神经科学家能否理解微处理器？In：PLOS Computational Biology 13.1 (2017), pp.

Jones, Erik and Jacob Steinhardt. "Capturing failures of large language models via human cognitive biases". In: Advances in Neural Information Processing Systems 35 (2022), pp. 11785-11799.
Jones, Erik and Jacob Steinhardt."通过人类认知偏差捕捉大型语言模型的失败"。In：神经信息处理系统进展 35 (2022)，第 11785-11799 页。

Kadavath, Saurav et al. "Language Models (Mostly) Know What They Know". In: arXiv (2022), pp. 1-42.
Kadavath, Saurav et al. "Language Models (Mostly) Know What They Know".In: arXiv (2022), pp.

Kaplan, Jared et al. "Scaling Laws for Neural Language Models". In: arXiv (2020), pp. 1-30.
Kaplan, Jared et al. "Scaling Laws for Neural Language Models".In: arXiv (2020), pp.

Khan, Mohammad Abdullah Matin, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval". In: arXiv (2023), pp. 1-44.
Khan、Mohammad Abdullah Matin、M. Saiful Bari、Xuan Long Do、Weishi Wang、Md Rizwan Parvez 和 Shafiq Joty。"xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval".In: arXiv (2023), pp.

Khandelwal, Aditi, Utkarsh Agarwal, Kumar Tanmay, and Monojit Choudhury. "Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test". In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2024, pp. 2882-2894.
Khandelwal、Aditi、Utkarsh Agarwal、Kumar Tanmay 和 Monojit Choudhury。"LLMs的道德判断和推理能力会随语言而改变吗？使用多语言定义问题测试的研究"。In：计算语言学协会欧洲分会第 18 届会议论文集》。计算语言学协会，2024 年，第 2882-2894 页。

Kim, Geunwoo, Pierre Baldi, and Stephen McAleer. "Language Models can Solve Computer Tasks". In: arXiv (2023), pp. 1-26.
Kim, Geunwoo, Pierre Baldi, and Stephen McAleer."语言模型可解决计算机任务》。In: arXiv (2023), pp.

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. "Large Language Models are Zero-Shot Reasoners". In: arXiv (2022), pp. 1-36.
Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa."大型语言模型是零点推理器》。In: arXiv (2022), pp.

Kosoy, Eliza, Emily Rose Reagan, Leslie Lai, Alison Gopnik, and Danielle Krettek Cobb. "Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses". In: NeurIPS Workshop: AI Meets Moral Philosophy and Moral Psychology. 2023, pp. 1-11.
Kosoy, Eliza, Emily Rose Reagan, Leslie Lai, Alison Gopnik, and Danielle Krettek Cobb."比较机器和儿童：利用发展心理学实验评估 LaMDA 反应的优势和劣势"。In：NeurIPS Workshop：AI Meets Moral Philosophy and Moral Psychology.2023, pp.

Kumar, Sreejan, Theodore R Sumers, Takateru Yamakoshi, Ariel Goldstein, Uri Hasson, Kenneth A Norman, Thomas L Griffiths, Robert D Hawkins, and Samuel A Nastase. "Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model". In: BioRxiv (2022), pp. 1-56.
Kumar, Sreejan, Theodore R Sumers, Takateru Yamakoshi, Ariel Goldstein, Uri Hasson, Kenneth A Norman, Thomas L Griffiths, Robert D Hawkins, and Samuel A Nastase."利用基于转换器的语言模型的内部计算重建大脑中的语言处理级联"。In：BioRxiv (2022)，第 1-56 页。

Lake, Brenden M. and Marco Baroni. "Human-like systematic generalization through a meta-learning neural network". In: Nature 623 (2023), pp. 1-23.
Lake, Brenden M. and Marco Baroni."通过元学习神经网络实现类人系统泛化"。In：自然》623 (2023)，第 1-23 页。

Lampinen, Andrew Kyle. "Can language models handle recursively nested grammatical structures? A case study on comparing models and humans". In: arXiv (2022), pp. 1-22.
兰皮宁、安德鲁-凯尔"语言模型能处理递归嵌套语法结构吗？模型与人类比较案例研究"。In: arXiv (2022), pp.

Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 9459-9474.
Lewis、Patrick 等人 "知识密集型 NLP 任务的检索增强生成"。In：神经信息处理系统进展》。H. Larochelle、M. Ranzato、R. Hadsell、M.F. Balcan 和 H. Lin 编辑。第 33 卷。Curran Associates, Inc., 2020, pp.

Li, Changmao and Jeffrey Flanigan. "Task Contamination: Language Models May Not Be Few-Shot Anymore". In:

, pp. 1-20.
《李长茂和杰弗里-弗莱尼根。"任务污染：语言模型可能不再寥寥无几》。In：

，第 1-20 页。

Li, Xingxuan, Yutong Li, Linlin Liu, Lidong Bing, and Shafiq Joty. "Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective". In: arXiv (2022), pp. 1-13.
Li、Xingxuan、Yutong Li、Linlin Liu、Lidong Bing 和 Shafiq Joty。"GPT-3是精神病患者吗？从心理学角度评估大型语言模型》。In: arXiv (2022), pp.

Linzen, Tal and Marco Baroni. "Syntactic Structure from Deep Learning". In: Annual Review of Linguistics 7.1 (2021), pp. 195-212.
Linzen、Tal 和 Marco Baroni。"来自深度学习的句法结构》。In：Annual Review of Linguistics 7.1 (2021), pp.

Liu, Ryan, Theodore R Sumers, Ishita Dasgupta, and Thomas L Griffiths. "How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?" In: arXiv (2024), pp. 1-21.
Liu、Ryan、Theodore R Sumers、Ishita Dasgupta 和 Thomas L Griffiths。"大型语言模型如何驾驭诚实与助人之间的冲突？"In: arXiv (2024), pp.

Lo, Kai-Ling, Rami Ariss, and Philipp Kurz. "GPoeT-2: A GPT-2 Based Poem Generator". In: arXiv (2022), pp. 1-10. Ma, Yecheng Jason, William Liang, Hung-Ju Wang, Sam Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman. "DrEureka: Language Model Guided Sim-To-Real Transfer". In: Robotics: Science and Systems (RSS). 2024, pp. 1-28.
Lo, Kai-Ling, Rami Ariss, and Philipp Kurz."GPoeT-2: A GPT-2 Based Poem Generator".In: arXiv (2022), pp.Ma, Yecheng Jason, William Liang, Hung-Ju Wang, Sam Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman."DrEureka：语言模型指导下的模拟-真实传输"。In：机器人：科学与系统》（RSS）。2024, pp.

Macmillan-Scott, Olivia and Mirco Musolesi. "(Ir)rationality and cognitive biases in large language models". In: Royal Society Open Science 11 (2024), pp. 1-14.
Macmillan-Scott, Olivia and Mirco Musolesi."大型语言模型中的（非）理性和认知偏差》。In. Royal Society Open Science 11 (2024), pp：英国皇家学会开放科学 11 (2024)，第 1-14 页。

Mahowald, Kyle, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. "Dissociating language and thought in large language models". In: Trends in Cognitive Sciences 28.6 (2024), pp. 517540 .
Mahowald, Kyle, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko."大型语言模型中的语言与思维分离"。In：Trends in Cognitive Sciences 28.6 (2024), pp.

Mazumder, Mark et al. "DataPerf: Benchmarks for Data-Centric AI Development". In: Advances in Neural Information Processing Systems. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. Vol. 36. 2024, pp. 5320-5347.
Mazumder, Mark et al. "DataPerf：以数据为中心的人工智能开发基准"。In：神经信息处理系统进展》。编辑：A. Oh、T. Naumann、A. Globerson、K. Saenko、M. Hardt 和 S. Levine。第 36 卷。2024, pp.

McClelland, Jay L, Mark St. John, and Roman Taraban. "Sentence comprehension: A parallel distributed processing approach". In: Language and Cognitive Processes 4.3-4 (1989), SI287-SI335.
《麦克莱兰、杰伊-L、马克-圣约翰和罗曼-塔拉班。"句子理解：并行分布式处理方法"。In：语言与认知过程》4.3-4（1989），SI287-SI335。

McCoy, R Thomas, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. "Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve". In: arXiv (2023), pp. 1-84.
McCoy, R Thomas, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths."自回归的余烬：通过训练解决的问题理解大型语言模型》。In: arXiv (2023), pp.

McCoy, Tom, Ellie Pavlick, and Tal Linzen. "Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference". In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp.

.
McCoy、Tom、Ellie Pavlick 和 Tal Linzen。"正确的原因是错误的：自然语言推理中的句法启发式诊断"。In：第 57 届计算语言学协会年会论文集》。2019, pp.

McKenzie, Ian R. et al. "Inverse Scaling: When Bigger Isn’t Better". In: arXiv (2023), pp. 1-39.
McKenzie, Ian R. et al. "Inverse Scaling：When Bigger Isn't Better".In: arXiv (2023), pp.

Merrill, William, Zhaofeng Wu, Norihito Naka, Yoon Kim, and Tal Linzen. "Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment". In:

(2024), pp. 1-22.
Merrill、William、Zhaofeng Wu、Norihito Naka、Yoon Kim 和 Tal Linzen。"你能通过下一个词的预测学习语义吗？词缀案例"。In：

（2024 年），第 1-22 页。

Mialon, Grégoire, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, and Roberta et al Raileanu. "Augmented Language Models: a Survey". In: arXiv (2023), pp. 1-33.
Mialon, Grégoire, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, and Roberta et al Raileanu."增强语言模型：调查"。In: arXiv (2023), pp.

Miotto, Marilù, Nicola Rossberg, and Bennett Kleinberg. "Who is GPT-3? An Exploration of Personality, Values and Demographics". In: arXiv (2022), pp. 1-10.
Miotto、Marilù、Nicola Rossberg 和 Bennett Kleinberg。"谁是 GPT-3？个性、价值观和人口统计学探索"。In: arXiv (2022), pp.

Moghaddam, Shima Rahimi and Christopher J. Honey. "Boosting Theory-of-Mind Performance in Large Language Models via Prompting". In: arXiv (2023), pp. 1-27.
Moghaddam, Shima Rahimi and Christopher J. Honey."通过提示提升大型语言模型的心智理论性能》。In: arXiv (2023), pp.

Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal. "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention". In: arXiv (2024), pp. 1-12.
Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal."不遗漏任何上下文：具有无限注意力的高效无限上下文转换器"。In: arXiv (2024), pp.

Nair, Varun, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. "DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents". In: arXiv (2023), pp. 1-38.
Nair、Varun、Elliot Schumacher、Geoffrey Tso 和 Anitha Kannan。"DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents".In: arXiv (2023), pp.

Niv, Yael. Reinforcement learning in the brain. 2009.
Niv, Yael.大脑中的强化学习》。2009.

Open Science Collaboration. "Estimating the reproducibility of psychological science". In: Science 349.6251 (2015), pp.

.
《开放科学合作。"估算心理科学的再现性》。In：Science 349.6251 (2015)，pp.

。

OpenAI. ChatGPT: Optimizing Language Models for Dialogue. 2022. URL: https://openai .com/blog/chatgpt/ (visited on 02/13/2023).
OpenAI.ChatGPT：优化对话语言模型。2022.URL：https://openai .com/blog/chatgpt/ （访问日期：02/13/2023）。

GPT-4 Technical Report. 2023. URL: https://cdn.openai.com/papers/gpt-4.pdf (visited on 03/19/2023).
GPT-4 技术报告。2023.URL：https://cdn.openai.com/papers/gpt-4.pdf （访问日期：03/19/2023）。
GPT-4V(ision) System Card. 2023. URL: https://cdn.openai.com/papers/GPTV_System_Card.pdf (visited on ).
GPT-4V(ision) 系统卡。2023.URL：https://cdn.openai.com/papers/GPTV_System_Card.pdf （访问日期：）。

Oswald, Johannes von, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. "Transformers learn in-context by gradient descent". In: Proceedings of the 40th International Conference on Machine Learning. 1464. JMLR, 2023, pp. 35151-35174.
奥斯瓦尔德、约翰内斯-冯、埃文德-尼克拉松、埃托雷-兰达佐、若昂-萨克拉门托、亚历山大-莫尔金采夫、安德烈-日莫吉诺夫和马克斯-弗拉基米尔诺夫。"变压器通过梯度下降在上下文中学习"。In：第 40 届机器学习国际会议论文集》。1464.JMLR, 2023, pp.

Ouyang, Long et al. "Training language models to follow instructions with human feedback". In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. 2022, pp. 27730-27744.
《欧阳龙等人 "训练语言模型，使其遵循人类反馈指令"。In：神经信息处理系统进展》。S. Koyejo、S. Mohamed、A. Agarwal、D. Belgrave、K. Cho 和 A. Oh 编著。第 35 卷。2022, pp.

Papachristou, Marios and Yuan Yuan. "Network Formation and Dynamics Among Multi-LLMs". In: arXiv (2024), pp. 1-27.
Papachristou, Marios and Yuan Yuan."多LLMs人之间的网络形成与动态》。In: arXiv (2024), pp.

Park, Joon Sung, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. "Social Simulacra: Creating Populated Prototypes for Social Computing Systems". In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 2022, pp. 1-18.
Park, Joon Sung, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein."社会仿真：为社会计算系统创建人口原型"。In：第 35 届 ACM 用户界面软件与技术年度研讨会论文集》。2022, pp.

Perner, Josef, Susan R. Leekam, and Heinz Wimmer. 'Three-year-olds' difficulty with false belief: The case for a conceptual deficit". In: The British Journal of Developmental Psychology 5.2 (1987), pp. 125-137.
佩尔纳、约瑟夫、苏珊-R-利卡姆和海因茨-维默。三岁儿童的错误信念困难：概念缺失的案例"。In：The British Journal of Developmental Psychology 5.2 (1987), pp.

Peterson, Joshua C., David D. Bourgin, Mayank Agrawal, Daniel Reichman, and Thomas L. Griffiths. Using large-scale experiments and machine learning to discover theories of human decision-making. 2021.
Peterson, Joshua C., David D. Bourgin, Mayank Agrawal, Daniel Reichman, and Thomas L. Griffiths.利用大规模实验和机器学习发现人类决策理论》。2021.

Phelps, Steve and Yvan I. Russell. "The Machine Psychology of Cooperation: Can GPT models operationalise prompts for altruism, cooperation, competitiveness and selfishness in economic games?" In: arXiv (2024), pp. 1-38.
菲尔普斯、史蒂夫和伊万-I.Russell."合作的机器心理学：GPT 模型能否在经济游戏中操作利他主义、合作、竞争和自私的提示？"In: arXiv (2024), pp.

Prasad, Grusha, Marten Van Schijndel, and Tal Linzen. "Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models". In: 23rd Conference on Computational Natural Language Learning, CoNLL 2019. 2019, pp. 66-76.
Prasad、Grusha、Marten Van Schijndel 和 Tal Linzen。"Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models".In: 23rd Conference on Computational Natural Language Learning, CoNLL 2019.2019, pp.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. "Robust speech recognition via large-scale weak supervision". In: International Conference on Machine Learning. 2023, pp. 2849228518 .
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever."通过大规模弱监督实现鲁棒语音识别"。In：国际机器学习大会。2023, pp.

Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, and Cynthia et al Breazeal. "Machine behaviour". In: Nature 568.7753 (2019), pp. 477-486.
Rahwan, Iyad, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, and Cynthia et al Breazeal."机器行为"。In：自然 568.7753 (2019)，第 477-486 页。

Renze, Matthew and Erhan Guven. "The Effect of Sampling Temperature on Problem Solving in Large Language Models". In: arXiv (2024).
Renze、Matthew 和 Erhan Guven。"采样温度对大型语言模型解决问题的影响》。In: arXiv (2024).

Röttger, Paul, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, and Dirk Hovy. "Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models". In: arXiv (2024), pp. 1-17.
Röttger, Paul, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, and Dirk Hovy."政治指南针还是旋转箭？在大型语言模型中实现更有意义的价值观和观点评价"。In: arXiv (2024), pp.

Ruis, Laura, Jacob Andreas, and Brenden M. Lake. "Improving Systematic Generalization Through Modularity and Augmentation". In: arXiv (2022), pp. 1-9.
Ruis, Laura, Jacob Andreas, and Brenden M. Lake."通过模块化和增量改进系统泛化》。In: arXiv (2022), pp.

Ruis, Laura Eline, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, and Edward Grefenstette. "The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs". In: Advances in Neural Information Processing Systems. Ed. by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine. Vol. 36. 2023, pp. 20827-20905.
Ruis, Laura Eline, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, and Edward Grefenstette."务实理解的黄金定律：通过LLMs解决隐含语的微调策略问题"。In：神经信息处理系统进展》。编辑：A. Oh、T. Naumann、A. Globerson、K. Saenko、M. Hardt 和 S. Levine。第 36 卷。2023, pp.

Russakovsky, Olga et al. "ImageNet Large Scale Visual Recognition Challenge". In: International Journal of Computer Vision 115 (2015), pp. 211-252.
Russakovsky、Olga 等人，"ImageNet 大规模视觉识别挑战赛"。In：国际计算机视觉杂志》第 115 期（2015 年），第 211-252 页。

Ryle, Gilbert. Collected Papers. Hutchinson, 1971.
《赖尔，吉尔伯特。论文集》。哈钦森，1971 年。

Salewski, Leonard, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. "In-Context Impersonation Reveals Large Language Models' Strengths and Biases". In: arXiv (2023), pp. 1-27.
Salewski, Leonard, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata."In-Context Impersonation Reveals Large Language Models' Strengths and Biases".In: arXiv (2023), pp.

Sap, Maarten, Ronan Le Bras, Daniel Fried, and Yejin Choi. "Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs". In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Ed. by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang. Association for Computational Linguistics, 2022, pp. 3762-3780.
Sap, Maarten, Ronan Le Bras, Daniel Fried, and Yejin Choi."神经思维理论？论大型 LM 中社会智能的局限性"。In：自然语言处理经验方法 2022 年会议论文集》。编辑：Yoav Goldberg、Zornitsa Kozareva 和 Yue Zhang。计算语言学协会，2022 年，第 3762-3780 页。

Scarpina, Federica and Sofia Tagini. "The Stroop Color and Word Test". In: Frontiers in Psychology 8 (2017), pp. 1-8. Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are Emergent Abilities of Large Language Models a Mirage?" In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2425. Curran Associates Inc., 2023, pp. 1-17.
Scarpina、Federica 和 Sofia Tagini。"施特罗普颜色和词语测试》。见《心理学前沿》第 8 期（2017 年），第 1-8 页：心理学前沿》第 8 期（2017 年），第 1-8 页。Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo."大型语言模型的新兴能力是海市蜃楼吗？In：第 37 届神经信息处理系统国际会议论文集》。2425.Curran Associates Inc., 2023, pp.

Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. "Toolformer: Language Models Can Teach Themselves to Use Tools". In: arXiv (2023), pp.

.
Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom."Toolformer：语言模型可以自学使用工具》。In: arXiv (2023), pp.

Schramowski, Patrick, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. "Large pre-trained language models contain human-like biases of what is right and wrong to do". In: Nature Machine Intelligence 4.3 (2022), pp. 258-268.
Schramowski, Patrick, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting."大型预训练语言模型包含类似人类的是非偏见"。In：自然-机器智能》4.3 (2022)，第 258-268 页。

Schubert, Johannes A, Akshay K Jagadish, Marcel Binz, and Eric Schulz. "In-context learning agents are asymmetric belief updaters". In: arXiv (2024), pp. 1-16.
Schubert, Johannes A, Akshay K Jagadish, Marcel Binz, and Eric Schulz."语境中的学习代理是不对称的信念更新者》。In: arXiv (2024), pp.

Schulze Buschoff, Luca M, Elif Akata, Matthias Bethge, and Eric Schulz. "Visual cognition in multimodal large language models". In: arXiv (2023), pp. 1-18.
Schulze Buschoff, Luca M, Elif Akata, Matthias Bethge, and Eric Schulz."多模态大型语言模型中的视觉认知"。In: arXiv (2023), pp.

Schwartz, Matthew D. "Should artificial intelligence be interpretable to humans?" In: Nature Reviews Physics 4.12 (2022), pp. 741-742.
Schwartz, Matthew D. "人工智能是否应该为人类所解释？"In：自然物理评论》4.12 (2022)，第 741-742 页。

Seals, S. M. and Valerie L. Shalin. "Long-form analogies generated by chatGPT lack human-like psycholinguistic properties". In:

(2023), pp. 1-8.
Seals, S. M. and Valerie L. Shalin."聊天 GPT 生成的长形类比缺乏类似人类的心理语言特性"。In：

(2023), pp.

Searle, John R. "Minds, brains, and programs". In: Behavioral and Brain Sciences 568.7753 (1980), pp. 417-424.
Searle, John R. "Minds, brains, and programs".In：Behavioral and Brain Sciences 568.7753 (1980), pp.

Sellars, Wilfrid. Empiricism and the Philosophy of Mind. Harvard University Press, 1997.
《塞拉斯，威尔弗莱德。经验主义与心灵哲学》。哈佛大学出版社，1997 年。

Shanahan, Murray. "Talking About Large Language Models". In: arXiv (2022), pp. 1-11.
Shanahan, Murray."Talking About Large Language Models".In: arXiv (2022), pp.

Shanahan, Murray, Kyle McDonell, and Laria Reynolds. "Role play with large language models". In: Nature 623.7987 (2023), pp. 493-498.
Shanahan、Murray、Kyle McDonell 和 Laria Reynolds。"用大型语言模型进行角色扮演"。In：自然》623.7987 (2023)，第 493-498 页。

Shevlin, Henry and Marta Halina. "Apply rich psychological terms in AI with care". In: Nature Machine Intelligence 1 (2019), pp. 165-167.
Shevlin、Henry 和 Marta Halina。"在人工智能中谨慎应用丰富的心理学术语"。In：Nature Machine Intelligence 1 (2019)，pp.

Sinclair, Arabella, Jaap Jumelet, Willem Zuidema, and Raquel Fernández. "Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations". In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 1031-1050.
Sinclair, Arabella, Jaap Jumelet, Willem Zuidema, and Raquel Fernández."语言模型中的结构持久性：作为抽象语言表征之窗的引物"。In：Transactions of the Association for Computational Linguistics 10 (2022), pp.

Singla, Sahil and Soheil Feizi. "Causal ImageNet: How to discover spurious features in Deep Learning?" In: arXiv (2021), pp. 1-76.
Singla, Sahil 和 Soheil Feizi."因果图像网络：如何在深度学习中发现虚假特征？"In: arXiv (2021), pp.

Srivastava, Aarohi et al. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models". In: arXiv (2022), pp. 1-100.
Srivastava, Aarohi et al：量化和推断语言模型的能力"。In: arXiv (2022), pp.

Stevenson, Claire, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas. "Putting GPT-3's Creativity to the (Alternative Uses) Test". In:

(2022), pp. 1-5.
Stevenson, Claire, Iris Smal, Matthijs Baas, Raoul Grasman, and Han van der Maas."对 GPT-3 的创造性进行（替代用途）测试》。In：

（2022 年），第 1-5 页。

Stolfo, Alessandro, Yonatan Belinkov, and Mrinmaya Sachan. "A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis". In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023, pp. 7035-7052.
Stolfo, Alessandro, Yonatan Belinkov, and Mrinmaya Sachan."使用因果中介分析对语言模型中算术推理的机制解释"。In：自然语言处理经验方法 2023 年会议论文集》。2023, pp.

Strachan, James W. A. et al. "Testing theory of mind in large language models and humans". In: Nature Human Behaviour 8 (2024), pp. 1285-1295.
Strachan, James W. A. et al.In：Nature Human Behaviour 8 (2024), pp.

Street, Winnie et al. "LLMs achieve adult human performance on higher-order theory of mind tasks". In: arXiv (2024), pp.

.
Street, Winnie et al. "LLMs 实现成人在高阶心智理论任务上的表现"。In: arXiv (2024), pp.

Todd, Peter M and Gerd Gigerenzer. Ecological Rationality: Intelligence in the World. Oxford University Press, 2012.
Todd, Peter M and Gerd Gigerenzer.生态理性：世界中的智慧》。牛津大学出版社，2012 年。

Tsvilodub, Polina, Hening Wang, Sharon Grosch, and Michael Franke. "Predictions from language models for multiplechoice tasks are not robust under variation of scoring methods". In: arXiv (2024), pp. 1-8.
Tsvilodub、Polina、Hening Wang、Sharon Grosch 和 Michael Franke。"语言模型对多项选择任务的预测在评分方法变化下并不稳健"。In: arXiv (2024), pp.

Tversky, Amos and Daniel Kahneman. "Judgment under Uncertainty: Heuristics and Biases". In: Science 185.4157 (1974), pp. 1124-1131.
Tversky, Amos and Daniel Kahneman."不确定性下的判断：启发式和偏见》。In. Science 185.4157 (1974), pp：科学》185.4157 (1974)，第 1124-1131 页。

_ "The Framing of Decisions and the Psychology of Choice". In: Science 211.4481 (1981), pp. 453-458.
_"决策框架与选择心理学"。In. Science 211.4481 (1981), pp：科学》211.4481 (1981)，第 453-458 页。

Ullman, Tomer. "Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks". In: arXiv (2023), pp.

.
Ullman, Tomer."大型语言模型在头脑理论任务的微小改动上失败".In: arXiv (2023), pp.

Vrieze, Scott I. "Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)". In: Psychological Methods 17.2 (2012), pp. 228-243.
Vrieze, Scott I. "模型选择与心理学理论：关于 Akaike 信息准则 (AIC) 与贝叶斯信息准则 (BIC) 之间差异的讨论"。In：Psychological Methods 17.2 (2012), pp.

Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small". In: arXiv (2022), pp. 1-25.
Wang、Kevin、Alexandre Variengien、Arthur Conmy、Buck Shlegeris 和 Jacob Steinhardt。"野外可解释性：GPT-2 小型间接对象识别电路》。In: arXiv (2022), pp.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. "Self-Consistency Improves Chain of Thought Reasoning in Language Models". In: arXiv (2022), pp. 1-24.
王学智、Jason Wei、Dale Schuurmans、Quoc Le、Ed Chi、Sharan Narang、Aakanksha Chowdhery 和 Denny Zhou。"自我一致性改善了语言模型中的思维推理链》(Self-Consistency Improves Chain of Thought Reasoning in Language Models).In: arXiv (2022), pp.

Warstadt, Alex et al. "Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora". In: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. Ed. by Alex Warstadt et al. Association for Computational Linguistics, 2023, pp. 1-34.
Warstadt、Alex 等人 "BabyLM 挑战赛的研究结果：在发展合理的语料库上进行样本高效预训练"。In：第 27 届计算自然语言学习大会 BabyLM 挑战赛论文集》。编辑：Alex Warstadt 等，计算语言学协会，2023 年，第 1-34 页。

Webb, Taylor, Keith J Holyoak, and Hongjing Lu. "Emergent analogical reasoning in large language models". In: Nature Human Behaviour 7.9 (2023), pp. 1526-1541.
Webb, Taylor, Keith J Holyoak, and Hongjing Lu."大型语言模型中的新兴类比推理"。In：自然-人类行为 7.9 (2023)，第 1526-1541 页。

Wei, Jason, Yi Tay, et al. "Emergent Abilities of Large Language Models". In: Transactions on Machine Learning Research (2022), pp. 1-30.
Wei、Jason、Yi Tay 等：《大型语言模型的新兴能力》。In：Transactions on Machine Learning Research (2022), pp.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Le Quoc, and Denny Zhou. "Chain of Thought Prompting Elicits Reasoning in Large Language Models". In: arXiv (2022), pp. 1-41.
Wei、Jason、Xuezhi Wang、Dale Schuurmans、Maarten Bosma、Brian Ichter、Fei Xia、Ed Chi、Le Quoc 和 Denny Zhou。"大型语言模型中的思维链提示推理"。In: arXiv (2022), pp.

Weidinger, Laura, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, and John et al Mellor. "Taxonomy of Risks posed by Language Models". In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2022, pp. 214-229.
Weidinger, Laura, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, and John et al Mellor."语言模型风险分类学》。In：2022 年 ACM 公平、问责和透明会议论文集》。计算机协会，2022 年，第 214-229 页。

Wilcox, Ethan Gotlieb, Richard Futrell, and Roger Levy. "Using Computational Models to Test Syntactic Learnability". In: Linguistic Inquiry (2023), pp. 1-44.
Wilcox, Ethan Gotlieb, Richard Futrell, and Roger Levy."使用计算模型测试句法可学性》。In：Linguistic Inquiry (2023)，pp.

Wimmer, H. and J Perner. "Beliefs about beliefs: representation and constraining function of wrong beliefs in young children's understanding of deception". In: Cognition 13.1 (1983), pp. 103-128.
Wimmer, H. and J Perner."关于信念的信念：错误信念在幼儿对欺骗的理解中的表征和制约功能"。In：Cognition 13.1 (1983), pp.

Xie, Sang Michael, Aditi Raghunathan, Percy Liang, and Tengyu Ma. "An Explanation of In-context Learning as Implicit Bayesian Inference". In: International Conference on Learning Representations. 2022, pp. 1-25.
Xie, Sang Michael, Aditi Raghunathan, Percy Liang, and Tengyu Ma."作为隐式贝叶斯推理的上下文学习解释》。In：国际学习表征会议。2022, pp.

Yang, Yuhong. "COMPARING LEARNING METHODS FOR CLASSIFICATION". In: Statistica Sinica 2 (2006), pp. 635-657.
杨玉红"比较分类学习方法"。In. Statistica Sinica 2 (2006), pp：Statistica Sinica 2 (2006), pp.

Yarkoni, Tal. "The generalizability crisis". In: Behavioral and Brain Sciences 45 (2022), pp. 1-37.
Yarkoni, Tal."概括性危机"。In：Behavioral and Brain Sciences 45 (2022), pp.

Yax, Nicolas, Hernan Anlló, and Stefano Palminteri. "Studying and improving reasoning in humans and machines". In: Communications Psychology 2.1 (2024), pp. 1-16.
Yax, Nicolas, Hernan Anlló, and Stefano Palminteri."研究和改进人类与机器的推理能力"。In：通信心理学 2.1 (2024)，第 1-16 页。

Yiu, Eunice, Eliza Kosoy, and Alison Gopnik. "Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet)". In: Perspectives on Psychological Science 0.0 (2023), pp. 1-10.
Yiu, Eunice, Eliza Kosoy, and Alison Gopnik."传承与真理，模仿与创新：儿童能做的事，大型语言和语言与视觉模型还做不到"。In：Perspectives on Psychological Science 0.0 (2023), pp.

Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. "HellaSwag: Can a Machine Really Finish Your Sentence?" In: Annual Meeting of the Association for Computational Linguistics. 2019, pp. 1-10.
Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi."HellaSwag：机器真的能完成你的句子吗？"In：计算语言学协会年会。2019, pp.

Zhang, Jingyi, Jiaxing Huang, Sheng Jin, and Shijian Lu. "Vision-Language Models for Vision Tasks: A Survey". In:

, pp. 1-24.
Zhang, Jingyi, Jiaxing Huang, Sheng Jin, and Shijian Lu."视觉任务的视觉语言模型：调查"。In：

，第 1-24 页。

Zhang, Tianhua, Jiaxin Ge, et al. "Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning". In: Findings of the Association for Computational Linguistics: NAACL 2024. Ed. by Kevin Duh, Helena Gomez, and Steven Bethard. 2024, pp. 4131-4155.
Zhang、Tianhua、Jiaxin Ge 等：《混合语言符号推理的自然语言嵌入式程序》。In：计算语言学协会的研究成果：NAACL 2024》。Ed. by Kevin Duh, Helena Gomez, and Steven Bethard.2024, pp.

Zhao, Haiyan, Hanjie Chen, F. Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. "Explainability for Large Language Models: A Survey". In: ACM Transactions on Intelligent Systems and Technology 15 (2023), pp. 1-38.
Zhao, Haiyan, Hanjie Chen, F. Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du."大型语言模型的可解释性：调查"。In：ACM Transactions on Intelligent Systems and Technology 15 (2023), pp.

Zhao, Tony Z., Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. "Calibrate Before Use: Improving Few-Shot Performance of Language Models". In: arXiv (2021), pp. 1-15.
Zhao, Tony Z., Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh."Calibrate Before Use: Improving Few-Shot Performance of Language Models".In: arXiv (2021), pp.

Zheng, Xiaosen, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses". In: arXiv (2024), pp. 1-22.
Zheng、Xiaosen、Tianyu Pang、Chao Du、Qian Liu、Jing Jiang 和 Min Lin。"改进的几枪越狱可规避对齐语言模型及其防御》。In: arXiv (2024), pp.

Zhou, Denny, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, and Xuezhi et al Wang. "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models". In: arXiv (2022), pp. 1-63.
Zhou, Denny, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, and Xuezhi et al Wang."从最少到最多的提示使大型语言模型中的复杂推理成为可能"。In: arXiv (2022), pp.

Zhuge, Mingchen et al. "Mindstorms in Natural Language-Based Societies of Mind". In: arXiv (2023), pp. 1-54.
Zhuge, Mingchen et al. "Mindstorms in Natural Language-Based Societies of Mind".In: arXiv (2023), pp.

*Shared first authorship. Contact: thilo.hagendorff@iris.uni-stuttgart.de, idg@google.com
*共同第一作者。联系方式：thilo.hagendorff@iris.uni-stuttgart.de, idg@google.com

Co-authors are listed in alphabetical order.
合著者按字母顺序排列。

MACHINE PsyCHOLOGY 机器心理学

Thilo Hagendorff* University of StuttgartThilo Hagendorff* < br> 斯图加特大学

Abstract 摘要

Introduction 导言

Theory: Evaluation paradigms for understanding intelligent systems理论：理解智能系统的评价范式

Paradigms: The many aspects of intelligent behavior范例：智能行为的多个方面