Expected Returns and Large Language Models *
预期收益和大型语言模型 *

Yifei Chen 陈逸飞Booth School of Business 布斯商学院University of Chicago 芝加哥大学

Bryan Kelly 布莱恩-凯利Yale University, AQR Capital
耶鲁大学、AQR 资本Management, and NBER 《管理》和 NBER

Dacheng XiuBooth School of Business 布斯商学院University of Chicago 芝加哥大学

Abstract 摘要

We leverage state-of-the-art large language models (LLMs) such as ChatGPT and LLaMA to extract contextualized representations of news text for predicting stock returns. Our results show that prices respond slowly to news reports indicative of market inefficiencies and limits-to-arbitrage. Predictions from LLM embeddings significantly improve over leading technical signals (such as past returns) or simpler NLP methods by understanding news text in light of the broader article context. For example, the benefits of LLM-based predictions are especially pronounced in articles where negation or complex narratives are more prominent. We present comprehensive evidence of the predictive power of news on market movements in 16 global equity markets and news articles in 13 languages.
我们利用最先进的大型语言模型（LLMs）（如 ChatGPT 和 LLaMA）来提取新闻文本的上下文表征，从而预测股票回报。我们的结果表明，价格对新闻报道的反应缓慢，这表明市场效率低下，套利空间有限。LLM嵌入的预测效果明显优于主要技术信号（如过去的回报率）或更简单的 NLP 方法，因为它能根据更广泛的文章上下文来理解新闻文本。例如，基于LLM的预测在否定或复杂叙述更为突出的文章中优势尤为明显。我们在 16 个全球股票市场和 13 种语言的新闻文章中全面展示了新闻对市场走势的预测能力。

Keywords: natural language processing (NLP), large language models, BERT, GPT, LLaMA, ChatGPT, Bag-of-Words, Word2vec, machine learning, return prediction
关键词：自然语言处理 (NLP)、大型语言模型、BERT、GPT、LLaMA、ChatGPT、词袋、Word2vec、机器学习、返回预测

1 Introduction 1 引言

Economic text is constantly generated by human writers striving to understand and make predictions about economic phenomena. In recent decades, the finance literature has begun to extract information from certain text sources like the financial news press, regulatory filings, and social media. The research agenda of improving economic models through text mining remains in its earliest stages. Research has thus far examined only a limited portion of market-relevant textual data, often focusing on a single specialized data source at a time (e.g., the front page of The Wall Street Journal, or the “risk factor” section of

10 - K

filings). And for each data source, text information is often represented in rudimentary ways (e.g., as a dictionary-based sentiment score or as a “bag of words”).
人类作家为了理解和预测经济现象，不断产生经济文本。近几十年来，金融文献已开始从某些文本来源（如金融新闻报刊、监管文件和社交媒体）中提取信息。通过文本挖掘改进经济模型的研究议程仍处于初期阶段。迄今为止，研究只考察了市场相关文本数据的有限部分，通常一次只关注一个专门的数据源（如《华尔街日报》头版或

10 - K

文件中的 "风险因素 "部分）。而对于每个数据源，文本信息通常以最基本的方式表示（如基于字典的情感评分或 "词袋"）。

There are good reasons for the limited use of text data to date. Its lack of regular structure makes it far more difficult to work with than standard numeric data sets. Language is an extremely nuanced information encoding scheme. As a result, highly complex models are necessary to faithfully unearth information contained in text. But complex models are prohibitive for many researchers. Technological barriers to entry exclude researchers who lack the specialized skill sets necessary to operate such models. The high computational cost of complex models excludes other researchers who may possess requisite skills but face research funding constraints.
迄今为止，文本数据的使用有限是有充分理由的。文本数据缺乏规则结构，因此比标准数字数据集更难处理。语言是一种极其微妙的信息编码方案。因此，需要高度复杂的模型来忠实地挖掘文本中包含的信息。但是，复杂的模型让许多研究人员望而却步。技术门槛将那些缺乏操作此类模型所需的专业技能的研究人员排除在外。复杂模型高昂的计算成本将其他可能拥有必要技能但面临研究经费限制的研究人员排除在外。

This means that recent textual analysis in finance and economics is the tip of the iceberg. Text is an underexploited data source for understanding asset markets. The challenges of textual analysis today portend an exciting research agenda tomorrow, in which economists gradually expand sourced text corpora and increasingly refine their ability to elicit information from that text.
这意味着近期金融和经济领域的文本分析只是冰山一角。在理解资产市场方面，文本是一个未被充分利用的数据源。今天文本分析所面临的挑战预示着明天会有一个令人兴奋的研究议程，在这个议程中，经济学家将逐步扩大来源文本库，并不断提高从文本中获取信息的能力。

In this paper we aim to take a step in this direction by constructing refined news text representations derived from large language models (LLMs) and then using these to improve models of expected stock returns. To better understand the role of LLMs, it is helpful to first grasp the current landscape in financial text mining. The most prevalent methods to date are supervised machine learning models that are customized to specific tasks such as forecasting returns (Ke et al. (2019); Jegadeesh and Wu (2013)), volatility (Manela and Moreira (2017)), or macroeconomic conditions (Kelly et al. (2018); Bybee et al. (2020)).
在本文中，我们将通过构建源自大型语言模型（LLMs）的精炼新闻文本表征，然后使用这些表征来改进预期股票收益模型，从而朝着这个方向迈出一步。要更好地理解 LLMs的作用，首先要了解当前金融文本挖掘的现状。迄今为止，最普遍的方法是针对特定任务定制的监督机器学习模型，例如预测回报率（Ke 等人（2019）；Jegadeesh 和 Wu（2013））、波动性（Manela 和 Moreira（2017））或宏观经济状况（Kelly 等人（2018）；Bybee 等人（2020））。

These analyses proceed in two general steps: a text representation step and an econometric modeling step. Step 1 decides on the numerical representation of the text data that will be passed to the Step 2 econometric model. The most common choice in the literature is “bag of words” (BoW), which collapses each document observation into a high dimensional vector of counts spanning all unique terms in the full corpus of documents. In some cases, the numerical representation stops here (e.g., Jegadeesh and Wu (2013); Kelly et al. (2018)). In other cases, the numerical representation is refined further. For example, Ke et al. (2019) reduce the BoW dimensionality from several tens of thousand to a few hundred terms with a correlation screening procedure to filter out irrelevant terms, and Bybee et al. (2020) reduce the dimensionality of counts with an unsupervised topic
这些分析一般分为两个步骤：文本表示步骤和计量经济学建模步骤。第 1 步决定文本数据的数字表示，并将其传递给第 2 步计量经济学模型。文献中最常见的选择是 "词袋"（BoW），它将每个文档观测值折叠成一个高维的计数向量，涵盖整个文档语料库中的所有独特术语。在某些情况下，数字表示到此为止（例如，Jegadeesh 和 Wu（2013 年）；Kelly 等人（2018 年））。在其他情况下，数字表示会进一步细化。例如，Ke 等人（2019 年）通过相关性筛选程序将 BoW 的维度从几万个术语降低到几百个术语，以过滤掉不相关的术语；Bybee 等人（2020 年）通过无监督主题降低了计数的维度。
model.

^{1}

The output of Step 1 is a numerical data matrix

X

of dimension

D \times P

. Rows correspond to the

D

documents in the text corpus, and each row contains the

P

-dimensional numerical vector representation of those documents (e.g.,

P

can be the number of terms in a BoW or the number of topics in a topic model). Step 2 uses

X

as data in an econometric model to describe some economic phenomenon (e.g., return, volatility, and macroeconomic modeling in the references above).
模型。

^{1}

步骤 1 的输出是维数为

D \times P

的数字数据矩阵

X

。各行对应文本语料库中的

D

文档，每行包含这些文档的

P

维数字向量表示（例如，

P

可以是 BoW 中的术语数或主题模型中的主题数）。步骤 2 将

X

作为计量经济学模型中的数据，用于描述某些经济现象（例如，上述参考文献中的回报率、波动率和宏观经济模型）。

The financial text representations referenced above have some limitations. First, all of these examples begin from a BoW representation, which is overly simplistic and only accesses the information in text that is conveyable by term usage frequency. It sacrifices nearly all information that is conveyed through word ordering or contextual relationships between terms. Second, the ultra-high dimensionality of BoW representations leads to statistical inefficiencies-Step 2 econometric models must include many parameters to process all these terms despite many of the terms conveying negligible information. Dimension reductions like LDA and correlation screening are beneficial because they mitigate the inefficiency of BoW. However, they are derived BoW and thus do not avoid the information loss from relying on term counts in the first place. Third, and more subtly, the dimension-reduced representations are corpus specific. For example, when Bybee et al. (2020) build their topic model, the topics are estimated only from The Wall Street Journal, despite the fact that topics are general language structures and can be better inferred by using additional text outside of their sample.
上文提到的财务文本表示法有一些局限性。首先，所有这些示例都是从博文表示法开始的，而博文表示法过于简单，只能获取文本中可通过术语使用频率传达的信息。它几乎牺牲了所有通过词序或术语之间的上下文关系传达的信息。其次，BoW 表征的超高维度导致统计效率低下--尽管许多术语传递的信息微乎其微，但第 2 步计量经济学模型必须包含许多参数才能处理所有这些术语。像 LDA 和相关筛选这样的降维方法是有益的，因为它们缓解了 BoW 的低效率。然而，它们都是推导出的博闻方法，因此无法避免首先依赖术语计数所造成的信息损失。第三，更微妙的是，降维表示法是语料特定的。例如，Bybee 等人（2020）在建立主题模型时，只根据《华尔街日报》估算了主题，尽管事实上主题是一种通用的语言结构，使用样本之外的其他文本可以更好地推断主题。

Enter the concept of an LLM. LLMs are trained on large text data sets that span many sources and themes. The idea of a LLM is for a specialized research team to perform the Herculean feat of estimating a general purpose language model with astronomical parameterization on truly big text data. LLMs have billions of parameters and are trained on many millions of documents (including huge corpora of complete ebooks, all entries of Wikipedia, and more). But for each LLM, this feat is performed just once, then the estimated model is made available for distribution to be deployed by non-specialized researchers in downstream tasks.
输入 LLM 的概念。LLMs是在跨越许多来源和主题的大型文本数据集上进行训练的。LLM的理念是让一个专业研究团队完成一项艰巨的任务，即在真正的大文本数据上估算一个具有天文参数化的通用语言模型。LLMs拥有数十亿个参数，并在数百万个文档（包括完整电子书的庞大语料库、维基百科的所有词条等）上进行训练。但是，对于每个 LLM 来说，这一壮举只需进行一次，然后估计出的模型就可以发布，供非专业研究人员在下游任务中使用。

In other words, the LLM delegates Step 1 of the procedure above to the handful of people in the world that can best execute it. A Step 2 econometric model can then be built around LLM output. Like LDA (or even BoW), the output of an LLM is a numerical vector representation (or “embedding”) of a document. A non-specialized researcher obtains this output by feeding the document of interest through software (which is open-source in many cases). Therefore, an LLM model in Step 1 delivers a numerical matrix

X

just like the examples above, making it seamless to integrate into Step 2 with little or no modification. The main benefit of an LLM in Step 1 is that it provides more sophisticated and well-trained text representations than used in the literature referenced above. This benefit comes from the expressivity of massive nonlinear model parameterizations and from training on extensive language examples across many domains and from throughout human history. The
换句话说，LLM将上述程序的第 1 步委托给了世界上能够最好地执行该步骤的少数人。然后，就可以根据 LLM 的输出建立第 2 步计量经济学模型。与 LDA（甚至 BoW）一样，LLM的输出是文档的数字向量表示（或 "嵌入"）。非专业研究人员通过软件（在许多情况下是开源的）输入感兴趣的文档，即可获得该输出。因此，步骤 1 中的LLM模型会像上面的示例一样提供一个数值矩阵

X

，使其可以无缝集成到步骤 2 中，几乎不需要修改。在步骤 1 中使用 LLM 的主要好处是，它能提供比上述文献中使用的更复杂和训练有素的文本表示。这种优势来自于大规模非线性模型参数化的表现力，以及对许多领域和人类历史上大量语言示例的训练。文本

transferability of LLMs make this unprecedented scale of knowledge available for finance research.
LLMs的可转让性使这一前所未有的知识规模可用于金融研究。
Our primary research contribution revolves around showcasing the advantages of LLM representations for effectively modeling stock returns. In addition, we compare the performance of LLMs with supervised machine learning models commonly used in the extant finance literature. To achieve this, we undertake two distinct econometric exercises that harness the power of text mining in understanding the financial market. The first exercise involves sentiment analysis, where we extract sentiment information from financial news text and examine how this information is incorporated into the dynamics of stock returns. In the second exercise, we directly leverage the predictive power of financial news text to model the short-term cross-section of expected stock returns.
我们的主要研究贡献围绕着展示 LLM 表示法在有效建立股票回报模型方面的优势。此外，我们还将 LLMs 的性能与现有金融文献中常用的监督机器学习模型进行了比较。为此，我们开展了两项不同的计量经济学研究，利用文本挖掘的力量来理解金融市场。第一项工作涉及情感分析，我们从金融新闻文本中提取情感信息，并研究这些信息是如何被纳入股票回报动态的。在第二项研究中，我们直接利用财经新闻文本的预测能力，对预期股票收益的短期横截面进行建模。

We study three large-scale pre-trained LLMs: BERT (developed by Google), RoBERTa (by Meta), LLaMA(LLaMA2) (by Meta). Additionally, we also obtain embeddings from OpenAI embedding model “text-embedding-3-large” with API provided by OpenAI. We compare this with SESTM, a sentiment analyzer based on BoW representation and trained on task-specific text data (developed by Ke et al. (2019)). We also study two other word-based models, Word2vec (a wordvector representation framework developed by Google), and Loughran-McDonald Master Dictionary (LMMD). The inputs to our modeling framework are global news text data from Refinitiv in their Thomson Reuters Real-time News Feed (RTRS) and Third Party Archive (3PTY) databases from January 1996 to June 2019. We merge this with individual stock data from CRSP (for US stocks) and Datastream-EIKON (for international stocks).
我们研究了三个大规模预训练 LLMs：BERT（由 Google 开发）、RoBERTa（由 Meta 开发）、LLaMA(LLaMA2)（由 Meta 开发）。此外，我们还通过 OpenAI 提供的 API 从 OpenAI 嵌入模型 "text-embedding-3-large "中获取了嵌入。我们将其与 SESTM 进行了比较，后者是基于 BoW 表示并在特定任务文本数据上训练的情感分析器（由 Ke 等人（2019）开发）。我们还研究了另外两个基于单词的模型：Word2vec（谷歌开发的单词向量表示框架）和 Loughran-McDonald Master Dictionary（LMMD）。我们建模框架的输入是 Refinitiv 在其汤森路透实时新闻源 (RTRS) 和第三方档案 (3PTY) 数据库中提供的 1996 年 1 月至 2019 年 6 月的全球新闻文本数据。我们将其与 CRSP（美国股票）和 Datastream-EIKON（国际股票）的个股数据合并。

We find the following main empirical results. First, econometric models that use pre-trained LLM embeddings outperform prevailing text-based machine learning return predictions. This is best summarized in terms of out-of-sample trading strategy performance. A quintile spread longshort strategy that buys stocks with high foundation-based return forecasts and sells those with low forecasts earns an annualized Sharpe ratios of 3.60, 3.75, 3.89 (4.16) and 4.62 based on BERT, RoBERTa, LLaMA (LLaMA2) and OpenAI-based (a.k.a ChatGPT in the rest of the paper) models, respectively (gross of trading costs). All of these significantly outperform the corresponding strategy based on word-embedding forecasts, which earns an annualized Sharpe ratio of 3.43, 3.06 and 2.29 for SESTM, Word2vec and LMMD, respectively.
我们发现了以下主要实证结果。首先，使用预训练 LLM嵌入的计量经济学模型优于普遍的基于文本的机器学习回报预测。这可以用样本外交易策略的表现来概括。基于 BERT、RoBERTa、LLaMA（LLaMA2）和基于 OpenAI（本文其余部分中又称 ChatGPT）的模型，买入基于基础回报预测较高的股票并卖出预测较低的股票的五分位差多空策略分别获得了 3.60、3.75、3.89 (4.16) 和 4.62 的年化夏普比率（扣除交易成本）。所有这些都明显优于基于词嵌入预测的相应策略，SESTM、Word2vec 和 LMMD 的年化夏普比率分别为 3.43、3.06 和 2.29。

Furthermore, we delve into the analysis of the impact of news recency on the relative performance of different models. By focusing on articles labeled as “news alerts” by Refinitiv, we observe that returns remain predictable for significantly longer horizons compared to unflagged articles. Surprisingly, despite the brevity of news alerts, which often consist of only headlines, we find that the distinction between LLMs and word-based models becomes less pronounced. This suggests that the advantages of speed can overshadow differences in language model capacity when it comes to predicting returns based on recent news. In essence, a simple representation of recent news can yield comparable performance to more sophisticated representations of older news. However, as time elapses and the predictive information within the text gradually diminishes, the benefits of employing sophisticated models become comparatively more crucial.
此外，我们还深入分析了新闻时效性对不同模型相对性能的影响。通过关注被 Refinitiv 标记为 "新闻提示 "的文章，我们观察到，与未标记的文章相比，收益率在更长的时间跨度内仍具有可预测性。令人惊讶的是，尽管新闻快讯通常只有标题，但我们发现 LLMs和基于单词的模型之间的区别变得不那么明显了。这表明，在根据近期新闻预测回报率时，速度的优势可以掩盖语言模型能力的差异。从本质上讲，对近期新闻的简单表述可以产生与对旧新闻的更复杂表述相当的性能。然而，随着时间的推移，文本中的预测信息逐渐减少，采用复杂模型的优势就变得相对更加重要。

We subsequently demonstrate the relationship between the complexity of LLMs and their performance in predicting returns. By employing a series of LLaMA models characterized by an escalating number of parameters, we illustrate that larger models typically surpass their smaller counterparts in terms of investment performance. The Sharpe ratios yielded by LLMs exhibit greater magnitudes, yet this improvement reaches a saturation point once the number of parameters exceeds 13 billion. This result suggests that while more complex LLMs possess enhanced capabilities in processing text, there is a limit to the benefits gained from increasing their complexity in terms of return prediction.
随后，我们展示了 LLMs的复杂性与其预测收益表现之间的关系。通过使用一系列以参数数量不断增加为特征的 LLaMA 模型，我们说明，就投资业绩而言，较大的模型通常会超过较小的模型。LLMs产生的夏普比率表现出更大的幅度，然而一旦参数数超过 130 亿，这种改进就会达到饱和点。这一结果表明，虽然更复杂的 LLMs 拥有更强的文本处理能力，但在收益预测方面，增加其复杂性所带来的收益是有限的。

Our primary findings are derived from the US equity market, focusing on news articles written in English. Additionally, we analyze 16 international stock markets using news articles written in 12 other languages, including Chinese, Japanese, German, Italian, French, Swedish, Danish, Spanish, Finnish, Portuguese, Greek, and Dutch. As a preliminary contribution, we extend the analysis of Ke et al. (2019) to international text. We find similar SESTM performance globally to that documented by Ke et al. (2019) for the US sample. We also find that the general LLMs can on average outperforms SESTM.
我们的主要研究结果来自美国股票市场，重点关注以英语撰写的新闻报道。此外，我们还使用用其他 12 种语言撰写的新闻文章分析了 16 个国际股票市场，包括中文、日文、德文、意大利文、法文、瑞典文、丹麦文、西班牙文、芬兰文、葡萄牙文、希腊文和荷兰文。作为初步贡献，我们将 Ke 等人（2019）的分析扩展到了国际文本。我们发现，在全球范围内，SESTM 的表现与 Ke 等人（2019）在美国样本中的表现相似。我们还发现，一般的 LLMs平均性能优于 SESTM。

The rest of the paper is organized as follows. Section 2 introduces the LLMs and other approaches we compare. Section 3 presents an empirical analysis of stock-level news and return prediction in US and international markets using these methods. Section 4 concludes. The appendix provides additional tables and figures.
本文接下来的内容安排如下。第 2 节介绍了LLMs和我们比较过的其他方法。第 3 节介绍了使用这些方法对美国和国际市场的股票级新闻和回报预测进行的实证分析。第 4 节为结论。附录中提供了更多的表格和数字。

2 The Text Mining Framework
2 文本挖掘框架

2.1 A Tale of Two Objectives
2.1 两个目标的故事

We employ a supervised approach to mine news text with two primary objectives. The first objective involves sentiment analysis, which entails assessing the tone of a news article. The second objective focuses on predicting the cross-section of returns within a short horizon.
我们采用有监督的方法挖掘新闻文本，主要目标有两个。第一个目标涉及情感分析，即评估新闻文章的基调。第二个目标是预测短期内的收益截面。

While both sentiment analysis and return prediction illuminate the statistical correlation between news text and returns, they are but components of a broader narrative. We aim to develop trading strategies that can efficiently translate these statistical correlations into profitable investments. The economic impact of these gains serves as a formidable challenge to the efficient market hypothesis.
虽然情绪分析和回报预测都揭示了新闻文本与回报之间的统计相关性，但它们只是更广泛叙述的组成部分。我们的目标是制定交易策略，有效地将这些统计相关性转化为有利可图的投资。这些收益的经济影响是对有效市场假说的巨大挑战。

The efficient market hypothesis, our null hypothesis, posits that expected returns are primarily driven by unpredictable news which is rapidly, and in the most extreme cases instantaneously, assimilated into prices. On the other hand, our research presents an alternative hypothesis: the information contained within news text is not immediately and completely integrated into market prices. This delay might be attributed to factors such as limits-to-arbitrage and rational inattention, suggesting a predictive capacity of news text for future asset price trends.
有效市场假说（我们的零假设）认为，预期收益主要由不可预测的新闻驱动，而这些新闻会迅速，甚至在最极端的情况下瞬间被价格同化。另一方面，我们的研究提出了另一种假设：新闻文本中包含的信息并没有立即完全融入市场价格。这种延迟可归因于套利限制和理性不关注等因素，表明新闻文本对未来资产价格趋势具有预测能力。

While the adoption of this alternative hypothesis is largely uncontroversial, its profound significance cannot be overstated. Our predictive analysis provides novel insights into the velocity and
虽然采用这一替代假设在很大程度上没有争议，但其深远意义怎么强调都不为过。我们的预测分析提供了新的洞察力，使我们可以了解到速度和时间的关系。
magnitude of deviations from the efficient market hypothesis, furnishing fresh evidence to the pool of empirical studies examining this alternative hypothesis.
与有效市场假说的偏离程度，为研究这一替代假说的实证研究提供了新的证据。

Sentiment analysis is commonly treated as a classification problem in machine learning. The primary aim is to delineate the relationship between specific text-based features, denoted as

x_{i, t}

, and their associated sentiment labels such as positive or negative, denoted by a binary variable

y_{i, t}

, based on a set of training articles.
情感分析通常被视为机器学习中的分类问题。其主要目的是根据一组训练文章，确定特定文本特征（用

x_{i, t}

表示）与相关情感标签（如正面或负面，用二元变量

y_{i, t}

表示）之间的关系。

The equation below posits the relationship between these labels and features:
下面的公式假设了这些标签和特征之间的关系：

E (y_{i, t} ∣ x_{i, t}) = σ (x_{i, t}^{'} β)

In this context,

σ (x)

is a logistic link function, represented as

σ (x) = \exp (x) / (1 + \exp (x))

. This function has been specifically designed to convert the features into a value ranging from

[0, 1]

, thereby standardizing the quantification of sentiment. This method enables us to derive a sentiment score for any article of the testing sample. The sentiment score quantifies the tone of an article: a score closer to one denotes a stronger positive sentiment.
在这种情况下，

σ (x)

是一个逻辑链接函数，表示为

σ (x) = \exp (x) / (1 + \exp (x))

。该函数专门用于将特征转换为

[0, 1]

范围内的值，从而使情感量化标准化。通过这种方法，我们可以为测试样本中的任何一篇文章得出情感评分。情感分数可量化文章的基调：分数越接近 1，表示正面情感越强。

To accomplish this, we require a sentiment label for each article in the training sample. Each of our news articles is tagged with a corresponding stock and includes a timestamp that records the timing of the news event. Drawing from the methodology presented in Ke et al. (2019), we employ the sign of the stock’s return, registered in close temporal relation to a news event, to assign a binary sentiment label (either positive or negative) to the relevant article. Although this label is inherently noisy, it is a simple and convenient alternative to manual labeling by expert readers. The empirical analysis conducted by Ke et al. (2019) demonstrates the effectiveness of this approach in measuring news sentiment and its robustness across different return definitions.
为此，我们需要为训练样本中的每篇文章添加情感标签。我们的每篇新闻文章都标记了相应的股票，并包含记录新闻事件发生时间的时间戳。借鉴 Ke 等人（2019 年）提出的方法，我们利用与新闻事件时间关系密切的股票收益率符号，为相关文章分配二元情感标签（正面或负面）。虽然这种标签本身有一定的噪声，但它是专家读者手动标记的一种简单方便的替代方法。Ke 等人（2019 年）进行的实证分析证明了这种方法在衡量新闻情感方面的有效性及其在不同回报定义下的稳健性。

Recognizing that news articles often report on events from previous days, we create sentiment labels based on three-day returns, following Ke et al. (2019). This process involves analyzing returns from the day before the article’s publication up to the day after. This approach improves the signal-to-noise ratio in sentiment labeling, leading to greater accuracy in the sentiment score - a key goal in sentiment analysis which is to establish a meaningful connection between text and score. It is crucial to note that these three-day returns are utilized solely for in-sample training, thereby avoiding any look-ahead bias when generating sentiment scores for articles within the testing sample.
由于新闻文章经常报道前几天发生的事件，因此我们效仿 Ke 等人（2019）的做法，根据三天的回报率创建情感标签。这一过程包括分析从文章发表前一天到发表后一天的回报。这种方法提高了情感标签的信噪比，从而提高了情感评分的准确性--情感分析的关键目标是在文本和评分之间建立有意义的联系。值得注意的是，这些三天回报仅用于样本内训练，因此在为测试样本内的文章生成情感评分时避免了任何前瞻性偏差。

Sentiment analysis does not directly provide a measurement of expected returns; instead, it merely reflects the tone present in news articles. Hence, we turn to a distinct approach for modeling expected returns, or stated differently, for predicting returns, to examine the extent to which information in news drives the short-term cross-sectional variation of expected returns. The simplest prediction model involves a standard panel regression. The regression equation translates article features,

x_{i, t}

, directly into the corresponding stock’s expected return,

E (r_{i, t + 1})

, for the next period:
情绪分析并不能直接提供预期收益率的测量方法；相反，它只是反映了新闻文章中的基调。因此，我们转而采用一种不同的方法来建立预期收益模型，或者换一种说法来预测收益，以研究新闻信息在多大程度上推动了预期收益的短期横截面变化。最简单的预测模型涉及标准的面板回归。回归方程将文章特征

x_{i, t}

直接转化为相应股票下一期的预期收益率

E (r_{i, t + 1})

：

E (r_{i, t + 1} ∣ x_{i, t}) = x_{i, t}^{'} θ

Inspired by the empirical analysis of Gu et al. (2020), we train this model by collectively considering the next-period returns across all stocks and time periods within our training sample. We can evaluate the effectiveness of our model by assessing its predictive performance in subsequent testing samples.
受 Gu 等人（2020 年）实证分析的启发，我们通过对训练样本中所有股票和时间段的下期回报进行综合考虑来训练这一模型。我们可以通过评估模型在后续测试样本中的预测性能来评估模型的有效性。

This pooled panel regression model allows us to represent expected returns as a linear combination of article level features. Similarly, our sentiment model represents the probability of a positive label through a sigmoid function of linear combinations of these features. It is important to note that linearity in model specification is not necessarily the optimal choice. Alternative approaches, such as incorporating neural networks or other nonlinear architectures on top of the

x_{i, t}

s are certainly viable. However, in order to emphasize the significance of text-based representations and highlight the role they play, we intentionally refrain from introducing such complex nonlinear models. This decision allows us to focus on the simplest model and emphasize the impact of text-based representations.
通过这种集合面板回归模型，我们可以将预期收益表示为文章层面特征的线性组合。同样，我们的情感模型通过这些特征线性组合的 sigmoid 函数来表示正面标签的概率。值得注意的是，模型规格中的线性并不一定是最佳选择。其他方法，例如在

x_{i, t}

s 的基础上加入神经网络或其他非线性架构，当然也是可行的。然而，为了强调基于文本的表征的重要性并突出它们所发挥的作用，我们有意避免引入此类复杂的非线性模型。这一决定使我们能够专注于最简单的模型，并强调基于文本表征的影响。

In the subsequent sections, we elucidate the procedure involved in deriving textual features

x_{i, t}

from news text through various methodologies, initiating with cutting-edge LLMs in the domain of NLP. This stage is recognized as feature engineering within the parlance of machine learning. Following this, we will present several alternative methods that were proposed prior to the advent of the LLM era.
在随后的章节中，我们将从 NLP 领域的尖端 LLMs 开始，通过各种方法阐明从新闻文本中获取文本特征

x_{i, t}

所涉及的程序。这一阶段在机器学习中被称为特征工程。接下来，我们将介绍在LLM时代到来之前提出的几种替代方法。

2.2 Large Language Models
2.2 大型语言模型

LLMs represent an innovative approach within the Artificial Intelligence (AI) sphere, first gaining prominence within NLP. This methodology comprises a set of deep learning models, characterized by extensive parameterization and training on expansive datasets.
LLMs代表了人工智能（AI）领域的一种创新方法，首先在 NLP 领域获得了显著地位。这种方法由一组深度学习模型组成，其特点是广泛的参数化和在大量数据集上的训练。

Distinguishing features of this paradigm pivot around a unique training process devoid of labeled data. Instead, it relies on self-supervised learning techniques. This involves randomly masking words within a text and predicting the masked terms, or through unsupervised language modeling, where the model maximizes the probability of predicting the subsequent sentence based on the current one. Once trained, these LLMs exhibit a remarkable capacity for transfer learning, a process by which the “knowledge” acquired from one task is applied to different tasks. This characteristic enhances their versatility and broadens their applicability across diverse domains.
这一范例的显著特点在于其独特的训练过程中不使用标注数据。相反，它依赖于自我监督学习技术。这包括随机屏蔽文本中的词语并预测被屏蔽的词语，或者通过无监督语言建模，使模型根据当前句子预测后续句子的概率最大化。一旦经过训练，这些LLMs就会表现出非凡的迁移学习能力，即从一项任务中获得的 "知识 "可以应用到不同的任务中。这一特点增强了它们的多功能性，并扩大了它们在不同领域的适用性。

State-of-the-art LLMs have been dominating performance benchmarks across various NLP tasks, primarily due to their expansive scale. They are often pre-trained on enterprise-level platforms by Google, OpenAI, Meta, etc, some of which have made their pre-trained models publicly available. Our work incorporates three distinct LLMs as benchmarks - Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. (2018)), Robustly Optimized BERT Pre-training Approach (RoBERTa) (Liu et al. (2019)), and Large Language Model Meta AI (LLaMA) (Touvron et al. (2023))
最先进的 LLMs一直在各种 NLP 任务的性能基准测试中占据主导地位，这主要归功于它们庞大的规模。谷歌、OpenAI、Meta 等公司通常会在企业级平台上对它们进行预训练，其中一些公司还公开了它们的预训练模型。我们的工作采用了三种不同的 LLMs作为基准--来自变换器的双向编码器表示（BERT）（Devlin 等人（2018））、鲁棒性优化 BERT 预训练方法（RoBERTa）（Liu 等人（2019））和大型语言模型元人工智能（LLaMA）（Touvron 等人（2023））。

BERT holds a historical significance in the annals of LLMs and NLP as it marked a crucial shift in the creation and application of language models. Before BERT, models predominantly relied on
BERT 在 LLMs和 NLP 史上具有重要的历史意义，因为它标志着语言模型的创建和应用发生了关键性的转变。在 BERT 之前，模型主要依赖于
unidirectional or superficially bidirectional understanding of text. BERT brought about a revolution with its deeply bidirectional model, allowing a contextual understanding from both preceding and succeeding words for prediction. This change sparked an influx of research and development in NLP, yielding more advanced models like GPT-2, GPT-3, and RoBERTa. As such, we adopt BERT as our initial benchmark model.
在对文本进行单向或肤浅的双向理解时，BERT 带来了一场革命。BERT 凭借其深入的双向模型带来了一场革命，它允许根据上下文理解前词和后词来进行预测。这一变革引发了 NLP 研究和开发的热潮，产生了 GPT-2、GPT-3 和 RoBERTa 等更先进的模型。因此，我们采用 BERT 作为初始基准模型。

On the other hand, RoBERTa, an offshoot of BERT, was developed by Meta AI. The goal was to augment BERT’s performance by modifying its training regimen. Although it shares the foundational architecture with BERT, the variations in the pre-training phase lend RoBERTa its distinct identity. These changes led to substantial performance enhancements, allowing RoBERTa to outdo BERT in numerous NLP benchmark tests. Yet, the question remains whether a model’s proficiency in NLP tasks unequivocally translates to stronger performance in investment scenarios. This intriguing query is what our empirical analysis will shed light on.
另一方面，RoBERTa 是 BERT 的分支，由 Meta AI 公司开发。其目标是通过修改 BERT 的训练方案来提高 BERT 的性能。虽然 RoBERTa 与 BERT 的基础结构相同，但训练前阶段的变化使其具有鲜明的特色。这些变化带来了显著的性能提升，使 RoBERTa 在众多 NLP 基准测试中超越了 BERT。然而，一个问题依然存在：模型在 NLP 任务中的熟练程度是否能明确地转化为在投资场景中的更强性能。我们的实证分析将揭示这一耐人寻味的问题。

LLaMA, developed by Meta AI, is another LLM that we consider in our analysis. It has been trained on a wide array of text data, including books, articles, and encyclopedias, and is designed to generate embeddings for various NLP tasks such as sentiment analysis and text classification. LLaMA is available in multiple versions with varying capacities, including LLaMA1 and LLaMA2, the latter being the more advanced iteration. These versions include models with 7 billion, 13 billion, and 33 billion parameters for LLaMA, and 7 billion, 13 billion, and 70 billion parameters for LLaMA2. We specifically utilize the LLaMA2 with 13 billion parameters and LLaMA with 13 billion parameters as benchmarks in our study.
由 Meta AI 开发的 LLaMA 是我们在分析中考虑的另一个 LLM 。它已在大量文本数据（包括书籍、文章和百科全书）上进行了训练，旨在为情感分析和文本分类等各种 NLP 任务生成嵌入。LLaMA 有多个不同容量的版本，包括 LLaMA1 和 LLaMA2，后者是更高级的迭代版本。这些版本包括：LLaMA1 有 70 亿、130 亿和 330 亿个参数，LLaMA2 有 70 亿、130 亿和 700 亿个参数。我们在研究中特别使用了具有 130 亿个参数的 LLaMA2 和具有 130 亿个参数的 LLaMA 作为基准。

One remarkable instance of an LLM is ChatGPT, a chatbot that swiftly garnered global recognition. ChatGPT was engineered based on the GPT-3.5 architecture. The initial breakthrough of the GPT model was made by Radford et al. (2018), who introduced a computational framework comprising 117 million parameters. This was subsequently enhanced by Radford et al. (2019) with the introduction of GPT-2, a more robust model featuring a staggering 1.5 billion parameters. Following this, GPT-3 was unveiled in Brown et al. (2020), which saw the model grow to more than tenfold the size of GPT-2. Although the most advanced GPT models have not been publicly released, OpenAI provides an API designed for generating embeddings using various models. We send our text strings to the embedding API endpoint and, in return, receive an embedding - a list of floatingpoint numbers that numerically represents the textual information. For our analysis, we used the “text-embedding-3-large” model from this API, referred to as ChatGPT in subsequent sections.
ChatGPT 就是LLM的一个杰出实例，它是一个迅速获得全球认可的聊天机器人。ChatGPT 是基于 GPT-3.5 架构设计的。GPT 模型的最初突破是由 Radford 等人（2018 年）实现的，他们引入了一个包含 1.17 亿个参数的计算框架。随后，Radford 等人（2019 年）推出了 GPT-2，这是一个更强大的模型，拥有 15 亿个参数。随后，Brown 等人（2020 年）又推出了 GPT-3，模型规模是 GPT-2 的十倍多。虽然最先进的 GPT 模型尚未公开发布，但 OpenAI 提供了一个应用程序接口（API），用于使用各种模型生成嵌入。我们将文本字符串发送到嵌入式 API 端点，作为回报，我们会收到一个嵌入式--一个浮点数列表，用数字表示文本信息。在分析中，我们使用了该 API 中的 "text-embedding-3-large "模型，在后续章节中称为 ChatGPT。

We now turn to details of our implementation from tokenization to feature construction for LLMs.
现在，我们将详细介绍从标记化到LLMs特征构建的实现过程。

2.2.1 Tokenization 2.2.1 标记化

In any LLM framework, the starting point of a contextualized representation is tokenization. The smallest component of an article is known as a token. Tokens can manifest as characters, words, or subwords, each representing different forms of tokenization. Within LLMs, tokens typically take
在任何 LLM 框架中，上下文化表示法的起点都是标记化。文章的最小组成部分称为标记。标记可以表现为字符、单词或子单词，各自代表不同的标记化形式。在LLMs中，标记通常采用以下形式
the form of subwords. Word-based tokenization, which partitions text into individual words based on specific delimiters, is most prevalent. LLMs implement similar tokenization algorithms, which effectively divide rare words into smaller, meaningful subwords. This method of subword tokenization helps to alleviate data sparsity, enabling token reuse and subsequently boosting their frequency of occurrence. Furthermore, it allows for the maintenance of a manageable vocabulary size. This is particularly beneficial given the vast array of different words, or surface forms, present in most languages, especially those that are morphologically rich.

^{2}

子词的形式。基于单词的标记化最为普遍，它根据特定的分隔符将文本划分为单个单词。LLMs采用类似的标记化算法，可有效地将稀少的单词划分为更小、更有意义的子单词。这种子词标记化方法有助于缓解数据稀疏性，实现标记的重复使用，从而提高其出现频率。此外，它还能保持可管理的词汇量。鉴于大多数语言（尤其是那些形态丰富的语言）中存在大量不同的单词或表面形式，这一点尤为有益。

^{2}

Here’s an example from a piece of news regarding Apple: “The Company also admitted that in addition to macroeconomics in the Chinese market, the price cuts to battery replacements a year earlier to fix the Company’s prior surreptitious conduct had hurt iPhone sales.” The BERT tokenizer, which utilizes WordPiece encoding, breaks down this sentence into a sequence of ordered tokens, totaling 43: ‘the’, ‘company’, ‘also’, ‘admitted’, ‘that’, ‘in’, ‘addition’, ‘to’, ‘macro’, ‘##economic’, ‘##s’, ‘in’, ‘the’, ‘chinese’, ‘market’, ‘,’, ‘the’, ‘price’, ‘cuts’, ‘to’, ‘battery’, ‘replacements’, ‘a’, ‘year’, ‘earlier’, ‘to’, ‘fix’, ‘the’, ‘company’, '", ‘s’, ‘prior’, ‘sur’, ‘##re’, ‘##pt’, ‘##iti’, ‘##ous’, ‘conduct’, ‘had’, ‘hurt’, ‘iphone’, ‘sales’, ‘.’. In particular, the relatively rare word ‘macroeconomics’ is broken down into three tokens, and ‘surreptitious’ into five.
下面是一则关于苹果公司的新闻中的例子："公司还承认，除了中国市场的宏观经济因素外，一年前为纠正公司之前的偷梁换柱行为而对电池更换进行的降价也损害了 iPhone 的销量"。使用 WordPiece 编码的 BERT 标记化器将这句话分解为一连串有序的标记，共计 43 个：'.其中，相对罕见的单词 "macroeconomics "被分解成三个词块，"surreptitious "被分解成五个词块。

While RoBERTa employs the same architectural framework as BERT, it opts for byte-level BytePair Encoding (BPE) for tokenization, akin to GPT-2 as presented by (Radford et al. (2019)). The use of byte-level tokenization enables RoBERTa to more effectively manage out-of-vocabulary words. Notably, LLaMA and LLaMA2 also adopt the same BPE tokenizer. Regarding the above example, the BPE tokenizer yields a total of 41 tokens, including punctuations:

^{3}

‘The’, ’

\dot{G}

Company’, ’

\dot{G}

also’,
虽然 RoBERTa 采用了与 BERT 相同的架构框架，但它选择了字节级字节对编码（BPE）来进行标记化，类似于 Radford 等人（2019）提出的 GPT-2。字节级标记化的使用使 RoBERTa 能够更有效地管理词汇表之外的词汇。值得注意的是，LLaMA 和 LLaMA2 也采用了相同的 BPE 标记化器。在上述示例中，BPE 标记化器总共产生 41 个标记，包括标点符号：

^{3}

'The', '

\dot{G}

Company', '

\dot{G}

also'、

tiple tokens. 提示代币。

Both WordPiece and Byte-Pair encoding methods can handle words that are not in their initial vocabulary by breaking them down into smaller, known pieces. In practice, the differences in performance between these two methods tend to be relatively small.
WordPiece 和 Byte-Pair 两种编码方法都可以通过将单词分解成已知的小片段来处理不在初始词汇表中的单词。在实践中，这两种方法的性能差异往往相对较小。

2.2.2 Transformer Architecture
2.2.2 变压器结构

The fundamental architecture of LLMs is rooted in a novel encoder design of deep neural networks, known as the transformer, which was introduced by Vaswani et al. (2017). The transformer encoder maps tokens into vector form, utilizing a series of attention layers, as conceptualized by Bahdanau et al. (2014) and Luong et al. (2015). This enables the modeling of token dependencies, irrespective
LLMs的基本架构植根于 Vaswani 等人（2017 年）提出的一种新颖的深度神经网络编码器设计，即转换器。正如 Bahdanau 等人（2014 年）和 Luong 等人（2015 年）所构想的那样，转换器编码器利用一系列注意力层将标记映射为向量形式。这样就能对标记依赖性进行建模，无论

of their respective positions in the input sequence. By implementing this technique, the traditional recurrent structure, which plays a crucial role in Recurrent Neural Networks (RNNs) and Long ShortTerm Memory networks (LSTMs) for sequence processing, is effectively eliminated. Notably, the transformer approach bypasses limitations associated with parallelization and memory constraints, hence enhancing the model’s scalability.
它们在输入序列中各自的位置。通过采用这种技术，传统的递归结构被有效地消除了，而这种结构在递归神经网络（RNN）和长短期记忆网络（LSTM）的序列处理中起着至关重要的作用。值得注意的是，变压器方法绕过了与并行化和内存约束相关的限制，从而提高了模型的可扩展性。

Although LLMs share core principles, their deep learning architectures differ, each tailored for specific NLP tasks. BERT and RoBERTa employ a bidirectional encoder, which generates contextual representations of tokens by considering both preceding and succeeding instances of their appearances. This architecture proves highly effective for tasks such as sentiment analysis and natural language understanding. On the other hand, LLaMA is a family of autoregressive large language models, which solely incorporates a decoder that translates vector embeddings into tokens, making it ideal for applications involving human-like conversation and natural language generation. Consequently, LLaMA’s contextualized embeddings are obtained from a unidirectional architecture.
虽然 LLMs具有相同的核心原理，但它们的深度学习架构却各不相同，每种架构都是为特定的 NLP 任务量身定制的。BERT 和 RoBERTa 采用双向编码器，通过考虑标记出现的前后实例来生成标记的上下文表示。事实证明，这种架构对情感分析和自然语言理解等任务非常有效。另一方面，LLaMA 是一个自回归大型语言模型系列，它只包含一个解码器，可将向量嵌入转化为标记，因此非常适合涉及类人对话和自然语言生成的应用。因此，LLaMA 的语境化嵌入是从单向架构中获得的。

In our specific context, the distinction between encoder and decoder networks has minimal significance. Both networks possess the capability to generate contextualized embeddings for each token in an input article. These embeddings consider not only the individual tokens themselves but also their respective positions within the article. These contextualized embeddings serve as the fundamental inputs to our procedure for modeling returns. Our empirical analysis reveals that, in terms of the architecture distinction, the impact is relatively minor compared to the complexity of the models, primarily determined by the number of parameters they possess.
在我们的具体语境中，编码器网络和解码器网络之间的区别意义不大。这两种网络都具有为输入文章中的每个标记生成上下文嵌入的能力。这些嵌入不仅考虑了单个标记本身，还考虑了它们在文章中各自的位置。这些上下文嵌入是我们对回报进行建模的基本输入。我们的实证分析表明，与模型的复杂性（主要由其所拥有的参数数量决定）相比，架构差异的影响相对较小。

2.2.3 Pre-training and Fine-tuning
2.2.3 预训练和微调

The training step in this transfer learning context is often termed pre-training, serving as a means to an end. This step involves learning about a large number of model parameters (e.g., millions or even trillions) from an extremely large and diverse dataset (e.g., Wikipedia, Common Crawl, WebText). This process allows the model to understand the syntax and semantics of the language, including learning the meanings of words, how they are used in different contexts, and the general structure and grammar of the language.
这种迁移学习中的训练步骤通常被称为预训练，是达到目的的一种手段。这一步涉及从一个极其庞大和多样化的数据集（如维基百科、Common Crawl、WebText）中学习大量的模型参数（如数百万甚至数万亿）。这一过程可以让模型理解语言的语法和语义，包括学习单词的含义、在不同语境中的使用方式以及语言的一般结构和语法。

BERT’s pre-training process involves two parallel unsupervised tasks: the Masked Language Model (MLM) and Next Sentence Prediction (NSP). In the MLM task,

15 %

of tokens within the input sequence are randomly hidden, with the model then working to predict these obscured tokens. This MLM goal allows token representation to integrate context from both left and right directions. As implied by its name, NSP attempts to determine if two sentences are consecutive or not.
BERT 的预训练过程包括两个并行的无监督任务：屏蔽语言模型 (MLM) 和下一句预测 (NSP)。在 MLM 任务中，输入序列中的

15 %

标记会被随机隐藏，然后模型会对这些被隐藏的标记进行预测。这种 MLM 目标允许标记表征整合来自左右两个方向的上下文。正如其名称所暗示的，NSP 试图确定两个句子是否连续。

RoBERTa refines BERT’s pre-training by omitting the NSP task and modifying the hyperparameters, including an extension in training duration, expansion of training data (from 16GB to 160 GB of uncompressed texts), an increase in batch-training sizes (from 256 to 8 k ), and elongation of training sequences. Moreover, while BERT produces static masks just once during data
RoBERTa 改进了 BERT 的预训练，省略了 NSP 任务并修改了超参数，包括延长训练持续时间、扩大训练数据（从 16GB 增加到 160GB 未压缩文本）、增加批量训练大小（从 256 增加到 8 k）以及延长训练序列。此外，虽然 BERT 只会在训练数据时生成一次静态掩码，但它可以在训练过程中生成更多的静态掩码。
preprocessing, RoBERTa improves upon this by generating masking patterns every time a sequence is input during training. This enhancement supports training across more steps and accommodates larger datasets.
在此基础上，RoBERTa 通过在训练过程中每次输入序列时生成掩码模式进行了改进。这一改进可支持更多步骤的训练，并适应更大的数据集。

The specific task used for pre-training LLaMA is called “next token prediction” or “autoregressive language modeling.” The model is trained to predict the next word in a sentence given all of the previous words. It does this by learning to understand the context provided by the previous words and using that context to predict the most likely next word. For example, given the input “The Fed raised interest”, the model might learn to predict the word “rate” as the next word. The model’s parameters are updated during pre-training to minimize the difference between the predicted words and the actual words in the training data.
用于预训练 LLaMA 的特定任务称为 "下一个标记预测 "或 "自回归语言建模"。该模型的训练目的是根据前面的所有单词预测句子中的下一个单词。它通过学习理解前一个单词所提供的上下文，并利用上下文来预测最有可能出现的下一个单词。例如，在输入 "美联储提高利息 "时，模型可能会学习预测下一个词是 "利率"。模型的参数会在预训练期间进行更新，以最小化预测词与训练数据中实际词之间的差异。

The adaptation stage, also known as the fine-tuning stage, follows pre-training in the development of LLMs. After the LLM has been pre-trained on a massive corpus of data to understand language in a broad and general sense, it is then adapted or fine-tuned before its deployment to a specific task. This fine-tuning involves training the model on a smaller, task-specific dataset. The tasks can be diverse, such as text classification, sentiment analysis, question answering, summarization, and more. Unlike pre-training, which is unsupervised, fine-tuning is a supervised learning process, as it uses labeled data specific to the task at hand. During fine-tuning, the model’s parameters, which were learned during pre-training, are updated to optimize the model’s performance on the specific task. This process is usually faster and requires less data than pre-training because the model has already learned a lot of the necessary language understanding during pre-training. In this way, pre-training provides the model with general language understanding, while fine-tuning adapts that understanding to the specific requirements of a task. This two-stage training process has proven to be very effective for building LLMs that perform well across a wide range of NLP tasks.
适应阶段也称为微调阶段，是在开发 LLMs 的预训练之后进行的。LLM在海量语料库上经过预训练以理解广义和一般意义上的语言后，在将其部署到特定任务之前要对其进行调整或微调。这种微调包括在较小的、针对特定任务的数据集上对模型进行训练。任务可以多种多样，如文本分类、情感分析、问题解答、总结等。与无监督的预训练不同，微调是一个有监督的学习过程，因为它使用的是手头任务的特定标签数据。在微调过程中，在预训练过程中学习到的模型参数会被更新，以优化模型在特定任务中的性能。这一过程通常比预训练更快，所需的数据也更少，因为在预训练期间，模型已经学会了很多必要的语言理解能力。这样，预训练为模型提供了一般的语言理解能力，而微调则使这种理解能力适应任务的具体要求。事实证明，这种两阶段的训练过程对于构建LLMs非常有效，可以在各种 NLP 任务中表现出色。

Drawing inspiration from the work of Peters et al. (2018), we employ a feature extraction approach, also known as probing, by directly utilizing the pre-trained parameters to generate features associated with text data for downstream tasks. To be more specific, we input a new article into the pre-trained model, which results in each token within the article being represented as a vector. These vector representations effectively capture the contextual essence of the tokens. These representations are then utilized in subsequent downstream tasks for further exploration and application. While we can perceive the subsequent training stage as a form of fine-tuning, it is important to note that we do not update any parameters produced during the pre-training stage. This streamlined approach minimizes computational efforts, making it easier to replicate our empirical analysis. By adopting this feature extraction approach, we leverage the power of pre-trained models to extract meaningful features from text data without the need for extensive retraining. This efficient process allows us to focus on the downstream tasks at hand while benefiting from the comprehensive contextual understanding encapsulated in the pre-trained parameters.
从 Peters 等人（2018 年）的研究中汲取灵感，我们采用了一种特征提取方法（也称为探测法），直接利用预训练参数生成与文本数据相关的特征，用于下游任务。更具体地说，我们将一篇新文章输入预训练模型，从而将文章中的每个标记表示为一个向量。这些向量表示有效地捕捉了标记的上下文本质。这些表征将被用于后续的下游任务中，以进行进一步的探索和应用。虽然我们可以将后续的训练阶段看作是一种微调，但需要注意的是，我们不会更新预训练阶段产生的任何参数。这种精简的方法最大限度地减少了计算量，使我们更容易复制经验分析。通过采用这种特征提取方法，我们可以利用预训练模型的强大功能从文本数据中提取有意义的特征，而无需进行大量的再训练。这种高效的过程使我们能够专注于手头的下游任务，同时受益于预先训练好的参数所包含的对上下文的全面理解。

2.2.4 Article-level Representations
2.2.4 条款层面的表述

Ultimately, we attempt to construct article level representations,

x_{i, t}

, for subsequent classification and regression tasks. LLMs like BERT and RoBERTa can process input sequences of up to 512 tokens, translating these tokens into a 1024-dimensional vector representation. In contrast, LLaMA(LLaMA2) can manage sequences as long as 2,048(4,096) tokens and embed each token into a 5,120-dimensional space.

^{4}

In scenarios where an article exceeds the upper token limit, we focus solely on the initial segment up to that upper token boundary. Approximately

60 %

of US news articles comply with this length restriction. Subsequent empirical analyses suggest that the preliminary 512 tokens effectively encapsulate necessary information for return prediction. After acquiring vector representations for each token, we calculate the vector average across all tokens within an article. The resulting vector is then utilized to represent the entire article’s information. Although we employ the mean of all token embeddings to derive an article-level embedding, alternate methods could be considered. For example, it’s a common practice to use the embedding of the first token (often referred to as the CLS token) in BERT and RoBERTa, or the last token in LLaMA(LLaMA2), for downstream classification tasks. We have opted for the mean, as it constitutes a reasonable approach for other models like Word2vec, to which we draw comparisons.
最终，我们尝试为后续的分类和回归任务构建文章级表示

x_{i, t}

。LLMs（如 BERT 和 RoBERTa）可以处理多达 512 个 token 的输入序列，并将这些 token 转换为 1024 维的向量表示。相比之下，LLaMA(LLaMA2) 可管理长达 2,048(4,096) 个标记的序列，并将每个标记嵌入 5,120 维空间。

^{4}

在文章超过标记上限的情况下，我们只关注标记上限之前的初始段落。大约

60 %

的美国新闻文章符合这一长度限制。随后的经验分析表明，最初的 512 个标记有效地包含了返回预测所需的信息。在获得每个标记的向量表示后，我们计算一篇文章中所有标记的向量平均值。得出的向量将用于表示整篇文章的信息。虽然我们采用所有标记嵌入的平均值来推导文章级嵌入，但也可以考虑其他方法。例如，通常的做法是在 BERT 和 RoBERTa 中使用第一个标记（通常称为 CLS 标记）的嵌入，或在 LLaMA(LLaMA2) 中使用最后一个标记的嵌入，用于下游分类任务。我们选择了平均值，因为它是其他模型（如 Word2vec）的合理方法，我们将与之进行比较。

2.2.5 Other Fine-Tuned BERT and Multi-Language BERT Models
2.2.5 其他微调 BERT 和多语言 BERT 模型

Several open-source BERT models are available for various tasks. For instance, Araci (2019) finetuned a BERT model for a classification task based on the Financial PhraseBank dataset collected by Malo et al. (2014). This dataset includes roughly 5,000 labeled sentences, divided into three categories: positive, neutral, and negative. In a separate work, Yang et al. (2020) pre-trained a different BERT model based on financial communication text. This included Corporate Reports 10-K and 10-Q (comprising 2.5 billion tokens), Earnings Call Transcripts ( 1.3 billion tokens), and Analyst Reports ( 1.1 billion tokens), amounting to a total of 4.9 billion tokens in corpus size. This model was later fine-tuned using 10,000 manually annotated sentences (categorized as positive, negative, neutral) from analyst reports. Although this model was pre-trained with data highly relevant to the financial context, it does not leverage the expansive corpus the original BERT was trained on. Due to space constraints, we only provide comparison results based on Yang et al. (2020)'s FinBERT, as our empirical analysis suggests that it surpasses the performance of the model by Araci (2019) (detailed findings not reported).
有几种开源 BERT 模型可用于各种任务。例如，Araci（2019）基于Malo等人（2014）收集的金融短语库数据集，对分类任务的BERT模型进行了微调。该数据集包含大约 5000 个标注句子，分为积极、中性和消极三类。在另一项研究中，Yang 等人（2020 年）基于金融通信文本预训练了一个不同的 BERT 模型。其中包括公司报告 10-K 和 10-Q（包含 25 亿个标量）、盈利电话记录（13 亿个标量）和分析师报告（11 亿个标量），总计 49 亿个标量的语料。随后，我们使用分析师报告中的 10,000 个人工标注句子（分为正面、负面和中性）对该模型进行了微调。尽管该模型使用与金融背景高度相关的数据进行了预训练，但它并没有利用原始 BERT 所训练的庞大语料库。由于篇幅有限，我们只提供了基于 Yang 等人（2020 年）的 FinBERT 的比较结果，因为我们的实证分析表明，该模型的性能超过了 Araci（2019 年）的模型（未报告详细结果）。

Beyond English, BERT has been pre-trained with multilingual datasets, enabling its application in the analysis of other languages. Moreover, XLM-RoBERTa, as presented by Conneau et al. (2020) is a multilingual adaptation of RoBERTa. It was pre-trained on 2.5 TB of filtered CommonCrawl data encompassing 100 languages. We utilize XLM-RoBERTa large as an extension of RoBERTa in
除英语外，BERT 还使用多语言数据集进行了预训练，使其能够应用于其他语言的分析。此外，Conneau 等人（2020 年）提出的 XLM-RoBERTa 是 RoBERTa 的多语言改编版。它在 2.5 TB 的过滤 CommonCrawl 数据（包括 100 种语言）上进行了预训练。我们利用 XLM-RoBERTa large 作为 RoBERTa 在以下方面的扩展

the analysis of non-English languages.
分析非英语语言。

2.3 Word Embeddings 2.3 词嵌入

LLMs have evolved beyond an earlier text embedding paradigm that focused on learning morphological word representations as vectors. This progression is rooted in the principles of distributional semantics, as postulated by Harris (1954) and Firth (1957). According to their distributional hypothesis, a word is characterized by the context in which it appears. This idea has been utilized to represent word meanings as vectors, thereby encapsulating semantic similarity in terms of vector similarity. This approach allows for the creation of contextualized embeddings of words in a semantic vector space, capturing the nuanced meaning shifts induced by the contextual environment.
LLMs已经超越了早期的文本嵌入范式，这种范式侧重于学习作为向量的词形表征。这一进步源于 Harris（1954 年）和 Firth（1957 年）提出的分布语义学原理。根据他们的分布假说，一个词的特点在于它出现的语境。人们利用这一思想将词义表示为向量，从而用向量相似性来概括语义相似性。这种方法可以在语义向量空间中创建词的语境嵌入，捕捉语境环境引起的细微意义变化。

The concept of learning continuous representations of words has a deep-rooted history in NLP, tracing back to the work of Rumelhart et al. (1986). More recently, Mikolov et al. (2013) proposed a simplified approach called Word2Vec that generates high-dimensional vectors on very large corpora. In their work, Mikolov et al. (2013) presented two distinct neural network architectures: the Continuous Bag-Of-Words (CBOW) and the Skip-Gram models. The CBOW model predicts the current word based on its context, excluding the word itself. Conversely, the Skip-Gram model predicts the surrounding words given the current word. An illustrative example by Mikolov et al. (2013) that showcases the efficacy of this approach is the operation vector(“King”) - vector(“Man”) + vector(“Woman”). Interestingly, the resulting vector from this operation aligns most closely with the vector representation of the word “Queen.”
学习单词的连续表征这一概念在 NLP 中根深蒂固，最早可追溯到 Rumelhart 等人（1986 年）的研究。最近，Mikolov 等人（2013 年）提出了一种名为 Word2Vec 的简化方法，可在超大语料库中生成高维向量。在他们的工作中，Mikolov 等人（2013 年）提出了两种不同的神经网络架构：连续词袋模型（CBOW）和跳格模型。CBOW 模型根据上下文预测当前单词，但不包括单词本身。相反，Skip-Gram 模型则根据当前单词预测周围的单词。Mikolov 等人（2013 年）的一个示例展示了这种方法的功效，即向量（"King"）-向量（"Man"）+向量（"Woman"）的操作。有趣的是，这一操作产生的向量与单词 "Queen "的向量表示最为接近。

In contrast to LLMs that operate directly on input sequences of variable lengths (up to 512 tokens), Word2Vec employs a fixed-size context window for each word, typically encompassing 5 or 10 words around the current word. This approach limits its capacity to capture contextual information that extends beyond this window. Additionally, Word2Vec is built on a two-layer neural network architecture, a significantly less complex structure compared to the extensive deep neural network architectures employed by foundational models.
LLMs直接对长度可变（最多 512 个字符）的输入序列进行操作，与之不同的是，Word2Vec 对每个单词采用固定大小的上下文窗口，通常包括当前单词周围的 5 或 10 个单词。这种方法限制了它捕捉超出该窗口的上下文信息的能力。此外，Word2Vec 采用的是双层神经网络架构，与基础模型采用的广泛深度神经网络架构相比，这种架构的复杂性要低得多。

For our purposes, we downloaded pre-trained word vectors for English and other languages from fastText, an extension of Mikolov et al. (2013)'s Skip-gram model.

^{5}

For English word vectors, we select the model wiki-news-300d-1M by Mikolov et al. (2018). This model contains 1 million word vectors trained on Wikipedia 2017, the UMBC webbase corpus, and the statmt.org news dataset, incorporating a total of 16 billion tokens. For non-English languages, the word vectors were trained on Common Crawl and Wikipedia data (Grave et al. (2018)). All these word vectors are 300dimensional. As we do with LLMs, we calculate the average of all word vectors within a news article to derive the article-level embedding, which is subsequently fed into downstream regressions as features.
为了我们的目的，我们从 fastText 下载了英语和其他语言的预训练词向量，这是 Mikolov 等人（2013 年）的 Skip-gram 模型的扩展。

^{5}

对于英语单词向量，我们选择了 Mikolov 等人（2018 年）的 wiki-news-300d-1M 模型。该模型包含在维基百科 2017、UMBC webbase 语料库和 statmt.org 新闻数据集上训练的 100 万个词向量，共包含 160 亿个词块。对于非英语语言，词向量是在 Common Crawl 和维基百科数据（Grave 等人（2018））上训练的。所有这些词向量都是 300 维的。正如我们在处理 LLMs 时所做的那样，我们计算一篇新闻文章中所有词向量的平均值，从而得出文章级嵌入，随后将其作为特征输入下游回归。

2.4 Bag-of-Words 2.4 词袋

The Bag-of-Words (BOW) model, initially proposed by Harris (1954), represents an article as a vector of word frequencies. This representation takes into account the occurrence and frequency of words, but neglects grammar, word order, and the broader context.
词袋（BOW）模型最初由 Harris（1954 年）提出，它将文章表示为词频向量。这种表示法考虑了词的出现和词频，但忽略了语法、词序和更广泛的语境。

We adhere to the SESTM approach proposed by Ke et al. (2019), which utilizes a structured sentiment model for BOW representations. This method is comprised of three steps. The first step identifies a list of terms (either unigrams or bigrams) most closely correlated with sentiment through a screening process. The second step assigns weights to these words by estimating a topic model. Finally, the third step aggregates these terms into an article-level sentiment score through penalized likelihood estimation. The simplicity, transparency, and theoretical soundness of this approach make it an appropriate BOW benchmark for our purposes.
我们采用 Ke 等人（2019 年）提出的 SESTM 方法，该方法利用结构化情感模型进行 BOW 表示。该方法包括三个步骤。第一步，通过筛选过程确定与情感关系最密切的词语列表（单字词或双字词）。第二步，通过估算主题模型为这些词语分配权重。最后，第三步通过惩罚似然估计将这些词汇总为文章级情感评分。这种方法简单、透明、理论上合理，因此适合作为我们的 BOW 基准。

In addition to the SESTM approach, we incorporate the Loughran-McDonald Master Dictionary (LMMD) for financial sentiment analysis. LMMD, first proposed by LOUGHRAN and MCDONALD (2011), specifically designed for financial contexts, is instrumental in accurately identifying and scoring financial terms, thus enhancing the precision of sentiment analysis in financial news.
除了 SESTM 方法外，我们还将 Loughran-McDonald 主词典 (LMMD) 用于金融情感分析。LMMD 由 LOUGHRAN 和 MCDONALD（2011 年）首次提出，专为金融语境设计，有助于准确识别金融术语并为其评分，从而提高金融新闻情感分析的精确度。

Among the methods we consider, the SESTM stands out for its simplicity and transparency, but it falls short in accounting for contextual information. On the other hand, LLMs offer the capability to model intricate token connections in natural languages, albeit at the cost of being relatively opaque, often likened to “black boxes.” Word2Vec, in comparison, strikes a balance between complexity and capacity, providing context-sensitive embeddings with a simpler architecture. The comparative analysis of these methods offers insights into the degree of predictability derived from a broad spectrum of NLP techniques.
在我们考虑的方法中，SESTM 以其简单性和透明性脱颖而出，但它在考虑上下文信息方面存在不足。另一方面，LLMs能够模拟自然语言中错综复杂的标记连接，但代价是相对不透明，通常被比喻为 "黑盒子"。相比之下，Word2Vec 在复杂性和容量之间取得了平衡，以更简单的架构提供了上下文相关的嵌入。通过对这些方法的比较分析，我们可以深入了解各种 NLP 技术的可预测性程度。

3 Empirical Analysis 3 实证分析

3.1 Data and pre-processing
3.1 数据和预处理

3.1.1 Stylized Facts 3.1.1 典型事实

We have sourced our news text data from Refinitiv, a trusted global provider of financial market data. This dataset encompasses global news from both Thomson Reuters Real-time News Feed (RTRS) and the Third Party Archive (3PTY), spanning from January 1996 to June 2019. For US firms, the news falls into two distinct categories: articles and alerts. Regular news articles feature both a headline and a body of text, offering a comprehensive narrative of various firm events. In contrast, news alerts focus on delivering timely updates on emerging and unfolding news, and thus consist only of a headline. It is important to note that our US 3PTY database and our international news database do not include alerts. This news text data is then integrated with US equity data from the Center for Research in Security Prices (CRSP), and with international equity data obtained from EIKON (Datastream).
我们的新闻文本数据来自 Refinitiv，这是一家值得信赖的全球金融市场数据提供商。该数据集包含汤森路透实时新闻源（RTRS）和第三方档案（3PTY）中的全球新闻，时间跨度为 1996 年 1 月至 2019 年 6 月。对于美国公司而言，新闻分为两类：文章和警报。常规新闻文章既有标题也有正文，全面叙述公司的各种事件。相比之下，新闻提示侧重于及时更新新出现和正在发生的新闻，因此只有一个标题。值得注意的是，我们的美国 3PTY 数据库和国际新闻数据库不包括警报。然后，将这些新闻文本数据与来自证券价格研究中心（CRSP）的美国股票数据以及来自 EIKON（Datastream）的国际股票数据进行整合。

In preparing the news database for analysis, we have implemented several filters. First, we have retained only those news articles and alerts associated with a single stock for which three-day close-toclose returns are available. Furthermore, we have removed excessively short news articles with fewer than 100 characters, as well as extremely detailed reports exceeding 100,000 characters. Moreover, we have taken measures to remove redundant articles that essentially replicate the content of preceding stories. Redundancy has been assessed through the computation of a novelty score, derived from cosine similarity calculations based on the bag-of-words representations of any pair of articles. An article is classified as redundant if it attains a cosine similarity score of 0.8 or higher when compared with another article published within the preceding five business days. This process ensures the diversity of the dataset and also safeguards the novelty of the content, resulting in a substantial reduction of superfluous repetition. It is important to note that the removal of such repetition also enhances the signal-to-noise ratio, which is critical as we utilize firm returns as labels in our supervised learning tasks.
在准备用于分析的新闻数据库时，我们采用了几种筛选方法。首先，我们只保留了与单只股票相关的新闻报道和警报，这些新闻报道和警报的三天收盘回报是可用的。此外，我们还删除了少于 100 个字符的过短新闻报道，以及超过 100,000 个字符的极其详细的报道。此外，我们还采取了一些措施，以删除那些与之前报道内容基本相同的冗余文章。冗余度是通过计算新颖度得分来评估的，新颖度得分是根据任意一对文章的词袋表示法进行余弦相似度计算得出的。如果一篇文章与前五个工作日内发表的另一篇文章相比，其余弦相似度得分达到或超过 0.8，则该文章被归类为冗余文章。这一过程确保了数据集的多样性，也保证了内容的新颖性，从而大大减少了多余的重复。值得注意的是，去除重复内容还能提高信噪比，这一点至关重要，因为我们在监督学习任务中使用了公司回报作为标签。

Table 1: Summary Statistics of US News Articles and Alerts
表 1：美国新闻文章和警报的简要统计

Note: In this table, we report the remaining sample size after each filter applied on the news articles and news alerts on the top and bottom panels, respectively. Columns under “Raw Articles/Alerts” present the numbers of available articles/alerts separately from Thomson Reuters Real-time News Feed (RTRS) and Archive (3PTY). Columns under “Articles/alerts Tagged with Single Stock” presents the number of articles/alerts tagged with a single stock. Columns “Articles/Alerts with Returns” present the number of remaining articles/alerts after matching returns data. Column “Filtering Short & Long Articles” reports the number of articles with at least 100 characters and at most 1,000,000 characters. Columns “First/Second In Take Sequences” report the number of alerts tagged as the first/second in a sequence of developing alerts. Columns “Filtering Redundancy” report the number of remaining articles/alerts after removing those similar to existing ones (cosine similarity score

> 0.8

) published in the preceding five business days.
注：在本表中，我们分别在顶部和底部面板上报告了对新闻文章和新闻提示进行过滤后的剩余样本量。原始文章/提示 "下的各栏分别列出了汤森路透实时新闻源（RTRS）和档案（3PTY）中的可用文章/提示数量。标有单一股票的文章/快讯 "栏显示标有单一股票的文章/快讯数量。有回报的文章/快讯 "列显示与回报数据匹配后剩余文章/快讯的数量。过滤长短文章 "列显示至少 100 个字符和最多 1,000,000 个字符的文章数量。取样序列中的第一个/第二个 "列报告了被标记为发展中警报序列中第一个/第二个的警报数量。过滤冗余 "列报告的是删除与前五个工作日内发布的现有文章/警报相似（余弦相似度得分

> 0.8

）的文章/警报后剩余的文章/警报数量。

Table 1 provides the statistical breakdown of news articles and alerts associated with the US market after applying various filters. The dataset comprises a substantial volume of over 3 million news articles. Notably, a significant proportion of these articles is sourced from third-party news providers. In terms of alerts, the dataset contains approximately 3 million alerts in total. Among these alerts,

55.3 %

represent the initial news alerts within a sequence of unfolding alerts.

15.9 %

of the alerts constitute the second in the sequence, while the remaining alerts fall into the category of third or subsequent alerts.
表 1 提供了应用各种过滤器后与美国市场相关的新闻文章和警报的统计分类。该数据集包含超过 300 万篇新闻文章，数量可观。值得注意的是，这些文章中有很大一部分来自第三方新闻提供商。就警报而言，数据集共包含约 300 万条警报。在这些警报中，

55.3 %

代表了一连串警报中的初始新闻警报。

15.9 %

则是该序列中的第二个警报，而其余警报则属于第三个或后续警报。

Figure 1 presents the analysis of news articles and alerts’ temporal distribution. Both categories share similar patterns, reflecting their intertwined nature. Annual data, displayed in the upper section, reveals a rising trend from 1996 to 2019, with a notable surge in 2008 during the global financial
图 1 展示了新闻文章和警报的时间分布分析。这两个类别具有相似的模式，反映了它们相互交织的性质。上部显示的年度数据揭示了从 1996 年到 2019 年的上升趋势，其中在 2008 年全球金融风暴期间出现了明显的激增。

Figure 1: US News Counts
图 1：美国新闻计数

Note: The top panel plots the annual time series of the total number of news articles/alerts, the middle plots the average numbers of news articles/alerts per half an hour ( 24 hour local time), and the bottom plots the average numbers of news articles/alerts per calendar day. Since our sample ends in June 2019, the number of articles/alerts in 2019 on the first panel is estimated and thus highlighted in red.
注：上图为新闻报道/快讯总数的年度时间序列，中图为每半小时（当地时间 24 小时）的平均新闻报道/快讯数，下图为每个日历日的平均新闻报道/快讯数。由于我们的样本到 2019 年 6 月结束，因此第一幅图中 2019 年的文章/快讯数量是估算出来的，因此用红色标出。
crisis. Monthly patterns, in the middle section, show cyclical peaks in February, May, August, and November, an occurrence likely attributed to concentrated earnings events. It is noticeable that the phenomenon gets especially prominent in alerts. Daily trends, in the lower section, depict increased news frequency around market opening and closing times, mirroring trading activity ebbs and flows.
危机。中间部分的月度模式显示，二月、五月、八月和十一月会出现周期性高峰，这可能是由于收益事件集中所致。值得注意的是，这一现象在警报中尤为突出。下部的每日趋势显示，市场开盘和收盘前后的新闻频率增加，反映了交易活动的起伏。

Beyond the US market, our analysis also incorporates international markets including China (HK), UK, Australia, Canada, Japan, Germany, Italy, France, Sweden, Denmark, Spain, Finland, Portugal, Greece, and the Netherlands. Figure 2 exhibits annual time series depicting the number of news articles for each international market. A summary of the market information and the processed dataset for each country is encapsulated in Table 2. Table IA10 in the appendix provides a summary
除美国市场外，我们的分析还包括中国（香港）、英国、澳大利亚、加拿大、日本、德国、意大利、法国、瑞典、丹麦、西班牙、芬兰、葡萄牙、希腊和荷兰等国际市场。图 2 展示了描述各国际市场新闻报道数量的年度时间序列。表 2 概括了各国的市场信息和处理过的数据集。附录中的表 IA10 提供了一个摘要。
of the sample size after undergoing a similar step-by-step filtering process for international markets.
在对国际市场进行类似的逐步筛选之后，再对样本量进行筛选。
Notably, there exists significant variation in the volume of news articles among different countries, ranging from a minimum of 3,751 articles (Netherlands) to a maximum of 571,285 (UK). While data acquisition for most countries commenced in 1996, aligning with the initiation year of US data, certain countries’ datasets have later inception points due to gaps in news data provision. For instance, the Netherlands began data collection in September 2005. The monthly average of news-covered stocks varies from a minimum of 11 (Netherlands) to a maximum of 645 (Japan).
值得注意的是，不同国家的新闻报道数量存在很大差异，最少为 3,751 篇（荷兰），最多为 571,285 篇（英国）。虽然大多数国家的数据采集始于 1996 年，与美国数据的起始年份一致，但某些国家的数据集由于新闻数据提供方面的差距，起始时间较晚。例如，荷兰从 2005 年 9 月开始收集数据。新闻报道股票的月平均数量从最少的 11 只（荷兰）到最多的 645 只（日本）不等。

Table 2: Summary Statistics of International Markets
表 2：国际市场统计摘要

	Language 语言	Market Hours 市场营业时间	# of Articles # 文章数量	Initial Day 初始日	Avg. # of Stocks 平均值# 股票数量	# of Days # 天数	Avg. # of News 平均值# 新闻数量
US (Alert) 美国（警报）	English 英语	09:30 - 16:00	2,935,852	1996-01-02	1,746	5,929	10,337
US	English 英语	09:30-16:00	$3, 038, 025$	1996-01-02	2,593	5,933	10,697
UK	English 英语	08:00 - 16:30	571,285	1996-01-02	454	6,087	2,011
Australia 澳大利亚	English 英语	10:00 - 16:00	249,190	1996-01-03	287	6,033	877
Canada 加拿大	English 英语	09:30-16:00	350,549	1996-01-03	406	6,032	1,234
China (HK) 中国（香港）	Chinese 中文	09:30-16:00	182,363	1996-01-03	247	5,768	642
Japan 日本	Japanese 日语	09:00 - 15:00	310,244	1996-01-05	645	5,875	1,092
Germany 德国	German 德国	09:00 - 17:30	178,039	1996-01-03	163	6,031	626
Italy 意大利	Italian 意大利语	09:00 - 17:30	130,168	1996-01-05	97	5,778	458
France 法国	French 法语	09:00 - 17:30	153,779	1996-01-03	167	5,994	541
Sweden 瑞典	Swedish 瑞典语	09:00 - 17:25	115,195	2001-06-07	170	4,629	526
Denmark 丹麦	Danish 丹麦语	09:00 - 16:55	43,584	1996-01-22	37	4,559	156
Spain 西班牙	Spanish 西班牙语	09:00 - 17:30	34,159	1996-01-05	37	5,520	120
Finland 芬兰	Finnish 芬兰语	10:00 - 18:25	28,633	2003-01-03	50	4,025	143
Portugal 葡萄牙	Portuguese 葡萄牙语	$11 : 30 - 16 : 30$	6,158	2005-05-13	11	2,616	36
Greece 希腊	Greek 希腊文	10:15-05:20	7,710	2003-02-19	16	3,057	39
Netherlands 荷兰	Dutch 荷兰语	09:00-17:30	3,751	2005-09-20	11	2,102	22

Note: This table summarizes market information and processed datasets for each country. The columns correspond to the language of news articles, local times corresponding to market hours, the overall count of news articles, the initial day of our sample period, the average number of available stocks per month, the total number of days with news articles, and the average monthly count of news articles.
注：本表总结了每个国家的市场信息和处理过的数据集。各栏分别对应新闻报道的语言、与市场时间相对应的当地时间、新闻报道的总体数量、样本期的初始日、每月可用股票的平均数量、有新闻报道的总天数以及新闻报道的月平均数量。

Table 3: Summary Statistics of Characters/Tokens/Words in US News Articles and Alerts
表 3：美国新闻文章和警报中的字符/符号/单词统计摘要

	Article 文章					Alert 警报
	$1 %$	$25 %$	$50 %$	$75 %$	$99 %$	$1 %$	$25 %$	$50 %$	$75 %$	$99 %$
# of Characters # 字符数	163	511	1566	3795	29887	32	63	78	103	160
# of LLaMA Tokens # LLaMA 代币数量	59	175	451	978	10029	19	33	41	51	79
# of RoBERTa Tokens # RoBERTa 代币数量	43	129	348	783	7076	14	25	31	38	59
# of BERT Tokens # BERT 令牌数量	44	129	352	802	7234	9	19	23	28	42
# of Words # 字数	6	30	89	240	1896	0	3	5	7	14

Note: Row “# of Characters” report the percentiles of the number of characters in the raw article. Rows “# of LLaMA Tokens”, “# of RoBERTa Tokens”, and “# of BERT Tokens” report the percentiles of the number of tokens converted from news text using model specific tokenizer. Row “# of Words” reports the percentiles of the number of words extracted from an article (after removing pronouns/stop words) that are used by SESTM/Word2vec.
注："字符数 #"行报告的是原始文章中字符数的百分位数。第 "LLaMA 标记数 #"、"RoBERTa 标记数 #"和 "BERT 标记数 #"行报告的是使用特定模型标记符从新闻文本转换而来的标记数的百分位数。单词数 #"行报告的是从文章中提取的单词数（去除代词/停顿词后）的百分位数被 SESTM/Word2vec 使用。

Figure 2: Total Number of News Articles/Alerts
图 2：新闻报道/提示总数

Note: This figure plots the total number of news articles each year for all international markets. The blue dashed line plots the number for news alerts for US equity market.
注：本图绘制了所有国际市场每年的新闻报道总数。蓝色虚线表示美国股市的新闻提示数量。

3.1.2 News Embeddings 3.1.2 新闻嵌入

We both use word-based models and Large Language Models (LLMs) to generate embeddings for news articles and alerts. LLMs operate at the level of tokens, whereas word-based models take individual words as inputs. Detailed statistics on token and word counts could be found at Table 3, and Table 4 presents the corresponding counts for international news articles.
我们同时使用基于单词的模型和大型语言模型（LLMs）来生成新闻文章和警报的嵌入。LLMs在词块层面上运行，而基于词的模型则将单个词作为输入。关于标记和单词计数的详细统计见表 3，表 4 列出了国际新闻文章的相应计数。

Word-based models like Word2Vec and SESTM require meticulous data preprocessing to operate effectively at the word level. We follow the procedure outlined in Ke et al. (2019) to derive Bag-of-Words (BOW) representations for news articles. This comprehensive preprocessing includes converting text to lowercase, expanding contractions (e.g., “haven’t” to “have n’t”), lemmatization (reducing words to their base forms), tokenization, and removal of pronouns, proper nouns, punctuation, special symbols, numbers, non-English words (for English texts), and common stop words like “and,” “the,” and “is.” To operate on text data preprocessing across all languages, we utilize the natural language processing package “spaCy”. In this way, each article is then represented using its word count vector, ensuring accurate word-based embeddings.
Word2Vec 和 SESTM 等基于单词的模型需要细致的数据预处理才能在单词层面有效运行。我们遵循 Ke 等人（2019）中概述的程序，为新闻文章导出词袋（BOW）表示法。这种全面的预处理包括将文本转换为小写、扩展缩略词（例如，将 "haven't "扩展为 "have n't"）、词法化（将单词还原为其基本形式）、标记化，以及删除代词、专有名词、标点符号、特殊符号、数字、非英语单词（对于英文文本），以及 "and"、"the "和 "is "等常见的停止词。为了对所有语言的文本数据进行预处理，我们使用了自然语言处理软件包 "spaCy"。通过这种方式，每篇文章都可以用其字数向量来表示，从而确保基于单词的嵌入准确无误。

In contrast, Large Language Models, such as BERT, RoBERTa and LLaMA, possess the advantage of accepting raw, unprocessed text as input. This capability significantly reduces the need for extensive data cleaning. We have selected specific pre-trained LLMs for each country, as detailed in Table IA11 in the Appendix.
相比之下，大型语言模型（如 BERT、RoBERTa 和 LLaMA）具有接受未经处理的原始文本作为输入的优势。这一功能大大减少了对大量数据清理的需求。我们为每个国家选择了特定的预训练 LLMs，详见附录中的表 IA11。

Table 4: Summary Statistics of Tokens/Words in International News Articles
表 4：国际新闻文章中时标/词的汇总统计

	LLaMA					Word2vec/SESTM
	$1 %$	$25 %$	$50 %$	$75 %$	$99 %$	$1 %$	$25 %$	$50 %$	$75 %$	$99 %$
US	59	175	451	978	10029	6	28	88	239	1882
UK	77	247	590	1184	18280	7	40	108	247	3106
Australia 澳大利亚	62	69	219	937	24439	6	19	38	199	4578
Canada 加拿大	76	356	823	1478	12071	7	59	193	382	2670
China 中国	43	52	65	124	5265	14	16	21	39	1880
Japan 日本	155	240	365	459	1636	71	99	134	159	756
Germany 德国	70	314	508	1026	5492	8	52	90	171	975
Italy 意大利	47	277	533	1476	9998	11	64	126	256	2276
France 法国	75	284	528	1093	10283	13	66	126	271	2151
Sweden 瑞典	85	260	630	1034	5561	13	54	129	218	1214
Denmark 丹麦	64	248	439	781	4124	8	40	77	144	651
Spain 西班牙	40	101	260	468	12944	6	18	46	92	1807
Finland 芬兰	222	561	857	1541	20035	24	83	139	265	2885
Portugal 葡萄牙	61	161	313	537	1617	5	18	61	124	394
Greece 希腊	136	552	960	1778	4791	16	55	94	171	460
Netherlands 荷兰	127	450	741	1208	6964	18	84	146	242	1282

Note: Columns under “LLaMA” report the percentiles of the number of tokens converted from text using specific tokenizers for each country. Columns under “Word2vec/SESTM” report the percentiles of the number of words extracted from an article (after removing pronouns/stop words).
注："LLaMA "下的各栏报告了各国使用特定标记符号转换器从文本中转换的标记符号数的百分位数。Word2vec/SESTM "下的各栏报告了从文章（去掉代词/停顿词后）中提取的词数百分比。

3.2 Model Training 3.2 模型训练

On the basis of Word2vec and LLMs, we commence by acquiring

P

-dimensional pre-trained features, denoted as

x_{i, t}

, for each news article

i

at time

t

within our sample dataset. In sentiment analysis, we train model (1) using a cross-entropy loss. In With respect to predicting the cross-section of returns, we employ a penalized squared-error loss for training model (2) and additionally apply ridge penalty as a means of regularization for overall robustness of our models.
在 Word2vec 和LLMs的基础上，我们首先为样本数据集中时间为

t

的每篇新闻文章

i

获取

P

维度的预训练特征，记为

x_{i, t}

。在情感分析中，我们使用交叉熵损失来训练模型 (1)。在预测收益截面方面，我们使用惩罚性平方误差损失来训练模型 (2)，并额外使用脊惩罚作为正则化手段，以提高模型的整体稳健性。

Notably, SESTM introduces a distinctive structural assumption that sets it apart from conventional word-based models. Consequently, its training and prediction methodologies diverge from the norm. The SESTM framework imposes structural assumptions on the BOW representation of article

i

at time

t, d_{i, t}

, and its associated sentiment score

p_{i, t}

:
值得注意的是，SESTM 引入了一种独特的结构假设，使其有别于传统的基于单词的模型。因此，它的训练和预测方法也不同于常规方法。SESTM 框架对文章

i

在时间

t, d_{i, t}

时的 BOW 表示及其相关情感分数

p_{i, t}

强加了结构性假设：

P (sgn (y_{i, t}) = 1) = g (p_{i, t}), d_{[S], i, t} \sim Multinomial (s_{i, t}, p_{i, t} O_{+} + (1 - p_{i, t}) O_{-})

where

g (\cdot)

is some increasing function,

s_{i, t}

is the total number of sentiment charged words for article

i

at time

t, O_{+}

and

O_{-}

are

| S | \times 1

vectors of parameters, the set

S

is part of the vocabulary with an exclusive list of sentiment charged words, and

d_{[S], i, t}

is a subvector of

d_{i, t}

with rows corresponding to words in set

S

. Ke et al. (2019) proposes this model and suggests that SESTM’s training involves the construction of in-sample estimates for various variables. To be more specific, we can construct

\hat{S}

by screening based on how frequently each word appears in a positive article and construct

{\hat{O}}_{\pm}

by running regressions of sentiment word frequencies of each article onto pilot estimates of in-sample
其中，

g (\cdot)

是某个递增函数，

s_{i, t}

是文章

i

在

t, O_{+}

时间段的情感充值词总数，

O_{-}

是参数向量

| S | \times 1

，集合

S

是包含情感充值词独家列表的词汇表的一部分，

d_{[S], i, t}

是

d_{i, t}

的子向量，其行与集合

S

中的词相对应。Ke 等人（2019）提出了这一模型，并建议 SESTM 的训练涉及构建各种变量的样本内估计值。更具体地说，我们可以根据每个词在正面文章中出现的频率进行筛选，从而构建

\hat{S}

，并通过将每篇文章的情感词频与样本内估计值进行回归，从而构建

{\hat{O}}_{\pm}

。
sentiment scores. Based on the estimated

{\hat{O}}_{\pm}

and

\hat{S}

, we are able to conduct the maximum likelihood estimator of the sentiment score for an article out of sample. Several tuning parameters will be involved in the process, including three in the screening step, and one shrinkage parameter in the (penalized) likelihood estimation step.
情感得分。根据估计出的

{\hat{O}}_{\pm}

和

\hat{S}

，我们可以对样本外文章的情感评分进行最大似然估计。在此过程中会涉及多个调整参数，包括筛选步骤中的三个参数和（惩罚性）似然法估计步骤中的一个收缩参数。

We train each model using annually updated rolling windows. Each rolling window consists of a 8-year interval for in-sample training with the first six years for training and the next two for validation. The subsequent one-year data is then set aside for out-of-sample testing. As a result, the out-of-sample data range from 2004 to 2019, totaling 16 years. The tuning parameters are selected in the validation sample.
我们使用每年更新的滚动窗口来训练每个模型。每个滚动窗口包括 8 年的样本内训练间隔，前 6 年用于训练，后 2 年用于验证。随后一年的数据用于样本外测试。因此，样本外数据从 2004 年到 2019 年，共计 16 年。在验证样本中选择调整参数。

3.3 Portfolio Performance
3.3 投资组合业绩

3.3.1 Sentiment Analysis 3.3.1 情绪分析

The sentiment analysis aims to predict a binary outcome: one indicating a positive return and zero signifying otherwise. The fitted value of the logistic regression,

σ (x^{'} \hat{β})

, is an estimate for the probability of a positive outcome for an article with feature

x

. A true positive (TP) or true negative (TN) occurs when a predicted “up” probability of greater than

50 %

coincides with a positive realized return and a probability less than

50 %

coincides with a negative return.

^{6}

False positives and negatives (FP and FN) are the complementary outcomes. We calculate classification accuracy as follows:
情感分析旨在预测二元结果：1 表示正回报，0 表示负回报。逻辑回归的拟合值

σ (x^{'} \hat{β})

是对具有

x

特征的文章的正面结果概率的估计。当预测的 "向上 "概率大于

50 %

时，恰好出现正的实际回报；当预测的 "向上 "概率小于

50 %

时，恰好出现负的实际回报时，就出现了真正的正回报 (TP) 或真正的负回报 (TN)。

^{6}

假阳性和假阴性（FP 和 FN）是互补结果。我们计算分类准确率的方法如下：

Accuracy = (T P + T N) / (T P + T N + F P + F N)

Table 5 presents the yearly out-of-sample prediction accuracy, offering several noteworthy observations. Firstly, the first six models consistently outperform a random guess (50%) in terms of average accuracy over these years. Remarkably, the three Language Models (ChatGPT, LLaMA(LLaMA2), ROBERTa and BERT) exhibit higher overall accuracy compared to the wordbased models (Word2vec and SESTM), with ChatGPT achieving an average accuracy of

54.28 %

.
表 5 列出了每年的样本外预测准确率，其中有几个值得注意的现象。首先，就平均准确率而言，前六个模型在这些年里一直优于随机猜测（50%）。值得注意的是，与基于单词的模型（Word2vec 和 SESTM）相比，三个语言模型（ChatGPT、LLaMA(LLaMA2)、ROBERTa 和 BERT）表现出更高的总体准确率，其中 ChatGPT 的平均准确率达到

54.28 %

。

However, it’s important to note that even the best-performing model, ChatGPT, does not show a significant increase in accuracy compared to a random guess. These statistical artifacts are primarily due to market efficiency. In a well-functioning market, unpredictable news dominates equity returns, resulting in a small predictable component. This explains why all models achieve accuracy slightly above

50 %

. Nevertheless, this modest level of predictability can still lead to substantial gains in terms of investment performance.
不过，值得注意的是，即使是表现最好的模型 ChatGPT，与随机猜测相比，准确率也没有显著提高。这些统计假象主要是市场效率造成的。在一个运作良好的市场中，不可预测的新闻会主导股票回报，从而导致可预测的成分较小。这就解释了为什么所有模型的准确率都略高于

50 %

。尽管如此，这种适度的可预测性仍能带来可观的投资收益。

To evaluate out-of-sample predictive performance in economic terms, we introduce a trading strategy that capitalizes on sentiment estimates for investment decisions. Our trading approach is straightforward: we construct a zero-net-investment portfolio by taking long positions in the top
为了从经济角度评估样本外预测性能，我们引入了一种交易策略，利用情绪估计值做出投资决策。我们的交易方法简单明了：我们构建一个零净投资的投资组合，在排名前几位的股票中持有多头头寸。

*We benefited tremendously from discussions with seminar and conference participants at NYU Courant Institute, PBC School of Finance at Tsinghua University, Zhejiang University, Central University of Finance and Economics, 2023 China Fintech Research Conference, 5th Annual NLP and Machine Learning in Investment Management Conference, 2023 China International Conference in Finance, WorldQuant 2023, Federal Reserve Bank of Philadelphia, Shandong University, AAAI 2023 Summer Symposium, Workshop on Frontiers of Statistics and Data Science, NLP SoDaS conference 2023, Wolfe Research, Risk Day at ETH 2023, GSU-RFS FinTech Conference, Macquarie University, Australian National University, SQA/CQA Joint Webinar, Brazilian Stock Exchange, UIC College of Business and Banco de Portugal. We gratefully acknowledge the computing support from the Research Computing Center and data support from the Fama-Miller Center at the University of Chicago. AQR Capital Management is a global investment management firm, which may or may not apply similar investment techniques or methods of analysis as described herein.
*我们与纽约大学柯兰特研究所、清华大学PBC金融学院、浙江大学、中央财经大学、2023中国金融科技研究大会、第五届NLP与机器学习在投资管理中的应用年会、2023中国国际金融会议、WorldQuant 2023等研讨会和会议的与会者进行了讨论，受益匪浅、费城联邦储备银行、山东大学、AAAI 2023 夏季研讨会、统计与数据科学前沿研讨会、2023 年 NLP SoDaS 会议、沃尔夫研究公司、2023 年 ETH 风险日、GSU-RFS 金融科技会议、麦考瑞大学、澳大利亚国立大学、SQA/CQA 联合网络研讨会、巴西证券交易所、UIC 商学院和葡萄牙银行。我们感谢芝加哥大学研究计算中心（Research Computing Center）提供的计算支持和法玛-米勒中心（Fama-Miller Center）提供的数据支持。AQR Capital Management 是一家全球投资管理公司，可能会也可能不会采用本文所述的类似投资技术或分析方法。
$^{1}$ Specifically, they employ latent dirichlet allocation (LDA) which can be thought of as a multinomial principal components estimator. This collapses their roughly 20,000-dimensional term count representation for each document to a 180 -dimensional topic attention representation.
$^{1}$ 具体来说，他们采用了潜在德里希勒分配（LDA），该方法可被视为多项式主成分估计器。这将每个文档的约 20,000 维术语计数表示法缩减为 180 维主题关注表示法。
$^{2}$ BERT has 30K tokens, LLaMA(LLaMA2) has 32k tokens, and RoBERTa uses about 50K. In contrast, some delimiter based tokenization may result in a vocabulary size over 250 K .
$^{2}$ BERT 有 30K 标记，LLaMA(LLaMA2) 有 32K 标记，而 RoBERTa 使用了大约 50K 标记。相比之下，某些基于分隔符的标记化可能会导致词汇量超过 250 K。
$^{3} A$ character ’ $\dot{G}$ ’ is automatically added to represent the space before a word in the original input sentence.
$^{3} A$ 字符" $\dot{G}$ "会自动添加，以表示原始输入句子中单词前的空格。
$^{4}$ Specifically, we chose BERT large, RoBERTa large, and LLaMA(LLaMA2) (13 billion) as our benchmark set of LLMs. Their total parameters are, respectively, $0.345 B, 0.354 B$ , and 13B for BERT, RoBERTa, and LLaMA(LLaMA2).
$^{4}$ 具体来说，我们选择 BERT large、RoBERTa large 和 LLaMA(LLaMA2) (130 亿) 作为 LLMs。它们的总参数分别为 $0.345 B, 0.354 B$ ，BERT、RoBERTa 和 LLaMA(LLaMA2) 分别为 13B 和 13B。
$^{5}$ FastText, developed by Bojanowski et al. (2017). was chosen due to its multilingual support. Although we adopted the fastText package, we consistently use the term “Word2Vec” for ease of understanding.
$^{5}$ 之所以选择 Bojanowski 等人（2017 年）开发的 FastText，是因为它支持多种语言。虽然我们采用了 fastText 软件包，但为了便于理解，我们一直使用 "Word2Vec "一词。
$^{6}$ The threshold of $50 %$ is a natural cutoff for positive sentiment score. Alternatively, we also consider the unconditional up probability as a threshold for the data. As we subsequently show, the empirical results remain consistent as we only trade stocks with extreme sentiment scores.
$^{6}$ 阈值 $50 %$ 是正面情绪得分的自然分界线。或者，我们也可以考虑将无条件上涨概率作为数据阈值。正如我们随后所展示的，由于我们只交易极端情绪分数的股票，因此经验结果保持一致。

Expected Returns and Large Language Models * 预期收益和大型语言模型 *

Abstract 摘要

1 Introduction 1 引言

2 The Text Mining Framework2 文本挖掘框架

2.1 A Tale of Two Objectives2.1 两个目标的故事

2.2 Large Language Models2.2 大型语言模型

2.2.1 Tokenization 2.2.1 标记化

2.2.2 Transformer Architecture2.2.2 变压器结构

2.2.3 Pre-training and Fine-tuning2.2.3 预训练和微调

2.2.4 Article-level Representations2.2.4 条款层面的表述

2.2.5 Other Fine-Tuned BERT and Multi-Language BERT Models2.2.5 其他微调 BERT 和多语言 BERT 模型

2.3 Word Embeddings 2.3 词嵌入

2.4 Bag-of-Words 2.4 词袋

3 Empirical Analysis 3 实证分析

3.1 Data and pre-processing3.1 数据和预处理

3.1.1 Stylized Facts 3.1.1 典型事实

3.1.2 News Embeddings 3.1.2 新闻嵌入

3.2 Model Training 3.2 模型训练

3.3 Portfolio Performance3.3 投资组合业绩

3.3.1 Sentiment Analysis 3.3.1 情绪分析

Expected Returns and Large Language Models *
预期收益和大型语言模型 *

2 The Text Mining Framework
2 文本挖掘框架

2.1 A Tale of Two Objectives
2.1 两个目标的故事

2.2 Large Language Models
2.2 大型语言模型

2.2.2 Transformer Architecture
2.2.2 变压器结构

2.2.3 Pre-training and Fine-tuning
2.2.3 预训练和微调

2.2.4 Article-level Representations
2.2.4 条款层面的表述

2.2.5 Other Fine-Tuned BERT and Multi-Language BERT Models
2.2.5 其他微调 BERT 和多语言 BERT 模型

3.1 Data and pre-processing
3.1 数据和预处理

3.3 Portfolio Performance
3.3 投资组合业绩