这是用户在 2024-9-16 20:39 为 https://app.immersivetranslate.com/pdf-pro/3e8569e1-de15-4d04-a573-ce8a7ba90709 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

StockFormer: Learning Hybrid Trading Machines with Predictive Coding
StockFormer: 通过预测编码学习混合交易机器

Siyu Gao, Yunbo Wang* and Xiaokang Yang
高思宇,王云博* 和 杨晓康
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
上海交通大学人工智能研究院人工智能重点实验室
{siyu.gao, yunbow, xkyang}@sjtu.edu.cn

Abstract 摘要

Typical RL-for-finance solutions directly optimize trading policies over the noisy market data, such as stock prices and trading volumes, without explicitly considering the future trends and correlations of different investment assets as we humans do. In this paper, we present StockFormer, a hybrid trading machine that integrates the forward modeling capabilities of predictive coding with the advantages of RL agents in policy flexibility. The predictive coding part consists of three Transformer branches with modified structures, which respectively extract effective latent states of long-/short-term future dynamics and asset relations. The RL agent adaptively fuses these states and then executes an actor-critic algorithm in the unified state space. The entire model is jointly trained by propagating the critic’s gradients back to the predictive coding module. StockFormer significantly outperforms existing approaches across three publicly available financial datasets in terms of portfolio returns and Sharpe ratios.
典型的金融领域强化学习解决方案直接在嘈杂的市场数据上优化交易策略,例如股票价格和交易量,而没有像我们人类那样明确考虑不同投资资产的未来趋势和相关性。在本文中,我们提出了 StockFormer,这是一种混合交易机器,结合了预测编码的前向建模能力和强化学习代理在策略灵活性方面的优势。预测编码部分由三个具有修改结构的 Transformer 分支组成,分别提取长期/短期未来动态和资产关系的有效潜在状态。强化学习代理自适应地融合这些状态,然后在统一的状态空间中执行演员-评论家算法。整个模型通过将评论家的梯度反向传播到预测编码模块进行联合训练。StockFormer 在三个公开可用的金融数据集上,在投资组合回报和夏普比率方面显著优于现有方法。

1 Introduction 1 引言

Reinforcement learning (RL) has shown promising results in practical applications of financial decision-making problems, such as improving stock trading strategies by identifying promising buying and selling points [Liu et al., 2021; Zhong et al., 2020]. A common practice is to formulate the portfolio optimization problem as a Markov decision process (MDP) [Puterman, 2014] and directly perform model-free RL algorithms in the state space represented by the observed data (e.g., stock prices, trading volumes, and technical indicators). However, it makes an excessively strong assumption that the observed data is sufficiently informative and can well represent (i) the correlations between hundreds (or thousands) of stocks and (ii) the underlying (or even future) dynamics in the rapidly changing financial markets.
强化学习(RL)在金融决策问题的实际应用中显示出良好的前景,例如通过识别有前景的买入和卖出点来改善股票交易策略[Liu et al., 2021; Zhong et al., 2020]。一种常见做法是将投资组合优化问题表述为马尔可夫决策过程(MDP)[Puterman, 2014],并直接在由观察数据(例如,股票价格、交易量和技术指标)表示的状态空间中执行无模型的 RL 算法。然而,这种做法过于强烈地假设观察数据足够信息丰富,并且能够很好地代表(i)数百(或数千)只股票之间的相关性,以及(ii)在快速变化的金融市场中潜在(甚至未来)的动态。
To address this issue, we consider the way in which humans make investment decisions. There are two factors that require special consideration, that is, the dynamic stock correlations and the expected returns of each stock in both long-
为了解决这个问题,我们考虑人类做出投资决策的方式。有两个因素需要特别考虑,即动态股票相关性和每只股票的预期收益。
and short-term horizons. We present StockFormer, a novel RL agent that learns to adaptively discover and capitalize on promising trading opportunities. It is a hybrid trading machine that integrates the forward modeling capabilities of predictive coding with the advantages of actor-critic methods for trading flexibility. Predictive coding [Elias, 1955; Spratling, 2017; Rao and Ballard, 1999] is one of the most successful selfsupervised learning methods in natural language processing [Mikolov et al., 2013] and computer vision [Oord et al., 2018], whose core idea is to extract useful latent states from noisy market data that can maximally benefit the prediction of future or missing contextual information.
短期视野。我们提出了 StockFormer,这是一种新颖的强化学习代理,能够自适应地发现并利用有前景的交易机会。它是一种混合交易机器,结合了预测编码的前向建模能力和演员-评论家方法在交易灵活性方面的优势。预测编码[Elias, 1955; Spratling, 2017; Rao 和 Ballard, 1999]是自然语言处理[Mikolov 等, 2013]和计算机视觉[Oord 等, 2018]中最成功的自监督学习方法之一,其核心思想是从嘈杂的市场数据中提取有用的潜在状态,以最大程度地促进对未来或缺失上下文信息的预测。
Specifically, we leverage three Transformer-like networks to respectively learn long-horizon, short-horizon, and relational latent representations from the observed market data. To ease the difficulty of concurrent time series modeling, StockFormer employs multi-head feed-forward networks in the attention block, which can maintain the diversity of temporal patterns learned from multiple concurrent market asset series (e.g., trading records of hundreds of stocks in the same time period). For policy optimization, the three types of latent states are adaptively and progressively combined through a series of multi-head attention structures. StockFormer exploits the actor-critic method but only propagates the critic’s analytic gradients back into the relational state encoder.
具体而言,我们利用三种类似 Transformer 的网络分别从观察到的市场数据中学习长时间跨度、短时间跨度和关系潜在表示。为了减轻并发时间序列建模的难度,StockFormer 在注意力模块中采用多头前馈网络,这可以保持从多个并发市场资产系列(例如,同一时间段内数百只股票的交易记录)中学习到的时间模式的多样性。对于策略优化,这三种类型的潜在状态通过一系列多头注意力结构自适应和逐步地结合。StockFormer 利用演员-评论家方法,但仅将评论家的分析梯度传播回关系状态编码器。
Notably, there exists another line of work that aims to improve future stock prediction accuracy with powerful time series modeling networks [Li et al., 2018; Feng et al., 2019; Wang et al., 2021; Duan et al., 2022]. They use fixed trading rules, such as “buy-and-hold”, which recycles capital from unsuccessful stock bets to average in on stocks that have promising future returns and hold them for a fixed period of time. As shown in Table 1, despite recent success, these models are not directly designed to maximize investment returns and cannot provide flexible trading decisions.
值得注意的是,还有另一项研究旨在通过强大的时间序列建模网络提高未来股票预测的准确性 [Li et al., 2018; Feng et al., 2019; Wang et al., 2021; Duan et al., 2022]。他们使用固定的交易规则,例如“买入并持有”,将来自不成功股票投资的资本回收,以便在未来回报有希望的股票上进行平均投资,并在固定时间内持有。如表 1 所示,尽管最近取得了一些成功,这些模型并不是直接设计用于最大化投资回报,并且无法提供灵活的交易决策。
We empirically observe that StockFormer remarkably outperforms existing stock prediction and RL-for-finance approaches across three public datasets from NASDAQ and Chinese stock markets as well as the cryptocurrency market.
我们通过实证观察到,StockFormer 在 NASDAQ 和中国股市的三个公共数据集以及加密货币市场中,显著优于现有的股票预测和金融强化学习方法。
RL for finance. There have been many attempts to use RL methods to make trading decisions [Théate and Ernst, 2021; Weng et al., 2020; Liang et al., 2018; Benhamou et al., 2020].
金融中的强化学习。已经有许多尝试使用强化学习方法来做出交易决策 [Théate 和 Ernst, 2021; Weng 等, 2020; Liang 等, 2018; Benhamou 等, 2020]。
Model 模型 Category 类别 Trading policy 交易政策 RL state space RL 状态空间
FactorVAE [Duan et al., 2022] Stock prediction 股票预测 Fixed (e.g., buy-and-hold)
固定(例如,买入并持有)
n/a 不适用
SARL [Ye et al., 2020]
SARL [叶等, 2020]
RL-for-finance Learned 学习过的 Observed asset prices + Asset movement signal
观察到的资产价格 + 资产移动信号
StockFormer Hybrid 混合型 Learned 学习过的 Temporal + Relational predictive coding (via SSL)
时间 + 关系预测编码(通过自监督学习)
Model Category Trading policy RL state space FactorVAE [Duan et al., 2022] Stock prediction Fixed (e.g., buy-and-hold) n/a SARL [Ye et al., 2020] RL-for-finance Learned Observed asset prices + Asset movement signal StockFormer Hybrid Learned Temporal + Relational predictive coding (via SSL)| Model | Category | Trading policy | RL state space | | :--- | :--- | :--- | :--- | | FactorVAE [Duan et al., 2022] | Stock prediction | Fixed (e.g., buy-and-hold) | n/a | | SARL [Ye et al., 2020] | RL-for-finance | Learned | Observed asset prices + Asset movement signal | | StockFormer | Hybrid | Learned | Temporal + Relational predictive coding (via SSL) |
Table 1: StockFormer is a hybrid trading framework that is clearly different from the previous art of (a) stock prediction methods and (b) RL-for-finance methods. In SARL, a typical “asset movement signal” is the financial new embedding. “SSL”: Self-supervised learning.
表 1:StockFormer 是一种混合交易框架,明显不同于之前的(a)股票预测方法和(b)金融强化学习方法。在 SARL 中,典型的“资产移动信号”是金融新闻嵌入。“SSL”:自监督学习。
Main differences of these models include: the definition of the input states [Zhong et al., 2020; Liu et al., 2021; Weng et al., 2020], the engineering of reward functions [Liang et al., 2018; Hu and Lin, 2019], and the RL algorithms [Benhamou et al., 2020; Suri et al., 2021; Huotari et al., 2020]. SARL [Ye et al., 2020] proposed to learn policy in the noisy observation space and expand the space with additional asset movement signals such as financial news embeddings. FinRL [Liu et al., 2021] integrates multiple off-the-shelf RL algorithms such as Soft Actor-Critic (SAC) [Haarnoja et al., 2018] and DDPG [Lillicrap et al., 2016]. It defines the states as a combination of the covariance matrix of the close prices of all stocks and the MACD indicators, whose dimension will sharply increase with the growth of the number of stocks. We use its SAC implementation as the baseline of StockFormer.
这些模型的主要区别包括:输入状态的定义 [Zhong et al., 2020; Liu et al., 2021; Weng et al., 2020]、奖励函数的工程设计 [Liang et al., 2018; Hu and Lin, 2019] 和强化学习算法 [Benhamou et al., 2020; Suri et al., 2021; Huotari et al., 2020]。SARL [Ye et al., 2020] 提出了在噪声观察空间中学习策略,并通过额外的资产移动信号(如金融新闻嵌入)扩展该空间。FinRL [Liu et al., 2021] 集成了多种现成的强化学习算法,如软演员-评论家(SAC)[Haarnoja et al., 2018] 和深度确定性策略梯度(DDPG)[Lillicrap et al., 2016]。它将状态定义为所有股票收盘价的协方差矩阵和 MACD 指标的组合,其维度会随着股票数量的增加而急剧增加。我们使用其 SAC 实现作为 StockFormer 的基线。
Stock prediction. Mainstream stock prediction approaches can be divided into three categories. CNN-based models [Hoseinzade and Haratizadeh, 2019; Wen et al., 2019] take the historical data of different stocks as a set of input feature maps of a convolutional neural network. RNN-based models [Li et al., 2018; Zhang et al., 2017; Qin et al., 2017] are better at capturing the underlying sequential trends in stocks. Other network architectures use attention mechanisms [Hu et al., 2018; Li et al., 2018; Wu et al., 2019], dilated convolutions [Wang et al., 2021; Cho et al., 2019], or graph neural networks [Feng et al., 2019; Wang et al., 2021; Patil et al., 2020] to jointly model the long-term dynamics and the relations of trading assets. Inspired by previous Transformer-based architectures [Vaswani et al., 2017; Kitaev et al., 2020; Zhou et al., 2021; Wu et al., 2021; Fedus et al., 2021], StockFormer integrates Transformer in the RL-for-finance framework. It borrows the idea of selfsupervised predictive coding as a representation learning method to extract useful and compact latent states from noisy and high-dimensional market data.
股票预测。主流股票预测方法可以分为三类。基于卷积神经网络(CNN)模型[Hoseinzade 和 Haratizadeh,2019;Wen 等,2019]将不同股票的历史数据作为卷积神经网络的一组输入特征图。基于递归神经网络(RNN)模型[Li 等,2018;Zhang 等,2017;Qin 等,2017]更擅长捕捉股票的潜在序列趋势。其他网络架构使用注意力机制[Hu 等,2018;Li 等,2018;Wu 等,2019]、扩张卷积[Wang 等,2021;Cho 等,2019]或图神经网络[Feng 等,2019;Wang 等,2021;Patil 等,2020]来共同建模交易资产的长期动态和关系。受到之前基于 Transformer 架构的启发[Vaswani 等,2017;Kitaev 等,2020;Zhou 等,2021;Wu 等,2021;Fedus 等,2021],StockFormer 在金融强化学习框架中整合了 Transformer。它借鉴了自监督预测编码的思想,作为一种表示学习方法,从嘈杂和高维市场数据中提取有用且紧凑的潜在状态。

3 Portfolio Optimization as POMDPs
3 投资组合优化作为部分可观测马尔可夫决策过程

We believe that: One of the key challenges of RL-for-finance is to extract useful states that can reflect the essential dynamics of the market from noisy, high-dimensional raw transaction records. Therefore, unlike previous work that commonly treats portfolio optimization as an MDP problem with 5-tuples ( S , A , T , R , γ ) ( S , A , T , R , γ ) (S,A,T,R,gamma)(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma), in this paper, we take it as partially observable Markov decision processes (POMDPs) with 7-tuples ( O , S , A , T , Z , R , γ ) ( O , S , A , T , Z , R , γ ) (O,S,A,T,Z,R,gamma)(\mathcal{O}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{Z}, \mathcal{R}, \gamma), where O O O\mathcal{O} is the observation space of the noisy market data, S S S\mathcal{S} is the state space, A A A\mathcal{A} is the action space, T ( s t + 1 s t , a t ) T s t + 1 s t , a t T(s_(t+1)∣s_(t),a_(t))\mathcal{T}\left(s_{t+1} \mid s_{t}, a_{t}\right) denotes the state transition probability, Z ( O t s t ) Z O t s t Z(O_(t)∣s_(t))\mathcal{Z}\left(O_{t} \mid s_{t}\right) is the prior distribution of the observed data O t O t O_(t)O_{t} given latent state s t s t s_(t)s_{t}. Assuming a fixed length of P P PP steps in each episode (e.g., we use 1,000 trading days for the stock trading experiments), the goal is to learn the optimal policy π π pi^(**)\pi^{*} that can maximize the total payoff: G t = E τ π [ t = 1 P γ t 1 R t ] G t = E τ π t = 1 P γ t 1 R t G_(t)=E_(tau∼pi)[sum_(t=1)^(P)gamma^(t-1)R_(t)]G_{t}=\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=1}^{P} \gamma^{t-1} R_{t}\right], where the R t R t R_(t)R_{t} is the immediate reward at each time step drew from the reward function R ( s , a ) R ( s , a ) R(s,a)\mathcal{R}(s, a) and γ ( 0 , 1 ) γ ( 0 , 1 ) gamma in(0,1)\gamma \in(0,1) is the discount factor of future rewards. We here give detailed definitions of O , S , A O , S , A O,S,A\mathcal{O}, \mathcal{S}, \mathcal{A}, and R R R\mathcal{R} as follows.
我们相信:金融领域的强化学习面临的一个关键挑战是从嘈杂的高维原始交易记录中提取能够反映市场基本动态的有用状态。因此,与以往通常将投资组合优化视为具有 5 元组 ( S , A , T , R , γ ) ( S , A , T , R , γ ) (S,A,T,R,gamma)(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma) 的马尔可夫决策过程(MDP)不同,本文将其视为具有 7 元组 ( O , S , A , T , Z , R , γ ) ( O , S , A , T , Z , R , γ ) (O,S,A,T,Z,R,gamma)(\mathcal{O}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{Z}, \mathcal{R}, \gamma) 的部分可观察马尔可夫决策过程(POMDP),其中 O O O\mathcal{O} 是嘈杂市场数据的观察空间, S S S\mathcal{S} 是状态空间, A A A\mathcal{A} 是动作空间, T ( s t + 1 s t , a t ) T s t + 1 s t , a t T(s_(t+1)∣s_(t),a_(t))\mathcal{T}\left(s_{t+1} \mid s_{t}, a_{t}\right) 表示状态转移概率, Z ( O t s t ) Z O t s t Z(O_(t)∣s_(t))\mathcal{Z}\left(O_{t} \mid s_{t}\right) 是给定潜在状态 s t s t s_(t)s_{t} 的观察数据 O t O t O_(t)O_{t} 的先验分布。假设每个回合固定长度为 P P PP 步(例如,我们在股票交易实验中使用 1,000 个交易日),目标是学习能够最大化总回报的最优策略 π π pi^(**)\pi^{*} G t = E τ π [ t = 1 P γ t 1 R t ] G t = E τ π t = 1 P γ t 1 R t G_(t)=E_(tau∼pi)[sum_(t=1)^(P)gamma^(t-1)R_(t)]G_{t}=\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=1}^{P} \gamma^{t-1} R_{t}\right] ,其中 R t R t R_(t)R_{t} 是从奖励函数 R ( s , a ) R ( s , a ) R(s,a)\mathcal{R}(s, a) 中提取的每个时间步的即时奖励, γ ( 0 , 1 ) γ ( 0 , 1 ) gamma in(0,1)\gamma \in(0,1) 是未来奖励的折扣因子。我们在此详细定义 O , S , A O , S , A O,S,A\mathcal{O}, \mathcal{S}, \mathcal{A} R R R\mathcal{R} 如下。
Noisy observation space ( O ) ( O ) (O)(\mathcal{O}). Let us take the stock trading problem for instance. The observed data includes:
嘈杂的观察空间 ( O ) ( O ) (O)(\mathcal{O}) 。让我们以股票交易问题为例。观察到的数据包括:
  • Raw trading records O t price R T × N × 5 O t price  R T × N × 5 O_(t)^("price ")inR^(T xx N xx5)O_{t}^{\text {price }} \in \mathbb{R}^{T \times N \times 5} that involve daily opening, close, highest, lowest prices, and trading volume during the past T T TT days, where N N NN is the number of stocks.
    原始交易记录 O t price R T × N × 5 O t price  R T × N × 5 O_(t)^("price ")inR^(T xx N xx5)O_{t}^{\text {price }} \in \mathbb{R}^{T \times N \times 5} ,涉及过去 T T TT 天的每日开盘、收盘、最高、最低价格和交易量,其中 N N NN 是股票数量。
  • Technical indicators O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K}, where K K KK is the number of indicators that capture the dynamics of the time series.
    技术指标 O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K} ,其中 K K KK 是捕捉时间序列动态的指标数量。
  • A covariance matrix O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N} between sequences of daily close prices of all stocks over a fixed period before t t tt.
    t t tt 之前的固定时期内,所有股票每日收盘价序列之间的协方差矩阵 O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N}

    In the following sections, we use O t relat R N × ( K + N ) O t relat  R N × ( K + N ) O_(t)^("relat ")inR^(N xx(K+N))O_{t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)} to denote the concatenation of O t stat O t stat  O_(t)^("stat ")O_{t}^{\text {stat }} and O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }}.
    在以下部分中,我们使用 O t relat R N × ( K + N ) O t relat  R N × ( K + N ) O_(t)^("relat ")inR^(N xx(K+N))O_{t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)} 来表示 O t stat O t stat  O_(t)^("stat ")O_{t}^{\text {stat }} O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }} 的连接。

    Compositional state space (S). The state space of StockFormer is composed of three types of latent states and the explicit account state s t amount R N + 1 s t amount  R N + 1 s_(t)^("amount ")inR^(N+1)s_{t}^{\text {amount }} \in \mathbb{R}^{N+1} that represents the total account balance and the holding amount of each trading asset.
    组合状态空间 (S)。StockFormer 的状态空间由三种类型的潜在状态和显式账户状态 s t amount R N + 1 s t amount  R N + 1 s_(t)^("amount ")inR^(N+1)s_{t}^{\text {amount }} \in \mathbb{R}^{N+1} 组成,该状态表示总账户余额和每种交易资产的持有量。

    Action space ( A ) ( A ) (A)(\mathcal{A}). We have a continuous action space such that a t R N a t R N a_(t)inR^(N)a_{t} \in \mathbb{R}^{N}, indicating the amount of buying, holding, or selling shares on each trading asset. To simulate real-world trading scenarios, we discretize a t a t a_(t)a_{t} into multiple intervals of daily trading signals when interacting with the environment.
    动作空间 ( A ) ( A ) (A)(\mathcal{A}) 。我们有一个连续的动作空间,使得 a t R N a t R N a_(t)inR^(N)a_{t} \in \mathbb{R}^{N} ,表示在每个交易资产上买入、持有或卖出股票的数量。为了模拟现实世界的交易场景,我们将 a t a t a_(t)a_{t} 离散化为多个日交易信号的区间,以便与环境进行交互。

    Reward function ( R R R\mathcal{R} ). The reward R t R ( s t , a t ) R t R s t , a t R_(t)∼R(s_(t),a_(t))R_{t} \sim \mathcal{R}\left(s_{t}, a_{t}\right) is defined as the daily portfolio return ratios.
    奖励函数 ( R R R\mathcal{R} )。奖励 R t R ( s t , a t ) R t R s t , a t R_(t)∼R(s_(t),a_(t))R_{t} \sim \mathcal{R}\left(s_{t}, a_{t}\right) 定义为每日投资组合回报率。

4 StockFormer

As aforementioned, we have highlighted the challenge of RLfor-finance methods, that is, learning flexible policies aware of the dynamic relations between a batch of trading targets in the financial market and the future trends of each individual of them. To tackle this challenge, we introduce StockFormer, an RL agent that extracts disentangled latent states from noisy time series data through predictive coding, and then optimizes the trading decisions in the compositional state space. Therefore, StockFormer consists of two training phases: predictive coding and policy learning.
如前所述,我们强调了金融领域强化学习方法的挑战,即学习灵活的策略,以了解金融市场中一批交易目标之间的动态关系以及它们各自的未来趋势。为了解决这一挑战,我们引入了 StockFormer,一个通过预测编码从噪声时间序列数据中提取解耦潜在状态的强化学习代理,然后在组合状态空间中优化交易决策。因此,StockFormer 由两个训练阶段组成:预测编码和策略学习。

4.1 Predictive Coding 4.1 预测编码

In this phase, three Transformer-like network branches are trained in a self-supervised manner to respectively learn the
在这个阶段,三个类似 Transformer 的网络分支以自监督的方式进行训练,以分别学习
Figure 1: Left: Diversified multi-head attention block (DMH-Attn), which improves the original attention block in Transformers with multi-head feed-forward networks. It captures the diverse patterns of concurrent time series in different sub-spaces. Right: The general architecture of Transformer branches learned through predictive coding, which extracts useful representations for the RL agent by maximizing the likelihood of predicting the missing context or future returns of the financial market.
图 1:左:多样化的多头注意力块(DMH-Attn),它通过多头前馈网络改进了 Transformer 中的原始注意力块。它捕捉了不同子空间中并发时间序列的多样化模式。右:通过预测编码学习的 Transformer 分支的一般架构,它通过最大化预测金融市场缺失上下文或未来回报的可能性来提取对 RL 代理有用的表示。

relational, long-horizon, and short-horizon hidden representations. The key insight of so-called predictive coding [Elias, 1955; Spratling, 2017; Rao and Ballard, 1999] is to extract useful representations that are maximally beneficial for predicting future, missing or contextual information. These representations jointly form the compositional state space in the next training phase for learning the investment policy. Next, we first introduce an improved Transformer architecture and then discuss the predictive coding methods for the relational and predictive representations respectively.
关系、长时间和短时间的隐藏表示。所谓的预测编码的关键见解是提取对预测未来、缺失或上下文信息最大有益的有用表示。这些表示共同形成下一个训练阶段中用于学习投资策略的组合状态空间。接下来,我们首先介绍一种改进的 Transformer 架构,然后分别讨论关系和预测表示的预测编码方法。

Building block: Diversified multi-head attention. The diversity of temporal patterns among the concurrent sequences of multiple trading assets in financial markets (e.g., hundreds of stocks) greatly increases the difficulty of learning effective representations from raw data. To tackle this difficulty, as shown in Figure 1 (Left), we renovate the multi-head attention block in the original Transformer with a group of feed-forward networks (FFNs) rather than a single FFN, in which each FFN individually responds to a single head in the output of the multi-head attention layer. Without changing the overall number of parameters, such a mechanism strengthens multihead attention’s original feature decoupling ability, which facilitates modeling the diverse temporal patterns in different subspaces. We thus refer to the modified attention block as diversified multi-head attention (DMH-Attn). For a set of d model d model  d_("model ")d_{\text {model }}-dimensional keys ( K ) ( K ) (K)(K), values ( V ) ( V ) (V)(V), and queries ( Q ) ( Q ) (Q)(Q), the process in a diversified multi-head attention block can be represented as follows. We split the output features Z Z ZZ of the multi-head attention layer by h h hh along the channel dimension, where h h hh is the number of parallel attention heads, and then apply a separate FFN to each group of separated features in Z Z ZZ :
构建模块:多样化的多头注意力。在金融市场中,多个交易资产(例如,数百只股票)并发序列之间的时间模式多样性大大增加了从原始数据中学习有效表示的难度。为了解决这个难题,如图 1(左)所示,我们在原始 Transformer 中的多头注意力模块中用一组前馈网络(FFN)替代了单个 FFN,其中每个 FFN 单独响应多头注意力层输出中的一个头。这样的机制在不改变参数总数的情况下,增强了多头注意力原有的特征解耦能力,有助于在不同子空间中建模多样的时间模式。因此,我们将修改后的注意力模块称为多样化多头注意力(DMH-Attn)。对于一组 d model d model  d_("model ")d_{\text {model }} 维的键 ( K ) ( K ) (K)(K) 、值 ( V ) ( V ) (V)(V) 和查询 ( Q ) ( Q ) (Q)(Q) ,多样化多头注意力模块中的过程可以表示如下。我们沿着通道维度将多头注意力层的输出特征 Z Z ZZ h h hh 分割,其中 h h hh 是并行注意力头的数量,然后对 Z Z ZZ 中每组分离特征应用一个单独的前馈神经网络(FFN):
Z = MH Attn ( Q , K , V ) + Q , x i = Split ( Z ) f i = max ( 0 , x i W 1 , i + b 1 , i ) W 2 , i + b 2 , i DMH Attn ( Q , K , V ) = Concat ( f 1 , , f h ) + Z Z = MH Attn ( Q , K , V ) + Q , x i = Split ( Z ) f i = max 0 , x i W 1 , i + b 1 , i W 2 , i + b 2 , i DMH Attn ( Q , K , V ) = Concat f 1 , , f h + Z {:[Z=MH-Attn(Q","K","V)+Q","quadx_(i)=Split(Z)],[f_(i)=max(0,x_(i)W_(1,i)+b_(1,i))W_(2,i)+b_(2,i)],[DMH-Attn(Q","K","V)=Concat(f_(1),dots,f_(h))+Z]:}\begin{aligned} & Z=\operatorname{MH}-\operatorname{Attn}(Q, K, V)+Q, \quad x_{i}=\operatorname{Split}(Z) \\ & f_{i}=\max \left(0, x_{i} W_{1, i}+b_{1, i}\right) W_{2, i}+b_{2, i} \\ & \operatorname{DMH}-\operatorname{Attn}(Q, K, V)=\operatorname{Concat}\left(f_{1}, \ldots, f_{h}\right)+Z \end{aligned}
where “MH-Attn” denotes multi-head attention, f i f i f_(i)f_{i} denotes the output features of each FFN head, which contains two linear transformations with the ReLU activation in between.
其中“MH-Attn”表示多头注意力, f i f i f_(i)f_{i} 表示每个 FFN 头的输出特征,其中包含两个线性变换,中间有 ReLU 激活。

General predictive coding architecture. Each Transformer branch in StockFormer can be divided into encoder and de- coder layers, as shown in Figure 1 (Right). Both parts are used in the predictive coding phase with different training objectives, but only the encoder layers are used during policy optimization. We have L L LL encoder layers and M M MM decoder layers. The representations X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} of the final encoder layer is used as one of the inputs of each decoder layer. The computation in the l th l th  l^("th ")l^{\text {th }} encoder layer and the m th m th  m^("th ")m^{\text {th }} decoder layer can be presented as follows:
通用预测编码架构。StockFormer 中的每个 Transformer 分支可以分为编码器和解码器层,如图 1(右)所示。这两个部分在预测编码阶段用于不同的训练目标,但在策略优化过程中仅使用编码器层。我们有 L L LL 个编码器层和 M M MM 个解码器层。最终编码器层的表示 X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} 作为每个解码器层的输入之一。 l th l th  l^("th ")l^{\text {th }} 编码器层和 m th m th  m^("th ")m^{\text {th }} 解码器层的计算可以表示如下:

Encoder Layer: 编码器层:
X enc l = DMH Attn ( X enc l 1 , X enc l 1 , X enc l 1 ) X enc l = DMH Attn X enc l 1 , X enc l 1 , X enc l 1 X_(enc)^(l)=DMH-Attn(X_(enc)^(l-1),X_(enc)^(l-1),X_(enc)^(l-1))X_{\mathrm{enc}}^{l}=\mathrm{DMH}-\operatorname{Attn}\left(X_{\mathrm{enc}}^{l-1}, X_{\mathrm{enc}}^{l-1}, X_{\mathrm{enc}}^{l-1}\right)
Decoder Layer: 解码器层:
F dec m 1 = MH-Attn ( X dec m 1 , X dec m 1 , X dec m 1 ) + X dec m 1 X dec m = DMH Attn ( F dec m 1 , X enc L , X enc L ) F dec m 1 =  MH-Attn  X dec m 1 , X dec m 1 , X dec m 1 + X dec m 1 X dec m = DMH Attn F dec m 1 , X enc L , X enc L {:[F_(dec)^(m-1)=" MH-Attn "(X_(dec)^(m-1),X_(dec)^(m-1),X_(dec)^(m-1))+X_(dec)^(m-1)],[X_(dec)^(m)=DMH-Attn(F_(dec)^(m-1),X_(enc)^(L),X_(enc)^(L))]:}\begin{aligned} & F_{\mathrm{dec}}^{m-1}=\text { MH-Attn }\left(X_{\mathrm{dec}}^{m-1}, X_{\mathrm{dec}}^{m-1}, X_{\mathrm{dec}}^{m-1}\right)+X_{\mathrm{dec}}^{m-1} \\ & X_{\mathrm{dec}}^{m}=\mathrm{DMH}-\operatorname{Attn}\left(F_{\mathrm{dec}}^{m-1}, X_{\mathrm{enc}}^{L}, X_{\mathrm{enc}}^{L}\right) \end{aligned}
where X enc l X enc  l X_("enc ")^(l)X_{\text {enc }}^{l} and X dec m X dec  m X_("dec ")^(m)X_{\text {dec }}^{m} are the output of the encoder and decoder layer respectively. Specifically, X enc 0 X enc 0 X_(enc)^(0)X_{\mathrm{enc}}^{0} and X dec 0 X dec 0 X_(dec)^(0)X_{\mathrm{dec}}^{0} are the inputs of the the first encoder and decoder layers, which are the positional embedding [Vaswani et al., 2017] of the raw input data O enc O enc O_(enc)O_{\mathrm{enc}} and O dec . X dec M O dec . X dec  M O_(dec).X_("dec ")^(M)O_{\mathrm{dec}} . X_{\text {dec }}^{M} by the last decoder layer are then fed into a projection layer to generate the final output in each predictive coding task, that is, predicting future returns or reasoning about the missing (masked) information in the financial market at this moment.
其中 X enc l X enc  l X_("enc ")^(l)X_{\text {enc }}^{l} X dec m X dec  m X_("dec ")^(m)X_{\text {dec }}^{m} 分别是编码器和解码器层的输出。具体来说, X enc 0 X enc 0 X_(enc)^(0)X_{\mathrm{enc}}^{0} X dec 0 X dec 0 X_(dec)^(0)X_{\mathrm{dec}}^{0} 是第一个编码器和解码器层的输入,它们是原始输入数据 O enc O enc O_(enc)O_{\mathrm{enc}} 的位置嵌入 [Vaswani et al., 2017],最后解码器层的 O dec . X dec M O dec . X dec  M O_(dec).X_("dec ")^(M)O_{\mathrm{dec}} . X_{\text {dec }}^{M} 然后被送入一个投影层,以生成每个预测编码任务的最终输出,即预测未来收益或推理此时金融市场中缺失(掩蔽)信息。

Relation inference module ( 1 st 1 st  1^("st ")1^{\text {st }} Transformer branch). As shown in Figure 2, the relation inference module is used to capture the dynamic correlations among concurrent time series, e.g., different stocks. At time step t t tt, we use the same input for the encoder and the decoder, O enc ,