这是用户在 2024-9-16 20:39 为 https://app.immersivetranslate.com/pdf-pro/3e8569e1-de15-4d04-a573-ce8a7ba90709 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

StockFormer: Learning Hybrid Trading Machines with Predictive Coding
StockFormer: 通过预测编码学习混合交易机器

Siyu Gao, Yunbo Wang* and Xiaokang Yang
高思宇,王云博* 和 杨晓康
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
上海交通大学人工智能研究院人工智能重点实验室
{siyu.gao, yunbow, xkyang}@sjtu.edu.cn

Abstract 摘要

Typical RL-for-finance solutions directly optimize trading policies over the noisy market data, such as stock prices and trading volumes, without explicitly considering the future trends and correlations of different investment assets as we humans do. In this paper, we present StockFormer, a hybrid trading machine that integrates the forward modeling capabilities of predictive coding with the advantages of RL agents in policy flexibility. The predictive coding part consists of three Transformer branches with modified structures, which respectively extract effective latent states of long-/short-term future dynamics and asset relations. The RL agent adaptively fuses these states and then executes an actor-critic algorithm in the unified state space. The entire model is jointly trained by propagating the critic’s gradients back to the predictive coding module. StockFormer significantly outperforms existing approaches across three publicly available financial datasets in terms of portfolio returns and Sharpe ratios.
典型的金融领域强化学习解决方案直接在嘈杂的市场数据上优化交易策略,例如股票价格和交易量,而没有像我们人类那样明确考虑不同投资资产的未来趋势和相关性。在本文中,我们提出了 StockFormer,这是一种混合交易机器,结合了预测编码的前向建模能力和强化学习代理在策略灵活性方面的优势。预测编码部分由三个具有修改结构的 Transformer 分支组成,分别提取长期/短期未来动态和资产关系的有效潜在状态。强化学习代理自适应地融合这些状态,然后在统一的状态空间中执行演员-评论家算法。整个模型通过将评论家的梯度反向传播到预测编码模块进行联合训练。StockFormer 在三个公开可用的金融数据集上,在投资组合回报和夏普比率方面显著优于现有方法。

1 Introduction 1 引言

Reinforcement learning (RL) has shown promising results in practical applications of financial decision-making problems, such as improving stock trading strategies by identifying promising buying and selling points [Liu et al., 2021; Zhong et al., 2020]. A common practice is to formulate the portfolio optimization problem as a Markov decision process (MDP) [Puterman, 2014] and directly perform model-free RL algorithms in the state space represented by the observed data (e.g., stock prices, trading volumes, and technical indicators). However, it makes an excessively strong assumption that the observed data is sufficiently informative and can well represent (i) the correlations between hundreds (or thousands) of stocks and (ii) the underlying (or even future) dynamics in the rapidly changing financial markets.
强化学习(RL)在金融决策问题的实际应用中显示出良好的前景,例如通过识别有前景的买入和卖出点来改善股票交易策略[Liu et al., 2021; Zhong et al., 2020]。一种常见做法是将投资组合优化问题表述为马尔可夫决策过程(MDP)[Puterman, 2014],并直接在由观察数据(例如,股票价格、交易量和技术指标)表示的状态空间中执行无模型的 RL 算法。然而,这种做法过于强烈地假设观察数据足够信息丰富,并且能够很好地代表(i)数百(或数千)只股票之间的相关性,以及(ii)在快速变化的金融市场中潜在(甚至未来)的动态。
To address this issue, we consider the way in which humans make investment decisions. There are two factors that require special consideration, that is, the dynamic stock correlations and the expected returns of each stock in both long-
为了解决这个问题,我们考虑人类做出投资决策的方式。有两个因素需要特别考虑,即动态股票相关性和每只股票的预期收益。
and short-term horizons. We present StockFormer, a novel RL agent that learns to adaptively discover and capitalize on promising trading opportunities. It is a hybrid trading machine that integrates the forward modeling capabilities of predictive coding with the advantages of actor-critic methods for trading flexibility. Predictive coding [Elias, 1955; Spratling, 2017; Rao and Ballard, 1999] is one of the most successful selfsupervised learning methods in natural language processing [Mikolov et al., 2013] and computer vision [Oord et al., 2018], whose core idea is to extract useful latent states from noisy market data that can maximally benefit the prediction of future or missing contextual information.
短期视野。我们提出了 StockFormer,这是一种新颖的强化学习代理,能够自适应地发现并利用有前景的交易机会。它是一种混合交易机器,结合了预测编码的前向建模能力和演员-评论家方法在交易灵活性方面的优势。预测编码[Elias, 1955; Spratling, 2017; Rao 和 Ballard, 1999]是自然语言处理[Mikolov 等, 2013]和计算机视觉[Oord 等, 2018]中最成功的自监督学习方法之一,其核心思想是从嘈杂的市场数据中提取有用的潜在状态,以最大程度地促进对未来或缺失上下文信息的预测。
Specifically, we leverage three Transformer-like networks to respectively learn long-horizon, short-horizon, and relational latent representations from the observed market data. To ease the difficulty of concurrent time series modeling, StockFormer employs multi-head feed-forward networks in the attention block, which can maintain the diversity of temporal patterns learned from multiple concurrent market asset series (e.g., trading records of hundreds of stocks in the same time period). For policy optimization, the three types of latent states are adaptively and progressively combined through a series of multi-head attention structures. StockFormer exploits the actor-critic method but only propagates the critic’s analytic gradients back into the relational state encoder.
具体而言,我们利用三种类似 Transformer 的网络分别从观察到的市场数据中学习长时间跨度、短时间跨度和关系潜在表示。为了减轻并发时间序列建模的难度,StockFormer 在注意力模块中采用多头前馈网络,这可以保持从多个并发市场资产系列(例如,同一时间段内数百只股票的交易记录)中学习到的时间模式的多样性。对于策略优化,这三种类型的潜在状态通过一系列多头注意力结构自适应和逐步地结合。StockFormer 利用演员-评论家方法,但仅将评论家的分析梯度传播回关系状态编码器。
Notably, there exists another line of work that aims to improve future stock prediction accuracy with powerful time series modeling networks [Li et al., 2018; Feng et al., 2019; Wang et al., 2021; Duan et al., 2022]. They use fixed trading rules, such as “buy-and-hold”, which recycles capital from unsuccessful stock bets to average in on stocks that have promising future returns and hold them for a fixed period of time. As shown in Table 1, despite recent success, these models are not directly designed to maximize investment returns and cannot provide flexible trading decisions.
值得注意的是,还有另一项研究旨在通过强大的时间序列建模网络提高未来股票预测的准确性 [Li et al., 2018; Feng et al., 2019; Wang et al., 2021; Duan et al., 2022]。他们使用固定的交易规则,例如“买入并持有”,将来自不成功股票投资的资本回收,以便在未来回报有希望的股票上进行平均投资,并在固定时间内持有。如表 1 所示,尽管最近取得了一些成功,这些模型并不是直接设计用于最大化投资回报,并且无法提供灵活的交易决策。
We empirically observe that StockFormer remarkably outperforms existing stock prediction and RL-for-finance approaches across three public datasets from NASDAQ and Chinese stock markets as well as the cryptocurrency market.
我们通过实证观察到,StockFormer 在 NASDAQ 和中国股市的三个公共数据集以及加密货币市场中,显著优于现有的股票预测和金融强化学习方法。
RL for finance. There have been many attempts to use RL methods to make trading decisions [Théate and Ernst, 2021; Weng et al., 2020; Liang et al., 2018; Benhamou et al., 2020].
金融中的强化学习。已经有许多尝试使用强化学习方法来做出交易决策 [Théate 和 Ernst, 2021; Weng 等, 2020; Liang 等, 2018; Benhamou 等, 2020]。
Model 模型 Category 类别 Trading policy 交易政策 RL state space RL 状态空间
FactorVAE [Duan et al., 2022] Stock prediction 股票预测 Fixed (e.g., buy-and-hold)
固定(例如,买入并持有)
n/a 不适用
SARL [Ye et al., 2020]
SARL [叶等, 2020]
RL-for-finance Learned 学习过的 Observed asset prices + Asset movement signal
观察到的资产价格 + 资产移动信号
StockFormer Hybrid 混合型 Learned 学习过的 Temporal + Relational predictive coding (via SSL)
时间 + 关系预测编码(通过自监督学习)
Model Category Trading policy RL state space FactorVAE [Duan et al., 2022] Stock prediction Fixed (e.g., buy-and-hold) n/a SARL [Ye et al., 2020] RL-for-finance Learned Observed asset prices + Asset movement signal StockFormer Hybrid Learned Temporal + Relational predictive coding (via SSL)| Model | Category | Trading policy | RL state space | | :--- | :--- | :--- | :--- | | FactorVAE [Duan et al., 2022] | Stock prediction | Fixed (e.g., buy-and-hold) | n/a | | SARL [Ye et al., 2020] | RL-for-finance | Learned | Observed asset prices + Asset movement signal | | StockFormer | Hybrid | Learned | Temporal + Relational predictive coding (via SSL) |
Table 1: StockFormer is a hybrid trading framework that is clearly different from the previous art of (a) stock prediction methods and (b) RL-for-finance methods. In SARL, a typical “asset movement signal” is the financial new embedding. “SSL”: Self-supervised learning.
表 1:StockFormer 是一种混合交易框架,明显不同于之前的(a)股票预测方法和(b)金融强化学习方法。在 SARL 中,典型的“资产移动信号”是金融新闻嵌入。“SSL”:自监督学习。
Main differences of these models include: the definition of the input states [Zhong et al., 2020; Liu et al., 2021; Weng et al., 2020], the engineering of reward functions [Liang et al., 2018; Hu and Lin, 2019], and the RL algorithms [Benhamou et al., 2020; Suri et al., 2021; Huotari et al., 2020]. SARL [Ye et al., 2020] proposed to learn policy in the noisy observation space and expand the space with additional asset movement signals such as financial news embeddings. FinRL [Liu et al., 2021] integrates multiple off-the-shelf RL algorithms such as Soft Actor-Critic (SAC) [Haarnoja et al., 2018] and DDPG [Lillicrap et al., 2016]. It defines the states as a combination of the covariance matrix of the close prices of all stocks and the MACD indicators, whose dimension will sharply increase with the growth of the number of stocks. We use its SAC implementation as the baseline of StockFormer.
这些模型的主要区别包括:输入状态的定义 [Zhong et al., 2020; Liu et al., 2021; Weng et al., 2020]、奖励函数的工程设计 [Liang et al., 2018; Hu and Lin, 2019] 和强化学习算法 [Benhamou et al., 2020; Suri et al., 2021; Huotari et al., 2020]。SARL [Ye et al., 2020] 提出了在噪声观察空间中学习策略,并通过额外的资产移动信号(如金融新闻嵌入)扩展该空间。FinRL [Liu et al., 2021] 集成了多种现成的强化学习算法,如软演员-评论家(SAC)[Haarnoja et al., 2018] 和深度确定性策略梯度(DDPG)[Lillicrap et al., 2016]。它将状态定义为所有股票收盘价的协方差矩阵和 MACD 指标的组合,其维度会随着股票数量的增加而急剧增加。我们使用其 SAC 实现作为 StockFormer 的基线。
Stock prediction. Mainstream stock prediction approaches can be divided into three categories. CNN-based models [Hoseinzade and Haratizadeh, 2019; Wen et al., 2019] take the historical data of different stocks as a set of input feature maps of a convolutional neural network. RNN-based models [Li et al., 2018; Zhang et al., 2017; Qin et al., 2017] are better at capturing the underlying sequential trends in stocks. Other network architectures use attention mechanisms [Hu et al., 2018; Li et al., 2018; Wu et al., 2019], dilated convolutions [Wang et al., 2021; Cho et al., 2019], or graph neural networks [Feng et al., 2019; Wang et al., 2021; Patil et al., 2020] to jointly model the long-term dynamics and the relations of trading assets. Inspired by previous Transformer-based architectures [Vaswani et al., 2017; Kitaev et al., 2020; Zhou et al., 2021; Wu et al., 2021; Fedus et al., 2021], StockFormer integrates Transformer in the RL-for-finance framework. It borrows the idea of selfsupervised predictive coding as a representation learning method to extract useful and compact latent states from noisy and high-dimensional market data.
股票预测。主流股票预测方法可以分为三类。基于卷积神经网络(CNN)模型[Hoseinzade 和 Haratizadeh,2019;Wen 等,2019]将不同股票的历史数据作为卷积神经网络的一组输入特征图。基于递归神经网络(RNN)模型[Li 等,2018;Zhang 等,2017;Qin 等,2017]更擅长捕捉股票的潜在序列趋势。其他网络架构使用注意力机制[Hu 等,2018;Li 等,2018;Wu 等,2019]、扩张卷积[Wang 等,2021;Cho 等,2019]或图神经网络[Feng 等,2019;Wang 等,2021;Patil 等,2020]来共同建模交易资产的长期动态和关系。受到之前基于 Transformer 架构的启发[Vaswani 等,2017;Kitaev 等,2020;Zhou 等,2021;Wu 等,2021;Fedus 等,2021],StockFormer 在金融强化学习框架中整合了 Transformer。它借鉴了自监督预测编码的思想,作为一种表示学习方法,从嘈杂和高维市场数据中提取有用且紧凑的潜在状态。

3 Portfolio Optimization as POMDPs
3 投资组合优化作为部分可观测马尔可夫决策过程

We believe that: One of the key challenges of RL-for-finance is to extract useful states that can reflect the essential dynamics of the market from noisy, high-dimensional raw transaction records. Therefore, unlike previous work that commonly treats portfolio optimization as an MDP problem with 5-tuples ( S , A , T , R , γ ) ( S , A , T , R , γ ) (S,A,T,R,gamma)(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma), in this paper, we take it as partially observable Markov decision processes (POMDPs) with 7-tuples ( O , S , A , T , Z , R , γ ) ( O , S , A , T , Z , R , γ ) (O,S,A,T,Z,R,gamma)(\mathcal{O}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{Z}, \mathcal{R}, \gamma), where O O O\mathcal{O} is the observation space of the noisy market data, S S S\mathcal{S} is the state space, A A A\mathcal{A} is the action space, T ( s t + 1 s t , a t ) T s t + 1 s t , a t T(s_(t+1)∣s_(t),a_(t))\mathcal{T}\left(s_{t+1} \mid s_{t}, a_{t}\right) denotes the state transition probability, Z ( O t s t ) Z O t s t Z(O_(t)∣s_(t))\mathcal{Z}\left(O_{t} \mid s_{t}\right) is the prior distribution of the observed data O t O t O_(t)O_{t} given latent state s t s t s_(t)s_{t}. Assuming a fixed length of P P PP steps in each episode (e.g., we use 1,000 trading days for the stock trading experiments), the goal is to learn the optimal policy π π pi^(**)\pi^{*} that can maximize the total payoff: G t = E τ π [ t = 1 P γ t 1 R t ] G t = E τ π t = 1 P γ t 1 R t G_(t)=E_(tau∼pi)[sum_(t=1)^(P)gamma^(t-1)R_(t)]G_{t}=\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=1}^{P} \gamma^{t-1} R_{t}\right], where the R t R t R_(t)R_{t} is the immediate reward at each time step drew from the reward function R ( s , a ) R ( s , a ) R(s,a)\mathcal{R}(s, a) and γ ( 0 , 1 ) γ ( 0 , 1 ) gamma in(0,1)\gamma \in(0,1) is the discount factor of future rewards. We here give detailed definitions of O , S , A O , S , A O,S,A\mathcal{O}, \mathcal{S}, \mathcal{A}, and R R R\mathcal{R} as follows.
我们相信:金融领域的强化学习面临的一个关键挑战是从嘈杂的高维原始交易记录中提取能够反映市场基本动态的有用状态。因此,与以往通常将投资组合优化视为具有 5 元组 ( S , A , T , R , γ ) ( S , A , T , R , γ ) (S,A,T,R,gamma)(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma) 的马尔可夫决策过程(MDP)不同,本文将其视为具有 7 元组 ( O , S , A , T , Z , R , γ ) ( O , S , A , T , Z , R , γ ) (O,S,A,T,Z,R,gamma)(\mathcal{O}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{Z}, \mathcal{R}, \gamma) 的部分可观察马尔可夫决策过程(POMDP),其中 O O O\mathcal{O} 是嘈杂市场数据的观察空间, S S S\mathcal{S} 是状态空间, A A A\mathcal{A} 是动作空间, T ( s t + 1 s t , a t ) T s t + 1 s t , a t T(s_(t+1)∣s_(t),a_(t))\mathcal{T}\left(s_{t+1} \mid s_{t}, a_{t}\right) 表示状态转移概率, Z ( O t s t ) Z O t s t Z(O_(t)∣s_(t))\mathcal{Z}\left(O_{t} \mid s_{t}\right) 是给定潜在状态 s t s t s_(t)s_{t} 的观察数据 O t O t O_(t)O_{t} 的先验分布。假设每个回合固定长度为 P P PP 步(例如,我们在股票交易实验中使用 1,000 个交易日),目标是学习能够最大化总回报的最优策略 π π pi^(**)\pi^{*} G t = E τ π [ t = 1 P γ t 1 R t ] G t = E τ π t = 1 P γ t 1 R t G_(t)=E_(tau∼pi)[sum_(t=1)^(P)gamma^(t-1)R_(t)]G_{t}=\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=1}^{P} \gamma^{t-1} R_{t}\right] ,其中 R t R t R_(t)R_{t} 是从奖励函数 R ( s , a ) R ( s , a ) R(s,a)\mathcal{R}(s, a) 中提取的每个时间步的即时奖励, γ ( 0 , 1 ) γ ( 0 , 1 ) gamma in(0,1)\gamma \in(0,1) 是未来奖励的折扣因子。我们在此详细定义 O , S , A O , S , A O,S,A\mathcal{O}, \mathcal{S}, \mathcal{A} R R R\mathcal{R} 如下。
Noisy observation space ( O ) ( O ) (O)(\mathcal{O}). Let us take the stock trading problem for instance. The observed data includes:
嘈杂的观察空间 ( O ) ( O ) (O)(\mathcal{O}) 。让我们以股票交易问题为例。观察到的数据包括:
  • Raw trading records O t price R T × N × 5 O t price  R T × N × 5 O_(t)^("price ")inR^(T xx N xx5)O_{t}^{\text {price }} \in \mathbb{R}^{T \times N \times 5} that involve daily opening, close, highest, lowest prices, and trading volume during the past T T TT days, where N N NN is the number of stocks.
    原始交易记录 O t price R T × N × 5 O t price  R T × N × 5 O_(t)^("price ")inR^(T xx N xx5)O_{t}^{\text {price }} \in \mathbb{R}^{T \times N \times 5} ,涉及过去 T T TT 天的每日开盘、收盘、最高、最低价格和交易量,其中 N N NN 是股票数量。
  • Technical indicators O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K}, where K K KK is the number of indicators that capture the dynamics of the time series.
    技术指标 O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K} ,其中 K K KK 是捕捉时间序列动态的指标数量。
  • A covariance matrix O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N} between sequences of daily close prices of all stocks over a fixed period before t t tt.
    t t tt 之前的固定时期内,所有股票每日收盘价序列之间的协方差矩阵 O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N}

    In the following sections, we use O t relat R N × ( K + N ) O t relat  R N × ( K + N ) O_(t)^("relat ")inR^(N xx(K+N))O_{t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)} to denote the concatenation of O t stat O t stat  O_(t)^("stat ")O_{t}^{\text {stat }} and O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }}.
    在以下部分中,我们使用 O t relat R N × ( K + N ) O t relat  R N × ( K + N ) O_(t)^("relat ")inR^(N xx(K+N))O_{t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)} 来表示 O t stat O t stat  O_(t)^("stat ")O_{t}^{\text {stat }} O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }} 的连接。

    Compositional state space (S). The state space of StockFormer is composed of three types of latent states and the explicit account state s t amount R N + 1 s t amount  R N + 1 s_(t)^("amount ")inR^(N+1)s_{t}^{\text {amount }} \in \mathbb{R}^{N+1} that represents the total account balance and the holding amount of each trading asset.
    组合状态空间 (S)。StockFormer 的状态空间由三种类型的潜在状态和显式账户状态 s t amount R N + 1 s t amount  R N + 1 s_(t)^("amount ")inR^(N+1)s_{t}^{\text {amount }} \in \mathbb{R}^{N+1} 组成,该状态表示总账户余额和每种交易资产的持有量。

    Action space ( A ) ( A ) (A)(\mathcal{A}). We have a continuous action space such that a t R N a t R N a_(t)inR^(N)a_{t} \in \mathbb{R}^{N}, indicating the amount of buying, holding, or selling shares on each trading asset. To simulate real-world trading scenarios, we discretize a t a t a_(t)a_{t} into multiple intervals of daily trading signals when interacting with the environment.
    动作空间 ( A ) ( A ) (A)(\mathcal{A}) 。我们有一个连续的动作空间,使得 a t R N a t R N a_(t)inR^(N)a_{t} \in \mathbb{R}^{N} ,表示在每个交易资产上买入、持有或卖出股票的数量。为了模拟现实世界的交易场景,我们将 a t a t a_(t)a_{t} 离散化为多个日交易信号的区间,以便与环境进行交互。

    Reward function ( R R R\mathcal{R} ). The reward R t R ( s t , a t ) R t R s t , a t R_(t)∼R(s_(t),a_(t))R_{t} \sim \mathcal{R}\left(s_{t}, a_{t}\right) is defined as the daily portfolio return ratios.
    奖励函数 ( R R R\mathcal{R} )。奖励 R t R ( s t , a t ) R t R s t , a t R_(t)∼R(s_(t),a_(t))R_{t} \sim \mathcal{R}\left(s_{t}, a_{t}\right) 定义为每日投资组合回报率。

4 StockFormer

As aforementioned, we have highlighted the challenge of RLfor-finance methods, that is, learning flexible policies aware of the dynamic relations between a batch of trading targets in the financial market and the future trends of each individual of them. To tackle this challenge, we introduce StockFormer, an RL agent that extracts disentangled latent states from noisy time series data through predictive coding, and then optimizes the trading decisions in the compositional state space. Therefore, StockFormer consists of two training phases: predictive coding and policy learning.
如前所述,我们强调了金融领域强化学习方法的挑战,即学习灵活的策略,以了解金融市场中一批交易目标之间的动态关系以及它们各自的未来趋势。为了解决这一挑战,我们引入了 StockFormer,一个通过预测编码从噪声时间序列数据中提取解耦潜在状态的强化学习代理,然后在组合状态空间中优化交易决策。因此,StockFormer 由两个训练阶段组成:预测编码和策略学习。

4.1 Predictive Coding 4.1 预测编码

In this phase, three Transformer-like network branches are trained in a self-supervised manner to respectively learn the
在这个阶段,三个类似 Transformer 的网络分支以自监督的方式进行训练,以分别学习
Figure 1: Left: Diversified multi-head attention block (DMH-Attn), which improves the original attention block in Transformers with multi-head feed-forward networks. It captures the diverse patterns of concurrent time series in different sub-spaces. Right: The general architecture of Transformer branches learned through predictive coding, which extracts useful representations for the RL agent by maximizing the likelihood of predicting the missing context or future returns of the financial market.
图 1:左:多样化的多头注意力块(DMH-Attn),它通过多头前馈网络改进了 Transformer 中的原始注意力块。它捕捉了不同子空间中并发时间序列的多样化模式。右:通过预测编码学习的 Transformer 分支的一般架构,它通过最大化预测金融市场缺失上下文或未来回报的可能性来提取对 RL 代理有用的表示。

relational, long-horizon, and short-horizon hidden representations. The key insight of so-called predictive coding [Elias, 1955; Spratling, 2017; Rao and Ballard, 1999] is to extract useful representations that are maximally beneficial for predicting future, missing or contextual information. These representations jointly form the compositional state space in the next training phase for learning the investment policy. Next, we first introduce an improved Transformer architecture and then discuss the predictive coding methods for the relational and predictive representations respectively.
关系、长时间和短时间的隐藏表示。所谓的预测编码的关键见解是提取对预测未来、缺失或上下文信息最大有益的有用表示。这些表示共同形成下一个训练阶段中用于学习投资策略的组合状态空间。接下来,我们首先介绍一种改进的 Transformer 架构,然后分别讨论关系和预测表示的预测编码方法。

Building block: Diversified multi-head attention. The diversity of temporal patterns among the concurrent sequences of multiple trading assets in financial markets (e.g., hundreds of stocks) greatly increases the difficulty of learning effective representations from raw data. To tackle this difficulty, as shown in Figure 1 (Left), we renovate the multi-head attention block in the original Transformer with a group of feed-forward networks (FFNs) rather than a single FFN, in which each FFN individually responds to a single head in the output of the multi-head attention layer. Without changing the overall number of parameters, such a mechanism strengthens multihead attention’s original feature decoupling ability, which facilitates modeling the diverse temporal patterns in different subspaces. We thus refer to the modified attention block as diversified multi-head attention (DMH-Attn). For a set of d model d model  d_("model ")d_{\text {model }}-dimensional keys ( K ) ( K ) (K)(K), values ( V ) ( V ) (V)(V), and queries ( Q ) ( Q ) (Q)(Q), the process in a diversified multi-head attention block can be represented as follows. We split the output features Z Z ZZ of the multi-head attention layer by h h hh along the channel dimension, where h h hh is the number of parallel attention heads, and then apply a separate FFN to each group of separated features in Z Z ZZ :
构建模块:多样化的多头注意力。在金融市场中,多个交易资产(例如,数百只股票)并发序列之间的时间模式多样性大大增加了从原始数据中学习有效表示的难度。为了解决这个难题,如图 1(左)所示,我们在原始 Transformer 中的多头注意力模块中用一组前馈网络(FFN)替代了单个 FFN,其中每个 FFN 单独响应多头注意力层输出中的一个头。这样的机制在不改变参数总数的情况下,增强了多头注意力原有的特征解耦能力,有助于在不同子空间中建模多样的时间模式。因此,我们将修改后的注意力模块称为多样化多头注意力(DMH-Attn)。对于一组 d model d model  d_("model ")d_{\text {model }} 维的键 ( K ) ( K ) (K)(K) 、值 ( V ) ( V ) (V)(V) 和查询 ( Q ) ( Q ) (Q)(Q) ,多样化多头注意力模块中的过程可以表示如下。我们沿着通道维度将多头注意力层的输出特征 Z Z ZZ h h hh 分割,其中 h h hh 是并行注意力头的数量,然后对 Z Z ZZ 中每组分离特征应用一个单独的前馈神经网络(FFN):
Z = MH Attn ( Q , K , V ) + Q , x i = Split ( Z ) f i = max ( 0 , x i W 1 , i + b 1 , i ) W 2 , i + b 2 , i DMH Attn ( Q , K , V ) = Concat ( f 1 , , f h ) + Z Z = MH Attn ( Q , K , V ) + Q , x i = Split ( Z ) f i = max 0 , x i W 1 , i + b 1 , i W 2 , i + b 2 , i DMH Attn ( Q , K , V ) = Concat f 1 , , f h + Z {:[Z=MH-Attn(Q","K","V)+Q","quadx_(i)=Split(Z)],[f_(i)=max(0,x_(i)W_(1,i)+b_(1,i))W_(2,i)+b_(2,i)],[DMH-Attn(Q","K","V)=Concat(f_(1),dots,f_(h))+Z]:}\begin{aligned} & Z=\operatorname{MH}-\operatorname{Attn}(Q, K, V)+Q, \quad x_{i}=\operatorname{Split}(Z) \\ & f_{i}=\max \left(0, x_{i} W_{1, i}+b_{1, i}\right) W_{2, i}+b_{2, i} \\ & \operatorname{DMH}-\operatorname{Attn}(Q, K, V)=\operatorname{Concat}\left(f_{1}, \ldots, f_{h}\right)+Z \end{aligned}
where “MH-Attn” denotes multi-head attention, f i f i f_(i)f_{i} denotes the output features of each FFN head, which contains two linear transformations with the ReLU activation in between.
其中“MH-Attn”表示多头注意力, f i f i f_(i)f_{i} 表示每个 FFN 头的输出特征,其中包含两个线性变换,中间有 ReLU 激活。

General predictive coding architecture. Each Transformer branch in StockFormer can be divided into encoder and de- coder layers, as shown in Figure 1 (Right). Both parts are used in the predictive coding phase with different training objectives, but only the encoder layers are used during policy optimization. We have L L LL encoder layers and M M MM decoder layers. The representations X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} of the final encoder layer is used as one of the inputs of each decoder layer. The computation in the l th l th  l^("th ")l^{\text {th }} encoder layer and the m th m th  m^("th ")m^{\text {th }} decoder layer can be presented as follows:
通用预测编码架构。StockFormer 中的每个 Transformer 分支可以分为编码器和解码器层,如图 1(右)所示。这两个部分在预测编码阶段用于不同的训练目标,但在策略优化过程中仅使用编码器层。我们有 L L LL 个编码器层和 M M MM 个解码器层。最终编码器层的表示 X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} 作为每个解码器层的输入之一。 l th l th  l^("th ")l^{\text {th }} 编码器层和 m th m th  m^("th ")m^{\text {th }} 解码器层的计算可以表示如下:

Encoder Layer: 编码器层:
X enc l = DMH Attn ( X enc l 1 , X enc l 1 , X enc l 1 ) X enc l = DMH Attn X enc l 1 , X enc l 1 , X enc l 1 X_(enc)^(l)=DMH-Attn(X_(enc)^(l-1),X_(enc)^(l-1),X_(enc)^(l-1))X_{\mathrm{enc}}^{l}=\mathrm{DMH}-\operatorname{Attn}\left(X_{\mathrm{enc}}^{l-1}, X_{\mathrm{enc}}^{l-1}, X_{\mathrm{enc}}^{l-1}\right)
Decoder Layer: 解码器层:
F dec m 1 = MH-Attn ( X dec m 1 , X dec m 1 , X dec m 1 ) + X dec m 1 X dec m = DMH Attn ( F dec m 1 , X enc L , X enc L ) F dec m 1 =  MH-Attn  X dec m 1 , X dec m 1 , X dec m 1 + X dec m 1 X dec m = DMH Attn F dec m 1 , X enc L , X enc L {:[F_(dec)^(m-1)=" MH-Attn "(X_(dec)^(m-1),X_(dec)^(m-1),X_(dec)^(m-1))+X_(dec)^(m-1)],[X_(dec)^(m)=DMH-Attn(F_(dec)^(m-1),X_(enc)^(L),X_(enc)^(L))]:}\begin{aligned} & F_{\mathrm{dec}}^{m-1}=\text { MH-Attn }\left(X_{\mathrm{dec}}^{m-1}, X_{\mathrm{dec}}^{m-1}, X_{\mathrm{dec}}^{m-1}\right)+X_{\mathrm{dec}}^{m-1} \\ & X_{\mathrm{dec}}^{m}=\mathrm{DMH}-\operatorname{Attn}\left(F_{\mathrm{dec}}^{m-1}, X_{\mathrm{enc}}^{L}, X_{\mathrm{enc}}^{L}\right) \end{aligned}
where X enc l X enc  l X_("enc ")^(l)X_{\text {enc }}^{l} and X dec m X dec  m X_("dec ")^(m)X_{\text {dec }}^{m} are the output of the encoder and decoder layer respectively. Specifically, X enc 0 X enc 0 X_(enc)^(0)X_{\mathrm{enc}}^{0} and X dec 0 X dec 0 X_(dec)^(0)X_{\mathrm{dec}}^{0} are the inputs of the the first encoder and decoder layers, which are the positional embedding [Vaswani et al., 2017] of the raw input data O enc O enc O_(enc)O_{\mathrm{enc}} and O dec . X dec M O dec . X dec  M O_(dec).X_("dec ")^(M)O_{\mathrm{dec}} . X_{\text {dec }}^{M} by the last decoder layer are then fed into a projection layer to generate the final output in each predictive coding task, that is, predicting future returns or reasoning about the missing (masked) information in the financial market at this moment.
其中 X enc l X enc  l X_("enc ")^(l)X_{\text {enc }}^{l} X dec m X dec  m X_("dec ")^(m)X_{\text {dec }}^{m} 分别是编码器和解码器层的输出。具体来说, X enc 0 X enc 0 X_(enc)^(0)X_{\mathrm{enc}}^{0} X dec 0 X dec 0 X_(dec)^(0)X_{\mathrm{dec}}^{0} 是第一个编码器和解码器层的输入,它们是原始输入数据 O enc O enc O_(enc)O_{\mathrm{enc}} 的位置嵌入 [Vaswani et al., 2017],最后解码器层的 O dec . X dec M O dec . X dec  M O_(dec).X_("dec ")^(M)O_{\mathrm{dec}} . X_{\text {dec }}^{M} 然后被送入一个投影层,以生成每个预测编码任务的最终输出,即预测未来收益或推理此时金融市场中缺失(掩蔽)信息。

Relation inference module ( 1 st 1 st  1^("st ")1^{\text {st }} Transformer branch). As shown in Figure 2, the relation inference module is used to capture the dynamic correlations among concurrent time series, e.g., different stocks. At time step t t tt, we use the same input for the encoder and the decoder, O enc , t relat = O dec , t relat R N × ( K + N ) O enc , t relat  = O dec , t relat  R N × ( K + N ) O_(enc,t)^("relat ")=O_(dec,t)^("relat ")inR^(N xx(K+N))O_{\mathrm{enc}, t}^{\text {relat }}=O_{\mathrm{dec}, t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)}, where N N NN is the number of concurrent trading assets and K K KK is the number of statistics that capture the dynamics of the time series data. Particularly for the stock market datasets, we use common technical indicators as the statistics, such as MACD, RSI, and SMA. During the training phase, O enc , t relat O enc  , t relat  O_("enc ",t)^("relat ")O_{\text {enc }, t}^{\text {relat }} can be divided into two parts:
关系推断模块( 1 st 1 st  1^("st ")1^{\text {st }} Transformer 分支)。如图 2 所示,关系推断模块用于捕捉并发时间序列之间的动态关联,例如不同的股票。在时间步 t t tt ,我们对编码器和解码器使用相同的输入 O enc , t relat = O dec , t relat R N × ( K + N ) O enc , t relat  = O dec , t relat  R N × ( K + N ) O_(enc,t)^("relat ")=O_(dec,t)^("relat ")inR^(N xx(K+N))O_{\mathrm{enc}, t}^{\text {relat }}=O_{\mathrm{dec}, t}^{\text {relat }} \in \mathbb{R}^{N \times(K+N)} ,其中 N N NN 是并发交易资产的数量, K K KK 是捕捉时间序列数据动态的统计数量。特别对于股票市场数据集,我们使用常见的技术指标作为统计数据,如 MACD、RSI 和 SMA。在训练阶段, O enc , t relat O enc  , t relat  O_("enc ",t)^("relat ")O_{\text {enc }, t}^{\text {relat }} 可以分为两个部分:
  • The covariance matrix O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N} between sequences of daily close prices of all concurrent trading assets over a fixed period of time before t t tt.
    t t tt 之前,所有同时交易资产的每日收盘价序列之间的协方差矩阵 O t cov R N × N O t cov  R N × N O_(t)^("cov ")inR^(N xx N)O_{t}^{\text {cov }} \in \mathbb{R}^{N \times N}
Figure 2: Unlike previous RL-for-finance methods, StockFormer builds the decision module upon learned representations provided by a relation inference module and two future prediction modules. The decision module contains a couple of multi-head attention layers that integrate the compositional representations, an actor network, and a critic network. In particular, the critic propagates its gradients of state values back into the relation inference module (Solid arrow: the data flow; Dashed arrow: the gradient flow of the critic loss).
图 2:与之前的金融强化学习方法不同,StockFormer 在关系推理模块和两个未来预测模块提供的学习表示基础上构建决策模块。决策模块包含几个多头注意力层,用于整合组合表示,一个演员网络和一个评论家网络。特别地,评论家将其状态值的梯度传播回关系推理模块(实线箭头:数据流;虚线箭头:评论家损失的梯度流)。
  • The masked statistics O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K}, in which we randomly select half of the time series and mask their input statistics by zero. It is worth noting that in the test phase, we use complete data without masks as input to the module.
    被屏蔽的统计数据 O t stat R N × K O t stat  R N × K O_(t)^("stat ")inR^(N xx K)O_{t}^{\text {stat }} \in \mathbb{R}^{N \times K} ,我们随机选择一半的时间序列,并将它们的输入统计数据屏蔽为零。值得注意的是,在测试阶段,我们使用完整的数据作为模块的输入,而不进行屏蔽。

    The task of the relation inference module is to reconstruct the masked statistics (i.e., technical indicators) given O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }} and the remaining visible statistics. The self-supervised loss function in this module can be represented as:
    关系推断模块的任务是根据 O t cov O t cov  O_(t)^("cov ")O_{t}^{\text {cov }} 和剩余可见统计数据重建被遮蔽的统计数据(即技术指标)。该模块中的自监督损失函数可以表示为:
J t relat = J ( O ^ t stat , O ¯ t stat ) = 1 N n O ^ t stat , O ¯ t stat 2 J t relat  = J O ^ t stat  , O ¯ t stat  = 1 N n O ^ t stat  , O ¯ t stat  2 J_(t)^("relat ")=J( widehat(O)_(t)^("stat "), bar(O)_(t)^("stat "))=(1)/(N)sum_(n)|| widehat(O)_(t)^("stat "), bar(O)_(t)^("stat ")||^(2)\mathcal{J}_{t}^{\text {relat }}=\mathcal{J}\left(\widehat{O}_{t}^{\text {stat }}, \bar{O}_{t}^{\text {stat }}\right)=\frac{1}{N} \sum_{n}\left\|\widehat{O}_{t}^{\text {stat }}, \bar{O}_{t}^{\text {stat }}\right\|^{2}
where O ^ t stat O ^ t stat  widehat(O)_(t)^("stat ")\widehat{O}_{t}^{\text {stat }} and O ¯ t stat O ¯ t stat  bar(O)_(t)^("stat ")\bar{O}_{t}^{\text {stat }} are the reconstructed values of the masked statistics and their ground truth. Intuitively, such a predictive coding method drives the Transformer encoder to capture the dependencies among the concurrent time series (i.e., stocks). The relation inference module provides its final encoding features (i.e., X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} in the above descriptions of the general architecture) for the subsequent decision module, as parts of the state space of the RL agent, denoted by s t relat s t relat  s_(t)^("relat ")s_{t}^{\text {relat }}.
其中 O ^ t stat O ^ t stat  widehat(O)_(t)^("stat ")\widehat{O}_{t}^{\text {stat }} O ¯ t stat O ¯ t stat  bar(O)_(t)^("stat ")\bar{O}_{t}^{\text {stat }} 是被遮蔽统计量的重构值及其真实值。直观上,这种预测编码方法驱动 Transformer 编码器捕捉并发时间序列(即股票)之间的依赖关系。关系推理模块为后续决策模块提供其最终编码特征(即上述通用架构描述中的 X enc L X enc  L X_("enc ")^(L)X_{\text {enc }}^{L} ),作为 RL 代理的状态空间的一部分,表示为 s t relat s t relat  s_(t)^("relat ")s_{t}^{\text {relat }}

Future prediction modules ( 2 nd & 3 rd 2 nd  & 3 rd  2^("nd ")&3^("rd ")2^{\text {nd }} \boldsymbol{\&} \boldsymbol{3}^{\text {rd }} branches). In the short-term prediction module, the task is to predict the return ratio of each stock 1 1 ^(1){ }^{1} for the next day ( H = 1 ) ( H = 1 ) (H=1)(H=1). In this module, we use different inputs for the encoder and the decoder. The inputs of the encoder are O enc , t price R T × 5 O enc  , t price  R T × 5 O_("enc ",t)^("price ")inR^(T xx5)O_{\text {enc }, t}^{\text {price }} \in \mathbb{R}^{T \times 5}, which involve daily opening, close, highest, lowest prices, and trading volumes during the past T T TT days. We feed the transaction records at step t t tt to the decoder, i.e., O dec , t price R 1 × 5 O dec  , t price  R 1 × 5 O_("dec ",t)^("price ")inR^(1xx5)O_{\text {dec }, t}^{\text {price }} \in \mathbb{R}^{1 \times 5}. Similar to the shortterm prediction module, the long-term prediction module is to predict the return ratios in a longer future horizon (e.g., H = 5 H = 5 H=5H=5 for daily stock trading). The long-term prediction task encourages the Transformer branch to capture future dynamics from a longer perspective. We adopt the regression loss and
未来预测模块( 2 nd & 3 rd 2 nd  & 3 rd  2^("nd ")&3^("rd ")2^{\text {nd }} \boldsymbol{\&} \boldsymbol{3}^{\text {rd }} 分支)。在短期预测模块中,任务是预测每只股票 1 1 ^(1){ }^{1} 在下一天 ( H = 1 ) ( H = 1 ) (H=1)(H=1) 的回报率。在这个模块中,我们为编码器和解码器使用不同的输入。编码器的输入是 O enc , t price R T × 5 O enc  , t price  R T × 5 O_("enc ",t)^("price ")inR^(T xx5)O_{\text {enc }, t}^{\text {price }} \in \mathbb{R}^{T \times 5} ,涉及过去 T T TT 天的每日开盘、收盘、最高、最低价格和交易量。我们将步骤 t t tt 的交易记录输入解码器,即 O dec , t price R 1 × 5 O dec  , t price  R 1 × 5 O_("dec ",t)^("price ")inR^(1xx5)O_{\text {dec }, t}^{\text {price }} \in \mathbb{R}^{1 \times 5} 。与短期预测模块类似,长期预测模块的任务是预测更长时间范围内的回报率(例如, H = 5 H = 5 H=5H=5 用于每日股票交易)。长期预测任务鼓励 Transformer 分支从更长的视角捕捉未来动态。我们采用回归损失和
stock ranking loss from [Feng et al., 2019]:
股票排名损失来自[Feng et al., 2019]:

[
J t future = J ( r ^ t + H , r t + H ) = r ^ t + H r t + H 2 + α i = 1 N j = 1 N max ( 0 , ( r ^ i t + H r ^ j t + H ) ( r i t + H r j t + H ) ) J t future  = J r ^ t + H , r t + H = r ^ t + H r t + H 2 + α i = 1 N j = 1 N max 0 , r ^ i t + H r ^ j t + H r i t + H r j t + H {:[J_(t)^("future ")=J( hat(r)^(t+H),r^(t+H))=|| hat(r)^(t+H)-r^(t+H)||^(2)+],[alphasum_(i=1)^(N)sum_(j=1)^(N)max(0,-( hat(r)_(i)^(t+H)- hat(r)_(j)^(t+H))(r_(i)^(t+H)-r_(j)^(t+H)))]:}\begin{aligned} & \mathcal{J}_{t}^{\text {future }}=\mathcal{J}\left(\hat{\mathbf{r}}^{t+H}, \mathbf{r}^{t+H}\right)=\left\|\hat{\mathbf{r}}^{t+H}-\mathbf{r}^{t+H}\right\|^{2}+ \\ & \alpha \sum_{i=1}^{N} \sum_{j=1}^{N} \max \left(0,-\left(\hat{r}_{i}^{t+H}-\hat{r}_{j}^{t+H}\right)\left(r_{i}^{t+H}-r_{j}^{t+H}\right)\right) \end{aligned}
]
where r t + H = [ r 1 t + H , , r N t + H ] r t + H = r 1 t + H , , r N t + H r^(t+H)=[r_(1)^(t+H),dots,r_(N)^(t+H)]\mathbf{r}^{t+H}=\left[r_{1}^{t+H}, \ldots, r_{N}^{t+H}\right] denotes the true future return ratios of all stocks, r i t + H = ( p i t + H p i t ) / p i t , p i t r i t + H = p i t + H p i t / p i t , p i t r_(i)^(t+H)=(p_(i)^(t+H)-p_(i)^(t))//p_(i)^(t),p_(i)^(t)r_{i}^{t+H}=\left(p_{i}^{t+H}-p_{i}^{t}\right) / p_{i}^{t}, p_{i}^{t} is the closing price of stock i i ii on day t , r ^ t + H = [ r ^ 1 t + H , , r ^ N t + H ] t , r ^ t + H = r ^ 1 t + H , , r ^ N t + H t, hat(r)^(t+H)=[ hat(r)_(1)^(t+H),dots, hat(r)_(N)^(t+H)]t, \hat{\mathbf{r}}^{t+H}=\left[\hat{r}_{1}^{t+H}, \ldots, \hat{r}_{N}^{t+H}\right] is the predicted return ratios of all stocks, and α α alpha\alpha is a hyperparameter used to balance the weight of the two loss terms. The second term encourages the predicted ranking of any pair of stocks to maintain the same order as the ground truth. We here use the output features of the decoders (i.e., X dec M X dec  M X_("dec ")^(M)X_{\text {dec }}^{M} before the projection layer) in the future prediction modules as parts of the states of the RL agent (i.e., s t long s t long  s_(t)^("long ")s_{t}^{\text {long }} and s t short s t short  s_(t)^("short ")s_{t}^{\text {short }} ).
其中 r t + H = [ r 1 t + H , , r N t + H ] r t + H = r 1 t + H , , r N t + H r^(t+H)=[r_(1)^(t+H),dots,r_(N)^(t+H)]\mathbf{r}^{t+H}=\left[r_{1}^{t+H}, \ldots, r_{N}^{t+H}\right] 表示所有股票的真实未来回报率, r i t + H = ( p i t + H p i t ) / p i t , p i t r i t + H = p i t + H p i t / p i t , p i t r_(i)^(t+H)=(p_(i)^(t+H)-p_(i)^(t))//p_(i)^(t),p_(i)^(t)r_{i}^{t+H}=\left(p_{i}^{t+H}-p_{i}^{t}\right) / p_{i}^{t}, p_{i}^{t} 是股票 i i ii 在第 t , r ^ t + H = [ r ^ 1 t + H , , r ^ N t + H ] t , r ^ t + H = r ^ 1 t + H , , r ^ N t + H t, hat(r)^(t+H)=[ hat(r)_(1)^(t+H),dots, hat(r)_(N)^(t+H)]t, \hat{\mathbf{r}}^{t+H}=\left[\hat{r}_{1}^{t+H}, \ldots, \hat{r}_{N}^{t+H}\right] 天的收盘价, α α alpha\alpha 是用于平衡两个损失项权重的超参数。第二项鼓励任何一对股票的预测排名与真实情况保持相同的顺序。我们在未来预测模块中使用解码器的输出特征(即 X dec M X dec  M X_("dec ")^(M)X_{\text {dec }}^{M} 在投影层之前)作为 RL 代理的状态的一部分(即 s t long s t long  s_(t)^("long ")s_{t}^{\text {long }} s t short s t short  s_(t)^("short ")s_{t}^{\text {short }} )。

4.2 Policy Optimization 4.2 政策优化

In the second training phase, StockFormer integrates the three types of latent representations into a unified state space S S Ssube\mathcal{S} \subseteq R N × ( D + 1 ) R N × ( D + 1 ) R^(N xx(D+1))\mathbb{R}^{N \times(D+1)} through a series of multi-head attention layers and then exploits an actor-critic method to learn the trading policy. StockFormer learns flexible trading strategies by taking full advantage of the future trends of each time series in the long/short time horizon, as well as the dynamic relationships between dozens of concurrent trading assets.
在第二个训练阶段,StockFormer 通过一系列多头注意力层将三种类型的潜在表示整合到一个统一的状态空间 S S Ssube\mathcal{S} \subseteq R N × ( D + 1 ) R N × ( D + 1 ) R^(N xx(D+1))\mathbb{R}^{N \times(D+1)} 中,然后利用演员-评论家方法学习交易策略。StockFormer 通过充分利用每个时间序列在长短期内的未来趋势,以及数十个并发交易资产之间的动态关系,学习灵活的交易策略。

Latent states integration. With predictive coding, we obtain informative latent representations from three Transformer branches, denoted by s t relat R N × D , s t long R N × D s t relat  R N × D , s t long  R N × D s_(t)^("relat ")inR^(N xx D),s_(t)^("long ")inR^(N xx D)s_{t}^{\text {relat }} \in \mathbb{R}^{N \times D}, s_{t}^{\text {long }} \in \mathbb{R}^{N \times D}, and s t short R N × D s t short  R N × D s_(t)^("short ")inR^(N xx D)s_{t}^{\text {short }} \in \mathbb{R}^{N \times D}. Therefore, the first question in the decisionmaking module is how to integrate different types of predictive embeddings into a unified state space. We use a cascaded multi-head attention mechanism as follows so that the decision
潜在状态集成。通过预测编码,我们从三个 Transformer 分支中获得信息丰富的潜在表示,分别表示为 s t relat R N × D , s t long R N × D s t relat  R N × D , s t long  R N × D s_(t)^("relat ")inR^(N xx D),s_(t)^("long ")inR^(N xx D)s_{t}^{\text {relat }} \in \mathbb{R}^{N \times D}, s_{t}^{\text {long }} \in \mathbb{R}^{N \times D} s t short R N × D s t short  R N × D s_(t)^("short ")inR^(N xx D)s_{t}^{\text {short }} \in \mathbb{R}^{N \times D} 。因此,决策模块中的第一个问题是如何将不同类型的预测嵌入整合到统一的状态空间中。我们使用级联多头注意力机制,如下所示,以便做出决策。

module can be informed with the predicted future information in the financial market, as well as the dynamic relations between different trading targets.
模块可以通过金融市场的预测未来信息以及不同交易目标之间的动态关系来进行信息更新。
s t future = MultiHead Attn ( s t long , s t short , s t short ) + s t long s t h = MultiHead Attn ( s t future , s t relat , s t relat ) + s t future s t = Concat ( s t h , s t hold ) s t future  = MultiHead Attn s t long  , s t short  , s t short  + s t long  s t h = MultiHead Attn s t future  , s t relat  , s t relat  + s t future  s t = Concat s t h , s t hold  {:[s_(t)^("future ")=MultiHead-Attn(s_(t)^("long "),s_(t)^("short "),s_(t)^("short "))+s_(t)^("long ")],[s_(t)^(h)=MultiHead-Attn(s_(t)^("future "),s_(t)^("relat "),s_(t)^("relat "))+s_(t)^("future ")],[s_(t)=Concat(s_(t)^(h),s_(t)^("hold "))]:}\begin{aligned} s_{t}^{\text {future }} & =\operatorname{MultiHead}-\operatorname{Attn}\left(s_{t}^{\text {long }}, s_{t}^{\text {short }}, s_{t}^{\text {short }}\right)+s_{t}^{\text {long }} \\ s_{t}^{h} & =\operatorname{MultiHead}-\operatorname{Attn}\left(s_{t}^{\text {future }}, s_{t}^{\text {relat }}, s_{t}^{\text {relat }}\right)+s_{t}^{\text {future }} \\ s_{t} & =\operatorname{Concat}\left(s_{t}^{h}, s_{t}^{\text {hold }}\right) \end{aligned}
where s t future s t future  s_(t)^("future ")s_{t}^{\text {future }} and s t h s t h s_(t)^(h)s_{t}^{h} are the outputs from each multi-head attention layer, s t long s t long  s_(t)^("long ")s_{t}^{\text {long }} is used as the query in the first MH-Attn unit as it is less vulnerable to short-term noise than s t short s t short  s_(t)^("short ")s_{t}^{\text {short }} and s t future s t future  s_(t)^("future ")s_{t}^{\text {future }} is used as the query to extract more effective relational features. We combine s t h s t h s_(t)^(h)s_{t}^{h} with the shares of all trading targets that we hold at a certain time step, s t hold R N × 1 s t hold  R N × 1 s_(t)^("hold ")inR^(N xx1)s_{t}^{\text {hold }} \in \mathbb{R}^{N \times 1}, to form the final states s t R N × ( D + 1 ) s t R N × ( D + 1 ) s_(t)inR^(N xx(D+1))s_{t} \in \mathbb{R}^{N \times(D+1)}, and then feed s t s t s_(t)s_{t} into the actor network and the critic network. In the experiments, we empirically show the effectiveness of the proposed architecture of the decision module: Exploiting a series of multi-head attention layers better integrates different sources of latent representations into a unified state space, which can eventually benefit policy optimization.
其中 s t future s t future  s_(t)^("future ")s_{t}^{\text {future }} s t h s t h s_(t)^(h)s_{t}^{h} 是每个多头注意力层的输出, s t long s t long  s_(t)^("long ")s_{t}^{\text {long }} 用作第一个 MH-Attn 单元中的查询,因为它比 s t short s t short  s_(t)^("short ")s_{t}^{\text {short }} 更不容易受到短期噪声的影响,而 s t future s t future  s_(t)^("future ")s_{t}^{\text {future }} 用作查询以提取更有效的关系特征。我们将 s t h s t h s_(t)^(h)s_{t}^{h} 与我们在某个时间步持有的所有交易目标的份额 s t hold R N × 1 s t hold  R N × 1 s_(t)^("hold ")inR^(N xx1)s_{t}^{\text {hold }} \in \mathbb{R}^{N \times 1} 结合起来,形成最终状态 s t R N × ( D + 1 ) s t R N × ( D + 1 ) s_(t)inR^(N xx(D+1))s_{t} \in \mathbb{R}^{N \times(D+1)} ,然后将 s t s t s_(t)s_{t} 输入到演员网络和评论家网络中。在实验中,我们通过实证展示了决策模块所提议架构的有效性:利用一系列多头注意力层更好地将不同来源的潜在表示整合到统一的状态空间中,这最终可以有利于策略优化。

Joint learning with RL objectives. We use the soft actorcritic (SAC) [Haarnoja et al., 2018] to learn the trading policy in the unified state space. The critic network learns the parametric soft Q-function by minimizing the following soft Bellman residual and propagates the analytic gradients of state values back into the relation inference module which is adopted from previous actor-critic approaches with visual inputs [Lee et al., 2020; Yarats et al., 2021]. In other words, we allow for the joint training of the two training phases of predictive coding and policy learning in this stage, so that the critic’s evaluation of the state values can further help to mine the correlations between the trading assets from noisy and high-dimensional observation data. The advantage of learning a hybrid trading machine is that the two training phases can be deeply bound, which integrates the forward modeling capabilities of predictive coding with the advantages of Rl-for-finance methods for the flexibility of the trading policy. Specifically, we optimize the critic loss J ( Q ) J ( Q ) J(Q)\mathcal{J}(Q) by
联合学习与强化学习目标。我们使用软演员评论家(SAC)[Haarnoja et al., 2018] 在统一状态空间中学习交易策略。评论网络通过最小化以下软贝尔曼残差来学习参数化的软 Q 函数,并将状态值的解析梯度传播回关系推理模块,该模块采用了之前的视觉输入演员-评论家方法[Lee et al., 2020; Yarats et al., 2021]。换句话说,我们允许在这一阶段联合训练预测编码和策略学习的两个训练阶段,以便评论者对状态值的评估可以进一步帮助挖掘交易资产之间的相关性,从噪声和高维观察数据中提取信息。学习混合交易机器的优势在于这两个训练阶段可以深度绑定,结合了预测编码的前向建模能力与金融强化学习方法在交易策略灵活性上的优势。具体而言,我们通过优化评论损失 J ( Q ) J ( Q ) J(Q)\mathcal{J}(Q) 来实现。
min ϕ , ψ E ( s t , a t ) D [ 1 / 2 ( Q ( s t , a t ) Q ^ ( s t , a t ) ) 2 ] , where Q ^ ( s t , a t ) = R t + γ ( Q ϕ ( s t + 1 , a t + 1 ) λ log π θ ( a t + 1 s t + 1 ) ) min ϕ , ψ E s t , a t D 1 / 2 Q s t , a t Q ^ s t , a t 2 ,  where  Q ^ s t , a t = R t + γ Q ϕ s t + 1 , a t + 1 λ log π θ a t + 1 s t + 1 {:[min_(phi,psi)E_((s_(t),a_(t))∼D)[1//2(Q(s_(t),a_(t))-( widehat(Q))(s_(t),a_(t)))^(2)]","" where "],[ widehat(Q)(s_(t),a_(t))=R_(t)+gamma(Q_(phi^('))(s_(t+1),a_(t+1))-lambda log pi_(theta)(a_(t+1)∣s_(t+1)))]:}\begin{gathered} \min _{\phi, \psi} \mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \mathcal{D}}\left[1 / 2\left(Q\left(s_{t}, a_{t}\right)-\widehat{Q}\left(s_{t}, a_{t}\right)\right)^{2}\right], \text { where } \\ \widehat{Q}\left(s_{t}, a_{t}\right)=R_{t}+\gamma\left(Q_{\phi^{\prime}}\left(s_{t+1}, a_{t+1}\right)-\lambda \log \pi_{\theta}\left(a_{t+1} \mid s_{t+1}\right)\right) \end{gathered}
where θ , ϕ θ , ϕ theta,phi\theta, \phi, and ψ ψ psi\psi represent the parameters from the actor network, the critic network, and the relation inference module respectively, D D D\mathcal{D} is the replay buffer, Q ϕ Q ϕ Q_(phi^('))Q_{\phi^{\prime}} is the target Q network and a t + 1 π θ ( s t + 1 ) a t + 1 π θ s t + 1 a_(t+1)∼pi_(theta)(*∣s_(t+1))a_{t+1} \sim \pi_{\theta}\left(\cdot \mid s_{t+1}\right). For the actor loss J ( π θ ) J π θ J(pi_(theta))\mathcal{J}\left(\pi_{\theta}\right), we have
其中 θ , ϕ θ , ϕ theta,phi\theta, \phi ψ ψ psi\psi 分别表示来自演员网络、评论家网络和关系推理模块的参数, D D D\mathcal{D} 是重放缓冲区, Q ϕ Q ϕ Q_(phi^('))Q_{\phi^{\prime}} 是目标 Q 网络, a t + 1 π θ ( s t + 1 ) a t + 1 π θ s t + 1 a_(t+1)∼pi_(theta)(*∣s_(t+1))a_{t+1} \sim \pi_{\theta}\left(\cdot \mid s_{t+1}\right) 。对于演员损失 J ( π θ ) J π θ J(pi_(theta))\mathcal{J}\left(\pi_{\theta}\right) ,我们有
min θ E s t D [ D KL ( π θ ( s t ) exp ( Q ϕ ( s t , ) ) / Z ϕ ( s t ) ) ] min θ E s t D D KL π θ s t exp Q ϕ s t , / Z ϕ s t min_(theta)E_(s_(t)∼D)[D_(KL)(pi_(theta)(*∣s_(t))||exp(Q_(phi)(s_(t),*))//Z_(phi)(s_(t)))]\min _{\theta} \mathbb{E}_{s_{t} \sim \mathcal{D}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}\left(\cdot \mid s_{t}\right) \| \exp \left(Q_{\phi}\left(s_{t}, \cdot\right)\right) / Z_{\phi}\left(s_{t}\right)\right)\right]
where Z ϕ Z ϕ Z_(phi)Z_{\phi} is a normalization factor.
其中 Z ϕ Z ϕ Z_(phi)Z_{\phi} 是一个归一化因子。

5 Experiments 5 个实验

5.1 Compared Methods 5.1 比较方法

We compare StockFormer with the following methods:
我们将 StockFormer 与以下方法进行比较:
  • The “market benchmarks” in terms of CSI-300 Index and NASDAQ Composite Index.
    “市场基准”是指 CSI-300 指数和纳斯达克综合指数。
Market 市场 # Assets # 资产 # Train Days # 火车日 # Test Days # 测试日
CSI-300 88 1935 728
NASDAQ-100 86 2000 756
Cryptocurrency 加密货币 27 1108 258
Market # Assets # Train Days # Test Days CSI-300 88 1935 728 NASDAQ-100 86 2000 756 Cryptocurrency 27 1108 258| Market | # Assets | # Train Days | # Test Days | | :---: | :---: | :---: | :---: | | CSI-300 | 88 | 1935 | 728 | | NASDAQ-100 | 86 | 2000 | 756 | | Cryptocurrency | 27 | 1108 | 258 |
Table 2: Statistical details of the finance investment datasets.
表 2:金融投资数据集的统计细节。
  • Min-variance portfolio allocation strategy (Min-Var) [Basak and Chabakauri, 2010] that balances returns and risks.
    最小方差投资组合配置策略(Min-Var)[Basak 和 Chabakauri,2010],旨在平衡收益和风险。
  • Recent advances for stock prediction [Wang et al., 2021; Feng et al., 2019; Duan et al., 2022] and a cutting-edge Transformer model for general time series prediction [Wu et al., 2021]. We here use the “buy-and-hold” strategy, in which we buy a stock each day that has the highest estimated return in the next 5 days and sell it 5 days later .
    最近在股票预测方面的进展 [Wang et al., 2021; Feng et al., 2019; Duan et al., 2022] 和一种用于一般时间序列预测的前沿 Transformer 模型 [Wu et al., 2021]。我们在这里使用“买入并持有”策略,即每天购买预计在接下来的 5 天内回报最高的股票,并在 5 天后卖出。
  • RL methods, including SAC [Haarnoja et al., 2018], DDPG [Lillicrap et al., 2016], and SARL [Ye et al., 2020]. The first two models are implemented on the FinRL platform [Liu et al., 2021] and directly perform policy learning on the technical indicators and the covariance matrix between the trading targets. We reproduce the SARL and feed the current asset prices (rather than the financial news) to its external state encoder for a fair comparison.
    RL 方法,包括 SAC [Haarnoja et al., 2018]、DDPG [Lillicrap et al., 2016] 和 SARL [Ye et al., 2020]。前两个模型在 FinRL 平台 [Liu et al., 2021] 上实现,并直接对技术指标和交易目标之间的协方差矩阵进行策略学习。我们重现了 SARL,并将当前资产价格(而不是金融新闻)输入其外部状态编码器,以便进行公平比较。
All models are experimented with the transaction costs and each RL method is performed three times with different seeds.
所有模型都在交易成本下进行实验,每种强化学习方法都以不同的种子执行三次。

5.2 CSI-300 Stock Dataset
5.2 CSI-300 股票数据集

This dataset contains stock data from the CSI-300 Composite Index from 01/17/2011 to 12/30/2021. As shown in Table 2, the dataset is divided into the training and test splits containing the basic stock price-volume information in 1,935 days and 728 trading days respectively. Furthermore, we follow [Feng et al., 2019] to retain the stocks that have been traded on more than 98 % 98 % 98%98 \% training days since 01 / 17 / 2011 01 / 17 / 2011 01//17//201101 / 17 / 2011. Our investment pool contains 88 stocks. If a stock in the training set is suspended from trading, we interpolate the missing data in using the daily rate of change of the CSI-300 Composite Index.
该数据集包含从 2011 年 1 月 17 日到 2021 年 12 月 30 日的 CSI-300 综合指数的股票数据。如表 2 所示,该数据集分为训练集和测试集,分别包含 1,935 天和 728 个交易日的基本股票价格-成交量信息。此外,我们遵循[Feng et al., 2019]的研究,保留自 01 / 17 / 2011 01 / 17 / 2011 01//17//201101 / 17 / 2011 以来在超过 98 % 98 % 98%98 \% 个交易日内交易的股票。我们的投资池包含 88 只股票。如果训练集中的某只股票暂停交易,我们将使用 CSI-300 综合指数的每日变化率插值缺失数据。
Figure 3(Left) shows qualitative results of the accumulated portfolio return on the CSI-300 test set. It shows that StockFormer outperforms other compared models by large margins, including both stock prediction models trading with the “buy-and-hold” strategy, as well as the RL-for-finance methods. Notably, the red curve presents the results of the proposed long-term prediction module without involving any other branches or the decision module in StockFormer, which is trained to predict future returns in 5 days and works with the “buy-and-hold” strategy as other stock prediction models. The remarkable advantage of this baseline model compared with existing approaches, including HATR [Wang et al., 2021], Relational Ranking [Feng et al., 2019], and AutoFormer [Wu et al., 2021], demonstrates the powerful dynamics modeling capability of the proposed Transformer architecture with diversified multi-head attention.
图 3(左)显示了在 CSI-300 测试集上累积投资组合收益的定性结果。结果表明,StockFormer 的表现远超其他比较模型,包括采用“买入并持有”策略的股票预测模型以及用于金融的强化学习方法。值得注意的是,红色曲线展示了所提议的长期预测模块的结果,该模块在 StockFormer 中不涉及任何其他分支或决策模块,旨在预测未来 5 天的收益,并与其他股票预测模型一样采用“买入并持有”策略。与现有方法相比,包括 HATR [Wang et al., 2021]、关系排名 [Feng et al., 2019] 和 AutoFormer [Wu et al., 2021],该基线模型的显著优势展示了所提议的 Transformer 架构在多样化多头注意力下强大的动态建模能力。
Table 3 provides corresponding quantitative results. Besides the total portfolio return (PR), we follow the work of FactorVAE [Duan et al., 2022] to include annual return (AR), Sharpe
表 3 提供了相应的定量结果。除了总投资组合回报(PR)外,我们遵循 FactorVAE [Duan et al., 2022] 的工作,包含年回报(AR)、夏普比率。
Method 方法 CSI-300 NASDAQ-100 Cryptocurrency 加密货币
PR PR PR^(uarr)\mathrm{PR}^{\uparrow} AR AR AR^(uarr)\mathrm{AR}^{\uparrow} SR SR SR^(uarr)\mathrm{SR}^{\uparrow} MDD MDD MDD^(darr)\mathrm{MDD}^{\downarrow} PR PR PR^(uarr)\mathrm{PR}^{\uparrow} AR AR AR^(uarr)\mathrm{AR}^{\uparrow} SR SR SR^(uarr)\mathrm{SR}^{\uparrow} MDD MDD MDD^(darr)\mathrm{MDD}^{\downarrow} PR PR PR^(uarr)\mathrm{PR}^{\uparrow} SR SR SR^(uarr)\mathrm{SR}^{\uparrow}
Market benchmark 市场基准 0.24 0.08 0.51 0.18 1.05 0.30 1.16 0.30 - -
Min-Var [Basak and Chabakauri, 2010]
Min-Var [Basak 和 Chabakauri, 2010]
0.11 0.04 0.38 0.13 0.59 0.18 0.99 0.26 -0.09 0.02
HATR [Wang et al., 2021]
HATR [王等, 2021]
-0.02 -0.01 0.14 0.45 0.08 0.28 0.25 0.41 -0.65 -0.66
Relational Ranking [Feng et al., 2019]
关系排名 [冯等, 2019]
-0.01 -0.01 0.17 0.45 0.96 0.28 0.95 0.30 - -
FactorVAE [Duan et al., 2022] 1.37 0.38 1.27 0.17 1.20 0.33 0.99 0.24 - -
AutoFormer [Wu et al., 2021]
AutoFormer [吴等, 2021]
0.08 0.03 0.25 0.40 -0.25 -0.10 -0.21 0.42 -0.27 -0.16
Our future prediction module ( t + 5 ) ( t + 5 ) (t+5)(t+5)
我们的未来预测模块 ( t + 5 ) ( t + 5 ) (t+5)(t+5)
0.89 0.27 0.73 0.49 0.66 0.20 0.64 0.44 -0.40 -0.71
SARL [Ye et al., 2020]
SARL [叶等, 2020]
1.59 0.39 1.38 0.31 1.22 0.30 0.99 0.34 0.06 0.43
FinRL-SAC [Liu et al., 2021]
FinRL-SAC [刘等, 2021]
1.76 0.42 1.41 0.34 1.38 0.34 1.24 0.33 0.10 0.55
FinRL-DDPG [Liu et al., 2021]
FinRL-DDPG [刘等, 2021]
1.43 0.36 1.23 0.39 0.83 0.22 0.87 0.33 0.15 0.60
StockFormer 2.47 0.54 1.73 0.31 1.71 0.40 1.39 0.31 0.24 0.75
Method CSI-300 NASDAQ-100 Cryptocurrency PR^(uarr) AR^(uarr) SR^(uarr) MDD^(darr) PR^(uarr) AR^(uarr) SR^(uarr) MDD^(darr) PR^(uarr) SR^(uarr) Market benchmark 0.24 0.08 0.51 0.18 1.05 0.30 1.16 0.30 - - Min-Var [Basak and Chabakauri, 2010] 0.11 0.04 0.38 0.13 0.59 0.18 0.99 0.26 -0.09 0.02 HATR [Wang et al., 2021] -0.02 -0.01 0.14 0.45 0.08 0.28 0.25 0.41 -0.65 -0.66 Relational Ranking [Feng et al., 2019] -0.01 -0.01 0.17 0.45 0.96 0.28 0.95 0.30 - - FactorVAE [Duan et al., 2022] 1.37 0.38 1.27 0.17 1.20 0.33 0.99 0.24 - - AutoFormer [Wu et al., 2021] 0.08 0.03 0.25 0.40 -0.25 -0.10 -0.21 0.42 -0.27 -0.16 Our future prediction module (t+5) 0.89 0.27 0.73 0.49 0.66 0.20 0.64 0.44 -0.40 -0.71 SARL [Ye et al., 2020] 1.59 0.39 1.38 0.31 1.22 0.30 0.99 0.34 0.06 0.43 FinRL-SAC [Liu et al., 2021] 1.76 0.42 1.41 0.34 1.38 0.34 1.24 0.33 0.10 0.55 FinRL-DDPG [Liu et al., 2021] 1.43 0.36 1.23 0.39 0.83 0.22 0.87 0.33 0.15 0.60 StockFormer 2.47 0.54 1.73 0.31 1.71 0.40 1.39 0.31 0.24 0.75| Method | CSI-300 | | | | NASDAQ-100 | | | | Cryptocurrency | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | $\mathrm{PR}^{\uparrow}$ | $\mathrm{AR}^{\uparrow}$ | $\mathrm{SR}^{\uparrow}$ | $\mathrm{MDD}^{\downarrow}$ | $\mathrm{PR}^{\uparrow}$ | $\mathrm{AR}^{\uparrow}$ | $\mathrm{SR}^{\uparrow}$ | $\mathrm{MDD}^{\downarrow}$ | $\mathrm{PR}^{\uparrow}$ | $\mathrm{SR}^{\uparrow}$ | | Market benchmark | 0.24 | 0.08 | 0.51 | 0.18 | 1.05 | 0.30 | 1.16 | 0.30 | - | - | | Min-Var [Basak and Chabakauri, 2010] | 0.11 | 0.04 | 0.38 | 0.13 | 0.59 | 0.18 | 0.99 | 0.26 | -0.09 | 0.02 | | HATR [Wang et al., 2021] | -0.02 | -0.01 | 0.14 | 0.45 | 0.08 | 0.28 | 0.25 | 0.41 | -0.65 | -0.66 | | Relational Ranking [Feng et al., 2019] | -0.01 | -0.01 | 0.17 | 0.45 | 0.96 | 0.28 | 0.95 | 0.30 | - | - | | FactorVAE [Duan et al., 2022] | 1.37 | 0.38 | 1.27 | 0.17 | 1.20 | 0.33 | 0.99 | 0.24 | - | - | | AutoFormer [Wu et al., 2021] | 0.08 | 0.03 | 0.25 | 0.40 | -0.25 | -0.10 | -0.21 | 0.42 | -0.27 | -0.16 | | Our future prediction module $(t+5)$ | 0.89 | 0.27 | 0.73 | 0.49 | 0.66 | 0.20 | 0.64 | 0.44 | -0.40 | -0.71 | | SARL [Ye et al., 2020] | 1.59 | 0.39 | 1.38 | 0.31 | 1.22 | 0.30 | 0.99 | 0.34 | 0.06 | 0.43 | | FinRL-SAC [Liu et al., 2021] | 1.76 | 0.42 | 1.41 | 0.34 | 1.38 | 0.34 | 1.24 | 0.33 | 0.10 | 0.55 | | FinRL-DDPG [Liu et al., 2021] | 1.43 | 0.36 | 1.23 | 0.39 | 0.83 | 0.22 | 0.87 | 0.33 | 0.15 | 0.60 | | StockFormer | 2.47 | 0.54 | 1.73 | 0.31 | 1.71 | 0.40 | 1.39 | 0.31 | 0.24 | 0.75 |
Table 3: Quantitative results in portfolio return ( P R P R PR\mathbf{P R} ), annual return (AR), Sharpe ratio (SR), and maximum drawdown (MDD) on the test splits of the stock market datasets. Transaction costs are included in buying and selling actions. For stock prediction methods (Rows 3-7), we use the “buy-and-hold” strategy. In the last two columns, we show the results on the cryptocurrency dataset.
表 3:股票市场数据集测试分割中的投资组合回报( P R P R PR\mathbf{P R} )、年回报(AR)、夏普比率(SR)和最大回撤(MDD)的定量结果。交易成本包含在买入和卖出操作中。对于股票预测方法(第 3-7 行),我们使用“买入并持有”策略。在最后两列中,我们展示了加密货币数据集的结果。
Figure 3: Accumulated portfolio returns on the CSI (Left) and NASDAQ (Right) test sets. We use the “buy-and-hold” strategy for the stock prediction models, including HATR, FactorVAE, and our baseline model (i.e., the long-term prediction Transformer branch in StockFormer).
图 3:CSI(左)和 NASDAQ(右)测试集的累计投资组合回报。我们对股票预测模型使用“买入并持有”策略,包括 HATR、FactorVAE 和我们的基准模型(即 StockFormer 中的长期预测 Transformer 分支)。

ratio (SR), and maximum drawdown (MDD) as the evaluation metrics. It is worth noting that the final StockFormer further improves the future prediction baseline (Row 7) and significantly outperforms all of the stock prediction models by learning flexible RL strategies. StockFormer also outperforms the vanilla SAC by 40.3 % ( 1.76 2.47 ) 40.3 % ( 1.76 2.47 ) 40.3%(1.76 rarr2.47)40.3 \%(1.76 \rightarrow 2.47) in the portfolio return and by 22.7 % ( 1.41 1.73 ) 22.7 % ( 1.41 1.73 ) 22.7%(1.41 rarr1.73)22.7 \%(1.41 \rightarrow 1.73) in the Sharpe ratio. Such a profit boost comes from executing policy optimization over the extracted relational and predictive states.
比率(SR)和最大回撤(MDD)作为评估指标。值得注意的是,最终的 StockFormer 进一步改善了未来预测基线(第 7 行),并通过学习灵活的强化学习策略显著超越了所有股票预测模型。StockFormer 在投资组合回报方面比普通的 SAC 提高了 40.3 % ( 1.76 2.47 ) 40.3 % ( 1.76 2.47 ) 40.3%(1.76 rarr2.47)40.3 \%(1.76 \rightarrow 2.47) ,在夏普比率方面提高了 22.7 % ( 1.41 1.73 ) 22.7 % ( 1.41 1.73 ) 22.7%(1.41 rarr1.73)22.7 \%(1.41 \rightarrow 1.73) 。这种利润提升来自于对提取的关系和预测状态执行策略优化。

5.3 NASDAQ-100 Stock Dataset
5.3 纳斯达克-100 股票数据集

For the NASDAQ stock dataset, we collect the daily pricevolume records and related technical indicators between 01/17/2011 and 12/30/2021 from Yahoo Finance. Like in CSI300, we use the 98 % 98 % 98%98 \% criteria to filter stocks, which derives an investment pool of 86 stocks, and then fill in the missing data based on the daily rate of change of the NASDAQ 100 Index. We construct the training and test datasets that involve 2,000 and 756 trading days respectively. As shown in Table 3 and Figure 3(Right), on the NASDAQ-100 test set, due to the effectiveness of predictive coding, StockFormer improves the vanilla SAC by 23.9 % ( 1.38 1.71 ) 23.9 % ( 1.38 1.71 ) 23.9%(1.38 rarr1.71)23.9 \%(1.38 \rightarrow 1.71) in the portfolio return and by 12.1 % ( 1.24 1.39 ) 12.1 % ( 1.24 1.39 ) 12.1%(1.24 rarr1.39)12.1 \%(1.24 \rightarrow 1.39) in Sharpe ratio.
对于 NASDAQ 股票数据集,我们从 Yahoo Finance 收集了 2011 年 1 月 17 日至 2021 年 12 月 30 日的每日价格和成交量记录及相关技术指标。与 CSI300 一样,我们使用 98 % 98 % 98%98 \% 标准来筛选股票,从而得出一个包含 86 只股票的投资池,然后根据 NASDAQ 100 指数的每日变化率填补缺失数据。我们构建了包含 2000 个和 756 个交易日的训练和测试数据集。如表 3 和图 3(右)所示,在 NASDAQ-100 测试集上,由于预测编码的有效性,StockFormer 在投资组合回报率上提高了 23.9 % ( 1.38 1.71 ) 23.9 % ( 1.38 1.71 ) 23.9%(1.38 rarr1.71)23.9 \%(1.38 \rightarrow 1.71) ,在夏普比率上提高了 12.1 % ( 1.24 1.39 ) 12.1 % ( 1.24 1.39 ) 12.1%(1.24 rarr1.39)12.1 \%(1.24 \rightarrow 1.39)

5.4 Cryptocurrency Dataset
5.4 加密货币数据集

We further evaluate StockFormer by using it to make trading decisions in the cryptocurrency market. With the 98 % 98 % 98%98 \% filtering criteria, we select 27 cryptocurrencies and collect their daily records between 11/01/2017 and 05/01/2022 from Yahoo Finance. We split the data into a training set of 1,108 trading days and a test set of 258 trading days.
我们进一步评估 StockFormer,通过它在加密货币市场做出交易决策。根据 98 % 98 % 98%98 \% 过滤标准,我们选择了 27 种加密货币,并从雅虎财经收集了 2017 年 11 月 1 日至 2022 年 5 月 1 日的每日记录。我们将数据分为 1,108 个交易日的训练集和 258 个交易日的测试集。
An interesting result in Table 3 is that most stock prediction models present negative return ratios on the test set, even worse than the Min-Var baseline. We suspect that this is due to the complexity and the non-stationarity of the cryptocurrency data. In other words, it is difficult to model the temporal dynamics accurately in such data scenarios, which further highlights the advantage of the model-free RL methods. For FinRL, SARL, and StockFormer, they do not explicitly model the transition function in the state space, and can flexibly produce more
表 3 中的一个有趣结果是,大多数股票预测模型在测试集上呈现出负回报率,甚至比最小方差基准还要糟糕。我们怀疑这与加密货币数据的复杂性和非平稳性有关。换句话说,在这种数据场景中,准确建模时间动态是困难的,这进一步突显了无模型强化学习方法的优势。对于 FinRL、SARL 和 StockFormer,它们并没有在状态空间中明确建模转移函数,而是可以灵活地产生更多。
Input latent states of RL agents
输入强化学习代理的潜在状态
PR SR
W/o relational states 无关系状态 1.42 1.43
W/o short-term predictive states
没有短期预测状态
1.53 1.43
W/o long-term predictive states
没有长期预测状态
1.45 1.34
StockFormer (Final) StockFormer(最终版) 2 . 4 7 2 . 4 7 2.47\mathbf{2 . 4 7} 1 . 7 3 1 . 7 3 1.73\mathbf{1 . 7 3}
Input latent states of RL agents PR SR W/o relational states 1.42 1.43 W/o short-term predictive states 1.53 1.43 W/o long-term predictive states 1.45 1.34 StockFormer (Final) 2.47 1.73| Input latent states of RL agents | PR | SR | | :--- | :---: | :---: | | W/o relational states | 1.42 | 1.43 | | W/o short-term predictive states | 1.53 | 1.43 | | W/o long-term predictive states | 1.45 | 1.34 | | StockFormer (Final) | $\mathbf{2 . 4 7}$ | $\mathbf{1 . 7 3}$ |
Table 4: Ablation study of the necessity of individual branches in StockFormer on CSI. We remove each branch separately while maintaining the other two branches for the decision module.
表 4:StockFormer 在 CSI 上各个分支必要性的消融研究。我们在保持其他两个分支的情况下,分别去除每个分支以进行决策模块。
Relation module 关系模块 Future prediction modules
未来预测模块
PR SR
FFN FFN 1.29 1.27
Multi-head FFN 多头前馈神经网络 FFN 1.48 1.29
FFN Multi-head FFN 多头前馈神经网络 1.52 1.31
Multi-head FFN 多头前馈神经网络 Multi-head FFN (StockFormer)
多头前馈网络(StockFormer)
1 . 7 1 1 . 7 1 1.71\mathbf{1 . 7 1} 1 . 3 9 1 . 3 9 1.39\mathbf{1 . 3 9}
Relation module Future prediction modules PR SR FFN FFN 1.29 1.27 Multi-head FFN FFN 1.48 1.29 FFN Multi-head FFN 1.52 1.31 Multi-head FFN Multi-head FFN (StockFormer) 1.71 1.39| Relation module | Future prediction modules | PR | SR | | :---: | :---: | :---: | :---: | | FFN | FFN | 1.29 | 1.27 | | Multi-head FFN | FFN | 1.48 | 1.29 | | FFN | Multi-head FFN | 1.52 | 1.31 | | Multi-head FFN | Multi-head FFN (StockFormer) | $\mathbf{1 . 7 1}$ | $\mathbf{1 . 3 9}$ |
Table 5: Ablation study of the proposed multi-head FFN in attention blocks in StockFormer on NASDAQ, compared with the original attention block [Vaswani et al., 2017] with conventional FFNs.
表 5:在 NASDAQ 上对 StockFormer 中注意力块中提出的多头 FFN 进行的消融研究,与原始注意力块[Vaswani 等,2017]中的传统 FFN 进行比较。

conservative trading policies (as shown by the Sharpe ratios).
保守的交易政策(如夏普比率所示)。

Another interesting fact in Table 3 is that while the future prediction module in StockFormer results in a negative portfolio return when working alone, its predictive coding states can still benefit the hybrid trading machine when integrated into StockFormer. We can see that StockFormer achieves positive and the best performance, which has a 140.0 % ( 0.10 0.24 ) 140.0 % ( 0.10 0.24 ) 140.0%(0.10 rarr0.24)140.0 \%(0.10 \rightarrow 0.24) improvement over the vanilla SAC method in the portfolio return and a 36.4 % ( 0.55 0.75 ) 36.4 % ( 0.55 0.75 ) 36.4%(0.55 rarr0.75)36.4 \%(0.55 \rightarrow 0.75) improvement in Sharpe ratio, showing the ability to control the investment risks.
表 3 中的另一个有趣事实是,尽管 StockFormer 中的未来预测模块单独工作时会导致负的投资组合回报,但其预测编码状态在与 StockFormer 集成时仍然可以使混合交易机器受益。我们可以看到,StockFormer 实现了正向且最佳的表现,在投资组合回报上比普通 SAC 方法提高了 140.0 % ( 0.10 0.24 ) 140.0 % ( 0.10 0.24 ) 140.0%(0.10 rarr0.24)140.0 \%(0.10 \rightarrow 0.24) ,在夏普比率上提高了 36.4 % ( 0.55 0.75 ) 36.4 % ( 0.55 0.75 ) 36.4%(0.55 rarr0.75)36.4 \%(0.55 \rightarrow 0.75) ,显示出控制投资风险的能力。

5.5 Ablation Study 5.5 消融研究

Necessity of individual Transformer branches. To verify the performance of the relation inference module and future prediction modules, we compare three variants of StockFormer on the CSI dataset. From Table 4, we can see that removing any encoding branch leads to a remarkable performance drop, and each branch makes essential contributions to the final performance of our final approach. The results demonstrate that the relation module and prediction modules provide complementary representations to the decision module from different perspectives of concurrent time series modeling.
个体变压器分支的必要性。为了验证关系推理模块和未来预测模块的性能,我们在 CSI 数据集上比较了 StockFormer 的三种变体。从表 4 中可以看出,移除任何编码分支都会导致显著的性能下降,每个分支对我们最终方法的最终性能都做出了重要贡献。结果表明,关系模块和预测模块从并发时间序列建模的不同角度为决策模块提供了互补的表示。

Performance gain of the modified attention block. Table 5 compares the results of three StockFormer variants, in which we may use the original Transformer architecture without the proposed multi-head FFNs in any one of the three predictive coding branches. These results show that multi-head FFNs can effectively improve the attention block in Transformer by mitigating the difficulty of concurrent time series modeling.
修改后的注意力块的性能提升。表 5 比较了三种 StockFormer 变体的结果,在这些变体中,我们可以在任一三条预测编码分支中使用原始的 Transformer 架构,而不使用提议的多头 FFN。这些结果表明,多头 FFN 可以通过减轻并发时间序列建模的难度,有效改善 Transformer 中的注意力块。

Design of the decision module. An important question is how to integrate different types of predictive embeddings into a unified state space. To answer this question, we study different network structures for fusing the relational and long-/shortterm predictive states in the decision module and show the results in Table 6. By replacing the multi-head attention layers in StockFormer with two fully connected layers (denoted by “2-layer FFN”), we observe that the portfolio return drops
决策模块的设计。一个重要的问题是如何将不同类型的预测嵌入整合到统一的状态空间中。为了解答这个问题,我们研究了在决策模块中融合关系和长期/短期预测状态的不同网络结构,并在表 6 中展示了结果。通过用两个全连接层(表示为“2 层 FFN”)替换 StockFormer 中的多头注意力层,我们观察到投资组合收益下降。
State fusion in decision module
决策模块中的状态融合
PR SR
2-layer FFN 2 层前馈神经网络 1.62 1.31
2-layer multi-head attention (StockFormer)
2 层多头注意力(StockFormer)
1 . 7 1 1 . 7 1 1.71\mathbf{1 . 7 1} 1 . 3 9 1 . 3 9 1.39\mathbf{1 . 3 9}
State fusion in decision module PR SR 2-layer FFN 1.62 1.31 2-layer multi-head attention (StockFormer) 1.71 1.39| State fusion in decision module | PR | SR | | :--- | :---: | :---: | | 2-layer FFN | 1.62 | 1.31 | | 2-layer multi-head attention (StockFormer) | $\mathbf{1 . 7 1}$ | $\mathbf{1 . 3 9}$ |
Table 6: Evaluation on NASDAQ of different methods that integrate the relational states and the predictive states in the decision module.
表 6:在决策模块中整合关系状态和预测状态的不同方法在 NASDAQ 上的评估。
Actor gradient 演员梯度 Critic gradient 批评梯度 PR SR
x x x\boldsymbol{x} x x x\boldsymbol{x} 0.03 0.35
x x x\boldsymbol{x} x x x\boldsymbol{x} 0.02 0.35
x x x\boldsymbol{x} \boldsymbol{\checkmark} (StockFormer) 0 . 2 4 0 . 2 4 0.24\mathbf{0 . 2 4} 0 . 7 5 0 . 7 5 0.75\mathbf{0 . 7 5}
Actor gradient Critic gradient PR SR x x 0.03 0.35 x x 0.02 0.35 x ✓ (StockFormer) 0.24 0.75| Actor gradient | Critic gradient | PR | SR | | :---: | :---: | :---: | :---: | | $\boldsymbol{x}$ | $\boldsymbol{x}$ | 0.03 | 0.35 | | $\boldsymbol{x}$ | $\boldsymbol{x}$ | 0.02 | 0.35 | | $\boldsymbol{x}$ | $\boldsymbol{\checkmark}$ (StockFormer) | $\mathbf{0 . 2 4}$ | $\mathbf{0 . 7 5}$ |
Table 7: Ablation study of back-propagating the actor-critic gradients to the relation inference module on the cryptocurrency dataset.
表 7:在加密货币数据集上对将演员-评论家梯度反向传播到关系推理模块的消融研究。

from 1.71 to 1.62 and the Sharpe ratio drops from 1.39 to 1.31. The results validate the effectiveness of the proposed architecture of the decision module: Exploiting a series of multi-head attention layers better integrates different sources of latent representations into a unified state space, which can eventually benefit policy optimization.
从 1.71 降至 1.62,夏普比率从 1.39 降至 1.31。结果验证了所提议的决策模块架构的有效性:利用一系列多头注意力层更好地将不同来源的潜在表示整合到统一的状态空间中,这最终可以促进策略优化。
Joint training vs. Two separate training stages. The two training phases, predictive coding and policy learning, are closely related. In Table 7, we experiment with different gradient back-propagation solutions in the policy learning phase of the actor-critic method. We observe that StockFormer achieves the best results by passing the gradients of the critic loss back into the relation inference module. It significantly improves the baseline model with two separate training phases (PR: 0.03 0.24 0.03 0.24 0.03 rarr0.240.03 \rightarrow 0.24 ), in which the RL agent does not propagate any gradients back to the predictive coding modules. We believe that the critic’s evaluation of the state values can further help to mine the correlations between the trading assets from noisy and high-dimensional observation data. These results show the superiority of the joint learning mechanism.
联合训练与两个独立的训练阶段。这两个训练阶段,预测编码和策略学习,密切相关。在表 7 中,我们在演员-评论家方法的策略学习阶段实验了不同的梯度反向传播解决方案。我们观察到,StockFormer 通过将评论损失的梯度传递回关系推理模块,取得了最佳结果。它显著改善了具有两个独立训练阶段的基线模型(PR: 0.03 0.24 0.03 0.24 0.03 rarr0.240.03 \rightarrow 0.24 ),在该模型中,RL 代理不向预测编码模块传播任何梯度。我们相信,评论者对状态值的评估可以进一步帮助挖掘来自噪声和高维观察数据的交易资产之间的相关性。这些结果显示了联合学习机制的优越性。

6 Conclusions 6 结论

This paper presents StockFormer, an RL approach for financial market decision-making. It has two main contributions. First, it incorporates predictive coding in the actor-critic RL framework, which integrates the advantages of existing stock prediction models in learning future dynamics and the advantages of RL-for-finance methods in learning more flexible polices. Second, we proposed to use three branches of the modified Transformers to learn a mixture of long-/short-term predictive states and relational states. These three types of states are then combined adaptively and progressively through a series of attention structures in the decision-making module, such that the RL agent can optimize the decisions in a unified and meaningful state space. StockFormer shows competitive results on three finance datasets compared with a wide range of deep learning approaches, including both stock prediction and RL-for-finance models.
本文介绍了 StockFormer,一种用于金融市场决策的强化学习方法。它有两个主要贡献。首先,它在演员-评论家强化学习框架中融入了预测编码,结合了现有股票预测模型在学习未来动态方面的优势和强化学习金融方法在学习更灵活策略方面的优势。其次,我们提出使用修改后的变换器的三个分支来学习长短期预测状态和关系状态的混合。这三种状态通过决策模块中的一系列注意力结构自适应和渐进地结合,从而使强化学习代理能够在统一且有意义的状态空间中优化决策。与广泛的深度学习方法相比,包括股票预测和强化学习金融模型,StockFormer 在三个金融数据集上显示出竞争力的结果。

Acknowledgements 致谢

This work was supported by NSFC (62250062, U19B2035, 62106144), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), the Fundamental Research Funds for the Central Universities, and Shanghai Sailing Program (21Z510202133) from the Science and Technology Commission of Shanghai Municipality.
本工作得到了国家自然科学基金(62250062, U19B2035, 62106144)、上海市科技重大项目(2021SHZDZX0102)、中央高校基本科研业务费和上海市科技委员会的上海市启航计划(21Z510202133)的支持。

References 参考文献

[Basak and Chabakauri, 2010] Suleyman Basak and Georgy Chabakauri. Dynamic mean-variance asset allocation. The Review of Financial Studies, 23(8):2970-3016, 2010.
[Basak 和 Chabakauri, 2010] Suleyman Basak 和 Georgy Chabakauri. 动态均值-方差资产配置. 《金融研究评论》,23(8):2970-3016, 2010.

[Benhamou et al., 2020] Eric Benhamou, David Saltiel, Jean Jacques Ohana, Jamal Atif, and Rida Laraki. Deep reinforcement learning (DRL) for portfolio allocation. In ECMLPKDD, 2020.
[Benhamou et al., 2020] 埃里克·本哈穆、戴维·萨尔蒂尔、让·雅克·奥哈纳、贾马尔·阿提夫和里达·拉拉基。用于投资组合配置的深度强化学习(DRL)。在 ECMLPKDD,2020。

[Cho et al., 2019] Chun-Hung Cho, Guan-Yi Lee, Yueh-Lin Tsai, and Kun-Chan Lan. Toward stock price prediction using deep learning. In U C C , 2019 U C C , 2019 UCC,2019U C C, 2019.
[Cho et al., 2019] 卓春宏、李冠怡、蔡岳霖和蓝昆展。利用深度学习进行股票价格预测。在 U C C , 2019 U C C , 2019 UCC,2019U C C, 2019 中。

[Duan et al., 2022] Yitong Duan, Lei Wang, Qizhong Zhang, and Jian Li. FactorVAE: A probabilistic dynamic factor model based on variational autoencoder for predicting cross-sectional stock returns. In AAAI, 2022.
[Duan et al., 2022] 段怡彤, 王磊, 张启忠, 李健. FactorVAE:一种基于变分自编码器的概率动态因子模型,用于预测横截面股票收益. 在 AAAI, 2022.

[Elias, 1955] Peter Elias. Predictive coding-I. IRE transactions on information theory, 1(1):16-24, 1955.
[Elias, 1955] 彼得·埃利亚斯。预测编码-I。《IRE 信息理论交易》,1(1):16-24,1955。

[Fedus et al., 2021] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
[Fedus et al., 2021] 威廉·费杜斯,巴雷特·佐夫,诺姆·沙泽尔。开关变压器:通过简单高效的稀疏性扩展到万亿参数模型。arXiv 预印本 arXiv:2101.03961,2021。

[Feng et al., 2019] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. Temporal relational ranking for stock prediction. ACM Transactions on Information Systems, 37(2):1-30, 2019.
[Feng et al., 2019] 冯福丽, 何向南, 王翔, 罗成, 刘怡群, 和蔡达生. 股票预测的时间关系排序. ACM 信息系统学报, 37(2):1-30, 2019.

[Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
[Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel 等人。软演员-评论家算法及其应用。arXiv 预印本 arXiv:1812.05905,2018。

[Hoseinzade and Haratizadeh, 2019] Ehsan Hoseinzade and Saman Haratizadeh. CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Systems with Applications, 129:273-285, 2019.
[Hoseinzade 和 Haratizadeh, 2019] Ehsan Hoseinzade 和 Saman Haratizadeh. CNNpred:基于 CNN 的股票市场预测,使用多样化的变量。应用专家系统,129:273-285, 2019。

[Hu and Lin, 2019] Yuh-Jong Hu and Shang-Jen Lin. Deep reinforcement learning for optimizing finance portfolio management. In AICAI, 2019.
[胡与林, 2019] 胡宇忠和林尚仁。深度强化学习在金融投资组合管理中的优化应用。发表于 AICAI, 2019。

[Hu et al., 2018] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. Listening to chaotic whispers: A deep learning framework for news-oriented stock trend prediction. In WSDM, 2018.
[胡等, 2018] 胡子牛, 刘伟清, 边江, 刘轩哲, 和 刘铁燕. 聆听混乱的低语:一个用于新闻导向股票趋势预测的深度学习框架. 在 WSDM, 2018.

[Huotari et al., 2020] Tommi Huotari, Jyrki Savolainen, and Mikael Collan. Deep reinforcement learning agent for S&P 500 stock selection. Axioms, 9(4):130, 2020.
[Huotari et al., 2020] Tommi Huotari, Jyrki Savolainen 和 Mikael Collan. 用于标准普尔 500 股票选择的深度强化学习代理. Axioms, 9(4):130, 2020.

[Kitaev et al., 2020] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR, 2020.
[Kitaev et al., 2020] Nikita Kitaev, Łukasz Kaiser 和 Anselm Levskaya. Reformer: 高效的变换器. 在 ICLR, 2020.

[Lee et al., 2020] Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In NeurIPS, 2020.
[Lee et al., 2020] Alex X Lee, Anusha Nagabandi, Pieter Abbeel 和 Sergey Levine. 随机潜在演员-评论家:具有潜在变量模型的深度强化学习. 在 NeurIPS, 2020.

[Li et al., 2018] Hao Li, Yanyan Shen, and Yanmin Zhu. Stock price prediction using attention-based multi-input LSTM. In ACML, 2018.
[Li et al., 2018] 李浩,沈艳艳,朱艳敏。基于注意力的多输入 LSTM 股票价格预测。发表于 ACML,2018。

[Liang et al., 2018] Zhipeng Liang, Hao Chen, Junhao Zhu, Kangkang Jiang, and Yanran Li. Adversarial deep reinforcement learning in portfolio management. arXiv preprint arXiv:1808.09940, 2018.
[Liang et al., 2018] 梁志鹏, 陈浩, 朱俊豪, 姜康康, 李妍然. 投资组合管理中的对抗深度强化学习. arXiv 预印本 arXiv:1808.09940, 2018.

[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, 和 Daan Wierstra. 使用深度强化学习进行连续控制. 载于 ICLR, 2016.

[Liu et al., 2021] Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang. FinRL: Deep reinforcement learning framework to automate trading in quantitative finance. In ICAIF, 2021.
[Liu et al., 2021] 刘晓阳、杨洪阳、高杰超和王丹·克里斯蒂娜。FinRL:用于自动化量化金融交易的深度强化学习框架。在 ICAIF,2021。

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado 和 Jeffrey Dean. 在向量空间中有效估计词表示. 载于 ICLR, 2013.

[Oord et al., 2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[Oord et al., 2018] Aaron van den Oord, Yazhe Li 和 Oriol Vinyals. 通过对比预测编码进行表征学习. arXiv 预印本 arXiv:1807.03748, 2018.

[Patil et al., 2020] Pratik Patil, Ching-Seh Mike Wu, Katerina Potika, and Marjan Orang. Stock market prediction using ensemble of graph theory, machine learning and deep learning models. In ICSIM, 2020.
[Patil et al., 2020] Pratik Patil, Ching-Seh Mike Wu, Katerina Potika 和 Marjan Orang。利用图论、机器学习和深度学习模型的集成进行股市预测。在 ICSIM,2020。

[Puterman, 2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
[Puterman, 2014] 马丁·L·普特曼。马尔可夫决策过程:离散随机动态规划。约翰·威利父子公司,2014 年。

[Qin et al., 2017] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison Cottrell. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971, 2017.
[秦等, 2017] 姚秦, 宋东进, 陈海峰, 程伟, 姜国飞, 和加里森·科特雷尔. 一种基于双阶段注意力的递归神经网络用于时间序列预测. arXiv 预印本 arXiv:1704.02971, 2017.

[Rao and Ballard, 1999] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79-87, 1999.
[Rao and Ballard, 1999] Rajesh PN Rao 和 Dana H Ballard. 视觉皮层中的预测编码:对一些超经典感受野效应的功能解释。自然神经科学, 2(1):79-87, 1999.

[Spratling, 2017] Michael W Spratling. A review of predictive coding algorithms. Brain and cognition, 112:92-97, 2017.
[Spratling, 2017] 迈克尔·W·斯普拉特林。预测编码算法的综述。《大脑与认知》,112:92-97,2017。

[Suri et al., 2021] Karush Suri, Xiao Qi Shi, Konstantinos Plataniotis, and Yuri Lawryshyn. TradeR: Practical deep hierarchical reinforcement learning for trade execution. arXiv preprint arXiv:2104.00620, 2021.
[Suri et al., 2021] Karush Suri, Xiao Qi Shi, Konstantinos Plataniotis, 和 Yuri Lawryshyn. TradeR: 实用的深度层次强化学习用于交易执行. arXiv 预印本 arXiv:2104.00620, 2021.

[Théate and Ernst, 2021] Thibaut Théate and Damien Ernst. An application of deep reinforcement learning to algorithmic trading. Expert Systems with Applications, 173:114632, 2021.
[Théate 和 Ernst, 2021] Thibaut Théate 和 Damien Ernst. 深度强化学习在算法交易中的应用. 专家系统与应用, 173:114632, 2021.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[Vaswani et al., 2017] 阿希什·瓦斯瓦尼、诺姆·沙泽尔、尼基·帕尔马尔、雅各布·乌斯科雷特、利昂·琼斯、艾丹·N·戈麦斯、卢卡斯·凯泽和伊利亚·波洛苏金。注意力是你所需要的一切。在 NeurIPS,2017。

[Wang et al., 2021] Heyuan Wang, Shun Li, Tengjiao Wang, and Jiayi Zheng. Hierarchical adaptive temporal-relational modeling for stock trend prediction. In IJCAI, 2021.
[Wang et al., 2021] 王赫远,李顺,王腾骄,郑佳怡。用于股票趋势预测的分层自适应时间关系建模。发表于 IJCAI,2021。

[Wen et al., 2019] Min Wen, Ping Li, Lingfei Zhang, and Yan Chen. Stock market trend prediction using high-order information of time series. IEEE Access, 7:28299-28308, 2019.
[Wen et al., 2019] 闻敏,李平,张灵飞,陈燕。利用时间序列的高阶信息进行股市趋势预测。IEEE Access, 7:28299-28308, 2019。

[Weng et al., 2020] Liguo Weng, Xudong Sun, Min Xia, Jia Liu, and Yiqing Xu. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing, 402:171-182, 2020.
[翁等, 2020] 翁立国, 孙旭东, 夏敏, 刘佳, 许怡青. 数字货币的投资组合交易系统:一种具有多维注意力门控机制的深度强化学习. 神经计算, 402:171-182, 2020.

[Wu et al., 2019] Jheng-Long Wu, Chi-Sheng Yang, KaiHsuan Liu, and Min-Tzu Huang. A deep learning model for dimensional valencearousal intensity prediction in stock market. In iCAST, 2019.
[吴等, 2019] 吴政龙, 杨启胜, 刘凯轩, 黄敏子. 一种用于股票市场维度情感唤起强度预测的深度学习模型. 在 iCAST, 2019.

[Wu et al., 2021] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In NeurIPS, 2021.
[吴等, 2021] 吴海旭, 许杰辉, 王建民, 和 龙铭生. Autoformer: 用于长期序列预测的自相关分解变换器. 在 NeurIPS, 2021.

[Yarats et al., 2021] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. In AAAI, 2021.
[Yarats et al., 2021] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau 和 Rob Fergus. 从图像中提高无模型强化学习的样本效率. 在 AAAI, 2021.

[Ye et al., 2020] Yunan Ye, Hengzhi Pei, Boxin Wang, PinYu Chen, Yada Zhu, Ju Xiao, and Bo Li. Reinforcementlearning based portfolio management with augmented asset movement prediction states. In AAAI, 2020.
[Ye et al., 2020] 叶云南, 裴恒志, 王博鑫, 陈品宇, 朱雅达, 肖瑜, 和 李博. 基于强化学习的投资组合管理与增强的资产移动预测状态. 在 AAAI, 2020.

[Zhang et al., 2017] Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. Stock price prediction via discovering multifrequency trading patterns. In K D D , 2017 K D D , 2017 KDD,2017K D D, 2017.
[Zhang et al., 2017] 张立恒, Charu Aggarwal, 和 Guo-Jun Qi. 通过发现多频交易模式进行股票价格预测。在 K D D , 2017 K D D , 2017 KDD,2017K D D, 2017

[Zhong et al., 2020] Yueyang Zhong, YeeMan Bergstrom, and Amy R Ward. Data-driven market-making via modelfree learning. In IJCAI, 2020.
[Zhong et al., 2020] 钟月阳, YeeMan Bergstrom, 和 Amy R Ward. 通过无模型学习进行数据驱动的市场做市. 载于 IJCAI, 2020.

[Zhou et al., 2021] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In AAAI, 2021.
[Zhou et al., 2021] 周浩毅, 张尚航, 彭杰琦, 张帅, 李建新, 熊辉, 张万才. Informer: 超越高效变换器的长序列时间序列预测. 在 AAAI, 2021.

  1. ^(**){ }^{*} Corresponding author
    ^(**){ }^{*} 通讯作者
  2. 1 1 ^(1){ }^{1} We here take daily stock trading as an example.
    我们在这里以日常股票交易为例。