基于深度强化学习的股票交易策略框架（代码+文档）

前言

深度强化学习（DRL）已被公认为量化投资中的一种有效方法，因此获得实际操作经验对初学者很有吸引力。然而，为了培养一个实用的DRL 交易agent，决定在哪里交易，以什么价格交易，以及交易的数量，会涉及非常多的内容和前期具有挑战性的开发和测试。

公众号为大家介绍了一个名为FinRL的DRL库，可以帮助初学者基于DRL自己开发股票交易策略。

我们先以单只股票为例。

获取完整代码，见文末

问题定义

这个问题是为单只股票交易而设计的一个自动化交易解决方案。我们将股票交易过程建模为马可夫决策过程交易过程（MDP）。然后我们将交易目标表述为一个最大化问题。

强化学习环境的组成部分：

Action

操作空间允许agent与环境交互的操作。一般情况下，a∈a包含三个动作：a∈{−1,0,1}，其中−1,0,1表示卖出、持有和买入。同时，一个Action可以对多份股票进行。我们使用一个动作空间{−k,…,−1,0,1,…,k}，其中k表示股份的数量。例如，“买10股Apple”或“卖10股Apple”分别是+10或-10。

Reward function

r (s，a，s ′)是agent学习更好的激励机制。当a在状态s时，达到新的状态s ‘时，投资组合值的变化，即r(s, a, s ‘) = v ‘−v，其中v ‘和v分别表示状态s ‘和s时的投资组合值。

State

状态空间描述agent从环境中接收的观察值。正如交易者在执行交易之前需要分析各种信息一样，我们的交易agent也观察了许多不同的特征，以便在交互环境中更好地学习。

本案例只研究单只股票，数据来自雅虎财经API。数据包含开高低收和成交量。

加载相关库

# Install the unstable development version in Jupyter notebook:

!pip install git+https://github.com/AI4Finance-LLC/FinRL-Library.git

importpkg_resources

importpip

installedPackages = {pkg.key forpkg inpkg_resources.working_set}

required = { ‘yfinance’, ‘pandas’, ‘matplotlib’, ‘stockstats’, ‘stable-baselines’, ‘gym’, ‘tensorflow’}

missing = required – installedPackages

ifmissing:

!pip install yfinance

!pip install pandas

!pip install matplotlib

!pip install stockstats

!pip install gym

!pip install stable-baselines[mpi]

!pip install tensorflow== 1.15.4

# import packages

importpandas aspd

importnumpy asnp

importmatplotlib

importmatplotlib.pyplot asplt

matplotlib.use( ‘Agg’)

importdatetime

fromfinrl.config importconfig

fromfinrl.marketdata.yahoodownloader importYahooDownloader

fromfinrl.preprocessing.preprocessors importFeatureEngineer

fromfinrl.preprocessing.data importdata_split

fromfinrl.env.environment importEnvSetup

fromfinrl.env.EnvMultipleStock_train importStockEnvTrain

fromfinrl.env.EnvMultipleStock_trade importStockEnvTrade

fromfinrl.model.models importDRLAgent

fromfinrl.trade.backtest importBackTestStats, BaselineStats, BackTestPlot

数据下载

FinRL 使用YahooDownloader类提取数据。

获取完整代码，见文末

classYahooDownloader:

“””Provides methods for retrieving daily stock data from

Yahoo Finance API

Attributes

———-

start_date : str

start date of the data (modified from config.py)

end_date : str

end date of the data (modified from config.py)

ticker_list : list

a list of stock tickers (modified from config.py)

Methods

——-

fetch_data

Fetches data from yahoo API

“””

保存数据：

data_df = YahooDownloader(start_date = ‘2009-01-01’,

end_date = ‘2020-09-30’,

ticker_list = [ ‘AAPL’]).fetch_data

view raw

数据预处理

技术指标构建，inRL使用一个FeatureEngineer类来预处理数据。

获取完整代码，见文末

classFeatureEngineer:

“””Provides methods for preprocessing the stock price data

Attributes

———-

df: DataFrame

data downloaded from Yahoo API

feature_number : int

number of features we used

use_technical_indicator : boolean

we technical indicator or not

use_turbulence : boolean

use turbulence index or not

Methods

——-

preprocess_data

main method to do the feature engineering

“””

特征工程：

data_df = FeatureEngineer(data_df.copy,

use_technical_indicator= True,

tech_indicator_list =

tech_indicator_list,use_turbulence= False,

user_defined_feature = True).preprocess_data

环境搭建

我们将金融建模定义为一个马尔可夫决策过程（MDP）问题。训练过程包括观测股价变化，采取动作和收益的计算，使agent调整其相应的策略。通过与环境的互动，交易agent将得到一个交易策略，随着时间的推移，最大化收益。

交易环境基于OpenAI Gym框架。

环境设计是DRL中最重要的部分之一，因为它会因应用程序和市场的不同而有所不同。我们不能用股票交易的环境来交易比特币，反之亦然。

操作空间描述允许agent与环境进行交互操作。通常，动作a包括三个动作：{- 1,0,1}，其中- 1,0,1表示卖出、持有和买入。同时，一个动作可以对多个股份进行。我们使用一个动作空间{-k,…,- 1,0,1,…,k}，其中k表示需要买入的股份数量，-k表示需要卖出的股份数量。连续动作空间需要归一化到[- 1,1]，因为策略是在高斯分布上定义的，需要归一化和对称。

在本文中，我们将k=200设置为AAPL的整个操作空间为：200*2+1=401。

FinRL使用EnvSetup类来设置环境：

获取完整代码，见文末

classEnvSetup:

“””Provides methods for retrieving daily stock data from

Yahoo Finance API

Attributes

———-

stock_dim: int

number of unique stocks

hmax : int

maximum number of shares to trade

initial_amount: int

start money

transaction_cost_pct : float

transaction cost percentage per trade

reward_scaling: float

scaling factor for reward, good for training

tech_indicator_list: list

a list of technical indicator names (modified from config.py)

Methods

——-

fetch_data

Fetches data from yahoo API

“””

初始化一个环境类：

env_setup = EnvSetup(stock_dim = stock_dimension,

state_space = state_space,

hmax = 200,

initial_amount = 100000,

transaction_cost_pct = 0.001,

tech_indicator_list = tech_indicator_list)

env_train = env_setup.create_env_training(data = train,

env_class = SingleStockEnv)

用户定义的环：模拟环境类。

FinRL为单一股票交易环境提供一个类：

获取完整代码，见文末

classSingleStockEnv(gym.Env):

“””A single stock trading environment for OpenAI gym

Attributes

———-

df: DataFrame

input data

stock_dim : int

number of unique stocks

hmax : int

maximum number of shares to trade

initial_amount : int

start money

transaction_cost_pct: float

transaction cost percentage per trade

reward_scaling: float

scaling factor for reward, good for training

state_space: int

the dimension of input features

action_space: int

equals stock dimension

tech_indicator_list: list

a list of technical indicator names

turbulence_threshold: int

a threshold to control risk aversion

day: int

an increment number to control date

Methods

——-

_sell_stock

perform sell action based on the sign of the action

_buy_stock

perform buy action based on the sign of the action

step

at each step the agent will return actions, then

we will calculate the reward, and return the next

observation.

reset

reset the environment

render

use render to return other functions

save_asset_memory

return account value at each time step

save_action_memory

return actions/positions at each time step

“””

实现DRL算法

DRL算法的实现基于OpenAI Baselines和Stable Baselines。Stable Baselines是OpenAI Baselines基线的一个分支，包括主要的结构重构和代码清理。

FinRL库经过微调的标准DRL算法，如 DQN、DDPG、Multi-Agent DDPG、PPO、SAC、A2C和TD3。还允许用户通过调整这些DRL算法来设计他们自己的DRL算法：

FinRL 使用 DRLAgent 类来实现算法：

classDRLAgent:

“””Provides implementations for DRL algorithms

Attributes

———-

env: gym environment class

user-defined class

Methods

——-

train_PPO

the implementation for PPO algorithm

train_A2C

the implementation for A2C algorithm

train_DDPG

the implementation for DDPG algorithm

train_TD3

the implementation for TD3 algorithm

DRL_prediction

make a prediction in a test dataset and get results

“””

模型训练

print( “==============Model Training===========”)

now = datetime.datetime.now.strftime( ‘%Y%m%d-%Hh%M’)

td3_params_tuning = { ‘batch_size’: 128,

‘buffer_size’: 50000,

‘learning_rate’: 0.0001,

‘verbose’: 0,

‘timesteps’: 20000}

agent = DRLAgent(env = env_train)

model_td3 = agent.train_TD3(model_name = “TD3_{}”.format(now),

model_params = td3_params_tuning)

在本文中我们使用了4种DRL模型，即 PPO、A2C、DDPG和TD3。

TD3是DDPG 的一个改进。

PPO：

A2C：

DDPG：

https://spinningup.openai.com/en/latest/algorithms/td3.html#background

Tensorboard：reward和损失函数绘图

我们使用 tensorboard integration进行超参数调整和模型选择，张力板生成漂亮的图表。

一旦调用了lear 函数，你可以在训练期间或训练之后监视RL agent，使用下面的bash命令：

# cd to the tensorboard_log folder, run the following command

tensorboard –logdir ./A2C_20201127 -19h01/

# you can also add past logging folder

tensorboard –logdir ./a2c_tensorboard/;./ppo2_tensorboard/

每个算法的总reward：

Total _ timesteps (int) : 要训练的样本总数。它是最重要的超参数之一，还有其他重要参数，如学习率、batch size、 buffer size等。

为了比较这些算法，我设置total_timesteps = 100k。如果我们将total_timesteps设置得太大，那么我们将面临过拟合的风险。

通过观察episode_reward图表，我们可以看到随着步骤的增长，这些算法最终会收敛到一个最优策略。TD3的收敛速度非常快。

actor_loss for DDPG和policy_loss for TD3：

我们最终选择 TD3模型，因为它收敛得非常快，而且它是 DDPG 上的最先进的模型。通过观察 episode_reward图表，TD3不需要达到100k的总时间/步数就会收敛。

交易

假设我们在2019/01/01有10万美元的初始资本，我们使用TD3模型来交易AAPL。

env_trade, obs_trade = env_setup.create_env_trading(data = trade,

env_class = SingleStockEnv)

## make a prediction and get the account value change

df_account_value = DRLAgent.DRL_prediction(model=model_td3,

test_data = trade,

test_env = env_trade,

test_obs = obs_trade)

回测表现

使用Quantopian pyfolio来回测我们的交易策略。

FinRL使用set of functions对pyfolio进行回测。

获取完整代码，见文末

print( “==============Get Backtest Results===========”)

perf_stats_all = BackTestStats(account_value = df_account_value)

perf_stats_all = pd.DataFrame(perf_stats_all)

perf_stats_all.to_csv( “./”+config.RESULTS_DIR+ “/perf_stats_all_”+now+ ‘.csv’)

# BackTestPlot

# pass the account value memory into the backtest functions

# and select a baseline ticker

print( “=====Compare to AAPL itself buy-and-hold=======”)

%matplotlib inline

BackTestPlot(account_value = df_account_value,

baseline_ticker = ‘AAPL’)

多个股票测试在这里不在陈述，大家看代码即可：

获取完整代码，见文末

希望这篇文章对大家有帮助，学到了一些基于DRL的知识！

作者：Bruce Yang 编译：QIML编辑部

本文采用「CC BY-SA 4.0 CN」协议转载自互联网、仅供学习交流，内容版权归原作者所有，如涉作品、版权和其他问题请给「我们」留言处理。

基于深度强化学习的股票交易策略框架（代码+文档）

你可能也喜欢这些文章