gunreal: gpu-accelerated unsupervised reinforcement and ......we extend the ga3c algorithm to learn...
TRANSCRIPT
Copyright 2018 FUJITSU LABORATORIES LTD.
GUNREAL: GPU-accelerated
UNsupervised REinforcement
and Auxiliary Learning
Koichi Shirahata, Youri Coppens, Takuya Fukagai,
Yasumoto Tomita, and Atsushi Ike
FUJITSU LABORATORIES LTD.
March 27, 2018
0
Deep Reinforcement Learning on GPUs
Deep Learning has led to tremendous advances in image, speech, and natural language recognitions
Deep learning benefits from GPU acceleration, by parallelizing the matrix computations of the deep neural network (DNN) units
Deep Reinforcement Learning has been applied on decision-making tasks and control tasks such as robotics, games, and HVAC (heating, ventilation, and air conditioning)
Frequent interactions between the environment and DNN makes Deep RL slow and unstable
Deep Q-Network (DQN) [Mnih et al. 2015] stabilized the DNN training on GPU by introducing an experience replay memory buffer and mini batches
Copyright 2018 FUJITSU LABORATORIES LTD.1
State-of-the-art Deep RL Algorithms
Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016] focused on concurrent simulations on single machines with only CPUs
Interactions become overhead when using GPU due to small batches
Copyright 2018 FUJITSU LABORATORIES LTD.
A3C
Learning capability
Thro
ughput
2
State-of-the-art Deep RL Algorithms
Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016] focused on concurrent simulations on single machines with only CPUs
Interactions become overhead when using GPU due to small batches
GA3C [Babaeizadeh et al. 2017] modified A3C to benefit better from GPU
Run concurrent simulations on the CPU, and sends larger batches to the GPU
Learning capability is not enough for complex tasks using 3D environments
Copyright 2018 FUJITSU LABORATORIES LTD.
A3C
GA3C
Learning capability
Thro
ughput
Faster training
3
State-of-the-art Deep RL Algorithms
Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016] focused on concurrent simulations on single machines with only CPUs
Interactions become overhead when using GPU due to small batches
GA3C [Babaeizadeh et al. 2017] modified A3C to benefit better from GPU
Run concurrent simulations on the CPU, and sends larger batches to the GPU
Learning capability is not enough for complex tasks using 3D environments
UNREAL [Jaderberg et al. 2017] extended A3C with auxiliary tasks
Training the auxiliary tasks guides a part of the main DNN model
Copyright 2018 FUJITSU LABORATORIES LTD.
A3C
GA3C
UNREAL
Learning capability
Thro
ughput
Faster training
Higher learning
capability
Complex tasks
4
Ideally, we would like to have an algorithm that can run fast (GPU efficiency) and has also high learning capability (ability to learn complex environments)
How to use GPU efficiently for the auxiliary tasks in UNREAL?
Frequent interactions between the environment and DNN for both main task and auxiliary tasks
Realizing both high training speed and learning capability at the same time
Problems on Fast and Efficient Deep RL
Copyright 2018 FUJITSU LABORATORIES LTD.
A3C
GA3C
UNREAL
Learning capability
Thro
ughput
?
How to use GPU
efficiently?Faster training
Higher learning
capability
Complex tasks
5
Proposal: GUNREAL (GPU-accelerated UNREAL)
We extend the GA3C algorithm to learn auxiliary tasks
We improve the throughput on the GPU by adding a preprocessing phase and changing the data format
Results
GUNREAL learned 4.0x faster than UNREAL
GUNREAL learned a 3D environment 73% more efficiently than GA3C
GUNREAL: GPU-accelerated Unsupervised
REinforcement and Auxiliary Learning
Copyright 2018 FUJITSU LABORATORIES LTD.
A3C
GA3C
UNREAL
GUNREAL
Thro
ughput
Learning capability
Faster training by
GPU acceleration
More learning
capability
Faster training
Higher learning
capability
Complex tasks
Complex tasks
6
Deep Reinforcement Learning
Reinforcement Learning (RL): teaching agents control tasks through iterations with the environment
Deep RL = Using DNNs (DL) to learn the agents control tasks (RL)
Training data is generated through the interactions with the environment
Difficult to parallelize the DNN computations using mini batches
Copyright 2018 FUJITSU LABORATORIES LTD.
DNN
Environment
st, st+1, …, st + tmax
rt, rt+1, …, rt + tmax
at, at+1, …, at + tmax
st
st, rt, at
dθ, dθvat π
Agent
, V
Training
data
7
Asynchronous Advantage Actor-Critic by DeepMind [Mnih et al. 2016]
Multiple agents simulate in parallel to train one global DNN
Targets single machines with CPUs
Interactions become overhead when using GPU due to small batches
A3C
Copyright 2018 FUJITSU LABORATORIES LTD.
DNN
DNN
Agent nst, st+1, …
rt, rt+1, …
at, at+1, …
Agent 1
st, st+1, …
rt, rt+1, …
at, at+1, …
・・・
st
at
at
π, V
st, rt, at
st
π, V
dθ, dθv
dθ, dθv
Global
Model
θ’, θ’v
θ’, θ’v
st, rt, at
DNN
8
DNN
GA3C
Developed by NVIDIA Research Labs [Babaeizadeh et al. 2017]
Only one global DNN on the GPU
No local DNNs = no synchronization
Dynamic producer-consumer architecture to utilize GPU better
Introduce queues for prediction and training
Learning capability is not enough for complex tasks (e.g. 3D environments)
Copyright 2018 FUJITSU LABORATORIES LTD.
Server
Trainer
Predictor
Prediction Queue
Training Queue
π, V
TrainerTrainer
Threads
PredictorPredictor
Threads
Layers
V
π
dθ, dθv
st, id
st, rt, at
at
st, st+1, …
rt, rt+1, …
at, at+1, …
Agent n
Agent 1
Agent 2
9
UNsupervised REinforcement and Auxiliary Learning by DeepMind[Jaderberg et al. 2017]
A3C + three auxiliary tasks
Pixel Control (PC), Value Replay (VR), Reward Prediction (RP)
Higher learning capability to learn more complex tasks
More training time per step to process the auxiliary tasks
UNREAL
Copyright 2018 FUJITSU LABORATORIES LTD.
DNN
Agent 1
DNNst, st+1, …
rt, r1t+1, …
at, at+1, …
st
at
Replay
Buffer
Pixel Controlpredict visual changes
Value Replayreduce value loss quicker
Reward Predictionpredict reward transitions
Agent n
Agent 2
st, rt, at
pixel
change
π, V
dθ, dθv
spc, rpc, apc, pc
svr, rvr
srp, rrp
Global Model
θ’, θ’v, θ’pc, θ’vr, θ’rp
st
pc
Qpc
Crp
dθpc
Vvr
dθvr
dθrp
10
GA3C + Auxiliary Tasks
= GUNREAL
Copyright 2018 FUJITSU LABORATORIES LTD.11
GA3C’s prediction architecture for main A3C
GUNREAL
Copyright 2018 FUJITSU LABORATORIES LTD.
DNN
Replay
MemoryReplay
Memory
Server
Trainer
Predictor
Prediction Queue
Training Queue
TrainerTrainer
Threads
PredictorPredictor
Threads
Layers
π
V
at
st, id
π, V
dθ, dθv
st, st+1, …
rt, rt+1, …
at, at+1, …
Agent 1
Agent 2
Agent n
st, rt, at
12
DNN
GA3C’s prediction architecture for main A3C
Two extra prediction lines for VR and PC training batch generation
More training data sent to the trainers
Extending the DNN
GUNREAL
Copyright 2018 FUJITSU LABORATORIES LTD.
Replay
MemoryReplay
Memory
Server
Trainer
Predictor
PC Prediction Queue
Prediction Queue
Training Queue
VR Prediction Queue
Vvr
Replay
Buffer
TrainerTrainer
Threads
PredictorPredictor
Threads
PC Predictor
Threads
Qpc
Crp
Shared
layers
π
V
at
st, id
svr, id
spc, pc,
id
π, V
Qpc
dθ, dθv, dθpc, dθvr, dθrp
RP
Crp
pc
st
VR Predictor
Threadsst, st+1, …
rt, rt+1, …
at, at+1, …
Agent 1
Agent 2
Agent n
svr, Rvr
spc, apc, Qpc
st, rt, at
13
Shared layers
RP out
PC out
Kernel 8x8
16 channels
Stride 4
Kernel 4x4
32 channels
Stride 2
Conv
1Conv
2FC1
FC2
FC3
P(A=at|st)
Action
256 Softmax
1
# actions
A3C, Value Replay
Replay
Memory
FC5 Softmax
+
0
-
-FC4
Deconv
Deconv mean
+
Qpc2592
1 output channel
Stride 2
Kernel 4x4
# actions
output channels
srp-3, srp-2, srp-1 128
FC6
3
Pixel Control
Reward Prediction
Replay
Memoryspc
st
V(st), V(svr)
GUNREAL’s DNN Structure
rrp
Replay
Memorysvr π
Environment
Copyright 2018 FUJITSU LABORATORIES LTD.14
(G)A3C has a preprocessing phase, collecting four sequential gray-scale frames
Instead, UNREAL uses single RGB frames and no preprocessing
We modified the Pixel Control task for GUNREAL, since the preprocessing affects the calculation process for the PC task
The preprocessed observations result in quicker learning for GUNREAL
Preprocessing the environment observations
Copyright 2018 FUJITSU LABORATORIES LTD.
Pixel Control task for GUNREAL
15
Image data format
Two formats exist to process image batches
NHWC: CPU optimal (default in TensorFlow)
NCHW: cuDNN is optimized to process these
• N: the number of images, C: channel size, H: image height, W: image width
Getting more throughput by using NCHW batches in GUNREAL
5% better in predictions per second (PPS) and trainings per second (TPS)
Copyright 2018 FUJITSU LABORATORIES LTD.
PPS (predictions per second) TPS (trainings per second)
16
Experiments
Implementations
GUNREAL: we implemented using TensorFlow
GA3C: publicly available OSS from the authors using TensorFlow1
UNREAL: open-source replication using TensorFlow2
• UNREAL has been enabled with GPU usage for DNN computations
Environments
OpenAI Gym: Atari environment (a 2D game simulator)
DeepMind Lab: a 3D game simulator
Copyright 2018 FUJITSU LABORATORIES LTD.
1https://github.com/NVlabs/GA3C, 2https://github.com/miyosuda/unreal
17
Results: OpenAI Gym (2D simulator)
Space Invaders (Elapsed time vs. Score)
GUNREAL learns faster than UNREAL in real-time
Copyright 2018 FUJITSU LABORATORIES LTD.
4.0x faster than UNREAL
18
Results: DeepMind Lab (3D simulator)
Seek Avoid Arena / Nav Maze (Elapsed time vs. Score)
GUNREAL clearly beats GA3C on more complex tasks
Copyright 2018 FUJITSU LABORATORIES LTD.
4.27x faster than GA3C
(73% less steps)1.6x faster than UNREAL
19
Playing movie after training finished
Demo: Seek Avoid Arena (DeepMind Lab)
Copyright 2018 FUJITSU LABORATORIES LTD.20
Conclusions and Future Work
Conclusions
We developed GUNREAL, a fast Deep RL algorithm with high learning capability
• An architecture that benefits from GPU acceleration by extending GA3C with UNREAL’s auxiliary tasks
• We evaluate the necessity of a preprocessing phase and improve the throughput on the GPU by changing the data format
GUNREAL learned 4.0x faster than UNREAL by better GPU acceleration
GUNREAL learned a 3D environment 73% more efficiently than GA3C
Future work
Integration of an LSTM into GUNREAL’s model
Assess the applicability of GUNREAL in real-world applications
Scalability study on cluster computing environment
Copyright 2018 FUJITSU LABORATORIES LTD.21
Copyright 2018 FUJITSU LABORATORIES LTD.22
Related Work
GPU acceleration
DQN [Mnih et al. 2015] used GPU to train a DNN with a single agent, 12 to 14 days on Atari games
• Prioritized experience replay [Schaul et al. 2016]
• Double Q-learning [Hasselt et al. 2016]
• Dueling network architecture [Wang et al. 2016]
GA3C [Babaeizadeh et al., 2017] improved multi-agent training, within a day on Atari games
PAAC [Clemente et al. 2017] is a framework for synchronously parallelized deep RL
Cluster acceleration
DistBelief [Dean et al. 2012] has been used to train a DNN using tens of thousands of CPU cores
Gorila [Nair et al. 2015], a successor of DistBelief, distributed DQN across a cluster, outperformed standard DQN after training for 4 days
Copyright 2018 FUJITSU LABORATORIES LTD.23