ActorCritic Hungyi Lee Asynchronous Advantage ActorCritic A 3
- Slides: 18
Actor-Critic Hung-yi Lee
Asynchronous Advantage Actor-Critic (A 3 C) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning”, ICML, 2016
Review – Policy Gradient baseline Very unstable With sufficient samples, approximate the expectation of G. Can we estimate the expected value of G?
Review – Q-Learning • for discrete action only s scalar s Estimated by TD or MC
Actor-Critic baseline
Advantage Actor-Critic Estimate two networks? We can only estimate one. Only estimate state value A little bit variance
Advantage Actor-Critic TD or MC
Advantage Actor-Critic • left Network right fire Network
Asynchronous Advantage Actor-Critic (A 3 C) The idea is from 李思叡
Asynchronous Source of image: https: //medium. com/emergent -future/simple-reinforcement-learning-withtensorflow-part-8 -asynchronous-actor-criticagents-a 3 c-c 88 f 72 a 5 e 9 f 2#. 68 x 6 na 7 o 9 1. Copy global parameters 2. Sampling some data 3. Compute gradients 4. Update global models (other workers also update models)
Pathwise Derivative Policy Gradient David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING”, ICLR, 2016
Another Way to use Critic Original Actor-critic Pathwise derivative policy gradient From Q function we know that taking a’ at state s is better than a decrease increase We know the parameters of Q function
Actor Critic Pathwise derivative policy gradient Original Actor-critic http: //www. cartomad. com/comic/109000081104011. html Actor as the solver of this optimization problem
Pathwise Derivative Policy Gradient a is the output of an actor Gradient ascent: Actor Fixed = This is a large network
Exploration Replay Buffer TD or MC Actor =
Q-Learning Algorithm •
Q-Learning Algorithm • 1 2 3 4 Pathwise Derivative Policy Gradient
Connection with GAN David Pfau, Oriol Vinyals, “Connecting Generative Adversarial Networks and Actor-Critic Methods”, ar. Xiv preprint, 2016
- Lee hung yi
- Hungyi lee
- Hung yi lee
- Hungyi lee
- Hungyi lee
- Comparative advantage vs absolute advantage
- Actual mechanical advantage vs ideal mechanical advantage
- Transient vs persistent communication
- Synchronous and asynchronous bus
- Asynchronous transfer of control
- Asynchronous javascript and xml
- Digital electronics presentation
- 4 bit asynchronous down counter truth table
- Asynchronous synchronous
- Asynchronous transfer mode diagram
- Similar pic
- Asynchronous trigger sql server
- Two i/o methods
- Embedded rpc