RL Successes and Challenges in HighDimensional Games Gerry

  • Slides: 26
Download presentation
RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T. J. Watson Research

RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T. J. Watson Research Center 1

Outline § Overview/Definition of “Games” § Why Study Games? § Commonalities of RL successes

Outline § Overview/Definition of “Games” § Why Study Games? § Commonalities of RL successes § RL in Classic Board Games § § RL in Robotics Games § Attacker/Defender Robots § Robocup Soccer § RL in Video/Online Games § § TD-Gammon, Knight. Cap, TD-Chinook, RLGO AI Fighters Open Discussion / Lessons Learned 2

What Do We Mean by “Games” ? ? § Some Definitions of “Game” §

What Do We Mean by “Games” ? ? § Some Definitions of “Game” § A structured activity, usually undertaken for enjoyment (Wikipedia) § Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt) § A form of play with goals and structure (Kevin Maroney) § Single-Player Game = “Puzzle” § “Competition” if players can’t interfere with other players’ performance § § Olympic Hockey vs. Olympic Figure Skating Common Ingredients: Players, Rules, Objective § But: Games with modifiable rules, no clear object (MOOs) 3

Why Use Games for RL/AI ? ? § Clean, Idealized Models of Reality §

Why Use Games for RL/AI ? ? § Clean, Idealized Models of Reality § Rules are clear and known (Samuel: not true in economically important problems) § Can build very good simulators § Clear Metric to Measure Progress § Tournament results, Elo ratings, etc. § Danger: Metric takes on a life of its own § Competition spurs progress § § DARPA Grand Challenge, Netflix competition Man vs. Machine Competition § “adds spice to the study” (Samuel) § “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel) 4

How Games Extend “Classic RL” Complex motivation “Motivated” RL Multi-agent game strategy Poker Chicken

How Games Extend “Classic RL” Complex motivation “Motivated” RL Multi-agent game strategy Poker Chicken backgammon, chess, etc. • Robocup Soccer AI Fighters Fourth dimension: non-stationarity Lifelike environment 5

Ingredients for RL success • Several commonalities: § § § Problems are more-or-less MDPs

Ingredients for RL success • Several commonalities: § § § Problems are more-or-less MDPs (near full observability, little history dependence) |S| is enormous can’t do DP State-space representation critical: use of “features” based on domain knowledge Train in a simulator! Need lots of experience, but still << |S| Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation Only visit plausible states; only generalize to plausible states 6

RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead

RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead V(s) = V (s), adjust to reduce MSE (R-V(s))2 by gradient descent: 7

TD( ) training of neural networks (episodic; =1 and intermediate r = 0): 8

TD( ) training of neural networks (episodic; =1 and intermediate r = 0): 8

RL in Classic Board Games 9

RL in Classic Board Games 9

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar 10

Learning backgammon using TD( ) • Neural net observes a sequence of input patterns

Learning backgammon using TD( ) • Neural net observes a sequence of input patterns x 1, x 2, x 3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions) – 1 -D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features • Train neural net using gradient version of TD( ) • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt ) 11

 • TD-Gammon can teach itself by playing games against itself and learning from

• TD-Gammon can teach itself by playing games against itself and learning from the outcome – Works even starting from random initial play and zero initial expert knowledge (surprising!) achieves strong intermediate play – add hand-crafted features: advanced level of play (1991) – 2 -ply search: strong master play (1993) – 3 -ply search: superhuman play (1998) 12

New TD-Gammon Results! (Tesauro, 1992) 13

New TD-Gammon Results! (Tesauro, 1992) 13

Extending TD(λ) to TDLeaf • Checkers and Chess: 2 -D geometry, 64 board locations,

Extending TD(λ) to TDLeaf • Checkers and Chess: 2 -D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation • Samuel had the basic idea: train value of current state to match minimax backed-up value • Proper mathematical formulation proposed by Beal & Smith; Baxter et al. • Baxter’s Chess program Knight. Cap showed rapid learning in play vs. humans: 1650→ 2150 Elo in only 300 games! • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort) 14

RL in Computer Go • Go: 2 -D geometry, 361 board locations, hundreds to

RL in Computer Go • Go: 2 -D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation • Neuro. Go (M. Enzenberger, 1996; 2003) – Multiple reward signals: single-point eyes, connections and live points – Rating ~1880 in 9 x 9 Go using 3 -ply α-β search • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game! – Rating ~2130 in 9 x 9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search 15

RL in Robotics Games 16

RL in Robotics Games 16

Robot Air Hockey • video at: http: //www. cns. atr. jp/~dbent/mpeg/hockeyfullsmall. avi • D.

Robot Air Hockey • video at: http: //www. cns. atr. jp/~dbent/mpeg/hockeyfullsmall. avi • D. Bentivegna & C. Atkeson, ICRA 2001 • 2 -D spatial problem • 30 degree-of-freedom arm, 420 decisions/sec • hand-built primitives, supervised learning + RL 17

Wo. LF in Adversarial Robot Learning • Gra-Wo. LF (Bowling & Veloso): Combines Wo.

Wo. LF in Adversarial Robot Learning • Gra-Wo. LF (Bowling & Veloso): Combines Wo. LF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al. , 2000) • again 2 -D spatial geometry, 7 input features, 16 CMAC tiles – video at: http: //webdocs. ualberta. ca/~bowling/videos/Adversarial. Robot. Learning. mp 4 18

RL in Robocup Soccer • Once again, 2 -D spatial geometry • Much good

RL in Robocup Soccer • Once again, 2 -D spatial geometry • Much good work by Peter Stone et al. – TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90 s – Fast Gait for Sony Aibo dogs – Ball Acquisition for Sony Aibo dogs – Keepaway in Robocup simulation league 19

Robocup “Keepaway” Game (Stone et al. ) • Uses Robocup simulator, not real robots

Robocup “Keepaway” Game (Stone et al. ) • Uses Robocup simulator, not real robots • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm • Represent Q(s, a) using CMAC (coarse tile coding) function approximation • State representation: small # of distances and angles between teammates, opponents, and ball • Reward = time of possession • Results: learned policies do much better than either random or hand-coded policies, e. g. on 25 x 25 field: – learned TOP 15. 0 sec, hand-coded 8. 0 sec, random 6. 4 sec 20

21

21

22

22

RL in Video Games 23

RL in Video Games 23

AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao

AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3 D!) – basic feature set + SARSA + linear value function – multiple challenges of environment (real time, concurrency, …): • opponent state not known exactly • agent state and reward not known exactly • due to game animation, legal moves are not known 24

Links to AI Fighters videos: before training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofengearlyaggressive. wmv after

Links to AI Fighters videos: before training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofengearlyaggressive. wmv after training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofenglateaggressive. wmv 25

Discussion / Lessons Learned ? ? § Winning formula: hand-designed features (fairly small number)

Discussion / Lessons Learned ? ? § Winning formula: hand-designed features (fairly small number) + smooth function approx. § § hand-designed features (fairly small number) aggressive smooth function approx. Researchers should try raw-input comparisons and try nonlinear function approx. Many/most state variables in real problems seem pretty irrelevant § § § Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms Sparsity constraints (L 1 regularization etc. ) also promising Brain/retina architecture impressively suited for 2 D spatial problems § More studies using Convolutional Neural Nets etc. 26