RL Successes and Challenges in HighDimensional Games Gerry

Outline § Overview/Definition of “Games” § Why Study Games? § Commonalities of RL successes

What Do We Mean by “Games” ? ? § Some Definitions of “Game” §

Why Use Games for RL/AI ? ? § Clean, Idealized Models of Reality §

How Games Extend “Classic RL” Complex motivation “Motivated” RL Multi-agent game strategy Poker Chicken

Ingredients for RL success • Several commonalities: § § § Problems are more-or-less MDPs

RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead

TD( ) training of neural networks (episodic; =1 and intermediate r = 0): 8

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13

Learning backgammon using TD( ) • Neural net observes a sequence of input patterns

• TD-Gammon can teach itself by playing games against itself and learning from

New TD-Gammon Results! (Tesauro, 1992) 13

Extending TD(λ) to TDLeaf • Checkers and Chess: 2 -D geometry, 64 board locations,

RL in Computer Go • Go: 2 -D geometry, 361 board locations, hundreds to

Robot Air Hockey • video at: http: //www. cns. atr. jp/~dbent/mpeg/hockeyfullsmall. avi • D.

Wo. LF in Adversarial Robot Learning • Gra-Wo. LF (Bowling & Veloso): Combines Wo.

RL in Robocup Soccer • Once again, 2 -D spatial geometry • Much good

Robocup “Keepaway” Game (Stone et al. ) • Uses Robocup simulator, not real robots

AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao

Links to AI Fighters videos: before training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofengearlyaggressive. wmv after

Discussion / Lessons Learned ? ? § Winning formula: hand-designed features (fairly small number)

Slides: 26

Download presentation

RL Successes and Challenges in High-Dimensional Games Gerry Tesauro IBM T. J. Watson Research Center 1

Outline § Overview/Definition of “Games” § Why Study Games? § Commonalities of RL successes § RL in Classic Board Games § § RL in Robotics Games § Attacker/Defender Robots § Robocup Soccer § RL in Video/Online Games § § TD-Gammon, Knight. Cap, TD-Chinook, RLGO AI Fighters Open Discussion / Lessons Learned 2

What Do We Mean by “Games” ? ? § Some Definitions of “Game” § A structured activity, usually undertaken for enjoyment (Wikipedia) § Activity among decision-makers in seeking to achieve objectives in a limiting context (Clark Abt) § A form of play with goals and structure (Kevin Maroney) § Single-Player Game = “Puzzle” § “Competition” if players can’t interfere with other players’ performance § § Olympic Hockey vs. Olympic Figure Skating Common Ingredients: Players, Rules, Objective § But: Games with modifiable rules, no clear object (MOOs) 3

Why Use Games for RL/AI ? ? § Clean, Idealized Models of Reality § Rules are clear and known (Samuel: not true in economically important problems) § Can build very good simulators § Clear Metric to Measure Progress § Tournament results, Elo ratings, etc. § Danger: Metric takes on a life of its own § Competition spurs progress § § DARPA Grand Challenge, Netflix competition Man vs. Machine Competition § “adds spice to the study” (Samuel) § “provides a convincing demonstration for those who do not believe that machines can learn” (Samuel) 4

How Games Extend “Classic RL” Complex motivation “Motivated” RL Multi-agent game strategy Poker Chicken backgammon, chess, etc. • Robocup Soccer AI Fighters Fourth dimension: non-stationarity Lifelike environment 5

Ingredients for RL success • Several commonalities: § § § Problems are more-or-less MDPs (near full observability, little history dependence) |S| is enormous can’t do DP State-space representation critical: use of “features” based on domain knowledge Train in a simulator! Need lots of experience, but still << |S| Smooth function approximation (linear or NN) → very aggressive generalization/extrapolation Only visit plausible states; only generalize to plausible states 6

RL + Gradient Parameter Training • Recall incremental Bellman updates (TD(0)) • If instead V(s) = V (s), adjust to reduce MSE (R-V(s))2 by gradient descent: 7

TD( ) training of neural networks (episodic; =1 and intermediate r = 0): 8

RL in Classic Board Games 9

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar 10

Learning backgammon using TD( ) • Neural net observes a sequence of input patterns x 1, x 2, x 3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding (“hand-crafted features” added in later versions) – 1 -D geometry → 28 board locations → 200 “raw” input units → 300 input units incl. features • Train neural net using gradient version of TD( ) • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt ) 11

• TD-Gammon can teach itself by playing games against itself and learning from the outcome – Works even starting from random initial play and zero initial expert knowledge (surprising!) achieves strong intermediate play – add hand-crafted features: advanced level of play (1991) – 2 -ply search: strong master play (1993) – 3 -ply search: superhuman play (1998) 12

New TD-Gammon Results! (Tesauro, 1992) 13

Extending TD(λ) to TDLeaf • Checkers and Chess: 2 -D geometry, 64 board locations, dozens to thousands (Deep Blue) of features, linear function approximation • Samuel had the basic idea: train value of current state to match minimax backed-up value • Proper mathematical formulation proposed by Beal & Smith; Baxter et al. • Baxter’s Chess program Knight. Cap showed rapid learning in play vs. humans: 1650→ 2150 Elo in only 300 games! • Schaeffer et al. retrained weights of Checkers program Chinook using TDLeaf + self-play; as strong as manually tuned weights (5 year effort) 14

RL in Computer Go • Go: 2 -D geometry, 361 board locations, hundreds to millions (RLGO) of features, linear or NN function approximation • Neuro. Go (M. Enzenberger, 1996; 2003) – Multiple reward signals: single-point eyes, connections and live points – Rating ~1880 in 9 x 9 Go using 3 -ply α-β search • RLGO (D. Silver, 2008) uses only primitive local features and a linear value function. Can do live on-the-fly training for each new position encountered in a Go game! – Rating ~2130 in 9 x 9 Go using α-β search (avg. depth ~6): strongest program not based on Monte-Carlo Tree Search 15

RL in Robotics Games 16

Robot Air Hockey • video at: http: //www. cns. atr. jp/~dbent/mpeg/hockeyfullsmall. avi • D. Bentivegna & C. Atkeson, ICRA 2001 • 2 -D spatial problem • 30 degree-of-freedom arm, 420 decisions/sec • hand-built primitives, supervised learning + RL 17

Wo. LF in Adversarial Robot Learning • Gra-Wo. LF (Bowling & Veloso): Combines Wo. LF (“Win or Learn Fast”) principle with policy gradient RL (Sutton et al. , 2000) • again 2 -D spatial geometry, 7 input features, 16 CMAC tiles – video at: http: //webdocs. ualberta. ca/~bowling/videos/Adversarial. Robot. Learning. mp 4 18

RL in Robocup Soccer • Once again, 2 -D spatial geometry • Much good work by Peter Stone et al. – TPOT-RL: Learned advanced team strategies given limited observability – key to CMUnited victories in late 90 s – Fast Gait for Sony Aibo dogs – Ball Acquisition for Sony Aibo dogs – Keepaway in Robocup simulation league 19

Robocup “Keepaway” Game (Stone et al. ) • Uses Robocup simulator, not real robots • Task: one team (“keepers”) tries to maintain possession of the ball as long as possible, other team (“takers”) try to take away • Keepers are trained using continuous-time, semi-Markov version of Sarsa algorithm • Represent Q(s, a) using CMAC (coarse tile coding) function approximation • State representation: small # of distances and angles between teammates, opponents, and ball • Reward = time of possession • Results: learned policies do much better than either random or hand-coded policies, e. g. on 25 x 25 field: – learned TOP 15. 0 sec, hand-coded 8. 0 sec, random 6. 4 sec 20

RL in Video Games 23

AI Fighters • Graepel, Herbrich & Gold, 2004 – used commercial game platform Tao Feng (runs on Xbox): real time simulator (3 D!) – basic feature set + SARSA + linear value function – multiple challenges of environment (real time, concurrency, …): • opponent state not known exactly • agent state and reward not known exactly • due to game animation, legal moves are not known 24

Links to AI Fighters videos: before training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofengearlyaggressive. wmv after training: http: //research. microsoft. com/en-us/projects/mlgames 2008/taofenglateaggressive. wmv 25

Discussion / Lessons Learned ? ? § Winning formula: hand-designed features (fairly small number) + smooth function approx. § § hand-designed features (fairly small number) aggressive smooth function approx. Researchers should try raw-input comparisons and try nonlinear function approx. Many/most state variables in real problems seem pretty irrelevant § § § Opportunity to try recent linear and/or nonlinear Dimensionality Reduction algorithms Sparsity constraints (L 1 regularization etc. ) also promising Brain/retina architecture impressively suited for 2 D spatial problems § More studies using Convolutional Neural Nets etc. 26