TicTacToe Training a Neural Network Outline Learning Method

Reinforcement Learning Goal: Find probability P(si) of winning from any state si Idea: After

Reinforcement Learning - Convergence If a decreases with time, all P(si) converge. sample state

Optimizing lookup table Training data Prune lookup table by exploiting symmetry: 0. 51+0. 54

Training an MLP 1. Input layer: current state (9 values: X=+1, O=-1) 2. Output

Slides: 5

Tic-Tac-Toe Training a Neural Network Outline • Learning Method: Reinforcement Learning • Generating training data • Training a multilayer perceptron 1

Reinforcement Learning Goal: Find probability P(si) of winning from any state si Idea: After each game, update P(si) for all si met Algorithm: (1) Initialize all states with P(si)=0. 5 (2) Play a game (try all P(next state) or random decision) (3) P(final state) = (4) Update intermediate states: P(si): =P(si)+a[P(sk)-P(si)] si … state before sk; a … learning rate (5) Go back to (2) 2

Reinforcement Learning - Convergence If a decreases with time, all P(si) converge. sample state 1 2 3 3

Optimizing lookup table Training data Prune lookup table by exploiting symmetry: 0. 51+0. 54 =0. 5625 > 0. 58+0. 59 =0. 005859375 5890 entries 825 entries Generate training data: current state and best move (target), pruned to 537 entries 4

Training an MLP 1. Input layer: current state (9 values: X=+1, O=-1) 2. Output layer: 9 neurons, target = max 3. 1 hidden layer (9 neurons): classification rate: 80% 4. 1 hidden layer (27 neurons): classification rate: 93% 5. 3 hidden layers (27 neurons each): classification rate: 98. 5% Result: Combine MLP and lookup table: • Less memory (only weights and strongly reduced lookup table) • Faster 5 • Can achieve perfect playing