Design Review Scooby Doo gang Jonathan Hsieh Annie
Design Review Scooby Doo gang: Jonathan Hsieh Annie Pettengill Jim Hollifield Jeff Barbieri Matt Silverstein
Design Goals • Mystery Machine Requirements: – Correctness / Proficiency – Compliance to external interface / protocol – support an interface for human to play – Asynchronous Decide now button
Other Goals • Priorities – Speedup over pure baseline software 68 HC 11 based-implementation. • Hardware “functional units” • software optimization – Interesting architecture – Opening / closing book (download/upload in play configurations) – Hardware HCI. (software hci optional)
Function call dependance Main Think Init Search Gen For Search Queisce Quiesce In_check Gen eval Gen Think Make. Move Gen Sort_pv Gen_caps Sort_pv For Make. Move For Search Make. Move Takeback Quiesce Takeback
More Call depedances In_check Attack eval Make_move For Eval_light_pawn Gen_push Gen_promote Eval_light_king Eval dark pawn Eval dark king In_check Can_castle In_check Takeback
Read / Write access analysis • Eval: – there are no writes from the board structure (but many reads). • In_check / attack: – many reads. Returns a boolean, could be a array of bits to lookup to see if being attacked • Gen: – generates list values. – Variable times
Quantify parameters • A program on sun machines • Compiles code with special hooks • graphically displays call info and run time info for profiling programs. • The idea -- Amdahl’s law -- speed up the slowest parts get most speedup • slowest parts: move to hardware!
Quantify Results • After doing about 20 moves these functions take the most time (not including print and scanf.
Summed run-time analysis • In_check -> attack – ~55% of program run time! – Straight forward for hardware • Eval -> eval_* – ~25% program run time! – Straight forward for hardware. • Gen* – ~15% of program!
Conclusion • Optimize in_check, eval, and gen by placing in hardware • This is most effective if board in FPGA registers. -- try to figure out if possible to use FPGA as memory for processor. • Keep recursion on processor.
Hardware software partition Serial interface Can access anything memory cpu can in simulation SW / CPU Memory structure allow for recursion / dynamic structures Compiler can handle that Recursion cannot really happen parallely (? ) Should be able to access RAM as well as FPGA registers using Move histories Recursion stacks FPGA Many parallel executions happening High speed custom implementations Good for static structures and constants Simple for “read” only functions if things read in registers. (always execute!)
Implementation plan
Implementation Plan • Design hierarchy – HW/SW split • HW subsystems / goals • SW goals – Physical design • HW/SW interface • Memory access architecture • FPGA/HC 11/mem interaction.
Implementation Plan • Handle all recursion on hc 11 -- compiler and assembler code best for memory structures (trees, hash tables, etc. ) • Software analysis shows that 3 function trees: in_check/attack, eval, and gen take the majority of the algorithm’s time.
Shared memory architecture Clk Psuedo clock clk FPGA Memory (Shared) HC 11
Architecture features • Observation: – Memory (12 ns=> 83 Mhz) is ~ as fast as max FPGA speed(100 Mhz). – about 10 x faster than 8 Mhz HC 11. • Zoinks! – Clock set at high FPGA clock speed – HC 11 clock: “psuedo clock. ” a function in the FPGA -- slows the FPGA clock to something in HC 11 range.
Clocking Diagram FPGA clk x 2 FPGA clk x 4 (HC 11 psuedo clk) Counter for clk 000 001 010 011 100 101 110 111
Clocking Diagram FPGA clk FPGA Mem read / Write HC 11 Mem read / Write FPGA clk x 8 (HC 11 psuedo clk) FPGA mem HC 11 mem FPGA Mem read / Write
Psuedoclock possiblities • Could allow for Processor to access memory as if it were the only thing using it. • While the Processor is waiting for next clock tick, and done with memory, FPGA can R/W memory. • FPGA can run and calculate information concurrently with the HC 11!
FPGA Hardware Units Psuedo Clk HC 11 Memory Bus Controller Chess Board Registers Eval Unit Mem Chess Piece Registers Attack/Check Unit Gen Unit HCI
FPGA/Memory organization • Specific addressses would contain specific information all the time. – Board representation address – current eval score – in check map – next generated moves • Addresses can be proxied by fpga so that fpga registers acts like memory to HC 11!
Performance prediction • Baseline = 1; • Best case based on profiling: (assuming hyper idealized HW) – 55% => 0% attack – 25% => 0% eval. – 15%=> 0% gen. • HW accelerated => 0. 05 baseline! – 20 x speedup.
Attack/in_check Annie Pettengill
In_check • Input is the color of the side to check if it is in check • Outputs true if in check, otherwise outputs false
in_check • In_checks looks at each of the 64 squares for the king of the color passed in to the function • It then calls attack on that square and color • If we used a pieces implementation (versus board) this would change a for loop and if statement into a single call of the attack function
In_check Board Implementation 64 times…. . Piece Implementation
Attack • Inputs the square the piece is on and the color of the other side • Outputs true if the square is being attacked by the color s and false if it is not
Details about Attack • The pawn is looked at separately because the way it moves is different from the way it attacks • The moves as organized now are different for black and white pawns • The different pieces are evaluated for every direction-to see whether they can actually move there and whether they can slide
Bigger Picture : Attack Tables • Construct two chess boards: one for white pieces and one for black • Instead of a piece, each square would contain a true or false depending on whether the square was being attacked for that color piece
So…. • Every time a player makes a move, the attack function on the fpga, rebuilds the table • Can tailor the attack function for specific pieces in specific squares using a combination of board and piece implementation
Implementation of Attack Tables
Advantages • Avoid lots of useless searching, you know exactly where each piece is with the piece implementation • If running attack on one square, why not on 128 squares in parallel? – or perhaps use a piece implementation for of attack table and only run it on 32 squares…. . • Acts as a lookup table for other functions
Parameter versus Internal • Use special tailoring for parameter squares that only check for pertinent cases – use a special numbering scheme for the location of pieces in piece implementation • Internal squares stay the same
Eval Matthew Silverstein
Eval() Function Inputs • What side the current move is for – Light or Dark • Board Configuration – Algorithm uses two 64 space arrays to represent the board • Which pieces are where (piece[64] structure) • What color the pieces are (color[64] structure) – The hardware overhead of can be cut by using an array that is indexed by piece, not but position
Eval() Function Outputs • Score – an integer value – based the present configuration of the board – Calibrated for if the current player is Light or Dark
Eval • The function call breaks down into three main subsections – Initialization • Takes the current board configuration and sets all of the internal registers to an appropriate value • Sets up the pawn_rank, pawn_mat, and pawn_count structures
– Bonus / Penalty assess • For each square calculates either a bonus or penalty; based upon relative benefit of certain pieces being on that square • Sums the results for each square and provides a Light_score and a Dark_score – Calculate score • Combines the Light and dark scores to provided an single return value for the function, based on if it is presently Light or dark’s turn.
Eval structure Init Board registers Bonus and penalty cases Light or dark Calculate
mux Eval_pawn_sq No penalty Knight penalty Bishop penalty Rook penalty Eval_king_sq Eval_pawn_sq Bonus / Penalty Assess structure …. adder Score_light One for each input square mux Adds the values generated at each block
Eval_pawn • Inputs: – Square to calculate penalty for – Pawn_rank structure – Pawn_count structure • Based on the inputs there is a possibility of assessing up to four different penalties
Eval_pawn Penalties, Bonuses • Penalty A: if there’s a pawn behind this one • Penalty B: if there are no friendly pawns adjacent to the current pawn • Penalty C: if the pawn is not isolated • Bonus D: if the pawn is passed
mux adder mux 0 Penalty D 0 Penalty C 0 Penalty B 0 Penalty A mux pawn_count pawn_rank Square Eval_pawn structure Pawn penalty “Control logic”
Eval_king • Inputs (same as eval_pawn): – Square to calculate penalty for – Pawn_rank structure – Pawn_count structure • The function returns a penalty value that is adjusted depending on how well shielded the king is by its own pawns
Eval_king Penalties • The File A, B, C, F, G, and H Penalties – These penalties are assessed when there is no pawn in File, one row away from the king. – The magnitude of the penalty is dependent on the distance in the row the pawn is from the king • The pawn attack Penalty – This penalty is assessed if the enemy's pawns have advanced too far down the board towards the king
control Adder mux No penalty Pawn Approach File H penalty File G penalty File F penalty Pawn Approach File C penalty File B penalty File A penalty Eval_king structure Pawn_count pawn_rank
Bonus Penalty Assess Structure • Switching from a position to a piece representation of the board • No longer need to repeat mux 64 times • Adder now has 16 inputs one for each piece (vs. 64 inputs). • Knight, Bishop, and Rock still strait table lookups • Pawn_eval is repeated 8 times – Still better then 64 times
Gen Jim Hollifield
gen() function • Searches through all 64 spaces • Skips empty spaces and opponent pieces • Creates all possible moves for each friendly piece • Pushes (with helper function gen_push() onto move_stack
Possible Moves • Basic Pawn Moves – Move forward 1 or 2 spaces – Take Left or Right • Non-pawn Piece (N, B, R, Q, K) moves – B, R, Q can slide (move more than one space), but stops when another piece is blocking path • Castle (King or Queen side) • En Passant
For Pawns Pawn Light Take Left Dark Take Right Same as Light, But Reversed Move Forward 1 Space Move Backward 1 Space
For Non-Pawn Pieces Is space Empty? Yes No Does Piece “Slide”? Yes No Is Piece Friendly? No Take Piece Move to next square in current direction Edge of Board Move to next direction Move to next piece No More Directions
Other Functions • gen_caps() – Same as gen(), except only checks for capture moves – called by quiesce() • gen_push() – Pushes moves from gen and gen_caps onto move_stack • gen_promote() – Pushes pawn promote move onto move_stack – One move for each possible piece (Q, B, R, or N)
move_stack Before Ply 0 gen_begin gen_end gen_begin After Ply 1 (before Ply 2) gen_end Ply 0 Ply 1 move n move 0 . . . gen_begin move 0 move 1 move 2. . . gen_end . . . After Ply 0 (before Ply 1) move 0 move 1 move 2 Ply 2
HW/SW Breakdown • FPGA puts moves into stack structure • Currently done by gen(), gen_caps(), gen_push(), and gen_promote() functions • HC 11 sorts stack structure • Currently done by sort() and sort_pv() functions
gen() Hardware Pieces Board Move Generator (FSM) Moves gen_begin Pusher current_ply gen_begin gen_end Stack (in Shared Memory)
Generator FSM Reset P take Left Pawn * # x 8 P move 2 *# N move 0 Knight x 2 N move 1 N move 2 x 7 # P take Right x 7 R move Forward x 7 x 2 N move 6 Q move Forward Right R move Left x 7 Rook * - moves skipped by gen_caps() # - moves checked by gen_promote() Q move Forward Left K move Forward Left x 7 Bishop x 7 Q move Left Q move Forward K move Forward B move Forward Right B move Back Right Q move Back Left x 7 x 7 x 2 Q move Back Queen x 7 N move 5 B move Forward Left x 7 Q move Back Right Q move Right x 7 R move Right R move Back N move 4 N move 3 x 7 # P move 1 N move 7 x 7 B move Back Left K move Left x 7 Special x 7 * Castle K side * Castle Q side En Passant Left K move Forward Right King K move Right K move Back Left En Passant Right DONE
HCI Jeff Barbieri
Possible Ideas • Interface Design #1 – 8 x 8 Bar LED board on left displaying the piece that is in each location (period is black/white) – DIP Switches to select the from/to for a move – “Clock” to make the move • Interface Design #2 – 8 x 8 Bar LED board on left displaying the piece that is in each location (period is black/white) – Button in each square next to the Bar LED to select the from and then to for a move – “Clock” to make the move
Latches & Other Logic From DIP To DIP Ribbon Cable Connection Interface Design Idea #1 Make Move
Ribbon Cable Connection Interface Design Idea #2 Make Move
Other Ideas • • Beep to signify illegal move Touch-screen for the board LEDs to signify which player’s move it is Alternate board layouts (many of these)
Considerations • Selection of parts – Numeric, Alphanumeric, LCD, etc. LEDs – Push buttons – Latches • Costs for parts – Number of FPGA pins needed – Time to wire-wrap board – $$$ for parts
Considerations • Feasibility of design • Design of board in relation to design of chess game
Summary • Many possibilities • Two basic likely designs • Lots of thought and planning needs to go into design before acquiring parts and building
Software optimizations Jonathan Hsieh
Software optimizations • All that remains is: – sort 2. 49% -> currently a O() alg, can change to be a heap (log n / constant) – makemove 1. 15% -> these will be slower – takeback 0. 88% -> these will be slower – quiesce 0. 38% -> will probably go up – search 0. 19%
Algorithm improvements • search – Killer Heuristic (search attack branches first, implemented already done) – think on opponent’s time. (multi threading/interrupts! Do it) – history heuristic. (built in already? ) – tighter searches (could be implemented) – refutation tables (not sure what they are) – transposition tables (not sure what they are)
• Eval – Pawn formation hash table (not needed, in hw) – King safety hash table. (not needed, in hw) • Jinkies! – can probably squeeze out another 20% reduction. 0. 05 => 0. 04. • Idealized speedup target = 25 x!
Integration plan Software/Profiling HW/SW Partitioning Baseline Stats HW/SW Interface SW modification / optimization FPGA: Attack FPGA: HCI FPGA: Eval FPGA: Gen Co. Sim Physical Design Integration/Debugging Optimizations Baseline Stats
Division of labor If it weren’t for those meddling kids!
Jon • Group leader – prevent him from dropping the class • Software guru (algorithm, HC 11, Hi. Ware) – strong software background – hates wires • overall system design – some hw/sw partitioning experience (research).
Jim • move generation hardware considerations – when wiring requirements dried up, we moved from individual projects to pairs -- volunteered to put thought into and implement • Wirewrap Whiz (Physical Interfacing)/ FPGA interface Whiz – doesn’t care what he does and didn’t volunteer for a verilog job at first
Jeff • Project management software – Experience with lots of MS software. • HCI Hardware – display, inputs, verilog, (beeper? ) and (wiring? )
Matt • Verilog Whiz – eval function design • wanted it and did a good job with it. • Memory Master – got FPGA demo 0 mem->fpga->mem proof working.
• Annie – Soldering Wire wrap Queen • wire wrapped everything a lot faster than jim – HC 11 hardware interface • Got that portion of demo 0 to work. – Attack function design.
Demonstration Plan • Demo 1 (week of 10/4) – Stats on baseline chess algorithm. – HW/SW partitioning and interfacing method. – Details about HW sub systems • eval / attack / gen / hci – (Co)simulation of separate parts of HW/SW partitioning
• Demo 2 (Week of 11/1) – Frozen Physical Hardware – Co Simulation working with hw/sw – Chess that works and communicates. – Preliminary stats on new design – Optimizing / Debugging process
• Final Demo 11/29 – Optimizations and speedup statistics. – PC interface GUI (depending on interface)
Demo 1 Work Schedule • • • 9/17 F. Demo 0 completion 9/20 M. Internal design review 9/22 W. HW/SW partitioning details 9/23 R. Design review 9/29 W. HW/SW interfacing resolved 10/4 M. Verilog simulations for FPGA stuff.
Demo 2 Work Schedule • 10/11 M. HW/SW integration/Cosimulation • 10/18 M Physical Hardware frozen • 10/25 Algorithm Optimizations • 11/1 Clock speed optimizations
Final Demo • Final review ready. Add more bells and whistles • Have one month for unpredicted delays. .
- Slides: 82