Genetic Programming on General Purpose Graphics Processing Units

Overview • Graphics Processing Units (GPUs) are no longer limited to be used only

Outline • • • Genetic Programming Resource Demands GPU Programming Genetic Programming on GPU

Genetic Programming (GP) • Evolutionary algorithm-based methodology • To optimize a population of computer

GP Resource Demands • GP is notoriously resource consuming • CPU cycles • Memory

Sources of Speed-up • • Fast machines Vector Processors Parallel Machines (MIMD/SIMD) Clusters Loose

Why GPU is faster than CPU ? The GPU Devotes More Transistors to Data

GPU Programming APIs • There a number of toolkits available for programming GPUs. •

CUDA Programming Massive number (>10000) of light-weight threads. 10

CUDA Memory Model CUDA exposes all the different types of memory on the GPU:

CUDA Programming Model GPU is viewed as a computing device operating as a coprocessor

Stop Thinking About What to Do and Start Doing It! • Memory transfer time

Parallel Genetic Programming While most GP work is conducted on sequential computers, the following

A Many Threaded CUDA Interpreter for Genetic Programming • Running Tree GP on GPU

Boolean Multiplexor a d=2 n=a+d Num test cases = 2 n 20 -mux 1

Genetic Programming Parameters for Solving 20 and 37 Multiplexors Terminals 20 or 37 Boolean

AND, OR, NAND, NOR AND: & NAND: d X Y X&Y X Y X|Y

Evolution of 20 -Mux and 37 -Mux [W. B. Langdon, Euro. GP-2010] 23

6 -Mux Tree I [W. B. Langdon, Euro. GP-2010] 24

6 -Mux Tree II [W. B. Langdon, Euro. GP-2010] 25

6 -Mux Tree III [W. B. Langdon, Euro. GP-2010] 26

Automatically Defined Functions (ADFs) • Genetic programming trees often have repeated patterns. • Repeated

Langdon’s CUDA Interpreter with ADFs • ADFs slow down the speed • 20 -Mux

Conclusion 1: GP • Powerful machine learning algorithm • Capable of searching through trillions

Conclusion 2: GPU • Computationally fast • Relative low cost • Need new programming

Acknowledgement • Dr Will Browne and Dr Mengjie Zhang for Supervision. • Kevin Buckley

Slides: 35

Download presentation

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering and Computer Sciences

Overview • Graphics Processing Units (GPUs) are no longer limited to be used only for Graphics: • High degree of programmability • Fast floating point operations • GPUs are now GPGPUs • Genetic programming is a computationally intensive methodology so a prime candidate for using GPUs. 2

Outline • • • Genetic Programming Resource Demands GPU Programming Genetic Programming on GPU Automatically Defined Functions 3

Genetic Programming (GP) • Evolutionary algorithm-based methodology • To optimize a population of computer programs • Tree based representation • Example: X Output 0 1 1 3 2 7 3 13 4

GP Resource Demands • GP is notoriously resource consuming • CPU cycles • Memory • Standard GP system, 1μs per node • Binary trees, depth 17: 131 ms per tree • Fitness cases: 1, 000 Population size: 1, 000 • Generations: 1, 000 Number of runs: 100 » Runtime: 10 Gs ≈ 317 years • Standard GP system, 1 ns per node » Runtime: 116 days • Limits to what we can approach with GP [Banzhaf and Harding – GECCO 2009] 5

Sources of Speed-up • • Fast machines Vector Processors Parallel Machines (MIMD/SIMD) Clusters Loose Networks Multi-core Graphics Processing Units (GPU) 6

Why GPU is faster than CPU ? The GPU Devotes More Transistors to Data Processing. [CUDA C Programming Guide Version 3. 2 ] 8

GPU Programming APIs • There a number of toolkits available for programming GPUs. • • CUDA MS Accelerator Rapid. Mind Shader programming • So far, researchers in GP have not converged on one platform 9

CUDA Programming Massive number (>10000) of light-weight threads. 10

CUDA Memory Model CUDA exposes all the different types of memory on the GPU: (Device) Grid Block (0, 0) Shared Memory Registers Host Registers Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Local Memory Global Memory Constant Memory Texture Memory [CUDA C Programming Guide Version 3. 2 ] 11

CUDA Programming Model GPU is viewed as a computing device operating as a coprocessor to the main CPU (host). • Data-parallel, computationally intensive functions should be off-loaded to the device. • Functions that are executed many times, but independently on different data, are prime candidates, i. e. body of for-loops. • A function compiled for the device is called a kernel. 12

Stop Thinking About What to Do and Start Doing It! • Memory transfer time expensive. • Computation is cheap. • No longer calculate and store in memory • Just recalculates • Built-in variables • • thread. Idx block. Idx grid. Dim block. Dim 14

Example: Increment Array Elements 15

Example: Matrix Addition 16

Example: Matrix Addition 17

Parallel Genetic Programming While most GP work is conducted on sequential computers, the following computationally intensive features make it well suited to parallel hardware: • Individuals are run on multiple independent training examples. • The fitness of each individual could be calculated on independent hardware in parallel. • Multiple independent runs of the GP are needed for statistical confidence to the stochastic element of the result. [Langdon and Banzhaf, Euro. GP-2008] 18

A Many Threaded CUDA Interpreter for Genetic Programming • Running Tree GP on GPU • 8692 times faster than PC without GPU • Solved 20 -bits Multiplexor • • 220 = 1048576 fitness cases Has never been solved by tree GP before Previously estimated time: more than 4 years GPU has consistently done it in less than an hour • Solved 37 -bits Multiplexor • 237 = 137438953472 fitness cases • Has never been attempted before • GPU solves it in under a day [W. B. Langdon, Euro. GP-2010] 19

Boolean Multiplexor a d=2 n=a+d Num test cases = 2 n 20 -mux 1 million test cases 37 -mux 137 billion test cases [W. B. Langdon, Euro. GP-2010] 20

Genetic Programming Parameters for Solving 20 and 37 Multiplexors Terminals 20 or 37 Boolean inputs D 0 – D 19 or D 0 – D 36 respectively Functions AND, OR, NAND, NOR Fitness Pseudo random sample of 2048 of 1048576 or 8192 of 137438953472 fitness cases. Tournament 4 members run on same random sample. New samples for each tournament and each generation. Population 262144 Initial Population Ramped half-and-half 4: 5 (20 -Mux) or 5: 7 (37 -Mux) Parameters 50% subtree crossover, 5% subtree 45% point mutation. Max depth 15, max size 511 (20 -Mux) or 1023 (37 -Mux) Termination 5000 generations Solutions are found in generations 423 (20 -Mux) and 2866 (37 -Mux). [W. B. Langdon, Euro. GP-2010] 21

AND, OR, NAND, NOR AND: & NAND: d X Y X&Y X Y X|Y 0 0 0 0 1 1 1 0 0 1 1 1 1 X Y Xd. Y Xr. Y 0 0 1 0 1 1 0 1 1 0 OR: | NOR: r 22

Evolution of 20 -Mux and 37 -Mux [W. B. Langdon, Euro. GP-2010] 23

6 -Mux Tree I [W. B. Langdon, Euro. GP-2010] 24

6 -Mux Tree II [W. B. Langdon, Euro. GP-2010] 25

6 -Mux Tree III [W. B. Langdon, Euro. GP-2010] 26

Ideal 6 -Mux Tree 27

Automatically Defined Functions (ADFs) • Genetic programming trees often have repeated patterns. • Repeated subtrees can be treated as subroutines. • ADFs is a methodology to automatically select and implement modularity in GP. • This modularity can: • Reduce the size of GP tree • Improve readability 28

Langdon’s CUDA Interpreter with ADFs • ADFs slow down the speed • 20 -Mux taking 9 hours instead of less than an hour • 37 -Mux taking more than 3 days instead of less than a day • Improved ADFs Implementation • Previously used one thread per GP program • Now using one thread block per GP program • Increased level of parallelism • Reduced divergence • 20 -Mux taking 8 to 15 minutes • 37 -Mux taking 7 to 10 hours 29

6 -Mux with ADF 32

6 -Mux with ADF 33

6 -Mux with ADF 34

Conclusion 1: GP • Powerful machine learning algorithm • Capable of searching through trillions of states to find the solution • Often have repeated patterns and can be compacted by ADFs • But computationally expensive 35

Conclusion 2: GPU • Computationally fast • Relative low cost • Need new programming paradigm, which is practical. • Accelerates processing speed up to 3000 times for computationally intensive problems. • But not well suited for memory intensive problems. 36

Acknowledgement • Dr Will Browne and Dr Mengjie Zhang for Supervision. • Kevin Buckley for Technical Support. • Eric for helping in CUDA compilation. • Victoria University of Wellington for Awarding “Victoria Ph. D Scholarship”. • All of You for Coming. 37

Thank You Questions? 38