Genetic Programming on General Purpose Graphics Processing Units





















![Evolution of 20 -Mux and 37 -Mux [W. B. Langdon, Euro. GP-2010] 23 Evolution of 20 -Mux and 37 -Mux [W. B. Langdon, Euro. GP-2010] 23](https://slidetodoc.com/presentation_image_h2/82225d73d312ec8ae9457083d986fb5a/image-22.jpg)
![6 -Mux Tree I [W. B. Langdon, Euro. GP-2010] 24 6 -Mux Tree I [W. B. Langdon, Euro. GP-2010] 24](https://slidetodoc.com/presentation_image_h2/82225d73d312ec8ae9457083d986fb5a/image-23.jpg)
![6 -Mux Tree II [W. B. Langdon, Euro. GP-2010] 25 6 -Mux Tree II [W. B. Langdon, Euro. GP-2010] 25](https://slidetodoc.com/presentation_image_h2/82225d73d312ec8ae9457083d986fb5a/image-24.jpg)
![6 -Mux Tree III [W. B. Langdon, Euro. GP-2010] 26 6 -Mux Tree III [W. B. Langdon, Euro. GP-2010] 26](https://slidetodoc.com/presentation_image_h2/82225d73d312ec8ae9457083d986fb5a/image-25.jpg)










- Slides: 35
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering and Computer Sciences
Overview • Graphics Processing Units (GPUs) are no longer limited to be used only for Graphics: • High degree of programmability • Fast floating point operations • GPUs are now GPGPUs • Genetic programming is a computationally intensive methodology so a prime candidate for using GPUs. 2
Outline • • • Genetic Programming Resource Demands GPU Programming Genetic Programming on GPU Automatically Defined Functions 3
Genetic Programming (GP) • Evolutionary algorithm-based methodology • To optimize a population of computer programs • Tree based representation • Example: X Output 0 1 1 3 2 7 3 13 4
GP Resource Demands • GP is notoriously resource consuming • CPU cycles • Memory • Standard GP system, 1μs per node • Binary trees, depth 17: 131 ms per tree • Fitness cases: 1, 000 Population size: 1, 000 • Generations: 1, 000 Number of runs: 100 » Runtime: 10 Gs ≈ 317 years • Standard GP system, 1 ns per node » Runtime: 116 days • Limits to what we can approach with GP [Banzhaf and Harding – GECCO 2009] 5
Sources of Speed-up • • Fast machines Vector Processors Parallel Machines (MIMD/SIMD) Clusters Loose Networks Multi-core Graphics Processing Units (GPU) 6
Why GPU is faster than CPU ? The GPU Devotes More Transistors to Data Processing. [CUDA C Programming Guide Version 3. 2 ] 8
GPU Programming APIs • There a number of toolkits available for programming GPUs. • • CUDA MS Accelerator Rapid. Mind Shader programming • So far, researchers in GP have not converged on one platform 9
CUDA Programming Massive number (>10000) of light-weight threads. 10
CUDA Memory Model CUDA exposes all the different types of memory on the GPU: (Device) Grid Block (0, 0) Shared Memory Registers Host Registers Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Local Memory Global Memory Constant Memory Texture Memory [CUDA C Programming Guide Version 3. 2 ] 11
CUDA Programming Model GPU is viewed as a computing device operating as a coprocessor to the main CPU (host). • Data-parallel, computationally intensive functions should be off-loaded to the device. • Functions that are executed many times, but independently on different data, are prime candidates, i. e. body of for-loops. • A function compiled for the device is called a kernel. 12
13
Stop Thinking About What to Do and Start Doing It! • Memory transfer time expensive. • Computation is cheap. • No longer calculate and store in memory • Just recalculates • Built-in variables • • thread. Idx block. Idx grid. Dim block. Dim 14
Example: Increment Array Elements 15
Example: Matrix Addition 16
Example: Matrix Addition 17
Parallel Genetic Programming While most GP work is conducted on sequential computers, the following computationally intensive features make it well suited to parallel hardware: • Individuals are run on multiple independent training examples. • The fitness of each individual could be calculated on independent hardware in parallel. • Multiple independent runs of the GP are needed for statistical confidence to the stochastic element of the result. [Langdon and Banzhaf, Euro. GP-2008] 18
A Many Threaded CUDA Interpreter for Genetic Programming • Running Tree GP on GPU • 8692 times faster than PC without GPU • Solved 20 -bits Multiplexor • • 220 = 1048576 fitness cases Has never been solved by tree GP before Previously estimated time: more than 4 years GPU has consistently done it in less than an hour • Solved 37 -bits Multiplexor • 237 = 137438953472 fitness cases • Has never been attempted before • GPU solves it in under a day [W. B. Langdon, Euro. GP-2010] 19
Boolean Multiplexor a d=2 n=a+d Num test cases = 2 n 20 -mux 1 million test cases 37 -mux 137 billion test cases [W. B. Langdon, Euro. GP-2010] 20
Genetic Programming Parameters for Solving 20 and 37 Multiplexors Terminals 20 or 37 Boolean inputs D 0 – D 19 or D 0 – D 36 respectively Functions AND, OR, NAND, NOR Fitness Pseudo random sample of 2048 of 1048576 or 8192 of 137438953472 fitness cases. Tournament 4 members run on same random sample. New samples for each tournament and each generation. Population 262144 Initial Population Ramped half-and-half 4: 5 (20 -Mux) or 5: 7 (37 -Mux) Parameters 50% subtree crossover, 5% subtree 45% point mutation. Max depth 15, max size 511 (20 -Mux) or 1023 (37 -Mux) Termination 5000 generations Solutions are found in generations 423 (20 -Mux) and 2866 (37 -Mux). [W. B. Langdon, Euro. GP-2010] 21
AND, OR, NAND, NOR AND: & NAND: d X Y X&Y X Y X|Y 0 0 0 0 1 1 1 0 0 1 1 1 1 X Y Xd. Y Xr. Y 0 0 1 0 1 1 0 1 1 0 OR: | NOR: r 22
Evolution of 20 -Mux and 37 -Mux [W. B. Langdon, Euro. GP-2010] 23
6 -Mux Tree I [W. B. Langdon, Euro. GP-2010] 24
6 -Mux Tree II [W. B. Langdon, Euro. GP-2010] 25
6 -Mux Tree III [W. B. Langdon, Euro. GP-2010] 26
Ideal 6 -Mux Tree 27
Automatically Defined Functions (ADFs) • Genetic programming trees often have repeated patterns. • Repeated subtrees can be treated as subroutines. • ADFs is a methodology to automatically select and implement modularity in GP. • This modularity can: • Reduce the size of GP tree • Improve readability 28
Langdon’s CUDA Interpreter with ADFs • ADFs slow down the speed • 20 -Mux taking 9 hours instead of less than an hour • 37 -Mux taking more than 3 days instead of less than a day • Improved ADFs Implementation • Previously used one thread per GP program • Now using one thread block per GP program • Increased level of parallelism • Reduced divergence • 20 -Mux taking 8 to 15 minutes • 37 -Mux taking 7 to 10 hours 29
6 -Mux with ADF 32
6 -Mux with ADF 33
6 -Mux with ADF 34
Conclusion 1: GP • Powerful machine learning algorithm • Capable of searching through trillions of states to find the solution • Often have repeated patterns and can be compacted by ADFs • But computationally expensive 35
Conclusion 2: GPU • Computationally fast • Relative low cost • Need new programming paradigm, which is practical. • Accelerates processing speed up to 3000 times for computationally intensive problems. • But not well suited for memory intensive problems. 36
Acknowledgement • Dr Will Browne and Dr Mengjie Zhang for Supervision. • Kevin Buckley for Technical Support. • Eric for helping in CUDA compilation. • Victoria University of Wellington for Awarding “Victoria Ph. D Scholarship”. • All of You for Coming. 37
Thank You Questions? 38