Evaluating the Productivity of a Multicore Architecture Jeremy
- Slides: 31
Evaluating the Productivity of a Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA 8721 -05 -C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory Slide-1 Multicore Productivity
Outline • Parallel Design • Programming Models • Architectures • Productivity Results • Summary Slide-2 Multicore Productivity • Architecture Buffet • Programming Buffet • Productivity Assessment MIT Lincoln Laboratory
Signal Processor Devices DSP/RISC Core For 2005 AD 90 nm CMOS Process MIT LL VLSI 1000 Full-Custom 100 FPGA Standard-Cell ASIC 10 DSP/RISC Core 1 0. 1 1 10 100 GOPS/W • • 10000 1. 0 E+11 GOPS/W GOPS/cm 2 10000 Full Custom ASIC FPGA I LS LV 1. 0 E+09 MIT L m sto u SIC C A l l Fu -Ce d r nda Sta PGA F ore C ISC R / P DS 1. 0 E+07 1. 0 E+05 1. 0 E+03 1. 0 E+01 2005 2007 2009 2011 Year 2013 2015 Wide range of device technologies for signal processing systems Each has their own tradeoffs. How do we choose? Slide-3 Multicore Productivity MIT Lincoln Laboratory
Multicore Processor Buffet Homogeneous Short Vector Long Vector • • • Intel Duo/Duo • Broadcom • AMD Opteron • Tilera • IBM Power. X • Sun Niagara • IBM Blue Gene • Cray XT • Cray XMT • Clearspeed Heterogeneous • IBM Cell • Intel Polaris • n. Vidia • ATI Wide range of programmable multicore processors Each has their own tradeoffs. How do we choose? Slide-4 Multicore Productivity MIT Lincoln Laboratory
Multicore Programming Buffet Flat word object • • • p. Threads • Stream. It • UPC • CAF • VSIPL++ • GA++ • p. Matlab • Star. P Hierarchical • Cilk • CUDA • ALPH • MCF • Sequouia • PVTOL • p. Matlab. XVM Wide range of multicore programming environments Each has their own tradeoffs. How do we choose? Slide-5 Multicore Productivity MIT Lincoln Laboratory
Performance vs Effort Style Example Granularity Training Effort Performance per Watt Graphical Spreadsheet Module Low 1/30 1/100 Domain Language Matlab, Maple, IDL Array Low 1/10 1/5 Object Oriented Java, C++ Object Medium 1/3 1/1. 1 Procedural Library VSIPL, BLAS Structure Medium 2/3 1/1. 05 Procedural Language C, Fortran (this talk) Word Medium 1 1 Assembly x 86, Power. PC Register High 3 2 Gate Array VHDL Gate High 10 10 Standard Cell High 30 100 Custom VLSI Transistor High 1000 • • Programmable Multicore Applications can be implemented with a variety of interfaces Clear tradeoff between effort (3000 x) and performance (100, 000 x) – Slide-6 Multicore Productivity Translates into mission capability vs mission schedule MIT Lincoln Laboratory
Assessment Approach Relative Performance/Watt Ref Traditional Parallel Programming Goal Java, Matlab, Python, etc. “All too often” Relative Effort • “Write” benchmarks in many programming environments on different multicore architectures • Compare performance/watt and relative effort to serial C Slide-7 Multicore Productivity MIT Lincoln Laboratory
Outline • Parallel Design • Programming Models • Architectures • Productivity Results • Summary Slide-8 Multicore Productivity • Environment features • Estimates • Performance Complexity MIT Lincoln Laboratory
Programming Environment Features • • Too many environments with too many features to assess individually Decompose into general classes – – • Serial programming environment Parallel programming model Assess only relevant serial environment and parallel model pairs Slide-9 Multicore Productivity MIT Lincoln Laboratory
Dimensions of Programmability • Performance – – • Effort – – • – Skill level of programmer required to obtain a certain level of performance Measured in: degree, years of experience, multi-disciplinary knowledge required, … Portability – – • Coding effort required to obtain a certain level of performance Measured in: programmer-days, lines-of-code, function points, Expertise – • The performance of the code on the architecture Measured in: flops/sec, Bytes/sec, GUPS, … Coding effort required to port code from one architecture to the next and achieve a certain level of performance Measured in: programmer-days, lines-of-code, function points, …) Baseline – – Slide-10 Multicore Productivity All quantities are relative to some baseline environment Serial C on a single core x 86 workstation, cluster, multi-core, … MIT Lincoln Laboratory
Serial Programming Environments Programming Language Assembly SIMD (C+Alti. Vec) Procedural (ANSI C) Objects (C++, Java) High Level Languages (Matlab) Performance Efficiency 0. 8 0. 5 0. 2 0. 15 0. 05 Relative Code Size 10 3 1 1/3 1/10 Effort/Line-of. Code 4 hour 2 hour 1 hour 20 min 10 min Portability Zero Low Very High Low Granularity Word Multi-word Object Array • • • OO High Level Languages are the current desktop state-of-the practice : -) Assembly/SIMD are the current multi-core state-of-the-practice : -( Single core programming environments span 10 x performance and 100 x relative code size Slide-11 Multicore Productivity MIT Lincoln Laboratory
Parallel Programming Environments Approach Direct Memory Access (DMA) Message Passing (MPI) Threads (Open. MP) Recursive Threads (Cilk) PGAS (UPC, VSIPL++) Hierarchical PGAS (PVTOL, HPCS) Performance Efficiency 0. 8 0. 5 0. 2 0. 4 0. 5 Relative Code Size 10 3 1 1/3 1/10 Effort/Line-of. Code Very High Medium High Portability Zero Very High Medium TBD Granularity Word Multi-word Word Array • • • Message passing and threads are the current desktop state-of-the practice : -| DMA is the current multi-core state-of-the-practice : -( Parallel programming environments span 4 x performance and 100 x relative code size Slide-12 Multicore Productivity MIT Lincoln Laboratory
Canonical 100 CPU Cluster Estimates lel l ara relative speedup P C++/DMA C/Arrays C++/Arrays C/MPI C++/threads C++ C/threads /MPI Matlab/Arrays Matlab/MPI Matlab/threads Assembly /DMA al ri Se Assembly C++ C Matlab relative effort • Programming environments form regions around serial environment Slide-13 Multicore Productivity MIT Lincoln Laboratory
Relevant Serial Environments and Parallel Models Partitioning Scheme Serial Multi. Threaded Distributed Arrays Hierarchical Arrays Assembly + DMA fraction of programmers 1 0. 95 0. 50 0. 10 0. 05 Relative Code Size 1 1. 5 2 10 “Difficulty” 1 1. 15 3 20 200 • Focus on a subset of relevant programming environments – – • C/C++ + serial, threads, distributed arrays, hierarchical arrays Assembly + DMA “Difficulty” = (relative code size) / (fraction of programmers) Slide-14 Multicore Productivity MIT Lincoln Laboratory
Performance Complexity Good Architecture Performance G Granularity Array Word Bad Architecture None (serial) • • B 1 D - No Comm (Trivial) 1 D ND Block. Cyclic ND Hierarchical Performance complexity (Strohmeier/LBNL) compares performance as a function of the programming model In above graph, point “G” is ~100 x easier to program than point “B” Slide-15 Multicore Productivity MIT Lincoln Laboratory
Outline • Parallel Design • Programming Models • Architectures • Productivity Results • Summary Slide-16 Multicore Productivity • • Kuck Diagram Homogeneous UMA Heterogeneous NUMA Benchmarks MIT Lincoln Laboratory
Single Processor Kuck Diagram M 0 P 0 • • Processors denoted by boxes Memory denoted by ovals Lines connected associated processors and memories Subscript denotes level in the memory hierarchy Slide-17 Multicore Productivity MIT Lincoln Laboratory
Parallel Kuck Diagram M 0 M 0 P 0 P 0 net 0. 5 • • Replicates serial processors net denotes network connecting memories at a level in the hierarchy (incremented by 0. 5) Slide-18 Multicore Productivity MIT Lincoln Laboratory
Multicore Architecture 1: Homogeneous Off-chip: 1 (all cores have UMA access to off-chip memory) On-chip: 1 (all cores have UMA access to on-chip 3 D memory) Core: Ncore (each core has its own cache) SM 2 SM net 2 SM 1 SM net 1 M 0 M 0 0 P 0 1 P 0 2 P 0 3 M 0 63 P 0 net 0. 5 Slide-19 Multicore Productivity MIT Lincoln Laboratory
Multicore Architecture 2: Heterogeneous Off-chip: 1 (all supercores have UMA access to off-chip memory) On-chip: N (sub-cores share a bank of on-chip 3 D memory and 1 control processor) Core: Ncore (each core has its own local store) SM 2 SM net 2 SM 1 SM net 1 M 0 M 0 M 0 P 0 1 P 0 N P 0 1 P 0 net 0. 5 4 net 0. 5 net 1. 5 Slide-20 Multicore Productivity MIT Lincoln Laboratory
HPC Challenge SAR benchmark (2 D FFT) SAR • 2 D FFT (with a full all-to-all corner turn) is a common operation in SAR and other signal processing • Operation is complex enough to highlight programmability issues FFT FFT Slide-21 Multicore Productivity FFT FFT %MATLAB Code A = complex(rand(N, M), . . . rand(N, M)); %FFT along columns B = fft(A, [], 1); %FFT along rows C = fft(B, [], 2); MIT Lincoln Laboratory
Projective Transform • • Slide-22 Multicore Productivity Canonical kernel in image processing applications Takes advantage of cache on single core processors Takes advantage of multiple cores Results in regular distributions of both source and destination images MIT Lincoln Laboratory
Outline • Parallel Design • Programming Models • Architectures • Productivity Results • Summary Slide-23 Multicore Productivity • Implementations • Performance vs Effort • Productivity vs Model MIT Lincoln Laboratory
Case 1: Serial Implementation Heterogeneous Performance COD E A = complex(rand(N, M), rand(N, M)); //FFT along columns for j=1: M A(: , j) = fft(A(: , j)); end //FFT along rows for i=1: N A(i, : ) = fft(A(i, : )); end Execution • This program will run on a single control processor Memory • Only off chip memory will be used Homogeneous Performance NOTES • Single threaded program • Complexity: LOW • Initial implementation to get the code running on a system Execution • This program will run on a single core • No parallel programming expertise required Memory • Off chip, on chip chache, and local cache will be used • Users capable of writing this program: 100% Slide-24 Multicore Productivity MIT Lincoln Laboratory
Case 2: Multi-Threaded Implementation COD E A = complex(rand(N, M), rand(N, M)); #pragma omp parallel. . . //FFT along columns for j=1: M A(: , j) = fft(A(: , j)); end #pragma omp parallel. . . //FFT along rows for i=1: N A(i, : ) = fft(A(i, : )); end Heterogeneous Performance Execution • This program will run on a all control processors Memory • Only off chip memory will be used • Poor locality will cause a memory bottleneck Homogeneous Performance NOTES • Multi-threaded program: each thread operated on a single column (row) of the matrix • Complexity: LOW • Minimal parallel programming expertise required • Users capable of writing this program: 99% Slide-25 Multicore Productivity Execution • This program will run on all cores Memory • Off chip memory, on chip cache, and local cache will be used • Poor locality will cause a memory bottleneck MIT Lincoln Laboratory
Case 3: Parallel 1 D Block Implementation Heterogeneous Performance CODE map. A = map([1 36], {}, [0: 35]); //column map. B = map([36 1], {}, [0: 35]); //row map A = complex(rand(N, M, map. A), rand(N, M, map. A)); B = complex(zeros(N, M, map. B), rand(N, M, map. B)); //Get local indices my. J = get_local_ind(A); my. I = get_local_ind(B); Distribution onto 4 //FFT along columns processors for j=1: length(my. J) A. local(: , j) = fft(A. local(: , j)); P 0 end P 1 P 0 P 1 P 2 P 3 B(: , : ) = A; //corner turn P 2 //FFT along rows P 3 for i=1: length(my. I) B. local(i, : ) = fft(B. local(i, : )); end Execution • This program will run on all control processors Memory • Only off chip memory will be used Homogeneous Performance NOTES • Explicitly parallel program using 1 D block distribution • Complexity: MEDIUM • Parallel programming expertise required, particularly for understanding data distribution • Users capable of writing this program: 75% Slide-26 Multicore Productivity Execution • This program will run on all cores Memory • Off chip memory, on chip cache, and local cache will be used • Better locality will decrease memory bottleneck MIT Lincoln Laboratory
Case 4: Parallel 1 D Block Hierarchical Implementation Heterogeneous Performance CODE map. Hcol = map([1 8], {}, [0: 7]); //col hierarchical map. Hrow = map([8 1], {}, [0: 7]); //row hierarchical map. H = map([0: 7]); //base hierarchical map. A = map([1 36], {}, [0: 35], map. H); //column map. B = map([36 1], {}, [0: 35], map. H); //row map A = complex(rand(N, M, map. A), rand(N, M, map. A)); B = complex(zeros(N, M, map. B), rand(N, M, map. B)); //Get local indices my. J = get_local_ind(A); my. I = get_local_ind(B); //FFT along columns for j=1: length(my. J) temp = A. local(: , j); //get local col temp = reshape(temp); //reshape col into matrix alocal = zeros(size(temp_col), map. Hcol); blocal = zeros(size(temp_col), map. Hrow); alocal(: , : ) = temp; //distrbute col to fit into SPE/cache my. Hj = get_local_ind(alocal); for jj = length(my. Hj) alocal(: , jj) = fft(alocal(: , jj)); end blocal(: , : ) = alocal; //corner turn that fits into SPE/cache my. Hi = get_local_ind(blocal); for ii = length(my. Hi) blocal(ii, : ) = fft(blocal(ii, : ); end temp = reshape(blocal); //reshape matrix into column A. local = temp; //store result end P 0 B(: , : ) = A; //corner turn P 1 //FFT along rows. . . P 0 P 1 P 2 P 3 • Complexity: HIGH • Users capable of writing this program: <20% reshape 2 D FFT Slide-27 Multicore Productivity P 2 P 3 Execution • This program will run on all cores Memory • Off chip, on-chip, and local store memory will be used • Hierarchical arrays allow detailed management of memory bandwidth Homogeneous Performance Execution • This program will run on all cores Memory • Off chip, on chip cache, and local cache will be used • Caches prevent detailed management of memory bandwdith MIT Lincoln Laboratory
Performance/Watt vs Effort Performance/Watt Efficiency AB C E D SAR 2 D FFT • Trade offs exist between performance and programming difficulty • Different architectures enable different performance and programming capabilities Heterogeneous Homogeneous • Forces system architects to understand device implications and consider programmability Programming Models A • C single threaded Projective Transform B • C multi-threaded C • Parallel Arrays D • Hierarchical Arrays E • Hand Coded Assembly Programming Difficulty = (Code Size)/(Fraction of Programmers) Slide-28 Multicore Productivity MIT Lincoln Laboratory
performance speedup Defining Productivity 100 10 hardware limit good ~10 ~0. 1 acceptable ~10 bad 1 0. 1 ~0. 1 difficulty • • Productivity is a ratio between utility to cost From the programmer perspective this is proportional to performance over difficulty Slide-29 Multicore Productivity MIT Lincoln Laboratory
Productivity vs Programming Model A B C D Relative Productivity SAR 2 D FFT Heterogeneous Homogeneous Projective Transform E • Productivity varies with architecture and application • Homogeneous: threads or parallel arrays • Heterogeneous: hierarchical arrays Programming Models A • C single threaded B • C multi-threaded C • Parallel Arrays D • Hierarchical Arrays Programming Model Slide-30 Multicore Productivity E • Hand Coded Assembly MIT Lincoln Laboratory
Summary • Many multicore processors are available • Many multicore programming environments are available • Assessing which approaches are best for which architectures is difficult • Our approach – – • “Write” benchmarks in many programming environments on different multicore architectures Compare performance/watt and relative effort to serial C Conclusions – – Slide-31 Multicore Productivity For homogeneous architectures C/C++ using threads or parallel arrays has highest productivity For heterogeneous architectures C/C++ using hierarchical arrays has highest productivity MIT Lincoln Laboratory
- Jeremy jeremy
- Obs multicore
- Speedy transactions in multicore in-memory databases
- And eat
- Asymmetric multicore processing
- Multicore packet scheduler samsung
- Cache craftiness for fast multicore key-value storage
- Sae international
- Multiprocessor vs multicore
- Multicore vs multiprocessor
- Multiprocessor programming
- Thể thơ truyền thống
- Bàn tay mà dây bẩn
- Các châu lục và đại dương trên thế giới
- Từ ngữ thể hiện lòng nhân hậu
- Diễn thế sinh thái là
- Ng-html
- Thứ tự các dấu thăng giáng ở hóa biểu
- Làm thế nào để 102-1=99
- Chúa yêu trần thế
- Hổ đẻ mỗi lứa mấy con
- đại từ thay thế
- Quá trình desamine hóa có thể tạo ra
- Vẽ hình chiếu vuông góc của vật thể sau
- Công của trọng lực
- Thế nào là mạng điện lắp đặt kiểu nổi
- Dot
- Lời thề hippocrates
- Bổ thể
- Vẽ hình chiếu đứng bằng cạnh của vật thể
- Phản ứng thế ankan
- Môn thể thao bắt đầu bằng chữ đua