Advanced Architectures CSE 190 Reagan W Moore San

Course Organization • Professors / TA • Sid Karin - Director, San Diego Supercomputer

Seminars • • • 4/3 4/10 4/17 4/24 5/1 5/8 5/15 5/22 5/29 6/5

Supercomputers for Simulation and Data Mining Application Information Discovery Data Mining Distributed Archives Collection

Heuristics for Characterizing Supercomputers • Generators of data - numerically intensive computing • Usage

Heuristics • Experience based models of computer usage • Dependent on computer architecture •

Supercomputer Data Flow Model CPU Memory Local Disk Archive tape National Partnership for Advanced

Y-MP Heuristics • Utilization measured on Cray Y-MP • Real memory architecture - entire

Data Generation Metrics CPU 7 Bytes/Flop Memory 1 Byte of storage per Flops 1

Peak Teraflops System ? TB Tera. Flops System Compute Engine ? GB/sec 0. 5

Data Sizes on Disk • How much scratch space is used by each job?

Peak Teraflops Data Flow Model 10 TB Tera. Flops System Compute Engine 1 GB/sec

HPSS Archival Storage System 3494 Robot Eight Tape Drives Magstar 3590 Tape 3494 Robot

Equivalent of Ohm’s Law for Computer Science • How does one relate application requirements

Data Distribution Comparison Reduce size of data from S bytes to s bytes and

Distributing Services Compare times for analyzing data with size reduction from S to s

Comparison of Time Processing at supercomputer T(Super) = S/B + CS/r + cs/r +

Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T

Scaling Parameters Data size reduction ratio Execution slow down ratio Problem complexity Communication/Execution balance

Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast

Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently

Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data

Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex

Characterization of Supercomputer Systems • Sufficiently high complexity • Move data to processing engine

Computer Architectures • Processor in memory • Do computations within memory • Complexity of

Characterization Metric • Describe systems in terms of their balance Optimal designs have r/(cb)

Second Example • Inclusion of latency (time for process to start) and overhead (time

Optimizing Use of Resources • Compare time needed to do calculations with time needed

Characterizing Latency • Time during which a network transmits data = Latency for initiating

Solve for Balanced System • CPU utilization time = Network utilization time • Solve

Comparing Utilization of Resources • Network utilization Un = Transmission time / (Transmission +

Comparing Efficiencies Utilization U-cpu U-network h = S-compute / S-transmit National Partnership for Advanced

Crossover Point • When utilization of bandwidth and execution resources is balanced: 1 /

Application Summary • Optimal application for a given architecture B * Cc / Rc

Further Information http: //www. npaci. edu/DICE National Partnership for Advanced Computational Infrastructure

Slides: 35

Download presentation

Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center moore@sdsc. edu http: //www. npaci. edu/DICE National Partnership for Advanced Computational Infrastructure

Course Organization • Professors / TA • Sid Karin - Director, San Diego Supercomputer Center, <skarin@sdsc. edu> • Reagan Moore - Associate Director, SDSC <moore@sdsc. edu> • Holly Dail - UCSD TA <hdail@cs. ucsd. edu> • Seminars • State of the art computer architectures • Mid-term / SDSC tour • Final exam National Partnership for Advanced Computational Infrastructure

Seminars • • • 4/3 4/10 4/17 4/24 5/1 5/8 5/15 5/22 5/29 6/5 6/12 : Reagan Moore- Performance evaluation heuristics & modeling : Sid Karin - Historical perspective : Richard Kaufmann, Compaq - Teraflops systems : IBM or Sun : Mark Seager, LLNL - ASCI 10 Tflops computer : Midterm / SDSC Tour : John Feo, Tera - Multi-threaded architectures : Peter Beckman, LANL - Clusters : Holiday / no class : Thomas Sterling, Caltech - Petaflops computers : Final exam National Partnership for Advanced Computational Infrastructure

Supercomputers for Simulation and Data Mining Application Information Discovery Data Mining Distributed Archives Collection Building Digital Library National Partnership for Advanced Computational Infrastructure

Heuristics for Characterizing Supercomputers • Generators of data - numerically intensive computing • Usage models for the rate at which supercomputers move data between memory, disk, and archives • Usage models for capacity of the data caches (memory size, local disk, and archival storage) • Analyzers of data - data intensive computing • Performance models for combining data analysis with data movement (between caches, disks, archives) National Partnership for Advanced Computational Infrastructure

Heuristics • Experience based models of computer usage • Dependent on computer architecture • Presence of data caches, memory-mapped I/O • Architectures used at SDSC • CRAY vector computers • X/MP, Y/MP, C-90, T-90 • Parallel computers • MPPs - Ipsc 860, Paragon, T 3 D, T 3 E • Clusters - SP National Partnership for Advanced Computational Infrastructure

Supercomputer Data Flow Model CPU Memory Local Disk Archive tape National Partnership for Advanced Computational Infrastructure

Y-MP Heuristics • Utilization measured on Cray Y-MP • Real memory architecture - entire job context is in memory, no paging of data • Exceptional memory bandwidth • I/O rate from CPU to memory was 28 Bytes per cycle • Maximum execution rate was 2 Flops per cycle • Scaled memory on C-90 to test heuristics • Noted that increasing memory from 1 GB to 2 GBs decreased idle time from 10% to 2 % • Sustained execution rate was 1. 8 GFlops National Partnership for Advanced Computational Infrastructure

Data Generation Metrics CPU 7 Bytes/Flop Memory 1 Byte of storage per Flops 1 Byte/60 Flop 1/7 of data persists for a day Local Disk Hold data for 1 day 1/7 of data sent to archive Archive Disk Hold data for 1 week All data sent to tape Archive tape Hold data forever National Partnership for Advanced Computational Infrastructure

Peak Teraflops System ? TB Tera. Flops System Compute Engine ? GB/sec 0. 5 -1 TB memory Sustain ? GF Archive Tape Local Disk 1 day cache ? MB/sec ? PB National Partnership for Advanced Computational Infrastructure Archive Disk ? TB 1 week cache

Data Sizes on Disk • How much scratch space is used by each job? • Disk space is 20 - 40 times the memory size. • Data lasts for about one day • Average execution time for long running jobs • 30 minutes to 1 hour • For jobs using all of memory • Between 48 and 24 jobs per day • Each job uses (Disk space) / (Number of jobs) • Or 40/48 Memory = 80% of memory National Partnership for Advanced Computational Infrastructure

Peak Teraflops Data Flow Model 10 TB Tera. Flops System Compute Engine 1 GB/sec 0. 5 -1 TB memory Sustain 150 GF Archive Tape Local Disk 1 day cache 40 MB/sec 0. 5 -1 PB National Partnership for Advanced Computational Infrastructure Archive Disk 5 TB 1 week cache

HPSS Archival Storage System 3494 Robot Eight Tape Drives Magstar 3590 Tape 3494 Robot Seven Tape Drives Magstar 3590 Tape SSA RAID 108 GB SSA RAID 54 GB SSA RAID 108 GB SSA RAID 160 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon RS 6000 Tape Mover PVR (9490) High Performance Gateway Node Trail. Blazer 3 Switch 9490 Robot Four Drives Hi. PPI Switch High Node Disk Mover Hi. PPI driver Wide Node Disk Mover Hi. PPI driver National Partnership for Advanced Computational Infrastructure Max. Strat RAID 830 GB 3490 Tape

Equivalent of Ohm’s Law for Computer Science • How does one relate application requirements to computation rates and I/O bandwidths? • Use prototype data movement problem to derive physical parameters that characterize applications. National Partnership for Advanced Computational Infrastructure

Data Distribution Comparison Reduce size of data from S bytes to s bytes and analyze Data B Data Handling Platform b Supercomputer Execution rate r < R Bandwidths linking systems are B&b Operations per bit for analysis is C Operations per bit for data transfer is c Should the data reduction be done before transmission? National Partnership for Advanced Computational Infrastructure

Distributing Services Compare times for analyzing data with size reduction from S to s Supercomputer Data Handling Platform Read Data Reduce Data Transmit Data S/B CS/r cs/r Network s/b Receive Data cs/R Supercomputer Data Handling Platform Read Data Transmit Data Network Receive Data S/B c. S/r S/b c. S/R National Partnership for Advanced Computational Infrastructure Reduce Data CS/R

Comparison of Time Processing at supercomputer T(Super) = S/B + CS/r + cs/r + s/b + cs/R Processing at archive T(Archive) = S/B + c. S/r + S/b + c. S/R + CS/R National Partnership for Advanced Computational Infrastructure

Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + CS/r + cs/r + s/b + cs/R < S/B + c. S/r + S/b + c. S/R + CS/R Which variable provides the simplest optimization criterion? National Partnership for Advanced Computational Infrastructure

Scaling Parameters Data size reduction ratio Execution slow down ratio Problem complexity Communication/Execution balance s/S r/R c/C r/(cb) Note (r/c) is the number of bits/sec that can be processed. When r/(cb) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(cb) = 1 National Partnership for Advanced Computational Infrastructure

Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network b > (r /C) (1 - s/S) / [1 - r/R - (c/C) (1 + r/R) (1 - s/S)] Note the denominator changes sign when C < c (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small. National Partnership for Advanced Computational Infrastructure

Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (c/C) (1 - s/S)] / [1 - (c/C) (1 - s/S) (1 + r/(cb)] Note the denominator changes sign when C < c (1 - s/S) [1 + r/(cb)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small. National Partnership for Advanced Computational Infrastructure

Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (C/c)(1 - r/R) / [1 + r/R + r/(cb)]} Note criteria changes sign when C > c [1 + r/R + r/(cb)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit. National Partnership for Advanced Computational Infrastructure

Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis C > c (1 -s/S) [1 + r/R + r/(cb)] / (1 -r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero. National Partnership for Advanced Computational Infrastructure

Characterization of Supercomputer Systems • Sufficiently high complexity • Move data to processing engine • Digital Library execution of remote services • Traditional supercomputer processing of applications • Sufficiently low complexity • Move process to the data source • Metacomputing execution of remote applications • Traditional digital library service National Partnership for Advanced Computational Infrastructure

Computer Architectures • Processor in memory • Do computations within memory • Complexity of supported operations • Commodity processors • L 2 caches • L 3 caches • Parallel computers • Memory bandwidth between nodes • MPP - shared memory • Cluster - distributed memory National Partnership for Advanced Computational Infrastructure

Characterization Metric • Describe systems in terms of their balance Optimal designs have r/(cb) = 1 Equivalent of Ohm’s law R=CB • Characterize applications in terms of their complexity Operations per byte of data C=R/B National Partnership for Advanced Computational Infrastructure

Second Example • Inclusion of latency (time for process to start) and overhead (time to execute communication protocol) • Illustrate with combined optimization of use of network and CPU National Partnership for Advanced Computational Infrastructure

Optimizing Use of Resources • Compare time needed to do calculations with time needed to access data over a network • Time spent using a CPU = Execution time + protocol processing time = Cc * Sc / Rc + Cp * St / Rp Where St = size of transmitted data (bytes) Sc = size of application data (bytes) Cc = number of operations per byte of transmitted data for the application Cp = number of operations per byte to process protocol Rc = execution rate of application Rp = execution rate of protocol National Partnership for Advanced Computational Infrastructure

Characterizing Latency • Time during which a network transmits data = Latency for initiating transfer + transmission time = L + St / B Where L is the round trip latency at the speed of light (sec) B is the bandwidth (bytes/sec) National Partnership for Advanced Computational Infrastructure

Solve for Balanced System • CPU utilization time = Network utilization time • Solve for transmission size as a function of Sc/St St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1] Solution exists when Sc/St > [Rc / (B*Cc)] [1 - B*Cp / Rp] and B * Cp / Rp < 1 National Partnership for Advanced Computational Infrastructure

Comparing Utilization of Resources • Network utilization Un = Transmission time / (Transmission + latency) = 1 / [1 + (L * B / St)] • CPU utilization Uc = Execution time / (Execution + Protocol processing) = 1 / [1 + (Cp * Rc) / (Cc * Rp) * (St / Sc)] Define h = Sc / St National Partnership for Advanced Computational Infrastructure

Comparing Efficiencies Utilization U-cpu U-network h = S-compute / S-transmit National Partnership for Advanced Computational Infrastructure

Crossover Point • When utilization of bandwidth and execution resources is balanced: 1 / [1 + (L * B / St)] = 1 / [1 + (Cp * Rc) / (Cc * Rp) / h] For optimal St, solve for h = Sc/St, and find h = (Rc Cp / 2 Rp Cc) [ sqrt(1 + 4 Rp / Cp B) -1] For small B * Cp / Rp h ~ Rc / Cc B or St / B ~ Sc Cc / Rc And transmission time ~ execution time National Partnership for Advanced Computational Infrastructure

Application Summary • Optimal application for a given architecture B * Cc / Rc ~ 1 (Bytes/sec) (Operations/byte) / (Operations/sec) Cc ~ Rc / B • Also need cost of network utilization to be small B * Cp / Rp < 1 And amount of data transmitted proportional to latency St = L B / [B * Cp / Rp + (B * Cc / Rc) * (Sc / St) -1] National Partnership for Advanced Computational Infrastructure

Further Information http: //www. npaci. edu/DICE National Partnership for Advanced Computational Infrastructure