Sven Gregorio Seminar on Computer Architecture Why systolic
- Slides: 41
Sven Gregorio Seminar on Computer Architecture Why systolic architectures? Hsiang-Tsung Kung Carnegie Mellon University IEEE computer, 1982
Background, Problem & Goal 2
Special-purpose systems and their cost n Many high-performance special-purpose systems are produced q n n Their cost is composed of design and parts cost Design cost tends to dominate the parts cost q n Special-purpose systems usually produced in small quantities Special-purpose system are often design ad hoc q n General-purpose systems aren't always able to meet performance constraints The designs solve one task only and aren’t generalizable The same errors are often repeated q Most notably: I/O imbalance 3
Why special-purpose systems? n There is an interested in speeding up compute-bound computations q Compute-bound: #operations > #inputs + #outputs n q Non compute-bound computations are I/O bound n n E. g. matrix multiplication E. g. matrix addition These computations tend to be too taxing for CPUs q Von Neumann bottleneck: for each operation at least an operand has to be fetched n q q Compute-bound computation become I/O bound Memory bandwidth often isn't enough to keep the CPU pipeline filled Memory accesses are costly in term of energy 4
Memory access energy cost Dally, Hi. PEAC 2015 A memory access consumes ~1000 X the energy of a complex addition Adapted from Prof. Onur Mutlu’s slides (Computer Architecture FS 2018) 5
The key architectural requirements 1. Simple and regular q q q 2. High concurrency q 3. The main way to build faster computer systems Simple communication q 4. Decrease the design cost Modular Adjustable to performance goal Tends to get more complex as concurrency increases Balance of computation with I/O q The system shouldn’t spend its time waiting for I/O operations 6
The goal 1. Accumulate the ideas of the author’s previous work q 2. Kung had already published multiple papers on systolic architectures Correct the ad hoc approach by providing a general guideline q q q How to map high-level computations to hardware The designs should respect the given requirements Easy to use guideline 7
Novelty 8
The conventional approach n I/O bandwidth: 10 MB/s Each operation uses 2 bytes n At most 5 million operations per second n 9
The systolic approach n Same conditions as before q n Up to 6 x improvements Systolic: q q The memory “pumps” data to the processing elements Like the heart pumps blood to the body cells 10
Both approaches visualized 11
Key Approach and Ideas 12
The structure of a systolic n A systolic architecture is composed of multiple processing architecture n n n elements (cells) Only cells at the boundary can be I/O ports of the system Partial results and inputs flow inside the system Cells are interconnected to form simple and regular structures: q q q Trees Arrays Grids Image source: Sano K. , Nakahara H. (2018) Hardware Algorithms. In: Amano H. (eds) Principles and Structures of FPGAs. Springer, Singapore 13
Mechanisms 14
Problems solvable by systolic n A sample of problems with known systolic solution: architectures q Signal and image processing: n n n q Matrix arithmetic: n n n q Convolution Discrete Fourier transform Interpolation Matrix multiplication QR decomposition of matrixes Linear systems of equation Non-numeric applications: n n n Regular expressions Dynamic programming Encoders (polynomial division) 15
An exemplar compute-bound n The convolution problem n Given: q q n Compute: q q n n The sequence of weights {w 1, w 2, …, wk} The sequence of inputs {x 1, x 2, …, xn} The sequence {y 1, y 2, …, yn+1 -k} Defined by yi = w 1 xi + w 2 xi+1 + … wk xi+k-1 This problem is regular and compute-bound There are many related problems, e. g. pattern matching 16
Example convolution problem n Given: instance q q n The sequence of weights: {2, 1, 4} The sequence of inputs: {5, 0, -7, 3, 1} The output sequence {y 1, y 2, y 3} is computed as follows q y 1 = 2*5 + 1*0 + 4*(-7) = -18 q y 2 = 2*0 + 1*(-7) + 4*3 =5 q y 3 = 2*(-7) + 1*3 + 4*1 = -7 17
The proposed designs n Three different systolic systems will be presented: 1. 2. 3. Broadcast: A semi-systolic solution where the input sequence is broadcast to the cells Low-latency: A pure systolic solution with low output latency High-throughput: A pure systolic solution where no cell is idle during usage 18
1. Broadcast 19
2. Low-latency 20
3. High-throughput 21
Comparison of the designs Nr Design Advantages Disadvantages 1 Broadcast • Simplest design • Does NOT scale well • Cells use only 3 I/O ports 2 Low-latency • Simplest pure systolic design 3 Highthroughput • Works with unbounded • Requires a bus to collect amount of weights results • The partial results stay in • More complex than 1 and 2 the cells* • Response time depends on the number of weights • Requires more I/O • Only half of the cells are used at any given time *Partial results often carry more bits because of numerical accuracy 22
Key Results: Methodology and Evaluation 23
Key properties of systolic n Criteria of systolic designs and their effects: architectures q They have simple and regular control flow Ø q They only use a few type of simple cells Ø q High performance They are highly concurrent by design Ø n Simplicity They use each input data item multiple times Ø q Simplicity, modularity, expandability, and high performance Highly scalable q Performance increases proportionally with number of cells 24
Summary 25
Summary n Special-purpose system often q q q n Systolic systems q q n Have high design cost Are designed ad hoc Repeat known errors Are simple and easy to design Avoid the pitfalls of special-purpose systems designs Modular, expandable, and high performance Are applicable to many (if not all) problems where it makes sense to build special-purpose systems Systolic systems geared to different applications can be obtained with little effort 26
Strengths 27
Strengths of the paper n n n Intuitive idea Well structured paper, with good flow Many different examples of systolic systems are presented q n General approach to common problems q n Many compute-bound problems have a systolic solution Scalable and adaptive designs q n The tradeoffs between different designs are discussed Adaptable to different I/O bandwidth and problem size The paper is still relevant today! (36 years after) q q More than 3000 citations, ~40 citations/year since 2000 Google’s TPU is a systolic system at its heart 28
The heart of Google’s first TPU Source: Google product news (17. 5. 2017) 29
Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 30
Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 31
Weaknesses 32
Weaknesses of the paper n No data to support the claim that systolic architectures are a viable alternative to ad hoc architectures q q q n Approach still limited by I/O bottleneck q n In-memory accelerators don’t share the same bottleneck It’s difficult to design systolic systems for compute-bound problems which aren’t inherently regular q n Design time Energy efficiency Performance Sparse matrix multiplication Difficult to debug q Partial result aren’t exposed to the programmer 33
Thoughts and Ideas 34
Thoughts and ideas n Can the design of systolic architectures be automated? q n How can systolic architectures be specified and verified without building prototypes? q n Still an open problem, but some can be designed automatically Are there simulation frameworks for systolic architectures? Systolic architectures map well to FPGAs 35
Simplified FPGA schematic Source: David Norwood’s master thesis 36
Thoughts and ideas n n Are there compute-bound problems with no systolic solution? Are there alternatives to systolic systems? q n GPU, in-memory accelerators, … Can there be general-purpose systolic structures? q Yes, i. Warp [1990, CMU & Intel] 37
A view of the i. Warp “The initial demonstration i. Warp system in 1990 is an 8 x 8 torus” Prof. Thomas Gross (ETHZ) participated in the design of the i. Warp 38
Takeaways 39
Key takeaways n n n General guideline to simple, efficient, and scalable designs Principled approach to the design of special-purpose systems Avoid designing ad hoc systems when possible q n Avoid known pitfalls Less successful ideas may have an impact in the future q See Google’s TPU 40
Open Discussion 41
- Why systolic architectures
- Andreas carlsson bye bye bye
- Systolic blood pressure meaning
- Stroke volume definition
- End-diastolic volume vs end-systolic volume
- Sam systolic anterior motion
- Holosystolic murmur seen in
- Machinary murmur
- Echo rvsp
- Systolic over diastolic
- 174/116 blood pressure
- Added heart sounds
- Systolic array vs simd
- Trunctus
- Systolic array
- Precapillary resistance
- Isolated systolic hypertension
- What is systolic and diastolic pressure
- Buses in computer architecture
- Difference between computer organization and architecture
- Basic computer organization and design
- Don't ask why why why
- The ballad of gregorio cortez movie
- Dr gregorio castillo
- Hermana de gregorio samsa
- Cómo es la lengua de la mariposa
- Dr gregorio reyes
- Gregorio de matos pica flor
- Gregorio evans
- Ley de mendel
- Istituto comprensivo san gregorio magno
- Istituto comprensivo san gregorio magno
- Why calamba called cradle of genius
- Quien fue doreen irvine
- Rafael navi gregorio angarita lamk
- Narrador objetivo que es
- Testigo ejemplo
- Mendel
- Causas del cambio climatico
- Sérgio biagi gregório
- Istituto comprensivo san gregorio magno
- Papa gregorio xvi