Sven Gregorio Seminar on Computer Architecture Why systolic









































- Slides: 41

Sven Gregorio Seminar on Computer Architecture Why systolic architectures? Hsiang-Tsung Kung Carnegie Mellon University IEEE computer, 1982

Background, Problem & Goal 2

Special-purpose systems and their cost n Many high-performance special-purpose systems are produced q n n Their cost is composed of design and parts cost Design cost tends to dominate the parts cost q n Special-purpose systems usually produced in small quantities Special-purpose system are often design ad hoc q n General-purpose systems aren't always able to meet performance constraints The designs solve one task only and aren’t generalizable The same errors are often repeated q Most notably: I/O imbalance 3

Why special-purpose systems? n There is an interested in speeding up compute-bound computations q Compute-bound: #operations > #inputs + #outputs n q Non compute-bound computations are I/O bound n n E. g. matrix multiplication E. g. matrix addition These computations tend to be too taxing for CPUs q Von Neumann bottleneck: for each operation at least an operand has to be fetched n q q Compute-bound computation become I/O bound Memory bandwidth often isn't enough to keep the CPU pipeline filled Memory accesses are costly in term of energy 4

Memory access energy cost Dally, Hi. PEAC 2015 A memory access consumes ~1000 X the energy of a complex addition Adapted from Prof. Onur Mutlu’s slides (Computer Architecture FS 2018) 5

The key architectural requirements 1. Simple and regular q q q 2. High concurrency q 3. The main way to build faster computer systems Simple communication q 4. Decrease the design cost Modular Adjustable to performance goal Tends to get more complex as concurrency increases Balance of computation with I/O q The system shouldn’t spend its time waiting for I/O operations 6

The goal 1. Accumulate the ideas of the author’s previous work q 2. Kung had already published multiple papers on systolic architectures Correct the ad hoc approach by providing a general guideline q q q How to map high-level computations to hardware The designs should respect the given requirements Easy to use guideline 7

Novelty 8

The conventional approach n I/O bandwidth: 10 MB/s Each operation uses 2 bytes n At most 5 million operations per second n 9

The systolic approach n Same conditions as before q n Up to 6 x improvements Systolic: q q The memory “pumps” data to the processing elements Like the heart pumps blood to the body cells 10

Both approaches visualized 11

Key Approach and Ideas 12

The structure of a systolic n A systolic architecture is composed of multiple processing architecture n n n elements (cells) Only cells at the boundary can be I/O ports of the system Partial results and inputs flow inside the system Cells are interconnected to form simple and regular structures: q q q Trees Arrays Grids Image source: Sano K. , Nakahara H. (2018) Hardware Algorithms. In: Amano H. (eds) Principles and Structures of FPGAs. Springer, Singapore 13

Mechanisms 14

Problems solvable by systolic n A sample of problems with known systolic solution: architectures q Signal and image processing: n n n q Matrix arithmetic: n n n q Convolution Discrete Fourier transform Interpolation Matrix multiplication QR decomposition of matrixes Linear systems of equation Non-numeric applications: n n n Regular expressions Dynamic programming Encoders (polynomial division) 15

An exemplar compute-bound n The convolution problem n Given: q q n Compute: q q n n The sequence of weights {w 1, w 2, …, wk} The sequence of inputs {x 1, x 2, …, xn} The sequence {y 1, y 2, …, yn+1 -k} Defined by yi = w 1 xi + w 2 xi+1 + … wk xi+k-1 This problem is regular and compute-bound There are many related problems, e. g. pattern matching 16

Example convolution problem n Given: instance q q n The sequence of weights: {2, 1, 4} The sequence of inputs: {5, 0, -7, 3, 1} The output sequence {y 1, y 2, y 3} is computed as follows q y 1 = 2*5 + 1*0 + 4*(-7) = -18 q y 2 = 2*0 + 1*(-7) + 4*3 =5 q y 3 = 2*(-7) + 1*3 + 4*1 = -7 17

The proposed designs n Three different systolic systems will be presented: 1. 2. 3. Broadcast: A semi-systolic solution where the input sequence is broadcast to the cells Low-latency: A pure systolic solution with low output latency High-throughput: A pure systolic solution where no cell is idle during usage 18

1. Broadcast 19

2. Low-latency 20

3. High-throughput 21

Comparison of the designs Nr Design Advantages Disadvantages 1 Broadcast • Simplest design • Does NOT scale well • Cells use only 3 I/O ports 2 Low-latency • Simplest pure systolic design 3 Highthroughput • Works with unbounded • Requires a bus to collect amount of weights results • The partial results stay in • More complex than 1 and 2 the cells* • Response time depends on the number of weights • Requires more I/O • Only half of the cells are used at any given time *Partial results often carry more bits because of numerical accuracy 22

Key Results: Methodology and Evaluation 23

Key properties of systolic n Criteria of systolic designs and their effects: architectures q They have simple and regular control flow Ø q They only use a few type of simple cells Ø q High performance They are highly concurrent by design Ø n Simplicity They use each input data item multiple times Ø q Simplicity, modularity, expandability, and high performance Highly scalable q Performance increases proportionally with number of cells 24

Summary 25

Summary n Special-purpose system often q q q n Systolic systems q q n Have high design cost Are designed ad hoc Repeat known errors Are simple and easy to design Avoid the pitfalls of special-purpose systems designs Modular, expandable, and high performance Are applicable to many (if not all) problems where it makes sense to build special-purpose systems Systolic systems geared to different applications can be obtained with little effort 26

Strengths 27

Strengths of the paper n n n Intuitive idea Well structured paper, with good flow Many different examples of systolic systems are presented q n General approach to common problems q n Many compute-bound problems have a systolic solution Scalable and adaptive designs q n The tradeoffs between different designs are discussed Adaptable to different I/O bandwidth and problem size The paper is still relevant today! (36 years after) q q More than 3000 citations, ~40 citations/year since 2000 Google’s TPU is a systolic system at its heart 28

The heart of Google’s first TPU Source: Google product news (17. 5. 2017) 29

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 30

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 31

Weaknesses 32

Weaknesses of the paper n No data to support the claim that systolic architectures are a viable alternative to ad hoc architectures q q q n Approach still limited by I/O bottleneck q n In-memory accelerators don’t share the same bottleneck It’s difficult to design systolic systems for compute-bound problems which aren’t inherently regular q n Design time Energy efficiency Performance Sparse matrix multiplication Difficult to debug q Partial result aren’t exposed to the programmer 33

Thoughts and Ideas 34

Thoughts and ideas n Can the design of systolic architectures be automated? q n How can systolic architectures be specified and verified without building prototypes? q n Still an open problem, but some can be designed automatically Are there simulation frameworks for systolic architectures? Systolic architectures map well to FPGAs 35

Simplified FPGA schematic Source: David Norwood’s master thesis 36

Thoughts and ideas n n Are there compute-bound problems with no systolic solution? Are there alternatives to systolic systems? q n GPU, in-memory accelerators, … Can there be general-purpose systolic structures? q Yes, i. Warp [1990, CMU & Intel] 37

A view of the i. Warp “The initial demonstration i. Warp system in 1990 is an 8 x 8 torus” Prof. Thomas Gross (ETHZ) participated in the design of the i. Warp 38

Takeaways 39

Key takeaways n n n General guideline to simple, efficient, and scalable designs Principled approach to the design of special-purpose systems Avoid designing ad hoc systems when possible q n Avoid known pitfalls Less successful ideas may have an impact in the future q See Google’s TPU 40

Open Discussion 41