Sven Gregorio Seminar on Computer Architecture Why systolic

  • Slides: 41
Download presentation
Sven Gregorio Seminar on Computer Architecture Why systolic architectures? Hsiang-Tsung Kung Carnegie Mellon University

Sven Gregorio Seminar on Computer Architecture Why systolic architectures? Hsiang-Tsung Kung Carnegie Mellon University IEEE computer, 1982

Background, Problem & Goal 2

Background, Problem & Goal 2

Special-purpose systems and their cost n Many high-performance special-purpose systems are produced q n

Special-purpose systems and their cost n Many high-performance special-purpose systems are produced q n n Their cost is composed of design and parts cost Design cost tends to dominate the parts cost q n Special-purpose systems usually produced in small quantities Special-purpose system are often design ad hoc q n General-purpose systems aren't always able to meet performance constraints The designs solve one task only and aren’t generalizable The same errors are often repeated q Most notably: I/O imbalance 3

Why special-purpose systems? n There is an interested in speeding up compute-bound computations q

Why special-purpose systems? n There is an interested in speeding up compute-bound computations q Compute-bound: #operations > #inputs + #outputs n q Non compute-bound computations are I/O bound n n E. g. matrix multiplication E. g. matrix addition These computations tend to be too taxing for CPUs q Von Neumann bottleneck: for each operation at least an operand has to be fetched n q q Compute-bound computation become I/O bound Memory bandwidth often isn't enough to keep the CPU pipeline filled Memory accesses are costly in term of energy 4

Memory access energy cost Dally, Hi. PEAC 2015 A memory access consumes ~1000 X

Memory access energy cost Dally, Hi. PEAC 2015 A memory access consumes ~1000 X the energy of a complex addition Adapted from Prof. Onur Mutlu’s slides (Computer Architecture FS 2018) 5

The key architectural requirements 1. Simple and regular q q q 2. High concurrency

The key architectural requirements 1. Simple and regular q q q 2. High concurrency q 3. The main way to build faster computer systems Simple communication q 4. Decrease the design cost Modular Adjustable to performance goal Tends to get more complex as concurrency increases Balance of computation with I/O q The system shouldn’t spend its time waiting for I/O operations 6

The goal 1. Accumulate the ideas of the author’s previous work q 2. Kung

The goal 1. Accumulate the ideas of the author’s previous work q 2. Kung had already published multiple papers on systolic architectures Correct the ad hoc approach by providing a general guideline q q q How to map high-level computations to hardware The designs should respect the given requirements Easy to use guideline 7

Novelty 8

Novelty 8

The conventional approach n I/O bandwidth: 10 MB/s Each operation uses 2 bytes n

The conventional approach n I/O bandwidth: 10 MB/s Each operation uses 2 bytes n At most 5 million operations per second n 9

The systolic approach n Same conditions as before q n Up to 6 x

The systolic approach n Same conditions as before q n Up to 6 x improvements Systolic: q q The memory “pumps” data to the processing elements Like the heart pumps blood to the body cells 10

Both approaches visualized 11

Both approaches visualized 11

Key Approach and Ideas 12

Key Approach and Ideas 12

The structure of a systolic n A systolic architecture is composed of multiple processing

The structure of a systolic n A systolic architecture is composed of multiple processing architecture n n n elements (cells) Only cells at the boundary can be I/O ports of the system Partial results and inputs flow inside the system Cells are interconnected to form simple and regular structures: q q q Trees Arrays Grids Image source: Sano K. , Nakahara H. (2018) Hardware Algorithms. In: Amano H. (eds) Principles and Structures of FPGAs. Springer, Singapore 13

Mechanisms 14

Mechanisms 14

Problems solvable by systolic n A sample of problems with known systolic solution: architectures

Problems solvable by systolic n A sample of problems with known systolic solution: architectures q Signal and image processing: n n n q Matrix arithmetic: n n n q Convolution Discrete Fourier transform Interpolation Matrix multiplication QR decomposition of matrixes Linear systems of equation Non-numeric applications: n n n Regular expressions Dynamic programming Encoders (polynomial division) 15

An exemplar compute-bound n The convolution problem n Given: q q n Compute: q

An exemplar compute-bound n The convolution problem n Given: q q n Compute: q q n n The sequence of weights {w 1, w 2, …, wk} The sequence of inputs {x 1, x 2, …, xn} The sequence {y 1, y 2, …, yn+1 -k} Defined by yi = w 1 xi + w 2 xi+1 + … wk xi+k-1 This problem is regular and compute-bound There are many related problems, e. g. pattern matching 16

Example convolution problem n Given: instance q q n The sequence of weights: {2,

Example convolution problem n Given: instance q q n The sequence of weights: {2, 1, 4} The sequence of inputs: {5, 0, -7, 3, 1} The output sequence {y 1, y 2, y 3} is computed as follows q y 1 = 2*5 + 1*0 + 4*(-7) = -18 q y 2 = 2*0 + 1*(-7) + 4*3 =5 q y 3 = 2*(-7) + 1*3 + 4*1 = -7 17

The proposed designs n Three different systolic systems will be presented: 1. 2. 3.

The proposed designs n Three different systolic systems will be presented: 1. 2. 3. Broadcast: A semi-systolic solution where the input sequence is broadcast to the cells Low-latency: A pure systolic solution with low output latency High-throughput: A pure systolic solution where no cell is idle during usage 18

1. Broadcast 19

1. Broadcast 19

2. Low-latency 20

2. Low-latency 20

3. High-throughput 21

3. High-throughput 21

Comparison of the designs Nr Design Advantages Disadvantages 1 Broadcast • Simplest design •

Comparison of the designs Nr Design Advantages Disadvantages 1 Broadcast • Simplest design • Does NOT scale well • Cells use only 3 I/O ports 2 Low-latency • Simplest pure systolic design 3 Highthroughput • Works with unbounded • Requires a bus to collect amount of weights results • The partial results stay in • More complex than 1 and 2 the cells* • Response time depends on the number of weights • Requires more I/O • Only half of the cells are used at any given time *Partial results often carry more bits because of numerical accuracy 22

Key Results: Methodology and Evaluation 23

Key Results: Methodology and Evaluation 23

Key properties of systolic n Criteria of systolic designs and their effects: architectures q

Key properties of systolic n Criteria of systolic designs and their effects: architectures q They have simple and regular control flow Ø q They only use a few type of simple cells Ø q High performance They are highly concurrent by design Ø n Simplicity They use each input data item multiple times Ø q Simplicity, modularity, expandability, and high performance Highly scalable q Performance increases proportionally with number of cells 24

Summary 25

Summary 25

Summary n Special-purpose system often q q q n Systolic systems q q n

Summary n Special-purpose system often q q q n Systolic systems q q n Have high design cost Are designed ad hoc Repeat known errors Are simple and easy to design Avoid the pitfalls of special-purpose systems designs Modular, expandable, and high performance Are applicable to many (if not all) problems where it makes sense to build special-purpose systems Systolic systems geared to different applications can be obtained with little effort 26

Strengths 27

Strengths 27

Strengths of the paper n n n Intuitive idea Well structured paper, with good

Strengths of the paper n n n Intuitive idea Well structured paper, with good flow Many different examples of systolic systems are presented q n General approach to common problems q n Many compute-bound problems have a systolic solution Scalable and adaptive designs q n The tradeoffs between different designs are discussed Adaptable to different I/O bandwidth and problem size The paper is still relevant today! (36 years after) q q More than 3000 citations, ~40 citations/year since 2000 Google’s TPU is a systolic system at its heart 28

The heart of Google’s first TPU Source: Google product news (17. 5. 2017) 29

The heart of Google’s first TPU Source: Google product news (17. 5. 2017) 29

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 30

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 30

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 31

Performance of Google’s first TPU Source: Google product news (17. 5. 2017) 31

Weaknesses 32

Weaknesses 32

Weaknesses of the paper n No data to support the claim that systolic architectures

Weaknesses of the paper n No data to support the claim that systolic architectures are a viable alternative to ad hoc architectures q q q n Approach still limited by I/O bottleneck q n In-memory accelerators don’t share the same bottleneck It’s difficult to design systolic systems for compute-bound problems which aren’t inherently regular q n Design time Energy efficiency Performance Sparse matrix multiplication Difficult to debug q Partial result aren’t exposed to the programmer 33

Thoughts and Ideas 34

Thoughts and Ideas 34

Thoughts and ideas n Can the design of systolic architectures be automated? q n

Thoughts and ideas n Can the design of systolic architectures be automated? q n How can systolic architectures be specified and verified without building prototypes? q n Still an open problem, but some can be designed automatically Are there simulation frameworks for systolic architectures? Systolic architectures map well to FPGAs 35

Simplified FPGA schematic Source: David Norwood’s master thesis 36

Simplified FPGA schematic Source: David Norwood’s master thesis 36

Thoughts and ideas n n Are there compute-bound problems with no systolic solution? Are

Thoughts and ideas n n Are there compute-bound problems with no systolic solution? Are there alternatives to systolic systems? q n GPU, in-memory accelerators, … Can there be general-purpose systolic structures? q Yes, i. Warp [1990, CMU & Intel] 37

A view of the i. Warp “The initial demonstration i. Warp system in 1990

A view of the i. Warp “The initial demonstration i. Warp system in 1990 is an 8 x 8 torus” Prof. Thomas Gross (ETHZ) participated in the design of the i. Warp 38

Takeaways 39

Takeaways 39

Key takeaways n n n General guideline to simple, efficient, and scalable designs Principled

Key takeaways n n n General guideline to simple, efficient, and scalable designs Principled approach to the design of special-purpose systems Avoid designing ad hoc systems when possible q n Avoid known pitfalls Less successful ideas may have an impact in the future q See Google’s TPU 40

Open Discussion 41

Open Discussion 41