EE 126 Computer Engineering Fall 2017 Tufts University

  • Slides: 38
Download presentation
EE 126 Computer Engineering Fall 2017 Tufts University Instructor: Prof. Mark Hempstead mark@ece. tufts.

EE 126 Computer Engineering Fall 2017 Tufts University Instructor: Prof. Mark Hempstead mark@ece. tufts. edu EE 126 Mark Hempstead 1

Lecture Outline • • Administrative details Why take EE 126? What you will learn?

Lecture Outline • • Administrative details Why take EE 126? What you will learn? What is Computer Architecture? Moore’s Law and Future Challenges for Computer Architects • Information sheet EE 126 Mark Hempstead 2

Instructor • Instructor: Mark Hempstead (mark@ece. tufts. edu ), Halligan Hall 235 A •

Instructor • Instructor: Mark Hempstead (mark@ece. tufts. edu ), Halligan Hall 235 A • Office Hours: – Mondays 3: 30 pm – 4: 30 pm – Tuesdays 3: 00 – 4: 00 pm • My Background – – – Tufts undergrad in Computer Engineering Ph. D at Harvard June 2009 Research Intern at Intel Recently at ARM R&D in Cambridge UK Assistant Professor, Drexel University 2010 - 2015 EE 126 Mark Hempstead 3

Instructor: My Research • Power Aware Computing and Low Power VLSI Design • Accelerator-centric

Instructor: My Research • Power Aware Computing and Low Power VLSI Design • Accelerator-centric computing – Selecting accelerators using static characterization and ASTs – Security of thermal side-channel in many accelerator workloads • Characterizing communication in workloads • Memory systems – Cache replacement policies and prefetching – Non-volitile memory technologies • Synchro. Trace for fast simulation and design exploration • Energy efficient structures for high performance processors • Power-agile computing systems • Power modeling of mobile devices (Android phones) EE 126 Mark Hempstead 4

Prof. Mark Hempstead Associate Professor Electrical and Computer Engineering “Energy-Efficient Computing from Hardware to

Prof. Mark Hempstead Associate Professor Electrical and Computer Engineering “Energy-Efficient Computing from Hardware to Software” Tufts Computer Architecture Lab n Power-Agile Computing for Android Smartphones Improving the energy consumption of smartphones n Energy. Performance Tradeoff n n Accelerating common application with hardware Combines hardware and software systems to automatically stay under energy constraints Selecting Hardware Accelerators for Energy. Efficient Computing n Out-of-Core Accelerators Power consumption and computational needs change rapidly n n Future of computing is threatened by increasing power density Traditional microprocessors are not enough. New application specific hardware is required Using software compilers and high-level synthesis to

Resources • Text: "Computer Organization and Design" by Patterson & Hennessy (5 th Ed

Resources • Text: "Computer Organization and Design" by Patterson & Hennessy (5 th Ed 2013) – Morgan Kaufmann – Print Book ISBN : – e. Book ISBN : 9780124077263 9780124078864 • The material in the 4 th revised Ed of the textbook is the same as our edition but the homework problems are different. EE 126 Mark Hempstead 6

Prerequisites • ES 4 Digital Logic – Binary Addition – Logic Gates and Flip-Flops

Prerequisites • ES 4 Digital Logic – Binary Addition – Logic Gates and Flip-Flops – Design of combinational logic – Design of state machines – Implementing and debugging digital systems at multiple ways (schematic, truth table, state diagram, RTL) • Assembly programming and basic machine organization; EE 14 ( Proc lab) or COMP 40 – – ISAs and instructions Assembly programming Interrupts and interrupt routines Basic Caches and interacting with memory (load-store) • VHDL or Verilog and experience with large digital designs • • – ES 4 with EE 26 (Digital lab) recommended C Programming, UNIX Compilers, OS, Circuits/VLSI background is a plus but not required EE 126 Mark Hempstead 7

Course Expectations • Homework Assignments – Completed Individually. – Submitted during class on paper.

Course Expectations • Homework Assignments – Completed Individually. – Submitted during class on paper. • Quizzes (4 over the semester) • Midterm + Final – Midterm is scheduled when the calendar says it is – Final will be comprehensive. During the exam period. • Labs Sucks up all your time – VHDL Implementation of a processor New this year – pipeline – Handouts will be provided this week tracker EE 126 Mark Hempstead 8

Why a pipeline tracker? • It’s your lightweight intro to verification • 2016 industry

Why a pipeline tracker? • It’s your lightweight intro to verification • 2016 industry survey – 55% of engineers have the title “verification eng” – 35% are design – but spend ½ of their time in verif! – CAGR = 10% for verif. eng, 4% for design eng. • Turn VHDL-lab lemons into lemonade – Less work than before (if you use the tracker) – Add a useful skill to your resume – Probably do a little debug competition later EE 126 Mark Hempstead 9

Grading • Grade Formula – – – Quizzes – 10% Midterm – 20% Final

Grading • Grade Formula – – – Quizzes – 10% Midterm – 20% Final – 30% Labs + final project – 30% Homework – 10% • Late days for HW/Lab assignments – 5 late days per quarter per student – After all late days are used, the grade will be reduced by (10% multiplied by the number of days late). • Lab makeup policy – Resubmit labs for up ½ credit lost – Must be submitted before turning in the next lab EE 126 Mark Hempstead 10

Topics of Study & why we care • Get through the basics of modern

Topics of Study & why we care • Get through the basics of modern processor design – single-threaded 5 -stage pipeline; 1980 s technology • Learn about pipelined systems – everything is pipelined • Understand the interfaces between architecture and system software (compilers, OS) – Essential to understand OS/compilers/PL – For everyone else, it can help you write better code! • Implement your own processor in VHDL – As previously discussed… EE 126 Mark Hempstead 11

After this course… • Computer architects strive to give maximum performance with programmer abstraction

After this course… • Computer architects strive to give maximum performance with programmer abstraction – Compilers, OS part of this abstraction – e. g. pipelining, superscalar, speculative execution, branch prediction, caching, virtual memory… • Technology has brought us to an inflection point – Multiple processors on a single chip -- Why? • Design complexity, ILP/pipelining-limits, power dissipation, etc – How to provide the abstraction? – Some burden will shift back to programmers EE 126 Mark Hempstead 12

Estimated Schedule • Review of Assembly Programming and Machine Organization – – Instructions and

Estimated Schedule • Review of Assembly Programming and Machine Organization – – Instructions and ISAs The ALU and single cycle implementation Introduce the MIPS ISA The class calendar 5 -stage Pipelining, hazards, branches is always the up-to • Memory Hierarchy and Caches -date schedule – Associative caches – Cache coherence • Security holes; superscalar processors; multi-cores EE 126 Mark Hempstead 13

Slide Credits • Many of the slides and teaching materials have been adapted from

Slide Credits • Many of the slides and teaching materials have been adapted from the work of others: • Elsevier publishing company supporting material for Patterson & Hennessy text. • Mary Jane, Irwin PSU. CSE 431 • David Brooks, Harvard EE 126 Mark Hempstead 14

Review: Some Basic Definitions • Kilobyte – 210 or 1, 024 bytes (KB or

Review: Some Basic Definitions • Kilobyte – 210 or 1, 024 bytes (KB or Ki. B) • Megabyte– 220 or 1, 048, 576 bytes (MB or Mi. B) – sometimes “rounded” to 106 or 1, 000 bytes • Gigabyte – 230 or 1, 073, 741, 824 bytes – sometimes rounded to 109 or 1, 000, 000 bytes • Terabyte – 240 or 1, 099, 511, 627, 776 bytes – sometimes rounded to 1012 or 1, 000, 000 bytes • Petabyte – 250 or 1024 terabytes – sometimes rounded to 1015 or 1, 000, 000 bytes • Exabyte – 260 or 1024 petabytes – Sometimes rounded to 1018 or 1, 000, 000 bytes 15

Quick quiz • • 1015 shops = 1? pet shop One million aches =

Quick quiz • • 1015 shops = 1? pet shop One million aches = ? 1 Mega. Hertz 1012 bulls = ? 1 terrible Reminders: – – – Kilobyte – 210 or 1, 024 bytes (KB or Ki. B) Megabyte– 220 106 bytes Gigabyte – 230 109 bytes Terabyte – 240 1012 bytes Petabyte – 250 or 1015 bytes Exabyte – 260 or 1018 bytes EE 126 Mark Hempstead 16

Where does this course fit into the world of computing? What is Computer Architecture?

Where does this course fit into the world of computing? What is Computer Architecture? Operating System Applications (AI, DB, Graphics) Software Instruction Set Architecture Microarchitecture System Architecture Technology Trends Prog. Lang, Compilers Application Trends Hardware VLSI/Hardware Implementations EE 126 Mark Hempstead 17

Below the Program Applications software • Hardware 126 Systems SW System software – Operating

Below the Program Applications software • Hardware 126 Systems SW System software – Operating system – supervising program that interfaces the user’s program with the hardware (e. g. , Linux, Mac. OS, Windows) • Handles basic input and output operations • Allocates storage and memory • Provides for protected sharing among multiple applications – Compiler – translate programs written in a high-level language (e. g. , C, Java) into instructions that the hardware can execute • Which of these two software layers “should” care about computer architecture? 18

Below the Program, Con’t • High-level language program (in C) swap (int v[], int

Below the Program, Con’t • High-level language program (in C) swap (int v[], int k) (int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; ) one-to-many • Assembly language program (for MIPS) swap: sll add lw lw sw sw jr $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 C compiler one-to-one • Machine (object, binary) code (for MIPS) assembler 000000 00101 00010000000 00100 0001000000100000. . .

Advantages of Higher-Level Languages ? • What are some advantages? l l l Allow

Advantages of Higher-Level Languages ? • What are some advantages? l l l Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, Cobol for business programming, Lisp for symbol manipulation, Java for web programming, …) Improve programmer productivity – more understandable code that is easier to debug and validate Improve program maintainability Allow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine) Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine • As a result, very little programming is done today at the assembler level

Instruction Set Architecture (ISA) • ISA, or simply Architecture – the abstract interface between

Instruction Set Architecture (ISA) • ISA, or simply Architecture – the abstract interface between the hardware and the lowest level software that encompasses all the information necessary to write a machine language program, including instructions, registers, memory access, I/O, … – Enables implementations of varying cost and performance to run identical software – A great business idea – but how well do you think it works? • The combination of the basic instruction set (the ISA) and the operating system interface is called the application binary interface (ABI) – ABI – The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.

Under the Covers • Five classic components of a computer – input, output, memory,

Under the Covers • Five classic components of a computer – input, output, memory, datapath, and control q datapath + control = processor (CPU)

History of the proc world • and why the future might be interesting… EE

History of the proc world • and why the future might be interesting… EE 126 Mark Hempstead 23

Moore’s Law q In 1965, Intel’s Gordon Moore predicted that the number of transistors

Moore’s Law q In 1965, Intel’s Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years Moore’s Law is the tail wagging a very big dog! Log Scale Courtesy, Intel ®

Technology Scaling Road Map (ITRS) Year 2004 2006 2008 2010 2012 2014 2017 Feature

Technology Scaling Road Map (ITRS) Year 2004 2006 2008 2010 2012 2014 2017 Feature size (nm) 90 65 45 32 22 14 10 Intg. Capacity (BT) 2 4 8 16 33 83 162 • Fun facts about 45 nm transistors – 30 million can fit on the head of a pin – You could fit more than 2, 000 across the width of a human hair – If car prices had fallen at the same rate as the price of a single transistor has since 1968, a new car today would cost about 1 cent 25

Another Example of Moore’s Law Impact DRAM capacity growth over 3 decades 26

Another Example of Moore’s Law Impact DRAM capacity growth over 3 decades 26

What would you do with endless transistors? • Your ideas? 27

What would you do with endless transistors? • Your ideas? 27

P 4 29 uo e. D Co r (2 ) ) 04 1) 00

P 4 29 uo e. D Co r (2 ) ) 04 1) 00 (2 s( 20 Pr e ill W 97 19 120 o q PP r 3) 99 (1 ) 89 (1 9 ) 85 (1 9 10000 P 4 6 m iu nt Pe 38 80 6 ) 82 (1 9 100 38 6 3) 1000 80 28 (1 99 9) 5) 10 80 ro (1 98 (1 2) 98 (1 80 WPower 99 (Watts) i l P 4 l ( 7) Pr 200 Co es 1) re (2 D 00 uo 4 (2 ) 00 7) P 4 6 m iu PP nt Pe 6 6 38 48 80 80 28 80 Clock Rate (MHz) But What Happened to Clock Rates and Why? Clock rates hit a “power wall” 100 60 40 20 0

Power Density creating the “Dark Silicon Problem” 30 [Taylor, DAC and Da. Si 2012]

Power Density creating the “Dark Silicon Problem” 30 [Taylor, DAC and Da. Si 2012] 30

How have we used these transistors? • More functionality on one chip – –

How have we used these transistors? • More functionality on one chip – – – Early 1980 s – 32 -bit microprocessors Late 1980 s – On Chip Level 1 Caches Early/Mid 1990 s – 64 -bit microprocessors, superscalar (ILP) Late 1990 s – On Chip Level 2 Caches Early 2000 s – Chip Multiprocessors, On Chip Level 3 Caches Early 2010 s – Many-Core, So. C integration, specialized hardware • What is next? – How much more cache can we put on a chip? (Itanium 2) – How many more cores can we put on a chip? (Niagara, etc) – What else can we put on chips? (Accelerators) EE 126 Mark Hempstead 31

Example: Intel Kaby Lake Quad Core (Core i 7/i 5 7400 -7700) • Introduced

Example: Intel Kaby Lake Quad Core (Core i 7/i 5 7400 -7700) • Introduced August 2016 • Quad core out-of-order (14 -19 stages of pipeline) – Supports 8 threads • 64 -bit datapath • 14 nm technology • Three levels of caches (L 1, L 2, L 3) on chip • Integrated memory controller • Integrated graphics • 3. 6 GHz clock turbo boost up to 4/2 GHz https: //en. wikichip. org/wiki/intel/microarchitectures/kaby_lake EE 126 Mark Hempstead 32

Example Processor: Apple A 10 Fusion • Introduced 2016 – i. Phone 7 •

Example Processor: Apple A 10 Fusion • Introduced 2016 – i. Phone 7 • • 3. 3 Billion Transistors 16 nm technology Integrated GPU 4 cores – 2 high power 2. 34 GHz ARMv 8 -A cores – 2 Energy-efficient cores 33

Example Processor: Apple A 11 Bionic A 10 Fusion A 11 Bionic Phone IPhone

Example Processor: Apple A 11 Bionic A 10 Fusion A 11 Bionic Phone IPhone 7 IPhone 8, 10 Technology 16 nm 10 nm Number of cores 4 (two slow, two fast) 6 (four slow, two fast) Number of transistors 3. 3 B 4. 3 B Freq 2. 34 GHz 2. 4 GHz Has a TV ad No Yes • https: //www. youtube. com/watch? v=QN 1 j. Hq. IFEb. Q • Bionic: dedicated neural-net hardware accelerator, powers Face. ID & other tasks 34

Crossroads: Conventional Wisdom in Comp. Arch • • Old Conventional Wisdom: Power is free,

Crossroads: Conventional Wisdom in Comp. Arch • • Old Conventional Wisdom: Power is free, Transistors expensive New Conventional Wisdom: “Power wall” Power expensive, transistors are free (Can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) Old CW: Uniprocessor performance 2 X / 1. 5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall – Uniprocessor performance now 2 X / 5(? ) yrs Sea change in chip design: multiple “cores” (2 X processors per chip / ~ 2 years) • More simpler processors are more power efficient EE 126 Mark Hempstead 35

“For the P 6, success criteria included performance above a certain level and failure

“For the P 6, success criteria included performance above a certain level and failure criteria included power dissipation above some threshold. ” Bob Colwell, Pentium Chronicles 36

Summary • Welcome to EE 126 • Architecture is the “glue” between system software/applications

Summary • Welcome to EE 126 • Architecture is the “glue” between system software/applications and VLSI implementations • Need to create abstractions to deal with complexity EE 126 Mark Hempstead 37

Questions? EE 126 Mark Hempstead

Questions? EE 126 Mark Hempstead

Information Sheet • Please fill this out • Designed to provide an understanding of

Information Sheet • Please fill this out • Designed to provide an understanding of your background and experience • Be honest … this is not graded EE 126 Mark Hempstead 39