High Performance Processor Architecture Andr Seznec IRISAINRIA ALF

  • Slides: 88
Download presentation
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1

High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1

2 Moore’s « Law » § § Nb of transistors on a micro processor

2 Moore’s « Law » § § Nb of transistors on a micro processor chip doubles every 18 months è 1972: 2000 transistors (Intel 4004) è 1979: 30000 transistors (Intel 8086) è 1989: 1 M transistors (Intel 80486) è 1999: 130 M transistors (HP PA-8500) è 2005: 1, 7 billion transistors (Intel Itanium Montecito) Processor performance doubles every 18 months è 1989: Intel 80486 16 Mhz (< 1 inst/cycle) è 1993 : Intel Pentium 66 Mhz x 2 inst/cycle è 1995: Intel Pentium. Pro 150 Mhz x 3 inst/cycle è 06/2000: Intel Pentium III 1 Ghz x 3 inst/cycle è 09/2002: Intel Pentium 4 2. 8 Ghz x 3 inst/cycle è 09/2005: Intel Pentium 4, dual core 3. 2 Ghz x 3 inst/cycle x 2 processors

3 Not just the IC technology § VLSI brings the transistors, the frequency, .

3 Not just the IC technology § VLSI brings the transistors, the frequency, . . § Microarchitecture, code generation optimization bring the effective performance

4 The hardware/software interface High level language software Compiler/code generation Instruction Set Architecture (ISA)

4 The hardware/software interface High level language software Compiler/code generation Instruction Set Architecture (ISA) hardware micro-architecture transistor

5 Instruction Set Architecture (ISA) § § Hardware/software interface: è The compiler translates programs

5 Instruction Set Architecture (ISA) § § Hardware/software interface: è The compiler translates programs in instructions è The hardware executes instructions Examples è Intel x 86 (1979): still your PC ISA è MIPS , SPARC (mid 80’s) è Alpha, Power. PC ( 90’ s) ISAs evolve by successive add-ons: è 16 bits to 32 bits, new multimedia instructions, etc Introduction of a new ISA requires good reasons: è New application domains, new constraints è No legacy code

6 Microarchitecture § macroscopic vision of the hardware organization è Nor at the transistor

6 Microarchitecture § macroscopic vision of the hardware organization è Nor at the transistor level neither at the gate level è But understanding the processor organization at the functional unit level

7 What is microarchitecture about ? § Memory access time is 100 ns §

7 What is microarchitecture about ? § Memory access time is 100 ns § Program semantic is sequential § But modern processors can execute 4 instructions every 0. 25 ns. – How can we achieve that ?

8 high performance processors everywhere § General purpose processors (i. e. no special target

8 high performance processors everywhere § General purpose processors (i. e. no special target application domain): è Servers, desktop, laptop, PDAs § Embedded processors: è Set top boxes, cell phones, automotive, . . , è Special purpose processor or derived from a general purpose processor

9 Performance needs § Performance: è Reduce the response time: è Scientific applications: treats

9 Performance needs § Performance: è Reduce the response time: è Scientific applications: treats larger problems è Data base è Signal processing è multimedia è … § Historically over the last 50 years: Improving performance for today applications has fostered new even more demanding applications

10 How to improve performance? Use a better algorithm language Optimise code compiler Instruction

10 How to improve performance? Use a better algorithm language Optimise code compiler Instruction set (ISA) Improve the ISA micro-architecture More efficient microarchitecture transistor New technology

11 How are used transistors: evolution § In the 70’s: enriching the ISA è

11 How are used transistors: evolution § In the 70’s: enriching the ISA è Increasing functionalities to decrease instruction number § In the 80’s: caches and registers è Decreasing external accesses è ISAs from 8 to 16 to 32 bits § In the 90’s: instruction parallelism è More instructions, lot of control, lot of speculation è More caches § In the 2000’s: è More and more: è Thread parallelism, core parallelism

12 A few technological facts (2005) § § Frequency : 1 - 3. 8

12 A few technological facts (2005) § § Frequency : 1 - 3. 8 Ghz An ALU operation: 1 cycle A floating point operation : 3 cycles Read/write of a registre: 2 -3 cycles è Often a critical path. . . § Read/write of the cache L 1: 1 -3 cycles è Depends on many implementation choices

13 A few technological parameters (2005) § § Integration technology: 90 nm – 65

13 A few technological parameters (2005) § § Integration technology: 90 nm – 65 nm 20 -30 millions of transistor logic Cache/predictors: up one billion transistors 20 -75 Watts è 75 watts: a limit for cooling at reasonable hardware cost è 20 watts: a limit for reasonable laptop power consumption § 400 -800 pins è § 939 pins on the Dual-core Athlon

14 The architect challenge § 400 mm 2 of silicon § 2/3 technology generations

14 The architect challenge § 400 mm 2 of silicon § 2/3 technology generations ahead § What will you use for performance ? è Pipelining è Instruction Level Parallelism è Speculative execution è Memory hierarchy è Thread parallelism

15 Up to now, what was microarchitecture about ? § Memory access time is

15 Up to now, what was microarchitecture about ? § Memory access time is 100 ns § Program semantic is sequential § Instruction life (fetch, decode, . . , execute, . . , memory access, . . ) is 10 -20 ns § How can we use the transistors to achieve the highest performance as possible? è So far, up to 4 instructions every 0. 3 ns

16 The architect tool box for uniprocessor performance § Pipelining § Instruction Level Parallelism

16 The architect tool box for uniprocessor performance § Pipelining § Instruction Level Parallelism § Speculative execution § Memory hierarchy

Pipelining 17

Pipelining 17

18 Pipelining § Just slice the instruction life in equal stages and launch concurrent

18 Pipelining § Just slice the instruction life in equal stages and launch concurrent execution: time I 0 I 1 I 2 IF … DC … EX M … CT

19 Principle § The execution of an instruction is naturally decomposed in successive logical

19 Principle § The execution of an instruction is naturally decomposed in successive logical phases § Instructions can be issued sequentially, but without waiting for the completion of the one.

20 Some pipeline examples § MIPS R 3000 : § MIPS R 4000 :

20 Some pipeline examples § MIPS R 3000 : § MIPS R 4000 : § Very deep pipeline to achieve high frequency è Pentium 4: 20 stages minimum è Pentium 4 extreme edition: 31 stages minimum

21 pipelining: the limits § Current: è 1 cycle = 12 -15 gate delays:

21 pipelining: the limits § Current: è 1 cycle = 12 -15 gate delays: • Approximately a 64 -bit addition delay § Coming soon ? : è 6 - 8 gate delays è On Pentium 4: • ALU is sequenced at double frequency – a 16 bit add delay

22 Caution to long execution § Integer : è multiplication : 5 -10 cycles

22 Caution to long execution § Integer : è multiplication : 5 -10 cycles è division : 20 -50 cycles § Floating point: è Addition : 2 -5 cycles è Multiplication: 2 -6 cycles è Division: 10 -50 cycles

23 Dealing with long instructions § Use a specific to execute floating point operations:

23 Dealing with long instructions § Use a specific to execute floating point operations: è E. g. a 3 stage execution pipeline § Stay longer in a single stage: è Integer multiply and divide

24 sequential semantic issue on a pipeline There exists situation where sequencing an instruction

24 sequential semantic issue on a pipeline There exists situation where sequencing an instruction every cycle would not allow correct execution: § Structural hazards: distinct instructions are competing for a single hardware resource § Data hazards: J follows I and instruction J is accessing an operand that has not been acccessed so far by instruction I § Control hazard: I is a branch, but its target and direction are not known before a few cycles in the pipeline

25 Enforcing the sequential semantic § Hardware management: è First detect the hazard, then

25 Enforcing the sequential semantic § Hardware management: è First detect the hazard, then avoid its effect by delaying the instruction waiting for the hazard resolution time I 0 I 1 I 2 IF … DC … EX M … CT

26 Read After Write § Memory load delay:

26 Read After Write § Memory load delay:

27 And how code reordering may help § a= b+c ; d= e+f

27 And how code reordering may help § a= b+c ; d= e+f

28 Control hazard § The current instruction is a branch ( conditional or not)

28 Control hazard § The current instruction is a branch ( conditional or not) è Which instruction is next ? § Number of cycles lost on branchs may be a major issue

29 Control hazards § 15 - 30% instructions are branchs § Targets and direction

29 Control hazards § 15 - 30% instructions are branchs § Targets and direction are known very late in the pipeline. Not before: è Cycle 7 on DEC 21264 è Cycle 11 on Intel Pentium III è Cycle 18 on Intel Pentium 4 § X inst. are issued per cycles ! § Just cannot afford to lose these cycles!

30 Branch prediction / next instruction prediction PC prediction IF correction on misprediction DE

30 Branch prediction / next instruction prediction PC prediction IF correction on misprediction DE EX MEM WB Prediction of the address of the next instruction (block)

Dynamic branch prediction: just repeat the past § Keep an history on what happens

Dynamic branch prediction: just repeat the past § Keep an history on what happens in the past, and guess that the same behavior will occur next time: è essentially assumes that the behavior of the application tends to be repetitive § Implementation: hardware storage tables read at the same time as the instruction cache § What must be predicted: è Is there a branch ? Which is its type ? è Target of PC relative branch è Direction of the conditional branch è Target of the indirect branch è Target of the procedure return 31

32 Predicting the direction of a branch § It is more important to correctly

32 Predicting the direction of a branch § It is more important to correctly predict the direction than to correctly predict the target of a conditional branch è PC relative address: known/computed at execution at decode time è Effective direction computed at execution time

33 Prediction as the last time direction 1 1 1 0 1 prediction 1

33 Prediction as the last time direction 1 1 1 0 1 prediction 1 1 1 0 for (i=0; i<1000; i++) { for (j=0; j<N; j++) { mipredict mispredict loop body } } mispredict 2 mispredictions on the first and the last iterations

34 Exploiting more past: inter correlations B 1: if cond 1 and cond 2

34 Exploiting more past: inter correlations B 1: if cond 1 and cond 2 … cond 1 cond 2 cond 1 AND cond 2 B 2: if cond 1 … T T N N N T N N Using information on B 1 to predict B 2 If cond 1 AND cond 2 true (p=1/4), predict cond 1 true Si cond 1 AND cond 2 false (p= 3/4), predict cond 1 false 100 % correct 66% correct

35 Exploiting the past: auto-correlation When the last 3 iterations are taken then predict

35 Exploiting the past: auto-correlation When the last 3 iterations are taken then predict not taken, otherwise predict taken 100 % correct 1 1 1 0 for (i=0; i<100; i++) for (j=0; j<4; j++) loop body

36 Information on the branch General principle of branch prediction Read tables F prediction

36 Information on the branch General principle of branch prediction Read tables F prediction PC, global history, local history

37 Alpha EV 8 predictor: (derived from) (2 Bcgskew) e-gskew 352 Kbits , cancelled

37 Alpha EV 8 predictor: (derived from) (2 Bcgskew) e-gskew 352 Kbits , cancelled 2001 Max hist length > 21, ≈35

38 Current state-of-the-art 256 Kbits TAGE: Geometric history length (dec 2006) pc h[0: L

38 Current state-of-the-art 256 Kbits TAGE: Geometric history length (dec 2006) pc h[0: L 1] pc hash ctr pc h[0: L 2] hash tag u hash ctr =? 1 hash tag u hash ctr =? 1 1 1 Tagless base predictor pc h[0: L 3] tag hash u =? 1 1 1 3. 314 misp/KI 1 1 prediction

ILP: Instruction level parallelism 39

ILP: Instruction level parallelism 39

40 Executing instructions in parallel : supercalar and VLIW processors § Till 1991, pipelining

40 Executing instructions in parallel : supercalar and VLIW processors § Till 1991, pipelining to achieve achieving 1 inst per cycle was the goal : § Pipelining reached limits: è Multiplying stages do not lead to higher performance è Silicon area was available: • Parallelism is the natural way § ILP: executing several instructions per cycle è Different approachs depending on who is in charge of the control : • The compiler/software: VLIW (Very Long Instruction Word) • The hardware: superscalar

41 Instruction Level Parallelism: what is ILP ? § 1. 2. 3. 4. 5.

41 Instruction Level Parallelism: what is ILP ? § 1. 2. 3. 4. 5. 6. 7. 8. A= B+C; D=E+F; → 8 instructions Ld @ C , R 1 (A) Ld @ B, R 2 (A) R 3← R 1+ R 2 (B) St @ A, R 3 (C ) • (A), (B), (C) three groups of independent Ld @ E , R 4 (A) instructions Ld @ F, R 5 (A) R 6← R 4+ R 5 (B) • Each group can be executed in // St @ A, R 6 (C )

42 VLIW: Very Long Instruction Word § Each instruction controls explicitly the whole processor

42 VLIW: Very Long Instruction Word § Each instruction controls explicitly the whole processor è The compiler/code scheduler is in charge of all functional units: • Manages all hazards: – resource: decides if there are two competing candidates for a resource – data: ensures that data dependencies will be respected – control: ? !?

43 Control unit Register bank UF UF Memory interface VLIW architecture igtr r 6

43 Control unit Register bank UF UF Memory interface VLIW architecture igtr r 6 r 5 -> r 127, uimm 2 -> r 126, iadd r 0 r 6 -> r 36, ld 32 d (16) r 4 -> r 33, nop;

44 VLIW (Very Long Instruction Word) § The Control unit issues a single long

44 VLIW (Very Long Instruction Word) § The Control unit issues a single long instruction word per cycle § Each long instruction lanches simultinaeously sevral independant instructions: è è è The compiler garantees that: the subinstructions are independent. The instruction is independent of all in flight instructions § There is no hardware to enforce depencies

45 VLIW architecture: often used for embedded applications § Binary compatibility is a nightmare:

45 VLIW architecture: often used for embedded applications § Binary compatibility is a nightmare: è Necessitates the use of the same pipeline structure § Very effective on regular codes with loops and very few control, but poor performance on general purpose application (too many branchs) § Cost-effective hardware implementation: è No control: • Less silicon area • Reduced power consumption • Reduced design and test delays

46 Superscalar processors § The hardware is in charge of the control: è The

46 Superscalar processors § The hardware is in charge of the control: è The semantic is sequential, the hardware enforces this semantic • The hardware enforces dependencies: § Binary compatibility with previous generation processors: § All general purpose processors till 1993 are superscalar

47 Superscalar: what are the problems ? § Is there instruction parallelism ? è

47 Superscalar: what are the problems ? § Is there instruction parallelism ? è On general-purpose application: 2 -8 instructions per cycle è On some applications: could be 1000’s, § How to recognize parallelism ? • Enforcing data dependencies § Issuing in parallel: è Fetching instructions in // è Decoding in parallel è Reading operands iin parallel è Predicting branchs very far ahead

48 In-order execution IF … DC … EX M … CT IF … DC

48 In-order execution IF … DC … EX M … CT IF … DC … EX M … CT

49 out-of-order execution To optimize resource usage: Executes as soon as operands are valid

49 out-of-order execution To optimize resource usage: Executes as soon as operands are valid IF IF wait EX M IF DC EX M IF DC DC DC wait CT CT EX M CT CT

50 Out of order execution § Instructions are executed out of order: è If

50 Out of order execution § Instructions are executed out of order: è If inst A is blocked due to the absence of its operands but inst B has its operands avalaible then B can be executed !! è Generates a lot of hardware complexity !!

51 speculative execution on OOO processors § 10 -15 % branches: è On Pentium

51 speculative execution on OOO processors § 10 -15 % branches: è On Pentium 4: direction and target known at cycle 31 !! § Predict and execute speculatively: è Validate at execution time è State-of-the-art predictors: • ≈2 -3 misprediction per 1000 instructions § Also predict: è Memory (in)dependency è (limited) data value

Out-of-order execution: Just be able to « undo » § branch misprediction § Memory

Out-of-order execution: Just be able to « undo » § branch misprediction § Memory dependency misprediction § Interruption, exception § Validate (commit) instructions in order § Do not do anything definitely out-of-order 52

The memory hierarchy 53

The memory hierarchy 53

54 Memory components § Most transistors in a computer system are memory transistors: è

54 Memory components § Most transistors in a computer system are memory transistors: è Main memory: • Usually DRAM • 1 Gbyte is standard in PCs (2005) • Long access time – 150 ns = 500 cycles = 2000 instructions è On chip single ported memory: • Caches, predictors, . . è On chip multiported memory: • Register files, L 1 cache, . .

55 Memory hierarchy § Memory is : è either huge, but slow è or

55 Memory hierarchy § Memory is : è either huge, but slow è or small, but fast The smallest, the fastest § Memory hierarchy goal: è Provide the illusion that the whole memory is fast § Principle: exploit the temporal and spatial locality properties of most applications

56 Locality property § On most applications, the following property applies: è Temporal locality

56 Locality property § On most applications, the following property applies: è Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future è Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.

57 A few examples of locality § Temporal locality: è Loop index, loop invariants,

57 A few examples of locality § Temporal locality: è Loop index, loop invariants, . . è Instructions: loops, . . • 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less ) § Spatial locality: è Arrays of data, data structure è Instructions: the next instruction after a non-branch inst is always executed

58 Cache memory § A cache is small memory which content is an image

58 Cache memory § A cache is small memory which content is an image of a subset of the main memory. § A reference to memory is è 1. ) presented to the cache è 2. ) on a miss, the request is presented to next level in the memory hierarchy (2 nd level cache or main memory)

59 Cache Tag Identifies the memory block Cache line Load &A If the address

59 Cache Tag Identifies the memory block Cache line Load &A If the address of the block sits in the tag array then the block is present in the cache A memory

Memory hierarchy behavior may dictate performance § Example : è 4 instructions/cycle, è 1

Memory hierarchy behavior may dictate performance § Example : è 4 instructions/cycle, è 1 data memory acces per cycle è 10 cycle penalty for accessing 2 nd level cache è 300 cycles round-trip to memory è 2% miss on instructions, 4% miss on data, 1 reference out 4 missing on L 2 § To execute 400 instructions : 1320 cycles !! 60

61 Block size § Long blocks: è Exploits the spatial locality è Loads useless

61 Block size § Long blocks: è Exploits the spatial locality è Loads useless words when spatial locality is poor. § Short blocks: è Misses on conntiguous blocks § Experimentally: è 16 - 64 bytes for small L 1 caches 8 -32 Kbytes è 64 -128 bytes for large caches 256 K-4 Mbytes

62 Cache hierarchy § Cache hierarchy becomes a standard: è è è L 1:

62 Cache hierarchy § Cache hierarchy becomes a standard: è è è L 1: small (<= 64 Kbytes), short access time (1 -3 cycles) • Inst and data caches L 2: longer access time (7 -15 cycles), 512 K-2 Mbytes • Unified Coming L 3: 2 M-8 Mbytes (20 -30 cycles) • Unified, shared on multiprocessor

63 Cache misses do not stop a processor (completely) § On a L 1

63 Cache misses do not stop a processor (completely) § On a L 1 cache miss: è The request is sent to the L 2 cache, but sequencing and execution continues • On a L 2 hit, latency is simply a few cycles • On a L 2 miss, latency is hundred of cycles: – Execution stops after a while § Out-of-order execution allows to initiate several L 2 cache misses (serviced in a pipeline mode) at the same time: è Latency is partially hiden

64 Prefetching § To avoid misses, one can try to anticipate misses and load

64 Prefetching § To avoid misses, one can try to anticipate misses and load the (future) missing blocks in the cache in advance: è Many techniques: • Sequential prefetching: prefetch the sequential blocks • Stride prefetching: recognize a stride pattern and prefetch the blocks in that pattern • Hardware and software methods are available: – Many complex issues: latency, pollution, . .

65 Execution time of a short instruction sequence is a complex function ! Branch

65 Execution time of a short instruction sequence is a complex function ! Branch Predictor Correct mispredict hit miss ITLB hit miss DTLB hit miss I-cache hit miss Execution core L 2 Cache D-cache

66 Code generation issues § First, avoid data misses: 300 cycles mispenalties • Data

66 Code generation issues § First, avoid data misses: 300 cycles mispenalties • Data layout • Loop reorganization • Loop blocking § Instruction generation: è minimize instruction count: e. g. common subexpression elimination è schedule instructions to expose ILP è Avoid hard-to-predict branches

On chip thread level pararallelism 67

On chip thread level pararallelism 67

68 One billion transistors now !! § Ultimate 16 -32 way superscalar uniprocessor seems

68 One billion transistors now !! § Ultimate 16 -32 way superscalar uniprocessor seems unreachable: è Just not enough ILP è More than quadratic complexity on a few key (power hungry) components (register file, bypass network, issue logic) è To avoid temperature hot spots: • Intra-CPU very long communications would be needed § On-chip thread parallelism appears as the only viable solution è Shared memory processor i. e. chip multiprocessor è Simultaneous multithreading è Heterogeneous multiprocessing è Vector processing

69 The Chip Multiprocessor § Put a shared memory multiprocessor on a single die:

69 The Chip Multiprocessor § Put a shared memory multiprocessor on a single die: è Duplicate the processor, its L 1 cache, may be L 2, è Keep the caches coherent è Share the last level of the memory hierarchy (may be) è Share the external interface (to memory and system)

70 General purpose Chip Multi. Processor (CMP): why it did not (really) appear before

70 General purpose Chip Multi. Processor (CMP): why it did not (really) appear before 2003 § Till 2003 better (economic) usage for transistors: è Single process performance is the most important è More complex superscalar implementation è More cache space: • Bring the L 2 cache on-chip • Enlarge the L 2 cache • Include a L 3 cache (now) Diminishing return !! Now: CMP is the only option !!

71 Simultaneous Multithreading (SMT): parallel processing on a single processor § functional units are

71 Simultaneous Multithreading (SMT): parallel processing on a single processor § functional units are underused on superscalar processors § SMT: è Sharing the functional units on a superscalar processor between several process § Advantages: è Single process can use all the resources units è dynamic sharing of all structures on parallel/multiprocess workloads

72 Superscalar Issue slots SMT

72 Superscalar Issue slots SMT

73 The programmer view of a CMP/SMT !

73 The programmer view of a CMP/SMT !

74 Why CMP/SMT is the new frontier ? § (Most) applications were sequential: è

74 Why CMP/SMT is the new frontier ? § (Most) applications were sequential: è Hardware WILL be parallel è Tens, hundreds of SMT cores in your PC, PDA in 10 years from now (might be § Option 1: Applications will have to be adapted to parallelism § Option 2: // hardware will have to run efficiently sequential applications § Option 3: invent new tradeoffs There is no current standard

75 Embedded processing and on-chip parallelism (1) § ILP has been exploited for many

75 Embedded processing and on-chip parallelism (1) § ILP has been exploited for many years: è DSPs: a multiply-add, 1 or 2 loads, loop control in a single (long) cycle è Caches were implemented on embedded processors +10 years è VLIW on embedded processor was introduced in 1997 -98 è In-order supercsalar processor were developped for embedded market 15 years ago (Intel i 960)

76 Embedded processing and on-chip parallelism (2): thread parallelism § Heterogeneous multicores is the

76 Embedded processing and on-chip parallelism (2): thread parallelism § Heterogeneous multicores is the trend: è Cell processor IBM (2005) • One Power. PC (master) • 8 special purpose processors (slaves) è Philips Nexperia: A RISC MIPS microprocessor + a VLIW Trimedia processor è ST Nomadik: An ARM + x VLIW

What about a vector microprocessor for scientific computing? Vector parallelism is well understood !

What about a vector microprocessor for scientific computing? Vector parallelism is well understood ! A niche segment Caches are not vectors friendly Not so small: 2 B$ ! never heard about: L 2 cache, prefetching, blocking ? GPGPU !! GPU boards have become Vector processing units 77

78 Structure of future multicores ? L 3 cache μP $ μP $ μP

78 Structure of future multicores ? L 3 cache μP $ μP $ μP $

79 Hierarchical organization ? μP $ L 2 $ L 3 $ L 2

79 Hierarchical organization ? μP $ L 2 $ L 3 $ L 2 $ $ μP

80 IL 1 $ Inst. fetch μP FP DL 1 $ μP Hardware accelerator

80 IL 1 $ Inst. fetch μP FP DL 1 $ μP Hardware accelerator An example of sharing L 2 cache IL 1 $ Inst. fetch μP FP DL 1 $ μP

81 A possible basic brick I$ I$ μP FP μP D$ D$ L 2

81 A possible basic brick I$ I$ μP FP μP D$ D$ L 2 cache I$ I$ μP FP μP D$ D$

82 I$ μP I$ FP μP I$ μP FP μP D$ FP μP μP

82 I$ μP I$ FP μP I$ μP FP μP D$ FP μP μP FP μP D$ L 2 cache I$ I$ μP FP μP I$ μP D$ I$ FP μP I$ μP FP μP D$ network interface L 3 cache I$ μP I$ FP μP μP FP μP D$ D$ I$ μP μP FP μP D$ I$ FP μP μP FP μP D$ L 2 cache I$ I$ μP FP μP D$ Memory interface I$ μP FP μP D$ System interface

83 Only limited available thread parallelism ? § Focus on uniprocessor architecture: è Find

83 Only limited available thread parallelism ? § Focus on uniprocessor architecture: è Find the correct tradeoff between complexity and performance è Power and temperature issues § Vector extensions ? è Contiguous vectors ( a la SSE) ? è Strided vectors in L 2 caches ( Tarantula-like)

84 Another possible basic brick Ultimate Out-of-order Superscalar L 2 cache I$ I$ μP

84 Another possible basic brick Ultimate Out-of-order Superscalar L 2 cache I$ I$ μP FP μP D$ D$

85 Ult. O-O-O L 2 $ I$ D$ I$ I$ D$ D$ I$ Memory

85 Ult. O-O-O L 2 $ I$ D$ I$ I$ D$ D$ I$ Memory interface D$ network interface L 3 cache D$ System interface I$ D$ L 2 $ I$ Ult. O-O-O D$ I$ D$ L 2 $ I$ Ult. O-O-O

86 Some undeveloped issues : the power consumption issue § Power consumption: è Need

86 Some undeveloped issues : the power consumption issue § Power consumption: è Need to limit power consumption • Labtop: battery life • Desktop: above 75 W, hard to extract at low cost • Embedded: battery life, environment è è è Revisit the old performance concept with: • “maximum performance in a fixed power budget” Power aware architecture design Need to limit frequency

87 Some undeveloped issues : the performance predictabilty issue § Modern microprocessors have unpredictable/unstable

87 Some undeveloped issues : the performance predictabilty issue § Modern microprocessors have unpredictable/unstable performance § The average user wants stable performance: è Cannot tolerate variations of performance by orders of magnitude when varying simple parameters § Real time systems: è Want to guarantee response time.

88 Some undeveloped issues : the temperature issue § Temperature is not uniform on

88 Some undeveloped issues : the temperature issue § Temperature is not uniform on the chip (hotspots) è Rising the temperature above a threshold has devastating effects: • Defectuous behavior: transcient or definitive • Component aging è Solutions: gating (stops !!) or clock scaling, or task migration