A New Era in Processor Evolution Dezs Sima















































































- Slides: 79

A New Era in Processor Evolution Dezső Sima Fall 2007 (Ver. 2. 2) Dezső Sima, 2007

Foreword Beginning with second generation superscalars , the continuous, approximately 10 -fold-per-decade increase of processor efficiency leveled off for reasons shown in Chapter I. Designers responded by massively rising clock frequencies at up to a 100 -fold-per-decade rate in order to sustain an approximately 100 fold-per-decade performance increase. Such a rapid progress, however inevitably encountered its limits due to declining processor efficiency, increasing dissipation and skew in parallel buses, as shown in this Chapter. As a consequence, a decade long era of processor evolution, characterized by massively rising clock frequencies, ended in the last few years. The new era is heralded by multicore and multithreaded designs, as discussed in Chapters III. and IV.

Contents • 1. Processor performance • 2. Efficiency of processors • 3. Addressing the levelling off of processor efficiency • 4. Aggressively raising clock frequency • 5. The efficiency wall • 6. The thermal wall • 7. The skew wall • 8. EPIC architectures/processors • 9. The end of an era in processor evolution

1. Processor Performance

1. 1. Introduction (1) Absolute performance Number of succesfully executed instructions/sec Number of succesfully executed operations/sec (SIMD) fc : Clock frequency IPC: Instructions/cycle OPI: Operations/cycle Relative performance Relating the execution times of a benchmark program on the tested system to a reference system according to the following interpretation: E. g. : SPECint 92, SPECint_base 2000

1. 1. Introduction (2) In general purpose applications: where: IPC : issued instructions per cycle η : number of successfully executed/issued instructions (efficiency of the speculative execution)

1. 1. Introduction (3) In performance/efficiency studies: Theoretical interpretation: Pa Practical measurement: Pr ?

1. 1. Introduction (4) If the following were true: In that case: I: Number of instructions in the application considered

1. 1. Introduction (5) However: Figure 1. 1. : Runtime ratios of the component programs of SPECint 2000 Source: http: //www. spec. org

1. 1. Introduction (6) When comparing the performance of two systems: This estimation is useable in trend considerations.

1. 1. Introduction (7) Comparing the efficiency of two systems:

1. 2. Evolution of processor performance (1) SPECint 92 Levelling off 10000 P 4/3200 * * Prescott (2 M) * * *Prescott (1 M) P 4/3060 * Northwood B P 4/2400 * **P 4/2800 P 4/2000 * *P 4/2200 P 4/1500 * * P 4/1700 PIII/600 PIII/1000 * **PIII/500 PII/400 * PII/300 * PII/450 * 5000 2000 1000 500 100 50 20 486/25 * 10 386/20 5 386/16 * * 386/33 * 386/25 * 80286/12 * 2 80286/10 1 8088/8 0. 5 0. 2 * Pentium Pro/200 Pentium/200 * * * Pentium/166 Pentium/133 Pentium/120 * Pentium/100 * Pentium/66* * 486 -DX 4/100 486/50 * 486 -DX 2/66 * * 486/33 486 -DX 2/50 * ~ 100*/10 years 200 * * * 8088/5 79 1980 81 Year 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 Figure 1. 2: Integer performance growth of Intel’s x 86 processors 05

1. 2. Evolution of processor performance (2) Figure 1. 3: Integer performance growth (in general - 1) Source: X 86 -64 Technology White Paper, AMD Inc. , Sunnyvale, CA, 2000

1. 2. Evolution of processor performance (3) 3. 1. 4: Integer performance growth (in general - 2) Figure Source: F. Labonte, www-vlsi. stanford. edu/group/chart/spec. Inf 2000. pdf

2. Efficiency of processors

2. 1. Introduction ?

2. 2. Growth of processor efficiency (1) SPECint_base 2000/ f c 2. generation superscalars Levelling off 1 0. 5 * ~10*/10 years 0. 2 * 0. 1 * 0. 05 0. 02 0. 01 386 DX 486 DX Pentium Pro Pentium II * * * Pentium III * * Pentium * * * 286 ~ ~ 78 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 Figure 2. 1: Efficiency of Intel processors 97 98 99 2000 01 02 Year

2. 2. Growth of processor efficiency (2) Figure 2. 2: Growth of processor performance/efficiency (in general) Source: J. Birnbaum, „Architecture at HP: Two decades of Innovation”, Microprocessor Forum, October 14, 1997.

2. 3. Contribution of raising processor efficiency to the growth of processor performance (up to the 2 nd generation of superscalars) ? A második generációig az órafrekvencia és a hatékonyság növelése egyenlő arányban járultak hozzá a teljesítmény növeléséhez.

2. 4. Sources of raising processor efficiency • Increasing the word length 8/16 32 bit (286 386 DX) • Introducing and increasing temporal parallelism 1 st and 2 nd generation pipeline processors (386 DX, 486 DX) • Introducing and increasing issue parallelism 1 st and 2 nd generation superscalars (Pentium, Pentium Pro)

2. 5. Limit of raising processor efficiency (1) 2 nd generation superscalars (wide superscalars) Processing width 4 RISC instructions/cycle ~3 CISC instructions/cycle Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990 Figure 2. 3: Processing width of 2 nd generation (wide) superscalars vs extent of parallelism available in general purpose applications

2. 5. Limit of raising processor efficiency (2) Figure 2. 4: Growth of processor efficiency (in general)

2. 5. Limit of raising processor efficiency (3) In general purpose applications: The width of 2 nd generation superscalars already approaches the extent of available parallelism (ILP) Beginning with 2 nd generation (wide) superscalars the sources of extensively raising processor efficiency became exhausted

3. Addressing the levelling off of processor efficiency

Aggresively raising clock frequency (Sections 4 – 7) Main road of evolution Essentially widening the core by introducing EPIC architectures (Section 8)

4. Aggressively raising clock frequency

4. 1. Sources of raising clock frequencies (1) Raising clock frequency By scaling down the feature size in the manufacturing process By reducing the logic depth of pipline stages

4. 1. Sources of raising clock frequencies (2) Figure 4. 1: Evolution of Intel’s process technology Source: D. Bhandarkar: „The Dawn of a New Era”, 11. EMEA, May, 2006.

4. 1. Sources of raising clock frequencies (3) No of pipeline stages 40 30 20 Pentium (5) 10 * 1990 Pentium Pro (~12) K 6 * (6) * 1995 Pentium 4 (~20) * Athlon (6) P 4 Prescott (~30) * Athlon-64 (12) * Core Duo Conroe (14) * * 2000 2005 Figure 4. 2: Number of pipeline stages in Intel’s and AMD’s processors Year

4. 1. Sources of raising clock frequencies (4) Figure 4. 3: Max. logic depth of pipeline stages in processors (in terms of FO 4) Source: F. Labonte www-vlsi. stanford. edu/group/chart/Cycle. FO 4. pdf

4. 2. Growth rate of clock frequencies (1) Figure 4. 4: Growth of clock frequencies in Intel’s x 86 line of processors

4. 2. Growth rate of clock frequencies (2) Figure 4. 5: Growth of clock frequencies (in general)

4. 3. Implications of aggressively raising clock frequencies 4. 3. 1 Overview • Ousting of major RISC families (4. 3. 2) • Emerging limits of evolution (4. 3. 3)

4. 3. 2. Ousting of major RISC families (2) Figure 4. 6: The shift in performance leadership between RISC and x 86 lines

4. 3. 2. Ousting of major RISC families (2) 1995 -2000: CISCs overtook the performance leadership then it is a more intrinsic task to raise fc from a higher value than from a lower one in the same rate 1997: Intel and HP unveiled IA-64/Merced as the next generation architecture/processor line Cancelling of most major RISC lines, such as MIPS’s R-Lines, HP’s Alpha and PA lines, Power. PC Consortium’s Power. PC line

4. 3. 3. Emerging limits of evolution • The efficiency wall (Section 5) • The thermal wall (Section 6) • The skew wall (Section 7)

5. The efficiency wall

5. 1. Overview Basic reason: speed gap between the processor and the memory (it widens on higher frequencies)

5. 1. Overview (2) Main appearances of the speed gap between the processor and the memory: • DRAM latencies • Memory transfer rates • L 2 cache latencies • Transfer rates of processor buses

5. 2. Speed gap between processor and memory (1 a) Figure 5. 1 a: DRAM types

5. 2. Speed gap between processor and memory (1 b) Figure 5. 1 b: Latency of DRAM chips

5. 2. Speed gap between processor and memory (1 c) Figure 5. 1 c: System-level memory latency in x 86 -based PCs

5. 2. Speed gap between processor and memory (1 d) Figure 5. 1 d: Latency of DRAM chips (in clock cycles)

5. 2. Speed gap between processor and memory (2) Figure 5. 2: Relative transfer rate of memories (D: dual channel)

5. 2. Speed gap between processor and memory (3) fc max at intro. (GHz) L 2 size (Kbyte) L 2 latency (clock cycles) Willamette 1. 5 128 7 Northwood 2. 0 512 16 Prescott 3. 4 1024 23 Figure 5. 3: Latency of L 2 caches

5. 2. Speed gap between processor and memory (4) Figure 5. 4: Relative transfer rates of processor buses

5. 3. Efficiency of 3 rd generation superscalars (1) 5. 5: Efficiency of Intel’s Pentium III and Pentium 4 processors in general purpose applications

5. 3. Efficiency of 3 rd generation superscalars (2) Figure 5. 6: efficiency of AMD’s Athlon, Athlon XP and Athlon 64 processors in general purpose applications

5. 3. Efficiency of 3 rd generation superscalars (3) Figure 5. 7: Main aspects of the memory subsystem affecting core efficiency

5. 3. Efficiency of 3 rd generation superscalars (4) Figure 5. 8: Contrasting the efficiency of Intel’s and AMD’s processors

5. 3. Efficiency of 3 rd generation superscalars (5) Figure 5. 9: Contrasting Intel’s and AMD’s processor design philosophies

5. 3. Efficiency of 3 rd generation superscalars (6) Implication of the emerging efficiency wall: Diminishing return on higher clock frequencies

6. The thermal wall

6. The thermal wall (1) Dissipation (D) : Dynamic Static Dd=A*C*V 2*fc Ds=V*Ileak with A: ratio of the active gates C: V: effective capacity of the gates supply voltage fc : clock frequency Ileak: leakage current

6. The thermal wall (2) Figure 6. 1: Chip dynamic and static power dissipation trends Source: N. S. Kim et al. , „Leakage Current: Moore’s Law Meets Static Power”, Computer, Dec. 2003, pp. 68 -75.

Figure 6. 2: Dynamic and static power dissipation trends Source: Solie D. , „Technology Trends, Aug. 2006, http: //www-03. ibm. com/procurement/proweb. nsf/objectdocswebview/ file 14+-+darryl+solie+-+ibm+power+symposium+presentation/$file/14+-+darryl+solie-ibm-power+symposium+presentation+v 2. pdf

6. The thermal wall (3) Figure 6. 3: Relative dissipation of Intel’s x 86 family of processors

6. The thermal wall (4) Figure 6. 4: Contrasting the evolution of Intel’s and AMD’s processor lines with thermal wall

6. The thermal wall (5) Figure 6. 5: Intel’s P 4 processor family (Netburst architecture)

6. The thermal wall (6) Figure 6. 6: The growth of relative dissipation of processors (in general) Source: R Hetherington, „The Ultra. SPARC T 1 Processor” White Paper, Sun Inc. , 2005

6. The thermal wall (7) Implications of thermal wall: The approach to increase performance by aggressively raising clock frequency met thermal wall Processor designs focus now more and more on power aware technics

7. The skew wall

7. The skew wall (1) Reason: Figure 7. 1: Skew between lines of parallel buses

7. The skew wall (2) Figure 7. 2: Equalizing skews among different bit lines of the processor bus on the MSI 915 G Combo motherboard

7. The skew wall (3) Implication of emerging skews between bit lines of parallel buses: Introducing sequential buses (also in slow peripheral buses due to impressive cost savings) Figure 7. 3: Signal transfer over a sequential bus

Implication of emerging limits of evolution The approach to aggressively raise clock frequencies met the efficiency, thermal and skew walls and thus hit the dead end

8. EPIC architectures/processors

8. EPIC architectures/processors (1) Aggresively raising clock frequency (Sections 4 – 7) Main road of evolution Essentially widening the core by introducing EPIC architectures (Section 8)

8. EPIC architectures/processors (2) Principle of superscalar processing Principle of VLIW processing Instructions dependent instructions independent instructions (static dependency resolution) dynamic dependency resolution F E F E F E Processor VLIW: Very Large Instruction Word Figure 8. 1: Contrasting the principles of operation of superscalar and VLIW processors

8. EPIC architectures/processors (3) VLIW EPIC: EPIC Explicitly Parallel Instruction Computer enhanced VLIW (integration of advanced superscalar features) • branch prediction • explicit cache control • • 1994: Intel, HP 1997: EPIC designation 2001: IA-64 Itanium

8. EPIC architectures/processors (4) Figure 8. 2: Overview of Itanium cores

8. EPIC architectures/processors (5) Figure 8. 3: The efficiency of Itanium processors

8. EPIC architectures/processors (6) Figure 8. 4: Expected spreading of the IA-64 architecture (Itanium processors) Source: L. Gwennap: Intel’s Itanium and IA-64: Technology and Market Forecast, MDR, 2000

8. EPIC architectures/processors (7) Figure 8. 5: Revenue expectations concerning Intel’s Itanium line

8. EPIC architectures/processors (8) In general purpose applications: EPIC architectures/processors play a decreasing role

9. The end of an era in processor evolution

9. The end of an era in processor evolution (1) In general purpose applications beginning with the 2. generation superscalars processor efficiency leveled off, but both approaches to address leveling off efficiency met limits of evolution and thus hit the dead end Single core complex superscalars, – at the end of an era

9. The end of an era in processor evolution (2) Available hardware complexity increases further on exponentially (Moore’s law) Complexity is doubled in each ~ 24 moths A new era in processor evolution – The dawn of multicore, multithreded processors The number of processors will double also in each ~ 24 months

9. The end of an era in processor evolution (3) Figure 9. 1: Rapid spreading of multi core processors revealed by Intel