Lecture on High Performance Processor Architecture CS 05162

Discussion Outline n Motivation n Introduction n Related Work n Our Profiling work n

Motivation n We’re in multicore era − Verification and validation times, wire delays, power

Introduction n A promising method: TLS ( Thread-Level speculation ) − A Simple Example

Introducion n Another Example : − Loop-carried data dependence for i = 1 to

Introduction n Another Example : − Inter-thread data dependence => Rollback and Restart for

Introduction n TLS has proposed mechanisms for optimistically executing nonanalyzable serial codes in parallel

Related work (How to support the TLS!!!) Project University URL Multiscalar Wisconsin http: //www.

Multiscalar n Processor Architecture n Task Dividing n Solve Dependences n Other Problems 2021/12/30

Multiscalar Processor Architecture 2021/12/30 CS of USTC 10

Task Dividing n CFG (Control Flow Graph) − a static program can be represented

Solve Dependences n Dependences − Control Dependences − Data Dependences l Register Dependences l

Other Problems n Other Design Choices − Share Processing Elements (SMT) − Tightly Couple

Multiscalar Summary n First TS Technique n Multi-Core，Tightly-coupled n Hardware Support for Thread-Level Speculation

Hydra n Hardware Architecture Support for Speculation − L 1 Cache − L 2

Hardware Architecture 2021/12/30 CS of USTC 16

Processor Architecture 2021/12/30 CS of USTC 17

L 1 Cache LRU bits Read-by-world Pre-invalidate Write-by-word Valid Modified 2021/12/30 CS of USTC

L 2 Cache Write Buffer Address Write Date from write bus Tail V L

Speculation Coprocessor & System Software n Coprocessor’s Function − Initial and Commit Thread −

Execution Model n Subroutine Conventional Hydra CP 0 A Time B CP 1 A

Execution Model n Loop CP 0 CP 1 A i=0 Time CP 2 CP

Speculative Load Nonspeculative Head CPU # i-3 me CPU # i-2 CPU # i-1

Speculative Store Nonspeculative Head CPU # i-3 me CPU # i-2 CPU # i-1

Hydra Summary n Loose-coupled CMP Architecture n Hardware and Software Support for Thread-level Speculation

STAMPede n Hardware Support − Cache − Coherence mechanism n Compiler Support − Base

Base Architecture Coherence Protocol 2021/12/30 Speculative Buffer STAMPede Write-back & Invalidate L 1 Data

DL 1 Line States and Messages DL 1 Line States State Coherence messages Description

Coherence Scheme 2021/12/30 CS of USTC 29

Compiler Support n Base Function − Instruction Schedule − Deciding Where to Speculate −

Compiler Support n Insert Synchronization (wait/signal) − a wait must occur before any use

An Example of Instruction Schedule 2021/12/30 CS of USTC 32

STAMPede Summary n Base on a Common CMP Architecture n Thread-level Speculation Support by

Comparison among the three techniques Sameness − − Multi-core Architecture Buffer to Store Speculative

What can we see? n How to find what you would do? − Trend

How should we do? n What we see? − Thread partition is so important

Our work n Balance thread partition through the offline profiling work − How to

key factors should be analyzed in TLS n Thread size − Small: significant dispatch

How to balance thread partition n Develop appropriate dynamic profiling tools for TLS parallelism

Our profiling tools n Motivation − appropriate thread partition scheme demands cogent profiling analysis

Profiling framework n Find and effectively exploit the speculative thread-level parallelism for various applications

Analysis method n Definition 1 produce-distance: the instruction numbers from the beginning of the

Speedup and Synchronization Strategy Speedup 2021/12/30 Synchronization CS of USTC 43

Profiling results n Return value prediction rate of different types of return value n

Profiling results n The dynamic length of subroutines & The thread granularity for loops

Profiling results n Memory dependence distribution for subroutine & loop speculation 2021/12/30 CS of

Profiling results n Speedup of subroutine level speculation & loop speculation 2021/12/30 CS of

Thread execution model n Threads: from iterations of loop n Speculative implementation: based Transaction

Case Study While i< N { foo 1(); if cond 1 j=i; else j=i-1;

Two key issues for implementation n Is the accuracy of initial profiling enough? −

Mapping from FVC sets to speculative versions n Only a FVC − R 1:

Example 2021/12/30 A B C D Ordered region version * 1 * * {A,

Evaluation n Simulator − fast. TM − Sim-SPo. TM n Benchmark − SPEC CPU

Results Runtime coverage of speculative parallel loops 2021/12/30 Average ratios of ordered region size

Results Restart rates of speculative threads Speedups of speculative parallel execution(2 cores) 2021/12/30 CS

Conclusion n Offline Profiling − the inter-thread data dependences are ubiquitous − the synchronization

TLS evaluation n Limitation − Algorithm may be inherently very serial as we discussed

Slides: 63

Download presentation

Lecture on High Performance Processor Architecture (CS 05162) Introduction to Thread-Level Speculation 讲者：王耀彬（BA 07011009） wyb 1982@mail. ustc. edu. cn 2007. 12 中国科学技术大学计算机科学技术系 CS of USTC

Discussion Outline n Motivation n Introduction n Related Work n Our Profiling work n Conclusion 2021/12/30 CS of USTC 2

Motivation n We’re in multicore era − Verification and validation times, wire delays, power dissipation and circuit unreliability n A big question is how we can exploit all this parallel processing power in the new processor generation? − Multicore needs Multithreading ! − Many programs have been written using serial algorithms n Creating parallelized versions of legacy code is difficult − Automated parallelization has proven to be a very difficult problem − Many applications may still turn out to have a large amount of parallelism, but are still only hand-parallelizeable − A new method is demanded for exploiting more thread-level parallelism efficiently. 2021/12/30 CS of USTC 3

Introduction n A promising method: TLS ( Thread-Level speculation ) − A Simple Example ( for subroutine ) 2021/12/30 CS of USTC 4

Introducion n Another Example : − Loop-carried data dependence for i = 1 to 5 { … … = x … x = … … } 2021/12/30 CS of USTC 5

Introduction n Another Example : − Inter-thread data dependence => Rollback and Restart for … = x x = … 2021/12/30 … = x … = x … = x x = … CS of USTC 6

Introduction n TLS has proposed mechanisms for optimistically executing nonanalyzable serial codes in parallel − Essential: l Release the strict parallelization restricts − Advantages: l many applications are amenable to parallelization with TLS l programming simplicity is kept 2021/12/30 CS of USTC 7

Related work (How to support the TLS!!!) Project University URL Multiscalar Wisconsin http: //www. cs. wisc. edu/~mscalar/ Hydra Stanford http: //www-hydra. stanford. edu/ STAMPede CMU http: //www. cs. cmu. edu/~stampede/ Others Multiscalar Hydra STAMPede Programming Model x x x Compiling Technique x x x Processor Architecture x x 2021/12/30 CS of USTC 8

Multiscalar n Processor Architecture n Task Dividing n Solve Dependences n Other Problems 2021/12/30 CS of USTC 9

Multiscalar Processor Architecture 2021/12/30 CS of USTC 10

Task Dividing n CFG (Control Flow Graph) − a static program can be represented as a control flow graph (CFG) ， where basic blocks are nodes, and arcs represent flow of control from one basic block to another. n Task − A task is a portion of the CFG whose execution corresponds to a contiguous region of the dynamic instruction sequence. 2021/12/30 CS of USTC 11

Solve Dependences n Dependences − Control Dependences − Data Dependences l Register Dependences l Memory Dependences n Synchronization & Speculation − Sequencer − Synchronization and Forwarding for Register Dependences − Synchronization & Speculation (use ARB-Address Register Buffer) for Memory Dependences 2021/12/30 CS of USTC 12

Other Problems n Other Design Choices − Share Processing Elements (SMT) − Tightly Couple ARB with Processing Units n Problems − Tightly-coupled Processing Units − Too much Synchronizations (It maybe a problem) 2021/12/30 CS of USTC 13

Multiscalar Summary n First TS Technique n Multi-Core，Tightly-coupled n Hardware Support for Thread-Level Speculation n Too Much Synchronization 2021/12/30 CS of USTC 14

Hydra n Hardware Architecture Support for Speculation − L 1 Cache − L 2 Cache Write Buffer − Speculation Coprocessor & System Software n Execution Model − Subroutine − Loop − Speculative Load − Speculative Store 2021/12/30 CS of USTC 15

Hardware Architecture 2021/12/30 CS of USTC 16

Processor Architecture 2021/12/30 CS of USTC 17

L 1 Cache LRU bits Read-by-world Pre-invalidate Write-by-word Valid Modified 2021/12/30 CS of USTC 18

L 2 Cache Write Buffer Address Write Date from write bus Tail V L 2 Tag [CAM] Data (L 2 Cache Line) Write Mask (by byte) Head Drain writes to L 2 cache after committing the CPU Priority encode by byte From other write buffer and L 2 cache Mux the most recent version of each byte to the read bus Read data out to read bus 2021/12/30 CS of USTC 19

Speculation Coprocessor & System Software n Coprocessor’s Function − Initial and Commit Thread − Trace the status of the Thread − Interrupt n System Software − Compiler − Runtime System 2021/12/30 CS of USTC 20

Execution Model n Subroutine Conventional Hydra CP 0 A Time B CP 1 A B a a 2021/12/30 CS of USTC 21

Execution Model n Loop CP 0 CP 1 A i=0 Time CP 2 CP 3 A i=0 i=1 i=2 i=5 i=6 i=3 i=1 i=2 i=3 i=4 a 2021/12/30 CS of USTC 22

Speculative Load Nonspeculative Head CPU # i-3 me CPU # i-2 CPU # i-1 CPU #i CPU # i+1 CPU # i+2 CPU # i+3 Write Buffer DL 1 DL 2 2021/12/30 Write Buffer CS of USTC 23

Speculative Store Nonspeculative Head CPU # i-3 me CPU # i-2 CPU # i-1 CPU #i RAW DL 2 2021/12/30 Write Buffer DL 1 Write Buffer CS of USTC CPU # i+1 CPU # i+2 CPU # i+3 Pre-Invalidate Write Buffer 24

Hydra Summary n Loose-coupled CMP Architecture n Hardware and Software Support for Thread-level Speculation 2021/12/30 CS of USTC 25

STAMPede n Hardware Support − Cache − Coherence mechanism n Compiler Support − Base Function − Synchronization − Instruction Schedule 2021/12/30 CS of USTC 26

Base Architecture Coherence Protocol 2021/12/30 Speculative Buffer STAMPede Write-back & Invalidate L 1 Data Cache Hydra L 2 Cache Write Buffer CS of USTC Write-through & Invalidate 27

DL 1 Line States and Messages DL 1 Line States State Coherence messages Description I Invalidate Messages E Exclusive Read a cache line. S Shared Read. Ex D Dirty Read-exclusive: return a copy of the cache line with exclusive access. Upgrade-request: gain exclusive access to a cache line that is already present. Invalidation. Writeback. Supply cache line and relinquish ownership Flush Supply cache line but maintain ownership. Notify. Shared Notify that the cache line is now shared. Read. Ex. Sp Read-exclusive-speculative: return cache line, possibly with exclusive access. Upgrade. Sp Upgrade-request-speculative: request exclusive access to a cache line that is already present. Inv. Sp Invalidation-speculative: only invalidate cache line if from a logically-earlier epoch. Sp. E Speculative (SM and/or SL) and Exclusive Sp. S Speculative (SM and/or SL) and Shared Conditions =Shared Description The request has returned shared access. =Excl The request has returned exclusive access. =Later The request is from a logically-later epoch. =Earlier 2021/12/30 The request is from a logically-earlier epoch. CS of USTC Description 28

Coherence Scheme 2021/12/30 CS of USTC 29

Compiler Support n Base Function − Instruction Schedule − Deciding Where to Speculate − Inserting TLS-Specific Instructions − Generating Object Code 2021/12/30 CS of USTC 30

Compiler Support n Insert Synchronization (wait/signal) − a wait must occur before any use of the scalar on any path − a signal must occur after the last definition of the scalar on any path. − a signal must occur for each synchronized scalar on every possible path. − each wait should be placed as late as possible − each signal should be placed as early as possible 2021/12/30 CS of USTC 31

An Example of Instruction Schedule 2021/12/30 CS of USTC 32

STAMPede Summary n Base on a Common CMP Architecture n Thread-level Speculation Support by Modified Cache Coherence Protocol 2021/12/30 CS of USTC 33

Comparison among the three techniques Sameness − − Multi-core Architecture Buffer to Store Speculative Data Violation Detecting and Recovery Thread-dividing by Software Difference Technique Detail Mutiscalar Hydra STAMPede Thread Name Task Thread Epoch Buffer ARB, DL 1 Write buffer, DL 2 DL 1 communication Register Forwarding DL 2 and Read-BUS DL 2 Thread Dividing Compiler and Hardware Compiler Input and Output of Compiler Binary to Binary Source to Source 2021/12/30 CS of USTC 34

What can we see? n How to find what you would do? − Trend l both hardware and software => software − The most important l how to divide the partition − What shall we do? l more related papers, and find the GAP!!! − Critical factor in TLS: l where to speculate : make a large difference in the resulting performance Why? dependent threads cause performance degradation ( More details will be seen in later slides ) => Thread Partition 2021/12/30 CS of USTC 35

How should we do? n What we see? − Thread partition is so important in TLS ! − It’s an software methodology n How to deal with the Thread Partiton? − Profiling : a particular run to get the accurate runtime characteristics − Why? l Accuracy, Overhead Limits, … − Offline and Online profiling l Offline => get the max potential performance l Online => get the actual performance 2021/12/30 CS of USTC 36

Our work n Balance thread partition through the offline profiling work − How to identify and analyze the potential parallelism − Dynamic profiling tools − The profiling results n An Online Profiling Guided Optimization Approach for Speculative Parallel Threading − Dynamic optimization framework − key issues − Experiments results 2021/12/30 CS of USTC 37

key factors should be analyzed in TLS n Thread size − Small: significant dispatch overhead − Large: overflow the speculative storage − Unequal size: Load imbalance n Predication rate for inter-thread control flow − Control dependence violations n Memory dependence − Only true memory dependence (RAW) − Register dependence can be captured by compiler n Potential speedup − the final criterion 2021/12/30 CS of USTC 38

How to balance thread partition n Develop appropriate dynamic profiling tools for TLS parallelism − Why? l Inherent inaccuracy in static compile-time partition l Expensive cost in dynamic execution-time partition n Identify the potential TLS sources − Procedure calls l its boundaries often separate fairly independent computations − Loop iterations l regular structures ：naturally load-balanced l significant coverage on execution time l run-time sequence predictable n Analyze various factors that affect the TLS performance 2021/12/30 CS of USTC 39

Our profiling tools n Motivation − appropriate thread partition scheme demands cogent profiling analysis − Some important application areas still have remained out of scope of the TLS research − Up to now , appropriate TLS profiling tools are still lacked n Different roles of our tools − Pro. Fun is used in procedure level speculation − Pro. Loop in speculative loop parallelism − Pro. RV in value predication 2021/12/30 CS of USTC 40

Profiling framework n Find and effectively exploit the speculative thread-level parallelism for various applications − Three criteria l where : the “hot” procedure and loop => GNU prof tools l What : all the key factors in TLS, emphasis is memory dependence => memory access type l How: define an STP (speculative thread-level parallelism) model, capture “what” from “where”, evaluate them => analysis method , and balance them 2021/12/30 CS of USTC 41

Analysis method n Definition 1 produce-distance: the instruction numbers from the beginning of the thread to the last write operation in the specific memory address. n Definition 2 consume-distance: the instruction numbers from the beginning of the thread to the first read operation in the specific memory address. n The ratio of “consume-distance” to “produce-distance” α − α < 1 : Deadly dependence − 1< α < 2 : Dangerous dependence − α > 2 : Safe dependence 2021/12/30 CS of USTC 42

Speedup and Synchronization Strategy Speedup 2021/12/30 Synchronization CS of USTC 43

Profiling results n Return value prediction rate of different types of return value n The execution time distribution of subroutines with different type of return value 2021/12/30 CS of USTC 44

Profiling results n The dynamic length of subroutines & The thread granularity for loops 2021/12/30 CS of USTC 45

Profiling results n Memory dependence distribution for subroutine & loop speculation 2021/12/30 CS of USTC 46

Profiling results n Speedup of subroutine level speculation & loop speculation 2021/12/30 CS of USTC 47

Online Profiling Guided Optimization Approach n potential ≠ practical performance boost n Profile: guide speculative optimization − offline profile l disadvantage: need appropriate training inputs − online profile l no need training inputs l can deal with the phase-changed program behavior l disadvantage: runtime overheads n Objective − design a more flexible profile guided optimization mechanism − verify the effect of dynamic optimization with online profile n Approach − a continuous two-phase profile guided optimization l two-phase: profiling phase and optimized execution phase l continuous: the profiling-optimizing cycle could be triggered again − generate possible optimized loop versions in advance, at runtime selects an appropriate version to execute 2021/12/30 CS of USTC 48

Thread execution model n Threads: from iterations of loop n Speculative implementation: based Transaction Memory n Optimizing aims: profitable loop and optimal transaction partition 2021/12/30 CS of USTC 49

Dynamic optimization framework with online profile Static compiling stage Runtime optimization stage 2021/12/30 CS of USTC 50

Dynamic optimization framework with online profile n Static compiling stage − Identifies the loop candidates and their potential violation candidates. − Generates the ahead-of-time optimized code versions n Runtime optimization stage − Initial profile phase − Decision routine => choose the right speculative version − Optimized execution phase − Monitor => deal with the phase-changed program behavior − Repeat the above process if necessary 2021/12/30 CS of USTC 51

Case Study While i< N { foo 1(); if cond 1 j=i; else j=i-1; S 1: v[i]=foo 2(v[j]); i++; } 2021/12/30 CS of USTC 52

Case Study 2021/12/30 CS of USTC 53

Two key issues for implementation n Is the accuracy of initial profiling enough? − Control flow profile: Youfeng Wu, 2004 − Crossing-iteration dependence profile l Identify the frequent violation : Mismatch rates for frequent violation candidate n How to reduce the overheads of the dynamic optimization at runtime? − Specialized multi-versions for potential FVC sets in advance 2021/12/30 CS of USTC 54

Mapping from FVC sets to speculative versions n Only a FVC − R 1: fully parallelize the loop, without explicit ordered region, if the violation rarely occurs; − R 2: move the candidate and its dependence descendants into ordered region, if the violation is frequent and the size of ordered region isn’t beyond the threshold; − R 3: under other conditions, abandon this loop for speculatively parallelization. n A combination of FVCs − H 1: if a FVC has a cost beyond the threshold, i. e. it has led to the sequential version, all the combinations containing it lead to the original sequential version; − H 2: if a FVC is the root of a dependence tree, the combinations only containing it and its descendants lead to the version as it is alone. − H 3: if two FVCs are independent and the combined cost is below the threshold, they and their possible dependence descendants are put into the ordered region to generate a new version. − H 4: other conditions lead to the original sequential version. 2021/12/30 CS of USTC 55

Example 2021/12/30 A B C D Ordered region version * 1 * * {A, B, C, D, E} 1 0 {A, C} 0 0 1 1 {C, D} … … {A, B, C, D, E} CS of USTC 56

Extension n Value profile and prediction A B C D Ordered region version 0 1 0 0 profile B 1 0 {A, C} 0 0 1 1 {C, D} 0 predictable 0 0 predict B … … {A, B, C, D, E} n Triple region partitioning n Limitation − unknown trip count − Lead to Sequential version − sequential version is a termination 2021/12/30 CS of USTC 57

Evaluation n Simulator − fast. TM − Sim-SPo. TM n Benchmark − SPEC CPU 2000 n Compiling − Manually generate multiple speculative version currently 2021/12/30 CS of USTC 58

Results Runtime coverage of speculative parallel loops 2021/12/30 Average ratios of ordered region size to loop body size CS of USTC 59

Results Restart rates of speculative threads Speedups of speculative parallel execution(2 cores) 2021/12/30 CS of USTC 60

Conclusion n Offline Profiling − the inter-thread data dependences are ubiquitous − the synchronization mechanism is necessary − Return value prediction and loop chunking are important to improve performance −… n Online profile is trustworthy to guide the speculative optimization n Dynamic optimizing approach has comparable effects with the static methods. − for applications lacking appropriate training inputs − for applications with phase-changed behavior 2021/12/30 CS of USTC 61

TLS evaluation n Limitation − Algorithm may be inherently very serial as we discussed before − High overhead brings not so much performance improvement − For programs that inherently can be completely parallelized, handparallelize is better than TLS − For the bus structure, 4 core is enough! −… n Applicability − Programs that exhibit complicated memory access patterns but can be well parallelized with the profiling aid − It brings an opportunity for most programmers to write an easy TLS parallel program. −… 2021/12/30 CS of USTC 62

Thanks ! 2021/12/30 CS of USTC 63