Characterization and Evaluation of Hardware Loop Unrolling Marcos

Motivation • High temporal locality available in loops suggests applying more aggressive fetch techniques

Outline • • Introduction Loop characteristics Loop prediction hardware Loop caching and unrolling hardware

Introduction • To exploit instruction level parallelism, it is essential to have a large

Loop characteristics • • • internal control flow number of loop visits number of

Loop prediction hardware • A path-to-loop register to detect loops in advance • A

Loop characteristics and hardware components * Based on a study of SPECint 2000, Mi.

Loop stack Head address 0 x 2 d 24 Path-in-loop table Tail address 0

Loop caching and unrolling hardware • A loop cache to hold instructions that belong

Path-to-loop bn-1 bn-2 bn-3 Gshare b 0 address Loop prediction table. . . b

Loop cache control mechanism Loop cache index 2 d 24 tag 50 instructions .

Loop cache control Loop cache (2 d 24, 2 d 68, 4, 001, 000)

Experimental approach • Modified Simplescalar 3. 0 c Alpha EV 6 pipeline to model

Modifications to Simple. Scalar Loop predictor Fetch Dispatch Register scheduler Memory scheduler Exec Mem

path-to-loop: prediction rate of entering the loop using the path-to-loop iterations: prediction rate of

Conclusions and current work • Above 50 % of loops have properties that make

Slides: 18

Download presentation

Characterization and Evaluation of Hardware Loop Unrolling Marcos R. de Alba and David R. Kaeli BARC 2003 Cambridge, MA January 30, 2003

Motivation • High temporal locality available in loops suggests applying more aggressive fetch techniques to provide a larger number of instructions for dispatch and issue • Current aggressive fetch techniques (e. g. , trace caches) are not tuned to exploit loop behavior • We propose a mechanism specifically tailored to fetching loop bodies DE ALBA, KAELI BARC 2003 2

Outline • • Introduction Loop characteristics Loop prediction hardware Loop caching and unrolling hardware Experimental approach Results Conclusions and current work DE ALBA, KAELI BARC 2003 3

Introduction • To exploit instruction level parallelism, it is essential to have a large window of candidate instructions available to issue from • The temporal locality present in loops provides a good opportunity for loop caching • In general-purpose applications, 50% of the loops have variable-dependent trip counts and/or contain conditional branches in their bodies • These characteristics suggest that a hardware-based loop caching approach should be investigated DE ALBA, KAELI BARC 2003 4

Loop characteristics • • • internal control flow number of loop visits number of iterations per loop visit dynamic loop body size patterns leading up to the loop visit DE ALBA, KAELI BARC 2003 5

Loop prediction hardware • A path-to-loop register to detect loops in advance • A stack to maintain nested per-iteration loop information • A table to maintain per-visit loop information and update loop prediction state • Path-in-iteration table to maintain history of branches visited within individual iterations DE ALBA, KAELI BARC 2003 6

Loop characteristics and hardware components * Based on a study of SPECint 2000, Mi. Bench and Media. Bench DE ALBA, KAELI BARC 2003 7

Loop stack Head address 0 x 2 d 24 Path-in-loop table Tail address 0 x 2 d 68 path-to-loop *pilt 101 . . Loop prediction table Head address 0 x 2 d 24 Tail address 0 x 2 d 68. . path-in-loop itns next 001 2 2 000 1__ 2 0 3 0 Path-in-loop prediction table path-to-loop *pilt 101 DE ALBA, KAELI Predicted path-in-loop 001 000 1__ BARC 2003 pred conf next itns ctr 2 3 2 2 3 3 0 8

Loop caching and unrolling hardware • A loop cache to hold instructions that belong to loop bodies • A loop cache control mechanism for indexing into the loop cache and for maintaining loop cache state (number of allocated loops, their indices and offsets) DE ALBA, KAELI BARC 2003 9

Path-to-loop bn-1 bn-2 bn-3 Gshare b 0 address Loop prediction table. . . b 1 tag head tail 50 2 d 68 2 d 24 Path-in-loop table preditns *pilt 4 001 2 2 000 2 3 1__ 0 0 index N tag match last branch ? N There is no information for this loop, proceed with normal fetching Y preditns > 1 ? Y The information is used by the loop cache control to interrogate the loop cache for a hit or to build dynamic traces in the case of a miss

Loop cache control mechanism Loop cache index 2 d 24 tag 50 instructions . . . . 2 d 68 from loop prediction table N match? store loop pattern in the loop cache Y . . . issue instructions from loop cache DE ALBA, KAELI BARC 2003 11

Loop cache control Loop cache (2 d 24, 2 d 68, 4, 001, 000) Loop body 2 d 24: ldl t 1, 16(sp) 2 d 28: lda t 1, -31(t 1) 2 d 2 c: bge t 1, 2 d 6 c 2 d 30: ldl v 0, 16(sp) 2 d 34: lda v 0, -15(v 0) 2 d 38: bge v 0, 2 d 50 2 d 3 c: ldl t 2, 0(sp) 2 d 40: ldl t 0, 32(sp) 2 d 44: subl t 0, t 2, t 0 2 d 48: br zero, 2 d 5 c 2 d 4 c: ldl v 0, 32(sp) 2 d 50: subl v 0, 0 x 1, v 0 2 d 54: stl v 0, 32(sp) 2 d 58: ldl t 0, 16(sp) 2 d 5 c: addl t 0, 0 x 1, t 0 2 d 60: stl t 0, 16(sp) 2 d 68: br zero, 2 d 24 Unrolled loop according to information from loop predictor (assumed 4 instructions/line) 12

Experimental approach • Modified Simplescalar 3. 0 c Alpha EV 6 pipeline to model the following features: – – – – loop detection loop prediction loop cache filling loop cache/I-cache multiplexing loop termination detection loop stack operations loop table operations DE ALBA, KAELI BARC 2003 13

Modifications to Simple. Scalar Loop predictor Fetch Dispatch Register scheduler Memory scheduler Exec Mem Write back Commit Loop cache I-Cache ITLB (IL 1) D-Cache DTLB (DL 1) I-Cache (IL 2) D-Cache (DL 2) Virtual memory 14

Frequency of loop iterations 15

Frequency of dynamic loop body size 16

path-to-loop: prediction rate of entering the loop using the path-to-loop iterations: prediction rate of number of iterations per entered loop visit path-in-itn: prediction rate of paths-in-iteration per loop iteration speedup: relative CPI gain compared to no loop prediction DE ALBA, KAELI BARC 2003 17

Conclusions and current work • Above 50 % of loops have properties that make them highly predictable and attractive for aggressive fetching • Compare efficiency of loop cache against trace cache • Propose a hybrid fetch approach utilizing the loop cache for loop bodies and the trace cache for all non-in-loop instructions DE ALBA, KAELI BARC 2003 18