Facilitating Compiler Optimizations Through the Dynamic Mapping of

Motivation Embedded Processors have fewer registers. Compiler Optimizations increase register pressure Difficult to apply

Vector Multiply Example Even before aggressive optimizations, 60% of available registers are already used

Application Configurable Processors Exploit common reference patterns found in code Small register files mimic

Architectural Modifications R 0 R 1 Q 1 Register File Map Table R 6

Software Pipelining Software pipelining is not often found in embedded compilers. Software pipelining cycle

Software Pipelining Example int A[1000], B[1000]; void vmul() { int I; for (I=2; I

Instruction Goal: Minimal modification to existing instruction set. Single cycle instruction latency Method: Add

Software Pipelining Example qmap r 1, #2, q 1 int A[1000], B[1000]; void vmul()

Results – Multiplies varying latency, load latency set at four 12

Results – Loads varying latency, multiply latency set at four 13

Conclusions Customized register structures reduce register pressure. Software pipelining is viable in resource constrained

Reference Behaviors Stack Reference Behavior ldr r 1, [r 6, r 4, lsl #4]

Application Configurable Architecture Application configurable processors are designed using a mapping table similar to

Application Configurable Architecture The customized register files are small in size but they efficiently

Remove Reference Behaviors ldr r 1, [r 6, r 4, lsl #4] ldr r

Remove Qmap Instruction qmap r 1, #0, q 0 ldr r 1, [r 6,

Modulo Scheduling For our work we used modulo scheduling. This requires using the dependences

Register Renaming due to software pipelining Renaming doesn’t work… not enough registers. Rotating registers

Results Register Savings latency grows for the instructions more iterations of the loop are

Slides: 23

Download presentation

Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley

Motivation Embedded Processors have fewer registers. Compiler Optimizations increase register pressure Difficult to apply aggressive compiler optimizations on embedded systems 2

Vector Multiply Example Even before aggressive optimizations, 60% of available registers are already used Further optimizations like Loop Unrolling and Software Pipelining are inhibited int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2]; } 3 . L 3: ldr mul str add cmp blt r 1, [r 2, r 3, lsl #2] r 12, [r 4], #4 r 0, r 12, r 1 r 0, [r 5, r 3, lsl #2] r 3, #1000. L 3

Application Configurable Processors Exploit common reference patterns found in code Small register files mimic these reference behaviors. Map Table provides register redirection. Changed architecture to add more registers, but have minimal impact on ISA support, particularly not increasing operand size 4

Architectural Modifications R 0 R 1 Q 1 Register File Map Table R 6 R 15 Queue Q 1 Queue Q 2 Queue Q 3 Stack Q 4 Circular Buffer Q 5 5

Software Pipelining Software pipelining is not often found in embedded compilers. Software pipelining cycle time of a loop. reduces the overall Extracts iterations Consumes Stalls Consumes registers!! 6

Software Pipelining Example int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } Stalls Present when Loop Run. L 3: ldr r 1, [r 2, r 3, lsl #2] ldr r 12, [r 4], #4 stall . L 3: 7 stall ldr r 1, [r 2, r 3, lsl #2] ldr r 12, [r 4], #4 mul r 0, r 12, r 1 stall str r 0, [r 5, r 3, lsl #2] stall add r 3, #1 stall cmp r 3, #1000 str r 0, [r 5, r 3, lsl #2] blt . L 3 add r 3, #1 mul r 0, r 12, r 1 cmp r 3, #1000 bgt. L 3

Instruction Goal: Minimal modification to existing instruction set. Single cycle instruction latency Method: Add a single instruction to the ISA that is used to map and unmap a common register specifier into a customized register structure. qmap <Reg Specifier> <Custom reg map information> <Custom reg specifier> qmap r 3, #4, q 3 8

Architectural Modifications R 0 R 1 Q 1 Register File Map Table R 6 R 15 Queue Q 1 Queue Q 2 An access to R 0, which has no mapping in the table would get the data from the register file. 9 R 1 is mapped into Q 1 and would retrieve its data from there. Queue Q 3 Destructive Queue Q 4 Circular Buffer Q 5

Software Pipelining Example qmap r 1, #2, q 1 int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; } Q 1 30 25 15 5 Q 2 34 2 1 Q 3 30 75 10 5 qmap r 12, #2, q 2 qmap r 0, #3, q 3 Prolog: 6 loads and 2 mults Loop: ldr r 1, [r 2, r 3, lsl #2] ldr r 12, [r 4], #4 mul r 0, r 12, r 1 str r 0, [r 5, r 3, lsl #2] add r 3, #1 cmp r 3, #1000 blt . L 3 Epilog 1 multiply and 3 stores

Results – Multiplies varying latency, load latency set at four 12

Results – Loads varying latency, multiply latency set at four 13

Conclusions Customized register structures reduce register pressure. Software pipelining is viable in resource constrained environments Performance can be improved with minor impact to the ISA. 14

Extra’s

Reference Behaviors Stack Reference Behavior ldr r 1, [r 6, r 4, lsl #4] ldr r 12, [r 6, r 4, lsl #8] ldr r 8, [r 6, r 4, lsl #12] str r 8, [r 3, r 4, lsl #16] str r 12, [r 3, r 4, lsl #20] str r 1, [r 3, r 4, lsl #24] 16

Application Configurable Architecture Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations. The map table is read during every access to the architected register file. This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure. 17

Application Configurable Architecture The customized register files are small in size but they efficiently manage the values that would require many architected registers. The customized register files can mimic queues, stacks, and circular buffers. These structures are accessed using the same register specifier that is used to access the architected register file. 18

Remove Reference Behaviors ldr r 1, [r 6, r 4, lsl #4] ldr r 12, [r 6, r 4, lsl #8] ldr r 8, [r 6, r 4, lsl #12] str r 8, [r 3, r 4, lsl #16] r 1 R 8 str r 12, [r 3, r 4, lsl #20] str r 1, [r 3, r 4, lsl #24] R 12 Stack Reference Behavior R 1 ldr r 1, [r 6, r 4, lsl #4] ldr r 1, [r 6, r 4, lsl #8] ldr r 1, [r 6, r 4, lsl #12] str r 1, [r 3, r 4, lsl #16] str r 1, [r 3, r 4, lsl #20] 19 str r 1, [r 3, r 4, lsl #24] Free up r 8 and r 12 for use.

Remove Qmap Instruction qmap r 1, #0, q 0 ldr r 1, [r 6, r 4, lsl #4] q 0 R 8 ldr r 1, [r 6, r 4, lsl #8] ldr r 1, [r 6, r 4, lsl #12] str r 1, [r 3, r 4, lsl #16] R 12 str r 1, [r 3, r 4, lsl #20] str r 1, [r 3, r 4, lsl #24] qmap r 1, #0, q 0 R 1 Free up r 8 and r 12 for use. 20

Modulo Scheduling For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop. The prolog and epilog are then built based off of this schedule. The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not possible. 21

Register Renaming due to software pipelining Renaming doesn’t work… not enough registers. Rotating registers would require a significant rewrite of the embedded ISA. The loop carried values can simply be mapped into a register queue to hold the value across several iterations. 22

Results Register Savings latency grows for the instructions more iterations of the loop are extracted to spread As out the latency. The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM. 23