A Framework for Studying Effects of VLIW Instruction

  • Slides: 31
Download presentation
A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar Embedded Systems Group IIT Delhi Slide 1 November 28, 2001

Overview • The VLIW code size expansion problem • What all such a framework

Overview • The VLIW code size expansion problem • What all such a framework needs to support? • Trimaran compiler infrastructure • The HPL-PD architecture • Extensions to the various modules of Trimaran • Results • Future work Embedded Systems Group IIT Delhi Slide 2 • Acknowledgements

Choices for exploiting ILP • The architectural choices for utilizing ILP – Superscalar processors

Choices for exploiting ILP • The architectural choices for utilizing ILP – Superscalar processors • • Try to extract ILP at run time Complex hardware Limited clock speeds and high power dissipation Not suited for embedded type of applications – VLIW processors Compiler has lot of knowledge about hardware Compiler extracts ILP statically Simplified hardware Possible to attain higher clock speeds Embedded Systems Group IIT Delhi Slide 3 • •

Problems with VLIW processors • Complex compiler required to extract ILP from application program

Problems with VLIW processors • Complex compiler required to extract ILP from application program • Requires adequate support in hardware for compiler controlled execution • Code size expansion due to explicit NOPs if, Embedded Systems Group IIT Delhi Slide 4 – The application does not contain enough parallelism – The compiler is not able to extract parallelism from the application – Need for good instruction encoding and NOP compression schemes

What all such a framework should support? • The framework should have quick retargetability

What all such a framework should support? • The framework should have quick retargetability • Studying the effect of a particular instruction encoding and decoding scheme on processor performance • Studying the code size minimization due to a particular instruction encoding scheme • Studying memory bandwidth requirements imposed Embedded Systems Group IIT Delhi Slide 5 by a particular instruction decoding scheme.

Trimaran Compiler Infrastructure C Program Bridge Code IMPACT • ANSI C Parsing • Code

Trimaran Compiler Infrastructure C Program Bridge Code IMPACT • ANSI C Parsing • Code profiling • Classical machine independent optimizations • Basic block formation ELCOR IR • Machine dependent code optimizations STATISTICS • Compute and stall cycles • Cache stats • Spill code info SIMULATOR • Code scheduling • Register allocation • ELCOR IR to low level C files • HPL-PD virtual machine • Cache simulation • Performance statistics Embedded Systems Group IIT Delhi Slide 6 HMDES Machine Description

Various modules of Trimaran - 1 • IMPACT – Developed by UIUC’s IMPACT group

Various modules of Trimaran - 1 • IMPACT – Developed by UIUC’s IMPACT group – Trimaran uses only the IMPACT front-end – Classical machine independent optimizations – Outputs a low level IR, Trimaran bridge code • ELCOR – Developed by HPL’s CAR group – It is the compiler backend – Performs registration allocation and code scheduling – Parameterized by HMDES machine description Embedded Systems Group IIT Delhi Slide 7 – Outputs ELCOR IR with annotated HPL-PD assembly

Various modules of Trimaran - 2 • HMDES – Developed by UIUC’s IMPACT group

Various modules of Trimaran - 2 • HMDES – Developed by UIUC’s IMPACT group – Specifies resource usage and latency information for an arch. – Input is translated to a low level representation – Has efficient mechanisms for querying the database – Does not specify instruction format information • HPL-PD Simulator – Developed by NYU’s REACT-ILP group – Converts ELCOR’s annotated IR to low level C representation – Processor performance and cache simulation Embedded Systems Group IIT Delhi Slide 8 – Generates statistics and execution trace

Various modules of Trimaran - 3 Example ELCOR Operation in IR Op 7 (

Various modules of Trimaran - 3 Example ELCOR Operation in IR Op 7 ( ADD_W [ br<11 : I gpr 14>] [br<27 : I gpr 14> I<1> ] Embedded Systems Group IIT Delhi Slide 9 p<t> s_time( 3 ) s_opcode( ADD_W. 0 ) attr(lc ^52) flags( sched ) )

Various modules of Trimaran - 4 • HMDES Sections – Field_Type e. g. REG,

Various modules of Trimaran - 4 • HMDES Sections – Field_Type e. g. REG, Lit etc. – Resource e. g. Slot 0, Slot 1 etc. – Resource_Usage e. g. RU_slot 0 time( 0 ) – Reservation_Table e. g. RT_slot 0 use( Slot 0 ) – Operation_Latency e. g. lat 1 ( time( 1 ) ) – Scheduling_Alternative e. g. (format(std 1) resv(RT 1) latency(lat 1) ) – Operation e. g. ADD_W. 0 ( Alt_1 Alt_2 ) Embedded Systems Group IIT Delhi Slide 10 – Elcor_Operation e. g. ADD_W( op( “ADD_W. 0” “ADD_W. 1” ) )

Various modules of Trimaran - 5 HPL-PD Simulator in detail REBEL Low level C

Various modules of Trimaran - 5 HPL-PD Simulator in detail REBEL Low level C files C libraries Emulation Library Code Processor HMDES Native Compiler Embedded Systems Group IIT Delhi Slide 11 Executable for the host platform

Various modules of Trimaran - 7 HPL-PD Simulator in detail HPL-PD Virtual Machine Fetch

Various modules of Trimaran - 7 HPL-PD Simulator in detail HPL-PD Virtual Machine Fetch Next Instruction Fetch Data Execute Instruction Accesses Data Accesses Dinero IV Cache Simulator Level I Instruction-Cache Level I Data-Cache Embedded Systems Group IIT Delhi Slide 12 Level II Unified Cache

The HPL-PD architecture • Parameterized ILP architecture from HP Labs • Possible to vary,

The HPL-PD architecture • Parameterized ILP architecture from HP Labs • Possible to vary, – Number and types of FUs – Number and types of registers – Width of instruction words – Instruction latencies • Predicated instruction execution • Compiler visible cache hierarchy • Result multicast is supported for predicate registers Embedded Systems Group IIT Delhi Slide 13 • Run time memory disambiguation instructions

The HPL-PD memory hierarchy Registers L 2 Cache Main Memory Embedded Systems Group Data

The HPL-PD memory hierarchy Registers L 2 Cache Main Memory Embedded Systems Group Data Prefetch Cache • Independent of L 1 Cache • Used to store large amount of cache polluting data • Doesn’t require sophisticated cache replacement mechanism IIT Delhi Slide 14 L 1 Cache

The Framework Decoder Model HMDES Perf. Stats TRIMARAN ASSEMBLER (using NJMC) Cache. Stats Obj.

The Framework Decoder Model HMDES Perf. Stats TRIMARAN ASSEMBLER (using NJMC) Cache. Stats Obj. File Instruction Address or Next Instr Request Code Size Bytes Fetched Embedded Systems Group IIT Delhi Slide 15 DISASSEMBLER (using NJMC)

Studying impact on performance • The HMDES modeling of decompressor, – Add a new

Studying impact on performance • The HMDES modeling of decompressor, – Add a new resource with latency of decoder – Add a new resource usage section for this decoder – Add this resource usage to all the HPL-PD operations • In the results there are two decompressor units with latency = 1 Embedded Systems Group IIT Delhi Slide 16 • The latency of decompressor should be estimated or generated using actual simulation.

Studying code size minimization - 1 A simple template based instruction encoding scheme Issue

Studying code size minimization - 1 A simple template based instruction encoding scheme Issue Slots MUL_OP Format ADD_W and L_W_C 1 IALU. 0 IALU. 1 FALU. 0 MU. 0 BU. 0 MUL_OP OPCODE & OPERANDS 00010 IOP ; Sgpr 1, Slit 1, Dgpr 2 Mem. OP ; Sgpr 1, Dgpr 1 …. . • Multi-ops are decided after profiling the generated assembly code. • Multi-op field encodes: Embedded Systems Group IIT Delhi Slide 17 • Size and position of each Uni-op • Number, size and position of operands of each Uni-op

Studying code size minimization - 2 • Instrumenting ELCOR to generate assembly code 1.

Studying code size minimization - 2 • Instrumenting ELCOR to generate assembly code 1. Arrange all the ops in IR in forward control order 2. Choose the next basic block and initialize cycle to 0 3. Walk the ops of this BB and dump those with the s_time = cycle 4. If BBs are left goto step 2 5. Dump the global data Embedded Systems Group IIT Delhi Slide 18 • Actual instruction encoding is done using procedures created by NJMC

Studying code size minimization - 3 The New Jersey Machine Code Toolkit • Deals

Studying code size minimization - 3 The New Jersey Machine Code Toolkit • Deals with bits at symbolic level • Can be used to write assemblers, disassemblers etc. • Supports concatenation to emit large binary data • Representation is specified in SLED • Has been used to write assemblers for Sparc, i 486 etc. • VLIW instructions need to be broken up into 32 bit (max) size tokens Embedded Systems Group IIT Delhi Slide 19 • Emitted binary data must end on a 8 bit boundary

Studying code size minimization - 4 Machine specifications in SLED bit 0 is least

Studying code size minimization - 4 Machine specifications in SLED bit 0 is least significant fields of TOK 32 (32) Dgpr_1 0: 3 Slit_1_part 1 4: 31 fields of TOK 8 (16) Slit_1_part 2 0: 3 Sgpr_1 4: 7 IOP 8: 11 tmpl 12: 14 Embedded Systems Group IIT Delhi Slide 20 patterns IOP_pats is any of [ ADD MUL SUB ], which is tmpl = 1 & IOP = { 0 to 2 } constructors IOP_pats Sgpr_1, Slit_1, Dgpr_1 is IOP_pats & Sgpr_1 & Slit_1_part 2 = Slit_1 @[28: 31]; Slit_2_part 1 = Slit_1 @[0: 27] & Dgpr_1

Studying code size minimization - 5 Toolkit encoder output ADD( unsigned Sgpr_1, unsigned Slit_1,

Studying code size minimization - 5 Toolkit encoder output ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); MUL( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); SUB( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 ); Specifying matcher for disassembler Embedded Systems Group IIT Delhi Slide 21 match | ADD( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | MUL( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something | SUB( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something endmatch

Studying code size minimization - 6 • The matcher application needs functions for fetching

Studying code size minimization - 6 • The matcher application needs functions for fetching data • Bit ordering is different on little and big endian machines • The matcher fails when large number of complex templates are given • Breaking large sized multi-ops across 32 bit tokens makes the representation messy and error prone Embedded Systems Group IIT Delhi Slide 22 • Specifying addresses forward branches requires two passes

Studying impact on memory bandwidth - 1 The Typical VLIW Pipeline Instruction Decode Store

Studying impact on memory bandwidth - 1 The Typical VLIW Pipeline Instruction Decode Store Results Embedded Systems Group Align Decompress Execute Decode DF/AG IIT Delhi Slide 23 Instruction Fetch

Studying impact on memory bandwidth - 2 • The cache simulation requires the generation

Studying impact on memory bandwidth - 2 • The cache simulation requires the generation of, – Instruction address – No. of bytes to fetch • Instruction address can be generated by disassembling the instructions at run time and keeping track of jumps • The matcher application returns the number of bytes required to disassemble an instruction Embedded Systems Group IIT Delhi Slide 24 • The disassembled instruction can be compared with the instruction issued to check correctness

Studying impact on memory bandwidth - 3 • Run time verification of disassembled instructions

Studying impact on memory bandwidth - 3 • Run time verification of disassembled instructions can be turned off for faster simulation • Due to restricted size of matcher results could not be obtained for larger programs Embedded Systems Group IIT Delhi Slide 25 • Memory access addresses and bytes to fetch have been generated by hand for Sum. To. N application

Impact on code size (Strcpy) Embedded Systems Group IIT Delhi Slide 26 Results -

Impact on code size (Strcpy) Embedded Systems Group IIT Delhi Slide 26 Results -

Impact on code size (Sum. To. N) Embedded Systems Group IIT Delhi Slide 27

Impact on code size (Sum. To. N) Embedded Systems Group IIT Delhi Slide 27 Results -

Size of SLED specification for various archs. Embedded Systems Group IIT Delhi Slide 28

Size of SLED specification for various archs. Embedded Systems Group IIT Delhi Slide 28 Results -

Cache performance comparison (Sum. To. N) Embedded Systems Group IIT Delhi Slide 29 Results

Cache performance comparison (Sum. To. N) Embedded Systems Group IIT Delhi Slide 29 Results -

Future work • Need for automation in most parts of the framework • Better

Future work • Need for automation in most parts of the framework • Better representation for VLIW instructions than SLED – Unlimited token size – Facility to bind one field with multiple patterns • Methodology for predicting latency for decompressor Embedded Systems Group IIT Delhi Slide 30 • Framework for finding the optimal instruction formats

Acknowledgements • Prof. M. Balakrishnan and Prof. Anshul Kumar • Rodric M. Rabbah, Georgia

Acknowledgements • Prof. M. Balakrishnan and Prof. Anshul Kumar • Rodric M. Rabbah, Georgia Institute of Technology • Shail Aditya, HP Labs Embedded Systems Group IIT Delhi Slide 31 • All the friends at Philips Lab. for stimulating discussions