Compiler Supports and Optimizations for PAC VLIW DSP

  • Slides: 18
Download presentation
Compiler Supports and Optimizations for PAC VLIW DSP Processors Y. -C. Lin M. -Y.

Compiler Supports and Optimizations for PAC VLIW DSP Processors Y. -C. Lin M. -Y. Hung S. -Y. Chen C. -L. Tang Y. -P. You and C. -J. Wu Y. -C. Moo J. -K. Lee National Tsing-Hua University Taiwan LCPC 2005

Outline • • • PAC VLIW DSP Architectures Optimization Issues Preliminary Compiler Supports Experimental

Outline • • • PAC VLIW DSP Architectures Optimization Issues Preliminary Compiler Supports Experimental Results Conclusion 12/18/2021 LCPC 2005 2

Introduction • Parallel Architecture Core (PAC) is designed by So. C Technology Center, ITRI,

Introduction • Parallel Architecture Core (PAC) is designed by So. C Technology Center, ITRI, Taiwan. – 32 bit, fixed-point, 5 -way issue VLIW DSP • scalable architecture • optimized instruction set for audio/video/image • innovative register file structure – two generations developed • TSMC’s 0. 13 μm technology (taped-out in Aug. 2005) High-performance 12/18/2021 LCPC 2005 Low-power 3

Key Issues • Deploy the general-purpose high-performance open source compiler for DSP processors –

Key Issues • Deploy the general-purpose high-performance open source compiler for DSP processors – ORC PAC DSP • Address issues for fragmentary register banks of DSP processors • Methods for irregular register constraints and instruction scheduling 12/18/2021 LCPC 2005 4

PAC DSP Overview • Five-Way Issues: • Cluster Design: – 1 Scalar/Control Unit (B)

PAC DSP Overview • Five-Way Issues: • Cluster Design: – 1 Scalar/Control Unit (B) – 2 Arithmetic Unit (I) – 2 Load/Store Unit (M) • Distributed Register Files: – Scalability – Explicit Inter-Cluster Data Transfer Instructions – 5 Local Register Files (A, AC, R) I-Unit. Files. I-Unit M-Unit – 2 B-Unit Global Register (D) Cluster Cluster • Other Features: A Registers A Registers – 8 -bit/16 -bit SIMD operations A Registers B-Unit M-Unit – Variable instruction M-Unit M-Unit word/bundle. B-Unit length D Registers Extend D Registers B-Unit I-Unit More – Dynamic Power Management M-Unit R Registers I-Unit Clusters R Registers I-Unit AC Registers – Standard AMBA interface AC Registers I-Unit AC Registers I-Unit 12/18/2021 LCPC 2005 5

Ping-pong Register File Structure • Used by Global Register File (D) • Concept: –

Ping-pong Register File Structure • Used by Global Register File (D) • Concept: – Overlap processing different data streams in a cluster • Benefit: – Decrease the port number for low-power and size So called as Ping-pong! Load I-Unit Compute Load Store Compute Store 12/18/2021 M-Unit and I-Unit operate on different data streams at the same time! LCPC 2005 6

Ping-pong Register Access • Each ‘D’ register file contains 2 banks. • Rules: –

Ping-pong Register Access • Each ‘D’ register file contains 2 banks. • Rules: – Access by one unit to the 2 banks is mutually-exclusive in a cycle. – M-Unit and I-Unit can only access to different banks in a cycle. M-Unit Bank 1 Bank 2 I-Unit 12/18/2021 Instructional Switcher Only 1 state for each cycle! LCPC 2005 M-Unit Bank 2 Bank 1 I-Unit 7

Issues for Ping-pong Registers(1) • Example for ping-pong usage: – Able to form a

Issues for Ping-pong Registers(1) • Example for ping-pong usage: – Able to form a bundle Lw D 8, A 0 – Unable to form a bundle Lw D 2, A 0 Add D 1, D 0, AC 0 We need to schedule into 2 bundles since they use the same bank! For compilers optimizations: Better register (file/bank) allocation Better schedule in fewer bundles 12/18/2021 LCPC 2005 8

Issues for Ping-pong Registers(2) • Data transfer between ping-pong banks: Need cross ping Additional

Issues for Ping-pong Registers(2) • Data transfer between ping-pong banks: Need cross ping Additional copy-pong needed! operation communication! Lw D 8, A 0 Add D 1, D 0, AC 0 Sw D 1, A 0 Sub D 9, D 8, D 1 Mov AC 1, D 1 Sw D 1, A 0 Sub D 9, D 8, AC 1 Invalid operation! For compiler optimizations: 1. Well-handle data-communication between ping-pong banks within any code manipulation 2. Generate additional copy-operation as few as possible 12/18/2021 LCPC 2005 9

Issues for Inter-cluster Communication • To exploit cluster parallelism: A B C E Cluster

Issues for Inter-cluster Communication • To exploit cluster parallelism: A B C E Cluster 1 D Additional Cross-Cluster Copy F Cluster 2 G – PAC needs explicit instruction to be issued for intercluster communication! Optimize code partitioning: 1. Fewer communication 2. Better scheduling Cluster 1 A B E Cluster 2 C D B-Unit F G 12/18/2021 LCPC 2005 10

More Considerations Two optimized codes of the same performance: Upper Smaller code size Lower

More Considerations Two optimized codes of the same performance: Upper Smaller code size Lower power consumption 12/18/2021 LCPC 2005 11

Compiler Supports for PAC DSP • Essential supports (IA-64 ORC PAC) – New Target_Info

Compiler Supports for PAC DSP • Essential supports (IA-64 ORC PAC) – New Target_Info • PAC Architecture and ISA descriptions • Complicated hazard descriptions – PAC application-binary-interface (ABI) • • data type mapping memory usage layout register usage conventions calling conventions – PAC code generation • 32 -bit WHIRL code generation • PAC WHIRL-to-CGIR procedures – PAC assembly code emission 12/18/2021 LCPC 2005 12

Simulated-Annealing (SA) Based Register Allocation Approach • Motivation: – Complex interference from: Register Allocation

Simulated-Annealing (SA) Based Register Allocation Approach • Motivation: – Complex interference from: Register Allocation Instruction Scheduling Code Insertion for Distributed Register Communication – We appreciate a machine-learning method to give a nearoptimal results. – To be a base reference for developing heuristic methods! 12/18/2021 LCPC 2005 13

To Determine: Virtual Register File (Bank) • • Input: Output: un-scheduled instructions a schedule

To Determine: Virtual Register File (Bank) • • Input: Output: un-scheduled instructions a schedule of the instructions a register file assignment (RFA) map – RFA map = {(v 1, f 1), (v 2, f 2), . . . } • Where vi : a virtual register, fi : a register file (bank) • PAC_Scheduler: – Graph-coloring based register allocation according to the RFA map – Instruction scheduling and code insertion for register file communication • Setup SA: 1. An initial random RFA map 2. schedule_len = PAC_Scheduler ( initial RFA map ) 3. SA control variables: • threshold • p_test: a probability test value (0 < p_test < 1). • energy: initial value > threshold. 12/18/2021 LCPC 2005 14

To Optimize: Scheduling Result new RFA map Randomly change: a mapping (vi, fi) Re-run:

To Optimize: Scheduling Result new RFA map Randomly change: a mapping (vi, fi) Re-run: new_schedule_len = PAC_Scheduler (new RFA map) yes SA stop test: energy > threshold F R w e n no ap m A yes Better result test: new_schedule_len < schedule_len energy-schedule_len = new_schedule_len yes Final RFA map & schedule 12/18/2021 old p ma RFA energy++ LCPC 2005 no Random test: a random number > p_test no 15

Preliminary Experimental Results (DSPStone benchmarks) 12/18/2021 LCPC 2005 16

Preliminary Experimental Results (DSPStone benchmarks) 12/18/2021 LCPC 2005 16

Related Works • Register Allocation – R. Leupers: Instruction scheduling for clustered VLIW DSPs.

Related Works • Register Allocation – R. Leupers: Instruction scheduling for clustered VLIW DSPs. In Proc. Int’l Conference on Parallel Architecture and Compilation Techniques, pages 291– 300, Oct. 2000 • Register File Organizations – S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens: Register organization for media processing. International Symposium on High Performance Computer Architecture (HPCA), pp. 375 -386, 2000 – Tay-Jyi Lin, Chin-Chi Chang. Chen-Chia Lee, and Chein-Wei Jen: An Efficient VLIW DSP Architecture for Baseband Processing. Proceedings of the 21 th International Conference on Computer Design, 2003 12/18/2021 LCPC 2005 17

Conclusion • We developed a compiler prototype for a new VLIW DSP architecture, called

Conclusion • We developed a compiler prototype for a new VLIW DSP architecture, called as PAC. – Based on ORC – New optimization issues by the irregular hardware design • Highly distributed register files • Port-access restricted ping-pong structures – A SA approach employed to obtain a preliminary result of exploiting register allocation on PAC • We will extend our works on the upcoming next version of PAC DSP. 12/18/2021 LCPC 2005 18