Systemlevel Exploration for Paretooptimal Configurations in Parameterized Systemsonachip

  • Slides: 45
Download presentation
System-level Exploration for Paretooptimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg

System-level Exploration for Paretooptimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center for Embedded Computer Systems University of California Irvine, CA 92697 givargis@ics. uci. edu

Size = {1 K, 4 K, 8 K} Line = {4, 8, 16} Assoc

Size = {1 K, 4 K, 8 K} Line = {4, 8, 16} Assoc = {1, 2, 4} Overview l Given: – Parameterized SOC architecture – Fixed application l Automatically explore the design space l Find optimal points w/respect to power and performance SOC CPU BRIDGE JPEG CODEC UART Memory I$-D$ Math/FPU void main(){ while(1){ Receive(); Decode(); Display(); } } Application Explore 2

Motivation l Design trends: – Growing demand for portable devices – Growing demand for

Motivation l Design trends: – Growing demand for portable devices – Growing demand for low power design – Increased application complexity – Shrinking time-tomarket windows l Technology trends: – Increased chip capacity – Increased I/O pins – Improved on-chip integration techniques (storage, digital, analog, digital, …) – SOC era Need for greater designer productivity! 3

Motivation l One approach: reuse of existing IP – – IP selection ? IP

Motivation l One approach: reuse of existing IP – – IP selection ? IP integration ? SOC verification ? Multi-source IP licensing – More… JPEG CODEC 2 MIPS USB AMBA BRIDGE UART ARM JPEG CODEC 1 SRAM ISA BRIDGE Math/FPU DRAM ? CPU ? BRIDGE JPEG CODEC SOC ? ? ? Memory MMX Math/FPU UART 4

Motivation l Alternate approach: reuse of SOC – Designed, integrated, tested – Domain specific

Motivation l Alternate approach: reuse of SOC – Designed, integrated, tested – Domain specific – Parameterized Designed by firms specializing in SOC l User: map application, then, “configure-andexecute” l Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART (successors to microcontrollers!) 5

Motivation Composed of 100 s of cores l Cores are “configurable” l Configurations impact

Motivation Composed of 100 s of cores l Cores are “configurable” l Configurations impact power/performance l Large number of total configurations! l Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART Architecture is otherwise fixed! 6

Motivation l l l ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box

Motivation l l l ATI Technologies – XILLEON™ 220 SOC for Digital Set-top Box Market Tensilica – Xtensa™ 1040 configurable processor cores Philips Semiconductors – Velocity RSP 9™ SOC platforms Adelante Technologies – offers complete SOC customizable platforms for DSP domains More… 7

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 8

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 9

Previous Work l Parameterized SOC design – [Malik 00], [Veidenbaum 99], [Vahid 99], [Stan

Previous Work l Parameterized SOC design – [Malik 00], [Veidenbaum 99], [Vahid 99], [Stan 95] l Power/performance evaluation – [Barndolese 00], [Simunic 99], [Li 98], [Tiwari 94] l Design space exploration (manual) – [givargis 99], [Lieverse 99] l Design space exploration (automatic) – Focus of this work… 10

Previous Work Application Application Architecture Auto Mapping Analysis Numbers Y-chart [Lieverse 99] 11

Previous Work Application Application Architecture Auto Mapping Analysis Numbers Y-chart [Lieverse 99] 11

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 12

Target Architecture I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 13

Target Architecture I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 13

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray,

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 14

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray,

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 15

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray,

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 16

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray,

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 17

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray,

Target Architecture l l l Voltage scale Size, line, associativity Bus width, encoding (gray, invert) UART tx/rx buffer size DCT resol. I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 18

Target Architecture 26 parameters l 1014 configurations l What are the optimal configuration (given

Target Architecture 26 parameters l 1014 configurations l What are the optimal configuration (given a fixed application)? l I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 19

Problem Summary What are the possible power/performance tradeoffs? (100 trillion) Ø Need to efficiently

Problem Summary What are the possible power/performance tradeoffs? (100 trillion) Ø Need to efficiently evaluate power/performance (1/sec 150, 000 years) Ø Need to explore the configuration space l Parameterized SOC CPU Memory BRIDGE MMX JPEG CODEC Math/FPU UART 20

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 21

l Relative accuracy required! 28800 5400 – Chip instrumentation (real-time) – System-level simulation –

l Relative accuracy required! 28800 5400 – Chip instrumentation (real-time) – System-level simulation – RTL simulation – Gate-level simulation – Circuit-level simulation 440 Exploration works with: 1 l 180000 Power Evaluation Digital camera application mapped on our SOC, capturing 1 image. 22

l Relative accuracy required! 28800 5400 – Chip instrumentation (real-time) – System-level simulation –

l Relative accuracy required! 28800 5400 – Chip instrumentation (real-time) – System-level simulation – RTL simulation – Gate-level simulation – Circuit-level simulation 440 Exploration works with: 1 l 180000 Power Evaluation Digital camera application mapped on our SOC, capturing 1 image. 23

Power Evaluation - Processor [Tiwari 94/00]’s instructionlevel l Measure watt/inst l Account for stalls

Power Evaluation - Processor [Tiwari 94/00]’s instructionlevel l Measure watt/inst l Account for stalls + dependency l Apply traces l I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 24

Power Evaluation – Cache/Mem. [Evans 95] l Capacitance model of subcomponents l Switching obtained

Power Evaluation – Cache/Mem. [Evans 95] l Capacitance model of subcomponents l Switching obtained via simulation (parameter dependent) l I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 25

Power Evaluation – Buses [Chern 92] l Model bus capacitance l Switching derived from

Power Evaluation – Buses [Chern 92] l Model bus capacitance l Switching derived from I/O traffic (parameter dependent) l I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 26

Power Evaluation – Peripherals Observation: cores execute instructions! Ø Apply a technique similar to

Power Evaluation – Peripherals Observation: cores execute instructions! Ø Apply a technique similar to that used for processors! l I-Cache MIPS D-Cache Memory Bridge Peripheral Bus UART DMA DCT CODEC 27

Power Evaluation – Summary I-Cache (8%) MIPS (10%) D-Cache (8%) Memory (8%) Bridge (5%)

Power Evaluation – Summary I-Cache (8%) MIPS (10%) D-Cache (8%) Memory (8%) Bridge (5%) Peripheral Bus UART (5%) DMA (5%) DCT CODEC (5%) ~50 -100 K instruction/second! (Platune) 28

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 29

Exploration Problem formulation l P 1, P 2, … , P n l A

Exploration Problem formulation l P 1, P 2, … , P n l A configuration (point) is an assignment of values to all parameters l How to efficiently generate all Paretooptimal configurations? 30

Exploration Algorithm Idea l l l A (10) B (32) * = 320 points

Exploration Algorithm Idea l l l A (10) B (32) * = 320 points A and B interdependent A C (32) A and C are independent (10) + = 42 points C and B are independent C B + = 64 points (32) With knowledge about dependency we prune 138 points 98. 6% Directed graph A B C * * = 10240 points (10) (32) 31

Exploration A B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes

Exploration A B : Pareto-optimal configurations of B calculated after Pareto-optimal configurations of nodes along the path A B l A B A, (cycle) : Pareto-optimal configurations of all the parameters on the cycle calculated simultaneously l A : Pareto-optimal configurations calculated in isolation l 32

Exploration Dependency Graph C A D I F J K T N O V

Exploration Dependency Graph C A D I F J K T N O V W Node Core Parameter A MIP S Voltage scale L CPU D$ bus Data bus width B I$ Total size M Data bus code C Line size N Addr bus width D Associativity O Addr bus code Total size P F Line size Q G Associativity R Addr bus width Data bus width S Addr bus code Data bus code T J Addr bus width U Data bus code K Addr bus code V Addr bus width Tx buffer size W Addr bus code Rx buffer size Z I G U M Parameter H E L Core E B H Node P Q X X R S Y Z Y D$ CPU I$ bus UAR T I/D$ Mem bus Periphe ral bus DCT CODE C Data bus width Data bus code Data bus width Pixel resolution 33

Exploration Dependency graph l Based on designer knowledge l Computed by simulating all pairs

Exploration Dependency graph l Based on designer knowledge l Computed by simulating all pairs of nodes (quadratic time complexity, approx. ) l One time effort C A B D H I F J K T L N V E G U M O W P Q R S X Y Z 34

Exploration – Algorithm Step 1: Clustering followed by simulation A C H B I

Exploration – Algorithm Step 1: Clustering followed by simulation A C H B I J N T M O U V E D L K F W P X Y G Q R S Z 35

Exploration – Algorithm Step 2: Pair-wise merge followed by simulation A, H, I B,

Exploration – Algorithm Step 2: Pair-wise merge followed by simulation A, H, I B, C, D, E, F, G L, M, P, Q A, H, I, B, C, D, E, F, G N, O, V, W J, K, T, U, Z A, H, I, B, C, D, E, F, G, J, K, T, U, Z J, K, T, U Z X, Y, R, S L, M, P, Q, N, O, V, W, X, Y, R, S A, H, I, B, C, D, E, F, G, J, K, T, U, Z, L, M, P, Q, N, O, V, W, X, Y, R, S 36

Exploration Exhaustive solution l Evaluate all points l Sort by decreasing execution time l

Exploration Exhaustive solution l Evaluate all points l Sort by decreasing execution time l Walk through the space, eliminate points with power > minimum seen so far! l Substitute heuristics (only works for 1 -4 parameters!) 37

Exploration l l l l Complexity: O((K + log(K)) * 2 N/K) K is

Exploration l l l l Complexity: O((K + log(K)) * 2 N/K) K is the number of clusters N is the number of parameters 2 N/K bounds the exhaustive comp. (K + log(k)) bounds the number of iterations Worse case K=1, best case K=N 2 N/K decrease rapidly as K increases (e. g. , 226/2+226/2 is much smaller than 226!) 38

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments

Outline l l l Previous work Target architecture Power/performance estimation Parameter space exploration Experiments Conclusion 39

Exploration – Results JPEG l Exploration time: 29. 1 min l Config. visited: 12352

Exploration – Results JPEG l Exploration time: 29. 1 min l Config. visited: 12352 (141) l 5. 10 x exe. time l 7. 51 x power l 2. 73 x energy l Pruning ratio > 0. 99997 40

Exploration – Results CKEY l Exploration time: 108 min l Config. visited: 15890 (223)

Exploration – Results CKEY l Exploration time: 108 min l Config. visited: 15890 (223) l 8. 31 x exe. time l 6. 08 x power l 2. 57 x energy l Pruning ratio > 0. 99993 41

Exploration – Results IMAGE l Exploration time: 50. 2 min l Config. visited: 10135

Exploration – Results IMAGE l Exploration time: 50. 2 min l Config. visited: 10135 (80) l 8. 29 x exe. time l 8. 57 x power l 1. 81 x energy l Pruning ratio > 0. 99998 42

Exploration – Results MATRIX l Exploration time: 73. 6 min l Config. visited: 12623

Exploration – Results MATRIX l Exploration time: 73. 6 min l Config. visited: 12623 (84) l 10. 7 x exe. time l 8. 16 x power l 3. 18 x energy l Pruning ratio > 0. 99997 43

Exploration – Results JPEG 44

Exploration – Results JPEG 44

Conclusion l Gave a system-level algorithm for exploring the solution space of an application

Conclusion l Gave a system-level algorithm for exploring the solution space of an application mapped to a parameterized SOC architectures – Given a dependency graph we extensively prune the solution space – Pruning ratio > 0. 99997 in experiments l Future work: – Automatically compute the dependency model – Replace the exhaustive sub-algorithm with a heuristic (e. g. , gradient search, GA) 45