Parameterized Embedded Systems Platforms Frank Vahid Students Tony

Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering University of California, Riverside Member, Center for Embedded Computer Systems, UC Irvine Supported by: NSF, NEC The Dalton Project

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 2

Introduction • Advent of system-on-a-chip Microproc. IC Memory IC Microprocessor core (aka “IP”) IC Peripher. IC FPGA IC Peripheral core Board Introduction 3

System-on-a-chip (SOC) Introduction 4
![The Productivity Gap [ITRS 99] 5 The Productivity Gap [ITRS 99] 5](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-5.jpg)
The Productivity Gap [ITRS 99] 5

Programmable Platforms Microprocessor Cache Memory (ITRS 99) DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform • Pre-fabricated IC, synthesizable HDL, or both – “reference designs” (VLSI), “silicon platforms” (Philips), “fig chips” (Vahid/Givargis 99) Introduction 6
![Targeted to Embedded Systems • May drive future architecture design [Patterson 98] • Varied Targeted to Embedded Systems • May drive future architecture design [Patterson 98] • Varied](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-7.jpg)
Targeted to Embedded Systems • May drive future architecture design [Patterson 98] • Varied power/performance/size constraints – Programmable platforms must adapt Introduction 7

Adapting platforms to constraints • One solution: Architectural Parameters Application 1 Microprocessor main() while (…) { Cache Memory DMA Bridge FPGA System bus Application 2 … main() … while(…) { …… } } Cache Peripheral bus Programmable Peripheral Platform Introduction 8
![Related work • Microcontrollers • VLSI’s Velocity • Pleiades project [Rabaey 97] • Microprocessor Related work • Microcontrollers • VLSI’s Velocity • Pleiades project [Rabaey 97] • Microprocessor](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-9.jpg)
Related work • Microcontrollers • VLSI’s Velocity • Pleiades project [Rabaey 97] • Microprocessor + FPGA • Philips’ Y-Chart approach Architecture Applications Mapping Analysis Our focus Introduction Numbers 9

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 10

Basic parameters -- cache Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Parameterized Systems-on-a-chip 11

Basic parameters -- cache Tag • Line Size V T Index D V T Offset D • Associativity • Cache Size == == Mux Data Parameterized Systems-on-a-chip 12

Basic parameters -- bus Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Parameterized Systems-on-a-chip 13
![Basic parameters -- Bus C 1 Change Bus Width [Givargis 98] Bus Mux Demux Basic parameters -- Bus C 1 Change Bus Width [Givargis 98] Bus Mux Demux](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-14.jpg)
Basic parameters -- Bus C 1 Change Bus Width [Givargis 98] Bus Mux Demux C 2 C 1 > C 2 Parameterized Systems-on-a-chip 14

Basic parameters -- Bus Encoder Decoder Parameterized Systems-on-a-chip 0 1 0 1 1 0 0 1 1 Bus-Invert Encoding 1 0 0 1 1 0 Hamming Dist = 3 0 1 0 1 1 Hamming Dist = 6 Binary Encoding invert_ctrl Encode data to reduce switching (Bus Invert) [Stan 95] invert_ctrl 15

Parameter definitions • Parameter – An architectural feature that can be varied, with a small set of possible values, without changing the application’s essential functionality. • Configuration – A selection of a particular value for every architecture parameter • Static vs. dynamic parameter – Static: Value is set before fabricating the IC. – Dynamic: Value is set after fabricating the IC. Parameterized Systems-on-a-chip 16
![Potential tradeoffs experiment Microprocessor I-cache D-cache Memory DMA [ICCAD 99] Bridge System bus Peripheral Potential tradeoffs experiment Microprocessor I-cache D-cache Memory DMA [ICCAD 99] Bridge System bus Peripheral](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-17.jpg)
Potential tradeoffs experiment Microprocessor I-cache D-cache Memory DMA [ICCAD 99] Bridge System bus Peripheral bus Parameters Possible values Size Peripheral 32 k, 16 k, 8 k, 4 k, 2 k, 1 k, 512, 256, 128 FPGA I-cache Line 8, 16, 32 Associativity 2, 4, 8 Size 32 k, 16 k, 8 k, 4 k, 2 k, 1 k, 512, 256, 128 D-cache Line 8, 16, 32 Associativity 2, 4, 8 Data bus width 4, 8, 16, 32 Mp-c bus Data bus invert on or off Data bus width 4, 8, 16, 32 Sys. bus Data bus invert on or off Parameterized Systems-on-a-chip 17
![Potential tradeoffs experiment • Cache: Dinero [Edler, Hill] C Program Instr. Set Micro. Simulator Potential tradeoffs experiment • Cache: Dinero [Edler, Hill] C Program Instr. Set Micro. Simulator](http://slidetodoc.com/presentation_image_h2/bf7b51680b92896e202b29598c1dfff5/image-18.jpg)
Potential tradeoffs experiment • Cache: Dinero [Edler, Hill] C Program Instr. Set Micro. Simulator processor [ICCAD 99] • ISS: [Tiwari 96] Cache Simulator Memory Simulator Power Bus simulator Total power Parameterized Systems-on-a-chip 18

Potential tradeoffs experiment • Computed power for all 45, 568 configurations – For each of four C applications – Used microprocessor, cache, and bus simulators (1 wk CPU) Tradeoff between performance and power • X-axis: execution time (sec) • Y-axis: power (watt) Parameterized Systems-on-a-chip 19

Potential tradeoffs experiment Bus: 32 -1/32 -0 I: 16 k, 4, 4 D: 16 k, 4, 4. 086 sec, 43. 6 W, 20 k. G Bus: 16 -1/32 -1 I: 16 k, 8, 16 D: 32 k, 8, 8. 389 sec, 11. 4 W, 21 k. G Bus: 8 -1/32 -1 I: 32 k, 8, 8 D: 16 k, 8, 16. 995 sec, 3. 4 W, 30 K Narrower bus required a larger cache size Parameterized Systems-on-a-chip 20

Potential tradeoffs experiment • Performance varied by 11 x • Power varied by 13 x • Area varied by 1 x • Energy consumption varied by 2 x Parameterized Systems-on-a-chip 21

Potential tradeoffs experiment Bus: 32 -1/32 -1 I: 1 k, 4, 4 D: 512, 4, 8 2 ms, . 19 W, 15 k. G Bus: 16 -1/32 -1 I: 1 k, 4, 4 D: 512, 4, 8 3 ms, . 07 W, 17 k. G Bus: 8 -1/4 -0, I: 1 k, 2, 4 D: 512, 2, 4 5 ms, . 02 W, 18 k. G Parameterized Systems-on-a-chip 22

Potential tradeoffs experiment • Performance varied by 2. 5 x • Power varied by 9. 5 x • Area varied by 1 x • Energy consumption varied by 4 x Parameterized Systems-on-a-chip 23

Potential tradeoffs experiment • How much variation in total system power and performance can we obtain just by varying the cache and bus parameters? – 9 to 14 x improvement in power/performance • How interdependent are these two types of parameters? – fixing cache param. values, then selecting bus param. values results in non-optimal solutions Parameterized Systems-on-a-chip 24

Many more parameters possible • Some examples include: – – – Code compression (Henkel/Wolf) Address bus encoding Multiple levels of memory hierarchy CPU parameters (e. g. , voltage scale, DP width) Peripheral core parameters (our current focus) Fertile research area • Can yield even larger tradeoffs if we: – Create parameter-aware compiler – Adapt OS? Parameterized Systems-on-a-chip 25

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 26

Exploring parameter configurations • Low-level simulation – Gate-level simulation • Far too slow, days per configuration – RT-level simulation • Still slow, hours per configuration • Our approach – System-level simulation • Minutes per configuration – System-level trace simulation • Seconds per configuration – System-level trace analysis • Milliseconds per configuration 27

Evaluation by gate-level simulation Reconfigure Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Total power HDL simulation HDL synthesis • Capture each core in HDL, synthesize, simulate • Hours (often tens) per configuration Exploring Parameter Configurations 28

Trace Generator Micro. Instr. Set processor Simulator Cache Simulator • Minutes-per-configuration • Contrast with hours-per-config. Memory Simulator r Exploring Parameter Configurations Bridge Simulator P o we Total power Power Bus Peripheral simulatorbus DMA Simulator Peripheral Simulator OO models C Program Power Reconfigure Evaluation by system-level simulation 29

Evaluation by trace-simulation C Program Instr. trace Address trace Cache trace Simulator • Seconds-per-configuration Memory trace Simulator Exploring Parameter Configurations Po Total power DMA trace Simulator w e r Bus trace simulator Power Bus trace Power Reconfigure – Get traces from small # of system simulation Bridge trace Simulator Instr. traces Peripheral trace Simulator OO non-fct. models Instr. trace Simulator • Note that the cache simulator is non-functional • Same approach for others Trace Generator 30

System simulation vs. trace simulation System level model DMA Parameter evaluation Execute Power u. P UART Traces Parameter evaluation u. P System level model DMA UART Execute Trace simulators Power 31

Evaluation by trace-analysis C Program Instr. stats. Cache trace analyzer DMA trace analyzer Po w e r Exploring Parameter Configurations Power Bus trace simulator Total power • Milliseconds-per-configuration Memory trace analyzer Power Bus stats. Power Reconfigure Address stats. – statistically-characterize traces – Still only small # of system simulations Bridge trace analyzer Instr. stats. Peripheral trace analyzer Equations Instr. trace analyzer • Further speedup -- Trace Generator 32

Trace-analysis approach for cache • Given a trace of memory refs • Cache parameters • Size (S) • Line/block-size (L) • Associativity (A) • Compute # of misses (N) Size (S) Exploring Parameter Configurations 33

Trace-analysis approach for cache Exploring Parameter Configurations 34

Trace-analysis approach for cache • Capture improvements obtainable by: – changing line-size at small/large values of cache-size – changing associativity at small/large values of cache-size Exploring Parameter Configurations 35

Trace-analysis approach for bus capacitance Num transfers per item Random data Exploring Parameter Configurations Bus width Items/second 36

Trace-analysis approach for bus • Bus equation: • m items/second (denotes the traffic N on the bus) • n bits/item • k bit wide bus • bus-invert encoding • random data assumption Exploring Parameter Configurations 37

Trace-analysis experiments • Cache parameters – size: 128, 256, 512, 1 k, 2 k, 4 k, 8 k, 16 k, 32 k – assoc: 2, 4, 8 – line: 8, 16, 32 • Bus Parameters – width: 4, 8, 16, 32 CPU Bus A I-Cache D-Cache Memory Bridge – code: binary/bus-invert • Analyzed 45 K sets exhaustively for each of 4 examples. Exploring Parameter Configurations Bus B Peripheral Bus Peripheral 1 Peripheral 2 Peripheral n 38

Experiment Results • Diesel application’s performance • Blue (light-gray) is system-simulation-based • Red (dark-gray) is trace-analysis-based 4% error 320 x faster Exploring Parameter Configurations 39

Experiment Results • Diesel application’s energy consumption • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 2% error 420 x faster Exploring Parameter Configurations 40

Experiment Results • CKey application’s performance • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 8% error 125 x faster Exploring Parameter Configurations 41

Experiment Results • CKey application’s energy consumption • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 3 % error 125 x faster Exploring Parameter Configurations 42

Experiment Results • 125 - 400 x speedup • 1 -18% absolute error (power & performance) Time (hours) Power Error (%) • 2% average power error Exploring Parameter Configurations 43

Techniques for general cores • Earlier experiments were for u. P/cache/bus • System simulation for other cores (ISSS’ 00) – – Isolate “instructions” in system-level model Gate-level simulation per instruction Back-annotate system-level model’s instructions Similar to technique for microprocessors, but: • Must consider “power modes” 44

Trace approach for general cores System level model u. P Parameter evaluation Traces DMA Execute Trace simulators Power UART Full trace Reset -Quantize P 1, P 2, …, P 64 IDCT P 1, P 2, …, P 64 Reduced trace with instructions only Reset -Quantize -IDCT -- Reduced trace with characterized data Reset -Quantize. 80 IDCT. 72 Quantize. 93 IDCT. 63 Reduced trace with instruction frequencies Reset *1 Quantize *2 IDCT *2 45

Experiments with general cores: JPEG pixel size (bits) trace file size (Kb) ftrc rtrc_ rtrc cd _i 10 12 32 39 pixel size (bits) 10 12 3. 6 gate m. J CPU time for power evaluation (sec) gate sys ftrc rtrc_ cd i 0. 5 290000 0. 5 330000 average speedup: ftrc m. J 420 443 531 569 average error: error 5% 7% 6% 48 49 6 K 26 27 12 K 4. 9 5. 1 62 K 4. 6 67 K rtrc_cd m. J error rtrc_i m. J error 451 576 491 632 7% 8% 7. 5% 17% 19% 18% 46

Experiments with general cores: UART 47

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 48

Future directions • Earlier work – used software on workstation to explore parameter configurations Exploration sw Configuration Workstation Platform • “Self-optimizing” platform – Can we build the exploration ability into the platform itself? – Transparent to the user • Ease of use, more accurate metrics, wider acceptance, Exploration ability Regular binary Workstation Platform – “Embedded CAD” 49

Conclusions • Parameters can improve usefulness of programmable platforms – by adapting platform to particular application and to power/performance constraints • Good tradeoff range even for basic parameters • Fast and accurate evaluation seems possible • Much work remains – More parameters – Better exploration – Self-optimizing platforms 50
- Slides: 50