Parameterized Embedded Systems Platforms Frank Vahid Students Tony

  • Slides: 50
Download presentation
Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept.

Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering University of California, Riverside Member, Center for Embedded Computer Systems, UC Irvine Supported by: NSF, NEC The Dalton Project

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction:

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 2

Introduction • Advent of system-on-a-chip Microproc. IC Memory IC Microprocessor core (aka “IP”) IC

Introduction • Advent of system-on-a-chip Microproc. IC Memory IC Microprocessor core (aka “IP”) IC Peripher. IC FPGA IC Peripheral core Board Introduction 3

System-on-a-chip (SOC) Introduction 4

System-on-a-chip (SOC) Introduction 4

The Productivity Gap [ITRS 99] 5

The Productivity Gap [ITRS 99] 5

Programmable Platforms Microprocessor Cache Memory (ITRS 99) DMA Bridge FPGA System bus Peripheral bus

Programmable Platforms Microprocessor Cache Memory (ITRS 99) DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform • Pre-fabricated IC, synthesizable HDL, or both – “reference designs” (VLSI), “silicon platforms” (Philips), “fig chips” (Vahid/Givargis 99) Introduction 6

Targeted to Embedded Systems • May drive future architecture design [Patterson 98] • Varied

Targeted to Embedded Systems • May drive future architecture design [Patterson 98] • Varied power/performance/size constraints – Programmable platforms must adapt Introduction 7

Adapting platforms to constraints • One solution: Architectural Parameters Application 1 Microprocessor main() while

Adapting platforms to constraints • One solution: Architectural Parameters Application 1 Microprocessor main() while (…) { Cache Memory DMA Bridge FPGA System bus Application 2 … main() … while(…) { …… } } Cache Peripheral bus Programmable Peripheral Platform Introduction 8

Related work • Microcontrollers • VLSI’s Velocity • Pleiades project [Rabaey 97] • Microprocessor

Related work • Microcontrollers • VLSI’s Velocity • Pleiades project [Rabaey 97] • Microprocessor + FPGA • Philips’ Y-Chart approach Architecture Applications Mapping Analysis Our focus Introduction Numbers 9

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction:

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 10

Basic parameters -- cache Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus

Basic parameters -- cache Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Parameterized Systems-on-a-chip 11

Basic parameters -- cache Tag • Line Size V T Index D V T

Basic parameters -- cache Tag • Line Size V T Index D V T Offset D • Associativity • Cache Size == == Mux Data Parameterized Systems-on-a-chip 12

Basic parameters -- bus Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus

Basic parameters -- bus Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Parameterized Systems-on-a-chip 13

Basic parameters -- Bus C 1 Change Bus Width [Givargis 98] Bus Mux Demux

Basic parameters -- Bus C 1 Change Bus Width [Givargis 98] Bus Mux Demux C 2 C 1 > C 2 Parameterized Systems-on-a-chip 14

Basic parameters -- Bus Encoder Decoder Parameterized Systems-on-a-chip 0 1 0 1 1 0

Basic parameters -- Bus Encoder Decoder Parameterized Systems-on-a-chip 0 1 0 1 1 0 0 1 1 Bus-Invert Encoding 1 0 0 1 1 0 Hamming Dist = 3 0 1 0 1 1 Hamming Dist = 6 Binary Encoding invert_ctrl Encode data to reduce switching (Bus Invert) [Stan 95] invert_ctrl 15

Parameter definitions • Parameter – An architectural feature that can be varied, with a

Parameter definitions • Parameter – An architectural feature that can be varied, with a small set of possible values, without changing the application’s essential functionality. • Configuration – A selection of a particular value for every architecture parameter • Static vs. dynamic parameter – Static: Value is set before fabricating the IC. – Dynamic: Value is set after fabricating the IC. Parameterized Systems-on-a-chip 16

Potential tradeoffs experiment Microprocessor I-cache D-cache Memory DMA [ICCAD 99] Bridge System bus Peripheral

Potential tradeoffs experiment Microprocessor I-cache D-cache Memory DMA [ICCAD 99] Bridge System bus Peripheral bus Parameters Possible values Size Peripheral 32 k, 16 k, 8 k, 4 k, 2 k, 1 k, 512, 256, 128 FPGA I-cache Line 8, 16, 32 Associativity 2, 4, 8 Size 32 k, 16 k, 8 k, 4 k, 2 k, 1 k, 512, 256, 128 D-cache Line 8, 16, 32 Associativity 2, 4, 8 Data bus width 4, 8, 16, 32 Mp-c bus Data bus invert on or off Data bus width 4, 8, 16, 32 Sys. bus Data bus invert on or off Parameterized Systems-on-a-chip 17

Potential tradeoffs experiment • Cache: Dinero [Edler, Hill] C Program Instr. Set Micro. Simulator

Potential tradeoffs experiment • Cache: Dinero [Edler, Hill] C Program Instr. Set Micro. Simulator processor [ICCAD 99] • ISS: [Tiwari 96] Cache Simulator Memory Simulator Power Bus simulator Total power Parameterized Systems-on-a-chip 18

Potential tradeoffs experiment • Computed power for all 45, 568 configurations – For each

Potential tradeoffs experiment • Computed power for all 45, 568 configurations – For each of four C applications – Used microprocessor, cache, and bus simulators (1 wk CPU) Tradeoff between performance and power • X-axis: execution time (sec) • Y-axis: power (watt) Parameterized Systems-on-a-chip 19

Potential tradeoffs experiment Bus: 32 -1/32 -0 I: 16 k, 4, 4 D: 16

Potential tradeoffs experiment Bus: 32 -1/32 -0 I: 16 k, 4, 4 D: 16 k, 4, 4. 086 sec, 43. 6 W, 20 k. G Bus: 16 -1/32 -1 I: 16 k, 8, 16 D: 32 k, 8, 8. 389 sec, 11. 4 W, 21 k. G Bus: 8 -1/32 -1 I: 32 k, 8, 8 D: 16 k, 8, 16. 995 sec, 3. 4 W, 30 K Narrower bus required a larger cache size Parameterized Systems-on-a-chip 20

Potential tradeoffs experiment • Performance varied by 11 x • Power varied by 13

Potential tradeoffs experiment • Performance varied by 11 x • Power varied by 13 x • Area varied by 1 x • Energy consumption varied by 2 x Parameterized Systems-on-a-chip 21

Potential tradeoffs experiment Bus: 32 -1/32 -1 I: 1 k, 4, 4 D: 512,

Potential tradeoffs experiment Bus: 32 -1/32 -1 I: 1 k, 4, 4 D: 512, 4, 8 2 ms, . 19 W, 15 k. G Bus: 16 -1/32 -1 I: 1 k, 4, 4 D: 512, 4, 8 3 ms, . 07 W, 17 k. G Bus: 8 -1/4 -0, I: 1 k, 2, 4 D: 512, 2, 4 5 ms, . 02 W, 18 k. G Parameterized Systems-on-a-chip 22

Potential tradeoffs experiment • Performance varied by 2. 5 x • Power varied by

Potential tradeoffs experiment • Performance varied by 2. 5 x • Power varied by 9. 5 x • Area varied by 1 x • Energy consumption varied by 4 x Parameterized Systems-on-a-chip 23

Potential tradeoffs experiment • How much variation in total system power and performance can

Potential tradeoffs experiment • How much variation in total system power and performance can we obtain just by varying the cache and bus parameters? – 9 to 14 x improvement in power/performance • How interdependent are these two types of parameters? – fixing cache param. values, then selecting bus param. values results in non-optimal solutions Parameterized Systems-on-a-chip 24

Many more parameters possible • Some examples include: – – – Code compression (Henkel/Wolf)

Many more parameters possible • Some examples include: – – – Code compression (Henkel/Wolf) Address bus encoding Multiple levels of memory hierarchy CPU parameters (e. g. , voltage scale, DP width) Peripheral core parameters (our current focus) Fertile research area • Can yield even larger tradeoffs if we: – Create parameter-aware compiler – Adapt OS? Parameterized Systems-on-a-chip 25

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction:

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 26

Exploring parameter configurations • Low-level simulation – Gate-level simulation • Far too slow, days

Exploring parameter configurations • Low-level simulation – Gate-level simulation • Far too slow, days per configuration – RT-level simulation • Still slow, hours per configuration • Our approach – System-level simulation • Minutes per configuration – System-level trace simulation • Seconds per configuration – System-level trace analysis • Milliseconds per configuration 27

Evaluation by gate-level simulation Reconfigure Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral

Evaluation by gate-level simulation Reconfigure Microprocessor Cache Memory DMA Bridge FPGA System bus Peripheral bus Programmable Peripheral Platform Total power HDL simulation HDL synthesis • Capture each core in HDL, synthesize, simulate • Hours (often tens) per configuration Exploring Parameter Configurations 28

Trace Generator Micro. Instr. Set processor Simulator Cache Simulator • Minutes-per-configuration • Contrast with

Trace Generator Micro. Instr. Set processor Simulator Cache Simulator • Minutes-per-configuration • Contrast with hours-per-config. Memory Simulator r Exploring Parameter Configurations Bridge Simulator P o we Total power Power Bus Peripheral simulatorbus DMA Simulator Peripheral Simulator OO models C Program Power Reconfigure Evaluation by system-level simulation 29

Evaluation by trace-simulation C Program Instr. trace Address trace Cache trace Simulator • Seconds-per-configuration

Evaluation by trace-simulation C Program Instr. trace Address trace Cache trace Simulator • Seconds-per-configuration Memory trace Simulator Exploring Parameter Configurations Po Total power DMA trace Simulator w e r Bus trace simulator Power Bus trace Power Reconfigure – Get traces from small # of system simulation Bridge trace Simulator Instr. traces Peripheral trace Simulator OO non-fct. models Instr. trace Simulator • Note that the cache simulator is non-functional • Same approach for others Trace Generator 30

System simulation vs. trace simulation System level model DMA Parameter evaluation Execute Power u.

System simulation vs. trace simulation System level model DMA Parameter evaluation Execute Power u. P UART Traces Parameter evaluation u. P System level model DMA UART Execute Trace simulators Power 31

Evaluation by trace-analysis C Program Instr. stats. Cache trace analyzer DMA trace analyzer Po

Evaluation by trace-analysis C Program Instr. stats. Cache trace analyzer DMA trace analyzer Po w e r Exploring Parameter Configurations Power Bus trace simulator Total power • Milliseconds-per-configuration Memory trace analyzer Power Bus stats. Power Reconfigure Address stats. – statistically-characterize traces – Still only small # of system simulations Bridge trace analyzer Instr. stats. Peripheral trace analyzer Equations Instr. trace analyzer • Further speedup -- Trace Generator 32

Trace-analysis approach for cache • Given a trace of memory refs • Cache parameters

Trace-analysis approach for cache • Given a trace of memory refs • Cache parameters • Size (S) • Line/block-size (L) • Associativity (A) • Compute # of misses (N) Size (S) Exploring Parameter Configurations 33

Trace-analysis approach for cache Exploring Parameter Configurations 34

Trace-analysis approach for cache Exploring Parameter Configurations 34

Trace-analysis approach for cache • Capture improvements obtainable by: – changing line-size at small/large

Trace-analysis approach for cache • Capture improvements obtainable by: – changing line-size at small/large values of cache-size – changing associativity at small/large values of cache-size Exploring Parameter Configurations 35

Trace-analysis approach for bus capacitance Num transfers per item Random data Exploring Parameter Configurations

Trace-analysis approach for bus capacitance Num transfers per item Random data Exploring Parameter Configurations Bus width Items/second 36

Trace-analysis approach for bus • Bus equation: • m items/second (denotes the traffic N

Trace-analysis approach for bus • Bus equation: • m items/second (denotes the traffic N on the bus) • n bits/item • k bit wide bus • bus-invert encoding • random data assumption Exploring Parameter Configurations 37

Trace-analysis experiments • Cache parameters – size: 128, 256, 512, 1 k, 2 k,

Trace-analysis experiments • Cache parameters – size: 128, 256, 512, 1 k, 2 k, 4 k, 8 k, 16 k, 32 k – assoc: 2, 4, 8 – line: 8, 16, 32 • Bus Parameters – width: 4, 8, 16, 32 CPU Bus A I-Cache D-Cache Memory Bridge – code: binary/bus-invert • Analyzed 45 K sets exhaustively for each of 4 examples. Exploring Parameter Configurations Bus B Peripheral Bus Peripheral 1 Peripheral 2 Peripheral n 38

Experiment Results • Diesel application’s performance • Blue (light-gray) is system-simulation-based • Red (dark-gray)

Experiment Results • Diesel application’s performance • Blue (light-gray) is system-simulation-based • Red (dark-gray) is trace-analysis-based 4% error 320 x faster Exploring Parameter Configurations 39

Experiment Results • Diesel application’s energy consumption • Blue (light-gray) is obtained using full

Experiment Results • Diesel application’s energy consumption • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 2% error 420 x faster Exploring Parameter Configurations 40

Experiment Results • CKey application’s performance • Blue (light-gray) is obtained using full simulation

Experiment Results • CKey application’s performance • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 8% error 125 x faster Exploring Parameter Configurations 41

Experiment Results • CKey application’s energy consumption • Blue (light-gray) is obtained using full

Experiment Results • CKey application’s energy consumption • Blue (light-gray) is obtained using full simulation • Red (dark-gray) is obtained using our equations 3 % error 125 x faster Exploring Parameter Configurations 42

Experiment Results • 125 - 400 x speedup • 1 -18% absolute error (power

Experiment Results • 125 - 400 x speedup • 1 -18% absolute error (power & performance) Time (hours) Power Error (%) • 2% average power error Exploring Parameter Configurations 43

Techniques for general cores • Earlier experiments were for u. P/cache/bus • System simulation

Techniques for general cores • Earlier experiments were for u. P/cache/bus • System simulation for other cores (ISSS’ 00) – – Isolate “instructions” in system-level model Gate-level simulation per instruction Back-annotate system-level model’s instructions Similar to technique for microprocessors, but: • Must consider “power modes” 44

Trace approach for general cores System level model u. P Parameter evaluation Traces DMA

Trace approach for general cores System level model u. P Parameter evaluation Traces DMA Execute Trace simulators Power UART Full trace Reset -Quantize P 1, P 2, …, P 64 IDCT P 1, P 2, …, P 64 Reduced trace with instructions only Reset -Quantize -IDCT -- Reduced trace with characterized data Reset -Quantize. 80 IDCT. 72 Quantize. 93 IDCT. 63 Reduced trace with instruction frequencies Reset *1 Quantize *2 IDCT *2 45

Experiments with general cores: JPEG pixel size (bits) trace file size (Kb) ftrc rtrc_

Experiments with general cores: JPEG pixel size (bits) trace file size (Kb) ftrc rtrc_ rtrc cd _i 10 12 32 39 pixel size (bits) 10 12 3. 6 gate m. J CPU time for power evaluation (sec) gate sys ftrc rtrc_ cd i 0. 5 290000 0. 5 330000 average speedup: ftrc m. J 420 443 531 569 average error: error 5% 7% 6% 48 49 6 K 26 27 12 K 4. 9 5. 1 62 K 4. 6 67 K rtrc_cd m. J error rtrc_i m. J error 451 576 491 632 7% 8% 7. 5% 17% 19% 18% 46

Experiments with general cores: UART 47

Experiments with general cores: UART 47

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction:

Outline • Introduction • Parameterized SOC platforms • Exploring parameter configurations • Future direction: self-optimizing platforms • Conclusions 48

Future directions • Earlier work – used software on workstation to explore parameter configurations

Future directions • Earlier work – used software on workstation to explore parameter configurations Exploration sw Configuration Workstation Platform • “Self-optimizing” platform – Can we build the exploration ability into the platform itself? – Transparent to the user • Ease of use, more accurate metrics, wider acceptance, Exploration ability Regular binary Workstation Platform – “Embedded CAD” 49

Conclusions • Parameters can improve usefulness of programmable platforms – by adapting platform to

Conclusions • Parameters can improve usefulness of programmable platforms – by adapting platform to particular application and to power/performance constraints • Good tradeoff range even for basic parameters • Fast and accurate evaluation seems possible • Much work remains – More parameters – Better exploration – Self-optimizing platforms 50