Hardware Performance Counters Parapet Research Group Princeton University
















- Slides: 16
Hardware Performance Counters Parapet Research Group, Princeton University EE for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Hardware Performance Counters (HPCs) Go beyond Performance § Several explored research avenues § Runtime power/thermal estimations § Dynamic management § Workload phases and application behavior prediction § HPCs provide value beyond simulations § Long-timescales § Real-system behavior 2 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Hardware Performance Counters (HPCs) Go beyond Performance § Runtime power § Isci & Martonosi [MICRO 2003] § Contreras & Martonosi [Submitted 2005] § Runtime thermal § Lee & Skadron [HP-PAC in IPDPS 2005] § Dynamic power management § Choi et al. [ISLPED 2004] § Weißel & Bellosa [CASES 2002] § Dynamic thermal management § Bellosa et al. [COLP 2003] § Workload phases and application behavior prediction § Isci & Martonosi [WWC 2003] § Duesterwald et al. [PACT 2003] 3 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals High-Performance Corner: P 4 Power Estimation § Idea: Power of component I = Max. Power[I] x Arch. Scaling[I] x Access. Rate[I] + Non. Gated. Power[I] § Motivation: § Fast (Real-time) § Estimated view of on-chip detail (Per physical component) § Design: § Developed heuristics using 24 events to approximate access rates for 22 chip components § Used 15 counters with 4 rotations to collect all event data § Validation: § Real-time estimates against real-time measured power 4 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals P 4 Power Estimator Results Gcc Gzip Vpr Vortex Gap Crafty Measured Modeled § Average difference: ~5% among all benchmarks § SPEC CPU 2000 & other applications 5 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Embedded Corner: PXA 255 Power Estimation § Idea: CPU Powernx 1 = Performance. Eventsnx 5 x Linear. Parameters 5 x 1 + Idle. Power Mem Powernx 1 = Performance. Eventsnx 2 x Linear. Parameters 2 x 1+ Idle. Power § Motivation: § Runtime power optimizations under DVFS § Design: § Parameter estimation (OLS) using dominant counter readings and live power measurements § Power estimation at various CPU configurations § Validation: § Comparison between estimates and real-time measured power 6 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals PXA 255 Results § DB CDC Java § 5% average error across 3 domains § Java CDC § Java CLDC § SPEC 2000 7 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 1. Track each physical unit individually for power & thermal: § Ex: μCode ROM μop Queue Instr-n Queue 1 Allocate Rename Instr-n Queue 2 Schedulers Trace Cache Dispatch Ports MEM All tracked with in-flight μops written to μop queue § Need individual utilization counts for each physical unit available on die for power and hotspot analyses 8 Canturk Isci, Gilberto Contreras, Margaret Martonosi EXE
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 2. Need bitline activity counts § Utilization is not complete information, power in part depends on switching factor 30 m. W (10%) swing 400 Mhz 1. 3 V PXA 255 Processor § Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample Reg. File ports/bit populations 9 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 2. Need bitline activity counts § Utilization is not complete information, power in part depends on switching factor + 000… 01 111… 11 + 20 m. W swing + + 000… 01 000… 00 111… 11 000… 00 + 111… 11 400 Mhz 1. 3 V PXA 255 Processor A 000… 01 000… 01 : 000… 01 B 111… 11 000… 00 001… 11 000… 00 : 000… 11 000… 00 000… 01 000… 00 § Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample Reg. File ports/bit populations 10 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 3. More detailed off-chip/memory access support in the embedded domain § Mem Power ~40% of system power § Tracking memory hierarchy transactions may help render better memory power estimates Main memory Read/Writes Ø Core + DMA Transaction length in bytes REX Memory power consumption (one 16 b bank) 11 Canturk Isci, Gilberto Contreras, Margaret Martonosi Activity factors can be shared with Reg. File
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 4. Metrics related to queue occupancy § Modern processor ≡ Several queues § Depending on implementation Power ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’ 02] Tradeoffs in Power-Efficient Issue Queue Design 12 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals from Experiences § 5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses § P 4 ex 1. MOB: Only event MOB_load_replays Counts replays for unknown st addr. /data, partial/unaligned addr. match No info for MOB entries/accesses/updates § P 4 ex 2. FPU: Has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect § P 4 ex 3. INT ALU: No dedicated event 13 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Additional Comments for HPC Design § General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses § Metrics related to Reg. File accesses vs. forwarding § Semi-distributed implementations will always induce dependencies among simultaneously countable events § Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime § Implementations that allow counter rotations without need for intermediate logging Partitioned / Dual-mode / Buffered counters § Different events for different types of accesses to same units with different magnitude power implications § i. e. branch scan < BHT update < BTA update § Different API/SW demands: § Lightweight implementations for runtime analyses § Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots 14 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Wishlist for Power/Thermal § 1) For each physical unit on die, separate events to track utilization rates § Sub events for different type of accesses with different power costs § 2) Bitline activity counters for switching units § 3) Occupancy counters for related queues § 4) Counter support for off-core memory accesses § 5) High parallelism among power events for minimal counter rotations 15 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Conclusions § New opportunities remain to be explored in future PMC designs for power and thermal studies § Direct correspondence to physical units § Bitline and occupancy counters § We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target. 16 Canturk Isci, Gilberto Contreras, Margaret Martonosi