VariationTolerant Open MP Tasking on TightlyCoupled Processor Clusters

  • Slides: 22
Download presentation
Variation-Tolerant Open. MP Tasking on Tightly-Coupled Processor Clusters A. Rahimi, A. Marongiu, P. Burgio,

Variation-Tolerant Open. MP Tasking on Tightly-Coupled Processor Clusters A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini UC San Diego and Università di Bologna

Outline • Device Variability – Process, voltage, and temperature variations • Why Open. MP

Outline • Device Variability – Process, voltage, and temperature variations • Why Open. MP and why tasking? • Task-Level Vulnerability (TLV) • Variation-Tolerant Architecture • Inter- and Intra-corner TLV • Variation-Tolerant Open. MP Tasking – Variation-Aware Reactive Scheduling Algorithm • Experimental Reults 03 -Jun-21 Andrea Marongiu / Università di Bologna 1

Ever-increasing Proc. -Vol. -Tem. Variations • Variability in transistor characteristics is a major challenge

Ever-increasing Proc. -Vol. -Tem. Variations • Variability in transistor characteristics is a major challenge in nanoscale CMOS – Static Process variation, e. g. , 40% VTH – Dynamic variations, e. g. , 160˚∆C temperature fluctuations and 10% supply voltage droops. • To handle variations designers use conservative guardbands loss of operational efficiency 03 -Jun-21 Your Name / Affiliation 2

Approaches to Variability-Tolerance 1. approach Design time This conservative I. relies on online measurements

Approaches to Variability-Tolerance 1. approach Design time This conservative I. relies on online measurements of errors guardbanding II. creates runtime overhead for both [Bowman’ 11] Latency (up to 28 extra recovery cycles per error) 2. v Post silicon v binning Energy overhead of 26 n. J that should be minimized 3. Runtime tolerance by various adaptiveness, e. g. , replay errant instructions 03 -Jun-21 Andrea Marongiu / Università di Bologna 3

Why a Variation-Aware Open. MP? • Variations are more exacerbated by many-core systems: –

Why a Variation-Aware Open. MP? • Variations are more exacerbated by many-core systems: – Multiple voltage-temperature islands – Cores in various islands display different error rate • The programming model and runtime environment of MIMD should be aware of variations. 847 MHz 893 MHz 847 MHz 901 MHz 847 MHz 909 MHz 877 MHz 870 MHz 909 MHz 855 MHz 826 MHz 917 MHz 901 MHz 820 MHz 826 MHz 862 MHz Frequency variation of a 16 -core cluster due to WID and D 2 D process variation Core 1 at 0. 81 V faces 428 K errant instructions Core 0 at 1. 1 V faces 7. 3 K errant instructions 03 -Jun-21 Andrea Marongiu / Università di Bologna 4

Why Open. MP Tasking? The steps to build variability abstractions up to the SW

Why Open. MP Tasking? The steps to build variability abstractions up to the SW layer • Task-Level Vulnerability (TLV) as metadata to characterize variations. • TLV is a vertical abstraction: TLV reflects manifestation of circuit-level variability in specific parallel software context. • The right granularity: • To observe and react for OMP scheduler • A convenient abstraction for programmers to express irregular and unstructured parallelism. Instruction-level Vulnerability (ILV) Sequence-level Vulnerability (SLV) Procedure-level Vulnerability (PLV) Task-level Vulnerability (TLV) [ILV] A. Rahimi, L. Benini, R. K. Gupta, “Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations, ” DATE, 2012. [SLV] A. Rahimi, L. Benini, R. K. Gupta, “Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability, ” IEEE Tran. on Computer, 2013 (to appear) [PLV] A. Rahimi, L. Benini, R. K. Gupta, “Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L 1 Processor Clusters, ” ISLPED, 2012. 03 -Jun-21 Andrea Marongiu / Università di Bologna 5

Instruction-Level Vulnerability (ILV)* • The ILV for each instructioni at every operating condition is

Instruction-Level Vulnerability (ILV)* • The ILV for each instructioni at every operating condition is quantified: – where Ni is the total number of clock cycles in Monte Carlo simulation of instructioni with random operands. – Violationj indicates whethere is a violated stage at clock cyclej or not. • ILVi defines as the total number of violated cycles over the total simulated cycles for the instructioni. • Therefore, the lower ILV, the better *A. Rahimi, L. Benini, R. K. Gupta, “Analysis Instruction-level Vulnerability to Dynamic Voltage and 6 03 -Jun-21 Andreaof Marongiu / Università di Bologna Temperature Variations, ” DATE, 2012.

Task-Level Vulnerability (TLV) • ILV represents a useful variability metric that raises the level

Task-Level Vulnerability (TLV) • ILV represents a useful variability metric that raises the level of abstraction from the circuit (critical paths) to the ISA-level. • ILV is extended to a more coarse-grained task-level metric, TLV, towards building an integrated, vertical approach to control variability. • TLV is a per core and per task type metric: – ∑EI is # of errant instructions during taskj on corei – Length is total # of executed instructions • The lower TLV, the better 03 -Jun-21 Andrea Marongiu / Università di Bologna 7

Variation-Tolerant MP Cluster (1/2) MASTER PORT I$ SLAVE PORT BANK 1 SLAVE PORT BANK

Variation-Tolerant MP Cluster (1/2) MASTER PORT I$ SLAVE PORT BANK 1 SLAVE PORT BANK 0 SLAVE PORT test-and-set semaphores 8 SHARED L 1 TCDM SLAVE PORT BANK N L 2/L 3 BRIDGE I$ Replay Var-Sensor CORE 0 LOW-LATENCY LOGARITHMIC INTERCONNECT MASTER PORT VDD-hopping Andrea Marongiu / Università di Bologna Replay • Bridge towards No. C VDD-hopping MASTER PORT • One clock domain 03 -Jun-21 VDD-Hopping Var. sensor • Fast Log. Interconnect CORE M • L 1 SW-managed Tightly Coupled Data Memory (TCDM) CORE 0 • Multi-banked/multi-ported • Fast concurrent read access I$ Var. sensor • 16 x 32 -bit RISC cores Replay • Inspired by STM STHORM

Variation-Tolerant Architecture (2/2) – Error sensing (EDS [Bowman’ 09]) • detect any timing error

Variation-Tolerant Architecture (2/2) – Error sensing (EDS [Bowman’ 09]) • detect any timing error due to dynamic delay variation CORE 0 I$ Replay • Every core is equipped with: Var-Sensor VDD-Hopping MASTER PORT – Error recovery (Multiple-issue replay mechanism [Bowman’ 11]) • to recover the errant instruction without changing the clock frequency – VDD hopping (semi-static) [Miermont’ 07] • to compensate the impact of static process variation [Rahimi’ 12] • Thus, cluster enables per-core characterization of TLV metadata Online variability measurement TLV metadata characterization Fast access to the TLV metadata for each type of task is guaranteed by carefully placing these key data structures in L 1 TCDM. 03 -Jun-21 Andrea Marongiu / Università di Bologna 9

Open. MP Tasking #pragma omp parallel { #pragma omp single Push task { for

Open. MP Tasking #pragma omp parallel { #pragma omp single Push task { for (i = 1. . . N) { #pragma omp task FUNC_1 (i); Task queue TCDM Task descriptor #pragma omp task FUNC_2 (i); } } } • • /* implicit barrier */ Fetch and execute (FIFO) two task types Task descriptors created upon encountering a task directive Task fetched by any core encountering a barrier task directives identify given portions of code (tasks) A task type is defined for every occurrence of the task directive 03 -Jun-21 Andrea Marongiu / Università di Bologna 10 in the program

Intra- and Inter-Corner TLV • TLV across various type of tasks: TLV of each

Intra- and Inter-Corner TLV • TLV across various type of tasks: TLV of each type of tasks is different (up to 9×) even within the fixed operating condition in a corei • Inter-corner TLV (across various operating conditions for 45 nm) – The average TLV of the six types of tasks is an increasing function of temperature. – In contrast, decreasing the voltage from the nominal point of 1. 1 V increases TLV. 03 -Jun-21 Intra-corner TLV at fix (25°C, 1. 1 V) Andrea Marongiu / Università di Bologna Inter-corner TLV 11

Variation-tolerant Open. MP Tasking • Online TLV characterization – TLV table: LUT containing TLV

Variation-tolerant Open. MP Tasking • Online TLV characterization – TLV table: LUT containing TLV for every core and task type – Reside in TCDM. Parallel inspection from multiple cores • Each core collects TLV information in parallel – Distributed scheduler – LUT updated at every task execution C 0 T 0 0. 0211 T 1 0. 891 C 1 0. 11 - C 2 0. 000005 TCDM 03 -Jun-21 CORE 0 Andrea Marongiu / Università di Bologna I$ Replay task types VDD-Hopping cores Var-Sensor TLV-table void handle_tasks () { while (HAVE_TASKS) { // Task scheduling loop task_desc_t *t = EXTRACT_TASK (); if (t) { float Otlv = tlv_read_task_metadata (core_id); /* Reset counter for this core */ tlv_reset_task_metadata (core_id); /* EXEC! */ t->task_fn (t->task_data); /* We executed. Fetch TLV. . . */ float tlv = tlv_read_task_metadata (core_id); /* Update TLV. Average new and old value */ tlv_table_write(t->task_type_id, core_id, (tlv-Otlv)/2); } } } MASTER PORT 12

TLV-aware Extensions #pragma omp parallel { #pragma omp single { for (i = 1.

TLV-aware Extensions #pragma omp parallel { #pragma omp single { for (i = 1. . . N) { #pragma omp task FUNC_1 (i); #pragma omp task FUNC_2 (i); } Fetch and execute (FIFO) TCDM Task descriptor TLV-aware fetch } } Task queue /* implicit barrier */ • Variation-tolerant Open. MP scheduler – Reactive scheduling. Idle processors trying to fetch a task check if their TLV for the task is under a certain threshold to minimize number of errant instructions (and costly replay cycles) – limited number of rejects for a given tasks, to avoid starvation 03 -Jun-21 Andrea Marongiu / Università di Bologna 13

Variation-aware Scheduling Algorithm TLV-table TCDM C 0 C 1 C 2 T 0 0.

Variation-aware Scheduling Algorithm TLV-table TCDM C 0 C 1 C 2 T 0 0. 0211 0. 11 - T 1 0. 891 - 0. 000005 core_escape_cnt C 0 C 1 C 2 1 5 0 Task queue taskj = PEEK_QUEUE() TLV(i, j) = tlv_table_read(corei, taskj); if (TLV(i, j)> TLV_THR && core i_escape_cnt <ESCAPE_THR) { corei_escape_cnt ++; escape (task j); } else { assign_to_core i (taskj); corei_escape_cnt = 0; } 03 -Jun-21 Andrea Marongiu / Università di Bologna 14

Experimental Setup: Arch. + Benchmarks • Architecture: System. C-based virtual platform* modeling the tightly-coupled

Experimental Setup: Arch. + Benchmarks • Architecture: System. C-based virtual platform* modeling the tightly-coupled cluster ARM v 6 core I$ size I$ line Latency hit Latency miss 16 16 KB per core 4 words 1 cycle ≥ 59 cycles TCDM banks TCDM latency TCDM size L 3 latency L 3 size 16 2 cycles 256 KB ≥ 60 cycles 256 MB • Benchmark: Seven widely used computational kernels from the image processing domain are parallelized using Open. MP tasking. On average 375 dynamic tasks. • The TLV lookup table only occupies 104− 448 Bytes depending upon the number of task types. *D. Bortolotti et al. , “Exploring instruction caching strategies for tightly-coupled shared 03 -Jun-21 Andrea Marongiu / Università di Bologna memory clusters, ” Proc. Intern. Symposium on System on Chip (So. C), pp. 34 -41, 2011 15

Experimental Setup: Variability Modeling Each core optimized All variations cores can work • To.

Experimental Setup: Variability Modeling Each core optimized All variations cores can work • To. P&R emulate variations, we have integrated during with a target Six cores (C 0, C 2, C 4, withusing the design models at the level of individual instructions the ILV frequency of 850 MHz. C 10, C 13, C 14) cannot @ Sign-off: die-to-die and time target characterization methodology. meet the design within-die process frequency of 850 • ILV models of 16 -core LEON-3 for TSMC 45 -nm, generalvariations are injected time target MHz but usingpurpose Prime. Time VX and process with normal V cells. TH frequency of 850 variation-aware 45 nm multiple voltage • Vdd-hopping is applied to compensate injected process MHz TSMC libs (derived from Op. Ps PCA) variation. C 0 C 4 C 8 C 12 >850 909 901 Process Vdd. Variation Hopping C 1 C 5 C 9 C 13 893 909 855 >850 C 2 C 6 C 10 C 14 >850 877 >850 C 3 C 7 C 11 C 15 901 870 917 862 VDD={ 1. 1 V, 0. 97 V, 0. 81 V Andrea Marongiu / Università di Bologna 16 } C 0 847 C 1 893 C 2 847 C 3 901 03 -Jun-21 C 4 847 C 5 909 C 6 877 C 7 870 C 8 909 C 9 855 C 10 826 C 11 917 C 12 901 C 13 820 C 14 826 C 15 862

Overhead of Variation-tolerant Scheduler • Normalized IPC = IPC variation-aware scheduler / IPC OMP

Overhead of Variation-tolerant Scheduler • Normalized IPC = IPC variation-aware scheduler / IPC OMP baseline scheduler • On a variation-immune cluster, on average, the normalized IPC of the cluster is slightly decreased by 0. 998×. Due to – reading the TLV lookup table – checking the conditions 03 -Jun-21 Andrea Marongiu / Università di Bologna 17

IPC of Variability-affected Cluster M= Number of times that the scheduler postponing the execution

IPC of Variability-affected Cluster M= Number of times that the scheduler postponing the execution of the task in the head of queue. On average, each task is escaped 2. 1 times. • Our scheduler decreases the number of cycles per cluster for each type of tasks, because cores incur fewer errant instructions and spend lower cycles for recovery. • The normalized IPC is increased by 1. 17× (on average) for all benchmarks executing at 10°C. At temperature of 100°C (ΔT=90°C) IPC is increased by 1. 15 ×. 03 -Jun-21 Andrea Marongiu / Università di Bologna 18

Conclusion • Vertical abstraction of circuit-level variations into a high-level parallel software execution (Open.

Conclusion • Vertical abstraction of circuit-level variations into a high-level parallel software execution (Open. MP 3. 0 tasking) • The vulnerability of tasks is characterized by TLV metadata during introspective execution • The reactive variation-tolerant runtime scheduler utilizes TLV to match cores with tasks • The normalized IPC of 16 -core variability-affected cluster increases up to 1. 51× (on average, 1. 15×). • Future work: multiple clusters @ multiple dynamic Op. P in Vdd & f 03 -Jun-21 Andrea Marongiu / Università di Bologna 19

Grazie dell’attenzione! ERC Multi. Therman 03 -Jun-21 NSF Variability Expedition Andrea Marongiu / Università

Grazie dell’attenzione! ERC Multi. Therman 03 -Jun-21 NSF Variability Expedition Andrea Marongiu / Università di Bologna 20

Classification of Instructions Based ILV at 0. 88 V, while varying temperature for 65

Classification of Instructions Based ILV at 0. 88 V, while varying temperature for 65 nm: (V, T) Mul. Mem &Div Logical & Arithmetic Cycle time (ns) add and or sll sra srl sub xnor xor load store mul div 1 1 1 1. 02 (0. 88 V, -40°C) 1. 06 1. 08 1. 10 0 0 0 0 0 0 0 0 0. 824 0 0. 847 0 0. 996 0. 064 0. 027 0. 017 0. 991 0. 989 0. 984 1. 12 1 0 0 0 0 1 1 1 1 (0. 88 V, 0°C) 1. 02 1. 06 1. 10 0 0 0 0 0 0 0. 707 0 0 0. 743 0 0 0. 996 0. 065 0. 018 0. 994 0. 991 0. 973 1. 12 1. 04 0 0 0 0 1 1 1 1. 06 (0. 88 V, 125°C) 1. 08 1. 10 1. 16 0 0 0 0 0 0 0 0 0 0. 796 0 0. 823 0 0. 876 0. 016 06 0. 991 0. 984 • Instructions are partitioned into three main classes: v 1 st Class: Logical & arithmetic instructions v 2 nd Class: Memory instructions v 3 rd Class: Hardware multiply & divide instructions • For every operating conditions: • ILV (3 rd Class) ≥ ILV (2 nd Class) ≥ ILV (1 st Class) 03 -Jun-21 Andrea Marongiu / Università di Bologna 21 1. 18 0 0 0 0