UHPC Program Execution Model and Its Impacts Guang


























- Slides: 26

UHPC Program Execution Model and Its Impacts Guang R. Gao Rishi Khan 12/20/2021 UHPC-09 -14 -2010 -Gao 1

Outline • • • Background: The Problem of “Threads” A Codelet Based Execution Model A Codelet-level Coordination Programming Interface Codelet Program Examples Impact on Architecture and Memory Systems Summary 12/20/2021 UHPC-09 -14 -2010 -Gao 2

Difficulty of reasoning concurrency “humans are quickly overwhelmed by concurrency and find it much more difficult to teason about concurrent than sequential code. Even careful people miss possible inter-leavings among even simple collections of partially ordered operations. - Sutter and Larus Concurrency 12/20/2021 Intel. . UHPC-09 -14 -2010 -Gao 3

On “Difficulty of Concurrent Programming” “Yet humans are actually quite adept at reasoning about concurrent systems. The physical world is highly concurrent, and our very survival depends on our ability to reason about concurrent physical dynamics. ” - Ed. Lee “Are new languages necessary for multicore ? ” 12/20/2021 UHPC-09 -14 -2010 -Gao 4

Why Productive Concurrent Programming Is Hard ? ? ? The difficulty of concurrent programming is consequence of the abstraction of “threads” and their execution models ! 12/20/2021 UHPC-09 -14 -2010 -Gao 5

The Problem of “Threads” “They discard the most essential and appealing properties of sequential computation …” – understandability – predictability – determinancy – compositionability 12/20/2021 UHPC-09 -14 -2010 -Gao 6

Outline • • • Background: The Problem of “Threads” A Codelet Based Execution Model A Codelet-level Coordination Programming Interface Codelet Program Examples Impact on Architecture and Memory Systems Summary 12/20/2021 UHPC-09 -14 -2010 -Gao 7

What Is A Codelet ? • Intuitively: A unit of computation which interacts with the global state only at its entrance and exit points • Terminology I do not like to use the term “functional” here – which usually means “stateless”! 12/20/2021 UHPC-09 -14 -2010 -Gao 8

What Is A Codelet ? - A codelet is a code unit that, when scheduled for execution (on a codelet execution unit) – the execution will not interfere with the “ “outside world” during its execution process. - In other word, a codelet interacts with the “outside world” only through its inputs and outputs (happened at the beginning and the end of its execution). - Consequently, its scheduling is intrinsically nonpreemptive 12/20/2021 UHPC-09 -14 -2010 -Gao 9

Codelet Graph • A Codelet Graph G = {V. E} is a Graph where – G is a set of nodes (each corresponds to a codelet – E is a set of directed edge each connecting a pair of codelet nodes 12/20/2021 UHPC-09 -14 -2010 -Gao 10

Operational Semantics of Codelets Enabling/Firing Rules Consider a codelet graph G – with an assignment of events on some of its edges: • A codelet is enabled if – an event is present on each of its input edges; – none of the output edges may have any events • An enabled event can be scheduled for execution (i. e. fired). The firing of a codelet will remove all input events (one from each input), and will produce output event, one on each output. 12/20/2021 UHPC-09 -14 -2010 -Gao 11

Comments on Codelets Generation – the effect of non-preemption • Codelets execution is intrinsically non-preemptive • So we should ensure that between any two statements in a codelet there should not be any dependences that will take long latency to be resolve (or satisfied). • History Remarks: Ph. D these Under Gao and Hendren’s supervision during late 1990 s at Mc. Gill and UD 12/20/2021 UHPC-09 -14 -2010 -Gao 12

On Codelet Partition • It is advisable that if two statements have a dependence that involve latencies that a low -level codelet compiler cannot be profitably scheduled (to mask out the latency impact) – then they should not be included in the same codelet. 12/20/2021 UHPC-09 -14 -2010 -Gao 13

Comments on Codelets – Role of a Sound Shared Memory Model • Question: how about shared memory access operations ? • Should they be allowed them from within a normal codelet ? • If so, what memory model(s) should a codelet model assume/use? 12/20/2021 UHPC-09 -14 -2010 -Gao 14

Memory Model of Codelets • The shared memory model is based on LC (Location Consistency, [Gao and Sarkar 2000]) and it variants/extensions. • There is no global coherence requirement! • The semantics of memory accesses to shared memory (under a desirable memory model) is defined through the codelet coordination language – to be discussed later. 12/20/2021 UHPC-09 -14 -2010 -Gao 15

Outline • • • Background: The Problem of “Threads” A Codelet Based Execution Model A Codelet-level Coordination Programming Interface Codelet Program Examples Impact on Architecture and Memory Systems Summary 12/20/2021 UHPC-09 -14 -2010 -Gao 16

The Role of A Codelet Level Coordination Language • Our coordination language is low-level. • It provide basic C-like programming tools to allow users to define codelets and codelet sets • It provide coordiation mechanism to allow users to connect codelets and build codelet graphs. • Other important features ? • Let us first see some examples! 12/20/2021 UHPC-09 -14 -2010 -Gao 17

Matrix Multiply Single Node 3. 2 GHz Quad core Xeon (45 GF) Double Precision general matrix multiply (DGEMM): C+=A*B 4224 x 4224 matrices 1 thread performance: 434 Mflops 154 thread performance: 62600 Mflops 1 node power: 66 Watts

Matrix Multiply Multi Node Other Systems @ 256 Nodes Double Precision general matrix multiply (DGEMM): C+=A*B 67584 x 67584 matrices 1 node performance: 62. 6 Gflops 256 node performance: 13400 Gflops (~40 K threads) Nebulae (China) 6. 3 TF Roadrunner 3. 2 TF Jaguar Cray XT 5 2. 6 TF Blue. Gene P 0. 87 TF Blue. Gene L 0. 71 TF

Outline • • • Background: The Problem of “Threads” A Codelet Based Execution Model A Codelet-level Coordination Programming Interface Codelet Program Examples Impact on Architecture and Memory Systems Summary 12/20/2021 UHPC-09 -14 -2010 -Gao 20

Impact on Architectures and Memory Systems • We will show a wish list that we have today • This list is not targeted to Sun. Shine. . although we hope it has a positive impact on the thinking of our team as a whole. • This list is to open a forum for the next 4 years. We will put them on the table for discussion, investigation, etc. – so we will not reject ideas too early and than later regret as a whole. 12/20/2021 UHPC-09 -14 -2010 -Gao 21

A List of Features Needs Solid Support from Arch/Memory Systems • Support of event-driven codelets and runtime scheduling: – Support of multiple readers/writes, lock-free, wait-free, concurrent queues – Support of eager, lenient and lazy evaluation models • Support async data trasnfer to/from memory 12/20/2021 UHPC-09 -14 -2010 -Gao 22

Architectural Features (By Kelly Livingston) 2 Features that make atomics CRITICAL: Queueing With only locks we show XXX queue ops/sec on Cyclops With Fetch and Add we show XXX queue ops/sec on Cyclops Still measuring granularity of task that makes our queue more “energy saving” Memory Allocation/Data Structure manipulation There are countless lock-free and wait-free algorithms that use Compare and Swap We have yet to really discuss what low level memory allocators will look like, but they are already going to be hard. . . and a nightmare without CAS 9/14/10 UHPC-09 -14 -2010 -Gao 2323

Hardware Support (first draft, from Rishi) atomic ops (probably should be done by the memory controller): * atomic store of 8, 16, 32, 64 bits * atomic load of 64 bits (other sizes appreciated but not necessary) * compare with value 1 and swap with value 2 * compare to value 1 and add value 2 * basic op: add, or, xor, not, and) [just store answer] * fetch and basic op (add, or, xor, not, and) * needs to be accessible by all XE/CE's, but not necessarily every memory needs these ops [except atomic load/store] * hopefully they can be local/remote such that we can exploit locality if possible but use globally for apps that require it * full/empty bit for queue entries [again, not necessarily all memories, but some in every block accessible by the whole chip]. 12/20/2021 UHPC-09 -14 -2010 -Gao 24

Future Work • History perspectives: dataflow model of computation and related work on wellbehavedness • Can codelet programs become well-behaved ? • Should we also focus on well-behaved codelet programs ? • How to incorporate runtime resource constraints and achieve self-awareness under user specified goals ? 12/20/2021 UHPC-09 -14 -2010 -Gao 25

Acknowledgements • • • 2021/12/20 Our Sponsors Intel UHPC Team CAPSL UHPC Team and friends ETI UHPC Team (Rishi, et. al. ) Other Collaborators My Host ACS-Workshop-8 -27 -2008 26