Special Course on Computer Architecture 7 Simulation of

  • Slides: 32
Download presentation
Special Course on Computer Architecture #7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano

Special Course on Computer Architecture #7 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano June 3 rd, 2011 Special Course on Computer Architecture 1

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip • Shared-memory chip multi-processors – Architecture – Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50 min] – Performance evaluation of parallel applications – Performance evaluation of coherence protocols June 3 rd, 2011 Special Course on Computer Architecture 2

Number of PEs (caches are not included) Multi- and many-core architectures 256 pico. Chip

Number of PEs (caches are not included) Multi- and many-core architectures 256 pico. Chip PC 102 pico. Chip PC 205 Clear. Speed CSX 700 128 64 Intel 80 -core TILERA TILE 64 Clear. Speed CSX 600 Intel SCC 32 16 MIT RAW UT TRIPS (OPN) STI Cell BE 8 Sun T 1 4 2 2004 2006 Sun T 2 Fujitsu SPARC 64 Intel Core, IBM Power 7 AMD Opteron 2008 2010 2011

Network-on-Chip (No. C) • Interconnection network to connect many-cores Core June 3 rd, 2011

Network-on-Chip (No. C) • Interconnection network to connect many-cores Core June 3 rd, 2011 Router 16 -Core Tile Architecture Special Course on Computer Architecture 4

On-chip router architecture 1) selecting an Input ports output channel 2) arbitration for the

On-chip router architecture 1) selecting an Input ports output channel 2) arbitration for the selected output channel Output ports ARBITER X+ X+ FIFO X- Y+ FIFO 3) sending the packet Yto the 5 x 5 output channel FIFO CROSSBAR YCORE GRANT CORE Routing, traversal are. Architecture performed in pipeline manner June 3 rd, arbitration, &switch 2011 Special Course on Computer 5

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip • Shared-memory chip multi-processors – Architecture – Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50 min] – Performance evaluation of parallel applications – Performance evaluation of coherence protocols June 3 rd, 2011 Special Course on Computer Architecture 6

Today’s target architecture • Chip multi-processors (CMPs) – Multiple processors (each has private L

Today’s target architecture • Chip multi-processors (CMPs) – Multiple processors (each has private L 1 cache) – Shared L 2 cache divided into multiple banks (SNUCA) Processor tile Cache tile Ultra. SPARC L 1 cache (I & D) L 2 cache bank June 3 rd, 2011 Special Course on Computer Architecture 7

Today’s target architecture • Chip multi-processors (CMPs) – Multiple processors (each has private L

Today’s target architecture • Chip multi-processors (CMPs) – Multiple processors (each has private L 1 cache) – Shared L 2 cache divided into multiple banks (SNUCA) – Processors and L 2 cache banks are connected via No. C Processor tile Cache tile Ultra. SPARC L 1 cache (I & D) L 2 cache bank On-chip router June 3 rd, 2011 Special Course on Computer Architecture 8

Cache coherence is maintained • Write back policy – Cache-write updates the memory when

Cache coherence is maintained • Write back policy – Cache-write updates the memory when block is evicted • Write invalidate policy – Cache-write invalidates all copies of the other sharers Processor tile Cache tile June 3 rd, 2011 Main Memory Special Course on Computer Architecture 9

Cache coherence is maintained • A CPU wants to read a block cached at

Cache coherence is maintained • A CPU wants to read a block cached at – The CPU sends a read request to the memory controller – The controller forwards the request to current owner – The owner sends the block to the requestor Processor tile Cache tile June 3 rd, 2011 Main Memory Special Course on Computer Architecture 10

Cache coherence: MOESI protocol class Status of each cache block is represented with M/O/E/S/I

Cache coherence: MOESI protocol class Status of each cache block is represented with M/O/E/S/I • Modified (M) • Owned (O) – Modified (i. e. , dirty) – Valid in one cache • Shared (S) – Shared by multiple CPUs • Exclusive (E) – Clean – Exists in one cache • Invalid (I) June 3 rd, 2011 – May or may not clean – Exists in multiple caches – Owned by one cache • Owner – Responsibility to respond any requests • MOESI protocols – MSI, MOSI, – MESI, MOESI, … Special Course on Computer Architecture 11

Cache coherence protocols • MSI/MOSI directory protocol – E state is not implemented –

Cache coherence protocols • MSI/MOSI directory protocol – E state is not implemented – S-to-M transition always updates the main memory • MESI directory protocol – O state is not implemented; Dirty sharing not allowed – M-to-S transition always updates the main memory • MOESI directory protocol • MOESI token protocol [Martin ISCA 03] – There are tokens as many as the number of CPUs – A CPU has one or more tokens It can read the block – A CPU has all tokens It can modify (write) the block June 3 rd, 2011 Special Course on Computer Architecture 12

MSI Protocol: State transition Cpu. Rd --Cpu. Wr --- Cpu. Rd --- M S

MSI Protocol: State transition Cpu. Rd --Cpu. Wr --- Cpu. Rd --- M S Cpu. Wr Bus. Wr Cpu. Rd Bus. Rd I M S Bus. Rd Flush Bus. Wr --- I Bus. Rd --Bus. Wr --- S-to-M transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MESI Protocol: State transition Cpu. Rd --Cpu. Wr --- Cpu. Rd --- M E

MESI Protocol: State transition Cpu. Rd --Cpu. Wr --- Cpu. Rd --- M E Cpu. Wr --Cp u. W Cpu. Rd r Cpu. Wr Bu s. W Bus. Rd(!C) Bus. Upgr r S Cpu. Rd --- Cpu. Rd Bus. Rd(C) I M E S I Bus. Wr Bus. Rd Flush. Opt Flush Bus. Rd Bus. Wr Flush. Opt Bus. Rd --Bus. Wr --Bus. Upgr --- M-to-S transitions flush (update) the main memory Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (1/2) Cpu. Rd --Cpu. Wr Bus. Upgr O Cpu. Rd

MOESI Protocol: State transition (1/2) Cpu. Rd --Cpu. Wr Bus. Upgr O Cpu. Rd --- M Cpu. Wr --- E Cp u. W r Cpu. Rd Bu Cpu. Wr s. W Bus. Rd(!C) r Bus. Upgr S Cpu. Rd Bus. Rd(C) I Cpu. Rd --- Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (2/2) Bus. Rd Flush O Bus. Rd Flush M Bus.

MOESI Protocol: State transition (2/2) Bus. Rd Flush O Bus. Rd Flush M Bus. Wr Bus. Rd Flush. Opt Flush S Bus. Rd Flush. Opt E Bus. Wr Flush. Opt I Bus. Rd --Bus. Wr --Bus. Upgr --- Bus. Wr Flush Bus. Upgr --Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip • Shared-memory chip multi-processors – Architecture – Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50 min] – Performance evaluation of parallel applications – Performance evaluation of coherence protocols June 3 rd, 2011 Special Course on Computer Architecture 17

Full-system simulation: GEMS/Simics • Wind River’s Simics – Commercial detailed processor simulator • Univ.

Full-system simulation: GEMS/Simics • Wind River’s Simics – Commercial detailed processor simulator • Univ. of Wisconsin’s GEMS – Cache, memory, and network module for Simics Processor tile Cache tile Main Memory Ultra. SPARC L 1 cache (I & D) L 2 cache bank On-chip router June 3 rd, 2011 Special Course on Computer Architecture 18

Full-system simulation: GEMS/Simics • Today’s simulation target – Solaris 9 OS on eight Ultra.

Full-system simulation: GEMS/Simics • Today’s simulation target – Solaris 9 OS on eight Ultra. SPARC processors – Parallel application examples: Pi and Integer sort – Various coherence protocols are supported Processor tile Cache tile Main Memory Ultra. SPARC L 1 cache (I & D) L 2 cache bank On-chip router June 3 rd, 2011 Special Course on Computer Architecture 19

Full-system simulation: GEMS/Simics • Simulation target – Solaris 9 OS on eight Ultra. SPARC

Full-system simulation: GEMS/Simics • Simulation target – Solaris 9 OS on eight Ultra. SPARC processors – Parallel application example: Integer Sort (IS) Solaris 9 is running on 8 -core Ultra. SPARC Processor tile Cache tile Main Memory Ultra. SPARC A parallel program L 1 cache (I & D) Compile L 2 cache bank On-chip. Execute routerit with 8 -core June 3 rd, 2011 Special Course on Computer Architecture 20

Parallel application example: Open. MP #include <stdio. h> #include <omp. h> int main() {

Parallel application example: Open. MP #include <stdio. h> #include <omp. h> int main() { #pragma omp parallel printf("hello world from %d of %dn", omp_get_thread_num(), omp_get_num_threads()); return 0; } Hello from all threads

Parallel application example: Open. MP int main() { int i; double start_time, end_time; start_time

Parallel application example: Open. MP int main() { int i; double start_time, end_time; start_time = omp_get_wtime(); omp_set_num_threads(num); #pragma omp parallel shared(A) private(i) { #pragma omp for (i = 0; i < N; i++) A[i] = A[i] * A[i] - 3. 0; } end_time = omp_get_wtime(); printf("Elapsed time: %f secn", end_time - start_time); return 0; }

Parallel application example: Open. MP int main() { int i; double s = 0.

Parallel application example: Open. MP int main() { int i; double s = 0. 0; double start_time, end_time; start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+: s) { #pragma omp for (i = 0; i < N; i++) s += (4. 0 / (4 * i + 1) - 4. 0 / (4 * i + 3)); } printf("pi = %lfn", s); end_time = omp_get_wtime(); printf("Elapsed time: %f secn", end_time - start_time); }

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip

Outline: Simulation of Multi-Processors • Background – Recent multi-core and many-core processors – Network-on-Chip • Shared-memory chip multi-processors – Architecture – Coherence protocols • Simulation environment: GEMS/Simics • Exercises [50 min] – Performance evaluation of parallel applications – Performance evaluation of coherence protocols June 3 rd, 2011 Special Course on Computer Architecture 24

The first step: How to use the simulator • Please pick up your account

The first step: How to use the simulator • Please pick up your account information • Log-in one of ICS cluster machines (id = 01… 15) ssh –X <username>@cluster<id>. ics. keio. ac. jp • Copy sample scripts and configuration files cp –r ~matutani/comparch 2011/files work cd work June 3 rd, 2011 Special Course on Computer Architecture 25

The first step: How to use the simulator • Start Simics. /start_ideal_memory. sh •

The first step: How to use the simulator • Start Simics. /start_ideal_memory. sh • You can use the gray window as a console of the target system (i. e. , Solaris 9 on 8 -core Ultra. SPARCs). June 3 rd, 2011 Special Course on Computer Architecture 26

The first step: How to use the simulator • In the target machine, for

The first step: How to use the simulator • In the target machine, for example, you can check the number of processors as follows. bash-2. 05# /usr/sbin/psrinfo -v You will see that there are eight processors June 3 rd, 2011 Special Course on Computer Architecture 27

Parallel application: “pi” calculation • You can execute a "pi" calculation program using eight,

Parallel application: “pi” calculation • You can execute a "pi" calculation program using eight, four, and one threads. bash-2. 05# export OMP_NUM_THREADS=8 bash-2. 05#. /pi bash-2. 05# export OMP_NUM_THREADS=4 bash-2. 05#. /pi bash-2. 05# export OMP_NUM_THREADS=1 bash-2. 05#. /pi June 3 rd, 2011 Special Course on Computer Architecture 28

Parallel application: Integer Sort (IS) • You can execute an Integer Sort (IS) program

Parallel application: Integer Sort (IS) • You can execute an Integer Sort (IS) program using eight, four, and one threads. bash-2. 05# export OMP_NUM_THREADS=8 bash-2. 05#. /IS bash-2. 05# export OMP_NUM_THREADS=4 bash-2. 05#. /IS bash-2. 05# export OMP_NUM_THREADS=1 bash-2. 05#. /IS June 3 rd, 2011 Special Course on Computer Architecture 29

Exercise 1 • Report the execution time of “pi” using 1, 4, 8, and

Exercise 1 • Report the execution time of “pi” using 1, 4, 8, and 16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results. June 3 rd, 2011 Special Course on Computer Architecture 30

Coherence protocols: Integer Sort (IS) • The following scripts automatically perform the IS program

Coherence protocols: Integer Sort (IS) • The following scripts automatically perform the IS program with different cache coherent protocols. . /start_moesi_directory. sh. /start_msi_mosi_directory. sh. /start_moesi_token. sh • Each simulation takes five to ten minutes. Do not run more than one scripts at the same time! June 3 rd, 2011 Special Course on Computer Architecture 31

Exercise 2 • Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory,

Exercise 2 • Report the execution time of MSI/MOSI directory, MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14— 19. June 3 rd, 2011 Special Course on Computer Architecture 32