Lecture 17 Embedded Multiprocessor Memory Embedded Computing Systems

Lecture 17: Embedded Multiprocessor Memory Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf High Performance Embedded Computing © 2007 Elsevier

Topics n n n Parallel memory systems. Models for memory Heterogeneous memory systems Consistent

Parallel memory systems n n memory banks can be accessed independently. Peak access rate given by n parallel accesses. If is the probability of a non-sequential access q n Probability of k sequential accesses is Mean length of sequential accesses is © 2006 Elsevier Bank 0 Bank 1 address data Bank 2 Bank 3

Memory system design n n Design parallel memory systems using previous memory component models (CH 2) Parameters: q q q n Delay is a nonlinear function of memory size. q n Area – Size of the component Performance - Access time of the component – may differ for reads vs. writes, , page mode, etc. Energy per access – may also differ Bit line delays can dominate access time Delay is a nonlinear function of the number of ports. © 2006 Elsevier

Heterogeneous memory systems n Heterogeneous memory improves real-time performance: q q n Heterogeneous memory improves power: q n Accesses to the same bank interfere, even if not to the same location. Segregating real-time locations improves predictability, reduces access time variance. Smaller blocks with fewer ports consume less energy. What are disadvantages of heterogeneous memory systems? © 2006 Elsevier

Memory system design methodology [Dut 98] © 2006 Elsevier [Dut 98] © 1998 IEEE

Motion Estimation Architecture [Dut 98] © 1998 IEEE © 2006 Elsevier

Memory Partitioning and Delay © 2006 Elsevier [Dut 98] © 1998 IEEE

Critical Sections and Locks n[Akg 02] Critical Section q q n Lock Delay q n Time between release and acquisition of a lock Latency q n Code section where shared data is accessed Lock helps guarantee the consistency of shared data (e. g. , global variables) Time to acquire a lock when no contention Approach q Provide So. C lock cache © 2006 Elsevier

So. C Lock Cache Mechanism n n Locks to shared code sections are stored

So. C Lock Cache Features n n n Simple hardware mechanism: So. CLC No modifications/extensions to processor core or to caches No special instructions or atomic primitives Can be integrated as an intellectual property (IP) block into the So. C Hardware interrupt triggered notification © 2006 Elsevier

So. C Lock Cache Hardware © 2006 Elsevier

Short vs. Long Critical Sections n A short critical section has a relatively short time between lock acquisition and release q q q n For example, less than 1, 000 cycles Don’t switch to another task while waiting for lock Locks are associated with PEs A long critical section has a relatively long time between lock acquisition and release q q q For example, less than 1, 000 cycles Locks are associated with tasks on PEs More hardware is required to track tasks © 2006 Elsevier

Soc Lock Cache Interrupts © 2006 Elsevier

So. C Lock Cache Results n Area is less than 0. 1% of full

Coherent parallel memory nsystems Caches need to be coherent. q q Cache snooping is

Application-Aware Snoop Filtering [Zho 08] n n In embedded systems, designer may know which memory is shared between tasks Snooping is enabled only for the accesses referring to known shared regions. q n Identify shared memory regions for each task q n Reduces power consumption due to snooping Provide this info to the operating system and cache snoop controller for runtime utilization Focus on write-back caches with writeinvalidate protocol © 2006 Elsevier

Snoop Filtering Architecture n Snoop filter determines if the D-cache should actually be snooped

Shared Memory Identification n With no virtual memory: q q q Utilize Shared Address Segments (SAS) mechanism Programmer identifies shared structures Compiler controls placement of data n n n Aligns data on 2 m address boundary Identify segment using a Seg. ID (MSBs of address) What if arrays are not of size 2 m? © 2006 Elsevier

SAS Snoop Filtering Hardware n For each shared segment to be supported q q q n Seg. Dim indicates size of the segment (bit mask) Seg. Id indicates start of the segment Compare Seg. ID with address MSBs What if you have more shared segments than hardware for identifying them? © 2006 Elsevier

Snoop Filtering Results n n Snoop activities are report for direct mapped and 4

Virtual memory and snoop nfiltering Recent embedded processors provide virtual memory support through MMUs q q Translate virtual address (VPN + offset) to physical address (PPN + offset) Provides transparent memory allocation, isolation, and protection for tasks Requires page table (PT) and translation lookaside buffer (TLB) to translate the VPN to PPN Programmer and compiler no longer know physical address n Different technique is needed for snoop filtering © 2006 Elsevier

Shared Memory Identification n With virtual memory: q q Utilize Shared Page Set (SPS) mechanism Programmer identifies shared structures n n q Provides array starting address and size Identifies which threads use which structures Operating systems assigns Reg. ID n Stores this information in the page table (PT) and translation lookaside buffer (TLB) © 2006 Elsevier

SPS Snoop Filtering Hardware n n The PT and TLB are augmented with the Reg. ID for each page The information for shared regions for each task is loaded by the operating system q q n n Implemented using a bit mask register with one bit for each shared region. For example: 01010100 indicates a task uses shared regions 2, 4, and 6. On a cache miss, the Reg. ID is transmitted along the databus. Filtering hardware at each node checks if the current task has shared data in Reg. ID region. © 2006 Elsevier

SPS Snoop Filtering Hardware © 2006 Elsevier

Snoop Filtering Energy Results n n Snoop energies are reported for direct mapped and 4 -way caches for write-invalidate (WI) and write-update (WU) mechanisms WI requires much less energy than WU Snoop energy is reduced by 47% to 93% SPS only used with WI © 2006 Elsevier

ARM 11 MPCore Features n n n Up to 4 CPUs implementing ARM v 6 Snoop Control Unit for Cache Coherency Distributed Interrupt Controller Private Timer and Private Watchdog for each CPU AXI high speed Advanced Microprocessor Bus Architecture (AMBA) L 2 memory interfaces Flexibility configuration during synthesis. © 2006 Elsevier

ARM 11 MPCore Pipeline Stages Stage 1 1 st Fetch Stage (Fe 1) Stage 2 1 st Fetch Stage (Fe 2) Stage 3 Instruction Decode (De) Stage 4 Reg. read and issue (Iss) Stage 5 Stage 6 Shifter Stage (Sh) ALU Operation (ALU) Saturation Stage (Sat) 1 st multiply acc. Stage (MAC 1) 2 nd multiply acc. Stage (MAC 2) 3 rd multiply acc. Stage (MAC 3) Address Generation (ADD) Data cache 1 (DC 1) Data cache 2 (DC 2) © 2006 Elsevier Stage 7 Stage 8 Write back Mul/ALU (WBex) Write back from LSU (WBIs)

ARM 11 MPCore Caches n n n Instruction and data caches, including a nonblocking data cache with Hit-Under-Miss (HUM) Data cache is physically indexed, physically tagged, write back, write allocate only Instruction cache is virtually indexed, physically tagged 32 -bit interface to the instruction cache and 64 -bit interface to the data cache Hardware support for data cache coherency The instruction and data cache can be independently configured during synthesis to sizes between 16 KB and 64 KB.

ARM 11 MPCore Caches n n Both caches are 4 -way set-associative. Cache line replacement policy is round-robin. The cache line length is eight 32 -bit words. Both data cache read misses and write misses are non-blocking. q n n Up to three outstanding data cache read misses and up to four outstanding data cache write misses are supported. Support is provided for streaming of sequential data with LDM operations, and for sequential instruction fetches. On a cache-miss, critical word first filling of the cache is performed.

Coherency protocol – MESI n MESI is a write-invalidate protocol q q n n Writing to a shared location invalidates corresponding lines in other L 1 caches Cache lines can be in one of four states Modified: The cache line is present only in the current cache, and it is dirty. It has been modified from the value in main memory. Exclusive: The cache line is present only in the current cache, and is clean. It matches the main memory value. Shared: The cache line is present in more than one CPU cache and is clean. It matches main memory value. Invalid: This coherent cache line is not present in the cache.

L 1 Data Memory

L 1 Instruction Memory

Level 2 Memory - AXI n MPCore Level 2 q n Supported AXI transfers q q n The ARM 11 MPCore processor Level 2 interface consists, by default, of two 64 -bit wide AXI bus masters coherent and non-coherent write-back write-allocate coherent non-cachable AXI transaction IDs q The arbitration for transaction ordering on Athe XI masters is round robin among the requesting MP 11 CPUs