Cache Coherence Support for NonShared Bus Architecture on

Cache Coherence Support for Non-Shared Bus Architecture on Heterogeneous MPSo. Cs Taeweon Suh §, Daehyun Kim †, and Hsien-Hsin S. Lee § June 15, 2005 § Georgia Institute of Technology, † Intel Corporation

MPSo. Cs n n n Time-to-Market Flexibility Low cost – Share memory interface to reduce pin count – However, shared bus arch. hinders the versatility provided by each processor – Non-Shared bus arch. n Real-time property – communication between processors IP Memory IP ADC u. P DSP u. P Memory Controller Wireless IP IP SDRAM 2

Introduction n Cache Coherence – Well known technique for data consistency for multiprocessor systems Protocol States Modified Exclusive Owned Shared Invalid P 0 P 1 D$ (MOESI) E S I abcd 1234 ----- M I O S shared cache-to-cache invalidate ----abcd 1234 Example operation sequence P 0: read P 1: write (abcd) P 0: read 1234 Memory 3

Previous Work Integration techniques for shared-bus based platform [1][2][3] n Read-to-write conversion Shared-signal assertion Wrapper 0 Proc 0 (MEI) Bus Wrapper 1 Proc 1 (MESI) Write Read/Write Memory Controller Snoop-hit buffer Wrapper 1 Wrapper 0 Proc 1 Proc 0 (MEI) (MESI) (MSI) Shared Bus Read Bus Write-back Memory Controller Wrapper 1 Proc 1 (MESI) Read Snoop-hit Buffer (single cache line) To memory [1] Taeweon Suh, Douglas M. Blough, and Hsien-Hsin S. Lee, Supporting cache coherence in heterogeneous multiprocessor systems, In DATE’ 04, Feb. 2004 [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004 4

Proposal n Cache Coherence-enforced Memory Controller (cc. MC) for Non-Shared bus based MPSo. Cs – Bypass approach – Bookkeeping approach Proc 0 (MESI) Bus 0 MPSo. C cc. MC Proc 1 (MEI) Bus 1 Memory n Integration of invalidation-based protocols such as MEI, MSI, MESI, and MOESI 5

Bypass Approach n n Blindly pass bus transactions if in shared range Very inexpensive in terms of silicon area cc. MC Snoop-hit buffer Bus request mux 0 1 comparator addr. Bus 0 Bus 1 Start_addr_reg Range_reg Proc 0 (MESI) Bus 0 MPSo. C cc. MC Proc 1 (MEI) Bus 1 Memory 6

Bookkeeping Approach n n Selectively pass bus transactions if in shared range Expensive compared to bypass approach cc. MC Snoop-hit buffer States P 0 P 1 Bus 0 if M Bus request I I S S • M I I I • • • I I if inside shared range addr. Bus 1 Start_addr_reg Range_reg Proc 0 (MESI) Bus 0 MPSo. C cc. MC Memory Proc 1 (MEI) Bus 1 7

Example n Bookkeeping approach Proc 0 (MSI) I S MPSo. C abcd ---- Proc 1 (MESI) SI M ---abcd 1234 cc. MC P 0 P 1 Bus 0 S I S M I Breq invalidate shared Bus 1 Example operation sequence P 1: read P 1: write (abcd) P 0: read Memory abcd 1234 8

Integration with no-coherence support processor n n No-coherence support processors work like having MEI w/o snooping: MEI-like integrated protocol Interrupt is used to inform possible snoop-hits MPSo. C Proc 0 (MESI) Bus 0 IRQ Proc 1 (no hardware support) cc. MC Bus 1 Memory 9

Simulation Model n Atalanta [4] RTOS – Home-grown RTOS in Georgia Tech – Designed for heterogeneous multiprocessor So. Cs n Atalanta kernel simulation – – – Task insertion/deletion Tasks are managed in TCB (Task Control Block) TCBs are connected through doubly-linked list Each other’s TCB is accessible by other processor Update the highest priority TCB, waiting for system objects such as semaphore, when a system object is ready [4] Di-Shi Sun, Douglas M. Blough, and Vincent J. Mooney, A New Multiprocessor RTOS Kernel for System-on-a-Chip Applications. Technical Report GIT-CC-02 -09, CERCS 10

Simulation Environment n Processors – Platform 1: PPC 755 (MEI) + ARM 9 with MESI – Platform 2: ARM 9 with MSI + ARM 9 with MESI n Simulators: Seamless CVE + Model. Sim DMA 0 Proc 0 Bus 0 Proc 1 DMA 1 Bus 1 cc. MC 320 X 240 LCD controller 100 Mbps Ethernet Memory 11

Simulation Results n Bypass Approach: 2 tasks on each processor 12

Simulation Results n Bypass Approach: 32 tasks on each processor 13

Simulation Results n Bookkeeping Approach – Platform 2, Miss penalty 14 cycles – Microbench simulation 14

Conclusions n Proposed integration techniques for cache coherence on Non-shared bus based-MPSo. Cs – Bypass approach, Bookkeeping approach n Bypass approach – Blindly pass shared memory operations – Very cheap in terms of silicon area n Bookkeeping approach – Selectively pass shared memory operations – Expensive compared to bypass approach n Effective solutions for communication as more and more heterogeneous processors are integrated in a single chip 15

Questions, Comments? Thanks for your attention! 16

Backup Slides 17

Motivation n n Embedded systems more and more require heterogeneous processors on a chip according to applications needs Efficient communication is imperative to meet realtime property of embedded applications Shared-bus architecture using AMBA, Core. Connect compromises the versatility provided by each processor Pin count restricts to use dedicated memory interface for each processor on So. Cs – Commercial MP So. Cs such as TI’ OMAP and Philip’s Nexperia employ Non-shared bus architecture sharing memory interface (check Nexperia) 18

Bookkeeping Approach (cont’d) n Problem with E-state Proc 0 (MSI) I E MPSo. C 1234 ---- Proc 1 (MESI) M I E ---abcd 1234 cc. MC P 0 P 1 Bus 0 E I Bus 1 Example operation sequence P 1: read P 1: write P 0: read Memory 1234 19

Bookkeeping Approach (cont’d) n Solution: Prohibit E-state (shared signal assertion) Proc 0 (MSI) I S MPSo. C abcd ---- Proc 1 (MESI) SI M ---abcd 1234 cc. MC P 0 P 1 Bus 0 S I S M I Breq invalidate shared Bus 1 Example operation sequence P 1: read P 1: write P 0: read Memory abcd 1234 20

Previous Work (cont’d) n n Snoop-hit Buffer [2][3] Region-Based Cache Coherence (RBCC) [2][3] Snoop-hit buffer Wrapper 0 Proc 0 (MEI) Bus Write-back Memory Controller MESI Wrapper 1 Proc 1 (MESI) Read RBCC Wrapper 0 Proc 0 (MESI) Wrapper 1 Proc 1 (MESI) MEI Wrapper 2 Proc 0 (MEI) Read Bus Snoop-hit Buffer (single cache line) To memory Memory Controller [2] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 1, In IEEE Micro, July/August 2004 [3] Taeweon Suh, Hsien-Hsin S. Lee, and Douglas M. Blough, Integrating cache coherence protocols for heterogeneous multiprocessor systems, Part 2, In IEEE Micro, September/October 2004 21