Flex RAM Toward an Advanced Intelligent Memory System

  • Slides: 34
Download presentation
Flex. RAM Toward an Advanced Intelligent Memory System Josep Torrellas University of Illinois http:

Flex. RAM Toward an Advanced Intelligent Memory System Josep Torrellas University of Illinois http: //iacoma. cs. uiuc. edu torrellas@cs. uiuc. edu

People Involved Students Michael Huang Joe Renau Seung Yoo Jaejin Lee Other faculty David

People Involved Students Michael Huang Joe Renau Seung Yoo Jaejin Lee Other faculty David Padua H. V. Jagadish Daniel Reed

Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others

Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e. g. IBM SA-27 E ASIC (Feb 99) • 0. 18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM Power. PC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?

Terminology Processor In Memory (PIM) = Intelligent Memory or Intelligent RAM (IRAM)

Terminology Processor In Memory (PIM) = Intelligent Memory or Intelligent RAM (IRAM)

Key Applications Benefit from HW • Data Mining (decision trees and neural networks) •

Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications

Example App: DNA Matching z Problem: Find areas of database DNA chains that match

Example App: DNA Matching z Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

How the Algorithm Works z Pick 4 consecutive aminoacids from sample z Generate 50+

How the Algorithm Works z Pick 4 consecutive aminoacids from sample z Generate 50+ most-likely mutations

Example App: DNA Matching z Compare them to every positions in the database DNAs

Example App: DNA Matching z Compare them to every positions in the database DNAs z If match is found: try to extend it

How to Use MLD 1. Main compute engine of the machine • Add proc

How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip • Include a vector processor or multiple processors Incremental gains Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller •

How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

How to Use MLD (III) 3. Our approach: take the place of memory chips

How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts of the program Illinois: Flex. RAM UC Davis: Active Pages USC-ISI: DIVA

Our Solution: Principles z Extract high bandwidth from DRAM: y Many simple processing units

Our Solution: Principles z Extract high bandwidth from DRAM: y Many simple processing units z Run legacy codes with high performance: y Do not replace off-the-shelf u. P in workstation y Take place of memory chip. Same interface as DRAM y Intelligent memory defaults to plain DRAM z Small increase in cost over DRAM: y Simple processing units, still dense z General purpose: y Do not hardwire any algorithm. No Special purpose

Architecture Proposed

Architecture Proposed

The Flex. RAM Memory System Can exploit multiple levels of parallelism For a high-end

The Flex. RAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P. Host processor (e. g. Merced, IBM GP) • 100’s of P. Mems in memory (e. g. IBM Power. PC 603) • 100, 000’s of very simple P. Arrays in memory

Chip Organization

Chip Organization

Memory in one Flex. RAM Chip • 64 Mbytes of DRAM organized as 16

Memory in one Flex. RAM Chip • 64 Mbytes of DRAM organized as 16 Mx 32 bits • Organized in 64 1 -Mbyte banks • Each bank: • Associated to 1 P. Array • 1 single port • 2 2 -Kbyte row buffers (no P. Array cache) • P. Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second

Memory in one Flex. RAM Chip Group of 4 P. Arrays share one 8

Memory in one Flex. RAM Chip Group of 4 P. Arrays share one 8 -Kbyte, 4 -ported SRAM instruction memory • Holds the P. Array code • Small because short code • Aggressive access time: 1 cycle = 2. 5 ns

P. Array • 64 P. Arrays per chip. Not SIMD but SPMD • 32

P. Array • 64 P. Arrays per chip. Not SIMD but SPMD • 32 -bit integer arithmetic; 16 registers • No caches, no floating point • 4 P. Arrays share one multiplier • 28 different 16 -bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier

P. Mem • 2 -issue static superscalar like IBM Power. PC 603 • 16

P. Mem • 2 -issue static superscalar like IBM Power. PC 603 • 16 -Kbyte I, D caches • Executes serial sections • Communication with P. Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P. Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence

Issues Communication P. Mem-P. Host: • P. Mem cannot be the master of bus

Issues Communication P. Mem-P. Host: • P. Mem cannot be the master of bus • P. Host starts P. Mems by writing register in Rambus interf. • P. Host polls a register in Rambus interf. of master P. Mem • If P. Mem not finished: memory controller retries. Retries are invisible to P. Host Virtual memory: • P. Mems and P. Arrays use virtual memory • They share a range of virtual addresses with P. Host

Chip Architecture

Chip Architecture

Basic Block

Basic Block

Area Estimation (mm ) 2 VERY CONSERVATIVE Power. PC 603+caches: 12 64 Mbytes of

Area Estimation (mm ) 2 VERY CONSERVATIVE Power. PC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P. Arrays: 96 Multipliers: 10 Rambus interface: 3. 4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM

Evaluation

Evaluation

Utilization z High P. Array Util z Low P. Mem Util

Utilization z High P. Array Util z Low P. Mem Util

Utilization z Low P. Host Utilization

Utilization z Low P. Host Utilization

Speedups z Constant Problem Sz z Scaled Problem Sz

Speedups z Constant Problem Sz z Scaled Problem Sz

Speedups z. Varying Logic Frequency

Speedups z. Varying Logic Frequency

Programming Flex. RAM • Flex. RAM programmed in C + extensions: C-Flex • Library

Programming Flex. RAM • Flex. RAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P. Arrays or P. Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines

C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitfor

C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitfor processor_range: processors waiting for others • Map object to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • Flex. RAM_malloc(), P_mem_malloc(), P_array_malloc()

Performance Evaluation z. Hardware performance monitoring embedded in the chip z. Software tools to

Performance Evaluation z. Hardware performance monitoring embedded in the chip z. Software tools to extract and interpret performance info

Current Status z Identified and wrote all applications z Designed architecture based on apps

Current Status z Identified and wrote all applications z Designed architecture based on apps & feasible technology z Conceived ideas behind language/compiler z Need to do: chip layout and fabrication development of the compiler z Funds needed for: yprocessor core (P. Mem) ychip fabrication yhardware and software engineers

Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system

Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications

Conclusion z. We have a handle on: y. A promising technology (MLD) y. Key

Conclusion z. We have a handle on: y. A promising technology (MLD) y. Key applications of industrial interest z. Real chance to transform the computing landscape