Flex RAM Toward an Advanced Intelligent Memory System

People Involved Students Michael Huang Joe Renau Seung Yoo Jaejin Lee Other faculty David

Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others

Terminology Processor In Memory (PIM) = Intelligent Memory or Intelligent RAM (IRAM)

Key Applications Benefit from HW • Data Mining (decision trees and neural networks) •

Example App: DNA Matching z Problem: Find areas of database DNA chains that match

How the Algorithm Works z Pick 4 consecutive aminoacids from sample z Generate 50+

Example App: DNA Matching z Compare them to every positions in the database DNAs

How to Use MLD 1. Main compute engine of the machine • Add proc

How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller •

How to Use MLD (III) 3. Our approach: take the place of memory chips

Our Solution: Principles z Extract high bandwidth from DRAM: y Many simple processing units

The Flex. RAM Memory System Can exploit multiple levels of parallelism For a high-end

Memory in one Flex. RAM Chip • 64 Mbytes of DRAM organized as 16

Memory in one Flex. RAM Chip Group of 4 P. Arrays share one 8

P. Array • 64 P. Arrays per chip. Not SIMD but SPMD • 32

P. Mem • 2 -issue static superscalar like IBM Power. PC 603 • 16

Issues Communication P. Mem-P. Host: • P. Mem cannot be the master of bus

Area Estimation (mm ) 2 VERY CONSERVATIVE Power. PC 603+caches: 12 64 Mbytes of

Utilization z High P. Array Util z Low P. Mem Util

Speedups z Constant Problem Sz z Scaled Problem Sz

Programming Flex. RAM • Flex. RAM programmed in C + extensions: C-Flex • Library

C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitfor

Performance Evaluation z. Hardware performance monitoring embedded in the chip z. Software tools to

Current Status z Identified and wrote all applications z Designed architecture based on apps

Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system

Conclusion z. We have a handle on: y. A promising technology (MLD) y. Key

Slides: 34

Download presentation

Flex. RAM Toward an Advanced Intelligent Memory System Josep Torrellas University of Illinois http: //iacoma. cs. uiuc. edu torrellas@cs. uiuc. edu

People Involved Students Michael Huang Joe Renau Seung Yoo Jaejin Lee Other faculty David Padua H. V. Jagadish Daniel Reed

Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e. g. IBM SA-27 E ASIC (Feb 99) • 0. 18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM Power. PC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?

Terminology Processor In Memory (PIM) = Intelligent Memory or Intelligent RAM (IRAM)

Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications

Example App: DNA Matching z Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

How the Algorithm Works z Pick 4 consecutive aminoacids from sample z Generate 50+ most-likely mutations

Example App: DNA Matching z Compare them to every positions in the database DNAs z If match is found: try to extend it

How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip • Include a vector processor or multiple processors Incremental gains Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts of the program Illinois: Flex. RAM UC Davis: Active Pages USC-ISI: DIVA

Our Solution: Principles z Extract high bandwidth from DRAM: y Many simple processing units z Run legacy codes with high performance: y Do not replace off-the-shelf u. P in workstation y Take place of memory chip. Same interface as DRAM y Intelligent memory defaults to plain DRAM z Small increase in cost over DRAM: y Simple processing units, still dense z General purpose: y Do not hardwire any algorithm. No Special purpose

Architecture Proposed

The Flex. RAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P. Host processor (e. g. Merced, IBM GP) • 100’s of P. Mems in memory (e. g. IBM Power. PC 603) • 100, 000’s of very simple P. Arrays in memory

Chip Organization

Memory in one Flex. RAM Chip • 64 Mbytes of DRAM organized as 16 Mx 32 bits • Organized in 64 1 -Mbyte banks • Each bank: • Associated to 1 P. Array • 1 single port • 2 2 -Kbyte row buffers (no P. Array cache) • P. Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second

Memory in one Flex. RAM Chip Group of 4 P. Arrays share one 8 -Kbyte, 4 -ported SRAM instruction memory • Holds the P. Array code • Small because short code • Aggressive access time: 1 cycle = 2. 5 ns

P. Array • 64 P. Arrays per chip. Not SIMD but SPMD • 32 -bit integer arithmetic; 16 registers • No caches, no floating point • 4 P. Arrays share one multiplier • 28 different 16 -bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier

P. Mem • 2 -issue static superscalar like IBM Power. PC 603 • 16 -Kbyte I, D caches • Executes serial sections • Communication with P. Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P. Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence

Issues Communication P. Mem-P. Host: • P. Mem cannot be the master of bus • P. Host starts P. Mems by writing register in Rambus interf. • P. Host polls a register in Rambus interf. of master P. Mem • If P. Mem not finished: memory controller retries. Retries are invisible to P. Host Virtual memory: • P. Mems and P. Arrays use virtual memory • They share a range of virtual addresses with P. Host

Chip Architecture

Basic Block

Area Estimation (mm ) 2 VERY CONSERVATIVE Power. PC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P. Arrays: 96 Multipliers: 10 Rambus interface: 3. 4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM

Evaluation

Utilization z High P. Array Util z Low P. Mem Util

Utilization z Low P. Host Utilization

Speedups z Constant Problem Sz z Scaled Problem Sz

Speedups z. Varying Logic Frequency

Programming Flex. RAM • Flex. RAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P. Arrays or P. Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines

C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitfor processor_range: processors waiting for others • Map object to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • Flex. RAM_malloc(), P_mem_malloc(), P_array_malloc()

Performance Evaluation z. Hardware performance monitoring embedded in the chip z. Software tools to extract and interpret performance info

Current Status z Identified and wrote all applications z Designed architecture based on apps & feasible technology z Conceived ideas behind language/compiler z Need to do: chip layout and fabrication development of the compiler z Funds needed for: yprocessor core (P. Mem) ychip fabrication yhardware and software engineers

Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications

Conclusion z. We have a handle on: y. A promising technology (MLD) y. Key applications of industrial interest z. Real chance to transform the computing landscape