Flex RAM Toward an Advanced Intelligent Memory System
- Slides: 34
Flex. RAM Toward an Advanced Intelligent Memory System Josep Torrellas University of Illinois http: //iacoma. cs. uiuc. edu torrellas@cs. uiuc. edu
People Involved Students Michael Huang Joe Renau Seung Yoo Jaejin Lee Other faculty David Padua H. V. Jagadish Daniel Reed
Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e. g. IBM SA-27 E ASIC (Feb 99) • 0. 18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM Power. PC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?
Terminology Processor In Memory (PIM) = Intelligent Memory or Intelligent RAM (IRAM)
Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications
Example App: DNA Matching z Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains
How the Algorithm Works z Pick 4 consecutive aminoacids from sample z Generate 50+ most-likely mutations
Example App: DNA Matching z Compare them to every positions in the database DNAs z If match is found: try to extend it
How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip • Include a vector processor or multiple processors Incremental gains Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories
How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE
How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts of the program Illinois: Flex. RAM UC Davis: Active Pages USC-ISI: DIVA
Our Solution: Principles z Extract high bandwidth from DRAM: y Many simple processing units z Run legacy codes with high performance: y Do not replace off-the-shelf u. P in workstation y Take place of memory chip. Same interface as DRAM y Intelligent memory defaults to plain DRAM z Small increase in cost over DRAM: y Simple processing units, still dense z General purpose: y Do not hardwire any algorithm. No Special purpose
Architecture Proposed
The Flex. RAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P. Host processor (e. g. Merced, IBM GP) • 100’s of P. Mems in memory (e. g. IBM Power. PC 603) • 100, 000’s of very simple P. Arrays in memory
Chip Organization
Memory in one Flex. RAM Chip • 64 Mbytes of DRAM organized as 16 Mx 32 bits • Organized in 64 1 -Mbyte banks • Each bank: • Associated to 1 P. Array • 1 single port • 2 2 -Kbyte row buffers (no P. Array cache) • P. Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second
Memory in one Flex. RAM Chip Group of 4 P. Arrays share one 8 -Kbyte, 4 -ported SRAM instruction memory • Holds the P. Array code • Small because short code • Aggressive access time: 1 cycle = 2. 5 ns
P. Array • 64 P. Arrays per chip. Not SIMD but SPMD • 32 -bit integer arithmetic; 16 registers • No caches, no floating point • 4 P. Arrays share one multiplier • 28 different 16 -bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier
P. Mem • 2 -issue static superscalar like IBM Power. PC 603 • 16 -Kbyte I, D caches • Executes serial sections • Communication with P. Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P. Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence
Issues Communication P. Mem-P. Host: • P. Mem cannot be the master of bus • P. Host starts P. Mems by writing register in Rambus interf. • P. Host polls a register in Rambus interf. of master P. Mem • If P. Mem not finished: memory controller retries. Retries are invisible to P. Host Virtual memory: • P. Mems and P. Arrays use virtual memory • They share a range of virtual addresses with P. Host
Chip Architecture
Basic Block
Area Estimation (mm ) 2 VERY CONSERVATIVE Power. PC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P. Arrays: 96 Multipliers: 10 Rambus interface: 3. 4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM
Evaluation
Utilization z High P. Array Util z Low P. Mem Util
Utilization z Low P. Host Utilization
Speedups z Constant Problem Sz z Scaled Problem Sz
Speedups z. Varying Logic Frequency
Programming Flex. RAM • Flex. RAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P. Arrays or P. Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines
C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitfor processor_range: processors waiting for others • Map object to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • Flex. RAM_malloc(), P_mem_malloc(), P_array_malloc()
Performance Evaluation z. Hardware performance monitoring embedded in the chip z. Software tools to extract and interpret performance info
Current Status z Identified and wrote all applications z Designed architecture based on apps & feasible technology z Conceived ideas behind language/compiler z Need to do: chip layout and fabrication development of the compiler z Funds needed for: yprocessor core (P. Mem) ychip fabrication yhardware and software engineers
Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications
Conclusion z. We have a handle on: y. A promising technology (MLD) y. Key applications of industrial interest z. Real chance to transform the computing landscape
- Ram nam me lin hai dekhat sabme ram
- The internal ram memory of the 8051 is *
- Cache rom
- Semantics prototype
- Difference between implicit and explicit memory
- Long term memory vs short term memory
- Internal memory and external memory
- Primary memory and secondary memory
- Logical and physical address in os
- Which memory is the actual working memory?
- Page fault
- Virtual memory in memory hierarchy consists of
- Eidetic memory vs iconic memory
- Shared memory vs distributed memory
- What is intelligent storage
- Intelligent storage definition
- Male cow reproductive system diagram
- Microsoft’s move toward a journaling file system
- Advanced power system
- Advanced power system
- Sitecore marketplace
- Explain advanced macro facilities with example
- Faa advanced automation system
- Literal table stores
- Advanced operating system notes
- Emergency braking preparation
- Advanced field artillery tactical data system
- Language processor
- Advanced train management system
- Advanced operating system
- Advanced operating system
- Directions
- Advanced braking system
- Single queue multiprocessor scheduling
- Complex incident