SCRATCHPAD MEMORIES A DESIGN ALTERNATIVE FOR CACHE ONCHIP
























- Slides: 24
SCRATCHPAD MEMORIES: A DESIGN ALTERNATIVE FOR CACHE ON-CHIP MEMORY IN EMBEDDED SYSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat
04/09/2010 OUTLINE Introduction Scratch pad memory Cache memory Proposed methodology Results Conclusions Spring 2010, EEL 6935, Embedded Systems 2
04/09/2010 INTRODUCTION Scratch pad memory Cache memory Proposed methodology Results Conclusions Spring 2010, EEL 6935, Embedded Systems 3
04/09/2010 INTRODUCTION Scratch pad memory: It is next closest memory to the ALU after the internal registers. Scratch pad based systems have NUMA(Non-Uniform Memory Access) latencies, and use explicit instructions to move data. DMA based data transfer is often used. On chip caches using SRAM consume power in the range of 25% to 45% of the total chip power Current embedded processors for multimedia applications have on-chip scratch pad memories Spring 2010, EEL 6935, Embedded Systems A high speed internal memory used for temporary storage of calculations, data and other work in progress. 4
04/09/2010 INTRODUCTION Scratchpad vs. Cache: � A scratchpad doesn’t contain a copy of data that is stored Spring 2010, EEL 6935, Embedded Systems in the main memory. � Scratchpad memory is directly manipulated by applications. � In cache memory systems mapping of program elements is done during runtime, in scratch pad memory systems it is done either by the user or by the compiler using a suitable algorithm Prior studies on scratch pad memories do not address the impact on area 5
04/09/2010 CONTRIBUTIONS The paper proposes scratchpad memory as an alternative to cache memory as on-chip memory for computationally intensive applications. CACTI tool is used for computing area and energy for AT 91 M 40400 target architecture. The results establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40% Spring 2010, EEL 6935, Embedded Systems 6
04/09/2010 Introduction SCRATCH PAD MEMORY Cache memory Proposed methodology Results Conclusions Spring 2010, EEL 6935, Embedded Systems 7
Memory Cell Memory Array 6 Transistor Static RAM Memory array with the decoding and the column circuitry logic Memory objects are mapped to the scratch pad in the last stage of the compiler It occupies one distant part of the memory address space. No need to check for data/instr. availability in the scratch pad Reduces the comparator and the signal miss/hit acknowledging circuitry Spring 2010, EEL 6935, Embedded Systems 04/09/2010 SCRATCH PAD MEMORY 8 Figure: Scratch Memory Array
04/09/2010 SCRATCH PAD MEMORY Area of scratchpad, As As = Asde + Asda + Asco + Aspr + Asse + Asou Spring 2010, EEL 6935, Embedded Systems Energy Consumption is estimated from the energy consumption of the components Escratchpad = Edecoder + Ememcol Components: Data decoder, data array area, column multiplexers, pre charge circuit, data sense amplifiers, output driver circuitry Memory array is the major consumer of energy CACTI tool first computes the capacitances for each unit then estimates the energy 9
For the memory array: Ememcol = Cmemcol * Vdd 2 * P 0 ->1 Cmemcol is the capacitance of the memory array unit and is calculated as Cmemcol = ncols * (Cpre + Creadwrite) P 0 ->1 is the probability of bit toggle, 0. 5 Only two word lines are switched regardless of the change in the address bits Total energy spent in the scratch pad memory is Spring 2010, EEL 6935, Embedded Systems 04/09/2010 ESTIMATING THE ENERGY CONSUMPTION Esptotal = SPaccess * E scratchpad The only case that holds good is read or write access 10
04/09/2010 Spring 2010, EEL 6935, Embedded Systems Introduction Scratch pad memory CACHE MEMORY Proposed methodology Results Conclusions 11
Tag Array Data Array Ac = Atag + Adata where Atag = Adt + Ata + Aco + Apr + Ase + Acom + Amu Figure: Cache Memory Organization and Adata = Ade + Ada + Acol + Apre + Asen + Aout Spring 2010, EEL 6935, Embedded Systems Area model is based on the transistor count in the circuitry Area of the cache, 04/09/2010 CACHE MEMORY 12
04/09/2010 Spring 2010, EEL 6935, Embedded Systems Introduction Scratch pad memory Cache memory PROPOSED METHODOLOGY Results Conclusions 13
Spring 2010, EEL 6935, Embedded Systems Compare same size cache with scratchpad memory (the delay of cache is higher than scratchpad for the same technology) Identification and Assignment of critical data structures to scratch pad in based on a packing algorithm Total number of clock cycles determines the performance Larger the number of clock cycles, lower the performance because on-chip configuration doesn’t change the clock period 04/09/2010 EXPERIMENTAL SETUP 14
� one for scratch pad read/write access, � one cycle and one wait cycle for 16 bit main memory access, � one cycle plus three wait states for main memory 32 bit access Access Number of Cycles Cache Using Cache calculations Scratch Pad 1 cycle Main memory 16 bit 1 cycle + 1 wait cycle Main memory 32 bit 1 cycle + 1 wait cycle Spring 2010, EEL 6935, Embedded Systems Performance estimation from the trace file. An appropriate latency is added to the overall program delay on scratchpad access: 04/09/2010 SCRATCH PAD MEMORY ACCESS 15
� Authors assume a write through cache Access type Caread Cawrite Mmread Mmwrite Read hit 1 0 0 0 Read miss 1 L L 0 Write hit 0 1 Write miss 1 0 0 1 Spring 2010, EEL 6935, Embedded Systems Read Hit: Tag array is accessed. No write to cache and no access to main memory Read Miss: One cache read operation, L (line size) words written to cache. One main memory read event of size L and no main memory write Write Hit: Cache write followed by memory write Write Miss: One cache tag read and main memory write. No cache update. 04/09/2010 CACHE MEMORY ACCESS 16
FLOW DIAGRAM Mapping Algorithm Compiler Support Cache Number of Cycles ARMulator trace analysis CACTI Energy Estimates Analytical model Area Estimates Cache/Scratch Pad Size Trace Analysis Scratchpad Number of cycles Spring 2010, EEL 6935, Embedded Systems Energy Aware Compiler 04/09/2010 C Benchmark 17
Target architecture: AT 91 M 40400, based on embedded ARM 7 TDMI embedded processor � High performance RSIC processor with a very low power consumption � On-chip scratch memory of 4 KB. 32 bit data path and two instruction sets. encc – energy aware complier, uses a special packing algorithmknapsack algorithm for assigning code and data blocks to the scratch pad memory The binary output of the compiler is simulated on the ARMulator to produce a trace file. ARMulator accepts the cache size as a parameter for on-chip cache configuration and generates the performance as number of cycles. Spring 2010, EEL 6935, Embedded Systems � 04/09/2010 EXPERIMENTAL SETUP The area and performance estimates are made for the 0. 5 um technology 18
04/09/2010 Spring 2010, EEL 6935, Embedded Systems Introduction Scratch pad memory Cache memory Proposed methodology RESULTS Conclusions 19
4. 57 n. J Scratch pad per access(2 k. B) 1. 53 n. J Main memory read access, 2 bytes 24. 00 n. J Main memory read access, 4 bytes 49. 30 n. J Main memory write access, 4 bytes 41. 10 n. J Table: Energy per access of various devices Table: Area/Performance ratios for bubble-sort Size Bytes Area Cache Area Scratchpad CPU cycles Cache CPU cycles, Scratchpad Area reduction Time reduction Area-time product 64 6744 4032 481. 9 347. 5 0. 40 0. 28 0. 44 128 11238 7104 302. 4 239. 9 0. 37 0. 21 0. 51 256 21586 14306 264. 0 237. 9 0. 34 0. 10 0. 55 512 38630 26722 242. 6 237. 9 0. 31 0. 10 0. 61 1024 74680 53444 241. 7 192. 0 0. 28 0. 21 0. 55 2048 142224 102852 241. 5 192. 0 0. 28 0. 20 0. 57 0. 33 0. 18 0. 54 Average Spring 2010, EEL 6935, Embedded Systems The average area, time and AT product reductions are 34% 18% and 46% Cache per access(2 k. B) 04/09/2010 RESULTS 20
Figure: Comparison of cache and scratch pad memory area Spring 2010, EEL 6935, Embedded Systems Figure: Energy consumed by the memory system 04/09/2010 RESULTS 21
04/09/2010 Spring 2010, EEL 6935, Embedded Systems Introduction Scratch pad memory Cache memory Proposed methodology Results CONCLUSION 22
Spring 2010, EEL 6935, Embedded Systems Presents an approach for selection of on-chip memory configurations Results show that scratch pad based compile time memory outperforms cache-based run-time memory on almost counts. 40% average reduction for the application considered Authors propose study of DRAM based memory comparisons since memory bandwidth and on-chip memory capacity are limiting factors for many applications. Also, the energy models for both cache and scratchpad need to be validated by real measurements 04/09/2010 CONCLUSION 23
04/09/2010 Spring 2010, EEL 6935, Embedded Systems QUESTIONS 24