Memory Controller V 2 Carlos Gonzlez MCV 2

  • Slides: 19
Download presentation
Memory Controller V 2 Carlos González

Memory Controller V 2 Carlos González

MCV 2 - Overview Per client interface (readrequest or writedata) (state + readdata) to

MCV 2 - Overview Per client interface (readrequest or writedata) (state + readdata) to / from clients Direct link with clients (two signals per client) Allocate entries & split/distribute By default shared by all clients (special mode: pseudo “split” per ROP, simulated using counters) Global Request Buffer (RB) Splitting/distribution is done when allocating a request in the RB Splitter/Distributor Ch 0 Ch 7 B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 Structure to allow out of order processing among banks (not required by baseline scheduler) B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 Select (oldest first) Channel Schedulers protocol Channel Request Channel Reply Select (oldest first) Channel State (per bank) • 1 R+1 W Fifo Channel Scheduler 0 GDDR 3/4 protocol Chip Request • n. RW Fifos • n. R+n. W Fifos Chip Reply GDDR 3 Chip 0 … • n. RW Lookahead buffers Channel Request Channel Reply Channel State (per bank) Channel Scheduler 7 Chip Request Chip Reply GDDR 3 Chip 7

Simulate a split RB with counters • Only possible for ROP clients and when

Simulate a split RB with counters • Only possible for ROP clients and when ROPs are assigned to specific channels. • Texture. Units and other clients keep using the RB as shared. – Split for these clients requires a more complex solution than just counters, very likely implementing a real RB split.

ROPs distributed among channels • rop. Counter[n. ROPS] – Each rop. Counter[i] keeps how

ROPs distributed among channels • rop. Counter[n. ROPS] – Each rop. Counter[i] keeps how many transactions are allocated by each ROP – If a counter rop. Counter[i] reaches RBsize/n. ROPS then the MC sends MT_NONE token to the specific ROPi (no more transaction accepted) • Units different than ROPs only take into account RBsize to decide whether more transactions are accepted or not

GPU Architecture overview VFetch Rasterization HZ u. Shader + TU GDDR 0 MC 0

GPU Architecture overview VFetch Rasterization HZ u. Shader + TU GDDR 0 MC 0 GDDR 1 C-ROP 1 u. Shader + TU MC XBAR Triangle Setup Z-ROP 0 Distributor Clipping Scheduler (frags/vtxs) PA C-ROP 0 u. Shader + TU Z-ROP 1 GDDR 2 MC 1 GDDR 3 C-ROP 2 Z-ROP 2 GDDR 4 MC 2 GDDR 5 C-ROP 3 Z-ROP 3 GDDR 6 MC 3 GDDR 7

GPU Architecture overview u. Shader + TU VFetch Rast HZ u. Shader + TU

GPU Architecture overview u. Shader + TU VFetch Rast HZ u. Shader + TU GDDR 0 MC 0 GDDR 1 C-ROP 1 MC XBAR TSetup u. Shader + TU Distributor Clipping Z-ROP 0 u. Shader + TU Scheduler (frags/vtxs) PA C-ROP 0 Z-ROP 1 GDDR 2 MC 1 GDDR 3 C-ROP 2 Z-ROP 2 u. Shader + TU C-ROP 3 u. Shader + TU Z-ROP 3 GDDR 4 MC 2 GDDR 5 GDDR 6 MC 3 GDDR 7

Basic RB connection to channels Request Buffer From/to ROP i From/to crossbar clients FIFO

Basic RB connection to channels Request Buffer From/to ROP i From/to crossbar clients FIFO … … C-ROP 0 Request Buffer i Channel Scheduler i 1 GDDR 3 Chip i 0 GDDR 3 Chip i 1 CS 0 GDDR 0 CS 1 GDDR 1 Z-ROP 0 … Channel Scheduler i 0 CH 0 … FIFO From/to crossbar clients CH 1 FIFO CH 0 C-ROP 0 reserve & enqueue Z-ROP 0 From/to crossbar clients reserve & enqueue … next CT CS 0 GDDR 0 CS 1 GDDR 1 Data Buffers … next FIFO CH 1 CT

RB connection to channels for schedulers with independent queues per bank From/to ROP i

RB connection to channels for schedulers with independent queues per bank From/to ROP i From/to crossbar clients B 0. . . B 7 C-ROP 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 Channel Scheduler i 1 GDDR 3 Chip i 0 GDDR 3 Chip i 1 B 0 From/to crossbar clients reserve & enqueue Z-ROP 0 next CT … oldest first FIFOs CH 1 B 7 CS 0 GDDR 0 CS 1 GDDR 1 Data Buffers … From/to crossbar clients reserve & enqueue … oldest first C-ROP 0 … FIFOs CH 0 . . . next CT oldest first Select (oldest first) Channel Scheduler i 0 CH 0 CS 0 GDDR 0 CS 1 GDDR 1 In-flight MTs Payload B 0 B 1 B 2 B 3 B 4 B 5 B 6 B 7 Z-ROP 0 Select (oldest first) oldest first Request Buffer i CH 1

MC Architecture Others (low traffic) TXT Crossbar ZStencil 0 ZStencil 1 ZStencil 2 ZStencil

MC Architecture Others (low traffic) TXT Crossbar ZStencil 0 ZStencil 1 ZStencil 2 ZStencil 3 Color 0 Color 1 Color 2 Color 3 MC 0 (RB 0) MC 1 (RB 1) MC 2 (RB 2) MC 3 (RB 3) CS 0 CS 1 CS 2 CS 3 CS 4 CS 5 CS 6 CS 7 DRAM 0 DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7

1 R+1 W_Fifo Scheduler From Data Buffer Next CT from Data Buffer To Data

1 R+1 W_Fifo Scheduler From Data Buffer Next CT from Data Buffer To Data Buffer Read transactio n Write transactio ns CAM ready? , dep? . . . b=5 r=9 c=0 sz=96 b=3 r=4 c=8 sz=96 ready? , dep? . . . Select (based on operating mode logic & GDDR state) Data Write Buffer To GDDR data pins GDDR CMD generator To GDDR command pins GDDR state Data Read Buffer From GDDR data pins

Generic per bank scheduler From Data Buffer BS 0 BS 1 Next CT from

Generic per bank scheduler From Data Buffer BS 0 BS 1 Next CT from Data Buffers BS 2 BS 3 BS 4 To Data Buffer BS 5 BS 6 BS 7 Channel Transaction Selector + PAM logic (configurable) Data Write Buffer To GDDR data pins GDDR CMD generator To GDDR command pins GDDR state Data Read Buffer From GDDR data pins

n. RWBank From write buffers W r? r=9 c=0 sz=96 BQ 1 R r?

n. RWBank From write buffers W r? r=9 c=0 sz=96 BQ 1 R r? r=9 c=0 sz=96 Channel Transaction BQ 2 BQ 3 To read buffers BQ 4 BQ 5 BQ 6 BQ 7 Channel Transaction Selector + PAM logic (configurable) Data Write Buffer To GDDR data pins GDDR CMD generator To GDDR command pins GDDR state Data Read Buffer From GDDR data pins

n. R+n. WBank From write buffers RQ Channel Transaction CAM logic To read buffers

n. R+n. WBank From write buffers RQ Channel Transaction CAM logic To read buffers WQ BQ 1 BQ 2 BQ 3 BQ 4 BQ 5 BQ 6 BQ 7 Channel Transaction Selector + PAM logic (configurable) Data Write Buffer To GDDR data pins GDDR CMD generator To GDDR command pins GDDR state Data Read Buffer From GDDR data pins

Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1 R+1 W_4) T 0 CK /

Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1 R+1 W_4) T 0 CK / CK# COMMAN D ADDRES S RDQ S WRQS T 3 … T 5 … T 23 T 25 … T 31 T 33 RD WR RD RD bank, col a bank, col b bank, col c bank, col d bank, col e WL=3 D 0 a T 0 T 3 WR bank, col a t. WRT = 6 T 7 D 0 b CL= 8 WL=3 D 0 c T 13 T 14 T 15 WR RD RD RD bank, col c bank, col b bank, col d bank, col e CK / CK# DQ T 19 T 20 WR DQ COMMAN D ADDRES S RDQ S WRQS … T 11 WL=3 D 0 a D 0 c … t. WRT = 6 CL= 8 t. WRT = 6 … CL= 8 T 19 T 21 T 23 D 0 b D 0 d D 0 e … T 39 T 41 D 0 d D 0 e

Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1 R+1 W_4) T 0 CK /

Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1 R+1 W_4) T 0 CK / CK# COMMAN D ADDRES S RDQ S WRQS T 3 T 3 n. T 4 T 4 n T 5 … … T 23 T 25 … T 31 T 33 RD WR RD RD bank X col a bank X col b bank X col c bank X col d bank X col e WL=3 DI a T 0 T 2 WR bank, col a t. WRT = 6 T 3 T 5 CL= 8 DI c t. WRT = 6 T 14 T 15 WR RD RD RD bank, col c bank, col b bank, col d bank, col e WL=3 D 0 a D 0 c T 7 t. Rt. W= 2 WL=3 DO b T 13 CK / CK# DQ … T 19 T 20 WR DQ COMMAN D ADDRES S RDQ S WRQS T 11 … t. WRT = 6 CL= 8 … … CL= 8 T 21 T 23 T 25 D 0 b D 0 d D 0 e T 39 T 41 D 0 d D 0 e

ACT – CAS - PRE T 0 … CK / CK# COMMAN D ADDRES

ACT – CAS - PRE T 0 … CK / CK# COMMAN D ADDRES S RDQ S WRQS PRE ACT bank a bank A row M DQ T 0 CK / CK# COMMAN D ADDRES S RDQ S WRQS T 9 RP=9 T 9 … PRE ACT bank A row M … T 22 T 23 T 24 RD bank A bank a col a row N T 22 T 23 T 24 RD … PRE ACT RP=9 T 30 T 31 T 32 t. RRD =9 bank A bank B bank A col a row N t. RCD=13 CL= 8 DO a t. RCD=13 … T 46 T 54 … ACT RD bank B row O bank b col b DO a CL= 8 RP=9 DQ T 33 t. RRD =9 PRE ACT t. RCD=13 … … T 30 T 31 T 32 DO b t. RCD=13 CL= 8 T 33 … T 37 T 45 T 46 … ACT RD RD bank A row O bank B col b bank A col c DO b t. RCD=13 CL= 8 … CL= 8 T 54 DO c

GDDR 3 DRAM architecture GDDR 3 simplified design CAS# WE# @ bits Address Register

GDDR 3 DRAM architecture GDDR 3 simplified design CAS# WE# @ bits Address Register CK CK# CS# RAS# CMD Decode CKE Control Logic Bank 7 Bank Memory 6 Bank Memory 5 Array Bank 5 Bank Memory 4 Array Bank 4 Bank Memory 3 Array Bank 3 Bank 2 Bank Memory 2 Sense Array Bank 1 Bank Memory 1 Sense Array Memory Array Sense Bank 0 Sense Array Row Memory Sense Address Array Sense Latch Sense & Row buffer 0 Decoder Column Decoder Data (DDR)

Shared VS Distributed u. Shader + TU VFetch Rast HZ u. Shader + TU

Shared VS Distributed u. Shader + TU VFetch Rast HZ u. Shader + TU GDDR 0 MC 0 GDDR 1 C-ROP 1 MC XBAR TSetup u. Shader + TU Distributor Clipping Z-ROP 0 u. Shader + TU Scheduler (frags/vtxs) PA C-ROP 0 Z-ROP 1 GDDR 2 MC 1 GDDR 3 C-ROP 2 Z-ROP 2 u. Shader + TU C-ROP 3 u. Shader + TU Z-ROP 3 GDDR 4 MC 2 GDDR 5 GDDR 6 MC 3 GDDR 7

Shared VS Distributed TSetup Rast HZ u. Shader + TU Z-ROP 0 u. Shader

Shared VS Distributed TSetup Rast HZ u. Shader + TU Z-ROP 0 u. Shader + TU C-ROP 1 u. Shader + TU Z-ROP 1 u. Shader + TU Distributor Clipping Scheduler (frags/vtxs) PA C-ROP 0 C-ROP 2 u. Shader + TU Z-ROP 2 u. Shader + TU C-ROP 3 u. Shader + TU Z-ROP 3 Interconnection network (RING, MESH…) VFetch u. Shader + TU GDDR 0 MC 0 GDDR 1 GDDR 2 MC 1 GDDR 3 GDDR 4 MC 2 GDDR 5 GDDR 6 MC 3 GDDR 7