VLI Bundle128 bit Template5 bit Cycle Break 10

柔軟なVLIＷ（グループとバンドル）ｇｒｏｕｐ Bundle(128 bit) Template(5 bit) Cycle Break

パイプライン構成 10段パイプライン、６命令同時発行 Front　end Ｉｎｓｔｒｕｃｔｉｏｎ Operand Delivery Execution IPG FET ROT EXP REN WLD REG EXE DET WRB IPG：Instruction　Pointer　ｇｅｎｅｒａｔｉｏｎ FET：Fetch ＤＥＴ：Ｅｘｅｃｕｔｅ　ｄｅｔｅｃｔ ROT：Rotate WRB：Write　Back EXP：Expand REN：Rename WLD：Word-line　decode ＲＥＧ：Register　ＲｅａｄＥＸＥ：Execute　

分岐命令の削減 predication　register １の時だけ指定されたレジスタを有効にする cmp　eax，ebx jne L 30 mov　ebx，CONST 1 jmp　　Ｌ３１ L 30：　mov　ebx，CONST 2 L 31： cmp．eq　p 7，p 8=r 14，r 15；；（ｐ７）　movi　r 15=CONST 1 (p 8) movi r 16=CONST 2

Advanced　Load add　r 3=4，r 0；； ld 4．a　r 2=［r 33］ st 4　［r 32］=r 3 add　r 3=4，r 0；； ld 4　r 2=［r 33］；； st 4　［r 32］=r 3 add　r 5=r 2，r 3 ld 4．c　r 2=［r 33］；； Advanced Load Check add　r 5=r 2，r 3 st命令との間のデータ依存性は、ALAT(Advanced　Load Address　Ｔａｂｌｅ）によって解決する。

Speculative　Ｌｏａｄ add 5： cmp．eq　r 6，p 5=r 32，r 0；； add 5: ld 8. s r 1=[r 32] cmp. eq p 6, p 5=r 32, r 0; ; (p 6)　 add　r 8=-1，r 0 (p 6)　 br．ret　 (p 6) add (p 5)　 ld 8　r 1=［r 32］ (p 6) br. ret add　r 8=5，r 1 (p 5) chk. s r 1, return_error br．ret；； page faultした場合、loadは待たされる r 8=-1, r 0 add r 8=5, r 1 br. ret

SMTの動作 Issue Slots superscalar fine-grained multithreaded superscalar SMT Clock Cycles Issue Slots

スーパスカラとの比較 Instruction Per Cycle（IPC）による比較 SPECInt 　　　　ＯＳ superscalar SMT 無し有り 3. 0 5. 9 2. 6 5. 6 Apache 無し有り 1. 1 4. 6 SPECInt ： not OS intensive application Apache ： OS intensive application

Flynnの分類 n n 命令流(Instruction　Stream)の数：　 M(Multiple)/S(Single) データ流（Data　Streaｍ）の数：M/S q SISD n q q q ユニプロセッサ（スーパスカラ、VLIWも入る） MISD：存在しない（Analog　Computer） SIMD MIMD

CM-2のプロセッサ Flags A B F OP s 256 bit memory c C Context 1 bit serial ALU

CM 2のプロセッサチップ 4096チップで 64 K　PE 命令１チップ構成 Router P P P P 4 x 4 Processor Array 12 links 4096 Hypercube connection 256 bit x 16 PE RAM

UMAの一例：バス結合型 Main　Memory 　shared　bus Snoop Cache PU PU SMP(Symmetric Multi. Processor)として標準部品化オンチップに格納可能

Stanford’s Hydra Considerations in the design of Hydra CSL-TR-98 -749, ＣＰＵＣＰＵ L 1　D Ｌ１　Ｉ Cache Mem. 　Cont. Write Through Bus(64 b) Read/Replace Bus(256 b) On-chip　L 2　Cache Off-chip　L 3　Cache　Int. Rambus Memory interface I/O Bus Interface Cache SRAM Array DRAM Main Memory I/O

Ｄａｙｔｏｎａ（Ｌｕｃｅｎｔ） n n n ＭＥＳＩ Protocol RISC+DSP Pipelined operation of bus and memory controller. 128 bit STBus 0．25μm CMOS　4．5 m× 6 mm (small chip)

Daytona(Lucent) STBus PE 0 L 1 PE 1 L 1 PE 2 L 1 PE 3 L 1 Memory　and Ｉ／Ｏ　Ｃｏｎｔｒｏｌｌｅｒ semaphores ａｒｂｉｔｅｒ

Power 4(IBM) n n n 0. 18μm copper process, 400 m㎡ 17000 M Tr. Inter-chip interface for MCM(Multi-Chip Module） TLP（Thread Level Parallelism) Design considering memory bandwidth Shared cache + links

Power 4(IBM) CPU 1 L 2 Shared Cache CPU 2 L 3 Tags Chip-to-Chip Interconnect >100 GByte/s Chip-to-Chip Interconnect >500 MHz >35 GByte/s >333 MHz >10 GByte/s L 3 Cache Main Memory >500 MHz, Wave-Pipelined Expansion Buses >10 GByte/s

ＭＡＪＣ n n n Hierarchical structure Variable length VLIW processing element Shared cache I/O for inter-processor communication I/O for PCI, DRAM MAJC 5200: 0．22μｍ CMOS 220 mm square

performance Earth Simulator (2002, NEC) Peak 40 TFLOPS Interconnection Network (16 GB/s x 2) Node 1 7 0 1 … Vector Processor …. Vector Processor 1 Vector Processor 0 … Shared Memory 16 GB Vector Processor 1 7 Node 0 Vector Processor … Shared Memory 16 GB Vector Processor 0 Vector Processor Shared Memory 16 GB 7 Node 639

SGI　Origin Bristled　Hypercube Main　Memory Hub　 Chip Ｎｅｔｗｏｒｋ Main　MemoryはHub　Chipから直接リンクを出す２ PEで１ Cluster

DDM(Data　Diffusion　Machine）Ｄ．．．．．．

帯行列の行列積　y=Ax a 11 a 12 0 0 a 21 a 22 a 23 0 a 23 a 32 a 22 a 12 a 21 a 11 Ｘ＋ x 1 0 a 32 a 33 a 34 0 0 a 43 a 44

帯行列の行列積　y=Ax a 11 a 12 0 0 a 21 a 22 a 23 0 a 33 a 23 a 32 a 22 a 12 y 1=a 11 x 1 a 21 Ｘ＋ x 2 Ｘ＋ x 1 0 a 32 a 33 a 34 0 0 a 43 a 44

帯行列の行列積　y=Ax a 11 a 12 0 0 a 21 a 22 a 23 0 a 34 a 43 a 33 a 23 0 a 32 a 33 a 34 0 0 a 32 a 22 y 1=a 11 x 1+ a 12 x 2 y 2=a 21 x 1 X＋ x 3 x 2 x 1 a 43 a 44

帯行列の行列積　y=Ax a 11 a 12 0 a 21 a 22 a 23 0 a 44 a 34 a 43 a 33 a 23 y 2=a 21 x 1+ a 32 a 22 x 2 Ｘ＋ x 3 0 Ｘ＋ x 2 0 a 32 a 33 a 34 0 0 a 43 a 44

帯行列の行列積　y=Ax a 11 a 12 0 0 a 21 a 22 a 23 0 a 44 a 34 0 a 32 a 33 a 34 0 0 a 43 y 2=a 21 x 1+ a 22 x 2+ a 23 x 3 a 33 y 3= a 32 x 2 Ｘ＋ x 3 x 2 a 43 a 44

5入力テーブル SRAM型FPGA (Field　Programmable　Gate　Array) スイッチ設定 2　F．F． I/O Logic　Block Switch Configuration　Memory Look　Up　Table

SRAM型CPLD (Complex　Programmable　Logic　Device) I/O Logic　Block Switch SRAM(Configuration　Memory）

Reconfigurable　Systemの発達 Stand　Alone Co-processor 1990年第 1回FPL SPLASH 1992年第 1回Japanese　 SPLASH-2 FPGA/PLD　Conf. RM-I 1993年第 1回FCCM RM-III 1995年 YARDS RM-IV RM-V 2000年 PRISM-II DISC-II New　Device MPLD WASMII Cache　Logic Mult．Context　 FPGA HOSMII ATTRACTOR FIPSOC Cont．Switch．FPGA RASH Pipe. Rench PCA DRL CHIMERA Chamereon

RASH (三菱電機) Compact. PCI bus CPUボード EXEボードディスプレイ Ethernet LAN CD disk RASH unit 1 Unit: 最大 6枚のEXEボードとCPUボード（Pentium) 複数のUnitを接続可能 This slide is supported by Dr. Nakajima of Mitsubishi. &p 70

EXEボードの構成 PCI-bus リンク接続とバス接続 PCI-bus I/F PCI Local-bus EXE-board controller FPGA SRAM （2 MB） Clocks／Cont. signals Local-bus FPGA 2系統のクロック PCIバスＩ／ＦＳＲＡＭ搭載 DRAM付加ボード搭載可 FPGA FPGA　Altera　FLEX 10 K 100 A　(62 K-158 KGate) &p 71

ATTRACTOR（NTT）高速シリアルリンク（1 Gbps） ATM I/O RISC FPGA RAM （LUT) ATM SW FPGA Buffer RISC RISC Ethernet ATM通信処理に特化したシステム MPU Mem. Compact　PCI 多種類のボードを接続ボードレベルで再構成可能

Garp (Hauserら 97) Memory queue n n UCBのプロジェクト MIPSコアとReconfigurable Arrayが強結合しメモリ階層を共有コンパイラの静的解析によりループ処理を抽出しハードウェア化画像処理などでUltrasparcの 43倍の性能 MIPS Cache Q Q Q Crossbar 32 bit buses x 5 Reconfigurable Array

Chameleon（Cｈameleon社）　 n Field　Programmable　System　Level　 Integrated　Circuits　(FPSLICs) q q 疎粒度のReconfigurable　Processing　Fabric、 RISC　Core、PCI　Controller、Memory　Controller、 DMA　Controller、SRAMを 1チップ上に混載信号処理、通信プロトコル処理用、高速DSPの 5 -10 倍の性能

Chameleon CS 2112 32 -bit PCI Bus 64 -bit Memory Bus Memory RISC Core Controller PCI Cont. 128 -bit Road. Runner Bus Configuration Subsystem DMA Subsystem Reconfigurable Processing Fabric 160 -pin Programmable I/O

CTL中の最大８命令をDPU中で実行可能 Reconfigurable Processing Fabricの構造 CTLは、同じサイクルで次の命令を決定可能 LM DPU 新しいbit　ｓｔｒｅａｍをloadする CTL LM DPU ことで構成を変えられる Tile　0 Slice　0 CTL Tile　0 Slice　3 108のDPU(Data　Path　Unit)が４つのSlice（各３ Tile）を構成 1 Tile: 　９ DPU＝ 32 bit ALU X 7 16 bit + 16 bit乗算器　X　２

DPUの構成 OP：C、Verilog演算子サポート DPU単位のSIMD, パイプライン Routing MUX Instruction Register ＆ Mask Barrel Shifter OP Register ＆ Mask Register