Scalable Vector Processors for Embedded Systems Kozyrakis Patterson

Scalable Vector Processors for Embedded Systems Kozyrakis, Patterson Presentation by: Mohamed Abuobaida Mohamed For COE 502 : Parallel Processing Architectures

Outline �Introduction �Instruction Set �Compiler �The Design �Evaluation �Clustered Processor �Conclusion

Introduction �Embedded processors requires low power and complexity �Performance and scalability are primary �Superscalar and VLIW (ILP) �Superscalar requires complex hardware to detect dependence �VLIW requires a very through compiler �Scaling is difficult

Introduction �Multimedia and telecommunications have data Level Parallelism (DLP) �Revise vector architecture for supercomputers �Introduce Vector IRAM (VIRAM)

Instruction Set �Coprocessor extension to MIPS �Vector Register File (VRF) ◦ 32 Registers ◦ Integer and floating point ◦ Flag register �Vector operations ◦ Arithmetic: integer and floating point ◦ Logical operations ◦ Other functions e. g. population count

Instruction Set �Supports three common access patterns and virtual addressing �Elements can be 64, 32 or 16 bit wide �The 64 -bit datapath can execute multiple narrow elements �Element permutation is limited to dot product and fast Fourier transforms �Supports speculative execution using the flag register

The Compiler �Based on PDGCS compilation system for Cray supercomputers �Extensive vectorization techniques: ◦ Outer-loop vectorization ◦ Handling partially vectorizable constructs �Does not require special functions nor custom libraries �Requires pragmas for irregular scatter/gather patterns

The Compiler �Selects operation and element width �Recognizes reduction

The Design �Coprocessor to 64 -bit MIPS �VRF capacity is 8 KB ◦ Can be 32 -64 -bit, 64 32 -bit or 128 16 -bit �A lane has 2 64 -bit ALU and vector load/store unit �On-chip 13 MB DRAM organized as 8 banks �The scalar core is a single issue in order MIPS

The Design �Operates at 200 MHZ with 2 W power consumption

Evaluation

Clustered Processor �VIRAM has complex VRF ◦ Approx. 3 ports per FU �Proposed: replace centralized VRF with clustered VRF �A cluster has a datapath for one FU and few vector registers �It contains access to intercluster network �Area, power and latency per cluster is constant

Clustered Processor �Renaming is used to utilize clustered configuration �It is done using a renaming table that identifies the source and destination �It can be used to implement more than 32 registers �Clustering improves scaling

Clustered Processor: Evaluation �ss

Conclusion �Designed for embedded systems ◦ Area, power and performance �Exploits DLP �Instruction set VRF �Vectorizing compiler �Evaluation �Clustered configurtaion