Simulation and Evaluation Framework for Manycore Architectures Andreas

  • Slides: 42
Download presentation
Simulation and Evaluation Framework for Manycore Architectures Andreas Savva, UCY Final Project Report ΚΥΠΡΙΑΚΗ

Simulation and Evaluation Framework for Manycore Architectures Andreas Savva, UCY Final Project Report ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ

OUTLINE • • Introduction in Many-core architectures. Main technical objectives of the project. Project

OUTLINE • • Introduction in Many-core architectures. Main technical objectives of the project. Project Breakdown. Work Packages. Using the developed framework – Case Studies. Simulation and Results. Project Outcomes / Deliverables.

Manycore Architectures • • • Emerging dominant trend in general purpose CPUS Expected to

Manycore Architectures • • • Emerging dominant trend in general purpose CPUS Expected to be interconnected using on-chip networks Tens to hundreds of cores Simple cores, large parallelism Several design parameters • I/O system • Processor Architecture • Interconnection Network Architecture • This project aims to: • Develop a simulation and evaluation framework so that researchers do parameter exploration related to the aforementioned parameters

Main Technical Objectives – Achieved 1. Developed a simulation and evaluation framework for many-core

Main Technical Objectives – Achieved 1. Developed a simulation and evaluation framework for many-core architectures using JAVA programming language. 2. Developed benchmarks in order to evaluate many-core architectures. 3. Developed on-chip network simulator which supports different architectures / routing algorithms and different traffic patterns. 4. Developed cross-compiler in C/C++ programming language which translates programs into instructions which can be executed from the architectures which are under evaluation. 5. Developed new architectures in order to evaluate the framework.

Project Breakdown • Work Packages: • Progress and Result Dissemination (WP 1, WP 2).

Project Breakdown • Work Packages: • Progress and Result Dissemination (WP 1, WP 2). • Develop simulator in order to interconnect cores (WP 3). • Develop models for the execution units and the cores (WP 4). • Develop Cross-Compiler (WP 5). • Create benchmarks to measure performance (WP 6). • Develop new architectures to evaluate the framework (WP 7).

Implementation Strategy WP 1 + WP 2: PROGRESS + RESULTS DISSEMINATION …OVERLAP… WP 3

Implementation Strategy WP 1 + WP 2: PROGRESS + RESULTS DISSEMINATION …OVERLAP… WP 3 WP 4 DEVELOP MANY–CORE EXECUTION SIMULATOR UNITS WP 5 CROSS COMPILER WP 6 WP 7 BENCHMAR EVALUATE KS FRAMEWORK

Project Management (WP 1) • Kick-Off Meeting December 2008 • Targeted Application Models Developed

Project Management (WP 1) • Kick-Off Meeting December 2008 • Targeted Application Models Developed • Application Design Trade-Offs • Roles • • Six-Month Progress Reports 18 - Month (Interim) Progress Report Financial Issues Final Progress Report • Final Financial issues

Dissemination of Results (WP 2) • Project Website • http: //www. ece. ucy. ac.

Dissemination of Results (WP 2) • Project Website • http: //www. ece. ucy. ac. cy/labs/easoc/Research/SEFMA/home. html • Publications in selected Journals and Conferences.

WP 3: Simulator for Interconnecting Cores • Determine specifications for many-core network simulator. •

WP 3: Simulator for Interconnecting Cores • Determine specifications for many-core network simulator. • Evaluate existent simulation frameworks • POPNET simulator – C++ program language. • GPNOC simulator – JAVA program language. • Adapt simulation framework in order to simulate our many -core systems. • Develop traffic models based on many-core applications for future evaluation • • Random Traffic Pattern. Tornado Traffic Pattern. Transpose Traffic Pattern. Neighbor Traffic Pattern. COMPLETED!

WP 4: Core and Execution Unit Models • Develop communication protocol between units and

WP 4: Core and Execution Unit Models • Develop communication protocol between units and network • Design and develop unit models • Cores. • Memory. • Input/output data models. • Framework to develop models based on the specifications. COMPLETED!

WP 5: Cross - Compiler • Create instruction set architecture. • Study existing compilers

WP 5: Cross - Compiler • Create instruction set architecture. • Study existing compilers for RISC processors. • Adapt existing compiler to translate programs into machine instructions. • Adapt compiler into the framework. COMPLETED!

WP 6: Benchmarks • Define and evaluate all possible functions of the system based

WP 6: Benchmarks • Define and evaluate all possible functions of the system based on : • Performance • Power consumption • Reliability • Develop algorithms to measure performance, power consumption, reliability. • Develop benchmarks for many-core processors in Assembly language. COMPLETED!

WP 7: Framework Evaluation • WP Goals: • Develop and evaluate novel many-core architectures.

WP 7: Framework Evaluation • WP Goals: • Develop and evaluate novel many-core architectures. • Develop and evaluate algorithms for work distribution in many-core processors. • Cross-evaluation of the developed framework based on the new many-core architectures. COMPLETED!

USING/EVALUATING THE FRAMEWORK Case Studies

USING/EVALUATING THE FRAMEWORK Case Studies

Reducing power consumption • Power Consumption: Major limitation in No. Cs. • Links and

Reducing power consumption • Power Consumption: Major limitation in No. Cs. • Links and No. C routers: the most power-hungry components. • Intel’s Teraflop No. C prototype suggests that link power consumption could be as high as 17% and the rest power consumption is dedicated at routers. • Reduce both static and dynamic power consumption. • Proposed works focus on simple static threshold mechanisms. Need of new intelligent dynamic power management policy for No. Cs.

Reducing power consumption Threshold based algorithm for turning links off/on: • Run Simulation and

Reducing power consumption Threshold based algorithm for turning links off/on: • Run Simulation and check link utilization. • Choose threshold. • Run simulation. • If new link utilization smaller than threshold turn link off for a period of time. • After x cycles turn link back on. NEXT: A new Intelligent Dynamic on/off Link Management for No. Cs based on ANNs.

Reducing power consumption Artificial Neural Networks • Information processing paradigm inspired by the way

Reducing power consumption Artificial Neural Networks • Information processing paradigm inspired by the way biological neurons process information. • Composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. • Used as prediction and forecasting mechanisms in several application areas • Able to determine hidden and strongly non-linear

Reducing power consumption Intelligent ANN algorithm: • Pre-training. • Choose links with minimum link

Reducing power consumption Intelligent ANN algorithm: • Pre-training. • Choose links with minimum link utilization • Size of network more manageable • Prediction scheme based on ANN • Divide network into smaller nets • Pass chosen links as inputs in ANNs • Output links to turn off Power Saves for 8 x 8 mesh ANN can be used for prediction since and torus networks they can discover hidden dependencies.

Reducing power consumption ANN predictor with No. Cs and an 8× 8 network partition

Reducing power consumption ANN predictor with No. Cs and an 8× 8 network partition into four 4× 4 networks with their ANNs.

Reducing power consumption • Experiments with several No. C regions • Compare hardware overheads

Reducing power consumption • Experiments with several No. C regions • Compare hardware overheads and responding power savings. • 4× 4 No. C region offers satisfactory power savings and less ANN overheads when compared to a 5× 5 No. C region. • 3× 3 No. C region does not provide enough information to the ANN in order to make accurate predictions. • We designed the based ANN Power Saves and hardware overheads for 3 x 3, 4 x 4, 5 x 5 No. C regions

Reducing power consumption Prediction scheme based on ANN • ANN mechanism receives all the

Reducing power consumption Prediction scheme based on ANN • ANN mechanism receives all the average link utilizations from all the links of the 4× 4 No. C partition. • ANN uses the utilization values to find optimal threshold • Determine if a link is going to be turned off or on for the next n-cycle interval.

Reducing power consumption ANN hardware optimization • A 4 x 4 ANN monitors 16

Reducing power consumption ANN hardware optimization • A 4 x 4 ANN monitors 16 routers => at least 8 input neurons. • Eight neurons at the input layer of the ANN => hidden layer should have five neurons. • Based rule of thumb that a satisfactory number of the hidden layer neurons equals to half the number of input neurons plus one neuron. Try to minimize the size of the hidden layer…

Reducing power consumption • Choose appropriate size of the hidden layer of the ANN

Reducing power consumption • Choose appropriate size of the hidden layer of the ANN • Three different ANNs were developed with five, four and three neurons at the hidden layer. • Using four neurons (instead of five), in the hidden layer exhibits the best power savings for all the traffic patterns. Power Savings for different neuron sizes in the hidden layer

Reducing power consumption • How the bit representation of the training weights affects the

Reducing power consumption • How the bit representation of the training weights affects the threshold computation? • 24, 16, 8, 6 and 4 bit representations were used. • 24, 16, 8 and 6 bits show similar power savings, but these savings are significantly reduced when 4 bits are used, due to reduced training accuracy. • => 6 bits are chosen, which made the multiplieraccumulation hardware very small Power savings for different training weight bit representations

Simulation and Results. . . • Power savings of the ANN-based mechanism are better

Simulation and Results. . . • Power savings of the ANN-based mechanism are better than the savings in the other cases. • ANN-based mechanism can identify a significant amount of future behavior in the observed traffic patterns. • Can intelligently select the threshold necessary for the next timing interval. Power Saves for 8 x 8 mesh and torus networks

Simulation and Results. . . • Measure throughput in each mechanism. • Having no

Simulation and Results. . . • Measure throughput in each mechanism. • Having no on/off mechanism yields a higher throughput, the ANN-based technique shows better throughput results compared to statically determined threshold techniques. Throughput for 8 x 8 mesh and torus networks

Simulation and Results. . . • Measure energy in each mechanism. • Energy consumed

Simulation and Results. . . • Measure energy in each mechanism. • Energy consumed using ANN mechanism is less than the other cases. • The ANN exhibits a reduction in the overall energy, because of a balanced performance-topower savings ratio, when compared to not having on/off links or when compared to static threshold computation. Normalized Energy for 8 x 8 torus networks

Simulation and Results. . . • Measure packet latency in each mechanism. • The

Simulation and Results. . . • Measure packet latency in each mechanism. • The ANN-based mechanism incurs more delay, but we believe that the delay penalty is acceptable when compared to the associated power savings. Average Packet Latency

Reducing power consumption New Intelligent ANN algorithm: • Pre-training. • Choose router ports with

Reducing power consumption New Intelligent ANN algorithm: • Pre-training. • Choose router ports with minimum port utilization • Size of network more manageable • Prediction scheme based on ANN • Divide network into smaller nets • Pass chosen ports as inputs in ANNs • Output ports to turn off

Reducing power consumption • When the router ports become unavailable, temporarily or permanently, X-Y

Reducing power consumption • When the router ports become unavailable, temporarily or permanently, X-Y routing cannot guarantee deadlock free system. • Since router ports are turned off in our work, a new routing algorithm must be developed in order to make sure that there are no deadlocks. • Fully adaptive routing algorithms perform better in the cases of faults but they are very difficult to implement due to higher overhead in silicon area and energy consumption. • Based on this, a partially adaptive routing algorithm was chosen in order to achieve a certain degree of fault tolerance in our system.

Reducing power consumption • Fault Tolerant Negative First algorithm is based on the turn

Reducing power consumption • Fault Tolerant Negative First algorithm is based on the turn models. • It makes certain turns forbidden so that the deadlock can be avoided. • A packet is routed at first in the negative direction in each dimension and then, it is routed at the positive direction. The forwarding message at first moves to west or south until the offset is zero and after that it moves to the north or east. Negative First Routing Algorithm in 8 x 8 Mesh network

Simulation Results • The power savings of the ANN-based mechanism are better compared to

Simulation Results • The power savings of the ANN-based mechanism are better compared to staticallydetermined case, and the case without any on/off ports for all the traffic models. Power Saves for 8 x 8 mesh and torus networks

Simulation Results. . . • Having no on/off mechanism yields a higher throughput; however,

Simulation Results. . . • Having no on/off mechanism yields a higher throughput; however, the ANNbased technique yields better throughput when compared to the statically-determined threshold Normalized throughput for 8 x 8 mesh and torus networks

Results from the framework use • Framework can be used from researchers in order

Results from the framework use • Framework can be used from researchers in order to evaluate many-core architectures. • It helps to compare how the number of cores affects the total power consumption of the network. • Intel showed that the number of cores may be affected from the power consumption because of the increase number of routers, interconnects and data travelling through the network. • Researchers can do parameter exploration related to many-core architectures. • This new Network on Chip framework helps researchers to solve different No. C tasks through simulations.

Project Outcomes • Smooth flow of work • Some simulator problems have been overcome

Project Outcomes • Smooth flow of work • Some simulator problems have been overcome • Help from Dr. Soteriou and Drs. Michael and Chadjicostis • Results Dissemination on target with Project Goals. • Publications in conferences/journals • Participation in ISVLSI Conference July 2011, Chennai, India. • Publication in Journal of Electrical and Computer Engineering, Hindawi Publishing Corporation, 2012. • Submission at the ISVLSI 2012: paper for turning router ports on/off. (Under Review)

Publications ARTICLES: • A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Link Management for

Publications ARTICLES: • A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Link Management for On-Chip Networks”, In Proc. IEEE Annual Symposium on VLSI, pp. 343 – 344, July 2011. • Under Review: A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Router Ports Management for Networks on Chip”, ISVLSI Conference 2012 JOURNALS: • Andreas G. Savva, T. Theocharides, V. Soteriou, "Intelligent On/Off Dynamic Link Management for On-Chip Networks, " Journal of Electrical and Computer Engineering, vol. 2012, Article ID 107821, 2012 POSTER: • Poster at Hi. PEAC Ph. D. Student Poster Presentation Paphos, Cyprus, January 2009. WORKSHOP: • Results of this work were presented in a workshop at KIOS Research Centre – 30 Nov. 2011

Project Deliverables: • D 1: Six Month, Interim, Final Report, Financial Reports • D

Project Deliverables: • D 1: Six Month, Interim, Final Report, Financial Reports • D 2: Project Website, Publications • D 3: Network communication simulator in JAVA, Four traffic models for purposes of simulation and evaluation of the network (Available source code) • D 4: RISC processor models, memory models, core models, Input Output models (VHDL/C++ Code) • D 5: Cross-compiler • D 6: Benchmarks, Algorithms for power consumption and performance measurements. • D 7: Many-core architectures, Evaluation of the developed framework.

Acknowledgements to: • Dr. Maria K. Michael – for the verification and automation algorithms

Acknowledgements to: • Dr. Maria K. Michael – for the verification and automation algorithms feedback. • Dr. Christoforos Hadjicostis – for the reliability aspects and the discrete event algorithms employed in building the simulator. • Dr. Vassos Soteriou - for the feedback on the Interconnect. • Dr. Theocharis Theocharides - for the coordination of this project and all the help.

This work falls under the Cyprus Research Promotion Foundation’s Framework Programme for Research, Technological

This work falls under the Cyprus Research Promotion Foundation’s Framework Programme for Research, Technological Development and Innovation 2008 (DESMI 2008), co-funded by the Republic of Cyprus and the European Regional Development Fund, and specifically under Grant PENEK/ENISX/0308 ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ

THANK YOU! Project Host Organization University of Cyprus Andreas Savva, Theocharis Theocharides , Maria

THANK YOU! Project Host Organization University of Cyprus Andreas Savva, Theocharis Theocharides , Maria K. Michael, Christoforos Hadjicostis Collaborating Partners Cyprus University of Technology Vassos Soteriou