Communication Architecture Synthesis Alessandro Pinto University of California

  • Slides: 60
Download presentation
Communication Architecture Synthesis Alessandro Pinto, University of California at Berkeley apinto@eecs. berkeley. edu

Communication Architecture Synthesis Alessandro Pinto, University of California at Berkeley apinto@eecs. berkeley. edu

Introduction: what Communication synthesis What is the “best way” of transferring information between entities

Introduction: what Communication synthesis What is the “best way” of transferring information between entities Problem Formulation Who wants to communicate with whom How he aspects the communication mechanism to behave What is possible to built today How to build a communication architecture that is cheap and satisfies the entities constraints while being feasible

Introduction: why On-Chip application Lots of components connected together Distance is becoming important There

Introduction: why On-Chip application Lots of components connected together Distance is becoming important There isn’t a formal approach to the problem We build tools which are the glue in the platform-based design approach More or less everything has been done in the design of “functions”

Outline Overview Internet vs. On-chipnet Standard communication structures VSIA + Standard comm. Platform Constraint-Driven

Outline Overview Internet vs. On-chipnet Standard communication structures VSIA + Standard comm. Platform Constraint-Driven Communication Synthesis A theoretic approach (A. Pinto, L. P. Carloni. ASV) Current research Conclusion

Wire Scaling “As designs scale to newer technologies, they get smaller and their wires

Wire Scaling “As designs scale to newer technologies, they get smaller and their wires get shorter, and the relative change in the speed of wires to the speed of gates is modest. ” M. A. Horowitz et. al “The future of wires” “The real wire problem arises with increasing chip complexity and global communication costs. ” M. A. Horowitz et. al “The future of wires” Scale Local Wire Global Wire

What is going to happen From few closed connected components to a lot of

What is going to happen From few closed connected components to a lot of distributed IP’s A global net is most likely to be traveled in more that one clock cycle High bandwidth and low power requirements Fault tolerance requirement There is need for a correct by construction methodology

Internet Network Highly distributed Fault Tolerant Every node must reachable

Internet Network Highly distributed Fault Tolerant Every node must reachable

Networks vs. On-Chip Networks Internet major concern was to guarantee connectivity in case of

Networks vs. On-Chip Networks Internet major concern was to guarantee connectivity in case of attack Cost is mostly static Buffers memory is not a problem (512 MB is still cheap) Number and complexity of protocol layer is important only for performance issues Full connectivity is not the major concern Now cost is mostly dynamic (power) Registers are expensive The protocol must be very light In the future error correction might become a reality

The beginning of internet Back in 1969: the ARPAnet SRI UTAH UCSB UCLA

The beginning of internet Back in 1969: the ARPAnet SRI UTAH UCSB UCLA

The beginning of On-Chipnet A lot of effort on on-chip communication One of GSRC

The beginning of On-Chipnet A lot of effort on on-chip communication One of GSRC main theme: Communicationbased design Special section in conferences Different approaches Not always really networks Buses are still widely used

Effect of standardization 1983: TCP/IP was introduced 1987: Commercialization of internet TCP/IP technology is

Effect of standardization 1983: TCP/IP was introduced 1987: Commercialization of internet TCP/IP technology is transferred form Inventors to vendors

Standardization VSIA: Virtual Socket Interface Alliance Specifies Virtual Component Interface Peripheral VCI Basic VCI

Standardization VSIA: Virtual Socket Interface Alliance Specifies Virtual Component Interface Peripheral VCI Basic VCI Advanced VCI http: //www. vsi. org A B VCI Initiator VCI Target VCI Initiator Bus Master Bus Slave Any Bus

Standard “Bus” Architecture ARM (ARM Ltd. )* Core. Frame (Palmchip) Core. Connect (IBM) Wich.

Standard “Bus” Architecture ARM (ARM Ltd. )* Core. Frame (Palmchip) Core. Connect (IBM) Wich. Bone (Silicore) Silicon. Back. Plane (Sonics)*

AMBA Bus Hierarchy UART Off-Chip RAM HP ARM Core On Chip RAM Bridge APB

AMBA Bus Hierarchy UART Off-Chip RAM HP ARM Core On Chip RAM Bridge APB DMA AHB Keypad AHB: Advanced High-Performance Bus AB: Advanced Peripheral Bus

Silicon. Backplane It’s possible to specify constraints for each point to point communication It’s

Silicon. Backplane It’s possible to specify constraints for each point to point communication It’s possible to specify different OPC Synthesis of the network that guarantees performances

A Today’s Platform Example CPU SDRAM Ir. DA PCMCIA Bluetooth Bridge MMC USB Based

A Today’s Platform Example CPU SDRAM Ir. DA PCMCIA Bluetooth Bridge MMC USB Based on Intel XScale Processor USB, MMC, Ir. DA, Bluetooth PCMCIA, SDRAM

Where are we headed? Communication is specified in terms of point to point Constraints

Where are we headed? Communication is specified in terms of point to point Constraints A target implementation technology is abstracted into a Library A synthesis procedure generates a communication architecture That satisfies the constraints That uses only the library elements

A Platform Integrator Environment Arch. Plt. Builder RTOS μC 1 μC 2 DSP HW

A Platform Integrator Environment Arch. Plt. Builder RTOS μC 1 μC 2 DSP HW MEM To Silicon

Communication Topology Design Platform-based-design in communication

Communication Topology Design Platform-based-design in communication

The Problem M 2 M 4 M 1 M 3 M 5

The Problem M 2 M 4 M 1 M 3 M 5

The Problem M 2 M 4 M 1 M 3 M 5

The Problem M 2 M 4 M 1 M 3 M 5

Previous Work S. Dey et al. , “System-Level Performance Analysis for Designing On-Chip Communication

Previous Work S. Dey et al. , “System-Level Performance Analysis for Designing On-Chip Communication Architectures”, TCAD, Vol. 20, NO. 6, June 2001 S. Dey et al. , “Efficient Exploration of the SOC Communication Architecture Design Space”, Int. Conf. On CAD, August 1999 J. M. Daveau et al. , “Protocol Selection and Interface Generation for HW-SW Codesign”, TVLSI, Vol. 5, NO. 1, March 1997 J. M. Daveau et al. , “Synthesis of system level communication by an allocation based approach”, Int. Symposium on System Synthesis, 1995

Our Approach Abstract Model System modules communicate by means of point-to-point unidirectional channels Each

Our Approach Abstract Model System modules communicate by means of point-to-point unidirectional channels Each channel is characterized by a set of communication constraints (distance, minimum bandwidth) The specific functionality of each module is “abstracted away”

The Goal – Overview of the Method Constraint Propagation Performance Abstraction

The Goal – Overview of the Method Constraint Propagation Performance Abstraction

Abstraction (Intuition) b 1 Cs(b) cost b 2 This is the basic library of

Abstraction (Intuition) b 1 Cs(b) cost b 2 This is the basic library of Communication Components Cs(b) Length Cr(b ) Relay Station clk Switches

Communication-Constraint Graph G=(V, A) v V, p(v) is the vertex position in our coordinate

Communication-Constraint Graph G=(V, A) v V, p(v) is the vertex position in our coordinate system a A: The arc distance d(a) The arc bandwidth b(a) M 5 M 1 v 2 a

Library = L N, communication links and communication nodes l L: d(l) is the

Library = L N, communication links and communication nodes l L: d(l) is the link length b(l) is the link bandwidth c(l) link cost n N c(n) is the cost of the communication node l 1 l 2 l 3 n r

Communication Implementation G=(V, A) (1, 2) (2, 2) (1, 3) (b(a), d(a))=(4, 5) n

Communication Implementation G=(V, A) (1, 2) (2, 2) (1, 3) (b(a), d(a))=(4, 5) n r

Communication Implementation Arc Matching (1, 2) (2, 2) (1, 3) (4, 5) n r

Communication Implementation Arc Matching (1, 2) (2, 2) (1, 3) (4, 5) n r

Communication Implementation Arc Segmentation (1, 2) (2, 2) (4, 5) n r

Communication Implementation Arc Segmentation (1, 2) (2, 2) (4, 5) n r

Communication Implementation Arc Duplication + Arc Segmentation (1, 2) (2, 2) n r

Communication Implementation Arc Duplication + Arc Segmentation (1, 2) (2, 2) n r

Communication Implementation Arc Merging (1, 2) (2, 2) (1, 3) n r

Communication Implementation Arc Merging (1, 2) (2, 2) (1, 3) n r

Implementation Graph G’(G, ) = (V’ N’, A’) v’ V’, v’ V n’ N’,

Implementation Graph G’(G, ) = (V’ N’, A’) v’ V’, v’ V n’ N’, n’ is an instance of an element of N a’ A’, a’ is an instance of an element of L a(u, v) A, P(a) set of paths s. t. P(a) connects u’ with v’ without passing through any other computational vertex P(a) satisfies the bandwidth constraint of a The cost of and implementation graph is

The. Optimization Problem

The. Optimization Problem

Assumptions Assumption on the Library d(a 1) > d(a 2) b(a 1) > b(a

Assumptions Assumption on the Library d(a 1) > d(a 2) b(a 1) > b(a 2) c(a 1) > c(a 2) K-Way Merging structure q*

Candidate Implementations Generate only k-way mergings that could be part of the final implementation

Candidate Implementations Generate only k-way mergings that could be part of the final implementation Derive mergeability conditions based only on positions and bandwidth of the constraint graph and the library

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v 2

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v 2

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v

2 -Way Mergeablility u 1 v 2 u 1 u 2 v 1 v 2

2 -Way Mergeablility d(a 1) + d(a 2) || p(u 1) – p(u 2)||

2 -Way Mergeablility d(a 1) + d(a 2) || p(u 1) – p(u 2)|| + || p(v 1) – p(v 2)|| {a 1, a 2} not mergeable u 1 a 1 v 1 || p(u 1) – p(u 2)|| || p(v 1) – p(v 2)|| u 2 a 2 v 2

2 -Way Mergeablility d(ai) (k-1)d(ak) + || p(ui) – p(uk)|| + || p(vi) –

2 -Way Mergeablility d(ai) (k-1)d(ak) + || p(ui) – p(uk)|| + || p(vi) – p(vk)|| {a 1… ak} not mergeable u 1 a 1 v 1 || p(u 1) – p(uk)|| || p(v 1) – p(vk)|| uk ak vk

n-Way Mergeablility b(ai) b(al) + u 1 b(aj) {a 1… ak} not mergeable v

n-Way Mergeablility b(ai) b(al) + u 1 b(aj) {a 1… ak} not mergeable v 1 uk vk

Expansion Rule If a A is not mergeable with any set of k-1 arcs

Expansion Rule If a A is not mergeable with any set of k-1 arcs in A, then a is not mergeable with any set of k arcs in A Akmax a A

Summary of Pruning Conditions 1 K-way mergeablility condition over distance K-way mergeablility condition over

Summary of Pruning Conditions 1 K-way mergeablility condition over distance K-way mergeablility condition over bandwidth 2 K k+1 mergeability condition 3 All the pruning conditions are based on Positions of ports and Properties of arcs in the Constraint Graph and maximum bandwidth in the Library

Solving the Optimization Problem

Solving the Optimization Problem

Constrained Distance Sum Matrix Dij= d(ai) + d(aj) aj ak am Dkj + Dmj

Constrained Distance Sum Matrix Dij= d(ai) + d(aj) aj ak am Dkj + Dmj = d(ak) + d(aj)+ d(am) + d(aj) = 2 d(aj) + d(ak) + d(am) = (k-1)d(aj) + d(ai)

Merging Distance Sum Matrix ij= ||p(ui) – p(uj)|| + ||p(vi) - p(vj)|| aj ak

Merging Distance Sum Matrix ij= ||p(ui) – p(uj)|| + ||p(vi) - p(vj)|| aj ak am kj + mj = ||p(uk) – p(uj)|| + ||p(vk) - p(vj)|| + ||p(um) – p(uj)|| + ||p(vm) - p(vj)||= = || p(ui) – p(uk)|| + || p(vi) – p(vk)||

Algorithm K 1; foundcanditatemapping TRUE; while (foundcanditatemapping) { for all column j col( )

Algorithm K 1; foundcanditatemapping TRUE; while (foundcanditatemapping) { for all column j col( ) k. Way. Merging. Not. Found TRUE; for all subset of row R={i 1…ik} row( ) 1 if sum(D, R) < sum( , R) if l L s. t. b(l) + bmin(R, j) < sum(B, R) + B[j] { S S kmergingimplementation(R, j); k. Way. Merging. Not. Found FALSE } if (k. Way. Merging. Not. Found) col( ) j; 3 if col( ) = foundcanditatemapping FALSE; else k k+1; } 2

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0)

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0) B 1 L 1 2 2 2 a 3 (1. 4, 0) a 2 a 3 a 1 a 2 a 3 (0. 4, 0) a 1 D (1. 4, 1) 1 a 2 b(l)=2 d(l)=1 c(l)=1 a 3 a 1 a 2 a 3 0. 8 2

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0)

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0) B 1 L 1 2 2 2 a 3 (1. 4, 0) a 2 a 3 a 1 a 2 a 3 (0. 4, 0) a 1 D (1. 4, 1) 1 a 2 b(l)=2 d(l)=1 c(l)=1 a 3 a 1 a 2 a 3 0. 8 2

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0)

A Simple Example (0, 1) a 1 (0. 4, 1) a 2 (0, 0) B a 1 a 2 a 3 (0. 4, 0) 1 L (1. 4, 1) 1 a 3 (1. 4, 0) a 2 a 3 2 2 2 1 a 2 b(l)=2 d(l)=1 c(l)=1 a 3 a 1 a 2 a 3 0. 8 2

Examples Lib 1 3 3 -way mergings 1 9 -way merging

Examples Lib 1 3 3 -way mergings 1 9 -way merging

Examples Lib 2 All p-to-p 1 9 -way merging

Examples Lib 2 All p-to-p 1 9 -way merging

Limitations Alternation u 1 u 2 v 1 u 3 v 2 v 3

Limitations Alternation u 1 u 2 v 1 u 3 v 2 v 3 Bi-directionality v 1 u 2 u 1 u 3 v 2 v 3

Library Characterization Main concerns (costs) Area Power Costs are affected by Geometric and physic

Library Characterization Main concerns (costs) Area Power Costs are affected by Geometric and physic properties of wires (our basic communication primitives) Communication Speed

Wires Geometry and Physics buffer sopt lcrit Depends only on the technology, not on

Wires Geometry and Physics buffer sopt lcrit Depends only on the technology, not on the metal layer

An Example Based on Intel 0. 13 um Energy consumption for a series of

An Example Based on Intel 0. 13 um Energy consumption for a series of lcrit and buffer sopt Optimum buffer size in terms of minimum size

Wires trade off crit is constant for the same length, different metal layers carry

Wires trade off crit is constant for the same length, different metal layers carry different bandwidth A wire in M(n) is going to cost more in terms of area and power then a wire in M(n-1) Buffers are bigger The library is convex

Conlcusion A chip will look like a network of components Highly distributed Multiclock cycle

Conlcusion A chip will look like a network of components Highly distributed Multiclock cycle nets CAD tools for On-Chipnet design must Allow specification of communication constraints Find the minimum cost network. Costs are: Power Area (statefull/stateless repeaters) Blocking time (queues) Ensure Network reliability and correctness Constraint-Driven Communication Synthesis is our approach to the problem

Future Work (Available Projects) Ad hoc flow control protocol for low power, minimum buffering

Future Work (Available Projects) Ad hoc flow control protocol for low power, minimum buffering On-Chipnet Implementation and performance/cost abstraction of fast low power On-Chip routers Interface with Metropolis On-Chipnet VHDL netlist generator