Convey Computer Status Steve Wallach swallachatconveycomputer com swallach

  • Slides: 14
Download presentation
Convey Computer Status Steve Wallach swallach”at”conveycomputer. com swallach - April 2009 - HPC Users

Convey Computer Status Steve Wallach swallach”at”conveycomputer. com swallach - April 2009 - HPC Users Forum

Company Background • Started in June 2007 – 28 people • Raised $15. 1

Company Background • Started in June 2007 – 28 people • Raised $15. 1 mill, series A – Intel, Xilinx, Centerpoint, Interwest, Rho • Located Richardson, Texas • Announced at SC’ 08 – Markoff Article in New York Times • Convey Convex++ – No plans for Convez swallach - April 2009 - HPC Users Forum 2

The Convey Hybrid-Core Computer • Extends x 86 ISA with performance of a hardware-based

The Convey Hybrid-Core Computer • Extends x 86 ISA with performance of a hardware-based architecture • Adapts to application workloads • Programmed in ANSI standard C/C++ and Fortran • Leverages x 86 ecosystem swallach - April 2009 - HPC Users Forum 3

Product • • Reconfiguable Co-Processor to Intel x 86 -64 Shared 64_bit Virtual and

Product • • Reconfiguable Co-Processor to Intel x 86 -64 Shared 64_bit Virtual and Physical Memory (cache coherent) Coprocessor executes instructions that are viewed as extensions to the x 86 ISA Convey Developed Compilers (C(C++) & Fortran based on open 64) – Automatic Vectorization/Parallelization • SIMD Multi-threading – Generates both x 86 and coprocessor instructions swallach - April 2009 - HPC Users Forum 4

Convey - ISA Systolic Bio-Informatics VEC (32 B TOR i Signa t -Float) l/Ima

Convey - ISA Systolic Bio-Informatics VEC (32 B TOR i Signa t -Float) l/Ima ging OR t) T C loa E V it -F t n e B lem 4 6 E ( e t i Fin X 86 ISA e c n a n i F oat) (Fl Dat Sor a Min ting Tra /Tre ver e sal Bit/Logical swallach - April 2009 - HPC Users Forum 5

Inside the Coprocessor Application Engines Scalar Processing Instruction Fetch/Decode direct I/O interface 16 DDR

Inside the Coprocessor Application Engines Scalar Processing Instruction Fetch/Decode direct I/O interface 16 DDR 2 memory channels Standard or Scatter-Gather DIMMs 80 GB/sec throughput swallach - April 2009 - HPC Users Forum memory controller memory controller crossbar memory controller System interface and memory management implemented by coprocessor infrastructure Host Interface Personalities dynamically loaded into AEs implement application specific instructions Non-blocking Virtual output queuing Round-robin arbitration 6

Convey Scatter-Gather DIMMs • Standard DIMMs are optimized for cache line transfers – performance

Convey Scatter-Gather DIMMs • Standard DIMMs are optimized for cache line transfers – performance drops dramatically when access pattern is strided or random • Convey Scatter-Gather DIMMs are optimized for 8 -byte transfers – deliver high bandwidth for random or strided 64 -bit accesses – prime number (31) interleave maintains performance for power-of -two strides – Supports both SIMD and Parallel multi-threading compute model – Out of order loads and stores swallach - April 2009 - HPC Users Forum 7

Personalities • A personality implements a set of extended instructions – multiple personalities may

Personalities • A personality implements a set of extended instructions – multiple personalities may be installed on the system – one is active on coprocessor at any one time – reloaded dynamically by the operating system as needed • Vector personalities – implement a load/store vector accumulator architecture with multiple function pipes – Convey vectorizing compilers automatically identify loops that can be executed with vector instructions – can operate on floating point, integer, or bit data • “Procedural” personalities – implement an entire routine or algorithm in logic – invoked by one or more instructions – called as procedures or functions 1/30/2009 8 swallach - April 2009 - HPC Users Forum 8

SPvector Personality 32 Function Pipes vector elements distributed across function pipes A load-store vector

SPvector Personality 32 Function Pipes vector elements distributed across function pipes A load-store vector architecture with modern latency-hiding features Optimized for Signal Processing (i. e. , Oil & Gas) applications crossbar fma fma misc eginter logical rcp, divide add store load vector register file Same instructions sent to all function pipes Each function pipe supports: −multiple functional units −out-of-order execution −register renaming to crossbar 1/30/2009 Page 9 swallach - April 2009 - HPC Users Forum 9

Financial Vector Personality 32 Function Pipes vector elements distributed across function pipes Same overall

Financial Vector Personality 32 Function Pipes vector elements distributed across function pipes Same overall structure and datapaths of SPvector personality Pairs of single precision functional units replaced by double precision units crossbar fma misc integer logical rcp add Parallel RNG exp, log, CND store load vector register file Add functional units for common functions such as log, exp, random number generation Supported by the compiler as vector intrinsics to crossbar 1/30/2009 Page 10 swallach - April 2009 - HPC Users Forum 10

Inspect Proteomics Procedural Personality length mbuf … pipe 31 Protein. Len pipe 2 Protein

Inspect Proteomics Procedural Personality length mbuf … pipe 31 Protein. Len pipe 2 Protein Fetch pipe 1 Substring Fetch Peptide Mass Memory pipe 0 Protein Database Score PRM Scores Memory Save Match Temp Matches Match Score To Beat Update Score To Beat Temp Match Memory Store Matches 1/30/2009 11 Score To Beat • Entire numerical routine implemented as function pipe • Scalar unit (in hc-1) performs setup • Multiple function pipes for data parallellism • Operates on main memory using virtual addresses swallach - April 2009 - HPC Users Forum 11

Development Tools C/C++ • Program in ANSI standard C/C++ and Fortran Common Optimizer •

Development Tools C/C++ • Program in ANSI standard C/C++ and Fortran Common Optimizer • Unified compiler generates x 86 & coprocessor instructions • Seamless debugging environment for Intel & coprocessor code other objects & Code Intel® 64 Optimizer Generator • Executable can run on x 86_64 nodes or on Convey Hybrid-Core nodes 1/30/2009 12 Fortran 95 swallach - April 2009 - HPC Users Forum Convey Vectorizer & Code Generator Procedura l. Personalit y Interface Linker executable Intel® 64 code Coprocessor code 12

Where we are • Shipping Beta – Bioinformatics, seismic, speech processing, architectural simulation, etc

Where we are • Shipping Beta – Bioinformatics, seismic, speech processing, architectural simulation, etc • 35 People • Production Summer 2009 • Expanding sales, service, manufacturing swallach - April 2009 - HPC Users Forum 13

swallach - April 2009 - HPC Users Forum 14

swallach - April 2009 - HPC Users Forum 14