Compiling for VIRAM Dave Judd Kathy Yelick Computer

  • Slides: 29
Download presentation
Compiling for VIRAM Dave Judd Kathy Yelick Computer Science Division UC Berkeley 6/15/2021

Compiling for VIRAM Dave Judd Kathy Yelick Computer Science Division UC Berkeley 6/15/2021

VIRAM Compiler • Based on Cray Inc production compiler – Used on the T

VIRAM Compiler • Based on Cray Inc production compiler – Used on the T 90, C 90, as well as the T 3 D and T 3 E – Being ported by Cray Inc to the SV 2 architecture • C, C++, and Fortran 95 front-ends – Fortran not supported in VIRAM • Extensive vectorization, restructuring capability • VIRAM code generator based on new SV 2 code generator – SV 2 code gen being developed in parallel w/ VIRAM – SV 2 vector architecture similar to VIRAM 6/15/2021

VIRAM Compiler Status Frontends Optimizer C C++ Fortran 95 Code Generators T 3 D/T

VIRAM Compiler Status Frontends Optimizer C C++ Fortran 95 Code Generators T 3 D/T 3 E PDGCS C 90/T 90/SV 1 SV 2/VIRAM • VIRAM vector & MIPS scalar support – Compiles & executes C & C++ commercial test suites – Compiles and executes several Cray vector C test suites 6/15/2021

Progress Since Winter Retreat • • • “-n 32” ABI implemented, replacing “– 64”

Progress Since Winter Retreat • • • “-n 32” ABI implemented, replacing “– 64” C++ support, modena test suite Code scheduler Code cleanup for vl, mvl, vbase, vinc registers Sync support partially implemented Addl. “vector 4” test suite executes correctly Eliminate dependency on Cray include files Vectorize loops w/ 8, 16 bit data A “few” bugs fixed 6/15/2021

Compiler Testing • C regression test suite (commercial test suite) – Scalar emphasis, C

Compiler Testing • C regression test suite (commercial test suite) – Scalar emphasis, C conformance – All tests pass except: • Small numerical differences due to lack on 128 f. p. support • C++ test suite – 1167 of 1183 tests execute correctly. – 12 failures in compilation: “undefined variables” – 4 failures in execution: bad answers 6/15/2021

Compiler Testing • Vector regression test suites (CRAY) – – Specifically tests for vectorization

Compiler Testing • Vector regression test suites (CRAY) – – Specifically tests for vectorization Compares vector and scalar results Easy to isolate problems “vector” status: • 59 of 62 tests pass • Some minor numerical differences • 1 bad answer, 2 integer overflow – “vector 4” status • 163 of 165 tests execute correctly • 1 bad anwer, 1 illegal use of vector inst. 6/15/2021

Kernel Performance: mvm matrix-vector multiplication 64 x 64, 32 bit floating pt. 6/15/2021 Hand

Kernel Performance: mvm matrix-vector multiplication 64 x 64, 32 bit floating pt. 6/15/2021 Hand optimized assembly code 579 mflops vcc w/ restrict keywords added 352 mflops + 1 element padding to avoid bank conflicts 401 mflops + shortloop directive Loops interchanged & outer loop vectorized by vcc. 592 mflops

Mods to mvm code /* Original code mvm. c */ void mvm (float *

Mods to mvm code /* Original code mvm. c */ void mvm (float * A, float * X, float * Y, int n, int acol ) { int i, j; float x_elem if ( n <= 64 ) { for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { Y[j] += A[j*acol+i] * x_elem; } } 6/15/2021 /* Modified code */ void mvm (float * restrict A, float * restrict X, float * restrict Y, int n, int acol ) { int i, j; < if ( n <= 64 ) { for (i = 0; i < n; i++) { #pragma shortloop for (j = 0; j < n; j++) { Y[j] += A[j*acol+i] * X[i]; } }

Kernel performance: mm_mul matrix –matrix multiplication 64 x 64, 32 bit float, 1. 6

Kernel performance: mm_mul matrix –matrix multiplication 64 x 64, 32 bit float, 1. 6 gigaflop theoretical peak Hand coded assembly 1. 58 gigaflops mm-mul-small. s 6/15/2021 vcc w/ restrict and shortloop keywords 0. 852 gigaflops + inner two loops in separate function, allows outer loop vectorization 1. 51 gigaflops

Kernel performance: saxpy • 32 bit floating point ops N=64 256 1024 4096 Hand

Kernel performance: saxpy • 32 bit floating point ops N=64 256 1024 4096 Hand coded assembly 379 593 691 720 vcc w/restrict keywords 385 596 692 721 6/15/2021

Kernel performance: motion_estimate 32 bit integer ops, finding the minimum sum of absolute differences

Kernel performance: motion_estimate 32 bit integer ops, finding the minimum sum of absolute differences for a reference block and a region in an image. Hand optimized assembly 1. 181 gigaops vcc w/restrict keywords 170 mops + shortloop directives 253 mops + outer loop unroll directive 257 mops* *No improvement because of spilling. 6/15/2021

Dongarra loops • • • 100 loops to test compiler vectorization capability Rewritten in

Dongarra loops • • • 100 loops to test compiler vectorization capability Rewritten in C by Cray (? ) vcc vectorizes 74 loops vcc partially vectorizes 3 loops vcc conditionally vectorizes 3 loops 1 loop not vectorized because vector sin/cos not currently available on viram. • 19 other loops not vectorized • Data provided by Sam Williams 6/15/2021

Features Remaining: • Support version 3 isa and version 4 isa: – Isa changes

Features Remaining: • Support version 3 isa and version 4 isa: – Isa changes required by Mips Inc. scalar core – Performance simulator only supports “old”isa • Finish sync support – take advantage of Cray implementation • VIRAM machine “target” – Allow easier maintainence of frontend and optimizer mods for viram • User documentation – Summary of differences w/Cray compiler – Useful options, hints for vector code 6/15/2021

Performance Features Remaining • • Additional tuning: instruction scheduler Support new SV 2 inliner

Performance Features Remaining • • Additional tuning: instruction scheduler Support new SV 2 inliner for C/C++ Shortloop enhancements Reduce spilling – Scheduler concern with registers – Ordering of blocks for register assignment within “priority groups” – Special vector registers carried across calls • Loop unrolling for vector loops • Tune for key benchmarks 6/15/2021

Other Future Compiler Features ? • Support for speculative execution • Compiler extensions for

Other Future Compiler Features ? • Support for speculative execution • Compiler extensions for fixed point hardware • Support for vector functions; vector mlib 6/15/2021

Summary • vcc is a reasonably robust compiler for VIRAM • Performance on kernels

Summary • vcc is a reasonably robust compiler for VIRAM • Performance on kernels is good w/appropriate directives, some effort for optimum vectorization • Need to prioritize remaining work 6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

6/15/2021

Backup slides follow 6/15/2021

Backup slides follow 6/15/2021

Vector Architectural State Virtual Processors (vl) VP 0 Data Registers VP 1 VPvl-1 vr

Vector Architectural State Virtual Processors (vl) VP 0 Data Registers VP 1 VPvl-1 vr 0 vr 1 vr 31 vpw • • Number of VPs given by the Vector Length register vl Width of each VP given by the register vpw – • vpw is one of {8 b, 16 b, 32 b, 64 b} Maximum vector length is given by a read-only register mvl – mvl depends on implementation and vpw: {128, 64, 32} in VIRAM-1 6/15/2021

Codegen/optimizer issues for VIRAM • Variable virtual processor width (VPW) • Variable maximum vector

Codegen/optimizer issues for VIRAM • Variable virtual processor width (VPW) • Variable maximum vector register length (MVL) • Vector flag registers treated as 1 bit wide vector register • Multiple base, incr, stride regs. + autoincrement • Fixed point arithmetic (saturating add, etc. ) • Memory consistency • New vector instructions not available on SV 2 6/15/2021

Generating Code for Variable VPW • Strategy: vectorizer determines minimum correct vpw for each

Generating Code for Variable VPW • Strategy: vectorizer determines minimum correct vpw for each loop nest – Vectorizer assumes vpw=64 initially – At end of vectorization, discard vectorized copy of loop if greatest width encountered is less than 64 and start vectorization over with new vpw. – Code gen checks vpw for each loop nest. • Limitation: a single loop nest will run at the speed of the widest type. – Reason: simplicity & performance of the common case – No attempt to split/combine loops based on vpw 6/15/2021

Generating Code for Variable MVL • Maximum vector length is not specified in IRAM

Generating Code for Variable MVL • Maximum vector length is not specified in IRAM ISA. • However, compiler assumes mvl at compile time – mvl based on vpw – mvl assumption dependent on VIRAM-1 hardware implementation – Recompiling required for future hardware versions if mvl changes • MVL knowledge useful for code gen and vectorizer: – register spilling – short loop vectorization – length-dependent vectorization ( and may eliminate safe vector length computation at run time) for (i = 0; i < n; i=++) a[i] = a[i+32] 6/15/2021

Memory consistency • Sync instructions: Sa. V Ra. W Wa. R Wa. W 6/15/2021

Memory consistency • Sync instructions: Sa. V Ra. W Wa. R Wa. W 6/15/2021 Va. S Va. V vp

VIRAM Tools • • • vas: assembler vdis: disassembler vsim-isa: simulator vsim-db: debugger vsim-p:

VIRAM Tools • • • vas: assembler vdis: disassembler vsim-isa: simulator vsim-db: debugger vsim-p: performance simulator vsim-sync: memory consistency simulator 6/15/2021

vsim-sync • Intended for debugging and optimizing sync’s • Tells you when there is

vsim-sync • Intended for debugging and optimizing sync’s • Tells you when there is a data hazard (sync needed) • Tells you when a sync executed that didn’t prevent a hazard; – sync may not be needed – according to dynamic execution – sync may be needed on some other execution path 6/15/2021