- Slides: 11
nek 5000 preliminary discussion for petaflops apps project
General facts about nek 5000 Research variant of commercial code n n developed by Fischer, Ho, and Ronquist in late 80’s subsequently modified by Fischer and Tufo Solves incompressible Navier Stokes using spectral element method. Used by several external research groups (Duke, Brown).
nek 5000 language issues About 80, 000 lines of mostly pure old f 77 with a little C. C called from Fortran with decent portability strategies. Shell scripts provided to simplify job management. These are mostly jazz specific.
nek 5000 portability issues Has been run on a wide range of architectures – Power 3, Pentiums, Alpha, SGI, etc. Focus is on PGF compiler but portability looks pretty good – I ran on SGI and Fujitsu-i 386 pretty easily as well as Jazz with PGF.
portability, cont. Relies on hacking a somewhat generalized makefile No configure script and no pre-existent machine-specific makefiles. compiler must be able to promote real to double precision some non-standard f 77 (common blocks resized, e. g. )
Software process No repository Thus, no versioning, no release schedule, no bug tracking, etc. Test problems, but no auto-verification-type test suite Good quick howto guide but very light on documentation not directly downloadable from e. g. web server
Performance Exhaustively studied/optimized n Gordon Bell Prize winner Serial part: n n Dominated by matrix-matrix product with smallish vector lengths homemade routine makes much better use of cache, does much better than BLAS – very high floppage rates on non-vector mahines.
Performance, cont. Communication patterns n n n Nearest neighbor (~10%) Vector reduction (~10%) Coarse grid solve (small) Not communication bound (yet) Has scaled nicely to 1000’s procs on ASCI Red, Seaborg (SP 3)
Performance issues Outstanding performance questions n Serial w Efficient use of cache for different parameter regimes (different vector sizes) w how will it perform on new vector hardware? w No “spike” in performance histogram. Hard to optimize further. n Parallel w nearest nabe could become bottleneck for slowconverging helmholtz w Scaling at 100, 000 procs depends possibly on improved vector reduction implementation
What am I doing now? Software process n n n n creating cvs repository establishing license agreement creating simple web page with info/release creating some self-testing scripts convincing Paul to add some documentation posting a page of benchmarks creating release script
What am I doing, cont. Performance n collecting some of my own numbers w PAPI installed locally but need on Jazz! w pgf tools to access hardware counters? w adding some superior instrumentation techniques to the code to make this easier in the future. Petaflops apps meetings n posting minutes/notes from each meeting on local web site.