Enabling Grids for Escienc E g Lite testing
Enabling Grids for E-scienc. E g. Lite testing experience 3 rd JRA 1 All Hands Meeting JRA 1 test team www. eu-egee. org INFSO-RI-508833
Outline Enabling Grids for E-scienc. E • • JRA 1 test team organization and infrastructure What we do Test reports Summary and conclusions INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 2
All-hands@Padova: Testing issues Enabling Grids for E-scienc. E • • • OS installation automated at all sites QMTest framework being used now Clean installs using kickstart or quattor (site dependent) Moving away from private communication to using savannah Testing status – Glite I/O - OK – WMS doesn’t work yet § Initial Testcases defined; many legacy tests can be reused – R-GMA progressing • Plans for – Catalogs – FPS/FTS – Security INFSO-RI-508833 • no plans yet for • Ali. En components • DS • Package mgr 3 rd JRA 1 All-Hands Meeting, Brno 3
Organization Enabling Grids for E-scienc. E Distributed testbed, 4 sites Each site runs a binary compatible version of Red Hat Enterprise Linux – CERN: SLC 3, NIKHEF: Cent. OS 3. 2, RAL: Scientific Linux, IC: RHE 3 • CERN: – 3 members plus team leader: David, Diana, Mario and Maite • • • – Running all available services except VOMS server Imperial College: – Janusz and Luke NIKHEF: – Davide (sysadmin) RAL: – S. Traylen (sysadmin), S. Burke (R-GMA testing) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 4
What we do (I) Enabling Grids for E-scienc. E • TESTING: the process of analyzing a software item to detect the differences between existing and required conditions (bugs) and to evaluate the features of the software item. – Verification: compare what it does with what it should do (requirements, specification…), does the system do what is supposed to do? – Validation: evaluate the software, subjective judgment by the tester (I think…), is what the system is doing correct? INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 5
What we do (II) Enabling Grids for E-scienc. E • Deploy and manage the (fragile) distributed testbed: 4 sites, ~60 machines – Releases – Integration releases (full or only individual components) – Frequency: average 1 new deployment every 2 -3 weeks (takes 1 -2 days each time) – Done in all the 4 sites – Deployment testing • Write functional test suites – CERN test team members plus external collaborations (e. g. RGMA team, VOMS team, etc) – This requires understanding of functionality, cannot be done till functionality/specifications/documentation are known – Also includes regression testing INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 6
What we do (III) Enabling Grids for E-scienc. E • Run the test suites plus manual, ad-hoc testing to verify/validate the release – Release candidate from integrators one week testing (run test suites) RELEASE – Continuous manual, ad-hoc testing: § Basic test of new features/components § Debugging problems § Deployment testing • • – Interpret the results to report bugs and write test reports Bug verification – Go through all bugs in “Ready to test” status and try to reproduce them to verify that they are indeed fixed Help/support other testing related activities – NA 4 test team – SA 1 certification/Pre-production (trying to reproduce the problems they see, etc) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 7
Functional testing Enabling Grids for E-scienc. E • Existing test suites – – – • Fireman catalog (since g. Lite 1. 0) I/O (since g. Lite 1. 0) WMS (since g. Lite 1. 0) R-GMA (since g. Lite 1. 1) FTS (coming with g. Lite 1. 2) VOMS (coming with g. Lite 1. 2) New ones being discussed – – – LB test suite WMS (API) test suite DAG test suite MPI test suite JDL test suite INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 8
Where to find test results and reports Enabling Grids for E-scienc. E INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 9
Test results Enabling Grids for E-scienc. E INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 10
Fireman test results Enabling Grids for E-scienc. E INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 11
Test reports Enabling Grids for E-scienc. E INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 12
Fireman Test report Enabling Grids for E-scienc. E INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 13
Improvements – plans for the future Enabling Grids for E-scienc. E • What would we like to do, ideally? – Stable testbed infrastructure managed 100% automatically or externally § At cern quattor manages OS, g. Lite has to be installed by us (APT or installers) – Automated environment (who said framework? ) to run the test suites (daily, weekly…) in a distributed testbed § We are far from that; tests suites run automatically BUT require previous configuration (where are the CEs, catalogs, etc) § First steps: deployment modules to configure the test suites and cron jobs to automatically run them INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 14
IC site report(I) Enabling Grids for E-scienc. E Imperial College (2 test-sites planned - currently running as a single site). • • Site 1 - HEP – Tester: – Machines: Janusz 1 WMS Install: Config: Version: 1 CE 2 WNs 1 RGMA Server. 1 IO Server 1 UI Manual Site (mostly) R 1. 1 Site 2 – Le. SC – Tester: Luke – Machines: 1 WMS 2 CEs (+1) 2 WNs (+1) Install: apt Config: Site Version: R 1. 1 (+ QF 7&8) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 15
IC site report(II) Enabling Grids for E-scienc. E • Past issues: – Blocking UI (now OK). – User name conflicts in CE, WMS and LB on WMS-machine (configured manually to have same username). – CE config does not allow more than 2 WMSs (still valid). • Immediate site-specific issues: – Recent quick fix has caused some problems with Le. SC CEs. – WMS at Le. SC has developed an unidentified error and needs fixing. – SE planned at HEP site, but there are problems with the IO Server handling d. Cache. • Next steps: – Full WMS testing (needs functioning SE(? )). Second R-GMA Server to split sites. VOMS Server planned. • Group interaction: – Have received a great deal of help from joining team, thanks! – Clear documentation, impressive bug tracking. – Would appreciate more detailed architectural description and administration guide. INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 16
Catalogs Testing Enabling Grids for E-scienc. E • What we test: – – – • My. SQL and Oracle Fire. Man All file permissions Most of CLIs Bulk and single creation of entries. Regression tests (6687, 6910, 8452) Untested areas: – Stress testing (covered by ARDA) – Permissions using several VOs with different roles. • Summary of issues in the last weeks – My. SQL Fireman: cannot write in ‘/’ (Bug #8437) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 17
FTS Functional testing 1/2 Enabling Grids for E-scienc. E • What we test: – We packaged and adapted the tests for the java API done by data management so that they can be used on the UI – We are writing the tests for the command line interface which is in the process of being released (version 1. 1. x), testing for correct behaviour/error messages on valid and wrong input. • Untested areas: – We do not provide tests that bypass the application server and exercise the agents directly • Summary of issues in the last months (including solved ones) – A lot of deployment problems, most of which are being ironed out for the next intermediate release INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 18
FTS Functional testing 2/2 Enabling Grids for E-scienc. E • Main issues (not solved yet) regarding the component – Backend DB has to be configured manually if there is no dedicated index tablespace in DB. – Error messages can still be quite cryptic (bug 8713) • The current limitation I see is that the FTS work only in a very controlled environment with a lot of help from the data management cluster. INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 19
I/O Testing Enabling Grids for E-scienc. E • What we test: – I/O Client API – Regression tests (5102, 4873, 4415, 5329, 4414) – Functional Testing (cycles of upload/download files) • Untested areas: – Stress testing (covered by ARDA) • Summary of issues in the last months – Bug 6043 “errors while simultaneous downloading/retrieving several files via IOserver” still open. – Basic functionality covered. Focus on other components but open to suggestions. INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 20
IO-d. Cache integration Enabling Grids for E-scienc. E It had never been tested but almost worked first time which is fairly impressive. • 3 problems were found. – Small bug in g. Lite i/o don't have any details but Paolo fixed it. – The d. Cap protocol turned out not be a suitable option as it is a read only protocol (anonymous access). GSId. Cap is now the planned protocol. – A bug in the d. Cap client libraries was also found for which a fix has been supplied by DESY, Fix is available in updated library but there is no release from DESY of this yet. • Current status is that Peter has a working g. Lite i/o talking GSId. Cap to d. Cache at CERN. I am waiting on a released g. Lite i/o version that contains at least fixes for 1 and 2 and then it will be redeployed. • Installed an FTS server fts 0443. gridpp. rl. ac. uk, Certainly testing this against d. Cache is also required. INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 21
R-GMA Functional tests Enabling Grids for E-scienc. E • Test plan defined • 12 tests implemented by John Walk • Coverage: – – – Continuous, latest and history producers and consumers Publication and consumption on both UI and WNs Limited scalability tests (default 10 jobs, 1000 tuples) Some detailed functionality (predicates, LRP) Limited tests of the content of standard tables • Not covered: – Service discovery, but largely covered by other tests – Browser + command line not tested explicitly, but used regularly – Error handling – known to be bad! • Tests are simple bash scripts, can be run on any UI • Some configuration options (scale, API language) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 22
R-GMA test results Enabling Grids for E-scienc. E • Some problems seen with slow response, tests hit timeouts • Secondary producers sometimes fail to catch published tuples • Some minor improvements in tests requested: – More diagnostic output – Apparent bug in one test (10) – More tests for table content • But generally they give good coverage • Difficult to test stability, scalability or stress response INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 23
R-GMA: major problems Enabling Grids for E-scienc. E • Outstanding: – Archivers can be inconsistent – Occasional server crashes – Recovery after crashes – Slow response – Error handling – Case sensitivity INFSO-RI-508833 • Fixed: – Crashes due to a bug in JVM – Wrong proxy handling on WNs – Performance with very large result sets – Crash with a large number of declarations – Settings for tomcat memory and file handles – Problems when a remote site has a slow response 3 rd JRA 1 All-Hands Meeting, Brno 24
R-GMA: summary Enabling Grids for E-scienc. E • No showstoppers • Some intermittent problems – May be fixed in 1. 2, need to test • Error handling must be improved • Schema replication needed • Scalability + stability will only be tested in production! • Service publisher configuration still not ideal • Work still needed on defining some items in the Service table (e. g. Type) • Need to migrate to the new Glue schema (under way) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 25
WMS Testing Enabling Grids for E-scienc. E • What we test: – Job submission test suite (using CLI, LCG certification test suite ported to g. Lite) – Regression testing (not included in the test suite) – Extensive debugging (configuration problems, distributed WMS in 3 sites) • Summary of issues in the last months – the first jobs failing due to "cannot locate condor schedd", related to the race conditions in Condor and the fact that some time is required before the condor schedd is spawned on the CE – doubly spawned identical schedds on the CE, bringing the system in an faulty state where 2 identical daemons are reporting to the same collector ( VO hashing ) INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 26
WMS Testing Enabling Grids for E-scienc. E • Summary of issues in the last months (cont) – faulty Authorization on the CE : no auth. Z check done on the CE , ever – lasting schedds on the CE : once you're Auth. Zed on the CE, you'll be for EVER - as of today Others : performance under stress: § § – – the failing of jobs when the system is loaded with some streams of concurrent job submission, leading to an error like: Got a job held event, reason: Attempts to submit failed ( Job got an error while in the Condor. G queue ) the hanging pbs_status processes observed from time to time, when the system is minimally under stress. Sites appearing and disappearing from the ismdump. fl file on the WMS node , therefore in the list-match of available resources jobs being resubmitted even if they had retry. Count = 0. This was observed on the prototype testbed , caused by the submission of a DAG. INFSO-RI-508833 3 rd JRA 1 All-Hands Meeting, Brno 27
- Slides: 27