Enabling Grids for Escienc E EGEE Operations EGEELCG
Enabling Grids for E-scienc. E EGEE Operations EGEE/LCG II OPERATION WORKSHOP – 26 th May 2005 Operation WG Wrap up C. Vistoli www. eu-egee. org INFSO-RI-508833
Operations issues covered Enabling Grids for E-scienc. E • • • Operation: – GOC DB and site registration – Site Management Workflow – Interaction with OSG VO management – Freedom of choice – GRAT – CIC web portal – Resource allocation Deployment procedure Metrics Accounting Monitoring (out of time) INFSO-RI-508833 EGEE Operations 2
GOCDB Enabling Grids for E-scienc. E • GOCDB 2: presentation P. Strange • A lot of discussion and decision about meaning and type of information to add or insert in the DB INFSO-RI-508833 EGEE Operations 3
New features of GOCDB 2 Enabling Grids for E-scienc. E • Main feature is a new roles based authentication system – Roles are granted to contacts to grant permissions – Roles system is expandable to contain new roles and permissions • Extra functions now exposed for ROC staff and COD/CIC – Creating new sites – Managing status of existing sites (production status, monitoring status etc) • Other improvements – Many bugfixes! – SQL transaction support – UI improvements INFSO-RI-508833 EGEE Operations 4
System architecture Enabling Grids for E-scienc. E • My. SQL 4. x database with transaction support – GOCDB 2 now fully supports transactions for enhanced data integrity – Contact GOC team if you want your tool to link directly to the My. SQL database • PHP 4 front end interface • Apache 2. x web server • Gridsite security layer – Grants HTTPS access only to EGEE recognised certificate holders – Does not map users to roles, this is a function provided by GOCDB INFSO-RI-508833 EGEE Operations 5
GOC How-to Enabling Grids for E-scienc. E GOC requirements -grid-wide needs and features from a centralized source of info « how-to » to be settled, followed up. GOC/COD to set up a identified process and validated by ROCs Variable status : Organisation/Group/Production/Type status « LCG 2, PPS » needs « closer to reality » variables definition to lead to coherent info filling from ROCs, sites that does not lead to inconsistencies e. g. into monitoring tools to be followed up to ROC managers and COD based on the following. ***** Needs how to state PPS ? Laurence how to set their IS? look at another site. MS: refer to the Glue-schema names. MT – LCG 2 nodes « group » monitored. « Type » = accepted w regards to a given grid w regards with a given ROC. Cf jdl attributeglueschema ce status not at the site level : confusing as it is. « Type gluevalueperce» = variable dynamically enforced onto IS similar as glue ce site status – COD and CIC – open/closed/ Bdii notification to site admin on daily operations. « Status » = ROC decision + COD decision : candidate, certified, uncertified, suspended, glitepre-production, at the site level. « Group Gridscope» = many-to-many. Site level. « organisation project organisation or funding partners » = hosting organization. Legal body. INFSO-RI-508833 EGEE Operations 6
Operations issues Enabling Grids for E-scienc. E • • • Open issue obout GOCDB and downtimes ROC: how they collect and certifiy sites Escalation procedure Core services management SA 1 requirements escalation INFSO-RI-508833 EGEE Operations 7
Scheduled Downtimes and GOC DB Enabling Grids for E-scienc. E - No roll-out list mailing - EGEE Broadcast to the affected VO managers and grid services users – coordination with user support ? - GOC DB filling **** GOC DB requirement : à needed to separate CE and SE. Silently assumed that scheduled downtimes meant « CE downtimes » . à detailed scheduled DT for all nodes at the service level. Production and inventory and follow-up of CIS from GIIS tool. IF proved not be sufficient then requirements on IS developpers. INFSO-RI-508833 EGEE Operations 8
Adding a site + site monitoring Enabling Grids for E-scienc. E - ROC management negociation– status « candidate » Site is established, – « uncertified » GOC DB - « Site form » complete SFT local instance by ROCs or the Egee one ok for 1 week ***** - ROCs are to exchange experience with their own local certification process and tests to come up with suggestions to put in common. several SFT tests instances need to run concurrently. To be taken care by SFT developpers. Documentation to install a local instance to be available to willing ROCs. **** Input : certified means « not harmful for the structure » SEE: local instance of SFT : good practice. To become a suggestion. TWN: same. PN: site has to be registered in a local bdii –ROC responsability. Then registered into the general bdii automatically no rollout published. CE : same for 3 days. UK : same as CE It : certification bdii for it sites. Specific set of tests and SFT. NE: D+CH: regular project SFT. Sites for the time bing in « pre-production » . Suggests to set a regional SFT set of tests. Needs documention for PN for this installation. Cern: a. k. a PN France: TBD Russia: SWE: ****** OSG: will provide the GIP and collect info the same way as EGEE. E. g : OSG could enforce the Egee dteam registration (OSG Egee: could run native tests and Osg would appear as a given entity in the monitoring system– something similar as a ROC) (Egee OSG : need to pack libraries). How-to: Follow-up of tickets between footprint/remedy //GGUS can be done manually in the first phase. INFSO-RI-508833 EGEE Operations 9
Site Quarantine and Escalation Enabling Grids for E-scienc. E Deadline to be dependant on site size. Prioritization of COD work to be dependant on site size. need for a proposal from the COD to be handed to the ROC managers and back to the COD: >100 CPU site deadline 1 day to be changed be top 10 sites deadline 1 day. ----How to take care of non european sites – non existing EGEE ROCs GD Team to give feedback on to ROC managers ----Escalation: 1 rst mail to site and ROC 2 nd mail to ROC Phone call to ROC -----OSG : contact the registration manager to put pressure on the ROC manager ROC to close the ticket. Need to update deadline for ROC manually through COD – OK COD to put a specific site under observation – for 3 days, into quarantine, when recurrent pbs. OSG : to publicize the « bad reputation sites » on a web page ROCs agreement on metrics to be agreed upon, URGENT. ----To large sites organisation to deal with this constraints and ensure 24/7 like behaviour. --Deadline before CA upgrade becomes « a critical test » is 1 week and need COD action RGMA no deadline: Need for accounting : Important. However, registry failures cannot be blamed on sites : refinement on SFT tests needed by SFT developpers. Contact with RGMA team needed. RGMA Tutorial link to be sent to the ROC managers. INFSO-RI-508833 EGEE Operations 10
Collecting requirements through SA 1 Enabling Grids for E-scienc. E • CL – almost no official requirements existing from SA 1 • SZ - implying ROC managers to collect - How-to “legitimate” requirements. • JT - getting experts to meet ------------------------STARTING NOW: Issues that have arisen in this workshop 1/ M/W security policy – tracability and datamanagement 2/ VO Fair share and site implementation 3/ Requirement on environement variables on to the batch systems ------Input: The process of JRA 1 enforcement is “frozen” or uneffective. and for Egee 2 it is unclear. Send them to SZ to eradicate duplicates. To get feedback from a mailing list and take them to the PTF. SZ to rewrite them and to do the follow-up. 1/ depends on JSG – 2/ irrelevant – site level relevancy INFSO-RI-508833 EGEE Operations 11
VO management Enabling Grids for E-scienc. E 1/CIC web 2/Freedom of Choice Atlas is used in production, The sites should be aware that they are blacklisted – to be implemented –medium sized sites VO should be able to define their customized set of tests 3/ Grid. AT Could have been run as dteam VO, and allow better development effort (e. g adding history features) of monitoring and alarm tool INFSO-RI-508833 EGEE Operations 12
CIC portal : Support a new VO Enabling Grids for E-scienc. E • Workflow CIC Portal Site ROC OAG VO Manager Request MAIL Validated ? no Validation yes Broadcast Cc MAIL Contact & dialog initialization INFSO-RI-508833 EGEE Operations 13
CIC portal: Publish DC Enabling Grids for E-scienc. E • Workflow VO Manager CIC Portal Other VOs OAG Sites Infos on DC Authentication Publication form Validation Broadcast Publicatio n MAIL Calendar NEWS INFSO-RI-508833 EGEE Operations 14
Freedom of choice - VO Page – 1/3 Enabling Grids for E-scienc. E INFSO-RI-508833 EGEE Operations 15
Freedom of choice - Final List - 2/3 Enabling Grids for E-scienc. E INFSO-RI-508833 EGEE Operations 16
Freedom of choice - CIC Page- 3/3 Enabling Grids for E-scienc. E INFSO-RI-508833 EGEE Operations 17
Enabling Grids for E-scienc. E Grid. AT (Grid Application Test) definitions: • Grid. AT aims to simplify the addition of new tests for new or existing applications. • Grid. AT can be used for validating grid site, from VO software viewpoint, submitting a test job and evaluating if its output matches the expected results. • Grid. AT is designed to certificate, on-demand, installed grid applications. INFSO-RI-508833 EGEE Operations 18
Enabling Grids for E-scienc. E Summary table contains the results for each site of last tests grouped by Virtual Organisation. INFSO-RI-508833 WEB portal gives an overview of the Italian Grid. AT Web Interface GRID from VO viewpoint. More details can be obtained just clicking on the test date. EGEE Operations 19
Resource Allocation Process Enabling Grids for E-scienc. E • Resource allocation policy – Overview of status and requirements from VO at the OAG Scratch WN space + MPI + licensed software+ secure data access **** OAG contact point to set up!! How to check resource actual allocation – Inventory of actual services from GOC and GIIS – Workflow blackbone INFSO-RI-508833 EGEE Operations 20
Resource negotiation: Problems Enabling Grids for E-scienc. E • Only general percentages by region available • Interpretation of numbers not always the same • No indication on availability of specific resources (MPI, licensed software) • Allocation has to be done site by site • Gap between OAG and sites, to be filled by ROCs • ROCs don’t decide on scientific priorities • Exact workflow description missing INFSO-RI-508833 EGEE Operations 21
Resource negotiation: Implementation Enabling Grids for E-scienc. E • Implementation via CIC portal, “OAG view” (and VO/RC) – Readable to everybody – Role-specific actions reserved to authorized people § Sites: support yes/no, free cycles only / more detailed description § OAG/ROCs: (re-)trigger requests to sites/ROCs (specific broadcast) § VOs: contact point for discussions with specific sites – Shows site status by region: Solicited, Answered (+answer details) • Automate statistics, summaries, steps of the procedure INFSO-RI-508833 EGEE Operations 22
Deployment and Process Enabling Grids for E-scienc. E • LCG-2_4_0 first release using the new process (5 days late) – Release was picked up at a slow pace (˜ 2. 5 sites/day) – Differences between regions § Repacking and �adaptation takes time and is needed – Release was not sufficiently tested § 2 test sites for a deployment test are not sufficient • Release Preparation: • More, early involvement of sites required – Have to be see the list of potential components very early – Regular progress reports to the ROC managers telephone conference • Very early announcement of new releases needed – 3 weeks complete list of components and changes § Problematic, because this means certification has to be finished – 2 weeks before a new release the release has to go to: § ROC-IT, ROC-SE, ROC-UK for: • Test deployment (1 week) • Testing on ROCs testbeds • Fixing bugs will take 1 week INFSO-RI-508833 EGEE Operations 23
Deployment and Process Enabling Grids for E-scienc. E • Deployment of new releases: • The ROCs will drive the deployment – Announcement of releases through the ROC managers – Sites that are (much) too late will be excluded by their ROCs • • Next Releases: By mid June an extra release is needed for the SC 3 – FTS, LFC service, VOMS (RFC compliant proxy extensions), bug fixes – tier 1 s and tier 2 s participating in SC 3 have to upgrade quickly Regular release 1 st July – Like mid June + updates – All sites – (may contain g. Lite WLM for voluntary parallel deployment) Transition to g. Lite will require changes to the process – More frequent releases – step by step introduction of new components • • INFSO-RI-508833 EGEE Operations 24
Metric Enabling Grids for E-scienc. E • Two sets needed • Complex, detailed set – Used for pinpointing problems – Used by ROCs, CICs and site admins (experts) • Coarse Summary – – Measure overall performance Small, easy to understand set Hierarchical (Grid, ROCs, CICs, RCs) Targeted at users to show progress (or lack of) INFSO-RI-508833 EGEE Operations 25
Metric Enabling Grids for E-scienc. E • General Agreement on the concept – detailed discussions on: § time windows • Sliding windows (week, month, 3 month) § quantities to watch for (RCs, ROCs, CICs…. . ) • ROCs based on RCs • CICs based on services • Release quality has to be measured • To make progress: workgroup to define quantities – – – Organized by: Ognjen Prnjat (oprnjat@admin. grnet. gr) Small (˜ 5), Ognjen, Markus, Helene, Jeff T. and ? ? ? Ognjen will collect input ROCs, CICs and OMC have to agree on ONE set of quantities INFSO-RI-508833 EGEE Operations 26
Metering: Gianduia Enabling Grids for E-scienc. E INFSO-RI-508833 EGEE Operations 27
DGAS deployment Enabling Grids for E-scienc. E VO 1 site 2 HLR 1 VO 2 CE HLR 2 VO 3 INFSO-RI-508833 CE HLR 4 Aggregate site accounting HLR 3 CE APEL site 3 CE CE HLR 5 CE EGEE Operations 28
Enabling Grids for E-scienc. E INFSO-RI-508833 EGEE Operations 29
How APEL Works? Enabling Grids for E-scienc. E • • PBS/LSF log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> Pbs. Records table Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> Gk. Records table Message log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> Message. Records table Site GIIS interrogated daily on site CE to obtain Spec. Int and Spec. Float values for CE, acts as DBProducer -> Spec. Records table, one dated record per day These three tables joined daily on MON to produce Lcg. Records table. As each record is produced program acts as Stream. Producer to send the entries to the Lcg. Records table on the GOC site. Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG. Interactive and regular reports produced by site or at GOC site as required. INFSO-RI-508833 EGEE Operations 30
APEL and g. Lite Enabling Grids for E-scienc. E – Is APEL integrated in g-Lite? § Work currently in progress. § We have ported the APEL code into the g. Lite CVS repository but need to understand functional differences e. g. WMS and use of Condor § 3 Components: Core + PBS plugin + LSF plugin § Sent our requirements to Erwin Laure…. waiting for information. – What about its deployment plan? § As soon as possible § …but would also like to add some new features • Global Job ID to link with L&B • DN to VO mapping INFSO-RI-508833 EGEE Operations 31
Grid. ICE Architecture Enabling Grids for E-scienc. E roles Consumer WAN logical components consumers publishers Grid. ICE Server consumer Grid. ICE on LCG 2 Browser application HTTP: HTML/XML NS LAN Resource INFSO-RI-508833 x. ML: pull, aperiodic, unicast NS: push, aperiodic, unicast LDAP Client pull, periodic, unicast WAN Site Publisher Data delivery model publisher MDS GRIS event collector Lemon srv push, periodic, unicast event provider Sensor Lemon agt scripts EGEE Operations 32
Grid. ICE and DGAS Common Metering for Grid jobs Enabling Grids for E-scienc. E • DGAS is an accounting system, therefore is interested in knowing the usage-related parameters of a job after its execution • Grid. ICE is a monitoring system, therefore is interested in knowing the job-related information since the job is created in the queue – The information should be updated frequently and provided to users respecting the security concerns queued Grid. ICE INFSO-RI-508833 running executed deleted DGAS aborted EGEE Operations 33
Grid. ICE on g. Lite Enabling Grids for E-scienc. E roles Consumer WAN logical components Grid. ICE on g. Lite Browser consumers publishers Grid. ICE Server application NS HTTP: HTML/XML consumer pull, periodic, unicast publisher event collector LAN Resource INFSO-RI-508833 x. ML: pull, aperiodic, unicast NS: push, aperiodic, unicast consumers WAN Site Publisher Data delivery model CEMon G RGMA MDS 2 Lemon srv push, periodic, unicast event provider Sensor Lemon agent scripts EGEE Operations 34
Summary Enabling Grids for E-scienc. E Schema Local Area Distribution Site Publisher Discovery Wide Area Data Distribution Notification INFSO-RI-508833 Grid. ICE in LCG 2. x Grid. ICE in g. Lite GLUE 1. 1++/GLUE 1. 2++ Lemon UDP/TCP MDS GRIS (LDAP) no security CEMon/R-GMA http+gsi+proxy+voms ext. BDII (LDAP) Service Discovery API LDAP/Pull SOAP/pull (push) fixed number of events content-based subscription EGEE Operations 35
NPM Architecture Enabling Grids for E-scienc. E • JRA 4/NPM provides uniform access to network performance information from a heterogeneous set of monitoring frameworks INFSO-RI-508833 EGEE Operations 36
We need your help Enabling Grids for E-scienc. E • We have some idea of requirements from networking experts within JRA 4 • Draft requirements document available here: – https: //edms. cern. ch/document/593620/1 • Draft use case document available here: – https: //edms. cern. ch/document/591777/1 • We’re looking for more input from NOCs and GOCs • If you have requirements, use cases or opinions on interfaces or needed metrics, please send them to us • Even if you don’t have ideas at the moment, but would like to be involved in the process, please get in contact • Contact details are at the end of the talk INFSO-RI-508833 EGEE Operations 37
Operations Summary Enabling Grids for E-scienc. E • CIC On Duty is now well established – COD is just 6 month old!!!!! – Tools have evolved at a dramatic pace § Portal, SFT, …… • Many rapid iterations § Truly distributed effort § Integration of new COD partner (Russia) went smoothly – Tuning of procedures is an ongoing process § No dramatic changes (take resource size more into account) INFSO-RI-508833 EGEE Operations 38
Operations Summary Enabling Grids for E-scienc. E • Accounting – Last November still an area of concern – APEL now well established § Support for batch systems is improving § Several privacy related problems have been understood and solved – g. Lite Accounting: DGAS § Some concerns about amount of information published • Can be handled by proper authorization? § Collaboration with APEL on batch sensors (BBQS, Condor, . . ) • DGAS agreed to provide them § Will be introduced initially on a voluntary basis • Sites will give feedback (including privacy issues) INFSO-RI-508833 EGEE Operations 39
Operations Summary Enabling Grids for E-scienc. E Tools!!!!!! – GOC-DB, monitoring, testing…. – Many impressive tools § Lots of overlap, we should focus and fuse some of them § R-GMA based “monitoring bus” emerging • Releases, Deployment – ROCs will drive the deployment – ROCs will contribute to the release preparation § Testing § Reviewing the proposed contents • Performance Metric – Measure service quality (RC, ROCs, CICs, …) – Ognjen organizes small workgroup to define details INFSO-RI-508833 EGEE Operations 40
Operations Summary Enabling Grids for E-scienc. E OSG • Similar problems, Interesting Tools • Linking of operations between LCG(EGEE) Grid 3(OSG) – Concept: OSG treated like a ROC, LCG like a SC § Details will be worked out during interoperation tests • Worries • Resources – Activities: – Service Challenges, LCG production, g. Lite pre-production, g. Lite transition, scaling up for LHC scale……… • Duplication of effort – Especially pronounced in the area of tools INFSO-RI-508833 EGEE Operations 41
Conclusions Enabling Grids for E-scienc. E • GOCDB 2 FEATURES AND LINK WITH OPERATIONS • OPERATIONS PROCEDURE – CODS Monitoring tools useful : current development effort to be provided to the COD management. • DEPLOYMENT • PERFORMANCE MEASUREMENT • ACCOUNTING Development coordinated with It and UK for accounting purposes. Dgas deployed as a facultative component for the time being. Current operations: Rgma and Apel specifities not covered. Next items on could not be covered by the agenda. • OSG interoperability : underlying on topics covered – Possibility to inegrate site verification of OSG and site certification in EGEE - SFT developpers – Interfacing needed somehow between the respective operations support tools – Footprint and Remedy INFSO-RI-508833 EGEE Operations 42
- Slides: 42