EGIIn SPIRE The evolution of the information system

  • Slides: 33
Download presentation
EGI-In. SPIRE The evolution of the information system in EGI/WLCG Stephen Burke egi. eu

EGI-In. SPIRE The evolution of the information system in EGI/WLCG Stephen Burke egi. eu Grid. Ka School August 30 th 2013 EGI-In. SPIRE RI-261323 www. egi. eu

Overview • What is a Grid information system? • Historical development – GLUE information

Overview • What is a Grid information system? • Historical development – GLUE information model – BDII technology • Current situation – Validation • Future developments EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 2 www. egi. eu

What is an information system? • A Grid consists of a large number of

What is an information system? • A Grid consists of a large number of services and resources, distributed over many sites around the world • Users, VO managers and Grid operations managers need to understand what services exist, and relevant details about their properties and current state • An information system collects information from across the Grid and allows it to be queried and displayed in a uniform way EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 3 www. egi. eu

Information system components • Information model – The heart of the system: a uniform

Information system components • Information model – The heart of the system: a uniform way to represent the properties of a diverse set of services and resources • Information providers – Software components which collect and publish information according to the information model • Transport – A technology which allows information to be aggregated from all the services in the Grid, and queried in a standard way EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 4 www. egi. eu

The GLUE information model EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school

The GLUE information model EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 5 www. egi. eu

What is the information model used for? • Users, applications and middleware need to

What is the information model used for? • Users, applications and middleware need to know what resources are available and what their properties are – What workload managers are available to the biomed VO? – Find a computing system running SL 6 with > 3 GB memory – Find a storage system with 20 TB of free space which I can write to • Grid management and operations staff need an overview of the state of the Grid – – How many jobs are running in the UK? How much disk space has biomed used? What is the total installed CPU power available to EGI? Which sites are running EMI 2 services? EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 6 www. egi. eu

GLUE history • The European Data. Grid project (EDG, the predecessor of EGEE and

GLUE history • The European Data. Grid project (EDG, the predecessor of EGEE and EGI) initially had its own information model (2001) • The GLUE (Grid Laboratory for a Uniform Environment) project was a collaboration between EDG, EU Data. TAG, i. VDGL (predecessor of OSG) and Globus to promote interoperability – GLUE 1. 0 was defined in September 2002 – Version 1. 1 was released with some minor improvements in April 2003, and deployed by EDG and then LCG and EGEE in 2003/4 – Version 1. 2 was agreed in February 2005, implemented in May 2005 and deployed (fairly gradually) by WLCG/EGEE in 2006 – Version 1. 3 was agreed in October 2006, implemented in December 2006 and deployed from 2007 EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 7 www. egi. eu

GLUE 1 concepts • Independent structures for computing and storage • Computing part split

GLUE 1 concepts • Independent structures for computing and storage • Computing part split into Cluster (hardware and software on batch workers) and CE (Grid interface and batch system) • Storage part pre-dated the SRM protocol – Storage area: a chunk of space accessible by one or more VOs – Access protocol: Grid. FTP, http, file, … – Some SRM support added in the 1. 3 revision • Generic Service object added in the 1. 2 revision – Basically a network endpoint • Limited scope for adding new information without a schema revision EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 8 www. egi. eu

Problems with GLUE 1 • GLUE 1 has worked, but we had many accumulated

Problems with GLUE 1 • GLUE 1 has worked, but we had many accumulated issues • The schema definitions were based on limited experience – Initially only for CE and SE – completely different structure – Service object different again – Embedded assumptions which turned out to be too restrictive/specific • Only two CPU benchmarks • Definitions not always clear, documentation somewhat limited – Many things effectively defined by WLCG/EGEE practice • Evolution was fairly slow – two upgrades in four years – The GLUE project finished – just an ad-hoc group of interested people – Slow to deploy upgraded information providers and clients • Backward compatibility maintained – significant constraint – Changes often “shoe-horned” into the available structure – Many legacy objects/attributes EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 9 www. egi. eu

GLUE 2 • We always intended to defer conceptual and structural changes to a

GLUE 2 • We always intended to defer conceptual and structural changes to a major revision called GLUE 2 – Complete redesign, no backward compatibility – Incorporating lessons from many years of experience • Decision made to define GLUE 2 within the Open Grid Forum (OGF) – Many (~ 14) Grid projects participating – Kickoff in January 2007; specification approved in March 2009 • Positive outcomes of OGF process – GLUE 2 is a Grid standard, widely accepted within OGF • Interacts with other OGF standards (BES, SAGA, JSDL, …) – Increased participation from, and hence acceptance by, other projects - interoperability – Raised visibility/commitment within WLCG/EGEE/EGI/EMI • EMI adopted GLUE 2 as a unified standard EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 10 www. egi. eu

Key concepts User Domain Negotiates Share with Admin Domain Provides Contacts Has End Point

Key concepts User Domain Negotiates Share with Admin Domain Provides Contacts Has End Point Service Maps User to Has Share Manager Defined on Resource Has Runs Has Access Policy EGI-In. SPIRE RI-261323 Mapping Policy Activity Information system evolution - Grid. Ka school 11 www. egi. eu

Major improvements • Universal concept of a Service as a coherent grouping of Endpoints,

Major improvements • Universal concept of a Service as a coherent grouping of Endpoints, Managers and Resources – Computing. Service and Storage. Service are specialisations, sharing a common structure as far as possible • Computing. Service has a better separation of Grid endpoint, LRMS and queue/fairshare • Storage. Service designed for SRM v 2 – Generic concepts for Manager (software) and Resource (hardware) – Clients can have a standard structure • All objects are extensible – We always find new cases we didn’t anticipate! – So far no need for a GLUE 2. 1 • Some concepts made more generic/flexible by making them separate objects rather than attributes – Location, Contact, Policy, Benchmark, Capacity • More complete/rigorous definitions EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 12 www. egi. eu

The BDII EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 13 www.

The BDII EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 13 www. egi. eu

History • A suite of Grid middleware was developed as part of the EU-funded

History • A suite of Grid middleware was developed as part of the EU-funded EDG (2001 -4) and EGEE (2004 -10) projects • EDG started with the tools available in version 2 of the Globus toolkit - the information system component was the Monitoring and Discovery system (MDS) based on the Lightweight Directory Access Protocol (LDAP) technology • The intention was to develop a new technology to support the information system, but in practice the LDAP-based system was evolved incrementally into the current Berkeley Database Information Index (BDII) EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 14 www. egi. eu

LDAP • LDAP dates from 1995 (RFC 1777). It provides a networkaccessible interface to

LDAP • LDAP dates from 1995 (RFC 1777). It provides a networkaccessible interface to a simple (non-relational) tree-structured database. It is supported as standard in Linux distributions; common applications are user databases and address books. – Standard query tool: ldapsearch – APIs for several languages – Standard text format for data: LDAP Data Interchange Format (LDIF) • Various backends are supported to store the underlying information, one of which is a relational database called the Berkeley Database. • LDAP supports SASL authentication, but this is not used by the BDII. EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 15 www. egi. eu

The BDII • The BDII uses LDAP in a somewhat unusual way. The information

The BDII • The BDII uses LDAP in a somewhat unusual way. The information is a snapshot of the current state, updated every few minutes. There is therefore no need for longterm persistency or backup, but there is a requirement to update the underlying database at a fairly high rate. – Useful to hold the database in memory • The database is updated either by running local information providers (scripts which output LDIFformatted information) or by querying another BDII • Information in the database can be cached to allow for the possibility of information being temporarily unavailable EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 16 www. egi. eu

Rendering GLUE in LDAP • The GLUE information model is an abstract structure, formed

Rendering GLUE in LDAP • The GLUE information model is an abstract structure, formed from objects, attributes and relations – It needs to be mapped into the concepts of a specific technology • For LDAP this is mostly straightforward – GLUE was defined with LDAP in mind – Limited data types, basically just string and integer • Type checking needs to be external – No built-in relations • Use attributes containing the ID of a related object • LDAP structures objects into a tree, but that has no direct mapping to GLUE – GLUE 1 and GLUE 2 in separate trees with different roots – Subtree per Grid site EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 17 www. egi. eu

Information system structure • The information system uses a 3 -level hierarchy of BDIIs

Information system structure • The information system uses a 3 -level hierarchy of BDIIs • A resource BDII runs on an individual service node, and publishes information about it by running local information providers • A site BDII pulls information from all resource BDIIs at a site – And also adds some information about the site as a whole • A top-level BDII pulls information from all site BDIIs – Bootstrapped from a central database (GOC DB) – There are many top BDIIs in the Grid • Site and top BDIIs are Grid services and hence have their own resource BDIIs! EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 18 www. egi. eu

Distributed LDAP Hierarchy Top-level BDII LDAP LDAP Site A Site BDII LDAP LDAP Site

Distributed LDAP Hierarchy Top-level BDII LDAP LDAP Site A Site BDII LDAP LDAP Site B LDAP Site C Resource BDII EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 19 www. egi. eu

LDAP queries • LDAP is suited to efficient query performance – Common queries can

LDAP queries • LDAP is suited to efficient query performance – Common queries can be optimised by indexing important attributes in the underlying DB – Example: extract the version of every site BDII (and count them): time ldapsearch -x -h lcg-bdii. cern. ch -p 2170 -b o=grid '(&(objectclass=Glue. Service)(Glue. Service. Type=bdii_site))' Glue. Service. Version | grep Version: | wc -l 328 real 0 m 0. 196 s • LDAP is unfamiliar to most users, and queries have a rather idiosyncratic syntax – Need to package queries inside other tools for common use EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 20 www. egi. eu

Information providers • GLUE 1 publication has been in place for many years –

Information providers • GLUE 1 publication has been in place for many years – Many services use a standard provider • GLUE 2 required new information providers • Progressively rolled out from late 2010 – Simple services came first – Computing and storage more complex • The EMI project made GLUE 2 publication a major objective – ARC, Unicore, g. Lite and d. Cache middleware all included • Globus services now also published in GLUE 2 • GLUE 2 infrastructure is now largely complete EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 21 www. egi. eu

Some statistics • 364 sites • 3788 service endpoints • GLUE 1: 68 k

Some statistics • 364 sites • 3788 service endpoints • GLUE 1: 68 k objects, 1. 2 million attributes, 90 MB • GLUE 2: 273 k objects, 3. 2 million attributes, 343 MB • Just a snapshot, your mileage may vary EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 22 www. egi. eu

Ensuring quality EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 23 www.

Ensuring quality EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 23 www. egi. eu

Resilience • Important to make the system resilient to failure and overload • Site

Resilience • Important to make the system resilient to failure and overload • Site and top BDIIs can have multiple load-balanced nodes behind a DNS alias – BDIIs are stateless • Top BDIIs cache information for up to 24 hours – Service status attributes set to “unknown” to indicate that dynamic state information is stale • Standard top BDIIs for each region are defined and monitored by EGI – Target is 99% availability • Client tools can use a list of top BDIIs for failover EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 24 www. egi. eu

Validation • The information in the information system comes from many different sources –

Validation • The information in the information system comes from many different sources – – Hardwired values Configured by the sysadmin Standard unix commands Dynamic queries to services • Many ways to go wrong – information is not always reliable • Traditionally the validation of the information was rather ad-hoc – Diagnose and fix after a problem is observed • For GLUE 2 we wanted to do better – Proactive detection of errors EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 25 www. egi. eu

EGI GLUE 2 profile • GLUE 2 is intentionally very flexible • Many ways

EGI GLUE 2 profile • GLUE 2 is intentionally very flexible • Many ways to use it, not necessarily interoperable • EGI needed a profile to specify how it should be used in our context – – – Detailed semantics of each attribute What should and should not be published Importance and expected use Rules to define what values are allowed Heuristics to detect anomalies • Profile document written in 2012 – With wide consultation of interested parties EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 26 www. egi. eu

glue-validator • A command-line tool (python) which systematically checks published information against the GLUE

glue-validator • A command-line tool (python) which systematically checks published information against the GLUE specification and (optionally) the EGI profile – Many options to restrict/filter/format the output – Primarily for GLUE 2, but many problems will be common with GLUE 1 • Bugs/issues in middleware fed back to developers – Now part of the EGI middleware acceptance criteria – Known issues can be excluded from the validation output • Can be run as a Nagios probe – Integrated into the standard EGI monitoring system – Automated checks for Grid sites - tickets raised by shifters – Still experimental, but will be moved to production soon EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 27 www. egi. eu

The future EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 28 www.

The future EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 28 www. egi. eu

GLUE 2 migration • GLUE 1 and GLUE 2 published in parallel • Most

GLUE 2 migration • GLUE 1 and GLUE 2 published in parallel • Most use is still GLUE 1, but: – Nearly all services now publish in GLUE 2 – Validation means that the quality of the information is good and constantly monitored – Workload and data management clients support GLUE 2 – Future developments will be in GLUE 2 • Hope to progressively move to using GLUE 2 over the next ~2 years – GLUE 1 will remain as a backup for the foreseeable future EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 29 www. egi. eu

Related systems • GOC DB: a static central database of sites and services –

Related systems • GOC DB: a static central database of sites and services – Used as a bootstrap for monitoring and to record downtimes – Now moving to GLUE 2 as an information model • XML data format – Some overlap with the information system – we need a better definition of the relationship • EMIR: A new service which stores a subset of slowlychanging GLUE 2 information which can be used for service discovery – XML data format – May be more efficient than the BDII? EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 30 www. egi. eu

Reacting to new developments • Clouds are somewhat similar to Grids, but have significant

Reacting to new developments • Clouds are somewhat similar to Grids, but have significant differences – Services may have different properties – Services and sites may appear and vanish dynamically • EGI is developing a GLUE 2 extension for cloud services – Still early days, very experimental • The information model may need to be extended as computing and storage systems evolve – Virtual machines, GPUs and many-core systems – Federated storage EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 31 www. egi. eu

Longer-term issues • The BDII technology is stable and reliable, but: – All attributes

Longer-term issues • The BDII technology is stable and reliable, but: – All attributes updated at the same frequency (minutes) • The intrinsic rate of change varies from seconds to months – One size fits all • Top-level BDIIs all contain the full information for the whole Grid • Different applications may have different requirements – The BDII structure is configured by hand • May be less good for dynamic, cloud-like systems? – LDAP is non-standard in the Grid world – No authentication or authorisation to read • Plenty of room for new ideas! EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 32 www. egi. eu

References • OGF GLUE working group home page – http: //forge. ogf. org/sf/projects/glue-wg •

References • OGF GLUE working group home page – http: //forge. ogf. org/sf/projects/glue-wg • GLUE 2. 0 specification – http: //www. ogf. org/documents/GFD. 147. pdf • EGI GLUE 2 Profile document – http: //documents. egi. eu/public/Show. Document? docid=1324 • Information system home page – http: //gridinfo. web. cern. ch/ • Information about LDAP – http: //www. gridpp. ac. uk/deployment/users/Advancedldapsearch. html • GOC DB documentation – https: //wiki. egi. eu/wiki/GOCDB_Documentation_Index • EMIR manual – http: //eu-emi. github. io/emiregistry/documentation/registry-1. 1. 1/emirmanual. pdf EGI-In. SPIRE RI-261323 Information system evolution - Grid. Ka school 33 www. egi. eu