Ali En GRID Predrag Buncic P Saiz JE
Ali. En @GRID Predrag Buncic P. Saiz, J-E. Revsbech R. Piskac, V. Sego, L. Aphecetche Predrag. Buncic@cern. ch
Ali. En @GRID 9/15/2020 CERN - LHC Predrag. Buncic@cern. ch 2
Ali. En @GRID 9/15/2020 Construction Predrag. Buncic@cern. ch 3
Ali. En @GRID 9/15/2020 ALICE Experiment Predrag. Buncic@cern. ch 4
Ali. En @GRID Alice Computing Simulation, Data Challenges & Reconstruction n n Centrally managed production of background event. S Distributed processing and event storage Event mixing n n Not necessarily centrally managed Once background events exist, the subsequent requests for event mixing must be routed to the location which holds required input Analysis n Using Ali. En API, PROOF will locate optimal site(s) for macro execution, try to execute it in parallel, collect the output and return it to the user (or register it in the catalogue) ALICE specific requirements w Event size 2 GB (simulated events), possibly split into several physical files, 20000 background events required w Event size 40 MB (Pb+Pb), 1 MB (p+p), 10^9 files/year (x n, n>2) 9/15/2020 Predrag. Buncic@cern. ch 5
Ali. En Solution? @GRID Ali. En Today, if you have computing problem, GRID is the answer @GRID to just about any question… 9/15/2020 Predrag. Buncic@cern. ch 6
Ali. En @GRID Ali. En@GRID First implementation of Alice World Wide Computing Model A lightweight, simplified but functionally equivalent alternative to full blown GRID Partial solution which is applicable to our boundary conditions and current requirements for simulation, reconstruction Gaining recognition within the Collaboration and in wider community 9/15/2020 Predrag. Buncic@cern. ch 7
Ali. En @GRID Architecture In brief: n n n n 9/15/2020 File catalogue built on top of RDBMS with user interface that mimics the file system Authentication module which supports various authentication methods Task queue which holds commands to be executed in the system (commands, inputs and outputs are all registered in catalogue) Metadata catalogue Services that support above components C/C++/perl API 100% perl 5 (95% reusable opens source modules) Predrag. Buncic@cern. ch 8
Ali. En Components @GRID Ali. En Components (RPMs): n Ali. En-Base w Contains external modules (more than 100, including Globus) n Ali. En-Client w Basic client functionality, needed to access LFNs n Ali. En-Server w Ali. En Server, one per Virtual Organization n Ali. En-SE w Storage Element, must be installed on sites which provide MSS functionality n Ali. En-CE w Computing Element, must be installed if site wants to participate in production 9/15/2020 Predrag. Buncic@cern. ch 9
Ali. En Components (…) @GRID Optional components: n Ali. En-GUI w Graphical User interface, optional component n Ali. En-Monitor w Monitor is required by Server to enable advanced RB features n Ali. En-Portal w This is Ali. En Web site n Ali. En-Alice w Description of Alice Virtual Organization n n Packages Commands The RPMs can be found at http: //alicedb. cern. ch/GRID/current 9/15/2020 Predrag. Buncic@cern. ch 10
Ali. En Architecture @GRID API Services (file transport, sync) Secure authentication service independent of underlying database File catalogue : global file system on top of relational database 9/15/2020 Central task queue Predrag. Buncic@cern. ch 11
Ali. En SQL Backend @GRID D 0 path dir host. Index entry. Id char(255) integer(11) MIRRORS mirror. Id path method. Name method. Arg <fk> <pk> integer(11) char(255) char(20) char(255) T 2526 type char(4) dir integer(8) T 2527 name char(64) type char(4) owner char(8) dir integer(8) ctime char(16) name char(64) comment char(80) owner char(8) content char(255) ctime char(16) method char(20) comment char(80) method. Arg char(255) content char(255) gowner char(8) method char(20) size integer(11) method. Arg char(255) gowner char(8) size integer(11) <pk> <fk> DBKEYS Name char(20) DBKey char(128) Last. Changes datetime FILES file. Id path local. Copy local. Host integer(11) char(255) char(20) <pk> DELETED path entry. Id char(255) integer(11) TOKENS ID Username Expires Token password integer(11) char(16) datetime char(32) char(16) <pk> HOSTS host. Index address db driver last. Update last. Delete integer(11) char(50) char(40) char(10) integer(11) <pk> METHODS method. Name method. Class method. Arg 9/15/2020 char(20) char(155) <pk> Predrag. Buncic@cern. ch 12
Ali. En File catalogue @GRID Tier 1 |--. / | |--cern. ch/ | | |--user/ | |--a/ | | |--admin/ | | | | |--aliprod/ | | | |--f/ | | |--fca/ | | | |--p/ | | |--psaiz/ | | |--as/ | | |--dos/ | | |--local/ ALICE LOCAL ALICE USERS ALICE SIM |--simulation/ | |--2001 -01/ | | |--V 3. 05/ | |--Config. C | |--grun. C | | | | | |--b/ | | |--barbera/ |--36/ | |--stderr | |--stdin | |--stdout | |--37/ | |--stderr | |--stdin | |--stdout | |--38/ | |--stderr | |--stdin | |--stdout Files, commands (job specification) as well as job input and output and tags are stored in the catalogue 9/15/2020 Predrag. Buncic@cern. ch 13
Ali. En @GRID File organization [tbed 0007 d. cern. ch] /alice/simulation/2001 -02/V 3. 06/00001/ > tree |--. / | |--00001/ | | |--galice. root | |--00002/ | | |--galice. root [tbed 0007 d. cern. ch] /proc/33608/ > tree | | …. . |--. / | |--Config. C | |--stderr | |--grun. C | |--stdin | |--stdout Forgotten wisdom: by organizing files into directory structure one can already tell a lot about file content, define cleanup and access policy and optimize access performance 9/15/2020 Predrag. Buncic@cern. ch 14
Ali. En Tags @GRID --. / | |--r 3418_01 -01. ds | |--r 3418_02 -02. ds | |--r 3418_03 -03. ds | |--r 3418_04 -04. ds | |--r 3418_05 -05. ds | |--r 3418_06 -06. ds | |--r 3418_07 -07. ds | |--r 3418_08 -08. ds | |--r 3418_09 -09. ds | |--r 3418_10 -10. ds | |--r 3418_11 -11. ds | |--r 3418_12 -12. ds | |--r 3418_13 -13. ds | |--r 3418_14 -14. ds | |--r 3418_15 -15. ds D 0 path dir host. Index entry. Id char(255) integer(11) <fk> <pk> T 2526 type char(4) dir integer(8) T 2527 name char(64) type char(4) owner char(8) dir integer(8) ctime char(16) name char(64) comment char(80) owner char(8) content char(255) ctime char(16) method char(20) comment char(80) method. Arg char(255) content char(255) gowner char(8) method char(20) size integer(11) method. Arg char(255) gowner char(8) size integer(11) The file catalogue on its own does not know anything about file content It is possible to add an additional information to describe file properties (metadata) In Ali. En environment this can be achieved by attaching an arbitrary number of TAG table(s) to the corresponding directory table lfn: //alien. cern. ch/alice/simulation/2001%/V 3. 05/%/galice. root? npart>1000#mytag The search will first select all tables on the basis of the file name selection and then locates all tables that correspond to “mytag” definition, apply selection and finally return only the list of files for which the attribute search has been successful 9/15/2020 Predrag. Buncic@cern. ch 15
Ali. En @GRID 9/15/2020 GUI: Ali. En Xfiles Predrag. Buncic@cern. ch 16
Ali. En @GRID 9/15/2020 Command Interface Predrag. Buncic@cern. ch 17
Ali. En @GRID 9/15/2020 Predrag. Buncic@cern. ch 18
Ali. En Organisation Cfg. @GRID An Organisation defines: • Sites • People • Service evironment • Packages 9/15/2020 Predrag. Buncic@cern. ch 19
Ali. En Site Cfg. @GRID Site defines: • Log directories • Packages to install • CE • SE • Host configuration CE defines: • Type of a local queue • Command arguments to submit, check the status and kill jobs (if different than default) SE defines: • Type of MSS • Base directory 9/15/2020 Predrag. Buncic@cern. ch 20
Ali. En @GRID Authentication - SASL is the Simple Authentication and Security Layer, a method for adding authentication support to connection-based protocols. To use SASL, a protocol includes a command for identifying and authenticating a user to a server and for optionally negotiating protection of subsequent protocol interactions. If its use is negotiated, a security layer is inserted between the protocol and the connection. It can be used on the client or server side to provide authentication (See RFC 2222 for more information) Open. LDAP v 2. x uses SASL 9/15/2020 Predrag. Buncic@cern. ch 21
Ali. En @GRID SASL mechanisms The following mechanisms are included in Cyrus SASL distribution: n n n n n ANONYMOUS CRAM-MD 5 KERBEROS_V 4 PLAIN SCRAM-MD 5 (deprecated) GSSAPI (MIT Kerberos 5 or Heimdal Kerberos 5) DIGEST-MD 5 LOGIN (unsupported) SRP (unsupported, may not work) Globus MDS 2. 1 uses Globus/GSI implementation of GSSAPI 9/15/2020 Predrag. Buncic@cern. ch 22
Ali. En @GRID Ali. En SASL implementation Ali. En now has perl module with implementation GSSAPI This allows us to use n n all SASL authentication schemes old Ali. En authentication (token, AFS password, SSH) X 509 certificates Globus/GSI (credential delegation) Ali. En distribution now includes necessary Globus/MDS/GSI software This allows us to develop secure Peer-To-Peer File Transfers based on machine/protocol/user certificates and LDAP based configuration management 9/15/2020 Predrag. Buncic@cern. ch 23
Ali. En Authentication @GRID Proxy Server Client LDAP Request methods List of methods SASL Authentication Data 9/15/2020 Checking if user exists Database X 509(Ali. En/Globus) PKI/RSA (ssh) Token (Ali. En) AFS password Data Predrag. Buncic@cern. ch 24
Ali. En @GRID Token or PKI Key To obtain either a private key or a token run alien Create. Keys alien Update. Token This will prompt you for a password. The password will be send to the Ali. En authentication server using SOAP over SSL. The Authentication server will check that you exists in our LDAPserver and that your password is correct. Checked with PAM. For VO Alice the password is the CERN AFS password 9/15/2020 Predrag. Buncic@cern. ch 25
Ali. En Certificates @GRID Currently Ali. En will trust certificates signed by Cern CA or Ali. En CA. This means that normal certificates issued by CERN will be valid as credential. A certificate is is requested by typing (on Data. Grid machine): grid-cert-request To register your certificate for use with Ali. En, do: alien Register. Cert 9/15/2020 Predrag. Buncic@cern. ch 26
Ali. En Services @GRID Organization IS Proxy Logger Site Cluster Monitor User Client 9/15/2020 CE Authen Process Monitor Predrag. Buncic@cern. ch CPUServer SE FTD 27
Ali. En Services @GRID CPUServer Communications between services is done via SOAP using certificate based SSL authentication. Certficate FTD Cluster. Monitor Certficate Client Certficate 9/15/2020 Predrag. Buncic@cern. ch 28
Ali. En @GRID Getting a file (from local SE) Proxy Authen 2 LFN? PFN & SE 1 SE Get LFN 9/15/2020 Client Predrag. Buncic@cern. ch SE at the site of the client 3 PFN? File 29
Ali. En @GRID Getting a file (from remote SE) 7 IS Proxy Authen SE 6 5 SE Get lfn 9/15/2020 Client Predrag. Buncic@cern. ch FTD 4 8 9 FTD 4. transfer file 5. Get remote host 6. Request transfer 7. Bring file from MSS 8. File ready 9. Start transfer 30
Ali. En Portal @GRID http: //alien. cern. ch • Generic Web portal • User can interact with alien submit jobs check jobs status • Administrator can configure system monitor status check syslog update distribution 9/15/2020 Predrag. Buncic@cern. ch 31
Ali. En @GRID 9/15/2020 Sending a job… Predrag. Buncic@cern. ch 32
Ali. En @GRID 9/15/2020 Task Queue Predrag. Buncic@cern. ch 33
Ali. En Task Queue @GRID “Pull” rather than “push”architecture 9/15/2020 Predrag. Buncic@cern. ch 34
Ali. En Submiting a job @GRID 4 Registering stdin IS Proxy Authen CPUServer 3 Cluster Monitor submit 9/15/2020 1 2 Client Predrag. Buncic@cern. ch 35
Ali. En Executing a job @GRID IS Proxy CPUServer 2 Cluster Monitor 9/15/2020 1 CE 3 Process Monitor Possible Local Queues: • LSF • PBS • BQS • Globus • CONDOR • DQS Predrag. Buncic@cern. ch 36
Ali. En @GRID Resource Broker Optimizer 9/15/2020 Predrag. Buncic@cern. ch 37
Ali. En @GRID Get Job description Condor Classad Resource broking Get Resource description Matching Condor Classad Perl Classad object/module nwrapped C++ Condor Classad library using an « automated » tool : SWIG (Simplified Wrapper and Interface Generator, www. swig. org) 9/15/2020 Predrag. Buncic@cern. ch 38
Ali. En Resource Broker @GRID Ali. En. Tasks CEs Broker Match ? No: Next Yes: Select alien job-submit job. jdl CE contacts CPUServer and presents its own Class. Ad, Resource Broker will match them against job Class. Ads and select the most appropriate job to run on that CE 9/15/2020 Predrag. Buncic@cern. ch 39
Ali. En Monitoring @GRID In order to develop and deploy sensible Resource Broker we need monitoring framework Frequent data updates, large data volume for large number of computers GRID CENTER Local Center Computer The idea is to implement hierarchy of clients and servers where each client (child) maintains the history of measurements reports the summary information to upper layer (parent) using SOAP protocol 9/15/2020 Predrag. Buncic@cern. ch Computer 40
Ali. En Subscribe @GRID Parent Monitor new child subscribe Local CSV database New Child 1 Child 2 Monitor Child n Local CSV database 9/15/2020 Predrag. Buncic@cern. ch 41
Ali. En Data collection @GRID Parent Monitor Local CSV database Infinite loop, requesting data in regular time intervals (SOAP) Child 1 Child 2 Each child responds to request, sending its Local CSV current data (SOAP) Monitor Child n database 9/15/2020 Predrag. Buncic@cern. ch 42
Ali. En @GRID Disconnect and failure unsubscribe Parent Monitor Ex-child Broken Child An old child has unsubscribed No response! Child 1 Child 2 An old child has unsubscribed 9/15/2020 Monitor Child n Local CSV database Predrag. Buncic@cern. ch 43
Ali. En @GRID Data representation Preview of the whole GRID 9/15/2020 Predrag. Buncic@cern. ch 44
Ali. En @GRID Production Summary 10^5 CPU hours 13 clusters, 9 sites 5682 events validated, 118 failed (2%) Up to 300 concurrently running jobs worldwide (5 weeks) 5 TB of data generated and stored at the sites with mass storage capability (CERN 73%, CCIN 2 P 3 14%, LBL, 14%, OSC 1%) GSI, Karlsruhe, Dubna, Nantes, Budapest, Bari, Zagreb, Birmingham(? ), Calcutta in addition ready by now 9/15/2020 Predrag. Buncic@cern. ch 45
Ali. En @GRID 1. 2. 3. 4. The Spiral Model Planning: determination of objectives, alternatives and constraints Risk analysis: analysis of alternatives and identification/resolution of risks Engineering: development of the "next level" product; Customer evaluation: assessment of the results of engineering. 9/15/2020 Predrag. Buncic@cern. ch 46
Ali. En Plans @GRID ü C/C++ API (January) ü GSI/Globus certificate authentication (March) Adding new services ü ü n n Monitoring and exception handling (February) Use Condor Class. Ads for resource description and matching(February) Queue optimization (March) Support for interactive jobs (April) Virtual datasets (May) Disk pool/cache manager (June) Modular “kernel” using POE (July) Implementation of Web services (August) ü n n 9/15/2020 SOAP (Simple Object Access Protocol) WSDL (Web Services Description Language) UDDI (Universal Description Discovery & Integration) Predrag. Buncic@cern. ch 47
Ali. En @GRID How to proceed? Follow up ongoing in-house developments (Data. GRID) Maintain compatibility and use standard solutions Keep Alice users happy Look towards the future… 9/15/2020 Predrag. Buncic@cern. ch 48
Ali. En History @GRID I 9/15/2020 Predrag. Buncic@cern. ch 49
Ali. En Future? @GRID 9/15/2020 Predrag. Buncic@cern. ch 50
Ali. En @GRID Ali. En as a meta-GRID Ali. En User Interface i. VDGL stack 9/15/2020 Ali. En stack Predrag. Buncic@cern. ch EDG stack 51
Ali. En @GRID Conclusions Ali. En framework is a lightweight, simplified but functionally equivalent alternative to full blown GRID It is a partial solution which solves boundary conditions and current requirements for simulation, reconstruction for next generation HEP experiment(s) Ali. En has passed first field tests and gearing up for next production We have picked up right direction (SOAP, Web services, standard components) one year ago and that gives us the competitive edge Following mainstream developments in computing pays off but one always must preserve critical judgment and not rush to implement latest and greatest buzzwords On mid to long term, ALICE remains committed to integrate Ali. En with Data. GRID solutions once they become available The goal is not to deliver the most exciting and advanced computing exercise but to deliver data to ALICE users 9/15/2020 Predrag. Buncic@cern. ch 52
- Slides: 52