Fabric Management The European Data Grid Project Team
Fabric Management The European Data. Grid Project Team http: //www. eu-datagrid. org
Contents Ø Definition Ø Fabric Management architecture overview Ø Ø User job management overview Present status (R 1. 2) n LCFG n LCAS + edg_gatekeeper Fabric Management Tutorial
What is Fabric Management Ø Definitions: n n Ø Cluster (or Farm): “A collection of computers on a network that can function as a single computing resource through the use of tools which hide the underlying physical structure“. Fabric: "A complete set of computing resources (processors, memory, storage) operated in a coordinated fashion with a uniform set of management tools". Functionality: n Enterprise system administration – scalable to ~10 K nodes n Provision for running grid jobs n Provision for running local jobs Fabric Management Tutorial
Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Fabric Management Tutorial
Architecture logical overview - Interface between Grid-wide services and local fabric; - Provides local Grid User authentication, authorization and mapping of grid credentials. Data Mgmt Resource Broker Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Fabric Management Tutorial
Architecture logical overview Resource Broker Grid User Data Mgmt - provides transparent (WP 2) access (both job and admin) to different cluster batch systems; - enhanced capabilities (extended scheduling. Local User policies, advanced reservation, local accounting). Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Farm A (LSF) Monitoring & Fault Tolerance Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Fabric Management Tutorial
Architecture logical overview Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring & Fault Tolerance Local User - provides the tools to install and manage all Grid Data software running on the Storage fabric nodes; Farm A (LSF) Farm B (PBS) Configuration Management (WP 5) -Agent to install, upgrade, (Mass storage, remove and configure Disk pools) software packages on the nodes. -bootstrap services and software repositories. Installation & Node Mgmt Fabric Management Tutorial
Architecture logical overview Resource Broker Grid User Data Mgmt Local User Fabric mgt subsystems Grid Info Services Other services -provides a central storage Fabric and management of all Gridification fabric configuration information; - central DB and set of protocols and APIs to Resource store and retrieve Management information. Farm A (LSF) Monitoring & Fault Tolerance Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Fabric Management Tutorial
- provides the tools for Architecture logical overview gathering monitoring information on fabric nodes; Resource Broker Grid User Data Mgmt Fabric Gridification Resource Management Grid Info Services WP 4 subsystems -central measurement repository stores all Wps monitoring Other information; - fault tolerance correlation engines detect failures and trigger recovery actions. Monitoring & Fault Tolerance Local User Farm A (LSF) Farm B (PBS) Grid Data Storage Configuration Management (Mass storage, Disk pools) Installation & Node Mgmt Fabric Management Tutorial
User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
User job management (Grid and local) Resource Broker - Submit job Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric Gridification Resource Management Fabric mgt subsystems Grid Info Services Other services - publish resource and accounting information Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
User job management (Grid and local) Resource Broker Grid User - Optimized selection of site Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
User job management (Grid and local) Resource Broker Grid User - Authorize - Map grid local credentials Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
User job management (Grid and local) Resource Broker Grid User Data Mgmt Fabric mgt subsystems Grid Info Services Other services Fabric Gridification - Select an optimal (WP 2)batch queue and submit - Return job status and output Resource Management Monitoring Local User Farm A (LSF) Farm B (PBS) Grid Data Storage (Mass storage, Disk pools) Fabric Management Tutorial
Fabric Management @ Release 1. 2 n Installation and configuration: LCFG (Local Con. Fi. Guration system) n Gridification: LCAS + edg_gatekeeper Fabric Management Tutorial
LCFG (Local Con. Fi. Guration system) Ø Ø Ø LCFG is originally developed by the Computer Science Department of Edinburgh University Handles automated installation, configuration and management of machines Basic features: n n automatic installation of O. S. installation/upgrade/removal of all (rpm-based) software packages centralized configuration and management of machines extendible to configure and manage EDG middleware and custom application software Fabric Management Tutorial
LCFG system architecture Config files LCFG Config Files Make XML Profile Server Abstract configuration parameters for all nodes stored in a central repository +inet. services telnet login ftp +inet. allow ftp sshd Read telnet login. Load ALLOWED_NETWORKS rdxprof ldxprof Profile HTTP+inet. allow_telnet Web Server XML Profile +inet. allow_login ALLOWED_NETWORKS +inet. allow_ftp ALLOWED_NETWORKS +inet. allow_sshd ALL +inet. daemon_sshd yes . . . Local +auth. users cache Profile Generic Object Component mickey +auth. userhome_mickey /home/mickey LCFG Objects +auth. usershell_mickey /bin/tcsh Client nodes A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Fabric Management Tutorial
LCFG system architecture LCFG Config Files XML profiles Read rdxprof Profile HTTP Load ldxprof Profile <inet> Make XML Profile Generic <allow cfg: template="allow_$ tag_$ daemon_$"> Object Component Web Server XML Profile <allow_RECORD cfg: name="telnet"> <allow>192. 168. , 192. 135. 30. </allow> Local cache </allow_RECORD> LCFG Objects . . . </auth> Server Abstract configuration parameters for all nodes stored in a central repository Client nodes <user_RECORD cfg: name="mickey"> <userhome>/home/Mickey. Mouse. Home</userhome> <usershell>/bin/tcsh</usershell> A collection </user_RECORD> of agents read configuration parameters and either generate traditional config files or directly manipulate various services Fabric Management Tutorial
LCFG system architecture LCFG Config Files HTTP Make XML Profile Server Abstract configuration parameters for all nodes stored in a central repository Read rdxprof Profile Load ldxprof Profile Generic Object Component Web Server XML Profile Local cache inet auth LCFG Objects Client nodes A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Fabric Management Tutorial
LCFG system architecture LCFG Config Files HTTP Read rdxprof Profile Load ldxprof Profile Generic Object Component /etc/shadow Make XML Profile /etc/passwd /etc/group . . Web Server XML Profile mickey: x: 999: 20: : /home/Mickey: /bin/tcsh Local cache inet auth LCFG Objects . . Server Abstract configuration parameters for all nodes stored in a central repository Client nodes A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Fabric Management Tutorial
LCFG system architecture LCFG Config Files HTTP Make XML Profile Server Abstract configuration parameters for all nodes stored in a central repository Read rdxprof Profile Load ldxprof Profile Generic Object Component Web Server XML Profile Local cache inet auth LCFG Objects Client nodes A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services Fabric Management Tutorial
LCFG: what’s a component? Ø Component == object Ø It's a simple shell script (… but moving to Perl) Ø Ø Each component provides a number of “methods” (start, stop, configure, . . . ) which are invoked at appropriate times A simple and typical component behaviour: n Started when notified of a configuration change n Loads its configuration (locally cached) n Ø Ø Configures the appropriate services, by translating config parameters into a traditional config file and reloading a service if necessary (e. g. restarting a init. d service). LCFG provides components to manage the configuration of services of a machine: inet, auth, nfs, cron, . . . Admins/mw developers can build new custom components to configure and manage their own applications Fabric Management Tutorial
LCFG component skeleton #!/bin/bash 2 class=mycomp # set the name of the component (mycomp) . /etc/obj/generic # include std methods and definitions Start() { # Start the component } Configure() { # Do the configuration } Case “$1” in configure) *) # 'main' program Configure; Do. Method “$@”; esac Fabric Management Tutorial
LCFG component methods Ø Start() method: Start() { # ‘Start’ the component Generic_Start; # standard setup steps Configure; # reconfigure the component if [ $? = 0 ]; then OK "Component mycomp started" else Fail "Starting component mycomp" fi } Ø Configure() method: Configure() { Load. Resources # Do the configuration myresource 1 myresource 2 … Check. Resources myresource 1 myresource 2 … # your code do_whatever … return status; } Only the RED parts are specific to a component! Fabric Management Tutorial
LCFG component example (I) ldconf component - Configures /etc/ld. so. conf (Cal Loomis). def file: class ldconf methods start configure conffile paths Resources defined on server: conffile /etc/ld. so. conf paths /usr/local/lib /opt/globus/lib Fabric Management Tutorial
LCFG component example (II) Component: Configure() method Configure() { Load. Resources conffile paths Check. Resources conffile paths } # Update ld. so. conf file. for i in $paths; do line=`grep $i $conffile` if [ -z "$line" ]; then echo "$i" >> $conffile fi done /sbin/ldconfig; return $? ; Fabric Management Tutorial
Software distribution with LCFG Ø It is done with an LCFG tool called updaterpms. Ø The standard system packaging format is used: rpm for Linux Ø It is managed via an LCFG component. Ø Functionality: 1. Compares the packages currently installed on the local node with the packages listed in the configuration 2. Computes the necessary install/deinstall/upgrade operations 3. Orders the transaction operations taking into account rpm dependency information 4. Invokes the packager (RPM) with the right operation transaction set Ø Packager functionality: 1. Read operations (transactions) 2. Downloads new packages from repository 3. Executes the operations (installs/removes/upgrades) Fabric Management Tutorial
Gridification Architecture Grid Scheduler Gridificationcomponent IMS (WP 3) External to fabric Internal to fabric Gridificationcomponent Gridification Implemented in R 1. 2 - Fab. NAT Information Providers Gri. FIS Computing. Element EDG Gatekeeper Globus Gatekeeper Job repository Job Manager Fabric Monitoring SE SE RMS Storage. Element farms LCMAPS LCAS plug-ins uid/gid static list other tokens walltime wallclocktime quota check ACL evaluation resource use Credential Rep. communications within established security context Configuration Mgmt, Installation Mgmt Policy FLIDS (Configuration Mgmt) Policy (Configuration Mgmt) Fabric Management Tutorial
LCAS 1. 0 (EDG release 1. 2) Ø Ø Ø The Local Centre Authorization Service (LCAS) handles authorization requests to the local computing fabric. In this release the LCAS is a shared library, which is loaded dynamically by the globus gatekeeper. The gatekeeper has been slightly modified for this purpose and will from now on be referred to as edg-gatekeeper. The authorization decision of the LCAS is based upon the users' certificate and the job specification in RSL (JDL) format. The certificate and RSL are passed to (plug-in) authorization modules, which grant or deny the access to the fabric. Three standard authorization modules are provided by default: n lcas_userallow. mod, checks if user is allowed on the fabric (currently the gridmap file is checked). n lcas_userban. mod, checks if user should be banned from the fabric. n lcas_timeslots. mod, checks if fabric is open at this time of the day for datagrid jobs. Fabric Management Tutorial
Authentication control flow EDG gatekeeper GLOBUS Gatekeeper TLS auth GLOBUS + LCAS Gatekeeper TLS auth LCAS (so) assist_gridmap Jobmanager-* * And store in job repository Fabric Management Tutorial
Summary Ø Two main fabric management components deployed on release 1. 2: Ø Installation and Configuration management functionality: n LCFG (Local Con. Fi. Guration system): handles automated installation, configuration and management of machines. Ø Gridification functionality: n LCAS (Local Centre Authorization Service) + edg_gatekeeper: handle authorization requests to the local computing fabric. Ø And more components coming in next releases in the areas of fabric monitoring, resource management, and configuration management. Ø Experience and conclusions from this release: automatic installation and configuration of the Data. Grid middleware was very useful for the testbed due to the complexity of the software. Fabric Management Tutorial
- Slides: 32