Achieving High Availability in a SAS Grid Environment
Achieving High Availability in a SAS® Grid Environment Daniel Wong (dwong@platform. com) Platform Computing Inc. Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Agenda n Background n Introduction to SAS Grid and HA n Techniques for achieving HA n SAS Metadata Server example n Future work Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Achieving High Availability in a SAS Grid Environment Targeted Audience n IT Managers/Administrators who want to improve overall availability in their SAS Grid environment n Consultants who help customers to implement SAS Grid n Anyone evaluating Grid wants to gain more general knowledge about HA in the SAS Grid environment n … Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Platform Computing 4, 000 2, 000 500 16 Managed CPUs Customers worldwide Employees in 15 offices Years of profitable growth 8 Years of Partnership with SAS 1 Leader in HPC Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS-Platform OEM Partnership n Started with Job Scheduler for SAS Warehouse Administrator v 8 in 2001 n Current OEM product: Platform Suite for SAS n Platform provides the Grid middleware technologies to SAS for: • Scheduling for SAS workflow • Effectively management of resources and SAS workload • Scaling out SAS (parallelized) Applications (DI, EM, Risk Dim…) n SAS-Platform R&D teams constantly drive new development initiatives to add values to SAS customers Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Status of this work Goal of joint development initiative n To provide failover capabilities for critical SAS system services in all operating systems supported by SAS Grid Manager Current status n Working prototype that can failover SAS Metadata Server in both Windows & Linux SAS Grid environment Key advantages of this solution n Eliminates operating system dependencies n Eliminates the requirement of a hot-standby for failover n Eliminates the expense of purchasing a third-party tool to provide high availability Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Agenda n Background n Introduction to SAS Grid and HA n Techniques for achieving HA n SAS Metadata Server example n Future work Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS Grid 101 Key Value Prop / Use Cases: n Enterprise Scheduling n Workload Balancing n Parallelized Workload Balancing Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS Enterprise Scheduling Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS Workload Balancing Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Parallelized Workload Balancing Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS Grid Architecture Topology Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
4 key steps to implement HA 1. Monitor 3. Synchronize n n Discover service failures Ensure client applications and Grid components are aware of new config Capabilities required n Detect failure of individual services n n Monitor services on all hosts in Grid Support for services running on host with a different IP address n Client apps and Grid components must know about the new arrangement 2. Recover 4. Reconnect n n Restart service when failures are detected Enable clients applications to access the service again Capabilities required n n Capability to restart services on any host or designated hosts in Grid Client apps must be able to access to service after failover without having to reconfig Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Agenda n Background n Introduction to SAS Grid and HA n Techniques for achieving HA n SAS Metadata Server example n Future work Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Platform Enterprise Grid Orchestrator (EGO) EGO is included in Platform Suite for SAS v 4. 1 / SAS Grid Manager 9. 2 Available on all Unix, Linux, and Windows platforms supported by Grid Manager EGO provides a full suite for services to: n Allows users to treat distributed sw/hw resources as components of a virtual computer n Enables all applications, services, workload to access a shared infrastructure n Allocates resources according to policy, simplifies management and improves availability of the entire environment We will be focusing in two EGO functions: n EGO Service Controller n EGO Service Director Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Monitoring and Restarting Services with EGOSC EGO Service Controller n EGOSC starts on Grid Control Machine n Responsible for starting services on remote grid nodes n Ensure services are running by detecting failures and restarting service instances based on configuration Registering your service with EGO n Create EGO Service Definition File • Service Name – defines the name of the service, a handle for EGO • EGOCommand – exact command to start your service • Host. Failover. Interval – Threshold for EGO to trigger failover • Resource. Requirement – Host candidates for service • Execution. User – User. ID that runs the service n EGO will start, monitor, and restart your service according to these definitions Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Location independence for failover Consider the following scenario: 1. Metadata Server (SASMeta) is configured to be started by EGO 2. Client App can lookup Metadata Server by: $nslookup $server SASMeta. ego. sas. com Server: 172. 25. 241. 134 (Corporate DNS server for client) Address: 172. 25. 241. 134#53 Name: SASMeta. ego. sas. com Address: 172. 25. 235. 50 3. Metadata Server does down unexpectedly as host. A crashes 4. EGO detects Metadata Server is gone and restarts it on a failover host 5. Metadata Server is successfully restarted: $nslookup $server SASMeta. ego. sas. com Server: 172. 25. 241. 134 (Corporate DNS server for client) Address: 172. 25. 241. 134#53 Name: SASMeta. ego. sas. com Address: 172. 25. 235. 55 6. Recovery is completed. Client App can still access Metadata Server by connecting to SASMeta. ego. sas. com Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
EGO Service Director n Provides location mechanism for other system services n Contains a stand-alone Domain Name Server for EGO DNS sub-domain n Provides server name to IP address resolution within the sub-domain n Relies on Service Controller to provide location info for service instances n Service Director updates Corporate DNS when location of Service Director DNS is changed Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Normal Case Don’t know anything about the EGO subdomain, let’s pass it on 2 Service Director DNS Server Corporate DNS Server 3 Update location of service instances 1 Client 5 Service A Host B Service Controller Launch/Control/Monitor Host C … EGO Kernel Host n Grid Control Machine Client makes a DNS query to determine location of Service A 2. Corporate DNS forward query to SD DNS 3. SD DNS responds with latest location of service 4. Corporate DNS pass the response back to Client 5. Client uses the response to locate the service 4 Update location of SD DNS Server Service Director 1. Grid Nodes Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Failover Case Service Director DNS Server Update location of service instances Corporate DNS Server Client Service Director Service A 1 Host B Service Controller Launch/Control/Monitor Service A 2 Host C … EGO Kernel Host n Grid Control Machine EGO discovers Service A has failed 2. Service Controller restarts Service A on failover Host C 3. Service Controller updates Service Director DNS Server 4. Client obtains new name -to-IP address mapping to access Service A on Host C 4 Update location of SD DNS Server 3 1. Grid Nodes Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
What’s going to failover EGO Service Controller? Answer: n Load Index Manager (LIM), the first process responsible for starting all Grid Services What’s going to failover LIM? n LIM runs on each node under a master-slave arrangement n Master and slave LIM communicates host load index info regularly n If Master LIM cannot be reached, Slave LIM will figure out among themselves a new master LIM n New master can restart EGO Service Controller Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Agenda n Background n Introduction to SAS Grid and HA n Techniques for achieving HA n SAS Metadata Server example n Future work Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Example: HA for SAS Metadata Server Install LSF Assumptions: host. A = default host for Metadata Server Install SAS Metadata Server Configure EGO Services Failover Tests Eliminating Hot-Standby host. B = failover host for Metadata Server \host. FLSFShare = shared config dir accessible by all hosts in Grid \host. FSAS = shared config dir for SAS Metadata Server, accessible by host. A and host. B 1. Install Platform Suite for SAS 4. 1 provided with SAS Grid Manager 9. 2 2. host. A should be the Grid Control Machine 3. host. B should be the failover host for the Grid Control Machine (same with Metadata Server) Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Example: HA for SAS Metadata Server Install LSF Install SAS Metadata Server 1. User a/c running Metadata Server must has full access to \host. FSAS 2. Install Metadata Server on host. A 3. Set configuration directory to \host. FSAS 4. Repeat #3 for host. B Configure EGO Services Failover Tests Note: SAS 9. 1. 3 and 9. 2 install procedures are different Eliminating Hot-Standby Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Example: HA for SAS Metadata Server Install LSF 1. Create EGO service definition file for Metadata Server 2. Specify host. A & host. B as host candidates Install SAS Metadata Server Configure EGO Services Failover Tests 3. Enable Service Director to start automatically on host. A and failover to host. B 4. Configure EGO Name Service (DNS) – set domain names, encryption key to Corp. DNS 5. Configure Corporate DNS to forward EGO sub-domain query to EGO DNS 6. Restart all EGO services 7. Disable DNS cache on client machine 8. Test with nslookup Eliminating Hot-Standby Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Example: HA for SAS Metadata Server Install LSF Install SAS Metadata Server Configure EGO Services 1. Connect with SAS Management Console to check if Metadata Server is functioning on host. A 2. Shutdown host. A 3. Check with egosh service list to see Metadata restarts on host. B 4. Restart SAS MC, it should connect to Metadata Server again on host. B Failover Tests Eliminating Hot-Standby Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Example: HA for SAS Metadata Server Install LSF Install SAS Metadata Server This step is optional n When Metadata Server is running on host. A, host. B can be a Grid node running SAS jobs n During failover, LSF can be reconfig not to send jobs to host. B Configure EGO Services n Use a wrapper script to close the host for accepting jobs before starting Metadata Server Failover Tests n Jobs already running on host. B will continue to run until finish, but no more jobs will be dispatched to host. B Eliminating Hot-Standby Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Agenda n Background n Introduction to SAS Grid and HA n Techniques for achieving HA n SAS Metadata Server example n Future work Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Failover for Grid Nodes? Host 1 has crashed, requeue if Job. A is rerunnable LSF SAS Job. A Host 1 SAS Job. A Host 2 … Host n n LSF can requeue a job when it detects that the host running the host has failed, and the job is defined as rerunnable n When the job is requeued, LSF will try to dispatch this job to another host that is suitable n Rerun job will start running from the beginning, meaning that all previous computation and data results will have to redo again Grid Nodes Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
SAS 9. 2 Checkpoint with Requeuing Checkpoint Images written to shared directory when Job A was running on Host 1: Host 1 has crashed, requeue if Job. A is rerunnable Job. A – chkpnt#1 Job. A – chkpnt#2 LSF SAS Job. A Host 1 … Job. A – chkpnt#17 … Host 1 crashed !! SAS Job. A Host 2 … Host n SAS Job. A on Host 2 resumes from chkpnt#17 Grid Nodes Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Conclusion n Working prototype that can failover SAS Metadata Server in both Windows & Linux SAS Grid environment Key advantages of this solution n Eliminates operating system dependencies n Eliminates the requirement of a hot-standby for failover n Eliminates the expense of purchasing a third-party tool to provide high availability Future Work n Provides HA configuration templates for the rest of SAS services and platforms n Simplify configurations n Provides support for SAS 9. 2 checkpoint and restart mode Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
- Slides: 32