Operations and Support for Grid Environments Leigh Grundhoefer

  • Slides: 36
Download presentation
Operations and Support for Grid Environments Leigh Grundhoefer Indiana University leighg@indiana. edu

Operations and Support for Grid Environments Leigh Grundhoefer Indiana University leighg@indiana. edu

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows l Conclusions l 2021/2/27 leighg@indiana. edu 2

Defining Grid Support l What kind of infrastructure? Definition of “instrumentation” software ¡ Deployment

Defining Grid Support l What kind of infrastructure? Definition of “instrumentation” software ¡ Deployment policies and procedures ¡ l l Error handling methods What is the structure for the support? Try to reduce duplication of effort ¡ Integration of grid support to a variable set of existing resource provider support mechanisms ¡ Interfacing support staff and grid experts ¡ 2021/2/27 leighg@indiana. edu 3

Integrating grid support Facility Machine Operators Support NOC Security Czar System Network Admins Admin

Integrating grid support Facility Machine Operators Support NOC Security Czar System Network Admins Admin Grid ops Resources 2021/2/27 leighg@indiana. edu 4

OSG landscape VOs & apps TG Mon&Info TG Policy Arch MIS TG Storage Support

OSG landscape VOs & apps TG Mon&Info TG Policy Arch MIS TG Storage Support Centers Technical Policy OSG deployment Group oversees Storage Operations Activity (Ops) Security TG Security Ops Integration TG Activity Support Centers site admins 2021/2/27 leighg@indiana. edu Chairs 5

Operations Scope Runs the grid-wide services including provisioning and installation of middleware and operational

Operations Scope Runs the grid-wide services including provisioning and installation of middleware and operational support for those services, resource providers and VO's running on OSG. l Coordinates with other Grids and between support organizations. l Applies Users and Service Agreements l Provides a repository for collected registrations and agreements of participating organizations l 2021/2/27 leighg@indiana. edu 6

2021/2/27 leighg@indiana. edu 7

2021/2/27 leighg@indiana. edu 7

2021/2/27 leighg@indiana. edu 8

2021/2/27 leighg@indiana. edu 8

Engineering l Maintained grid-controlled software packages and cache l Provide common grid software support

Engineering l Maintained grid-controlled software packages and cache l Provide common grid software support through VDT l Verify software compatibilities l Provision releases of the OSG middleware and services l Troubleshoot service failures l Deployment guidance and assistance l liason to other service support centers l Monitor status of grid resources l Publish monitoring information for grid resources 2021/2/27 leighg@indiana. edu 9

Infrastructure l Trouble Ticketing system and interface l Monitoring tools development and maintenance l

Infrastructure l Trouble Ticketing system and interface l Monitoring tools development and maintenance l Accounting services l Discovery services l Identity services l Grid information index l Grid Catalog l VO-level services for monitoring services l Knowledge base l Mailing Lists l Formal and collaborative web information repositories 2021/2/27 leighg@indiana. edu 10

Provisioning Tasks l l l l Set up the pre-release candidates for production installation

Provisioning Tasks l l l l Set up the pre-release candidates for production installation tests Add version control to production release Deploy and validate auxiliary services Adjust middleware configuration setup Pre-release testing of production installation Pre-release test of services Full documentation preparation ¡ ¡ ¡ 2021/2/27 Installation Manuals Releases Notes, Change Logs, Patches, Upgrades Description of the services provided for the release and access information leighg@indiana. edu 11

Support Services l Coordinates and Tracks: ¡ ¡ ¡ problems for service providers Security

Support Services l Coordinates and Tracks: ¡ ¡ ¡ problems for service providers Security incidents Requests for assistance Schedule grid service and middleware changes l Monitor policy compliance l Detailed later in this talk l 2021/2/27 leighg@indiana. edu 12

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows l Conclusions l 2021/2/27 leighg@indiana. edu 13

OSG Service Integration l l l Grid Catalog -- Grid. Cat Mona. Lisa MIS-CI

OSG Service Integration l l l Grid Catalog -- Grid. Cat Mona. Lisa MIS-CI ¡ 2021/2/27 New “core” monitoring services leighg@indiana. edu 14

Integrated Monitoring Framework l l l Globus Meta Directory System (LDAP directory) Mon. ALISA,

Integrated Monitoring Framework l l l Globus Meta Directory System (LDAP directory) Mon. ALISA, Monitoring Agents in Large Integrated Service Architecture (Pub/Sub) Mon. ALISA repository (WS/WAP) Ganglia performance monitoring (Multicast/Hierarchical) Job Monitoring System at the Advanced Center for Distributed Computing (non invasive archive) The Grid Site Status Cataloging System at i. GOC (human/automatic managed DB) 2021/2/27 leighg@indiana. edu 15

Grid Telemetry l Information ¡ ¡ ¡ Site list Test result Load Jobs running

Grid Telemetry l Information ¡ ¡ ¡ Site list Test result Load Jobs running Jobs queued Heterogeneous l Redundant l 2021/2/27 leighg@indiana. edu 16

What is Grid. Cat ? l A Grid Site Status Cataloging System - A

What is Grid. Cat ? l A Grid Site Status Cataloging System - A Web App. l High level simple status map : l Computing Resource Information Collector/Presenter ¡ Static and dynamic information about all sites ¡ Simple grid status presentation on the web l Identifies site readiness l A web application easy to develop and deploy l Displays disk space and CPU slots l Parallel information collecting, storing, and archiving among sites (BE) l Web pages: In templated html+php+js(FE) 2021/2/27 leighg@indiana. edu 17

i. GOC Grid. Cat View 2021/2/27 leighg@indiana. edu 18

i. GOC Grid. Cat View 2021/2/27 leighg@indiana. edu 18

i. GOC Mona. Lisa View 2021/2/27 leighg@indiana. edu 19

i. GOC Mona. Lisa View 2021/2/27 leighg@indiana. edu 19

i. GOC Mona. Lisa View (partial) 2021/2/27 leighg@indiana. edu 20

i. GOC Mona. Lisa View (partial) 2021/2/27 leighg@indiana. edu 20

OSG MIS advancement l Monitoring and information Services - Core Infrastructure (MIS-CI) ¡ MIS

OSG MIS advancement l Monitoring and information Services - Core Infrastructure (MIS-CI) ¡ MIS Compute Element and Storage Element l l 2021/2/27 Discovery Service Consumer Interface Resource Repository Information Gathering leighg@indiana. edu 21

MIS-CI Architecture User Resource Gatekeeper Discovery Service Schema Eval MIS-CI Resource Throttling SQL-Lite Local

MIS-CI Architecture User Resource Gatekeeper Discovery Service Schema Eval MIS-CI Resource Throttling SQL-Lite Local Historical Repository SQL-Lite Remote VO Historical Repository jobs gridftp jobmanager-mis -profile (default) -jobs -gridftp -diskspace -accounting -policy -environment -software -VO -statistics -security diskspace SQL-Lite Database MIS-CI Consumer Throttling accounting Self. Monitor policy environment software VO MIS-CI Profiles SQL-Lite Database Backup Custom Remote Repository Grid Scheduler OSG Auditing/Accounting 2021/2/27 Cron statistics security ? leighg@indiana. edu 22

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows l Conclusions l 2021/2/27 leighg@indiana. edu 23

Grid Operations Center Operations VO Support Centers Indiana i. GOC • Provisioning • Ops

Grid Operations Center Operations VO Support Centers Indiana i. GOC • Provisioning • Ops procedures • Coordination Service Support Centers Resource Provider Support Centers 2021/2/27 leighg@indiana. edu 24

Leveraging the NOC l Global ¡ 2021/2/27 NOC at Indiana University The Global NOC

Leveraging the NOC l Global ¡ 2021/2/27 NOC at Indiana University The Global NOC provides 24 x 7 network engineering and operations services for research and education networks and international interconnections, including Internet 2 Abilene, National Lambda. Rail, Trans. PAC and AMPATH networks, the STAR TAP and MANLAN layer 3 international exchange points, and the STAR LIGHT optical exchange. In addition, the Global NOC supports activities of the i. VDGL Grid Operations Center and the REN-ISAC cybersecurity Watch Desk. By virtue of the R&E network, grid, and cybersecurity activities, the Global NOC possesses a unique and embracing view of R&E cyberinfrastructure. leighg@indiana. edu 25

Monitoring the GOC services Grid Systems and Services(run every 15 m) NOC Trouble Tickets

Monitoring the GOC services Grid Systems and Services(run every 15 m) NOC Trouble Tickets • Ticket 894 GOC NOC Mon Nagios Contact DB 2021/2/27 leighg@indiana. edu 26

http: //www. ivdgl. org/grid 3 2021/2/27 leighg@indiana. edu 27

http: //www. ivdgl. org/grid 3 2021/2/27 leighg@indiana. edu 27

Problem to Trouble Ticket l Scope ¡ ¡ ¡ l A single resource /

Problem to Trouble Ticket l Scope ¡ ¡ ¡ l A single resource / Multiple resources Application wide VO wide Grid wide Operations Resource/ Operations Service Severity ¡ Critical, High, Elevated, Normal Problem Owner l Problem Contact l Problem Description l 2021/2/27 leighg@indiana. edu 28

Monitoring Event GOC Site Fails Grid Catalog Test (run NOC Monitors every 5 hours)

Monitoring Event GOC Site Fails Grid Catalog Test (run NOC Monitors every 5 hours) Grid Catalog Map Trouble Tickets • Ticket 854 Grid Experts GOC Mon Grid. Cat Mona. Lisa Contact DB Security/Incidence Handling Resource 2021/2/27 VO support Or Facility leighg@indiana. edu Resouce Resource 29

Reactive Support workflow GOC igoc@ivdgl. org Web form & Telephone Trouble Tickets • •

Reactive Support workflow GOC igoc@ivdgl. org Web form & Telephone Trouble Tickets • • Ticket 803 Ticket 823 Ticket 833 Ticket 843 Grid Experts Web Docs Developers User/Admin Application Failure Contact DB Other Support Centers Planned Outages Security problems Installation help Security/Incidence Handling Configuration assistance Identity management Authorization problems 2021/2/27 leighg@indiana. edu 30

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows

Agenda Introduction to OSG Operations model and implementation l Monitoring Instruments l Support Workflows l Conclusions l 2021/2/27 leighg@indiana. edu 31

Operations Enables Applications l Provide operational services that provide Applications with the “instruments” to:

Operations Enables Applications l Provide operational services that provide Applications with the “instruments” to: ¡ ¡ ¡ ¡ 2021/2/27 Publish site policies and environment Know the status of grid middleware on sites Know the job queue for compute resources Know the status and load of grid resources Access historical monitoring information Manage grid services Keep apprised of security incidents in the collaborative leighg@indiana. edu 32

Lessons Learned Configuration management efforts in the development and deployment areas are rewarded many

Lessons Learned Configuration management efforts in the development and deployment areas are rewarded many times over during production. l A monitoring infrastructure allows a significant problem solving advantage, esp. redundant monitoring. l Establishment of clear communications between resources providers, users and Virtual Organizations is hard. l 2021/2/27 leighg@indiana. edu 33

More Lessons Learned l Human interactions in grid building costly l Keeping resource provider

More Lessons Learned l Human interactions in grid building costly l Keeping resource provider requirements light lead to heavy loads on gatekeeper hosts ( monitoring framework ) l Diverse set of resource configurations made jobs requirements exchange difficult l Troubleshooting: efficiency for submitted jobs was not as high as we’d like. 2021/2/27 leighg@indiana. edu 34

Upcoming Challenges Shared problem handling with application-centric and VO centric support structures l Ticket

Upcoming Challenges Shared problem handling with application-centric and VO centric support structures l Ticket passing to and from other Grid environments l Establishing a working monitoring framework for distributed storage resources and virtual data cataloging infrastructure l 2021/2/27 leighg@indiana. edu 35

2021/2/27 leighg@indiana. edu 36

2021/2/27 leighg@indiana. edu 36