Functional description Detailed view of the system Status

  • Slides: 51
Download presentation
Functional description Detailed view of the system Status and features Giuseppe Lo Presti, Olof

Functional description Detailed view of the system Status and features Giuseppe Lo Presti, Olof Bärring CERN / IT Castor Readiness Review – June 2006

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a PUT request v Description and status of the components Ø Main daemons Ø Diskserver related Ø Central services Ø Tape related v Tape migration and recall Ø Workflow details Giuseppe Lo. Presti (IT/FIO/FD) 2

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a PUT request v Description and status of the components Ø Main daemons Ø Diskserver related Ø Central services Ø Tape related v Tape migration and recall Ø Workflow details Giuseppe Lo. Presti (IT/FIO/FD) 3

Castor 2 Architecture Client Stager logic Disk cache subsystem Central services Tape archive subsystem

Castor 2 Architecture Client Stager logic Disk cache subsystem Central services Tape archive subsystem From the “simple” view … Giuseppe Lo. Presti (IT/FIO/FD) 4

Castor 2 Architecture Client RH Stager DB Svc Scheduler Job Svc Qry Svc Error

Castor 2 Architecture Client RH Stager DB Svc Scheduler Job Svc Qry Svc Error Svc RR m e t s sy b u e s Stager. Job h c ca Disk DB Mover Mig. Hunter RTCPClient. D Tape. Daemon Disk Servers m e t s sy b u s. RTCPD e v i ch r a e Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) GC … to a more detailed one Name. Server s e ic v r VMGR e S l ra t n e VDQM C 5

Lifecycle of a GET + recall 1. Client connects to the RH 2. RH

Lifecycle of a GET + recall 1. Client connects to the RH 2. RH stores the request into the db 3. Stager polls the db and checks for file availability 4. If the file is not available, the recall process is activated 5. Once the file is available, stager asks the scheduler to schedule the access to the file 6. Client gets a callback and can initiate the transfer Ø The commandline is stager_get Giuseppe Lo. Presti (IT/FIO/FD) 6

stager_get (1) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_get (1) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Tape. Daemon Disk Servers • Client opens temporary RTCPClient. D port for receiving the response Name. Server • Client sends its request to RH VMGR • RH stores request into the DB RTCPD VDQM Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) 7

stager_get (2) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_get (2) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Tape Servers • Stager polls the DB to get the request RTCPClient. D • It checks for file availability • The file is not available, it creates a Disk. Copy in WAITTAPERECALL RTCPD Tape. Daemon Disk Servers Stager. Job DB Giuseppe Lo. Presti (IT/FIO/FD) GC Mig. Hunter Name. Server VMGR VDQM 8

stager_get (3) Stager DB Job Qry Error • Client rtcp. Clientd polls the DB

stager_get (3) Stager DB Job Qry Error • Client rtcp. Clientd polls the DB to. Scheduler get disk. Copies in WAITTAPERECALL RH Svc Svc • It organizes the recall of the data but the target filesystem is not yet selected Mover RR Stager. Job Tape. Daemon Giuseppe Lo. Presti (IT/FIO/FD) GC Mig. Hunter Disk Servers Tape Servers DB RTCPClient. D recaller Name. Server VMGR RTCPD VDQM 9

stager_get (4) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_get (4) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Tape. Daemon Disk Servers Stager. Job DB GC Mig. Hunter RTCPClient. D Name. Server recaller VMGR • recaller sends a request. RTCPD to the stager in order to know where to put the file VDQM • the request goes through the usual way: Request Handler, DB, Tape Servers stager (job service), Request Replier Giuseppe Lo. Presti (IT/FIO/FD) 10

stager_get (5) Stager selected filesystem Client • rtcpd transfers the data from the tape

stager_get (5) Stager selected filesystem Client • rtcpd transfers the data from the tape to the DB Job Qry Error Mover Scheduler Svc position Svc • the DB is. RH updated with the new file size and • the original subrequest is set to RESTART status RR Stager. Job Tape. Daemon Giuseppe Lo. Presti (IT/FIO/FD) GC Mig. Hunter Disk Servers Tape Servers DB Svc RTCPClient. D recaller Name. Server VMGR RTCPD VDQM 11

stager_get (6) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_get (6) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Tape. Daemon • Stager polls the DB to get the request Disk Servers • It checks for file availability Name. Server RTCPClient. D • The file is available, it calls the scheduler through the VMGR RMMaster to schedule the I/O RTCPD • The scheduler launches a Stager. Job VDQM Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) 12

stager_get (7) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_get (7) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Tape. Daemon Disk Name. Server • the Servers Stager. Job launches the right RTCPClient. D mover corresponding to the client request (note that the scheduler takes available movers into account) • it answers to the client, giving to it the machine and port where to VMGR contact RTCPD the mover VDQM • data is transfered Tape Servers • DB is updated Giuseppe Lo. Presti (IT/FIO/FD) 13

Lifecycle of a PUT + migration 1. Client connects to the RH 2. RH

Lifecycle of a PUT + migration 1. Client connects to the RH 2. RH stores the request into the db 3. Stager polls the db and looks for a candidate filesystem for the transfer 4. Client gets a callback and can initiate the transfer 5. After the transfer is completed, migration to tape is performed Ø The commandline is stager_put Giuseppe Lo. Presti (IT/FIO/FD) 14

stager_put (1) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_put (1) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Tape. Daemon Disk Servers • Client opens temporary RTCPClient. D port for receiving the response Name. Server • Client sends its request to RH VMGR • RH stores request into the DB RTCPD VDQM Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) 15

stager_put (2) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_put (2) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB Disk Servers • Stager polls the DB to get the request RTCPClient. D GC Mig. Hunter Name. Server Tape. Daemon • It calls the scheduler through the RMMaster to schedule the I/O VMGR • The scheduler launches a Stager. Job RTCPD VDQM Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) 16

stager_put (3) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_put (3) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Tape. Daemon Disk Name. Server • the Servers Stager. Job launches the right RTCPClient. D mover corresponding to the client request (note that the scheduler takes available movers into account) • it answers to the client, giving to it the machine and port where to VMGR contact RTCPD the mover VDQM • data is transferred Tape Servers • DB is updated with the file size and the diskcopy is set in CANBEMIGR and one or many Tape. Copies are created Giuseppe Lo. Presti (IT/FIO/FD) 17

stager_put (4) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc

stager_put (4) Stager Client RH DB Svc Scheduler Job Svc Qry Svc Error Svc Mover RR Stager. Job DB GC Mig. Hunter Disk Servers Tape. Daemon Name. Server RTCPClient. D • thanks to a Mig. Hunter, the new tapecopy is attached to the streams VMGR it can belong to (depending on tapepools, svcclasses, . . . ) RTCPD VDQM Tape Servers Giuseppe Lo. Presti (IT/FIO/FD) 18

stager_put (5) Stager Mover • rtcpclientd Client will launch a migrator DB Job Qry

stager_put (5) Stager Mover • rtcpclientd Client will launch a migrator DB Job Qry Error RH Scheduler • this one asks the DB for the next migration candidate Svc Svc • the DB takes the best candidate in the stream (based on filesystems availability) • the file is written RR to tape and the DB updated Stager. Job Tape. Daemon Giuseppe Lo. Presti (IT/FIO/FD) GC Mig. Hunter Disk Servers Tape Servers DB RTCPClient. D migrator Name. Server VMGR RTCPD VDQM 19

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a PUT request v Description and status of the components Ø Main daemons Ø Diskserver related Ø Central services Ø Tape related v Tape migration and recall Ø Workflow details Giuseppe Lo. Presti (IT/FIO/FD) 20

Detailed picture of CASTOR Giuseppe Lo. Presti (IT/FIO/FD) 21

Detailed picture of CASTOR Giuseppe Lo. Presti (IT/FIO/FD) 21

Status of all system components Ø Request Handler Ø Stager Ø RMMaster & RMNode

Status of all system components Ø Request Handler Ø Stager Ø RMMaster & RMNode Ø Distributed Logging Facility Ø External plugins (LSF, expertd) Ø gc. Daemon Ø Central services (Name. Server, VDQM, VMGR, CUPV) Ø Tape part (Mig. Hunter, migrator/recaller, rtcopy, rtcpclientd) Giuseppe Lo. Presti (IT/FIO/FD) 22

Request Handler v Scope Ø Stores incoming requests into the DB v Features Ø

Request Handler v Scope Ø Stores incoming requests into the DB v Features Ø Very lightweight Ø Allows for request throttling v Maturity Ø Production, no changes since mid 2005 v Implementation Ø Fully C++ Ø Usage of the internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 23

Stager v Scope Ø Main daemon for requests processing v Features Ø Stateless Ø

Stager v Scope Ø Main daemon for requests processing v Features Ø Stateless Ø Multi-services implementation by thread pools • Allows for independent services execution, even on different nodes • Enhanced scalability v Maturity Ø Fairly stable, some bug fixes in the last months Ø Development still going on to implement missing features Ø Bugs and RFEs open on it (e. g. memory leak) v Implementation Ø Main core in C, internal services in C or C++ Ø Usage of the internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 24

RMMaster & RMNode v Scope Ø Gather monitoring information from nodes Ø Submit jobs

RMMaster & RMNode v Scope Ø Gather monitoring information from nodes Ø Submit jobs to the scheduler v Features Ø Not fully stateless Ø RMMaster gathers data from RMNode Ø RMNode runs on the diskservers and polls /proc data v Maturity Ø Fairly stable, not many bugfixes in the last months Ø But quite a number of major issues are open on it v Implementation Ø Fully C Ø No usage of DB, internal protocol to communicate with stager Giuseppe Lo. Presti (IT/FIO/FD) 25

Distributed Logging Facility v Scope Ø Central DB-based logging system v Features Ø A

Distributed Logging Facility v Scope Ø Central DB-based logging system v Features Ø A daemon accepts and stores any log entry from any Castor subsystem Ø A PHP-based GUI allows for querying the log v Maturity Ø Fairly stable, development still going on to improve performances v Implementation Ø Fully C, “legacy” DB API Giuseppe Lo. Presti (IT/FIO/FD) 26

DLF GUI Giuseppe Lo. Presti (IT/FIO/FD) 27

DLF GUI Giuseppe Lo. Presti (IT/FIO/FD) 27

LSF plugin v Scope Ø Select best candidate resource (file system) among the set

LSF plugin v Scope Ø Select best candidate resource (file system) among the set proposed by LSF v Features Ø Not entirely stateless due to lack of information flow in LSF API v Maturity Ø Stable, few changes in the last months Ø Lack of optimization due to lack of functionality in the LSF API • Need for further development in conjunction with LSF people v Implementation ØC Ø Usage of the internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 28

gc. Daemon v Scope Ø Deletes files marked for garbage collection v Features Ø

gc. Daemon v Scope Ø Deletes files marked for garbage collection v Features Ø Stateless daemon implemented as a stager client v Maturity Ø Production, no changes since Dec 2004 v Implementation Ø C++ Ø Usage of the client API and the internal API • proxy “remotized” implementation of the stager Giuseppe Lo. Presti (IT/FIO/FD) 29

Name. Server v Scope Ø Archive the filesystem-like information for the HSM files Ø

Name. Server v Scope Ø Archive the filesystem-like information for the HSM files Ø Associate tape related information v Features Ø Stateless daemon, DB backend v Maturity Ø Production, last change has been a merge with DPM’s Name. Server in Jan 2006, otherwise no changes since 2004 v Implementation Ø Fully C Giuseppe Lo. Presti (IT/FIO/FD) 30

Expert daemon (expertd) v Scope Ø Externalize decisions based on policies v Features Ø

Expert daemon (expertd) v Scope Ø Externalize decisions based on policies v Features Ø Framework for executing policy scripts Ø Receives policy requests from other components (stager, Mig. Hunter, Tape. Error. Handler) Ø Supported policy requests types are: • Filesystem weight • Replication • Migrator • Recaller Giuseppe Lo. Presti (IT/FIO/FD) 31

Filesystem policy v Scope Ø Provide an evaluation of each resource (filesystem) from gathered

Filesystem policy v Scope Ø Provide an evaluation of each resource (filesystem) from gathered monitoring information v Features Ø Single formula implementation Ø Currently only global, to be converted soon to policy per service class v Maturity Ø Under development, the current implementation works in production but has demonstrated not to be stable enough under very heavy load (Tier 0 Data Challenge) v Implementation Ø Rule in CLIPS logic engine, going to be converted to Perl Giuseppe Lo. Presti (IT/FIO/FD) 32

Volume and Drive Queue Mgr (VDQM) v Scope Ø Manage the tape queue and

Volume and Drive Queue Mgr (VDQM) v Scope Ø Manage the tape queue and device status v Features Ø Supports drive dedication (regexp) Ø Supports request prioritization Ø Allows for re-use of mounted tapes (useful for CASTOR 1) v Maturity Ø In production since 2000 Ø Scheduling algorithm melts down beyond ~4000 queued requests Ø New implementation (VDQM 2) ready to be rolled out v Implementation Ø C++ and DB API for the new VDQM 2 Giuseppe Lo. Presti (IT/FIO/FD) 33

Volume Manager (VMGR) v Scope Ø Logical Volume Repository. Inventory of all tapes and

Volume Manager (VMGR) v Scope Ø Logical Volume Repository. Inventory of all tapes and their status v Features Ø Tape pools • Grouping of tapes for given activities • Counters for total and free space (calculated using compression rates) v Maturity Ø In production since 2000 v Implementation ØC Ø Oracle Pro-C Giuseppe Lo. Presti (IT/FIO/FD) 34

Castor User Privileges (Cupv) v Scope Ø Manages administrative authorization rights on other CASTOR

Castor User Privileges (Cupv) v Scope Ø Manages administrative authorization rights on other CASTOR modules (nameserver, VMGR) v Features Ø Flat repository of privileges Ø Supports regular expressions v Maturity Ø In production since 2000 v Implementation ØC Ø Oracle Pro-C Giuseppe Lo. Presti (IT/FIO/FD) 35

Tape mover (rtcpd) v Scope Ø Copy files between tape and disk v Features

Tape mover (rtcpd) v Scope Ø Copy files between tape and disk v Features Ø Highly multithreaded • Overlaid network and tape I/O • Large memory buffers allows for copying multiple files in parallel Ø Supports a large number of legacy tape formats… v Maturity Ø In production since 2000 v Implementation ØC Giuseppe Lo. Presti (IT/FIO/FD) 36

Mig. Hunter v Scope Ø Attach migration candidates to streams v Features Ø Stateless

Mig. Hunter v Scope Ø Attach migration candidates to streams v Features Ø Stateless Ø Callout to expert system for executing migrator policies for fine- grained control Ø Can trigger on frequency or volume of data to be migrated v Maturity Ø In production since 2006 Ø Some known problems with files that have been deleted from the name server but not cleared in the catalogue v Implementation Ø C Ø Usage of internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 37

Migrator/recaller v Scope Ø Controls the tape migration/recall v Features Ø Stateless, multithreaded v

Migrator/recaller v Scope Ø Controls the tape migration/recall v Features Ø Stateless, multithreaded v Maturity Ø In production since 2006 Ø Some known problems with files that have been deleted from the name server but not cleared in the catalogue Ø Known aging problem resulting in inconsistency in one auxiliary oracle table that is updated through triggers • Workaround for oracle problem • Operational procedure exists for repairing the streams v Implementation Ø C Ø Usage of internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 38

rtcpclientd v Scope Ø Master daemon controlling tape migration/recall v Features Ø Not fully

rtcpclientd v Scope Ø Master daemon controlling tape migration/recall v Features Ø Not fully stateless due to VDQM Ø Single threaded v Maturity Ø In production since 2006 v Implementation ØC Ø Usage of internal DB API Giuseppe Lo. Presti (IT/FIO/FD) 39

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a

Outline v Detailed view of the architecture Ø Lifecycle of a GET and a PUT request v Description and status of the components Ø Main daemons Ø Diskserver related Ø Central services Ø Tape related v Tape migration and recall Giuseppe Lo. Presti (IT/FIO/FD) 40

Tape Migration and Recall v “rtcpclientd” is the main component dealing with all interaction

Tape Migration and Recall v “rtcpclientd” is the main component dealing with all interaction to the CASTOR tape archive Ø For each running tape recall it forks a ‘recaller’ child process per tape Ø For each running tape migration it forks a ‘migrator’ child process per tape v Migration streams are created and populated by the “Mig. Hunter” component v A Tape. Error. Handler process is forked by the rtcpclientd daemon whenever a recaller or migrator child process exits with error status. v Detailed description of the functioning and operation of tape migration and recall in CASTOR 2 can be found at: Ø http: //cern. ch/castor/docs/guides/admin/tape. Migration. And. Recall. pdf Giuseppe Lo. Presti (IT/FIO/FD) 41

Tape migration/recall components vdqmserv vmgrdaemon nsdaemon rtcpclientd Catalogue fork Mig. Hunter Tape. Error. Handler

Tape migration/recall components vdqmserv vmgrdaemon nsdaemon rtcpclientd Catalogue fork Mig. Hunter Tape. Error. Handler recaller rtcpd Giuseppe Lo. Presti (IT/FIO/FD) migrator rtcpd Castor 1 component Castor 2 component 42

Tape recall (1) v Tape recalls are triggered when the stager receives a request

Tape recall (1) v Tape recalls are triggered when the stager receives a request for a CASTOR file for which there is no available disk resident copy Ø Stager calls the castor name server to retrieve the tape segment information (VID, fseq, blockid) Ø Stager inserts the corresponding rows in the Tape and Segment tables in the catalogue v rtcpclientd regularly (every 30 s) checks the catalogue for tapes to be recalled Ø Submits the tape request to VDQM (tape queue) Ø When mover (rtcpd) starts it connects back to the rtcpclientd, which then forks a recaller process for servicing the tape recall Giuseppe Lo. Presti (IT/FIO/FD) 43

Tape recall (2) v The recaller attempts to optimize the use of tape and

Tape recall (2) v The recaller attempts to optimize the use of tape and disk resources Ø Tape files are sorted • Current in fseq order. Ongoing work to find more optimal sorting taking into account the serpentine track layout on media Ø Requests for new files on same tape are dynamically added to running request Ø Target file system is decided given the current load picture when the tape file is positioned Giuseppe Lo. Presti (IT/FIO/FD) 44

Tape recall flow getsegattr(file. X) stager I 00234, fseq 45 I 00234, PENDING nsdaemon

Tape recall flow getsegattr(file. X) stager I 00234, fseq 45 I 00234, PENDING nsdaemon 45, UNPROCESSED tapes. To. Do ? I 00234 rtcpclientd Giuseppe Lo. Presti (IT/FIO/FD) queue vdqmserv 45

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, UNPROCESSED vdqmserv rtcpclientd fork started

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, UNPROCESSED vdqmserv rtcpclientd fork started rtcpd Giuseppe Lo. Presti (IT/FIO/FD) recaller 46

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, SELECTED segments. For. Tape ?

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, SELECTED segments. For. Tape ? rtcpd Giuseppe Lo. Presti (IT/FIO/FD) fseq=45 vdqmserv rtcpclientd recaller 47

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, SELECTED best. File. System. For.

Tape recall flow nsdaemon stager I 00234, MOUNTED 45, SELECTED best. File. System. For. Segment ? lxfsrk 1234: /srv/castor/01 vdqmserv rtcpclientd rtcpd recaller lxfsrk 1234: /srv/castor/01 Giuseppe Lo. Presti (IT/FIO/FD) 48

Tape migration v Similar to tape recall but Ø Triggered by policy rather than

Tape migration v Similar to tape recall but Ø Triggered by policy rather than waiting requests v Migration candidates are attached to ‘streams’ Ø A migration ‘Stream’ is a container of migration candidates Ø Each Stream is associated with 0 or 1 tapes: • 0 tape: stream not active (e. g. not yet picked up by rtcpclientd, or VMGR • tape pool is full) 1 tape: stream is running (tape write request is running) or waiting for tape mount Ø A Stream can survive many tapes (but only one at a time) Ø A Tape. Copy can be linked to many Streams • When a Tape. Copy is selected by one of the Streams, its status is atomically updated preventing it from being selected by another Stream v The Mig. Hunter process is responsible for attaching the migration candidates to the streams Ø Migrator policies can be used for fine-grained control over this process Giuseppe Lo. Presti (IT/FIO/FD) 49

Example policy #!/usr/bin/perl -w # # Migration policy for distinguishing between small and large

Example policy #!/usr/bin/perl -w # # Migration policy for distinguishing between small and large files # - if file. Size < 100 MB smallfiles # - if file. Size >= 100 MB large. Files # use strict; use diagnostics; use POSIX; my $do. Migrate = 0; END {print "$do. Migraten"; } my ($tape. Pool, $castor. File, $copynb, $fileid, $file. Size, $mode, $uid, $gid, $atime, $mtime, $ctime, $file. Class. Id) = @ARGV; if ( (($tape. Pool =~ “smallfiles") && ($file. Size < 100*1024)) || (($tape. Pool =~ “large. Files") && ($file. Size >= 100*1024)) ) { $do. Migrate = 1; } exit(EXIT_SUCCESS); Giuseppe Lo. Presti (IT/FIO/FD) 50

Comments, questions? Giuseppe Lo. Presti (IT/FIO/FD) 51

Comments, questions? Giuseppe Lo. Presti (IT/FIO/FD) 51