Condor Overview and User Guide to the Condor

  • Slides: 70
Download presentation
Condor: Overview and User Guide to the Condor Biostatistics Environment

Condor: Overview and User Guide to the Condor Biostatistics Environment

Autoria • Autores – Patrícia Kayser Vargas – Setembro de 2002 – Palestra na

Autoria • Autores – Patrícia Kayser Vargas – Setembro de 2002 – Palestra na Biostat, Wisconsin, EUA • Revisões – V 1 • C. Geyer • PDP/2005 -2, PPGC, UFRGS • Dezembro 2005 2

Topics • Introduction – What is Condor? – Why and when use Condor? –

Topics • Introduction – What is Condor? – Why and when use Condor? – What are Condor Universes? • Running Jobs on Condor – C programs • YAP – Java Programs • Final Remarks 3

Introduction 4

Introduction 4

What is Condor? • Condor – is a distributed batch scheduling system • “The

What is Condor? • Condor – is a distributed batch scheduling system • “The goal of Condor is to provide the highest feasible throughput by executing the most jobs over extended periods of time. ” [1] • What is a job? – Several possibilities 5

What is Condor? • Condor – is composed of a collection of different daemons

What is Condor? • Condor – is composed of a collection of different daemons that provide various services, such as • mecanismo de fila de jobs, • políticas de escalonamento, • esquema de prioridades, • monitoramento, • resource management, • job management, • matchmaking. . . 6

What is Condor? Architecture [1] 7

What is Condor? Architecture [1] 7

What is Condor? Architecture • Tipos de máquinas – Central Manager • Gerente de

What is Condor? Architecture • Tipos de máquinas – Central Manager • Gerente de uma rede (grade) Condor • Uma por “pool” • Ponto de falha central (¯) – Submit Machines • Máquinas de usuários • Usuário submete, monitora e controla execução de 1 job – Execution Machine (escravo) • Executa jobs – Uma máquina pode ter vários papéis

What is Condor? Architecture • Tipos de máquinas (cont. ) – Check. Point Server

What is Condor? Architecture • Tipos de máquinas (cont. ) – Check. Point Server • Opcional • Armazena arquivos com checkpoints

What is Condor? Architecture • Condor has four daemons • On Central Manager and

What is Condor? Architecture • Condor has four daemons • On Central Manager and on Submit Machines – startd: • monitors the conditions of the resource where it runs • publishes Class. Ads resource offer, and • is responsible for enforcing the resource owner’s policy for starting, suspending, and evicting jobs. – schedd: • maintains a persistent job queue • publishes Class. Ads resource request, and • negotiates for available resources 10

What is Condor? Architecture • Only on Central Manager: – collector: • is the

What is Condor? Architecture • Only on Central Manager: – collector: • is the central repository of information • startd and schedd send periodic updates to the collector – negotiator: • periodically performs a negotiation cycle – – process of matchmaking negotiator tries to find matches between various Class. Ads, of resource offers and requests, and once a match is made, both parties are notified and are responsible for acting on that match 11

What is Condor? Architecture [1] 12

What is Condor? Architecture [1] 12

What is Condor? Architecture Submitter Executing [1] 13

What is Condor? Architecture Submitter Executing [1] 13

What is Condor? Architecture • Publicação de Class. Ads de recursos e de jobs

What is Condor? Architecture • Publicação de Class. Ads de recursos e de jobs que são enviados ao collector – Startd envia (de) recursos – Schedd envia (de) jobs • O collector tudo envia ao negotiator que faz o matchmaking 14

What is Condor? Architecture • Algoritmo de matchmaking – o negotiator pode descobrir recursos

What is Condor? Architecture • Algoritmo de matchmaking – o negotiator pode descobrir recursos no qual um job pode ser executado – ele avisa ao daemon schedd, da máquina que submeteu, com quem ela deve se comunicar para exportar o job – ele avisa o daemon startd da máquina escolhida para executar (recurso ocioso que tem os requisitos) que vai receber um tarefa 15

What is Condor? Architecture • Neste ponto o central manager não age mais, são

What is Condor? Architecture • Neste ponto o central manager não age mais, são as duas máquinas que vão executar o job – a máquina de submissão cria um processo shadown • para enviar a tarefa e receber os resultados – a máquina que vai executar • cria um processo starter que recebe a tarefa e • um “user job” que por sua vez executa a tarefa • e ao final os resultados são enviados à máquina de submissão 16

Why and when use Condor? • Condor is useful when – there are several

Why and when use Condor? • Condor is useful when – there are several jobs to be submitted – there is one executable and several different input data 17

Why and when use Condor? • Condor is useful because – can use different

Why and when use Condor? • Condor is useful because – can use different available machines • opportunistic scheduling – controls file transfers • the job must be able to access the data files from any machine on which it can potentially run – send email notifying when job has completed • except if jobs submitted from a Linux machine 18

What are Condor Universes? • Types of universes – – standard vanilla java parallel

What are Condor Universes? • Types of universes – – standard vanilla java parallel • The Universe attribute is specified in the submit description file – the default is standard 19

What are Condor Universes? • standard – provides • checkpointing and • remote system

What are Condor Universes? • standard – provides • checkpointing and • remote system calls – job more reliable and uniform access to resources from anywhere in the pool – to prepare a program as a standard universe job, it must be relinked with condor_ compile 20

What are Condor Universes? • standard – there a few restrictions – complete list

What are Condor Universes? • standard – there a few restrictions – complete list in manual http: //www. cs. wisc. edu/condor/manual/v 6. 4/2_4 Road_map_running. html – examples • no multi-process jobs (no fork(), exec(), and system()) • no inter-process communication (includes pipes, semaphores, and shared memory) • no sending or receiving the SIGUSR 2 or SIGTSTP • all files must be opened read-only or write-only 21

What are Condor Universes? • vanilla – used for programs which cannot be successfully

What are Condor Universes? • vanilla – used for programs which cannot be successfully re -linked – useful for shell scripts – cannot checkpoint or use remote system calls – sometimes a job must restart from the beginning on another machine in the pool • sem checkpoint 22

What are Condor Universes? • java – can execute on any machine in the

What are Condor Universes? • java – can execute on any machine in the pool that will run the Java Virtual Machine – at the moment it does not work at Biostat • departamento de Wisconsin – compiled Java programs can be submitted – creating jar file for programs with several classes is recommended 23

What are Condor Universes? • parallel – MPI and PVM • used for parallel

What are Condor Universes? • parallel – MPI and PVM • used for parallel programs using message passing – Globus • must have Condor-G installed – I did not check if they work at Biostats 24

Running Jobs on Condor 25

Running Jobs on Condor 25

Running Jobs on Condor • You can submit your jobs from any biostat machine,

Running Jobs on Condor • You can submit your jobs from any biostat machine, since all run schedd and startd • You must – set PATH environment variable – prepare a submission file – compile your job with condor_compile if using standard universe – submit your job(s) with condor_submit command 26

Running Jobs on Condor • Submission file – o submit description file é o

Running Jobs on Condor • Submission file – o submit description file é o arquivo que diz • qual é o executável • diretório onde vão ser colocados os arquivos de saída • quantos jobs vão ser instanciados, etc 27

Running Jobs on Condor • Submission file – esse arquivo é transformado em um

Running Jobs on Condor • Submission file – esse arquivo é transformado em um Class. Add para cada job que precisa ser instanciado • p. ex. se no arq tiver o comando 'queue 50', vão ter que ser executados 50 jobs daquele programa • portanto vão ser publicados 50 Class. Ads no central manager 28

Running Jobs on Condor Setting PATH environment variable • Change PATH to find Condor

Running Jobs on Condor Setting PATH environment variable • Change PATH to find Condor commands (conforme shell) bash: source /s/pkg/condor. sh PATH=$PATH: /s/pkg/`/s/share/ostoken`/condor/bin; export PATH csh: source /s/pkg/condor. csh set path = ( $path /s/pkg/`/s/share/ostoken`/condor/bin ) rehash 29

Running Jobs on Condor Preparing a submission file • Class. Ads (Classified Advertisement) –

Running Jobs on Condor Preparing a submission file • Class. Ads (Classified Advertisement) – pairs of values – syntax similar to C/Java • The commands are case insensitive, i. e. , executable = fact Executable = fact 30

Running Jobs on Condor Preparing a submission file • At least, must have the

Running Jobs on Condor Preparing a submission file • At least, must have the “executable” attribute: your program/binary Executable = fact • Other useful attribute: input file – your data input = test. data 31

Running Jobs on Condor Compiling your job with condor_compile • If using standard universe:

Running Jobs on Condor Compiling your job with condor_compile • If using standard universe: – use condor_compile • it is necessary to relink the program with the Condor library condor_compile gcc fact. c -o fact 32

Running Jobs on Condor Submitting your job(s) with condor_submit • In any Condor Universe

Running Jobs on Condor Submitting your job(s) with condor_submit • In any Condor Universe – jobs submitted using condor_submit command with submission file as parameter condor_submit condor 1. sub – -v option to see information about submission (full Class. Ad generated) • somente uma lista e encerra (não interativo) condor_submit -v condor 1. sub 33

Example of C Program

Example of C Program

Running Jobs on Condor C programs bash-2. 03$ condor_compile gcc fact. c -o fact

Running Jobs on Condor C programs bash-2. 03$ condor_compile gcc fact. c -o fact • options: – – – gcc (the GNU C compiler) cc (the system C compiler) acc (ANSI C compiler, on Sun systems) CC (the system C++ compiler) … (http: //www. cs. wisc. edu/condor/manual/v 6. 4/condor_compile. html) 35

Running Jobs on Condor C programs – exemplo de “submission file” ########## # C

Running Jobs on Condor C programs – exemplo de “submission file” ########## # C Example: demonstrate use of multiple directories # "Arguments = 5" to pass integer 5 as parameter # ########## Executable = fact Universe = standard output = loop. out error = loop. error Log = loop. log Arguments = 5 Initialdir Queue = run_1 = run_2 36

Running Jobs on Condor C programs • Log – contém informações importantes para avaliar

Running Jobs on Condor C programs • Log – contém informações importantes para avaliar a execução/desempenho da aplicação – para um usuário comum talvez não seja tão relevante – descreve cada evento que ocorre com o job, contendo informações de data/hora/máquina • quando: foi submetido, iniciou execução, foi suspendido, foi migrado, terminou (com erro ou com sucesso 37

Running Jobs on Condor C programs • Arguments – parâmetros para o executável –

Running Jobs on Condor C programs • Arguments – parâmetros para o executável – no exemplo; • arguments = 5 • equivaleria a executar no terminal 'fact 5' • Initialdir – onde os arquivos output/erro/log vão ser armazenados – initialdir= run_1 • Diretório “run_1” 38

Running Jobs on Condor C programs • Queue – roda uma única instância de

Running Jobs on Condor C programs • Queue – roda uma única instância de job, usando run_1 como initialdir – diretório deve ser criado antes de rodar o condor_sub senão dá erro • “Initialdir = run_2” e “Queue” – mais uma instância do job agora em outro diretório 39

Running Jobs on Condor C programs outro exemplo de “submission file” ########## # C

Running Jobs on Condor C programs outro exemplo de “submission file” ########## # C Example: # each job runs with a different argument and # store results in different files ########## Executable = fact notify_user = kayser@cos. ufrj. br Input Output Error Log = = in. $(Process) out. $(Process) err. $(Process) fact. log Queue 2 40

Running Jobs on Condor C programs • notify_user = kayser@cos. ufrj. br – diz

Running Jobs on Condor C programs • notify_user = kayser@cos. ufrj. br – diz para enviar msg avisando do término do job • Input = in. $(Process) – $(Process): variável do condor Process • que é instanciada com número inteiro sequencial para cada job criado • assim: vai criar in. 0, in. 1, in. 2 e 41

Running Jobs on Condor C programs • Log = fact. log – um único

Running Jobs on Condor C programs • Log = fact. log – um único arquivo de log apesar de vários jobs – eventos são anotados com número do job • Queue 2 – cria dois jobs – pode ser colocado qq nro inteiro – Queue 100 • cria 100 tarefas 42

Running Jobs on Condor C programs – YAP • To configure YAP with Condor:

Running Jobs on Condor C programs – YAP • To configure YAP with Condor: configure --enable-depth-limit --enable-condor make 43

Running Jobs on Condor C programs – YAP • condor. sub Universe = standard

Running Jobs on Condor C programs – YAP • condor. sub Universe = standard Executable = /u/dutra/Yap-4. 3. 20/condor/yap. $$(Arch). $$(Op. Sys) Initialdir = /u/dutra/App/f 1/train_best Log = /u/dutra/App/f 1/train_best/log Requirements = ((Arch == "INTEL" && Op. Sys == "LINUX") && (Mips >= 500) || (Is. Dedicated && Uid. Domain == "cs. wisc. edu")) Arguments Input Output Error = = -b /u/dutra/Yap-4. 3. 20/condor/. . /pl/boot. yap condor. in. $(Process) /dev/null Queue 300 44

Running Jobs on Condor C programs – YAP • condor. in. 0 [‘~/Yap-4. 3.

Running Jobs on Condor C programs – YAP • condor. in. 0 [‘~/Yap-4. 3. 20/condor/. . /pl/init. yap']. module(user). [‘~/Aleph/aleph. pl']. read_all(‘~/App/f 1/train_best/train'). set(i, 5). set(minacc, 0. 7). set(clauselength, 5). set(recordfile, ‘~/App/f 1/train_best/trace-0. 7 -5. 0'). set(test_pos, ‘~/App/f 1/train_best/test. f'). set(test_neg, ‘~/App/f 1/train_best/test. n'). set(evalfn, coverage). induce. write_rules(‘~/App/f 1/train_best/theory-0. 7 -5. 0'). halt. 45

Example of Java Program

Example of Java Program

Running Jobs on Condor Java programs • Using Java Universe • Does not need

Running Jobs on Condor Java programs • Using Java Universe • Does not need to compile with Condor • Use jar file to programs with several classes: http: //java. sun. com/docs/books/tutorial/jar/ • If using Computer Science environment, must grant access of files to be used on AFS http: //www. cs. wisc. edu/condor/uwcs/ 47

Running Jobs on Condor Java programs ########## # Example in Java Universe # executable

Running Jobs on Condor Java programs ########## # Example in Java Universe # executable must have the. class file and # arguments must have the main class as first argument ########## universe = java executable = Fact. class arguments = Fact notify_user = kayser@cos. ufrj. br output = loop. out error = loop. error log = loop. log Queue 48

Running Jobs on Condor Java programs ########## # Example in Java Universe using jar

Running Jobs on Condor Java programs ########## # Example in Java Universe using jar file ########## universe = java executable = jgf. Section 2. jar arguments = JGFAll. Size. A 4 jar_files = jgf. Section 2. jar transfer_files = ALWAYS output error log Queue = log. All. Section 2 f. out = log. All. Section 2 f. error = log. All. Section 2 f. log 49

Running Jobs on Condor Java programs • executable = jgf. Section 2. jar –

Running Jobs on Condor Java programs • executable = jgf. Section 2. jar – é um jar – não um. class como no exemplo anterior • arguments = JGFAll. Size. A 4 – dois argumentos – exemplo gerado a partir do Java. Grand • jar_files = jgf. Section 2. jar – parece redundante – mas sem esse argumento arquivo não é transferido 50

Running Jobs on Condor Java programs • transfer_files = ALWAYS – idem: para transferir.

Running Jobs on Condor Java programs • transfer_files = ALWAYS – idem: para transferir. jar – talvez um erro que tenha sido resolvido 51

Running Jobs on Condor Inspecting Condor Jobs • Some useful commands: – condor_q •

Running Jobs on Condor Inspecting Condor Jobs • Some useful commands: – condor_q • mostra fila de jobs submetidos localmente – condor_q -analyze • mais informações • permitindo entender se um job não está executando pq teve algum problema nos requisitos ou se não há recurso 52 • condor_q –submitter <user>

Running Jobs on Condor Inspecting Condor Jobs • condor_q -run – mostra apenas os

Running Jobs on Condor Inspecting Condor Jobs • condor_q -run – mostra apenas os jobs que estão em execução • condor_q -submitter <user> – filtra pra mostrar informações apenas dos jobs submetidos pelo “user” 53

Running Jobs on Condor Inspecting Condor Jobs • condor_status – mostra cada uma das

Running Jobs on Condor Inspecting Condor Jobs • condor_status – mostra cada uma das máquinas da condor_pool – mostrando informações • estáticas (p. ex. qual o SO) • dinâmicas (p. ex. se está ociosa ou ocupada) 54

Running Jobs on Condor Inspecting Condor Jobs • condor_rm – se resolver remover um

Running Jobs on Condor Inspecting Condor Jobs • condor_rm – se resolver remover um job ou conjunto de jobs da fila – parecido como o kill – precisa dar o número do job • condor_q -global – mostra informações de todas as filas – em todas as máquinas onde houve submissão 55

Final Remarks 56

Final Remarks 56

Final Remarks • So, Condor. . . – controls execution of several jobs –

Final Remarks • So, Condor. . . – controls execution of several jobs – can really improve your runtime • Yap+Aleph: during three months: 53, 000 CPU hours (peak of 400 machines) • But, Condor. . . – does not automatically parallelize your job 57

Final Remarks • Running Jobs on Condor - Observations: – input data file and

Final Remarks • Running Jobs on Condor - Observations: – input data file and directory used to output/log/error must be previously created, • otherwise an error will be reported and no job will be executed – for each execution, • the outputs are appended to log files • the results are overwritten to out files – error, log and out files must have different names • to avoid race conditions 58

Final Remarks • Trabalhos sobre gerenciamento de dados – mas não sei até que

Final Remarks • Trabalhos sobre gerenciamento de dados – mas não sei até que ponto integrados ao Condor? – Stork (Data Placement Scheduler): http: //www. cs. wisc. edu/condor/stork – Kangaroo (parece que esse foi abandonado): http: //www. cs. wisc. edu/condor/kangaroo – Ne. ST: Network Storage : http: //www. cs. wisc. edu/condor/nest/ 59

Final Remarks • Trabalho sobre monitoração – Hawkeye System Monitoring Tool: http: //www. cs.

Final Remarks • Trabalho sobre monitoração – Hawkeye System Monitoring Tool: http: //www. cs. wisc. edu/condor/hawkeye/ 60

Final Remarks • More information about Condor: http: //www. cs. wisc. edu/condor/ • Tutoriais

Final Remarks • More information about Condor: http: //www. cs. wisc. edu/condor/ • Tutoriais – http: //www. cs. wisc. edu/condor/Condor. Week 2006/ – http: //www. cs. wisc. edu/condor/Condor. Week 2005/ presentations. html • More information about running Condor: http: //www. cs. wisc. edu/condor/manual/v 6. 4/ 61

Final Remarks • References: – [1] WRIGHT, Derek. Cheap cycles from the desktop to

Final Remarks • References: – [1] WRIGHT, Derek. Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Conference on Linux Clusters: The HPC Revolution, June, 2001, Champaign - Urbana, IL - USA. http: //www. cs. wisc. edu/condor/doc/cheap-cycles. pdf 62

NMR-Star file to Class. Ad Patrícia Kayser Vargas Mangan kayser@cos. ufrj. br September, 2002

NMR-Star file to Class. Ad Patrícia Kayser Vargas Mangan kayser@cos. ufrj. br September, 2002

NMR-Star to Class. Ad • Bio. Mag. Res. Bank (http: //www. bmrb. wisc. edu)

NMR-Star to Class. Ad • Bio. Mag. Res. Bank (http: //www. bmrb. wisc. edu) – an international repository for biological NMR (nuclear magnetic resonance) data – uses the NMR Self-defining Text Archival and Retrieval (NMR-STAR) format to store its data • NMR-STAR is characterized by a set of information organized as a hierarchical tree – stored as plain text file – some may have inconsistencies that are manually verified 65

NMR-Star to Class. Ad • Class. Ads – a simple representation language used first

NMR-Star to Class. Ad • Class. Ads – a simple representation language used first in the Condor context, • Steps: – conversion of NMR-STAR data to Class. Ads format using starlibj (Java package) – use to detect inconsistencies on NMR-STAR files 66

NMR-Star to Class. Ad • Future work: – Matchmaking as consistency checker – try

NMR-Star to Class. Ad • Future work: – Matchmaking as consistency checker – try to “learn” similarities among NMR data • Working with R. Kent Wenger from the Condor team of UW-Madison 67

68

68

TALK 1: Condor: Managing Resources in the Biostatistics Department Environment TALK 2: Using Class.

TALK 1: Condor: Managing Resources in the Biostatistics Department Environment TALK 2: Using Class. Ads to Represent NMR Data

What is Condor? Architecture • After schedd receives a match for a given job,

What is Condor? Architecture • After schedd receives a match for a given job, the schedd enters into a claiming protocol directly with the startd • Through this protocol, the schedd presents the job Class. Ad to the startd and requests temporary control over the resource 70