D 0 Run II Farms Heidi Schellman Farms

  • Slides: 14
Download presentation
D 0 Run II Farms Heidi Schellman Farms and SAM 10/7/2020 D 0 Farms

D 0 Run II Farms Heidi Schellman Farms and SAM 10/7/2020 D 0 Farms by n e l o St rtram e B Iain 1

D 0 Farm needs • 250 K raw event size • 50 Hz trigger

D 0 Farm needs • 250 K raw event size • 50 Hz trigger rate – peak rate of 12. 5 MB/sec to farms • Reconstruction 5 -10 CPU seconds/event – need 250 -500 CPU’s to handle peak rate – DC is 40% of peak – time constant for 1 GB file is 5 -10 hours. 10/7/2020 D 0 Farms 2

Current design for Stage 1 Database server I/O Tape Robot 10/7/2020 18 GB 300

Current design for Stage 1 Database server I/O Tape Robot 10/7/2020 18 GB 300 GB D 0 Farms 50 dual workers 3

Data Access is SAM/enstore • Good news: – no direct mounted tape drives on

Data Access is SAM/enstore • Good news: – no direct mounted tape drives on farm – data handling and much of book-keeping done by SAM - farm control is easier to write • Bad news: – increase load on network – need to integrate two(3? ) systems 10/7/2020 D 0 Farms 4

I/O node • Purpose – merge of farm output – careful layout of farm

I/O node • Purpose – merge of farm output – careful layout of farm output on tape – peak I/O rates of 40 -60 MB/sec • For first 100 CPU farm: – 4 -8 CPU SMP – 1 -2 GB ethernet cards – 300 GB disk 10/7/2020 D 0 Farms 5

Worker Nodes • Dual Pentium III? • 256 MB/CPU • 2 data disks (9

Worker Nodes • Dual Pentium III? • 256 MB/CPU • 2 data disks (9 or 18 GB) + 6 GB system • Generic FNAL farm nodes ular! p stem y o s p r e a r l sa simi a Farm g n i build e r a rience e p D 0 -UK x e local r u o using 10/7/2020 D 0 Farms 6

Worker Configuration • Workers act as remote nodes – – – 10/7/2020 avoid dependence

Worker Configuration • Workers act as remote nodes – – – 10/7/2020 avoid dependence on NFS products download on change D 0 Farm executables download at job start databases either download or access through Dbserver data access through SAM/encp/rcp D 0 Farms 7

Present Status • Two parallel paths – D 0 specific code/farm control • covered

Present Status • Two parallel paths – D 0 specific code/farm control • covered on Monday (1 slide) – SAM integration with farm 10/7/2020 D 0 Farms 8

What we learned from MCC 99 production • Processed 75, 000 events several times

What we learned from MCC 99 production • Processed 75, 000 events several times • Takes less than a week at very low efficiency – (errors, SAM tests) • Lessons – FBS system robust, works reliably – FBS control scripts easy to write – D 0 code and parameter files can be downloaded to remote machines - no need for NFS – Linux systems not very forgiving of executable growth – rcp transfers generally reliable - should not do too many at once. 10/7/2020 D 0 Farms 9

Path II - SAM/FBS integration • SAM/enstore now in ups/upd distribution – All I

Path II - SAM/FBS integration • SAM/enstore now in ups/upd distribution – All I have to do is: setup farms setup sam setup encp – and can use the system • Have successfully (5 -27 -99) read files from SAM (disk) to worker nodes and reconstructed them. • (6 -4 -99) successfully read files from SAM(encp tape) to worker nodes 10/7/2020 D 0 Farms 10

SAM from robot Job start Control node Farm is reading 700 MB top files

SAM from robot Job start Control node Farm is reading 700 MB top files from robot Job start worker 1 st transfer ends and Processing begins 2 nd Transfer begins 10/7/2020 1 st Transfer begins D 0 Farms Farm CPU monitor with 10 second samples 11

What we learned • SAM written and tested on another platform required very few

What we learned • SAM written and tested on another platform required very few fixes to integrate with farm. • small conflict between FBS and encp assumptions about job cleanup identified and fixed. • Error messages not idiot proof yet. • SAM is already easier to use than standalone scripts 10/7/2020 D 0 Farms 12

Short term goals • Put the output data directly in to SAM from worker

Short term goals • Put the output data directly in to SAM from worker – (currently can use SAM import mechanism from I/O node) • Get whole D 0 MC sample into SAM and reprocess it • Improve control scripts (sh -> python) 10/7/2020 D 0 Farms 13

Long Term (Fall 99) • Eliminate need for NFS on workers • Get a

Long Term (Fall 99) • Eliminate need for NFS on workers • Get a 100 CPU system going when it arrives • Get scripts for general farm control – – project generation monitoring/accounting error recovery file merging • generate/process MC 99_2 10/7/2020 D 0 Farms 14