Computing challenges and architecture Xiaomei Zhang IHEP April

Time Line Higgs factory in 2030 Higgs 4 year after HL_LHC Z factor in

Data Volume n Read out in DAQ from CDR è è Maximum Event rate:

CEPC computing challenges n Two stages: è è n è LHC: 50 PB/y in

Computing requirements n CEPC simulation for detector design needs ~2 K CPU cores and

Distributed computing n n CEPC distributed computing system has been built up based on

Computing model n IHEP as central site è è è n Remote sites è

Network n IHEP international Network provides a good basis for distributed computing è è

Resources n 6 Active Sites è è n England, Taiwan, China Universities(4) QMUL from

Software distribution with CVMFS n n n CVMFS is a global, software in a

Workload management n DIRAC WMS ¡ n JSUB ¡ n Middle layer between jobs

Data management n Central Storage Element based on Sto. RM to share experiment data

Monitoring n To ensure stability of sites, monitoring is developed and in production n

Central services in IHEP n Basic services are working è è è DIRAC （WMS,

Current production status n The distributed computing is taking full tasks of CEPC massive

Multi-core supports n Parallelism is considered in future CEPC software n n Study of

Light-weight virtualization - Singularity n Singularity based on container tech provides è è n

Commercial Cloud integration n n Commercial cloud would be a good potential resource for

HPC federation n HPC becomes more and more important for data processing è è

Technology evolution and cooperation FUTURE

Scaling up n With the scale of CERN today, to meet needs of HL-LHC

Technology evolution n Huge data processing is also a challenge faced by industry, especially

Infrastructure evolution n Distributed computing can be the main way to organize resources è

Unexpected evolutions n Quantum computing is coming closer to us è n Universal hardware

International Cooperation with HSF n n n To face the challenges, international cooperation is

Summary n n n CEPC distributed computing system is working well for current CEPC

Slides: 29

Download presentation

Computing challenges and architecture Xiaomei Zhang IHEP April. 2019 CEPC workshop, Oxford, UK

Time Line Higgs factory in 2030 Higgs 4 year after HL_LHC Z factor in ~2037 Z ~11 year after HL_LHC

Data Volume n Read out in DAQ from CDR è è Maximum Event rate: ~100 KHz at Z peak Data rate to trigger n è n Trigger rate not clear now Event size from simulation è è n ~2 TB/s Size of signal event: ~500 KB/event for Z, ~1 MB/event for Higgs Adding the background, likely increase to 5 MB~10 MB/event for Z and 10 MB~20 MB/event for Higgs Estimated data rate output to disk: è è Higgs/W factory (8 years) with 108 events: 1. 5~3 PB/year Z factory (2 year) with 1011 ~ 1012 events: 0. 5~5 EB/year

CEPC computing challenges n Two stages: è è n è LHC: 50 PB/y in 2016 HL-LHC: ~600 PB/y in 2027 No major computing problem expected in Higgs/W factory è è n Higgs/W factory: 1. 5~3 PB/year ( >10 y) Z factory : 0. 5~5 EB/year ( >17 y) Data volume in LHC and HL-LHC è n Data: • Raw 2016: 50 PB 2027: 600 PB • Derived (1 copy): 2016: 80 PB 2027: 900 PB Benefit from WLCG experience Try to do it with less costs The challenging part would be in Z factory è EB scale, same data volume level as HL-LHC è But 11 years after

Computing status in R&D NOW

Computing requirements n CEPC simulation for detector design needs ~2 K CPU cores and ~2 PB each year è n Currently no enough funding to meet requirements Distributed computing is the main way to collect and organize resources for R&D è è Dedicated resources from funding Contributions from collaborators Share IHEP resources from other experiments through HTCondor and IHEPCloud Commercial Cloud, Super. Computing center…

Distributed computing n n CEPC distributed computing system has been built up based on DIRAC in 2015 DIRAC provides a framework and solution for experiments to setup their own distributed computing system è n Good cooperation with DIRAC community è è n Originally from LHCb, now widely used by other communities, such as BELLEII，ILC，CTA, EGI, etc Join DIRAC consortium in 2016 Join the efforts on common need The system has considered current CEPC computing requirements, resource situation and manpower è è Use existing grid solutions as much as possible from WLCG Keep system as simple as possible for users and sites

Computing model n IHEP as central site è è è n Remote sites è è n MC production including Mokka simu + Marlin recon No requirements of storage Data flow è è n Event Generation(EG), analysis Hold central storage for all data Hold central database for detector geometry IHEP -> Sites, stdhep files from EG distributed to Sites -> IHEP, output MC data directly transfer back to IHEP from jobs Simple but extensible è è No problems to add more sites and resources Storage can be extended to multi-tier infrastructure with more SEs distributed……

Network n IHEP international Network provides a good basis for distributed computing è è 20 Gbps outbound，10 Gbps to Europe/USA In 2018 IHEP joined LHCONE, with CSTNet and CERNET n è LHCONE is a virtual link dedicated to LHC CEPC international cooperation also can benefit from LHCONE 9

Resources n 6 Active Sites è è n England, Taiwan, China Universities(4) QMUL from England IPAS from Taiwan plays a great role Resource: ~2500 CPU cores, shared resources with other experiments è è Site Name CPU Cores CLOUD. IHEPOPENNEBULA. cn 24 CLUSTER. IHEP-Condor. cn 48 CLOUD. IHEPCLOUD. cn 200 GRID. QMUL. uk 1600 CLUSTER. IPAS. tw 500 CLUSTER. SJTU. cn 100 Total (Active) 2472 Resource types include Cluster, Grid , Cloud This year ~500 CPU cores dedicated QMUL: Queen Mary University resources will be added from IHEP of London IPAS: Institute of Physics, Academia Sinica

Software distribution with CVMFS n n n CVMFS is a global, software in a fast CEPC uses CVMFS to IHEP CVMFS service è è è HTTP-based file system to distribute and reliable way distribute software eventually joined federation In 2014, IHEP CVMFS Stratum 0(S 0) created In 2017, IHEP CVMFS Stratum 1(S 1) created, both replicate IHEP S 0 and CERN S 0 In 2018, RAL S 1 replicate IHEP S 0 to speed up access of CEPC software among European collaborators

Workload management n DIRAC WMS ¡ n JSUB ¡ n Middle layer between jobs and various resources Self-developed general-purpose massive job submission and management tool Production system ¡ Highly automate MC sim submission for next plan

Data management n Central Storage Element based on Sto. RM to share experiment data among sites n n n File Catalogue, metadata Catalogue and dataset Catalogue will provide a global view of CEPC dataset n n Lustre /cefs as its backend Frontend provides SRM, HTTP and xrootd access DFC ( Dirac File Catalogue) could be one of the solutions Data movement system among sites n n DIRAC solution: Transformation + Request Management System + FTS The prototype is ready

Monitoring n To ensure stability of sites, monitoring is developed and in production n n n A global site status view Send Regular SAM tests to sites Collect site status info Take actions when sites failed DIRAC system and job monitoring are using Elastic. Search (ES) and Kibana More monitoring can be considered with ES and Grafana infrastructure

Central services in IHEP n Basic services are working è è è DIRAC （WMS, DMS) Central storage server (1. 1 PB) CVMFS and Squid servers Grid CA and VOMS services for user authentication in grid Monitoring Running stable

Current production status n The distributed computing is taking full tasks of CEPC massive simulation for last four years n About 3 million jobs, data exchange about 2 PB 2015~2018 jobs ~ 3 million 2015~2018 data exchange ~ 2 PB

On-going research NEAR FUTURE

Multi-core supports n Parallelism is considered in future CEPC software n n Study of Multi-core scheduling in distributed computing system started in 2017 Prototype is successful è è n Explore multicore CPU architectures and improve performance Decrease memory usage per core Multi-core scheduling with different pilot modes are developed Scheduling efficiency is studied and improved Real CEPC use cases can be tuned when ready

Light-weight virtualization - Singularity n Singularity based on container tech provides è è n Singularity has been embedded in current system transparently è è n n Good OS portability and isolation in a very light way Give sites enough flexibility to choose its OS Singularity can be started by pilots to provide OS wanted Allow sites to add another container layer outside Tests have been done successfully in IHEP local Condor site Ready for more usage to come Worker node Container Pilot job Singularity Payload

Commercial Cloud integration n n Commercial cloud would be a good potential resource for urgent and CPU-intensive tasks with its unlimited resource for CEPC Cloud can be well integrated in current distributed computing system and used in an elastic way n v Cloud resource can be occupied and released in real time according to real CEPC job requirements With the support of Amazon AWS China region, trials have been done successfully with CEPC simulation jobs n n Well connected, good running, return back results to IHEP Cost pattern need further study depending on future usage

HPC federation n HPC becomes more and more important for data processing è è n HPC federation is in plan to build a “grid” of HPC è n HTC (High Throughout Computing) is main resource for HEP Many HPC computing centers are being built up, eg. IHEP, JINR… Integrate HTC and HPC resources as a whole Preliminary study has been done with GPU è With “tag”in DIRAC, GPU and CPU jobs can easily find their proper resources

Technology evolution and cooperation FUTURE

Scaling up n With the scale of CERN today, to meet needs of HL-LHC è n With past experiences, scaling up is not just a issue of increasing the capacity, but more about è è è n Resources still need x 50~100 increase with current hardware Manpower and costs Performance and efficiency of data access and processing Complexity of resource provisioning and management Will future evolutions help? è è Technology Infrastructure

Technology evolution n Huge data processing is also a challenge faced by industry, especially internet è n Convergence of Big data and Deep learning with HPC are the technology trends in industry è n Google searches 98 PB, Google internet archives 18 EB Creating a promising future in exabyte era HEP is trying to catch up è è Both in software and computing See such trends in CHEP, WLCG, ACAT, HSF workshops

Infrastructure evolution n Distributed computing can be the main way to organize resources è n It will not a simple grid system anymore, but è è n HL_LHC, CEPC Mixture of any possible resources, highly heterogeneous Keep changing with technology evolutions in network, storage, software, analysis techniques…… Can Supercomputing Centers be a dominated resource in future? CMS started looking into it è “A single Exascale system could process the whole HL-LHC with no R&D or model changes” è

Unexpected evolutions n Quantum computing is coming closer to us è n Universal hardware in products expected to come in ~10 years (Intel) CERN Openlab held its first quantum computing for High Energy Physics workshop on November 5 -6 in 2018 è “A breakthrough in the number of qubits could emerge any time, Although there are challenges on different levels”

International Cooperation with HSF n n n To face the challenges, international cooperation is more than needed than ever HSF (HEP Software Foundation) provides a platform for expertise in HEP communities to work together for future software and computing IHEP cooperates with international communities and benefits from common efforts through HSF è IHEP was invited to give two plenary talks in HSF workshops

Summary n n n CEPC distributed computing system is working well for current CEPC R&D tasks New technology studies have been carried out to meet requirements of CEPC in the near future Look into the future, closely cooperate with HEP community and follow technology trends

Thank you!