The Large Scale Data Management and Analysis Project

  • Slides: 29
Download presentation
The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss, SCC, KIT

The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss, SCC, KIT Steinbuch Centre for Computing (SCC) KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www. kit. edu

Overview Introducing KIT and SCC Big Data Infrastructures at KIT: Grid. Ka and the

Overview Introducing KIT and SCC Big Data Infrastructures at KIT: Grid. Ka and the Large Scale Data Facility (LSDF) Large Scale Data Management and Analysis (LSDMA) Summary and Outlook 2 September 12, 2013 Dr. Andreas Heiss

Introducing KIT is both state university with research and teaching and research center of

Introducing KIT is both state university with research and teaching and research center of the Helmholtz Association with program oriented provident research Objectives: research Numbers 24, 000 students 9, 400 employees teaching innovation 3 September 12, 2013 Dr. Andreas Heiss 3, 200 Ph. D researchers 370 professors 790 million EUR annual budget in 2012

Introducing Steinbuch Center for Computing Provisioning and development of IT services for KIT and

Introducing Steinbuch Center for Computing Provisioning and development of IT services for KIT and beyond R&D High Performance Computing Grids and Clouds Big Data ~ 200 employees in total 50% scientists 50% technicians, administrative personnel and student assistants named after Karl Steinbuch, professor at Karlsruhe University, creator of the term “Informatik” (German term for computer science) 4 September 12, 2013 Dr. Andreas Heiss

Big Data Comparing Google trends Cloud computing Big Data Grid Computing 2010 5 September

Big Data Comparing Google trends Cloud computing Big Data Grid Computing 2010 5 September 12, 2013 Dr. Andreas Heiss 2013

Big Data Comparing Google trends Cloud computing Big Data Grid Computing 6 September 12,

Big Data Comparing Google trends Cloud computing Big Data Grid Computing 6 September 12, 2013 Dr. Andreas Heiss

Big Data 2000 years ago “In those days Caesar Augustus issued a decree that

Big Data 2000 years ago “In those days Caesar Augustus issued a decree that a census should be taken of the entire Roman world. ” clearly defined purpose for collecting data: tax lists of all tax payers (Luke 2, 1) data collection distributed analog time-consuming distributed storage of data tedious data aggregation 7 September 12, 2013 Dr. Andreas Heiss

Big Data today One Buzzword …. . various challenges! Industry - Data mining -

Big Data today One Buzzword …. . various challenges! Industry - Data mining - Business intelligence - Get additional information from (often) already existing data. - Data aggregation - Typically O(10) or O(100) TBs New field to make money! - Products - Services - Market shared between some ‘big players’ and many startups / spin-offs! 8 September 12, 2013 Dr. Andreas Heiss Science - Handling huge amounts of data - Peta. Bytes - Distributed data sources and/or storage - (Global) data management - High Throughput - Data preservation

Definition of Data Science Venn-Diagramm by Drew Conway (IA Ventures) 9 September 12, 2013

Definition of Data Science Venn-Diagramm by Drew Conway (IA Ventures) 9 September 12, 2013 Dr. Andreas Heiss

Big Data in science: LHC at CERN Goals search for the origin of mass

Big Data in science: LHC at CERN Goals search for the origin of mass understanding the early state of the universe LHC went live in 2008 four detectors main discovery until now: a Higgs boson 40 MH z ( 1 , 00 L e 100 vel 1 0 TB KH - Ha /sec z (1 rdw Lev el 2 00 are ) equiv 5 K – Onl GB/ alen s in H t Lev z (5 e Far ec d igiti 300 el 3 – GB/ m zed s On Hz line ec) ) (25 Far 0 M m B/s wo ec) rldw ide L H com C mu nity 2012: 25 PB of data taken Goal for 2015: 500 Hz@L 3 10 September 12, 2013 Dr. Andreas Heiss

Big Data in science: LHC at CERN Goals 4 0 M O(1000) physicists search

Big Data in science: LHC at CERN Goals 4 0 M O(1000) physicists search for the origin of Hz mass distributed worldwide (1, 0 understanding the early 100 Level 1 00 TB state of the universe KH - Ha /sec z (1 rdw Lev LHC el 2 00 are ) equiv 5 K – Onl GB/ alen went live in 2008 s ine Hz t Far ec d L ( eve 5 four detectors igiti 300 l 3 – GB/ m zed s On main discovery until now: Hz line ec) ) (25 Far a Higgs boson 0 M m wo rldwid L H com C e mu nity B/s ec) 2012: 25 PB of data taken Goal for 2015: 500 Hz@L 3 11 September 12, 2013 Dr. Andreas Heiss

Worldwide LHC Computing Grid – Hierarchical Tier Structure Hierarchy of services, response times and

Worldwide LHC Computing Grid – Hierarchical Tier Structure Hierarchy of services, response times and availability: 1 Tier-0 center at CERN Hierarchical model relaxed copy of all raw data (tape) first pass reconstruction 11 Tier-1 centers worldwide 2 to 3 distributed copies of raw data large-scale data reprocessing Storage of simulated data from Tier-2 centers tape storage ~150 Tier-2 centers worldwide user analysis simulations 12 September 12, 2013 Dr. Andreas Heiss Hierarchy Mesh Courtesy of Ian Bird, CERN

Big Data in science: DNA sequencing GB MB 13 September 12, 2013 Dr. Andreas

Big Data in science: DNA sequencing GB MB 13 September 12, 2013 Dr. Andreas Heiss

Big Data in science: synchrotron light sources Source: Wikipedia ANKA @ KIT 14 September

Big Data in science: synchrotron light sources Source: Wikipedia ANKA @ KIT 14 September 12, 2013 Dr. Andreas Heiss

Big Data in science: synchrotron light sources Dectris Pilatus 6 M 2463 x 2527

Big Data in science: synchrotron light sources Dectris Pilatus 6 M 2463 x 2527 pixels 7 MB images 25 frames/s 175 MB/s Several TB/day Data doesn‘t fit any more on USB drive Users are usually not affiliated to the synchrotron lab Users from physics, biology, chemistry, material sciences, … 15 September 12, 2013 Dr. Andreas Heiss

Big Data in science: high throughput imaging Imaging machines / microscope 1 – 100

Big Data in science: high throughput imaging Imaging machines / microscope 1 – 100 frames/s => up to 800 MByte/s => O(10) TBytes/day Reconstruction of zebrafish early embryonic development 16 September 12, 2013 Dr. Andreas Heiss

Big Data in science Many research areas, where the data growth is very fast

Big Data in science Many research areas, where the data growth is very fast Biology, chemistry, earth sciences, … Data sets became too big to take home Data rates require dedicated IT infrastructures to record and store Data analysis requires farms and clusters. Single PCs not sufficient. Collaborations require distributed infrastructures and networks Data management becomes a challenge Less IT experienced and IT interested people than e. g. in phyisics 17 September 12, 2013 Dr. Andreas Heiss

Definition of Data Science Physicist Biologist, chemist, … Venn-Diagramm by Drew Conway (IA Ventures)

Definition of Data Science Physicist Biologist, chemist, … Venn-Diagramm by Drew Conway (IA Ventures) 18 September 12, 2013 Dr. Andreas Heiss

KIT infrastructures: Grid. Ka German WLCG Tier-1 Center Supports all LHC experiments + Belle

KIT infrastructures: Grid. Ka German WLCG Tier-1 Center Supports all LHC experiments + Belle II + several small communities and older experiments >10, 000 cores Disk space: 12 PB, tape space: 17 PB 6 x 10 Gbit/s network connectivity ~ 15% of LHC data permanently stored at Grid. Ka Services: file transfer, workload management, file catalog, … Global Grid User Support (GGUS): service development and operation of the trouble ticket system for the world-wide LHC Grid Annual international Grid. Ka School 2013: ~140 participants from 19 countries 19 September 12, 2013 Dr. Andreas Heiss

Grid. Ka Experiences evolving demands and usage patterns no common workflows hardware commodity, software

Grid. Ka Experiences evolving demands and usage patterns no common workflows hardware commodity, software not hierarchical storage with tape is challenging data access and I/O is the central issue Different users / user communities have different data access methods and access patterns! on-site experiment representation highly useful 20 September 12, 2013 Dr. Andreas Heiss

KIT infrastructure: Large Scale Data Facility Main goals provision of storage for multiple research

KIT infrastructure: Large Scale Data Facility Main goals provision of storage for multiple research groups at KIT and U-Heidelberg support of research groups in data analysis Resources and access 6 PB of on-line storage 6 PB of archival storage 100 Gb. E connection between LSDF@KIT and U-Heidelberg analysis cluster of 58*8 cores variety of storage protocols jointly funded by Helmholtz Association and state of Baden-Württemberg 21 September 12, 2013 Dr. Andreas Heiss

LSDF experiences high demand for storage, analysis and archival research groups vary in research

LSDF experiences high demand for storage, analysis and archival research groups vary in research topics (from genetic sequencing to geophysics) size IT expertise need for services and protocols Important needs common to many user groups sharing data with other groups data security and preservation ‘consulting’ many small groups depend on LSDF 23 September 12, 2013 Dr. Andreas Heiss

The Large Scale Data Management and Analysis (LSDMA) project: facts and figures Helmholtz portfolio

The Large Scale Data Management and Analysis (LSDMA) project: facts and figures Helmholtz portfolio extension initial project duration: 2012 -2016 partners: project coordinator: Achim Streit (KIT) sustainability: inclusion of activities into respective Helmholtz programoriented funding in 2015 next annual international symposium: September 24 th at KIT 24 September 12, 2013 Dr. Andreas Heiss

Scientific Data Life Cycle 25 September 12, 2013 Dr. Andreas Heiss

Scientific Data Life Cycle 25 September 12, 2013 Dr. Andreas Heiss

LSDMA: Dual Approach Data Life Cycle Labs Data Services Integration Team Joint R&D with

LSDMA: Dual Approach Data Life Cycle Labs Data Services Integration Team Joint R&D with scientific user communities Generic methods R&D optimization of the data life cycle community-specific data analysis tools and services 26 September 12, 2013 Dr. Andreas Heiss data analysis tools and services common to several DLCLs interface between federated data infrastructures and DLCLs/communities

Selected LSDMA activities (I) DLCL Energy (KIT, U-Ulm) analyzing stereoscopic satellite images for estimating

Selected LSDMA activities (I) DLCL Energy (KIT, U-Ulm) analyzing stereoscopic satellite images for estimating the efficiency of solar energy with Hadoop privacy policies for personal energy data DLCL Key Technologies (KIT, U-Heidelberg, U-Dresden) optimization of tomographical reconstruction using data-intensive computing visualization for high throughput microscopy DLCL Health (FZJ) workflow support for data-intensive parameter studies efficient metadata administration and indexing 27 September 12, 2013 Dr. Andreas Heiss

Selected LSDMA activities (II) DLCL Earth&Environment (KIT, DKRZ) Mongo. DB for data and metadata

Selected LSDMA activities (II) DLCL Earth&Environment (KIT, DKRZ) Mongo. DB for data and metadata of meteorologic satellite data Data Replication within the European EUDAT project using i. Rods DLCL Structure of Matter (DESY, GSI, HTW) Development of a portal for PETRA-III data Determining the computing requirements for FAIR data analysis DSIT (all partners) Federated identity management Archive Federated storage (e. g. d. Cache) … 28 September 12, 2013 Dr. Andreas Heiss

LSDMA Challenges Communities differ in previous knowledge level of specification of the data life

LSDMA Challenges Communities differ in previous knowledge level of specification of the data life cycle tools and services used Within communities focus on data analysis high fluctuation of computing experts running tools and services Needs driven by increasing amount of data cooperation between groups policies Lessons learned interoperable AAI crucial data privacy very challenging, both legally and technically communities need evolution, not revolution needs can be very specific open access/data long-term preservation 29 September 12, 2013 Dr. Andreas Heiss

Summary and Outlook data facilities and R&D very important for KIT extensive experience at

Summary and Outlook data facilities and R&D very important for KIT extensive experience at Grid. Ka and LSDF wide variety of user communities often very specific needs Interoperable AAI and privacy crucial topics Today, data is important to basically all research topics more projects on state, national and international levels to come LSDMA: research on generic data methods, workflows and services and community specific support and R&D. 30 September 12, 2013 Dr. Andreas Heiss