National Research Center Kurchatov Institute The Laboratory of
National Research Center “Kurchatov Institute” The Laboratory of Big. Data Technologies for mega-science projects Data management in heterogeneous metadata storage and access infrastructures MARINA GOLOSOVA
Outline • ETL subsystem • Development framework for ETL subsystems 29/09/2017 Marina Golosova, NEC 2017 2/8
ETL subsystem • ETL = (E)xtract, (T)ransform, (L)oad Data source Extract ct a r t Ex Ext ra ct 29/09/2017 Transformation Load Data sink Transformation Data sink 1 Transformation Data sink 2 Marina Golosova, NEC 2017 3/8
ETL subsystem • ETL = (E)xtract, (T)ransform, (L)oad Data source 29/09/2017 Extract Transformation Load Data sink Transformation Data sink 1 Transformation Data sink 2 Extract Marina Golosova, NEC 2017 4/8
Subject area (non)specific components Failure handling: • restart process • reprocess data Transformation Data sink 1 Data sink 2 Parallelization coordinator Transformation 29/09/2017 Loader Ensure data delivery Loader Data source Extractor Run every X min ETL supervisor Marina Golosova, NEC 2017 5/8
ETL subsystem development framework • Tasks: • process supervising (run, stop, restart, …) • data delivery between processes (exactly once, …) • parallelization management • Project: • the framework based on Apache Kafka: • data delivery via Kafka topics and producer/consumer API • process management with Kafka Streams library • parallelization management via topics partitioning and Kafka Streams application configuration 29/09/2017 Marina Golosova, NEC 2017 6/8
DKB ETL subsystem • Kafka extentions: • topology constructor (config files instead of Java code) • external process adapters for Processing/Source/Sink (allow running any executable as a topology node) • primitive data transfer protocol (to be improved) 29/09/2017 Marina Golosova, NEC 2017 7/8
Acknowledgements Many thanks to the wonderful people who helped me through the work: • Maria Grigorieva • Alexei Klimentov • Torre Wenaus • Eugene Ryabinkin • ATLAS collaboration The work was supported by the Russian Ministry of Science and Education under contract № 14. Z 50. 31. 0024. 29/09/2017 Marina Golosova, NEC 2017 8/8
- Slides: 8