SCAPE The SCAPE Platform Overview Rainer Schmidt SCAPE

  • Slides: 24
Download presentation
SCAPE The SCAPE Platform Overview Rainer Schmidt SCAPE Training Event September 16 th –

SCAPE The SCAPE Platform Overview Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library

SCAPE Goal of the SCAPE Platform SCAlable Preservation Environments • Hardware and software platform

SCAPE Goal of the SCAPE Platform SCAlable Preservation Environments • Hardware and software platform to support scalable preservation in terms of computation and storage. • Employing an scale-out architecture to supporting preservation activities against large amounts of data. • Integration of existing tools, workflows, and data sources and sinks. • A data center service providing a scalable execution and storage backend for different object management systems. • Based a minimal set of defined services for • processing tools and/or queries closely to the data.

SCAPE Underlying Technologies SCAlable Preservation Environments • The SCAPE Platform is built on top

SCAPE Underlying Technologies SCAlable Preservation Environments • The SCAPE Platform is built on top of existing data-intensive computing technologies. • Reference Implementation leverages Hadoop Software Stack (HDFS, Map. Reduce, Hive, …) • Virtualization and packaging model for dynamic deployments of tools and environments • Debian packages and Iaa. S suppot. • Repository Integration and Services • Data/Storage Connector API (Fedora and Lily) • Object Exchange Format (METS/PREMIS representation) • Workflow modeling, translation, and provisioning. • Taverna Workbench and Component Catalogue • Workflow Compiler and Job Submission Service

SCAPE Architectural Overview (Core) SCAlable Preservation Environments Component Lookup API Component Catalogue Workflow Modeling

SCAPE Architectural Overview (Core) SCAlable Preservation Environments Component Lookup API Component Catalogue Workflow Modeling Environment Component Registration API

SCAPE Architectural Overview (Core) SCAlable Preservation Environments Focus of this talk Component Lookup API

SCAPE Architectural Overview (Core) SCAlable Preservation Environments Focus of this talk Component Lookup API Component Catalogue Workflow Modeling Environment Component Registration API

SCAPE SCAlable Preservation Environments Hadoop Overview

SCAPE SCAlable Preservation Environments Hadoop Overview

SCAPE SCAlable Preservation Environments The Framework • Open-source software framework for large-scale dataintensive computations

SCAPE SCAlable Preservation Environments The Framework • Open-source software framework for large-scale dataintensive computations running on large clusters of commodity hardware. • Derived from publications Google File System and Map. Reduce publications. • Hadoop = Map. Reduce + HDFS • Map. Reduce: Programming Model (Map, Shuffle/Sort, Reduce) and Execution Environment. • HDFS: Virtual distributed file system overlay on top of local file systems.

SCAPE SCAlable Preservation Environments Programming Model • Designed for write one read many times

SCAPE SCAlable Preservation Environments Programming Model • Designed for write one read many times access model. • Data IO is handled via HDFS. • Data divided into blocks (typically 64 MB) and distributed and replicated over data nodes. • Parallelization logic is strictly separated from user program. • Automated data decomposition and communication between processing steps. • Applications benefit from built-in support for data-locality and fail -safety. • Applications scale-out on big clusters processing very large data volumes.

SCAPE SCAlable Preservation Environments Cluster Set-up

SCAPE SCAlable Preservation Environments Cluster Set-up

SCAPE SCAlable Preservation Environments Platform Deployment • There is no prescribed deployment model •

SCAPE SCAlable Preservation Environments Platform Deployment • There is no prescribed deployment model • Private, institutionally-shared, external data center • Possible to deploy on “bare-metal” or using virtualization and cloud middleware. • Platform Environment packaged as VM image • Automated and scalable deployment. • Presently supporting Eucalyptus (and AWS) clouds. • SCAPE provides two shared Platform instances • Stable non-virtualized data-center cluster • Private-cloud based development cluster • Partitioning and dynamic reconfiguration

SCAPE Deploying Environments SCAlable Preservation Environments • Iaa. S enabling packaging and dynamic deployment

SCAPE Deploying Environments SCAlable Preservation Environments • Iaa. S enabling packaging and dynamic deployment of (complex) Software Environments • But requires complex virtualization infrastructure • Data-intensive technology is able to deal with a constantly varying number of cluster nodes. • Node failures are expected and automatically handled • System can grow/shrink on demand • Network Attached Storage solution can be used as data source • But does not scalability and performance needs for computation • SCAPE Hadoop Clusters • Linux + Preservation tools + SCAPE Hadoop libraries • Optionally Higher-level services (repository, workflow, …)

SCAPE SCAlable Preservation Environments Using the Cluster

SCAPE SCAlable Preservation Environments Using the Cluster

SCAPE SCAlable Preservation Environments • Wrapping Sequential Tools • Using a wrapper script (Hadoop

SCAPE SCAlable Preservation Environments • Wrapping Sequential Tools • Using a wrapper script (Hadoop Streaming API) • PT’s generic Java wrapper allows one to use pre-defined patterns (based on toolspec language) • Works well for processing a moderate number of files • e. g. applying migration tools or FITS. • Writing a custom Map. Reduce application • Much more powerful and usually performs better. • Suitable for more complex problems and file formats, such as Web archives. • Using a High-level Language like Hive and Pig • Very useful to perform analysis of (semi-)structured data, 13 e. g. characterization output.

SCAPE SCAlable Preservation Environments Available Tools • Preservation tools and libraries are pre-packaged so

SCAPE SCAlable Preservation Environments Available Tools • Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes • SCAPE Debian Packages • Supporting SCAPE Tool Specification Language • Map. Reduce libs for processing large container files • For example METS and (W)arc Record. Reader • Application Scripts • Based on Apache Hive, Pig, Mahout • Software components to assemble a complex data-parallel workflows • Taverna and Oozie Workflows

SCAPE Sequential Workflows SCAlable Preservation Environments • In order to run a workflow (or

SCAPE Sequential Workflows SCAlable Preservation Environments • In order to run a workflow (or activity) on the cluster it will have to be parallelized first! • A number of different parallelization strategies exist • Approach typically determined on a case-by-case basis • May lead to changes of activities, workflow structure, or the entire application. • Automated parallelization will only work to a certain degree • Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach). • SCAPE driver program for parallelizing Taverna workflows. • SCAPE template workflows for different institutional scenarios developed. 15

SCAPE Parallel Workflows SCAlable Preservation Environments • Are typically derived from sequential (conceptual) workflows

SCAPE Parallel Workflows SCAlable Preservation Environments • Are typically derived from sequential (conceptual) workflows created for desktop environment (but may differ substantially!). • Rely on Map. Reduce as the parallel programming model and Apache Hadoop as execution environment • Data decomposition is handled by Hadoop framework based on input format handlers (e. g text, warc, mets-xml, etc. ) • Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes. • May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools. • Tools invocations are based on API or cmd-line interface and 16 performed as part of a Map. Reduce application.

SCAPE SCAlable Preservation Environments Map. Red Tool Wrapper

SCAPE SCAlable Preservation Environments Map. Red Tool Wrapper

SCAPE Tool Specification Language SCAlable Preservation Environments • The SCAPE Tool Specification Language (toolspec)

SCAPE Tool Specification Language SCAlable Preservation Environments • The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. • Can be used to automate a complex tool invocation (many arguments) based on a keyword (e. g. ps 2 pdfs) • Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. • Can be resolved by the execution system using Linux packages. • The toolspec is minimalistic and can be easily created for individual tools and scripts. • Tools provided as SCAPE Debian packages come with a toolspec document by default. 18

SCAPE Map. Red Toolwrapper SCAlable Preservation Environments • Hadoop provides scalability, reliability, and robustness

SCAPE Map. Red Toolwrapper SCAlable Preservation Environments • Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. • Application must however be made compliant with the execution environment. • Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. • User simply specifies toolspec file, command name, and payload data. • Supports HDFS references and (optionally) standard IO streams. • Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface. 19

SCAPE Hadoop Streaming API SCAlable Preservation Environments • Hadoop streaming API supports the execution

SCAPE Hadoop Streaming API SCAlable Preservation Environments • Hadoop streaming API supports the execution of scripts (e. g. bash or python) which are automatically translated and executed as Map. Reduce applications. • Can be used to process data with common UNIX filters using commands like echo, awk, tr. • Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. • Perfect for processing text but difficult to process binary data. • The steaming API uses streams to read/write from/to HDFS. • Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. • Hence, DP tools are difficult to use with streaming API 20

SCAPE Suitable Use-Cases SCAlable Preservation Environments • Use Map. Red Toolwrapper when dealing with

SCAPE Suitable Use-Cases SCAlable Preservation Environments • Use Map. Red Toolwrapper when dealing with (a large number of) single files. • Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). • However, practical and sufficient in many cases, as there is no additional application development required. • A typical example is file format migration on a moderate number of files (e. g. 100. 000 s), which can be included in a workflow with additional QA components. • Very helpful when payload is simply too big to be computed on a single machine. 21

SCAPE SCAlable Preservation Environments Example – Exploring an uncompressed WARC • Unpacked a 1

SCAPE SCAlable Preservation Environments Example – Exploring an uncompressed WARC • Unpacked a 1 GB WARC. GZ on local computer • 2. 2 GB unpacked => 343. 288 files • `ls` took ~40 s, • count *. html files with `file` took ~4 hrs => 60. 000 html files • Provided corresponding bash command as toolspec: • <command>if [ "$(file ${input} | awk "{print $2}" )" == HTML ]; then echo "HTML" ; fi</command> • Moved data to HDFS and executed pt-mapred with toolspec. • 236 min on local file system • 160 min with 1 mapper on HDFS (this was a surprise!) • 85 min (2), 52 min (4), 27 min (8) • 26 min with 8 mappers and IO streaming (also a surprise) 22

SCAPE Ongoing Work SCAlable Preservation Environments • Source project and README on Github presently

SCAPE Ongoing Work SCAlable Preservation Environments • Source project and README on Github presently under openplanets/scape/pt-mapred* • Will be migrated to its own repository soon. • Presently required to generate an input file that specifies input file paths (along with optional output file names). • TODO: Input binary directly based on input directory path allowing Hadoop to take advantage of data locality. • Input/output steaming and piping between toolspec commands has already been implemented. • TODO: Add support for Hadoop Sequence Files. • Look into possible integration with Hadoop Streaming API. * https: //github. com/openplanets/scape/tree/master/pt-mapred 23

SCAPE SCAlable Preservation Environments 24

SCAPE SCAlable Preservation Environments 24