zetta Tuktu Big Data Science Tools Meet Learning

zetta. Tuktu Big Data Science Tools Meet Learning Object Repositories David Massart, Ph. D European Schoolnet / D. E. Solution / Zetta. Data. Net Ed. Re. Ne Meeting, Athens – April 24, 2017

Big Data • Social network data • Servers log • Satellite imagery • Broadcast audio streams • Banking transactions • MP 3 s of rock music • Content of web pages • Scans of government documents • GPS trails • Telemetry from automobiles • Financial market data • etc.

Despite Its Diversity, This Data Can Be Characterized By • Variety: Big data is not necessarily structured or its structure can vary • Volume: Big data is big • Velocity: Big data flows at an increased rate

How Variable, How Big, How Fast? • Too variable , big, or fast to be processed by a single machine (at least in an affordable way) • Classical technologies (e. g. , relational databases) can only scale up up to a certain limit • Going beyond this limit requires tools able to scale out • a. k. a. Distributed Systems • a. k. a. Big Data technologies

Distributing Data and Processes Is Complex • Controlling costs • Hardware: Partitioning + replication • Software: Open source • Controlling complexity • Data models: No. SQL (e. g. , key/value, JSON) • Concurrency: Map/reduce, functional programming • Trade-off • Consistency • Availability • Partition tolerance

Today The recent advent of Big Data Science has led to the development of a range of powerful open source tools for efficiently extracting, transforming, loading, and analyzing varieties of data (and automating these processes as a whole)

• One-stop data platform • Easy to use • Easy to integrate with existing (big) data tools • Easy to extend • Scalable • Open Source

Tuktu ? “Most big data tools have weird animal names. So, to mock that, I did some searching to find the most random animal name I could come up with and I found a list of Inuit words and there it was: Tuktu (Inuit for Caribou)” -- Erik Tromp

Easy To Use • Intuitive graphical interface • Dramatically improves productivity • Focuses on reusability • Comes up with a rich library of data generators and processors (e. g. , nlp, ml, nosql, social) • Scalable • As easily deployable on a laptop • As on multiple servers

Technology • Built around the Play! framework • Lives as a basic HTTP server but has other ways of invoking it besides HTTP-based traffic • Makes heavy use of the Play! Iteratee library and hence also Akka • Written in Scala

Two Basic Types of Actors Generators • Gather data packets from the external environment (e. g. , filesystem, remote location, data creation) • As soon as a generator has a complete data packet, it streams it into a series of processors Processors • Manipulate data packets in various ways • Can be chained together, executed in parallel with a merge-step or can copy data into multiple subsequent processors • This way, Tuktu creates a tree of processors that operate on a single data packet injected by a generator

Data Packet: List[Map[String, Any] case class Data. Packet( data: List[Map[String, Any]] ) [ { }, { ] } "key 1": "value 1", "key 2": "value 2" "key 3": "value 3", "key 4": "value 4"

Generators • Crawler (wikipedia) • CSV • No. SQL (Casandra, Kafka, Mongo. DB, SQL) • Social (Facebook, Linked. In, Twitter) • Tuktu Distributed File System • Web (REST)

Processors • Arithmetic • Aggregation (Count, Max, Min, Sum) • Statistics (Mean, Median, Correlation, Covariance, Standard deviation) • CSV • File • JSON • Machine Learning (Association, Clustering, Decision Trees, Regression, Time Series) • Natural Language Processing • No. SQL (SQL, Elastic Search, HDFS, Mongo. DB, Kafka) • Social (Facebook, Twitter) • Statistics • Time • Visualization • Web Analytics

zetta. Tuktu • A fork of Tuktu maintained by Zetta. Data. Net • Emphasizes digital libraries over data science • ‘robustness’ versus ‘Let it crash’ • ’metadata’ versus ‘data’ • importance of ‘meta-metadata’ • Tries to stay as close as possible to Tuktu (regular pull requests)

Digital Library Generators • Europeana • Learning Registry • Learning Resource Exchange • Youtube • OAI-PMH • • List Identifiers List Metadata Formats List Records List Sets

Digital Library Processors • Europeana Query • LRE • Query • Retrieval • Thumbnails • Map (i. e. , metadata) Merger • Metadata Serialization • Template • XSLT • Youtube • OAI-PMH • • • Get Record Harvester Identify List Metadata Formats List Sets • Vocabularies • • Term adder Vocabulary loader Vocabulary lookup Vocabulary remover

Other Generators & Processors of Interest Generators Processors • Generic Crawler Generator • Wikipedia Content Generator • REST Generator • REST Processor • URL Checker Processor • Geo IP Processor

Demo: Metadata Generation

Open Source Apache License, Version 2. 0 https: //github. com/Zetta. Data. Net/Tuktu

zetta. Tuktu Developments Towards Covering the Entire Metadata Lifecycle • Metadata acquisition: Add support for new protocols • Metadata generation: Add support for new templating mechanisms • Metadata enrichment: Further explore natural language processing (NLP) and machine learning (ML) • Metadata curation: Go beyond link checking • Metadata indexing

Conclusion • Advent of big data has led to the development of a range of powerful open source tools • These tools can be use to simplify the aggregation and curation of (learning resource) metadata • These tools mostly relies on the JSON data format • Standards relying on XML-like data models are increasingly at a disadvantage