MapReduce in Practice Hadoop Hbase Mongo DB Accumulo

Map/Reduce in Practice Hadoop, Hbase, Mongo. DB, Accumulo, and related Map/Reduceenabled data stores

How we got here Google Map/Reduce Hadoop Uses GFS HDFS To Provide Big. Table HBase Related Stuff… Cassandra Mongo. DB Accumulo

In the beginning was the Google • Larry and Sergey had a lot of data – Needed fast distributed large files – Needed location awareness – GFS was born:

Processing that data • Needed some way to process it all efficiently – Move processing to the data – Distributed processing – Only transfer minimal results – Map/Reduce

Files are good, structure is better • Map/Reduce naturally produces and functions on structured data (key => value pairs) – Needed a way to efficiently store and access data – Big. Table • Compressed, sparse, distributed, multidimensional

Open, sortof • Google told the world about this great stuff: – Dean, Jeffrey and Ghemawat, Sanjay. “Map. Reduce: Simplified Data Processing on Large Clusters, ” OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. – Chang, Fay et al. “Bigtable: A Distributed Storage System for Structured Data, ” OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. • But they weren’t sharing the implementations

Hadoop: Map/Reduce for the masses • Open source Apache project – Derived from Google papers – Consists of Hadoop Kernel, Map. Reduce, and HDFS – Also related projects Hive, Hbase, Zookeeper, etc.

Hadoop Architecture

Map. Reduce Layer • Takes Jobs, which are split into Tasks – Tasks are executed on worker nodes that, ideally, store the data the task needs to process – If that’s not possible, the task attempts to execute on a worker node in the same rack as the data – Tasks might be map tasks or reduce tasks, depending on what the job tracker needs at the time

HDFS Layer • Consists of namenode, secondary namenode for replication, and datanodes – Datanodes contain redundant copies of data, generally 2 copies on one rack, and a third copy on a different rack – Exposes data location information to Jobtracker so tasks can be distributed to workers close to the data – Not a POSIX file system, and can’t be mounted directly

Other Storage • Hadoop is flexible about what storage system is used – Alternatives are Amazon S 3, Cloud. Store, FTP Filesystem, and read-only HTTP(S) file systems – Only HDFS and Cloud. Store are rack-aware, though – Multiple data store implementations • Also, HDFS isn’t restricted to Hadoop. Hbase and other projects use it as storage

HBase • Basically open-source Big. Table – Non-relational, distributed, sparse, multidimensional, compressed data – Tables can be input/output for Map. Reduce jobs run in Hadoop – Support Bloom filters • Another thing borrowed from Big. Table • Can tell you if something isn’t in the column, but not necessarily if it is there

Data Model • Data is stored as rows with a single key, timestamp, and multiple columnfamilies • Data is sorted based on the key, but otherwise there aren’t any indexes • Supports 4 operations: Get, Put, Scan, Delete • Deletes don’t actually delete, they just mark a row as dead, for later compactions to clean up

Digression: Bloom Filters • Maintains a bit array like a hash table – Each item, when inserted to the column, is hashed with k different algorithms, and the resulting index bit is set to 1. – To determine if a value is in the table, hash it with the k algorithms and see if all the indexes are set to 1. If one or more is missing, the value isn’t in there – But if there is a non-zero probability that all will be 1 and the value won’t be there. – Write-only, since you never know which entries duplicated a bit

So, why bother? • Column scans are expensive, and that’s about the only way to find stuff in a column that’s not the key

Accumulo • Hbase for the NSA – Provides basically the same functionality of Hbase, but with security – Adds a new element to the key, Column Visibility • Stores a logical combination of security labels that must be satisfied at query time for the key/value to be returned • Hence a single table can store data with various security levels, and users only see what they’re allowed to see

Cassandra • A lot like Hbase, with Big. Table inspiration, but also inspired by Amazon Dynamo (cloud key/value store) • Also has columnfamilies (and even supercolumns), but allows secondary indexes • Distribution and replication are tunable • Writes faster than reads, so good for logging, etc.

Cassandra vs. HBase • Basically comes down to the CAP theorem: – You have to pick two of Consistency, Availability, and Partition tolerance. You can’t have all 3. • Cassandra chooses AP, though you can get consistency if you can tolerate greater latency. – By default provides weak consistency • Hbase values CP, but availability may suffer. In the event of a partition (node failure), the data won’t be available if it can’t be guaranteed to be consistent with committed operation.

Mongo. DB • Document-Oriented Storage – Full index support – Replication and high availability – Auto-sharding to scale horizontally – Javascript-based querying – Map/Reduce – Grid. FS storage

Conclusion • There a lot of options out there, and more all the time • RDBMS offers the most functionality, but stumbles at the scalability problem • Key/value stores scale, but require different processing model • Best option will be determined by a combination of data and task