Big Data Technologies Prof Smita Wangikar Information Technology
Big Data Technologies Prof. Smita Wangikar Information Technology Department International Institute of Information Technology, I²IT www. isquareit. edu. in
Big Data Technologies Ø Characteristics of Big Data ØVolume ØVariety ØVelocity ØVeracity ØVolume Ø Internal and External Data ØData that is owned by an organization ØData that belongs to an entity other than the organization that wishes to acquire and use it. Ø Structured and Unstructured Data International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
Google File System ØDesign consideration ØInterface ØArchitecture ØChunk Size ØMetadata ØClient operations : Write ØClient operations: with Server ØDecoupling and Atomic Record Appends ØMaster operations Logging, Where to put a chunk, Re-replication and Rebalancing ØGarbage Collection ØFault Tolerance Summary ( Benefits , limitations) International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS Design consideration ØBuilt from cheap commodity hardware ØExpect large files: 100 MB to many GB ØSupport large streaming reads and small random reads ØSupport large, sequential file appends ØSupport producer-consumer queues for many-way merging and file atomicity ØSustain high bandwidth by writing data in bulk International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS … Architecture International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS … Chunk. Size Ø Interface 64 MB Ø Much larger than typical file system block sizes Ø Advantages from large chunk size Ø Reduce interaction between client and master Ø Client can perform many operations on a given chunk Ø Reduces network overhead by keeping persistent TCP connection Ø Reduce size of metadata stored on the master ØThe metadata can reside in memory Ø Store three major types Ø Namespaces Ø File and chunk identifier Ø Mapping from files to chunks Ø Location of each chunk replicas International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS … Client Operation…Write ØSome chunkserver is primary for each chunk ØMaster grants lease to primary (typically for 60 sec. ) ØLeases renewed using periodic heartbeat messages between master and chunkservers ØClient asks master for primary and secondary replicas for each chunk ØClient sends data to replicas in daisy chain ØPipelined: each replica forwards as it receives ØTakes advantage of full-duplex Ethernet links International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS … Client Operation Write International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
GFS … Client Operation…Write with ØIssues control (metadata) requests to master server ØIssues data requests directly to chunkservers ØCaches metadata ØDoes no caching of data ØNo consistency difficulties among clients ØStreaming reads (read once) and append writes (write once) don’t benefit much from caching at client International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
HDFS (Hadoop Distributed File System) ØA distributed file system that provides high-throughput access to application data ØHDFS uses a master/slave architecture in which one device (master) termed as Name. Node controls one or more other devices (slaves) termed as Data. Node. ØIt breaks Data/Files into small blocks (128 MB each block) and stores on Data. Node and each block replicates on other nodes to accomplish fault tolerance. ØName. Node keeps the track of blocks written to the Data. Node International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
HDFS Architecture International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
Hadoop’s Map Reduce Engine International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
How Map Reduce Works? … ØA method for distributing computation across multiple nodes ØEach node processes the data that is stored at that node ØConsists of two main phases ØMap ØReduce ØThe Mapper ØReads data as key/value pairs Ø The key is often discarded ØOutputs zero or more key/value pairs ØThe Shuffle and Sort ØOutput from the mapper is sorted by key ØAll values with the same key are guaranteed to go to the same machine ØThe Reducer ØCalled once for each unique key ØGets a list of all values associated with a key as input ØThe reducer outputs zero or more final key/value pairs ØUsually just one output per input key International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
How Map Reduce Works? … ØA method for distributing computation across multiple nodes ØEach node processes the data that is stored at that node ØConsists of two main phases ØMap ØReduce ØThe Mapper ØReads data as key/value pairs Ø The key is often discarded ØOutputs zero or more key/value pairs ØThe Shffle and Sort ØOutput from the mapper is sorted by key ØAll values with the same key are guaranteed to go to the same machine ØThe Reducer ØCalled once for each unique key ØGets a list of all values associated with a key as input ØThe reducer outputs zero or more final key/value pairs ØUsually just one output per input key International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
Map Reduce Example International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
Thank You E-mail: smitaw@isquareit. edu. in International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www. isquareit. edu. in | Email - info@isquareit. edu. in
- Slides: 16