COIT 20253 Business Intelligence Using Big Data Week

COIT 20253: Business Intelligence Using Big Data Week 2: Big Data Technology Term 1_ 2015 Course Coordinator: Dr. Meena Jha

Big Data Technology What Technology Do We Have For Big Data ? ? Dr. Meena Jha 2

BIG DATA PROCESSING TECHNOLOGIES � Technology is radically changing the way data is produced. � Big Data solutions can be seen from many perspectives such as: Storage: the requirement for lots of storage; Distribution: Data stored in lots of places globally; Database design: lots of rows, lots of columns, lots of tables; Algorithmic or mathematical: lots of variables, lots of combinations, lots of permutations, summed up as one optimal answer among a large number of possibilities. Dr. Meena Jha 3

Overview of Technologies for Big Data Technology Definitions Hadoop Open-source software for processing Big Data across multiple parallel servers Map. Reduce The architectural Framework on which Hadoop is based Scripting Language Programming Languages that works well with big data (e. g. Python, Pig, Hive) Machine Learning Software for rapidly finding the model that best fits the dataset Visual Analytics Display of analytical results in visual or graphic formats Natural Language Processing Software for analysing textfrequencies, meanings, etc. . In-memory analytics Processing big data in computer memory for greater speed. 4

Apache Hadoop � Apache Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry standard servers configured with direct attached storage. Using Hadoop, petabytes of data can be stored reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Dr. Meena Jha 5

Hadoop Distributed File System (HDFS) � Hadoop Distributed File System (HDFS) splits files into large blocks (usually 64 MB or 128 MB) and distributes these blocks amongst the nodes in the cluster. For processing the data, the Hadoop Map/Reduce ships code (specifically Jar files) to the nodes that have the required data, and the nodes then process the data in parallel. This approach takes advantage of data locality in contrast to conventional HPC architecture which usually relies on a parallel file system (compute and data separated, but connected with highspeed networking). Dr. Meena Jha 6

Big Data Technologies… � Hortonworks, Cloudera, Map. Reduce, IBM, Microsoft, Intel, EMC Greenplum are open source providers of Hadoop is a powerful platform for big data storage and processing. However, its extensibility and novelty renew questions around data integration, data quality, governance, security, and a host of other issues that enterprises with mature Business Intelligence (BI) processes have long taken for granted. Despite the many challenges of integrating Hadoop into a traditional BI environment, ETL has proven to be a frequent use case for Hadoop in enterprises. Dr. Meena Jha 7

EMC Greenplum: � EMC Greenplum: Greenplum Unified Analytics Platform (UAP) is a unified platform enabling agile Big Data Analytics by empowering data science teams to analyse structured and unstructured data in a unified platform. It comprises of three components: the Greenplum MPP database, for structured data; a Hadoop distribution, Greenplum HD; and Chorus, a productivity and groupware layer for data science teams. Dr. Meena Jha 8

IBM’s Info. Sphere Big. Insight: � IBM’s Info. Sphere Big. Insight: IBM's big data platform includes software for processing streaming data and persistent data. Big. Insight supports persistent data, while Info. Sphere Streams supports streaming data. The two can be deployed together to support realtime and batch analytics of various forms of raw data, or they can be deployed individually to meet specific application objectives. Dr. Meena Jha 9

Microsoft’s Big Data Solutions: � Microsoft’s Big Data Solutions: Microsoft’s Big Data solution brings Hadoop to the Windows Server Platform and in elastic form to the cloud platform Windows Azure. The Hadoop Hive data warehouse is part of the Big Data Solution, including connectors from Hive to ODBC and Excel. Dr. Meena Jha 10

Oracle Appliance: � Oracle Appliance: Oracle’s approach caters to the high-end enterprise market, and particularly leans to the rapid deployment, high performance end of the spectrum. It is only vendor to include the popular R analytical language integrated with Hadoop, and to ship a No. SQL database of their own design as opposed to Hadoop HBase. Dr. Meena Jha 11

Big Data Technologies Dr. Meena Jha 12

The Big Data Stack Dr. Meena Jha 13

Storage… Low cost � Storing large and diverse amounts of data on disk is becoming more cost effective as the disk technologies become more commoditized and efficient. � Storage on Hadoop environments is typically on multiple disks attached to commodity servers. � Companies like EMC sell storage solutions that allow disks to be added quickly and cheaply, thereby scaling storage in lock steps with growing data volumes. Dr. Meena Jha 14

Platform Infrastructure � The big data platform is a collection of functions that comprises high performance processing of big data. � The platform includes capabilities to integrate, manage, and apply sophisticated computational processing to data. � Big Data platforms include a Hadoop (or a similar open source project) foundation---you can think of it as big data execution engine. Dr. Meena Jha 15

Data � The expanse of Big data is as broad and complex as the applications of it. Big data can mean human genome sequences, oil well sensors, cancer cell behaviours, locations of products on pallets, social media interactions, or patients vital signs, to name a few examples. � The data layer in the stack implies that data is a separate asset, warranting discrete management and governance. Dr. Meena Jha 16

Application Code � Big Data varies with the business application. � The code used to manipulate and process the data vary. � Hadoop uses a processing framework called Map. Reduce not only to distribute data across the disks but also to apply complex computational instructions to that data. Dr. Meena Jha 17

Application Code � In keeping with the high performance capabilities of the platform, Map. Reduce instructions are processed in parallel across various nodes on the big data platform, and then quickly assembled to provide a new data structure or answer set. Dr. Meena Jha 18

Apache Pig and Hive � Apache Pig and Hive are two scripting languages for carrying out Map. Reduce functionality in application code. � Pig provides for describing operations like reading, filtering, transforming, joining, and writing data. It is a higher level language than java. � Hive performs similar functions but is more batch oriented, and it can transform data into the relational format suitable for SQL queries. Dr. Meena Jha 19

Business View � Business View layer of the stack makes big data ready for further analysis. Depending on the big data application, additional processing via Map. Reduce or custom code might be used to construct an intermediate data structure, such as statistical model, a flat file, a relational table, or a data cube. � This business view ensures that big data is more consumable by the tools and knowledge workers that already exist in an organization. Dr. Meena Jha 20

Applications � One of the more profound developments in the world of big data is the adoption of socalled data visualization. Unlike the specialized business intelligence technologies and unwieldy spreadsheets of yesterday, data visualization tools allow the average business person to view information in an intuitive, graphical way. Dr. Meena Jha 21

Applications Data Visualization at a Wireless Carrier Dr. Meena Jha 22

Integrating Big Data Technologies �A Big Data Technology Ecosystem Dr. Meena Jha 23

What Most Large Companies do today �A Typical Data Warehouse Environment Dr. Meena Jha 24

Putting the pieces together � Big companies with large investments in their data warehouses have neither the resources nor the will to simply replace an environment that works well doing what it was designed to do. At the majority of big companies, a coexistence strategy that combines the best of legacy data warehouse and analytics environments with the new power of big data solutions is the best of both worlds. Dr. Meena Jha 25

Putting the pieces together � Many companies continue to rely on incumbent data warehouses for standard BI and analytics reporting, including regional sales reports, customer dashboards, or credit risk history. In this new environment, the data warehouse can continue with its standard workload, using data from legacy operational systems and storing historical data to provision traditional business intelligence and analytics results. Dr. Meena Jha 26

Putting the Pieces Together � Big Data and Data Warehouse Coexistence Dr. Meena Jha 27

Dr. Meena Jha 28

Big Data Technology Dr. Meena Jha 29

Integrating Big Data Technologies When determining the components of the big data environment, the executives need to know the answers to the key questions: � 1: What’s the initial problem set we think new big data technologies can help us with? � 2: What existing technologies will play a role? � 3: Do we have the right skills in place to develop or customize big data solutions to fit our needs? � 4: Do these new solutions need to “talk to” our incumbent platforms? Will we need to enable that? Are there open source projects that can give us a head start? � 5: It’s not practical for us to acquire all the big dataenabling technologies we need in one fell swoop. Assuming we can establish acquisition tiers for key big data solutions, what are the corresponding budget tiers? � Dr. Meena Jha 30

Integrating Big Data Technologies � By circumscribing a specific set of business problems, companies considering big data can be more specific about the corresponding functional capabilities and the big data projects or service providers that can help address them. This approach can inform both the acquisition of new big data technologies and the re-architecting of existing ones to fit into the brave new world of big data. Dr. Meena Jha 31

Week 2 � End of the Lecture Dr. Meena Jha 32