Unit IV Big Data Analytics Data analytics life
Unit IV Big Data Analytics ü Data analytics life cycle ü Data cleaning ü Data transformation ü Comparing reporting and analysis ü Types of analysis ü Analytical approaches ü Data analytics using R ü Exploring basic features of R ü Exploring R GUI ü Reading data sets ü Manipulating and processing data in R ü Functions and packages in R ü Performing graphical analysis in R ü Integrating R and Hadoop ü Hive
Data Analytics “Big Data Analytics can be considered as a process of transforming huge unstructured raw data, received from various sources to a data product which is useful for organizations. “ v Data Analytics Life Cycle Big data analytics is different from traditional data analysis bcz of volume, velocity and variety.
1) Business Case Evaluation DALC start with Business case , understanding of justification, motivation and goals of carrying out the analysis. 2) Data Identification Use to identify the datasets necessary for the analysis project and their sources. 3) Data Acquisition and Filtering Collecting the data from data source and perform filtering 4) Data Extraction Useful for extracting disparate data and converting it into format that can be useful data analysis purpose. 5) Data Validation and Cleaning Invalid data may leads to fake analysis result. DV used to set up complex validation rules for recognizing &elimination invalid data
6) Data Aggregation and Representation Used to aggregate numerous datasets collectively to arrive at combined view. 7) Data Analysis Responsible for performing actual analysis, carry out multiple types of analytics 8) Data Visualization use to provide effective presentation of data for communicating with user 9) Utilization of Analysis Result use to determine how and where processed analysis data further influenced
Data Cleaning “Data Cleaning is a process of finding the incorrect or corrupted data and removing it. ” Issues in Data Cleaning 1. Lack of validations 2. Data from diff sources 3. Personal names 4. Locations 5. Dates 6. Numbers 7. Currencies 8. Language 9. Other Issues
Cleaning Methods 1. Histogram 2. Conversion Tables 3. Tools 4. Algorithm 5. Manually (used to find out which values are used less frequently) (used when data is sorted) (Open refine, plye, reshape 2)
Data Transformation “unstructured data is transform into structured data called as Smart Data. ” Point to be consider 1. Searching right data 2. Filtering information 3. Right skills 4. Right tools 5. Customer behaviour 6. Analyzing information 7. Right strategy
Comparing Reporting and Analysis The reporting and analysis can be different according to their purpose, task, output, delivery and value. Goal of reporting and analysis is to raise sales and decrease costs.
1. Purpose q Reporting : Process of arranging data into summaries in order to monitor how different areas of business are performing. used to transform raw data into information. reporting illustrate What is happening q Analysis : Process of discovering data and reports for extracting insights related to problems, useful for greater understanding and enhance performance. used for transforming data and information into insights illustrate Why it is happening and What you can do
2. Task As gap between reporting and analysis is less imp task of analytics team to identify weather organization performs reporting or analysis Reporting : building, configuring, combining, arranging, formatting and summarizing Analysis : questioning, exploring, interpreting, evaluating 3. Output Push approach is used in analysis, reporting is pushed to user There are no of charts, graphs, tables are used
Approaches for reporting and Analysis
Types of reporting 1. Canned reports (it contains fixed matrix and dimension) 2. Dashboards (provide complete, advanced view of performance) 3. Alerts (when data goes outside the predictable range)
Types of Analysis Types of analysis 1. Ad hoc responses (gets request to answer range of business question) 2. Analysis presentations (some questions are difficult and take long time to analysis)
4. Delivery People can access report using analytical tool, Excel spreadsheet, and schedule for delivery of info. Into mailbox, mobile device etc. 5. Value
Types of Big data Analytics
o Descriptive Analytics Simplest class of analytics, used to compress big data into smaller, more useful chunk of information. 90% organization used it. I t describe the “What has happen ? ”, Examines data coming in real-time & historical data. main goal is to find out reasons of previous success or failure happen in past help business to learn from past behaviours, and how it effect upcoming outcomes Eg. Google Analytics tools o Diagnostic Analytics used to find out reasons of why something happened. take deeper look at data to understand event and behaviour Eg. Customer health score analysis
o Predictive Analytics Useful for making prediction of what will happen in future use statistical, modelling , data mining and machine learning techniques Offers the best recommendation Eg. Amozon…other retailers o Prescriptive Analytics add extra features fro manipulating data gives suggestion for possible outcomes of every decision it will ask ”What should a business do? ” required 2 components a) Actionable data b)Feedback system, keep track of outcomes
Analytical Approaches Advance application for Big Data Analysis 1. Custom Application for Big Data Analysis 2. Semi-Custom Application for Big Data Analysis 1. Custom Application for Big Data Analysis Objective is to minimize the time for decision or action o Google Prediction API download from Google Developer website. well documented + various languages Simple to use. o R tool
2. Semi-Custom Application for Big Data Analysis Not necessary to create completely new application Semi-Custom Application for Big Data Analysis 1. TA-Lib 2. JUNG 3. Geo Tool (Technical Analysis library, use to analyse financial market data) (Java Universal Network Graph, analyse & visualize data, represented using graph ) (used to manipulate (Geographic Information Systems ) GIS data in various forms, generate graph)
Data Analytics using R In 1990, R language was originated as academic demonstration language q Features: 1. R is very powerful tool for all types of data processing as well as manipulating 2. R considered as community of programmer, end user, academics and practitioner 3. R tool create all types of quality graphics and data visualization 4. R is freely distributed add-on packages 5. R is toolbox with remarkable versatility (ability to adapt many functions) q History of R Programming lang S was implemented to develop R lang At University of Auckland(New zealand) R was developed by Ross Ihaka and Robert Gentlemanat. initial letter of first two R authorises taken. started in 1992, in 1995 first version, beta version released in 2000
q Advantages of R free and open source executed on various Software and Hardware provide wide range of features for data manipulating, statistical modelling and Graphics. R supports extensions i. e new functionality can be added easily
Exploring Basic features of R Programming lang S was implemented to develop R lang 1. Performing multiple calculations with vector R is consider as vector-based language Vector is assume as row or column of numbers or text. for example list of numbers{10, 20, 30, 40, 50} assign the values 10: 50 to a vector : > v <- 10: 50 > v [1] 10 20 30 40 50 > v+5 [1] 15 25 35 45 55
2. Processing more than just statistics objective is to make statistical processing easier initially only statistic now include programming R suitable for data processing, graphic visualization and all types of analysis Now days R used in finance, biology, genetics and Market research 3. Running code without a compiler no need compiler for compilation purpose bcz R is interpreted langu increase the speed of code execution
Exploring R GUI The basic R install give you following: o Window : A basic R editor called RGui o Mac OS X : A basic R editor called R. app o Linux: No specific editor is available, bt we can use Vim or Emacs to edit R code Ø Using R studio is code editor provide development environment. Code highlight which provides separate colours for elements, variables, keywords Automatic bracket matching Code completion element(no need to write full query) Ø To run Rstudio follow the steps: Start Rstudio (folder) Rstudio( icon)
R studio Screen short
Performing Graphical Analysis in R Various type of plots drawn in R 1. Plots with single variable 2. Plot with two variable 3. Plot with multiple variable 4. Special Plot
1. Plots with single variable There may be need to plot single variable. for eg. Consider specific product’s daily sales value over a period of time. Ø following function are provided by R • hist(y) : Histograms use to display frequency distribution • plot(y) : Index plots used to display values of y in sequence • plot. ts(y) : Time series plot • pie(x) : Compositional plots such as pie diagrams Type of plot in single R 1. Histogram 2. Index Plot 3. Time Series Plot
1. Histogram Shows mode, spread as well as symmetry of data. R provide hist() function to plot histogram Histogram use Bin concept Bin : divide the entire range of values into a series of intervals—and then count how many values fall into each interval. Square and round bracket are used for bin boundaries in R [a, b) indicate ‘greater than or equal to a but less than b. (a, b] includes ‘greater than a, but less than or equal to b’ bin includes both min as max value 2. Index Plot use to plot single sample single argument needed for plot function usually fruitful for error checking
3. Time Series Plot This plot is used to join dots in order set after a period of time If some values are missing it will give error (e. g sales value for three month is missing) R provide two functions ts. plot and plot. tsfor plotting time series data 4. Pie Chart for illustration of proportional makeup of sample , pie chart is used. circle is divided, and Labels are used to indicate each segment
2. Plots with Two variable two type of variable use in R i. Response variable ii. Explanatory variable on y-axis, response variable represented On X-axis , explanatory variable represented Ø following function are provided by R • plot(x, y) • plot(factore. y) • barplot(y) : Histograms use to display frequency distribution : Index plots used to display values of y in sequence : Time series plot Type of plot in two R 1. Scatter plots 2. Stepped Lines 3. Boxplots 4. Barplots
1. Scatter plots use when explanatory variable is in the form of continues variable 2. Stepped Lines Used to plot data uniquely and provide clear view 3. Box plots used to display the location, spread of data 4. Bar plots used to display the highlights of the mean values from various treatments
3. Plots with multiple variable initial data inspection using plot is much more necessary on y-axis, response variable represented On X-axis , explanatory variable represented Ø following function are provided by R • plot(x, y) • plot(factore. y) • barplot(y) : Histograms use to display frequency distribution : Index plots used to display values of y in sequence : Time series plot Type of plot in multiple R 1. The pairs Function 2. The coplot Function
1. The pairs Function for multiple continues variable, it is important to verify dependencies In data frame, each variable is on y-axis opposite to each variable on x-axis 2. The coplot Function Used when relation between two variable is unclear usually sequence from lower left to upper right
4. Special Plot R provide extensive facilities. As per need, it provide high and low level facility 1. Design plots : drawn using plot. design use to visualize effective sizes in designed experiments 2. Bubble plots : used to illustrate third variable across various location in x-y
Integrating R and Hadoop Sometimes data analyst or scientist working on hadoop requires R for purpose of data processing In such case they need to rewrite these R script in java to implement Hadoop Map. Reduce Solution for integration of Hadoop in R 1. RHADOOP- install R on Workstations and connect data in Hadoop 2. RHIPE- Execute R inside Hadoop Map Reduce 3. R and Hadoop Streaming 4. RHIVE- Install R on Workstation and connect data in HIVE 5. ORCH- Oracle Connector for Hadoop
1. RHADOOP- install R on Workstations and connect data in Hadoop most common solution for integration Allows user to inject data from Hbase and HDFS simple and cost effective Rhdoop is set of five pkg i. rhbase ii. rhdfs iii. plyrmr iv. ravro v. Rmr 2 2. RHIPE- Execute R inside Hadoop Map Reduce R and Hadoop Integrated Programming Environment R programmer only write R map and R reduce functions which are then transfer into RHIVE library Then Hadoop Map and Hadoop Reduce task call
3. R and Hadoop Streaming Hadoop streaming API helps to execute Hadoop Map. Reduce jobs using any executable script which can read data from standard input and write data into standard output in the form of Mapper and Reducer Does not need any client side integration bcz it is done by Hadoop Command line 4. RHIVE- Install R on Workstation and connect data in HIVE RHIVE use to launch Hive queries from R interface It provide functions of metadata such as database, tables, colunms from Apache Hive 5. ORCH- Oracle Connector for Hadoop When ORCH is use, R programmer do not need to learn other language to get Hadoop environment
Hive Apache Hive is data warehouse software built on Apache Hadoop for proving data summarization, queries and analysis Hive gives SQL like interface to query data stored in database and file system v Features • Support analysis of large dataset stored in Hadoop HDFS • Hive. QL is provided by Hive • indexing provide acceleration v Architecture of Hive
1. Metastore metadata for each tables such as schema and location Support partition, help driver to keep track od data backup server replicate data, useful in case of data loss 2. Driver Act like controller, receives Hive. QL statements start the execution of statement by creating session and observe the life cycle Also act as collection point of data 3. Compiler perform compilation of Hive. QL query, convert into execution plan compiler convert query to Abstract Syntax tree (AST) after checking compatibility and compile time error convert into Directed Acyclic Graph(DAG)
4. Optimizer perform various transformation to get optimized DAG split the task for applying transformation provide better performance and scalability 5. Executor After compilation and Optimization executor execute the task according to DAG interact with job tracker in hadoop to schedule the task 6. CLI, UI, and Thrift Server Command Line Interface(CLI) and User Interface(UI) allows external user to interact with HIVE Thrift Server allows external client to interact with Hive
Reading Data Sets 1. Reading and Writing Data There are various types of functions provided by R to read data o read. table , read. csv - use to read tabular data o read. Lines : use to read lines of text file o source, dget : use to read R code file o unserialize : use to read R object in binary form There are various types of functions provided by R to write data o write. table : - use to write data from tabular form to text file o write. Lines : write data of character line by line into files o dump : used to dump textual representation of R object o save : use to store random numbers o serialize : use to convert R object into binary
Manipulating and processing data in R Before analysing your data, it is important to decide the way to represent it in R
Functions in R R contain many build in functions and also provide facility for user to create their own function A. Function Definition o A function can br define with Function keyword o syntax : function_name <- function(arg 1, arg 2, arg 3…. ) { } B. Function component function body Parts of function 1. Function name 2. Argument 3. Function body 4. Return value
C. Bult –in Function o seq(), mean(), max(), sum() o syntax : print (seq(10, 20)) - to find sequence (o/p - ? ) print(mean(42: 78)) - to find mean (o/p - ? ) print(sum(1: 10)) - to find sum (o/p - ? ) D. User-Define Function functions created by user called user-define function Square_function <- function(x) { for(i in 1: x) { y<-i^2 print(y) } } Calling function : Square_function(8) (o/p -? )
- Slides: 44