ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyright © 2012, SAS Institute Inc. All rights reserved.
AGENDA From DBMS to BIG DATA Architectural Considerations Big Data Analytics Methods Data Discovery: Visual Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
WHAT IS BIG DATA? DATA are everywhere: • IT organization often collect many data in EDW but them need to integrate with many other sources The ability to generate, communicate, share, and access information has been revolutionized by the increasing number of people, devices, and sensors that are now connected by digital networks. • People leave information in networks • Devices many ways to provide information • Data are a stream continuos of information • Data are not only measures but text, images, sounds Copyright © 2013, SAS Institute Inc. All rights reserved.
ACTUAL COMPANY DATA ORGANIZATION DATA ARE DEPLOYED INFORMATION AS SNAPSHOTS: DATA WAREHOUSE • ANALYTICAL DATAMARTS Same information are replicated in several data structures provide slow updating process and slow renewal data. • Spreading information need drastic changements into paradigm how companies collect their data and how they use it: • Customer data are not only in Customer company DB. These data give partial customers vision: i. e. Telco operators collect customer voice and sms traffic, while many their customers establish contacts using social media and apps. • Customers can give many signal on market preferences like a sensor on market but the actual data storage structures and their analytics tools are not be able to deal with these data. Copyright © 2013, SAS Institute Inc. All rights reserved.
TREND COMPANY DATA ORGANIZATION NEEDS: • • • TO AVOID DATA PROLIFERATION TO PROVIDE SEVERAL SCENARIO OF SAME DATA ENRICHMENT WITH SEVERAL SOURCES QUICKLY DATA RENEWAL TO PROVIDE PATTERN OF CHANGEMENTS SCENARIO “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. The ability to store, aggregate, and combine data and then use the results to perform analysis in motion has become ever more accessible as trends. Copyright © 2013, SAS Institute Inc. All rights reserved.
NEW QUESTIONS Not always data are in structured data model • Often we need to join data with not same keys • Often data coming with periodic flow near real time • Often we need to recognize pattern from data changing frequently • New ways to manage distributed and not structured in classical way data are needed: We need different paradigm to organize data and, above all, to query them. Collect several sources and manage them open several new problems: • Relational data (GRAPH DATA) can be useful to understand event spreading in a population. • Data in motion coming from several tools on field (sensor devices, smarthphone) provide dynamic pattern often without an history of their form Copyright © 2013, SAS Institute Inc. All rights reserved.
ANALYSIS Not always you can apply sampling to extract data • Not always you can join data to define ABT • Often you need to know how environment can influence event: like buy, choice, changement. • Often we need to merging information collected with different scope. • • SQL Queries often are useless to reach these data: • Information are not organized into DB structures • Data are very different way to provides information: i. e. text are not easy to query using traditional query languages. • Merging are driven by fuzzy keys where you can assign group information according statistic relationship. • Event can be happen driven from relational with other data rather from specific behavior. Copyright © 2013, SAS Institute Inc. All rights reserved.
as Facebook, Twitter, Linked. In and blogs BIG DATA Machine-to-machine data • includes readings from sensors, meters, and other devices as part of the so-called “Internet of things. ” Big transaction data What types? • includes healthcare claims, telecommunications call detail records (CDRs), and utility billing records that are increasingly available in semi-structured and unstructured formats. Biometric data • includes fingerprints, genetics, handwriting, retinal scans, and similar types of data. Human-generated data Copyright © 2013, SAS Institute Inc. All rights reserved. • includes vast quantities of unstructured and semi-structured data such as call center agents’ notes, voice recordings, email, paper
AGENDA From DBMS to BIG DATA Architectural Considerations Big Data Analytics Methods Data Discovery: Visual Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
DBMS and Datamart help to analyzing data coming from one central point data. You need only to know where data is and their meaning. Query are managed directly from DBMS Copyright © 2013, SAS Institute Inc. All rights reserved. Data are stored in different place and you have to know relationship MAPPING coming from different sources. Here before you extract data your query have to know from which place into the net you have data.
MULTI POINT DATA HUB BUILDING BLOCKS OF A BIG DATA ANALYTICS PROCESS ANALYTICS Copyright © 2013, SAS Institute Inc. All rights reserved.
REFERENCE EXAMPLE SAS-RACK IMPLEMENTATION ARCHITECTURE CLIENT GREENPLUM HADOOP Copyright © 2013, SAS Institute Inc. All rights reserved. TERADATA ORACLE
Input Hadoop Output Visual Analytics Metadata High Performance Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
In memory GRID COMPUTING In Database Input Output Visual Analytics Metadata Analytical Tool Copyright © 2013, SAS Institute Inc. All rights reserved. High Performance Analytics
AGENDA From DBMS to BIG DATA Architectural Considerations Big Data Analytics Methods Data Discovery: Visual Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
SAS® HIGHPERFORMANCE ANALYTICS Worrying about software performance is not a new concept at SAS • What is New? • Dedicated high-performance software § Accelerated development § • Why Now? » Customer needs » Blade systems have proven viable platforms for high-performance computing » New computing paradigms » Partnerships with MPP database vendors Copyright © 2013, SAS Institute Inc. All rights reserved.
SAS PROCEDURES THEN AND NOW proc logistic data=TD. mydata; proc hplogistic data=TD. mydata; class A B C; model y(event=‘ 1’) = A B B*C; run; Single-threaded Not aware of distributed computing environment Runs on client Copyright © 2013, SAS Institute Inc. All rights reserved. Multi-threaded Aware of distributed computing environment Runs on client or DBMS appliance
HP PROCS IN SINGLE SERVER libname disk BASE “/filesys”; proc hpreg data=disk. source; analytic stuff… run; SAS Process Steps: (1) SAS Process Starts on HW & O/S (2) SAS sets up access library to disk (3) SAS starts HPREG PROC (4) HPREG reads data through ACCESS during computation* (5) Multiple threads are launched to process the incoming data (6) As execution continues, temporary data is written out to utility files on disk *SMP HP PROCS do not load the entire source dataset into RAM – the SAS Process utilizes the MEMSIZE option as a boundary. No different than MVA or “regular” procs, datastep, etc. Copyright © 2013, SAS Institute Inc. All rights reserved. OPERATING SYSTEM 1 Process SAS Process 3 5 2 6 4 Disks – “/filesys” Temp/Utility files to support SAS Datasets
HPPROCS IN DISTRIBUTED ARCHITECTURE HADOOP HDAT – SHARED-RACK EXAMPLE libname a sashdat; option set=gridhost=“NAMENODE”; proc hpreg data=a. source; analytic stuff… performance nodes=all; run; SAS Process Steps: (1) SAS Process Starts on HW & O/S (2) SAS sets up access library to disk (3) SAS starts HPREG PROC (4) Due to GRIDHOST and proper access engine setting, multi-threaded processes are started on grid nodes (via TKGrid) (5) As TKGrid processes start up, ALL data is lifted into RAM from HDFS. (6) Processing occurs in parallel against in memory data (7) Results return to initiating process on SAS Server Copyright © 2013, SAS Institute Inc. All rights reserved. HADOOP NAMENODE OPERATING SYSTEM 4 Process NODE 1 SAS Process 4 1 4 3 2 5 Data 6 7 NODE 2 4 5 Data 6 NODE N 5 4 6 Data
Big data analysis can be done using several analytic strategy. • SAS collects many different methods many of them coming from traditional statistical inference analysis using SEMMA paradigm. • Other coming from stochastic process analysis both for continue and discrete events. • Other coming from linear and not linear mixed models. • Graph analysis Copyright © 2013, SAS Institute Inc. All rights reserved.
AGENDA From DBMS to BIG DATA Architectural Considerations Big Data Analytics Methods Data Discovery: Visual Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
ANALYTICAL CATEGORIES AND TARGET USAGE Statistics • Binary target & continuous no. no predictions • Linear, Non Linear Non. Linear, & & Mixed Linear modeling Copyright © 2013, SAS Institute Inc. All rights reserved. Data Mining • Complex relationships • Tree-based Classification • Variable Selection Text Mining Forecasting Econometrics Optimization • Parsing large • Large-scale, • Probability of • Local search events • Severity of random events optimization • Large-scale linear & mixed integer problems • Graph theory -scale text collections • Extract entities • Auto Stemming & & synonym detection multiple hierarchy problems
Data coming from different sources can be tie using different methods like canonical decomposition. Data pattern variability on data in motion like data coming from devices can be sampled or simulate pattern distribution using Markov chain Monte Carlo methods. Sparse vector data with missing values can be simulate using MCMC or other regression methods Discrete choice among different events can be defined using multinomial discrete models. Copyright © 2013, SAS Institute Inc. All rights reserved.
GRAPH ANALYSIS Network The Network Analysis objectives are: Identifying the subnets (communities) with high potential of information exchange. Community Measuring changes over time. Producing initiatives which increase the enterprise presence in the single communities knowing the spreading strength of the community. Copyright © 2013, SAS Institute Inc. All rights reserved.
GRAPH ANALYSIS A network is collection of the relationships among nodes by links. Link Node 1 0 4 2 5 7 3 6 11 10 8 9 12 14 13 15 16 A node is an individual featured by qualities which can be transmitted through the links (impulses). A link is the relationship which connects 2 nodes. It can be outgoing, incoming or with no direction. Copyright © 2013, SAS Institute Inc. All rights reserved.
AGENDA From DBMS to BIG DATA Architectural Considerations Big Data Analytics Methods Data Discovery: Visual Analytics Copyright © 2013, SAS Institute Inc. All rights reserved.
. . . provide very easy to use - yet sophisticated – statistical graphic tools to all of your users? SAS VISUAL ANALYTICS ® A Single solution for Statistical Visualization and reporting Copyright © 2013, SAS Institute Inc. All rights reserved. … use ad hoc exploration and visualizations to analyze multivariate results? ……quickly produce mobile dashboards and reports that convey more foresight than hindsight?
SAS® VISUAL BUSINESS VISUALIZATION DRIVEN BY ANALYTICS EXPLORATION AND VISUALIZATION Copyright © 2013, SAS Institute Inc. All rights reserved. POWER OF ANALYTICS RAPID DELIVERY OF MOBILE INSIGHTS
BUSINESS THE DIFFERENCE BETWEEN RAPID INSIGHT AND FAST VISUALIZATION INFORMATION DATA VISUALIZATION ANALYTIC VISUALIZATION EXPLORATION DISCOVERY Copyright © 2013, SAS Institute Inc. All rights reserved.
BENEFITS INCREASE THE USE OF ANALYTICS AND BI • • • Self-service Easy to use Analytics Work with more data Copyright © 2013, SAS Institute Inc. All rights reserved. • • • Reporting and Dashboards Mobile BI Collaboration
SAS® VISUAL MEETING YOUR BUSINESS NEEDS THROUGH FLEXIBILITY ANALYTICS Traditional “on premise” Deployments Copyright © 2013, SAS Institute Inc. All rights reserved. Public Private Hybrid SAS Cloud & SAS Solutions on Demand
- Slides: 31