BIG DATA CHALLENGES AND SOLUTIONS Jo Prichard Architect

  • Slides: 17
Download presentation
BIG DATA: CHALLENGES AND SOLUTIONS Jo Prichard, Architect, Lexis. Nexis 7/31/2012

BIG DATA: CHALLENGES AND SOLUTIONS Jo Prichard, Architect, Lexis. Nexis 7/31/2012

Lexis. Nexis Risk Solutions is a leading global provider of content-enabled workflow solutions to

Lexis. Nexis Risk Solutions is a leading global provider of content-enabled workflow solutions to help clients across multiple industries predict, assess and manage risk § § § Total Revenue: $1. 4 B (2011) Industries Served: Insurance, Background Screening, Financial Services, Receivables Management, Health Care, Legal and Government Headquarters: Alpharetta, Georgia Number of offices: 27+ Employees: 3, 900 2011 LN Risk Solutions Revenue = $1, 389 m Gov’t Insurance Business Services Financial Services Note: Chart excludes c. $100 m law firm revenues included in Legal & Professional Copyright © 2012 Lexis. Nexis. All rights reserved 2

HPCC Systems: Big Data Analytics and Processing Platform PUBLIC RECORDS ENTITY RESOLUTION PROPRIETARY DATA

HPCC Systems: Big Data Analytics and Processing Platform PUBLIC RECORDS ENTITY RESOLUTION PROPRIETARY DATA NEWS ARTICLE UNSTRUCTURED RECORDS LINK ANALYSIS CLUSTERING ANALYSIS STRUCTURED RECORDS COMPLEX ANALYSIS Copyright © 2012 Lexis. Nexis. All rights reserved 3

HPCC Systems Components • The HPCC Systems platform includes: • Thor: batch oriented data

HPCC Systems Components • The HPCC Systems platform includes: • Thor: batch oriented data manipulation, linking and analytics engine • Roxie: real-time data delivery and analytics engine • A high level declarative data oriented language: ECL • • Implicitly parallel No side effects Code/data encapsulation Extensible Highly optimized Builds graphical execution plans Compiles into C++ and native machine code Common to Thor and Roxie • An extensive library of ECL modules, including data profiling, linking and Machine Learning 4

The Benefits of HPCC Systems Speed § § § Scales to extreme workloads quickly

The Benefits of HPCC Systems Speed § § § Scales to extreme workloads quickly and easily Increase speed of development leads to faster production/delivery Improved developer productivity Capacity § § Enables massive joins, merges, sorts and data transformations Increases business responsiveness Accelerates creation of new services via rapid prototyping capabilities Offers a platform for collaboration and innovation leading to better results Cost Savings § § Commodity hardware and fewer people can do much more in less time Uses IT resources efficiently via sharing and higher system utilization 5

Lex. ID SM Overview The fastest linking technology platform available with results that help

Lex. ID SM Overview The fastest linking technology platform available with results that help you make intelligent information connections. Lex. ID℠ is the ingredient behind our products that turns disparate information into meaningful insights. This technology enables customers using our products to identify, link and organize information quickly with a high degree of accuracy. Lex. ID is the linking technology behind our products that helps customers: Get a More Complete Picture. Make intelligent information connections beyond the obvious by drawing insights from both traditional and new sources of data. Better Results, Faster. Use the fastest technology for processing large amounts of data to help you solve cases more quickly and confidently. Protect private information. Keep customer SSNs and FEINs secure and enjoy peace of mind knowing you are taking steps to observe the highest levels of privacy and compliance. Copyright © 2012 Lexis. Nexis. All rights reserved 6

Public Data Social Graph • Social Graph Overview • • What is a Social

Public Data Social Graph • Social Graph Overview • • What is a Social Graph. Examples of Social Graph Analytics seen every day. • Lexis. Nexis Public Data Social Graph (PDSG) • • • Relationships inferred from 50 Tb of Public Data. People connected to people, assets, businesses and more. High Value relationships for Mapping trusted networks. • Large Scale Data Fabrication and Analytics. • • • Thousands of data sources to ingest, clean, aggregate and link. +- 270 million Active Identities, 4 billion people relationships, 140 billion intermediate data points when running collusion analysis on property transfers. • Innovative Examples leveraging the Lexis. Nexis PDSG • • • Healthcare. • MedicaidMedicare Fraud. • Drug Seeking Behavior. Financial Services. • Mortgage Fraud. • Anti Money Laundering. • “Bust out” Fraud. Insurance • Staged Accident Fraud. Lexis. Nexis Confidential and Proprietary 7

Lexis. Nexis Public Data Social Graph Background • Pre-2001 Building Social Graph for Data

Lexis. Nexis Public Data Social Graph Background • Pre-2001 Building Social Graph for Data Products • 2001 Built Tools to Explore the Social Graph: Relavint • Tool in use across wide array of products from Financial Services to Healthcare and Government. E. g. Real Time Crime Centre in NYC • Learned that real value lies within Analytics that measured the Social Graph at a social level in scale. Lexis. Nexis Confidential and Proprietary 8

Social Context Matters Complex variables for complex scenarios Between what? Close to what? Central

Social Context Matters Complex variables for complex scenarios Between what? Close to what? Central to what? § § Not enough to know ‘influence’, influence over what? Large Scale Graph algorithms create variables for specific signals in the Social Graph. Simple Examples § § Property professionals in your trusted social network increases the risk of your transactions having some fraudulent element. Patients with a medical professional in their trusted social network could have lower hospital re-admittance rates. Crowdsourcing Fraud § § § Industry expertise in a system raises the risk of the system being gamed (Insiders). When someone learns how to game the system, they escalate the scheme by involving trusted family and associates to avoid detection. Traditionally extremely difficult to detect, relationships are undisclosed and activities are spread and hidden. Big Data Sweet Spot § § § Social Media companies infer strategic social influence, skills etc. Lexis. Nexis infers a different set of skills and identifies very specific types of influencers. HPCCSystems platform, Public Data Social Graph and Analytics have made detecting organized syndicates our Big Data Sweet Spot. 9

Public Data Social Graph: Staged Accident Network Detection Lexis. Nexis Confidential and Proprietary 10

Public Data Social Graph: Staged Accident Network Detection Lexis. Nexis Confidential and Proprietary 10

Public Data Social Graph: Staged Accident Network Detection 2 Mapping social flow of claims.

Public Data Social Graph: Staged Accident Network Detection 2 Mapping social flow of claims. Red = Suspect Claims Circle = SIU Flag Blue = benign Lexis. Nexis Confidential and Proprietary 11

MORTGAGE FRAUD ATTRIBUTES: Improve visibility into hidden relationship risk Lexis. Nexis Collusion Attributes help

MORTGAGE FRAUD ATTRIBUTES: Improve visibility into hidden relationship risk Lexis. Nexis Collusion Attributes help customers quickly identify undisclosed relationships between entities that will assist in segmenting and detecting potential fraudulent loans within their portfolio’s. • • Leverages the use of supercomputing power and analytics to target organized syndicates and rings of collusion. Allows customers to generate internal scores to segment out the riskiest loans involving collusion. Identifies people and businesses at the front end of a fraud investigation that may expose your company to risk. Exposes key perpetrators to improve remediation and recourse opportunities. Organized rings use relationships to obscure suspicious activities. Lexis. Nexis uses relationships to expose suspicious syndicates. Identifies hard to detect schemes like illegal flipping schemes, equity stripping, builder bailout, artificial property price inflation, straw buyers and more. Lexis. Nexis Confidential and Proprietary 12

HEALTHCARE: MEDICAID CASE STUDY Scenario Proof of concept for Office of the Medicaid Inspector

HEALTHCARE: MEDICAID CASE STUDY Scenario Proof of concept for Office of the Medicaid Inspector Generation (OMIG) of large Northeastern state. Social groups game the Medicaid system which results in fraud and improper payments. Task Given a large list of names and addresses, identify social clusters of Medicaid recipients living in expensive houses, driving expensive houses. Result Interesting recipients were identified using asset variables, revealing hundreds of high-end automobiles and properties. Leveraging the Public Data Social Graph, large social groups of interesting recipients were identified along with links to provider networks. The analysis identified key individuals not in the data supplied along with connections to suspicious volumes of “property flipping” potentially indicative of mortgage fraud and money laundering Lexis. Nexis Confidential and Proprietary 13

Big. Data Visualizations: Division of Workload. Distributed compute platform (THOR) § § § Extract,

Big. Data Visualizations: Division of Workload. Distributed compute platform (THOR) § § § Extract, Transform, Load. Heavy duty lifting to create valuable data points. Data intensive compute algorithms. E. g. Ranking Algorithms. Distributed Query Platform (ROXIE) § § High speed access to massive distributed indexes. Allows for complex high speed big data queries. Single high level language across THOR and ROXIE. (ECL) Allows for on the fly joins, sorts and aggregation. REST, SOAP, JSON § § § Queries can return results in various forms. Easy for visualizations to consume. Allows for both power light web visualization and heavy duty desktop visualization. 14

Example 1: Wiki Pageview Interest Senario 21 Billion Rows of Wikipedia Hourly Pageview Statistics

Example 1: Wiki Pageview Interest Senario 21 Billion Rows of Wikipedia Hourly Pageview Statistics for a year. Task Generate meaningful statistics to understand aggregated global interest in all Wikipedia pages across the 24 hours of the day built off all english wikipedia pageview logs for 12 months. Result Produces page statistics that can be queried in seconds to visualize which key times of day each Wikipedia page is more actively viewed. Helps gain insight into both regional and time of day key interest periods in certain topics. This result can be leveraged with Machine Learning to cluster pages with similar 24 hr Fingerprints. 1 Steve_Jobs 2 Whitney_Huston 3 Wikipedia: SOPA_initiative/Learn_more 4 List_of_HTTP_status_codes%231 xx_Informational 5 Adele_(singer) 6 Bruno_Mars 7 Kim_Jong-il 8 Jeremy_Lin 9 Heavy_D 10 Christopher_Columbus 11 Murder_of_Meredith_Kercher 12 Eli_Manning 13 Jorge_Luis_Borges HPCC Systems 15

Example 2: Wiki. Graph Senario Calculate Google Page Rank to be used to rank

Example 2: Wiki. Graph Senario Calculate Google Page Rank to be used to rank search results and drive visualisations. Task Load the 75 GIG english wikipedia XML snapshot. Strip page links from all pages and run 20 iterations of Google Page Rank Algorithm. Generate indexes and Roxie query to support visualization. Result Produces +- 300 million links between 15 million pages. Page Rank allows for ranking results in searching and driving more intuitive visualizations. Advanced Interactive Visualization uses calculated page rank to size and color nodes. Lays a foundation for advanced graph algorithms that combine ranking, NLP and ML in scale. HPCC Systems 16

Example 3: Enron Email Flow Senario Map email originations to recipients. Task Load the

Example 3: Enron Email Flow Senario Map email originations to recipients. Task Load the 210 GIG XML Enron email dataset. Calculate aggregates for from, to. Visualize in a d 3 js Chord Chart. Result Visual Mapping of key recipients within Enron. Next Steps Implement a time slider for historical context across time. HPCC Systems 17