Enron Emails as Graph Data Corpus for Largescale

  • Slides: 13
Download presentation
Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin

Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý

Motivation and Approach Motivation • To exploit information and knowledge included in email communication

Motivation and Approach Motivation • To exploit information and knowledge included in email communication Approach • Social Network Extraction • Entities extraction like People, Organizations, Locations, Contact data • Forming semantic trees and graphs • User interaction with graph data Bratislava, 26 October 2011 GCCP 2011 2

Email Social Networks • Email Social Networks are less explored – Several scientific publications:

Email Social Networks • Email Social Networks are less explored – Several scientific publications: Apache mailing list, Enron, … – Commercial: Xobni (contacts and attachments) • Benefit – Web Social Network Sites: owned by third parties – Email SN: owned by organization, individual or community – Additional level of interaction and context is present in emails • Information and Knowledge – – – People, locations, contacts, product, services, attachments or links Interactions Time Discovering relations can bring significant benefits Spread of Activation – simple way to discover relations Bratislava, 26 October 2011 GCCP 2011 3

Ontea: Information Extraction Tool http: //ontea. sf. net v Regex patterns v Gazetteers v

Ontea: Information Extraction Tool http: //ontea. sf. net v Regex patterns v Gazetteers v Resuls v Key-value pairs v Structured into trees v graphs v Transformers, Configuration v Automatic loading of extractors v Visual Annotation Tool v Integration with external tools v GATE, Stemers, Hadoop … v Multilingual tests l English, Slovak, Spanish, Italian Bratislava, 26 October 2011 GCCP 2011 4

Business objects in Emails • Study on 6 organizations show: – Objects can be

Business objects in Emails • Study on 6 organizations show: – Objects can be identified by patterns and gazeteers – It is possible to define set of common objects • Objects identified: – Organization: • org: Name, org: Reg. No, org: Tax. No – Person: • person: Name, person: Function – Contact: • contact: Phone, contact: Email, contact: Webpage – Address: • address: ZIP, address: Street, address: Settlement – Product: • product: Name, product: Module, product: Component, product: BOID – Document: • doc: Invoice, doc: Order, doc: Contract, doc: Change. Request – Inventory: • inventory: Res. ID, inventory: Res. Type – Other business object • ID: BOID Bratislava, 26 October 2011 GCCP 2011 5

Email Social Graph/Network Bratislava, 26 October 2011 GCCP 2011 6

Email Social Graph/Network Bratislava, 26 October 2011 GCCP 2011 6

Email Search Prototype • • Use of Social Network from email Includes extracted objects

Email Search Prototype • • Use of Social Network from email Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph • Faceted search, navigation g. Sem. Search: Graph based Semantic Search • http: //ikt. ui. sav. sk/esns/ Bratislava, 26 October 2011 GCCP 2011 7

Email Example 1 Vertex: Doc=>/home/misos/enron/test/6. eml 1 Vertex: Quote=>/6. eml 0: 1: 0 2

Email Example 1 Vertex: Doc=>/home/misos/enron/test/6. eml 1 Vertex: Quote=>/6. eml 0: 1: 0 2 Edge: (Doc=>/home/misos/enron/test/6. eml)=>(Quote=>/6. eml 0: 1: 0) 1 Vertex: Paragraph=>/6. eml 0: 1: 0 2 Edge: (Quote=>/6. eml 0: 1: 0)=>(Paragraph=>/6. eml 0: 1: 0) 1 Vertex: Sentence=>/6. eml 0: 1: 0 2 Edge: (Paragraph=>/6. eml 0: 1: 0)=>(Sentence=>/6. eml 0: 1: 0) 1 Vertex: Date. Time=>Fri, 8 Mar 2002 06: 46: 07 -0800 (PST) 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Date. Time=>Fri, 8 Mar 2002 06: 46: 07 -0800 (PST)) 1 Vertex: Email=>mike. grigsby@enron. com 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Email=>mike. grigsby@enron. com) 1 Vertex: Email=>robert. badeer@enron. com 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Email=>robert. badeer@enron. com) 1 Vertex: Person: Name=>Grigsby, Mike 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Person: Name=>Grigsby, Mike) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Company=>ENRON) 1 Vertex: Person: Name=>Badeer, Robert 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Person: Name=>Badeer, Robert) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Company=>ENRON) 1 Vertex: Person: Given. Name=>Robert 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Person: Given. Name=>Robert) 1 Vertex: Person: Name=>Badeer, Robert 2 Edge: (Sentence=>/6. eml 0: 1: 0)=>(Person: Name=>Badeer, Robert) 1 Vertex: Paragraph=>/6. eml 659: 19: 0 2 Edge: (Quote=>/6. eml 0: 1: 0)=>(Paragraph=>/6. eml 659: 19: 0) 1 Vertex: Sentence=>/6. eml 659: 19: 0 2 Edge: (Paragraph=>/6. eml 659: 19: 0)=>(Sentence=>/6. eml 659: 19: 0) 1 Vertex: Person: Name=>Michael D. Grigsby 2 Edge: (Sentence=>/6. eml 659: 19: 0)=>(Person: Name=>Michael D. Grigsby) 1 Vertex: Company=>UBS Warburg Energy, LLC 2 Edge: (Sentence=>/6. eml 659: 19: 0)=>(Company=>UBS Warburg Energy, LLC) 1 Vertex: Telephone. Number=>713 -853 -7031 2 Edge: (Sentence=>/6. eml 659: 19: 0)=>(Telephone. Number=>713 -853 -7031) 1 Vertex: Telephone. Number=>713 -408 -6256 2 Edge: (Sentence=>/6. eml 659: 19: 0)=>(Telephone. Number=>713 -408 -6256) Bratislava, 26 October 2011 GCCP 2011 8

Enron Graph corpus Statistics Bratislava, 26 October 2011 GCCP 2011 9

Enron Graph corpus Statistics Bratislava, 26 October 2011 GCCP 2011 9

Conclusions and Future Directions

Conclusions and Future Directions

Future Direction: Relations Discovery in Large Graph Data • Motivation – Graph/Network data are

Future Direction: Relations Discovery in Large Graph Data • Motivation – Graph/Network data are everywhere: social networks, web, Linked. Data, transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial. • Approach – Forming semantic trees and graphs from text, web, communication, databases and Linked. Data – User interaction with graph data in order to achieve integration and data cleansing – Users will do it, if user effort have immediate impact on search results Bratislava, 26 October 2011 GCCP 2011 11

SGDB: Simple Graph Database • • • Storage for graphs Optimized for graph traversing

SGDB: Simple Graph Database • • • Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo 4 j for graph traversing operations Supports Blueprints API https: //simplegdb. svn. sourceforge. net/svnroot/simplegdb/Sgdb 3 • Graph Database Benchmarks – Graph Traversal Benchmark for Graph Databases – http: //ups. savba. sk/~marek/gbench. html – Blueprints API - possibility to test compliant Graph databases Bratislava, 26 October 2011 GCCP 2011 12

Conclusion • Email Archives – Valuable source of knowledge – Hidden Social Networks owned

Conclusion • Email Archives – Valuable source of knowledge – Hidden Social Networks owned by Enterprise or Individual – Information Extraction and Social Network Analysis can help • Challenges – Graph based Querying – New data and approach for information search – Relation search • Applications – Recommendation and Search in Emails – Population of Databases (Cold start problem) – Possibility to extend social network graph with transaction data, processed document repositories and other business data – Business Intelligence and Knowledge Management Bratislava, 26 October 2011 GCCP 2011 13