ANALYZING THE ENRON EMAILS USING TOPICAL ANALYSIS AND




















- Slides: 20

ANALYZING THE ENRON EMAILS USING TOPICAL ANALYSIS AND GRAPH THEORY CASEY KALINOWSKI FACULTY MENTORS: DR. ZAKARIA KURDI & DR. KIM MCCABE

STATEMENT OF THE PROBLEM • The Enron Scandal • 2001 Bankruptcy of the Enron Corporation • • Investors lost millions of dollars Investigations by the FBI, IRS, and the Securities and Exchange Commission lasting for 5 years • The Enron Emails • 520, 914 emails from over 150 employees of the Enron Corporation • Obtained during the investigation and released to the public as the largest dataset of its kind Is it possible to analyze the emails with modern techniques in a way that is more efficient and effective than the original investigation?

APPLICATION OF THEORY • Topical Analysis • Artificial Intelligence • • • Natural Language Processing Create topics and assign a topic to each piece of data Graph Theory • Application of Graphs to research • Graph – Nodes connected by Edges • Node Degree – number of edges coming from a node • Edge Weight – numerical value shared by the two connected nodes http: //cs. slu. edu/~esposito/teaching/1080/webscience/graphs. html

APPLICATION OF THEORY TOPICAL ANALYSIS MODELS • Gensim • • • Nodes – Unique email address • Natural Language Tool. Kit (NLTK) • • Latent Dirichlet Allocation Model CREATING THE GRAPHS Word. Net Keyword Search Node degree – Number of edges connecting to other email addresses • Edges – Email between two nodes • Edge weight – Number of emails sent between two nodes

METHODOLOGY • Research Design • • • Source of Data • • Retrospective - relies on previously collected data Exploratory - little or no previous research on this topic 520, 914 emails from 156 employees of the Enron Corporation Collected during Enron Investigation and released to the public by the Federal Energy Regulatory Commission in 2004 Retrieved from Carnegie Mellon University Test Sample • 5890 emails from inbox/outbox of Kenneth Lay

ANALYSIS OF THE DATA • Processing the data • Creating the topics • Topical Analysis of the data • Creating the graphs • Analyzing the graphs

PROCESSING THE DATA CONVERTING THE EMAILS FROM. CSV FORMAT TO. JSON FORMAT WITH PYTHON. CSV . JSON "dasovich-j/notes_inbox/526. ", "Message-ID: 10768729. 1075843109928. Java. Mail. evans@thyme Date: Wed, 20 Sep 2000 10: 51: 00 -0700 (PDT) From: steven. kean@enron. com To: david. parquet@enron. com Subject: {‘owner’: ’dasovich-j’, ‘from’: ’steven. kean@enron. com’, Cc: jeff. dasovich@enron. com, sandra. mccubbin@enron. com Mime-Version: 1. 0 Content-Type: text/plain; charset=us-ascii ‘to’: ’david. parquet@enron. com’, ‘date’: ’ 2000 -09 -20 10: 51: 00’, Content-Transfer-Encoding: 7 bit Bcc: jeff. dasovich@enron. com, sandra. mccubbin@enron. com X-From: Steven J Kean X-To: David Parquet X-cc: Jeff Dasovich, Sandra Mc. Cubbin X-bcc: X-Folder: Jeff_Dasovich_Dec 2000Notes FoldersNotes inbox X-Origin: DASOVICH-J X-File. Name: jdasovic. nsf I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment. " ‘subject’: ’ ‘, ‘message’: ’ I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment. ’}

CREATING THE TOPICS NLTK Word. Net and Keyword Search • Picked by hand • Accounting, Bankruptcy, Fraud, Leisure, Management, Meeting, Stock, and None

CREATING THE TOPICS Gensim - Latent Dirichlet Allocation (LDA) • Create training set from only the Nouns and Plural Nouns from emails • Feed the training set to Gensim to create LDA Model • Process emails with LDA Model to create lists of topics • Use the top six topics and use them with the Word. Net technique • Business, Employees, Information, Market, People, and Stock “I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment. ” “today time governor talk schedule try set something meantime governor sign bill course office message commitment” LDA Model Emails Topic 0: talk, bill, office, commitment Topic 1: governor, course, office, today Topic 2: talk, schedule, message, set

TOPICAL ANALYSIS OF THE DATA Keyword Search • Very simple and easy to implement • Keyword with highest number of occurrences in an email becomes that email’s topic

TOPICAL ANALYSIS OF THE DATA NLTK Word. Net • Use Word. Net to find the hyponyms of every topic • Compare every hyponym of every topic to every sense of every word in each of the emails • • Use Word. Net to find similarity between hyponym and sense of the word • Similarity of. 25 or higher gets scored 1 point The topic with the highest weighted average score becomes the topic of the email • Weighted Average = (Total score / Total number of word senses in the email) https: //sourcedexter. com/find-synonyms-and-hyponyms-using-python-nltk-and-wordnet%e 2%80%8 b/

TOPICAL ANALYSIS OF THE DATA Runtime 3, 2 Gensim + Word. Net • Tested on 50 random emails • Manually assign topics to each 1, 2 Keyword Search 0, 0633 0 0, 5 1 1, 5 2 2, 5 3 3, 5 Runtime in Hours • Run topical analysis on the Accuracy Gensim + Word. Net emails 0, 76 Word. Net • Compare the generated topics to 0, 78 Keyword Search 0, 46 0 0, 1 0, 2 0, 3 0, 4 email the manually assigned topics 0, 5 0, 6 Accuracy (0, 1) 0, 7 0, 8 0, 9

CREATING THE GRAPHS • Network. X – Python tool • Load email from. JSON file • Create a node for each unique email address • Create edge between sending node and receiving node and +1 to edge weight • Export as. graphml file

ANALYZING THE GRAPHS • Gephi – Open source graph analysis and visualization tool • View graph data (node degree, edge weight, centrality, etc. ) • Visualize the graphs • Most importantly, refine the graphs

ANALYZING THE GRAPHS • All topics • 31 nodes, 38 edges • Nodes sized by degree • Edges sized by weight

ANALYZING THE GRAPHS • Gensim + Word. Net Topic Market • 26 nodes • 32 edges

ANALYZING THE GRAPHS • Word. Net Topic Accounting • 18 nodes • 20 edges • Note the edge between jeff. skilling@enron. com and leonardo. pacheo@enron. com

LIMITATIONS TO RESEARCH • Reliability of the data • • Did not collect the Enron emails personally No record of what was done to the emails before being released to the public

CONCLUSION AND FUTURE PROPOSALS • Work in progress • Improvements to effectiveness and efficiency • Promising Results • Word. Net and Gensim higher accuracy than Keyword Search • Future Application • Once refined, technique can be applied to other datasets

REFERENCES 10 Enron Players: Where They Landed After the Fall. (2006, January 29). The New York Times. Retrieved January 25, 2018, from http: //www. nytimes. com/2006/01/29/businessspecial 3/10 -enron-playerswhere-they-landed-after-the-fall. html Bastian M. , Heymann S. , Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. Cohen, W. W. (2015, May 8). Enron Email Dataset. Retrieved December 12, 2017, from https: //www. cs. cmu. edu/~enron/ Famous Cases and Criminals: Enron. (2016, July 20). Retrieved January 20, 2018, from https: //www. fbi. gov/history/famous-cases/enron Federal Energy Regulatory Commission. (n. d. ). The Western Energy Crisis, the Enron Bankruptcy, and FERC’s Response. Retrieved January 25, 2018, from https: //www. ferc. gov/industries/electric/indusact/wec/chronology. pdf Klein, B. (n. d. ). Python Advanced: Graphs in Python. Retrieved January 29, 2018, from https: //www. python-course. eu/graphs_python. php Moon, B. , Mccluskey, J. D. , & Mccluskey, C. P. (2010). A general theory of crime and computer crime: An empirical test. Journal of Criminal Justice, 38(4), 767 -772. doi: 10. 1016/j. jcrimjus. 2010. 05. 003 Peixin Zhao, Marjorie Darrah, Jim Nolan, & Cun-Quan Zhang. (2014). Analyses of Crime Patterns in NIBRS Data Based on a Novel Graph Theory Clustering Method: Virginia as a Case Study. The Scientific World Journal, 2014. doi: 10. 1155/2014/492461 Python Software Foundation. Python Language Reference, version 3. 6. 4. Available at http: //www. python. org Ruohonen, K. (2013). Graph Theory (J. Tamminen, K. Lee, & R. Piché, Trans. ). Retrieved January 27, 2018, from http: //math. tut. fi/~ruohonen/GT_English. pdf Salkind, N. J. (2010). Encyclopedia of Research Design. Thousand Oaks, Calif: SAGE Publications, Inc.