CSIS Stylometry System Use Cases and Feasibility Study

CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer , Dr. Sandra Westcott Student-Faculty Research Day May 7, 2010 Seidenberg School of Computer Science and Information Systems Stylometry System

CSIS Stylometry • Discipline that determines authorship of literary works through the use of statistical analysis and machine learning • Is about pattern recognition Stylometry System

Stylometry • Feature sets used for literary works – Lexical • Word or character base – How terms or characters are used within a community – Syntax • Patterns used to form sentences – Structural • Layout of the text – Content-specific • Words that are important within a specific domain • Has been used to determine authorship since the mid 1400’s Stylometry System CSIS

CSIS The Project • Part I – Search to determine interesting and unique applications of stylometry for Research • Part II – Feasibility study on existing tools/applications for email authorship (250 words or less) Stylometry System

CSIS Existing / Potential Uses of Stylometry • • • Music Lyrics Music Melody Paintings Literary Works • • Plagiarism Social Networking Electronic Mail Instant Messaging Forensic Linguistics - Social networking, electronic mail, and instant messaging are still in early stages of study Stylometry System

CSIS Use Cases - Twitter - - Used to verify existing Twitter accounts and help mitigate impersonations Electronic mail - - Implemented in a corporate setting helping identify anonymous emails meant to do harm Chat - Assist in determining authorship of instant messages Stylometry System

CSIS Use Cases - Terrorism - Help identify an author of terrorist content or identify terrorist content by using contextual analysis Applied to blogs, forums, wikis, email, chat and other forms of digital content Stylometry System

CSIS Tools Tested - JGAAP (Java Graphical Authorship Attribute Program) - - Java based tool Developed by Dr. Juola at Duquesne University Runs on Windows and Linux Identification tool - 1 of n decision – Many known email authors trying to determine the author of one unknown email One unknown email author compared to 99 known email authors Stylometry System

CSIS Tools Tested - C# Tool - Written in C programming language Developed by prior Pace CS graduate students Identification tool - - 1 of n decision – Many known email authors trying to determine the author of one unknown email One unknown email author compared to 99 known email authors Stylometry System

CSIS Tools Tested - Signature Tool - Written in C programming language Created by Peter Millican from Hartford College Authentication Tool - - Either match / no match Match testing – 9 known and 1 unknown sample (same author) No Match – 10 known and 1 unknown (two different authors) Stylometry System

CSIS Testing methodology - Each team member submitted emails from different authors. - Total of 100 emails collected from 10 different authors - Removed from native program and saved as text files - Average size of email: 195. 7 words - Three (3) identification and authentication tools tested - 100 tests run on each software tool Stylometry System

CSIS Testing Results JGAAP (Levenshtein Distance algorithm) Canonizers C# Tool Match Test On Off Words 50% 30% Accuracy Word Length 50% 30% 57% Characters 60% 40% Syllables per Word 40% 30% Word Bigrams 70% 60% Categorizing the result based on the country of the author Signature Tool Match Test Match Events Accuracy FRR Word Length 53. 33% Letters 46. 67% Tool India USA 46. 67% JGAAP 50% 100% NA NA 53. 33% Signature 61. 11% 75. 00% 81. 48% 83. 33% C# Tool 42% 80. 00% NA NA Signature Tool No-Match Test Events No-Match Accuracy FAR Word Length 53. 33% 46. 67% Letters 82. 22% 17. 78% Stylometry System

CSIS Conclusion - Overall the moderate accuracy of the test results suggest that none of the tools evaluated are capable of accurate stylometric email author identification - Categorizing email samples by country of origin seems to yield better accuracy results for all three tools tested. Stylometry System

CSIS Recommendations - Further testing and research using email from authors of different countries Continue to refine and add to the stylistic feature set created by prior Pace graduate students - - Emoticons Font color Font size Embedded images Hyperlinks Internet ‘slang’ (ex – LOL, TTYL) Further research on individuals who disguise their identity Stylometry System
- Slides: 14