Authorship Attribution Using Word Network Features Shibamouli Lahiri
Authorship Attribution Using Word Network Features Shibamouli Lahiri October 16, 2012
Text Authors 2
Two Lenses • Rank the documents (books) • Classify the documents
Features for Classification • Standard features (state-of-the-art): ü most frequent words ü most frequent character n-grams ü stop words • Our approach (word networks)
Word Networks 5
Word Networks 6
Word Networks • Directed edges • Stop words included • Weighted edges (bigram counts)
Word Networks – Intuition • Natural language shows complex network structure (small world, low degree of separation, scale-free, shrinking diameter, …) • Does it have anything to do with authorship? • Other cues (Text. Rank)
Dataset • Project Gutenberg • 3036 books by 142 authors • Cleaned to remove metadata • Word networks ready • Features extracted
Network Features • #vertices, #edges, #scc (strongly connected components) • Degree distribution properties (min, max, avg, …) • Clustering coefficient • Neighborhood size distribution • Coreness distribution • Reciprocity • Many others …(total 127)
Results So Far (3036 documents, 142 authors, 10 -fold CV) Multi-class Classification One-vs-all Classification Word network features 34. 82% 33. 53% Naïve Random Baseline 0. 7% (± 0. 15%) 11
Competition Datasets (AAAC, PAN 12) Competition Data Test Set Accuracy Random Baseline AAAC 55 classes, 180 training samples, 80 test samples 26. 25% 1. 94% (± 1. 59%) PAN 12 25 classes, 50 training samples, 28 test samples 32. 14% 4. 17% (± 3. 85%) 12
Future Work • Compare with state-of-the-art features. Are they better than graph features? Worse? Can we combine the two? • Feature selection and feature analysis • Try local graph features (e. g. , motifs) rather than global features (esp. important for competition datasets).
Future Work • Remove small classes (≤ 10 documents) and see the results • Experiment on a smaller subset of authors • Build separate train and test sets for Gutenberg data (cross-validate on train set, report results on test set)
Related Work • Authorship attribution is a huge field, with a very rich history of discoveries and stateof-the-art methods.
Related Work • Authorship attribution was there much before computers came – since antiquity.
Related Work • For a detailed account, please see the survey articles by Juola (2008), Stamatatos (2009), and Koppel et al (2009).
Related Work • Complex networks for authorship attribution are relatively recent.
Related Work • Some Issues on Complex Networks for Author Characterization, Antiqueira et al, TIL 2006.
Related Work • The Complex Networks Approach for Authorship Attribution of Books, Mehri et al, Physica A 2012.
Related Work • Comparing Intermittency and Network Measurements of Words and Their Dependency on Authorship, Amancio et al, 2011.
Related Work • Most work was done by physicists who are working in complex networks.
Related Work • No study so far that has clearly pinpointed the specific graph features that are affected most by authorship.
Image Sources 1. 2. 3. 4. 5. 6. 7. Dickens: http: //upload. wikimedia. org/wikipedia/commons/thumb/a/aa/ Dickens_Gurney_head. jpg/220 px-Dickens_Gurney_head. jpg George Bernard Shaw: http: //www. nobelprize. org/nobel_prizes/literature/laureates/1 925/shaw. jpg Edgar Allan Poe: http: //upload. wikimedia. org/wikipedia/commons/thumb/f/fb/ Edgar_Allan_Poe_portrait_B. jpg/200 px. Edgar_Allan_Poe_portrait_B. jpg Two Approaches: http: //coolaffiliatemarketingguide. com/emotional-keywords Word Networks: http: //www. oooneida. org/images/network_diagram. jpg Dataset: http: //puppsfreestuff. com/files/iceage. htm Future work: http: //iceage. wikia. com/wiki/Deceased_sabertooth_squirrel 24
Thank you! 25
- Slides: 25