Authorship Attribution Using Probabilistic ContextFree Grammars Sindhu Raghavan
Authorship Attribution Using Probabilistic Context-Free Grammars� Sindhu Raghavan, Adriana Kovashka, Raymond Mooney The University of Texas at Austin 1
Authorship Attribution • Task of identifying the author of a document • Applications – Forensics (Luckyx and Daelemans, 2008) – Cyber crime investigation (Zheng et al. , 2009) – Automatic plagiarism detection (Stamatatos, 2009) – The Federalist papers study (Monsteller and Wallace, 1984) – The Federalist papers are a set of essays of the US constitution – Authorship of these papers were unknown at the time of publication – Statistical analysis was used to find the authors of these documents 2
Existing Approaches • Style markers (function words) as features for classification (Monsteller and Wallace, 1984; Burrows, 1987; Holmes and Forsyth, 1995; Joachims, 1998; Binongo and Smith, 1999; Stamatatos et al. , 1999; Diederich et al. , 2000; Luyckx and Daelemans, 2008) • Character-level n-grams (Peng et al. , 2003) • Syntactic features from parse trees (Baayen et al. , 1996) • Limitations – Capture mostly lexical information – Do not necessarily capture the author’s syntactic style 3
Our Approach • Using probabilistic context-free grammar (PCFG) to capture the syntactic style of the author • Construct a PCFG based on the documents written by the author and use it as a language model for classification – Requires annotated parse trees of the documents How do we obtain these annotated parse trees? 4
Algorithm – Step 1 Training documents ………… ………. . …. ……. . Alice ………… ………. . …. ……. . Bob ………… ………. . …. ……. . Mary ………… ………. . …. ……. . John Treebank each document using a statistical parser trained on a generic corpus – Stanford parser (Klein and Manning, 2003) – WSJ or Brown corpus from Penn Treebank (http: //www. cis. upenn. edu/~treebank) 5
Algorithm – Step 2 Probabilistic Context-Free Grammars S NP VP S VP NP Det A N NP PP NP Prop. N . . 8. 2. 4. 35. 25 Alice S NP VP S VP NP Det A N NP PP NP Prop. N . . 7. 3. 6. 25. 15 Bob S NP VP S VP NP Det A N NP PP NP Prop. N . . 9. 1. 3. 5. 2 Mary S NP VP S VP NP Det A N NP PP NP Prop. N . . 5. 5. 8. 1. 1 John Train a PCFG for each author using the treebanked documents from Step 1 6
Algorithm – Step 3 S NP VP. 8 S VP. 2 NP Det A N. 4 NP PP. 35 NP Prop. N. 25 Test document ………… ………. . …. ……. . S NP VP. 7 S VP. 3 NP Det A N. 6 NP PP. 25 NP Prop. N. 15 S NP VP S VP NP Det A N NP PP NP Prop. N . 9. 1. 3. 5. 2 S NP VP S VP NP Det A N NP PP NP Prop. N . 5. 5. 8. 1. 1 Alice . 6. 5 Bob . 33 Mary . 75 John 7
Algorithm – Step 3 S NP VP. 8 S VP. 2 NP Det A N. 4 NP PP. 35 NP Prop. N. 25 Test document ………… ………. . …. ……. . S NP VP. 7 S VP. 3 NP Det A N. 6 NP PP. 25 NP Prop. N. 15 S NP VP S VP NP Det A N NP PP NP Prop. N . 9. 1. 3. 5. 2 S NP VP S VP NP Det A N NP PP NP Prop. N . 5. 5. 8. 1. 1 Alice . 6. 5 Bob . 33 Mary Multiply the probability of the top parse for each sentence in the test document . 75 John 8
Algorithm – Step 3 S NP VP. 8 S VP. 2 NP Det A N. 4 NP PP. 35 NP Prop. N. 25 Test document ………… ………. . …. ……. . S NP VP. 7 S VP. 3 NP Det A N. 6 NP PP. 25 NP Prop. N. 15 S NP VP S VP NP Det A N NP PP NP Prop. N . 9. 1. 3. 5. 2 S NP VP S VP NP Det A N NP PP NP Prop. N . 5. 5. 8. 1. 1 Alice . 6. 5 Bob . 33 Mary . 75 John Multiply the probability of the top parse for each sentence in the test document Label for the test document 9
Experimental Evaluation 10
Data set # Authors Approx # Words/author Approx # Sentences/auth or Football 3 14374 786 Business 6 11215 543 Travel 4 23765 1086 Cricket 4 23357 1189 Poetry 6 7261 329 Blue – News articles Red – Literary works Data sets available at www. cs. utexas. edu/users/sindhu/acl 2010 11
Methodology • Bag-of-words model (baseline) – Naïve Bayes, Max. Ent • N-gram models (baseline) – N=1, 2, 3 • Basic PCFG model • PCFG-I (Interpolation) 12
Methodology • Bag-of-words model (baseline) – Naïve Bayes, Max. Ent • N-gram models (baseline) – N=1, 2, 3 • Basic PCFG model • PCFG-I (Interpolation) 13
Basic PCFG • Train PCFG based only on the documents written by the author • Poor performance when few documents are available for training – Increase the number of documents in the training set – Forensics - Do not always have access to a number of documents written by the same author – Need for alternate techniques when few documents are available for training 14
PCFG-I • Uses the method of interpolation for smoothing • Augment the training data by adding sections of WSJ/Brown corpus • Up-sample data for the author 15
Results 16
Accuracy in % Performance of Baseline Models Dataset Inconsistent performance for baseline models – the same model does not necessarily perform poorly on all data sets 17
Accuracy in % Performance of PCFG and PCFG-I Dataset PCFG-I performs better than the basic PCFG model on most data sets 18
Accuracy in % PCFG Models vs. Baseline Models Dataset Best PCFG model outperforms the worst baseline for all data sets, but does not outperform the best baseline for all data sets 19
PCFG-E • PCFG models do not always outperform N-gram models • Lexical features from N-gram models useful for distinguishing between authors • PCFG-E (Ensemble) – PCFG-I (best PCFG model) – Bigram model (best N-gram model) – Max. Ent based bag-of-words (discriminative classifier) 20
Accuracy in % Performance of PCFG-E Dataset PCFG-E outperforms or matches with the best baseline on all data sets 21
Significance of PCFG Accuracy in % (PCFG-E – PCFG-I) Dataset Drop in performance on removing PCFG-I from PCFG-E on most data sets 22
Conclusions • PCFGs are useful for capturing the author’s syntactic style • Novel approach for authorship attribution using PCFGs • Both syntactic and lexical information is necessary to capture author’s writing style 23
Thank You 24
- Slides: 24