Stylometry and authorship D Holmes Authorship attribution Computers
Stylometry and authorship D. Holmes “Authorship attribution” Computers and the Humanities 28 (1994), 87 -106. D. Holmes “The Evolution of Stylometry in Humanities Scholarship” Literary and Linguistic Computing 13 (1998), 111 -117. http: //llc. oxfordjournals. org/cgi/reprint/13/3/111. pdf T. Mc. Enery & M. Oates “Authorship identification and computational stylometry” in Dale et al (eds) Handbook of Natural Language Processing, New York (2000): Dekker, chapter 23. 1 1
Stylometry • Measurement of (aspects) of style • Especially using computational tools • Purposes: – Genre classification – Historical study of language change (diachronic linguistics) – Literary analysis – Authorship attribution – Forensic linguistics 2
Authorship attribution • Has been a topic of research since at least mod-19 th century (predates computers) • Interest in – resolving issues of disputed authorship – identifying authorship of anonymous texts – may be useful in detecting plagiarism, and authorship of computer viruses – used in forensic setting, eg to detect genuine confessions 3
Authorship attribution • Five main approaches: • Physical evidence – eg carbon dating and handwriting analysis, as in case of Hitler Diaries. Not relevant to linguistics/stylistics • Historical evidence – eg did Marlowe or Shakespeare write Edward III? It was published 1596, 3 yrs after Marlowe’s death, but contains references to the defeat of the Armada (1588) – “knowledge intensive”, not feasible for computers 4
Authorship attribution • Cipher-based decryption – idea that authors deliberately encode their names in text – especially widespread in Bible studies, but also in Shakespeare-Bacon debate – Penn (1987) used computer analysis to show Bacon had written a lot of Shakespeare’s plays – easily debunked: see http: //shakespeareauthorship. com/#5 b: Ross showed that using the same techniques “proved” that bacon also wrote Spenser’s Faerie Queene, the Bible, Caesar’s Gallic Wars, Hiawatha, Moby Dick and The Federalist Papers (see later) 5
Authorship attribution • Manual analysis – Much used in forensic linguistics – Detailed analysis of unlimited linguistic traits – Not suitable for computational analysis, but we’ll look at some examples later • Computational stylometry – Involves counting things – So can only look at what is easily countable 6
Stylometry • Assumes that the essence of the individual style of an author can be captured with reference to a number of quantitative criteria, called discriminators • Obviously, some (many) aspects of style are conscious and deliberate – as such they can be easily imitated and indeed often are – many famous pastiches, either humorous or as a sort of homage • Computational stylometry is focused on subconscious elements of style less easy to imitate or falsify 7
Stylometry is not foolproof • We should be aware of shortcomings – Discriminators are mostly lexical, though some recent work has looked also at syntactic discriminators – Authors’ styles change, either over time, or deliberately, eg when writing in different literary genres – Many techniques rely on large quantities of data • Most of the following techniques are better at dealing with closed questions – – Who wrote this, A or B? If A wrote these, did they also write this? How likely is it that A wrote this? but not Who wrote this? 8
Some classical examples • Did Homer write both the Illiad and the Odyssey? – both generally attributed to a single individual named “Homer”, but both are derived from long oral tradition • Did Paul write all the NT Letters of St Paul? – Especially, the authorship of Hebrews has long been debated on theological grounds • Plato developed his philosophy in the form of dialogues, putting his own doctrines into the mouth of Socrates his teacher. – Ascertaining the correct chronological order of these dialogues would help to understand how Plato developed his philosophy • Did Shakespeare write all of his plays? – Various authors including Bacon and Marlowe are said to have written parts or all of several plays – “Shakespeare” may even be a nom-de-plume for a group of writers – two more plays – Edward III and Two Noble Kinsmen – may have been written partly by Shakespeare 9
Some modern examples • The Federalist Papers – a series of articles published in 1787 -88 with the aim of promoting the ratification of the new US constitution. – written by three authors, Jay, Hamilton and Madison, under the pseudonym “Publius” – Some are of known (and in some cases joint) authorship but others are disputed – Pioneering stylometric methods were famously used by Mosteller and Wallace in the early 1960 s to attempt to answer this question – It is now considered as settled – The Federalist Papers present a difficult but solvable test case, and are seen as a benchmark to test new ideas 10
Some modern examples • Similarities with private letters helped to identify the style of the Unabomber’s manifesto – Unabomber Theodore Kaczynski perpetrated a number of bomb attacks on universities and airlines between 1978 and 1995 – Promised to stop if his 35, 000 -word anti-industrialist “manifesto” was published in major newspapers – Distinctive writing style and turns of phrase enabled him to be identified • Authorship of Primary Colors, a work of fiction about preparations for the Democratic primaries which showed the Bill Clinton character in a bad light 11
Some modern examples • Derek Bentley and his disputed murder ‘confession’ (1953) – Bentley (an illiterate man of low IQ) and another man involved in an armed robbery in which a policeman was shot – Bentley found guilty and hanged in January 1953 – In 1971 author Yallop looked closely at the case, – As well as conflicting ballistic evidence, and some procedurtal errors in the trial, Bentley’s statement was found to have been doctored by police: – Contested statement used then every 58 words on average and repeatedly used I then. – Bo. E uses then every 500 words, and then I ten times more often than I then. Importantly, witness statement frequencies overall are similar to Bo. E. – Police statement ‘genre’ of the time used then every 78 words, and typically used the I then form. – Derek Bentley acquitted in 1999, posthumously, appeal assisted by a linguistics professor 12
Basic methodologies • Word or sentence length too obvious and easy to manipulate • Frequencies of letter pairs strangely successful, though limited • Distribution of words of a given length (in syllables), especially relative frequencies, ie length of gaps between words of same syllable length. 13
Vocabulary richness • Based on the idea that author’s vocabulary is more or less constant • Various measures – Type-token ratio – Simpson’s index (the chance that two word arbitrarily chosen from text will be the same) – Yule’s K (occurrence of a given word is a chance occurrence can be modelled as a Poisson distribution) – Entropy (measure of uniformity) 14
The Federalist Papers • 85 papers arguing for the adoption of the US constitution • written by three authors (Jay, Hamilton, Madison) – – – 5 authored by Jay 51 authored by Hamilton 14 authored by Madison 3 jointly by Hamilton and Madison authorship of 12 of them disputed (Hamilton or Madison? ) • Mosteller and Wallace (1964) employed function words such as prepositions, conjunctions, and articles as discriminators. – e. g. , the word upon averaged 3. 24 appearances per 1, 000 words in the known writings of Hamilton but only 0. 23 in the writings of Madison – 30 “marker words” identified as discriminative of the two contested authors: upon, whilst, there, on, while, vigor, by, consequently, would, voice 15
Bayesian probability • Bayes hypothesis reconciles prior hypotheses (in this case based on historical observation) with conditional probabilities based on measurements • If prior hypothesis (eg that there is a 1: 3 chance that Madison wrote the paper) is confirmed by the measurements (eg of features associated with Madison’s style), the result will be neutral • If prior hypothesis is contradicted by the measurements, result will be much more striking 16
Cumulative sum charts • Method – Assume authorial “fingerprints” such as percentage of short words, or words beginning with a vowel – Put two texts together and plot the number of items per sentence against the cumulative average – If graph has a sharp divergence at the point where the texts are joined, this shows the authors differ • Highly controversial – Interpretation of graphs very subjective – But much used in courts! • Weighted cusum – Slightly sounder footing statistically – eliminates need for subjective judgment – Still not very accurate compared to other measures 17
Multivariate analysis • Thanks to computers it is now possible to collect large numbers of different measurements, of a variety of features • Variants of multivariate analysis – Cluster analysis – Correspondence analysis – Principal components analysis 18
Cluster analysis • Group objects according to their similarity with respect to a given feature • Produces a tree diagram or “dendogram” 19
Correspondence analysis • Example of superlatives in Dickens’ and Smollett’s works – Tabata 2007: http: //www. digitalhumanities. org/dh 2007/abstr acts/xhtml. xq? id=259) • Count frequency of 242 superlatives in 30 texts • CA allows classification of associations between variables in a 2 d matrix, rows x columns • D 1 distinguishes Dickens from Smollett • D 2 20
Principal components analysis • Like cluster analysis but can work with much larger range of variables • PCA is a statistical method for arranging large arrays of data into interpretable patterning match • “principal components” are computed by calculating the correlations between all the variables, then grouping them into sets that show the most correspondence • each “set” is a “component”, or “dimension” 21
Final word • Many of these techniques are also used to identify different genres rather than different authors – especially PCA, where the dimensions can be characterised • (In fact, cluster analysis and PCA illustrations were taken from such a study!) • An interesting question: how well do they work on pastiches? – If interested, see H Somers & F Tweedie “Authorship attribution and pastiche”, Computers and the Humanities 37 (2003), 407 -429. 22
- Slides: 22