1 Armadillo Data Extraction Across Multiple Text Datasets

  • Slides: 42
Download presentation
1 Armadillo Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark

1 Armadillo Data Extraction Across Multiple Text Datasets for Arts and Humanities Research Mark Greengrass University of Sheffield 15 July 2007 (c) M. Greengrass

2 Response to the Re. PAH questionnaire (2005 -6), aggregate of all Arts and

2 Response to the Re. PAH questionnaire (2005 -6), aggregate of all Arts and Humanities respondants (Repah: A User Requirements Analysis Report (2006), p. 102. 15 July 2007 (c) M. Greengrass

3 15 July 2007 (c) M. Greengrass Repah, A user requirements analysis… (2006), p.

3 15 July 2007 (c) M. Greengrass Repah, A user requirements analysis… (2006), p. 109

4 Some Distinctive Features of in Historians’ Approach to their Evidence • Promiscuous range

4 Some Distinctive Features of in Historians’ Approach to their Evidence • Promiscuous range of sources consulted • Firm distinction between primary and secondary sources • Complex dialogue between existing historiography and constitutive source materials • Reiterative process of open interrogation of source materials • A ‘coherent’ narrative consists of one composed (generally) from more than one source 15 July 2007 (c) M. Greengrass

5 Historians’ Database Challenge • Growing number of (mainly text-based) historical datasets in electronic

5 Historians’ Database Challenge • Growing number of (mainly text-based) historical datasets in electronic media, furnished from a wide variety of providers • These datasets utilise a variety of different historical sources • They contain varying amounts of encoded information (dependant on the historical question being asked by the PI; and by the constraints of the particular source being used) • The information is encoded in different ways • The delivery formats used also vary widely 15 July 2007 (c) M. Greengrass

6 15 July 2007 (c) M. Greengrass

6 15 July 2007 (c) M. Greengrass

Sources The Marine Society Registers The Westminster Historical Database Prerogative Court of Canterbury Wills

Sources The Marine Society Registers The Westminster Historical Database Prerogative Court of Canterbury Wills The Proceedings of the Old Bailey St. Martin’s Settlement Exams Index WESTCAT Collage image databse Guildhall Library Metropolitan London in the 1690 s IHR Selected Criminal Records TNA AHDS Deposits Harben’s Dictionary of London John Strype’s “Survey…” House of Lords Journals BOPCRIS 15 July 2007 Eighteenth Century Fire Insurance Policies (c) M. Greengrass http: //www. motco. com 7

The Old Bailey Proceedings: XML <trial> <person> <defend gender="m"><given>William</given><surname>Mawn</surname></defend> </person> was Tryed for <off>

The Old Bailey Proceedings: XML <trial> <person> <defend gender="m"><given>William</given><surname>Mawn</surname></defend> </person> was Tryed for <off> <theft type="animals">stealing a Bay Gelding price 20 l. </theft> </off> from one <victim gender="m"><given>Thomas</given><surname>Lane</surname></victim> out of Berkshire on the <cd>25 th of April</cd>. The Witness swore that the Horse was found in the Prisoner's custody in Smithfield, which the Prosecutor owned to be his. The Prisoner could not produce any Evidence to prove that he came honestly by the Horse only produc'd a Felonious person, that was no stranger to Newgate, who went under the Notion of his Man, he declared that the Prisoner bought the Horse upon the Road beyond Uxbridge. The Prisoners being found in several faultering stories, he was found <verdict> <guilty>Guilty</guilty> </verdict>. </p> <punish><death><note type="editorial">[Death. See summary. ]</note></death></punish> </p> </trial> 15 July 2007 (c) M. Greengrass 8

Canterbury Wills: Delimited Text 2530553 2530553 2530553 2530553 2530553 15 July 2007 W W

Canterbury Wills: Delimited Text 2530553 2530553 2530553 2530553 2530553 15 July 2007 W W W W W (c) Agnes Kervill or Kervytt Andrew Bridham London Andrew Pykeman London Austin Hawkyns Cecilia Foster Christian Chepman Christian Cust David Syadine Bristol, Edmund Bybbesworth Edward Wellys Hadley, Ellen Lacy Widow Saint Pe Gerard Heshull Guy Shuldham Helmingus Leget Henry Porter Henry Warlegh Keynesha Henry Wellis Hugh Caundyssh Hugh Geynesburgh Rector Isabelle Woodhill M. Greengrass 9

10 The Issues Can the technologies developed for the ‘semantic web’ help us: -

10 The Issues Can the technologies developed for the ‘semantic web’ help us: - • To structure the (different) encoded information across varying sources in a way that the user community will find (research) fruitful? • To understand the way in which these different sources relate to one another, such that they can be used in an intelligent fashion? • To ‘bootstrap’ relevant historical/semantic information from one source, by using another? 15 July 2007 (c) M. Greengrass

11 Data ‘Sharing’ and Data ‘Re-use’ Reuse means to build new applications, assembling components

11 Data ‘Sharing’ and Data ‘Re-use’ Reuse means to build new applications, assembling components already built 15 July 2007 (c) Sharing is when different applications use the same resources Oscar Korcho (with acknowledgement)

12 Ontologies Problem Solving Methods Describe domain knowledge in a generic way Describe the

12 Ontologies Problem Solving Methods Describe domain knowledge in a generic way Describe the reasoning process of a dataset and provide agreed understanding of a domain (‘Knowledge-Based System’) in a domain-independent manner Interaction Problem Representing Knowledge for the purpose of solving some problem is strongly affected by the nature of the problem and the inference strategy to be applied to the problem Bylander Chandrasekaran, B. Generic Tasks in knowledge-based reasoning. : the right level of abstraction for knowledge acquisition. In B. R. Gaines and J. H. Boose, EDs Knowledge Acquisition for Knowledge Based systems, 65 -77, London: Academic Press 1988. 15 July 2007 (c) O. Corcho (with acknowledgement)

Definitions of an Ontology 1. “An ontology defines the basic terms and relations comprising

Definitions of an Ontology 1. “An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary” 13 Neches R, Fikes RE, Finin T, Gruber TR, Senator T, Swartout WR (1991) Enabling technology for knowledge sharing. AI Magazine 12(3): 36– 56 2. “An ontology is an explicit specification of a conceptualization” Gruber TR (1993 a) A translation approach to portable ontology specification. Knowledge Acquisition 5(2): 199– 220 3. “An ontology is a formal, explicit specification of a shared conceptualization” Studer R, Benjamins VR, Fensel D (1998) Knowledge Engineering: Principles and Methods. IEEE Transactions on Data and Knowledge Engineering 25(1 -2): 161– 197 4. “A logical theory which gives on explicit, partial account of a conceptualization” 5. “A set of logical axioms designed to account for the intended meaning of a vocabulary” 15 July 2007 (c) O. Corcho (with acknowledgement) Guarino N, Giaretta P (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In: Mars N (ed) Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (KBKS’ 95). University of Twente, Enschede, The Netherlands. IOS Press, Amsterdam, The Netherlands, pp 25– 32 Guarino N (1998) Formal Ontology in Information Systems. In: Guarino N (ed) 1 st International Conference on Formal Ontology in Information Systems (FOIS’ 98). Trento, Italy. IOS Press, Amsterdam, pp 3– 15

14 Key Components of an Ontology Concepts are organized in taxonomies Relations R: C

14 Key Components of an Ontology Concepts are organized in taxonomies Relations R: C 1 x C 2 x. . . x Cn-1 x Cn Subclass-of: Concept 1 x Concept 2 Connected to: Component 1 x Component 2 Functions F: C 1 x C 2 x. . . x Cn-1 --> Cn Mother-of: Person --> Women Price of a used car: Model x Year x Kilometers --> Price Instances Elements Axioms Sentences which are always true 15 July 2007 (c) M. Greengrass

Semantic Continuum and Formality Shared human consensus Implicit e. g. Language Semantics hardwired; used

Semantic Continuum and Formality Shared human consensus Implicit e. g. Language Semantics hardwired; used at runtime Text descriptions Informal [explicit] Formal (for humans) e. g. dictionaries 15 July 2007 (c) e. g. library catalogues M. Greengrass, after Corcho Semantics processed and used at runtime Formal [for machines] E. g. see below 15

16 15 July 2007 (c) M. Greengrass

16 15 July 2007 (c) M. Greengrass

17 15 July 2007 (c) M. Greengrass

17 15 July 2007 (c) M. Greengrass

18 http: //www. vicodi. org 15 July 2007 (c) M. Greengrass

18 http: //www. vicodi. org 15 July 2007 (c) M. Greengrass

19 Webbased ‘secondary ’ historical writing ‘middle-out ontologies’ (generated by intelligent iteration) Primary sources

19 Webbased ‘secondary ’ historical writing ‘middle-out ontologies’ (generated by intelligent iteration) Primary sources (historical document s; images; artefacts) in elecronic media 15 July 2007 ‘top-down ontologies’ (generated from discipline-accepted taxonomies) (c) M. Greengrass ‘bottom-up ontologies’ (generated from a representative sample of canonical data

20 15 July 2007 (c) M. Greengrass

20 15 July 2007 (c) M. Greengrass

21 John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668)

21 John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668) 15 July 2007 (c) M. Greengrass

22 15 July 2007 (c) M. Greengrass

22 15 July 2007 (c) M. Greengrass

23 15 July 2007 (c) M. Greengrass

23 15 July 2007 (c) M. Greengrass

24 15 July 2007 (c) M. Greengrass

24 15 July 2007 (c) M. Greengrass

25 15 July 2007 (c) M. Greengrass

25 15 July 2007 (c) M. Greengrass

26 15 July 2007 (c) M. Greengrass

26 15 July 2007 (c) M. Greengrass

27 15 July 2007 (c) M. Greengrass

27 15 July 2007 (c) M. Greengrass

28 15 July 2007 (c) M. Greengrass

28 15 July 2007 (c) M. Greengrass

29 15 July 2007 (c) M. Greengrass

29 15 July 2007 (c) M. Greengrass

30 Armadillo – a Semantic Agent Retrieves information according to pre-agreed ontologies Ø Ø

30 Armadillo – a Semantic Agent Retrieves information according to pre-agreed ontologies Ø Ø Takes account of deviations in spelling, typographic formatting and contextual information Ø Makes use of delimited fields and tagged data as ‘oracles’ to provide firm instantiations of elements in an ontology to apply to electronic materials which have no such structure 15 July 2007 (c) M. Greengrass

31 15 July 2007 (c) M. Greengrass

31 15 July 2007 (c) M. Greengrass

32 15 July 2007 (c) M. Greengrass

32 15 July 2007 (c) M. Greengrass

33 15 July 2007 (c) M. Greengrass

33 15 July 2007 (c) M. Greengrass

34 15 July 2007 (c) M. Greengrass

34 15 July 2007 (c) M. Greengrass

35 15 July 2007 (c) M. Greengrass

35 15 July 2007 (c) M. Greengrass

36 15 July 2007 (c) M. Greengrass

36 15 July 2007 (c) M. Greengrass

37 15 July 2007 (c) M. Greengrass

37 15 July 2007 (c) M. Greengrass

38 15 July 2007 (c) M. Greengrass

38 15 July 2007 (c) M. Greengrass

39 15 July 2007 (c) M. Greengrass

39 15 July 2007 (c) M. Greengrass

40 15 July 2007 (c) M. Greengrass

40 15 July 2007 (c) M. Greengrass

Automated Text-Mining, used for tagging purposes in Central Criminal Court records 41 <p>CENTRAL CRIMINAL

Automated Text-Mining, used for tagging purposes in Central Criminal Court records 41 <p>CENTRAL CRIMINAL COURT, </p> <p>Held on Monday, December 17 th, 1866, and following days, </p> <p><sc>BEFORE THE RIGHT HON. </sc> <lc><name role="judiciary" given="THOMAS" surname="GABRIEL" sex="m" age="na">THOMAS GABRIEL</name>, LORD MAYOR</lc> of the City of London; Sir <sc><name role="judiciary" given="JOHN" surname="MELLOR" sex="m" age="na">JOHN MELLOR</name></sc>, Knt. , one of the Justices of Her Majesty's Court of Queen's Bench; <sc><name role="judiciary" given="WILLIAM TAYLOR" surname="COPELAND" sex="m" age="na">WILLIAM TAYLOR COPELAND</name></sc>, Esq. , <sc><name role="judiciary" given="THOMAS" surname="CHALLIS" sex="m" age="na">THOMAS CHALLIS</name></sc>, Esq. , <sc>THOMAS QUESTED FINNIS</sc>, Esq. , Sir <sc><name role="judiciary" given="ROBERT WALTER" surname="CARDEN" sex="m" age="na">ROBERT WALTER CARDEN</name></sc>, Knt. , and <sc><name role="judiciary" given="WILLIAM" surname="LAWRENCE" sex="m" age="na">WILLIAM LAWRENCE</name></sc>, Esq. , Aldermen of the said City; 15 July 2007 (c) M. Greengrass

Automated Text-Mining, used for tagging purposes in Central Criminal Court records – with less

Automated Text-Mining, used for tagging purposes in Central Criminal Court records – with less success! <p>CENTRAL CRIMINAL COURT, </p> <p>Held on Monday, July 22 nd, 1912, and following days. </p> <p>Before the Right Hon. Sir <lc>THOMAS BOOR CROSBY, M. D. , LORD MAYOR</lc> of the said City of London; the Right Hon. Lord <sc>COLERIDGE</sc>, one of the Justices of His Majesty's High Court; Sir <sc><name role="judiciary" given="HENRY" surname="KNIGHT" sex="m" age="na">HENRY KNIGHT</name></sc>, Knight; Sir <sc><name role="judiciary" given="HORATIO" surname="DAVIES" sex="m" age="na">HORATIO DAVIES</name></sc>, K. C. M. G. ; Sir <sc><name role="judiciary" given="JOHN" surname="POUND" sex="m" age="na">JOHN POUND</name></sc>, Bart. ; Sir <sc>GEORGE W. TRUSCOTT</sc>, Bart. ; Sir <sc><name role="judiciary" given="CHARLES" surname="JOHNSTON" sex="m" age="na">CHARLES JOHNSTON</name></sc>, Knight; and Sir <sc>HORACE B. MARSHALL</sc>, Knight, LL. D. , Aldermen of the said City; Sir <sc>FORREST FULTON</sc>, Knight, K. C. , Recorder of the said City; Sir <sc>FK. ALBERT BOSANQUET</sc>, K. C. , Common Not identified Serjeant of the said City; Not identified 15 July 2007 (c) M. Greengrass 42