A Flexible Workbench for Document Analysis and Text
A Flexible Workbench for Document Analysis and Text Mining NLDB’ 2004, Salford, June 23 -25 2004 Jon Atle Gulla, Terje Brasethvik and Harald Kaada Norwegian University of Science and Technology Norway Outline: 1. 2. 3. 4. Why a linguistic workbench? How does it work? How to use it? How did we use it? 1
A Flexible Workbench for Document Analysis and Text Mining Building Search Engines Docs Index Retrieve Modified Query NLDB’ 2004, Salford, June 23 -25 2004 Result page [FAST search engine (www. alltheweb. com)] • Need to handle syntactic and morphological variation in documents: – language identification, text categorization, stemming/lemmatization, stopwords • Want to modify query to improve search result – stemming/lemmatization, spell-checking, query reformulation with ontologies/dictionaries, grammatical analysis, phrasing, anti-phrasing 2
A Flexible Workbench for Document Analysis and Text Mining Extracting Information From Text Ontology NLDB’ 2004, Salford, June 23 -25 2004 Text Minimal recursion semantics representations Database [Deep Thought EU project] • Structuring knowledge from text – tagging, compounds, grammatical analysis, ontological interpretation, regular expressions, patter recognition 3
A Flexible Workbench for Document Analysis and Text Mining Constructing Ontologies Manual labor Ontology NLDB’ 2004, Salford, June 23 -25 2004 Domain doc. coll. Statistical & linguistic analyses [Brasethvik & Gulla, DKE, 38/1, 2001] • Want to extract prominent concepts/relations from text – tagging, compounds, NP recognition, term frequencies, stopwords, language identification 4
A Flexible Workbench for Document Analysis and Text Mining Common Challenges • How to combine linguistic/statistical techniques for document analysis? – Many combinations feasible – Not clear what to use under which circumstances NLDB’ 2004, Salford, June 23 -25 2004 • How to support the experimental use of techniques? – – Make use of existing techniques Add new ones Parameterize techniques Run techniques in different orders A simple expandable workbench for planning and running sequences of linguistic/statistical text analysis techniques 5
A Flexible Workbench for Document Analysis and Text Mining Workbench Concept • Each technique is a component: parameters NLDB’ 2004, Salford, June 23 -25 2004 input text transform or add output text – parameters to govern behavior – dependencies with other components • Workbench – manages components as building blocks – users can define an analysis as a chain of building blocks – no programming involved as long as appropriate components are available on the network 6
A Flexible Workbench for Document Analysis and Text Mining Workbench Concept NLDB’ 2004, Salford, June 23 -25 2004 Job = input text collection + sequence of parameterized online components Library of components = components available on the network Result = XML representation of documents, all (temporary) results 7
A Flexible Workbench for Document Analysis and Text Mining Workbench Architecture • Components: NLDB’ 2004, Salford, June 23 -25 2004 – Each component a web service – Programmed in any language (Java, Perl, Python, C) – Add to or transform input text document(s) • Execution of jobs: – Workbench keeps track of techniques that are available and coordinates their execution – All communication with XML-RPC – All temporary files stored in DOXML format for later inspection 8
A Flexible Workbench for Document Analysis and Text Mining The Principle of Adding Information kliniske undersøkelser Tagging NLDB’ 2004, Salford, June 23 -25 2004 Lemmatization Phrase detection 9
A Flexible Workbench for Document Analysis and Text Mining How to Use Workbench? • Set up techniques as web services with XML-RPC interface on some networked computers • Tell the workbench where to find them NLDB’ 2004, Salford, June 23 -25 2004 • Define job: – – Specify document(s) to run job on Select components and set parameters Decide order of components Run job 10
A Flexible Workbench for Document Analysis and Text Mining Selecting a Component NLDB’ 2004, Salford, June 23 -25 2004 11
A Flexible Workbench for Document Analysis and Text Mining Defining a Job NLDB’ 2004, Salford, June 23 -25 2004 Iver’s document analysis job consists of 5 techniques 12
A Flexible Workbench for Document Analysis and Text Mining How did we use it? • KITH: Norwegian Center of Medical Informatics – Editorial responsibility for creating and publishing ontologies for medical domains NLDB’ 2004, Salford, June 23 -25 2004 – Traditional approach: • Workshops with experts • Manual process – New approach • Generate concept/relation candidates for health school ontology based on KITH’s document collection on the topic • 2. 79 MB collection of documents 13
A Flexible Workbench for Document Analysis and Text Mining The KITH Ontology Construction Job NLDB’ 2004, Salford, June 23 -25 2004 14
A Flexible Workbench for Document Analysis and Text Mining Extracted Prominent Concepts NLDB’ 2004, Salford, June 23 -25 2004 15
A Flexible Workbench for Document Analysis and Text Mining Extracted concept relationships NLDB’ 2004, Salford, June 23 -25 2004 16
A Flexible Workbench for Document Analysis and Text Mining KITH Evaluation • KITH case NLDB’ 2004, Salford, June 23 -25 2004 – 10 components used to extract concept candidates from document collection – 99 of 111 concepts in KITH’s existing ontology found – New concepts detected – Considerable faster than traditional manual approach – Workbench results included in KITH’s experimental ontologydriven IR system: www. volven. no 17
A Flexible Workbench for Document Analysis and Text Mining Conclusions • Presented a light-weight and expandable workbench for document analysis and text mining NLDB’ 2004, Salford, June 23 -25 2004 – Easy to set up, easy to use – Limited functionality • Future work: – Add more components to library – Allow more advanced job structures (choices, iterations, etc. ) Thank you! 18
- Slides: 18