Lomonosov Moscow State University Research Computing Center for


















































![Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] = Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] =](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-51.jpg)
![ROMIP: web-page categorization [or] ROMIP: web-page categorization [or]](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-52.jpg)


- Slides: 54
Lomonosov Moscow State University Research Computing Center for Information Research Problems of Ontology Development for a Broad Domain Loukachevitch Natalia louk_nat@mail. ru Leading Researcher of Lomonosov Moscow State University
Technologies • Ontologies for Natural Language Processing and Information Retrieval Applications • Applications – – – Conceptual indexing Query expansion Text Categorization Document Clustering Question-Answering Automatic Summarization • Linguistic Ontologies – Ru. Thes thesaurus (52 thousand concepts, 150 thousand words and expressions) – Ontology on Natural Sciences and Technologies (60 thousand concepts) – Banking Thesaurus for Information Retrieval applications et. al.
Projects of Our Research Group-1 • State Bodies – Central Bank of the Russian Federation (2006 –. . ) • Development of banking thesaurus, conceptual indexing, text categorization – Central Election Committee of the RF (1999 –. . ) • Information-retrieval system, conceptual indexing, text categorization, – State Duma of RF (1999 –. . ) • Information retrieval system on Duma records – Accounting Chamber of RF (2003) • Creation of a terminology dictionary – other state bodies • Text categorization, clusterization, development of domainspecific ontologies,
Projects of Our Research Group-2 • Commercial organizations – Rambler Media company (2007–. . ) • Automatic clusterization, categorization, summarization of news flow • Personalization of news and advertisements • Spam detection • Information extraction – Garant Legal Information Company (2002 – …) • Text categorization of legal documents • Summarization of court decisions • Learning to rank in information-retrieval – etc.
Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization
Domains and Tasks • Ontology vs. Machine Learning? • Description of domains is difficult • Data can need generalization • Some knowledge can be already described in ontology-based resources • Therefore for many tasks we need • Ontology+Machine learning
Ontologies: general remarks • Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts • Main components: – – – Concepts (classes) Instances (individuals) Relations Attributes Axioms (rules)
object Taxonomy Classes organism animal mammal cat siamese instances frog
Knowledge management domain
Ontology development paradigms • Formal, logically sound ontologies – Logical inference, – Some domains are difficult to formalize – Inconsistency is a huge problem • Semantic Web – Lot of specific ontologies – Rdf triples, Same_as links – a lot of “messy” data • Ontologies for Natural Language processing – Less formal – Relation to language semantics – Formalization is restricted with current state of natural language processing
Ontology-1: Ontology Spectrum (Obrst, 2006) e iv s s e e or ss m o r to m le F r p x e strong semantics Modal Logic First Order Logical Theory Is Disjoint Subclass of with transitivity Description Logic DAML+OIL, OWL property UML Conceptual Model RDF/S XTM Extended ER Thesaurus ER Relational Model, XML weak semantics Semantic Interoperability Has Narrower Meaning Than DB Schemas, XML Schema Taxonomy Is Subclass of Structural Interoperability Is Sub-Classification of Syntactic Interoperability
Expressivity vs. community-size (Hepp, 2007)
Ontology-2, Semantic Web. Linking Data Project http: //www. w 3. org/wiki/Sweo. IG/Task. Forces/Community. Projects/Linking. Open. Data
Approach 3. Ontologies for Natural Language Processing • Relations between the concepts and lexical meanings are quite complex • How represent synonyms and nearsynonyms • How detailed lexical senses of ambiguous words should be represented • Large volume vs. complexity of description • Word. Net as a symbol of this approach • (!) For different tasks – different types of ontologies
Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization
Complicated vs. simple domains • Simple domains (wine ontology) – Explicit boundaries – Boundaries are determined with “physical processes” e. g. production, services – Clear roles of entities – Small number of classes (may have many instances) or many uniform classes • Complicated domains (terrorism, financial control) – Vague boundaries, – The same entities used in different roles and functions – Knowledge stored in text documents,
Wine ontology http: //www. w 3. org/TR/owl-guide/wine. rdf Wine White. Wine Region White. Loire White. Burgundy White. Bordeaux Meal course Grape Table. Wine Sweet. Wine Red. Wine
Complicated domains: vague boundaries • Interdisciplinarity – state financial control (economy+ law + finances) – Counter-terrorism (criminal law + international law+ + constitutional law +state bodies+ buildings+vehicles+weapons…) • Two main parts – Center of the domain – Additional concepts from neighbour domains
Boundaries of domain: Terrorism • Center of domain – Terrorist acts, groups, terrorists – Anti-terrorist activity • Additional spheres – – – Geographic places, Weapons and explosives, Transport, Financial payment, Ideology, Religion etc. • Re-use of ontologies?
Problem: Distortion of Reality • General concepts necessary for domain description are treated as subordinates of domain concepts • Name of concept is general but its intended sense in domain specific – Law (=antiterrorist law=), – Intelligence – (= antiterrorist intelligence) • Problems in ontology mapping, ontology reuse • Thesaurus on Radiological terrorism • http: //www. jasonmorrison. net/content/2004/a-thesaurus-forradiological-terrorism-research/
Example: distortion of reality
Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization
Ontology Development and Domain-Specific Texts • Knowledge stored in texts • Domain-specific text collection – As many as possible – Necessary to find exact boundaries • Automatic extraction of terms from texts (Term acquisition) – Terms are expressions corresponding to concepts of a specific domain • Top-level modeling • Use of existing ontologies
Automatic Term Acquisition from Texts • Linguistic criteria (noun groups) • Lexical restrictions (f. e. evaluative words good, bad are rarely parts of terms) • Statistical criteria (Frequency, Mutual information, and many others) • !!Use of machine learning approaches to improve term extraction • Formation of ordered list of term-candidates
The most frequent phrases in documents of financial control domain • Translation from Russian – – – – Federal budget Russian Federation Accounting Chamber Federal law Overall sum (-) Resources of federal budget (? ) Oblast budget Financial means Use of financial means (? ) Wages Ministry of finance Budget resources Tax body
Analysis of Term-Candidate List • In the beginning of the list there are many evident terms • Furthere are many unclear expressions – whether they are terms (domain experts can have different opinions) – whether they are related to the domain – where is a boundary of the domain • A lot of synonymic variants • Ambiguity of terms
Boundaries of the domain • Bottom-up+top-down • Term extraction from texts – a bottom-up stage • Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy • Top-down analysis • Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)
Synonyms and variants of “money laundering” • CRIMINAL LAUNDERING • ILLEGAL LAUNDERING • LAUNDERING ACTIVITIES • LAUNDERING OF MONEY • LAUNDERING OPERATIONS • MONEY LAUNDERING ACTIVITIES • MONEY LEGALIZATION • MONEY WASHING • PROFIT LAUNDERING • PROFIT WASHING
Lexical ambiguity • Homonyms are words that share the same spelling but have different meanings (unrelated in origin) – – bank (financial institution vs. land (river bank)), rarely met in the same domain except broad one easily recognized by non-linguists different concepts, different sets of relations • Polysemes are words with the same spelling and distinct but related meanings – – – bank (financial institution vs. building) very often met in any domains regular polysemes (institutions and their buildings) difficult for recognition by non-linguists tendency to use the same concept of ontology for related senses
Lexical ambiguity (homonyms): bow
Lexical ambiguity (polysemes) • Transport – They have succeeded in stopping the transport of live animals (=moving) – mechanism of contactless payment in public transport (=vehicles) • Regular polysemy – Tree – wood (material): birch • Non-linguists cannot recognize different senses, feel strange deviations in relations
Lexical ambiguity (polysemes) • How to help yourselves – nonambiguous synonymic phrases – Transport 1 = Transportation process – Transport 2 = transport vehicle – Birch 1 = birch tree – Birch 2 = birch wood • Possible to see different entities behind closely related senses
Relations of an ontology • The set of relations of ontology can be non-evident • Main relations – Class-subclass – Instance relation – Role relations • Different properties: transitivity et. al. • Old AI books and manuals: the same relation in all cases – “is_a” • Diagnostic expression “X is a Y” can be appropriate in all cases
Class-subclass relation • Relation between two sets of entities (classes) (manyto-many): birch - tree • Properties: transitive, inheritance • Rules: – If class A is a subclass of class B, then each instance of class A is also an instance of B – Top-level classes (categories) should coincide for A and B – Real example of a mistake: – river – water object – water – substance -> – Moscow river – is a Substance? ?
Instance relation • Relation one-to-many – Moscow river – instance of river – Teacher – instance of profession Dog breed • Not transitive – Rex, Poodle, dog breed, dog – what relations – Rex is an instance of poodle – Poodle is an instance of dog breed – Poodle is a subclass of dog – Rex is not a dog breed – Rex is a dog Subclass Instance X Poodle Instance Rex
Roles and types • Roles: student, employer, terrorist, player • Types: Person, animal, building, car • Role is a type in some conditions • A student is a person in the role of learning • Properties of roles: – Roles are created dynamically – Roles can play other roles – A type can play many different roles
Confusion of type-role relations with class-subclass relations • Frequent mistake of almost every beginner Employer X Person X Organization • Not every person is an employer, an organization is not an employer in all situations • Problems with inference
Text-motivated confusion of types and roles • Natural substances such as salt, sugar, vinegar, alcohol, . . are also used as traditional preservatives. (wikipedia) • Often salt and other preservatives are added to canned foods. (http: //www. family-health-and-nutrition. com/this-vs-that. html) • What relation is between salt and preservative? – Class-subclass? – Class – instance? –. . • In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.
Automatic extraction of relations from texts • A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc. • But in complex domain it is impossible fully rely on automatic tools • In many cases evident relations are extracted • Causes – Multiword expressions – Ambiguity of language expressions – Contextual dependence – Necessity of very large domain text collection processing
Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization
Automatic text categorization • Main approaches – Knowledge-based methods (based on rules) – Machine learning methods – very popular in scientific conferences • Text categorization in real practice (operational text categorization) – Training collection should exist – Experts should categorize documents in a consistent way – Every category needs enough number of training examples In practice knowledge-based systems are widely used • Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents
Subjectivity of experts Experts’ agreement in manual text categorization is around 60%
Our text categorization projects • Use of both approaches in dependence of task and data • Knowledge-based approach uses knowledge of our large resource Ru. Thes thesaurus • Projects – Classifier for Central Election Committee (450 categories, 4 levels) – Classifier of Russian legislation (1169 categories, 3000 categories) – Classifier of English economic research papers (700 categories) – Classifier of public opinion polls (350 categories) – Classifier of banking document and news (200 categories) – General news classifiers – and others
Thesaurus on sociopolitical life • Sociopolitical domain: social life of contemporary society • Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc. • Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports • 36 thousand concepts, 100 thousand terms, 140 thousand direct relations • Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.
Levels of Hierarchy Socio-Political Domain Taxation Law Accounting Banking
Thesaurus-based text categorization • Use of knowledge described in the Thesaurus • Manual description of Boolean expressions for categories based on small number of thesaurus concepts • Automatic thesaurus-based expansion of Boolean expressions • Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)
Describing a category with supporting concepts • Categotization of legal acts • 200. 020. Heads of states summits • • { ( HEADS OF STATES SUMMITY ) OR { ( NEGOTIATIONSN ) ( INTERNATIONAL NEGOTIATIONSY ) ( INTERNATIONAL CONTACTSN ) ( MEETINGN )} AND ( HEAD OF STATEL )}
Expanded representation of the category • {( HEADS OF STATES SUMMITY ) • ( summit, summit meeting, top-level meeting, head of states meeting ) • OR { ( NEGOTIATIONSN ) ( negotiations, talks ) ( INTERNATIONAL NEGOTIATIONSY ) ( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …) ( INTERNATIONAL CONTACTSN ) ( international intercourse, transnational contacts… ) ( MEETINGN )} AND ( HEAD OF STATEL) ( leader of country, president of country, federal president, RF president, US president, monarch, …, emir of Kuwait … )}
ROMIP: Russian Seminar on Information Retrieval • Russian TREC • Text categorization task • Categories: DMOZ, 247 categories of 2 nd level Top/World/Russian/*/* • Training collection: «DMOZ» (presented by Rambler) – 300 000 documents, 2100 sites. • Testing collection: Belorussian Internet «BY. web» (granted by Yandex company) – 1 500 000 documents, 19 000 sites • Our task: – Thesaurus-based text categorization – Measuring of time to create categorization system – Evaluation
Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] = 97%, R=98%, P= 96%) v Boolean expression for the category MARTIAL ARTS (Е) «E» -- full expansion using thesaurus tree v The expanded description includes: AIKIDO, JIUJUTSU, JUDO, KARATE, JUDOIST, KARATEKA …
ROMIP: web-page categorization [or]
Benefits from Large-Scale Linguistic Ontologies Use in Information Retrieval Tasks Benefits Web Search 0+ % Corporate Search / Legal Search 10 % Long Queries / Verbose Queries 15 % Text Categorization News Clustering Summarization, Visualization, Multi Document Summarization 15 -50 % 15 % ++ (SUMMAC)
Conclusion • Complex domains – Broad domains including a lot of heterogeneous entities – vague boundaries, – Knowledge stored in texts • Special efforts to find boundaries • Acquisition knowledge from texts – Partial automation – Necessity to prevail ambiguity and vagueness of natural texts even for non-linguists