Lomonosov Moscow State University Research Computing Center for


















































![Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] = Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] =](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-51.jpg)
![ROMIP: web-page categorization [or] ROMIP: web-page categorization [or]](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-52.jpg)


- Slides: 54

Lomonosov Moscow State University Research Computing Center for Information Research Problems of Ontology Development for a Broad Domain Loukachevitch Natalia louk_nat@mail. ru Leading Researcher of Lomonosov Moscow State University

Technologies • Ontologies for Natural Language Processing and Information Retrieval Applications • Applications – – – Conceptual indexing Query expansion Text Categorization Document Clustering Question-Answering Automatic Summarization • Linguistic Ontologies – Ru. Thes thesaurus (52 thousand concepts, 150 thousand words and expressions) – Ontology on Natural Sciences and Technologies (60 thousand concepts) – Banking Thesaurus for Information Retrieval applications et. al.

Projects of Our Research Group-1 • State Bodies – Central Bank of the Russian Federation (2006 –. . ) • Development of banking thesaurus, conceptual indexing, text categorization – Central Election Committee of the RF (1999 –. . ) • Information-retrieval system, conceptual indexing, text categorization, – State Duma of RF (1999 –. . ) • Information retrieval system on Duma records – Accounting Chamber of RF (2003) • Creation of a terminology dictionary – other state bodies • Text categorization, clusterization, development of domainspecific ontologies,

Projects of Our Research Group-2 • Commercial organizations – Rambler Media company (2007–. . ) • Automatic clusterization, categorization, summarization of news flow • Personalization of news and advertisements • Spam detection • Information extraction – Garant Legal Information Company (2002 – …) • Text categorization of legal documents • Summarization of court decisions • Learning to rank in information-retrieval – etc.

Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization

Domains and Tasks • Ontology vs. Machine Learning? • Description of domains is difficult • Data can need generalization • Some knowledge can be already described in ontology-based resources • Therefore for many tasks we need • Ontology+Machine learning

Ontologies: general remarks • Ontology - formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts • Main components: – – – Concepts (classes) Instances (individuals) Relations Attributes Axioms (rules)

object Taxonomy Classes organism animal mammal cat siamese instances frog

Knowledge management domain

Ontology development paradigms • Formal, logically sound ontologies – Logical inference, – Some domains are difficult to formalize – Inconsistency is a huge problem • Semantic Web – Lot of specific ontologies – Rdf triples, Same_as links – a lot of “messy” data • Ontologies for Natural Language processing – Less formal – Relation to language semantics – Formalization is restricted with current state of natural language processing

Ontology-1: Ontology Spectrum (Obrst, 2006) e iv s s e e or ss m o r to m le F r p x e strong semantics Modal Logic First Order Logical Theory Is Disjoint Subclass of with transitivity Description Logic DAML+OIL, OWL property UML Conceptual Model RDF/S XTM Extended ER Thesaurus ER Relational Model, XML weak semantics Semantic Interoperability Has Narrower Meaning Than DB Schemas, XML Schema Taxonomy Is Subclass of Structural Interoperability Is Sub-Classification of Syntactic Interoperability

Expressivity vs. community-size (Hepp, 2007)

Ontology-2, Semantic Web. Linking Data Project http: //www. w 3. org/wiki/Sweo. IG/Task. Forces/Community. Projects/Linking. Open. Data

Approach 3. Ontologies for Natural Language Processing • Relations between the concepts and lexical meanings are quite complex • How represent synonyms and nearsynonyms • How detailed lexical senses of ambiguous words should be represented • Large volume vs. complexity of description • Word. Net as a symbol of this approach • (!) For different tasks – different types of ontologies

Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization

Complicated vs. simple domains • Simple domains (wine ontology) – Explicit boundaries – Boundaries are determined with “physical processes” e. g. production, services – Clear roles of entities – Small number of classes (may have many instances) or many uniform classes • Complicated domains (terrorism, financial control) – Vague boundaries, – The same entities used in different roles and functions – Knowledge stored in text documents,

Wine ontology http: //www. w 3. org/TR/owl-guide/wine. rdf Wine White. Wine Region White. Loire White. Burgundy White. Bordeaux Meal course Grape Table. Wine Sweet. Wine Red. Wine

Complicated domains: vague boundaries • Interdisciplinarity – state financial control (economy+ law + finances) – Counter-terrorism (criminal law + international law+ + constitutional law +state bodies+ buildings+vehicles+weapons…) • Two main parts – Center of the domain – Additional concepts from neighbour domains


Boundaries of domain: Terrorism • Center of domain – Terrorist acts, groups, terrorists – Anti-terrorist activity • Additional spheres – – – Geographic places, Weapons and explosives, Transport, Financial payment, Ideology, Religion etc. • Re-use of ontologies?

Problem: Distortion of Reality • General concepts necessary for domain description are treated as subordinates of domain concepts • Name of concept is general but its intended sense in domain specific – Law (=antiterrorist law=), – Intelligence – (= antiterrorist intelligence) • Problems in ontology mapping, ontology reuse • Thesaurus on Radiological terrorism • http: //www. jasonmorrison. net/content/2004/a-thesaurus-forradiological-terrorism-research/

Example: distortion of reality

Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization

Ontology Development and Domain-Specific Texts • Knowledge stored in texts • Domain-specific text collection – As many as possible – Necessary to find exact boundaries • Automatic extraction of terms from texts (Term acquisition) – Terms are expressions corresponding to concepts of a specific domain • Top-level modeling • Use of existing ontologies

Automatic Term Acquisition from Texts • Linguistic criteria (noun groups) • Lexical restrictions (f. e. evaluative words good, bad are rarely parts of terms) • Statistical criteria (Frequency, Mutual information, and many others) • !!Use of machine learning approaches to improve term extraction • Formation of ordered list of term-candidates

The most frequent phrases in documents of financial control domain • Translation from Russian – – – – Federal budget Russian Federation Accounting Chamber Federal law Overall sum (-) Resources of federal budget (? ) Oblast budget Financial means Use of financial means (? ) Wages Ministry of finance Budget resources Tax body

Analysis of Term-Candidate List • In the beginning of the list there are many evident terms • Furthere are many unclear expressions – whether they are terms (domain experts can have different opinions) – whether they are related to the domain – where is a boundary of the domain • A lot of synonymic variants • Ambiguity of terms

Boundaries of the domain • Bottom-up+top-down • Term extraction from texts – a bottom-up stage • Extracted expressions are necessary to understand what types of entities are needed in the domain – in fact design of top-level taxonomy • Top-down analysis • Combined approach to concept selection (frequency from the collection+top-level taxonomy restrictions)

Synonyms and variants of “money laundering” • CRIMINAL LAUNDERING • ILLEGAL LAUNDERING • LAUNDERING ACTIVITIES • LAUNDERING OF MONEY • LAUNDERING OPERATIONS • MONEY LAUNDERING ACTIVITIES • MONEY LEGALIZATION • MONEY WASHING • PROFIT LAUNDERING • PROFIT WASHING

Lexical ambiguity • Homonyms are words that share the same spelling but have different meanings (unrelated in origin) – – bank (financial institution vs. land (river bank)), rarely met in the same domain except broad one easily recognized by non-linguists different concepts, different sets of relations • Polysemes are words with the same spelling and distinct but related meanings – – – bank (financial institution vs. building) very often met in any domains regular polysemes (institutions and their buildings) difficult for recognition by non-linguists tendency to use the same concept of ontology for related senses

Lexical ambiguity (homonyms): bow

Lexical ambiguity (polysemes) • Transport – They have succeeded in stopping the transport of live animals (=moving) – mechanism of contactless payment in public transport (=vehicles) • Regular polysemy – Tree – wood (material): birch • Non-linguists cannot recognize different senses, feel strange deviations in relations

Lexical ambiguity (polysemes) • How to help yourselves – nonambiguous synonymic phrases – Transport 1 = Transportation process – Transport 2 = transport vehicle – Birch 1 = birch tree – Birch 2 = birch wood • Possible to see different entities behind closely related senses

Relations of an ontology • The set of relations of ontology can be non-evident • Main relations – Class-subclass – Instance relation – Role relations • Different properties: transitivity et. al. • Old AI books and manuals: the same relation in all cases – “is_a” • Diagnostic expression “X is a Y” can be appropriate in all cases

Class-subclass relation • Relation between two sets of entities (classes) (manyto-many): birch - tree • Properties: transitive, inheritance • Rules: – If class A is a subclass of class B, then each instance of class A is also an instance of B – Top-level classes (categories) should coincide for A and B – Real example of a mistake: – river – water object – water – substance -> – Moscow river – is a Substance? ?

Instance relation • Relation one-to-many – Moscow river – instance of river – Teacher – instance of profession Dog breed • Not transitive – Rex, Poodle, dog breed, dog – what relations – Rex is an instance of poodle – Poodle is an instance of dog breed – Poodle is a subclass of dog – Rex is not a dog breed – Rex is a dog Subclass Instance X Poodle Instance Rex

Roles and types • Roles: student, employer, terrorist, player • Types: Person, animal, building, car • Role is a type in some conditions • A student is a person in the role of learning • Properties of roles: – Roles are created dynamically – Roles can play other roles – A type can play many different roles

Confusion of type-role relations with class-subclass relations • Frequent mistake of almost every beginner Employer X Person X Organization • Not every person is an employer, an organization is not an employer in all situations • Problems with inference

Text-motivated confusion of types and roles • Natural substances such as salt, sugar, vinegar, alcohol, . . are also used as traditional preservatives. (wikipedia) • Often salt and other preservatives are added to canned foods. (http: //www. family-health-and-nutrition. com/this-vs-that. html) • What relation is between salt and preservative? – Class-subclass? – Class – instance? –. . • In practice, beginners usually try to establish relations “Class-subclass”, however this is a type-role relation, preservative is a role of substances.

Automatic extraction of relations from texts • A lot of scientific publications: extraction of synonyms, taxonomies, part-whole relations etc. • But in complex domain it is impossible fully rely on automatic tools • In many cases evident relations are extracted • Causes – Multiword expressions – Ambiguity of language expressions – Contextual dependence – Necessity of very large domain text collection processing

Plan of Tutorial • Ontologies: general remarks – Main paradigms and their problems – Level of formalization • Broad vs. simple domains – Boundaries of a domain – Main source of knowledge - texts • Domain-specific texts – Concepts and terms, term extraction – Synonyms and near-synonyms – Ambiguity of terms – Establishing relations • Example: Ontology-based text categorization

Automatic text categorization • Main approaches – Knowledge-based methods (based on rules) – Machine learning methods – very popular in scientific conferences • Text categorization in real practice (operational text categorization) – Training collection should exist – Experts should categorize documents in a consistent way – Every category needs enough number of training examples In practice knowledge-based systems are widely used • Reuter company (provider of known training collection Reuter-21578) uses a knowledge-based system for text categorization of own documents

Subjectivity of experts Experts’ agreement in manual text categorization is around 60%

Our text categorization projects • Use of both approaches in dependence of task and data • Knowledge-based approach uses knowledge of our large resource Ru. Thes thesaurus • Projects – Classifier for Central Election Committee (450 categories, 4 levels) – Classifier of Russian legislation (1169 categories, 3000 categories) – Classifier of English economic research papers (700 categories) – Classifier of public opinion polls (350 categories) – Classifier of banking document and news (200 categories) – General news classifiers – and others

Thesaurus on sociopolitical life • Sociopolitical domain: social life of contemporary society • Includes: thematic vocabulary and terminology from such domains as economy, finance, defense, law, sport, arts, military conflicts etc. • Domain for such documents as government documents, legal acts, international treaties, newspaper articles, news reports • 36 thousand concepts, 100 thousand terms, 140 thousand direct relations • Applications: conceptual indexing; automatic text categorization, document clustering, automatic text summarization, question-answering.

Levels of Hierarchy Socio-Political Domain Taxation Law Accounting Banking

Thesaurus-based text categorization • Use of knowledge described in the Thesaurus • Manual description of Boolean expressions for categories based on small number of thesaurus concepts • Automatic thesaurus-based expansion of Boolean expressions • Thesaurus-based thematic representation of the text content independent of the genre and the length of a text (lexical chain technique)

Describing a category with supporting concepts • Categotization of legal acts • 200. 020. Heads of states summits • • { ( HEADS OF STATES SUMMITY ) OR { ( NEGOTIATIONSN ) ( INTERNATIONAL NEGOTIATIONSY ) ( INTERNATIONAL CONTACTSN ) ( MEETINGN )} AND ( HEAD OF STATEL )}

Expanded representation of the category • {( HEADS OF STATES SUMMITY ) • ( summit, summit meeting, top-level meeting, head of states meeting ) • OR { ( NEGOTIATIONSN ) ( negotiations, talks ) ( INTERNATIONAL NEGOTIATIONSY ) ( international talks, interstate talks, diplomatic negotiations, international talks, multinational talks, intergovernmental talks, contracting nations, negotiating states …) ( INTERNATIONAL CONTACTSN ) ( international intercourse, transnational contacts… ) ( MEETINGN )} AND ( HEAD OF STATEL) ( leader of country, president of country, federal president, RF president, US president, monarch, …, emir of Kuwait … )}

ROMIP: Russian Seminar on Information Retrieval • Russian TREC • Text categorization task • Categories: DMOZ, 247 categories of 2 nd level Top/World/Russian/*/* • Training collection: «DMOZ» (presented by Rambler) – 300 000 documents, 2100 sites. • Testing collection: Belorussian Internet «BY. web» (granted by Yandex company) – 1 500 000 documents, 19 000 sites • Our task: – Thesaurus-based text categorization – Measuring of time to create categorization system – Evaluation
![Knowledgebased approach 8 manhours v Category 135 Martial arts F 1 measure OR Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] =](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-51.jpg)
Knowledge-based approach (8 man-hours) v Category 135 «Martial arts» (F 1 -measure [OR] = 97%, R=98%, P= 96%) v Boolean expression for the category MARTIAL ARTS (Е) «E» -- full expansion using thesaurus tree v The expanded description includes: AIKIDO, JIUJUTSU, JUDO, KARATE, JUDOIST, KARATEKA …
![ROMIP webpage categorization or ROMIP: web-page categorization [or]](https://slidetodoc.com/presentation_image/28134528ee8dea804aceb84961bcfbaa/image-52.jpg)
ROMIP: web-page categorization [or]

Benefits from Large-Scale Linguistic Ontologies Use in Information Retrieval Tasks Benefits Web Search 0+ % Corporate Search / Legal Search 10 % Long Queries / Verbose Queries 15 % Text Categorization News Clustering Summarization, Visualization, Multi Document Summarization 15 -50 % 15 % ++ (SUMMAC)

Conclusion • Complex domains – Broad domains including a lot of heterogeneous entities – vague boundaries, – Knowledge stored in texts • Special efforts to find boundaries • Acquisition knowledge from texts – Partial automation – Necessity to prevail ambiguity and vagueness of natural texts even for non-linguists
Lomonosov moscow state university international relations
Confucius institute at moscow state linguistic university
Moscow state university accommodation
Moscow state university of design and technology
"moscow state forest university" "money laundering"
"bauman moscow state technical university" "scam"
Moscow state university of psychology and education
Moscow state forest university
Lavoazjev zakon
Mikhail vasilyevich lomonosov
B.f. lomonosov
Sti university moscow
Rosbusinessconsulting
Army high performance computing research center
University of michigan automotive research center
Development research centre of the state council
Novosibirsk national research state university
Conventional computing and intelligent computing
Climograph of moscow
My friends lives in moscow said alec
Moscow was founded in 1147
Moscow is the capital of which country
Stalin tower
Luke upchurch
Climograph
Moscow consulting group
Landforms and resources section 1
To inform the readers/listeners about a newsworthy event
Present simple past simple future simple exercises
Moscow ski
Andre bliznyuk
Bulgakov
Boleslavsky acting
Moscow financial forum
Moscow 1988
Levis moscow t shirt
Dragon lights festival at moscow
Luke conner moscow
Ogilvy russia
Itep moscow
Communist parents
Alla pugacheva was born in 1949 in moscow
Ruth handler an american businesswoman
Khrihfa hlabu 460
Moscow institute of steel and alloys
Indian school moscow
Mhpcc
Dr david swanson
Cambridge university computing service
Cambridge platform
Kontinuitetshantering
Novell typiska drag
Tack för att ni lyssnade bild
Ekologiskt fotavtryck
Shingelfrisyren