Tools and Interfaces for Wordnet construction linking and

  • Slides: 75
Download presentation
Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under

Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya

Wordnet Language - Means of communication using encoded information ¢ Words - Units used

Wordnet Language - Means of communication using encoded information ¢ Words - Units used for communicating information ¢ Semantics - Meanings of words and word forms ¢

Wordnet Dictionary - List of alphabetically arranged words with meanings ¢ Thesaurus - List

Wordnet Dictionary - List of alphabetically arranged words with meanings ¢ Thesaurus - List of alphabetically arranged concepts with word forms ¢ What is Wordnet?

Wordnet ¢ Lexical database of words l l ¢ ¢ Arranged based on concepts

Wordnet ¢ Lexical database of words l l ¢ ¢ Arranged based on concepts Grouped based on synonymy Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank

Wordnet - Lexical Matrix Word Forms Word Meanings F 1 M 2 M 3

Wordnet - Lexical Matrix Word Forms Word Meanings F 1 M 2 M 3 … Mm (depend) E 1, 1 F 2 F 3 (bank) E 1, 2 (rely) E 1, 3 Fn (embankme nt) E 2, … (bank) E 2, 2 (bank) E 3, 2 … E 3, 3 … Em, n

Wordnet - Relations ¢ Semantic Relations Hypernymy and Hyponymy l Meronymy and Holonymy l

Wordnet - Relations ¢ Semantic Relations Hypernymy and Hyponymy l Meronymy and Holonymy l Entailment l Troponymy l Coordinate terms l ¢ Lexical Relations Antonymy l Gradation l

Wordnet - Relations ¢ Hypernymy and Hyponymy is a kind of l leaf is

Wordnet - Relations ¢ Hypernymy and Hyponymy is a kind of l leaf is the hypernym of neem leaf l neem leaf is the hyponym of leaf l ¢ Meronymy and Holonymy part-whole l root is the meronym of tree l tree is the holonym of root l

Wordnet - Relations ¢ Entailment implication l snore entails sleep l ¢ Troponymy manner

Wordnet - Relations ¢ Entailment implication l snore entails sleep l ¢ Troponymy manner elaboration l roar is the troponym of speak l ¢ Coordinate terms Common hypernym l wolf and dog are coordinate terms l

Wordnet - Relations ¢ Antonymy opposites l fat is the antonym of thin l

Wordnet - Relations ¢ Antonymy opposites l fat is the antonym of thin l ¢ Gradation Intermediate concepts in antonymy l morning -> noon -> evening l

Wordnet - Wordnets PWN - Princeton Word. Net for English language ¢ Euro. Word.

Wordnet - Wordnets PWN - Princeton Word. Net for English language ¢ Euro. Word. Net - Wordnet for European languages ¢ HWN - Hindi Wordnet for Hindi language ¢

Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. ¢ Defines 8

Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. ¢ Defines 8 part-whole relationships ¢ Defines 3 types of antonymy relations ¢ Gradable antonym (गरम -ठड ) l Complementary antonym (ज व त मत ) l Converse antonym (लन -दन ) l

Hindi Wordnet ¢ Gradation l Intermediate terms • Pre-Intermediate terms • Post-Intermediate terms l

Hindi Wordnet ¢ Gradation l Intermediate terms • Pre-Intermediate terms • Post-Intermediate terms l l Eg. सख ग ल - शषक - नम - तर - 10 domains of interpretation. Eg. State, Size, Gender, etc.

Hindi Wordnet - Verbs Simple Verb - One root. Eg. ख न ¢ Compound

Hindi Wordnet - Verbs Simple Verb - One root. Eg. ख न ¢ Compound Verb - Made up of another POS. Eg. म ठ लगन ¢ Combination Verb - Made of related two verbs. Eg. पढ़न -ल खन ¢ Onomatopoeic Verb - Eg. खटखट न from खटखट ¢ Conjunct Verb - Hidden sense of action. Eg. ल ज न ¢

Hindi Wordnet - Verbs ¢ Causative verbs First causative verb - Eg. सल न

Hindi Wordnet - Verbs ¢ Causative verbs First causative verb - Eg. सल न (to make somebody sleep) l Second causative verb - Eg. सलव न (to make somebody sleep through the effort of a third person) l

Hindi Wordnet - Creation Principles for Wordnet creation ¢ Minimality - Minimal set. Eg.

Hindi Wordnet - Creation Principles for Wordnet creation ¢ Minimality - Minimal set. Eg. {घर, कमर , ककष { ¢ Coverage - Coverage of words. Eg. {घर, कमर , ककष { ¢ Replaceability - Mutual replaceability in a context. Eg. अमर क म द स ल ब त न क ब द शय म सवदश /घर ल ट

Sanskrit Wordnet Concept-based Multilingual dictionary ¢ Need Loss of synonymy when moving across languages.

Sanskrit Wordnet Concept-based Multilingual dictionary ¢ Need Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अधर and दषट are not. l Number of lexicographers required O(n 2) l

Sanskrit Wordnet - Concept based Multilingual dictionary Concepts L 1 (English) L 2 (Hindi)

Sanskrit Wordnet - Concept based Multilingual dictionary Concepts L 1 (English) L 2 (Hindi) L 3 (Sanskrit) Concept ID: Concept description (W 1, W 2, W 3, . . ) (W 4, W 5, W 6, . . ) (W 7, W 8, W 9, . . ) 4066: any of various long-tailed primates (monkey) (excluding the prosimians) (बदर , बनदर , ब नर , व नर , क श , कप , मरकट , . . ) (व नर , कप , पलवङग , पलवग , श ख मग , वल मख , मरकट , . . ) 2186: a typical star that is the source of light and heat (sun) for the planets in the solar system (सरय , सरज , भ न , द व कर , भ सकर , परभ कर , द नकर , रव , (सरय , सव त , आद तय , म तर , अरण , भ न , पष ,

Sanskrit Wordnet Challenges Observed during construction of Marathi Wordnet: ¢ Single word to synthetic

Sanskrit Wordnet Challenges Observed during construction of Marathi Wordnet: ¢ Single word to synthetic expression. Eg. bankrupt -> द व ल न क लन ¢ ¢ Culture specific concepts. Eg. girlfriend. Requires transliteration such as मह ल म तर Splitting of concepts. Eg. फ़ क (tasteless) in Hindi -> अग ड (less sweet), अळण (less salty), म ळम ळत (less spicy) in Marathi

Sanskrit Wordnet Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: ¢ Varied

Sanskrit Wordnet Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: ¢ Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. ¢ Single-word and multi-word expressions in same language. Eg. In Nepali, म ह and म ह -म य both mean infatuation.

Sanskrit Wordnet - Sanskrit ¢ Indo-Aryan language Hinduism l Buddhism l Classical Sanskrit -

Sanskrit Wordnet - Sanskrit ¢ Indo-Aryan language Hinduism l Buddhism l Classical Sanskrit - Panini ¢ Vedic Sanskrit - pre-Classical ¢

Sanskrit Wordnet - Sanskrit Etymology ¢ Etymology of Verbs गण - Ten classes based

Sanskrit Wordnet - Sanskrit Etymology ¢ Etymology of Verbs गण - Ten classes based on how stem is generated l इट - Three groups based on position of tense marker l उपसरग - 22 prepositional particles that modify a root l

Synset Marking Grouping of synsets based on frequency of occurrence and usage in language

Synset Marking Grouping of synsets based on frequency of occurrence and usage in language ¢ Universal concepts ¢ who and what l honesty l

Synset. Marker - Interface

Synset. Marker - Interface

Synset. Marker - Features ¢ ¢ ¢ Display of synset fields Browsing Search l

Synset. Marker - Features ¢ ¢ ¢ Display of synset fields Browsing Search l l ¢ ¢ ¢ Word ID Marking - Universal, Common in Hindi and Uncommon Save/Exit Shortcuts

Synset. Marker - API ¢ records Define. Record l Synset. Record l ¢ operations

Synset. Marker - API ¢ records Define. Record l Synset. Record l ¢ operations Synset. Operator l Record. Reader l Record. Writer l ¢ gui l Interface

Synset. Marker - Process ¢ First round divided among 6 people 31000 synsets marked

Synset. Marker - Process ¢ First round divided among 6 people 31000 synsets marked l Universal and Common clubbed 15234 synsets l Common in Hindi - 6771 synsets l Uncommon - 10987 synsets l ¢ Second round voting schema l Common - 13205 synsets

Core Synset Selection ¢ Bharatiya Vyavahara Kosh English and 15 Indian languages l 2000

Core Synset Selection ¢ Bharatiya Vyavahara Kosh English and 15 Indian languages l 2000 concepts with domains l खल (game), पर ण (animal), फल (fruit( l ¢ Link synsets to words in Kosh l Polysemy • अननन स as pineapple fruit as pineapple plant

Domain. Classifier - Interface

Domain. Classifier - Interface

Domain. Classifier - Features Display of synset fields ¢ Browsing through records ¢ Marking

Domain. Classifier - Features Display of synset fields ¢ Browsing through records ¢ Marking right synset for a word and a domain ¢ Save/Export ¢

Domain. Classifier - API ¢ records Define. Record l Synset. Record l ¢ operations

Domain. Classifier - API ¢ records Define. Record l Synset. Record l ¢ operations Synset. Operator l Record. Reader l Record. Writer l ¢ gui l Interface

Domain. Classifier - Process ¢ Groupings Single IDs l Multiple IDs l No IDs

Domain. Classifier - Process ¢ Groupings Single IDs l Multiple IDs l No IDs l ¢ Rounds of marking Common synsets l Common in Hindi synsets l Uncommon synsets l

Domain. Classifier - Process ¢ End of process Core - 1969 synsets l Common

Domain. Classifier - Process ¢ End of process Core - 1969 synsets l Common - 11658 synsets l

Online Synset. Marker Interface

Online Synset. Marker Interface

Online Synset. Marker Interface

Online Synset. Marker Interface

Online Synset. Marker - API Written in PHP ¢ ¢ ¢ ¢ login. php

Online Synset. Marker - API Written in PHP ¢ ¢ ¢ ¢ login. php - Interface to login as a user or as an admin or to register as a new user process. php - To process login/register data and accordingly direct a user logout. php - To logout a user mainprocess. php - Processing of data to display unmarked synset main. php - Display of synset with buttons to mark as Common or Uncommon admin. php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword. php - Password interface to login as adminuserprofile. php - Profile data of a particular user

Online Synset. Marker Process ¢ Threshold for dropping synset as Uncommon l ¢ Had

Online Synset. Marker Process ¢ Threshold for dropping synset as Uncommon l ¢ Had to be set to 1 Common - 10312 synsets

Sanskrit Wordnet Interface for creation of Sanskrit Wordnet ¢ Based on idea of Concept-based

Sanskrit Wordnet Interface for creation of Sanskrit Wordnet ¢ Based on idea of Concept-based Multilingual dictionary ¢

User Interface - Configure

User Interface - Configure

User Interface - Main

User Interface - Main

User Interface - Panels ¢ ¢ ¢ Help Panel: Buttons for Commenting, Synchronizing and

User Interface - Panels ¢ ¢ ¢ Help Panel: Buttons for Commenting, Synchronizing and References tool. Search Panel: Search word or ID or perform advanced search. Font increase/decrease. Synset Panels: Synset data fields and completion status. Tool Panel: English synset, Link tool, Etymology tool. Browse Panel: Browsing through records, saving and exiting.

User Interface - Features Reference tool

User Interface - Features Reference tool

User Interface - Features Synchronize tool

User Interface - Features Synchronize tool

User Interface - Features Advanced Search

User Interface - Features Advanced Search

User Interface - Features English synsets tool

User Interface - Features English synsets tool

User Interface - Features Link tool

User Interface - Features Link tool

User Interface - Features Etymology tool

User Interface - Features Etymology tool

User Interface - Features Keyboard Shortcuts Undo feature - Monitor keyboard actions and undo

User Interface - Features Keyboard Shortcuts Undo feature - Monitor keyboard actions and undo on Ctrl-Z ¢ Saving feature - Monitor change in field values and save on Ctrl-S ¢ Search - Ctrl-F for quick search access ¢

Interface API Problems and Requirements Huge volumes of data (eg. 30, 000 synsets) l

Interface API Problems and Requirements Huge volumes of data (eg. 30, 000 synsets) l Links between different data l Efficient and user-friendly GUI l Sufficient querying l • Grouping • Review separation

Interface API

Interface API

Graphical User Interface JButton save. Button = null; public JButton get. Save. Button() {

Graphical User Interface JButton save. Button = null; public JButton get. Save. Button() { if (save. Button == null) { save. Button = new JButton(); } return save. Button; }

Graphical User Interface

Graphical User Interface

Graphical User Interface Panels

Graphical User Interface Panels

Graphical User Interface ¢ Panels l ¢ Components (within Panels) l ¢ Hierarchical structure

Graphical User Interface ¢ Panels l ¢ Components (within Panels) l ¢ Hierarchical structure Classes JButton, JText. Field, JCheck. Box, etc. Listeners l l Action. Listner - actions performed by user Key. Listener - key strokes (undo, search) and shortcuts

Synset ¢ ¢ ¢ Synset ID: a unique number identifying a synset Category: POS

Synset ¢ ¢ ¢ Synset ID: a unique number identifying a synset Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset

Data structure Synset. Record Class Synset. Record Strings to hold field values ¢ Functions:

Data structure Synset. Record Class Synset. Record Strings to hold field values ¢ Functions: ¢ equals(other. Object) l is. Better. Than(other. Object) l is. Complete() l… l

Data structure Define. Record

Data structure Define. Record

“define-end” language Example (description of a book about cricket): define book sixer length :

“define-end” language Example (description of a book about cricket): define book sixer length : : 700 topic : : cricket define chapter 1 length : : 300 topic : : batting end define chapter 2 length : : 400 topic : : bowling : : scientific end

Data structure Define. Record

Data structure Define. Record

Data structure Define. Record Example (etymology data for synset ID 1476): define etymology 1476

Data structure Define. Record Example (etymology data for synset ID 1476): define etymology 1476 करमवततव : : अकरमक finished : : true define word कष इट : : सट पद : : परसमपद सवर : : कत रप : : कषय उपसरग : : अप स ध त : : end

Data structure Define. Record ¢ ¢ Data structure to hold parametric and nested data

Data structure Define. Record ¢ ¢ Data structure to hold parametric and nested data Functions: l l add. Field(object. To. Add) - Function to add a parameter or a nested instance of Define. Record to. String() - Function to export a record in the define-end language get. Parameter. Field(parameter. Name) Function to return a specific parameter field …

Data Operations

Data Operations

Data Operations - File I/O Unicode text data manipulation - UTF -8 format ¢

Data Operations - File I/O Unicode text data manipulation - UTF -8 format ¢ Classes for file parsing/writing: ¢ Record. Writer l Record. Reader l

Data Operations - File I/O ¢ Record. Reader Synset. Record parser l Define. Record

Data Operations - File I/O ¢ Record. Reader Synset. Record parser l Define. Record parser l String converters l ¢ Record. Writer Synset. Record parser l Define. Record parser l

Data Operations Record. Model Interface Model to create mechanism for working with a new

Data Operations Record. Model Interface Model to create mechanism for working with a new data structure ¢ Handles parsing, writing, querying and ID retrieval ¢ Models written as Classes: ¢ l Synset. Record. Model • English. Synset. Record. Model l Abstract. Define. Record. Model

Data Operations Record. Model Interface ¢ ¢ ¢ ¢ int get. Record. Id(E record):

Data Operations Record. Model Interface ¢ ¢ ¢ ¢ int get. Record. Id(E record): Function to return the record ID of a record boolean is. Better. Than(E a, E a): Function to return whether a record weighs better than the other boolean is. Finished(E a): Function to return whether a record can be set as completed E merge. Records(E a, E b): Function to merge in data in two separate records into one boolean search. Word(String word, E a): Function to perform a query (defined in String word) on a record E parse. Record(Record. Reader file. Handle): Function to parse a record from a file void write. Record(Record. Reader file. Handle, E a): Function to write a record into a file

Data Operations Record. Operator Class Operator to provide functionality to work with records of

Data Operations Record. Operator Class Operator to provide functionality to work with records of data ¢ Load, Browse, Update, Search, Synchronize and Write ¢ Two kinds at the GUI level: ¢ Parent Operator l Linker Operator l

Data Operations Record. Operator Class Functions for each data type (depending on the corresponding

Data Operations Record. Operator Class Functions for each data type (depending on the corresponding Record. Model): ¢ ¢ ¢ ¢ Constructors for Parent. Operator and Linker. Operator get. Record() - Function to obtain the current record set. Current. Id() and get. Current. Id() - Functions to set and obtain ID to work with get. First. Id(), get. Previous. Id(), get. Next. Id() and get. Last. Id() - Functions to browse through records is. Finished and is. All. Finished() - Functions to obtain completion status of records search. Records() and advanced. Search() - Functions to perform search operations on the records …

API Overview GUI defines one Parent. Operator (eg. source synsets) ¢ GUI defines many

API Overview GUI defines one Parent. Operator (eg. source synsets) ¢ GUI defines many Linker. Operators (eg. target synsets, link data, etc. ) ¢ Models attached to the operators ¢ Data repositories are defined ¢ GUI browses, retrieves and manipulates data using operators. ¢

Version history

Version history

Future work Tool to generate etymology format ¢ GUI functionality to display synsets from

Future work Tool to generate etymology format ¢ GUI functionality to display synsets from multiple languages ¢ Advanced commenting based on reviews and completion ¢

References ¢ ¢ ¢ ¢ Miller G. A. , Beckwith R. , Fellbaum C.

References ¢ ¢ ¢ ¢ Miller G. A. , Beckwith R. , Fellbaum C. , Gross D. , Miller K. J. , "Introduction to Word. Net: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235 -244. Ramanand J. , Ukey A. , Singh B. K. , Bhattacharyya P. , "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp. 30 -43. Hindi Wordnet Documentation, http: //www. cfilt. iitb. ac. in/wordnet/webhwn/other/hwn_docs_2. doc Chakrabarti D. , Narayan D. K. , Pandey P. , Bhattacharyya P. , "Experiences in building the Indo Word. Net - A Word. Net for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002. Mohanty R. K. , Bhattacharyya P. , Kalele S. , Pandey P. , Sharma A. , Kopra M. , "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global Word. Net Conference, University of Szeged, Department of Informatics, 2008. Sinha, M. , Reddy, M. , Bhattacharyya, P. , "An Approach towards Construction and Application of Multilingual Indo-Word. Net", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006. Staal J. F. , "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261– 275.

References ¢ ¢ ¢ ¢ ¢ Mac. Donell A. A. , A History Of

References ¢ ¢ ¢ ¢ ¢ Mac. Donell A. A. , A History Of Sanskrit Literature, Kessinger Publishing, ISBN 1417906197, 2004. Burrow T. , Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001. Goldman R. P. and Sutherland S. J. , Devavanipravesika: An Introduction to the Sanskrit Language, ISBN 0 -944613 -40 -3, 1999. Macdonell A. A. , A Sanskrit Grammar for Students, ISBN 81 -246 -0094 -5, 1997. Monier-Williams M. , A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN 81 -208 -3105 -5, 2005. Katre S. M. , Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989. Indian Languages, http: //www. english. emory. edu/Bahri/Ind. Langs. html Wierzbicka A. , "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7 -26. Beckwith R. , Miller G. A. , Tengi R. , "Design and Implementation of the Word. Net Lexical Database and Searching Software", Description of Word. Net, 1993. JSch - Java Secure Channel, http: //www. jcraft. com/jsch

Thank you

Thank you