Tools and Interfaces for Wordnet construction linking and











































































- Slides: 75
Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya
Wordnet Language - Means of communication using encoded information ¢ Words - Units used for communicating information ¢ Semantics - Meanings of words and word forms ¢
Wordnet Dictionary - List of alphabetically arranged words with meanings ¢ Thesaurus - List of alphabetically arranged concepts with word forms ¢ What is Wordnet?
Wordnet ¢ Lexical database of words l l ¢ ¢ Arranged based on concepts Grouped based on synonymy Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank
Wordnet - Lexical Matrix Word Forms Word Meanings F 1 M 2 M 3 … Mm (depend) E 1, 1 F 2 F 3 (bank) E 1, 2 (rely) E 1, 3 Fn (embankme nt) E 2, … (bank) E 2, 2 (bank) E 3, 2 … E 3, 3 … Em, n
Wordnet - Relations ¢ Semantic Relations Hypernymy and Hyponymy l Meronymy and Holonymy l Entailment l Troponymy l Coordinate terms l ¢ Lexical Relations Antonymy l Gradation l
Wordnet - Relations ¢ Hypernymy and Hyponymy is a kind of l leaf is the hypernym of neem leaf l neem leaf is the hyponym of leaf l ¢ Meronymy and Holonymy part-whole l root is the meronym of tree l tree is the holonym of root l
Wordnet - Relations ¢ Entailment implication l snore entails sleep l ¢ Troponymy manner elaboration l roar is the troponym of speak l ¢ Coordinate terms Common hypernym l wolf and dog are coordinate terms l
Wordnet - Relations ¢ Antonymy opposites l fat is the antonym of thin l ¢ Gradation Intermediate concepts in antonymy l morning -> noon -> evening l
Wordnet - Wordnets PWN - Princeton Word. Net for English language ¢ Euro. Word. Net - Wordnet for European languages ¢ HWN - Hindi Wordnet for Hindi language ¢
Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. ¢ Defines 8 part-whole relationships ¢ Defines 3 types of antonymy relations ¢ Gradable antonym (गरम -ठड ) l Complementary antonym (ज व त मत ) l Converse antonym (लन -दन ) l
Hindi Wordnet ¢ Gradation l Intermediate terms • Pre-Intermediate terms • Post-Intermediate terms l l Eg. सख ग ल - शषक - नम - तर - 10 domains of interpretation. Eg. State, Size, Gender, etc.
Hindi Wordnet - Verbs Simple Verb - One root. Eg. ख न ¢ Compound Verb - Made up of another POS. Eg. म ठ लगन ¢ Combination Verb - Made of related two verbs. Eg. पढ़न -ल खन ¢ Onomatopoeic Verb - Eg. खटखट न from खटखट ¢ Conjunct Verb - Hidden sense of action. Eg. ल ज न ¢
Hindi Wordnet - Verbs ¢ Causative verbs First causative verb - Eg. सल न (to make somebody sleep) l Second causative verb - Eg. सलव न (to make somebody sleep through the effort of a third person) l
Hindi Wordnet - Creation Principles for Wordnet creation ¢ Minimality - Minimal set. Eg. {घर, कमर , ककष { ¢ Coverage - Coverage of words. Eg. {घर, कमर , ककष { ¢ Replaceability - Mutual replaceability in a context. Eg. अमर क म द स ल ब त न क ब द शय म सवदश /घर ल ट
Sanskrit Wordnet Concept-based Multilingual dictionary ¢ Need Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अधर and दषट are not. l Number of lexicographers required O(n 2) l
Sanskrit Wordnet - Concept based Multilingual dictionary Concepts L 1 (English) L 2 (Hindi) L 3 (Sanskrit) Concept ID: Concept description (W 1, W 2, W 3, . . ) (W 4, W 5, W 6, . . ) (W 7, W 8, W 9, . . ) 4066: any of various long-tailed primates (monkey) (excluding the prosimians) (बदर , बनदर , ब नर , व नर , क श , कप , मरकट , . . ) (व नर , कप , पलवङग , पलवग , श ख मग , वल मख , मरकट , . . ) 2186: a typical star that is the source of light and heat (sun) for the planets in the solar system (सरय , सरज , भ न , द व कर , भ सकर , परभ कर , द नकर , रव , (सरय , सव त , आद तय , म तर , अरण , भ न , पष ,
Sanskrit Wordnet Challenges Observed during construction of Marathi Wordnet: ¢ Single word to synthetic expression. Eg. bankrupt -> द व ल न क लन ¢ ¢ Culture specific concepts. Eg. girlfriend. Requires transliteration such as मह ल म तर Splitting of concepts. Eg. फ़ क (tasteless) in Hindi -> अग ड (less sweet), अळण (less salty), म ळम ळत (less spicy) in Marathi
Sanskrit Wordnet Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: ¢ Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. ¢ Single-word and multi-word expressions in same language. Eg. In Nepali, म ह and म ह -म य both mean infatuation.
Sanskrit Wordnet - Sanskrit ¢ Indo-Aryan language Hinduism l Buddhism l Classical Sanskrit - Panini ¢ Vedic Sanskrit - pre-Classical ¢
Sanskrit Wordnet - Sanskrit Etymology ¢ Etymology of Verbs गण - Ten classes based on how stem is generated l इट - Three groups based on position of tense marker l उपसरग - 22 prepositional particles that modify a root l
Synset Marking Grouping of synsets based on frequency of occurrence and usage in language ¢ Universal concepts ¢ who and what l honesty l
Synset. Marker - Interface
Synset. Marker - Features ¢ ¢ ¢ Display of synset fields Browsing Search l l ¢ ¢ ¢ Word ID Marking - Universal, Common in Hindi and Uncommon Save/Exit Shortcuts
Synset. Marker - API ¢ records Define. Record l Synset. Record l ¢ operations Synset. Operator l Record. Reader l Record. Writer l ¢ gui l Interface
Synset. Marker - Process ¢ First round divided among 6 people 31000 synsets marked l Universal and Common clubbed 15234 synsets l Common in Hindi - 6771 synsets l Uncommon - 10987 synsets l ¢ Second round voting schema l Common - 13205 synsets
Core Synset Selection ¢ Bharatiya Vyavahara Kosh English and 15 Indian languages l 2000 concepts with domains l खल (game), पर ण (animal), फल (fruit( l ¢ Link synsets to words in Kosh l Polysemy • अननन स as pineapple fruit as pineapple plant
Domain. Classifier - Interface
Domain. Classifier - Features Display of synset fields ¢ Browsing through records ¢ Marking right synset for a word and a domain ¢ Save/Export ¢
Domain. Classifier - API ¢ records Define. Record l Synset. Record l ¢ operations Synset. Operator l Record. Reader l Record. Writer l ¢ gui l Interface
Domain. Classifier - Process ¢ Groupings Single IDs l Multiple IDs l No IDs l ¢ Rounds of marking Common synsets l Common in Hindi synsets l Uncommon synsets l
Domain. Classifier - Process ¢ End of process Core - 1969 synsets l Common - 11658 synsets l
Online Synset. Marker Interface
Online Synset. Marker Interface
Online Synset. Marker - API Written in PHP ¢ ¢ ¢ ¢ login. php - Interface to login as a user or as an admin or to register as a new user process. php - To process login/register data and accordingly direct a user logout. php - To logout a user mainprocess. php - Processing of data to display unmarked synset main. php - Display of synset with buttons to mark as Common or Uncommon admin. php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword. php - Password interface to login as adminuserprofile. php - Profile data of a particular user
Online Synset. Marker Process ¢ Threshold for dropping synset as Uncommon l ¢ Had to be set to 1 Common - 10312 synsets
Sanskrit Wordnet Interface for creation of Sanskrit Wordnet ¢ Based on idea of Concept-based Multilingual dictionary ¢
User Interface - Configure
User Interface - Main
User Interface - Panels ¢ ¢ ¢ Help Panel: Buttons for Commenting, Synchronizing and References tool. Search Panel: Search word or ID or perform advanced search. Font increase/decrease. Synset Panels: Synset data fields and completion status. Tool Panel: English synset, Link tool, Etymology tool. Browse Panel: Browsing through records, saving and exiting.
User Interface - Features Reference tool
User Interface - Features Synchronize tool
User Interface - Features Advanced Search
User Interface - Features English synsets tool
User Interface - Features Link tool
User Interface - Features Etymology tool
User Interface - Features Keyboard Shortcuts Undo feature - Monitor keyboard actions and undo on Ctrl-Z ¢ Saving feature - Monitor change in field values and save on Ctrl-S ¢ Search - Ctrl-F for quick search access ¢
Interface API Problems and Requirements Huge volumes of data (eg. 30, 000 synsets) l Links between different data l Efficient and user-friendly GUI l Sufficient querying l • Grouping • Review separation
Interface API
Graphical User Interface JButton save. Button = null; public JButton get. Save. Button() { if (save. Button == null) { save. Button = new JButton(); } return save. Button; }
Graphical User Interface
Graphical User Interface Panels
Graphical User Interface ¢ Panels l ¢ Components (within Panels) l ¢ Hierarchical structure Classes JButton, JText. Field, JCheck. Box, etc. Listeners l l Action. Listner - actions performed by user Key. Listener - key strokes (undo, search) and shortcuts
Synset ¢ ¢ ¢ Synset ID: a unique number identifying a synset Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset
Data structure Synset. Record Class Synset. Record Strings to hold field values ¢ Functions: ¢ equals(other. Object) l is. Better. Than(other. Object) l is. Complete() l… l
Data structure Define. Record
“define-end” language Example (description of a book about cricket): define book sixer length : : 700 topic : : cricket define chapter 1 length : : 300 topic : : batting end define chapter 2 length : : 400 topic : : bowling : : scientific end
Data structure Define. Record
Data structure Define. Record Example (etymology data for synset ID 1476): define etymology 1476 करमवततव : : अकरमक finished : : true define word कष इट : : सट पद : : परसमपद सवर : : कत रप : : कषय उपसरग : : अप स ध त : : end
Data structure Define. Record ¢ ¢ Data structure to hold parametric and nested data Functions: l l add. Field(object. To. Add) - Function to add a parameter or a nested instance of Define. Record to. String() - Function to export a record in the define-end language get. Parameter. Field(parameter. Name) Function to return a specific parameter field …
Data Operations
Data Operations - File I/O Unicode text data manipulation - UTF -8 format ¢ Classes for file parsing/writing: ¢ Record. Writer l Record. Reader l
Data Operations - File I/O ¢ Record. Reader Synset. Record parser l Define. Record parser l String converters l ¢ Record. Writer Synset. Record parser l Define. Record parser l
Data Operations Record. Model Interface Model to create mechanism for working with a new data structure ¢ Handles parsing, writing, querying and ID retrieval ¢ Models written as Classes: ¢ l Synset. Record. Model • English. Synset. Record. Model l Abstract. Define. Record. Model
Data Operations Record. Model Interface ¢ ¢ ¢ ¢ int get. Record. Id(E record): Function to return the record ID of a record boolean is. Better. Than(E a, E a): Function to return whether a record weighs better than the other boolean is. Finished(E a): Function to return whether a record can be set as completed E merge. Records(E a, E b): Function to merge in data in two separate records into one boolean search. Word(String word, E a): Function to perform a query (defined in String word) on a record E parse. Record(Record. Reader file. Handle): Function to parse a record from a file void write. Record(Record. Reader file. Handle, E a): Function to write a record into a file
Data Operations Record. Operator Class Operator to provide functionality to work with records of data ¢ Load, Browse, Update, Search, Synchronize and Write ¢ Two kinds at the GUI level: ¢ Parent Operator l Linker Operator l
Data Operations Record. Operator Class Functions for each data type (depending on the corresponding Record. Model): ¢ ¢ ¢ ¢ Constructors for Parent. Operator and Linker. Operator get. Record() - Function to obtain the current record set. Current. Id() and get. Current. Id() - Functions to set and obtain ID to work with get. First. Id(), get. Previous. Id(), get. Next. Id() and get. Last. Id() - Functions to browse through records is. Finished and is. All. Finished() - Functions to obtain completion status of records search. Records() and advanced. Search() - Functions to perform search operations on the records …
API Overview GUI defines one Parent. Operator (eg. source synsets) ¢ GUI defines many Linker. Operators (eg. target synsets, link data, etc. ) ¢ Models attached to the operators ¢ Data repositories are defined ¢ GUI browses, retrieves and manipulates data using operators. ¢
Version history
Future work Tool to generate etymology format ¢ GUI functionality to display synsets from multiple languages ¢ Advanced commenting based on reviews and completion ¢
References ¢ ¢ ¢ ¢ Miller G. A. , Beckwith R. , Fellbaum C. , Gross D. , Miller K. J. , "Introduction to Word. Net: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235 -244. Ramanand J. , Ukey A. , Singh B. K. , Bhattacharyya P. , "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp. 30 -43. Hindi Wordnet Documentation, http: //www. cfilt. iitb. ac. in/wordnet/webhwn/other/hwn_docs_2. doc Chakrabarti D. , Narayan D. K. , Pandey P. , Bhattacharyya P. , "Experiences in building the Indo Word. Net - A Word. Net for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002. Mohanty R. K. , Bhattacharyya P. , Kalele S. , Pandey P. , Sharma A. , Kopra M. , "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global Word. Net Conference, University of Szeged, Department of Informatics, 2008. Sinha, M. , Reddy, M. , Bhattacharyya, P. , "An Approach towards Construction and Application of Multilingual Indo-Word. Net", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006. Staal J. F. , "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261– 275.
References ¢ ¢ ¢ ¢ ¢ Mac. Donell A. A. , A History Of Sanskrit Literature, Kessinger Publishing, ISBN 1417906197, 2004. Burrow T. , Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001. Goldman R. P. and Sutherland S. J. , Devavanipravesika: An Introduction to the Sanskrit Language, ISBN 0 -944613 -40 -3, 1999. Macdonell A. A. , A Sanskrit Grammar for Students, ISBN 81 -246 -0094 -5, 1997. Monier-Williams M. , A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN 81 -208 -3105 -5, 2005. Katre S. M. , Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989. Indian Languages, http: //www. english. emory. edu/Bahri/Ind. Langs. html Wierzbicka A. , "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7 -26. Beckwith R. , Miller G. A. , Tengi R. , "Design and Implementation of the Word. Net Lexical Database and Searching Software", Description of Word. Net, 1993. JSch - Java Secure Channel, http: //www. jcraft. com/jsch
Thank you