Work at TACOLA Lab Team Members T V

Work at TACOLA Lab Team Members T. V. Geetha Ranjani Parthasarathi Madhan Karky E. Uma. Maheswari J. Balaji Subalalitha Elanchezhiyan. K, Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani

Tamil Language Processing ² Tamil Language Oriented Tools v Morphological analyser v Dictionary ¯Normal Words, Compound v Text Compaction Words, Colloquial Words ² UNL Based Work v Parser v UNL for semantic ¯Simple, Complex and representation Compound Sentences v Nested UNL v Semantic analysis based on v Concept based Search UNL v Bi-lingual Search ² Language Technology v Event Processing v Blog Mining v Discourse Analysis v Ontology Based Information v Summarization Extraction v Question answering v Personalized Search v Thirukural Search v Parallelization for NLP ² Lyric Oriented Processing v Lyric Mining v Emotion detection form text v Lyrics for Tunes ² Carnatic Music Processing v Pleasantness v Raga Modelling v Singer, Genre Identification v Music Emotion Recognition Dr. T. V. Geetha, Anna University 2

Papers for TIC 2011 Tamil Language Oriented Tools ² Agaraadhi: A Novel Online Dictionary Framework ² An Efficient Tamil Text Compaction System. (Surukkupai) ² Kuralagam, A Concept Relation Based Search Framework for Thirukural. ² Popularity Based Scoring Model for Tamil Word Games Tamil Language Processing ² Template based Multilingual Summary Generation. ² On Emotion detection from Tamil Text. ² Tamil Summary Generation for Cricket Match. Lyric Oriented Processing ² Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis. ² Special Indices for Laa Lyric Analysis & Generation Framework. Dr. T. V. Geetha, Anna University 3

AGARAADHI A NOVEL ONLINE DICTIONARY FRAMEWORK Elanchezhiyan. K Karthikeyan. S T. V. Geetha Ranjani Parthasarathi Madhan Karky Dr. T. V. Geetha, Anna University 4

OBJECTIVES ²Agaraadhi, a dictionary framework for indexing and retrieving Tamil words, their meaning, analysis and related information. ²Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about. Dr. T. V. Geetha, Anna University 5

INTRODUCTION Agaraadhi dictionary has more than 3 lac words in various domains such as • General, • Literature, • Medical, • Engineering, • Computer Science, • Birds Name and More… The Agaraadhi is a Tamil English bilingual dictionary. Dr. T. V. Geetha, Anna University 6

INTRODUCTION CONT… The Agaraadhi is a Tamil English bilingual dictionary with 20 features. such as • morphological analysis, • morphological generation, • word usage statistics, • word pleasantness analysis, • spell checking, • similar word finder, • word usage in literature, • picture dictionary, • number to text conversion, • phonetic transliteration, • live usage analysis from micro blogs and more… Dr. T. V. Geetha, Anna University 7

AGARAADHI FRAMEWORK CONT… Dr. T. V. Geetha, Anna University 8

AGARAADHI FEATURES ²Morphological Analyser v gives the morphological features of the query word such as root word, parts of speech, gender, tense and count. ¯ If the Query word is padithaan, Morphological Analyser gives as padi as root, word represents male gender and query word is past tense and so on. ²Morphological Generator Tamil morphological generator tackles different syntactic categories such as nouns, verbs, post positions, adjectives, adverbs. v The generator is used to generate possible morphological variations of the query word. ²Spell Checker v used to check the spelling of Tamil words and to provide alternative suggestions for the wrongly spelt words. v If root word not in dictionary - generates all the possible suggestions with minimum variations from the given word Dr. T. V. Geetha, Anna University 9

AGARAADHI FEATURES ²Word Suggestions v gives the list of equivalent or related words for the given query word. ²Word Pleasantness v score generator provides how easy it is to pronounce the word. ²Word Popularity Score v shows the word usage in the web based on frequency distribution of the word across the popular blogs, news articles, social nets etc. ²Word Usage Statistics v shows the usage of the word in the social network over the past one week. ²Word Usage in Literature v finds the usage of words in popular literature such as Thirukural, Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie songs. Dr. T. V. Geetha, Anna University 10

AGARAADHI FEATURES ²Word of the Day v A rare word is randomly chosen and is displayed in the opening page to facilitate users to learn a new word every day. ²Number to Text Converter v converts a number to Tamil word equivalent as well as in English text. For example in Tamil we represent oru Arpputham (��) for 100 million, Kumbam (��) for 10 billion and finally up to Anniyan (��) for one zilli ²Picture Dictionary v Pictures, photos or line drawings to depict popular words have been included in the dictionary to enable efficient learning for children using this tool. Dr. T. V. Geetha, Anna University 11

RESULTS ²Query word: pookkal (ப ககள ) vhttp: //www. agaraadhi. com/dict/OD. jsp? w=%E 0%AE% AA%E 0%AF%82%E 0%AE%95%E 0%AF%8 D%E 0% AE%95%E 0%AE%B 3%E 0%AF%8 D+&ln=ta&Submit. x=8&Submit. y=7 ²Query word: mazhai (மழ ) vhttp: //www. agaraadhi. com/dict/OD. jsp? w=%E 0%AE% AE%E 0%AE%B 4%E 0%AF%88+&ln=ta&Submit. x=21 &Submit. y=4 ²Query word: fruit vhttp: //www. agaraadhi. com/dict/OD. jsp? w=fruit&ln=en Dr. T. V. Geetha, Anna University 12

FUTURE WORK ²Providing APIs for programmers and developing mobile apps for Agaraadhi framework will open a good platform for many researchers and developers working in Tamil Computing area. Dr. T. V. Geetha, Anna University 13

REFERENCE 1. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. 2. Anandan, R. Parthasarathi, and Geetha, Morphological Generator for Tamil Inayam, Malaysia, 2001. 3. J. Jai Hari Raju, P. Indhu. Reka, Dr. Madhan Karky, Statistical Analysis and visualization of Tamil Usage in Live Text Streams, Tamil Internet Conference, Coimbatore, 2010. Dr. T. V. Geetha, Anna University 14

An Efficient Tamil Text Compaction System N. M. Revathi G. P. Shanthi Elanchezhiyan. K T V Geetha Ranjani Parthasarathi Madhan Karky Dr. T. V. Geetha, Anna University 15

OBJECTIVES ²Why Compacting? vlimited message length in blog sites and tiny user interface of mobile phones. vsaves online storage space and hence reduction in cost. ²The paper proposes va text compaction system for Tamil, first of its kind in Tamil. ²Idea of compaction v. Getting the shortest word has no specific rule it is mainly aimed at understanding. vcan be obtained by omitting letters, replacing prefix and suffix through suitable symbols and numbers. Dr. T. V. Geetha, Anna University 16

FRAMEWORK ARCHITECTURE Dr. T. V. Geetha, Anna University 17

FRAMEWORK CONT. . ²Input Processing ØThe morphological analyzer removes the suffix (if present) added to the word and delivers the root word (RW). Dr. T. V. Geetha, Anna University 18

FRAMEWORK CONT. . ²Identification of the category & Extraction of compact word v Three categories of words ; common Tamil words, abbreviations/acronyms, numbers. Ø abbreviations /acronyms by comparing it with the keys of the hashmap. Ø With the help of the hash key and a mapping algorithm, the compact word is retrieved. Ø Otherwise belongs to either the common tamil word or numbers Ø If numbers - Numerical analyser for text to number conversion. ²Output Processing : Ø Tamil tool Morphological Generator to add the suitable suffix to cater to the rules of the language. Dr. T. V. Geetha, Anna University 19

RESULT AND ANALYSIS ² Tested with over 10, 000 words. ² The final result is reduced to 40% of the original text. Dr. T. V. Geetha, Anna University 20

REFERENCES ² Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. ²Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore. ² Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs. ² Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10, 000 Cellular Phone Text Messages by Robert E. Beasley, Franklin College. Dr. T. V. Geetha, Anna University 21

Kuralagam Concept Relation based Search Engine for Thirukkural Elanchezhiyan. K T. V. Geetha Ranjani Parthasarathi Madhan Karky Dr. T. V. Geetha, Anna University 22

Objectives ²Kuralagam is a conceptual search framework for Thirukkural – based on UNL Framework. v. Searching with keywords – in kurals and intepretations v. Concept based search based on Co. Re. X – conceptual indexing based on UNL v. Bilingual search – English and Tamil v. Showing Relationships between the concepts. Dr. T. V. Geetha, Anna University 23

Kuralagam Framework Dr. T. V. Geetha, Anna University 24

Offline Processing ²Web Crawler v. A Thirukkural statistics crawler ¯crawls the news and blog documents - to find the usage of each individual Thirukkural. ¯The usage recorded for measuring the popularity score for each Thirukkural ²Enconversion – Based on UNL ²Indexed – based on Co. Re. X Framework Dr. T. V. Geetha, Anna University 25

UNL & Enconversion UNL is an intermediate language processes knowledge across languagebarriers. captures semantics by converting natural language terms present in the document to concepts are connected to the other concepts through UNL relations - 46 UNL relations plf(Place From), plt(Place To), tmf(Time from), tmt(Time to) etc Process of converting a natural language text to UNL graph is known as Enconversion reverse process is known as Deconversion. Dr. T. V. Geetha, Anna University 26

An Example speaks more. . . Ex: John was playing in the garden john(iof>person) agt play(icl>action) plc garden(icl>place) Dr. T. V. Geetha, Anna University 27

Indexer ²The Kuralagam Indexer is designed based on Co. Re. X Techniques. ²The Indexer stores and manages the UNL graphs in two different indices. v. Concept only index (C index), and v. Concept-Relation-Concept index (CRC index) Dr. T. V. Geetha, Anna University 28

Online Processing ² Query Translation and Expansion v converts the user query to UNL graph. v uses CRC (Concept Relation Concept) Co. Re. X indices to fetch similarity thesaurus and co-occurrence list to populate the Multi list Data Structure. ² Search and Ranking v fetches the Thirukkural number and its details. ¯ Thirukkurals for a given query are fetched using the two types of concept relation indices namely CRC and C. v The query concept is expanded using related CRC indices pointing to the query concept. ¯helps in retrieving many Thirukkurals conceptually related to the query – not possible with key word Thirukkural search engines. v The ranking is based on ¯priority to the indices in the order CRC>C ¯usage score ¯frequency occurrence of the query concept Dr. T. V. Geetha, Anna University 29

Tab Layout Dr. T. V. Geetha, Anna University 30

Performance Evaluation ²The accuracy of the Thirukkural search engine was measured using the average precision and mean average precision. ²The comparisons between concept based search and keyword based search were measured using Average Precision methodology Dr. T. V. Geetha, Anna University 31

Average Precision Dr. T. V. Geetha, Anna University 32

Reference ² 1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky Vairamuthu. Co. Re. X: A Concept Based Semantic Indexing Technique. In SWM-08. 2008. India. ² 2. Foundation, U. , the Universal Networking Language (UNL) Specifications Version 3 3 ed. December 2004: UNL Computer Society, 2004. 8(5). Center UNDL Foundation ² 3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. ² 4. T. Dhanabalan, K. Saravanan, and T. V. Geetha. 2002. Tamil to UNL Enconverter, ICUKL, Goa, India. ² 5. Andrew, T. and S. Falk. User performance versus precision measures for simple search tasks. In 29 th Annual international ACM SIGIR Conference on Research and Development in information Retrieval 2006. Seattle, Washington, USA. Dr. T. V. Geetha, Anna University 33

Template Based Multi. Lingual Summary Generation Subalalitha C. N E. Umamaheswari T V Geetha Ranjani Parthasarathi Madhan Karky Dr. T. V. Geetha, Anna University 34

Aim To generate a multi lingual summary using based on Universal Networking Language (UNL) Framework Dr. T. V. Geetha, Anna University 35

The Architechture Dr. T. V. Geetha, Anna University 36

Multi Lingual Summary Generation using UNL Template based Information Extraction • Seven tourism specific templates have been designed and used • Templates filled using semantic information inherent in UNL input graphs • Template information is language independent and can be used with any desired language. Dr. T. V. Geetha, Anna University 37

Example Templates for Tourism Domain Template Semantics inherited from UNL God iof>god, iof>goddess, icl>god Food icl>food, icl>fruit Flaura and Fauna icl>animal, icl>reptile, icl>mammal, icl> plant Boarding facility icl>facility Transport facility icl>transport Place icl>place, iof>city, iof>country Distance icl >unit , icl >number Dr. T. V. Geetha, Anna University 38

Summary. Generation • • • The template information is converted to target language using respective UNL-target language dictionaries contains root words. Natural language term from the root word is obtained using target language information like case suffixes and language technology tools like morphological generator (�� • +��=�� ) When these converted template information is fitted into target language specific dynamic sentence patterns, a summary is generated. Dr. T. V. Geetha, Anna University 39

Performance Evaluation Tested with 33, 000 Tamil and English text documents enconverted to UNL graphs. The performance of the methodology proposed has been evaluated using human judgement. The accuracy of the summary generated has achieved 90%. Further Enhancements • Query specific summary • Comparing the performance with human generated summaries. Dr. T. V. Geetha, Anna University 40

References [1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, Co. Re – Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010. [2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to multilingual text summarization”, Conference RIAO 2007, Pittsburgh PA, U. S. A. May 30 -June 1, 2007 [3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences , Columbia University, 2005 [4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for multidocument multilingual text summarization [5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL Center UNDL Foundation December 2004. Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single Document Summarization” Workshop on Document Summarization, March, 2005, IIIT Allahabad. [7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm ” International Journal of Computer Applications (0975 – 8887) Volume 17– No. 2, March 2011. [8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents Implemented using Timestamps ” First International Conference on Emerging Trends in Engineering and Technology, IEEE 2008 Dr. T. V. Geetha, Anna University 41