1 Indian Languages 2001 Adi Garo Kolami Malto

Indian Languages - 2001 Adi Garo Kolami Malto Rengma Afghani / Kabuli / Pashto

Mission Statement Annotated, quality language data (both-text and speech) and tools in Created Indian

Objectives A repository of linguistic resources in all Indian languages in the form of

Participating Institutions in India All academic institutes, research organizations and Corporate R&D groups from

Funding & Management • The core funding from the Government of India. • All

Arrangements 1. LDC-IL will be open to all institutions, Research Organizations, and Corporate sector

Tasks • • Establishing standards Creating language resources Annotating language data Building systems/helping system

Major Areas Linguistic Resource Development • Creation of different kinds of Corpora including Pathological

Text Corpora - Monolingual / Parallel Corpora (SL) Sl. No. Languages 1 st Year

Tools for Corpora Management & Analysis Frequency analyzers for character, word, sentence. KWIC and

Computational Grammars for Indian Languages Task 1: Hierarchical POS Tag set 12

Linguistic Research • • • Lexical studies Semantics Pragmatics & Discourse analysis Sociolinguistics Dialectology

Speech Corpora § § § Develop tools that facilitate collection of high quality speech

Applications • Speech Recognition and Speech Synthesis • Speech to Speech translation for a

Speech Dataset 1. Phonetically Balanced Vocabulary 2. Phonetically Balanced Sentences 3. Connected Text created

Number of Speakers • Data will be collected from minimum of 300 (150 Male

Annotation of data: 1. Data to be used for speech recognition shall be annotated

Coverage of languages I Year III Year 13. Maithili 19. Sindhi 1. Bengali 7.

Indian Sign Language corpora Northern India : Southern India: North-eastern India: Western India: Eastern

Character Recognition • • • Development of standards, tools and linguistic resources (datasets) for

By-products like lexicon, thesauri, Word. Net etc • Creation of frequency dictionaries - five

Licensing Policy Licensing is an important issue for LDC-IL. The draft policy for licensing

Evaluation The data that the LDCIL creates and obtains has to be evaluated. For

Beyond Roadmap Above all and in addition to what LDCIL has projected in the

Monolingual Text Corpora Sl. No. Language Word Count 1. Bengali 2. Sl. No. Language

Parallel Text Corpora Sl. No. 1 2 3 4 5 6 Language English Bengali

Speech Data Set Details Assamese Bengali Gujarati Hindi Kannada Phon. Bal. Vocabulary 439 561

Speech Data Set Details Maithili Manipuri Nepali Tamil Urdu Phon. Bal. Vocabulary 509 374

Speech Corpora Assamese Bengali Informants Male Female 81 81 238 234 Gujarati Hindi Kannada

Frequency Dictionaries: Most frequent 5000 words Published Sl. No. 1. 2. 3. 4. Language

Development of Tools Corpora management packages developed: 1. Word Frequency Analyser 2. N-Gram (Bi-Gram,

Slides: 34

Download presentation

Indian Languages - 2001 Adi Garo Kolami Malto Rengma Afghani / Kabuli / Pashto Gondi Kom Maram Sangtam Anal Halabi Konda Maring Savara Angami Halam Konyak Miri / Mishing Sema Ao Hmar Korku Mishmi Sherpa Arabic / Arbi Ho Korwa Mogh Shina Balti Jatapu Koya Monpa Simte Bhili / Bhilodi Juang Kui Munda Tamang Bhotia Kabui Kuki Mundari Tangkhul Bhumij Karbi / Mikir Kurukh / Oraon Nicobarese Tangsa Bishnupuriya Khandeshi Ladakhi Nissi / Dafla Thado Chakhesang Kharia Lahauli Nocte Tibetan Chakru / Chokri Khasi Lahnda Paite Tripuri Chang Khezha Lakher Parji Tulu Coorgi / Kodagu Khiemnungan Lalung Pawi Vaiphei Deori Khond / Kondh Lepcha Persian Wancho Dimasa Kinnauri Liangmei Phom Yimchungre English Kisan Limbu Pochury Zeliang Gadaba Koch Lotha Rabha Gangte Koda / Kora Lushai / Mizo Rai Zemi 2 Zou

Mission Statement Annotated, quality language data (both-text and speech) and tools in Created Indian languages in house, to through Individuals outsourcing, Institutions Industry etc. , for Research and Development. acquisition. 3

Objectives A repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora. Facilitating creation of such databases by different organizations. Setting standards for data collection and storage of corpora for different research and development activities. Supporting development and sharing of tools for data collection and management. Facilitating training through workshops, seminars etc. in technical as well as process related issues. Creating and maintaining the LDC-IL website that would be the primary gateway for accessing LDC-IL resources. Designing or providing help in creation of appropriate language technology for mass use. Providing the necessary linkages between academic institutions, individual researchers and the masses. 4

Participating Institutions in India All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC -IL. The following have already shown interest: • IISc Bangalore; • All Indian Institutes of Technology; • IIITs at Hyderabad and elsewhere; • ISI Calcutta/Hyderabad/Bangalore; • C-DAC, Pune; • TIFR Mumbai; • Universities like HCU; DU; JNU; NEHU • HP Labs India; • IBM; • Language institutions like KHS, NCPUL & RSKS; and, 5

Funding & Management • The core funding from the Government of India. • All activities will be in a project mode. • Will attempt to leverage expertise already available to cut avoidable cost and delay. • All staff will be on contract. • All receipts and payments through internet gateways, or through conventional means, will go to the Consolidated Fund. • However, the Government will release grants required to the Consortium as required. If need be, the support will be extended beyond the initial six year period. • As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions. • An annual progress report will be submitted to the government. 6

Arrangements 1. LDC-IL will be open to all institutions, Research Organizations, and Corporate sector from all over the world. 2. Members will be encouraged to contribute databases and share revenues from sale of the data they contribute 3. The databases will be available for R&D purposes to all members and non-members on payment of the appropriate fee, with a license for use only. 4. The organization will be asked to sign a License Agreement that the databases will not be distributed by it to others either free or for a fee. 5. The IP and the copyright of any product developed as a result of such an R&D activity shall lie with the organization that has created the product. 7

Tasks • • Establishing standards Creating language resources Annotating language data Building systems/helping system building • Creating human resources • Co-ordinating language resource developing activities 8

Major Areas Linguistic Resource Development • Creation of different kinds of Corpora including Pathological speech, Historical/ Inscriptional databases • Natural Language Processing • Speech Recognition and Synthesis • Character Recognition • By-products like Word finders, lexicons of different kind, thesauri, Usage compilations etc. 9

Text Corpora - Monolingual / Parallel Corpora (SL) Sl. No. Languages 1 st Year 2 nd Year 3 rd Year 4 th Year 5 th Year Total 1 Assamese 2 2 2 10 2 Bengali 2 2 2 10 3 Bodo 0. 6 0. 6 3 4 Dogri 0. 6 0. 6 3 5 Gujarati 2 2 2 10 6 Hindi 2 2 2 10 7 Kannada 2 2 2 10 8 Kashmiri 1 1 1 5 9 Konkani 1 1 1 5 10 Maithili 1 1 1 5 11 Malayalam 2 2 2 10 12 Manipuri 1 1 1 5 13 Marathi 2 2 2 10 14 Nepali 2 2 2 10 15 Oriya 2 2 2 10 16 Punjabi 2 2 2 10 17 Sanskrit 0. 4 0. 4 2 18 Santali 0. 6 0. 6 3 19 Sindhi 0. 6 0. 6 3 20 Tamil 2 2 2 10 21 Telugu 2 2 2 10 10 22 Urdu 2 2 2 10

Tools for Corpora Management & Analysis Frequency analyzers for character, word, sentence. KWIC and KWOC retrievers. Tool for Automatic transliterations from Indian language scripts to Roman and vice versa: Kannada, Tamil, Telugu, Assamese, Bengali, Manipuri, Malayalam, Punjabi, Oriya, Gujarati. Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora. Tools for • Morphological analysis • POS tagging • Semantic tagging • Syntactic tree bank 11

Computational Grammars for Indian Languages Task 1: Hierarchical POS Tag set 12

Linguistic Research • • • Lexical studies Semantics Pragmatics & Discourse analysis Sociolinguistics Dialectology & Variation studies Stylistics Language teaching Historical linguistics Psycholinguistics Social psychology Cultural studies 13

Speech Corpora § § § Develop tools that facilitate collection of high quality speech data Collect data that can be used for building speech recognition. speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English). Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect Child language corpora Pathological speech/language data and Speech error Data 14

Applications • Speech Recognition and Speech Synthesis • Speech to Speech translation for a pair of Indian languages • Command control applications • Multimodal interfaces to the computer in Indian languages • E-mail readers over the telephone • Readers for the visually disadvantaged • Speech enabled Office Suite etc 15

Speech Dataset 1. Phonetically Balanced Vocabulary 2. Phonetically Balanced Sentences 3. Connected Text created using phonetically balanced vocabulary 4. Date Format 5. Command Control Words 6. Proper Nouns 500 place and 500 person names 7. Most Frequent Words: 1000 8. Form and Function Words 9. News domain: news, editorial, essay - each text not less than 500 words 16

Number of Speakers • Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language. • Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers. 17

Annotation of data: 1. Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels 2. Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases. 18

Coverage of languages I Year III Year 13. Maithili 19. Sindhi 1. Bengali 7. Manipuri 2. Hindi 8. Malayalam 14. Dogri 20. Oriya 3. Tamil 9. Punjabi 15. Bodo 21. Marathi 4. Telugu 10. Urdu 16. Konkani 22. Khasi 5. Assamese 11. Kannada 17. Santali 6. Nepali 18. Kashmiri 24. Kodava 12. Gujarati 23. Tulu 19

Indian Sign Language corpora Northern India : Southern India: North-eastern India: Western India: Eastern Indian: Delhi 1 st year Mysore 2 nd year Shillong 3 rd year Lchalkaranji 4 th year Kolkata 5 th year Lexical items Sentences Production data 15000 2500 50 20

Character Recognition • • • Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR. Promotion of development of these technologies. Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts. 21

By-products like lexicon, thesauri, Word. Net etc • Creation of frequency dictionaries - five per year • • First year: Bengali, Hindi, Kannada, Manipuri, Urdu. Second year: Bodo, Dogri, Maithili, Nepali, Konkani. Third year: Assamese, Gujarati, Oriya, Punjabi, Tamil, Fourth year: Kashmiri, Malayalam, Marathi, Sanskrit, Santali. Fifth year : other languages Multilingual multi directional dictionary - an ongoing process Aiding wordnet creation and collaborating with others for the same - an ongoing process 22

Licensing Policy Licensing is an important issue for LDC-IL. The draft policy for licensing shall be evolved through discussions within one year. The same shall be finalized within another one year by the time the annotated data is available for delivery purposes. 23

Evaluation The data that the LDCIL creates and obtains has to be evaluated. For each kind of data, tool etc. , matrices have to be evolved. Bench marking, good standards etc. , have to be developed. In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed 24

Beyond Roadmap Above all and in addition to what LDCIL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed. 25

Monolingual Text Corpora Sl. No. Language Word Count 1. Bengali 2. Sl. No. Language Word Count 50, 42, 724 8. Konkani I 5, 69, 906 Bodo 6, 37, 801 9. Maithili 83, 92, 505 3. Dogri 8, 24, 443 10. Manipuri 16, 37, 104 4. English 21, 15, 461 11. Nepali 21, 58, 324 5. Hindi 3, 45, 882 12. Tamil 4, 67, 096 6. Kannada 71, 84, 702 13. Urdu 22, 80, 782 7. Kodava 1, 83, 322 14. Yarava 13, 904 26

Parallel Text Corpora Sl. No. 1 2 3 4 5 6 Language English Bengali English Dogri English Hindi English Kannada English Maithili English Nepali Texts 05 04 73 32 07 11 Word Count 1, 26, 828 93, 952 88, 025 93, 293 17, 57, 736 17, 53, 235 7, 79, 258 4, 76, 855 1, 59, 419 1, 36, 421 2, 63, 256 2, 02, 157 27

Speech Data Set Details Assamese Bengali Gujarati Hindi Kannada Phon. Bal. Vocabulary 439 561 689 800 390 Phon. Bal. Sentences 200 200 500 150 6 6 6 Command & Control Words 250 238 296 250 82 Proper Nouns 841 823 902 824 1018 Most Frequent Words - 1000 Form & Function Words 265 178 232 200 432 News Domain texts 150 150 150 Connected Texts 28

Speech Data Set Details Maithili Manipuri Nepali Tamil Urdu Phon. Bal. Vocabulary 509 374 421 565 775 Phon. Bal. Sentences 208 200 228 195 6 6 6 Command & Control Words 187 243 74 369 141 Proper Nouns 824 825 834 908 500 1000 1000 Form & Function Words 243 189 190 598 380 News Domain texts 150 150 150 Connected Texts Most Frequent Words Other languages to be completed before March 31, 2009 Malayalam, Punjabi 29

Speech Corpora Assamese Bengali Informants Male Female 81 81 238 234 Gujarati Hindi Kannada Maithili Manipuri Nepali Tamil 77 314 82 82 82 48 78 Language 83 316 82 82 82 60 71 Duration Minutes Hours 1985 0533. 14850 247. 50 3769 11483 2940 3340 2602 3307 5127 62. 49 191. 23 49. 00 55. 40 43. 22 55. 07 85. 45 Other languages to be completed before March 31, 2009 30 Malayalam, Punjabi, Urdu

Frequency Dictionaries: Most frequent 5000 words Published Sl. No. 1. 2. 3. 4. Language Bengali Hindi Kannada Manipuri To be published by March 31, 2009 Sl. No. 1. 2. Language Nepali 31 Urdu

Development of Tools Corpora management packages developed: 1. Word Frequency Analyser 2. N-Gram (Bi-Gram, Tri-Gram) for word and character 3. Speech Annotation Manual prepared and published The following packages will be developed: 1. KWIC and KWOC Retriver 2. Tool for Semi Automatic Annotation of Speech Data. 32

» Interns LDC- 33