Project Tukaram Sagar Tamhane Centre for Indian Language

  • Slides: 22
Download presentation
Project Tukaram Sagar Tamhane Centre for Indian Language Technology Solutions IIT Bombay 12 June

Project Tukaram Sagar Tamhane Centre for Indian Language Technology Solutions IIT Bombay 12 June 2002 Center For Indian Language Technology Solutions 1

12 June 2002 Center For Indian Language Technology Solutions 2

12 June 2002 Center For Indian Language Technology Solutions 2

The Goal • To make Saint Tukaram’s Abhangas available over web for browsing and

The Goal • To make Saint Tukaram’s Abhangas available over web for browsing and searching • Locate the right Abhangas that you need. • Present the pages to the user in an order of importance. 12 June 2002 Center For Indian Language Technology Solutions 3

The Source • The Abhangas are typed from a book called “Ea. I tukaramabaavaa.

The Source • The Abhangas are typed from a book called “Ea. I tukaramabaavaa. Mcyaa ABa. Mgaa. Mca. I gaaqaa” published on 6 th November 1973 by the Govt. of Maharashtra • Previous editions: 1950 and 1955. • Number of Abhangas: Center For Indian 4644 Language 12 June 2002 Technology Solutions 4

Creation of Web Content • Software used for typing: MS Word with Akruti_Priya_Expanded font

Creation of Web Content • Software used for typing: MS Word with Akruti_Priya_Expanded font and Akruti keyboard driver • Problems faced: – Non displayable characters Eg: This was typed as mna • Automated page splitting 12 June 2002 Center For Indian Language Technology Solutions 5

Converters Used • Akruti_Priya_Expanded ISCII converter: required for indexing the text • ISCII Monolingual

Converters Used • Akruti_Priya_Expanded ISCII converter: required for indexing the text • ISCII Monolingual ISFOC converter: required for displaying the text through DV-TTYogesh • XDVNG ISCII: for query strings to ISCII 12 June 2002 Center For Indian Language Technology Solutions 6

Technologies used for the Tukaram Search Engine • Input Technology: – Jtrans: XDVNG font

Technologies used for the Tukaram Search Engine • Input Technology: – Jtrans: XDVNG font • Keyboard Mapping: – Phonetic English • Result Display at client: – ISFOC • Encoding for indexing (storage): – ISCII 12 June 2002 Center For Indian Language Technology Solutions 7

Architecture 12 June 2002 Center For Indian Language Technology Solutions 8

Architecture 12 June 2002 Center For Indian Language Technology Solutions 8

Input Technology 12 June 2002 Center For Indian Language Technology Solutions 9

Input Technology 12 June 2002 Center For Indian Language Technology Solutions 9

Components of the Search Engine • Index – Case sensitive ISCII – Database structure

Components of the Search Engine • Index – Case sensitive ISCII – Database structure • Searcher – In-memory search – Algorithm: Hybrid of Hashing & Binary search 12 June 2002 Center For Indian Language Technology Solutions 10

Database Structure 12 June 2002 Center For Indian Language Technology Solutions 11

Database Structure 12 June 2002 Center For Indian Language Technology Solutions 11

 • Snap shot of result 12 June 2002 Center For Indian Language Technology

• Snap shot of result 12 June 2002 Center For Indian Language Technology Solutions 12

Relevancy Criteria • • Number of query words in the abhang Position Adjacency Total

Relevancy Criteria • • Number of query words in the abhang Position Adjacency Total number of words in the abhang 12 June 2002 Center For Indian Language Technology Solutions 13

12 June 2002 Center For Indian Language Technology Solutions 14

12 June 2002 Center For Indian Language Technology Solutions 14

12 June 2002 Center For Indian Language Technology Solutions 15

12 June 2002 Center For Indian Language Technology Solutions 15

12 June 2002 Center For Indian Language Technology Solutions 16

12 June 2002 Center For Indian Language Technology Solutions 16

12 June 2002 Center For Indian Language Technology Solutions 17

12 June 2002 Center For Indian Language Technology Solutions 17

12 June 2002 Center For Indian Language Technology Solutions 18

12 June 2002 Center For Indian Language Technology Solutions 18

12 June 2002 Center For Indian Language Technology Solutions 19

12 June 2002 Center For Indian Language Technology Solutions 19

12 June 2002 Center For Indian Language Technology Solutions 20

12 June 2002 Center For Indian Language Technology Solutions 20

General information • • • Number of abhangas : 4, 644 Total number of

General information • • • Number of abhangas : 4, 644 Total number of words : 2, 09, 702 Number of distinct words : 34, 773 Languages used for converters: Lex & C Language used for search engine: Java 2 Scripting on client side : Java. Script 12 June 2002 Center For Indian Language Technology Solutions 21

Thank You 12 June 2002 Center For Indian Language Technology Solutions 22

Thank You 12 June 2002 Center For Indian Language Technology Solutions 22