Content Level Access to Digital Library of India
Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C. V. Jawahar CVIT, IIIT Hyderabad
Digital Library of India (DLI) http: //www. dli. iiit. ac. in/ Vision : To enhance access to information and knowledge to masses. • Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers IIIT Hyderabad Vamshi Ambati, N. Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.
Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Content Languages Statistics IIIT Hyderabad • #Books 4 Lakhs • 41 different languages • #Pages 134 Million • Includes - Hindi, Telugu, Marathi. . • #Words 26 Billion - English, French, Greek. . Source: http: //www. new 1. dli. ernet. in/
Digital Library of India (DLI) Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search IIIT Hyderabad
Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search IIIT Hyderabad
Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search ? Reliable Text Representation IIIT Hyderabad
Goal Digital Library of India Search • Build a search engine with support for Indian languages. • Word Spotting IIIT Hyderabad
Goal Indian Language Document Search Engine Text Query Support ��� Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine ������� ��� Multi Keyword Support Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine ������� ��� Ranks based on # Occurrences Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine ������� ��� Semantically Related Words Page 1 IIIT Hyderabad
Goal Indian Language Document Search Engine ������� ��� Seamless scaling to billions of word images. Sub second retrieval Page 1 IIIT Hyderabad
Text from OCR Hindi Page Telugu Page IIIT Hyderabad - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
Text from OCR Hindi Page IIIT Hyderabad Cuts Telugu Page
Text from OCR Hindi Page IIIT Hyderabad Merges Cuts Telugu Page
Text from OCR Hindi Page Telugu Page IIIT Hyderabad Variations in Script, Cuts Font and Typesetting.
Text from OCR Char % Hindi Telugu IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts, ” in ICDAR MOCR Workshop, 2011.
Text from OCR Word % Hindi Telugu IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts, ” in ICDAR MOCR Workshop, 2011.
Text from OCR Search % Hindi Telugu IIIT Hyderabad
Bo. VW for Image Retrieval Text Retrieval Query Image Recognition Ranked Retrieved Results IIIT Hyderabad Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
Bo. VW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image Ranked Retrieved Results IIIT Hyderabad Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
Bo. VW for Document Image Retrieval IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval Cuts IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval Cuts Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval Merges IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval Merges Histogram of Visual Words IIIT Hyderabad R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
Bo. VW for Document Image Retrieval • Robust against degradation • Lost Geometry • Use Spatial Verification – SIFT based. – Longest Subsequence alignment. y 1 0. 5 Clean 0 0. 5 IIIT Hyderabad V 1 1 V 2 V 6 1. 5 Cuts 2 V 4 2. 5 V 8 3 V 9 x Merge R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012
Query Expansion Querying Database Query Image Rank 1 Rank 2 Histogram Rank 3 Rank 4 Rank 5 Rank 6 IIIT Hyderabad Refined Histogram
Query Expansion Querying Database Query Image Rank 1 Rank 2 Query Histogram Rank 3 Rank 4 Better Results Rank 5 Rank 6 IIIT Hyderabad
Text Query Support • Originally formulated in a “query by example” setting. Input Query Image Histogram IIIT Hyderabad
Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries Input Text Query IIIT Hyderabad Text Query Histogram
Observations • Are the results of OCR and Bo. VW complementary? IIIT Hyderabad Bo. VW OCR Bo. VW
Observations m. AP • m. AP v/s Word Length IIIT Hyderabad No. of Characters
Observations • “OCR system has a high precision while Bo. VW approach has a high recall. ” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0. 4 Bo. VW Out List; Precision = 0. 8 ; Recall = 1 IIIT Hyderabad
Fusion • Fusion Techniques: • Naïve Fusion m. AP Chart OCR IIIT Hyderabad
Fusion • Fusion Techniques: • Naïve Fusion m. AP Chart Bo. VW IIIT Hyderabad
Fusion • Fusion Techniques: • Naïve Fusion Concatenating OCR Results with Bo. VW OCR Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Edit Distance Based Fusion OCR Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Edit Distance Based Fusion • Reordering Bo. VW • Bo. VW score • Modified Edit distance cost Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Edit Distance Based Fusion • Reordering Bo. VW • Bo. VW score • Modified Edit distance cost Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Edit Distance Based Fusion OCR Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Hybrid Fusion OCR Bo. VW m. AP Chart IIIT Hyderabad
Fusion • Fusion Techniques: • Hybrid Fusion m. AP Chart • Re-querying Bo. VW using • OCR retrieved results. • Using rank aggregation techniques Bo. VW IIIT Hyderabad
Fusion • Fusion Techniques: • Hybrid Fusion m. AP Chart • Re-querying Bo. VW using • OCR retrieved results. • Using rank aggregation techniques Bo. VW IIIT Hyderabad
Fusion • Fusion Techniques: • Hybrid Fusion OCR Bo. VW m. AP Chart IIIT Hyderabad
Experimental Results IIIT Hyderabad
Experimental Details • OCR [1] • Feature Detector – Harris Interest point detection. [2] • Feature Descriptor – SIFT [2] • Indexing – Lucene [3] IIIT Hyderabad [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts, ”in ICDAR MOCR Workshop, 2011. [2] http: //www. vlfeat. org [3] http: //lucene. apache. org/
Test Bed Sample Word Images Language #Books #Pages #Words #Annotation Hindi (HS 1) 11 1000 362, 593 Yes Hindi (HS 2) 52 10, 196 4, 290, 864 No Telugu (TS 1) 11 1000 161, 276 Yes Telugu (TS 2) 69 13, 871 2, 531, 069 No DLI Corpus IIIT Hyderabad • In addition, we used HP 1 & TP 1 fully annotated dataset
Evaluation Measures • Precision • Recall TP = True Positive FP = False Positive FN = False Negative • m. AP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. • Precision @ 10 Shows how accurate top 10 retrieved IIIT Hyderabad results are. Precision-Recall Curve
Bo. VW Search Language #Query Bo. VW + Query Expansion m. AP Prec@10 Hindi (HP 1) 100 62. 54 81. 30 66. 09 83. 86 Telugu (TP 1) 100 71. 13 78 73. 08 79. 89 Comparison of naïve Bo. VW with Bo. VW + Query Expansion IIIT Hyderabad
Bo. VW Search Language #Query Bo. VW using Text Queries m. AP Prec@10 Hindi (HP 1) 100 62. 54 81. 30 56. 32 73. 89 Telugu (TP 1) 100 71. 13 78 69. 06 78. 83 Comparison of naïve Bo. VW with Bo. VW + Text Query Support IIIT Hyderabad
Naïve Language #Query Edit Distance Hybrid m. AP Prec@10 Hindi (HP 1) 100 75. 66 90. 7 79. 58 90. 8 80. 37 91. 4 Telugu (TP 1) 100 76. 02 81. 2 78. 01 81. 4 80. 23 83. 7 Comparative performance of different fusion techniques on HP 1 & TP 1 IIIT Hyderabad
OCR Language #Query Bo. VW Fusion m. AP Prec@10 Hindi (HS 1) 100 14. 95 62. 60 60. 55 95. 5 68. 81 95. 6 Telugu (TS 1) 100 27. 03 62. 10 74. 38 90. 6 78. 41 91. 9 Performance statistics on DLI Annotated Corpus IIIT Hyderabad
Language Hindi (HS 2) Telugu (TS 2) #Query 50 50 Precision @ N OCR Bo. VW Fusion Prec@10 82. 03 96. 94 97. 11 Prec@20 75. 16 94. 83 95. 42 Prec@30 71. 12 92. 82 93. 16 Prec@10 90. 85 99. 14 Prec@20 85. 42 98. 00 98. 85 Prec@30 80. 76 96. 38 96. 57 Performance statistics on DLI Un-Annotated Corpus IIIT Hyderabad
Retrieved Results IIIT Hyderabad
Retrieved Results IIIT Hyderabad
Failure Cases • The word images shown in the figure fails in both OCR and Bo. VW. • Reason: IIIT Hyderabad – (a) Word Image smaller in length and containing a character not used these days. – (b) A highly degraded word image.
Implementation Details • Search Engine Development – An elegant web based search and retrieval interface. No of Images Time in milliseconds Lucene Scalability IIIT Hyderabad Sample Retrieved Page No of Visual Words
Search Architecture (Ongoing) Search Query Ranked Results Delegator Partial Scores FUSION Query Expansion IIIT Hyderabad OCR Index Web Service OCR Ranking Bo. VW Index Web Service
Ongoing Work • Learn to improve from annotated dataset – Use of visual confusion matrix to improve Bo. VW results from annotated datasets. • Necessity of Costly Features for Re-ranking – The images shows in failure cases would require costly features to show up. – Use of machine learning algorithms. IIIT Hyderabad • Exploration of features better than SIFT.
Thank You IIIT Hyderabad
- Slides: 62