SMS Based FAQ Retrieval Task at FIRE 2011

  • Slides: 19
Download presentation
SMS Based FAQ Retrieval Task at FIRE 2011 Danish Contractor, IBM Research India Ankush

SMS Based FAQ Retrieval Task at FIRE 2011 Danish Contractor, IBM Research India Ankush Mittal, College of Engineering Roorkee Deepak P, IBM Research India L Venkata Subramaniam, IBM Research India COER Dec 2, 2011 0 FIRE 2011 Proposal

Agenda • Motivation and Overview • Dataset • Participants COER • Evaluation and Final

Agenda • Motivation and Overview • Dataset • Participants COER • Evaluation and Final Scores 1 FIRE 2011 Proposal

India’s Education Pyramid and Information Access Patterns Internet Users 70 million The mobile phone

India’s Education Pyramid and Information Access Patterns Internet Users 70 million The mobile phone is the preferred information device for Indians 2 COER Mobile Phone Users 800 million FIRE 2011 Proposal

SMS Based FAQ Retrieval Task SMS Question FAQ wht r d policis avlbl 4

SMS Based FAQ Retrieval Task SMS Question FAQ wht r d policis avlbl 4 cancar pasaints Database §Which insurance policies are available for cancer patients §LIC has some insurance policies for cancer patients §What are the rates for roaming within India §Average Roaming rates on prepaid connections are 60 Paise per minute SMS Answer The goal is to find the Question Q* in the FAQ database that best matches the SMS S 3 COER LIC has some insurance policies for cancer patients FIRE 2011 Proposal

FAQ Retrieval Task Retrieve the best FAQ for a given SMS query Task 2:

FAQ Retrieval Task Retrieve the best FAQ for a given SMS query Task 2: Cross Language Retrieval Task 3: Multi Lingual Retrieval 4 FAQ L 1 SMS FAQ L 1 English, Hindi, Malayalam English SMS, Hindi FAQ L 2 SMS FAQ L 1/L 2/L 3 English/ Hindi/Malaya lam SMS/FAQ COER Task 1: Same Language Retrieval SMS FIRE 2011 Proposal

Details of Dataset • FAQs – Collected from online resources, both govt. and private

Details of Dataset • FAQs – Collected from online resources, both govt. and private sector – Three languages: English, Hindi, Malayalam – FAQ Categories • • • Health Telecom Insurance Railway booking ………… • SMSes – Collected from mobile savvy college students, online sources and by – Three languages: English, Hindi, Malayalam – Both in domain and out of domain • SMS could match a FAQ in the same language, in another language or not at all 5 COER manually perturbing questions to include common forms of noise-induced variations FIRE 2011 Proposal

Dataset FAQ Language No. of SMS Queries (In-domain, Out-of-domain) 1994 681 English Hindi Malayalam

Dataset FAQ Language No. of SMS Queries (In-domain, Out-of-domain) 1994 681 English Hindi Malayalam Crosslingual task Multilingual Task (701, 370) (291, 181) (290, 170) (728, 2677) (37, 3368) (724, 2681) (181, 49) (183, 47) (200, 124) (120, 20) (60, 20) (50, 0) Training Dataset Release (May 2011 and July 2011) FAQ Dataset: FAQs in three languages Training SMS: SMSes in three languages Test Data Release (August 2011) Test SMS: SMSes in three languages COER 7251 Monolingual Task Train 6 Submissions by teams (Sept 2011) Top 5 FAQs for each SMS Test FIRE 2011 Proposal

Participating Teams 1. Univ of Iowa (Sanmitra Bhattacharya, Hung Tran and Padmini Srinivasan) 2.

Participating Teams 1. Univ of Iowa (Sanmitra Bhattacharya, Hung Tran and Padmini Srinivasan) 2. BUAP Mexico (Darnes Vilariño Ayala, David Pinto, Saúl León Silverio, Esteban Castillo and Mireya Tovar Vidal) 3. 4. 5. 6. 7. DCE Delhi (Arpit Gupta) IIIT Hyderabad (Aditya Mogadala, Bhupal Reddy and Vasudeva Varma) DAIICT Gandhinagar (Khushboo Singhal, Smita Kumari and Gaurav Arora) DTU Delhi (Anwar Shaikh, Rajiv Ratn Shah, Mukul Jain, Mukul Rawat and Manoj Kumar) Jadhavpur Univ and IPN Mexico (Partha Pakray, Soujanya Poria, Sivaji Bandyopadhyay and Alexander Gelbukh) 8. DCU Dublin (Deirdre Hogan, Paul Ferguson, Hongyi Wang, Johannes Leveling and 9. MSRIT Bangalore (Vinayaka Dj) 10. TCS Mumbai (Arijit De) 11. SASTRA Thanjavur (Ashish Raste, Venkata Narasimhan A and Santhosh Bargav) 12. RVCE Bangalore (Nishit Shivhre) 13. IIIT Delhi (Tanushree Mishra) 7 COER Cathal Gurrin) FIRE 2011 Proposal

Evaluation • Participants to submit the top 5 FAQs for each SMS COER •

Evaluation • Participants to submit the top 5 FAQs for each SMS COER • Accuracy and MRR based evaluation 8 FIRE 2011 Proposal

Team-Task Matrix 9 Team (# of runs submitted) English Mono Hindi Mono Mal Mono

Team-Task Matrix 9 Team (# of runs submitted) English Mono Hindi Mono Mal Mono Cross English Multi Hindi Multi Mal Multi Iowa (19) ✔ ✔ ✔ ✔ BUAP (11) ✔ ✔ ✔ ✔ DCE (5) ✔ ✔ ✔ IIITH (12) ✔ ✔ ✔ DAIICT (6) ✔ ✔ ✔ Jadhavpur-IPN (4) ✔ ✔ DTU (4) ✔ ✔ DCU (3) ✔ MSRIT (3) ✔ TCS (2) ✔ SASTRA (1) ✔ RVCE (1) ✔ IIITD (1) ✔ ✔ ✔ COER 13 Teams 72 Runs 9 sub-tasks ✔ score above median FIRE 2011 Proposal

Monolingual Task: English SMS – English FAQ 0. 9 SMS: 728 indomain, 2677 outdomain

Monolingual Task: English SMS – English FAQ 0. 9 SMS: 728 indomain, 2677 outdomain FAQs: 7251 (508, 2307) 0. 8 High Score: 0. 83 Median: 0. 14 (396, 1940) 0. 7 (432, 1512) 0. 6 0. 5 (553, 871) 0. 4 0. 3 (473, 118) 0. 2 (506, 19) (391, 75) (415, 0) (0, 225) 0. 1 (12, 58) (0, 29) (0, 0) Jadhavpur MSRIT IIITD 0 RVCE DTU IIITH Iowa BUAP DAIICT TCS SASTRA COER DCU 10 FIRE 2011 Proposal

Monolingual Task: Hindi SMS – Hindi FAQ SMS: 200 indomain, 124 outdomain FAQs: 1994

Monolingual Task: Hindi SMS – Hindi FAQ SMS: 200 indomain, 124 outdomain FAQs: 1994 High Score: 0. 62 Median: 0. 53 0. 7 (198, 3) (111, 80) 0. 6 (186, 0) (171, 2) (165, 0) 0. 5 (153, 0) (0, 119) 0. 4 0. 3 0. 2 0. 1 0 DCE DAIICT IIITH Iowa BUAP Jadhavpur COER DTU 11 FIRE 2011 Proposal

Monolingual Task: Malayalam SMS – Malayalam FAQ 1 (47, 0) SMS: 50 indomain, 0

Monolingual Task: Malayalam SMS – Malayalam FAQ 1 (47, 0) SMS: 50 indomain, 0 outdomain FAQs: 681 (46, 0) (44, 0) 0. 9 (39, 2) 0. 8 High Score: 0. 94 Median: 0. 90 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 IIITH Iowa BUAP COER DAIICT 12 FIRE 2011 Proposal

Crosslingual Task: English SMS – Hindi FAQ SMS: 37 indomain, 3368 outdomain FAQs: 1994

Crosslingual Task: English SMS – Hindi FAQ SMS: 37 indomain, 3368 outdomain FAQs: 1994 0. 7 (2, 2206) High Score: 0. 65 Median: 0. 0499 0. 6 0. 5 0. 4 0. 3 0. 2 (5, 182) 0. 1 (0, 170) (4, 159) (2, 40) 0 Iowa BUAP IIITH Jadhavpur COER DCE 13 FIRE 2011 Proposal

Multilingual: English SMS – English/Hindi/Malayalam FAQ SMS: 724 indomain, 2681 outdomain FAQs: 9926 0.

Multilingual: English SMS – English/Hindi/Malayalam FAQ SMS: 724 indomain, 2681 outdomain FAQs: 9926 0. 6 High Score: 0. 52 Median: 0. 15 (424, 1336) 0. 5 0. 4 0. 3 (504, 17) 0. 2 (356, 25) 0. 1 0 Iowa BUAP COER DCE 14 FIRE 2011 Proposal

Multilingual: Hindi SMS – English/Hindi/Malayalam FAQ SMS: 200 indomain, 124 outdomain FAQs: 9926 0.

Multilingual: Hindi SMS – English/Hindi/Malayalam FAQ SMS: 200 indomain, 124 outdomain FAQs: 9926 0. 7 0. 6 High Score: 0. 57 Median: 0. 51 (103, 83) (165, 0) 0. 5 (113, 0) 0. 4 0. 3 0. 2 0. 1 0 Iowa BUAP COER DCE 15 FIRE 2011 Proposal

Multilingual: Malayalam SMS – English/Hindi/Malayalam FAQ 1 0. 9 SMS: 50 indomain, 0 outdomain

Multilingual: Malayalam SMS – English/Hindi/Malayalam FAQ 1 0. 9 SMS: 50 indomain, 0 outdomain FAQs: 9926 (44, 0) High Score: 0. 88 0. 7 (32, 0) 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 BUAP COER Iowa 16 FIRE 2011 Proposal

Concluding Remarks • The mobile phone is the preferred Information Device for Indians –

Concluding Remarks • The mobile phone is the preferred Information Device for Indians – SMS is the preferred mode • The FAQ Retrieval task encourages research in building systems that enable accessing of information from FAQ databases using SMS queries COER – The results are encouraging 17 FIRE 2011 Proposal

COER Thank You! 18 18 FIRE 2011 Proposal

COER Thank You! 18 18 FIRE 2011 Proposal