The DomainSpecific Track at CLEF 2007 Vivien Petras

  • Slides: 17
Download presentation
The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS

The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007 1

Outline • The Domain-Specific Task • Collections & Controlled Vocabularies • Topics • Participants,

Outline • The Domain-Specific Task • Collections & Controlled Vocabularies • Topics • Participants, Runs & Relevance Assessments • Themes • Summary & Outlook 2

The Domain-Specific Task CLIR on structured scientific document collections: • social science domain •

The Domain-Specific Task CLIR on structured scientific document collections: • social science domain • bibliographic metadata • controlled vocabularies for subject description Leverage bibliographic metadata & controlled vocabularies for: • search • translation 3

The Domain-Specific Tasks: • Monolingual against German, English or Russian • Bilingual against German,

The Domain-Specific Tasks: • Monolingual against German, English or Russian • Bilingual against German, English or Russian • Multilingual against combined collection 4

Collections German English GIRT-DE GIRT-EN CSA-SA ISISS Description German social science literature & projects

Collections German English GIRT-DE GIRT-EN CSA-SA ISISS Description German social science literature & projects GIRT-DE translated Sociolog. Abstracts Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science Coverage 1990 -2000 1994 -1996 151, 319 20, 000 145, 802 96% 17% 94% 27% Name Docs Abstracts Russian 5

Controlled Vocabularies 5 different subject-describing terminologies: • Thesaurus for the Social Sciences (GIRT-DE, -EN)

Controlled Vocabularies 5 different subject-describing terminologies: • Thesaurus for the Social Sciences (GIRT-DE, -EN) • Thesaurus of Sociological Indexing Terms (CSA-SA) • INION Thesaurus (ISISS) • Social Sciences Classification (GIRT-DE, -EN) • Sociological Abstracts Classification (CSA-SA) GIRT CSA-SA ISISS Descriptors / doc 10 6. 4 3. 9 Class. codes / doc 2 1. 3 n/a 6

Controlled Vocabularies – Mapping Tools Translation: • GIRT German GIRT English Intellectual term mappings

Controlled Vocabularies – Mapping Tools Translation: • GIRT German GIRT English Intellectual term mappings (cross-walks): • equivalent terms in vocabularies • GIRT German CSA-SA English • GIRT English CSA-SA English original-term: agricultural area mapped-term: Rural areas 7

Topics 25 topics in standard TREC format (title, desc, narr): • 15 volunteers (social

Topics 25 topics in standard TREC format (title, desc, narr): • 15 volunteers (social scientists) • 2 -5 suggestions from 28 subject specialties • checked for: • coverage in collections • variance from previous years • translated into English, Russian 8

Participants 5 groups Group Institution Chemnitz Media Informatics Chemnitz University of Technology Cheshire School

Participants 5 groups Group Institution Chemnitz Media Informatics Chemnitz University of Technology Cheshire School of Information UC Berkeley Moscow State University Country Germany USA Russia Unine Computer Science Department University of Neuchatel Switzerland Xerox Data Mining Group Xerox Research Centre Europe France 9

Runs Task Runs 2007 Runs 2006 Runs 2005 - against German 13 13 17

Runs Task Runs 2007 Runs 2006 Runs 2005 - against German 13 13 17 - against English 15 8 15 - against Russian 11 1 8 - against German 14 6 15 - against English 15 3 13 - against Russian 9 3 5 Multilingual 9 2 3 Total 86 36 76 Monolingual Bilingual 10

Relevance Assessments All assessments done with Univ. of Padova‘s DIRECT System. German English Russian

Relevance Assessments All assessments done with Univ. of Padova‘s DIRECT System. German English Russian 16, 288 17, 867 14, 473 Rel. Docs 2007 22% 25% 10%* Rel. docs 2006 39% 26% n/a Rel. docs 2005 20% 21% 9% (RSSC) Pool size * In Russian collection: 3 topics without relevant topics 11

Relevance Assessments – Best MAP Task MAP 2007 MAP 2006 MAP 2005 - against

Relevance Assessments – Best MAP Task MAP 2007 MAP 2006 MAP 2005 - against German 0. 5051 0. 5454 0. 4936 - against English 0. 3534 0. 4576 0. 5065 - against Russian 0. 1971 0. 2542 0. 3038 - against German 0. 4568 (90%) 0. 2448 (45%) 0. 4201 (85%) - against English 0. 3341 (95%) 0. 3301 (72%) 0. 4743 (94%) - against Russian 0. 1348 (68%) 0. 1648 (62%) 0. 2331 (77%) 0. 0884 0. 0753 0. 0532 Monolingual Bilingual Multilingual 12

Themes - Retrieval models • Lucene • Language Modelling • Logistic Regression • Comparison:

Themes - Retrieval models • Lucene • Language Modelling • Logistic Regression • Comparison: Vector Space, LM, Probabilistic - Okapi, DFR • Data fusion • Russian • word-based vs. N-gram retrieval • new light-weight stemmer 13

Themes – Query Expansion Entry Vocabulary Modules • query terms associated with thesaurus terms

Themes – Query Expansion Entry Vocabulary Modules • query terms associated with thesaurus terms from documents Thesaurus Lookup • combined thesaurus from all CVs • GIRT Thesaurus Index Lexical Entailment • find document terms in relation to query terms Blind Feedback 14

Themes – Translation Lucene plug-in • Babelfish, Google, PROMT, Reverso Bilingual thesaurus mapping Dictionary

Themes – Translation Lucene plug-in • Babelfish, Google, PROMT, Reverso Bilingual thesaurus mapping Dictionary adaption • disambiguate term translation given language context of feedback documents Statistical machine translation • MATRAX Commercial Software 15

Summary & Outlook Extension of Russian materials • Translation table DE-EN-RU for GIRT Thesaurus

Summary & Outlook Extension of Russian materials • Translation table DE-EN-RU for GIRT Thesaurus • Translation table RU-EN for INION Thesaurus • Mapping between GIRT – INION Thesaurus More tools for Terminology mapping • different relationships (0 T, SYN, BT, NT, RT) • GESIS-IZ project: > 40 mappings • 25 controlled vocabularies / 11 disciplines • ~ 125, 000 terms & phrases • ~ 400, 000 relations 16

Domain-Specific Track: http: //www. gesis. org/en/research/ information_technology/clef_ds_2007. htm Vocabulary Mappings: http: //www. gesis. org/en/research/

Domain-Specific Track: http: //www. gesis. org/en/research/ information_technology/clef_ds_2007. htm Vocabulary Mappings: http: //www. gesis. org/en/research/ information_technology/komohe. htm Email: vivien. petras@gesis. org 17