The DomainSpecific Track at CLEF 2007 Vivien Petras
- Slides: 17
The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007 1
Outline • The Domain-Specific Task • Collections & Controlled Vocabularies • Topics • Participants, Runs & Relevance Assessments • Themes • Summary & Outlook 2
The Domain-Specific Task CLIR on structured scientific document collections: • social science domain • bibliographic metadata • controlled vocabularies for subject description Leverage bibliographic metadata & controlled vocabularies for: • search • translation 3
The Domain-Specific Tasks: • Monolingual against German, English or Russian • Bilingual against German, English or Russian • Multilingual against combined collection 4
Collections German English GIRT-DE GIRT-EN CSA-SA ISISS Description German social science literature & projects GIRT-DE translated Sociolog. Abstracts Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science Coverage 1990 -2000 1994 -1996 151, 319 20, 000 145, 802 96% 17% 94% 27% Name Docs Abstracts Russian 5
Controlled Vocabularies 5 different subject-describing terminologies: • Thesaurus for the Social Sciences (GIRT-DE, -EN) • Thesaurus of Sociological Indexing Terms (CSA-SA) • INION Thesaurus (ISISS) • Social Sciences Classification (GIRT-DE, -EN) • Sociological Abstracts Classification (CSA-SA) GIRT CSA-SA ISISS Descriptors / doc 10 6. 4 3. 9 Class. codes / doc 2 1. 3 n/a 6
Controlled Vocabularies – Mapping Tools Translation: • GIRT German GIRT English Intellectual term mappings (cross-walks): • equivalent terms in vocabularies • GIRT German CSA-SA English • GIRT English CSA-SA English original-term: agricultural area mapped-term: Rural areas 7
Topics 25 topics in standard TREC format (title, desc, narr): • 15 volunteers (social scientists) • 2 -5 suggestions from 28 subject specialties • checked for: • coverage in collections • variance from previous years • translated into English, Russian 8
Participants 5 groups Group Institution Chemnitz Media Informatics Chemnitz University of Technology Cheshire School of Information UC Berkeley Moscow State University Country Germany USA Russia Unine Computer Science Department University of Neuchatel Switzerland Xerox Data Mining Group Xerox Research Centre Europe France 9
Runs Task Runs 2007 Runs 2006 Runs 2005 - against German 13 13 17 - against English 15 8 15 - against Russian 11 1 8 - against German 14 6 15 - against English 15 3 13 - against Russian 9 3 5 Multilingual 9 2 3 Total 86 36 76 Monolingual Bilingual 10
Relevance Assessments All assessments done with Univ. of Padova‘s DIRECT System. German English Russian 16, 288 17, 867 14, 473 Rel. Docs 2007 22% 25% 10%* Rel. docs 2006 39% 26% n/a Rel. docs 2005 20% 21% 9% (RSSC) Pool size * In Russian collection: 3 topics without relevant topics 11
Relevance Assessments – Best MAP Task MAP 2007 MAP 2006 MAP 2005 - against German 0. 5051 0. 5454 0. 4936 - against English 0. 3534 0. 4576 0. 5065 - against Russian 0. 1971 0. 2542 0. 3038 - against German 0. 4568 (90%) 0. 2448 (45%) 0. 4201 (85%) - against English 0. 3341 (95%) 0. 3301 (72%) 0. 4743 (94%) - against Russian 0. 1348 (68%) 0. 1648 (62%) 0. 2331 (77%) 0. 0884 0. 0753 0. 0532 Monolingual Bilingual Multilingual 12
Themes - Retrieval models • Lucene • Language Modelling • Logistic Regression • Comparison: Vector Space, LM, Probabilistic - Okapi, DFR • Data fusion • Russian • word-based vs. N-gram retrieval • new light-weight stemmer 13
Themes – Query Expansion Entry Vocabulary Modules • query terms associated with thesaurus terms from documents Thesaurus Lookup • combined thesaurus from all CVs • GIRT Thesaurus Index Lexical Entailment • find document terms in relation to query terms Blind Feedback 14
Themes – Translation Lucene plug-in • Babelfish, Google, PROMT, Reverso Bilingual thesaurus mapping Dictionary adaption • disambiguate term translation given language context of feedback documents Statistical machine translation • MATRAX Commercial Software 15
Summary & Outlook Extension of Russian materials • Translation table DE-EN-RU for GIRT Thesaurus • Translation table RU-EN for INION Thesaurus • Mapping between GIRT – INION Thesaurus More tools for Terminology mapping • different relationships (0 T, SYN, BT, NT, RT) • GESIS-IZ project: > 40 mappings • 25 controlled vocabularies / 11 disciplines • ~ 125, 000 terms & phrases • ~ 400, 000 relations 16
Domain-Specific Track: http: //www. gesis. org/en/research/ information_technology/clef_ds_2007. htm Vocabulary Mappings: http: //www. gesis. org/en/research/ information_technology/komohe. htm Email: vivien. petras@gesis. org 17