Configurable Indexing and Ranking for XML Information Retrieval
Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou, wwc}@cs. ucla. edu July 26, 2004
XML Basics • A format for defining the syntax and semantics of structured documents • An XML document is commonly modeled as an ordered labeled tree <article author=“J. Webb”> <year> 2003</year> <body> <sec>XML retrieval… </sec> </body> <ref> XML retrieval…</ref> </article> Element article author year body ref J. Webb 2003 sec XML retrieval… Content
XML Queries: content and structure (CAS) • Structure: XPath expressions • Content: about (path, string) functions – Specify a certain context, path, to be about a specific content, string – Basis for result ranking • Example – /article/body/sec[about(. , XML retrieval)] article author year body ref J. Webb 2003 sec XML retrieval…
Motivation CAS Queries XML Retrieval System XML Documents Ranked Results
Text Retrieval Query Ranked Documents Text Documents Indexing Retrieval Indices
What’s new in XML retrieval Structure !!! 6
XML Retrieval Challenges • Indexing – Text retrieval: content information – XML Retrieval: content + structure information • Ranking – Text retrieval: static document concept[1] – XML retrieval: dynamic document concept[1] Querying and Ranking XML Documents (T. Schlieder and H. Meuss, 2000)
Related Works • XML Query Language – – XIRQL [Fuhr et al. , 2001] XXL [Theobald and Weikum, 2001] Searching XML documents via XML fragment [Carmel et al. , 2003] Narrow Extended XPath I (NEXI) [Trotman et al. , 2004] • XML Search Engines – – – Hy. REX [Fuhr et al. 2001] The XXL Search Engine [Theobald and Weikum, 2002] Juru. XML [Mass et al. , 2003] XSEarch [Cohen et al. , 2003] A lot of others in INEX 02 & INEX 03
Goal: Fully utilize XML structure information to improve retrieval performance!!!
Our Approach: configurable XML retrieval CAS Queries Ranked Elements XML Documents Configurable Indexing XML Retrieval (Configurable Ranking) Content Structure Indices Ctree
Roadmap • Background, Challenges & Related works • Our approach: configurable XML retrieval system • Configurable XML Indexing • XML ranking • Experiments • Conclusion & Future Work
Why Configurable Indexing? • Utilize structure information to customize indexing (filtering operations, index types) for different elements. <sec> … text retrieval…</sec> <author> J. Webb </author> Remove stop word, stem <sec> … text retrief…</sec> <author> web </author>
Configurable XML Indexing • Filtering operation selection • Index type selection Index Configurations Scan Index Builder Content Indices Structure Indices XML Documents
Building Index articles article fm body year kwd sec year sec 2003 XML retrieval… 2000 XML… Database… The tree representation of the XML document collection g 1 -1 articles g 2 0, 0 article g 3 0, 1 fm g 6 0, 1 bdy g 4 0, 1 year g 5 0 kwd g 7 0, 1, 1 sec 0, 1, 2 Structure index: Ctree Content index example: invert
Roadmap • • Background, Challenges & Related works Our approach: Configurable XML retrieval system Configurable XML Indexing XML Ranking – Weighted term frequency – Inverse element frequency • Experiments • Conclusion & Future work
XML Ranking: why weighted term frequency? • Hierarchical XML: content of an element e is also considered as part of the content of e’s ancestor elements. • How to estimate an element e’s relevancy to a term t? – Example: //article[about(. , ‘XML’)] article fm body bm kwd sec ref XML… title para XML… …XML… XML
XML Ranking: weighted term frequency • Basic idea – Terms under different paths of an element e are of different importance. • Notations – A path l = x 1/x 2/. . /xn – w(l): weight for a path l • Formula e: an element t: a term li: a path under element e and containing t m: # of different paths under element e and containing t 17
XML Ranking: how to assign path weight? article A straightforward method – assign weights to all possible paths – Problems: too many combinations! Our approach fm body bm kwd sec ref – l = x 1/x 2/…/xn XML… title para XML – w(x): user-configurable weight for a node x XML… …XML… – properties of w(l) = f(w(x 1), w(x 2), …, w(xn)) • f(w(x 1), …, w(xn)) is monotonically increasing wrt. any w(xi) (1≤i ≤n ) • f(w(x 1), …, w(xn)) = 0 if any w(xi) = 0 (1≤i ≤n ) – Example function 18
XML Ranking: inverse element frequency • Vector space model – Term frequency Weighted term frequency – Inverse document frequency Inverse element frequency • Inverse element frequency (IEF) q: a content and structure query N 1: # of elements satisfying the structure condition in q N 2: # of elements that satisfy the structure condition in q and contain term t
Roadmap • Background, Challenges & Related works • Our approach: Configurable XML retrieval system • Configurable XML Indexing • XML Ranking • Experiments • Conclusion & Future work
Experiments: dataset • INEX (Initiative for the Evaluation of XML retrieval) – Similar to TREC for text retrieval • Document collections – Scientific articles from IEEE Computer Society 1995 – 2002 – About 500 M – Each article consists of 1500 XML nodes on average • Query set: all the 30 CAS queries in INEX 03 • Evaluation metric: (exhaustiveness, specificity)
Experimental Setup: index configuration • Element content statistics – – – # of digit tokens, e. g. , ‘ 1990’ # of word tokens, e. g. , ‘retrieval’ # of mixed tokens, e. g. , ‘A 1004 s’ Maximal content length Minimal content length Token selection & Index types selection element digit# word# mixed# token selection content index type yr 155029 177 26 digit Number atl 5974 1060869 4056 word Invert 22
Experimental Setup: index configuration • Element content statistics – – – # of digit tokens, e. g. , ‘ 1990’ # of word tokens, e. g. , ‘retrieval’ # of mixed tokens, e. g. , ‘A 1004 s’ Maximal content length Minimal content length Token selection & Index types Stop word removal element Min content length Max content length Remove stop word? p 1 22767 Yes fnm 1 123 No 23
Experimental Setup: weight configuration article 1 bdy 3 … 2 tig fm 1 5 kwd abs atl st 0. 2 bm sec … ss 1 … Level 2 … Tree representation of the schema for the dataset Weight configuration A & B (nodes indexed but not listed below are with default weight 1). fm bdy bm atl abs kwd st A 3 1 1 3 1 2 3 B 5 1 0 5 1 3 5 24
Experiments Setup: weight configuration Example 1: //article[about(. , XML retrieval)] Example 2: //vita[about(. , XML retrieval)] article bdy bm sec vita p p XML retrieval. . . 25
Experimental Setup: weight configuration article fm … tig bm bdy abs kwd atl st sec … ss 1 … … … Tree representation of the schema for the dataset Weight configuration A & B (nodes indexed but not listed below are with default weight 1). fm bdy bm atl abs kwd st A 3 1 1 3 1 2 3 B 5 1 0 5 1 3 5 26
Experimental Results: run 1 CAS Topic 65: //article[. /fm//yr > '1998' AND about(. /, '"image retrieval"')] 0. 3 0. 55 Strict quantization High precision at low recall regions Adjusting weights properly improves retrieval performance
Experimental Results: run 2 All the 30 CAS topics with weight configuration B 1 Precision 0. 8 0. 6 Avg. Precision 0. 3309 0. 4 0. 2 0 0. 5 Recall Strict quantization 1 28
Roadmap • Background, Challenges & Related works • Our approach: Configurable XML retrieval system • Configurable XML Indexing • XML Ranking • Experiment • Conclusion & Future work
Conclusion • A configurable XML retrieval system that fully utilizes XML structures to improve retrieval performance – Element-specific index configurations – Configurable XML ranking • Weighted term frequency • Inverse element frequency • Experimental results – High precision at low recall regions – Adjusting weights properly improves retrieval performance • Future works – Automate index configurations – Optimize weight configurations
Acknowledgement • The Initiative for the evaluation of XML retrieval (INEX) – http: //qmir. dcs. qmw. ac. uk/inex/ Thank You! Questions? 31
- Slides: 31