Towards Unifying Database Systems and Information Retrieval Systems

Towards Unifying Database Systems and Information Retrieval Systems Jayavel Shanmugasundaram Cornell University

10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data

10000 foot view of Data Management Ranked Keyword Search Information Retrieval Systems Text search in databases Queries Complex and Structured Database Systems Ranking based on structured values Structured Unstructured Data

Case Study: Internet Archive

Internet Archive Database Movies Mid Name 10 Amateur Film Description … they stand on the golden gate bridge and … 20 American Thrift … golden gate bridge with statue of liberty … … SELECT * FROM Movies M ORDER BY score(M. description, “golden gate”) FETCH TOP 10 RESULTS ONLY

Main Issue • Traditional IR ranking methods would rank the two movies about the same • Example: TF-IDF – “Golden Gate” appears exactly once in both descriptions – Length of the text fields are about the same – Hence: same normalized TF-IDF score • Larger issue: Traditional IR scoring methods developed for stand-alone document collections

Internet Archive Database Movies Mid Name Description 10 Amateur Film … they stand on the golden gate bridge and … 20 American Thrift … golden gate bridge with statue of liberty … … Statistics Reviews Rid Mid 901 Name Rating 10 bleblanc 902 10 903 20 904 20 … … harry cooker alice … Sid Mid Visits Downloads 2 81 10 285 90 1 4 5 … 82 … 20 … 927 … 247 … Structured Value Ranking (SVR)

Structured Value Ranking (Guo et al. , 2005) • Use structured data values associated with text columns to score results • Main technical challenge – Structured data value (and hence scores) change frequently and possibly dramatically! • Number of visits, downloads, award announcements • “Slash. Dot effect” • Bursts and rapidly changing popularity [Kleinberg] – Users still want to see results ordered by latest score values • Current focus: design efficient inverted lists

System Architecture SQL/MM Results Create Text Index SQL Specification of SVR Scores Keyword Query Relational Query Engine Text Management Component Text Query Engine Results & scores Relational Sub-query Relational Tables and Indices Materialized Novel Indices Views for using SVR Scores B+-trees RDBMS

Index Operations • Document score updates – Handle frequent updates to scores • Top-k keyword queries – Conjunctive and disjunctive keyword queries – Include IR-style (TF-IDF) scores – Top-k query results • Content updates, insertions and deletions – Update to document content – Document insertions and deletions

Naïve Approach 1: ID Method Inverted List golden 10 12 18 21 34 … gate 11 13 18 34 39 … Score Table Id 1 2 3 … Score 70. 85 91. 86 12. 34. . . (ordered by Id) • Score updates: efficient (just update score table) • Top-k queries: inefficient (scan all of inverted list)

Naïve Approach 2: Score Method Inverted List golden 156 12 89 Score 54 98. 32 90. 19 79. 52 77. 79 gate 176 12 64 4 97. 19 90. 19 89. 55 84. 63 … … (ordered by Score) • Top-k queries: efficient (top part of inverted list) • Score updates: inefficient (reorganize many lists)

Dilemma • Want inverted lists ordered by score – For top-k query performance – Like in Score Method • But do not want to touch inverted lists for every score update – For score update performance – Like in ID Method • How can we address this apparent dilemma?

Score-Threshold Method • Extends Score Method in two key aspects 1) Allow inverted list scores to be out-of-date by up to a threshold – Avoids having to frequently update inverted list • – Better score update performance Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score • Slightly reduced query performance 2) Use “short” inverted list for scores that exceed threshold – More efficient than updating large inverted list

Score-Threshold Method golden 156 12 89 98. 32 90. 19 79. 52 … … Short list gate 176 12 64 97. 19 90. 19 89. 55 … … Score Table Id Score 1 70. 85 … … 12 90. 19 …. . . List. Score Table Id Score In. Short. List (ordered by Score) Threshold = 10 Doc 12 new score: 95

Score-Threshold Method golden 156 12 89 98. 32 90. 19 79. 52 gate 176 12 64 97. 19 90. 19 89. 55 … … Score Table Id Score 1 70. 85 … … 12 95. 00 …. . . List. Score Table Id Score In. Short. List 12 90. 19 false (ordered by Score) Threshold = 10 Doc 12 new score: 95

Score-Threshold Method golden 156 12 89 98. 32 90. 19 79. 52 gate 176 12 64 97. 19 90. 19 89. 55 … … Score Table Id Score 1 70. 85 … … 12 95. 00 …. . . List. Score Table Id Score In. Short. List 12 90. 19 false (ordered by Score) Threshold = 10 Doc 12 new score: 105

Score-Threshold Method golden 156 12 89 98. 32 90. 19 79. 52 … … 12 105. 0 gate 176 12 64 97. 19 90. 19 89. 55 … … Score Table Id Score 1 70. 85 … … 12 105. 0 …. . . List. Score Table Id Score In. Short. List 12 90. 19 false 12 105. 0 (ordered by Score) Threshold = 10 Doc 12 new score: 105

Score-Threshold Method golden 156 12 89 98. 32 90. 19 79. 52 … … 12 105. 0 gate 176 12 64 97. 19 90. 19 89. 55 … … Score Table Id Score 1 70. 85 … … 12 105. 0 …. . . List. Score Table Id Score In. Short. List 12 105. 0 true 12 105. 0 (ordered by Score) Threshold = 10 Doc 12 new score: 105

Query-Update Tradeoff • Choice of threshold function • If threshold(score) = 0 – Every update results in update to inverted list – Similar to Score Method • If threshold(score) = infinity – No inverted list update, but scan all of list – Similar to ID Method • Can control query-update tradeoff using threshold function – threshold(score) = r * score, r >= 0 – r: threshold ratio

Experimental Setup • Two primary performance metrics – Time for a score update – Time for a top-k query • Data sets – Real (Internet Archive): 60 MB • Thanks to Brewster Kahle and Jon Aizen – Synthetic: 805 MB • Compared alternatives – Implemented in C++ on top of Berkeley. DB – 2. 7 GHz 1 GB processor

Varying # Updates Times in Milliseconds

10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data

XML Keyword Search • Example applications – Accident reports, Shakespeare’s plays • XRank: Keyword search over semi-structured XML documents – Extends keyword search to work over both structured and unstructured data – SIGMOD 2003 [Guo, Shao, Botev, Shanmugasundaram]

10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data

Towards Unifying DB and IR • Example applications – Content management, web querying • Te. XQuery: Query language for structured and unstructured data, structured and keyword queries – Precursor to W 3 C XQuery Full-Text – WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram]

Related Work • Integrating DB and IR systems – For the most part, treat individual systems as “black boxes” – Our goal is to unify DB and IR systems • Search over Semi-Structured Data – Specialized techniques for search semi-structured data – Our goal is to generalize DB and IR techniques • Keyword search and ranking in databases – BANKS, DBXplorer, DISCOVER

Summary • Many emerging applications require a unification of DB and IR techniques – E-commerce, content management, … • Argues for a new generation of systems and techniques that seamlessly provide this capability – SVR, XRank, Te. XQuery, … • Educational benefit: present unified view of data management – Currently at graduate level – Eventually introduce concepts at undergraduate level