Basic IR: Queries n n n Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content, representation dictates what the index must do. Varies from single keywords through specialized query languages to exemplar documents.
Documents as Queries n n n “Find other documents like this one. ” Query is itself a document; it can go through same sort of pre-processing (e. g. , stop word removal, stemming). Characteristics of queries mimic those of documents.
Query Term Distribution in Savvy. Search
Keyword Queries Query is composed of a set of keywords. n Retrieve document that best matches keywords. Advantage: easy to use, supports fast indexing Disadvantage: coarse, easy to lead astray (e. g. , words with multiple meanings), difficult to express complex information need n
Boolean Queries n combine queries (keywords) with Boolean operators: n n OR: “children OR kids” AND: “windows AND software” BUT: “unix BUT solaris” no NOT! Advantage: more precise queries Disadvantage: does not support ranking, less intuitive
Phrase Search n n n Supplement single terms with phrases: exact sequence of terms Requires index that tracks proximity of terms or stores both singletons and phrases Extension is context: where proximity between terms is stated (e. g. , “adele w/2 howe” to Cite. Seer)
Query Language: Web Search Engines I n Altavista Advanced Search: n Form: n n n Boolean Expression: n n n all of these words this exact phrase any of these words none of these words AND OR AND NOT NEAR Date, File Type, Location
Query Language: Web Search Engines II n Google Advanced Search: n Find Results: n n n Page-Specific Search: n n n with all of the words with the exact phrase with at least one of the words without the words Language, File Format, Date, Occurrences, Domain, Safe. Search Find pages similar to the page Find pages that link to the page Topic Specific Searches
Typical Query Behavior on WWW n n Query term distribution obeys Zipf’s Law (quite skewed, although skew does drift). Length is ? terms. Few users exploit full power of query languages; most enter terms without operations and do not use advanced search interfaces. Change in behavior?
Natural Language n Augmented Boolean approach: n n n Treat query as document. Rank documents by how well they match the constraints of the query and return those above a certain threshold. NLP approach: n Interpret semantics in a limited way to constrain query (e. g. , “who” indicates a person)
NL Example: Ask. Jeeves
Advanced Querying n Pattern Matching: n n combinations of syntactic features, e. g. , regular expressions, wild-card queries Structural Queries: n n forms, hypertext and hierarchies typically supports iterative querying as in guided browsing (e. g. , Web. Glimpse or Letitzia)
Advanced Querying: Letizia n n n Recommends new pages based on user’s browsing preferences Infers interests by observing user behavior: save bookmark, follow link, spend time on page… Models documents as list of keywords Figure from http: //lieber. www. media. mit. edu/people/lieber/Liebrary/Letizia. html
Caching n n n An astonishing number of people submit the same queries (e. g. , “Harry Potter”). Just as single word usage is skewed (Zipf’s Law) so is query submission on WWW. Can exploit this by caching results for oft repeated queries.
Single vs. On-Going Queries: Filtering n n n Find new documents from information stream that satisfy a static information need User profile represents interests; threshold represents how closely documents must match. User may provide the query or it may be learned through relevance feedback.
Filtering Process Docs Filtered for User 2 Profile User 1 Profile Docs for User 1 Documents Stream from MIR text