Query Models Use Types What do search engines

  • Slides: 52
Download presentation
Query Models • Use • Types • What do search engines do

Query Models • Use • Types • What do search engines do

What we have covered • • • What is IR Web crawling Evaluation Tokenization

What we have covered • • • What is IR Web crawling Evaluation Tokenization and properties of text This time – Query models

Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine

Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine

Query Engine Users Index Interface On-line Indexer Crawler Off-line Web Online vs offline processing

Query Engine Users Index Interface On-line Indexer Crawler Off-line Web Online vs offline processing

Queries Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine

Queries Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine

Why the interest in Queries? • Queries are ways we interact with IR systems

Why the interest in Queries? • Queries are ways we interact with IR systems – Expression of an information need • Nonquery methods? • Types of queries?

Issues with Query Structures Matching and ranking criteria • Given a query, what documents

Issues with Query Structures Matching and ranking criteria • Given a query, what documents are retrieved? • In what order (rank)?

Types of Query Structures Query Models (languages) – most common • Boolean Queries •

Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries – Vector space Boolean • Vector queries • Natural Language Queries • Others?

Databases vs. IR Databases IR What we’re retrieving Structured data. Clear semantics based on

Databases vs. IR Databases IR What we’re retrieving Structured data. Clear semantics based on a formal model. Mostly unstructured. Free text with some metadata. Queries we’re posing Formally Vague, imprecise (mathematically) defined information needs (often queries. Unambiguous. expressed in natural language). Results we get Exact. Always correct in Sometimes relevant, a formal sense. often not. Interaction with system One-shot queries. Interaction is important. Other issues Concurrency, recovery, atomicity are all critical. Issues downplayed.

Simple query language: Boolean – Earliest query model – Terms + Connectors (or operators)

Simple query language: Boolean – Earliest query model – Terms + Connectors (or operators) – terms • • words normalized (stemmed) words phrases thesaurus terms – connectors • AND • OR • NOT – Ex: Beethoven AND sonata

Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P,

Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

Problems with Boolean Queries • Ranking? • Incorrect interpretation of Boolean connectives AND and

Problems with Boolean Queries • Ranking? • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony

Order of precedence of operators Example of query. Is • A AND B •

Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?

Sample Boolean Queries • Cat OR Dog • Cat AND Dog • (Cat AND

Sample Boolean Queries • Cat OR Dog • Cat AND Dog • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

Satisfaction of Boolean Query • (Cat OR Dog) AND (Collar OR Leash) – Each

Satisfaction of Boolean Query • (Cat OR Dog) AND (Collar OR Leash) – Each of the following column combinations works: • • Cat Dog Collar Leash x x x x Others? x x x x x

Satisfaction of Boolean Query • (Cat OR Dog) AND (Collar OR Leash) – None

Satisfaction of Boolean Query • (Cat OR Dog) AND (Collar OR Leash) – None of the following column combinations work: • • Cat Dog Collar Leash x x x x

Boolean Logic A B

Boolean Logic A B

Order of Preference – Define order of preference • EX: a OR b AND

Order of Preference – Define order of preference • EX: a OR b AND c – Infix notation • Parenthesis evaluated 1 st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s – a OR b AND c becomes – a OR (b AND c)

Infix Notation – Usually expressed as INFIX operators in IR • ((a AND b)

Infix Notation – Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) – NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) – AND and OR can be n-ary operators • (a AND b AND c AND d) – Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(a)) = a

DNFs and CNFs All queries can be rewritten as – Disjunctive Normal Forms (DNFs)

DNFs and CNFs All queries can be rewritten as – Disjunctive Normal Forms (DNFs) – Conjunctive Normal Forms (CNFs) • DNF Constituents: – – Terms (words or phrases) Conjuncts (terms joined by ANDs) Disjuncts (conjuncts joined by ORs) Ex: (A AND B) OR (A AND NOTC) • CNF Constituents: – – Terms (words or phrases) Disjuncts (terms joined by ORs) Conjuncts (disjuncts joined by ANDs) Ex: (A OR B) AND (A OR NOTC)

Effect of CNFs • All complex Boolean queries can be simplified • Why do

Effect of CNFs • All complex Boolean queries can be simplified • Why do reference librarians like CNFs? • AND’s reduce the size of the set returned and are easily expandable – So do minus’s

Boolean Logic t 1 D 9 m 3 D 11 m 6 D 4

Boolean Logic t 1 D 9 m 3 D 11 m 6 D 4 D 5 D 3 D 10 m 8 D 2 D 1 m 5 m 2 m 1 D 6 m 4 m 7 D 8 D 7 t 3 t 2 m 1 = t 1 t 2 t 3 m 2 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Cracks Formal

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Cracks Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Width measurement Beams Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

Pseudo-Boolean Queries • A new notation, from web search – +cat dog +collar leash

Pseudo-Boolean Queries • A new notation, from web search – +cat dog +collar leash – + means this term must appear in the document • Does not mean the same thing! • Need a way to group combinations. • Phrases: – “stray cat” AND “frayed collar” – +“stray cat” + “frayed collar”

Information need Collections Pre-process text input Parse Query Index Rank

Information need Collections Pre-process text input Parse Query Index Rank

Result Sets • Run a query, get a result set • Two choices –

Result Sets • Run a query, get a result set • Two choices – Reformulate query, run on entire collection – Reformulate query, run on result set • Example: Dialog query • • (Redford AND Newman) -> S 1 1450 documents (S 1 AND Sundance) ->S 2 898 documents

Information need Collections Pre-process text input Parse Query Rank Index Reformulated Query Re-Rank

Information need Collections Pre-process text input Parse Query Rank Index Reformulated Query Re-Rank

Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is

Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: – order chronologically – order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?

Boolean Query - Summary • Advantages – simple queries are easy to understand –

Boolean Query - Summary • Advantages – simple queries are easy to understand – relatively easy to implement • Disadvantages – difficult to specify what is wanted – too much returned, or too little – ordering not well determined • Dominant language in commercial systems until the WWW

Vector Space Model • Queries treated as small documents • Documents and queries are

Vector Space Model • Queries treated as small documents • Documents and queries are represented as vectors in term space – Terms are usually stems – Documents represented by binary vectors of terms • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

Document Vectors • Documents are represented as “bags of words” – Words are terms

Document Vectors • Documents are represented as “bags of words” – Words are terms with no order • Represented as vectors when used computationally – A vector is like an array of floating point values – Has direction and magnitude – Each vector holds a place for every term in the collection – Therefore, most vectors are sparse

Queries Vocabulary (dog, house, white) Queries: • dog (1, 0, 0) • house (0,

Queries Vocabulary (dog, house, white) Queries: • dog (1, 0, 0) • house (0, 1, 0) • white (0, 0, 1) • house and dog (1, 1, 0) • dog and house (1, 1, 0) • Show 3 -D space plot

Documents (queries) in Vector Space t 3 D 1 D 9 D 11 D

Documents (queries) in Vector Space t 3 D 1 D 9 D 11 D 5 D 3 D 10 D 4 D 2 t 1 t 2 D 7 D 8 D 6

Documents in 3 D Space Assumption: Documents that are “close together” in space are

Documents in 3 D Space Assumption: Documents that are “close together” in space are similar in meaning.

Vector Query Problems • Significance of queries – Can different values be placed on

Vector Query Problems • Significance of queries – Can different values be placed on the different terms – eg. 2 dog 1 house • Scaling – size of vectors • Number of words in the dictionary? • 100, 000

Proximity Searches • Proximity: terms occur within K positions of one another – pen

Proximity Searches • Proximity: terms occur within K positions of one another – pen w/5 paper • A “Near” function can be more vague – near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations – “United Nations” “Bill Clinton” • Phrase Variants – “retrieval of information” “information retrieval” Proximity - wikipedia

Filters/field limiters • Filters: Reduce set of candidate docs • Often specified simultaneous with

Filters/field limiters • Filters: Reduce set of candidate docs • Often specified simultaneous with query • Usually restrictions on metadata – restrict by: • • • date range internet domain (. edu. com. berkeley. edu) author size limit number of documents returned

Natural Language Queries • The “Holy Grail” of information retrieval • Issues in Natural

Natural Language Queries • The “Holy Grail” of information retrieval • Issues in Natural Language Processing – – – syntax semantics pragmatics speech understanding speech generation

What do search engines do? • Tags – Title – Meta • Term frequency

What do search engines do? • Tags – Title – Meta • Term frequency and location • Popularity • Others

What do search engines do? • Collection of various methods, sometimes called pseudo-Boolean –

What do search engines do? • Collection of various methods, sometimes called pseudo-Boolean – quotes, minus, plus – pseudo AND • truth in vs in truth – stop words?

What does Google do? • Basic search • Search operators

What does Google do? • Basic search • Search operators

UC Berkeley Search Engine Guide http: //www. lib. berkeley. edu/Teaching. Lib/Guides/Internet/Search. Engines. html

UC Berkeley Search Engine Guide http: //www. lib. berkeley. edu/Teaching. Lib/Guides/Internet/Search. Engines. html

UC Berkeley Search Engine Guide http: //www. lib. berkeley. edu/Teaching. Lib/Guides/Internet/Search. Engines. html

UC Berkeley Search Engine Guide http: //www. lib. berkeley. edu/Teaching. Lib/Guides/Internet/Search. Engines. html

Old: Search Engine Query Differences

Old: Search Engine Query Differences

Older: Search engine query models

Older: Search engine query models

Search query string The portion of a dynamic URL that contains the search parameters

Search query string The portion of a dynamic URL that contains the search parameters when a dynamic Web site is searched. Query strings do not exist until a user plugs the variables into a database search, at which point the search engine will create the dynamic URL with the query string based on the results. Query strings typically contain ? and % characters.

Search query string

Search query string

Search query strings

Search query strings

Search query string http: //www. lib. berkeley. edu/level-up/resources/guide#search-engines

Search query string http: //www. lib. berkeley. edu/level-up/resources/guide#search-engines

Lucene Basics • Searches are supported through a wide range of Query options –

Lucene Basics • Searches are supported through a wide range of Query options – Keyword – Terms – Phrases – Wildcards – Many, many more

Query. Parser syntax examples Query expression Document matches if… java Contains the term java

Query. Parser syntax examples Query expression Document matches if… java Contains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit java AND junit Contains both java and junit in the default field title: ant Contains the term ant in the title field title: extreme –subject: sports Contains extreme in the title and not sports in subject (agile OR extreme) AND java Boolean expression matches title: ”junit in action” Phrase matches in title: ”junit action”~5 Proximity matches (within 5) in title java* Wildcard matches java~ Fuzzy matches lastmodified: [1/1/09 TO 12/31/09] Range matches

Types of Query Structures Query Models (languages) – most common • Boolean Queries –

Types of Query Structures Query Models (languages) – most common • Boolean Queries – Old model • Vector queries – Very common - in all search engines to some extent • Web queries – Search engines • Probabilistic models – Mostly research (Indri) • Holy grail of search – Natural Language Queries