Knowledge Retrieval Dr Franz J Kurfess Computer Science

  • Slides: 71
Download presentation
Knowledge Retrieval Dr. Franz J. Kurfess Computer Science Department Cal Poly 2

Knowledge Retrieval Dr. Franz J. Kurfess Computer Science Department Cal Poly 2

Acknowledgements 4

Acknowledgements 4

Usage Franz Kurfess: Knowledge Retrieval 6

Usage Franz Kurfess: Knowledge Retrieval 6

Use and Distribution of these Slides ❖ These slides are primarily intended for the

Use and Distribution of these Slides ❖ These slides are primarily intended for the students in classes I teach. In some cases, I only make PDF versions publicly available. If you would like to get a copy of the originals (Apple Key. Note or Microsoft Power. Point), please contact me via email at fkurfess@calpoly. edu. I hereby grant permission to use them in educational settings. If you do so, it would be nice to send me an email about it. If you’re considering using them in a commercial environment, please contact me first. © Franz J. Kurfess 2003 -2011 7

Usage of the Slides ❖ these slides are intended for the students of my

Usage of the Slides ❖ these slides are intended for the students of my CPE/CSC 481 “Knowledge-Based Systems” class at Cal Poly SLO v if you want to use them outside of my class, please let me know (fkurfess@calpoly. edu) ❖I usually put together a subset for each quarter as a “Custom Show” v to view these, go to “Slide Show => Custom Shows”, select the respective quarter, and click on “Show” v in Apple Keynote, I use the “Hide” feature to achieve similar results ❖ To print them, I suggest to use the “Handout” option v 4, 6, or 9 per page works fine v Black & White should be fine; there are few diagrams where color is important © Franz J. Kurfess 2003 -2011 8

Overview Knowledge Retrieval ❖ Finding Out About v Keywords ❖ Data and Queries; Documents;

Overview Knowledge Retrieval ❖ Finding Out About v Keywords ❖ Data and Queries; Documents; Indexing Retrieval v Access via Address, Field, Name ❖ Information Retrieval v Access via Content (Values); Parsing; Matching Against Indices; Retrieval Assessment ❖ Knowledge v Access via Structure; Meaning; Context; Usage ❖ Knowledge v Data Retrieval Discovery Mining; Rule Extraction © Franz J. Kurfess 2003 -2011 9

Finding Out About [Belew 2000] © Franz J. Kurfess 2003 -2011 13

Finding Out About [Belew 2000] © Franz J. Kurfess 2003 -2011 13

Finding Out About ❖ Keywords ❖ Queries ❖ Documents ❖ Indexing © Franz J.

Finding Out About ❖ Keywords ❖ Queries ❖ Documents ❖ Indexing © Franz J. Kurfess 2003 -2011 [Belew 2000] 19

Keywords ❖ linguistic atoms used to characterize the subject or content of a document

Keywords ❖ linguistic atoms used to characterize the subject or content of a document v words v pieces of words (stems) v phrases ❖ provide the basis for a match between v the user’s characterization of information need v the contents of the document ❖ problems v ambiguity v choice of keywords © Franz J. Kurfess 2003 -2011 [Belew 2000] 20

Queries ❖ formulated v natural language v v interaction with human information providers artificial

Queries ❖ formulated v natural language v v interaction with human information providers artificial language v interaction with computers v v especially search engines vocabulary v controlled v v limited set of keywords may be used uncontrolled v v in a query language any keywords may be used syntax v often Boolean operators (AND, OR) v sometimes regular expressions © Franz J. Kurfess 2003 -2011 [Belew 2000] 21

Documents ❖ general v any interpretation document that can be represented digitally v text,

Documents ❖ general v any interpretation document that can be represented digitally v text, image, music, video, program, etc. ❖ practical interpretation v passage of text v strings of characters in an alphabet v written natural language v length may vary v longer documents may be composed of shorter ones © Franz J. Kurfess 2003 -2011 22

Aboutness of Documents ❖ describes the suitability of a document as answer to a

Aboutness of Documents ❖ describes the suitability of a document as answer to a query ❖ assumptions v all documents have equal aboutness v the probability of any document in a corpus to be considered relevant is equal for all documents v simplistic; not valid in reality va paragraph is the smallest unit of text with appreciable aboutness © Franz J. Kurfess 2003 -2011 [Belew 2000] 23

Structural Aspects of Documents ❖ documents may be composed of other smaller pieces, or

Structural Aspects of Documents ❖ documents may be composed of other smaller pieces, or other documents v paragraphs, v footnotes, subsections, chapters, parts references ❖ documents may contain meta-data v information about the document v not part of the content of the document itself v may be used for organization and retrieval purposes v can be abused by creators v usually to increase the perceived relevance © Franz J. Kurfess 2003 -2011 24

Document Proxies ❖ surrogates v abridged v catalog, for the real document representations abstract

Document Proxies ❖ surrogates v abridged v catalog, for the real document representations abstract v pointers v bibliographical v different citation, URL media v microfiches v digital representations © Franz J. Kurfess 2003 -2011 25

Indexing ❖a vocabulary of keywords is assigned to all documents of a corpus ❖

Indexing ❖a vocabulary of keywords is assigned to all documents of a corpus ❖ an index maps each document doci to the set of keywords {kwj} it is about Index: doci →about {kwj} Index-1: {kwj} →describes doci ❖ indexing of a document / corpus v manual: humans select appropriate keywords v automatic: a computer program selects the keywords ❖ building the index relation between documents and sets of keywords is critical for information retrieval [Belew © Franz J. 2000] Kurfess 2003 -2011 26

FOA Conversation Loop © Franz J. Kurfess 2003 -2011 [Belew 2000] 27

FOA Conversation Loop © Franz J. Kurfess 2003 -2011 [Belew 2000] 27

Data Retrieval ❖ access to specific data items ❖ access via address, field, name

Data Retrieval ❖ access to specific data items ❖ access via address, field, name ❖ typically ❖ user used in data bases asks for items with specific features v absence or presence of features v values ❖ system v no returns data items irrelevant items ❖ deterministic retrieval method © Franz J. Kurfess 2003 -2011 28

Information Retrieval (IR) ❖ access v also referred to as document retrieval ❖ access

Information Retrieval (IR) ❖ access v also referred to as document retrieval ❖ access ❖ IR to documents via keywords aspects v parsing v matching against indices v retrieval assessment © Franz J. Kurfess 2003 -2011 29

Diagram Search Engine [Belew 2000] © Franz J. Kurfess 2003 -2011 30

Diagram Search Engine [Belew 2000] © Franz J. Kurfess 2003 -2011 30

Parsing ❖ extraction v mostly ❖ may of lexical features from documents words require

Parsing ❖ extraction v mostly ❖ may of lexical features from documents words require some manipulation of the extracted features v e. g. ❖ used stemming of words as the basis for automatic compilation of indices [Belew 2000] © Franz J. Kurfess 2003 -2011 31

Matching Against Indices ❖ identification of documents that are relevant for a particular query

Matching Against Indices ❖ identification of documents that are relevant for a particular query ❖ keywords of the query are compared against the keywords that appear in the document v either in the data or meta-data of the document ❖ in addition to queries, other features of documents may be used v descriptive v usually v derived features provided by the author or cataloger meta-data features computed from the contents of the document [Belew 2000] © Franz J. Kurfess 2003 -2011 33

Vector Space ❖ interpretation v of the index matrix relates documents and keywords ❖

Vector Space ❖ interpretation v of the index matrix relates documents and keywords ❖ can grow extremely large v binary matrix of 100, 000 words * 1, 000 documents v sparsely populated: most entries will be 0 ❖ can be used to determine similarity of documents v overlap in keywords v proximity in the (virtual) vector space ❖ associative memories can be used as hardware implementation v extremely fast, but expensive to build [Belew 2000] © Franz J. Kurfess 2003 -2011 34

Vector Space Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 35

Vector Space Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 35

Measuring Retrieval ❖ ideally, all relevant documents should be retrieved v relative to the

Measuring Retrieval ❖ ideally, all relevant documents should be retrieved v relative to the query posed by the user v relative to the set of documents available (corpus) v relevance can be subjective ❖ precision v relevant and recall documents vs. retrieved documents © Franz J. Kurfess 2003 -2011 36

Document Retrieval [Belew 2000] © Franz J. Kurfess 2003 -2011 37

Document Retrieval [Belew 2000] © Franz J. Kurfess 2003 -2011 37

Precision and Recall recall ≡ precision ≡ |retrieved ∩ relevant| / |relevant| |retrieved ∩

Precision and Recall recall ≡ precision ≡ |retrieved ∩ relevant| / |relevant| |retrieved ∩ relevant| / |retrieved| [Belew 2000] © Franz J. Kurfess 2003 -2011 38

Specificity vs. Exhaustivity [Belew 2000] © Franz J. Kurfess 2003 -2011 39

Specificity vs. Exhaustivity [Belew 2000] © Franz J. Kurfess 2003 -2011 39

Retrieval Assessment ❖ subjective v how assessment well do the retrieved documents satisfy the

Retrieval Assessment ❖ subjective v how assessment well do the retrieved documents satisfy the request of the user ❖ objective assessment v idealized omniscient expert determines the quality of the response [Belew 2000] © Franz J. Kurfess 2003 -2011 40

Retrieval Assessment Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 41

Retrieval Assessment Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 41

Relevance Feedback ❖ subjective ❖ often assessment of retrieval results used to iteratively improve

Relevance Feedback ❖ subjective ❖ often assessment of retrieval results used to iteratively improve retrieval results ❖ may be collected by the retrieval system for statistical evaluation ❖ can be viewed as a variant of object recognition v the object to be recognized is the prototypical document the user is looking for v this document may or may not exist v the difference between the retrieved document(s) and the idealized prototype indicates the quality of the retrieval results [Belew 2000] © Franz J. Kurfess 2003 -2011 42

Relevance Feedback in Vector Space ❖ relevance feedback is used to move the query

Relevance Feedback in Vector Space ❖ relevance feedback is used to move the query towards the cluster of positive documents v moving away from bad documents does not necessarily improve the results ❖ it can also be used as a filter for a constant stream of documents v as in news channels or similar situations [Belew 2000] © Franz J. Kurfess 2003 -2011 43

Query Session Example [Belew 2000] © Franz J. Kurfess 2003 -2011 44

Query Session Example [Belew 2000] © Franz J. Kurfess 2003 -2011 44

Consensual Relevance ❖ relevance feedback from multiple users v identifies documents that many users

Consensual Relevance ❖ relevance feedback from multiple users v identifies documents that many users found useful or interesting v used by some Web sites v related to collaborative filtering v can also be used as an evaluation method for search engines v performance v criteria must be carefully considered precision and recall, plus many others [Belew 2000] © Franz J. Kurfess 2003 -2011 45

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 46

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 47

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 48

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 49

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2

IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 50

Knowledge Retrieval ❖ Context ❖ Usage v exploratory search v faceted search © Franz

Knowledge Retrieval ❖ Context ❖ Usage v exploratory search v faceted search © Franz J. Kurfess 2003 -2011 51

Context in Knowledge Retrieval ❖ in addition to keywords, relationships between keywords and documents

Context in Knowledge Retrieval ❖ in addition to keywords, relationships between keywords and documents are exploited v explicit links v hypertext v related concepts v thesaurus, v proximity v spatial: place, directory v temporal: v ontology creation date/time intermediate relations v author/creator v organization v project © Franz J. Kurfess 2003 -2011 52

Inference beyond the Index ❖ determines ❖ citations relationships between documents are explicit references

Inference beyond the Index ❖ determines ❖ citations relationships between documents are explicit references to relevant documents v bibliographic references v legal citations v hypertext ❖ examples v NEC Cite. Seer <http: //citeseer. nj. nec. com> v Google Scholar http: //scholar. google. com © Franz J. Kurfess 2003 -2011 53

Additional Information Sources [Belew 2000, after Kochen 1975] © Franz J. Kurfess 2003 -2011

Additional Information Sources [Belew 2000, after Kochen 1975] © Franz J. Kurfess 2003 -2011 54

Hypertext ❖ inter-document links provide explicit relationships between documents v can be used to

Hypertext ❖ inter-document links provide explicit relationships between documents v can be used to determine the relevance of a document for a query v example: Google <http: //www. google. com> ❖ intra-document links may offer additional context information for some terms v footnotes, glossaries, related terms © Franz J. Kurfess 2003 -2011 55

Adaptive Retrieval Techniques ❖ fine-tuning the matching between queries and retrieved documents v learning

Adaptive Retrieval Techniques ❖ fine-tuning the matching between queries and retrieved documents v learning of relationships between terms v training with term pairs (thesaurus) v pattern detection in past queries v automatic v grouping of documents according to common features clustering of similar documents v pre-defined categories v metadata v overlap in keywords v consensual relevance v source © Franz J. Kurfess 2003 -2011 56

Document Classification © Franz J. Kurfess 2003 -2011 57

Document Classification © Franz J. Kurfess 2003 -2011 57

Query Model ❖ query types (templates) v frequently used types of queries v e.

Query Model ❖ query types (templates) v frequently used types of queries v e. g. problem/solution, symptoms/diagnosis, problem/further checks, . . . ❖ category types v abstractions of query types v used to determine categories or topics for the grouping of search results ❖ context information v current working document/directory v previous queries [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 58

Terminology Model ❖ individual terms are connected to related terms v thesaurus/ontology v synonyms,

Terminology Model ❖ individual terms are connected to related terms v thesaurus/ontology v synonyms, ❖ identifies super-/sub-classes, related terms labels for the category types [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 59

Matching ❖ categorizer v determines the categories to be selected for the grouping of

Matching ❖ categorizer v determines the categories to be selected for the grouping of results v assigns retrieved documents to the categories ❖ organizer v arranges v should categories into a hierarchy be balanced and easy to browse by the user v depends on the distribution of the search results [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 60

Results ❖ retrieved documents are grouped into hierarchically arranged categories meaningful for the user

Results ❖ retrieved documents are grouped into hierarchically arranged categories meaningful for the user v the categories are related to the query v the categories are related to each other v all categories have similar size v not always achievable due to the distribution of documents ❖ reduced ❖ higher search times user satisfaction [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 61

Dyan. Cat Results [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 64

Dyan. Cat Results [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 64

Dyna. Cat Query Types [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 65

Dyna. Cat Query Types [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 65

Dyna. Cat Search [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 66

Dyna. Cat Search [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 66

Information vs. Knowledge Retrieval IR KR keywords as main components keywords plus context of

Information vs. Knowledge Retrieval IR KR keywords as main components keywords plus context of the query information for the query index plus ontology for index as match-making facility matching query and documents relationships between statistical basis for selection of keywords and documents relevant documents influence the selection of relevant documents (ordered) list of results are grouped into meaningful categories © Franz J. Kurfess 2003 -2011 67

KR Diagram Query Term 3 Term 2 Term 4 Term 1 Documents Doc. 5

KR Diagram Query Term 3 Term 2 Term 4 Term 1 Documents Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 Keywords keyword input synonym expansion relation expansion Corpus Index Term A Term B Term C Term E Term D Ontology Term F Term H Term I Term K Term L Term J 2003 -2011 © Franz J. Kurfess Term G Term M 68

Exploratory Search ❖ finding knowledge through association ❖ hypothesis: Human-made associations between knowledge items

Exploratory Search ❖ finding knowledge through association ❖ hypothesis: Human-made associations between knowledge items are valuable for others v especially if the associations are made by experts or experienced users © Franz J. Kurfess 2003 -2011 69

Vannevar Bush: Memex ❖ better knowledge management for scientific document collections v build, maintain,

Vannevar Bush: Memex ❖ better knowledge management for scientific document collections v build, maintain, and share paths through the document space containing knowledge (“knowledge trails”) v see Vannevar Bush, “As We May Think”, Atlantic Monthly, July 1945; www. theatlantic. com/194507/bush © Franz J. Kurfess 2003 -2011 71

Faceted Search ❖ exploration of a domain via attributes v select a relevant attribute,

Faceted Search ❖ exploration of a domain via attributes v select a relevant attribute, and display the elements of the domain ordered according to the attribute © Franz J. Kurfess 2003 -2011 72

Faceted Search in i. Tunes © Franz J. Kurfess 2003 -2011 74

Faceted Search in i. Tunes © Franz J. Kurfess 2003 -2011 74

Variations on Faceted Search ❖ displaying lists of items ordered according to an attribute

Variations on Faceted Search ❖ displaying lists of items ordered according to an attribute can get quite boring ❖ attributes often lend themselves to alternative presentation methods v visual v static v color, size, shape v dynamic v movement, changes over time v auditory v often for supplementary information © Franz J. Kurfess 2003 -2011 75

Knowledge Discovery ❖ combination of v Data Mining v Knowledge Extraction v Knowledge Fusion

Knowledge Discovery ❖ combination of v Data Mining v Knowledge Extraction v Knowledge Fusion © Franz J. Kurfess 2003 -2011 76

Data Mining ❖ identification of interesting “nuggets” in huge quantities of data v often

Data Mining ❖ identification of interesting “nuggets” in huge quantities of data v often relations between subsets v automatic or semi-automatic ❖ techniques v classification, correlation (e. g. temporal, spatial) © Franz J. Kurfess 2003 -2011 77

Knowledge Extraction ❖ conversion of internal representations of knowledge into human-understandable format v extraction

Knowledge Extraction ❖ conversion of internal representations of knowledge into human-understandable format v extraction of rules from neural networks is one example © Franz J. Kurfess 2003 -2011 78

Knowledge Fusion ❖ multiple pieces of information are combined into one v redundancy v

Knowledge Fusion ❖ multiple pieces of information are combined into one v redundancy v do several pieces contain the same type of information v compatibility v do the individual pieces have similar formats and interpretations v are there mappings to convert values into the same format v consistency v are the values of the individual pieces close © Franz J. Kurfess 2003 -2011 79

Franz Kurfess: Knowledge Retrieval 82

Franz Kurfess: Knowledge Retrieval 82

Image Search ❖ contextual v meta-data v text in the same or close-by documents

Image Search ❖ contextual v meta-data v text in the same or close-by documents v e. g. on the same Web page, or in the same directory ❖ content-based v analysis and comparison of images v feature extraction © Franz J. Kurfess 2003 -2011 83

Contextual Image Search ❖ images are indexed via keywords v captions of images v

Contextual Image Search ❖ images are indexed via keywords v captions of images v metadata (tags) v surrounding text ❖ relies on the assumption that the indexed text is correlated to the image and a good description of its content ❖ basis for most current image search engines © Franz J. Kurfess 2003 -2011 84

Content-Based Image Search ❖ images v no are compared against query images text elements

Content-Based Image Search ❖ images v no are compared against query images text elements as proxies v or at least not in a “pure” content-based image search v relies on feature extraction or object recognition v direct comparison of pictures on a pixel-by-pixel basis is impractical v only yields identical pictures, not similar ones v computationally very challenging v especially if the query image is a subset of a target image v allows the use of a picture as a “template” to find related pictures © Franz J. Kurfess 2003 -2011 85

Example: UC Santa Cruz Image Search ❖ feature extraction and object recognition from images

Example: UC Santa Cruz Image Search ❖ feature extraction and object recognition from images and video Using a single image as a template, computer software can find similar images in a large database of photos, as shown in these examples. Images courtesy of P. Milanfar. http: //www. physorg. com/newman/gfx/news/hires/newsearchtec. jpg http: //www. physorg. com/news 177095786. html © Franz J. Kurfess 2003 -2011 86

Music Search: Shazam ❖ identifies musical pieces through “finger prints” v emphasis on popular

Music Search: Shazam ❖ identifies musical pieces through “finger prints” v emphasis on popular music © Franz J. Kurfess 2003 -2011 87

Summary Knowledge Retrieval ❖ identification, selection, and presentation of documents relevant to a user

Summary Knowledge Retrieval ❖ identification, selection, and presentation of documents relevant to a user query ❖ utilization of structural information, context, meta-data in addition to keyword search ❖ organized presentation of results v categories, visual arrangement ❖ internal representations may be converted to humanunderstandable ones © Franz J. Kurfess 2003 -2011 94

© Franz J. Kurfess 2003 -2011 95

© Franz J. Kurfess 2003 -2011 95