Knowledge Retrieval Dr Franz J Kurfess Computer Science
- Slides: 71
Knowledge Retrieval Dr. Franz J. Kurfess Computer Science Department Cal Poly 2
Acknowledgements 4
Usage Franz Kurfess: Knowledge Retrieval 6
Use and Distribution of these Slides ❖ These slides are primarily intended for the students in classes I teach. In some cases, I only make PDF versions publicly available. If you would like to get a copy of the originals (Apple Key. Note or Microsoft Power. Point), please contact me via email at fkurfess@calpoly. edu. I hereby grant permission to use them in educational settings. If you do so, it would be nice to send me an email about it. If you’re considering using them in a commercial environment, please contact me first. © Franz J. Kurfess 2003 -2011 7
Usage of the Slides ❖ these slides are intended for the students of my CPE/CSC 481 “Knowledge-Based Systems” class at Cal Poly SLO v if you want to use them outside of my class, please let me know (fkurfess@calpoly. edu) ❖I usually put together a subset for each quarter as a “Custom Show” v to view these, go to “Slide Show => Custom Shows”, select the respective quarter, and click on “Show” v in Apple Keynote, I use the “Hide” feature to achieve similar results ❖ To print them, I suggest to use the “Handout” option v 4, 6, or 9 per page works fine v Black & White should be fine; there are few diagrams where color is important © Franz J. Kurfess 2003 -2011 8
Overview Knowledge Retrieval ❖ Finding Out About v Keywords ❖ Data and Queries; Documents; Indexing Retrieval v Access via Address, Field, Name ❖ Information Retrieval v Access via Content (Values); Parsing; Matching Against Indices; Retrieval Assessment ❖ Knowledge v Access via Structure; Meaning; Context; Usage ❖ Knowledge v Data Retrieval Discovery Mining; Rule Extraction © Franz J. Kurfess 2003 -2011 9
Finding Out About [Belew 2000] © Franz J. Kurfess 2003 -2011 13
Finding Out About ❖ Keywords ❖ Queries ❖ Documents ❖ Indexing © Franz J. Kurfess 2003 -2011 [Belew 2000] 19
Keywords ❖ linguistic atoms used to characterize the subject or content of a document v words v pieces of words (stems) v phrases ❖ provide the basis for a match between v the user’s characterization of information need v the contents of the document ❖ problems v ambiguity v choice of keywords © Franz J. Kurfess 2003 -2011 [Belew 2000] 20
Queries ❖ formulated v natural language v v interaction with human information providers artificial language v interaction with computers v v especially search engines vocabulary v controlled v v limited set of keywords may be used uncontrolled v v in a query language any keywords may be used syntax v often Boolean operators (AND, OR) v sometimes regular expressions © Franz J. Kurfess 2003 -2011 [Belew 2000] 21
Documents ❖ general v any interpretation document that can be represented digitally v text, image, music, video, program, etc. ❖ practical interpretation v passage of text v strings of characters in an alphabet v written natural language v length may vary v longer documents may be composed of shorter ones © Franz J. Kurfess 2003 -2011 22
Aboutness of Documents ❖ describes the suitability of a document as answer to a query ❖ assumptions v all documents have equal aboutness v the probability of any document in a corpus to be considered relevant is equal for all documents v simplistic; not valid in reality va paragraph is the smallest unit of text with appreciable aboutness © Franz J. Kurfess 2003 -2011 [Belew 2000] 23
Structural Aspects of Documents ❖ documents may be composed of other smaller pieces, or other documents v paragraphs, v footnotes, subsections, chapters, parts references ❖ documents may contain meta-data v information about the document v not part of the content of the document itself v may be used for organization and retrieval purposes v can be abused by creators v usually to increase the perceived relevance © Franz J. Kurfess 2003 -2011 24
Document Proxies ❖ surrogates v abridged v catalog, for the real document representations abstract v pointers v bibliographical v different citation, URL media v microfiches v digital representations © Franz J. Kurfess 2003 -2011 25
Indexing ❖a vocabulary of keywords is assigned to all documents of a corpus ❖ an index maps each document doci to the set of keywords {kwj} it is about Index: doci →about {kwj} Index-1: {kwj} →describes doci ❖ indexing of a document / corpus v manual: humans select appropriate keywords v automatic: a computer program selects the keywords ❖ building the index relation between documents and sets of keywords is critical for information retrieval [Belew © Franz J. 2000] Kurfess 2003 -2011 26
FOA Conversation Loop © Franz J. Kurfess 2003 -2011 [Belew 2000] 27
Data Retrieval ❖ access to specific data items ❖ access via address, field, name ❖ typically ❖ user used in data bases asks for items with specific features v absence or presence of features v values ❖ system v no returns data items irrelevant items ❖ deterministic retrieval method © Franz J. Kurfess 2003 -2011 28
Information Retrieval (IR) ❖ access v also referred to as document retrieval ❖ access ❖ IR to documents via keywords aspects v parsing v matching against indices v retrieval assessment © Franz J. Kurfess 2003 -2011 29
Diagram Search Engine [Belew 2000] © Franz J. Kurfess 2003 -2011 30
Parsing ❖ extraction v mostly ❖ may of lexical features from documents words require some manipulation of the extracted features v e. g. ❖ used stemming of words as the basis for automatic compilation of indices [Belew 2000] © Franz J. Kurfess 2003 -2011 31
Matching Against Indices ❖ identification of documents that are relevant for a particular query ❖ keywords of the query are compared against the keywords that appear in the document v either in the data or meta-data of the document ❖ in addition to queries, other features of documents may be used v descriptive v usually v derived features provided by the author or cataloger meta-data features computed from the contents of the document [Belew 2000] © Franz J. Kurfess 2003 -2011 33
Vector Space ❖ interpretation v of the index matrix relates documents and keywords ❖ can grow extremely large v binary matrix of 100, 000 words * 1, 000 documents v sparsely populated: most entries will be 0 ❖ can be used to determine similarity of documents v overlap in keywords v proximity in the (virtual) vector space ❖ associative memories can be used as hardware implementation v extremely fast, but expensive to build [Belew 2000] © Franz J. Kurfess 2003 -2011 34
Vector Space Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 35
Measuring Retrieval ❖ ideally, all relevant documents should be retrieved v relative to the query posed by the user v relative to the set of documents available (corpus) v relevance can be subjective ❖ precision v relevant and recall documents vs. retrieved documents © Franz J. Kurfess 2003 -2011 36
Document Retrieval [Belew 2000] © Franz J. Kurfess 2003 -2011 37
Precision and Recall recall ≡ precision ≡ |retrieved ∩ relevant| / |relevant| |retrieved ∩ relevant| / |retrieved| [Belew 2000] © Franz J. Kurfess 2003 -2011 38
Specificity vs. Exhaustivity [Belew 2000] © Franz J. Kurfess 2003 -2011 39
Retrieval Assessment ❖ subjective v how assessment well do the retrieved documents satisfy the request of the user ❖ objective assessment v idealized omniscient expert determines the quality of the response [Belew 2000] © Franz J. Kurfess 2003 -2011 40
Retrieval Assessment Diagram [Belew 2000] © Franz J. Kurfess 2003 -2011 41
Relevance Feedback ❖ subjective ❖ often assessment of retrieval results used to iteratively improve retrieval results ❖ may be collected by the retrieval system for statistical evaluation ❖ can be viewed as a variant of object recognition v the object to be recognized is the prototypical document the user is looking for v this document may or may not exist v the difference between the retrieved document(s) and the idealized prototype indicates the quality of the retrieval results [Belew 2000] © Franz J. Kurfess 2003 -2011 42
Relevance Feedback in Vector Space ❖ relevance feedback is used to move the query towards the cluster of positive documents v moving away from bad documents does not necessarily improve the results ❖ it can also be used as a filter for a constant stream of documents v as in news channels or similar situations [Belew 2000] © Franz J. Kurfess 2003 -2011 43
Query Session Example [Belew 2000] © Franz J. Kurfess 2003 -2011 44
Consensual Relevance ❖ relevance feedback from multiple users v identifies documents that many users found useful or interesting v used by some Web sites v related to collaborative filtering v can also be used as an evaluation method for search engines v performance v criteria must be carefully considered precision and recall, plus many others [Belew 2000] © Franz J. Kurfess 2003 -2011 45
IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 46
IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 47
IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 48
IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 49
IR Diagram Index Query Documents Term 1 Term 3 Term 4 Keywords Term 2 © Franz J. Kurfess 2003 -2011 Corpus Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 50
Knowledge Retrieval ❖ Context ❖ Usage v exploratory search v faceted search © Franz J. Kurfess 2003 -2011 51
Context in Knowledge Retrieval ❖ in addition to keywords, relationships between keywords and documents are exploited v explicit links v hypertext v related concepts v thesaurus, v proximity v spatial: place, directory v temporal: v ontology creation date/time intermediate relations v author/creator v organization v project © Franz J. Kurfess 2003 -2011 52
Inference beyond the Index ❖ determines ❖ citations relationships between documents are explicit references to relevant documents v bibliographic references v legal citations v hypertext ❖ examples v NEC Cite. Seer <http: //citeseer. nj. nec. com> v Google Scholar http: //scholar. google. com © Franz J. Kurfess 2003 -2011 53
Additional Information Sources [Belew 2000, after Kochen 1975] © Franz J. Kurfess 2003 -2011 54
Hypertext ❖ inter-document links provide explicit relationships between documents v can be used to determine the relevance of a document for a query v example: Google <http: //www. google. com> ❖ intra-document links may offer additional context information for some terms v footnotes, glossaries, related terms © Franz J. Kurfess 2003 -2011 55
Adaptive Retrieval Techniques ❖ fine-tuning the matching between queries and retrieved documents v learning of relationships between terms v training with term pairs (thesaurus) v pattern detection in past queries v automatic v grouping of documents according to common features clustering of similar documents v pre-defined categories v metadata v overlap in keywords v consensual relevance v source © Franz J. Kurfess 2003 -2011 56
Document Classification © Franz J. Kurfess 2003 -2011 57
Query Model ❖ query types (templates) v frequently used types of queries v e. g. problem/solution, symptoms/diagnosis, problem/further checks, . . . ❖ category types v abstractions of query types v used to determine categories or topics for the grouping of search results ❖ context information v current working document/directory v previous queries [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 58
Terminology Model ❖ individual terms are connected to related terms v thesaurus/ontology v synonyms, ❖ identifies super-/sub-classes, related terms labels for the category types [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 59
Matching ❖ categorizer v determines the categories to be selected for the grouping of results v assigns retrieved documents to the categories ❖ organizer v arranges v should categories into a hierarchy be balanced and easy to browse by the user v depends on the distribution of the search results [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 60
Results ❖ retrieved documents are grouped into hierarchically arranged categories meaningful for the user v the categories are related to the query v the categories are related to each other v all categories have similar size v not always achievable due to the distribution of documents ❖ reduced ❖ higher search times user satisfaction [Pratt, Hearst, Fagan 2000] © Franz J. Kurfess 2003 -2011 61
Dyan. Cat Results [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 64
Dyna. Cat Query Types [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 65
Dyna. Cat Search [Dyna. Cat, 2000] © Franz J. Kurfess 2003 -2011 66
Information vs. Knowledge Retrieval IR KR keywords as main components keywords plus context of the query information for the query index plus ontology for index as match-making facility matching query and documents relationships between statistical basis for selection of keywords and documents relevant documents influence the selection of relevant documents (ordered) list of results are grouped into meaningful categories © Franz J. Kurfess 2003 -2011 67
KR Diagram Query Term 3 Term 2 Term 4 Term 1 Documents Doc. 5 Doc. 4 Doc. 3 Doc. 2 Doc. 1 Keywords keyword input synonym expansion relation expansion Corpus Index Term A Term B Term C Term E Term D Ontology Term F Term H Term I Term K Term L Term J 2003 -2011 © Franz J. Kurfess Term G Term M 68
Exploratory Search ❖ finding knowledge through association ❖ hypothesis: Human-made associations between knowledge items are valuable for others v especially if the associations are made by experts or experienced users © Franz J. Kurfess 2003 -2011 69
Vannevar Bush: Memex ❖ better knowledge management for scientific document collections v build, maintain, and share paths through the document space containing knowledge (“knowledge trails”) v see Vannevar Bush, “As We May Think”, Atlantic Monthly, July 1945; www. theatlantic. com/194507/bush © Franz J. Kurfess 2003 -2011 71
Faceted Search ❖ exploration of a domain via attributes v select a relevant attribute, and display the elements of the domain ordered according to the attribute © Franz J. Kurfess 2003 -2011 72
Faceted Search in i. Tunes © Franz J. Kurfess 2003 -2011 74
Variations on Faceted Search ❖ displaying lists of items ordered according to an attribute can get quite boring ❖ attributes often lend themselves to alternative presentation methods v visual v static v color, size, shape v dynamic v movement, changes over time v auditory v often for supplementary information © Franz J. Kurfess 2003 -2011 75
Knowledge Discovery ❖ combination of v Data Mining v Knowledge Extraction v Knowledge Fusion © Franz J. Kurfess 2003 -2011 76
Data Mining ❖ identification of interesting “nuggets” in huge quantities of data v often relations between subsets v automatic or semi-automatic ❖ techniques v classification, correlation (e. g. temporal, spatial) © Franz J. Kurfess 2003 -2011 77
Knowledge Extraction ❖ conversion of internal representations of knowledge into human-understandable format v extraction of rules from neural networks is one example © Franz J. Kurfess 2003 -2011 78
Knowledge Fusion ❖ multiple pieces of information are combined into one v redundancy v do several pieces contain the same type of information v compatibility v do the individual pieces have similar formats and interpretations v are there mappings to convert values into the same format v consistency v are the values of the individual pieces close © Franz J. Kurfess 2003 -2011 79
Franz Kurfess: Knowledge Retrieval 82
Image Search ❖ contextual v meta-data v text in the same or close-by documents v e. g. on the same Web page, or in the same directory ❖ content-based v analysis and comparison of images v feature extraction © Franz J. Kurfess 2003 -2011 83
Contextual Image Search ❖ images are indexed via keywords v captions of images v metadata (tags) v surrounding text ❖ relies on the assumption that the indexed text is correlated to the image and a good description of its content ❖ basis for most current image search engines © Franz J. Kurfess 2003 -2011 84
Content-Based Image Search ❖ images v no are compared against query images text elements as proxies v or at least not in a “pure” content-based image search v relies on feature extraction or object recognition v direct comparison of pictures on a pixel-by-pixel basis is impractical v only yields identical pictures, not similar ones v computationally very challenging v especially if the query image is a subset of a target image v allows the use of a picture as a “template” to find related pictures © Franz J. Kurfess 2003 -2011 85
Example: UC Santa Cruz Image Search ❖ feature extraction and object recognition from images and video Using a single image as a template, computer software can find similar images in a large database of photos, as shown in these examples. Images courtesy of P. Milanfar. http: //www. physorg. com/newman/gfx/news/hires/newsearchtec. jpg http: //www. physorg. com/news 177095786. html © Franz J. Kurfess 2003 -2011 86
Music Search: Shazam ❖ identifies musical pieces through “finger prints” v emphasis on popular music © Franz J. Kurfess 2003 -2011 87
Summary Knowledge Retrieval ❖ identification, selection, and presentation of documents relevant to a user query ❖ utilization of structural information, context, meta-data in addition to keyword search ❖ organized presentation of results v categories, visual arrangement ❖ internal representations may be converted to humanunderstandable ones © Franz J. Kurfess 2003 -2011 94
© Franz J. Kurfess 2003 -2011 95
- Franz kurfess
- Franz kurfess
- Stephen kurfess
- Ned kurfess
- My favorite subject science
- Shared knowledge vs personal knowledge
- Knowledge shared is knowledge squared meaning
- Knowledge shared is knowledge multiplied meaning
- Knowledge creation and knowledge architecture
- Contoh shallow knowledge dan deep knowledge
- Priori vs posteriori knowledge
- Street knowledge vs book knowledge
- Knowledge claim
- Gertler econ
- Scientia definition
- Aok framework
- A knowledge intensive computer program that captures
- Milton henschel
- Franz aurenhammer
- (770) 905-2309
- Metamorfosis resumen capitulo 1
- Franz josef och
- Haydn mappa concettuale
- Convergencia adaptativa
- Franz kafka la metamorfosis resumen
- Franz josef och
- Chapter 29 section 1 marching toward war
- Kalenberger bauernfamilie
- Interpretacion de metamorfosis
- Franz kafka premena
- Madz. skladatelj istvan
- Heimkehr franz kafka
- Arne franz
- Metamorfosis humana
- Classical period characteristics
- Franz immler
- Franz knoop dog experiment
- Franz boas quotes
- Boas linguistics
- Franz kafka, “the metamorphosis” (1915)
- Franz anton ratkojat
- Franz josef gellert
- Franz moritz wilhelm marc
- Characteristics of franz joseph haydn
- Franz rothenbacher
- Características de grete samsa
- Schmelzumwandlung
- Into the wild mr franz
- The metamorphosis background
- Franz kafka lub joseph haydn
- Aportaciones de franz brentano
- Franz marc modrý kůň
- Le triomphe d'achille
- Franz chaves sell
- Franz kafka brexit
- Franz marc zitate
- Franz kafka
- Sachverständiger marktoberdorf
- Why do you think has grete's attitude toward gregor changed
- Franz niedermaier oracle
- Wersyfikacji lirycznej w wierszu sześcioakcentowy
- Franz wendler
- Franz liszt eserleri
- Kafka preobrazba
- Edelman prize
- Franz carl spitzweg
- Franz baudenbacher
- Franz peters
- Universidad franz tamayo bolivia
- Jasmin erdmann
- Franz chaves sell
- Preobrazba likovi