Intelligent Information Systems 2 Search Gio Wiederhold EPFL

  • Slides: 19
Download presentation
Intelligent Information Systems 2. Search Gio Wiederhold EPFL, April-June 2000, at 14: 15 -

Intelligent Information Systems 2. Search Gio Wiederhold EPFL, April-June 2000, at 14: 15 - 15: 15, room INJ 211 EPFL-2 S

Schedule Presentations in English -- but I'll try to manage discussions in French and/or

Schedule Presentations in English -- but I'll try to manage discussions in French and/or German. • I plan to cover the material in an integrating fashion, drawing from concepts in databases, artificial intelligence, software engineering, and business principles. 1. 13/4 Historical background, enabling technology: ARPA, Internet, DB, OO, AI. , IR 2. 27/4 Search engines and methods (recall, precision, overload, semantic problems). 3. 4/5 Digital libraries, information resources. Value of services, copyright. 4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing. 5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance. 6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D. Beringer] 7. 31/5 Application to Bioinformatics. 8. 15/6 Educational challenges. Expected changes in teaching and learning. 9. 22/6 Privacy protection and security. Security mediation. 10. 29/6 Summary and projection for the future. • Feedback and comments are appreciated. 10/6/2020 EPFL-2 S 2

Function of Search Reduce the space of material • • Select likely data sources

Function of Search Reduce the space of material • • Select likely data sources Extract relevant material Iterate for convergence if needed Integrate from diverse sources – Omit redundancy • Summarize for presentation Ü INFORMATION 10/6/2020 EPFL-2 S 3

Link from resources to consumer Decision Makers Intermediaries Real World Observations 10/6/2020 Collections EPFL-2

Link from resources to consumer Decision Makers Intermediaries Real World Observations 10/6/2020 Collections EPFL-2 S 4

Information Data overload starvation Problem because of • More databases – public & corporate

Information Data overload starvation Problem because of • More databases – public & corporate • Faster communication – digital – packeting: TCP-IP, ATM • World-wide connectivity – internet – world-wide web • Disintermediation – ubiquitous publishing 10/6/2020 EPFL-2 S 5

Change in Supply vs Demand What information consumes is rather obvious, it consumes the

Change in Supply vs Demand What information consumes is rather obvious, it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it. [Herbert Simon] 10/6/2020 EPFL-2 S 6

Data to Information Application Layer Mediation Layer users at workstations value-added services Foundation Layer

Data to Information Application Layer Mediation Layer users at workstations value-added services Foundation Layer data resources 10/6/2020 EPFL-2 S 7

+ Support for Decision-Making • Report current status – own status -- under control

+ Support for Decision-Making • Report current status – own status -- under control of decision-maker – state of the world -- not under control • Trends from past history – temporal databases • Projection into the future – effect of decisions – effect of external events • Provide a limited number of interesting choices – avoid overload – leave choices to account for human insight 10/6/2020 EPFL-2 S 8

Relationships among search parameters perfect recall 100% precision 50% ll a c re ed

Relationships among search parameters perfect recall 100% precision 50% ll a c re ed v e tri ble e r e ila a m v u vol me a u vol 0% 10/6/2020 r = v. relevant v. available per fect p= v. relevant v. retrieved pre cisi on % tage actually relevant space of methods EPFL-2 S 9

Search methods add value catalog humans catalog and place into a hierarchy many volunteered

Search methods add value catalog humans catalog and place into a hierarchy many volunteered or recommended useful web sites. worms automatically surf the web, extract key terms that are given explicitly or extracted from text. index creation for fast referencing extracted terms. provides base metrics of term usage/value thesauri augment indexes with broader/superior terms. wrappers convert sources to common interface. Where? personalize track queries to learn about user preferences provide customer control over their profiles. cookies track users’ activities between sessions in client. classify categorize customers into groups / subgroups? find patterns summarize webpage usage or referencing 10/6/2020 EPFL-2 S 10

Search engines use methods Yahoo humans catalog and organize useful web sites into hierarchies

Search engines use methods Yahoo humans catalog and organize useful web sites into hierarchies with crosslinks. About 300 topic experts in 1999. Scalable? Alta. Vista worms (surfs) and indexes all of the web. Frequency/volatility? Excite also tracks queries and classifies customers [Amazon, . . ] How? Firefly provides customer control over their profiles. Ease/effective? Junglee integrates diverse wrapped sources, compares [Am. ] Semantics? Alexa collects webpages and their usage, keeps old pages Value? suggests pages with similar access reference pattern. Google Pagerank gives the reference importance of web pages, using weighted votes, iterated closure. Find new stuff? IBM research web structure graph[Sridhar Rajagopalan @ Almaden] and most of the e-commerce sites. 10/6/2020 EPFL-2 S 11

Combinations and value Ya. Hoo INKTOMI Google ö æ humans catalog and organize some

Combinations and value Ya. Hoo INKTOMI Google ö æ humans catalog and organize some web sites ø è provides indexing automated searches to all also keeps old pages, since 4 Apr. 00 has a directory. . . publ ic service classification www. demoz. org 3000 editors supersearch engines invoke (in parallel) other engines • Will they all become similar? Patents? Cost? subcontracts! Search is viewed by public as a commodity service Portal value -- assumption people search before they buy: (10/6/2020 IWon tries to attract people. EPFL-2 S by offering daily/annual prizes 12)

Up-to-dateness 100% never 1/year % tage up-to-date 1/month 1/week 1/day ò = effort, methods

Up-to-dateness 100% never 1/year % tage up-to-date 1/month 1/week 1/day ò = effort, methods F(user need) 50% 1/hour 1/minute 1/second frequency of change 10/6/2020 0% 0 1 ? frequency of visits as often as possible Feb. 2000 F(capability given 2. 2 M public sites with 288 M pages ) EPFL-2 S 13

Qualitative problems for search engines • Unsuitable source representations Progress • part classification: HTML

Qualitative problems for search engines • Unsuitable source representations Progress • part classification: HTML --- XML. . • print formats: postscript, adobe PDF. . . . Being improved. • non-text: · images, · video (more ®), • sound [Audio. M ining (Dragon)], . . • hidden in databases behind CGI scripts. . . Rate? • Inconsistent semantics • context distinct / scope / view. . . • Naïve modeling of customers • roles & growth. . . Search engines cannot solve all problems 10/6/2020 EPFL-2 S 14

HTML • Major Web Source today Hierarchical Text Markup Language – sharing of physics

HTML • Major Web Source today Hierarchical Text Markup Language – sharing of physics preprints [Tim Berners-Lee @CERN] – markup = embedded format commands for layout • • Multi-part, multi-representation (text, figs) documents Markups per SGML + (hyper = external) links – SGML = IBM initiated standard graphic document markup – basic commands are size, font, color independent, to be interpreted by the publisher for report, book, manual, . . . Alternative to (a. o. ) (also UNIX runoff, … ) • • XEROX initiated Postscript (PS), Adobe PDF – exact bit-wise layout via executable script TEX markup Detail (pretty math) [Knuth], LATEX macros [Lamport] – generates device independent format (DVI), then PS Problem: Markups not directly relevant for search although invisible keywords can be added (to fool the search engines) 10/6/2020 EPFL-2 S 15

XML • Machine Processable ! With defined, context-specific markups – Digital Library -- people-to-machines

XML • Machine Processable ! With defined, context-specific markups – Digital Library -- people-to-machines - • documents marked up with content type indication: abstract, chapters, . . . – E-commerce (C 2 B)-- people-to-machines • catalogs marked up with item type, price, quality? , weight, delivery, . . . • Requires interpretation – for printing: • XSL converts to HTML by matching markups and inserting HTML codes – for searching and querying • XQL, …, using field labels like database schema attribute names – for processing - automated comparison, ordering, • Mediated -- people-to-services-to-machines – for Business (B 2 B)-- machine-to-machine(s) Expected • standard ontologies - DTDs – Ubiquitous -- gadget-to-gadget new proposals: Bluetooth • embedded interpreters of small ontologies Future 10/6/2020 EPFL-2 S 16

Video stream indexing Area with rapid progress 1998 -2000, now real-time • Important to

Video stream indexing Area with rapid progress 1998 -2000, now real-time • Important to the television industry -- funding CNN all 12 streams • Aggregated techniques, complementary – high performance engines with much storage, parallel computation – segmentation of video streams by • complete image change (not all) • sound change, music intervals – indexing of segments • textual subtitles, - service for hard-of-hearing • speech recognition for words not perfect, but significant • face segmentation in individual images, not recognition • speech identification of popular persons politicians, announcers • now better than newspaper morgues 10/6/2020 EPFL-2 S 17

Domain Specific Catalogs Objective: more depth than a general catalog can provide Accessed directly

Domain Specific Catalogs Objective: more depth than a general catalog can provide Accessed directly or by higher-level search engines • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size automously maintainable Empowerment * based on experience with software 10/6/2020 EPFL-2 S 18

Summary Search requires precision • Customer models – to control and simplify the process

Summary Search requires precision • Customer models – to control and simplify the process • Value models – to increase relevance • Semantic consistency for the customer – semantic translations from contexts Many interesting research tasks Technology transfer: how to good ideas integrate operationally? 10/6/2020 EPFL-2 S 19