Web searching the invisible web Finding things that
Web searching & the invisible web Finding things that are hard to find © Tefko Saracevic Principles of Searching 1
Dictionary definitions World wide web : Internet-connected files the very large set of linked documents and other files located on computers connected through the Internet and used to access, manipulate, and download data and programs Invisible - dictionary definition: not easily noticed; noticed or detected readily Invisible web – not yet in the dictionary © Tefko Saracevic Principles of Searching 2
What is “Invisible web? ” n Materials that general search engines cannot or WILL not include in their collection of web pages (indexes) q n You cannot find through general search engines Contains a vast amount of information resources q q q much of it authoritative & higher quality than visible web Ø quality becomes a main issue much of it specialized a lot of it also fluid or streaming or real time Ø q n “You can’t step in the same river twice” much of it free Many times larger than the visible Web © Tefko Saracevic Principles of Searching 3
in other words… There is much more to the web than or Distribution of use: © Tefko Saracevic Principles of Searching 4
Why search engines do not cover all? n n Size: web is huge, cannot cover all Economics: associated costs are high q q n Technical: still a challenge & limited capabilities q n n n engines support themselves mostly by ads also a number of engines have rank per pay & crawl update per pay - providing paid listings first & mostly also some file formats hard to cover Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex © Tefko Saracevic Principles of Searching 5
How do search engines work? Main parts n Crawlers, spiders: go out to find content q q looking for new & changed sites periodic, not for each query Ø n no search engine works in real time Organizing content: labeling, arranging q indexing for searching or classifying as directory Databases, caches: storing content n Retrieval engine: searching on basis of query n Interface: handles query, displays results All based on various, mostly proprietary algorithms n © Tefko Saracevic Principles of Searching 6
Search engine coverage n No engine covers more than a fraction of WWW q n n Hard (impossible) to discern & compare coverage Many national search engines q n own coverage, orientation, governance Many topical or domain search engines q n estimates: none more than 16% own coverage geared to subject of interest Many comprehensive sources independent of search engines q some compilations of evaluated web sources © Tefko Saracevic Principles of Searching 7
Search engines differ n Substantial differences among search engines on each of these parts q n Need to know how they work & differ Information about search engines: Ø Search Engine Watch Ø Ø ratings, news, statistics, charts, explanations, tutorials Search Engine Showdown Ø © Tefko Saracevic “The users’ guide to web searching” - run by a librarian, news links, ratings Principles of Searching 8
Invisible web searching: Basic approach n The first step in determining the best approach for searching the invisible web is to have a clear idea of what you’re seeking q n extensive user modeling Limit your search to appropriate resources & tools for the particular type of information you’re looking for q q know your sources know how to find appropriate sources Ø shades of “Knowledge is of two kinds…” © Tefko Saracevic Principles of Searching 9
Specialized sources - particularly for the invisible web The rest of the lecture covers: 1. 2. 3. 4. 5. 6. 7. 8. 9. © Tefko Saracevic Meta search engines Specialized engines & catalogs Domain (subject) engines & catalogs Reference sources Libraries as web sources Virtual libraries Subject databases Societies, organizations Good old books Principles of Searching 10
Meta search engines n Meta search engines search multiple engines q n getting combined results from a variety of engines Finding a search engine or meta engine: Ø Search. Engines. com search for engines by topic, geography, reference Ø Search Engine Guide Ø Ø engines categorized by topic; other engine information Search Engine Colossus Ø © Tefko Saracevic international directory of search engines by country, topic from 198 countries and 61 territories; engines in choice of languages Principles of Searching 11
Sample of meta engines n Some meta engines provide organized results: Dogpile results from a number of leading search engines; gives source, so overlap can be compared; (has also a (bad) joke of the day) Surfwax gives statistics and text sources & linking to sources; for some terms gives related terms to focus Teoma results with suggestions for narrowing; links resources derived; originated at Rutgers Turbo 10 provides results in clusters; engines searched can be edited © Tefko Saracevic Principles of Searching 12
meta search engines (cont. ) n Large directory Ø Complete Planet Ø n directory of over 70, 000 databases & specialty engines Results with graphical displays Ø Vivisimo Ø Ø clusters results; innovative Webbrain Ø results in tree structure – fun to use Kartoo results in display by topics of query © Tefko Saracevic Principles of Searching 13
Domain engines & catalogs n Cover general & specific subjects Ø Open Directory Project Ø Ø BUBL LINK Ø Ø large edited catalog of the web – global, run by volunteers selected Internet resources covering all academic subject areas; organized by Dewey Decimal System – from UK Profusion Ø search in categories for resources & search engines Resource Discovery Network – UK “UK's free national gateway to Internet resources for the learning, teaching and research community” © Tefko Saracevic Principles of Searching 14
domain engines … Available in variety of domains & subjects – rich! n Ø Think Quest – Oracle Education Foundation Ø q All Music Guide Ø Ø education resources, programs; web sites created by students resource about musicians, albums, and songs Internet Movie Database Ø treasure trove of American and British movies Genealogy links and surname search engines well. . that is getting really specialized (and popular) © Tefko Saracevic Principles of Searching 15
domain engines … n Scholarship, science Ø Psychcrawler - Amer Psychological Association Ø Ø web index for psychology Entrez Pub. Med – Nat Library of Medicine biomedical literature from MEDLINE & health journals Ø Cite. Seer - NEC Research Center Ø scientific literature, citations index; strong in computer science Scholar Google searches for scholarly articles & resources Infomine scholarly internet research collections © Tefko Saracevic Principles of Searching 16
Reference services n Reference services - several models Ø Ask Jeeves! q Ø most popular, commercial Information Please § almanac type questions Ref. Desk access to a number of reference tools Wikipedia web encyclopedia in many languages Martindale’s The reference Desk probably the most amazing & versatile reference collection on the web – numerous sections, great to explore © Tefko Saracevic Principles of Searching 17
reference … • Digital reference - new service area for libraries Ø Question. Point L of Congress & OCLC Ø Ø Virtual Reference Desk – L of Congress Ø Ø project for a global reference network large compilation of web reference sites Live. Ref - maintained at Iowa State U Ø a registry of real time digital reference services Martindale’s The reference Desk probably the most amazing & versatile reference collection on the web – numerous sections, great to explore © Tefko Saracevic Principles of Searching 18
Libraries as web sources n Academic, national libraries providing open collections & services; models vary Ø Ø Rutgers libraries - big long term effort University of California, Berkeley Ø a most elaborate effort together with Sun Corporation Lib. Web U California, Berkeley “lists currently over 7200 pages from libraries in over 125 countries” Ø Bibliothèque Nationale de France Ø © Tefko Saracevic includes virtual exhibitions, among others Principles of Searching 19
Virtual libraries on the Web n Libraries emerging only on the Web Ø Virtual Library – Ø Ø Internet Public Library U of Michigan Ø Ø Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ also a long term effort Librarians Index of the Internet very popular and comprehensive Digital librarian “a librarian's choice of the best of the Web “ – compiled annotated by a librarian Ø © Tefko Saracevic Principles of Searching 20
virtual libraries … Ø Academic Info Digital Library Ø Ø Gabriel Ø Ø many links to digital collections & resources in various subjects Gateway to European National Libraries Museum of online museums Ø a delight Stanford Encyclopedia of Philosophy a comprehensive encyclopedia and library The historical New York Times Project universal library – ongoing digitization © Tefko Saracevic Principles of Searching 21
Subjects resources n Many subject specific sites q q n rich & often unique coverage & services different approaches & requirements Examples in health related domains: Ø Web. MDHealth Ø Ø Rxlist Ø Ø news, medical information The Internet Drug Index Mayo Clinic Health. Oasis Ø health advice Kidshealth sites for parents, kids, teens © Tefko Saracevic Principles of Searching 22
Subject resources … n Scholarship, humanities, government Ø KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen Ø Ø Perseus Digital Library Tufts University Ø Ø German; a variety of resources for classics covers antiquity to renaissance; one of the best subject sites on the web; affected the whole field Sch of Slavonic & East European Studies, Ø University College London includes country resources, e. g. Croatia Ø U Mich Document Center Ø official documents from all over the world © Tefko Saracevic Principles of Searching 23
n Subject resources … Growing number of resources in arts, museums Museum. Stuff. com “We have 1000's of museums, zoos, historical societies and related organizations in our database” The State Hermitage Museum One of the greatest museums in the world, and one of the best museum site – developed with IBM help National Museum of Science and Technology Leonardo da Vinci Guess where those pictures came from. A delight! © Tefko Saracevic Principles of Searching 24
subject resources … Diotima Materials for study of women and gender in the Ancient World Moving Images Collections “MIC documents moving image collections around the world. ” Part particularly oriented toward science educators. Now at Library of Congress, but developed at Rutgers. And, of course … Snoopy The Official Peanuts Website © Tefko Saracevic Principles of Searching 25
Societies, organizations n Many societies, agencies developed their sites q great many rich sources for searching & resources differences in requirements, depth, richness Ø Assoc. for Computing Machinery q Ø Ø Digital Library; subscription or registration or through RUL US State Department Ø about the U. S & other countries First. Gov the US government official web portal Ocean Planet NASA presentation of earth & its vast oceans Ar. Xiv Cornell U, National Science Foundation e-print service in the fields of physics, mathematics, nonlinear science, computer science, and quantitative biology © Tefko Saracevic Principles of Searching 26
Archiving, books on the web n Internet Archive – a large undertaking q q q n includes web archive & lots more publicly available & free 10 billion web pages archived from 1996 to a few months ago Wayback Machine – search to look at old versions of web pages Books on the web Million Book Project digitizing books and providing free access International Children’s Digital Library online children books Digital books Index “links to more than 105, 000 title records from more than 1800 commercial and non-commercial publishers, universities, and various private sites” © Tefko Saracevic Principles of Searching 27
Language barriers on the Web n English still the major language q n but declining, now slightly over 50% Multilingual retrieval search engines Ø Euroseek Ø Ø searches in a number of languages All the Web Ø results in 45 languages © Tefko Saracevic Principles of Searching 28
Web news; keeping up n What is going on on the Web? Some major sources of news and evaluations: Ø Free Pint Ø Ø newsletter, articles, links; nice & sometimes quirky Internet Resources Newsletter UK based; monthly newsletter for “academics, students, engineers, scientists and social scientists” Ø Research. Buzz daily updates; many aspects; “Collection of items on search engines, online databases, and other information resources” Ø About. com Web Search Ø tools, Web Search Forum © Tefko Saracevic Principles of Searching 29
keeping up … Information Today trade & professional monthly newspaper & web site; industry news; searcher columns; general analyses of trends n Keeping up through blogosphere: Ø Resource Shelf bloger about internet (and some other stuff) with archive; it has really good and really bad exchanges & threads New York Times blogrunner - The annotated NYT blog tracking of NYT articles, topics, authors; thread into discussion of many other weblogs; includes net & web topics © Tefko Saracevic Principles of Searching 30
Finding links & listings – back to good old books with a new twist n Number of books on web searching have also sites with links in the book, updates, news Ø Extreme Searcher Randolph Hock Ø update of a popular book; links by chapter topics The web library Nicholas G. Tomaiuolo spotlights free resources, links by chapter and new topics – done by a librarian The invisible web Chris Sherman & Gary Price original book on the topic, links organized by subject p. s. most, but not all, of the sites in this lecture can be found on those sites – and much, much more © Tefko Saracevic Principles of Searching 31
Evaluations, ratings n n Evaluating web sites: a prime responsibility of searchers & all information professionals Many sources evaluate web sites: Ø The Scout Report – Ø Ø Medical Library Association Ø Ø librarians’ BIBLE! Annotations. Comprehensive. ten most useful sites for consumer health MLA user guide Ø for finding & evaluating health information on the web Ø Web 100 Ø commercial, user ranking & evaluation of web sites Ø Evaluating web pages UC Berkeley tutorial and guide © Tefko Saracevic Principles of Searching 32
Needed for Web searching n Knowledge & competencies on q q n variety of web sources & their organization search engines web search strategies search dynamics, feedback Keeping up & up q Why? many reasons, such as: Ø Ø Ø constant updates, changes, innovations many domain/subject specific fluidity very high © Tefko Saracevic Principles of Searching 33
Needed for web searching by professionals n Knowledge of SOURCES in area of interest Ø search engines not enough q Ø n not too helpful in finding these other sources; structure hard to discern find & use specialized sources Evaluation of sources Ø a key professional skill! Ø application of standard criteria & web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability © Tefko Saracevic Principles of Searching 34
Needed competencies … n n n n Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update q q again: keeping up, keeping up and again: keeping up, keeping up © Tefko Saracevic Principles of Searching 35
But now really: How to do it? information WWW © Tefko Saracevic Principles of Searching 36
© Tefko Saracevic Principles of Searching 37
© Tefko Saracevic Principles of Searching 38
Images from the invisible web © Tefko Saracevic Principles of Searching 39
images … © Tefko Saracevic Principles of Searching 40
images … © Tefko Saracevic Principles of Searching 41
and of course… © Tefko Saracevic Principles of Searching 42
P. S. a nice site Poem by Emily Dickinson, 1830 -1886 In a library Who will write a poem: In a digital library ? ? ? © Tefko Saracevic Principles of Searching 43
P. S. a few weird or fun sites… n Select. Smart. com q n n James Dean official web site Deaducated q n Dead Librarians’ Society Livejournal q n all kinds of quizzes for you blogs & authoring tools; and many pathetic entries Airline meals q q “the world’s first and leading site about nothing but airline food” … some 12, 000 pictures from 447 airlines it is not weird, but for real and great fun © Tefko Saracevic Principles of Searching 44
Sources n n n n n n About. com Web Search http: //websearch. about. com Academic Info Digital Library http: //www. academicinfo. net/digital. html Airline meals http: //www. airlinemeals. net/ All the Web http: //www. alltheweb. com/ Ask Jeeves! http: //www. ask. com/ Assoc. for Computing Machinery http: //www. acm. org/ Bibliothèque Nationale de France http: //www. bnf. fr/ BUBL LINK http: //bubl. ac. uk/link/ CDNET Search. com http: //www. search. com/ Cite. Seer http: //citeseer. nj. nec. com/ Complete. Planet http: //completeplanet. com Deaducated http: //www. geocities. com/deadlibrarians/ Digital book index http: //www. digitalbookindex. org/about. htm Digital librarian http: //www. digital-librarian. com/ Diotima http: //turbo 10. com/ Dogpile http: //www. dogpile. com/ Entrez Pub. Med http: //www. ncbi. nlm. nih. gov/Pub. Med/ Extreme Searcher http: //www. extremesearcher. com/ Free Pint http: //www. freepint. com/ Gabriel http: //www. kb. nl/gabriel/ Genealogy http: //darcisplace. com/darci/search. htm © Tefko Saracevic Principles of Searching 45
sources … n n n n n Hermitage http: //www. hermitagemuseum. org/html_En/index. html Information Please http: //www. infoplease. com/ International Children’s Digital Library http: //www. icdlbooks. org/ Internet Archive http: //www. archive. org/ Internet Public Library, Michigan http: //www. ipl. org/ Internet Resources Newsletter. http: //www. hw. ac. uk/libwww/irn/ James Dean http: //www. jamesdean. com/ Kartoo http: //www. kartoo. com/ KIRKE http: //www. phil. uni-erlangen. de/~p 2 latein/ressourc. html Leonardo da Vinci Museum http: //www. museoscienza. org/english/ Librarians Index to the Internet http: //lii. org/ Live Journal http: //www. livejournal. com/ Live. Ref http: //www. public. iastate. edu/~CYBERSTACKS/Live. Ref. htm Martindale’s The reference Desk http: //www. martindalecenter. com/ Mayo Clinic http: //www. mayohealth. org/ Medical Library Assoc. ten top sites http: //www. mlanet. org/resources/medspeak/topten. html Medical Library Assoc. user guide for health inf. http: //www. mlanet. org/resources/userguide. html Medscape http: //www. medscape. com/ © Tefko Saracevic Principles of Searching 46
sources … n n n n n Million Book Project http: //www. archive. org/texts/collection. php? collection=millionbooks Museum of online museums. http: //www. coudal. com/moom. php Museum. Stuff http: //www. museumstuff. com/ NYT blogrunner http: //nytimes. blogrunner. com/ NYT historical project http: //www. nyt. ulib. org/ OCLC Web Characterization Project http: //wcp. oclc. org/ Open Directory Project http: //dmoz. org Perseus Digital Library http: //www. perseus. tufts. edu/ Profusion http: //www. profusion. com/ Psychcrawler http: //www. psychcrawler. com/ Question. Point http: //www. questionpoint. org/ Research. Buzz. http: //www. researchbuzz. com/index. shtml Resource Shelf http: //resourceshelf. blogspot. com/ Rutgers Libraries http: //www. libraries. rutgers. edu/ Rx. List http: //www. rxlist. com/ Sch of East Eur & Slavonic Studies http: //www. ssees. ac. uk/dirctory. htm Search Engine Colossus http: //www. searchenginecolossus. com/ Search Engine Guide http: //www. searchengineguide. com/ Search Engine Showdown http: //searchengineshowdown. com/ © Tefko Saracevic Principles of Searching 47
sources … n n n n n n Search Engine Watch http: //searchenginewatch. com/ Select Smart. com http: //www. selectsmart. com/home. html Snoopy http: //www. snoopy. com/ Stanford Encyclopedia of Philosophy http: //www. wikipedia. org/ Surfwax http: //www. surfwax. com/ Teoma http: //teoma. com/ The invisible Web http: //www. invisible-web. net/ The Scout Report. http: //scout. cs. wisc. edu/ The Web Library http: //www. ccsu. edu/library/tomaiuolon/theweblibrary. htm Think Quest http: //www. thinkquest. org/ Turbo 10 http: //turbo 10. com/ U California Berkeley http: //sunsite. berkeley. edu/ U Mich Documents Center http: //www. lib. umich. edu/govdocs/ US State department http: //www. state. gov/ Virtual Library http: //vlib. org Virtual Reference Desk http: //www. loc. gov/rr/askalib/virtualref. html Vivisimo http: //vivisimo. com Web 100 http: //www. web 100. com Webbrain http: //www. webbrain. com/html/default_win. html Web. MD http: //my. webmd. com/webmd_today/home/default Wikipedia http: //www. wikipedia. org/ © Tefko Saracevic Principles of Searching 48
- Slides: 48