HUMANS do it better dmoz The Open Directory
HUMANS do it better! dmoz: The Open Directory Project
What is dmoz? • dmoz stands for Directory MOZilla • Also known as the Open Directory Project (ODP) • Searchable directory, similar to Yahoo! • Administered by Netscape as a noncommercial entity
Who maintains dmoz? • Data maintained by “expert” volunteers – Anyone can become an editor – 47, 083 editors • ODP categorizes “quality” information – 378, 028 categories
Interface features • Simple • No ads • Browseable directory • Regular and advanced search • http: //www. dmoz. org/
Web coverage • dmoz - 3, 260, 681 documents • Google - 2, 073, 418, 204 documents
dmoz directory structure Top Arts Health Conditions & Diseases World Fitness Sleep Disorders Narcolepsy
RDF Format <RDF xmlns: r="http: //www. w 3. org/TR/RDF/" directory. mozilla. org/rdf"> "> xmlns: r="http: //www. w 3. org/TR/RDF/" xmlns: d="http: //purl. org/dc/elements/1. 0/" xmlns="http: //directory. mozilla. org/rdf <Topic r: id="Top"> <tag catid="1"/> <d: Title>Top</ d: Title> > d: Title>Top</d: Title </Topic> <Topic r: id="Top/Arts"> <tag catid="2"/> <d: Title>Arts</ d: Title> > d: Title>Arts</d: Title <link r: resource="http: //www 3. bc. sympatico. ca/PHILLIPSHOTGLASS/Glass. Page. html"/> </Topic> <External. Page about="http: //www 3. bc. sympatico. ca/PHILLIPSHOTGLASS/Glass. Page. html"> <d: Title>John > d: Title>John phillips Blown glass</d: Title> <d: Description>A > d: Description>A small display of glass by John Phillips</d: Description> </External. Page> <Topic r: id="Top/Computers"> <tag catid="4"/> <d: Title>Computers</ d: Title> > d: Title>Computers</d: Title <link r: resource="http: // www. cs. tcd. ie/FME/"/> r: resource="http: //www. cs. tcd. ie <link r: resource="http: // pages. whowhere. com/computers/pnyhlen/Timeline. html"/> r: resource="http: //pages. whowhere. com/computers/pnyhlen/Timeline. html </Topic> <External. Page about="http: //www. cs. tcd. ie /FME/"> about="http: //www. cs. tcd. ie/FME/"> <d: Title>FME > d: Title>FME HUB</d: Title> <d: Description>Formal Methods Europe (FME) is a European organization supported by the Commission of the European Union (via ESSI of the ESPRIT programme), with the mission of promoting and supporting the industrial use of formal methods for computer systems development. </d: Description > development. </d: Description> </External. Page> <External. Page about="http: //pages. whowhere. com/computers/pnyhlen/Timeline. html "> about="http: //pages. whowhere. com/computers/pnyhlen/Timeline. html"> <d: Title>Computer > d: Title>Computer Timeline</d: Title> <d: Description>A > d: Description>A brief description of the eras in computing. </d: Description> </External. Page>
Using dmoz data • Data is freely available for download • http: //dmoz. org/rdf. html • http: //dmoz. org/license. html • Must provide attribution and back-link • No Warranty
dmoz data • Many sites use dmoz data – – – AOL Search Google Lycos Hot. Bot over 200 others • Some sites add enhancements and extensions – Google adds page rank – Lycos adds targeted ads
Searching dmoz • Boolean – implicitly AND – AND, OR, ANDNOT – allows shorthand (+, |, -) • Wildcard search (pup*) • Phrasal search • Mixed searches • Field based queries
Search relevance • Queries performed against fields in the RDF database – For documents: title, description, URL – For categories: title, terms/keywords • Keywords are chosen manually; potentially more relevant • Results clustered by category and ranked according to the number of matches within a given category – Some inconsistency, but it doesn't seem to be publicly documented – Some documents are flagged with a star and appear at the top of a directory listing (these do not seem to get special promotion in search results)
Relevance feedback • Not directly supported • Web forms for reporting feedback • http: //dmoz. org/cgi-bin/feedback. cgi
Engine • Uses I-Search • http: //www. etymon. com/Isearch/ • Open source • Modules may be added to enable searching of different document types • dmoz extensions to I-Search – RDF parsing module – Special search module, to return sub-records
More about I-Search • Supports many different kinds of queries • • – Vector search (or at least some sort of weighted keyword search) – Soundex (looks for "similar" words, English and similar only) – Boolean search – Geographic search (hits within a given x 1, y 1, x 2, y 2 box) – field searches (for structured documents, like RDF) Thesaurus expansion and stopword lists supported Queries translated into an RPN, and pushed onto a stack Operations/operands are handled in a generic fashion Has a number of options for searching (for exact terms): – dictionary (hash table) – binary search of sorted index
dmoz vs. UNCA Library Catalog • UNCA Library Catalog has a fixed vocabulary • Library catalog created by trained professionals; dmoz uses “expert” volunteers • Both use field-based queries • dmoz always searches the same fields
dmoz vs. Google • Google uses dmoz’s data • Google is a search engine (good for finding specific information) • dmoz is a directory (good for finding general information) • Google adds page ranking to dmoz documents
Query 1: When is the next year of the Ram on the Chinese calendar? • +"Chinese calendar" +"year of • • the ram“ Documents returned – Google: 10 – dmoz: 0 – Library: 0 No dead links No overlap Relevance – Google: 70% – dmoz: N/A – Library N/A • +"Chinese calendar" • Documents returned – Google: 15, 200 – dmoz: 10; 7 categories – Library: 2 • No dead links • Overlap – 4 pages (Google/dmoz) • Relevance – Google: 30% – dmoz: 30% – Library: 50%
Query 2: According to Douglas Adams, author of "Hitch. Hiker's Guide to the Galaxy, “ what is the answer to the question: "What is the meaning of life? " • "douglas adams" hitchhiker • • guide galaxy "meaning of life" Documents returned – Google: ~364 – dmoz: 0 – Library: 0 No dead links No overlap Relevance – Google: 60% – dmoz: N/A – Library N/A • “meaning of life“ answer • Documents returned – Google: 49, 700 – dmoz: 1 – Library: 0 • No dead links • No overlap • Relevance – Google: 0% – dmoz: 0% – Library: 0%
Query 3: Find Morgan horse breeders in North Carolina • morgan horse breeders north carolina • Documents returned – Google: 1140 – dmoz: 0 – Library: 0 • No dead links • No overlap • Relevance – Google: 40% – dmoz: N/A – Library N/A
Questions?
- Slides: 20