LIS 618 lecture 10 Thomas Krichel 2003 04

  • Slides: 20
Download presentation
LIS 618 lecture 10 Thomas Krichel 2003 -04 -23

LIS 618 lecture 10 Thomas Krichel 2003 -04 -23

Structure • • some repeats from last week other special syntaxes usenet news in

Structure • • some repeats from last week other special syntaxes usenet news in google open directory project in google.

query language II • * is a wildcard for any word • +stopword requires

query language II • * is a wildcard for any word • +stopword requires the presences of a stop word stopword. But the list of stop words has not been published. • In fact it depends from query to query • There is a limit of 10 words, but a * does not count towards the limit

special syntax I • intitle: find in title only, "intitle: google" • intext: find

special syntax I • intitle: find in title only, "intitle: google" • intext: find in text only. This will exclude occurrences of the search term in anchor or title data. "intext: html" • inanchor: This option requests pages, for which there is another page that links to them with the anchor text in the query. example: inanchor: "a list of my courses" finds my courses page because it has a link with that text

special syntax • cache: pages that are in the google cache, useful if query

special syntax • cache: pages that are in the google cache, useful if query result has nothing to do with the query terms cache: openlib. org/home/krichel will show the cached version of the page. • If you add further terms, they will be highlighted.

daterange: special syntax • limits the search to pages indexed between a range of

daterange: special syntax • limits the search to pages indexed between a range of dates. Changed pages are reindexed, unchanged pages are not reindexed when the crawler visits a page. • dates are expressed in the Julian period, i. e. number of days after -4713 -01 -01 0: 00 UTC of the Julian calendar. Today is 2452739 • example: daterange: 2452640 -2452739

mixing special syntax expressions • The link: syntax does not mix with others. •

mixing special syntax expressions • The link: syntax does not mix with others. • Other bad ideas: – "site: openlib. org –inurl: openlib" – "site: edu site: com" • Things that work well – intitle: search – Intitle: biology inurl: help

Examples • George Bush site: nytimes. com • "Copyright * The New York Times"

Examples • George Bush site: nytimes. com • "Copyright * The New York Times" "George Bush" • Intitle: "directory * * trees" • Botany intitle: "directory of" site: edu • "powered by blogger" or site: blogspot. com • "classical music" (inurl: mailman | inurl: listserv)

phonebook: special syntax • also rphonebook for residential and bphonebook for businesses • A

phonebook: special syntax • also rphonebook for residential and bphonebook for businesses • A location seems to be required, i. e. phone: long island university ny • no – wildcards – exclusions – or

stocks on google • stocks: ticker will look up a ticker symbol ticker at

stocks on google • stocks: ticker will look up a ticker symbol ticker at http: //finance. yahoo. com • you can find ticker symbols there • ticker symbols are useful to find financial information about publicly traded companies.

google images • it has the following special syntaxes – intitle searches for images

google images • it has the following special syntaxes – intitle searches for images on a page with a given title, "intitle: long island university" – Inurl: searches for images in pages that have a certain url, inurl: liu. edu – site: restricts the search to a certain site, should be combined with a search term like "site: liu. edu koenig"

Google interfaces to 3 rd party data • Google groups are an interface to

Google interfaces to 3 rd party data • Google groups are an interface to usenet news • Google directory is an interface to the Open Directory Project. • In both cases Google is dependent on the quality of these underlying data source.

usenet news • Usenet is a collection of user-submitted notes on various subjects that

usenet news • Usenet is a collection of user-submitted notes on various subjects that are posted to servers on a worldwide network. Each subject collection of posted notes is known as a newsgroup. • A newsgroup is a discussion about a particular subject consisting of notes written to a networked site and distributed through Usenet. • Newsgroups are hierarchical. Hierarchical levels are separated by dots example: comp. text. tex • alt stands for anarchists, lunatics and terrorists.

usenet history • The idea of network news was born in 1979 when two

usenet history • The idea of network news was born in 1979 when two graduate students, Tom Truscott and Jim Ellis, thought of using UUCP to connect machines for the purpose of information exchange among users. They set up a small network of three machines in North Carolina. • UUCP is ``UNIX to UNIX copy'' a protocol that is used to copy files between machines running some flavor of UNIX, without the need for IP protocol. Usenet is older than the Internet

decline of usenet • essentially open to all (peer-to-peer system) • used by spammers

decline of usenet • essentially open to all (peer-to-peer system) • used by spammers for – posting – gathering addresses • steady decline of quality of contribution • steady decline of quantity of contributions

usenet worth checking out • independent reviews of products, often written by experts. •

usenet worth checking out • independent reviews of products, often written by experts. • Example: interpretation of beethoven sonatas by Wilhelm Kempff. • Sorting by date reveals that the newsgroup rec. music. classical. recordings is still active. On a good day, you will find no finer guide to records.

special syntax for usenet • group: limits posting to a certain group • title:

special syntax for usenet • group: limits posting to a certain group • title: limits to titles of postings • author: searches for author name or email address • Mixing syntaxes works well

the open directory project • "The Open Directory Project is the largest, most comprehensive

the open directory project • "The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors. • Claim that there is a historic precedence in the Oxford English Dictionary. • Formerly known as ``Gnu. Hoo'', then ``New. Hoo'', then acquired by Net. Scape, and called ``dmoz''.

dmoz. org • dmoz is maintained by volunteers ``net-citizen''. No special qualifications required, but

dmoz. org • dmoz is maintained by volunteers ``net-citizen''. No special qualifications required, but claimed to be experts. • There about 30, 000 volunteers (they claim). • Powers the core directory services for the Web's largest and most popular search engines and portals – Netscape Search – Google – Hot. Bot AOL Search Lycos Direct. Hit • Headquarters run by Netscape

http: //openlib. org/home/krichel Thank you for your attention!

http: //openlib. org/home/krichel Thank you for your attention!