Google and Google Scholar Roger Mills and Judy

  • Slides: 53
Download presentation
Google and Google Scholar Roger Mills and Judy Reading May 2007

Google and Google Scholar Roger Mills and Judy Reading May 2007

Welcome to the Web The world’s biggest haystack

Welcome to the Web The world’s biggest haystack

What can you do in a haystack? • Romp • Get hay fever •

What can you do in a haystack? • Romp • Get hay fever • Have unexpected encounters • Sleep • Not do research • So what do you fancy?

Finding needles • Google helps you find needles in haystacks • But: • Google

Finding needles • Google helps you find needles in haystacks • But: • Google is an index of web pages • A journal article is not a web page • So Google is not good at finding journal articles • However: • An image of a journal article may be placed on a web page • So Google may find it • If it’s free and not behind a firewall • How do you know?

Google is fast • Very fast • Proudly fast • Tells you how fast

Google is fast • Very fast • Proudly fast • Tells you how fast • Found OUCS home page in 0. 09 secs • Also found 350, 000 other ‘relevant’ pages • But put home page first • Brilliant - How does it do it? • Not telling….

Did I need 350, 000 references? • Nobody looks at all the references Google

Did I need 350, 000 references? • Nobody looks at all the references Google retrieves • So why display them? • Algorithm takes into account links made by other pages • And click-throughs • So the top result for a given search is determined over time by the people who make that search • Is that the same as the ‘best’ result?

OK, how would you do it? • To index a document, I’d read it

OK, how would you do it? • To index a document, I’d read it first. • Google can’t read • We don’t read the web – we view it • We remember references visually – that red book on the third shelf down… • If Google can list all the red books on all the third shelves down in all the world I’m bound to find it, right? • Actually I remember I saw in Oxford, so I just need to list all the red books in Oxford – doddle • That’s not really how Google works – is it?

So you read the article, and then…? • Give it some index terms •

So you read the article, and then…? • Give it some index terms • Not ones I’ve just made up, but ones from a standard list. • That way, everyone will know what the article’s about, and every article on the same topic can be found. • Provided everyone agrees what the article’s about. • Then I’d list the authors in a standard form: so everything by Roger Mills, Roger Anthony Mills, Roger A Mills, R Anthony Mills, R A Mills can be found in one go. • That’s a controlled vocabulary. • Works for journal titles too.

Google doesn’t do that • No controlled terms • So you must think of

Google doesn’t do that • No controlled terms • So you must think of synonyms, different forms of name, title abbreviations etc • You must define the context – that matters….

Knitting according to Google

Knitting according to Google

OK, we get it. So let’s invent… • Google Scholar • Let’s team up

OK, we get it. So let’s invent… • Google Scholar • Let’s team up with publishers so they let us search behind their firewalls • Let’s modify our algorithm so it excludes non-scholarly material (how do we define that? ) • Let’s look at citations so when one article we index cites another one we index, we can move it higher up the relevance ranking • Let’s link together different versions of the same article • Let’s include library locations for full-text access • Let’s see how it goes

But let’s not allow: • creation of sets • Or controlled vocabularies • Or

But let’s not allow: • creation of sets • Or controlled vocabularies • Or combining of searches • Or hit rate figures for individual search terms • Or proximity searching • Or saving and e-mailing results • Or creation of alerts • Or standardisation of journal names/abbreviations • Or info on what is included and what is not • Or info on how the system decides what is scholarly • Or an indication of update frequency – seems slower than normal Google

Which of these statements is true? • Google is comprehensive • Google is all

Which of these statements is true? • Google is comprehensive • Google is all I need • Google is up-to-date • Google is not evil • Google is commercial • Google is independent • Google is secretive • Google wants to rule the world • Google wants to beat Microsoft • Google loves me • I love Google

Google is a family • A range of products under a common brand •

Google is a family • A range of products under a common brand • Some add value to the basic search engine; others are nothing to do with searching • Google Scholar is a variant of the standard search engine • It uses a different algorithm, but we don’t know how it differs

What’s in Google Scholar? • “Google Scholar provides a simple way to broadly search

What’s in Google Scholar? • “Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: peer-reviewed papers, theses, books, abstracts and articles, from academic publishers, professional societies, preprint repositories, universities and other scholarly organizations. Google Scholar helps you identify the most relevant research across the world of scholarly research. ”

NB: only in Beta • Features may change • Developing in tandem with Google

NB: only in Beta • Features may change • Developing in tandem with Google Books, which will include digitised texts from Oxford collections and others • In competition with Wo. K, Science. Direct, SCOPUS, Scirus etc

Content • Algorithm to identify scholarly materials crawled by Google from the open web

Content • Algorithm to identify scholarly materials crawled by Google from the open web • Access to materials locked behind subscription barriers • Must include abstract • Full-text access requires institutional subscriptions or individual payment • Includes peer-reviewed papers, theses, books, preprints, abstracts, fulltext, citations, etc.

Library links • Includes Open. URL links to local library holdings • In Oxford

Library links • Includes Open. URL links to local library holdings • In Oxford displays as ‘Oxford Full Text’ beside title

Includes citation data • Uses ‘citation extraction’ to build connections between papers • ‘Cited

Includes citation data • Uses ‘citation extraction’ to build connections between papers • ‘Cited by’ link lists items (known to Google Scholar) that cite the original paper • Cited items not available online are listed with prefix [citation] • ‘Citation analysis’ puts the most-cited papers at the top of the results list

Searching • AND implied between words as in normal Google • + to include

Searching • AND implied between words as in normal Google • + to include common words, letters or numbers that Google’s search technology generally ignores • “quote marks” to search for a phrase • minus sign – to exclude from a search • OR for either search term • author: for author search • intitle: to search document title • restrict by date and publication • advanced search screen available

Exercise • Try searching for: French national identity • In Google and Google Scholar

Exercise • Try searching for: French national identity • In Google and Google Scholar • With and without quotation marks • Now try searching in Web of Science (or other relevant database) • Is it clear why results differ? • What approach provides the most useful results: • • • For writing a paper for publication For quoting in a thesis For preparing a speech For preparing for a pub quiz Or any other purpose…

Help screens

Help screens

Earlier version

Earlier version

Alternatives to Google • Google it! • See Charles Knight’s up-to-date ‘Top 100’ list

Alternatives to Google • Google it! • See Charles Knight’s up-to-date ‘Top 100’ list in Reade/Write Web: http: //www. readwriteweb. com/archives/top_100_alternative _search_engines_mar 07. php • Use Intute www. intute. ac. uk for reputable human-selected sites, chosen for a UK academic audience • Check Ox. LIP www. ouls. ox. ac. uk/oxlip for complete listing and subject guide to university-subscribed databases. Most list the sources they cover and use controlled vocabularies for indexing

An example of Google’s strengths - and weaknesses in finding a specific article: a

An example of Google’s strengths - and weaknesses in finding a specific article: a search done in 2005 and repeated in Nov 2006:

Biology search: glutathione in green Arabidopsis

Biology search: glutathione in green Arabidopsis

Wo. S

Wo. S

Exact article in one step

Exact article in one step

Scholar phrase search 2005: 15 results, this one at 7

Scholar phrase search 2005: 15 results, this one at 7

Scholar phrase search 2006: 16 results, this one first

Scholar phrase search 2006: 16 results, this one first

Scholar keyword search 2005: 2420 results, this one at 10

Scholar keyword search 2005: 2420 results, this one at 10

Scholar keyword search 2006: 4800 results, this one first

Scholar keyword search 2006: 4800 results, this one first

Google keyword search 2005: 17600 results, this one first

Google keyword search 2005: 17600 results, this one first

Google keyword search 2006: 169000 articles, this one first

Google keyword search 2006: 169000 articles, this one first

Google phrase search 2005: 59 results, this first

Google phrase search 2005: 59 results, this first

Google phrase search 2006: 86 results, this first

Google phrase search 2006: 86 results, this first

Scholar 2005: ‘all 7 versions’

Scholar 2005: ‘all 7 versions’

Scholar 2005: cited by 2

Scholar 2005: cited by 2

Scholar 2006: cited by 14

Scholar 2006: cited by 14

Wo. S 2005: cited by 3

Wo. S 2005: cited by 3

Wo. S 2006: cited by 15

Wo. S 2006: cited by 15

Comparing citations data: 2005 X GS X SC X GS

Comparing citations data: 2005 X GS X SC X GS

Comparing citations data: 2006 X GS

Comparing citations data: 2006 X GS

Citations arranged by most cited

Citations arranged by most cited

SCIRUS phrase search: 2 journals, this first; 8 other web sources (inc previous versions

SCIRUS phrase search: 2 journals, this first; 8 other web sources (inc previous versions of this talk!)

SCIRUS keyword search: 735 journals, this first; 6996 others

SCIRUS keyword search: 735 journals, this first; 6996 others

Biological Abs phrase search: exact match in 1 note controlled keywords

Biological Abs phrase search: exact match in 1 note controlled keywords

SCIRUS • Very similar to Scholar but can also: • Mark records • Save

SCIRUS • Very similar to Scholar but can also: • Mark records • Save records • E-mail records • Export set in RIS format (for Endnote)

Search on controlled terms in Biological Abstracts

Search on controlled terms in Biological Abstracts

Omitting ‘green’, 14 results

Omitting ‘green’, 14 results

Not including this one, first on Scholar

Not including this one, first on Scholar

Need wildcard – arabidopsis-*

Need wildcard – arabidopsis-*

Conclusion • Maintain a balanced diet! • Five a day… • Wo. K, Scopus,

Conclusion • Maintain a balanced diet! • Five a day… • Wo. K, Scopus, Intute, subject-specific database, Google Scholar…