Greenstone Open source system for creating and delivering

  • Slides: 53
Download presentation
Greenstone: Open source system for creating and delivering digital library collections v Ian H.

Greenstone: Open source system for creating and delivering digital library collections v Ian H. Witten v New Zealand Digital Library Project Computer Science Department Waikato University New Zealand v http: //greenstone. org Browsing around a digital library

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v (Using the Collector) v The power of open source

What we wanted Greenstone turns a ragtag menagerie of documents in various formats into

What we wanted Greenstone turns a ragtag menagerie of documents in various formats into an easy-to-use collection that can run on a standalone laptop in a Ugandan village’s information center ALA 2002

What we wanted v v v “Collections” of digital material Individualized, depending on metadata

What we wanted v v v “Collections” of digital material Individualized, depending on metadata etc Up to several Gb of text … … + associated images, movies, whatever Fully searchable Served on WWW, or published on CD-ROM Multi-platform (Unix + all Windows) Multi-format documents Multi-lingual: documents and interfaces Multimedia Metadata: standard and non-standard

Collections: on the Web nzdl. org (demo, not service)

Collections: on the Web nzdl. org (demo, not service)

Greenstone collections: on CD-ROM UN and NGOs, e. g. v. UNESCO v. Global Help

Greenstone collections: on CD-ROM UN and NGOs, e. g. v. UNESCO v. Global Help Project v. United Nations University v. World Health Organization v. Pan American Health Organization

Kataayi Multipurpose Cooperative Rural Uganda (20 km from Masaka)

Kataayi Multipurpose Cooperative Rural Uganda (20 km from Masaka)

Example Humanity Development Library for sustainable development and basic human needs • • •

Example Humanity Development Library for sustainable development and basic human needs • • • 160, 000 pages 30, 000 images 1230 books 340 kg US$20, 000 • • • CD-ROM US$6 Win 3. 1 x(!)/95/98/NT Stand-alone and intranet server Web browser user interface Global Help Project, Antwerp (+ UN agencies)

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v Using the Collector v The power of open source

Collection of pictures (pictures of text) Alexander Turnbull Library, NZ

Collection of pictures (pictures of text) Alexander Turnbull Library, NZ

Voice (and pictures) Hamilton Public Library

Voice (and pictures) Hamilton Public Library

Music

Music

Chinese documents (pictures of text) + Chinese interface Peking University Library

Chinese documents (pictures of text) + Chinese interface Peking University Library

Chinese (Chinese & English interfaces) Classic Chinese literature

Chinese (Chinese & English interfaces) Classic Chinese literature

Arabic (Arabic & English interfaces) Famous mosques

Arabic (Arabic & English interfaces) Famous mosques

French UNESCO, Paris

French UNESCO, Paris

Spanish PAHO, WHO

Spanish PAHO, WHO

Turkish

Turkish

Russian collection from Mari El Republic http: //gov. mari. ru/gsdl

Russian collection from Mari El Republic http: //gov. mari. ru/gsdl

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v Using the Collector v The power of open source

Hierarchical document model v. Metadata specified at any level Title metadata

Hierarchical document model v. Metadata specified at any level Title metadata

Searching and browsing v. Searching v. Metadata-based browsing Subject Title Dublin Core Publisher “How.

Searching and browsing v. Searching v. Metadata-based browsing Subject Title Dublin Core Publisher “How. To” ad hoc

Multiple search indexes text metadata

Multiple search indexes text metadata

Collectiondependent metadata

Collectiondependent metadata

Multilingual searching

Multilingual searching

Browsing using classifiers AZList classifier (Title metadata)

Browsing using classifiers AZList classifier (Title metadata)

Date. List classifier (Date metadata)

Date. List classifier (Date metadata)

Hierarchy classifier (Subject metadata)

Hierarchy classifier (Subject metadata)

Metadata extraction plugins Acronym extraction plugin

Metadata extraction plugins Acronym extraction plugin

Language identification plugin

Language identification plugin

Email plugin

Email plugin

Phrase hierarchy extraction + thesaurus browsing

Phrase hierarchy extraction + thesaurus browsing

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v Using the Collector v The power of open source

Collection configuration file vname, icon, etc vdescription vemail of creator vsearch indexes vplugins vclassifiers

Collection configuration file vname, icon, etc vdescription vemail of creator vsearch indexes vplugins vclassifiers how to format vdocuments vquery results vclassifiers creator sjboddie@cs. waikato. ac. nz maintainer sjboddie@cs. waikato. ac. nz public true beta true indexes section: text section: Title document: text defaultindex section: text plugin GAPlug plugin Arc. Plug plugin Rec. Plug classify Hierarchy hfile=sub. txt metadata=Subject sort=Title classify HDLList metadata=Title classify Hierarchy hfile=org. txt metadata=Organization sort=Title classify List metadata=Howto format Search. VList "<td valign=top>[link][icon][/link]</td> <td>{If}{[parent(All': '): Title], [parent(All': '): Title]: } [link][Title][/link]</td>" format CL 4 VList " [link][Howto][/link]" format Document. Images true format Document. Text "<h 3>[Title]</h 3>\n\n<p>[Text]" collectionmeta collectionname "greenstone demo" collectionmeta collectionextra "This is a demonstration collection for the Greenstone digital library software. n. It contains a small subset (11 books) of the Humanity Development Library" collectionmeta iconcollectionsmall "/gsdl/collect/demo/images/demosm. gif" collectionmeta iconcollection "/gsdl/collect/demo/images/demo. gif" collectionmeta. section: Title "section titles" collectionmeta. document: text "entire books" collectionmeta. section: text "chapters“

Alter configuration document: Title indexes line v Add full-text index of titles indexesadditional indexes

Alter configuration document: Title indexes line v Add full-text index of titles indexesadditional indexes document: Creator … need author metadata v. . . or authors Creator add –metadata classifier line v Add alphabetic author browserclassify AZList plugin Word. Plug add plugin line v Include Word documents (same) plugin PDFPlug v Include PDF documents languagesadd enlanguages fr es line v Separate index for each language option –extract_acronyms v Extract acronyms and add list plugin PDFPlug plugin v Import OAI metadata plugin OAIPlugadd plugin line add classifier line v Extract phrase hierarchy and add classify phind browser v Alter the format of any of the format above… add format string format Preference. Langs en|fr|es add format string v Restrict collection’s interface langs cgiarg shortname=1 argdefault =fr edit site config file v Change default interface language

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v Using the Collector v The power of open source

The pen is mightier than the sword! Building and distributing collections carries responsibilities …

The pen is mightier than the sword! Building and distributing collections carries responsibilities … legal … social … ethical … Collector Be aware of the power of information and use it wisely = software “wizard” for building new collections

Status updated every 5 secs

Status updated every 5 secs

Agenda v Context v Documents and interfaces – Different document types – … and

Agenda v Context v Documents and interfaces – Different document types – … and interface languages v Searching and browsing – Different search indexes – … and browsing functionality v Collection configuration v Using the Collector v The power of open source

The power of open source: Greenstone uses … v Ghostscript Interpreter for Adobe Postscript

The power of open source: Greenstone uses … v Ghostscript Interpreter for Adobe Postscript documents (Postscript plugin) v Kea Keyphrase extraction program (to generate metadata) v pdftohtml Converter for PDF documents (PDF plugin) v rtftohtml Converter for RTF documents (RTF plugin) v Text. Cat Detects languages and document encodings v wv. Ware Converter for Word documents (Word plugin) v Xlhtml Converter for Excel/Powerpoint documents (plugins) v XML: : Parser Parses XML documents, used to read and write Greenstone’s internal XML document format

and … v MG Creates compressed full-text indexes and performs searches v GDBM Database

and … v MG Creates compressed full-text indexes and performs searches v GDBM Database used for metadata etc v wget Downloading pages from the Web when creating collections v YAZ Client and server implementation of Z 39. 50 v Stemmer English language stemmer v GCC C/C++ compiler v CVS Version control system v Perl Used for plugins etc v Apache Web server used by many Greenstone installations

Greenstone DL software Access ü Accessible via any Web browser ü Server runs on

Greenstone DL software Access ü Accessible via any Web browser ü Server runs on Windows and Unix ü Collections can be published on CD-ROM Searching/ ü Full-text and fielded search browsing ü Flexible browsing facilities ü ü Metadata-based (Dublin Core) Collection-specific Hierarchical phrase browsing supported Creates all access structures automatically Extensible ü Plugins — new document, metadata formats ü Classifiers — new metadata browsers Multilingual ü Documents and interfaces ü Chinese, Arabic, Maori, Russian etc (+ European) ü Multimedia: video, audio collections exist Distributed ü CORBA protocol allows remote access ü Z 39. 50 server/client for backwards compatibility What you see — you can get! ü Open-source software: free, extensible

UNESCO: Distributing Greenstone DL software Sustainable development “Give a man a fish, feed him

UNESCO: Distributing Greenstone DL software Sustainable development “Give a man a fish, feed him for a day Teach a man to fish, feed him for life” Greenstone software on CD-ROM v. GNU licensed http: //greenstone. org v. Fully documented v. Trilingual (English/French/Spanish) v. Unix/Windows (3. 1/3. 11, 95/98/ME, NT/2000/XP) v. Trivial to install v. End-user interface for collection building v. Serve collections on Web or write them to CD-ROM v. Documents on disk and/or Web v. Formats: HTML, Word, PDF, Post. Script, plain text, e-mail, … download from http: //greenstone. org

How to build a digital library Witten and Bainbridge Morgan Kaufmann 2003 Kia papapounamu

How to build a digital library Witten and Bainbridge Morgan Kaufmann 2003 Kia papapounamu te moana kia hora te marino, kia tere te karohi, kia papapounamu te moana may peace and calmness surround you, may you reside in the warmth of a summer’s haze, may the ocean of your travels be as smooth as the polished greenstone.