MathDemonstrator Using Search Engine Technology for Academic Online

  • Slides: 31
Download presentation
Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory

Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory to Practice" Sabine Rahmsdorf, Bernd Fehling Bielefeld UL

Presentation Overview Part 1: Math-Demonstrator General introduction to the Math-Demonstrator objectives, content, potential Part

Presentation Overview Part 1: Math-Demonstrator General introduction to the Math-Demonstrator objectives, content, potential Part 2: Technical report about the Math-Demonstrator backend: harvesting and preprocessing, processing and indexing frontend: search surface, search and result presentation

Academic Online Information: the Reality Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues

Academic Online Information: the Reality Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues institutional document servers search engine digital libraries commercial providers portals search

Academic Online Information: the Vision Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues

Academic Online Information: the Vision Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues institutional document servers search engine for academic online information

From Theory to Practice (1) Math-Demonstrator Pilot project with FAST Data Search: search engine

From Theory to Practice (1) Math-Demonstrator Pilot project with FAST Data Search: search engine for academic online information for mathematicians: Subject based but not subject bound!

From Theory to Practice (2) Math-Demonstrator Objectives of the Math-Demonstrator: collecting and making accessible

From Theory to Practice (2) Math-Demonstrator Objectives of the Math-Demonstrator: collecting and making accessible in a single index a representative and heterogeneous set of academic online content Ødifferent document types Ødifferent data formats Øfulltext and/or metadata Øcontent from the “visible” and “invisible” web

From Theory to Practice (3) Math-Demonstrator Objectives of the Math-Demonstrator: testing technical suitability of

From Theory to Practice (3) Math-Demonstrator Objectives of the Math-Demonstrator: testing technical suitability of FAST Data Search for indexing and processing academic online content working with interoperability standards (OAI) developing prototype of intelligent and flexible user interface

From Theory to Practice (4) Math-Demonstrator Some general information: work on Math-Demonstrator in progress

From Theory to Practice (4) Math-Demonstrator Some general information: work on Math-Demonstrator in progress at Bielefeld UL since summer 2003 team of 2 software developers software: FAST Data Search 3. 2 pilot project for DFG-proposal “Using Search Engine Technology in Digital Libraries and Scientific Information Portals” by Bielefeld UL and HBZ (part of VDS in vascoda)

The Content (1) Math-Demonstrator about 466, 000 documents indexed up to now in 10

The Content (1) Math-Demonstrator about 466, 000 documents indexed up to now in 10 collections: a)Metadata Zentralblatt MATH (137, 678 records) Project Euclid (6, 516 records): Ø harvested using OAI-protocol OPAC Bielefeld UL (75, 017 records)

The Content (2) Math-Demonstrator b)Fulltext without metadata / web content Documenta Mathematica preprint servers

The Content (2) Math-Demonstrator b)Fulltext without metadata / web content Documenta Mathematica preprint servers at Bielefeld University (together 18, 301 documents) TIB/UB Hannover: project reports of BMBF (64 documents)

The Content (3) Math-Demonstrator c)Fulltext with metadata Springer journals (224, 387 records): Ømetadata indexed,

The Content (3) Math-Demonstrator c)Fulltext with metadata Springer journals (224, 387 records): Ømetadata indexed, fulltext in preparation Bochum UL: electronic dissertations of Ruhr-University Bochum (1908 documents): Øharvested using OAI-protocol

The Content (4) Math-Demonstrator c)Fulltext with metadata (cont. ) University of Michigan Historical Math

The Content (4) Math-Demonstrator c)Fulltext with metadata (cont. ) University of Michigan Historical Math Collection (772 documents) Cornell University Library Historical Math Monographs (630 documents) SUB Göttingen/GDZ: Mathematica (427 documents) Øall harvested using OAI-protocol, up to now only metadata indexed

Math-Demonstrator The Potential making accessible different kinds of content sources in one index: web

Math-Demonstrator The Potential making accessible different kinds of content sources in one index: web content and databases/catalogues indexing metadata and fulltext with or without metadata enhancing fulltext data by metadata extraction flexible and customizable frontend transferring performance and scalability of search engine technology to digital library world

Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory

Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory to Practice" Part 2 Sabine Rahmsdorf, Bernd Fehling Bielefeld UL

Math-Demonstrator System Components separate frontend and backend server currently one frontend server can be

Math-Demonstrator System Components separate frontend and backend server currently one frontend server can be easily enhanced with more servers currently one backend server (single node) can be enhanced to multi node system

Dispatching Frontend: Math-Demonstrator § search surface (basic, advanced) § result processing and result presentation

Dispatching Frontend: Math-Demonstrator § search surface (basic, advanced) § result processing and result presentation Backend: § harvesting (Perl OAI harvester) § preprocessing and conversion from BRS, OAI-DC and other DB formats with Perl § filetraverser, crawler § document processing and indexing of data § query and result processing

Math-Demonstrator The Frontend § Siemens Primergy, 2 x 800 MHz CPU 1. 28 GB

Math-Demonstrator The Frontend § Siemens Primergy, 2 x 800 MHz CPU 1. 28 GB RAM § RAID 1, Adaptec SCSI, 36 GB § Su. SE Linux 9. 0, Kernel 2. 4. 21 -smp § Apache web server with PHP 4 Bielefeld University Library web server

Search Surface (1) • Basic Search single search field Math-Demonstrator advanced search content source

Search Surface (1) • Basic Search single search field Math-Demonstrator advanced search content source language help

Search Surface (2) Advanced Search Math-Demonstrator search field selection year limit source selection

Search Surface (2) Advanced Search Math-Demonstrator search field selection year limit source selection

Result Presentation (1) Math-Demonstrator query support drill down result change simple search history

Result Presentation (1) Math-Demonstrator query support drill down result change simple search history

Result Presentation (2) Math-Demonstrator fulltext meta data

Result Presentation (2) Math-Demonstrator fulltext meta data

The Backend Math-Demonstrator Live-System: § § SUN Enterprise 450, 4 x 250 MHz CPU,

The Backend Math-Demonstrator Live-System: § § SUN Enterprise 450, 4 x 250 MHz CPU, 2 GB RAM RAID 5 + Hotspare, SCSI, 768 GB SUN Solaris 8 System report: 344. 8 GB total, 24. 8 GB used, 320 GB free Ø 11 Collections, 466496 documents Ø FAST Search 3. 2 (PHP 4, Python 2. 2) Test-System (provided by FAST Search & Transfer): § Dell Power. Edge, 1 x PIII 730 MHz CPU, 1. 2 GB RAM § RAID 5, Adaptec SCSI, 66 GB § Red. Hat Linux 7. 3, Kernel 2. 4. 22 -pre 5

Harvesting harvested OAI-DC data assumed: <date>1991</date> Math-Demonstrator reality: <date>-set=math& until=2003 -11 -03& metadata. Prefix=oai_dc</date>

Harvesting harvested OAI-DC data assumed: <date>1991</date> Math-Demonstrator reality: <date>-set=math& until=2003 -11 -03& metadata. Prefix=oai_dc</date> <date>1903; 1903 -09 -02</date> <date>[c 1911] </date> <date>1906 -1928 [v. 1, &apos; 28] </date> <date>[192 -? ] </date> <date>C. Gerolds sohn, </date> <date>28 cm. </date>

Preprocessing (1) converting to FAST-XML in: <date>[c 1915]</date> Math-Demonstrator out: <element name="dcdate"><value>[c 1915]</value></element> <element

Preprocessing (1) converting to FAST-XML in: <date>[c 1915]</date> Math-Demonstrator out: <element name="dcdate"><value>[c 1915]</value></element> <element name="dcyear"><value>1915</value></element> in: <language>ENG</language> out: <element name="dclanguage"><value>eng</value></element> <element name="language"><value>en</value></element>

Preprocessing (2) in (binary data from CDROM database): . . . Japanese. Esperanto summary.

Preprocessing (2) in (binary data from CDROM database): . . . Japanese. Esperanto summary. . . Math-Demonstrator out: <element name="dclanguage"><value>Japanese. Esperanto summary</value></element> <element name="language"><value>jp</value></element> <element name=„secondarylanguage"><value>eo</value> </element>

Preprocessing (3) Math-Demonstrator Summary language code (text and ISO 639 -2 to ISO 639

Preprocessing (3) Math-Demonstrator Summary language code (text and ISO 639 -2 to ISO 639 -1) Ø ISO 639 -1 (de, fr) Ø ISO 639 -2/B (ger, fre), ISO 639 -2/T (deu, fra) date filtering XML encoding and conversion ( <, >, &, “, ‘ , CDATA) generating unique document id (doi, document number, . . . ) general filtering and error correction building of body content (author, title, description, . . . ) fulltext link extraction

Processing (1) Math-Demonstrator Filetraverser sources loading of preprocessed content with filetraverser processing with self

Processing (1) Math-Demonstrator Filetraverser sources loading of preprocessed content with filetraverser processing with self created pipelines Ø language detection from title and description Ø setting of mime type o generate teaser based on description, meta data or body o tokenize selected fields o lemmatize (run, runs, running, ran) and synonyms (security, safety) dictionary based o vectorizer (for analyzing similarities between docs)

Processing (2) Math-Demonstrator Crawler sources crawling of selected web sites according to rules crawling

Processing (2) Math-Demonstrator Crawler sources crawling of selected web sites according to rules crawling of fulltext link lists system processing with stages (partly self developed) Ø deleting format (mime type) Ø format detection Ø uncompressing (zip, gzip, tar, . . . ) and setting of new format Ø set content type (metadata, fulltext, mixed, unknown) Ø Postscript conversion (Ghostscript) Ø PDF conversion (XPDF) Ø Search. MLConverter (FAST) Ø language and encoding detection (FAST)

Indexing Math-Demonstrator Indexstructure enhanced by 15 DC fields additonal 5 index fields Ø dcisbn

Indexing Math-Demonstrator Indexstructure enhanced by 15 DC fields additonal 5 index fields Ø dcisbn (ISBN, ISSN) Ø dcdoi (DOI or similar identifier) Ø dcyear (filtered year as integer) Ø dcstype (metadata, fulltext, . . . ) Ø rights (name of source)

Further Development Math-Demonstrator Frontend § templating § search interface (search API) § combining metadata

Further Development Math-Demonstrator Frontend § templating § search interface (search API) § combining metadata record and corresponding fulltext in result display Backend § automation of harvesting and content preprocessing § search result improvement (ranking, boosting, doclink, linguistics) § performance optimisation

Math-Demonstrator Thank you!

Math-Demonstrator Thank you!