MathDemonstrator Using Search Engine Technology for Academic Online
- Slides: 31
Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory to Practice" Sabine Rahmsdorf, Bernd Fehling Bielefeld UL
Presentation Overview Part 1: Math-Demonstrator General introduction to the Math-Demonstrator objectives, content, potential Part 2: Technical report about the Math-Demonstrator backend: harvesting and preprocessing, processing and indexing frontend: search surface, search and result presentation
Academic Online Information: the Reality Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues institutional document servers search engine digital libraries commercial providers portals search
Academic Online Information: the Vision Math-Demonstrator web pages subject databases publishers‘ ejournals library catalogues institutional document servers search engine for academic online information
From Theory to Practice (1) Math-Demonstrator Pilot project with FAST Data Search: search engine for academic online information for mathematicians: Subject based but not subject bound!
From Theory to Practice (2) Math-Demonstrator Objectives of the Math-Demonstrator: collecting and making accessible in a single index a representative and heterogeneous set of academic online content Ødifferent document types Ødifferent data formats Øfulltext and/or metadata Øcontent from the “visible” and “invisible” web
From Theory to Practice (3) Math-Demonstrator Objectives of the Math-Demonstrator: testing technical suitability of FAST Data Search for indexing and processing academic online content working with interoperability standards (OAI) developing prototype of intelligent and flexible user interface
From Theory to Practice (4) Math-Demonstrator Some general information: work on Math-Demonstrator in progress at Bielefeld UL since summer 2003 team of 2 software developers software: FAST Data Search 3. 2 pilot project for DFG-proposal “Using Search Engine Technology in Digital Libraries and Scientific Information Portals” by Bielefeld UL and HBZ (part of VDS in vascoda)
The Content (1) Math-Demonstrator about 466, 000 documents indexed up to now in 10 collections: a)Metadata Zentralblatt MATH (137, 678 records) Project Euclid (6, 516 records): Ø harvested using OAI-protocol OPAC Bielefeld UL (75, 017 records)
The Content (2) Math-Demonstrator b)Fulltext without metadata / web content Documenta Mathematica preprint servers at Bielefeld University (together 18, 301 documents) TIB/UB Hannover: project reports of BMBF (64 documents)
The Content (3) Math-Demonstrator c)Fulltext with metadata Springer journals (224, 387 records): Ømetadata indexed, fulltext in preparation Bochum UL: electronic dissertations of Ruhr-University Bochum (1908 documents): Øharvested using OAI-protocol
The Content (4) Math-Demonstrator c)Fulltext with metadata (cont. ) University of Michigan Historical Math Collection (772 documents) Cornell University Library Historical Math Monographs (630 documents) SUB Göttingen/GDZ: Mathematica (427 documents) Øall harvested using OAI-protocol, up to now only metadata indexed
Math-Demonstrator The Potential making accessible different kinds of content sources in one index: web content and databases/catalogues indexing metadata and fulltext with or without metadata enhancing fulltext data by metadata extraction flexible and customizable frontend transferring performance and scalability of search engine technology to digital library world
Math-Demonstrator Using Search Engine Technology for Academic Online Content The "Math-Demonstrator" or "From Theory to Practice" Part 2 Sabine Rahmsdorf, Bernd Fehling Bielefeld UL
Math-Demonstrator System Components separate frontend and backend server currently one frontend server can be easily enhanced with more servers currently one backend server (single node) can be enhanced to multi node system
Dispatching Frontend: Math-Demonstrator § search surface (basic, advanced) § result processing and result presentation Backend: § harvesting (Perl OAI harvester) § preprocessing and conversion from BRS, OAI-DC and other DB formats with Perl § filetraverser, crawler § document processing and indexing of data § query and result processing
Math-Demonstrator The Frontend § Siemens Primergy, 2 x 800 MHz CPU 1. 28 GB RAM § RAID 1, Adaptec SCSI, 36 GB § Su. SE Linux 9. 0, Kernel 2. 4. 21 -smp § Apache web server with PHP 4 Bielefeld University Library web server
Search Surface (1) • Basic Search single search field Math-Demonstrator advanced search content source language help
Search Surface (2) Advanced Search Math-Demonstrator search field selection year limit source selection
Result Presentation (1) Math-Demonstrator query support drill down result change simple search history
Result Presentation (2) Math-Demonstrator fulltext meta data
The Backend Math-Demonstrator Live-System: § § SUN Enterprise 450, 4 x 250 MHz CPU, 2 GB RAM RAID 5 + Hotspare, SCSI, 768 GB SUN Solaris 8 System report: 344. 8 GB total, 24. 8 GB used, 320 GB free Ø 11 Collections, 466496 documents Ø FAST Search 3. 2 (PHP 4, Python 2. 2) Test-System (provided by FAST Search & Transfer): § Dell Power. Edge, 1 x PIII 730 MHz CPU, 1. 2 GB RAM § RAID 5, Adaptec SCSI, 66 GB § Red. Hat Linux 7. 3, Kernel 2. 4. 22 -pre 5
Harvesting harvested OAI-DC data assumed: <date>1991</date> Math-Demonstrator reality: <date>-set=math& until=2003 -11 -03& metadata. Prefix=oai_dc</date> <date>1903; 1903 -09 -02</date> <date>[c 1911] </date> <date>1906 -1928 [v. 1, ' 28] </date> <date>[192 -? ] </date> <date>C. Gerolds sohn, </date> <date>28 cm. </date>
Preprocessing (1) converting to FAST-XML in: <date>[c 1915]</date> Math-Demonstrator out: <element name="dcdate"><value>[c 1915]</value></element> <element name="dcyear"><value>1915</value></element> in: <language>ENG</language> out: <element name="dclanguage"><value>eng</value></element> <element name="language"><value>en</value></element>
Preprocessing (2) in (binary data from CDROM database): . . . Japanese. Esperanto summary. . . Math-Demonstrator out: <element name="dclanguage"><value>Japanese. Esperanto summary</value></element> <element name="language"><value>jp</value></element> <element name=„secondarylanguage"><value>eo</value> </element>
Preprocessing (3) Math-Demonstrator Summary language code (text and ISO 639 -2 to ISO 639 -1) Ø ISO 639 -1 (de, fr) Ø ISO 639 -2/B (ger, fre), ISO 639 -2/T (deu, fra) date filtering XML encoding and conversion ( <, >, &, “, ‘ , CDATA) generating unique document id (doi, document number, . . . ) general filtering and error correction building of body content (author, title, description, . . . ) fulltext link extraction
Processing (1) Math-Demonstrator Filetraverser sources loading of preprocessed content with filetraverser processing with self created pipelines Ø language detection from title and description Ø setting of mime type o generate teaser based on description, meta data or body o tokenize selected fields o lemmatize (run, runs, running, ran) and synonyms (security, safety) dictionary based o vectorizer (for analyzing similarities between docs)
Processing (2) Math-Demonstrator Crawler sources crawling of selected web sites according to rules crawling of fulltext link lists system processing with stages (partly self developed) Ø deleting format (mime type) Ø format detection Ø uncompressing (zip, gzip, tar, . . . ) and setting of new format Ø set content type (metadata, fulltext, mixed, unknown) Ø Postscript conversion (Ghostscript) Ø PDF conversion (XPDF) Ø Search. MLConverter (FAST) Ø language and encoding detection (FAST)
Indexing Math-Demonstrator Indexstructure enhanced by 15 DC fields additonal 5 index fields Ø dcisbn (ISBN, ISSN) Ø dcdoi (DOI or similar identifier) Ø dcyear (filtered year as integer) Ø dcstype (metadata, fulltext, . . . ) Ø rights (name of source)
Further Development Math-Demonstrator Frontend § templating § search interface (search API) § combining metadata record and corresponding fulltext in result display Backend § automation of harvesting and content preprocessing § search result improvement (ranking, boosting, doclink, linguistics) § performance optimisation
Math-Demonstrator Thank you!
- Google scholar api
- Bielefeld base
- Vivian is using a search engine to find photos
- Internal combustion engine vs external combustion engine
- Rutgers library database
- Microsoft academic search api
- Ebsco oficinas centrales
- Academic search premier
- Inhaltsverzeichnis word
- Whats a search engine
- Asi distributor website
- Sebutkan search engine
- Goto search engine
- The anatomy of a large scale hypertextual web search engine
- Oogoogle translate
- Difference between web browser and search engine
- Sequence diagram for atm system pdf
- What are the four components of a search engine
- Anatomy of a search engine
- Advantages and disadvantages of meta search engines
- Trellian keyword discovery tool
- Search engine adult
- Information retrieval architecture
- Scirus
- Personalized mobile search engine ieee paper
- Sequence diagram for search engine
- Search engine architecture
- Vista search engine
- Term-document incidence matrix
- Anatomy of a search engine
- Indri search engine
- Alt search engine