Svein Arne Brygfjeld National Library of Norway Nordic
Svein Arne Brygfjeld National Library of Norway Nordic Web Archive
The message of today • • First: A summary Second: Legal deposit in Norway Third: Our digital library principles Fourth: Harvesting, archiving and giving access to the web • Fifth: The prototype, a demonstration
Part one: Summary • Norwegian legislation on legal deposit: Includes digital information! • The national library of Norway has a relatively advanced digital library activity • Nordic cooperation on methods and technology for legal deposit of the web • Nordic project on access to web archives
Part Two: Legal deposit in Norway • Legislation revised in 1989 • Includes all information carriers in the ”traditional domain”, like books, newspapers & more • Also including music and broadcast programs • And: Including the information living in the digital domain
The National Library of Norway Bendik Rugaas Administration IT & Innovation National Librarian 100 employees Administration IT Public Collections Bibliographic Norwegian Music Oslo Division Rana Division (Svein Arne)2 200 employees Administration IT Technical Repository Legal Deposit Media Lab Sound & Image
The challenge: • Preserving the cultural heritage represented by the world-wide web – Including harvesting and archiving • Giving access to historical web archives – …Nordic Web Archive access project
But first: Part three • Our digital library principles…
One strategy for most digital objects • One large long-term digital repository • All storage, long-term preservation and access based on this infrastructure
Our Digital Library reference model General storage facility -unix servers Digital Library application layer - fault tolerant disk systems -Search Engines -Personalization -Specialized applications -Collecting applications -Tape libraries -HSM Digital objects - text, audio, still images, moving images, web pages & more -Metadata (DC) Repository functionality & organization -Identification (URN) -Migration -Quality and Formats -IPR/Copyrights/Access control
Examples of current use • Digital Radio Archive – Digitization & archiving of 50. 000 hrs • Galleri NOR – Still images in high quality • Historical news-papers – Images of pages as well as OCR-based text
And now… • …the preservation of the web!
Preserving the web: some focus areas • Harvesting & collecting it all • Archiving – Identification, versions, metadata, longterm preservation • Access to archive
Harvesting • Can it be possible? – Have a look at the search engines • Available software – Public domain/Open. Source • NEDLIB – Commercial • several
Harvesting: Resolution in time • Snapshots vs continous • Continous: – Wanted for services considered interesting and with rapid updates – Dependent on use of software agents placed at the publisher
Everything or bits & pieces • Questions to be answered: – What is (technically) possible? – What do we want? – What level of metadata do we need?
Archiving • Different models in the five countries (probably) • The norwegian model based on use on the library’s general storage facilites • Close integration to other digital objects • Online or near-line
Long-term preservation • Migration – So far our choice • Emulation – Technically complicated • Museum – Hard to do over time
And now… • …access to web archives
Nordic Web Archive • A context for cooperation to find common technology and methods to harvest, archive and give access to the web • Current focus on access to archives – Small, focused project
NWA: Members • • • Denmark (Royal Library) Finland (National Library) Iceland (National Library) Norway (National Library), project mgmt Sweden (Royal Library) Nordunet 2
NWA: Current scope • Focus on access to web archives • NOT harvesting • NOT archiving
NWA: Main choises • General and well-specified interface to archive • Search (and navigation) through the use of a commercial search engine • Access based on search and navigation/browsing • Support for navigation in time and space
NWA: Architecture COMMON FORMAT INDEXES INDEXER SEARCH ENGINE WEB INTERFACE XML URN FIND_DOCUMENT(URN) FIND_ID (URL, TIME) DOCUMENT INDEXER ARCHIVE ACCESS
NWA: The technology • Based on commercial search engine from Fast Search & Transfer • In-house development on Linux-platform – XML, PHP, Perl and Java – Probably Open. Source – General web user interface (no additional plugins needed)
NWA: Search engine motivations • Motivation – Support for search functionality on text documents – Speed – Reduced complexity in implementation
NWA: Search engine benefits • (in addition to fullfilling the motivations) – Extreme scalability – Support for distributed searching – Easy integration with other indexes – Integrated language technologies (limited)
NWA: Access methods • Main principles: – The web seen in the archive should look like it did on the net – It should be available through the use of a ordinary web browser • Three main methods – Search, navigation and browsing
NWA: Search • Search based on search engine • Indexes based on exports from archives – In general search on the original content is possible, but – Some additional information available • Protocol metadata, timestamps and more • Time limitations, phrase search and other funtionalities
NWA: Search cont.
NWA: Time navigation • Given a location or service – The user should easily be able to go to next/previous version • Using a JAVA-based time-line as time navigation tool
NWA: Time navigation cont.
NWA: Space navigation • Given a point of time – The user should be able to go some other service based on the url • In NWA prototype, the user can use original url’s as reference to service within the archive
NWA: Space navigation
NWA: Metadata • Few web recources contain user-produced metadata • HTTP contains some metadata, like time of modification and more • Tagging of documents (like <TITLE>) can be viewed as metadata, and is passed on to the indexer
NWA: Open Source? • Many good reasons pro, few contra • Dependent on third-party software! – Radical re-implementation to be independent
NWA: Scalability • Search engine extremely scalable
Further challenges • • • ”The deep web” Dynamic and user dependent services Continuity Description/metadata Access rights to archive! – This is the main obstacle
See also…. • • • http: //www. openarchives. org http: //Sult. nb. no http: //Nwa. nb. no http: //www. dublincore. org http: //www. fast. no
That’s it! • Thank you for listening (if you were ; -) ) • Please contact me if there’s anything – But on email only! • svein. brygfjeld@nb. no
- Slides: 39