Svein Arne Brygfjeld National Library of Norway Nordic

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive

The message of today • • First: A summary Second: Legal deposit in Norway Third: Our digital library principles Fourth: Harvesting, archiving and giving access to the web • Fifth: The prototype, a demonstration

Part one: Summary • Norwegian legislation on legal deposit: Includes digital information! • The national library of Norway has a relatively advanced digital library activity • Nordic cooperation on methods and technology for legal deposit of the web • Nordic project on access to web archives

Part Two: Legal deposit in Norway • Legislation revised in 1989 • Includes all information carriers in the ”traditional domain”, like books, newspapers & more • Also including music and broadcast programs • And: Including the information living in the digital domain

The National Library of Norway Bendik Rugaas Administration IT & Innovation National Librarian 100 employees Administration IT Public Collections Bibliographic Norwegian Music Oslo Division Rana Division (Svein Arne)2 200 employees Administration IT Technical Repository Legal Deposit Media Lab Sound & Image

The challenge: • Preserving the cultural heritage represented by the world-wide web – Including harvesting and archiving • Giving access to historical web archives – …Nordic Web Archive access project

But first: Part three • Our digital library principles…

One strategy for most digital objects • One large long-term digital repository • All storage, long-term preservation and access based on this infrastructure

Our Digital Library reference model General storage facility -unix servers Digital Library application layer - fault tolerant disk systems -Search Engines -Personalization -Specialized applications -Collecting applications -Tape libraries -HSM Digital objects - text, audio, still images, moving images, web pages & more -Metadata (DC) Repository functionality & organization -Identification (URN) -Migration -Quality and Formats -IPR/Copyrights/Access control

Examples of current use • Digital Radio Archive – Digitization & archiving of 50. 000 hrs • Galleri NOR – Still images in high quality • Historical news-papers – Images of pages as well as OCR-based text

And now… • …the preservation of the web!

Preserving the web: some focus areas • Harvesting & collecting it all • Archiving – Identification, versions, metadata, longterm preservation • Access to archive

Harvesting • Can it be possible? – Have a look at the search engines • Available software – Public domain/Open. Source • NEDLIB – Commercial • several

Harvesting: Resolution in time • Snapshots vs continous • Continous: – Wanted for services considered interesting and with rapid updates – Dependent on use of software agents placed at the publisher

Everything or bits & pieces • Questions to be answered: – What is (technically) possible? – What do we want? – What level of metadata do we need?

Archiving • Different models in the five countries (probably) • The norwegian model based on use on the library’s general storage facilites • Close integration to other digital objects • Online or near-line

Long-term preservation • Migration – So far our choice • Emulation – Technically complicated • Museum – Hard to do over time

And now… • …access to web archives

Nordic Web Archive • A context for cooperation to find common technology and methods to harvest, archive and give access to the web • Current focus on access to archives – Small, focused project

NWA: Members • • • Denmark (Royal Library) Finland (National Library) Iceland (National Library) Norway (National Library), project mgmt Sweden (Royal Library) Nordunet 2

NWA: Current scope • Focus on access to web archives • NOT harvesting • NOT archiving

NWA: Main choises • General and well-specified interface to archive • Search (and navigation) through the use of a commercial search engine • Access based on search and navigation/browsing • Support for navigation in time and space

NWA: Architecture COMMON FORMAT INDEXES INDEXER SEARCH ENGINE WEB INTERFACE XML URN FIND_DOCUMENT(URN) FIND_ID (URL, TIME) DOCUMENT INDEXER ARCHIVE ACCESS

NWA: The technology • Based on commercial search engine from Fast Search & Transfer • In-house development on Linux-platform – XML, PHP, Perl and Java – Probably Open. Source – General web user interface (no additional plugins needed)

NWA: Search engine motivations • Motivation – Support for search functionality on text documents – Speed – Reduced complexity in implementation

NWA: Search engine benefits • (in addition to fullfilling the motivations) – Extreme scalability – Support for distributed searching – Easy integration with other indexes – Integrated language technologies (limited)

NWA: Access methods • Main principles: – The web seen in the archive should look like it did on the net – It should be available through the use of a ordinary web browser • Three main methods – Search, navigation and browsing

NWA: Search • Search based on search engine • Indexes based on exports from archives – In general search on the original content is possible, but – Some additional information available • Protocol metadata, timestamps and more • Time limitations, phrase search and other funtionalities

NWA: Search cont.

NWA: Time navigation • Given a location or service – The user should easily be able to go to next/previous version • Using a JAVA-based time-line as time navigation tool

NWA: Time navigation cont.

NWA: Space navigation • Given a point of time – The user should be able to go some other service based on the url • In NWA prototype, the user can use original url’s as reference to service within the archive

NWA: Space navigation

NWA: Metadata • Few web recources contain user-produced metadata • HTTP contains some metadata, like time of modification and more • Tagging of documents (like <TITLE>) can be viewed as metadata, and is passed on to the indexer

NWA: Open Source? • Many good reasons pro, few contra • Dependent on third-party software! – Radical re-implementation to be independent

NWA: Scalability • Search engine extremely scalable

Further challenges • • • ”The deep web” Dynamic and user dependent services Continuity Description/metadata Access rights to archive! – This is the main obstacle

See also…. • • • http: //www. openarchives. org http: //Sult. nb. no http: //Nwa. nb. no http: //www. dublincore. org http: //www. fast. no

That’s it! • Thank you for listening (if you were ; -) ) • Please contact me if there’s anything – But on email only! • svein. brygfjeld@nb. no