Workshop on Web Archiving MODULE 2 EXISTING WEB
Workshop on Web Archiving MODULE 2: EXISTING WEB ARCHIVES Janne Nielsen Asger Harlung Ulrich Karstoft Have Workshop SDU 19. 09. 2016 netlab. dk
Module 2: Existing Web Collections • • • Introduction to web archives The Danish Netarkivet Internet Archive Library of Congress Other (US) web archives Ideas for Net. Lab workspace Workshop SDU 19. 09. 2016 2 netlab. dk
Introduction to Web Archives Focus on: • The collection, including strategies • Access • Search • Documentation Workshop SDU 19. 09. 2016 3 netlab. dk
Netarkivet The collection, including strategies Access Search Documentation Workshop SDU 19. 09. 2016 4 netlab. dk
• Netarkivet is run by the State and University Library (Aarhus) and the Royal Library (National Library of Denmark, Copenhagen). • The Danish part of the Internet is defined as cultural heritage in the Legal Deposit Act (Act no. 1439 of 22. 12. 2004), effective from June 1 st, 2005 • The ”Danish part of the Internet” = all Internet content in Danish or meant for Danes the top level domain. dk and danica (e. g. sites in Danish or addressing Danes on other domains such as. com, . eu, . nu, etc. ) • . dk domain names: 607. 000 in July 2005, 960. 000 in January 2013 • Dead. dk domains from July 2005 to January 2013: 741. 838 • 2011: Roughly 222 TB; 6 m objects, most common file types are html, jpeg, gif and png • 2013: Most common file types are html, jpeg, pdf and mp 4 (video) • 2014: On July 27 the data in Netarkivet amounted to 501 TB • 2015: On November 15 the data comprised 654 TB
Netarkivet Coverage 2005 Strategies: • Broad/bulk • Selective • Event • Special Workshop SDU 19. 09. 2016 Broad Event Broad Selective Broad E E From http: //netarkivet. dk/om-netarkivet 6 netlab. dk Time
Netarkivet The collection, including strategies Access Search options Access is restricted to: • researchers (online) • thesis students (on-site) Documentation No-one else can get access. Workshop SDU 19. 09. 2016 7 netlab. dk
Netarkivet The collection, including strategies Access Single URL search using the wayback interface Search options Documentation Workshop SDU 19. 09. 2016 8 netlab. dk
9
10
Netarkivet The collection, including strategies Access Single URL search using the wayback interface Search options Free text search Documentation Net. Lab is working on: • multiple URL search • file type search Workshop SDU 19. 09. 2016 11 netlab. dk
Netarkivet The collection, including strategies Access Manual documentation: Search options At collection level (netarkivet. dk, word-dokument) Curators (wiki) Documentation Automated documentation: Harvesting data (metadata) Crawl logs, but not accessible yet Workshop SDU 19. 09. 2016 12 netlab. dk
Internet Archive The collection, including strategies Access Search Documentation Workshop SDU 19. 09. 2016 13 netlab. dk
The Internet Archive: • american non-profit • from 1996 • not based on national legislation • in general based on cumulative archiving, following hyperlinks from what was already archived • the worlds largest collection of archived web • more than 491 billion web pages, collects app. 1 billion pages per week • quality is erratic — often only top level(s) • heterogenious collection, no overall strategy, including donations… 14
Internet Archive The collection Access Free online access for everyone Search Documentation Workshop SDU 19. 09. 2016 15 netlab. dk
Internet Archive The collection Access Search for individual URLs, displayed via Open Wayback interface Search Documentation Workshop SDU 19. 09. 2016 16 netlab. dk
17
18
Internet Archive The collection Access No accessible documentation for the URL except harvest time Search Documentation Workshop SDU 19. 09. 2016 General documentation about how the Internet Archive harvests (FAQ) 19 netlab. dk
Exercise in Web Archives Open Internet Archive on https: //archive. org Find one or more websites in the Wayback Machine. Move around on the website by clicking hyperlinks. - Are elements missing, or do you notice anything else? If you have access to Netarkivet, you can choose to do the excercise in Netarkivet: https: //netarkiv-wayback. kb. dk Workshop SDU 19. 09. 2016 20 netlab. dk
Funny observations? 21 netlab. dk
Internet Archive-It — the Internet Archive’s subscription web archiving service • A number of collections from their partners, including event collections • Full-text searchable • Archive-It Research Services (ARS) — provides access to data sets extracted from collections (metadata, link graphs, named entities, other data). • https: //archive-it. org/ Workshop SDU 19. 09. 2016 22 netlab. dk
Library of Congress The collection, including strategies Access Search Documentation Workshop SDU 19. 09. 2016 23 netlab. dk
Library of Congress web archive: • from 2000 • curated, topic based and selective collections • harvested by the Internet Archive (not Archive-It) • 763 TB 24
Library of Congress The collection, including strategies Access Free online access for everyone, via Lo. C Wayback Search https: //www. loc. gov/websites/ collections/ Documentation In many cases only ‘flat’ image Workshop SDU 19. 09. 2016 25 netlab. dk
Library of Congress The collection, including strategies Access Search for individual URL, displayed via Open Wayback interface Documentation Full-text search in meta data Workshop SDU 19. 09. 2016 26 netlab. dk
Library of Congress The collection, including strategies Access Very well documented and curated Search Documentation about each collection, and about each website Documentation Workshop SDU 19. 09. 2016 27 netlab. dk
Other (US) Web Archives 28 netlab. dk
Other Web Archives IIPC Member Archives http: //netpreserve. org/resources/member-archives List of Web archiving initiatives, https: //en. wikipedia. org/wiki/List_of_Web_archiving_initiatives Truman, G. (2016). Web. Archiving Environmental Scan. Harvard Library Report. http: //nrs. harvard. edu/urn 3: HUL. Inst. Repos: 25658314 Workshop SDU 19. 09. 2016 29 netlab. dk
Ideas for Net. Lab workspace Workshop SDU 19. 09. 2016 30 netlab. dk
The Four Phases in Research Corpus Dissemination Analysis Storage creation Search Duplicates Identifify Evaluate Select Isolate Select/remove/combine 31 netlab. dk
32
33
Ideas for Net. Lab Workspace Challenges: • Large amounts of data • How to distinguish between the many versions? • No visual representation Needs: • Different ways of filtering content • Choosing and ‘bookmarking’ pages • Isolation/extraction of corpus • Flexible interface to present different metadata Workshop SDU 19. 09. 2016 34 netlab. dk
Inspiration: LARM. fm 35 netlab. dk
Inspiration: Trello 36 netlab. dk
Inspiration: Papers 2 37 netlab. dk
Ideas for Net. Lab Workspace 38 netlab. dk
Ideas for Net. Lab Workspace 39 netlab. dk
Ideas for Net. Lab Workspace 40 netlab. dk
Ideas for Net. Lab Workspace 41 netlab. dk
Ideas for Net. Lab Workspace 42 netlab. dk
Ideas for Net. Lab Workspace 43 netlab. dk
- Slides: 43