Australian Newspapers Digitisation Program Development of the Newspapers

  • Slides: 43
Download presentation
Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley –

Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28 November 2008 1

Requirements Manage, store and organise millions of digital newspaper pages behind the scenes. n

Requirements Manage, store and organise millions of digital newspaper pages behind the scenes. n Manage the entire digitisation workflow from scanning to public delivery. n 2

How? Current NLA Digital Content Management System cannot cope with volume of digital newspapers

How? Current NLA Digital Content Management System cannot cope with volume of digital newspapers or complex structure of newspapers n No ‘off the shelf’ product available that meets requirements n Need the system now (March 2007) n 3

Solution n n NLA team to develop a software solution Ensure the system uses

Solution n n NLA team to develop a software solution Ensure the system uses open source software System to be standalone and not bolted into other systems Possibility of sharing system in future/providing as open source to other libraries 4

Software Development n n n n Agile method of development used Modules designed in

Software Development n n n n Agile method of development used Modules designed in stages as required Stage 1 – Receipt and checking of scanned images Stage 2 – Quality Assurance Modules Stage 3 – Sending/receiving items from OCR Stage 4 – System Administration and Statistics Stage 5 – Interface Design and Usability of System 5

Progress n n n Software development March 2007 – June 2008 First module in

Progress n n n Software development March 2007 – June 2008 First module in use May 2007 CMS in use for 18 months CMS in final stages of completion (Jan – June 2009) Further development required to enable acceptance of contributors content Simple user interface yet to be designed 6

7

7

Australian Newspapers CMS n Screenshots of system follow and explanation of workflows. 8

Australian Newspapers CMS n Screenshots of system follow and explanation of workflows. 8

Workflow Summary Preparing for Digitisation n Creation of digital images n Adding metadata and

Workflow Summary Preparing for Digitisation n Creation of digital images n Adding metadata and Quality Assurance n Optical Character Recognition n Quality Assurance n Statistics and Admin n 9

Preparing for Digitisation Identify title to be digitised n Source master microfilm from owner

Preparing for Digitisation Identify title to be digitised n Source master microfilm from owner n Send master microfilm to scanning contractors n Add title to Content Management System n 10

CMS - Add Title 11

CMS - Add Title 11

Microfilm converted to digital images 12

Microfilm converted to digital images 12

Image Reception Images received from scanning contractor on LTO 2 Tape n Tapes added

Image Reception Images received from scanning contractor on LTO 2 Tape n Tapes added to tape robot and extracted n Reels automatically added to Content Management System n Reel details are checked n Images ingested into Content Management System n 13

CMS - Check Reel Details 14

CMS - Check Reel Details 14

CMS - Ingest Reels 15

CMS - Ingest Reels 15

CMS - Tasks 1 and 2 Task 1 – Add metadata (dates and page

CMS - Tasks 1 and 2 Task 1 – Add metadata (dates and page numbers) n Supervisor reviews marked pages n Task 2 – Define batches n Task 2 – Resolve duplicates n Task 2 – Create missing page targets n 16

Identify title to be worked on 17

Identify title to be worked on 17

Identify reel 18

Identify reel 18

CMS - Adding Metadata n Date and Page Sequence number added 19

CMS - Adding Metadata n Date and Page Sequence number added 19

Supervisor Review n Supervisor reviews pages marked for attention 20

Supervisor Review n Supervisor reviews pages marked for attention 20

CMS - Define Batches n n n Batches defined by date Each batch contains

CMS - Define Batches n n n Batches defined by date Each batch contains 2 -3000 images Batches are automatically assigned a number 21

CMS - Resolve Duplicates n Duplicate pages compared and the best copy is selected

CMS - Resolve Duplicates n Duplicate pages compared and the best copy is selected 22

Missing Pages n Missing page targets are generated 23

Missing Pages n Missing page targets are generated 23

Optical Character Recognition (OCR) Complete batches are added to a tape n Tapes are

Optical Character Recognition (OCR) Complete batches are added to a tape n Tapes are generated and written n Tapes sent to OCR contractor n Contractor completes OCR processes n OCR data (not images) is returned via FTP n 24

CMS - Tapes Created n Completed batches added to a tape 25

CMS - Tapes Created n Completed batches added to a tape 25

Optical Character Recognition (OCR) of pages and article zoning 26

Optical Character Recognition (OCR) of pages and article zoning 26

OCR Data Reception n n n (Automated process) OCR contractor advises NLA server that

OCR Data Reception n n n (Automated process) OCR contractor advises NLA server that a batch has been completed NLA server downloads the batch Batch is ingested into Content Management System Checks are performed on data validity QA Derivatives are generated Articles may now be searched, but are not yet publicly accessible 27

CMS - Batch information 28

CMS - Batch information 28

Quality Assurance (QA) n n n A random sample of Issues and Articles are

Quality Assurance (QA) n n n A random sample of Issues and Articles are checked Volume and Issue number are checked for accuracy Sample articles are checked against agreed Quality Acceptance Criteria (QAC) Error rates calculated against QAC on the fly Supervisor checks final results 29

CMS - Selecting the batch 30

CMS - Selecting the batch 30

Volume & Issue Number Check 31

Volume & Issue Number Check 31

Article checked against QAC 32

Article checked against QAC 32

Re-keyed fields checked for accuracy 33

Re-keyed fields checked for accuracy 33

Supervisor checks results (auto or manual accept/reject) 34

Supervisor checks results (auto or manual accept/reject) 34

QA Results Automated email sent to supplier advising the result n Emails for rejected

QA Results Automated email sent to supplier advising the result n Emails for rejected batches include a summary of errors n Summary of errors saved for all batches n Accepted batches are immediately accessible in public search system n 35

Batch History and details retained 36

Batch History and details retained 36

37

37

Search or Browse articles within CMS 38

Search or Browse articles within CMS 38

Statistics Stats for content received, QA’d and delivered to the public generated by the

Statistics Stats for content received, QA’d and delivered to the public generated by the Content Management System n (Stats for usage of public search system collected using Google Analytics) n 39

CMS - Content Statistics 40

CMS - Content Statistics 40

CMS - Work Statistics 41

CMS - Work Statistics 41

Access Public access to digital newspapers is provided through Australian Newspapers Search and Delivery

Access Public access to digital newspapers is provided through Australian Newspapers Search and Delivery System n Users can search or browse newspapers n Search results can be refined using filters n Users can browse by Newspaper title or Date. n 42

http: //ndpbeta. nla. gov. au/ndp/del/home 43

http: //ndpbeta. nla. gov. au/ndp/del/home 43