Current Challenges in Digitization Ivo IOSSIGER President Director

  • Slides: 16
Download presentation
Current Challenges in Digitization Ivo IOSSIGER President Director General 1

Current Challenges in Digitization Ivo IOSSIGER President Director General 1

Digitization Brings Documents Online paper content digital images digital files digital content • Books,

Digitization Brings Documents Online paper content digital images digital files digital content • Books, magazines and newspapers are precious collections of knowledge and reference • Allow users to search and access valuable content through most recent online technologies The new trend : "What is not online, does not exist !” 2

The Problem • Content, knowledge are imprisoned between pages of books. • Millions of

The Problem • Content, knowledge are imprisoned between pages of books. • Millions of documents are written in various languages. • How to digitize by preserving the book or the unique copy. 3

Challenges of Books • Manipulate the large variety : – – page formats paper

Challenges of Books • Manipulate the large variety : – – page formats paper types binding types soft or rigid covers • Turn one by one all pages of a book and present them to a camera. • Reproduce accurately the aspect of every page – page layout – calligraphy – images 4

Challenges of Pages • Enhancing images of pages – – – remove typical artifacts

Challenges of Pages • Enhancing images of pages – – – remove typical artifacts (borders, split pages) remove transparency of print from the opposite page (unbleed) correct rotation of page (deskew) correct page curvature of bound documents enhance contrast and clear page background 5

Challenges of Text • Recognize text and make it accessible for full text search,

Challenges of Text • Recognize text and make it accessible for full text search, copy and paste. 6

Challenges of Costs • Equipment and production tools – are more performing and reliable

Challenges of Costs • Equipment and production tools – are more performing and reliable – engages fixed costs • Human Labor – is expensive and slow – engages variable costs • Logistics and workflow – are sophisticated and diverse – engages variable costs How to reduce costs per page ? 7

The Solution Everybody is Looking for How to digitize a large number of books

The Solution Everybody is Looking for How to digitize a large number of books as quick and as cheap as possible with preserving superior quality ? • automatic solution without operator unattended • superior productivity beyond humans faster • insure digitization of all pages reliable • produce high quality images reliable Increase VOLUME, Keep QUALITY, Decrease PRCE 8

The Solution of the Past An operator turns pages all day long. . .

The Solution of the Past An operator turns pages all day long. . . an endless task – forbidding and tedious task – limited performance by concentration and tiredness – irregular quality due to individuals – contradiction between performance and motivation 9

The Solution of the Future 10

The Solution of the Future 10

Technology and Production Challenges Image Scanning Paper Content Image Treatment 4 Digital. Books OCR

Technology and Production Challenges Image Scanning Paper Content Image Treatment 4 Digital. Books OCR ABBYY Indexing Structuring Providing Online Digital Library Digital Content Recognition Server Digitizing Line Page Improver Text Search 11

Challenges of File Formats Text Image & Text -17 -16 -11 -9 -8 -4

Challenges of File Formats Text Image & Text -17 -16 -11 -9 -8 -4 -1 0 time you are here 1993 PDF 2001 2005 2008 PDF/A PDF is an hidden text open standard (Acrobat 1) (Acrobat 5) 1992 JPEG (Acrobat 9) 2000 JPEG 2000 1992 TIFF 1998 XML Which formats will survive ? - The most popular and widely spread ! 12

Challenges of Storage ratio GS RGB bitmap image A 4 at 300 dpi :

Challenges of Storage ratio GS RGB bitmap image A 4 at 300 dpi : 8. 5 MB MB % % lossless quality (recommended for multiple edits) • TIFF. TIF 1: 1 100 300 • TIFF LZW. TIF 2: 1 ~50 • TIFF CCITT G 4. TIF 100: 1 BW 25. 5 MB 1. 0 % 12 ~150 ~6 ~1 loss on quality (not recommended for multiple edits) • JPEG. JPG 5 -10: 1 ~20 -10 ~25 -12 • JPEG 2000. JP 2 6 -12: 1 ~16 -8 ~20 -10 Archive or reference files are NOT intended for multiple edits. Therefore all these formats are good for long term preservation. 13

Digital Born Paper Born Challenges Joining Digital and Paper Born Content PDF Image (large

Digital Born Paper Born Challenges Joining Digital and Paper Born Content PDF Image (large size) abcdefghijklm nopqrstuvwxyz abcdefghijklm accurate to original OCR PDF Text (small size) abcdefghijklm nopqrstuvwxyz abcdefghijklm PDF Image over Text PDF hidden text abcdefghijklm nopqrstuvwxyz abcdefghijklm nopqrstuvwxyz abcdefghijklm nopqrstuvwxyz abcdefghijklm subject to read & print OCR mistakes Word Text (small size) PDF Text (small size) abcdefghijklm nopqrstuvwxyz abcdefghijklm nopqrstuvwxyz abcdefghijklm search copy & paste read & print search copy & paste 14

Challenges Selecting Source Material - Microfilm or Digital computer output microfilm COM abcdefghijklm nopqrstuvwxyz

Challenges Selecting Source Material - Microfilm or Digital computer output microfilm COM abcdefghijklm nopqrstuvwxyz abcdefghijklm microfilm printer analog media lifetime 500 y STORAGE microfilm digital media lifetime 20 y NEEDS MIGRATION abcdefghijklm nopqrstuvwxyz abcdefghijklm nopqrstuvwxyz abcdefghijklm possible image enhancement and restoration abcdefghijklm nopqrstuvwxyz abcdefghijklm microfilm scanner microfilm camera 5 -10% quality loss no quality loss book scanner analog media lifetime 500 y REFERENCE 15

Challenges to Bring Files Online 16

Challenges to Bring Files Online 16