Census Data Capture Challenge Intelligent Document Capture Solution

  • Slides: 28
Download presentation
Census Data Capture Challenge Intelligent Document Capture Solution Amir Angel Director of Government Projects

Census Data Capture Challenge Intelligent Document Capture Solution Amir Angel Director of Government Projects UNSD Workshop - Minsk Dec 2008

The evolution of data capture in census projects Five steps: From OCR into IDR

The evolution of data capture in census projects Five steps: From OCR into IDR Solution e. FLOW 2

The evolution of data capture in census projects Manual data entry (Key from paper)

The evolution of data capture in census projects Manual data entry (Key from paper) Slow process High error rate in the data entry process Recruitment, training and management of personnel Key from Image: Archive Approx 20% faster than key from paper 3 Key From Paper Key From Image

The evolution of data capture in census projects OMR (Hardware readers for checkbox) OMR

The evolution of data capture in census projects OMR (Hardware readers for checkbox) OMR – Requires special scanners and specially printed forms – Cannot handle handwritten/printed data – Forms are not user-friendly – OMR requires more answers => more space => increased paper expenditures => more handling and printing costs – Not flexible, difficult to adjust to other applications once census is over – No possibility to add business rules: imputation, validations, coding 4

The evolution of data capture in census projects Automated Data Capture – Requires less

The evolution of data capture in census projects Automated Data Capture – Requires less human intervention, enables to complete the census data capture much faster (less space, less salaries, less hardware) Automated Data Capture – Full flexibility in the type of data gathered (checkbox, OMR, handwritten, alpha and numeric, barcode…) – Ensures data integrity – enables the use of automatic AND manual: online validations, exception handling, coding – The most advanced and proven technology for Censuses, recommended by the UN and used by all modern countries for census projects – Creates a correlation between the image and the actual form – Remote capabilities enable all forms to be scanned locally and then sent to a central site for processing 5 e. FLOW

The evolution of data capture in census projects Intelligent data capture platform (IDR) Intelligent

The evolution of data capture in census projects Intelligent data capture platform (IDR) Intelligent Data Capture by using OCR/ICR/OMR/PDA/Web/email: – Automated data capture + – Automatic classification for documents § understands and differentiates between various types of documents and languages and Based on state-of-the-art Machine Learning algorithms § Artificial intelligence algorithms which provides enough information for the system to find the location of the fields on its own e. FLOW 6

Traditional Data Capture Mail Room Document prep Sorting 7 Scanning Data Entry Manual Key

Traditional Data Capture Mail Room Document prep Sorting 7 Scanning Data Entry Manual Key from image Back. Office End Users

Intelligent Document Capture Mail Room Scanning Data Entry Back. Office End Users Document prep

Intelligent Document Capture Mail Room Scanning Data Entry Back. Office End Users Document prep No sorting 8 Reduce manual data entry by 40 -70% Increase accuracy and consistency

India 2001 Turkey 1997 Brazil 2000 South Africa 2001 Ireland 2002 Italy 2002 Cyprus

India 2001 Turkey 1997 Brazil 2000 South Africa 2001 Ireland 2002 Italy 2002 Cyprus 2002 Turkey 2000 Kenya 2000 Slovak Republic 2001 Hong Kong 2001 Thailand 2008(Community) Slovenia 2006 Hong Kong 2006 South Africa Survey 2007 Ireland 2006 9

Manual Automated Data Capture = time saving Saving of 25% Saving of 50% (Source:

Manual Automated Data Capture = time saving Saving of 25% Saving of 50% (Source: CSO – Central Statistic Office Ireland) 10

The technology is there § No need to invent the wheel § Reducing risks

The technology is there § No need to invent the wheel § Reducing risks by using an ‘Off the shelf’ technologies. 11

Data Types OCR ICR OMR 12

Data Types OCR ICR OMR 12

Automatic Recognition *=Unrecognized Character ICR A*C*EF 12345*7 13

Automatic Recognition *=Unrecognized Character ICR A*C*EF 12345*7 13

Improve Recognition – Voting mechanism 14

Improve Recognition – Voting mechanism 14

Voting Single Engine vs. Virtual Engines

Voting Single Engine vs. Virtual Engines

Figure Of Merit Example A system recognizes 90% of the characters contained in a

Figure Of Merit Example A system recognizes 90% of the characters contained in a batch, but misclassifies 4% 90 - (10*4) = 50 The Figure Of Merit in this example is 50 A system recognizes 80% of the characters contained in a batch, but misclassifies 1% 80 - (10*1) = 70 The Figure Of Merit in this example is 50 The second system is more efficient 16

Benefits of Multiple ICRs 28956374316785

Benefits of Multiple ICRs 28956374316785

Unique Tiling station – Checking for false positives § § Identify false positives Alpha

Unique Tiling station – Checking for false positives § § Identify false positives Alpha & Numeric fields Highlight for verifications Quality control for ICR

Voting Methods Example § Assume we have a V. engine that includes 4 engines

Voting Methods Example § Assume we have a V. engine that includes 4 engines § We want to identify the following number: 253478 § The results of each engine are displayed on the right § The final results of the V. engines will be: Safe: 2****8 Normal: 25**78 Majority: 253478 Order: 255378 Equalizer: ? ? ? 19 Engine Result 1 2 3 4 25***8 2*5378 2534 2*34*8 78

Processing Example ICR 1 3 ICR 2 3 Majority = 3 20 ICR 3

Processing Example ICR 1 3 ICR 2 3 Majority = 3 20 ICR 3 8 Safe = * ICR 4 3

Automatic Recognition Time + Completion Time + Correction Time = THROUGHPUT

Automatic Recognition Time + Completion Time + Correction Time = THROUGHPUT

Fuzzy/Approximate Search Recognition Image Completion

Fuzzy/Approximate Search Recognition Image Completion

Image Recognition Completion

Image Recognition Completion

Other Approaches § Auto Coding – Coding tasks and data validations performed on the

Other Approaches § Auto Coding – Coding tasks and data validations performed on the data capture platform: a ‘cost-effective’ solution – Use artificial intelligent & statistic software's for “understand” sentences § Q: “What do you do for living? ” § A: “I am guiding children” “Teacher” 2030 – Use Approximate Search tools for improving results via DB (Exorbyte)

Process integrality, Questioner integrity a work flow according to the client needs MFlexibilityctiva tor

Process integrality, Questioner integrity a work flow according to the client needs MFlexibilityctiva tor 25 Scanning OCR Validation Export

Flexibility 26

Flexibility 26

Flexibility 27

Flexibility 27

Census Data Capture Platform Thank You

Census Data Capture Platform Thank You