Content Classification Wheres My Stuff 1 IBM Confidential

Content Classification – Where’s My Stuff? 1 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 2 IBM Confidential

Why Classify? § Content that is not properly classified is not accessible – 1 in 2 business leaders don’t have access to the information they need to do their jobs § Quality of decision-making suffers when content is not accurate – 1 in 3 business leaders frequently make business decisions based on information they lack or don’t trust § Companies face difficulty in deriving full visibility and insight into breadth and depth of unstructured content – 77% of CEOs don’t have immediate information to make key business decisions Sources: IBM 2010 CEO & CFO Studies, IBM 2010 Break Away With Business Analytics and Optimization Study 3 IBM Confidential

Why Classify? § What if you walked into the Library of Congress and there was no Dewey Decimal System? § What about the hardware store, the grocery store, the clothing store? § Do you park your car in the living room and place your sofa in the garage? Everything in our life is categorized and classified in some way You need to: You have: § Millions of pieces of content § Hundreds of repositories § Thousands of workers § Find relevant content, quickly § Accurately, consistently categorize content § Gather meaning and understanding from the content 4 IBM Confidential

Why Classify? You have been storing content for many years, but… can you find it when you need it? can you produce it for audits and litigation? can you gain insight from it? How does your organization go from this…. to this? 5 IBM Confidential

Why Classify? 6 IBM Confidential

Why Classify? Accessibility, Usability, Compliance, Analytics § Can you find relevant content, quickly? – “Search, Refine, Repeat” is no longer acceptable – Image Capture, Content Collection, Enterprise Search § Is the right content available at the right time? – Business processes require timely access to content – Business Process Management, Case Management § Are you complying with Legal and Business mandates? – Content has a compliance lifecycle that must be enforced – Content Collection, Enterprise Records, e. Discovery § Are you uncovering business insight from your content? – Organized content produces better insight – Content Analytics 7 IBM Confidential

Why Classify? § Automated Classification makes information accessible, leaving your workers to focus on important business tasks rather searching, over and over, for relevant content § Classification provides enhanced content usability by automating routing decisions based on the meaning of the text in your content § Advanced Classification, combined with collection and records, enables your company to comply with business and legal mandates § Classification augments Content Analytics by providing extended facet navigation and content clustering, delivering added analysis and insight

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 9 IBM Confidential

How does Classification work? CLASSIFICATION AS A FACTORY WORKER § § § Think of a worker at the end of an assembly line Task is to sort items coming down the line into correct containers Four possible item types on the line: – Can – Box – Bottle – Jar How do you tell the factory worker which is which? Start with the item to the right as a ‘can’ reference model – 6. 5” high – Red with blue & white lettering – 3. 5” diameter – Opened with a tab – Contains liquid 10 IBM Confidential

How does Classification work? Based on initial assumptions, which of these are “cans”? § What are our identification parameters? ─ Shape? ─ Capacity/size? ─ Contents (liquid vs. solid)? ─ Method of opening? ─ Construction material? § Based on the original reference model, which of these is a can? ─ 6. 5” high ─ Red with blue & white lettering ─ 3. 5” diameter ─ Opened with a tab ─ Contains liquid 11 IBM Confidential

How does Classification work? § § § Analogy is very relevant to category definition & corpus selection Document classification involves the same problems – What is an “Accounting and Finance” document? • How can we differentiate it from a “Legal” document? • How about “Regulatory? ” – How do humans tell which is which? • Keywords • Phrases • Intent Some distinctions are clear… – Legal vs. Engineering – Personnel vs. Operations – Manufacturing vs. Advertising Others are not… – Legal vs. Regulatory Classification effort depends on your environment 12 IBM Confidential

How does Classification work? Business Information A Category ‘B’ Engineering Category ‘A’ Marketing Intellectual Property is essential Legal is A B Engineering drafts require approval Engineering requires skilled software staff Engineering requires clear requirements B A B changing the timeframe for Legal is contract currently approval requiring full approval Category ‘C’ Strategy C Strategy should look out over 36 months Strategy is Important to the marketing team C Context-Based Classification A The core market for this new product has been defined as such by IBM ? The core market for this new product has been defined as such by IBM 13 IBM Confidential

How does Classification work? § Content Classification combines multiple methods of categorization technologies to deliver the automatic classification – Uses natural language processing and semantic analysis – Uses rules-based on metadata or confidence score – Can be used in tandem or separately depending on requirements To: Bob Smith <bsmith@hotmail. com> From: Bill Roker <broker@financialadv. com> Subject: Contract? Does the email contains the phrase “contract”? Bob, Does the sender belongs to the broker email group? Hope you’re doing well. Does the email have anything that matches the pattern “XXX-YY-ZZZZ”? A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed. Regards, Bill Roker 212 -555 -1234 Financial Advisors, Inc. Natural Language Processing + Semantic Analysis + Targeted Rules = Comprehensive Content Classification 14 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 15 IBM Confidential

Content Classification Features 1. Automatic Categorization of documents and emails – – – Analyzes the content of documents and emails in order to categorize them Uses natural language processing and semantic analysis Handles imperfect language (misspellings, abbreviations, poor grammar) Assigns confidence score to each category suggestion (0 – 100) Learns from examples or keywords • Creates a profile for each category by analyzing sample texts • Categories can also be defined by keywords 2. Combines classification methods using text analysis and rules processing – Rules based on metadata can be defined in combination with classification based on confidence score – Language identification capability can be used in tandem with rules 16 IBM Confidential

Content Classification Features 3. Learns in real-time – Can adapt based on feedback from end users or administrators – Feedback is incorporated into analysis on-the-fly for immediate adaptation 4. Classification Workbench configuration tool – Enables the process of creation and maintenance of Knowledge Bases and Decision Plans – Facilitates classification tune-up and reporting 5. Integrated to IBM ECM offerings – Application for bulk classification of content upon ingestion to repository and bulk classification and reclassification of content already under management – Integrated with Datacap, Content Collector, Enterprise Records, Analytics, etc. 6. Taxonomy Creation Assistance – Suggests new taxonomies for organizations that do not have them – Suggests new elements for existing taxonomies 17 IBM Confidential

Content Classification Features – Knowledge Base § § 18 A knowledge base contains learned information that Classification needs to perform matching, training, and online learning It is filled with relevant statistical and semantic information derived from sample texts Statistical entities consist of words, number of occurrences, hints about the text, and distance between words A knowledge base is created & maintained through the Workbench application 1. Collect and organize sample content 2. Create, analyze, and learn 3. Assess performance, review reports 18 IBM Confidential

Content Classification Features – Decision Plan § § § A Decision Plan is a collection of rules that you configure to determine how content is classified A Decision Plan is developed by configuring one or more rules based on content or metadata. Each rule consists of one trigger and one or more actions – Example: Trigger: “If Title contains ‘Contract’ ” then, Action: “Assign to Contracts Category” & “Move to Contracts folder” Rules can use strings, word distance, regular expressions, pattern extraction, Boolean expressions Actions include set properties, invoke analysis, move to folder, declare record, custom actions, and more Decision Plans can be used with or without a Knowledge Base 19 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 20 IBM Confidential

Content Classification – Taxonomy Basics Taxonomy 1. The science or technique of classification. 2. A classification into ordered categories. 3. The science dealing with the description, identification, naming, and classification of organisms. Business Taxonomy 1. Usually follows a line of business hierarchy 2. Logical grouping of content for business, repositories or compliance purposes. 3. Generally “flattened” for better control and management 7 levels 3 -4 levels IBM Confidential 21

Content Classification – Taxonomy Basics The Goldilocks Zone “Too Many Categories” 1000 categories is probably too many Company Claims Policies Finance Boat Motorcycle Auto RV Mobile Single Yacht Dingy Cruise Make Brick < 20 Ft. <32 Ft. < 46 Ft. Health Home Vehicle Wood Model 22 IBM Confidential

Content Classification – Taxonomy Basics The Goldilocks Zone “Too Few Categories” 10 categories is probably too few Company Claims Policies HR Legal Finance 23 IBM Confidential

Content Classification – Taxonomy Basics The Goldilocks Zone “Just Right” Somewhere around 100 categories is probably just right Company Claims Auto Policies Home Employee HR Purchasing Policies Legal Finance Contracts Reporting Budget 24 IBM Confidential

Content Classification – Taxonomy Basics § Taxonomies are important, but… § They do not have to be complex or unwieldy § Need to be acceptable to different organization areas ─ Finance, Legal, HR, IT § Your organization may have a formal, internal taxonomy ─ If so, start there, but it may have to be flattened § Your organization may have a de facto taxonomy ─ ECM document classes, folders, File System structures, Departmental structures, may be enough to start § Publicly available or 3 rd-party taxonomies may be used ─ Again, may have to be flattened § How are humans classifying today? ─ Are workers filing paper in folder, drawers, cabinets? ─ Are worker putting content in ECM, File Systems, Folders? 25 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 26 IBM Confidential

Starting a Classification Project § Approaches – – Taxonomy Proposal through Content Clustering Taxonomy Creation through “Seeded” Keywords Taxonomy Creation through Manual Content Gathering Knowledge Base Creation through Content Extraction 27 IBM Confidential

Starting a Classification Project § Taxonomy Proposal through Content Clustering ─ We don’t know, what we don’t know ─ Starting from a blank sheet categorize cluster A B gather A create B C crawl evaluate Trained Knowledge Base D D 28 IBM Confidential C

Starting a Classification Project § Taxonomy Creation through “Seeded” Keywords ─ We know, what we don’t know ─ Starting from a blank sheet Keyword-based content set Knowledge Base creation gather crawl Workbench A review B D keyword Keyword Seeded taxonomy Trained Knowledge Base keyword evaluate & tune keyword 29 IBM Confidential C

Starting a Classification Project § Taxonomy Creation through Manual Content Gathering ─ We know, what we don’t know ─ Starting with known content Manual content gathering Manually gathered content set A A Strawman Taxonomy Knowledge Base creation A B B B C C D Trained Knowledge Base C D evaluate & tune 30 IBM Confidential

Starting a Classification Project § Knowledge Base Creation through Content Extraction ─ We know, what we know ─ Starting with known content and taxonomy Extracted content set A Content extraction Knowledge Base creation A B B Trained Knowledge C Base C Established ECM Repository D evaluate & tune D 31 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 32 IBM Confidential

Best Practices for Classification (or All I really Need to Know about Classification, I learned in Kindergarten) § Look § Listen § Learn 33

Best Practices for Classification (or All I really Need to Know about Classification, I learned in Kindergarten) § Look ─ ─ ─ In order to properly classify , you need to know your content Understand how your content is created and by whom Understand how content used in your business Understand the meaning and purpose of content Set realistic expectations ─ 100% automation with 100% accuracy is rare ─ Balance automation expectations with accuracy requirements ─ This is a resume ─ It is used by Human Resources, Hiring Managers ─ It is a text document ─ The purpose is to aide the hiring process ─ The document may have compliance value 34

Best Practices for Classification (or All I really Need to Know about Classification, I learned in Kindergarten) § Listen ─ All content owners and users have a stake in proper classification ─ Gather input and consider all aspects of content, users and organizations ─ Define categories based on business use • Categories should represent organizational content, not organizational structure • Taxonomies are less hierarchical and flatter than “standard” taxonomies Hierarchical Flat Marketing Store Operations Human Resources Advertising Store Management Employee Management Contracts Corporate Reporting Public Relations Sales Benefits Audit Pricing Catalog Training Records & Retention AP/AR Legal Finance 35

Best Practices for Classification (or All I really Need to Know about Classification, I learned in Kindergarten) § Learn ─ Training is iterative, it improves and learns over time ─ Training sets must contain “high value” examples ─ Number of training documents varies by organization (~20 to ~50, rule of thumb) ─ 100’s of documents is less useful than 20 well selected documents ─ More is not better, it’s just more ─ Addition of new categories affects existing categories ─ Some categories may perform well immediately, others may require additional effort ─ Categories may “drift” over time (content intent, phrases, business changes, etc. ) ─ Learning requires the active use of feedback capabilities Classification systems have to learn……. Remember what Grover taught us… “Three of these things belong together. . . ” 36

Best Practices for Classification – Summary § Categories ─ Should be content driven and represent organizational content, not organization chart § Taxonomies ─ Less hierarchical, generally flatter and less formal than “standard” taxonomies § Training Sets ─ Training sets should be consistent with actual content and represent “highvalue” content ─ Clearly delineation of content between various categories § Ongoing monitoring and training ─ Training is iterative, similar to business process optimization, it improves over time § Set Realistic expectations with business user ─ Balance automation expectations with accuracy requirements § Engage competent and experienced service providers to assist with initial classification project 37

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 38 IBM Confidential

Real World Example Image Capture and Classification Integration between Datacap Taskmaster and Content Classification brings the power of image capture and automated classification together Content Classification provides text analytics and statistical probability to provide another recognition approach to Taskmaster’s vast array of methods

Real World Example Image Capture and Classification Challenges § What type of document is this? – to vary processing by type § What pages contain the data I need? – to extract or key in the proper fields § Do the documents contain the correct pages? – to ensure that the documents are “in good order” and not missing information § What is the business meaning of this document? – to get the document to the right person or process with the right priority

Real World Example Image Capture and Classification The Separation Challenge § Where does one document end and the next begin? Here? § Traditional Methods – – Patch & Barcoded Separator Sheets Barcode Labels and Documents Manual Identification Paper Sorting Here? § Here? Shortcomings – Labor-intensive – Relies on a worker knowledge to correctly identify and sort out the documents – Externally generated documents cannot be barcoded 41

Real World Example Image Capture and Classification Datacap Taskmaster & Classification for Separation & Page Identification § Taskmaster examines each page using multiple methods – The fastest methods are done first : barcode, pattern match, & fingerprint – The slower methods that require OCR follow: Text analytics and keywords – Rules examine the context to determine if any remaining pages can be identified based on the surrounding pages – Taskmaster calls Content Classification to help identify pages – Taskmaster separates and assembles the pages into documents § Content Classification analyzes the text content – Statistical analysis of the text on a page compared to a knowledge base to find the closest match – Assigns confidence score to each category suggestion (0 – 100) – Returns the Classification results to Taskmaster ─ Classification feedback loop improves future results by providing feedback to the classification engine § Exceptions, low confidence results are reviewed and classified by users

Bank specializing in mortgage loan servicing Slashing costs with IBM Production Imaging Edition and IBM Content Classification The need • Reduce paper document scanning and processing costs • Reduce loan servicing customer service costs • Processing volumes can exceed 100 million scanned pages per month The solution The company contracted with IBM partner Imagine Solutions to implement IBM Production Imaging Edition (PIE) and IBM Classification Module software • PIE - Datacap Taskmaster scans and imports paper documents • PIE - Datacap Taskmaster rules classify documents to the page level using barcodes, image fingerprint pattern matching, regular expressions, and text analytic classification • IBM Classification Module classifies pages using text analytics • Taskmaster extracts text and data fields using optical character recognition (OCR) • Data collection, statistical reporting, and feedback loops improve accuracy and configuration tuning • PIE - File. Net Content Manager securely stores the documents • Acquisition and servicing processes are automated through web-based document access and PIE business process capabilities. Projected benefits • Save millions of dollars of staff time by automating document classification, reducing data entry, and providing direct access to the loan documents with improved speed, accuracy, and granularity. • Save millions of dollars in per-page licensing fees associated with the competitively replaced Kofax KTM system • Provide a platform that can be rapidly ramped up to handle high loads associated with portfolio acquisitions The solution is targeted to reduce costs by automating the classifying, keying and filing of millions of pages of loan documentation per day. 43 IBM Confidential

Agenda § § § § Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts 44 IBM Confidential

Closing Thoughts How can classification help my business? § Improve teaching programs and student learning ─ Classifying educational content through analysis of lesson plan text § Automatically code medical bills ─ Interpret doctors notes and apply industry standard codes (ICD-9, ICD-10) § Reduce manual, human intervention ─ Automatically evaluate email service requests and establishing responses § Shorten process cycle time ─ Distinguish mortgage, auto, personal, credit card loan applications ─ Route content to appropriate worker or process step § Automatically understand Personally Identifiable Information (PII), Personal Health Information (PHI) in unstructured content ─ Take actions such as file, record, route, redact 45 IBM Confidential

Closing Thoughts § Classification is a powerful solution to automate the categorization of text-based content § Properly categorized content provides better accessibility, usability, compliance and analytics § Many factors lead to high-quality classification – consider and understand all of them § They keys to success are planning, preparation and persistence ─ Is there any project that does not require these? § Automated classification allows you to cut costs associated with content capture, collection, archiving, retention, analysis and more “Anything worth doing, is worth doing right. ” – Hunter S. Thompson 46 IBM Confidential

47
- Slides: 47