Content Detection and Analysis CSCI 572 Information Retrieval

Outline • • The Information Landscape Importance of Content Detection Challenges Approaches May-20 -10

The Information Landscape May-20 -10 CS 572 -Summer 2010 CAM-3

Proliferation of content types available • By some accounts, 16 K to 51 K

Importance of content types May-20 -10 CS 572 -Summer 2010 CAM-5

Importance of content type detection May-20 -10 CS 572 -Summer 2010 CAM-6

Search Engine Architecture May-20 -10 CS 572 -Summer 2010 CAM-7

Goals • Identify and classify file types – MIME detection • Glob pattern –

Goals • Parsing – Based on MIME type in an automated fashion – Extraction

Many custom applications and tools • You need this: May-20 -10 to to read

Third-party parsing libraries • Most of the custom applications come with software libraries and

Extraction of Metadata • Important to follow common Metadata models – – Dublin Core

Cancer Research Example May-20 -10 CS 572 -Summer 2010 CAM-13

Cancer Research Example Attributes Relationships May-20 -10 CS 572 -Summer 2010 CAM-14

Language Identification • Hard to parse out text and metadata from different languages –

Methods for language identification • N-grams – Method of detecting next character or set

Challenges • Ability to uniformly extract and present metadata • Scale – Extract on

Challenges • Language and charset detection is hard! May-20 -10 CS 572 -Summer 2010

Challenges • Maintenance of MIME type database as new MIMEs are constantly being identified

Wrapup • Content detection and analysis – – – MIME detection Parsing and integration

Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines

Outline • • What is Tika? Where did it come from? What are the

Apache Tika is… • A content analysis and detection toolkit • A set of

Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome

Getting started rapidly • Download Tika from: – http: //tika. apache. org/download. html •

Detecting MIME types from Java • String type = Tika. detect(…) – – java.

Adding new MIME types • Got XML? May-20 -10 CS 572 -Summer 2010 CAM-27

Parsing • String content = Tika. parse. To. String(…) – Input. Stream – File

Streaming Parsing • Reader reader = Tika. parse(…) – Input. Stream – File –

Metadata • Metadata met = new Metadata(); //Dubiln Core met. set(Metadata. FORMAT, “text/html”); //multi-valued

Running Tika in GUI form • tika --gui <html xmlns: html=“…”> <body> … </body>

Integrating Tika into your App • • Maven Ant Eclipse It’s just a set

Wrapup • Lots more information at – http: //tika. apache. org • Possible projects

Acknowledgements • Material inspired by Jukka Zitting’s talks – http: //www. slideshare. net/jukka/text-and-metadataextraction-with-apache-tika-4427630 May-20

Slides: 35

Download presentation

Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines

Outline • • The Information Landscape Importance of Content Detection Challenges Approaches May-20 -10 CS 572 -Summer 2010 CAM-2

The Information Landscape May-20 -10 CS 572 -Summer 2010 CAM-3

Proliferation of content types available • By some accounts, 16 K to 51 K content types* • What to do with content types? – Parse them • How? • Extract their text and structure – Index their metadata • In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance – Identify what language they belong to • Ngrams *http: //filext. com/ May-20 -10 CS 572 -Summer 2010 CAM-4

Importance of content types May-20 -10 CS 572 -Summer 2010 CAM-5

Importance of content type detection May-20 -10 CS 572 -Summer 2010 CAM-6

Search Engine Architecture May-20 -10 CS 572 -Summer 2010 CAM-7

Goals • Identify and classify file types – MIME detection • Glob pattern – *. txt – *. pdf • URL – http: //…pdf – ftp: //myfile. txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted May-20 -10 CS 572 -Summer 2010 CAM-8

Goals • Parsing – Based on MIME type in an automated fashion – Extraction of Text and Metadata • Text content can be fed into – Search engine – Machine learning/Statistical analysis – Used to subset data from a formatted document • Metadata can be used for field/faceted search May-20 -10 CS 572 -Summer 2010 CAM-9

Many custom applications and tools • You need this: May-20 -10 to to read this: CS 572 -Summer 2010 CAM-10

Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files – Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem – Not all libraries parse text in equivalent manners – Some are faster than others – Some are more reliable than others May-20 -10 CS 572 -Summer 2010 CAM-11

Extraction of Metadata • Important to follow common Metadata models – – Dublin Core Word Metadata XMP EXIF • Lots of standards and models out there – The use and extraction of common models allows for content intercomparison – All standardizes mechanisms for searching – You always know for X file type that field Y is there and of type String or Int or Date May-20 -10 CS 572 -Summer 2010 CAM-12

Cancer Research Example May-20 -10 CS 572 -Summer 2010 CAM-13

Cancer Research Example Attributes Relationships May-20 -10 CS 572 -Summer 2010 CAM-14

Language Identification • Hard to parse out text and metadata from different languages – French document: J’aime la classe de CS 572! • Metadata: – Publisher: L’Universitaire de Californie en Etas-Unis de Sud – English document: I love the CS 572 class! • Metadata: – Publisher: University of Southern California • How to compare these 2 extracted texts and sets of metadata when they are in different languages? May-20 -10 CS 572 -Summer 2010 CAM-15

Methods for language identification • N-grams – Method of detecting next character or set of characters in a sequence – Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches – Tagging – Looking for common words or characters May-20 -10 CS 572 -Summer 2010 CAM-16

Challenges • Ability to uniformly extract and present metadata • Scale – Extract on the fly, or extract during indexing? – Utility of content detection and analysis important both prior to indexing and after • Integrating third-party parsing libraries is difficult – Many intrinsic dependencies – Non-uniform extraction interfaces • Some don’t provide the same content – Slowdown May-20 -10 CS 572 -Summer 2010 CAM-17

Challenges • Language and charset detection is hard! May-20 -10 CS 572 -Summer 2010 CAM-18

Challenges • Maintenance of MIME type database as new MIMEs are constantly being identified • Ensuring portability since content type detection and identification is becoming more and more needed even outside of the search engine – Firefox, Safari, HTTPD, etc. , all must know about MIME types May-20 -10 CS 572 -Summer 2010 CAM-19

Wrapup • Content detection and analysis – – – MIME detection Parsing and integration of parsing libraries Language identification Charset identification Common Metadata models and formats • Use in a number of areas within the domain of search engines May-20 -10 CS 572 -Summer 2010 CAM-20

Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines

Outline • • What is Tika? Where did it come from? What are the current versions of Tika? What can it do? May-20 -10 CS 572 -Summer 2010 CAM-22

Apache Tika is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code May-20 -10 CS 572 -Summer 2010 CAM-23

Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project – Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit – A Content Management System • Graduated from the Incubator to Lucene sub-project in 2008 • Graduated to Apache TLP in 2010 May-20 -10 CS 572 -Summer 2010 CAM-24

Getting started rapidly • Download Tika from: – http: //tika. apache. org/download. html • • Grab tika-app-0. 7. jar alias tika “java –jar tika-app-0. 7. jar” tika < somefile. doc > extracted-text. xhtml tika –m < somefile. doc > extracted. met May-20 -10 CS 572 -Summer 2010 CAM-25

Detecting MIME types from Java • String type = Tika. detect(…) – – java. io. Input. Stream java. io. File java. net. URL java. lang. String May-20 -10 CS 572 -Summer 2010 CAM-26

Adding new MIME types • Got XML? May-20 -10 CS 572 -Summer 2010 CAM-27

Parsing • String content = Tika. parse. To. String(…) – Input. Stream – File – URL May-20 -10 CS 572 -Summer 2010 CAM-28

Streaming Parsing • Reader reader = Tika. parse(…) – Input. Stream – File – URL May-20 -10 CS 572 -Summer 2010 CAM-29

Language Detection • Language. Identifier lang = new Language. Identifier(new Language. Profile( File. Utils. read. File. To. String(new File(filename)))); • System. out. println(lang. get. Language()); • Uses Ngram analysis included with Tika – Originating from Nutch – Can be improved May-20 -10 CS 572 -Summer 2010 CAM-30

Metadata • Metadata met = new Metadata(); //Dubiln Core met. set(Metadata. FORMAT, “text/html”); //multi-valued met. set(Metadata. FORMAT, “text/plain”); System. out. println( met. get. Values(Metadata. FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc. ) May-20 -10 CS 572 -Summer 2010 CAM-31

Running Tika in GUI form • tika --gui <html xmlns: html=“…”> <body> … </body> </html> May-20 -10 CS 572 -Summer 2010 CAM-32

Integrating Tika into your App • • Maven Ant Eclipse It’s just a set of jars – – tikaapp tika-core tika-parsers tika-app tika-bundle May-20 -10 tikabundle tika-parsers tika-core CS 572 -Summer 2010 CAM-33

Wrapup • Lots more information at – http: //tika. apache. org • Possible projects – Adding more parsers for content types • Omnigraffle? – Expanding ability to handle random access file parsing • Scientific data file formats – Improving language and charset detection May-20 -10 CS 572 -Summer 2010 CAM-34

Acknowledgements • Material inspired by Jukka Zitting’s talks – http: //www. slideshare. net/jukka/text-and-metadataextraction-with-apache-tika-4427630 May-20 -10 CS 572 -Summer 2010 CAM-35