Content Detection and Analysis CSCI 572 Information Retrieval
- Slides: 35
Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines
Outline • • The Information Landscape Importance of Content Detection Challenges Approaches May-20 -10 CS 572 -Summer 2010 CAM-2
The Information Landscape May-20 -10 CS 572 -Summer 2010 CAM-3
Proliferation of content types available • By some accounts, 16 K to 51 K content types* • What to do with content types? – Parse them • How? • Extract their text and structure – Index their metadata • In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance – Identify what language they belong to • Ngrams *http: //filext. com/ May-20 -10 CS 572 -Summer 2010 CAM-4
Importance of content types May-20 -10 CS 572 -Summer 2010 CAM-5
Importance of content type detection May-20 -10 CS 572 -Summer 2010 CAM-6
Search Engine Architecture May-20 -10 CS 572 -Summer 2010 CAM-7
Goals • Identify and classify file types – MIME detection • Glob pattern – *. txt – *. pdf • URL – http: //…pdf – ftp: //myfile. txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted May-20 -10 CS 572 -Summer 2010 CAM-8
Goals • Parsing – Based on MIME type in an automated fashion – Extraction of Text and Metadata • Text content can be fed into – Search engine – Machine learning/Statistical analysis – Used to subset data from a formatted document • Metadata can be used for field/faceted search May-20 -10 CS 572 -Summer 2010 CAM-9
Many custom applications and tools • You need this: May-20 -10 to to read this: CS 572 -Summer 2010 CAM-10
Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files – Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem – Not all libraries parse text in equivalent manners – Some are faster than others – Some are more reliable than others May-20 -10 CS 572 -Summer 2010 CAM-11
Extraction of Metadata • Important to follow common Metadata models – – Dublin Core Word Metadata XMP EXIF • Lots of standards and models out there – The use and extraction of common models allows for content intercomparison – All standardizes mechanisms for searching – You always know for X file type that field Y is there and of type String or Int or Date May-20 -10 CS 572 -Summer 2010 CAM-12
Cancer Research Example May-20 -10 CS 572 -Summer 2010 CAM-13
Cancer Research Example Attributes Relationships May-20 -10 CS 572 -Summer 2010 CAM-14
Language Identification • Hard to parse out text and metadata from different languages – French document: J’aime la classe de CS 572! • Metadata: – Publisher: L’Universitaire de Californie en Etas-Unis de Sud – English document: I love the CS 572 class! • Metadata: – Publisher: University of Southern California • How to compare these 2 extracted texts and sets of metadata when they are in different languages? May-20 -10 CS 572 -Summer 2010 CAM-15
Methods for language identification • N-grams – Method of detecting next character or set of characters in a sequence – Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches – Tagging – Looking for common words or characters May-20 -10 CS 572 -Summer 2010 CAM-16
Challenges • Ability to uniformly extract and present metadata • Scale – Extract on the fly, or extract during indexing? – Utility of content detection and analysis important both prior to indexing and after • Integrating third-party parsing libraries is difficult – Many intrinsic dependencies – Non-uniform extraction interfaces • Some don’t provide the same content – Slowdown May-20 -10 CS 572 -Summer 2010 CAM-17
Challenges • Language and charset detection is hard! May-20 -10 CS 572 -Summer 2010 CAM-18
Challenges • Maintenance of MIME type database as new MIMEs are constantly being identified • Ensuring portability since content type detection and identification is becoming more and more needed even outside of the search engine – Firefox, Safari, HTTPD, etc. , all must know about MIME types May-20 -10 CS 572 -Summer 2010 CAM-19
Wrapup • Content detection and analysis – – – MIME detection Parsing and integration of parsing libraries Language identification Charset identification Common Metadata models and formats • Use in a number of areas within the domain of search engines May-20 -10 CS 572 -Summer 2010 CAM-20
Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines
Outline • • What is Tika? Where did it come from? What are the current versions of Tika? What can it do? May-20 -10 CS 572 -Summer 2010 CAM-22
Apache Tika is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code May-20 -10 CS 572 -Summer 2010 CAM-23
Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project – Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit – A Content Management System • Graduated from the Incubator to Lucene sub-project in 2008 • Graduated to Apache TLP in 2010 May-20 -10 CS 572 -Summer 2010 CAM-24
Getting started rapidly • Download Tika from: – http: //tika. apache. org/download. html • • Grab tika-app-0. 7. jar alias tika “java –jar tika-app-0. 7. jar” tika < somefile. doc > extracted-text. xhtml tika –m < somefile. doc > extracted. met May-20 -10 CS 572 -Summer 2010 CAM-25
Detecting MIME types from Java • String type = Tika. detect(…) – – java. io. Input. Stream java. io. File java. net. URL java. lang. String May-20 -10 CS 572 -Summer 2010 CAM-26
Adding new MIME types • Got XML? May-20 -10 CS 572 -Summer 2010 CAM-27
Parsing • String content = Tika. parse. To. String(…) – Input. Stream – File – URL May-20 -10 CS 572 -Summer 2010 CAM-28
Streaming Parsing • Reader reader = Tika. parse(…) – Input. Stream – File – URL May-20 -10 CS 572 -Summer 2010 CAM-29
Language Detection • Language. Identifier lang = new Language. Identifier(new Language. Profile( File. Utils. read. File. To. String(new File(filename)))); • System. out. println(lang. get. Language()); • Uses Ngram analysis included with Tika – Originating from Nutch – Can be improved May-20 -10 CS 572 -Summer 2010 CAM-30
Metadata • Metadata met = new Metadata(); //Dubiln Core met. set(Metadata. FORMAT, “text/html”); //multi-valued met. set(Metadata. FORMAT, “text/plain”); System. out. println( met. get. Values(Metadata. FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc. ) May-20 -10 CS 572 -Summer 2010 CAM-31
Running Tika in GUI form • tika --gui <html xmlns: html=“…”> <body> … </body> </html> May-20 -10 CS 572 -Summer 2010 CAM-32
Integrating Tika into your App • • Maven Ant Eclipse It’s just a set of jars – – tikaapp tika-core tika-parsers tika-app tika-bundle May-20 -10 tikabundle tika-parsers tika-core CS 572 -Summer 2010 CAM-33
Wrapup • Lots more information at – http: //tika. apache. org • Possible projects – Adding more parsers for content types • Omnigraffle? – Expanding ability to handle random access file parsing • Scientific data file formats – Improving language and charset detection May-20 -10 CS 572 -Summer 2010 CAM-34
Acknowledgements • Material inspired by Jukka Zitting’s talks – http: //www. slideshare. net/jukka/text-and-metadataextraction-with-apache-tika-4427630 May-20 -10 CS 572 -Summer 2010 CAM-35
- Csci 572
- Link analysis in information retrieval
- Link analysis in information retrieval
- Link analysis in information retrieval
- Precision recall information retrieval
- Information retrieval and web search
- Information retrieval data structures and algorithms
- Information retrieval tools and techniques
- Information retrieval data structures and algorithms
- Jelaskan chord apa yang disusun oleh nada 572
- Cse 572
- Ese 572
- Data mining
- Find the odd one : 396, 462, 572, 427, 671, 264
- Characteristics of esp
- Content meaning
- Difference between task analysis and content analysis
- Sequential searching in information retrieval
- Components of search engine in information retrieval
- Text operations in information retrieval
- Query operations in information retrieval
- Skip pointer information retrieval
- Index construction in information retrieval
- Spimi
- Which internet service is used for information retrieval
- Information retrieval tutorial
- Wildcard queries in information retrieval
- Browse capabilities in information retrieval system
- Information retrieval lmu
- Defense acquisition management information retrieval
- Advantages of information retrieval system
- Information retrieval nlp
- Search engines information retrieval in practice
- Relevance information retrieval
- Stanford information retrieval
- Skip pointers