A Presentation on Extracting Patterns and Relations from

  • Slides: 20
Download presentation
A Presentation on Extracting Patterns and Relations from the World Wide Web Sergey Brin

A Presentation on Extracting Patterns and Relations from the World Wide Web Sergey Brin Qian Liu, Computer and Information Sciences Department 1

Problem • The World Wide Web as an information resource: • Huge • Widely

Problem • The World Wide Web as an information resource: • Huge • Widely distributed • Complex, various styles and formats • Scattered information • So, if we could integrate the chunks of information. . . Qian Liu, Computer and Information Sciences Department 2

Motivation Discover information sources Extract information of a particular data type automatically/with minimal human

Motivation Discover information sources Extract information of a particular data type automatically/with minimal human intervention Integrate into a structured form The largest and most diverse source of information Qian Liu, Computer and Information Sciences Department 3

Applications • To extract structured data from the entire World Wide Web • Data

Applications • To extract structured data from the entire World Wide Web • Data types: books, movies, music, restaurants, etc. Qian Liu, Computer and Information Sciences Department 4

Methods Problem: To extract a relation of books --- (author, title) pairs from the

Methods Problem: To extract a relation of books --- (author, title) pairs from the Web. Qian Liu, Computer and Information Sciences Department 5

Methods Intuition: A small seed set of books (author, title pairs) Find occurrences of

Methods Intuition: A small seed set of books (author, title pairs) Find occurrences of them on the Web Generate patterns Search for books matching the patterns Obtain a large list of books Qian Liu, Computer and Information Sciences Department 6

Methods Formal Definition of the Problem: • World Wide Web • Relation --- (author,

Methods Formal Definition of the Problem: • World Wide Web • Relation --- (author, title) pairs that occur on the Web • Occurrences • Every tuple of the relation occurs >= 1 times on the Web • Consists of all fields of the tuple • Fields --- in close proximity to one another Qian Liu, Computer and Information Sciences Department 7

Methods Formal Definition of the Problem (Continued): • Patterns • Matching one particular format

Methods Formal Definition of the Problem (Continued): • Patterns • Matching one particular format of occurrences of tuples of the relation. (order, urlprefix, middle, suffix) • Represented by a class of regular expressions Qian Liu, Computer and Information Sciences Department 8

Methods R’: Approximation of relation R Coverage (recall) = |R’ + R| R |R’

Methods R’: Approximation of relation R Coverage (recall) = |R’ + R| R |R’ - R| Error rate = R’ Precision = |R’ + R| R’ Qian Liu, Computer and Information Sciences Department 9

Methods Method: Dual Iterative Pattern Relation Expansion Basis: • Find tuples from patterns. •

Methods Method: Dual Iterative Pattern Relation Expansion Basis: • Find tuples from patterns. • Find patterns from tuples. Qian Liu, Computer and Information Sciences Department 10

Methods Set of patterns with high coverage and low error rate Find all matches

Methods Set of patterns with high coverage and low error rate Find all matches to patterns Find all occurrences of the tuples. Discover similarities in occurrences Set of tuples Qian Liu, Computer and Information Sciences Department 11

Methods 1. Start with a small sample, e. g. , five books. 2. Find

Methods 1. Start with a small sample, e. g. , five books. 2. Find all occurrences of the sample books on WWW. Keep the context of every occurrence (url and surrounding text). Qian Liu, Computer and Information Sciences Department 12

Methods 3. Generate patterns based on the occurrences. Requirements: • Generate patterns for sets

Methods 3. Generate patterns based on the occurrences. Requirements: • Generate patterns for sets of occurrences with similar context • Low error rate • Coverage Qian Liu, Computer and Information Sciences Department 13

Methods Procedure: • Group the occurrences by order and middle. • For each group:

Methods Procedure: • Group the occurrences by order and middle. • For each group: set urlprefix, suffix. Specificity of Pattern: • Too specific? • Too general? • Specificity(p)=|p. middle| |p. url| |p. prefix| |p. suffix| Qian Liu, Computer and Information Sciences Department 14

Methods 4. Search the Web for tuples matching the pattern. 5. Is result large

Methods 4. Search the Web for tuples matching the pattern. 5. Is result large enough? If yes, return. If no, go to step 2. Qian Liu, Computer and Information Sciences Department 15

Experiments Qian Liu, Computer and Information Sciences Department 16

Experiments Qian Liu, Computer and Information Sciences Department 16

Limitations of Study 1. Scalability problem: Limited experiments due to time constraints. 2. Problem

Limitations of Study 1. Scalability problem: Limited experiments due to time constraints. 2. Problem with data: duplicate books. 3. Measure of safety in matching tuples with patterns: To match a single pattern. Qian Liu, Computer and Information Sciences Department 17

Suggestions for Future Studies 1. Scan for larger numbers of patterns and tuples over

Suggestions for Future Studies 1. Scan for larger numbers of patterns and tuples over a huge repository. 2. Include methods to disregard differences such as capitalization, space, how the author is listed in the book, and so on. Qian Liu, Computer and Information Sciences Department 18

Conclusions • DIPRE --- a remarkable tool to extract structured data from the Web

Conclusions • DIPRE --- a remarkable tool to extract structured data from the Web • Minimal human intervention • Application in domains other than books • Finding books not listed in major online sources --- change in information flow Qian Liu, Computer and Information Sciences Department 19

Qian Liu, Computer and Information Sciences Department 20

Qian Liu, Computer and Information Sciences Department 20