Web Mining for Extracting Relations Negin Nejati Relation
Web Mining for Extracting Relations Negin Nejati
Relation Extraction (James Gleick, Chaos: Making a New Science) (Charles Dickens, Great Expectations) (William Shakespeare, The Comedy of Errors) (Isaac Asimov, The Robots of Dawn) (David Brin, Startide Rising) (author, title)
DIPRE Algorithm S = Sample. Tuples While size(S) < T O = Find. Occurrences(S) P = Gen. Patterns(O) S = Matching. Tuples(P)
Pattern Generation n Existing methods assume components of tuple appear close together (e. g. ” Foundation, by Isaac Asimov”) n This is a very strong assumption. (e. g. misses all the titles in the author’s webpage). n Non-popular relations with limited source of data suffer more. (for some relations this is not the typical appearance, e. g. (service, price))
Using Heuristics n We are looking for (author, title) pairs. n It is very likely that the works of an author are presented as lists or tables. n Such tables usually have helpful titles such as: bibliography, selected work, novels, stories, etc.
New Algorithm Charles Dickens occurrences Great Expectations
New Algorithm Group occurrences using edit distance and generate patterns: <LI><I><A HREF="/WIKI/CHAOS: _MAKING_A_NEW_SCIENCE" TITLE="title">title</A></I> (VIKING PENGUIN, 1987)</LI> & <LI><I><A HREF="/WIKI/GREAT_EXPECTATIONS“ TITLE="title">title</A></I> (1860 �� 1861)</LI> [<LI><I><A HREF="/WIKI/, “TITLE="title">title</A></I> (, )</LI>]
Pattern Generation (An Alternative) 1. [Charles Dickens 2. James Gleick 3. William Shakespeare 4. …. ] 2. “List of authors” New titles Run patterns on result pages New authors
Results n DIPRE 5 seeds 3 patterns 4047 pairs n The proposed algorithm 5 seeds 2 patterns 2596 pairs
Further Investigations n Study the effects of including the titles of the lists and tables in the patterns. n Study the qualitative differences of these two methods.
- Slides: 10