The Debye Environment for Web Data Management Julia
The Debye Environment for Web Data Management Julia Erdman SE 521
The Web l Huge reservoir of data, including digital libraries and online stores l l l “data containers” Identifiable through visual clues Structural variations and irregularities
What is Debye? l l l Extract web data from its original sources Logically represent it in a format that allows further manipulation Use of extended tables support solutions to several Web data management problems
Debye: what it does l l Example: query amazon. com for “Paul Mc. Cartney” How do users keep track of updates? l l Revisit Requery Record data Compare with old data
Debye: what it does l Debye l l Extract target data from pages Store data in nested tables
Debye: what it does
Debye: what it does
Debye: what it does l Once the tables are built l l They can be queried Store data in relational database
Debye: how it works l l GUI that allows users to cut and paste values from a target page GUI generates an object-extraction pattern (OE pattern) OE pattern is fed to a generic extractor The extractor output the extracted objects in XML, Debye textual object repository (DTOR)
Debye: how it works l There is also a user-independent example generator l Compares object in the data repository with objects on a webpage
Debye: how it works
Debye: how it works l Nested tables l l Use well-known query operations Easy to store the data in relational databases
Debye data model Product. List = (Store: atom, Info: [(Title: atom, Artist: atom, Audio. Type: atom); (Title: atom, Authors: {atom}, Book. Type: atom); (Item: atom, Bid: atom, Time: atom)])
Product. List = (Store: atom, Info: [ (Title: atom, Artist: atom, Audio. Type: atom); (Title: atom, Authors: {atom}, Book. Type: atom); (Item: atom, Bid: atom, Time: atom) ])
Debye: data extraction l GUI l l l Helps users specify objects User marks pieces of data from the Source and copy them to columns of a table User can Insert, Remove, Rename, Group, Split columns using the GUI
Debye: data extraction
Debye: OE Pattern Generation l Patterns generated based on nested tables assembled by the user l l Implicitly informs Debye of the object’s syntactic context and structure through examples OE pattern generation l l l Target object’s structure, in the form of a table scheme Textual surroundings – markups, symbols, keywords Precisely, it is a pair
Debye: extraction strategy l l The extractor reads and parses the rules from the OE pattern generation Given an OE pattern and set of pages as input, the extractor extracts data in a bottomup procedure l l Atomic components extracted first Then a complete object is assembled
Debye: query interface l l Selection Projection Nest Unnest
Debye: data storage manager l l Store Web data in relational databases Mappings l Map-Table l l Creates a relation for every distinct table scheme it finds in the target data’s repository Map-Column l Created relations for columns with atom lists
Questions? ? Comments? ?
Laender, A. H. F. , et. al. "The Debye environment for Web data management" IEEE Internet Computing. Volume: 6, Issue: 4, July-Aug. 2002. pp. 60 - 69
- Slides: 25