Building Structured Web Databases A Midterm Report from

Building Structured Web Databases: A Midterm Report from the Cimple Project An. Hai Doan University of Wisconsin-Madison

Structured Web Databases 22

The Cimple Project (2005 – Date) l Develops a generic solution to build Web databases – using extraction + integration + user feedback l Example: DBLife Jim Gray Researcher Keyword search Homepages Conference Pages Group Pages DBworld Browse give-talk Web pages IE/II program SIGMOD-04 SQL querying Question answering Mining mailing list Alert/Monitor DBLP News summary 3

Data Model for IE/II Many choices – Relational, XML, RDF triples, nested, Jason, etc. l Desiderata – Conceptually simple, programmers can visualize – Naïve users can visualize (for providing feedback) – Easy to write queries – Robust industrial support l Decided on relational + ER Jim Gray l Researcher Homepages Conference give-talk Web pages IE/II program SIGMOD-04 Pages Group Pages l Want to understand benefits / limitations 4

Programming Model for IE/II Must combine IE/II blackboxes into workflows l Many possible choices – E. g. , UIMA, pub/sub l l Desiderata – Easy to write, understand, debug, maintain – Expressive (e. g. , can do loops), highly extensible – Solid theoretical foundation – Can optimize to death (critical!) 5

Proposed Solution: Xlog, Datalog with Embedded Procedural Predicates Talks “Feedback in IR” Relevance feedback is important. . . “Personalized Search” Customizing rankings with relevance feedback. . . title abstract “Feedback in IR” “Relevance feedback is important. . . ” “Personalized Search” “Customizing rankings with relevance feedback. . . ” docs d 1 d 2 titles(d, t) : - docs(d), extract. Title(d, t). perl module abstracts(d, a) : - docs(d), extract. Abstract(d, a). C++ module talks(d, t, a) : - titles(d, t), abstracts(d, a), imm. Before(t, a), contains(a, “relevance feedback”). perl module 6

Xlog = Workflow of Relational Operators + Blackboxes d 1 t 1 a 1 σcontains(a, “relevance feedback”) d 1 t 1 a 1 d 1 t 2 a 2 d 1 t 1 a 1 d 1 t 1 a 2 d 2 t 2 a 1 d 2 t 2 a 2 d 1 t 1 d 1 t 2 d 1 d 2 σimm. Before(t, a) extract. Title(d, t) docs(d) extract. Abstract(d, a) docs(d) d 1 a 1 d 1 a 2 d 1 d 2 7

Sample Optimization: Pushing Down Text Properties σcontains(a, “relevance feedback”) a: σimm. Before(t, a) extract. Abstract extract. Title(d, t) extract. Abstract(d, a) docs(d) d: contains(a, w) Λ comes-from(a, d) contains(d, w) italics(s) Λ overlaps(s, t) contains. Italics(t) (length. Word(s) = 3) Λ comes-from(s, t) length. Word(t) > 3 σcontains(a, “relevance feedback”) σimm. Before(t, a) extract. Title(d, t) extract. Abstract(d, a) σcontains(d, “relevance feedback”) docs(d) 8

Benefits of Xlog Can model complex workflows – e. g. , recursion, negation l Has well-defined semantics l Can naturally combine IE/II blackboxes w/ relational ops l Can immediately exploit many optimization methods – already developed for Datalog & RDBMS l l Can naturally incorporate text-centric optimizations – estimate cost, select good exec plan, in RDBMS fashion 9

Implementing Xlog: Take 1 l Key challenge: how to store & access data on disk HTML pages RDBMS σcontains(a, “relevance feedback”) σimm. Before(t, a) OS Files extract. Title(d, t) extract. Abstract(d, a) docs(d) Version store (e. g. , Rdiff) Web

Problems (Observed when Running DBLife) HTML pages l RDBMS σcontains(a, “relevance feedback”) σimm. Before(t, a) l OS Files extract. Title(d, t) extract. Abstract(d, a) docs(d) l l Version store l (e. g. , Rdiff) Web Multiple concurrent processes – machines, humans Random data access Lots of RDBMSlike operations Huge amount of disk-resident data Unlike ETL, Mapreduce processes

Implementing Xlog: Take 2 Extend RDBMS to handle IE/II over text – also hot direction today at RDBMS companies l Want to understand benefits / limitations HTML pages l RDBMS σcontains(a, “relevance feedback”) σimm. Before(t, a) extract. Title(d, t) extract. Abstract(d, a) docs(d) Web

Implementing Learning-Based Operators by Pushing Them into RDBMS HTML pages l RDBMS σcontains(a, “relevance feedback”) σimm. Before(t, a) extract. Title(d, t) extract. Abstract(d, a) docs(d) Web E. g. , Markov Logic network – Lots of RDBMSlike operations – Alchemy uses a fixed exec plan – RDBMS automatically selects a good plan – Drastic speedup in our experiments

Lessons Learned / Open Questions Relational + ER seem okay so far l To combine blackboxes, Datalog variants are promising l l Right implementation strategy: still unclear – ETL / Mapreduce seems best for one-shot IE/II – Building / maintaining many Web DBs are not one-shot – especially if involving humans – concurrent processes, data often revisited, random access – Optimization is critical – RDBMS especially promising – locking, indexing, optimization, handling disk-resident data – Most likely need a combination of RDBMS & Mapreduce 14

The Cimple Project (2005 – Date) l Develops a generic solution to build Web databases – using extraction + integration + user feedback l Example: DBLife Jim Gray Researcher Keyword search Homepages Conference Pages Group Pages DBworld Browse give-talk Web pages IE/II program SIGMOD-04 SQL querying Question answering Mining mailing list Alert/Monitor DBLP News summary 15

User Feedback l Critical – IE/II inevitably make mistakes, can cascade quickly – when database evolves, mistakes happen – a lot of data in user head, not yet on the Web l Highly beneficial – scenario 1: 10 -15 developers – their feedback can already make a big difference – no good solution today, “designated victim” in DBLife – scenario 2: hire people using Mechanical Turk – scenario 3: lot of ordinary users volunteering feedback 16

Types of User Feedback Flagging an Error Fixing an Error Editing Data Input Intermediate Results Editing Code Output 17

Editing the Output l To maximize amount of feedback users should be able to edit anything – records, lists, sets, tables, natural text, … – using whatever UI they like: form, excel, wiki, GUI, … – virtually the whole page should be editable 18

Example: Editing a Record HTML View Data Name: Joe Hellerstein Organization: UC-Berkeley Contact: joe@berkeley. edu Remove Contact: joe@berkeley. edu Entity #123 name: Joe Hellerstein org: UC-Berkeley email: joe@berkeley. edu Entity #123 name: Joe Hellerstein salary: 150 K org: UC-Berkeley email: joe@berkeley. edu Research Interest: Data stream Declarative networking Sensor networks Data stream, 0. 9 Declarative networking, 0. 6 Sensor networks, 0. 4 How to interpret edits? l How to push down edits? l How to manage concurrent edits? l How to propagate edits? l 19

Example: Editing a Record l HTML View Data How to edit page format? How to display new data? Name: Joe Hellerstein Organization: UC-Berkeley Contact: joe@berkeley. edu Entity #123 name: Joe Hellerstein org: UC-Berkeley email: joe@berkeley. edu Name: Joe Hellerstein Contact: joe@berkeley. edu (try calling first) Organization: UC-Berkeley Name: Contact: (try calling first) Organization: Entity #123 name: Joe Hellerstein salary: 150 K org: UC-Berkeley email: joe@berkeley. edu, joe@acm. org 20

Example: Editing a Record How to undo? recover from crash? – roll back to 3 pm yesterday – undo a bad user edit: what if other users have built on that edit? l How to reconcile human / machine edits? l Name: Joe Hellerstein Organization: UC-Berkeley Contact: joe@berkeley. edu machine l human How to split superhomepages? Name: Joe Hellerstein Organization: UC-Berkeley Contact: joe@berkeley. edu, joe@mit. edu, joe@swivel. com machine Joe Berkeley Joe MIT human 21

22

23

24

25

Text mixed with structured data (from the database) l Can edit both l 26

27

Editing Input/Intermediate Results l Extracting conference services name conf role Joe Hellerstein CIDR 2009 PC Chair … … … name conf role … … … Wiki name role page … … … Spreadsheet url … Form roles name role page … … find. Roles extract. Conf … extract. Names crawl url date http: //. . . /cidr 09/01/2008 … … data. Sources 28

Editing Code l Currently: naïve users edit control flow of code 1 Joe Hellerstein 5 Chen Li-s use just author name filter pubs use author name, co-authors, conf proximity 29

Lessons Learned / Open Questions User feedback is critical – correct data obtained from Web – help improve IE/II algorithms – help solicit data in users’ head – help build “community Wikipedia”, using machine-human l Numerous interesting challenges l 30

Cimple: Current Status l Started in 2005 – Involved UIUC, Yahoo, IBM, Microsoft l Major project @ Wisconsin – affiliated profs: Jeff Naughton, Chris Re, Jude Shavlik, Raghu Ramakrishnan – 20+ students: Pedro De. Rose, Warren Shen, Robert Mc. Cann, Xiaoyong Chai, Ba-Quy Vuong, Fei Chen, Chaitanya Gokhale, Feng Niu, Ting Chen, Byron Gao, Erick Chu, Akanksha Baid, Jiansheng Huang, and more – prototypes: Cimple 1. 0, Cimple 2. 0, applied to DB, Lake Mendota, Wikipedia – 19 SIGMOD/VLDB/ICDE papers + invited papers, special issues, tutorial – funded by NSF, Yahoo, IBM, Google, Microsoft, DARPA – technology transfer to Microsoft (Ad Lab, SQL Server group) 31