The Cimple Project on Community Information Management An

The CIM Problem l Numerous online communities – database researchers, movie fans, legal professionals,

The CIM Problem l Members often want to discovery, query, monitor information in the

The CIM Problem To address such needs, build data portals l Starting out topic-based,

Cimple Project @ Wisconsin / Yahoo! Research Develop generic solutions to create structured data

The Research Team l Faculty / Vice President – An. Hai Doan – Raghu

Prototype System: DBLife Integrate data of the DB research community l 1164 data sources

Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava, . . . 9

Resulting ER Graph “Proactive Re-optimization write Shivnath Babu advise coauthor write Pedro Bizarro coauthor

Querying The ER Graph Query: “David De. Witt Jennifer Widom” coauthor 1. David De.

Mass Collaboration: Example 1 Picture is removed if enough users vote “no”. 13

Mass Collaboration Meets Jeff Naughton Jeffrey F. Naughton swears that this is David J.

Mass Collaboration: Example 2 Community Wikipedia backed up by a structured underlying database 15

What We Have Done l Define the CIM problem / understand it a little

What We Would Like to Do Next l Release DBLife – as a research

Research Challenges (1) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing

Research Challenges (2) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing

Research Challenges (3) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing

Rest of the Talk The CIM problem l The Cimple solution approach l What

Declarative IE l Current IE research – develops learning- & rule-based solutions [SIGMOD-06 tutorial]

Example in DBLife Find conference name in raw text ####################################### # Regular expressions to

Example in DBLife (cont. ) # Only look for conference names in the top

Solution: Declarative, Compositional IE Treat each solution as a “black box” l Glue black

IE Execution Plan PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic. .

Sample Optimization: Push Down Selections PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun

Sample Optimization: Order Operations PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic.

Sample Optimization: Efficient Large-Scale Pattern Matching PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a

Information Extraction: Another Example DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic.

Data Integration Research: Setting the Context l Past and current work – build the

Sample Data Integration Challenge in Cimple: Matching Mentions of Entities Researcher Homepages Jim Gray

Extremely Important Problem! Appears in numerous real-world contexts l Plagues many applications that we

An Example Discover related organizations using occurrence analysis: “J. Han. . . Centrum voor

Classical Mention Matching Applies just a single “matcher” l Focuses mainly on improving matcher

Illustrating Example Only one Luis Gravano d 1: Luis Gravano’s Homepage d 2: Columbia

A liberal matcher: good for matching Luis Gravano, bad for matching Chen Li s

A conservative matcher: good for matching Chen Li, bad for matching Luis Gravano s

Better solution: apply both matchers in a workflow d 1: Luis Gravano’s Homepage d

Key Challenges s 1 l How to compose matchers, to form a space of

Mass Collaboration: The General Idea l Many applications have multiple developers / users –

Sample Mass Collaboration in DBLife IE W 1 Raw data W 2 Wn 44

Key Challenges l What types of extraction / integration tasks are most amenable to

Sample Research: Summary l Information extraction – how to do it in a declarative

Conclusions l Community Information Management – increasingly crucial problem l The Cimple project –

Broader Perspectives [speculation mode] Current Web: keyword search over text l Future Web l

Slides: 48

Download presentation

The Cimple Project on Community Information Management An. Hai Doan University of Wisconsin-Madison

The CIM Problem l Numerous online communities – database researchers, movie fans, legal professionals, bioinformatics, enterprise intranets, tech support groups Each community = many data sources + many members l Database community l – home pages, project pages, DBworld, DBLP, conference pages, . . . l Movie fan community – review sites, movie home pages, theatre listings, . . . l Legal profession community – law firm home pages 2

The CIM Problem l Members often want to discovery, query, monitor information in the community l Database community – – l what is new in the past week in the database community? any interesting connection between researchers X and Y? find all citations of this paper in the past one week on the Web what are current hot topics? who has moved where? Legal profession community – which lawyers have moved where? – which law firms have taken on which cases? 3

The CIM Problem To address such needs, build data portals l Starting out topic-based, now structured data portals l – DBLP, Citeseer, IMDB, Global. Spec, etc. l Limitations of current solutions – mostly by hand, labor intensive, error prone – hard-to-port solutions – few services other than browsing and keyword search 4

Cimple Project @ Wisconsin / Yahoo! Research Develop generic solutions to create structured data portals via extraction + integration + mass collaboration Researcher Homepages Jim Gray ** Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 give-talk SIGMOD-04 ** * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Personalize system, provide feedback 5

The Research Team l Faculty / Vice President – An. Hai Doan – Raghu Ramakrishnan l Current students – – – – Pedro De. Rose Warren Shen Fei Chen Yoonkyong Lee Doug Burdick Mayssam Sayyadian Xiaoyong Chai Ting Chen 6

Prototype System: DBLife Integrate data of the DB research community l 1164 data sources l Crawled daily, 11000+ pages = 160+ MB / day 7

Data Extraction 8

Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava, . . . 9

Resulting ER Graph “Proactive Re-optimization write Shivnath Babu advise coauthor write Pedro Bizarro coauthor Jennifer Widom David De. Witt advise PC-member PC-Chair SIGMOD 2005 10

Querying The ER Graph Query: “David De. Witt Jennifer Widom” coauthor 1. David De. Witt Jennifer Widom coauthor 2. Jennifer Widom David De. Witt PC-member PC-Chair SIGMOD 2005 Shivnath Babu 3. advise Jennifer Widom coauthor David De. Witt 11

Provide Services l DBLife system 12

Mass Collaboration: Example 1 Picture is removed if enough users vote “no”. 13

Mass Collaboration Meets Jeff Naughton Jeffrey F. Naughton swears that this is David J. De. Witt 14

Mass Collaboration: Example 2 Community Wikipedia backed up by a structured underlying database 15

What We Have Done l Define the CIM problem / understand it a little bit – start to talk about it in the DB community [SIGMOD-06 tutorial, IEEE DEB-06, CIDR-07] l Build DBLife / helps clarify research issues – live at dblife. cs. wisc. edu – latest stuff at dblife-labs. cs. wisc. edu l Start some preliminary research – ICDE-07 a, ICDE-07 b 16

What We Would Like to Do Next l Release DBLife – as a research / education tool possible service to the DB community demo of CIM systems benchmark / challenge for data integration / extraction l Develop and release a generic Cimple platform – anyone can use it to build structured data portals l Build Cim. Base: a hosting service – anyone can specify a structured portal on Cim. Base – we will build and host it l Continue research / expand team / build alliance 17

Research Challenges (1) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 give-talk SIGMOD-04 ** * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Personalize system, provide feedback Information extraction l Data integration l Mass collaboration l 18

Research Challenges (2) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 give-talk SIGMOD-04 ** * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Personalize system, provide feedback Exploiting extracted data l Handling uncertainty / provenance / explanation l Dealing with evolving data, versioning, temporal data l 19

Research Challenges (3) Researcher Homepages Jim Gray ** Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 give-talk SIGMOD-04 ** * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Personalize system, provide feedback l l l What is the right architecture? What is the right data model / storage? How to build continuously running systems How to build massively scalable hosting services? How to build a generic CIM platform? 20

Rest of the Talk The CIM problem l The Cimple solution approach l What we have done / plan to do l Research challenges l – information extraction – data integration (focus on entity matching) – mass collaboration l Broader perspectives 21

Declarative IE l Current IE research – develops learning- & rule-based solutions [SIGMOD-06 tutorial] – focuses largely on improving accuracy DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic. . . l Real-world IE applications – glue multiple such solutions together, using Perl l Serious problems – hard to develop, understand, debug, and optimize 22

Example in DBLife Find conference name in raw text ####################################### # Regular expressions to construct the pattern to extract conference names ####################################### # These are subordinate patterns my $word. Ordinals="(? : first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)"; my $number. Ordinals="(? : \d? (? : 1 st|2 nd|3 rd|1 th|2 th|3 th|4 th|5 th|6 th|7 th|8 th|9 th|0 th))"; my $ordinals="(? : $word. Ordinals|$number. Ordinals)"; my $conf. Types="(? : Conference|Workshop|Symposium)"; my $words="(? : [A-Z]\w+\s*)"; # A word starting with a capital letter and ending with 0 or more spaces my $conf. Descriptors="(? : international\s+|[A-Z]+\s+)"; #. e. g "International Conference. . . ' or the conference name for workshops (e. g. "VLDB Workshop. . . ") my $connectors="(? : on|of)"; my $abbreviations="(? : $[A-Z]\w\w+[\W\s]*? (? : \d\d+)? $)"; # Conference abbreviations like "(SIGMOD'06)" # The actual pattern we search for. A typical conference name this pattern will find is # "3 rd International Conference on Blah (ICBBB-05)" my $full. Name. Pattern="((? : $ordinals\s+$words*|$conf. Descriptors)? $conf. Types(? : \s+$connectors\s+. *? |\s+)? $abbreviations? )(? : \n|\r|\. |<)"; ################################ # Given a <dbworld. Message>, look for the conference pattern ############################### look. For. Pattern($dbworld. Message, $full. Name. Pattern); ############################# # In a given <file>, look for occurrences of <pattern> # <pattern> is a regular expression ############################# sub look. For. Pattern { my ($file, $pattern) = @_; 23

Example in DBLife (cont. ) # Only look for conference names in the top 20 lines of the file my $max. Lines=20; my $top. Of. File=get. Top. Of. File($file, $max. Lines); # Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($top. Of. File=~/(. *? )$pattern/is) { my ($prefix, $name)=($1, $2); # If it matches, do a sanity check and clean up the match # Get the first letter # Verify that the first letter is a capital letter or number if(!($name=~/^W*? [A-Z 0 -9]/)) { return (); } # If there is an abbreviation, cut off whatever comes after that if($name=~/^(. *? $abbreviations)/s) { $name=$1; } # If the name is too long, it probably isn't a conference if(scalar($name=~/[^s]/g) > 100) { return (); } # Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter, $non. Letter)=("[A-Za-z]", "[^A-Za-z]"); " $name"=~/$non. Letter($letter) $letter*$non. Letter*$/; # Need a space before $name to handle the first $non. Letter in the pattern if there is only one word in name my $last. Letter=$1; if(!($last. Letter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter # Passed test, return a new crutch return new. Crutch(length($prefix), length($prefix)+length($name), $name, "Matched pattern in top $max. Lines lines", "conference name", get. Year($name)); } return (); } 24

Solution: Declarative, Compositional IE Treat each solution as a “black box” l Glue black boxes using a Datalog-like language l – author(y, d) : - docs(d), name(y, d), title(x, d), distance-line(x, y)<3 – name(y, d) : - docs(d), seeds(s), namepatterns(s, p), match(p, d, y) – title(x, d) : - docs(d), lines(x, n, d), allcaps(x), (n<5) seeds(s) DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic. . . Raghu, Ramakrishnan Divesh, Srivastava. . . p = Raghu Ramakrishnan R. Ramakrishnan Dr. Ramakrishnan, etc. 25

IE Execution Plan PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic. . . match(y, p, d) SELECT_[allcaps(x) and (n<5)] lines(x, n, d) DECLARATIVE IE Dr. R. Ramakrishnan namepatterns(p, s) docs(d) seeds(s) 26

Sample Optimization: Push Down Selections PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic. . . match(y, p, d) SELECT_[allcaps(x) and (n<5)] lines(x, n, d) DECLARATIVE IE Dr. R. Ramakrishnan namepatterns(p, s) docs(d) seeds(s) 27

Sample Optimization: Order Operations PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic. . . match(y, p, d) SELECT_[allcaps(x) and (n<5)] lines(x, n, d) DECLARATIVE IE Dr. R. Ramakrishnan namepatterns(p, s) docs(d) seeds(s) 28

Sample Optimization: Efficient Large-Scale Pattern Matching PROJECT_[y, d] distance-line(x, y)<3 docs(d) This is a fun topic. . . match(y, p, d) SELECT_[allcaps(x) and (n<5)] lines(x, n, d) DECLARATIVE IE Dr. R. Ramakrishnan namepatterns(p, s) docs(d) seeds(s) 29

Related Project: Avatar @ IBM Almaden Person can be reached at Phone. Number Person followed by Contact. Pattern followed by Phone. Number Declarative Query Language Contact. Pattern Regular. Expression(Email. body, ”can be reached at”) Person. Phone Precedes (Person, Contact. Pattern, D), Phone, D) 30

Information Extraction: Another Example DECLARATIVE IE Dr. R. Ramakrishnan This is a fun topic. . . time 0 DECLARATIVE IE Dr. R. Ramakrishnan This is a great topic. . . DECLARATIVE IE Dr. R. Ramakrishnan time 1 More will follow soon. . . time 2 How to efficiently extract information over text streams? 31

Data Integration Research: Setting the Context l Past and current work – build the foundation: TSIMMIS, Information Manifold, UPenn, P 2 P, etc. – develop solutions for specific integration tasks: wrapping, schema matching, entity matching, adaptive QP, etc. – branching into many app. domains: bioinformatics, PIM (e. g. , semex, i. Memex), etc. – top-k, top. X query processing l Our work in Cimple – compositional solutions for schema matching, entity matching, etc. [VLDB-05 a, VLDBJ-06, ICDE-07 a, Tech Report-07 a] – best-effort data integration: e. g. keyword search + automatic schema matching + automatic entity matching over relational databases [ICDE-07 b] – data integration for masses [Tech Report-07 b] 32

Sample Data Integration Challenge in Cimple: Matching Mentions of Entities Researcher Homepages Jim Gray ** Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 give-talk SIGMOD-04 ** * Text documents Question answering Browse Mining Alert/Monitor News summary DBLP Personalize system, provide feedback 33

Extremely Important Problem! Appears in numerous real-world contexts l Plagues many applications that we have seen l – Citeseer, Rexa, DBLP, Info. Zoom, etc. Why so important? l Many services rely on correct mention matching l Incorrect matching propagates errors 34

An Example Discover related organizations using occurrence analysis: “J. Han. . . Centrum voor Wiskunde en Informatica” DBLife incorrectly matches this mention “J. Han” with “Jiawei Han”, but it actually refers to “Jianchao Han”. 35

Classical Mention Matching Applies just a single “matcher” l Focuses mainly on improving matcher accuracy l Our key observation: l A single matcher often has limited utility 36

Illustrating Example Only one Luis Gravano d 1: Luis Gravano’s Homepage d 2: Columbia DB Group Page L. Gravano, K. Ross. Text Databases. SIGMOD 03 Members L. Gravano, J. Sanz. Packet Routing. SPAA 91 L. Gravano, J. Zhou. Text Retrieval. VLDB 04 K. Ross d 4: Chen Li’s Homepage Two Chen Li-s J. Zhou d 3: DBLP Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 C. Li. Machine Learning. AAAI 04 Chen Li, Anthony Tung. Entity Matching. KDD 03 C. Li, A. Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 What is the best way to match mentions here? 37

A liberal matcher: good for matching Luis Gravano, bad for matching Chen Li s 0 matcher: two mentions match if they share the same name. d 1: Luis Gravano’s Homepage d 2: Columbia DB Group Page L. Gravano, K. Ross. Text Databases. SIGMOD 03 Members L. Gravano, J. Sanz. Packet Routing. SPAA 91 L. Gravano, J. Zhou. Text Retrieval. VLDB 04 K. Ross d 4: Chen Li’s Homepage J. Zhou d 3: DBLP Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 C. Li. Machine Learning. AAAI 04 Chen Li, Anthony Tung. Entity Matching. KDD 03 C. Li, A. Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 38

A conservative matcher: good for matching Chen Li, bad for matching Luis Gravano s 1 matcher: two mentions match if they share the same name and at least one co-author name. d 1: Luis Gravano’s Homepage d 2: Columbia DB Group Page L. Gravano, K. Ross. Text Databases. SIGMOD 03 Members L. Gravano, J. Sanz. Packet Routing. SPAA 91 L. Gravano, J. Zhou. Text Retrieval. VLDB 04 K. Ross d 4: Chen Li’s Homepage J. Zhou d 3: DBLP Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 C. Li. Machine Learning. AAAI 04 Chen Li, Anthony Tung. Entity Matching. KDD 03 C. Li, A. Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 39

Better solution: apply both matchers in a workflow d 1: Luis Gravano’s Homepage d 2: Columbia DB Group Page L. Gravano, K. Ross. Text Databases. SIGMOD 03 Members L. Gravano, J. Sanz. Packet Routing. SPAA 91 L. Gravano, J. Zhou. Text Retrieval. VLDB 04 K. Ross d 4: Chen Li’s Homepage s 1 union s 0 d 3 d 1 d 2 Luis Gravano, Jingren Zhou. Fuzzy Matching. VLDB 01 Luis Gravano, Jorge Sanz. Packet Routing. SPAA 91 C. Li. Machine Learning. AAAI 04 Chen Li, Anthony Tung. Entity Matching. KDD 03 C. Li, A. Tung. Entity Matching. KDD 03 Chen Li, Chris Brown. Interfaces. HCI 99 s 0 d 4 union J. Zhou d 3: DBLP Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04 s 0 matcher: two mentions match if they share the same name. s 1 matcher: two mentions match if they share the same name and at least one co-author name. 40

Key Challenges s 1 l How to compose matchers, to form a space of workflows? l How to estimate the accuracy of each workflow? l How to efficiently find one with high accuracy? union s 0 d 3 d 4 union d 1 s 0 d 2 [See ICDE-07 a] 41

Mass Collaboration: The General Idea l Many applications have multiple developers / users – how to exploit feedback from all of them? l Variants of this is known as – collective development of system, mass collaboration, collective curation, Web 2. 0 applications, social software, etc. l Has been applied to many applications – open-source software, bug detection, tech support group, Yahoo! Answers, Google Co-op, and many more l Studied in some academic contexts, e. g. , ESP Game l Little has been done in extraction / integration contexts – except in industry, e. g. , epinions. com 42

Sample Mass Collaboration in DBLife 43

Sample Mass Collaboration in DBLife IE W 1 Raw data W 2 Wn 44

Key Challenges l What types of extraction / integration tasks are most amenable to mass collaboration? – e. g. , see MOBS project at Illinois [Web. DB-03, ICDE-05] l l l l How to entice people to contribute? What can they contribute? What is the underlying data model? How to handle the Naughton effect? How to propagate user contributions? How to undo? How to reconcile multiple conflicting editions? – e. g. , see ORCHESTRA project at Penn [Taylor & Ives, SIGMOD-06] 45

Sample Research: Summary l Information extraction – how to do it in a declarative / compositional fashion? – how to apply database-like optimization techniques? l Data integration – how to do it incrementally (best effort, pay-as-you-go)? an example of a Data Space? – how to do it in a compositional fashion? l Human computation / mass collaboration – new! (Though industry has been doing it for years. ) – how to do it for data management tasks? 46

Conclusions l Community Information Management – increasingly crucial problem l The Cimple project – sample challenges: information extraction data integration human computation – extends the footprints of DB technologies to Web data – develops new DB technologies l DBLife prototype – research/education tool, community service, benchmark Search “cimple wisc” for project homepage 47

Broader Perspectives [speculation mode] Current Web: keyword search over text l Future Web l – should have increasingly more structure – should have more ways to exploit structure – should be more “social” l This future Web should be great for our community – we are the “Structure King” – if the Web remains text-centric not as good for us l How to accelerate the coming of this future Web? – Cimple and many current projects can contribute – but as a community we need more efforts in this direction! 48