Research Internships Advanced Research and Modeling Research Group

  • Slides: 40
Download presentation
Research Internships Advanced Research and Modeling Research Group

Research Internships Advanced Research and Modeling Research Group

ADREM – What? • Research group that deals with computational aspects of data –

ADREM – What? • Research group that deals with computational aspects of data – databases – data mining – Information retrieval

ADREM – Who? DB/DM/IR • Floris Geerts • Bart Goethals • Martin Theobald Bioinf

ADREM – Who? DB/DM/IR • Floris Geerts • Bart Goethals • Martin Theobald Bioinf • Kris Laukens • Tim Van den Bulcke + Phd students and postdoctoral researchers http: //adrem. ua. ac. be/adrem

Internships – What? • 2 research internships (15 credits each) • Msc thesis (30

Internships – What? • 2 research internships (15 credits each) • Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADRe. M • 15 credits is a lot = internship is time consuming! • 1 credit = 15 hour work… • Balance your course load and internship well. • Internships are not necessarily related to your Msc thesis (but it can) • In a Msc thesis your ability to independently do research plays an important role.

Internships – Who? • Everyone who follows the research option in the database Msc

Internships – Who? • Everyone who follows the research option in the database Msc program

Research In an internship you need to: 1. Understand a specific problem 2. Implement

Research In an internship you need to: 1. Understand a specific problem 2. Implement an (existing) method for solving the problem 3. Test and evaluate 4. Write a report (Msc thesis: you have to solve the problem as well by designing new methods…)

Internships in a company • It is allowed to do a internship in a

Internships in a company • It is allowed to do a internship in a company but you have to ask permission • Also, you have to find the company yourself and convince us that there is research involved • You can’t receive any money from the company during your internship

Databases, data mining, information retrieval • These are not separate research domains • The

Databases, data mining, information retrieval • These are not separate research domains • The topics for internships that each of us will present next are usually on the intersection of these areas. • Let’s see some example topics….

Bart Goethals

Bart Goethals

Recommender Systems • • • Implement state of the art recommenders Pattern mining for

Recommender Systems • • • Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data

Visual Instant Interactive Pattern Mining • Study Visualizations enabling Interactive Pattern Mining • Implement

Visual Instant Interactive Pattern Mining • Study Visualizations enabling Interactive Pattern Mining • Implement and Experiment with novel instant mining methods

Pattern based Clustering • Implement and evaluate different techniques for clustering based pattern mining,

Pattern based Clustering • Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering

Data Mining for Cleaning • Study and experiment with data mining methods for data

Data Mining for Cleaning • Study and experiment with data mining methods for data cleaning.

Martin Theobald

Martin Theobald

Information Extraction (I): Wikipedia Infoboxes

Information Extraction (I): Wikipedia Infoboxes

Information Extraction (I): Infoboxes YAGO/DBpedia et al. born. On(Jeff, 09/22/42) grad. From(Jeff, Columbia) has.

Information Extraction (I): Infoboxes YAGO/DBpedia et al. born. On(Jeff, 09/22/42) grad. From(Jeff, Columbia) has. Advisor(Jeff, Arthur) has. Advisor(Surajit, Jeff) known. For(Jeff, Theory) >120 M facts for YAGO 2 (mostly from Wikipedia infoboxes)

Information Extraction (II): Wikipedia Categories

Information Extraction (II): Wikipedia Categories

Information Extraction (II): Wikipedia Categories ?

Information Extraction (II): Wikipedia Categories ?

RDF Knowledge Bases 3 Mio. entities, 120 Mio. facts 100 relations, 200 k classes

RDF Knowledge Bases 3 Mio. entities, 120 Mio. facts 100 relations, 200 k classes Entity subclass Organization subclass Person Scientist subclass Biologist subclass Politician instance. Of Oct 23, 1944 instance. Of Max_Planck Society Oct 4, 1947 Apr 23, 1858 Erwin_Planck Kiel has. Won Father. Of Germany located. In born. In Schleswig. Holstein citizen. Of died. On Max_Planck born. On means instance. Of died. On Nobel Prize Country State City instance. Of subclass instance. Of Physicist accuracy 95% Location subclass “Max Planck” http: //www. mpi-inf. mpg. de/yago-naga/ means “Max Karl Ernst Ludwig Planck” Angela Merkel means “Angela Merkel” means “Angela Dorothea Merkel”

Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF

Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links http: //linkeddata. org/

As of Sept. 2011: > 5 million owl: same. As links between DBpedia/YAGO/Freebase

As of Sept. 2011: > 5 million owl: same. As links between DBpedia/YAGO/Freebase

IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia

IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4 -pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge back-ends D. Ferrucci et al. : Building Watson: An Overview of the Deep. QA Project. AI Magazine, Fall 2010. YAGO www. ibm. com/innovation/us/watson/index. htm

Jeopardy! A big US city with two airports, one named after a World War

Jeopardy! A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field?

Structured Knowledge Queries A big US city with two airports, one named after a

Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ? c Where { ? c type City. ? c located. In USA. ? a 1 type Airport. ? a 2 type Airport. ? a 1 located. In ? c. ? a 2 located. In ? c. ? a 1 named. After ? p type War. Hero. ? a 2 named. After ? b type Battle. Field. } • Use manually created templates for mapping sentence patterns to structured queries. • Works for factoid and list questions.

Mining Rules from RDF Knowledge Bases Goal: Inductively learn (soft) rules: lives. In(x, y)

Mining Rules from RDF Knowledge Bases Goal: Inductively learn (soft) rules: lives. In(x, y) : - born. In(x, y) R Ground truth for lives. In (only partially known) Knowledge base for lives. In (known positive examples) Facts produced by the rule (only partially correct) KB G • A-priori-style pre-filtering of low-support join patterns • Dynamic programming ILP algorithm • Learning with constants and type constraints

Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints • People may live in

Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints • People may live in more than one place lives. In(x, y) married. To(x, z) lives. In(z, y)[0. 8] lives. In(x, y) has. Child(x, z) lives. In(z, y)[0. 5] • People are not born in different places/on different dates born. In(x, y) born. In(x, z) y=z • People are not married to more than one person (at the same time, in most countries? ) married. To(x, y, t 1) married. To(x, z, t 2) y≠z disjoint(t 1, t 2)

Probabilistic RDF Database Query graduated. From(Surajit, y) 0. 7 x(1 -0. 888)=0. 078 graduated.

Probabilistic RDF Database Query graduated. From(Surajit, y) 0. 7 x(1 -0. 888)=0. 078 graduated. From (Surajit, Q 1 Princeton) (1 -0. 7)x 0. 888=0. 266 graduated. From (Surajit, Q 2 Stanford) A (B (C D)) 1 -(1 -0. 72)x(1 -0. 6) =0. 888 / A graduated. From 0. 8 x 0. 9 =0. 72 (Surajit, Princeton)[0. 7] C has. Advisor (Surajit, Jeff)[0. 8] / D B graduated. From (Surajit, Stanford)[0. 6] works. At (Jeff, Stanford)[0. 9] Rules has. Advisor(x, y) works. At(y, z) graduated. From(x, z) [0. 4] graduated. From(x, y) graduated. From(x, z) y=z Base Facts graduated. From(Surajit, Princeton) [0. 7] graduated. From(Surajit, Stanford) [0. 6] graduated. From(David, Princeton) [0. 9] has. Advisor(Surajit, Jeff) [0. 8] has. Advisor(David, Jeff) [0. 7] works. At(Jeff, Stanford) [0. 9] type(Princeton, University) [1. 0] type(Stanford, University) [1. 0] type(Jeff, Computer_Scientist) [1. 0] type(Surajit, Computer_Scientist) [1. 0] type(David, Computer_Scientist) [1. 0]

Temporal Knowledge

Temporal Knowledge

Probabilistic-Temporal Consistency Reasoning Derived Facts t 3 team. Mates(Beckham, Ronaldo, Tt 3) State Relation

Probabilistic-Temporal Consistency Reasoning Derived Facts t 3 team. Mates(Beckham, Ronaldo, Tt 3) State Relation 0. 08 ‘ 03 0. 4 Base Facts ‘ 04 0. 16 plays. For(Beckham, Real, T 1) Ù plays. For(Ronaldo, Real, T 2) Ù overlaps(T 1, T 2) 0. 12 ‘ 05 ‘ 07 0. 6 ‘ 05 ‘ 07 ‘ 03 plays. For(Beckham, Real, T 1) 0. 1 0. 2 0. 4 0. 2 ‘ 00 ‘ 02 ‘ 07 ‘ 04 ‘ 05 plays. For(Ronaldo, Real, T 2)

Topics for Internships & Master Theses Research Internships • Preparation & Integration of Linked

Topics for Internships & Master Theses Research Internships • Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) • Mining Association Rules from Linked Data (Java/C++) • Visualization Frontend for Linked Data (Action. Script & Adobe Flash) Master Theses • Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) • Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) • Faceted Search and Interactive Browsing for Linked Data

Floris Geerts

Floris Geerts

RDBMS-based recommendation systems EDI NY n Find top-3 flights from Edi to NYC with

RDBMS-based recommendation systems EDI NY n Find top-3 flights from Edi to NYC with at most one stop ¨ Items: flights ¨ Selection criteria: relational queries ¨ Utility function: in terms of price and duration (for ranking) n Top-k item selection Selection criteria Utility function top-k items … items 32 Books, music, news, Web sites, research papers, …. .

Query relaxation Query for 5 -day holiday Q(f#, name, type, ticket, time) = ∃DT,

Query relaxation Query for 5 -day holiday Q(f#, name, type, ticket, time) = ∃DT, AD, x. To ( flight ( f#, EDI, x. To, DT, 5/19/2012, AT, AD, Pr ) ∧ POI ( name, x. To, type, ticket, time) ∧x. To= NYC ) Relaxation: cities There is no direct flight within 15 miles of EDI from EDI to NYC or NYC are acceptable E = { EDI, NYC, 4/1/2012 }, X = { x. To } Q 1(f#, name, type, ticket, time) =∃DT, AD, u. To, w. Edi, w. NYC, w. DD ( flight ( f#, w. Edi, x. To, DT , w. DD, AT, A D, Pr ) ∧ x. To= w. NYC ∧ POI( name, u. To, type, ticket, time) ∧ w. DD=5/19/2012 ∧ dist(w. NYC, NYC)≤ 15 ∧ dist(w. Edi, EDI) ≤ 15 ∧ x. To=u. To) Further relaxation: departure dates within 3 days of 5/19/2012 arequery relaxation dist(w. DD, 5/10/2012 ) ≤ 3 valid 33 acceptable

Topics n Top-k query answering algorithm on top of RDBMS n Query relaxation approaches

Topics n Top-k query answering algorithm on top of RDBMS n Query relaxation approaches and query completion

Data quality • Detecting and correcting inconsistencies • Finding duplicates • Finding most up-to-date

Data quality • Detecting and correcting inconsistencies • Finding duplicates • Finding most up-to-date information

Semantic errors Yahoo! Finance Day’s Range: 93. 80 -95. 71 Nasdaq 52 wk Range:

Semantic errors Yahoo! Finance Day’s Range: 93. 80 -95. 71 Nasdaq 52 wk Range: 25. 38 -95. 71 Day’s Range: 93. 80 -95. 71 52 Wk: 25. 38 -93. 72

Instance ambiguity

Instance ambiguity

Out-of-Date Data 4: 05 pm 3: 57 pm

Out-of-Date Data 4: 05 pm 3: 57 pm

Unit errors 76. 82 B 76, 821, 000

Unit errors 76. 82 B 76, 821, 000

Topics n n Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of

Topics n n Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of “data quality rules”