Informationsintegration TopN Anfragen 13 12 2005 Felix Naumann

Überblick l Anfragen nach den ERSTEN Ergebnissen (First-N) l “Es reicht!“ in SQL [CK

Motivation für First-N l 1. Anfragen l Semantik: l l l Korrektheit Vollständigkeit D.

Informationsintegration und Browsen l Warum sind First-N Techniken für die Informationsintegration interessant? l Für

Anfragebearbeitung in DBMS SQL Anfrage formulieren System nimmt SQL Anfrage entgegen 1. 2. Parsen

Anfragebearbeitung in DBMS l Problem l l DBMS berechnet vollständiges Ergebnis Anwendung holt eventuell

Überblick l Anfragen nach den ERSTEN Ergebnissen (First-N) l l “Es reicht!“ in SQL

SQL Projektion Relationen Selektion und Join. Bedingungen SELECT. . . FROM. . . WHERE.

Teures SQL – Beispiel SELECT h. name, h. adresse, h. tel FROM hotels h,

STOP AFTER – Syntax Projektion Relationen Selektion und Join. Bedingungen SELECT. . . FROM.

STOP AFTER – Semantik: l 1. 2. Keine Sortierung l l Genaue Ergebnismenge nicht

STOP AFTER – Beispiel SELECT h. name, h. adresse, h. tel FROM hotels h,

STOP AFTER – Beispiel SELECT p. name, v. umsatz FROM Produkte p, Verkäufe V

STOP AFTER – Updates UPDATE Spieler SET Gehalt = 0. 5 * Gehalt WHERE

STOP AFTER – Implementierung l Implementierung in der Anwendung l l l Implementierung in

Rückblick: Anfrageoptimierung l Umwandlung von SQL in interne Repräsentation l Interne Operatoren l l

Rückblick: Anfragebearbeitung SELECT m. name FROM mitarbeiter m, abteilung a WHERE m. abt_id =

Rückblick: Anfragebearbeitung m. name(sortm. gehalt( m. abt_id = a. id, a. name = ‚Verkauf‘

Rückblick: Anfragebearbeitung [20] (m. name) [20] Sort(m. gehalt) ? [20] (m. name) [20] a.

Neuer Operator Logischer Operator l l Stop(N, Sortierungsrichtung, Sortierungsausdruck) l N: maximale Ergebnisgroesse l

Scan-Stop l l l Falls Sortierungsrichtung = ‚none‘ Schließt Input-Strom nach N Tupeln Kostenmodell

Sort-Stop l l Falls Sortierungsrichtung = ‚asc/desc‘ Falls schon entsprechend sortiert: Schließt Input. Strom

Optimierung mit Stop-Operator l l Platzierung des Stop Operators im Anfrageplan Fundamentales Problem: Frühe

Optimierung mit Stop-Operator l Konservative Strategie l l l Kostenminimal: Platziere Stop so früh

Optimierung mit Stop-Operator SELECT * FROM mitarbeiter m, abteilung a WHERE m. abt_id =

Optimierung mit Stop-Operator l Aggressive Strategie l l l Platziere Stop so früh wie

Optimierung mit Stop-Operator SELECT * FROM mitarbeiter m, Stop(10) abteilung a, reisen r WHERE

Implementierungen von First N l SQL: l l l My. SQL: l l SELECT

Anfragen nach den Top-N Ergebnissen – Motivation l First-N beschränkt Ergebnismenge aber nicht (unbedingt)

Anfragen nach den Top-N Ergebnissen – Beispiele l Suchmaschinen l l Information Retrieval l

Top-N in Multimedia DBMS l Farb-Ähnlichkeit l Z. B. Anfrage: Farbe = ‚rot‘ l

Top-N in Multimedia DBMS l l Beatles „Red Album“ Anfrage: l Farbe = ‚rot‘

Top-N in Multimedia DBMS l Anfrage: l l l Farbe = ‚rot‘ Name =

Top-N – benotete Mengen l Benotete Menge: l l Anfrage: Name = ‚Beatles‘ l

Top-N – benotete Mengen l Anfrage: l l l Problem: l l l Name

Top-N – benotete Mengen l Vorschlag l Konjunktionsregel [Za 65]: l l g. A

Top-N – benotete Mengen l l g. A B(x) = min{g. A(x), g. B(x)},

Andere Maße? l AVG l g. A B(x) = avg{g. A(x), g. B(x)}, g.

Top-N – Fagin‘s Algorithmus l l l Gegeben: Konjunktive Anfrage mit teilweise fuzzy Prädikaten.

Top-N – Beispiel l Anfrage: l Name = ‚Beatles‘ Farbe = ‚rot‘ G =

Top-N – Naiver Algorithmus l 1. 2. 3. 4. Anfrage: l Form = ‚rund‘

Top-N – Beispiel l Anfrage: l Form = ‚rund‘ Farbe = ‚rot‘ G =

Top-N – Fagins Algorithmus l Allgemeineres Problem: l l Anfrage statt A B nun

Top-N – Fagins Algorithmus l l A 1 A 2 . . . Am

Top-N – Fagins Algorithmus Objekte aus MMDBMS_2 mit g. A 2(x) Objekte aus MMDBMS_1

Top-N – Fagins Algorithmus Wichtig: Dies sind nicht unbedingt die Top-N Objekte! MMDBMS_1 13.

Top-N – Fagins Algorithmus l Phase 2: Random access l Hole alle unbekannten g.

Top-N – Fagins Algorithmus l Phase 3: Berechnung und Sortierung l l Berechne für

Fagins Algorithmus – Beispiel l Anfrage: l l Form = ‚rund‘ Farbe = ‚rot‘

Fagins Algorithmus – Beispiel 4: (? ? ; 0. 2; ? ? ) 4

Top-N – Fagins Algorithmus l Korrektheit: l l Fagins Algorithmus findet die Top-N Objekte

Top-N – Fagins Algorithmus l Aufwand: O(n(m-1)/m. N 1/m) (Beweis: siehe [Fa 96]) l

Fagins Algorithmus in der Praxis l Probleme aus [WHTB 99] l l Monotonie l

Top-N Anfragen – Herausforderungen l Beliebige Maße l l Je nach Nutzer bzw. Anwendung

Rückblick l First-N l l l Syntax und Semantik Optimierung Top-N l l Motivation

Informationsintegration 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 60

Literatur l First-N l l l Top-N l l [CK 97] Michael J. Carey,

Slides: 61

Download presentation

Informationsintegration Top-N Anfragen 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06

Überblick l Anfragen nach den ERSTEN Ergebnissen (First-N) l “Es reicht!“ in SQL [CK 97] l l l Syntax und Semantik Optimierung Anfragen nach den BESTEN Ergebnissen (Top-N) l l Motivation Fagin‘s Algorithm 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 2

Motivation für First-N l 1. Anfragen l Semantik: l l l Korrektheit Vollständigkeit D. h. alle Ergebnisse erwünscht DBMS Data Warehouses l l l l So korrekt wie möglich Nur die besten Ergebnisse D. h. Ein oder wenige passende Ergebnisse Bsp: Dokumente Anwendungen: l l l Semantik: l Semantik l l 2. Browsen l 3. Suchen l Bsp. : Aggregation Anwendungen: l l l Digital Library Systeme Content Management Systeme Google So korrekt wie möglich So vollständig wie gewünscht D. h. einige, bespielhafte Ergebnisse Bsp: Life Sciences Anwendungen l GUI 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 3

Informationsintegration und Browsen l Warum sind First-N Techniken für die Informationsintegration interessant? l Für Nutzer l l l Art der Daten unbekannt Umfang der Daten unbekannt Browsen Nutzen/Qualität der Daten sowieso zweifelhaft Anfragen nur fuzzy formuliert Verfeinerung der Anfrage in weiteren Schritten § l Query refinement Für System l l Datenbeschaffung oft langsam und teuer Deshalb: Großes Optimierungspotential § § 13. 12. 2005 Lokale Optimierung Globale Optimierung: Netzwerkkosten Felix Naumann, VL Informationsintegration, WS 05/06 4

Anfragebearbeitung in DBMS SQL Anfrage formulieren System nimmt SQL Anfrage entgegen 1. 2. Parsen Optimieren System führt Anfrage aus 3. 1. 2. Tupel-pipeline aufbauen Ergebnistupel in temporäre Tabelle schreiben Rückgabe eines Cursors auf erstes Ergebnistupel Sukzessives next() auf Cursor durch Anwendung 4. 5. 1. 2. GUI (z. B. Aqua. Data. Studio) Programm (z. B. mittels JDBC) 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 5

Anfragebearbeitung in DBMS l Problem l l DBMS berechnet vollständiges Ergebnis Anwendung holt eventuell nur wenige Tupel l z. B. ein Fenster voll z. B. Top-N Ergebnisse entsprechend einer Sortierung Anwendung spart Aufwand, DBMS jedoch nicht! 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 6

Überblick l Anfragen nach den ERSTEN Ergebnissen (First-N) l l “Es reicht!“ in SQL [CK 97] l Syntax und Semantik l Optimierung Anfragen nach den BESTEN Ergebnissen (Top-N) l l Motivation Fagin‘s Algorithm 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 7

SQL Projektion Relationen Selektion und Join. Bedingungen SELECT. . . FROM. . . WHERE. . . GROUP BY. . . HAVING. . . ORDER BY. . . Gruppierung 13. 12. 2005 Sortierung Selektion nach Gruppierung Felix Naumann, VL Informationsintegration, WS 05/06 8

Teures SQL – Beispiel SELECT h. name, h. adresse, h. tel FROM hotels h, flughäfen f WHERE f. name = ‚TXL‘ ORDER BY distance(h. ort, f. ort) 20. 000 l l 1. 000 1 Ergebnis: 20. 000 Hotels mit aufsteigender Entfernung zu TXL Zudem: 20. 000 x distance() ausführen Beispiele nach [CK 97] 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 9

STOP AFTER – Syntax Projektion Relationen Selektion und Join. Bedingungen SELECT. . . FROM. . . WHERE. . . GROUP BY. . . HAVING. . . ORDER BY. . . STOP AFTER. . . Gruppierung Neu: Beschränkung der Ergebniskardinalität 13. 12. 2005 Sortierung Selektion nach Gruppierung STOP AFTER nach [CK 97] Wichtig: Nicht SQL Standard! Felix Naumann, VL Informationsintegration, WS 05/06 10

STOP AFTER – Semantik: l 1. 2. Keine Sortierung l l Genaue Ergebnismenge nicht spezifiziert Sortierung l l Führe sämtliches Standard-SQL in der Anfrage aus. Beschränke Ergebnis auf erste Tupel. Genaue Ergebnismenge spezifiziert, außer bei Duplikaten in Sortierungsattributen: Genaue Ergebnismenge nicht spezifiziert Weniger als N Tupel im Ergebnis: Kein Einfluss durch STOP AFTER 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 11

STOP AFTER – Beispiel SELECT h. name, h. adresse, h. tel FROM hotels h, flughäfen f WHERE f. name = ‚TXL‘ ORDER BY distance(h. ort, f. ort) STOP AFTER 5 l l Ergebnis: 5 Hotels mit aufsteigender Entfernung zu TXL Einsparungen bei distance()? 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 12

STOP AFTER – Beispiel SELECT p. name, v. umsatz FROM Produkte p, Verkäufe V WHERE p. typ = ‚software‘ AND p. id = v. prod_id ORDER BY v. umsatz DESC STOP AFTER ( SELECT count(*)/10 FROM Produkte p WHERE p. typ = ‚software‘) Liste Name und Umsatz der 10% umsatzstärksten Softwareprodukte. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 13

STOP AFTER – Updates UPDATE Spieler SET Gehalt = 0. 5 * Gehalt WHERE id IN ( SELECT s. id FROM Spieler s ORDER BY s. tore STOP AFTER 3 ) Hertha BSC: - Kürze die Gehälter der 3 schlechtesten Spieler um 50%. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 14

STOP AFTER – Implementierung l Implementierung in der Anwendung l l l Implementierung in DBMS als äußere Schicht l l l Keine Veränderung des DBMS Optimierungspotential nicht ausgeschöpft Einsparungen bei Datenübertragung Optimierungspotential nicht voll ausgeschöpft Implementierung im DBMS Kern l l Volles Optimierungspotential Schwieriger 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 15

Rückblick: Anfrageoptimierung l Umwandlung von SQL in interne Repräsentation l Interne Operatoren l l Scan, Sort, Select, Project, . . . Interpretation als Baum Transformationsschritte im Baum Wahl des Schrittes gemäß Kostenmodell 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 16

Rückblick: Anfragebearbeitung SELECT m. name FROM mitarbeiter m, abteilung a WHERE m. abt_id = a. id AND a. name = ‚Verkauf‘ ORDER BY m. gehalt In Worten? m. name(sortm. gehalt( m. abt_id = a. id, a. name = ‚Verkauf‘ (m x a))) 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 17

Rückblick: Anfragebearbeitung m. name(sortm. gehalt( m. abt_id = a. id, a. name = ‚Verkauf‘ (m x a))) [20] (m. name) [20] Sort(m. gehalt) Einsparung [20] a. Name = ‚Verkauf‘ [990] [1. 000] Mitarbeiter m X a. Name = ‚Verkauf‘ [990] m. abt_id = a. id [100. 000] 13. 12. 2005 [20] [100] Abteilung a ⋈m. abt_id = a. id [1. 000] Mitarbeiter m Felix Naumann, VL Informationsintegration, WS 05/06 [100] Abteilung a 18

Rückblick: Anfragebearbeitung [20] (m. name) [20] Sort(m. gehalt) ? [20] (m. name) [20] a. Name = ‚Verkauf‘ ⋈m. abt_id = a. id Einsparung [990] ⋈m. abt_id = a. id [1. 000] Mitarbeiter m 13. 12. 2005 a. Name = ‚Verkauf‘ [100] Abteilung a [1. 000] Mitarbeiter m Felix Naumann, VL Informationsintegration, WS 05/06 [5] [100] Abteilung a 19

Neuer Operator Logischer Operator l l Stop(N, Sortierungsrichtung, Sortierungsausdruck) l N: maximale Ergebnisgroesse l Sortierungsrichtung: asc, desc, none l Sortierungsausdruck: meist wie ORDER BY Physikalische Operatoren l l Implementierungsvarianten des logischen Operators 1. Scan-Stop Jetzt! 2. Sort-Stop 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 20

Scan-Stop l l l Falls Sortierungsrichtung = ‚none‘ Schließt Input-Strom nach N Tupeln Kostenmodell l l p = Plan unterhalb (im Baum) Stop Operator s = Plan inkl. Stop Operator Cost(1) = Kosten für erstes Tupel (Latenz, latency) Costp(ALL) = Kosten für alle Tupel von p Pipelines werden bevorzugt! Warum? 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 21

Sort-Stop l l Falls Sortierungsrichtung = ‚asc/desc‘ Falls schon entsprechend sortiert: Schließt Input. Strom nach N Tupeln l l Kosten wie Scan-Stop Sonst sortieren: l l Kosten eines Vergleichs Erste N Tupel in priority-heap (Haufen) Nächste Tupel gegen Heap testen Input-Strom komplett erzeugen 13. 12. 2005 Test gegen Heap-Grenzen i Einfügungen in Heap N i ALL Felix Naumann, VL Informationsintegration, WS 05/06 22

Optimierung mit Stop-Operator l l Platzierung des Stop Operators im Anfrageplan Fundamentales Problem: Frühe Platzierung vorteilhaft aber risikoreich l l l Vorteil: Kleine Zwischenergebnisse geringe Kosten Risiko: Endergebnis nicht groß genug Erneute Ausführung Zwei Strategien l „Konservativ“ und „aggressiv“ 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 24

Optimierung mit Stop-Operator l Konservative Strategie l l l Kostenminimal: Platziere Stop so früh wie möglich in Plan. Korrekt: Platziere Stop nie so, dass Tupel entfernt werden, die später eventuell gebraucht werden. D. h. : Wende Stop nur auf Input-Ströme an, deren Input-Tupel jeweils mindestens ein Output-Tupel erzeugen. l Operatoren, die Tupel filtern, müssen also früher ausgeführt werden. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 25

Optimierung mit Stop-Operator SELECT * FROM mitarbeiter m, abteilung a WHERE m. abt_id = a. id ORDER BY m. gehalt DESC STOP AFTER 10 Stop(10) sort. Stop m. abt_id NOT NULL Unter welchen m. abt_id ist Fremdschlüssel Bedingungen? Stop(10) ⋈m. abt_id = a. id Mitarbeiter m 13. 12. 2005 ⋈m. abt_id = a. id sort. Stop Abteilung a Mitarbeiter m Felix Naumann, VL Informationsintegration, WS 05/06 Abteilung a 26

Optimierung mit Stop-Operator SELECT * FROM mitarbeiter m, abteilung a WHERE m. abt_id = a. id AND a. name = ‚Verkauf‘ ORDER BY m. gehalt DESC STOP AFTER 10 Stop(10) sort. Stop Nein! Erlaubt? ⋈m. abt_id = a. id Stop(10) sort. Stop a. Name = ‚Verkauf‘ Mitarbeiter m 13. 12. 2005 Abteilung a a. Name = ‚Verkauf‘ Mitarbeiter m Felix Naumann, VL Informationsintegration, WS 05/06 Abteilung a 27

Optimierung mit Stop-Operator l Aggressive Strategie l l l Platziere Stop so früh wie möglich in Plan. Wähle (hoffentlich) hinreichend großes N: Füge „Reserve“ hinzu (z. B. 20%). Platziere weiteres, endgültiges Stop(N) später im Plan. Platziere geeignete „Restart“ Operatoren. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 28

Optimierung mit Stop-Operator SELECT * FROM mitarbeiter m, Stop(10) abteilung a, reisen r WHERE m. abt_id = a. id ⋈m. abt_id = a. id AND r. konto = m. reisekonto ORDER BY m. gehalt DESC Restart Abteilung a STOP AFTER 10 ⋈m. rkonto = r. konto Stop(20) sort. Stop Reise r Mitarbeiter m 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 29

Implementierungen von First N l SQL: l l l My. SQL: l l SELECT TOP N. . . FROM. . . Oracle: l l FETCH FIRST N ROWS ONLY OPTIMIZE FOR N ROWS MS SQL Server l l SELECT. . . FROM. . . LIMIT 10 DB 2: l l select name, salary from employee A where 50 > (select count(*) from employee B where B. salary > A. salary). . . WHERE rownum < N OPTIMIZER_MODE = FIRST_ROWS_N Optimierung jeweils unklar! Weitere Optimierung („Bremsweg verkleinern“) z. B. in [CK 98] 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 30

Anfragen nach den Top-N Ergebnissen – Motivation l First-N beschränkt Ergebnismenge aber nicht (unbedingt) Eigenschaften des Ergebnisses l l Ausnahme: Sortierung auf einem Attribut Top-N beschränkt Ergebnismenge und Eigenschaften l l l Sortierung nach einem (beliebigen) Maße sind oft fuzzy. Maße haben oft mehrere Attribute als Input. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 32

Anfragen nach den Top-N Ergebnissen – Beispiele l Suchmaschinen l l Information Retrieval l l Maß: Relevanz In DBMS l l l Maß: Vorkommen des Suchworts & „authority“ 4 -Zimmer Wohnungen, unter $30, 000 Bisher nicht unterstützt In Multimedia DBMS l Bilder mit roten und runden Objekten 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 33

Top-N in Multimedia DBMS l Farb-Ähnlichkeit l Z. B. Anfrage: Farbe = ‚rot‘ l Berechnung der „Röte“ oft komplex (viele Farbdimensionen, viele Pixel) l MMDBMS liefert top-N roteste (röteste? ) Objekte l l l Multidimensionale Indices Form-Ähnlichkeit l Z. B. Anfrage: Form = ‚rund‘ l Berechnung der „Rundheit“ oft komplex l MMDBMS liefert top-N rundeste Objekte Entspricht First-N Semantik (Maß auf einem Attribut) l Aber wie kombinieren? 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 34

Top-N in Multimedia DBMS l l Beatles „Red Album“ Anfrage: l Farbe = ‚rot‘ Name = ‚Beatles‘ Fuzzy Prädikat Antwort ist sortierte Liste l Non-Fuzzy Prädikat Antwort ist (unsortierte) Menge Was als Antwort: l l Menge? Sortierte Liste? 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 35

Top-N in Multimedia DBMS l Anfrage: l l l Farbe = ‚rot‘ Name = ‚Beatles‘ Antwort: l l Menge? Sortierte Liste? Anfrage: Farbe = ‚rot‘ Form = ‚rund‘ Antwort: l l Menge? Sortierte Liste? Idee [Fa 96]: Antwort ist „benotete Menge“ („graded set“) 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 36

Top-N – benotete Mengen l Benotete Menge: l l Anfrage: Name = ‚Beatles‘ l l Menge aus Paaren (x, g) x ist ein Objekt g [0, 1] ist eine Note (grade) Antwort: benotete Menge mit g {0, 1} Anfrage: Farbe = ‚rot‘ l Antwort: benotete Menge mit g [0, 1] 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 37

Top-N – benotete Mengen l Anfrage: l l l Problem: l l l Name = ‚Beatles‘ Farbe = ‚rot‘ Maß: Benotung der Objekte in Antwort Sei g. A(x) die Note von Objekt x unter Anfrage A. Erwünschte Eigenschaften l l l Falls g {0, 1} sollte Standard-Logik gelten. Bewahrung der logischen Äquivalenz l g. A A(x) = g. A(x) l g. A (B C)(x) = g(A B) (A C)(x) Monotonie: g. A(x) g. A(y), g. B(x) g. B(y) g. A B(x) g. A B(y) 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 38

Top-N – benotete Mengen l Vorschlag l Konjunktionsregel [Za 65]: l l g. A B(x) = min{g. A(x), g. B(x)} Disjunktionsregel [Za 65]: l g. A B(x) = max{g. A(x), g. B(x)} 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 39

Top-N – benotete Mengen l l g. A B(x) = min{g. A(x), g. B(x)}, g. A B(x) = max{g. A(x), g. B(x)} Standardlogik (g {0, 1}) l 0 1 = min{0, 1} = 0 l 0 1 = max{0, 1} = 1 Äquivalenz l g. A A(x)= min{g. A(x), g. A(x)} = g. A(x) l g. A (B C)(x) = min{g. A(x), max{g. B(x), g. C(x)}} = max{min{g. A(x), g. B(x)}, min{g. A(x), g. C(x)}} = g(A B) (A C)(x) Monotonie l g. A(x) g. A(y), g. B(x) g. B(y) g. A B(x) g. A B(y) l g. A(x) g. A(y), g. B(x) g. B(y) min{g. A(x), g. B(x)} min{g. A(y), g. B(y)} 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 40

Andere Maße? l AVG l g. A B(x) = avg{g. A(x), g. B(x)}, g. A B(x) = max{g. A(x), g. B(x)} l 0 1 = avg{0, 1} = 0. 5 l 0 1 = max{0, 1} = 1 l l g. A A(x)= avg{g. A(x), g. A(x)} = g. A(x) g. A (B C)(x) = avg{g. A(x), max{g. B(x), g. C(x)}} = max{avg{g. A(x), g. B(x)}, avg{g. A(x), g. C(x)}} = g(A B) (A C)(x) g. A(x) g. A(y), g. B(x) g. B(y) avg{g. A(x), g. B(x)} avg{g. A(y), g. B(y)} D. h. Standardlogik bleibt nicht erhalten. l Name = ‚Beatles‘ Farbe = ‚rot‘ l Album (Santana, Supernatural) hat score > 0 l Fast jedes andere Album hat auch score > 0 l l 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 41

Top-N – Fagin‘s Algorithmus l l l Gegeben: Konjunktive Anfrage mit teilweise fuzzy Prädikaten. Gesucht: Benotete Menge mit mindesten Top-N Objekten Zugriffsmodell auf MMDBMS l l l Kostenmodell: l l Sorted access: Cursor auf sortierte Liste Random access: Note eines bestimmten Objekts Jedes angefragte Objekt kostet 1. Optimierung: l Minimiere Kosten 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 42

Top-N – Beispiel l Anfrage: l Name = ‚Beatles‘ Farbe = ‚rot‘ G = min{1, g. Farbe = ‚rot‘(x)} ⋈s. id = p. id random access Kosten? Name = ‚Beatles‘ DBMS Schallplatten 13. 12. 2005 MMDBMS Plattencover Felix Naumann, VL Informationsintegration, WS 05/06 43

Top-N – Naiver Algorithmus l 1. 2. 3. 4. Anfrage: l Form = ‚rund‘ Farbe = ‚rot‘ Sorted access auf alle Objekte (mit Note für Form = ‚rund‘) Sorted access auf alle Objekte (mit Note für Farbe = ‚rot‘) Join über alle Objekte x Jeweils Berechnung der minimalen Note ¢ 5. l min{grund(x), grot(x)} Sortierung für Top-N Kosten l 2 n (2 x sorted access) 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 44

Top-N – Beispiel l Anfrage: l Form = ‚rund‘ Farbe = ‚rot‘ G = min{g. Form = ‚rund‘(x), g. Farbe = ‚rot‘(x)} sorted access MMDBMS_1 Plattencover (Formen) 13. 12. 2005 ⋈s. id = p. id sorted access/random access MMDBMS_2 Plattencover (Farben) Felix Naumann, VL Informationsintegration, WS 05/06 45

Top-N – Fagins Algorithmus l Allgemeineres Problem: l l Anfrage statt A B nun A 1 A 2 . . . Am Für jedes Prädikat eine Quelle. l l bzw. Zugriffsmöglichkeit durch sorted und random access Phase 1: Sorted access Phase 2: Random access Phase 3: Berechnung und Sortierung 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 47

Top-N – Fagins Algorithmus l l A 1 A 2 . . . Am Phase 1: Sorted access l l Für jedes i: Schicke Ai an Quelle i Schreite sukzessive voran, bis Join über alle Teilergebnisse die Größe N hat. ⋈id MMDBMS_1 13. 12. 2005 MMDBMS_2 . . . MMDBMS_m Felix Naumann, VL Informationsintegration, WS 05/06 48

Top-N – Fagins Algorithmus Objekte aus MMDBMS_2 mit g. A 2(x) Objekte aus MMDBMS_1 mit g. A 1(x) N N Objekte aus allen MMDBMS mit allen g. Ai(x) also auch mit Gesamt-Note MMDBMS_1 13. 12. 2005 Objekte aus MMDBMS_2 und MMDBMS_m mit g. A 1(x) und g. Am(x) MMDBMS_2 . . . MMDBMS_m Felix Naumann, VL Informationsintegration, WS 05/06 49

Top-N – Fagins Algorithmus Wichtig: Dies sind nicht unbedingt die Top-N Objekte! MMDBMS_1 13. 12. 2005 Der Clou: Unter allen gesehenen Objekten befinden sich auch die Top. N Objekte. Beweis später. N MMDBMS_2 . . . MMDBMS_m Felix Naumann, VL Informationsintegration, WS 05/06 50

Top-N – Fagins Algorithmus l Phase 2: Random access l Hole alle unbekannten g. Ai(x) ein. Ergebnis: Nun kennen wir alle Noten aller gesehenen Objekte. N MMDBMS_1 13. 12. 2005 MMDBMS_2 . . . MMDBMS_m Felix Naumann, VL Informationsintegration, WS 05/06 51

Top-N – Fagins Algorithmus l Phase 3: Berechnung und Sortierung l l Berechne für jedes Objekt g. A 1 A 2 . . . Am(x) = min{g. A 1(x), g. A 2(x), . . . , g. Am(x)} Sortiere alle Objekte nach g. A 1 A 2 . . . Am(x) Selektierte die höchsten N Objekte. Ausgabe dieser Top-N Objekte. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 52

Fagins Algorithmus – Beispiel l Anfrage: l l Form = ‚rund‘ Farbe = ‚rot‘ Stil = ‚Modern‘ N=2 MMDBMS_1 MMDBMS_2 MMDBMS_3 ID Farbe Rotheit modern 1 3 rot 1 2 rock 0. 7 2 orange 0. 5 0. 15 4 barock 0. 2 1 gelb 0. 3 dreieck 0. 1 1 keltisch 0. 1 4 blau 0. 01 strich 0 5 uralt 5 grün 0 ID Form Rundheit ID Stil 1 oval 0. 8 3 2 achteck 0. 6 3 viereck 4 5 13. 12. 2005 Modernität 0. 01 Felix Naumann, VL Informationsintegration, WS 05/06 53

Fagins Algorithmus – Beispiel 4: (? ? ; 0. 2; ? ? ) 4 3: (0. 15; 1; 1) 3 2 1 2: (0. 6; 0. 7; 0. 5) 1: (0. 8; ? ? ; 0. 3) ID Form Rundheit ID Stil Modernität ID Farbe Rotheit 1 oval 0. 8 3 modern 1 3 rot 1 2 achteck 0. 6 2 rock 0. 7 2 orange 0. 5 3 viereck 0. 15 4 barock 0. 3 1 gelb 0. 3 4 dreieck 0. 1 1 keltisch 0. 2 4 blau 0. 01 5 13. 12. 2005 strich 0 5 grün 0 Felix Naumann, VL Informationsintegration, WS 05/06 5 uralt 0. 01 54

Top-N – Fagins Algorithmus l Korrektheit: l l Fagins Algorithmus findet die Top-N Objekte gemäß g. A(x). Beweis: l Idee: Wir zeigen für jedes ungesehene Objekt y, dass es nicht unter den Top-N sein kann: l Notation l l l x: gesehene Objekte y: ungesehene Objekte Für jedes x der Joinmenge nach Phase 1 und jedes Prädikat Ai gilt: g. Ai(y) g. Ai(x). Wichtig: Wir können Wegen Monotonie von min{} gilt: dies nicht für andere g. A 1 A 2 . . . Am(y) g. A 1 A 2 . . . Am(x). gesehene Objekte Es gibt mindesten N solcher Objekte x zeigen. (Abbruch-Kriterium Phase 1). Schlussfolgerung: Es gibt kein y, das besser ist als die besten N x. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 55

Top-N – Fagins Algorithmus l Aufwand: O(n(m-1)/m. N 1/m) (Beweis: siehe [Fa 96]) l n = DB-Größe; m = Anzahl der DBs l l l Beispiel: 10000 Objekte, 3 Prädikate, Top 10 10. 0002/3 x 101/3 = 1. 000 Gilt falls Ai unabhängig. Gilt mit beliebig hoher Wahrscheinlichkeit. l D. h. : Für jedes ε>0 c, so dass die Wahrscheinlichkeit dass der Aufwand höher ist als angegeben < ε ist. Zum Vergleich: Naiver Algorithmus in O(nm) l Im Beispiel: 10. 000 x 3 = 30. 000 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 56

Fagins Algorithmus in der Praxis l Probleme aus [WHTB 99] l l Monotonie l Vorgabe einer festen Menge (monotoner) Maße l oder Nutzerimplementation erlauben? WHERE Klausel oder ORDER BY Klausel l Charakter des Algorithmus Join über mehrere Quellen l Objektidentifikation Kostenmodell schwierig 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 57

Top-N Anfragen – Herausforderungen l Beliebige Maße l l Je nach Nutzer bzw. Anwendung Effiziente Ausführung in bestehenden DBMS l Unter Ausnutzung vorhandener Datenstrukturen und Metadaten l Korrektheit und Vollständigkeit l Idee: Wandele Top-N Anfragen in herkömmliche Anfragen um [CG 99]. 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 58

Rückblick l First-N l l l Syntax und Semantik Optimierung Top-N l l Motivation Fagins Algorithmus 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 59

Informationsintegration 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 60

Literatur l First-N l l l Top-N l l [CK 97] Michael J. Carey, Donald Kossmann: On Saying "Enough Already!" in SQL. SIGMOD Conference 1997: 219 -230 [CK 98] Michael J. Carey, Donald Kossmann: Reducing the Braking Distance of an SQL Query Engine. VLDB 1998: 158 -169 [Fa 98] Ronald Fagin: Fuzzy Queries in Multimedia Database Systems. PODS 1998: 110 Weitere l l l [CG 99] Surajit Chaudhuri, Luis Gravano: Evaluating Top-k Selection Queries. VLDB 1999: 397 -410 [DR 99] Donko Donjerkovic, Raghu Ramakrishnan: Probabilistic Optimization of Top N Queries. VLDB 1999: 411 -422 [Za 65] Lotfi A. Zadeh: Fuzzy Sets. Information and Control 8(3): 338 -353 (1965) [DP 84] D, Dubois and H. Prade, Criteria Aggregation and Ranking of Alternatives in the Framework of Fuzzy Set Theory, in Fuzzy Sets and Decision Analysis, TIMS Studies in Management Sciences 20 (1984), pp. 209 -240. [WHTB 99] Edward L. Wimmers, Laura M. Haas, Mary Tork Roth, Christoph Braendli: Using Fagin's Algorithm for Merging Ranked Results in Multimedia Middleware. Coop. IS 1999: 267 -278 13. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 61