Informationsintegration Verteilte Anfragebearbeitung 1 12 2005 Felix Naumann

Informationsintegration Verteilte Anfragebearbeitung 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06

10, 000 feet 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Einführung in die Informationsintegration Szenarien der Informationsintegration Verteilung und Autonomie Heterogenität Materialisierte und virtuelle Integration Klassifikation integrierter Informationssysteme und 5 -Schichten Architektur Mediator/Wrapper-Architektur Global-as-View und Lokal-as-View Modellierung Global-as-View Anfragebearbeitung Schema. SQL Verteilte Anfragebearbeitung Dynamische Programmierung in verteilten Datenbanken Top-N Anfragen Problemstellung Architekturen Modellierung Optimierung 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 2

10, 000 feet 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Informationsqualität Duplikaterkennung ETL & Data Lineage Datenfusion - Union & Co. Containment & Local-as-View Anfragebearbeitung Bucket Algorithmus Peer-Data-Management Systeme (PDMS) Schema Mapping Schema Matching Hidden Web Semantic Web Forschungsprojekte - TSIMMIS, Garlic, Revere, etc Data Streams Konflikte Anfragen Mapping Systeme 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 3

Überblick l l Anfragebearbeitung im Überblick Techniken der Verteilten Anfragebearbeitung l l l Row Blocking Multicasts Multithreading Partitionierung Joinbearbeitung l l l Semi-Join Reduzierung Semi-Join mit Filter 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 4

Parallelität vs. Verteilung l Parallele DBMS l l l Shared memory Shared disk Shared nothing Fokus auf Transaktionen und Anfragebearbeitung Verteilte DBMS l l l Shared nothing Fokus auf Heterogenität und Anfragebearbeitung Unser Fokus! 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 5

Architektur zentralen Anfragebearbeitung Logische Optimierung (unabhängig von System und Konfiguration): Entschachtelung, redundante Prädikate, . . . Anfrage Parser Anfrageumschreibung Anfrageoptimierung 1. 12. 2005 Anfrageergebnis Code Generierung Anfragebearbeitung (Engine) Optimierung für System und Konfiguration: Indices, Joinreihenfolge, Selektion der Datenquelle Katalog/ Metadaten Syntax und etwas Semantik Erzeugt Anfragegraph Wandelt Plan (Baum) in ausführbaren Plan (Code) um. Daten Schema, Statistik, Partitionierung, Lage der Daten, . . . Felix Naumann, VL Informationsintegration, WS 05/06 6

Anfragebearbeitung in verteilten Systemen l l l Anfragen sind deklarativ. Anfragen müssen in ausführbare (prozedurale) Form transformiert werden. Ziele l l QEP – prozeduraler Query Execution Plan Effizienz l Schnell l Wenig Ressourcenverbrauch (CPU, I/O, RAM, Bandbreite) 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 7

Schritt 1: Anfragetransformation Parsen der Anfrage (Syntax) Namensauflösung Überprüfen der Elemente (Semantik) Normalisierung 1. 2. 3. 4. – Algebraische Vereinfachung 5. – 6. Konjunktive Normalform in WHERE Klausel Eliminierung redundanter Teilausdrücke Transformation zum Operatorbaum 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 8

Schritt 2: Daten-Lokalisierung l Globale Relationen werden abgebildet auf lokale Relationen l l Ga. V / La. V Algebraische Vereinfachungen 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 9

Schritt 3: Globale Optimierung l Bestimmung eines Ausführungsplanes mit minimalen globalen Kosten l Bestimmung der Ausführungsknoten l Festlegung der Ausführungsreihenfolge (sequentiell, parallel) l Alternative Strategien zur Join-Berechnung bewerten (z. B. mit Semi-Join) l l Aber: Trennung zwischen globaler und lokaler Optimierung kann zur Auswahl suboptimaler Pläne führen Kostenmodell l |Instruktionen| + |Diskzugriffe| + |Nachrichten| + |übertragene Byte| l Kosten = (TCPU * #insts) + (TI/O * #I/Os) + (TMSG * #msgs) + (TTR * #byte) 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 10

Neue Ziele der Anfragebearbeitung l In zentralisierten (mehr. Benutzer) DBMS l l l Site 0 ⋈ ⋈ ⋈ Durchsatz (throughput) = Anzahl der verarbeiteten Tupel Minimierung des Ressourcenverbrauchs R S Site 0 In verteilten DBMS l l Gesamt. Ressourcenverbrauch erhöhen um schnellere Antworten zu erhalten Minimierung der Antwortzeit 1. 12. 2005 receive Site 1 R T U ⋈ receive Site 2 send ⋈ ⋈ Felix Naumann, VL Informationsintegration, WS 05/06 S T U 11

Dimensionen der Anfragebearbeitung l l Festlegung des Ausführungsknotens Festlegung der Auswertungsstrategie l Ship whole l Vollständige Relationen l Wenig Nachrichten Site 0 l Viele Byte ⋈ l Fetch rows as needed l Bindings receive l Viele Nachrichten Site 1 Site 2 l Nur relevante Byte send l Semi Join l Fetch columns as needed ⋈ ⋈ l Semi-Join R 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 S T U 12

Schritt 4: Lokale Optimierung l Entsprechend des jeweiligen Systems l l Lokale Katalogdaten Lokale Parameter l l TCPU und TI/O Etc. 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 13

Überblick l l Anfragebearbeitung im Überblick Techniken der Verteilten Anfragebearbeitung l l l Row Blocking Multicasts Multithreading Partitionierung Joinbearbeitung l l l Semi-Join Reduzierung Semi-Join mit Filter 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 14

Row Blocking l l l Anfragebearbeitung in verteilten DBMS mittels send und receive Operatoren Naiv: Für jedes Tupel eine send und eine receive Operation Besser: Row blocking l l l Tupel werden gesammelt und en block versendet. Blockgröße abhängig von datagram-Größe im Netzwerk (z. B. 64 k. B) Queues puffern unregelmäßige Pipelines l 1. 12. 2005 Wichtig in Netzwerken; Stillstand wird vermieden. Felix Naumann, VL Informationsintegration, WS 05/06 15

Row Blocking Site 0 ⋈ Erlaubt konstante(re) Ausführung des restlichen Plans Puffer receive Site 1 Netzwerkunregelmäßigkeiten Site 2 send Puffer ⋈ ⋈ R S 1. 12. 2005 T Row blocking U Felix Naumann, VL Informationsintegration, WS 05/06 16

Multicasts Site 0 ⋃ ⋈ ⋈ T ⋃ receive Site 2 send Optimierung ⋈ receive Site 1 receive T receive S send ⋈ receive S Site 1 send R R 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 17

Multicasts l l Site 0 Einsparungen l Netzverkehr l CPU (packing und unpacking) Optimierung l Viele Alternativen l § § § l receive Site 2 send Welche Daten schicke ich wohin? Welchen Operator führe ich wo aus? Outer joins statt inner joins Späte/Frühe Projektion Usw. Kostenmodell Dynamische Entscheidungen l receive T noch während der Ausführung! send ⋈ Logisch: § l ⋈ Physisch: § l ⋃ receive S Site 1 send R 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 18

Multi-Threading l Single-threaded UNION: l l l Parallele Verarbeitung der send und receive Operationen ∪ receive Vorteile l l l Site 0 Multi-threaded UNION: l l Sukzessive ein Datagram von einer Site „Round-robin“-Verfahren Netzwerkverkehr parallel Send Operationen parallel Nachteile l l l Zusätzliche Kosten durch Synchronisation (shared memory) Zusätzliche Kosten durch Konkurrenz um Ressourcen Optimierer muss entscheiden welche Operatoren in wie vielen Threads ausgeführt wird. 1. 12. 2005 Site 1 receive Site 2 receive Site 3 send R R R Felix Naumann, VL Informationsintegration, WS 05/06 19

Multi-Threading l Multi-threading nicht immer vorteilhaft l z. B. Speicherplatz Site 0 Sort-merge-join sort receive Site 1 1. 12. 2005 sort receive Site 3 send R S Felix Naumann, VL Informationsintegration, WS 05/06 20

Partitionierung l l Grundlagen von DBMS Vertikal l l Verschiedene Projektionen einer Relation Jedes Attribut in mindestens einer Partition Erhalt eines Schlüssels! Auch: Normalisierung Horizontal l l Verschiedene (disjunkte) Selektionen einer Relation Beispiel l l Partition A: SELECT * FROM R WHERE area=`Nord´ Partition B: SELECT * FROM R WHERE area=`Süd´ (bzw. `Nord´) Jedes Tupel in mindestens einer Partition Optimierung l l l Wie partitionieren? Wie die Partitionen verteilen? Abhängig von Anwendungen und query workload 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 21

Joins über horizontale Partitionen l Sei R horizontal partitioniert l l Alternativen l l l R = R 1 ∪ R 2 R ⋈ S = (R 1 ∪ R 2) ⋈ S R ⋈ S = (R 1 ⋈ S) ∪ (R 2⋈ S) Komplikationen l l R noch weiter partitioniert S ebenfalls partitioniert Unterschiedliche Kosten Wer führt was wo aus? 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 22

Selektion über horizontale Partitionierung Beispiel: Mitschang, VL „Verteilte DBMS“ 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 23

Join über horizontale Partitionierung Beispiel: Mitschang, VL „Verteilte DBMS“ 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 24

Joins über horizontale Partitionen |(R 1 ⋃ R 2) ⋈S| Frage: Wieviel wird jeweils übertragen? Site 0 |R| Site 1 Site 0 ⋈ receive ∪ Optimierung receive send |R 3| ∪ R 1 R 2 1. 12. 2005 S Site 2 ⋈ Site 2 receive Site 1 |S| send |R 3 ⋈S| send R 3 S send ∪ R 1 ⋈ R 2 Felix Naumann, VL Informationsintegration, WS 05/06 S R 3 S 25

Joins über horizontale Partitionen l Optimierung l l Lage der Partitionen Kostenvergleiche Leere Teilergebnisse (Ri ⋈ Sj = ) l Vorhersagen & Vermeiden Häufig: Zwei Relationen nach gleichem Prädikat partitioniert l „Abteilungen“ nach Standort partitioniert (Nord, Süd) l „Mitarbeiter“ nach Abteilung (also auch nach Standort) partitioniert l Dann: R ⋈ S = (R 1⋈S 1) ∪ (R 1⋈S 2) ∪ (R 2⋈S 1) ∪ (R 2⋈S 2) = (R 1⋈S 1) ∪ (R 2⋈S 2) l D. h. Joins (Ri ⋈ Sj) nur ausführen falls i = j 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 26

Überblick l l Anfragebearbeitung im Überblick Techniken der Verteilten Anfragebearbeitung l l l Row Blocking Multicasts Multithreading Partitionierung Joinbearbeitung l l l Semi-Join Reduzierung Semi-Join mit Filter 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 27

Semi-Join Ergebnis (ohne Attribute von S) in Site 1 verlangt Site 1 receive ⋈ receive S R R Site 0 send ⋈ send |S| x ID-length(S) |S| x tuple-length(S) ID S 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 28

Semi-Join Ergebnis (mit Attributen von S) in Site 1 verlangt Site 1 receive Merge-join |R ⋈ S| x tuple-length (S) Site 1 ⋈ ⋈ R Site 0 receive send ⋊ Site 0 send |S| x tuple-length(S) send S receive S ID R 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 |R| x ID-length(R) 29

Semi-Join l Formal l l R(A), S(B) R ⋉ S : = A(R⋈FS) = A(R) ⋈F A B(S) = R⋈F A B(S) i. d. R. = R⋈F F(S) Nicht symmetrisch! Literatur: l l [BC 81] in jedem DB Lehrbuch 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 30

Semi-Join Ergebnis (Attribute aus S und R) in Site 1 verlangt Site 1 receive Site 0 send ⋈ |R ⋈ S| x tl(R) ⋈ receive Site 2 send receive Site 2 |R| x tl(R) send ⋉ receive ID send R Frage: Welche 2 anderen Strategien sind möglich? 1. 12. 2005 receive |R ⋈ S| x ID-length (R) Site 0 |S| x tl(S) S Site 1 |R ⋈ S| x tl(S) ⋊ ID S |S| x ID-length(S) Felix Naumann, VL Informationsintegration, WS 05/06 receive R Frage: Welche anderen Strategien sind möglich? 31

Optimierung mit Semi-Join l l Transformationsregeln für Joins R⋈FS = l (R ⋉F S) ⋈F S l l R⋈F (S ⋉F R) l l S verkleinern, dann Join mit R Frage: Welche beiden Varianten standen auf der vorigen Folie? (R ⋉F S) ⋈F (S ⋉F R) l l R verkleinern, dann Join mit S R und S verkleinern dann Join. Problem: Wann welche Variante einsetzen? 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 32

Eine Übung im letzten Jahr: Die Daten l Zwei Datenbanken in DB 2 l l l Film. DB 1 l Schema: Movie 1 l Tabelle: Movie 1. Filme 1(Titel Character(100), Jahr Integer) Film. DB 2 l Schema: Movie 2 l Tabelle: Movie 2. Filme 2(Titel Character(100), Regie Character(50)) Zugriffsberechtigungen: l l l Lesend auf die Tabellen CREATE TABLE auf die Datenbanken INSERT auf neue Tabellen 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 34

Eine Übung im letzten Jahr: Die Aufgabe l l Titel und Regisseur aller Filme, die jünger als 1980 sind. SELECT F 1. Titel, F 2. Regie FROM Movie 1. Filme 1 F 1, Movie 2. Filme 2 F 2 WHERE F 1. Titel = F 2. Titel AND F 1. Jahr > 1980 Zur Kontrolle: Ergebniskardinalität ist 1121. Problem: Join über verschiedene Datenbanken l l l Passing Binding Semijoins Andere? 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 35

Byte Eine Übung im letzten Jahr: Naive Lösung (vorgegeben): Auswertung Kosten = 8. 296. 344 Byte Noch naiver: Große Tabelle außen. Gruppe 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 36

Eine Übung im letzten Jahr: Die gesammelten Tricks l l Prepared Statement Rtrim() Filme 2 ist wesentlich kleiner Gezielte Projektionen l l l Hashfunktionen l l l Jahr muss nie übertragen werden Jeder Titel höchstens einmal als UDF In SPJ Anfrage Auf Kollisionen getestet Optimiert auf Länge Rollback statt drop table 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 37

Chipmunks l l CREATE TABLE firma 06 (h INTEGER NOT NULL PRIMARY KEY) alle "hash" Werte für titel aus Firma 2 holen l SELECT DISTINCT ASCII(titel)*762534+LENGTH(RTRIM(titel))*464857 +(ASCII(substr(rtrim(titel), length(rtrim(titel)))))*659386 +(ASCII(SUBSTR(RTRIM(titel), (LENGTH(RTRIM(titel))/2)+1, 1)))*121555 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 55737694963700240. 5, 0))+1, 1)))*481517 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 54559185170078770. 5, 0))+1, 1)))*418045 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 93909299891710910. 5, 0))+1, 1)))*479165 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 30933127700388530. 5, 0))+1, 1)))*151240 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 13964794618711818 -0. 5, 0))+1, 1)))*668683 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 74383388518759610. 5, 0))+1, 1)))*208042 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 24890649793926223 -0. 5, 0))+1, 1)))*386682 +(ASCII(SUBSTR(RTRIM(titel), integer(round(length(rtrim(titel))*0. 77277793264924140. 5, 0))+1, 1)))*49285 FROM movie 2. filme 2 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 38

Eine Übung im letzten Jahr: Auswertung Projekt Chaos ii Die Drei 0815 Felix Naumann, VL Informationsintegration, WS 05/06 Hedge Knights gruppe 1 camarilla blub 10099 bib IFA Ghost Dogs fmr idefix Chipmunks 1. 12. 2005 39

Überblick l l Anfragebearbeitung im Überblick Techniken der Verteilten Anfragebearbeitung l l l Row Blocking Multicasts Multithreading Partitionierung Joinbearbeitung l l l Semi-Join Reduzierung Semi-Join mit Filter 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 40

Optimierung mit Semi-Join l l l Komplexeres Beispiel l R ⋈F S ⋈G T = l (R ⋉F S) ⋈F (S ⋉G T) ⋈G T = l (R ⋉F (S ⋉G T)) ⋈F (S ⋉G T) ⋈G T Allgemein: Man suche für jede Relation das beste Semi-Join Programm, den vollständigen Reduzierer („full reducer“). l Eine fully reduced Relation enthält keine Tupel, die nicht zur Anfragebearbeitung benötigt werden. Für jede Relation in einer Anfrage existieren exponentiell viele Semi-Join Programme. l Zyklische Anfragen: i. A. existiert kein full reducer. l Baum-Anfragen: Finden des full reducer NP-schwer. l Ketten-Anfragen: Finden des full reducer polynomial. 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 41

Optimierung mit Semi-Join l Ketten-Anfragen SELECT EMP. name, DEPT. name FROM EMP, DEPT, PROJ WHERE EMP. d_ID = DEPT. ID AND EMP. p_ID = PROJ. ID PROJECTS EMPLOYEES Mitarbeiter mit ihren Abteilungen, die an einem Projekt arbeiten. DEPARTMENTS Ermittlung der full reducer für jede Relation in 2 Phasen 1. Vorwärts 2. Rückwärts 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 42

Fully Reduce l l Ermittlung der full reducer für jede Relation in 2 Phasen 1. Vorwärts 2. Rückwärts Allgemein für Kettenanfragen l R 1 ⋈A R 2 ⋈B. . . ⋈Y R(n-1) ⋈Z Rn l Vorwärts l l l R 2‘ = R 2 ⋉ R 1 R 3‘ = R 3 ⋉ R 2‘ = R 3 ⋉ (R 2 ⋉ R 1). . . Rn‘ = Rn ⋉ R(n-1)‘ =. . . Rückwärts l l 1. 12. 2005 R(n-1)‘‘ = R(n-1)‘ ⋉ Rn‘ R(n-2)‘‘ = R(n-2)‘ ⋉ R(n-1)‘. . . R 1‘‘ = R 1 ⋉ R 2‘‘ Full reducer für Rn Full reducer für R(n-1) Full reducer für R 1 Felix Naumann, VL Informationsintegration, WS 05/06 43

Fully Reduce l Beispiel für Kettenanfragen l l R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 Vorwärts l l R 2‘ = R 2 ⋉ R 1 R 3‘ = R 3 ⋉ R 2‘ [= R 3 ⋉ (R 2 ⋉ R 1)] R 4‘ = R 4 ⋉ R 3‘ [= R 4 ⋉ (R 3 ⋉ (R 2 ⋉ R 1))] Rückwärts l l l 1. 12. 2005 R 3‘‘ = R 3‘ ⋉ R 4‘ R 2‘‘ = R 2‘ ⋉ R 3‘‘ R 1‘‘ = R 1 ⋉ R 2‘‘ Felix Naumann, VL Informationsintegration, WS 05/06 44

Fully Reduce – Beispiel Site 0 Site 1 Site 2 A Site 3 B ⋉ ⋉ ⋉ C B A R 1 ⋊ R 2 Vorwärts R 2‘ = R 2 ⋉ R 1 R 3‘ = R 3 ⋉ R 2‘ R 4‘ = R 4 ⋉ R 3‘ 1. 12. 2005 ⋊ C ⋊ R 3 Rückwärts R 3‘‘ = R 3‘ ⋉ R 4‘ R 2‘‘ = R 2‘ ⋉ R 3‘‘ R 1‘‘ = R 1 ⋉ R 2‘‘ Felix Naumann, VL Informationsintegration, WS 05/06 R 4 Full reducer 45

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 X A A B B C C Z 1 7 9 3 3 1 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 0 7 5 3 6 7 3 2 6 0 7 8 R 1 R 2 R 3 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 46

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 X A A B B C C Z 1 7 9 3 3 1 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 0 7 5 3 6 7 3 2 6 0 7 8 R 1 (1, 5, 6, 7) R 2 R 3 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 47

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 X A A B B C C Z 1 7 9 3 3 1 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 5 3 6 7 6 0 7 8 R 1 (1, 5, 6, 7) 0 7 (3, 5, 7) 3 2 R 3 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 48

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 X A A B B C C Z 1 7 9 3 3 1 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 6 7 R 1 (1, 5, 6, 7) 0 7 (3, 5, 7) 3 2 5 3 6 0 R 2 R 3 (1, 2, 3) 6 7 7 8 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 49

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 C Z X A A B B C 1 7 9 3 3 1 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 6 7 R 1 (1, 5, 6, 7) 0 7 (3, 5, 7) 3 2 5 3 6 0 R 2 R 3 (2, 3) (1, 2, 3) 2 1 6 7 7 8 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 50

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 B C C Z X A A B 1 7 9 3 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 6 7 R 1 (1, 5, 6, 7) (5, 7) 0 7 (3, 5, 7) 3 2 3 1 5 3 6 0 R 2 R 3 (2, 3) (1, 2, 3) 2 1 6 7 7 8 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 51

Fully Reduce – Beispiel Site 0 Site 1 Site 2 Site 3 A B B C C Z X A 1 7 2 1 5 5 1 0 3 2 3 6 7 7 7 2 4 5 4 7 5 3 7 1 5 6 5 5 6 7 R 1 (5, 7) (1, 5, 6, 7) 9 3 (5, 7) 0 7 (3, 5, 7) 3 2 3 1 5 3 6 0 R 2 R 3 (2, 3) (1, 2, 3) 2 1 6 7 7 8 R 4 R 1 ⋈A R 2 ⋈B R 3 ⋈C R 4 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 52

Fully Reduce l Gegeben Baum-Anfrage Q l (R 4 ⋈C (R 3 ⋈D R 5)) ⋈B (R 1 ⋈A R 2) Auswahl der Wurzel beliebig R 1 ⋈ R 2 ⋈ R 3 ⋈ ⋈ R 1 R 5 ⋈ ⋈ R 2 R 3 ⋈ ⋈ R 4 R 5 R 4 Baumdarstellung rein graphisch! 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 53

Fully Reduce l Phase 1 l l Von unten nach oben Ziel: Full reducer für Wurzel Einführung von Semi-Joins von Knoten zu ihren Eltern Phase 2 l l l Von oben nach unten Ziel: Full reducer für alle anderen Knoten Einführung von Semi-Joins von Knoten zu ihren Kindern 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 R 1 ⋈ ⋈ R 2 R 3 ⋈ ⋈ R 4 R 5 54

Fully Reduce – Phase 1 R 1 ⋈ R 2 R 1 ⋈ ⋈ R 2 R 3 ⋈ R 4 ⋈ R 5 ⋊ R 2 R 3 R 4 Erfüllte Bedingungen: - R 3. C = R 4. C - R 3. D = R 5. D 1. 12. 2005 R 1 ⋈ ⋊ Erfüllte Bedingungen: - R 3. C = R 4. C - R 3. D = R 5. D - R 1. A = R 2. A - R 1. B = R 3. B ⋉ R 3 ⋊ R 5 ⋉ R 4 ⋉ R 5 Ergebnis in R 1 erfüllt alle Bedingungen, ist also fully reduced. Felix Naumann, VL Informationsintegration, WS 05/06 55

Optimierung mit Semi-Join l Zyklische Anfragen SELECT EMP. name, DEPT. name FROM EMP, DEPT, PROJ WHERE EMP. d_ID = DEPT. ID AND EMP. p_ID = PROJ. ID AND PROJ. d_ID = DEPT. ID PROJECTS Mitarbeiter, die an Projekten der eigenen Abteilung arbeiten DEPT PROJ DName ID d_ID ID PName a 1 1 6 Clio b 2 2 7 Hum. Mer EMP EName d_ID p_ID EMPLOYEES 1. 12. 2005 DEPARTMENTS X 1 7 y 2 6 Problem: Semi-Join betrachtet immer nur eine Kante im Anfragebaum und blickt nicht voraus. Ergebnis ist nie die leere Menge. Felix Naumann, VL Informationsintegration, WS 05/06 56

Überblick l l Anfragebearbeitung im Überblick Techniken der Verteilten Anfragebearbeitung l l l Row Blocking Multicasts Multithreading Partitionierung Joinbearbeitung l l l Semi-Join Reduzierung Semi-Join mit Filter 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 57

Optimierung mit Semi-Join l Verbesserung durch Hashfilter l l Idee: Statt A(R) sende man nur eine „Signatur“ (Bloom-Filter). Weniger Netzwerkverkehr. 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 58

Join mit Hashfilter (Bloom-Filter) rte e W 15 ⋈C 12 Werte 1 1 0 0 1. 12. 2005 (4 Tupel , inkl. 2 F 6 Bit alse Dro ps) 1 1 0 0 Felix Naumann, VL Informationsintegration, WSPassau 05/06 Quelle: VL-Folien, Alfons Kemper, Uni False drops 59

Rückblick l Techniken der verteilten Anfragebearbeitung l l l Site 1 ⋈ receive Row Blocking Multicasts Multithreading Partitionierung l l Basics Reduzierung Mit Filter 1. 12. 2005 send ⋈ send Semi-Join l Site 0 S receive ID R Felix Naumann, VL Informationsintegration, WS 05/06 60

Literatur l l l Überblick l Fast jedes deutsche DBMS Lehrbuch l Englisch: [GMUW 00] Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom: Database System Implementation Prentice-Hall 2000 Spezielles l [Ko 00] The State of the Art in Distributed Query Processing, Donald Kossmann, ACM Computing Surveys 32(4), pages 422469. (Link auf WWW) l [Graefe 93] Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2): 73 -170 (1993) Semijoin l [BC 81] Philip A. Bernstein, Dah-Ming W. Chiu: Using Semi-Joins to Solve Relational Queries. Journal of the ACM 28(1): 25 -40 (1981) 1. 12. 2005 Felix Naumann, VL Informationsintegration, WS 05/06 61