Parallel C 3 M Aylin Toku Erkan Okuyan

  • Slides: 24
Download presentation
Parallel C 3 M Aylin Tokuç Erkan Okuyan Özlem Gür Parallel C 3 M

Parallel C 3 M Aylin Tokuç Erkan Okuyan Özlem Gür Parallel C 3 M 1

Outline • Basics of Parallel computing • Sequential C 3 M • Parallel C

Outline • Basics of Parallel computing • Sequential C 3 M • Parallel C 3 M 2

Parallel Computation Decomposition: The process of dividing a computation into smaller parts. Task: Programmer

Parallel Computation Decomposition: The process of dividing a computation into smaller parts. Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition. Parallel C 3 M 3

Parallel Computation Primary Considerations • Load Balancing • Minimizing Communication • Task Dependency Optimization

Parallel Computation Primary Considerations • Load Balancing • Minimizing Communication • Task Dependency Optimization Parallel C 3 M 4

Parallel Computation Load Balancing Parallel C 3 M 5

Parallel Computation Load Balancing Parallel C 3 M 5

Parallel Computation Minimizing Communication Parallel C 3 M 6

Parallel Computation Minimizing Communication Parallel C 3 M 6

Parallel Computation Task Dependency Optimization Parallel C 3 M 7

Parallel Computation Task Dependency Optimization Parallel C 3 M 7

C 3 M Algorithm 1 - Determine the cluster seeds of the database. 2

C 3 M Algorithm 1 - Determine the cluster seeds of the database. 2 - if d, is not a cluster seed then Find the cluster seed (if any) that maximally covers d 3 - If there remain unclustered documents, group them into a ragbag cluster. Parallel C 3 M 8

C 3 M Formulas Parallel C 3 M 9

C 3 M Formulas Parallel C 3 M 9

C 3 M – Sample Matrices Parallel C 3 M 10

C 3 M – Sample Matrices Parallel C 3 M 10

Parallel C 3 M- Distribution Distribute rows among processors Ø Load balancing by cyclic

Parallel C 3 M- Distribution Distribute rows among processors Ø Load balancing by cyclic block distribution Parallel C 3 M 11

Local Calculations All processors calculate α, partial β and Pi Current Method for Weighted

Local Calculations All processors calculate α, partial β and Pi Current Method for Weighted Matrix: too costly Need coloumn vectors (but rowwise partitioned) Parallel C 3 M 12

Seed Powers Pi • Seed power Pi, should be small for a document whose

Seed Powers Pi • Seed power Pi, should be small for a document whose terms appear in too many documents or too few documents. • Seed power Pi, should be bigger for a document whose terms appear in a moderate number of documents. Parallel C 3 M 13

Minimize Communication Proposed Heuristic All processors calculate α, partial β and β’ # of

Minimize Communication Proposed Heuristic All processors calculate α, partial β and β’ # of non-zeros Parallel C 3 M 14

Effectiveness of Heuristic • A matlab script is written to compare the effectiveness of

Effectiveness of Heuristic • A matlab script is written to compare the effectiveness of the proposed heuristic. • Correlation Coeeficient = 0. 95 Parallel C 3 M 15

Communication btw Processors • Partial β and β’ vectors are exchanged btw processors to

Communication btw Processors • Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors. • Then, all processor calculate cii=δi Parallel C 3 M 16

# of Clusters • Processors exchange local δ • All processors calculate nc Parallel

# of Clusters • Processors exchange local δ • All processors calculate nc Parallel C 3 M 17

Cluster-head Selection • Calculate seed power of local documents • Exchange largest nc seed

Cluster-head Selection • Calculate seed power of local documents • Exchange largest nc seed powers. • Calculate largest nc seed powers among all Pi and find cluster heads. Parallel C 3 M 18

Clustering Non-seed Docs • Exchange seed documents • Cluster non-seed documents (as in sequential

Clustering Non-seed Docs • Exchange seed documents • Cluster non-seed documents (as in sequential C 3 M) in each processor. Parallel C 3 M 19

Future Work • Term Based Clustering • Overlapping Clusters Parallel C 3 M 20

Future Work • Term Based Clustering • Overlapping Clusters Parallel C 3 M 20

C 3 M Summary • Load Balancing with cyclic block distribution • Communication minimization

C 3 M Summary • Load Balancing with cyclic block distribution • Communication minimization by a new heuristic • Task dependency minimized with block distirbution & heuristic. Parallel C 3 M 21

References • Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can,

References • Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can, E. A. Ozkarahan • Parallelizing the Buckshot Algorithm for Efficient Document Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder • Clustering and Classification of Large Document Bases in a Parallel Environment, Anthony S. Ruocco, Ophir Frieder • Efficient Clustering of Very Large Document Collections, I. S. Dhillon, J. Fan, Y. Guan Parallel C 3 M 22

Questions? Parallel C 3 M 23

Questions? Parallel C 3 M 23

The End Thank you for your patience Parallel C 3 M 24

The End Thank you for your patience Parallel C 3 M 24