Special Topics in Computer Science Advanced Topics in
























- Slides: 24
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR Alexander Gelbukh www. Gelbukh. com
Previous Chapter: Conclusions q How to accelerate search? Same results as sequential q Ideas: § Quick-and-dirty rejection of bad objects, 100% recall § Fast data structure for search (based on clustering) § Careful check of all found candidates q Solution: mapping into fewer-D feature space q Condition: lower-bounding of the distance q Assumption: skewed spectrum distribution § Few coefficients concentrate energy, rest are less important 2
Previous Chapter: Research topics q q Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1 D) New types of data. § What features to select? How to determine them? q Mixed-type data (e. g. , webpages, or images with sound and description) q What clustering/IR methods are better suited for what features? (What features for what methods? ) q Similar methods in data mining, . . . 3
The problem q Very large document collections § Google: 4, 000, 000 pages § Slow response? q Solution: parallel computing § Google: 10, 000 computers 4
Parallel architectures Data stream Instruction stream Single Multiple Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD 5
MIMD architecture q The most common q Can be § tightly coupled § loosely coupled q Distributed § Many computers interacting via network § PC Clusters § Similar to MIMD computers, but greater cost of communication § very loosely coupled § More coarse-grained programs 6
Performance improvement Time: speedup S q Ideally, N times (number of processors) q In practice impossible § The problem does not decompose into N equal parts § Communication and control overhead § < 1 / f, where f is the largest separable fraction of the problem Cost q Per processor: S / N 7
Two approaches to parallelism q Build new algorithms § E. g. , neural nets § Naturally parallel § Problem: to define the retrieval task q Adapt the existing techniques to parallelism § Allows relying on well-studied approaches § We will consider this option 8
Ways to use parallelism q Multitasking § N search engines § Good for processing many queries Problems: § A single query is not speeded up § Bottleneck: disk access (index) § Possible solution: replicating (part of) data. RAIDs q Parallel algorithms § IR = data. Main question: how to partition the data § Document / index term matrix (terms can be LSI dimensions, signature bits, etc) 9
Possible partitionings q Horizontal: document partitioning. Union of results q Vertical: term partitioning. Basically, intersect results 10
Inverted files: Logical partitioning q Logical vs. physical document partitioning q Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion 11
Inverted files: Logical partitioning Construction and updating q Also parallel Construction q Assign docs to processors q Order docs such that each processor has an interval q Process in parallel q Merge. Each piece is ordered already 12
Inverted files: Physical document partitioning q q Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used § § The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book q Consistent scores are needed § Global statistics is needed. Can be computed at index time 13
Logical or physical partitioning? q Logical requires less communication § Faster q Physical is more flexible. Simpler implementation § Simpler conversion of existing systems 14
Inverted files: Term partitioning q Each processor processes a part of the inverted file q The results are intersected (for AND) § (or as appropriate for Boolean operations, OR and NOT) q When term distribution in user queries is skewed, then document partitioning is better § When uniform, term partitioning is better. § Twice for long queries, 5 – 10 times for short (Web-like) 15
Suffix arrays q Array construction can be parallelized § merges are parallel q Document partitioning is applied straightforwardly § Each processor maintains its own suffix array q Term partitioning can be applied § Each processor owns a branch of the tree (lexicographic interval) § Bottleneck: all processors need access to the entire text 16
17
Signature files q Document partitioning: straightforward § Create query signature, distribute to each processor § Merge results (using Boolean operations if needed) q Term partitioning: shorter signatures § Merging and eliminating false drops is slow § This method is not recommended 18
SIMD computers q Single Instruction, Multiple data q Uncommon q Good for simple operations § Bit operations in signature files § Details in the book q Ranking is supported in hardware in some computers q If signature file does not fit into memory, can be processed in batches § I/O overhead § Use multiple queries with the same batch § This improves throughput, but not response time 19
… SIMD computers q Inverted files are difficult to adapt to SIMD q The inverted file is restructured q Details in the book 20
Distributed IR q MIMD with § Slow communication § Not all nodes are used for a given query § Encryption issues q Document partitioning is usually used § Term partitioning imposes greater communication overhead q Document clustering can be useful (to distribute docs by processors) § Index clusters and then search only the best ones § Another approach: use training queries, then similarity of the user query to these 21
Research topics q q How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck § Meta search engines q Creating large collections with judgements § Is recall important? 22
Conclusions q Parallel computing can improve § response time for each query and/or § throughput: number of queries processed with same speed q Document partitioning is simple § good for distributed computing q Term partitioning is good for some data structures q Distributed computing is MIMD computing with slow communication q SIMD machines are good for Signature files § Both are out of favor now 23
Thank you! Till May 17? 18? , 6 pm 24