Accelerating Pathology Image Data CrossComparison on CPUGPU Hybrid

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2, 3, Xiaodong Zhang 1, Joel H. Saltz 2, 3 1 Department of Computer Science and Engineering, The Ohio State University 2 Center for Comprehensive Informatics, Emory University 3 Department of Biomedical Informatics, Emory University

Background: Digital Pathology • Digital pathology imaging has become an increasingly important field in the past decade • Examination of high-resolution tissue images enables more effective prediction, diagnosis, and therapy of diseases Glass slides Scanning Whole slide images Image analysis 1

Background: Image Algorithm Evaluation • High-quality image analysis algorithms are essential to support biomedical research and diagnosis – Validate algorithms with human annotations – Compare and consolidate multiple algorithm results – Sensitivity study of algorithm parameters Green: algorithm one Red: algorithm two 2

Problem: Spatial Cross-Comparison • Spatial cross-comparison: identify and compare derived spatial objects belonging to different observations or analyses • Jaccard similarity: the overlap ratio of intersecting polygons from two result sets p q 3

Both Data- and Compute-Intensive • Increasingly large data sets – 105 x 105 pixels per image – 1 million objects per image Parallel computing techniques must be utilized to handle such intensive workloads – Hundreds to thousands of images per study – Big data demanding high throughput • High computation intensity – Computing Jaccard similarity requires heavy-duty geometric operations – Demanding high performance 4

Existing Approach: Spatial DBMS • Extension of RDBMS with spatial data types and operators (Post. GIS, DB 2, etc. ) • A typical cross-comparison takes many hours to A high-performance/throughput and costfinish on a single machine effective solution is highly desirable – 90+% computing time spent on computing the areas of polygon intersections and unions – Algorithms used by SDBMS are highly branchintensive and difficult to parallelize • Task-based (MIMD) parallel computing can be applied on large-scale clusters – Expensive in facility of many high end nodes 5

Graphics Processing Units (GPU) • SIMD Low-cost dataand parallel powerful architecture data-parallel devices GPU Peak Performance (GFLOPS) Peak 1800 CPU GTX 580 Applications must exploit SIMD data parallelism 1400 in 1200 order to best utilize the power of GPUs 1600 GTX 480 GTX 285 All cores on a streaming multiprocessor (SM) execute the same instruction on different data 1000 GTX 280 800 600 8800 GTX 9800 GTX E. g. , NVIDIA GTX 580 has 512 cores (16 SMs, 32 cores each) 400 200 7900 GTX E 4300 E 6850 Q 9650 X 7460 980 XE 0 Oct-06 Oct-06 Jul-06 Nov-06 Oct-06 Nov-06 Nov-06 Dec-06 Dec-06 Jan-07 Jan-07 Feb-07 Feb-07 Mar-07 Mar-07 Apr-07 Apr-07 May-07 May-07 Jun-07 Jun-07 Jul-07 Jul-07 Aug-07 Aug-07 Sep-07 Jul-07 Sep-07 Oct-07 Oct-07 Nov-07 Nov-07 Dec-07 Dec-07 Jan-08 Jan-08 Feb-08 Feb-08 Mar-08 Mar-08 Apr-08 Apr-08 May-08 May-08 Jun-08 Jun-08 Jul-08 Jul-08 Aug-08 Aug-08 Sep-08 Sep-08 Oct-08 Oct-08 Nov-08 Nov-08 Dec-08 Dec-08 Jan-09 Jan-09 Feb-09 Feb-09 Mar-09 Mar-09 Apr-09 Apr-09 May-09 May-09 Jun-09 Jun-09 Jul-09 Jul-09 Aug-09 Jul-09 Aug-09 Aug-09 Sep-09 Sep-09 Oct-09 Oct-09 Nov-09 Nov-09 Dec-09 Dec-09 Jan-10 Jan-10 Feb-10 Feb-10 Mar-10 Mar-10 Apr-10 Apr-10 May-10 May-10 Jul-10 May-10 Jun-10 Jun-10 Jul-10 Jul-10 Aug-10 Aug-10 Sep-10 Sep-10 Oct-10 Oct-10 Nov-10 Nov-10 USENIX ATC’ 11 6

Our Solution: SCCG • Spatial Cross-comparison on CPUs and GPUs – Utilize GPUs with CPUs for both high throughput and high performance in a cost-effective way • Critical challenges – SIMD data-parallel algorithms on GPU – CPU-GPU hybrid computing framework – Load balancing between CPU and GPU 7

Outline • Introduction • SCCG – Pixel. Box GPU algorithm – Cross-comparing framework – Load balancing • Experiments • Conclusions 8

The Pixel. Box GPU Algorithm • Given an array of polygon pairs, compute the area of intersection and area of union for each polygon pair • Algorithm principles – Exploit SIMD data parallelism Compute areas of polygon intersections and unions in an SIMD data parallelism mode – Maximize data parallelism and minimize unnecessary compute intensity Reduce compute intensity while maintain high data parallelism 9

Exploit SIMD Data Parallelism • Monte-Carlo approach (a basic method) Consider polygons Compute Perfect data areas parallelism, of intersection/union but high compute by counting themap lyingintensity on a pixel Intersection p q number of when pixels polygons lying within are large each region Union 10

Reduce Unnecessary Compute Intensity • Compute Recursively Use sampling areas explore boxes box unsettled byrectangular box, thus boxes avoiding lots Partition the bounding of by a thus needgrid Don’t belong to either of partitioning costly pair per-pixel them into testing smaller sub-boxes polygon into. Unsettled, boxes (like cells) further exploration intersection or union Further partitioned Accordingpto our testing, 50+% of areas can be q determined with only one level of box partitioning Completely within the union of p and q Completely within the intersection of p and q 11

The Pixel. Box Algorithm Then, pixel byby pixel smallfinish sub. First, boxwithin to quickly boxes needoffurther testing thethat testing large regions • Pixel. Box works on both pixels and boxes In this way, Pixel. Box preserves the benefits of both p q high data parallelism and low compute intensity 12

Implementation for GPU • All Computation Use threads asub-boxes shared keep finishes stack popping to when store boxes the stack boxes from to be stack The New stack top is are fetched pushed by all to threads, the topthe of and stack tested for becomes processing empty partitioned again if APartitioned itbox needs further with. Amark box into with sub-boxes, 1 needs markto 0 testing be needs then further no further partitioned, tested or apply by different Monte testing, Carlo threads thusifignored it in has parallel been by all small threads enough Using a stack improves data parallelism: both popping and pushingrectangular can 1 be done SIMD fashion The bounding of two in polygons is pushed on the stack as the first box 0 0 1 0 Contribution of each sub-box is computed; also see whether further testing is needed Mark: 1 – need further testing 1 0 0 0 box 0 Sampling – no further testing 0 1 0 Thread 1 Thread 2 Thread 3 …… stack 13

The Cross-Comparing Framework • A pipelined framework that executes the whole cross-comparing workflow CPU GPU • Pipelined execution. CPUreduces resource Load data Find intersecting polygon Compute Jaccard contention over pairs GPUs from disk from the data similarity (Pixel. Box) – GPU is an exclusive, non-preemptive device – Multiple threads trying to access a GPU simultaneously are serialized – A single initiator (Aggregator) reduces blocking over GPUs 14

CPU-GPU Load Balancing • Both CPUs and GPUs have to be fully utilized in order to maximize throughput Load data from disk Find intersecting polygon pairs from the data Produce polygon pairs on CPUs Compute Jaccard similarity Consume polygon pairs on GPUs if production speed > consumption speed Tasks have to. CPUs be idle dynamically migrated among CPUs and GPUs to achieve load balancing GPUs idle if production speed < consumption speed (please read paper for details) 15

Experiments • Platforms Methodology – Dell Spatial T 1500 DBMS: workstation Postgres. QL 9. 1. 3 + Post. GIS 1. 5. 3 • One Intel Core i 7 860 (4 cores) – Disk loading time not CPU considered • • One NVIDIA GTX 580 GPU “First-load-then-query” scheme of database is known to be inefficient to process one-time data – Amazon EC 2 instance • • Storage are improving SSD) Two Inteldevices Xeon X 5570 CPUs (8(e. g. , cores, 16 threads) – Ignore data format conversion and data • Data sets partitioning time in SDBMS – 18 pairs of polygon sets extracted from 18 real. Both simplifications the performance world brain tumorfavor pathology images of Post. GIS – 12 Gi. B in raw text format 16

Effectiveness of Pixel. Box on GPU • Algorithm performance 430 On s. Dell T 1500, compute the Jaccard similarity of 1000 Over 1000 This experiment shows the effectiveness of Pixel. Box Over s 619609 pairs of 290 polygons in a representative data set and its best utilization of SIMD data parallelism of GPUs 100 10 Only 3. 6 s GEOS on a single core Speedup over GEOS Execution Time (s) 120 x 100 10 1. 5 x Pixel. Box on GPU 1 1 GEOS Pixel. Box-CPU CPU-version Pixel. Box on a single core Pixel. Box-CPU Pixel. Box 17

Overall Performance • Cross-comparing performance Post. GIS-M SCCG Two Intel Xeon : 190 $ 2000 w Our solution is X 5570 2. 4 x lower in hardware cost, and Parallelized Our solution Core 860 + in GTX 580 : 339 $ 800 w watt 10 xi 7 higher per Post. GIS on EC 2 performance on Dell T 1500 45 250 40 35 200 30 18 x speedup on average 150 25 20 100 15 16 17 15 16 14 15 13 17 18 G 18 M ea n Data. Set. Index Data 12 14 11 13 10 12 9 11 8 10 7 9 6 8 5 7 4 6 3 5 2 4 1 2 00 3 10 50 5 1 Execution Time (s) Speedup over Post. GIS-M 50 300 18

Conclusions • Spatial cross-comparison is a data- and compute-intensive operation • Existing approach with SDBMS is not highperformance and low-cost • We provide a software solution based on GPUs and CPUs to significantly accelerate the work at low cost 19

Thank You • Q&A 20