Bug Bench A Benchmark for Evaluating Bug Detection

Bug. Bench: A Benchmark for Evaluating Bug Detection Tools Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou and Yuanyuan Zhou University of Illinois, Urbana-Champaign

Content of This Talk Share our experience n Bug/application characteristics analysis n n Bug. Bench has been used by ¨ Our previous work [Micro’ 04, ISCA’ 04, HPCA’ 05] ¨ Other research groups: UCSD, Purdue, NCSU, etc.

Current Benchmark Suite Name Program Source Crash Latency Bug Type 1. 9 K 0. 7 K N/A 9040 K Inst GNU 8. 2 K SPEC 95 2. 0 K SPEC 95 29. 6 K Red Hat 4. 7 K GNU 17. 0 K squid 93. 5 K UIUC 6. 6 K GNU 114. 5 K Linux NIS 11. 4 K Pro. FTPD 68. 9 K squid 104. 6 K Apache 224 K My. SQL 1028 K My. SQL 514 K My. SQL 1028 K Postgre. SQL 559 K Apache 224 K 15 K Inst N/A 29. 5 M Inst 189 K Inst 0 N/A N/A N/A Stack smash & Global buffer overflow Global buffer overflow Uninitialized read Double free Memory leak Data race Atomicity Semantic NCOM POLY ncompress-4. 2. 4 Red Hat polymorph-0. 4. 0 GNU GZIP COMP GO MAN BC SQUD CALB CVS YPSV PFTP SQUD 2 HTPD MSQL 1 MSQL 2 MSQL 3 PSQL HTPD 2 gzip-1. 2. 4 129. compress 099. Go man-1. 5 h 1 bc-1. 06 squid-2. 3 cachelib cvs-1. 11. 4 ypserv-2. 2 proftpd-1. 2. 9 squid-2. 4 httpd-2. 0. 49 msql-4. 1. 1 msql-3. 23. 56 msql-4. 1. 1 postgresql-7. 4. 2 httpd-2. 0. 49 LOC Other type of bugs: In searching … memory related multi-thread related semantic

Functionality Name Catch Bug? Related Memory Object Type Valgrind Purify CCured NCOM No No Yes Stack POLY Vary Yes Stack & global buffer GZIP Yes Yes COMP No No Yes GO No Yes MAN Yes Yes BC Yes Yes SQUD Yes N/A Valgrind Global buffer Heap buffer miss stack buffer overflow miss moderate global-buffer overflow Purify miss stack buffer overflow miss 1 Byte global-buffer overflow CCured Failed to apply

Memory Alloc Freq. (# per MInst) Heap Usage Ratio [Heap/(Heap+Stack)] NCOM BC 138 0 NCOM BC 76. 6% 85. 1% 23. 9% 0% . 48 . 5. 52 769 480 BC Mem. Access Freq. (# per Instruction) 1. 35 X n 28% 4% n Valgrind: 6. 4 X (NCOM) ~ 119 X (BC) Purify: 28% (POLY) ~ 76 X (BC) CCured: 4% (POLY) ~ 3. 7 X (GZIP) 18% n 69% Overhead . 55 99% NCOM. 62. 65 . 69 . 85

Experience Summary n Building benchmark is a time-consuming and long-term work ¨ Motivate automatic tools to extract bugs n Bug/application characteristics are important for selecting applications n Need cooperation from entire community