LLVM builtin scalable code clone detection based on
LLVM: built-in scalable code clone detection based on semantic analysis Institute for System Programming of the Russian Academy of Sciences
Clone Types 1. Identical code fragments except the variations in whitespace, layout and comments. 2. Structurally/syntactically identical code fragments except the variations in identifiers, literals, types, layout and comments. 3. Copied fragments of code with further modifications. Statements can be changed, added or removed in addition to variations in identifiers, literals, types, layout and comments.
Formulation Of The Problem Design code clone detection tool capable for real projects analysis. Requirements : • Semantic based ( based on Program Dependence Graph ) • High accuracy • Scalable (analyze up to million lines of source code)
Architecture : First part PDG for one module clang PASS PDG LLVM PASS executable Generation of Program Dependence Graphs (PDG) 1. Construction of PDG 2. Optimizations of PDG 3. Serialization of PDG
Example of Program Dependence Graph void foo() { int b = 5; int a = b*b; } %b = alloca i 32 %a = alloca i 32 store i 32 5, i 32* %b define void @foo() #0 { %b = alloca i 32 %a = alloca i 32 store i 32 5, i 32* %b %1 = load i 32* %b %2 = load i 32* %b %3 = mul nsw i 32 %1, %2 store i 32 %3, i 32* %a } %1 = load i 32* %b %2 = load i 32* %b %3 = mul nsw i 32 %1, %2 store i 32 %3, i 32* %a
Architecture : Second part PDG for one module Code Clone Detection Tool 1. Load dumped PDGs 2. Split PDGs to sub graphs 3. Fast checks (check if two graphs are not clones) 4. Maximal isomorphic sub graphs detection (approximate) 5. Filtration 6. Printing
Automatic Testing System List of PDGs for the project PDG 1 PDG 2 PDG n Modified list of PDGs PDG’ 1 PDG’ 2 PDG’ n/2 PDG’ j Check for clone PDG i PDG’ j PDG i PDG k
Advantages 1. Compile-time generation of PDGs. 2. No need of extra analysis for dependencies between compilation modules. 3. High accuracy. 4. Scalable to analyze million lines of source code (С/С++).
Results Copy 00. cpp modified to get 3 types of clones. Copy 00. cpp 1: void foo(float sum, float prod) { 2: float result = sum + prod; 3: } 4: void sum. Prod(int n) { 5: float sum = 0. 0; //C 1 6: float prod = 1. 0; 7: for (int i = 1; i<=n; i++) { 8: sum = sum + i; 9: prod = prod * i; 10: foo(sum, prod); 11: } 12: } Test Name copy 00. cpp copy 01. cpp copy 02. cpp copy 03. cpp copy 04. cpp copy 05. cpp copy 06. cpp copy 07. cpp copy 08. cpp copy 09. cpp copy 10. cpp copy 11. cpp copy 12. cpp copy 13. cpp copy 14. cpp copy 15. cpp **CCFinder(X) yes yes yes no no yes MOSS yes yes yes no no yes yes Clone. DR yes yes no yes yes yes CCD yes yes yes no yes yes **CCFinder(X) – Chanchal K. Roy : Comparison and evaluation of code clone detection techniques and tools : A qualitative approach
Results Intel core i 3, 8 GB Ram. Source code lines (million lines) Firefox Mozilla LLVM + Clang Openssl 11 1. 9 0. 46 Compile time without PDGs generation (seconds) 5231 1965 81 Compile time with PDGs generation (seconds) Size of PDGs (megabytes) 5525 2520 130 221 150 8. 7
Results Similarity level higher 90%. Detection time (seconds) Firefox Mozilla LLVM + Clang Openssl 28219 20491 83 Detected clones False Positive Size of PDGs (megabytes) 91 23 101 7 3 4 221 150 8. 7
Results
Results
Results openssl-1. 0. 1 g/crypto/cast/c_enc. c 141 : for (l-=8; l>=0; l-=8) 142 : { 143 : n 2 l(in, tin 0); 144 : n 2 l(in, tin 1); 145 : tin 0^=tout 0; 146 : tin 1^=tout 1; 147 : tin[0]=tin 0; 148 : tin[1]=tin 1; 149 : CAST_encrypt(tin, ks); 150 : tout 0=tin[0]; 151 : tout 1=tin[1]; 152 : l 2 n(tout 0, out); 153 : l 2 n(tout 1, out); 154 : } openssl-1. 0. 1 g/crypto/bf/bf_enc. c 237 : for (l-=8; l>=0; l-=8) 238 : { 239 : n 2 l(in, tin 0); 240 : n 2 l(in, tin 1); 241 : tin 0^=tout 0; 242 : tin 1^=tout 1; 243 : tin[0]=tin 0; 244 : tin[1]=tin 1; 245 : BF_encrypt(tin, schedule); 246 : tout 0=tin[0]; 247 : tout 1=tin[1]; 248 : l 2 n(tout 0, out); 249 : l 2 n(tout 1, out); 250 : }
Results openssl-1. 0. 1 g/crypto/bf/bf_ofb 64. c openssl-1. 0. 1 g/crypto/des/ofb 64 ede. c 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 81 : c 2 l(iv, v 0); 82 : c 2 l(iv, v 1); 83 : ti[0]=v 0; 84 : ti[1]=v 1; 85 : dp=(char *)d; 86 : l 2 c(v 0, dp); 87 : l 2 c(v 1, dp); 88 : while (l--) 89 : { 90 : if (n == 0) 91 : { 92 : /* ti[0]=v 0; */ 93 : /* ti[1]=v 1; */ 94 : DES_encrypt 3(ti, k 1, k 2, k 3); 95 : v 0=ti[0]; 96 : v 1=ti[1]; 97 : 98 : dp=(char *)d; 99 : l 2 c(v 0, dp); 100 : l 2 c(v 1, dp); 101 : save++; 102 : } 103 : *(out++)= *(in++)^d[n]; 104 : n=(n+1)&0 x 07; 105 : } : : : : : n 2 l(iv, v 0); n 2 l(iv, v 1); ti[0]=v 0; ti[1]=v 1; dp=(char *)d; l 2 n(v 0, dp); l 2 n(v 1, dp); while (l--) { if (n == 0) { BF_encrypt((BF_LONG *)ti, schedule); dp=(char *)d; t=ti[0]; l 2 n(t, dp); t=ti[1]; l 2 n(t, dp); save++; } *(out++)= *(in++)^d[n]; n=(n+1)&0 x 07; }
Results openssl-1. 0. 1 g/crypto/bn/asm/x 86_64 -gcc. c 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 : : : : : : mul_add_c(a[0], b[1], c 2, c 3, c 1); mul_add_c(a[1], b[0], c 2, c 3, c 1); r[1]=c 2; c 2=0; mul_add_c(a[2], b[0], c 3, c 1, c 2); mul_add_c(a[1], b[1], c 3, c 1, c 2); mul_add_c(a[0], b[2], c 3, c 1, c 2); r[2]=c 3; c 3=0; mul_add_c(a[0], b[3], c 1, c 2, c 3); mul_add_c(a[1], b[2], c 1, c 2, c 3); mul_add_c(a[2], b[1], c 1, c 2, c 3); mul_add_c(a[3], b[0], c 1, c 2, c 3); r[3]=c 1; c 1=0; mul_add_c(a[3], b[1], c 2, c 3, c 1); mul_add_c(a[2], b[2], c 2, c 3, c 1); mul_add_c(a[1], b[3], c 2, c 3, c 1); r[4]=c 2; c 2=0; mul_add_c(a[2], b[3], c 3, c 1, c 2); mul_add_c(a[3], b[2], c 3, c 1, c 2); r[5]=c 3; c 3=0; : : : : : : : sqr_add_c 2(a, 7, 0, c 2, c 3, c 1); sqr_add_c 2(a, 6, 1, c 2, c 3, c 1); sqr_add_c 2(a, 5, 2, c 3, c 1); sqr_add_c 2(a, 4, 3, c 2, c 3, c 1); r[7]=c 2; c 2=0; sqr_add_c(a, 4, c 3, c 1, c 2); sqr_add_c 2(a, 5, 3, c 1, c 2); sqr_add_c 2(a, 6, 2, c 3, c 1, c 2); sqr_add_c 2(a, 7, 1, c 3, c 1, c 2); r[8]=c 3; c 3=0; sqr_add_c 2(a, 7, 2, c 1, c 2, c 3); sqr_add_c 2(a, 6, 3, c 1, c 2, c 3); sqr_add_c 2(a, 5, 4, c 1, c 2, c 3); r[9]=c 1; c 1=0; sqr_add_c(a, 5, c 2, c 3, c 1); sqr_add_c 2(a, 6, 4, c 2, c 3, c 1); sqr_add_c 2(a, 7, 3, c 2, c 3, c 1); r[10]=c 2; c 2=0; sqr_add_c 2(a, 7, 4, c 3, c 1, c 2); sqr_add_c 2(a, 6, 5, c 3, c 1, c 2); r[11]=c 3; c 3=0;
Thank You.
- Slides: 17