FSE 14 SemanticsBased ObfuscationResilient Binary Code Similarity Comparison

FSE’ 14 Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Application to Software Plagiarism Detection Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, Sencun Zhu The Pennsylvania State University, University Park

2 Research Problem Program B Program A Component A set of functions, or a whole program. Similar ?

3 Challenges Code obfuscation techniques Different compilers & compiler optimization levels No source code available

4 Examples if (condition) { /*then branch*/ } -O 0 vs. –O 2 high dissimilarity (around 60 optimizations) GCC –O 2 GCC –O 0

5 Examples Code obfuscation: switch/case if/else CIL Code obfuscation dramatically transform code in various ways switch/case if/else

6 Requirements Basic Other Binary codebased Partial code reuse detection Obfuscation resilient

7 Binary similarity detection Software plagiarism detection Existing methods… Requirements Clone detection MOSS, JPLag, etc. Bdiff, Darun. Grim 2, etc. Birthmark based, core value based, etc. Binary codebased ✗ ✔ ✔ Obfuscation resilient ✗ ✗ ✔/✗ Partial code detection ✔ ✔ ✗

8 Co. P: a binary-oriented obfuscationresilient method • Based on a new concept: • Longest Common Subsequence (LCS) of Semantically Equivalent Basic Blocks (SEBB) SEBB • Obfuscation resiliency LCS • Scalability • Obfuscation resiliency

9 Architecture Three levels: • Basic block level • Path level • Whole component level

10 Basic block similarity computation • Symbolic execution • Obtain a set of symbolic formulas that represent the input-output relations of basic blocks. • Theorem proving • Check the pair-wise equivalence of the symbolic formulas of two basic blocks. • Calculate a basic block similarity score > a threshold semantically equivalent basic blocks

11 Obfuscated block An Example Plaintiff block p = a + b; q = a – b; Semantically equivalent 100% basic blocks Symbolic execution u = x + 10; v = y – 10; Suspicious block s = u + v; st = ux – + v; y; rt = x – + y; 1; Symbolic execution p = f 1 (a, b) = a + b; su==ff 33(x, x +10; y; (x)y)==x + v = f 4 (y) = y – 10; = ff 54 (u, (x, v) y) = = ux + - y; v; st = f 6 (u, v) = u – v; r = f 7 (x) = x + 1; q = f 2 (a, b) = a - b; a=x∧b=y p=s a=x∧b=y q=t

12 Examples of semantically equivalent basic blocks with very different instructions

13 Architecture Three levels: • Basic block level • Path level • Whole component level

Improving Software Security with Concurrent Monitoring, Automated Diagnosis, and Self-shielding 14 Path similarity comparison Suspicious Plaintiff S 1 S 2 Sp S 3 ? Step 1: Starting blocks. Step 2: Linearly independent paths. Step 3: Longest common subsequence (LCS) of semanticallyequivalent-basic-block (SEBB) computation.

15 Computing LCS of SEBB • Breadth-first search • LCS dynamic programming

16 LCS Refinement Merge unmatched basic blocks Basic block splitting and merging Basic block reordering Conditional obfuscation

17 Evaluation • Obfuscation resiliency • Experiments: thttpd, openssl, and gzip • Different compiler optimization levels • Different compilers • Different code obfuscation techniques. • Compared with MOSS, JPLag, Bdiff, and Darun. Grim 2. • Scalability • Gecko vs. Firefox

18 thttpd vs. sthttpd: Different compiler optimization levels T 0: thttpd –O 0 T 2: thttpd –O 2 S 0: shttpd –O 0 S 2: shttpd –O 2

19 thttpd vs. sthttpd : Different compilers TG: thttpd GCC TI : thttpd ICC SG: shttpd GCC SI : shttpd ICC

20 Code obfuscation resiliency testing • Source code obfuscation tools • Semantic Designs Inc. ’s C obfuscator • Stunnix’s CXX-obfuscator • Binary code obfuscation tools • Diablo • Loco • CIL: possesses many useful source code transformation techniques.

21

22 thttpd vs. independent programs • To measure false positives, we tested our tool against four independently developed programs. • thttpd-2. 25 b • atphttpd-0. 4 b • boa-0. 94. 13 • lighttpd-1. 4. 30 • Very low similarity scores (below 2%) were reported.

23 Gecko vs. Firefox : Scalability Gecko vs. Opera Gecko vs. Chrome Co. P reported scores below 3% for all cases.

24 Summary • We propose a binary-oriented, obfuscation-resilient code similarity comparison approach, named Co. P. • Co. P is based on a new concept, Longest Common Subsequence (LCS) of Semantically Equivalent Basic Blocks (SEBB). • Our experimental results show that Co. P is effective and practical when applied to real-world software.

25