CS 590 Z Matching Program Versions Xiangyu Zhang

  • Slides: 21
Download presentation
CS 590 Z Matching Program Versions Xiangyu Zhang

CS 590 Z Matching Program Versions Xiangyu Zhang

Problem Statement q Suppose a program P’ is created by modifying P. Determine the

Problem Statement q Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. • • Static mapping Non-trivial v v • Name comparison? What if Clone analysis, comparison checking CS 590 Z

Motivations q Validate compiler transformations q Facilitate regression testing q Reverse obfuscation q Information

Motivations q Validate compiler transformations q Facilitate regression testing q Reverse obfuscation q Information propagation q Debugging q Code plagiarism detection q Information Assurance CS 590 Z

Approaches q Static Approaches • • q Entity name based String based (MOSS) AST

Approaches q Static Approaches • • q Entity name based String based (MOSS) AST based (DECKARD) CFG based (JDIFF) PDG based (PDIFF) Binary based (BMAT) Log based (editor plugin, comparison checking) Dynamic Approaches (not today) CS 590 Z

Static Approaches q Entity name matching • • q Model a function/field as tuples

Static Approaches q Entity name matching • • q Model a function/field as tuples Coarse grained matching String matching • • Diff (CVS, Subservion) Longest common subsequence (LCS) v v v • Available operations are addition and deletion Matched pairs can not cross one another Programs are far more complicated than strings ü Copy, paste, move CP-Miner (scale to linux kernel clone detection) v Frequent subsequence mining CS 590 Z

MOSS q Code plagiarism detection • q Challenges • • • q It also

MOSS q Code plagiarism detection • q Challenges • • • q It also handles other digital contents White space (variable name) Noise (“the”, “int i”); Order scrambling (paragraph reorders) Problem statement • Given a set of documents, identify substring matches that satisfy two properties: v v If there is a substring match at least as long as the guarantee threshold t, then this match is detected; Do not detect any matches shorter than the noise threshold, k. CS 590 Z

MOSS q k-gram • A continuous substring of length k CS 590 Z

MOSS q k-gram • A continuous substring of length k CS 590 Z

MOSS q Incremental hashing • • Hashing strings of length k is expensive for

MOSS q Incremental hashing • • Hashing strings of length k is expensive for large k. “rolling” hash function v The (i+1)th k-gram hash = F (the ith k-gram hash, …) CS 590 Z

MOSS q Fingerprint selection • A subset of hash values • Our goals: find

MOSS q Fingerprint selection • A subset of hash values • Our goals: find all matching substrings >t; ignore matchings <k) One of every tth hash values 0 mod p • • CS 590 Z

MOSS q Winnowing • • • Observation: given a sequence of hashes h 1,

MOSS q Winnowing • • • Observation: given a sequence of hashes h 1, …hn, if n>t-k, then at least one of the hi must be chosen Have a sliding window with size w=t-k+1 In each window select the minimum hash value, break ties by select the rightmost occurrence. CS 590 Z

MOSS q Algorithm • • Build an index mapping fingerprints to locations for all

MOSS q Algorithm • • Build an index mapping fingerprints to locations for all documents. Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. Sort (d, d 1, fx), (d, d 2, fy) by the first two elements. Matches between documents are rank-ordered by size (number of fingerprints) CS 590 Z

MOSS q Advantages • q Guarantee to detect any >t substring matches Limitations •

MOSS q Advantages • q Guarantee to detect any >t substring matches Limitations • Minor edits fail MOSS. v • x= a*b + c vs. z= c + a*b Insertion, deletion CS 590 Z

AST based matching q [YANG, 1991, Software Practice and Experience] • • q Given

AST based matching q [YANG, 1991, Software Practice and Experience] • • q Given two functions, build the ASTs Match the roots If so, apply LCS to align subtrees Continue recursively Fragile CS 590 Z

DECKARD (ICSE 2007) CS 590 Z

DECKARD (ICSE 2007) CS 590 Z

DECKARD q Advantages • • q Scalability Insensitive to minor structural changes such as

DECKARD q Advantages • • q Scalability Insensitive to minor structural changes such as reordering, insertion, deletion Limitations • • Structural similarity only Insertion that incurs structure change. CS 590 Z

CFG matching Hammock graph (JDIFF , ASE 2004) q • • Match classes by

CFG matching Hammock graph (JDIFF , ASE 2004) q • • Match classes by names Match fields by types Match methods by signatures Match instruction in methods by hammock graphs v A hammock is a single entry single exit subgraph of a CFG. CS 590 Z

CFG matching q Pros • Orthogonal v • q Can be combined with other

CFG matching q Pros • Orthogonal v • q Can be combined with other matching techniques Simple Cons • Coarse grained matching only v • Not good at clone detection In case of code transformation CS 590 Z

Semantic Based Matched q Using PDG (SAS’ 01) CS 590 Z

Semantic Based Matched q Using PDG (SAS’ 01) CS 590 Z

Semantic Based CS 590 Z

Semantic Based CS 590 Z

Semantic Based q Pros • • q Non-contiguous, intertwined, reordered Insensitive to code transformations.

Semantic Based q Pros • • q Non-contiguous, intertwined, reordered Insensitive to code transformations. Cons • Scalability v • Points-to analysis Starting from a matching pair seems to be a problem CS 590 Z

Wrap Up q For clone detection • q For whole program matching / method

Wrap Up q For clone detection • q For whole program matching / method matching with code transformations • q Maybe structural / text similarity is a good idea Semantic based is more appropriate Scalability • PDG < CFG | AST < STRING < NAME CS 590 Z