On Detection of Gapped Code Clones using Gap























![Related work ¢ Baxter et al. [3] l l Extract clone pairs of statements, Related work ¢ Baxter et al. [3] l l Extract clone pairs of statements,](https://slidetodoc.com/presentation_image_h2/4c55ad7468c59d2d40b5ebc28b8c85ac/image-24.jpg)
















- Slides: 40
 
	On Detection of Gapped Code Clones using Gap Locations Yasushi Ueda†, Toshihiro Kamiya‡, Shinji Kusumoto†, and Katsuro Inoue† †Graduate School of Information Science and Technology, Osaka University. , Japan {y-ueda, y-higo, kusumoto, inoue}@ist. osaka-u. ac. jp ‡PRESTO, Japan Science and Technology Corporation, Japan kamiya@ist. osaka-u. ac. jp APSEC 2002
 
	Contents Background ¢ Research goals ¢ Gapped code clone detection ¢ Case study ¢ Conclusions and future works ¢ 2 APSEC 2002
 
	Background (1/2) ¢ A code clone is a pair/set of code portions in source files that are identical or similar to each other. 3 APSEC 2002
 
	Background (2/2) ¢ Code clone is one of the factors that make software maintenance more difficult. l ¢ If some faults are found in a code portion, it is necessary to correct the faults in its all clone pairs. We have developed a code clone detection tool, CCFinder[1], and its analysis tool, Gemini[2]. l CCFinder • Token-based clone detector • The input is a set of source files and the output (text-based) is the locations of clone pairs. l Gemini • GUI-based clone analysis environment • Uses CCFinder as a clone detector. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7): 654 -670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8 th IEEE International Symposium on Software Metrics, 67 -76, 2002.
 
	CCFinder/Gemini (1/4) ¢ Example of clone detection process 1. static void foo() throws RESyntax. Exception { 2. String a[] = new String [] { "123, 400", "abc", "orange 100" }; 3. org. apache. regexp. RE pat = new org. apache. regexp. RE("[0 -9, ]+"); 4. int sum = 0; 5. for (int i = 0; i < a. length; ++i) 6. if (pat. match(a[i])) 7. sum += Sample. parse. Number(pat. get. Paren(0)); 8. System. out. println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntax. Exception { 11. RE exp = new RE("[0 -9, ]+"); 12. int sum = 0; 13. for (int i = 0; i < a. length; ++i) 14. if (exp. match(a[i])) 15. sum += parse. Number(exp. get. Paren(0)); 16. System. out. println("sum = " + sum); 17. } 5 APSEC 2002 Source files Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on on transformed sequence Formatting Clone pairs
 
	CCFinder/Gemini (2/4) ¢ Gemini overview a b c a d e c l l A GUI-based code clone analysis tool Uses CCFinder as a code clone detector. Has several views to interactive analysis. • Scatter plot view • Select clones by mouse dragging • Metric graph view • Select clones by the value of metric for clones • Source code view a, b, c, . . . : tokens 6 : matched position APSEC 2002
 
	If (a > b) { b++; a=1; } CCFinder/Gemini (3/4) reused by ‘copy-and-paste’ ¢ Classification of code clones l l l Exact clone Renamed clone Gapped clone If (a > b) { b++; a=1; } Exact clone If (i > j) { j++; i=0; } Renamed clone Non-gapped clone renamed inserted If (i > j) { i = i / 2; j++; i=0; } deleted modified If (i > j) { j = j + 1; i=0; } If (i > j) { i=0; } Gapped clone Gaps
 
	CCFinder/Gemini (4/4) ¢ Needs of gapped clone detection l l CCFinder can detect non-gapped clones. Gapped clone is separately detected as several short non-gapped clones. • If each matched portion is too short, CCFinder does not identify it as a clone because the minimum length of clone to be detected must be set in CCFinder beforehand. • Generally, if the minimum length is set to short one, too many clones would be detected. 1. static void foo() throws RESyntax. Exception 1. static void goo(String [] a) throws RESyntax. Exception 2. { 14 tokens 3. String a[] = new String [] {"123, 400", "abc"}; 3. RE exp = new RE(“[0 -9, ]+”); 4. org. apache. regexp. RE pat = 4. int sum. Set = 0; the min. length 5. new org. apache. regexp. RE("[0 -9, ]+"); 5. int i = 0; to 20 tokens… 6. int sum = 0; 6. while (i < a. length) 27 tokens 7. for (int i = 0; i < a. length; ++i) 7. { 8. if (exp. match(a[i])) 9. if (pat. match(a[i])){ 9. sum += parse. Number(exp. get. Paren(0)); 13 tokens 10. sum += Sample. parse. Number(pat. get. Paren(0)); } 10. i++; 11. } Clones longer than 30 tokens Clones longer than =10 tokens 12. System. out. println("sum = " + sum); 12. System. out. println("sum " + sum); 13. } number of clone pairs is 26984) (the number of clone pairs is 1208) (the
 
	Research goals ¢ Propose a method to efficiently detect gapped clones. ¢ Conduct a case study to evaluate the method. 9 APSEC 2002
 
	Gapped code clone detection - Overview (1/2) ¢ Major premise l ¢ See the problem to detect gapped clones as a combination problem of non-gapped clones. Combination explosion of non-gapped clones l l If there are many overlapping or overcrowded non-gapped clones, identification of gapped clones makes a combination explosion because one non-gapped clone may have many The number of 105 combinations is to 3 be combined into a gapped clone. other non-gapped clones 15 Takes long time for computation. 10 APSEC 2002
 
	Gapped code clone detection - Overview (2/2) ¢ Source files Approach l Man-machine collaboration Non-gapped clone detection • Extract concatenated subsets from all of non-gapped clones Entanglements Non-gapped clones • Visualize the entanglements on a scatter plot. • Users can see the locations where gapped clones possibly exist and Gap identification pick up interactively one of them to find gapped clones in it. ¢ Detecting process l l Gaps Step 1: Non-gapped clone detection Step 2: Gap identification Step 3: Visualization Step 4: Source code investigation Correspondences Visualization Gap-and-clone scatter plot Source code investigation 11 APSEC 2002
 
	Gapped code clone detection File Y - Detecting process File X ¢ Sample input A B C E F B C D E B C D l Code sequence of source file X: A “ABCDCDEFBCDG” B l Code sequence of source file Y: C “ABCEFBCDEBCD” D • “A”, “B”, “C” … are code portions C in a certain unit. D Source files Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences E F B Visualization Gap-and-clone scatter plot C D G Source code investigation
 
	Gapped code clone detection - Detecting process Source files Non-gapped clone detection Non-gapped clones The upper limit of gap length Gap identification Source files Gaps Correspondences Visualization Gap-and-clone scatter plot Source code investigation
 
	Gapped code clone detection File Y - Detecting process Source files A B C E F B C D E B C D A Non-gapped clone detection B C Non-gapped clones D File X C D Gap identification Gaps Correspondences E F B C D G Visualization Gap-and-clone scatter plot Source code investigation
 
	Gapped code clone detection File Y - Detecting process A A B C E F B C D E B Source files C D Non-gapped clone detection B C Non-gapped clones D File X C D Gap identification Gaps Correspondences E F B C D G Visualization Gap-and-clone scatter plot Source code investigation
 
	Gapped code clone detection - Implementation ¢ ¢ CCFinder is used as a non-gapped clone detection tool Extend a GUI maintenance support tool Gemini. l On the view of gap-and-clone scatter plot implemented in Gemini, user can select a nongapped clones by mouse dragging and refer to the actual source code. Entanglement 16 APSEC 2002
 
	Case study overview ¢ Application target l Programs developed in a programming exercise of Osaka Univ. • Compiler in C language • Consists of three steps (sub-exercises): • Step 1(Ex. 1): Making a syntax checker • Step 2(Ex. 2): Making a semantic checker • Step 3(Ex. 3): Making a compiler • In Ex. 2 and Ex. 3, it was also required that the programs are developed by reusing the code of the previous programs. • Programs of 69 students. • Total size is 360, 000 lines of code ¢ Issues of analysis l l 17 Type of gapped clones found in gap-and-clone scatter plot Usefulness of gap-and-clone scatter plot APSEC 2002
 
	in Ex. 2 void sentence() in Ex. 3 void sentence() if ((tok_name == SIDENTIFIER)|| (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) basic_sen(); else if (tok_name == SIF) int llt, llf, lpf; in Ex. 1 in Ex. 2 in Ex. 3 llt=lt; llf=lf; lp=p; lpf=pf; if ((tok_name == SIDENTIFIER) || (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) B 1 { in Ex. 3 Analysis – Type of gapped clone found in gap-and-clone scatter The minimum size of non-gapped clones: plot A { in Ex. 1 { scan(); if (expression() != TBOOLEAN) error(4); basic_sen(); 20 tokens else if (tok_name == SIF) { if (tok_name != STHEN) syntax_error(); scan(); ¢ The minimum size of non-gapped clones: Compare Bthree versions of a function “sentence()” in 10 tokens 40 tokens The maximum size of gaps: Ex. 1, Ex. 2 and Ex. 3 of a certain student. if (expression() != TBOOLEAN) error(4); if (tok_name == SELSE) { scan(); multi_sentence(); fprintf(outfile, "t. POPt. GR 2t; %dn", tok_line); 2 fprintf(outfile, "t. CPAt. GR 2, TRUEn", sub); fprintf(outfile, "t. JNZt. LF%dnn", llf); lf++; lt++; B 3 } } else if (tok_name == SWHILE) if (tok_name != STHEN) syntax_error(); scan(); B 4 { scan(); in Ex. 2 scan(); multi_sentence(); 10 tokens The minimum size of entanglements: 20 tokens 45 tokens multi_sentence(); fprintf(outfile, "t. JMPt. LT%dn", llt); if (expression() != TBOOLEAN) error(4); fprintf(outfile, "LF%dnn", llf); if (tok_name != SDO) syntax_error(); scan(); if (tok_name == SELSE) sentence(); { } 27 tokens scan(); else syntax_error(); multi_sentence(); } 50 tokens } fprintf(outfile, "LT%dn", llt); } { in Ex. 3 else if (tok_name == SWHILE) scan(); fprintf(outfile, "LOOP%dn", lp); p++; if (expression() != TBOOLEAN) error(4); fprintf(outfile, "t. POPt. GR 2t; %dn", tok_line); A 18 tokens B 14 tokens 12 tokens fprintf(outfile, "t. CPAt. GR 2, TRUEn", sub); 14 tokens fprintf(outfile, "t. JNZt. LOOF%dnn", lpf); pf++; if (tok_name != SDO) syntax_error(); scan(); sentence(); 18 fprintf(outfile, "t. JMPt. LOOP%dn", lp); fprintf(outfile, "LOOF%dnn", lpf); } APSEC 2002
 
	Conclusions and future works ¢ ¢ ¢ The method to show the gapped clones based on the information of the gap location was proposed and implemented. The case study was conducted. l As result, we have successfully found the gapped clones that are composed of several short clones each of which is too short to appear individually. Since we just show gapped clones and have no mechanisms to evaluate the characteristic of each of gapped clones quantitatively, we are going to examine the method to extract efficiently the each as future works. 19 APSEC 2002
 
	20 APSEC 2002
 
	Web page of CCFinder/Gemini is available at http: //sel. ist. osaka-u. ac. jp/cdtools/index. html. en 21 APSEC 2002
 
	Application of CCFinder/Gemini ¢ Free software l l ¢ Commercial software l ¢ ¢ JDK libraries (Java, 570 KLOC) Linux, Free. BSD (C, 1. 6 + 1. 3 MLOC) Free. BSD, Open. BSD,Net. BSD(C) Qt(C++,240 KLOC) NTT Data Corp. , Hitachi Ltd. , Hitachi GP Ltd. , NEC soft Ltd. , ASTEC Inc. , SRA Inc. , NASDA, etc… Students exercise of Osaka university Filed in a court as an evidence for software copyright suit. APSEC 2002
 
	Differences between our method and homology analysis in genome informatics ¢ Alignment analysis l l ¢ Dynamic programming • O(mn) (m, n : length of sequences) The optimal alignment is not our interest. Homology search l l BLAST, FASTA We have no query sequence for search and want to detect all gapped clones. 23 APSEC 2002
![Related work  Baxter et al 3 l l Extract clone pairs of statements Related work ¢ Baxter et al. [3] l l Extract clone pairs of statements,](https://slidetodoc.com/presentation_image_h2/4c55ad7468c59d2d40b5ebc28b8c85ac/image-24.jpg) 
	Related work ¢ Baxter et al. [3] l l Extract clone pairs of statements, declarations, or sequences of them from C source files. Parse source code to build an abstract syntax tree (AST) and compare its sub-trees by characterization metrics (hash functions). Its computation complexity is O(n), where n is the number of the sub-tree of the source files. The hash function enables one to do parameterized matching, to detect gapped clones, and to identify clones of code portions in which some statements are reordered. 24 [3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees, ” Proc. of ICSM ’ 98, pp. 368 -377, Bethesda, Maryland, 1998. APSEC 2002
 
	Computation cost of our method ¢ Non-gapped clone detection (in CCFinder): O(n + m) n: length of source code l m: number of non-gapped clones l ¢ Gap identification: O(m) l ¢ Identification of gaps combined with each non-gapped clones : O(1) Total: O(n+m) 25 APSEC 2002
 
	The difference between ‘diff’ and clone detection tools ¢ Diff finds the longest common sub-string. l ¢ Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions. 26 APSEC 2002
 
	Snapshots of clone class metric graph RAD LEN Filtering mode : ON 27 APSEC 2002 POP DFL
 
	Clone class metrics ¢ ¢ ¢ LEN (C ): Length of token sequence of each element in clone class C POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements ¢ RAD (C ): Distribution in the file system of elements in clone class C 28 APSEC 2002
 
	Definitions of DFL and RAD ¢ DFL(C ) l DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) • LEN(C) ×POP(C) : the target code size for restructuring • 5×POP(C) : the code size of new caller statements • LEN(C) : the code size of new identical routine ¢ RAD (C ) l Distribution in the file system of elements in clone class C • RAD(C) = 0 : C is enclosed within a single file. • RAD(C) = 1 : C is enclosed within a single directory. • RAD(C) = n : C is enclosed within a directory tree of n layers. 29 APSEC 2002
 
	Analysis using clone class metrics ¢ Example of analysis issue l Finding clones that are appropriate for refactoring. • Clones having high DFL • Clones having high POP and low RAD • It may be easy and meaningful to merge clones into one routine because of their density. l Finding portions that are not reliable. • Clones having high LEN • Modules having larger code clones are less maintainable than modules having smaller code clones [4] Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato, Ken-ichi Matsumoto, “Software Quality Analysis by Code Clones in Industrial Legacy Software”, Proc. Of the 8 th IEEE International Symposium on Software Metrics, 87 -96, 2002.
 
	Suffix-tree ¢ Suffix tree is a tree that satisfies the following conditions. 1. A leaf node represents the starting position of sub-string. A path from root node to a leaf node represents a sub-string. First characters of labels of all the edges from one node are different from each other. 2. 3. → A common path means a clone 31 APSEC 2002
 
	Example of transformation rules in Java ¢ ¢ All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. l Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. l By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” l Eliminates table initialization code. 32 APSEC 2002
 
	The output of CCFinder ¢ Output of CCFinder #version: ccfinder 3. 1 #langspec: JAVA #option: -b 30, 1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} Object file ID 0. 0 52 C: Gemini. java ( file 0 in Group 0 ) 0. 1 94 C: General. Manager. java : : #end{file description} Location of a clone pair #begin{clone} 0. 1 53, 9 63, 13 1. 10 542, 9 553, 13 35 ( Lines 53 - 63 in file 0. 1 and Lines 542 - 553 in file 1. 10 are identical or similar to each other) 0. 1 53, 9 63, 13 1. 10 624, 9 633, 13 35 0. 2 124, 9 152, 31 0. 2 154, 9 216, 51 42 : l It is difficult to analyze source code by only this text-based information of the location of clone pairs. 33 APSEC 2002 : #end{clone}
 
	Gapped code clone detection - Algorithm (1/5) Source files ¢ Step 1: Non-gapped clone detection l Detect non-gapped clones from input source files. Non-gapped clone detection • Set the minimum length of clone (threshold 1). l Sort the list of the detected non-gapped clones for effective identification of gap locations in Step 2. • Make a clone pair which appears previously in the file appear previously also in the sorted list. • When the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files. Non-gapped clone ID Non-gapped clones Gap identification Pos. in file X in file Y Matched Gaps Pos. Correspondences (ABCDCDEFBCDG) subsequence (ABCEFBCDEBCD) c 1 1– 3 c 2 2– 4 Visualization 6– 8 “BCD” c 3 2– 4 10 – 12 “BCD” c 4 5– 5 3– 3 “C” c 5 5– 6 11 – 12 “CD” c 6 5– 6 7– 9 “CDE” c 7 7 – 11 APSEC 2002 c 8 c 9 1– 3 “ABC” Gap-and-clone scatter plot – 8 “EFBCD” Source 4 code investigation 9 – 10 2 -3 “BC” 9 – 11 10 - 12 “BCD”
 
	Gapped code clone detection - Algorithm (2/5) ¢ Step 2: Gap identification l Source files Generate gap locations from sorted list of non-gapped clones. Non-gapped clone detection • Gap location is a kind of the combination of the two nongapped clones. • (c 1, c 6) = ((1 -3, 1 -3), (5 -6, 7 -9)) g 1= (4 -4, 4 -6) • The length of each gap is the length of longer unmatched subsequence. • Set the upper limit of the length of each gap (threshold 2). • Use the facts for optimizations • non-gapped clones are stored as the sorted result. • The number of gaps connected from each non-gapped clone can be considered up to a certain constant. The overall time complexity of Step 2 is O(n) (n: number of nongapped clones) Gap ID Non-gapped clones Gap identification Gaps Correspondences Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Length in longer g 1 4– 4 4– 6 3 g 2 4– 4 4 – 10 7 g 3 4– 6 g 4 4– 8 4– 9 6 g 5 – g 6 5– 8 9 4 g 7 8– 8 – 1 Visualization Gap-and-clone scatter 3 plot – – 10 investigation 2 Source 9 code
 
	Gapped code clone detection - Algorithm (3/5) Source files ¢ Step 3 -1: Visualization – gap-and-clone scatter plot l Draw gaps on the scatter plot of non-gapped clone to visualize gapped clones in a pseudo way. Non-gapped clone detection File Y 1 A 2 B 3 C 4 D 5 C File X 6 D 7 E 8 F 1 2 3 4 5 6 7 8 9 10 11 12 A B C E F B C D E B Non-gapped clones C D Gap identification c 1 c 2 c 3 g 1 g 5 g 3 Gaps c 5 Correspondences Visualization c 6 Gap-and-clone scatter plot g 7 Gapped clone ID Path gc 1 g 1 c 5 g 7 c 7 11 D gc 2 c 1 g 3 c 6 “ABC---EFBCD” “ABCEFBCD” 12 G gc 3 c 2 g 5 c 4 “BCDCD” 9 B 10 C c 8 c 7 c 9 Subsequence in file X (ABCDCDEFBCDG) “ABC-CDE--CD” Source Subsequence in file Y (ABCEFBCDEBCD) code“ABC---CDEBCD” investigation
 
	File Y Gapped code clone detection - Algorithm (4/5) ¢ Step 3 -2: Visualization – filtering Remove non-gapped clones and gaps that do not contribute to make a long gapped clone. • Introduce the length of each entanglement (“e. Size”) of nongapped clones and gaps. • e. Size = max (e. Size. X, e. Size. Y) e. Size. X = e. End. X – e. Start. X e. Size. Y = e. End. Y – e. Start. Y • “e. Size” means the maximum length of gapped clone included in the entanglement. • Set the minimum “e. Size” for display (threshold 3). File X l 1 A 2 B 3 C 4 D 5 C 6 D 7 E 8 F 9 B 10 C 11 D 12 G 1 2 3 4 5 6 7 8 9 10 11 12 A B C E F B C D E B C D c 1 c 2 Source files g 1 c 3 g 5 g 3 c 5 Non-gapped c 6 clone detection g 7 c 8 Non-gapped clones c 7 c 9 Gap identification Gaps Correspondences Visualization Gap-and-clone scatter plot Source code investigation
 
	Gapped code clone detection - Algorithm (5/5) Source files ¢ Step 4: Source code investigation l l Investigate source files with gap-and-clone scatter plot. Change parameters. • Threshold 1: Minimum size of non-gapped clones in non-gapped clone detection • Threshold 2: Maximum size of gaps in identification of gap locations. • Threshold 3: Minimum size of entanglement of non -gapped clones and gaps in gap-and-clone scatter plot. • Theshold 1 and threshold 2 greatly affect computation time. • Small threshold 1 makes O(m 2) non-gapped clone pairs detected from size-m source code. • Large threshold 2 makes O(n 2) gaps detected from n clone pairs. Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences Visualization Gap-and-clone scatter plot Source code investigation
 
	(Frequency of non-gapped clones) ¢ Analysis - Usefulness of gap- 1500 and-clone scatter plot 1000 Compared the scatter plots of non-gapped clones to the gap-and-clone 500 scatter plot 500 Shown up as long Three programs (Ex. 1: 2267 tokens, Ex. 2: 4394 tokens and Ex. 3: gapped 5738 clones tokens) of a student S are arranged on both of the vertical and horizontal 0 axes. 0 10 20 30 40 0 (Tokens) 10 30 boundary 40 l 0 The grid 20 represents lines 50 between sub-exercises. l (Tokens) Ex. 1 Ex. 2 Ex. 3 Ex. 1 Ex. 2 Ex. 1 Ex. 3 Ex. 1 Ex. 2 Ex. 3 39 Threshold 1 = 10 Ex. 3 APSEC 2002 Threshold 1 = 30 Ex. 2 Threshold 1 = 10 Threshold 2 = 10 Threshold 3 = 30 Ex. 3 50
 
	The analysis of comparison among students (non-gapped clones only) ¢ The corresponding code l B A (2 students) • Similar code fragments were from source code of sample compiler described in textbook. l B (4 students) • Many code fragments were similar even with respect to name of variables or comments. 40 APSEC 2002 A
