Gemini Maintenance Support Environment Based on Code Clone

Gemini: Maintenance Support Environment Based on Code Clone Analysis Yasushi Ueda*, Toshihiro Kamiya**, Shinji Kusumoto*** and Katsuro Inoue*** *Graduate School of Engineering Science, Osaka Univ. **PRESTO, Japan Science and Technology Corp. ***Graduate School of Information Science and Technology, Osaka Univ. y-ueda@ics. es. osaka-u. ac. jp {kamiya, kusumoto, inoue}@ist. osaka-u. ac. jp 1

Contents n Background n Maintenance support environment, Gemini n Overview n System structure n Scatter Plot n Case Study n Conclusions 2

Background (1/2) n A code clone is a pair/set of code portions in source files that are identical or similar to each other. clone pair clone class clone pair 3

Background (2/2) n Code clone is one of the factors that make software maintenance more difficult. n If some faults are found in a code fragment, it is necessary to correct the faults in its all clone pairs. n We have developed a code clone detection tool, CCFinder[1]. n Token-based clone detector n Its input is a set of source files and output is the locations of clone pairs. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, (to appear). 4

CCFinder (1/4) n Clone detection process consists of four steps. Source files Step 1 CCfinder Lexical analysis Token sequence Step 2 Transformation Transformed token sequence Step 3 Match detection n Target program l C / C++ l Java l FORTRAN l COBOL l LISP Clones on transformed sequence Step 4 Formatting Clone pairs 5

CCFinder (2/4) n Example of clone detection process 1. static void foo() throws RESyntax. Exception { 2. String a[] = new String [] { "123, 400", "abc", "orange 100" }; 3. org. apache. regexp. RE pat = new org. apache. regexp. RE("[0 -9, ]+"); 4. int sum = 0; 5. for (int i = 0; i < a. length; ++i) 6. if (pat. match(a[i])) 7. sum += Sample. parse. Number(pat. get. Paren(0)); 8. System. out. println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntax. Exception { 11. RE exp = new RE("[0 -9, ]+"); 12. int sum = 0; 13. for (int i = 0; i < a. length; ++i) 14. if (exp. match(a[i])) 15. sum += parse. Number(exp. get. Paren(0)); 16. System. out. println("sum = " + sum); 17. } Source files Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Clone pairs 6

Example of transformation rules in Java n All identifiers defined by user are transformed to same tokens. n Unique identifier is inserted at each end of the top-level definitions and declarations. n Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. n ”java. lang. Math. PI” is transformed to ”Math. PI”. n By using import sentence, a class is referred to with either full package name or a shorter name n ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” n Eliminates table initialization code. 7

CCFinder (2/4) n Example of clone detection process 1. 1. staticvoidfoo()throws. RESyntax. Exception{{ 2. 2. Stringa[] a[]==new new. String[][]{{"123, 400", "abc", "orange 100"}; }; 3. 3. org. apache. regexp. REpat pat==new neworg. apache. regexp. RE("[0 -9, ]+"); 4. 4. intsum sum==0; 0; 5. 5. for(intii==0; 0; ii<<a. length; ++i) 6. 6. ifif(pat. match(a[i])) 7. 7. sum+= +=Sample. parse. Number(pat. get. Paren(0)); 8. 8. System. out. println("sum==""++sum); 9. 9. }} 10. staticvoidgoo(String[][]a) a)throws. RESyntax. Exception{{ 11. RE REexp exp==new new. RE("[0 -9, ]+"); 12. intsum sum==0; 0; 13. for(intii==0; 0; ii<<a. length; ++i) 14. ifif(exp. match(a[i])) 15. sum sum+= +=parse. Number(exp. get. Paren(0)); 16. System. out. println("sum==""++sum); 17. }} Source files Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Clone pairs 8

CCFinder (3/4) n Application of CCFinder n Free software l. JDK libraries (Java, 570 KLOC) l. Linux, Free. BSD (C, 1. 6 + 1. 3 MLOC) l. Free. BSD, Open. BSD，Net. BSD(C) l. Qt(C++，240 KLOC) n Commercial software l. NTT data Corp. , Hitachi Ltd. , NEC soft Ltd. , ASTEC Inc. , SRA Inc. l. NASDA (Control program for rocket) 9

CCFinder (4/4) n Output of CCFinder Object file ID ( file 0 in Group 0 ) #version: ccfinder 3. 1 #langspec: JAVA #option: -b 30, 1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0. 0 52 C: Gemini. java 0. 1 94 C: General. Manager. java : : #end{file description} Location of a clone pair ( Lines 53 - 63 in file 0. 1 and Lines 542 - 553 in file 1. 10 are identical or similar to each other) n It is difficult to analyze #begin{clone} 0. 1 53, 9 63, 13 1. 10 542, 9 553, 13 35 0. 1 53, 9 63, 13 1. 10 624, 9 633, 13 35 0. 2 124, 9 152, 31 0. 2 154, 9 216, 51 42 : : #end{clone} source code by only this text-based information of the location of clone pairs. 10

Goals of this study n Proposal of an interactive code clone analysis environment n Gemini n. Case study to evaluate the proposed environment n Apply Gemini to programming exercise in our university and analyze the results. 11

Gemini overview n A GUI-based code clone analysis environment n Uses CCFinder as a code clone detector. n Has several views to interactive analysis. l Scatter plot view l Select by mouse dragging l Sorting function l Zoom in/out l Metric graph view l Select by metric values l Source code view n Implemented in Java l About 10, 000 lines of code 12

System structure of Gemini User Interfaces Clone pair manager (CPM) Clone selection information CCFinder Clone pair list view Source code manager (SCM) Code clone detector (CCD) Source files Scatter plot view Code clone database (CDB) Clone selection information Clone class manager (CCM) User Source code view Metrics graph Clone class list view 13

Scatter plot n The main diagonal line is always drawn, since each dot on it refers to an identical position of the two axes. n A clone pair is shown as a diagonal line segment. n The distribution is symmetrical with the main diagonal line. a b c a d e c n Both the vertical and horizontal axes represent a token sequence of source code. n A dot means that corresponding two tokens on the two axes are same. a, b, c, . . . : tokens : matched position 14

Sorting function n When multiple files are compared in scatter plot, boundaries of their files are shown on the axes. n Depending on the file orders, the distribution of dots is spread widely. n We put similar files as near as possible. f 1 f 2 f 3 f 4 f 5 f 1 f 6 f 3 f 4 f 2 f 1 f 2 f 6 f 3 f 4 f 5 f 2 f 6 f 5 15 f 5

Snapshots of scatter plot 16

Clone class metrics n LEN (C ): Length of token sequence of each element in clone class C n POP (C ): Number of elements in clone class C n DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements n RAD (C ): Distribution in the file system of elements in clone class C 17

Aims of clone class metrics n We are interested in n Clone classes whose elements are spread widely. l High value of POP means that there are many similar code fragments. l High value of RAD means that the clones are spread over many subsystems. They are difficult to find all together in maintenance. n Clone classes which are appropriate for refactoring. l High value of DFL (high value POP and high value of LEN) means that the clone class is worth evaluating whether the elements can be merged into one routine. 18

Snapshots of clone class metric graph RAD LEN POP DFL Filtering mode : ON 19

Case study overview n Application target n Programs developed in a programming exercise of Osaka Univ. l. Compiler in C language l. Programs of 69 students l. Total size is 360, 000 lines of code n Issue of Analysis n Similarity among all programs l. In the programming exercise, plagiarisms sometimes happen. 20

Analysis (1/2) n Compiler of 69 students are arranged on the two axes. n The distribution is spread widely. n Rearrangement of scatter plot using sorting function n The grid represents boundary lines between individuals. 21

Analysis (2/2) n The corresponding code n A (2 students) l. Similar code fragments were from source code of sample compiler described in textbook. B A n B (4 students) l. Many code fragments were similar even with respect to name of variables or comments. 22

Conclusions n We presented a maintenance support environment based on code clone analysis, Gemini. n We also applied it to programming exercise to evaluate its usefulness. We are going to evaluate the applicability of Gemini to large scale software in actual software maintenance as future research work. 23

Suffix-tree n Suffix tree is a tree that satisfies the following conditions. 1. A leaf node represents the starting position of sub-string. 2. A path from root node to a leaf node represents a sub-string. 3. First characters of labels of all the edges from one node are different from each other. → A common path means a clone 24

Definition of DFL and RAD n DFL(C ) n DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) l LEN(C) ×POP(C) : the target code size for restructuring l 5×POP(C) : the code size of new caller statements l LEN(C) : the code size of new identical routine new sub routine caller statements n RAD (C ) n Distribution in the file system of elements in clone class C l RAD(C) = 0 : C is enclosed within a single file. l RAD(C) = 1 : C is enclosed within a single directory. l RAD(C) = n : C is enclosed within a directory tree of n layers. 25

f 1 f 2 f 3 f 4 f 5 f 6 f 1 f 3 f 4 RSA(i) : Ratio of covered code range in file i by clones between one file i f 5 Step 1: Select a head file by the value of RSA (Make F the head file) f 2 Sorting function of other files f 6 Step 2: From among the remaining f 1 f 6 f 3 f 5 f 4 f 2 files, select the most similar file to F and put it next to RST(i, j) : Ratio of covered code range in file i by clones between a file i F by the value of RST and a file j Step 3: Repeat step 2 recursively while any file remains, treating the most similar file in previous step 2 as new F f 1 f 6 f 3 f 4 f 2 f 5 26

Analysis reuse of programs (1/3) n RST(Parser, Checker) and RST(Checker, SPC) of each student were used as ratio of reused code. RST Parser, Checker Chekcer, SPC ave S 1 0. 117 0. 086 0. 102 S 2 0. 553 0. 563 0. 549 S 3 0. 674 0. 729 0. 701 : : S 69 0. 112 0. 598 0. 390 ave 0. 185 0. 461 0. 320 max 0. 674 0. 747 0. 701 min 0. 037 0. 086 0. 102 27

Analysis reuse of programs (2/3) n The average of RST of S 1 is the lowest. n C : between Parser and Checker n D : between Checker and Parser Checker SPC Parser Checker n Minimum length of clone to be detected was changed to 15 tokens. C D SPC 28

Analysis reuse of programs (3/3) n The highest average value of RST n S 2 : 0. 549, S 3 : 0. 701 n Different appearances in scatter plot S 3 S 2 Parser Checker SPC 29

Analysis - Usefulness of metric graph n Verified the value of DFL from metrics graph n DFL(C) = (LEN(C) ×POP(C))– (LEN (C) + 5×POP(C)) n S 9 ： The value of DFL(Parser) was very high n S 10 ： The value of DFL(SPC) was very high The highest values of DFL in each program Checker SPC S 1 0 99 113 : : S 9 3538 163 189 S 10 100 211 3439 : : S 69 223 211 258 ave. 196 183 311 C D Parser Checker SPC Parser Checker SPC DFL S 10 S 9 E 30