CodeClone Detection Tool CCFinder Software Engineering Laboratory Department

Code-Clone Detection Tool CCFinder Software Engineering Laboratory Department of Computer Science Graduate School of Information Science and Technology Osaka University Japan Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Contents n n n Code Clone Detection Tool: CCFinder Code Clone Analysis Tool: Gemini Applications Summaries and Future Works Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Code Clone n In our studies, Code clone (or Software Clone) is a code fragment in source files that is identical or similar to another. Clone Pair Clone Class Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Problems caused by code clone It is generally said that code clone is one of problems of software maintenance. If a fault is found in a code portion, all of its clone code portions should be modified. “Programs that have duplicate logic are hard to modify. ” [Fowler] It is unrealistic to find code clones by hand in million lines of source code. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Initial motivation of this project A huge software system used in a division of government One million lines of code of two thousand modules Written mainly in COBOL The system was developed more than 20 years ago and has been maintained continually by a large number of engineers. It was believed that there would be many code clones in the system. but the documentation did not provide enough information about the code clones Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Code Clone Analysis Tools: CCFinder&Gemini n We have been developing code clone analysis tools, Code clone detection tool, CCFinder[1], GUI-based clone analysis environment, Gemini[2]. n We have delivered these tools to software companies and evaluated the usefulness through some case studies. [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7): 654 -670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8 th IEEE International Symposium on Software Metrics, 67 -76, 2002. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Outline of CCFinder n CCFinder directly compares source code on token unit, and detects code clones. Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of module delimiters n CCFinder can analyze the system of millions line scale in practical use time. n Target language C/C++,Java,COBOL,FORTRAN, LISP Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

CCFinder Example of clone detection process Source files 1. 1. static void foo() throws RESyntax. Exception {{ 2. 2. String a[] == new String [] [] {{ "123, 400", "abc", "orange 100" }; }; 3. 3. org. apache. regexp. RE pat == new org. apache. regexp. RE("[0 -9, ]+"); 4. 4. int sum == 0; 0; 5. 5. for (int ii == 0; 0; ii << a. length; ++i) 6. ifif (pat. match(a[i])) 6. (pat. match(a[i])) 7. sum += += Sample. parse. Number(pat. get. Paren(0)); 8. 8. System. out. println("sum == "" ++ sum); 9. 9. }} 10. static void goo(String [] [] a) a) throws RESyntax. Exception {{ 11. RE RE exp == new RE("[0 -9, ]+"); 12. int sum == 0; 0; 13. for (int ii == 0; 0; ii << a. length; ++i) 14. ifif (exp. match(a[i])) 14. (exp. match(a[i])) 15. sum += += parse. Number(exp. get. Paren(0)); 16. System. out. println("sum == "" ++ sum); 17. }} 0. 1 3, 1 9, 1 Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Clone pairs 11, 1 17, 1 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Outline of Gemini n Gemini is GUI-based clone analysis environment Gemini uses CCFinder as clone detection unit Gemini has mainly three interfaces n Scatter plot – User can select clones by mouse dragging – Scatter plot has sort function, zoom function, and so on n Metric graph – Metric graph shows several metrics of clone class. – User can select clones by specifying ranges of each metric value n Source code view – User can browse the source code of clones selected in other views Gemini is implemented in Java Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Gemini: Architecture Gemini a b c a d e c User Interfaces a b c a d e c Clone pair manager Clone selection information CCFinder Source code manager Code clone detector Source files Scatter plot view Code clone database Clone selection information User Source code view Metrics manager Metric graph views a, b, c, . . . : tokens : matched position Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Gemini: Architecture Gemini User Interfaces n DFL (C ): Estimation of how many tokens would be Clone pair removed from source files when allmanager code fragments of clone class C are replaced with caller. Clonestatements of a new selection identical routine information Source files Scatter plot view CCFinder Source codenew manager sub routine caller statements Source code view Code clone detector Code clone database Clone selection information Metrics manager Metric graph views Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University User

Source files Gemini: Architecture Gemini User Interfaces Clone pair manager Clone selection information CCFinder Scatter plot view Source code manager Code clone detector Code clone database Clone selection information User Source code view Metrics manager Metric graph views Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Application of CCFinder&Gemini n Open source software n n JDK libraries (Java, 570 KLOC) Linux, Free. BSD (C, 1. 6 + 1. 3 MLOC) Free. BSD, Open. BSD,Net. BSD(C) Qt(C++,240 KLOC) n Commercial Software (about 30 companies) n NTT Data Corp. , Hitachi Ltd. , Hitachi GP, NEC soft Ltd. , ASTEC Inc. , SRA Inc. , NASDA,Daiwa Computer, etc… n Students exercise of Osaka University n Filed in a court as an evidence for software copyright suit Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Application: JDK library JDK(Java Development Kit) 1. 2. 2 Number of file: 1700 LOC: 500, 000 Analysis time: 3 minutes. Pentium III 650 MHz with 1 GB RAM Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Scatter plot Unit of clone 20 LOC A: Many code clones are detected. B: The longest clone B Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A

A: Many code clones 29 files in src/javax/swing/plaf/multi/*. java These codes were generated by automatic code generation tool. 31| */ 32| public class Multi. Button. UI extends Button. UI { 33| 160| 161| 162| 163| 164| 165| public static Component. UI create. UI(JComponent a) { Component. UI mui = new Multi. Button. UI(); return Multi. Look. And. Feel. create. UIs(mui, ((Multi. Button. UI) mui). uis, a); } Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

B: The longest clone 349 LOC Eighteen “sort” methods in src/java/util/Arrays. java Difference: type and numbers in argument Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Application Free. BSD, Linux, Net. BSD Three types of UNIX Free. BSD 4. 0 (C, 2200 KLOC) Linux 2. 4. 0 (C, 2400 KLOC) Net. BSD 1. 5 (C, 2600 KLOC) Free. BSD and Net. BSD were derived from the same code. Unit of code clone: more than 30 tokens Analysis time: 108 minutes Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Scatter Plot Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Clones of Free. BSD and Linux Device driver Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary and Future Works n Code Clone Detection Tool: CCFinder n Code Clone Analysis Tool: Gemini n Practical use of code clone information n refactoring n Reusable component Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Suffix-tree Suffix tree is a tree that satisfies the following conditions. 1. A leaf node represents the starting position of sub-string. 2. A path from root node to a leaf node represents a sub-string. 3. First characters of labels of all the edges from one node are different from each other. → A common path means a clone Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

The output of CCFinder #version: ccfinder 3. 1 #langspec: JAVA #option: -b 30, 1 Output of CCFinder Object file ID ( file 0 in Group 0 ) #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0. 0 52 C: Gemini. java 0. 1 94 C: General. Manager. java : : #end{file description} Location of a clone pair ( Lines 53 - 63 in file 0. 1 and Lines 542 - 553 in file 1. 10 are identical or similar to each other) l It is difficult to analyze #begin{clone} 0. 1 53, 9 63, 13 1. 10 542, 9 553, 13 35 0. 1 53, 9 63, 13 1. 10 624, 9 633, 13 35 0. 2 124, 9 152, 31 0. 2 154, 9 216, 51 42 : : #end{clone} source code by only this text-based information of the location of clone pairs. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

The analysis of comparison among students (non-gapped clones only) The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B A B (4 students) Many code fragments were similar even with respect to name of variables or comments. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Clone class metrics n LEN (C ): Length of token sequence of each element in clone class C n LNR (C) : Length of non-repetitive token sequence of LEN(C) n POP (C ): Number of elements in clone class C n DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine new sub routine caller statements n RAD (C ): Distribution in the file system of elements in clone class C Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Comparison with AST approach n Features of AST approach Extract the same sub-trees of AST as a clone The result is precise because of strict syntax analysis. High space and time complexity n Features of Our approach Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

The other approaches n AST(Abstract syntax tree) approach Clone = the same sub-trees in an AST Deep dependence on program language n PDG(Program dependency Graph) approach Clone = the same sub-graph in a PDG Graph comparison is difficult n Code metric Clone = the routines which have the same metric values Severe restriction in granularity n CCFinder&CCShaper Clone = the code fragments which have the same syntax structure Limited precision Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Why I choose “a” n I selected the clones by the following criteria All clone code fragments appear in the same class The metric LEN is high The code fragment includes a whole method body Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

- Slides: 32