Reusing Open Source Software Issues on Code Search

  • Slides: 93
Download presentation
Reusing Open Source Software - Issues on Code Search and License Identification - Katsuro

Reusing Open Source Software - Issues on Code Search and License Identification - Katsuro Inoue Osaka University Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Motivation and Overview Software Engineering Laboratory, Department of Computer Science, Graduate School of Information

Motivation and Overview Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Open Source Software • Key Driver of software engineering in researches and practices in

Open Source Software • Key Driver of software engineering in researches and practices in there days • Essential resource for the software system development – Academia – Industry – Development environment provided as OSS Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 3

Code Clones in Free. BSD Ports Collection 10. 8 GB 403 M LOC in

Code Clones in Free. BSD Ports Collection 10. 8 GB 403 M LOC in C 80 PC Workstations 2 days 4

Issues • How do we find appropriate software components in huge collection of OSS?

Issues • How do we find appropriate software components in huge collection of OSS? • How can we identify the licenses of each component in the large OSS? Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 5

Software Space Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science

Software Space Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Source. Forge • Huge Open Source Software (OSS) development support site • Project repository,

Source. Forge • Huge Open Source Software (OSS) development support site • Project repository, software search, . . . # Projects > 240, 000 # Users > 2, 600, 000 This is only a small part of OSS space Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 7

Software Space • The number of different software • Countably infinite • However, we

Software Space • The number of different software • Countably infinite • However, we feel very similar programs are repeatedly made by ourselves or others – I remember the similar code had been made before. . . – I found a similar but better program on the net. . . Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 8

Managing Software Collection • Organizational assets in repository • Open source software projects We

Managing Software Collection • Organizational assets in repository • Open source software projects We can collect them relatively easily by simply keeping everything Management • Categorize and register software components • Keep track of software evolution Clustering/labeling/indexing software collection needs an extensive elaboration Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 9

Exploring Software Space • Browsing – Target must be well-organized • • Feature Name

Exploring Software Space • Browsing – Target must be well-organized • • Feature Name Time Kind – Not practical if the collection becomes huge • Search Find software from the mass collection – High quality answer good ranking Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 10

Search Methods • Keyword – – Function/Class names Parameters Variable/Identifiers Comments • Program Snippets

Search Methods • Keyword – – Function/Class names Parameters Variable/Identifiers Comments • Program Snippets . . . 31 @author Ceki Gü lcü */ 32 public class Sort. Algo { 33 34 final static String class. Name = Sort. Algo. class. get. Name(); 35 final static Logger LOG = Logger. get. Logger(class. Name); 36 final static Logger OUTER = Logger. get. Logger(class. Name + ". OUT 37 final static Logger INNER = Logger. get. Logger(class. Name + ". INNE 38 final static Logger DUMP = Logger. get. Logger(class. Name + ". DUMP 39 final static Logger SWAP = Logger. get. Logger(class. Name + ". SWAP 40 41 int[] int. Array; 42 43 Sort. Algo(int[] int. Array) { 44 this. int. Array = int. Array; 45 } 46. . . – Incomplete structures – Complete structures Search keys can be automatically created through user activities Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 11

Related Works (1) Software Search Engines • Google, Google Code Search (Google) • Koders

Related Works (1) Software Search Engines • Google, Google Code Search (Google) • Koders (Black Duck) – 3 GB OSS, C/C++/C#/. . . 30 languages • Krugle (Krugle Enterprise) – OSS project support, search • Source. Forge (Geeknet Inc. ) • SPARS/J – Osaka-U, earlier than Google Code Search • Code. Broker, Sourcerer, Merobase, Exemplar, Strathcona, Assieme, XSnippet, . . . Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 12

Related Works (2) Software Component Recommendation • Historical Approach Collect user activity and repository

Related Works (2) Software Component Recommendation • Historical Approach Collect user activity and repository logs – Provide the raw data – Provide after processing such as collaborative filters • Social Approach Construct developers and users network – Ask experts the best solution – Developers values Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 13

Related Works (3) Program Analysis Approach Software Component Ranking • Incoming references – Components

Related Works (3) Program Analysis Approach Software Component Ranking • Incoming references – Components with many incoming references have higher values • Component rank, page rank – Components with incoming reference from highvalue components have higher values Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 14

Ranking Software Component for Code Search Software Engineering Laboratory, Department of Computer Science, Graduate

Ranking Software Component for Code Search Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Automated Component Library • Collect software components eagerly without preserving their inherent structures •

Automated Component Library • Collect software components eagerly without preserving their inherent structures • Analyze relations among components by using various analysis techniques • Rank the components based on their Component Rank Model significance • Answer user’s queries according to the rank Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 16

Component Graph System Y System X A B F C D G E H

Component Graph System Y System X A B F C D G E H I component use relation Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 17

Weight of Nodes System Y System X 0. 1 A B C 0. 1

Weight of Nodes System Y System X 0. 1 A B C 0. 1 D 0. 1 0. 2 E 0. 1 0. 05 H F 0. 1 G 0. 2 I 0. 05 sum of all node weights = 1. . . (1) weight of node represents significance of node Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 18

Weights of Edges 0. 05 0. 2 d=1/4 0. 2 A d=1/4 0. 05

Weights of Edges 0. 05 0. 2 d=1/4 0. 2 A d=1/4 0. 05 d=1/4 0. 05 B 0. 4 0. 15 d: distribution ratio w(A) = sum of all outgoing edge weights sum of all incoming edge weights = w(B) . . . (2). . . (3) Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 19

Weight Equation • Under constraints (1)~(3), we have a simultaneous equation = W: node

Weight Equation • Under constraints (1)~(3), we have a simultaneous equation = W: node weight vector . Dt: transposed matrix of distribution ratio Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 20

Propagating Weights 0. 34 0. 17 A 0. 33 B 0. 17 0. 33

Propagating Weights 0. 34 0. 17 A 0. 33 B 0. 17 0. 33 C Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 21

Propagating Weights 0. 33 0. 175 A 0. 17 B 0. 175 0. 17

Propagating Weights 0. 33 0. 175 A 0. 17 B 0. 175 0. 17 0. 5 C Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 22

Propagating Weights 0. 5 0. 25 A 0. 175 B 0. 25 0. 345

Propagating Weights 0. 5 0. 25 A 0. 175 B 0. 25 0. 345 0. 175 0. 345 C Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 23

Propagating Weights 0. 4 0. 2 A B 0. 2 0. 4 C Stable

Propagating Weights 0. 4 0. 2 A B 0. 2 0. 4 C Stable weight assignment (eigenvector computation) Component Rank : Order of nodes sorted by the weight Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 24

Markov Model 0. 02 0. 01 0. 05 0. 03 0. 001 0. 1

Markov Model 0. 02 0. 01 0. 05 0. 03 0. 001 0. 1 • Component rank model can be considered as a Markov Chain of user's focus • User's focus moves from one component to another along a use relation at a fixed time period • Node weight represents the existence probability of the user's focus Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 25

Clustering Components C G B F A D component graph C G BF E

Clustering Components C G B F A D component graph C G BF E AD E clustered component graph Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 26

Experiment 1 JDK 1. 3. 0 575, 000 lines, 1877 components 7 minutes on

Experiment 1 JDK 1. 3. 0 575, 000 lines, 1877 components 7 minutes on PC (Pentium IV, 2 GHz, 2 GB) rank class name 1 java. lang. Object 2 java. lang. Class 3 java. lang. Throwable 4 java. lang. Exception 5 java. io. IOException 6 java. lang. String. Buffer 7 java. lang. Security. Manager 8 java. io. Input. Stream 9 java. lang. reflect. Field 10 java. lang. reflect. Constructor. . . 1256 sunw. util. Event. Listener. . . 1256 weight 0. 16126 0. 08712 0. 05510 0. 03103 0. 01343 0. 01214 0. 01169 0. 01027 0. 00948 0. 00936. . . 0. 00011. . . Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 27

Experiment 2: Application to Industry • Daiwa computer: a middle size software company in

Experiment 2: Application to Industry • Daiwa computer: a middle size software company in Osaka • A shared Java application framework for web -based data management • 5 applications + framework – 1538 components, 339 clustered nodes • Classes in the framework and definitions of data structure are highly ranked Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 28

Related Works • Markov models of documentation traversal – Influence Weight: impact of publication

Related Works • Markov models of documentation traversal – Influence Weight: impact of publication thought references – Page Rank: weight of HTML in the Internet through web links Explicit use relation No clustering (important for software products) • Reusability measurement – Various characteristic metrics of components or interfaces Indirect inference of reusability (our approach directly reflects usage of components) Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 29

S P A R S-J Software Product Archiving, Analyzing and Retrieving System for Java

S P A R S-J Software Product Archiving, Analyzing and Retrieving System for Java Component Collection Classification/ Analysis Internet / Corporate Component Query Creation Software Component Searcher Component Archive SPARS-J Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 30

SPARS-J Portal Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science

SPARS-J Portal Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 31

A Search Result Software Engineering Laboratory, Department of Computer Science, Graduate School of Information

A Search Result Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 32

Displaying Source Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science

Displaying Source Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 33

Similar Component Group Software Engineering Laboratory, Department of Computer Science, Graduate School of Information

Similar Component Group Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 34

Callers Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and

Callers Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 35

Callees Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and

Callees Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 36

Metrics Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and

Metrics Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 37

Use Cases of SPARS-J • Source code management in a project – See code

Use Cases of SPARS-J • Source code management in a project – See code developed by others – See older versions • Source code management through related projects – Component dependency can be seen – Reusability and newly-developed code are identified • Source code management of overall organization – Components actually used and not used are identified – Overall asset in the organization can be seen Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 38

Applications • Asset management of Lab • OSS management (300, 000 classes) • Java

Applications • Asset management of Lab • OSS management (300, 000 classes) • Java framework management of a software developer • The organizational Java class asset management of a Japanese food major company Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 39

Software License Identification Software Engineering Laboratory, Department of Computer Science, Graduate School of Information

Software License Identification Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Software License • Permissions of use, and requirements and conditions to get such permission

Software License • Permissions of use, and requirements and conditions to get such permission • Software license identification – Finding corresponding license statements from a known licenses database • Needed for reusing a component (class, method and so on) – If the license is not compatible with the license of the application, we cannot reuse it. 41 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Challenges • Finding license statement – (F 1) License statements are usually mixed with

Challenges • Finding license statement – (F 1) License statements are usually mixed with other text – (F 2) Files might reference other file where the license is located – (F 3) Files might contain multiple licenses • Language related – (L 1) License statements contain spelling errors – (L 2) A license is can be represented in different ways – (L 3) Licensors change the spelling/grammar of the license statement • License customization – (C 1) Several licenses must be customized when used – (C 2) Licensors modify, add or remove conditions to well known licenses – (C 3) Licensors modify licenses for various intents 42 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Finding License Statement The first comments of a file contain text that is not

Finding License Statement The first comments of a file contain text that is not part of the license statement /* * … */ This file includes utility functions Description of file Copyright (C) 2010 foo This program is free software: you can redistribute. . . change log: v 2. 1 Bug fix License statement Change log 43 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Language Related Issues The licensors might change the spelling/grammar of the license statement Example

Language Related Issues The licensors might change the spelling/grammar of the license statement Example – "license" → "licence" – "it would be useful" → "it will be useful" 44 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

License Customization Licensors modify, add or remove conditions to well-known licenses to create a

License Customization Licensors modify, add or remove conditions to well-known licenses to create a new license • Example MIT/X 11 license "Permission to use, copy, modify, distribute and sell this software. . . " → "Permission to use, copy, modify and distribute this software. . . " 45 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Multi-Knowledgebase Approach for License Identification Legend source file Process Data Knowledge Base 1. License

Multi-Knowledgebase Approach for License Identification Legend source file Process Data Knowledge Base 1. License Stmt. Extraction 2. Text Segmentation equiv. phrases (12) 3. Text normalization 4. Sentence Filtering 5. Sentence Token Matching 6. License Rule Matching license name filtering keywords (82) sentence-token expressions (427) rules (126 for 112 license) Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 46

1 License statement extraction /* /* Copyright (c) 2001 foo foo@bar. org All rights

1 License statement extraction /* /* Copyright (c) 2001 foo foo@bar. org All rights reserved. ** Copyright ** Redistribution and use in in source and binary forms, with or or without ** Redistribution modification, are permitted provided that the following conditions ** modification, are met: ** are 1. Redistributions of of source code must retain the above copyright ** 1. . . . 2. Redistributions in in binary form must reproduce the above copyright ** 2. . . . ** THIS SOFTWARE IS IS PROVIDED BY BY THE AUTHOR AND CONTRIBUTORS. . . ** THIS IN NO NO EVENT SHALL THE AUTHOR OR OR CONTRIBUTORS. . . ** IN */ */ #include <sys/cdefs. h> #include <sys/types. h> … The comments at header ↓ License statement Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 47

2 Text segmentation /* [Copyright (c) 2001 foo@bar. org All rights reserved. ] *[Redistribution

2 Text segmentation /* [Copyright (c) 2001 foo@bar. org All rights reserved. ] *[Redistribution Copyright (c) 2001 foo in foo@bar. org rights reserved. and use source and. All binary forms, with or without *modification, are permitted provided that the following conditions are *met: ] Redistribution and use in source and binary forms, with or without *[1. ] modification, are permitted provided that the following conditions Split with an *[Redistributions are met: of source code must retain the above copyright notice. . . ] implementat *[2. ] 1. Redistributions of source code must retain the above copyright ion based on. . . [Redistributions in binary form must reproduce the above copyright. . . ] [3] with *[THIS 2. Redistributions binary form reproduce the. CONTRIBUTORS above copyright SOFTWARE ISin PROVIDED BY must THE AUTHOR AND some. . . "AS IS". . . ] heuristics *[IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS. . . ] * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS. . . * IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS. . . */ [3] P. Claugh. A Perl program for sentence splitting using rules. http: //ir. shef. ac. uk/cloughie/software. html, April 2001. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 48

3 Text normalization [Copyright (c) 2001 foo@bar. org All rights reserved. ] [Redistribution and

3 Text normalization [Copyright (c) 2001 foo@bar. org All rights reserved. ] [Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: ] met<colon>] [1. ] [Redistributions of source code must retain the above copyright notice. . . ] [2. ] [Redistributions in binary form must reproduce the above copyright. . . ] [THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS". . . ] IS<quotes>. . . ] <quotes>AS [IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS. . . ] phrases which should be regarded as having the same meaning. Convert " to <quotes> Equivalent Phrases ' , " , ` → <quotes> : → <colon> Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 49

4 Sentence Filtering [Copyright (c) 2001 foo@bar. org All rights reserved. ] [Redistribution and

4 Sentence Filtering [Copyright (c) 2001 foo@bar. org All rights reserved. ] [Redistribution and use in source and binary forms, with or without Removing modification, are permitted provided that the following conditions are sentences met<colon>] not related to [1. ] licenses [Redistributions of source code must retain the above copyright notice. . . ] [2. ] [Redistributions in binary form must reproduce the above copyright. . . ] [Redistributions form BY must copyright. . . ] [THIS SOFTWAREin. ISbinary PROVIDED THEreproduce AUTHOR the ANDabove CONTRIBUTORS [THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS <quotes>AS IS<quotes>. . . ] [IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS. . . DAMAGES…]. . . DAMAGES. . . ] If no sentences left → "NONE" License-related Keywords Filtering Keyword all rights, conditions, distributions, reproduce, damages, as is Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 50

5 Sentence Token Matching [Copyright (c) 2001 foo@bar. org All rights reserved. ] [All.

5 Sentence Token Matching [Copyright (c) 2001 foo@bar. org All rights reserved. ] [All. Rights] [Redistribution and use in source and binary forms, with or without [BSDPre] modification, are permitted provided that the following conditions are BSDcond. Source [BSDcond. Source] met<colon>] [Redistributions of source code must retain the above copyright notice. . . ] [BSDcond. Binary] [Redistributions in binary form must reproduce the above copyright. . . ] [BSDas. Is] [THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS [BSDWarr] <quotes>AS IS<quotes>. . . ] Matching [IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS. . . ] Sentence-token expressions BSDcond. Source: Redistributions? of source code must retain the (above )? copyright notice, this list of conditions(, )? and the following disclaimer(, without modification)? : … Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 51

6 License Rule Matching [All. Rights][BSDPre][BSDcond. Source][BSDcond. Binary][BSDas. Is][BSDWarr] BSD 2 (BSD 2 -clauses

6 License Rule Matching [All. Rights][BSDPre][BSDcond. Source][BSDcond. Binary][BSDas. Is][BSDWarr] BSD 2 (BSD 2 -clauses license) Matching Rule If no rule matches → "UNKNOWN" BSD 2:BSDPre, BSDcond. Source, BSDcond. Binary, BSDas. Is, BSDWarr Rules representing the relations between license name and a sequence of sentence tokens Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 52

Ninka Automatically license identification tool – Reporting license name (112 licenses) • BSD 4(BSD

Ninka Automatically license identification tool – Reporting license name (112 licenses) • BSD 4(BSD 4 -clause license) • BSD 3(BSD 3 -clause license) • BSD 2(BSD 3 -clause license) • GPLv 2+(GNU Public License version 2 or later) • Library. GPLv 2+. . . Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 53

Analysis Result of Debian Ninka FOSSology ohcount OSLC Recall [%] 82. 8 99. 6

Analysis Result of Debian Ninka FOSSology ohcount OSLC Recall [%] 82. 8 99. 6 100 Precision [%] 96. 6 55. 0 33. 2 29. 5 0. 891 0. 709 0. 498 0. 371 22 923 27 372 F-measure Execution Time [s] Ninka has the highest precision and faster execution time 54 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

What Licenses Are Used in Debian? • NONE is the most popular • GPLv

What Licenses Are Used in Debian? • NONE is the most popular • GPLv 2+ is second most used license License NONE GPLv 2+ Lesser. GPLv 2. 1+ Files Percent 210147 31. 5% 147535 22. 1% 42692 6. 4% CDDLv 1 or. GPLv 2 See. File 37623 31685 5. 6% 4. 7% . . . 55 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Do Different Programming Languages Use Different Licenses? Examine the number of files written in

Do Different Programming Languages Use Different Licenses? Examine the number of files written in each programming language (Java, C, C++, Perl ) under each license Java License Files Percent Apps CDDLv 1 or. GPLv 2 37562 25. 43% 2 NONE 25371 17. 17% 344 Lesser. GPLv 2. 1+ 22834 15. 46% 61 . . . License Perl Few application but many files Files Percent Apps NONE 18227 31. 63% 999 GPLv 2 3979 24. 40% 1171 Same. As. Perl 2651 8. 10% 15 Same. As. Perl Indirect license 56 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

When Present, What Types of Errors Do License Statements Have? • Observe several potential

When Present, What Types of Errors Do License Statements Have? • Observe several potential problems in the licensing of various applications that we analyzed • Found problems – Files without a license – Cutting & pasting the wrong license statement – Inconsistent license clauses – Incorrect name of the license 57 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Evolution of Licenses • Software licenses are adapted to environment • Software licenses evolves

Evolution of Licenses • Software licenses are adapted to environment • Software licenses evolves because of – author's requirement – user's demand – external pressure • No detail of the evolution characteristics was analyzed Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 58

Free. BSD (all) 16000 14000 10000 8000 6000 4000 2. 2. 2. 6 2.

Free. BSD (all) 16000 14000 10000 8000 6000 4000 2. 2. 2. 6 2. 2. 8 3. 1. 0 3. 3. 0 3. 5. 0 4. 1. 0 4. 2. 0 4. 4. 0 4. 6. 2 4. 8. 4. 0 10. 0 5. 2. 0 5. 3. 0 5. 5. 0 6. 1. 0 6. 3. 0 7. 0. 0 7. 2. 0 8. 0. 0 0 2. #files 12000 Release Version BSD 4 BSD 3 Library. GPLv 2+ Others Decreased BSD 4 Increased BSD 2 and BSD 3 BSD 2 GPLv 2+ NONE UNKNOWN Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 59

Free. BSD (all) 800 600 400 9. 4. 0 11. 0 5. 2. 1

Free. BSD (all) 800 600 400 9. 4. 0 11. 0 5. 2. 1 5. 4. 0 6. 0. 0 6. 2. 0 6. 4. 0 7. 1. 0 7. 3. 0 0 4. 7. 1 4. 6. 0 4. 5. 0 4. 3. 1 4. 1. 0 4. 0 3. 2. 0 3. 0. 7 3. 2. 5 2. -200 2. 1 0 2. #files 200 -400 -600 -800 BSD 4 BSD 3 v 5. 2. 1 - v 5. 3 Release Version 531 files under BSD 4 were moved to other BSD 2 GPLv 2+ Library. GPLv 2+ license BSD 2 or BSD 3. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 60

Free. BSD (kernel) 4000 3500 3000 2000 1500 1000 500 4. 0 10. 0

Free. BSD (kernel) 4000 3500 3000 2000 1500 1000 500 4. 0 10. 0 5. 2. 0 5. 3. 0 5. 5. 0 6. 1. 0 6. 3. 0 7. 0. 0 7. 2. 0 8. 0. 0 8. 2 4. 6. 0 4. 2. 0 4. 1. 0 4. 5. 0 3. 3. 1. 0 8 3. 2. 6 2. 2. 0 0 2. #file 2500 Release Version Decreased BSD 4 BSD 3 CDDLic Others Increased BSD 2 and BSD 3 Inter. ACPILic NONE UNKNOWN Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 61

Open. BSD (all) 25000 #files 20000 15000 10000 5000 0 2. 1 2. 2

Open. BSD (all) 25000 #files 20000 15000 10000 5000 0 2. 1 2. 2 2. 3 2. 4 2. 5 2. 6 2. 7 2. 8 2. 9 3. 0 3. 1 3. 2 3. 3 3. 4 3. 5 3. 6 3. 7 3. 8 3. 9 4. 0 4. 1 4. 2 4. 3 4. 4 4. 5 4. 6 4. 7 BSD 4 BSD 3 MITold 1 Others Release Version Decreased BSD 4 BSD 2 Increased BSD 2 GPLv 2+ and BSD 3 NONE UNKNOWN Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 62

Eclipse 45000 40000 35000 20000 15000 10000 5000 1 3. 2 3. 3 3.

Eclipse 45000 40000 35000 20000 15000 10000 5000 1 3. 2 3. 3 3. 4. 1 3. 4. 2 3. 5. 1 3. 5. 2 1 3. 1 2 2 3. 2. 3. 3. 2 1. 1 1. 3. 1 3. 2 0. 1 0. 3. 0 3. 3 1. 2 2. 1. 1 2. 0 0 2. #files 30000 CPLv 0. 5→CPLv 1. 0 Release Version CPLv 1. 0→EPLv 1. 0 EPLv 1 CPLv 0. 5 CPLv 1 MPLv 1_1 BSD 3 Others NONE UNKNOWN Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 63

Argo. UML 2500 1000 500 0 0. 8. 1 0. 9. 3 0. 9.

Argo. UML 2500 1000 500 0 0. 8. 1 0. 9. 3 0. 9. 6 0. 9. 8 0. 10 0. 11. 1 0. 11. 3 0. 12 0. 13. 4 0. 13. 6 0. 14. 1 0. 15. 2 0. 15. 4 0. 15. 6 0. 16. 1 0. 17. 2 0. 17. 4 0. 18 0. 19. 1 0. 19. 4 0. 19. 6 0. 19. 8 0. 21. 1 0. 21. 3 0. 23. 1 0. 23. 3 0. 23. 5 0. 25. 1 0. 25. 3 0. 25. 5 0. 26. 2 0. 27. 2 0. 28 0. 29. 1 0. 29. 3 0. 30 0. 31. 1 #files 2000 EPLv 1 Lesser. GPLv 2. 1 UNKNOWN Release Version UNKNOWN(BSD-like license) Apachev 2 BSD 2 →EPLv 1. 0 Lesser. GPLv 2. 1+ Others NONE Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 64

Findings • There are large shifts of license in Free. BSD (all) and Open.

Findings • There are large shifts of license in Free. BSD (all) and Open. BSD (all) • Argo. UML and Eclipse also have similar large shifts – Sometimes those licenses are more drastically changed to others than Free. BSD (all) and Open. BSD (all) – A few licenses cover almost all files in those systems • The kernel of Free. BSD and Open. BSD also have large shifts Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 65

Conclusions Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and

Conclusions Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary • Searching software components – Component-Rank model – SPARS-J • Software license identification

Summary • Searching software components – Component-Rank model – SPARS-J • Software license identification – Multi-knowledgebase approach – Ninka Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 67

Computation Intensive Software Engineering CISE • Methods and technologies which efficiently produce quality software

Computation Intensive Software Engineering CISE • Methods and technologies which efficiently produce quality software using – High performance computation environment – Huge amount of empirical data collection – Comprehensive network with various data Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 68

Classifying Software Engineering Main target of CISE Ordinary SE tools and environment Software Engineering

Classifying Software Engineering Main target of CISE Ordinary SE tools and environment Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 69

Idea behind CISE • Ordinary SE does not fully utilize cutting-edge computational power, network

Idea behind CISE • Ordinary SE does not fully utilize cutting-edge computational power, network performance, data collection, . . . • Success of computation intensive approaches in other fields, e. g. , Web mining, bioinformatics, . . . • Leading examples of CISE – – – Search-based software engineering Mining software repositories Large code-clone analysis Internet-scale code search License usage evolution Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 70

Conclusions • Use of Open Source Systems is still growing • Total management for

Conclusions • Use of Open Source Systems is still growing • Total management for the OSS assets using CISE concept is strongly expected • Our approach, SPARS-J and Ninka, would be initial steps for such total management Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 71

Resources • Papers –Katsuro Inoue, Reishi Yokomori, Tetsuo Yamamoto, Makoto Matsushita, Shinji Kusumoto: "Ranking

Resources • Papers –Katsuro Inoue, Reishi Yokomori, Tetsuo Yamamoto, Makoto Matsushita, Shinji Kusumoto: "Ranking Significance of Software Components Based on Use Relations", IEEE Transactions on Software Engineering, Vol. 31, No. 3, pp. 213 -225, 2005. –T. Kamiya, S. Kusumoto, and K. Inoue, CCFinder: A multi-linguistic token-based code clone detection system for large scale source code, IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654 -670, Jul. 2002. –Yuki Manabe, Yasuhiro Hayase, Katsuro Inoue: "Evolutional Analysis of Licenses in FOSS", Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), 2010 -9. –Daniel M. German, Yuki Manabe, Katsuro Inoue: "A Sentence-Matching Method for Automatic License Identification of Source Code Files", In Proceedings of the IEEE/ACM international Conference on Automated Software Engineering, 2010 -9. • WEB – SPARS http: //www. spars. info/ – CCFinder. X http: //www. ccfinder. net/ccfinderxos-j. html – CCFinder http: //sel. ics. es. osaka-u. ac. jp/cdtools/index. html Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 72

Thank You! Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science

Thank You! Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology,

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 74

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology,

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 75

Pseudo Use Relation A B C • Add pseudo edges to get always convergence

Pseudo Use Relation A B C • Add pseudo edges to get always convergence • Connect from each node to each non-connected node Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 76

Prototyping Component Rank System • inheritance • method call • attribute access • abstract

Prototyping Component Rank System • inheritance • method call • attribute access • abstract class impl input similarity measure by SMMT . java file = component similarity criterion t: sharing 80% statements output componentrank pairs use relation extraction clustered graph clustering construction weight ratio p between real and pseudo edges : 0. 85 de-clustering to original graph node weight computation equal distribution ratio d to outgoing edges Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 77

Experiment 2: Collection of SE Tools and Libraries – CK metrics measurement tools, component

Experiment 2: Collection of SE Tools and Libraries – CK metrics measurement tools, component rank system – ANTLR, JAMA, Caffe Cappuccino – 582 components rank class name 1 antlr. Token 2 antlr. debug. Event 2 antlr. debug. New. Line. Event 4 antlr. collections. impl. Vector 5 jp. gr. java_conf. keisuken. text. html. Html. Parameter 6 jp. gr. java_conf. keisuken. net. server. Server. Properties 7 Jama. Matrix 8 jp. gr. java_conf. keisuken. util. Integer. Array 8 jp. gr. java_conf. keisuken. util. Long. Array 10 jp. ac. osaka_u. es. ics. iip_lab. metrics. parser. Identifier. Info. . . 418 cktool_new. examples. Main weight 0. 10727 0. 06189 0. 05434 0. 05246 0. 03699 0. 01564 0. 01390 0. 01365. . . 0. 00050 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 78

Experiment 4: Document Processing Tools and Libraries • • JEDIT, jext, Enhydra, saxon, phex,

Experiment 4: Document Processing Tools and Libraries • • JEDIT, jext, Enhydra, saxon, phex, JDK, etc. Search get. Nodetype by grep (7171 components) 1(67) enhydra 3. 1. . . dom. Node 2(169) saxon 7_0. . . saxon. om. Node. Info 3(275) saxon 7_0. . . saxon. pattern. Node. Test 4(316) enhydra 3. 1. . . dom. Document. Impl 5(355) saxon 7_0. . . saxon. pattern. Pattern 6(382) saxon 7_0. . . saxon. Controller 7(437) enhydra 3. 1. . . xslt. XSLTEngine. Impl 8(446) enhydra 3. 1. . . dom. Element. Impl 9(500) saxon 7_0. . . saxon. style. Style. Element 10(506) saxon 7_0. . . saxon. tree. Node. Impl. . . 125(4441) enhydra 3. 1. . . Func. ID. . . 125(4441) 0. 029110 0. 000969 0. 000437 0. 000368 0. 000324 0. 000296 0. 000241 0. 000235 0. 000202 0. 000198. . . 0. 000029. . . Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 79

Discussion 1: Weight Computation Reference Count Model Component Rank Model 0. 2 B 0.

Discussion 1: Weight Computation Reference Count Model Component Rank Model 0. 2 B 0. 31 B 0. 6 A 0. 33 A E D C 0 0 0. 2 0. 03 0. 30 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 80

Discussion 2: Clustering Policy (1) • Simply duplicated components are eliminated A A X

Discussion 2: Clustering Policy (1) • Simply duplicated components are eliminated A A X B B Y original copy others Clustering 0. 25 A X B Y 0. 25 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 81

Discussion 2: Clustering Policy (2) • Reused components in different environments are counted A

Discussion 2: Clustering Policy (2) • Reused components in different environments are counted A A X B C Y original modified others Clustering 0. 3 0. 2 A X B C Y 0. 15 0. 2 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 82

Discussion 3: Similarity Criterion and Pseudo Use Relation • Resulting ranks are fairly insensitive

Discussion 3: Similarity Criterion and Pseudo Use Relation • Resulting ranks are fairly insensitive to the similarity criterion t – Some distinct components are in the same cluster if less than 0. 8 • Various pseudo use relation ratios p have been investigated – Resulting ranks are stable between 0. 75 - 0. 95 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 83

Features of SPARS-J (Registration) • One class file in Java (*. java)= Component •

Features of SPARS-J (Registration) • One class file in Java (*. java)= Component • Dependence relations: Inheritance, interface, caller, refer, . . . • Extraction of keywords in the class file • Indexing using Berkeley DB Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 84

Features of SPARS-J (Search) • Keyword search/ Package-tree browsing • Displaying source, caller/callee list

Features of SPARS-J (Search) • Keyword search/ Package-tree browsing • Displaying source, caller/callee list (class/method), various metrics • Search with constraints • Display the top-ranked components by the component rank and TF-IDF rank • Clustering by source-code similarity • English/Japanese Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 85

Related Works • ASLA[21], OSLC, ohcount – Matching regular expression pattern corresponding to a

Related Works • ASLA[21], OSLC, ohcount – Matching regular expression pattern corresponding to a license against license statements • FOSSology[11] – Matching simple string corresponding to a license against license statements with the b. SAM argorithm • Not precise enough • Does not report whether any license exists or not when the tool can't identify the license • False positives • Slow execution time 86 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Evaluation of License Identification • Goal: To show if our approach is better than

Evaluation of License Identification • Goal: To show if our approach is better than other methods • Tools – Ninka (implementation of proposed approach), FOSSology 1. 0. 0, ohcount version 3. 90 rc, OSLC 3. 0 • Target systems – Source files: 250 files in Debian 5. 0. 2 • Randomly select 250 packages in Debian 5. 0. 2 • For each selected packages, randomly select 1 file in each package in them 87 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Method • Compare the results from each tool to the results obtained by manual

Method • Compare the results from each tool to the results obtained by manual inspection • Result category – C: Correct license name and version – I: Incorrect – U: Unknown • Measured values – – Recall Precision F-measure Execution Time 88 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Research Questions Goal: To demonstrate usefulness of Ninka RQ 1: What are the licenses

Research Questions Goal: To demonstrate usefulness of Ninka RQ 1: What are the licenses used in FOSS? RQ 2: When present, what types of error do license statements have? 89 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

RQ 1: What Are the Licenses Used in FOSS? • Sub questions – What

RQ 1: What Are the Licenses Used in FOSS? • Sub questions – What licenses are used in Debian? – Do different programming languages use different licenses? – Does size matter? • Target: source code of Debian 5. 0. 2 – 794622 files from 11101 applications analyzed – Ninka could not identify license of 15. 9% of source files (reported "UNKNOWN") 90 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Does Size Matter? • Examine median of the file size in case of with

Does Size Matter? • Examine median of the file size in case of with license and without license • Are smaller files more likely not to have a license? →Are there difference in the size of files between with license and without license? • A Mann-Whitney test confirms that these difference are significant (p<0. 0001) Median(bytes) overall with license without license statement 4633 5488 2137 1005 91 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Experiments 1. Examine the number of files under each license at each release version

Experiments 1. Examine the number of files under each license at each release version in Free. BSD (all), Open. BSD (all), Eclipse and Argo. UML 2. Analyze the difference of licensed file number across different versions in Free. BSD(all) and Open. BSD(all) 3. Examine the difference in evolution patterns of OS all and OS kernel Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 92

Analysis Targets Free. BSD (all) Type Free. BSD (kernel) OS kernel, OS kernel applications

Analysis Targets Free. BSD (all) Type Free. BSD (kernel) OS kernel, OS kernel applications Open. BSD (all) Open. BSD (kernel) OS kernel, OS kernel applications Eclipse Argo. UML SDE platform UML Design Tools Release Version 2. 2 -8. 0 2. 0 -4. 7 2. 03. 5. 2 0. 8. 10. 31. 1 Release Date 1994/112009/11 1996/102010/5 2002/62009/9 2000/102010/6 # release 28 45 25 79 1141935880 6862208 CVS Subversion #Files (oldestlatest) Version Control System 441221266 6273490 CVS 624514181 9873314 CVS Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 93