Study on Reuse Issues in Open Source Software


























- Slides: 26
Study on Reuse Issues in Open Source Software Shi QIU (仇 実) Advisor: Prof. Katsuro Inoue Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1
Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 2
Introduction n Software is playing an important role in society nowadays. n Software reuse is popular, especially the reuse of open source software (OSS). OSS Projects 1. Software quality Reuse Your Projects 2. Software productivity Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 3
Introduction n The Reuse issues in OSS is needed to be taken special care of. Copyright authorship Copyright inconsistency Copyright identification … Copyright issue lawsuits and potential monetary penalties License inconsistency License violation License identification … OSS Projects Reuse License issue Maintenance issue Outdated Dependency Software Popularity OSS ecosystem … side-effect on software development Quality issue Your Projects Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Security Vulnerability Testing … 4
Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 5
Copyright Inconsistency n Definition The inconsistency between the holders in the copyright notices and the contributors (committers) of the source code. * * Copyright (C) 2003 Alice, All Rights Reserved. * Copyright (C) 2003 Bob, All Rights Reserved. * Copyright (C) 2003 Geoger, All Rights Reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as • published by the Free Software Foundation. Alice: 23 lines Daniel: 124 lines Bob: 12 lines Ella: 876 lines Charlie: 8 lines Frank: 1054 lines n Two types - committer not holder inconsistency and holder not committer inconsistency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 6
Analysis Method n Research Question - RQ 1: How prevalent are copyright inconsistencies? - RQ 2: What caused the copyright inconsistencies? n Dataset Construction Version 4. 14 Date Nov 13, 2017 #File 45, 477 #File 414 Source files whose life cycle can be entirely traced. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 7
Analysis Method n RQ 1 Copyright (C) 2003 Alice, All Rights Reserved. Copyright (C) 2005 Oracle, All Rights Reserved. Copyright © 2003 -2014 TI, All Rights Reserved. Fossology Source files The holder dataset (name) Cregit Git. Hub Copyright Inconsistency The committer dataset (name, organization, #tokens and the proportion) Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 8
Analysis Method n RQ 2 - Manually check the commit logs and the comments in the source code of all 134 source files detected as having copyright inconsistency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 9
Results n RQ 1 414 134 32. 4% 229 55. 3% Holder-not-committer Inconsistency 414 262 63. 3% Committer-not-holder Inconsistency Copyright Inconsistency Answer 1: The copyright inconsistency is prevalent in the Linux kernel. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 10
Results n RQ 2 Holder-not-committer Inconsistency Committer-not-holder Inconsistency Answer 2: Code reuse, affiliation change, refactoring, support function, and others' contributions are the main reasons. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 11
A Machine Learning Method for Automatic Copyright Notice Identification of Source Files (Chapter 3) n Overview Source file Supervised classifier Native Bayes Decision Tree Random Forest SVM <1, 1, 0, 2, …> <1, 2, 3, 0, …> <0, 0, 2, 1, …> <1, 1, 0, 2, …> <7, 8, 0, 2, …> <1, 9, 0, 4, …> <1, 1, 6, 2, …> Copyright-related sentence Vector Copyright notice Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 12
Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 13
Introduction n OSS license Describes the terms and conditions when OSS software is used, modified and shared. • May violate with each other. • n OSS ecosystem consists of software projects that are developed and evolve together in a shared environment. l Software projects may depend on each other. l Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 14
Dependency-related License Violation n Definition The situation that the license of an OSS software is not compatible with the license of its dependency. Package A MIT License n Reuse as dependency Package B GPL-3. 0+ License Enforce package A to be licensed under GPL-3. 0 or later version as well ! Two types: License violation caused by direct dependency and license violation caused by indirect dependency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 15
Research Question n RQ 1: How prevalent are dependency-related license violations? n RQ 2: What are the developers’ attitudes towards dependency -related license violation? Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 16
RQ 1: The prevalence n Dataset Construction n Method 1. Constructing the license dictionary GPL-2, GPLv 2, GPL 2, GNU GPL-2. 0, GPL version 2, … GPL-2. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 17
RQ 1: The prevalence n Method 2. Constructing the license compatibility network MIT, GPL 2. 0, Apache 2. 0, … Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 1 Version: 1. 0. 1 License: GPL-2. 0 18 popular licenses Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 0. 1 License: GPL-2. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 18
RQ 1: The prevalence n Method 3. Constructing the historical meta-dataset Name: package 7 Version License Dependency (version) 1. 0. 1 MIT package 8 (1. 0. 1), package 9 (2. 3. 1) 1. 0. 2 MIT package 8 (1. 0. 2) 1. 1. 0 GPL-2. 0 package 9 (2. 4. 0), package 10 (1. 0. 1) … 4. Constructing the dependency network Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 2. 1 License: MIT Name: Package 3 Version: 2. 0. 1 License: GPL-2. 0 Name: Package 4 Version: 1. 2. 3 License: GPL-3. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 19
RQ 1: The prevalence n Method 5. Dependency-related license violation detection Report historical meta-dataset Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 2. 1 License: MIT Name: Package 3 Version: 2. 0. 1 License: GPL-2. 0 Name: Package 1 License: MIT -------------------Direct risks: Package 4 (GPL-3. 0) Indirect risks: Package 3 (GPL-2. 0) Name: Package 4 Version: 1. 2. 3 License: GPL-3. 0 license compatibility network Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 20
RQ 1: The prevalence n Observations of RQ 1 l Only a few packages (2, 704 packages, 0. 644%) in npm have dependency-related license violations. l Including the packages licensed under copyleft licenses as dependency is highly related with the occurrence of dependencyrelated license violations. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 21
RQ 2: Preliminary Questionnaire n Questionnaire Design Q 1: Do you think this is a kind of risk? If so, were you aware of this kind of risk when you are developing your packages? Q 2: In this question, you can share anything you want to say with this kind of risk with us. n Data collection 2, 704 packages 100 email invitations 20 responses Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 22
RQ 2: Preliminary Questionnaire n Observations of Q 1 The dependency-related license violations are overlooked and misunderstood by the developers for various reasons. n Observations of Q 2 Managing dependency-related license violations is difficult and the developers are demanding help. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 23
Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 24
Conclusion and Future Work n 1. 2. 3. 4. n Conclusion Copyright inconsistency. (the Linux kernel) Automatic copyright notice identification. (the Linux kernel) Dependency-related license violation. (npm) The popularity growth. (npm) Future Work 1. The provenance of source code. 2. License maintenance and management. 3. The success of OSS. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 25
Study on Analysis of Open Source Software for Proper Reuse Thank You ! ご清聴ありがとうございました! Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 26