Study on Reuse Issues in Open Source Software

  • Slides: 26
Download presentation
Study on Reuse Issues in Open Source Software Shi QIU (仇 実) Advisor: Prof.

Study on Reuse Issues in Open Source Software Shi QIU (仇 実) Advisor: Prof. Katsuro Inoue Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 2

Introduction n Software is playing an important role in society nowadays. n Software reuse

Introduction n Software is playing an important role in society nowadays. n Software reuse is popular, especially the reuse of open source software (OSS). OSS Projects 1. Software quality Reuse Your Projects 2. Software productivity Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 3

Introduction n The Reuse issues in OSS is needed to be taken special care

Introduction n The Reuse issues in OSS is needed to be taken special care of. Copyright authorship Copyright inconsistency Copyright identification … Copyright issue lawsuits and potential monetary penalties License inconsistency License violation License identification … OSS Projects Reuse License issue Maintenance issue Outdated Dependency Software Popularity OSS ecosystem … side-effect on software development Quality issue Your Projects Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Security Vulnerability Testing … 4

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 5

Copyright Inconsistency n Definition The inconsistency between the holders in the copyright notices and

Copyright Inconsistency n Definition The inconsistency between the holders in the copyright notices and the contributors (committers) of the source code. * * Copyright (C) 2003 Alice, All Rights Reserved. * Copyright (C) 2003 Bob, All Rights Reserved. * Copyright (C) 2003 Geoger, All Rights Reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as • published by the Free Software Foundation. Alice: 23 lines Daniel: 124 lines Bob: 12 lines Ella: 876 lines Charlie: 8 lines Frank: 1054 lines n Two types - committer not holder inconsistency and holder not committer inconsistency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 6

Analysis Method n Research Question - RQ 1: How prevalent are copyright inconsistencies? -

Analysis Method n Research Question - RQ 1: How prevalent are copyright inconsistencies? - RQ 2: What caused the copyright inconsistencies? n Dataset Construction Version 4. 14 Date Nov 13, 2017 #File 45, 477 #File 414 Source files whose life cycle can be entirely traced. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 7

Analysis Method n RQ 1 Copyright (C) 2003 Alice, All Rights Reserved. Copyright (C)

Analysis Method n RQ 1 Copyright (C) 2003 Alice, All Rights Reserved. Copyright (C) 2005 Oracle, All Rights Reserved. Copyright © 2003 -2014 TI, All Rights Reserved. Fossology Source files The holder dataset (name) Cregit Git. Hub Copyright Inconsistency The committer dataset (name, organization, #tokens and the proportion) Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 8

Analysis Method n RQ 2 - Manually check the commit logs and the comments

Analysis Method n RQ 2 - Manually check the commit logs and the comments in the source code of all 134 source files detected as having copyright inconsistency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 9

Results n RQ 1 414 134 32. 4% 229 55. 3% Holder-not-committer Inconsistency 414

Results n RQ 1 414 134 32. 4% 229 55. 3% Holder-not-committer Inconsistency 414 262 63. 3% Committer-not-holder Inconsistency Copyright Inconsistency Answer 1: The copyright inconsistency is prevalent in the Linux kernel. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 10

Results n RQ 2 Holder-not-committer Inconsistency Committer-not-holder Inconsistency Answer 2: Code reuse, affiliation change,

Results n RQ 2 Holder-not-committer Inconsistency Committer-not-holder Inconsistency Answer 2: Code reuse, affiliation change, refactoring, support function, and others' contributions are the main reasons. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 11

A Machine Learning Method for Automatic Copyright Notice Identification of Source Files (Chapter 3)

A Machine Learning Method for Automatic Copyright Notice Identification of Source Files (Chapter 3) n Overview Source file Supervised classifier Native Bayes Decision Tree Random Forest SVM <1, 1, 0, 2, …> <1, 2, 3, 0, …> <0, 0, 2, 1, …> <1, 1, 0, 2, …> <7, 8, 0, 2, …> <1, 9, 0, 4, …> <1, 1, 6, 2, …> Copyright-related sentence Vector Copyright notice Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 12

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 13

Introduction n OSS license Describes the terms and conditions when OSS software is used,

Introduction n OSS license Describes the terms and conditions when OSS software is used, modified and shared. • May violate with each other. • n OSS ecosystem consists of software projects that are developed and evolve together in a shared environment. l Software projects may depend on each other. l Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 14

Dependency-related License Violation n Definition The situation that the license of an OSS software

Dependency-related License Violation n Definition The situation that the license of an OSS software is not compatible with the license of its dependency. Package A MIT License n Reuse as dependency Package B GPL-3. 0+ License Enforce package A to be licensed under GPL-3. 0 or later version as well ! Two types: License violation caused by direct dependency and license violation caused by indirect dependency. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 15

Research Question n RQ 1: How prevalent are dependency-related license violations? n RQ 2:

Research Question n RQ 1: How prevalent are dependency-related license violations? n RQ 2: What are the developers’ attitudes towards dependency -related license violation? Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 16

RQ 1: The prevalence n Dataset Construction n Method 1. Constructing the license dictionary

RQ 1: The prevalence n Dataset Construction n Method 1. Constructing the license dictionary GPL-2, GPLv 2, GPL 2, GNU GPL-2. 0, GPL version 2, … GPL-2. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 17

RQ 1: The prevalence n Method 2. Constructing the license compatibility network MIT, GPL

RQ 1: The prevalence n Method 2. Constructing the license compatibility network MIT, GPL 2. 0, Apache 2. 0, … Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 1 Version: 1. 0. 1 License: GPL-2. 0 18 popular licenses Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 0. 1 License: GPL-2. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 18

RQ 1: The prevalence n Method 3. Constructing the historical meta-dataset Name: package 7

RQ 1: The prevalence n Method 3. Constructing the historical meta-dataset Name: package 7 Version License Dependency (version) 1. 0. 1 MIT package 8 (1. 0. 1), package 9 (2. 3. 1) 1. 0. 2 MIT package 8 (1. 0. 2) 1. 1. 0 GPL-2. 0 package 9 (2. 4. 0), package 10 (1. 0. 1) … 4. Constructing the dependency network Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 2. 1 License: MIT Name: Package 3 Version: 2. 0. 1 License: GPL-2. 0 Name: Package 4 Version: 1. 2. 3 License: GPL-3. 0 Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 19

RQ 1: The prevalence n Method 5. Dependency-related license violation detection Report historical meta-dataset

RQ 1: The prevalence n Method 5. Dependency-related license violation detection Report historical meta-dataset Name: Package 1 Version: 1. 0. 1 License: MIT Name: Package 2 Version: 1. 2. 1 License: MIT Name: Package 3 Version: 2. 0. 1 License: GPL-2. 0 Name: Package 1 License: MIT -------------------Direct risks: Package 4 (GPL-3. 0) Indirect risks: Package 3 (GPL-2. 0) Name: Package 4 Version: 1. 2. 3 License: GPL-3. 0 license compatibility network Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 20

RQ 1: The prevalence n Observations of RQ 1 l Only a few packages

RQ 1: The prevalence n Observations of RQ 1 l Only a few packages (2, 704 packages, 0. 644%) in npm have dependency-related license violations. l Including the packages licensed under copyleft licenses as dependency is highly related with the occurrence of dependencyrelated license violations. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 21

RQ 2: Preliminary Questionnaire n Questionnaire Design Q 1: Do you think this is

RQ 2: Preliminary Questionnaire n Questionnaire Design Q 1: Do you think this is a kind of risk? If so, were you aware of this kind of risk when you are developing your packages? Q 2: In this question, you can share anything you want to say with this kind of risk with us. n Data collection 2, 704 packages 100 email invitations 20 responses Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 22

RQ 2: Preliminary Questionnaire n Observations of Q 1 The dependency-related license violations are

RQ 2: Preliminary Questionnaire n Observations of Q 1 The dependency-related license violations are overlooked and misunderstood by the developers for various reasons. n Observations of Q 2 Managing dependency-related license violations is difficult and the developers are demanding help. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 23

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel

Table of Contents 1. Introduction 2. Study on Copyright Inconsistency in the Linux Kernel 3. A Machine Learning Method for Automatic Copyright Notice Identification of Source Files 4. Study on Dependency-related License Violation in the Java. Script Package Ecosystem 5. Study on Popularity Growth of Packages in the Java. Script Package Ecosystem 6. Conclusion and Future Work Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 24

Conclusion and Future Work n 1. 2. 3. 4. n Conclusion Copyright inconsistency. (the

Conclusion and Future Work n 1. 2. 3. 4. n Conclusion Copyright inconsistency. (the Linux kernel) Automatic copyright notice identification. (the Linux kernel) Dependency-related license violation. (npm) The popularity growth. (npm) Future Work 1. The provenance of source code. 2. License maintenance and management. 3. The success of OSS. Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 25

Study on Analysis of Open Source Software for Proper Reuse Thank You ! ご清聴ありがとうございました!

Study on Analysis of Open Source Software for Proper Reuse Thank You ! ご清聴ありがとうございました! Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 26