Follow the Software Spillovers and Knowledge Transfer Neil

  • Slides: 29
Download presentation
Follow the Software: Spillovers and Knowledge Transfer* Neil Gandal Peter Naftaliev Uriel Stettner Tel

Follow the Software: Spillovers and Knowledge Transfer* Neil Gandal Peter Naftaliev Uriel Stettner Tel Aviv University gandal@tau. ac. il Tel Aviv University peternaftaliev@gmail. com Tel Aviv University Email: urielste@tau. ac. il Tel Aviv University March 2017 * We acknowledge the support of Israel Science Foundation grants: 1287/12 and 1069/15 © Neil Gandal & Uriel Stettner. All rights reserved.

© Neil Gandal & Uriel Stettner. All rights reserved.

© Neil Gandal & Uriel Stettner. All rights reserved.

Open Source Software is Everywhere © Neil Gandal & Uriel Stettner. All rights reserved.

Open Source Software is Everywhere © Neil Gandal & Uriel Stettner. All rights reserved.

Brief Introduction to Open Source Software (OSS) Major products like LINUX, Apache, and many

Brief Introduction to Open Source Software (OSS) Major products like LINUX, Apache, and many less well-known products Open source code also appears in most proprietary electronic devices. Open source remedies a defect of IP protection (which does not require disclosure of source code. ) User innovation goes back to Victorian times. European folklore – incremental improvements © Neil Gandal & Uriel Stettner. All rights reserved.

Innovation in OSS: Community-based Virtual Organizations (VO) § Community-based Virtual Organization: Semi-structured group of

Innovation in OSS: Community-based Virtual Organizations (VO) § Community-based Virtual Organization: Semi-structured group of geographically dispersed and skilled individuals working on interdependent tasks using informal, non-hierarchical, and decentralized structure. § Disadvantages of Virtual Organizations § Lack of face-to-face communication negatively impacts team’s ability to form personal relations § Lack of strong connections negatively impacts commitment § Lack of social support limits trust & willingness to share knowledge § Advantages of Virtual Organizations § No moral hazard – contribution of individuals measurable § Intellectual Property Barriers less likely to adversely affect innovation in OSS © Neil Gandal & Uriel Stettner. All rights reserved.

Knowledge Spillovers in Open Source Projects § 1. Reuse of Software Code: Programmers take

Knowledge Spillovers in Open Source Projects § 1. Reuse of Software Code: Programmers take code from one project and employ it in another project that they are working on. (No previous work. ) § 2. Programmers who work on multiple projects may transfer knowledge (other than code) among the projects they work on. § It is important to examine both channels. § Previous work on channel #2: Programmers take knowledge from one open source project they work on to another open source project they work on. (Fershtman and Gandal 2011, and Gandal and Stettner 2016 a) § Until recently, it was not possible to follow code from 9, 000 files… © Neil Gandal & Uriel Stettner. All rights reserved.

Tracing the Code from Project to Project § Using custom software and large scale

Tracing the Code from Project to Project § Using custom software and large scale text/data mining applications, we trace the progression of code among projects § We know whether a “unit of software code” initially developed in project one was also used in project “two. ” § Enables us to determine whether the spillovers among projects were due to contributors porting code from one project to another. § Are there spillovers from code reuse? If so, this will confirm a key story in the open source world that has never been confirmed. © Neil Gandal & Uriel Stettner. All rights reserved.

Measuring Spillovers from Software Reuse § “Following the Code” was a very difficult and

Measuring Spillovers from Software Reuse § “Following the Code” was a very difficult and time-consuming process; involved comparing huge number of file pairs for similarity. § In this paper, we chose to focus on one software language, JAVA. § We have approximately 9, 000 JAVA files and hence, we had to compare file similarity across approximately 81 trillion file pairs. § 9000 JAVA projects + data for four years. Thus, on average, open source projects in our data have approximately 250 files each § Given size of the data set, all actions were executed by a set of programs we created specifically for these tasks. © Neil Gandal & Uriel Stettner. All rights reserved.

Measuring Spillovers from Software Reuse § For example, the java code file “A. java”

Measuring Spillovers from Software Reuse § For example, the java code file “A. java” will be similar to “B. java” if they share similar function names, variable names, code fragments and similar comments within the code. § To make this comparison, we used Apache Solr, an Open Source distributed natural language processing engine based on Lucene. § The process is able to index very large text files, and provides searching capabilities over the text. Method: Every word in each document is assigned two numbers: (1) Term Frequency (TF), number of times a word appears within the same document compared to all other words in the document and (2) Inverted Document Frequency (IDF), number of documents in the entire text universe (i. e. , all files) in which the particular word appears. This creates two TF-IDF vectors. Taking their dot product, dividing by the product of their norms, and calculating the cosine angle of this expression yields the “similarity scores” for the “pair. ” © Neil Gandal & Uriel Stettner. All rights reserved.

Defining Software Reuse § Next, two trained software developers examined randomly selected 30 filepairs

Defining Software Reuse § Next, two trained software developers examined randomly selected 30 filepairs (across projects) with (essentially) identical similarity scores. (We made an initial guess for the cut-off. ) § If one of the pairs was not reused software, another group of 30 randomly selected file-pairs with a higher common similarity score to identify the threshold for defining pairs of files to involve reused software. § This process was done by experts – and involved cross consultation and careful deliberation. § Determining a lower bound for the cut-off was essential for automating the pairwise comparison across the approximately 81 trillion pairs of files. § It was important to us that the cutoff was determined by experts who examined the file pairs one-by-one. © Neil Gandal & Uriel Stettner. All rights reserved.

Example of File-Pair with Similarity Score above Cutoff © Neil Gandal & Uriel Stettner.

Example of File-Pair with Similarity Score above Cutoff © Neil Gandal & Uriel Stettner. All rights reserved.

Example: Construction of reuse variables Note –Since project B copied two software files (W

Example: Construction of reuse variables Note –Since project B copied two software files (W and X) from project A, the variable “reuse_in_2” takes on the value 1 for project B in 2008. © Neil Gandal & Uriel Stettner. All rights reserved.

Research Setting, Sample and Data Sources § Panel Data from: www. sourceforge. net §

Research Setting, Sample and Data Sources § Panel Data from: www. sourceforge. net § List and meta data on contributors (e. g. join date, function, and location) § Descriptive data on approximately 30, 000 projects between 2005 - 2008 (IP license, # of developers, stage, # downloads, …) § Panel data focusing on differences over time used to construct two distinct two-mode networks: Project Network Contributors Percent of per project total projects 1 69. 9 2 14. 4 3 -4 9. 2 5 -9 4. 8 10 or more 1. 7 Contributor Network Projects per Percent of contributor Contributors 1 77. 2 2 14. 1 3 -4 6. 5 5 -9 1. 9 10 or more 0. 2 Table 1: Distribution of components in project networks—December 2008 © Neil Gandal & Uriel Stettner. All rights reserved.

Descriptive Statistics: Giant vs. Non-Giant § Most empirical networks – one giant component &

Descriptive Statistics: Giant vs. Non-Giant § Most empirical networks – one giant component & lots of very small ones § Projects in the giant component have on average many more downloads than projects outside of the giant component (151, 928 vs. 10, 092). § Projects in the giant component receive on average 1, 396 modifications compared to 353 for projects outside of the giant component. § Projects in the giant component receive on average 799 additions compared to 225 for projects outside of the giant component. § Further, projects in the giant component have on average § (i) more contributors (4. 84 vs. 1. 89), § (ii) a larger degree (7. 06 vs. 1. 35) § (iii) a great # of contributors who work on 5 or more projects (0. 52 vs 0. 09). © Neil Gandal & Uriel Stettner. All rights reserved.

Descriptive Statistics: Software Reuse § In the giant component, in 2008, 17% of the

Descriptive Statistics: Software Reuse § In the giant component, in 2008, 17% of the projects reused code from other projects. Outside of the giant component, 7% of the projects reused code from other projects. Similar for percentages for “reuse_out” § For projects in the giant component that did not reuse software, the median number of downloads was 1, 772 while for projects that reused software, the median number of downloads was 8, 423. § In the case of projects outside of the giant component, the median number of downloads for projects did not reuse software was 714 while for projects that reused software, the median number of downloads was 2, 553. § reuse_in, reuse_out, degree, “many_projects”, additions and modifications are positively correlated with success. © Neil Gandal & Uriel Stettner. All rights reserved.

Network Variables § We employ two project network centrality measures: § (i) degree (i.

Network Variables § We employ two project network centrality measures: § (i) degree (i. e. , # of projects with common developers) § (ii) closeness (i. e. , the inverse of the sum of all distances between a focal project and all other projects multiplied by the number of other projects. © Neil Gandal & Uriel Stettner. All rights reserved.

Graph of largest thickly connected network © Neil Gandal & Uriel Stettner. All rights

Graph of largest thickly connected network © Neil Gandal & Uriel Stettner. All rights reserved.

Fershtman & Gandal (2011) Theoretical Foundation For Spillovers § Project receives spillovers from connected

Fershtman & Gandal (2011) Theoretical Foundation For Spillovers § Project receives spillovers from connected projects: Si = α + β*Di § Project receives spillovers from all projects (decay): © Neil Gandal & Uriel Stettner. All rights reserved.

Theoretical Foundation § Direct and the indirect spillovers have different impacts: § § When

Theoretical Foundation § Direct and the indirect spillovers have different impacts: § § When γ>0, direct & indirect spillovers β>0, γ>0, 'hyperbolic' (strong) spillovers β>0, γ=0, direct spillovers only Β=0, γ=0, no spillovers © Neil Gandal & Uriel Stettner. All rights reserved.

Analysis Estimation equation: ldownloads = α+ β 0 + β 1 cpp + β

Analysis Estimation equation: ldownloads = α+ β 0 + β 1 cpp + β 2 degree + β 3 closeness + β 4 Many_Projects + β 5 Stage + β 6 lyears_since + β 7 num_mods + β 8 num_adds + β 9 reuse_in + β 10 reuse_out + β 11 single + β 12 DYEAR +ε, We estimate for two cases (2005 -2008: ) § Case I: Projects outside of the giant component § Case II: Projects in the giant component § We use a fixed effects model § We employ a novel test for endogeneity from reverse causality (GS 2016 a) § Note – degree and closeness for the full network © Neil Gandal & Uriel Stettner. All rights reserved.

Table 2: Results Explaining Success of OSS Projects © Neil Gandal & Uriel Stettner.

Table 2: Results Explaining Success of OSS Projects © Neil Gandal & Uriel Stettner. All rights reserved.

Summary of Results (Table 2) § The greater reuse of code the more successful

Summary of Results (Table 2) § The greater reuse of code the more successful the project is. The result is highly significant in all regressions § Provides the first econometric evidence of the prevalence of and the benefits from software reuse. § Degree centrality is positively associated with the number of downloads and that this association is statistically significant, even after taking into account spillovers from software reuse. § The positive coefficients on “reuse_in” and degree suggest that both spillover channels discussed in the introduction provide benefits. § reuse_out is positively correlated with success in the raw data. Hence, it is nice that when we control for other factors that lead to success, resuse_out is not significant in explaining success. © Neil Gandal & Uriel Stettner. All rights reserved.

Reuse by Non-Neighbors – Regression #3 Table 2 § We delineated the “reuse_in” variable

Reuse by Non-Neighbors – Regression #3 Table 2 § We delineated the “reuse_in” variable into two categories: (I) software reuse from connected projects, i. e. , reuse from a project from which the project has a contributor in common and (II) software reuse from unconnected projects. § 16% (6%) of the projects in (out of) the giant reused code from unconnected projects. 3% (1%) in (out of ) giant reused code from connected projects. § Regression #2 in Table 2 – reuse_in from unconnected projects only § Reuse from unconnected projects is statistically significant in explaining project success (coefficient = 0. 041, t=2. 31). © Neil Gandal & Uriel Stettner. All rights reserved.

Correlations: Giant component, 2008 N=3, 276 downlo~s degree close cpp stars 5 stage adds

Correlations: Giant component, 2008 N=3, 276 downlo~s degree close cpp stars 5 stage adds mods reuse_in reuse_1_out -------+-----------------------------------------downloads | 1. 0000 degree | 0. 0575 1. 0000 closeness | 0. 0453 0. 4234 1. 0000 cpp | 0. 0748 0. 6018 0. 2642 1. 0000 stars 5 | 0. 0224 0. 4891 0. 2891 0. 1480 1. 0000 stage | 0. 0260 0. 1235 0. 1021 0. 0889 0. 0793 1. 0000 adds | 0. 0470 0. 2359 0. 1027 0. 3857 0. 0645 0. 0915 1. 0000 mods | 0. 0916 0. 2563 0. 1310 0. 3575 0. 0695 0. 0933 0. 5312 1. 0000 reuse_in | 0. 1544 0. 1701 0. 0834 0. 2437 0. 0720 0. 0719 0. 3875 0. 2320 1. 0000 reuse_out | 0. 1706 0. 1341 0. 1124 0. 1657 0. 0281 0. 0602 0. 1370 0. 1551 0. 3578 © Neil Gandal & Uriel Stettner. All rights reserved.

Descriptive Statistics © Neil Gandal & Uriel Stettner. All rights reserved.

Descriptive Statistics © Neil Gandal & Uriel Stettner. All rights reserved.

FE Regressions for the Giant and Non-Giant Separately © Neil Gandal & Uriel Stettner.

FE Regressions for the Giant and Non-Giant Separately © Neil Gandal & Uriel Stettner. All rights reserved.

Largest Thickly Connected Component several projects © Neil Gandal & Uriel Stettner. All rights

Largest Thickly Connected Component several projects © Neil Gandal & Uriel Stettner. All rights reserved.

Table 3: Explaining Reuse Out © Neil Gandal & Uriel Stettner. All rights reserved.

Table 3: Explaining Reuse Out © Neil Gandal & Uriel Stettner. All rights reserved.

Reuse Out – Intermediate measure of success § From Table 3: Dependent Variable –

Reuse Out – Intermediate measure of success § From Table 3: Dependent Variable – Reuse Out § Older projects are more likely to have their code reused - both for projects in the giant component and projects outside of the giant component § Position in the network (degree and/or closeness) is significantly associated with code reuse by others. Effect obtains for closeness in giant component & degree for projects outside of the giant component. § The greater the number of contributors, the more its software is used by other projects (result holds for giant component © Neil Gandal & Uriel Stettner. All rights reserved.