Ayelet Israeli and Dror G Feitelson The Linux

  • Slides: 61
Download presentation
Ayelet Israeli and Dror G. Feitelson, “The Linux Kernel as a Case Study in

Ayelet Israeli and Dror G. Feitelson, “The Linux Kernel as a Case Study in Software Evolution”. Journal of Systems and Software 83(3), pp. 485 -501, Mar 2010. Presented by Dror Feitelson.

Synopsis • A study of 810 versions of the Linux kernel released over 14

Synopsis • A study of 810 versions of the Linux kernel released over 14 years comparing the evolution of the system to Lehman’s Laws of software evolution. • Conclusion: several laws are supported by the data. • Observation: average complexity is decreasing with time.

Linux Background • First announced August 1991 • First release March 1994 • Dual

Linux Background • First announced August 1991 • First release March 1994 • Dual release scheme till 2003 – Odd versions are development (1. 1, 1. 3, 2. 1, 2. 3, 2. 5) – Even versions are production (1. 0, 1. 2, 2. 0, 2. 2, 2. 4) • New release scheme in 2. 6 – New version every 2 -3 months – Development is distributed, no official releases • Full source code of all versions available online

Linux Kernel Versions • Paper used all 810 versions from March 1994 to August

Linux Kernel Versions • Paper used all 810 versions from March 1994 to August 2008 (all. h and. c files) – 144 production – 429 development – 237 of 2. 6 • Unprecedented scale of investigation • Other researchers used only production or only. c or a sample of versions • Some versions (test kernels and release candidates) were missed

1. 0 1. 1 Kernel version locations on www. kernel. org 1. 2 1.

1. 0 1. 1 Kernel version locations on www. kernel. org 1. 2 1. 3 2. 0 2. 1 2. 2 2. 3 2. 4 2. 5 2. 6 v 1. 0/linux-1. 0 v 1. 1/v 1. 1. 0 v 1. 1/linux-1. 1. * v 1. 2/linux-1. 2. * v 1. 3/linux-1. 3. * v 1. 3/linux-pre 2. 0. * v 2. 0/linux-2. 0. * v 2. 1/linux-2. 1. * v 2. 1/linux-2. 2. 0 -pre* v 2. 2/linux-2. 2. * v 2. 3/linux-2. 3. 99 -pre* v 2. 4/old-test-kernels/linux-2. 4. 0 -* v 2. 4/linux-2. 4. * v 2. 5/linux-2. 5. * v 2. 6/pre-releases/linux-2. 6. 0 -test* v 2. 6/linux-2. 6. * v 2. 6/testing/v 2. 6. */linux-2. 6. *-rc* v 2. 6/longterm/v 2. 6. */linux-2. 6. *

CPP Problems • Kernel is littered with preprocessor directives • Removed them in order

CPP Problems • Kernel is littered with preprocessor directives • Removed them in order to analyze all the code – This is what the developers see • Sometimes this leads to incorrect syntax – Files where this happened were ignored – About 1. 5% of the code • Alternative is to perform preprocessing (used by others) – Induces code bloat (macros and #include) – Only one configuration of the system

Evolution Background • Textbooks: Software developed in well-defined phases: – – – Elicit requirements

Evolution Background • Textbooks: Software developed in well-defined phases: – – – Elicit requirements Create specifications Design the system Implement Test and correct Install and maintain • Reality: Software evolves: – Start with a small useful project – Users will introduce new requirements – Adapt the system to do what is needed – Needs cannot be anticipated in advance

Three Types of Programs • S-type: derived from well defined formal specifications • P-type:

Three Types of Programs • S-type: derived from well defined formal specifications • P-type: can’t derive a formal solution, so use an iterative process to find and refine a solution Continuous evolution • E-type: a program that becomes embedded in instead of its environment and changes with it; development and mechanizing an activity changes it and induces then maintenance new requirements

Early Data • Data on OS/370 – Size in modules – As function of

Early Data • Data on OS/370 – Size in modules – As function of release serial no. • Main results – Steady growth – Ripple effect – Instability in late releases

Lehman's Laws 1) Continuing change (adaptation) 2) Increasing complexity (unless refactored) 3) Self regulation

Lehman's Laws 1) Continuing change (adaptation) 2) Increasing complexity (unless refactored) 3) Self regulation (of rate of change) 4) Invariant work rate (inertia) 5) Conservation of familiarity (of users and developers) 6) Continuing growth (more features) 7) Declining quality (unless maintained) 8) Feedback system (at multiple levels)

The Idea • Lehman used little data from closed-source systems • A lot of

The Idea • Lehman used little data from closed-source systems • A lot of data is now available • Use Linux data to see if it supports Lehman’s Laws • In particular try to use software metrics to quantify the laws

Law VI Continuing Growth The functional capability of E-type systems must be continually enhanced

Law VI Continuing Growth The functional capability of E-type systems must be continually enhanced to maintain user satisfaction over system lifetime • Law requires new functionality to be added • Can also be interpreted as requiring growth in size • Are the two interpretations equivalent?

Lehman’s Data • Size in modules as function of release number for OS/360 and

Lehman’s Data • Size in modules as function of release number for OS/360 and other systems • Grows, but growth rate often seen to decline – Though not for OS/360 – Turski suggested inverse square law: – Idea: effort E is spent on all possible interactions among si modules – Leads to a model where

Law VI and Linux • The dominant effect (so we deal with it first)

Law VI and Linux • The dominant effect (so we deal with it first) • The easiest to measure and analyze – If interpreted as size • Growth is super-linear (quadratic? ) • Explained by positive feedback with growth of developer base • Functional growth is harder to quantify

Godfrey & Tu Linux Data • LOC or tarball size as function of date,

Godfrey & Tu Linux Data • LOC or tarball size as function of date, 1994 -2000 • Focus on development versions • Growth rate seen to increase • Fits quadratic model • Largely verified by others • Also for other (but not all) open-source systems

Release Number vs. Time • Doesn’t matter if releases are regular – In Linux

Release Number vs. Time • Doesn’t matter if releases are regular – In Linux before 2. 6 they are not • Changes growth shape if irregular • Question of interleaving multiple versions – Assume version 2. 3 was released after 3. 0 – If sorted by number their order is reversed – Justified because related to 2. 2, not to 3. 0

Linux Growth Data

Linux Growth Data

Super-linear growth Contradicts Lehman and Turski who claimed growth should slow down due to

Super-linear growth Contradicts Lehman and Turski who claimed growth should slow down due to increasing complexity

Functionality • Previous results good for all common size metrics • Different results if

Functionality • Previous results good for all common size metrics • Different results if try to measure functional growth • System calls are leveling out – Possibly reflects maturity, as predicted by Torvalds • Config options are growing faster – Indicates growth is in internal mechanisms rather than user-visible services

System Calls

System Calls

Config Options

Config Options

Law I Continuing Change An E-type system must be continually adapted, else it becomes

Law I Continuing Change An E-type system must be continually adapted, else it becomes progressively less satisfactory in use • This means that software must evolve • “Adapt” implies keeping up with a changing environment

Law I and Linux • Change is obviously true – In 2. 6 a

Law I and Linux • Change is obviously true – In 2. 6 a new version is released every 2 -3 months • Change is achieved through growth • Adaptation to changing hardware environment • Hard to distinguish adaptation from growth – Is adding support for sound cards a new feature or adaptation to a changing environment?

Adaptation to New Hardware • Special case of operating system environment • Confined to

Adaptation to New Hardware • Special case of operating system environment • Confined to two subdirectories – arch (supported architectures) – drivers (supported peripherals) • Together about 60% of the code • Grow together with the rest of the system at about the same rate

arch + drivers vs. Whole Kernel

arch + drivers vs. Whole Kernel

Law II Increasing Complexity As an E-type system is changed its complexity increases and

Law II Increasing Complexity As an E-type system is changed its complexity increases and it becomes more difficult to evolve unless work is done to maintain it and reduce the complexity • Functionality costs in complexity • Two-sided law: supported either way

Law II and Linux • Complexity not necessarily increasing • System is largely modular

Law II and Linux • Complexity not necessarily increasing • System is largely modular (e. g. no coupling between file systems, scheduler, and drivers) • New functions being added are short and simple • Growing number but reduced fraction of high -MCC functions • Active work to reduce complexity

Mc. Cabe Cyclomatic Complexity (MCC) • Introduced by Mc. Cabe in 1976 • Essentially

Mc. Cabe Cyclomatic Complexity (MCC) • Introduced by Mc. Cabe in 1976 • Essentially counts the minimal number of paths through the code • Suggestion: functions with MCC>10 may require refactoring • Easily calculated by counting predicates – All while, for, if, and case statements • Widely used in tools and research • Has been criticized, but no better alternatives

Measuring MCC • Use commercial static analysis tool (klocwork) – Requires compilation of the

Measuring MCC • Use commercial static analysis tool (klocwork) – Requires compilation of the code – Therefore limited to specific configuration – Some bug and usage problems • Use free tool (pmccabe) – But not in this paper • Write your own script – Simple and what we need – Danger of bugs and not being standard

Results • Total MCC grows with code

Results • Total MCC grows with code

Results • Total MCC grows with code • But average MCC per function is

Results • Total MCC grows with code • But average MCC per function is decreasing

Distribution of MCC

Distribution of MCC

Possible Explanations • Many new functions being added, and they tend to be simpler

Possible Explanations • Many new functions being added, and they tend to be simpler than the old ones – Indeed, new functions tend to have lower MCC • Code is being actively improved with time

High-MCC Functions • Distribution of MCC values is heavy-tailed • Highest values are in

High-MCC Functions • Distribution of MCC values is heavy-tailed • Highest values are in the hundreds – 369 functions with MCC ≥ 100 over the years • Some of these functions evolve – Massive reduction in MCC as in sys 32_ioctl – Gradual growth of MCC – Occasional large growth in production version • Very long, but actually not very complex

Tail of MCC Distribution

Tail of MCC Distribution

An Aside on Heavy Tails • Definition: tail decays as a power law •

An Aside on Heavy Tails • Definition: tail decays as a power law • CDF: • CCDF: • Heavy tail: • LLCD:

Law VII Declining Quality Unless rigorously adapted and evolved to take into account changes

Law VII Declining Quality Unless rigorously adapted and evolved to take into account changes in the operational environment, the quality of an E-type system will appear to be declining • Again can be supported either way • What is “quality”?

Law VII and Linux • Question of how to quantify quality • Quality is

Law VII and Linux • Question of how to quantify quality • Quality is most probably not decreasing • It may even be improving

Perceived Quality • If quality declines system will fall out of use • Linux

Perceived Quality • If quality declines system will fall out of use • Linux usage is strong and growing • Ergo Linux quality is not declining

Measured Quality Oman’s Maintainability Index (MI) • HV = Halstead’s volume (N ln n)

Measured Quality Oman’s Maintainability Index (MI) • HV = Halstead’s volume (N ln n) – Bits required to write the function • MCC = Mc. Cabe Cyclomatic Complexity • Lo. C = Lines of Code • p. CM = percent Comment lines – Interpreted as fraction (0 -1) rather than percent

Changes in MI

Changes in MI

Law IV Invariant Work rate The work rate of an organization evolving an E-type

Law IV Invariant Work rate The work rate of an organization evolving an E-type software system tends to be constant over the operational lifetime of that system or phases of that lifetime • Large organizations have inertia • What about open source communities?

Law IV and Linux • Work on Linux is growing superlinearly • Fraction of

Law IV and Linux • Work on Linux is growing superlinearly • Fraction of files handled is near constant • Release rate is near constant – 5 -10 days per minor release till 2. 5 – 2 -3 months for new version in 2. 6

Interpretation 1: Work Hours • Data not available • Ill-defined: developers typically have other

Interpretation 1: Work Hours • Data not available • Ill-defined: developers typically have other daytime job • Nevertheless, work rate is most probably not constant – Growth in developer base – Increased growth rate of code

Interpretation 2: Elements Handled • • • Suggested by Lehman Use development versions (+

Interpretation 2: Elements Handled • • • Suggested by Lehman Use development versions (+ 1 st year of 2. 4) Includes number added (reflects growth) Absolute number grows with time Fraction of existing files relatively constant

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month – Lower in 2. 4

Releases per Month

Releases per Month

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month • Production versions have high minor release rate until next development version is forked

Rate of Minor Releases Linear slope = steady release rate

Rate of Minor Releases Linear slope = steady release rate

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month

Interpretation 3: Release Rate • Release rate of development versions 19962003 around 3 -6/month • Production versions have high minor release rate until next development version is forked • Since 2003 (version 2. 6) new version every 2 -3 months • Conclusion: seems to support constant rate

Law V Conservation of Familiarity In general, the incremental growth (growth rate trend) of

Law V Conservation of Familiarity In general, the incremental growth (growth rate trend) of E-type systems is constrained by the need to maintain familiarity • Capacity of humans to change constrains the rate of change

Law V and Linux • Rapid development releases imply small change between versions •

Law V and Linux • Rapid development releases imply small change between versions • Production versions branch off from development versions again with small change • Large difference between production versions – So user familiarity is not conserved • Users may continue to use production version for long time – Evidence for need for conservation of familiarity

Law III Self Regulation Global E-type system evolution is feedback regulated • Reflects a

Law III Self Regulation Global E-type system evolution is feedback regulated • Reflects a balance between forces that demand change, and constraints on what can actually be done

Lehman’s Ripple • Ripple indicates negative feedback control • Or maybe alternation of major/minor

Lehman’s Ripple • Ripple indicates negative feedback control • Or maybe alternation of major/minor releases?

Increments of Growth • Large increment reflects desire to add more new functionality •

Increments of Growth • Large increment reflects desire to add more new functionality • Small increment reflects need to stabilize • Alternations reflect self regulation • Also seen to some degree in Linux

Alternating Increments

Alternating Increments

Law VIII Feedback System E-type evolution processes are multi-level, multi-loop, multiagent feedback systems •

Law VIII Feedback System E-type evolution processes are multi-level, multi-loop, multiagent feedback systems • Extension of law III?

Law VIII and Linux • Archetypal open-source system • Continued development based on feedback

Law VIII and Linux • Archetypal open-source system • Continued development based on feedback from users – Defect reports – Bug fixes – Contribution of code • Change of release scheme in 2. 6 reflects need for more rapid dissemination • Hard to quantify

Lehman’s Laws and Linux: Summary • Some laws are two-sided – II (complexity), VII

Lehman’s Laws and Linux: Summary • Some laws are two-sided – II (complexity), VII (quality) • Some laws are qualitative – I (adaptation), III (self regulation), V (familiarity), VII (quality), VIII (feedback) • Laws need to be interpreted and quantified – II (complexity), IV (work rate), VII (quality)

Lehman’s Laws and Linux: Summary I II III change complexity self regulation IV work

Lehman’s Laws and Linux: Summary I II III change complexity self regulation IV work rate V familiarity VI growth VII quality VIII feedback Adaptation to new hardware Not increasing Maybe Constant release rate, superlin. growth Within production versions Superlinear Not decreasing Inherent in open source paradigm