A Comparative Analysis of the Efficiency of Change

Defect Prediction n n quality standards customer satisfaction Questions: 1. Which metrics are good

Approaches for Defect Prediction 1. Product-centric measures extracted from the: • static/dynamic structure of

Previous Work Two aspects of defect prediction: The relationship between software defects and code

Questions to Answer by This Work Questions: 1. Which metrics are good defect predictors?

Outline n n n Experimental Set-Up Assessing Classification Accuracy Classification Results Cost-Sensitive Classification Cost-Sensitive

Data & Experimental Set-Up n n n Public data set from the Eclipse CVS

One Possible Proposal of Change Metrics renaming or moving software elements the number of

Experiments Build Three Models for Predicting the Presence or Absence of Defects in Files

Results Assessing Classification Accuracy

Accuracy Classification Results By analyzing the decision trees: Defect Free: n Large MAX_CHANGESET or

Cost-Sensitive Classification Cost-sensitive classification - costs associated with different errors made by a model

Cost-Sensitive Defect Prediction Results for J 48 Learner, Release 2. 0 n n Use

Experiment Using a Cost Factor of 5 n n Defect predictors based on change

Limitations n n Dependability on a specific environment Conclusions are on only three data

Conclusions 18 change metrics, J 48 learner, =5 give accurate results for 3 releases

Future Research n n Which information in change data is relevant for defect prediction?

Slides: 19

Download presentation

A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction Raimund Moser, Witold Pedrycz, Giancarlo Succi Free University of Bolzano-Bozen University of Alberta

Defect Prediction n n quality standards customer satisfaction Questions: 1. Which metrics are good defect predictors? 2. Which models should be used? 3. How accurate those models? 4. How much does it cost? Benefits?

Approaches for Defect Prediction 1. Product-centric measures extracted from the: • static/dynamic structure of source code • design documents • design requirements 2. Process-centric n n n 3. Change history of source files (number or size of modifications, age of a file) Changes in the team structure Testing effort Technology Other human factors to software defects Combination of both

Previous Work Two aspects of defect prediction: The relationship between software defects and code metrics Impact of the software process on the defectiveness of software No agreed answer No cost-insensitive prediction

Questions to Answer by This Work Questions: 1. Which metrics are good defect predictors? Are change metrics more useful? 2. Which models should be used? Which change metrics are good? 3. How accurate those models? How can cost-sensitive analysis be used? 4. How much does it cost? Benefits? Not how many defects are present in a subsystem but is source file defective?

Outline n n n Experimental Set-Up Assessing Classification Accuracy Classification Results Cost-Sensitive Classification Cost-Sensitive Defect Prediction Experiment Using a Cost Factor of 5

Data & Experimental Set-Up n n n Public data set from the Eclipse CVS repository (releases 2. 0, 2. 1, 3. 0) by Zimmermann et al. 18 change metrics concerning change history of files 31 static code attributes metrics that Zimmerman has used at a file level (correlation analysis, logistic regression, and ranking analysis)

One Possible Proposal of Change Metrics renaming or moving software elements the number of files that have been committed together with file x in weeks, starting from release date to its first appearance

Experiments Build Three Models for Predicting the Presence or Absence of Defects in Files 1. 2. 3. Change Model uses proposed change metrics Code Model uses static code metrics Combined Model uses both types of metrics

Outline n n n Experimental Set-Up Assessing Classification Accuracy Classification Results Cost-Sensitive Classification Cost-Sensitive Defect Prediction Experiment Using a Cost Factor of 5

Results Assessing Classification Accuracy

Accuracy Classification Results By analyzing the decision trees: Defect Free: n Large MAX_CHANGESET or Low REVISIONS n Smaller MAX_CHANGESET and Low REVISIONS and REFACTORINGS Defect Prone: n High number of BUGFIXES

Outline n n n Experimental Set-Up Assessing Classification Accuracy Classification Results Cost-Sensitive Classification Cost-Sensitive Defect Prediction Experiment Using a Cost Factor of 5

Cost-Sensitive Classification Cost-sensitive classification - costs associated with different errors made by a model >1 FN implicate higher costs than FP Costly to fix an undetected defect in post release cycle than to inspect defect-free file min

Cost-Sensitive Defect Prediction Results for J 48 Learner, Release 2. 0 n n Use some heuristics to stop increasing the recall FP<30% =5

Experiment Using a Cost Factor of 5 n n Defect predictors based on change data outperform those based on static code attributes. Reject

Limitations n n Dependability on a specific environment Conclusions are on only three data miners Choice for code and change metrics Reliability of the data mapping between defects and locations in source code extraction of code or change metrics from repositories

Conclusions 18 change metrics, J 48 learner, =5 give accurate results for 3 releases of the Eclipse project: >75% of correctly classified files >80% recall < 30% FP rate Hence, the change metrics contain more discriminatory and meaningful information about the defect distribution that the source code itself. Important change metrics: n Defect prone files with high revision numbers large bug fixing activities n Defect-free files that are large CVS commits refactored several times files n

Future Research n n Which information in change data is relevant for defect prediction? How to extract this data automatically?