Tim Menzies WVU USA Forrest Shull Fraunhofer CESE

Our theme Ü Empirical Software Engineering 1. 0 has had successes Ü Important and

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0

About us Ü Shull: Ü Ü Ü EIC IEEE Software Ed board, Journal of

EMSE: What is it? Ü Field of research that Ü Focuses on using studies

EMSE: Why Should I Care? Ü [Zelkowitz 08]: Ü Strong trend over time that

EMSE version 2. 0: what’s new? Ü The way we get data Ü More

EMSE version 2. 0: what’s new? e. g. NASA’s MDP program publishing NASA project

EMSE version 2. 0: what’s new? Traditional SE version 1. 0 Conclusions From one

Case studies [Easterbrook 07] Ü Exploratory case studies Ü used as initial investigations of

Case studies (assessment) Ü Used when researcher does not have control over variables and

Adjusting questions to data The questions you want to ask The answers anyone else

That old joke Version 2. 0 Version 1. 0 Ü One night, I meet

Example Experiment Design From [Basili 96]: T I M E Group A Group B

Example Experiment Results From [Shull 98]: Average over all subjects Individual effectiveness results

Experiments (assessment) Con Pro Ü Hypothesis-driven Ü Results that control for the factor of

Other S 04 CMA+ 03 MST 01 LAS+00 Des S 03 LEH 00 b

MDM+ 05 LMC+ 04 S 04 CMA+ 03 S 03 MST 01 LEH 00

Other MDM+ 05 S 04 CMA+ 03 S 03 LEH 00 b B 97

– Show to improve industry. ZBS 99 – Shown to be feasible with industry;

Literature Reviews (assessment) Con Pro Ü Builds confidence and robustness of conclusions Ü So

Data mining What Ü Diamonds in the dust Ü Finding patterns in (lots of)

Some Sample Data Mining Technologies [Wu 08] Ü Learning trees: Ü Apriori C 4.

Many, many tools Ü “R” Ü Rapid. Miner Ü Orange Ü Matlab (not free)

Sample applications of Data Mining in SE Ü Process data Ü Social data Ü

Data Mining Con Pro Ü Ü Ü Ready availability to many toolkits Broader range

Answer: perhaps not Ü [Biffl 05] : Value-based software engineering Ü Make value considerations

“Values” bias the search for models Sources of bias Ü Search bias Ü Nearly

[Green 09]: Effects of bias Ü Data miners searching for best treatments to 4

VBSE Pro Ü Recognition that technical decisions can’t be divorced from business context Ü

BTW, this is hard! Methods such as GQM+Strategies ™ are effective for aligning business

Aggregating data from multiple studies V 2. 0 V 1. 0 Ü Hard, slow

Better selection of training data Ü For items in the test set Ü Build

Building Knowledge Systematically Define topic Identify search parameters –Forms of evidence –Sources to include

Approaches to Aggregation Define topic of interest Systematic review Critical mass of quant. data

Systematic Review Ü An explicit protocol documents: Ü Inclusion / exclusion criteria Ü Data

Meta-analysis Example: Pair Programming [Dyba 07] Ü Systematic review of literature: Ü 15 studies

Meta-analysis Example Relative weight Effect size and 95% CI 23, 24 22, 85 Quality

Vote Counting Example 1: Reqts Elicitation [Dieste 08] Background Ü Systematic review, filter out

Vote Counting Example 2: Developer Motivation [Hall 08] Background Example Findings Ü Systematic review

Dealing with a Lack of Data: Study of MBT, Travassos et al. Ü Systematic

Dealing with a Lack of Data: Study of MBT, Travassos et al. Number of

Aggregating across Studies Con Pro Ü Faster predictions at a local site Ü Don’t

Data / Experiment Repositories Ü Rationale for why we would do this: Ü Make

In theory: repositories won’t work Now Before Ü Why would organizations place effort into

Data / Experiment Repositories Ü Examples of ESE repositories Ü Ce. BASE: Fraunhofer, UMD,

Repositories : lessons learned Ü From PROMISE : Ü Need a reward structure for

Data / Experiment Repositories Con Pro Ü Not free: Ü Enable reuse and cheaper

How will this change SE? Ü Continual monitoring all artifacts associated with a project

Agents for adaptive business intelligence Quality (e. g. PRED(30)) mode #2 mode #1 learn

Faster publication methods Ü Not 20 pages articles Ü But 4 page “letters” Ü

Coming Soon: PI in a Box? Updated list of hypotheses: • Code complexity predicts

No more “one size fits all” Ü Olde software engineering Ü If V(g) >

Conclusions Ü Empirical Software Engineering 1. 0 has had successes Ü Important and rigorous

Resources for EMSE Ü Wohlin et al. , Experimentation in Software Engineering Ü Introductory

References [Ado 05] Ü "Toward the Next Generation of Recommender Systems: A Survey of

References (more) [Dieste 08] Ü Dieste, O. , Juristo, N. , and Shull, F.

References (more) [Koc 11 b] Ü How to Find Relevant Data for Effort Estimation?

References (more) [Wu 08] Top 10 Algorithms in Data Mining, , Xindong Wu, Vipin

Slides: 73

Download presentation

Tim Menzies, WVU, USA Forrest Shull, Fraunhofer CESE, USA ICSE, Hawaii, May 2011 Empirical Software Engineering, Version 2. 0 of 73

Science 2. 0

Our theme Ü Empirical Software Engineering 1. 0 has had successes Ü Important and rigorous studies Ü Increasing penetration in CS research Ü But it has been slow Ü Practices become adopted or obsolete without us Ü Recent trends are exacerbating the problem Ü Are we losing the chance to impact practice? Ü Empirical software engineering version 2. 0 needs to shorten the feedback loop: Ü Good results Ü When they can be most useful of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü Case studies Experiments Literature studies EMSE 2. 0 is incorporating: 4. 5. 6. 7. Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration Ü A vision for where we are headed Ü References & resources Ü Download these slides: http: //tinyurl. com/version-two of 73

About us Ü Shull: Ü Ü Ü EIC IEEE Software Ed board, Journal of Emp SE Program Co-chair, ESEM 11 Editor, Guide to Advanced Empirical Software Engineering Division Director, Fraunhofer CESE Ü Menzies: Ü Ü Ed board, Journal of Emp SE, Automated Software Engineering Co-founder, PROMISE series on repeatable SE Program Co-chair, PROMISE’ 11, ASE’ 12 Former research chair @ NASA Ü Both: curators of large repositories of SE data Ü Ü Shull, NSF-funded Ce. Base 2001 - 2005 Menzies, PROMISE, 2006 -2011 Ü Ü If you publish, offer data used in that pub http: //promisedata. org/data

EMSE v 2. 0 What is it?

EMSE: What is it? Ü Field of research that Ü Focuses on using studies of all kinds to accumulate knowledge about software engineering Ü Uses methods such as experiments, case studies, surveys, statistical analysis Ü In short, gathering observable, empirical and measurable evidence Ü Is based on applying the scientific method to software engineering Ü The legacy of EMSE has been to make SE more of a truly “engineering” field Ü But we’re not there yet. of 73

EMSE: Why Should I Care? Ü [Zelkowitz 08]: Ü Strong trend over time that an increasing % of software engineering papers have an empirical component of 73

EMSE version 2. 0: what’s new? Ü The way we get data Ü More data Ü Thank you WWW Ü Different kinds of data Ü Text mining, program comprehension Ü Structures inferred from unstructured data Ü Who collects the data Ü Who analyzes the data of 73

EMSE version 2. 0: what’s new? e. g. NASA’s MDP program publishing NASA project data Traditional SE version 1. 0 e. g. vendors studying data collected from many client site I study my data I let you study my data I study data from other sites They collect it; others study it e. g. PROMISE SE version 2. 0: groups contributing data to a shared repo that others use e. g. SIR of 73

EMSE version 2. 0: what’s new? Traditional SE version 1. 0 Conclusions From one project Crowd sourcing I study my data I let you study my data I study data from other sites They collect it; others study it SE version 2. 0 Data aggregated across multiple projects of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration Ü A vision for where we are headed Ü References & resources Ü 15 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Case studies [Easterbrook 07] Ü Exploratory case studies Ü used as initial investigations of some phenomena to derive new hypotheses and build theories, Ü Confirmatory case studies Ü used to test existing theories. Ü Important for refuting theories: Ü a detailed case study of a real situation in which a theory fails may be more convincing than ‘failed’ experiments in the lab. Ü The detailed insights obtained from confirmatory case studies can also be useful for choosing between rival theories. of 73

Case studies (assessment) Ü Used when researcher does not have control over variables and there is a focus on contemporary events. Ü The generality of conclusions from case studies needs to be carefully assessed. Ü Can be slow to conduct (fast pace of change in modern SE) Ü Also, in EMSE 2. 0, no control on data collection. Ü Must adjust questions to the data of 73

Adjusting questions to data The questions you want to ask The answers anyone else cares about The questions the data can support (which, BTW, you won’t know till you look). Are you here? of 73

That old joke Version 2. 0 Version 1. 0 Ü One night, I meet a drunk searching the street. Ü As before, then… Ü “A ha!” shouted the drunk. Ü ”Can’t find my keys”, he said. . Ü “Found the keys? ” I asked Ü "Are you sure you lost your keys Ü So he did not drive home drunk here? " I asked. Ü “Better! Tire tracks to bus stop!” Ü "No" the drunk replies "I lost them in the alley but there's no light there. ” Ü Moral (v 1. 0): Ü Pick your goals, Ü Then pick your data Ü Moral (v 2. 0): Ü Study your data, Ü Then revise your goals of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü 22 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Example Experiment Design From [Basili 96]: T I M E Group A Group B Training Review (usual) (NASA DOC A) Review (usual) (NASA DOC B) Review (usual) (GENERIC DOC A) Review (usual) (GENERIC DOC B) Training (PBR) Review (PBR) (GENERIC DOC B) Review (PBR) (GENERIC DOC A) Review (PBR) (NASA DOC B) Review (PBR) (NASA DOC A) Slide

Example Experiment Results From [Shull 98]: Average over all subjects Individual effectiveness results

Experiments (assessment) Con Pro Ü Hypothesis-driven Ü Results that control for the factor of interest Ü Expensive! Ü Degree of control in tension with representative results Ü Replicable of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration Ü A vision for where we are headed Ü References & resources Ü 30 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Other S 04 CMA+ 03 MST 01 LAS+00 Des S 03 LEH 00 b TSFB 99 B 97 SBB 87 LMC+ 04 RRT 00 S 98 CDL+ 97 MDM+ 05 S 04 S 03 SBC+ 02 PVB 98 (BGL+ (PVB 95) 96) Reqts ZBS 99 Code BS 87 A Family of Studies

Other S 04 CMA+ 03 MST 01 LAS+00 Des S 03 LEH 00 b TSFB 99 B 97 SBB 87 LMC+ 04 RRT 00 S 98 CDL+ 97 MDM+ 05 S 04 S 03 SBC+ 02 PVB 98 (BGL+ (PVB 95) 96) Reqts ZBS 99 Code BS 87 Studies: in industry; in classroom. A Family of Studies

MDM+ 05 LMC+ 04 S 04 CMA+ 03 S 03 MST 01 LEH 00 b B 97 TSFB 99 ZBS 99 Other BS 87 SBB 87 Des Code S 04 SBC+ 02 PVB 98 LAS+00 – Show to improve industry. RRT 00 – Shown to be feasible with industry; CDL+ 97 Reqts – Tried/evolved in the classroom; S 98 (BGL+ (PVB 95) 96) Empirical data helps build maturity and move into industry, e. g: S 03 A Family of Studies

Other MDM+ 05 S 04 CMA+ 03 S 03 LEH 00 b B 97 TSFB 99 ZBS 99 Code BS 87 SBB 87 Des MST 01 LAS+00 – Show to improve industry. LMC+ 04 RRT 00 SBC+ 02 PVB 98 – Shown to be feasible with industry; S 98 Reqts – Tried/evolved in the classroom; CDL+ 97 (BGL+ (PVB 95) 96) Empirical data helps build maturity and move into industry, e. g: S 03 A Family of Studies

– Show to improve industry. ZBS 99 – Shown to be feasible with industry; Code MDM+ 05 S 04 LMC+ 04 S 04 BS 87 LEH 00 b – Tried/evolved in the classroom; CMA+ 03 B 97 SBB 87 Empirical data helps build maturity and move into industry, e. g: S 03 MST 01 LAS+00 TSFB 99 Des Other S 03 RRT 00 SBC+ 02 PVB 98 S 98 Reqts CDL+ 97 (BGL+ (PVB 95) 96) A Family of Studies

Literature Reviews (assessment) Con Pro Ü Builds confidence and robustness of conclusions Ü So much research: Ü Foolish to ignore it Ü A fool learns from their own mistakes Ü The wise learn from someone else’s Ü What has been learned from all that work? Ü EBSE = Evidence-based SE Ü Justify current practice w. r. t. current publications Ü [Budgen 09]: Ü They ask: “Is Evidence Based Software Engineering mature enough for Practice & Policy? ” Ü Their answer: no!

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration Ü A vision for where we are headed Ü References & resources Ü 40 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Data mining What Ü Diamonds in the dust Ü Finding patterns in (lots of) data Ü Synonyms Ü Machine learning Ü Business intelligence Ü Predictive analytics How, why Ü Combines statistics, AI, visualization, …. Ü Used for… anything Ü The review of current beliefs w. r. t. new data is the hallmark of human rationality. Ü It is irrational NOT to data mine. Ü The art of the approximate scalable analysis Ü Bigger is better of 73

Some Sample Data Mining Technologies [Wu 08] Ü Learning trees: Ü Apriori C 4. 5 : for discrete classes Ü CART: for continuous classes Association rule learning Ü People who buy this also buy that Ü Ü Clustering: Ü Ü Ü K-means (small), EM (statistical) canopy (large) Ü Ü Naive Bayes Ü Ü (simple, scalable) k-th nearest neighbors Reasoning by analogy Ü Very useful when data is sparse Ü Ü For complex data : SVM = support vector machines Ü Ada. Boost: Ü how to make a learner, better And many, many others as well of 73

Many, many tools Ü “R” Ü Rapid. Miner Ü Orange Ü Matlab (not free) Ü Weka of 73

Sample applications of Data Mining in SE Ü Process data Ü Social data Ü Input: Developer skills, platform stability Ü Input: e. g. which tester do you most respect? Ü Output: effort estimation Ü Ü E. g. [Boehm 09, Shepperd 97, Koc 11] Output: predictions of what bugs gets fixed first Ü E. g. [Guo 10] Ü Product data Ü Trace data Ü Input: static code descriptions Ü Input: what calls what? Ü Output: defect predictors Ü Ü E. g. [Bell 07, Me 07] Output: call sequences that lead to a core dump Ü E. g. [Thumm 09] Ü Usage data Ü Any textual form Ü Input: what is everyone using? Ü Input: text of any artifact Ü Output: recommendations on where to browse next Ü Output: e. g. connections between concepts Ü E. g. [Ado 05] Ü E. g. [Ma 03]

Data Mining Con Pro Ü Ü Ü Ready availability to many toolkits Broader range of applications than traditional statistics Enables analyses to scale-up to size and speed that could never be done manually Analysis is now automatable, repeatable, auditable More and more university graduates trained in data mining technologies Ü Requires data Ü Not always available at all sites Ü Analysis unguided by domain knowledge can yield results that are Ü uninteresting, Ü un-actionable Ü or just plain weird (alien messages) Ü Conclusion instability Ü Patterns have a poor track record regarding utility outside their home context of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü 53 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Question: Is bias a problem? of 73

Answer: perhaps not Ü [Biffl 05] : Value-based software engineering Ü Make value considerations explicit Ü so that software engineering decisions meet or reconcile stakeholder objectives Ü [Basili 94] : GQM Ü Goal-question-metric Ü Bias data collection according to overall goal of the research of 73

“Values” bias the search for models Sources of bias Ü Search bias Ü Nearly infinite space of possible models Ü all subsets of (all attributes * all ranges) Ü Use bias to guide the search towards most effective options Ü Evaluation bias Ü Generated models are assessed by business criteria Ü Bad models do not comment on Bias = good Ü No bias? Ü No way to prune possible models Ü No pruning? Ü No summarization? Ü No model that can predict the future Ü So bias makes us blind Ü To certain factors Ü Buts lets us see (predict) the future the business case of 73

[Green 09]: Effects of bias Ü Data miners searching for best treatments to 4 projects Ü NASA ground & flight systems Ü Guidance systems: OSP & OSP 2 Ü 2 bias Ü BFC= better faster cheaper Ü Less bugs and less cost and less time Ü Get it to market quickly, with not too many bugs Ü XPOS= reduce exposure to competition Ü 20 runs: Ü how many times did different ranges get selected by the two biases Ü Results: Ü Different biases = different treatments Ü Implications Ü Can’t talk about “best” way to improve a project unless you also define “best” of 73

VBSE Pro Ü Recognition that technical decisions can’t be divorced from business context Ü Recognition that there are few if any one-size-fits-all answers Con Ü Makes analysis less general Ü Increases analysis work Ü Must be repeated whenever the business bias (a. k. a. value proposition) is changed Ü Despite the cons Ü There is no way to avoid understanding the local business case Ü Unless you want to build models no one cares about

BTW, this is hard! Methods such as GQM+Strategies ™ are effective for aligning business and technical goals [Basili 10]. But they make explicit many: • Conflicting goals Disconnected goals • Orphaned goals …and other anomalies…

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü 60 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Aggregating data from multiple studies V 2. 0 V 1. 0 Ü Hard, slow Ü Plagued by “conclusion instability” Ü What works here, does not work there. Ü E. g. [Kitch 07] (effort estimation) Ü E. g. [Zimmerman 09] (defect prediction) Ü data Ü Less “conclusion instability” Ü More data Ü Easier to aggregate via “meta -studies” 600+ defect models built from project 1, Ü Ü Ü Better selection of training applied to project 2 In only 4% of those projects Ü Did project 1’s predictors work for project 2 of 73

Better selection of training data Ü For items in the test set Ü Build a special training set, just from projects closest to the test set item Ü Results: Ü Cross-company data works as well as local data Ü [Turhan 09]: defect prediction Ü [Koc 10, Koc 11 a]: effort estimation Ü Implications Ü Can generate estimates faster, without local data collection Ü Shared repositories are useful.

Building Knowledge Systematically Define topic Identify search parameters –Forms of evidence –Sources to include –Key variables –Key measures Problem 1: There are insufficient studies available on many topics to base conclusions on only the most rigorous sources. –Literature survey –Interviews –Polls –Filtering –Expert opinion? –Content extraction Find Analyze evidence Integrate evidence Problem 2: Users expecting relevant results cannot always wait until this point for all topics. –Standardize & document –Handle discrepancies –Abstract conclusions Slide

Approaches to Aggregation Define topic of interest Systematic review Critical mass of quant. data Some quant & qual data Minimal data Meta-analysis Broaden scope; Reason about results using vote counting Find exemplary study; Advocate for more data

Systematic Review Ü An explicit protocol documents: Ü Inclusion / exclusion criteria Ü Data sources Ü Data extraction Ü List of pubs found Ü These protocols aim to address: Ü Repeatability - Expandability Ü Completeness - Expense Ü Appendices describing systematic review details for all topics discussed here Ü Online at http: //www. computer. org/portal/web/computingnow/ software/supplements

Meta-analysis Example: Pair Programming [Dyba 07] Ü Systematic review of literature: Ü 15 studies over 10 years Ü Enough data to conduct a true meta-analysis Ü Normalized according to (mean value for pairs – mean value for individuals) normalized by standard deviation 0. 5 => mean of the pairs is half a standard deviation larger than the mean of the individuals. Ü Each study assigned a weight inversely proportional to the variance

Meta-analysis Example Relative weight Effect size and 95% CI 23, 24 22, 85 Quality Effects on: – Quality: Agreement among studies that PP leads to a mediumsized increase in quality. 18, 64 9, 80 7, 95 4, 73 4, 53 2, 81 2, 33 2, 02 1, 09 33, 33 15, 24 Duration –Duration: Overall, PP has a medium-sized reduction of the time to deliver the finished product, although studies are mixed. 11, 83 10, 97 8, 66 6, 09 6, 01 2, 87 2, 51 1, 65 0, 84 47, 66 Effort –Effort: There’s a medium-sized negative effect, shown by all but one study. 21, 17 15, 73 4, 46 4, 04 3, 88 3, 07 -2, 00 -1, 00 Favors solo programming 0, 00 1, 00 2, 00 Favors pair programming

Vote Counting Example 1: Reqts Elicitation [Dieste 08] Background Ü Systematic review, filter out less trustable studies Ü Found: Ü 30 studies over 20 years of research Ü 43 different techniques, 50 variables Ü Grouped results into: Ü 7 types of techniques Example Findings Ü Regardless of experience, interviews with some structure are more effective than completely unstructured ones. Ü Sorting techniques are less effective than interviews. Ü Introspective techniques (e. g. protocol analysis) are less effective than other types. Ü Interviews were not always the most efficient techniques. Ü 8 types of outcome vars. of 73

Vote Counting Example 2: Developer Motivation [Hall 08] Background Example Findings Ü Systematic review to identify relevant studies Ü Qualitative analysis: Conduct coding of results to identify similar factors across different studies Ü Construct a model of “most influential” factors and their interactions Ü Identified: Ü 519 candidate papers Ü 92 relevant studies of 47

Dealing with a Lack of Data: Study of MBT, Travassos et al. Ü Systematic review of the literature to understand the depth of evidence for one specific practice, Model. Based Testing Ü Results: Ü 202 realistic candidates => 85 relevant papers Ü 71 distinct approaches (!) Ü Small minority of papers relied on experimentation or experience to demonstrate why the approaches were beneficial – interesting insight as to state of the practice

Dealing with a Lack of Data: Study of MBT, Travassos et al. Number of papers Type of evidence UMLbased Non-UML Total Percent Speculation 17 6 23 27 Example 22 16 38 45 Proof of concept 5 8 13 15 Experience report 0 4 4 5 Experimentation 3 4 7 8 Total 47 38 85 100

Aggregating across Studies Con Pro Ü Faster predictions at a local site Ü Don’t have to wait for local data collection Ü Can make predictions using imported data Ü Can yield more robust, more “interesting” results not tethered to a specific context Ü Imported data has to be pruned or weighted Ü Too slow Ü Even harder to get empirical results when they can be most relevant to developers Ü Unclear Ü After a lot of work, answer likely to be “It depends. ” of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü 75 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

Data / Experiment Repositories Ü Rationale for why we would do this: Ü Make the cost of replication lower – enable more data to be collected on an issue Ü Provide more detail about experiment that was done – assist in better replications Ü Issues related to replications: Ü Keep too much the same and you run the risk of repeating biases and errors Ü Vary too much and it’s not clear the same issue is really being tested of 73

In theory: repositories won’t work Now Before Ü Why would organizations place effort into sharing data? Ü What ROI? Ü Why would a researcher give away the core resource used in their publications? Ü The data Ü Corporate confidentiality will prevent data collection. Lionel Briand, 2006, on the PROMISE repo Ü “You will never get any data”. Ü Ü Given cross-company aggregation Ü We gain more together than apart Ü Reputations can be built in the open source data world Ü E. g. everyone quotes the first paper that donates/ first analyzes the data Ü PROMISE repo 2011: Ü 140 data sets Ü Terminating Ph. D. students, donating their project data of 73

Data / Experiment Repositories Ü Examples of ESE repositories Ü Ce. BASE: Fraunhofer, UMD, USC Ü Data on inspections, COTS, … Ü SIR: UNL : http: //sir. unl. edu/portal/ Ü Software artifacts (many testing related) Ü PROMISE: WVU http: //promisedata. org/data Text-mining Model-based General Effort estimation Defect 0 50 100 of 73

Repositories : lessons learned Ü From PROMISE : Ü Need a reward structure for academics Ü From CEBASE – Ü Can’t predefine context. Ü Can’t pre-limit hypotheses. Ü Annual conference, journal special issue Ü Not being hypothesis- driven is a blessing and a curse. Ü Lowers the barrier of Ü From PROMISE & CEBASE : Ü server architecture is not trivial! Ü Survivability of the data is important. Ü Ce. BASE data is now lost entry. Ü Increases the diversity of the data) of 73

Data / Experiment Repositories Con Pro Ü Not free: Ü Enable reuse and cheaper experiments => more data Ü Allow studies to build upon each other’s results Ü Make more and larger datasets publicly available for re-analysis Ü Somebody has to maintain them. Ü Maintenance != research Ü Need more than just a database Ü Need social institutions around the data of 73

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü 80 Case studies Experiments Literature studies Download these slides: http: //tinyurl. com/version-two of 73

How will this change SE? Ü Continual monitoring all artifacts associated with a project (emails, diagrams, video chats, etc) Ü E. g. University of Auckland’s visual wikis of 73

Agents for adaptive business intelligence Quality (e. g. PRED(30)) mode #2 mode #1 learn break down mode #3 break down learn Data collected over time Ü How to learn faster? Ü Technology: active learning: reflect on examples to date to ask most informative next question Ü How to recognize breakdown? Ü Technology: anomaly detection of 73

Faster publication methods Ü Not 20 pages articles Ü But 4 page “letters” Ü For background, can refer to other papers, without including details Ü Review cycles in days, Ü Not months Ü “Papers” will include executable code and data packages Ü So reviewers can repeat the experiments as part of the review process Ü Where do experience reports fit? of 73

Coming Soon: PI in a Box? Updated list of hypotheses: • Code complexity predicts maintenance effort • “God class” anti-pattern predicts change-proneness of code • Requirements defects predict project success Public site Within Organization C… Within Organization B… Published anonymously Behind my firewall… Within Organization A, • Code complexity DOES NOT predict maint effort • “God class” anti-pattern predicts change-proneness of code • Requirements defects predict project success Machine learners of 73

No more “one size fits all” Ü Olde software engineering Ü If V(g) > 10 then defective Ü Next generation software engineering Ü For your kind of project, this is the best action Ü If your projects change to “this” then “that” will become your best action of 73

Conclusions Ü Empirical Software Engineering 1. 0 has had successes Ü Important and rigorous studies Ü Increasing penetration in CS research Ü But it has been slow Ü Practices become adopted or obsolete without us Ü Recent trends are exacerbating the problem Ü Are we losing the chance to impact practice? Ü Empirical software engineering version 2. 0 needs to shorten the feedback loop: Ü Good results Ü When they can be most useful of 73

Questions? Comments?

Roadmap Ü Empirical SE v 2. 0: What is it? Ü EMSE 1. 0 marked by: 1. 2. 3. Ü EMSE 2. 0 is incorporating: 4. 5. 6. 7. Ü Ü Case studies Experiments Literature studies Better pattern detection in big and rich datasets A recognition of value propositions Aggregating studies for more robust results Infrastructure and repositories that facilitate collaboration A vision for where we are headed References & resources Ü Download these slides: http: //tinyurl. com/version-two of 73

Resources for EMSE Ü Wohlin et al. , Experimentation in Software Engineering Ü Introductory textbook Ü Shull, Singer, and Sjoberg, Guide to Advanced Empirical Software Engineering Ü Handbook of advanced methods Ü ISERN – http: //isern. iese. de/ Ü Support community; bibliography & other resources Ü EMSE Journal http: //www. springer. com/computer/swe/journal/10664 Ü ESEM Conference - http: //esem. cpsc. ucalgary. ca/esem 2011/index. html Ü 2011 in Banff, Canada of 48

References [Ado 05] Ü "Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions". Gediminas Adomavicius and Alexander Tuzhilin. 2005. IEEE Trans. on Knowl. and Data Eng. 17, 6 (June 2005), 734 -749. [Basili 94] Ü V. Basili, G. Caldiera, and H. D. Rombach, “Goal Question Metric Approach, ” Encyclopedia of Software Engineering, pp. 528 -532, John Wiley & Sons, Inc. , 1994. brendan: [Basili 96] Ü “The empirical investigation of perspective-based reading, ” Basili V. R. , Green, S. , Laitenberger, O. , Lanubile, F. , Shull, F. , Sorumgaard, S. , and Zelkowitz, M. V. , Empirical Software Engineering: An International Journal, vol. 1, no. 2, pp. 133 -164, 1996. [Basili 10] Ü Victor Basili, Mikael Lindvall, Myrna Regardie, Carolyn Seaman, Jens Heidrich, Juergen Muench, Dieter Rombach, Adam Trendowicz, Linking software Development and Business Strategy through Measurement, IEEE Computer, April 2010. [Bell 06] Ü “Looking for bugs in all the right places”, Robert M. Bell, Thomas J. Ostrand, Elaine J. Weyuker: . ISSTA 2006 [Biffl 05] Ü Biffl S. , Aurum A. , Boehm B. , Erdogmus H. , Grünbacher P. (eds. ) (2005) "Value-Based Software Engineering", Springer Verlag, ISBN: 3 -540 -25993 -7. [Boehm 00] Ü “Software Cost Estimation with COCOMO II” Barry W. Boehm, Chris Abts, A. Winsor Brown, Sunita Chulani, Bradford K. Clark, Ellis Horowitz, Ray Madachy, Donald J. Reifer, and Bert Steece. 2000. Prentice Hall [Boehm 03] Ü Barry Boehm. 2003. Value-based software engineering. SIGSOFT Softw. Eng. Notes 28, 2 (March 2003), 4 -. [Budgen 09] Ü B. K. D. Budgen, P. Brereton, “Is evidence based software engineering mature enough for practice & policy? ” in 33 rd Annual IEEE Software Engineering Workshop 2009 (SEW-33), Skvde, Sweden, 2009 of 73

References (more) [Dieste 08] Ü Dieste, O. , Juristo, N. , and Shull, F. “Understanding the Customer: What Do We Know about Requirements Elicitation? ” IEEE Software, vol. 25, no. 2, pp. 11 -13, March/April 2008. [Dyba 07] “Are Two Heads Better than One? On the Effectiveness of Pair Programming, ” Dybå, T. , Arisholm, E. , Sjøberg, D. I. K. , Hannay, J. , and Shull, F. , IEEE Software, vol. 24, no. 6, pp. 12 -15, November/December 2007. [Easterbrook 07] Ü “Selecting Empirical Methods for Software Engineering Research” Steve Easterbrook, Janice Singer, Margaret-Anne Storey, Daniela Damian, in [Shull 07] [Green 09] Ü Understanding the Value of Software Engineering Technologies Phillip Green II, Tim Menzies, Steven Williams, Oussama El-Rawas [Guo 10] Ü “Characterizing and predicting which bugs get ﬁxed: An empirical study of Microsoft Windows” P. J. Guo, T. Zimmermann, N. Nagappan, and M. Brendan. In Proceedings of the 32 nd International Conference on Software Engineering, pages 495– 504, May 2010. [Hall 08] Ü Ü Hall, T. , Sharp, H. , Beecham, S. , Baddoo, N. , and Robinson, H. “What Do We Know about Developer Motivation? ” edited by F. Shull, IEEE Software, vol. 25, no. 4, pp. 92 -94, July/August 2008. [Kitch 07] Ü “ Cross versus within-company cost estimation studies: A systematic review, ” B. Kitchenham, E. Mendes, and G. H. Travassos, IEEE Trans. Softw. Eng. , vol. 33, no. 5, pp. 316– 329, 2007 [Koc 10] Ü "When to Use Data from Other Projects for Effort Estimation" by Ekrem Kocaguneli and Gregory Gay and Tim Menzies and Ye Yang and Jacky W. Keung. IEEE ASE 10 2010. Available from http: //menzies. us/pdf/10 other. pdf [Koc 11] Ü "Exploiting the Essential Assumptions of Analogy-Based Effort Estimation”, Ekrem Kocaguneli, Tim Menzies, Ayse Bener, Jacky W. Keung, IEEE Transactions on Software Engineering, 2011, to appear of 73

References (more) [Koc 11 b] Ü How to Find Relevant Data for Effort Estimation? Ekrem Kocaguneli, Tim Menzies, ESEM 2011 [Ma 03] Ü “Recovering documentation-to-source-code traceability links using latent semantic indexing”. Andrian Marcus and Jonathan I. Maletic. In Proceedings of the 25 th International Conference on Software Engineering (ICSE '03), 2003 [Me 07] Ü "Data Mining Static Code Attributes to Learn Defect Predictors" by Tim Menzies and Jeremy Greenwald and Art Frank. IEEE Transactions on Software Engineering January 2007 [Perry 04] “Case Studies for Software Engineers”, Dewayne E. Perry, Susan Elliott Sim, Steve M. Easterbrook. ICSE’ 04 tutorial. http: //goo. gl/3 Iqlv [Shepperd 07] Ü “Software project economics: a roadmap”. Martin Shepperd. 2007 Future of Software Engineering (FOSE '07). [Shneiderman 08] Ü “Science 2. 0”, Ben Shneiderman. Science, 319(7): 1349– 1350, March 2008 [Shull 98] Ü Ü Shull, Forrest, “Developing Techniques for Using Software Documents: A Series of Empirical Studies. ” Ph. D. Dissertation, University of Maryland, 1998. [Shull 07] “Guide to Advanced Empirical Software Engineering” Forrest Shull, Janice Singer, Dag I. K. Sjoberg (eds), , Springer-Verlag, 2007 [Thumm 09] Ü Mining exception-handling rules as sequence association rules. Suresh Thummalapenta and Tao Xie. In ICSE ’ 09: Proceedings of the 31 st International Conference on Software Engineering, pages 496– 506, Washington, DC, USA, 2009. IEEE Computer Society. Ü of 73

References (more) [Wu 08] Top 10 Algorithms in Data Mining, , Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. Mc. Lachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand Dan Steinberg, Knowledge and Information Systems, 14(2008), 1: 1 -37. [Zelkowitz 08] Ü “An update to experimental models for validating computer technology”, J. Syst. Software (2008), doi: 10. 1016/j. jss. 2008. 06. 040 [Zimmermann 09] Ü “Cross-project defect prediction, ” T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy in ESEC/FSE’ 09, August 2009. Ü of 73