Code Kernel A Graph Kernel Based Approach to

Code. Kernel: A Graph Kernel Based Approach to the Selection of API Usage Examples Xiaodong Gu Hongyu Zhang Sunghun Kim 1

Programming by Examples ing ad(Str e r d i o v { ) fname new e). r(fnam e d a e File. R // read(); } 2

Obtaining API Examples API Documentations Manually Too simple or Written! incomplete 3

Mining for Sample Code A General Recipe – cluster similar code snippets and select the most representative of each cluster. void read (String fname) { new File. Reader(fname). read(); } File. Reader. read ③ Selection/Synthesis Internet ① Gathering Codebase The key problem: void read void foo () () void read () {{ { aaa. put(‘’”); == map(); a = map(); void read () } a. put(‘’”); a. put(1, ‘’); void} read () { } { a = map(); void read () a = map(); { a. put(‘’”); void read () } a= { } array(); void foo () a = map(); a. add(‘’”){ a. put(‘’”); } ; a = list(); } ; } a. append(‘’) ② Clustering void read foo () () void foo () { { { a = map(); aa. put(‘’”); =a. put(‘’”); list(); } }a. read(); } void load () void foo () { void { { a = map(); a= a = map(); a. put(‘’”); array(); } } a. load(1); } void read foo () () void foo () { { { a = map(); a. put(‘’”); aa. put(‘’”); = map(); } } a. put(1, ‘’); } How to cluster similar code? 4

Code Clustering: Related Work Category Related Works Clone-detection based approach MUSE [Moreno et al. ICSE’ 15] Extracting feature vectors Exoa. Docs [Kim et al. AAAI’ 10] Similarity heuristics e. g. , method call sequences MAPO [Xie and Pei, MSR’ 06] UP-Miner [Wang et al. MSR’ 13] 5

Category 1: Code Clustering by Clone Detection MUSE [Moreno et al. ICSE’ 15] void read (String fname) { new File. Reader(fname). read(); } File. Reader. read ①Gathering ② Slicing Client Projects void read void foo () () void read () {{ { aaa. put(‘’”); == map(); a = map(); void read () } a. put(‘’”); a. put(1, ‘’); void} read () { } { map(); voida =read () a = map(); { a. put(‘’”); void read () } a= { } array(); void read () a = map(); a. add(‘’”){ a. put(‘’”); } ; a = list(); } ; } a. append(‘’) ④ selection by popularity ③ Clone Detection void read () void foo () { void { { a map(); aa== =map(); list(); a. put(‘’”); a. append(‘’) } } ; } void read () void read 2 void foo () { void foo () ()void { { a= a = map(); { a = map(); a. put(‘’”); array(); a. put(‘’”); a = map(); } } a. add(‘’”) a. read(); ; } } Shortcoming: sensitive to local context, yielding redundant examples. Moreno, et al. "How can I use this method? . " ICSE 2015. 6

Category 2: Clustering Call Sequences MAPO [Zhong et al. ECOOP’ 09] UP-Miner [Wang et al. MSR’ 13] Sql. Connection. Create. Command ↓ Sql. Connection. Open ↓ Sql. Command. Execute. Reader ↓ Sql. Data. Reader. Read Shortcomings • Incomplete representation with a partial order • Hard to recover raw code • Relies on feature extraction (n-gram, etc. ) Zhong, Hao, et al. "MAPO: Mining and recommending API usage patterns. " ECOOP 2009. Wang et al. "Mining succinct and high-coverage API usage patterns from source code. " MSR 2013. 7

Category 3: Extracting Feature Vectors Exoa. Docs (Kim et al. AAAI’ 10) from Code (AST) AST feature vector int sum (int x, n){ int s = 0; for(int i = x; i<n; i++) s = s + i; return s; } 4 # type declarations vectorization 1 0 # if statements 4 # identifiers Over. Simplifie d 4 int power (int x, n){ int p = 1; for(int i = x; i<n; i++) p = p * x; return p; } # for statements vectorization 1 # type declarations Similarity Measure # for statements 0 # if statements 4 # identifiers Kim, Jinhan, Sanghoon Lee, Seung-won Hwang, and Sunghun Kim. “Towards an intelligent code search engine. ” AAAI 2010. 8

Code. Kernel – A Kernel Method to Cluster Source Code Directly • Code clustering with kernel methods API List Offline Processing Code Snippets for Each API Code Repository File. Reader. read void read () void foo () { void { { a map(); aa== =map(); list(); a. put(‘’”); a. append(‘’) } } ; } void foo () () void foo () { { a = map(); { a= a = map(); a. put(‘’”); aa. put(‘’”); = map(); a. put(‘’”); array(); a. put(‘’”); } } } a. put(1, ‘’); } a. add(‘’”) } ; } Code Example Repository Code. Kernel void read(String fname) { new File. Reader(fname). read(); } 9

Kernel Method We have data that is hard to compute, e. g. , measuring similarities • non-linear for classification • non-vectorial (requires feature engineering) kernel trick" X F Inner products in the new space can Φ(X) Φ X be computed via a kernel function in O O X X O Φ(X) Φ(O) Φ(O) kernel function the original space! This is often computationally cheaper than the explicit computation. Solution: map the data into a vector space where linear relations exist among the data, then apply a linear algorithm in this space. • 10

Graph Kernel – Kernel Method for Graphs graph computation (e. g. , Isomorphism) is usually NP hard. Φ(·) graph clustering in the embedding space with pairwise inner products Φ(·) Original Space Continuous Space Widely used graph kernels n random walk kernel n shortest path kernel – considers only shortest paths n …… 11

Overall Workflow of Code. Kernel Represent code snippets as graphs and then cluster the graphs with graph kernel: 01 02 03 04 | | graph representation for source code embedding graphs into a new continuous space graph clustering in the new space example selection Graph Construction ① void foo () void foo()()() void { foo { { { a = map(); a==map(); aa=a. put(‘’”); map(); a. put(‘’”); }} }} Code Snippets Graph Embedding ② ③ Clustering Example Selection ④ int power ( int x, n){ int p = 0; for( int i = x; i<n; i++) p = p * x; return p; String print( int x, n){ String s = “Hello Word”; for( int i = x; i<n; i++) s = s + i +“number”; return s; Graphs Inner-product Matrix Graph Clusters Code Examples 12

Step 1: Graph Representation for Source Code • Represent code as object usage graph [Nguyen et al. FSE’ 09] • Groum = CFG + DFG 1 String. Buffer sb = new String. Buffer(); 2 3 4 5 6 Buffered. Reader reader = new Buffered. Reader(new File. Reader(“ ”)); String line=“”; while((line=reader. read. Line())!=null) sb. append(line+“n”); reader. close; String. Buffer. new String. Buffer File. Reader. new File. Reader Buffered. Reader. new Buffered. Reade r Buffered. Reader. read + 5 6 7 ffe re d. R e fe r. a w Bu f Bu St ri ng …. . 8 9] pp en d ad er. cl os e . 3 4 r. n rin ffe St g. B u rin St Buffered. Reader. close 2 ew g Fil Bu e. R ffe ea de r r. n ew [0 1 String. Buffer. append hi le while 13

Step 2: Graph Embedding with Graph Kernel graphs inner-product (similarity) matrix 14

Step 3: Graph Clustering on The Kernel Matrix • Difficulty • inputs are non-vectorial ! • K-means, GMM, EM, Bayesian × • Spectral Clustering • Takes as input similarity (inner product) product pairs! • Faster and more accurate in many tasks kernel matrix graph partition 15

Step 4: Code Example Selection (Ranking) void read(String fname) { new Audio. Reader(fname). read(); } void read(String fname) { new Text. Reader(fname). read. Line(); } void read(String fname) { new File. Reader(fname). read(String fname) void read(String fname) { read(); new File. Reader(fname). Buff. Reader(fname). read(); } String { new read(); } } • Rank Metric 16

Evaluation Research Questions RQ 1: How accurate are the code examples selected by Code. Kernel ? RQ 2: How useful is Code. Kernel for selecting API usage examples ? RQ 3: Does graph kernel help improve the graph clustering performance ? 17

Accuracy of Code Clustering (RQ 1) • Data Ø 473 code snippets that are relevant to 14 randomly selected Java APIs • Metrics • Precision: ratio of correct decisions that pairs of graphs are correctly clustered into the same or different clusters • Recall: ratio of correct decisions that a pair of graphs should be clustered into the same or different clusters. • Baseline Ø MUSE – clustering code snippets by program slicing and clone detection Ø Exoa. Docs – clustering code snippets with similarity heuristics between AST element vectors 18

Accuracy of Code Clustering (RQ 1) A Comparison between Code. Kernel and Baselines Approach Precision Recall F 1 MUSE ≤ 0. 70 ≤ 0. 36 ≤ 0. 46 EXOADOCS 0. 31 0. 67 0. 39 Code. Kernel 0. 86 0. 79 ☞ Our approach outperforms the two baselines in terms of F 1 -score. 19

An Example 1 [from 6 instances] Centrality↓: 0. 8085 Specificity↑: 0. 4202 Repo. add( final String name, final String content){ final File dir = new File(this. path); final File file = new File(dir, name); File. Utils. write. String. To. File(file, content); this. git. exec(dir, "add”, name); } Instance 1 Centrality↓: 0. 8085 Specificity↑: 0. 4202 Repo. add( final String name, final String content){ final File dir = new File(this. path); final File file = new File(dir, name); File. Utils. write. String. To. File(file, content); this. git. exec(dir, "add”, name); } Instance 2 Centrality↓: 0. 7777 Specificity↑: 0. 5648 GSISSHAbstract. Cluster. submit. Batch. Job(Job. Descriptor job. Descriptor){ int number = new Secure. Random(). next. Int(); number = (number < 0 ? -number : number); temp. PBSFile = new File(Integer. to. String(number) + job. Manager. Configuration. get. Script. Extension()); File. Utils. write. String. To. File(temp. PBSFile, script. Content); } Instance 3 Centrality↓: 0. 7739 Specificity↑: 0. 6023 Config. Generator. generate. Config(File. Info template, File. Info filter, String output. Base. Path, Str. Substitutor str. Sub, Map<String, Set<String>> miss. Properties. By. Filename, boolean missing. Property. Found ) { String raw. Templ = File. Utils. read. File. To. String(template. get. File()); Properties properties = read. Filter. Into. Properties(filter); String processed. Template = Str. Sub. replace(raw. Templ, properties); File. Utils. write. String. To. File(new File(output. Filename), processed. Tem plate); } …… 20

Usefulness (RQ 2) A user study involving 25 developers in a multinational company, with each developer having more than 2 years of programming experience very totally not Questionnaire Overall, are the selected examples useful for understanding API usages? Tool comparison 25 Which tool produces better example? 15 not useful votes 20 10 Usefulness 5 Not Useful 5% Very Useful 38% Code. Kernel e. Xoa. Doc em or So m y es ck ta e t. b m p. in co d m pa re To er rv Se Ti Ru nt im e. fre e. M ck . lo ck at Lo . fo at rm Fo Da te UR rm en t e et Fr ag m in ad L . re ct St re am ne on ut np Da t a. I UR L. o pe n. C r. l de oa s. L I. g ss la ce oa d. C ur so Re et r. g Cl as de oa s. L as io n 0 Cl Useful 57% useful neither useful Similar 21

The Effects of Graph Kernel (RQ 3) • Replace the graph kernel component in Code. Kernel with a traditional similarity measure: ☞ Graph Kernel is effective in code clustering. 22

Conclusion Code. Kernel – a graph kernel based approach for clustering and selecting code examples n n graph representation of code snippets graph kernel for code graph embedding and clustering Future Work § § Other tasks that require code similarity measure e. g. , code search, code clone detection, etc. Deep learning to further improve the completeness and readability of the code examples. 23

Thank you! Q&A Paper 24