L 10 Classic RF to uncover biological interactions
L 10: Classic RF to uncover biological interactions Kirill Bessonov GBIO 0002 Nov 24 th 2015 1
Talk Plan • Trees – Basic concepts – Examples • Tree-based algorithms – Regression trees – Random Forest • Practical on RF – RF variable selection • Networks – network vocabulary – biological networks 2
Data Structures • arrangement of data in a computer's memory • Convenient access by algorithms • Main types – Arrays – Lists – Stack – Queue – Binary Trees – Graph tree stack queue graph 3
Trees • is a data structure with – a hierarchy relationships • Basic elements – Nodes (N) • Variables • Features • e. g. files, genes, cities – Edges (E) • directed links from – From lower to higher depth 4
Nodes (N) • Usually defined by one variable • Are selected from {x 1 …. xp} variables – Differential selection criteria • E. g. strength of association to response (Y~X) • E. g. “best” split • Others • Node variable could take many forms – question – feature – data point 5
Binary splits • a node could be split in – Two • binary split • Two – child nodes – Multiple ways • multi-split • Several – Child nodes • Travelling from top to bottom of a tree – Like being lost in cave maze 6
Edges (E) • Edges connects – Parent and child nodes (parent child) – Directional • Do not have weight • Represent node splits parent children 7
Tree types • Decision – To move to next node need to take decision • Classification – Allow to predict class • use input to predict output class label • i. e. classify input • Regression – Allow to predict output value • i. e. use input to predict output 8
Predicting response(s) • Input: data on a sample; Nodes: variables • Travel from root down to child nodes 9
Decision Tree example 10
Classification tree A banking customer would accept a personal loan? Classes = {yes, no} • Leaf nodes predict class of input (i. e. customer) • Output class label – yes or no answer 11
Classification example Name Cough Fever Marie yes Jean no no Marc yes no Weight Pain Class skinny none normal none 12
Classification example Name Cough Fever Marie yes Jean no no Marc yes no Weight Pain skinny none normal none Class flu none cold 13
Regression trees • Predict outcome – e. g. price of a car 14
Trees • Purpose 1: – recursively partition data • cut data space into perpendicular hyperplanes (w) • Purpose 2: classify data • class label at the leaf node • E. g. a potential customer will respond to a direct mailing? – predicted binary class: YES or NO Source: DECISION TREES by Lior Rokach
Tree growth and splitting Selected feature(s) • In top-down approach – assign all data to root node Max depth reached Splitting criteria is not met Leafs/Terminal nodes • Select attribute(s)/feature(s) to split the node • Stop tree growth based on X<x X>x Y>y Y<y
Recursive splitting (1) Splits Data
Variable 2 Recursive splitting (2) Variable 1 Each node corresponds to a quadrant
Recursive splitting (3) 19
Stopping rules • When splitting is enough? – Max depth reached • No new levels wanted – Node contains too few samples • Prone to unreliable results – Further splits do not improve • purity of child nodes • association of Y~X is below threshold 20
Other tree uses • Trees can be used also for – Clustering – Hierarchy determination • E. g. phylogenetic trees • Convenient visualization – effective visual condensation of the clustering results • Gene Ontology – Direct acyclic graph (DAG) – Example of functional hierarchy 21
GO tree example 22
last common ancestor 23
Alignment trees 24
Tree Ensembles Random Forest (RF) 25
Random Forest • 1999 introduced by Breiman – Ensemble of tree predictors – Let each tree “vote” for most popular class • Significant performance improvement – over previous classification algorithms • CART and C 4. 5 • Relies on randomness • Variable selection based on purity of a split – GINI index 26
Random Forests Randomly build ensemble of trees 1. Bootstrap a sample of data, start building a tree 2. Create a node by 1. Randomly selecting m variables from M 2. Keep m constant (except for term. nodes) 3. Split the node based on m variables 4. Grow a tree until no more splits possible 5. Repeat steps 1 -4 n times Generate an ensemble of trees 6. Calculate variable importance for each predictor variable X
Random Forest animation A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 A 3 B 3 C 3 D 3 A 4 B 4 C 4 D 4 A 5 B 5 C 5 D 5 A 6 B 6 C 6 D 6 A 7 B 7 C 7 D 7 <X 2 B 1 {A, B, C, D} C <X 2 D {A, B, D, D} >X 3 A {A, B, C, D} 1 C <X 2 B 1 C A 1 D 2 A 2 B 2 A 3 C 2 B 3 D 2 C 3 B 3 A 4 C 3 B 4 D 3 C 4 B 4 A 5 C 4 B 5 D 4 D 3 D 4 C 5 D 5 A 5 B 5 A 6 C 5 B 6 D 5 C 6 D 6 A 6 B 6 A 7 C 6 B 7 D 6 C 7 D 7 A 7 B 7 C 7 D 7 <X 2 C B 1 A 2 C 1 B 2 D 1 C 2 A 4 3 A 3 B C 1 D 1 A 3 >X >X B 1 1 C >X 3 B
Building a forest • Forest – Collection of several trees • Random Forest – Aggregation of several decision trees • Logic – Single tree – too variable performance – Forest of trees – good and stable performance • Predictions averaged over several trees 29
Splits • Based on split purity criteria of a node – gini impurity criterion (GINI index) – measures impurity of the outputs after a split • a split of a node is made on variable m – with lowest Gini Index split (slide 33) where j is class and pj is probability of class (proportion of samples with class j) 30
Goodness of split Which two splits would give – the highest purity ? – The lowest Gini Index ? 31
GINI Index calculation • At given tree node the probability of p(normal)=0. 4 and p(asthma)=0. 6. Calculate node GINI Index. Gini Index = 1 – (0. 42 + 0. 62) = 0. 48 • What will be GINI index of a “pure” node – Gini Index = 0 32
GINI Index of a split • 33
GINI Index split • Given tax evasion data sorted by income variable, choose the best split value based on GINI Index split? Income 10 K 20 K 30 K 40 K 50 K 60 K 70 K 80 K 90 K 100 K 110 K Split <= > <= > <= > Yes 0 3 0 3 1 2 2 1 3 0 3 0 30 No 0 7 1 6 2 5 3 4 3 4 4 3 5 2 6 1 70 GINI Index split 0. 42 0. 4 0. 375 0. 343 0. 417 0. 4 0. 343 0. 375 0. 42 34
mtry parameter • To build each node m variables are selected • Specified by mtry parameter • Allows to build different trees – Randomized selection of variables at each split – Gives heterogeneity to a forest • Given X={A, B, C, D, E, F, G} and mtry=2 – – Node N 1 = {A, B} Node N 2 = {C, G} Node N 3 = {A, D} … • Default mtry = sqrt(p) – p – number of predictor variables 35
RF stopping criteria • RF – ensemble of non-pruned decision trees • Growth tree until – Node has maximum purity (GINI index = 0) • all samples of the same class – No more samples for next split • 1 sample in a node • Greedy sleeting • random. Forest library min samples per node – 1 in classification – 5 in regression 36
Performance comparison • RF handles – missing values – continuous and categorical predictors (i. e. X) – and high-dimensional dataset where p>>N • Improves over single tree performance * Lower is better Source: Breiman, Leo. "Statistical modeling: The two cultures. " Quality control and applied statistics 48. 1 (2003): 81 -82. 37
RF variable importance (1) • Need to estimate “importance” of each predictor {x 1 … xp} in predicting a response y • Ranking of predictors • Variable importance measure (VIM) – Classification: misclassification rate (MR) – Regression: mean square error (MSE) • VIM - the increasing in mean of the errors (MR or MSE) in the forest when the y values are randomly permuted in the OOB samples 38
Out-of-the-bag (OOB) samples • Dataset divided into OOB bootstrap – Bootstrap (i. e. training) – OOB (i. e. testing) • Trees are built on bootstrap • Predictions are made on OOB • Benefits – avoids over-fitting OOB bootstrap • false results X 1… 1. 5 X 2… 1. 2 X 3…-0. 5 VIM
RF variable importance (2) 1. Predict classes of the OOB samples using each tree of a RF 2. Calculate the misclassication rate = out of bag error rate (OOBerrorobs) for each tree 3. For each variable in the tree, permute the variables values 4. Using the tree compute the permutation-based out-of-bag error (OOBerrorperm) 5. Aggregate OOBerrorperm over all trees 6. Compute the final VIM 40
RF variable importance (3) • 41
Aims of variable selection • Find variables related to response – E. g. predictive of class with highest probability • Simplify problem – Summarize dataset by fever variables – Decrease dimensionality 42
RFs in R random. Forest library Titanic example 43
random. Forest • Well implemented library • Good performance • Main functions – random. Forest() – importance() • Install library – install. packages("random. Forest", repos="http: //cran. freestatistics. org ") • Load titanic_example. Rdata 44
random. Forest Which factor was the most important in survival of passengers? library(random. Forest); titanic_data = read. table(file="http: //biostat. mc. vanderbilt. edu/wiki/pub/Main/Data. Sets/titanic. txt", header=T, sep=", "); train_idx = sample(1: 1313, 0. 75*1313); test_idx = which(!1: 1313 %in% train_idx); titanic_data. train = titanic_data[train_idx, ]; titanic_data. test = titanic_data[test_idx, ]; titanic. survival. train. rf = random. Forest(as. factor(survived) ~ pclass + sex + age, data=titanic_data. train, ntree=1000, importance=TRUE, na. action=na. omit ); c_matrix = titanic. survival. train. rf$confusion[1: 2, 1: 2 ]; print(accuracy in training data); sum( diag(c_matrix) ) / sum(c_matrix); imp = importance(titanic. survival. train. rf); print(imp); survived 0 1 Mean. Decrease. Accuracy Mean. Decrease. Gini pclass 25. 08341 27. 00470 30. 59483 16. 38874 sex 77. 77933 82. 17791 84. 19724 74. 51846 age 22. 82038 22. 48106 30. 55145 18. 07370 var. Imp. Plot(titanic. survival. train. rf); 45
Variable importance • Which variable was the most important in survival of passengers? 46
RF to uncover networks of interactions 47
One to many • So far seen one response Y to many predictors X • Can use RF to predict many Ys sequentially – create interaction network 48
RF to interaction networks (1) • Can we build networks from tree ensembles? – Need to consider all possible interactions • Y 1~X, then Y 2~X … Yp~X – Need to “shift” Y to new variable • Assign Y to new variable (previously X) – Complete matrix of interactions (i. e. network) 49
RF to interaction networks (1) • For example given continuous data and A, B, C variables, build an interaction network – Consider interaction scenarios • A ~ {B, C} • B ~ {A, C} • C ~ {A, B} – Need to have 3 RF runs giving 3 sets of VIMs – Fill out interaction network matrix A A B C A B 0 C 0 0 Interaction network (p x p matrix) 50
RF to interaction networks (2) • Read in input data matrix D with 3 variables – Continuous scale • Aim: variable ranking and response prediction • Load RFto. Networks. Rdata A rf =random. Forest(A~B+C, data=D, importance=T); importance(rf, 1) %Inc. MSE B 8. 706760 C 8. 961513 rf =random. Forest(B~A+C, data=D, importance=T); importance(rf, 1) %Inc. MSE A 9. 603829 C 3. 271325 rf =random. Forest(C~A+B, data=D, importance=T); importance(rf, 1) %Inc. MSE A 8. 830840 B 1. 519951 D = 1 2 3 4 5 6 7 8 9 10 B 4. 41 11. 18 1. 32 6. 82 10. 51 2. 67 6. 24 6. 85 10. 50 9. 15 C 7. 95 5. 76 7. 74 3. 83 5. 05 7. 78 5. 30 1. 56 4. 89 8. 05 6. 27 3. 35 1. 11 3. 64 10. 42 1. 83 6. 07 3. 50 8. 44 9. 73 51
RF to interaction networks (3) • Fill out the interaction matrix D A B C 0 8. 706760 8. 961513 9. 603829 0 3. 271325 8. 830840 1. 519951 0 8. 96 • The resulting network 8. 83 1. 51 3. 27 9. 60 8. 70 52
Networks of biological interactions 53
Networks • What comes to your mind? Related terms? • Where can we find networks? • Why should we care to study them? 54
We are surrounded by networks 55
56
Transportation Networks 57
Computer Networks 58
Social networks 59
Internet submarine cable map 60
Social interaction patterns 61
PPI (Protein Interaction Networks) • Nodes • Links – protein names – physical binding event 62
Network Definitions 63
Network components • Networks also called graphs – Graph (G) contains • Nodes (N): genes, SNPs, cities, PCs, etc. • Edges (E): links connecting two nodes 64
Characteristics • Networks are – Complex – Dynamic – Can be used to reduce data dimensionally time = t 0 time = t 65
Topology • Refers to connection pattern of a network – The pattern of links 66
Modules • Sub-networks with – Specific topology – Function • Biological context – Protein complex – Common function • E. g. energy production clique 67
Network types • Directed – Edge have directionality – Some links are unidirectional – Direction matters • Going A B is not the same as B A – Analogous to chemical reactions • Forward rate might not be the same as reverse – E. g. directed gene regulatory networks (TF gene) • Undirected – Edges have no directionality – Simpler to describe and work with – E. g. co-expression networks 68
Edges Types graph: N nodes E edges directed undirected
Neighbours of node(s) • Neighbours(node, order) = {node 1 … nodep} • Neighbours(3, 1) = {2, 4} • Neighbours(2, 2) = {1, 3, 5, 4} 70
Node degree (k) • the number of edges connected to the node • k(6) = 1 • k(4) = 3 71
Connectivity matrix (also known as adjacency matrix) Size binary or weighted A=
Degree distribution (P(k)) • Determines the statistical properties of uncorrelated networks source: http: //www. network-science. org/powerlaw_scalefree_node_degree_distribution. html 73
Topology: random Degree distribution of nodes is statistically independent
Topology: Scale-free • Biological processes are characterized by this topology • – Few hubs (highly connected nodes) – Predominance of poorly connected nodes – New vertices attach preferentially to highly connected ones Barabási, Albert-László, and Réka Albert. "Emergence of scaling in random networks. " science 286. 5439 (1999): 509 -512. 75
Topologies: scale-free Most real networks have Degree distribution that follows power-law • the sizes of earthquakes craters on the moon • solar flares • the sizes of activity patterns of neuronal populations • the frequencies of words in most languages • frequencies of family names • sizes of power outages • criminal charges per convict • and many more
Shortest path (p) • Indicates the distance between i and j in terms of geodesics (unweighted) • p(1, 3) = – {1 -5 -4 -3} – {1 -5 -2 -3} – {1 -2 -5 -4 -3} – {1 -2 -3} 77
Cliques • A clique of a graph G is a complete subgraph of G – i. e. maximally interconnected subgraph • The highlighted clique is the maximal clique of size 4 (nodes) 78
“The richest people in the world look for and build networks. Everyone else looks for work. ” –Robert Kiyosaki
Biological context 80
Biological Networks 81
Biological examples • Co-expression – For genes that have similar expression profile • Directed gene regulatory networks (GRNs) – show directionality between gene interactions • Transcription factor target gene expression – Show direction of information flow – E. g. transcription factor activating target gene • Protein-Protein Interaction Networks (PPI) – Show physical interaction between proteins – Concentrate on binding events • Others – Metabolic, differential, Bayesian, etc. 82
Biological networks • Three main classes Type Name Nodes Edges Resource PPI proteins physical bonds Bio. GRID molecular interactions DTI drugs/targets physical bonds Pub. Chem genetic GI genes interactions Bio. GRID functional associations functional ON Gene Ontology relations GO GDA genes/diseases associations OMIM expression profile GEO, Co-Ex genes similarity Array. Express functional/structural similarities structural PStr. S proteins similarities PDB Source: Gligorijević, Vladimir, and Nataša Pržulj. "Methods for biological data integration: perspectives and challenges. " Journal of The Royal Society Interface 12. 112 (2015): 20150571. 83
Summary • Trees are powerful techniques for – Response prediction • e. g. Classification • Random Forest is a powerful tree ensemble technique for variable selection • RF can be used to build networks – assess pair-wise variable associations • Networks are well suited for interactions representation – Biological networks are scale-free 84
References 1) https: //www. stat. berkeley. edu/~breiman/Random. Forests/cc_home. htm 2) Breiman, Leo. "Statistical modeling: The two cultures. " Quality control and applied statistics 48. 1 (2003): 81 -82. 3) Liaw, Andy, and Matthew Wiener. "Classification and regression by random. Forest. " R news 2. 3 (2002): 18 -22. 4) Loh, Wei-Yin, and Yu-Shan Shih. "Split selection methods for classification trees. " Statistica sinica 7. 4 (1997): 815 -840. 85
- Slides: 85