# THE JOHNS HOPKINS UNIVERSITY 550 635 TOPICS IN

- Slides: 85

THE JOHNS HOPKINS UNIVERSITY 550. 635 TOPICS IN BIOINFORMATICS ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano. Reverse engineering of regulatory networks in human B cells Katia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera, and Andrea Califano http: //www. cis. jhu. edu/~sanchez/ARACNE_635_sanchez. ppt Instructor: Student: Dr. Donald Geman Francisco Sánchez Vega Baltimore, 28 November 2006

Outline • A very short introduction to MRF • The ARACNE approach • Theoretical framework • Technical implementation • Experiments and results • Synthetic data • Human B cells (paper by Basso et. al) • Discussion

A very short introduction to MRFs What is a Markov Random Field ? • Intuitive idea: concept of Markov chain Xt 0 Xt 1 … X(t-1) Xt P(Xt|Xt 0, Xt 1, …, X(t-1))=P(Xt|X(t-1)) " The world is a huge Markov chain " (Ronald A. Howard) X(t+1) …

A very short introduction to MRFs What is a Markov Random Field ? • A Markov random field is a model of the joint probability distribution over a set X of random variables. • A generalization of Markov Chains to a process indexed by the sites of a graph and satisfying a Markov property to the neighborhood system.

A very short introduction to MRFs MRF: a constructive definition • Let X={X 1, …, XN} be the set of variables (the stochastic process) whose JPD we want to model. • Start by considering a set of sites or vertices V={V 1, …, VN}. • Define a neighborhood system: ={ v, v V} where (1) v V (2) v v (3) v u u v

A very short introduction to MRFs MRF: a constructive definition • Example: “N times N” square lattice V={(i, j): 1 i, j N} c=1 (i, j) ={ (k, l) V: 1 (i-k)2+(l-j)2 c} c=2 c=8

A very short introduction to MRFs MRF: a constructive definition • At this point, we can build an undirected graph G=(V, E) • Each vertex v V is associated to one of the random variables in X • The set of edges E is given by the chosen neighborhood system. c=1 c=2

A very short introduction to MRFs MRF: a constructive definition • Clique: induced complete subgraph c=2 c=1

A very short introduction to MRFs MRF: a constructive definition • Imagine that each variable Xv in X={X 1, …, XN} can take one of a finite number Lv of values: • Define the configuration space S as • We say that X is a MRF on (V, ) if (a) P(X=x)>0, for all x S (b) P(Xv=xv|Xu=xu, u v)=P(Xv=xv|Xu=xu, u v) for v V

A very short introduction to MRFs MRF: a constructive definition c=1 c=2

A very short introduction to MRFs • Let us now associate a specific function, that we will call potential, to each clique c C, so that: • We say that a probability distribution p PS is a Gibbs distribution wrt (V, ) if it is of the form: where Z is the partition function:

A very short introduction to MRFs Correspondence Theorem X is a MRF with respect to (V, ) if and only if p(x)=P(X=x) is a Gibbs distribution with respect to (V, ) (i. e. every given Markov Random Field has an associated Gibbs distribution and every given Gibbs distribution has an associated Markov Random Field).

Outline • A very short introduction to MRF • The ARACNE approach • Theoretical framework • Technical implementation • Experiments and results • Synthetic data • Human B cells (paper by Basso et. al) • Discussion

The name of the game Theoretical framework Microarray data Graphical model

Theoretical framework • Need of a strategy to deal with uncertainty under small samples: Maximum entropy models • Philosophical basis: • Model all that is known and assume nothing about that which is unknown. • Given a certain dataset, choose a model which is consistent with it, but otherwise as “uniform” as possible

Theoretical framework Max. Ent: toy example (adapted from A. Berger) • Paul, Mary, Jane, Brian and Emily are five grad students that work in the same research lab. • Let us try to model the probability of the discrete random variable X “first person to arrive at the lab in the morning”. p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1 • If we do not know anything about them: p(X=Paul) p(X=Mary) p(X=Jane) p(X=Bryan) p(X=Emily) = 1/5 = 1/5 The most “uniform” model is the one that maximizes the entropy H(A)

Theoretical framework Max. Ent: toy example (adapted from A. Berger) • Imagine that we know that Mary or Jane are the first ones to arrive 30% of the time. • In that case, we know: p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1 p(X=Mary)+p(X=Jane)=3/10 • Again, we may want to choose the most “uniform” model, although this time we need to respect the new constraint: p(X=Paul) p(X=Mary) p(X=Jane) p(X=Bryan) p(X=Emily) = (7/10)(1/3) = 7/30 = (3/10)(1/2) = 3/20 = (7/10)(1/3) = 7/30 Maximize H(P(X)) under the given constraints

Theoretical framework Max. Ent extensions ~ constrained optimization problem • S countable set • PS set of prob. measures on S, PS ={p(x): x S}. • • set of K constraints for the problem subset of PS that satisfies all the constraints in • Let { 1, …, K} be a set of K functions S R [feature functions] • We choose these functions so that each given constraint can be expressed as: • If x are samples from a r. v. X, then E[ i(X)]= i

Theoretical framework Max. Ent extensions ~ constrained optimization problem • Ex. Let x be a sample from X ”person to arrive first in the lab” In this case, S={Paul, Mary, Jane, Bryan, Emily} • ={ 1}, 1 p(X=Mary)+p(X=Jane)=3/10 • We can model 1 it as follows: Define 1(x) 1 if x=Mary or x=Jane 0 otherwise 1=3/10 So that

Theoretical framework Max. Ent extensions ~ constrained optimization problem Find p(x)=arg max p H(p(x)) subset of PS that satisfies all the constraints in : Of course, since p(x) is a probability measure:

Theoretical framework • Use Lagrange multipliers: H(p) i

Theoretical framework • This leads to a solution of the form: But this is a Gibbs distribution, and therefore, by theorem, we can think in terms of the underlying Markov Random Field !! - We can profit from previous knowledge of MRF - We have found the graphical model paradigm that we were looking for!

The name of the game Theoretical framework Technical implementation Microarray data Graphical model

Approximation of the interaction structure • Consider the discrete, binary case. First order constraints: X 1 X 5 X 2 N X 3 X 4 …

Approximation of the interaction structure • Consider the discrete, binary case. Second order constraints: constraints

Approximation of the interaction structure • Consider the discrete, binary case. For jth order: constraints • The higher the order, the more accurate our approximation will be… … provided we observe enough data !!

Approximation of the interaction structure (from Dr. Munosby lectures, MVA) (from Elements of Statistical learning, Hastie Master et Tibshirani) “…for M → ∞ (where M is sample set size) the complete form of the JPD is restored. In fact, M > 100 is generally sufficient to estimate 2 -way marginals in genomics problems, while P(gi, gj, gk) requires about an order of magnitude more samples…”

Approximation of the interaction structure • Therefore, the model they adopt is of the form: • All genes for which ij = 0 are declared mutually non-interacting: • Some of these genes are statistically independent i. e. P(gi, gj) ≈ P(gi)P(gj)) • Some are genes that do not interact directly but are statistically dependent due to their interaction via other genes. i. e. P(gi, gj) ≠ P(gi)P(gj), but ij = 0 How can we extract this information ? ?

Approximation of the interaction structure • Therefore, the model they adopt is of the form: Pairs of genes for which the MI is a rightful indicator of dependency ij=0 Pairs of genes who interact through a third gene P(gi, gj) = P(gi)P(gj) Pairs of genes whose direct interaction is balanced out by a third gene

Mutual Information • The mutual information between two random variables is a measure of the amount of information that one random variable contains about another. • It is defined as the relative entropy (or Kullback. Leibler distance) between the joint distribution and the product of the marginals: • Alternative formulations:

Mutual Information estimation Another toy example X Y 1 0 1 1 0 0 1 0 1 P(X) P(Y) P(X=0)=6/10; P(X=1)=4/10; P(Y=0)=4/10; P(Y=1)=6/10; P(X=0, Y=0)=2/10 P(X=0, Y=1)=4/10 P(X=1, Y=0)=2/10 P(X=1, Y=1)=2/10 P(X, Y)

Mutual Information estimation Another toy example The Gaussian Kernel estimator

Mutual Information estimation Another toy example f(x, y) P(X, Y) f(x) f(y)

Mutual Information estimation Another toy example h’ = 4 h h’’ = h/2 Reference (h=1)

Mutual Information estimation • Every pair of variables is copula-transformed: (X 1, X 2) [FX 1(X 1), FX 2(X 2)], where FX(x)=P(X x) • By the Probability Integral Transform Theorem, we know that FX(X)~U[0, 1]. • Thus, the resulting variables have range between 0 and 1, and marginals are uniform. • Transformation is one-to-one H and MI unaffected • Reduction of the influence of arbitrary transformations from microarray data preprocessing. • No need for position dependent kernel widths (h).

Mutual Information estimation • The choice of h is critical for the accuracy of the MI estimate, but not so important for estimation of MI ranks.

Technical implementation • At this point, we have a procedure to construct an undirected graph from our data. • In an “ideal world” (infinite samples, assumptions hold), we would be done. • Unfortunately, things are not so simple !

Technical implementation • Finite random samples • Possible higher-order interactions Use hypothesis testing For each pair: • H 0: the two genes are mutually independent (no edge) • HA: the two genes interact with each other (edge) • We reject the null hypothesis H 0 (i. e. we “draw an edge”) when the MI between two genes is big enough. Need to choose a statistical threshold I 0

Choice of the statistical threshold • Chosen approach: Random permutation analysis • Randomly shuffle gene expression values and labels Sample 1 2 2 … Sample M Gene 1 g 12 g 13 … g 1 M Gene 2 g 21 g 22 g 23 … g 2 M … … … Gene N gn 1 gn 2 gn 3 … gn. M

Choice of the statistical threshold • Chosen approach: Random permutation analysis • Randomly shuffle gene expression values and labels • The resulting variables are supposed to be mutually independent, but their MI will need not be equal to zero. • Each threshold I 0 can then be assigned a p-value (which measures the probability of getting a MI value higher or equal to I 0 just “by chance”). • Thus, by fixing a p-value, we obtain the desired I 0.

Choice of the statistical threshold Number of pairs • Chosen approach: Random permutation analysis I 0 MI values

Choice of the statistical threshold • Chosen approach: Random permutation analysis Parameter fitting: use known fact from large deviation theory 1 2 3

Technical implementation • If we have a real distribution of the type: =0 =0 =0 ARACNe will get it wrong, because of our assumption The extension of the algorithm to higher order interactions is a possible object of future research. Real network ARACNe’s output

Technical implementation • If we have a real distribution of the type: 0 Iij=0 ARACNE will not identify the edge between gi and gj However, this situation is considered “biologically unrealistic” Real network ARACNe’s output

Technical implementation • If we have a real distribution of the type: =0 Iij 0 As it is, ARACNE will put an edge between gi and gj The algorithm can be improved using the Data Processing Inequality Real network ARACNe’s output

The Data Processing Inequality • The DPI is a well known theorem within the Information Theory community. Let X, Y, Z be 3 random variables that form a Markov Chain X-Y-Z, then I(X; Y) I(X, Z) Proof. I(X; Y, Z) = I(X, Z) + I(X; Y|Z) = I(X, Y) + I(X; Z|Y) X and Z are conditionally independent given Y I(X; Z|Y)=0 Thus, I(X, Y) = I(X, Z) + I(X; Y|Z) I(X, Z), since I(X; Y|Z) 0 Similarly, we can prove I(Y; Z) I(X, Z) , and therefore: I(X, Z) min[I(X; Y) , I(Y; Z)]

The Data Processing Inequality • A consequence of the DPI is that no transformation of Y, as clever as it can be, can increase the information that Y contains about X. (consider the MC of the form X-Y-[Z=g(Y)] ) • From an intuitive point of view: Mike Francisco Jean-Paul

The Data Processing Inequality • Going back to our case of study 0. 1 0. 3 gi 0. 1 0. 3 gj 0. 2 0. 1 0. 2 0. 3 gk So I assume that ij=0, even though P(gi, gj) P(gi)P(gj) 0. 2 0. 1 0. 3 0. 2

The Data Processing Inequality • But, what if the underlying network is truly a three-gene loop? gi gj …Then ARACNE will break the loop at the weakest edge! (known flaw of the algorithm) gk Philosophy: “An interaction is retained iff there exist no alternate paths which are a better explanation for the information exchange between two genes” Claim: In practice, looking at the TP vs. FP tradeoff, it pays to simplify

The Data Processing Inequality • Theorem 1. If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions. • Theorem 2. The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE. • Theorem 3. Let πik be the set of nodes forming the shortest path in the network between nodes i and k. Then, if MIs can be estimated without errors, ARACNE reconstructs an interaction network without false positive edges, provided: (a) the network consists only of pairwise interactions, (b) for each j πik, Iij ≥ Iik. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact iff (c) for each directly connected pair (ij) and for any other node k, we have Iij ≥ min(Ijk, Iik).

The Data Processing Inequality • Proof of theorem 1. (a) MIs can be estimated with no errors (b) Network is a tree (c) Only pairwise interactions ARACNE reconstructs true network without errors -(c) no problem with higher order interactions -(a) blue area boundary is ok -(b) red area is contained in yellow area. -(a), (b), DPI yellow area is ok (every edge with =0 is removed and only edges with =0 are removed)

The Data Processing Inequality The Chow-Liu Maximum Entropy tree (1968) • Method for approximating the JPD of a set of discrete variables using products of distributions involving no more than pair of variables. • The Chow-Liu method approximates a distribution P(x) by T(x), a tree structured MRF. • They proved that minimizing the Kullback-Leibler distance between P and T amounts to maximizing the total entropy of the edges of T.

The Data Processing Inequality • In practice, the variance of the MI estimator may lead to the use of a tolerance so that the DPI relations become of the form: Iij Iik(1 - ) (Erdös-Rényi) (Scale-free)

Technical implementation: summary The final algorithm 1. Choice of a threshold I 0 2. Compute all pairwise MIs 3. Draw an edge when MI I 0 4. Look at all the three-gene loops and prune edge with the lowest MI.

Technical implementation: summary The final algorithm 1. Choice of a threshold I 0 2. Compute all pairwise MIs 3. Draw an edge when MI I 0 4. Look at all the three-gene loops and prune edge with the lowest MI.

Technical implementation: summary The final algorithm 1. Choice of a threshold I 0 2. Compute all pairwise MIs 3. Draw an edge when MI I 0 4. Look at all the three-gene loops and prune edge with the lowest MI.

Technical implementation: summary The final algorithm 1. Choice of a threshold I 0 2. Compute all pairwise MIs pairs x M samples 3. Draw an edge when MI I 0 4. Look at all the three-gene loops and prune edge with the lowest MI. O(N 2 M 2+N 3) maximum triplets

Outline • A very short introduction to MRF • The ARACNE approach • Theoretical framework • Technical implementation • Experiments and results • Synthetic data • Human B cells (paper by Basso et. al) • Discussion

Experiments and results • Two different sets of experiments are presented: (a) Reconstruction of synthetic networks (b) Reconstruction of a human B lymphocyte genetic network from microarray data. • ARACNE’s performance is compared to that of two concurrent methodologies: • Bayesian networks • Relevance networks

Synthetic data: AGNs • Model proposed by Mendes et al. as a platform for comparison of reverse engineering algorithms. • Simplification of real biological networks… • …but reasonably complex to model some aspects of transcriptional regulation. “An algorithm that does not perform well on this model is unlikely to perform well in a more complex case”

Synthetic data: AGNs • Transcriptional interactions are approximated as: where xi Ni NA Ij Al … level of expression of the i-th gene number of upstream inhibitors number of activators concentration of upstream inhibitors concentration of activators

Synthetic data: AGNs • Transcriptional interactions are approximated as: • In order to generate M samples, the parameters for gene i in imaginary microarray k are: are some original constant values of the parameters are uniform random variables in [0, 2] This simulates the sampling of a population of distinct phenotypes at random time points and where the efficiency of biochemical reactions may be also distinct.

Synthetic data: AGNs • Two possible network topologies are considered: Each vertex of the graph is equally likely to be connected to any other vertex. i. e. The presence of an edge between each possible pair of genes is modeled as a Bernoulli random variable with parameter p. Erdös-Rényi

Synthetic data: AGNs • Two possible network topologies are considered: The distribution of the number of connections k associated with each vertex follows a power law: p(k)~k- , >0 Scale-free This motivates the appearance of large interaction hubs.

Synthetic data: AGNs • Two possible network topologies are considered: Erdös-Rényi Scale-free

Synthetic data: AGNs • Two performance indicators: • Recall: NTP/(NTP+NFN) (fraction of the true edges that are identified) • Precision: NTP/(NTP+NFP) (fraction of the identified edges that are true)

Synthetic data: AGNs (Erdös-Rényi) (Scale-free) • ARACNE outperforms its competitors for sufficiently small choices of p-values • When the p-value is not small enough, non-statistically significant MI values are accepted and the algorithm crashes.

Synthetic data: AGNs • ARACNE’s performance depends on MI being high for directly interacting genes and decreasing rapidly with the interaction distance.

Synthetic data: AGNs • Performance is stable in terms of kernel width as long as it is not too narrow.

Synthetic data: AGNs • Better performance for Erdös-Rényi (less loops) • High precision and substantial recall, even for small samples

Real data: Human B cells • Results for the case of human B cells are described in detail in the second paper: “Reverse engineering of regulatory networks in human B cells” Basso et al. , Nature Genetics, 37, 382 -390, 2005;

Real data: Human B cells • Large gene expression datasets such as those derived from systematic perturbations to simple organisms (yeast) are not that easily obtained for mammalian cells. • The more complex the organism becomes, the more difficult it is to distinguish among physical and functional interaction.

Real data: Human B cells • Authors assume that an equivalent dynamic richness can be obtained for a given type of human cell. • They choose to work with 340 B lymphocytes derived from normal, tumor-related and experimentally manipulated populations. • They feed the data to ARACNE, which generates a regulatory network containing aprox. 129, 000 interactions.

Real data: Human B cells • Interesting observation: • The results show a power-law tail in the relationship between the number of genes n and their number of interactions. • This is suggestive of a scale-free underlying network structure. • Important because the evidence of scale-free topology in higher-order eukaryotic cells is still scarce.

Real data: Human B cells • They use Gene Ontology to analyze the biological processes affected by the most relevant (top 5%) hubs. • Focus on the c-MYC proto-oncogene: • Because it is well characterized as a transcription factor, which helps validate the results. • Subnetwork with 2, 063 genes (56 directly connected to MYC). • The network is handicapped by certain limitations: • The edges are undirected • Some direct connections due to certain intermediates which are not even represented on the microarray • Some direct relations incorrectly removed by the DPI

Real data: Human B cells

Real data: Human B cells

Real data: Human B cells • Some results of interest can be summarized as follows: • 29/56 (51. 8%) of the first neighbors presumed “correct” (either reported in literature or Ch. IP validated in lab) • This is statistically significant wrt the expected 11% of background c-MYC targets among randomly selected genes. • Furthermore, c-MYC target genes are significantly more enriched among first neighbors than among second neighbors (51. 8% vs 19. 4%).

Discussion • ARACNE provides a provably exact network reconstruction under a controlled set of approximations. • Some limitations: • ARACNE opens all three-gene loops along weakest edge (i. e. false negatives for triplets of truly interacting genes) • Only statistical dependencies expressed as pairwise interaction potentials can be inferred. (solution: extend the model to higher orders) • Edges are undirected (inevitable with non-temporal data? ) • Some ambiguities concerning the interpretation of the inferred irreducible statistical dependencies. • Beware of loopy underlying topologies (go for tree-like)

Discussion • ARACNE provides a provably exact network reconstruction under a controlled set of approximations. • Some virtues: • Low computational complexity • No need to discretize expression levels • Does not rely on unrealistic network models or a priori assumptions (excuse-me ? ? : -|) • Avoidance of heuristic/stochastic search procedures • High precision and recall on synthetic data • Ability to infer genetic interactions on a genome-wide scale from gene-expression profiles of mammalian cells.

My point of view • There is no “free lunch”: in order to effectively deal with the small sample regime some simplifying assumptions need to be made. • Yet, it seems that this is the way to go: “divide and conquer paradigm” The big problem of inferring gene interaction networks should be broken down into smaller subproblems that can be addressed by relatively simple classes of models. Each model will then rely on strong assumptions, but they should be able to deal with complex scenarios when carefully chosen and properly combined. • No magic recipes, a long and winding road ahead… (remember Minsky? )

My point of view “…A well-known anecdote relates how, sometime in 1966, the legendary Artificial Intelligence pioneer Marvin Minsky directed an undergraduate student to solve "the problem of computer vision" as a summer project. This anecdote is often resuscitated to illustrate how egregiously the difficulty of computational vision has been underestimated. Indeed, nearly forty years later, the discipline continues to confront numerous unsolved (and perhaps unsolvable) challenges, particularly with respect to high-level "image understanding" issues such as pattern recognition and feature recognition. Nevertheless, the intervening decades of research have yielded a great wealth of well-understood, low-level techniques that are able, under controlled circumstances, to extract meaningful information from a camera scene. These techniques are indeed elementary enough to be implemented by novice programmers at the undergraduate or even high-school level…”

Thanks for your attention

- Johns Hopkins University Business Plan Competition Johns Hopkins
- Johns Hopkins High School Redesign Cohort Johns Hopkins
- Jet Grooming Brock Tweedie Johns Hopkins University 23
- Jesper W Gjerloev Johns Hopkins University Applied Physics
- Scholarship of Teaching Johns Hopkins University Anne Belcher
- Johns Hopkins University Baltimore MD The Office of
- Disk Dynamics Julian Krolik Johns Hopkins University Central
- JOHNS HOPKINS UNIVERSITY Center for Talented Youth Presentation