Monotony and Surprise Algorithmic and Combinatorial Foundations of
Monotony and Surprise Algorithmic and Combinatorial Foundations of Pattern Discovery Alberto Apostolico University of Padova and Georgia Inst. Of Tech. Alberto Apostolico - Erice 05
http: //www. cc. gatech. edu/~axa/papers A) Specialized Material A. Apostolico and G. Bejerano ``Optimal Amnesic Probabilistic Automata, or How to Learn and Classify Proteins in linear Time and Space '', RECOMB 2000 and Journal of Computational Biology, 7(3/4): 381 --393, 2000. A. Apostolico, M. E. Bock, S. Lonardi and X. Xu. ``Efficient Detection of Unusual Words'', Proceedings of RECOMB 2002 and Journal of Computational Biology, 7(1/2): 71 --94, 2000. A. Apostolico, F. Gong and S. Lonardi. ``Verbumculus and the Detection of Unusual Words'', Journal of Computer Science and Technology, 19: 1 ( Special Issue on Bioinformatics), 22 -41 (2004). A. Apostolico, L. Parida. ``Incremental Paradigms of Motif Discovery'', Journal of Computational Biology 11: 1, 15 --25 (2004). A. Apostolico, M. E. Bock and S. Lonardi. ``Monotony of Surprise and Large Scale Quest for Unusual Words. '‘ Journal of Computational Biology, 10, 3 -4, 283 -311 (2003). A. Apostolico, C. Pizzi. ``Monotone Scoring of Patterns with Mismatches'‘ Proceedings of the 4 th Workshop on Algorithms in Bioinformatics, Bergen, Norway, Springer Verlag LNCS 3240, 87 -98, (2004) Alberto Apostolico - Erice 05 2
http: //www. cc. gatech. edu/~axa/papers B) Introductory Material A. Apostolico and M. Crochemore ``String Pattern Matching for a Deluge Survival Kit'' Handbook of Massive Data Sets, J. Abello et al, Eds. Kluver Acad. Publishers, to appear. A. Apostolico ``General Pattern Matching'', Handbook of Algorithms and Theory of Computation, M. J. Atallah, ed. , CRC Press Ch. 13, pp. 1 --22 (1999). A. Apostolico ``Of Maps Bigger than the Empire'', Keynote, SPIRE 2001, IEEE Press (2001) A. Apostolico ``Pattern Discovery and the Algorithmics of Surprise'' Artificial Intelligence and Heuristic Methods for Bioinformatics, (P. Frasconi and R. Shamir, eds. ) IOS Press, 111 --127 (2003). A. Apostolico ``Pattern Discovery in the Crib of Procrustes'' Imagination and Rigor, Essays on Eduardo R. Caianiello's Scientific Heritage Ten Years after his Death, ( S. Termini, ed. ), Springer-Verlag, to appear 2005. Alberto Apostolico - Erice 05 3
Acknowledgements Gill Bejerano Dept. of Computer Science - The Hebrew University Mary Ellen Bock Dept. of Statistics - Purdue University Matteo Comin Univ. of Padova Jianhua Dong Dept. of Industrial Technology, Purdue University S. Lonardi Dept. of Comp. Science and Eng. - UC Riverside Fu Lu Celera Fang. Cheng Gong Celera Laxmi Parida IBM Cinzia Pizzi Univ. of Padova Xuyan Xu Capital. One Alberto Apostolico - Erice 05 4
Form = Function A hemoglobin molecule consists of four polypeptide chains: two a globin chains (shown in green and blue) and two b globin chains (shown in yellow and orange). Each globin chain contains a heme (shown in red). Hemoglobin is the protein that carries oxygen from the lungs to the tissues and carries carbon dioxide from the tissues back to the lungs. In order to function most efficiently, hemoglobin needs to bind oxygen tightly in the oxygen-rich atmosphere of the lungs and be able to release oxygen rapidly in the relatively oxygen-poor environment of the tissues. Alberto Apostolico - Erice 05 5
Alberto Apostolico - Erice 05 6
Bioinformatics the Road Ahead ‘’. . . more than any other single factor, the sheer volume of data poses the most serious challenge -many problems that are ordinarily quite manageable become seemingly insurmountable when scaled up to these extents. For these reasons, it is evident that imaginative new applications of technologies designed for dealing with problems of scale will be required. For example, it may be imagined that data mining techniques will have to supplant manual search, intelligent data base integration will be needed in place of hyperlink browsing, scientific visualization will replace conventional interface to the data, and knowledge-based systems will have to supervise high-throughput annotation of the [sequence] data’’ [ D. B. Searls, Grand Challenges in Computational Biology Salzberg Searls Kasif eds Elsevier 1998] Alberto Apostolico - Erice 05 7
• At a joint EU - US panel meeting on large scientific data bases held in Annapolis in 1999, I was invited with the physicists and earth observators to represent the needs of computational biology In honest to my duty, I said time and again that the kind of data available to biology was a tiny fracvtion of what is produced in earth observation and high energy physics. Just as the others were disposing of me saying that swe did not need money, I said : don’t worry we will make up for it with the data we will generate biology is a natural science, it dissects and multiplies formal sciences synthesize and cluster there is no telling where these two will go together Alberto Apostolico - Erice 05 8
Which Information Anyway Ø - Greek ``eidos" is form, appearance, or, in Latin, species - information is modern, quantified version of what the Greek called ``eidos" - it is a measure of the amount of structure Ø the three dimensions of information: syntactic Ø semantic Ø Ø pragmatic (formal medium without meaning) (dualism of subject and object invented by modern philosophy) (attempt to describe the understanding of meaning as a natural process) Alberto Apostolico - Erice 05 9
King Phillip Came Over For Green Soup (Kingdom, Phylum, Class, Order, Family, Genus, Species) biologists group organisms by morphology to represent similarities and propose relationships Linnaeus’ Taxonomy (partial) Alberto Apostolico - Erice 05 10
The “Chinese” Taxonomy attributed by a Dr. Franz Kuhn to the Chinese Encyclopedia entitled Celestial Emporium of Benevolent Knowledge. Animals are divided into (a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel's hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from afar'' J. L. Borges, "The Analytical Language of John Wilkins, " from Otras Inquisiciones (Other Inquisitions 1937 -1952, London: Souvenir Press, 1973) Alberto Apostolico - Erice 05 11
Summary Form and Information Ø To Classify and Generate Ø Of Free Lunches, Ugly Ducklings, and Little Green Men Ø Privileging Syntactic Information Ø Avoidable and Unavoidable Regularities Ø Periods, Palindromes, Squares, etc. Ø Theories Bigger than Life Ø Motifs, Profiles and Weigh Matrices Ø The Emperor’s New Map Alberto Apostolico - Erice 05 12
Defining ‘’Class’’ From Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition Class can be defined by 1 intension ( = list of properties or predicates) or 2 extension ( = list of names of individual members) Class can be also defined by 3 paradigm ( = show a few members and, optionally, few non-members) This is what brain does well (and what pattern recognition does poorly) Finally, Class can be defined by 4 clustering ( = we are not even given paradigms but rather sets of objects and asked to isolate subsets with strong coherence) Alberto Apostolico - Erice 05 13
Class by Intension From Watanabe’s pattern recognition as information compression in Frontiers in pattern ‘recognition Types of class intension: Vectorial approach (statistical pattern recognition) divides into two = In the conventional zone a class is characterized by a predicate of type: belongs to such and such volume of n-dimensional representation space I In the subspace method, a class is characterized by a predicate of type: belongs to such and such subspace in n-dimensional representation space Structural or grammatical approach = a class is characterized by a predicate of the type: consists of such and such elementary components which are arranged together in such and such ways Note: structural and vector description are not uncorrelated, on the contrary For example, multiple sequence alignment can be considered as a search for the discovery of dimensions along which paradigms of noisy vectors exhibit same value Alberto Apostolico - Erice 05 14
Statistical Classification Ø A Class is formed by Objects with many Predicates in common Ø Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan. Ø Classification as experienced on an empirical basis is only possible to the extent that the various predicates characterizing objects are given non-uniform weights. Alberto Apostolico - Erice 05 15
Statistical Classification Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan. Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one with only the right eye Must measure similarity by # of shared predicates But this number is irrespective of the number of objects and same for all pairs Alberto Apostolico - Erice 05 16
Statistical Classification Cannot measure similarity by # of shared features: a member with only the left eye is more similar to one with no eye than to one with only the right eye Must measure similarity by # of shared predicates But this number is irrespective of the number of objects and same for all pairs n Total # of predicates = S () r=0 n r n n = (1 +1) n = 2 n d-2 Total # of predicates = shared by ANY two patterns S ( ) r=2 d-2 = (1 +1) d-2 = 2 d-2 r-2 Theorem of the Ugly Duckling (S. Watanabe): as long as all of the predicates characterizing the objects to be classified are given the same importance or ``weight", then a swan will be found to be just as similar to a duck as to another swan. Alberto Apostolico - Erice 05 17
Inferring Grammars grammatical inference problem: Input: a finite set of symbol strings from some language L and possibly a finite set of strings from the complement of L Output: a grammar for the language ``Precisely the same problem arises in trying to choose a model or theory to explain a collection of sample data. This is one of the most important information processing problems known and it is surprising that there has been so little work on its formalization. ’’ ( Bierman- Feldman, 1972) Alberto Apostolico - Erice 05 18
Regular, Anomalous, Entropy, Negentropy Ø Ø Shannon: information is entropy Brillouin: info is negentropy, entropy is chaos Key to the paradox: actual versus potential information How can we express gain in information? (difference between two distributions ? ) This measure is global and can be either positive or negative Ø A better measure (Alfred Renyi - always positive) Ø Alberto Apostolico - Erice 05 19
Random, Regular, Compressible Ø Measuring structure in finite objects presupposes the ability to measure randomness in such objects. Ø Defining randomness has been an elusive goal for statisticians since the turn of the last century. Ø Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is the length of the recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one string from scratch (alt. , from another string). Alberto Apostolico - Erice 05 20
Random, Regular, Compressible Ø Kolmogorov's definition of information (note resemblance to molecular evolution): information (alternatively, conditional information) is the length of the recorded sequence of zeroes and ones that constitute a shortest program by which a universal machine produces one string from scratch (alt. , from another string). The programs of length less than k are at most L, 0, 1, 00, 01, 10, 11, …, . . . , 11… 1 (or k `1’) Ø The number of strings with a program of length less than k is 1+2+…+4 + 2 k-1 = 2 k -1 < 2 k Bad News: there is hardly such a notion as that of a finite random sequence and yet most very long strings are complex – any given short sequence seems to exhibit some kind of regularity, however, in the limit, a great many sequences of sufficiently large length are seen to be incompressible and hence to appear as random Ø It appears thus that we attribute and measure structure in finite objects only to the extent that we privilege (i. e. , assign a high weight to) certain regularities and neglect others (is the structural classification pendant to theorem of the ugly duckling? ) Alberto Apostolico - Erice 05 21
Summary Form and Information To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Green Men Privileging Syntactic Information Ø Avoidable and Unavoidable Regularities Ø Periods, Palindromes, Squares, etc. Ø Theories Bigger than Life Ø Motifs, Profiles and Weigh Matrices Alberto Apostolico - Erice 05 22
Privileging Syntactic Regularities in Strings Syntactic regularities in strings are pervasive notions in Computer Science and its applications. In Molecular Biology, regularities are variously implicated in diverse facets of biological function and structure Typical string regularities: -cadences -periods -squares or tandem repeats -repetitions -palindromes -episodes -motifs -other exact variants and approximate versions thereof Ø There avoidable and unavoidable regularities ! Alberto Apostolico - Erice 05 23
Unavoidable Regularities If N is partitioned into k classes, one of the classes contains arbitrarily long arithmetic progressions ( Baudet-Artin-van. Der Waerden 1926 -27 ) Alberto Apostolico - Erice 05 24
Avoidable Regularities Periods, Borders periodicities are pervasive notions of string algorithmics, e. g. , KMP string searching abaabaababaabaab A string can have many periods abacabac abacab The smallest one is THE period of the string Alberto Apostolico - Erice 05 25
Periods cannot coexist too long A string can have many periods abacaba abacabac abacab Ø Periodicity Lemma (Lyndon-Schutzemberger, 62) If w has two periods of length p and q and |w| is at least p+q, p then w has period gcd(p, q) Ø Proof assume wlog p>q, take x[i] q either 1) i-q is not smaller than 1 or 2) i+p is not larger than n Ø case 1: x[i] = x[i-q+p] case 2: x[i] = x[i+p-q] Ø so p-q is a period ----> now repeat on q and p-q Alberto Apostolico - Erice 05 26
Avoidable Regularities Periods and periodicities are pervasive notions of string algorithmics, , e. g. , KMP string searching abaabaababaabaab A string can have many periods abacabac abacab The smallest one is THE period of the string Palindromes w=w. R Once we know how to compute optimally ALL periods of a string we an also compute all initial palindromes Ø Proof: run the algorithm on w*w R abab. . . *. . . baba (In fact, all palindromes of a string can be computed in serial linear time: Manacher, 76) Alberto Apostolico - Erice 05 27
Squares or Tandem Repeats or why does genetic code need more than 2 characters Ø Square: a string in the form ww with w a primitive string Ø Primitive string: a string that cannot be rewritten in the form v k with k > 1 Ø Square free strings : a string that contains no square Longest squarefree string on two symbols 010 ? Thue (1906): On an alphabet of at least 3 symbols we can write indefinitely long square free strings Istrail’s morphism (square free on ``a’’) square free morphism rew(a) -> abcab rew(a) -> abc rew(b) -> acabcb rew(b) -> ac rew(c) -> b rew(c) -> acbcacb there about n 2 ways of choosing indices i and j , thus n 2 squares ? • i • j Alberto Apostolico - Erice 05 28
Detecting Squares How many squares? Ø there can be cnlogn squares in a string (Crochemore, 81) Ø Ø Example: Fibonacci words Fo = a F 1 = b Fi = Fi-1 Fi-2 a b ba babbababbabba. . . Optimal nlogn algorithms since early 80's (Main-Lorentz, AA-Preparata, Rabin, Crochemore) Recent (Kosaraju, Gusfield) Parallel (AA, Crochemore-Rytter, AA-Breslauer) Alberto Apostolico - Erice 05 29
Tandem Repeats, Repeated Episodes (Myers ‘ 87, Kannan-Myers ‘ 92, Landau-Schmidt ‘ 93, Benson ’ 98, Ap. -Federico ’ 98, Myers-Sagot ’ 99, Ap-Atallah `99) Input: textstring Output: repeated episode (within constaints) (worst-case quadratic or nk with max k errors) Max 12 pos Max 30 pos Alberto Apostolico - Erice 05 30
Pattern Discovery in WAKA alluded to: Kokin-shu #315 (Minamoto-no-Muneyuki) alluded to: Kokin-shu #315 ya-ma-sa-to-ha fu-yu-so-sa-hi-sa ma-sa-ri-ke-ru hi-to-me-mo-ku-sa-mo ka-re-nu-to-o-mo-he-ha A hamlet in mountain is the drearier in winter. I feel that there is no one to see and no green around allusive variation shugyoku-shu #3528 (Jien) allusive variation shugyoku-shu #3528 ya-to-sa-hi-te hi-to-me-mo-ku-sa-mo ka-re-nu-re-ha so-te-ni-so-no-ko-ru a-ki-no-shi-ra-tsu-yu My home has been deserted Now in autumn, there is no one to see And no green around There is a pearl dew left in my sleeve Alberto Apostolico - Erice 05 31
Discovering instances of poetic allusion from anthologies of classical Japanese poems Theoretical Computer Science Volume 292 , Issue 2 Masayuki Takeda Tomoko Fukuda Ichiro Nanri Mayumi Yamasaki Koichi Tamari ABSTRACT Waka is a form of traditional Japanese poetry with a 1300 -year history. In this paper, we attempt to semi-automatically discover instances of poetic allusion, or more generally, to find similar poems in anthologies of Waka poems. One reasonable approach would be to arrange all possible pairs of poems in two anthologies in decreasing order of similarity values, and to scrutinize high-ranked pairs by human effort. The means of defining similarity between Waka poems plays a key role in this approach. In this paper, we generalize existing (dis)similarity measures into a uniform framework, called string resemblance systems, and using this framework, we develop new similarity measures suitable for finding similar poems. Using the measures, we report successful results in finding instances of poetic allusion between two anthologies Kokin. Shu and Shin-Kokin-Shu. Most interestingly, we have found an instance of poetic allusion that has never before been pointed out in the long history of Waka research. Alberto Apostolico - Erice 05 32
Cheating by Schoolteachers (the longest substring common to k of n strings) 112 a 4 a 342 cb 214 d 0001 acd 24 a 3 a 12 dadbcb 4 a 0000000 112 a 4 a 342 cb 214 d 000 d 4 a 2341 cacbddad 3142 a 2344 a 2 ac 23421 c 00 adb 4 b 3 cb 1 b 2 a 34 d 4 ac 42 d 23 b 141 acd 24 a 3 a 12 dadbcb 4 a 2134141 1 b 2 a 34 d 4 ac 42 d 23 b 14 dba 23 dad 1 abbac 1 db 11 acd 24 a 3 a 12 dadbcb 4 a 21 db 200 dba 23 dad 1 abbac 1 db 1 dbbbd 21 d 3 aac 11 da 42 dadcc 000 adcd 21 c 4 b 4421 dd 000 121 a 4 a 2 dcc 2 cadc 11 a 1 acd 24 a 3 a 12 dadbcb 4 a 11 da 011 121 a 4 a 2 dcc 2 cadc 11 a 1421 acbbdba 23 dad 121 acd 24 a 3 a 12 dadbcb 4 a a 000214 1421 acbbdba 23 dad 12 cacb 1 dadbc 42 dd 11221 acd 24 a 3 a 12 dadbcb 4 a cacb 1 dadbc 42 dd 1122 dbbbd 21 d 3 aac 11 da 421 dadcc 000 adcd 21 c 4 b 4421 dd 00 2 baaab 3 dad 2 aadca 221 acd 24 a 3 a 12 dadbcb 4 a 23421 c 0 2 baaab 3 dad 2 aadca 22 1 baaab 3 dcacb 1 dadbc 42 ac 2 cc 31012 dadbcb 4 ad 40000 From: S. D. Levit and S. J Dubner, Freakanomics Morrow, 2005 Alberto Apostolico - Erice 05 33
Summary Form and Information To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Green Men Privileging Syntactic Information Avoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc. Ø Theories Bigger than Life Ø Motifs, Profiles and Weigh Matrices Alberto Apostolico - Erice 05 34
General Form of Pattern Discovery • Find-exploit a priori unknown patterns or associations thereof in a Data Base • With some prior domain-specific knowledge • Without any domain-specific prior knowledge • Tenet: a pattern or association (rule) that occurs more frequently than one would expect is potentially informative and thus interesting frequent = interesting Alberto Apostolico - Erice 05 35
Data Compression by Textual Substitution 1 2 3 Ø Detect Repeated Patterns Set up Dictionary Use Pointers to Dictionary to Encode Replicas Redundancy (repetitiveness) is sought in order to remove it Alberto Apostolico - Erice 05 36
Consumer Prediction (Data Mining) Intrusion Detection (Security) Protein Classification (Bio-Informatics) Infer consistent behavior from protocol of past record Ø Use to predict future behavior or detect malicious practices 1) Collect a set of behavioral sequences (normal profile) into a repository or dictionary 2) Define measure(s) of sequence similarity 3) Compare any new sequence to the dictionary, using similarity to past behavior as a a basis for classification as normal or anomalous Anomaly is sought as a carrier of information Similarity or predictability equals fitness to the model Ø Learning from positive & negative samples Ø Alberto Apostolico - Erice 05 37
Of Exactitude in Science . . . In that Empire, the craft of Cartography attained such Perfection that the Map of a Single province covered the space of an entire City, and the Map of the Empire itself an entire Province. In the course of Time, these Extensive maps were found somehow wanting, and so the College of Cartographers evolved a Map of the Empire that was of the same Scale as the Empire and that coincided with it point for point. Less attentive to the Study of Cartography, succeeding Generations came to judge a map of such Magnitude cumbersome, and, not without Irreverence, they abandoned it to the Rigours of Sun and Rain. In the western Deserts, tattered Fragments of the Map are still to be found, Sheltering an occasional Beast or beggar; in the whole Nation, no other relic is left of the Discipline of Geography. From Travels of Praiseworthy Men (1658) by J. A. Suarez Miranda The piece was written by Jorge Luis Borges and Adolfo Bioy Casares. English translation quoted from J. L. Borges, A Universal History of Infamy, Penguin Books, London, 1975. Alberto Apostolico - Erice 05 38
Detection and Analysis of Gene Regulatory Regions (Jacques van Helden, http: //copan. cifn. unam. mx/Computational_Biology/yeast-tools) `` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements. We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides. J. Mol. Biol. (1998) 281, 827 -842. ‘’ Alberto Apostolico - Erice 05 39
http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K 12/oligo-frequencies A table of mono-mers only contains 4 lines seq observed_freq a 0. 2879006655447 c 0. 2120993344553 g 0. 2120993344553 t 0. 2879006655447 occ 301075 221805 301075 Alberto Apostolico - Erice 05 40
• ; seq observed_freq • aa http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ 0. 0996514874362 103508 • ac 0. 0516799845961 • ag • at occ 53680 Index of /bioinformatics/rsa-tools/data/ 0. 0522951766631 54319 Escherichia_coli_K 12/oligo-frequencies 0. 0840396649658 87292 • ca 0. 0630865504958 65528 • cc 0. 0474795417349 49317 • cg 0. 0490959853663 50996 • ct 0. 0522951766631 54319 • ga 0. 0559112351978 58075 • gc 0. 0573659381920 59586 • gg 0. 0474795417349 49317 • gt 0. 0516799845961 53680 • ta 0. 0692904592279 71972 • tc 0. 0559112351978 • tg 0. 0630865504958 65528 • tt 0. 0996514874362 A table of 2 -mers contains 16 lines 58075 103508 Alberto Apostolico - Erice 05 41
; seq observed_freq occ • gct 0. 0161176513919 • ctt 0. 0163987337723 • gaa 0. 0180416118233 18614 • gac • aaa 0. 0374140033303 38601 RSA-tools - menu. htm 0. 0096450026461 9951 • gag 0. 0108817651198 11227 • gat 0. 0172361654160 17783 • gca 0. 0166342614221 17162 • gcc 0. 0133436590723 13767 • gcg 0. 0147384092288 15206 • gct 0. 0127214008370 13125 • gga 0. 0123763479839 12769 • ggc 0. 0133436590723 13767 • ggg 0. 0103942325773 10724 • ggt 0. 0114206678905 11783 • gta 0. 0123288547541 12720 • gtc 0. 0096450026461 9951 • gtg 0. 0117036887701 12075 • gtt 0. 0180639045638 18637 • taa 0. 0259671657010 26791 • tac 0. 0123288547541 12720 • tag 0. 0088434332371 9124 • tat 0. 0221735228152 22877 • tca 0. 0190302464026 19634 • tcc 0. 0123763479839 12769 • tcg 0. 0102721071292 10598 • tct 0. 0141936909606 14644 • tga 0. 0190302464026 19634 • tgc 0. 0166342614221 17162 • tgg 0. 0113256814309 11685 • tgt 0. 0162746698251 16791 • tta 0. 0259671657010 26791 • ttc 0. 0180416118233 18614 • ttg 0. 0181356290333 18711 • ttt 0. 0374140033303 38601 16629 http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ 16919 • ; seq observed_freq occ • aac 0. 0180639045638 18637 • aag 0. 0163987337723 16919 • aat 0. 0276555984825 28533 • aca 0. 0162746698251 16791 • acc 0. 0114206678905 11783 • acg 0. 0118180602214 12193 • act 0. 0121282200894 12513 • aga 0. 0141936909606 14644 • agc 0. 0127214008370 13125 • agg 0. 0133359050756 13759 • agt 0. 0121282200894 12513 • ata 0. 0221735228152 22877 • atc 0. 0172361654160 17783 • atg 0. 0170946549762 17637 • att 0. 0276555984825 28533 • caa 0. 0181356290333 18711 • cac 0. 0117036887701 12075 • cag 0. 0161176513919 16629 • cat 0. 0170946549762 17637 • cca 0. 0113256814309 11685 • ccc 0. 0103942325773 10724 • ccg 0. 0122910540202 12681 • cct 0. 0133359050756 13759 • cga 0. 0102721071292 10598 • cgc 0. 0147384092288 15206 • cgg 0. 0122910540202 12681 • cgt 0. 0118180602214 12193 • cta 0. 0088434332371 9124 • ctc 0. 0108817651198 11227 • ctg 0. 0161176513919 16629 Alberto • ctt. Apostolico 0. 0163987337723 - Erice 05 16919 • gaa With increasing k, a table of k-mers grows rapidly out of proportions How many k-mers in total, for all k? 42
• caaa • ; seq • gaac • aaaa • gaag • aaac • gaat • aaag • gaca • aaat • gacc • aaca • gacg • aacc • gact • aacg • gaga • aact • gagc • aaga • gagg • aagc • gagt • aagg • gata • aagt • gatc • aata • gatg • aatc • gatt • aatg • gcaa • aatt • gcac • acaa • gcag • acac • gcat • acag • gcca • acat • gccc • acca • gccg • accc • gcct • accg • gcga • acct • gcgc • acga • gcgg • acgc • gcgt • acgg • gcta • acgt • gctc • acta • gctg • caac 0. 0069331564107 • caag 0. 0028587318158 observed_freq occ • caat 0. 0032207077557 0. 0149249217020 15297 • caca 0. 0049993658103 0. 0061848126213 6339 • cacc 0. 0031611914960 0. 0062443288810 6400 • cacg 0. 0017367039700 0. 0099860478277 10235 • cact 0. 0025152937274 0. 0059106475564 6058 • caga 0. 0022216151347 0. 0039183163728 4016 • cagc 0. 0032968105139 0. 0044237167416 4534 • cagg 0. 0022313718986 0. 0037729405911 3867 • cagt 0. 0026206667772 0. 0044959167943 4608 • cata 0. 0027279911799 0. 0039875893963 4087 • catc 0. 0051232767116 0. 0042851706946 4392 • catg 0. 0022206394583 0. 0036529323954 3744 • catt 0. 0046481223108 0. 0082503195340 8456 • ccaa 0. 0052569443767 5388 • ccac 0. 0055896500249 0. 0058403988565 5986 • ccag 0. 0027260398271 0. 0083478871728 8556 • ccat 0. 0036968378328 0. 0055476959402 5686 • ccca 0. 0045973871386 0. 0026733533022 2740 • cccc 0. 0035651215205 0. 0038831920229 3980 • cccg 0. 0024147990594 0. 0041339408545 4237 • ccct 0. 0036060999288 0. 0031475320266 3226 • ccga 0. 0037583054452 0. 0024606558497 2522 • ccgc 0. 0033680348902 0. 0027182344160 2786 • ccgg 0. 0039378299006 0. 0030333778892 3109 • ccgt 0. 0035026782317 0. 0026128613661 2678 • ccta 0. 0039446596353 4043 • cctc 0. 0028333642298 0. 0025670045759 2631 • cctg 0. 0022313718986 0. 0026694505966 2736 • cctt 0. 0036441513079 0. 0023845530914 2444 • actc • gctt 0. 0039875893963 0. 0027279911799 2796 • gaaa • actg 0. 0033348618930 3418 • actt 0. 0036529323954 3744 0. 0064882479779 6650 3735 3900 0. 0015962065702 • gcgc 1636 0. 0024791937010 2541 • cctc 0. 0026206667772 • gcgg 2686 • agaa 0. 0047310548037 4849 • tatg 0. 0053652444557 5499 • cctg 0. 0048276467661 • gcgt 4948 • agac 0. 0023933341789 2453 • tatt 0. 0034704809109 3557 • cctt 0. 0042851706946 • gcta 4392 • agag 0. 0031094806475 3187 • tcaa 0. 0029543481018 3028 • cgaa 0. 0031855834057 • gctc 3265 • agat 0. 0039202677256 4018 • tcac 0. 0021738069917 2228 • cgac 0. 0022645448957 • gctg 2321 • agca 0. 0040002731894 4100 • tcag 0. 0031114320002 3189 • cgag 0. 0016557228299 • gctt 1697 • agcc 0. 0028704399325 2942 • tcat 0. 0043485896598 4457 • cgat 0. 0031348482335 • ggaa 3213 • tggg • agcg 0. 0036841540398 3776 • tcca 0. 0036441513079 3735 • cgca 0. 0042002868489 4305 • ggac • tggt • agct 0. 0021835637556 2238 • tccc 0. 0048276467661 4948 • cgcc 0. 0039397812534 • ggag 4038 • tgta • agga 0. 0036792756578 3771 • tccg 0. 0033348618930 3418 • cgcg 0. 0029309318685 3004 • ggat • tgtc • aggc 0. 0037583054452 3852 • tcct 0. 0038753866118 3972 • cgct 0. 0036841540398 3776 • ggca • tgtg • aggg 0. 0028811723727 2953 • tcga 0. 0046481223108 4764 • cgga 0. 0030977725308 3175 • ggcc • tgtt • aggt 0. 0030333778892 3109 • tcgc 0. 0027670182354 2836 • cggc 0. 0036060999288 3696 • ggcg • ttaa • agta 0. 0030324022128 3108 • tcgg 0. 0058403988565 5986 • cggg 0. 0028801966964 • ggct 2952 • ttac • agtc 0. 0022216151347 2277 • tcgt 0. 0022635692194 2320 • cggt 0. 0027182344160 2786 • ggga • ttag • agtg 0. 0031114320002 3189 • tcta 0. 0023425990068 2401 • cgta 0. 0027201857688 2788 • gggc • ttat • agtt 0. 0037729405911 3867 • tctc 0. 0035407296108 3629 • cgtc 0. 0025152937274 2578 • gggg • ttca • ataa 0. 0092581932425 9489 • tctg 0. 0031836320529 3263 • cgtg 0. 0021738069917 2228 • gggt • ttcc • atac 0. 0033397402749 3423 • tctt 0. 0019981852419 2048 • cgtt 0. 0044237167416 • ggta 4534 • ttcg • atag 0. 0030704535920 3147 • tgaa 0. 0026196911009 2685 • ctaa 0. 0030499643878 3126 • ggtc • ttct • atat 0. 0065097128584 6672 • tgac 0. 0028801966964 2952 • ctac 0. 0022577151610 2314 • ggtg • ttga • atca 0. 0061487125950 6302 • tgag 0. 0028811723727 2953 • ctag 0. 0004624706077 474 • ggtt • ttgc • atcc 0. 0040002731894 4100 • tgat 0. 0022460070444 2302 • ctat 0. 0030704535920 3147 • gtaa • ttgg • atcg 0. 0031348482335 3213 • tgca 0. 0035026782317 3590 • ctca 0. 0031904617876 • gtac 3270 • ttgt • atct 0. 0039202677256 4018 • tgcc 0. 0040002731894 4100 • ctcc 0. 0029133696935 2986 • gtag • ttta • atga 0. 0052588957295 5390 • tgcg 0. 0025670045759 2631 • ctcg 0. 0016557228299 1697 • gtat • tttc • atgc 0. 0045973871386 4712 • tgct 0. 0015962065702 1636 • ctct 0. 0031094806475 3187 • gtca • tttg • atgg 0. 0031836320529 3263 • tgga 0. 0026206667772 2686 • ctga 0. 0050071712214 5132 • gtcc • tttt • atgt 0. 0041339408545 4237 • tggc 0. 0048276467661 4948 • ctgc 0. 0036968378328 • gtcg 3789 • atta 0. 0072736674700 7455 0. 0042851706946 4392 • ctgg 0. 0035407296108 • gtct 3629 • attc 0. 0049993658103 5124 4087 • attg 7106 2930 3301 5124 3240 1780 2578 2277 3379 2287 2686 2796 5251 2276 4764 5388 5729 2794 3789 4712 3654 2475 3696 3852 3452 4036 3590 4043 2904 2287 0. 0038051379119 • ccta 0. 0039378299006 4036 • http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ 9292 5280 • RSA-tools - menu. htm • attt • ctgt • ctta • cttc • cttg • cttt 0. 0038831920229 • gtga 3980 0. 0053652444557 5499 0. 0044656708263 • gtgc 4577 0. 0099860478277 10235 0. 0032207077557 • gtgg 3301 0. 0035026782317 3590 • taaa 3972 0. 0090659849941 0. 0038753866118 0. 0039446596353 4043 • taac 8456 0. 0051515713268 0. 0082503195340 0. 0028333642298 2904 • taag 4897 0. 0044656708263 0. 0047778872704 0. 0022313718986 2287 • taat 4103 0. 0072736674700 0. 0040032002186 0. 0036441513079 3735 • taca 5132 0. 0037748919438 0. 0050071712214 0. 0039875893963 4087 • tacc 5390 0. 0028021425853 0. 0052588957295 0. 0039446596353 4043 2048 0. 0019981852419 • tacg 2703 0. 0027201857688 0. 0026372532758 0. 0014869308148 1524 3226 0. 0031475320266 • tact 0. 0030324022128 0. 0029016615769 2974 0. 0029133696935 2986 3869 0. 0037748919438 • taga 3175 0. 0020537987960 0. 0030977725308 0. 0040002731894 4100 3240 0. 0031611914960 • tagc 3771 0. 0028333642298 0. 0036792756578 0. 0042198003766 4325 3557 0. 0034704809109 • tagg 2108 0. 0015962065702 0. 0020567258252 0. 0023182070971 2376 6058 0. 0059106475564 • tagt 3452 0. 0023845530914 0. 0033680348902 0. 0039397812534 4038 8898 0. 0086815684974 • tata 2302 0. 0049661928132 0. 0022460070444 0. 0028704399325 2942 5125 0. 0050003414867 • tatc 0. 0051232767116 0. 0026128613661 2678 0. 0029016615769 2974 3126 0. 0030499643878 • tatg 2105 0. 0038753866118 0. 0020537987960 0. 0024147990594 2475 9489 0. 0092581932425 • tatt 0. 0082503195340 0. 0032968105139 3379 0. 0026196911009 2685 6363 0. 0062082288547 • tcaa 4457 0. 0047778872704 0. 0043485896598 0. 0024606558497 2522 4043 0. 0039446596353 • tcac 4608 0. 0040032002186 0. 0044959167943 0. 0028021425853 2872 3265 0. 0031855834057 • tcag 6363 0. 0050071712214 0. 0062082288547 0. 0017367039700 1780 4849 0. 0047310548037 • tcat 0. 0052588957295 0. 0034831647039 3570 0. 0029543481018 3028 4897 0. 0047778872704 • tcca 3270 0. 0026372532758 0. 0031904617876 0. 0039183163728 4016 5729 0. 0055896500249 • tccc 6302 0. 0029016615769 0. 0061487125950 0. 0050003414867 5125 2320 0. 0022635692194 • tccg 4350 0. 0030977725308 0. 0042441922863 0. 0017210931478 1764 5686 0. 0055476959402 • tcct 0. 0036792756578 0. 0042198003766 4325 0. 0022577151610 2314 9292 0. 0090659849941 • tcga 4305 0. 0020567258252 0. 0042002868489 0. 0033397402749 3423 7106 0. 0069331564107 • tcgc 4100 0. 0033680348902 0. 0040002731894 0. 0034831647039 3570 6650 0. 0064882479779 • tcgg 2703 0. 0022460070444 0. 0026372532758 0. 0014869308148 1524 15297 0. 0149249217020 • tcgt 3654 0. 0026128613661 0. 0035651215205 0. 0022645448957 2321 • tcta 0. 0020537987960 0. 0023933341789 2453 • tctc 0. 0032968105139 0. 0040032002186 4103 • tctg 0. 0043485896598 0. 0027260398271 2794 • tctt 0. 0044959167943 0. 0023425990068 2401 7455 3869 2872 A table of k-mers grows rapidly out of proportions or out of sight 2788 3108 2105 2904 1636 2444 5090 5251 3972 8456 4897 4103 5132 5390 How many k-mers in total, for all k? 0. 0024791937010 • gtgt Alberto Apostolico -2541 Erice 05 0. 0026733533022 0. 0062443288810 • gtta 6400 4577 0. 0051515713268 2740 5280 2703 2974 3175 3771 2108 3452 2302 2678 2105 3379 4457 4608 43
How many distinct substrings in a string of n symbols 1 i j A: n no more than ( n x n)/2 ( n ways to choose beginning or i, then n-i ways to choose end or j ) Alberto Apostolico - Erice 05 44
How many surprising substrings in a string of n symbols Ø Agree on a model for the source: e. g. , the source emits symbols independently with identical distribution • Agree on some measure of surprise, e. g. , departure from expected number of occurrences exceeds a certain threshold • For a given observed string of n symbols, how many substrings may turn out to be surprising? A: possibly, all (n x n)/2 of them ! Alberto Apostolico - Erice 05 45
Source Modeling by Probabilistic Finite State Automata Order-2 Markov Chain 0. 25 10 0. 75 0. 25 00 0. 75 0. 5 11 0. 25 0. 75 0. 5 Probabilistic Suffix Automaton 0. 25 10 0. 25 00 0. 25 0. 75 0. 25 1 0. 5 01 Prob Suffix Tree 00 (0. 75, 0. 25) (0. 5, 0. 5) 0 (0. 5, 0. 5) 10 (0. 25, 0. 75) 1 (0. 5, 0. 5) Alberto Apostolico - Erice 05 46
Approximate Patterns Finding surprising substrings with mismatches Input: a sequence or set of sequences, integers m and k Ø Out: all substrings of length m that occur unusually often, up to k mismatches, as a replica of the same pattern Ø • NOTE: the pattern might never occur exactly in the input How many patterns should one try ? Alberto Apostolico - Erice 05 47
From the Special Issue for the 50 th Shannon Anniversary of IEEE Trans. IT ``Perhaps as a consequence of the fact that approximate matches abound whereas exact matches are unique, it is inherently much faster to look for an exact match that it is to search from a plethora of approximate matches looking for the best, or even nearly the best, among them. The right way to trade off search effort in a poorly understood environment against the degree to which the product of the search possesses desired criteria has long been a human enigma. '' T. Berger and J. D. Gibson, ``Lossy Source Coding, '‘ IEEE Trans. on Inform. Theory, vol. 44, No. 6, pp. 2693 --2723, 1998. Alberto Apostolico - Erice 05 48
Syntactic Motif: a recurring pattern with some solid characters and some characters that are a subset of the alphabet, or a ‘’don’t care’’ or ‘’gap’’ ``don’t care’’ characters solid character T A G A G G T A G A T AG T T A G G T A G A T AG T PROBLEM Input: textstring Output: repeated motifs T AGA GGTAGA TA T Motifs may be rigid or extensible (sometimes also called flexible) Alberto Apostolico - Erice 05 49
From Syntax to Stat: Extracting a Profile Matrix & Consensus (From Hertz-Stormo 99) A A A G G G T T A C G C T G A C G T 4 0 0 0 1 0 3 0 0 0 3 1 1 1 0 2 0 1 2 1 1 1 A G G T G ? Alignment Matrix (Consensus - by majority rule ) ni, j = times letter i is observed at jth position in alignment N = number of sequences = 4 NOTE: While each sequence is a ``realization’’ of the consensus itself might not be any of the sequences Alberto Apostolico - Erice 05 50
From Syntax to Stat, continued: Computing Weight Matrix A A A G G G T T A C G C T G A C G T 4 0 0 0 1 0 3 0 0 0 3 1 1 1 0 2 0 1 2 1 1 1 A G G T G ? Alignment Matrix (Consensus - by majority rule ) Compute ln [[(ni, j + pi ) / (N + 1)] / pi ] ~ ln (fi, j / pi) ni, j = times letter i is observed at jth position in alignment N = number of sequences = 4 pi = a priori probability (. 25 in example ) f i, j = frequency of letter i at position j this is like taking the ratio of the empirical frequencies, compensated by pi to avoid infinity or zero, to the hypothetical probabilities or flat distribution (popular measure among statisticians: how much the observed distribution deviates from chance) Alberto Apostolico - Erice 05 51
From Syntax to Stat, continued: Weighing a Test Sequence A A T T G A A G G T G A G C C T G C G T 4 1 0 1 0 0 0 1 1 1 0 3 3 0 2 1 0 0 1 2 1 1 A G G T G ? Weight Matrix ln (fi, j / pi) 1. 2 -1. 6 0 -1. 6. 96 -1. 6 A G G -1. 6. 96 0 0 -1. 6. 59 0 0 0 T G C (test sequence) ln [ [ (ni, j + pi ) / (N + 1) ] / pi ] ~ ln (fi, j / pi) Alberto Apostolico - Erice 05 52
From Syntax to Stat, continued: Weighing a Test Sequence A A T T G A A G G T G A G C C T G C G T C 4 1 0 1 0 0 0 1 1 1 G 0 3 3 0 2 1 T 0 0 1 2 1 1 A Weight Matrix ln (fi, j / pi) 1. 2 0 -1. 6. 96 -1. 6 0 0 -1. 6. 59 0 0 0 A G G T G C (test sequence, score = 4. 3) ln [[ (ni, j + pi ) / (N + 1) ] / pi ] ~ ln (fi, j / pi) Alberto Apostolico - Erice 05 53
From Stat to Syntax: extracting a “full consensus” from sample (daf-19 binding sites in C. elegans GTTGTCATG GTGAC GTTTCCATG GAAAC GCTACCATG GCAAC GTTACCATA GTAAC GTTTCCATG GTAAC Consensus at all costs generates monsters daf-19 che-2 osm-1 osm-6 F 02 D 8. 3 -150 - Peter Swoboda) -1 Model: G_T__CAT_G__AC Alberto Apostolico - Erice 05 GTT__CATGGT_AC GTT_CCATGG_AAC G_T_CCATGG_AAC GTT_CCAT_ GTAAC GTT_CCATG GTAAC Now the model describes also GATCCCATCGGAAC which did not belong to the data 54
Episodes and extensible motifs Mannila et al. , 95; Das et al. , 97 Input: textstring and pattern string Output: episode realization (quadratic worst-case) Max 10 pos Alberto Apostolico - Erice 05 55
Extensible Motifs Definition: Extensible Motifs are patterns which allow variable-length don’t cares e. g. , Prosite F…. . G-(2, 4)G. H Ø Note that the length of these patterns is variable Ø High expressive power Ø Huge pattern space Alberto Apostolico - Erice 05 56
An Example from Prosite Entry name: HIPIP Accession number: PS 00596 Description: High potential iron-sulfur proteins signature. Pattern: C-(6, 9)[LIVM]…G[YW]C. . [FYW] PDB 1 PIJ PDB 1 HLQ Alberto Apostolico - Erice 05 57
Extensible Motifs (Implications of Variable-Gaps) s = axbcaxxxbc m = a-[1 -3]bc at pos 1, 5 and 10 Main Issues 1) a location list corresponds to multiple patterns Eg. axbcpdaycbqd (at positions 1 and 7) m 1 = a-[1 -2]b-[1 -2]d m 2 = a-[1 -2]c-[1 -2]d 2) multiple occurrences at a location Eg. axbbxc (at position 1) m = a-[1 -2]b-[1 -2]c Alberto Apostolico - Erice 05 58
Summary Form and Information To Classify and Generate Of Free Lunches, Ugly Ducklings, and Little Green Men Privileging Syntactic Information Avoidable and Unavoidable Regularities Periods, Palindromes, Squares, etc. Ø Theories Bigger than Life Ø Motifs, Profiles and Weigh Matrices Ø The Emperor’s New Map Alberto Apostolico - Erice 05 59
Detection and Analysis of Gene Regulatory Regions (Jacques van Helden, http: //copan. cifn. unam. mx/Computational_Biology/yeast-tools) `` Starting from the simple knowledge that a set of genes share some regulatory behavior, one can suppose that some elements are shared by their upstream region, and one would like to detect such elements. We implemented a simple and fast method to extract such elements, based on a detection of over-represented oligonucleotides. J. Mol. Biol. (1998) 281, 827 -842. ‘’ Alberto Apostolico - Erice 05 60
Over-represented sequences in the 800 bps upstream segments of two families of co-regulated genes in the yeast: superposition of circled words yields known motifs TCACGTG TCCGCGGA AAAACTGTGG Alberto Apostolico - Erice 05 61
Question: how many of the 8 -mers in a sequence 106 bases long could be surprisingly over-represented? How many k-mers in total, for all k? Alberto Apostolico - Erice 05 62
http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K 12/oligo-frequencies Name Last modified Size 1 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 1 k 2 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 1 k 3 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 2 k 4 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 7 k 5 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 26 k 6 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 108 k 7 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 434 k 8 nt_non-coding_Esche. . > 24 -Dec-2001 06: 57 1. 7 M dyads_3 nt_sp 0 -20_non. . > 24 -Dec-2001 07: 11 2. 9 M Alberto Apostolico - Erice 05 63
http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ Index of /bioinformatics/rsa-tools/data/ Escherichia_coli_K 12/oligo-frequencies Name Last modified Size 1 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 1 k 2 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 1 k 3 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 2 k 4 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 7 k 5 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 26 k 6 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 108 k 7 nt_non-coding_Esche. . > 24 -Dec-2001 06: 56 434 k 8 nt_non-coding_Esche. . > 24 -Dec-2001 06: 57 1. 7 M dyads_3 nt_sp 0 -20_non. . > 24 -Dec-2001 07: 11 2. 9 M Alberto Apostolico - Erice 05 64
Theories bigger than Life: Assume we wanted to build a statistical table counting occurrences of all surprising substrings in a genome Q: How many distinct substrings in a string of n symbols 1 i j n A: no more than ( n x n)/2 ( n ways to choose beginning or i, then n-i ways to choose end or j ) Alberto Apostolico - Erice 05 65
Theories bigger than Life: How many surprising substrings in a string of n symbols Ø Agree on a model for the source: e. g. , the source emits symbols independently with identical distribution Ø Agree on some measure of surprise, e. g. , departure from expected number of occurrences exceeds a certain threshold Ø For a given observed string of n symbols, how many substrings may turn out to be surprising? A: possibly, all (n x n)/2 of them ! Alberto Apostolico - Erice 05 66
Z-scores as measures of surprise Alberto Apostolico - Erice 05 67
Three easy conditions on surprise • 1) always: • 2) for absent words: (note asymmetry of surprise) • 3) for over-represented words: (longer word = bigger surprise) From 1 -3 together Alberto Apostolico - Erice 05 68
Monotony of Surprise A score such that : will be called monotone Alberto Apostolico - Erice 05 69
Main point For many monotone scores where ``surprising’’ Alberto Apostolico - Erice 05 70
DAWGs the set of words reaching a node is a burst of consecutive suffixes of a same word A T T T A A • Each state corresponds to a set of strings, the set of all strings that have occurrences ending precisely at the same positions in x • The sequence of labels on each distinct path from source to sink spells a suffix of x • |x| < Q < 2|x| - 1 |x| -1 < E < 3|x| -3 Alberto Apostolico - Erice 05 71
DAWGs With monotone scores, it suffices to publish scores only at the longest word in each one of the O(n) equivalence class 2 (Often, however, we still need to compute all O(n ) scores ) Alberto Apostolico - Erice 05 72
The Size of Tables for Substring Statistics Alberto Apostolico - Erice 05 73
Substring Statistics with Suffix Trees A partial view (all suffixes starting with ``a'') of the weighted suffix tree for the string x = abaababaababa: the weight of each internal node reports the number of (possibly overlapping) occurrences in x of the substring having locus at that node. • 1 Counts do not change along an arc • 2 If aw ends at a node so does w (suffix links) `The Myriad Virtues of Suffix Trees’’ A. Apostolico Combinatoral Algorithms On Words A. A and Z. Galil eds, Springer 1985 Alberto Apostolico - Erice 05 74
Detecting Squares with Suffix Trees There is a square iff there is a node with two consecutive leaves in its subtree too close for comfort. 14 - 12 = 2 > 3 = |aba| (A. Apostolico & FP Preparata, 83) Alberto Apostolico - Erice 05 75
Combining Saturation and Monotony of Scores over ST Arcs yields Surprising Solid Words in Linear Time and Space Ø Verbumculus (AA, Bock, Gong, Lonardi, Xu, JCB 2000, JCB 2003, Recomb 2003, . . ) l l l Based on Suffix tree and iid Partitions the O(n 2) substrigs into O(n) “equivalence classes of monotone score”, then computes expected frequencies, variances and scores for the most surprising word in each class in time O(n) overall. For any word v without a score, there is a scored extension v y which is at least equally surprising. Alberto Apostolico - Erice 05 76
Z-scores and measures of surprise Alberto Apostolico - Erice 05 77
Main point For any measure of surprise where and conditions 1 -3 are satisfied: ``surprising’’ Alberto Apostolico - Erice 05 78
Exercise: i. i. d. variables We are interested in the expected number of occurrences of y in X, and the corresponding variance. Alberto Apostolico - Erice 05 79
Overand Under-represented words: Z-Scores !@#&!!$!! Alberto Apostolico - Erice 05 80
Under the Hood: Periods and Variance Alberto Apostolico - Erice 05 81
• caaa • ; seq • gaac • aaaa • gaag • aaac • gaat • aaag • gaca • aaat • gacc • aaca • gacg • aacc • gact • aacg • gaga • aact • gagc • aaga • gagg • aagc • gagt • aagg • gata • aagt • gatc • aata • gatg • aatc • gatt • aatg • gcaa • aatt • gcac • acaa • gcag • acac • gcat • acag • gcca • acat • gccc • acca • gccg • accc • gcct • accg • gcga • acct • gcgc • acga • gcgg • acgc • gcgt • acgg • gcta • acgt • gctc • acta • gctg • caac 0. 0069331564107 • caag 0. 0028587318158 observed_freq occ • caat 0. 0032207077557 0. 0149249217020 15297 • caca 0. 0049993658103 0. 0061848126213 6339 • cacc 0. 0031611914960 0. 0062443288810 6400 • cacg 0. 0017367039700 0. 0099860478277 10235 • cact 0. 0025152937274 0. 0059106475564 6058 • caga 0. 0022216151347 0. 0039183163728 4016 • cagc 0. 0032968105139 0. 0044237167416 4534 • cagg 0. 0022313718986 0. 0037729405911 3867 • cagt 0. 0026206667772 0. 0044959167943 4608 • cata 0. 0027279911799 0. 0039875893963 4087 • catc 0. 0051232767116 0. 0042851706946 4392 • catg 0. 0022206394583 0. 0036529323954 3744 • catt 0. 0046481223108 0. 0082503195340 8456 • ccaa 0. 0052569443767 5388 • ccac 0. 0055896500249 0. 0058403988565 5986 • ccag 0. 0027260398271 0. 0083478871728 8556 • ccat 0. 0036968378328 0. 0055476959402 5686 • ccca 0. 0045973871386 0. 0026733533022 2740 • cccc 0. 0035651215205 0. 0038831920229 3980 • cccg 0. 0024147990594 0. 0041339408545 4237 • ccct 0. 0036060999288 0. 0031475320266 3226 • ccga 0. 0037583054452 0. 0024606558497 2522 • ccgc 0. 0033680348902 0. 0027182344160 2786 • ccgg 0. 0039378299006 0. 0030333778892 3109 • ccgt 0. 0035026782317 0. 0026128613661 2678 • ccta 0. 0039446596353 4043 • cctc 0. 0028333642298 0. 0025670045759 2631 • cctg 0. 0022313718986 0. 0026694505966 2736 • cctt 0. 0036441513079 0. 0023845530914 2444 • actc • gctt 0. 0039875893963 0. 0027279911799 2796 • gaaa • actg 0. 0033348618930 3418 • actt 0. 0036529323954 3744 0. 0064882479779 6650 3735 3900 0. 0015962065702 • gcgc 1636 0. 0024791937010 2541 • cctc 0. 0026206667772 • gcgg 2686 • agaa 0. 0047310548037 4849 • tatg 0. 0053652444557 5499 • cctg 0. 0048276467661 • gcgt 4948 • agac 0. 0023933341789 2453 • tatt 0. 0034704809109 3557 • cctt 0. 0042851706946 • gcta 4392 • agag 0. 0031094806475 3187 • tcaa 0. 0029543481018 3028 • cgaa 0. 0031855834057 • gctc 3265 • agat 0. 0039202677256 4018 • tcac 0. 0021738069917 2228 • cgac 0. 0022645448957 • gctg 2321 • agca 0. 0040002731894 4100 • tcag 0. 0031114320002 3189 • cgag 0. 0016557228299 • gctt 1697 • agcc 0. 0028704399325 2942 • tcat 0. 0043485896598 4457 • cgat 0. 0031348482335 • ggaa 3213 • tggg • agcg 0. 0036841540398 3776 • tcca 0. 0036441513079 3735 • cgca 0. 0042002868489 4305 • ggac • tggt • agct 0. 0021835637556 2238 • tccc 0. 0048276467661 4948 • cgcc 0. 0039397812534 • ggag 4038 • tgta • agga 0. 0036792756578 3771 • tccg 0. 0033348618930 3418 • cgcg 0. 0029309318685 3004 • ggat • tgtc • aggc 0. 0037583054452 3852 • tcct 0. 0038753866118 3972 • cgct 0. 0036841540398 3776 • ggca • tgtg • aggg 0. 0028811723727 2953 • tcga 0. 0046481223108 4764 • cgga 0. 0030977725308 3175 • ggcc • tgtt • aggt 0. 0030333778892 3109 • tcgc 0. 0027670182354 2836 • cggc 0. 0036060999288 3696 • ggcg • ttaa • agta 0. 0030324022128 3108 • tcgg 0. 0058403988565 5986 • cggg 0. 0028801966964 • ggct 2952 • ttac • agtc 0. 0022216151347 2277 • tcgt 0. 0022635692194 2320 • cggt 0. 0027182344160 2786 • ggga • ttag • agtg 0. 0031114320002 3189 • tcta 0. 0023425990068 2401 • cgta 0. 0027201857688 2788 • gggc • ttat • agtt 0. 0037729405911 3867 • tctc 0. 0035407296108 3629 • cgtc 0. 0025152937274 2578 • gggg • ttca • ataa 0. 0092581932425 9489 • tctg 0. 0031836320529 3263 • cgtg 0. 0021738069917 2228 • gggt • ttcc • atac 0. 0033397402749 3423 • tctt 0. 0019981852419 2048 • cgtt 0. 0044237167416 • ggta 4534 • ttcg • atag 0. 0030704535920 3147 • tgaa 0. 0026196911009 2685 • ctaa 0. 0030499643878 3126 • ggtc • ttct • atat 0. 0065097128584 6672 • tgac 0. 0028801966964 2952 • ctac 0. 0022577151610 2314 • ggtg • ttga • atca 0. 0061487125950 6302 • tgag 0. 0028811723727 2953 • ctag 0. 0004624706077 474 • ggtt • ttgc • atcc 0. 0040002731894 4100 • tgat 0. 0022460070444 2302 • ctat 0. 0030704535920 3147 • gtaa • ttgg • atcg 0. 0031348482335 3213 • tgca 0. 0035026782317 3590 • ctca 0. 0031904617876 • gtac 3270 • ttgt • atct 0. 0039202677256 4018 • tgcc 0. 0040002731894 4100 • ctcc 0. 0029133696935 2986 • gtag • ttta • atga 0. 0052588957295 5390 • tgcg 0. 0025670045759 2631 • ctcg 0. 0016557228299 1697 • gtat • tttc • atgc 0. 0045973871386 4712 • tgct 0. 0015962065702 1636 • ctct 0. 0031094806475 3187 • gtca • tttg • atgg 0. 0031836320529 3263 • tgga 0. 0026206667772 2686 • ctga 0. 0050071712214 5132 • gtcc • tttt • atgt 0. 0041339408545 4237 • tggc 0. 0048276467661 4948 • ctgc 0. 0036968378328 • gtcg 3789 • atta 0. 0072736674700 7455 0. 0042851706946 4392 • ctgg 0. 0035407296108 • gtct 3629 • attc 0. 0049993658103 5124 4087 • attg 7106 2930 3301 5124 3240 1780 2578 2277 3379 2287 2686 2796 5251 2276 4764 5388 5729 2794 3789 4712 3654 2475 3696 3852 3452 4036 3590 4043 2904 2287 0. 0038051379119 • ccta 0. 0039378299006 4036 • http: //www. ucmb. ulb. ac. be/bioinformatics/rsa-tools/ 9292 5280 • RSA-tools - menu. htm • attt • ctgt • ctta • cttc • cttg • cttt 0. 0038831920229 • gtga 3980 0. 0053652444557 5499 0. 0044656708263 • gtgc 4577 0. 0099860478277 10235 0. 0032207077557 • gtgg 3301 0. 0035026782317 3590 • taaa 3972 0. 0090659849941 0. 0038753866118 0. 0039446596353 4043 • taac 8456 0. 0051515713268 0. 0082503195340 0. 0028333642298 2904 • taag 4897 0. 0044656708263 0. 0047778872704 0. 0022313718986 2287 • taat 4103 0. 0072736674700 0. 0040032002186 0. 0036441513079 3735 • taca 5132 0. 0037748919438 0. 0050071712214 0. 0039875893963 4087 • tacc 5390 0. 0028021425853 0. 0052588957295 0. 0039446596353 4043 2048 0. 0019981852419 • tacg 2703 0. 0027201857688 0. 0026372532758 0. 0014869308148 1524 3226 0. 0031475320266 • tact 0. 0030324022128 0. 0029016615769 2974 0. 0029133696935 2986 3869 0. 0037748919438 • taga 3175 0. 0020537987960 0. 0030977725308 0. 0040002731894 4100 3240 0. 0031611914960 • tagc 3771 0. 0028333642298 0. 0036792756578 0. 0042198003766 4325 3557 0. 0034704809109 • tagg 2108 0. 0015962065702 0. 0020567258252 0. 0023182070971 2376 6058 0. 0059106475564 • tagt 3452 0. 0023845530914 0. 0033680348902 0. 0039397812534 4038 8898 0. 0086815684974 • tata 2302 0. 0049661928132 0. 0022460070444 0. 0028704399325 2942 5125 0. 0050003414867 • tatc 0. 0051232767116 0. 0026128613661 2678 0. 0029016615769 2974 3126 0. 0030499643878 • tatg 2105 0. 0038753866118 0. 0020537987960 0. 0024147990594 2475 9489 0. 0092581932425 • tatt 0. 0082503195340 0. 0032968105139 3379 0. 0026196911009 2685 6363 0. 0062082288547 • tcaa 4457 0. 0047778872704 0. 0043485896598 0. 0024606558497 2522 4043 0. 0039446596353 • tcac 4608 0. 0040032002186 0. 0044959167943 0. 0028021425853 2872 3265 0. 0031855834057 • tcag 6363 0. 0050071712214 0. 0062082288547 0. 0017367039700 1780 4849 0. 0047310548037 • tcat 0. 0052588957295 0. 0034831647039 3570 0. 0029543481018 3028 4897 0. 0047778872704 • tcca 3270 0. 0026372532758 0. 0031904617876 0. 0039183163728 4016 5729 0. 0055896500249 • tccc 6302 0. 0029016615769 0. 0061487125950 0. 0050003414867 5125 2320 0. 0022635692194 • tccg 4350 0. 0030977725308 0. 0042441922863 0. 0017210931478 1764 5686 0. 0055476959402 • tcct 0. 0036792756578 0. 0042198003766 4325 0. 0022577151610 2314 9292 0. 0090659849941 • tcga 4305 0. 0020567258252 0. 0042002868489 0. 0033397402749 3423 7106 0. 0069331564107 • tcgc 4100 0. 0033680348902 0. 0040002731894 0. 0034831647039 3570 6650 0. 0064882479779 • tcgg 2703 0. 0022460070444 0. 0026372532758 0. 0014869308148 1524 15297 0. 0149249217020 • tcgt 3654 0. 0026128613661 0. 0035651215205 0. 0022645448957 2321 • tcta 0. 0020537987960 0. 0023933341789 2453 • tctc 0. 0032968105139 0. 0040032002186 4103 • tctg 0. 0043485896598 0. 0027260398271 2794 • tctt 0. 0044959167943 0. 0023425990068 2401 7455 3869 2872 A table of k-mers grows rapidly out of proportions or out of sight 2788 3108 2105 2904 1636 2444 5090 5251 3972 8456 4897 4103 5132 5390 How many k-mers in total, for all k? 0. 0024791937010 • gtgt Alberto Apostolico -2541 Erice 05 0. 0026733533022 0. 0062443288810 • gtta 6400 4577 0. 0051515713268 2740 5280 2703 2974 3175 3771 2108 3452 2302 2678 2105 3379 4457 4608 82
Verbumculus + Dot on first 512 bps of Yeast Mitochondrial DNA Alberto Apostolico - Erice 05 83
Counting occurrences of gagga in HSV 1 Alberto Apostolico - Erice 05 84
Alternate Counting Alberto Apostolico - Erice 05 85
Counting occurrences of ccgct in HSV 1 Alberto Apostolico - Erice 05 86
Alberto Apostolico – Cinzia Pizzi – Giorgio Satta Dyads Detection in Biology ACCG TAAG Dyads are the composition of two solid components separated by a variable gap Part of Speech Tagging in NLP CORRECT CLASSIFICATION TEXT Automatic Tagging • Set of correctly classified examples • Infer rules • Classify new texts Although preliminary findings were reported more than a year ago, the latest results appear… + IN JJ NNS VBD VBN RBR IN DT NN IN , DT JJS NNS VBP. . . = Rules Possible Solution: Barriers Drawback: ambiguity • Limited size contest centered on a word can fail to give a unique tag assignment NN/JJ NN or JJ ? B 1 NN B 2 JJ TAGGING DISAMBIGUATION Goal: efficient counting of subword co-occurrences within distance d, with no interleaving occurrences of one or the other
Alberto Apostolico – Cinzia Pizzi – Giorgio Satta Goal: efficient counting of subword co-occurrences within distance d, with no interleaving occurrences of one or the other Notation • X is a string of n symbols over the alphabet • d is a fixed non-negative integer • y and z are subwords of X • Tandem Index I(y, z) is the number of times that z has a closest occurrence within a distance d from a corresponding closest occurrence of y to its left • Relaxed Tandem Index Î(y, z): all the occurrences of z within distance d are counted Key Observation In principle there are O(n 2) substrings in x, and thus O(n 4) distinct pair of substrings; however, it suffices to consider a family containing only O(n 2) pairs. Then, for any neglected pair (y’, z’) there is a pair (y, z) in the family such that: (i) y’ and z’ are prefixes of y and z respectively, and (ii) the tandem index of (y’, z’) equals the tandem index of (y, z). Result : O(n 2) algorithm for building a tandem index table ( previous results O(n 3) [Arimura et al. , Wang et al. ], in case the of two words from a generalized version of the problem)
Towards a theory of saturated motifs: here a motif is a recurring pattern with some solid and some ``don’t care’’ characters together with its set of occurrences ``don’t care’’ characters solid character T A G G T A G A T AG T T AGA GGT AGA TA T PROBLEM Input: textstring Output: repeated motifs T AGA GGT AG T Is motif discovery still beset by the circumstance that typically there are exponentially many candidate motifs in a sequence ? Alberto Apostolico - Erice 05 89
Controlling Motif Growth: Irredundant Motifs (L. Parida) A motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list • maximal in length if making the motif longer implies an alteration to the cardinality or displacement of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant Alberto Apostolico - Erice 05 90
Maximal, Redundant, Irredundant Motifs (examples, cont. ) Let s= aa. Xta. Yg. ZZZaa. Vta. Wc. XXXXaa. Ytg. Xc s= aa. Xba. Yg. ZZZaa. Vba. Wc. XXXXaa. Ybg. Xc m_1 = aa. t m_2 = aa. ta m_3 = aa. t. c with L_1 = { 1, 11, 22} with L_2 = {1, 11} with L_3 = {11, 22} m_1 = aa. t is redundant, since 1) m_1 is a sub-motif of m_2 and of m_3 and 2) L_1 is the union of L_2 and L_3. Alberto Apostolico - Erice 05 91
Controlling Motif Growth : HOW MANY Irredundant Motifs Recall that a motif is • maximal in composition if specifying more solid characters implies an alteration to its occurrence list • maximal in length if making it longer implies an alteration to the cardinality of its occurrence list A maximal motif such that the motif and its list can be inferred from studying other motifs is redundant A motif that occurs at least k times in the textstring is a k-motif Theorem In any textstring x the number of irredundant 2 -motifs is O(|x|) (PROBLEM: How to find irredundant motifs as fast as possible) Alberto Apostolico - Erice 05 92
Suffix Consensus, Suffix Meet b a c c a c b a suf 4 b a c c a c b a a c a s = suf 1 The consensus of suf 1 and suff 4 is not a motif The meet of suf 1 and suf 4 is a maximal motif Theorem Every irredundant 2 -motif of x is the meet of two suffixes of x Alberto Apostolico - Erice 05 93
Approximate Patterns Lazy Finding surprising substrings with mismatches Input: a sequence or set of sequences, integers m and k have frequent Ø Out: all substrings of length m that occur unusually often as a s replica of the same pattern with up to k mismatches Ø • NOTE: the pattern might never occur exactly in the input How many patterns should one try Alberto Apostolico - Erice 05 94
Problem Statement Ø Given a source text X and an error threshold k, extract substrings of X that occur unusually often in X within k substitutions or mismatches. Ø Measure of Surprise: compare counts with expectations Alberto Apostolico - Erice 05 95
Sub. Problem: Compute Expected Frequencies under I. I. D. Distribution Ø Two results for expected frequencies l l O(nk) preprocessing of text, then report expected frequency for any substring in O(k 2) Report expected frequency of all substrings of a given length in O(nk) Alberto Apostolico - Erice 05 96
JACM 50, 1, January 2003 pp 25 -26 Special Issue: Problems for the Next 50 Years page 1 paper 1 problem 1 ’’ Shannon and Weaver performed an inestimable service by giving us a definition of information and a metric for information as communicated from place to place. We have no theory however that gives us a metric for the information embodied in structure. . . this is the most fundamental gap in theoretical underpinning of information and computer science. A young information theory scholar willing to spend years on a deeply fundamental problem need look no further. ’’ Frederick P. Brooks , jr The Great Challenges for Half Century Old Computer Science Alberto Apostolico - Erice 05 97
- Slides: 97