Finding repeat pattern in human genome by TEIRESIAS

Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu

Human Genome

TEIRESIAS Algorithm

Define Pattern Σ(Σ{‘. ’} * )Σ Σ: the alphabet of residues, like ACTG in DNA, the set of all amino acids ‘. ’ : wild card character Pattern: Any string that begins and ends with a residue, and contains an arbitrary combination of residues and ‘. ’ characters

Define Pattern • The pattern is defined by three parameters • L: the minimal number of literals • W: the pattern length • K: the minimal repeat times • Example: • Pattern: A. CH. . E L = 4, W= 7 <4, 7> pattern

Define Pattern • P: pattern ‘A. CH. . E’ • S: a set of sequences S={LFAADCHFFEDTR, LKLALCHESESDR, AFAGCHADELFT} • Ls(P) = {(i, j) | sequence si matches P at offset j} Ls(‘A. CH. . E’) = {(1, 4), (2, 4), (3, 3)}

Problem Definition • Given a set S={s 1, s 2, …, sn} of input sequences and parameters L, W, K, find all maximal <L, W> patterns that have support at least K repeat times.

TEIRESIAS Algorithm • The algorithm contains two steps: Scanning and convolution • 1. Scanning the sequences in the input set S and locating all elementary patterns with support at least K. • 2. Combining together the elementary pattern to recover the maximal pattern

Scanning Process

Convolution Process

Convolution Process • Prefix(P) is the prefix sub pattern of P that has exactly (L-1) residues Example: Prefix(‘F. ASTS’) =‘F. A’ if L = 3 • Suffix(P) is the suffix subpattern of P with exactly (L-1) residues Example: Prefix(‘F. A…S’) =‘A. . S’ if L = 3

Convolution Process • P, Q are the arbitrary patterns with at least L • • • residues each R denotes a new pattern Q’ denotes what remains of Q after the prefix(Q) is thrown away Φ denotes the empty string

Convolution Process • Forward extension • If Suffix(P) = Prefix(Q), P is extended to right with Q’(Pattern Q leftover except prefix) Example: ‘F. AS’ convolute ‘AST’ to form ‘F. AST’ Backward extension If Prefix(P) = Suffix(Q), P is extended to the left with Q’(Pattern Q leftover except suffix) Example: ‘F. AS’ convolute ‘TF. A’ to form ‘TF. AS’

Application 1: Finding repeat pattern in Chr. M • Define the elementary pattern: L=W=5 • Scan the Chr. M sequence Option 1. Generate all possible elementary patterns (4^5 = 1024), implement in R Option 2. Use Hash function to count elementary patterns (1017), implement in Perl

Application 1: Finding repeat pattern in Chr. M

Application 1: Finding repeat pattern in Chr. M • Scan the sequence Save the pattern Offset (index) Chr. M: G 1 A 2 T 3 CACAGGTCT … Pattern: GATCA Offset: 1 Pattern: ATCAC Offset: 2 Pattern: TCACA Offset: 3

Application 1: Finding repeat pattern in Chr. M • Convolution Simple repeat pattern offset example: Repeat Pattern 1: G 1 ATCACAG Offset: 1 Repeat Pattern 2: G 10 ATCACAG Offset: 10 Repeat Pattern 3: G 20 ATCACAG Offset: 20 Elementary pattern and their offset: GATCA 1 10 20 ATCAC 2 11 21 TCACA 3 12 22 CACAG 4 13 23 From the offset, if the pattern forward extend 1 base, the offset increase 1

Application 1: Finding repeat pattern in Chr. M Repeat Pattern: GATC 4 ACAG Elementary pattern and their offset: CACAG 4 13 23 TCACA 3 12 22 ATCAC 2 11 21 GATCA 1 10 20 From the offset, if the pattern backward extend 1 base, the offset decrease 1

Application 1: Finding the repeat pattern in Chr. M • Convolution Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824 L=W=5, K=5 Elementary patterns and their offset: ACAAC 492 1712 3303 3400 3445 3806 4040 4073 5057 5396 6065 6197 6400 7185 7725 8643 8709 8768 9198 9668 10040 10134 10300 10839 10897 10900 10944 11263 11335 11997 12031 12178 12250 12488 12545 12554 12741 12858 12931 13380 13746 13775 14331 14478 14691 15364 15680 15767 15825 16079 CAACA 1110 1122 2214 2399 2430 2969 3304 3762 3807 3940 4041 4074 4698 5581 6198 6401 6549 6598 7186 7726 8347 8641 8663 8710 9199 9354 9560 9791 10041 10086 10617 10716 10780 10835 10898 11264 11333 11954 11998 12032 12179 12251 12489 12829 12932 13747 13776 13861 13906 14188 14651 14822 15121 15681 15826 16077 16281 Forward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =20 >5 , accept New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A) New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825

Application 1: Finding the repeat pattern in Chr. M • Convolution patterns and their offset: ACAACA 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825 AACAA 240 283 362 646 1123 2400 2766 2920 2970 3302 3721 4039 4699 5173 5287 5395 6196 6402 7727 8508 8642 8664 8691 9667 10299 10420 10781 10896 10899 10943 11334 11573 12030 12249 12416 12490 12553 12740 12933 13388 13395 13745 13774 13777 13862 14029 14189 14193 14573 14658 15363 15603 15679 15682 15827 15867 16078 16282 Forward extension: Offset difference = new pattern length (6) – pattern suffix length (4) = 2 Repeat position (repeat times) =8 >5 , accept New pattern: ACAACAA (pattern ACAAC forward extends 1 base A) New Offset: 6400 7725 10897 12488 12931 13775 15680 15825

Application 1: Finding the repeat pattern in Chr. M • Convolution patterns and their offset: ACAACAA 6400 7725 10897 12488 12931 13775 15680 15825 CACAA 1011 1052 1619 3156 3805 3810 3991 4072 5280 5680 5938 6399 7184 7724 8053 8456 8563 8708 8767 8916 9106 10133 10838 10985 11262 11357 11996 12001 12217 12487 12544 12930 13218 13379 13546 13942 14330 14624 15419 15581 15824 16431 Backward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =5 >=5 , accept New pattern: CACAACAA (pattern ACAACAA backward extends 1 base C) New Offset: 6399 7724 12487 12930 15824 • Forward and backward extension finished, repeat times >=K , report it

Application 1: Finding the repeat pattern in Chr. M Pattern report is saved in the Chr. M_pattern. txt file. Here is a part of result(L=5, K=10). Pattern Length Repeat times Positions CACAACA 7 11 3805 4072 6399 7184 7724 8708 11262 11996 12487 12930 15824 ACCCCAT 7 10 535 638 4391 5437 6584 7491 13692 14340 14629 14814 CCCACTA 7 14 15 110 2282 3739 4195 5705 7113 7403 9058 11216 11267 12361 13624 13681 CACCCTA 7 10 1086 2219 11816 11871 12970 13147 13882 14391 15701 16268 CCCCCAT 7 12 4880 7956 8029 8407 11428 12085 12238 12387 12433 14426 14589 16191 CCCTCTA 7 10 726 1192 2068 3513 8275 8284 8608 9086 11980 13785 CCTAGCA 7 11 4198 6357 9061 11744 12988 13132 13306 13468 13925 15320 15647 CCTCCTA 7 12 3322 3610 4168 9279 9729 10092 10952 12095 12985 13990 15335 15491 AGCCCTA 7 10 1098 6924 10290 10353 10695 11969 13069 13558 13837 15317

Application 1: Finding the repeat pattern in Chr. M

Application 2: Finding the repeat pattern in Chr 1 • Define the elementary pattern: • • L=W=17 Scan the Chr 1 sequence Problem: 1. Ch 1 file size: 252, 194, 721 bytes, very large, can’t been read by R 2. All possible elementary patterns (4^17 = 17, 179, 869, 184), over the R memory limitation

Application 2: Finding the repeat pattern in Chr 1 • Solution: Use another computer language, like C or perl The C program gcat and the scanning result were got from Dr. Steven. Parallel program were used for running the program. The elementary patterns over 2000 repeat times are used for convolution. I also wrote perl script: scan. pl Usage: perl scan. pl <chr file> <output file> <L value>

Application 2: Finding the repeat pattern in Chr 1 • Convolution Problem: The index file is too large, it can’t been read by R My Solution: perl convolute. pl <index file> <output file> <pattern length(L)> <Repeat times(K)> It’s working, but still very slow. The PVM (parralle virtual machine) techniques should be used for increasing speed.

Application 2: Finding the repeat pattern in Chr 1 Some of Pattern Found in the Chr 1 Pattern length 45 Repeat times ATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGG 40 42 CAATCTCAGCTCACTGCAACCTCCCAGGTTCAAG 40 49 Pattern ATCTCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGCGATTCT GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTC 40 40 108 AAAAAAAAAAAAAAAAAAC 36 45 CCCACCTCAGCCTCCCAAAGTGCTGGGATTATAGG 35 66 AAAAAAAAAAAAAAAATTA 34 42 AAAAAAAAAAAAAAAAGC 34 42 CCCACCTCAGCCTCCCAAAGTGCTGAGATTACAG 34 54 CAATCTCAGCTCACTGCAACCTCCTGGG 34 62

• There are two more patterns, namely, AGCAATTCTCCTGCCTCA: 2858 CTCCTGCCTCAGCCTCC: 14168 can convolute with gctcactgcaacctccgcctcccaggttcaagcaattctc : 108 to form the maximal pattern GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAA TTCTCCTGCCTCAGCCTCC : 46

Application 2: Finding the repeat pattern in Chr 1 • Discussion 1. For the human genome pattern discovery, the high performance computer and parallel program should be used. R is a not good tools for solving this problem. 2. The algorithm should be carefully implemented for the better performance, like reducing the loop time.

Reference • Isodore Rigoutsos, Aris Floratos Combinatorial • pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics, 1998, Vol(14) no. 1: 55 -67 Isodore Rigoutsos et al. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological process, PNAS, 2006, Vol(103) no. 17: 6605 -6610