RNA Secondary Structure Prediction Introduction RNA is a

  • Slides: 36
Download presentation
RNA Secondary Structure Prediction Introduction • RNA is a single-stranded chain of the nucleotides

RNA Secondary Structure Prediction Introduction • RNA is a single-stranded chain of the nucleotides A, C, G, and U. The string of nucleotides specifies the linear structure of the RNA strand. • When RNA folds, complementary nucleotides form base pairs (CG and AU). • The tertiary (3 dimensional) structure is too complicated for us to calculate. • We calculate only secondary structures, lists of base pairs. • Knowing the base pairs tells a lot about the 3 dimensional structure.

Chemical Structure of RNA • Four base types. • Distinguishable ends.

Chemical Structure of RNA • Four base types. • Distinguishable ends.

Partial Tertiary Structure • One illustration

Partial Tertiary Structure • One illustration

Yet Another Tertiary Structure • Found via google

Yet Another Tertiary Structure • Found via google

Our Final Tertiary Picture • Very complex

Our Final Tertiary Picture • Very complex

A Partial RNA Secondary Structure

A Partial RNA Secondary Structure

Pure Secondary Structure

Pure Secondary Structure

Our Basic Model • RNA linear structure: R=r 1 r 2. . . rn

Our Basic Model • RNA linear structure: R=r 1 r 2. . . rn from {A, C, G, U} • RNA secondary structure: pairs (ri, rj) such that 0<i<j<n+1. • Goal: secondary structures with minimum free energy.

Implementing Model Restrictions • No knots: pairs (ri, rj) and (rk, rl) such that

Implementing Model Restrictions • No knots: pairs (ri, rj) and (rk, rl) such that i<k<j<l. RNA does contain knots. • Program loop structure. • No “close” base pairs: j-i>t for some t>0. • High free energy. • Complementary base pairs: A-U, C-G. • High free energy.

Our Two Algorithms • Independent base pairs – quite easy, but inaccurate. • Calculate

Our Two Algorithms • Independent base pairs – quite easy, but inaccurate. • Calculate loops’ free energy – best we can do for today’s class.

Independent Base Pair Algorithm • Assumption: Independent base pairs. • Advantage 1: Simpler calculations.

Independent Base Pair Algorithm • Assumption: Independent base pairs. • Advantage 1: Simpler calculations. • Advantage 2: Illustrates ideas for a much more accurate algorithm. • Disadvantage: Unrealistic answers.

Independent Base Pairs What Makes It “Easy”? • Assumption: The energy of each base

Independent Base Pairs What Makes It “Easy”? • Assumption: The energy of each base pair is independent of all of the other pairs and the loop structure. • Consequence: Total free energy is the sum of all of the base pair free energies.

Independent Base Pairs Basic Approach • Use solutions for smaller strings to determine solutions

Independent Base Pairs Basic Approach • Use solutions for smaller strings to determine solutions for larger strings. • This is precisely the kind of decoupling required for dynamic programming algorithms to work.

Independent Base Pairs Notation • a(ri, rj) – the free energy of a base

Independent Base Pairs Notation • a(ri, rj) – the free energy of a base pair joining ri and rj. • Si, j – The secondary structure of the RNA strand from base ri to base rj. Ie, the set of base pairs between ri and rj inclusive. • E(Si, j) – The free energy associated with the secondary structure Si, j. • We define a(ri, rj) large when constraints are violated.

Independent Base Pairs: Calculating Free Energy • Consider the RNA strand from position i

Independent Base Pairs: Calculating Free Energy • Consider the RNA strand from position i to j. • Consider whether rj is paired • If rj is paired, E(Si, j)=E(Si, k-1)+a(k, j)+E(Sk+1, j-1) for some i-1<k<j • If rj isn’t paired, then E(Si, j)=E(Si, j-1)

Independent Base Pairs - Algorithm • We search for intervals with minimum free energy.

Independent Base Pairs - Algorithm • We search for intervals with minimum free energy. • For each interval, the free energy is given by this formula: E(Si, j) = min( E(S i+1, j-1)+a(ri, rj), E(S i, k-1+a(ri, rk)+Sk+1, j-1), i -1<k<j+1 ) • The free energy of the RNA strand is E(S 1, n).

Independent Base Pairs: Question 1 • How does this formula deal with the case

Independent Base Pairs: Question 1 • How does this formula deal with the case where rj isn’t paired with any base? • A special case of E(Si, k-1+a(ri, rk)+Sk+1, j-1), i -1<k<j+1 • The special case with k=j.

Independent Base Pairs: Question 2 • What is the high level algorithm flow? 1.

Independent Base Pairs: Question 2 • What is the high level algorithm flow? 1. Advance from smaller to larger intervals, calculating free energy costs. 2. Trace back the path that corresponds to the maximum free energy cost.

Independent Base Pairs: Question 3 • In what orders can the intervals’ free energy

Independent Base Pairs: Question 3 • In what orders can the intervals’ free energy costs be evaluated? 1. 2. 3. 4. Major = lower, minor = upper bound Major = upper, minor = lower bound Diagonally Any order (eg, random) that respects the partial order induced by inclusion

Independent Base Pairs: Question 4 • What are the time and storage requirements of

Independent Base Pairs: Question 4 • What are the time and storage requirements of this algorithm? • Express your answer in terms of the number of bases in the RNA strand. • Since the number of intervals is quadratic, the storage requirements are quadratic. • Since the time requirement for each interval is linear, total time is cubic.

Independent Base Pairs: Question 5 • Why not simply calculate free energies as they

Independent Base Pairs: Question 5 • Why not simply calculate free energies as they are needed? Why store them at all? • Because the recursive calls would turn our polynomial algorithm into an exponential algorithm.

Independent Base Pairs: Question 6 • How does traceback work for this algorithm? 1.

Independent Base Pairs: Question 6 • How does traceback work for this algorithm? 1. Recalculate which subinterval yields the maximum free energy. 2. Save traceback paths.

Loop Free Energy Algorithm • An RNA molecule’s free energy is not independent of

Loop Free Energy Algorithm • An RNA molecule’s free energy is not independent of all other base pairs. • An RNA molecules free energy actually depends on its loop structure. • What do we mean by loops?

Types of Loops • Each base pair (ri, rj) encloses a loop: 1. Hairpin

Types of Loops • Each base pair (ri, rj) encloses a loop: 1. Hairpin loop 2. Bulge on i or j 3. Interior loop 4. Helical region

Hairpin Loop • There are no base pairs (rk, rl) for i<k<l<j.

Hairpin Loop • There are no base pairs (rk, rl) for i<k<l<j.

Bulge on i and j • Bulge on i: • (ri, rj) and (rk,

Bulge on i and j • Bulge on i: • (ri, rj) and (rk, rj-1) are base pairs with k>i+1. • ri+1 is not paired. • The bulge on j is symmetric.

Interior loop • (ri, rj) and (rk, rl) are base pairs with i+1<k 2<j-1.

Interior loop • (ri, rj) and (rk, rl) are base pairs with i+1<k 2<j-1. • ri+1 and rj-1 are not in base pairs

Helical region • (ri, rj) and (ri+1, rj-1) are base pairs.

Helical region • (ri, rj) and (ri+1, rj-1) are base pairs.

Free energy analysis • E(Si, j) = E(Si+1, j) when ri isn’t paired. •

Free energy analysis • E(Si, j) = E(Si+1, j) when ri isn’t paired. • E(Si, j) = E(Si, j-1) when rj isn’t paired. • E(Si, j) = min(E(Si, k)+E(Sk+1, j)) for i<k<l, k between i’s and j’s pairs when i and j are paired but not to each other • E(Si, j) = E(Li, j) where Li, j is loop energy when I and j are paired to each other

Free Energy Functions • a(ri, rj) – Free energy of base pair (ri, rj)

Free Energy Functions • a(ri, rj) – Free energy of base pair (ri, rj) • H(k) – Destabilizing free energy of a hairpin loop with size k. • R – Stabilizing free energy of adjacent base pairs (helical region). • B(k) – Destabilizing free energy of a bulge of size k. • I(k) – Destabilizing free energy of an interior loop of size k.

Loop Energy Formulas • • • H(j-i-1) – for a hairpin loop R +

Loop Energy Formulas • • • H(j-i-1) – for a hairpin loop R + E(S ) – for a helical region B(k) + E(S ) – for a bulge on i B(k) + E(S ) – for a bulge on j I(k 1+k 2) + E(Si+k 1+1, j-k 2 -1) – for an interior loop i+1, j-1 i+k+1, j-1 i+1, j-k-1

Free Energy Calculation for interval (i, j) • 1. 2. • • Minimize over

Free Energy Calculation for interval (i, j) • 1. 2. • • Minimize over Case where (r , r ) is not a pair. Case where (r , r ) is a pair. Add a(ri, rj) to the formulas. Minimize over k, k 1, and k 2. i j

What is the Apparent Complexity? • The interior loop calculations are given by I(k

What is the Apparent Complexity? • The interior loop calculations are given by I(k 1+k 2) + E(Si+k 1+1, j-k 2 -1) • The number of inner loop possibilities is quadratic in the interval size. • The number of intervals is quadratic in the size of the problem. • The complexity appears to grow as n 4.

What is the Actual Complexity? • Overall reduction from n 4 to n 3

What is the Actual Complexity? • Overall reduction from n 4 to n 3 is possible. • Interval reduction from n 2 to linear. • Store the minimum free energy V where the interval (i, j) contains an interior loop of size k. i, j, k

Multiple Solutions • Care must be taken to define the issues. • Multiple solutions

Multiple Solutions • Care must be taken to define the issues. • Multiple solutions can be obtained by adding flexibility to the traceback logic. • The number of solutions can grow exponentially.

References • M. Zuker, “The Use of dynamic programming in RNA secondary structure prdiction”.

References • M. Zuker, “The Use of dynamic programming in RNA secondary structure prdiction”. In M. S. Waterman, editor, Mathematical Methods for DNS Sequences. Boca Raton, FL: CRC Press, 1989 • J, Setubal and J. Meidanis, Ch 8. 1, Introduction to Computational Molecular Biology, Pacific Grove, CA: Brooks/Cole Publishing Co. , 1997