Basics of Systems Bioinformatics TsuWang David Shen Ph

Basics of Systems Bioinformatics 沈祖望 博士 Tsu-Wang (David) Shen, Ph. D. Medical Informatics Department Tzu Chi University, Taiwan E-mail: tshen@mail. tcu. edu. tw Web: http: //www. biom 3. tcu. edu. tw Bio. M 3

Agenda Bio. M 3 1. About me & Data mining 2. Introduction of Biological signal processing 3. Basics of Biology 4. Basics of Signals & Systems 5. Applications 6. Future Works 2021/6/7 2

Introduce myself Bio. M 3 v. Education v 美國威斯康辛大學麥迪遜分校生物醫學 程博士, Biomedical Engineering, Ph. D. (University of Wisconsin - Madison, WI, USA) 美國威斯康辛大學麥迪遜分校電機電腦 程碩士, Electrical and Computer Engineering, M. S. (University of Wisconsin - Madison, WI, USA) 美國伊利諾理 學院電機電腦 程碩士, Electrical and Computer Engineering M. S. (Illinois Institute of Technology, Chicago, IL, USA) v. Specialty §生醫 程、生物醫學訊號處理、生物安全辨識、類神經網 路、生物醫學統計、人 智慧、生醫電子 2021/6/7 3

Introduce myself Bio. M 3 v Experience § 慈濟大學課務組組長 (2005. 8. 1~2006. 7. 31) § 國立東華大學全球運籌管理研究所兼任助理教授 (2007) § 慈濟大學醫學研究所合聘助理教授 (2006~) § 教育部高科技專利取得與攻防種子教師 (2006) § IEEE Member since 1996 § 台灣癲癇醫學會會員 since 2007 § 中華民國生醫 程師證照 (2007 I 1090) § Educational Psychology Department, University of Wisconsin-Madison, WI, USA, 電腦專案助理 (1999~2005) § 大華技術學院專任講師(1995~1996) 2021/6/7 4

Data mining Bio. M 3 • Why mining? • We need information from big data warehouse, but it is too many information! • What is data mining? • Data mining is the process of sorting through large amounts of data and picking out relevant information. 2021/6/7 5

Traditional data mining skills 2021/6/7 6 Bio. M 3

Traditional data mining skills 2021/6/7 7 Bio. M 3

Today – New view of data mining Bio. M 3 v Systems Bioinformatics: An Engineering Case-Based Approach by Gil Alterovitz, Marco F. Ramoni Computational Biology : This trail-blazing work introduces a quantitative systems approach to bioinformatics research using powerful computational tools drawn from signal processing, circuit analysis, control systems, and communications. It presents the functionality of biological processes in an engineering context to facilitate the application of technical skills in solving the field's challenges, from the lab bench to data analysis and modeling, and to enable reverse engineering from biology in the development of synthetic biological devices. 2021/6/7 8

Systems Bioinformatics Bio. M 3 v. Today, we will focus on using signal processing and system skill to solve Bioinformatics & Biology problems as the “mining” skills. 2021/6/7 9

What are signals? ♦ Signals are everywhere! Signals for detecting the presence of an object. For example, • Signals for making medical diagnosis (eg. electrocardiogram (ECG), electroencephalogram (EEG)] • The trading volume in the stock market. ♦ Signals can be categorized in various ways 2021/6/7 10 Bio. M 3

What are signals? v A signal is an abstract element of information, or (more commonly) a flow of information (in one or more dimensions). v One or more variable -> one dimensional vs. multidimensional v Signal vs. noise -> 端看是否需要 2021/6/7 11 Bio. M 3

Continuous (Analog), Discrete, and Digital signals 2021/6/7 12 Bio. M 3

Continuous (Analog), Discrete, and Digital signals 2021/6/7 13 Bio. M 3

Bio. M 3 What is a system? v A system is defined as an entity that manipulates one or more signals to accomplish a function, thereby yielding new signals x[n] y[n] h[n] 2021/6/7 14

Specific systems v. Communication Systems v. Control Systems v. Microelectromechanical systems (MEMS) v. Remote sensing v Biomedical signal systems v Auditory system v Bioinformatics systems v. Others … 2021/6/7 15 Bio. M 3

Biological Signal Processing Bio. M 3 v. The discipline aimed at § Understanding and modeling the biological algorithms implemented by living systems using signal processing theory. (biology as an endpoint) § The efforts seeking to use biology as a metaphor to formulate novel signal processing algorithms for engineering systems. (signal processing as an endpoint) 2021/6/7 16

History Bio. M 3 v Claude Bernard 1800’s – concept of homeostasis v Ludwig von Bertalanffy 1968 – General System Theory: there appear to exist general system laws which apply to any system of a particular type, irrespective of the particular properties of the systems and the elements involved … v Norbert Wiener 1900’s – system feedback v Reiner 1960 s – System works vs. DNA findings v Focused on reductionist component-level view. 2021/6/7 17

Electrical vs. Biological Signal Processing Bio. M 3 整合訊號處理 與生物 2021/6/7 18

BSP at the cell level -Today’s coverage v. Signal detection and estimation § DNA sequencing § Gene identification § Protein hotspots identification v. System identification and analysis § Gene regulation systems § Protein signal systems 2021/6/7 19 Bio. M 3

Bio. M 3 Basics of Biology DNA -> RNA -> amino acid ->protein -> cells (Gene) 2021/6/7 20

Introduction of Basics of Biology Bio. M 3 v. In April 2003, sequencing of all three billion nucleotides in the human genome was declared complete. v 1631 human genetic diseases are now associated with know DNA sequence. v. The human genome was not the only organism targeted. (June 2004, 1557 viruses, 165 microbes, and 26 eukaryotes. ) 2021/6/7 21

DNA and Gene Expression • DNA is found in nucleus and mitochondria of eukaryotic cells. DNA是存在細胞核內,包含動物細胞、 植物細胞與真菌細胞。 • DNA in the nucleus contains information from both parents. • The chemical that directs protein synthesis and serves as a genetic blueprint is DNA, which is found in the nucleus of the cell. 2021/6/7 22 Bio. M 3



Base pairs, DNA 2021/6/7 25 Bio. M 3

DNA and Gene Human Genome has been estimated about 30000. 2021/6/7 26 Bio. M 3

DNA-> Gene Bio. M 3 Each gene has a particular location in a specific chromosome and contains the “code” for producing one of three forms of RNA (ribosomal RNA (r. RNA), messenger RNA (m. RNA), and transfer RNA (t. RNA)). 2021/6/7 27


The Central Dogma of biology Bio. M 3 v The Central Dogma of biology. DNA is copied into RNA (transcription); the RNA is used to make proteins (translation); and the proteins perform functions such as copying the DNA (replication). (www. biologyforengineers. org). 2021/6/7 29

Replication Bio. M 3 v Replication of DNA by DNA polymerase. After the two strands of DNA separate (top), DNA polymerase uses nucleotides to synthesize a new strand complementary to the existing one (bottom). Images from the online tutorial “Biological Information Handling: Essentials for Engineers” (www. biologyforengineers. org). 2021/6/7 30

Replication v During Replication, some enzymes check for accuracy. v Error rate: approximately one per billion. 2021/6/7 31 Bio. M 3

Transcription (轉錄) Bio. M 3 v Transcription of DNA by RNA polymerase. Note that RNA contains U’s instead of T’s. Image from the online tutorial “Biological Information Handling: Essentials for Engineers” (www. biologyforengineers. org). 2021/6/7 32



Transcription vs. replication Bio. M 3 v. Only a certain stretch of DNA acts as the template and not whole strand v. Different enzymes are used. v. Only a single strand is produced. 2021/6/7 35

Translation (轉譯) Bio. M 3 v In the process of translation, each m. RNA codon attracts a t. RNA molecule containing a complementary anticodon. (www. biologyforengineers. org). 2021/6/7 36


Transcription & Translation 2021/6/7 38 Bio. M 3

DNA to RNA to Protein Bio. M 3 v The genetic information is stored in DNA, copied to RNA, and then interpreted from the RNA copy to form a functional protein. v m. RNA <-> r. RNA <-> t. RNA v 蛋白質是由氨基酸串成的 。細胞內有20種氨基酸, 可以串出不同長度、不同形狀、不同功能的蛋白 質。 2021/6/7 39


Bio. M 3 DNA to RNA to Protein(蛋白質) 2021/6/7 41


64 codons and the amino acid for each http: //en. wikipedia. org/ http: //www. medigenomix. de/ 2021/6/7 43 Bio. M 3

Bio. M 3 2021/6/7 v Complexity at the protein level exceeds complexity at the gene and transcript levels. Individual genes in the genome are transcribed into RNA. In eukaryotes, the RNA may be further processed to remove intervening sequences (RNA splicing) and result in a mature transcript that encodes a protein. Different proteins may be encoded by differently spliced transcripts (alternative splicing products). Moreover, once proteins are produced, they can be processed (e. g. cleaved by a protein-cutting protease) or modified (e. g. by addition of a sugar or lipid molecule). Moreover, proteins may have non-covalent interaction with other proteins (and/or with other biomolecules such as lipids or nucleotides). Each of these can have tissue-, stage- and cell-type specific effects on the abundance, function and/or stability of proteins produced 44 from a single gene.


Gel electrophoresis v Method for separating DNA, RNA, or protein using an electrical charge to separate DNA molecules through a threedimensional matrix. The larger the DNA fragment, the slower it will move through the matrix. DNA isolated on a gel can be recovered and purified away from the matrix. 2021/6/7 46 Bio. M 3

Polymerase Chain Reaction (PCR) Bio. M 3 v A method for amplification of a specific DNA fragment in which paired DNA strands are separated (by high temperature) and then each is used as a template for production of complementary strand by an enzyme (a DNA polymerase). v PCR的發展可以說是從DNA合成酵素的發現緣起。但由 於這個酵素是一種易被熱所破壞之酵素,因此不符合一 連串的高溫連鎖反應所需。 v 現今所使用的酵素 (簡稱 Taq polymerase),則是於 1976年從熱泉中的細菌(Thermus Aquaticus) 分離 出來的。它的特性就在於能耐高溫,是一個很理想的酵 素. 2021/6/7 47

Polymerase Chain Reaction (PCR) Flash DEMO 2021/6/7 48 Bio. M 3

Bio. M 3 Gene clone v Flow-chart of the major steps involved in gene clone set production and use. 2021/6/7 49

Bio. M 3 v Bioinformatic approaches to select target genes for a cloning project. For bacterial genomes, target selection primarily draws from genome sequence, where introns are not a consideration and genome-scale projects are feasible. For eukaryotes, researchers commonly use one or more informatics-based methods to identify sub-groups of target genes that share a common feature, such as function, localization, expression or disease association. As noted, these information sources draw significantly on one another (as experimental data is in genome annotation, etc. ). 2021/6/7 50

Bio. M 3 Basics of Signals and Systems 2021/6/7 51

Signal processing Basics Bio. M 3 v Time- Domain Representation 隨時間改變之訊號 v Frequency – Domain Representation 傅立業發現:任何的訊號都可表達為SIN和COS函數的 組合 (驚!) 2021/6/7 52

Time Operations v Time scaling v If a >0, Y(t) is compressed. v If 0< a <1, y(t) is expended 2021/6/7 53 Bio. M 3

Time scaling 2021/6/7 54 Bio. M 3

Reflection Bio. M 3 v Y(t) = x(-t) v The y(t) represents a reflected version of x(t) about t=0. 2021/6/7 55

Time shifting v Y(t)=x(t-t 0) v If t 0 >0 shifting toward the right. v If t 0 <0 shifting toward the left. 2021/6/7 56 Bio. M 3

Example: Precedence Rule v. Y(t)=x(at-b) § V(t)=x(t-b) § Y(t)=v(at)=x(at-b) 2021/6/7 57 Bio. M 3


Relationship between Time Properties of a signal and the Appropriate Fourier Representation Bio. M 3 Time property Periodic Non-periodic Continuous (t) Fourier Series (FS) Fourier Transform (FT) Discrete [n] Discrete-time Fourier Series (DTFS) Discrete-time Fourier Transform (DTFT) 2021/6/7 59

Relations Among Fourier Methods 2021/6/7 60 Bio. M 3

Bio. M 3 各種轉換方式 Time Fourier Z S -Laplace 2021/6/7 61

Filters 2021/6/7 62 Bio. M 3

Advantages of digital filters over analog filters Bio. M 3 v Highly immune to noise because of the way it is implemented (software/digital circuits) v Accuracy dependent only on round-off error, directly determined by the number of bits v Easy and inexpensive to change a filter’s operating characteristics (e. g. , cutoff frequency) v Performance not a function of component aging, temperature variation, and power supply voltage 2021/6/7 63

Bio. M 3 Signal conversion Band- limiter and sampler Continuous signal 2021/6/7 Digital filter (processor) Sampled signal 64 Reconstruction filter Continuous signal

Sampling as multiplication by a train of impulses 2021/6/7 65 Bio. M 3

Sampling theorem Bio. M 3 v Must sample at a rate at least twice the highest frequency present in the signal (including noise) v If a signal contains no frequencies higher than fc, the original signal can be completely recovered by sampling at least 2 fc samples/s. v Sampling frequency fs must be at least twice the highest frequency present in a signal (Nyquist frequency) 2021/6/7 66

LTI system • 何謂LTI (Linear Time Invarent) two conditions 2021/6/7 67 Bio. M 3

Bio. M 3 System operation convolution 2021/6/7 multiple 68

Autocorrelation and PSD Bio. M 3 The power spectral density (PSD) of a signal is defined as Fourier transform of the autocorrelation function. 2021/6/7 69

Example 數位 OR 類比? 2021/6/7 70 Bio. M 3

Bio. M 3 Signal Detection and Estimation 2021/6/7 71

Estimation and detection 2021/6/7 72 Bio. M 3

Signal detection and estimation Bio. M 3 v Estimation v Detection Hypotheses testing v Five steps for analyzing genomic and proteomic data § Describe and identify the measurement system S § Define the signal of interest x[n]. Map the biological space into a numerical space. § Formulate the problem (estimation, detection, or analysis § Solve problem and compute output signals § Interpret the results in a biological context. 2021/6/7 73

DNA sequencing Bio. M 3 v. The DNA sequencing process § DNA sample preparation § Electrophoresis § Processing v. Processing the eletropherogram data to identify the DNA sequence § Conditioning the signal and increasing S/N ratio § Identifying the underlying DNA seqence. 2021/6/7 74

DNA sequencing - eletropherogram 2021/6/7 75 Bio. M 3 http: //www. genome. uab. edu/

Model of DNA sequencing Blurring : for example, diffusion effects, instrument noises 2021/6/7 76 Bio. M 3

Wiener filter http: //en. wikipedia. org/wiki/Wiener_filter 2021/6/7 77 Bio. M 3

DNA sequence estimation using LTI filtering 2021/6/7 78 Bio. M 3

Homomorphic Blind Deconvolution Bio. M 3 Wiener filter can lead significant errors on diffusion effects. High-pass filter to reduce blurring (low frequency) 2021/6/7 79

Results Bio. M 3 Error rate 1. 06% is better than reports from florescence – base sequencing instrument! 2021/6/7 80

Model based estimation techniques There are four main distortions introduced by sequencing: 1. Loading artifects 2. Diffusion effects 3. Fluorescence interference 4. Additive instrument noise 以MODEL方式反推 2021/6/7 81 Bio. M 3

Gene identification v. Once the DNA sequence has been identified, it needs to be analyzed to identify genes and coding sequence. 2021/6/7 82 Bio. M 3

DNA signal properties Bacterium Aquifex aeolicus 2021/6/7 Bio. M 3 • Consider autocorrelation function of xa[n] • The nonflat shape of the spectrum revels correlations at low frequencies, indicating that base pairs that are far away seems to be correlated. 83

DNA signal properties Bio. M 3 v At thin peak 2 pi/3, the increased correlation corresponds to the tendency of nucleotides to be repeated along the DNA sequence with period 3 and is indicative of coding regions. § The triplet nature of the codon § Potentially codon bias (unequal usage of codon) § The biased usage of nucleotide triples in genomic DNA (triplet bias). v Yin and Yau showed that the period-3 property is not affected by codon bias. v The period -3 property of coding regions seems to be generated by unbalanced nucleotide distributions in three codon potions. 2021/6/7 84

DNA signal processing for Gene identification Bio. M 3 v Fickett – the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding gene. i. e. predict the amino acid sequence of protein to provide the insight of function. v The premise of all methods is to exploit the period -3 property of coding regions by processing the DNA signal to identify regions with strong period – 3 correlation. 2021/6/7 85

DNA signal processing for Gene identification Bio. M 3 v DNA spectrum v Signal-to-noise ratio v Tiwari et al. observed that for most coding sequences in variety of organism, Px is large, but not in non coding regions. 2021/6/7 86

Fourier spectra Coding stretch of DNA 2021/6/7 noncoding stretch of DNA 87 Bio. M 3

Filtering methods applied window Predict the five exons for gene F 56 F 11. 4 in C. elegans chromosome III. IIR antinotch Multistage filters 2021/6/7 88 Bio. M 3

Protein Hotspots identification Bio. M 3 v. Once coding regions have been identified, the corresponding protein sequence can be determined by mapping the coding region to the amino acid sequence using the genetic code. 2021/6/7 89

Protein Signal Definition Bio. M 3 v. The new physicomathematical approach resented here is called the Resonant Recognition Model (RRM). The RRM is based on the representation of the protein primary structure as a numerical series by assigning to each amino acid a physical parameter value relevant to the protein’s biological activity. 2021/6/7 90

Protein Signal Definition Bio. M 3 v The RRM is a physical and mathematical model which interprets protein sequence linear information using signal analysis methods. v It comprises two stages: § The first involves the transformation of the amino acid sequence into a numerical sequence. Each amino acid is represented by the value of the electron-ion interaction potential (EIIP) which describes the average energy states of all valence electrons, in particular amino acids. § Numerical series obtained this way are then analyzed by digital signal analysis methods in order to extract information pertinent to the biological function. 2021/6/7 91

Bio. M 3 EIIP 2021/6/7 92

RESONANT RECOGNITION MODEL (RRM) Bio. M 3 2021/6/7 93

Prediction of protein hotspot cytochrome C proteins 2021/6/7 94 Bio. M 3

Bio. M 3 System identification and Analysis 2021/6/7 95

Signal processing view of the cell 2021/6/7 96 Bio. M 3

Signal coordination at cell level System view 2021/6/7 97 Bio. M 3

Gene expression Bio. M 3 v A DNA microarray (also commonly known as gene chip, DNA chip, or biochip) is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. v DNA microarrays, such as c. DNA microarrays and oligonucleotide microarrays. v In genetics, complementary DNA (c. DNA) is DNA synthesized from a mature m. RNA template. c. DNA is often used to clone eukaryotic genes in prokaryotes. 2021/6/7 98

Gene expression One sample from a tumor; One sample from a normal tissue. White – highly expressed under treatment A Gray – no difference Dark - highly expressed under treatment B 2021/6/7 99 Bio. M 3

Time changes – measure drugs in 6 hours. 2021/6/7 100 Bio. M 3

c. DNA microarrays The procedure begins by attaching the DNA sequences of thousands of genes onto microscope slide in the pattern of spots, with each spot containing only DNA sequences of a single gene. 2021/6/7 101 Bio. M 3

Oligonucleotide microarrays Bio. M 3 • In oligonucleotide microarrays (or single- channel microarrays), the probes are designed to match parts of the sequence of known or predicted m. RNAs. • In stead of attaching full-length DNAs, oligonucleotide microarrays make use of short oligonucleotide chosen to be specific to individual genes. 2021/6/7 102

Oligonucleotide microarrays These microarrays give estimations of the absolute value of gene expression and therefore the comparison of two conditions requires the use of two separate microarrays. 2021/6/7 103 Bio. M 3

DNA microarray 2021/6/7 104 Bio. M 3

Principle components analysis Bio. M 3 v Principle component analysis transforms the original set of variables into a smaller set of linear combinations that account for most of variance of the original set. The purpose of principle component analysis is to determine factors (i. e. , principle components) in order to explain as much of the total variation in the data as possible with as few of these factors as possible. The principal components are those uncorrelated linear combinations PC(1), PC(2), …, PC(m) whose variances are as large as possible. 2021/6/7

Eigenvalues and Eigenvectors Bio. M 3 Definition Let A be an n n matrix. A scalar is called an eigenvalue of A if there exists a nonzero vector x in Rn such that Ax = x. The vector x is called an eigenvector corresponding to . 2021/6/7 106

Computation of Eigenvalues and Eigenvectors Let A be an n n matrix with eigenvalue and corresponding eigenvector x. Thus Ax = x. This equation may be rewritten Ax – x = 0 giving (A – In)x = 0 Solving the equation |A – In| = 0 for leads to all the eigenvalues of A. On expending the determinant |A – In|, we get a polynomial in . This polynomial is called the characteristic polynomial of A. The equation |A – In| = 0 is called the characteristic equation of A. 2021/6/7 107 Bio. M 3

Example 1 Find the eigenvalues and eigenvectors of the matrix Solution Let us first derive the characteristic polynomial of A. We get We now solve the characteristic equation of A. The eigenvalues of A are 2 and – 1. =2 2021/6/7 108 Bio. M 3

Bio. M 3 This leads to the system of equations giving x 1 = –x 2. The solutions to this system of equations are x 1 = –r, x 2 = r, where r is a scalar. Thus the eigenvectors of A corresponding to = 2 are nonzero vectors of the form v = – 1 Thus x 1 = – 2 x 2. The eigenvectors of A corresponding to = – 1 are nonzero vectors of the form s[-2 1]t 2021/6/7 109

Singular value decomposition (SVD) Bio. M 3 v Suppose M is an m-by-n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form • The matrix V thus contains a set of orthonormal "input" or "analysing" basis vector directions for M • The matrix U contains a set of orthonormal "output" basis vector directions for M • The matrix Σ contains the singular values, which can be thought of as scalar "gain controls" by which each corresponding input is multiplied to give a corresponding output. 2021/6/7 110

SVD example Bio. M 3 v A non-negative real number σ is a singular value for M if and only if there exist unit-length vectors u in Km and v in Kn such that 2021/6/7 111

Eigengenes from applying SVD 2021/6/7 112 Bio. M 3

Project CLB 2 and CLN 3 2021/6/7 113 Bio. M 3

Apoptosis System Identification 2021/6/7 114 Bio. M 3

Apoptosis System Identification 2021/6/7 115 Bio. M 3

PCA results 2021/6/7 116 Bio. M 3

Summarizations v We tried to link signals, systems, and biology. v Filtering is necessary for removing artifacts. v We learned “Signal detection and estimation” 1. DNA sequencing 2. Gene identification 3. Protein hotspots identification v We learned “System identification and analysis” 1. Gene regulation systems 2. Protein signal systems v We provided the new thought of data mining. 2021/6/7 117 Bio. M 3

Thanks 2021/6/7 118 Bio. M 3
- Slides: 118