1 Bio Programming Jong Bhak TGI UNIST Ulsan

1 Bio. Programming Jong Bhak TGI UNIST Ulsan Korea jongbhak@gmail. com 20030321

Bio 감사의말 • Researchers who are honest and passionate in doing • • science People who support scientific research by paying tax MRC, Harvard, KAIST, KOBIC, TBI & Genome Research Foundation. 테 라젠 고진업대표이사. NCC, 이연수, 이진수박사, 국가참조표준센터(채균식, 김창근박사) KIOST 이정현박사와 동료들 한양대학교 (류성언교수, 김덕수교수, 고인송교수) UNIST, (조무제 총장), BME교수, 지역지원자들 UNIST 학생

Two Aspects of Bio. Programming 1) Bioprogramming as the natural process of information propagation in the universe 2) Bioprogramming as programming technique in bioinformatics 2021 -02 -21 3

Bioprogramming the Universe • Programming as a key mechanism of the universe advancement 2021 -02 -21 4

Universe is programmable 5 jongbhak@genomics. org Copy. Left Under Bio. License

IFE: Infinitely Fractal Encapsulation 6 jongbhak@genomics. org Copy. Left Under

Semiconductor of Life • Nano scale chemicals and molecules are the semiconductor of life for information processing • Proteins are the key molecules for information processing. 2021 -02 -21 7

Proteins modules for bioprogramming • Proteins: The central processing molecules of life. (15% of the mass of the average person) • Minium 20 different kinds of amino acids: Alanine ala a CH 3 -CH(NH 2)-COOH Arginine arg r HN=C(NH 2)-NH-(CH 2)3 -CH(NH 2)-COOH Asparagine asn n H 2 N-CO-CH 2 -CH(NH 2)-COOH Aspartic acid asp d HOOC-CH 2 -CH(NH 2)-COOH Cysteine cys c HS-CH 2 -CH(NH 2)-COOH Glutamine gln q H 2 N-CO-(CH 2)2 -CH(NH 2)-COOH Glutamic acid glu e HOOC-(CH 2)2 -CH(NH 2)-COOH Glycine gly g NH 2 -COOH Histidine his h NH-CH=N-CH=C-CH 2 -CH(NH 2)-COOH Isoleucine ile i CH 3 -CH 2 -CH(CH 3)-CH(NH 2)-COOH Leucine leu l (CH 3)2 -CH-CH 2 -CH(NH 2)-COOH Lysine lys k H 2 N-(CH 2)4 -CH(NH 2)-COOH Methionine met m CH 3 -S-(CH 2)2 -CH(NH 2)-COOH Phenylalanine phe f Ph-CH 2 -CH(NH 2)-COOH Proline pro p NH-(CH 2)3 -CH-COOH Serine ser s HO-CH 2 -CH(NH 2)-COOH Threonine thr t CH 3 -CH(OH)-CH(NH 2)-COOH Tryptophan trp w Ph-NH-CH=C-CH 2 -CH(NH 2)-COOH Tyrosine tyr y HO-p-Ph-CH 2 -CH(NH 2)-COOH Valine val v (CH 3)2 -CH-CH(NH 2)-COOH http: //www. nyu. edu/pages/mathmol/library/life 1. html

Amino Acids (L-form)

Types of Amino Acids Amino acids can be grouped into 4 -5 different groups for Bioinformatic analysis. Most important distinctions: Hydrophobic and Hydrophilic groups Big side chain groups and Small side chain groups Cysteine for disulphide bonding. (well conserved) Proline structurally important Histidine important for switching • Aliphatic - alanine glycine isoleucine proline valine • Aromatic - phenylalanine tryptophan tyrosine • Acidic - aspartic acid glutamic acid • Basic - arginine histidine lysine • Hydroxylic - serine threonine • Amidic (containing amide group) - asparagine glutamine • http: //chemistry. gsu. edu/glactone/PDB/Amino_Acids/aa. html

Amino Acids • CH – COO – R – NH 3 (CORN law: Clockwise) Zwitterions remain when the a-amino acid is dissolved in water at p. H 7. Addition of an acid, supplying more protons, produces ions with a surplus positive charge:

Peptide Bond

Planes of peptide bonds

Amino Acid Protein

Secondary Structures from A. A. • 3 main secondary structure elements often used. • In reality, there are many more types! – Different types of alpha, beta, coils….

Alpha and Beta

Supersecondary Structure

Basic knowledge for Bioinformatics (focusing on proteins). • Some basic points to be understood by biologists and non-biologists.

Protein • Life is a huge chunk of protein with various ligands attached. • Protein level is very efficient to work with for us a Naturally distinct unit. So, favoured by bioinformatic computing.

Only 1, 000 Protein shapes? Year 2001 • There are less than 1, 000 types of distinct shapes of protein structures known so far (called Folds)

Only 1, 000 Protein superfamilies? Year 2001 • There around 1, 000 evolutionarily distinct protein shapes (called Super. Family)

e. g) 11 very common superfamilies

Only around 10, 000 proteins? • Perhaps not much more than 10, 000 different types (representative) protein sequences in nature. *Then, where are the complexity and diversity of life come from? : Network of interactions among them. Then, where all the bio-funcitons come from? next slide

Functional diversity organizational and regulatory differences • Chimps and Humans are the same in terms of genetic components. Yet different species. • The English and Koreans: same genome but different sub-species. – Organization and regulation of information are different. Also developmental diversity. – Somehow cells ‘self-organize’ data very efficiently and effectively.

Organising Structures and Sequences bioinformatically • A technical challenge we have is how to organize the structures and sequences of proteins • There are many different ways to organize protein sequences and structures. • PDB 1976, SCOP, CATH, FSSP, , , • Swissprot, PIR, Genbank, EST, SNP, Tremble, Enssembl, … (over 500 major biological DBs) Now we have very large scale data next

Interaction (directionless) • Interactions do not have directions 2021 -02 -21 26

Resulting PSIMAP Add a strong statement that summarizes how you feel or think about this topic Summarize key points you want your audience to remember PSIMAP ? A Low

Practical Steps of Complete Human Interactome Structure DB Predicted Human Interactome http: //hpid. org/

Global view of protein family interaction networks for 146 genomes

Signal Transduction (Pathways) • Pathways have directions 2021 -02 -21 30

Bioinformatics Programming • • • Bio. Perl: 1994 Bio. Java: 1995 Bio. PHP Bio. Ruby Bio. C++ Bio[X] 2021 -02 -21 31

Pragramming for Bioinformatics • Automation is the key 2021 -02 -21 32

Pragramming for Bioinformatics • Automation is the key • Use fast prototyping • • • Solve problems Reuse scripts Share the codes with lab members Use public resources Use openfree resources 2021 -02 -21 33

Pragramming for Bioinformatics • Automation is the key • Use fast prototyping • Solve problems • • Reuse scripts Share the codes with lab members Use public resources Use openfree resources 2021 -02 -21 34

Pragramming for Bioinformatics • Automation is the key • Use fast prototyping • Solve problems • Reuse scripts • Share the codes with lab members • Use public resources • Use openfree resources 2021 -02 -21 35

Pragramming for Bioinformatics • • Automation is the key Use fast prototyping Solve problems Reuse scripts • Share the codes with lab members • Use public resources • Use openfree resources 2021 -02 -21 36

Pragramming for Bioinformatics • • • Automation is the key Use fast prototyping Solve problems Reuse scripts Share the codes with lab members • Use public resources • Use openfree resources 2021 -02 -21 37

Pragramming for Bioinformatics • • • Automation is the key Use fast prototyping Solve problems Reuse scripts Share the codes with lab members Use public resources • Use openfree resources –Git. Hub 2021 -02 -21 38

Requirements • Guru level coding ability • Understanding computer hardware • Parsing ability – Text manipulation • Database – Flat file – Relational (My. SQL) 2021 -02 -21 39

What is a grammar? 2021 -02 -21 40

2021 -02 -21 41

What is a compiler? • A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code). [1] The most common reason for converting a source code is to create an executable program. 2021 -02 -21 42

Programming modules in Python 2021 -02 -21 43

2021 -02 -21 44