National Taiwan University Department of Computer Science and
- Slides: 18
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao
National Taiwan University Department of Computer Science and Information Engineering Genetic Variations n The genetic variations in DNA sequences (e. g. , insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. u The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence. u 2
National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n A Single Nucleotide Polymorphisms (SNP), pronounced “snip, ” is a genetic variation when a single nucleotide (i. e. , A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% u Mutation: Single DNA base variation found <1% u 94% CTTAGCTT 99. 9% CTTAGCTT 6% CTTAGTTT 0. 1% CTTAGTTT SNP Mutation
National Taiwan University Department of Computer Science and Information Engineering Mutations and SNPs Mutations Observed genetic variations Common Ancestor time present 4
National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n SNPs are the most frequent form among various genetic variations. u 90% of human genetic variations come from SNPs. u SNPs occur about every 300~600 base pairs. u Millions of SNPs have been identified (e. g. , Hap. Map and Perlegen). n SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies. 5
National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP locus is quite small. u The tri-allele cases are usually considered to be the effect of genotyping errors. u n The nucleotide on a SNP locus is called a major allele (if allele frequency > 50%), or u a minor allele (if allele frequency < 50%). u 94% ACTTAGCTT T: Major allele 6% ACTTAGCTC C: Minor allele
National Taiwan University Department of Computer Science and Information Engineering Haplotypes n A haplotype stands for an ordered list of SNPs on the same chromosome. u A haplotype can be simply considered as a binary string since each SNP is binary. T C -A C T T A G C T T- Haplotype 1 C Haplotype 2 C A T -A A T T T G C T C- Haplotype 3 A T C -A C T T T G C T C- SNP 1 SNP 2 SNP 3 7
National Taiwan University Department of Computer Science and Information Engineering Genotypes n The use of haplotype information has been limited because the human genome is a diploid. u In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. A G C T SNP 1 SNP 2 A C G A T T SNP 1 SNP 2 Genotype data C G SNP 1 SNP 2 A C T G SNP 1 SNP 2 Haplotype data 8
National Taiwan University Department of Computer Science and Information Engineering Problems of Genotypes n Genotypes only tell us the alleles at each SNP locus. But we don’t know the connection of alleles at different SNP loci. u There could be several possible haplotypes for the same genotype. u A G C T SNP 1 SNP 2 A C G T SNP 1 SNP 2 Genotype data A C T G SNP 1 SNP 2 or A C G T SNP 1 SNP 2 We don’t know which haplotype pair is real. 9
National Taiwan University Department of Computer Science and Information Engineering Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy 10
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference n The problem of inferring the haplotypes from a set of genotypes is called haplotype inference. u n This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem. This model assumes that the real haplotypes in natural population is rare. u The solution of this problem is a minimum set of haplotypes that can explain the given genotypes. u 11
National Taiwan University Department of Computer Science and Information Engineering Maximum Parsimony G 1 A C SNP 1 G 2 A A SNP 1 n G T h 1 A h 2 C T G or h 3 A h 4 C G T SNP 2 T T h 1 A T A G C T A T T SNP 2 Find a minimum set of haplotypes to explain the given genotypes. 12
National Taiwan University Department of Computer Science and Information Engineering Our Results n n We formulated this problem as an integer quadratic programming (IQP) problem. We proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem. u n This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in Mat. Lab and compared with existing methods. u Huang, Y. -T. , Chao, K. -M. , and Chen, T. , 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony, ” Journal of Computational Biology, 12: 1261 -1274. 14
National Taiwan University Department of Computer Science and Information Engineering Problem Formulation n Input: u n A set of n genotypes and m possible haplotypes. Output: u A minimum set of haplotypes that can explain the given genotypes. G 1 A C SNP 1 G 2 A A SNP 1 G T SNP 2 T T SNP 2 h 1 A h 2 C T G h 1 A h 2 C h 1 A T G T T 15
National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) n Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected. u xi = -1 if the i-th haplotype is not selected. u n Minimizing the number of selected haplotypes is to minimize the following integer quadratic function: 16
National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) n Each genotype must be resolved by at least one pair of haplotypes. u For genotype G 1, the following integer quadratic function must be satisfied. Suppose h and h are selected 1 1 G 1 A C SNP 1 G T SNP 2 h 1 A T h 2 C G 2 1 or h 3 A G h 4 C T 17
National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) Objective Function Constraint Functions n n Maximum parsimony: Find a minimum set of haplotypes to resolve all genotypes. We use the SDP-relaxation technique to solve this IQP problem. 18
National Taiwan University Department of Computer Science and Information Engineering The Flow of the Iterative SDP Relaxation Algorithm NP-hard Integer Quadratic Programming Relax the integer constraint P Reformulation Vector Formulation Semidefinite Programming No, repeat this algorithm. All genotypes resolved? Existing SDP solver Yes, done. Vector Solution Integral Solution Randomized rounding SDP Solution Incomplete Cholesky decomposition 19
- National taiwan university civil engineering
- National science council taiwan
- Columbia university department of computer science
- Seoul national university computer science
- National yunlin university of science and technology
- Providence university taiwan ranking
- Ucl computer science meng
- Electrical engineering northwestern
- Computer science department rutgers
- Stanford vptl tutoring
- Florida state university computer science faculty
- Trimentoring
- Bhargavi goswami
- My favorite subject is...
- Computer science university of phoenix
- University of bridgeport computer science
- University of bridgeport computer science faculty
- Yonsei syllabus
- York university computer science