National Taiwan University Department of Computer Science and

  • Slides: 18
Download presentation
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao

National Taiwan University Department of Computer Science and Information Engineering Genetic Variations n The

National Taiwan University Department of Computer Science and Information Engineering Genetic Variations n The genetic variations in DNA sequences (e. g. , insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. u The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence. u 2

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n A Single Nucleotide Polymorphisms (SNP), pronounced “snip, ” is a genetic variation when a single nucleotide (i. e. , A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% u Mutation: Single DNA base variation found <1% u 94% CTTAGCTT 99. 9% CTTAGCTT 6% CTTAGTTT 0. 1% CTTAGTTT SNP Mutation

National Taiwan University Department of Computer Science and Information Engineering Mutations and SNPs Mutations

National Taiwan University Department of Computer Science and Information Engineering Mutations and SNPs Mutations Observed genetic variations Common Ancestor time present 4

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n SNPs are the most frequent form among various genetic variations. u 90% of human genetic variations come from SNPs. u SNPs occur about every 300~600 base pairs. u Millions of SNPs have been identified (e. g. , Hap. Map and Perlegen). n SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies. 5

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n

National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism n A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP locus is quite small. u The tri-allele cases are usually considered to be the effect of genotyping errors. u n The nucleotide on a SNP locus is called a major allele (if allele frequency > 50%), or u a minor allele (if allele frequency < 50%). u 94% ACTTAGCTT T: Major allele 6% ACTTAGCTC C: Minor allele

National Taiwan University Department of Computer Science and Information Engineering Haplotypes n A haplotype

National Taiwan University Department of Computer Science and Information Engineering Haplotypes n A haplotype stands for an ordered list of SNPs on the same chromosome. u A haplotype can be simply considered as a binary string since each SNP is binary. T C -A C T T A G C T T- Haplotype 1 C Haplotype 2 C A T -A A T T T G C T C- Haplotype 3 A T C -A C T T T G C T C- SNP 1 SNP 2 SNP 3 7

National Taiwan University Department of Computer Science and Information Engineering Genotypes n The use

National Taiwan University Department of Computer Science and Information Engineering Genotypes n The use of haplotype information has been limited because the human genome is a diploid. u In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. A G C T SNP 1 SNP 2 A C G A T T SNP 1 SNP 2 Genotype data C G SNP 1 SNP 2 A C T G SNP 1 SNP 2 Haplotype data 8

National Taiwan University Department of Computer Science and Information Engineering Problems of Genotypes n

National Taiwan University Department of Computer Science and Information Engineering Problems of Genotypes n Genotypes only tell us the alleles at each SNP locus. But we don’t know the connection of alleles at different SNP loci. u There could be several possible haplotypes for the same genotype. u A G C T SNP 1 SNP 2 A C G T SNP 1 SNP 2 Genotype data A C T G SNP 1 SNP 2 or A C G T SNP 1 SNP 2 We don’t know which haplotype pair is real. 9

National Taiwan University Department of Computer Science and Information Engineering Research Directions of SNPs

National Taiwan University Department of Computer Science and Information Engineering Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy 10

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference n The

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference n The problem of inferring the haplotypes from a set of genotypes is called haplotype inference. u n This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem. This model assumes that the real haplotypes in natural population is rare. u The solution of this problem is a minimum set of haplotypes that can explain the given genotypes. u 11

National Taiwan University Department of Computer Science and Information Engineering Maximum Parsimony G 1

National Taiwan University Department of Computer Science and Information Engineering Maximum Parsimony G 1 A C SNP 1 G 2 A A SNP 1 n G T h 1 A h 2 C T G or h 3 A h 4 C G T SNP 2 T T h 1 A T A G C T A T T SNP 2 Find a minimum set of haplotypes to explain the given genotypes. 12

National Taiwan University Department of Computer Science and Information Engineering Our Results n n

National Taiwan University Department of Computer Science and Information Engineering Our Results n n We formulated this problem as an integer quadratic programming (IQP) problem. We proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem. u n This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in Mat. Lab and compared with existing methods. u Huang, Y. -T. , Chao, K. -M. , and Chen, T. , 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony, ” Journal of Computational Biology, 12: 1261 -1274. 14

National Taiwan University Department of Computer Science and Information Engineering Problem Formulation n Input:

National Taiwan University Department of Computer Science and Information Engineering Problem Formulation n Input: u n A set of n genotypes and m possible haplotypes. Output: u A minimum set of haplotypes that can explain the given genotypes. G 1 A C SNP 1 G 2 A A SNP 1 G T SNP 2 T T SNP 2 h 1 A h 2 C T G h 1 A h 2 C h 1 A T G T T 15

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP)

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) n Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected. u xi = -1 if the i-th haplotype is not selected. u n Minimizing the number of selected haplotypes is to minimize the following integer quadratic function: 16

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP)

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) n Each genotype must be resolved by at least one pair of haplotypes. u For genotype G 1, the following integer quadratic function must be satisfied. Suppose h and h are selected 1 1 G 1 A C SNP 1 G T SNP 2 h 1 A T h 2 C G 2 1 or h 3 A G h 4 C T 17

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP)

National Taiwan University Department of Computer Science and Information Engineering Integer Quadratic Programming (IQP) Objective Function Constraint Functions n n Maximum parsimony: Find a minimum set of haplotypes to resolve all genotypes. We use the SDP-relaxation technique to solve this IQP problem. 18

National Taiwan University Department of Computer Science and Information Engineering The Flow of the

National Taiwan University Department of Computer Science and Information Engineering The Flow of the Iterative SDP Relaxation Algorithm NP-hard Integer Quadratic Programming Relax the integer constraint P Reformulation Vector Formulation Semidefinite Programming No, repeat this algorithm. All genotypes resolved? Existing SDP solver Yes, done. Vector Solution Integral Solution Randomized rounding SDP Solution Incomplete Cholesky decomposition 19