DNA Forensics DNA Forensics deals with the use

  • Slides: 75
Download presentation
DNA Forensics • DNA Forensics deals with the use of recombinant DNA technology on

DNA Forensics • DNA Forensics deals with the use of recombinant DNA technology on one or more biological specimens forensic investigation • Common use of DNA Forensics include: Human Identification, Kinship Analysis for Missing Person Identification, Parentage Testing, etc. • Probability and Statistics play important roles in assessing the strength of DNA evidence in all such applications • Events in DNA forensics are generally low probability events, and statistical assessment of DNA forensic data requires estimation based on sparse multi-dimensional data

Brief Introduction of the DNA Forensics Session of the Symposium • Four talks will

Brief Introduction of the DNA Forensics Session of the Symposium • Four talks will address some of the major Statistical/Probabilistic issues of DNA Forensics • Current paradigm of the topic will be the focus of the first talk (R. Chakraborty) • B. Budowle will address challenges to such paradigm, when DNA quantity is low, and for identification of source of microbial agents in forensic samples • T. Wang will introduce the need of pedigree-based probabilistic calculations for missing person identification • A. Eisenberg will discuss possible statistical formulations applicable for newer technologies that being (or, about to be) implemented in the field • All four speakers are major players in DNA Forensics in the country; contributed significantly in the development of DNA Forensics; and together, have over 75 years of experience working in the subject

Statistical and Probabilistic Issues in DNA Forensics: Current Paradigms Ranajit Chakraborty, Ph. D Robert

Statistical and Probabilistic Issues in DNA Forensics: Current Paradigms Ranajit Chakraborty, Ph. D Robert A. Kehoe Professor and Director Center for Genome Information Department of Environmental Health University of Cincinnati College of Medicine Cincinnati, OH 45267, USA Tel. (513) Fax (513) 558 - on (Presentation at 558 -4925/3757; the University of Cincinnati Symposium Probability Theory and Applications on March 21, 2009) 4505

Overview of the Talk • Brief History of DNA Forensics • Currently used DNA

Overview of the Talk • Brief History of DNA Forensics • Currently used DNA Markers in Forensics • Three Generic Forensic Scenarios • Examples of DNA Evidence Data • Frequency, Likelihood, and Bayesian Logic of DNA Statistics • Population Substructure and Its Effect on DNA Statistics • Lineage Markers (mt. DNA and Y-STR haplotypes) • Match and Partial Match in Databases

Brief History of DNA Forensics • • • • 1980 – Ray White described

Brief History of DNA Forensics • • • • 1980 – Ray White described the first hypervariable RFLP marker 1985 – Alec Jeffreys discovered multilocus VNTR probes (the term “DNA Fingerprinting” coined) 1985 – First paper on PCR published 1988 – In US, FBI started DNA forensic casework 1991 – First STR paper published 1992 – NRC-I Report Issued 1994 –CODIS STR Loci Characterized 1995 – FSS started UK DNA Database 1996 – NRC-II Report Issued; mt. DNA introduced in Forensics 1997 – 13 CODIS STR Loci Validated for Forensic Use; Y-STRs described forensic investigation purposes 1998 – FBI launched CODIS Database 2000 – RFLP Technology replaced by Multiplex STR Technology 2002 – FBI mt. DNA Population Database published; Y-STR 20 plex published 2002 – SNPs have been proposed as supplementary markers 2004 – Large sizes of “offenders’ data bases” opened issues of coincidental full/partial matches 2007 – Familial search through partial match occurrences in databases

Advantages of Use of STR Loci in DNA Forensics • PCR Based • Low

Advantages of Use of STR Loci in DNA Forensics • PCR Based • Low quantity DNA • Degraded DNA • Amenable to automation • Non-isotopic • Rapid typing • Discrete alleles • Abundant in genome • Highly informative (satisfied by the CODIS STRs) CSF 1 PO D 7 S 820 TPOX D 8 S 1179 THO 1 D 13 S 317 FGA D 16 S 539 VWA D 18 S 51 D 3 S 1358 D 21 S 11 D 5 S 818 Penta D Penta E

15 CODIS STR Loci with Chromosomal Positions TPOX D 3 S 1358 D 8

15 CODIS STR Loci with Chromosomal Positions TPOX D 3 S 1358 D 8 S 1179 D 5 S 818 FGA TH 01 VWA D 7 S 820 CSF 1 PO AMEL Penta E D 13 S 317 D 16 S 539 D 18 S 51 D 21 S 11, Penta D AMEL

Three Types of DNA Forensic Issues Ø Transfer Evidence: DNA profile of the evidence

Three Types of DNA Forensic Issues Ø Transfer Evidence: DNA profile of the evidence sample providing indications of it being of a single source origin Ø Mixture of DNA: Evidence sample’s DNA profile suggests it being a mixture of DNA from multiple (more than one) individuals Ø Kinship Determination: Evidence sample’s DNA compared with that of one or more reference profiles is to be used to determine the validity of stated biological relatedness among individuals

Transfer Evidence – An Example

Transfer Evidence – An Example

DNA Mixture Analysis (amelogenin, D 8 S 1179, D 21 S 11, D 18

DNA Mixture Analysis (amelogenin, D 8 S 1179, D 21 S 11, D 18 S 51)

Inclusion

Inclusion

mt. DNA Lineage Marker

mt. DNA Lineage Marker

Y-Chromosomal Genes Lahn, Pearson & Jegalian 2001

Y-Chromosomal Genes Lahn, Pearson & Jegalian 2001

Y STR Loci

Y STR Loci

Three Types of Conclusions ØExclusion ØMatch, or Inclusion ØInconclusive

Three Types of Conclusions ØExclusion ØMatch, or Inclusion ØInconclusive

Statistical Assessment of DNA Evidence ØNeeded most frequently in the inclusionary events Ø(Apparent) exclusionary

Statistical Assessment of DNA Evidence ØNeeded most frequently in the inclusionary events Ø(Apparent) exclusionary cases may also be sometimes subjected to statistical assessment, particularly for kinship determination because of genetic events such as mutation, recombination, etc. ØLoci providing inconclusive results are often excluded from statistical considerations ØEven if one or more loci show inconclusive results, inclusionary observations of the other

Approaches for Statistical Assessment of DNA Evidence Frequentist Approach: indicating the coincidental chance of

Approaches for Statistical Assessment of DNA Evidence Frequentist Approach: indicating the coincidental chance of the event observed Likelihood Approach: indicating relative support of the event observed under two contrasting (mutually exclusive) stipulations regarding the source of the evidence sample Bayesian Approach: providing a posterior probability regarding the source, when data in hand is considered with a prior probability of the knowledge of the source (later is not generally provided by the DNA profiles being considered for statistical assessment)

Frequentist Approach of Statistical Assessment for Transfer Evidence When the evidence sample DNA profile

Frequentist Approach of Statistical Assessment for Transfer Evidence When the evidence sample DNA profile matches that of the reference sample, one or more of the following questions are answered: Ø How often a random person would provide such a DNA match? Equivalently, what is the expected frequency of the profile observed in the evidence sample? – also called Random Match Probability, complement of which is the Exclusion Probability Ø What is the expected frequency of the profile seen in the evidence sample, given that it is observed in another person (namely in the reference sample) – also called Conditional Match Probability Ø What would be the expected frequency of the profile seen in the evidence sample in a relative (of specified kinship) of

Frequentist Approach of Statistical Assessment for DNA Mixture When the evidence mixture DNA profile

Frequentist Approach of Statistical Assessment for DNA Mixture When the evidence mixture DNA profile fails to exclude a reference sample as a part contributor, and more commonly a set of reference samples together explains alleles seen in the mixture, one or more of the following questions are answered: Ø How often a random person would be excluded as a part contributor of the mixture sample? – also called Exclusion Probability, the complement of which is the inclusion probability, giving the expected chance of Coincidental Inclusion (Note: This answer is based on the data on the evidence sample alone, without any consideration of the profiles of the reference samples) Ø With a stipulation on the number of contributors, how often a random person’s DNA, mixed with that of one or more of the reference persons, would provide a mixture profile as seen in the evidence sample, given that the reference persons are also part contributors of the DNA mixture

Kinship Assessment – Frequentist Approach When comparisons of evidence and reference samples fail to

Kinship Assessment – Frequentist Approach When comparisons of evidence and reference samples fail to exclude a stated relationship of the evidence sample with the reference individual(s), the frequency based question is of the form: Ø What is the chance of excluding the stated relationship? – called the Exclusion Probability (PE), this is generally answered conditioned on the profiles of the reference samples and stated relationship Note: Average exclusion probability can also be computed disregarding the profiles examined, which rationalizes the choice of loci to be typed for validating the stated relationship

Concept of Likelihood A Likelihood represents the support of a given hypothesis (of vale

Concept of Likelihood A Likelihood represents the support of a given hypothesis (of vale of a parameter) provided by the observations in the data, written as Likelihood = Prob. (Data | Hypothesis). Technically, likelihood is mathematically identical to the probability of the data given the hypothesis, but interpreted as a function of the hypothesis (or, parameter values specified by the hypothesis) for the observations in the data.

Likelihood Ratio With two (mutually exclusive) hypotheses, say H 1 and H 2, the

Likelihood Ratio With two (mutually exclusive) hypotheses, say H 1 and H 2, the likelihood ratio (LR) is the ratio of probabilities of observing the same data under H 1 and H 2 , giving LR = Prob. (Data | H 1) / Prob. (Data | H 2). Meaning of LR: LR < 1: Data less well supported by H 1, compared with H 2 LR = 1: Data equally well supported by H 1 and H 2 LR > 1: Data better supported by H 1, compared with H 2

LR in Transfer Evidence Background Data: DNA profile of evidence sample (E) matches that

LR in Transfer Evidence Background Data: DNA profile of evidence sample (E) matches that of the suspect (S); i. e. , E = S Contrasting Scenarios of Source (Hypotheses): Hp: DNA in the evidence sample came from the suspect Hd: DNA in the evidence came from someone other than the suspect, but it coincidentally matches the DNA profile of

LR in Transfer Evidence Computation Hd ) LR = Pr. (Data | Hp) /

LR in Transfer Evidence Computation Hd ) LR = Pr. (Data | Hp) / Pr. (Data | Hd) = Pr. (E = S | Hp) / Pr. (E = S | = 1 / Pr. (coincidental match) Thus, LR in this case is simply the inverse (reciprocal) of the relative frequency of the DNA profile of the evidence sample in the population, given that it is the same as of the suspect

LR in Transfer Evidence Variation Since LR can be defined for any two mutually

LR in Transfer Evidence Variation Since LR can be defined for any two mutually exclusive hypotheses, one may also consider the alternative hypothesis as: Hr: A relative of the suspect is the source of evidence DNA In this case, the likelihood ratio, LR(r), will be LR(r) = Prob. (E=S | Hp) / Prob. (E =S | Hr) = 1/ Pr. (DNA match in the relative), which equals the reciprocal of the probability of the DNA profile found in the evidence sample in the relative of the suspect, given that the suspect has the same DNA profile

LR in DNA Mixture Background Data: The DNA evidence profile, E (a DNA mixture)

LR in DNA Mixture Background Data: The DNA evidence profile, E (a DNA mixture) has alleles which are all explained by alleles present in the suspect’s DNA profile (S) and that of a victim’s DNA profile (V) Contrasting Hypotheses: Hp: DNA in the evidence sample is the mixture of DNA of the suspect and that of the victim; (i. e. , Hp: E = V + S) Hd 1: Evidence DNA is a mixture of DNA from the victim and that of an unknown person (i. e. , Hd 1: E = V + UN) Hd 2: Evidence DNA is a mixture of DNA from two unknown persons (i. e. , H : E = UN + UN)

LR in DNA Mixture Computation Pr. (Data | Hp: E = V + S)

LR in DNA Mixture Computation Pr. (Data | Hp: E = V + S) = 1, since data represents alleles in the mixture are explained by alleles present in V and S, and no extra alleles are present in V and/or S. Hence under Hp: E = V + S, data observed is the only possible outcome, but Pr. (Data | Hd 1: E = V + UN) = relative frequency of a random person, whose DNA, mixed with the DNA of the victim, would yield a mixture that matched the evidence sample, Pr. (Data | Hd 2: E = UN + UN) = relative frequency of a pair of random persons, whose DNA mixture would match the profile seen in the evidence sample

LR in DNA Mixture Interpretation LR for Hp vs. Hd 1: = 1 /

LR in DNA Mixture Interpretation LR for Hp vs. Hd 1: = 1 / Pr. (Data | Hp: E = V + UN), which becomes the reciprocal of the relative frequency of a random person, whose DNA, mixed with the DNA of the victim, would yield a mixture that matched the evidence sample Likewise, LR for Hp vs. Hd 2: = 1 / Pr. (Data | Hp: E = UN + UN), which is the inverse of the relative frequency of a pair of random persons,

Other Considerations of Computing LR in DNA Mixture Computations of numerator and denominator of

Other Considerations of Computing LR in DNA Mixture Computations of numerator and denominator of LR in mixture interpretation depend on: Ø Precise knowledge of the number of contributors in the DNA mixture Ø Assumptions regarding the biological relatedness of the unknown contributors (between themselves, or with the reference individuals) Ø Population origin of the contributors

Likelihood Ratio in Kinship Assessment Although the logic is similar, principles of LR formulation

Likelihood Ratio in Kinship Assessment Although the logic is similar, principles of LR formulation in kinship analysis can be simply illustrated with: Ø Standard paternity analysis (with DNA of mother, child, and alleged father typed for several loci), and Ø Kinship assessment for a pair of individuals (with genotype data from one or more loci)

Interpretation of LR in Paternity Testing Ø LR in paternity testing, also called PI,

Interpretation of LR in Paternity Testing Ø LR in paternity testing, also called PI, is the ratio of two conditional probabilities Ø It contrasts the chance of observing the specific trio of genotypes (GC, GM, and GAF) given that AF = BF, as opposed to AF ≠ BF Ø PI (or LR) can be computed even when M and AF, or AF and BF, are biologically related Ø PI can be computed for apparent exclusion events as well, invoking mutation and/or recombination (generally leading to drastically reduced PI or LR for the loci

LR in Standard Paternity Testing Data: Mother’s DNA profile (GM), and that of the

LR in Standard Paternity Testing Data: Mother’s DNA profile (GM), and that of the child (GC) suggests that all obligatory alleles (i. e. , the alleles that the child must have received from its biological father, BF) are present in the DNA profile of AF (GAF) Hypotheses contrasted: Ø Hp: Alleged father (AF) is the biological father (BF) of the child (M is assumed to the true mother); i. e. , Hp: AF = BF Ø Hd: Alleged father is not the biological father, but he is not excluded from paternity (i. e. , Hd: AF ≠

SAMPLING THEORY OF ALLELE FREQUENCIES Under the mutation-drift balance, the probability of a sample

SAMPLING THEORY OF ALLELE FREQUENCIES Under the mutation-drift balance, the probability of a sample in which copies of the allele is observed, for any set of is given by Where freq. of allele in the population, and G(. ) is the Gamma function, in which is the coefficient of coancestry (equivalent to Fst or Gst, the coefficient of gene differentiation between subpopulations within the population)

Match Probability - Formulae under HWE with substructure adjustment unconditional Homozygote (Ai. Ai )

Match Probability - Formulae under HWE with substructure adjustment unconditional Homozygote (Ai. Ai ) pi 2 +θpi (1 -pi) [pi (1 -θ)+2θ] [pi (1 -θ)+3θ] (1+θ) (1+2θ) Heterozygote (Ai. Aj ) 2 pipj (1 -θ) 2[pi (1 -θ)+θ] [pj (1 -θ)+θ] (1+θ) (1+2θ)

CONDITIONAL MATCH PROBABILITY Where pi, pj are frequencies of alleles Ai and Aj ,

CONDITIONAL MATCH PROBABILITY Where pi, pj are frequencies of alleles Ai and Aj , and = coefficient of co-ancestry ( Fst/Gst) representing extent of population substructure effect (Balding and Nichols, 1994)

Match Probability - examples under HWE with substructure adjustment (θ=. 01) unconditional D 3

Match Probability - examples under HWE with substructure adjustment (θ=. 01) unconditional D 3 S 1358 (14, 18 ) 0. 0457 0. 0495 v. WA (14, 16) 0. 0411 0. 0451 FGA (23, 25) 0. 0218 0. 0253 D 8 S 1179 (12, 14) 0. 0586 0. 0626 D 21 S 11 (29, 30) 0. 0840 0. 0881 D 18 S 51 (13, 17) 0. 0381 0. 0418 D 5 S 818 (12, 12) 0. 1252 0. 1275 0. 1367 D 13 S 317 ( 9, 11) 0. 0488 0. 0542 D 7 S 820 (10, 10) 0. 0844 0. 0865 0. 0949 Cumulative 3. 96 10 -12 4. 13 10 -12 9. 15 10 -12 Upper bound of 95% C. I. 1. 02 10 -11 1. 05 10 -11 2. 17 10 -11

Paternity Testing – Frequentist Approach Example In a standard paternity testing case, with mother’s

Paternity Testing – Frequentist Approach Example In a standard paternity testing case, with mother’s genotype being A 1 A 1, and the child’s A 1 A 2, an alleged father whose genotype does not contain the A 2 allele would be excluded, giving where is any allele other than the allele A 2. This computation assumes that no mutation occurred during the transmission of alleles across generations. Note: Average exclusion probability can also be computed disregarding the profiles examined, which rationalizes the choice of loci to be typed for validating the stated relationship

LR for Kinship of a Pair of Individuals Data: DNA profile (GX) of one

LR for Kinship of a Pair of Individuals Data: DNA profile (GX) of one individual X, compared with that (GY) of another individual Y is considered to assess the accuracy of a specified stated biological relationship between X and Y Hypotheses contrasted: Ø Hp: X and Y are biologically related (i. e. , the stated relationship is correct) Ø Hd: X and Y are biologically not related Note: Comparison between two stated relationships may also be tested

IBD Probabilities – ITO Method Two individuals of genotypes GX and GY can share:

IBD Probabilities – ITO Method Two individuals of genotypes GX and GY can share: Ø Both alleles IBD (called scenario I), Ø Only one allele from each is IBD (scenario T), Ø None of their alleles are IBD (scenario O). Their probabilities are denoted by Φ 2, Φ 1, and Φ 0, respectively, and for any biological relatedness

Kinship Analysis of a pair of Individuals : IBD Coefficients In Relatives Relationship Type

Kinship Analysis of a pair of Individuals : IBD Coefficients In Relatives Relationship Type Symbol 0 1 2 Monozygotic twins MZ 0 0 1 Parent-Offspring PO 0 1 0 Full Sib S 1/4 1/2 1/4 First Cousin 1 C 3/4 1/4 0 Unrelated U 1 0 0

Conditional Probability of Gy given Gx for specific kinship of x and y •

Conditional Probability of Gy given Gx for specific kinship of x and y • Stipulated kinship between x and y specifies the IBD probabilities 0, 1, 2 for x and y • For observed Gx and Gy : Pr (Gy | Gx for the specified relationship) = 0 • Pr(Gy | Gx under O) + 1 • Pr(Gy | Gx under T) + 2 • Pr(Gy | Gx under I) Rule: Conditional probability of Gy given Gx for a stated kinship is the weighted average of conditional probabilities of the same event under specified IBD described by the kinship

GENOTYPE PROBABILITIES FOR A PAIR OF INDIVIDUALS CONDITIONED BY IBD PROBABILITIES OF ALLELES

GENOTYPE PROBABILITIES FOR A PAIR OF INDIVIDUALS CONDITIONED BY IBD PROBABILITIES OF ALLELES

Bayes Formula (Odds form) æ P(H 1 | E) ö æ P(E | H

Bayes Formula (Odds form) æ P(H 1 | E) ö æ P(E | H 1 ) ö æ P(H 1 ) ö çç ÷÷ = çç ÷÷ è P(H 2 | E) ø è P(E | H 2 ) ø è P(H 2 ) ø posterior odds = likelihood ratio x prior odds E = DNA evidence H 1 = alleged father is biological father H 2 = alleged father is not biological father Note: While the first factor of the RHS is computed from DNA evidence, the second factor, P(H 1)/P(H 2), is not necessarily a DNA-

Synthesis of Three Approaches of Statistical Assessment ØFrequency-Approach provides the probability of the observed

Synthesis of Three Approaches of Statistical Assessment ØFrequency-Approach provides the probability of the observed DNA evidence (unconditional as well as conditional) under a given stipulated hypothesis ØLikelihood Ratio (LR) contrasts such probabilities for two mutually exclusive hypotheses ØIn Bayesian approach, with the use of prior probability, LR is transformed to obtain the relative odds of one hypothesis against another given the DNA data of the evidence

Synthesis of Three Approaches (Contd. ) ØThe three approaches are built on one another,

Synthesis of Three Approaches (Contd. ) ØThe three approaches are built on one another, and hence, it is inaccurate to say one is wrong and the others are correct ØLR, without the transformation with the use of the prior probability, may be incorrectly interpreted as the answer of the Bayesian computation, but the numerator and denominator of LR can be stated with frequentist’s interpretation to avoid the error of reverse conditioning ØThe prior probability of the Bayesian approach generally comes from non-DNA

Important Fact with An Example LR, by itself, is not a Bayesian Approach, and

Important Fact with An Example LR, by itself, is not a Bayesian Approach, and the prosecutor’s fallacy can be avoided by explaining the two conditional probabilities separately Example: Consider a mixture case, where victim’s profile (V) together with the defendant’s profile (S) explains alleles in the mixture profile (E). Under Hp: E = V + S, the conditional probability of E given Hp is 1. 0, but under Hd: E = V + UN, say the conditional probability of E given that the other contributor is unknown (UN) is 1 in 100, 000. Instead of telling LR = 100, 000, it is less confusing to say that if we were to assume that the mixture DNA came from the victim and this defendant, this is the only observation possible (certain), but if the other contributor is unknown, we have to sample 100, 000 unrelated persons before finding one, whose DNA mixed with that of the victim would produce a profile matching the profile seen in the mixture DNA evidence sample.

Is the Extent of Population Substructure Uncertain for the Forensic Loci?

Is the Extent of Population Substructure Uncertain for the Forensic Loci?

Inbreeding Coefficient (FST) Caucasian African American Hispanic Asian Native American CSF 1 PO -0.

Inbreeding Coefficient (FST) Caucasian African American Hispanic Asian Native American CSF 1 PO -0. 0007 -0. 0009 -0. 0003 -0. 0012 0. 0244 D 13 S 317 -0. 0008 0. 0029 0. 0047 0. 0071 0. 0157 D 18 S 51 0. 0001 0. 0012 0. 0011 0. 0046 0. 0268 D 21 S 11 0. 0008 0. 0005 0. 0013 0. 0056 0. 0371 D 3 S 1358 -0. 0009 0. 0010 0. 0035 0. 0764 D 5 S 818 -0. 0001 0. 0010 0. 0028 0. 0656 D 7 S 820 -0. 0005 0. 0000 0. 0010 0. 0039 0. 0201

Inbreeding Coefficient (FST) Caucasian African American Hispanic Asian Native American 0. 0000 -0. 0001

Inbreeding Coefficient (FST) Caucasian African American Hispanic Asian Native American 0. 0000 -0. 0001 0. 0005 0. 0025 0. 0125 FGA -0. 0004 0. 0008 0. 0029 0. 0168 THO 1 -0. 0012 0. 0015 0. 0041 0. 0058 0. 0356 TPOX -0. 0015 0. 0021 0. 0024 0. 0100 0. 0164 VWA -0. 0011 0. 0029 0. 0027 0. 0172 Average -0. 0005 0. 0006 0. 0021 0. 0039 0. 0282 D 8 S 1179

The NRC-II recommendation = 0. 01 for large cosmopolitan populations and = 0. 03

The NRC-II recommendation = 0. 01 for large cosmopolitan populations and = 0. 03 for small isolated populations is well-validated by empirical as well as theoretical foundations

Are the DNA Forensic Population Databases Random and are their Sample Sizes Sufficient?

Are the DNA Forensic Population Databases Random and are their Sample Sizes Sufficient?

Features of Genetic Databases • Population Genetics historically always employed ‘convenient’ sampling, in stead

Features of Genetic Databases • Population Genetics historically always employed ‘convenient’ sampling, in stead of strict random sampling • ‘Convenient sampling’ defined as sampling of individuals without any prior knowledge of their DNA type is operationally random, in particular, when variations at DNA loci do not affect fertility, viability, cognitive or life achievement abilities • Allele frequency estimates from convenient samples have been shown to well-approximate those estimates from structured strict random sampling • Strict random samples collected at one point of time from a natural population may not remain random at another time point because of birth, death, immigration, and emigration events

Features of Genetic Databases - 2 • Allele frequencies from subjects of convenient samples

Features of Genetic Databases - 2 • Allele frequencies from subjects of convenient samples described by ‘selfidentified’ ethnicity have been shown to represent genetic affinities comparable with similar inferences drawn from anthropologically well-defined populations • Occasional presence of biological relatives in convenient samples does not affect allele frequency estimates, but may produce excess allele/genotype sharing at some loci

Phylogenetic Tree (UPGMA) for some World Populations with allele frequency data of the CODIS

Phylogenetic Tree (UPGMA) for some World Populations with allele frequency data of the CODIS STR Loci SW Hispanic (TX) SW Hispanic (CA) US Caucasian Swiss Italian SE Hispanic (FL) Chinese Japanese African American (TX) African American (CA) Apache Navajo Athabaskan Inupiat Yupik

Sample Size Limitation Issue • Strictly speaking, no sample size is universally sufficient unless

Sample Size Limitation Issue • Strictly speaking, no sample size is universally sufficient unless all individuals are continually genotyped over times • Sample sizes such as 100 to 150 individuals per population has been shown to produce stable estimates of allele frequencies above a prescribed minimum threshold allele frequency • Current forensic DNA statistics employ the concepts of minimum threshold allele frequency, and upper 95% confidence interval to account for sampling variation

Concerns Related to Databases Used for Lineage Markers (e. g. , mt. DNA and

Concerns Related to Databases Used for Lineage Markers (e. g. , mt. DNA and Y-STRs)

Inheritance of Lineage Markers (NOTE: Colors denote mt. DNA-type, Letters (X, A, B) indicate

Inheritance of Lineage Markers (NOTE: Colors denote mt. DNA-type, Letters (X, A, B) indicate Ylinked information, where X denotes no Y-chromosome; A and B are Y-linked alleles or Haplotypes) B X X B B B X A A X X X A X

Introductory Comments on Lineage Markers • mt. DNA is maternally inherited, and Y-STRs are

Introductory Comments on Lineage Markers • mt. DNA is maternally inherited, and Y-STRs are transmitted to only sons from fathers alone • Barring mutations, all maternally related persons (males as well as females) will have the same mt. DNA profile, and all paternally related males will have the same Y-STR profile • Different markers on mt. DNA are genetically linked (with virtually no recombination) and so are the Y-STRs (residing on the non-recombining region of the Y chromosome)

Comments on Lineage Markers (Contd. ) • Consequently, mt. DNA sequence data has to

Comments on Lineage Markers (Contd. ) • Consequently, mt. DNA sequence data has to be treated like a haploid haplotype, frequency of which is NOT multiplicative across markers, and so is the case of Y-STR based profile • Counting method is the one that captures the genetic information • Stated ethnicity of individuals does not necessarily reflect patrilineal or matrilineal ancestry (e. g. , mt. DNA of Hispanics may be almost entirely of Native American descent, while for the autosomal STRs, only 30 -50% of their genes are of Native American descent) • Thus, grouping of populations used for autosomal nuclear STR loci does not necessarily provide accurate frequency estimates of Y-linked STR haplotype, nor that of specific mt. DNA sequence

Fundamental Difference of Frequency of CODIS STR DNA Profile and that of based on

Fundamental Difference of Frequency of CODIS STR DNA Profile and that of based on mt. DNA and Y-STRs • For CODIS STR loci, profile frequency provides information regarding the rarity of the profile in the population, or conditional probability given that the profile is found in someone else • For mt. DNA, it is the frequency among individuals who are NOT maternally related • For Y-STRs, likewise, it is the frequency among individuals NOT paternally related

Computation of Frequency of Lineagebased Marker Profile Using the general theory, the unconditional frequency

Computation of Frequency of Lineagebased Marker Profile Using the general theory, the unconditional frequency of an haplotype (say Ai), which is count divided by sample size, can be modified to get the conditional probability Pr. (Ai|Ai) = [pi 2 + pi(1 -pi)]/pi = pi + (1 -pi) = + pi(1 - ) Hence, the conditional probability always exceeds , the adjustment factor of possible population substructure in the database used

Computation of Frequency of Lineagebased Marker Profile (Contd. ) Some advocates suggest that the

Computation of Frequency of Lineagebased Marker Profile (Contd. ) Some advocates suggest that the quantity pi in Pr. (Ai|Ai) = + pi(1 - ) can be substituted by (Count of Ai + 2)/(N + 3), where N is the sample size. When N is large, this has little effect, but can be of help when the count of Ai in the database is zero (i. e. , profile in evidence not seen in the database)

mt. DNA and Y-STR -Value Since in terms of match versus non-match, how different

mt. DNA and Y-STR -Value Since in terms of match versus non-match, how different are the haplotypes is not an issue, the values for mt. DNA and Y-linked haplotypes are to be computed not based on mismatch based approaches (such as AMOVA), but treating all haplotypes as different alleles, generally leading to much smaller value

Issues Related to DNA-Match Statistics when Suspects are Identified by Database Search

Issues Related to DNA-Match Statistics when Suspects are Identified by Database Search

Three Approaches – Three Types of Questions! • The NRC-I recommendation to use only

Three Approaches – Three Types of Questions! • The NRC-I recommendation to use only the additional loci, not used in database search, is counter-productive • The chance of coincidental finding of a profile in a database depends on the expected rarity of the profile and database size • NRC-II’s Np rule answers the question of expected number of profiles matching a target profile (of rarity p) in a database (random with respect to crime) of size N • Bayesian approach makes additional assumptions regarding the prior odds of each individual in the database being the contributor of the DNA of the target profile

DOES SOMEONE HAVE YOUR BIRTHDAY? Prob. that in a sample of persons, all birthdays

DOES SOMEONE HAVE YOUR BIRTHDAY? Prob. that in a sample of persons, all birthdays are different is given by

SAMPLE SIZE NEEDED FOR AT LEAST ONE DUPLICATE FOR GIVEN VALUES OF EVENT PROBABILITY

SAMPLE SIZE NEEDED FOR AT LEAST ONE DUPLICATE FOR GIVEN VALUES OF EVENT PROBABILITY AND DEGREE OF CONFIDENCE

OBSERBED AND EXPECTED MATCH PROBABILITY

OBSERBED AND EXPECTED MATCH PROBABILITY

EXPECTED NUMBER OF MATCHES IN DATABASE SEARCH (CARIBBEAN)

EXPECTED NUMBER OF MATCHES IN DATABASE SEARCH (CARIBBEAN)

OBSERVED AND EXPECTED NUMBER OF MATCHES IN PAIRWISE COMPARISON OF PROFILE IN DATABASE (CARIBBEAN)

OBSERVED AND EXPECTED NUMBER OF MATCHES IN PAIRWISE COMPARISON OF PROFILE IN DATABASE (CARIBBEAN)

EFFECT OF PRESENCE OF RELATIVES (Caucasian data on CODIS loci, =0, N = 1000)

EFFECT OF PRESENCE OF RELATIVES (Caucasian data on CODIS loci, =0, N = 1000)

Conclusions • With larger amount of data collected since 1996, and with experiences of

Conclusions • With larger amount of data collected since 1996, and with experiences of statistical results from caseworks, NRC-II recommendations remain as appropriate suggestions for statistical evaluation of Forensic DNA evidence • Statistical answers for different questions are necessarily different; they do not constitute lack of general acceptance • mt. DNA and Y-STR database groupings are necessarily different from that of autosomal STRs because of uniparental ancestry of lineage markers • Convenient sampling effect and sampling size limitations are imbedded in current protocols of DNA statistics • Suspect from database search raises multiple type of questions answers of which are different

Acknowledgements • Dr. Bruce Budowle - from FBI Academy • Hee S. Lee, Xiaohua

Acknowledgements • Dr. Bruce Budowle - from FBI Academy • Hee S. Lee, Xiaohua Sheng, Jianye Ge Graduate Students at CGI, Univ. Cincinnati • SWGDAM members – for providing databases • US Granting Agencies NIH and NIJ – for partial support of the research

Thank You!

Thank You!