Statistical Genomics Lecture 11 Power type I error

  • Slides: 26
Download presentation
Statistical Genomics Lecture 11: Power, type I error and FDR Zhiwu Zhang Washington State

Statistical Genomics Lecture 11: Power, type I error and FDR Zhiwu Zhang Washington State University

Administration Homework 2, due Feb 17, Wednesday, 3: 10 P Homework 3 posted, due

Administration Homework 2, due Feb 17, Wednesday, 3: 10 P Homework 3 posted, due Mar 2, Wednesday, 3: 10 PM Midterm exam: February 26, Friday, 50 minutes (3: 354: 25 PM), 25 questions.

Outline Simulation of phenotype from genotype GWAS by correlation Power FDR Cutoff Null distribution

Outline Simulation of phenotype from genotype GWAS by correlation Power FDR Cutoff Null distribution of p values Resolution QTN bins and non-QTN bins

GWAS by correlation my. GD=read. table(file="http: //zzlab. net/GAPIT/data/mdp_numeric. txt", head=T) my. GM=read. table(file="http: //zzlab.

GWAS by correlation my. GD=read. table(file="http: //zzlab. net/GAPIT/data/mdp_numeric. txt", head=T) my. GM=read. table(file="http: //zzlab. net/GAPIT/data/mdp_SNP_information. txt", head=T) setwd("~/Dropbox/Current/ZZLab/WSUCourse/CROPS 545/Demo") source("G 2 P. R") source("GWASby. Cor. R") X=my. GD[, -1] index 1 to 5=my. GM[, 2]<6 X 1 to 5 = X[, index 1 to 5] set. seed(99164) my. Sim=G 2 P(X= X 1 to 5, h 2=. 75, alpha=1, NQTN=10, distribution="norm") p= GWASby. Cor(X=X, y=my. Sim$y)

The top five associations index=order(p) top 5=index[1: 5] detected=intersect(top 5, my. Sim$QTN. position) false.

The top five associations index=order(p) top 5=index[1: 5] detected=intersect(top 5, my. Sim$QTN. position) false. Positive=setdiff(top 5, my. Sim$QTN. position) top 5 my. Sim$QTN. position detected length(detected) false. Positive Power=3/10 False Discovery Rate (FDR) =2/5

The top five associations color. vector <- rep(c("deepskyblue", "orange", "forestgreen", "indianred 3"), 10) m=nrow(my.

The top five associations color. vector <- rep(c("deepskyblue", "orange", "forestgreen", "indianred 3"), 10) m=nrow(my. GM) plot(t(-log 10(p))~seq(1: m), col=color. vector[my. GM[, 2]]) abline(v=my. Sim$QTN. position, lty = 2, lwd=2, col = "black") abline(v= false. Positive, lty = 2, lwd=2, col = "red") ü Bonferroni Cutoff: 1%/3093=3. 2 E-6 ü Resolution

Cutoff from null distribution of P values: CHR 6 -10 N 1 2 3

Cutoff from null distribution of P values: CHR 6 -10 N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Observed 9. 80 E-08 7. 76 E-07 3. 07 E-06 5. 20 E-06 7. 26 E-06 8. 64 E-06 9. 72 E-06 1. 67 E-05 1. 91 E-05 2. 13 E-05 3. 28 E-05 3. 45 E-05 3. 98 E-05 4. 46 E-05 5. 39 E-05 Expected 0. 000926784 0. 001853568 0. 002780352 0. 003707136 0. 00463392 0. 005560704 0. 006487488 0. 007414273 0. 008341057 0. 009267841 0. 010194625 0. 011121409 0. 012048193 0. 012974977 0. 013901761 1074 1075 1076 1077 1078 1079 0. 9918845 0. 9954024 0. 9960631 0. 9970031 0. 9992356 0. 9999589 0. 9953661 0. 9962929 0. 9972196 0. 9981464 0. 9990732 1 1% of observed p values are below 0. 0000328 index. null=!index 1 to 5 & !is. na(p) p. null=p[index. null] m. null=length(p. null) index. sort=order(p. null) p. null. sort=p. null[index. sort] head(p. null. sort) tail(p. null. sort) seq=seq(1: m. null) table=cbind(seq, p. null. sort, seq/m. null) head(table, 15) tail(table) P value of 3. 28 E-5 is equivalent to 1% type 1 error

What about QTNs every where? set. seed(99164) my. Sim=G 2 P(X= my. GD[, -1],

What about QTNs every where? set. seed(99164) my. Sim=G 2 P(X= my. GD[, -1], h 2=. 75, alpha=1, NQTN=10, distribution="norm") p= GWASby. Cor(X=X, y=my. Sim$y) plot(t(-log 10(p))~seq(1: m), col=color. vector[my. GM[, 2]]) abline(v=my. Sim$QTN. position, lty = 2, lwd=2, col = "black")

Resolution and bin approach 10 Kb is really good, 100 Kb is OK Bins

Resolution and bin approach 10 Kb is really good, 100 Kb is OK Bins with QTNs for power Bins without QTNs for type I error

Bins (e. g. 100 Kb) big. Num=1 e 9 resolution=100000 bin=round((my. GM[, 2]*big. Num+my.

Bins (e. g. 100 Kb) big. Num=1 e 9 resolution=100000 bin=round((my. GM[, 2]*big. Num+my. GM[, 3])/resolution) result=cbind(my. GM, t(p), bin) head(result) Minimum p value within bin

Bins of QTNs QTN. bin=result[my. Sim$QTN. position, ] QTN. bin

Bins of QTNs QTN. bin=result[my. Sim$QTN. position, ] QTN. bin

Sorted bins of QTNs index. qtn. p=order(QTN. bin[, 4]) QTN. bin[index. qtn. p, ]

Sorted bins of QTNs index. qtn. p=order(QTN. bin[, 4]) QTN. bin[index. qtn. p, ]

FDR and type I error Total number of bins: 3054 (size of 100 kb)

FDR and type I error Total number of bins: 3054 (size of 100 kb) N bin t(p) Power #False bins FDR Type. I Error 1 2 3 4 5 6 7 8 9 10 50120 12235 60985 12918 31482 101348 31573 42222 10502 22331 4. 44 E-16 1. 00 E-10 1. 38 E-10 7. 02 E-08 2. 05 E-05 9. 58 E-02 1. 88 E-01 2. 94 E-01 4. 98 E-01 9. 91 E-01 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1 0 0 2 416 608 782 1001 1335 0 0 0. 285714286 0. 985781991 0. 988617886 0. 989873418 0. 991089109 0. 992565056 0 0 0. 000654879 0. 1362148 0. 19908317 0. 256057629 0. 327766863 0. 437131631 0. 285714286=2/(2+5) 0. 000654879=2/3054

 Receiver Operating Characteristic "The curve is created by plotting the true positive rate

Receiver Operating Characteristic "The curve is created by plotting the true positive rate against the false positive rate at various threshold settings. " -Wikipedia Power ROC curve FDR Liu et. al. PLo. S Genetics, 2016

GAPIT. FDR. Type. I Function library(compiler) #required for cmpfun source("http: //www. zzlab. net/GAPIT/gapit_functions. txt")

GAPIT. FDR. Type. I Function library(compiler) #required for cmpfun source("http: //www. zzlab. net/GAPIT/gapit_functions. txt") my. Stat=GAPIT. FDR. Type. I( WS=c(1 e 0, 1 e 3, 1 e 4, 1 e 5), GM=my. GM, seq. QTN=my. Sim$QTN. position, GWAS=result) str(my. Stat)

Return

Return

Area Under Curve (AUC) par(mfrow=c(1, 2), mar = c(5, 2, 5, 2)) plot(my. Stat$FDR[,

Area Under Curve (AUC) par(mfrow=c(1, 2), mar = c(5, 2, 5, 2)) plot(my. Stat$FDR[, 1], my. Stat$Power, type="b") plot(my. Stat$Type. I[, 1], my. Stat$Power, type="b")

Replicates nrep=100 set. seed(99164) stat. Rep=replicate(nrep, { my. Sim=G 2 P(X=my. GD[, -1], h

Replicates nrep=100 set. seed(99164) stat. Rep=replicate(nrep, { my. Sim=G 2 P(X=my. GD[, -1], h 2=. 5, alpha=1, NQTN=10, distribution="norm") p=p= GWASby. Cor(X=my. GD[, -1], y=my. Sim$y) seq. QTN=my. Sim$QTN. position my. GWAS=cbind(my. GM, t(p), NA) my. Stat=GAPIT. FDR. Type. I(WS=c(1 e 0, 1 e 3, 1 e 4, 1 e 5), GM=my. GM, seq. QTN=my. Sim$QTN. position, GWAS=my. GWAS, max. Out=100, Max. BP= 1 e 10) })

str(stat. Rep)

str(stat. Rep)

Means over replicates power=stat. Rep[[2]] #FDR s. fdr=seq(3, length(stat. Rep), 7) fdr=stat. Rep[s. fdr]

Means over replicates power=stat. Rep[[2]] #FDR s. fdr=seq(3, length(stat. Rep), 7) fdr=stat. Rep[s. fdr] fdr. mean=Reduce ("+", fdr) / length(fdr) #AUC: power vs. FDR s. auc. fdr=seq(6, length(stat. Rep), 7) auc. fdr=stat. Rep[s. auc. fdr] auc. fdr. mean=Reduce ("+", auc. fdr) / length(auc. fdr)

Plots of power vs. FDR the. Color=rainbow(4) plot(fdr. mean[, 1], power , type="b", col=the.

Plots of power vs. FDR the. Color=rainbow(4) plot(fdr. mean[, 1], power , type="b", col=the. Color [1], xlim=c(0, 1)) for(i in 2: ncol(fdr. mean)){ lines(fdr. mean[, i], power , type="b", col= the. Color [i]) }

Plots of AUC barplot(auc. fdr. mean, names. arg=c("1 bp", "1 K", "100 K"), xlab="Resolution",

Plots of AUC barplot(auc. fdr. mean, names. arg=c("1 bp", "1 K", "100 K"), xlab="Resolution", ylab="AUC")

ROC with different heritability h 2= 25% vs. 75% 10 QTNs Normal distributed QTN

ROC with different heritability h 2= 25% vs. 75% 10 QTNs Normal distributed QTN effect 100 kb resolution Power against Type I error

Simulation and GWAS nrep=100 set. seed(99164) #h 2=25% stat. Rep 25=replicate(nrep, { my. Sim=G

Simulation and GWAS nrep=100 set. seed(99164) #h 2=25% stat. Rep 25=replicate(nrep, { my. Sim=G 2 P(X=my. GD[, -1], h 2=. 25, alpha=1, NQTN=10, distribution="norm") p=p= GWASby. Cor(X=my. GD[, -1], y=my. Sim$y) seq. QTN=my. Sim$QTN. position my. GWAS=cbind(my. GM, t(p), NA) my. Stat=GAPIT. FDR. Type. I(WS=c(1 e 0, 1 e 3, 1 e 4, 1 e 5), GM=my. GM, seq. QTN=my. Sim$QTN. position, GWAS=my. GWAS, max. Out=100, Max. BP=1 e 10)}) #h 2=75% stat. Rep 75=replicate(nrep, { my. Sim=G 2 P(X=my. GD[, -1], h 2=. 75, alpha=1, NQTN=10, distribution="norm") p=p= GWASby. Cor(X=my. GD[, -1], y=my. Sim$y) seq. QTN=my. Sim$QTN. position my. GWAS=cbind(my. GM, t(p), NA) my. Stat=GAPIT. FDR. Type. I(WS=c(1 e 0, 1 e 3, 1 e 4, 1 e 5), GM=my. GM, seq. QTN=my. Sim$QTN. position, GWAS=my. GWAS, max. Out=100, Max. BP=1 e 10)})

Means and plot power 25=stat. Rep 25[[2]] s. t 1=seq(4, length(stat. Rep 25), 7)

Means and plot power 25=stat. Rep 25[[2]] s. t 1=seq(4, length(stat. Rep 25), 7) t 1=stat. Rep 25[s. t 1] t 1. mean. 25=Reduce ("+", t 1) / length(t 1) power 75=stat. Rep 75[[2]] s. t 1=seq(4, length(stat. Rep 75), 7) t 1=stat. Rep 75[s. t 1] t 1. mean. 75=Reduce ("+", t 1) / length(t 1) plot(t 1. mean. 25[, 4], power 25, type="b", col="blue", xlim=c(0, 1)) lines(t 1. mean. 75[, 4], power 75, type="b", col= "red")

Highlight Simulation of phenotype from genotype GWAS by correlation Power FDR Cutoff Null distribution

Highlight Simulation of phenotype from genotype GWAS by correlation Power FDR Cutoff Null distribution of p values Resolution QTN bins and non-QTN bins