Place UNC Stat OR Object Oriented Data Analysis
? ? ? Place ? ? ? UNC, Stat & OR Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina 12 March 2021 1
Interdisciplinary Relationship UNC, Stat & OR How does: Statistics Relate to: Mathematics? (probability, optimization, geometry, …) 2
Statistics - Mathematics Relationship UNC, Stat & OR Mathematical Statistics: n Validation of existing methods n Asymptotics (n ∞) & Taylor expansion n Comparison of existing methods (requires hard math, but really “accounting”? ? ? ) 3
Statistics - Mathematics Relationship UNC, Stat & OR Suggested New Relationship: Put Mathematics to work to Generate New Statistical Ideas/Approaches (publishable in the Ann. Stat. ? ? ? ) 4
Personal Opinions on Mathematical Statistics UNC, Stat & OR What is Mathematical Statistics? n Validation of existing methods n Asymptotics (n ∞) & Taylor expansion n Comparison of existing methods (requires hard math, but really “accounting”? ? ? ) 5
Personal Opinions on Mathematical Statistics UNC, Stat & OR What could Mathematical Statistics be? n Basis for invention of new methods n Complicated data mathematical ideas n Do we value creativity? n Since we don’t do this, others do… (where are the $$$s? ? ? ) 6
Personal Opinions on Mathematical Statistics UNC, Stat & OR n Since we don’t do this, others do… n Pattern Recognition n Artificial Intelligence n Neural Nets n Data Mining n Machine Learning n ? ? ? 7
Personal Opinions on Mathematical Statistics UNC, Stat & OR Possible Litmus Test: Creative Statistics § Clinical Trials Viewpoint: Worst Imaginable Idea § Mathematical Statistics Viewpoint: ? ? ? 8
Object Oriented Data Analysis, I UNC, Stat & OR What is the “atom” of a statistical analysis? n 1 st Course: Numbers n Multivariate Analysis Course : n Functional Data Analysis: n More generally: Data Objects Vectors Curves 9
Functional Data Analysis, I UNC, Stat & OR Curves as Data Objects Important Duality: Curve Space Point Cloud Space Illustrate with Travis Gaydos Graphics n 2 dim’al curves (easy to visualize) 10
Functional Data Analysis, Toy EG I UNC, Stat & OR 11
Functional Data Analysis, Toy EG II UNC, Stat & OR 12
Functional Data Analysis, Toy EG III UNC, Stat & OR 13
Functional Data Analysis, Toy EG IV UNC, Stat & OR 14
Functional Data Analysis, Toy EG V UNC, Stat & OR 15
Functional Data Analysis, Toy EG VI UNC, Stat & OR 16
Functional Data Analysis, Toy EG VII UNC, Stat & OR 17
Functional Data Analysis, Toy EG VIII UNC, Stat & OR 18
Functional Data Analysis, Toy EG IX UNC, Stat & OR 19
Functional Data Analysis, Toy EG X UNC, Stat & OR 20
Functional Data Analysis, 10 -d Toy EG 1 UNC, Stat & OR 21
Functional Data Analysis, 10 -d Toy EG 1 UNC, Stat & OR 22
Functional Data Analysis, 10 -d Toy EG 2 UNC, Stat & OR 23
Functional Data Analysis, 10 -d Toy EG 2 UNC, Stat & OR 24
Object Oriented Data Analysis, I UNC, Stat & OR What is the “atom” of a statistical analysis? n 1 st Course: Numbers n Multivariate Analysis Course : n Functional Data Analysis: n More generally: Data Objects Vectors Curves 25
Object Oriented Data Analysis, II UNC, Stat & OR Examples: n n Medical Image Analysis n Images as Data Objects? n Shape Representations as Objects Micro-arrays for Gene Expression n Just multivariate analysis? 26
Object Oriented Data Analysis, III UNC, Stat & OR Typical Goals: n Understanding population variation n Visualization n Principal Component Analysis + n Discrimination (a. k. a. Classification) n Time Series of Data Objects 27
Object Oriented Data Analysis, IV UNC, Stat & OR Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) n Dimension d >> sample size n n “Multivariate Analysis” nearly useless n n Can’t “normalize the data” Land of Opportunity for Statisticians n Need for “creative statisticians” 28
Object Oriented Data Analysis, V UNC, Stat & OR Major Statistical Challenge, II: n n Data may live in non-Euclidean space n Lie Group / Symmet’c Spaces (manifold data) n Trees/Graphs as data objects Interesting Issues: n What is “the mean” (pop’n center)? n How do we quantify “pop’n variation”? 29
Statistics in Image Analysis, I UNC, Stat & OR First Generation Problems: n Denoising n Segmentation n Registration (all about single images) 30
Statistics in Image Analysis, II UNC, Stat & OR Second Generation Problems: n Populations of Images n Understanding Population Variation n Discrimination (a. k. a. Classification) n Complex Data Structures (& Spaces) n HDLSS Statistics 31
HDLSS Statistics in Imaging UNC, Stat & OR Why HDLSS (High Dim, Low Sample Size)? n Complex 3 -d Objects Hard to Represent n n Often need d = 100’s of parameters Complex 3 -d Objects Costly to Segment n Often have n = 10’s cases 32
Medical Imaging – A Challenging Example UNC, Stat & OR n Male Pelvis n n n Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment (cancer) Work with 3 -d CT Very Challenging to Segment n n Find boundary of each object? Represent each Object? 33
Male Pelvis – Raw Data UNC, Stat & OR One CT Slice (in 3 d image) Coccyx (Tail Bone) Rectum Bladder 34
Male Pelvis – Raw Data UNC, Stat & OR Bladder: manual segmentation Slice by slice Reassembled 35
Male Pelvis – Raw Data UNC, Stat & OR Bladder: Slices: Reassembled in 3 d How to represent? Thanks: Ja-Yeon Jeong 36
Object Representation UNC, Stat & OR n n n Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations n n Find “skeleton” Discretize as “atoms” called M-reps 37
3 -d m-reps UNC, Stat & OR Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes” “surface” 38
3 -d m-reps UNC, Stat & OR M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood posterior • ~Conjugate Gaussians, but there are issues: • Major HLDSS challenges • Manifold aspect of data 39
Illuminating Viewpoint UNC, Stat & OR Object Space Focus here on collection of data objects Feature Space Here conceptualize population structure via “point clouds” 40
Personal HDLSS Viewpoint: Data UNC, Stat & OR n. Images “Points” (cases) are n. In Feature Space n. Features are Axes n. Data set is “Point Clouds” n. Use Proj’ns to visualize 41
Personal HDLSS Viewpoint: PCA UNC, Stat & OR n. Rotated n. Often n. One Axes Insightful set of Dir’ns n. Others Useful, too 42
Cornea Data, I UNC, Stat & OR Images as data n ~42 Cornea Images n. Outer n. Heat surface of eye map of curvature (in radial direction) n. Hard to understand “population structure” 43
Cornea Data, II UNC, Stat & OR PC 1 n. Starts at Pop’n Mean n. Overall Curvature n. Vertical Astigmatism n. Correlated! n. Gaussian Projections n. Visualization: Can’t Overlay (so use movie) 44
Cornea Data, III UNC, Stat & OR PC 2 n. Horrible Outlier! (present in data) n. But look only in center: Steep at top -- bottom n. Want n. For Robust PCA HDLSS data ? ? ? 45
Cornea Data, IV UNC, Stat & OR Robust PC 2 n. No outlier impact n. See top – bottom variation n. Projections now Gaussian 46
PCA for m-reps, I UNC, Stat & OR Major issue: m-reps live in (locations, radius and angles) E. g. “average” of: = ? ? ? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds) 47
PCA for m-reps, II UNC, Stat & OR PCA on non-Euclidean spaces? (i. e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”… 48
PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 49
PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 50
PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 51
HDLSS Classification (i. e. Discrimination) UNC, Stat & OR Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis n New Patients are “Healthy” or “Ill” n Determined based on measurements 52
HDLSS Classification (Cont. ) UNC, Stat & OR n Ineffective Methods: n n n Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: n n Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition) 53
HDLSS Classification (Cont. ) UNC, Stat & OR n Currently Fashionable Methods: n n n Support Vector Machines Trees Based Approaches New High Tech Method n Distance Weighted Discrimination (DWD) n Specially designed for HDLSS data n Avoids “data piling” problem of SVM n Solves more suitable optimization problem 54
HDLSS Classification (Cont. ) UNC, Stat & OR n. Currently Methods: Fashionable n. Trees Based Approaches n. Support Vector Machines: 55
Kernel Embedding Idea UNC, Stat & OR Aizerman, Braverman, Rozoner (1964) Make data linearly separable by embedding in higher dimensional space 56
Kernel Embedding Idea UNC, Stat & OR Linearly separable by embedding in higher dimensions 57
Kernel Embedding Idea UNC, Stat & OR Linearly separable by embedding in higher dimensions 58
Kernel Embedding Idea UNC, Stat & OR Linearly separable by embedding in higher dimensions 59
Kernel Embedding Idea UNC, Stat & OR Linearly separable by embedding in higher dimensions 60
Kernel Embedding Idea UNC, Stat & OR Linearly separable by embedding in higher dimensions Distributional Assumptions in Embedded Space? ǁ ˅ Support Vector Machine 61
HDLSS Classification (Cont. ) UNC, Stat & OR Comparison of Linear Methods (toy data): n. Optimal Direction n Excellent, but need dir’n in dim = 50 n. Maximal Data Piling (J. Y. Ahn, D. Peña) n Great n. Support n More n. Distance separation, but generalizability? ? ? Vector Machine separation, gen’ity, but some data piling? Weighted Discrimination n Avoids data piling, good gen’ity, Gaussians? 62
Distance Weighted Discrimination UNC, Stat & OR Maximal Data Piling 63
Distance Weighted Discrimination UNC, Stat & OR Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming n n n Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software 64
Simulation Comparison UNC, Stat & OR E. G. Above Gaussians: n. Wide array of dim’s n. SVM Subst’ly worse n. MD – Bayes Optimal n. DWD close to MD 65
Simulation Comparison UNC, Stat & OR E. G. Outlier Mixture: n Disaster for MD n SVM & DWD much more solid n Dir’ns are “robust” n SVM & DWD similar 66
Simulation Comparison UNC, Stat & OR E. G. Wobble Mixture: n Disaster for MD n SVM less good n DWD slightly better Note: All methods come together for larger d ? ? ? 67
DWD Bias Adjustment for Microarrays UNC, Stat & OR Microarray data: n Simult. Measur’ts of “gene expression” n Intrinsically HDLSS n n Dimension d ~ 1, 000 s – 10, 000 s Sample Sizes n ~ 10 s – 100 s My view: Each array is “point in cloud” 68
DWD Batch and Source Adjustment UNC, Stat & OR n n n For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics https: //genome. unc. edu/pubsup/dwd/ Adjust for Source Effects n n Different sources of m. RNA Adjust for Batch Effects n Arrays fabricated at different times 69
DWD Adj: Raw Breast Cancer data UNC, Stat & OR 70
DWD Adj: Source Colors UNC, Stat & OR 71
DWD Adj: Batch Colors UNC, Stat & OR 72
DWD Adj: Biological Class Colors UNC, Stat & OR 73
DWD Adj: Biological Class Colors & Symbols UNC, Stat & OR 74
DWD Adj: Biological Class Symbols UNC, Stat & OR 75
DWD Adj: Source Colors UNC, Stat & OR 76
DWD Adj: PC 1 -2 & DWD direction UNC, Stat & OR 77
DWD Adj: DWD Source Adjustment UNC, Stat & OR 78
DWD Adj: Source Adj’d, PCA view UNC, Stat & OR 79
DWD Adj: Source Adj’d, Class Colored UNC, Stat & OR 80
DWD Adj: Source Adj’d, Batch Colored UNC, Stat & OR 81
DWD Adj: Source Adj’d, 5 PCs UNC, Stat & OR 82
DWD Adj: S. Adj’d, Batch 1, 2 vs. 3 DWD UNC, Stat & OR 83
DWD Adj: S. & B 1, 2 vs. 3 Adjusted UNC, Stat & OR 84
DWD Adj: S. & B 1, 2 vs. 3 Adj’d, 5 PCs UNC, Stat & OR 85
DWD Adj: S. & B Adj’d, B 1 vs. 2 DWD UNC, Stat & OR 86
DWD Adj: S. & B Adj’d, B 1 vs. 2 Adj’d UNC, Stat & OR 87
DWD Adj: S. & B Adj’d, 5 PC view UNC, Stat & OR 88
DWD Adj: S. & B Adj’d, 4 PC view UNC, Stat & OR 89
DWD Adj: S. & B Adj’d, Class Colors UNC, Stat & OR 90
DWD Adj: S. & B Adj’d, Adj’d PCA UNC, Stat & OR 91
DWD Bias Adjustment for Microarrays UNC, Stat & OR n n Effective for Batch and Source Adj. Also works for cross-platform Adj. n n E. g. c. DNA & Affy Despite literature claiming contrary “Gene by Gene” vs. “Multivariate” views n Funded as part of ca. BIG “Cancer Bio. Informatics Grid” n “Data Combination Effort” of NCI 92
Interesting Benchmark Data Set UNC, Stat & OR n NCI 60 Cell Lines Interesting benchmark, since same cells n Data Web available: http: //discover. nci. nih. gov/datasets. Nature 2000. jsp n Both c. DNA and Affymetrix Platforms n n 8 Major cancer subtypes n Use DWD now for visualization 93
NCI 60: Fully Adjusted Data, Leukemia Cluster UNC, Stat & OR LEUK. CCRFCEM LEUK. K 562 LEUK. MOLT 4 LEUK. HL 60 LEUK. RPMI 8266 LEUK. SR 94
NCI 60: Views using DWD Dir’ns (focus on biology) UNC, Stat & OR 95
Why not adjust by means? UNC, Stat & OR n DWD is complicated: value added? n Xuxin Liu example… n Key is sizes of biological subtypes n Differing ratio trips up mean n But DWD more robust (although still not perfect) 96
Twiddle ratios of subtypes UNC, Stat & OR 97
DWD in Face Recognition, I UNC, Stat & OR n. Face Images as Data (with M. Benito & D. Peña) n. Registered n. Male using landmarks – Female Difference? n. Discrimination Rule? 98
DWD in Face Recognition, II UNC, Stat & OR n DWD Direction n Good separation n Images “make sense” n Garbage at ends? (extrapolation effects? ) 99
DWD in Face Recognition, III UNC, Stat & OR n Interesting summary: n Jump between means (in DWD direction) n Clear separation of Maleness vs. Femaleness 100
DWD in Face Recognition, IV UNC, Stat & OR n Fun Comparison: n Jump between means (in SVM direction) n Also distinguishes Maleness vs. Femaleness n But not as well as DWD 101
DWD in Face Recognition, V UNC, Stat & OR Analysis of difference: Project onto normals n SVM has “small gap” (feels noise artifacts? ) n DWD “more informative” (feels real structure? ) 102
DWD in Face Recognition, VI UNC, Stat & OR n Current Work: n Focus on “drivers”: (regions of interest) n Relation to Discr’n? n Which is “best”? n Lessons for human perception? 103
Time Series of Curves UNC, Stat & OR n Chemical Spectra, evolving over time (with J. Wendelberger & E. Kober) n Mortality curves changing in time (with Andres Alonzo) 104
Discrimination for m-reps UNC, Stat & OR n. Classification n. S. n. What for Lie Groups – Symm. Spaces K. Sen, S. Joshi & M. Foskey is “separating plane” (for SVM-DWD)? 105
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 106
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 107
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 108
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 109
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 110
Blood vessel tree data UNC, Stat & OR Marron’s brain: § Segmented from MRA § Reconstruct trees § in 3 d § Rotate to view 111
Blood vessel tree data UNC, Stat & OR , , . . . , Now look over many people (data objects) Structure of population (understand variation? ) PCA in strongly non-Euclidean Space? ? ? 112
Blood vessel tree data UNC, Stat & OR , , . . . , Possible focus of analysis: • Connectivity structure only (topology) • Location, size, orientation of segments • Structure within each vessel segment 113
Blood vessel tree data UNC, Stat & OR Present Focus: Topology only § Already challenging § Later address others § Then add attributes § To tree nodes § And extend analysis 114
Blood vessel tree data UNC, Stat & OR The tree team: § Very Interdsciplinary § Neurosurgery: § Bullitt, Ladha § Statistics: § Wang, Marron § Optimization: § Aydin, Pataki 115
Blood vessel tree data UNC, Stat & OR Recall from above: Marron’s brain: § Focus on back § Connectivity (topology) only 116
Blood vessel tree data UNC, Stat & OR Present Focus: § Topology only § Raw data as trees § Marron’s reduced tree § Back tree only 117
Blood vessel tree data UNC, Stat & OR Topology only E. g. Back Trees Full Population Study as movie Understand variation? 118
Strongly Non-Euclidean Spaces UNC, Stat & OR Statistics on Population of Tree-Structured Data Objects? • Mean? ? ? • Analog of PCA? ? ? Strongly non-Euclidean, since: • Space of trees not a linear space • Not even approximately linear (no tangent plane) 119
Mildly Non-Euclidean Spaces UNC, Stat & OR Useful View of Manifold Data: Tangent Space Center: Frechét Mean Reason for terminology “mildly non Euclidean” 120
Strongly Non-Euclidean Spaces UNC, Stat & OR Mean of Population of Tree-Structured Data Objects? Natural approach: Fréchet mean Requires a metric (distance) on tree space 121
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space? • Recall Conventional PCA: • Directions that explain structure in data • • Data are points in point cloud 1 -d and 2 -d projections allow insights about population structure 122
Illust’n of PCA View: PC 1 Projections UNC, Stat & OR 123
Illust’n of PCA View: Projections on PC 1, 2 plane UNC, Stat & OR 124
Source Batch Adj: PC 1 -3 & DWD direction UNC, Stat & OR 125
Source Batch Adj: DWD Source Adjustment UNC, Stat & OR 126
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space? Key Idea (Jim Ramsay): • Replace 1 -d subspace that best approximates data • By 1 -d representation that best approximates data Wang and Marron (2007) define notion of Treeline (in structure space) 127
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space: Treeline • Best 1 -d representation of data Basic idea: • From some starting tree • Grow only in 1 “direction” 128
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space: Treeline • Best 1 -d representation of data Problem: • Hard to compute In particular: to solve optimization problem Wang and Marron (2007) • Maximum 4 vessel trees • Hard to tackle serious trees (e. g. blood vessel trees) 129
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space: Treeline Problem: Solution: Hard to compute Burcu Aydin & Gabor Pataki (linear time algorithm) (based on clever “reformulation” of problem) 130
PCA for blood vessel tree data UNC, Stat & OR PCA on Tree Space: Treelines Interesting to compare: • Population of Left Trees • Population of Right Trees • Population of Back Trees And to study 1 st, 2 nd, 3 rd & 4 th treelines 131
PCA for blood vessel tree data UNC, Stat & OR Study “Directions” 1, 2, 3, 4 For subpopulations B, L, R (interpret later) 132
PCA for blood vessel tree data UNC, Stat & OR Notes on Treeline Directions: • PC 1 always to left • BACK has most variation to right (PC 2) • LEFT has more varia’n to 2 nd level (PC 2) • RIGHT has more var’n to 1 st level (PC 2) See these in the data? 133
PCA for blood vessel tree data UNC, Stat & OR Notes: PC 1 – all left PC 2: BACK - right LEFT 2 nd lev RIGHT 1 st lev See these? ? 134
Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space: Treeline Next represent data as projections • Define as closest point in tree line (same as Euclidean PCA) • Have corresponding score (length of projection along line) • And analog of residual (distance from data point to projection) 135
PCA for blood vessel tree data UNC, Stat & OR Individual (each PC separately) Scores Plot 136
PCA for blood vessel tree data UNC, Stat & OR Data Analytic Goals: Age, Gender See these? No… 137
PCA for blood vessel tree data UNC, Stat & OR Directly study age PC scores PC 1 + PC 2 - Thickness Not Sig’t - Descendants Left Sig’t 138
Upcoming New Approach UNC, Stat & OR Replace Tree-Lines by Tree-Curves: 139
Upcoming New Approach UNC, Stat & OR Projections on Tree-Curves: 140
Preliminary Tree-Curve Results UNC, Stat & OR First Correlation Of Structure To Age! (Back Trees) 141
Preliminary Tree-Curve Results UNC, Stat & OR But does not appear everywhere (Left Trees) Finding locality! 142
HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? 143
HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? § An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size” 144
HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? § An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size” § Suggested perspective: Asymptotics are a tool for finding simple structure underlying complex entities 145
HDLSS Asymptotics UNC, Stat & OR Which asymptotics? § n ∞ (classical, very widely done) § d ∞ ? ? ? § Sensible? § Follow typical “sampling process”? § Say anything, as noise level increases? ? ? 146
HDLSS Asymptotics UNC, Stat & OR Which asymptotics? § n ∞ & d ∞ § n >> d: a few results around (still have classical info in data) § n ~ d: random matrices (Iain J. , et al) (nothing classically estimable) § HDLSS asymptotics: n fixed, d ∞ 147
HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ § Follow typical “sampling process”? 148
HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ § Follow typical “sampling process”? § Microarrays: # genes bounded § Proteomics, SNPs, … § A moot point, from perspective: Asymptotics are a tool for finding simple structure underlying complex entities 149
HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ § Say anything, as noise level increases? ? ? 150
HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ § Say anything, as noise level increases? ? ? Yes, there exists simple, perhaps surprising, underlying structure 151
HDLSS Asymptotics: Simple Paradoxes, I UNC, Stat & OR For dim’al “Standard Normal” dist’n: Euclidean Distance to Origin (as ): - Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”? ? ? - Paradox resolved by: “density w. r. t. Lebesgue Measure” 152
HDLSS Asymptotics: Simple Paradoxes, II UNC, Stat & OR For dim’al “Standard Normal” dist’n: indep. of Euclidean Dist. between and (as Distance tends to non-random constant: ): Can extend to Where do they all go? ? ? (we can only perceive 3 dim’ns) 153
HDLSS Asymptotics: Simple Paradoxes, III UNC, Stat & OR For dim’al “Standard Normal” dist’n: indep. of High dim’al Angles (as ): - -“Everything is orthogonal”? ? ? - Where do they all go? ? ? (again our perceptual limitations) - Again 1 st order structure is non-random 154
HDLSS Asy’s: Geometrical Representation, I UNC, Stat & OR Assume , let Study Subspace Generated by Data a. Hyperplane through 0, of dimension b. Points are “nearly equidistant to 0”, & dist c. Within plane, can “rotate towards Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neeman 155
HDLSS Asy’s: Geometrical Representation, II UNC, Stat & OR Assume , let Study Hyperplane Generated by Data a. dimensional hyperplane b. Points are pairwise equidistant, dist c. Points lie at vertices of “regular hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data? 156
HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Simulation View: shows “rigidity after rotation” 157
HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Straightforward Generalizations: n non-Gaussian data: n non-independent: only need moments use “mixing conditions” (with P. Hall & A. Neeman) n Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) n Mixing Condition on Stand’d & Permuted Var’s (with S. Jung) All based on simple “Laws of Large Numbers” 158
2 nd Paper on HDLSS Asymptotics UNC, Stat & OR Ahn, Marron, Muller & Chi (2007) Biometrika For § Assume 2 nd Moments § Assume no eigenvalues too large in sense: assume (and Gaussian) i. e. (min possible) (much weaker than previous mixing conditions…) 159
HDLSS Asy’s: Geometrical Representation, IV UNC, Stat & OR Explanation of Observed (Simulation) Behavior: “everything similar for very high d” n 2 popn’s are 2 simplices (i. e. regular n-hedrons) n All are same distance from the other class n i. e. everything is a support vector n i. e. all sensible directions show “data piling” n so “sensible methods are all nearly the same” n Including 1 - NN 160
HDLSS Asy’s: Geometrical Representation, V UNC, Stat & OR Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1 -NN rule inefficiency is quantified. 161
HDLSS Math. Stat. of PCA, I UNC, Stat & OR Consistency & Strong Inconsistency: Spike Covariance Model (Johnstone & Paul) For Eigenvalues: 1 st Eigenvector: How good are empirical versions, as estimates? 162
HDLSS Math. Stat. of PCA, II UNC, Stat & OR Consistency (big enough spike): For , Strong Inconsistency (spike not big enough): For , 163
HDLSS Math. Stat. of PCA, III UNC, Stat & OR Consistency of eigenvalues? § Eigenvalues Inconsistent § But known distribution § Unless as well 164
HDLSS Work in Progress, I UNC, Stat & OR Batch Adjustment: Xuxin Liu Recall Intuition from above: n Key is sizes of biological subtypes n Differing ratio trips up mean n But DWD more robust Mathematics behind this? 165
Liu: Twiddle ratios of subtypes UNC, Stat & OR 166
HDLSS Data Combo Mathematics UNC, Stat & OR Xuxin Liu Dissertation Results: § Simple Unbalanced Cluster Model § Growing at rate § Answers depend on as Visualization of setting…. 167
HDLSS Data Combo Mathematics UNC, Stat & OR 168
HDLSS Data Combo Mathematics UNC, Stat & OR 169
HDLSS Data Combo Mathematics UNC, Stat & OR Asymptotic Results (as § For ): , DWD Consistent Angle(DWD, Truth) § For , DWD Strongly Inconsistent Angle(DWD, Truth) 170
HDLSS Data Combo Mathematics UNC, Stat & OR Asymptotic Results (as § For ): , PAM Inconsistent Angle(PAM, Truth) § For , DWD Strongly Inconsistent Angle(PAM, Truth) 171
HDLSS Data Combo Mathematics UNC, Stat & OR Value of § , for sample size ratio : , only when , PAM Inconsistent § Otherwise for § Verifies intuitive idea in strong way 172
HDLSS Work in Progress, II UNC, Stat & OR Canonical Correlations: Myung Hee Lee § Results similar to those for PCA § Singular values inconsistent § But directions converge under a much milder spike assumption. 173
HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist. : John Kent example: Can only say: not deterministic Conclude: need some flavor of mixing 174
HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist. : Conclude: need some flavor of mixing Challenge: Classical mixing conditions require notion of time ordering Not always clear, e. g. microarrays 175
HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist. : Sungkyu Jung Condition: where Define: Assume: So that Ǝ a permutation, is ρ-mixing 176
HDLSS Deep Open Problem UNC, Stat & OR In PCA Consistency: § § Strong Inconsistency Consistency - spike What happens at boundary ( )? ? ? 177
The Future of HDLSS Asymptotics? UNC, Stat & OR 1. Address your favorite statistical problem… 2. HDLSS versions of classical optimality results? 3. Continguity Approach 4. Rates of convergence? 5. Improved Discrimination Methods? (~Random Matrices) It is early days… 178
The Future of Geometrical Representation? UNC, Stat & OR n HDLSS version of “optimality” results? n “Contiguity” approach? n Rates of Convergence? n Improvements of DWD? Params depend on d? (e. g. other functions of distance than inverse) It is still early days … 179
Some Carry Away Lessons UNC, Stat & OR n Atoms of the Analysis: Object Oriented n Viewpoint: n DWD is attractive for HDLSS classification n “Randomness” in HDLSS data is only in rotations Object Space Feature Space (Modulo rotation, have constant simplex shape) n How to put HDLSS asymptotics to work? 180
Object Oriented Data Analysis UNC, Stat & OR Potential Future Opportunity: OODA SAMSI Program 2010 -2011 Interested in joining? Let’s talk 181
- Slides: 181