L 15 Microarray analysis Classification March 5 2021

Final exam syllabus • Take home • The entire course, but emphasis will be

Biol. Data analysis: Review Assembly Sequence Analysis/ DNA signals Bafna 5, 2021 March Protein

Other static analysis is possible Genomic Analysis/ Pop. Genetics Assembly Protein Sequence Analysis Bafna

A Static picture of the cell is insufficient • Each Cell is continuously active,

Protein quantification via. LC-MS Maps Peptide 2 I Peptide 1 m/z time • A

LC-Map Comparison for Quantification Map 1 (normal) Map 2 (diseased) Bafna

Quantitation: transcript versus Protein Expression Sample 1 Protein 1 35 Sample 2 4 Protein

Transcript quantification • Instead of counting protein molecules, we count active transcripts in the

Transcript quantification with RNAseq March 5, 2021 Bafna

Quantification via microarrays March 5, 2021 Bafna

Quantitation: transcript versus Protein Expression Sample 1 m. RNA 1 100 m. RNA 1

Gene Expression Data • Gene Expression data: s 1 s 2 – Each row

Formalizing Classification • Classification problem: Find a surface (hyperplane) that will separate the classes

Basic geometry • What is ||x||2 ? • What is x/||x|| • Dot product?

Dot Product x • Let be a unit vector. – || || = 1

Hyperplane • How can we define a hyperplane L? • Find the unit vector

Points on the hyperplane • Consider a hyperplane L defined by unit vector ,

Hyperplane properties • Given an arbitrary point x, what is the distance from x

Separating by a hyperplane • Input: A training set of +ve & ve examples

Error in classification • An arbitrarily chosen hyperplane might not separate the test. We

Gradient Descent • The function D( ) defines the error. • We follow an

Rosenblatt’s perceptron learning algorithm Bafna 5, 2021 March

Classification based on perceptron learning • Use Rosenblatt’s algorithm to compute the hyperplane L=(

Perceptron learning • If many solutions are possible, it does not choose between solutions

Supervised classification • We have learnt a generic scheme for understanding biological data –

Linear Discriminant analysis • • • + Provides an alternative approach to classification with

Choosing the right + + 1 x 2 - - 2 x 1 •

Linear Discriminant analysis • Fisher Criterion March 5, 2021 Bafna

LDA cont’d + • What is the projection of a point x onto ?

LDA Cont’d Fisher Criterion March 5, 2021 Bafna

LDA Therefore, a simple computation (Matrix inverse) is sufficient to compute the ‘best’ separating

Supervised classification summary • Most techniques for supervised classification are based on the notion

The dynamic nature of the cell • Proteomic and transcriptomic analyses provide a snapshot

Silly Quiz • Who are these people, and what is the occasion? March 5,

Genome Sequencing and Assembly March 5, 2021 Bafna

DNA Sequencing • DNA is double-stranded • The strands are separated, and a polymerase

PCA: motivating example • Consider the expression values of 2 genes over 6 samples.

Principle Components Analysis • Consider the expression values of 2 genes over 6 samples.

Projecting • Consider the mean of all points m, and a vector emanating from

Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… •

How to project • The generic scheme allows us to project an m dimensional

PCA • Suppose all of the data were to be reduced by projecting to

PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize

Proof of Observation 1 Differentiating w. r. t ak March 5, 2021 Bafna

Minimizing PCA Error • To minimize error, we must maximize TS • By definition,

PCA steps • X = starting matrix with n column m rows X March

Slides: 50

Download presentation

L 15: Microarray analysis (Classification) March 5, 2021 Bafna

Final exam syllabus • Take home • The entire course, but emphasis will be given to post-midterm lectures • HMMs, • Gene-finding, • mass spectrometry, • Micro-array analysis,

Biol. Data analysis: Review Assembly Sequence Analysis/ DNA signals Bafna 5, 2021 March Protein Sequence Analysis Gene Finding

Other static analysis is possible Genomic Analysis/ Pop. Genetics Assembly Protein Sequence Analysis Bafna 5, 2021 March Gene Finding nc. RNA

A Static picture of the cell is insufficient • Each Cell is continuously active, – Genes are being transcribed into RNA – RNA is translated into proteins – Proteins are PT modified and transported – Proteins perform various cellular functions • Can we probe the Cell dynamically? – Which transcripts are active? – Which proteins interact? Bafna 5, 2021 March Gene Regulation Transcript profiling Proteomic profiling

Protein quantification via. LC-MS Maps Peptide 2 I Peptide 1 m/z time • A peptide/feature can be labeled with the triple (M, T, I): Peptide 2 elution – monoisotopic M/Z, centroid retention time, and intensity • An LC-MS map is a collection of features x x x x x m/z x x x x x time Bafna

LC-Map Comparison for Quantification Map 1 (normal) Map 2 (diseased) Bafna

Quantitation: transcript versus Protein Expression Sample 1 Protein 1 35 Sample 2 4 Protein 2 Protein 3 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins March Bafna 5, 2021

Transcript quantification • Instead of counting protein molecules, we count active transcripts in the cell. March 5, 2021 Bafna

Transcript quantification with RNAseq March 5, 2021 Bafna

Quantification via microarrays March 5, 2021 Bafna

Quantitation: transcript versus Protein Expression Sample 1 m. RNA 1 100 m. RNA 1 Sample 2 20 Protein 1 35 4 Protein 2 Protein 3 m. RNA 1 Our Goal is to construct a matrix as shown for proteins, and RNA, and use it to identify differentially expressed transcripts/proteins Bafna 5, 2021 March

Gene Expression Data • Gene Expression data: s 1 s 2 – Each row corresponds to a gene – Each column corresponds to an expression value • Can we separate the experiments into two or more classes? • Given a training set of two classes, can we build a classifier that places a new experiment in one of the two classes. March 5, 2021 Bafna g s

Formalizing Classification • Classification problem: Find a surface (hyperplane) that will separate the classes • Given a new sample point, its class is then determined by which side of the surface it lies on. • How do we find the hyperplane? How do we find the side that a point lies on? 1 2 3 g 1 g 2 Bafna 5, 2021 March 4 5 6 1. 9. 8. 1. 2. 1. 1 0. 2. 8. 7. 9 1 2 3

Basic geometry • What is ||x||2 ? • What is x/||x|| • Dot product? Bafna 5, 2021 March x=(x 1, x 2) y

Dot Product x • Let be a unit vector. – || || = 1 • Recall that – Tx = ||x|| cos • What is Tx if x is orthogonal (perpendicular) to ? Bafna 5, 2021 March T x = ||x|| cos

Hyperplane • How can we define a hyperplane L? • Find the unit vector that is perpendicular (normal to the hyperplane) Bafna 5, 2021 March

Points on the hyperplane • Consider a hyperplane L defined by unit vector , and distance 0 • Notes; x. T – For all x L, must be the same, x. T = 0 – For any two points x 1, x 2, • (x 1 - x 2)T =0 Bafna 5, 2021 March x 2 x 1

Hyperplane properties • Given an arbitrary point x, what is the distance from x to the plane L? – D(x, L) = ( Tx - 0) • When are points x 1 and x 2 on different sides of the hyperplane? Bafna 5, 2021 March x 0

Separating by a hyperplane • Input: A training set of +ve & ve examples • Goal: Find a hyperplane that separates the two classes. • Classification: A new point x is +ve if it lies on the +ve side of the hyperplane, -ve otherwise. • The hyperplane is represented by the line • {x: - 0+ 1 x 1+ 2 x 2=0} + x 2 x 1 Bafna 5, 2021 March

Error in classification • An arbitrarily chosen hyperplane might not separate the test. We need to minimize a misclassification error • Error: sum of distances of the misclassified points. • Let yi=-1 for misclassified +ve example i, – yi=1 otherwise. • Other definitions are also possible. Bafna 5, 2021 March + x 2 x 1

Gradient Descent • The function D( ) defines the error. • We follow an iterative refinement. In each step, refine so the error is reduced. • Gradient descent is an approach to such iterative refinement. Bafna 5, 2021 March D( ) D’( )

Rosenblatt’s perceptron learning algorithm Bafna 5, 2021 March

Classification based on perceptron learning • Use Rosenblatt’s algorithm to compute the hyperplane L=( , 0). • Assign x to class 1 if f(x) >= 0, and to class 2 otherwise. Bafna March 5, 2021

Perceptron learning • If many solutions are possible, it does not choose between solutions • If data is not linearly separable, it will converge to some local minimum. • Time of convergence is not well understood Bafna 5, 2021 March

Supervised classification • We have learnt a generic scheme for understanding biological data – Represent each data point (gene expression samples) as a vector in high dimensional space. – In supervised classification, we have two classes of points. – We separate the classes using a linear surface (hyperplane) – The perceptron algorithm is one of the simplest to implement + x 2 x 1 March 5, 2021 Bafna

END OF LECTURE March 5, 2021 Bafna

Linear Discriminant analysis • • • + Provides an alternative approach to classification with a linear function. Project all points, including the means, onto vector . We want to choose such that – Difference of projected means is large. – Variance within group is small x 2 x 1 March 5, 2021 Bafna

Choosing the right + + 1 x 2 - - 2 x 1 • 1 is a better choice than 2 as the variance within a group is small, and difference of means is large. • How do we compute the best ? March 5, 2021 Bafna

Linear Discriminant analysis • Fisher Criterion March 5, 2021 Bafna

LDA cont’d + • What is the projection of a point x onto ? – Ans: Tx • What is the distance between projected means? x 2 x x 1 March 5, 2021 Bafna

LDA Cont’d Fisher Criterion March 5, 2021 Bafna

LDA Therefore, a simple computation (Matrix inverse) is sufficient to compute the ‘best’ separating hyperplane March 5, 2021 Bafna

Supervised classification summary • Most techniques for supervised classification are based on the notion of a separating hyperplane. • The ‘optimal’ separation can be computed using various combinatorial (perceptron), algebraic (LDA), or statistical (ML) analyses. March 5, 2021 Bafna

The dynamic nature of the cell • Proteomic and transcriptomic analyses provide a snapshot of the active proteins (transcripts) in a cell in a particular state (columns of the matrix) • This snapshot allows us to classify the state of the cell. • Other queries are also possible (not in the syllabus). – Which genes behave in a similar fashion? Each row describes the behaviour of genes under different conditions. Solved by ‘clustering’ rows. – Find a subset of genes that is predictive of the state of the cell. Using PCA, and other dimensionality reduction techniques. March 5, 2021 Bafna

Silly Quiz • Who are these people, and what is the occasion? March 5, 2021 Bafna

Genome Sequencing and Assembly March 5, 2021 Bafna

DNA Sequencing • DNA is double-stranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early. March 5, 2021 Bafna

PCA: motivating example • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of g 1 is not informative, and it suffices to look at g 2 values. • Dimensionality can be reduced by discarding the gene g 1 g 2 March 5, 2021 Bafna

Principle Components Analysis • Consider the expression values of 2 genes over 6 samples. • Clearly, the expression of the two genes is highly correlated. • Projecting all the genes on a single line could explain most of the data. • This is a generalization of “discarding the gene”. March 5, 2021 Bafna

Projecting • Consider the mean of all points m, and a vector emanating from the mean • Algebraically, this projection on means that all samples x can be represented by a single value T(x-m) m x x-m T March 5, 2021 Bafna = M T( x-m)

Higher dimensions • Consider a set of 2 (k) orthonormal vectors 1, 2… • Once proejcted, each sample means that all samples x can be represented by 2 (k) dimensional vector 2 m 1 x – 1 T(x-m), 2 T(x-m) 1 T(x-m) x-m 1 T = March 5, 2021 Bafna M

How to project • The generic scheme allows us to project an m dimensional surface into a k dimensional one. • How do we select the k ‘best’ dimensions? • The strategy used by PCA is one that maximizes the variance of the projected points around the mean March 5, 2021 Bafna

PCA • Suppose all of the data were to be reduced by projecting to a single line from the mean. • How do we select the line ? March 5, 2021 Bafna m

PCA cont’d • Let each point xk map to x’k=m+ak. We want to minimize the error • Observation 1: Each point xk maps to x’k = m + T(xk-m) – (ak= T(xk-m)) March 5, 2021 Bafna xk m x’k

Proof of Observation 1 Differentiating w. r. t ak March 5, 2021 Bafna

Minimizing PCA Error • To minimize error, we must maximize TS • By definition, = TS implies that is an eigenvalue, and the corresponding eigenvector. • Therefore, we must choose the eigenvector corresponding to the largest eigenvalue. March 5, 2021 Bafna

PCA steps • X = starting matrix with n column m rows X March 5, 2021 Bafna xj

March 5, 2021 Bafna