Fuzzy K means Fuzzy K means A gene

  • Slides: 41
Download presentation
Fuzzy K means

Fuzzy K means

Fuzzy K means ● ● ● A gene can be assigned to several clusters

Fuzzy K means ● ● ● A gene can be assigned to several clusters Each gene is assigned to a cluster with a membership value between 0 and 1 The membership values of a gene add up to one Genes with lower membership values are not well represented by the cluster centroid Expression of genes with high membership values are close to cluster centroid

Centroid • During the centroid refinement in each clustering cycle, new centroids were calculated

Centroid • During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene – expression patterns in the data set according to

Membership Function • Each gene’s membership m (a continuous variable from 0 to 1)

Membership Function • Each gene’s membership m (a continuous variable from 0 to 1) is defined as:

Fuzzy K means • The gene weight is (only on the seconed and the

Fuzzy K means • The gene weight is (only on the seconed and the third round) empirically defined as: Where is the Pearson Correlation between Xi and Xn and is the correlation cutoff

Fuzzy K means • In each clustering cycle , the centroids were iteratively refined

Fuzzy K means • In each clustering cycle , the centroids were iteratively refined until the average change was <0. 001. • Around 85 % of the centroids , stabilized within approximately 15 iterations , some of centroids required more : about 40 -60 iterations before stabilizing.

Fuzzy K means • After each clustering cycle , each centroid was compared to

Fuzzy K means • After each clustering cycle , each centroid was compared to all other centroids in the set , and centroid pairs correlated >0. 9 were replaced by their average.

Visualization Tools

Visualization Tools

Cells respond to environment Various external messages Heat Responds to environmental conditions Food Supply

Cells respond to environment Various external messages Heat Responds to environmental conditions Food Supply

Genome is fixed – Cells are dynamic • A genome is static – Every

Genome is fixed – Cells are dynamic • A genome is static – Every cell in our body has a copy of same genome • A cell is dynamic – Responds to external conditions – Saccharomyces cerevisiae cells follow a cell cycle of division and also budding. • Cells differentiate during development

Gene regulation • Gene regulation is responsible for dynamic cell • Gene expression varies

Gene regulation • Gene regulation is responsible for dynamic cell • Gene expression varies according to: – Cell type – External conditions

Transcription Factors Binding to DNA • Transcription regulation: • Certain transcription factors bind DNA

Transcription Factors Binding to DNA • Transcription regulation: • Certain transcription factors bind DNA • Binding recognizes DNA substrings: • Regulatory motifs

Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Regulatory Element Gene

Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Regulatory Element Gene

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

Regulation of Genes New protein Transcription Factor DNA Regulatory Element Gene RNA polymerase

Regulation of Genes New protein Transcription Factor DNA Regulatory Element Gene RNA polymerase

The Challenges of Gene Expression Data • Many genes have expression data patterns that

The Challenges of Gene Expression Data • Many genes have expression data patterns that are similar to multiple, distinct gene groups.

Results of Clustering Gene Expression • CLUSTER is simple and easy to use •

Results of Clustering Gene Expression • CLUSTER is simple and easy to use • De facto standard for microarray analysis • Limitations: – Hierarchical and other method clustering in general is not robust – Genes may belong to more than one cluster

 • Gene can be co expressed with different gene groups in response to

• Gene can be co expressed with different gene groups in response to different conditions.

Saccharomyces cerevisiae • The yeast Saccharomyces cerevisiae possesses sophisticated mechanisms to choreograph the expression

Saccharomyces cerevisiae • The yeast Saccharomyces cerevisiae possesses sophisticated mechanisms to choreograph the expression of its 6200 genes in order to thrive or at list to survive in a wide range of environmental conditions.

 • The gene expression of 40 Yap 1 p targets, these genes were

• The gene expression of 40 Yap 1 p targets, these genes were coordinately induced in responds to subset of conditions shown here ( labeled in red)

What is a microarray

What is a microarray

What is a microarray (2) • A 2 D array of DNA sequences from

What is a microarray (2) • A 2 D array of DNA sequences from thousands of genes • Each spot has many copies of same gene • Allow m. RNAs from a sample to hybridize • Measure number of hybridizations per spot

Goal of Microarray Experiments • Measure level of gene expression across many different conditions:

Goal of Microarray Experiments • Measure level of gene expression across many different conditions: – Expression Matrix M: {genes} {conditions}: Mij = |genei| in conditionj • Deduce gene function – Genes with similar function are expressed under similar conditions

Fuzzy K-Means clustering • Each gene can belong to many clusters • Soft (fuzzy)

Fuzzy K-Means clustering • Each gene can belong to many clusters • Soft (fuzzy) assignment of genes to clusters – Each gene has 1. 0 membership units, allocated amongst clusters based on correlation with means • Cluster means are calculated by taking the weighted average of all the genes in the cluster

Fuzzy K-Means clustering Algorithm: • Use PCA to initialize cluster means • 3 iterations

Fuzzy K-Means clustering Algorithm: • Use PCA to initialize cluster means • 3 iterations of fuzzy k-means clustering, find k/3 clusters per iteration – In each iteration, start with brand new clusters and initializations • And a few more heuristic tricks

Initialization • Use PCA to find a few eigenvectors for initialization • These features

Initialization • Use PCA to find a few eigenvectors for initialization • These features capture the directions of maximum variance • Must be orthonormal

Example Initialization • k/3 centroids defined from k/3 first eigenvectors

Example Initialization • k/3 centroids defined from k/3 first eigenvectors

Example • First iteration of clustering

Example • First iteration of clustering

Iteration of the approach • Remove genes that have a Pearson Correlation with a

Iteration of the approach • Remove genes that have a Pearson Correlation with a particular cluster greater than 0. 7 – Intuition: These strong signal from these genes has been accounted for • Repeat

Removing Duplicate Centroids • Centroids with Pearson correlation > 0. 9 will be averaged.

Removing Duplicate Centroids • Centroids with Pearson correlation > 0. 9 will be averaged. • Allows selecting a large initial number of clusters, since duplicates will be removed

Repeat 3 times Output 1) Cluster means 2) Gene assignmen ts to clusters

Repeat 3 times Output 1) Cluster means 2) Gene assignmen ts to clusters

 • Regulatory systems that govern the expression of overlapping sets of genes in

• Regulatory systems that govern the expression of overlapping sets of genes in yeast.

Fuzzy K means ADVANTAGES • The method can present overlapping clusters , revealing distinct

Fuzzy K means ADVANTAGES • The method can present overlapping clusters , revealing distinct features of each gene’s function and regulation. • The resulting implication can be used to assign refined hypothetical functions to uncharacterized gene products and additional cellular roles of well none studied proteins.

Fuzzy K means ADVANTAGES • It present more comprehensive groups of conditionally co regulate

Fuzzy K means ADVANTAGES • It present more comprehensive groups of conditionally co regulate genes. • It elucidate the environmental conditions that trigger changes in gene expression. • It requires no a priori information about the dataset.

Fuzzy K means DISADVANTAGES • Assignment of genes to the cluster requires a user

Fuzzy K means DISADVANTAGES • Assignment of genes to the cluster requires a user – defined cutoff and selecting meaningful cutoff is a challenge. • Fuzzy K means failed to identify a small number of groups that were identified by hierarchical clustering.

My opinion • The unique advantages of fuzzy K means clustering make the technique

My opinion • The unique advantages of fuzzy K means clustering make the technique a valuable tool for gene expression analysis , it’s flexibility can be used to reveal more complex correlations between gene expression patterns, promoting refined hypotheses of the role and regulation of gene expression changes.

 • In order to get over the limitations… combining hierarchical clustering with fuzzy

• In order to get over the limitations… combining hierarchical clustering with fuzzy K means can be useful. .

Thank you !

Thank you !