ge Workbench HandsOn Training Session Date Session Length
ge. Workbench Hands-On Training Session Date: Session Length: Target Audience: Trainer: Developer Subject Matter Expert: 1
ge. Workbench is being developed at the Joint Centers for Systems Biology, Columbia University This work is supported by the NCI ca. BIG and the NIH NCBC programs. 2
Session Details: ► This training is designed for a user who is new to ge. Workbench. ► Target Audience: Researchers and students interested in microarray gene expression experiment analysis. ► The attendee is expected to have basic computer and biological knowledge. ► Note – this is not a complete introduction to all ge. Workbench components. The primary goal is to describe those features developed for ca. BIG during Year 1 of the project, and the context in which they are used. 3
Session Details: Overview of the Training Environment These slides are suitable for use in: ► Classroom Training ► Centra – Online Classroom ► Web-based Delivery 4
Session Details: Hardware and Software ► ge. Workbench requires the Sun Java JRE 1. 5 environment to be installed on your local machine. ► ge. Workbench requires significant memory. At least 1 GB is recommended, especially if larger datasets are being read in or hierarchical clustering will be done. ► Windows, Linux and Mac/Power. PC version of ge. Workbench are available. ► See www. geworkbench. org for full details. 5
Session Details: Session Goals By the end of the training session participants should : ► Have a basic understanding of the purpose and aims of ge. Workbench. ► Be able to set program preferences and load microarray data from local and remote sources. ► Understand how data files are organized into Projects, and how subsets of data can be formed and used. ► Use filtering and normalization components to prepare data. ► Analyze and view data using a number of new components. 6
Session Details: Outline of lessons ► Introduction ► Tutorial Data ► Part 1 – Data management ♦ ♦ ♦ Lesson 1: Basics of the graphical interface Lesson 2: Setting Preferences Lesson 3: Projects and Data Files Lesson 4: Working with Data Subsets Lesson 5: Working with Remote Sources ► Part 2 – Data manipulation ♦ ♦ ♦ Lesson 6: Normalization Lesson 7: Filtering Lesson 8: Experiment Annotations ► Part 3 – Analysis and display ♦ ♦ ♦ Lesson 9: The Scatter Plot component Lesson 10: Expression Value Distribution Lesson 11: Reverse Engineering Lesson 12: Gene Annotation and Pathway Viewing Lesson 13 : Hierarchical Clustering Analysis Lesson 14 : ANOVA ► Part 4 – Workflow execution ♦ Lesson 15: ca. SCRIPT Editor 7
Introduction 8
Introduction: Overview ► This section will describe in general the capabilities of ge. Workbench in the following areas: ♦ ♦ ♦ Microarray analysis. Sequence analysis. Access to remote data and services ► A complete description of ge. Workbench and online tutorials are available at www. geworkbench. org. 9
Introduction: Overview ge. Workbench – a platform for tool and data integration ► ge. Workbench is an open-source bioinformatics platform that provides an extensive collection of tools for the management, analysis, visualization and annotation of biomedical data. ► ge. Workbench has been designed with a plug-in framework. As new techniques are developed and implemented, they can be added to ge. Workbench. ► ge. Workbench aims to allow different tools to easily work together, such as using microarray analysis to obtain a list of interesting genes, and then retrieving their coding or upstream sequences and using these in BLAST, pattern discovery, or transcription factor binding motif searches. 10
Introduction: Microarray data ge. Workbench supports many kinds of operations on microarray data: ► ► ► ► Obtaining data from local or remote data sources Filtering and normalization Basic statistical analysis Clustering (Hierarchical, SOM) Gene Ontology analysis Reverse Engineering Visualization using many common tools ♦ ♦ ♦ Scatter Plot Volcano Plot Expression Profiles Expression Value Distribution Color Mosaic Dendrogram 11
Introduction: Sequence data ge. Workbench also provides capabilities for working with sequence data: ► BLAST ► Pattern Discovery ► Transcription Factor Mapping ► Syntenic Region Analysis 12
Introduction: External data services ► There are many biomedical data sources and computational services available through the internet. ge. Workbench strives to make remote data and services directly available on the desktop, integrated with its own local tools. ► External sources provide expression data, sequences and annotation: ♦ Microarray gene expression repositories (ca. Array) ♦ Gene annotation web pages (via. CGAP) ♦ DNA Sequence retrieval (UC Santa Cruz) ♦ Pathway diagrams (Bio. Carta via ca. BIO database at NCI) 13
Introduction: External computational services ge. Workbench also provides a gateway to several computational services, including some hosted on Columbia servers and clusters. ► BLAST – search for sequences similar to a query sequence. ♦ Access is provided both to a Columbia server and the NCBI BLAST service. ► Pattern Discovery – find repeated patterns in a group of sequences. ► Synteny – compare regions of one chromosome against another. ► Through the ca. GRID project, additional remote services are being added: ♦ Hierarchical clustering – tree-like grouping by expression similarity. ♦ SOM (Self-Organizing Maps) – divide expression profiles into a limited number of bins. ♦ ARACNE – regulatory network reverse engineering. 14
Tutorial Data 15
Tutorial Data: Overview ► In this section we describe the downloadable tutorial data files. This is primarily a reference section. Other files are included in the data directory of the program itself. ► The data can be downloaded from http: //wiki. c 2 b 2. columbia. edu/workbench/index. php/Download ► There are several file types ♦ Microarray ♦ Affymetrix MAS 5/GCOS format files – a single file per array, as produced by Affymetrix software. ♦ The ge. Workbench data matrix format, which merges all expression data from a set of experiments into a single file. By default it uses the ending “. exp”. ♦ Genepix two-color array experiments (in base download). ♦ Sequence ♦ DNA and protein sequence files in FASTA format. 16
Tutorial Data: Data files All data sets used in the tutorials are available from the download area of the ge. Workbench website (http: //wiki. c 2 b 2. columbia. edu/workbench/index. php/Download ). The file "tutorial_data. zip" contains the following files: cardiogenomics. med. harvard. edu/ Contains 10 individual MAS 5/GCOS format data files. webmatrix_quantile_log 2_dev 1. 2_mv 0. exp A ge. Workbench "exp" format matrix file containing filtered, normalized data. This data originally derives from the file "webmatrix 2. exp". NM_024426 -Wilms. fasta A Genbank nucleotide seqeuence file. NP_077744 -Wilms. fasta A Genbank protein seqeuence file. H 1 H 5_Histone. DB_NHGRI. fasta Contains H 1 and H 5 histone sequences from the NHGRI. cluster_tree_total_pearsons_84_markers. csv Contains a list of genes derived from hierarchical clustering. 64 of 84 Cluster. Pearsons. Seqs. fasta Contains upstream DNA sequences derived from a subset of the above genes. 17
Tutorial Data: About the Cardiogenomics Microarray Dataset The example MAS 5 format data files were obtained from the following site at Harvard University: http: //cardiogenomics. med. harvard. edu/project-detail? project_id=229 A number of MAS 5 format data files are available there. The specific project is the "Belgium Dataset of Aortic Stenosis, Congestive Cardiomyopathy and Normal LV Function", and the data is downloadable from: http: //cardiogenomics. med. harvard. edu/groups/proj 1/pages/download_Hs-belgium. html An abstract describing the study is also available, at: http: //cardiogenomics. med. harvard. edu/groups/proj 2/pages/Hs-belgium_home. html 18
Tutorial Data: Generation of example microarray dataset Generation of the "webmatrix 2_quantile_log 2_dev 1. 2_mv 0. exp" dataset. The file "webmatrix 2. exp", available in the Download area, contains results from 100 Affymetrix HG-U 95 Av 2 chips containing B-cell samples from numerous different disease states. 12, 600 probes are represented. For use in these tutorials we normalized and filtered the data. The steps on the next page are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals. 19
Tutorial Data: Generation of example microarray dataset The dataset was created through the following steps: 1. Normalization: Quantile normalization. 2. Normalization: Log 2 transformation. 3. Filtering: Deviation filter with Deviation bound of 1. 2. 4. Filtering: Missing values filter with maximum number of missing arrays of 0. The result of performing these steps is available as the file "webmatrix 2_quantile_log 2_dev 1. 2_mv 0. exp", found in the tutorial data file "tutorial_data. zip”. 20
Part 1: Data Management 21
Part 1: Data Management Objectives The objective of Part 1 is to learn the basic operation of ge. Workbench. This includes understanding the layout of the graphical interface in four main functional regions, and setting user preferences. The loading of local and remote data files will be demonstrated. Perhaps of most importance is understanding how ge. Workbench allows data to be divided into subsets, both for setting up analyses and utilizing their results. After completing Part 1, you should be able to: 1. Load microarray data into ge. Workbench from local and remote sources, and set display preferences. 2. Understand how the data can be organized into projects and manipulated using sets. 22
Part 1: Data Management Lesson outline Lesson 1: Basics of the graphical interface Lesson 2: Setting Preferences Lesson 3: Projects and Data Files Lesson 4: Working with Data Subsets Lesson 5: Working with Remote Sources 23
Lesson 1: Basics of the graphical interface: Basics of the graphical interface. 24
Lesson 1: Basics of the graphical interface The four areas of the GUI The graphical user interface for ge. Workbench is divided into four major sections 1. Data management Workspace and Projects (upper left). 2. Marker and Array/Phenotype set selection and management (lower left). 3. Visualization tools (upper right). 4. Analytical tools (lower right). Areas 2, 3 and 4 are defined for convenience. The actual placement of a given component into any of these three areas is controlled by a configuration file and can be customized as desired. 25
Lesson 1: Basics of the graphical interface Menu bar and data management area Menu bar ► The GUI provides a menu bar at top with a standard choice of commands. ► Many commands that are available in the menu bar are also available by right-clicking on data objects. Data management area (area 1) ► Working with ge. Workbench involves creating a project within the toplevel Workspace. ► Opened data files and the results of analysis are stored within a Project. ► Multiple projects can be used within a workspace to organize data. ► A workspace and all the projects and data within it can be saved and later reloaded. 26
Lesson 1: Basics of the graphical interface Set selection area Set selection and management (area 2) ► ge. Workbench allows sets of markers (gene probes) and of arrays/phenotypes to be defined and used. This allows the application to: ♦ analyze only a desired subset of the data ♦ Return lists of genes from one module which can then be used in another module, e. g. a list of genes returned by a t-test of differential expression can then be further investigated through sequence retrieval and analysis. 27
Lesson 1: Basics of the graphical interface Visualization and analysis areas Visualization and Analysis tools (areas 3 and 4) ► To simplify the display area, only the visualization and analysis components relevant to the type of dataset currently selected in the Project Folders area (area 1) are displayed. ► Thus choosing a microarray dataset will result in a different set of tabs being displayed as compared with those seen when a nucleotide sequence file is selected. ► When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project area (area 1), but an appropriate viewer in the Visualization area (area 3) is automatically selected. ► A selection of visualization and analysis tools will be demonstrated in the following sections. 28
Lesson 2: Setting Preferences 29
Lesson 2: Setting Preferences Modifying settings Preferences ► The Preferences selection in the Tools menu allows users to specify how certain aspects of the system will behave. ► Once the preferences are set, they are persistent between application sessions. Modifying Settings ► From the main menu, click on Tools >Preferences. 30
Lesson 2: Setting Preferences Modifying settings Modifying Settings ► Text Editor: The editor selected will be used to open and inspect data sets loaded in a project. Notepad is the default setting. ► Visualization: The color scheme to be applied to color mosaic images. ♦ Absolute: (default) Values are scaled against the largest absolute value found in the dataset, with positive values red and negative green. ♦ Relative: Each marker is mean-variance normalized across all arrays. A red-blue color scheme is used, with red showing positive and blue negative values. ► Genepix Value Computation: Specifies how to compute the value displayed for a Genepix array. The default setting is Option 1 (Mean F 635 - Mean B 635) / (Mean F 532 - Mean B 532). 31
Lesson 2: Setting Preferences Notes ► The relative display performs its own transformation on the data just for purposes of visualization. The underlying data is not changed. ► The relative selection for the Microarray Viewer preference will give odd-looking results if only a small number of arrays are loaded (e. g. 2). This is because with only two values, each point will be at a color extreme – either blue or red. ► Changing the Microarray Viewer relative/absolute preference will not take effect until the next time a data set is loaded. 32
Lesson 3: Projects and Data Files 33
Lesson 3: Projects and Data Files File types ge. Workbench supports a number of data file formats, including: For Microarrays: ► Affymetrix MAS 5/GCOS text files. ► Affymetrix File Matrix - this is the native file type created by ge. Workbench, and contains a data matrix from any number of experiments merged together. ► RMA Express File - RMA Express is a sophisticated tool for combining data from multiple Affymetrix chips. It is not a part of ge. Workbench. ► Genepix Files – created by a popular analysis program for two color arrays. For Sequence: ► FASTA Files. DNA or protein sequence files in FASTA format. ► Pattern Files – created by the Pattern Discovery component. 34
Lesson 3: Projects and Data Files Opening a file In this example, we will load 10 individual Affymetrix MAS 5 format files, merging them into a single dataset. 1. Create a Project. All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project. 2. Next, right-click on the new Project entry and select Open Files. 35
Lesson 3: Projects and Data Files Loading and merging data 3. Select file type Affymetrix MAS 5/GCOS as shown. 5 4. Make sure to check the Merge files checkbox. 5. Select 10 MAS 5 format text files from the tutorial data directory. 3 6. Click Open. 6 4 The chip type HG_U 95 Av 2 is recognized. . . 36
Lesson 3: Projects and Data Files Viewing data The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here. 37
Lesson 3: Projects and Data Files Renaming and saving a merged dataset ► The merged dataset can be given a shorter name. ♦ Right click on the merged dataset and select Rename. ♦ Enter a new dataset name, e. g. merged_cardio. ► The dataset can also be saved to disk for later reuse. ♦ Right-click on the merged dataset and select Save. ♦ Enter a filename. 38
Lesson 4: Working with Subsets of Data Working with subsets of data 39
Lesson 4: Working with Subsets of Data Background ► ge. Workbench makes extensive use of sets of markers (genes) or arrays. ► Sets can be defined by the user, or may be created as a result of an analysis. ► Sets of arrays can be used to distinguish between different experimental states, for example as part of a statistical analysis. ♦ The t-test requires two states be defined for comparison. ► Sets of markers are returned from various analysis routines. For example the t-test returns a list of markers showing signficant differential expression, and after hierarchical clustering, the markers in a subtree of the resulting dendrogram can be saved. ► ge. Workbench supports groupings of sets. Each such group can contain different sets of markers or arrays. 40
Lesson 4: Working with Subsets of Data Overview In this tutorial you will learn ► How to create a set of arrays. ► How to mark a set of arrays as "Active“. ► How to classify a set of arrays, e. g. as "case" vs. "control". ► How arrays can be grouped in different ways with descriptive tags. 41
Lesson 4: Working with Subsets of Data Preparation The first example here will use the same data files read in and merged in the previous lesson (Projects and Data Files). The second example will use the tutorial file webmatrix 2_quantile_log 2_dev 1. 2_mv 0. exp 42
Lesson 4: Working with Subsets of Data Assigning arrays to sets We will leave the arrays in the default group, however you can create a new group by pushing the New button on Array/Phenotype Sets located at the lower left in the application (arrow labeled New). First, we will select and label arrays which contain samples from the congestive cardiomyopathy disease state: 1. In the Arrays/Phenotypes component, select the six arrays beginning with JB-ccmp, which represent the samples from the congestive cardiomyopathy disease state. 2. Right click, select Add to Set. 1 2 New 43
Lesson 4: Working with Subsets of Data Assigning arrays to sets 3. Enter "CCMP" in the input box and click OK. 4. Next, similarly label the arrays beginning with JB-n as "Normal“. The Array/Phenotype Sets component will now show the two sets added: 3 4 44
Lesson 4: Working with Subsets of Data Activating sets The boxes next to the set name can be checked to indicate that a set of arrays is "Active". Various analysis and visualization components can be set to only use/display activated arrays or markers. Note – if no Marker sets are explicitly activated, then all Markers are implicitly active. The same applies to Arrays. 45
Lesson 4: Working with Subsets of Data Classifying a set For statistical tests such as the t-test, Case and Control groups can be specified. 1. Left-click on the thumb-tack icon in front of the phenotype name. 2. Select Case to specify the disease arrays as the "Case". The remaining "Normal" arrays are by default considered Control. 1 2 46
Lesson 4: Working with Subsets of Data Classifying a set 3. A red thumbtack indicates an array set has been marked as "Case". 3 47
Lesson 4: Working with Subsets of Data Using multiple array groups ► Different groups of sets can be made, both for Markers and for Arrays. They may differ in membership or in how members are named (e. g. amount of detail). ► Here we show several different groupings are defined in the example data file "webmatrix 2_quantile_log 2_dev 1_mv 0. exp“. ► After loading this file into ge. Workbench as type "Affymetrix File Matrix", four groups can be seen in the Arrays/Phenotypes group pulldown menu at right. 48
Lesson 4: Working with Subsets of Data Using multiple array groups If we choose the group called "Class", the sets of arrays at right are displayed: 49
Lesson 4: Working with Subsets of Data Using multiple array groups If instead we choose the group "Cell Line", a different grouping of the same arrays is seen: 50
Lesson 5: Working with Remote Data Sources 51
Lesson 5: Working with Remote Data Sources The remote Open File dialog ge. Workbench can retrieve microarray data from certain remote data sources, primarily from instances of the NCI's ca. Array database. The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services. Right-clicking on Project will bring up the Open File dialog. Click the Remote radio button. The Open File dialog window will be expanded to include remote sources. Entries (locations, parameters) for non-grid services can be edited. 52
Lesson 5: Working with Remote Data Sources The remote Open File dialog After clicking Remote, four additional buttons appear: 1. Remote source selector – chose from available Remote Resources. 2. Go button - Accesses the Remote Source that you selected. 3. Query button – specifies search criteria for retrieving only a subset of available experiments. 4. Add A New Resource button - Opens the Data Source Definition Page used to add Remote Data. 5. 4. Edit button - Edits Remote Source Parameters. 1 2 3 4 5 53
Lesson 5: Working with Remote Data Sources Loading data from a remote instance of ca. Array Click on the Go button next to the ca. Array data source at the bottom of the dialog. All available ca. Array experiments will be displayed. 54
Lesson 5: Working with Remote Data Sources Selecting an experiment Select an experiment that has bioassays 1. Here we depict the experiment ending in *36540. The number of derived bioassays, 4, is displayed, along with the experiment information. 2. To start retrieving the bioassays themselves, right-click on the experiment and press Get bioassays. This will download the list of available bioassays into ge. Workbench 2 1 55
Lesson 5: Working with Remote Data Sources Retrieving bioassay data To Retrieve Bioassay Data Select the desired arrays and push the Open button. (You might want to first select just one, as each can take several minutes to download). 56
Lesson 5: Working with Remote Data Sources Searching for specific types of bioassay data To Retrieve Bioassay Data Based on Search Criteria 1. Click on the Query button. 2. Select “Experiments” from the available search categories. 3. Select one or more fields (like “Tissue Type”, “Chip Platform”, etc) and enter a desired search value; some fields (like “Tissue Type”) assume values from a pick list while others accept free text. 57
Lesson 5: Working with Remote Data Sources Searching for specific types of bioassay data (cont. ) To Retrieve Bioassay Data Based on Search Criteria 1. Click on the Search button. 2. The list of available experiments is updated to include only those that meet all the search criteria specified by the user. 58
Lesson 5: Working with Remote Data Sources Searching for specific types of bioassay data (cont. ) To Remove the Search Filter and Retrieve All Bioassay Data Again 1. Bring up again the Query screen. 2. Click on the Clear All button to clear all search fields. 3. Click on the Search button. 2 4. 3 The full list of experiments is displayed again. 59
Lesson 5: Working with Remote Data Sources Adding or modifying a remote source To modify a remote source 1 1. Click on the Edit button. 2 2 1. Click on the Add A New Resource button. 1 To add a remote source 2 3 3 2. Fill in the Data Source definition page. URL and Short Name are required fields. 2. Make the changes that you need. 3. Click on the OK button. The configuration is set up to automatically reflect your additional Data Source. 60
Part 1: Data Management Review Part 1 covered the basics of the layout of ge. Workbench, loading data, setting preferences, and the use of sets of arrays and markers to organize data. After completing Part 1, you should be able to: 1. Locate the different working areas of the application GUI. 2. Load microarray data from local and remote sources, and create a merged dataset for further analysis. 3. Use search criteria to filter the set of experiments retrieved from a remote data source. 4. Set the display preferences. 5. Use sets of arrays and markers to organize data for analysis and to convey results from one tool to another. 61
Part 2: Data Manipulation 62
Part 2: Data Manipulation Objectives The objective of Part 2 is to learn the a few of the basic techniques available in ge. Workbench for microarray data normalization and filtering. This section will also cover the manual and automatic annotation of datasets. After completing Part 2, you should be able to: 1. Normalize a microarray dataset. 2. Filter out unwanted data points, such as low quality or missing data. 3. Use and create new dataset annotations. 63
Part 2: Data Manipulation: Lesson outline Lesson 6: Normalization Lesson 7: Filtering Lesson 8: Experiment Annotations 64
Lesson 6: Normalization 65
Lesson 6: Normalization Overview Normalization is used to reduce the effects of systematic variations between arrays, such as variations in hybridization, scanning, sample concentration etc. The aim is to make the data from different chips more comparable. ge. Workbench supports a number of basic types of normalization. In this section, two will be described: Housekeeping Gene normalization, and Quantile normalization. 66
Lesson 6: Normalization Housekeeping gene normalization ► Housekeeping genes are those thought to express at a relatively constant level. ► They can be used to provide a reference point against which to normalize. ► Using multiple housekeeping genes can lower the effect if one or more of them is actually varying with the experimental conditions. ge. Workbench uses the average expression of all selected housekeeping genes as the normalization factor. ► To perform a housekeeping gene normalization: ♦ Load or select a dataset, such as the merged_cardio set created earlier. ♦ For this example, first perform a log 2 normalization on the dataset. This will reduce the dominance of the more highly expressed genes, and the result will be similar to performing a geometric mean normalization. ♦ In the Housekeeping Gene normalization component, the Load button allows a predefined list of genes to be loaded. The supplied file “housekeeping_marker_list. csv” is a list of 26 such genes applicable to the Affymetrix HG_U 95 Av 2 chip type. 67
Lesson 6: Normalization Housekeeping gene normalization ► Performing the normalization ♦ Loaded genes can be moved to and from the active list using the arrow buttons. Here all 26 have been chosen, but you would likely select just a few, perhaps based on experiments using other techniques (1). ♦ Press the Normalize button. The current dataset will be normalized. 68
Lesson 6: Normalization Quantile normalization ► Quantile normalization is used to make the expression profile of each array the same. It is the relative position of each gene in a list ordered by expression value that now varies on each array. ► The assumption is that the real expression profile on each array is quite similar. ► Quantile normalization at the Affymetrix probe level is a feature of the advanced analysis technique called RMA. Quantile normalization in ge. Workbench is applied at the gene (probeset) level. ► To perform a Quantile normalization: ♦ Load or select a dataset, such as the merged_cardio set created earlier. ♦ Go to the Normalization component in the Analysis area. 69
Lesson 6: Normalization Quantile normalization ► To perform a Quantile normalization (cont. ): ♦ ♦ Choose an Averaging method for handling missing values. ♦ Mean profile marker – average for marker across all arrays. ♦ Mean microarray value – average for array across all markers. Push the Normalize button. The current dataset will be normalized. 70
Lesson 6: Normalization References ► (1) Accurate normalization of real-time quantitative RT_PCR data by geometric averaging of multiple internal control genes. Vandesompele et al. Genome Biology 2002, 3(7) 71
Lesson 7: Filtering 72
Lesson 7: Filtering Overview Filtering is used to remove data from datasets. The data may be removed due to being of low quality, of low interest (unvarying), or may have been flagged by another program as being absent or unreliable. 73
Lesson 7: Filtering Gene. Pix Flags filtering ► Gene. Pix is a software platform used for analyzing spotted two-color arrays. It produces its own file format with the suffix. gpr. ► The file can include flags on individual data points, indicating e. g. bad or missing data. ► ge. Workbench can filter out these flagged data points. ► To perform Gene. Pix flags filtering: ♦ Load a Gene. Pix format file, such as 21161 neu 10. gpr. This is included in the ge. Workbench data directory. ♦ In the Analysis area, go to the Filtering component and select Gene. Pix Flags Filter. ♦ The list of available flags is presented. Choose a flag such as “bad” to filter on by checking its box. ♦ Push Filter. 74
Lesson 7: Filtering Gene. Pix Flags filtering ► Filtered-out values are colored yellow in the Tabular Microarray Viewer, indicating they are now classified as Missing Values in ge. Workbench. ► Such values can be removed entirely from the dataset through use of the Missing Values Filter (not shown). 75
Lesson 8: Experiment Annotations 76
Lesson 8: Experiment Annotations Three annotation components ► Three components provide for automatic and manual annotation of datasets. ♦ Dataset Annotation – allows the user to type in comments on a dataset. ♦ Dataset History - automatically records data transformation steps. ♦ Experiment Info – information about the makeup of the dataset, e. g. the files that were merged to create it. ► Shown on the next slide are annotations for the dataset used in the Housekeeping Gene normalization example. ► A text file can also be read in to the Dataset Annotation component using the Load Custom Data Annotations button. 77
Lesson 8: Experiment Annotations Three annotation components The three annotation components ♦ Dataset Annotation (text entered by hand) ♦ Dataset History ♦ Experiment Info 78
Part 2: Data Manipulation Review In Part 2 we covered microarray data normalization and filtering. We also saw how ge. Workbench keeps a record of each data transformation, and how annotations can be added to an experimental dataset by hand or from a file. After completing Part 2, you should be able to: 1. Normalize a microarray dataset using tools such as Housekeeping Genes Normalization and Quantile normalization. 2. Filter unwanted data points out, for example flagged points from a Gene. Pix datafile. 3. View dataset annotations created automatically by ge. Workbench when a dataset is transformed, and 4. create new dataset annotations by hand. 79
Part 3: Analysis and Display Objectives The objective of Part 3 is to introduce some of the major tools for microarray data analysis and display found in ge. Workbench. The Scatter Plot and Expression Value Distribution (EVD) components are used to inspect microarray data, for example to evaluate data quality and the effectiveness of normalization and filtering. The Reverse Engineering component can be used to examine relationships between the expression pattern of a chosen gene and others in the dataset. Lists of genes which result from analysis steps can be evaluated through annotations and Pathway diagrams retrieved using the Marker Annotations component. After completing Part 3, you should be able to: 1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets. 2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene, and 3. Retrieve gene annotations and pathway diagrams using the Marker Annotations component and view them. 80
Part 3: Analysis and Display 81
Part 3: Analysis and Display Lesson outline Lesson 9: The Scatter Plot component Lesson 10: Expression Value Distribution Lesson 11: Reverse Engineering Lesson 12: Gene Annotation and Pathway Viewing Lesson 13: Hierarchical Clustering Lesson 14: Analysis of Variance 82
Lesson 9: The Scatter Plot component 83
Lesson 9: The Scatter Plot component Overview ► The Scatter Plot examines the relationship between two datasets. Two types of comparisons can be made: one gene probe against a second on every chip (Marker option), or every gene probe against itself on two chips (Array option). Up to 6 graphs can be shown. Two marker plots are shown here. The marker AFFX-Bio. C-5_at is on the x-axis while the markers AFFX-Bio. B-5_at and AFFX-Bio. C-3_at are on the y-axes. 84
Lesson 9: The Scatter Plot component Using the Scatter Plot 1. You can use the dataset loaded in the previous example, or open the tutorial data file webmatrix_quantile_log 2_dev 1. 2_mv 0. exp. 2. In the scatter plot component, select the Marker or Array tab to choose the type of comparison. The above picture used Marker. 3. Highlight a reference marker or array. The second any following items selected will result in a graph being drawn, up to a limit of six. 85
Lesson 9: The Scatter Plot component Basic usage Basic Usage: The steps of basic usage are indicated with numbers in the screenshot 1. This tab switches between Marker/Marker and Array/Array plots. 2. Markers or arrays available for selection. 3. The first marker or array selected is placed on the x-axis and his highlighted in black. A different marker or array can be placed on the x-axis by right-clicking the marker/array name and choosing Put on X-Axis. 4. Subsequent selections of markers or arrays after a marker/array is on the xaxis results in the creation of a chart. Plotted markers/arrays are highlighted in grey. Clicking again on one of these markers/arrays results in the plot being removed. 86
Lesson 9: The Scatter Plot component Basic usage Basic Usage: continued 5. Clicking the Rank Statistics Plot checkbox transforms the data for analysis. The x and y values are sorted and plotted according to their rank. 6. By default, a black reference line with slope 1 is displayed in each chart. This may be turned off with the Reference Line checkbox. Also, the slope of the line may be adjusted in the Slope textbox. 7. The Clear Charts button removes all charts and removes the x-axis selection. The Print button prints the charts after allowing the user to adjust the page setup and choose a printer. The Image Snapshot button captures the charts as an image and places it in to the project underneath the current data set. 87
Lesson 9: The Scatter Plot component Options and sets Each chart can be manipulated by right-clicking anywhere in the plot area. This brings up a menu that allows the chart to be individually saved as an image or printed, zoomed and visual properties adjusted. Set Selections Markers or microarrays that are members of active sets will be plotted with unique visual properties. These selections are managed for arrays and markers in the Phenotype and the Marker components, respectively. Consider an example where the two sets are activated in the Phenotype component: 88
Lesson 9: The Scatter Plot component Example plot, all arrays Here we compare the expression of two genes across all arrays. The two selected sets of markers are colored blue and green. Because the “All Arrays” box is also checked, the remaining arrays are also displayed, in red: 89
Lesson 9: The Scatter Plot component Set options The visual properties of a set of markers or arrays may be adjusted. From within the Array or Marker component, right click a set and choose Change Visual Properties. A dialog opens that allows the shape and color to be changed for that set. These properties are honored in the Scatter Plot as well as other ca. Workbench components 90
Lesson 10: Expression Value Distribution 91
Lesson 10: Expression Value Distribution Features ► The expression value distribution component plots a histogram of binned expression values for selected or all the genes on one or more arrays. ► A slider (at bottom) can be used to step between each array in the current dataset. ► A subset of markers within a given expression range can be selected using movable sliders (Select values from and Select values to) and added to a Marker Set using the Add to Set button ► A T-Test can be used to detect markers with significantly different expression. A Case set of arrays must be activated in the Arrays component (remaining arrays are by default Control). ► Image Snapshot saves an image of the graph to the Project Folders component. ► Mouse-over annotations can be activated by pressing the lightbulb ► An array from the Housekeeping Genes Normalization example is displayed in the following picture: 92
Lesson 10: Expression Value Distribution Example graph Normalized, log 2 transformed data (Housekeeping Gene Normalizer example) 93
Lesson 10: Expression Value Distribution Display options for the EVD diagram. ► Right-click on the EVD diagram to obtain the following list of display and manipulation options. 94
Lesson 10: Expression Value Distribution Working with activated datasets ► The Arrays/Phenotypes component allows the dataset to be divided into sets of arrays, which can be named and classified (e. g. as Case/Control) ►Select a group (e. g. CCMP arrays) and right-click, select Add to Set 95
Lesson 10: Expression Value Distribution Displaying an activated set ► The set CCMP is active. The “One color per array” checkbox is checked, so each array is shown in a different color. ► The base array, shown in red, is selected using the array slider. 96
Lesson 10: Expression Value Distribution t-test Results of a t-test on CCMP vs Normal arrays ► Now both the CCMP and Normal array sets are active. CCMP has been marked Case. ► The t-test button is active, showing the t-statistic distribution. 97
Lesson 11: Reverse Engineering 98
Lesson 11: Reverse Engineering Overview ► The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products. ► The Reverse Engineering component uses the information theory concept of mutual information to find these interactions. ♦ Mutual information here means the information that the expression pattern of one gene carries about the expression of another gene - it is a pairwise calculation. ♦ Mutual Information is in principle more sensitive and flexible than a simple correlation calculation. ♦ It is also invariant under certain data transformations, such as log transformations. 99
Lesson 11: Reverse Engineering Overview Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support. Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically performed on large cluster computers and are not feasible on a desktop machine. 100
Lesson 11: Reverse Engineering in the context of ge. Workbench ► As typically used in ge. Workbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset. ► In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pair-wise Mx. M/2 mutual information calculation is performed between them. ► The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component. 101
Lesson 11: Reverse Engineering Prerequisites ► A dataset containing multiple arrays (the more the better) should be loaded into ge. Workbench. If data is loaded from separate files, it should be merged into a single micro array dataset. See the section Projects and Data Files. ► For this example we will load the tutorial dataset "webmatrix 2_quantile_log 2_dev 1. 2_mv 0. exp". ♦ This contains a set of 100 experiments on Affymetrix HG_U 95 Av 2 chips. This filtered dataset has been reduced to 2226 markers. 102
Lesson 11: Reverse Engineering Profiler - selecting a hub gene 1. In the upper right section of ge. Workbench find the Reverse Engineering component. It should by default be displaying the Profiler tab 2. In the Markers component search box, on the left side of the ge. Workbench interface, enter 1973 and hit enter. This will find the marker 1973 _s_at, which is the c-Myc gene, a well-known transcription factor with many interactions. 3. Click on this marker in the list. This will enter the marker into the Hub Gene Label field of the Profiler. 103
Lesson 11: Reverse Engineering Profiler – Analyze 2 D 4. The default setting in the Profiler is Mutual Information (fast). With this selected, hit Analyze(2 D). This will return a list of all markers having a MI score of greater than the cutoff value (the default is 0. 2). 104
Lesson 11: Reverse Engineering Profiler - Options ► Pearson - Uses a Pearson correlation function to calculate the interaction scores. 105
Lesson 11: Reverse Engineering Profiler – data output 5. After the Mutual Information algorithm has been run, an adjacency matrix will be placed in the Projects Folder: 106
Lesson 11: Reverse Engineering Profiler – adding returned markers to a set ► If a smaller network is desired, a set of markers can be highlighted in the list originally returned. Only this selected subset, up to 100 markers, will then be used if "Create Network" is pressed. ► By right-clicking and selecting "Add to Set", the selected group can also be added to the Markers component as a new set which can be used in other components (sequence retrieval, annotation retriever etc. ). 107
Lesson 11: Reverse Engineering Profiler – Create Network 6. Hit the Create Network button. a) A network will be displayed based on the top 100 markers interacting with c. Myc. As described above, the MI algorithm is run again on these M=100 markers, in order to measure interactions between each pair. b) Each marker is then connected via an edge with the marker it most strongly interacts with, with the chosen hub-gene at the center. 108
Lesson 11: Reverse Engineering Cytoscape viewer 7. The resulting network is displayed in the Cytoscape viewer. 109
Lesson 11: Reverse Engineering Cytoscape viewer layout 8. The visualization in Cytoscape can be improved by going to the Layout menu, and choosing y. Files->organic: 110
Lesson 11: Reverse Engineering Cytoscape viewer layout 9. Within the network created in Cytoscape, one can select the central gene, and then on the Cytoscape menu chose Select->Nodes->First Neighbors of selected nodes 111
Lesson 11: Reverse Engineering Cytoscape viewer – choosing genes 10. The first neighbors will be highlighted in the graph. 11. and also added as a new set in the Markers component. 112
Lesson 11: Reverse Engineering Motif Location Histogram 12. Return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component. 13. Select the first (highest MI score) marker on the list and the graph shown below is drawn in the Motif Location Histogram display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list. 113
Lesson 12: Gene Annotation and Pathway Viewing 114
Lesson 12: Gene Annotation and Pathway Viewing The Marker Annotation component ► The Marker Annotations component retrieves information for selected markers (genes) using ca. BIO. ♦ Links to CGAP annotation pages are listed under Gene. ♦ Links to Bio. Carta pathway diagrams are listed under Pathway. ♦ Clicking on the pathway links will display the SVG pathway diagrams in the ca. BIO Pathways viewer. 115
Lesson 12: Gene Annotation and Pathway Viewing Marker Annotations - retrieving 1. Load a set of markers from the tutorial data into the Markers component: a) Press Load set b) Locate the file “cluster_tree_total_pearsons_84_markers. csv” and load it. c) Activate the set by checking the box in front of its entry. Here it has been renamed to “Cluster tree”. 2. In the Marker Annotations component, press Retrieve annotations. 3. Click on an Gene or Pathway link to view the annotations. 116
Lesson 12: Gene Annotation and Pathway Viewing Marker Annotations - display ► The list of markers can be sorted by Gene or by Pathway name by clicking on the column headings. 117
Lesson 12: Gene Annotation and Pathway Viewing CGAP annotations A CGAP annotation page displayed in a web browser window. 118
Lesson 12: Gene Annotation and Pathway Viewing Bio. Carta Pathway display A Bio. Carta pathway displayed in the ca. BIO Pathways component. 119
Lesson 13: Hierarchical Clustering 120
Lesson 13: Hierarchical Clustering Overview ► Hierarchical clustering can be used to identify trends in the data by grouping together genes or/and microarrays that share common expression patterns. ► Like many of the analyses available in ge. Workbench, hierarchical clustering is carried out in 2 steps: ► Analysis setup and execution: the Analysis Panel is used to specify the parameters settings to be used when invoking the hierarchical clustering algorithm. ► Visualization of analysis results: the Dendrogram module is used to visualize the clusters generated by the analysis. ► The hierarchical clustering algorithm can be executed in 2 modes: ► Local: the computation takes place on the user’s machine (the same computer on which ge. Workbench is running). ► Remote: using the ca. Grid infrastructure, the computation can be outsourced to any computer running a grid-enabled version of the hierarchical clustering code. 121
Lesson 13: Hierarchical Clustering Set up the analysis parameters • The Analysis Panel is located in the lower bottom portion of the application’s user interface; locate and select on the tab titled “Analysis”. • From the list of available analyses, select the one titled Hierarchical Clustering. • Within the parameters portion of the interface you can specify the values for the 3 parameters applicable to the hierarchical clustering analysis: “Clustering Method”, “Clustering Dimension” and “Clustering Metric”. The values for these parameters will determine, respectively, (1) how clusters get agglomerated, (2) if the analysis should cluster markers, arrays or both, and (3) what distance metric to use for assessing similarity between clusters. • Set the values of the parameters as shown above. 122
Lesson 13: Hierarchical Clustering Select Local or Remote execution • To select a compute server (the piece of software which will carry out the actual computations) select the Services tab • To specify local or remote execution click (respectively) on the Local or Grid radio button. • If the Local option is selected, a locally running version of the hierarchical clustering code will be executed. To select among a list of available grid-enable hierarchical clustering servers, select the radio button next to the Grid option (as shown above). • ge. Workbench will query a ca. Grid Index Service in order to find out which grid-enabled servers are available. The application comes pre-configured with a default Index Service address. To change this default, click on the “Change Index Service” link and enter the host URL and port for the new Index service to use. 123
Lesson 13: Hierarchical Clustering Select Local or Remote execution (cont. ) • To retrieve the available hierarchical clustering services, click on the button titled “Grid Services”. • The list of discovered services is displayed here. • Select the radio button next to the service that you would like to use. Details about the selected service (including it’s URL, the host institution, etc) appear at the bottom portion of the interface. Your selection will be remembered next time you use ge. Workbench. • Return to the Parameters tab and click the Analyze button to initiate the clustering (using the compute server you 124 designated).
Lesson 13: Hierarchical Clustering Dendrogram • • By clicking on the results node, the resulting hierarchical cluster can be visualized as a dendrogram in the upper right part of the user interface. In this view, horizontal mosaic blocks correspond to markers and vertical blocks correspond to arrays. Upon completion of the analysis a tree node representing the analysis results is created in the Project Folders pane. It appears as a child node under the microarray set that was clustered. 125
Lesson 13: Hierarchical Clustering Dendrogram (cont. ) • The view can be adjusted to focus on a particular marker or array cluster. • To select a marker cluster first check the “Enable Selection” box. • Mouse-over to highlight a marker cluster of interest and right-click. • The dendrogram view is updated to include only the selected markers. Further, by right clicking on the view and selecting “Add to set” the selected markers can be grouped into a marker set. 126
Lesson 14: ANOVA (Analysis of Variance) 127
Lesson 14: ANOVA Overview ► The ANOVA ( Analysis of VAriance) algorithm is used to determine whether any significant difference exists in the means of independent groups of data. ANOVA is an extension of the t-test to more than two experimental conditions. In ge. Workbench each group comprises gene expression microarray measurements from various samples, and one is interested in identifying genes whose mean expression is significantly different across the various groups. ► Currently, only one-way ANOVA is implemented. The user is initially required to enter the number of groups, following which a sample grouping panel similar to the t-test panel, with the appropriate number of groups, is created. Samples can be assigned to any group or excluded from the analysis. Fstatistics are calculated for each gene, and gene is considered significant if pvalue associated with its F-statistic is smaller than the user-specified alpha or critical p-value. ► The compute code for the ANOVA analysis used in ge. Workbench has been adapted from the ANOVA component in the Me. V software from TIGR (http: //www. tm 4. org/mev. html). 128
Lesson 14: ANOVA Analysis Setup • Load a microarray set in the Project Folders component. • Use the Arrays/Phenotypes component in order to define 3 or more groups of arrays upon which ANOVA will operate (the groups need to be activated) • In the Analysis component, select “Anova Analysis” among the available analyses. • Enter the desired run parameters in the parameters panel. • Click on the Analyze button to initiate the analysis. 129
Lesson 14: ANOVA Results Display Tabular Viewer: This Visual Area Component displays a read-only spreadsheet view of the significant genes sorted by p-value in ascending order (from most significant to least significant). In this view, the columns displayed can be altered in the preference window (click on the “Display Preferences” button), and the display can be sorted by the values in any column. The table is exportable in. cvs format. They following columns can be displayed: Marker Name: The name of the marker that is deemed significant according to the analysis. F- Statistic: The value of the statistic calculated by the ANOVA test. P- Value: The probability of observing the F-statistic value under the null hypothesis. For each group: • Mean is the mean expression value of the marker in that group. • Std is the standard deviation of the marker expression measurements in that group. 130
Lesson 14: ANOVA Results Display (cont. ) Color Mosaic: In this view, a color spectrum is used to indicate the relative magnitudes of the measurements. The arrays (columns) are grouped by input group membership, i. e. set 1, set 2 etc. Each row corresponds to a marker, and marker display is ordered by p-value in ascending order (from most significant to least significant). P-values Marker names 131
Part 3: Analysis and Display Review Part 3 described several tools for the analysis and display of microarray data. Having completed Part 3, you should be able to: 1. Use the Scatter Plot and Expression Value Distribution components to examine microarray datasets. 2. Run Reverse Engineering on a microarray dataset to find interactions with a chosen hub gene. 3. Retrieve gene annotations and pathway diagrams. 4. Run Hierarchical Clustering analysis (either remotely or locally) to discover trends in the data and visualize and interact with the resulting dendrogram. 5. Run Analysis of Variance analysis to discover genes differentially expressed in a collection of 3 or more exprimental conditions. 132
Part 4: Workflow Execution Objectives • The objective of Part 4 is to demonstrate how to define and execute (in batch mode) analysis workflows involving the sequential invocation of multiple ge. Workbench modules and ca. Grid analytical services. ge. Workbench uses a specially developed script language, ca. Script, for coding workflows. The language itself will not be described in this presentation; if you are interested in ca. Script, the syntax and semantics of the language are explained in the Software Requirements and Specification document: http: //cabigcvs. nci. nih. gov/viewcvs. cgi/caworkbenchcabig/Requ irements/ge. Workbench_cagrid_SRS_final. pdf After completing Part 4, you should be able to: 1. Load, edit and save ca. Script workflow files. 2. Execute ca. Script workflow files. 133
Part 4: Workflow Execution 134
Part 4: Workflow Execution Lesson outline Lesson 15: ca. Script Editor 135
Lesson 15: The ca. Script Editor 136
Lesson 15: ca. Script Editor Overview ► In many settings it is desirable to be able to codify a sequence of data processing steps so that they can be re-executed at a future time. ► ge. Workbench uses a scripting language, ca. Script, to allow users to express such analysis workflows. ca. Script provides direct access to ge. Workbench module functionality that has been explicitly exposed for scripting. It also allows programmatic invocation of ca. Gridenabled analytical services. ► The ca. Script Editor facilitates the authoring of workflows by: ► Providing an editor environment where to compose scripts. ► Supporting loading and editing of script files. ► Listing all ge. Workbench modules and their corresponding methods that have been exposed to ca. Script. ► Listing all available ca. Grid analytical services and their methods that can be invoked by ca. Script. 137
Lesson 15: ca. Script Editor The Editor Environment • The ca. Script Editor component is located in the bottom right portion of the interface. It can be accessed by selecting the tab titled ca. SCRIPT 2 1 3 The Editor interface is divided into three main areas: 1. The editor window where scripts are edited. A script can be authored de novo or can be loaded from a file in the disc, by clicking on the Open File button ( ). The contents of the editor can be save to disc as a script file by clicking on the Save to File button ( ). 2. The list of components that are available for invocation by ca. SCRIPT. This list contains all loaded ge. Workbench modules (they appear under the tab “Local”) as well as available ca. Grid services (under the tab “Grid”). 3. The list of methods that each component/service exposes to ca. Script. For each method its name, input and output parameter types are displayed, to facilitate script authoring. 138
Lesson 15: ca. Script Editor The Editor Environment – Grid Services • The grid services accessible to ca. Script are discovered by querying a ca. Grid index service. The Grid tab of the component provides a space for entering the URL of this service. • Clicking on the Discover button will retrieve all services registered with the specified Index service (in the screenshot below only one such service is found, a Hierarchical Clustering service). 139
Lesson 15: ca. Script Editor Script Execution • After editing is complete, the script can be executed by clicking on the “Execute” button. • At any point during execution, a script can be stopped by clicking on the “Stop Execution” button. • It is possible that a script involves the execution of methods which engage parts of the ge. Workbench graphical user interface (e. g. , when opening an Affymetrix gene expression file, you will be prompted to specify the location of the associated annotations file). In such cases script execution does not take place in a purely batch mode but rather requires interaction with the user. • At present the ca. Script editor component is at a prototype stage; although it is fully functional in its ability to execute scripts, it misses some features that would make it more usable, such as: • There still are many ge. Workbench modules that have not exposed methods to ca. Script. • There is no visual feedback to the user indicating the progress of an 140 executing script.
Part 4: Workflow Execution Review • Part 4 described how ge. Workbench supports the authoring of workflows to facilitate the reproducible execution of data processing pipelines. ca. Script is the scripting language used for expressing such workflows. Having completed Part 4, you should be able to: 1. Locate appropriate documentation that contains further details about the syntax and semantics of the ca. Script language. 2. Create scripts and save (load) them into (from) disc files. 3. Identify ge. Workbench modules and ca. Grid services that are accessible to ca. Script, along with the precise methods that are available for invocation. 4. Start and cancel script execution. 141
For further information…. For further information about ge. Workbench, including complete online tutorials, please see: www. geworkbench. org 142
- Slides: 142