Simple Programming to Identify Symbols of Selection Lynette

Simple Programming to Identify Symbols of Selection Lynette Strickland INFO 500 Objective Problem (Continued) Solution! Chelymorpha alternans is a Chrysomelid species distributed widely throughout Central and northern South America. The species on the Isthmus of Panama has five different morphotypes differing strongly elytral and pronotal pattern and coloration. The ultimate aim of my dissertation work is to identify genomic areas that are highly divergent between morphotypes, that may be candidates for genomic regions contributing to differences in coloration. To lay the foundational work for this, I performed a Restriction-site Associated DNASequencing study (RAD-Seq) on 32 individual beetles from a single geographic population along Santa Ridge Road in San Lorenzo, Panama. The 32 individuals were representative of three different morphotypes, 11 Metallic, 11 Rufipennis, and 10 Darian individuals. I sure wish I’d presented my theory with a poster before I wrote my book. From here, I would have one excel sheet with two columns of numbers, one column for each “population”. I would then manually identify the duplicates (as shown above). In order to find these values, I would then have to go back through the original excel sheet with thousands of numbers to find these particular base pairs and their surrounding information. Now, this would not be a problem if I only had to do this once, however, I had to find this information for every batch parameter set that I would run through STACKS. I would have had to do this over 13 different times!! Figure 1. Four morphotypes clockwise from left to right: Veraguensis, Metallic, Darian f. militaris-b, and Rufipennis Problem I used the STACKS pipeline, generated by the Cresko lab to filter, assemble, and analyze polymorphic SNPs, using each of the three morphotypes to represent one “population” ie. Population 1= Metallic phenotype, Population 2 = Rufipennis phenotype, Population 3= Darian phenotype. This was done to find homozygous SNPs that have an associated FST estimate indicating high divergence, and that are always associated with a particular phenotype, in this case, the metallic phenotype, because it is the only genotype known to always be homozygous. The STACKS pipeline outputs many different files in many different formats, allowing for versatility when analyzing data, however, the Fst output files gives a separate file for each pairwise population comparison. These data files give me every single nucleotide polymorphism (SNP) identified during sequencing, along with an FST estimate between 2 populations. To identify if any were specific to the metallic phenotype, I would copy-paste columns F, G, and I, from each population pairwise comparison into a new excel sheet, identified all SNPs that had an FST value higher than 0. 3 using conditional formatting, and then copied the data left into a new excel sheet. At the end I had two different excel sheets, one with columns F, G, and I of all SNPs with an FST value greater than 0. 3 for a comparison between populations 1 and 2, and the other sheet with the same information but for populations 1 and 3. These are the data I found: -Identified 27 SNPs with an FST higher than 0. 3 in the metallic-rufipennis pairwise comparison -Identified 79 SNPs with an FST higher than 0. 3 in the metallic-darian pairwise comparison From here, I would copy-paste the two separate columns into one excel sheet, and use the function to find duplicates under conditional formatting in excel (this would also give me duplicates within a column as well as between the two columns which made things more difficult) to generate a final list. This was clearly much easier, and saved me so much time!! Using this code, I was able to quickly run these files for multiple STACKS batch runs, and found that two SNPs (76 and 90) were repeated outliers through my analyses, and that these two are strongly associated with the metallic phenotype. Providing candidate regions to start looking for areas of the genome that may be controlling color and pattern! Acknowledgments For further information This work was part of a Focal Point grant funded by the Graduate College at the University of Illinois at Urbana. Champaign. I would acknowledge the Department of Animal Biology (my home department). I would like to thank Halie Rando, Diana Byrne, Heidi Imker, Ayla Stein, Jennifer Jones, Ian Toller-Clark, Christina Rodriguez, Tolu Stowe, Kelsey Witt, and everyone else who took/taught/interacted/ and helped out with INFO 500! Please contact lrstric 2@Illinois. edu. More information on this and related projects can be obtained at http: //publish. illinois. edu/data-science-across-disciplines/.

Slides: 1

Download presentation