High Performance Computing for genomic applications Using genomic
- Slides: 26
High Performance Computing for genomic applications Using genomic software on Euler Scientific IT Services Michal Okoniewski ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 1
Bioinformatic modules on EULER § § module load gdc module avail module purge ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 2
Bioinformatic software jungle - categories § Trimming tools: trimmomatic, cutadapt § Aligners: § § General purpose: bwa, bowtie, SHRi. MP RNA aligners: STAR, tophat, subjunc, hisat 2 Transcriptome aligners: kallistio, salmon, sailfish, RSEM Other aligners: Blast, Blat, VMATCH § De-novo assemblers: trinity, velvet, spades § Feature extraction, counting: HTSeq, feature. Count § Transcript discovery: cufflinks, § Specialized tools: MISO, blast 2 go, … § Utilities and conversion tools: samtools, bcftools, bedtools, picard tools ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 3
Bioinformatic software jungle (as seen back in 2015) Genomic feature extraction - counting Alignment to the genome Statistical analysis tools STARTING PANEL Options For analysis Mode: Step-by-step Single-run Experiment definition genome String DB RPKM tables igenomes fastq BAM Filters, trimming STAR aligner fastq BAM Count tables table (gene, exon…) Spark. Seq counts Spark. Seq tests DESeq/edge R DEXSeq Filters, trimming BAM Spark. Seq junctions unmapped BAM unmapped fastq tophat aligner fastqc junctions Genome browser Selection of • graphs • Report types • output formats (BED CSV. . ) MISO/MATS jsplice Genome browser REPORTING PANEL Setting thresholds junctions cufflinks/ cuffdiff Fastqc report Functional analysis parametres Gene. Go (commercial) Ingenuity (commercial) Reporting Differential expression report Differential splicing report Output BED CSV Genome browser other fastq ID | SIS David GTF/GFF ht_seq fastq Functional analysis tools RSEM/Bitseq isoform deconvolute Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 4
Trimming tools § trimmomatic, cutadapt § Trimming: § Removing adapters § By quality: eg sliding window § Fixed position, eg several base pairs at each end § Typicaly: fastq on input, fastq on output § itets more complex with paired reads ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 5
Trimming tools module load gdc module load java module load trimmomatic SE {input} {output} ILLUMINACLIP: adapters. fa: 2: 30: 10 LEADING: 3 TRAILING: 3 SLIDINGWINDOW: 4: 15 MINLEN: 36 ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 6
Bowtie, bowtie 2 § Building genome index bowtie 2 -build --threads 24 /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. dna. primary_assembly. fa /cluster/scratch/michalo/hg 38 § Alignment: ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 7
Tophat § Classic splice-aware aligner § Uses bowtie 2 as engine, so also bowtie 2 index tophat -p 24 -o tophat_out --library-type fr-firststrand ~/work_michalo/hg 38 mini. fastq. gz § Manual: http: //ccb. jhu. edu/software/tophat/index. shtml ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 8
Splice aware aligners for RNA-seq, memory needs Index type Index size=memory needs for hg 38 Computing node Tophat, 2011 bowtie index in a file any STAR, 2014 In memory ca 40 MB EULER normal (65 M) Subjunc, 2014 In memory ca 40 MB EULER normal (65 M) Hisat 2, 2016 In memory ca 7 MB good laptop (>=8 M RAM) ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 9
Tuxedo suite and “new tuxedo” ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 10
“New tuxedo suite” ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 11
Cufflinks and stringtie § Transcript discovery tools § Uses coverage and junctions from a BAM file cufflinks mini_star. sorted. bam § Other § cuffmerge, cuffdiff, cuffquant, cuffnorm, Cumme. Rbund ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 12
Cufflinks and stringtie § Produces GTF ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 13
STAR § § Splice aware aligner, loading index into memory Results similar to tophat, but faster --genome. Load. And. Keep With specific options, can produce BAM and do the counting too § https: //github. com/alexdobin/STAR/blob/master/doc/STARmanual. pdf ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 14
STAR statistics ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 15
subread § Includes subjunc similar to STAR and feature. Counts § Building index subread-buildindex -o /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. dna. primary_assembly. fa § Alignment subread -T 24 -i /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 -r mini. fastq -o mapped_reads_subjunc/mini. bam subjunc -T 24 -i /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 -r mini. fastq -o mapped_reads_subjunc/mini. bam § http: //bioinf. wehi. edu. au/subread-package/Subread. Users. Guide. pdf ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 16
samtools § General purpose tool for conversion of BAM SAM § Many other operations: pileup… § See: http: //www. htslib. org/doc/samtools. html ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 17
feature. Counts § Fast and flexible counting in genomic features feature. Counts -M -s 2 -T 24 -t gene -g gene_id -a /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. 86. chr. gtf -o mini. cnt mini_star. sorted. bam ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 18
feature. Counts ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 19
Counting reads in genes is non-trivial! ht. Seq counting modes by Simon Anders ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 20
Tuning feature. Counts § § Defaults counting features can be fuound in current STAR Still, for many cases a more careful counting is needed Gene or exon level counting Options ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 21
SRA tools The command line download of public datasets from Short Read Archive http: //ncbi. github. io/sra-tools/ fastq-dump --split-files ERR 2811092 (from v 2. 9) fasterq-dump --split-files ERR 2811092 –e 4 ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 22
GATK § GATK is a genomic toolbox for various operations related mainly to genomic variants calling § Operations include producing a variant file *. vcf from an alignment file *. bam module load gcc/4. 8. 2 gdc java/1. 8. 0_73 gatk/3. 5 java -jar Genome. Analysis. TK. jar -T Unified. Genotyper -R ref/human_g 1 k_b 37_20. fasta -I bams/exp_design/NA 12878_wgs_20. bam -o sandbox/NA 12878_wgs_20_UG_calls. vcf -glm BOTH -L 20: 10, 000 -10, 200, 000 https: //software. broadinstitute. org/gatk/documentation/tooldocs/current/ https: //software. broadinstitute. org/gatk/documentation/topic? name=tutorials http: //gatkforums. broadinstitute. org/gatk/discussion/7869/howto-discover-variantswith-gatk-a-gatk-workshop-tutorial ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 23
Bioinformatics software stack on EULER § https: //scicomp. ethz. ch/wiki/GDC_software_stack § Commands to call modules for a specific tool § List compiles in collaboration with Genomic Diversity Center from D-USYS ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 24
Resources needed by bioinformatic tools on EULER cores Trimming Trimmomatic, cutadapt Aligners - old memory node time/queue 1 24 h bowtie, bwa, SHRi. MP, tophat many: 24 4 h Aligners - RNAseq STAR, subread many: 24 Aligners - optimized hisat 2 many: 24 4 h Conversion tools Samtools, bcftools 1 4 h Counting feature. Count 1 4 h De-novo assembly Spades, velvet many 256 MB-1 T 24 h or more Variant calling GATK, pileup many 256 MB 24 h 65 M 4 h … ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 25
Thank you! Using genomic software on Euler
- Mobile performance testing using loadrunner
- High performance liquid chromatography uses
- Sand: towards high-performance serverless computing
- Maui high performance computing center
- Laptops for high performance computing
- High performance embedded computing
- High performance computing modernization program
- Bigpurple nyu
- High performance computing cluster linux
- High performance computing modernization program
- Hpsc nasa
- Matlab high performance computing
- High performance embedded computing
- High performance embedded computer
- Army high performance computing research center
- Genomic equivalence
- Genomic england
- Genomic england
- Genomic england
- Genomic instability
- Genomic
- Genomic imprinting definition
- Genomic signal processing
- Comparative genomic hybridization animation
- Genomic equivalence definition
- Computing refers to
- Computing applications building