High Performance Computing for genomic applications Using genomic

  • Slides: 26
Download presentation
High Performance Computing for genomic applications Using genomic software on Euler Scientific IT Services

High Performance Computing for genomic applications Using genomic software on Euler Scientific IT Services Michal Okoniewski ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 1

Bioinformatic modules on EULER § § module load gdc module avail module purge ID

Bioinformatic modules on EULER § § module load gdc module avail module purge ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 2

Bioinformatic software jungle - categories § Trimming tools: trimmomatic, cutadapt § Aligners: § §

Bioinformatic software jungle - categories § Trimming tools: trimmomatic, cutadapt § Aligners: § § General purpose: bwa, bowtie, SHRi. MP RNA aligners: STAR, tophat, subjunc, hisat 2 Transcriptome aligners: kallistio, salmon, sailfish, RSEM Other aligners: Blast, Blat, VMATCH § De-novo assemblers: trinity, velvet, spades § Feature extraction, counting: HTSeq, feature. Count § Transcript discovery: cufflinks, § Specialized tools: MISO, blast 2 go, … § Utilities and conversion tools: samtools, bcftools, bedtools, picard tools ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 3

Bioinformatic software jungle (as seen back in 2015) Genomic feature extraction - counting Alignment

Bioinformatic software jungle (as seen back in 2015) Genomic feature extraction - counting Alignment to the genome Statistical analysis tools STARTING PANEL Options For analysis Mode: Step-by-step Single-run Experiment definition genome String DB RPKM tables igenomes fastq BAM Filters, trimming STAR aligner fastq BAM Count tables table (gene, exon…) Spark. Seq counts Spark. Seq tests DESeq/edge R DEXSeq Filters, trimming BAM Spark. Seq junctions unmapped BAM unmapped fastq tophat aligner fastqc junctions Genome browser Selection of • graphs • Report types • output formats (BED CSV. . ) MISO/MATS jsplice Genome browser REPORTING PANEL Setting thresholds junctions cufflinks/ cuffdiff Fastqc report Functional analysis parametres Gene. Go (commercial) Ingenuity (commercial) Reporting Differential expression report Differential splicing report Output BED CSV Genome browser other fastq ID | SIS David GTF/GFF ht_seq fastq Functional analysis tools RSEM/Bitseq isoform deconvolute Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 4

Trimming tools § trimmomatic, cutadapt § Trimming: § Removing adapters § By quality: eg

Trimming tools § trimmomatic, cutadapt § Trimming: § Removing adapters § By quality: eg sliding window § Fixed position, eg several base pairs at each end § Typicaly: fastq on input, fastq on output § itets more complex with paired reads ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 5

Trimming tools module load gdc module load java module load trimmomatic SE {input} {output}

Trimming tools module load gdc module load java module load trimmomatic SE {input} {output} ILLUMINACLIP: adapters. fa: 2: 30: 10 LEADING: 3 TRAILING: 3 SLIDINGWINDOW: 4: 15 MINLEN: 36 ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 6

Bowtie, bowtie 2 § Building genome index bowtie 2 -build --threads 24 /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens.

Bowtie, bowtie 2 § Building genome index bowtie 2 -build --threads 24 /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. dna. primary_assembly. fa /cluster/scratch/michalo/hg 38 § Alignment: ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 7

Tophat § Classic splice-aware aligner § Uses bowtie 2 as engine, so also bowtie

Tophat § Classic splice-aware aligner § Uses bowtie 2 as engine, so also bowtie 2 index tophat -p 24 -o tophat_out --library-type fr-firststrand ~/work_michalo/hg 38 mini. fastq. gz § Manual: http: //ccb. jhu. edu/software/tophat/index. shtml ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 8

Splice aware aligners for RNA-seq, memory needs Index type Index size=memory needs for hg

Splice aware aligners for RNA-seq, memory needs Index type Index size=memory needs for hg 38 Computing node Tophat, 2011 bowtie index in a file any STAR, 2014 In memory ca 40 MB EULER normal (65 M) Subjunc, 2014 In memory ca 40 MB EULER normal (65 M) Hisat 2, 2016 In memory ca 7 MB good laptop (>=8 M RAM) ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 9

Tuxedo suite and “new tuxedo” ID | SIS Michal Okoniewski, Scientific IT ETH |

Tuxedo suite and “new tuxedo” ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 10

“New tuxedo suite” ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 |

“New tuxedo suite” ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 11

Cufflinks and stringtie § Transcript discovery tools § Uses coverage and junctions from a

Cufflinks and stringtie § Transcript discovery tools § Uses coverage and junctions from a BAM file cufflinks mini_star. sorted. bam § Other § cuffmerge, cuffdiff, cuffquant, cuffnorm, Cumme. Rbund ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 12

Cufflinks and stringtie § Produces GTF ID | SIS Michal Okoniewski, Scientific IT ETH

Cufflinks and stringtie § Produces GTF ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 13

STAR § § Splice aware aligner, loading index into memory Results similar to tophat,

STAR § § Splice aware aligner, loading index into memory Results similar to tophat, but faster --genome. Load. And. Keep With specific options, can produce BAM and do the counting too § https: //github. com/alexdobin/STAR/blob/master/doc/STARmanual. pdf ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 14

STAR statistics ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 15

STAR statistics ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 15

subread § Includes subjunc similar to STAR and feature. Counts § Building index subread-buildindex

subread § Includes subjunc similar to STAR and feature. Counts § Building index subread-buildindex -o /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. dna. primary_assembly. fa § Alignment subread -T 24 -i /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 -r mini. fastq -o mapped_reads_subjunc/mini. bam subjunc -T 24 -i /cluster/home/michalo/work_michalo/hg 38/subread_index/hg 38 -r mini. fastq -o mapped_reads_subjunc/mini. bam § http: //bioinf. wehi. edu. au/subread-package/Subread. Users. Guide. pdf ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 16

samtools § General purpose tool for conversion of BAM SAM § Many other operations:

samtools § General purpose tool for conversion of BAM SAM § Many other operations: pileup… § See: http: //www. htslib. org/doc/samtools. html ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 17

feature. Counts § Fast and flexible counting in genomic features feature. Counts -M -s

feature. Counts § Fast and flexible counting in genomic features feature. Counts -M -s 2 -T 24 -t gene -g gene_id -a /cluster/home/michalo/work_michalo/hg 38/Homo_sapiens. GRCh 38. 86. chr. gtf -o mini. cnt mini_star. sorted. bam ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 18

feature. Counts ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 19

feature. Counts ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 19

Counting reads in genes is non-trivial! ht. Seq counting modes by Simon Anders ID

Counting reads in genes is non-trivial! ht. Seq counting modes by Simon Anders ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 20

Tuning feature. Counts § § Defaults counting features can be fuound in current STAR

Tuning feature. Counts § § Defaults counting features can be fuound in current STAR Still, for many cases a more careful counting is needed Gene or exon level counting Options ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 21

SRA tools The command line download of public datasets from Short Read Archive http:

SRA tools The command line download of public datasets from Short Read Archive http: //ncbi. github. io/sra-tools/ fastq-dump --split-files ERR 2811092 (from v 2. 9) fasterq-dump --split-files ERR 2811092 –e 4 ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 22

GATK § GATK is a genomic toolbox for various operations related mainly to genomic

GATK § GATK is a genomic toolbox for various operations related mainly to genomic variants calling § Operations include producing a variant file *. vcf from an alignment file *. bam module load gcc/4. 8. 2 gdc java/1. 8. 0_73 gatk/3. 5 java -jar Genome. Analysis. TK. jar -T Unified. Genotyper -R ref/human_g 1 k_b 37_20. fasta -I bams/exp_design/NA 12878_wgs_20. bam -o sandbox/NA 12878_wgs_20_UG_calls. vcf -glm BOTH -L 20: 10, 000 -10, 200, 000 https: //software. broadinstitute. org/gatk/documentation/tooldocs/current/ https: //software. broadinstitute. org/gatk/documentation/topic? name=tutorials http: //gatkforums. broadinstitute. org/gatk/discussion/7869/howto-discover-variantswith-gatk-a-gatk-workshop-tutorial ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 23

Bioinformatics software stack on EULER § https: //scicomp. ethz. ch/wiki/GDC_software_stack § Commands to call

Bioinformatics software stack on EULER § https: //scicomp. ethz. ch/wiki/GDC_software_stack § Commands to call modules for a specific tool § List compiles in collaboration with Genomic Diversity Center from D-USYS ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 24

Resources needed by bioinformatic tools on EULER cores Trimming Trimmomatic, cutadapt Aligners - old

Resources needed by bioinformatic tools on EULER cores Trimming Trimmomatic, cutadapt Aligners - old memory node time/queue 1 24 h bowtie, bwa, SHRi. MP, tophat many: 24 4 h Aligners - RNAseq STAR, subread many: 24 Aligners - optimized hisat 2 many: 24 4 h Conversion tools Samtools, bcftools 1 4 h Counting feature. Count 1 4 h De-novo assembly Spades, velvet many 256 MB-1 T 24 h or more Variant calling GATK, pileup many 256 MB 24 h 65 M 4 h … ID | SIS Michal Okoniewski, Scientific IT ETH | 12/23/2021 | 25

Thank you! Using genomic software on Euler

Thank you! Using genomic software on Euler