Infrastructure for Deploying GATK Best Practices Pipeline
The Broad Institute GATK Best Practices pipeline has helped standardize genomic analysis by providing step-by-step recommendations for performing pre-processing and variant discovery analysis. Pre-processing refers to generating analysis-ready mapped reads from raw reads using tools like BWA*, Picard* tools, and the Genome Analysis Tool Kit. These analysis-ready reads are passed through the Variant Calling step of Variant Discovery analysis to generate variants per-sample. The first part of the GATK Best Practices pipeline takes two FASTQ files, a reference genome, and dbSNP and 1000g_indels VCF files as input and outputs a gVCF file per-sample. These gVCF files are then further analyzed using Joint Genotyping and Variant Filtering steps of the Variant Discovery analysis.
The tools mentioned in the GATK Best Practices Pipeline require enormous computational power and long periods of time to complete. Benchmarking such a pipeline allows users to better determine the recommended hardware and optimize parameters to help reduce execution time. In an effort to advance the standardization and optimization of genomic pipelines, Intel has benchmarked the GATK Best Practices pipeline using Workflow Profiler, an open-source tool that provides insight into system resources (such as CPU/Disk Utilization, Committed Memory, etc.) and helps eliminate resource bottlenecks.
By using the recommended hardware and applying the thread-level and process-level optimizations to the single sample Solexa-272221 WGS* dataset, we achieve different levels of performance. The chart to the right shows how the execution time scales with the number of threads and processes across various pipeline components. For this particular dataset, all components show a decrease in run time going from 1 to 36 threads. Overall, the execution time from BWA-MEM* to Haplotype-Caller went from 227 hours to 36 hours, a 6x speed-up.1 These performance guidelines can be used to size genomics clusters running GATK Best Practices pipelines.
This benchmarking study provides recommendations of Intel® hardware and guidelines on running a set of whole genome sequences through the GATK Best Practices pipeline. Researchers whose aim is to use this pipeline for multiple datasets may use this paper to scale the number of machines to match the number of datasets that require analysis. For example, an institution whose aim is to analyze 100 WGS a month may need about 5 machines (each with 36 cores) running in parallel to achieve this goal.
A popular software package for mapping low-divergent sequences against a large-reference genome, such as the human genome.
An open-source implementation of the HMMER* protein sequence analysis suite.
An algorithm for comparing primary biological sequence information.
A software package developed at the Broad Institute to analyze next-generation sequencing data.
QIAGEN Bioinformatics* solutions deliver faster time to insight by combining powerful analytics that are able to interpret complex biological processes.
Halvade* is a MapReduce implementation of the best-practice DNA sequencing pipeline as recommended by Broad Institute.
ABySS* is an open-source de novo genome assembler for short paired-end reads.
DIDA* performs large-scale alignment tasks by distributing the indexing and alignment stages into smaller subtasks over a cluster of compute nodes.
elPrep* is a high-performance tool for preparing SAM/BAM/CRAM files for variant calling in genomic sequencing pipelines.
제품 및 성능 정보
벤치마크 결과는 “스펙터”와 “멜트다운”으로 알려진 공격에 대응하기 위한 목적의 최신 소프트웨어 패치 및 펌웨어 업데이트를 적용하기 이전에 얻어진 것입니다. 이러한 업데이트를 적용할 경우 이와 같은 결과가 귀하의 장치 또는 시스템에는 해당하지 않을 수 있습니다.
성능 테스트에 사용된 소프트웨어 및 워크로드는 인텔® 마이크로프로세서에만 적합하도록 최적화되었을 수 있습니다. SYSmark* 및 MobileMark*와 같은 성능 테스트는 특정 컴퓨터 시스템, 구성 요소, 소프트웨어, 운영 및 기능을 사용해서 수행합니다. 해당 요소에 변경이 생기면 결과가 달라질 수 있습니다. 구매를 고려 중인 제품을 제대로 평가하려면 다른 제품과 결합하여 사용할 경우 해당 제품의 성능을 포함한 기타 정보 및 성능 테스트를 참고해야 합니다. 자세한 내용은 http://www.intel.co.kr/benchmarks를 참조하십시오.