DeepVariant whole exome sequencing (WES) case study

Similar to the case study on whole genome sequencing data, in this study we describe applying DeepVariant to a real exome sample using a single machine.

Please see the metrics page for details on runtime and data.

Prepare environment

Tools

Docker will be used to run DeepVariant and hap.py,

Download Reference

We will be using GRCh38 for this case study.

mkdir -p reference

FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids

curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > reference/GRCh38_no_alt_analysis_set.fasta.fai

Download Genome in a Bottle Benchmarks

We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG003.

mkdir -p benchmark

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38

curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

Download HG003 BAM

mkdir -p input
HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata

curl ${HTTPDIR}/HG003.novaseq.wes_idt.100x.dedup.bam > input/HG003.novaseq.wes_idt.100x.dedup.bam
curl ${HTTPDIR}/HG003.novaseq.wes_idt.100x.dedup.bam.bai > input/HG003.novaseq.wes_idt.100x.dedup.bam.bai

Download capture target BED file

In this case study we'll use idt_capture_novogene.grch38.bed as the capture target BED file. For evaluation, hap.py will intersect this BED with the GIAB confident regions.

HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata

curl ${HTTPDIR}/idt_capture_novogene.grch38.bed > input/idt_capture_novogene.grch38.bed

Running on a CPU-only machine

mkdir -p output
mkdir -p output/intermediate_results_dir

BIN_VERSION="1.10.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WES \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
  --regions /input/idt_capture_novogene.grch38.bed \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --intermediate_results_dir /output/intermediate_results_dir

By specifying --model_type WES, you'll be using a model that is best suited for Illumina Whole Exome Sequencing data.

NOTE: If you want to run each of the steps separately, add --dry_run=true to the command above to figure out what flags you need in each step. Based on the different model types, different flags are needed in the make_examples step.

--intermediate_results_dir flag is optional. By specifying it, the intermediate outputs of make_examples and call_variants stages can be found in the directory. After the command, you can find these files in the directory:

call_variants_output.tfrecord.gz
gvcf.tfrecord-?????-of-?????.gz
make_examples.tfrecord-?????-of-?????.gz

For running on GPU machines, or using Singularity instead of Docker, see Quick Start.

Benchmark on all chromosomes

mkdir -p happy

sudo docker pull jmcdani20/hap.py:v0.3.12

sudo docker run \
  -v "${PWD}/benchmark":"/benchmark" \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  -v "${PWD}/happy:/happy" \
  jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
  /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
  /output/HG003.output.vcf.gz \
  -f /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
  -T /input/idt_capture_novogene.grch38.bed \
  -r /reference/GRCh38_no_alt_analysis_set.fasta \
  -o /happy/happy.output \
  --engine=vcfeval \
  --pass-only

Output:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         1051      1024        27         1485         8        430      5      2       0.974310          0.992417        0.289562         0.983280                     NaN                     NaN                   1.747283                   1.796935
INDEL   PASS         1051      1024        27         1485         8        430      5      2       0.974310          0.992417        0.289562         0.983280                     NaN                     NaN                   1.747283                   1.796935
  SNP    ALL        25279     24983       296        27709        60       2665     36      4       0.988291          0.997604        0.096178         0.992926                2.854703                2.761297                   1.623027                   1.628821
  SNP   PASS        25279     24983       296        27709        60       2665     36      4       0.988291          0.997604        0.096178         0.992926                2.854703                2.761297                   1.623027                   1.628821

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepVariant whole exome sequencing (WES) case study

Prepare environment

Tools

Download Reference

Download Genome in a Bottle Benchmarks

Download HG003 BAM

Download capture target BED file

Running on a CPU-only machine

Benchmark on all chromosomes

FilesExpand file tree

deepvariant-exome-case-study.md

Latest commit

History

deepvariant-exome-case-study.md

File metadata and controls

DeepVariant whole exome sequencing (WES) case study

Prepare environment

Tools

Download Reference

Download Genome in a Bottle Benchmarks

Download HG003 BAM

Download capture target BED file

Running on a CPU-only machine

Benchmark on all chromosomes