DeepVariant Fast Pipeline case study

This document goes over the example of using DeepVariant Fast Pipeline with PacBio data.

Background

Fast Pipeline is a DeepVariant feature that allows parallelization of the make_examples and call_variant stages. It is especially useful for machines with a GPU. Examples are streamed to call_variants inference, allowing simultaneous utilization of both the CPU and GPU. Note that this feature is still experimental.

This setup requires a machine with a GPU. For this case study, we will use a n1-standard-16 compute instance with 1 Nvidia P4 GPU. However, this setup is not optimal, as 16 cores may not be sufficient to fully utilize the GPU. In a real-life scenario, allocating 32 cores for make_examples would ensure better GPU utilization and improved runtime.

Provision the compute instance

Here we create Google Cloud compute instance. You may skip this step if you run the case study on a local computer with GPU.

gcloud compute instances create "deepvariant-fast-pipeline" \
  --scopes "compute-rw,storage-full,cloud-platform" \
  --maintenance-policy "TERMINATE" \
  --accelerator=type=nvidia-tesla-p4,count=1 \
  --image-family "ubuntu-2204-lts" \
  --image-project "ubuntu-os-cloud" \
  --machine-type "n1-standard-16" \
  --boot-disk-size "100" \
  --zone "us-central1-a"

You can then ssh into the machine by running:

gcloud compute ssh "deepvariant-fast-pipeline" --zone us-central1-a

Install Nvidia drivers and Nvidia container toolkit (optional)

CUDA drivers and NVIDIA Container toolkit are required to run the case study. Please refer to the following documentation for more details. NVIDIA CUDA Installation Guide for Linux, or Install GPU drivers for the installation on GCP. Installing the NVIDIA Container Toolkit

Get Docker image, models, and test data

Get DeepVariant Docker image

BIN_VERSION="1.10.0"
sudo docker pull google/deepvariant:"${BIN_VERSION}-gpu"

Download test data

Before you start running, you need to have the following input files:

A reference genome in [FASTA] format and its corresponding index file (.fai).

mkdir -p reference
gcloud storage cp gs://deepvariant/case-study-testdata/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna* reference/

An aligned reads file in [BAM] format and its corresponding index file (.bai). You get this by aligning the reads from a sequencing instrument, using an aligner like [BWA] for example.

mkdir -p input
gcloud storage cp gs://deepvariant/pacbio-case-study-testdata/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam* input/

Run DeepVariant pipeline on chromosome 20 alignments

fast_pipeline binary in DeepVariant docker allows to run make_examples and call_variant stages of DeepVariant in stream mode. Here is the command line to run the fast_pipeline

Prepare files containing command line parameters for all `DeepVariant` binaries

Config files below contain all input data parameters for DeepVariant. All other model specific parameters are automatically applied from model.example_info.json. --examples and --gvcf flags are set with the sharded file names. ==It is important to ensure that the number of shards matches in all config files and the --num_shards flag in the fast_pipeline binary. In our case it is set to 14==

The machine has 16 virtual cores, but we set the number of shards to 14 to reserve 2 cores for the input pipeline in call_variants. Insufficient CPU resources for the inference pipeline can cause an input bottleneck, leading to a slowdown in the inference stage.

mkdir -p config
FILE=config/make_examples.ini

cat <<EOM >$FILE
--examples=/tmp/examples.tfrecords@14.gz
--gvcf=/tmp/examples.gvcf.tfrecord@14.gz
--mode=calling
--reads=/input/HG003.SPRQ.pacbio.GRCh38.nov2024.chr20.bam
--ref=/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
--output_phase_info
--checkpoint=/opt/models/pacbio
--regions=chr20
EOM

FILE=config/call_variants.ini

cat <<EOM >$FILE
--outfile=/output/case_study.cvo.tfrecord.gz
--checkpoint=/opt/models/pacbio
--batch_size=1024
--writer_threads=1
EOM

FILE=config/postprocess_variants.ini

cat <<EOM >$FILE
--ref=/reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
--infile=/output/case_study.cvo.tfrecord.gz
--nonvariant_site_tfrecord_path=/tmp/examples.gvcf.tfrecord@14.gz
--outfile=/output/variants.chr20.vcf.gz
--gvcf_outfile=/output/variants.gvcf.chr20.vcf.gz
--small_model_cvo_records=/tmp/examples_call_variant_outputs.tfrecords@14.gz
--cpus=14
EOM

time sudo docker run \
  -v "${PWD}/config":"/config" \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  -v /tmp:/tmp \
  --gpus all \
  -e DV_BIN_PATH=/opt/deepvariant/bin \
  --shm-size=2gb \
  google/deepvariant:"${BIN_VERSION}-gpu" \
    /opt/deepvariant/bin/fast_pipeline \
    --make_example_flags /config/make_examples.ini \
    --call_variants_flags /config/call_variants.ini \
    --postprocess_variants_flags /config/postprocess_variants.ini \
    --shm_prefix dv \
    --num_shards 14 \
    --buffer_size 10485760 \
    2>&1  | tee /tmp/fast_pipeline.docker.log

-v allows to map local directory inside docker container.
-e we need to set DV_BIN_PATH environment variable to point to DeepVariant binaries directory inside the container.
--shm-size sets the size of shared memory available to the container. It has to be larger than --buffer_size x --num_shards. In our case buffer_size is 10M and we run 14 shards, so 2gb would be large enough to accommodate buffers and all synchronization objects for each shard.

`fast_pipeline` command line parameters:

--make_example_flags - path to the file containing make_examples command line parameters.
--call_variants_flags - path to the file containing call_variants command line parameters.
--postprocess_variants_flags - path to the file containing postprocess_variants command line parameters.
--shm_prefix - prefix for shared memory files. It is an arbitrary name.
--num_shards - number of parallel processes to run make_examples
--buffer_size - shared memory buffer size for each process.

On a successful completion the output directory will contain two VCF files:

variants.chr20.vcf
variants.gvcf.chr20.vcf

With the same settings the pipeline takes approximately 10 minutes.

real    10m23.879s
user    0m0.041s
sys     0m0.074s

Benchmark output

Download benchmark data:

mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

HAPPY_VERSION=v0.3.12

time sudo docker run \
  -v ${PWD}/output:/output \
  -v ${PWD}/benchmark:/benchmark \
  -v ${PWD}/reference:/reference \
  jmcdani20/hap.py:${HAPPY_VERSION} \
  /opt/hap.py/bin/hap.py \
    /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
    /output/variants.chr20.vcf.gz \
    -f /benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
    -r /reference/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    -o /output/happy.output \
    --engine=vcfeval \
    --pass-only \
    -l "chr20"

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL        10628     10561        67        22717        70      11671     31     30       0.993696          0.993663        0.513756         0.993679                     NaN                     NaN                   1.748961                   2.174472
INDEL   PASS        10628     10561        67        22717        70      11671     31     30       0.993696          0.993663        0.513756         0.993679                     NaN                     NaN                   1.748961                   2.174472
  SNP    ALL        70166     70106        60       103051        60      32792      7      5       0.999145          0.999146        0.318211         0.999145                2.296566                1.720961                   1.883951                   1.409186
  SNP   PASS        70166     70106        60       103051        60      32792      7      5       0.999145          0.999146        0.318211         0.999145                2.296566                1.720961                   1.883951                   1.409186

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepVariant Fast Pipeline case study

Background

Provision the compute instance

Install Nvidia drivers and Nvidia container toolkit (optional)

Get Docker image, models, and test data

Get DeepVariant Docker image

Download test data

Run DeepVariant pipeline on chromosome 20 alignments

Prepare files containing command line parameters for all `DeepVariant` binaries

`fast_pipeline` command line parameters:

Benchmark output

FilesExpand file tree

deepvariant-fast-pipeline-case-study.md

Latest commit

History

deepvariant-fast-pipeline-case-study.md

File metadata and controls

DeepVariant Fast Pipeline case study

Background

Provision the compute instance

Install Nvidia drivers and Nvidia container toolkit (optional)

Get Docker image, models, and test data

Get DeepVariant Docker image

Download test data

Run DeepVariant pipeline on chromosome 20 alignments

Prepare files containing command line parameters for all DeepVariant binaries

fast_pipeline command line parameters:

Benchmark output

Prepare files containing command line parameters for all `DeepVariant` binaries

`fast_pipeline` command line parameters: