Skip to content

Commit 5b37230

Browse files
pichuancopybara-github
authored andcommitted
Update 2 remaining docs for v1.4.0 release.
PiperOrigin-RevId: 452458249
1 parent 7b43b45 commit 5b37230

File tree

2 files changed

+44
-39
lines changed

2 files changed

+44
-39
lines changed

docs/deepvariant-training-case-study.md

Lines changed: 32 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ We demonstrated that by training on 1 replicate of BGISEQ-500 whole genome data
2020
(everything except for chromosome 20-22), we can significantly improve the
2121
accuracy comparing to the WGS model as a baseline:
2222

23-
* Indel F1 :93.4908% --> 98.0862%;
24-
* SNP F1: 99.8838% --> 99.8943%.
23+
* Indel F1 :93.4908% --> 98.1305%;
24+
* SNP F1: 99.8838% --> 99.9011%.
2525

2626
Training for 50,000 steps took about 1.5 hours on 1 GPU. Currently we cannot
2727
train on multiple GPUs.
@@ -40,7 +40,7 @@ YOUR_PROJECT=REPLACE_WITH_YOUR_PROJECT
4040
OUTPUT_GCS_BUCKET=REPLACE_WITH_YOUR_GCS_BUCKET
4141
4242
BUCKET="gs://deepvariant"
43-
BIN_VERSION="1.3.0"
43+
BIN_VERSION="1.4.0"
4444
4545
MODEL_BUCKET="${BUCKET}/models/DeepVariant/${BIN_VERSION}/DeepVariant-inception_v3-${BIN_VERSION}+data-wgs_standard"
4646
GCS_PRETRAINED_WGS_MODEL="${MODEL_BUCKET}/model.ckpt"
@@ -94,7 +94,7 @@ gunzip "${DATA_DIR}/ucsc_hg19.fa.gz"
9494
```
9595
sudo apt -y update
9696
sudo apt -y install parallel
97-
curl -O https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/install_nvidia_docker.sh
97+
curl -O https://raw.githubusercontent.com/google/deepvariant/r1.4/scripts/install_nvidia_docker.sh
9898
bash -x install_nvidia_docker.sh
9999
```
100100

@@ -155,8 +155,12 @@ Starting in v1.4.0, we added an extra channel in our WGS setting using the
155155

156156
```
157157
$ cat "${OUTPUT_DIR}/training_set.with_label.tfrecord-00000-of-00016.gz.example_info.json"
158-
{"version": "1.3.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
159-
````
158+
{"version": "1.4.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
159+
```
160+
161+
Depending on your data type, you might want to tweak the flags for the
162+
`make_examples` step, which can result in different shape of the output
163+
examples.
160164

161165
We will want to shuffle this on Dataflow later, so we copy the data to GCS
162166
bucket first:
@@ -230,7 +234,7 @@ Then, get the code that shuffles:
230234

231235
```
232236
mkdir -p ${SHUFFLE_SCRIPT_DIR}
233-
wget https://raw.githubusercontent.com/google/deepvariant/r1.3/tools/shuffle_tfrecords_beam.py -O ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py
237+
wget https://raw.githubusercontent.com/google/deepvariant/r1.4/tools/shuffle_tfrecords_beam.py -O ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py
234238
```
235239

236240
Next, we shuffle the data using DataflowRunner. Before that, please make sure
@@ -281,13 +285,13 @@ In the output, the `tfrecord_path` should be valid paths in gs://.
281285

282286
```
283287
# Generated by shuffle_tfrecords_beam.py
288+
# class2: 124564
289+
# class1: 173668
290+
# class0: 44526
284291
#
285292
# --input_pattern_list=OUTPUT_BUCKET/training_set.with_label.tfrecord-?????-of-00016.gz
286293
# --output_pattern_prefix=OUTPUT_BUCKET/training_set.with_label.shuffled
287294
#
288-
# class2: 124564
289-
# class1: 173668
290-
# class0: 44526
291295
292296
name: "HG001"
293297
tfrecord_path: "OUTPUT_GCS_BUCKET/training_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
@@ -319,13 +323,13 @@ cat "${OUTPUT_DIR}/validation_set.dataset_config.pbtxt"
319323

320324
```
321325
# Generated by shuffle_tfrecords_beam.py
326+
# class0: 5595
327+
# class1: 31852
328+
# class2: 21954
322329
#
323330
# --input_pattern_list=OUTPUT_DIR/validation_set.with_label.tfrecord-?????-of-00016.gz
324331
# --output_pattern_prefix=OUTPUT_DIR/validation_set.with_label.shuffled
325332
#
326-
# class2: 21954
327-
# class1: 31852
328-
# class0: 5595
329333
330334
name: "HG001"
331335
tfrecord_path: "OUTPUT_DIR/validation_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
@@ -425,7 +429,7 @@ gsutil cat "${TRAINING_DIR}"/best_checkpoint.txt
425429
```
426430

427431
In my run, this showed that the model checkpoint that performs the best on the
428-
validation set was `${TRAINING_DIR}/model.ckpt-50000`.
432+
validation set was `${TRAINING_DIR}/model.ckpt-33739`.
429433

430434
It's possible that training more steps can result in better accuracy. For now
431435
let's use this model to do the final evaluation on the test set and see how we
@@ -437,17 +441,17 @@ sudo docker run --gpus 1 \
437441
google/deepvariant:"${BIN_VERSION}-gpu" \
438442
/opt/deepvariant/bin/run_deepvariant \
439443
--model_type WGS \
440-
--customized_model "${TRAINING_DIR}/model.ckpt-50000" \
444+
--customized_model "${TRAINING_DIR}/model.ckpt-33739" \
441445
--ref "${REF}" \
442446
--reads "${BAM_CHR20}" \
443447
--regions "chr20" \
444448
--output_vcf "${OUTPUT_DIR}/test_set.vcf.gz" \
445449
--num_shards=${N_SHARDS}
446450
```
447451

448-
Note that in v1.4.0, by using `--model_type WGS`, it will automatically add
449-
`insert_size` as an extra channel. So we don't need to add it in
450-
`--make_examples_extra_args`.
452+
In v1.4.0, by using `--model_type WGS`, `run_deepvariant` will
453+
automatically add `insert_size` as an extra channel in the `make_examples` step.
454+
So we don't need to add it in `--make_examples_extra_args`.
451455

452456
Once this is done, we have the final callset in VCF
453457
format here: `${OUTPUT_DIR}/test_set.vcf.gz`. Next step is to run `hap.py` to
@@ -475,24 +479,25 @@ The output of `hap.py` is here:
475479
```
476480
[I] Total VCF records: 3775119
477481
[I] Non-reference VCF records: 3775119
478-
[W] overlapping records at chr20:11311221 for sample 0
479-
[W] Variants that overlap on the reference allele: 1
482+
[W] overlapping records at chr20:35754687 for sample 0
483+
[W] Variants that overlap on the reference allele: 3
480484
[I] Total VCF records: 132914
481-
[I] Non-reference VCF records: 96625
485+
[I] Non-reference VCF records: 96580
486+
2022-06-02 00:36:08,582 WARNING Creating template for vcfeval. You can speed this up by supplying a SDF template that corresponds to /home/pichuan_google_com/training-case-study/input/data/ucsc_hg19.fa
482487
Benchmarking Summary:
483488
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
484-
INDEL ALL 10023 9801 222 19307 167 8940 109 34 0.977851 0.983891 0.463044 0.980862 NaN NaN 1.547658 2.058546
485-
INDEL PASS 10023 9801 222 19307 167 8940 109 34 0.977851 0.983891 0.463044 0.980862 NaN NaN 1.547658 2.058546
486-
SNP ALL 66237 66156 81 78318 59 12065 16 2 0.998777 0.999109 0.154051 0.998943 2.284397 2.199583 1.700387 1.797428
487-
SNP PASS 66237 66156 81 78318 59 12065 16 2 0.998777 0.999109 0.154051 0.998943 2.284397 2.199583 1.700387 1.797428
489+
INDEL ALL 10023 9806 217 19266 163 8898 107 33 0.978350 0.984279 0.461850 0.981305 NaN NaN 1.547658 2.046311
490+
INDEL PASS 10023 9806 217 19266 163 8898 107 33 0.978350 0.984279 0.461850 0.981305 NaN NaN 1.547658 2.046311
491+
SNP ALL 66237 66160 77 78315 54 12065 15 4 0.998838 0.999185 0.154057 0.999011 2.284397 2.200204 1.700387 1.798656
492+
SNP PASS 66237 66160 77 78315 54 12065 15 4 0.998838 0.999185 0.154057 0.999011 2.284397 2.200204 1.700387 1.798656
488493
```
489494

490495
To summarize, the accuracy is:
491496

492497
Type | # FN | # FP | Recall | Precision | F1\_Score
493498
----- | ---- | ---- | -------- | --------- | ---------
494-
INDEL | 222 | 167 | 0.977851 | 0.983891 | 0.980862
495-
SNP | 81 | 59 | 0.998777 | 0.999109 | 0.998943
499+
INDEL | 217 | 163 | 0.978350 | 0.984279 | 0.981305
500+
SNP | 77 | 54 | 0.998838 | 0.999185 | 0.999011
496501

497502
The baseline we're comparing to is to directly use the WGS model to make the
498503
calls, using this command:

docs/metrics-deeptrio.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ Runtime is on HG002/HG003/HG004 (all chromosomes).
88

99
Stage | Wall time (minutes)
1010
-------------------------------- | -----------------
11-
make_examples | ~501m
12-
call_variants for HG002 | ~348m
13-
call_variants for HG003 | ~344m
14-
call_variants for HG004 | ~344m
15-
postprocess_variants (parallel) | ~67m
16-
total | ~1604m = ~26.73 hours
11+
make_examples | ~508m
12+
call_variants for HG002 | ~341m
13+
call_variants for HG003 | ~349m
14+
call_variants for HG004 | ~347m
15+
postprocess_variants (parallel) | ~65m
16+
total | ~1610m = ~26.83 hours
1717

1818
### Accuracy
1919

@@ -49,12 +49,12 @@ Runtime is on HG002/HG003/HG004 (all chromosomes).
4949

5050
Stage | Wall time (minutes)
5151
-------------------------------- | -------------------
52-
make_examples | ~798m
53-
call_variants for HG002 | ~261m
54-
call_variants for HG003 | ~264m
55-
call_variants for HG004 | ~263m
52+
make_examples | ~829m
53+
call_variants for HG002 | ~263m
54+
call_variants for HG003 | ~266m
55+
call_variants for HG004 | ~268m
5656
postprocess_variants (parallel) | ~78m
57-
total | ~1664m = ~27.73 hours
57+
total | ~1704m = ~28.4 hours
5858

5959
### Accuracy
6060

@@ -137,7 +137,7 @@ Use `gcloud compute ssh` to log in to the newly created instance.
137137
Download and run any of the following case study scripts:
138138

139139
```
140-
curl -O https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/inference_deeptrio.sh
140+
curl -O https://raw.githubusercontent.com/google/deepvariant/r1.4/scripts/inference_deeptrio.sh
141141
142142
# WGS
143143
bash inference_deeptrio.sh --model_preset WGS

0 commit comments

Comments
 (0)