@@ -20,8 +20,8 @@ We demonstrated that by training on 1 replicate of BGISEQ-500 whole genome data
2020(everything except for chromosome 20-22), we can significantly improve the
2121accuracy comparing to the WGS model as a baseline:
2222
23- * Indel F1 :93.4908% --> 98.0862 %;
24- * SNP F1: 99.8838% --> 99.8943 %.
23+ * Indel F1 :93.4908% --> 98.1305 %;
24+ * SNP F1: 99.8838% --> 99.9011 %.
2525
2626Training for 50,000 steps took about 1.5 hours on 1 GPU. Currently we cannot
2727train on multiple GPUs.
@@ -40,7 +40,7 @@ YOUR_PROJECT=REPLACE_WITH_YOUR_PROJECT
4040OUTPUT_GCS_BUCKET=REPLACE_WITH_YOUR_GCS_BUCKET
4141
4242BUCKET="gs://deepvariant"
43- BIN_VERSION="1.3 .0"
43+ BIN_VERSION="1.4 .0"
4444
4545MODEL_BUCKET="${BUCKET}/models/DeepVariant/${BIN_VERSION}/DeepVariant-inception_v3-${BIN_VERSION}+data-wgs_standard"
4646GCS_PRETRAINED_WGS_MODEL="${MODEL_BUCKET}/model.ckpt"
@@ -94,7 +94,7 @@ gunzip "${DATA_DIR}/ucsc_hg19.fa.gz"
9494```
9595sudo apt -y update
9696sudo apt -y install parallel
97- curl -O https://raw.githubusercontent.com/google/deepvariant/r1.3 /scripts/install_nvidia_docker.sh
97+ curl -O https://raw.githubusercontent.com/google/deepvariant/r1.4 /scripts/install_nvidia_docker.sh
9898bash -x install_nvidia_docker.sh
9999```
100100
@@ -155,8 +155,12 @@ Starting in v1.4.0, we added an extra channel in our WGS setting using the
155155
156156```
157157$ cat "${OUTPUT_DIR}/training_set.with_label.tfrecord-00000-of-00016.gz.example_info.json"
158- {"version": "1.3.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
159- ````
158+ {"version": "1.4.0", "shape": [100, 221, 7], "channels": [1, 2, 3, 4, 5, 6, 19]}
159+ ```
160+
161+ Depending on your data type, you might want to tweak the flags for the
162+ ` make_examples ` step, which can result in different shape of the output
163+ examples.
160164
161165We will want to shuffle this on Dataflow later, so we copy the data to GCS
162166bucket first:
@@ -230,7 +234,7 @@ Then, get the code that shuffles:
230234
231235```
232236mkdir -p ${SHUFFLE_SCRIPT_DIR}
233- wget https://raw.githubusercontent.com/google/deepvariant/r1.3 /tools/shuffle_tfrecords_beam.py -O ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py
237+ wget https://raw.githubusercontent.com/google/deepvariant/r1.4 /tools/shuffle_tfrecords_beam.py -O ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py
234238```
235239
236240Next, we shuffle the data using DataflowRunner. Before that, please make sure
@@ -281,13 +285,13 @@ In the output, the `tfrecord_path` should be valid paths in gs://.
281285
282286```
283287# Generated by shuffle_tfrecords_beam.py
288+ # class2: 124564
289+ # class1: 173668
290+ # class0: 44526
284291#
285292# --input_pattern_list=OUTPUT_BUCKET/training_set.with_label.tfrecord-?????-of-00016.gz
286293# --output_pattern_prefix=OUTPUT_BUCKET/training_set.with_label.shuffled
287294#
288- # class2: 124564
289- # class1: 173668
290- # class0: 44526
291295
292296name: "HG001"
293297tfrecord_path: "OUTPUT_GCS_BUCKET/training_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
@@ -319,13 +323,13 @@ cat "${OUTPUT_DIR}/validation_set.dataset_config.pbtxt"
319323
320324```
321325# Generated by shuffle_tfrecords_beam.py
326+ # class0: 5595
327+ # class1: 31852
328+ # class2: 21954
322329#
323330# --input_pattern_list=OUTPUT_DIR/validation_set.with_label.tfrecord-?????-of-00016.gz
324331# --output_pattern_prefix=OUTPUT_DIR/validation_set.with_label.shuffled
325332#
326- # class2: 21954
327- # class1: 31852
328- # class0: 5595
329333
330334name: "HG001"
331335tfrecord_path: "OUTPUT_DIR/validation_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
@@ -425,7 +429,7 @@ gsutil cat "${TRAINING_DIR}"/best_checkpoint.txt
425429```
426430
427431In my run, this showed that the model checkpoint that performs the best on the
428- validation set was `${TRAINING_DIR}/model.ckpt-50000 `.
432+ validation set was ` ${TRAINING_DIR}/model.ckpt-33739 ` .
429433
430434It's possible that training more steps can result in better accuracy. For now
431435let's use this model to do the final evaluation on the test set and see how we
@@ -437,17 +441,17 @@ sudo docker run --gpus 1 \
437441 google/deepvariant:"${BIN_VERSION}-gpu" \
438442 /opt/deepvariant/bin/run_deepvariant \
439443 --model_type WGS \
440- --customized_model "${TRAINING_DIR}/model.ckpt-50000 " \
444+ --customized_model "${TRAINING_DIR}/model.ckpt-33739 " \
441445 --ref "${REF}" \
442446 --reads "${BAM_CHR20}" \
443447 --regions "chr20" \
444448 --output_vcf "${OUTPUT_DIR}/test_set.vcf.gz" \
445449 --num_shards=${N_SHARDS}
446450```
447451
448- Note that in v1.4.0, by using `--model_type WGS`, it will automatically add
449- `insert_size` as an extra channel. So we don't need to add it in
450- `--make_examples_extra_args`.
452+ In v1.4.0, by using ` --model_type WGS ` , ` run_deepvariant ` will
453+ automatically add ` insert_size ` as an extra channel in the ` make_examples ` step.
454+ So we don't need to add it in ` --make_examples_extra_args ` .
451455
452456Once this is done, we have the final callset in VCF
453457format here: ` ${OUTPUT_DIR}/test_set.vcf.gz ` . Next step is to run ` hap.py ` to
@@ -475,24 +479,25 @@ The output of `hap.py` is here:
475479```
476480[I] Total VCF records: 3775119
477481[I] Non-reference VCF records: 3775119
478- [ W] overlapping records at chr20:11311221 for sample 0
479- [ W] Variants that overlap on the reference allele: 1
482+ [W] overlapping records at chr20:35754687 for sample 0
483+ [W] Variants that overlap on the reference allele: 3
480484[I] Total VCF records: 132914
481- [ I] Non-reference VCF records: 96625
485+ [I] Non-reference VCF records: 96580
486+ 2022-06-02 00:36:08,582 WARNING Creating template for vcfeval. You can speed this up by supplying a SDF template that corresponds to /home/pichuan_google_com/training-case-study/input/data/ucsc_hg19.fa
482487Benchmarking Summary:
483488Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
484- INDEL ALL 10023 9801 222 19307 167 8940 109 34 0.977851 0.983891 0.463044 0.980862 NaN NaN 1.547658 2.058546
485- INDEL PASS 10023 9801 222 19307 167 8940 109 34 0.977851 0.983891 0.463044 0.980862 NaN NaN 1.547658 2.058546
486- SNP ALL 66237 66156 81 78318 59 12065 16 2 0.998777 0.999109 0.154051 0.998943 2.284397 2.199583 1.700387 1.797428
487- SNP PASS 66237 66156 81 78318 59 12065 16 2 0.998777 0.999109 0.154051 0.998943 2.284397 2.199583 1.700387 1.797428
489+ INDEL ALL 10023 9806 217 19266 163 8898 107 33 0.978350 0.984279 0.461850 0.981305 NaN NaN 1.547658 2.046311
490+ INDEL PASS 10023 9806 217 19266 163 8898 107 33 0.978350 0.984279 0.461850 0.981305 NaN NaN 1.547658 2.046311
491+ SNP ALL 66237 66160 77 78315 54 12065 15 4 0.998838 0.999185 0.154057 0.999011 2.284397 2.200204 1.700387 1.798656
492+ SNP PASS 66237 66160 77 78315 54 12065 15 4 0.998838 0.999185 0.154057 0.999011 2.284397 2.200204 1.700387 1.798656
488493```
489494
490495To summarize, the accuracy is:
491496
492497Type | # FN | # FP | Recall | Precision | F1\_ Score
493498----- | ---- | ---- | -------- | --------- | ---------
494- INDEL | 222 | 167 | 0.977851 | 0.983891 | 0.980862
495- SNP | 81 | 59 | 0.998777 | 0.999109 | 0.998943
499+ INDEL | 217 | 163 | 0.978350 | 0.984279 | 0.981305
500+ SNP | 77 | 54 | 0.998838 | 0.999185 | 0.999011
496501
497502The baseline we're comparing to is to directly use the WGS model to make the
498503calls, using this command:
0 commit comments