tutorials-middle/episodes/severity-static.Rmd at eb25cc0a09dac2a8d3408e60d61c0b03aa9654b6 · epiverse-trace/tutorials-middle · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
---
title: 'Estimation of outbreak severity'
teaching: 10
exercises: 2
editor_options:
  chunk_output_type: inline
---

:::::::::::::::::::::::::::::::::::::: questions

- Why do we estimate the clinical severity of an epidemic?

- How can the Case Fatality Risk (CFR) be estimated early in an ongoing
epidemic?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Estimate the CFR from aggregated case data using `{cfr}`.

- Estimate a delay-adjusted CFR using `{epiparameter}` and `{cfr}`.

- Estimate a delay-adjusted severity for an expanding time series using `{cfr}`.

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: prereq

## Prerequisites

This episode requires you to be familiar with:

**Data science**: Basic programming with R.

**Epidemic theory**: [Delay distributions](../learners/reference.md#delaydist).

:::::::::::::::::::::::::::::::::

## Introduction

Common questions at the early stage of an epidemic include:

- What is the likely public health impact of the outbreak in terms of clinical severity?
- What are the most severely affected groups?
- Does the outbreak have the potential to cause a very severe pandemic?

We can assess the pandemic potential of an epidemic with two critical measurements: the transmissibility and the clinical severity
([Fraser et al., 2009](https://www.science.org/doi/full/10.1126/science.1176062),
[CDC, 2016](https://www.cdc.gov/flu/pandemic-resources/national-strategy/severity-assessment-framework-508.html)).

![HHS Pandemic Planning Scenarios based on the Pandemic Severity Assessment Framework. This uses a combined measure of clinical severity and transmissibility to characterise influenza pandemic scenarios. **HHS**: United States Department of Health and Human Services ([CDC, 2016](https://www.cdc.gov/flu/pandemic-resources/national-strategy/severity-assessment-framework-508.html)).](fig/cfr-hhs-scenarios-psaf.png){alt='The horizontal axis is the scaled measure of clinical severity, ranging from 1 to 7, where 1 is low, 4 is moderate, and 7 is very severe. The vertical axis is the scaled
measure of transmissibility, ranging from 1 to 5, where 1 is low, 3 is moderate, and 5 is highly transmissible. On the graph, HHS pandemic planning scenarios are labeled across four quadrants (A, B, C and D). From left to right, the scenarios are “seasonal range”, “moderate pandemic”, “severe pandemic” and “very severe pandemic.” As clinical severity increases along the horizontal axis, or as transmissibility increases along the vertical axis, the severity of the pandemic planning scenario also increases.'}

One epidemiological approach to estimating the clinical severity is quantifying the Case Fatality Risk (CFR). CFR is the conditional probability of death given confirmed diagnosis, calculated as the ratio of the cumulative number of deaths from an infectious disease to the number of confirmed diagnosed cases. However, calculating this directly during the course of an epidemic tends to result in a naive or biased CFR given the time [delay](../learners/reference.md#delaydist) from onset to death, varying substantially as the epidemic progresses and stabilising at the later stages of the outbreak ([Ghani et al., 2005](https://academic.oup.com/aje/article/162/5/479/82647?login=false#620743)).

![Observed biased confirmed case fatality risk (CFR) estimates as a function of time (thick line) calculated as the cumulative number of deaths over confirmed cases at time t. The estimate at the end of an outbreak (~May 30) is the realised CFR by the end of the epidemic. The horizontal continuous line and dotted lines show the expected value and the 95% confidence intervals ($95\%$ CI) of the predicted delay-adjusted CFR estimate only by using the observed data until 27 Mar 2003 ([Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852))](fig/cfr-pone.0006852.g003-fig_c.png){alt='The periods are relevant: Period 1 -- 15 days where CFR is zero to indicate this is due to no reported deaths; Period from Mar 15 -- Apr 26 where CFR appears to be rising; Period Apr 30 -- May 30 where the CFR estimate stabilises.'}

::::::::::::::::::::::: instructor

The periods are relevant: Period 1 -- 15 days where CFR is zero to indicate this is due to no reported deaths; Period from Mar 15 -- Apr 26 where CFR appears to be rising; Period Apr 30 -- May 30 where the CFR estimate stabilises.

:::::::::::::::::::::::

More generally, estimating severity can be helpful even outside of a pandemic planning scenario and in the context of routine public health.
Knowing whether an outbreak has or had a different severity from the historical record can motivate causal investigations,
which could be intrinsic to the infectious agent (e.g., a new, more severe strain) or due to underlying factors in the population (e.g. reduced immunity or morbidity factors) ([Lipsitch et al., 2015](https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0003846)).

In this tutorial we are going to learn how to use the `{cfr}` package to calculate and adjust a CFR estimation using [delay distributions](../learners/reference.md#delaydist) from `{epiparameter}` or elsewhere, based on the methods developed by [Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852). We will also explore how we can reuse `{cfr}` functions for more severity measurements.

We'll use the pipe `%>%` operator from the `{magrittr}` package to connect  functions from {cfr}, including others from the package `{dplyr}`. Hence, we will load the `{tidyverse}` package which contains both `{magrittr}` and `{dplyr}`.

```{r message=FALSE, warning=FALSE}
# install packages if their are not already installed
if (!require("pak")) install.packages("pak")
if (!require("cfr")) pak::pak("cfr")
if (!require("epiparameter")) pak::pak("epiparameter")
if (!require("tidyverse")) pak::pak("tidyverse")
if (!require("outbreaks")) pak::pak("outbreaks")

# load packages
library(cfr)
library(epiparameter)
library(tidyverse)
library(outbreaks)
```

::::::::::::::::::: checklist

### The double-colon

The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the current environment.

For example, `dplyr::filter(data, condition)` uses the `filter()` function from the `{dplyr}` package.

This helps us remember package functions and avoid namespace conflicts.

:::::::::::::::::::

:::::::::::::::::::: discussion

Have you been a member of an epidemic response team? If yes:

- How did you assess the clinical severity of the outbreak?
- What were the primary sources of bias?
- What did you do to take into account the identified bias?
- What complementary analysis would you do to solve the bias?

::::::::::::::::::::

## Data sources for clinical severity

What data sources can we use to estimate the clinical severity of a disease outbreak? [Verity et al., 2020](https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30243-7/fulltext) summarises the spectrum of COVID-19 cases:

![Spectrum of COVID-19 cases. The CFR aims to estimate the proportion of deaths among confirmed cases in an epidemic.
([Verity et al., 2020](https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30243-7/fulltext#gr1))](fig/cfr-spectrum-cases-covid19.jpg)

- At the top of the pyramid, those who met the WHO case criteria for **severe** or **critical** cases would likely have been identified in the hospital setting, presenting with atypical viral pneumonia. These cases would have been identified in mainland China and among those categorised internationally as local transmission.
- Many more cases are likely to be **symptomatic** (i.e., with fever, cough, or myalgia) but might not require hospitalisation. These cases would have been identified through links to international travel to high-risk areas and through contact-tracing of contacts of confirmed cases. They might be identifiable through population surveillance of, for example, influenza-like illness.
- The bottom part of the pyramid represents **mild** (and possibly **asymptomatic**) cases. These cases might be identifiable through contact tracing and subsequently via serological testing.

## Naive CFR

We measure disease severity in terms of case fatality risk (CFR). The CFR is interpreted as the conditional probability of death given confirmed diagnosis, calculated as the ratio of the cumulative number of deaths $D_{t}$ to the cumulative number of confirmed cases $C_{t}$ at a certain time $t$. We can refer it to the _naive CFR_ (also crude or biased CFR, $b_{t}$):

$$ b_{t} =  \frac{D_{t}}{C_{t}} $$

This calculation is _naive_ because it tends to yield a biased and mostly underestimated CFR due to the time-delay from onset to death, only stabilising at the later stages of the outbreak.

To calculate the naive CFR, the `{cfr}` package requires an input data frame with three columns named:

- `date`
- `cases`
- `deaths`

Let's explore the `ebola1976` dataset, included in {cfr}, which comes from the first Ebola outbreak in Zaire (now the Democratic Republic of the Congo) in 1976, as analysed by Camacho et al. (2014).

```{r}
# Load the Ebola 1976 data provided with the {cfr} package
data("ebola1976")

# Plot the incidence of cases and death reports
ebola1976 %>%
  incidence2::incidence(
    date_index = "date",
    counts = c("cases", "deaths")
  ) %>%
  plot()
```

We'll frame this episode under the context of an **ongoing outbreak** with only the **first 30 days** of data observed.

```{r}
# Assume we only have the first 30 days of this data
ebola_30days <- ebola1976 %>%
  dplyr::slice_head(n = 30) %>%
  dplyr::as_tibble()

ebola_30days
```

:::::::::::::::::: callout

### We need aggregated incidence data

`{cfr}` reads **aggregated** incidence data.

<!-- Similar to the `{EpiNow2}` with the difference that for `{cfr}` we need one more column named `deaths`. -->

```{r,eval=FALSE,echo=FALSE}
EpiNow2::example_confirmed %>%
  dplyr::as_tibble()
```

This data input should be **aggregated** by day, which means one observation *per day*, containing the *daily* number of reported cases and deaths. Observations with zero or missing values should also be included, similar to time-series data.

Also, `{cfr}` currently works for *daily* data only, but not for other temporal units of data aggregation, e.g., weeks.

<!-- suggest ways to deal with weekly incidence data -->
<!-- https://github.com/epiverse-trace/cfr/issues/117 -->

::::::::::::::::::

When we apply the `cfr_static()` function to the input `data` directly, we are calculating the naive CFR:

```{r}
# Calculate the naive CFR for the first 30 days
cfr::cfr_static(data = ebola_30days)
```

:::::::::::::::::::::::::::::::::::::::: challenge

Download the file [sarscov2_cases_deaths.csv](data/sarscov2_cases_deaths.csv) and read it into R.

Estimate the naive CFR.

:::::::::::::::::::: hint

Inspect the format of the data input.

- Does it contain daily data?
- Does the column names are as required by `cfr_static()`?
- How would you rename column names from a data frame?

::::::::::::::::::::

:::::::::::::::::::: solution

We read the data input using `readr::read_csv()`. This function recognize that the column `date` is a `<date>` class vector.

```{r, eval=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
# read data from the tutorial repository R project
sarscov2_input <-
  readr::read_csv(file.path("data", "sarscov2_cases_deaths.csv"))
```

```{r,eval=FALSE,echo=TRUE,warning=FALSE,message=FALSE}
# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
sarscov2_input <-
  readr::read_csv(here::here("data", "raw-data", "sarscov2_cases_deaths.csv"))
```

```{r}
# Inspect data
sarscov2_input
```

We can use `cleanepi::standardize_column_names()` to adapt the external data to fit the data input for `cfr::cfr_static()`.

```{r}
# Rename before Estimate naive CFR
sarscov2_input %>%
  cleanepi::standardize_column_names(
    rename = c(cases = "cases_jpn", deaths = "deaths_jpn")
  ) %>%
  cfr::cfr_static()
```

::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::

## Biases that affect CFR estimation

::::::::::::::::::::::::::::: discussion

### Two biases that affect CFR estimation

[Lipsitch et al., 2015](https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0003846) describe two potential biases that can affect the estimation of CFR (and their potential solutions):

:::::::::::::::::::::::::::::

::::::::::::: solution

### 1. Preferential ascertainment of severe cases

For diseases with a _spectrum_ of clinical presentations, those cases that come to the attention of public health authorities and are registered into surveillance databases will typically be people with the most severe symptoms who seek medical care, are admitted to a hospital, or die.

Therefore, the CFR will typically be higher among _detected cases_ than among the entire population of cases, given that the latter may include individuals with mild, subclinical, and (under some definitions of “case”) asymptomatic presentations.

:::::::::::::

:::::::::::: solution

### 2. Bias due to delayed reporting of death

During an _ongoing_ epidemic, there is a delay between the time someone dies and the time their death is reported. Therefore, at any given moment in time, the list of cases includes people who will die and whose death has not yet occurred or has occurred but not yet been reported. Thus, dividing the cumulative number of reported deaths by the cumulative number of reported cases at a specific time point during an outbreak will underestimate the true CFR.

The key determinants of the magnitude of the bias are the epidemic _growth rate_ and the _distribution of delays_ from case-reporting to death-reporting; the longer the delays and the faster the growth rate, the greater the bias.

In this tutorial episode, we are going to focus on solutions to deal with this specific bias using `{cfr}`!

::::::::::::

:::::::::::::::::::: solution

### Case study: Influenza A (H1N1), Mexico, 2009

Improving an _early_ epidemiological assessment of a delay-adjusted CFR is crucial for determining virulence, shaping the level and choices of public health intervention, and providing advice to the general public.

In 2009, during the swine-flu virus, Influenza A (H1N1), Mexico had an early biased estimation of the CFR. Initial reports from the government of Mexico suggested a virulent infection, whereas, in other countries, the same virus was perceived as mild ([TIME, 2009](https://content.time.com/time/health/article/0,8599,1894534,00.html)).

In the USA and Canada, no deaths were attributed to the virus in the first ten days following the World Health Organization's declaration of a public health emergency. Even under similar circumstances at the early stage of the global pandemic, public health officials, policymakers and the general public want to know the virulence of an emerging infectious agent.

[Fraser et al., 2009](https://www.science.org/doi/full/10.1126/science.1176062) reinterpreted the data assessing the biases and getting a clinical severity lower than the 1918 influenza pandemic but comparable with that seen in the 1957 pandemic.

::::::::::::::::::::

:::::::::::::::::::: instructor

We can showcase this last bias using the [concept described in this `{cfr}` vignette](https://epiverse-trace.github.io/cfr/articles/cfr.html#concept-how-reporting-delays-bias-cfr-estimates).

<!-- create code and then a .gif? -->

::::::::::::::::::::

## Delay-adjusted CFR

[Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852) developed a method that considers the **time delay** from the onset of symptoms to death.

Real-time outbreaks may have a number of deaths that are insufficient to determine the time distribution between onset and death. Therefore, we can estimate the _distribution delay_ from historical outbreaks or reuse the ones accessible via R packages like `{epiparameter}` or `{epireview}`, which collect them from published scientific literature. For a step-by-step guide, read the tutorial episode on how to [access epidemiological delays](https://epiverse-trace.github.io/tutorials-early/delays-reuse.html).

Let's use `{epiparameter}`:

```{r, message=FALSE, warning=FALSE}
# Get delay distribution
onset_to_death_ebola <-
  epiparameter::epiparameter_db(
    disease = "Ebola",
    epi_name = "onset_to_death",
    single_epiparameter = TRUE
  )

# Plot <epiparameter> object
plot(onset_to_death_ebola, day_range = 0:40)
```

To calculate the delay-adjusted CFR, we can use the `cfr_static()` function with the `data` and `delay_density` arguments.

```{r}
# Calculate the delay-adjusted CFR
# for the first 30 days
cfr::cfr_static(
  data = ebola_30days,
  delay_density = function(x) density(onset_to_death_ebola, x)
)
```

```{r,echo=FALSE}
out_delay_adjusted <-
  cfr::cfr_static(
    data = ebola_30days,
    delay_density = function(x) density(onset_to_death_ebola, x)
  )

out_estimate <- out_delay_adjusted %>% pull(severity_estimate)
out_low <- out_delay_adjusted %>% pull(severity_low)
out_high <- out_delay_adjusted %>% pull(severity_high)
```

The delay-adjusted CFR indicated that the overall disease severity _at the end of the outbreak_ or with the _latest data available at the moment_ is `r out_estimate` with a 95% confidence interval between `r out_low` and `r out_high`, slightly higher than the naive one.

:::::::::::::::::: callout

### Use the epiparameter class

When using an `<epiparameter>` class object we can use this expression as a template:

`function(x) density(<EPIPARAMETER_OBJECT>, x)`

For distribution functions with parameters not available in `{epiparameter}`, we suggest you two alternatives:

- Create an `<epiparameter>` class object, to plug into other R packages of the outbreak analytics pipeline. Read the [reference documentation of `epiparameter::epiparameter()`](https://epiverse-trace.github.io/epiparameter/reference/epiparameter.html)

- Read `{cfr}` vignette for [a primer on working with delay distributions](https://epiverse-trace.github.io/cfr/articles/delay_distributions.html).

::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: challenge

Use the same file from the previous challenge ([sarscov2_cases_deaths.csv](data/sarscov2_cases_deaths.csv)).

Estimate the delay-adjusted CFR using the appropriate distribution delay. Then:

- Compare the naive and the delay-adjusted CFR solutions!

:::::::::::::::::::: hint

- Find the appropriate `<epiparameter>` object!

::::::::::::::::::::

:::::::::::::::::::: solution

We use `{epiparameter}` to access a delay distribution for the SARS-CoV-2 aggregated incidence data:

```{r,message=FALSE,warning=FALSE}
library(epiparameter)

sarscov2_delay <-
  epiparameter::epiparameter_db(
    disease = "covid",
    epi_name = "onset to death",
    single_epiparameter = TRUE
  )
```

We read the data input using `readr::read_csv()`. This function recognize that the column `date` is a `<date>` class vector.

```{r, eval=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
# read data from the tutorial repository R project
sarscov2_input <-
  readr::read_csv(file.path("data", "sarscov2_cases_deaths.csv"))
```

```{r,eval=FALSE,echo=TRUE}
# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
sarscov2_input <-
  readr::read_csv(here::here("data", "raw-data", "sarscov2_cases_deaths.csv"))
```

```{r}
# Inspect data
sarscov2_input
```

We can use the `cleanepi::standardize_column_names()` to adapt the external data to fit the data input for `cfr_static()`.

```{r}
# Rename before Estimate naive CFR
sarscov2_input %>%
  cleanepi::standardize_column_names(
    rename = c(cases = "cases_jpn", deaths = "deaths_jpn")
  ) %>%
  cfr::cfr_static(
    delay_density = function(x) density(sarscov2_delay, x)
  )
```

Interpret the comparison between the naive and delay-adjusted CFR estimates.

::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::: spoiler

### When to use discrete distributions?

For  `cfr_static()` and all the `cfr_*()` family of functions, the most appropriate choice to pass are **discrete** distributions. This is because we will work with daily case and death data.

We can assume that evaluating the Probability Distribution Function (PDF) of a *continuous* distribution is equivalent to the Probability Mass Function (PMF) of the equivalent *discrete* distribution.

However, this assumption may not be appropriate for distributions with larger peaks. For instance, diseases with an onset-to-death distribution that is strongly peaked with a low variance. In such cases, the average disparity between the PDF and PMF is expected to be more pronounced compared to distributions with broader spreads. One way to deal with this is to discretise the continuous distribution using `epiparameter::discretise()` to an `<epiparameter>` object.

::::::::::::::::::


::::::::::::::::::::::::::: spoiler

### How does {cfr} works?

To adjust the CFR, [Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852) use the case and death incidence data to estimate the number of cases with known outcomes:

$$
  u_t = \dfrac{\sum_{i = 0}^t
        \sum_{j = 0}^\infty c_{i - j} f_{j}}{\sum_{i = 0} c_i},
$$

where:

- $c_{t}$ is the daily case incidence at time $t$,
- $f_{t}$ is the value of the Probability Mass Function (PMF) of the **delay distribution** between onset and death, and
- $u_{t}$ represents the underestimation factor of the known outcomes.

$u_{t}$ is used to **scale** the value of the cumulative number of cases in the denominator in the calculation of the CFR. This is calculated internally with the [`estimate_outcomes()`](https://epiverse-trace.github.io/cfr/reference/estimate_outcomes.html) function.

The estimator for CFR can be written as:

$$p_{t} = \frac{b_{t}}{u_{t}}$$

where $p_{t}$ is the realized proportion of confirmed cases to die from the infection (or the unbiased CFR), and $b_{t}$, the crude and biased estimate of CFR (also naive CFR).

From this last equation, we observe that the unbiased CFR $p_{t}$ is larger than biased CFR $b_{t}$ because in $u_{t}$ the numerator is smaller than the denominator (note that $f_{t}$ is the probability distribution of the *delay distribution* between onset and death). Therefore, we refer to $b_{t}$ as the biased estimator of CFR.

When we observe the entire course of an epidemic (from $t \rightarrow \infty$), $u_{t}$ tends to 1, making $b_{t}$ tends to $p_{t}$ and become an unbiased estimator ([Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852)).

:::::::::::::::::::::::::::


## An early-stage CFR estimate

The **naive** estimate is useful to get an overall severity estimate of the outbreak (so far). Once the outbreak has ended or has progressed such that more deaths are reported, the estimated CFR is then closest to the 'true' unbiased CFR.

On the other hand, the **delay-adjusted** estimate can assess the severity of an emerging infectious disease *earlier* than the biased or naive CFR, during an epidemic.

We can explore the **early** determination of the _delay-adjusted CFR_ using the `cfr_rolling()` function.

:::::::::::::::::::::: callout

`cfr_rolling()` is a utility function that automatically calculates CFR for each day of the outbreak with the data available up to that day, saving the user time.

::::::::::::::::::::::

`cfr_rolling()` shows the estimated CFR on each outbreak day, given that future data on cases and deaths is unavailable at the time. The final value of `cfr_rolling()` estimates is identical to `cfr_static()` on the same data.

```{r}
# for all the 73 days in the Ebola dataset
# Calculate the rolling daily naive CFR
rolling_cfr_naive <- cfr::cfr_rolling(data = ebola1976)
```

```{r}
# for all the 73 days in the Ebola dataset
# Calculate the rolling daily delay-adjusted CFR
rolling_cfr_adjusted <- cfr::cfr_rolling(
  data = ebola1976,
  delay_density = function(x) density(onset_to_death_ebola, x)
)
```

With `utils::tail()`, we view the latest CFR estimates. The naive and delay-adjusted estimates have overlapping ranges of 95% confidence intervals.

```{r,eval=FALSE,echo=TRUE}
# Print the tail of the data frame
utils::tail(rolling_cfr_naive)
utils::tail(rolling_cfr_adjusted)
```

Now, let's visualise both results in a time series. How would the naive and delay-adjusted CFR estimates perform in real time?

```{r,echo=TRUE,warning=FALSE,message=FALSE}
# bind by rows both output data frames
dplyr::bind_rows(
  list(
    naive = rolling_cfr_naive,
    adjusted = rolling_cfr_adjusted
  ),
  .id = "method"
) %>%
  # visualise both adjusted and unadjusted rolling estimates
  ggplot() +
  geom_ribbon(
    aes(
      date,
      ymin = severity_low,
      ymax = severity_high,
      fill = method
    ),
    alpha = 0.2, show.legend = FALSE
  ) +
  geom_line(
    aes(date, severity_estimate, colour = method)
  )
```

The red and blue lines represent the delay-adjusted and naive CFR, respectively, throughout the outbreak. The bands around them represent the 95% confidence intervals (95% CI).

**Notice** that this delay-adjusted calculation is particularly useful when an _epidemic curve of confirmed cases_ is the only data available (i.e. when individual data from onset to death are unavailable, especially during the early stage of the epidemic). When there are few deaths or none at all, an assumption has to be made for the *delay distribution* from onset to death, e.g. from literature based on previous outbreaks. [Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852) depict this in the figures with data from the SARS outbreak in Hong Kong, 2003.

:::::::::::::::::::::::::::::::::: spoiler

### Case study: SARS outbreak, Hong Kong, 2003

Figures A and B show the cumulative numbers of cases and deaths of SARS, and Figure C shows the observed (biased) CFR estimates as a function of time, i.e. the cumulative number of deaths over cases at time $t$. Due to the delay from the onset of symptoms to death, the biased estimate of CFR at time $t$ underestimates the realised CFR at the end of an outbreak (i.e. 302/1755 = 17.2 %).

![Observed (biased) confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. ([Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852))](fig/cfr-pone.0006852.g003-fig_abc.png)

Nevertheless, even by only using the observed data for the period March 19 to April 2, `cfr_static()` can yield an appropriate prediction (Figure D), e.g. the delay-adjusted CFR at March 27 is 18.1 % (95% CI: 10.5, 28.1). An overestimation is seen in the very early stages of the epidemic, but the 95% confidence limits in the later stages include the realised CFR (i.e. 17.2 %).

![Early determination of the delay-adjusted confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. ([Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852))](fig/cfr-pone.0006852.g003-fig_d.png)

::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::: discussion

### Interpret the early-stage CFR estimate

Based on the figure above:

- How much difference in days is between the date in which the 95% CI of the estimated _delay-adjusted CFR_ vs _naive CFR_ cross with the CFR estimated at the end of the outbreak?

Discuss:

- What are the Public health policy implications of having a _delay-adjusted CFR_ estimate?

::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::: hint

We can either use a visual inspection or analyse the output data frames.

::::::::::::::::::::::

:::::::::::::::::::::: solution

There is almost one month of difference.

Note that the estimate has considerable uncertainty at the beginning of the time series. After two weeks, the delay-adjusted CFR approaches the overall CFR estimate at the outbreak's end.

Is this pattern similar to other outbreaks? We can use the data sets in this episode's challenges. We invite you to find it out!

::::::::::::::::::::::

:::::::::::::::::::::: discussion

### Checklist

With `{cfr}`, we estimate the CFR as the proportion of deaths among **confirmed** cases.

By only using **confirmed** cases, it is clear that all cases that do not seek medical treatment or are not notified will be missed, as well as all asymptomatic cases. This means that the CFR estimate is higher than the proportion of deaths among the infected.

::::::::::::::::::::::

::::::::::::::::::::::::::: solution

### Why the naive and delay-adjusted differ?

`{cfr}` method aims to obtain an unbiased estimator "well before" observing the entire course of the outbreak. For this, `{cfr}` uses the underestimation factor $u_{t}$ to estimate the unbiased CFR $p_{t}$ using maximum-likelihood methods, given the *sampling process* defined by [Nishiura et al., 2009](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006852).

:::::::::::::::::::::::::::

:::::::::::::::::::::::::: solution

### What is the sampling process?

![The population of confirmed cases and sampling process for estimating the unbiased CFR during the course of an outbreak. ([Nishiura et al., 2009](https://doi.org/10.1371/journal.pone.0006852.g001))](fig/cfr-pone.0006852.g001.png)

From *aggregated incidence data*, at time $t$ we know the cumulative number of confirmed cases and deaths, $C_{t}$ and $D_{t}$, and wish to estimate the unbiased CFR $\pi$, by way of the factor of underestimation $u_{t}$.

If we knew the factor of underestimation $u_{t}$ we could specify the size of the population of confirmed cases no longer at risk ($u_{t}C_{t}$, **shaded**), although we do not know which surviving individuals belong to this group. A proportion $\pi$ of those in the group of cases still at risk (size $(1- u_{t})C_{t}$, **unshaded**) is expected to die.

Because each case no longer at risk had an independent probability of dying, $\pi$, the number of deaths, $D_{t}$, is a sample from a binomial distribution with sample size $u_{t}C_{t}$, and probability of dying $p_{t}$ = $\pi$.

This is represented by the following likelihood function to obtain the maximum likelihood estimate of the unbiased CFR $p_{t}$ = $\pi$:

$$
  {\sf L}(\pi | C_{t},D_{t},u_{t}) = \log{\dbinom{u_{t}C_{t}}{D_{t}}} + D_{t} \log{\pi} +
  (u_{t}C_{t} - D_{t})\log{(1 - \pi)},
$$

This estimation is performed by the internal function `?cfr:::estimate_severity()`.

::::::::::::::::::::::::::

:::::::::::::::::::::::::: solution

### Limitations

- The delay-adjusted CFR does not address all sources of error in data like the underdiagnosis of infected individuals.

::::::::::::::::::::::::::

## Challenges

:::::::::::::::::::::::::::::::: discussion

### More severity measures

Suppose we need to assess the clinical severity of the epidemic in a context different from surveillance data, like the severity among cases that arrive at hospitals or cases you collected from a representative serological survey.

Using `{cfr}`, we can change the inputs for the numerator (`cases`) and denominator (`deaths`) to estimate more severity measures like the Infection fatality risk (IFR) or the Hospitalisation Fatality Risk (HFR). We can follow this analogy:

::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::: solution

### Infection and Hospitalization fatality risk

If for a _Case_ fatality risk (CFR), we require:

- _case_ and death incidence data, with a
- case-to-death delay distribution (or close approximation, such as symptom onset-to-death).

Then, the _Infection_ fatality risk (IFR) requires:

- _infection_ and death incidence data, with an
- exposure-to-death delay distribution (or close approximation).

Similarly, the _Hospitalisation_ Fatality Risk (HFR) requires:

- _hospitalisation_ and death incidence data, and a
- hospitalisation-to-death delay distribution.

::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::: solution

### Data sources for more severity measures

[Yang et al., 2020](https://www.nature.com/articles/s41467-020-19238-2/figures/1) summarises different definitions and data sources:

![Severity levels of infections with SARS-CoV-2 and parameters of interest. Each level is assumed to be a subset of the level below.](fig/cfr-s41467-020-19238-2-fig_a.png)

- sCFR symptomatic case-fatality risk,
- sCHR symptomatic case-hospitalisation risk,
- mCFR medically attended case-fatality risk,
- mCHR medically attended case-hospitalisation risk,
- HFR hospitalisation-fatality risk.

![Schematic diagram of the baseline analyses. Red, blue, and green arrows denote the data flow from laboratory-confirmed cases of passive surveillance, clinically-diagnosed cases, and laboratory-confirmed cases of active screenings.](fig/cfr-s41467-020-19238-2-fig_b.png){alt='Data source of COVID-19 cases in Wuhan: D1) 32,583 laboratory-confirmed COVID-19 cases as of March 84, D2) 17,365 clinically-diagnosed COVID-19 cases during February 9–194, D3)daily number of laboratory-confirmed cases on March 9–April 243, D4) total number of COVID-19 deaths as of April 24 obtained from the Hubei Health Commission3, D5) 325 laboratory-confirmed cases and D6) 1290 deaths were added as of April 16 through a comprehensive and systematic verification by Wuhan Authorities3, and D7) 16,781 laboratory-confirmed cases identified through universal screening10,11. Pse: RT-PCR sensitivity12. Pmed.care: proportion of seeking medical assistance among patients suffering from acute respiratory infections13.'}

::::::::::::::::::::::::::::

::::::::::::::::: callout

### Aggregated data differ from linelists

*Aggregated* incidence data differs from **linelist** data, where each observation contains individual-level data.

```{r}
outbreaks::ebola_sierraleone_2014 %>% as_tibble()
```

:::::::::::::::::

:::::::::::::::::::::::::::::::::: challenge

### Use incidence2 to rearrange your data

From the `{outbreaks}` package, load the MERS linelist of cases from the `mers_korea_2015` object.

Rearrange your this linelist to fit into the `{cfr}` package input.

Estimate the delay-adjusted CFR using the corresponding distribution delay.

::::::::::::::::: hint

**How to rearrange my input data?**

Rearranging the input data for data analysis can take most of the time. To get ready-to-analyse _aggregated incidence data_, we encourage you to use `{incidence2}`!

First, in the [Get started](https://www.reconverse.org/incidence2/articles/incidence2.html) vignette from the `{incidence2}` package, explore how to use the `date_index` argument when reading a linelist with dates in multiple column.

Then, refer to the `{cfr}` reference manual on [Prepare common epidemiological data formats for CFR estimation](https://epiverse-trace.github.io/cfr/reference/prepare_data.html) on how to use the `cfr::prepare_data()` function from incidence2 objects.

<!-- cite howto entry one lineslist + incidence2 + cfr connection -->

:::::::::::::::::

::::::::::::::::: solution

```{r,message=FALSE,warning=FALSE}
# Load packages
library(cfr)
library(epiparameter)
library(incidence2)
library(outbreaks)
library(tidyverse)

# Access delay distribution
mers_delay <- epiparameter::epiparameter_db(
  disease = "mers",
  epi_name = "onset to death",
  single_epiparameter = TRUE
)

# Read linelist
mers_korea_2015$linelist %>%
  as_tibble() %>%
  select(starts_with("dt_"))

# Use {incidence2} to count daily incidence
mers_incidence <- mers_korea_2015$linelist %>%
  # converto to incidence2 object
  incidence(date_index = c("dt_onset", "dt_death")) %>%
  # complete dates from first to last
  incidence2::complete_dates()

# Inspect incidence2 output
mers_incidence

# Prepare data from {incidence2} to {cfr}
mers_incidence %>%
  cfr::prepare_data(
    cases_variable = "dt_onset",
    deaths_variable = "dt_death"
  )

# Estimate delay-adjusted CFR
mers_incidence %>%
  cfr::prepare_data(
    cases_variable = "dt_onset",
    deaths_variable = "dt_death"
  ) %>%
  cfr::cfr_static(delay_density = function(x) density(mers_delay, x))
```


:::::::::::::::::

::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::::::::: challenge

### Severity heterogeneity

The CFR may differ across populations (e.g. age, space, treatment); quantifying these heterogeneities can help target resources appropriately and compare different care regimens ([Cori et al., 2017](https://royalsocietypublishing.org/doi/10.1098/rstb.2016.0371)).

Use the `cfr::covid_data` data frame to estimate a delay-adjusted CFR stratified by country.

::::::::::::::::::::::::: hint

One way to do a _stratified analysis_ is to apply a model to nested data. This [`{tidyr}` vignette](https://tidyr.tidyverse.org/articles/nest.html#nested-data-and-models) shows you how to apply the `group_by()` + `nest()` to nest data, and then `mutate()` + `map()` to apply the model.

:::::::::::::::::::::::::

::::::::::::::::::::::::: solution

```{r,message=FALSE,warning=FALSE}
library(cfr)
library(epiparameter)
library(tidyverse)

covid_data %>% glimpse()

delay_onset_death <-
  epiparameter::epiparameter_db(
    disease = "covid",
    epi_name = "onset to death",
    single_epiparameter = TRUE
  )

covid_data %>%
  group_by(country) %>%
  nest() %>%
  mutate(
    temp =
      map(
        .x = data,
        .f = cfr::cfr_static,
        delay_density = function(x) density(delay_onset_death, x)
      )
  ) %>%
  unnest(cols = temp)
```

Great! Now you can use similar code for any other stratified analysis like age, regions or more!

But, how can we interpret that there is a country variability of severity from the same diagnosed pathogen?

Local factors like testing capacity, the case definition, and sampling regime can affect the report of cases and deaths, thus affecting case ascertainment. Take a look to the `{cfr}` vignette on [Estimating the proportion of cases that are ascertained during an outbreak](https://epiverse-trace.github.io/cfr/articles/estimate_ascertainment.html)!

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::


## Appendix

The `{cfr}` package has a function called `cfr_time_varying()` with functionality that differs from `cfr_rolling()`.

::::::::::::::::: callout

### When to use cfr_rolling()?

`cfr_rolling()` shows the estimated CFR on each outbreak day, given that future data on cases and deaths is unavailable at the time. The final value of `cfr_rolling()` estimates is identical to `cfr_static()` on the same data.

Remember, as shown above, `cfr_rolling()` is helpful to get early-stage CFR estimates and check whether an outbreak's CFR estimate has stabilised. Thus, `cfr_rolling()` is not sensitive to the length or size of the epidemic.

:::::::::::::::::

::::::::::::::::: callout

### When to use `cfr_time_varying()`?

On the other hand, `cfr_time_varying()` calculates the CFR over a moving window and helps to understand changes in CFR due to changes in the epidemic, e.g. due to a new variant or increased immunity from vaccination.

However, `cfr_time_varying()` is sensitive to sampling uncertainty. Thus, it is sensitive to the size of the outbreak. The higher the number of cases with expected outcomes on a given day, the more reasonable estimates of the time-varying CFR we will get.

For example, with 100 cases, the fatality risk estimate will, roughly speaking, have a 95% confidence interval ±10% of the mean estimate (binomial CI). So if we have >100 cases with expected outcomes *on a given day*, we can get reasonable estimates of the time varying CFR. But if we only have >100 cases *over the course of the whole epidemic*, we probably need to rely on `cfr_rolling()` that uses the cumulative data.

We invite you to read this [vignette about the `cfr_time_varying()` function](https://epiverse-trace.github.io/cfr/articles/estimate_time_varying_severity.html).

:::::::::::::::::


::::::::::::::::::::::::::::::::::::: keypoints

- Use `{cfr}` to estimate severity

- Use `cfr_static()` to estimate the overall CFR with the latest data available.

- Use `cfr_rolling()` to show what the estimated CFR would be on each day of the outbreak.

- Use the `delay_density` argument to adjust the CFR by the corresponding delay distribution.

::::::::::::::::::::::::::::::::::::::::::::::::