Skip to content

Commit 38a48ef

Browse files
authored
Merge pull request #242 from xiaodaigh/development
Development
2 parents 4f54a73 + 950fa77 commit 38a48ef

41 files changed

Lines changed: 655 additions & 497 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,5 @@ cran-mirror
5151
misc/disk.frame-report_files/
5252
.httr-oauth
5353
README_cache
54+
vignettes/
55+
README.html

CRAN-RELEASE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
This package was submitted to CRAN on 2019-11-23.
2-
Once it is accepted, delete this file and tag the release (commit a5f1cee4f9).
1+
This package was submitted to CRAN on 2019-12-18.
2+
Once it is accepted, delete this file and tag the release (commit 5c386003ef).

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Type: Package
22
Package: disk.frame
33
Title: Larger-than-RAM Disk-Based Data Manipulation Framework
44
Version: 0.3.0
5-
Date: 2019-12-15
5+
Date: 2019-12-17
66
Authors@R: c(
77
person("Dai", "ZJ", email = "zhuojia.dai@gmail.com", role = c("aut", "cre")),
88
person("Jacky", "Poon", role = c("ctb"))

NEWS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# disk.frame 0.3.0
2-
* experimental group-by framework!
2+
* experimental one-stage group-by framework!
33
* bug fixes for data.table trigger by integration with tidyfast
44
* removed assertthat from imports
55
* add benchmarkme to Suggests

R/recommend_nchunks.r

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ df_ram_size <- function() {
120120

121121
if(is.na(ram_size)) {
122122
warning("RAM size can't be determined. Assume you have 16GB of RAM.")
123-
warning("Please report this error github.com/xiaodaigh/disk.frame/issues")
123+
warning("Please report this error at github.com/xiaodaigh/disk.frame/issues")
124124
warning(glue::glue("Please include your operating system, R version, and if using RStudio the Rstudio version number"))
125125
return(16)
126126
} else {
@@ -130,7 +130,8 @@ df_ram_size <- function() {
130130
} else{
131131
if(is.na(ram_size)) {
132132
warning("RAM size can't be determined. Assume you have 16GB of RAM.")
133-
warning("Please report this error github.com/xiaodaigh/disk.frame/issues")
133+
warning("Please try to install install.packages('benchmarkme') and try again.")
134+
warning("If error persists, please report this error at github.com/xiaodaigh/disk.frame/issues")
134135
warning(glue::glue("Please include your operating system, R version, and if using RStudio the Rstudio version number"))
135136
return(16)
136137
} else {

README.Rmd

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,6 @@ knitr::opts_chunk$set(
1515
```
1616
# disk.frame <img src="inst/figures/disk.frame.png" align="right">
1717

18-
<details>
19-
<summary>Please take a moment to star the disk.frame Github repo if you like disk.frame. It keeps me going. </summary>
20-
<iframe src="https://ghbtns.com/github-btn.html?user=xiaodaigh&repo=disk.frame&type=star&count=true&size=large" frameborder="0" scrolling="0" width="160px" height="30px"></iframe>
21-
</details>
22-
2318
<!-- badges: start -->
2419
<!-- ![disk.frame logo](inst/figures/disk.frame.png?raw=true "disk.frame logo") -->
2520
<!-- badges: end -->
@@ -64,7 +59,7 @@ install.packages("disk.frame", repo="https://cran.rstudio.com")
6459
Please see these vignettes and articles about `{disk.frame}`
6560

6661
- [Quick start:
67-
`{disk.frame}`](https://daizj.net/disk.frame/articles/intro-disk-frame.html)
62+
`{disk.frame}`](https://diskframe.com/articles/intro-disk-frame.html)
6863
which replicates the `sparklyr` vignette for manipulating the
6964
`nycflights13` flights data.
7065
- [Ingesting data into `{disk.frame}`](https://diskframe.com/articles/ingesting-data.html) which lists some commons way of creating disk.frames
@@ -158,8 +153,8 @@ flights.df %>%
158153

159154

160155

161-
### Group by
162-
Starting from {disk.frame} v0.2.2, there is for support `group_by` for a limited set of functions. For example:
156+
### Group-by
157+
Starting from `{disk.frame}` v0.3.0, there is for support `group_by` for a limited set of functions. For example:
163158

164159
```r
165160
result_from_disk.frame = iris %>%
@@ -178,11 +173,11 @@ result_from_disk.frame = iris %>%
178173
collect
179174
```
180175

181-
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not then please [report a bug](https://github.com/xiaodaigh/disk.frame/issues).
176+
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please [report a bug](https://github.com/xiaodaigh/disk.frame/issues).
182177

183178
#### List of supported group-by functions
184179

185-
If a function you like is missing, please make a feature request [here](https://github.com/xiaodaigh/disk.frame/issues). It is a limitation that function that depend on the order a column can only obtained using estimated methods.
180+
If a function you like is missing, please make a feature request [here](https://github.com/xiaodaigh/disk.frame/issues). It is a limitation that function that depend on the order a column can only be obtained using estimated methods.
186181

187182
| Function | Exact/Estimate | Notes |
188183
| -- | -- | -- |
@@ -304,7 +299,7 @@ Thank you to all our backers! [[Become a backer](https://opencollective.com/disk
304299

305300
<a href="https://opencollective.com/diskframe#backers" target="_blank"><img src="https://opencollective.com/diskframe/backers.svg?width=890"></a>
306301

307-
### Sponsors
302+
### Sponsor and back `{disk.frame}`
308303

309304
Support `{disk.frame}` development by becoming a sponsor. Your logo will show up here with a link to your website. [[Become a sponsor](https://opencollective.com/diskframe#sponsor)]
310305

@@ -315,6 +310,16 @@ Support `{disk.frame}` development by becoming a sponsor. Your logo will show up
315310
**Do you need help with machine learning and data science in R, Python, or Julia?**
316311
I am available for Machine Learning/Data Science/R/Python/Julia consulting! [Email me](mailto:dzj@analytixware.com)
317312

313+
## Non-financial ways to contribute
314+
315+
Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute
316+
317+
* Write a blogpost about your `{disk.frame}`. I would love to learn more about how `{disk.frame}` has helped you
318+
* Tweet or post on social media (e.g LinkedIn) about `{disk.frame}` to help promote it
319+
* Bring attention to typos and grammatical errors by correcting and making a PR. Or simply by [raising an issue here](https://github.com/xiaodaigh/disk.frame/issues)
320+
* Star the [`{disk.frame}` Github repo](https://github.com/xiaodaigh/disk.frame)
321+
* Star any repo that `{disk.frame}` depends on e.g. [`{fst}`](https://github.com/fstpackage/fst) and [`{future}`](https://github.com/HenrikBengtsson/future)
322+
318323
## Download Counts & Build Status
319324

320325
[![](https://cranlogs.r-pkg.org/badges/disk.frame)](https://cran.r-project.org/package=disk.frame)

README.md

Lines changed: 41 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,6 @@
33

44
# disk.frame <img src="inst/figures/disk.frame.png" align="right">
55

6-
<details>
7-
8-
<summary>Please take a moment to star the disk.frame Github repo if you
9-
like disk.frame. It keeps me going. </summary>
10-
<iframe src="https://ghbtns.com/github-btn.html?user=xiaodaigh&repo=disk.frame&type=star&count=true&size=large" frameborder="0" scrolling="0" width="160px" height="30px"></iframe>
11-
12-
</details>
13-
146
<!-- badges: start -->
157

168
<!-- ![disk.frame logo](inst/figures/disk.frame.png?raw=true "disk.frame logo") -->
@@ -63,7 +55,7 @@ install.packages("disk.frame", repo="https://cran.rstudio.com")
6355
Please see these vignettes and articles about `{disk.frame}`
6456

6557
- [Quick start:
66-
`{disk.frame}`](https://daizj.net/disk.frame/articles/intro-disk-frame.html)
58+
`{disk.frame}`](https://diskframe.com/articles/intro-disk-frame.html)
6759
which replicates the `sparklyr` vignette for manipulating the
6860
`nycflights13` flights data.
6961
- [Ingesting data into
@@ -225,21 +217,18 @@ flights.df %>%
225217
filter(year == 2013) %>%
226218
mutate(origin_dest = paste0(origin, dest)) %>%
227219
head(2)
228-
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
229-
#> 1 2013 1 1 517 515 2 830 819
230-
#> 2 2013 1 1 533 529 4 850 830
231-
#> arr_delay carrier flight tailnum origin dest air_time distance hour minute
232-
#> 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
233-
#> 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
234-
#> time_hour origin_dest
235-
#> 1 2013-01-01 05:00:00 EWRIAH
236-
#> 2 2013-01-01 05:00:00 LGAIAH
220+
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
221+
#> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228
222+
#> 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211
223+
#> origin dest air_time distance hour minute time_hour origin_dest
224+
#> 1 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 EWRIAH
225+
#> 2 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 LGAIAH
237226
```
238227

239-
### Group by
228+
### Group-by
240229

241-
Starting from {disk.frame} v0.2.2, there is for support `group_by` for a
242-
limited set of functions. For example:
230+
Starting from `{disk.frame}` v0.3.0, there is for support `group_by` for
231+
a limited set of functions. For example:
243232

244233
``` r
245234
result_from_disk.frame = iris %>%
@@ -259,14 +248,14 @@ result_from_disk.frame = iris %>%
259248
```
260249

261250
The results should be exactly the same as if applying the same group-by
262-
operations on a data.frame. If not then please [report a
251+
operations on a data.frame. If not, please [report a
263252
bug](https://github.com/xiaodaigh/disk.frame/issues).
264253

265254
#### List of supported group-by functions
266255

267256
If a function you like is missing, please make a feature request
268257
[here](https://github.com/xiaodaigh/disk.frame/issues). It is a
269-
limitation that function that depend on the order a column can only
258+
limitation that function that depend on the order a column can only be
270259
obtained using estimated methods.
271260

272261
| Function | Exact/Estimate | Notes |
@@ -290,6 +279,7 @@ obtained using estimated methods.
290279

291280
``` r
292281
library(data.table)
282+
#> data.table 1.12.8 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
293283
#>
294284
#> Attaching package: 'data.table'
295285
#> The following object is masked from 'package:purrr':
@@ -336,31 +326,27 @@ To find out where the disk.frame is stored on disk:
336326
``` r
337327
# where is the disk.frame stored
338328
attr(flights.df, "path")
339-
#> [1] "C:\\Users\\RTX2080\\AppData\\Local\\Temp\\Rtmpgv1Q1Y\\filebf052f045d8.df"
329+
#> [1] "C:\\Users\\RTX2080\\AppData\\Local\\Temp\\Rtmpeoxh5E\\file4c5c517b5f0c.df"
340330
```
341331

342332
A number of data.frame functions are implemented for disk.frame
343333

344334
``` r
345335
# get first few rows
346336
head(flights.df, 1)
347-
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
348-
#> 1: 2013 1 1 517 515 2 830 819
349-
#> arr_delay carrier flight tailnum origin dest air_time distance hour minute
350-
#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5 15
351-
#> time_hour
352-
#> 1: 2013-01-01 05:00:00
337+
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
338+
#> 1: 2013 1 1 517 515 2 830 819 11 UA 1545 N14228
339+
#> origin dest air_time distance hour minute time_hour
340+
#> 1: EWR IAH 227 1400 5 15 2013-01-01 05:00:00
353341
```
354342

355343
``` r
356344
# get last few rows
357345
tail(flights.df, 1)
358-
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
359-
#> 1: 2013 9 30 NA 840 NA NA 1020
360-
#> arr_delay carrier flight tailnum origin dest air_time distance hour minute
361-
#> 1: NA MQ 3531 N839MQ LGA RDU NA 431 8 40
362-
#> time_hour
363-
#> 1: 2013-09-30 08:00:00
346+
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
347+
#> 1: 2013 9 30 NA 840 NA NA 1020 NA MQ 3531 N839MQ
348+
#> origin dest air_time distance hour minute time_hour
349+
#> 1: LGA RDU NA 431 8 40 2013-09-30 08:00:00
364350
```
365351

366352
``` r
@@ -427,7 +413,7 @@ backer](https://opencollective.com/diskframe#backer)\]
427413

428414
<a href="https://opencollective.com/diskframe#backers" target="_blank"><img src="https://opencollective.com/diskframe/backers.svg?width=890"></a>
429415

430-
### Sponsors
416+
### Sponsor and back `{disk.frame}`
431417

432418
Support `{disk.frame}` development by becoming a sponsor. Your logo will
433419
show up here with a link to your website. \[[Become a
@@ -442,6 +428,24 @@ or Julia?** I am available for Machine Learning/Data
442428
Science/R/Python/Julia consulting\! [Email
443429
me](mailto:dzj@analytixware.com)
444430

431+
## Non-financial ways to contribute
432+
433+
Do you wish to give back the open-source community in non-financial
434+
ways? Here are some ways you can contribute
435+
436+
- Write a blogpost about your `{disk.frame}`. I would love to learn
437+
more about how `{disk.frame}` has helped you
438+
- Tweet or post on social media (e.g LinkedIn) about `{disk.frame}` to
439+
help promote it
440+
- Bring attention to typos and grammatical errors by correcting and
441+
making a PR. Or simply by [raising an issue
442+
here](https://github.com/xiaodaigh/disk.frame/issues)
443+
- Star the [`{disk.frame}` Github
444+
repo](https://github.com/xiaodaigh/disk.frame)
445+
- Star any repo that `{disk.frame}` depends on
446+
e.g. [`{fst}`](https://github.com/fstpackage/fst) and
447+
[`{future}`](https://github.com/HenrikBengtsson/future)
448+
445449
## Download Counts & Build Status
446450

447451
[![](https://cranlogs.r-pkg.org/badges/disk.frame)](https://cran.r-project.org/package=disk.frame)

book/02-intro-disk-frame.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ The `by` variables that were used to shard the dataset are called the `shardkey`
226226

227227
## Group-by
228228

229-
`{disk.frame}` implements the `group_by` operation some caveats. In the `{disk.frame}` framework, only a set functions are supported in `summarize`. However, the user can create more custom `group-by` functions can be defined. For more information see [group-by](10-group-by.Rmd)
229+
`{disk.frame}` implements the `group_by` operation some caveats. In the `{disk.frame}` framework, only a set functions are supported in `summarize`. However, the user can create more custom `group-by` functions can be defined.
230230

231231
```{r, dependson='asdiskframe'}
232232
flights.df %>%
@@ -290,7 +290,7 @@ flights.df %>%
290290

291291
`{disk.frame}` supports all `data.frame` operations, unlike Spark which can only perform those operations that Spark has implemented. Hence windowing functions like `min_rank` and `rank` are supported out of the box.
292292

293-
For the following example, we will use the `hard_group_by` which performs a group-by and also reorganises the chunks so that all records with the same `year`, `month`, and `day` end up in the same chunk. This is typically not adviced, as `hard_group_by` can be slow for large datasets.
293+
For the following example, we will use the `hard_group_by` which performs a group-by and also reorganises the chunks so that all records with the same `year`, `month`, and `day` end up in the same chunk. This is typically not advised, as `hard_group_by` can be slow for large datasets.
294294

295295
```{r, dependson='asdiskframe'}
296296
# Find the most and least delayed flight each day

book/03-concepts.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,9 @@ future::nbrOfWorkers()
6060

6161
## How `{disk.frame}` works
6262

63-
When `df %>% some_fn %>% collect` is callled. The `some_fn` is applied to each chunk of `df`. The collect will row-bind the results from `some_fn(chunk)`together if the returned value of `some_fn` is a data.frame, or it will return a `list` containing the results of `some_fn`.
63+
When `df %>% some_fn %>% collect` is called. The `some_fn` is applied to each chunk of `df`. The collect will row-bind the results from `some_fn(chunk)`together if the returned value of `some_fn` is a data.frame, or it will return a `list` containing the results of `some_fn`.
6464

65-
The session that receives these results is called the **main session**. In general, we should try to minimise the amount of data passed from the worker sessions back to the main session, because passing data around can be slow.
65+
The session that receives these results is called the **main session**. In general, we should try to minimize the amount of data passed from the worker sessions back to the main session, because passing data around can be slow.
6666

6767
Also, please note that there is no communication between the workers, except for workers passing data back to the main session.
6868

book/08-more-epic.Rmd

Lines changed: 0 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -244,33 +244,3 @@ So there you go! {disk.frame} can be even more "epic"! Here are the two main tak
244244
1. Load CSV files as many individual files if possible to take advantage of multi-core parallelism
245245
2. `srckeep` is your friend! Disk IO is often the bottleneck in data manipulation, and you can reduce disk IO by specifying only columns that you will use with `srckeep(c(columns1, columns2, ...))`.
246246

247-
## Advertisements
248-
249-
### Interested in learning {disk.frame} in a structured course?
250-
251-
Please register your interest at:
252-
253-
https://leanpub.com/c/taminglarger-than-ramwithdiskframe
254-
255-
### Open Collective
256-
257-
If you like disk.frame and want to speed up its development or perhaps you have a feature request? Please consider sponsoring {disk.frame} on Open Collective. Your logo will show up here with a link to your website.
258-
259-
#### Backers
260-
261-
Thank you to all our backers! 🙏 [[Become a backer](https://opencollective.com/diskframe#backer)]
262-
263-
<a href="https://opencollective.com/diskframe#backers" target="_blank"><img src="https://opencollective.com/diskframe/backers.svg?width=890"></a>
264-
265-
[![Backers on Open Collective](https://opencollective.com/diskframe/backers/badge.svg)](#backers)
266-
267-
#### Sponsors
268-
269-
[[Become a sponsor](https://opencollective.com/diskframe#sponsor)]
270-
271-
[![Sponsors on Open Collective](https://opencollective.com/diskframe/sponsors/badge.svg)](#sponsors)
272-
273-
### Contact me for consulting
274-
275-
**Do you need help with machine learning and data science in R, Python, or Julia?**
276-
I am available for Machine Learning/Data Science/R/Python/Julia consulting! [Email me](mailto:dzj@analytixware.com)

0 commit comments

Comments
 (0)