You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not then please [report a bug](https://github.com/xiaodaigh/disk.frame/issues).
176
+
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please [report a bug](https://github.com/xiaodaigh/disk.frame/issues).
182
177
183
178
#### List of supported group-by functions
184
179
185
-
If a function you like is missing, please make a feature request [here](https://github.com/xiaodaigh/disk.frame/issues). It is a limitation that function that depend on the order a column can only obtained using estimated methods.
180
+
If a function you like is missing, please make a feature request [here](https://github.com/xiaodaigh/disk.frame/issues). It is a limitation that function that depend on the order a column can only be obtained using estimated methods.
186
181
187
182
| Function | Exact/Estimate | Notes |
188
183
| -- | -- | -- |
@@ -304,7 +299,7 @@ Thank you to all our backers! [[Become a backer](https://opencollective.com/disk
Support `{disk.frame}` development by becoming a sponsor. Your logo will show up here with a link to your website. [[Become a sponsor](https://opencollective.com/diskframe#sponsor)]
310
305
@@ -315,6 +310,16 @@ Support `{disk.frame}` development by becoming a sponsor. Your logo will show up
315
310
**Do you need help with machine learning and data science in R, Python, or Julia?**
316
311
I am available for Machine Learning/Data Science/R/Python/Julia consulting! [Email me](mailto:dzj@analytixware.com)
317
312
313
+
## Non-financial ways to contribute
314
+
315
+
Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute
316
+
317
+
* Write a blogpost about your `{disk.frame}`. I would love to learn more about how `{disk.frame}` has helped you
318
+
* Tweet or post on social media (e.g LinkedIn) about `{disk.frame}` to help promote it
319
+
* Bring attention to typos and grammatical errors by correcting and making a PR. Or simply by [raising an issue here](https://github.com/xiaodaigh/disk.frame/issues)
320
+
* Star the [`{disk.frame}` Github repo](https://github.com/xiaodaigh/disk.frame)
321
+
* Star any repo that `{disk.frame}` depends on e.g. [`{fst}`](https://github.com/fstpackage/fst) and [`{future}`](https://github.com/HenrikBengtsson/future)
Copy file name to clipboardExpand all lines: book/02-intro-disk-frame.Rmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -226,7 +226,7 @@ The `by` variables that were used to shard the dataset are called the `shardkey`
226
226
227
227
## Group-by
228
228
229
-
`{disk.frame}` implements the `group_by` operation some caveats. In the `{disk.frame}` framework, only a set functions are supported in `summarize`. However, the user can create more custom `group-by` functions can be defined. For more information see [group-by](10-group-by.Rmd)
229
+
`{disk.frame}` implements the `group_by` operation some caveats. In the `{disk.frame}` framework, only a set functions are supported in `summarize`. However, the user can create more custom `group-by` functions can be defined.
230
230
231
231
```{r, dependson='asdiskframe'}
232
232
flights.df %>%
@@ -290,7 +290,7 @@ flights.df %>%
290
290
291
291
`{disk.frame}` supports all `data.frame` operations, unlike Spark which can only perform those operations that Spark has implemented. Hence windowing functions like `min_rank` and `rank` are supported out of the box.
292
292
293
-
For the following example, we will use the `hard_group_by` which performs a group-by and also reorganises the chunks so that all records with the same `year`, `month`, and `day` end up in the same chunk. This is typically not adviced, as `hard_group_by` can be slow for large datasets.
293
+
For the following example, we will use the `hard_group_by` which performs a group-by and also reorganises the chunks so that all records with the same `year`, `month`, and `day` end up in the same chunk. This is typically not advised, as `hard_group_by` can be slow for large datasets.
Copy file name to clipboardExpand all lines: book/03-concepts.Rmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -60,9 +60,9 @@ future::nbrOfWorkers()
60
60
61
61
## How `{disk.frame}` works
62
62
63
-
When `df %>% some_fn %>% collect` is callled. The `some_fn` is applied to each chunk of `df`. The collect will row-bind the results from `some_fn(chunk)`together if the returned value of `some_fn` is a data.frame, or it will return a `list` containing the results of `some_fn`.
63
+
When `df %>% some_fn %>% collect` is called. The `some_fn` is applied to each chunk of `df`. The collect will row-bind the results from `some_fn(chunk)`together if the returned value of `some_fn` is a data.frame, or it will return a `list` containing the results of `some_fn`.
64
64
65
-
The session that receives these results is called the **main session**. In general, we should try to minimise the amount of data passed from the worker sessions back to the main session, because passing data around can be slow.
65
+
The session that receives these results is called the **main session**. In general, we should try to minimize the amount of data passed from the worker sessions back to the main session, because passing data around can be slow.
66
66
67
67
Also, please note that there is no communication between the workers, except for workers passing data back to the main session.
Copy file name to clipboardExpand all lines: book/08-more-epic.Rmd
-30Lines changed: 0 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -244,33 +244,3 @@ So there you go! {disk.frame} can be even more "epic"! Here are the two main tak
244
244
1. Load CSV files as many individual files if possible to take advantage of multi-core parallelism
245
245
2.`srckeep` is your friend! Disk IO is often the bottleneck in data manipulation, and you can reduce disk IO by specifying only columns that you will use with `srckeep(c(columns1, columns2, ...))`.
246
246
247
-
## Advertisements
248
-
249
-
### Interested in learning {disk.frame} in a structured course?
If you like disk.frame and want to speed up its development or perhaps you have a feature request? Please consider sponsoring {disk.frame} on Open Collective. Your logo will show up here with a link to your website.
258
-
259
-
#### Backers
260
-
261
-
Thank you to all our backers! 🙏 [[Become a backer](https://opencollective.com/diskframe#backer)]
0 commit comments