Clarification of the "control flow" page in the "DataOps" User Guide section#2010
Clarification of the "control flow" page in the "DataOps" User Guide section#2010emassoulie wants to merge 4 commits intoskrub-data:mainfrom
Conversation
| Running complex operations on DataOps variables: deferred evaluation | ||
| ==================================================================== | ||
|
|
||
| Why DataOps cannot handle complex operations |
There was a problem hiding this comment.
they can. maybe something like "why some operations need to be inside functions" or similar
| This remains true even if we have provided a value for ``orders`` and we can | ||
| see a result for that value: |
There was a problem hiding this comment.
I think this still need to be stated somewhere because it can be confusing: what do you mean it will be computed later I see it right here in the repr
| transformation that we apply must not modify its input, but leave it unchanged | ||
| and return a new value. | ||
|
|
||
| Consider the transformers in a scikit-learn pipeline: each computes a new |
There was a problem hiding this comment.
from oral discussions I remember this comparison can help understand why each node must return a new value instead of modifying the input
There was a problem hiding this comment.
Thanks for the PR @emassoulie, I left a few comments with some suggested changes
| columns: it is a skrub DataOp that will produce a list of columns, later, | ||
| over the columns. This is the way any computation on any variable is usually run, | ||
| referred to here as *eager* evaluation. However, ``orders.columns`` is not an actual | ||
| list of columns: it is a skrub DataOp that will produce a list of columns, later, |
There was a problem hiding this comment.
I wonder if this section could benefit from a small example, something like
>>> import pandas as pd
>>> import skrub
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
>>> df
>>> a = skrub.var("df", df)
>>> cols = a.columns
>>> cols.skb.eval({"df": df.drop(columns="b")})to show that the result of the evaluation depends on what's in side the variable
| :func:`deferred` function. But we should make a (shallow) copy of the inputs and | ||
| return a new value. | ||
|
|
||
| Finally, there are other situations where using :func:`deferred` can be helpful: |
There was a problem hiding this comment.
I think the "Finally..." paragraph should be moved at the end of the previous section, possibly with the examples
| >>> csv_path = skrub.var("csv_path") | ||
| >>> data = skrub.deferred(pd.read_csv)(csv_path) | ||
|
|
||
| Unpacking multiple outputs from deferred functions |
There was a problem hiding this comment.
I wonder if after editing the rest of this section, the part about unpacking should be put into a drop down as additional explanation, something like "note about unpacking"
| applying ``deferred`` and calling the function as shown above we can use | ||
|
|
||
| .. warning:: | ||
| DataOps are evaluated *lazily* (we are building a pipeline, not immediately |
There was a problem hiding this comment.
I am not sure I understand what this section is saying
Proposed reordering and renaming of the "Control flow" page, to better indicate the use of the page: