Hadley Wickham advocates for pure, predictable and pipeable functions in the tidyverse. Although non-standard evaluation (NSE) makes many of the functions in tidyverse not referentially transparent (which makes them more difficult in reusable contexts like a function) this also contributes to a cleaner language, at least for beginneRs. With {svFlow} we want both to make tidyverse-style NSE more easily reusable, and the data analysis workflow based on pipelines and the pipe operator ({magrittr}’s pipe operator in tidyverse) even more data-aware.
The initial name was {flow}, because it is short. However, it is now used by an,other package on CRAN. Other names as {workflow} and {workplan} are also already used. So, {svFlow} for SciViews’ flow.
The {performanceEstimation} package has {Workflow} object and a
Workflow()
function. Also, the {zoon} package has a
workflow()
function, but it creates a {zoonWorkflow}
object, so no clash here. In the {drake} package, now superseded by
{targets}, there is also a workflow()
function, but
deprecated in favor of workplan()
. {targets} is used to
organize different analyses in data.frames. Hence, as
we see here, {workflow} or {workplan} names are already pretty much used
in the R ecosystem.
There is the {flowr} package which uses (internal)
flow()
, and is.flow()
functions, and a
flow S3 object. This is for complex, bioinformatics
(work)flows, but of course, the source of potential problems when both
{flowr} and {svFlow} packages are used simultaneously, if both objects
bear the same class name. That is why in {svFlow}, objects are named
Flow with an uppercase F, to avoid
such a conflict.
The {wrapr} package provides an alternate pipe operator:
%.>%
, the “dot arrow pipe”. It is very simple:
“a %.>% b” is to be treated as if the user had written “{ . <- a; b };” with “%.>%” being treated as left-associative.
There are three interesting points with this pipe operator:
%>%
.|>
introduced in R 4.1.0, it is not just a syntactic flavor that transforms
the code into imbricated functions calls internally. The base R pipe
operator has many advantages, but also many limitations that
%>.%
tries to eliminate.The only drawback with this pipe operator is that it is not pure,
since it modifies the calling environment (it assigns .
in
it before evaluation of the right-hand side expression). However, if you
never use .
as a name for other objects, this is not much a
problem. In {wrapr}, there is a synonym: %>.%
, but that
its author never uses in the examples, vignettes or on its blog. So, we
decide to reuse %>.%
as our pipe operator in {svFlow}.
We add two things in it:
It is also aware of Flow objects (see here under) and behaves accordingly,
The expression to be evaluated is also recorded in the calling
environment as .call
. This way, it becomes easy to debug
the last expression that failed during the pipeline execution (since
.
is also available, one can inspect it, or rerun
eval(.call)
, … or use debug_flow()
to get
extra information):
library(svFlow)
# An example pipeline with an error in the middle:
library(dplyr)
iris %>.%
filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>.%
mutate(., logS = log(Species)) %>.%
group_by(., Species) %>.%
summarise(., mean_logS = mean(logS))
#> Error in `mutate()`:
#> ℹ In argument: `logS = log(Species)`.
#> Caused by error in `Math.factor()`:
#> ! 'log' not meaningful for factors
# Both . and .call are available and can be explored
head(.)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.9 3.0 1.4 0.2 setosa
#> 2 4.4 2.9 1.4 0.2 setosa
#> 3 4.8 3.0 1.4 0.1 setosa
#> 4 4.3 3.0 1.1 0.1 setosa
#> 5 5.0 3.0 1.6 0.2 setosa
#> 6 4.4 3.0 1.3 0.2 setosa
.call
#> mutate(., logS = log(Species))
eval(.call)
#> Error in `mutate()`:
#> ℹ In argument: `logS = log(Species)`.
#> Caused by error in `Math.factor()`:
#> ! 'log' not meaningful for factors
… or even more easily:
debug_flow()
#> Last expression run in the pipeline:
#> mutate(., logS = log(Species))
#>
#> with . being:
#> 'data.frame': 12 obs. of 5 variables:
#> $ Sepal.Length: num 4.9 4.4 4.8 4.3 5 4.4 4.5 4.8 4.9 5 ...
#> $ Sepal.Width : num 3 2.9 3 3 3 3 2.3 3 2.4 2 ...
#> $ Petal.Length: num 1.4 1.4 1.4 1.1 1.6 1.3 1.3 1.4 3.3 3.5 ...
#> $ Petal.Width : num 0.2 0.2 0.1 0.1 0.2 0.2 0.3 0.3 1 1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 2 2 ...
#>
#> producing:
#> Error in `mutate()`:
#> ℹ In argument: `logS = log(Species)`.
#> Caused by error in `Math.factor()`:
#> ! 'log' not meaningful for factors
From there, you can manipulate .
, .call
, or
both, and rerun debug_flow()
to fix the pipeline.
In {pipeR}, Kun Ren proposes several alternative pipe operators to
the now traditional {magrittr}’s one (%>%
).
Pipe()
is interesting since it encapsulates essentially the
pipeline steps inside an object. The pipe operator is then simply
replaced by $
. It is striking to note the similitude of the
$
operator for Pipe and
proto objects (from the {proto} package), although they
are designed for different purposes in mind. The proto
objects are class-less prototype-based objects that support simple
inheritance. They are convenient to manipulate sets of objects in a
common place, and internally, they use an environment to store these
objects. Pipe objects also use internally an
environment to store everything related to the pipeline operations.
However, there is no mean to add custom objects, nor to define
inheritance between Pipe objects. Satellite variables
may be used in pipelines. They are currently placed in the calling
environment (usually .GlobalEnv
), and they “pollute” it.
There is no mean to define “local” variables like, say in function, with
the pipe. Yet, if we could combine Pipe behavior for
pipeline operation, with proto objects to store locally
various items and allow inheritance, this would be a wonderful way to
drive analyses workflows. The Flow object just does
that.
Flow objects are indeed proto with
a .value
item that contains the result obtained from the
last pipeline operation. The pipe operator %>.%
is
behaving differently when a Flow object (constructed
using flow()
) is passed to it: (1) .
is taken
from flow_obj$.value
, and result updates it. Also, a
..
object is created in the calling environment that is the
Flow object. That way, one can access items stored in
the Flow object by ..$item
within pipeline
expressions. This allows to embed pipeline temporary variables directly
in the Flow object.
The second pipe operator in the {.flow} package,
%>_%
, does the opposite to %>.%
: it
constructs a Flow
object if it does not
receives one, and returns a Flow object containing the
results in flow_obj$.value
. Finally, to get the value out
of a Flow object, on can also end the pipeline by
%>_% .
, which extracts flow_obj$.value
and
returns it. Here is an example of use:
data(iris)
fl <- iris %>_% # Create a Flow object
filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>_%
mutate(., logSL = log(Sepal.Length))
# Interrupt the pipeline, and inspect or modify the flow object:
fl
#> <Flow object with $.value being>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species logSL
#> 1 4.9 3.0 1.4 0.2 setosa 1.589235
#> 2 4.4 2.9 1.4 0.2 setosa 1.481605
#> 3 4.8 3.0 1.4 0.1 setosa 1.568616
#> 4 4.3 3.0 1.1 0.1 setosa 1.458615
#> 5 5.0 3.0 1.6 0.2 setosa 1.609438
#> 6 4.4 3.0 1.3 0.2 setosa 1.481605
#> 7 4.5 2.3 1.3 0.3 setosa 1.504077
#> 8 4.8 3.0 1.4 0.3 setosa 1.568616
#> 9 4.9 2.4 3.3 1.0 versicolor 1.589235
#> 10 5.0 2.0 3.5 1.0 versicolor 1.609438
#> 11 5.0 2.3 3.3 1.0 versicolor 1.609438
#> 12 4.9 2.5 4.5 1.7 virginica 1.589235
With the Flow object, you can continue the pipeline where you left it, because all the required variables are recorded inside it.
fl %>_%
group_by(., Species) %>_%
summarise(., mean_logSL = mean(logSL)) %>_% . # Get final result
#> # A tibble: 3 × 2
#> Species mean_logSL
#> <fct> <dbl>
#> 1 setosa 1.53
#> 2 versicolor 1.60
#> 3 virginica 1.59
With the flow()
function, you can explicitly create the
Flow object and easily add variables to it, including
those you want to keep as quosures (by ending their
names with _
):
fl <- flow(iris, var1_ = Sepal.Length, thresh1 = 5.1)
str(fl)
#> Flow object
#> $ var1 : language ~Sepal.Length
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ thresh1: num 5.1
#> parent: Flow root
Note that a quosure is recorded as var
,
not var_
! Indeed, everything works as if the trailing
underscore was an unary suffixed operator applied to var
,
which converts it into a quosure.
You could use var
in the pipeline expression to
manipulate the quosure directly, but you would most probably use
var_
which will also treat var
as a tidyeval
expression and will unquote it transparently in non-standard
expressions. Here is the same pipeline as above, but with all the
possible variables stored either as quosure, or as
usual R objects inside the Flow object:
fl <- flow(iris,
var1_ = Sepal.Length,
var2_ = Sepal.Width,
var_group_ = Species,
var1_log_ = logSL,
var1_mean_ = mean_logSL,
thresh1 = 5.1,
thresh2 = 3.1) %>_%
filter(., var1_ < thresh1_, var2_ < thresh2_) %>_%
{..$temp_data <- mutate(., var1_log_ = log(var1_))} %>_%
group_by(., var_group_) %>_%
summarise(., var1_mean_ = mean(var1_log_))
str(fl)
#> Flow object
#> $ var1_log : language ~logSL
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ var1 : language ~Sepal.Length
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ var2 : language ~Sepal.Width
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ var1_mean: language ~mean_logSL
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ thresh1 : num 5.1
#> $ thresh2 : num 3.1
#> $ var_group: language ~Species
#> ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ temp_data:'data.frame': 12 obs. of 6 variables:
#> parent: Flow root
fl$temp_data # The temporary variable
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species logSL
#> 1 4.9 3.0 1.4 0.2 setosa 1.589235
#> 2 4.4 2.9 1.4 0.2 setosa 1.481605
#> 3 4.8 3.0 1.4 0.1 setosa 1.568616
#> 4 4.3 3.0 1.1 0.1 setosa 1.458615
#> 5 5.0 3.0 1.6 0.2 setosa 1.609438
#> 6 4.4 3.0 1.3 0.2 setosa 1.481605
#> 7 4.5 2.3 1.3 0.3 setosa 1.504077
#> 8 4.8 3.0 1.4 0.3 setosa 1.568616
#> 9 4.9 2.4 3.3 1.0 versicolor 1.589235
#> 10 5.0 2.0 3.5 1.0 versicolor 1.609438
#> 11 5.0 2.3 3.3 1.0 versicolor 1.609438
#> 12 4.9 2.5 4.5 1.7 virginica 1.589235
fl %>_% . # The final results
#> # A tibble: 3 × 2
#> Species mean_logSL
#> <fct> <dbl>
#> 1 setosa 1.53
#> 2 versicolor 1.60
#> 3 virginica 1.59
Notice that, even standard variables, like thresh1
or
thresh2
must be called thresh1_
and
thresh2_
to look for them inside the
Flow object. Otherwise, they will be looked for in the
calling environment as usual. Also, the Flow
object can be
accessed and manipulated directly through ..
if you need
to.