--- title: "(Work)flow and alternate pipe operator, rationate" author: "Philippe Grosjean" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: yes vignette: > %\VignetteIndexEntry{(Work)flow and alternate pipe operator, rationate} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(svFlow) ``` ## Pure, predictable, pipeable, ... and data-aware Hadley Wickham advocates for pure, predictable and pipeable functions in the tidyverse. Although non-standard evaluation (NSE) makes many of the functions in tidyverse not referentially transparent (which makes them more difficult in reusable contexts like a function) this also contributes to a cleaner language, at least for beginneRs. With {svFlow} we want both to make tidyverse-style NSE more easily reusable, and the data analysis workflow based on pipelines and the pipe operator ({magrittr}'s pipe operator in tidyverse) even more data-aware. ## Choice of the name The initial name was {flow}, because it is short. However, it is now used by an,other package on CRAN. Other names as {workflow} and {workplan} are also already used. So, {svFlow} for SciViews' flow. The {performanceEstimation} package has {Workflow} object and a `Workflow()` function. Also, the {zoon} package has a `workflow()` function, but it creates a {zoonWorkflow} object, so no clash here. In the {drake} package, now superseded by {targets}, there is also a `workflow()` function, but deprecated in favor of `workplan()`. {targets} is used to organize different analyses in **data.frame**s. Hence, as we see here, {workflow} or {workplan} names are already pretty much used in the R ecosystem. There is the {flowr} package which uses (internal) `flow()`, and `is.flow()` functions, and a **flow** S3 object. This is for complex, bioinformatics (work)flows, but of course, the source of potential problems when both {flowr} and {svFlow} packages are used simultaneously, if both objects bear the same class name. That is why in {svFlow}, objects are named **Flow** with an uppercase **F**, to avoid such a conflict. ## A simpler and more efficient pipe operator? The {wrapr} package provides an alternate pipe operator: `%.>%`, the "dot arrow pipe". It is very simple: > "a %.\>% b" is to be treated as if the user had written "{ . \<- a; b };" with "%.\>%" being treated as left-associative. There are three interesting points with this pipe operator: 1. It does not alter the expression evaluated, and the dot can be placed everywhere in the expression. It means that *any* expression is suitable and is "pipe-aware" with this operator. 2. It is *explicit*, that is, you don't have to guess what will happen, the location of the dot replacement(s) in the expression is explicitly indicated. It should makes it also easier to understand from a beginneR's point of view, providing he is not used to {magrittr} style, of course. 3. Since the expression is not reworked, it is very fast in comparison to the complex-rules that must be computed each time you call {magrittr}'s `%>%`. 4. On the contrary to the base R pipe operator `|>` introduced in R 4.1.0, it is not just a syntactic flavor that transforms the code into imbricated functions calls internally. The base R pipe operator has many advantages, but also many limitations that `%>.%` tries to eliminate. The only drawback with this pipe operator is that it is not pure, since it modifies the calling environment (it assigns `.` in it before evaluation of the right-hand side expression). However, if you never use `.` as a name for other objects, this is not much a problem. In {wrapr}, there is a synonym: `%>.%`, but that its author never uses in the examples, vignettes or on its blog. So, we decide to reuse `%>.%` as our pipe operator in {svFlow}. We add two things in it: 1. It is also aware of **Flow** objects (see here under) and behaves accordingly, 2. The expression to be evaluated is also recorded in the calling environment as `.call`. This way, it becomes easy to debug the last expression that failed during the pipeline execution (since `.` is also available, one can inspect it, or rerun `eval(.call)`, ... or use `debug_flow()` to get extra information): ```{r error=TRUE} library(svFlow) # An example pipeline with an error in the middle: library(dplyr) iris %>.% filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>.% mutate(., logS = log(Species)) %>.% group_by(., Species) %>.% summarise(., mean_logS = mean(logS)) ``` ```{r error=TRUE} # Both . and .call are available and can be explored head(.) .call eval(.call) ``` ... or even more easily: ```{r error=TRUE} debug_flow() ``` From there, you can manipulate `.`, `.call`, or both, and rerun `debug_flow()` to fix the pipeline. ## Mixing Pipe() and proto(): the Flow object In {pipeR}, Kun Ren proposes several alternative pipe operators to the now traditional {magrittr}'s one (`%>%`). `Pipe()` is interesting since it encapsulates essentially the pipeline steps inside an object. The pipe operator is then simply replaced by `$`. It is striking to note the similitude of the `$` operator for **Pipe** and **proto** objects (from the {proto} package), although they are designed for different purposes in mind. The **proto** objects are class-less prototype-based objects that support simple inheritance. They are convenient to manipulate sets of objects in a common place, and internally, they use an environment to store these objects. **Pipe** objects also use internally an environment to store everything related to the pipeline operations. However, there is no mean to add custom objects, nor to define inheritance between **Pipe** objects. Satellite variables may be used in pipelines. They are currently placed in the calling environment (usually `.GlobalEnv`), and they "pollute" it. There is no mean to define "local" variables like, say in function, with the pipe. Yet, if we could combine **Pipe** behavior for pipeline operation, with **proto** objects to store locally various items and allow inheritance, this would be a wonderful way to drive analyses workflows. The **Flow** object just does that. **Flow** objects are indeed **proto** with a `.value` item that contains the result obtained from the last pipeline operation. The pipe operator `%>.%` is behaving differently when a **Flow** object (constructed using `flow()`) is passed to it: (1) `.` is taken from `flow_obj$.value`, and result updates it. Also, a `..` object is created in the calling environment that is the **Flow** object. That way, one can access items stored in the **Flow** object by `..$item` within pipeline expressions. This allows to embed pipeline temporary variables directly in the **Flow** object. The second pipe operator in the {.flow} package, `%>_%`, does the opposite to `%>.%`: it **constructs** a `Flow` object if it does not receives one, and returns a **Flow** object containing the results in `flow_obj$.value`. Finally, to get the value out of a **Flow** object, on can also end the pipeline by `%>_% .`, which extracts `flow_obj$.value` and returns it. Here is an example of use: ```{r} data(iris) fl <- iris %>_% # Create a Flow object filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>_% mutate(., logSL = log(Sepal.Length)) # Interrupt the pipeline, and inspect or modify the flow object: fl ``` With the **Flow** object, you can continue the pipeline where you left it, because all the required variables are recorded inside it. ```{r} fl %>_% group_by(., Species) %>_% summarise(., mean_logSL = mean(logSL)) %>_% . # Get final result ``` With the `flow()` function, you can explicitly create the **Flow** object and easily add variables to it, including those you want to keep as **quosure**s (by ending their names with `_`): ```{r} fl <- flow(iris, var1_ = Sepal.Length, thresh1 = 5.1) str(fl) ``` Note that a **quosure** is recorded as `var`, not `var_`! Indeed, everything works as if the trailing underscore was an unary suffixed operator applied to `var`, which converts it into a **quosure**. You could use `var` in the pipeline expression to manipulate the quosure directly, but you would most probably use `var_` which will also treat `var` as a tidyeval expression and will unquote it transparently in non-standard expressions. Here is the same pipeline as above, but with all the possible variables stored either as **quosure**, or as usual R objects inside the **Flow** object: ```{r} fl <- flow(iris, var1_ = Sepal.Length, var2_ = Sepal.Width, var_group_ = Species, var1_log_ = logSL, var1_mean_ = mean_logSL, thresh1 = 5.1, thresh2 = 3.1) %>_% filter(., var1_ < thresh1_, var2_ < thresh2_) %>_% {..$temp_data <- mutate(., var1_log_ = log(var1_))} %>_% group_by(., var_group_) %>_% summarise(., var1_mean_ = mean(var1_log_)) str(fl) fl$temp_data # The temporary variable fl %>_% . # The final results ``` Notice that, even standard variables, like `thresh1` or `thresh2` must be called `thresh1_` and `thresh2_` to look for them **inside** the **Flow** object. Otherwise, they will be looked for in the calling environment as usual. Also, the `Flow` object can be accessed and manipulated directly through `..` if you need to.