Data Analysis Work Flow and Pipeline Operator for ‘SciViews::R’

This document still need substantial editing! It is left here because it may still be useful in its current state.

Here is a pipeline:

library(dplyr)
library(svFlow)
threshold <- 1.5
iris %>.%
  filter(., Petal.Length > threshold) %>.%
  mutate(., log_var = log(Petal.Length)) %>.%
  head(.) %>.% .
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   log_var
#> 1          5.4         3.9          1.7         0.4  setosa 0.5306283
#> 2          4.8         3.4          1.6         0.2  setosa 0.4700036
#> 3          5.7         3.8          1.7         0.3  setosa 0.5306283
#> 4          5.4         3.4          1.7         0.2  setosa 0.5306283
#> 5          5.1         3.3          1.7         0.5  setosa 0.5306283
#> 6          4.8         3.4          1.9         0.2  setosa 0.6418539

Use of flow() to add local variables inside the pipeline, and to have convenient and transparent resolution of the lazyeval mechanism:

flow(iris, var_ = Petal.Length, threshold = 1.5) %>_%
  filter(., var_ > threshold_) %>_%
  {..$tab <- mutate(., log_var = log(var_))} %>_%
  head(.) %>_% .
#> Warning: Assigning non-quosure objects to quosure lists is deprecated as of rlang 0.3.0.
#> Please coerce to a bare list beforehand with `as.list()`
#> This warning is displayed once every 8 hours.
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   log_var
#> 1          5.4         3.9          1.7         0.4  setosa 0.5306283
#> 2          4.8         3.4          1.6         0.2  setosa 0.4700036
#> 3          5.7         3.8          1.7         0.3  setosa 0.5306283
#> 4          5.4         3.4          1.7         0.2  setosa 0.5306283
#> 5          5.1         3.3          1.7         0.5  setosa 0.5306283
#> 6          4.8         3.4          1.9         0.2  setosa 0.6418539

Convert this into a reusable function by replacing flow() by function() and starting the pipeline with enflow():

my_process <- function(data, var_ = Petal.Length, threshold = 1.5)
  enflow(data) %>_%
  filter(., var_ > threshold_) %>_%
  {..$tab <- mutate(., log_var = log(var_))} %>_%
  head(.) %>_% .

Then, you use it just as a plain function. The arguments ending with _ are also a good way to immediately spot those who are treated specially by the tidyeval mechanism! Here, we redo the analysis:

my_process(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   log_var
#> 1          5.4         3.9          1.7         0.4  setosa 0.5306283
#> 2          4.8         3.4          1.6         0.2  setosa 0.4700036
#> 3          5.7         3.8          1.7         0.3  setosa 0.5306283
#> 4          5.4         3.4          1.7         0.2  setosa 0.5306283
#> 5          5.1         3.3          1.7         0.5  setosa 0.5306283
#> 6          4.8         3.4          1.9         0.2  setosa 0.6418539

Here, we change the variable and the threshold:

my_process(iris, var_ = Sepal.Width, threshold = 0.5)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  log_var
#> 1          5.1         3.5          1.4         0.2  setosa 1.252763
#> 2          4.9         3.0          1.4         0.2  setosa 1.098612
#> 3          4.7         3.2          1.3         0.2  setosa 1.163151
#> 4          4.6         3.1          1.5         0.2  setosa 1.131402
#> 5          5.0         3.6          1.4         0.2  setosa 1.280934
#> 6          5.4         3.9          1.7         0.4  setosa 1.360977

Flow objects to test alternate scenarios

The Flow objects can be subclassed. This is a little bit similar to branches of git repositories (although with automatic updates from master branch). It could be nice to keep this comparison as close as possible and to make both approaches conceptually similar, so that one can work similarly with flow() and with git?! We need tools to create, delete, switch to, merge (into master only?), and rebase + diff.

The biggest difference is that branches in git terms do not dynamically inherit objects from parents but proto/Flow objects do (in this case, main branch is indeed the ancestor). So, we could create a function branch() that does something like this:

# Create a new branch called lda
#..$branch("lda", model = mlLda(Species ~ ., data = .))
# Switching to a branch or referring to a branch: use one of those two syntaxes
#..@lda(model = mlLda(Species ~ ., data = .))
#..(lda)(model = mlLda(Species ~ ., data = .))

Visual map of Flow objects inheritance

TODO: in an older version of the proto package, there was a nice graph.proto() function, and I made my own one with other dependencies => reimplement it for {svFlow} in order to show the workflow in a similar way a git repository is often depicted.