--- title: "Data Analysis Work Flow and Pipeline Operator for 'SciViews::R'" author: "Philippe Grosjean" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: yes vignette: > %\VignetteIndexEntry{(Data Analysis Work Flow and Pipeline Operator for 'SciViews::R'} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(dplyr) library(svFlow) ``` **This document still need substantial editing! It is left here because it may still be useful in its current state.** Here is a pipeline: ```{r} library(dplyr) library(svFlow) threshold <- 1.5 iris %>.% filter(., Petal.Length > threshold) %>.% mutate(., log_var = log(Petal.Length)) %>.% head(.) %>.% . ``` Use of `flow()` to add local variables inside the pipeline, and to have convenient and transparent resolution of the lazyeval mechanism: ```{r} flow(iris, var_ = Petal.Length, threshold = 1.5) %>_% filter(., var_ > threshold_) %>_% {..$tab <- mutate(., log_var = log(var_))} %>_% head(.) %>_% . ``` Convert this into a reusable function by replacing `flow()` by `function()` and starting the pipeline with `enflow()`: ```{r} my_process <- function(data, var_ = Petal.Length, threshold = 1.5) enflow(data) %>_% filter(., var_ > threshold_) %>_% {..$tab <- mutate(., log_var = log(var_))} %>_% head(.) %>_% . ``` Then, you use it just as a plain function. The arguments ending with `_` are also a good way to immediately spot those who are treated specially by the tidyeval mechanism! Here, we redo the analysis: ```{r} my_process(iris) ``` Here, we change the variable and the threshold: ```{r} my_process(iris, var_ = Sepal.Width, threshold = 0.5) ``` ## Flow objects to test alternate scenarios The **Flow** objects can be subclassed. This is a little bit similar to branches of git repositories (although with automatic updates from master branch). It could be nice to keep this comparison as close as possible and to make both approaches conceptually similar, so that one can work similarly with `flow()` and with git?! We need tools to create, delete, switch to, merge (into master only?), and rebase + diff. The biggest difference is that branches in git terms do not dynamically inherit objects from parents but **proto**/**Flow** objects do (in this case, **main** branch is indeed the ancestor). So, we could create a function `branch()` that does something like this: ```{r} # Create a new branch called lda #..$branch("lda", model = mlLda(Species ~ ., data = .)) # Switching to a branch or referring to a branch: use one of those two syntaxes #..@lda(model = mlLda(Species ~ ., data = .)) #..(lda)(model = mlLda(Species ~ ., data = .)) ``` ## Visual map of Flow objects inheritance **TODO:** in an older version of the **proto** package, there was a nice `graph.proto()` function, and I made my own one with other dependencies =\> reimplement it for {svFlow} in order to show the workflow in a similar way a git repository is often depicted.