--- title: "Performance of svTidy Functions" author: "Philippe Grosjean" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: yes vignette: > %\VignetteIndexEntry{Performance of svTidy Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(dplyr) library(tidyr) library(data.table) library(collapse) library(svTidy) ``` > One goal of the {svTidy} package is to provide an interface that is similar to {dplyr} and {tidyr} to performant code (both for speed and memory use), possibly using the {data.table} or {collapse} packages under the hood. In this document, we compare speed and memory use of {svTidy} with {dplyr} and {tidyr}. Keep in mind that most benchmarks (including the present ones) are artificial and do not necessarily reflect real use cases. Test with your own data. Also, do you really need faster or more memory efficient code? It depends on your particular context! Also take your hardware into account (number of CPU and amount of RAM). We test {svTidy} function both in standard evaluation (SE) and non standard evaluation (NSE) modes: the later one usually requires more computing time, but it is more convenient to write and read and closer to {dplyr}/{tidyr} syntax. ## Functions that are considered optimized Preparation: using 3/4 of available cores for parallel code in {data.table} and {collapse}. ```{r} data.table::setDTthreads(percent = 75) (.nthreads <- data.table::getDTthreads()) options(collapse_nthreads = .nthreads) options(collapse_na.rm = FALSE) options(collapse_mask = "all") ``` Small and medium data sets. ```{r} # Small one data(mtcars) mtcars <- as_tibble(mtcars, rownames = "model") mtcars_dt <- as.data.table(mtcars) # Medium one data(babynames, package = 'babynames') babynames <- as_tibble(babynames) babynames_dt <- as.data.table(babynames) ``` Here is a couple of examples of fast and memory efficient {svTidy} functions. ### `filter_()` Small data set. ```{r} # Note: collapse::qDF() = quickly convert to a data.frame, for identical results bench::mark( dplyr = filter(mtcars, mpg > 20) |> qDF(), data.table = mtcars_dt[mpg > 20] |> qDF(), svTidyNSE = filter_(mtcars, ~mpg > 20) |> qDF(), svTidySE = filter_(mtcars, mtcars$mpg > 20) |> qDF()) ``` Medium data set. ```{r} bench::mark( dplyr = filter(babynames, n > 1000) |> qDF(), data.table = babynames_dt[n > 1000] |> qDF(), svTidyNSE = filter_(babynames, ~n > 1000) |> qDF(), svTidySE = filter_(babynames, babynames$n > 1000) |> qDF()) ``` ### `arrange_()` ```{r} bench::mark( dplyr = arrange(mtcars, cyl, desc(vs)) |> qDF(), data.table = mtcars_dt[order(cyl, -vs)] |> qDF(), svTidyNSE = arrange_(mtcars, ~cyl, ~ -vs) |> qDF(), svTidySE = arrange_(mtcars, 'cyl', '-vs') |> qDF()) ``` ```{r} bench::mark( dplyr = arrange(babynames, sex, desc(n)) |> qDF(), data.table = babynames_dt[order(sex, -n)] |> qDF(), svTidyNSE = arrange_(babynames, ~sex, ~ -n) |> qDF(), svTidySE = arrange_(babynames, 'sex', '-n') |> qDF()) ``` ## Functions that could still be optimized Not all {svTidy} functions are currently faster or more memory efficient than their {dplyr} or {tidyr} counterparts. Those still need refactoring to be optimized. Here are some examples. ### `bind_rows_()` ```{r} df1 <- tibble(x = 1:2, y = letters[1:2]) df1_dt <- as.data.table(df1) bench::mark( dplyr = bind_rows(df1, df1) |> qDF(), base = rbind(df1, df1) |> qDF(), data.table = rbindlist(list(df1_dt, df1_dt)) |> qDF(), svTidy = bind_rows_(df1, df1) |> qDF(), svTidy2 = bind_rows_(list(df1, df1)) |> qDF()) ``` ```{r} bench::mark( dplyr = bind_rows(babynames, babynames) |> qDF(), base = rbind(babynames, babynames) |> qDF(), data.table = rbindlist(list(babynames_dt, babynames_dt)) |> qDF(), svTidy = bind_rows_(babynames, babynames) |> qDF(), svTidy2 = bind_rows_(list(babynames, babynames)) |> qDF()) ``` ### `bind_cols_()` ```{r} df1 <- tibble(x = 1:2, y = letters[1:2]) df2 <- tibble(z = 10:11, w = factor(5:6)) df1_dt <- as.data.table(df1) df2_dt <- as.data.table(df2) bench::mark(check = FALSE, dplyr = bind_cols(df1, df2) |> qDF(), base = cbind(df1, df2) |> qDF(), data.table = cbind(df1_dt, df2_dt) |> qDF(), svTidy = bind_cols_(df1, df2) |> qDF()) ```