---
title: "Performance of svTidy Functions"
author: "Philippe Grosjean"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Performance of svTidy Functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(dplyr)
library(tidyr)
library(data.table)
library(collapse)
library(svTidy)
```

> One goal of the {svTidy} package is to provide an interface that is similar to {dplyr} and {tidyr} to performant code (both for speed and memory use), possibly using the {data.table} or {collapse} packages under the hood.

In this document, we compare speed and memory use of {svTidy} with {dplyr} and {tidyr}. Keep in mind that most benchmarks (including the present ones) are artificial and do not necessarily reflect real use cases. Test with your own data. Also, do you really need faster or more memory efficient code? It depends on your particular context! Also take your hardware into account (number of CPU and amount of RAM).

We test {svTidy} function both in standard evaluation (SE) and non standard evaluation (NSE) modes: the later one usually requires more computing time, but it is more convenient to write and read and closer to {dplyr}/{tidyr} syntax.

## Functions that are considered optimized

Preparation: using 3/4 of available cores for parallel code in {data.table} and {collapse}.

```{r}
data.table::setDTthreads(percent = 75)
(.nthreads <- data.table::getDTthreads())
options(collapse_nthreads = .nthreads)
options(collapse_na.rm = FALSE)
options(collapse_mask = "all")
```

Small and medium data sets.

```{r}
# Small one
data(mtcars)
mtcars <- as_tibble(mtcars, rownames = "model")
mtcars_dt <- as.data.table(mtcars)

# Medium one
data(babynames, package = 'babynames')
babynames <- as_tibble(babynames)
babynames_dt <- as.data.table(babynames)
```


Here is a couple of examples of fast and memory efficient {svTidy} functions.

### `filter_()`

Small data set.

```{r}
# Note: collapse::qDF() = quickly convert to a data.frame, for identical results
bench::mark(
  dplyr      = filter(mtcars, mpg > 20) |> qDF(),
  data.table = mtcars_dt[mpg > 20] |> qDF(), 
  svTidyNSE  = filter_(mtcars, ~mpg > 20) |> qDF(),
  svTidySE   = filter_(mtcars, mtcars$mpg > 20) |> qDF())
```

Medium data set.

```{r}
bench::mark(
  dplyr      = filter(babynames, n > 1000) |> qDF(),
  data.table = babynames_dt[n > 1000] |> qDF(),
  svTidyNSE  = filter_(babynames, ~n > 1000) |> qDF(),
  svTidySE   = filter_(babynames, babynames$n > 1000) |> qDF())
```

### `arrange_()`

```{r}
bench::mark(
  dplyr      = arrange(mtcars, cyl, desc(vs)) |> qDF(),
  data.table = mtcars_dt[order(cyl, -vs)] |> qDF(),
  svTidyNSE  = arrange_(mtcars, ~cyl, ~ -vs) |> qDF(),
  svTidySE   = arrange_(mtcars, 'cyl', '-vs') |> qDF())
```

```{r}
bench::mark(
  dplyr      = arrange(babynames, sex, desc(n)) |> qDF(),
  data.table = babynames_dt[order(sex, -n)] |> qDF(),
  svTidyNSE  = arrange_(babynames, ~sex, ~ -n) |> qDF(),
  svTidySE   = arrange_(babynames, 'sex', '-n') |> qDF())
```

## Functions that could still be optimized

Not all {svTidy} functions are currently faster or more memory efficient than their {dplyr} or {tidyr} counterparts. Those still need refactoring to be optimized. Here are some examples.

### `bind_rows_()`

```{r}
df1 <- tibble(x = 1:2, y = letters[1:2])
df1_dt <- as.data.table(df1)

bench::mark(
  dplyr      = bind_rows(df1, df1) |> qDF(),
  base       = rbind(df1, df1) |> qDF(),
  data.table = rbindlist(list(df1_dt, df1_dt)) |> qDF(),
  svTidy     = bind_rows_(df1, df1) |> qDF(),
  svTidy2    = bind_rows_(list(df1, df1)) |> qDF())
```

```{r}
bench::mark(
  dplyr      = bind_rows(babynames, babynames) |> qDF(),
  base       = rbind(babynames, babynames) |> qDF(),
  data.table = rbindlist(list(babynames_dt, babynames_dt)) |> qDF(),
  svTidy     = bind_rows_(babynames, babynames) |> qDF(),
  svTidy2    = bind_rows_(list(babynames, babynames)) |> qDF())
```

### `bind_cols_()`

```{r}
df1 <- tibble(x = 1:2, y = letters[1:2])
df2 <- tibble(z = 10:11, w = factor(5:6))
df1_dt <- as.data.table(df1)
df2_dt <- as.data.table(df2)

bench::mark(check = FALSE,
  dplyr      = bind_cols(df1, df2) |> qDF(),
  base       = cbind(df1, df2) |> qDF(),
  data.table = cbind(df1_dt, df2_dt) |> qDF(),
  svTidy     = bind_cols_(df1, df2) |> qDF())
```