The {data.io} package provides
several example datasets in a standardized way, as well as, a
read()
function to retrieve them, or to import external
datasets in different formats in an unified way. A cache mechanism is
implemented for those datasets that are read from an URL. Also a
“sidecar” R script can be used to preformat or preprocess the data. A
write()
function also eases export of the R objects in
various formats.
There are several datasets spread between various R packages, but there is no clear convention to name them, or their variables, or units to use (some are in metric units, but other ones use the imperial unit system). Here, we propose a set of data, partly converted from other packages, partly new ones, that respect the following conventions:
English for variable names,
snake_case names, both for the datasets and their variables,
Uppercase for factor levels (but less strict),
data frames are converted according to user preferences indicated
in options(SciViews.as_dtx = ...)
. The default is
as_dtt
which converts into a data.table.
Other options are as_dtf
to concert into base R
data.frame objects, or as_dtbl
to convert
into {tibble}’s tbl_df objects.
variables have a label
attribute with more
meaningful (short) description of the variables, and a
units
attribute, if applicable.
the origin of the data is recorded as an src
attribute to the comment if this is a R package dataset, or as a
srcfile
attribute to comment if it read from a
file.
For instance, the iris
dataset in the {datasets} package
uses names for its variables like Petal.Length
that do not
follow the rules exposed here above. Getting this dataset with
data.io::read()
, these names are “corrected”. Labels and
units are also automatically added.
library(data.io)
# Instead of data(iris), we use:
iris <- read("iris", package = "datasets")
head(iris)
#> sepal_length sepal_width petal_length petal_width species
#> <num> <num> <num> <num> <fctr>
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> 6: 5.4 3.9 1.7 0.4 setosa
With str()
one can see the labels and units added for
each variable:
str(iris)
#> Classes 'data.table' and 'data.frame': 150 obs. of 5 variables:
#> $ sepal_length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..- attr(*, "label")= chr "Length of the sepals"
#> ..- attr(*, "units")= chr "cm"
#> $ sepal_width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..- attr(*, "label")= chr "Width of the sepals"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..- attr(*, "label")= chr "Length of the petals"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..- attr(*, "label")= chr "Width of the petals"
#> ..- attr(*, "units")= chr "cm"
#> $ species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr "Iris species"
#> - attr(*, "comment")= chr [1:2] "The 'iris' from 'datasets', but with variables names in snake_case" "(Sepal.Length -> sepal_length, Species -> species)."
#> ..- attr(*, "lang")= chr "en"
#> ..- attr(*, "lang_encoding")= chr "UTF-8"
#> ..- attr(*, "src")= chr "datasets::iris"
#> - attr(*, ".internal.selfref")=<externalptr>
The comment gives some general information about the dataset.
comment(iris)
#> [1] "The 'iris' from 'datasets', but with variables names in snake_case"
#> [2] "(Sepal.Length -> sepal_length, Species -> species)."
#> attr(,"lang")
#> [1] "en"
#> attr(,"lang_encoding")
#> [1] "UTF-8"
#> attr(,"src")
#> [1] "datasets::iris"
French is supported too. Labels and comments are in French:
iris <- read("iris", package = "datasets", lang = "fr")
str(iris)
#> Classes 'data.table' and 'data.frame': 150 obs. of 5 variables:
#> $ sepal_length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..- attr(*, "label")= chr "Longueur des sépales"
#> ..- attr(*, "units")= chr "cm"
#> $ sepal_width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..- attr(*, "label")= chr "Largeur des sépales"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..- attr(*, "label")= chr "Longueur des pétales"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..- attr(*, "label")= chr "Largeur des pétales"
#> ..- attr(*, "units")= chr "cm"
#> $ species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr "Espèces d'Iris"
#> - attr(*, "comment")= chr [1:2] "Jeu de données 'iris' de 'datasets', mais avec noms de variables modifiées" "(Sepal.Length -> sepal_length, Species -> species)."
#> ..- attr(*, "lang")= chr "fr"
#> ..- attr(*, "lang_encoding")= chr "UTF-8"
#> ..- attr(*, "src")= chr "datasets::iris"
#> - attr(*, ".internal.selfref")=<externalptr>
All datasets form R packages can be loaded with
read("<dataset_name>", package = "<package_name>")
,
but only a small subset of these datasets have labels and units
automatically set. They are listed in the man page
?Datasets
.
Another feature is conversion of quantitative variables into the SI
unit system, in case they are expressed in imperial system in use in the
US. Here is an example with the trees
dataset, from the
{datasets} package whose lengths are in inches or feet and volume is in
cubic feet. When this dataset is loaded with read()
, the
units are converted to meters and cubic meters (also Girth
is replaced by diameter
since it is really the diameter of
the tree that is reported).
trees <- read("trees", package = "datasets")
head(trees)
#> diameter height volume
#> <num> <num> <num>
#> 1: 0.211 21.3 0.292
#> 2: 0.218 19.8 0.292
#> 3: 0.224 19.2 0.289
#> 4: 0.267 21.9 0.464
#> 5: 0.272 24.7 0.532
#> 6: 0.274 25.3 0.558
str(trees)
#> Classes 'data.table' and 'data.frame': 31 obs. of 3 variables:
#> $ diameter: num 0.211 0.218 0.224 0.267 0.272 0.274 0.279 0.279 0.282 0.284 ...
#> ..- attr(*, "label")= chr "Diameter at 1.4m"
#> ..- attr(*, "units")= chr "m"
#> $ height : num 21.3 19.8 19.2 21.9 24.7 25.3 20.1 22.9 24.4 22.9 ...
#> ..- attr(*, "label")= chr "Height"
#> ..- attr(*, "units")= chr "m"
#> $ volume : num 0.292 0.292 0.289 0.464 0.532 0.558 0.442 0.515 0.64 0.563 ...
#> ..- attr(*, "label")= chr "Volume of timber"
#> ..- attr(*, "units")= chr "m^3"
#> - attr(*, "comment")= chr [1:3] "The 'trees' from 'datasets' but with variables renamed and in m or m^3" "(Girth [in] -> diameter [m], Height [ft] -> height [m]," "Volume [ft^3] -> volume [m^3])."
#> ..- attr(*, "lang")= chr "en"
#> ..- attr(*, "lang_encoding")= chr "UTF-8"
#> ..- attr(*, "src")= chr "datasets::trees"
#> - attr(*, ".internal.selfref")=<externalptr>
You got the same result using lang = "fr"
. If you want
the original data, you still can use data()
, of course.
Here it is, for comparison:
data(trees)
head(trees)
#> Girth Height Volume
#> 1 8.3 70 10.3
#> 2 8.6 65 10.3
#> 3 8.8 63 10.2
#> 4 10.5 72 16.4
#> 5 10.7 81 18.8
#> 6 10.8 83 19.7
str(trees)
#> 'data.frame': 31 obs. of 3 variables:
#> $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
#> $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
#> $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
If you use read()
without arguments, a list with all
datasets from installed R packages in opened in RStudio or in the web
browser. If you just specify
package = "<package_name>"
, only datasets in that
package are listed.
The read()
and write()
functions implement
a type =
argument to specify the format. The format
specification is optional for read()
if the file extension
is explicit enough. However, it is mandatory for write()
.
An alternate and more compact syntax is advised: one can “subset” the
read()
or write()
function with the type. For
instance, to write df
in a CSV file “data/df.csv”, one can
use write(df, "data/df.csv", type = "csv")
, but one can
also use write$csv(df, "data/df.csv")
. The later form is
more compact and easier to read.
The {data.io} contains an “extdata” folder with a series of example
datasets in different formats. The data_example()
function
can be used to get the path to these files. For instance, to get the
path to the “iris.csv.gz” file, one can use:
Then, you can import this compressed CSV file with
read()
:
read$csv.gz(data_example("iris.csv.gz")) # Type optional (explicit extension)
#> Rows: 150 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Species
#> dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <char>
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
To add labels and units to variables in a
data.frame, you can use the labelise()
function. Here is an example with some synthetic data:
df <- data.frame(
age = 1:10,
size = 3 + 0.5 * (1:10) + rnorm(10),
sex = sample(c("M", "F"), 10, replace = TRUE)
)
# Add labels and units
df <- labelise(df,
label = list(age = "Age", size = "Body size", sex = "Sex"),
units = list(age = "years", size = "cm"))
str(df)
#> 'data.frame': 10 obs. of 3 variables:
#> $ age : int 1 2 3 4 5 6 7 8 9 10
#> ..- attr(*, "label")= chr "Age"
#> ..- attr(*, "units")= chr "years"
#> $ size: num 4.24 2.22 4.14 4.87 6.85 ...
#> ..- attr(*, "label")= chr "Body size"
#> ..- attr(*, "units")= chr "cm"
#> $ sex : chr "F" "F" "M" "M" ...
#> ..- attr(*, "label")= chr "Sex"
You do not have to label or give units for all the
variables (here, there is no units for sex
). For more
general metadata, you can add them with the base
comment() <- "Some metadata..."
instruction.
Most file formats (except those who save R object natively) lack features to fully express the structure of the data or the metadata such as label and units. The ubiquitous CSV format is a good example. It is not possible to indicate in the CSV file that that a character string column should be treated as character or factor for instance. Also, Date or POSIXt fields are imported as character too. Consequently, the dataset must be postprocessed in R to bring those corrections.
With data.io::read()
, there is another mechanism
available, using sidecar R scripts. Such a script is in
the same folder as the dataset and bears the same name with the
.R
extension appended to the name of the dataset. In the
“extdata” folder of {data.io}, there is an example with a dataset named
“iris_sidecar.csv”, and its complement, “iris_sidecar.csv.R”.
(iris_sidecar_csv_file <- data_example("iris_sidecar.csv"))
#> [1] "/tmp/RtmpVVrxss/Rinst158932e0f593/data.io/extdata/iris_sidecar.csv"
data_example("iris_sidecar.csv.R")
#> [1] "/tmp/RtmpVVrxss/Rinst158932e0f593/data.io/extdata/iris_sidecar.csv.R"
The sidecar file contains code that is executed after the data is
imported. It can transform or rename variables, add labels and units,
calculate derived variables, handle code for missing data, etc. The
sidecar file is used by default. You have to indicate the argument
sidecar_file = FALSE
in read()
to not
use it. Here the “iris_sidecar.csv” file is imported first without using
the sidecar file, and then, with it:
# Without sidecar file
(iris_no_sc <- read$csv(iris_sidecar_csv_file, sidecar_file = FALSE))
#> Rows: 150 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Species
#> dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <char>
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
str(iris_no_sc)
#> Classes 'data.table' and 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. Sepal.Length = col_double(),
#> .. Sepal.Width = col_double(),
#> .. Petal.Length = col_double(),
#> .. Petal.Width = col_double(),
#> .. Species = col_character()
#> .. )
#> - attr(*, "problems")=<externalptr>
#> - attr(*, "comment")= chr ""
#> ..- attr(*, "lang")= chr "en"
#> ..- attr(*, "lang_encoding")= chr "UTF-8"
#> ..- attr(*, "srcfile")= chr "/tmp/RtmpVVrxss/Rinst158932e0f593/data.io/extdata/iris_sidecar.csv"
#> - attr(*, ".internal.selfref")=<externalptr>
# With sidecar file (sidecar_file = TRUE is the default)
(iris_sc <- read$csv(iris_sidecar_csv_file))
#> Rows: 150 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Species
#> dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning in cmt[] <- c(cmt, comments): number of items to replace is not a
#> multiple of replacement length
#> sepal_length sepal_width petal_length petal_width species
#> <num> <num> <num> <num> <fctr>
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
str(iris_sc)
#> Classes 'data.table' and 'data.frame': 150 obs. of 5 variables:
#> $ sepal_length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..- attr(*, "label")= chr "Length of the sepals"
#> ..- attr(*, "units")= chr "cm"
#> $ sepal_width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..- attr(*, "label")= chr "Width of the sepals"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..- attr(*, "label")= chr "Length of the petals"
#> ..- attr(*, "units")= chr "cm"
#> $ petal_width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..- attr(*, "label")= chr "Width of the petals"
#> ..- attr(*, "units")= chr "cm"
#> $ species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "label")= chr "Iris species"
#> - attr(*, "spec")=
#> .. cols(
#> .. Sepal.Length = col_double(),
#> .. Sepal.Width = col_double(),
#> .. Petal.Length = col_double(),
#> .. Petal.Width = col_double(),
#> .. Species = col_character()
#> .. )
#> - attr(*, "problems")=<externalptr>
#> - attr(*, "comment")= chr ""
#> ..- attr(*, "lang")= chr "en"
#> ..- attr(*, "lang_encoding")= chr "UTF-8"
#> ..- attr(*, "srcfile")= chr "/tmp/RtmpVVrxss/Rinst158932e0f593/data.io/extdata/iris_sidecar.csv.R"
#> - attr(*, ".internal.selfref")=<externalptr>
The sidecar script did rename the variables in iris_sc
.
Note that the species
variable of iris_sc
is
converted into factor, while Species
of
iris_no_sc
is still a *character** variable. Note also that
labels and units are added for each variable of iris_sc
.
The sidecar file is convenient for quick preprocessing of you datasets.
That way, you do not have to resave your data in a different format that
keeps the metadata and types of your variables.
The example sidecar file is rather complex and it deals with several
languages through the lang =
argument of
read()
. Usually, your own sidecar file would be much
shorter, just dealing with a couple of adjustments in the dataset.
The read()
function can also import data from an URL for
all supported file formats (note the code that reads from an URL in
not executed in the vignette to avoid problems when checking
the package, but you can run the code yourself).
In the case the URL does not end with an explicit extension, you
have to specify the file format as the type (here
read$csv(....)
because the dataset is in CSV format).
Reading data from an external URL is convenient, especially for big
datasets that you do not want to include, say, in a git repository.
However, it could be slow to retrieve those big datasets each time from
the internet. The read()
function implements a cache
mechanism that you activate by indicating in which file you want to
store a cached copy of your dataset in the cache_file =
argument. Here is an example:
# Here, we use the temporary directory for the example
# but you should use a permanent directory in your project
ble_cache_file <- file.path(tempdir(), "ble.csv")
(ble <- read$csv("http://tinyurl.com/Biostat-Ble",
cache_file = ble_cache_file))
Now, there is a copy of the dataset in CSV format in
ble_cache_file
.
If you project is managed with git, you would most probably indicate
the folder that contains the cached copies of your large datasets in
.gitignore. That way, you can use large, or even huge datasets in your
git repositories without versioning these large files. They are
downloaded from the internet only once. Every time you read the
ble
dataset again, it is imported from the local cache
file.
In case you have to refresh the cached version from the URL, just
erase the cache file and read again, or use
force = TRUE
):
The list of file formats that read()
and
write()
can handle is summarized in the table produced by
data_types()
(using the default view = TRUE
automatically opens a view in RStudio or the web browser with that
table):
data.io::data_types(view = FALSE)
#> # A tibble: 32 × 5
#> type read_fun read_header write_fun comment
#> <chr> <chr> <chr> <chr> <chr>
#> 1 csv readr::read_csv data.io::hread_text readr::write_csv comma …
#> 2 csv2 readr::read_csv2 data.io::hread_text <NA> semico…
#> 3 xlcsv readr::read_csv data.io::hread_text readr::write_excel_csv write …
#> 4 tsv readr::read_tsv data.io::hread_text readr::write_tsv tab se…
#> 5 fwf readr::read_fwf data.io::hread_text <NA> fixed …
#> 6 log readr::read_log <NA> <NA> standa…
#> 7 rds readr::read_rds <NA> readr::write_rds R data…
#> 8 txt readr::read_file <NA> readr::write_file text f…
#> 9 raw readr::read_file_raw <NA> <NA> binary…
#> 10 ssv readr::read_table data.io::hread_text <NA> space …
#> # ℹ 22 more rows