Title: | Read and Write Data in Different Formats |
---|---|
Description: | Read or write data from many different formats (tabular datasets, from statistic software ...) into R objects. Add labels and units in different languages. |
Authors: | Philippe Grosjean [aut, cre] |
Maintainer: | Philippe Grosjean <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.5.1 |
Built: | 2024-11-20 05:52:10 UTC |
Source: | https://github.com/SciViews/data.io |
The {data.io} package focuses on reading and writing datasets in different formats in an unified and convenient way. It can deal with labels and units metadata for variables, translation in different languages, and even use a sidecar file for preprocessing the dataset automatically. The same features are also available for a subset of datasets from R packages.
read()
is the main function to read data from R packages or files,
write()
is the main function to write data to disk. It is compatible
with base::write()
but provides many more features if you indicate
type=
or use it like write$type()
.
labelise()
adds a label
, and possibly a units
attributes to an
object, to be used while pretty printing a table or plot.
Convert an object into a dataframe
and check for it. A
dataframe
(without dot) is both a data.frame
(with dot, the default
rectangular dataset structure in R) and a tibble
, the tidyverse
equivalence. In fact, dataframe
s behave almost completely like a tibble
,
except for a few details explained in the details section.
as_dataframe(x, ...) as.dataframe(x, ...) ## Default S3 method: as_dataframe(x, tz = "UTC", ...) ## S3 method for class 'data.frame' as_dataframe(x, ..., rownames = "rownames") ## S3 method for class 'dataframe' as_dataframe( x, ..., rownames = "rownames", .name_repair = c("check_unique", "unique", "universal", "minimal") ) ## S3 method for class 'list' as_dataframe( x, .name_repair = c("check_unique", "unique", "universal", "minimal"), ... ) ## S3 method for class 'matrix' as_dataframe(x, ..., rownames = "rownames") ## S3 method for class 'table' as_dataframe(x, n = "n", ...) is_dataframe(x) is.dataframe(x)
as_dataframe(x, ...) as.dataframe(x, ...) ## Default S3 method: as_dataframe(x, tz = "UTC", ...) ## S3 method for class 'data.frame' as_dataframe(x, ..., rownames = "rownames") ## S3 method for class 'dataframe' as_dataframe( x, ..., rownames = "rownames", .name_repair = c("check_unique", "unique", "universal", "minimal") ) ## S3 method for class 'list' as_dataframe( x, .name_repair = c("check_unique", "unique", "universal", "minimal"), ... ) ## S3 method for class 'matrix' as_dataframe(x, ..., rownames = "rownames") ## S3 method for class 'table' as_dataframe(x, n = "n", ...) is_dataframe(x) is.dataframe(x)
x |
An object to convert to a |
... |
Additional parameters. |
tz |
The time zone. Useful for converting |
rownames |
Name of the column that is prepended to the
|
.name_repair |
Treatment for problematic column names. |
n |
The name for the column containing the number of items, |
TODO: explain difference between dataframe
s and tibble
s here...
A dataframe
, which is an S3 object with class
c("dataframe", "tbl_df", "tbl", "data.frame")
.
Philippe Grosjean [email protected]
class(as.dataframe(mtcars)) class(as.dataframe(tibble::tribble(~x, ~y, 1, 2, 3, 4))) # Any object, like a vector v1 <- 1:10 is_dataframe(v1) (df1 <- as_dataframe(v1)) is_dataframe(df1) # Check names of an existing dataframe (as_dataframe(df1, .name_repair = "universal")) # A data.frame with trivial row names datasets::iris as_dataframe(datasets::iris) # A data.frame containing meaningful row names datasets::mtcars as_dataframe(datasets::mtcars) # A list l1 <- list(x = 1:3, y = rnorm(3)) as_dataframe(l1) # A matrix with column and row names (m1 <- matrix(1:9, nrow = 3L, dimnames = list(letters[1:3], LETTERS[1:3]))) as_dataframe(m1) # A table set.seed(756) (t1 <- table(sample(letters[1:5], 50, replace = TRUE))) as_dataframe(t1) # compare with the base R function: as.data.frame(t1)
class(as.dataframe(mtcars)) class(as.dataframe(tibble::tribble(~x, ~y, 1, 2, 3, 4))) # Any object, like a vector v1 <- 1:10 is_dataframe(v1) (df1 <- as_dataframe(v1)) is_dataframe(df1) # Check names of an existing dataframe (as_dataframe(df1, .name_repair = "universal")) # A data.frame with trivial row names datasets::iris as_dataframe(datasets::iris) # A data.frame containing meaningful row names datasets::mtcars as_dataframe(datasets::mtcars) # A list l1 <- list(x = 1:3, y = rnorm(3)) as_dataframe(l1) # A matrix with column and row names (m1 <- matrix(1:9, nrow = 3L, dimnames = list(letters[1:3], LETTERS[1:3]))) as_dataframe(m1) # A table set.seed(756) (t1 <- table(sample(letters[1:5], 50, replace = TRUE))) as_dataframe(t1) # compare with the base R function: as.data.frame(t1)
Get the full path to so example datasets included in different formats in the "data.io" package.
data_example(path)
data_example(path)
path |
The subpath to a file inside the "extdata" subdirectory of the "data.io" package. |
The path to the file, or "" if it is not found.
Philippe Grosjean [email protected]
data_example("iris.csv")
data_example("iris.csv")
Display information about data types that can read() and write() can use, as well as, the original functions that are delegated (see they respective help pages for more info and to know which additional parameters can be used in read() and write()).
data_types(types_only = FALSE, view = TRUE)
data_types(types_only = FALSE, view = TRUE)
types_only |
If |
view |
If |
The function is mainly designed to be used interactively and to
provide information about file types that can be read() or write(). This
cannot be done through a man page because this list is dynamic and other
packages could add or change entries there. With view = FALSE
, the function
can, nevertheless, be also used in a script or a R Markdown/Notebook
document.
An tibble
with types_only = FALSE
, or a character vector.
Philippe Grosjean [email protected]
## Not run: data_types() data_types(TRUE) ## End(Not run) # For non-interactive use, specify view = FALSE data_types(view = FALSE) data_types(TRUE, view = FALSE)
## Not run: data_types() data_types(TRUE) ## End(Not run) # For non-interactive use, specify view = FALSE data_types(view = FALSE) data_types(TRUE, view = FALSE)
Use name <- read("data", package = "pkg", lang = "xx")
to read these
datasets together with the metadata (labels, units, comments, ...).
From data
:
mauna_loa
Temperature and atmospheric CO2 at Mauna Loa, Hawai. 5 vars x 768 obs. Time series of monthly averages from 1955 to 2018.
urchin_bio
Sea urchins biometry. 19 vars x 421 obs.
Morphometric variables measured on two populations of sea urchins, incl.
one circular variable (maturity
).
urchin_growth
Sea urchins growth. 3 vars x 7024 obs. Size at age for a cohort of sea urchins followed over more than 10 years.
zooplankton
Zooplankton image analysis. 20 vars x 1262 obs. A training set with 19 measurements made on images of zooplankton and their respective class as attributed by taxonomists.
From datasets
:
anscombe
Anscombe's quartet of 'identical' simple linear Regressions. 8 vars x 11 obs. Artificial data.
iris
Edgar Anderson's iris data. 5 vars x 150 obs. Morphometry of the flowers of three iris species (50 for each species).
lynx
Annual Canadian lynx trappings 1821–1934. 2 vars x 114 obs. Long (> 1 century) time series.
trees
Black cherry trees measurements. 3 vars x 31 obs. Measurement of tree timber of various sizes.
From ggplot2
:
Prices of 50,000 round cut diamonds. 10 vars x 53940 obs. Price and other attributes of 10,000's of diamonds.
Fuel economy data from 1999 and 2008 for popular cars. 11 vars x 234 obs. Data are for most popular U.S. market cars only.
From MASS
:
crabs
Morphological measurements on Leptograpsus crabs. 8 vars x 200 obs. Morphological measurements of Leptograpsus variegatus crabs, either blue or orange, males and females.
geyser
Old Faithful geyser data. 2 vars x 299 obs. Duration and waiting time for eruptions from August 1 to August 15, 1985.
From nycflights13
:
Airlines by their carrier codes. 2 vars x 16 obs.
Various metadata about New York city airports. 8 vars x 1458 obs.
On-time data for all flights that departed NYC (i.e., JFK, LGA or EWR) in 2013. 19 vars x 336776 obs.
Planes metadata. 9 vars x 3322 obs.
Hourly meteorological data for JFK, LGA and EWR. 15 vars x 26130 obs.
Set the label
, as well as the units
attributes to an object.
The label can be used for better display as plot axes labels, or as table
headers in pretty-formatted R outputs. The units are usually associated to
the label in axes labels for plots. cl()
is a shortcut for concatenate
(c()
) and labelise()
.
labelise(x, label, units = NULL, as_labelled = FALSE, ...) labelize(x, label, units = NULL, as_labelled = FALSE, ...) ## Default S3 method: labelise(x, label, units = NULL, as_labelled = FALSE, ...) ## S3 method for class 'data.frame' labelise(x, label, units = NULL, as_labelled = FALSE, self = TRUE, ...) cl(..., label = NULL, units = NULL, as_labelled = FALSE) unlabelise(x, ...) unlabelize(x, ...) ## Default S3 method: unlabelise(x, ...) ## S3 method for class 'data.frame' unlabelise(x, self = TRUE, ...)
labelise(x, label, units = NULL, as_labelled = FALSE, ...) labelize(x, label, units = NULL, as_labelled = FALSE, ...) ## Default S3 method: labelise(x, label, units = NULL, as_labelled = FALSE, ...) ## S3 method for class 'data.frame' labelise(x, label, units = NULL, as_labelled = FALSE, self = TRUE, ...) cl(..., label = NULL, units = NULL, as_labelled = FALSE) unlabelise(x, ...) unlabelize(x, ...) ## Default S3 method: unlabelise(x, ...) ## S3 method for class 'data.frame' unlabelise(x, self = TRUE, ...)
x |
An object. |
label |
The character string to set as |
units |
The units (optional) as a character string to set for |
as_labelled |
Should the object be converted as a |
... |
Further arguments: items to be concatenated in a vector using
|
self |
Do we label the |
The same mechanism as the one used in package Hmisc is used
here. However, Hmisc always add the labelled class to an object,
while here, this is optional. Setting this class make the object more nicely
printed, and subsettable without loosing these attributes. But it conflicts
with a class of the same name in package haven, used for other purposes.
So, here, one can also opt not to set it, using as_labelled = FALSE
.
The x
object plus a label
attribute, and possibly, a units
attribute.
Philippe Grosjean [email protected]
# Labelise a vector: x <- 1:10 x <- labelise(x, label = "A suite of integers", units = "cm") x # or, in a single operation: x <- cl(1:10, label = "A suite of integers", units = "cm") x # Not adding the labelled class: x <- cl(1:10, label = "Integers", units = "cm", as_labelled = FALSE) x # Unlabelising a labelised object unlabelise(x) # Labelise a data.frame iris <- labelise(datasets::iris, "The famous iris dataset") unlabelise(iris) # but if you indicate self = FALSE, you can labelise variables within the # data.frame (use a list or character vector of same length as x, or a # named list or character vector): iris <- labelise(iris, self = FALSE, label = list( Sepal.Length = "Length of the sepals", Petal.Length = "Length of the petals" ), units = c(rep("cm", 4), NA)) iris <- unlabelise(iris, self = FALSE)
# Labelise a vector: x <- 1:10 x <- labelise(x, label = "A suite of integers", units = "cm") x # or, in a single operation: x <- cl(1:10, label = "A suite of integers", units = "cm") x # Not adding the labelled class: x <- cl(1:10, label = "Integers", units = "cm", as_labelled = FALSE) x # Unlabelising a labelised object unlabelise(x) # Labelise a data.frame iris <- labelise(datasets::iris, "The famous iris dataset") unlabelise(iris) # but if you indicate self = FALSE, you can labelise variables within the # data.frame (use a list or character vector of same length as x, or a # named list or character vector): iris <- labelise(iris, self = FALSE, label = list( Sepal.Length = "Length of the sepals", Petal.Length = "Length of the petals" ), units = c(rep("cm", 4), NA)) iris <- unlabelise(iris, self = FALSE)
Monthly averages of temperatures and CO2 concentrations, maximal and minimal monthly temperatures at Mauna Loa slope observatory from 1955 to 2018.
mauna_loa
mauna_loa
An object of class mts
(inherits from ts
, matrix
) with 768 rows and 4 columns.
Atmospheric CO2 concentration is mole fraction in dry air, micromol/mol, abbreviated as ppm. Temperatures are in degree Celsius.
class(mauna_loa) head(mauna_loa) plot(mauna_loa) # Using read(), the dataset becomes an annotated dataframe (ml_en <- read("mauna_loa", package = "data.io")) class(ml_en) # Indicating lang = "EN_US" (all uppercase!) also converts temperatures # into degrees Farenheit (ml_en_us <- read("mauna_loa", package = "data.io", lang = "EN_US")) # Each variable is also labelled: ml_en$avg_co2 # The same in French: (ml_fr <- read("mauna_loa", package = "data.io", lang = "fr")) ml_fr$avg_co2
class(mauna_loa) head(mauna_loa) plot(mauna_loa) # Using read(), the dataset becomes an annotated dataframe (ml_en <- read("mauna_loa", package = "data.io")) class(ml_en) # Indicating lang = "EN_US" (all uppercase!) also converts temperatures # into degrees Farenheit (ml_en_us <- read("mauna_loa", package = "data.io", lang = "EN_US")) # Each variable is also labelled: ml_en$avg_co2 # The same in French: (ml_fr <- read("mauna_loa", package = "data.io", lang = "fr")) ml_fr$avg_co2
Read and return an R object from data on disk, from URL, or from packages.
read( file, type = NULL, header = "#", header.max = 50L, skip = 0L, locale = default_locale(), lang = getOption("data.io_lang", "en"), lang_encoding = "UTF-8", as_dataframe = FALSE, as_labelled = FALSE, comments = NULL, package = NULL, sidecar_file = TRUE, fun_list = NULL, hfun = NULL, fun = NULL, data, cache_file = NULL, method = "auto", quiet = FALSE, force = FALSE, ... ) type_from_extension(file, full = FALSE) hread_text(file, header.max, skip = 0L, locale = default_locale(), ...) hread_xls(file, header.max, skip = 0L, locale = default_locale(), ...) hread_xlsx(file, header.max, skip = 0L, locale = default_locale(), ...) ## S3 method for class 'subsettable_type' x$name ## S3 method for class 'read_function_subset' .DollarNames(x, pattern = "")
read( file, type = NULL, header = "#", header.max = 50L, skip = 0L, locale = default_locale(), lang = getOption("data.io_lang", "en"), lang_encoding = "UTF-8", as_dataframe = FALSE, as_labelled = FALSE, comments = NULL, package = NULL, sidecar_file = TRUE, fun_list = NULL, hfun = NULL, fun = NULL, data, cache_file = NULL, method = "auto", quiet = FALSE, force = FALSE, ... ) type_from_extension(file, full = FALSE) hread_text(file, header.max, skip = 0L, locale = default_locale(), ...) hread_xls(file, header.max, skip = 0L, locale = default_locale(), ...) hread_xlsx(file, header.max, skip = 0L, locale = default_locale(), ...) ## S3 method for class 'subsettable_type' x$name ## S3 method for class 'read_function_subset' .DollarNames(x, pattern = "")
file |
The path to the file to read, or the name of the dataset to get
from an R package (in that case, you must provide the |
type |
The type (format) of data to read. |
header |
The character to use for the header and other comments. |
header.max |
The maximum of lines to consider for the header. |
skip |
The number of lines to skip at the beginning of the file. |
locale |
A readr locale object with all the data regarding required to correctly interpret country-related items. The default value matches R defaults as US English + UTF-8 encoding, and it is advised to be used as much as possible. |
lang |
The language to use (mainly for comment, label and units), but
also for factor levels or other character strings if a translation exists
and if the language is spelled with uppercase characters (e.g., |
lang_encoding |
Encoding used by R scripts for translation. They should
all be encoded as |
as_dataframe |
Deprecated: now use |
as_labelled |
Are variable converted into 'labelled' objects. This
allows to keep labels and units when the vector is manipulated, but it can
lead to incompatibilities with some R code (hence, it is |
comments |
Comments to add in the created object. |
package |
The package where to look for the dataset. If |
sidecar_file |
If |
fun_list |
The table with correspondence of the types, read, and write functions. |
hfun |
The function to read the header (lines starting with a special
mark, usually '#' at the beginning of the file). This function must have
the same arguments as |
fun |
The function to delegate reading of the data. If |
data |
A synonym to |
cache_file |
The path to a local file to use as a cache when file is
downloaded (http://, https://, ftp://, or file:// protocols). If cache_file
already exists, data are read from this cache, except if |
method |
The downloading method used ( |
quiet |
In case we have to download files, do it silently ( |
force |
If |
... |
Further arguments passed to the function |
full |
Do we return the full extension, like |
x |
A |
name |
The value to use for the |
pattern |
A regular expression to list matching names. |
read()
allows for a unique entry point to read various kinds of
data, but it delegates the actual work to various other functions dispatched
across several R packages. See getOption("read_write")
.
An R object with the data (its class depends on the data being read).
Philippe Grosjean [email protected]
data_types()
, write()
, read_csv()
# Use of read() as a more flexible substitute to data() (can change dataset # name and syntax more similar to read R datasets and datasets from files) read() # List all available datasets in your installed version of R # List datasets in one particular package read(package = "data.io") # Read one dataset from this package, possibly changing its name (urchin <- read("urchin_bio", package = "data.io")) # Same, but using labels in French (urchin <- read("urchin_bio", package = "data.io", lang = "fr")) # ... and also the levels of factors in French (note: uppercase FR) (urchin <- read("urchin_bio", package = "data.io", lang = "FR")) # Read one dataset from another package, but with labels and comments data(iris) # The R way: you got the initial datasets # Same result, using read() ir2 <- read("iris", package = "datasets", lang = NULL) # ir2 records that it comes from datasets::iris attr(comment(ir2), "src") # otherwise, it is identical to iris, except is may be a data.table or a # tibble, depending on user preferences comment(ir2) <- NULL # Force coercion into a data.frame ir2 <- svBase::as_dtf(ir2) identical(iris, ir2) # More interesting: you can get an enhanced version of iris with read(): # (note that variable names ar in snake-case now!) (ir3 <- read("iris", package = "datasets")) class(ir3) comment(ir3) ir3$sepal_length # ... and you can get it in French too! (ir_fr <- read("iris", package = "datasets", lang = "fr")) class(ir_fr) comment(ir_fr) ir_fr$sepal_length # Sometimes, datasets are more deeply reworked. For instance, trees has # variables in imperial units (in, ft, and cubic ft), but it is automatically # reworked by read() into metric variables (m or m^3): data(trees) head(trees) (trees2 <- read("trees", package = "datasets")) comment(trees2) trees2$volume # Read from a Github Gist (need to specify the type here!) # (ble <- read$csv("http://tinyurl.com/Biostat-Ble")) # Various versions of the famous iris dataset (iris <- read(data_example("iris.csv"))) (iris <- read(data_example("iris.csv.zip"))) (iris <- read(data_example("iris.csv.gz"))) (iris <- read(data_example("iris.csv.bz2"))) (iris <- read(data_example("iris.tsv"))) (iris <- read(data_example("iris.xls"))) (iris <- read(data_example("iris.xlsx"))) (iris <- read(data_example("iris.rds"))) # Does not tranform into tibble! #(iris <- read(data_example("iris.syd"))) ## #(iris <- read(data_example("iris.csvy"))) ## #(iris <- read(data_example("iris.csvy.zip"))) ## # A file with an header both in English (default) and in French (iris <- read(data_example("iris_short_header.csv"))) (iris_fr <- read(data_example("iris_short_header.csv"), lang = "fr")) # Headers are also recognized in xls/xlsx files (iris_fr <- read(data_example("iris_short_header.xls"), lang = "fr")) # Read a file with a sidecar file (same name + '.R') (iris <- read(data_example("iris_sidecar.csv"))) # lang = "en" by default (iris <- read(data_example("iris_sidecar.csv"), lang = "EN")) # Full lang (iris <- read(data_example("iris_sidecar.csv"), lang = "en_us")) # US (in) (iris <- read(data_example("iris_sidecar.csv"), lang = "fr")) # French (iris <- read(data_example("iris_sidecar.csv"), lang = "FR_BE")) # Belgian (iris <- read(data_example("iris_sidecar.csv"), lang = NULL)) # No labels # Require the feather package #(iris <- read(data_example("iris.feather"))) # Not available for all Win # Challenging datasets from the readr package library(readr) (mtcars <- read(readr_example("mtcars.csv"))) (mtcars <- read(readr_example("mtcars.csv.zip"))) (mtcars <- read(readr_example("mtcars.csv.bz2"))) (challenge <- read(readr_example("challenge.csv"), guess_max = 1001)) (massey <- read(readr_example("massey-rating.txt"))) # By default, the type cannot be guessed from the extension # This is a space-separated vaules file (ssv) (massey <- read(readr_example("massey-rating.txt"), type = "ssv")) # or ... (massey <- read$ssv(readr_example("massey-rating.txt"))) (epa <- read$ssv(readr_example("epa78.txt"), col_names = FALSE)) (example_log <- read(readr_example("example.log"))) # There are different ways to specify columns for fixed-width files (fwf) # See ?read_fwf in package readr (fwf_sample <- read$fwf(readr_example("fwf-sample.txt"), col_positions = fwf_cols(name = 20, state = 10, ssn = 12))) # Various examples of Excel datasets from readxl library(readxl) (xl <- read(readxl_example("datasets.xls"))) (xl <- read(readxl_example("datasets.xlsx"), sheet = "mtcars")) (xl <- read(readxl_example("datasets.xlsx"), sheet = 3)) # Accomodate a column with disparate types via col_type = "list" (clip <- read(readxl_example("clippy.xls"), col_types = c("text", "list"))) (clip <- read(readxl_example("clippy.xlsx"), col_types = c("text", "list"))) tibble::deframe(clip) # Read from a specific range in a sheet (xl <- read(readxl_example("datasets.xlsx"), range = "mtcars!B1:D5")) (deaths <- read(readxl_example("deaths.xls"), range = cell_rows(5:15))) (deaths <- read(readxl_example("deaths.xlsx"), range = cell_rows(5:15))) (type_me <- read(readxl_example("type-me.xls"), sheet = "logical_coercion", col_types = c("logical", "text"))) (type_me <- read(readxl_example("type-me.xlsx"), sheet = "numeric_coercion", col_types = c("numeric", "text"))) (type_me <- read(readxl_example("type-me.xls"), sheet = "date_coercion", col_types = c("date", "text"))) (type_me <- read(readxl_example("type-me.xlsx"), sheet = "text_coercion", col_types = c("text", "text"))) (xl <- read(readxl_example("geometry.xls"), col_names = FALSE)) (xl <- read(readxl_example("geometry.xlsx"), range = cell_rows(4:8))) # Various examples from haven library(haven) haven_example <- function(path) system.file("examples", path, package = "haven", mustWork = TRUE) (iris2 <- read(haven_example("iris.dta"))) # Stata v. 8-14 (iris2 <- read(haven_example("iris.sav"))) # SPSS, TODO: labelled -> factor? (pbc <- read(data_example("pbc.por"))) # SPSS, POR format (iris2 <- read$sas(haven_example("iris.sas7bdat"))) # SAS file (afalfa <- read(data_example("afalfa.xpt"))) # SAS transport file # Note that where completion is available, you have a completion list of file # format after typing read$<tab>
# Use of read() as a more flexible substitute to data() (can change dataset # name and syntax more similar to read R datasets and datasets from files) read() # List all available datasets in your installed version of R # List datasets in one particular package read(package = "data.io") # Read one dataset from this package, possibly changing its name (urchin <- read("urchin_bio", package = "data.io")) # Same, but using labels in French (urchin <- read("urchin_bio", package = "data.io", lang = "fr")) # ... and also the levels of factors in French (note: uppercase FR) (urchin <- read("urchin_bio", package = "data.io", lang = "FR")) # Read one dataset from another package, but with labels and comments data(iris) # The R way: you got the initial datasets # Same result, using read() ir2 <- read("iris", package = "datasets", lang = NULL) # ir2 records that it comes from datasets::iris attr(comment(ir2), "src") # otherwise, it is identical to iris, except is may be a data.table or a # tibble, depending on user preferences comment(ir2) <- NULL # Force coercion into a data.frame ir2 <- svBase::as_dtf(ir2) identical(iris, ir2) # More interesting: you can get an enhanced version of iris with read(): # (note that variable names ar in snake-case now!) (ir3 <- read("iris", package = "datasets")) class(ir3) comment(ir3) ir3$sepal_length # ... and you can get it in French too! (ir_fr <- read("iris", package = "datasets", lang = "fr")) class(ir_fr) comment(ir_fr) ir_fr$sepal_length # Sometimes, datasets are more deeply reworked. For instance, trees has # variables in imperial units (in, ft, and cubic ft), but it is automatically # reworked by read() into metric variables (m or m^3): data(trees) head(trees) (trees2 <- read("trees", package = "datasets")) comment(trees2) trees2$volume # Read from a Github Gist (need to specify the type here!) # (ble <- read$csv("http://tinyurl.com/Biostat-Ble")) # Various versions of the famous iris dataset (iris <- read(data_example("iris.csv"))) (iris <- read(data_example("iris.csv.zip"))) (iris <- read(data_example("iris.csv.gz"))) (iris <- read(data_example("iris.csv.bz2"))) (iris <- read(data_example("iris.tsv"))) (iris <- read(data_example("iris.xls"))) (iris <- read(data_example("iris.xlsx"))) (iris <- read(data_example("iris.rds"))) # Does not tranform into tibble! #(iris <- read(data_example("iris.syd"))) ## #(iris <- read(data_example("iris.csvy"))) ## #(iris <- read(data_example("iris.csvy.zip"))) ## # A file with an header both in English (default) and in French (iris <- read(data_example("iris_short_header.csv"))) (iris_fr <- read(data_example("iris_short_header.csv"), lang = "fr")) # Headers are also recognized in xls/xlsx files (iris_fr <- read(data_example("iris_short_header.xls"), lang = "fr")) # Read a file with a sidecar file (same name + '.R') (iris <- read(data_example("iris_sidecar.csv"))) # lang = "en" by default (iris <- read(data_example("iris_sidecar.csv"), lang = "EN")) # Full lang (iris <- read(data_example("iris_sidecar.csv"), lang = "en_us")) # US (in) (iris <- read(data_example("iris_sidecar.csv"), lang = "fr")) # French (iris <- read(data_example("iris_sidecar.csv"), lang = "FR_BE")) # Belgian (iris <- read(data_example("iris_sidecar.csv"), lang = NULL)) # No labels # Require the feather package #(iris <- read(data_example("iris.feather"))) # Not available for all Win # Challenging datasets from the readr package library(readr) (mtcars <- read(readr_example("mtcars.csv"))) (mtcars <- read(readr_example("mtcars.csv.zip"))) (mtcars <- read(readr_example("mtcars.csv.bz2"))) (challenge <- read(readr_example("challenge.csv"), guess_max = 1001)) (massey <- read(readr_example("massey-rating.txt"))) # By default, the type cannot be guessed from the extension # This is a space-separated vaules file (ssv) (massey <- read(readr_example("massey-rating.txt"), type = "ssv")) # or ... (massey <- read$ssv(readr_example("massey-rating.txt"))) (epa <- read$ssv(readr_example("epa78.txt"), col_names = FALSE)) (example_log <- read(readr_example("example.log"))) # There are different ways to specify columns for fixed-width files (fwf) # See ?read_fwf in package readr (fwf_sample <- read$fwf(readr_example("fwf-sample.txt"), col_positions = fwf_cols(name = 20, state = 10, ssn = 12))) # Various examples of Excel datasets from readxl library(readxl) (xl <- read(readxl_example("datasets.xls"))) (xl <- read(readxl_example("datasets.xlsx"), sheet = "mtcars")) (xl <- read(readxl_example("datasets.xlsx"), sheet = 3)) # Accomodate a column with disparate types via col_type = "list" (clip <- read(readxl_example("clippy.xls"), col_types = c("text", "list"))) (clip <- read(readxl_example("clippy.xlsx"), col_types = c("text", "list"))) tibble::deframe(clip) # Read from a specific range in a sheet (xl <- read(readxl_example("datasets.xlsx"), range = "mtcars!B1:D5")) (deaths <- read(readxl_example("deaths.xls"), range = cell_rows(5:15))) (deaths <- read(readxl_example("deaths.xlsx"), range = cell_rows(5:15))) (type_me <- read(readxl_example("type-me.xls"), sheet = "logical_coercion", col_types = c("logical", "text"))) (type_me <- read(readxl_example("type-me.xlsx"), sheet = "numeric_coercion", col_types = c("numeric", "text"))) (type_me <- read(readxl_example("type-me.xls"), sheet = "date_coercion", col_types = c("date", "text"))) (type_me <- read(readxl_example("type-me.xlsx"), sheet = "text_coercion", col_types = c("text", "text"))) (xl <- read(readxl_example("geometry.xls"), col_names = FALSE)) (xl <- read(readxl_example("geometry.xlsx"), range = cell_rows(4:8))) # Various examples from haven library(haven) haven_example <- function(path) system.file("examples", path, package = "haven", mustWork = TRUE) (iris2 <- read(haven_example("iris.dta"))) # Stata v. 8-14 (iris2 <- read(haven_example("iris.sav"))) # SPSS, TODO: labelled -> factor? (pbc <- read(data_example("pbc.por"))) # SPSS, POR format (iris2 <- read$sas(haven_example("iris.sas7bdat"))) # SAS file (afalfa <- read(data_example("afalfa.xpt"))) # SAS transport file # Note that where completion is available, you have a completion list of file # format after typing read$<tab>
Define the functions that read()
or write() must call to
import or export data for the different types (formats).
read_write_option(new_type)
read_write_option(new_type)
new_type |
A data.frame with four columns: |
The data.frame with all known formats is returned invisibly. The same
data.frame is also saved in the read_write`` option, and can be retrieved directly with
getOption("read_write")'.
Philippe Grosjean [email protected]
# The default options (read_write_option()) # To add a new type: tail(read_write_option(data.frame(type = "png", read_fun = "png::readPNG", read_header = NA, write_fun = "png::writePNG", comment = "PNG image")))
# The default options (read_write_option()) # To add a new type: tail(read_write_option(data.frame(type = "png", read_fun = "png::readPNG", read_header = NA, write_fun = "png::writePNG", comment = "PNG image")))
After normalizing both file
and dir
, try to find a common
ancestor directory to build a path for file
relative to dir
.
relative_path(file, dir = getwd())
relative_path(file, dir = getwd())
file |
A single string with the path to a file or directory to transform as relative. |
dir |
A single string with the "reference" directory (by default, the
directory provided by |
A single character string with the relative path, or file
unmodified if file
is totally unrelated to dir
.
Philippe Grosjean [email protected]
relative_path("/Users/me/project/file.txt", "/Users/me/project") relative_path("/Users/me/project/subdir/file.txt", "/Users/me/project") relative_path("/Users/me/file.txt", "/Users/me/project") relative_path("/Users/me/subdir/file.txt", "/Users/me/project") relative_path("/Users/file.txt", "/Users/me/project") relative_path("/Users/subdir1/subdir2/file.txt", "/Users/me/project") relative_path("/Unrelated/file.txt", "/Users/me/project") relative_path("file.txt", "/Users/me/project") relative_path("~/file.txt", "/Users/me/project") relative_path("./file.txt", "/Users/me/project") relative_path(file.path(getwd(), "data.io", "file.txt"))
relative_path("/Users/me/project/file.txt", "/Users/me/project") relative_path("/Users/me/project/subdir/file.txt", "/Users/me/project") relative_path("/Users/me/file.txt", "/Users/me/project") relative_path("/Users/me/subdir/file.txt", "/Users/me/project") relative_path("/Users/file.txt", "/Users/me/project") relative_path("/Users/subdir1/subdir2/file.txt", "/Users/me/project") relative_path("/Unrelated/file.txt", "/Users/me/project") relative_path("file.txt", "/Users/me/project") relative_path("~/file.txt", "/Users/me/project") relative_path("./file.txt", "/Users/me/project") relative_path(file.path(getwd(), "data.io", "file.txt"))
Various measurement on Paracentrotus lividus sea urchins providing from fishery (Brittany, France), or from a sea urchins farm in Normandy.
urchin_bio
urchin_bio
A data frame with 19 variables:
origin
A factor with two levels: "Culture"
, and
"Fishery"
.
diameter1
Diameter (in mm) of the test measured at the ambitus (its widest part).
diameter2
A second diameter (in mm) measured at the ambitus,
perpendicular to the first one. The idea here is to calculate the average
of diameter1
and diameter2
in order to eliminate the effect of possible
slight departure from a nearly circular ambitus.
height
The height of the test (in mm), measured from month to anus, thus, orthogonally to the two diameters.
buoyant_weight
Weight (in g) of the sea urchin immersed in seawater.
weight
Weight (in g) of the whole animal.
solid_parts
Weight (in g) of the animal after draining its coelomic fluid out of the test.
integuments
Weight (in g) of the sea urchin after taking out the whole content of the test (coelomic fluid, digestive tract and gonads.
dry_integuments
Dry weight (in g) of the integuments.
digestive_tract
Weight (in g) of the digestive tract, including its content.
dry_digestive_tract
Dry weight (in g) of the digestive tract and its content.
gonads
Weight (in g) of the gonads.
dry_gonads
Dry weight (in g) of the gonads.
skeleton
Weight of the skeleton (g), calculated as the sum of lantern + test + spines.
lantern
Dry weight (in g) of the lantern (the jaw and teeth of the sea urchin).
test
Dry weight (in g) of the calcareous part of the test.
spines
Dry weight (in g) of calcareous parts of the spines.
maturity
Gonads maturity index (integer), measured on a scale of 3 states: state 0 means the gonad is absent or spent, state 1 means it is growing but not mature, and state 2 means the gonad is mature. This should be treated as a circular variable, since the reproductive cycle is 0 -> 1 -> 2 -> 0 (spawning).
sex
When it is possible, the sex of the animal is determined by
visual inspection of the gonads (factor with levels "F"
and "M"
).
A stratified sample was performed to make sure all size classes (from 5 to 5 mm in test diameter) from each sub-population are equally represented in the dataset. Hence, the size or weight-classes distributions among each population cannot be studied with this dataset. However, those data are more suitable to explore allometric relationships between body measurements and/or body parts of the sea urchins over the whole size range.
For further details on the farming of these sea urchins, see here.
Size at age for a cohort of farmed sea urchins, Paracentrotus lividus.
urchin_growth
urchin_growth
An object of class data.frame
with 7024 rows and 3 columns.
The same cohort of farmed sea urchins being measured at various time intervals, the observations are not completely independent from each other: the same individuals are repeatedly measured here. As the sea urchins are not individually tagged, it is not possible to track them from one measurement to the other. However, the whole dataset is representative of the growth, and spreading of growth in a single cohort. Also, mortality could be derived from the number of measurements made at each time period, since all the individuals still alive are measured (no sub-sampling).
library(ggplot2) ggplot(urchin_growth, aes(age, diameter)) + geom_jitter(alpha = 0.2) + xlab(label(urchin_growth$age, units = TRUE)) + ylab(label(urchin_growth$diameter, units = TRUE)) + ggtitle("Growth of a cohort of sea urchins")
library(ggplot2) ggplot(urchin_growth, aes(age, diameter)) + geom_jitter(alpha = 0.2) + xlab(label(urchin_growth$age, units = TRUE)) + ylab(label(urchin_growth$diameter, units = TRUE)) + ggtitle("Growth of a cohort of sea urchins")
Write R data into a file, in different formats.
write( data, file = "data", ncolumns = if (is.character(data)) 1 else 5, append = FALSE, sep = " ", type = NULL, fun_list = NULL, x, ... ) ## S3 method for class 'write_function_subset' .DollarNames(x, pattern = "")
write( data, file = "data", ncolumns = if (is.character(data)) 1 else 5, append = FALSE, sep = " ", type = NULL, fun_list = NULL, x, ... ) ## S3 method for class 'write_function_subset' .DollarNames(x, pattern = "")
data |
An object to write in a file. The accepted class depends on what
the delegated function expects (in many cases, a |
file |
The path to the file to write to. If |
ncolumns |
The number of columns to write the data in when |
append |
If |
sep |
A string used to separate columns. Using |
type |
The type (format) of data to read. |
fun_list |
The table with correspondence of the types, read, and write functions. |
x |
Same as |
... |
Further arguments passed to the write function, when |
pattern |
A regular expression to list matching names. |
This function is designed to be fully compatible with
base::write()
, while allowing to specify type
also, and get a more
interesting behavior in this case. Hence, when type
is not provided,
either with write(type = ...)
, or write$...()
, the default code is used
and a plain text file wit fields separated by spaces (be default) is written.
When type is provided, then the exportation is delegated to specific
functions (see data_types()
) to write the data in different formats.
data
is returned invisibly (on the contrary to base::write()
which returns NULL
).
Philippe Grosjean [email protected]
data_types()
, read()
, write_csv()
, base::write()
# Always specify type to delegate to more sophisticated functions # (type = NULL explicitly indicated meaning: "guess from file extension") urchin <- read("urchin_bio", package = "data.io") write(urchin, "urchin_temporary.csv", type = NULL) # To use a format more easily readable by Excel write(urchin, "urchin_temporary.csv", type = "xlcsv") # ... equivalently (and more compact) write$xlcsv(urchin, "urchin_temporary.csv") # Tidy up unlink("urchin_temporary.csv") # Write in Excel format write$xlsx(urchin, "urchin_temporary.xlsx") # Tidy up unlink("urchin_temporary.xlsx") # Use base::write() code to output atomic vectors (and matices) in text files # when you don't specify type= mat1 <- matrix(1:12, nrow = 4) # To get a similar presentation in the file, you have to do: write(t(mat1), "my_temporary_data.txt", ncolumns = 3) file.show("my_temporary_data.txt") # Tidy up unlink("my_temporary_data.txt") rm(mat1)
# Always specify type to delegate to more sophisticated functions # (type = NULL explicitly indicated meaning: "guess from file extension") urchin <- read("urchin_bio", package = "data.io") write(urchin, "urchin_temporary.csv", type = NULL) # To use a format more easily readable by Excel write(urchin, "urchin_temporary.csv", type = "xlcsv") # ... equivalently (and more compact) write$xlcsv(urchin, "urchin_temporary.csv") # Tidy up unlink("urchin_temporary.csv") # Write in Excel format write$xlsx(urchin, "urchin_temporary.xlsx") # Tidy up unlink("urchin_temporary.xlsx") # Use base::write() code to output atomic vectors (and matices) in text files # when you don't specify type= mat1 <- matrix(1:12, nrow = 4) # To get a similar presentation in the file, you have to do: write(t(mat1), "my_temporary_data.txt", ncolumns = 3) file.show("my_temporary_data.txt") # Tidy up unlink("my_temporary_data.txt") rm(mat1)
Various features measured by image analysis with the package zooimage
and
ImageJ
on samples of zooplankton originating from Tulear, Madagascar. The
taxonomic classification is also provided in the class
variable.
zooplankton
zooplankton
A data frame with 19 variables:
ecd
The "equivalent circular diameter", the diameter of a circle with the same area as the particle (in mm).
area
The area of the particle on the image (in mm^2).
perimeter
The perimeter of the particle (in mm).
feret
The Feret diameter, that is, the largest measured diameter of the particle on the image (mm).
major
The major axis of the ellipsoid matching the particle (mm).
minor
The minor axis of the same ellipsoid (mm).
mean
The mean value of the gray levels calibrated in optical density (OD), thus, unitless.
mode
The most frequent gray level in that particle in OD.
min
The most transparent part in OD.
max
The most opaque part in OD.
std_dev
The standard deviation of the OD distribution inside the particle.
range
Transparency range as max
- min
.
size
The mean diameter of the particle, as the average of
minor
and major
(mm).
aspect
Aspect ratio of the particle as minor
/major
.
elongation
The area
divided by the area of a circle of the
same perimeter
of the particle.
compactness
sqrt((4/pi) * area
) / major
.
transparency
1 - (ecd
- size
).
circularity
4pi(area
/ perimeter
^2).
density
Density integrate by the surface covered by each gray level, i.e. O.D., inside the particle.
class
The classification of this particle. 17 classes are made.
Note that Copepods
are Calanoid
+ Cyclopoid
+ Harpactivoid
+
Poecilostomatoid
and they represent the most abundant zooplankton at sea.
This is a typical training set used to train a plankton classifier with machine learning algorithms. Organisms originate from various samples (different seasons, depth, etc. to take the variability into account). However, the abundance of the different classes do not match abundance found in each sample, i.e., rare classes are over-represented in this training set. Only zooplankton classes are present in the dataset. Full data also contains classes for phytoplankton, marine snow, etc. Take care that several variables are correlated!
Grosjean, Ph & K. Denis (2004). Supervised classification of images, applied to plankton samples using R and ZooImage. Chap.12 of Data Mining Applications with R. Zhao, Y. & Y. Cen (eds). Elsevier. Pp 331-365. https://doi.org/10.1016/C2012-0-00333-X.
table(zooplankton$class) library(ggplot2) ggplot(zooplankton, aes(circularity, transparency, color = class)) + geom_point()
table(zooplankton$class) library(ggplot2) ggplot(zooplankton, aes(circularity, transparency, color = class)) + geom_point()