RStudio project

Open the RStudio project that we created in the previous session. We recommend using this RStudio project for the entire course and within the RStudio project create separate R scripts for each session.

  • Create a new empty R script by going to the tab “File”, select “New File” and then “R script”. In the new R script, type # Session 4: Data import and export and save the file in your folder “scripts” within your project folder, e.g. as “4_DataImportExport.R”

R offers several options for accessing different data formats. Standard formats include text files (in ASCII format), binary files, R files (single objects or entire workspaces), and databases. These data can be read in from files using a file path, but also from the Internet using an URL path or from the clipboard. Of course, you can also write to these same devices. A good overview is provided in the manual „R Data Import/Export“ (R Core Team 2019), which you can access by typing help.start(). The manual also contains information on some useful packages, e.g. foreign, which provides access to datasets created with different statistical software such as SAS, SPSS, S-PLUS etc. Data import using the package tidyverse is briefly explained in the RStudio cheat sheets.

1 Read text data

Download the data files veggyshopping.txt, veggyshopping.csv and myColumn.txt from the course data repository and move them to the data folder inside your project folder.

Since you are working in an R project the R session already points to the project directory, so when we want to load data from the data folder that is contained within the project directory, all we need to do is specify the relative path to the data folder. This is best done using file.path(), because this function builds the path automatically using the correct slash (/ on Unix systems, \ on Windows). Let’s list the files inside the data folder:

# file.path() constructs the path with the correct slash (they are different on different operating systems). The point is a placeholder for your current folder

file.path(".", "data") # the . is a placeholder for your current working directory. It can be omitted, but we keep it as an example of how paths are built.

list.files(path = file.path(".", "data"))

# Since we are not navigating to subfolders within the data folder, in this specific case we can also use:
list.files(path = "data")

Read the file veggyshopping.txt from your data folder. The file is in table format (data frame); every column has a name, and columns are separated by comma. This information is important for correct reading:

my_df <- read.table(file.path('.', 'data', 'veggyshopping.txt'))

There may be several reasons for an error message. Is the data table correct? Do all columns contain the same number of elements? Yes, the data file is correct. Then, it probably has to do with the arguments passed to read.table(). Have a look at the help page and the default settings.

Important arguments:

  • file: (path) and file name
  • header: a logical value indicating whether the file contains column names; defaults to FALSE
  • sep: column separator. Default: ” ” (white space or tab)
  • dec: character used for decimal points. Default: “.”
  • colClasses: optional vector giving data types for each column
  • stringsAsFactors: set to FALSE to avoid automatic transformation of character to factor.

R expects white spaces as field separators but we had separated our data columns by a comma when saving the file. Also, R does not assume column names, so, we need to tell R that:

(my_df <- read.table(file.path("data", "veggyshopping.txt"), sep = ',', header = TRUE))
str(my_df)
names(my_df)
summary(my_df)

There are also some useful modifications for common formats, e.g. for comma-separated files (csv):

read.csv(file.path("data", "veggyshopping.txt"))
read.csv(file.path("data", "veggyshopping.csv"))

Sometimes, using read.table() or its modifications is not meaningful. The next example file does not contain a data table but only a (very long) vector. read.table() will still work but it will be slow and will create a data frame object with many rows in only one column, which is not very sensible.

read.table(file.path("data", "myColumn.txt"))

For such one-row or one-column data files, it is better to use functions like scan(). With some exercise, this command can also be used for reading very complex data structures.

scan(file.path("data", "myColumn.txt"))

Other packages are available that provide more convenient functions for data import, e.g. tidyverse or data.table. Often, these functions will automatically detect the data structure in the files.

library(tidyverse)
read_delim(file.path("data", "veggyshopping.txt"))

# so, read_delim does not detect the column separator
read_delim(file.path("data", "veggyshopping.txt"), delim = ',')
read_csv(file.path("data", "veggyshopping.txt"))

library(data.table)
fread(file.path("data", "veggyshopping.txt"))

2 Write text data

All objects created in R can be exported to file. Data frame can be written to file by write.table and its variants:

write.table(veggy_shopping, file.path("data", "veggyshopping.txt"), row.names = FALSE, col.names = TRUE, sep = ',')
write.csv(veggy_shopping, file.path("data", "veggyshopping.csv"))

This will automatically overwrite the two files you downloaded above because the file paths are identical to those of the downloaded files.

Important arguments:

  • file: (path) and file name
  • sep: column separator. Default: ” ” or “,” (depending on function)
  • dec: character used for decimal points. Default: “.”

The function write() allows writing vectors and matrices to file. However, it contains some surprising default settings. Check out the help pages and the below example. For demonstration purposes, we pipe the output into the console (““).

# Write vector:
x <- 1:500
write(x, "") # prints to console

# Write matrix:
(Xmat <- matrix(1:10, ncol = 5))
write(Xmat, "") # prints to console

Surprisingly, the matrix is not exported in the right structure. Check out the help page ?write to find out more about it.

# transpose matrix before exporting: works!
write(t(Xmat), "")

3 Read and write R objects

When your data structure is more complex than simple data frames or matrices, other formats are likely preferable to text files.

  • save(): saves R objects to .Rdata format. When you load an .Rdata file (load()), objects saved within appear in your global environment under the same name they were stored with.
  • saveRDS(): stores R objects to .rds format. Arguably the most efficient way of storing any type of data that only needs to be accessed by R. readRDS() loads .rds files and you can assign the loaded data to a new object name (unlike load() which automatically assigns the object name of the data from when it was stored).
  • save.image(): stores your entire workspace (equivalent to save(), but for all the data currently in the global environment). This way, you can simply continue an R session at any later time by loading the stored workspace image (load()). When you are working in an RStudio project, workspace images are stored upon exiting the project by default. When your loaded objects take up a lot of memory, saving and loading can become pretty slow and it sometimes might be preferable to disable this setting.
# save object Xmat to Rdata format
save(Xmat, file = file.path("data", "Xmat.Rdata"))
load(file.path("data", "Xmat.Rdata")) # object Xmat will be loaded into global environment

saveRDS(Xmat, file = file.path("data", "Xmat.rds"))
Xmat_differentname <- readRDS(file.path("data", "Xmat.rds")) # same matrix, but is loaded into a different object name in global environment

Exercises:

Task 1: Download the file surveys_pract3.txt and save it in the data folder of your R-project. The data set contains vertebrate survey data with information about the survey date (year, month and day columns), the survey plot and information on the surveyed species (species id, sex, hindfoot length and weight) (Ernest et al. 2018).

  • Read the file to your workspace (hint: separator? header?) and inspect the data set (e.g. str(), summary()) (hint: when you see an error message, you may have to open the file in a text editor to see what the problem might be!)
  • Are the data types of the columns correct? (hint: categorical variables should be factor and not character)
  • Do you notice something about the sex column?
  • Load the data again, this time making sure that all columns have the correct type and missing values are filled with NA.
  • Save the data set as surveys_pract3_correct.csv in the data folder. Specify a suitable separator.

References

Ernest, M., J. Brown, T. Valone, and E. White. 2018. “Portal Project Teaching Database.” Figshare Dataset. https://doi.org/10.6084/m9.figshare.1314459.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.