RStudio project

Open the RStudio project that we created in the previous session. We recommend to use this RStudio project for the entire course and within the RStudio project create separate R scripts for each session.

Create a new empty R script by going to the tab “File”, select “New File” and then “R script”. In the new R script, type # Session 7: Simple data wrangling and save the file in your folder “scripts” within your project folder, e.g. as “7_DataWrangling.R”

R offers different options for inspecting, subsetting and manipulating data. We use the iris data that is contained in the R distribution

data(iris)

# Inspect the first six lines
head(iris)

# Inspect the last six lines
tail(iris)

# Check data structure:
str(iris)

1 Extract data

We have already seen how to subset columns by indexing:

iris[, c('Petal.Length')]
summary(iris[, c('Petal.Length')])

Rows can also be extracted using indexing, or by using the function subset(). Here, we need the logical and relational operators mentioned earlier

# selecting rows using indexing
iris[iris$Species == 'virginica', ]
iris[iris$Species == 'virginica' & iris$Sepal.Length > 6.5, ]

# selecting rows using subset
subset(iris, Species == 'virginica')
subset(iris, Species == 'virginica' & Sepal.Length > 6.5)

2 Join/merge data

When processing data, it is often necessary to join/merge different data sets. As a simple example, we create two data frames, one with a price list for groceries and one with our shopping list.

prices_market <- data.frame(product=c('apples', 'tomatoes', 'blueberries', 'strawberries'), price_per_kg = c(2.5, 3.5, 5.20, 4.50))
shopping_list <- data.frame(product = c('apples', 'strawberries', 'bananas'), amount_in_kg = c(1.5, 0.5, 2))

We now use the function merge() for joining the two data frames. Thereby, we have the option of ignoring non-matching entries or filling these up with NAs.

# The joined data frame contains only matching entries:
merge(prices_market, shopping_list, by = 'product')

# NAs will be added for the non-matching entries
merge(prices_market, shopping_list, by = 'product', all = TRUE)

3 Family `apply`

If you want to do complex operations on matrices, data frames or lists, apply() and its derivates are most often the tool of choice as they run a lot faster than loops. This is what is often termed vectorizing your code.

apply() allows row-wise and column-wise execution of functions over matrices, data.frames and arrays. Thereby, the argument MARGIN refers to the dimension you want the operation to run over (1 means row-wise, 2 column-wise). Also you have to provide an argument FUN that specifies which function you want to run over the array. In the following examples, we calculate the mean.

# Build a numeric matrix
(Xmat = matrix(1:12, nrow = 4)) 

# Calculate the mean of all rows
apply(X = Xmat, MARGIN = 1, FUN = mean) # named version
apply(Xmat, 1, mean) # same thing, but without naming the arguments

# Calculate the mean of all columns
apply(X = Xmat, MARGIN = 2, FUN = mean) # named version
apply(Xmat, 2, mean) # same thing, but without naming the arguments

You can also define own anonymous functions in the form FUN=function(x){...}.

(Xdf <- data.frame(x = rep(1, 10), y = rep (3, 10)))

apply(Xdf, 1, FUN = function(x) {sum(x) ^ 2})

Also check out `lapply() and sapply() that are useful functions for operations over vectors or lists.

3.1 rowSums(), rowMeans() etc

There are some convenient and fast implementations to form row and column sums and means for numeric matrices (or arrays) and data frames. They are equivalent to using the function mean or sum with apply() as above but probably easier to handle for you.

# Calculate the mean and sum of all rows
rowMeans(Xmat)
rowSums(Xmat)

# Calculate the mean and sum of all columns

colMeans(Xmat)
colSums(Xmat)

# Get unique entries (meaningful when you have recurring categories)

unique(Xmat[,1])

4 Manipulate data

4.1 Analyse data by groups

aggregate() lets you calculate summaries of groups of your data. The grouping can be expressed with the by argument, but you can also use a formula syntax. In both cases, you can also summarize by multiple grouping variables. The function returns a data frame containing the calculated summaries for each group. You can think of the function as “summarizing variable x by groups Z, using function f”.

# calculate standard deviation and mean of the petal length for each iris species

# grouping via "by" argument
aggregate(iris$Petal.Length, by = list(iris$Species), FUN = 'mean') # mean
aggregate(iris$Petal.Length, by = list(iris$Species), FUN = 'sd') # standard dev

# grouping via formula syntax
aggregate(Petal.Length ~ Species, data = iris, FUN = 'mean') # mean
aggregate(Petal.Length ~ Species, data = iris, FUN = 'sd') # mean

# user-defined function: calculating the percentage of flowers with petals longer than 4.5 cm within each species
aggregate(Petal.Length ~ Species, data = iris, FUN = function(x) length(which(x > 4.5))/length(x) * 100)

4.2 Manipulating character strings

Manipulation of strings is important for piecing together character strings, e.g. for automatically processing a list of files, for generating formulas, or when you want to add text or mathematical annotations to a plot. Here are some useful functions for string manipulations:

paste(): Concatenate vectors after conversion to character
strsplit(): Split characters strings into substrings
nchar: Count number of characters
grep(): Pattern matching
sub(), gsub: Pattern replacement
substr(), substring: Extract or replace substrings
toupper(), tolower(): Translate characters from lower to upper case and vice versa

Let’s take a look at some examples.

# make variable names
(varnames <- paste('Variable', 1:10, sep='_'))

# Split variable names at the underscore
strsplit(varnames[1:2],'_')

# Replace an expression
sub('Variable','Covariate',varnames)

# Find variable 10
grep('10', varnames) # finds index
grep('10', varnames, value=T) # returns name

# Find the letter 'h' in the alphabet
grep('h', letters)  # get index
grep('h', letters, value=T)  # get value (trivial in this example)

Main exercise block 2:

We use a dataset on bird abundances at Hubbard Brook (1969-2010) that you can download here hb_bird.txt Read more about the metadata here.

Task 1: Download and import the hb_bird.txt data.

Read in the hb_bird data using the reading function read.delim().
What is the main difference between read.delim() and read.table()?

Task 2: Explore the data set (Hint: use str() to understand how the data set is structured).

How many bird species are included in the hb_bird data?
Missing data are indicated by a “t”. Find these and replace them with an NA. To do so, you can write a loop to iterate through all columns seperately. I provide a few more hints how to structure the loop below. However, it is often advisable to first define and test the function that should be carried out within the loop. To do so,
Select the second column hb_bird[,2] and find all elements in this column that contain a “t”. Use the grep() function for this as we did above when searching for the letter h in the alphabet. Store the index/indices you get as result of grep() in a new variable, e.g. row_index and use this to identify the elements in the selected column that should be replaced. Also, turn the column into numeric values using the function as.numeric().
If all of the above works well, then put it into a for loop. Hint: the loop header could be for (i in 2:ncol(hb_bird)) meaning that you loop through the indices of all columns starting at column 2. The loop index i can then be used to select the ith column, and the rest is done analogous to above.
Run the loop over the entire data frame.
Use summary() to explore your resulting data frame.
In which year do you have most NAs ?
Which bird species has the highest average abundance? Hint: use rowMeans() and take care to ignore NA by adding the argument na.rm=T in the function call.

Task 3: Visualise bird abundance

Plot the abundance over time for the bird species with the highest average abundance and the bird species with the lowest average abundance.
Select three years (e.g. 1970, 1990, and 2010) and plot the abundances of all species in boxplots.

Simple data wrangling