RStudio project
Open the RStudio project that we created in the previous session. We recommend to use this RStudio project for the entire course and within the RStudio project create separate R scripts for each session.
# Session 7: Simple data wrangling
and save the file in
your folder “scripts” within your project folder, e.g. as
“7_DataWrangling.R”R offers different options for inspecting, subsetting and manipulating data. We use the iris data that is contained in the R distribution
data(iris)
# Inspect the first six lines
head(iris)
# Inspect the last six lines
tail(iris)
# Check data structure:
str(iris)
We have already seen how to subset columns by indexing:
iris[, c('Petal.Length')]
summary(iris[, c('Petal.Length')])
Rows can also be extracted using indexing, or by using the function
subset()
. Here, we need the logical and relational
operators mentioned earlier
# selecting rows using indexing
iris[iris$Species == 'virginica', ]
iris[iris$Species == 'virginica' & iris$Sepal.Length > 6.5, ]
# selecting rows using subset
subset(iris, Species == 'virginica')
subset(iris, Species == 'virginica' & Sepal.Length > 6.5)
When processing data, it is often necessary to join/merge different data sets. As a simple example, we create two data frames, one with a price list for groceries and one with our shopping list.
prices_market <- data.frame(product=c('apples', 'tomatoes', 'blueberries', 'strawberries'), price_per_kg = c(2.5, 3.5, 5.20, 4.50))
shopping_list <- data.frame(product = c('apples', 'strawberries', 'bananas'), amount_in_kg = c(1.5, 0.5, 2))
We now use the function merge()
for joining the two data
frames. Thereby, we have the option of ignoring non-matching entries or
filling these up with NAs.
# The joined data frame contains only matching entries:
merge(prices_market, shopping_list, by = 'product')
# NAs will be added for the non-matching entries
merge(prices_market, shopping_list, by = 'product', all = TRUE)
apply
If you want to do complex operations on matrices, data frames or
lists, apply()
and its derivates are most often the tool of
choice as they run a lot faster than loops. This is what is often termed
vectorizing your code.
apply()
allows row-wise and column-wise execution of
functions over matrices, data.frames and arrays. Thereby, the argument
MARGIN
refers to the dimension you want the operation to
run over (1 means row-wise, 2 column-wise). Also you have to provide an
argument FUN
that specifies which function you want to run
over the array. In the following examples, we calculate the mean.
# Build a numeric matrix
(Xmat = matrix(1:12, nrow = 4))
# Calculate the mean of all rows
apply(X = Xmat, MARGIN = 1, FUN = mean) # named version
apply(Xmat, 1, mean) # same thing, but without naming the arguments
# Calculate the mean of all columns
apply(X = Xmat, MARGIN = 2, FUN = mean) # named version
apply(Xmat, 2, mean) # same thing, but without naming the arguments
You can also define own anonymous functions in the form
FUN=function(x){...}
.
(Xdf <- data.frame(x = rep(1, 10), y = rep (3, 10)))
apply(Xdf, 1, FUN = function(x) {sum(x) ^ 2})
Also check out `lapply()
and sapply()
that
are useful functions for operations over vectors or lists.
There are some convenient and fast implementations to form row and
column sums and means for numeric matrices (or arrays) and data frames.
They are equivalent to using the function mean
or
sum
with apply()
as above but probably easier
to handle for you.
# Calculate the mean and sum of all rows
rowMeans(Xmat)
rowSums(Xmat)
# Calculate the mean and sum of all columns
colMeans(Xmat)
colSums(Xmat)
# Get unique entries (meaningful when you have recurring categories)
unique(Xmat[,1])
aggregate()
lets you calculate summaries of groups of
your data. The grouping can be expressed with the by
argument, but you can also use a formula syntax. In both cases, you can
also summarize by multiple grouping variables. The function returns a
data frame containing the calculated summaries for each group. You can
think of the function as “summarizing variable x by groups Z, using
function f”.
# calculate standard deviation and mean of the petal length for each iris species
# grouping via "by" argument
aggregate(iris$Petal.Length, by = list(iris$Species), FUN = 'mean') # mean
aggregate(iris$Petal.Length, by = list(iris$Species), FUN = 'sd') # standard dev
# grouping via formula syntax
aggregate(Petal.Length ~ Species, data = iris, FUN = 'mean') # mean
aggregate(Petal.Length ~ Species, data = iris, FUN = 'sd') # mean
# user-defined function: calculating the percentage of flowers with petals longer than 4.5 cm within each species
aggregate(Petal.Length ~ Species, data = iris, FUN = function(x) length(which(x > 4.5))/length(x) * 100)
Manipulation of strings is important for piecing together character strings, e.g. for automatically processing a list of files, for generating formulas, or when you want to add text or mathematical annotations to a plot. Here are some useful functions for string manipulations:
paste()
: Concatenate vectors after conversion to
characterstrsplit()
: Split characters strings into
substringsnchar
: Count number of charactersgrep()
: Pattern matchingsub()
, gsub
: Pattern replacementsubstr()
, substring
: Extract or replace
substringstoupper()
, tolower()
: Translate characters
from lower to upper case and vice versaLet’s take a look at some examples.
# make variable names
(varnames <- paste('Variable', 1:10, sep='_'))
# Split variable names at the underscore
strsplit(varnames[1:2],'_')
# Replace an expression
sub('Variable','Covariate',varnames)
# Find variable 10
grep('10', varnames) # finds index
grep('10', varnames, value=T) # returns name
# Find the letter 'h' in the alphabet
grep('h', letters) # get index
grep('h', letters, value=T) # get value (trivial in this example)
Main exercise block 2:
We use a dataset on bird abundances at Hubbard Brook (1969-2010) that
you can download here hb_bird.txt
Read more about the
metadata here.
Task 1: Download and import the hb_bird.txt
data.
hb_bird
data using the reading function
read.delim()
.read.delim()
and
read.table()
?Task 2: Explore the data set (Hint: use str()
to
understand how the data set is structured).
hb_bird
data?NA
. To do so, you can write a loop to iterate
through all columns seperately. I provide a few more hints how to
structure the loop below. However, it is often advisable to first define
and test the function that should be carried out within the loop. To do
so,hb_bird[,2]
and find all
elements in this column that contain a “t”. Use the grep()
function for this as we did above when searching for the letter
h in the alphabet. Store the index/indices you get as result of
grep()
in a new variable, e.g. row_index
and
use this to identify the elements in the selected column that should be
replaced. Also, turn the column into numeric values using the function
as.numeric()
.for
loop. Hint: the loop header could be
for (i in 2:ncol(hb_bird))
meaning that you loop through
the indices of all columns starting at column 2. The loop index
i
can then be used to select the ith column, and
the rest is done analogous to above.summary()
to explore your resulting data
frame.NA
s ?rowMeans()
and take care to ignore NA
by
adding the argument na.rm=T
in the function call.Task 3: Visualise bird abundance