RStudio project

Open the RStudio project that we created in the first session. I recommend to use this RStudio project for the entire module and within the RStudio project create separate R scripts for each session.

  • Create a new empty R script by going to the tab “File”, select “New File” and then “R script”
  • In the new R script, type # Session b1: Species data and save the file in your folder “scripts” within your project folder, e.g. as “b1_SpeciesData.R”

Many different types of biodiversity data exist, e.g. from standardised monitoring schemes, citizen science platforms, or expert knowledge. Each comes with own challenges. Here, we will concentrate on GBIF as publicly available source for terrestrial species occurrence data.

GBIF stands for Global Biodiversity Information Facility. GBIF defines itself as “an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth”. GBIF contains worldwide point records from observations as well as museum records or other contributions. The data are thus not standardised and often it is unclear which spatial resolution the data represent. To understand this, let’s look at some example under https://www.gbif.org/.

1 Occurrence data from GBIF

We will use the package rgbif to search and retrieve data from GBIF. A good tutorial to this package is offered here. Remember that we can install new packages with the function install.packages(). Alternatively, the dismo package offers the function gbif() to download gbif records. Robert Hijmans, who wrote the raster and the dismo package, also offers great tutorials on his website: http://rspatial.org.

library(rgbif)

# Check out the number of occurrences found in GBIF:
occ_count()
## [1] 2275004213
# number of observations:
occ_count(basisOfRecord='OBSERVATION')
## [1] 23372979
# number of occurrences reported for Germany:
occ_count(country=isocodes[grep("Germany", isocodes$name), "code"])
## [1] 49292805
# number of observations reported for Germany:
occ_count(country=isocodes[grep("Germany", isocodes$name), "code"],basisOfRecord='OBSERVATION')
## [1] 155477

As example, I picked the Alpine shrew (Sorex alpinus) for today, a small mammal species occurring in Central and Suuth-Eastern European mountain ranges. Its conservation status is near threatened (Link to IUCN redlist).

**Figure 1. The Alpine shrew (*Sorex alpinus*). Picture by Dr. Richard Kraft, downloaded from https://kleinsaeuger.at/sorex-alpinus.html.**

We first check whether any synonyms exist and how many records exist for the species. Download will be slow for high numbers of records.

# Check for synonyms
name_suggest(q='Sorex alpinus', rank='species')
## Records returned [1] 
## No. unique hierarchies [0] 
## Args [q=Sorex alpinus, limit=100, rank=species, fields1=key,
##      fields2=canonicalName, fields3=rank] 
## # A tibble: 1 × 3
##       key canonicalName rank   
##     <int> <chr>         <chr>  
## 1 2435986 Sorex alpinus SPECIES
# Check number of records - here filtered to those with coordinate information
occ_search(scientificName = "Sorex alpinus", hasCoordinate=T, limit = 10)
## Records found [1271] 
## Records returned [10] 
## No. unique hierarchies [1] 
## No. media records [10] 
## No. facets [0] 
## Args [hasCoordinate=TRUE, limit=10, offset=0, scientificName=Sorex alpinus,
##      fields=all] 
## # A tibble: 10 × 83
##    key        scientificName  decimalLatitude decimalLongitude issues datasetKey
##    <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
##  1 3824140382 Sorex alpinus …            48.7             8.36 cdrou… 8a863029-…
##  2 3784645455 Sorex alpinus …            46.8            11.3  cdrou… 50c9509d-…
##  3 3873231122 Sorex alpinus …            46.5            10.6  cdrou… 50c9509d-…
##  4 3881729801 Sorex alpinus …            47.5            10.7  cdrou… 50c9509d-…
##  5 3892282027 Sorex alpinus …            46.6             6.11 cdrou… 50c9509d-…
##  6 3902041214 Sorex alpinus …            47.1            12.7  cdrou… 50c9509d-…
##  7 3904820501 Sorex alpinus …            47.8             8.34 cdrou… 8a863029-…
##  8 3906193269 Sorex alpinus …            48.8             8.44 cdrou… 8a863029-…
##  9 3902763723 Sorex alpinus …            50.2            17.3  cdrou… 50c9509d-…
## 10 3923242090 Sorex alpinus …            46.9             6.64 cdrou… 50c9509d-…
## # … with 77 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
## #   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
## #   basisOfRecord <chr>, individualCount <int>, occurrenceStatus <chr>,
## #   lifeStage <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …

One of the most interesting items from the outputs is the “Records found” at the very top of the output. Please be aware that occ_search() will not allow to download more than 100’000 records. If the GBIF data contain more, then you can set additional filters (e.g. set time period with argument year or geographic extent with arguments decimalLatitude and decimalLongitude) or split the area into spatial tiles (by setting geographic extent) and download the tiles separately.

GBIF data also contain a lot of extra information about observer, basis of record etc. Take a look at the arguments you can set in ?occ_search to get an overview. For example, we can limit the data to those observations that were done by humans:

occ_search(scientificName = "Sorex alpinus", hasCoordinate=T, basisOfRecord='HUMAN_OBSERVATION', limit = 10)
## Records found [614] 
## Records returned [10] 
## No. unique hierarchies [1] 
## No. media records [10] 
## No. facets [0] 
## Args [hasCoordinate=TRUE, limit=10, offset=0, scientificName=Sorex alpinus,
##      basisOfRecord=HUMAN_OBSERVATION, fields=all] 
## # A tibble: 10 × 83
##    key        scientificName  decimalLatitude decimalLongitude issues datasetKey
##    <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
##  1 3824140382 Sorex alpinus …            48.7             8.36 cdrou… 8a863029-…
##  2 3784645455 Sorex alpinus …            46.8            11.3  cdrou… 50c9509d-…
##  3 3873231122 Sorex alpinus …            46.5            10.6  cdrou… 50c9509d-…
##  4 3881729801 Sorex alpinus …            47.5            10.7  cdrou… 50c9509d-…
##  5 3892282027 Sorex alpinus …            46.6             6.11 cdrou… 50c9509d-…
##  6 3902041214 Sorex alpinus …            47.1            12.7  cdrou… 50c9509d-…
##  7 3904820501 Sorex alpinus …            47.8             8.34 cdrou… 8a863029-…
##  8 3906193269 Sorex alpinus …            48.8             8.44 cdrou… 8a863029-…
##  9 3902763723 Sorex alpinus …            50.2            17.3  cdrou… 50c9509d-…
## 10 3923242090 Sorex alpinus …            46.9             6.64 cdrou… 50c9509d-…
## # … with 77 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
## #   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
## #   basisOfRecord <chr>, individualCount <int>, occurrenceStatus <chr>,
## #   lifeStage <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …

Now, let’s download the records and plot them. We had a bit less than 600 records for the shrew, and use this as limit. Of course, the map will only help us judging the data quality if we have a rough idea where the species should occur. So, look up the species on the web first!

gbif_shrew <- occ_search(scientificName = "Sorex alpinus", hasCoordinate=T, basisOfRecord='HUMAN_OBSERVATION', limit = 600) 

# We are just interested in the data frame containing the records
gbif_shrew <- gbif_shrew$data

library(maptools)
data(wrld_simpl)
plot(wrld_simpl,xlim=c(5,50), ylim=c(40,55))
points(gbif_shrew$decimalLongitude, gbif_shrew$decimalLatitude, col='red',  pch=19)

2 Cleaning and cross-checking occurrence data

You should always critically assess the quality of your data. This also holds true for GBIF data. Look at the map again, do all records look plausible?

Not all coordinates seem to be correct, and we thus need to cross-check these. Robert Hijmans provides some code in his dismo tutorials. We will here use the new package CoordinateCleaner (Zizka et al. (2019); see tutorials here). The function clean_coordinates() allows cleaning geographic coordinates using different cross-checks. Here, we first compare whether the coordinates for each entry match the country code provided (test centroids) for each entry and are no outliers (test outliers). Also, we test for vicinity to a biodiversity institution like botanical or zoological gardens (test ìnstitutions and radius inst_rad), and for duplicates (test duplicates). There are many more tests that you can check in the help pages ?clean_coordinates.

library(CoordinateCleaner)

# We use only those data entries with coordinate information - Note that you don't need this if you have used the hasCoordinate=T in the occ_search() function:
gbif_shrew <- subset(gbif_shrew, !is.na(decimalLatitude))

# We now clean the coordinates and check for outliers - see ?clean_coordinates for more options
gbif_shrew_cleaned_coord <- clean_coordinates(gbif_shrew, lon="decimalLongitude", lat="decimalLatitude", countries="countryCode", tests=c("centroids", "outliers", "duplicates", "institutions"), inst_rad = 1000)
## Testing coordinate validity
## Flagged 0 records.
## Testing country centroids
## Flagged 0 records.
## Testing geographic outliers
## Flagged 15 records.
## Testing biodiversity institutions
## Flagged 0 records.
## Testing duplicates
## Flagged 189 records.
## Flagged 204 of 600 records, EQ = 0.34.
# Plot world map
plot(wrld_simpl,xlim=c(5,50), ylim=c(40,55))
# Plot all gbif points downloaded
points(gbif_shrew$decimalLongitude, gbif_shrew$decimalLatitude, col='red',  pch=19)
# Plot all remaining points after cleaning
points(gbif_shrew$decimalLongitude[gbif_shrew_cleaned_coord$.summary], gbif_shrew$decimalLatitude[gbif_shrew_cleaned_coord$.summary], col='blue',  pch=18)

# Store the cleaned point locations in a new object
gbif_shrew_cleaned <- gbif_shrew[gbif_shrew_cleaned_coord$.summary,]

Have a look at Zizka et al. (2019) and the examples by Robert Hijmans (http://rspatial.org for finding out about other typical problems with GBIF and how to deal with these.

Finally, save your data, for example by writing the final data frame to file or by saving the R object(s).

save(gbif_shrew_cleaned,file='data/gbif_shrew_cleaned.RData')

Exercise:

Pick a species of your choice, look it up on the web (www.gbif.org), and then download and clean the GBIF data using occ_search() and clean_coordinates().

  • Be careful with very large data sets as download will take long. Try, for example, the Iberian lynx, the Gorilla, the Caucasian Salamander, or any other species with max. 500-1000 records.
  • Summarise the data: how many occurrence records are there in total, how many with coordinates, how many human observations, which time span do the records cover?
  • Map the data
  • Clean the data: how many geographic outliers did you find, how many duplicates?

3 [EXTRA] Occurrence data for marine species: OBIS

Although we will concentrate on terrestrial species in the practicals, I would at least like to mention that marine data are available from OBIS, the the Ocean Biodiversity Information System. Currently, the OBIS database holds more than 135 000 marine species. We can access the database using the package robis. This can be installed from github.

library(devtools)
install_github("iobis/robis")

You can visually check out the data on the OBIS Mapper webpage. Occurrence can be downloaded from the website or using the function occurrence() from robis package.

library(robis)

# Download OBIS data for the sea otter
otter <- occurrence('Enhydra lutris')

# Map data
library(maps)
map('world')
points(otter$decimalLongitude, otter$decimalLatitude, col='blue',  pch=18)

Showing the occurrence records in our typically Atlantic-centred world view is a little confusing. We can obtain a Pacific-centred view by simply wrapping the maps.

map("world", wrap=c(0,360))
points(ifelse(otter$decimalLongitude<0,otter$decimalLongitude+360,otter$decimalLongitude), otter$decimalLatitude, col='blue',  pch=18)

Overall, the data look quite plausible. In general, however, we also need to pay attention to data cleaning in OBIS data. As the OBIS data follow the same data standards as GBIF data, we can also use the package CoordinateCleaner for data cleaning.

# We test for spatial outliers and remove duplicates
otter_cleaned_coord <- clean_coordinates(otter, lon="decimalLongitude", lat="decimalLatitude", tests=c("duplicates","outliers"))
## Testing coordinate validity
## Flagged 0 records.
## Testing geographic outliers
## Flagged 0 records.
## Testing duplicates
## Flagged 2140 records.
## Flagged 2140 of 3259 records, EQ = 0.66.
# remove the flagged records
otter_cleaned <- otter[otter_cleaned_coord$.summary,]

References

Zizka, A., D. Silvestro, T. Andermann, J. Azevedo, C. Duarte Ritter, D. Edler, H. Farooq, et al. 2019. CoordinateCleaner : Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Methods in Ecology and Evolution 10 (5): 744–51. https://doi.org/10.1111/2041-210x.13152.