Improving species biodiversity analyses and citizen science feedback through mining data

Interdisciplinary project funded by the Swiss Data Science Center (ETH board, 2017-2019; awarded to Niklaus E. Zimmermann, and co-PIs Dirk N. Karger and Damaris Zurell).


In order to conserve and manage biodiversity, we need an improved understanding of essential biodiversity drivers and improved predictions of resulting biodiversity patterns in space and time. Here, we propose a novel approach based on data-mining and iterative machine learning to improve biodiversity models and to better exploit existing data as well as guide future data sampling efforts. Modern data-mining techniques are destined to improve traditional species distribution modelling. On the one hand, massive amounts of biodiversity data are becoming available through citizen science and technical advances in monitoring, with increasing data of species occurrence, morphological traits, evolutionary history, and environmental variables. On the other hand, these data are often incomplete in that clear sampling designs are missing and information is not equally accurate or complete for all species. These data gaps could be filled by modern machine learning algorithms that are able to find a way through the maze of uncertainties in these data, in which scientists so easily get lost.

Applicable to any species group world-wide, the proposed project will focus on floristic data from Switzerland as a pilot system to set up and study the benefits of data-mining and machine learning techniques for facilitating biodiversity assessments. Structured into four work packages, the project will combine various data sources in a novel way and foster the link between ecological sciences and citizen science. By that, it will pave the way towards automated quality checks in citizen science data, improved uncertainty analyses and identification of hidden information in large scale biodiversity inventories, and real-time guidance of observer efforts in citizen science based data collection. Key milestones of the project include: a) an operative framework linking real-time data streams and a citizen science interface, b) iteratively model the distribution of individual species and associated spatiotemporal uncertainty patterns using machine learning and data-mining, c) a meta-learning to detect ecologically relevant, higher-level processes structuring biodiversity, and d) a model-based catalogue of criteria for guiding citizen scientists for improved data collection.