I’m excited to be a part of Big Data Week this year. For those of you who aren’t familiar with the phenomenon of big data, IBM has a pretty good definition. In essence, we are collecting huge amounts of data by virtue of living in a technologically advanced world, and those data are collected rapidly in a diverse range of formats. The challenge now is what to do with all of it! Big Data Week, which is running from 22-28 April 2013, is an international movement that was established in 2011 to connect businesses, data scientists, and technology groups to explore novel social, political, technological and commercial applications of big data. Leeds Data Thing is my local big data group, formed in 2013 to provide a venue for the discussion of local big data applications. They are putting on a range of events for BDW 2013, and I have volunteered to give a short presentation at one of those events.
So why am I involved?
I have been interested in large biological datasets for a several years (a current analysis that I am working on uses 23 million records of sightings of plants and animals from across the UK over the past couple of centuries). In some respects, the field of ecology has been involved with big data for a long time! In particular, I am interested in pulling out long-term patterns from these kinds of datasets. I am also involved with the Yorkshire Dales Environment Network (YDEN), which is based at the University of Leeds and which seeks to link conservation organisations (like Yorkshire Wildlife Trust and BugLife), government agencies (like the Environment Agency and the Forestry Commission), and landowners (for example the National Farmers Union). YDEN is currently looking at ways in which we can enhance the quality, quantity, and ease-of-use of biological recording data in the Yorkshire Dales National Park (YDNP), with the aim of promoting the conservation of Yorkshire wildlife, and it is on that topic that I will be speaking at the Leeds “Bring Your Own Data” evening on 24th April 2013.
Biological data in the Yorkshire Dales
Just like any other scientific endeavour, the conservation of the natural world relies on the collection and analysis of data to make informed decisions. In this sense, it bears a close resemblance to medicine: both are “crisis disciplines” where we are forced to make important decisions without access to all the necessary information, both need customised solutions to particular problems, and both need a scientific approach to find which interventions work and which do not. It is my hope that there are applications of big data that can be used to close the knowledge gap and aid in wildlife conservation. There are three specific needs that we have identified:
1. Data collection – At present, there is a considerable amount of information about the YDNP floating around on the internet and on various servers. However, quality control has been variable and the data come from a wide variety of sources. A major issue is a fresh start in data collection and curation, using techniques and tools that are easy to use. These data will include habitat surveys (describing the areas in which species are found) and species surveys (recording where precisely the species of interest are found). We have considered the use of existing web apps as well as particular programs, and data collection could be by a combination of skilled surveyors and the general public. Data can also be collected from other sources (such as satellites and aerial photographs).
2. Data analysis – Without wanting to go too far into the technical aspects of the analysis, one of the major tools used in conservation biology is the “habitat suitability model” (a type of species distribution model). This approach takes information about where a species is known to occur and uses that information to predict where else it might be able to occur. For example, we can sample a small number of sites in the YDNP and see whether a butterfly is present. Assuming that the sites at which the butterfly is present constitute the suitable habitat, with our knowledge of the rest of the park we can then predict the butterfly’s distribution based on where else those suitable conditions are present. However, the tools for implementing these models are complex – we would like to make it more straightforward for the conservation practitioners who often don’t have the training or the resources to be able to carry out these analyses. There are a range of other analyses that can also be carried out and we’d be interested in hearing more about ideas for general tools.
3. Data visualisation – I first became interested in Leeds Data Thing after seeing a presentation on OpenStreetMap. This seems like one potentially powerful tool to visualise geographical data without the need for expensive and complex mapping tools (like ArcGIS, the industry standard, or QGIS, an open source alternative with a similarly steep learning curve). We would be interested in ways to visualise both the underlying data (habitat and species) and the results of the habitat suitability models described above. Ultimately, while the underlying data and models are the true evidence behind the conservation decisions, clarity in the presentation of those data is fundamental to making a case for a particular course of action.
I look forward to engaging with the LDT and BDW crowd tomorrow night, and here are the slides that I will be presenting:
Image sources (from top to bottom, and in the slideshow): Big Data Week logo; Paul Barber; Marek, P., Shear, W., Bond, J. (2012). “A redescription of the leggiest animal, the millipede Illacme plenipes, with notes on its natural history and biogeography (Diplopoda, Siphonophorida, Siphonorhinidae)”. ZooKeys 241: 77. doi:10.3897/zookeys.241.3831; OpenStreetMap logo.