The New York Botanical Garden Canopy Classification Dataset - A Dataset for Machine Learning Tasks in Biodiversity Informatics

by Christian Kittim Belardi

The Challenge

This summer, I worked at the New York Botanical Garden (NYBG) as a Siegel Family Endowment PiTech PhD Impact Fellow. I supported NYBG’s mission to develop biodiversity monitoring technology.

Motivation

Biodiversity monitoring is an important task in conservation biology, however, it is resource intensive, requiring time, money, expertise, and a large on–the–ground presence that is not always possible due to conflict, politics, etc. Potentially traditional biodiversity monitoring techniques can be supplemented with an inexpensive tool that utilizes remote sensing data and deep learning models to remotely monitor biodiversity. 

Such a tool would enable faster detection and response to disease and detrimental anthropogenic activities (e.g. illegal logging). It would also be particularly useful for monitoring highly inaccessible areas. A major challenge to developing the deep learning models necessary to build this tool is the limited biodiversity data available. 

When faced with limited data a common approach is to find a related task which has a lot of data available, train a model for that task, then fine tune the model for the task of interest - this is known as transfer learning. Previously NYBG had tried using transfer learning to improve their biodiversity model which struggled due to lack of data for the biodiversity task. They trained the model for land cover classification then fine tuned it for biodiversity estimation, however, they found that this provided limited improvement. Essentially, the land cover classification task was not forcing the model to learn the complex features which would be useful for biodiversity estimation. That is because the model was only ever asked to distinguish between forests, crops, cities, etc. it never learned anything about the different species of trees. This motivated the development of the NYBG Canopy Classification Dataset. The idea behind it is that the features learned by the model to distinguish between different species of tree will be more useful later on when trying to estimate biodiversity.

Christian Kittim Belardi

The Project

Over the summer I completed all the initial development for the NYBG Canopy Classification Dataset building on previous work from NYBG. The NYBG Canopy Classification Dataset is constructed entirely from publicly available data which has been collected from a number of sources. In developing this dataset my first objective was to collect known occurrences of trees to find areas around the world with the densest labelings. These occurrences were collected from the Global Biodiversity Information Facility (GBIF) and iDigBio. The primary challenge I faced while working on this task was the lack of consistent naming across databases and I relied upon previous algorithms developed by NYBG to reconcile scientific names with the World Checklist of Vascular Plants. Once scientific names were reconciled, it was simple to plot the occurrences of different species of trees around the world and select regions with many occurrences as well as high species diversity amongst those occurrences. 

Next for the selected regions, I collected satellite data from the European Space Agency’s (ESA) Copernicus program, which included elevation, multispectral, and synthetic aperture radar data. Aggregating the satellite data required a number of processing steps. First complete images of the region had to be assembled from sometimes overlapping partial images of the region. Next, the data from the different sensor modalities was captured from different platforms and therefore were not temporally aligned, so I paired multispectral images with the synthetic aperture radar images that were the closest temporally. Furthermore the elevation, multispectral and synthetic aperture radar images had to be reprojected into the same coordinate reference system for the structures in the data to be aligned across all sensor modalities. During this processing lower resolution channels are upsampled using nearest neighbor interpolation to 10m resolution matching the highest resolution channels. Next, I constructed a permanent water mask and building mask for each region from Copernicus Global Land Service. 

Finally, I build tree masks for the region using the tree occurrences from earlier. I project the geopositions of the trees into the appropriate coordinate reference system for the rest of the data and then draw a circle with a radius depending upon the uncertainty of the GPS reading for that tree occurrence. Every batch of sensor data as well as the corresponding labels were stacked and stored as TIFs. The result is multiple TIFs of the same region over time capturing the area using multiple sensing modalities and labels denoting the positions of different species of trees in the region as well as water and buildings.

Dataset Details

The NYBG Canopy Classification Dataset can be thought of as a collection of images but each image has many more channels than just red, green, blue. The first 19 channels are meant to be input to a model and the remaining channels are labels to be used as ground truth during training. We include in the first 19 channels a digital elevation map so the model can account for the elevation of the land, as well as multispectral and synthetic aperture radar data from which we hope the model can learn signatures of different species of tree. The label channels contain a permanent water mask and a building mask so the model can learn to recognize water and buildings. There are also the tree masks for the most common species of tree in the region. The tree masks are essentially images which have dots everywhere there is a tree with the size of the dot depending on the geographic position measurement uncertainty. The tree masks are necessary because they provide feedback so the model can learn to recognize different species of tree.

Impact and Path Forward

The primary outcome of this summer work is the NYBG Canopy Classification Dataset, which NYBG will use to develop biodiversity monitoring technology. Beyond this specific use case, we hope that others working in biodiversity informatics and related areas find the dataset useful. Furthermore, we think the dataset’s weak labeling presents an interesting problem for researchers working in machine learning and remote sensing to explore. The dataset will be made publicly available in the future.


Previous
Previous

Welcoming Nicki Dell as the Inaugural Siegel PiTech Faculty Impact Fellow at Cornell Tech

Next
Next

How Can People With Intellectual and Developmental Disabilities Be Supported With Computer Vision AI in the Future?