Context-Driven Detection of Invertebrate Species in Deep-Sea Video

McEver, R. Austin; Zhang, Bowen; Levenson, Connor; Iftekhar, A S M; Manjunath, B. S.

doi:10.1007/s11263-023-01755-4

Context-Driven Detection of Invertebrate Species in Deep-Sea Video

Open access
Published: 22 February 2023

Volume 131, pages 1367–1388, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Context-Driven Detection of Invertebrate Species in Deep-Sea Video

Download PDF

R. Austin McEver ORCID: orcid.org/0000-0001-5257-4258¹,
Bowen Zhang¹,
Connor Levenson¹,
A S M Iftekhar¹ &
…
B. S. Manjunath¹

4116 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Each year, underwater remotely operated vehicles (ROVs) collect thousands of hours of video of unexplored ocean habitats revealing a plethora of information regarding biodiversity on Earth. However, fully utilizing this information remains a challenge as proper annotations and analysis require trained scientists’ time, which is both limited and costly. To this end, we present a Dataset for Underwater Substrate and Invertebrate Analysis (DUSIA), a benchmark suite and growing large-scale dataset to train, validate, and test methods for temporally localizing four underwater substrates as well as temporally and spatially localizing 59 underwater invertebrate species. DUSIA currently includes over ten hours of footage across 25 videos captured in 1080p at 30 fps by an ROV following pre-planned transects across the ocean floor near the Channel Islands of California. Each video includes annotations indicating the start and end times of substrates across the video in addition to counts of species of interest. Some frames are annotated with precise bounding box locations for invertebrate species of interest, as seen in Fig. 1. To our knowledge, DUSIA is the first dataset of its kind for deep sea exploration, with video from a moving camera, that includes substrate annotations and invertebrate species that are present at significant depths where sunlight does not penetrate. Additionally, we present the novel context-driven object detector (CDD) where we use explicit substrate classification to influence an object detection network to simultaneously predict a substrate and species class influenced by that substrate. We also present a method for improving training on partially annotated bounding box frames. Finally, we offer a baseline method for automating the counting of invertebrate species of interest.

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

OV-VIS: Open-Vocabulary Video Instance Segmentation

Article 31 May 2024

Automatic object detection for behavioural research using YOLOv8

Article Open access 15 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Marine scientists spend enormous amounts of resources on understanding and studying life in our oceans. These studies hold numerous benefits for environmental protection and scientific advancement, including the ability to identify areas of the ocean where certain habitats and substrates exist and where certain species gather. As scientists better understand biodiversity in the oceans and where in the ocean life flourishes, they can begin working toward more focused conservation efforts with those areas in mind. Further, scientists can revisit those same areas and perform surveys in the future to monitor how life is changing in the ocean as a result of conservation efforts (Fig. 1).

A common method for studying underwater habitats consists of planning underwater routes, called transects, then following those paths and recording the environment either by a diver with a camera or using an underwater ROV (Shester et al., 2017; Drap et al., 2015). Once the transects have been recorded and videos matched with their GPS locations, common annotation methods require researchers to review each video several times, annotating the substrates that the camera passes over in the first few annotation passes, then counting invertebrates in another pass, and then counting fish species in a final pass to give a better idea of where in the ocean which substrates exist and where different species live. This information is vital to determining species hotspots and finding ways to protect the environment while also meeting human needs for usage of our oceans. These studies ultimately lead to new discoveries as they facilitate exploration of unknown oceanic regions. Currently, however, the sheer amount of data researchers collect can be overwhelmingly expensive and difficult to annotate and utilize as their annotation methods’ multiple passes can push annotations times to many times the duration of the video. Additionally, researchers spend a lot of time sifting through videos of just bare substrate (like rocks or mud) with no visible life, and methods that can help tell where there is no life may aid researchers in more quickly filtering those sections of video out of invertebrate counting.

Computer vision and machine learning models can significantly aid in managing, utilizing, analyzing, and understanding these videos, ultimately reducing the overall costs of these studies and freeing researchers from tedious annotation tasks. However, developing and training these models require annotated data. Further, the types of annotations generated and used by domain scientists do not directly correspond with the typical types of annotations generated and used by computer vision researchers, requiring new approaches to learning from video data and their annotations.

As a step toward advancement in efficiently computationally analyzing videos from a marine science setting, we introduce DUSIA, a real world scientific dataset including videos collected and annotated by marine scientists who directly use a superset of these videos to advance their own research and exploration. To our knowledge, DUSIA is the first public dataset to contain videos recorded in this challenging moving-camera setting where an underwater ROV drives and records over the ocean floor. This dataset allows us to create solutions to a host of difficult computer vision problems that have not yet been explored such as classifying and temporally localizing underwater habitats and substrates, counting and tracking invertebrate species as they appear in ROV video, and using these explicit substrate and habitat classifications to help detect and classify invertebrate species. Further, the types of annotations provided in DUSIA differ from those of typical computer vision datasets, requiring new approaches to learning.

Our contributions can be summarized as follows:

DUSIA provides the first publicly available dataset of annotated, full-length videos captured via an underwater ROV. DUSIA’s videos are annotated by expert marine scientists with temporal labels indicating substrates, count labels for 59 invertebrate species, partial bounding box labels for ten invertebrate species of interest in the training set, and full bounding box labels for those species of interest in the validation and testing sets.
We introduce the novel Context-Driven Detector (CDD), which uses implicit context representations and explicit context labels to improve bounding box detections. In our case, context refers to explicit class labels of the background. Specifically, our context labels describe the substrate present on the ocean floor, which determine the environment and habitat in which the organisms live. In natural images, context might refer to indoor vs outdoor images or subcategories within such as school, office, library, or supermarket.
We propose Negative Region Dropping, an approach for improving performance of an object detector trained on a dataset with partially annotated images.
Finally, we offer a baseline method for counting invertebrate species individuals in this challenging setting using a detection plus tracking pipeline.

In Sect. 2 we review other datasets and methods with similar data and highlight how DUSIA differs from previous datasets. Next, in Sect. 3 we discuss the contents and collection of DUSIA’s data and annotations. Section 4 describes some of the tasks for which DUSIA can be used, and Sect. 5 discusses our approaches to those tasks including the novel CDD, Negative Region Dropping, and baseline tracking method. Section 6 describes our experiments and results, and Sect. 7 discusses our findings.

2 Related Works

Analyzing underwater animals and habitats remains a challenge for computer vision models. Marine scientists collect a wide variety of visual data for an even wider variety of tasks, so when it comes to solving specific tasks, there often exists a scarcity of well-annotated underwater data. Although there are a few efforts from the computer vision community to collect and annotate underwater data (Pedersen et al., 2019; King et al., 2018; Boom et al., 2014; Marini et al., 2018; Joly et al., 2014), it is hardly enough to tackle this daunting problem, and few of these efforts collect data in the same way or provide annotations for the same goals. In general, collecting underwater image or video data is far more difficult than land data and day to day images of common objects. Collecting underwater data is so difficult, in fact, that Ishiwaka et al. (2021) proposed a method for generating synthetic datasets. DUSIA aims to be a collaborative, comprehensive effort to guide the exploration and automated analysis of underwater ecosystems.

2.1 Underwater Marine Datasets

Many of the existing underwater marine datasets are developed in order to detect and recognize the various behaviors, or simply presence, of fish (Konovalov et al., 2019; Måløy et al., 2019; Boom et al., 2014; Joly et al., 2014; Levy et al., 2018). Numerous current works (Konovalov et al., 2019; Måløy et al., 2019; Levy et al., 2018; Ditria et al., 2020) have validated their fish detection and fish behavior recognition models on these datasets. Interestingly, these methods mainly focus on developing novel data-hungry algorithms, but the data on which the algorithms perform is limited by its static perspective. For example, Måløy et al. (2019) proposed a dual spatial-temporal recurrent network, but the algorithm is trained and tested on a dataset that is constrained by having no camera movement and working in a covered area. Similarly, Konovalov et al. (2019) augments the dataset of underwater fish images that they use with the underwater non-fish images from VOC2012 (Everingham et al., 2015) by restricting their model to generating only binary (fish vs. no fish) predictions. In the same way, (Ditria et al., 2020; Levy et al., 2018) confined their models to do analysis only on one single fish. Similarly, Marini et al. (2018) works on automating the counting of fish without distinguishing among different species. In contrast, DUSIA provides dynamic, high definition ROV video showcasing a rich and varied environment with many species occurring in intermingling groups.

Additionally, unlike existing datasets, a novel feature of DUSIA is the utilization of explicit, human-annotated, contextual information such as substrates or habitat in the analysis workflow. Such contextual information can play a vital role in making accurate predictions, especially in the case of identifying fish or other marine animals. Recently, Rashid and Chennu (2020) has developed a large scale dataset for habitat mapping using both RGB images and hyperspectral images. This dataset contains a large number of annotated images for classifying different coral reef habitats, but marine animal information is not included in this dataset. DUSIA, in contrast, is unique in this aspect, as it has both explicit substrate and invertebrate annotations. Tables 1 and 2 highlight the differences in many underwater image and video datasets.

Table 1 Underwater datasets with labelled images

Context-Driven Detection of Invertebrate Species in Deep-Sea Video

Abstract

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

OV-VIS: Open-Vocabulary Video Instance Segmentation

Automatic object detection for behavioural research using YOLOv8

1 Introduction

2 Related Works

2.1 Underwater Marine Datasets

2.2 Methodologies

3 Dataset

3.1 Data Collection

3.2 Substrate Classes and Annotations

3.3 Invertebrate Classes and Annotations

3.3.1 Bounding Box Labels

3.4 Dataset Splits

3.5 Statistical Analysis of Data

4 Tasks

4.1 Substrate Temporal Localization

4.2 Counting Species Individuals

5 Methods

5.1 Substrate Classification

5.2 Invertebrate Species Detection

5.2.1 Negative Region Dropping

5.2.2 Context Driven Detection

5.3 Invertebrate Tracking and Counting

6 Experiments

6.1 Substrate Temporal Localization

6.1.1 Single Classifier

6.1.2 Combination of Binary Classifiers

6.2 Invertebrate Species Detection

6.2.1 Negative Region Dropping Percent \(\rho \)

6.2.2 Global Feature Fusion Scalar \(\beta \)

6.2.3 Context Loss Weight \(\alpha \)

6.2.4 Hyperparameter Combinations

6.3 Invertebrate Species Counting

7 Discussion and Future Work

8 Supplementary Information

Availability of Data and Materials

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Appendices

Appendix A: Species Statistics

Appendix B: Hyperparameter Search Summary

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation