1 Introduction

Fig. 1
figure 1

Conceptual overview. We utilise a student-teacher paradigm for learning where the teacher produces pseudo-labels for the student to learn from while being updated by an exponentially moving average (EMA) of the student model. We apply five dynamic policies to this learning loop that we show can lead to effective (i.e. virtuous) self-training cycles: unlabelled data sampling policy to control the unlabelled sample input, confidence threshold policy to filter unreliable pseudo-labels, data augmentation policy to diversify the unlabelled training data, unsupervised loss weighting policy to balance unsupervised and supervised losses, and teacher momentum policy to adjust the update speed of the teacher model

Motivation-Automated visual monitoring of animals filmed in their natural habitats is gaining significant traction, boosted recently by a plethora of deep learning methods and applications (Tabak et al., 2019; Norouzzadeh et al., 2021; Tuia et al., 2022). However, developing and advancing relevant computer vision tools remains challenging due to several factors. Animals in their natural environments are often hard to detect, obscured by dynamic backgrounds, varying illumination conditions, occlusions, camouflage effects, and more. Deploying network models trained on prevalent image and video databases, such as ImageNet (Deng et al., 2009), MS-COCO (Lin et al., 2014), Kinetics (Carreira and Zisserman, 2017), are often insufficient on their own, even after taking advantage of the potentials of transfer learning. To further exacerbate the difficulty of deploying machine learning methods to their fullest extent in the domain, there is still a distinct lack of large-scale, annotated training datasets for particular species despite evolving general frameworks (Beery et al., 2019). Whilst crowd sourcing annotations can help, low labelling rates relative to archive sizes remain the norm in the field. For great apes in particular, several recent works have attempted to address some of the above mentioned challenges (Yang et al., 2019; Schofield et al., 2019; Sakib & Burghardt, 2021; Bain et al., 2021). However, these works still either only pretrain on datasets from other domains or rely on relatively small datasets for supervised training due to the complexities associated with obtaining annotations. Thus, while these methods have advanced the cause somewhat regarding great ape detection in jungle settings, they have also emphasised the urge for better use of the vast archives of completely unlabelled camera trap footage.

Paper Concept-In response, this paper introduces a novel curriculum learning approach that intertwines traditional supervised detector training with unlabelled data utilisation. The approach demonstrates by proof-of-concept that, exemplified for great apes, large unlabelled camera trap archives can indeed be exploited to enrich and empower real-world animal detector construction without any further labelling efforts. We leverage lessons learned from recent self-supervised  (Grill et al., 2020; Caron et al., 2021; Chen and He, 2021) and semi-supervised (Sohn et al., 2020b, a; Xu et al., 2021) methods on feature representation learning and image classification to propose an end-to-end student-teacher based detection pipeline that integrates self-training (via pseudo-labels) and dynamic training polices into one cyclical curriculum learning design. Our model learns from unlabelled data in the curriculum by generating high quality pseudo-labels on the fly. In turn, these virtual annotations of otherwise unlabelled samples are exploited by the student whose update influences the teacher and a next round of pseudo-label generation. This cyclical self-training idea can be illustrated conceptually as a learning loop shown in Fig. 1. We will demonstrate that carefully fine-tuned curriculum learning policies in this loop can blend labelled and unlabelled sample input in a way that leads to virtuous training cycles (as opposed to vicious training cycles) which increasingly and consistently improve model performance. Critically, we show that dynamic learning adjustments can be controlled stably by policies and can improve performance over static learning. Intuitively, the approach expands model coverage of the vast space of animal appearance in particular, slowly from the labelled sample base, guided and channelled by the policies. We show that this approach can significantly improve great ape detection benchmarks, as well as other benchmarks including Bees and Snapshot Serengeti. We also demonstrate that the method is applicable beyond the targeted animal domain and achieves competitive or state-of-the-art results on the MS-COCO and PASCAL-VOC object detection challenges without a need for dataset-specific hyperparameter fine-tuning.

Contributions-Overall, the contributions of this paper can be summarised as, (i) a novel end-to-end dynamic detection framework for semi-supervised curriculum learning designed to improve species detectors built from sparsely labelled datasets, (ii) a dynamic policy system with stable hyper-parameters for temporal control over changing learning properties in semi-supervised detector training promoting self-reinforcing virtuous training loops, (iii) extensive experiments and ablations on a large scale real-world great ape camera trap dataset - we report improvements to the state-of-the-art for the semi-supervised great ape detection task evaluated on the Extended PanAfrican Dataset, (iv) we offer new semi-supervised detection benchmarks on sparse labelling versions of two other animal datasets - Bees and Snapshot Serengeti, contributing towards handling annotation shortage in the animal domain, and finally (v) we also provide competitive and state-of-the-art semi-supervised object detection results for the MS-COCO and PASCAL-VOC datasets, demonstrating broader applicability.

2 Related Work

In this section, we consider works related to the key topics of interest with focus on the state-of-the-art.

Semi-supervised Learning (SSL)-SSL exploits the potential of unlabelled data to facilitate model learning with limited amounts of annotated data (Rebuffi et al., 2020). Training computer vision models such as objection detection or action recognition networks, relies on the availability of annotated datasets which can be costly to generate. This has motivated the development of semi-supervised methods (Jeong et al., 2019; Berthelot et al., 2019; Zhai et al., 2019; Sohn et al., 2020a, b; Zhang et al., 2021; Xu et al., 2021; Tang et al., 2021).

One dominant SSL approach is consistency regularisation where the model is regularised to generate consistent predictions on data with different augmentations (Jeong et al., 2019; Berthelot et al., 2019; Zhai et al., 2019). Another approach is based on generating pseudo-labels for unlabelled data and updating the model by training on a mix of unlabelled data with pseudo-labels and labelled data with manually-annotated labels (Sohn et al., 2020a, b; Zhang et al., 2021; Xu et al., 2021; Tang et al., 2021; Liu et al., 2021). What type of pseudo-labelling to use is critical to the success of SSL in particular scenarios. FixMatch (Sohn et al., 2020a) applied a high confidence threshold for mining pseudo-labels and then these sharpened and strongly-augmented pseudo-labels were utilised for model training. STAC (Sohn et al., 2020b) extended FixMatch from image classification to objection detection by introducing self-training and augmentation-driven consistency regularisation. More recently, Xu et al. (2021) introduced the soft teacher mechanism to alleviate the issue of unreliable pseudo-labels generated by the teacher in SSL object detection. Liu et al. (2021) jointly train a student and a teacher in a mutually-beneficial manner by applying a class-balance loss to down-weight overly confident pseudo-label impact. In the light of the success of these methods, our approach follows the pseudo-labelling concept, but addresses the model learning challenges differently.

Fig. 2
figure 2

Detailed end-to-end self-training great ape detection pipeline. We utilise the Deformable DETR (Zhu et al., 2020) framework with a ResNet backbone as detector architecture. The student network (light green) uses this architecture as well as the teacher network (dark green). All labelled data along with dynamically sampled and policy-controlled unlabelled data are mixed during training. The teacher performs pseudo-label generation with purely unlabelled input on the fly. The pseudo-labels are filtered with an adaptive threshold and then augmented via a bounding box-aware transformation. The teacher network is updated by student model via a dynamic momentum coefficient. The final loss is the sum of supervised and unsupervised detection losses balanced by a policy-controlled dynamic weight. We carefully designed the policies of the system to achieve an effective (a.k.a. virtuous) self-reinforcing training cycle (Color figure online)

Object detection-This area of computer vision has advanced in leaps and bounds since the very start of the modern era of deep learning. Some notable early works are: (i) single-stage detection frameworks, such as (Redmon et al., 2016; Liu et al., 2016; Lin et al., 2017b; Tian et al., 2019), which perform object classification and bounding box regression directly, without using pre-generated region proposals. They are typically applied over a dense sampling of possible object locations to estimate the class probabilities and bounding box coordinates directly. (ii) In contrast, two-stage detection frameworks, such as (Ren et al., 2015; He et al., 2017; Lin et al., 2017a) utilise a region proposal network to generate class-agnostic regions of interest (ROIs) and only then perform ROI bounding box regression and object classification. More recently, DEtection with TRansformers (DETR) (Carion et al., 2020) built the first end-to-end detection pipeline by viewing object detection as a direct set-prediction problem. DETR eliminated the need for anchor-based target assignment pre-processing and non-maximum suppression (NMS) post-processing, prevalent in commonly used object detectors. It combined CNNs for feature extraction and transformers for feature interpretation to directly translate object queries to class and bounding boxes by leveraging cross attention (Vaswani et al., 2017) on image features. However, the vanilla DETR suffers from slow convergence and hence longer training time than detectors based on YOLO, SSD and Faster-RCNN. The Deformable DETR (Zhu et al., 2020) proposed a deformable attention module that only attend to a small set of prominent key elements to replace the attention in DETR. This improvement led to faster convergence and a better performance. We select this variant as our model for the various detection components of our proposed curriculum learning framework.

Curriculum Learning (CL)-The CL training approach (Bengio et al., 2009; Wang et al., 2022b) has had significant impact on the design of computer vision algorithms, such as (Karras et al., 2018; Wang et al., 2018; Huang et al., 2020; Wang et al., 2022a; Zhang et al., 2021). Wang et al. (2018), for instance, use average precision of each sample to re-rank the data from easy to hard and train the object detector in an easy-to-hard fashion applied to the pre-ranked order of data. Wang et al. (2022a) propose a pseudo-labelled auto-curriculum learning framework that engages reinforcement learning to learn a series of dynamic thresholds for the pseudo-labels for semi-supervised key-point localisation. FixMatch (Sohn et al., 2020a), on the other hand, applied a constant threshold to select unlabelled samples for training, which fails to address the learning difficulties at different time steps. Thus, it can allow poor quality samples to get through. FlexMatch (Zhang et al., 2021) improved on FixMatch by dynamically adjusting the threshold at each time step to filter unlabelled samples and pseudo-labels.

Both FixMatch and FlexMatch applied a pre-trained pseudo-label generator which does not get updated during the semi-supervised learning stage, thus, failing to consider the evolution of the pseudo-label generator as the learning progressing. To address this issue, we propose student-teacher learning paradigms inspired by the recent advances in self-supervised learning methods (Grill et al., 2020; Chen and He, 2021; Caron et al., 2021) that evolve the teacher component dynamically guided by a set of curriculum learning policies and controls. We will now describe our approach in detail.

3 Proposed Method

We introduce an end-to-end curriculum learning pipeline for effective semi-supervised Great Ape detection in camera trap footage. Our framework follows a student-teacher training scheme, as illustrated in detail in Fig. 2 and operates as follows: in each learning iteration, we train a student model built around a Deformable DETR detector (Zhu et al., 2020) by a mix of labelled and unlabelled videos where the unlabelled videos are sampled by our curriculum sampling policy \(\pi \). The teacher performs pseudo-label generation with unlabelled input. Pseudo-labels are then refined by a dynamic threshold \(\varsigma _t\) and transformed by augmentation policy \(\mathcal {A}\). Together, both the pseudo-labels and manually-annotated labels are fed into the student network for learning. The student network is then updated by the gradient from the overall loss which is balanced by unsupervised loss weight \(\alpha _t\). Finally, the teacher network is updated by the exponential moving average (EMA) of the student parameters via a dynamic momentum coefficient \(m_t\). This completes one iteration of the learning loop leading to an updated teacher and student model. Our target will be to design the mentioned policies and an appropriate loss in a fashion that virtuous, that is effective, learning can be practically achieved.

3.1 Problem Definition

Let us consider that frames are sampled from the video at a frequency of \(\omega \) for both labelled and unlabelled videos. The teacher is trained to generate the pseudo-labels for unlabelled frames only, while the student is trained to fit the pseudo-labels with the unlabelled input frames, as well as the ground-truth labels with the labelled input frames. Thus, the overall loss for the student is defined as the weighted sum of supervised and unsupervised losses:

$$\begin{aligned} \mathcal {L}_{all}=\mathcal {L}+\alpha \mathcal {L}^{\prime }, \end{aligned}$$
(1)

where \(\mathcal {L}\) and \(\mathcal {L}^{\prime }\) denote the supervised loss of labelled samples and unsupervised loss of unlabelled samples respectively, and \(\alpha \) represents the balancing weight.

Fig. 3
figure 3

Bipolar behavioural dynamics of learning via self-training loops. Two representative cases illustrating the bipolar dynamics of training success when using the proposed architecture: vicious collapse in (a) and virtuous effective learning in (b). The scenarios differ only in policy parameterisation. (For reproducibility of behaviour in Fig. 3, the exact parameters used were: (a) \(\pi \): constant policy, constant \(\varsigma =0.1\) and constant \(\alpha =0.1\); (b) \(\pi \): linear increase policy, linear increase \(\varsigma ~0.3\rightarrow 0.5\), constant \(\alpha =0.5\). Both were trained for 1000 epochs with the first 250 epochs for warmup, lr decreases at 800\(^{th}\) epoch). In both plots, the right ordinate indicates the \(AP_{50}\) on the validation set whilst the left ordinate represents the average number of samples with confident score \(\zeta > 0.9\) or \(\zeta >0.5\)

Further consider we have a labelled sample set \(D_X\) with M labelled samples \(X_i\) and their corresponding class and box labels \((C_i, B_i)\) and an unlabelled sample set \(D_{U}\) with N unlabelled samples \(U_j\) (regardless of the sampling approach) used for training. Also, let \(\mathfrak {S}(\theta )\) be the student model parameterised by \(\theta \), and \(\mathfrak {T}(\theta ^{\prime })\) be the teacher model parameterised by \(\theta ^{\prime }\). Eq. (1) can then be expanded to:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{all}(\theta )&= \mathcal {L}+\alpha \mathcal {L}^{\prime } \\&= \frac{1}{M} \sum _{X_i \in D_X} L_{\theta }\left( X_{i}\right) +\alpha \frac{1}{N}\sum _{U_j \in D_{U}} L_{\theta }^{\prime }\left( U_j\right) ~, \end{aligned}\nonumber \\ \end{aligned}$$
(2)

where \(L_{\theta }\left( X_{i}\right) \) is the loss for labelled sample \(X_i\),

$$\begin{aligned} \begin{aligned} L_{\theta }\left( X_{i}\right) =&{\text {Loss}} \big ( \mathfrak {S}\left( X_{i},{\theta }\right) , \left[ C_{i},B_i \right] \big ) \\ =&L_{reg}\big (\mathfrak {S}\left( X_{i},{\theta }\right) ,B_i\big )+L_{ce}\big (\mathfrak {S}\left( X_{i},{\theta }\right) ,C_i\big ) ~, \end{aligned}\nonumber \\ \end{aligned}$$
(3)

and \(L_{\theta }^{\prime }\left( U_j\right) \) is the loss for the unlabelled sample \(U_j\),

$$\begin{aligned} \begin{aligned} L_{\theta }^{\prime }\left( U_j\right) =&{\text {Loss}} \big (\mathfrak {S}\left( U_j,{\theta }\right) , \mathfrak {T}\left( U_j,{\theta ^{\prime }}\right) \big ) ~, \\ =&L_{reg}\big (\mathfrak {S}\left( U_j,\theta \right) ,\mathfrak {T}\left( U_j,\theta ^{\prime }\right) \big ) \\&+L_{ce}\big (\mathfrak {S}\left( U_j,\theta \right) ,\mathfrak {T}\left( U_j,\theta ^{\prime }\right) \big ) ~, \end{aligned}\nonumber \\ \end{aligned}$$
(4)

where \(L_{reg}\) represents the bounding box regression loss and \(L_{ce}\) represents the classification loss.

We follow common practice in self-supervised learning methods, such as (Caron et al., 2021; Grill et al., 2020), so that the teacher is updated by the EMA of the student,

$$\begin{aligned} \theta ^{\prime }_{t} \leftarrow m\theta ^{\prime }_{t-1}+(1-m)\theta _{t} ~. \end{aligned}$$
(5)

Our objective is to find a set of student parameters \(\theta ^*\) that minimises the expected overall loss \(\mathcal {L}_{all}(\theta )\), such that

$$\begin{aligned} \begin{array}{c} \theta ^*=\underset{\theta }{\arg \min }\ \mathcal {L}_{all}(\theta ) ~. \end{array} \end{aligned}$$
(6)

3.2 Self-reinforcing Training Loop

Fig. 4
figure 4

Processes within self-reinforcing Loops. Illustration of four key processes (arrows) involved in training loops. Note that a destabilised Virtuous Cycle where low quality pseudo labels or highly inaccurate student or teacher networks are produced turns into a Vicious Cycle and vice versa. Thus, effective parameterisations and policies for the key processes are required to to promote stable learning

The evolution of the student and teacher network is conceptually a cyclic relationship. On the one hand, the performance of the student detector depends on the quality of the pseudo-labels, which in turn relies on the teacher, and on the other hand, the teacher is updated according to student status. Thus, there is an intricate interdependence between the student, the teacher, and the pseudo-labels forming a self-training loop which is controlled by the learning policies.

In practice, we observe a bipolarisation phenomenon for the training of models with different settings where they gradually become more confident of their predictions, but show two drastically different performance trajectories. As illustrated on sample training runs shown in Fig. 3, whilst a gradual increase of confidence indicators \(\mathbb {E}[N_{\zeta >0.9}]\) and \(\mathbb {E}[N_{\zeta >0.5}]\), which represent the average number of predicted objects whose confident scores \(\zeta \) are over 0.9 or 0.5 respectively, can be observed; the validation performance can be erratic, either collapsing or improving effectively. In the example illustrated in Fig. 3a, a decrease of \(AP_{50}\) on the validation set was observed after a few hundred training epochs. Initially, one may assume the model is simply over-fitting at this stage in learning. However, as shown in the second example in Fig. 3b for a different parameterisation, a long-term increase in \(AP_{50}\) can be observed, which suggests the model can learn from the training set well into training cycles. We hypothesise that the bipolar collapse or success of learning with regard to generalisation is critically linked to the self-reinforcement property of the training loop parameterisation and policies.

We categorise bipolarisation as two different types of learning cycles, effective Virtuous Cycles and collapsing Vicious Cycles as illustrated in Fig. 4. In the virtuous cycle state, the teacher model generates pseudo-labels of sufficient quality as to contribute to the training of the student model, allowing both models to improve continually. In contrast, the vicious cycle sees the teacher generate insufficiently low quality pseudo-labels that degrade the training of the student model, thus both models degenerate continually.

Fig. 4 depicts four key processes in the learning loop (shown as arrows), which are crucially influencing the trajectory of learning: (i) Initialisation: initialising the student model before the self-training phase; (ii) Teacher Update: updating the teacher network according to student status; (iii) Pseudo-label Generalisation: generating pseudo-labels by teacher; (iv) Student Training: using pseudo-labels to update student. Our goal is to find suitable controls that guide the above processes and can maintain the development of a virtuous self-training loop and, for robustness, also transition from a vicious to a virtuous setting.

For initialisation, we confirmed experimentally that the proposed system operates in a stable manner with fixed, standard backbone initialisations across all tested datasets. In particular, we use the self-supervised ImageNet pre-trained ResNet weights from SWAV (Caron et al., 2020) for our detection backbone. In addition, to show that general training stability can also be maintained in a supervised initialisation scenario, we test supervised ImageNet pre-trained ResNet (He et al., 2016) weights too. Note that any such fixed initialisation is essential as random initialisation triggers vicious training cycles, however, the fix is not sensitive to target dataset properties as transfer between scenarios still produces stable learning (see Sect 7).

For the other three processes above, we propose appropriate ‘policies’ that guide learning within the confounds of effective ’virtuous’ learning cycles—these are described next.

3.3 Student Training

Student network training is guided by two policies that allow the student model to exploit the unlabelled sample data and their pseudo-labels effectively.

Fig. 5
figure 5

Unlabelled data sampling policies & Unsupervised Loss Dynamics. (a) models a linear increase of the unlabelled data ratio from 0 to 1 over epochs, (b) combineswarm-up and cool-down phases with linear increase, (c) linear decrease of the unlabelled data ratio, (d) combines linear decrease with warm-up and cool-down phases, (e) keeps a constant ratio, and (f) shows the unsupervised loss \(L_\theta ^{\prime }\) observed in training

Unlabelled Data Sampling Policy controls the number of unlabelled samples to use in the self-training loop at different time steps. It can be expressed with an additional Bayesian prior on Eq.(2), i.e.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\pi }(\theta )&= \hat{\mathbb {E}}\,\left[ L_{\theta }\right] \\&= \frac{1}{M} \sum _{X_i\in D_X}L_{\theta }\left( X_{i}\right) +\alpha \frac{1}{N\mathbb {E}[\pi ]}\sum _{U_j\in D_U}L_{\theta }^{\prime }\left( U_j\right) \pi (U_j), \end{aligned} \end{aligned}$$
(7)

where \(\pi (U_j)\) is the probability for using the unsupervised loss of \(U_j\) in the self-training stage. For simplicity, we substitute \(\alpha \frac{\pi (U_j)}{N\mathbb {E}[\pi ]}\) with \(p(U_j)\) in Eq. (7) to obtain:

$$\begin{aligned} \mathcal {L}_{\pi }(\theta )&= \frac{1}{M} \sum _{i=1}^{M} L_{\theta }\left( X_{i}\right) + \sum _{j=1}^{N} L_{\theta }^{\prime }\left( U_j\right) p(U_j)\nonumber \\&=\frac{1}{M} \sum _{i=1}^{M} L_{\theta }\left( X_{i}\right) + \sum _{j=1}^{N} \Big ( L_{\theta }^{\prime }\left( U_j\right) p(U_j) - \hat{\mathbb {E}}[L_\theta ^{\prime }]p(U_j) \nonumber \\&\quad - \hat{\mathbb {E}}[p]L_{\theta }^{\prime }\left( U_j\right) + \hat{\mathbb {E}}[L_\theta ^{\prime }]\hat{\mathbb {E}}[p] \Big ) + N\hat{\mathbb {E}}[L_\theta ^{\prime }]\hat{\mathbb {E}}[p] \nonumber \\&=\frac{1}{M} \sum _{i=1}^{M} L_{\theta }\left( X_{i}\right) +N\hat{\mathbb {E}}[L_\theta ^{\prime }]\hat{\mathbb {E}}[p]\nonumber \\&\quad +\sum _{j=1}^{N} (L_{\theta }^{\prime }\left( U_j\right) -\hat{\mathbb {E}}[L_\theta ^{\prime }])(p(U_j)-\hat{\mathbb {E}}[p]) \nonumber \\ {}&=\frac{1}{M} \sum _{i=1}^{M} L_{\theta }\left( X_{i}\right) +N\hat{\mathbb {E}}[L_\theta ^{\prime }]\hat{\mathbb {E}}[p] + N\hat{{\text {Cov}}}[L_\theta ^{\prime },p] ~ .\quad \end{aligned}$$
(8)

Based on the definition of \(\mathcal {L}_{all}(\theta )\) in Eq. (2), Eq. (8) can be simplified to:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\pi }(\theta )&= \mathcal {L}_{all}(\theta ) + N\hat{{\text {Cov}}}[L_\theta ^{\prime },p] \end{aligned} \end{aligned}$$
(9)

The goal is to search for the best unlabelled data sampling policy \(\pi ^*\) that can yield the lowest possible loss for Eq. (9), such that:

$$\begin{aligned} \begin{aligned} \pi ^*&=\underset{\pi }{\arg \min }\ ~\mathcal {L}_{\pi }(\theta )\\&= \underset{\pi }{\arg \min }\ ~ \mathcal {L}_{all}(\theta ) + N\hat{{\text {Cov}}}[L_\theta ^{\prime },p]\\&=\underset{\pi }{\arg \min }\ ~ \hat{{\text {Cov}}}[L_\theta ^{\prime },p] \end{aligned} \end{aligned}$$
(10)

Equation (10) suggests that if \(L_\theta ^{\prime }\) and p are negatively correlated then we can arrive at an effective policy \(\pi ^*\). Given that p is positively correlated with \(\pi \), since \(\frac{\alpha }{N\mathbb {E}[\pi ]}\) is positive, an effective unlabelled data sampling policy \(\pi ^*\) should be negatively correlated to \(L_\theta ^{\prime }\). The model gets updated for each iteration, thus one may assume naïvely that \(L_\theta ^{\prime }(U_{j+1})<L_\theta ^{\prime }(U_{j})\), because \(L_\theta ^{\prime }(U_{j+1})\) is generated after backpropagation of \(L_\theta ^{\prime }(U_{j})\). In practice, during training, we also observed such a decrease of \(\mathbb {E}[L_\theta ^{\prime }]\) as shown in Fig. 5(f).

In summary, considering \(\pi ^*\) and \(L_\theta ^{\prime }\) are negatively correlated, and \(L_\theta ^{\prime }\) is indeed decreasing over time, we can conclude that \(\pi \) can consequently be obtained via cyclical curriculum learning. Practically, this may be carried out via a gradual increase of unlabelled sample input in the ways shown in Fig. 5a or b, where the latter includes warm-up and cool-down periods. Conceptually, these policies expand learning slowly but steadily towards the unexplored data domain in order to allow for a gradual expansion of high quality model expertise and prevent erratic learning collapse. For comparison and to emphasise the importance of this policy choice, we later also experimentally examine other policies depicted in Figs. 5c, d, and e.

Fig. 6
figure 6

Pseudo-label analysis. We use 70k pseudo-labels generated from the teacher network to conduct this analysis. Note that unlabelled samples do not use ground truth in training, but we use it for this analysis. (a) and (b) show the distributions of each pseudo-label’s ground truth IOU against the confidence score \(\zeta \) visualised at the early stage (100\(^{th}\) epoch snapshot) and later stage (800\(^{th}\) epoch snapshot) of training, respectively. (c) the pseudo-label mean precision at IOU\(_{50}\) and IOU\(_{75}\) without applying a threshold over the training epochs. (d) pseudo-label quality against the threshold \(\varsigma \) represented by IOU\(_{50}\) and IOU\(_{75}\) precision averaged across epochs, and IoU of real ground truth labels averaged on all epochs. (e) mean recall at IOU\(_{50}\) and IOU\(_{75}\) for different \(\varsigma \) values, (f) heatmap indicating the normalised \(F_\beta \) score for \(\varsigma _t\) at different epochs - light to dark colours for low to high scores, orange shows the best \(F_\beta \) score at each epoch, and the dark green plot represents our confidence threshold policy as the \(\arctan \) function that approximates the best \(F_\beta \) scores (Color figure online)

Unsupervised Loss Weighting Policy is tasked with balancing the weighting between the supervised and unsupervised losses. The performance of the student detector depends on the quality of the pseudo-labels. Fig. 6a and b depict pseudo-label distributions captured at an early stage and a late stage of the training, respectively, plotted against the confidence score \(\zeta \). Label confidence and IOU quality clearly increase over training at these snapshot points. The associated @IOU\(_{50}\) and @IOU\(_{75}\) precision curves in Fig. 6c illustrate that the average quality of the pseudo-labels increases over time. We note that recent works (Sohn et al., 2020b; Xu et al., 2021; Tang et al., 2021) on this topic only applied a fixed weighting to all pseudo-labels throughout the training. Yet, given this observed gradual change in pseudo-label quality, there is an opportunity to design an adaptive weighting policy that applies smaller unsupervised loss weights for less reliable pseudo-labels in the early training stages and larger weights for more reliable pseudo-labels generated in the later training stages.

To implement this, we use a curriculum learning approach for the unsupervised loss weighting parameter \(\alpha \), which is made subject to an adaptive weighting policy. Theoretically, more optimal policies would keep track of the bounding-box pseudo-label qualities. However, in practice, this is hard to do on the fly due to the unavailability of the ground truth and extensive computational needs. We thus opt for a simple linear increase of \(\alpha \) as a first approximation.

3.4 Pseudo-label Generation

We use two policies to generate reliable pseudo-labels from the teacher’s output to promote a virtuous cycle for training.

Confidence Threshold Policy allows us to examine the reliability of pseudo-labels at different threshold values of \(\varsigma \), ranging from 0.1 to 0.9, and averaged across epochs. Our aim is to select an optimised value in order to discard unreliable pseudo-labels most effectively.

Three metrics are applied to assess the quality of the pseudo-labels: IOU, Precision @IOU\(_{50}\) and Precision @IOU\(_{75}\). The plots in Fig. 6d for all measures show that they increase as \(\varsigma \) does. Trivially, the higher the value of \(\varsigma \), the higher the probability of obtaining more reliable pseudo-labels. So for highest quality one could select \(\varsigma =0.9\). However, as a consequence the recall rate is significantly suppressed, with both mean @IOU\(_{50}\) and @IOU\(_{75}\) recalls of course negatively correlated to \(\varsigma \) (see Fig. 6e). For example, when \(\varsigma =0.9\), mean precision reaches approx. \(95\%\), while the mean recall drops to approx. \(15\%\).

Fig. 7
figure 7

Augmentation strategy. Visualisation of colour augmentation and geometric augmentation examples used in the experiments. Augmentations are selected such that the results reflect the variance found across different camera and acquisition settings commonly seen in the dataset. (best viewed under zoom)

To address this issue, our dynamic confidence threshold policy increases the quality of pseudo-labels by controlling false negatives explicitly and thereby balancing precision and recall. We apply the \(F_{\beta }\) score which is the weighted harmonic mean of precision and recall, with \(\beta =0.5\) to allow the \(F_{\beta }\) score assign more weight to the precision than the recall on the basis that the false positives have more negative impact than the false negatives in our pipeline. Compared to anchor-based detectors which assign missed bounding boxes as negatives (non-object class), such as Ren et al. (2015); He et al. (2017); Lin et al. (2017b), the DETR family of methods do not. Their bipartite matching stage only matches the predictions with the ground truth, thus any missed bounding boxes are ignored and there are no penalties for this in training. Further, bounding box-aware crops are applied in the augmentation stage, thus false negative areas could be wiped out in the image. For example, see the chimp on the right side of the first image in the last column in Fig. 7, which if undetected by the teacher, it will disappear after a geometric transformation, as shown in the last image in the last column.

After fixing \(\beta \), which may be changed for different application scenarios, the goal of our confidence threshold policy is to search for the threshold \(\varsigma \) at time step t that can maximise \(F_{\beta }\), i.e.

$$\begin{aligned} \begin{array}{c} \varsigma ^*_{t}=\underset{\varsigma _t}{\arg \max }~ F_{\beta } (\mathcal {P}_t , \mathcal {R}_t | \varsigma _t),\\ \end{array} \end{aligned}$$
(11)

where \(\mathcal {P}_t\) and \(\mathcal {R}_t\) represent the precision and recall rates at time step t, determined by threshold \(\varsigma _t\). The heatmap in Fig. 6(f) shows the normalised \(F_\beta \) score for each time step (darker colour means higher value), with the best \(F_\beta \) at threshold \(\varsigma _t\) shown in the orange plot. We use the approximate fit of the \(\arctan \) function (dark green plot) to represent our confidence threshold policy.

Data Augmentation Policy ensures consistent augmentation of unlabelled data under the pseudo-labels produced by the teacher.

This is an indispensable element in semi-supervised and self-supervised methods. Self-supervised methods, such as DINO (Caron et al., 2021) and BYOL (Grill et al., 2020) minimise different views of data generated by data augmentation. Recently, semi-supervised methods such as FixMatch (Sohn et al., 2020a) and STAC (Sohn et al., 2020b), use augmentation-driven consistency regularisation for classification and detection.

figure a

Following STAC, we explore different variants of transformations on the Extended PanAfrican Dataset as our augmentation policy \(\mathcal {A}\). We apply transformation operations in sequence as follows: first, we randomly apply bounding-box-aware crop and resize on the image, and then we apply a randomly-selected geometric transformation, followed by a random transformation on the colour statistics of the image (see code for all details).

Finally, for strong augmentation \(\mathcal {A}_s\), we apply random erase (Zhong et al., 2020) or cutout (DeVries and Taylor, 2017) at multiple random locations of the whole image. For a weak augmentation \(\mathcal {A}_w\), we just decrease the intensity for each transformation. Some examples are shown in Fig. 7 to illustrate the augmentation process.

3.5 Teacher Update

Teacher Momentum Policy controls the update speed of the teacher model and is encapsulated in the momentum coefficient m. In Fig. 6c, we can see the pseudo-label precision rate increases steeply in the early training stages, but slowly in the later training stages. Given that learning happens faster in the early stage, it motivates us to design a dynamic momentum policy which takes this fact onboard and stabilises teacher updates. Eq. (5) suggests that a lower momentum coefficient allows faster updates of the teacher model. To match the learning speed of the model at different time steps, we use a lower momentum coefficient m at the early stages and gradually increase it with time. In practice, we use a cosine increase of m in our pipeline which has also been explored in DINO (Caron et al., 2021). In experiments, we find that this dynamic momentum policy leads to consistently better performance than a constant one (see Table 2).

3.6 Combined Policy Application

All five policies are implemented in unison as a dynamic curriculum learning strategy \(\Pi \) for our wildlife detection pipeline, comprising the unlabelled data sampling policy \(\pi _t=\Pi _\pi (t)\), the unsupervised loss weighting policy \(\alpha _t=\Pi _\alpha (t)\), the confidence threshold policy \(\varsigma _t=\Pi _\varsigma (t)\), the data augmentation policy \(\mathcal {A}=\Pi _{\mathcal {A}}(t)\) and the teacher momentum policy \(m_t=\Pi _m(t)\). Algorithm 1 illustrates this curriculum learning strategy \(\Pi =\{\pi _t,\alpha _t,\varsigma _t,\mathcal {A},m_t\}\) in its complete form.

4 Experiments

Datasets-We test our method on the Extended PanAfrican Dataset from the PanAf programme (Max-Planck-Institute, 2022) which contains camera-trap footage captured in natural Great Ape habitats in central Africa. There are two major species of Great Apes in the dataset, gorillas and chimpanzees. The archive footage contains around 20K videos adding up to around 600 hours. We use a subset of 5219 videos, with 500 videos (totalling over 180K frames) manually annotated with per frame great ape location bounding boxes, species and further categories (Yang et al., 2019; Sakib & Burghardt, 2021). This labelled data is split into trainset, valset, testset at a ratio of \(80\%, 5\%, 15\%\) respectively. All labels and metadata are fully published (Yang et al., 2019) and source videos may be obtained as detailed in the Acknowledgements.

Table 1 Results and detailed comparative evaluation on the extended panAfrican dataset

Following standard evaluation protocols as used in (Sohn et al., 2020b; Xu et al., 2021; Zhou et al., 2021; Liu et al., 2021), we utilise the Extended PanAfrican Dataset for system training and benchmarking under two general paradigms:

  1. 1.

    Partially Labelled Data (PLD). In this setting, either \(10\%\), \(20\%\), or \(50\%\) of the annotated trainset data are sampled as labelled training data, and the complete remainder of all data is used as unlabelled data. For each quantity, we create 3 different data folds and report the performance on testset with mean average precision (mAP) as the evaluation metric.

  2. 2.

    Fully Labelled Data (FLD). In this setting, the whole annotated trainset is utilised as the labelled training data and only the remaining \(\sim \)5K unlabelled videos, totalling \(\sim \)1.8M frames, are used as additional unlabelled data.

In addition, we investigate two other animal datasets under sparse labelling settings - BeesFootnote 1 and Snapshot Serengeti (Swanson et al., 2015) - to explore system effectiveness under sparse labelling regimes further across the domain of animal visuals. We also present results on the MS-COCO (Lin et al., 2014) and PASCAL-VOC (Everingham et al., 2010) datasets to explore wider applicability of the introduced concepts to mainstream object detection.

Implementation Details-We use a Deformable DETR architecture with a ResNet-50 backbone as our default detection model (see Fig. 2) for evaluating the effectiveness of our method. The transformer decoder and encoder are randomly initialised and the ImageNet pre-trained ResNet-50 weights from SWAV (Caron et al., 2020) are used as initial parameters for our backbone. The student model is trained with the AdamW optimizer (Loshchilov and Hutter, 2018) with a weight decay of 0.0004 and a batch size of 64, distributed over 4 GPUs. We follow Caron et al. (2021) using a linear scale rule of \(lr = 0.0005\times batchsize /64\) and apply a slightly lower learning rate of \(0.1\times lr\) for the backbone.

We use randomly sampled frames for each video at each epoch with frequency \(\omega =10\), and the frames are rescaled so that the smaller axis of the frame is in range [320, 480]. The PLD model is trained for 1000 epochs with the first quarter as the warm-up phase and the last quarter as the cool-down phase (Fig. 5a), and lr decreases to \(5e-5\) at the \(800^{th}\) epoch. The momentum m for updating the teacher follows a cosine schedule from 0.998 to 0.9998. Since the amount of training data for the partially labelled data setting and the fully labelled data setting is quite different, training parameters vary slightly from that for FLD.Footnote 2

Fig. 8
figure 8

Relative improvement comparisons. Relative improvement of mAP for our method over the supervised baseline, STAC, SoftTeacher, and UbTeacher across various PLD settings. We find our method shows particularly strong performance in lower annotation ratio regimes typical for many wildlife data settings

Comparative Evaluation-We first evaluate our method for the PLD and FLD settings against a supervised baseline and state-of-the-art works STAC (Sohn et al., 2020b), SoftTeacher (Xu et al., 2021) and UbTeacher (Liu et al., 2021) at various ratios of labelled data. Table 1 summarises the results.

Our proposed method shows significant performance improvements under almost all test settings. For example, in the mAP column, we outperform the supervised baseline by \(13.79\%\), \(12.08\%\), \(3.89\%\), STAC by \(7.92\%\), \(7.66\%\), \(3.46\%\), SoftTeacher by \(6.59\%\), \(8.14\%\), \(2.92\%\) and UbTeacher by \(1.93\%\), \(3.23\%\), \(1.73\%\) when \(10\%\), \(20\%\), \(50\%\) of labelled data are provided, respectively. We find that our method works better than others particularly when the provided labelled data is small as illustrated in Fig. 8. We note again that such a setting is particularly common in wildlife applications where camera trap archives are large and accurate annotation ratios are very small. We can see that competitor methods also show sizeable improvements over the supervised baseline for smaller splits, indicating unsurprisingly that extra unlabelled data has particularly high value when very little labelling is available in the first place. However, note that the performance gap between the proposed method and other approaches is also particularly large in exactly this setting, confirming the specific applicability of our enhanced dynamics for curriculum learning in low labelling ratio settings. Qualitative results across all methods are exemplified and discussed in Fig. 9. This is complemented by visualisations of some failure cases in Fig. 10.

Fig. 9
figure 9

Qualitative detection examples. We compare our method with other state-of-the-art approaches tested and the supervised baseline under the PLD setting with Labelled Ratios of \(10\%\), \(20\%\), \(50\%\). Note examples where our proposed method reliably detects partly occluded apes and ignores tree structures which distract some of the other models (best viewed under zoom)

Fig. 10
figure 10

Examples of failure cases. Visualised are failure cases under the \(10\%\) PLD setting. Ground-truth labels are annotated in red, and our detection results are shown in green. Note that partial occlusions form one hard-to-learn aspect given sparse label availability for training (best viewed under zoom) (Color figure online)

Table 2 Ablation studies of all learning policies

5 Ablation Studies

In this section, we evaluate our key contributions by examining the importance of each policy. We use the fold1 split of the \(50\%\) PLD setting as the base for the conducted ablations.

Unlabelled Data Sample Policy-The motivation of our unlabelled data sample policy (Eq. 10) is to make sure that if the unlabelled samples loss \(L_\theta ^{\prime }\) and unlabelled data sampling policy \(\pi \) are negatively correlated, the minimum possible loss can be achieved. Based on this hypothesis, we designed five experiments for five different possible characteristics for policy \(\pi \): (i) linear increase of the number of unlabelled samples (Curriculum learning) used in training over the training process (depicted in Fig. 5a), (ii) linear increase but with warm-up and cool-down phases at the beginning and at the end respectively (Fig. 5b), (iii) start with all unlabelled samples but linearly decrease the number of unlabelled samples throughout training (Fig. 5c), (iv) The opposite of (ii) (Fig. 5d), (v) using all unlabelled samples constantly throughout training (Fig. 5e).

The results in Table 2(a) show that a gradual increase of the number of unlabelled samples during the self-training phase can gain around \(1.16\%\) mAP compared with constantly using all the unlabelled samples. The best performance however is achieved by introducing a warm-up and a cool-down phase at \(64.3\%\) mAP. This ablation experiment demonstrates that both the choice of ’phasing in’ unlabelled data underpinned by our theoretical discussion in Sect. 3.3 have a positive measurable effect on learning performance.

Unsupervised Loss Weight Policy-The results in Table 2(b) demonstrate the effects of the unsupervised loss weight policy. We find that setting the unsupervised loss weight \(\alpha \) is a challenge since both a large loss weight and a small loss weight can harm the performance. A \(9.30\%\) mAP drop occurs when \(\alpha =2.0\) compared to \(\alpha =0.5\). We argue that constantly applying a large \(\alpha \) would harm the training at the beginning because \(\alpha \) would assign a large weight for the loss produced by the unreliable pseudo-labels in the early stages which would mislead the model, causing it to get caught up in vicious training cycles. In contrast, applying our dynamic weighting approach, the performance reaches \(64.73\%\), which is \(0.75\%\) better than a constant \(\alpha =0.5\), \(2.55\%\) better than \(\alpha =0.1\) and about \(10\%\) better than \(\alpha =2\). As discussed in Sect. 3.3, while our linear increase performs best, it is a naïve approach since the best global policy is difficult and computationally costly to find across the space of monotonously growing functions. Further exploration of this policy is subject to future work.

Confidence Threshold Policy- Table 2(c) displays the effects of different approaches for the confidence threshold policy. Both low and high thresholds cause significant performance degradation, at both low and high thresholds, e.g. \(\varsigma =0.05\) and \(\varsigma =0.9\), respectively, with lower thresholds being worse. This suggests that false positive pseudo-labels (appearing at low thresholds) have more negative impact than the false negative pseudo-labels (that appear at higher thresholds). As noted in Sect. 3.4, this motivated us to use a new weighted metric \(F_\beta \) to assess the best choice of threshold for this policy. Applying the \(\arctan \) increasing \(\varsigma \) approach, we see a significant increase in performance.

For comparison, we conducted a linear increase approach from 0.1 to 0.6, which takes similar strides, although \(\arctan \) increases more aggressively in the early stages and achieves a slightly better outcome.

Augmentation Policy-A simple ablation is performed on the proposed augmentation policy, i.e. \(\mathcal {A}_w\) and \(\mathcal {A}_s\) augmentations versus no augmentation. Results show a significant improvement by \(9.32\%\) as seen in Table 2(d).

Teacher Momentum Policy-We compare a static momentum coefficient approach with a dynamic momentum approach for m which is used to update the teacher network. As shown in Table 2(e), both settings have a similar expected value but the dynamic momentum policy improves the performance significantly by \(4.66\%\).

Initialisation-The initial status of the student model is crucial since it can affect training direction from the start towards an effective virtuous or catastrophic vicious cycle of learning. We see in Table 2(f) that a random initialisation of the model can lead to such catastrophic failure in training, while a SWAV-based self-supervised initialisation outperforms a supervised one.

Table 3 Comparison with state-of-the-art methods on MS-COCO val2017 with PLD setting
Table 4 Comparison with state-of-the-art methods on MS-COCO val2017 with FLD Setting

6 Experiments on the MS-COCO Dataset

Our method primarily addresses the problem of handling sparsity of labelled data in animal biometrics whenever large unlabelled data is available. Yet, it is nevertheless both conceptually and practically applicable to mainstream object detection. The concept of slowly expanding detection capabilities of a model in a policy-controlled way to learn highly complex and variable object appearance is indeed not limited to animal detection. In order to experimentally support any claim of wider applicability, we next evaluated our proposed method on the popular MS-COCO dataset under a low data regime (PLD) and with extra unlabelled data (FLD). For a fair comparison, we followed the evaluation approach used by STAC (Sohn et al., 2020b) using their splits between labelled and unlabelled data for PLD settings. We trained our model with Labelled Ratios of 1%, 5%, and 10% evaluated on the standard COCO val2017 with the mAP\(_{50:95}\) metrics. For the FLD option, we trained our model using the fully labelled COCO train2017, plus additional unlabelled COCO unlabeled2017 following the same procedure described in (Sohn et al., 2020b; Yang et al., 2021; Liu et al., 2021; Tang et al., 2021; Zhou et al., 2021; Xu et al., 2021; Kim et al., 2022). As shown in Tables 3 and 4, we achieve leading state-of-the-art results for a 10% PLD Labelled Ratio and the FLD setting. At other ratios, our benchmarks remain competitive: for 5% PLD our method trails only \(0.90\%\) below the best result by SoftTeacher, and for 1% PLD it scores \(4.52\%\) below the SOTA MUM model. This demonstrates that the introduced concepts of dynamic control in curriculum learning are certainly applicable to a wider domain of general object detection. We find that our curriculum learning method is less sensitive to hyper-parameters. In practice, the hyper-parameter configurations for COCO datasetFootnote 3 are inherited from the hyper-parameters that are fine-tuned on the PanAfrica dataset. They can indeed outperform the state-of-the-art under certain configuration after searching. Further research will be required to stipulate in how far truly dataset-optimal hyper-parameterisation of dynamic training regimes such as the one presented is computationally feasible. For practical purposes, it is important to note that hyper-parameter transfer does not lead to learning collapse or vastly degraded performance as will be shown again in our experiments outlined in the next section.

Table 5 Comparison on PASCAL VOC dataset
Table 6 Experimental results on bees dataset

7 Experiments on the PASCAL VOC Dataset

In order to understand applicability to mainstream object detection further, we utilise another popular object detection benchmark to evaluate our model. We follow the standard FLD evaluation process on the PASCAL VOC dataset ( Everingham et al. (2010)), as in (Sohn et al., 2020b; Liu et al., 2021; Zhou et al., 2021), with the performance of our model reported on VOC07-test, trained using VOC07-trainval as the labelled training set, and VOC12-trainval or VOC12-trainval + COCO20clsFootnote 4.

As shown in Table 5, we explore two different policy-parameter settings in the experiments, (i) without policy-parameter searchingFootnote 5 (with \(\dagger \) notation in the Table), and (ii) with pseudo-label analysis and policy-parameter searching. For VOC12, Row 13 shows the leading 57.02% mAP and the second best 81.89 % mAP\(_{50}\), and for VOC12 + COCO20cls, Row 13 offers the second best 58.28% mAP and competitive 81.82% mAP\(_{50}\) among state-of-the-art methods. It also achieves a 14.89% and 16.15% gain in mAP over the supervised baselines, respectively, by simply adopting the configuration from the MS-COCO experiments. This further supports the argument that policy and parameter transfer does not lead to learning collapse or vast performance degradation.

When we systematically analyse the pseudo-labels and perform policy-hyper-parameter finetuningFootnote 6, the performance of our method can be boosted, achieving state-of-the-art \(57.65\%\) mAP for VOC12 and \(82.34\%\) mAP\(_{50}\) for VOC12 + COCO20cls (row 14 in Table 5).

Our experimental results on PASCAL VOC suggest i) the proposed method can have applications beyond animal detection, ii) it does not need heuristic tuning for hyper-parameters, since merely adopting the COCO ones to PASCAL VOC can lead to virtuous training cycles and achieve competitive results.

Table 7 Experimental results on snapshot serengeti dataset

8 Experiments on the Bees Dataset

The Bees datasetFootnote 7 contains approximately 5K images of bees captured in hives. The bees and pollen that appear in each image are annotated with Bounding boxes and most of the data includes crowded scenes where bees are densely located.

In this experiment, we consider only PLD settings as we do not have extra bee data. We randomly spilt 80% of the whole data as a training set and use the rest as a testing set. For the training set, we construct three different PLD labelled ratios sampled with five different random seeds, where the labels are randomly masked so that the proportion of labelled data is 5%, 10%, and 20%, respectively. We evaluate the supervised baseline (using the same baseline approach as for MS-COCO) and our proposed model over five data folds for 5%, 10%, and 20% labelled ratios and report the mean and standard deviation of mAP. In Table 6, we demonstrate a substantial performance boost by applying our policy-guided semi-supervised learning, especially under lower data regimes, with \(6.66\%\) gain over the baseline in the 5% PLD setting.

9 Experiments on the Snapshot Serengeti Dataset

We finally conducted experiments on sparsely labelled versions of the Snapshot Serengeti datasetFootnote 8 (Swanson et al. (2015)) in which overall around 78K images (out of 7.1M) are labelled with instance-level bounding boxes that allow us to test our proposed method. We conducted our experiments under PLD settings where the model was trained with 5% and 10% of the labelled data out of 78K labelled images.

As shown in Table 7, substantial boosts of mAP, mAP\(_{50}\), mAP\(_{75}\) can be observed under limited label regimes when comparing the supervised baseline (composed as before for the MS-COCO and Bees datasets) to our full system. Moreover, our method can reach similar or better performance than the supervised baseline while using only half of the labelled data.

To provide some further context to the wider literature on this dataset, we note that using only \(20\%\) of the labels our method’s performance at \(mAP_{50}\) of 82.7 comes close to published results for fully supervised training with \(100\%\) of the Snapshot Serengeti labels using Mask-RCNN (Ibraheam et al., 2021) at \(mAP_{50}\) of 85.7 and significantly outperforms full label training with Faster-RCNN Ibraheam et al. (2021) at \(mAP_{50}\) of 73.2 or Context-RCNN (Beery et al., 2020) at \(mAP_{50}\) of 55.9.

Finally, we note that the multi-dataset learning approach of the MegaDetectorV5a* (Beery et al., 2019) leading to \(mAP_{50}\) of 90.65 on this dataset prevents fair apple-to-apple comparison with our method. However, the multi-dataset training regime is clearly highly effective in utilising label information across animal species boundaries. Future work into benchmarking our presented work in such multi-dataset training settings seems a promising avenue to improve results for species-specific and species-agnostic detectors further.

10 Conclusion

In this paper, we introduced an end-to-end dynamic curriculum learning framework for semi-supervised detection in sparely labelled datasets unlocking information in the unlabelled data portions. We demonstrated that bipolarity in the behaviour of cyclical student-teacher training regimes can lead to either effective virtuous or collapsing vicious training loops. We discussed the importance of expanding model coverage of new data slowly and in a controlled way to keep expanding detector and label quality without collapse. To achieve this, we proposed five policies to guide the dynamics of training and promote steady, simultaneous improvements to the student detector, the teacher detector, and the quality of the pseudo-labels. We showed that the described approach is effective in significantly advancing the state-of-the-art in great ape detection performance when evaluated under various settings on the large Extended PanAfrican Dataset. Our method is also shown to be beneficial to sparse labelling versions of other datasets without specialising hyper-parameterisations or policies. We have demonstrated this for the Bees and Snapshot Serengeti datasets in the animal domain. Finally, we showed that evaluation on general object detection tasks in MS-COCO and PASCAL-VOC achieves competitive or superior performance over existing state-of-the-art methods.

We conclude that the work holds the promise for dynamic curriculum learning controlled by training policies to be applied effectively to sparsely labelled wildlife data and thereby help unlock the full wealth of information so far widely sealed in steadily growing unlabelled camera trap archives.