Watching the Birds: A New Dataset for Understanding Animal Behavior

Author: Denis Avetisyan


Researchers have released a comprehensive dataset and benchmarking framework to track individual birds over extended periods, enabling deeper insights into their natural behaviors.

The CHIRP dataset addresses complex video understanding by simultaneously tackling the challenges of identifying <i>who</i> is present, determining <i>what</i> actions are being performed - leveraging both action recognition and [latex]2D[/latex] keypoint estimation - and providing detailed annotations like object segmentation and bounding boxes, culminating in application-specific benchmarks evaluating performance through biologically relevant metrics such as individual feeding rates and co-occurrence patterns.
The CHIRP dataset addresses complex video understanding by simultaneously tackling the challenges of identifying who is present, determining what actions are being performed – leveraging both action recognition and [latex]2D[/latex] keypoint estimation – and providing detailed annotations like object segmentation and bounding boxes, culminating in application-specific benchmarks evaluating performance through biologically relevant metrics such as individual feeding rates and co-occurrence patterns.

The CHIRP dataset facilitates long-term, individual-level monitoring of wild bird populations using computer vision and provides application-specific benchmarks for evaluating algorithm performance.

Long-term, individual-level behavioral data are crucial for understanding animal ecology and evolution, yet obtaining such data remains a significant challenge for wild populations. To address this, we introduce the ‘CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild’, a novel resource curated from a long-term study of Siberian jays, supporting tasks including individual re-identification, action recognition, and object detection. This dataset is coupled with CORVID, a new pipeline for bird identification based on colored leg rings, and a biologically relevant benchmarking paradigm evaluating metrics like feeding rates and co-occurrence. Will this approach bridge the gap between computer vision research and the increasingly data-hungry field of behavioral ecology?


Decoding the Whispers of Jay Society

Accurately deciphering the intricacies of animal behavior is fundamentally reliant on the ability to consistently track and identify individuals, a challenge amplified within highly social species like the Siberian Jay. These birds exhibit complex relationships and cooperative breeding strategies, meaning understanding any single action requires knowing who is performing it and with whom. Traditional observational methods, while valuable, struggle to keep pace with the rapid, nuanced interactions within a group, often leading to incomplete or biased data. Consequently, researchers increasingly depend on automated tracking systems, leveraging computer vision and machine learning to monitor individuals across extended periods and within dynamic natural environments, unlocking the potential to reveal previously hidden patterns in social organization and communication.

The study of animal societies benefits immensely from comprehensive observational data, and the CHIRP dataset represents a significant advancement in this area by providing an unprecedentedly large and detailed video record of wild Siberian Jays. This collection isn’t simply a long-duration recording; it meticulously captures the nuanced interactions within a complex social group, offering researchers a wealth of information about jay behavior. The sheer scale of the dataset – encompassing numerous hours of footage – allows for the investigation of rarely observed events and subtle behavioral patterns, moving beyond anecdotal observations towards statistically robust conclusions about social dynamics, communication, and cooperative breeding strategies. It’s a resource poised to unlock a deeper understanding of how these intelligent birds navigate their social world and the evolutionary pressures shaping their behavior.

Extracting meaningful insights from the extensive CHIRP video dataset demands sophisticated automated analysis techniques. Researchers developed a computational pipeline integrating video re-identification, action recognition, and pose estimation to process this complex behavioral data. This system allows for the tracking of individual birds and the categorization of their actions, ultimately enabling large-scale studies of Siberian Jay social interactions. Critically, the pipeline achieves an action recognition accuracy of 0.72, demonstrating performance comparable to established 3D convolutional neural network (C3D) models and providing a robust foundation for detailed ethological investigations.

The CORVID pipeline identifies birds in video by segmenting ring-shaped objects, analyzing their color distributions using a random forest model, and selecting the most probable bird based on a probability matrix derived from these color features.
The CORVID pipeline identifies birds in video by segmenting ring-shaped objects, analyzing their color distributions using a random forest model, and selecting the most probable bird based on a probability matrix derived from these color features.

Automating Observation: The CORVID System

CORVID is an automated pipeline designed for the individual identification of Siberian Jays within the CHIRP (CHIRP is a dataset of long-term monitoring of a wild population of Siberian jays) dataset. The framework utilizes the unique color ring combinations present on each bird as identifying markers. By automating this process, CORVID circumvents the need for manual annotation, which is both time-consuming and prone to human error. The system is specifically tailored to analyze image and video data collected from the CHIRP dataset, allowing for scalable and repeatable identification of individual birds over time, thereby facilitating long-term behavioral and ecological studies.

CORVID employs a two-stage process for locating and identifying birds within video footage. Initially, the YOLOv8 object detection model is utilized to identify all potential bird instances in each frame. Subsequently, the BoTSORT tracking algorithm is applied to these detections, linking the same individual bird across consecutive frames to create consistent tracks. This combination allows for robust detection and tracking even with instances of occlusion or rapid movement, providing the basis for subsequent color ring segmentation and identification.

The CORVID pipeline utilizes Mask2Former for precise color ring segmentation, achieving a mean Intersection over Union (IOU) of 0.84 on the CHIRP dataset. This segmentation provides the basis for individual bird identification; the generated masks are then input to a Random Forest Classifier, which predicts the unique ring identity. The Random Forest Classifier was trained to associate segmented ring features with known individual identities, enabling automated re-identification of Siberian Jays across video frames with high accuracy.

Evaluation of the CORVID pipeline, alongside fine-tuned MegaDescriptor and random assignments, reveals accurate predictions of biological measurements, including individual feeding rates ([latex]pecks/min[/latex]) and co-occurrence rates, as assessed against ground truth data using a consistent object detection (YOLOv8), tracking (BoTSORT), and action recognition (C3D) framework.
Evaluation of the CORVID pipeline, alongside fine-tuned MegaDescriptor and random assignments, reveals accurate predictions of biological measurements, including individual feeding rates ([latex]pecks/min[/latex]) and co-occurrence rates, as assessed against ground truth data using a consistent object detection (YOLOv8), tracking (BoTSORT), and action recognition (C3D) framework.

Beyond Pixels: Biologically-Inspired Benchmarking

Standard object detection metrics, such as mean average precision (mAP) and intersection over union (IoU), are designed for scenarios with clear object boundaries and consistent viewpoints. However, biological datasets, particularly those involving animal social behavior, present unique challenges including occlusion, varying illumination, and non-rigid poses. These metrics often fail to adequately capture subtle but critical aspects of animal identification and tracking, leading to an inaccurate assessment of algorithm performance in ecologically relevant contexts. Consequently, a shift towards benchmarks incorporating biologically-inspired metrics is necessary to effectively evaluate and compare algorithms designed for analyzing complex biological data.

The Application-Specific Benchmark assesses re-identification performance by quantifying behaviors relevant to social interactions. Specifically, feeding rate measures the frequency with which individuals are observed engaging in feeding behavior within a defined timeframe, providing insight into foraging success and resource competition. Co-occurrence rate quantifies the proportion of time two or more individuals are present within a specified proximity of each other, reflecting social grouping and potential collaborative activities. These metrics, calculated across video sequences, provide a performance evaluation grounded in ecologically relevant behaviors, contrasting with traditional metrics focused solely on detection or tracking accuracy.

Comparative analysis reveals that the CORVID system surpasses the performance of the Mega Descriptor foundation model when evaluated on biologically-relevant metrics, specifically feeding rate and co-occurrence rate. Quantitative results demonstrate CORVID’s superior ability to accurately track and associate individuals within a social context. Furthermore, utilizing the ViTPose large model for 2D keypoint estimation, CORVID achieves high Percentage Correct Keypoints (PCK) values, indicating robust and precise localization of anatomical landmarks. These combined results establish CORVID as a strong performer in tasks requiring both individual identification and behavioral analysis.

Revealing the Social Fabric: Implications for Behavioral Ecology

The intricacies of animal social life are increasingly being revealed through automated individual identification, and a recent advancement utilizes CORVID – a convolutional neural network – to achieve remarkably accurate re-identification of Siberian Jays. This technology moves beyond traditional methods like tagging or banding, which can be invasive or impractical for long-term studies. By analyzing images of these birds, CORVID learns to recognize unique plumage patterns, enabling researchers to track individuals across extended periods and within complex social structures. Consequently, detailed maps of social networks emerge, allowing for quantitative analysis of interactions – who associates with whom, how often, and for how long – ultimately providing a deeper understanding of cooperative behaviors and the dynamics of this fascinating bird species.

Detailed quantification of social interactions becomes possible through precise individual tracking, revealing the nuances of cooperative behaviors in animal populations. Researchers can move beyond simple observation to statistically analyze the frequency with which individuals engage in specific actions – such as allopreening, food sharing, or joint defense – and the duration of those interactions. This data allows for the calculation of social network metrics, identifying key individuals and the strength of relationships within the group. Understanding these patterns is crucial for deciphering the evolutionary pressures driving cooperation, and how these behaviors contribute to the overall fitness and survival of the species. Ultimately, this approach provides a robust framework for investigating the complex social lives of animals and the ecological factors that shape them.

The analytical pipeline detailed in this research extends far beyond the specific study of Siberian Jays, offering a broadly applicable methodology for investigating animal social dynamics. Designed for flexibility, the system accommodates diverse image and video datasets, and is readily adaptable to a wide range of species-from primates and marine mammals to insects and birds. This ease of implementation promises to significantly accelerate research in behavioral ecology by automating a traditionally labor-intensive process. By providing a standardized and scalable approach to individual identification and interaction analysis, the pipeline facilitates comparative studies across species and habitats, ultimately enabling a deeper understanding of the evolutionary drivers of social behavior and the complex relationships within animal communities.

Siberian jays are uniquely identified by the color sequence of rings on their legs-listed from top left to bottom right-as illustrated by an example bird with an 'oaor' combination (orange, aluminium, orange, red) and detailed in the provided class distribution of ring masks.
Siberian jays are uniquely identified by the color sequence of rings on their legs-listed from top left to bottom right-as illustrated by an example bird with an ‘oaor’ combination (orange, aluminium, orange, red) and detailed in the provided class distribution of ring masks.

The CHIRP dataset, with its focus on long-term individual tracking, doesn’t merely catalog bird behavior; it attempts to divine patterns within the chaos of natural action. The pursuit of ‘accuracy’ in recognizing these behaviors is a fool’s errand, a grasping for certainty in a world defined by nuance. As David Marr observed, “Representation is the key to understanding.” The dataset isn’t about building perfect classifiers; it’s about crafting representations that allow one to glimpse the underlying structure, the whispers of intent hidden within each chirp and flutter. This application-specific benchmarking, seeking biological relevance over raw numerical scores, acknowledges the fundamental truth: data are shadows, and models are ways to measure the darkness.

What’s Next?

The CHIRP dataset represents a necessary, if temporary, victory over the chaos of long-term observation. It is, after all, merely a larger, more insistent truce between observation error and the stubborn refusal of birds to conform to tidy categories. The application-specific benchmarks are a welcome adjustment; one grows weary of algorithms optimized for statistical significance, rather than biological plausibility. Still, the question lingers: how much of what is ‘recognized’ is simply the system hallucinating patterns, and how much is genuine behavioral insight?

The true test won’t be pixel accuracy, but predictive power. Can these algorithms anticipate shifts in foraging strategy? Detect subtle precursors to disease? Anything less feels… decorative. The dataset, as comprehensive as it is, still captures a limited slice of the aviary world. The next iteration will require a reckoning with the uncaptured variables – the humidity, the subtle shifts in wind, the unquantifiable effects of a particularly judgmental squirrel.

One anticipates, inevitably, a proliferation of models. Each claiming incremental gains in recognition, each quietly failing in the face of truly novel behavior. It is a comforting thought, in a way. Data isn’t truth; it’s a negotiation. And everything unnormalized is still, stubbornly, alive.


Original article: https://arxiv.org/pdf/2603.25524.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-29 13:13