Tracking multiple pigs in a pig pen through body part detection

(1)

Tracking multiple pigs in

a pig pen through body

part detection

(2)

Layout: typeset by the author using LA_TEX. Cover illustration: Jeroen P. Jagt

(3)

Tracking multiple pigs in a pig pen

through body part detection

Towards deep learning-based livestock monitoring

Jeroen P. Jagt 11834684

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Mr Devanshu Arya Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098XH, Amsterdam dr. Rajat M. Thomas Department of Psychiatry Amsterdam Medical Center

Meibergdreef 9 1105AZ, Amsterdam

(4)

Abstract

An important task in large-scale pig farming is the monitoring of animal behaviour in order to detect disease and lameness. Automated pig monitoring requires highly accurate tracking of individual pigs. In this thesis, a method to individually track multiple pigs in a pig pen is proposed where pig instances are detected by a fully convolutional Hourglass network and subsequently associated based on position, forming tracks. The proposed method achieves 89.1% IDF1 and 32.2% MOTA on unseen recordings of the same pig pen it was trained on, and 61.4% IDF1 and 33.4% MOTA on recordings of a pig pen unseen in the training data.

Acknowledgements

First off, I want to thank my supervisors Devanshu and Rajat, for their attentive supervision and enjoyable discussions. My thanks goes to Putri, for providing invalu-able feedback to my work. Thank you, also, Jesse, for your ever-rational technical tips and tricks, and Lennart, for your readiness to critique my writing. And, I want to thank Marina for the support that she has continuously provided throughout this project.

This work would not have been possible without Serket Tech1and Kristof Nagy for providing the dataset chiefly used in this research. Also, I want to thank Psota et al. for making their pig dataset publicly available, no doubt advancing the field of livestock monitoring with it. Finally, additional thanks to Eric Psota for his swift communication regarding their research.

(5)

1 Introduction

Research has shown that changes in animal behaviour correspond to animal health (Taylor et al.,1986). Monitoring an animal’s behaviour is a way to assess its health and well-being. The health and well-being of livestock is valuable to livestock farms and facilities in both ethical and economical dimensions. An animal in poor health might be unsuitable for consumption and is instead culled, which means that the investment in that animal is lost. Furthermore, preventive treatment against disease among animals has led to excessive use of veterinary antibiotics, which pollute the soil and contribute to antibiotic resistance (Jechalke et al.,2014). For these reasons, monitoring animal behaviour is a valuable task in livestock farming.

Due to a growing world population and an increase in welfare, the demand for live-stock products is increasing (Ritchie,2017), and many livestock farms have massively increased in size, in order to meet the growing demand. However, the laborious task of monitoring individual animals’ behaviour becomes decreasingly feasible as a farm increases in size. For instance, only two seconds of daily observation per pig is recom-mended in modern swine facilities (Psota et al.,2019).

One solution to the unfeasibility of this important task of monitoring animal behaviour would be to automate it, which is cheaper and more practical than traditional human observation. As an additional benefit, automated monitoring techniques do not suffer from lapses in analytical performance resulting from fatigue or observer bias (Tuyttens et al.,2014). However, the performance of current automated monitoring techniques is inadequate for large-scale adoption at present. Especially challenging are highly accurate and robust methods to track the location of individual animals throughout time required by effective automated monitoring, in order to provide individual behavioural analysis.

Generally, two technologies can be employed to track individual livestock. The first technology comes in the form of wearables, which are physical devices that animals wear. The other technology involves the processing of camera footage of animals in their environments. Approaches that use wearables – typically devices mounted on the animal that emit specific radio frequencies – directly provide spatial information for individual animals. However, wearable devices come at a high monetary cost and are intrusive for animals. Furthermore, the high degree of physical contact in livestock environments leaves wearables prone to damage, leading to higher costs and a higher level of intrusion.

On the other hand, camera-based approaches suffer none of these disadvantages: they are generally cheaper to implement, non-intrusive for the animals, and not specifically prone to failure in the livestock environments. However, typically cameras that record visual light are used, which is not continuously available in most environments. More importantly, tracking multiple livestock in a video remains a challenging task in the field of Computer Vision.

The task of individually tracking multiple objects in a single scene or video is called Multi-Object Tracking (MOT). In this research, the objects that will be tracked are pigs in a pig pen. While in recent years the adoption of deep learning techniques such as Convolutional Neural Networks (CNNs) has led in significant increases in MOT perfor-mance, tracking multiple pigs individually remains particularly challenging for a number of reasons. First of all, pigs have a similar appearance and are often huddled together, lying side-to-side or even on top of one another, which makes it difficult to distinguish one pig from another. Second, despite a fixed camera and pig pen, the image can become blurry or unclear, lighting conditions can change rapidly, and occlusions might occur due

(7)

to, for instance, insects crossing across the lens of the camera.

MOT algorithms can be divided into online (or real-time) and offline (or batch) meth-ods, the latter of which is allowed to use information from both previous and future frames for detection and association in the current frame, whereas online methods only use information from previous frames (and, of course, from the current frame) (Ciaparrone et al.,2020). The responsive nature of livestock monitoring warrants a real-time approach, so this research exclusively focused on online methods.

In this thesis, a method for individually tracking pigs in a pig pen is proposed that uses two deep learning models in a typical MOT setup. The first of these models detects pig instances in a single frame, and is based on the approach proposed byPsota et al.

(2019). The second model, a feedforward neural network, is used to associate and match detected instances across frames. Additionally, an algorithm is proposed which uses both of these models to track individual pigs throughout a video of a pig pen.

The rest of this thesis is structured as follows. Related research in the field of MOT is summarized in Section2. The datasets used are briefly discussed in Section3. Section4

describes the implementational details of the proposed method. Section5displays the results that were achieved using the proposed method. Finally, I discuss these results in Section6, as the conclusion of my thesis.

2 Related research

Classical MOT

The field of MOT was historically characterized by complex, hand-crafted algorithms that were specifically designed for a narrow set of tasks. Following the seminal introduction of Convolutional Neural Networks (CNNs) inKrizhevsky et al.(2012), MOT research has largely shifted towards DL techniques. For context, non-DL animal tracking techniques are briefly summarized. A more detailed survey on non-DL techniques in livestock monitoring can be found inNasirahmadi et al.(2017).

The primary step in MOT is to detect/localize instances of objects in a single image, or frame. Typically, this is achieved by performing background subtraction in order to isolate and segment foreground objects. For instance,Nasirahmadi et al.(2016) aimed to detect mounting events among pigs by separating foreground and background using the Otsu method (Otsu,1979). Ellipses were fitted to the remaining structures in the foreground, representing the possible pig locations. A similar ellipse fitting method was applied byKashiha et al.(2013), where detected pigs were identified by means of a unique pattern which had been painted on the back of each pig. While identification accuracy was reasonably high (87%), techniques where pigs are manually marked are generally not feasible nor desirable in large-scale applications, where marking (and re-marking) individual livestock would require a high amount of labour.

Ahrendt et al.(2011) used a 5-dimensional Gaussian Mixture Model (GMM) to model the 2D coordinates and colour channels in order to predict pig trajectories in a loose-housed pen. While three pigs could be tracked for a substantial duration, it was stated that tracking failed when pigs were spinning rapidly, jumped, or mounted each other, suggesting that a more crowded pen would prove challenging.

Underlying all these methods is the fundamentally challenging task of background and foreground separation. 2D cameras do not capture depth information, meaning that this separation needs to be inferred from the contents of the frame. On the other hand,

(8)

3D cameras, such as Kinect sensors, do record depth information as well as (visible) light. Several approaches have investigated the value of depth information from 3D cameras (Kulikov et al.,2014;Zhu et al.,2015;Stavrakakis et al.,2015;Kim et al.,2017;Matthews et al.,2017;Ju et al.,2018). However, 3D cameras are costlier than standard 2D cameras, and have a smaller field of view, to the point where a single pig pen requires multiple 3D cameras to fully observe. Furthermore, most of the literature addresses 2D cameras rather than 3D cameras. In any case, since our dataset was recorded using a 2D camera, this branch of research was not explored further.

Deep learning in MOT

Following the introduction of CNNs inKrizhevsky et al.(2012), they are increasingly uti-lized in object instance detection. For a full review of deep learning in MOT, c.f.Ciaparrone et al.(2020). Recently,Psota et al.(2019) applied a novel image-space representation of pig instances in combination with a single fully-convolutional neural network (FCN) to detect pigs in a pig pen. Locations of pig instances on a frame were represented by a 16-channel image with the same height and width as that frame, rather than a collection of coordinates. The first four channels of this representation indicated, per pixel, the existence probability of any left ear, right ear, shoulder, and tail belonging to any pig in the frame, respectively. In order to extract complete pig instances, four different body parts encoded on these channels need to be associated with each other for each pig in the frame. These body-part associations were represented in the other twelve channels which, all together, encoded the real-value offsets from one body part to the other (i.e., the 2D bidirectional vectors between the shoulder points and the left ear, right ear, & tail points). The FCN was then trained to output such a representation for any given frame, which could be parsed in order to retrieve the set of coordinates that denoted (the body parts of) each individual pig instance. This setup could detect pigs with 99% precision and 96% recall in environments previously seen by the network during training, and 91% precision and 67% recall in environments and lighting conditions unseen by the network.

In MOT, instance detection is just one part of the story. In order to actually track objects throughout a sequence of frames, detected instances from adjacent frames need to be associated with each other. Typically, detections are associated using positional or appearance features, or a combination of the two. Associating objects based on position is relatively straight-forward, but its performance is limited in situations where two or more objects in transition are close to each other. More sophisticated positional association models mitigate this to some extent by utilizing a sequence of preceding positions to estimate the object’s velocity (e.g. Kalman filters and SORT (Bewley et al., 2016)), but scenes with proximate objects remain challenging for association models using positional features exclusively. Association models based on appearance features, on the other hand, are less affected by the proximity of objects, but do not inherently respect physical constraints, such as the reasonable travel distance of an object in between frames. One example of a model which uses both positional and appearance features is DeepSORT (Wojke et al.,2017).

An instance detection model and method for associating instances can be combined into a model that can perform MOT.Wu et al.(2019a) implemented a pig tracking model using Mask R-CNN (He et al.,2017) to detect pig instances and a custom algorithm to associate detected pigs. This association algorithm aimed to consolidate the resulting tracks in a number of steps, which used on positional features in the form of bounding box

(9)

Intersection over Union (IoU) and Euclidean distance between the bounding box centroids. One significant downside to their approach is the fact that this algorithm requires the number of pigs in the pen to be known in advance. Additionally, their dataset consisted of videos containing only two pigs, so it is questionable how well their results would scale up to environments containing at least three pigs.

Cowton et al.(2019) investigated the performance of a combination of existing state-of-the-art detection and association models on the task of pig tracking. In order to detect pig instances, a Faster R-CNN (Ren et al.,2015) was used. For the association model, both SORT and DeepSORT (Bewley et al.,2016;Wojke et al.,2017) were implemented and compared, of which DeepSORT yielded a higher performance. From the resulting tracks, some behavioural metrics were extracted. Pre-training the Faster R-CNN and DeepSORT models on generic object detection and re-identification datasets2increased performance by a small amount. Overall, their approach achieved 92% MOTA and 73.4% IDF1.3

Approaches like these are known as tracking-by-detection, defined by their characteristic of separately performing instance detection in every frame. Methods that perform tracking-by-detection are highly dependent on the performance of the detection model, as errors in detections can easily confuse the association model (Ciaparrone et al.,2020). In order to overcome this dependency, some MOT models have been proposed that update the positions of known instances from the previous frame, similar to the 5D-GMM model of

Ahrendt et al.(2011)). The advantages of such an approach include a smaller degree of reliance on the robustness of detections, as well as more consistent (i.e., less fragmented) tracks. However, detection will still need to occur at least once to acquire initial positions, and many of the techniques that are tracker-based do not include a detection method, instead requiring initial positions.

One branch of tracker-based models that have performed well in benchmarks are the models based on Discriminative Correlation Filters (DCFs), the first of which was the seminal MOSSE filter proposed byBolme et al.(2010). This technique has seen numerous improvements and refinements, including the introduction of DL techniques to its inner workings (Lukezic et al.,2017;Danelljan et al.,2017;Valmadre et al.,2017;Danelljan et al.,

2019).

In essence, a DCF-based model aims to localise an object by learning a template which can discriminate the object from the rest of the scene—or, at least, its surrounding area. This template is a multi-channel filter whose convolution with the image yields a maximum response at the location of the target object. The filter is learned from a set of target images, being a set of frames that are cropped to only display the target object, collected from a number of preceding frames.

While DCF-based models have achieved top performance in single-object tracking benchmarks (Kristan et al.,2014,2015,2017), challenges become apparent when comparing the nature of single-object tracking to that of multi-pig tracking. Besides the additional requirement for the target-specific filter to discriminate its target from other targets, rather than from just the background, this difficulty is compounded by the fact that pigs are similar in shape, texture, and colour, and are typically huddled together. These circumstances increase the likelihood of the tracker drifting away from the target.

In an attempt to mitigate tracker drifting,Zhang et al.(2019) composed a model that identified track drifts using output from a detection model that ran in parallel to their

2_{These are the Pascal Visual Object Classes Challenge 2007 dataset (}_{Everingham et al.}_,₂₀₀₇_{) and Motion}

Analysis and Re-identification Set (Zheng et al.,2016), respectively.

(10)

tracking model. Their approach consisted of three components. Primarily, a DCF-based model was used to track a medium-sized area on the back of each pig in the scene. A Single Shot Detector (Liu et al.,2016) was used to detect pig instances. Detections in the initial frame were used to initialize the track areas, and detections in subsequent frames were used in an auxiliary way to correct the tracking efforts of the DCF-based model. The tracks and detections from these models were then employed by the third component, an algorithm that identified the occurrence of track drifting based on inconsistent track and detection boxes, in which case it corrected the location of the track area.

Figure 1: Sample images from the three partitions of the Serket dataset, showing the single pig pen recorded in serket:train and serket:val, and the second pig pen recorded in serket:train. One challenging aspect of this dataset is the tendency for the camera to over-expose, which caused some of the backsides of pigs to contain white segments.

Figure 2: Sample images from the three partitions of the Psota dataset, showing one of multiple environments from each of the three partitions (psota:train, psota:seen and psota:unseen), with psota:unseen representing environments and lighting conditions that were not seen in either of the other two partitions.

3 Datasets

Two datasets were employed in this research: the first dataset, which we will call the Psota dataset4_{, was made available by}_{Psota et al.}₍₂₀₁₉_{) and consists of 2000 annotated images}

sampled at random, with an average frequency of more than two hours, from top-down video recordings of 17 different pig pens. The second dataset is a proprietary dataset produced by Serket Tech, a company concerned with livestock health management, and consists of 40 annotated video recordings of 2 different pig pens, using 2D cameras that were angled diagonally downwards, with a mode length of 30 seconds per video. Every 20th frame was extracted from each video, yielding a total of 1756 frames.

(11)

Name Frames Sequence Description

psota:train 1600 No Set of frames used for training the Hourglass model.

psota:seen 200 No Frames that depict similar environments and light-ing conditions as the frames in the psota:train set.

psota:unseen 200 No Frames that represent new environments and/or lighting conditions than those of the psota:train set, useful for measuring the generalizability of the model.

serket:train 1022 Yes Set of frames (depicting a single pig pen) used to train the models used in the detection and associa-tion stages.

serket:val 249 Yes Set of frames depicting the same pig pen as the training set, used for validation.

serket:test 485 Yes Set of frames depicting a different pig pen from the one depicted in serket:train and serket:val, but with similar lighting conditions, used for mea-suring model’s generalizability to multiple pens. Table 1: A list of the dataset partitions and their details. Sequence indicates whether the partition consists of frames that form one or multiple videos. Frames denotes the number of frames.

In the Serket dataset, the annotations consisted of the coordinates of shoulder and tail points for each pig in a video. In the Psota dataset, the annotations consisted of the coordinates of shoulder, tail, left ear, and right ear, points for each pig in each frame, although only the shoulder and tail coordinates were used. The images of both datasets were scaled down to have a width of 480 pixels in order to improve processing speed.

Both datasets were partitioned in order to create subsets to be used for training, validation, and evaluation (testing). Table1describes these partitions in more detail. As the Serket dataset consisted of recordings of only two different pig pens, the serket:test set contained all 11 videos that recorded one of the pig pens, while the serket:train and serket:valcontained all remaining videos that recorded the other pig pen. Figure1and Figure2display a sample image for each partition of the Serket dataset and the Psota dataset, respectively.

4 Method

The proposed method is an online tracking-by-detection MOT algorithm whose purpose is to track multiple pigs in a pig pen. Tracking-by-detection MOT, i.e., MOT which is based on detecting and subsequently matching instances in each frame, is commonly split up into a sequence of four tasks, or stages (Ciaparrone et al.,2020):

• Detection stage: each input frame is processed by some object detection method to find instances of the target objects, in this case pigs.

• Feature extraction stage: one or more positional and/or appearance features are extracted from the detected instances.

(12)

• Affinity computation stage: based on the extracted instance features, the affinities between pairs of detections and/or tracklets are computed using some distance or similarity method.

• Association stage: finally, using the computed affinities, detected instances are associated with other detected instances to form track(let)s.

The proposed method also followed this pattern of separation into four stages. In practice, due to the online nature of the method, these stages are performed for each frame in a sequence, rather than for the sequence in its entirety, which is detailed in the description of the Association stage. In the remainder of this section, the approaches taken to handle each of these stages are described in detail, and the methods of evaluation are discussed.

4.1 Detection stage

Hourglass model

We incorporated the Hourglass model ofPsota et al.(2019) as the basis for our detection model, for a number of reasons. First and foremost, their model is trained on data whose labels consist of shoulder and tail coordinates (s&t coordinates), rather than the bounding box-labels ubiquitous in MOT research. Since our dataset was also labelled with s&t coor-dinates and it is non-trivial to infer accurate bounding boxes from s&t coorcoor-dinates without manual correction or additional labelling, an approach utilising s&t coordinates would be substantially more easy to adopt than approaches that utilise bounding boxes. Second, the high level of precision achieved in both seen (99%) and unseen (91%) environments is indicative of high quality detections, which are, as mentioned, particularly important for MOT.

In essence, the major contribution by Psota et al. (2019) is the introduction of an image-like 16-channel representation of the locations of multiple pigs in a pig pen, from which coordinate locations can be inferred/extracted, and which can be (approximately) constructed, for any frame, or image, of a pig pen, by an Hourglass-type fully convolu-tional neural network which receives that frame as input. This 16-channel representation encodes pig instances in terms of their left ear, right ear, shoulder, and tail points.

Because left and right ears were not annotated in the Serket dataset, this representation was adapted to exclude those points, instead forming a six-channel image-space represen-tation. This six-channel representation, visualized in Figure3, consists of two channels that encode body part locations, and four channels that encode associations between body parts, which are necessary to form complete pig instances. The two location channels encode the probability that there exists a body part at any point, or pixel, belonging to any pig, in the image. The first location channel encodes this for all shoulder points, whereas the second location channel encodes this for all tail points.

Just these location channels indicate the positions of all body parts of all pigs in a frame, but do not provide enough information to reconstruct complete pig instances, which are represented by a pair of shoulder and tail points. The purpose of the four association channels link body parts together by encoding the 2D vectors that point from one body part to another (for the body parts that belong to the same pig). Per channel, the values in a single dimension for the 2D vectors from a single body part to another body part are encoded, totalling 2×2 =4 channels for 2D two-way vectors between three body part pairs. As an example, the value v at some point in channel 3 encodes the information that

(13)

Figure 3: A visualization of the six-channel image-space representation, adapted from the 16-channel representation proposed by Psota et al. (a) displays the original labels consisting of shoulder (red) and tail (blue) points with purple lines indicating complete pig instances. (b) visualizes the Gaussian kernels encoded in the location channels (1, 2), denoting the existence probabilities of shoulders (red) and tails (blue). (c) and (d) visualize the 2D vectors encoded in the association channels (3-6) that denote the offsets of shoulder to tail (encoded by channels 3 & 4, displayed light purple) and tail to shoulder (encoded by channels 5 & 6, displayed green).

Channel 1 2 3

Encoding ∝ P(s|I) ∝ P(t|I) ∝(xsi −xti)

Channel 4 5 6

Encoding ∝(ysi −yti) ∝(xti−xsi) ∝(yti −ysi)

Table 2: The encoding, or interpretation, of each of the six channels in the output of the Hourglass model. Channels 1 and 2 encode the probability of a shoulder and tail point, respectively. Channels 3 to 6 encode real-valued offsets – taken together, they represent 2D vectors between the two body parts that constitute a single pig instance.

the shoulder point at that point corresponds with a tail point that is offset−v pixels in the horizontal direction.

More formally, let any frame F contain N pig instances(I1, ..., In), where instance Ii is represented by a 2D shoulder point si = (xsi, ysi)and a 2D tail point ti = (xti, yti). Then, the encoding for each channel in the six-channel representation is presented in Table2.

This six-channel representation forms the output of the Hourglass model, which is trained using target representations constructed using the labels present in the dataset5. The architecture of the Hourglass model is illustrated in Figure4.

In the original implementation, polygon masks were utilized to denote the area of the pig pen, and all pixels outside of these masks were set to black, in order to mask out pigs from adjacent pens which were not labelled (Psota et al.,2019, pg. 11). Such masks were not implemented in our approach, because they were not available for any of the datasets used. However, using masks would almost certainly have led to an improved performance, both because no unlabelled pigs would be visible; and because the number of detections outside of the pig pen would have decreased, as these areas would have been set to black.

5_{Details on how to construct this target representation from the body part labels can be found in Section}

(14)

Figure 4: The hourglass-shaped network used by the proposed method to convert images to 16-channel image-space instance detection maps. Image and caption taken from (Psota et al.,2019, p. 9).

One additional change between the original implementation and ours was made to the equation used to determine the standard deviation σnof the Gaussian kernels that correspond to the locations of pig n. Originally, σnwas calculated as follows (Psota et al.,

2019, pg. 12):

σn=0.16× (µs→t+δ_s(n_→)_t) (1)

where µs→tdenotes the average length of all pigs in the same frame as pig (instance)

n, and δ(_sn_→)_tdenotes the length of pig n. However, it was determined through qualitative analysis that this equation resulted in kernels that extended over parts of the background, and so the method of computation was changed to:

σn= α× ((2−β)µs→t+βδ_s(n_→)_t) (2)

where α and β denote scalar parameters that control the size of the resulting kernel.

Temporal Hourglass

One single frame essentially contains all the information needed to infer the positions of objects present on that frame, called the target frame. However, objects are sometimes occluded on the target frame, or their detection might be hampered due to conditions in the environment, such as lighting fluctuations. These conditions can vary rapidly over time: in other words, the detection of objects can be made difficult by transitory conditions. Considering this, performance of the detection model might increase if frames that are adjacent to the target frame are provided as additional input, as small differences in the

(15)

positions of pigs and in their environment might enable the model to detect pig instances more robustly. This temporal dimension was not explored by Psota et al. To investigate the impact of including adjacent frames in the input, the Hourglass model was adapted to have the option of being passed n adjacent frames both before and after the target frame, resulting in an input of n2₊_{1 frames. Adaptations of the model with both n} ₌ _{1 and} n=2 were trained and evaluated.

The dataset for this adapted model was adjusted so that each data point consisted of a sequence of n2₊_{1 frames, of which the target output was the six-channel representation} of the object locations on the centre target frame. Of each video, all frames were used as data points, except for the first and last n frames, for which either preceding or subsequent frames were not available. Of the model itself, the number of input and output channels for each of the first three Conv+BN+ReLU layers were changed from c1=3 : 16, c2=c1o : 32, c3=c2o : 64 to: c1 =3n :(16∗max(1, n 3)) (3) c2 =c1o :(32∗max(1, n 6)) (4) c3 =c2o :(64∗max(1, n 9)) (5)

where ck =i : o denotes that layer k maps i input channels to o output channels, and ckois the number of output channels of layer k.

Location-only and Association-only Hourglass

Due to the structure of the Hourglass model, its layers are concurrently optimized for both location and association channel outputs, which are rather different in target range and connotation. While the same loss function is applied to the entire output, this difference in connotation classifies the Hourglass as a multi-task learning model (Ruder,2017). While multi-task learning has been shown to improve a model’s performance in a multitude of DL topics, including MOT (Ruder,2017;Son et al.,2017;Wu et al.,2019b), it is not at all guaranteed that performance will improve. The difference in connotation of locations versus associations implies a corresponding difference in the optimal kernels pertaining to the preceding convolutional layers.

In order to investigate the impact on performance of this multi-task setup, two ad-ditional variants of the Hourglass model were trained. The first variant was adapted to only detect and output body part locations by changing the number of output channels to 2, and using the Mean Squared Error (MSE) between those channels and the location channels of the target output as its loss function. The second variant was adapted to only detect body part associations; correspondingly, its output consisted of four channels, and its loss function was the MSE between that four-channel output and the association channels of the target output.

Custom loss

The loss function used to train the baseline model calculates the MSE of all pixels in the two location channels (1-2) of the output, while for the association channels (2-6), only those pixels in the output whose corresponding target pixels are actually defined, i.e. have a value that is not zero, contribute to the MSE loss, as specified in the original paper. This

(16)

selective training prevented the model output from being specifically zero-valued in areas that were not used when extracting locations (Psota et al.,2019, pg. 8).

The nature of the six-channel representation, however, entailed that the association channels contributed a significantly larger part to the loss than the location channels. The two location channels encode probability values; therefore, the range of the values contained in these channels was[0, 1]. The other four association channels, on the other hand, represent real-valued offsets in pixels, with an average range of[−325, 690]. The consequence of this difference is that changes in the association channels are much more influential on the loss than changes in the location channels. This means that the model will ‘focus’ on correctly estimating the association channels, rather than the location channels.6 _{However, the number of detections is defined by location channels, suggesting}

that the quality of detections is heavily dependent on the location channels. Therefore, balancing the contribution of each channel to the loss might improve performance. In order to achieve this balance, the loss function was adapted to:

L(

Y

, ˆ

Y

) =

z

(

γ

)(

1 γ

||

Y

1:2,.,.

−

ˆ

Y

_1:2,.,.

||

2₂

+ ||

Y

_3:6,.,.

−

Y

ˆ

_3:6,.,.

||

2₂) (6)

where Yq:r,.,.correspond to channels q to r of matrix Y and||x||2

2denotes the L2norm of x, and

z(γ) = 2

6γ+ 4

6 (7)

The additional factor z(γ)was added in an attempt to keep the range of the loss more

or less the same, regardless of the value of γ, because, when optimizing with Stochastic Gradient Descent (SGD), scaling the loss effectively equals scaling the learning rate, which would make it difficult to compare between different values of γ.

4.2 Feature extraction stage

Just the detection model on its own does not have the capability to perform MOT, but is only part of the first stage. In order to achieve tracking capabilities, the per-frame detected instances that the Hourglass model outputs need to be associated with other detections in adjacent frames. This association is made based on some set of features that are extracted from the detected instances.

Typically, two types of features can be extracted from object instances: positional features, which are based on the location of the object on the frame, and appearance features, which are location-invariant features based on the visual appearance of the object. For our purposes, only positional features were used, because it was expected that, in the case of a robust detection method, they would suffice to uniquely identify a pig across two frames.

From a single pig instance, comprising two 2D coordinates denoting the shoulder and tail point of said pig, the most straight-forward features are those coordinates. However, different pigs are often located very close to each other, meaning that one, or both, of their coordinates will be similar in value. Therefore, two additional features were selected that varied little for two instances of the same pig in adjacent frames, but that helped to distinguish them from instances of other pigs. These two features, which can be inferred

6_{Note that this does not mean that the location channels will be completely ignored, but the}

(17)

Figure 5: Distributions of the differences between a pair of pig instance features appearing in adjacent frames, for all occurrences in the training dataset. Two instances that belong to the same pig are denoted as a positive sample, while two instances that belong to different pigs are denoted as a negative sample.

from the set of 2D coordinates, are the length and angle of the main axis of a pig. More formally, the length liand angle θibelonging to pig instance Ii, which consists of shoulder and tail coordinates si and ti respectively, are defined as:

l_i = ||si−ti|| (8)

θi =atan2(xsi−xti, ysi −yti) (9) These extracted features will be used to compare pairs of instances from adjacent frames. A pair of instances that belong to the same pig, called a positive sample, should have similar features, while a pair of instances that belong to different pigs, called a negative sample, should have dissimilar features. In order to easily distinguish positive from negative samples, distributions of difference in features for positive samples and negative samples, which are displayed in Figure5, should have as little overlap as possible.

4.3 Affinity computation stage

In order to associate pairs of detected instances between frames, we need to define some affinity measure that computes the distance between two detections. Since the features extracted from detections are expressed as values which are expected to vary little for instances of the same pig in different frames, perhaps the most intuitive distance measure would be to take the sum of the differences between those features. The difference between a pair of coordinate shoulder or tail features can be expressed by taking the L2norm of

the difference between the pair of coordinates. Despite the fact that length and angle were typically similar in value for two subsequent instances of the same pig, it was found that including their differences in the distance measure led to a decrease in performance. Thus, only the differences between the shoulder and tail points were used.

(18)

Figure 6: The network architecture of the Association model, which takes two instances Ii and Ij from subsequent frames and outputs the probability p that these instances belong to the same pig. σ denotes the Sigmoid function. w is a parameter defining the width of the network. All BatchNorm layers have eps=1e−5, momentum=0.1, and all Linear layers add bias.

Specifically, the distance δrealbetween two detections Ii and Ij was defined as

δreal(Ii, Ij) = ||si−sj|| + ||ti−tj|| (10) There exist an infinite number of possible distance measures that utilize these four features. In order to investigate to what extent the distance measure could be optimized by deep learning, a linear feedforward neural network was defined that mapped the features of a pair of pig instances to the probability that these were instances of the same pig in subsequent frames. The layers of this network, which we will refer to as the Association model, or Asc for short, are specified in Figure6. The distance δnetthat uses this Association model is defined as:

δnet(Ii, Ij) = −Asc([xsi, xti, ysi, yti, xsj, xtj, ysj, ytj, li, lj, θi, θj]

T₎ ₍₁₁₎

Since Asc returns the probability that a pair of instances belong to the same pig, its out-put can be negated to represent a distance between two instances, so that a lower distance between two instances indicates an increased likelihood of those instances belonging to the same pig.

In order to compare these two distance measures, a composite distance function was made which used a parameter e which interpolated between the two distance functions:

δcomposite(Ii, Ij, e) = ((1−e) +eδnet(Ii, Ij)) ∗ (e+ (1−e)δreal(Ii, Ij)) (12)

4.4 Association stage

To achieve the tracking of multiple pigs throughout a sequence of frames, the detected instances in every frame, generated by the detection model, can be associated with each other based on the affinity measure described above. Let I(t)be the frame at time t, and Dt =Hourglass(I(t))its corresponding set of detections. Furthermore, let ψ_j(t) ∈Ψ(t)_be

the tracked instance that belongs to tracklet j at time t.

For every separate sequence of frames, the algorithm needs to be initialized, which is done using the detections of the first frame I(0). The detections of the first frame define the

(19)

initial positions of the tracks, and the number of detections equals the number of tracks in the entire sequence: i.e.,Ψ(0) ₌_D(0)_.

Then, for any frame I(t), the following steps are undertaken to find the updated tracks Ψ(t)_{. First, the detected instances D}(t) _{are extracted from the frame using the detection}

model. Each instance d_i(t) ∈ D(t)is paired with each instance ψ_j(t−1) ∈ Ψ(t−1)_{and their}

affinity score is calculated, resulting in an affinity matrixA(t)_{with size}_|D(t)_{| × |}_Ψ(t−1)_|

whereA(_ijt) = δ(d_i(t), ψ_j(t−1)). The set of detection-tracklet pairs(d(_it), ψ(_jt−1))that yield a

globally optimal affinity score is found by solvingA(t)_{using the Hungarian Algorithm}

(Kuhn, 1955). For each of these pairs (d_i(t), ψ(_jt−1)), the next instance of tracklet j then becomes ψ_j(t) = d_i(t). If, however, for any tracked instance ψ(_jt−1) no detected instances d(_it)exist such that δ(d_i(t), ψ_j(t−1)) <η(l_i(t)+l(_jt−1)), where η denotes an arbitrarily chosen

hyper-parameter serving as a max distance factor, then the tracked instance j stays in the same position, meaning that ψ(_jt)=ψ(_jt−1).

4.5 Evaluation

The variants on the Hourglass model and the Association model can be evaluated using common metrics that are inferred from the matches of detections to ground truths, i.e. true and false positives and negatives; and from those, precision, recall, and F1-score.

The tracking algorithm described in Section4.4can be evaluated by a selection of metrics from the three sets of metrics that are commonly used in MOT (Ciaparrone et al.,

2020). These three sets of metrics are increasingly complex, and are defined as follows. First, there exist the “classical metrics”, originally proposed byWu and Nevatia(2006). These metrics can be seen as aggregates of the different types of errors that can be made in MOT, and consist of:

• Mostly tracked (MT): the number of ground truth tracks that are correctly tracked in at least 80% of the frames.

• Mostly lost (ML): the number of ground truth tracks that are correctly tracked in less than 20% of the frames.

• Fragments (FM): the number of generated tracks that cover at least 80% of a ground truth track.

• Identity switches (IDs): the number of times that the generated identity of a ground truth track changes (meaning that the object continues to be tracked, but the tracker recognises it as a different object).

Second, there exist the CLEAR MOT metrics (Bernardin and Stiefelhagen,2008), which are loosely defined as follows:

• Multi-object tracking accuracy (MOTA): the number of mistakes (false positives, false negatives, and identity switches) per ground truth detection.

• Multi-object tracking precision (MOTP): the degree of overlap between the bound-ing box of the detection and the boundbound-ing box of the correspondbound-ing ground truth. However, since bounding boxes are not used in our approach, this metric will not be used.

While the CLEAR MOT metrics indicate the prevalence of events that are considered errors, they fall short in the assessment of the algorithm to track a single object for a long

(20)

time (Ciaparrone et al.,2020). In order to “evaluate how well computed identities conform to true identities, while disregarding where or why mistakes occur”, three identification-based (ID) metrics were proposed byRistani et al.(2016). These three metrics are meant as a complement to the CLEAR MOT metrics. Rather than evaluating the algorithm’s associations between detections and GT on a frame-by-frame basis, which is how the CLEAR MOT metrics are computed, the ID metrics are computed using associations that are evaluated globally, i.e. for the entire sequence of frames. The specifics of the computation of the ID metrics will not be discussed here (cf.Ristani et al.,2016), but their interpretations are as follows:

• Identification precision (IDP): the precision of the generated tracks, which are optimally matched to GT tracks – i.e., the percentage of generated tracklets that correctly tracked a GT object.

• Identification recall (IDR): the recall of the generated tracks, which are optimally matched to GT tracks – i.e., the percentage GT object tracks that were correctly tracked by the model.

• Identification F1 (IDF1): the harmonic mean of IDP and IDR – i.e., a measure that indicates how well the model performs when considering both IDP and IDR. For the performance of the Tracker module, the following metrics will be reported: IDF1, IDP, IDR, MOTA, as well as the number of GT, MT, PT, and ML tracks, the number of IDs and FM’s, and finally, the number of false positive (FP) and false negative (FN) detections7. In order to provide a single set of metrics per dataset partition, which consists of multiple videos, the percentual metrics (IDF1, IDP, IDR, MOTA) are averaged, while the remaining metrics are summed.

One constituent of the computation of these MOT metrics is the non-trivial method which decides at what point a detected instance and a ground truth (GT) instance can be considered to match. In other words, when should a detected instance be considered a true positive, and when should it be considered a false positive? To match each GT instance g(_it)in frame t to a detected instance d(_jt)in that frame, an affinity matrixB(t)_was

defined similarly to affinity matrixA(t)_{described in Section}_4.4_{, where}_B(t)

ij = δ(g

(t)

i , d

(t)

j ). Additionally, a pair of instances(g(_it), d(_jt))could only be matched if δ(g_i(t), d(_jt)) <1.5(l(_it)+ l(_jt−1)), where the value 1.5 was empirically deemed to be lenient enough to allow for some offsets in the detected locations, while sufficiently strict for the deduced metrics to form a realistic assessment of performance.

5 Results

5.1 Experimental results

All the models in the detection stage were trained on the serket:train set, except for the HGPsota model, which was trained on the psota:train set. These sets were augmented heavily in order to increase the variety of pig positions and stances available to the model, undergoing any combination of random left-right flips, random XY-shifts with a range of (−20px, 20px)in both directions, random 0−360 deg rotations, and random scaling with a factor in the range(0.5, 1.5).

(21)

Each variant of the Hourglass model was trained for a total of 120 epochs, using SGDM with a momentum of 0.9, a batch size of 4 for both training and validation. The learning rate was set to 1e−4 for the first 60 epochs, and 3e−5 for the subsequent 60 epochs. In Equation2, β was set to 1.05, while α was set to 0.12 for the first 60 epochs and 0.1 for the remaining epochs. Using a larger α in the first 60 epochs was found to be effective in preventing the network from outputting empty location channels.

The Association model was trained for 160 epochs, using SGDM with a momentum of 0.9, a batch size of 128 for both training and validation sets, and a learning rate of 1e−3. In order to maintain a balance between efficiency and performance, the width of the network, denoted by w in Figure6, was set to 128. When assessing the model’s performance, two pig instances were considered to belong to the same pig if the model’s output δnet>0.5.

Processing details

Like in the original implementation, the six-channel output of the Hourglass model was smoothened using a 5×5 average-pool layer appended to the end of the network during evaluation, in order to reduce the disruptive effects of noise on the regional max-response detection. For regional max response detection in the method that extracted coordinate locations from the location channels of the six-channel image representation, a radius of 15 pixels was used. The minimum required probability to denote a body part was controlled by thresholding the location channels. The value of this threshold is called the part threshold. This part threshold essentially defines a trade-off between recall and precision; a higher threshold leads to fewer detected instances that the model is increasingly confident about, typically resulting in lower recall but higher precision. The F1-score, being the harmonic mean of recall and precision, is an indication of how good the network is able to detect pigs, and is therefore focused on primarily.

Sometimes, instances extracted from the model’s output were much longer than any pig in the dataset, spanning a large part of the image space. In order to eliminate these obvious false positives, any detected instance with a length larger than 150 pixels was discarded. However, this was only done after the instances were extracted from the detection model’s output, which meant that the shoulder and tail points belonging to the discarded instances could not be used for additional body part matching, despite the fact that they could have denoted a ground truth accurately.

5.2 Detection stage

Hourglass

The performance of the HGPsota model, which was trained on the Psota et al. training dataset, and of the baseline model HGNormal are displayed in Figure7. The number of instances detected by HGPsota drops off at a lower part threshold than other models, but note that this does not impact its overall performance.

The HGPsota model, which was the Hourglass model trained on the psota:train set, performed best on the psota:seen set with a part threshold of 0.15 where it achieved 34% precision, 71% recall, and an F1-score of 41%. On the psota:unseen set, its best perfor-mance was 76% recall, 23% precision, and an F1-score of 31% (at a part threshold of 0.1; more details are available in AppendixA). For comparison, the original implementation that this Hourglass model was modeled after, and that was trained on the same dataset,

(22)

serket:val serket:test psota:seen psota:unseen Model PT F1↑ PT F1↑ PT F1↑ PT F1↑ HGNormal 0.2 49.8 0.2 52.2 0.2 34.6 0.15 32.5 HGPsota 0.15 33.7 0.15 37.7 0.15 41.3 0.1 30.9 HG3Frames 0.2 49.9 0.2 50.4 - - - -HG5Frames 0.2 49.0 0.2 59.4 - - - -HGAlpha0.1 0.25 60.8 0.2 60.5 0.25 41.9 0.15 40.2 HGAlpha0.05 0.3 64.3 0.25 66.7 0.2 40.7 0.15 44.6 HGAlpha0.01 0.3 63.1 0.3 66.9 0.15 50.0 0.15 52.6 HGCompositeTrue 0.3 73.5 0.2 71.5 0.15 41.6 0.2 44.0 HGCompositeLoc 0.3 73.5 0.2 71.2 0.15 40.8 0.2 42.8 HGCompositeAsc 0.2 49.6 0.2 51.1 0.15 34.5 0.15 32.5

Table 3: The highest F1-score (F1) achieved by each Hourglass model evaluated, alongside the part threshold (PT) at which this highest score was achieved, for all four testing dataset partitions. The highest overall score per dataset is denoted in red. HG3Frames and HG5Frames could not be evaluated on the Psota et al. sets, as its samples are not temporally related. Complete results for all part thresholds are available in AppendixA.

Figure 7: F1-score of the HGPsota model (trained on the Psota et al. dataset) compared to the baseline model (HGNormal) for different part thresholds.

attained 99% precision and 96% recall on the psota:seen set, and 91% precision and 67% recall on the psota:unseen set.

Temporal Hourglass

Figure8displays the performance of the single-frame HGNormal baseline model to the HG3Frames and HG5Frames models, which received 1 and 2 adjacent frames as input respectively, on both the validation set (serket:val) and the test set (serket:test). These three models perform nearly identically, with the HG5Frames model performing slightly better on the test set, as can be seen in Table3as well.

(23)

Figure 8: F1-score of the temporal models (HG3Frames, HG5Frames), which were passed multiple frames as input, compared to the baseline model (HGNormal), only passed a single frame as input, for different part thresholds.

Figure 9: F1-score of the custom loss models (HGAlpha0.1, HGAlpha0.05, HGAlpha0.01), trained using a custom loss function with α = 0.1, α = 0.05, α = 0.01 respectively, compared to the baseline model (HGNormal), trained using a loss function with α=1.0, for different part thresholds.

Location-only and Association-only Hourglass

The Location-only and Association-only models do not output a complete six-channel rep-resentation on their own, which is required in order to extract detections and subsequently measure the quality of these detections. Therefore, in order to evaluate both models, the baseline model HGNormal was compared against three composite models: one with loca-tions from the Location-only model and associaloca-tions from the baseline Hourglass, called HGCompositeLoc; another with locations from the baseline Hourglass and associations from the Association-only model, called HGCompositeAsc; and finally, locations from the Location-only and associations from the Association-only model put together, called HGCompositeTrue. The performance of these composite models is displayed in Figure

10. The performance of both the Location-only and Association-only model can then be assessed by taking the difference between the model that uses their output, and the model that does not. Figure11displays the non-relative difference in F1-score between the HG-Normal and HGCompositeLoc models, as well as the HGHG-Normal and HGCompositeAsc models. Using the output of the Association-only model yields a negligible difference in

(24)

Figure 10: F1-score of the composite models (HGCompositeTrue, HGCompositeLoc, HG-CompositeAsc, see text for details) compared to the baseline model (HGNormal) for different part thresholds.

Figure 11: Difference in F1-score between the HGNormal and HGCompositeLoc models as well as the HGNormal and HGCompositeAsc models, for different part thresholds. This is a non-relative difference, e.g. 90%−10%=80%. The green bars are all nearly zero and therefore barely visible.

performance when compared to using the association channels from the output of the baseline model. On the other hand, using the output of the Location-only model, rather than that of the baseline model, yields a significant improvement in performance for nearly all thresholds. Keep in mind that the difference in performance for higher part thresholds alone does not signify much, because this discrepancy in performance is mainly due to the range of output values in the location channels, while that range does not matter for the extraction of locations. However, from Figure10, as well as Table3, it becomes clear that both models using output from the Location-only model, HGCompositeLoc and HGCompositeTrue, significantly outperform any other evaluated detection model.

Custom loss

A custom loss function was implemented in which the weight of the location channels in the loss (relative to the weight of the association channels) could be controlled with the loss function parameter α, as defined in Equation6. In order to assess the impact of this parameter, three models (HGAlpha0.1, HGAlpha0.05, HGAlpha0.01) were trained using loss functions with α=0.1, α=0.05, and α=0.01, respectively, and compared to

(25)

Figure 12: Distributions of δreal, which is the sum of distances between the body parts of a pair of pig instances appearing in adjacent frames, for all occurrences in the training dataset. A pair of instances that belong to the same pig are denoted as a positive sample, while a pair of instances that belong to different pigs are denoted as a negative sample.

Metric serket:train serket:test

Recall 98.2 97.7

Precision 95.9 96.4

F1 96.9 96.9

Accuracy 99.6 99.5

Table 4: Performance of trained Association model on the dataset partitions serket:train and serket:test, in percentages.

the baseline model HGNormal which effectively was trained on the custom loss function with α=1.0. The performance of these modles is displayed in Figure9.

5.3 Affinity stage

In this section, the performance of both δreal and δnetto distinguish positive from negative samples is shown. Note that these results are preliminary, because they concern labelled data, rather than detected instances. The performance of these distance measures can be better assessed by their impact on tracking performance, which is shown in the next section, when the results of changing e are discussed.

The distributions of δreal for pairs of positive and negative samples is visualized in Figure12. The less overlap there exists between these distributions, the better the distance measure is able to distinguish between positive and negative samples.

Table4displays how well the Association model performed on both the training and testing sets. Based on these results, it can be said that the model is able to learn the required function.

(26)

Figure 13: Tracking performance expressed in IDF1 for different values of e on both serket:valand serket:test. e controls the interpolation between two distance measures

δrealand δnet, where at e=0 the only function used is δreal, and at e=1 the only function used is δnet.

Figure 14: Tracking performance expressed in IDF1 for different values of η on both serket:valand serket:test. η is a factor for the maximum distance at which two detected instances can be associated with each other. If no detected instances in the current frame can be associated with a tracklet, that tracklet will stay in place for that frame.

5.4 Multi-pig tracking

The results in this section are based on the instances detected by the network with the highest overall precision and F1-score, which is the HGCompositeTrue model with a part threshold of 0.3.

Figure13, which displays the IDF1-score of tracking for different values of e, clearly shows that both affinity measures δreal and δnet yield a similar performance, with δreal resulting in roughly 1% higher IDF1-scores all-round. Because of this, as well as the benefit of δrealbeing a computationally more efficient method, δrealwas used as the affinity measure for further tracking results. Finally, the parameter η was set to 1.1 as this yielded the optimal score on the test dataset, as shown in Figure14.

(27)

IDF1↑ IDP↑ IDR↑ MOTA↑ GT MT↑ PT↓ ML↓ FP↓ FN↓ IDs↓ FM↓

serket:val 89.1% 88.0% 73.8% 32.2% 84 36 23 25 680 2135 72 336

serket:test 61.4% 68.9% 55.4% 33.4% 154 69 42 43 1545 2840 18 60

Table 5: Evaluation of final MOT setup on validation and test sets. The up- and down-arrows indicate whether a higher or lower number is better, respectively.

6 Conclusion

In this thesis, a method was proposed to individually track multiple pigs in a pig pen which consisted of two deep learning models and an algorithm which composed both models into a single MOT setup. The Hourglass model used for the detection stage extended the pig instance detection model proposed byPsota et al. (2019), to which various augmentations were made and evaluated. Performance increased, with regards to the baseline model, when using a custom loss function which increased the importance of the location channels, or when the location channels were predicted using a separate model. These results suggest that the quality of detections is largely dependent on the location channels, rather than the association channels, in the six-channel image-space pig instance representation.

The other deep learning model, which computed the affinity between two pig instances, was compared to a simple distance measure by summing the difference between the features of two pig instances. These features consisted of the shoulder and tail locations of the pig instance. It was found that the difference in performance between the deep learning model and the simple distance measure was negligible.

Finally, the complete tracking setup was evaluated using the common MOT metrics as listed in Section4.5. Using the best performing detection model and the simple distance measure, this setup achieved 89.1% IDF1, 88.0% IDP, 73.8% IDR, and 32.2% MOTA on a validation set serket:val which contained frames of the same pig pen as the training set, and 61.4% IDF1, 68.9% IDP, 55.4% IDR, and 33.4% MOTA on the test set serket:test which consisted of frames recorded in a pig pen that had not been seen yet by the model. The performance of the proposed method does not exceed those of state-of-the-art pig tracking methods such as described in (Cowton et al.,2019;Zhang et al.,2019). Unfor-tunately, due to the lack of a common, widely used benchmark for pig tracking, a direct comparison between results from different methods remains difficult. However, qualita-tive analysis of the proposed method confirms that its performance remains lacklustre in many situations. The model was often able to successfully track pigs that were located away from other pigs and were not moving rapidly, but often switched identities of pigs that were positioned closely to each other, partly as a result of erratic detections.

6.1 Discussion

To what factors can this mediocre performance be attributed? First and foremost, this is probably a result of the poor detection results. Like mentioned before, high quality detections are paramount to any well-performing MOT algorithm implemented through tracking-by-detection. This is especially the case when tracking pigs, who typically move slowly and spend long periods lying down, which means that the association of detected instances across frames is generally not very difficult. The Hourglass model that was used for detections was heavily based on the well-performing pig detection model proposed

(28)

byPsota et al.(2019), but a comparable level of performance was never reached; our HGPsota model, trained using many of the same deep learning hyper-parameters, and using the same dataset for training and evaluation, achieved significantly lower precision and recall on both test sets than the original implementation. This suggests that our implementation was sub-optimal, and can presumably be improved upon. An additional reason is the relatively small size of the available dataset, confounded by its low variance in environments: only a single pig pen was present in the serket:train set.

A second factor might be hinted at in the results of using the Association model, or the corresponding affinity measure δnet, in the tracking setup. Despite the model’s high performance during evaluation, which can be seen in Table4, tracking performance did not improve when using δnet instead of the more simple distance method δreal, as evidenced by Figure13. Of course, it was trained and evaluated using only instances that were manually annotated, which are much more accurate and less noisy than the instances produced by the detection model, and as a result of that, easier to accurately classify as belonging (or not belonging) the same pig. In other words, the data that the Association model was trained on was most likely not representative of the detections that formed its actual input in the tracking setup. A possible solution to this would be to devise a setup in which the Association model is trained during tracking, and its loss defined as some function of the tracking errors in which association plays a role (such as identity switches, fragments, and MOTA). Alternatively, the outputs of the detection model on the training set could be used to create a new dataset by labelling those outputs, either manually or automatically using the GT locations, on which the Association model could be trained. This way, the Association model would be trained on data that is, in fact, representative of the inputs it receives during tracking.

Finally, the tracking setup proposed in this thesis entails a complex interaction between multiple models and hyper-parameters. While it was attempted to optimize for these parameters as much as possible, it does remain plausible that small modifications to the setup could lead to substantial boosts in performance, which became clear during qualitative evaluation of the resulting tracks, where some of the failures appeared to be relatively easy to identify.

One additional downside of the proposed method is its low processing speed. While the statistic of processed frames per second is difficult to assess because it is very depen-dent on the computational power of the machine that runs the model, it can be said with relative certainty that the proposed method is not able to run in real-time.

As a final remark, it is notable that the achieved ID metrics were relatively high when compared to the achieved MOTA. This means that while the method made a large number of mistakes, it could successfully and consistently produce accurate tracks. Furthermore, this method can most likely be substantially improved upon, considering some of its naively approached constituents, motivating additional research on the proposed method.

6.2 Future research

One significant characteristic of the Hourglass model is its ability to associate body parts located on different parts of the image. This ability suggests that the model might be able to associate two instances of the same pig in adjacent frames. From that temporal association, affinities can be extracted, thus replacing the need for a separate affinity computation stage. A problem of the proposed Association model in this research was that the data it was trained on did not represent the detections it received as input during tracking.

(29)

Potentially, problems similar to this would be avoided by extending the Hourglass model to output temporal instance affinity.

Alternatively, to overcome this problem pertaining to the proposed Association model, some setup could be devised in which the Association model is trained using the detected instances directly, rather than just the labelled training data. This could be achieved, for instance, by automatically labelling the detected instances using the training data based on real distance, and then training the Association model on those labelled detections. Alternatively, the training phase of one or both models could be embedded in the full tracking setup, where a combination of MOT metrics would form the loss. The challenging aspect of such a training-during-tracking setup is that the loss would somehow need to be back-propagated through the entire association algorithm. Nevertheless, such a setup would ensure that both detection and affinity models are trained to function optimally in the overall MOT setup.

Finally, one improvement that can be made to the association algorithm is the method of deciding the number of tracks. Currently, this is set to the number of detections in the first frame, and is not altered later. However, this will result in a high number of false positives if the number of detections in the first frame is higher than the number of pigs in the pen. Generally, the number of mistakes made by the algorithm would certainly drop if the number of tracks was able to vary over time.

References

David J Taylor et al. Pig diseases. Number Edition 4. Dr. DJ Taylor, 31 North Birbiston Road, 1986.

Sven Jechalke, Holger Heuer, Jan Siemens, Wulf Amelung, and Kornelia Smalla. Fate and effects of veterinary antibiotics in soil. Trends in microbiology, 22(9):536–545, 2014. Hannah Ritchie. Meat and dairy production. Our World in Data, 2017.

https://ourworldindata.org/meat-production.

Eric T Psota, Mateusz Mittek, Lance C Pérez, Ty Schmidt, and Benny Mote. Multi-pig part detection and association with a fully-convolutional network. Sensors, 19(4):852, 2019. FAM Tuyttens, Sophie de Graaf, Jasper LT Heerkens, Leonie Jacobs, Elena Nalon, Sanne

Ott, Lisanne Stadig, Eva Van Laer, and Bart Ampe. Observer bias in animal behaviour research: can we believe what we score, if we score what we believe? Animal Behaviour, 90:273–280, 2014.

Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Taglia-ferri, and Francisco Herrera. Deep learning in video multi-object tracking: A survey. Neurocomputing, 381:61–88, 2020.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Abozar Nasirahmadi, Sandra A Edwards, and Barbara Sturm. Implementation of machine vision for detecting behaviour of cattle and pigs. Livestock Science, 202:25–38, 2017.

(30)

Abozar Nasirahmadi, Oliver Hensel, Sandra A Edwards, and Barbara Sturm. Automatic detection of mounting behaviours among pigs using image analysis. Computers and Electronics in Agriculture, 124:295–302, 2016.

Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE transac-tions on systems, man, and cybernetics, 9(1):62–66, 1979.

Mohammadamin Kashiha, Claudia Bahr, Sanne Ott, Christel PH Moons, Theo A Niewold, Frank O Ödberg, and Daniel Berckmans. Automatic identification of marked pigs in a pen using image pattern recognition. Computers and electronics in agriculture, 93:111–120, 2013.

Peter Ahrendt, Torben Gregersen, and Henrik Karstoft. Development of a real-time computer vision system for tracking loose-housed pigs. Computers and Electronics in Agriculture, 76(2):169–174, 2011.

Victor A Kulikov, Nikita V Khotskin, Sergey V Nikitin, Vasily S Lankin, Alexander V Kulikov, and Oleg V Trapezov. Application of 3-d imaging sensor for tracking minipigs in the open field test. Journal of neuroscience methods, 235:219–225, 2014.

Qiming Zhu, Jinchang Ren, David Barclay, Samuel McCormack, and Willie Thomson. Automatic animal detection from kinect sensed images for livestock monitoring and assessment. In 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pages 1154–1157. IEEE, 2015.

Sophia Stavrakakis, Wei Li, Jonathan H Guy, Graham Morgan, Gary Ushaw, Garth R Johnson, and Sandra A Edwards. Validity of the microsoft kinect sensor for assessment of normal walking patterns in pigs. Computers and Electronics in Agriculture, 117:1–7, 2015.

Jinseong Kim, Yeonwoo Chung, Younchang Choi, Jaewon Sa, Heegon Kim, Yongwha Chung, Daihee Park, and Hakjae Kim. Depth-based detection of standing-pigs in moving noise environments. Sensors, 17(12):2757, 2017.

Stephen G Matthews, Amy L Miller, Thomas PlÖtz, and Ilias Kyriazakis. Automated tracking to measure behavioural changes in pigs for health and welfare monitoring. Scientific reports, 7(1):1–12, 2017.

Miso Ju, Younchang Choi, Jihyun Seo, Jaewon Sa, Sungju Lee, Yongwha Chung, and Daihee Park. A kinect-based segmentation of touching-pigs for real-time monitoring. Sensors, 18(6):1746, 2018.

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468. IEEE, 2016.

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.

SiFeng Wu, XueBin Zhao, Hao Zhou, and Jun Lu. Multi object tracking based on detection with deep learning and hierarchical clustering. In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), pages 367–370. IEEE, 2019a.

Tracking multiple pigs in a pig pen through body part detection

Tracking multiple pigs in

a pig pen through body

part detection

Tracking multiple pigs in a pig pen

through body part detection

Towards deep learning-based livestock monitoring

Contents

1

Introduction

2

Related research

3

Datasets

4

Method

L(

Y

, ˆ

Y

) =

z

(

γ

)(

1

γ

||

Y

−

ˆ

Y

||

+ ||

Y

−

Y

ˆ

||

5

Results

6

Conclusion

References