Group-level monitoring of group-housed pigs with a Fully-Convolutional Neural Network

(1)

Group-level monitoring of

group-housed pigs with a

Fully-Convolutional Neural

(2)

Layout: typeset by the author using LA_TEX.

(3)

Group-level monitoring of group-housed

pigs with a Fully-Convolutional Neural

Network

Bart P. de Rooij 11883138

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Mr D. Arya Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam dr. R.M. Thomas Department of Psychiatry Amsterdam Medical Center

Meibergdreef 9 1105 AZ Amsterdam

(4)

Abstract

One essential part of agricultural livestock farming is monitoring the health of animals. In this research a non-invasive monitoring technique is explored to detect behaviour of group-housed pigs. This non-invasive monitoring technique is based on sensory data from stationary cameras capturing indoor pen environments containing several pigs. The chosen behaviours to monitor are feeding and drinking behaviour of a group of pigs.

Using a fully-convolutional neural network, individual pig instances are detected from 2D RGB images. The network consists of a downsampling and mirrored upsampling section with max pooling and max unpooling layers respectively which are connected via skip-connections. The network generates a 6-channel representation from an image where the first two channels represent locations of shoulder and tail points and the remaining four channels represent associations between these points. From these 6-channel representations, optimal combinations of shoulder and tail locations are made to detect individual pig instances. The amount of pigs engaging in eating and drinking behaviour is determined by counting the occlusions of shoulder points with a predetermined mask of the feeder and drinking stations. A final recall and precision of 0.828 and 0.743 on the test set was achieved. This data set was separate from the group-level data set on which the model did not perform well. The final results for counting group-level behaviour was comparable to random guesses. However, with some adjustments proposed in this research, the implementation could provide some value for monitoring group-level behaviour of pigs housed in an indoor pen environ-ment.

(5)

1 Introduction

Animal farming is a substantial part of the agricultural sector. On a yearly basis, millions of animals such as pigs are bred, raised and processed into meat products. A considerable amount of pigs spend a large part of their lives indoors in a pen environment with other pigs. In this research, these pigs will be referred to as group-housed pigs. The farming of group-housed pigs is a part of the livestock sector which has had major developments in the recent century. Where historically farmers often had a small number of animals to take care of, they now are responsible for thousands of pigs at the same time. With a growing world population, the expected demand for meat products will also rise. This demand will mostly consist of meat from livestock such as poultry and pigs [Wathes et al., 2012]. Therefore, to keep up with the rising demand, the supply of pork products needs to be increased. This will most likely result in an expansion of pig farms which will lead to an increased pig population to house and monitor. Because group-housed pig farming is one of the most cost-effective and viable ways of producing meat, it is necessary that the industry prepares for this increase in demand.

One crucial aspect of this preparation is to insure the health of the animals in the livestock farming sector. The well-being and health of the pigs is a major responsibility for farmers. With the goal of maximizing profits and minimizing costs, manually monitoring the health of each individual animal would be an infeasible task. Assuming a farmer took care of a few hundred pigs, even the monitoring of a few seconds per individual animal on a daily basis would add up to a significant amount of time each day. Precision livestock farming is the answer for this prob-lem. Development of systems for automatic monitoring of group-housed pigs is needed to benefit and insure livestock health and financial profit. Development of such systems assist farmers in improving productivity and reduces the average amount of man hours needed per animal over the span of its life. It also allows farmers to house more animals at the same time, thus meeting the future demands of a rising world population.

Automation of monitoring livestock health is not a novel idea. Several ways of automating this process have been proposed and implemented for commercial use over the last decades. The use of microphones, Radio Frequency Identification (RFID) ear tags, GPS trackers, accelerom-eters and monitoring via 2D or 3D cameras have all been tested and implemented to monitor group-housed pigs. However, these methods all pose different problems. Some of these problems for example are inaccuracy, expenses to implement, difficulty of scaling up or invasiveness of the monitoring [Benjamin and Yik, 2019].

In this research, a model based on deep learning techniques is created and trained on images captured with stationary RGB cameras mounted in indoor pig pens. The use of a stationary RGB camera allows the implementation to be cost effective, non-invasive, easy to install and scalable. An implementation based on inexpensive RGB cameras can work virtually in any pig pen environment and does not require extensive maintenance. The specific behaviours selected to be detected include eating and drinking of group-housed pigs in a pen. Research has shown that sickness behaviour of pigs often presents with lethargy of the animal and reduced drinking and feeding behaviour [Millman, 2007]. Because of this, the combination of these behaviours can give an impression of the health of an animal. Reductions of feeding and drinking frequencies of a group of pigs can be used to give indications of the overall health of the population of the pen. By real-time detection and tracking of normalities and anomalies of these behaviours the cost of animal production and the losses from diseases and mortality are reduced while improving the job satisfaction of workers [Nasirahmadi et al., 2017].

(7)

The use of stationary cameras does pose some challenges. The main challenge is instance detection of individual pigs. Group-housed pigs have a tendency to group up and lie on top of each other. Studies show that pigs lie down approximately 85% of the time and show ‘social lying behaviour’ i.e. lying close to each other [Ekkel et al., 2003]. This behaviour could lead to occlusion of pigs which complicates detecting instances of single pigs. Monitoring of pigs will be needed at every time of the day, so another challenge to overcome is the different lighting conditions at different times of the day.

The goal of this project is to investigate a solution to the problem of monitoring group-housed pigs. To develop a computer vision model which can detect the behaviour of group-housed pigs in real-time using deep learning based on footage by stationary RGB cameras is a complex task. The research goal is therefore:

1. Develop a model that identifies pigs in indoor pen environments with varied conditions of lighting, sizes, camera angles and number of pigs.

2. Identify the number of pigs engaging in drinking and eating behaviour at any given time based on a single frame.

The biggest challenge is to develop a robust model that detects the location and orientation of individual pigs in the often crowded pen environments the pigs live in. The challenge for this model is to be robust enough to perform well in a plethora of different situations with varying lighting levels, number of pigs, camera angles and time of the day. The monitoring of health of pigs is a constant process and therefore pigs should be detectable in all of these situations. Once this model is developed and performs well, the locations and orientations can be used in combination with simple heuristics to detect the number of pigs engaging in a certain behaviour. The monitoring of pig health does not only include eating and drinking behaviour of pigs. Because of difficulties implementing a viable solution, tracking behaviours such as walking and lying were not achievable. Other more complex behaviours, such as aggression of pigs, were not selected as behaviours to be detected because they require the expertise of an experienced ethologist to develop a gold standard. It was decided that these kind of behaviours were beyond of the scope of this project.

(8)

2 Related Work

The monitoring of group-housed pigs is a fairly niche subject matter in the context of artificial intelligence (AI). The body of research has been expanding over the last couple of years but it is not extensive. There is a considerable body of research outside of AI with respect to moni-toring the health group-housed pigs. However, the main focus of this research is to find ways of incorporating deep learning techniques to aid in precision livestock farming. It is important to highlight the different approaches in the history of monitoring of pigs in group-housed environ-ments to develop the best fitting solution for the problem. Therefore, in the following section some solutions for monitoring group-housed pigs are highlighted with a primary focus on deep learning approaches.

2.1 Sensor approach

One example of a non-AI driven solution for tracking feeding behaviour comes from a study using radio-frequency identification (RFID) technology by Brown-Brandl et al. [2013]. Over a period four grow-out periods, data was collected of 960 pigs. The research illustrated that utilizing feed-ing behavior was an important way of trackfeed-ing the development of pigs. It also showed that the setup of the system is a labour and cost intensive process. Individual pigs needed to be tagged and retagged with RFID tags in their ears and panels with antennas needed to be installed to detect the feeding behaviour. The use of cameras would prevent this invasive technique and is more desirable because it needs to be installed just once.

2.2 Image processing approaches

One way of avoiding these invasive tracking techniques is to use cameras as sensors to collect data. This way tagging of an animal can be avoided which reduces stress for the pig. As far back as 1997 ways of detecting pigs were created with the use of cameras. In the beginning these methods often consisted of simple image processing approaches. Using a point distribution model, a single pig was able to be tracked over a sequence of images from a top down view. Both position and orientation of the single pig could be tracked by training a model on 49 landmark points [Tillett et al., 1997]. One drawback from this approach is the thorough labeling process of tracking these 49 landmark points. Frost et al. [2000] later expanded on this research by using image analysis to create models to predict the positions of arbitrary points on the body of pigs using 3 × 3 grids. It reduced the amount of points needed for an animal to be detected. An interesting thing to note for this research is that rear positions were significantly harder to predict than head positions because the head was obscured by the feeding station. Thus, it may benefit the detection accuracy to use the shoulders as a reference point instead of the head.

Recently, more advanced 3D cameras have become widely available which has had conse-quences for the agricultural sector. Vázquez-Arellano et al. [2016] showed the significance image processing of 3D imaging systems for agricultural applications. Using data from cameras such as the widely available Kinect cameras provide a 3D image using an infrared light source which can greatly improve tracking accuracy in a multitude of settings. In a study by Lao et al. [2016] an image processing system was developed that automatically recognized sow behaviors in a farrowing crate by dividing them up into different body parts by a grid. Behaviours of lactating sows such as lying, standing, feeding and drinking were predicted with an accuracy of 99.9%,

(9)

99.2%, 97.4% and 92.7% respectively. However, the provided data for this research consists only of 2D images from RGB cameras. The main advantage of 3D cameras is that the distance to animals can be observed which is not as easy with a 2D image.

An approach which used 2D top down cameras to detect pigs was proposed by Kashiha et al. [2013]. Orientations and centroids were fitted using ellipse fitting algorithms to locate individual pigs with an average accuracy of 88.7%. It showed that with image pattern recognition, fairly accurate predictions of animals could be given. However, an accuracy of 88.7% leads to a rela-tively high rate of error which will increase when used for counting purposes.

Another example of using ellipses to track individual pigs is displayed in a study by Nasir-ahmadi et al. [2016]. An algorithm which detects mounting behaviour in pigs had a sensitivity, specificity and accuracy, of 94.5%, 88.6% and 92.7% respectively. This was achieved by using binary images from a top view camera to draw ellipses based on data points such as the head, tail and sides of each pig.

This is similar to a study by Shao and Xin [2008] where a computer vision system which could classify the thermal comfort for group-housed pigs and motion in new-born pigs was cre-ated based on differences in pixels from two consecutive images. Global thresholding was used to create a binary image after which morphological filtering and blob-filling operations were per-formed to prepare the image for classification. This shows that using some form of blobs or ellipses to represent individual pigs is a viable way to track animals.

2.3 Deep learning approaches

Over the last couple of years it was discovered that deep learning models also perform well and are usable to monitor group-housed pigs in a pen environment. Just like with other image pro-cessing tasks, deep learning approaches have proven to be beneficial and are causing innovation in the pig farming sector. A model developed by Cowton et al. [2019] based on a faster region-based convolutional neural network (Faster R-NN) can track individual pigs across video footage with 92% Multi-Object Tracking Accuracy and 73.4% Identity F1-Score. The eventual model performed well with a 0.90 mean average precision. This was achieved by splitting the Faster R-NN into a feature extractor, the Region Proposal Network and the fully-connected layers. In a similar way Yang et al. [2018] achieved a precision rate of 99.6% and recall rate of 86.93% using R-NN to detect feeding behaviours of group-housed pigs. This proves the strengths of us-ing recurrus-ing neural networks as compared to more rudimentary approaches of only usus-ing image processing techniques.

Comparable techniques that were discussed before were tested in prior research based on the same data set provided by Serket, a company providing solutions for Livestock Health Man-agement. A similar ellipses fitting algorithm to detect instances of individual pigs was used. Group-level tracking of pigs was also attempted with a model which could detect several be-haviours using a combination of computer vision techniques and the ResNet18 convolutional neural network architecture. In this earlier research, the images of a stationary camera were combined with a mask of the feeding and drinking stations and the boundaries of the pen to count the amount of pigs engaging in drinking and feeding activities in the pen. A mean squared error of 1.106 was achieved with this method on the validation data set but no concrete results were obtained for accuracy.

(10)

Another example of the strengths of fully-convolutional neural networks was demonstrated by Wu et al. [2019] who proposed a method for multi objects tracking by using a Faster R-NN in com-bination with bounding box regression and segmentation masks on Regions of Interest. Bounding boxes were arranged in chronological order and classes were generated by a customized shortest distance clustering method. The bounding boxes with the smallest distances were merged and using linear interpolation objects were tracked through a time sequence. The drawback for this study is that pen environments were usually less crowded than in the data set used for this study. Thus, it cannot be ensured that this computationally heavy implementation will work on a higher number of animals.

A simpler yet also successful way of tracking individual pigs was created by Ahrendt et al. [2011]. Images were converted into support maps pointing to preliminary pig segments. With support map segments a 5D-Gaussian model was used to represent position and shape to track individual pigs. This created a robust model which could almost perfectly track 3 different pigs for at least 8 minutes in a realistic experiment. It showed that a high dimensional representation was able to be used to track individual pigs. The research was based on loose-housed pigs which also means a lower number of individual pigs to be tracked.

2.4 State of the art approach

The remaining part of this related work will focus on an approach proposed by Psota et al. [2019]. This research aims to use its implementation for group-level tracking of behaviour instead of the original aim of tracking individual pigs. From the prior studies displayed in this related work sec-tion, it becomes clear that fully-convolutional neural networks could provide a powerful solution to detection individual instances of pigs. The implementation by Psota et al. [2019] combines a lot of the components mentioned in the previously displayed studies. The proposed method uses a single fully-convolutional neural network to detect the location and orientation of each individual pig. An image space which represents locations of four body parts (tail, shoulder and left and right ear) and pairwise associations is created using images recorded from a top down 2D RGB camera. This image space representation proves to be a powerful tool in locating single instances of individual pigs in crowded pen environments. The image space consisted of rows and columns with each point having 16 channels. The first 4 channels represent the locations of shoulder, tail, right ear and left ear points of a pig of which an example can be seen in figure 1. To allow an approximation of each body part a 2D Gaussian distribution was used to represent the annotated body parts.

The remaining 12 channels were used to represent the associations between the body parts in both ways. Channel 5-8 were for left-ear-shoulder association, 9-12 for right-ear-shoulder associ-ation and 13-16 for shoulder-tail associassoci-ation. These additional channels all encoded a real-valued offset from one point to another for the x and y dimension per association relation. Using the Hungarian algorithm, most likely locations and association were used to determine individual pigs in the cluttered environment of a pen. In addition to the method, a publicly available data set for group-housed pigs consisting of 2000 images from 17 different pen environments was also published. A fully-convolutional neural network was used to predict these highly dimen-sional representations from simple RGB images as input. An hourglass network architecture with symmetry in the downsampling and upsampling stages was used for this. By using various max pooling with corresponding unpooling layers in combination with skip-connections a robust

(11)

Figure 1: Example from Psota et al. [2019] data set of a input image (a), target mapping with a Gaussian kernels for different body parts (b) and the combination of the two (c)

model for generating these 16 channel representations was created. A precision of 99% and recall of over 96% was achieved on images of previously seen pig pen environments. On images from not before seen images a precision of 91% and recall of 67% was achieved.

In this paper an implementation based on the Psota et al. [2019] paper will be developed in combination with a heuristic approach for counting the different behaviours. This decision was made because of the robust model the hourglass proved to be in combination with the fact that the publicly available data set closely resembles the data set provided by Serket. Both data sets are from crowded pig pen environments with numerous pigs walking around. The annotations for Serket do not include annotations for coordinates of single pigs, so training the proposed network on that would not be a feasible option. However, Serket does have different data sets which include annotated shoulder and tail points for pigs. It was therefore concluded that developing this model could provide useful insights for future work on these data sets. This thesis focuses primarily on the group-level monitoring but in a closely related parallel research paper, this model is tested on tracking individual animals. The labeling of the Serket data in these papers consists of only shoulder and tail points for individual animals. It is therefor that the 16-channel representation is simplified to a 6-channel representation to accommodate for the fewer annotation points. In the following section this simplified solution is explained in more detail and the implementation is discussed.

(12)

3 Method

As discussed before, this section will focus on the implementation of the simplified 6-channel representation and hourglass model proposed by Psota et al. [2019]. The data set from Serket with the annotations for shoulder and tail points was not used in this paper but was used in a different research which was done parallel to this one. The following sections describe the used data sets of both Psota et al. and Serket, explain the simplified 6-channel representation and explain both the hourglass model architecture and the counting heuristics. All of the training and experimentation of the models was done via remote access to a work station in the RoboLab of the University of Amsterdam. The full list of specifications of this work station and used Python libraries can be found in the appendix.

3.1 Datasets

In this research, two different data sets are used. The first one from Psota et al. [2019] is publicly available and is used to train the fully-convolutional neural network. This data set will be referred to as the Multi-Pig Part Detection and Association (MPPDA) data set. The second one is provided by Serket and this is used to make predictions for group-level monitoring of group-housed pigs.

Multi-Pig Part Detection and Association data set

The MMPDA data set1_{, which was published and made publicly available by Psota et al. [2019],}

contains 2000 images of 17 different pig pen environments with 24,842 uniquely labeled pig in-stances. Each image was recorded with a RGB camera and has a resolution of 1080 × 1920 pixels. Most pens were recorded from a perpendicular top down view but different camera angles were also included. Pigs range in ages from 1.5 to 5.5 months old and are all annotated with x and y positions of tail, shoulder and right and left ear points. The MPPDA data set also includes a mask for each pen environment which blacks out adjacent pens as not to confuse the network. For this research, x and y position-annotations for both shoulder and tail locations of individual pigs were used. The mask of each pen was applied to its corresponding image. As can be seen from figure 2, some pigs are still visible through the bars of the pen environments. These section cannot be masked over because it could affect the actively tracked pigs inside of the pen.

The data set is split up into a training set (1800 images), validation set of seen environments (200 images) and a test set of unseen environments (200 images) to train and evaluate the model. These splits cover a wide array of pen sizes, lighting conditions and number of pigs. It is a diverse data set which represents the diversity in pig farms comparable to the Serket data set.

(13)

Figure 2: Examples of 4 different pen environments from MPPDA data set

Serket data set

The Serket data set consists of videos from two different pig pens. Pen "137" and "147" consist of five and three different recording dates respectively. The recording days for both pens spanned a period of two months. The pigs of pen "137" consist of piglets for the first recording dates which grew up over the recording period. This results in a data set which is diverse in size. For pen "137" each recording date includes approximately two videos for every hour between 6am and 6pm. Pen "147" has recordings twice an hour for each hour of the day for each recording date. Every video also has an additional csv file with the amount of pigs eating and drinking for each frame in the video. The recordings were made with a 25fps camera with a resolution of 1280 × 960 pixels. Every 20th consecutive frame was collected with its corresponding labels. Below two examples per pen are displayed. As can be seen, the lighting conditions were not always optimal. This is caused by flies and dust which is flying around in the pen. For pen "137" the lighting condition were fairly optimal because recordings were only taken during operational hours. This was not the case for pen "147" where images at night were taken using the infrared night vision mode. Both pens contained smudges on the lenses of the camera because of grime which lowered the camera quality in a lot of images. In addition to the recordings, a mask of the outer perimeter, and feeding and drinking stations was included. These are put on top of the image to black out adjacent pens and to count occlusions with pigs.

In figure 4 the distributions of number of pigs eating an drinking as annotated per pen is given in a histogram. The distribution for numbers of pigs drinking is almost identical albeit a little bit larger for pen "137" because of the extra recording dates. The ranges for number of pigs eating is slightly larger for pen "147" but the overall distribution of data points is still similar.

(14)

(a) Pen "137" young pigs (b) Pen "137" grown pigs

(c) Pen "147" normal lighting (d) Pen "147" challenging lighting

Figure 3: Examples from Serket data set.

(a) (b)

(15)

3.2 Pig location representation

To detect instances of single pigs in a pen environment, single 3-channel RGB images need to be converted into a higher dimensional 6-channel representation for locations and associations for tail and shoulder points. This proposed method is an adjusted version of the 16-channel representation proposed by Psota et al. [2019]. Reducing the dimensions from 16 to 6 will cause the training to have a reduced amount of parameters to focus on. Less computational power is needed to convert the images and train the network and a memory reduction is achieved by using this method. In addition, the labeling of new data sets will take less time and this representation is more compatible with other Serket data sets which makes it more usable for Serket.

Image to 6-channel representation

For an image with dimensions of H × W , a 6-channel mapping with the same dimensions H × W is created. In each frame of a pen there are N animals visible where each animal has a number n ∈ {1, 2, ..., N }. Each instance of an animal n has shoulder and tail coordinates displayed as sn= {xsn, ysn} and tn = {xtn, ytn} respectively, where x < W and y < H.

Because the pigs in the images were labeled by human observers, the labels are approximated by using a 2D Gaussian kernel. One pixel in the H × W image would be annotated as a shoulder or tail point and is used as the center of the Gaussian distribution with a standard deviation denoted by σn for pig n in the image. This σn is calculated for each individual pig by based on

the distance between tail and shoulder points pln and average pig length _N1 PN_i=1pln:

σn= α × (β × pln+ (2 − β) × 1 N N X i=1 pln)

The α factor is to reduce the σnand corresponding range of the Gaussian kernel to a desirable

range. Because of the slanted camera, a β is also introduced to weigh the average pig length with the individual pig length. Pigs further away from the camera will optically have different sizes than the animals closer to the camera even though they might be the same size.

The kernel is than normalised to a value between 0 and 1 with the center of the Gaussian being 1 so locations can be identified with a simple threshold operation. This makes the first and second channel of the 6-channel a probability for shoulder and tail positions respectfully.

The remaining 4 channels represent the association for shoulders and tail points in the fol-lowing order: (s → t)x, (s → t)y, (t → s)x and (t → s)y. These associations are needed to locate single instances of pigs. Using a naive distance matching technique to get instances of pigs would result in mismatched shoulder and tail points because of the crowded pen environment. Each association channel represents a real-value offset between a tail and shoulder or shoulder and tail relation in the x or y direction. The value in channels 3-6 represent the vectors for (xsn− xtn), (ysn− ytn), (xtn− xsn) and (ytn− ysn) respectfully. Just like with the locations,

these relations were denoted by circular regions with sizes according to the σn for each pig. The

Gaussian distribution for each channel are than normalised and summed. Values below a certain kernel threshold (0.05) are set to zero to avoid noise. Below in figure 5 an example of such a generated 6-channel from shoulder and tail coordinates is generated and deconstructed. The Gaussian kernel σn is calculated with α = 0.08 and β = 1.05.

(16)

(a)

(b)

(c)

Figure 5: Example of orignal RGB image in 5a, with generated locations/associations in 5b from deconstructed 6-channels representation in 5c

(17)

6-channel representation to pig instances

These 6-channel representations are the target output for the neural network and can be gener-ated for each image in the MPPDA data set. Because these images are genergener-ated as output by the neural network, a way to decipher the locations of shoulder and tail is also needed.

The first two channels are probabilities of locations for shoulder and tail points. In the target output these points are the peaks of the Gaussian distribution and can be determined using regional max response detection. This is done for each the shoulder and tail point separately by setting all probabilities under the part threshold to zero and sorting the indexes of the image according to values. These indexes are then iterated over and only the maximum x and y value of a certain range with radius in both x and y dimensions are indicated as locations. This way the maximum of the Gaussian kernel is found. Once the probabilities of the indexes reach 0, no more locations can be found and the locations are returned.

This gives a list of {p1, ..., pn} = {(xp1, yp1), ..., (xpn, ypn)} for p ∈ {s, t}. These part locations are

then combined with the remaining 4 channels. For every shoulder part sn ∈ {s1, ..., sN}, both

the real-value offsets in the x and y directions from association channels for (s → t)(xsn, ysn) is

subtracted to get a estimation for a tail location named (s → t)n. The same is done for every

tail point tn to generate estimated shoulder points (t → s)n.

With these shoulder and tail locations and estimated shoulder and tail locations a pairwise distance matrix is created where the distance d(sn, tm) between an arbitrary shoulder location

sn and tail location tmis given by:

d(sn, tm) =

|(s → t)n− tn| + |(t → s)m− tm|

2

where |(s → t)n− tn| and |(t → s)m− tm| are the norm of the location subtracted from the

association vector. The resulting pairwise distance matrix looks like this:

Ds,t=      d(s1, t1) d(s1, t2) . . . d(s1, tn) d(s2, t1) d(s2, t2) . . . d(s2, tn) .. . ... . .. ... d(sn, t1) d(sn, t2) . . . d(sn, tn)     

Finally the optimal combination of body parts is determined by the Hungarian algorithm from the SciPi library. The resulting pig instances can be seen in Figure 5b where each shoulder point is red and each tail point is blue. The connecting vector was determined by the pair wise distance comparison.

(18)

3.3 Hourglass model

To generate these 6-channel representation from any arbitrary 2D RGB image, a fully-convolutional neural network with an architecture shaped like an hourglass is used similar to the one imple-mented in the Psota et al. [2019] paper. It was proven that this sort of architecture could provide powerful high dimensional estimations from simple RGB images as input. The target output for training of the network are the 6-channel representation generated from MPPDA data set. The input of the network is an RGB image which is resized to 480 × 270. This reduced dimen-sionality allows the network to train and recognize patterns faster but is still detailed enough to give precise predictions. The output dimensions of the 6-channel representations are matched to the input dimensions.

A PyTorch neural network template by Victor Huang2_{was used to implement the hourglass}

architecture. In figure 6, the hourglass architecture from the Psota et al. paper can be seen.

Figure 6: Hourglass - Fully-convolutional neural network architecture from Psota et al. [2019] The first half of the network is the downsampling stage where convolutional layers with batch normalization and rectified linear units (ReLu) in combination with max pooling layers is used. Before each max pooling layer a copy of the input is stored to connect skip-connections with the

(19)

upsampling layers. These inputs are used with depth concatenations after each corresponding max unpooling layer in the upsampling stage. This way the training is sped up and less features need to be learned by the network. For the implementation in this paper, these depth concate-nations could concatenate multiple images together to allow the model to expand to consecutive frame inputs. The kernel size for each convolutional layer was set to 3. The final layer includes only a convolutional layer with no batch normalization or ReLu to output the 6-channel repre-sentation.

The chosen optimizer for this network is stochastic gradient descent (SGD) with momentum. Because the network is solving a high dimensional problem with a lot of depth this was found to be the best choice for an optimizer. A custom loss function based on Mean Squared Error (MSE) loss was developed to aid in training the network. For the first two channels, which represent probabilities of locations, the MSE loss for the whole channel was collected. For the last four association channels the loss was calculated over the non-zero target values corresponding to their output values. An α was added to balance the association values with the location values. It was decided that this value was α = 0.03 because of the range differences for the location probabilities and real-value offsets values for associations.

During training phases, random modifications were used on the input to improve robustness of the model. This way the training set covered a more diverse set of camera angles and positions without having to expand the data set. These random modifications included flipping the image across the x axis, scaling the image by a factor between 0.5 and 1.5, applying a random rotation with angle φ where 0 < φ < 360 degrees and shifting every pixels in both the x and y direction separately for a random value λ where −20 < λ < 20. After all these modifications the input dimensions of 480 × 270 were preserved by adding black borders or cropping the image.

To evaluate the model, correct instances of pigs were counted according to their labels. Using the described method of extracting instances of pigs for 6-channel representations, instances of pigs were generated with a certain part threshold and radius. A pair-wise distance matrix was then created between the ground-truth locations gt and the generated pigs gp:

Dgt,gp=      d(gt1, gp1) d(gt1, gp2) . . . d(gt1, gpn) d(gt2, gp1) d(gt2, gp2) . . . d(gt2, gpn) .. . ... . .. ... d(gtm, gp1) d(gtm, gp2) . . . d(gtm, gpn)     

Each gt and gp consists of ((sx, sy), (tx, ty)). The distance d(gtm, gpn) between these two

instances was calculated by summing the Euclidean distance of the tail and shoulder locations. For each ground-truth locations, the nearest generated pig locations was collected and if this distance was small enough (< 100 pixels), it was counted as true positive (TP). False positives (FP) were calculated by the number of generated pig locations minus true positives. False negatives (FN) were calculated by the number of ground-truth locations minus true positives.

(20)

3.4 Counting heuristics

To use the generated locations from the 6-channel representations for group-level monitoring, a simple heuristic approach for counting is proposed in this research. A mask of the feeding and drinking station and the outer perimeter of the pen is made. These feeding and drinking stations are only a small section of the pen. It is therefore not necessary to locate every pig in the pen to count drinking and eating behaviour. A crop of the total image is made which only includes the feeding and drinking stations with enough room to locate the pigs. One example of this cropped image in combination with the mask for both feeding and drinking stations can be seen in figure 7. This cropped image is inputted into the hourglass model and a 6-channel representation is made from it. The generated pig instances from a 6-channel representation give a prediction of shoulder and tail points from a single animal. The number of pigs eating and drinking are counted by counting the overlap op shoulder points with the feeding station mask and drinking station mask respectively. The mask of the feeding and drinking station are drawn in a way that the shoulder point of a pig is overlapping with the mask when eating or drinking.

Figure 7: Mask of feeding and drinking station from pen "147"

To measure the accuracy of this approach, the mean absolute error (MAE) will be calculated for each number of pigs eating and drinking. The MAE is calculated by

M AE = 1 N N X i=1 |Yi− ˆYi|

where Yi is the correct number of pigs engaging in a certain behaviour and ˆYi the predicted

number of pigs.

MAE is a conventional way to asses counting mistakes and to predict the performance of the model. By using the MAE an evaluation per category of number of pigs can be given for both behaviours in each pen. This allows for the evaluation of the total proposed method. By looking at the MAE per number of pigs an estimation of performance for a high or low number of pigs engaging in a behaviour can be made.

(21)

4 Results

Several different values for learning rate, momentum and learning step rate were used to train the network. In figure 8 the most successful training loss and validation loss can be seen for this network. This was done with the parameters for the SGD optimizer set to a learning rate of 5e − 5, momentum of 0.99, weight decay of 0 and learning rate which decreased every 90 epochs with a γ = 0.6. As discussed before, an α of 0.03 was chosen for association loss to balance the loss for location and association channels.

Figure 8: Training and validation loss for network

The convergence of the network does not lead to a loss of 0 but does go down quite sub-stantially. After the trained 120 epochs the loss of 24.95 and validation loss of 11.44 does not go down further. After 120 epoch, the loss moves in a volatile way but does not ever decrease further. It was therefore decided that this would serve as the best model. In figure 9, a near perfect estimation can be found on the training set. From the 6-channel it can be seen that every tail and shoulder point of every pig is predicted correctly. The real-value offset for the relations from shoulder to tail points form distinct clusters pointing in the same direction. From the location points and their connection it can be seen that it is almost 100% correct. The only incorrect prediction is at the top left of the image where the predicted Gaussian distributions has two peaks for both the shoulder and tail locations. This causes the network to detect an extra pig. The areas in the association channels which are dark or light and not associated with a tail or shoulder location do not produce pig instances. Another example from the MPPDA data set can be found in the appendix. This example shows a generated 6-channel representation and found locations for a pig pen environment with a slanted camera angle.

(22)

Figure 9: Example of deconstructed 6-channels representation (left) with generated loca-tions/associations (right)

The extraction of locations from the 6-channel representation can be done with different part thresholds. Below in table 1 the different results for part thresholds can be found. Only the most relevant range of part thresholds from 0.1 to 0.4 is displayed. Values outside of this range were calculated but left out because precision or recall values were too small. A part threshold which was too low, gives a low precision because too many Gaussian peaks are extracted as shoul-der and tail locations. A higher part threshold causes the recall to drop because the predicted Gaussian peaks representing probabilities are not high enough. This evaluation of the model was done on the training split of the MPPDA data set. It can be seen that a part threshold of 0.2 yields the best results with a good balance of recall and precision. A recall of 0.821 and precision of 0.781 is achieved on the data set which gives the best F1-score of 0.800.

Part Threshold Recall Precision F1-score 0.1 0.852 0.541 0.662 0.15 0.843 0.710 0.771 0.2 0.821 0.781 0.800 0.25 0.762 0.805 0.783 0.3 0.640 0.808 0.715 0.35 0.422 0.810 0.555 0.4 0.157 0.762 0.260

Table 1: Results with different part threshold on training split of the MPPDA data set With this optimal part threshold, the recall and precision for the validation and test splits from the MPPDA data set were predicted. The results for this are in table 2. For the validation split, which was comprised of examples seen prior in the training split, a recall of 0.828 and a precision of 0.743 was achieved. On the unseen test split a recall 0f 0.720 and precision of 0.673 was achieved.

(23)

Data split Recall Precision F1-score Training 0.821 0.781 0.800 Validation 0.828 0.743 0.783 Test 0.720 0.673 0.696

Table 2: Results for hourglass model with optimal parameters

The trained network was then used to generate 6-channels for the cropped images of the feeding and drinking stations of the Serket data set. One such example of pen "147" can be seen in figure 10. The generated locations and vectors between tails and shoulders does not give accurate results for most images in the data set. Most of the time the pigs at a feeding or drinking station are too far away and at an angle of the camera where they are not detectable by the network. Pigs which were closer to the camera and in full view have a higher change to be accurately detected than pigs facing away or in half view of the camera. Another example from pen "137" can be found in the appendix which displays the same issues. An interesting point to note is that the reflection of pigs in the metal feeder were occasionally predicted to be pig parts. For the group-level Serket data set no qualitative analysis was done because the pigs were not annotated. From the examples it can be seen that some head and tail locations are switched and several pigs are not detected at all. It can also be seen that the prediction are often wrong or inaccurate because both the location channels as well as the association channels are less precise compared to the MPPDA examples.

Figure 10: Example of deconstructed 6-channels representation (left) with generated loca-tions/associations (right) for pen 147

In figure 11, the mean absolute error of the counting heuristics can be seen. It can be seen that the predictions for pen "137" are very similar to pen "147" and that the MAE is high for all number of pig predictions. The MAE is high regardless of number of pigs, pen environment or behaviour pattern.

For both pens, the mean absolute error for predicting the number of pigs approaches at least half of the ground truth. For pen "137", the error for 12, 13 and 14 for eating and 3 for drinking are 0 because no recorded instances of those numbers were present in the Serket data set.

(24)

(a) (b)

Figure 11: Histograms of MAE for eating (a) and drinking (b) behaviour

5 Conclusion

The initial research goal of creating a robust model to monitor group-level behaviour of pigs was not reached in this research. The implemented hourglass model performed substantially worse than in the original paper. Where the recall and precision were 96% and 99% in the original paper, a recall and accuracy of 82.8% and 74.3% was achieved in this paper. This could be due to the reduction of the 16-channel representation to a 6-channel representation. Another reason for the bad performance of the network could be a difference in loss calculations. The recall and precision of 72% and 67.3% respectively, on unseen pen environments is another indication of why the model did not perform well on the Serket data set. The Serket data consisted only of unseen pen environments with even more challenging conditions. The resulting accuracy for predictions of locations are therefore even lower for this data set. This causes a big choke point in the pipeline of prediction number of pigs engaging in a certain behaviour.

The MPPDA data set was not diverse enough to translate over to the Serket group-level data set. The predictions of counts for both eating and drinking are meaningless because the locations generated from the model for the Serket pen environments are so inaccurate. This causes the MAE to approach almost half of the ground truth value. For every number of pigs, pen environment and type of behaviour, the mean absolute error indicates that the predictions are not usable. For both pens the MAE increases in a linear fashion for both eating and drinking behaviour. This shows that the model is basically guessing and only getting correct results by chance. It can therefore be concluded that the model is not robust enough to predict or monitor different types of behaviour of group-housed pigs. Some of the shortcomings and challenges of this research, which lead to the inadequate results, will be highlighted in the following section of the discussion.

(25)

6 Discussion

There is a lot to be discussed about the implementation and the difficulties encountered during this research. These difficulties and errors can be categorized into three categories which will be discussed in this section.

6.1 Hourglass model

The biggest contributor to the failure to predict accurate counts for behaviour is the hourglass model. The model performed moderately well on the MPPDA data set but was not robust enough to carry over to a totally new pig pen environment. The fact that the predictions for single pig instances were worse than in the original paper could have several different reasons. One reason could be the simplification from the 16-channel to a 6-channel representation. There were two reasons to simplify this representation. The first is that in reducing the dimensions, training would become easier because less parameters would need to be trained and memory use would be reduced. The second reason is that a lot of the annotated pig instances from Serket only feature the tail and shoulder points of pigs. The reduced dimensionality of the representation will lead to less annotation work and a more ready-to-implement model for Serket. However, in changing this, the implementation became less reliable. The accuracy of prediction for unseen environments on the MPPDA data is a good indication of why the model did not perform well on the unseen environments of the Serket data set which had even more challenges to overcome. From the validation and training loss it can be seen that the losses did not converge beyond the range of 24.95 and 11.44. This could be because of the nature of the loss function which calculates over the Gaussian distributions of the points for the mix of probabilities and real-offset values. By using a simple range reduction for the real-offset values, the focus of training was too much on associations instead of locations. A 1-to-1 prediction from an image to a target output was hard to find for the model because of the Gaussian distribution of each of the 6-channels. The loss of training would not further decrease and network did not manage to learn everything from the MPPDA data set.

6.2 Camera angle and image quality

One noticeable thing from the MPPDA data set results was that the model had difficulty pre-dicting locations of pigs at the far end of the pen for images from an angle. This can be seen from the result in appendix B. This difficulty with predictions for images from a slanted camera angle causes problem for the Serket data set where cameras were installed at an angle and feeder and drinking stations were at the far side of the pen. Because of this, the data set consisted only of pen environments which were difficult for the model to make predictions on. There were some slanted camera angles in the MPPDA data set but these samples proved to be too sparse to provide a robust model for these kind of situations. If it were the case that the Serket data consisted of top-down viewing cameras, such as in figure 9, the model would likely perform much better.

Attempts were made to use the model on whole images from the Serket data set but this gener-ated 6-channels with such noise that they were not usable. The sheer amount and density of pigs in one pen made it hard to detect single instances of pigs. To make this a viable implementation the training data set for the model needs to be expanded to cover a wider array of different pig pens. There was no way to achieve that in this project, so for that reason, it was decided to use cropped sections of the images to detect feeding and drinking behaviour. The hypothesis was that this would garner better results because of the reduced amount and complexity of detectable pigs.

(26)

However, this did not circumvent the many other issues with the images. The biggest problem was the distance to the feeder and drinker. The long distance in combination with the unusual camera angle were large contributors to the difficulty of the data set. Cropping the image to a certain region of the pen also caused some pigs to be partially visible. The matching of predicted tail and shoulder locations was made even more inaccurate due to this problem. Other problems were smudges on the camera and dust and flies flying around in many of the images. These often obscured the camera and made for poorer image quality. The feeder stations were made of metal and reflected the pigs which caused false positive detections of pig body parts. This never occurred in the MPPDA data set and therefore a lot of incorrect predictions were generated. The fact that a lot of the recordings in pen "147" were made with infrared night vision mode, made them harder to predict. These images were present in the MPPDA data set but not in a proportional manner. In combination with the aforementioned problems this lead to even poorer predictions.

6.3 Focus on group-level counting

A mistake made during this project was to focus too much on the implementation to detect separate instances of pigs using the 6-channel representations and the hourglass model. A lot of experimentation was done to get this to work to a sufficient level. Progress was made but this did not result in a robust model which worked in different environments. By focusing too much on this one aspect of the problem, the overall goal was not reached to a satisfactory degree. The choice to just include shoulder and tail locations and trying to train a model on a different data set to use on the Serket data set may have been a naive decision. The hypothesis was that the MPPDA data set would be diverse enough to translate over to the Serket pen environments. One final problem was that this project was largely done on a remote access workstation which was the solution to not having enough computing power. This solved the problem of training the network but meant that experimentation took substantially longer because output and input needed to be transferred from local to server locations constantly to be evaluated.

Despite all of its shortcomings, this research gave some insights into the monitoring of group-housed pigs. The MPPDA data set proved not to be representative enough for camera angles which are not from the top. The 6-channel representation of 2D images could be a viable option if the problems with calculating the loss could be solved.

7 Future Work

Although the results of this research were not favorable, a viable model for monitoring different behaviours could come from it. With some expansion and adjustments to the 6-channel represen-tation and hourglass model, it could be a powerful tool to locate single instances of pigs. With an improvement to the loss function to weigh the locations and association in a more balanced way, the model could converge into a robust network. In addition to this, an expanded data set with shoulder and tail locations annotated should be created. This data set would preferably be from a top down camera to negate the size differences of pigs at different distances and the distance to the feeder and drinking stations. This would also allow for the network to generate accurate 6-channels representations for whole images from the Serket data set as was the case with the MPPDA data set. The recall and precision for seen pen environments, even from diffi-cult camera angles, was significantly better than for unseen pen environments. Thus, adding the pen environments from the Serket data set would be a great improvement.

Once this is implemented, the counting heuristics for eating and drinking behaviours could be tested again and reevaluated. The group of detected behaviours could also be expanded. By

(27)

extracting individual animals in consecutive frames and comparing their differences walking, ly-ing and standly-ing behaviours could be counted. The final goal would be to incorporate more complicated behaviour such as aggression. To do this a new labeled data set by an experienced ethologist is needed to determine these kinds of behaviour.

Once all these adjustments are made, the methods used in this thesis could provide a useful tool to monitor different behaviours is crowded indoor pen environments.

References

P. Ahrendt, T. Gregersen, and H. Karstoft. Development of a real-time computer vision system for tracking loose-housed pigs. Computers and Electronics in Agriculture, 76(2):169 – 174, 2011. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2011.01.011.

M. Benjamin and S. Yik. Precision livestock farming in swine welfare: A review for swine practitioners. Animals (Basel)., 9(4):133, 2019. doi: doi:10.3390/ani9040133.

T. Brown-Brandl, G. Rohrer, and R. Eigenberg. Analysis of feeding behavior of group housed growing–finishing pigs. Computers and Electronics in Agriculture, 96:246 – 252, 2013. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2013.06.002.

J. Cowton, I. Kyriazakis, and J. Bacardit. Automated individual pig localisation, tracking and behaviour metric extraction using deep learning. IEEE Access, 7:108049–108060, 2019. E. Ekkel, H. A. Spoolder, I. Hulsegge, and H. Hopster. Lying characteristics as determinants

for space requirements in pigs. Applied Animal Behaviour Science, 80(1):19 – 30, 2003. ISSN 0168-1591. doi: https://doi.org/10.1016/S0168-1591(02)00154-5.

A. Frost, R. Tillett, and S. Welch. The development and evaluation of image analysis procedures for guiding a livestock monitoring sensor placement robot. Computers and Electronics in Agri-culture, 28(3):229 – 242, 2000. ISSN 0168-1699. doi: https://doi.org/10.1016/S0168-1699(00) 00129-0.

M. Kashiha, C. Bahr, S. Ott, C. P. Moons, T. A. Niewold, F. Ödberg, and D. Berckmans. Automatic identification of marked pigs in a pen using image pattern recognition. Computers and Electronics in Agriculture, 93:111 – 120, 2013. ISSN 0168-1699. doi: https://doi.org/10. 1016/j.compag.2013.01.013.

F. Lao, T. Brown-Brandl, J. Stinn, K. Liu, G. Teng, and H. Xin. Automatic recognition of lactat-ing sow behaviors through depth image processlactat-ing. Computers and Electronics in Agriculture, 125:56–62, 07 2016. doi: 10.1016/j.compag.2016.04.026.

S. T. Millman. Sickness behaviour and its relevance to animal welfare assessment at the group level. Animal Welfare, 16(2):123–125, 2007.

A. Nasirahmadi, O. Hensel, S. Edwards, and B. Sturm. Automatic detection of mounting be-haviours among pigs using image analysis. Computers and Electronics in Agriculture, 124: 295–302, 06 2016. doi: 10.1016/j.compag.2016.04.022.

A. Nasirahmadi, S. A. Edwards, and B. Sturm. Implementation of machine vision for detecting behaviour of cattle and pigs. Livestock Science, 202:25 – 38, 2017. ISSN 1871-1413. doi: https://doi.org/10.1016/j.livsci.2017.05.014.

(28)

E. T. Psota, M. Mittek, L. C. Pérez, T. Schmidt, and B. Mote. Multi-pig part detection and association with a fully-convolutional network. Sensors (Basel), 19(4), 2019. doi: https: //doi.org/10.3390/s19040852.

B. Shao and H. Xin. A real-time computer vision assessment and control of thermal comfort for group-housed pigs. Computers and Electronics in Agriculture, 62:15–21, 06 2008. doi: 10.1016/j.compag.2007.09.006.

R. Tillett, C. Onyango, and J. Marchant. Using model-based image processing to track animal movements. Computers and Electronics in Agriculture, 17(2):249 – 261, 1997. ISSN 0168-1699. doi: https://doi.org/10.1016/S0168-1699(96)01308-7. Livestock Monitoring.

M. Vázquez-Arellano, H. Griepentrog, D. Reiser, and D. Paraforos. 3-d imaging systems for agricultural applications—a review. Sensors, 16(5), 2016.

C. Wathes, H. Maggs, M. Campbell, and H. Buller. Towards livestock production in the 21st century: A perfect storm averted? volume 3, 01 2012. doi: 10.13031/2013.42319.

S. Wu, X. Zhao, H. Zhou, and J. Lu. Multi object tracking based on detection with deep learning and hierarchical clustering. In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), pages 367–370, 2019.

Q. Yang, D. Xiao, and S. Lin. Feeding behavior recognition for group-housed pigs with the faster r-cnn. Computers and Electronics in Agriculture, 155:453 – 460, 2018. ISSN 0168-1699. doi: https://doi.org/10.1016/j.compag.2018.11.002.

(29)

8 Appendix

8.1 A: Software and Hardware

Calculation, training of networks and predictions were done using a remote SSH access at a work-station at the RoboLab at the University of Amsterdam. The specifications for this workwork-station are listed below:

• Cpu: Intel Core i7 9700 (8 cores/16 threads) • Memory: 64 GB DDR4

• GPU: NVidia 2080 Ti (11 GB) with CUDA 10 and driver 415

The hourglass architecture is based on a free to use PyTorch template by Victor Huang (https://github.com/victoresque/pytorch-template). Python3 was used for the implementation and all needed libraries are listed below:

• PyTorch • NumPy • Pandas • Matplotlib • SciPy • imageio • torchvision • cudatoolkit • tqdm • pip • TensorBoard

(30)

8.2 B: Generated output of network

Two different 6-channel representations and their predicted pig instance locations are displayed. Figure 12 shows an example from the MPPDA data set with a slanted camera angle.

Figure 13 shows and example from pen "137" from the Serket data set.

Figure 12: Example of deconstructed 6-channels representation (left) with generated loca-tions/associations (right) from MPPDA data set

Figure 13: Example of deconstructed 6-channels representation (left) with generated loca-tions/associations (right) for pen 137

Group-level monitoring of group-housed pigs with a Fully-Convolutional Neural Network

Group-level monitoring of

group-housed pigs with a

Fully-Convolutional Neural

Group-level monitoring of group-housed

pigs with a Fully-Convolutional Neural

Network

Contents

1

Introduction

2

Related Work

2.1

Sensor approach

2.2

Image processing approaches

2.3

Deep learning approaches

2.4

State of the art approach

3

Method

3.1

Datasets

3.2

Pig location representation

3.3

Hourglass model

3.4

Counting heuristics

4

Results

5

Conclusion

6

Discussion

6.1

Hourglass model

6.2

Camera angle and image quality

6.3

Focus on group-level counting

7

Future Work

References

8

Appendix

8.1

A: Software and Hardware

8.2

B: Generated output of network