Weakly Supervised 3D Human Pose Estimation Through Latent Space Dependencies

(1)

MSc Artificial Intelligence

Master Thesis

Weakly Supervised 3D Human Pose

Estimation Through Latent Space

Dependencies

by

Orestis Kompougias

12126500

June 18, 2020

Number of Credits: 48 ECTS November 2019 - June 2020

Supervisor: Yahui Zhang

Assessor & Examiner: Dr Pascal Mettes

(2)

Abstract

The focus of this thesis is the task of 3D human pose estimation from a single image. It is a widely researched task with a lot of well known and documented intricacies. Research is mainly aimed at combating the scarcity of 3D annotated datasets and their lack of out-door images where generalization suffers. In this work, a probabilistic perspective aimed at improving generalization to combat these issues is presented. To accompany this, the well known dependency between 2D and depth information is explicitly integrated as part of the model. The motivation for our method is to explore whether this dependency can be further taken advantage of. The problem is split into two subtasks, namely 2D pose and depth estimation, with weight sharing during the feature extraction stage of the neural network. A bottleneck is also introduced for both tasks where latent representations are sampled from learned distributions. By constraining the spread of the latent representa-tions the problem of the constrained indoor domain of our training dataset is partially alleviated due to the induced regularization effect. The depth task is also given useful information cues from the 2D task during the regression stage such as a depth guess given 2D latent information. Accuracy comparable to other similar methods is achieved while we conduct several experiments to verify the effectiveness of our method.

(3)

Acknowledgements

I would like to thank my supervisor Yahui Zhang for his incredible support and guidance throughout the thesis period. From early on, he helped me get on a track that made me comfortable with useful concepts and existing research concerning this task. He provided me with feedback and ideas through all ups and downs of the thesis. I would also like to thank Dr. Pascal Mettes for agreeing to be my assessor and examiner and taking the time to do so. Furthermore, I would like to thank my family and friends for their support and believing in me these past two years of this master’s program. Finally, I would like to thank the University of Amsterdam and the Artificial Intelligence master’s program. The past two years have been incredibly demanding, both emotionally and in terms of workload, while the outcome has been profoundly rewarding. It has been an eye opening experience to get to partake in a program of such high academic level.

(4)

1 Introduction

Human pose estimation has been a long-standing problem in computer vision. The wide variety of applications that it envelops puts it amongst the most extensively researched computer vision problems today. Such applications include, but are not limited to, virtual character creation and animation [1], pedestrian detection [2, 3], surveillance [4], video games [5, 6, 7], enhancement of sport training [8, 9], human-computer interaction [10] and various medical applications [11, 12].

In recent years, there have been significant advances in 2D human pose estimation because of the rapidly evolving field of deep learning and the widely available annotated datasets for that task. For most of the applications where human pose estimation is needed however, having depth information is important and in some cases, such as character animation, it is vital. These applications pose a need for a 3D human pose estimation framework that can generate accurate depth information given images that contain humans in many different settings and activities.

One of the main challenges in 3D pose estimation is the sparsity of 3D annotated datasets. These datasets require expensive professional equipment and very specific set-tings to create. For that reason, the available 3D datasets are much fewer than their 2D counterparts. Another main challenge is the fact that 3D datasets are limited on setting meaning that models trained on these datasets do not generalize well to images in-the-wild. This is caused by different lighting conditions and backgrounds that can exist in various in-the-wild images that might be unexpected by a trained neural network model. In various works [13, 14, 15, 16, 17], it has been shown that incorporating 2D datasets, datasets without depth annotations, in the training process can improve generalization on the 3D human pose estimation task. Examples of these works include: exploiting the al-ready well developed field of 2D human pose estimation for use in 3D pose estimation in an end-to-end manner [14, 15], using Generative Adversarial Networks as weakly supervised regressors [16, 18, 17] and inducing a geometric loss to the regression of unlabeled samples [15]. These works pinpoint the fact that 2D and 3D human pose estimation are largely correlated tasks, therefore, it is beneficial to use both kinds of datasets when training a 3D human pose estimation neural network. This can be exploited, for example, by using the predicted 2D information as direct input to a depth regressor or by weight sharing during feature extraction.

For human pose estimation, we mainly use the human body’s joints as our pose repre-sentation. The datasets used for this task contain images of people and are accompanied by labels of their 2D and 3D joint locations. However, there are some issues with the joint representation. For example, joints are defined differently in various datasets leading to inconsistencies in predictions between datasets. Some joint definition discrepancies can be somewhat overcome with post-processing while others can not.

In this thesis, we explicitly model the dependence between 3D and 2D pose in latent space and exploit it to improve generalization. By modelling such a dependence we can train a model that facilitates it throughout the training process, enabling weakly super-vised learning by taking advantage of the 2D dataset for the depth estimation task. For both the 2D task and the depth task, we also weight share during the feature extraction part of the model, further enforcing the dependence of the two tasks. To further improve generalization, we model the 2D and depth latent representations to follow a unit Gaussian distribution, constraining latent space and aiding the pose prediction of unseen samples.

First, we use an encoder network that performs feature extraction from the images. Then, the latent encodings of the 2D pose and depth information are sampled from learned

(6)

distributions that are constrained to follow a unit Gaussian distribution similarly to Vari-ational Autoencoders [19] for regularization reasons. Finally, two decoders are trained to output the 2D and depth predictions from their latent encodings. To accomplish our goal of improving generalization through weak supervision, we research the dependencies the two tasks have in latent space and explicitly model them. In the end, we evaluate our model thoroughly with an ablation study. We also compare our performance to other works on the validation set of our 3D annotated dataset and on another unused 3D dataset which contains in-the-wild images.

2 Related Work

There are various approaches for 3D human pose estimation that have been proposed in literature. Most of those are explored in the following subsections.

2.1 Direct 3D human pose estimation

Directly regressing the 3D pose from a 3D labeled dataset is the standard supervised learning approach. The work of Li and Chan [20] proposes a multi-task neural network that can be trained for joint detection and regression. They get similar results for training them sequentially and in parallel. They empirically show that such a framework can successfully disentangle dependencies between body parts.

In the work of Zhou et al. [21], the structural information of the human pose such as bone length and bone rotation angles are enforced through a proposed kinematic layer to train the model. A different approach by Tekin et al. [22], uses an auto-encoder to learn a high dimensional latent representation of the 3D pose and uses that as the regression target for the next stage where a CNN is introduced to project to this latent space. Following that, the decoder is added back again to fine tune the network. Li et al. [23] propose a score network with an image and pose as inputs with the goal of outputting a high score for matching poses. They also aid the network’s learning process by performing regression in an auxiliary 3D joint prediction task. Kostrikov and Gall [24] train a variation of Random Forests for direct 3D human pose estimation.

These approaches, while performing great given the task of 3D human pose estima-tion, are limited to the domain of images captured in a controlled environment that are present in 3D datasets. The most common solution to this domain limitation is to in-clude 2D datasets, which are mostly made up of in-the-wild images, in order to improve generalization.

2.2 3D human pose estimation from 2D joint positions

Regression of 3D pose from 2D joint positions is a well researched topic in this field [13, 25, 26, 27, 28, 29, 30, 31, 32, 33, 14, 15, 16, 17]. The main assumption is that depth coordinates are largely dependant on the 2D joint positions and thus, this intermediate task aids in the extraction of more accurate 3D poses. 2D datasets are also generally made up of images in-the-wild in contrast to 3D datasets which contain mostly images created under a constrained lab environment. This helps with generalization for both the 2D and 3D tasks. An effective baseline proposed by Matinez et al. [30] simply performs regression of 3D coordinates from 2D coordinates using a residual network.

The work of Zhou et al. [13] uses an Expectation-Maximization algorithm to calculate the 3D pose using intermediate 2D heatmaps and a geometric prior on the 3D pose. In the work of Chen and Ramanan [26], the 2D heatmaps are directly matched to a 3D pose

(7)

library with a Nearest Neighbour algorithm. Following a published statistical 3D body shape model [34], Bogo et al. [27] optimize its parameters to best fit 2D joint locations. Habibie et al. [35] propose learning the viewpoint parameters for projecting 3D pose estimates to the 2D image space in order to allow weak supervision.

In contrast to the direct regression of joint locations proposed by other works, Pavlakos et al. [29] propose a volumetric representation as the regression target. This aims to sim-plify the network’s learning task by bringing the regression target closer to what it actually represents. Another proposal that uses a different representation is by Moreno-Noguer [31]. In that paper, pairwise Euclidean Distance Matrices are chosen as the representation for both 2D and 3D joint locations, arguing that the structural information of the pose and the correlations between joints are better captured this way.

A different approach, where the heatmap representation is still used but the regres-sion target has been modified was proposed by Sun et al. [36]. In their work, they propose integral regression, the integration of all locations in a heatmap weighted by their probabilities, allowing end-to-end regression with coordinates without sacrificing the detection-based performance of using heatmaps.

2.3 Human Mesh Reconstruction

Besides the popular joint representation, there are also works that model the human pose as a mesh. This representation can be helpful for applications such as foreground seg-mentation and animation and has various other advantages compared to joint estimation. For example, when using a human mesh representation, the joint angle limits and bone lengths are implicitly learnt.

Kanazawa et al. [16] propose performing regression on Skinned Multi-Person Linear (SMPL) [34] parameters which determine the shape and pose of the human body. They use a discriminator to tell between real or fake parameters and use a projection of the human mesh to joints in order to help guide training. Kolotouros et al. [37] encode extracted features onto vertices of a graph, based on the human mesh template of the SMPL model. Then, after a series of graph convolutions, the predicted coordinates of the vertices are given within the vertices themselves.

Zhu et al. [38] propose a hierarchical mesh deformation framework that refines an initial SMPL mesh prediction in a coarse-to-fine manner. They predict motion vectors that are used for Laplacian mesh deformation that refine the initial mesh prediction in three stages for three separate groups of anchors. Zheng et al. [39] also propose a coarse-to-fine recoarse-to-finement framework that gradually recoarse-to-fines a SMPL human mesh in voxel space using extracted image features. The voxel representation is then further enriched by refining its projected normal maps, alongside the input image, to get the final mesh.

2.4 Weakly supervised methods

Some works introduce methods that enable weak supervision, where the 2D data points that have no depth labels can directly contribute to the depth regression task through various means, other than just their 2D coordinates which was discussed in Section 2.2.

One popular weakly supervised method is to train Generative Adversarial Networks (GAN). Kanazawa et al. [16] propose an architecture that does not require paired 3D samples with the generator being trained on 2D annotated samples (with the purpose of outputting 3D poses) and the discriminator on unpaired 3D samples. Yang et al. [17] use the discriminator to detect fake or real 3D poses that came either from the generator or the ground truth. This enables weakly supervised learning as the generator is forced to

(8)

incorporate the 2D dataset and produce 3D poses that can fool the discriminator. Wandt and Rosenhahn [40] also propose a weakly supervised learning framework without needing 3D paired samples and include a camera parameter module. This module along with the 3D pose output from the generator are used to reproject the output 3D pose from the generator down to 2D and use the output as reprojection loss to help with training.

There are also methods that train 2D pose and depth networks in parallel by sharing CNN intermediate features. One of those works is proposed by Mehta et al. [14], where such networks were shown to generalize better to images in-the-wild and thus showing that the two tasks are entangled. Zhou et al. [15] propose using a geometric loss based on bone length to enable weakly supervised learning of the 3D regression module with a hybrid dataset.

We also use a two-stage method by first training a 2D pose network and then using those weights to initialize the joint task. Our focus however, is the correlation between the latent representations of heatmaps and depthmaps and how we can further exploit that, implicitly or explicitly, in order to improve generalization.

3 Methodology

In this section, we go over some processing steps that are widely used in human pose estimation as they are crucial to our task. We then present our baseline architecture and explain the structural components we add to it in order to get our proposed architecture.

3.1 Human pose estimation pre-processing

We perform human pose estimation by predicting human joint coordinates in the camera coordinate system. The number of joints varies per dataset, with MPII [41], the 2D dataset we are using, having 16 joints and Human3.6M [42, 43], the 3D dataset, having 17 joints. We use the MPII joints as our base representation for this thesis so we exclude the one extra joint from Human3.6M.

For both 2D and depth coordinates, the torso joint is set as the point between the two shoulders, which we use as the root joint for training. This is performed similarly to Zhou et al.[15] since the MPII [41] and Human3.6M [42, 43] datasets have different torso joint definitions.

As stated previously, we use heatmap and depthmap representations for 2D and depth coordinates respectively. In 2D human pose estimation, heatmaps have been proven to be easier to regress than direct x, y coordinates. As mentioned in Tompson et al. [44], continuous regression from RGB images to x, y coordinates adds unnecessary learning complexity which hurts generalization. Heatmaps result in a more manageable, discrete regression task. Similarly, most works on 3D human pose estimation choose to use a depthmap representation for depth coordinates.

Thus, we pre-process the raw coordinates into heatmap representations by drawing Gaussians centered on the coordinates and with their standard deviation being a hyperpa-rameter. For the depthmaps, we require the network to output the relative distances from the root joint. This representation makes learning a much simpler task as the network now performs regression in a local coordinate system.

3.2 Human pose estimation post-processing

In order to obtain the real 3D pose from the outputs of the network, we have to perform some post-processing steps. We start by first obtaining the 2D coordinates of the

(9)

interme-diate representation by using the argmax operation on the heatmaps. Since we trained the network to output Gaussians centered on the predicted joint locations, the indexes obtained correspond to the intermediate x, y coordinates themselves.

xj, yj = argmax(Hj), (1)

H ∈ R16×H×W

It is worth noting that because the output size H × W is smaller than the input image size, quantization errors occur. However, keeping the output size small is necessary for computational reasons. Next, in order to get the depth values that those x,y coordinates correspond to, we use them to index the depthmaps.

zj = Dj,xj,yj, (2)

D ∈ R16×H×W

For validation needs, we use the pelvis joint as our root joint as it is commonly used in bibliography for validation metrics. Thus, we subtract the pelvis joint from our inter-mediate coordinates. Lastly, we need to upscale this interinter-mediate representation to get the real 3D pose in millimeters. We multiply the coordinates by a pre-computed mean of bone length sums and then divide by the bone length sum of the current coordinates.

Out = Out ×ms

s (3)

3.3 Baseline architecture

Figure 1: Baseline architecture. We use ResNet-50 as a feature extractor from which we learn two posterior distributions q(z1 | x) and q(z2 | x). We sample z1 and z2 and

then both latent vectors get passed to their respective decoders to get the heatmaps and depthmaps from which we get the predicted 3D pose.

We use the architecture defined in Figure 1 as our baseline model throughout this thesis. In the end, we perform an ablation study starting with the baseline model and adding the proposed components one by one in order to determine their effectiveness. Details about the depth and heatmap decoder architectures can be found in Tables 1 and 2.

(10)

For extracting image features, a ResNet-50 [45] pretrained on ImageNet [46] is used. ResNet-50 is a well performing classification network that we modify to use as a feature extractor for our task. It is well suited for our task’s needs in terms of accuracy and computational cost and is also not too complex to cause major overfitting issues so its simplicity is also a positive.

A 2D pose and a depth regression network is trained in parallel in order to output the heatmaps and depthmaps. This works as a form of semi-supervised learning wherein we do not have depth annotations for the 2D dataset, however, due to the similarity of the tasks they can provide useful information to one another through weight sharing.

The network learns explicit posterior distributions similar to Variational Autoencoders [19], where those distributions are sampled in order to obtain the latent representations of the depthmaps and heatmaps, before the decoding step. This is defined as:

z ∼ q(z | x) = N (µ, σ2), (4)

where z is the latent vector, x the input image, µ the mean of the distribution and σ2 the variance. Following the standard VAE reparameterization trick to allow for gradient flow, we sample using:

z = µ + σ × , (5)

∼ N (0, 1) (6)

The reason we model the latent variables this way is because we use a Kullback-Leibler (KL) divergence error term to force them into following a unit Gaussian distribution. The KL error term puts a constraint on the network to learn better latent representations and it works as a regularization technique against overfitting. Another reason for using KL divergence is because we need the latent space of the depth regression network to be as dense as possible to make the decoders interpolate better, similarly to VAE. We argue that this helps the decoders generalize better on unseen samples.

The overall loss of this network is defined as:

L(θ1, θ2, φ1, φ2) = Ez1∼q_θ1(z|x)[(y1− dφ1(z1)) 2_] + λdepth× Ez2∼q_θ2(z|x)[(y2− dφ2(z2)) 2_] + λKL1× KL(qθ1(z1 | x) || p(z1| x)) + λKL2× KL(qθ2(z2 | x) || p(z2| x)), (7)

where θ1 and θ2 are the encoder parameters for the 2D and depth tasks respectively.

Similarly, φ1 and φ2 are the regression network parameters for each task. The first two

terms are the regression losses between the labels y and the output of the decoders d. In the KL divergence loss terms, the approximations to the true posterior distributions are defined as q(z | x) and the true posterior distributions are given by p(z | x), which we model using unit Gaussians. λdepth, λKL1 and λKL2 are the weight hyperparameters of

the respective loss terms.

The baseline architecture (Fig. 1) is trained in two stages. The first stage does not train the depthmap branch of the network and uses just the 2D dataset, MPII [41], to do regression on heatmaps. The second stage has been trained using a mixed and balanced dataset of MPII [41] and Human3.6M [42, 43] on the entire network for both tasks. The results obtained serve as baseline metrics for the experiments.

(11)

3.4 Latent vector dependencies

During experimentation, we ran Pearson correlation tests between the latent vectors and discovered that the Pearson correlation coefficients between them were considerably dif-ferent depending on if the sample came from the 3D or 2D dataset. One reason why this could have been happening is the different variety of poses and actions being performed in the 2D dataset versus the 3D dataset. The most likely reason however, was that our latent representation of the depthmaps, z2, was not generalizing well to the in-the-wild

images present in the 2D dataset. This makes sense given our setup, as there is no back-ward gradient through the depth decoder for 2D annotated images and our 3D dataset is constrained to an indoor environment.

In various weakly supervised learning methods, it is common practice to input the pre-dicted heatmaps to the depth regression task [15, 47]. Prepre-dicted heatmaps give additional cues and spatial information that help with the depth regression task. This dependency between the two tasks is widely known and is the primary factor for sharing weights in the feature extraction stage between tasks as well as pursuing other methods to improve weakly supervised learning. In this work, we present a method of exploiting the depen-dency between the two tasks in latent space without weight sharing.

We introduce an explicitly defined distribution that models the dependency of the two tasks and gives an estimate of the depth latent vector z2 based on information derived

from the heatmap latent vector z1:

q(z2 | z1) = N (ˆµ, ˆσ2) (8)

We model this distribution using intermediate layers originating from the heatmap latent vector z1. What this distribution aims to capture is the idea that a set of 2D human

pose coordinates can correspond to a variety of valid depth values. Doing this during the latent phase, however, where the information is more rich and hasn’t yet been reduced to a set of coordinates, allows for a higher degree of estimation and information quality. To summarize, we use this distribution for passing stochastically inferred information about depth for the given 2D latent information to the depth regression task.

In order to get the depth latent vector z2, we use the following formulas to get the

parameters of the posterior distribution q(z2 | x, z1) and sample it. The formulas are

derived by calculating the product of two univariate Gaussian probability density functions [48]. It is a method of combining normal distributions taking into account the variance of each, known as Inverse-variance weighting. This can be interpreted as obtaining a posterior given a prior and a likelihood distribution. Given the two distributions q(z2 | x)

and q(z2 | z1) we use the following formulas to get the parameters of q(z2 | x, z1):

z2∼ q(z2|x, z1) = N (µ, σ2), (9) µ = µˆeσ 2_{+ ˆ}_µ e σ2 e σ2_{+ ˆ}_σ2 , (10) σ = r e σ2_σ_ˆ2 e σ2_{+ ˆ}_σ2, (11)

where ˆµ, ˆσ2 _{are the mean and variance of q(z}

2 | z1) and µ,e eσ

2 _{are the mean and variance}

of q(z2 | x). Again, following the standard VAE reparameterization trick to allow for

gradient flow, we sample using:

z2 = µ + σ × , (12)

(12)

3.5 Depthmap-Heatmap dependencies

Similarly to some other methods [15, 47], we input the predicted heatmap to the depth regression task. This provides a polished and finalized representation of the 2D prediction to the depth regression task which has been proven to be very beneficial due to providing spatial cues for the joints of interest. The previous component proposed passes richer information at the latent vector stage while this one passes more targeted information at a later stage.

Since our method does not perform convolutions end-to-end however, we decided it would be more appropriate to add the heatmap to an intermediate layer of the depth decoder rather than introduce it as part of the distribution sampling where it would have to be vectorized. Thus, we perform a convolution operation on the heatmap to bring it down to an appropriate size to be added to the output of a convolutional layer in the depth decoder.

3.6 Proposed Architecture

Figure 2: Proposed architecture. We use ResNet-50 as a feature extractor from which we learn two posterior distributions q(z1 | x) and q(z2 | x). We sample z1 and then learn

another distribution q(z2 | z1) which is combined with q(z2 | x) using inverse-variance

weighting to give us q(z2| x, z1). It is from that distribution that z2is then sampled from.

Both latent vectors get passed to their respective decoders and the predicted heatmaps are passed to an intermediate layer of the depthmap decoder. We then get the predicted 3D pose from the heatmaps and depthmaps.

The proposed architecture can be seen in Figure 2. In the same manner as the baseline architecture, we perform stochastic sampling of the depth and heatmap latent vectors from approximated posterior distributions and we use KL divergence loss terms. Additionally, we include the two proposed architectural components introduced previously. We use two KL divergence loss terms for the two distributions we learn for the sampling of z2 as

described in Section 3.4. For the heatmap input to the depthmap decoder, we had to find the perfect balance between making the heatmap too small, leading to information loss or introducing it at later layers in the depth decoder, leading to insufficient capability to exploit that information.

(13)

For both the baseline and proposed architecture, the heatmap decoder is defined in Table 1 while the depthmap decoder is defined in Table 2. These decoder architectures were found during a search amongst a multitude of others where we chose the ones that generalized best without overfitting issues. The inputs to the decoders have shape 64×1×1 where 64 is the size of the latent vectors z. The outputs have shape 16 × 64 × 64 where 16 is the number of joints and 64 × 64 is our depthmap and heatmap output size.

The overall loss of this network is defined as:

L(θ₁, θ2, φ1, φ2) = Ez1∼q_θ1(z|x)[(y1− dφ1(z1)) 2_] + λdepth× Ez2∼q_θ2(z|x)[(y2− dφ2(z2)) 2_] + λKL1× KL(qθ1(z1 | x) || p(z1| x)) + λKL2× KL(qθ2(z2 | x) || p(z2| x)) + λKL2× KL(qθ3(z2 | z1) || p(z2 | z1)) (14)

Filter Size (Stride, Pad) Filters Output Size

9x9 (2, 0) 64 64x9x9

11x11 (2, 0) 32 32x27x27

12x12 (2, 0) 16 16x64x64

Table 1: Heatmap decoder architecture. (Transposed convolutions)

Filter Size (Stride, Pad) Filters Output Size

4x4 (2, 1) 128 128x2x2 4x4 (2, 1) 128 128x4x4 4x4 (2, 1) 128 128x8x8 4x4 (2, 1) 128 128x16x16 4x4 (2, 1) 128 128x32x32 4x4 (2, 1) 128 128x64x64 1x1 (1, 0) 16 16x64x64

Table 2: Depthmap decoder architecture. (Transposed convolutions)

4 Implementation and Experiments

4.1 Datasets

For this thesis, use the MPII [41], Human3.6M [42, 43] and MPI-INF-3DHP [49] datasets. MPII. The MPII dataset [41] serves as our 2D dataset. We use this dataset in com-bination with the Human3.6M dataset for training, effectively augmenting the overall training dataset. This dataset contains 25 thousand training images and 3 thousand vali-dation images. The images have been extracted from YouTube videos of people covering 410 activities. It is one of the main datasets used for 2D human pose estimation and is also used for training 3D human pose estimation networks in stages with mixed 2D and 3D datasets, similarly to our method. It is especially useful in 3D human pose estimation as it contains in-the-wild images.

(14)

Human3.6M. The Human3.6M dataset [42, 43] is the most widely used 3D dataset in bibliography. It contains 3.6 million 3D annotated human poses and their respective images captured using a professional motion capture system in an indoor space. The system was composed of 4 digital video cameras, 1 time-of-flight sensor and 10 motion cameras and recording was done in an area of 4 by 3 meters. The dataset depicts 11 professional actors performing 15 activities, however only seven of those subjects have 3D annotations. The indoor and controlled nature of the dataset presents a limitation to its generalization capabilities but it serves as a baseline between researchers for finding methods of improving generalization, other than using additional datasets. We use the “standard protocol” [20, 21, 13] where we use subjects 1, 5, 6, 7 and 8 for training and every 64th frame of subjects 9, 11 for testing.

MPI-INF-3DHP. The MPI-INF-3DHP dataset [49] is a newer 3D dataset that is used for validation purposes. It contains both indoor and outdoor images and thus serves as an in-the-wild generalization benchmark for models trained using the Human3.6M dataset, which only contains indoor images. We use its test set containing 2935 images of six sub-jects performing seven actions for evaluation. The creators of the dataset proposed using only 14 of the joints for compatibility with other datasets and thus we do the same by excluding the thorax and pelvis joints when evaluating with this dataset. The incompati-bility comes from the fact that the joint definitions are different between datasets, which is one of the obstacles that make it hard to train using all available 3D datasets. It is also worth noting that the subjects in this dataset do not wear visible motion capture markers, unlike Human3.6M where their presence might provide visual cues.

4.2 Metrics

For monitoring the depth estimation task during training, we use an absolute distance metric between predicted depth and ground truth in order to verify performance of only the depth prediction. For the full 3D prediction we use the mean per joint position error (MPJPE) in millimeters which is widely used in bibliography. MPJPE represents the mean Euclidean distance between the predicted joint positions and the ground truth.

M P J P E = 1 N N X i=1 k(J_i− J_root) − (J_iGT − JGT root)k2, (15)

where N is the number of joints (which in our case is 16), J the 3D joint coordinates in millimeters and Jroot the pelvis joint coordinates which serve as our root.

Another widely used metric which we use is Procrustes analysis MPJPE (PA-MPJPE) where MPJPE is calculated after the estimated 3D pose is aligned to the ground truth by the Procrustes method.

For the MPI-INF-3DHP dataset, a different set of metrics were proposed [49]. The Percentage of Correct Keypoints (PCK) metric assumes that a joint is correctly detected if its distance from the ground truth is within a certain threshold. The Area Under the Curve (AUC) metric measures average PCK over a range of thresholds up until the one used for the PCK metric. Similarly, we use PCK with a threshold of 150mm and AUC for the range 0-150mm. As mentioned before, we evaluate with this dataset using 14 joints instead of the 16 used for training.

4.3 Implementation Details

The baseline and proposed networks were implemented using Python 3 and PyTorch. All experiments were conducted on DAS-4 [50] on an NVIDIA GTX TitanX GPU. The code

(15)

for pre-processing and post-processing was based on the public implementation of [15] on Github. We performed grid search for hyperparameter tuning.

For our feature extractor, ResNet-50 [45], we do not freeze or remove any layers and instead we fine tune all the weights that have been pretrained on ImageNet [46]. We keep the original 1000 dimensional vector intended for classification as an intermediate latent representation from where both the 2D and depth tasks branch off. We found that increasing or decreasing the size of this vector did not improve performance.

We use the Adam optimizer [51] with β1 = 0.9, β2 = 0.999 and a learning rate of 0.0005

with step decay at various epoch intervals. Our input image size is 256 × 256 and our output heatmap and depthmap size is 64 × 64. Convolutions, besides the last layers, are followed by batch normalization [52] and rectified linear units [53]. Kaiming initialization [54] is used for the weight initialization of convolutional and fully connected layers.

Each network was trained in two stages. The first stage consists of training the heatmap regression task using only the MPII dataset for 300 epochs with early stopping. In the second stage, we initialize the network with the learned weights from stage 1 and then train the entire network with a mixed and balanced dataset consisting of samples from both MPII and Human3.6M for 140 epochs with early stopping. We use a batch size of 32, drawing samples at random from either MPII or Human3.6M during stage 2. The total training time of the entire model was 3 days on the mentioned GPU.

In the loss function of our network (Equation 14) we also use weights for some of the loss terms. We use λdepth = 0.1 for the depth loss, λKL1 = 0.00001 for the KL

divergence of the heatmap distribution and λKL2= 0.00001 for the KL divergence of the

depthmap distribution. The depth loss weight serves as a way of preserving the in-the-wild features learnt by stage 1 by not overwhelming the learnt network parameters given the depth regression task, as suggested by Mehta et al. [14]. The KL divergence weights control the spread of the heatmap and depthmap representations in latent space, serving as regularization parameters.

4.4 Ablation Study

We perform an ablation study to determine the added pose estimation accuracy that each component provides. Starting from the baseline model we add each component one at a time and train each model with the same hyperparameters. Namely, we perform the experiments explained below.

Baseline. The baseline model from Section 3.3 where the 2D task does not explic-itly pass information to the depth regression task but only relies on implicit information exchange due to weight sharing during the feature extraction phase.

Interm. Adding to the baseline model, we introduce the distribution q(z2 | z1) to

pass 2D information to the depth regression task at the latent vector phase. As mentioned in Section 3.4, this distribution is a probabilistic approach of modelling the most likely depth information given the 2D information. We combine this distribution with q(z2| x)

to obtain q(z2 | x, z1) from which we sample the depth latent vector z2.

Interm+H. We add to the previous model by introducing the predicted heatmaps di-rectly into the depth regression task. These heatmaps provide the final stages of the depth regression task with polished spatial information concerning the predicted 2D locations of the joints of interest.

Notably, we do not perform an ablation study on the explicit distribution sampling present in all our models because the rest of the architecture was built around the premise of the latent variables being normally distributed and as such we believe the results won’t be comparable. Since we do not perform such an ablation study, it is worth noting that

(16)

raising or lowering the weights of the KL divergence loss terms makes the accuracy of the model considerably worse. The lowering aspect makes us believe that forcing the latent variables to be normally distributed contributes to the overall accuracy of the model positively and confirms our hypothesis that it improves the generalization capabilities of the decoders.

The results of the ablation study can be seen in Tables 3, 4 and 5. We observe consid-erable accuracy increase compared to the baseline model with each added component. The improvement is a lot more prominent on the MPI-INF-3DHP dataset in Table 5, signifying that our proposed components directly influence the generalization to other domains.

4.5 Results

Method Dir. Disc. Eat Greet Phone Photo Pose Purch.

Mehta et al.[14] 59.7 69.7 60.6 68.8 76.4 85.4 59.1 75.0 Pavlakos et al.[29] 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 Katircioglu et al.[55] 54.9 63.3 57.3 62.3 70.3 77.4 56.7 57.1 Zhou et al.[15] 54.8 60.7 58.2 71.4 62.0 65.5 53.8 55.6 Baseline 56.1 63.0 58.5 65.4 69.0 74.4 56.4 62.0 Interm 54.4 61.7 54.6 62.9 66.7 72.7 55.1 59.1 Interm+H 53.8 60.1 55.9 61.9 66.2 71.7 54.5 58.4

Sit SitD Smoke Wait WalkD Walk WalkT Avg

Mehta et al.[14] 96.2 122.9 70.8 68.5 54.4 82.0 59.8 74.1 Pavlakos et al.[29] 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9 Katircioglu et al.[55] 79.0 97.1 64.3 61.9 67.1 49.8 62.3 65.4 Zhou et al.[15] 75.2 111.6 64.2 66.1 51.4 63.2 55.3 64.9 Baseline 81.8 108.9 66.6 64.9 69.7 54.9 55.9 67.4 Interm 80.3 108.8 64.5 62.2 67.6 52.7 56.2 65.6 Interm+H 76.6 102.9 63.7 61.2 66.7 51.4 54.7 64.3

Table 3: Comparison of the Mean Per Joint Position Error (MPJPE) for each action in the Human3.6M dataset al ong with the ablation study results.

Method MPJPE PA-MPJPE

Mehta et al.[14] 74.1 54.6 Pavlakos et al.[29] 71.9 51.9 Katircioglu et al.[55] 65.4 50.69 Zhou et al.[15] 64.9 -Baseline 67.4 51.8 Interm 65.6 51.0 Interm+H 64.3 50.3

Table 4: MPJPE and PAMPJPE comparison between other works on the Human3.6M dataset.

Comparisons of our results with other similar works on Human3.6M can be seen in Table 3 and Table 4. Additionally, visual results can be seen in Figure 3 for Human3.6M and Figure 4 for MPI-INF-3DHP. We achieve good performance compared to similar methods that additionally use either a geometric constraint for the bone lengths or use temporal dependencies. For the purposes of this thesis, we don’t compare our results with

(17)

newer methods that are based on the differentiable soft argmax [56] or integral regression [36] as we do not use either of those.

Method PCK AUC Mehta et al. 64.7 31.7 Zhou et al. 69.2 32.5 Baseline 49.5 17.8 Interm 53.7 20.1 Interm+H 58.4 22.5

Table 5: PCK and AUC result comparison on the MPI-INF-3DHP dataset.

Figure 3: Visual results from the Human3.6M dataset. On the first and third column are the 2D joint predictions and on the second and fourth column are the 3D joint predictions. The left part of the body is shown in blue while the right part is shown in red.

(18)

Figure 4: Visual results from the MPI-INF-3DHP dataset. On the first and third col-umn are the 2D joint predictions and on the second and fourth colcol-umn are the 3D joint predictions. The left part of the body is shown in blue while the right part is shown in red.

We achieve results comparable to other methods on the MPI-INF-3DHP dataset which can be seen in Table 5. For reference, Mehta et al. [14] achieve 72.5PCK and 36.9AUC when using the MPI-INF-3DHP dataset for training. Zhou et al. [15] use post-processing to raise the hip joints towards the neck as the definition of the hip and pelvis joints between MPII/Human3.6M and MPI-INF-3DHP is different. They also use a geometric constraint on the bone lengths during training. While presumably, the creators of MPI-INF-3DHP did not remove the left and right hip joints from their evaluation, they did mention removing the pelvis joint due to incompatibility with other datasets. It is apparent however, that the hip joints, being directly attached to the pelvis joint, also have different

(19)

definitions from the other datasets. We thus had to make a decision for our evaluation of this dataset where we decided to follow the creators’ advice and include the left and right hip joints without any kind of post-processing. Perhaps they should be entirely excluded in a similar manner as the pelvis joint but then we would be making the comparison problem worse. Nonetheless, we achieve significant generalization to the in-the-wild images of this dataset without using any of the aforementioned techniques.

We show qualitative results on the MPII dataset in Figure 5. We observe good results even with images significantly different from our constrained indoor domain 3D training set, showing that our method can generalize effectively to in-the-wild images.

Figure 5: Visual results from the MPII dataset. On the first and third column are the 2D joint predictions and on the second and fourth column are the 3D joint predictions. The left part of the body is shown in blue while the right part is shown in red.

(20)

5 Discussion

During experimentation, we found that it was quite easy to go down a path of overfitting on the indoor domain of the Human3.6M images. With certain architectures we tested for example, it was possible to get much better accuracy on the testing set of Human3.6M while significantly impacting accuracy on MPI-INF-3DHP. We also noticed that if we trained our models for more epochs, the validation loss and validation MPJPE was still decreasing but those models when evaluated on MPI-INF-3DHP performed significantly worse. For that reason, it was absolutely necessary to always check the performance of a model on both Human3.6M and MPI-INF-3DHP as the latter is meant to check for generalization to in-the-wild images.

On the subject of MPI-INF-3DHP, due to the lack of compatibility between itself and the MPII/Human3.6M datasets, there has not been a standardised approach to combining them for training purposes. Such an approach would enhance our training with in-the-wild images which are necessary to improve overall generalization capabilities without having to rely on domain adaptation tricks. This incompatibility also affects the consistency when comparing results of different studies who might or might not have used post-processing to make the dataset more “compatible” with others. This calls for a more standardised approach for future creation of human pose datasets.

Additionally, due to the stochastic nature of our framework, different seeds used for training might provide quite different results while the seeds used for evaluating have smaller deviations. The results we provided in the previous section were obtained by training our model 10 times with random seeds and then testing each of those models 10 times, again, using random seeds. We provide the results obtained from the best perform-ing model out of those runs. It’s worth notperform-ing that while performance on Human3.6M did not deviate more than 0.5 MPJPE between trained models, for MPI-INF-3DHP the largest deviation we obtained was 4 PCK. We theorize that those models had started overfitting on the indoor domain of the Human3.6M dataset.

As most other 3D human pose estimation methods using monocular images, our model struggles the most when parts of the body are occluded. This usually happens in images where the subject is sitting, crouching or leaning in some direction and it depends on the viewing angle. There is also a bias that is induced by the datasets where the viewpoint is at a fixed location which can further exacerbate point-of-view issues such as occlusion.

6 Conclusion and Recommendations

We presented an end-to-end method that predicts a 3D human pose from a single image and enhanced its accuracy by making some probabilistic assumptions in latent space. It was shown that constraining latent representations of heatmaps and depthmaps in latent space can be beneficial to the overall accuracy. Additionally, the motivation behind the inner workings of our framework was explained and an ablation study was performed to confirm the added value of our proposed components. Furthermore, we achieve comparable accuracy and generalization capabilities for in-the-wild samples to other methods similar to ours.

We believe the accuracy can be improved further by using data augmentation to gener-ate different lightning conditions so that the model generalizes better to in-the-wild images that are mostly outdoors. Of course, nothing can be better for generalization than having a 3D dataset with in-the-wild images that is compatible with existing ones, however, such a dataset does not exist yet. As for the architecture, using PixelCNN[57, 58] layers in the

(21)

decoder similar to PixelVAE[59] might prove beneficial to the overall accuracy.

The biggest and perhaps most significant change one could make however, would be to implement one of the newer techniques used for performing supervised regression on heatmaps such as soft argmax [56] or integral regression [36]. With these methods it is possible to extract the predicted coordinates during training while allowing gradient flow. Overall, they provide a higher degree of flexibility and accuracy since the output of the network becomes a set of continuous variables while keeping the heatmap representation for its proven spatial information advantages during training.

To summarize, our method provides a probabilistic view of tackling this problem which can serve as a baseline or motivation for future endeavors interested in such an approach. While the dependency of the heatmap and depthmap tasks is well known and being taken advantage of using shared weights, it is well worth exploring that dependency more intri-cately.

(22)

References

[1] Yupeng Zhang, Teng Han, Zhimin Ren, Nobuyuki Umetani, Xin Tong, Yang Liu, Takaaki Shiratori, and Xiang Cao. Bodyavatar: Creating freeform 3d avatars using first-person body gestures. pages 387–396, 10 2013.

[2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 623–630, June 2010.

[3] Wonhui Kim, Manikandasriram Srinivasan Ramanagopal, Charles Barto, Ming-Yuan Yu, Karl Rosaen, Nick Goumas, Ram Vasudevan, and Matthew Johnson-Roberson. Pedx: Benchmark dataset for metric 3d pose estimation of pedestrians in complex urban intersections. CoRR, abs/1809.03605, 2018.

[4] M. Hofmann and Dariu Gavrila. Multi-view 3d human pose estimation in complex environment. International Journal of Computer Vision, 96:103–124, 01 2012. [5] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore,

P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2821–2840, Dec 2013.

[6] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR 2011, pages 1297–1304, June 2011.

[7] Evan Suma Rosenberg, Belinda Lange, Albert Rizzo, David Krum, and Mark Bolas. Faast: The flexible action and articulated skeleton toolkit. pages 247–248, 03 2011. [8] L. Unzueta, J. Goenetxea, M. Rodriguez, and M. T. Linaza. Viewpoint-dependent 3d

human body posing for sports legacy recovery from images and video. In 2014 22nd European Signal Processing Conference (EUSIPCO), pages 361–365, Sep. 2014. [9] M. Fastovets, J. Guillemaut, and A. Hilton. Athlete pose estimation from

monoc-ular tv sports footage. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1048–1054, June 2013.

[10] Yale Song, David Demirdjian, and Randall Davis. Continuous body and hand gesture recognition for natural human-computer interaction. ACM Trans. Interact. Intell. Syst., 2(1), March 2012.

[11] Leslie Casas, Nassir Navab, and Stefanie Demirci. Patient 3d body pose estimation from pressure imaging. International Journal of Computer Assisted Radiology and Surgery, 12 2018.

[12] Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory M. Erickson, Yash Chitalia, and Charles C. Kemp. Estimating 3d human pose on a configurable bed from a single pressure image. CoRR, abs/1804.07873, 2018.

[13] Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G. Derpanis, and Kostas Daniilidis. Sparseness meets deepness: 3d human pose estimation from monoc-ular video. CoRR, abs/1511.09439, 2015.

(23)

[14] Dushyant Mehta, Helge Rhodin, Dan Casas, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation using transfer learning and improved CNN supervision. CoRR, abs/1611.09813, 2016.

[15] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Weakly-supervised transfer for 3d human pose estimation in the wild. CoRR, abs/1704.02447, 2017.

[16] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. CoRR, abs/1712.06584, 2017.

[17] Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy S. J. Ren, Hongsheng Li, and Xiaogang Wang. 3d human pose estimation in the wild by adversarial learning. CoRR, abs/1803.09722, 2018.

[18] Yasunori Kudo, Keisuke Ogaki, Yusuke Matsui, and Yuri Odagiri. Unsupervised adversarial learning of 3d human pose from 2d joint locations. CoRR, abs/1803.08244, 2018.

[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013. [20] Sijin Li and Antoni Chan. 3d human pose estimation from monocular images with

deep convolutional neural network. volume 9004, pages 332–347, 11 2014.

[21] Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. Deep kinematic pose regression. CoRR, abs/1609.05317, 2016.

[22] Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. Structured prediction of 3d human pose with deep neural networks. CoRR, abs/1605.05180, 2016.

[23] Sijin Li, Weichen Zhang, and Antoni B. Chan. Maximum-margin structured learning with deep networks for 3d human pose estimation. CoRR, abs/1508.06708, 2015. [24] Ilya Kostrikov and Juergen Gall. Depth sweep regression forests for estimating 3d

human pose from images. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.

[25] Denis Tom`e, Chris Russell, and Lourdes Agapito. Lifting from the deep: Convolu-tional 3d pose estimation from a single image. CoRR, abs/1701.00295, 2017.

[26] Ching-Hang Chen and Deva Ramanan. 3d human pose estimation = 2d pose estima-tion + matching. CoRR, abs/1612.06524, 2016.

[27] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter V. Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. CoRR, abs/1607.08128, 2016.

[28] Jiajun Wu, Tianfan Xue, Joseph J. Lim, Yuandong Tian, Joshua B. Tenenbaum, Antonio Torralba, and William T. Freeman. Single image 3d interpreter network. CoRR, abs/1604.08685, 2016.

[29] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Dani-ilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. CoRR, abs/1611.07828, 2016.

(24)

[30] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3d human pose estimation. CoRR, abs/1705.03098, 2017. [31] Francesc Moreno-Noguer. 3d human pose estimation from a single image via distance

matrix regression. CoRR, abs/1611.09010, 2016.

[32] B. X. Nie, P. Wei, and S. Zhu. Monocular 3d human pose estimation by predicting depth on joints. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3467–3475, Oct 2017.

[33] Ehsan Jahangiri and Alan L. Yuille. Generating multiple hypotheses for human 3d pose consistent with 2d joint detections. CoRR, abs/1702.02258, 2017.

[34] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6), October 2015.

[35] Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Gerard Pons-Moll, and Christian Theobalt. In the wild human pose estimation using explicit 2d features and interme-diate 3d representations. CoRR, abs/1904.03289, 2019.

[36] Xiao Sun, Bin Xiao, Shuang Liang, and Yichen Wei. Integral human pose regression. CoRR, abs/1711.08229, 2017.

[37] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. CoRR, abs/1905.03244, 2019. [38] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang Yang. Detailed human shape estimation from a single image by hierarchical mesh deformation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4486–4495, 2019.

[39] Georgios Pavlakos, Nikos Kolotouros, and Kostas Daniilidis. Texturepose: Supervis-ing human mesh estimation with texture consistency. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019.

[40] Bastian Wandt and Bodo Rosenhahn. Repnet: Weakly supervised training of an ad-versarial reprojection network for 3d human pose estimation. CoRR, abs/1902.09868, 2019.

[41] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele.

[42] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environ-ments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325– 1339, jul 2014.

[43] Cristian Sminchisescu Catalin Ionescu, Fuxin Li. Latent structured models for human pose estimation. In International Conference on Computer Vision, 2011.

[44] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bre-gler. Efficient object localization using convolutional networks. CoRR, abs/1411.4280, 2014.

(25)

[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[47] Rishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, and Arjun Jain. Structure-aware and temporally coherent 3d human pose estimation. CoRR, abs/1711.09250, 2017.

[48] P Bromiley. Products and convolutions of gaussian distributions. 01 2003.

[49] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017.

[50] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed system for computer science research: Infrastructure for the long term. Computer, 49(5):54–63, 2016.

[51] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.

[52] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.

[53] Geoffrey Hinton. Rectified linear units improve restricted boltzmann machines vinod nair. volume 27, pages 807–814, 06 2010.

[54] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.

[55] Isinsu Katircioglu, Bugra Tekin, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. Learning latent representations of 3d human pose with deep neural networks. International Journal of Computer Vision, 126:1326–1341, 2018.

[56] Diogo C. Luvizon, Hedi Tabia, and David Picard. Human pose regression by com-bining indirect part detection and contextual information. CoRR, abs/1710.02322, 2017.

[57] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders, 2016. [58] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifi-cations. CoRR, abs/1701.05517, 2017.

[59] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images, 2016.

Weakly Supervised 3D Human Pose Estimation Through Latent Space Dependencies

MSc Artificial Intelligence

Master Thesis