Part-Based Object and People Detection
Bernt Schiele
Department of Computer Science
TU Darmstadt, Germany
Cognitive Science Summerschool 2oo9
e.g. HOG: fixed spatial relationships
Overview
•
Introduction (part 1)
‣ why study computer vision in general and object recognition in particular :)
•
Object Recognition Methods
‣ Bag of Words Models (BoW) (part 2) • Model: Histogram of local features
• e.g. Interest Points (scale invariant)
‣ Global Feature Models + Classifier (part 3) • e.g. HOG = Histogram of Oriented Gradients
– global object feature / description
• e.g. SVM = Support Vector Machines – discriminant classifier - widely used
‣ Part-Based Object Models (part 4) • e.g. Implicit Shape Model (ISM)
• local parts & global constellation of parts
BoW: no spatial relationships
e.g. HOG: fixed spatial relationships
e.g. ISM: flexible spatial relationships
Global Feature Models
Example: static HOG-feature vector
Input Image
Detection Window
HOG: Histogram of Oriented Gradients
Global Feature Models
Example: static HOG-feature vector
•
Most Important Cues:
‣
head, shoulder,
leg silhouettes
‣
vertical gradients inside
a person are counted
as negative
‣
overlapping blocks just
outside the contour are
most important
•
“Local Context” Use:
‣
Note that Dalal & Triggs obtain best performance by including quite
substantial context/background around the person
Global Feature Models:
Sliding Window Method for Detection
•
Sliding Window Based Object & People Detection:
Scan Image Extract Feature Vector Classify Feature Vector Non-Maxima Suppression
Two Important Questions:
1) which feature vector
2) which classifier
‘slide’ detection window over all positions & scales
Overview of lecture parts 3 & 4...
•
Global Feature Based Methods
for People Detection
(part 3)
‣
A Performance Evaluation of Single and Multi-Feature People Detection[Wojek,Schiele@DAGM-08]
‣
Pedestrian Detection: A New Benchmark[Dollar,Wojek,Perona,Schiele@CVPR-09]
‣
Multi-Cue Onboard Pedestrian Detection[Wojek,Walk,Schiele@CVPR-09]
•
Part-Based Model
for People & Object Detection
(part 4)
‣
Detection by Tracking and Tracking by Detection[Andriluka,Roth,Schiele@CVPR-08]
‣
Pictorial Structures Revisited: People Detection and Articulated PoseEstimation [Andriluka,Roth,Schiele@CVPR-09]
‣
A Shape-Based Object Class Model for Knowledge TransferComparison of Pedestrian Detectors
•
Motivation (global feature & sliding window approaches):
‣
what is the best feature?
‣
what is the best classifier?
‣
how complementary are those features/classifier?
Reimplementation of Approaches
•
Comparison of our reimplementations & available binaries
‣
our HOG = published binary
‣
our Haar Wavelets > openCV-implementation
‣
our Shapelets >> published binary
Comparison of different features &
classifiers:
•
Same Classifier, different features:
•
Conclusions:
‣
best features: HOG & Dense-Shape-Context
Combination of different features
•
Combination of
‣
Dense Shape Context &
Haar Wavelets
‣
HOG & Haar Wavelets
‣
all three
Sample Detections:
HOG vs. Multi-Feature Combination
•
Comparison (on INRIA people dataset):
‣
1. row: HOG
‣
2. row: combination of Dense Shape Context & Haar Wavelets
(linear SVM)
Pedestrian Detection: A New Benckmark
•
Features of the new
Pedestrian Dataset
:
‣
11h of ‘normal’ driving in urban
environment (greater LA area)
‣
annotation:
-
250’000 frames (~137 min) annotated with 350’000 labeled bounding boxesof 2’300 unique pedestrians
-
occlusion annotation: 2 bounding boxes for entire pedestrian & visible region-
difference between ‘single person’ and ‘groups of people’Comparison to Existing Datasets
•
New Pedestrian Dataset
‣
1 - 2 orders of magnitude larger than any existing dataset
‣
new features: temporal correlation of pedestrians, occlusion
labeling, ...
Evaluation Criteria
•
Different Approaches:
‣
False Positive Per Window (FFPW) vs.
‣
False Positives Per Image (FFPI)
•
comparison of different algorithms on INRIA-person:
Comparison of Algorithms
•
7 Algorithms tested (FPPI: False-Positives-per-Image):
overall performance
Distribution of Pedestrian Sizes
•
Differentiation between different sizes:
‣
far
: pedestrians < 30 pixels large
‣
medium
: 30 - 80 pixels large
Comparison of Algorithms
•
7 Algorithms tested (FPPI: False-Positives-per-Image):
overall performance
Remaining Failure Cases
(for INRIA-people dataset)
•
Missing Detections:
•
149 missing detections:
‣
44 difficult contrast & backgrounds
‣
43 occlusion & carried bags
‣
37 unusual articulations
‣
18 over- / underexposure
‣
7 wrong scale (too small/large)
Remaining Failure Cases
(for INRIA-people dataset)
•
False Positives
•
149 false positives:
‣
54 vertical structures / street signs
‣
31 cluttered background
‣
28 too small scale (body parts)
‣
24 too large scale detections
‣
12 people that are not annotated :-)
Motion HOG
•
Encoding of the Optic Flow Differences:
‣
here: Internal Motion Histogram (IMH)
‣
other possibility: Motion Boundary Histogram (MBH)
Combined HOG: Joint Modeling of
TUD-Brussels - Urban Onboard Dataset
•
Training Set (image resolution 720x576 pixels)
‣
positive set: 1092 image pairs (Darmstadt)
(with
1776 pedestrians
)
‣
negative set: 192 image pairs
-
85 pairs - taken from inner city (Darmstadt)-
107 pair recorded from moving car (Brussels)-
to find hard examples: 26 image pairs with 183 pedestrians•
Test Set (image resolution 640x480 pixels)
Quantitative Results
Static & Motion Features
•
Most important Results:
‣
motion features (right) outperform static features (left)
‣
overall best: multi-cue + linear/Hik-SVM best
‣
MPLBoost competitive to HikSVM for static features
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1-precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 recall HOG and SVM HOG, Haar and SVM HOG and MPLBoost (K=3) HOG, Haar and MPLBoost (K=4) HOG and AdaBoost
HOG, Haar and AdaBoost HOG and HIKSVM
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1-precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 recall
HOG, IMHwd and SVM HOG, IMHwd, Haar and SVM HOG, IMHwd and MPLBoost (K=3) HOG, IMHwd, Haar and MPLBoost (K=4) HOG, IMHwd and AdaBoost
HOG, IMHwd, Haar and AdaBoost HOG, Haar and MPLBoost (K=4) HOG, IMHwd and HIKSVM
Motion-Based People Detection
vs. full ETH-system
•
Comparison
‣
our best
static person detector
‣
our best
motion-based person detector
‣
full ETH-Zurich system
(Ess, Leibe, van Gool) using
stereo, ground-plane, structure-from-motion, tracking, ...
[Wojek,Walk,Schiele@CVPR-09]
Motion-Based People Detection
vs. full ETH-system
•
Quantiative Comparison
‣
blue
line: our best
static person detector
‣
red line: our best
motion-based person detector
‣
black line:
full ETH-Zurich system
(Ess, Leibe, van Gool) using
stereo, ground-plane, structure-from-motion, tracking, ...
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 false positives per image
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall
HOG, IMHwd and MPLBoost (K=3) HOG and MPLBoost (K=3) Ess et al. (ICCV’07) - Full system
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 false positives per image
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall
HOG, IMHwd, Haar and MPLBoost (K=4) HOG, Haar and MPLBoost (K=4) Ess et al. (ICCV’07) - Full system
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 false positives per image
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall
HOG, IMHwd and SVM HOG and MPLBoost (K=3) Ess et al. (ICCV’07) - Full system
ETH-1 ETH-2 ETH-3
Part-Based Object and People Detection
Bernt Schiele
Department of Computer Science
TU Darmstadt, Germany
Cognitive Science Summerschool 2oo9
e.g. ISM: flexible spatial relationships
Augmented Computing isual Objec t R ec ognition Tutorial
Different Connectivity Structures
Fergus et al. ’03 Fei-Fei et al. ‘03
Leibe et al. ’04 Crandall et al. ‘05 Fergus et al. ’05
Crandall et al. ‘05 Felzenszwalb & Huttenlocher ‘00
Bouchard & Triggs ‘05 Carneiro & Lowe ‘06 Csurka ’04
Vasconcelos ‘00
Augmented Computing
t R
ec
ognition
Tutorial
Spatial Models Considered Here
(back in 2oo7)
x1 x3 x4 x6 x5 x2“Star” shape model
x1 x3 x4 x6 x5 x2
Fully connected shape model
e.g. Constellation Model
Parts fully connected
Recognition complexity: O(N
P)
Method: Exhaustive search
e.g. ISM
Parts mutually independent
Recognition complexity: O(NP)
Method: Gen. Hough Transform
Overview of lecture parts 3 & 4...
•
Global Feature Based Methods for People Detection (part 3)
‣
A Performance Evaluation of Single and Multi-Feature People Detection [Wojek,Schiele@DAGM-08]‣
Pedestrian Detection: A New Benchmark [Dollar,Wojek,Perona,Schiele@CVPR-09]‣
Multi-Cue Onboard Pedestrian Detection [Wojek,Walk,Schiele@CVPR-09]•
Part-Based Model
for People & Object Detection
(part 4)
‣
Detection by Tracking and Tracking by Detection[Andriluka,Roth,Schiele@CVPR-08]
‣
Pictorial Structures Revisited: People Detection and Articulated PoseEstimation [Andriluka,Roth,Schiele@CVPR-09]
‣
A Shape-Based Object Class Model for Knowledge TransferMotivation:
People Detection and Tracking
•
Challenges for Detection:
‣
Partial occlusions
‣
Appearance variation
‣
Data association difficult
•
Challenges for Tracking:
‣
Dynamic backgrounds
‣
Multiple people
‣
Frequent long term occlusions
Overview
Three stages of our multi-person detection and tracking system:
1. Single-frame
detection
2. Tracklet detection
3. Tracking through
occlusion
Single-frame Detector: partISM
•
Appearance of parts:
Implicit Shape Model (ISM)
[Leibe, Seemann & Schiele, CVPR 2005]
Implicit Shape Model - Representation
Spatial occurrence distributions
•
Learn appearance codebook
‣ Extract features at interest points (e.g. DoG)
‣ Agglomerative clustering ⇒ codebook
•
Learn codebook distributions
(position & scale)
‣ Match codebook to training images
‣ Record matching positions on object
Appearance codebook
…
… …
Interpretation (Codebook match) Object & Position Image Feature o,x e I Voting Space (continuous)
Interest Points Matched Codebook
Entries Probabilistic Voting
Backprojection of Maximum Refined Hypothesis
(uniform sampling) BackprojectedHypothesis
Segmentation
Voting Space (continuous)
Categorization: “Closing the Loop”
Interest Points Matched Codebook
Entries Probabilistic Voting
Backprojection of Maximum Refined Hypothesis (uniform sampling) Backprojected Hypothesis
Detection Results
•
Qualitative Performance (UIUC database - 200 cars)
‣
Recognizes different kinds of cars
Single-frame Detector: partISM
•
Appearance of parts:
Implicit Shape Model (ISM)
[Leibe, Seemann & Schiele, CVPR 2005]
Single-frame Detector: partISM
•
Appearance of parts:
Implicit Shape Model (ISM)
[Leibe, Seemann & Schiele, CVPR 2005]
•
Part decomposition and inference:
Pictorial structures model
[Felzenszwalb & Huttenlocher, IJCV 2005]
p(L
|E) ∝ p(E|L)p(L)
Body-part positions
Image evidence
x1 x2 x3 x4 x5 x6 x8 x7 xo
L =
{x
o, x
1, . . . , x
8}
Single Frame Detection
•
Detections at equal error rate:
HOG
4D-ISM
Single-frame Detection Results
TUD pedestrians data
No occlusions
•
partISM clearly outperforms 4D-ISM
[Seemann et al, DAGM’06]
.
•
Outperforms HOG
[Dalal&Triggs, CVPR’05]
with much less training
Overview
1. Single-frame
detection
2. Tracklet detection
3. Tracking through
occlusion
Tracklet Detection in Short Subsequences
•
Given:
•
Want:
•
Posterior over positions and configurations:
dynamical body model (hGPLVM)
speed model (Gaussian) here: constant speed Likelihood model
(partISM)
E = [E
1, . . . , E
m]
p(X
o∗, Y
∗|E) ∝ p(E|X
o∗, Y
∗)p(X
o∗)p(Y
∗).
frame m
...
frame 2 frame 1X
o∗= [x
o∗ 1, . . . , x
om∗]
body positions
Y
∗= [y
∗ 1, . . . , y
∗m]
body configurations
x1 x2 x3 x4 x5 x6 x7 x8 xo −200 −150 −100 −50 0 50 100 0 50 100 150 200 250 xooverlapping subsequences
Modeling Body Dynamics
•
is high-dimensional: Full body poses in frames.
•
Model the body dynamics using
hierarchical Gaussian process
latent variable model
(hGPLVM) [Lawrence&Moore, ICML 2007]
p(Y
|Z, θ) =
D!
i=1N (Y
:,i|0, K
z)
p(Z
|T, ˆθ) =
q!
i=1N (Z
:,i|0, K
T)
training
Y
Configuration
−200 −150 −100 −50 0 50 100 0 50 100 150 200 250y
iY = [y
i∈ R
D]
Latent space
Z
Z = [z
i∈ R
q]
Time (frame #)
T
T = [t
i∈ R]
Y
∗m
Modeling Body Dynamics
•
Visualization of the
hierarchical Gaussian process
Single-Frame Detector vs.
Tracklet Detector
•
At equal error rate:
‣
Fewer false positives.
‣
More robust detection of partially occluded people.
partISM
T
racklet
Overview
1. Single-frame
detection
2. Tracklet detection
3. Tracking through
occlusion
Occlusion Recovery
•
Greedily link partial tracks based on:
‣
Motion & articulation compatibility.
‣
Plus appearance compatibility.
•
Greedily link partial tracks based on:
‣
Motion & articulation compatibility.
‣
Plus appearance compatibility.
Pictorial Structure Revisited:
Model Components [cvpr09]
.
.
.
.
...
orientation 1 orientation K likelihood of part 1 likelihood of part N AdaBoost Local Features estimated pose...
.
.
.
.
part posteriorsAppearance Model:
Prior and Inference:
−50 0 50 −60 −40 −20 0 20 40 60 80 100
Model Components
.
.
.
.
...
orientation 1 orientation K likelihood of part 1 likelihood of part N AdaBoost Local Features estimated pose...
.
.
.
.
part posteriorsAppearance Model:
Prior and Inference:
−50 0 50 −60 −40 −20 0 20 40 60 80 100
Likelihood Model
•
Build on recent advances in object detection:
‣
state-of-the-art image descriptor:
Shape Context
[Belongie et al., PAMI’02; Mikolajczyk&Schmid, PAMI’05]
‣
dense representation
‣
discriminative model:
AdaBoost
classifier for each body part
- Shape Context: 96 dimensions
(4 angular, 3 radial, 8 gradient
orientations)
- Feature Vector: concatenate the
descriptors inside part bounding
box
- head: 4032 dimensions
- torso: 8448 dimensions
Likelihood Model
•
Part likelihood derived from the boosting score:
˜p(d
i|l
i) = max
! "
tα
"
i,th
t(x(l
i))
tα
i,t, ε
0#
part location
decision stump output
decision stump weight
small constant to deal with part
occlusions
Likelihood Model
Upper leg[Ramanan,
NIPS’06]
Our part
likelihoods
Input image Head Torso
.
. .
.
Likelihood Model
Upper leg
Input image Head Torso
[Ramanan,
NIPS’06]
Our part
likelihoods
.
. .
.
Likelihood Model
Upper leg
Input image Head Torso
[Ramanan,
NIPS’06]
Our part
likelihoods
.
. .
.
Model Components
.
.
.
.
...
orientation 1 orientation K likelihood of part 1 likelihood of part N AdaBoost Local Features estimated pose...
.
.
.
.
part posteriorsAppearance Model:
Prior and Inference:
−50 0 50 −60 −40 −20 0 20 40 60 80 100
Bernt Schiele | Part-Based Object and People Detection | Aug 27, 2oo9 |
•
Represent pairwise part relations
[Felzenszwalb & Huttenlocher, IJCV’05]
Kinematic Tree Prior
60
l
1p(l
2|l
1) = N (T
12(l
2)|T
21(l
1), Σ
12)
p(L) = p(l
0)
!
(i,j)∈Ep(l
i|l
j),
l
2part locations relative to the joint transformed part locations
−50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −100 −80 −60 −40 −20 0 20 40 −50 −40 −30 −20 −10 0 10 20 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −100 −50 0 50 100 −100 −80 −60 −40 −20 0 20 40 60 80 100 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 +
l
1l
2l
1l
2Bernt Schiele | Part-Based Object and People Detection | Aug 27, 2oo9 |
Kinematic Tree Prior
•
Prior parameters:
•
Parameters of the prior are estimated with
maximum
likelihood
61{T
ij, Σ
ij}
−50 0 50 −60 −40 −20 0 20 40 60 80 100 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120Figure 2. (left)
Kinematic prior learned on the multi-view and
multi-articulation dataset from [15]. The mean part position is
shown using blue dots; the covariance of the part relations in the
transformed space is shown using red ellipses. (right)
Several
in-dependent samples from the learned prior (for ease of
visualiza-tion given fixed torso posivisualiza-tion and orientavisualiza-tion).
[14], and use AdaBoost [7] to train discriminative part
clas-sifiers. Our detectors are evaluated densely and are
boot-strapped to improve performance. Strong detectors of that
type have been commonplace in the pedestrian detection
lit-erature [1, 12, 13, 24]. In these cases, however, the
em-ployed body models are often simplistic. A simple star
model for representing part articulations is, for example,
used in [1], whereas [12] does not use an explicit part
repre-sentation at all. This precludes the applicability to strongly
articulated people and consequently these approaches have
been applied to upright people detection only.
We combine this discriminative appearance model with a
generative pictorial structures approach by interpreting the
normalized classifier margin as the image evidence that is
being generated. As a result, we obtain a generic model
for people detection and pose estimation, which not only
outperforms recent work in both areas by a large margin, but
is also surprisingly simple and allows for exact and efficient
inference.
More related work: Besides the already mentioned related
work there is an extensive literature on both people (and
pedestrian) detection, as well as on articulated pose
estima-tion. A large amount of work has been advocating strong
body models, and another substantial set of related work
relies on powerful appearance models.
Strong body models have appeared in various forms. A
certain focus has been the development of non-tree
mod-els. [17] imposes constraints not only between limbs on
the same extremity, but also between extremities, and relies
on integer programming for inference. Another approach
incorporate self-occlusion in a non-tree model [8]. Either
approach relies on matching simple line features, and only
appears to work on relatively clean backgrounds. In
con-trast, our method also works well on complex, cluttered
backgrounds. [20] also uses non-tree models to improve
occlusion handling, but still relies on simple features, such
as color. A fully connected graphical model for
represent-ing articulations is proposed in [2], which also uses
dis-criminative part detectors. However, the method has
sev-eral restrictions, such as relying on absolute part
orienta-tions, which makes it applicable to people in upright poses
only. Moreover, the fully connected graph complicates
in-ference. Other work has focused on discriminative tree
models [16, 18], but due to the use of simple features, these
methods fall short in terms of performance. [25] proposes
a complex hierarchical model for pruning the space of valid
articulations, but also relies on relatively simple features. In
[5] discriminative training is combined with strong
appear-ance representation based on HOG features, however the
model is applied to detection only.
Discriminative part models have also been used in
con-junction with generative body models, as we do here.
[11, 21], for example, use them as proposal distributions
(“shouters”) for MCMC or nonparametric belief
propaga-tion. Our paper, however, directly integrates the part
detec-tors and uses them as the appearance model.
2. Generic Model for People Detection and
Pose Estimation
To facilitate reliable detection of people across a wide
variety of poses, we follow [4] and assume that the body
model is decomposed into a set of parts. Their configuration
is denoted as L = {l
0, l
1, . . . , l
N}, where the state of part i
is given by l
i= (x
i, y
i, θ
i, s
i). x
iand y
iis the position of
the part center in image coordinates, θ
iis the absolute part
orientation, and s
iis the part scale, which we assume to be
relative to the size of the part in the training set.
Depending on the task, the number of object parts may
vary (see Figs. 2 and 3). For upper body detection (or pose
estimation), we rely on 6 different parts: head, torso, as well
as left and right lower and upper arms. In case of full body
detection, we additionally consider 4 lower body parts: left
and right upper and lower legs, resulting in a 10 part model.
For pedestrian detection we do not use arms, but add feet,
leading to an 8 part model.
Given the image evidence D, the posterior of the part
configuration L is modeled as p(L|D) ∝ p(D|L)p(L),
where p(D|L) is the likelihood of the image evidence given
a particular body part configuration. In the pictorial
struc-tures approach p(L) corresponds to a kinematic tree prior.
Here, both these terms are learned from training data,
ei-ther from generic data or trained more specifically for the
application at hand. To make such a seemingly generic
and simple approach work well, and to compete with more
specialized models on a variety of tasks, it is necessary to
carefully pick the appropriate prior p(L) and an appropriate
image likelihood p(D|L). In Sec. 2.1, we will first
intro-duce our generative kinematic model p(L), which closely
follows the pictorial structures approach [4]. In Sec. 2.2,
we will then introduce our discriminatively trained
appear-ance model p(D|L).
Bernt Schiele | Part-Based Object and People Detection | Aug 27, 2oo9 |
CVPR-09
Results for Articulated Pose Estimation
62 (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10
(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10
(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10
(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006.
8
(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006.
8
(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006.
8
(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10(g)
8/10
(h)
0/10
(i)
8/10
(j)
2/10
(k)
7/10
(l)
6/10
7/10
0/10
4/10
6/10
3/10
3/10
Figure 7.
Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of
each image indicate the number of correctly localized body parts.
Method
Torso
Upper leg
Lower leg
Upper arm
Forearm
Head
Total
IIP [15], 1st parse (edge features only)
39.5
21.4
20
23.9
17.5
13.6
11.7
12.1
11.2
21.4
19.2
IIP [15], 2nd parse (edge + color feat.)
52.1
30.2
31.7
27.8
30.2
17
18
14.6
12.6
37.5
27.2
Our part detectors
29.7
12.6
12.1
20
17
3.4
3.9
6.3
2.4
40.9
14.8
Our inference, edge features from [15]
63.4
47.3
48.7
41.4
34.14
30.2
23.4
21.4
19.5
45.3
37.5
Our inference, our part detectors
81.4
67.3
59
63.9
46.3
47.3
47.8
31.2
32.1
75.6
55.2
Table 2.
Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on
the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part
segments is 10 × 205 = 2050 ).
in order to model occlusions (c.f . [20]). We expect that such
additional constraints will further improve the performance
and should be explored in future work.
Acknowledgements: The authors are thankful to
Krys-tian Mikolajczyk for the shape context implementation and
Christian Wojek for the AdaBoost code and helpful
sugges-tion. Mykhaylo Andriluka gratefully acknowledges a
schol-arship from DFG GRK 1362 “Cooperative, Adaptive and
Responsive Monitoring in Mixed Mode Environments”.
References
[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.
[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.
[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.
[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.
[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer
and System Sciences, 55(1):119–139, 1997.
[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.
[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.
[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.
[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.
[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.
[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.
[15] D. Ramanan. Learning to parse images of articulated objects.
NIPS*2006.
[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.
[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.
[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.
[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC
2005.
[20] L. Sigal and M. J. Black. Measure locally, reason globally:
Occlusion-sensitive articulated pose estimation. CVPR 2006.
[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.
AMDO 2006.
[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.
[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.
[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.
[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. Our model Our model [Ramanan, NIPS’06] [Ramanan, NIPS’06]