Part-Based Object and People Detection

(1)

Part-Based Object and People Detection

Bernt Schiele

Department of Computer Science

TU Darmstadt, Germany

Cognitive Science Summerschool 2oo9

e.g. HOG: fixed spatial relationships

(2)

Overview

• Introduction (part 1)

‣ why study computer vision in general and object recognition in particular :)

• Object Recognition Methods

‣ Bag of Words Models (BoW) (part 2) • Model: Histogram of local features

• e.g. Interest Points (scale invariant)

‣ Global Feature Models + Classifier (part 3) • e.g. HOG = Histogram of Oriented Gradients

– global object feature / description

• e.g. SVM = Support Vector Machines – discriminant classifier - widely used

‣ Part-Based Object Models (part 4) • e.g. Implicit Shape Model (ISM)

• local parts & global constellation of parts

BoW: no spatial relationships

e.g. HOG: fixed spatial relationships

e.g. ISM: flexible spatial relationships

(3)

Global Feature Models

Example: static HOG-feature vector

Input Image

Detection Window

HOG: Histogram of Oriented Gradients

(4)

Global Feature Models

Example: static HOG-feature vector

• Most Important Cues:

‣

head, shoulder,

leg silhouettes

‣

vertical gradients inside

a person are counted

as negative

‣

overlapping blocks just

outside the contour are

most important

• “Local Context” Use:

‣

Note that Dalal & Triggs obtain best performance by including quite

substantial context/background around the person

(5)

Global Feature Models:

Sliding Window Method for Detection

• Sliding Window Based Object & People Detection:

Scan Image Extract Feature Vector Classify Feature Vector Non-Maxima Suppression

Two Important Questions:

1) which feature vector

2) which classifier

‘slide’ detection window over all positions & scales

(6)

Overview of lecture parts 3 & 4...

• Global Feature Based Methods

for People Detection

(part 3)

‣

A Performance Evaluation of Single and Multi-Feature People Detection

[Wojek,Schiele@DAGM-08]

‣

Pedestrian Detection: A New Benchmark

[Dollar,Wojek,Perona,Schiele@CVPR-09]

‣

Multi-Cue Onboard Pedestrian Detection

[Wojek,Walk,Schiele@CVPR-09]

• Part-Based Model

for People & Object Detection

(part 4)

‣

Detection by Tracking and Tracking by Detection

[Andriluka,Roth,Schiele@CVPR-08]

‣

Pictorial Structures Revisited: People Detection and Articulated Pose

Estimation [Andriluka,Roth,Schiele@CVPR-09]

‣

A Shape-Based Object Class Model for Knowledge Transfer

(7)

Comparison of Pedestrian Detectors

• Motivation (global feature & sliding window approaches):

‣

what is the best feature?

‣

what is the best classifier?

‣

how complementary are those features/classifier?

(8)

Reimplementation of Approaches

• Comparison of our reimplementations & available binaries

‣

our HOG = published binary

‣

our Haar Wavelets > openCV-implementation

‣

our Shapelets >> published binary

(9)

Comparison of different features &

classifiers:

• Same Classifier, different features:

• Conclusions:

‣

best features: HOG & Dense-Shape-Context

(10)

Combination of different features

• Combination of

‣

Dense Shape Context &

Haar Wavelets

‣

HOG & Haar Wavelets

‣

all three

(11)

Sample Detections:

HOG vs. Multi-Feature Combination

• Comparison (on INRIA people dataset):

‣

1. row: HOG

‣

2. row: combination of Dense Shape Context & Haar Wavelets

(linear SVM)

(12)

Pedestrian Detection: A New Benckmark

• Features of the new

Pedestrian Dataset

:

‣

11h of ‘normal’ driving in urban

environment (greater LA area)

‣

annotation:

-

250’000 frames (~137 min) annotated with 350’000 labeled bounding boxes

of 2’300 unique pedestrians

-

occlusion annotation: 2 bounding boxes for entire pedestrian & visible region

-

difference between ‘single person’ and ‘groups of people’

(13)

Comparison to Existing Datasets

• New Pedestrian Dataset

‣

1 - 2 orders of magnitude larger than any existing dataset

‣

new features: temporal correlation of pedestrians, occlusion

labeling, ...

(14)

Evaluation Criteria

• Different Approaches:

‣

False Positive Per Window (FFPW) vs.

‣

False Positives Per Image (FFPI)

• comparison of different algorithms on INRIA-person:

(15)

Comparison of Algorithms

• 7 Algorithms tested (FPPI: False-Positives-per-Image):

overall performance

(16)

Distribution of Pedestrian Sizes

• Differentiation between different sizes:

‣

far

: pedestrians < 30 pixels large

‣

medium

: 30 - 80 pixels large

(17)

Comparison of Algorithms

• 7 Algorithms tested (FPPI: False-Positives-per-Image):

overall performance

(18)

Remaining Failure Cases

(for INRIA-people dataset)

• Missing Detections:

• 149 missing detections:

‣

44 difficult contrast & backgrounds

‣

43 occlusion & carried bags

‣

37 unusual articulations

‣

18 over- / underexposure

‣

7 wrong scale (too small/large)

(19)

Remaining Failure Cases

(for INRIA-people dataset)

• False Positives

• 149 false positives:

‣

54 vertical structures / street signs

‣

31 cluttered background

‣

28 too small scale (body parts)

‣

24 too large scale detections

‣

12 people that are not annotated :-)

(20)

(21)

Motion HOG

• Encoding of the Optic Flow Differences:

‣

here: Internal Motion Histogram (IMH)

‣

other possibility: Motion Boundary Histogram (MBH)

(22)

Combined HOG: Joint Modeling of

(23)

TUD-Brussels - Urban Onboard Dataset

• Training Set (image resolution 720x576 pixels)

‣

positive set: 1092 image pairs (Darmstadt)

(with

1776 pedestrians

)

‣

negative set: 192 image pairs

-

85 pairs - taken from inner city (Darmstadt)

-

107 pair recorded from moving car (Brussels)

-

to find hard examples: 26 image pairs with 183 pedestrians

• Test Set (image resolution 640x480 pixels)

(24)

Quantitative Results

Static & Motion Features

• Most important Results:

‣

motion features (right) outperform static features (left)

‣

overall best: multi-cue + linear/Hik-SVM best

‣

MPLBoost competitive to HikSVM for static features

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1-precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 recall HOG and SVM HOG, Haar and SVM HOG and MPLBoost (K=3) HOG, Haar and MPLBoost (K=4) HOG and AdaBoost

HOG, Haar and AdaBoost HOG and HIKSVM

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1-precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 recall

HOG, IMHwd and SVM HOG, IMHwd, Haar and SVM HOG, IMHwd and MPLBoost (K=3) HOG, IMHwd, Haar and MPLBoost (K=4) HOG, IMHwd and AdaBoost

HOG, IMHwd, Haar and AdaBoost HOG, Haar and MPLBoost (K=4) HOG, IMHwd and HIKSVM

(25)

Motion-Based People Detection

vs. full ETH-system

• Comparison

‣

our best

static person detector

‣

our best

motion-based person detector

‣

full ETH-Zurich system

(Ess, Leibe, van Gool) using

stereo, ground-plane, structure-from-motion, tracking, ...

[Wojek,Walk,Schiele@CVPR-09]

(26)

Motion-Based People Detection

vs. full ETH-system

• Quantiative Comparison

‣

blue

line: our best

static person detector

‣

red line: our best

motion-based person detector

‣

black line:

full ETH-Zurich system

(Ess, Leibe, van Gool) using

stereo, ground-plane, structure-from-motion, tracking, ...

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 false positives per image

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall

HOG, IMHwd and MPLBoost (K=3) HOG and MPLBoost (K=3) Ess et al. (ICCV’07) - Full system

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall

HOG, IMHwd, Haar and MPLBoost (K=4) HOG, Haar and MPLBoost (K=4) Ess et al. (ICCV’07) - Full system

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 recall

HOG, IMHwd and SVM HOG and MPLBoost (K=3) Ess et al. (ICCV’07) - Full system

ETH-1 ETH-2 ETH-3

(27)

Part-Based Object and People Detection

Bernt Schiele

Department of Computer Science

TU Darmstadt, Germany

Cognitive Science Summerschool 2oo9

e.g. ISM: flexible spatial relationships

(28)

Augmented Computing isual Objec t R ec ognition Tutorial

Different Connectivity Structures

Fergus et al. ’03 Fei-Fei et al. ‘03

Leibe et al. ’04 Crandall et al. ‘05 Fergus et al. ’05

Crandall et al. ‘05 Felzenszwalb & Huttenlocher ‘00

Bouchard & Triggs ‘05 Carneiro & Lowe ‘06 Csurka ’04

Vasconcelos ‘00

(29)

Augmented Computing

t R

ec

ognition

Tutorial

Spatial Models Considered Here

(back in 2oo7)

x1 x3 x4 x6 x5 x2

“Star” shape model

x1 x3 x4 x6 x5 x2

Fully connected shape model



e.g. Constellation Model



Parts fully connected



Recognition complexity: O(N

P

)



Method: Exhaustive search



e.g. ISM



Parts mutually independent



Recognition complexity: O(NP)



Method: Gen. Hough Transform

(30)

Overview of lecture parts 3 & 4...

• Global Feature Based Methods for People Detection (part 3)

‣

A Performance Evaluation of Single and Multi-Feature People Detection [Wojek,Schiele@DAGM-08]

‣

Pedestrian Detection: A New Benchmark [Dollar,Wojek,Perona,Schiele@CVPR-09]

‣

Multi-Cue Onboard Pedestrian Detection [Wojek,Walk,Schiele@CVPR-09]

• Part-Based Model

for People & Object Detection

(part 4)

‣

Detection by Tracking and Tracking by Detection

[Andriluka,Roth,Schiele@CVPR-08]

‣

Pictorial Structures Revisited: People Detection and Articulated Pose

Estimation [Andriluka,Roth,Schiele@CVPR-09]

‣

A Shape-Based Object Class Model for Knowledge Transfer

(31)

Motivation:

People Detection and Tracking

• Challenges for Detection:

‣

Partial occlusions

‣

Appearance variation

‣

Data association difficult

• Challenges for Tracking:

‣

Dynamic backgrounds

‣

Multiple people

‣

Frequent long term occlusions

(32)

Overview

Three stages of our multi-person detection and tracking system:

1. Single-frame

detection

2. Tracklet detection

3. Tracking through

occlusion

(33)

Single-frame Detector: partISM

• Appearance of parts:

Implicit Shape Model (ISM)

[Leibe, Seemann & Schiele, CVPR 2005]

(34)

Implicit Shape Model - Representation

Spatial occurrence distributions

• Learn appearance codebook

‣ Extract features at interest points (e.g. DoG)

‣ Agglomerative clustering ⇒ codebook

• Learn codebook distributions

(position & scale)

‣ Match codebook to training images

‣ Record matching positions on object

Appearance codebook

…

… …

(35)

Interpretation (Codebook match) Object & Position Image Feature o,x e I Voting Space (continuous)

Interest Points Matched Codebook

Entries Probabilistic Voting

Backprojection of Maximum Refined Hypothesis

(uniform sampling) BackprojectedHypothesis

(36)

Segmentation

Voting Space (continuous)

Categorization: “Closing the Loop”

Interest Points Matched Codebook

Entries Probabilistic Voting

Backprojection of Maximum Refined Hypothesis (uniform sampling) Backprojected Hypothesis

(37)

Detection Results

• Qualitative Performance (UIUC database - 200 cars)

‣

Recognizes different kinds of cars

(38)

(39)

Single-frame Detector: partISM

• Appearance of parts:

Implicit Shape Model (ISM)

[Leibe, Seemann & Schiele, CVPR 2005]

(40)

Single-frame Detector: partISM

• Appearance of parts:

Implicit Shape Model (ISM)

[Leibe, Seemann & Schiele, CVPR 2005]

• Part decomposition and inference:

Pictorial structures model

[Felzenszwalb & Huttenlocher, IJCV 2005]

p(L

_{|E) ∝ p(E|L)p(L)}

Body-part positions

Image evidence

x1 x2 x3 x4 x5 x6 x8 x7 xo

L =

_{x

o

, x

1

, . . . , x

8

_}

(41)

Single Frame Detection

• Detections at equal error rate:

HOG

4D-ISM

(42)

Single-frame Detection Results

TUD pedestrians data

No occlusions

• partISM clearly outperforms 4D-ISM

[Seemann et al, DAGM’06]

.

• Outperforms HOG

[Dalal&Triggs, CVPR’05]

with much less training

(43)

Overview

1. Single-frame

detection

2. Tracklet detection

3. Tracking through

occlusion

(44)

Tracklet Detection in Short Subsequences

• Given:

• Want:

• Posterior over positions and configurations:

dynamical body model (hGPLVM)

speed model (Gaussian) here: constant speed Likelihood model

(partISM)

E = [E

₁

, . . . , E

_m

]

p(X

o∗

, Y

∗

_{|E) ∝ p(E|X}

o∗

, Y

∗

)p(X

o∗

)p(Y

∗

).

frame m

...

frame 2 frame 1

X

o_∗

_{= [x}

o_∗ 1

, . . . , x

om∗

]

body positions

Y

∗

_{= [y}

∗ 1

, . . . , y

∗m

]

body configurations

x1 x2 x3 x4 x5 x6 x7 x8 xo −200 −150 −100 −50 0 50 100 0 50 100 150 200 250 xo

overlapping subsequences

(45)

Modeling Body Dynamics

• is high-dimensional: Full body poses in frames.

• Model the body dynamics using

hierarchical Gaussian process

latent variable model

(hGPLVM) [Lawrence&Moore, ICML 2007]

p(Y

_{|Z, θ) =}

D

!

i=1

N (Y

:,i

|0, K

z

)

p(Z

_{|T, ˆθ) =}

q

!

i=1

N (Z

:,i

|0, K

T

)

training

Y

Configuration

−200 −150 −100 −50 0 50 100 0 50 100 150 200 250

y

_i

Y = [y

i

∈ R

D

]

Latent space

Z

Z = [z

_i

_{∈ R}

q

_]

Time (frame #)

T

T = [t

i

∈ R]

Y

∗

_m

(46)

Modeling Body Dynamics

• Visualization of the

hierarchical Gaussian process

(47)

Single-Frame Detector vs.

Tracklet Detector

• At equal error rate:

‣

Fewer false positives.

‣

More robust detection of partially occluded people.

partISM

T

racklet

(48)

Overview

1. Single-frame

detection

2. Tracklet detection

3. Tracking through

occlusion

(49)

Occlusion Recovery

• Greedily link partial tracks based on:

‣

Motion & articulation compatibility.

‣

Plus appearance compatibility.

(50)

• Greedily link partial tracks based on:

‣

Motion & articulation compatibility.

‣

Plus appearance compatibility.

(51)

(52)

Pictorial Structure Revisited:

Model Components [cvpr09]

.

...

orientation 1 orientation K likelihood of part 1 likelihood of part N AdaBoost Local Features estimated pose

...

.

part posteriors

Appearance Model:

Prior and Inference:

−50 0 50 −60 −40 −20 0 20 40 60 80 100

(53)

Model Components

.

...

.

part posteriors

Appearance Model:

Prior and Inference:

−50 0 50 −60 −40 −20 0 20 40 60 80 100

(54)

Likelihood Model

• Build on recent advances in object detection:

‣

state-of-the-art image descriptor:

Shape Context

[Belongie et al., PAMI’02; Mikolajczyk&Schmid, PAMI’05]

‣

dense representation

‣

discriminative model:

AdaBoost

classifier for each body part

- Shape Context: 96 dimensions

(4 angular, 3 radial, 8 gradient

orientations)

- Feature Vector: concatenate the

descriptors inside part bounding

box

- head: 4032 dimensions

- torso: 8448 dimensions

(55)

Likelihood Model

• Part likelihood derived from the boosting score:

˜p(d

_i

_|l

_i

) = max

! "

t

α

_"

i,t

h

t

(x(l

i

))

t

α

i,t

, ε

₀

#

part location

decision stump output

decision stump weight

small constant to deal with part

occlusions

(56)

Likelihood Model

Upper leg

[Ramanan,

NIPS’06]

Our part

likelihoods

Input image Head Torso

.

. .

.

(57)

Likelihood Model

Upper leg

[Ramanan,

NIPS’06]

Our part

likelihoods

.

. .

.

(58)

Likelihood Model

Upper leg

[Ramanan,

NIPS’06]

Our part

likelihoods

.

. .

.

(59)

Model Components

.

...

.

part posteriors

Appearance Model:

Prior and Inference:

−50 0 50 −60 −40 −20 0 20 40 60 80 100

(60)

Bernt Schiele | Part-Based Object and People Detection | Aug 27, 2oo9 |

• Represent pairwise part relations

[Felzenszwalb & Huttenlocher, IJCV’05]

Kinematic Tree Prior

60

l

1

p(l

₂

_|l

₁

) = N (T

₁₂

(l

₂

)|T

₂₁

(l

₁

), Σ

12

)

p(L) = p(l

₀

)

!

(i,j)_∈E

p(l

_i

_|l

_j

),

l

2

part locations relative to the joint transformed part locations

−50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −100 −80 −60 −40 −20 0 20 40 −50 −40 −30 −20 −10 0 10 20 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 −100 −50 0 50 100 −100 −80 −60 −40 −20 0 20 40 60 80 100 −50 0 50 −50 −40 −30 −20 −10 0 10 20 30 40 50 +

l

1

l

2

l

1

l

2

(61)

Kinematic Tree Prior

• Prior parameters:

• Parameters of the prior are estimated with

maximum

likelihood

61

{T

ij

, Σ

ij

}

−50 0 50 −60 −40 −20 0 20 40 60 80 100 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120 −80 −60 −40 −20 0 20 40 60 80 −60 −40 −20 0 20 40 60 80 100 120

Figure 2. (left)

_{Kinematic prior learned on the multi-view and}

multi-articulation dataset from [15]. The mean part position is

shown using blue dots; the covariance of the part relations in the

transformed space is shown using red ellipses. (right)

_Several

in-dependent samples from the learned prior (for ease of

visualiza-tion given fixed torso posivisualiza-tion and orientavisualiza-tion).

[14], and use AdaBoost [7] to train discriminative part

clas-sifiers. Our detectors are evaluated densely and are

boot-strapped to improve performance. Strong detectors of that

type have been commonplace in the pedestrian detection

lit-erature [1, 12, 13, 24]. In these cases, however, the

em-ployed body models are often simplistic. A simple star

model for representing part articulations is, for example,

used in [1], whereas [12] does not use an explicit part

repre-sentation at all. This precludes the applicability to strongly

articulated people and consequently these approaches have

been applied to upright people detection only.

We combine this discriminative appearance model with a

generative pictorial structures approach by interpreting the

normalized classifier margin as the image evidence that is

being generated. As a result, we obtain a generic model

for people detection and pose estimation, which not only

outperforms recent work in both areas by a large margin, but

is also surprisingly simple and allows for exact and efficient

inference.

More related work: Besides the already mentioned related

work there is an extensive literature on both people (and

pedestrian) detection, as well as on articulated pose

estima-tion. A large amount of work has been advocating strong

body models, and another substantial set of related work

relies on powerful appearance models.

Strong body models have appeared in various forms. A

certain focus has been the development of non-tree

mod-els. [17] imposes constraints not only between limbs on

the same extremity, but also between extremities, and relies

on integer programming for inference. Another approach

incorporate self-occlusion in a non-tree model [8]. Either

approach relies on matching simple line features, and only

appears to work on relatively clean backgrounds. In

con-trast, our method also works well on complex, cluttered

backgrounds. [20] also uses non-tree models to improve

occlusion handling, but still relies on simple features, such

as color. A fully connected graphical model for

represent-ing articulations is proposed in [2], which also uses

dis-criminative part detectors. However, the method has

sev-eral restrictions, such as relying on absolute part

orienta-tions, which makes it applicable to people in upright poses

only. Moreover, the fully connected graph complicates

in-ference. Other work has focused on discriminative tree

models [16, 18], but due to the use of simple features, these

methods fall short in terms of performance. [25] proposes

a complex hierarchical model for pruning the space of valid

articulations, but also relies on relatively simple features. In

[5] discriminative training is combined with strong

appear-ance representation based on HOG features, however the

model is applied to detection only.

Discriminative part models have also been used in

con-junction with generative body models, as we do here.

[11, 21], for example, use them as proposal distributions

(“shouters”) for MCMC or nonparametric belief

propaga-tion. Our paper, however, directly integrates the part

detec-tors and uses them as the appearance model.

2. Generic Model for People Detection and

Pose Estimation

To facilitate reliable detection of people across a wide

variety of poses, we follow [4] and assume that the body

model is decomposed into a set of parts. Their configuration

is denoted as L = {l

0

, l

1

, . . . , l

N

}, where the state of part i

is given by l

i

= (x

i

, y

i

, θ

i

, s

i

). x

i

and y

i

is the position of

the part center in image coordinates, θ

i

is the absolute part

orientation, and s

i

is the part scale, which we assume to be

relative to the size of the part in the training set.

Depending on the task, the number of object parts may

vary (see Figs. 2 and 3). For upper body detection (or pose

estimation), we rely on 6 different parts: head, torso, as well

as left and right lower and upper arms. In case of full body

detection, we additionally consider 4 lower body parts: left

and right upper and lower legs, resulting in a 10 part model.

For pedestrian detection we do not use arms, but add feet,

leading to an 8 part model.

Given the image evidence D, the posterior of the part

configuration L is modeled as p(L|D) ∝ p(D|L)p(L),

where p(D|L) is the likelihood of the image evidence given

a particular body part configuration. In the pictorial

struc-tures approach p(L) corresponds to a kinematic tree prior.

Here, both these terms are learned from training data,

ei-ther from generic data or trained more specifically for the

application at hand. To make such a seemingly generic

and simple approach work well, and to compete with more

specialized models on a variety of tasks, it is necessary to

carefully pick the appropriate prior p(L) and an appropriate

image likelihood p(D|L). In Sec. 2.1, we will first

intro-duce our generative kinematic model p(L), which closely

follows the pictorial structures approach [4]. In Sec. 2.2,

we will then introduce our discriminatively trained

appear-ance model p(D|L).

(62)

CVPR-09

Results for Articulated Pose Estimation

62 (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. CVPR 2008.

[2] M. Bergtholdt, J. Kappes, S. Schmidt, and C. Schn¨orr. A study of parts-based object class detection using complete graphs. IJCV, 2009. In press.

[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005.

[4] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.

[5] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discrimina-tively trained, multiscale, deformable part model. CVPR 2008.

[6] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose estimation. CVPR 2008.

[7] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer

and System Sciences, 55(1):119–139, 1997.

[8] H. Jiang and D. R. Martin. Global pose estimation using non-tree models. CVPR 2008.

[9] F. R. Kschischang, B. J. Frey, and H.-A. Loelinger. Factor graphs and the sum-product algorithm. IEEE T. Info. Theory, 47(2):498– 519, Feb. 2001.

[10] X. Lan and D. P. Huttenlocher. Beyond trees: Common-factor mod-els for 2D human pose recovery. ICCV 2005.

[11] M. W. Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. CVPR 2004.

[12] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. CVPR 2005.

[13] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object class de-tection with a generative model. CVPR 2006.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 27(10):1615–1630, 2005.

[15] D. Ramanan. Learning to parse images of articulated objects.

NIPS*2006.

[16] D. Ramanan and C. Sminchisescu. Training deformable models for localization. CVPR 2006.

[17] X. Ren, A. C. Berg, and J. Malik. Recovering human body configu-rations using pairwise constraints between parts. ICCV 2005.

[18] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pictures of people. ECCV 2002.

[19] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evalu-ation of local shape-based features for pedestrian detection. BMVC

2005.

[20] L. Sigal and M. J. Black. Measure locally, reason globally:

Occlusion-sensitive articulated pose estimation. CVPR 2006.

[21] L. Sigal and M. J. Black. Predicting 3D people from 2D pictures.

AMDO 2006.

[22] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unify-ing segmentation, detection, and recognition. IJCV, 63(2):113–140, 2005.

[23] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dynamical models. CVPR 2006.

[24] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. ICCV 2003.

[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

NIPS*2006.

2005.

AMDO 2006.

[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. (a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

NIPS*2006.

2005.

AMDO 2006.

[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006.

8

(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

NIPS*2006.

2005.

AMDO 2006.

8

(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

NIPS*2006.

2005.

AMDO 2006.

8

(a) 8/10 (b) 6/10 (c) 3/10 (d) 7/10 (e) 9/10 (f) 2/10 0/10 5/10 4/10 3/10 6/10 1/10

(g)

8/10

(h)

0/10

(i)

8/10

(j)

2/10

(k)

7/10

(l)

6/10

7/10

0/10

4/10

6/10

3/10

Figure 7.

Comparison of full body pose estimation results between our approach (top) and [15] (bottom). The numbers on the left of

each image indicate the number of correctly localized body parts.

Method

Torso

Upper leg

Lower leg

Upper arm

Forearm

Head

Total

IIP [15], 1st parse (edge features only)

39.5

21.4

20

23.9

17.5

13.6

11.7

12.1

11.2

21.4

19.2 IIP [15], 2nd parse (edge + color feat.)

52.1

30.2

31.7

27.8

30.2

17

18

14.6

12.6

37.5

27.2 Our part detectors

29.7

12.6

12.1

20

17

3.4

3.9

6.3

2.4

40.9

14.8 Our inference, edge features from [15]

63.4

47.3

48.7

41.4

34.14

30.2

23.4

21.4

19.5

45.3

37.5 Our inference, our part detectors

81.4

67.3

59

63.9

46.3

47.3

47.8

31.2

32.1

75.6

55.2 Table 2.

Full body pose estimation: Comparison of body part detection rates and evaluation of different components of the model on

the “Iterative Image Parsing” (IIP) dataset [15] (numbers indicate the percentage of the correctly detected parts. The total number of part

segments is 10 × 205 = 2050 ).

in order to model occlusions (c.f . [20]). We expect that such

additional constraints will further improve the performance

and should be explored in future work.

Acknowledgements: The authors are thankful to

Krys-tian Mikolajczyk for the shape context implementation and

Christian Wojek for the AdaBoost code and helpful

sugges-tion. Mykhaylo Andriluka gratefully acknowledges a

schol-arship from DFG GRK 1362 “Cooperative, Adaptive and

Responsive Monitoring in Mixed Mode Environments”.

References

NIPS*2006.

2005.

AMDO 2006.

[25] J. Zhang, J. Luo, R. Collins, and Y. Liu. Body localization in still images using hierarchical models and hybrid search. CVPR 2006. Our model Our model [Ramanan, NIPS’06] [Ramanan, NIPS’06]

(63)

CVPR-09

Results for Articulated Pose Estimation

• Generic Kinematic Tree Prior

• Quantitative Results

−50 0 50 −60 −40 −20 0 20 40 60 80 100

(64)

CVPR-09:

Results for Upper Body Detection (Buffy)

• Different Kinematic Tree Priors

‣

generic vs. ‘buffy’ (front-back) prior

• Qualitative

• Quantitative

−50 0 50 −100 −50 0 50 100 −50 0 50 −60 −40 −20 0 20 40 60 80 100

(65)

Discussion (people detection)

• Visual People Detection

‣

has made tremendous progress over the last 5+ years

• Global Feature Based Methods

‣

simple, yet effective to obtain bounding boxes

‣

performance starts to saturate (at least for single frame detection)

‣

pros: good performance for mid- and large-scale pedestrians, ...

‣

issues: small scales (<40 pixels), partial occlusion,

requires substantial and representative training data, ...

• Part-based Models enable Articulated Pose Estimation

‣

more complex models & inference

‣

performance can still be improved

‣

pros: enable articulated pose estimation & tracking