• No results found

Signal Processing: Image Communication

N/A
N/A
Protected

Academic year: 2022

Share "Signal Processing: Image Communication"

Copied!
14
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

A line based pose representation for human action recognition

Sermetcan Baysal

n

, Pınar Duygulu

Bilkent University, Department of Computer Engineering, 06800 Ankara, Turkey

a r t i c l e i n f o

Article history:

Received 26 February 2012 Accepted 23 January 2013 Available online 4 February 2013 Keywords:

Human motion Action recognition Pose similarity Pose matching Line-flow

a b s t r a c t

In this paper, we utilize a line based pose representation to recognize human actions in videos. We represent the pose in each frame by employing a collection of line-pairs, so that limb and joint movements are better described and the geometrical relationships among the lines forming the human figure are captured. We contribute to the literature by proposing a new method that matches line-pairs of two poses to compute the similarity between them. Moreover, to encapsulate the global motion information of a pose sequence, we introduce line-flow histograms, which are extracted by matching line segments in consecutive frames. Experimental results on Weizmann and KTH datasets emphasize the power of our pose representation, and show the effectiveness of using pose ordering and line-flow histograms together in grasping the nature of an action and distinguishing one from the others.

&2013 Elsevier B.V. All rights reserved.

1. Introduction

Recognizing and analyzing human actions in videos have been receiving increasing attention of computer vision researchers both from academia and industry.

A reliable and an effective solution to this problem is essential for a large variety of applications such as athletic performance analysis, medical diagnostics, visual surveil- lance[1].

However, automatically recognizing human actions in videos is challenging since people can perform the same action in different ways with various execution speeds.

Furthermore, recording conditions such as the illumina- tions or viewpoints may differ as well.

The human brain can more or less recognize what a person is doing in a video even by looking at a single frame without examining the whole sequence. From this observa- tion it can be inferred that the human pose encapsulates useful information about the action being performed. In this

study we focus on the representation of actions and use human pose as our primitive representative unit.

Some of the previous studies [4,5,27] attempt to represent the shape of a pose by using human silhouettes.

Although these approaches are robust to variations in the appearance of actors, they require static cameras and a good background model, which may not be possible under realistic conditions[15]. A more severe limitation of such methods is that they ignore limb movements remaining inside the silhouette boundaries; for example,

‘standing still’ is likely to be confused with ‘hand clapping’

when the action is performed facing the camera and hands are in front of the torso.

An alternative shape representation can be established using contour features. Motivated by the work of Ferrari et al.[10], where encouraging results were obtained using line segments as descriptors for object recognition, we represent the shape of a pose as a collection of line segments fitted to the contours of a human figure.

Utilizing only shape information may fail to capture differences between actions with similar pose appear- ances, such as ‘running’ and ‘jogging’. In such cases the speed and direction of the movement is important in making a distinction. In addition to our pose-based action Contents lists available atSciVerse ScienceDirect

journal homepage:www.elsevier.com/locate/image

Signal Processing: Image Communication

0923-5965/$ - see front matter & 2013 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.image.2013.01.005

nCorresponding author. Tel.: þ 90 532 2563397 E-mail addresses: sermetcan@cs.bilkent.edu.tr (S. Baysal), duygulu@cs.bilkent.edu.tr (P. Duygulu).

(2)

representation, we also extract global line-flow histo- grams for a pose sequence by matching lines in consecu- tive frames in order to identify differences between actions with similar appearances.

The overview of our approach (depicted inFig. 1) is as follows. For each frame, a Contour Segment Network (CSN) consisting of roughly straight lines is constructed.

Next, noise elimination is applied and the human figure is detected by utilizing the densest area of line segments.

Then an N  N grid structure is placed over the human figure for localization of the line segments. To obtain the global line-flow of a pose sequence, line displacement vectors are extracted for each frame by matching its set of lines with the ones in the previous frame. Then these vectors are represented by a single compact line-flow histogram. Given a sequence of poses, recognition is performed by combining decisions of separate weighted k-nearest neighbor (k-NN) classifiers for both pose and line-flow features.

In this work, we concentrate on the representation of actions and make two main contributions to the literature.

First, we propose a new matching method1 between two poses to compute their similarity. Second, we introduce

global line-flow to encapsulate motion information for a collection of poses formed by line segments.

2. Related work

Human action recognition has been a widely studied topic of computer vision. In this section, we will first give a brief review of recent studies focusing on the represen- tation then we will have a discussion.

2.1. Review of previous studies

Space-time volumes are utilized for action recognition in the following studies. Blank et al. [4] regard human actions as 3D shapes induced by the silhouettes in the space-time volume. Similarly, Ke et al. [15] segment videos into space-time volumes, however, their spatio- temporal shape based correlation algorithm does not require background subtraction.

There are a large number of studies which employ space-time interest points (STIP) for action representa- tion. Dollar et al.[7]propose a spatio-temporal interest point detector based on 1D Gabor filters to find local regions of interest in space and time (cuboids) and use histograms of these cuboids to perform action recogni- tion. These linear filters were also applied in[21,25,26]to extract STIP. There are also other studies which use different spatio-temporal interest point detectors. Laptev et al. [17] detect interest points using a space-time Fig. 1. The overview of our approach.

1A preliminary version of this matching method was presented in [3]at International Conference on Pattern Recognition, Istanbul, Turkey, August, 2010.

(3)

extension of the Harris operator. However, instead of performing a scale selection, multiple levels of spatio- temporal scales are extracted. The same STIP detection technique is also adopted by Thi et al. in[34]. They extend Implicit Shape Model to 3D, enabling them to robustly integrate the set of local features into a global configura- tion, while still being able to capture local saliency.

Among the STIP based approaches, Refs. [7,17,20,25]

quantize local space-time features to form a visual voca- bulary and construct a bag-of-words model to represent a video. However, Kovashka et al. [16]and Ta et al. [33]

believe that the orderless bag-of-words lack cues about motion trajectories, before–after relationships and spatio- temporal layout of the local features which may be almost as important as the features themselves. So, Kovashka et al.[16]propose to learn shapes of space-time feature neighbors that are most representative for an action category. Similarly, Ta et al. [33]present pairwise fea- tures, which encode both the appearance and the spatio- temporal relations of the local features for action recogni- tion. In contrast to using hand-designed local features for action recognition, Le et al.[18]present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data.

A group of studies use flow-based techniques which estimate the optical field between adjacent frames to represent actions. In [8], Efros et al. introduce a motion descriptor based on blurred optical flow measurements in a spatio-temporal volume for each stabilized human figure, which describes motion over a local period of time.

Wang et al.[38]also use the same motion descriptor for frame representation and represent video sequences by a bag of words representation. Fathi et al. [9]extend the work of Efros to a 3D spatio-temporal volume. Different from the flow-based studies above, Ahmad et al. [2]

represent action as a set of multi-dimensional combined local-global (CLG) optic flow and shape flow feature vectors in the spatio-temporal action boundary.

Actions are represented by poses in the following studies. Carlsson et al. [6] demonstrate that specific actions can be recognized by matching shape information extracted from individual frames to stored prototypes representing key frames of an action. Following this study and using the same shape matching scheme, which compares edge maps of poses, Loy et al. [23]present a method for automatically extracting key frames from an image sequence. Ikizler et al. [14] propose a bag-of- rectangles method that represents human body as a collection of rectangular patches and calculate their histograms based on their orientation. Hatun et al.[12]

describe pose in each frame using the histogram of gradients (HOG) features obtained from radial partition- ing of the frame. Similarly, Thurau et al.[35]extend HOG based descriptor to represent pose primitives. In order to include local temporal context, they compute histograms of n-gram instances. Tran et al.[36]propose a generative representation of the motion of human body-parts to learn and classify human actions. They transfer motion of different human body-parts into polar histograms.

In another group of studies both shape (pose) and motion (flow) features are combined to represent actions.

Ikizler et al.[13]introduce a new shape descriptor based on the distribution of lines fitted to the boundaries of human figures. Poses are represented by employing his- togram of lines based on their orientations and spatial locations. Moreover, a dense representation of optical flow and global temporal information is utilized for action recognition. Schindler et al.[30] propose a method that separately extracts local shape, using the responses of Gabor filters at multiple orientations, and dense optic flow from each frame. Then the shape and flow feature vectors are merged by simple concatenation before apply- ing SVM classification for action recognition. Lin et al.[19]

capture correlations between shape and motion cues by learning action prototype trees in a joint features space.

The shape descriptor is formed by simply counting the number of foreground pixels either in silhouettes or appearance-based likelihoods. Their motion descriptor is an extension of the one introduced by Efros et al.[8], in which background motion components are removed. Shao et al.[32] propose a color based method and a motion based method for human action temporal segmentation under a stationary background condition. They apply a shape-based feature descriptor: Pyramid Correlogram of Oriented Gradients (PCOG) aiming to detect different action classes within the same video sequence.

2.2. Discussion of related studies

Studies of Hatun et al.[12], Ikizler et al.[13,14]and Thurau et al.[35]share a common property of employing histograms to represent the pose information in each frame. However, using histograms for pose representation results in the loss of geometrical information among the components (e.g. lines, rectangles, gradients) forming the pose. For action recognition such a loss is intolerable since configuration of the components is very crucial in describ- ing the nature of a human action involving limb and joint movements. Representing the pose in a frame as a collec- tion of line-pairs, our work differs from these studies by preserving the geometrical configuration of lines as the components encapsulated in poses.

In this study, we propose to capture the global motion information in a video by tracking line displacements across adjacent frames, which could be compared to optical flow representations in[2,8,9,38]. Although, opti- cal flow often serves as a good approximation of the true physical motion projected onto the image plane; in practice, its computation is susceptible to noise and illumination changes as stated in [37]. Lines are less effected by variations in the appearance of actors and they are easier to track than lower-level features such as color/intensity changes. Thus, we believe that line-flow could be a good alternative to optical flow.

3. Pose extraction

Before presenting our proposed pose matching method and line-flow histograms, first, we give the details of our line-based pose extraction in this section. Given an action

(4)

sequence, pose in each frame is extracted as follows (depicted inFig. 2):

1. The global probability of boundaries (GPB), which is presented by Maire et al. as a high-performance detector for contours in natural images (see [24]for details), are computed to extract the edges of the human figure in a frame.

2. To eliminate the effect of noise caused by short and/or weak edges, hysteresis thresholding is applied to obtain a binary image consisting of edge pixels (edgels).

3. Edgels are chained by using closeness and orientation information. The edgel-chains are partitioned into roughly straight contour segments. This chained struc- ture is used to construct a contour segment network (CSN).

4. The CSN is represented by scale invariant k-Adjacent Segment (k AS) descriptor encoding the geometric configuration of the segments, which was introduced by Ferrari et al. in[10].

As defined in[10], the segments in a k-AS form a path of length k through the CSN. Two segments are considered as connected in the CSN, when they are adjacent along some object contour even if there is a small gap

separating them physically. More complex structures can be captured as k increases in a k-AS. 1AS are just individual lines, 2AS include L-shapes and 3AS can form C, F and Z shapes.

Human pose, especially limb and joint movements, can be better described by using L-shapes. Therefore, in our work we select k¼2, and refer to 2AS features as line- pairs. Example line-pairs can be seen in Fig. 2(d). As in [10], each line-pair consisting of line segments s1and s2is represented with the following descriptor:

Vlinepair¼ rx2 Nd

,ry2 Nd

,y1,y2,l1

Nd

,l2

Nd

 

ð1Þ

where r2¼ ðrx2,ry2Þis the vector going from midpoint of s1

to midpoint of s2,yi is the orientation and li¼ JsiJ is the length of si(i ¼ 1,2). Ndis the distance between the two midpoints, which is used as the normalization factor. The center of the two midpoints (center of the vector r2) is used as the coordinates of the line-pair on the image.

3.1. Noise elimination

Under realistic conditions (varying illumination, cluttered backgrounds, reflection of shadows, etc.) the edge detection

Fig. 2. This figure illustrates the steps of pose extraction. Given any frame (a), GPB are computed to extract the contours (b). Then hysteresis thresholding is applied to obtain a binary image consisting of edge-pixels (edgels) (c). Next, edgel-chains are partitioned into roughly straight contour segments forming the CSN (d). Finally, CSN is represented by k AS descriptor.

Fig. 3. This figure illustrates the steps of noise elimination. Notice that after the pose extraction steps, the CSN contains erroneous line segments that do not belong to the human figure (a). So in (b), edge_img is projected onto x- and y-axes to form a bounding box around the densest area of line segments in the csn_img. Line segments that remain outside the bounding box are eliminated form the CSN (c).

(5)

results may contain erroneous line segments that do not belong to the human figure. Assuming that the densest area of line segments in the CSN contains the human figure, the following noise elimination steps are applied after pose extraction (depicted inFig. 3):

1. Project edge_img onto the x-axis: Then calculate the area under each separate curve peak. Set x1and x2to be the boundaries of the isolated curve peak with the largest area.

2. Project edge_img onto the y-axis: Then calculate the projected length of each separate curve peak on the y- axis. Set y1and y2to be the boundaries of the longest isolated curve peak.

3. Place a bounding box on the csn_img with ðx1,y1Þand ðx2,y2Þ being its upper left and lower right corner coordinates respectively.

4. Recall that the csn_img contains a set of line segments such that csn_img ¼ fl1,l2, . . . ,lng. Eliminate a line segment li2csn_img from the CSN, if its center’s coordinates is not in the bounding box.

3.2. Spatial binning

The descriptor presented in [10] (Eq. (1)), encodes scale, orientation and length of the line-pairs, but it lacks positional information. Therefore, in order to capture spatial locations of the line-pairs; first, the human figure is cropped from the frame using the bounding box which was previously formed in the noise elimination process.

Then, to be used in the latter stages, the human figure is divided into equal-sized spatial bins forming an N  N grid structure. Finally, each line-pair is assigned to a specific bin depending on its coordinates.

4. Finding similarity between poses

Recall that pose in each frame is represented by a set of line-pair descriptors. The similarity between two line- pair descriptors vaand vbis computed by the following formula as suggested in[10]:

dlinepairða,bÞ ¼ wr Jra2rb2J þwyX2

i ¼ 1

Dyðyai,ybiÞ þX2

i ¼ 1

9logðlai=lbiÞ9 ð2Þ

where the first term is the difference in the relative location of the line-pairs, the second term measures the orientation difference of the line-pairs and the last term accounts for the difference in lengths. The weights of the terms are wr¼4 and wy¼2. Note that Eq. (2), proposed in [10]

computes the similarity only between two individual line- pairs. However, we need to compare two poses. Therefore, in this paper, we introduce a method to find similarity between two poses consisting of multiple line-pairs.

4.1. Pose matching

To compute a similarity value between two poses, first of all, we need to find a correspondence between their

line-pairs. Any two poses consisting of multiple line-pair descriptors can mathematically be thought of as two sets X and Y with different cardinalities. We seek for a ‘one-to- one’ match between two subsets X0X and Y0Y, so that an element in X0 is associated with exactly one element in Y0. For instance, xiand yjare matched if and only if gðxiÞ ¼yjand hðyjÞ ¼xi where g : X0-Y0, h : Y0-X0, xi2X0, yj2Y0.

To describe our pose matching mechanism more for- mally, let f1and f2be two poses having a set of line-pair descriptors V1¼ fv11,v12, . . . ,v1ng and V2¼ fv21,v22, . . . ,v2mg, where n and m are the number of line-pair descriptors in V1 and V2 respectively. We compare each line-pair descriptor v1i 2V1 with each line-pair descriptor v2j 2V2

to find matching line-pairs. v1i and v2j are matched if and only if among descriptors in V2, v2j has the minimum distance to v1i and among descriptors in V1, v1i has the minimum distance to v2j. To include location information, we apply a constraint in which matching is allowed only between line-pairs within the same spatial bin.

As an output of our pose matching method two matrices, D and M of size n  m, are generated. Distance matrix D stores similarity of each line-pair in f1 to each line-pair in f2, where Dði,jÞ indicates the similarity value between v1i and v2j. Match matrix M is a binary matrix, where Mði,jÞ ¼ 1 indicates that i-th line-pair in f1and j-th line-pair in f2 are matched. These matrices are utilized when an overall similarity distance between two poses is calculated.

4.2. Calculating a similarity value

Having established a correspondence between poses f1

and f2 by matching their line-pairs (as shown inFig. 4), now we need to numerically express this correspondence.

The first approach would be to take the average of the matched line-pair distances. This could be calculated by utilizing the matrices D and M as follows:

sim1ðf1,f2Þ ¼ sumðD4MÞ

9matchðf1,f2Þ9 ð3Þ

where sumðD4MÞ is the sum of distances between matched line-pairs and 9matchðf1,f2Þ9 is the number of matched line-pairs between f1and f2.

The function sim1, calculates a ‘weak’ similarity value between f1and f2, since it utilizes distances between only the matched line-pairs. However, poses of distinct actions may be very similar, differing only in configuration of a single limb (seeFig. 5). To compute a ‘stronger’ similarity value, unmatched line-pairs in both f1 and f2 should be

Fig. 4. This figure illustrates the matched line-pairs in two frames having similar poses.

(6)

utilized. Thus, we present another similarity value calcu- lation function sim2, which assumes that a perfect match between sets X and Y is established when both sets have the equal number of elements and both ‘one-to-one’ and

‘onto’ set properties are satisfied, so that each element in X is exactly associated with one element in Y. The function sim2calculates the overall similarity distance by penaliz- ing unmatched line-pairs in the frame having more number elements as follows:

sim2ðf1,f2Þ ¼sumðD4MÞþp  ðmaxðm,nÞ9matchðf1,f2Þ9Þ maxðm,nÞ

ð4Þ where p ¼ meanðD4:MÞ is the penalty value denoting the average dissimilarity between two poses; maxðm,nÞ is the maximum of m and n; m and n is the number of line-pairs in f1and f2respectively. The penalty value p is computed by excluding the matched line-pair values in the distance matrix D by bitwise-anding it with the complement of the match matrix M and taking average of the positive values.

This is simply the average of the pairwise similarity values of all the unmatched line-pairs. Relative perfor- mance of sim1and sim2will be evaluated inSection 7.

5. Line-flow extraction

By utilizing only shape information, it is sometimes difficult to distinguish actions having similar poses such as jogging and running. In such cases, the speed of transitions from one pose to the next one is crucial in distinguishing actions. In our work, we characterize this transition by extracting global flow of lines throughout an action sequence. We have experimentally found that flow of lines better describes motion patterns than flow of line- pairs.

Given an action sequence, consecutive frames are compared to find matching lines. The same pose matching method inSection 4.1is applied, however, this time lines are matched instead of line-pairs. To do so, Eq. (2) is

modified as follows to compute a distance between two line segments:

dlineða,bÞ ¼ wyDyðya,ybÞ þ 9logðla=lbÞ9 ð5Þ where the first term is the orientation difference of the lines and the second term accounts for the difference in lengths. The weighting coefficient is wy¼2.

As depicted inFig. 6, after finding matches between consecutive frames, the displacement of each matched line with respect to the previous frame is represented by a line-flow vector F!

. Then this vector is separated into four non-negative components F!

¼ fFxþ,Fx,Fyþ,Fyg, represent- ing its magnitudes when projected on x þ , x, y þ and y- axes on the xy-plane. For each j-th spatial bin, where j 2 f1, . . . ,N  Ng, we define line-flow histogram hj(i) as follows:

hjðiÞ ¼X

k2Bj

!F

k ð6Þ

where F!

krepresent a line-flow vector in spatial bin j. Bjis the set of flow vectors in spatial bin j and i 2 f1, . . . ,ng, where n is the number of frames in the action sequence.

To obtain a single line-flow histogram h(i) for the i-th frame, we concatenate line-flow histogram hj of each spatial bin j.

6. Recognizing actions

Given the details of our feature extraction steps in the previous sections, we now describe our action recognition methods in the following subsections.

6.1. Using pose ordering

In this classification method, recognition is performed by comparing two action sequences and finding a corre- spondence between their pose orderings. However, com- paring two pose sequences is not straightforward since Fig. 5. This figure illustrates matched line-pairs in similar (a) and

slightly different (b) poses. Red lines (straight) denote the matched line-pairs common in both (a) and (b). Blue lines (dashed) indicate that these line-pairs are only matched in (a). sim1ðf1,f2Þis calculated by taking the average of red and blue lines (assuming that they represent a distance value between matching line-pairs) and sim1ðf2,f3Þis calcu- lated by averaging only the red lines. Since red lines are common in both scenarios, similarity distance in (a) may be very close to or even greater than (b) depending on the distances represented by blue lines. There- fore, unmatched line-pairs, shown by blue dots in (b), should be utilized to produce a ‘stronger’ similarity distance. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6. This figure illustrates extraction of line-flow vectors and histo- grams for a single frame. Given an action sequence, i-th frame is matched with the previous (i1)-th frame. Line-flow vectors (in green) show the displacement of matched lines with respect to the previous frame. Each line-flow vector is then separated into four non-negative components. We employ a histogram for each spatial bin to represent these line-flow vectors. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(7)

actions can be performed with various speeds and peri- ods, resulting in sequences with different lengths. There- fore, first we align two sequences by means of Dynamic Time Warping (DTW)[28]and then utilize the distance between aligned poses to derive an overall similarity.

DTW is an algorithm to compare time series and find the optimal alignment between them by means of dynamic programming. As formalized in[29], given two action sequences A ¼ a1,a2, . . . ,ai, . . . ,a9A9 and B ¼ b1, b2, . . . ,bj, . . . ,b9B9 of lengths 9A9 and 9B9, DTW constructs a warp path W ¼ w1,w2, . . . ,wK (depicted inFig. 7) where K is the length of the warp path and wk¼ ði,jÞ is the k-th element of warp path indicating that the i-th element of A and the j-th element of B are aligned. Using the aligned poses, the distance between two action sequences A and B is calculated as follows:

DistDTWðA,BÞ ¼ PK

k ¼ 1distðwki,wkjÞ

K ð7Þ

where distðwki,wkjÞis the distance between two frames ai2A and bj2B, which are aligned at the k-th index of the warp path, calculated using our pose matching function. Refer to [29] for the details of finding the minimum-distance warp path using a dynamic programming approach.

We use a weighted k-NN classifier, which assigns a given test pose sequence to the class most common amongst its k nearest training pose sequences using DistDTW(Eq.(7)) as its distance metric. In addition we weight the contributions of the neighbors by 1=d, where d is the distance to the test sequence, so that nearer neighbors contribute to the decision more than the distant ones. We denote this classifier as cpose

to be used inSection 6.3.

6.2. Using global line-flow histograms

In Section 5, the extraction of a line-flow histogram h(i) for a single frame was shown. In order to represent a video, we simply sum up line-flow histograms of each frame to from a single compact representation of the entire action sequence consisting of n frames as follows:

H ¼ Xn

i ¼ 1

hðiÞ ð8Þ

We compute the flow similarity between two action sequences A and B by comparing their global line-flow histograms Ha and Hb using chi-square distance

w

2 as follows:

w

2ðA,BÞ ¼1 2

X

j

ðHaðjÞHbðjÞÞ2

HaðjÞ þ HbðjÞ ð9Þ

where j 2 1,2, . . . ,k and k is the number of bins in the histogram.

In order to classify a given pose sequence, we employ a weighted k-NN classifier (as inSection 6.1) which uses

w

2

(Eq.(9)) as its distance metric. This classifier is denoted as cflow to be used in Section 6.3. The global line-flow of different actions can be seen inFig. 8.

6.3. Using combination of pose ordering and line-flow In the previous sections, two action recognition meth- ods were introduced. The first one utilizes pose ordering of an action sequence and the second one captures the global motion cues by using line-flow histograms. These two methods are combined in this final classification scheme, in order to overcome limitations of either shape or flow-based behaviors and achieve a higher accuracy.

To classify a given pose sequence, we employ decision vectors d!

poseand d!

flow, generated by the weighted k-NN classifiers cpose(seeSection 6.1) and cflow(seeSection 6.2) Fig. 7. This figure illustrates the alignment of two action sequences. Frame-to-frame similarity matrix of two actions can be seen on the left. Brighter pixels indicate smaller similarity distances (more similar frames). The ‘blue line’ overlaid on the matrix indicates the warp path obtained by DTW. The frame correspondence based on the alignment path is shown on the right. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. This figure illustrates the global line-flow of different actions (from left to the right): bend, jumping jack, jump in place, running and walking. Notice that the line-flow vectors are in different orientations in different spatial locations so that ‘bend’, ‘jumping jack’ and ‘jump in place’ can be easily distinguished. Although, the global line-flow of

‘running’ and ‘walking’ seem similar, notice the difference in the density of the lines. Sparser lines represent faster motion, whereas dense lines represent actions with slower motion.

(8)

respectively. Each decision vector is normalized such as

!d

ðiÞ 2 ½0,1 for all i 2 f1,2, . . . ,ng, where d!

ðiÞ is the prob- ability of the test sequence belonging to the i-th class and n is the number of classes. We combine the two normal- ized decision vectors d!

poseand d!

flowusing a simple linear weighting scheme to obtain the final decision vector

!d

combined as follows:

!d

combined¼

a

!d

poseþ ð1

a

Þ !d

flow ð10Þ

where

a

is the weighting coefficient of the decision vectors. It determines the relative influence of pose (shape) and line-flow (motion) features on the final classification. Finally, the test pose sequence is assigned to the class having the highest probability value in the combined decision vector d!

combined. The effect of choosing

a

will be evaluated inSection 7.

7. Experiments

In this section we evaluate the performance of our approach. First we introduce the state-of-art action recog- nition datasets. Then we give details of our experiments and results. Finally, we compare our results to the related studies and provide a discussion.

7.1. Datasets

In our experiments, we evaluate our method on the Weizmann and the KTH datasets, which are currently considered as the benchmark datasets for single-view action recognition. We adopt leave-one-out cross valida- tion as our experimental setup on all the datasets in order to compare our performance fairly and completely with other studies as recommended in[11].

7.1.1. Weizmann dataset

This single-view dataset was introduced by Blank et al.

in[4] containing 10 actions performed by nine different actors. We use the same set of nine actions for our experiments as in [4]; which are bend, jumping jack (jack), jump forward (jump), jump in place (pjump), run, gallop sideways (side), walk, one-hand wave (wave1) and two-hands wave (wave2). Example frames are shown in Fig. 9. For this dataset we used the available silhouettes, which were obtained using background subtraction, and applied canny edge detection to extract edges. So we start our pose extraction process (seeSection 3) from step 3.

7.1.2. KTH dataset

This dataset was introduced by Schuldt et al. in[31].

It contains six actions: boxing, hand clapping, hand waving, jogging, running and walking. Each action is performed by 25 subjects in four different shooting conditions: outdoor recordings with a stable camera (sc1), outdoor recordings with camera zoom effects and different viewpoints (sc2), outdoor recordings in which the actors wear different out- fits and carry items (sc3), indoor recordings with illumina- tion changes and shadow effects (sc4). Example frames are shown inFig. 9. KTH is considered as a more challenging dataset compared to Weizmann due to its different realistic shooting conditions. In addition, it contains two similar actions: jogging and running. In this dataset, the length of a video generally exceeds 100 frames and actions are performed multiple times in a video. In order to reduce extensive computational cost, we trim the action sequences to 20–50 frames for our experiments so that an action in a video is performed only once or twice. Note that since global line-flow histogram is calculated from all the frames in an action video, the number of frames influences the results. So trimming action sequences to a specific number of repeti- tions results in better performance of the global line-flow approach. Although the action sequences were segmented manually in our experiments, this can be easily automa- tized. DTW already contains the required information since it seeks for an alignment between two pose sequences. We can first apply DTW detect the matching subsequence and extract the line-flow using only those poses.

7.2. Experimental results

In this section, we present the experimental results evaluating our approach in recognizing human actions. First, the effect of applying spatial binning is examined (Section 7.2.1). Then, pose matching is evaluated and the optimal configuration of our pose similarity calculation function is founded (Section 7.2.2). Next, pose and flow features are evaluated (Section 7.2.3); and the effect of applying noise elimination is discussed (Section 7.2.4). Afterwards, regard- ing classification, the weighting between pose ordering and line-flow is examined. Finally, computational cost of approach is addressed (Section 7.2.6).

7.2.1. Evaluation of spatial binning

Recall that inSection 3.2, we place an N  N imaginary grid structure over the human figure in order to capture the locations of line segments in a frame. The choice of N is important, because in our pose matching method we only allow matching between line-pairs within the same spatial bin. Similarly, during line-flow extraction between consecutive frames, lines are required to be in the same spatial bin in order to be matched. More importantly, since a line-flow histogram is extracted for each spatial bin, the choice of N directly effects the size of the global line-flow feature vector.

Table 1 compares the use of different-sized grid structures. The worst results are obtained when N ¼ 1, which means that no spatial binning is used and matching Fig. 9. Example frames are shown from different datasets. Top row,

Weizmann, bottom row, KTH.

(9)

is allowed between lines or line-pairs located anywhere in the frame. N ¼2 gives better results compared to no spatial binning and the best results are obtained when a 3  3 grid structure is placed over the human figure. This justifies that the spatial locations of the line-pairs provide useful clues when comparing poses. Regarding line-flow, we can infer from the results that using spatial binning and histogramming line-flow in each spatial bin better describes the local motion of separate body parts.

7.2.2. Evaluation of pose similarity calculation function In addition to our classification methods, in order to experimentally evaluate our pose matching mechanism, we employ a single pose (SP) based classification scheme.

This experiment discards the order of poses and performs classification based on individual votes of each frame.

Therefore, the performance of this method directly depends on the accuracy of our pose matching.

Given a sequence of images A ¼ fa1,a2, . . . ,ang to be classified as one of the available classes C ¼ fc1,c2, . . . ,cmg, we calculate the similarity distance di(j) of each frame ai2 A to each class cj2C, by finding the most similar training

frame from class cj (depicted in Fig. 10). In order to classify A, we seek for the class having smallest average distance, where the average distance to each class cj2C is computed as follows:

DðjÞ ¼ Pn

i ¼ 1diðjÞ

n ð11Þ

Observing the results of single pose based classifica- tion in Table 1, we can say that it achieves acceptable results on both of the datasets considering that the ordering of the poses is totally discarded in this classifica- tion method. This demonstrates the power of pose match- ing mechanism since the performance of this method mainly depends on the accuracy of our pose similarity function. A higher accuracy is obtained by sim2because of its strict constraints on pose matching which results in a

‘stronger’ function. More importantly, when comparing test poses to the stored templates, there is always a frame obeying these strict constraints, since the single pose based classification seeks for a matching pose within the set of all training frames.

After finding a correspondence between two poses by matching their line-pairs, in order to calculate an overall similarity between the frames, two pose similarity calcu- lation functions were introduced in Section 4.1. Recall that sim1 utilizes only the distances between matching line-pairs, whereas sim2 also penalizes the unmatched line-pairs.Table 1compares the relative performances of these functions.

When pose ordering classification is used, notice that the accuracy of sim1significantly increases for both of the datasets; whereas sim2is about the same for Weizmann, but decreases so that it is below sim1for KTH dataset. First of all, the increase in the accuracy of sim1 shows the importance of including the ordering of poses in action recognition. Regarding the performance of sim2, we can say that since the data is ‘clean’ in the Weizmann dataset, similar pose sequences can still be found under strict Table 1

Action recognition accuracies on Weizmann and KTH datasets using different classification methods (SP, single pose; PO, pose ordering; LF, line-flow) with respect to choice of pose similarity calculation functions (sim1and sim2) and different spatial binnings (N  N). N ¼ 1 indicates that no spatial binning is applied.

Classification Method Weizmann KTH

N ¼ 1 N¼ 2 N ¼3 N ¼1 N ¼ 2 N ¼ 3

SP (%) sim1 64.2 71.6 74.1 61.7 63.3 66.3

sim2 92.6 92.6 93.8 71.3 75.0 75.3

PO (%) sim1 69.1 81.5 85.2 74.3 77.2 81.3

sim2 92.6 92.6 95.1 56.2 68.5 73.3

LF (%) 48.1 64.2 87.7 71.3 74.8 80.5

Fig. 10. This figure illustrates the classification of an action sequence utilizing only single pose information throughout the video. For each frame in the test sequence, its distance to each class is computed by finding the most similar training frame from that class. In order to classify the sequence, we take the average distance of all frames to each class and assign the class label with the smallest average distance.

(10)

matching constraints, which slightly increases the accu- racy. However, the accuracy of sim2drops below sim1for the KTH dataset. This means that requiring strict match- ing constraints when comparing two poses in a ‘noisy’

dataset, results in addition of unrealistic penalty due to the high number of unmatched line-pairs that actually do not even belong to the human figure.

In summary, sim2is more accurate when the edges of the human figure are successfully extracted and at classi- fying individual poses when pose ordering is not avail- able. However, it is wiser to employ sim1in more realistic data. Hence, sim2 function is used in the Weizmann dataset where the edges are extracted from background subtracted silhouettes; sim1 is used in the KTH dataset where edges are extracted from contour information.

7.2.3. Evaluation of pose and flow features

Having decided on the optimal spatial binning value and chosen a suitable pose similarity calculation function depending on the conditions, in this section, we evaluate the performance of pose and flow features in recognizing human actions on single-camera datasets by comparing the action recognition accuracies of different classification methods, namely, pose ordering (PO), global line-flow (LF) and combination of pose ordering and global line-flow (POþLF).

Confusion matrices for the Weizmann dataset in Fig. 11 contain insightful information to compare pose and flow features by examining the misclassifications made by each recognition method. As expected, the best results are achieved when pose information is combined with global motion cues as in the POþLF classification

method, in which we obtain a perfect accuracy of 100%.

We achieve an overall recognition rate of 90.7% using POþLF on KTH dataset (Fig. 12 shows the misclassifica- tions). The decrease in the performance with respect to the Weizmann dataset is reasonable, considering the relative complexity of the KTH dataset.

Fig. 11. Confusion matrix of each classification method for the Weizmann dataset. Misclassifications of pose ordering (PO) method belong to actions having similar poses such as in run and walk, pjump and jump, pjump and side (both include standing still human poses). Global line-flow (LF) confuses actions having similar line-flow directions and magnitudes in the same spatial bin, however, its set of misclassifications do not overlap with PO.

Therefore, when they are combined in PO þ LF, we obtain a perfect accuracy of 100%. (a) Pose ordering, 95.1%; (b) global line-flow, 87.7%.

Fig. 12. Confusion matrix of PO þLF classification method for the KTH dataset. The average of all scenarios accuracy we achieve in this dataset is 90.7%. Most of the confusions occur among jogging, running and walking, which is quite reasonable considering their visual similarity.

(11)

Fig. 13compares recognition performances on indivi- dual scenarios of the KTH dataset. As expected, the high- est performance is obtained in sc1, which is the simplest scenario of the KTH dataset. This shows that combination of pose ordering and line-flow features can achieve high recognition rates when line segments are accurately extracted. The second and third highest performances are obtained in sc3 and sc4 respectively. Notice that, in these scenarios the accuracy of pose ordering is lower than global line-flow. This can be explained by the decrease in the performance of our pose matching, due to the different outfits (e.g. long coats) worn by actors resulting in unusual configuration of line segments in sc3;

and due to the existence of erroneous line segments belonging to the floor and shadows reflected on the walls in sc4. In contrast, performance of line-flow is lower than pose ordering in sc2, which implies that zooming and viewpoint variance has a negative effect on line-flow extraction. Although the relative performances of pose ordering and line-flow alter from one scenario to another, the overall accuracy is always boosted when these fea- tures are combined together in POþLF classification method.

7.2.4. Effect of noise elimination

To evaluate the effect of our noise elimination algo- rithm (see Section 3.1), we test our approach without applying any noise elimination. Note that, when noise elimination is not applied we cannot form a bounding box around the human figure so that spatial binning is also omitted in this case.Fig. 13reports the overall accuracy of our approach in each scenario of the KTH dataset when noise elimination is not applied. It is obvious that, apply- ing noise elimination and spatial binning significantly improves the recognition rate of each scenario. More specifically, our approach is less effected by noise in the standard outdoor (sc1) and indoor (sc4) settings. How- ever, the recognition rates on sc2 and sc3 are significantly effected by noise due to existence of cluttered

backgrounds in these conditions, resulting in inaccurate line segments.

7.2.5. Weighting between pose ordering and line-flow Recall that in the POþ LF classification method (see Section 6.3), pose ordering is combined with global line- flow features in a linear weighting scheme where

a

is the weighting coefficient in this combination, which deter- mines the influence of individual components on the final classification decision.Fig. 14shows the change in recog- nition rates with respect to choice of

a

.

In the KTH dataset, the individual performances of pose ordering and global line-flow are about the same. So the best accuracy is achieved when they are combined with equal weights at

a

¼0:5. We obtain similar results to those of Ikizler et al.[13]finding the best combination of line and optic-flow features at

a

¼0:5. This is also in agreement with the observations of Ke et al.[15], stating that the shape and motion features are complimentary to each other.

The perfect accuracy rate of 100% is reached on Weizmann dataset, when pose ordering has more influ- ence on the final classification decision. This is because, the individual performance of pose ordering is better than line-flow, since actions are mostly differentiable based on their appearances in the Weizmann dataset.

7.2.6. Computational cost

We have implemented our method in MATLAB and have not applied any significant optimizations.Table 2 shows the computational cost of our approach on the KTH dataset for pose extraction, similarity matrix calculation and classification. All the results are obtained using a 2.2 GHz Intel Core I7 laptop and are averaged over all frames/videos.

Our pose extraction process consisting of edge detec- tion, line-pair extraction and noise elimination takes about 1.75 s per frame. These steps are the most time consuming ones of our approach. However, their running times can be dramatically reduced when coded in OpenCV Fig. 13. Recognition accuracies on each scenario in the KTH dataset using different classification methods. ‘White bars’ show the overall accuracy when noise elimination is not applied. In addition, spatial binning is also omitted since a bounding box around the human figure can not be formed. It is apparent that applying noise elimination and then spatial binning significantly improves the performance in all of the scenarios.

(12)

or when GPU is utilized. Having extracted the line-pairs in each frame of a video, comparing the line-pairs in two frames takes about 0.0009 s. The number of line-pairs in a frame for different actions vary between 10 and 20, so that in the worst case of comparing all the possible line- pairs in two frames takes about 0.36 s, whereas in the best case it takes about 0.09 s. The remaining operations are significantly simpler in terms of computational cost than the previous steps. Extracting line-flow histograms from a video using all the frames takes about 0.03 s on the average and comparing line-flows of two videos takes 0.0003 s. We can see that the computational costs of the classification steps are negligible after computing the similarity matrices in the previous steps.

8. Comparison to related studies

In this section, we compare our method’s performance to other studies in the literature that reported results on the KTH dataset. A comparison of results over the Weiz- mann dataset is not given since most of the recent approaches, including ours, obtain perfect recognition rates on this simple dataset. A comparison over the KTH dataset is given, although making a fair and an accurate

one is difficult since different researches employ different experimental setups. As stated by Gao et al. in[11], the performances on the KTH dataset can differ by 10.67%, when different n-fold cross-validation methods are used.

Moreover, the performance is dramatically effected by the choice of scenarios used in training and testing. To evaluate our approach, as recommended in[11], we use a simple leave-one-out as the most easily replicable clear- cut partitioning.

InTable 3, we compare our method’s performance to the results of other studies on the KTH dataset. Our main concern in this study is to present a new pose representa- tion, but still our action recognition results are higher than a considerable number of studies. Taking into account its straightforward approach in combining pose and line-flow features, our results are also comparable to the best ones [19,20,33,38]. Although our recognition results are slightly lower than these top studies, we claim that our approach is advantageous in terms of its pose (shape) representation. In order to demonstrate this, we provide a detailed comparison inTable 4with the work of Lin et al. [19], which lies at the top position of our rankings table for the KTH dataset. FromTable 4we can Fig. 14. This graph shows the change in the recognition accuracy on the Weizmann and KTH datasets with respect to choice ofa(weighting coefficient).

a¼0 means that only line-flow features are used, whereasa¼1 corresponds to using only pose ordering information.

Table 2

Computational cost of our approach on the KTH dataset.

Operation Execution time (s) per

Edge detection 0.01100 Frame

Line-pair extraction 1.28000 Frame

Noise elimination 0.46000 Frame

Compare line-pairs in two frames 0.00090 Frame- pair Line-flow histogram extraction 0.02980 Video Compare two line-flow histograms 0.00030 Hist-pair

DTW classification 0.00320 Video

Line-flow classification 0.00004 Video

DTW þ line-flow classification 0.00329 Video

Table 3

Comparison of our approach to other studies over the KTH dataset.

Method Evaluation Accuracy (%)

Lin[19] Leave-one-out 95.77

Ta[33] Leave-one-out 93.00

Liu[20] Leave-one-out 91.80

Wang[38] Leave-one-out 91.20

Our approach Leave-one-out 90.70

Fathi[9] Split 90.50

Ahmad[2] Split 88.33

Nowozin[26] Split 87.04

Niebles[25] Leave-one-out 83.30

Dollar[7] Leave-one-out 81.17

Ke[15] Leave-one-out 80.90

Liu[22] Leave-one-out 73.50

Schuldt[31] Split 71.72

(13)

observe that the combined shape and motion result reported in [19] (95.8%) are better than our POþLF classification (90.7%) on the KTH dataset. Although, motion-only action recognition of [19]performs slightly better than our global line-flow based classification; we outperform their shape-only results by more than 20% on KTH and by 14% on Weizmann dataset using our pose ordering based classification. This reflects the effective- ness of the pose features and pose similarity function presented in this study. It also reveals the disadvantage of our linear classification results scheme when compared to the action prototype-tree learning approach used in[19]

to combine shape and motion features. So in these experiments, we have demonstrated the potential of our line-pair features in human pose representation. We will further research and expand our studies on classification techniques for better utilization of these features which will results in higher recognition performance.

InTable 5, we compare the performance of global line- flow with the results of other flow-based studies on the KTH dataset. Examining the results, we can say that global line-flow histograms perform better than flow-based correlation method used in [15]. In the study of Ikizler et al.[13]a similar approach is used for flow-based action recognition. They utilize optic flow histograms and per- form action recognition using an SVM classifier. Our 80.5%

recognition rate is close to the result reported in[13]; we believe that the minor difference is due to our simple k-NN classification scheme when compared to the SVM classifier. The top studies inTable 5 [9,38]use optic flow as their low-level features and build mid-level features and codebook on top of them. In addition they use more sophisticated classification methods compared to k-NN to get the maximum out of their flow features. Recall that for

motion-only classification, we simply aggregate our line- flow vectors extracted from each frame into a histogram.

So, in our feature studies we plan to exploit line-flow features by seeking for alternative approaches to global histogramming and by using more complicated classifica- tion methods.

9. Conclusions

In this paper, we introduce a line based pose repre- sentation and explore its ability in recognizing human actions. We encapsulate a human pose into a collection of line-pairs, preserving the geometrical configurations of the components forming the human figure. The corre- spondences between the set of line-pairs in two frames are captured by means of the proposed matching mechan- ism, in order to compute a pose similarity. To include the ordering of poses, we compare two sequences and find the optimal alignment between them using Dynamic Time Warping. In addition to our pose-based representa- tion, the speed and direction of movement in an action sequence is embodied into global line-flow histograms.

Experimental results show that combination of pose ordering and line-flow features overcome the limitations of either shape or motion behaviors, thus increase the overall recognition accuracy. When our approach is com- pared to the other studies combining shape and motion features, we observe that they obtain higher accuracies using features with relatively lower individual perfor- mances. This reflects the effectiveness of our pose and motion features; also reveals the disadvantage of our simple combination scheme. It is apparent that the overall recognition rates could be increased by employing a more complex method to combine pose ordering and line-flow features.

The limitation of our approach is that it relies on good edge detection so that Contour Segment Networks con- sisting of accurate lines can be constructed for each frame.

The experiments on the Weizmann and KTH datasets show that our approach can successfully distinguish actions with high recognition rates when the lines are accurately extracted. However, our pose matching perfor- mance is negatively affected when the number of erro- neous line segments in each frame increases. Although line-flow is less tolerant to zoom effects than pose features, it performs better under noisy conditions.

In this study we mainly concentrated on the represen- tation of actions. As future work many improvements can be made regarding classification. First, a more sophisti- cated method can be developed for combining pose ordering and line-flow features. Second, to always extract accurate lines, edge detection scheme can be specialized just for human actions. Finally, our powerful pose match- ing mechanism can be applied to recognize actions in still images.

References

[1] J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Computer Vision and Image Understanding 73 (3) (1999) 428–440.

Table 4

Comparison of our results to[19], with respect to different features:

shape only (s), motion only (m), combined shape and motion (s þ m). For our study, s and m refer to pose ordering and line-flow respectively.

Although, s þm results reported in[19]are higher than our PO þ LF, we outperform their shape-only results by more than 20% on KTH and by 14% on Weizmann dataset using our pose-ordering based classification.

Method Dataset

Weizmann KTH

s (%) m (%) s þ m (%) s (%) m (%) s þ m (%)

Our study 95.1 87.7 100 81.3 80.5 90.7

Lin et al.[19] 81.1 88.9 100 60.9 86.0 95.8

Table 5

Comparison of our global line-flow to other flow-based studies over the KTH dataset.

Flow-based feature Classification Accuracy (%)

Codebook from optic flow in[38] S-LDA 91.20 Mid-level motion features in[9] AdaBoost 90.5 Codebook from optic flow in[38] S-CTM 90.33

Optic flow histogram in[13] SVM 84

Global line-flow k-NN 80.5

Flow based correlation in[15] SVM 70

(14)

[2] M. Ahmad, S. Lee, Human action recognition using shape and clg- motion flow from multi-view image sequences, Pattern Recognition 41 (7) (2008) 2237–2252.

[3] S. Baysal, M.C. Kurt, P. Duygulu, Recognizing human actions using key poses, in: ICPR, 2010.

[4] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, in: ICCV, 2005.

[5] A.F. Bobick, J.W. Davis, The recognition of human motion using temporal templates, Pattern Analysis and Machine Intelligence 23 (3) (2001) 257–267.

[6] S. Carlsson, J. Sullivan, Action recognition by shape matching to key frames, in: Workshop on Models versus Exemplars in Computer Vision, 2001.

[7] P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: VS-PETS, 2005.

[8] A.A. Efros, A.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: ICCV, 2003.

[9] A. Fathi, G. Mori, Action recognition by learning mid-level motion features, in: CVPR, 2008.

[10] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of adjacent contour segments for object detection, IEEE Transactions on Pattern Analy- sis and Machine Intelligence 30 (1) (2008) 36–51.

[11] Z. Gao, M. Chen, A.G. Hauptmann, A. Cai, Comparing evaluation protocols on the kth dataset, In: Human Behavior Understanding, Lecture Notes in computer Science, vol. 6219, 2008, pp. 88–100.

[12] K. Hatun, P. Duygulu, Pose sentences: a new representation for action recognition using sequence of pose words, in: ICPR, 2008.

[13] N. Ikizler, R.G. Cinbis, P. Duygulu, Human action recognition with line and flow histograms, in: ICPR, 2008.

[14] N. Ikizler, P. Duygulu, Human action recognition using distribution of oriented rectangular patches, Image and Vision Computing 27 (10) (2009).

[15] Y. Ke, R. Sukthankar, M. Hebert, Spatio-temporal Shape and Flow Correlation for Action Recognition, Visual Surveillance Workshop, 2007.

[16] A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human action recognition, in: CVPR, 2010.

[17] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: CVPR, 2008.

[18] Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, in: CVPR, 2011.

[19] Z. Lin, Z. Jiang, L.S. Davis, Recognizing actions by shape-motion prototype trees, in: ICCV, 2009.

[20] J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos in the wild, in: CVPR, 2009.

[21] J. Liu, M. Shah, Learning human actions via information maximiza- tion, in: CVPR, 2008.

[22] J. Liu, J. Yang, Y. Zhang, X. He, Action recognition by multiple features and hyper-sphere multi-class svm, in: ICPR, 2010.

[23] G. Loy, J. Sullivan, S. Carlsson, Pose Based Clustering in Action Sequences, Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003.

[24] M. Maire, P. Arbelaez, C. Fowlkes, J. Malik, Using contours to detect and localize junctions in natural images, in: CVPR, 2008.

[25] J.C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, International Journal of Computer Vision 79 (3) (2008) 299–318.

[26] S. Nowozin, G. Bakir, K. Tsuda, Discriminative subsequence mining for action classification, in: ICCV, 2007.

[27] H. Qu, L. Wang, C. Leckie, Action recognition using space-time shape difference images, in: ICPR, 2010.

[28] L. Rabiner, B. Juang, Fundamentals of Speech Recognition, Prentice- Hall, 1993.

[29] S. Salvador, P. Chan, Fastdtw: Toward Accurate Dynamic Time Warping in Linear Time and Space, KDD Workshop on Mining Temporal and Sequential Data, 2004.

[30] K. Schindler, L.V. Gool, Action snippets: how many frames does human action recognition require?, in: CVPR, 2008.

[31] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: ICPR, 2004.

[32] L. Shao, L. Ji, Y. Liu, J. Zhang, Human action segmentation and recognition via motion and shape analysis, Pattern Recognition Letters 33 (2012) 438–445.

[33] A. Ta, C. Wolf, G. Lavoue, A. Baskurt, J. Jolion, Pairwise features for human action recognition, in: ICPR, 2010.

[34] T. Thi, L. Cheng, J. Zhang, L. Wang, S. Satoh, Weakly supervised action recognition using implicit shape models, in: ICPR, 2010.

[35] C. Thurau, V. Hlavac, Pose primitive based human action recogni- tion in videos or still images, in: CVPR, 2008.

[36] K.N. Tran, I.A. Kakadiaris, S.K. Shah, Part-based motion descriptor image for human action recognition, Pattern Recognition 45 (7) (2012) 2562–2572.

[37] P. Turaga, R. Chellappa, V.S. Subrahmanian, O. Udrea, Machine recognition of human activities: a survey, Circuits and Systems for Video Technology 18 (11) (2008) 1473–1488.

[38] Y. Wang, P. Sabzmeydani, G. Mori, Semi-latent dirichlet allocation:

a hierarchical model for human action recognition, in: ICCV Work- shop on Human Motion, 2007.

Referenties

GERELATEERDE DOCUMENTEN

Ze lopen uiteen van: ‘Het wiskundeonderwijs in Nederland is niks meer’ tot ‘Ze hebben er meer plezier in, maar ik moet ze nog wel even duidelijk maken dat algebraïsche vaardigheden

Toen de Stad Hasselt, na aankoop van het domein, plan- nen maakte voor de renovatie van het huidige gebouw en voor een hernieuwde parkaanleg op de motteheuvel zelf,

Such a static resource allocation corresponds to a point on the boundary of the rate region for a given crosstalk canceller tap budget C total , and generates a rectangular

In the National Water Plan 2009-2016 [NWP, 2009] and the Policy Documents on the North Sea 2009-2016 [PDNS, 2009], the government took responsibility for appointing wind energy

Hypothesis 3: In the case of brand communication inconsistency, high (brand) involvement consumers are less likely to re-evaluate their image of a brand in terms of

De nieuwe vondst laat zien dat de verspreiding van de vroegste tetrapoden zeer snel is verlopen. Vooralsnog ziet het ernaar uit dat het eerste continent dat veroverd werd

Niet alleen van Theo, die zich pas tot een antiracistisch protest laat verleiden nadat hij daarvoor oneigenlijke motieven heeft gekregen en die in zijn erotische escapades met

Bovendien vervalt met deze wijziging van de Regeling de voorlopige vaststelling en uitkering van de vergoeding van kosten van zorg die niet door het CAK aan de zorgaanbieders