Academia.eduAcademia.edu

Recent developments in human motion analysis

2003, Pattern recognition

Pattern Recognition 36 (2003) 585 – 601 www.elsevier.com/locate/patcog Recent developments in human motion analysis Liang Wang, Weiming Hu, Tieniu Tan ∗ National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, People’s Republic of China Received 27 November 2001; accepted 13 May 2002 Abstract Visual analysis of human motion is currently one of the most active research topics in computer vision. This strong interest is driven by a wide spectrum of promising applications in many areas such as virtual reality, smart surveillance, perceptual interface, etc. Human motion analysis concerns the detection, tracking and recognition of people, and more generally, the understanding of human behaviors, from image sequences involving humans. This paper provides a comprehensive survey of research on computer-vision-based human motion analysis. The emphasis is on three major issues involved in a general human motion analysis system, namely human detection, tracking and activity understanding. Various methods for each issue are discussed in order to examine the state of the art. Finally, some research challenges and future directions are discussed. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Human motion analysis; Detection; Tracking; Behavior understanding; Semantic description 1. Introduction As one of the most active research areas in computer vision, visual analysis of human motion attempts to detect, track and identify people, and more generally, to interpret human behaviors, from image sequences involving humans. Human motion analysis has attracted great interests from computer vision researchers due to its promising applications in many areas such as visual surveillance, perceptual user interface, content-based image storage and retrieval, video conferencing, athletic performance analysis, virtual reality, etc. Human motion analysis has been investigated under several large research projects worldwide. For example, Defense Advanced Research Projects Agency (DARPA) funded a multi-institution project on Video Surveillance and Monitoring (VSAM) [1], whose purpose was to develop an automatic video understanding technology that enabled a single human operator to monitor activities over complex ∗ Corresponding author. Tel.: +86-10-6264-7441; +86-10-6255-1993. E-mail address: [email protected] (T. Tan). fax: areas such as battle elds and civilian scenes. The real-time visual surveillance system W4 [2] employed a combination of shape analysis and tracking, and constructed the models of people’s appearances to make itself capable of detecting and tracking multiple people as well as monitoring their activities even in the presence of occlusions in an outdoor environment. Researchers in the UK have also done much research on the tracking of vehicles and people and the recognition of their interactions [3]. In addition, companies such as IBM and Microsoft are also investing on research on human motion analysis [4,5]. In recent years, human motion analysis has been featured in a number of leading international journals such as International Journal of Computer Vision (IJCV), Computer Vision and Image Understanding (CVIU), IEEE Transactions on Pattern Recognition and Machine Intelligence (PAMI), and Image and Vision Computing (IVC), as well as prestigious international conferences and workshops such as International Conference on Computer Vision (ICCV), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), European Conference on Computer Vision (ECCV), Workshop on Applications of Computer Vision (WACV), and IEEE International Workshop on Visual Surveillance (IWVS). 0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 2 ) 0 0 1 0 0 - 0 586 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 All the above activities have demonstrated a great and growing interest in human motion analysis from the pattern recognition and computer vision community. The primary purpose of this paper is thus to review the recent developments in this exciting research area, especially the progress since previous such reviews. 1.1. Potential applications Human motion analysis has a wide range of potential applications such as smart surveillance, advanced user interface, motion-based diagnosis, to name a few [6]. 1.1.1. Visual surveillance The strong need of smart surveillance systems [7,8] stems from those security-sensitive areas such as banks, department stores, parking lots, and borders. Surveillance cameras are already prevalent in commercial establishments, while camera outputs are usually recorded in tapes or stored in video archives. These video data is currently used only “after the fact” as a forensic tool, losing its primary bene t as an active real-time media. What is needed is the real-time analysis of surveillance data to alert security ocers to a burglary in progress, or to a suspicious individual wandering around in the parking lot. Nowadays, the tracking and recognition techniques of face [9 –12] and gait [13–16] have been strongly motivated for the purpose of access control. As well as the obvious security applications, smart surveillance has also been proposed to measure trac ow, monitor pedestrian congestion in public spaces [17,18], compile consumer demographics in shopping malls, etc. 1.1.2. Advanced user interface Another important application domain is advanced user interfaces in which human motion analysis is usually used to provide control and command. Generally speaking, communication among people is mainly realized by speech. Therefore, speech understanding has already been widely used in early human–machine interfaces. However, it is subject to the restrictions from environmental noise and distance. Vision is very useful to complement speech recognition and natural language understanding for more natural and intelligent communication between human and machines. That is to say, more detailed cues can be obtained by gestures, body poses, facial expressions, etc. [19 –22]. Hence, future machines must be able to independently sense the surrounding environment, e.g., detecting human presence and interpreting human behavior. Other applications in the user interface domain include sign-language translation, gesture driven controls, and signaling in high-noise environment such as factories and airports [23]. 1.1.3. Motion-based diagnosis and identi cation It is particularly useful to segment various body parts of human in an image, track the movement of joints over an image sequence, and recover the underlying 3-D body struc- ture for the analysis and training of athletic performance. With the development of digital libraries, interpreting video sequences automatically using content-based indexing will save tremendous human e orts in sorting and retrieving images or video in a huge database. Traditional gait analysis [24 –26] aims at providing medical diagnosis and treatment support, while human gait can also be used as a new biometric feature for personal identi cation [13–16]. Some other applications of vision-based motion analysis lie in personalized training systems for various sports, medical diagnostics of orthopedic patients, choreography of dance and ballet, etc. In addition, human motion analysis shows its importance in other related areas. For instance, typical applications in virtual reality include chat-rooms, games, virtual studios, character animations, teleconferencing, etc. As far as computer games [27] are concerned, they have been very prevalent in entertainment. Maybe people are surprised at the realism of virtual humans and simulated actions in computer games. In fact, this bene ts greatly from computer graphics dealing with devising realistic models of human bodies and the synthesis of human movement based on knowledge of the acquisition of human body model, the retrieval of body pose, human behavior analysis, etc. Also, it is obvious that model-based image coding (e.g., only encoding the pose of the tracked face in images in more detail than the uninterested background in a videophone setting) will bring about very low bit-rate video compression for more e ective image storage and transmission. 1.2. Previous surveys The importance and popularity of human motion analysis has led to several previous surveys. Each such survey is discussed in the following in order to put the current review in context. The earliest relevant review was probably due to Aggarwal et al. [28]. It covered various methods used in articulated and elastic non-rigid motion prior to 1994. As for articulated motion, the approaches with or without a prior shape models were described. Cedars and Shah [29] presented an overview of methods for motion extraction prior to 1995, in which human motion analysis was illustrated as action recognition, recognition of body parts and body con guration estimation. Aggarwal and Cai gave another survey of human motion analysis [30], which covered the work prior to 1997. Their latest review [31] covering 69 publications was an extension of their workshop paper [30]. The paper provided an overview of various tasks involved in motion analysis of human body prior to 1998. The focuses were on three major areas related to interpreting human motion: (a) motion analysis involving human body parts, (b) tracking moving human from a single view or multiple camera perspectives, and (c) recognizing human activities from image sequences. L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 587 A similar survey by Gavrila [6] described the work in human motion analysis prior to 1998. Its emphasis was on discussing various methodologies that were grouped into 2-D approaches with or without explicit shape models and 3-D approaches. It concluded with two main future directions in 3-D tracking and action recognition. Recently, a relevant study by Pentland [32] centered on person identi cation, surveillance=monitoring, 3-D methods, and smart rooms=perceptual user interfaces to review the state of the art of “looking at people”. The paper was not intended to survey the current work on human motion analysis, but touched on several interesting topics in human motion analysis and its applications. The latest survey of computer-vision-based human motion capture was presented by Moeslund and Granum [33]. Its focus was on a general overview based on the taxonomy of system functionalities, viz., initialization, tracking, pose estimation and recognition. It covered the achievements from 1980 into the rst half of 2000. In addition, a number of general assumptions used in this research eld were identi ed and suggestions for future research directions were o ered. 1.3. Purpose and contributions of this paper The growing interest in human motion analysis has led to signi cant progress in recent years, especially on high-level vision issues such as human activity and behavior understanding. This paper will provide a comprehensive survey of work on human motion analysis from 1989 onwards. Approximately 70% of the references discussed in this paper are found after 1996. In contrast to the previous reviews, the current review focuses on the most recent developments, especially on intermediate-level and high-level vision issues. To discuss the topic more conveniently, various surveys usually select di erent taxonomies to group individual papers depending on their purposes. Unlike previous reviews, we will focus on a more general overview on the overall process of a human motion analysis system shown in Fig. 1. Three major tasks in the process of human motion analysis (namely human detection, human tracking and human behavior understanding) will be of particular concern. Although they do have some overlap (e.g., the use of motion detection during tracking), this general classi cation provides a good framework for discussion throughout this survey. The majority of past works in human motion analysis are accomplished within tracking and action recognition. Similar in principle to earlier reviews, we will make more detailed introductions to both processes. We also introduce relevant reviews on motion segmentation used in human detection, and behavior semantic description used in human activity interpretation. Compared with previous reviews, we include more comprehensive discussions on research challenges and future open directions in the domain of vision-based human motion analysis. Fig. 1. A general framework for human motion analysis. Instead of detailed summaries of individual publications, our emphasis is on discussing various methods for di erent tasks involved in a general human motion analysis system. Each issue will be accordingly divided into sub-processes or categories of various methods to examine the state of the art, and only the principles of each group of methods are described in this paper. Unlike previous reviews, this paper is clearly organized in a hierarchical manner from low-level vision, intermediate-level vision to high-level vision according to the general framework of human motion analysis. This, we believe, will help the readers, especially newcomers to this area, not only to obtain an understanding of the state of the art in human motion analysis but also to appreciate the major components of a general human motion analysis system and the inter-component links. In summary, the primary purpose and contributions of this paper are as follows (when compared with the existing survey papers on human motion analysis): (1) This paper aims to provide a comprehensive survey of the most recent developments in vision-based human motion analysis. It covers the latest research ranging mainly from 1997 to 2001. It thus contains many new references not found in previous surveys. 588 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 (2) Unlike previous reviews, this paper is organized in a hierarchical manner (from low-level vision, intermediate-level vision, to high-level vision) according to a general framework of human motion analysis systems. (3) Unlike other reviews, this paper selects a taxonomy based on functionalities including detection, tracking and behavior understanding within human motion analysis systems. (4) This paper focuses more on overall methods and general characteristics involved in the above three issues (functionalities), so each issue is accordingly divided into sub-processes and categories of approaches so as to provide more detailed discussions. (5) In contrast to past surveys, we provide detailed introduction to motion segmentation and object classi cation (an important basis for human motion analysis systems) and semantic description of behaviors (an interesting direction which has recently received increasing attentions). (6) We also provide more detailed discussions on research challenges and future research directions in human motion analysis than any other earlier reviews. The remainder of this paper is organized as follows. Section 2 reviews the work on human detection including motion segmentation and moving object classi cation. Section 3 covers human tracking, which is divided into four categories of methods: model-based, region-based, active-contour-based and feature-based. The paper then extends the discussion to the recognition and description of human activities in image sequences in Section 4. Section 5 analyzes some challenges and presents some possible directions for future research at length. Section 6 concludes this paper. 2. Detection Nearly every system of vision-based human motion analysis starts with human detection. Human detection aims at segmenting regions corresponding to people from the rest of an image. It is a signi cant issue in a human motion analysis system since the subsequent processes such as tracking and action recognition are greatly dependent on it. This process usually involves motion segmentation and object classi cation. 2.1. Motion segmentation Motion segmentation in video sequences is known to be a signi cant and dicult problem, which aims at detecting regions corresponding to moving objects such as vehicles and people in natural scenes. Detecting moving blobs provides a focus of attention for later processes such as tracking and activity analysis because only those changing pixels need be considered. However, changes from weather, illumination, shadow and repetitive motion from clutter make motion segmentation dicult to process quickly and reliably. At present, most segmentation methods use either temporal or spatial information of the images. Several conventional approaches to motion segmentation are outlined in the following. 2.1.1. Background subtraction Background subtraction [2,34 – 42] is a particularly popular method for motion segmentation, especially under those situations with a relatively static background. It attempts to detect moving regions in an image by di erencing between current image and a reference background image in a pixel-by-pixel fashion. However, it is extremely sensitive to changes of dynamic scenes due to lighting and extraneous events. The numerous approaches to this problem di er in the type of a background model and the procedure used to update the background model. The simplest background model is the temporally averaged image, a background approximation that is similar to the current static scene. Based on the observation that the median value was far more robust than the mean value, Yang and Levine [36] proposed an algorithm for constructing the background primal sketch by taking the median value of the pixel color over a series of images. The median value, as well as a threshold value determined using a histogram procedure based on the least median squares method, was used to create the di erence image. This algorithm could handle some of the inconsistencies due to lighting changes, etc. Most researchers show more interest in building di erent adaptive background models in order to reduce the in uence of dynamic scene changes on motion segmentation. For instance, some early studies given by Karmann and Brandt [34] and Kilger [35], respectively, proposed an adaptive background model based on Kalman ltering to adapt temporal changes of weather and lighting. 2.1.2. Statistical methods Recently, some statistical methods to extract change regions from the background are inspired by the basic background subtraction methods described above. The statistical approaches use the characteristics of individual pixels or groups of pixels to construct more advanced background models, and the statistics of the backgrounds can be updated dynamically during processing. Each pixel in the current image can be classi ed into foreground or background by comparing the statistics of the current background model. This approach is becoming increasingly popular due to its robustness to noise, shadow, change of lighting conditions, etc. As an example of statistical methods, Stau er and Grimson [38] presented an adaptive background mixture model for real-time tracking. In their work, they modeled each pixel as a mixture of Gaussians and used an online approximation to update it. The Gaussian distributions of the adaptive L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 mixture models were then evaluated to determine the pixels most likely from a background process, which resulted in a reliable, real-time outdoor tracker which can deal with lighting changes and clutter. A recent study by Haritaoglu et al. [2] built a statistical model by representing each pixel with three values: its minimum and maximum intensity values, and the maximum intensity di erence between consecutive frames observed during the training period. The model parameters were updated periodically. The quantities which are characterized statistically are typically colors or edges. For example, McKenna et al. [39] used an adaptive background model combining color and gradient information, in which each pixel’s chromaticity was modeled using means and variances, and its gradient in the x and y directions was modeled using gradient means and magnitude variances. Background subtraction was then performed to cope with shadows and unreliable color cues effectively. Another example was P nder [37], in which the subject was modeled by numerous blobs with individual color and shape statistics. 2.1.3. Temporal di erencing The approach of temporal di erencing [1,12,43– 46] makes use of pixel-wise di erence between two or three consecutive frames in an image sequence to extract moving regions. Temporal di erencing is very adaptive to dynamic environments, but generally does a poor job of extracting the entire relevant feature pixels, e.g., possibly generating holes inside moving entities. As an example of this method, Lipton et al. [43] detected moving targets in real video streams using temporal di erencing. After the absolute di erence between the current and the previous frame was obtained, a threshold function was used to determine change. By using a connected component analysis, the extracted moving sections were clustered into motion regions. These regions were classi ed into prede ned categories according to image-based properties for later tracking. An improved version is to use three-frame di erencing instead of two-frame di erencing [1,46]. For instance, VSAM [1] has successfully developed a hybrid algorithm for motion segmentation by combining an adaptive background subtraction algorithm with a three-frame di erencing technique. This hybrid algorithm is very fast and surprisingly e ective for detecting moving objects in image sequences. 2.1.4. Optical ow Flow is generally used to describe coherent motion of points or features between image frames. Motion segmentation based on optical ow [26,47–51] uses characteristics of ow vectors of moving objects over time to detect change regions in an image sequence. For example, Meyer et al. [26] performed monofonic operation which computed the displacement vector eld to initialize a contour-based tracking algorithm, called active rays, for the 589 extraction of articulated objects which would be used for gait analysis. The work by Rowley and Rehg [50] also focused on the segmentation of optical ow elds of articulated objects. Its major contributions were to add kinematic motion constraints to each pixel, and to combine motion segmentation with estimation in expectation maximization (EM) computation. However the addressed motion was restricted to 2-D ane transforms. Also, in Bregler’s work [52], each pixel was represented by its optical ow. These ow vectors were grouped into blobs having coherent motion and characterized by a mixture of multivariate Gaussians. Optical ow methods can be used to detect independently moving objects even in the presence of camera motion. However, most ow computation methods are computationally complex and very sensitive to noise, and cannot be applied to video streams in real-time without specialized hardware. More detailed discussion of optical ow can be found in Barron’s work [49]. In addition to the basic methods described above, there are some other approaches to motion segmentation. Using the extended EM algorithm [53], Friedman and Russell [54] implemented a mixture of Gaussian classi cation model for each pixel. This model attempted to explicitly classify the pixel values into three separate predetermined distributions corresponding to background, foreground and shadow. Meanwhile it could also update the mixture component automatically for each class according to the likelihood of membership. Hence, slow-moving objects were handled perfectly, meanwhile shadows were eliminated much more e ectively. Stringa [55] also proposed a novel morphological algorithm for scene change detection. This proposed method allowed obtaining a stationary system even under varying environmental conditions. From the practical point of view, the statistical methods described in Section 2.1.2 are a far better choice due to their adaptability in more unconstrained applications. 2.2. Object classi cation Di erent moving regions may correspond to di erent moving targets in natural scenes. For instance, the image sequences captured by surveillance cameras mounted in road trac scenes probably include pedestrians, vehicles, and other moving objects such as ying birds, owing clouds, etc. To further track people and analyze their activities, it is very necessary to correctly distinguish them from other moving objects. The purpose of moving object classi cation [1,43,56 – 61] is to precisely extract the region corresponding to people from all moving blobs obtained by the motion segmentation methods discussed above. Note that this step may not be needed under some situations where the moving objects are known to be human. For the purpose of describing the overall process of human detection, we present a simple discussion on moving object classi cation here. At present, there are 590 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 two main categories of approaches towards moving object classi cation. 2.2.1. Shape-based classi cation Di erent descriptions of shape information of motion regions such as representations of point, box, silhouette and blob are available for classifying moving objects. For example, Collins et al. [1] classi ed moving object blobs into four classes such as single human, vehicles, human groups and clutter, using a viewpoint-speci c three-layer neural network classi er. Input features to the network were a mixture of image-based and scene-based object parameters such as image blob dispersedness, image blob area, apparent aspect ratio of the blob bounding box, and camera zoom. Classi cation was performed on each blob at every frame, and the results of classi cation were kept in histogram. At each time step, the most likely class label for the blob was chosen as the nal classi cation. Another work by Lipton et al. [43] also used the dispersedness and area of image blob as classi cation metrics to classify all moving object blobs into humans, vehicles and clutter. Temporal consistency constraints were considered so as to make classi cation results more precise. As an example of silhouette-based shape representation for object classi cation, Kuno and Watanabe [56] described a reliable method of human detection for visual surveillance systems. The merit of this method was to use simple shape parameters of human silhouette patterns to classify humans from other moving objects such as butter ies and autonomous vehicles, and these shape parameters were the mean and the standard deviation of silhouette projection histograms and the aspect ratio of the circumscribing rectangle of moving regions. 2.2.2. Motion-based classi cation Generally speaking, non-rigid articulated human motion shows a periodic property, so this has been used as a strong cue for moving object classi cation [57–59]. For example, Cutler and Davis [57] described a similarity-based technique to detect and analyze periodic motion. By tracking moving object of interest, they computed its self-similarity as it evolved over time. As we know, for periodic motion, its self-similarity measure was also periodic. Therefore they applied time–frequency analysis to detect and characterize the periodic motion, and implemented tracking and classi cation of moving objects using periodicity. Optical ow is also very useful for object classi cation. In Lipton’s work [58], residual ow was used to analyze rigidity and periodicity of moving entities. It was expected that rigid objects would present little residual ow whereas a non-rigid moving object such as a human being had higher average residual ow and even displayed a periodic component. Based on this useful cue, one could distinguish human motion from other moving objects such as vehicles. Two common approaches mentioned above, namely shape-based classi cation and motion-based classi cation can also be e ectively combined for moving object classi cation [2]. Furthermore, Stau er [61] proposed a novel method based on time co-occurrence matrix to hierarchically classify both objects and behaviors. It is expected that more precise classi cation results can be obtained by using extra features such as color and velocity. In a word, nding people [62– 64] in images is a particularly dicult object recognition problem. Generally, human detection follows the processing described above. However several of the latest papers provide an improved version in which the combination of component-based or segment-based method and geometric con guration constraints of human body is used [65 – 67]. For example, in Mohan et al.’s work [65], the system was structured with four distinct example-based detectors that were trained to separately nd four components of the human body: the head, legs, left arm, and right arm. After ensuring that these components were present in the proper geometric con guration, a second example-based classi er was used to classify a pattern as either a person or a non-person. Although this method was relatively complex, it might provide more robust results than full-body person detection methods in that it was capable of locating partially occluded views of people and people whose body parts had little contrast with the background. 3. Tracking Object tracking in video streams has been a popular topic in the eld of computer vision. Tracking is a particularly important issue in human motion analysis since it serves as a means to prepare data for pose estimation and action recognition. In contrast to human detection, human tracking belongs to a higher-level computer vision problem. However, the tracking algorithms within human motion analysis usually have considerable intersection with motion segmentation during processing. Tracking over time typically involves matching objects in consecutive frames using features such as points, lines or blobs. That is to say, tracking may be considered to be equivalent to establishing coherent relations of image features between frames with respect to position, velocity, shape, texture, color, etc. Useful mathematical tools for tracking include Kalman lter [68], the Condensation algorithm [69,70], Dynamic Bayesian Network [71], etc. Kalman ltering is a state estimation method based on Gaussian distribution. Unfortunately, it is restricted to situations where the probability distribution of the state parameters is unimodal. That is, it is inadequate in dealing with simultaneous multi-modal distributions with the presence of occlusion, cluttered background resembling the tracked objects, etc. The Condensation algorithm has shown to be a powerful alternative. It is a kind of conditional density propagation method for visual tracking. Based upon sampling the posterior distribution estimated in L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 the previous frame, it is extended to propagate these samples iteratively to successive images. By combining a tractable dynamic model with visual observations, it can accomplish highly robust tracking of object motion. However, it usually requires a relatively large number of samples to ensure a fair maximum likelihood estimation of the current state. Tracking can be divided into various categories according to di erent criteria. As far as tracked objects are concerned, tracking may be classi ed into tracking of human body parts such as hand, face, and leg [9 –12,21,22,72–76] and tracking of whole body [77–112]. If the number of views is considered, there are single-view [70,113–115, 77–86], multiple-view [87,107–110,116], and omnidirectional view [111] tracking. Certainly, tracking can also be grouped according to other criteria such as the dimension of tracking space (2-D vs. 3-D), tracking environment (indoors vs. outdoors), the number of tracked human (single human, multiple humans, human groups), the camera’s state (moving vs. stationary), the sensor’s multiplicity (monocular vs. stereo), etc. Our focus is on discussing various methods used in the tracking process, so di erent tracking methods studied extensively in past work are summarized as follows. 3.1. Model-based tracking Traditionally, the geometric structure of human body can be represented as stick gure, 2-D contour or volumetric model [117]. So body segments can be approximated as lines, 2-D ribbons and 3-D volumes accordingly. 3.1.1. Stick gure The essence of human motion is typically addressed by the movements of the torso, head and four limbs, so the stick- gure representation can be used to approximate a human body as a combination of line segments linked by joints [77–79,118–122]. The stick gure is obtained in various ways, e.g., by means of median axis transform or distance transform [123]. The motion of joints provides a key to motion estimation and recognition of the whole gure. For example, Guo et al. [78] represented the human body structure in the silhouette by a stick gure model which had ten sticks articulated with six joints. It transformed the problem into nding a stick gure with minimal energy in a potential eld. In addition, prediction and angle constraints of individual joints were introduced to reduce the complexity of the matching process. Karaulova et al. [77] also used this kind of representation of human body to build a novel hierarchical model of human dynamics encoded using Hidden Markov models, and realized view-independent tracking of the human body in monocular video sequences. 3.1.2. 2-D contour This kind of representation of human body is directly relevant to the human body projection in the image plane. 591 In such description, human body segments are analogous to 2-D ribbons or blobs [80 –83,124 –126]. For instance, Ju et al. [83] proposed a cardboard people model, in which the limbs of human were represented by a set of connected planar patches. The parameterized image motion of these patches was constrained to enforce the articulated movement, and was used to deal with the analysis of articulated motion of human limbs. In the work by Leung and Yang [80], the subject’s outline was estimated as edge regions represented by 2-D ribbons which were U-shaped edge segments. The 2-D ribbon model was used to guide the labeling of the image data. A silhouette or contour is relatively easy to be extracted from both the model and image. Based upon 2-D contour representation, Niyogi and Adelson [82] used the spatial-temporal pattern in XYT space to track, analyze and recognize walking gures. They rst examined the characteristic braided pattern produced by the lower limbs of a walking human, the projection of head movements was then located in the spatio-temporal domain, followed by the identi cation of other joint trajectories; nally, the contour of a walking gure was outlined by utilizing these joint trajectories, and a more accurate gait analysis was carried out using the outlined 2-D contour for the recognition of speci c human. 3.1.3. Volumetric models The disadvantage of 2-D models is its restriction to the camera’s angle, so many researchers are trying to depict the geometric structure of human body in more detail using some 3-D models such as elliptical cylinders, cones, spheres, etc. [84 –88,127–130]. The more complex 3-D volumetric models, the better results may be expected but they require more parameters and lead to more expensive computation during the matching process. An early work by Rohr [84] made use of 14 elliptical cylinders to model human body in 3-D volumes. The origin of the coordinate system was xed at the center of torso. Eigenvector line tting was applied to outline the human image, and then the 2-D projections were t to the 3-D human model using a similar distance measure. Aiming at generating 3-D description of people by modeling, Wachter and Nagel [85] recently attempted to establish the correspondence between a 3-D body model of connected elliptical cones and a real image sequence. Based on the iterative extended Kalman ltering, incorporating information of both edge and region to determine the degrees of freedom of joints and orientations to the camera, they obtained the qualitative description of human motion in monocular image sequences. An important advantage of 3-D human model is the ability to handle occlusion and obtain more signi cant data for action analysis. However, it is restricted to impractical assumptions of simplicity regardless of the body kinematics constraints, and has high computational complexity as well. 592 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 3.2. Region-based tracking The idea here is to identify a connected region associated with each moving object in an image, and then track it over time using a cross-correlation measure. Region-based tracking approach [118] has been widely used today. For example, Wren et al. [37] explored the use of small blob features to track the single human in an indoor environment. In their work, a human body was considered as a combination of some blobs, respectively, representing various body parts such as head, torso and four limbs. Meanwhile, both human body and background scene were modeled with Gaussian distributions. Finally, the pixels belonging to the human body were assigned to di erent body parts blobs using the log-likelihood measure. Therefore, by tracking each small blob, the moving people could be successfully tracked. Recent work of McKenna et al. [39] proposed an adaptive background subtraction method that combined color and gradient information to e ectively cope with shadows and unreliable color cues in motion segmentation. Tracking process was then performed at three levels of abstraction: regions, people, and groups. Each region that could merge and split had a bounding box. A human was composed of one or more regions grouped together under the condition of geometric structure constraints of human body, and a human group consisted of one or more people grouped together. Therefore, using the region tracker and individual color appearance model, they achieved perfect track of multiple people, even during occlusion. The region-based tracking approach works reasonably well. However, diculties arise in two important situations. The rst is that of long shadows, and it may result in connecting up blobs that should have been associated with separate people. This problem may be resolved to some extent by making use of color or exploiting the fact that shadow regions tend to be devoid of texture. The more serious, and so far intractable, problem for video tracking has been that of congested situations. Under these conditions, people partially occlude one another instead of being spatially isolated. This makes the task of segmenting individual humans very dicult. The resolution to this problem may require tracking systems using multiple cameras. 3.3. Active-contour-based tracking Tracking based on active contour models, or snakes [64, 90 –96], aims at directly extracting the shape of the subjects. The idea is to have a representation of the bounding contour of the object and keep dynamically updating it over time. Active-contour-based tracking has been intensively studied over the past few years. For instance, Isard and Blake [91] adopted the stochastic di erential equation to describe complex motion model, and combined this approach with deformable templates to cope with people tracking. Recent work of Paragios and Deriche [92] presented a variational framework for detecting and tracking multiple moving objects in image sequences. A statistical framework, for which the observed inter-frame di erence density function was approximated using a mixture model, was used to provide the initial motion detection boundary. Then the detection and tracking problems were addressed in a common framework that employed a geodesic active contour objective function. Using the level set formulation scheme, complex curves could be detected and tracked while topological changes for the evolving curves were naturally managed. Also, Peterfreund [94] explored a new active contour model based on Kalman lter for tracking of non-rigid moving targets such as people in spatio-velocity space. The model employed measurements of gradient-based image potential and of optical- ow along the contour as system measurements. Meanwhile, to improve robustness to clutter and occlusions, an optical- ow-based detection mechanism was proposed. In contrast to the region-based tracking approach, the advantage of having an active contour-based representation is the reduction of computational complexity. However, it requires a good initial t. If somehow one could initialize a separate contour for each moving object, then one could keep tracking even in the presence of partial occlusion. But initialization is quite dicult, especially for complex articulated objects. 3.4. Feature-based tracking Abandoning the idea of tracking objects as a whole, this tracking method uses sub-features such as distinguishable points or lines on the object to realize the tracking task. Its bene t is that even in the presence of partial occlusion, some of the sub-features of the tracked objects remain visible. Feature-based tracking includes feature extraction and feature matching. Low-level features such as points are easier to extract. It is relatively more dicult to track higher-level features such as lines and blobs. So, there is usually a trade-o between feature complexity and tracking eciency. Polana and Nelson’s work [97] is a good example of point-feature tracking. In their work, a person was bounded by a rectangular box, whose centroid was selected as the feature point for tracking. Even when occlusion happened between two subjects during tracking, as long as the velocity of the centroids could be distinguished e ectively, tracking was still successful. In addition, Segen and Pingali’s tracking system [101] utilized the corner points of moving silhouettes as the features to track, and these feature points were matched using a distance measure based on positions and curvatures of points between successive frames. The tracking of point and line features based on Kalman ltering [76,131,115] has been well developed in the L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 computer vision community. In recent work of Jang and Choi [76], an active template that characterized regional and structural features of an object was built dynamically based on the information of shape, texture, color, and edge feature of the region. Using motion estimation based on a Kalman lter, the tracking of a non-rigid moving object could be successfully performed by minimizing a feature energy function during the matching process. Another tracking aspect, the use of multiple cameras [87,107–110] has recently been actively studied. Multi-camera tracking is very helpful for reducing ambiguity, handling occlusions and providing general reliability of data. As a good example, Cai et al. [107] proposed a probabilistic framework for tracking human motion in an indoor environment. Multivariate Gaussian models were applied to nd the best matches of human subjects between consecutive frames taken by cameras mounted in various locations, and automatic switching mechanism between the neighboring cameras was carefully discussed. In the work by Utsumi [109], a multiple-view-based tracking algorithm for multiple-human motions in the surrounding environment was proposed. Human positions were tracked using Kalman ltering, and a best viewpoint selection mechanism could be used to solve the problem of self-occlusion and mutual occlusion between people. For the tracking systems based on multiple cameras, one needs to decide which camera or image to use at each time instant. That is to say, it is an important problem for a successful multi-camera tracking system how the selection and data fusion between cameras are handled. 4. Behavior understanding After successfully tracking the moving humans from one frame to another in an image sequence, the problem of understanding human behaviors from image sequences follows naturally. Behavior understanding involves action recognition and description. As a nal or long-time goal, human behavior understanding can guide the development of many human motion analysis systems. In our opinion, it will be the most important area of future research in human motion analysis. Behavior understanding is to analyze and recognize human motion patterns, and to produce high-level description of actions and interactions. It may be simply considered as a classi cation problem of time varying feature data, i.e., matching an unknown test sequence with a group of labeled reference sequences representing typical human actions. It is obvious that the basic problem of human behavior understanding is how to learn the reference action sequences from training samples, how to enable both training and matching methods e ectively to cope with small variations at spatial and time scales within similar classes of motion patterns, and how to e ectively interpret actions using natural language. 593 All these are hard problems and have received increasing attentions from researchers. 4.1. General techniques Action recognition involved in behavior understanding may be thought as a time-varying data matching problem. The general analytical methods for matching time-varying data are outlined in the following. 4.1.1. Dynamic time warping Dynamic time warping (DTW) [132], used widely for speech recognition in the early days, is a template-based dynamic programming matching technique. It has the advantage of conceptual simplicity and robust performance, and has been used in the matching of human movement patterns recently [133,134]. As far as DTW is concerned, even if the time scale between a test pattern and a reference pattern may be inconsistent, it can still successfully establish matching as long as time ordering constraints hold. 4.1.2. Hidden Markov models Hidden Markov models (HMMs) [135], a kind of stochastic state machine, is a more sophisticated technique for analyzing time-varying data with spatio-temporal variability. Its model structure can be summarized as a hidden Markov chain and a nite set of output probability distributions. The use of HMMs touches on two stages: training and classi cation. In the training stage, the number of states of an HMM must be speci ed, and the corresponding state transformation and output probabilities are optimized in order that the generated symbols can correspond to the observed image features within the examples of a speci c movement class. In the matching stage, the probability that a particular HMM possibly generates the test symbol sequence corresponding to the observed image features is computed. HMMs are superior to DTW in processing unsegmented successive data, and are therefore extensively being applied to the matching of human motion patterns [136 –140]. 4.1.3. Neural network Neural network (NN) [79,141] is also an interesting approach for analyzing time-varying data. As larger data sets become available, more emphasis is being placed on neural networks for representing temporal information. For example, Guo et al. [79] used it to understand human motion pattern, and Rosenblum et al. [141] recognized human emotion from motion using radial basis function network architecture. In addition to three approaches described above, the Principle Component Analysis (PCA) method [142,143] and some variants from HMM and NN such as Coupled Hidden Markov Models (CHMM) [139], Variable-Length Markov Model (VLMM) [144] and Time-Delay Neural Network (TDNN) [145], have also appeared in the literature. 594 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 4.2. Action recognition Similar to the survey by Aggarwal and Cai [31], we discuss human activity and action recognition under the following groups of approaches. 4.2.1. Template matching This approach based on template matching [22,97, 146 –148,131], rst converts an image sequence into a static shape pattern, and then compares it to prestored action prototypes during recognition. In the early work by Polana and Nelson [97], the features consisting of 2-D meshes were utilized to recognize human action. First, optical ow elds were computed between successive frames, and each ow frame was decomposed into a spatial grid in both horizontal and vertical directions. Then, motion amplitude of each cell was accumulated to form a high-dimensional feature vector for recognition. In order to normalize the duration of motion, they assumed that human motion was periodic, so the entire sequence could be divided into many circular processes of certain activity that were averaged into temporal divisions. Finally, they adopted the nearest-neighbor algorithm to realize human action recognition. Recent work of Bobick and Davis [147] proposed a view-based approach to the representation and recognition of action using temporal templates. They made use of the binary Motion Energy Image (MEI) and Motion History Image (MHI) to interpret human movement in an image sequence. First, motion images in a sequence were extracted by di erencing, and these motion images were accumulated in time to form MEI. Then, the MEI was enhanced into MHI, which was a scalar-valued image. Taken together, the MEI and MHI could be considered as a two-component version of a temporal template, a vector-valued image in which each component of each pixel was some function of the motion at that pixel position. Finally, these view-speci c templates were matched against the stored models of views of known actions during the recognition process. Based on PCA, Chomat and Crowley [142] generated motion templates by using a set of temporal-spatial lters computed by PCA. A Bayes classi er was used to perform action selection. The advantage of template matching is low computational complexity and simple implementation. However, it is usually more susceptible to noise and the variations of the time interval of the movements, and is viewpoint dependent. 4.2.2. State-space approaches The approach based on the state-space models [136 –139,52,149] de nes each static posture as a state, and uses certain probabilities to generate mutual connections between these states. Any motion sequence can be considered as a tour through various states of these static postures. Through these tours, joint probability with the maximum value is selected as the criterion for action classi cation. Nowadays, the state-space models have been widely applied to prediction, estimation, and detection of temporal series. HMM is the most representative method used to study discrete time series. For example, Yamato et al. [138] made use of the mesh features of 2-D moving human blobs such as motion, color and texture, to identify human behavior. In the learning stage, HMMs were trained to generate symbolic patterns for each action class, and the optimization of the model parameters was achieved by forward–backward algorithm. In the recognition process, given an image sequence, the output result of forward calculation was used to guide action identi cation. As an improved version, Brand et al. [139] applied the coupled HMMs to recognize human actions. Moreover, in recent work of Bregler [52], a comprehensive framework using the statistical decomposition of human body dynamics at di erent levels of abstractions was presented to recognize human motion. In the low-level processing, the small blobs were estimated as a mixture Gaussian models based on motion and color similarity, spatial proximity, and prediction from the previous frames. Meanwhile the regions of various body parts were implicitly tracked over time. During the intermediate-level processing, those regions with coherent motion were tted into simple movements of dynamic systems. Finally, HMMs were used as mixture of these intermediate-level dynamic systems to represent complex motion. Given the input image sequence, recognition was accomplished by maximizing the posterior probability of the HMM. Although the state-space approach may overcome the disadvantages of the template matching approach, it usually involves complex iterative computation. Meanwhile, how to select the proper number of states and the dimensionality of the feature vector remains a dicult issue. As a whole, recognition of human actions is just in its infancy, and there exists a trade-o between computational cost and accuracy. Therefore, it is still an open area deserving further attention in future. 4.3. Semantic description Applying concepts of natural languages to vision systems is becoming popular, so the semantic description of human behaviors [150 –156] has recently received considerable attention. Its purpose is to reasonably choose a group of motion words or short expressions to report the behaviors of the moving objects in natural scenes. As a good example, Intille and Bobick [152] provided an automated annotation system for sport scenes. Each formation of players was represented by belief networks based on visual evidences and temporal constraints. Another work by Remagnino et al. [150] also proposed an event-based surveillance system involving pedestrians and vehicles. This system could provide text-based description to the dynamic actions and interactions of moving objects in 3-D scenes. L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 Recently, Kojima et al. [151] proposed a new method to generate natural language description of human behaviors appearing in real video sequence. First, a head region of a human, as a portion of the whole body, was extracted from each image frame, and its 3-D pose and position were estimated using a model-based approach. Next, the trajectory of these parameters was divided into the segments of monotonous movement. The conceptual features for each segment, such as degrees of change of pose and position and that of relative distance from other objects in the surroundings, were evaluated. Meanwhile, the most suitable verbs and other syntactic elements were selected. Finally, the natural language text for interpreting human behaviors was generated by machine translation technology. Compared with vehicle movement, the description of human motion in image sequences is more complex. Moreover, there are inherently various concepts on actions, events and states in natural language. So, how to select e ective and adequate expressions to convey the meanings of the scenes is quite dicult. At present, human behavior description is still restricted to simple and speci c action patterns. Therefore, research on semantic description of human behaviors in complex unconstrained scenes still remains open. 5. Discussions Although a large amount of work has been done in human motion analysis, many issues are still open and deserve further research, especially in the following areas: (1) Segmentation. Fast and accurate motion segmentation is a signi cant but dicult problem. The captured images in dynamic environments are often a ected by many factors such as weather, lighting, clutter, shadow, occlusion, and even camera motion. Taking only shadow for an example, they may either be in contact with the detected object, or disconnected from it. In the rst case, the shadow distorts the object shape, making the use of subsequent shape recognition methods less reliable. In the second case, the shadow may be classi ed as a totally erroneous object in the natural scenes. Nearly every system within human motion analysis starts with segmentation, so segmentation is of fundamental importance. Although current motion segmentation methods mainly focus on background subtraction, how to develop more reliable background models adaptive to dynamic changes in complex environments is still a challenge. (2) Occlusion handling. At present, the majority of human motion analysis systems cannot e ectively handle the problems of self-occlusion of human body and mutual occlusions between objects, especially the detection and tracking of multiple people under congested conditions. Typically, during occlusions, only portions of each person are visible and often at very low resolution. This problem is generally intractable, and motion segmentation based on background 595 subtraction may become unreliable. To reduce ambiguities due to occlusion, better models need to be developed to cope with the correspondence problem between features and body parts. Interesting progress is being made using statistical methods [157], which essentially try to predict body pose, position, and so on, from available image information. Perhaps the most promising practical method for addressing occlusions is through the use of multiple cameras. (3) 3-D modeling and tracking. 2-D approaches have shown some early successes in visual analysis of human motion, especially for low-resolution application areas where the precise posture reconstruction is not needed (e.g., pedestrian tracking in trac surveillance setting). However, the major drawback of 2-D approach is its restriction of the camera angles. Compared with 2-D approaches, 3-D approaches are more e ective for accurate estimation in physical space, e ective handling of occlusion, and the high-level judgment between various complex human movements such as wandering around, shaking hands and dancing [87,107–109]. However, applying 3-D tracking will require more parameters and more computation during the matching process. As a whole, current research on 3-D tracking is still at its infancy. Also, vision-based 3-D tracking brings a number of challenges such as the acquisition of human model [158], occlusion handling, parameterized body modeling [159 –163], etc. Only taking modeling for an example, human models for vision have been adequately parameterized by various shape parameters. However, few have incorporated constraints of joints and dynamical properties of body parts. Also, almost all past work assumes that 3-D model is fully speci ed in advance according to prior assumptions. In practice, the shape parameters of 3-D model need to be estimated from the images. So 3-D modeling and tracking deserve more attention in future work. (4) Use of multiple cameras. It is obvious that future systems of human motion analysis will greatly bene t from the use of multiple cameras. The availability of information from multiple cameras can be extremely helpful because the use of multiple cameras not only expands surveillance area, but also provides multiple viewpoints to solve occlusions e ectively. Tracking with a single camera easily generates ambiguity due to occlusion or depth. However this may be eliminated from another view. For multi-camera tracking systems, one needs to decide which camera or image to use at each time instant. That is, the coordination and information fusion between cameras are a signi cant problem. (5) Action understanding. Since the nal objective of “looking at people” is to analyze and interpret human action and the interactions between people and other objects, better understanding of human behaviors is the most interesting long-term open issue facing human motion analysis. For instance, the W4 system [2] can recognize some simple events between people and objects such as carrying an object, depositing an object, and exchanging bags. 596 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 However, human motion understanding still stresses tracking and recognition of some standard posture, and simple action analysis, e.g., the de nition and classi cation of a group of typical actions (running, standing, jumping, climbing, pointing, etc). Some recent progress has been made in building the statistical models of human behaviors using machine learning. But action recognition is just in its infancy. Some restrictions are usually imposed to decrease ambiguity during matching of feature sequences. Therefore, the diculties of behavior understanding still lie in feature selection and machine learning. Nowadays, the approaches of state space and template matching for action recognition often choose a trade-o between computational cost and recognition accuracy, so e orts should be made to improve performance of behavior recognition, and at the same time to e ectively reduce computational complexity. Furthermore, we should make full use of existing achievements from areas such as arti cial intelligence to extend current simple action recognition to higher-level natural language description in more complex scenes. (6) Performance evaluation. Generally speaking, robustness, accuracy and speed are three major demands of practical human motion analysis systems [33]. For example, robustness is very important for surveillance applications because they are usually required to work automatically and continuously. These systems should be insensitive to noise, lighting, weather, clothes, etc. It may be expected that the fewer assumptions a system imposes on its operational conditions, the better. The accuracy of a system is important to behavior recognition in surveillance or control situations. The processing speed of a system deserves more attention, especially for some situations for the purpose of surveillance where high speed is needed. It will be important to test the robustness of any systems on large amount of data, a number of di erent users, and in various environments. Furthermore, it is an interesting direction to nd more e ective ideas for real-time and accurate online processing. It seems to be helpful and necessary to incorporate various data types and processing methods to improve robustness of a human motion analysis system to all possible situations. The other interesting topic for future research is the combination of human motion analysis and biometrics. The combination of human motion analysis and biometric identi cation is becoming increasingly important for some security-sensitive applications. For instance, surveillance systems can probably recognize the intruder for access control by tracking and recognizing his or her face from near distance. If the distance is very far, face features are possibly at too low resolution to recognize. Instead, human gait as a new biometric feature for personal recognition can be employed. As such, human gait has recently attracted interest from many researchers [13–16]. Also, in the advanced human– machine interface, what is needed is to let the machine not only sense the presence of human, position and behavior, but also know who the user is by using biometric identi cation. 6. Conclusions Computer-vision-based human motion analysis has become an active research area. It is strongly driven by many promising applications such as smart surveillance, virtual reality, advanced user interface, etc. Recent technical developments have strongly demonstrated that visual systems can successfully deal with complex human movements. It is exciting to see many researchers gradually spreading their achievements into more intelligent practical applications. Bearing in mind a general processing framework of human motion analysis systems, we have presented an overview of recent developments in human motion analysis in this paper. The state of the art of existing methods in each key issue is described and the focus is on three major tasks: detection, tracking and behavior understanding. As for human detection, it involves motion segmentation and object classi cation. Four types of techniques for motion segmentation are addressed: background subtraction, statistical methods, temporal di erencing and optical ow. The statistical methods may be a better choice in more unconstrained situations. Tracking objects is equivalent to establish correspondence of image features between frames. We have discussed four approaches studied intensively in past works: model-based, active-contour-based, region-based and feature-based. The task of recognizing human activity in image sequences assumes that feature tracking for recognition has been accomplished. Two types of techniques are reviewed: template matching and state-space approaches. In addition, we examine the state of the art of human behavior description. Although a large amount of work has been done in this area, many issues remain open such as segmentation, modeling and occlusion handling. At the end of this survey, we have given some detailed discussions on research diculties and future directions in human motion analysis. Acknowledgements The authors would like to thank H. Z. Ning and the referee for their valuable suggestions. This work is supported in part by NSFC (Grant No. 69825105 and 60105002), and the Institute of Automation (Grant No. 1M01J02), Chinese Academy of Sciences. References [1] R.T. Collins, et al., A system for video surveillance and monitoring: VSAM nal report, CMU-RI-TR-00-12, Technical Report, Carnegie Mellon University, 2000. [2] I. Haritaoglu, D. Harwood, L.S. Davis, W4 : real-time surveillance of people and their activities, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 809–830. L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 [3] P. Remagnino, T. Tan, K. Baker, Multi-agent visual surveillance of dynamic scenes, Image Vision Comput. 16 (8) (1998) 529–532. [4] C. Maggioni, B. Kammerer, Gesture computer: history, design, and applications, in: R. Cipolla, A. Purtland (Eds), Computer Vision for Human–Machine Interaction, Cambridge University Press, Cambridge, MA, 1998. [5] W. Freeman, C. Weissman, Television control by hand gestures, Proceedings of the International Conference on Automatic Face and Gesture Recognition, 1995, pp. 179 –183. [6] D.M. Gavrila, The visual analysis of human movement: a survey Comput. Vision Image Understanding 73 (1) (1999) 82–98. [7] R.T. Collins, A.J. Lipton, T. Kanade, Introduction to the special section on video surveillance, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 745–746. [8] S. Maybank, T. Tan, Introduction to special section on visual surveillance, Int. J. Comput. Vision 37 (2) (2000) 173. [9] J. Ste ens, E. Elagin, H. Neven, Person Spotter-fast and robust system for human detection, tracking and recognition, Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 1998, pp. 516 –521. [10] J. Yang, A. Waibel, A real-time face tracker, Proceedings of the IEEE CS Workshop on Applications of Computer Vision, Sarasota, FL, 1996, pp. 142–147. [11] B. Moghaddam, W. Wahid, A. Pentland, Beyond eigenfaces: probabilistic matching for face recognition, Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 1998, pp. 30 –35. [12] C. Wang, M.S. Brandstein, A hybrid real-time face tracking system, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA, 1998. [13] J.J. Little, J.E. Boyd, Recognizing people by their gait: the shape of motion, J. Comput. Vision Res. 1 (2) (1998). [14] J.D. Shutler, M.S. Nixon, C.J. Harris, Statistical gait recognition via velocity moments, Proceedings of the IEE Colloquium on Visual Biometrics, 2000, pp. 10=1–10=5. [15] P.S. Huang, C.J. Harris, M.S. Nixon, Human gait recognition in canonical space using temporal templates, Proc. IEE Vision Image Signal Process. 146 (2) (1999) 93–100. [16] D. Cunado, M.S. Nixon, J.N. Carter, Automatic gait recognition via model-based evidence gathering, Proceedings of the Workshop on Automatic Identi cation Advanced Technologies, New Jersey, 1998, pp. 27–30. [17] B.A. Boghossian, S.A. Velastin, Image processing system for pedestrian monitoring using neural classi cation of normal motion patterns, Meas. Control 32 (9) (1999) 261–264. [18] B.A. Boghossian, S.A. Velastin, Motion-based machine vision techniques for the management of large crowds, Proceedings of the IEEE Sixth International Conference on Electronics, Circuits and Systems, September 5 –8, 1999. [19] Yi Li, Songde Ma, Hanqing Lu, Human posture recognition using multi-scale morphological method and Kalman motion estimation, Proceedings of the IEEE International Conference on Pattern Recognition, 1998, pp. 175 –177. [20] J. Segen, S. Kumar, Shadow gestures: 3D hand pose estimation using a single camera, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1999, pp. 479 – 485. 597 [21] M-H. Yang, N. Ahuja, Recognizing hand gesture using motion trajectories, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1999, pp. 468– 472. [22] Y. Cui, J.J. Weng, Hand segmentation using learning-based prediction and veri cation for hand sign recognition, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997, pp. 88–93. [23] M. Turk, Visual interaction with lifelike characters, Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Killington, 1996, pp. 368–373. [24] H.M. Lakany, G.M. Haycs, M. Hazlewood, S.J. Hillman, Human walking: tracking and analysis, Proceedings of the IEE Colloquium on Motion Analysis and Tracking, 1999, pp. 5=1–5=14. [25] M. Kohle, D. Merkl, J. Kastner, Clinical gait analysis by neural networks: issues and experiences, Proceedings of the IEEE Symposium on Computer-Based Medical Systems, 1997, pp. 138–143. [26] D. Meyer, J. Denzler and H. Niemann, Model based extraction of articulated objects in image sequences for gait analysis, Proceedings of the IEEE International Conference on Image Processing, 1997, pp. 78–81. [27] W. Freeman, et al., Computer vision for computer games, Proceedings of the International Conference on Automatic Face and Gesture Recognition, 1996, pp. 100 –105. [28] J.K. Aggarwal, Q. Cai, W. Liao, B. Sabata, Articulated and elastic non-rigid motion: a review, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1994, pp. 2–14. [29] C. Cedras, M. Shah, Motion-based recognition: a survey Image Vision Comput. 13 (2) (1995) 129–155. [30] J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1997, pp. 90 –102. [31] J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Comput. Vision Image Understanding 73 (3) (1999) 428–440. [32] Alex Pentland, Looking at people: sensing for ubiquitous and wearable computing, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 107–119. [33] T.B. Moeslund, E. Granum, A survey of computer vision-based human motion capture, Comput. Vision Image Understanding 81 (3) (2001) 231–268. [34] K.P. Karmann, A. Brandt, Moving object recognition using an adaptive background memory, in: V. Cappellini (Ed.), Time-Varying Image Processing and Moving Object Recognition, Vol. 2, Elsevier, Amsterdam, The Netherlands, 1990. [35] M. Kilger, A shadow handler in a video-based real-time trac monitoring system, Proceedings of the IEEE Workshop on Applications of Computer Vision, 1992, pp. 1060 –1066. [36] Y.H. Yang, M.D. Levine, The background primal sketch: an approach for tracking moving objects Mach. Vision Appl. 5 (1992) 17–34. [37] C.R. Wren, A. Azarbayejani, T. Darrell, A.P. Pentland, P nder: real-time tracking of the human body, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 780–785. [38] C. Stau er, W. Grimson, Adaptive background mixture models for real-time tracking, Proceedings of the IEEE CS 598 [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 Conference on Computer Vision and Pattern Recognition, Vol. 2, 1999, pp. 246 –252. S.J. McKenna, et al., Tracking groups of people, Comput. Vision Image Understanding 80 (1) (2000) 42–56. S. Arseneau, J.R. Cooperstock, Real-time image segmentation for action recognition, Proceedings of the IEEE Paci c Rim Conference on Communications, Computers and Signal Processing, 1999, pp. 86 –89. H.Z. Sun, T. Feng, T.N. Tan, Robust extraction of moving objects from image sequences, Proceedings of the Fourth Asian Conference on Computer Vision, Taiwan, 2000, pp. 961–964. A. Elgammal, D. Harwood, L.S. David, Nonparametric background model for background subtraction, Proceedings of the Sixth European Conference on Computer Vision, 2000. A.J. Lipton, H. Fujiyoshi, R.S. Patil, Moving target classi cation and tracking from real-time video, Proceedings of the IEEE Workshop on Applications of Computer Vision, 1998, pp. 8–14. C. Anderson, P. Bert, G. Vander Wal, Change detection and tracking using pyramids transformation techniques, Proceedings of the SPIE-Intelligent Robots and Computer Vision, Vol. 579, 1985, pp. 72–78.d J.R. Bergen, et al., A three frame algorithm for estimating two-component image motion, IEEE Trans. Pattern Anal. Mach. Intell. 14 (9) (1992) 886–896. Y. Kameda, M. Minoh, A human motion estimation method using 3-successive video frames, Proceedings of the International Conference on Virtual Systems and Multimedia, 1996. A. Verri, S. Uras, E. DeMicheli, Motion segmentation from optical ow, Proceedings of the Fifth Alvey Vision Conference, 1989, pp. 209 –214. A. Meygret, M. Thonnat, Segmentation of optical ow and 3d data for the interpretation of mobile objects, Proceedings of the International Conference on Computer Vision, Osaka, Japan, December 1990. J. Barron, D. Fleet, S. Beauchemin, Performance of optical ow techniques, Int. J. Comput. Vision 12 (1) (1994) 42–77. H.A. Rowley, J.M. Rehg, Analyzing articulated motion using expectation-maximization, Proceedings of the International Conference on Pattern Recognition, 1997, pp. 935 –941. A.M. Baumberg, D. Hogg, Learning spatio-temporal models from training examples, Technical Report of University of Leeds, September 1995. C. Bregler, Learning and recognizing human dynamics in video sequences, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997, pp. 568–574. G.J. McLachlan, T. Krishnan, The EM Algorithm and Extensions, Wiley Interscience, New York, 1997. N. Friedman, S. Russell, Image segmentation in video sequences: a probabilistic approach, Proceedings of the 13th Conference on Uncertainty in Arti cial Intelligence, August 1–3, 1997. E. Stringa, Morphological change detection algorithms for surveillance applications, British Machine Vision Conference, 2000, pp. 402– 411. Y. Kuno, T. Watanabe, Y. Shimosakoda, S. Nakagawa, Automated detection of human for visual surveillance system, Proceedings of the International Conference on Pattern Recognition, 1996, pp. 865 –869. [57] R. Cutler, L.S. Davis, Robust real-time periodic motion detection, analysis, and applications, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 781–796. [58] A.J. Lipton, Local application of optic ow to analyse rigid versus non-rigid motion, In the website https://www.eecs.lehigh.edu/FRAME/Lipton/iccvframe.html. [59] A. Selinger, L. Wixson, Classifying moving objects as rigid or non-rigid without correspondences, Proceedings of the DAPRA Image Understanding Workshop, Vol. 1, 1998, pp. 341–358. [60] M. Oren, et al., Pedestrian detection using wavelet templates, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997, pp. 193–199. [61] C. Stau er, Automatic hierarchical classi cation using time-base co-occurrences, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1999, pp. 333–339. [62] A. Iketani, et al., Detecting persons on changing background, Proceedings of the International Conference on Pattern Recognition, Vol. 1, 1998, pp. 74 –76. [63] A.M. Elgammal, L.S. Davis, Probabilistic framework for segmenting people under occlusion, Proceedings of the International Conference on Computer Vision, 2001. [64] D. Meyer, J. Denzler, H. Niemann, Model based extraction of articulated objects in image sequences, Proceedings of the Fourth International Conference on Image Processing, 1997. [65] A. Mohan, C. Papageorgiou, T. Poggio, Example-based object detection in images by components, IEEE Trans. Pattern Recognition Mach. Intell. 23 (4) (2001) 349–361. [66] L. Zhao, C. Thorpe, Recursive context reasoning for human detection and parts identi cation, Proceedings of the IEEE Workshop on Human Modeling, Analysis and Synthesis, June 2000. [67] S. Io e, D. Forsyth, Probabilistic methods for nding people, Int. J. Comput. Vision 43 (1) (2001) 45–68. [68] G. Welch, G. Bishop, An introduction to the Kalman lter, from https://www.cs.unc.edu, UNC-Chapel Hill, TR95-041, November 2000. [69] M. Isard, A. Blake, Condensation—conditional density propagation for visual tracking, Int. J. Comput. Vision 29 (1) (1998) 5–28. [70] H. Sidenbladh, M.J. Black, D.J. Fleet, Stochastic tracking of 3D human gures using 2D image motion, Proceedings of the European Conference on Computer Vision, 2000. [71] V. Pavlovic, J.M. Rehg, T.-J. Cham, K.P. Murphy, A dynamic Bayesian network approach to gure tracking using learned dynamic models, Proceedings of the International Conference on Computer Vision, 1999, pp. 94 –101. [72] L. Goncalves, E.D. Bernardo, E. Ursella, P. Perona, Monocular tracking of the human arm in 3D, Proceedings of the Fifth International Conference on Computer Vision, Cambridge, 1995, pp. 764 –770. [73] J. Rehg, T. Kanade, Visual tracking of high DOF articulated structures: an application to human hand tracking, Proceedings of the European Conference on Computer Vision, 1994, pp. 35 – 46. [74] D. Meyer, et al., Gait classi cation with HMMs for trajectories of body parts extracted by mixture densities, British Machine Vision Conference, 1998, pp. 459 – 468. [75] P. Fieguth, D. Terzopoulos, Color-based tracking of heads and other mobile objects at video frame rate, Proceedings of L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997, pp. 21–27. D.-S. Jang, H.-I. Choi, Active models for tracking moving objects, Pattern Recognition 33 (7) (2000) 1135–1146. I.A. Karaulova, P.M. Hall, A.D. Marshall, A hierarchical model of dynamics for tracking people with a single video camera, British Machine Vision Conference, 2000, pp. 352–361. Y. Guo, G. Xu, S. Tsuji, Tracking human body motion based on a stick gure model, Visual Commun. Image Representation 5 (1994) 1–9. Y. Guo, G. Xu, S. Tsuji, Understanding human motion patterns, Proceedings of the International Conference on Pattern Recognition, 1994, pp. 325 –329. M.K. Leung, Y.H. Yang, First sight: a human body outline labeling system, IEEE Trans. Pattern Anal. Mach. Intell. 17 (4) (1995) 359–377. I.-C. Chang, C.-L. Huang, Ribbon-based motion analysis of human body movements, Proceedings of the International Conference on Pattern Recognition, Vienna, 1996, pp. 436 – 440. S.A. Niyogi, E.H. Adelson, Analyzing and recognizing walking gures in XYT, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp. 469 – 474. S. Ju, M. Black, Y. Yaccob, Cardboard people: a parameterized model of articulated image motion, Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 1996, pp. 38– 44. K. Rohr, Towards model-based recognition of human movements in image sequences, CVGIP: Image Understanding 59 (1) (1994) 94–115. S. Wachter, H.-H. Nagel, Tracking persons in monocular image sequences, Comput. Vision Image Understanding 74 (3) (1999) 174–192. J.M. Rehg, T. Kanade, Model-based tracking of self-occluding articulated objects, Proceedings of the Fifth International Conference on Computer Vision, Cambridge, 1995, pp. 612– 617. I.A. Kakadiaris, D. Metaxas, Model-based estimation of 3-D human motion with occlusion based on active multi-viewpoint selection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 1996, pp. 81–87. N. Goddard, Incremental model-based discrimination of articulated movement from motion features, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, Austin, 1994, pp. 89 –94. J. Badenas, J. Sanchiz, F. Pla, Motion-based segmentation and region tracking in image sequences, Pattern Recognition 34 (2001) 661–670. A. Baumberg, D. Hogg, An ecient method for contour tracking using active shape models, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, Austin, 1994, pp. 194 –199. M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density, Proceedings of the European Conference on Computer Vision, 1996, pp. 343–356. N. Paragios, R. Deriche, Geodesic active contours and level sets for the detection and tracking of moving objects, IEEE Trans. Pattern Anal. Mach. Intell. 22 (3) (2000) 266–280. 599 [93] M. Bertalmio, G. Sapiroo, G. Randll, Morphing active contours, IEEE Trans. Pattern Anal. Mach. Intell. 22 (7) (2000) 733–737. [94] N. Peterfreund, Robust tracking of position and velocity with Kalman snakes, IEEE Trans. Pattern Anal. Mach. Intell. 22 (6) (2000) 564–569. [95] Y. Zhong, A.K. Jain, M.P. Dubuisson-Jolly, Object tracking using deformable templates, IEEE Trans. Pattern Anal. Mach. Intell. 22 (5) (2001) 544–549. [96] A. Baumberg, D. Hogg, Generating spatio-temporal models from examples, Image Vision Comput. 14 (8) (1996) 525–532. [97] R. Polana, R. Nelson, Low level recognition of human motion, Proceedings of the IEEE CS Workshop on Motion of Non-Rigid and Articulated Objects, Austin, TX, 1994, pp. 77–82. [98] P. Tissainaryagam, D. Suter, Visual tracking with automatic motion model switching, Pattern Recognition 34 (2001) 641–660. [99] A. Azarbayejani, A. Pentland, Real-time self-calibrating stereo person tracking using 3D shape estimation from blob features, Proceedings of the International Conference on Pattern Recognition, 1996, pp. 627– 632. [100] Q. Cai, A. Mitiche, J.K. Aggarwal, Tracking human motions in an indoor environment, Proceedings of the International Conference on Image Processing, Vol. 1, 1995, pp. 215 –218. [101] J. Segen, S. Pingali, A camera-based system for tracking people in real time, Proceedings of the International Conference on Pattern Recognition, 1996, pp. 63– 67. [102] T.-J. Cham, J.M. Rehg, A multiple hypothesis approach to gure tracking, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1999, pp. 239 –245. [103] Y. Ricquebourg, P. Bouthemy, Real-time tracking of moving persons by exploiting spatio-temporal image slices, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 797–808. [104] T. Darrell, G. Gordon, M. Harville, J. Wood ll, Integrated person tracking using stereo, color, and pattern detection, Int. J. Comput. Vision 37 (2) (2000) 175–185. [105] M. Rossi, A. Bozzoli, Tracking and counting people, Proceedings of the International Conference on Image Processing, Austin, 1994, pp. 212–216. [106] H. Fujiyoshi, A.J. Lipton, Real-time human motion analysis by image skeletonization, Proceedings of the IEEE Workshop on Applications of Computer Vision, 1998, pp. 15 –21. [107] Q. Cai, J.K. Aggarwal, Tracking human motion using multiple cameras, Proceedings of the 13th International Conference on Pattern Recognition, 1996, pp. 68–72. [108] D. Gavrila, L. Davis, 3-D model-based tracking of humans in action: a multi-view approach, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 1996, pp. 73–80. [109] A. Utsumi, H. Mori, J. Ohya, M. Yachida, Multiple-viewbased tracking of multiple humans, Proceedings of the International Conference on Pattern Recognition, 1998, pp. 597– 601. [110] T.H. Chang, S. Gong, E.J. Ong, Tracking multiple people under occlusion using multiple cameras, British Machine Vision Conference, 2000, pp. 566 –575. [111] T. Boult, Frame-rate multi-body tracking for surveillance, DARPA Image Understanding Workshop, Morgan Kaufmann, Monterey, CA, San Francisco, November 1998. 600 L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 [112] Q. Zheng, R. Chellappa, Automatic feature point extraction and tracking in image sequences for arbitrary camera motion, Int. J. Comput. Vision 15 (1995) 31–76. [113] C. Barron, I.A. Kakadiaris, Estimating anthropometry and pose from a single uncalibrated image, Comput. Vision Image Understanding 81 (3) (2001) 269–284. [114] Y. Wu, T.S. Huang, A co-inference approach to robust visual tracking, Proceedings of the International Conference on Computer Vision, 2001. [115] H.T. Nguyen, M. Worring, R. Boomgaard, Occlusion robust adaptive template tracking, Proceedings of the International Conference on Computer Vision, 2001. [116] E.J. Ong, S. Gong, Tracking 2D–3D human models from multiple views, Proceedings of the International Workshop on Modeling People at ICCV, 1999. [117] J.K. Aggarwal, Q. Cai, W. Liao, B. Sabata, Non-Rigid motion analysis: articulated & elastic motion, Comput. Vision Image Understanding 70 (2) (1998) 142–156. [118] C.R. Wren, B.P. Clarkson, A. Pentland, Understanding purposeful human motion, Proceedings of the International Conference on Automatic Face and Gesture Recognition, France, March 2000. [119] Y. Iwai, K. Ogaki, M. Yachida, Posture estimation using structure and motion models, Proceedings of the International Conference on Computer Vision, Greece, September 1999. [120] Y. Luo, F.J. Perales, J. Villanueva, An automatic rotoscopy system for human motion based on a biomechanic graphical model, Comput. Graphics 16 (4) (1992) 355–362. [121] C. Yaniz, J. Rocha, F. Perales, 3D region graph for reconstruction of human motion, Proceedings of the Workshop on Perception of Human Motion at ECCV, 1998. [122] M. Silaghi, et al., Local and global skeleton tting techniques for optical motion capture, Proceedings of the Workshop on Modeling and Motion Capture Techniques for Virtual Environments, Switzerland, November 1998. [123] S. Iwasaw, et al., Real-time estimation of human body posture from monocular thermal images, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997. [124] Y. Kameda, M. Minoh, K. Ikeda, Three-dimensional pose estimation of an articulated object from its silhouette image, Proceedings of the Asian Conference on Computer Vision, 1993. [125] Y. Kameda, M. Minoh, K. Ikeda, Three-dimensional pose estimation of a human body using a di erence image sequence, Proceedings of the Asian Conference on Computer Vision, 1995. [126] C. Hu, et al., Extraction of parametric human model for posture recognition using generic algorithm, Proceedings of the Fourth International Conference on Automatic Face and Gesture Recognition, France, March 2000. [127] C. Bregler, J. Malik, Tracking people with twists and exponential maps, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1998. [128] O. Munkelt, et al., A model driven 3D image interpretation system applied to person detection in video images, Proceedings of the International Conference on Pattern Recognition, 1998. [129] Q. Delamarre, O. Faugeras, 3D articulated models and multi-view tracking with silhouettes, Proceedings of the International Conference on Computer Vision, Greece, September 1999. [130] J.P. Luck, D.E. Small, C.Q. Little, Real-time tracking of articulated human models using a 3d shape-from-silhouette method, Proceedings of the Robot Vision Conference, Auckland, New Zealand, 2001. [131] R. Rosales, S. Sclaro , 3D trajectory recovery for tracking multiple objects and trajectory guided recognition of actions, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, June 1999. [132] C. Myers, L. Rabinier, A. Rosenberg, Performance tradeo s in dynamic time warping algorithms for isolated word recognition, IEEE Trans. Acoust. Speech Signal Process. 28 (6) (1980) 623–635. [133] A. Bobick, A. Wilson, A state-based technique for the summarization and recognition of gesture, Proceedings of the International Conference on Computer Vision, Cambridge, 1995, pp. 382–388. [134] K. Takahashi, S. Seki, et al., Recognition of dexterous manipulations from time varying images, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, Austin, 1994, pp. 23–28. [135] A.B. Poritz, Hidden Markov models: a guided tour, Proceedings of the International Conference on Acoustic Speech and Signal Processing, 1988, pp. 7–13. [136] L. Rabinier, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257–285. [137] T. Starner, A. Pentland, Real-time American Sign Language recognition from video using hidden Markov models, Proceedings of the International Symposium on Computer Vision, 1995, pp. 265 –270. [138] J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1992, pp. 379 –385. [139] M. Brand, N. Oliver, A. Pentland, Coupled hidden Markov models for complex action recognition, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1997, pp. 994 –999. [140] C. Vogler, D. Metaxas, ASL recognition based on a coupling between HMMs and 3D motion analysis, Proceedings of the International Conference on Computer Vision, 1998, pp. 363–369. [141] M. Rosenblum, Y. Yacoob, L. Davis, Human emotion recognition from motion using a radial basis function network architecture, Proceedings of the IEEE Workshop on Motion of Non-Rigid and Articulated Objects, Austin, 1994, pp. 43– 49. [142] O. Chomat, J.L. Crowley, Recognizing motion using local appearance, International Symposium on Intelligent Robotic Systems, University of Edinburgh, 1998. [143] Y. Yacoob, M.J. Black, Parameterized modeling and recognition of activities, Proceedings of the International Conference on Computer Vision, India, 1998. [144] A. Galata, N. Johnson, D. Hogg, Learning variable-length Markov models of behavior, Comput. Vision Image Understanding 81 (3) (2001) 398–413. [145] C.-T. Lin, H.-W. Nein, W.-C. Lin, A space-time delay neural network for motion recognition and its application to lipreading, Int. J. Neural Systems 9 (4) (1999) 311–334. [146] J.E. Boyd, J.J. little, Global versus structured interpretation of motion: moving light displays, Proceedings of the IEEE CS Workshop on Motion of Non-Rigid and Articulated Objects, 1997, pp. 18–25. L. Wang et al. / Pattern Recognition 36 (2003) 585 – 601 [147] A.F. Bobick, J. Davis, Real-time recognition of activity using temporal templates, Proceedings of the IEEE CS Workshop on Applications of Computer Vision, 1996, pp. 39 – 42. [148] J.W. Davis, A.F. Bobick, The representation and recognition of action using temporal templates, Technical Report 402, MIT Media Lab, Perceptual Computing Group, 1997. [149] L. Campbell, A. Bobick, Recognition of human body motion using phase space constraints, Proceedings of the International Conference on Computer Vision, Cambridge, 1995, pp. 624 – 630. [150] P. Remagnino, T. Tan, K. Baker, Agent orientated annotation in model based visual surveillance, Proceedings of the International Conference on Computer Vision, 1998, pp. 857–862. [151] Kojima, et al., Generating natural language description of human behaviors from video images, Proceedings of the International Conference on Pattern Recognition, 2000, pp. 728–731. [152] S. Intille, A. Bobick, Representation and visual recognition of complex, multi-agent actions using belief networks, Technical Report 454, Perceptual Computing Section, MIT Media Lab, 1998. [153] G. Herzog, K. Rohr, Integrating vision and language: towards automatic description of human movements, Proceedings of the 19th Annual German Conference on Arti cial Intelligence, 1995, pp. 257–268. [154] A. Penlend, A. Liu, Modeling and prediction of human behaviors, Neural Comput. 11 (1999) 229–242. [155] M. Thonnat, N. Rota, Image understanding for visual surveillance applications, Proceedings of the Third [156] [157] [158] [159] [160] [161] [162] [163] 601 International Workshop on Cooperative Distributed Vision, 1999, pp. 51–82. N. Rota, M. Thonnat, Video sequence interpretation for visual surveillance, Proceedings of the Workshop on Visual Surveillance, Ireland, 2000, pp. 59 – 67. G. Rigoll, S. Eickeler, S. Muller, Person tracking in real world scenarios using statistical methods, Proceedings of the International Conference on Automatic Face and Gesture Recognition, France, March 2000. I.A. Kakadiaris, D. Metaxas, Three-dimensional human body model acquisition from multiple views, Int. J. Comput. Vision 30 (3) (1998) 191–218. H. Sidenbladh, F. Torre, M. J. Black, A framework for modeling the appearance of 3D articulated gures, Proceedings of the International Conference on Automatic Face and Gesture Recognition, France, March 2000. N. Johnson, A. Galata, D. Hogg, The acquisition and use of interaction behavior models, Proceedings of the IEEE CS Conference on Computer Vision and Pattern Recognition, 1998, pp. 866 –871. P. Fua, et al., Human body modeling and motion analysis from video sequence, International Symposium on Real Time Imaging and Dynamic Analysis, Japan, June 1998. R. Plankers, P. Fua, Articulated soft object for video-based body modeling, Proceedings of the International Conference on Computer Vision, 2001. A. Hilton, P. Fua, Foreword: modeling people toward vision-based understanding of a person’s shape, appearance, and movement, Comput. Vision Image Understanding 81 (3) (2001) 227–230. About the Author—LIANG WANG received his B.Sc. (1997) and M.Sc. (2000) in the Department of Electronics Engineering and Information Science from Anhui University, China. He is currently a Ph.D. candidate in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. He has published more than six papers in major national journals and international conferences. His main research interests include computer vision, pattern recognition, digital image processing and analysis, multimedia, visual surveillance, etc. About the Author—WEIMING HU received his Ph.D. Degree from the Department of Computer Science and Engineering, Zhejiang University, China. From April 1998 to March 2000, he worked as a Postdoctoral Research Fellow at the Institute of Computer Science and Technology, Founder Research and Design Center, Peking University. From April 2000, he worked at the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, as an Associate Professor. His research interests are in visual surveillance and monitoring of dynamic scenes, neural network, 3-D computer graphics, physical design of ICs, and map publishing system. He has published more than 20 papers in major national journals, such as Science in China, Chinese Journal of Computers, Chinese Journal of Software, and Chinese Journal of Semiconductors. About the Author—TIENIU TAN received his B.Sc. (1984) in electronic engineering from Xi’an Jiaotong University, China, and M.Sc. (1986), DIC (1986) and Ph.D. (1989) in Electronic Engineering from Imperial College of Science, Technology and Medicine, London, UK. In October 1989, he joined the Computational Vision Group at the Department of Computer Science, The University of Reading, England, where he worked as Research Fellow, Senior Research Fellow and Lecturer. In January 1998, he returned to China to join the National Laboratory of Pattern Recognition, the Institute of Automation of the Chinese Academy of Sciences, Beijing, China. He is currently Professor and Director of the National Laboratory of Pattern Recognition as well as President of the Institute of Automation. Dr. Tan has published widely on image processing, computer vision and pattern recognition. He is a Senior Member of the IEEE and was an elected member of the Executive Committee of the British Machine Vision Association and Society for Pattern Recognition (1996 –1997). He serves as referee for many major national and international journals and conferences. He is an Associate Editor of the International Journal of Pattern Recognition, the Asia Editor of the International Journal of Image and Vision Computing and is a founding co-chair of the IEEE International Workshop on Visual Surveillance. His current research interests include speech and image processing, machine and computer vision, pattern recognition, multimedia, and robotics.