US8995717B2 - Method for building and extracting entity networks from video - Google Patents

Method for building and extracting entity networks from video Download PDF

Info

Publication number
US8995717B2
US8995717B2 US13/597,698 US201213597698A US8995717B2 US 8995717 B2 US8995717 B2 US 8995717B2 US 201213597698 A US201213597698 A US 201213597698A US 8995717 B2 US8995717 B2 US 8995717B2
Authority
US
United States
Prior art keywords
entities
event
temporal
video data
spatio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/597,698
Other versions
US20120321137A1 (en
Inventor
Hui Cheng
Jiangjian Xiao
Harpreet Sawhney
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SRI International Inc
Original Assignee
SRI International Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SRI International Inc filed Critical SRI International Inc
Priority to US13/597,698 priority Critical patent/US8995717B2/en
Publication of US20120321137A1 publication Critical patent/US20120321137A1/en
Application granted granted Critical
Publication of US8995717B2 publication Critical patent/US8995717B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06K9/00771
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06K2009/3291
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking

Definitions

  • the present invention relates generally to vision systems, and more particularly to a method and system that automatically detects and relates entities found in video and builds entity networks that can be stored in a database for later recall.
  • Entities can include people, vehicles, houses, etc. Entity association in the context of gathering and relating entity data for defense, surveillance systems, sports and entertainment archiving systems is traditionally accomplished using text or structured data, such as known affiliations. In such contexts, it would be desirable to associate structured text data with images and/or video taken of a scene to enhance the meaning of the structured text data and allow for the extraction of meaningful inferences about the data with a high degree of certainty. For example, if a plurality of trucks is traveling together on a highway for an extended period of time, it can be inferred that the collection of trucks are traveling in a convoy. Thus, the video would be tagged with the label “convoy.” In another example, a person is seen entering a car driven by another person. Then, the two persons are likely to know each other.
  • AEN attribute entity network
  • the entity extraction step further comprises the steps of detecting moving objects and classifying them into vehicle and people; and determining which structures in the video data are at least one of roads, parking lots, buildings and building descriptions of sites.
  • the deriving step further comprises the steps of calculating a similarity measure of the closeness of two tracklets; identifying entity behaviors (spatial actions and behavioral action); and performing pattern analysis to group tracklets and sites.
  • the at least one event is classified as one of a spatial action and a behavioral action.
  • a behavioral action is detected using Spatial-Temporal Object (STO) Analysis.
  • STO Analysis comprises the steps of: obtaining a focus-of-attention of moving objects in the video data using Spatio-Temporal Cues; obtaining spatio-temporal fragments extracted from the moving objects within the focus-of-attention, the moving objects including at least one person; combining the obtaining spatio-temporal fragments to compute at least one pose of the at least one person; extracting and classifying at least one object associated the at least one person; extracting feature words are extracted from the at least one objects and at least one person to create spatio-temporal object words; encoding the spatio-temporal object words as feature vectors; and classifying the feature vectors using a Supporting Vector Machine (SVM).
  • SVM Supporting Vector Machine
  • the method can further comprise the steps of merging an event ontology with hierarchical weighted graph matching to reduce the candidate space, which in turn comprises the steps of constructing an event graph wherein a node represents a sub-event and a link represents the type of temporal transition between two nodes, the link being assigned a weight that is proportional to the importance of the temporal transition to the overall event; forming a hierarchical event description by removing nodes with small weights and combining the links between nodes with large weights; matching observations using the hierarchical event graph at its highest level, wherein observations receiving a predetermined minimum matching score being passed to a next level for verification; and repeating the step of matching with other observations until a predetermined confidence level is achieved for accepting or rejecting an event.
  • the step of matching further comprising the step of computing the similarity between two events using a shortest path length measure between two objects/actions in an object/action taxonomy.
  • the method can further comprise the step of employing a Markov Logic Network for reasoning and inferencing in visual and geo-spatial domains.
  • FIG. 1 is a block diagram of a hardware architecture for a system for deriving an attribute entity network (AEN) from video, constructed in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram of an attribute entity network associated with the system of FIG. 1 ;
  • FIG. 3 is a block diagram pertaining to a pattern analysis technique for track based entity association to derive the links, link types, and link certainties of FIG. 2 ;
  • FIG. 4 is a graph depicting a 3D representation of tracklets as poly-lines
  • FIG. 5A shows a snapshot of preliminary results of automated entity association through track analysis using WAVS video data
  • FIG. 5B shows a snapshot of mounting activity captured as evidence for associating a person with a vehicle derived from FIG. 5A ;
  • FIG. 6 depicts the steps of a Spatial-Temporal Object (STO) Analysis process for recognizing behavioral actions
  • FIG. 7 is a diagram depicting a composite event represented as an event graph for graph matching
  • FIG. 8 is a diagram representing a Markov Logic Network
  • FIG. 9 is a screen shot of an entity-centric analyzer GUI that allows an analyst to co-exploit entity tracks, entity sightings, entity networks and videos containing entities and their interactions.
  • the system 10 receives digitized video from one or more cameras 12 , which may be rigidly mounted on an aerial platform.
  • the system 10 can also include a digital video capture system 14 and a computing platform 16 .
  • the digital video capturing system 14 processes streams of digital video, or converts analog video to digital video, to a form which can be processed by the computing platform 16 .
  • the digital video capturing system 14 may be stand-alone hardware, or cards such as Firewire cards which can plug-in directly to the computing platform 16 .
  • the computing platform 16 may include a personal computer or work-station (e.g., a Pentium-M 1.8 GHz PC-104 or higher) comprising one or more processors 20 which includes a bus system 22 which is fed by video data streams 24 via the one or more processors 20 or directly to a computer-readable medium 26 .
  • the computer readable medium 26 can also be used for storing the instructions of the system 10 to be executed by the one or more processors 20 , including an operating system, such as the Windows or the Linux operating system.
  • the computer readable medium 26 can further be used for the storing and retrieval of the entity networks and associated video clips of the present invention in one or more databases.
  • the computer readable medium 26 can include a combination of volatile memory, such as RAM memory, and non-volatile memory, such as flash memory, optical disk(s), and/or hard disk(s). Portions of a processed video data stream 28 can be stored temporarily in the computer readable medium 26 for later output along with visual diagrams of entity network constructs to a monitor 30 .
  • the monitor 30 can display processed video data stream and entity network constructs.
  • FIG. 2 a block diagram of an attribute entity network (AEN) 40 , constructed in accordance with an embodiment of the present invention, is depicted.
  • Video data 42 is input to an entity association engine 44 from which the AEN 40 is constructed as a graph.
  • the AEN 40 comprises a plurality of nodes 46 and links 48 which may be extracted at least in part from the video data 42 .
  • the nodes 46 represent entities, such as vehicles, people and sites (e.g., buildings, parking lots, roads), and the links 48 represent the relationships amongst two or more nodes 46 observed from video data 42 .
  • Each of the nodes 46 has a unique ID 49 , and associated entity type and entity attributes (not shown), such as locations of buildings or tracks of vehicles and people.
  • Each of the links 48 includes a type (indicated by a color 50 ), a confidence measure 52 (probability) and a pointer (not shown) to the associated evidence, i.e., the video segment from which the link is established.
  • the attributed entity network 40 can be stored in a database (not shown) for searching, exploitation and fusion with other entity or social networks created from other information sources.
  • Evidence associated with a linkage such as a frame or a clip, can also be stored in the database for forensic analysis and verification.
  • Entity associations used to derive the links 48 via the entity association engine 44 are found from the aprioiri understanding of people and vehicle movements and activities through track analysis to be discussed in connection with FIG. 3 herein below.
  • Some typical associations include:
  • FIG. 3 a block diagram pertaining to a pattern analysis technique for track based entity association to derive the links 48 , link types, and link certainties of FIG. 2 is depicted.
  • People and vehicle movements are key features for linking people, vehicles and sites. A vehicle leaves a house, picks up a passenger from another house, and enters the garage of an office building will not only link the two people in the vehicle, but also the three sites and other people in the two residences.
  • association using tracks may not always be as straight forward as described in the above example. Vehicles parked in a parking lot or stopped at an intersection may or may not be related. Therefore, in addition to detecting links, it is necessary to assess the certainty and importance of a link between two entities using tracks (i.e., the trajectory of an object in a video).
  • UAV video 60 is fed to the first module, tracking and analysis module 62 .
  • the tracking and analysis module 62 breaks entity tracks from the video 60 into tracklets 64 by identifying entity behaviors, also called spatial actions, such as start, stop, turn, appear and disappear.
  • the tracking and analysis module 62 determines which objects in the video 60 are moving and then tracks those objects.
  • the UAV video 60 is fed to an urban structure extraction module 66 (along with optional GIS data 67 ), which determines which structures in the video 60 are buildings and builds descriptions of site locations (sites 68 ) from the video 60 .
  • tracklets are represented as 3D (x, y, t) poly-lines.
  • S 1 is a similarity measure for the closeness of two tracklets, which is used for detecting a convoy.
  • S 2 measures the (x, y, t) distance of the start/end points of two tracklets, which is used for detecting entity interactions, such as mounting and dismounting.
  • a similarity measure i.e., a metric for the closeness of two tracklets
  • an adjacency graph is built for the tracklets 64 and the sites 68 .
  • links 70 are generated from the adjacency graph.
  • Pre-defined similarity measures are related to activities or behavior of interest. For example, a spatio-temporal similarity measure at the tracklet level can be used to detect a convoy and group of people walking together. The distance in the (x, y, t) space between two end points of two tracklets can be used to detect people mounting or dismounting vehicles. Similarity measures discovered from tracklets can cue analysts to unknown patterns that might be of interest.
  • FIG. 5A shows preliminary results of automated entity association through track analysis using WAVS video data.
  • Vehicle-Vehicle, Vehicle-People and People-People associations derived from tracks analysis include track-related behaviors such as vehicles traveling as a convoy, and people mounting and dismounting vehicles.
  • the two columns of chips are vehicle chips on the left 82 and people chips on the right 84 .
  • the links 86 among vehicle chips represent detected convoy behavior.
  • Links 88 among vehicle and people chips represent mounting and dismounting activity.
  • FIG. 5B shows mounting activity captured as evidence for associating a person with a vehicle.
  • the link classification module 71 classifies the links 70 into categories (link types 72 ) based on attributes of the tracklets 64 and the sites 68 . For instance, when a stopped vehicle track and appearance of a person track indicate a dismounting activity, the person is determined to be the occupant of the vehicle.
  • the last module, context normalization module 80 estimates the importance and the certainty of a link 70 .
  • the context normalization module 80 weights each link by the type of the link 70 , the type of the site 68 and the inverse frequency of the same link types 72 and the site 68 . In this way, two cars parked together in a parking lot will be a much weaker link than two cars parking together in a deserted area or around a house.
  • link certainty 76 is assigned to the link 70 and is a measure of the degree of confidence of assigning the link 70 .
  • the present invention detects events at multiple levels from primitive events, such as actions, to complex events over large spatial and temporal extent involving multiple agents.
  • actions/events can be classified into spatial actions and behavioral actions. Spatial actions, such as start, disappear, turn, etc., can be inferred purely from tracks or the interactions among tracks. Spatial actions are detected using track analysis as described above.
  • Another category of actions are behavioral actions, which are coordinated movements of different parts of an object, e.g., load, unload, push, throw, and other human actions.
  • Behavioral actions typically involve people, objects and their interactions in a short time interval, such as talking/fighting, loading/unloading, etc.
  • the motion of people in behavioral actions can be more complicated than in spatial actions.
  • a loading action a hand or arm movement is associated with the object being lifted.
  • STO Spatial-Temporal Object
  • STO Spatial-Temporal Object
  • a focus-of-attention of moving objects in a video is obtained using Spatio-Temporal Cues (such as motion and appearance).
  • Zooming in on the focus of attention which, instead of using an object as a whole, spatio-temporal fragments or parts models are used that are extracted from the objects, such as hands and arm regions from a person.
  • the fragment extraction algorithm uses spatio-temporal video features such as human poses computed from videos and model based matching that is resilient to occlusion and background clutter.
  • the poses of people are refined.
  • objects associated with moving people having a given pose are extracted and classified, such as a box 98 associated with the people 100 having similar poses.
  • feature words are extracted from the objects and people moving together which are called spatio-temporal object words, such as walking, running, carrying, digging, etc.
  • the spatio-temporal representation of object fragments, including location, states and motion are expressed using an STO vocabulary both as words for each time instant and as sentences for a time interval.
  • these spatio-temporal object words and sentences are encoded as feature vectors.
  • STO sentences are then classified using a Supporting Vector Machine (SVM) with the optional aid of an STO sentences of known activities database 108 into different actions, such as “Two people carried a box from a truck to a house.”
  • SVM Supporting Vector Machine
  • the present invention meets these goals by merging an event ontology with hierarchical weighted graph matching to reduce the candidate space. Only a small number of key sub-events are matched with detailed matching involving only well-qualified candidates. Additionally, a Markov Logic Network is used for reasoning and inferencing in visual and geo-spatial domains.
  • a composite event can be represented as an event graph 110 for graph matching.
  • a node 112 represents a sub-event and a link 114 represents the type of temporal transition between two nodes, such as “after”, “during”, etc.
  • Each node 112 is also assigned a weight (not shown) proportional to its importance to the overall event.
  • the weights can be user defined in the event ontology.
  • the weights can also be computed from training examples. The weights effectively reveal a level of importance. From a description of an event provided by a user or extracted from a video, a hierarchical event description is formed by removing nodes with small weights and combining the links between nodes with large weights.
  • the example in FIG. 7 depicts, for example, a three-level hierarchical event graph of cross-border weapons smuggling.
  • observations are first matched at the highest level. Only those observations receiving a predetermined minimum matching score pass to a next level for verification. This process is repeated with other observations until a predetermined confidence level is achieved for an event hypothesis to be accepted or rejected. In this way, a large number of observations are quickly filtered and detailed matching is only performed on credible candidates.
  • the similarity between two events is computed. Based on the event ontology, the similarity of a pair of objects or actions is computed using a shortest path length measure between two objects/actions in the object/action taxonomy. For example, among actions: “walk”, “run” and “pick-up”, the similarity values of (walk, run) will be bigger than those of (walk, pick-up).
  • the Complex Event Similarity (CES) can be computed as:
  • SSE (a i , b .i ) is the similarity between two corresponding simple events a i and b .i from the two streams.
  • W i is the importance weight for the simple event a i .
  • the weights are computed using Term-Frequency Inverse Document Frequency (TFIDF). scheme that has been successfully used to measure similarity of documents.
  • TFIDF Term-Frequency Inverse Document Frequency
  • the weights are the product of the frequency of the simple event in the event to be matched to (event template) times the log of the inverse of the frequency of the same simple events observed in the Region-Of-Interest (ROI).
  • ROI Region-Of-Interest
  • the weight of a sub-event depending on an ROI makes the event matching scheme of the present invention adaptive to the environment. For example, in a desert, the frequency of observing a moving object is low. So, when matching an event related to moving objects in a desert, a higher weight is given to the action of moving than detecting the same event in urban environment with heavy traffic.
  • the weight of an object can be inferred from how it was carried and the status of a person can inferred from how he gets out of a car and how he is greeted by others.
  • MSN Markov Logic Networks
  • the present invention employs Markov Logic Networks (MLN) as a probabilistic framework for accounting for the uncertainty of video processing and to enable learning.
  • MLN seamlessly integrates learning, logic and probabilistic inferencing and can be used based on either rules or annotated examples or both for event detection and reasoning.
  • a Markov Logic Network is a set of pairs (F, w) where F is a formula in first-order logic and w is a weight (real number). These weights can be determined a priori, or can be learned from observed data or examples.
  • MLN defines a network with one node for each grounding (achieved by assigning a constant to a variable) of each predicate in a MLN.
  • a sample ground MLN is shown in FIG. 8 .
  • the network 116 includes edges 118 between pairs of atoms 120 , which are groundings of predicates.
  • the probability distribution over possible worlds, x, specified by a ground Markov network is:
  • w i represents the weight of formula i
  • n i (x) is the number of true groundings of formula i in x
  • Z is a normalization factor
  • MLN is used to infer properties of objects and outcomes of events or actions.
  • a geo-spatial and visual ontology can be developed to provide the attribute set of an object and a rule set for inferencing.
  • the inputs to the MLN reasoning engine are factlets, (i.e., assertions of the video content) extracted from WAVS videos.
  • the goal of employing an MLN is to infer information from these factlets, such as inferring a box is heavy if two instead of one person are carrying it.
  • MLN Based on factlets from WAVS data, MLN dynamically creates a network and learns the appropriate weights for the formulae that constitute the knowledge base. Once the weights have been updated, MLN can be used to answer queries—e.g., does the knowledge base entail a specific event-related hypothesis? (e.g., “Is the box heavy?” in FIG. 8 ). Inferencing using MLNs reduces to the problem of computing the probability that Formula x is true given that Formula i is true. MLN enables inferencing about properties/attributes of objects, outcomes of actions and occurrence of complex events such as clandestine meetings, ambush or transportation of weapons or bomb making materials.
  • an activity model of each urban entity is built using statistics of related vehicles, people and their movement.
  • the activity model of a building will be the number and the type of vehicles entering/leaving the building as a function of time and date.
  • urban context also captures cultural information, such as difference between weekday and weekend activities and difference of vehicle activities in different part of a city.
  • urban structures can be classified into not only broad categories, such as residential area, shopping district, factory, office complex; but also into fine classifications, such as movie theaters, retail stores, restaurant, garages and mosques. For example, a large number of vehicles will arrive and leave movie theaters in regular intervals based on the movie schedule, while vehicles arrive and leave a retail stores continuously throughout the day, although fluctuate according to the time of the day, but much less predictable.
  • activity models can also identify functional components that are difficult to detect purely based on appearances.
  • the present invention can label the entrance of a building, egress/ingress points of an area, such as gates or check-points, parking lots, drive ways or alleys, etc.
  • the activity of a given a structure or a site can be compared with the activity of the same type structures. In this way, abnormal structures are identified, such as a house or a store that has much more car activity than the norm of its class.
  • the present invention can provide advanced capabilities for searching, browsing, retrieval and visualization:
  • FIG. 9 shows an envisioned entity-centric analyzer that allows an analyst to co-exploit entity tracks, entity sightings, entity networks and videos containing entities and their interactions in the same GUI window. Analysts also can easily move the focal point from one entity to a related entity with a single click.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

A computer implemented method for deriving an attribute entity network (AEN) from video data is disclosed, comprising the steps of: extracting at least two entities from the video data; tracking the trajectories of the at least two entities to form at least two tracks; deriving at least one association between at least two entities by detecting at least one event involving the at least two entities, said detecting of at least one event being based on detecting at least one spatio-temporal motion correlation between the at least two entities; and constructing the AEN by creating a graph wherein the at least two objects form at least two nodes and the at least one association forms a link between the at least two nodes.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application is a continuation of issued U.S. Pat. No. 8,294,763B2, (U.S. non-provisional patent application Ser. No. 12/271,173 filed Nov. 14, 2008) which further claims the benefit U.S. provisional patent application No. 61/013,888 filed Dec. 14, 2007. The aforementioned related patent applications are herein incorporated by reference in their entirety.
GOVERNMENT RIGHTS IN THIS INVENTION
This invention was made with U.S. government support under contract number NBCH-C-07-0062. The U.S. government has certain rights in this invention.
FIELD OF THE INVENTION
The present invention relates generally to vision systems, and more particularly to a method and system that automatically detects and relates entities found in video and builds entity networks that can be stored in a database for later recall.
BACKGROUND OF THE INVENTION
Entities can include people, vehicles, houses, etc. Entity association in the context of gathering and relating entity data for defense, surveillance systems, sports and entertainment archiving systems is traditionally accomplished using text or structured data, such as known affiliations. In such contexts, it would be desirable to associate structured text data with images and/or video taken of a scene to enhance the meaning of the structured text data and allow for the extraction of meaningful inferences about the data with a high degree of certainty. For example, if a plurality of trucks is traveling together on a highway for an extended period of time, it can be inferred that the collection of trucks are traveling in a convoy. Thus, the video would be tagged with the label “convoy.” In another example, a person is seen entering a car driven by another person. Then, the two persons are likely to know each other.
It would be desirable to associate visual attributes to entities and with video imagery. Persistent and wide-area coverage of video imagery provides an opportunity to monitor the behavior of entities, such as vehicles, people and sites, over long periods of time and large geo-spatial extents. It would also be desirable to deduce the relationship of entities under different contexts and in the presence of clutter and under uncertainties inherent in detecting, classifying and tracking entities from video data. Any entity information derived from videos has an associated probability or belief computed from the data. Inferences of associations use propagation of uncertainties within a network representation built from the data. Therefore, linkages can be established and hidden relationships can be discovered among entities automatically.
Accordingly, what would be desirable, but has not yet been provided, is a system and method for effectively and automatically detecting and relating entities from video data, deducing inferences from the data and their relationships, automatically constructing entity networks, and storing and later retrieving the entity networks for later analysis.
SUMMARY OF THE INVENTION
The above-described problems are addressed and a technical solution is achieved in the art by providing a computer implemented method for deriving an attribute entity network (AEN) from video data, comprising the steps of extracting at least two entities from the video data; tracking the trajectories of the at least two entities to form at least two tracks; deriving at least one association between at least two entities by detecting at least one event involving the at least two entities, said detecting of at least one event being based on detecting at least one spatio-temporal motion correlation between the at least two entities; and constructing the AEN by creating a graph wherein the at least two objects form at least two nodes and the at least one association forms a link between the at least two nodes. The entity extraction step further comprises the steps of detecting moving objects and classifying them into vehicle and people; and determining which structures in the video data are at least one of roads, parking lots, buildings and building descriptions of sites. The deriving step further comprises the steps of calculating a similarity measure of the closeness of two tracklets; identifying entity behaviors (spatial actions and behavioral action); and performing pattern analysis to group tracklets and sites.
The at least one event is classified as one of a spatial action and a behavioral action. A behavioral action is detected using Spatial-Temporal Object (STO) Analysis. STO Analysis comprises the steps of: obtaining a focus-of-attention of moving objects in the video data using Spatio-Temporal Cues; obtaining spatio-temporal fragments extracted from the moving objects within the focus-of-attention, the moving objects including at least one person; combining the obtaining spatio-temporal fragments to compute at least one pose of the at least one person; extracting and classifying at least one object associated the at least one person; extracting feature words are extracted from the at least one objects and at least one person to create spatio-temporal object words; encoding the spatio-temporal object words as feature vectors; and classifying the feature vectors using a Supporting Vector Machine (SVM).
The method can further comprise the steps of merging an event ontology with hierarchical weighted graph matching to reduce the candidate space, which in turn comprises the steps of constructing an event graph wherein a node represents a sub-event and a link represents the type of temporal transition between two nodes, the link being assigned a weight that is proportional to the importance of the temporal transition to the overall event; forming a hierarchical event description by removing nodes with small weights and combining the links between nodes with large weights; matching observations using the hierarchical event graph at its highest level, wherein observations receiving a predetermined minimum matching score being passed to a next level for verification; and repeating the step of matching with other observations until a predetermined confidence level is achieved for accepting or rejecting an event. The step of matching further comprising the step of computing the similarity between two events using a shortest path length measure between two objects/actions in an object/action taxonomy.
The method can further comprise the step of employing a Markov Logic Network for reasoning and inferencing in visual and geo-spatial domains.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be more readily understood from the detailed description of exemplary embodiments presented below considered in conjunction with the attached drawings, of which:
FIG. 1 is a block diagram of a hardware architecture for a system for deriving an attribute entity network (AEN) from video, constructed in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of an attribute entity network associated with the system of FIG. 1;
FIG. 3 is a block diagram pertaining to a pattern analysis technique for track based entity association to derive the links, link types, and link certainties of FIG. 2;
FIG. 4 is a graph depicting a 3D representation of tracklets as poly-lines;
FIG. 5A shows a snapshot of preliminary results of automated entity association through track analysis using WAVS video data;
FIG. 5B shows a snapshot of mounting activity captured as evidence for associating a person with a vehicle derived from FIG. 5A;
FIG. 6 depicts the steps of a Spatial-Temporal Object (STO) Analysis process for recognizing behavioral actions;
FIG. 7 is a diagram depicting a composite event represented as an event graph for graph matching;
FIG. 8 is a diagram representing a Markov Logic Network; and
FIG. 9 is a screen shot of an entity-centric analyzer GUI that allows an analyst to co-exploit entity tracks, entity sightings, entity networks and videos containing entities and their interactions.
It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.
DETAILED DESCRIPTION OF THE INVENTION
Referring now to FIG. 1, a system for deriving entity networks from video is depicted, generally indicated at 10. By way of a non-limiting example, the system 10 receives digitized video from one or more cameras 12, which may be rigidly mounted on an aerial platform. The system 10 can also include a digital video capture system 14 and a computing platform 16. The digital video capturing system 14 processes streams of digital video, or converts analog video to digital video, to a form which can be processed by the computing platform 16. The digital video capturing system 14 may be stand-alone hardware, or cards such as Firewire cards which can plug-in directly to the computing platform 16. The computing platform 16 may include a personal computer or work-station (e.g., a Pentium-M 1.8 GHz PC-104 or higher) comprising one or more processors 20 which includes a bus system 22 which is fed by video data streams 24 via the one or more processors 20 or directly to a computer-readable medium 26. The computer readable medium 26 can also be used for storing the instructions of the system 10 to be executed by the one or more processors 20, including an operating system, such as the Windows or the Linux operating system. The computer readable medium 26 can further be used for the storing and retrieval of the entity networks and associated video clips of the present invention in one or more databases. The computer readable medium 26 can include a combination of volatile memory, such as RAM memory, and non-volatile memory, such as flash memory, optical disk(s), and/or hard disk(s). Portions of a processed video data stream 28 can be stored temporarily in the computer readable medium 26 for later output along with visual diagrams of entity network constructs to a monitor 30. The monitor 30 can display processed video data stream and entity network constructs.
Referring now to FIG. 2, a block diagram of an attribute entity network (AEN) 40, constructed in accordance with an embodiment of the present invention, is depicted. Video data 42 is input to an entity association engine 44 from which the AEN 40 is constructed as a graph. The AEN 40 comprises a plurality of nodes 46 and links 48 which may be extracted at least in part from the video data 42. The nodes 46 represent entities, such as vehicles, people and sites (e.g., buildings, parking lots, roads), and the links 48 represent the relationships amongst two or more nodes 46 observed from video data 42. Each of the nodes 46 has a unique ID 49, and associated entity type and entity attributes (not shown), such as locations of buildings or tracks of vehicles and people. Each of the links 48 includes a type (indicated by a color 50), a confidence measure 52 (probability) and a pointer (not shown) to the associated evidence, i.e., the video segment from which the link is established. There can be multiple links between two entities, each of which represents an association of two entities observed from WAVS data (e.g., the links 54 a-54 c indicated by multiple colors). The attributed entity network 40 can be stored in a database (not shown) for searching, exploitation and fusion with other entity or social networks created from other information sources. Evidence associated with a linkage, such as a frame or a clip, can also be stored in the database for forensic analysis and verification.
Entity associations used to derive the links 48 via the entity association engine 44 are found from the aprioiri understanding of people and vehicle movements and activities through track analysis to be discussed in connection with FIG. 3 herein below. Some typical associations include:
    • Vehicle-vehicle association: Convoy; vehicles parked close by in a deserted area; vehicle-to-vehicle transfer of materials; interactions among occupants of vehicles, etc.
    • People people association: Walking, running together; meeting; entering/leaving the same vehicle/house; involved in the same activities, such as loading, unloading a vehicle.
    • People-vehicle association: Entering, exiting, loading and unloading a vehicle.
    • Vehicle-site association: Entering or leaving a garage or the parking lot of a building; parked close to a house.
    • People-site association: Entering or leaving a house or a building; often seen in an area or on a road or a walkway.
    • Site-site association: Site-to-site association is mainly established through people and vehicles associated with the two or more sites. For instance, two or more vehicles driven by a few people leaving a warehouse and ending at slightly different times at a chemical factory establishes an association between the warehouse and the factory through the agents connecting them, the vehicles and people.
Referring now to FIG. 3, a block diagram pertaining to a pattern analysis technique for track based entity association to derive the links 48, link types, and link certainties of FIG. 2 is depicted. People and vehicle movements are key features for linking people, vehicles and sites. A vehicle leaves a house, picks up a passenger from another house, and enters the garage of an office building will not only link the two people in the vehicle, but also the three sites and other people in the two residences. However, association using tracks may not always be as straight forward as described in the above example. Vehicles parked in a parking lot or stopped at an intersection may or may not be related. Therefore, in addition to detecting links, it is necessary to assess the certainty and importance of a link between two entities using tracks (i.e., the trajectory of an object in a video).
As shown in FIG. 3, UAV video 60 is fed to the first module, tracking and analysis module 62. The tracking and analysis module 62 breaks entity tracks from the video 60 into tracklets 64 by identifying entity behaviors, also called spatial actions, such as start, stop, turn, appear and disappear. The tracking and analysis module 62 determines which objects in the video 60 are moving and then tracks those objects. At substantially the same time, the UAV video 60 is fed to an urban structure extraction module 66 (along with optional GIS data 67), which determines which structures in the video 60 are buildings and builds descriptions of site locations (sites 68) from the video 60. With the set of tracklets 64 and site locations and extents 68, pattern analysis 69 is performed to group the tracklets 64 or the tracklets 64 and the sites 68. Referring now to FIG. 4, tracklets are represented as 3D (x, y, t) poly-lines. S1 is a similarity measure for the closeness of two tracklets, which is used for detecting a convoy. S2 measures the (x, y, t) distance of the start/end points of two tracklets, which is used for detecting entity interactions, such as mounting and dismounting. Using a similarity measure (i.e., a metric for the closeness of two tracklets), an adjacency graph is built for the tracklets 64 and the sites 68. Using graph analysis, links 70 are generated from the adjacency graph.
Both pre-defined similarity measures and those extracted from track data using intrinsic dimension analysis can be employed. Pre-defined similarity measures are related to activities or behavior of interest. For example, a spatio-temporal similarity measure at the tracklet level can be used to detect a convoy and group of people walking together. The distance in the (x, y, t) space between two end points of two tracklets can be used to detect people mounting or dismounting vehicles. Similarity measures discovered from tracklets can cue analysts to unknown patterns that might be of interest.
FIG. 5A shows preliminary results of automated entity association through track analysis using WAVS video data. Vehicle-Vehicle, Vehicle-People and People-People associations derived from tracks analysis include track-related behaviors such as vehicles traveling as a convoy, and people mounting and dismounting vehicles. The two columns of chips are vehicle chips on the left 82 and people chips on the right 84. The links 86 among vehicle chips represent detected convoy behavior. Links 88 among vehicle and people chips represent mounting and dismounting activity. FIG. 5B shows mounting activity captured as evidence for associating a person with a vehicle.
Referring again to FIG. 3, the link classification module 71 classifies the links 70 into categories (link types 72) based on attributes of the tracklets 64 and the sites 68. For instance, when a stopped vehicle track and appearance of a person track indicate a dismounting activity, the person is determined to be the occupant of the vehicle. The last module, context normalization module 80, estimates the importance and the certainty of a link 70. The context normalization module 80 weights each link by the type of the link 70, the type of the site 68 and the inverse frequency of the same link types 72 and the site 68. In this way, two cars parked together in a parking lot will be a much weaker link than two cars parking together in a deserted area or around a house. However, using the link types 72, e.g., mounting and dismounting, are considered strong links even if the mounting and dismounting occur in a parking lot. The output of the context normalization module 74 is a link certainty 76, which is assigned to the link 70 and is a measure of the degree of confidence of assigning the link 70.
For capturing the associations between entities by means of track analysis and to better detect behaviors of interest, it is desirable to detect events. The present invention detects events at multiple levels from primitive events, such as actions, to complex events over large spatial and temporal extent involving multiple agents. Referring now to Table 1, actions/events can be classified into spatial actions and behavioral actions. Spatial actions, such as start, disappear, turn, etc., can be inferred purely from tracks or the interactions among tracks. Spatial actions are detected using track analysis as described above. Another category of actions are behavioral actions, which are coordinated movements of different parts of an object, e.g., load, unload, push, throw, and other human actions.
TABLE 1
Spatial start, stop, appear, disappear, accelerate,
actions de-accelerate, move, enter, leave, meet,
disperse, follow/chase, pass, turn
Behavioral load, unload, drop, pick-up, throw, push,
actions drag, carry, dig, kick, crouch
Behavioral actions typically involve people, objects and their interactions in a short time interval, such as talking/fighting, loading/unloading, etc. The motion of people in behavioral actions can be more complicated than in spatial actions. For example, in a loading action, a hand or arm movement is associated with the object being lifted. To recognize behavioral actions, Spatial-Temporal Object (STO) Analysis is employed which integrates object and object part interactions and generates spatio-temporal motion correlations.
Referring now to FIG. 6, the Spatial-Temporal Object (STO) Analysis process is illustrated. At step 90, a focus-of-attention of moving objects in a video is obtained using Spatio-Temporal Cues (such as motion and appearance). Zooming in on the focus of attention, which, instead of using an object as a whole, spatio-temporal fragments or parts models are used that are extracted from the objects, such as hands and arm regions from a person. As a result, behavioral action involving articulation can be handled effectively. At step 92, the fragment extraction algorithm uses spatio-temporal video features such as human poses computed from videos and model based matching that is resilient to occlusion and background clutter. At step 94, the poses of people are refined. At step 96, objects associated with moving people having a given pose are extracted and classified, such as a box 98 associated with the people 100 having similar poses. At step 102, feature words are extracted from the objects and people moving together which are called spatio-temporal object words, such as walking, running, carrying, digging, etc. The spatio-temporal representation of object fragments, including location, states and motion are expressed using an STO vocabulary both as words for each time instant and as sentences for a time interval. At step 104, these spatio-temporal object words and sentences are encoded as feature vectors. At step 106, STO sentences are then classified using a Supporting Vector Machine (SVM) with the optional aid of an STO sentences of known activities database 108 into different actions, such as “Two people carried a box from a truck to a house.”
Complex composite events over large spatial and temporal extent involving multiple agents present unique challenges for automated detection:
    • Huge candidate space—Because of the volume of WAVS data, the number of objects detected is huge and the number of interactions among these objects is prohibitively large. Moreover, the number of events to be detected grows over time. Therefore, it is desired that an event detection system be able to scale-up to the amount of data and the number of events in WAVS data.
    • Lack of observability and causality in event detection—Not all actions or sub-events can be observed or detected. Therefore, event detection algorithms have to be robust and smart. For example, when unloading of material cannot be directly observed, it can be inferred from the changes from the back of a truck or how people carry a barrel at two time instants. Thus, an effective event algorithm needs to have advanced reasoning capabilities.
    • Uncertainty—Uncertainty in varying degrees is associated with every step of event detection. It is a challenge to manage and reason with uncertainty in event detection.
The present invention meets these goals by merging an event ontology with hierarchical weighted graph matching to reduce the candidate space. Only a small number of key sub-events are matched with detailed matching involving only well-qualified candidates. Additionally, a Markov Logic Network is used for reasoning and inferencing in visual and geo-spatial domains.
Referring now to FIG. 7, a composite event can be represented as an event graph 110 for graph matching. In the event graph 110, a node 112 represents a sub-event and a link 114 represents the type of temporal transition between two nodes, such as “after”, “during”, etc. Each node 112 is also assigned a weight (not shown) proportional to its importance to the overall event. The weights can be user defined in the event ontology. The weights can also be computed from training examples. The weights effectively reveal a level of importance. From a description of an event provided by a user or extracted from a video, a hierarchical event description is formed by removing nodes with small weights and combining the links between nodes with large weights. The example in FIG. 7 depicts, for example, a three-level hierarchical event graph of cross-border weapons smuggling.
Using the hierarchical event graph 110, observations are first matched at the highest level. Only those observations receiving a predetermined minimum matching score pass to a next level for verification. This process is repeated with other observations until a predetermined confidence level is achieved for an event hypothesis to be accepted or rejected. In this way, a large number of observations are quickly filtered and detailed matching is only performed on credible candidates.
To match an event graph, the similarity between two events is computed. Based on the event ontology, the similarity of a pair of objects or actions is computed using a shortest path length measure between two objects/actions in the object/action taxonomy. For example, among actions: “walk”, “run” and “pick-up”, the similarity values of (walk, run) will be bigger than those of (walk, pick-up). The Complex Event Similarity (CES) can be computed as:
CES ( A , B ) = i = 1 n w i · SSE ( a i , b i ) / i = 1 n w i
where SSE (ai, b.i) is the similarity between two corresponding simple events ai and b.i from the two streams. Wi is the importance weight for the simple event ai. The weights are computed using Term-Frequency Inverse Document Frequency (TFIDF). scheme that has been successfully used to measure similarity of documents. The weights are the product of the frequency of the simple event in the event to be matched to (event template) times the log of the inverse of the frequency of the same simple events observed in the Region-Of-Interest (ROI). The weight of a sub-event depending on an ROI makes the event matching scheme of the present invention adaptive to the environment. For example, in a desert, the frequency of observing a moving object is low. So, when matching an event related to moving objects in a desert, a higher weight is given to the action of moving than detecting the same event in urban environment with heavy traffic.
For robust and effective event detection, advanced reasoning is needed to fill in the gaps using what is observed and to extract intelligence beyond what is visible in a video. For example, the weight of an object can be inferred from how it was carried and the status of a person can inferred from how he gets out of a car and how he is greeted by others. To reason based on objects, tracks, actions, and primitive and complex events, it is desirable to leverage the ease of ingestion and the power of inferencing using first order logic while minimizing the brittleness and scalability of rule-based methods. To this effect, the present invention employs Markov Logic Networks (MLN) as a probabilistic framework for accounting for the uncertainty of video processing and to enable learning. MLN seamlessly integrates learning, logic and probabilistic inferencing and can be used based on either rules or annotated examples or both for event detection and reasoning.
A Markov Logic Network is a set of pairs (F, w) where F is a formula in first-order logic and w is a weight (real number). These weights can be determined a priori, or can be learned from observed data or examples. Together with a set of constants, MLN defines a network with one node for each grounding (achieved by assigning a constant to a variable) of each predicate in a MLN. A sample ground MLN is shown in FIG. 8. The network 116 includes edges 118 between pairs of atoms 120, which are groundings of predicates. The probability distribution over possible worlds, x, specified by a ground Markov network is:
P ( x ) = 1 Z exp ( i w i * n i ( x ) )
where wi represents the weight of formula i, ni(x) is the number of true groundings of formula i in x, and Z is a normalization factor.
MLN is used to infer properties of objects and outcomes of events or actions. A geo-spatial and visual ontology can be developed to provide the attribute set of an object and a rule set for inferencing. The inputs to the MLN reasoning engine are factlets, (i.e., assertions of the video content) extracted from WAVS videos. The goal of employing an MLN is to infer information from these factlets, such as inferring a box is heavy if two instead of one person are carrying it.
Based on factlets from WAVS data, MLN dynamically creates a network and learns the appropriate weights for the formulae that constitute the knowledge base. Once the weights have been updated, MLN can be used to answer queries—e.g., does the knowledge base entail a specific event-related hypothesis? (e.g., “Is the box heavy?” in FIG. 8). Inferencing using MLNs reduces to the problem of computing the probability that Formulax is true given that Formulai is true. MLN enables inferencing about properties/attributes of objects, outcomes of actions and occurrence of complex events such as clandestine meetings, ambush or transportation of weapons or bomb making materials.
To accurately detect anomalous behaviors and anomalous changes of behaviors of an entity, the function of the entity in its urban environment needs to be understood. To this end, ongoing activities in urban areas are observed and functional characteristics of urban entities are modeled and inferred to create an urban context. Using GIS and image analysis major urban structures, such as road, building, square, lot, a water body and open spaces are labeled. Then, an activity model of each urban entity is built using statistics of related vehicles, people and their movement. For example, the activity model of a building will be the number and the type of vehicles entering/leaving the building as a function of time and date. In this way, urban context also captures cultural information, such as difference between weekday and weekend activities and difference of vehicle activities in different part of a city.
Using activity models together with the physical characteristics of an urban structure, urban structures can be classified into not only broad categories, such as residential area, shopping district, factory, office complex; but also into fine classifications, such as movie theaters, retail stores, restaurant, garages and mosques. For example, a large number of vehicles will arrive and leave movie theaters in regular intervals based on the movie schedule, while vehicles arrive and leave a retail stores continuously throughout the day, although fluctuate according to the time of the day, but much less predictable.
Additionally, activity models can also identify functional components that are difficult to detect purely based on appearances. Using tracks and track statistics, the present invention can label the entrance of a building, egress/ingress points of an area, such as gates or check-points, parking lots, drive ways or alleys, etc. The activity of a given a structure or a site can be compared with the activity of the same type structures. In this way, abnormal structures are identified, such as a house or a store that has much more car activity than the norm of its class.
The present invention can provide advanced capabilities for searching, browsing, retrieval and visualization:
    • In addition to the simple space-time coverage based search, the present invention enables content-based search and retrieval for entities, entity associations, entity tracks, events and anomalies. The present invention supports searches on people and vehicle activities such as people and vehicle entering/leaving an area of interest (AOI) in a time interval or vehicles briefly stopped along a given road.
    • The present invention also enables composite queries defined through a workflow. Composite queries, such as “find vehicles that leave area A, take different routes and meet in area B”, are not supported. The present invention allows analysts to build composite queries using simple queries and workflow tools.
The present invention provides entity and event centric browsing tools that help analysts exploit complex relationships among entities and events for both intelligence and forensic analysis. FIG. 9 shows an envisioned entity-centric analyzer that allows an analyst to co-exploit entity tracks, entity sightings, entity networks and videos containing entities and their interactions in the same GUI window. Analysts also can easily move the focal point from one entity to a related entity with a single click.
It is to be understood that the exemplary embodiments are merely illustrative of the invention and that many variations of the above-described embodiments may be devised by one skilled in the art without departing from the scope of the invention. It is therefore intended that all such variations be included within the scope of the following claims and their equivalents.

Claims (20)

What is claimed is:
1. A computer implemented method for deriving, from video data, an association between at least two entities in motion, comprising:
extracting the entities from the video data;
tracking trajectories of the entities based on the video data to form two or more tracklets;
deriving one or more associations between the entities by: detecting an event based on at least one spatio-temporal motion correlation between the entities;
calculating a similarity measure of the closeness of the tracklets;
identifying entity behaviors comprising at least one of spatial actions and behavioral action;
performing pattern analysis to group the tracklets and sites; and
merging an event ontology with hierarchical weighted graph matching to reduce candidate space wherein the candidate space comprises all entities to be tracked.
2. The method of claim 1, wherein extracting the entities further comprises:
detecting moving objects and classifying them as at least one of vehicles or people; and
determining which structures in the video data are sites, comprising at least one of roads, parking lots, buildings and building descriptions.
3. The method of claim 1, wherein the at least one event is classified as one of a spatial action and a behavioral action.
4. The method of claim 3, wherein a behavioral action is detected using Spatial-Temporal Object (STO) Analysis.
5. The method of claim 4, wherein STO Analysis further comprises:
obtaining a focus-of-attention of moving objects in the video data using Spatio-Temporal Cues;
obtaining spatio-temporal fragments extracted from the moving objects within the focus-of-attention, the moving objects including at least one person;
combining the obtaining spatio-temporal fragments to compute at least one pose of the at least one person;
extracting and classifying at least one object associated the at least one person;
extracting feature words are extracted from the at least one objects and at least one person to create spatio-temporal object words;
encoding the spatio-temporal object words as feature vectors; and
classifying the feature vectors using a Supporting Vector Machine (SVM).
6. The method of claim 1, wherein the step of merging an event ontology with hierarchical weighted graph matching further comprising the steps of:
constructing an event graph wherein a node represents a sub-event and a link represents a type of a temporal transition between two nodes, the link being assigned a weight that is proportional to an importance of the temporal transition to the overall event;
forming a hierarchical event description by removing nodes with small weights and combining the links between nodes with large weights;
matching observations using the event graph at its highest level, wherein observations receiving a predetermined minimum matching score being passed to a next level for verification; and
repeating the step of matching with other observations until a predetermined confidence level is achieved for accepting or rejecting an event.
7. The method of claim 6, wherein the step of matching further comprising the step of computing the similarity between two events using a shortest path length measure between two entities/actions in an entity/action taxonomy.
8. An apparatus for deriving an association between at least two entities in motion from video data captured by at least one sensor, comprising:
a processor communicatively connected to said at least one sensor, the processor being configured for:
extracting the entities from the video data;
tracking trajectories of the entities based on the video data to form two or more tracklets; and
deriving one or more associations between the entities by: detecting an event based on at least one spatio-temporal motion correlation between the entities;
calculating a similarity measure of the closeness of the tracklets;
identifying entity behaviors comprising at least one of spatial actions and behavioral action;
performing pattern analysis to group the tracklets and sites; and
merging an event ontology with hierarchical weighted graph matching to reduce candidate space wherein the candidate space comprises all entities to be tracked.
9. The apparatus of claim 8, wherein extracting the entities further comprises:
detecting moving objects and classifying them as at least one of vehicles or people; and
determining which structures in the video data are sites, comprising at least one of roads, parking lots, buildings and building descriptions.
10. The apparatus of claim 8, wherein the at least one event is classified as one of a spatial action and a behavioral action.
11. The apparatus of claim 10, wherein a behavioral action is detected by:
obtaining a focus-of-attention of moving objects in the video data using Spatio-Temporal Cues;
obtaining spatio-temporal fragments extracted from the entities in motion within the focus-of-attention, the entities in motion including at least one person;
combining the spatio-temporal fragments to compute at least one pose of the at least one person;
extracting and classifying at least one object associated the at least one person;
extracting feature words are extracted from the at least one objects and at least one person to create spatio-temporal object words;
encoding the spatio-temporal object words as feature vectors; and
classifying the feature vectors using a Supporting Vector Machine (SVM).
12. The apparatus of claim 8, wherein the processor is further configured for merging an event ontology with hierarchical weighted graph matching to reduce candidate space wherein the candidate space comprises all entities to be tracked comprising:
constructing an event graph wherein a node represents a sub-event and a link represents a type of a temporal transition between two nodes, the link being assigned a weight that is proportional to an importance of the temporal transition to the event;
forming a hierarchical event description by removing nodes with small weights and combining the links between nodes with large weights;
matching observations using the event graph at its highest level, wherein observations receiving a predetermined minimum matching score being passed to a next level for verification; and
repeating the step of matching with other observations until a predetermined confidence level is achieved for accepting or rejecting an event.
13. A computer implemented method for deriving, from video data, an association between at least two entities in motion, comprising:
extracting the entities from the video data;
tracking trajectories of the entities;
deriving one or more associations between the entities by detecting at least one event based on at least one spatio-temporal motion correlation between the entities; and
merging an event ontology with hierarchical weighted graph matching to reduce candidate space wherein the candidate space comprises all entities to be tracked.
14. The method of claim 13, wherein extracting the entities further comprises:
detecting moving objects and classifying them as at least one of vehicles or people; and
determining which structures in the video data are sites, comprising at least one of roads, parking lots, buildings and building descriptions.
15. The method of claim 13, wherein the at least one event is classified as one of a spatial action and a behavioral action.
16. The method of claim 13 wherein tracking the trajectories of the entities forms two or more tracklets, and the method further comprises:
calculating a similarity measure of the closeness of the two or more tracklets; and
performing pattern analysis using the similarity measure to group the two or more tracklets.
17. A computer implemented method for deriving, from video data, an association between at least two entities in motion, comprising:
extracting the entities from the video data;
tracking trajectories of the entities;
deriving one or more associations between the entities by detecting at least one event based on at least one spatio-temporal motion correlation between the entities; and
employing a Markov Logic Network for reasoning and inferencing in visual and geo-spatial domains.
18. The method of claim 17, wherein extracting the entities further comprises:
detecting moving objects and classifying them as at least one of vehicles or people; and
determining which structures in the video data are sites, comprising at least one of roads, parking lots, buildings and building descriptions.
19. The method of claim 17, wherein the at least one event is classified as one of a spatial action and a behavioral action.
20. The method of claim 17 wherein tracking the trajectories of the entities forms two or more tracklets, and the method further comprises:
calculating a similarity measure of the closeness of the two or more tracklets; and
performing pattern analysis using the similarity measure to group the two or more tracklets.
US13/597,698 2007-12-14 2012-08-29 Method for building and extracting entity networks from video Active 2029-01-14 US8995717B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/597,698 US8995717B2 (en) 2007-12-14 2012-08-29 Method for building and extracting entity networks from video

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US1388807P 2007-12-14 2007-12-14
US12/271,173 US8294763B2 (en) 2007-12-14 2008-11-14 Method for building and extracting entity networks from video
US13/597,698 US8995717B2 (en) 2007-12-14 2012-08-29 Method for building and extracting entity networks from video

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
US12/217,173 Continuation US20080269661A1 (en) 2006-04-21 2008-07-02 Easy-to-peel securely attaching bandage
US12/271,173 Continuation US8294763B2 (en) 2007-12-14 2008-11-14 Method for building and extracting entity networks from video

Publications (2)

Publication Number Publication Date
US20120321137A1 US20120321137A1 (en) 2012-12-20
US8995717B2 true US8995717B2 (en) 2015-03-31

Family

ID=40752659

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/271,173 Active 2031-07-20 US8294763B2 (en) 2007-12-14 2008-11-14 Method for building and extracting entity networks from video
US13/597,698 Active 2029-01-14 US8995717B2 (en) 2007-12-14 2012-08-29 Method for building and extracting entity networks from video

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/271,173 Active 2031-07-20 US8294763B2 (en) 2007-12-14 2008-11-14 Method for building and extracting entity networks from video

Country Status (1)

Country Link
US (2) US8294763B2 (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9177254B2 (en) * 2013-12-02 2015-11-03 Qbase, LLC Event detection through text analysis using trained event template models
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9208204B2 (en) 2013-12-02 2015-12-08 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9223833B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Method for in-loop human validation of disambiguated features
US9223875B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Real-time distributed in memory search architecture
US9230041B2 (en) 2013-12-02 2016-01-05 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9239875B2 (en) 2013-12-02 2016-01-19 Qbase, LLC Method for disambiguated features in unstructured text
US9317565B2 (en) 2013-12-02 2016-04-19 Qbase, LLC Alerting system based on newly disambiguated features
US9336280B2 (en) 2013-12-02 2016-05-10 Qbase, LLC Method for entity-driven alerts based on disambiguated features
US9348573B2 (en) 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
US9355152B2 (en) 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases
US9361317B2 (en) 2014-03-04 2016-06-07 Qbase, LLC Method for entity enrichment of digital content to enable advanced search functionality in content management systems
US9424524B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions
US9430547B2 (en) 2013-12-02 2016-08-30 Qbase, LLC Implementation of clustered in-memory database
US9544361B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US9542477B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Method of automated discovery of topics relatedness
US9547701B2 (en) 2013-12-02 2017-01-17 Qbase, LLC Method of discovering and exploring feature knowledge
US9619571B2 (en) 2013-12-02 2017-04-11 Qbase, LLC Method for searching related entities through entity co-occurrence
US9626623B2 (en) 2013-12-02 2017-04-18 Qbase, LLC Method of automated discovery of new topics
US9659108B2 (en) 2013-12-02 2017-05-23 Qbase, LLC Pluggable architecture for embedding analytics in clustered in-memory databases
US9710517B2 (en) 2013-12-02 2017-07-18 Qbase, LLC Data record compression with progressive and/or selective decomposition
US9922032B2 (en) 2013-12-02 2018-03-20 Qbase, LLC Featured co-occurrence knowledge base from a corpus of documents
US9984427B2 (en) 2013-12-02 2018-05-29 Qbase, LLC Data ingestion module for event detection and increased situational awareness
US20210103718A1 (en) * 2016-10-25 2021-04-08 Deepnorth Inc. Vision Based Target Tracking that Distinguishes Facial Feature Targets
US20210209144A1 (en) * 2020-01-03 2021-07-08 International Business Machines Corporation Internet of things sensor equivalence ontology
US11875550B2 (en) 2020-12-18 2024-01-16 International Business Machines Corporation Spatiotemporal sequences of content

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7953690B2 (en) * 2008-01-25 2011-05-31 Eastman Kodak Company Discovering social relationships from personal photo collections
WO2011022577A1 (en) * 2009-08-20 2011-02-24 Purdue Research Foundation Predictive duty cycle adaptation scheme for event-driven wireless sensor networks
US8548203B2 (en) 2010-07-12 2013-10-01 International Business Machines Corporation Sequential event detection from video
EP2676221A1 (en) * 2011-02-16 2013-12-25 Siemens Aktiengesellschaft Object recognition for security screening and long range video surveillance
US9147273B1 (en) * 2011-02-16 2015-09-29 Hrl Laboratories, Llc System and method for modeling and analyzing data via hierarchical random graphs
US8903128B2 (en) 2011-02-16 2014-12-02 Siemens Aktiengesellschaft Object recognition for security screening and long range video surveillance
JP5821625B2 (en) * 2011-08-29 2015-11-24 カシオ計算機株式会社 Image editing apparatus and program
US9911043B2 (en) * 2012-06-29 2018-03-06 Omni Ai, Inc. Anomalous object interaction detection and reporting
US10186123B2 (en) * 2014-04-01 2019-01-22 Avigilon Fortress Corporation Complex event recognition in a sensor network
US9697828B1 (en) * 2014-06-20 2017-07-04 Amazon Technologies, Inc. Keyword detection modeling using contextual and environmental information
US10572825B2 (en) 2017-04-17 2020-02-25 At&T Intellectual Property I, L.P. Inferring the presence of an occluded entity in a video captured via drone
US10089640B2 (en) * 2015-02-26 2018-10-02 Conduent Business Services, Llc Methods and systems for interpretable user behavior profiling in off-street parking
US9824281B2 (en) * 2015-05-15 2017-11-21 Sportlogiq Inc. System and method for tracking moving objects in videos
US9758246B1 (en) * 2016-01-06 2017-09-12 Gopro, Inc. Systems and methods for adjusting flight control of an unmanned aerial vehicle
US10606814B2 (en) * 2017-01-18 2020-03-31 Microsoft Technology Licensing, Llc Computer-aided tracking of physical entities
US11206375B2 (en) 2018-03-28 2021-12-21 Gal Zuckerman Analyzing past events by utilizing imagery data captured by a plurality of on-road vehicles
US11138418B2 (en) 2018-08-06 2021-10-05 Gal Zuckerman Systems and methods for tracking persons by utilizing imagery data captured by on-road vehicles
CN109388663A (en) * 2018-08-24 2019-02-26 中国电子科技集团公司电子科学研究院 A kind of big data intellectualized analysis platform of security fields towards the society
CN109242024B (en) * 2018-09-13 2021-09-14 中南大学 Vehicle behavior similarity calculation method based on checkpoint data
CN109919078B (en) * 2019-03-05 2024-08-09 腾讯科技(深圳)有限公司 Video sequence selection method, model training method and device
CN110457686A (en) * 2019-07-23 2019-11-15 福建奇点时空数字科技有限公司 A kind of information technology data entity attribute abstracting method based on deep learning
US20220036087A1 (en) * 2020-07-29 2022-02-03 Optima Sports Systems S.L. Computing system and a computer-implemented method for sensing events from geospatial data
CN112581761B (en) * 2020-12-07 2022-04-19 浙江宇视科技有限公司 Collaborative analysis method, device, equipment and medium for 5G mobile Internet of things node
WO2022242827A1 (en) * 2021-05-17 2022-11-24 NEC Laboratories Europe GmbH Information aggregation in a multi-modal entity-feature graph for intervention prediction
CN113779169B (en) * 2021-08-31 2023-09-05 西南电子技术研究所(中国电子科技集团公司第十研究所) Space-time data stream model self-enhancement method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5093869A (en) * 1990-12-26 1992-03-03 Hughes Aircraft Company Pattern recognition apparatus utilizing area linking and region growth techniques
US20040120581A1 (en) * 2002-08-27 2004-06-24 Ozer I. Burak Method and apparatus for automated video activity analysis
US20050288911A1 (en) * 2004-06-28 2005-12-29 Porikli Fatih M Hidden markov model based object tracking and similarity metrics
US7046169B2 (en) 2003-04-09 2006-05-16 Bucholz Andrew J System and method of vehicle surveillance
US20070263900A1 (en) * 2004-08-14 2007-11-15 Swarup Medasani Behavior recognition using cognitive swarms and fuzzy graphs
US7363548B2 (en) 2003-09-29 2008-04-22 Nortel Networks Limited Probable cause fields in telecommunications network alarm indication messages
US20080123900A1 (en) * 2006-06-14 2008-05-29 Honeywell International Inc. Seamless tracking framework using hierarchical tracklet association
US20080273751A1 (en) * 2006-10-16 2008-11-06 Chang Yuan Detection and Tracking of Moving Objects from a Moving Platform in Presence of Strong Parallax
US7599544B2 (en) 2003-12-01 2009-10-06 Green Vision Systems Ltd Authenticating and authentic article using spectral imaging and analysis
US7787656B2 (en) 2007-03-01 2010-08-31 Huper Laboratories Co., Ltd. Method for counting people passing through a gate
US7999857B2 (en) 2003-07-25 2011-08-16 Stresscam Operations and Systems Ltd. Voice, lip-reading, face and emotion stress analysis, fuzzy logic intelligent camera system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5093869A (en) * 1990-12-26 1992-03-03 Hughes Aircraft Company Pattern recognition apparatus utilizing area linking and region growth techniques
US20040120581A1 (en) * 2002-08-27 2004-06-24 Ozer I. Burak Method and apparatus for automated video activity analysis
US7046169B2 (en) 2003-04-09 2006-05-16 Bucholz Andrew J System and method of vehicle surveillance
US7999857B2 (en) 2003-07-25 2011-08-16 Stresscam Operations and Systems Ltd. Voice, lip-reading, face and emotion stress analysis, fuzzy logic intelligent camera system
US7363548B2 (en) 2003-09-29 2008-04-22 Nortel Networks Limited Probable cause fields in telecommunications network alarm indication messages
US7599544B2 (en) 2003-12-01 2009-10-06 Green Vision Systems Ltd Authenticating and authentic article using spectral imaging and analysis
US20050288911A1 (en) * 2004-06-28 2005-12-29 Porikli Fatih M Hidden markov model based object tracking and similarity metrics
US20070263900A1 (en) * 2004-08-14 2007-11-15 Swarup Medasani Behavior recognition using cognitive swarms and fuzzy graphs
US20080123900A1 (en) * 2006-06-14 2008-05-29 Honeywell International Inc. Seamless tracking framework using hierarchical tracklet association
US20080273751A1 (en) * 2006-10-16 2008-11-06 Chang Yuan Detection and Tracking of Moving Objects from a Moving Platform in Presence of Strong Parallax
US7787656B2 (en) 2007-03-01 2010-08-31 Huper Laboratories Co., Ltd. Method for counting people passing through a gate

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Graciano et al. "Graph-based Object Tracking Using Structural Pattern Recognition." XX Brazilian Symposium on Computer Graphics and Image Processing, Oct. 7, 2007, pp. 179-186. *
Junejo et al. "Multi Feature Path Modeling for Video Surveillance." Proceedings of the 17th International Conference on Pattern Recognition, vol. 2, Aug. 2004, pp. 716-719. *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9544361B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US9317565B2 (en) 2013-12-02 2016-04-19 Qbase, LLC Alerting system based on newly disambiguated features
US9208204B2 (en) 2013-12-02 2015-12-08 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9223833B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Method for in-loop human validation of disambiguated features
US9177254B2 (en) * 2013-12-02 2015-11-03 Qbase, LLC Event detection through text analysis using trained event template models
US9230041B2 (en) 2013-12-02 2016-01-05 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9239875B2 (en) 2013-12-02 2016-01-19 Qbase, LLC Method for disambiguated features in unstructured text
US9542477B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Method of automated discovery of topics relatedness
US9336280B2 (en) 2013-12-02 2016-05-10 Qbase, LLC Method for entity-driven alerts based on disambiguated features
US9547701B2 (en) 2013-12-02 2017-01-17 Qbase, LLC Method of discovering and exploring feature knowledge
US9355152B2 (en) 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases
US9984427B2 (en) 2013-12-02 2018-05-29 Qbase, LLC Data ingestion module for event detection and increased situational awareness
US9424524B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions
US9430547B2 (en) 2013-12-02 2016-08-30 Qbase, LLC Implementation of clustered in-memory database
US9507834B2 (en) 2013-12-02 2016-11-29 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9223875B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Real-time distributed in memory search architecture
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9348573B2 (en) 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
US9613166B2 (en) 2013-12-02 2017-04-04 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9619571B2 (en) 2013-12-02 2017-04-11 Qbase, LLC Method for searching related entities through entity co-occurrence
US9626623B2 (en) 2013-12-02 2017-04-18 Qbase, LLC Method of automated discovery of new topics
US9659108B2 (en) 2013-12-02 2017-05-23 Qbase, LLC Pluggable architecture for embedding analytics in clustered in-memory databases
US9710517B2 (en) 2013-12-02 2017-07-18 Qbase, LLC Data record compression with progressive and/or selective decomposition
US9720944B2 (en) 2013-12-02 2017-08-01 Qbase Llc Method for facet searching and search suggestions
US9785521B2 (en) 2013-12-02 2017-10-10 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9910723B2 (en) 2013-12-02 2018-03-06 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US9916368B2 (en) 2013-12-02 2018-03-13 QBase, Inc. Non-exclusionary search within in-memory databases
US9922032B2 (en) 2013-12-02 2018-03-20 Qbase, LLC Featured co-occurrence knowledge base from a corpus of documents
US9361317B2 (en) 2014-03-04 2016-06-07 Qbase, LLC Method for entity enrichment of digital content to enable advanced search functionality in content management systems
US20210103718A1 (en) * 2016-10-25 2021-04-08 Deepnorth Inc. Vision Based Target Tracking that Distinguishes Facial Feature Targets
US11544964B2 (en) * 2016-10-25 2023-01-03 Deepnorth Inc. Vision based target tracking that distinguishes facial feature targets
US20210209144A1 (en) * 2020-01-03 2021-07-08 International Business Machines Corporation Internet of things sensor equivalence ontology
US11875550B2 (en) 2020-12-18 2024-01-16 International Business Machines Corporation Spatiotemporal sequences of content

Also Published As

Publication number Publication date
US20090153661A1 (en) 2009-06-18
US20120321137A1 (en) 2012-12-20
US8294763B2 (en) 2012-10-23

Similar Documents

Publication Publication Date Title
US8995717B2 (en) Method for building and extracting entity networks from video
Hakeem et al. Video analytics for business intelligence
US20150279182A1 (en) Complex event recognition in a sensor network
US9569531B2 (en) System and method for multi-agent event detection and recognition
US20100036875A1 (en) system for automatic social network construction from image data
Razi et al. Deep learning serves traffic safety analysis: A forward‐looking review
Ferryman et al. Robust abandoned object detection integrating wide area visual surveillance and social context
Kwak et al. Detection of dominant flow and abnormal events in surveillance video
Beyer et al. Towards a principled integration of multi-camera re-identification and tracking through optimal bayes filters
Youssef et al. Automatic vehicle counting and tracking in aerial video feeds using cascade region-based convolutional neural networks and feature pyramid networks
Athanesious et al. Detecting abnormal events in traffic video surveillance using superorientation optical flow feature
Anisha et al. Automated vehicle to vehicle conflict analysis at signalized intersections by camera and LiDAR sensor fusion
Lejmi et al. Event detection in video sequences: Challenges and perspectives
Noh et al. Analysis of vehicle–pedestrian interactive behaviors near unsignalized crosswalk
Jiao et al. Traffic behavior recognition from traffic videos under occlusion condition: a Kalman filter approach
Chow et al. Robust object detection fusion against deception
Kooij et al. Mixture of switching linear dynamics to discover behavior patterns in object tracks
Ganagavalli et al. YOLO-based anomaly activity detection system for human behavior analysis and crime mitigation
Kwak et al. Abandoned luggage detection using a finite state automaton in surveillance video
Zhang et al. A Multiple Instance Learning and Relevance Feedback Framework for Retrieving Abnormal Incidents in Surveillance Videos.
Porter et al. A framework for activity detection in wide-area motion imagery
Lauffenburger et al. Traffic sign recognition: Benchmark of credal object association algorithms
Ji et al. An expert ensemble for detecting anomalous scenes, interactions, and behaviors in autonomous driving
Levchuk et al. Adversarial behavior recognition from layered and persistent sensing systems
Zhang et al. Semantic retrieval of events from indoor surveillance video databases

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Year of fee payment: 8