JP2022019339A

JP2022019339A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2022019339A
Application number: JP2020123119A
Authority: JP
Inventors: 俊太舘; Shunta Tachi; 修平小川; Shuhei Ogawa; 裕輔御手洗; Hirosuke Mitarai
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2022-01-27

Abstract

To allow stable continuation of tracking even when an object having a similar posture or similar appearance characteristics approaches.SOLUTION: An information processing apparatus according to the present invention that solves the above-mentioned problem is an information processing apparatus that detects at least one or more objects from an image, and has: estimation means that, based on a learned model that has learned image characteristics indicating the shielding relationship between a shielding object and a shielded object, estimates, for each of the objects detected from the image, the shielding relationship with the other object detected from the image; and specification means that, based on the shielding relationship estimated by the estimation means, specifies, for each of the objects detected from the image, the correspondence relationship with an object detected in an image picked up at a time different from that of the image.SELECTED DRAWING: Figure 3

Description

本発明は、被写体を追尾する技術に関する。 The present invention relates to a technique for tracking a subject.

画像内の特定の被写体を追尾するための技術としては、輝度や色情報を利用するものやテンプレートマッチングなどが存在する。近年、ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ（以下ＤＮＮと省略）を利用した技術が、高精度な追尾技術として注目を集めている。 As a technique for tracking a specific subject in an image, there are techniques that utilize luminance and color information, template matching, and the like. In recent years, a technology using Deep Neural Network (hereinafter abbreviated as DNN) has been attracting attention as a highly accurate tracking technology.

非特許文献１は、画像内の特定の被写体を追尾するための方法の１つである。追尾対象が映った画像と、探索範囲となる画像を、重みが同一のＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（以下ＣＮＮと省略）にそれぞれ入力する。ＣＮＮから得られたそれぞれの特徴量同士の相互相関を算出することによって、探索範囲の画像中で追尾対象が存在する位置を特定するものである。このような追尾手法は追尾対象の位置を正確に同定できる一方、追尾対象に類似した物体が画面の上で重なるような場合に、誤った対象を追尾する失敗が発生し易い。 Non-Patent Document 1 is one of the methods for tracking a specific subject in an image. The image showing the tracking target and the image to be the search range are input to the Convolutional Neural Network (hereinafter abbreviated as CNN) having the same weight. By calculating the cross-correlation between the features obtained from the CNN, the position where the tracking target exists in the image in the search range is specified. While such a tracking method can accurately identify the position of the tracking target, when objects similar to the tracking target overlap on the screen, a failure to track an erroneous target is likely to occur.

これを回避するために特許文献１の手法に代表されるように、検出物体の領域の色特徴や奥行き情報からヒストグラムを作成し、その変化等を調べて物体が遮蔽されているか否かを判定する手法がある。 In order to avoid this, as typified by the method of Patent Document 1, a histogram is created from the color characteristics and depth information of the region of the detected object, and its change is examined to determine whether or not the object is shielded. There is a method to do.

米国特許出願第１０１８５８７７（Ｂ２）号広報US Patent Application No. 10185877 (B2) Public Relations

Ｂｅｒｔｉｎｅｔｔｏｅｔａｌ．，Ｆｕｌｌｙ－ＣｏｎｖｏｌｕｔｉｏｎａｌＳｉａｍｅｓｅＮｅｔｗｏｒｋｓｆоｒＯｂｊｅｃｔＴｒａｃｋｉｎｇ，ａｒＸｉｖ２０１６Bertinetto et al. , Fully-Convolutional Siamese Network Tracking, arXiv 2016

しかしながら、特許文献１に示される方法では、同じような姿勢の物体や外見的な特徴の類似した物体が画面上で重なると、色やテクスチャといった特徴量のヒストグラムに差異が出にくいため判定できないという課題がある。例えば、スポーツの集団競技等においては、狭い範囲に存在する複数の人物の服装や姿勢が同一になることも多く、異なる人物を同じ人物と見なして追尾する失敗が起こりうる。本発明は、このような課題に鑑みなされたものであり、外見的特徴や姿勢が類似した物体が近接する場合においても安定して追尾を継続することを目的とする。 However, in the method shown in Patent Document 1, when objects having similar postures or objects having similar external features overlap on the screen, it is difficult to make a determination because the histograms of features such as colors and textures are unlikely to differ. There are challenges. For example, in a group competition of sports, the clothes and postures of a plurality of people existing in a narrow range are often the same, and a failure to track different people as the same person may occur. The present invention has been made in view of such a problem, and an object of the present invention is to stably continue tracking even when objects having similar appearance characteristics and postures are close to each other.

上記課題を解決する本発明にかかる情報処理装置は、画像から少なくとも１つ以上の物体を検出する情報処理装置であって、遮蔽する物体と遮蔽された物体との遮蔽関係を示す画像特徴を学習した学習済みモデルに基づいて、前記画像から検出された各物体について、前記画像から検出された他の物体との遮蔽関係を推定する推定手段と、前記推定手段によって推定された遮蔽関係に基づいて、前記画像から検出された各物体について、前記画像と異なる時刻に撮像された画像において検出された物体との対応関係を特定する特定手段と、を有する。 The information processing device according to the present invention that solves the above problems is an information processing device that detects at least one or more objects from an image, and learns image features showing a shielding relationship between a shielded object and a shielded object. Based on the learned model, the estimation means for estimating the shielding relationship of each object detected from the image with other objects detected from the image, and the shielding relationship estimated by the estimation means. For each object detected from the image, there is a specific means for specifying a correspondence relationship with the object detected in the image captured at a time different from the image.

本発明によれば、外見的特徴や姿勢が類似した物体が近接する場合においても安定して追尾を継続できる。 According to the present invention, tracking can be stably continued even when objects having similar appearance characteristics and postures are close to each other.

物体検出の一例を説明する模式図Schematic diagram illustrating an example of object detection 情報処理装置のハードウェア構成例を示す図The figure which shows the hardware configuration example of an information processing apparatus. 情報処理装置の機能構成例を示すブロック図Block diagram showing a functional configuration example of an information processing device 情報処理装置が実行する処理手順を示すフローチャートFlow chart showing the processing procedure executed by the information processing device 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置が実行する処理手順を示すフローチャートFlow chart showing the processing procedure executed by the information processing device 遮蔽に関する情報の派生の例Example of derivation of information about obstruction 情報処理装置の機能構成例を示すブロック図Block diagram showing a functional configuration example of an information processing device 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置が実行する処理手順を示すフローチャートFlow chart showing the processing procedure executed by the information processing device 情報処理装置の学習処理の例を示す図The figure which shows the example of the learning process of an information processing apparatus 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置の処理の結果例を示す図The figure which shows the result example of the processing of an information processing apparatus 情報処理装置の学習処理の例を示す図The figure which shows the example of the learning process of an information processing apparatus

＜実施形態１＞
実施形態に係る情報処理装置を、図面を参照しながら説明する。なお、図面間で符号の同じものは同じ動作をするとして重ねての説明を省く。また、この実施の形態に掲載されている構成要素はあくまで例示であり、この発明の範囲をそれらのみに限定する趣旨のものではない。 <Embodiment 1>
The information processing apparatus according to the embodiment will be described with reference to the drawings. It should be noted that the same operation is performed between the drawings with the same reference numeral, and the description thereof will be omitted. In addition, the components described in this embodiment are merely examples, and the scope of the present invention is not limited to them.

本実施形態では、動画もしくは連続撮影した静止画フレームから人物を検出し、追尾する機能について説明する。適用範囲は検出・追尾対象の物体のカテゴリを限定しないが、本実施形態１は対象を人物に限定する。本実施形態では、時間的に連続する画像毎に人物を検出し、連続する画像間でそれぞれどの人物がどの人物と同一人物であるかを対応付けることで、人物の追尾を実現する。本実施形態では特に、スポーツイベントなどの撮影を想定し、人物の服装や移動方向等が類似しており、高頻度で近接・交差するとする。このような場合、各画像における人物の位置または服装の色といった外見的な特徴が近い人物同士を対応付けるだけでは、誤った対応付けが発生しやすい。このような失敗をここでは誤マッチングと呼ぶ。 In this embodiment, a function of detecting and tracking a person from a moving image or a still image frame continuously shot will be described. The scope of application does not limit the category of the object to be detected / tracked, but the first embodiment limits the target to a person. In the present embodiment, a person is detected for each image that is continuous in time, and the tracking of the person is realized by associating which person is the same person as which person among the continuous images. In this embodiment, in particular, assuming shooting of a sporting event or the like, it is assumed that the clothes, movement directions, etc. of the person are similar, and the person approaches and intersects frequently. In such a case, erroneous correspondence is likely to occur only by associating people with similar external features such as the position of the person or the color of clothes in each image. Such a failure is called mismatching here.

本実施形態では撮影者から見て物体が重なっている時の、遮蔽関係のパターンを学習した学習済みモデルが出力する遮蔽に関する情報に着目する。学習済みモデルによって出力された遮蔽関係を、物体の対応関係の特定に併せて用いることで、手前にいる人物と奥にいる人物同士を対応付ける失敗を抑制し、追尾の精度を向上する。 In this embodiment, attention is paid to the information on the occlusion output by the trained model that has learned the occlusion-related patterns when the objects are overlapped from the viewpoint of the photographer. By using the shielding relationship output by the trained model together with the identification of the correspondence relationship of the objects, it is possible to suppress the failure of associating the person in the foreground with the person in the back and improve the tracking accuracy.

これを模式的に示した図が図１である。図１（Ａ）の画像２１００，２２１２０，２１４０は同一の絵柄の２枚のトランプカードがテーブル上で交差していく様子を上から写した動画の３フレーム分の静止画を示している（時系列順に左から右に並んでいる）。画像２１００，２１２０，２１４０を観察しだけでは各画像におけるカードがそれぞれどのように移動したかを確定することができない。一方で図１（Ｂ）は（Ａ）よりも高フレームレートで同じ様子を撮影した例である。つまり、より短い時間間隔で撮像された画像群である。図１（Ｂ）の画像を時系列順に観察していけば、どちらのカードが次の画像でどこに移動したかを対応付けることができる。全体としては左側のカード２２０１が右側のカード２２０２の上を通過し、右側に移動したということを比較的容易に推定することができる。画像２２１０や画像２２３０に示すように、物体同士の交差の瞬間に過渡的に生じる見えを観察することで、画像２２２０においてどちらのカードが手前側を通過し、どちらが奥側にあるのかが判定可能となる。この判定に際しては、２．５次元の奥行画像といった特別なセンサーやオプティカルフロー等の生成のコストの高い情報は必ずしも必要でない。物体同士が手前と奥で重なったときに、どのような見えが生じ易いかという、遮蔽関係と見えの特徴（ａｐｐｅａｒａｎｃｅｆｅａｔｕｒｅ）とのパターン認識の問題として解くことができる。これは図１（Ｃ）および図１（Ｄ）に示す人物の交差のようなシーンでも同様である。本図１（Ｃ）（Ｄ）では人物の服装や姿勢等の見え、移動方向は同一であるとする。このような場合も、物体が交差する前後の見えの状態に着目して観察すれば、図１（Ｄ）の画像２４２０では人物２４０１が手前側に、人物２４０２が奥側にいると判定する。以降の画像において、この遮蔽関係を維持したままであれば、人物２４０１が人物２４０２を一度遮蔽した場合に、手前側の人物２４０１を追尾し、奥側の人物２４０１の遮蔽関係と画像特徴を保持する。そして、遮蔽が解消したときには、人物２４０１の追尾を継続しつつ、奥側にいた人物２４０２を再び検出することが可能である。以上が本実施形態の原理の概要を示す説明である。詳細な処理については後述する。 FIG. 1 is a diagram schematically showing this. Images 2100, 22120, and 2140 in FIG. 1A show three frames of still images of a moving image of two playing cards with the same pattern crossing each other on a table from above (time). They are arranged from left to right in chronological order). It is not possible to determine how the cards in each image have moved just by observing the images 2100, 2120, and 2140. On the other hand, FIG. 1B is an example of shooting the same situation at a higher frame rate than that of FIG. 1A. That is, it is a group of images taken at shorter time intervals. By observing the images of FIG. 1B in chronological order, it is possible to associate which card moved to where in the next image. As a whole, it can be relatively easily estimated that the card 2201 on the left side has passed over the card 2202 on the right side and has moved to the right side. As shown in image 2210 and image 2230, by observing the appearance that occurs transiently at the moment of intersection between objects, it is possible to determine which card passes through the front side and which is on the back side in image 2220. It becomes. In this determination, special sensors such as a 2.5-dimensional depth image and high-cost information such as optical flow are not always required. It can be solved as a pattern recognition problem between the shielding relationship and the appearance feature, which is what kind of appearance is likely to occur when objects overlap in the foreground and the back. This is also the case in scenes such as the intersection of people shown in FIGS. 1 (C) and 1 (D). In FIGS. 1 (C) and 1 (D), it is assumed that the clothes, postures, and the like of the person are visible and the moving directions are the same. Even in such a case, if the state of appearance before and after the intersection of the objects is focused on and observed, it is determined that the person 2401 is on the front side and the person 2402 is on the back side in the image 2420 of FIG. 1 (D). In the subsequent images, if this shielding relationship is maintained, when the person 2401 shields the person 2402 once, the person 2401 on the front side is tracked and the shielding relationship and the image feature of the person 2401 on the back side are retained. do. Then, when the shielding is canceled, it is possible to detect the person 2402 in the back side again while continuing the tracking of the person 2401. The above is an explanation showing the outline of the principle of the present embodiment. Detailed processing will be described later.

図２は、本実施形態における、画像認識によって追尾対象を追尾する情報処理装置１のハードウェア構成図である。ＣＰＵＨ１０１は、ＲＯＭＨ１０２に格納されている制御プログラムを実行することにより、本装置全体の制御を行う。ＲＡＭＨ１０３は、各構成要素からの各種データを一時記憶する。また、プログラムを展開し、ＣＰＵＨ１０１が実行可能な状態にする。記憶部Ｈ１０４は、本実施形態の処理対象となるデータを格納するものであり、追尾対象となるデータを記憶する。記憶部Ｈ１０４の媒体としては、ＨＤＤ，フラッシュメモリ、各種光学メディアなどを用いることができる。入力部Ｈ１０５は、キーボード・タッチパネル、ダイヤル等で構成され、ユーザからの入力を受け付けるものであり、追尾対象を設定する際になどに用いられる。表示部Ｈ１０６は、液晶ディスプレイ等で構成され、被写体や追尾結果をユーザに対して表示する。また、本装置は通信部Ｈ１０７を介して、撮影装置等の他の装置と通信することができる。 FIG. 2 is a hardware configuration diagram of the information processing apparatus 1 that tracks the tracking target by image recognition in the present embodiment. The CPU H101 controls the entire apparatus by executing the control program stored in the ROM H102. The RAM H103 temporarily stores various data from each component. Also, the program is expanded so that the CPU H101 can be executed. The storage unit H104 stores the data to be processed according to the present embodiment, and stores the data to be tracked. As the medium of the storage unit H104, an HDD, a flash memory, various optical media, or the like can be used. The input unit H105 is composed of a keyboard, a touch panel, a dial, and the like, and receives input from the user, and is used when setting a tracking target or the like. The display unit H106 is composed of a liquid crystal display or the like, and displays a subject and a tracking result to the user. In addition, this device can communicate with other devices such as a photographing device via the communication unit H107.

図３は、情報処理装置の機能構成例を示すブロック図である。図３ではＣＰＵＨ１０１において実行される処理を、それぞれ機能ブロックとして示している。情報処理装置１は、画像取得部２０１、物体検出部２０２、遮蔽情報生成部２０３、抽出部２０４、対応付け部２０５を有し、外部の記憶部２０６に接続されている。記憶部２０６は情報処理装置１の内部にあってもよい。それぞれの機能を簡単に説明する。画像取得部２０１は、撮像装置によって特定の物体（本実施形態では人物）を撮像した動画や連続静止画の画像を取得する。物体検出部２０２は、画像取得部２０１によって取得された画像から予め設定された所定の物体を示す画像特徴を検出する。例えば、さまざまな姿勢の人物の画像を用いて人体（頭や動体）を示す画像特徴を予め学習した学習済みモデルに基づいて、画像における人物の領域を検出する。遮蔽情報生成部２０３は、遮蔽する物体と遮蔽された物体との遮蔽関係を示す画像特徴を学習した学習済みモデルに基づいて、画像から検出された各物体について、画像から検出された他の物体との遮蔽関係を示す遮蔽情報を推定する。遮蔽情報とは、注目物体が他の物体によって遮蔽されている可能性を表す尤度（被遮蔽／遮蔽スコア）である。例えば、ある物体について、他の物体によって遮蔽されている可能性が高ければ、被遮蔽／遮蔽スコアを１に近づける。ある物体について、他の物体を遮蔽している可能性が高ければ、被遮蔽／遮蔽スコアを０に近づける。このような遮蔽関係を示す被遮蔽／遮蔽スコアを、学習済みモデルを用いて推定する。抽出部２０４は、ある画像について検出された物体ごとに遮蔽情報を記憶部２０６に記憶する。対応付け部２０５は、複数の画像間で検出された物体の対応付けを行う。すなわち、遮蔽情報に基づいて、ある画像から検出された各物体について、ある画像と異なる時刻に撮像された画像において検出された物体との対応関係を特定する。異なる時間で撮像された画像のそれぞれから検出された物体同士を正しく対応付けることによって物体を追尾できる。また、物体同士の遮蔽関係はある一定の期間において維持されることを仮定することによって、遮蔽関係を使って物体同士を対応付けることができる。記憶部２０６は、各検出物体の被遮蔽スコアを記憶する。各機能部の処理の詳細は図４のフローチャートを用いて説明する。 FIG. 3 is a block diagram showing a functional configuration example of the information processing apparatus. In FIG. 3, the processes executed by the CPU H101 are shown as functional blocks. The information processing device 1 has an image acquisition unit 201, an object detection unit 202, a shielding information generation unit 203, an extraction unit 204, and a mapping unit 205, and is connected to an external storage unit 206. The storage unit 206 may be inside the information processing device 1. Each function is briefly explained. The image acquisition unit 201 acquires an image of a moving image or a continuous still image obtained by capturing a specific object (a person in the present embodiment) by an image pickup device. The object detection unit 202 detects an image feature indicating a predetermined object set in advance from the image acquired by the image acquisition unit 201. For example, a region of a person in an image is detected based on a trained model in which image features showing a human body (head or moving body) are learned in advance using images of people in various postures. The occlusion information generation unit 203 is based on a trained model that has learned the image features showing the occlusion relationship between the obstructed object and the obstructed object, and for each object detected from the image, another object detected from the image. Estimate the occlusion information indicating the occlusion relationship with. The occlusion information is a likelihood (obstruction / occlusion score) indicating the possibility that the object of interest is obscured by another object. For example, if one object is likely to be occluded by another, the occluded / occluded score approaches 1. For one object, if it is likely that it is obscuring another object, the obscured / obscured score is close to zero. A shielded / shielded score showing such a shielding relationship is estimated using a trained model. The extraction unit 204 stores the shielding information in the storage unit 206 for each object detected for a certain image. The association unit 205 associates an object detected between a plurality of images. That is, based on the shielding information, for each object detected from a certain image, the correspondence relationship with the object detected in the image captured at a different time from the certain image is specified. Objects can be tracked by correctly associating objects detected from each of the images captured at different times. Further, by assuming that the shielding relationship between the objects is maintained for a certain period of time, it is possible to associate the objects with each other using the shielding relationship. The storage unit 206 stores the obscured score of each detected object. The details of the processing of each functional unit will be described with reference to the flowchart of FIG.

図４は本実施形態の処理の流れを示したフローチャートである。以下の説明では、各工程（ステップ）について先頭にＳを付けて表記することで、工程（ステップ）の表記を省略する。ただし、情報処理装置はこのフローチャートで説明するすべての工程を必ずしも行わなくても良い。図４のフローチャートに示した処理は、コンピュータである図２のＣＰＵＨ１０１により記憶部Ｈ１０４に格納されているコンピュータプログラムに従って実行される。 FIG. 4 is a flowchart showing the flow of processing of the present embodiment. In the following description, the notation of the process (step) is omitted by adding S at the beginning of each process (step). However, the information processing apparatus does not necessarily have to perform all the steps described in this flowchart. The process shown in the flowchart of FIG. 4 is executed by the CPU H101 of FIG. 2, which is a computer, according to a computer program stored in the storage unit H104.

Ｓ３０１では、情報処理装置１が、各動画フレームについて繰り返すループ処理を開始する。Ｓ３０２では、画像取得部２０１が人物を撮像した動画や連続静止画の画像フレームを順次取得する。以降の処理はＳ３０１～Ｓ３１１まで各画像について順次処理がなされる。なお、画像取得部２０１は、情報処理装置に接続された撮像装置によって撮像された画像を取得してもよいし、記憶部Ｈ１０４に記憶された画像を取得してもよい。図５（Ａ）中の動画フレーム３１００，３１１０，３１２０，３１３０，３１４０が取得した画像フレームの例である。 In S301, the information processing apparatus 1 starts a loop process that repeats for each moving image frame. In S302, the image acquisition unit 201 sequentially acquires image frames of moving images and continuous still images of a person. Subsequent processing is sequentially performed for each image from S301 to S311. The image acquisition unit 201 may acquire an image captured by an image pickup device connected to the information processing device, or may acquire an image stored in the storage unit H104. It is an example of the image frame acquired by the moving image frame 3100, 3110, 3120, 3130, 3140 in FIG. 5A.

次にＳ３０３では、物体検出部２０が、所定の物体（ここでは人物）の画像特徴に基づいて、前記取得された画像から少なくとも１つ以上の所定の物体を検出する。画像内から物体を検出する公知技術としては、Ｌｉｕによる手法等が挙げられる（Ｌｉｕ，ＳＳＤ：ＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉｂｏｘＤｅｔｅｃｔｏｒ．Ｉｎ：ＥＣＣＶ２０１６）。画像内から候補物体を検出した結果を図５（Ａ）に示す。図５（Ａ）中の矩形枠３１０１，３１０２，３１０３，３１１１，３１１２，３１１３，３１２１，３１２２，３１３１，３１３２，３１４１，３１４２が検出された物体領域を示すＢｏｕｎｄｉｎｇＢｏｘ（以下ＢＢ）である。 Next, in S303, the object detection unit 20 detects at least one or more predetermined objects from the acquired image based on the image characteristics of the predetermined object (here, a person). Known techniques for detecting an object in an image include a method using Liu (Liu, SSD: Single Shot Multibox Detector. In: ECCV2016). The result of detecting the candidate object in the image is shown in FIG. 5 (A). FIG. 5 (A) is a Bounding Box (hereinafter referred to as BB) showing an object region in which rectangular frames 3101, 3102, 3103, 3111, 3112, 3113, 3121, 3122, 3131, 3132, 3141, 3142 are detected.

Ｓ３０４では、遮蔽マップ生成部２０３が、各画像について、領域毎に遮蔽されているか否か（遮蔽関係）についての遮蔽情報を示した遮蔽マップを生成する。遮蔽マップ生成部２０３が、各画像について、遮蔽されている物体のうちの見えている領域（被遮蔽物体領域）を推定する。ここでは各人物が他の人物と重なっているか、重なっている場合に奥側にいるか、手前側にいるかを判定し、その結果を遮蔽状態のスコア（尤度）として領域ごとに出力する。これは意味的領域分割の認識タスクの一種であり、Ｃｈｅｎらの手法等の公知の手法を使って実現することができる。（Ｃｈｅｎ，ＤｅｅｐＬａｂ：ＳｅｍａｎｔｉｃＩｍａｇｅＳｅｇｍｅｎｔａｔｉｏｎｗｉｔｈＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｓ，ＡｔｒｏｕｓＣｏｎｖｏｌｕｔｉｏｎ，ａｎｄＦｕｌｌｙＣｏｎｎｅｃｔｅｄＣＲＦｓ，２０１６）。 In S304, the occlusion map generation unit 203 generates a occlusion map showing the occlusion information regarding whether or not each image is shielded (shielding relationship) for each area. The occluded map generation unit 203 estimates a visible region (shielded object region) of the occluded object for each image. Here, it is determined whether each person overlaps with another person, and if they overlap, whether they are on the back side or the front side, and the result is output as a score (likelihood) of the shielded state for each area. This is a kind of recognition task of semantic region division, and can be realized by using a known method such as the method of Chen et al. (Chen, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2016).

図６（Ａ）に遮蔽マップの生成処理を説明する模式図と結果の一例を示す。ニューラルネットワーク４０２は入力された画像から、入力画像の各画素について遮蔽状態を判定するニューラルネットワークである。ＲＧＢ画像４０１が入力されると、ニューラルネットワーク４０２は画像中に人物がいるか否か、さらにその人物が遮蔽されているか否かを推定した結果を遮蔽マップ４０４として出力する。同マップは遮蔽されていない人物および人物以外の領域と推定された場合は０、遮蔽されている人物の領域には１、の被遮蔽スコアが出力される。遮蔽マップ４０４中の黒い領域ほど高い被遮蔽スコアであることを示す。すなわち、黒い領域は遮蔽された人物の領域であると推定されたことを示している。ニューラルネットワーク４０２は入力画像に対してこのような出力ができるように事前に学習を行っている（学習については後述する）。なお、図に示した遮蔽マップ４０４は推定結果として理想的な出力状態の一例を示したものである。 FIG. 6A shows a schematic diagram illustrating the process of generating the shielding map and an example of the result. The neural network 402 is a neural network that determines the shielding state of each pixel of the input image from the input image. When the RGB image 401 is input, the neural network 402 outputs a result of estimating whether or not there is a person in the image and whether or not the person is shielded as a shielding map 404. The map outputs an obscured score of 0 when it is estimated to be an unobstructed person and an area other than the person, and 1 in the area of the obscured person. Black areas in the obstruction map 404 indicate higher obstruction scores. That is, it indicates that the black area was presumed to be the area of the shielded person. The neural network 402 is trained in advance so that such an output can be performed for the input image (learning will be described later). The shielding map 404 shown in the figure shows an example of an ideal output state as an estimation result.

なお、ＲＧＢ画像４０１のほかに、専用センサー等を使って２．５次元奥行画像４０５を別途取得するような派生的な形態も考えられる。前記奥行画像４０５を３チャンネルのＲＧＢ画像４０１と連結した４チャンネルの情報をＲＧＢ画像の代わりに画像入力として学習・認識する。これにより遮蔽領域の情報をより高精度にすることも可能である。 In addition to the RGB image 401, a derivative form in which the 2.5-dimensional depth image 405 is separately acquired by using a dedicated sensor or the like is also conceivable. The information of 4 channels in which the depth image 405 is connected to the RGB image 401 of 3 channels is learned and recognized as an image input instead of the RGB image. This makes it possible to make the information in the shielded area more accurate.

次に、Ｓ３０５では、情報処理装置１が、Ｓ３０３で検出された各物体について、Ｓ３０６からＳ３０７のループ処理を実行する。Ｓ３０５～Ｓ３０８では、抽出部２０４が、生成された遮蔽マップから検出物体ごとに遮蔽関係を示す情報を抽出し、記憶部２０６に記憶する。Ｓ３０６では、抽出部２０４が、遮蔽マップから検出物体毎に遮蔽関係を示す情報を抽出する。具体的には、図６（Ａ）の人物検出枠４０７の中の遮蔽マップ４０４の被遮蔽スコアを平均する。この被遮蔽スコアが１に近いほどその物体は遮蔽されている可能性が高く、被遮蔽スコアが０に近いほどその物体は遮蔽されていない可能性が高いことを示す。なお、検出枠の位置のずれや、遮蔽マップ４０４にノイズが含まれることを想定して、図６（Ｂ）に示すように枠の中央付近を重視した重み付き平均で取得する。図中の演算４１２は各画像の部分領域毎ごとの要素積（アダマール積）を意味する。マップ４１３は中央にピークがあり、画像ブロックの総和が１となる２次元ガウス関数である（縦横サイズを人物検出枠に合わせて変形してある）。取得結果の例を図６（Ａ）に記号ｏｃｃを付して被遮蔽スコア値４０９と４１０として示す。左側の検出枠は奥側にいる人物のため被遮蔽スコアが高く、右側の検出枠は手前側のため被遮蔽スコアが低いと判定されている。以上のような処理を先ほどの図５の入力画像に対して処理した結果例を図５（Ｂ）（遮蔽マップ）および図５（Ｃ）（各枠の被遮蔽スコア推定結果）として示す。交差開始～終了の間、奥側に位置する人物３１０１に対応する被遮蔽スコアは人物３１０２のそれよりも相対的に高いことを示している。 Next, in S305, the information processing apparatus 1 executes the loop processing of S306 to S307 for each object detected in S303. In S305 to S308, the extraction unit 204 extracts information indicating the shielding relationship for each detected object from the generated shielding map and stores it in the storage unit 206. In S306, the extraction unit 204 extracts information indicating the shielding relationship for each detected object from the shielding map. Specifically, the shielded scores of the shielded map 404 in the person detection frame 407 of FIG. 6A are averaged. The closer the occluded score is to 1, the more likely the object is occluded, and the closer the occluded score is to 0, the more likely the object is unobscured. Assuming that the position of the detection frame is displaced and noise is included in the shielding map 404, the weighted average is acquired with an emphasis on the vicinity of the center of the frame as shown in FIG. 6B. The operation 412 in the figure means the element product (Hadamard product) for each partial region of each image. Map 413 is a two-dimensional Gaussian function having a peak in the center and a sum of image blocks of 1 (the vertical and horizontal sizes are deformed according to the person detection frame). Examples of the acquisition results are shown in FIG. 6A with the symbol occ as shielded score values 409 and 410. It is determined that the detection frame on the left side has a high obstruction score because it is a person on the back side, and the detection frame on the right side has a low obstruction score because it is on the front side. Examples of the results of the above processing on the input image of FIG. 5 are shown as FIGS. 5 (B) (shielding map) and FIG. 5 (C) (shielded score estimation result of each frame). From the start to the end of the crossing, the shielded score corresponding to the person 3101 located on the back side is relatively higher than that of the person 3102.

Ｓ３０７では、記憶部２０６が、抽出部２０４によって取得された各検出物体の被遮蔽スコアを記憶する。同時に、各検出物体の位置・サイズの情報、および色やテクスチャのヒストグラムといった物体の見えに関する特徴量、も記憶する。ここではこれら複数種類の数量を一括して検出物体の特徴量と呼ぶ。なお、見えに関する特徴量としてはこの他にニューラルネットワークの中間層情報等を利用してもよい。（例えば”Ｈａｒｉｈａｒａｎ，ｅｔ．ａｌ，ＨｙｐｅｒｃｏｌｕｍｎｓｆｏｒＯｂｊｅｃｔＳｅｇｍｅｎｔａｔｉｏｎａｎｄＦｉｎｅ－ｇｒａｉｎｅｄＬｏｃａｌｉｚａｔｉｏｎ，ｉｎＣＶＰＲ２０１５”）。 In S307, the storage unit 206 stores the shielding score of each detected object acquired by the extraction unit 204. At the same time, information on the position and size of each detected object and features related to the appearance of the object such as color and texture histograms are also stored. Here, these multiple types of quantities are collectively referred to as feature quantities of the detected object. In addition to this, the intermediate layer information of the neural network or the like may be used as the feature amount related to the appearance. (For example, "Hariharan, et. Al, Hypergranulars for Object Segmentation and Fine-grounded Localization, in CVPR2015").

Ｓ３０８では、情報処理装置１が、各画像について繰り返すループと、各画像において検出された人物について繰り返すループを終了する。このループは画像毎に、その画像から検出された人物すべてについて遮蔽情報を取得したときに終了する。次に、Ｓ３０９～Ｓ３１０では、対応付け部２０５が、前後の画像間の物体の対応付けを行う（ただし一つ目の動画フレームの場合は過去のフレームがないためこれを行わない）。まず、Ｓ３０９で、対応付け部２０５が記憶部２０６に記憶された過去の物体の特徴量である、被遮蔽スコア、位置サイズおよび見えの特徴量を取得する。次にＳ３１０で、対応付け部２０５が過去の動画フレーム中に検出された物体と、現在処理しているフレーム中に検出された物体の対応付けを行う。 In S308, the information processing apparatus 1 ends the loop that repeats for each image and the loop that repeats for the person detected in each image. This loop ends when obstruction information is acquired for all the people detected from the image for each image. Next, in S309 to S310, the mapping unit 205 maps objects between the preceding and following images (however, in the case of the first moving image frame, this is not performed because there is no past frame). First, in S309, the association unit 205 acquires the shielded score, the position size, and the visible feature amount, which are the feature amounts of the past objects stored in the storage unit 206. Next, in S310, the mapping unit 205 associates the object detected in the past moving image frame with the object detected in the frame currently being processed.

Ｓ３１０における対応付け部２０５の詳細な処理フローを図７に示す。Ｓ５０１で、対応付け部２０５は、まず現フレームの検出物体と一つ前のフレームで検出された物体の間で全組み合わせのペアを作る。前後のフレームでそれぞれｎ人とｍ人の人物が検出されていれば、全部でｎ×ｍ個のペアが生成される。次に、Ｓ５０２で、対応付け部２０５は全ての物体ペアについて類似度を算出する。類似度としては検出物体同士の特徴量の差分に基づいた指標を用いることができる。一例として過去の検出物体ｃ_１と現在の検出物体ｃ_２の類似度を下式のように算出する。
（数式１）
Ｌ（ｃ_１，ｃ_２）＝－Ｗ_１｜｜ＢＢ_１－ＢＢ_２｜｜
－Ｗ_２｜｜ｆ_１－ｆ_２｜｜－Ｗ_３｜｜ｏｃｃ_１－ｏｃｃ_２｜｜
ここで、ＢＢとは各物体の（中心座標値ｘ、中心座標値ｙ、幅、高さ）の４変数をまとめたベクトルであり、ｆは各物体の特徴を示したものである。｜｜ｘ｜｜はｘのＬ^ｐノルムである。ｏｃｃは各物体の被遮蔽スコアである。Ｗ１，Ｗ２，Ｗ３はそれぞれ経験的あるいは機械学習的に調整して設定される０以上のバランス係数である。ここで各特徴量のばらつきを事前に統計的に求めておいて各特徴量を正規化する等してもよい。物同士が交差する場合であっても、他の物体を遮蔽する側の人物を追尾することによって、被遮蔽側の人物が再び画像で確認されたときに、直前で他の物体を遮蔽する側の人物と対応付けようとすると数式１の３つめの項の値が小さくなり、類似度が低く算出される。つまり、この処理によって、遮蔽関係が異なる人物同士はマッチングされる可能性が低くなり、追尾の誤マッチングが抑制できる。 FIG. 7 shows a detailed processing flow of the matching unit 205 in S310. In S501, the mapping unit 205 first creates a pair of all combinations between the detected object of the current frame and the object detected in the previous frame. If n and m people are detected in the previous and next frames, a total of n × m pairs is generated. Next, in S502, the matching unit 205 calculates the similarity for all the object pairs. As the degree of similarity, an index based on the difference in the feature amount between the detected objects can be used. As an example, the similarity between the past detected object c ₁ and the current detected object c ₂ is calculated by the following equation.
(Formula 1)
L (c ₁ , c ₂ ) = -W ₁ || BB ₁ -BB ₂ ||
-W ₂ || f ₁ -f ₂ ||-W ₃ || occ ₁ -occ ₂ ||
Here, BB is a vector that summarizes four variables (center coordinate value x, center coordinate value y, width, height) of each object, and f indicates the characteristics of each object. || x || is the L ^p norm of x. occ is the obscured score of each object. W1, W2, and W3 are balance coefficients of 0 or more set by adjusting empirically or machine learning, respectively. Here, the variation of each feature amount may be statistically obtained in advance and each feature amount may be normalized. Even when objects intersect with each other, by tracking the person on the side that shields the other object, the side that shields the other object immediately before the person on the shielded side is confirmed in the image again. When trying to associate with the person of, the value of the third term of the formula 1 becomes small, and the similarity is calculated to be low. That is, by this process, it is less likely that people with different shielding relationships are matched with each other, and erroneous tracking matching can be suppressed.

次に、Ｓ５０３において、対応付け部２０５が、過去の物体と現在の物体との類似度に基づいて物体間の対応関係を特定するための対応付け（マッチング）を行う。マッチングの方法にはいくつか存在する。例えば、類似度が高い候補同士から優先的にマッチングする方法や、ハンガリアンアルゴリズムを用いる方法等がある。ここでは前者を用いる。 Next, in S503, the mapping unit 205 performs mapping (matching) for specifying the correspondence between the objects based on the degree of similarity between the past object and the current object. There are several matching methods. For example, there is a method of preferentially matching candidates with high similarity, a method of using a Hungarian algorithm, and the like. Here, the former is used.

Ｓ５０３では、対応付け部２０５が、現フレームの全物体について対応付けが終了していなければＳ５０６で類似度最大のペアから同一人物として対応付けていく。対応付けの終わったペアの物体は対応付けの候補から省いていく。上記の処理の際に、その時点で残っているペアの中の最大の類似度の大きさが所定の閾値を下回った場合は、もはや類似した物体ペアが残っていないことを意味する。その場合はそれ以上無理に対応付けることなく（Ｓ５０５）、対応付けを終了する。 In S503, if the mapping is not completed for all the objects in the current frame, the mapping unit 205 associates the pair with the maximum similarity as the same person in S506. The pair of objects that have been associated are omitted from the matching candidates. When the magnitude of the maximum similarity among the pairs remaining at that time falls below a predetermined threshold value during the above processing, it means that there are no more similar object pairs remaining. In that case, the association is terminated without forcibly associating (S505).

以上の処理Ｓ３０１～Ｓ３１１を動画フレームごとに行う。その結果、図５（Ｄ）に結果例を示すように、動画中から人物を検出し、それぞれの物体がどこに移動したかの一連の追尾結果が得られる（フレーム間の同一の人物に記号Ａ，Ｂ，Ｃで付して追尾の結果を示している）。 The above processes S301 to S311 are performed for each moving image frame. As a result, as shown in the result example in FIG. 5 (D), a person is detected in the moving image, and a series of tracking results of where each object has moved can be obtained (symbol A for the same person between frames). , B, C are attached to indicate the tracking result).

＜変形例＞
本実施形態では物体ペア同士のマッチングの類似度として差分に基づき、被遮蔽スコアや見えといった各指標の距離を重み付け和した。ここで例えばＫＬダイバージェンスを使うことも考えられる。またメトリック学習を行ってより精度の高い距離指標を求めることも考えられる。また単一の類似度を一度だけ用いるのでなく、まず見えの特徴で類似度を判定し、条件を満たしたものは次に遮蔽状態のスコアの類似度に基づいて判定する、等のルールベースによる方法や段階的な判定方法も考えられる。またさらにニューラルネットやサポートベクトルマシンといった公知の識別器の手法を用い、説明変数を特徴量、目的変数を同一物体か否かの結果、として学習・識別し、この値によってマッチングを判定することも可能である。以上のようにフレーム間の物体間の対応付けは特定の形態に限定されない。 <Modification example>
In this embodiment, the distances of each index such as the shielded score and the appearance are weighted and summed based on the difference as the degree of similarity of matching between the object pairs. Here, for example, it is conceivable to use KL divergence. It is also conceivable to perform metric learning to obtain a more accurate distance index. Also, instead of using a single similarity only once, the similarity is first judged based on the appearance characteristics, and those that meet the conditions are then judged based on the similarity of the score in the shielded state, etc. A method or a stepwise judgment method can be considered. Furthermore, using the method of a known classifier such as a neural network or a support vector machine, it is also possible to learn and discriminate the explanatory variable as a feature quantity and the objective variable as a result of whether or not they are the same object, and judge matching based on this value. It is possible. As described above, the correspondence between objects between frames is not limited to a specific form.

またさらに別の派生形態として、被遮蔽スコアの推定値を安定させるために、下式のように過去のスコアを移動平均した値を用いる工夫も考えられる。
（数式２）
ｏｃｃ^ＥＭＡ _（ｔ）＝（１－α）×ｏｃｃ^ＥＭＡ _{（ｔ－１）}＋ α×ｏｃｃ_（ｔ）
上式は指数移動平均値と呼ばれる値であり、ｏｃｃ^ＥＭＡ _（ｔ）は時刻ｔの被遮蔽スコアの指数移動平均値、ｏｃｃ_（ｔ）は時刻ｔの被遮蔽スコア、αは０＜α≦１の係数である。過去の複数フレームで追尾ができている物体については上式で指数移動平均値を算出しておき、類似度を比較する際には元の被遮蔽スコアではなく、指数移動平均被遮蔽スコアを用いる。これにより、交差時に複数のフレームにまたがって徐々に重畳状態が起こるような場合に、複数フレームの被遮蔽スコアの平均値に基づいてマッチングできるので、より物体間の対応付けが安定する。 As yet another derivative form, in order to stabilize the estimated value of the shielded score, it is conceivable to use the value obtained by moving average the past scores as shown in the following equation.
(Formula 2)
occ ^EMA _(t) = (1-α) × occ ^EMA _(t-1) + α × occ _(t)
The above equation is a value called an exponential moving average value, occ ^EMA _(t) is the exponential moving average value of the obscured score at time t, occ _(t) is the obscured score at time t, and α is 0 <α≤1. It is a coefficient of. For objects that have been tracked in multiple frames in the past, calculate the exponential moving average value using the above formula, and use the exponential moving average obscured score instead of the original obscured score when comparing similarities. .. As a result, when the overlapping state gradually occurs over a plurality of frames at the time of intersection, matching can be performed based on the average value of the shielded scores of the plurality of frames, so that the correspondence between the objects is more stable.

またさらに別の派生形態として、マッチングの際に前後フレーム間の類似度だけでなく、ｎステップ前の過去の複数のフレームの特徴量・位置を用いてマッチングを行うような形態も考えられる。この方法を用いることで、一度物体が遮蔽されて追尾できないフレームが発生しても、その後のフレームで遮蔽が解消されれば再び追尾が可能になる。この形態では例えば、ｎフレームまでさかのぼって物体の特徴量の平均値を求め、これに基づいて現フレームから検出された物体との類似度の算出を行う。もしくは、過去のｎフレームの物体と現フレームの物体間でそれぞれ類似度を求め、得られたｎ個の類似度の平均値が最も高い物体に対応付ける。また、過去だけでなく、ｎステップの未来のフレームの結果も使って双方向で判定を行うことも考えられる。この形態は未来のフレームを処理するまで結果が判明しないため処理のリアルタイム性には劣るが、過去のみを見る方法よりも高精度である。 As yet another derivative form, a form in which matching is performed using not only the degree of similarity between the preceding and following frames but also the features and positions of a plurality of past frames n steps before is conceivable. By using this method, even if an object is shielded once and a frame that cannot be tracked occurs, it can be tracked again if the shielding is removed in the subsequent frames. In this form, for example, the average value of the feature amount of the object is obtained by going back to n frames, and the similarity with the object detected from the current frame is calculated based on this. Alternatively, the similarity between the past n-frame object and the current frame object is obtained, and the object having the highest average value of the obtained n similarity is associated with the object. It is also conceivable to make a bidirectional judgment using the results of not only the past but also the future frame of n steps. This form is inferior in real-time processing because the result is not known until the future frame is processed, but it is more accurate than the method of looking only at the past.

またさらに別の派生形態として、検出の失敗に対応するための形態が考えられる。物体検出・追尾においては物体の姿勢が特殊な形状に変化した、等の理由で物体検出の段階で一時的に失敗するようなことも起こり得る。このような未検出が起こると、フレーム間の対応付けの際に、前のフレームに存在した物体が、現フレームでは対応なしと判定される。すると追尾はそこで途切れることになる。このような失敗を防ぐために、以下のような工夫もありえる。すなわち、マッチングで未対応の人物が発生したら、その情報をリストに記憶しておき、次のフレームのマッチングのときに対応付けの候補に加える（一定時間が経過してもまだ未対応であれば物体自体がもう存在しないと判断し、リストから除去する。ここではこれをタイムアウト処理と呼ぶ）。 As yet another derivative form, a form for dealing with a detection failure can be considered. In object detection / tracking, it is possible that the posture of an object has changed to a special shape, and so on, causing a temporary failure at the stage of object detection. When such undetection occurs, it is determined that the object existing in the previous frame does not correspond in the current frame at the time of associating between the frames. Then the tracking will be interrupted there. In order to prevent such a failure, the following measures can be taken. That is, if a person who does not support matching occurs, that information is stored in a list and added to the matching candidates when matching the next frame (if it is not yet supported even after a certain period of time has passed). It determines that the object itself no longer exists and removes it from the list. This is called timeout processing here).

このように動画フレームをまたがる物体の対応付けについては種々のやり方が考えられ、特定の形態に限定されない。 As described above, various methods can be considered for associating objects across moving image frames, and the association is not limited to a specific form.

＜遮蔽情報の形態のバリエーションおよび学習方法＞
本実施形態では、遮蔽マップとして、遮蔽されている物体のうちの見えている領域（被遮蔽物体領域）を推定した。この形態についても様々な派生形態が考えられる。一例を図８に示す。ここでは図８（Ｂ）に示すように、画像８０１のように奥側の物体の見えている領域を推定する以外でもよい。例えば、画像８０２のように奥側の物体の全領域を推定する。また、画像８０３のように、手前側の遮蔽物体の領域を推定する（図の領域４４０のように他物体と重なっていない物体も手前側領域として含めて推定している。ただし別の形態としてこのような単独の物体は手前側の領域に含めないことも考えられる）。また、画像８０１～８０３のように前景領域を推定するのではなく、画像８０４のように物体の中心や重心の位置を推定することも考えられる。画像８０４の場合においては被遮蔽物体の中心付近の領域に大きな正の値を、遮蔽物体の中心付近に小さな負の値を推定するようにする（ここでいう物体の中心領域は図示するようにガウス関数状の領域を推定させるような形態が考えられる）。 <Variations of the form of shielding information and learning methods>
In the present embodiment, the visible area (shielded object area) of the shielded object is estimated as the shield map. Various derivative forms can be considered for this form as well. An example is shown in FIG. Here, as shown in FIG. 8 (B), it is not necessary to estimate the visible region of the object on the back side as shown in the image 801. For example, the entire area of the object on the back side is estimated as in the image 802. Further, the area of the shielded object on the front side is estimated as in the image 803 (an object that does not overlap with another object such as the area 440 in the figure is also estimated including the area on the front side as another form. It is conceivable that such a single object is not included in the front area). Further, instead of estimating the foreground region as in the images 801 to 803, it is also conceivable to estimate the position of the center and the center of gravity of the object as in the image 804. In the case of image 804, a large positive value is estimated in the region near the center of the shielded object, and a small negative value is estimated in the region near the center of the shielded object (the central region of the object here is as shown in the figure). A form that makes an estimation of a Gaussian function-like region is conceivable).

ここで遮蔽状態の情報の学習方法について図６（Ｄ）を用いて説明する。前述のＣｈｅｎらの手法等で示されるニューラルネット４０２は、入力画像であるＲＧＢ画像４０１に対して遮蔽物体の被遮蔽スコアマップ４０３を出力する。４０３の結果例を４３０に示す。ＣＨＥＮらの手法等は特定カテゴリ物体の前景領域を推定する手法であるが、ここでは遮蔽情報の教師値４３１を与えて、教師値４３１と同じようなマップが推定によって得られるようニューラルネット４０２の学習を行う。具体的には出力結果のマップ４０３と教師値４３１を比較し、交差エントロピーや二乗誤差などの公知の方法で損失値算出４３２を行う。損失値が漸減するように誤差逆伝搬法等でニューラルネット４０２の重みパラメーターを調整する（この処理についてはＣｈｅｎらの手法と同一のため詳細は略す）。入力画像と教師値は十分な量を与える必要がある。重なった物体の領域の教師値を作成するのはコストがかかるため、ＣＧを用いることや、物体画像を切り出して重畳する画像合成の方法を用いて学習データを作成するようなことも考えられる。以上が学習方法になる。 Here, a method of learning information on the shielded state will be described with reference to FIG. 6D. The neural network 402 shown by the above-mentioned method of Chen et al. Outputs a shielded score map 403 of a shielded object to an RGB image 401 which is an input image. An example of the result of 403 is shown in 430. The method of CHEN et al. Is a method of estimating the foreground region of a specific category object, but here, a teacher value 431 of the shielding information is given so that a map similar to the teacher value 431 can be obtained by estimation of the neural network 402. Do learning. Specifically, the map 403 of the output result and the teacher value 431 are compared, and the loss value calculation 432 is performed by a known method such as cross entropy or squared error. Adjust the weight parameter of the neural network 402 by the error back propagation method or the like so that the loss value gradually decreases (this process is the same as the method of Chen et al., So details are omitted). The input image and teacher value should be given in sufficient quantity. Since it is costly to create a teacher value for an area of overlapping objects, it is conceivable to use CG or to create learning data by using an image composition method in which an object image is cut out and superimposed. The above is the learning method.

またさらに、本実施形態では上記で求めた物体の枠の中で取得して被遮蔽スコアと呼ぶ指標を求めた。遮蔽情報の取得の形態の様々な例を図８（Ｃ）に示す。図８（Ｃ１）は本実施形態の形態である。この他に、（Ｃ２）奥側の被遮蔽スコアと手前側の被遮蔽スコアの差分値を物体枠内で取得する、（Ｃ３）物体の中心のスコアを１点だけ参照する、等様々に考えられる。また、枠内で取得する際に、物体の枠内で取得する際に、他の物体枠と重なっている領域についてはどちらの物体の領域か判然としないために取得から省くような方法も考えられる。 Furthermore, in the present embodiment, an index called an obscured score obtained within the frame of the object obtained above is obtained. FIG. 8C shows various examples of acquisition of shielding information. FIG. 8 (C1) is an embodiment of the present embodiment. In addition to this, (C2) the difference value between the obscured score on the back side and the obscured score on the front side is acquired in the object frame, (C3) the score at the center of the object is referred to only one point, and so on. Be done. Also, when acquiring within the frame, when acquiring within the frame of the object, the area that overlaps with other object frames may be omitted from the acquisition because it is not clear which object the area is. Be done.

またさらに、上述の＜遮蔽状態の推定＞と＜各物体の被遮蔽スコアの取得＞を同時に行う方法も考えられる。例として、Ｌｉｕの手法等で使われている公知な方法であるアンカーと呼ばれる手法があげられる。この手法では物体の候補枠の集合が求められるので、これを利用して各候補枠が遮蔽物体か被遮蔽物体かの被遮蔽スコアを推定し対応付けることが考えられる（この形態の詳細については実施形態３で示すのでここでは説明を略す）。 Further, a method of simultaneously performing the above-mentioned <estimation of the shielding state> and <acquisition of the shielding score of each object> can be considered. An example is a method called an anchor, which is a known method used in Liu's method and the like. Since a set of candidate frames for objects is obtained in this method, it is conceivable to use this to estimate and associate the shielded score of whether each candidate frame is a shielded object or a shielded object (details of this form are implemented). Since it is shown in the third form, the description is omitted here).

またさらに、上で示したような複数の形態の遮蔽情報をそれぞれ取得し、これを遮蔽に関する多次元の特徴として後段の物体の対応付けに用いてもよい。もしくは前記の遮蔽に関する多次元の特徴から機械学習によって物体の遮蔽されている面積の割合を推定して用いてもよい。この場合は前記の遮蔽に関する多次元の特徴を説明変数とし、物体が遮蔽されている面積の割合を目標変数とし、ロジスティック回帰等の公知技術で回帰推定を行う等すればよい。 Further, it is also possible to acquire each of a plurality of forms of shielding information as shown above and use this as a multidimensional feature related to shielding for associating objects in the subsequent stage. Alternatively, the ratio of the shielded area of the object may be estimated and used by machine learning from the above-mentioned multidimensional characteristics related to shielding. In this case, the multidimensional feature related to the shielding may be used as an explanatory variable, the ratio of the area where the object is shielded as the target variable, and regression estimation may be performed by a known technique such as logistic regression.

＜実施形態２＞
本実施形態では実施形態１と同様に人物の検出と追尾を行う。ハードウェア構成は実施形態１の図２と同様である。本実施形態における機能構成例を示すブロック図は図９（Ａ）になる。実施形態１の構成に新たに遮蔽状態判定部３０１が追加されている。実施形態１では追尾中に人物の枠は人物同士の重なりによって、人物の検出ができないことがある。例えば図１０（Ａ）中の動画フレーム４１２０に示すように、人物間で重なった面積が大きいときには、奥側の人物が検出できないことは多い。このような時に遮蔽状態判定部３０１が、人物は存在しているが被遮蔽状態にある、と判定する。 <Embodiment 2>
In the present embodiment, a person is detected and tracked in the same manner as in the first embodiment. The hardware configuration is the same as that of FIG. 2 of the first embodiment. The block diagram showing the functional configuration example in this embodiment is shown in FIG. 9A. A new shielding state determination unit 301 has been added to the configuration of the first embodiment. In the first embodiment, it may not be possible to detect a person in the frame of the person during tracking due to the overlap of the people. For example, as shown in the moving image frame 4120 in FIG. 10A, when the overlapping area between people is large, it is often the case that the person on the back side cannot be detected. In such a case, the shielded state determination unit 301 determines that the person exists but is in the shielded state.

実施形態１で説明したような物体検出部の一時的な検出の失敗による未検出と異なり、人物の集団が同じ方向に同じ速度で移動しているような場合、長時間未検出の状態が続く。さらに被遮蔽状態から脱した画面上の位置が、被遮蔽状態が開始した位置から離れることがある。このため被遮蔽状態であると積極的に判定し、推定した前記状態に応じた処理を行うことで追尾の成功率を高めることが望ましい。 Unlike undetected due to temporary detection failure of the object detection unit as described in the first embodiment, when a group of people is moving in the same direction at the same speed, the undetected state continues for a long time. .. Further, the position on the screen that has been removed from the shielded state may be separated from the position where the shielded state has started. For this reason, it is desirable to positively determine that the state is shielded and to increase the success rate of tracking by performing processing according to the estimated state.

本実施形態も全体の処理フローは実施形態１の図４と同じであるが、Ｓ３１０の処理の詳細が下記のように異なる。ここでは、実施形態１と異なるＳ３１０の処理についてのみ説明する。図１１を用いて遮蔽状態判定部３０１が行うＳ３１０処理の詳細なフローについて説明する。まずこれまでと同じようにＳ６０１で現フレームと前フレームで物体の対応付けを行う。Ｓ６０２で対応付けられなかった前フレームの物体がある場合、被遮蔽状態に入った可能性がある。そこでＳ６０３で当該物体のそれまでの被遮蔽スコアの高さが閾値以上かを調べる。これは動画フレームのフレームレートが十分に高ければ、遮蔽により未検出になる前後で被遮蔽スコアが高くなることが多いためである。さらにＳ６０４で当該物体の周辺領域で現フレームの物体の検出数の数が減っていないかを調べ、上記の二つの結果が真であれば当該物体は被遮蔽状態に入ったと推定し被遮蔽状態のリストに記憶する（Ｓ６０５）。被遮蔽状態のリストに記憶された物体については前回検出されたときの特徴量と位置も合わせて記憶する。これによって、遮蔽が解消されて再び検出されたときに追尾できる可能性が向上する。 The overall processing flow of this embodiment is the same as that of FIG. 4 of the first embodiment, but the details of the processing of S310 are different as follows. Here, only the processing of S310, which is different from the first embodiment, will be described. A detailed flow of the S310 process performed by the shielding state determination unit 301 will be described with reference to FIG. First, as in the past, in S601, the current frame and the previous frame are associated with each other. If there is an object in the front frame that was not associated with S602, it is possible that it has entered the shielded state. Therefore, in S603, it is checked whether the height of the obstruction score of the object up to that point is equal to or higher than the threshold value. This is because if the frame rate of the moving image frame is sufficiently high, the shielded score often increases before and after it becomes undetected due to shielding. Further, in S604, it is investigated whether the number of detected objects in the current frame is reduced in the peripheral area of the object, and if the above two results are true, it is estimated that the object has entered the shielded state and the shielded state. Store in the list of (S605). For the objects stored in the list of shielded states, the features and positions at the time of the previous detection are also stored. This increases the possibility of tracking when the obstruction is removed and detected again.

Ｓ６０６～Ｓ６１０は被遮蔽状態の物体が再出現したかどうかを判定する処理である。Ｓ６０３で対応付けられなかった現フレームの物体がある場合、被遮蔽状態を脱して再度検出できるようになった可能性がある。そこでＳ６０７で当該物体の被遮蔽スコアの高さが閾値以上かを調べる。さらにＳ６０８で当該物体の周辺領域で現フレームの物体の検出数の数が増えていいないかを調べる。両方の結果が真で、且つ被遮蔽状態のリストに記憶されている物体のいずれかと当該物体が所定閾値以上に類似度が高い場合（Ｓ６０８）、当該物体は被遮蔽状態から脱して再度出現したと推定する。そのとき、対応付けた物体を被遮蔽状態のリストから除去する（Ｓ６０９）被遮蔽状態のリストから除去された物体については、現在の入力画像から検出された特徴量と位置を取得する。 S606 to S610 are processes for determining whether or not the object in the shielded state has reappeared. If there is an object in the current frame that was not associated with S603, it is possible that the object can be detected again after leaving the shielded state. Therefore, in S607, it is checked whether the height of the obstruction score of the object is equal to or higher than the threshold value. Further, in S608, it is examined whether or not the number of detected objects in the current frame has increased in the peripheral region of the object. When both results are true and the object has a high similarity to any of the objects stored in the list of shielded states (S608), the object has escaped from the shielded state and reappeared. I presume. At that time, the associated object is removed from the list of shielded states (S609). For the object removed from the list of shielded states, the feature amount and the position detected from the current input image are acquired.

ここで、対応付けの処理の工夫として、例えば、フレーム間の物体のマッチングの際に、被遮蔽状態にある人物とのマッチングは距離による類似度のペナルティを減ずる。再出現を待つタイムアウトの時間を長く取る。遮蔽状態の物体との対応付けの閾値は、通常の物体間のマッチングよりも閾値を低く設定する、等が考えられる。 Here, as a device of the matching process, for example, when matching an object between frames, matching with a person in a shielded state reduces the penalty of the degree of similarity depending on the distance. Take a long time-out to wait for reappearance. It is conceivable that the threshold value for associating with an object in a shielded state is set lower than the threshold value for matching between ordinary objects.

またさらに、ここでは二人の人物の重なりを想定して説明を行ったが、３人以上の人物の間で重なりが生じることもある。この場合は、遮蔽状態に入ったと判定されれば被遮蔽状態のリストに加えておき、再出現したら前フレームとの対応付けを行い、被遮蔽状態のリストから都度除去する。これにより３人以上についてもある程度の対応が可能である。 Furthermore, although the explanation is made assuming the overlap of two people, the overlap may occur between three or more people. In this case, if it is determined that the shielded state has been entered, it is added to the list of shielded states, and when it reappears, it is associated with the previous frame and removed from the list of shielded states each time. As a result, it is possible to deal with three or more people to some extent.

＜実施形態３＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。ここでは追尾対象は人体等の特定カテゴリに限らず、ユーザが指定した不特定の物体を追尾する形態を扱う。例えば、犬などの動物や、車などの乗り物であってもよい。 <Embodiment 3>
In this embodiment, a mode for tracking a single object specified by the user will be described. Here, the tracking target is not limited to a specific category such as a human body, and a form of tracking an unspecified object specified by the user is dealt with. For example, it may be an animal such as a dog or a vehicle such as a car.

機能ブロックの図は図９（Ｂ）になる。これまでの構成に新たに追尾物体指定部３０２が追加されている。ここで追尾物体指定部３０２と物体検出部２０２の機能は非特許文献１の方法を用いることで容易に実現することができる。追尾物体指定部３０２はユーザが動画フレーム中で追尾対象物体の枠位置を指定する機能部である。これにより追尾すべき物体の特徴が初期化される。物体検出部２０２は各動画中で最も対象物体と一致度の高い画像領域を同定する。同定した結果例を図１２（Ａ）に示す。図１２（Ａ）の動画フレーム５１１０上の枠５１１１がユーザによって指示された追尾物体の枠である。動画フレーム５１２０ではこの物体が画面中で右側に移動しており、物体検出部２０２によって枠５１２１として検出されている。非特許文献１の方法は物体の追尾手法として優れるが、類似物体間で容易に誤スイッチが生じる。そこで本実施形態ではこれまでの実施形態と同様に、追尾物体に対して遮蔽状態に関する情報を推定し、誤スイッチが生じていないかを判定する。 The figure of the functional block is shown in FIG. 9B. The tracking object designation unit 302 is newly added to the previous configuration. Here, the functions of the tracking object designation unit 302 and the object detection unit 202 can be easily realized by using the method of Non-Patent Document 1. The tracking object designation unit 302 is a functional unit that allows the user to specify the frame position of the tracking target object in the moving image frame. This initializes the characteristics of the object to be tracked. The object detection unit 202 identifies the image region having the highest degree of coincidence with the target object in each moving image. An example of the identified results is shown in FIG. 12 (A). The frame 5111 on the moving image frame 5110 in FIG. 12A is a frame of the tracking object instructed by the user. In the moving image frame 5120, this object is moving to the right side in the screen, and is detected as the frame 5121 by the object detection unit 202. The method of Non-Patent Document 1 is excellent as an object tracking method, but an erroneous switch easily occurs between similar objects. Therefore, in the present embodiment, as in the previous embodiments, the information regarding the shielding state of the tracking object is estimated, and it is determined whether or not an erroneous switch has occurred.

このために遮蔽情報生成部２０３として図１３（Ｂ）に示すようなニューラルネット６３００を用いる。これは検出された追尾物体の画像６３０１（ここでは処理の簡単のために正方形の画像に縦横比率を正規化している）を入力すると、画像パターンを見て、遮蔽されている（Ｙｅｓ）かされていない（Ｎｏ）かの分類結果６３０２を出力する分類器である。遮蔽の有無の定義としては、物体の面積が何％以上遮蔽されているか否かとして定義する。この２クラスの値を教師値として与えてニューラルネット６３００を学習させる。この技術は通常の画像分類タスクと同様の広く公知な方法のため詳細を略す。また、教師値（目標変数）を遮蔽の有無の２値ではなく遮蔽面積の割合として与えて回帰学習を行えば、推定結果６３０３のように遮蔽の割合を推定することができる。この回帰学習には学習時に与える損失値として二乗誤差等を用いる。 For this purpose, a neural network 6300 as shown in FIG. 13B is used as the shielding information generation unit 203. This is because when you input the image 6301 of the detected tracking object (here, the aspect ratio is normalized to the square image for the sake of simplicity), the image pattern is seen and it is shielded (Yes). It is a classifier that outputs the classification result 6302 of whether or not it is (No). The definition of the presence or absence of shielding is defined as whether or not the area of the object is shielded by what percentage or more. The values of these two classes are given as teacher values to train the neural network 6300. This technique is not detailed because it is a widely known method similar to a normal image classification task. Further, if the teacher value (target variable) is given as the ratio of the shielded area instead of the binary value of the presence or absence of the shield and regression learning is performed, the ratio of the shield can be estimated as in the estimation result 6303. For this regression learning, a square error or the like is used as the loss value given at the time of learning.

遮蔽情報生成部２０３で追尾物体候補の遮蔽度を推定した結果が図１４（Ａ）（Ｂ）である。図１４（Ａ）に示す物体の検出結果に対して、物体検出部２０２が図１４（Ｂ）に符号ｏｃｃを付して示したのが被遮蔽面積の推定値である。同図では被遮蔽スコアの変動幅は所定値（例えば０．３等の値）より小さく、追尾に失敗していないと判定できる（ここで、被遮蔽スコアだけでなく実施形態１で用いたような位置や見えの特徴量の類似度も併用して追尾の成功・失敗を判定してもよい）。 14 (A) and 14 (B) show the results of estimating the degree of shielding of the tracking object candidate by the shielding information generation unit 203. With respect to the detection result of the object shown in FIG. 14 (A), the object detection unit 202 added the reference numeral occ to FIG. 14 (B) to indicate the estimated value of the shielded area. In the figure, the fluctuation range of the obscured score is smaller than a predetermined value (for example, a value such as 0.3), and it can be determined that tracking has not failed (here, as used in the first embodiment as well as the obscured score). The success / failure of tracking may be judged by also using the similarity of the features of different positions and appearances).

一方で図１４（Ｃ）では、動画フレーム７２２０から７２３０にかけて物体７２０１が物体７２０２の向こう側を通過しており、その結果、物体検出部２０２が動画フレーム７２３０における物体の位置を枠７２３１として誤って推定している。この場合の遮蔽スコアは図１４（Ｄ）に示すように０．４から０．０へと大きく変動しているため、交差によって誤追尾が発生したと判定することができる。誤追尾が発生したことが分かれば、そこで検出を止めたり、後段で修正する等の工夫を行うことができる。 On the other hand, in FIG. 14C, the object 7201 passes through the other side of the object 7202 from the moving image frame 7220 to 7230, and as a result, the object detection unit 202 mistakenly sets the position of the object in the moving image frame 7230 as the frame 7231. I'm estimating. Since the occlusion score in this case greatly fluctuates from 0.4 to 0.0 as shown in FIG. 14 (D), it can be determined that erroneous tracking has occurred due to the intersection. If it is known that erroneous tracking has occurred, it is possible to stop the detection at that point or take measures such as correcting it at a later stage.

以上が本実施形態の説明となる。 The above is the description of this embodiment.

なお、遮蔽情報生成部２０３の学習は図１２（Ａ）５１１０～５１５０に示すように、不特定の物体について遮蔽状態が判定できるように様々な物体の遮蔽状態を推定できるように学習しておくことが望ましい。 As shown in FIGS. 5110 to 5150, the learning of the shielding information generation unit 203 is learned so that the shielding state of various objects can be estimated so that the shielding state of an unspecified object can be determined. Is desirable.

なお他の派生の形態としては、図１３（Ｂ）では、物体枠で切られた画像６３０１を入力画像として示している。しかし、被遮蔽状態にあるか否かの判定には当該物体だけでなくその周辺を観察することが重要なため、入力画像としてはより広い範囲を入力することも考えられる（その場合、推定時にも同様の範囲を切り取って入力する）。 As another derivative form, in FIG. 13B, an image 6301 cut by an object frame is shown as an input image. However, since it is important to observe not only the object but also its surroundings in order to determine whether or not it is in a shielded state, it is conceivable to input a wider range as an input image (in that case, at the time of estimation). Also cut out the same range and enter it).

なお他の派生の形態としては、図１３（Ｃ）に示すように、上述のＬｉｕの手法のようなアンカーと言われる候補枠を使って物体の検出と遮蔽度の推定を同時に行う形態も考えられる。アンカー枠は図１３（Ｄ）に示すような複数のサイズ・縦横比率の候補枠の集合である（ここでは３種類のアンカー枠を図示している）。アンカー枠は図１３（Ｃ）の結果画像６４５０に示すように、画像中の各ブロック領域に配置されている。ニューラルネット６４００は画像が入力されたら、各ブロック領域の各アンカーに当該物体があるか否かの被遮蔽スコアマップ６４３０を生成する。被遮蔽スコアマップ６４３０はアンカー枠の種類の３個に対応した３枚のマップである。推定結果の例を図１３（Ｃ）６４５０に示す（以上の手法は広く公知のため詳細は上述のＬｉｕの方法を参照されたい）。 As another form of derivation, as shown in FIG. 13C, a form in which an object is detected and the degree of shielding is estimated at the same time using a candidate frame called an anchor as in the above-mentioned Liu method is also considered. Be done. The anchor frame is a set of candidate frames having a plurality of sizes and aspect ratios as shown in FIG. 13 (D) (here, three types of anchor frames are shown). As shown in the result image 6450 of FIG. 13C, the anchor frame is arranged in each block area in the image. When the image is input, the neural network 6400 generates an obscured score map 6430 indicating whether or not the object is present at each anchor in each block region. The shielded score map 6430 is three maps corresponding to three types of anchor frames. An example of the estimation result is shown in FIG. 13 (C) 6450 (since the above method is widely known, refer to the above-mentioned Liu method for details).

ここで本実施形態の派生の形態として、物体が存在するか否かの推定と同時に、物体の被遮蔽スコアマップ６４４０を生成する。これは各アンカー枠に、もしそこに物体がある場合、その被遮蔽割合がいくつになるかを推定したマップである。同マップもアンカーの種類の数に対応した３枚からなる（学習時には画像の各ブロックにおいて、各アンカー枠に被遮蔽スコアの教師値を与えてニューラルネット６４００を学習すればよい）。結果例を図１３（Ｃ）６４６０に示す。二つの推定マップを最終的に統合した例を統合結果例６４７０として図示する。 Here, as a derivative form of the present embodiment, the shielded score map 6440 of the object is generated at the same time as the estimation of whether or not the object exists. This is a map that estimates what the coverage ratio will be for each anchor frame if there is an object there. The map also consists of three maps corresponding to the number of types of anchors (at the time of learning, in each block of the image, the teacher value of the shielded score may be given to each anchor frame to learn the neural network 6400). An example of the result is shown in FIG. 13 (C) 6460. An example of finally integrating the two estimation maps is illustrated as an example of integration result 6470.

上記の説明は物体検出の例になるが、非特許文献１の方法もアンカー候補枠ベースの手法であるため、物体を追尾しながら同時にその被遮蔽スコアを推定する派生形態を構成することが可能である。 Although the above description is an example of object detection, since the method of Non-Patent Document 1 is also an anchor candidate frame-based method, it is possible to construct a derivative form in which an object is tracked and its obstruction score is estimated at the same time. Is.

＜実施形態４＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。機能ブロックの図は実施形態３と同じで図９（Ｂ）である。これまでの実施形態では類似度を比較する際に、直前と直後のフレームで特徴量を比較することや、前後のｎフレームを用いて比較すること等、ルールベースでフレーム間の物体の対応付けを行った。本実施形態では、この部分を機械学習に置き換えることでより精度の高い対応付けを行う。 <Embodiment 4>
In this embodiment, a mode for tracking a single object specified by the user will be described. The figure of the functional block is the same as the embodiment 3 and is FIG. 9B. In the conventional embodiments, when comparing the degree of similarity, the feature quantities are compared between the immediately preceding and immediately preceding frames, and the preceding and following n frames are used for comparison. Was done. In the present embodiment, more accurate mapping is performed by replacing this part with machine learning.

リカレントニューラルネットは時系列データを処理して識別・分類等を行うことができる技術であり、Ｂｙｅｏｎらの方法などで公知なＬｏｎｇｓｈｏｒｔｔｅｒｍｍｅｍｏｒｙネットワーク（以下ＬＳＴＭ）が代表的手法である。（Ｂｙｅｏｎｅｔａｌ．，ＳｃｅｎｅｌａｂｅｌｉｎｇｗｉｔｈＬＳＴＭｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓ，ＣＶＰＲ２０１５）。当該手法で物体の特徴の経時的な変化を判別して物体間の対応付けを行うことができる。本実施形態の構成と結果例の模式図を図１５に示す。ここでは１つの物体９１０２が追尾対象として指定され、Ｂｅｒｔｉｎｅｔｔｏら等の手法で追尾されている（ｔ＝２の動画フレームで誤スイッチが起こっている）。図１５（Ｃ）のＬＳＴＭユニット９５０１～９５０４は、各時刻で追尾している物体の特徴９４０１～９４０４を受け取って、追尾が成功しているか、失敗しているかを判定して出力９７０１～９７０４として出力する。ここでは図示上ＬＳＴＭユニットを複数書いているが、ここでは複数のユニットが存在するのではなく同一のユニットの各時刻の状態を示している。各時刻のＬＳＴＭユニットは次の時刻のＬＳＴＭユニットにリカレント入力９８０２を送る。ＬＳＴＭユニットはその時点の物体の特徴とそれまでの過去の情報を含むリカレント入力９８０２を元に内部状態を必要に応じて変更する。これにより物体のパターンが経時的にどのように変化しているかを踏まえた上で現時点の追尾が成功しているか否かを判断することができる。 The recurrent neural network is a technique capable of processing time-series data to perform identification, classification, and the like, and a long short term memory network (hereinafter referred to as LSTM) known by the method of Byeon et al. Is a typical method. (Byeon et al., Scene labeling with LSTM recurrent neural networks, CVPR 2015). With this method, it is possible to discriminate changes in the characteristics of objects over time and associate them with each other. FIG. 15 shows a schematic diagram of the configuration of this embodiment and an example of the results. Here, one object 9102 is designated as a tracking target and is tracked by a method such as Bertinetto et al. (An erroneous switch occurs in a moving image frame of t = 2). The LSTM units 9501 to 9504 of FIG. 15C receive the features 9401 to 9404 of the object being tracked at each time, determine whether the tracking is successful or unsuccessful, and output 9701 to 9704. Output. Here, a plurality of LSTM units are written in the figure, but here, the state of each time of the same unit is shown instead of the existence of a plurality of units. The LSTM unit at each time sends a recurrent input 9802 to the LSTM unit at the next time. The LSTM unit changes the internal state as necessary based on the recurrent input 9802 including the characteristics of the object at that time and the past information up to that point. This makes it possible to determine whether or not the current tracking is successful, taking into account how the pattern of the object changes over time.

ＬＳＴＭユニットへの入力の特徴量は実施形態３で説明したニューラルネットの特徴量などを用いることができる。例えば図１３（Ｂ）の物体の被遮蔽スコアを判定するニューラルネット６３１０の最終層６３２０への入力値を用いる。ここでは前記層の出力値（１値のスカラー）でなく入力値（多次元ベクトル）を用いている。これは遮蔽状態を判断するのに用いたのと同じ多次元特徴を用いることで、遮蔽に関する多種の情報をＬＳＴＭに取り込むためである。これにより様々な遮蔽のパターンを判定できることが期待できる。 As the feature amount of the input to the LSTM unit, the feature amount of the neural network described in the third embodiment can be used. For example, the input value to the final layer 6320 of the neural network 6310 for determining the obscured score of the object in FIG. 13B is used. Here, the input value (multidimensional vector) is used instead of the output value (single-value scalar) of the layer. This is to capture a variety of shielding information into the LSTM by using the same multidimensional features used to determine the shielding state. From this, it can be expected that various shielding patterns can be determined.

ＬＳＴＭユニットの学習時には、教師値として各瞬間の追尾が成功しているか失敗しているかを与え、ＬＳＴＭの各重みパラメーターを調整する。また別の形態として図１６（Ｄ）に示すように、追尾の成功・失敗ではなく、教師値として遮蔽状態にあるか否かを与えて学習すれば、被遮蔽状態にあるか否かを判定させることも可能である。 When learning the LSTM unit, the teacher value is given as to whether the tracking at each moment is successful or unsuccessful, and each weight parameter of the LSTM is adjusted. As another form, as shown in FIG. 16D, if learning is performed by giving whether or not the teacher is in the shielded state as a teacher value instead of success or failure in tracking, it is determined whether or not the teacher is in the shielded state. It is also possible to let them.

また別の形態として、実施形態３で説明した派生の形態と同様に、追尾物体をアンカー枠ベースで検出し、９４０１として図１３（Ｃ）の物体の位置および被遮蔽スコアを同時に判定するニューラルネット６４００の特徴量６４２０を使ってもよい。この形態であれば、物体の追尾や検出と当該物体の被遮蔽スコアを同時・高速に判定することができる。 As another embodiment, as in the derived embodiment described in the third embodiment, the neural network detects the tracking object based on the anchor frame and simultaneously determines the position and the shielded score of the object in FIG. 13 (C) as 9401. The feature amount 6420 of 6400 may be used. In this form, the tracking and detection of an object and the shielding score of the object can be determined simultaneously and at high speed.

＜実施形態５＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。基本機能構成は実施形態１と同様である。本実施形態では物体の遮蔽情報として、相対的な物体間の遠近情報を用いる。 <Embodiment 5>
In this embodiment, a mode for tracking a single object specified by the user will be described. The basic functional configuration is the same as that of the first embodiment. In this embodiment, perspective information between relative objects is used as the shielding information of the objects.

図１６（Ａ１）にその例を示す。ここでは学習画像としてＲＧＢ画像８０１を用意する。さらにレーザーレンジファインダー装置やステレオ計測等によりＲＧＢ画像８０１に対応した距離画像８３３が得られている。距離画像８３３はカメラからの距離の絶対値をグレースケールで表したものであり、白い色ほど近い距離の物体を意味する。本実施形態ではＲＧＢ画像８０１を入力画像とし、距離画像を教師値８３１として、ニューラルネット４０２の重みを学習する。ただし絶対値としての距離画像８３１と全く同じ出力結果８３０を得ることはパターン認識としては比較的難しい問題であり、本実施形態に用いる遮蔽情報としてはそこまで高精度であることを必要としない。そこで本実施形態では近傍の物体間の相対的な遠近関係を推定するような学習を行う。 An example is shown in FIG. 16 (A1). Here, an RGB image 801 is prepared as a learning image. Further, a distance image 833 corresponding to the RGB image 801 is obtained by a laser range finder device, stereo measurement, or the like. The distance image 833 represents the absolute value of the distance from the camera in gray scale, and the whiter the color, the closer the distance is to the object. In the present embodiment, the RGB image 801 is used as the input image, the distance image is used as the teacher value 831, and the weight of the neural network 402 is learned. However, obtaining an output result 830 that is exactly the same as the distance image 831 as an absolute value is a relatively difficult problem for pattern recognition, and the shielding information used in the present embodiment does not need to have such high accuracy. Therefore, in this embodiment, learning is performed to estimate the relative perspective relationship between nearby objects.

例えば同図の出力結果８３０に示すように、人物８０１１と８０１２の距離の推定結果８３０１と８３０２は絶対値としては正しくない。人物８０１１と離れた人物８０１３に対応する推定結果８３０１と８３０３も正しくない遠近関係になっている。しかし近傍の二人の人物８０１１と８０１２の、遠近の順序関係だけに限定すれば、正しい結果である。このように＜局所の物体間＞の＜遠近順序の関係＞は正しく推定できるように学習し、これらを物体の遮蔽情報として集計して用いる。 For example, as shown in the output result 830 of the figure, the estimation results 8301 and 8302 of the distance between the person 8011 and 8012 are not correct as absolute values. The estimation results 8301 and 8303 corresponding to the person 8011 and the person 8013 distant from the person 8011 also have an incorrect perspective relationship. However, the correct result is obtained if it is limited to the order relationship between the two persons 8011 and 8012 in the vicinity. In this way, the <relationship of perspective order> of <local objects> is learned so that it can be estimated correctly, and these are aggregated and used as the shielding information of the objects.

以上は、学習時の損失値計算に以下の工夫を施すことで実現される。図１６（Ａ２）に図１６（Ａ１）の教師値８３１上の記号＊の付近の領域を拡大した教師値領域８３１ａを示す。対応する出力結果の領域８３０ａも示す。ここで領域８３１ａ上の各画素ｉと画素ｊに注目し、その遠近関係が正しいか否かで当該画素ペアの損失を求める。ここでは領域８３０ａ上の画素ｉと画素ｊの遠近関係は教師値と一致するので損失は発生しない。対してもし領域８３０ｂのような推定結果であった場合は、遠近関係が正しくないので損失を計上する。このような判断を、所定距離内にある全画素ペアで行う。最終的に遠近関係を誤ったペア数を全ペア数で割った値を損失値の総計とする。このようにして学習したニューラルネット８０２が学習終了し、推定した距離の出力結果８３４を図１６（Ｂ）に示す。 The above is realized by applying the following measures to the loss value calculation during learning. FIG. 16 (A2) shows a teacher value region 831a which is an enlargement of a region near the symbol * on the teacher value 831 of FIG. 16 (A1). The corresponding output result region 830a is also shown. Here, attention is paid to each pixel i and pixel j on the region 831a, and the loss of the pixel pair is obtained depending on whether or not the perspective relationship is correct. Here, since the perspective relationship between the pixels i and the pixels j on the region 830a matches the teacher value, no loss occurs. On the other hand, if the estimation result is as in the region 830b, the perspective relationship is not correct and a loss is recorded. Such a determination is made for all pixel pairs within a predetermined distance. Finally, the value obtained by dividing the number of pairs with incorrect perspective by the total number of pairs is taken as the total loss value. The neural network 802 learned in this way has completed learning, and the output result 834 of the estimated distance is shown in FIG. 16 (B).

次に、相対的な距離の出力結果８３４を集計して物体の被遮蔽尤度を求める。ここでは別途検出しておいた人物検出枠８３５１と８３５２を用いて検出枠ごとに集計する。各枠内でそれぞれの距離の値を平均し、ｄ^ａｖｅ _１とｄ^ａｖｅ _２とする。次にこの距離の値を隣接した物体枠間で比較して正規化して被遮蔽尤度のスコア値ｏｃｃへと変換する。例えば下式で変換する。
（数式３）
ｏｃｃ_ｉ＝Ｓｉｇｍｏｉｄ（Ｌｏｇ（ｄ^ａｖｅ _ｉ／ｄ^ａｖｅ _ｊ））
＝１／（１＋ｄ^ａｖｅ _ｊ／ｄ^ａｖｅ _ｉ），
ｏｃｃ_ｊ＝１／（１＋ｄ^ａｖｅ _ｉ／ｄ^ａｖｅ _ｊ），
ただし
Ｓｉｇｍｏｉｄ（ｘ）＝１／（１＋ｅｘｐ（－ｘ））．
ここでｉとｊは重なり部分のある二つの隣接した検出物体枠である。３つ以上の物体が重なっている場合は、それぞれ上記の式で被遮蔽スコアｏｃｃ_ｉを求め、そのうちの最大値をその物体の被遮蔽スコアとしてもよい。 Next, the output results 834 of the relative distance are aggregated to obtain the shielding likelihood of the object. Here, the person detection frames 8351 and 8352 that have been separately detected are used for totaling for each detection frame. The values of the respective distances in each frame are averaged to be ^dave ₁ and ^dave ₂ . Next, the value of this distance is compared and normalized between adjacent object frames, and converted into a score value occ of the shielded likelihood. For example, convert with the following formula.
(Formula 3)
^occ _i = Sigmoid (Log (dave _i / ^dave _j ))
= 1 / (1 + d ^ave _j / d ^ave _i ),
^occ _j = 1 / (1 + dave _i / ^dave _j ),
However, Sigmoid (x) = 1 / (1 + exp (-x)).
Here, i and j are two adjacent detection object frames having an overlapping portion. When three or more objects overlap, the obstruction score occ _i may be obtained by the above formula, and the maximum value among them may be used as the obstruction score of the object.

以上が相対的な距離推定を行い、被遮蔽スコアを集計するまでの処理内容となる。被遮蔽スコアを用いた追尾処理は実施形態１と同様になるためここでは割愛する。 The above is the processing content until the relative distance estimation is performed and the shielded score is totaled. Since the tracking process using the shielded score is the same as that in the first embodiment, it is omitted here.

なお派生的な学習の工夫として下記のようなものが考えられる。（１）距離の教師値の差分が所定閾値Θ以上のペアのみに限定して損失を集計する。これにより距離画像の観測時のノイズに対しロバストに学習できる。（２）（１）を行い、且つマージン領域を設定する。例えばペアの遠近関係が正しいか正しくないかのみならず、遠近関係が正しく、且つ所定閾値Θ以上値が相対的に離れていない場合に損失を発生させる。（３）距離の教師値の差分が閾値Θ未満の画素ペアに対する出力値が、閾値Θ以上に大きなケースも誤りとして損失を与える。これによりノイズ的な出力を抑制する。 The following can be considered as derivative learning ideas. (1) The loss is totaled only for the pair in which the difference between the teacher values of the distance is equal to or more than the predetermined threshold value Θ. This makes it possible to learn robustly against noise when observing a distance image. (2) Perform (1) and set the margin area. For example, not only whether the perspective relationship of the pair is correct or incorrect, but also when the perspective relationship is correct and the values are not relatively separated by a predetermined threshold value Θ or more, a loss is generated. (3) A case where the output value for a pixel pair in which the difference between the teacher values of the distance is less than the threshold value Θ is larger than the threshold value Θ also causes a loss as an error. This suppresses noise-like output.

以上、さまざまな形態があり得るが、相対的・局所的に距離を学習できるような形態であればいずれでもよく、一つの形態に限定されない。本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、データ通信用のネットワーク又は各種記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。また、そのプログラムをコンピュータが読み取り可能な記録媒体に記録して提供してもよい As described above, there may be various forms, but any form can be used as long as the distance can be learned relative to and locally, and the form is not limited to one. The present invention is also realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiment is supplied to the system or device via a network for data communication or various storage media. Then, the computer (or CPU, MPU, etc.) of the system or device reads and executes the program. Further, the program may be recorded and provided on a computer-readable recording medium.

１情報処理装置
２０１画像取得部
２０２物体検出部
２０３遮蔽情報生成部
２０４特徴量取得部
２０５対応付け部
２０６記憶部 1 Information processing device 201 Image acquisition unit 202 Object detection unit 203 Shielding information generation unit 204 Feature quantity acquisition unit 205 Correspondence unit 206 Storage unit

Claims

An information processing device that detects at least one or more objects from an image.
For each object detected from the image, the shielding relationship with other objects detected from the image is based on a trained model that has learned the image features showing the shielding relationship between the object to be shielded and the shielded object. An estimation means for estimating the occlusion information indicating
It is characterized by having, for each object detected from the image, at least based on the shielding information, a specific means for specifying the correspondence relationship with the object detected in the image captured at a time different from the image. Information processing device.

The estimation means provides the shielding information indicating the shielded partial region of the object for each object detected from the image based on the trained model for estimating the shielded area of the object in the input image. The information processing apparatus according to claim 1, wherein the information processing apparatus is estimated.

The specifying means specifies a correspondence relationship between an image feature of the object, the shielding information estimated by the estimation means, and an object detected in an image captured at a time different from the image. The information processing apparatus according to claim 1 or 2, wherein the information processing apparatus is characterized in that.

The estimation means estimates the shielding information, which is a likelihood indicating that each object detected from the image is shielded by another object.
The information processing apparatus according to any one of claims 1 to 3, further comprising a holding means for holding the likelihood in association with an image feature of the object.

The shielding information is characterized in that, for each region of the image, the region of the object being shielded shows a larger likelihood, and the other regions show a smaller likelihood. The information processing apparatus according to claim 4.

The specific means captures each object detected from the image at a time different from the image based on the position in the image, the image feature detected from the image, and the shielding relationship in the image. The information processing apparatus according to any one of claims 1 to 5, wherein the correspondence relationship with the object detected from the image is specified.

Further having an acquisition means for acquiring an area for each object from the image,
The estimation means according to any one of claims 1 to 6, wherein the estimation means estimates the shielding information indicating the presence or absence of a shielded object in the area of each object acquired by the acquisition means. The information processing device described.

Is there an object that is shielded based on the correspondence between each object detected from the image identified by the specific means and the object detected in the image captured at a time different from the image? Judgment means to determine whether or not,
The information processing apparatus according to any one of claims 1 to 7, further comprising a storage means for storing the shielded object.

The determination means determines as the first object an object that does not correspond to each object detected from the image among the objects detected in the image captured before the image.
The information processing apparatus according to claim 8, wherein the storage means stores that the first object is shielded at the time when the image is captured.

The determination means determines, among the objects detected from the image, an object that does not correspond to the object detected in the image captured before the image as the second object.
The storage means has a similarity with the first object determined to be shielded in the image captured before the image by the storage means with respect to the second object detected from the image. The information processing apparatus according to claim 9, wherein when the value is larger than a predetermined threshold value, it is stored that the first object is not shielded at the time when the image is captured.

When two objects are detected in the first image and one object is detected in the second image captured after the first image.
The estimation means estimates the shielding information indicating that the object detected in the second image is shielding another object.
The specific means is detected in the second image with respect to an object that is shielding another object among the objects detected in the first image based on the shielding information estimated by the estimation means. The information processing apparatus according to any one of claims 1 to 10, wherein it is specified that the object is the same as the object.

When two objects are detected from the third image captured after the second image,
The estimation means shields the other object from the two objects detected from the third image with respect to the object associated with the image feature of the object detected from the second image. Estimate the shielding information indicating
The specific means refers to an object shielded by another object among the objects detected from the first image, and an object detected from the second image among the objects detected from the third image. The information processing apparatus according to claim 11, wherein an object different from the object associated with the image feature is specified to be the same object as the shielded object in the second image.

The information processing apparatus according to any one of claims 1 to 12, wherein the trained model is a neural network.

One of claims 1 to 13, wherein the trained model is a model in which image features showing a shielding relationship of the object are trained based on a plurality of images captured at shorter time intervals. The information processing apparatus according to item 1.

A program for making a computer function as each means included in the information processing apparatus according to any one of claims 1 to 14.

An information processing method that detects at least one or more objects from an image.
Based on the trained model that learned the image features showing the shielding relationship between the shielded object and the shielded object, for each object detected from the image, the shielding relationship with other objects detected from the image is determined. An estimation process for estimating the shielding information to be shown, and
It is characterized by having a specific step of specifying a correspondence relationship between each object detected from the image based on at least the shielding information and the object detected in the image captured at a time different from the image. Information processing method.