JP7295053B2

JP7295053B2 - Scene extraction method, device and program

Info

Publication number: JP7295053B2
Application number: JP2020037619A
Authority: JP
Inventors: 和之田坂
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2023-06-20
Anticipated expiration: 2040-03-05
Also published as: JP2021141434A

Description

本発明は、動画映像から注目シーンを抽出する方法、装置およびプログラムに係り、特に、人物の姿勢を推定し、注目シーンの抽出根拠となる特定姿勢シーンが検知されると、この特定姿勢シーンの布石あるいは契機となった関連姿勢シーンまで遡って注目シーンを自動的に抽出できるシーン抽出方法、装置およびプログラムに関する。 The present invention relates to a method, apparatus, and program for extracting a scene of interest from a moving image. The present invention relates to a scene extracting method, device and program capable of automatically extracting a scene of interest by going back to a related posture scene that was a strategy or a trigger.

動画映像から興味のある注目シーンを抽出する技術が特許文献１－３に開示されている。 Techniques for extracting an interesting scene of interest from a moving image are disclosed in Patent Documents 1 to 3.

特許文献１には、スポーツ映像から審判員を識別してその姿勢を推定し、姿勢の推定結果から審判員の動作を推定し、動作の推定結果に基づいてプレー区間を精度良く抽出する技術が開示されている。 Patent Document 1 discloses a technique for identifying a referee from a sports video, estimating the posture of the referee, estimating the motion of the referee from the posture estimation result, and accurately extracting a play section based on the motion estimation result. disclosed.

特許文献２には、スポーツ映像に含まれる特定のテロップを抽出し、当該テロップが変化した前後の時間的な区間をイベント区間として抽出する技術が開示されている。特許文献２では、イベント区間内に特定のテロップの表示が最後に消えた時間を抽出し、その時間から一定時間遡った時間を当該イベント区間の開始点とする補正を行う技術も開示されている。 Patent Literature 2 discloses a technique of extracting a specific telop included in a sports video and extracting a temporal section before and after the change in the telop as an event section. Patent Literature 2 also discloses a technique of extracting the time when a specific telop display last disappears in an event period, and performing correction by setting the time that is a certain amount of time before that time as the starting point of the event period. .

特許文献３には、従画面に表示されているコンテンツの特徴量が所定の閾値を超えたとき、従画面を主画面に切り替える技術が開示されている。サッカー中継番組であれば、得点シーンでは観客の声援等の音声の出力が大きくなり、音量や所定の周波数の音声信号が大きくなるので、それらを特徴量として得点シーンを検出し、盛り上がりシーンから主画面に表示することができる。 Japanese Patent Application Laid-Open No. 2004-200002 discloses a technique for switching a sub-screen to a main screen when the feature amount of content displayed on the sub-screen exceeds a predetermined threshold. In the case of a live soccer program, when a goal is scored, the sound output of the audience cheering is louder, and the sound volume and audio signals of a predetermined frequency are louder. can be displayed on the screen.

特願2016-556198号Patent application No. 2016-556198 特願2006-98340号Patent application No. 2006-98340 特願2005-208839号Patent application No. 2005-208839

Z. Cao, T. Simon, S. Wei and Y. Sheikh, "Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1302-1310.Z. Cao, T. Simon, S. Wei and Y. Sheikh, "Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp 1302-1310.

特許文献１では、審判員の動きから特定の重要プレー区間を抽出できる。特許文献２，３では、テロップや音声信号の大きさと連動して、その前後を含む区間を重要プレー区間として抽出できる。しかしながら、スポーツ中継では重要なプレー区間の前に、その布石となる関連シーンが存在することがあり、重要プレー区間はその布石となる関連シーンまで遡って連続的に視聴できるようにすることが望まれる。 In Patent Literature 1, a specific important play section can be extracted from the referee's movement. In Patent Documents 2 and 3, it is possible to extract a section including before and after a telop or an audio signal as an important play section in conjunction with the magnitude of the telop or audio signal. However, in sports broadcasts, there may be a related scene that serves as the foundation for the important play section before the important play section. be

例えば、サッカー中継ではシュートシーンやゴールシーンが重要シーンとなり得るが、これらの重要シーンの前には、その布石となるパスやセンタリングなどのアシストに関するシーンが存在し、これを契機に重要シーンが生まれることが多い。 For example, shooting scenes and goal scenes can be important scenes in a soccer broadcast, but before these important scenes, there are scenes related to assists such as passing and centering, which are the foundation for those important scenes. There are many things.

しかしながら、上記の従来技術は特定の重要シーンに注目するのみで、その布石となった関連シーンまで遡って注目シーンを抽出することが行われていなかった。 However, the prior art described above focuses only on a specific important scene, and does not extract the target scene by going back to the relevant scene that was the foundation for that.

本発明の目的は、上記の技術課題を解決し、プレーヤの姿勢推定の結果に基づいて注目シーンの抽出根拠となるシーンが検知されると、このシーンに関連したシーンまで遡って注目シーンを自動的に抽出できるシーン抽出方法、装置およびプログラムを提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described technical problems, and to automatically detect a scene of interest by going back to scenes related to the scene that serves as a basis for extracting a scene of interest based on the result of estimating the player's posture. The object is to provide a scene extraction method, device, and program capable of extracting scenes in a realistic manner.

上記の目的を達成するために、本発明は、動画映像から注目シーンを抽出する方法、装置およびプログラムにおいて、以下の構成を具備した点に特徴がある。 In order to achieve the above object, the present invention is characterized by having the following configuration in a method, apparatus and program for extracting a scene of interest from a moving image.

(1) 動画映像から抽出した人物の姿勢を推定し、姿勢推定の結果が関連姿勢である関連姿勢シーンおよび特定姿勢である特定姿勢シーンを検知し、特定姿勢シーンが検知されると検知済みの関連姿勢シーンまで遡って当該特定姿勢シーンまでの映像区間を注目シーンとして抽出するようにした。ここで、関連姿勢とは特定姿勢の布石となる姿勢である。 (1) Estimating the pose of a person extracted from a moving image, detecting a related posture scene in which the result of posture estimation is a related posture and a specific posture scene in which a specific posture is detected. A video section up to the specific posture scene is extracted as a target scene by going back to the related posture scene. Here, the related posture is a posture that serves as a foundation for the specific posture.

(2) 動画映像のオブジェクトを追跡し、人物の姿勢推定の結果およびオブジェクト追跡の結果に基づいて特定姿勢シーンおよび関連姿勢シーンを検知するようにした。 (2) We tracked objects in moving images, and detected specific pose scenes and related pose scenes based on the results of human pose estimation and object tracking.

(3) 特定姿勢シーンまでの経過時間が所定時間内の関連姿勢シーンから当該特定姿勢シーンまでを注目シーンとして抽出するようにした。 (3) A scene from a related posture scene whose elapsed time to a specific posture scene is within a predetermined time to the specific posture scene is extracted as a target scene.

(4) 特定姿勢シーンまでの経過時間が所定時間内の関連姿勢シーンが複数検知されていると、各関連姿勢シーンから特定姿勢シーンまでの複数の注目シーン候補を一覧表示し、オペレータに一の注目シーンを選択させるようにした。 (4) When multiple related posture scenes are detected for which the elapsed time to the specific posture scene is within a predetermined period of time, a list of multiple target scene candidates from each related posture scene to the specific posture scene is displayed, and the operator is prompted to select one. I made it so that you can select the attention scene.

本発明によれば、以下のような効果が達成される。 According to the present invention, the following effects are achieved.

(1) 人物の姿勢に基づいて特定姿勢シーンが検知されると、その布石となった関連姿勢シーンまで遡り、関連姿勢シーンから特定姿勢シーンまでの映像区間が注目シーンとして抽出されるので、重要シーンをその契機となったシーンから連続して客観的に抽出できるようになる。 (1) When a specific posture scene is detected based on a person's posture, it is traced back to the related posture scene that was the foundation for the scene, and the video section from the related posture scene to the specific posture scene is extracted as the target scene. Scenes can be continuously and objectively extracted from the scene that triggered them.

(2) 動画映像のオブジェクトを追跡し、人物の姿勢推定の結果およびオブジェクト追跡の結果に基づいて特定姿勢シーンおよび関連姿勢シーンを検知するので、姿勢推定の結果のみからは検知できない多種多様な特定姿勢シーンおよび関連姿勢シーンを正確に検知できるようになる。 (2) Tracking objects in moving images and detecting specific pose scenes and related pose scenes based on the results of human pose estimation and object tracking. Pose scenes and related pose scenes can be detected accurately.

(3) 特定姿勢シーンまでの経過時間が所定時間内の関連姿勢シーンから当該特定姿勢シーンまでの区間を注目シーンとして抽出できるので、相互に無関係な特定姿勢シーンおよび関連姿勢シーンに基づいて注目シーンが抽出されてしまうことを防止できる。 (3) Since it is possible to extract, as a scene of interest, a section from a related posture scene in which the elapsed time to the specific posture scene is within a predetermined time to the specific posture scene, the target scene can be extracted based on the specific posture scene and the related posture scene that are not related to each other. can be prevented from being extracted.

(4) 特定姿勢シーンの前に複数の関連姿勢シーンが検知されていると、各関連姿勢シーンから特定姿勢シーンまでの複数の注目シーン候補を一覧表示してオペレータに選択させることができるので、人の判断を介在させた主観的なシーン抽出が可能になる。 (4) If multiple related posture scenes are detected before a specific posture scene, a list of multiple target scene candidates from each related posture scene to the specific posture scene can be displayed and the operator can select one. Subjective scene extraction through human judgment becomes possible.

本発明の第１実施形態に係るシーン抽出装置の機能ブロック図である。1 is a functional block diagram of a scene extraction device according to a first embodiment of the present invention; FIG. 姿勢推定において抽出対象となる骨格の一例を示した図である。FIG. 4 is a diagram showing an example of a skeleton to be extracted in posture estimation; 第１実施形態の動作を示したフローチャートである。4 is a flow chart showing the operation of the first embodiment; 注目シーンの決定方法を模式的に示した図である。FIG. 4 is a diagram schematically showing a method of determining a scene of interest; 本発明の第２実施形態に係るシーン抽出装置の機能ブロック図である。FIG. 5 is a functional block diagram of a scene extraction device according to a second embodiment of the present invention; オブジェクトの追跡結果を考慮したシーン検知方法を示した図である。FIG. 3 illustrates a scene detection method that considers object tracking results; 注目シーンの他の抽出方法を示した図である。FIG. 10 is a diagram showing another method of extracting a scene of interest; 注目シーンの他の再生例を示した図である。FIG. 10 is a diagram showing another example of reproduction of a scene of interest;

以下、図面を参照して本発明の実施の形態について詳細に説明する。図１は、本発明の第１実施形態に係るシーン抽出装置１の主要部の構成を示した機能ブロック図であり、ここでは、サッカー競技を撮影したカメラ映像からゴールシーンを含む注目シーンを抽出する場合を例にして説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a functional block diagram showing the configuration of the main parts of a scene extraction device 1 according to the first embodiment of the present invention. Here, a scene of interest including a goal scene is extracted from a camera image of a soccer match. A case will be described as an example.

このようなシーン抽出装置１は、CPU、メモリ、インタフェースおよびこれらを接続するバス等を備えた少なくとも１台の汎用コンピュータやモバイル端末に、後述する各機能を実現するアプリケーション（プログラム）を実装することで構成できる。あるいは、アプリケーションの一部をハードウェア化またはプログラム化した専用機や単能機としても構成できる。 Such a scene extraction device 1 is implemented by installing an application (program) for realizing each function described later in at least one general-purpose computer or mobile terminal equipped with a CPU, a memory, an interface, and a bus connecting them. can be configured with Alternatively, a part of the application can be configured as a dedicated machine or a single-function machine that is hardware or programmed.

カメラ映像取得部１０１は、競技フィールドを撮影する複数のカメラCamからカメラ映像を取得する。カメラ映像は映像データベース（DB）１０３に録画されると共に、フレーム画像取得部１０２によりカメラ映像からフレーム画像が取得される。このとき、カメラ映像を一旦映像データベース（DB）１０３に録画した後、フレーム画像取得部１０２が改めて映像データベース１０３からカメラ映像を読み出し、当該読み出したカメラ映像からフレーム画像が取得されるようにしてもよい。姿勢推定部１０４は、人物領域抽出部１０４ａおよび骨格情報抽出部１０４ｂを備え、カメラごとに各フレーム画像から抽出した人物の姿勢を推定する。 A camera image acquisition unit 101 acquires camera images from a plurality of cameras that capture images of a competition field. A camera video is recorded in a video database (DB) 103, and a frame image acquisition unit 102 acquires a frame image from the camera video. At this time, after the camera video is once recorded in the video database (DB) 103, the frame image acquisition unit 102 reads the camera video again from the video database 103, and the frame image is acquired from the read camera video. good. The posture estimation unit 104 includes a person region extraction unit 104a and a skeleton information extraction unit 104b, and estimates the posture of a person extracted from each frame image for each camera.

前記人物領域抽出部１０４ａは、カメラ映像の各フレーム画像から人物領域を抽出する。人物領域の抽出には、例えばSSD (Single Shot Multibox Detector) を用いることができる。 The person area extraction unit 104a extracts a person area from each frame image of the camera video. An SSD (Single Shot Multibox Detector), for example, can be used to extract the person area.

骨格情報抽出部１０４ｂは、フレーム画像の人物領域から、予め抽出対象として登録されている骨格を抽出し、その位置情報や他の骨格との連結状態を骨格情報として登録する。骨格情報の抽出には、既存の骨格抽出技術 (Cascaded Pyramid Network) を用いることができる。 The skeleton information extraction unit 104b extracts a skeleton registered in advance as an extraction target from the person region of the frame image, and registers the position information and the connection state with other skeletons as skeleton information. Existing skeleton extraction technology (Cascaded Pyramid Network) can be used to extract skeleton information.

図２は、前記骨格情報抽出部１０４ｂが抽出対象とする骨格を示した図であり、左右の肘関節P3，P6、左右の手首関節P4，P7、左右の膝関節P9，P12および左右の足首関節P10，P13ならびにこれらの関節を連結する骨などが抽出される。 FIG. 2 is a diagram showing a skeleton to be extracted by the skeleton information extraction unit 104b. Left and right elbow joints P3 and P6, left and right wrist joints P4 and P7, left and right knee joints P9 and P12, and left and right ankles Joints P10 and P13 and bones connecting these joints are extracted.

なお、骨格の抽出手法は、上記のように予め抽出した人物領域を対象とする方法に限定されない。例えば、非特許文献１に開示されるように、フレーム画像から抽出した特徴マップに対して、身体パーツの位置をエンコードするConfidence Mapおよび身体パーツ間の連結性をエンコードするPart Affinity Fields（PAFs）を用いた二つの逐次予測プロセスを順次に適用し、フレーム画像から抽出した人物オブジェクト（ユーザ）の身体パーツの位置および連結性をボトムアップ的アプローチにより一回の推論で推定することでスケルトンモデルを構築してもよい。 Note that the skeleton extraction method is not limited to the method targeting the human region extracted in advance as described above. For example, as disclosed in Non-Patent Document 1, a Confidence Map that encodes the positions of body parts and Part Affinity Fields (PAFs) that encode connectivity between body parts are added to a feature map extracted from a frame image. A skeleton model is constructed by sequentially applying the two sequential prediction processes used to estimate the positions and connectivity of body parts of human objects (users) extracted from frame images in a single inference using a bottom-up approach. You may

このとき、異なる部分領域から抽出した身体パーツの連結性を推定対象外とする処理を実装することで、身体パーツの位置および連結性を部分領域ごとに、すなわちユーザごとにオブジェクトのスケルトンモデルを推定できるようになる。 At this time, by implementing a process that excludes the connectivity of body parts extracted from different partial regions from the target of estimation, the position and connectivity of body parts are estimated for each partial region, that is, for each user, and the skeleton model of the object is estimated. become able to.

シーン検知部１０５は、特定姿勢シーン検知部１０５ａ，関連姿勢シーン検知部１０５ｂ，学習モデル１０５ｃおよびシーン登録部１０５ｄを含み、姿勢推定の結果に基づいて、人物が特定姿勢にある特定姿勢シーンQsおよび人物が関連姿勢にある関連姿勢シーンRsを検知する。 The scene detection unit 105 includes a specific posture scene detection unit 105a, a related posture scene detection unit 105b, a learning model 105c, and a scene registration unit 105d. A related posture scene Rs in which a person is in a related posture is detected.

特定姿勢とは、注目シーンの抽出根拠となるプレーヤの代表的な姿勢であり、例えばゴールシーンやシュートシーンでのプレーヤの姿勢が該当する。関連姿勢とは特定姿勢に関連するプレーヤの代表的な姿勢であり、例えばパス、センタリング、フリーキック、コーナーキックあるいはヘッディング等するプレーヤの姿勢が該当する。したがって、関連姿勢シーンRsは特定姿勢シーンQsの前に検知され、当該特定姿勢シーンQsの布石となる映像シーンと言える。前記学習モデル１０５ｃには、姿勢推定部１０４が推定したプレーヤの姿勢が特定姿勢や関連姿勢に該当するか否かを推定するために予め学習された予測モデルMが格納されている。 A specific posture is a representative posture of a player that serves as a basis for extracting a scene of interest, and corresponds to, for example, a player's posture in a goal scene or a shoot scene. A related posture is a representative posture of a player related to a specific posture, and corresponds to, for example, a player's posture such as pass, centering, free kick, corner kick, or heading. Therefore, the related posture scene Rs is detected before the specific posture scene Qs, and can be said to be a video scene that serves as a foundation for the specific posture scene Qs. The learning model 105c stores a prediction model M that has been learned in advance for estimating whether the player's posture estimated by the posture estimation unit 104 corresponds to a specific posture or a related posture.

特定姿勢シーン検知部１０５ａは、姿勢推定の結果を前記予測モデルMに適用することでプレーヤが特定姿勢にある特定姿勢シーンQsを検知する。関連姿勢シーン検知部１０５ｂは、姿勢推定の結果を前記予測モデルMに適用することでプレーヤが関連姿勢にある関連姿勢シーンRsを検知する。シーン登録部１０５ｄには、特定姿勢シーンQsの再生時刻および関連姿勢シーンRsの再生時刻が登録される。 The specific posture scene detection unit 105a detects a specific posture scene Qs in which the player is in a specific posture by applying the posture estimation result to the prediction model M. FIG. The related posture scene detection unit 105b detects a related posture scene Rs in which the player is in a related posture by applying the posture estimation result to the prediction model M. FIG. The playback time of the specific posture scene Qs and the playback time of the related posture scene Rs are registered in the scene registration unit 105d.

注目シーン決定部１０６は、注目シーン候補提示部１０６ａおよび注目シーン選択部１０６ｂを含み、特定姿勢シーンQsの再生時刻および関連姿勢シーンRsの再生時刻に基づいて注目シーンを決定する。注目シーン再生部１０７は、前記決定された注目シーンを再生する。 The attention scene determination unit 106 includes an attention scene candidate presentation unit 106a and an attention scene selection unit 106b, and determines an attention scene based on the reproduction time of the specific posture scene Qs and the reproduction time of the related posture scene Rs. The attention scene reproducing unit 107 reproduces the determined attention scene.

なお、関連姿勢シーンRsがプレーヤの例えばヘッディング姿勢を根拠に検知されている場合、注目シーンにはヘッディング姿勢が検知された時刻よりも少し前のシーン、具体的にはプレーヤがヘッディング行為に至るためにボールの到達位置まで走り込み、さらにジャンプして空中で相手プレーヤと競い合うような、関連姿勢に至るまでの一例のシーンも含めることが望ましい。そこで、本実施形態では関連姿勢シーンRsの再生時刻よりも所定時間Δtだけ前の時刻を注目シーンの開始時刻としても良い。同様の趣旨で、注目シーンの終了時刻も特定姿勢シーンQsの再生時刻よりも所定時間Δtだけ後の時刻としても良い。前記所定時間Δtは固定値でも良いし、関連姿勢や特定姿勢の種別ごとに予め設定しておいても良い。 Note that if the related posture scene Rs is detected based on the player's heading posture, for example, the scene of interest is a scene slightly before the time when the heading posture was detected. It would be desirable to include an example scene up to a related posture, such as running to the position where the ball reaches, and then jumping to compete with an opponent player in the air. Therefore, in the present embodiment, the start time of the scene of interest may be set to a time that precedes the playback time of the related posture scene Rs by a predetermined time Δt. For the same reason, the end time of the scene of interest may also be set to a time after the playback time of the specific posture scene Qs by a predetermined time Δt. The predetermined time Δt may be a fixed value, or may be set in advance for each type of related posture or specific posture.

図３は、本実施形態の動作を示したフローチャートであり、図４は、前記注目シーン決定部１０６による注目シーンの決定方法を模式的に示した図である。 FIG. 3 is a flow chart showing the operation of this embodiment, and FIG. 4 is a diagram schematically showing a method of determining a scene of interest by the scene-of-interest determining unit 106. As shown in FIG.

図３において、ステップＳ１では、カメラ映像がリアルタイムで取得されて映像DB１０３に録画される。ステップＳ２では、前記フレーム画像取得部１０２により、取得または録画したカメラ映像からフレーム画像が取り込まれる。ステップＳ３では、前記人物領域抽出部１０４ａによりフレーム画像から人物領域が抽出される。ステップＳ４では、前記骨格情報抽出部１０４ｂにより各プレーヤの骨格情報が抽出され、当該プレーヤの姿勢が推定される。 In FIG. 3, in step S1, a camera image is obtained in real time and recorded in the image DB 103. FIG. In step S2, the frame image acquisition unit 102 acquires a frame image from the acquired or recorded camera video. In step S3, the person area is extracted from the frame image by the person area extraction unit 104a. In step S4, the skeleton information of each player is extracted by the skeleton information extraction unit 104b, and the posture of the player is estimated.

ステップＳ５では、前記関連姿勢シーン検知部１０５ｂにより、推定されたプレーヤの姿勢が予め定義された関連姿勢に該当するか否かが判断される。本実施形態では多数の関連姿勢が定義されており、推定された姿勢が関連姿勢のいずれかに該当すればステップＳ６へ進む。ステップＳ６では、関連姿勢の再生時刻t1（t2，t3）が、前記シーン登録部１０５ｄにより関連姿勢シーンRsのタイミングとして登録される。その後、当該処理はステップＳ２へ戻り、関連姿勢シーンRsの登録が繰り返される。 In step S5, the related posture scene detection unit 105b determines whether or not the estimated player posture corresponds to a predefined related posture. In this embodiment, many related postures are defined, and if the estimated posture corresponds to any of the related postures, the process proceeds to step S6. In step S6, the playback time t1 (t2, t3) of the related posture is registered as the timing of the related posture scene Rs by the scene registration unit 105d. After that, the process returns to step S2, and the registration of the related posture scene Rs is repeated.

一方、前記ステップＳ５において、プレーヤの姿勢が関連姿勢ではないと判断されるとステップＳ７へ進み、前記特定姿勢シーン検知部１０５ａにより、前記ステップＳ４で推定されたプレーヤの姿勢が予め定義された特定姿勢に該当するか否かが判断される。本実施形態では複数の特定姿勢が定義されており、推定された姿勢がいずれの特定姿勢にも該当しなければステップＳ２へ戻り、次のフレーム画像に対して同様の処理が繰り返される。 On the other hand, if it is determined in step S5 that the player's posture is not a related posture, the process proceeds to step S7, where the player's posture estimated in step S4 is specified in advance by the specific posture scene detection unit 105a. It is determined whether or not it corresponds to the posture. In this embodiment, a plurality of specific orientations are defined, and if the estimated orientation does not correspond to any of the specific orientations, the process returns to step S2, and similar processing is repeated for the next frame image.

これに対して、推定されたプレーヤの姿勢がいずれかの特定姿勢に該当すればステップＳ８へ進み、その再生時刻t4が前記シーン登録部１０５ｄにより特定姿勢シーンQsのタイミングとして登録される。ステップＳ９では、当該特定姿勢が推定された時刻t4に至るまでに登録された関連姿勢シーンRsのうち、時刻t4までの経過時間ΔT（=t4-t1, t4-t2 ,t4-t3）が所定時間ΔTref内である全ての関連姿勢シーンRsが抽出される。 On the other hand, if the estimated player's posture corresponds to any of the specific postures, the process proceeds to step S8, and the playback time t4 is registered as the timing of the specific posture scene Qs by the scene registration unit 105d. In step S9, the elapsed time ΔT (=t4-t1, t4-t2, t4-t3) up to time t4 among the related posture scenes Rs registered up to time t4 at which the particular posture was estimated is specified. All relevant pose scenes Rs within the time ΔTref are extracted.

ステップＳ１０では、抽出された各関連姿勢シーンRsから当該特定姿勢シーンQsに至る各映像区間が、前記注目シーン候補提示部１０６ａにより注目シーン候補として一覧表示される。 In step S10, each video section from each extracted related posture scene Rs to the specific posture scene Qs is displayed as a list of target scene candidates by the target scene candidate presenting unit 106a.

図４は、特定姿勢シーンQsに関して３つの関連姿勢シーンRs1，Rs2，R_S3が当該順序で抽出された場合の注目シーン候補の提示例を示した図であり、最上部には、特定姿勢シーンQsおよび３つの関連姿勢シーンRs1，Rs2，Rs3のサムネイルがその再生時刻t1，t2，t3，t4の各位置に表示され、その下方に３つの注目シーン候補がその再生時間と共に一覧表示されている。 FIG. 4 is a diagram showing a presentation example of target scene candidates when three related pose scenes Rs1, Rs2, and _Rs3 are extracted in the order with respect to the specific pose scene Qs. Thumbnails of the scene Qs and the three related pose scenes Rs1, Rs2, and Rs3 are displayed at their respective playback times t1, t2, t3, and t4, and below them, three candidate scenes of interest are listed together with their playback times. there is

本実施形態では、第１関連姿勢シーンRs1から特定姿勢シーンQsまでの第１注目シーン、第２関連姿勢シーンRs2から特定姿勢シーンQsまでの第２注目シーン、および第３関連姿勢シーンRs3から特定姿勢シーンQsまでの第３注目シーンが一覧表示される。各注目シーンの開始位置には関連姿勢シーンRs1，Rs2，Rs3のサムネイルが表示され、終了位置には特定姿勢シーンQsのサムネイルが表示されている。 In the present embodiment, the first scene of interest from the first related posture scene Rs1 to the specific posture scene Qs, the second scene of interest from the second related posture scene Rs2 to the specific posture scene Qs, and the third related posture scene Rs3 are specified. A list of the third scenes of interest up to the posture scene Qs is displayed. Thumbnails of related posture scenes Rs1, Rs2, and Rs3 are displayed at the start positions of the respective scenes of interest, and thumbnails of the specific posture scene Qs are displayed at the end positions.

オペレータがいずれかの注目シーン候補を注目シーンに選択し、これがステップＳ１１において前記注目シーン選択部１０６ｂにより検知されるとステップＳ１２へ進む。ステップＳ１２では、選択された注目シーンがその関連姿勢シーンRsから特定姿勢シーンQsまで再生される。例えば、第２注目シーンが選択されると、時刻t2またはその所定時間Δtだけ前の時刻t2-Δtから、時刻t4またはその所定時間Δtだけ後の時刻t4+Δtまでの映像期間が再生される。 The operator selects one of the attention scene candidates as the attention scene, and when this is detected by the attention scene selection unit 106b in step S11, the process proceeds to step S12. In step S12, the selected attention scene is reproduced from the related posture scene Rs to the specific posture scene Qs. For example, when the second scene of interest is selected, the video period from time t2 or time t2-Δt, which is a predetermined time Δt before, to time t4 or time t4+Δt, which is a predetermined time Δt after t4, is reproduced. .

なお、上記の実施形態では、特定姿勢シーンQsが検知されると所定時間ΔTref内の全ての関連姿勢シーンRsが抽出されるものとして説明したが、本発明はこれのみに限定されるものではなく、特定姿勢Qsの種別（ゴールシーンやシュートシーン）ごとに関連する関連姿勢シーンRsを対応付け、特定姿勢シーンQsが検知されると当該特定姿勢シーンQsの種別に対応する所定時間ΔTref内の関連姿勢シーンRsのみが抽出されるようにしても良い。 In the above embodiment, it is assumed that all related posture scenes Rs within the predetermined time ΔTref are extracted when the specific posture scene Qs is detected, but the present invention is not limited to this. , a related posture scene Rs is associated with each type of specific posture Qs (a goal scene or a shooting scene), and when a specific posture scene Qs is detected, a related posture scene Rs within a predetermined time ΔTref corresponding to the type of the specific posture scene Qs is determined. Only the posture scene Rs may be extracted.

図５は、本発明の第２実施形態に係るシーン抽出装置１の主要部の構成を示した機能ブロック図であり、前記と同一の符号は同一または同等部分を表しているので、その説明は省略する。 FIG. 5 is a functional block diagram showing the configuration of the main parts of a scene extraction device 1 according to the second embodiment of the present invention. omitted.

本実施形態は、フレーム画像間でオブジェクトを追跡するオブジェクト追跡部１０８を設け、シーン検知部１０５が各オブジェクトの追跡結果および各プレーヤの姿勢推定の結果に基づいて特定姿勢シーンQsおよび関連姿勢シーンRsを検知するようにした点に特徴がある。 In this embodiment, an object tracking unit 108 that tracks objects between frame images is provided, and a scene detection unit 105 detects a specific posture scene Qs and a related posture scene Rs based on the tracking result of each object and the result of estimating the posture of each player. It is characterized in that it is designed to detect

前記オブジェクト追跡部１０８において、オブジェクト検知部１０８ａは各フレーム画像からオブジェクトを検知する。本実施形態では、プレーヤおよびサッカーボールが検知対象のオブジェクトとされる。ID割当部１０８ｂは、フレーム画像間で各オブジェクトをその形状、サイズおよび／またはテクスチャ（例えば、ユニフォームのデザイン）等の類似度や移動ベクトルに基づく位置推定に基づいて追跡し、同一と推定されたオブジェクトに同一のID（オブジェクト識別子）を割り当てることでフレーム間でのオブジェクト追跡を実現する。 In the object tracking section 108, the object detection section 108a detects an object from each frame image. In this embodiment, a player and a soccer ball are objects to be detected. The ID assigning unit 108b tracks each object between frame images based on similarity in shape, size and/or texture (e.g., uniform design) and position estimation based on a movement vector. By assigning the same ID (object identifier) to each object, object tracking is realized between frames.

上記の第１実施形態では、各シーンをプレーヤの姿勢推定の結果に基づいて検知したため、各プレーヤの所属チームや各プレーヤのポジション（例えば、ゴールキーパであるか否か）といった属性を識別できず、またボールの行方をシーン検知に反映できなかった。これに対して、本実施形態では各プレーヤやボールを識別し、追跡できるので、各シーンをより精細かつ正確に検知できるようになる。 In the above-described first embodiment, since each scene is detected based on the results of estimating the player's posture, attributes such as the team to which each player belongs and the position of each player (for example, whether or not the player is a goalkeeper) cannot be identified. Also, the whereabouts of the ball could not be reflected in the scene detection. In contrast, in the present embodiment, each player and ball can be identified and tracked, so each scene can be detected more finely and accurately.

図６は、本実施形態における関連姿勢シーンRsおよび特定姿勢シーンQsの検知方法を模式的に示した図であり、一方のチーム（チームA）のプレーヤa1がボール３０を支配している状態から、同じチームAの他のプレーヤa2にボール３０がパスされ、さらに当該プレーヤa2から同じチームAのプレーヤa3にボール３０がパスされ、当該プレーヤa3が他方のチーム（チームB）のゴールキーパbgをかわしてボール３０をシュートし、ゴールが成立するまでの一例の流れを示している。 FIG. 6 is a diagram schematically showing a method for detecting the related posture scene Rs and the specific posture scene Qs in this embodiment. , the ball 30 is passed to another player a2 of the same team A, and the ball 30 is passed from the said player a2 to the player a3 of the same team A, and the said player a3 dodges the goalkeeper bg of the other team (team B). It shows an example flow from shooting the ball 30 to scoring a goal.

このようなケースでは、ボール３０がプレーヤ間を移動したときに、同一チームのプレーヤ間であればパス、異なるチームのプレーヤ間であればインターセプトまたはカットと判別できるのでシーン検知を正確に行えるようになる。 In such a case, when the ball 30 moves between the players, it can be determined as a pass between players of the same team, and an interception or a cut between players of different teams, so that scene detection can be performed accurately. Become.

本実施形態では、プレーヤa3がシュートしたボール３０とその直後にゴールしたボール３０とが同一である（即ち、ボール３０を追跡できている。以下同様）ことから当該シーンを特定姿勢（シュート）シーンとして検知でき、かつプレーヤa3を得点者と認識できる。 In this embodiment, the ball 30 shot by the player a3 and the ball 30 that hit the goal immediately after that are the same (that is, the ball 30 can be tracked; the same applies hereinafter). , and player a3 can be recognized as the scorer.

更に、プレーヤa3とプレーヤa2とが同一チームであり、プレーヤa2がキックしたボール３０とその直後にプレーヤa3がレシーブしたボール３０とが同一であることから当該行為がプレーヤa2からプレーヤa3へのパスであることが判ると同時にプレーヤa2がアシストであることが判る。したがって、当該シーンを関連姿勢（アシスト）シーンとして検知できる。 Furthermore, since player a3 and player a2 are on the same team and the ball 30 kicked by player a2 is the same as the ball 30 received by player a3 immediately after that, the act is a pass from player a2 to player a3. At the same time, it is found that player a2 is an assist. Therefore, the scene can be detected as a related posture (assist) scene.

さらに、プレーヤa2とプレーヤa1とが同一チームであり、プレーヤa1がキックしたボール３０とその直後にプレーヤa2がレシーブしたボール３０とが同一であることから当該行為がプレーヤa1からプレーヤa2へのパスであることが判り、プレーヤa1もアシストである可能性があることもわかる。したがって、当該シーンも関連姿勢（アシスト）シーンとして検知できる。 Furthermore, since player a2 and player a1 are on the same team, and the ball 30 kicked by player a1 is the same as the ball 30 received by player a2 immediately after that, the act is a pass from player a1 to player a2. , and it can be seen that player a1 may also be an assist. Therefore, this scene can also be detected as a related posture (assist) scene.

このように、本実施形態によれば各プレーヤを識別して追跡することができ、またボールを追跡できるので、プレーヤ間でのボール移動をパスおよびインターセプト（カット）のいずれかに識別できるのみならず、シュートという特定姿勢シーンQsに至るまでのプレーヤやボールの追跡結果を参照することでアシストやパスといった関連姿勢シーンRsも正確に検知できる。 Thus, according to this embodiment, each player can be identified and tracked, and the ball can be tracked. In addition, by referring to the tracking results of the player and the ball up to the specific posture scene Qs of shooting, the related posture scene Rs of assist and pass can also be accurately detected.

換言すれば、プレーヤの姿勢のみでは識別できなかった類似のシーンを識別できるようになり、またプレーヤの姿勢のみでは正確に検知することが難しかった多種多様なシーンを正確に検知できるようになる。 In other words, it becomes possible to identify similar scenes that could not be identified based on the player's posture alone, and to accurately detect a wide variety of scenes that were difficult to accurately detect based on the player's posture alone.

なお、上記の実施形態では、将来的に検知される可能性のある特定姿勢の布石となる関連姿勢シーンを予め検知して登録しておき、その後、特定姿勢が検知されると既登録の対応する関連姿勢シーンまで遡って注目シーンを決定するものとして説明した。しかしながら、本発明はこれのみに限定されるものではなく、関連姿勢シーンを予め検知せず、特定姿勢が検知されるとカメラ映像を遡って当該特定姿勢に対応した関連姿勢シーンを検知し、注目シーンを決定するようにしても良い。 Note that, in the above-described embodiment, related posture scenes that are likely to be detected in the future and that serve as a basis for a specific posture are detected and registered in advance. It has been explained that the attention scene is determined by going back to the relevant posture scene. However, the present invention is not limited to this, and when a specific posture is detected without detecting a related posture scene in advance, the camera image is traced back and the related posture scene corresponding to the specific posture is detected. A scene may be determined.

さらに、上記の実施形態では関連姿勢シーンが検知されたカメラ画像を撮影したカメラcamと特定姿勢シーンが検知されたカメラ画像を撮影したカメラcamとの同異に言及していないが、本実施形態のように複数台のカメラで競技フィールドを撮影した場合、各オブジェクトをフレーム間のみならずカメラ間で追跡する技術は確立されている。したがって、一のカメラcamで撮影したカメラ画像上で特定姿勢シーンが検知されたとき、これと同じカメラで撮影したカメラ画像のみならず異なるカメラで撮影したカメラ画像で検知された関連姿勢シーンから前記特定姿勢シーンまでの映像区間を注目シーン（候補）としても良い。 Furthermore, in the above embodiment, there is no mention of the difference between the camera cam that captured the camera image from which the related posture scene was detected and the camera cam that captured the camera image from which the specific posture scene was detected. There is already established technology for tracking each object not only between frames but also between cameras when shooting a competition field with multiple cameras as in . Therefore, when a specific posture scene is detected in a camera image taken with one camera cam, the related posture scene detected not only in the camera image taken by the same camera but also in the camera image taken by a different camera is used. A video section up to a specific posture scene may be set as a target scene (candidate).

図７に示した例では、時刻t1において、カメラcam1で撮影したカメラ映像から関連姿勢シーンRs1が検知され、時刻t2において、カメラcam2で撮影したカメラ映像から関連姿勢シーンRs2が検知され、時刻t3において、カメラcam3で撮影したカメラ映像から関連姿勢シーンRs3が検知されている。その後、時刻t4において、カメラcam4で撮影したカメラ映像から特定姿勢シーンRsが検知されると、カメラcam2で撮影した時刻t1からt2の映像区間、カメラcam3で撮影した時刻t2からt3の映像区間およびカメラcam4で撮影した時刻t3からt4の映像区間を連結することで注目シーンが抽出される。 In the example shown in FIG. 7, at time t1, the related posture scene Rs1 is detected from the camera video taken by the camera cam1, at time t2, the related posture scene Rs2 is detected from the camera video taken by the camera cam2, and at time t3. , the related posture scene Rs3 is detected from the camera video taken by the camera cam3. After that, at time t4, when specific posture scene Rs is detected from the camera video taken by camera cam4, the video segment from time t1 to t2 captured by camera cam2, the video segment from time t2 to t3 captured by camera cam3, and A scene of interest is extracted by connecting video sections from times t3 to t4 captured by camera cam4.

なお、当該注目シーンに、上記と同様に前記関連姿勢シーンRs1に至るまでの所定時間Δtの映像区間を含ませるのであれば、図８に示したように、カメラcam1で撮影した時刻t1-Δtからt1の映像区間を前記注目シーンの前に連結しても良い。 It should be noted that if the target scene includes a video section of a predetermined time Δt up to the related posture scene Rs1 in the same manner as described above, as shown in FIG. to t1 may be connected before the scene of interest.

１…シーン抽出装置，３０…ボール，１０１…カメラ映像取得部，１０２…フレーム画像取得部，１０３…映像DB，１０４…姿勢推定部，１０５…シーン検知部，１０６…注目シーン決定部，１０７…注目シーン再生部，１０８…オブジェクト追跡部 Reference Signs List 1 scene extraction device 30 ball 101 camera image acquisition unit 102 frame image acquisition unit 103 image DB 104 posture estimation unit 105 scene detection unit 106 attention scene determination unit 107 Attention scene reproducing unit 108... Object tracking unit

Claims

In a scene extraction device for extracting a scene of interest from a moving image,
posture estimation means for estimating the posture of a person extracted from a moving image;
a means for detecting a related pose scene whose result of pose estimation is a related pose;
means for detecting a specific posture scene in which the result of posture estimation is a specific posture;
means for, when the specific posture scene is detected, extracting as a scene of interest up to the specific posture scene by going back to related posture scenes that have already been detected and for which the elapsed time up to the specific posture scene is within a predetermined period of time ;
means for displaying a list of a plurality of target scene candidates from each related posture scene to the specific posture scene when a plurality of related posture scenes having elapsed time up to the specific posture scene within a predetermined time are detected ,
The scene extracting apparatus, wherein the extracting means selects a separately designated target scene candidate as the target scene .

further comprising means for tracking an object in the moving image;
2. The scene extracting apparatus according to claim 1, wherein said means for detecting a specific pose scene and said means for detecting a related pose scene detect each scene based on a result of pose estimation and a result of object tracking. .

3. The scene extracting apparatus according to claim 1 , wherein the related posture is a posture that serves as a foundation for the specific posture.

comprising means for associating a related posture with each type of the specific posture;
4. When a specific posture scene is detected, the extracting means traces back to a related posture scene corresponding to a type of the specific posture scene and extracts the scene up to the specific posture scene as the target scene. The scene extraction device according to any one of 1.

5. The scene extracting apparatus according to claim 1, wherein said scene of interest is configured by connecting video sections extracted from a plurality of moving pictures from different viewpoints.

In a scene extraction method in which a computer extracts a scene of interest from a moving image,
Estimates the pose of a person extracted from a moving image,
Detecting a related pose scene whose pose estimation result is a related pose,
Detecting a specific posture scene in which the result of posture estimation is a specific posture,
when the specific posture scene is detected, extracting as a scene of interest up to the specific posture scene by going back to related posture scenes that have already been detected and for which the elapsed time up to the specific posture scene is within a predetermined time , and
if a plurality of related posture scenes having elapsed time up to the specific posture scene within a predetermined time are detected, displaying a list of a plurality of target scene candidates from each related posture scene to the specific posture scene;
2. A scene extracting method, characterized in that, in extracting the scene of interest, a separately specified one candidate of the scene of interest is set as the scene of interest .

7. The scene extraction method according to claim 6 , further comprising tracking an object in a moving image, and detecting the specific pose scene and the related pose scene based on the result of pose estimation and the result of object tracking.

In a scene extraction program for extracting attention scenes from moving images,
a procedure for estimating the pose of a person extracted from a moving image;
a procedure for detecting a related pose scene whose result of pose estimation is a related pose;
a procedure for detecting a specific posture scene in which the result of posture estimation is the specific posture;
a step of, when the specific posture scene is detected, extracting as a scene of interest up to the specific posture scene by going back to related posture scenes that have already been detected and for which the elapsed time up to the specific posture scene is within a predetermined time ;
If a plurality of related posture scenes whose elapsed time to the specific posture scene is within a predetermined time are detected, a procedure for displaying a list of a plurality of target scene candidates from each related posture scene to the specific posture scene is caused to be executed by the computer. ,
A scene extraction program in which the extracting procedure uses a separately designated target scene candidate as a target scene .

further comprising tracking an object in the moving image;
9. The scene extraction program according to claim 8 , wherein the specific pose scene and the related pose scene are detected based on the pose estimation result and the object tracking result.