JP2012053855A

JP2012053855A - Content browsing device, content display method and content display program

Info

Publication number: JP2012053855A
Application number: JP2010198132A
Authority: JP
Inventors: Masato Aranishi; 誠人新西
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-09-03
Filing date: 2010-09-03
Publication date: 2012-03-15

Abstract

PROBLEM TO BE SOLVED: To provide a content browsing device and the like with which important contexts and points thereof can be easily understood in a review of information having time information such as a video content.SOLUTION: A content browsing device in the present invention has a video content and audio and still pictures linked on a time axis, stores video scenes extracted from the video content, associates keywords extracted from the audio and the still pictures with importance degrees obtained by analyzing the respective keywords to store them, reads out the predetermined number of keywords of top importance degrees, also reads out audio or still pictures that are extraction sources of the read-out keywords and video scenes linked on the same time axis as display data, displays a group of a video scene, audio and/or a still picture, and a keyword linked on the same time axis among the read-out display data, and if there are plural groups, time-sequentially arranges and displays each group.

Description

本発明は、コンテンツ閲覧装置、コンテンツ表示方法及びコンテンツ表示プログラムに関する。 The present invention relates to a content browsing device, a content display method, and a content display program.

従来、会議（ミーティング）の議事録といえば、例えば文書議事録による記録方式が主流であった。この方式では書記が会議に同席し議事録を作成していく。書記は会議中にあった発言の内容を逐一記録した議事録を作成する場合もあるが（特に重要な会議等）、大抵の場合、書記は会議の内容を理解しつつ要点を選択しながら要約された議事録を作成していく。後日、会議の内容を振り返る場合、要約された文書議事録は重要な事項がまとめられているため、短時間で容易に会議の内容を把握できる。 Conventionally, the minutes of conferences (meetings) have been mainly recorded by, for example, document minutes. In this method, the clerk attends the meeting and creates the minutes. In some cases, the clerk creates a minutes of the statements made during the meeting (especially important meetings). In most cases, the clerk summarizes the contents while selecting the key points while understanding the contents of the meeting. We will make the minutes. When reviewing the contents of the meeting at a later date, the summarized document minutes contain important matters, so that the contents of the meeting can be easily grasped in a short time.

近年デジタル技術の発達に伴い、会議（ミーティング）の内容は、容易に映像コンテンツとして記録できるようになった。会議の内容をビデオカメラに収録しておけば、後日会議の模様をそのまま再生できるが、ここで、ビデオカメラに収録された映像コンテンツはデータとして時間とともに流れる連続的な情報である。従って上述の要約された文書議事録とは違い、会議の内容（要約）を短時間で効率よく振り返りたい場合、早送りや巻戻しを繰り返しながら再生する必要があるので重要な要点（場面）を探すのだけでも長い時間を要してしまうことになる。 In recent years, with the development of digital technology, the contents of meetings (meetings) can be easily recorded as video contents. If the content of the conference is recorded on the video camera, the pattern of the conference can be reproduced as it is, but here the video content recorded on the video camera is continuous information that flows as time passes. Therefore, unlike the document summaries described above, if you want to look back on the content (summary) of the meeting in a short time and efficiently, you need to play it back while repeating fast forward and rewind, so look for important points (scenes) It just takes a long time.

そこで会議を撮影した映像コンテンツを後から振り返る目的で、映像コンテンツに検索のためのタグ付けを行い、また重要度を算出して、時間軸上、映像の重要な要点（場面）の位置を特定する技術が知られている。例えば特許文献１には、音声や画像などのマルチメディア情報を用いた会議システムにおいて、会議中の重要な項目を短時間で簡便に編集できる装置およびユーザインタフェースを提供する目的で、会議でのデータを取り込む動画入力手段、静止画入力手段、音声入力手段、ペン入力手段、ポインティング手段およびキー入力手段のうち少なくとも１つ以上により入力し、入力されたデータをデータ格納手段に格納し、データの時間関係を解析して検索用ファイルを作成し、作成された検索用ファイルを格納し、格納された検索用ファイルの参照結果をもとに該当するデータをデータ格納手段から読み出し、データを表示および編集することにより会議録の作成を支援する構成が開示されている。 Therefore, for the purpose of looking back on the video content shot at the conference, the video content is tagged for search, and the importance is calculated to locate the important points (scenes) of the video on the time axis. The technology to do is known. For example, in Patent Document 1, in a conference system using multimedia information such as voice and images, data in a conference is provided in order to provide a device and a user interface that can easily edit important items in the conference in a short time. Video input means, still picture input means, voice input means, pen input means, pointing means, and key input means, and the input data is stored in the data storage means, and the data time is stored. Analyzing the relationship to create a search file, storing the created search file, reading the corresponding data from the data storage means based on the reference result of the stored search file, and displaying and editing the data By doing so, a configuration for supporting the creation of the minutes is disclosed.

しかしながら、従来の映像を振り返る技術（例えば特許文献１）は、映像コンテンツのストリームを時間軸に沿って表示し、表示された映像の中からコンテキストの重要度の高いものについては色分けなどで表示するものであるため、特に長時間に及ぶ会議映像の場合、重要な場面を探すのには依然相応の時間を要することになる。また色分けされた映像だけが表示されて重要な場面の位置は特定できたとしても、実際にその映像場面を再生して内容を確認しないことには、なかなか要点内容までを把握することは困難である。即ち従来の技術は、映像コンテンツなど時間情報のある情報の振り返りにおいてユーザフレンドリーの観点からそのコンテンツの表示方法に依然改良されるべき余地があった。 However, a technique for looking back on a conventional video (for example, Patent Document 1) displays a stream of video content along a time axis, and displays a video with high context importance by color coding or the like. Therefore, in the case of a conference video that lasts for a long time, it takes time to search for an important scene. Even if only the color-coded video is displayed and the position of the important scene can be specified, it is difficult to grasp the contents of the main point without actually replaying the video scene and checking the content. is there. That is, the conventional technique still has room for improvement in the display method of content from the viewpoint of user friendliness when looking back on information with time information such as video content.

本発明は、上記の点に鑑みてなされたものであって、映像コンテンツなど時間情報のある情報の振り返りにおいて、重要なコンテキストとその要点内容を容易に把握可能なコンテンツ閲覧装置、コンテンツ表示方法及びコンテンツ表示プログラムを提供することを目的とする。 The present invention has been made in view of the above points, and in reviewing information with time information such as video content, a content browsing device, a content display method, and a content browsing device capable of easily grasping important contexts and the contents of the main points are provided. An object is to provide a content display program.

上記の目的を達成するために、本発明に係るコンテンツ閲覧装置は、コンテンツを表示手段に表示するコンテンツ閲覧装置であって、映像コンテンツ及び前記映像コンテンツと時間軸上でリンク付けされた音声、静止画像のうち少なくともいずれかを有するデータ格納手段と、前記映像コンテンツから抽出された映像シーンが格納され、また前記音声、前記静止画像のうち少なくともいずれかから抽出されたキーワードと当該キーワード毎に解析された重要度とが対応付けて格納された解析データ格納手段と、前記解析データ格納手段から前記重要度が上位所定数のキーワードを読み出すとともに、読み出された前記キーワードの抽出元となった前記音声又は静止画像と、当該音声又は静止画像と同一の時間軸上でリンク付けされた映像シーンとを表示データとして読み出す読出制御手段と、前記読出制御手段により読み出された前記表示データのうち、同一の時間軸上でリンク付けされた映像シーン、音声及び／又は静止画像、並びにキーワードのグループを表示するとともに、前記グループが複数ある場合にはグループ毎に時系列に並べ、前記表示手段に表示させる表示制御手段と、を有することを特徴とする。 In order to achieve the above object, a content browsing device according to the present invention is a content browsing device that displays content on a display means, and is a video content and a voice, a still image linked to the video content on a time axis. Data storage means having at least one of images and a video scene extracted from the video content are stored, and a keyword extracted from at least one of the audio and the still image is analyzed for each keyword. Analysis data storage means in which the importance level is stored in association with each other, and the voice having the importance level extracted from the analysis data storage means and the voice from which the read keyword was extracted Or a video image linked to the still image and the same audio or still image on the same time axis. As a display data, and among the display data read by the read control means, a video scene, an audio and / or still image linked on the same time axis, and a group of keywords Display control means for arranging the plurality of groups and arranging them in time series for each group and displaying them on the display means.

また、上記の目的を達成するために、上記コンテンツ閲覧装置において、前記映像コンテンツから抽出された映像シーンと当該映像シーン毎に解析された重要度とが対応付けて格納された第２解析データ格納手段と、前記第２解析データ格納手段から前記重要度が上位所定数の映像シーンを読み出すとともに、読み出された前記映像シーンと同一の時間軸上でリンク付けされた音声及び静止画像とを第２表示データとして読み出す第２読出制御手段と、前記第２読出制御手段により読み出された前記表示データのうち、同一の時間軸上でリンク付けされた映像シーン、並びに音声及び／又は静止画像のグループを表示するとともに、前記グループが複数ある場合にはグループ毎に時系列に並べ、前記表示手段に表示させる第２表示制御手段と、を有することを特徴とする。 In order to achieve the above object, in the content browsing device, a second analysis data storage in which a video scene extracted from the video content and an importance level analyzed for each video scene are stored in association with each other And a voice and still image linked on the same time axis as the read video scene are read out from the second analysis data storage means Second reading control means for reading as two display data, and of the display data read by the second reading control means, a video scene linked on the same time axis, and an audio and / or still image A second display control means for displaying a group, and when there are a plurality of the groups, the display means arranges the data in time series and displays on the display means. Characterized in that it has a.

また、上記の目的を達成するために、上記コンテンツ閲覧装置において、前記表示制御手段は、前記抽出元となった音声又は静止画像は強調して表示し、前記第２表示制御手段は、前記重要度が上位所定数の映像シーンは強調して表示すること、を特徴とする。 In order to achieve the above object, in the content browsing apparatus, the display control means emphasizes and displays the voice or still image from which the extraction is performed, and the second display control means A predetermined number of video scenes with higher degrees are highlighted and displayed.

なお、本発明の構成要素、表現または構成要素の任意の組合せを、方法、装置、システム、コンピュータプログラム、記録媒体、などに適用したものも本発明の態様として有効である。 In addition, what applied the arbitrary combination of the component of this invention, expression, or a component to a method, an apparatus, a system, a computer program, a recording medium, etc. is also effective as an aspect of this invention.

本発明によれば、映像コンテンツなど時間情報のある情報の振り返りにおいて、重要なコンテキストとその要点内容を容易に把握可能なコンテンツ閲覧装置、コンテンツ表示方法及びコンテンツ表示プログラムを提供することができる。 According to the present invention, it is possible to provide a content browsing apparatus, a content display method, and a content display program capable of easily grasping an important context and the contents of the main points in looking back information with time information such as video content.

実施形態に係る会議収録システムの一連の処理の流れを示す図である。It is a figure which shows the flow of a series of processes of the conference recording system which concerns on embodiment. 本実施形態に係る表示画面例を示す。The example of a display screen concerning this embodiment is shown. 会議収録システム１の一実施形態の主要構成を示すハードウェア構成図である。It is a hardware block diagram which shows the main structures of one Embodiment of the conference recording system 1. 本実施形態に係る会議収録システム１の一実施形態の主要機能を示す機能ブロック図である。It is a functional block diagram which shows the main functions of one Embodiment of the conference recording system 1 which concerns on this embodiment. データ格納手段及び解析データ格納手段のＤＢ構成例を示す図である。It is a figure which shows the DB structural example of a data storage means and an analysis data storage means. 映像コンテンツデータＤＢ５１０構成例を示す。The structural example of video content data DB510 is shown. 音声データＤＢ５２０構成例を示す。The structural example of audio | voice data DB520 is shown. ホワイトボード画像ＤＢ５３０構成例を示す。The structural example of whiteboard image DB530 is shown. スライド画像ＤＢ５４０構成例を示す。The structural example of slide image DB540 is shown. 映像コンテンツデータから切り出された映像シーンＤＢ５１１構成例を示す。The structural example of video scene DB511 cut out from video content data is shown. 映像シーンから抽出された場面転換ＤＢ５１２構成例を示す。The structural example of the scene change DB 512 extracted from the video scene is shown. 音声データから取り出された音声認識データＤＢ５２１構成例を示す。The structural example of speech recognition data DB521 extracted from speech data is shown. 音声認識データから解析された話者ＤＢ５２２構成例を示す。The structural example of speaker DB522 analyzed from speech recognition data is shown. 音声認識データ、ホワイトボード画像及びスライド画像から解析されたキーワードＤＢ５５０構成例を示す。The structural example of keyword DB550 analyzed from speech recognition data, a whiteboard image, and a slide image is shown. データの読み出し及び表示処理を説明するフローチャートである。It is a flowchart explaining the reading and display processing of data. 読み出された解析データ例を示す。An example of the read analysis data is shown.

以下、本発明を実施するための最良の形態について図面を参照して説明する。 The best mode for carrying out the present invention will be described below with reference to the drawings.

[システム構成]
（概要）
本発明は、映像コンテンツなど時間情報のある情報の振り返りにおいて、重要なコンテキストとその要点内容を容易に把握可能にするコンテンツ閲覧装置を提供するものであるが、コンテンツ閲覧する前に、映像コンテンツを取得、解析する必要があることから、以下の実施形態では、映像コンテンツを取得、解析、表示（閲覧）、再生を含む一連の処理機能を備える会議収録システム（便宜上このように呼ぶ）に本発明を適用した例を示す。 [System configuration]
(Overview)
The present invention provides a content browsing device that makes it possible to easily grasp important contexts and the contents of the main points in looking back information with time information such as video content. Since it is necessary to acquire and analyze, in the following embodiments, the present invention is applied to a conference recording system (referred to as such for convenience) having a series of processing functions including acquisition, analysis, display (browsing), and playback of video content. An example where is applied.

図１は、実施形態に係る会議収録システムの一連の処理の流れを示す図である。実施形態に係る会議収録システムは、映像コンテンツを取得、解析、表示（閲覧）、再生を含む一連の処理機能を備え、図に示されるように大きくＳ１００〜４００のステップの流れでその処理を進める。 FIG. 1 is a diagram illustrating a flow of a series of processes of the conference recording system according to the embodiment. The conference recording system according to the embodiment includes a series of processing functions including acquisition, analysis, display (browsing), and playback of video content, and the processing proceeds in a flow of steps of S100 to S400 as shown in the figure. .

Ｓ１００：データ入力ステップである。本実施形態においてデータは会議に関する会議データをいい、具体的に映像コンテンツデータ、音声（映像コンテンツデータに含まれてもよい）、画像、マウス、キーボートからの入力指示データなどである。会議が進行されるにつれ、会議収録システムは各入力手段を介しこれら会議データを入力し続ける。会議が終了するとデータ入力も終了する。 S100: Data input step. In this embodiment, the data refers to conference data related to a conference, and specifically includes video content data, audio (may be included in the video content data), an image, a mouse, an input instruction data from a keyboard, and the like. As the conference proceeds, the conference recording system continues to input these conference data via each input means. When the conference is over, the data entry is also over.

Ｓ２００：データ解析ステップである。Ｓ１００で入力されたデータは格納手段（記憶手段）に格納されており、会議収録システムはこのデータに対してデータ解析を行う。データ解析は、例えば映像ストリームを分割したり、各シーンに対し重要度を算出したり、音声からテキストを抽出しその話者を特定したり、ホワイトボードやスライドを解析しキーワードを抽出したりまたその重要度の算出などを行う。データ解析結果は格納手段（記憶手段）に格納される。 S200: Data analysis step. The data input in S100 is stored in storage means (storage means), and the conference recording system performs data analysis on this data. Data analysis includes, for example, dividing a video stream, calculating importance for each scene, extracting text from speech to identify the speaker, analyzing whiteboards and slides to extract keywords, etc. The importance is calculated. The data analysis result is stored in storage means (storage means).

Ｓ３００：データの読み出し及び表示ステップである。このステップはユーザが会議の映像コンテンツの振り返り閲覧を行う場面であり、例えばユーザは会議収録システムの表示手段から、１の会議を指定し所定操作を行うと、その会議コンテンツの中から、表示手段上表示可能なコマ数分の重要なコンテキストが表示される。また重要とされたコンテンツだけでなく、時間軸上同時刻の他のコンテキストも対応するように表示される。 S300: Data reading and display step. This step is a scene in which the user looks back on the video content of the meeting. For example, when the user designates one meeting and performs a predetermined operation from the display means of the meeting recording system, the display means is displayed from the meeting contents. As many important contexts as the number of frames that can be displayed are displayed. Further, not only the content regarded as important, but also other contexts at the same time on the time axis are displayed so as to correspond.

図２は、本実施形態に係る表示画面例を示す。当会議では、入力データとして、映像コンテンツデータ、音声データ、ホワイトボード画像、スライド（電子スライド）画像が入力されたため、データ解析の結果、映像シーン、話者、ホワイトボード、スライド、及びキーワードというコンテキストに分解、解析される。このうち重要度の高いコンテキスト（強枠表示のもの）が抽出されるとともに、その重要度の高いコンテキストと時間軸上同時刻の他のコンテキストも対応するよう並列して表示される。当会議では４つの重要度の高いコンテキストが抽出、表示されており、ユーザは当会議において特にこれら４つが重要な内容を持つ場面（映像シーン）であると認識するとともに、キーワードからより具体的にどのような内容であったかなどもあわせて認識できる。 FIG. 2 shows an example of a display screen according to the present embodiment. In this conference, video content data, audio data, whiteboard images, and slide (electronic slide) images were input as input data. As a result of data analysis, the context of video scenes, speakers, whiteboards, slides, and keywords It is decomposed and analyzed. Among these, contexts with high importance (those with a strong frame display) are extracted, and the contexts with high importance and other contexts at the same time on the time axis are displayed in parallel. In this meeting, four highly important contexts are extracted and displayed, and the user recognizes that these four particularly important scenes (video scenes) in this meeting, and more specifically from the keywords. You can also recognize what the content was.

Ｓ４００：シーンの選択及び映像の再生ステップである。会議収録システムの表示手段上、いくつかの重要なコンテキストが表示されているので、ユーザはこの中から再生しようとする映像シーンを選択し再生できる。会議中、選択したシーンの時点から実際に映像コンテンツを再生し、ユーザは局所的に会議を振り返ることができる。例えば、いずれかのコンテキストを選択し、「再生」ボタンを押下すると、再生画面に切り替わりこの時刻から映像コンテンツが再生される。 S400: This is a scene selection and video playback step. Since some important contexts are displayed on the display means of the conference recording system, the user can select and reproduce a video scene to be reproduced. During the conference, the video content is actually reproduced from the point of the selected scene, and the user can look back on the conference locally. For example, when any context is selected and the “play” button is pressed, the screen is switched to the playback screen, and the video content is played from this time.

（ハードウェア）
ここで、会議収録システム（会議収録装置）１のハードウェア構成について簡単に説明しておく。図３は、会議収録システム１の一実施形態の主要構成を示すハードウェア構成図である。会議収録システム１は、主要な構成として、ＣＰＵ１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、補助記憶装置１０４、記憶媒体読取装置１０５、入力装置１０６、表示装置１０７、通信装置１０８、及びインターフェース装置１０９を含む構成である。 (hardware)
Here, the hardware configuration of the conference recording system (conference recording device) 1 will be briefly described. FIG. 3 is a hardware configuration diagram showing the main configuration of one embodiment of the conference recording system 1. The conference recording system 1 includes, as main components, a CPU 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an auxiliary storage device 104, a storage medium reading device 105, an input device 106, a display device 107, and a communication device. 108 and an interface device 109.

ＣＰＵ１０１は、マイクロプロセッサ及びその周辺回路から構成され、装置全体を制御する回路である。また、ＲＯＭ１０２は、ＣＰＵ１０１で実行される所定の制御プログラム（ソフトウェア部品）を格納するメモリであり、ＲＡＭ１０３は、ＣＰＵ１０１がＲＯＭ１０２に格納された所定の制御プログラム（ソフトウェア部品）を実行して各種の制御を行うときの作業エリア（ワーク領域）として使用するメモリである。 The CPU 101 is composed of a microprocessor and its peripheral circuits, and is a circuit that controls the entire apparatus. The ROM 102 is a memory that stores a predetermined control program (software component) executed by the CPU 101. The RAM 103 executes various control operations by the CPU 101 executing a predetermined control program (software component) stored in the ROM 102. This is a memory used as a work area (work area) when performing.

補助記憶装置１０４は、汎用のＯＳ（Operating System）、各種プログラムを含む各種情報を格納する装置であり、不揮発性の記憶装置であるＨＤＤ（Hard Disk Drive）などが用いられる。 The auxiliary storage device 104 is a device that stores various information including a general-purpose OS (Operating System) and various programs, and an HDD (Hard Disk Drive) that is a nonvolatile storage device is used.

入力装置１０６は、ユーザが各種入力操作を行うための装置である。入力装置１０６は、マウス、キーボード、表示装置１０７の表示画面上に重畳するように設けられたタッチパネルスイッチなどを含む。表示装置１０７は、各種データを表示画面に表示する装置である。例えば、ＬＣＤ(Liquid Crystal Display)、ＣＲＴ(Cathode Ray Tube)などから構成される。 The input device 106 is a device for the user to perform various input operations. The input device 106 includes a mouse, a keyboard, a touch panel switch provided so as to be superimposed on the display screen of the display device 107, and the like. The display device 107 is a device that displays various data on a display screen. For example, it is composed of LCD (Liquid Crystal Display), CRT (Cathode Ray Tube) and the like.

通信装置１０８は、ネットワークを介して他の機器との通信を行う装置である。有線ネットワークや無線ネットワークなど含む各種ネットワーク形態に応じた通信をサポートする。 The communication device 108 is a device that communicates with other devices via a network. Supports communication according to various network forms including wired and wireless networks.

インターフェース装置１０９は、外部の周辺機器と接続するためのインターフェースである。本実施形態では、ビデオカメラ（動画映像用）、デジタルカメラ（静止画用）、マイクなど主にデータキャプチャ機器と接続され、このインターフェース装置１０９を介してデータが入力される。 The interface device 109 is an interface for connecting to an external peripheral device. In the present embodiment, a video camera (for moving image), a digital camera (for still image), a microphone and the like are mainly connected to a data capture device, and data is input via the interface device 109.

（機能構成）
次に、本実施形態に係る会議収録システム１の主要機能構成についてそれぞれ簡単に説明する。図４は、本実施形態に係る会議収録システム１の一実施形態の主要機能を示す機能ブロック図である。図に示すように会議収録システム１は、主要な機能として、データ入力手段２０１、計時手段２０２、データ登録手段２０３、データ格納手段２０４、データ解析手段２０５、解析データ格納手段２０６、表示制御手段２０７、表示データ読出制御手段２０８、操作手段２０９、表示手段２１０、再生手段２１１を含み構成される。 (Functional configuration)
Next, the main functional configuration of the conference recording system 1 according to the present embodiment will be briefly described. FIG. 4 is a functional block diagram showing the main functions of one embodiment of the conference recording system 1 according to this embodiment. As shown in the figure, the conference recording system 1 includes, as main functions, a data input unit 201, a timing unit 202, a data registration unit 203, a data storage unit 204, a data analysis unit 205, an analysis data storage unit 206, and a display control unit 207. , Display data read control means 208, operation means 209, display means 210, and reproduction means 211.

データ入力手段２０１は、データを入力する手段である。例えば上述のインターフェース装置１０９によって実現され、入力されるデータは、例えばビデオカメラの映像コンテンツデータ、マイクからの音声データ、カメラによるホワイトボードのスクリーンキャプチャ（静止画像）、スライドの資料データ（静止画像）などである。 The data input unit 201 is a unit for inputting data. For example, the data that is realized and input by the above-described interface device 109 includes, for example, video content data of a video camera, audio data from a microphone, screen capture of a whiteboard by the camera (still image), and slide material data (still image). Etc.

計時手段２０２は、時間を秒単位などで計測し、データ登録手段２０３からの要求に応じて、その時点での時刻を返答する。 The time measuring means 202 measures the time in seconds or the like and returns the time at that time in response to a request from the data registration means 203.

データ登録手段２０３は、データ入力手段２０１から入力されたデータをデータ格納手段２０４に格納する。データ格納手段２０４にまだ登録されていないデータであれば、計時手段２０２に時刻を問い合わせ、その時刻情報を一緒にデータ格納手段２０４に登録する。すでに登録されている、例えば映像や音声のようなストリーミング情報の場合は、時刻を計時手段２０２に問い合せない。また静止画像の場合も計時手段２０２に時刻を問い合わせ、その時刻情報を一緒にデータ格納手段２０４に登録する。 The data registration unit 203 stores the data input from the data input unit 201 in the data storage unit 204. If the data is not yet registered in the data storage unit 204, the time is inquired to the time measuring unit 202 and the time information is registered in the data storage unit 204 together. In the case of streaming information that has already been registered, such as video and audio, the time is not inquired of the time measuring means 202. Also in the case of a still image, the time is inquired to the time measuring means 202 and the time information is registered in the data storage means 204 together.

データ格納手段２０４は、データ登録手段２０３から登録要求のあったデータを格納する。またその時刻情報も格納する。またデータ解析手段２０５や表示データ読出制御手段２０８からの要求に応じて、格納されたデータを取り出す。データ格納手段２０４は例えば上述の補助記憶装置１０４によって実現され、ＨＤＤ（ハードディスク）などであってよい。なおデータ格納手段２０４はＤＢ（データベース）によって格納データを管理する（具体例後述）。 The data storage unit 204 stores data requested for registration from the data registration unit 203. The time information is also stored. The stored data is retrieved in response to a request from the data analysis unit 205 or the display data read control unit 208. The data storage unit 204 is realized by the auxiliary storage device 104 described above, and may be an HDD (hard disk) or the like. The data storage unit 204 manages stored data by a DB (database) (specific examples will be described later).

データ解析手段２０５は、データ格納手段２０４に格納されたデータを取り出し、特定の処理を行ってメタデータを取り出す。またメタデータに基づいて重要度を算出する。データ解析手段２０５は、例えば、場面転換抽出手段、話者識別手段、キーフレーム抽出手段、スライド抽出手段、音声認識手段、キーワード抽出手段などから構成される。 The data analysis unit 205 retrieves data stored in the data storage unit 204, performs a specific process, and retrieves metadata. The importance is calculated based on the metadata. The data analysis unit 205 includes, for example, a scene change extraction unit, a speaker identification unit, a key frame extraction unit, a slide extraction unit, a voice recognition unit, and a keyword extraction unit.

解析データ格納手段２０６は、データ解析手段２０５によって解析されたデータを格納する。また解析データ格納手段２０６は表示データ読出制御手段２０８からの要求に応じてデータを取り出す。なお解析データ格納手段２０６はＤＢによって格納データを管理する（具体例後述）。 The analysis data storage unit 206 stores the data analyzed by the data analysis unit 205. The analysis data storage means 206 takes out data in response to a request from the display data read control means 208. The analysis data storage means 206 manages the stored data by a DB (a specific example will be described later).

表示制御手段２０７は、操作手段２１０によって操作された情報を基に表示データ読出制御手段２０８に対して読み出し要求を出す。また読み出した情報に従って表示手段２０９に対して表示要求を出す。 The display control unit 207 issues a read request to the display data read control unit 208 based on the information operated by the operation unit 210. A display request is issued to the display means 209 according to the read information.

表示データ読出制御手段２０８は、表示制御手段２０９からの要求に応じて、解析データ格納手段２０６に対して解析データ（解析結果）の取り出しを要求する。またその解析データを比較演算し、データ格納手段２０４に対して必要なデータの取り出しを要求する。取り出したデータを表示データとして表示制御手段２０７に渡す。 The display data read control unit 208 requests the analysis data storage unit 206 to take out analysis data (analysis result) in response to a request from the display control unit 209. Further, the analysis data is compared and the data storage means 204 is requested to retrieve necessary data. The extracted data is transferred to the display control means 207 as display data.

表示手段２０９は、表示制御手段２０７によって表示を要求されたものについて、表示を行う。表示手段２０９は例えば上述の表示装置１０７によって実現され、ディスプレイなどであってよい。 The display unit 209 displays the display requested by the display control unit 207. The display unit 209 is realized by the display device 107 described above, and may be a display or the like.

操作手段２１０は、ユーザに操作をさせる手段を提供する。操作手段２１０は例えば上述の入力装置１０６によって実現され、マウスやキーボード、ペン入力デバイスなどであってよい。 The operation means 210 provides means for causing the user to perform an operation. The operation unit 210 is realized by the input device 106 described above, and may be a mouse, a keyboard, a pen input device, or the like.

再生手段２１１は、データ格納手段２０４に格納された映像コンテンツの再生を行う。ユーザから対象となる映像コンテンツ及び再生時間等が指定されると、映像コンテンツを再生する。メディア再生プレーヤなどで実現されればよい。 The reproduction unit 211 reproduces the video content stored in the data storage unit 204. When the target video content and playback time are specified by the user, the video content is played back. What is necessary is just to implement | achieve with a media reproduction | regeneration player etc.

以上これらの機能は、実際には装置のＣＰＵ１０１が実行するプログラムによりコンピュータに実現させるものである。 These functions are actually realized by a computer by a program executed by the CPU 101 of the apparatus.

[情報処理]
上述したように、実施形態に係る会議収録システム１は、映像コンテンツを取得、解析、表示（閲覧）、再生を含む一連の処理機能を備え、大きくＳ１００〜４００のステップ（図１）の流れでその処理を進める。 [Information processing]
As described above, the conference recording system 1 according to the embodiment includes a series of processing functions including acquisition, analysis, display (browsing), and playback of video content, and the flow of steps S100 to 400 (FIG. 1) is largely performed. The process proceeds.

図５は、データ格納手段及び解析データ格納手段のＤＢ構成例を示す図である。以下同図をあわせて参照しながら説明をしていく。 FIG. 5 is a diagram illustrating a DB configuration example of the data storage unit and the analysis data storage unit. The following description will be made with reference to the same figure.

（データ入力：Ｓ１００）
実施形態に係る会議収録システム１は、データ入力のための周辺機器が接続され、本実施形態において入力されるデータは、映像コンテンツデータ、音声データ、ホワイトボード画像（静止画像）、スライドのスライド画像（静止画像）であるものとする。 (Data input: S100)
In the conference recording system 1 according to the embodiment, peripheral devices for data input are connected, and data input in the present embodiment is video content data, audio data, whiteboard image (still image), slide image of a slide. (Still image).

例えば会議室には会議室全体を見渡すようにビデオカメラが設置される。またもしくは撮影者が随時ビデオカメラを扱って自由に撮影することもできる。ビデオカメラから取得された映像は時間情報を含むストリーミングデータである。また例えば会議室にはホワイトボードが設置され一定間隔又はユーザ操作によりホワイトボードの手書きのスクリーン画像がキャプチャされる。また例えば会議室ではユーザＰＣの電子資料データが壁側のスクリーンなど投影されたり、ユーザＰＣ間で電子資料データが共有されており、このスライド画像がキャプチャされる。 For example, a video camera is installed in the conference room so as to look over the entire conference room. Alternatively, the photographer can handle the video camera at any time and shoot freely. The video acquired from the video camera is streaming data including time information. Further, for example, a whiteboard is installed in the conference room, and a handwritten screen image of the whiteboard is captured at regular intervals or by a user operation. Also, for example, in the conference room, the electronic material data of the user PC is projected on the screen on the wall side, or the electronic material data is shared between the user PCs, and this slide image is captured.

会議が開始されるとこれらデータが入力され始め、会議の終了とともにデータの入力を終了する。会議中、データはデータ登録手段２０３によりデータ格納手段２０４に格納される。なおデータ登録手段２０３は計時手段２０２に時刻を問い合わせ、その時刻情報を一緒にデータ格納手段２０４に登録しておく。 When the conference is started, the data starts to be input, and the input of the data is ended together with the end of the conference. During the meeting, the data is stored in the data storage unit 204 by the data registration unit 203. The data registration unit 203 inquires the time counting unit 202 about the time, and registers the time information together with the data storage unit 204.

図６は、映像コンテンツデータＤＢ５１０構成例を示す。データ格納手段２０４は、映像コンテンツデータを格納するにあたり図に示されるＤＢでもってデータを管理する。映像コンテンツデータＤＢ５１０は、ＩＤ、ファイルの場所、開始時間などから構成される。ＩＤは、ＤＢ上映像を一意に特定するためのＩＤである。ファイルの場所は、その映像が格納されている場所を示すものである。ファイルパスのような記述によって格納場所を特定してもよい。そして例えば「0001.avi」が映像コンテンツデータ（実体）である。開始時間は、その映像の格納が開始された時間が計時手段２０２によって付与されたものであり、例えばＵＴＣフォーマットで記録されたものである。本図例によると、３つのＩＤを持つ映像コンテンツデータが格納（登録）されていることから、３つの会議分の映像がＤＢに格納されていることが分かる。 FIG. 6 shows a configuration example of the video content data DB 510. The data storage unit 204 manages data in the DB shown in the figure when storing the video content data. The video content data DB 510 includes an ID, a file location, a start time, and the like. The ID is an ID for uniquely specifying the video on the DB. The file location indicates the location where the video is stored. The storage location may be specified by a description such as a file path. For example, “0001.avi” is video content data (substance). The start time is given by the time measuring means 202 when the storage of the video is started, and is recorded, for example, in the UTC format. According to this example, since video content data having three IDs is stored (registered), it can be seen that videos for three conferences are stored in the DB.

図７は、音声データＤＢ５２０構成例を示す。データ格納手段２０４は、音声データを格納するにあたり図に示されるＤＢでもってデータを管理する。音声データＤＢ５２０は、ＩＤ、ファイルの場所、開始時間などから構成される。ＩＤは、ＤＢ上音声を一意に特定するためのＩＤである。ファイルの場所は、その音声が格納されている場所を示すものである。そして例えば「0001.wav」が音声データ（実体）である。開始時間は、その音声の格納が開始された時間が計時手段２０２によって付与されたものである。本図例によると、３つのＩＤを持つ音声データが格納（登録）されていることから、３つの会議分の音声がＤＢに格納されていることが分かる。 FIG. 7 shows a configuration example of the audio data DB 520. The data storage unit 204 manages data in the DB shown in the figure when storing the audio data. The audio data DB 520 includes an ID, a file location, a start time, and the like. The ID is an ID for uniquely specifying the voice on the DB. The file location indicates the location where the sound is stored. For example, “0001.wav” is audio data (substance). The start time is given by the time measuring unit 202 when the voice storage is started. According to this example, since voice data having three IDs is stored (registered), it can be seen that voices for three conferences are stored in the DB.

図８は、ホワイトボード画像ＤＢ５３０構成例を示す。データ格納手段２０４は、ホワイトボード画像を格納するにあたり図に示されるＤＢでもってデータを管理する。ホワイトボード画像ＤＢ５３０は、ＩＤ、ファイルの場所、開始時間などから構成される。ＩＤは、ＤＢ上ホワイトボード画像を一意に特定するためのＩＤである。ファイルの場所は、そのホワイトボード画像が格納されている場所を示すものである。そして例えば「w0001.jpg」がホワイトボード画像データ（実体）である。開始時間は、そのホワイトボード画像の格納が開始された時間が計時手段２０２によって付与されたものである。。本図例のホワイトボード画像は、とくに時間情報（2010-04-05 13:10:33〜）に注目すれば、図６のＩＤ「m1」及び図７のＩＤ「a1」に対応する会議においてキャプチャされたホワイトボード画像データであることが分かる。 FIG. 8 shows a configuration example of the whiteboard image DB 530. The data storage means 204 manages data in the DB shown in the figure when storing the whiteboard image. The whiteboard image DB 530 includes an ID, a file location, a start time, and the like. The ID is an ID for uniquely specifying the whiteboard image on the DB. The file location indicates the location where the whiteboard image is stored. For example, “w0001.jpg” is whiteboard image data (substance). The start time is given by the time measuring unit 202 when the whiteboard image storage is started. . The whiteboard image in this example is particularly suitable for the meeting corresponding to the ID “m1” in FIG. 6 and the ID “a1” in FIG. 7 if attention is paid to time information (2010-04-05 13: 10: 33-). It turns out that it is the captured whiteboard image data.

図９は、スライド画像ＤＢ５４０構成例を示す。データ格納手段２０４は、スライド画像を格納するにあたり図に示されるＤＢでもってデータを管理する。スライド画像ＤＢ５４０は、ＩＤ、ファイルの場所、開始時間などから構成される。ＩＤは、ＤＢ上スライド画像を一意に特定するためのＩＤである。ファイルの場所は、そのスライド画像が格納されている場所を示すものである。そして例えば「s0001.jpg」がスライド画像データ（実体）である。開始時間は、そのスライド画像の格納が開始された時間が計時手段２０２によって付与されたものである。本図例のスライド画像は、特に時間情報（2010-04-05 13:10:33〜）に注目すれば、図６のＩＤ「m1」及び図７のＩＤ「a1」に対応する会議においてキャプチャされたスライド画像データであることが分かる。 FIG. 9 shows a configuration example of the slide image DB 540. The data storage unit 204 manages data in the DB shown in the figure when storing slide images. The slide image DB 540 includes an ID, a file location, a start time, and the like. The ID is an ID for uniquely identifying the slide image on the DB. The file location indicates the location where the slide image is stored. For example, “s0001.jpg” is slide image data (substance). The start time is given by the time measuring unit 202 when the slide image storage is started. The slide image in this example is captured in the meeting corresponding to the ID “m1” in FIG. 6 and the ID “a1” in FIG. 7, particularly when attention is paid to time information (2010-04-05 13: 10: 33˜). It can be seen that this is the slide image data.

（データ解析：Ｓ２００）
上述したように入力されたデータはデータ格納手段２０４に格納されており、会議収録システム１はこのデータに対してデータ解析を行う。データ解析は、例えば映像ストリームを分割したり、各シーンに対し重要度を算出したり、音声からテキストを抽出しその話者を特定したり、ホワイトボードやスライドを解析しキーワードを抽出したりまたその重要度の算出などを行う。データ解析結果は解析データ格納手段２０６に格納される。以下図５をあわせて参照しながら説明する。 (Data analysis: S200)
The data input as described above is stored in the data storage means 204, and the conference recording system 1 performs data analysis on this data. Data analysis includes, for example, dividing a video stream, calculating importance for each scene, extracting text from speech to identify the speaker, analyzing whiteboards and slides to extract keywords, etc. The importance is calculated. The data analysis result is stored in the analysis data storage unit 206. This will be described below with reference to FIG.

図１０は、映像コンテンツデータから切り出された映像シーンＤＢ５１１構成例を示す。データ解析手段２０５は、映像コンテンツデータのストリーミングから映像シーン毎にイメージを切り出し、これを映像シーンＤＢ５１１に格納する。よって図１０の映像ＩＤは図６のＩＤに対応する。例えば図１０のＩＤ「si1」は、図６の映像コンテンツデータ５１０のＩＤ「m1」の「0001.avi」から切り出され抽出されたものである。なお映像シーン画像が切り出された時間帯は時間情報として格納される。 FIG. 10 shows a configuration example of the video scene DB 511 cut out from the video content data. The data analysis unit 205 extracts an image for each video scene from the streaming of the video content data, and stores this in the video scene DB 511. Therefore, the video ID in FIG. 10 corresponds to the ID in FIG. For example, ID “si1” in FIG. 10 is extracted from “0001.avi” of ID “m1” in video content data 510 in FIG. The time zone when the video scene image is cut out is stored as time information.

図１１は、映像シーンから抽出された場面転換ＤＢ５１２構成例を示す。データ解析手段２０５は、映像シーンから特に重要と判断される映像シーンを抽出し、これを場面転換ＤＢ５１２に格納する。よって図１１の映像シーンＩＤは図１０のＩＤに対応する。例えば図１１のＩＤ「sc1」は、図１０の映像シーンＤＢ５１１のＩＤ「si1」を重要と判断された結果、抽出されたものである。そしてデータ解析手段２０５は、各場面転換を示すと判断した映像シーンに対し重要度を計算し、あわせてＤＢに格納する。重要度は、場面転換について、どのくらいの情報が変化したのかを重要度として計算した結果を示すものであるが、この場面転換重要度の具体的算出方法は、例えば本出願人による特許４４１４２５４号などを参考にできる。 FIG. 11 shows a configuration example of the scene change DB 512 extracted from the video scene. The data analysis unit 205 extracts a video scene that is determined to be particularly important from the video scene, and stores it in the scene change DB 512. Therefore, the video scene ID in FIG. 11 corresponds to the ID in FIG. For example, the ID “sc1” in FIG. 11 is extracted as a result of determining that the ID “si1” in the video scene DB 511 in FIG. 10 is important. Then, the data analysis unit 205 calculates the importance for the video scene determined to indicate each scene change, and also stores it in the DB. The importance indicates the result of calculating how much information has changed with respect to the scene change as the importance, and a specific method for calculating the importance of the scene change is, for example, Japanese Patent No. 441254 of the present applicant, etc. Can be referred to.

図１２は、音声データから取り出された音声認識データＤＢ５２１構成例を示す。データ解析手段２０５は、音声データのストリーミングから音声認識処理を行い、これをテキストに変換、また所定文節に区切って音声認識データを作成し、これを音声認識データＤＢ５２１に格納する。よって図１２の音声ＩＤは図７のＩＤに対応する。例えば図１２のＩＤ「t1」は、図７の音声データＤＢ５２０のＩＤ「a1」の「0001.wav」から抽出されたものである。時間情報は、音声の開始時間と経過時間から計算され、元の音声データにおける当該音声認識データの開始時間を示すものである。 FIG. 12 shows a configuration example of the voice recognition data DB 521 extracted from the voice data. The data analysis unit 205 performs voice recognition processing from the streaming of the voice data, converts it into text, creates voice recognition data by dividing it into predetermined phrases, and stores this in the voice recognition data DB 521. Therefore, the voice ID in FIG. 12 corresponds to the ID in FIG. For example, ID “t1” in FIG. 12 is extracted from “0001.wav” of ID “a1” in the audio data DB 520 in FIG. The time information is calculated from the voice start time and elapsed time, and indicates the start time of the voice recognition data in the original voice data.

図１３は、音声認識データから解析された話者ＤＢ５２２構成例を示す。データ解析手段２０５は、音声データ又は音声認識データから音声認識処理を行い、話者（話者ＩＤ）を特定し、これを話者ＤＢ５２２に格納する。よって図１３の音声ＩＤは図７のＩＤに対応する。例えば図１３のＩＤ「a1」は、図７の音声データＤＢ５２０のＩＤ「a1」の「0001.wav」から話者が特定されたものである。時間情報は、音声の開始時間と経過時間から計算され、元の音声データにおける当該音声認識データの開始時間を示すものである。なお話者を特定する方法としては、例えば予め想定される話者名、話者ＩＤ、話者写真画像及びそれぞれの声紋等を対応付けて登録しておき、音声データ又は音声認識データと照合することにより話者を特定できる。 FIG. 13 shows a configuration example of the speaker DB 522 analyzed from the speech recognition data. The data analysis unit 205 performs speech recognition processing from the speech data or speech recognition data, identifies a speaker (speaker ID), and stores this in the speaker DB 522. Therefore, the voice ID in FIG. 13 corresponds to the ID in FIG. For example, the ID “a1” in FIG. 13 is a speaker identified from “0001.wav” of the ID “a1” in the voice data DB 520 in FIG. The time information is calculated from the voice start time and elapsed time, and indicates the start time of the voice recognition data in the original voice data. As a method for specifying a speaker, for example, a presumed speaker name, a speaker ID, a speaker photo image, and each voiceprint are registered in association with each other, and collated with voice data or voice recognition data. Thus, the speaker can be specified.

図１４は、音声認識データ、ホワイトボード画像及びスライド画像から解析されたキーワードＤＢ５５０構成例を示す。キーワードＤＢ５５０は、ＩＤ、コンテンツ、キーワード、重要度、時間情報などから構成される。ＩＤはＤＢ上キーワードを一意に特定するためのＩＤである。コンテンツは、そのキーワードがどこから取得されたものかの抽出元を示し、音声認識データのＤＢ、ホワイトボード画像のＤＢ、スライド画像のＤＢなどのＩＤを示す。キーワードは、入力データ又は入力データから解析された解析データなどの音声認識データ、ホワイトボード画像、スライド画像から抽出されたキーワードあるいはキーフレーズを格納する。重要度は、計算によってキーワード重要度が演算されたものを格納したものである。 FIG. 14 shows a configuration example of the keyword DB 550 analyzed from the voice recognition data, the whiteboard image, and the slide image. The keyword DB 550 includes ID, content, keyword, importance, time information, and the like. The ID is an ID for uniquely specifying a keyword on the DB. The content indicates an extraction source of where the keyword is acquired from, and indicates IDs such as a speech recognition data DB, a whiteboard image DB, and a slide image DB. The keyword stores voice recognition data such as input data or analysis data analyzed from the input data, a keyword or a key phrase extracted from a whiteboard image or a slide image. The importance is a value obtained by calculating the keyword importance by calculation.

データ解析手段２０５は、音声認識データ、ホワイトボード画像、スライド画像からキーワード抽出処理を行うとともに、抽出されたキーワードの重要度を算出し、これを解析データ格納手段２０６のＤＢに格納する。なおキーワードを抽出する方法としては、例えばホワイトボード画像、スライド画像からは文字認識処理（ＯＣＲ）を行ってテキストを取り出す（ホワイトボード画像ＯＣＲＤＢ５３１、スライド像画像ＯＣＲＤＢ５４１に格納）。音声データからは音声認識データとして既にテキストとして取り出されており（音声認識データＤＢ５２１に格納）、これらテキストを形態素に分解し、各々の形態素に対してＴＦ（単語の出現頻度:Term Frequency）−ＩＤＦ（逆出現頻度:Inverse Document Frequency）値等の重要度を表す指標を計算し、その指標（重要度）の一定以上高い形態素をキーワードとすることができる。 The data analysis unit 205 performs keyword extraction processing from the voice recognition data, the whiteboard image, and the slide image, calculates the importance of the extracted keyword, and stores it in the DB of the analysis data storage unit 206. As a method for extracting keywords, for example, text recognition processing (OCR) is performed from whiteboard images and slide images to extract text (stored in whiteboard images OCRDB 531 and slide image images OCRDB 541). The speech data has already been taken out as text as speech recognition data (stored in the speech recognition data DB 521). These texts are decomposed into morphemes, and TF (Term Frequency: Word Frequency) -IDF for each morpheme. (Inverse Document Frequency) An index representing importance such as a value can be calculated, and a morpheme whose index (importance) is higher than a certain level can be used as a keyword.

例えば図中、キーワードのＩＤ「k1」は、コンテンツ「sl1」（スライド画像）からキーワード「System」が抽出され、その重要度が「0.4」であることを示している。またキーワードのＩＤ「k2」は、コンテンツ「w1」（ホワイトボード画像）からキーワード「Architecture」が抽出され、その重要度が「0.2」であることを示している。また同様に、キーワードのＩＤ「k3」は、コンテンツ「t1」（音声認識データ）からキーワード「システム構成」が抽出され、その重要度が「0.6」であることを示している。なお音声認識データの「t1」（図１２）は、そのコンテンツが「最初の議題は、システム構成についてです。」となっており、本会議の議題を示すキーワード「システム構成」には比較的高い重要度「0.6」が算出されている。 For example, in the figure, the keyword ID “k1” indicates that the keyword “System” is extracted from the content “sl1” (slide image), and the importance is “0.4”. The keyword ID “k2” indicates that the keyword “Architecture” is extracted from the content “w1” (whiteboard image) and the importance is “0.2”. Similarly, the keyword ID “k3” indicates that the keyword “system configuration” is extracted from the content “t1” (speech recognition data) and its importance is “0.6”. Note that “t1” (FIG. 12) of the speech recognition data has the content “The first agenda is about the system configuration.” The keyword “system configuration” indicating the agenda of the conference is relatively high. The importance “0.6” is calculated.

（データの読み出し及び表示：Ｓ３００）
このステップはユーザが会議の映像コンテンツの閲覧を行う場面であり、例えばユーザは会議収録システム１の表示手段から、収録会議リストの中から１の会議を選択し、選択された会議について重要場面のコンテキストを表示する操作を行う。これを受け会議収録システム１は、選択された会議を対象として、解析データ格納手段２０６の解析データに基づいて、重要なコンテキストを抽出、表示する。 (Data reading and display: S300)
This step is a scene where the user browses the video content of the conference. For example, the user selects one conference from the recorded conference list from the display means of the conference recording system 1, and the important conference is selected for the selected conference. Perform an operation to display the context. In response, the conference recording system 1 extracts and displays an important context for the selected conference based on the analysis data of the analysis data storage unit 206.

図１５は、データの読み出し及び表示処理を説明するフローチャートである。本実施形態に係る会議収録システム１は、ユーザにより、１の会議が選択され、重要場面表示の操作を受けてフローチャートの処理を開始する。なおユーザは操作時、重要度判断の対象となるコンテキストを映像シーンにするか、キーワードにするか、それとも映像シーンとキーワードの両方にするか、の指定を行う。また重要場面表示に際し、何コマ（何シーン）を一度に表示させるかのコマ数の指定を行う。ここでは、「映像シーンとキーワードの両方」、「４コマ」と指定されたものとして説明する。 FIG. 15 is a flowchart for explaining data reading and display processing. In the conference recording system 1 according to the present embodiment, one conference is selected by the user, and an operation for displaying an important scene is received to start the process of the flowchart. At the time of operation, the user designates whether the context whose importance is to be judged is a video scene, a keyword, or both a video scene and a keyword. When displaying important scenes, the number of frames (number of scenes) to be displayed at one time is designated. Here, it is assumed that “both video scene and keyword” and “4 frames” are designated.

Ｓ３０１：まず表示データ読出制御手段２０８は、ユーザより選択された会議に基づき、キーワードＤＢ５５０からこの会議に対応するキーワードを重要度順に並べる。なお解析データ格納手段２０６によりキーワードＤＢ５５０は既に重要度順にソート済みであってもよい。 S301: First, the display data read control unit 208 arranges keywords corresponding to the conference from the keyword DB 550 in order of importance based on the conference selected by the user. The keyword DB 550 may already be sorted in order of importance by the analysis data storage unit 206.

Ｓ３０２：同様に表示データ読出制御手段２０８は、ユーザより選択された会議に基づき、映像シーンＤＢ５１１からこの会議に対応する映像シーンを重要度順に並べる。なお解析データ格納手段２０６により映像シーンＤＢ５１１は既に重要度順にソート済みであってもよい。 S302: Similarly, the display data read control unit 208 arranges the video scenes corresponding to the conference from the video scene DB 511 in order of importance based on the conference selected by the user. Note that the video scene DB 511 may already be sorted in the order of importance by the analysis data storage unit 206.

Ｓ３０３：表示データ読出制御手段２０８は、所定上位数のキーワード、映像シーンを表示データとして取得する（読み出す）。所定上位数は、指定コマ数により決定される値である。ここでは指定コマ数「４コマ」であるので、所定上位数は４であるので、上位４番目までのキーワード、映像シーンを取得する。取得方法としては、重要度判断の対象となるコンテンツが「キーワード」（のみ）である場合、重要度が上位４番目までのキーワードを取得する。また重要度判断の対象となるコンテンツが「映像シーン」（のみ）である場合、重要度が上位４番目までの映像シーンを取得する。また、重要度判断の対象となるコンテンツが「映像シーンとキーワードの両方」である場合には、それぞれ２つずつ、つまり重要度が上位２番目まで映像シーンとキーワードを取得してもよいし、どちらかを重視するのであれば重み付けによりいずれかを優先し取得するようにしてもよい。またもしくは映像シーンとキーワードの重要度が正規化（標準化）されているのであれば、映像シーン及びキーワードの両方を含めた中から重要度が上位４番目までの映像シーン及び／又はキーワードを取得するようにしてもよい。 S303: The display data reading control means 208 acquires (reads) a predetermined upper number of keywords and video scenes as display data. The predetermined upper number is a value determined by the designated number of frames. Here, since the designated number of frames is “4 frames”, the predetermined upper number is 4, so the keywords and video scenes up to the upper fourth are acquired. As an acquisition method, when the content whose importance is to be determined is “keyword” (only), keywords having the highest importance are acquired up to the fourth highest. If the content whose importance is to be judged is “video scene” (only), the video scenes with the fourth highest importance are acquired. In addition, when the content whose importance is determined is “both video scenes and keywords”, two video scenes and keywords may be acquired up to two, that is, the second highest importance, If one of them is emphasized, priority may be given to obtain either by weighting. Alternatively, if the importance level of the video scene and the keyword is normalized (standardized), the video scene and / or the keyword having the fourth highest importance level is acquired from both the video scene and the keyword. You may do it.

Ｓ３０４：また表示データ読出制御手段２０８は、取得されたキーワード、映像シーンと同時刻のコンテキストを表示データとして取得する。具体的に、重要度に基づきあるキーワードが取得された場合、このキーワードは音声認識データ、ホワイトボード画像、スライド画像のいずれかから抽出されているところ、この抽出元のコンテキストの時間情報を特定し、特定された時間情報と同時刻の他のコンテキストを取得する。 S304: Further, the display data reading control unit 208 acquires the acquired keyword and the context at the same time as the video scene as display data. Specifically, when a keyword is acquired based on importance, this keyword is extracted from one of speech recognition data, a whiteboard image, and a slide image, and the time information of this extraction source context is specified. To obtain another context at the same time as the specified time information.

具体的に例えばＳ３０３にて、重要度「0.6」のキーワードＩＤ「k3」が所定上位数内に入ったためこのキーワード「システム構成」が取得されたとする。キーワードＤＢ５５０を参照すると、このキーワード「システム構成」の抽出元はコンテンツ「t1」となっており、即ち音声認識データのＩＤ「t1」（図１２）が抽出元である。そして音声認識データのＩＤ「t1」において、時間情報は「2010-04-05 13：10：32」である。従って、本キーワードと同時刻のコンテキストである音声認識データＩＤ「t1」を取得する。また同時刻の他のコンテキストとして、「2010-04-05 13：10：32」の映像シーン、「2010-04-05 13：10：32」のホワイトボード画像、「2010-04-05 13：10：32」のスライド画像、「2010-04-05 13：10：32」時点での話者をそれぞれ取得する。映像シーンは映像シーンＤＢ５１１から、ホワイトボード画像はホワイトボード画像ＤＢ５３０（又はホワイトボード画像ＯＣＲＤＢ５３１）から、スライド画像はスライド画像ＤＢ５４０（又はスライド画像ＯＣＲＤＢ５４１）から、話者は話者ＤＢ５２２を読み出すことができる。なお図９のスライド画像ＤＢ５４０について、同時刻のスライド画像がなければ直近のスライド画像をこれに代えて取得する（スライドは切り替わり毎に画像を取得しているため）。 Specifically, it is assumed that, for example, in S303, the keyword “system configuration” is acquired because the keyword ID “k3” having the importance “0.6” is included in the predetermined upper number. Referring to the keyword DB 550, the extraction source of the keyword “system configuration” is the content “t1”, that is, the ID “t1” (FIG. 12) of the speech recognition data is the extraction source. In the voice recognition data ID “t1”, the time information is “2010-04-05 13:10:32”. Therefore, the voice recognition data ID “t1” which is the context at the same time as the keyword is acquired. As another context at the same time, a video scene of “2010-04-05 13:10:32”, a whiteboard image of “2010-04-05 13:10:32”, “2010-04-05 13: 10:32 ”slide images and speakers as of“ 2010-04-05 13:10:32 ”are acquired. The video scene can be read from the video scene DB 511, the white board image from the white board image DB 530 (or the white board image OCR DB 531), the slide image from the slide image DB 540 (or the slide image OCR DB 541), and the speaker from the speaker DB 522. . For the slide image DB 540 of FIG. 9, if there is no slide image at the same time, the most recent slide image is obtained instead (since the slide acquires an image every time it is switched).

図１６は、読み出された解析データ例を示す。上述のＳ３０３及びＳ３０４を経て、図に示されるように５つのコンテキスト（要素）が抽出、読み出されることになる。ここでこれら５つのコンテキストは、会議の開催時間中、同時刻に発生したコンテキストであり、いわば同時刻コンテキストグループといえる。そしてここでは、上述の如く重要場面表示を一度に表示させるかの指定コマ数は「４コマ」であるので、この要領で表示データとして、４つの同時刻コンテキストグループが抽出、読み出されることになる。 FIG. 16 shows an example of read analysis data. Through S303 and S304 described above, five contexts (elements) are extracted and read out as shown in the figure. Here, these five contexts are contexts that occur at the same time during the conference, and can be said to be a simultaneous context group. Here, as described above, the designated number of frames for displaying the important scene display at a time is “4 frames”. Therefore, in this manner, four simultaneous context groups are extracted and read out as display data. .

Ｓ３０５：表示制御手段２０７は、表示データ読出制御手段２０８により読み出された表示データを表示手段２０９に表示させる。ここで再び図２を参照する。表示の方法は、例えば図２に示されるように同時刻コンテキストグループをコンテキストの種類順に縦に並べる。「13：10：32」の同時刻コンテキストグループ（図１６）の場合、グループ内の各コンテキスト要素を上から「映像シーン」、「話者」、「ホワイトボード」、「スライド」、「キーワード」の順に縦に整列させる。同様の要領で、他３つの同時刻コンテキストグループについても縦に整列させる。そして横軸には時系列に縦に整列した同時刻コンテキストグループを並べていく。なおこれらコンテキストは重要度に基づき抽出されているため、その根拠となったコンテキストに対しては強調表示を行うようにする。 S305: The display control unit 207 causes the display unit 209 to display the display data read by the display data read control unit 208. Reference is again made to FIG. As a display method, for example, as shown in FIG. 2, the same-time context groups are arranged vertically in the order of context types. In the case of the same time context group of “13:10:32” (FIG. 16), each context element in the group is “video scene”, “speaker”, “whiteboard”, “slide”, “keyword” from the top. Align vertically. In the same manner, the other three simultaneous time context groups are also aligned vertically. On the horizontal axis, the same-time context groups are arranged in chronological order. Since these contexts are extracted based on the importance level, the context that is the basis thereof is highlighted.

（シーンの選択及び映像の再生：Ｓ４００）
このステップでは、会議収録システム１の表示手段上、いくつかの重要なコンテキストが表示されているので、ユーザはこの中から再生しようとする映像シーンを選択し再生できる。再び図２を参照し、ユーザは再生しようとする映像シーン（又は同グループ内のコンテンツ）を選択し「再生」ボタンを押下する。再生手段２１１は、選択された映像シーンの時間情報が取得し、映像シーンに対応するファイルを特定（図６）し、同ファイルを取得した時間から再生を開始する。これまでの例でいえば、この会議の映像ファイル「0001.avi」が時間ストリーム上「13：10：32」（時間情報）から映像及び音声を含め再生される。 (Scene selection and video playback: S400)
In this step, since some important contexts are displayed on the display means of the conference recording system 1, the user can select and reproduce a video scene to be reproduced. Referring to FIG. 2 again, the user selects a video scene (or content in the same group) to be reproduced and presses a “play” button. The playback unit 211 acquires time information of the selected video scene, specifies a file corresponding to the video scene (FIG. 6), and starts playback from the time when the file is acquired. In the example so far, the video file “0001.avi” of this conference is reproduced from the time stream “13:10:32” (time information) including video and audio.

[総括]
以上本実施形態に係る会議収録システム１によれば、映像コンテンツデータの振り返りにおいて、映像コンテンツデータや、その他の入力データの中から重要度の高いものだけを時間軸に沿って一覧表示される。また表示されたコンテキストのうち重要度の高いものについてはユーザの視認性を高めるように強調表示（色分け含む）などで表示する。また会議場面にて同時刻に平行している他のコンテキストについても同時に表示する。これによりユーザは映像のみならず他の複数の要素から会議場面を思い起こし振り返ることができる。またそのうち、解析されたキーワードを付して表示することで、重要度の高い映像やコンテキスト（話者、ホワイトボード、スライド）とともに、その映像シーン時刻における要点内容までをユーザがキーワードとして一見で把握できるようにした。 [Summary]
As described above, according to the conference recording system 1 according to the present embodiment, in reviewing video content data, only video content data and other input data having high importance are displayed in a list along the time axis. Further, among the displayed contexts, those with high importance are displayed by highlighting (including color coding) or the like so as to improve the visibility of the user. In addition, other contexts parallel to the same time in the meeting scene are also displayed simultaneously. As a result, the user can recall the conference scene from not only the video but also a plurality of other elements. Of these, by displaying the analyzed keyword, the user can grasp at a glance the key content at the time of the video scene as a keyword along with the video and context (speaker, whiteboard, slide) with high importance. I was able to do it.

即ち上述の本実施形態によれば、映像コンテンツなど時間情報のある情報の振り返りにおいて、重要なコンテキストとその要点内容を容易に把握可能なコンテンツ閲覧装置等を提供することが可能となる。 In other words, according to the above-described embodiment, it is possible to provide a content browsing device and the like that can easily grasp important contexts and the contents of the main points when looking back on time information such as video content.

各実施形態に基づき本発明の説明を行ってきたが、上記各実施形態にあげたその他の要素との組み合わせなど、ここで示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することが可能であり、その応用形態に応じて適切に定めることができる。また、本発明の構成要素、表現または構成要素の任意の組合せを、方法、装置、システム、コンピュータプログラム、記録媒体、などに適用したものも本発明の態様として有効である。 Although the present invention has been described based on each embodiment, the present invention is not limited to the requirements shown here, such as combinations with other elements listed in the above embodiments. With respect to these points, the present invention can be changed within a range that does not detract from the gist of the present invention, and can be appropriately determined according to the application form. Moreover, what applied the component, expression, or arbitrary combinations of the component of this invention to a method, an apparatus, a system, a computer program, a recording medium, etc. is also effective as an aspect of this invention.

１会議支援システム
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４補助記憶装置
１０５記憶媒体読取装置
１０６入力装置
１０７表示装置
１０８通信装置
１０９インターフェース装置
２０１データ入力手段
２０２計時手段
２０３データ登録手段
２０４データ格納手段
２０５データ解析手段
２０６解析データ格納手段
２０７表示制御手段
２０８表示データ読出制御手段
２０９操作手段
２１０表示手段
２１１再生手段
５１０映像コンテンツデータＤＢ
５１１映像シーンＤＢ
５１２場面転換ＤＢ
５２０音声データＤＢ
５２１音声認識データＤＢ
５２２話者ＤＢ
５３０ホワイトボード画像ＤＢ
５３１ホワイトボード画像ＯＣＲＤＢ
５４０スライド画像ＤＢ
５４１スライドＯＣＲ画像ＤＢ
５５０キーワードＤＢ 1 Conference support system 101 CPU
102 ROM
103 RAM
104 Auxiliary storage device 105 Storage medium reading device 106 Input device 107 Display device 108 Communication device 109 Interface device 201 Data input means 202 Timekeeping means 203 Data registration means 204 Data storage means 205 Data analysis means 206 Analysis data storage means 207 Display control means 208 Display data read control means 209 Operation means 210 Display means 211 Playback means 510 Video content data DB
511 Video scene DB
512 Scene change DB
520 Audio data DB
521 Speech recognition data DB
522 Speaker DB
530 Whiteboard Image DB
531 Whiteboard Image OCRDB
540 slide image DB
541 Slide OCR image DB
550 Keyword DB

特許第３１８５５０５号Japanese Patent No. 3185505

Claims

A content browsing device for displaying content on a display means,
Data storage means having at least one of video content, audio linked to the video content on a time axis, and a still image;
Analysis data in which a video scene extracted from the video content is stored, and a keyword extracted from at least one of the audio and the still image is associated with an importance level analyzed for each keyword Storage means;
The keyword with the highest degree of importance is read from the analysis data storage means, and the voice or still image from which the read keyword is extracted and on the same time axis as the voice or still image Read control means for reading out the linked video scene as display data;
Displaying video scenes, audio and / or still images, and keyword groups linked on the same time axis among the display data read by the read control means, and when there are a plurality of the groups Display control means arranged in time series for each group and displayed on the display means;
A content browsing apparatus comprising:

A second analysis data storage means in which a video scene extracted from the video content and an importance level analyzed for each video scene are stored in association with each other;
A predetermined number of video scenes with higher importance are read from the second analysis data storage means, and audio and still images linked on the same time axis as the read video scene are displayed as second display data. Second read control means for reading as:
When the display data read by the second read control means displays video scenes and audio and / or still image groups linked on the same time axis, and there are a plurality of the groups. Includes a second display control means arranged in time series for each group and displayed on the display means,
The content browsing apparatus according to claim 1, further comprising:

The display control means emphasizes and displays the voice or still image that is the extraction source,
The second display control means emphasizes and displays the predetermined number of video scenes with the highest importance;
The content browsing apparatus according to claim 2.

A content display method in a content browsing apparatus for displaying content on a display means,
The content browsing device
Data storage means having at least one of video content, audio linked to the video content on a time axis, and a still image;
Analysis data in which a video scene extracted from the video content is stored, and a keyword extracted from at least one of the audio and the still image is associated with an importance level analyzed for each keyword Storage means;
With
The keyword with the highest degree of importance is read from the analysis data storage means, and the voice or still image from which the read keyword is extracted and on the same time axis as the voice or still image A readout control procedure for reading out linked video scenes as display data;
Displaying video scenes, audio and / or still images, and keyword groups linked on the same time axis among the display data read by the read control procedure, and when there are a plurality of the groups Is arranged in time series for each group, and a display control procedure for displaying on the display means,
A content display method characterized by comprising:

The content browsing device
A second analysis data storage means in which a video scene extracted from the video content and an importance level analyzed for each video scene are stored in association with each other;
With
A predetermined number of video scenes with higher importance are read from the second analysis data storage means, and audio and still images linked on the same time axis as the read video scene are displayed as second display data. A second reading control procedure to read as
In the display data read out by the second readout control procedure, a video scene and a group of audio and / or still images linked on the same time axis are displayed and there are a plurality of the groups. Includes a second display control procedure arranged in time series for each group and displayed on the display means;
5. The content display method according to claim 4, further comprising:

The display control procedure emphasizes and displays the voice or still image that is the extraction source,
The second display control procedure is to emphasize and display the predetermined number of video scenes with higher importance.
The content display method according to claim 5.

A content display program for causing a computer to execute the content display method according to claim 4.