US20140142941A1 - Generation of timed text using speech-to-text technology, and applications thereof - Google Patents
Generation of timed text using speech-to-text technology, and applications thereof Download PDFInfo
- Publication number
- US20140142941A1 US20140142941A1 US14/165,484 US201414165484A US2014142941A1 US 20140142941 A1 US20140142941 A1 US 20140142941A1 US 201414165484 A US201414165484 A US 201414165484A US 2014142941 A1 US2014142941 A1 US 2014142941A1
- Authority
- US
- United States
- Prior art keywords
- video
- text
- audio
- transcription
- timed text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000005516 engineering process Methods 0.000 title description 2
- 238000013518 transcription Methods 0.000 claims abstract description 71
- 230000035897 transcription Effects 0.000 claims abstract description 71
- 238000000034 method Methods 0.000 claims abstract description 35
- 230000004044 response Effects 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- JLQUFIHWVLZVTJ-UHFFFAOYSA-N carbosulfan Chemical compound CCCCN(CCCC)SN(C)C(=O)OC1=CC=CC2=C1OC(C)(C)C2 JLQUFIHWVLZVTJ-UHFFFAOYSA-N 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000001172 regenerating effect Effects 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/102—Programmed access in sequence to addressed parts of tracks of operating record carriers
- G11B27/105—Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/28—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/34—Indicating arrangements
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Definitions
- the present field generally relates to captioning web video.
- Video is increasingly being accessed by remote users over networks using web video services, such as the YOUTUBE service made available by Google Inc.
- web video services such as the YOUTUBE service made available by Google Inc.
- the rise of the World Wide Web, including various web applications, protocols, and related networking and computing technologies has made it possible for remote users to view and to play video.
- Timed text such as caption or subtitles
- Timed text is sometimes provided with video content and is “timed” so that certain text appears in association with certain portions of a video content.
- Timed text can serve a number of purposes. First, timed text can make the dialogue understandable to the hearing impaired. Second, timed text can make the video understandable in environments where audio is unavailable or not permitted. Third, timed text can provide commentary to video with educational or entertainment value. Finally, timed text can translate the audio for those who do not understand the language of the dialogue.
- manual transcription of timed text can be expensive and time-consuming.
- Embodiments relate to generation of timed text m web video.
- a computer-implemented method generates timed text for online video.
- a request to play a timed text track of a video incorporated into a web video service is received from a client computing device.
- audio of the video is processed to determine intermediate timed text data.
- the intermediate timed text data lacks a complete text transcription of the audio, but includes data to enable the complete text transcription to be generated when playing the video.
- a text transcription of the audio is determined using the intermediate data.
- the text transcription of the audio is sent to the client computing device for display along with the video.
- a system generates timed text for online video.
- the system includes a timed text player module that receives, from a client computing device, a request to play a video incorporated into a web video service.
- a preprocessor module processes audio of the video to determine intermediate timed text data.
- the intermediate timed text data lacks a complete text transcription of the audio, but includes sufficient data to enable the complete text transcription to be generated when playing the video.
- a text generator module determines a text transcription of the audio using the intermediate data.
- the timed text player module sends the text transcription of the audio to the client computing device for display along with the video.
- a computer-implemented method generates timed text for online video.
- a transcript input by a user, to incorporate into a web video service for a video is received.
- a quality value of the transcript of the video is determined. The quality value represents how closely the transcript specifies audio for the video.
- a determining step determines time codes indicating when to display respective portions of the transcript to align the transcript with the audio of the video. The time codes are provided to a client computing device to display the transcript along with the video.
- a system generates timed text for online video.
- the system includes a timed text module that enables a user to input a transcript to incorporate into a web video service for a video.
- a transcript alignment module determines a quality of the transcript of the video, the quality representing how closely the transcript specifies audio for the video.
- the transcript alignment module also determines when the transcript is of sufficient quality, time codes indicating when to display respective portions of the transcript to align the transcript with the audio of the video.
- a timed text player module provides the time codes determined by the transcript alignment module to a client computing device to display the transcript along with the video.
- FIG. 1 is a diagram illustrating a system for generating timed text for web video according to an embodiment.
- FIG. 2 is a diagram illustrating the system for generating timed text in FIG. 1 in greater detail.
- FIG. 3 is a flowchart illustrating a method for uploading web video according to an embodiment.
- FIG. 4 is a flowchart illustrating a method for generating timed text for web video according to an embodiment.
- FIGS. 5-10 are diagrams illustrating an example user interface that may be used in the system of FIG. 1 .
- This description relates to generating timed text for web video using a real-time algorithm.
- a video is preprocessed to determine intermediate data that is sufficient to generate timed text in real-time. Examples of such intermediate data are further described below.
- the intermediate data enables the captions to be generated in real-time, but does not include the text transcription of the video.
- a real-time recognition algorithm is used to generate a text transcription of the web video in real-time based on the intermediate data.
- captions may be automatically generated without having to store the text transcription.
- the timed text may be translated into other languages, such as the user's language.
- the timed text may customized in other ways. For example, the timed text may be reformatted according to as needed for the specific device.
- FIG. 1 is a diagram illustrating a system 100 for generating timed text for web video according to an embodiment.
- System 100 includes a timed text server 110 , a speech recognition server 120 , a video server 130 , and a client computing device 140 coupled via one or more networks 1 06 , such as the Internet, an intranet or any combination thereof.
- Timed text server 11 0 is coupled to a timed text database 102 .
- Speech recognition server 120 and video server 130 are both coupled to video database 104 .
- system 100 may operate as follows to upload a new video into a web video service.
- a user of client computing device 140 may have a video to incorporate into the web video service.
- Client computing device 140 may send the video to video server 130 .
- video server 130 may store the video in video database 104 and conduct some in-processing on the video.
- video server 130 may request that timed text server 110 generate timed text data.
- Timed text server 110 may communicate with speech recognition server 120 to generate intermediate timed text data sufficient to generate captions in real-time, but that does not include a text transcription of the video.
- Timed text server 110 may store the intermediate timed text data in timed text database 1 02 for later use to generate the text transcription for timed text when the video is played.
- Client computing device 140 may include a browser 150 .
- browser 150 receives an HTTP response containing a file.
- the file may be encoded in hypertext markup language (HTML) or FLASH file format made available by Adobe, Inc. of California.
- the browser may interpret the file to instantiate a timed text editor module 152 .
- Timed text editor module 152 enables users to upload and manage timed text tracks for their videos incorporated into the web video service.
- timed text editor module 152 may enable a user to request that timed text data be created for a video with no existing timed text data.
- timed text editor module 152 may also enable users to upload new videos and create timed text tracks for the new videos. Screenshots illustrating the operation of an example timed text editor module 152 are presented in FIGS. 6-9 . Timed text editor module 152 may use remote procedure calls (RPC) or HTTP requests to communicate with video server 130 .
- RPC remote procedure calls
- Video server 130 includes a video uploader module 132 and may host videos for a web video service.
- video uploader module 132 may store the new video into video database 104 to incorporate the new video into the web video service. With the new video stored in video database 104 , other users may request to view the video, and video server 130 may stream the video to their video player for display.
- video uploader module 132 may conduct some initial processing on the new video.
- video uploader module 132 may invoke a timed text generator module 112 to generate timed text data for the video.
- timed text generator module 112 is located on timed text server 110 .
- video uploader module 132 may invoke timed text generator module 112 using a remote procedure call.
- text generator module 112 may be located on video server 130 , and uploader module 132 may call text generator module 112 with a direct function call.
- timed text generator module 112 may obtain the audio track of a video and package the audio track as a request to speech recognition server 120 to generate timed text data.
- the request may include an identifier identifying a video in video database 104 , instead of including the entire audio track.
- speech recognition server 120 may retrieve the audio track from video database 104 .
- a preprocessor module 122 included within speech recognition server 120 may determine timed text data.
- preprocessor module 122 may determine intermediate data is sufficient to generate timed text when the video is played, but does not include a text transcription of the video.
- intermediate data may be a partial representation of the timed text.
- the intermediate format generated by preprocessor module 122 may include a list of timestamped segments. Each segment may identify a portion of the audio track with, for example, a timestamp for the beginning of the portion and a timestamp for the end of the portion. Each segment may provide information about that portion of the audio track without including the transcribed words. Further, the information may be independent of the language of the speech in the audio.
- the intermediate format may describe a portion of the audio track by identifying a type of sound included in the portion of the track. For example, the segment may have a sound type field with values that identify whether the sound in its portion is either “speech”, “noise”, “music”, or other types of sound known to those of skill in the art. Preprocessor module 122 may identify the different types of sound and generate the intermediate data using speech recognition algorithms known to those of skill in the art. In other embodiments, the sound type field may be omitted.
- a specific example of intermediate data would use a Google “protocol buffer” to marshal the binary data generated by the speech recognition process.
- Examples of such data would be groups of tentative words combined with timestamps to mark the time of occurrence and their duration, along with word tokens each of which can have an associated probability estimate. There may be one such token with high probability, or several of lesser probability.
- the word tokens In the final process of converting this intermediate data to captions for the video player, the word tokens must be correctly ordered based on timestamps, selected based on their confidence levels, divided into groups to form captions of readable length and duration, translated or corrected based on natural language processing, and re-formatted according to the needs of the specific client (e.g. plain text for some, variants of XML or HTML for others).
- timed text database 1 02 for later use to generate the text transcription for timed text when the new video is played.
- the intermediate data may be used to generate automatically a text transcription.
- preprocessor module 122 enables text transcription to be generated in real-time when the request is made. Further, generating an intermediate format in advance, instead of the text itself, may obviate the need to store the text transcription. When text transcription is determined in real-time, the quality of the transcription may improve as the automated speech recognition algorithms improve. As further a benefit of generating timed text at request time, the timed text may be translated into other languages, such as the user's language.
- FIG. 2 is a diagram illustrating system 100 for generating timed text in greater detail.
- FIG. 2 shows a timed text player module 214 included within timed text server 11 0 ; a text generator module 224 and a transcript generator module 226 included within speech recognition server 120 ; and a player module 154 included within browser 150 .
- Player module 154 enables a user to play a video and a corresponding timed text track.
- player module 154 may be instantiated by a browser plug-in using a FLASH file.
- Player module 154 may stream video over the Internet to display to a user.
- Player module 154 may also include various controls, for example, conventional video controls as well as other timed text controls to view a timed text track.
- the timed text controls may list which timed text tracks are available.
- the timed text controls may also indicate whether an automated speech recognition track is available (e.g. whether the intermediate data to generate an automated transcription is available).
- timed text controls may include an indication that the track is automatically generated and may be of lower quality than manually generated tracks.
- a screenshot of an example player module 154 is illustrated in FIG. 10 .
- player module 154 may request a tracklist from timed text server 110 .
- Timed text server 110 may return the tracklist in, for example, XML format.
- the XML tracklist may include an attribute to indicate whether a timed text track is automatically or manually generated.
- player module 154 may enable a user to request an automatically generated timed text track.
- player module 154 may be configured always to request timed text.
- player module 154 may request an automatically generated timed text track if no manually generated timed text tracks are available.
- player module 154 may send an HTTP request with a special parameter to timed text server 110 .
- Timed text server 110 includes a timed text player module 214 that receives the request from player module 154 .
- Timed text player module 214 may retrieve the intermediate data from timed text database 102 .
- Timed text generator module 112 then uses the intermediate data to determine a text transcription for the video.
- timed text generator module 112 may use only the intermediate data to determine a text transcription for the video.
- timed text generator module 112 may use both the intermediate data and the audio of the video.
- Timed text generator module 112 may generate a list of words timestamped at specific times in the video. Then, the words may be combined such that a group of words are displayed during a given time period. To determine the words, timed text generator module 112 may use the intermediate data determined prior to the request. The intermediate data may enable timed text generator module 112 to generate the text transcription more quickly. Using the intermediate data, timed text generator module 112 generates a timed text track describing the speech of the video. The timed text track may define a series of groups of words with each group of words having a corresponding period of time to be displayed in the video.
- timed text generator module 112 Once timed text generator module 112 generates the timed text track, timed text generator module 112 sends the timed track to timed text player module 214 in timed text server 11 0 . Timed text player module 214 then sends the timed text track to player module 154 for display with the video. In another embodiment, the timed text track may be sent to player module 154 in band with the video. For example, video server 130 may request text from timed text server 11 0 and combine the text with the video stream that it sends to client device 140 .
- a transcript may be available for a video. But, the transcript may not have time codes to indicate when to display the text in the video.
- timed text editor module 152 may enable a user to upload a transcript
- speech recognition server 120 may include a transcript alignment module 226 that determines when different portions of the transcript should be displayed in the video.
- Transcript alignment module 226 may determine time codes corresponding to different portions of the transcript text. The time codes define when to display the text in the video and correspond to when the text is spoken in the audio track. When transcript alignment module 226 aligns the text with the video, transcript alignment module 226 may evaluate the quality of the transcript. If the transcript quality is good, the resulting transcript may be saved in timed text database 102 to be played with the video. If the transcript quality is poor, then the resulting transcript may not be available to be played with the video. Alternatively, transcript alignment module 226 may evenly distribute the transcript text over the duration of the video.
- Each of timed text server 110 , speech recognition server 120 , video server 130 , and client computing device 140 may be implemented on any type of computing device.
- Such computing device can include, but is not limited to, a personal computer, mobile device such as a mobile phone, workstation, embedded system, game console, television, set-top box, or any other computing device.
- a computing device can include, but is not limited to, a device having a processor and memory for executing and storing instructions.
- Software may include one or more applications and an operating system.
- Hardware can include, but is not limited to, a processor, memory and graphical user interface display.
- the computing device may also have multiple processors and multiple shared or separate memory components.
- the computing device may be a clustered computing environment or server farm.
- Each of timed text database 102 and video database 104 may be any type of structured memory, including a persistent memory.
- each database may be implemented as a relational database.
- timed text player module 214 timed text generator module 112 , preprocessor module 122 , text generator module 224 , transcript alignment module 226 , video uploader module 132 , browser 150 , player module 154 , and timed text editor module 152 may be implemented in hardware, software, firmware or any combination thereof.
- Timed text server 160 may include a web server.
- a web server is a software component that responds to a hypertext transfer protocol (HTTP) request with an HTTP response.
- HTTP hypertext transfer protocol
- the web server may be, without limitation, Apache HTTP Server, Apache Tomcat, MICROSOFT® Internet Information Server, JBoss Application Server, WEBLOGIC Application Server®, or SUN JavaTM System Web Server.
- the web server may serve content such as hypertext markup language (HTML), extendable markup language (XML), documents, videos, images, multimedia features, MACROMEDIA Flash programs, or any combination thereof. These examples are strictly illustrative and do not limit the present invention.
- FIG. 3 is a flowchart illustrating a method 300 for uploading a web video according to an embodiment.
- Method 300 may be used in operation of system 100 in FIGS. 1-2 .
- method 300 is described with respect to components of system 100 , but it is not limited thereto. A person of skill in the art given this description would recognize additional applications of method 300 .
- Method 300 begins with client computing device 140 sending a new video to video server 130 at step 302 .
- video server 130 stores the new video in a video database.
- video server 130 sends a request to speech recognition server 120 to pre-process timed text for the new video at step 306 .
- speech recognition server 120 determines intermediate timed text data for the video at step 308 .
- the intermediate data may not include any transcribed text or complete transcribed text, but may include sufficient data to transcribe the text in real time when playing the video.
- speech recognition server 120 may send the intermediate data to timed text server 110 at step 310 .
- timed text server 110 stores the intermediate data in timed text database 102 .
- FIG. 4 is a flowchart illustrating a method 400 for generating timed text for web video according to an embodiment.
- Method 400 may be used in operation of system 100 in FIGS. 1-2 .
- method 400 is described with respect to components of system 100 , but it is not limited thereto. A person of skill in the art given this description would recognize additional applications of method 400 .
- Method 400 begins with client computing device 140 requesting an automatically generated timed text track to timed text server 110 at step 402 .
- timed text server 110 retrieves intermediate timed text data corresponding to the video and uses a speech recognition algorithm to generate a timed text track based on the intermediate data received the timed text server 110 at step 406 .
- Any suitable speech recognition algorithm can be used depending upon a particular application or design need.
- the timed text track may define a series of groups of word with each group of words having a corresponding period of time to be displayed in the video. Speech recognition server 120 sends the timed text track to timed text server 110 .
- timed text server 110 sends the timed text track onto client computing device 140 .
- Client computing device 140 displays the timed text track to a user along with the video at step 408 .
- FIGS. 5-10 are diagrams illustrating an example user interface that may be used in the system of FIG. 1 .
- these user interfaces and accompanying display screens may be implemented using browser 150 on client computing device 140 or on any other remote client device with a browser.
- FIG. 5 shows a screenshot 500 of an example user interface to manage a video with no timed text tracks.
- Screenshot 500 shows a video 502 in a video player.
- a frame 504 indicates that no timed text tracks are presently available for video 502 .
- a button 506 when pressed, navigates a user to an interface to add a manually generated timed text track or known transcript.
- the interface to add a manually generated timed text track or known transcript is described in more detail below with respect to FIG. 9 .
- Screenshot 500 also includes a button 508 that, when pressed, requests that timed text data (either intermediate data or a complete timed text transcription) be generated. When pressed, button 508 also navigates a user to an interface as illustrated in FIG. 6
- FIG. 6 shows a screenshot 600 of an example user interface to manage a video with an automatically generated timed text track shown at frame 604 .
- frame 604 may appear when intermediate data has been generated to enable speech-to text transcription in real-time when the video is played.
- frame 604 may appear when a complete text transcription has been generated.
- FIG. 7 shows a screenshot 700 of an example user interface to manage a video with multiple timed text tracks.
- Frame 704 shows a listing of multiple timed text tracks.
- Each listed manually generated timed text track includes an checkbox to enable display of the track to users (such as a checkbox 712 ), a button to navigate to an interface to adjust settings for the track (such as a button 706 ), a button to download the track (such as a button 708 ), and a button to remove the track (such as a button 710 ).
- Each listed timed text track generated with an automated speech to text algorithm may include a button to reprocess the video (such as a button 702 ). Reprocessing the video may include regenerating the timed text track or the intermediate data.
- Screenshot 700 further includes an “Add Caption” button 714 .
- button 714 may navigate a user to an interface to upload a caption as illustrated in FIG. 9 . Selecting a listed timed text track may result in displaying the track as shown in FIG. 8 .
- FIG. 8 shows a screenshot 800 of an example user interface to display a timed text track.
- Frame 802 lists caption text in one column and the times to begin and end display of the caption text in another column.
- Navigation controls such as a slider may also be provided to scroll through the timed text track.
- FIG. 9 shows a screenshot 900 of an example user interface to add a manually generated timed text track or known transcript.
- a field 902 enables a user to specify a file with the manually generated timed text track or known transcript.
- the file may, for example, be a comma-delimited text file.
- Radio boxes 904 and 906 indicate whether the file is a manually generated timed text track or known transcript.
- a drop down menu 908 enables a user to select a language of the manually generated timed text track or known transcript.
- field 910 enables a user to name the timed text track.
- time codes for the known transcript may be determined as described above with respect to transcript alignment module 226 .
- transcript alignment module 226 may determine time codes corresponding to different portions of the transcript text. The time codes define when to display the text in the video and correspond to when the text is spoken in the audio track.
- FIG. 10 shows a screenshot 1000 of an example user interface with a menu to play a timed text track.
- a menu option 1004 enables a user to request a timed text track that is automatically generated with an algorithm.
- a menu option 1006 enables a user to request a timed text track translated into another language.
- menu options 1008 enable a user to select a language of the requested timed text track.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
- This is a continuation of U.S. patent application Ser. No. 12/949,527 filed Nov. 18, 2010, which claims benefit under 35 U.S.C. §119(e) to U.S. Provisional Application No. 61/262,426 filed Nov. 18, 2009, the contents of which are hereby incorporated by reference.
- The present field generally relates to captioning web video.
- Video is increasingly being accessed by remote users over networks using web video services, such as the YOUTUBE service made available by Google Inc. The rise of the World Wide Web, including various web applications, protocols, and related networking and computing technologies has made it possible for remote users to view and to play video.
- Timed text, such as caption or subtitles, is sometimes provided with video content and is “timed” so that certain text appears in association with certain portions of a video content. Timed text can serve a number of purposes. First, timed text can make the dialogue understandable to the hearing impaired. Second, timed text can make the video understandable in environments where audio is unavailable or not permitted. Third, timed text can provide commentary to video with educational or entertainment value. Finally, timed text can translate the audio for those who do not understand the language of the dialogue. However, manual transcription of timed text can be expensive and time-consuming.
- Embodiments relate to generation of timed text m web video. In a first embodiment, a computer-implemented method generates timed text for online video. In the method, a request to play a timed text track of a video incorporated into a web video service is received from a client computing device. Prior to receipt of the request, audio of the video is processed to determine intermediate timed text data. The intermediate timed text data lacks a complete text transcription of the audio, but includes data to enable the complete text transcription to be generated when playing the video. In response to receipt of the request, a text transcription of the audio is determined using the intermediate data. Finally, the text transcription of the audio is sent to the client computing device for display along with the video.
- In a second embodiment, a system generates timed text for online video. The system includes a timed text player module that receives, from a client computing device, a request to play a video incorporated into a web video service. Prior to receipt of the request by the timed text player, a preprocessor module processes audio of the video to determine intermediate timed text data. The intermediate timed text data lacks a complete text transcription of the audio, but includes sufficient data to enable the complete text transcription to be generated when playing the video. In response to the request by the timed text player, a text generator module determines a text transcription of the audio using the intermediate data. Finally, the timed text player module sends the text transcription of the audio to the client computing device for display along with the video.
- In a third embodiment, a computer-implemented method generates timed text for online video. In the method, a transcript, input by a user, to incorporate into a web video service for a video is received. A quality value of the transcript of the video is determined. The quality value represents how closely the transcript specifies audio for the video. When the transcript is of sufficient quality for alignment, a determining step determines time codes indicating when to display respective portions of the transcript to align the transcript with the audio of the video. The time codes are provided to a client computing device to display the transcript along with the video.
- In a fourth embodiment, a system generates timed text for online video. The system includes a timed text module that enables a user to input a transcript to incorporate into a web video service for a video. A transcript alignment module determines a quality of the transcript of the video, the quality representing how closely the transcript specifies audio for the video. The transcript alignment module also determines when the transcript is of sufficient quality, time codes indicating when to display respective portions of the transcript to align the transcript with the audio of the video. Finally, a timed text player module provides the time codes determined by the transcript alignment module to a client computing device to display the transcript along with the video.
- Further embodiments, features, and advantages of the invention, as well as the structure and operation of the various embodiments of the invention are described in detail below with reference to accompanying drawings.
- The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
-
FIG. 1 is a diagram illustrating a system for generating timed text for web video according to an embodiment. -
FIG. 2 is a diagram illustrating the system for generating timed text inFIG. 1 in greater detail. -
FIG. 3 is a flowchart illustrating a method for uploading web video according to an embodiment. -
FIG. 4 is a flowchart illustrating a method for generating timed text for web video according to an embodiment. -
FIGS. 5-10 are diagrams illustrating an example user interface that may be used in the system ofFIG. 1 . - The drawing in which an element first appears is typically indicated by the leftmost digit or digits in the corresponding reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.
- Systems and methods are needed to automatically generate timed text for web video.
- This description relates to generating timed text for web video using a real-time algorithm. In an embodiment, a video is preprocessed to determine intermediate data that is sufficient to generate timed text in real-time. Examples of such intermediate data are further described below. The intermediate data enables the captions to be generated in real-time, but does not include the text transcription of the video. Then, when a user requests timed text to be played with the video, a real-time recognition algorithm is used to generate a text transcription of the web video in real-time based on the intermediate data. In this way, captions may be automatically generated without having to store the text transcription. As further a benefit of generating timed text at request-time, the timed text may be translated into other languages, such as the user's language. Also, the timed text may customized in other ways. For example, the timed text may be reformatted according to as needed for the specific device. This and other embodiments are described below with reference to the accompanying drawings.
-
FIG. 1 is a diagram illustrating asystem 100 for generating timed text for web video according to an embodiment.System 100 includes a timedtext server 110, aspeech recognition server 120, avideo server 130, and aclient computing device 140 coupled via one ormore networks 1 06, such as the Internet, an intranet or any combination thereof. Timed text server 11 0 is coupled to a timedtext database 102.Speech recognition server 120 andvideo server 130 are both coupled tovideo database 104. - In general,
system 100 may operate as follows to upload a new video into a web video service. A user ofclient computing device 140 may have a video to incorporate into the web video service.Client computing device 140 may send the video tovideo server 130. In response to receipt of the video,video server 130 may store the video invideo database 104 and conduct some in-processing on the video. Among the in-processing,video server 130 may request that timedtext server 110 generate timed text data.Timed text server 110 may communicate withspeech recognition server 120 to generate intermediate timed text data sufficient to generate captions in real-time, but that does not include a text transcription of the video.Timed text server 110 may store the intermediate timed text data intimed text database 1 02 for later use to generate the text transcription for timed text when the video is played. Each of the components and their operation are described in greater detail below. -
Client computing device 140 may include abrowser 150. In an embodiment,browser 150 receives an HTTP response containing a file. As an example, the file may be encoded in hypertext markup language (HTML) or FLASH file format made available by Adobe, Inc. of California. The browser may interpret the file to instantiate a timedtext editor module 152. - Timed
text editor module 152 enables users to upload and manage timed text tracks for their videos incorporated into the web video service. In an embodiment, timedtext editor module 152 may enable a user to request that timed text data be created for a video with no existing timed text data. In a further embodiment, timedtext editor module 152 may also enable users to upload new videos and create timed text tracks for the new videos. Screenshots illustrating the operation of an example timedtext editor module 152 are presented inFIGS. 6-9 . Timedtext editor module 152 may use remote procedure calls (RPC) or HTTP requests to communicate withvideo server 130. -
Video server 130 includes avideo uploader module 132 and may host videos for a web video service. Whenvideo uploader module 132 receives a new video,video uploader module 132 may store the new video intovideo database 104 to incorporate the new video into the web video service. With the new video stored invideo database 104, other users may request to view the video, andvideo server 130 may stream the video to their video player for display. In addition to storing the new video,video uploader module 132 may conduct some initial processing on the new video. In an embodiment,video uploader module 132 may invoke a timedtext generator module 112 to generate timed text data for the video. - As shown in
FIG. 1 , timedtext generator module 112 is located ontimed text server 110. In this embodiment,video uploader module 132 may invoke timedtext generator module 112 using a remote procedure call. In another embodiment,text generator module 112 may be located onvideo server 130, anduploader module 132 may calltext generator module 112 with a direct function call. - In an embodiment, timed
text generator module 112 may obtain the audio track of a video and package the audio track as a request tospeech recognition server 120 to generate timed text data. In another embodiment, the request may include an identifier identifying a video invideo database 104, instead of including the entire audio track. In that embodiment,speech recognition server 120 may retrieve the audio track fromvideo database 104. - In response to the request from timed
text generator 110, apreprocessor module 122 included withinspeech recognition server 120 may determine timed text data. In an embodiment,preprocessor module 122 may determine intermediate data is sufficient to generate timed text when the video is played, but does not include a text transcription of the video. In an example, intermediate data may be a partial representation of the timed text. - In an embodiment, the intermediate format generated by
preprocessor module 122 may include a list of timestamped segments. Each segment may identify a portion of the audio track with, for example, a timestamp for the beginning of the portion and a timestamp for the end of the portion. Each segment may provide information about that portion of the audio track without including the transcribed words. Further, the information may be independent of the language of the speech in the audio. In an embodiment, the intermediate format may describe a portion of the audio track by identifying a type of sound included in the portion of the track. For example, the segment may have a sound type field with values that identify whether the sound in its portion is either “speech”, “noise”, “music”, or other types of sound known to those of skill in the art.Preprocessor module 122 may identify the different types of sound and generate the intermediate data using speech recognition algorithms known to those of skill in the art. In other embodiments, the sound type field may be omitted. - A specific example of intermediate data would use a Google “protocol buffer” to marshal the binary data generated by the speech recognition process. Examples of such data would be groups of tentative words combined with timestamps to mark the time of occurrence and their duration, along with word tokens each of which can have an associated probability estimate. There may be one such token with high probability, or several of lesser probability. In the final process of converting this intermediate data to captions for the video player, the word tokens must be correctly ordered based on timestamps, selected based on their confidence levels, divided into groups to form captions of readable length and duration, translated or corrected based on natural language processing, and re-formatted according to the needs of the specific client (e.g. plain text for some, variants of XML or HTML for others).
- Once the intermediate data is generated, it is stored in timed
text database 1 02 for later use to generate the text transcription for timed text when the new video is played. As discussed in more detail below, when a user requests to play a timed text track the intermediate data may be used to generate automatically a text transcription. - Often, automated speech recognition algorithms cannot be run in real-time with streaming video, because they require too much computing resources. By determining this intermediate data prior to a request to play a timed text track,
preprocessor module 122 enables text transcription to be generated in real-time when the request is made. Further, generating an intermediate format in advance, instead of the text itself, may obviate the need to store the text transcription. When text transcription is determined in real-time, the quality of the transcription may improve as the automated speech recognition algorithms improve. As further a benefit of generating timed text at request time, the timed text may be translated into other languages, such as the user's language. -
FIG. 2 is adiagram illustrating system 100 for generating timed text in greater detail. In addition to the components inFIG. 1 ,FIG. 2 shows a timedtext player module 214 included within timed text server 11 0; atext generator module 224 and a transcript generator module 226 included withinspeech recognition server 120; and aplayer module 154 included withinbrowser 150. -
Player module 154 enables a user to play a video and a corresponding timed text track. In one embodiment,player module 154 may be instantiated by a browser plug-in using a FLASH file.Player module 154 may stream video over the Internet to display to a user.Player module 154 may also include various controls, for example, conventional video controls as well as other timed text controls to view a timed text track. The timed text controls may list which timed text tracks are available. The timed text controls may also indicate whether an automated speech recognition track is available (e.g. whether the intermediate data to generate an automated transcription is available). In an example, timed text controls may include an indication that the track is automatically generated and may be of lower quality than manually generated tracks. A screenshot of anexample player module 154 is illustrated inFIG. 10 . - To generate the timed text controls,
player module 154 may request a tracklist from timedtext server 110.Timed text server 110 may return the tracklist in, for example, XML format. The XML tracklist may include an attribute to indicate whether a timed text track is automatically or manually generated. - In an embodiment,
player module 154 may enable a user to request an automatically generated timed text track. In another embodiment,player module 154 may be configured always to request timed text. In that embodiment,player module 154 may request an automatically generated timed text track if no manually generated timed text tracks are available. To request timed text generated using an automated algorithm,player module 154 may send an HTTP request with a special parameter to timedtext server 110. -
Timed text server 110 includes a timedtext player module 214 that receives the request fromplayer module 154. Timedtext player module 214 may retrieve the intermediate data from timedtext database 102. Timedtext generator module 112 then uses the intermediate data to determine a text transcription for the video. In an embodiment, timedtext generator module 112 may use only the intermediate data to determine a text transcription for the video. In an alternative embodiment, timedtext generator module 112 may use both the intermediate data and the audio of the video. - Timed
text generator module 112 may generate a list of words timestamped at specific times in the video. Then, the words may be combined such that a group of words are displayed during a given time period. To determine the words, timedtext generator module 112 may use the intermediate data determined prior to the request. The intermediate data may enable timedtext generator module 112 to generate the text transcription more quickly. Using the intermediate data, timedtext generator module 112 generates a timed text track describing the speech of the video. The timed text track may define a series of groups of words with each group of words having a corresponding period of time to be displayed in the video. - Once timed
text generator module 112 generates the timed text track, timedtext generator module 112 sends the timed track to timedtext player module 214 in timed text server 11 0. Timedtext player module 214 then sends the timed text track toplayer module 154 for display with the video. In another embodiment, the timed text track may be sent toplayer module 154 in band with the video. For example,video server 130 may request text from timed text server 11 0 and combine the text with the video stream that it sends toclient device 140. - In some cases, a transcript may be available for a video. But, the transcript may not have time codes to indicate when to display the text in the video. In an embodiment, timed
text editor module 152 may enable a user to upload a transcript, andspeech recognition server 120 may include a transcript alignment module 226 that determines when different portions of the transcript should be displayed in the video. - Transcript alignment module 226 may determine time codes corresponding to different portions of the transcript text. The time codes define when to display the text in the video and correspond to when the text is spoken in the audio track. When transcript alignment module 226 aligns the text with the video, transcript alignment module 226 may evaluate the quality of the transcript. If the transcript quality is good, the resulting transcript may be saved in
timed text database 102 to be played with the video. If the transcript quality is poor, then the resulting transcript may not be available to be played with the video. Alternatively, transcript alignment module 226 may evenly distribute the transcript text over the duration of the video. - Each of timed
text server 110,speech recognition server 120,video server 130, andclient computing device 140 may be implemented on any type of computing device. Such computing device can include, but is not limited to, a personal computer, mobile device such as a mobile phone, workstation, embedded system, game console, television, set-top box, or any other computing device. Further, a computing device can include, but is not limited to, a device having a processor and memory for executing and storing instructions. Software may include one or more applications and an operating system. Hardware can include, but is not limited to, a processor, memory and graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be a clustered computing environment or server farm. - Each of
timed text database 102 andvideo database 104 may be any type of structured memory, including a persistent memory. In examples, each database may be implemented as a relational database. - Each of timed
text player module 214, timedtext generator module 112,preprocessor module 122,text generator module 224, transcript alignment module 226,video uploader module 132,browser 150,player module 154, and timedtext editor module 152 may be implemented in hardware, software, firmware or any combination thereof. - Timed text server 160 may include a web server. A web server is a software component that responds to a hypertext transfer protocol (HTTP) request with an HTTP response. As illustrative examples, the web server may be, without limitation, Apache HTTP Server, Apache Tomcat, MICROSOFT® Internet Information Server, JBoss Application Server, WEBLOGIC Application Server®, or SUN Java™ System Web Server. The web server may serve content such as hypertext markup language (HTML), extendable markup language (XML), documents, videos, images, multimedia features, MACROMEDIA Flash programs, or any combination thereof. These examples are strictly illustrative and do not limit the present invention.
-
FIG. 3 is a flowchart illustrating amethod 300 for uploading a web video according to an embodiment.Method 300 may be used in operation ofsystem 100 inFIGS. 1-2 . For clarity,method 300 is described with respect to components ofsystem 100, but it is not limited thereto. A person of skill in the art given this description would recognize additional applications ofmethod 300. -
Method 300 begins withclient computing device 140 sending a new video tovideo server 130 atstep 302. Atstep 304,video server 130 stores the new video in a video database. Then,video server 130 sends a request tospeech recognition server 120 to pre-process timed text for the new video atstep 306. - In response to the request from
video server 130,speech recognition server 120 determines intermediate timed text data for the video atstep 308. The intermediate data may not include any transcribed text or complete transcribed text, but may include sufficient data to transcribe the text in real time when playing the video. Atstep 310,speech recognition server 120 may send the intermediate data to timedtext server 110 atstep 310. Atstep 312, timedtext server 110 stores the intermediate data intimed text database 102. -
FIG. 4 is a flowchart illustrating amethod 400 for generating timed text for web video according to an embodiment.Method 400 may be used in operation ofsystem 100 inFIGS. 1-2 . For clarity,method 400 is described with respect to components ofsystem 100, but it is not limited thereto. A person of skill in the art given this description would recognize additional applications ofmethod 400. -
Method 400 begins withclient computing device 140 requesting an automatically generated timed text track to timedtext server 110 atstep 402. Atstep 404, timedtext server 110 retrieves intermediate timed text data corresponding to the video and uses a speech recognition algorithm to generate a timed text track based on the intermediate data received the timedtext server 110 atstep 406. Any suitable speech recognition algorithm can be used depending upon a particular application or design need. The timed text track may define a series of groups of word with each group of words having a corresponding period of time to be displayed in the video.Speech recognition server 120 sends the timed text track to timedtext server 110. - At
step 406, timedtext server 110 sends the timed text track ontoclient computing device 140.Client computing device 140 displays the timed text track to a user along with the video atstep 408. -
FIGS. 5-10 are diagrams illustrating an example user interface that may be used in the system ofFIG. 1 . In one example, these user interfaces and accompanying display screens may be implemented usingbrowser 150 onclient computing device 140 or on any other remote client device with a browser.FIG. 5 shows ascreenshot 500 of an example user interface to manage a video with no timed text tracks.Screenshot 500 shows avideo 502 in a video player. Aframe 504 indicates that no timed text tracks are presently available forvideo 502. Abutton 506, when pressed, navigates a user to an interface to add a manually generated timed text track or known transcript. The interface to add a manually generated timed text track or known transcript is described in more detail below with respect toFIG. 9 .Screenshot 500 also includes abutton 508 that, when pressed, requests that timed text data (either intermediate data or a complete timed text transcription) be generated. When pressed,button 508 also navigates a user to an interface as illustrated inFIG. 6 . -
FIG. 6 shows ascreenshot 600 of an example user interface to manage a video with an automatically generated timed text track shown atframe 604. In an embodiment,frame 604 may appear when intermediate data has been generated to enable speech-to text transcription in real-time when the video is played. In another embodiment,frame 604 may appear when a complete text transcription has been generated. -
FIG. 7 shows ascreenshot 700 of an example user interface to manage a video with multiple timed text tracks.Frame 704 shows a listing of multiple timed text tracks. Each listed manually generated timed text track includes an checkbox to enable display of the track to users (such as a checkbox 712), a button to navigate to an interface to adjust settings for the track (such as a button 706), a button to download the track (such as a button 708), and a button to remove the track (such as a button 710). Each listed timed text track generated with an automated speech to text algorithm may include a button to reprocess the video (such as a button 702). Reprocessing the video may include regenerating the timed text track or the intermediate data. Enabling the user to reprocess this data may be useful as speech-to-text algorithms improve in accuracy.Screenshot 700 further includes an “Add Caption”button 714. When selected,button 714 may navigate a user to an interface to upload a caption as illustrated inFIG. 9 . Selecting a listed timed text track may result in displaying the track as shown inFIG. 8 . -
FIG. 8 shows ascreenshot 800 of an example user interface to display a timed text track.Frame 802 lists caption text in one column and the times to begin and end display of the caption text in another column. Navigation controls such as a slider may also be provided to scroll through the timed text track. -
FIG. 9 shows ascreenshot 900 of an example user interface to add a manually generated timed text track or known transcript. Afield 902 enables a user to specify a file with the manually generated timed text track or known transcript. The file may, for example, be a comma-delimited text file.Radio boxes menu 908 enables a user to select a language of the manually generated timed text track or known transcript. Finally,field 910 enables a user to name the timed text track. When a user uploads a known transcript, time codes for the known transcript may be determined as described above with respect to transcript alignment module 226. As described above, transcript alignment module 226 may determine time codes corresponding to different portions of the transcript text. The time codes define when to display the text in the video and correspond to when the text is spoken in the audio track. -
FIG. 10 shows ascreenshot 1000 of an example user interface with a menu to play a timed text track. Amenu option 1004 enables a user to request a timed text track that is automatically generated with an algorithm. Amenu option 1006 enables a user to request a timed text track translated into another language. Finally,menu options 1008 enable a user to select a language of the requested timed text track. - The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
- The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
- The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/165,484 US20140142941A1 (en) | 2009-11-18 | 2014-01-27 | Generation of timed text using speech-to-text technology, and applications thereof |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US26242609P | 2009-11-18 | 2009-11-18 | |
US12/949,527 US8645134B1 (en) | 2009-11-18 | 2010-11-18 | Generation of timed text using speech-to-text technology and applications thereof |
US14/165,484 US20140142941A1 (en) | 2009-11-18 | 2014-01-27 | Generation of timed text using speech-to-text technology, and applications thereof |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/949,527 Continuation US8645134B1 (en) | 2009-11-18 | 2010-11-18 | Generation of timed text using speech-to-text technology and applications thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140142941A1 true US20140142941A1 (en) | 2014-05-22 |
Family
ID=50001771
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/949,527 Active 2032-02-25 US8645134B1 (en) | 2009-11-18 | 2010-11-18 | Generation of timed text using speech-to-text technology and applications thereof |
US14/165,484 Abandoned US20140142941A1 (en) | 2009-11-18 | 2014-01-27 | Generation of timed text using speech-to-text technology, and applications thereof |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/949,527 Active 2032-02-25 US8645134B1 (en) | 2009-11-18 | 2010-11-18 | Generation of timed text using speech-to-text technology and applications thereof |
Country Status (1)
Country | Link |
---|---|
US (2) | US8645134B1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120301111A1 (en) * | 2011-05-23 | 2012-11-29 | Gay Cordova | Computer-implemented video captioning method and player |
CN106412678A (en) * | 2016-09-14 | 2017-02-15 | 安徽声讯信息技术有限公司 | Method and system for transcribing and storing video news in real time |
TWI747417B (en) * | 2020-08-05 | 2021-11-21 | 國立陽明交通大學 | Method for generating caption file through url of an av platform |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9412372B2 (en) * | 2012-05-08 | 2016-08-09 | SpeakWrite, LLC | Method and system for audio-video integration |
US9946712B2 (en) * | 2013-06-13 | 2018-04-17 | Google Llc | Techniques for user identification of and translation of media |
US9953646B2 (en) | 2014-09-02 | 2018-04-24 | Belleau Technologies | Method and system for dynamic speech recognition and tracking of prewritten script |
US9886423B2 (en) * | 2015-06-19 | 2018-02-06 | International Business Machines Corporation | Reconciliation of transcripts |
US20180130484A1 (en) * | 2016-11-07 | 2018-05-10 | Axon Enterprise, Inc. | Systems and methods for interrelating text transcript information with video and/or audio information |
US10560656B2 (en) * | 2017-03-19 | 2020-02-11 | Apple Inc. | Media message creation with automatic titling |
TWI661319B (en) * | 2017-11-30 | 2019-06-01 | 財團法人資訊工業策進會 | Apparatus, method, and computer program product thereof for generatiing control instructions based on text |
US10580410B2 (en) * | 2018-04-27 | 2020-03-03 | Sorenson Ip Holdings, Llc | Transcription of communications |
CN114143592B (en) * | 2021-11-30 | 2023-10-27 | 抖音视界有限公司 | Video processing method, video processing apparatus, and computer-readable storage medium |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6076059A (en) * | 1997-08-29 | 2000-06-13 | Digital Equipment Corporation | Method for aligning text with audio signals |
US20030236663A1 (en) * | 2002-06-19 | 2003-12-25 | Koninklijke Philips Electronics N.V. | Mega speaker identification (ID) system and corresponding methods therefor |
US6941266B1 (en) * | 2000-11-15 | 2005-09-06 | At&T Corp. | Method and system for predicting problematic dialog situations in a task classification system |
US7231351B1 (en) * | 2002-05-10 | 2007-06-12 | Nexidia, Inc. | Transcript alignment |
US7440895B1 (en) * | 2003-12-01 | 2008-10-21 | Lumenvox, Llc. | System and method for tuning and testing in a speech recognition system |
US20080288250A1 (en) * | 2004-02-23 | 2008-11-20 | Louis Ralph Rennillo | Real-time transcription system |
US20090030698A1 (en) * | 2007-03-07 | 2009-01-29 | Cerra Joseph P | Using speech recognition results based on an unstructured language model with a music system |
US20090119704A1 (en) * | 2004-04-23 | 2009-05-07 | Koninklijke Philips Electronics, N.V. | Method and apparatus to catch up with a running broadcast or stored content |
US20090125534A1 (en) * | 2000-07-06 | 2009-05-14 | Michael Scott Morton | Method and System for Indexing and Searching Timed Media Information Based Upon Relevance Intervals |
US20100257212A1 (en) * | 2009-04-06 | 2010-10-07 | Caption Colorado L.L.C. | Metatagging of captions |
US20100299131A1 (en) * | 2009-05-21 | 2010-11-25 | Nexidia Inc. | Transcript alignment |
US20110087491A1 (en) * | 2009-10-14 | 2011-04-14 | Andreas Wittenstein | Method and system for efficient management of speech transcribers |
US8131545B1 (en) * | 2008-09-25 | 2012-03-06 | Google Inc. | Aligning a transcript to audio data |
US20120188446A1 (en) * | 2008-09-22 | 2012-07-26 | International Business Machines Corporation | Verbal description |
US8281231B2 (en) * | 2009-09-11 | 2012-10-02 | Digitalsmiths, Inc. | Timeline alignment for closed-caption text using speech recognition transcripts |
US20120293522A1 (en) * | 1996-12-05 | 2012-11-22 | Interval Licensing Llc | Browser for Use in Navigating a Body of Information, with Particular Application to Browsing Information Represented by Audiovisual Data |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040080528A1 (en) * | 2000-06-21 | 2004-04-29 | Watchit.Com,Inc. | Systems and methods for presenting interactive programs over the internet |
US7035804B2 (en) * | 2001-04-26 | 2006-04-25 | Stenograph, L.L.C. | Systems and methods for automated audio transcription, translation, and transfer |
US7292975B2 (en) * | 2002-05-01 | 2007-11-06 | Nuance Communications, Inc. | Systems and methods for evaluating speaker suitability for automatic speech recognition aided transcription |
US8009966B2 (en) * | 2002-11-01 | 2011-08-30 | Synchro Arts Limited | Methods and apparatus for use in sound replacement with automatic synchronization to images |
NO327155B1 (en) * | 2005-10-19 | 2009-05-04 | Fast Search & Transfer Asa | Procedure for displaying video data within result presentations in systems for accessing and searching for information |
US8120638B2 (en) * | 2006-01-24 | 2012-02-21 | Lifesize Communications, Inc. | Speech to text conversion in a videoconference |
US8204891B2 (en) * | 2007-09-21 | 2012-06-19 | Limelight Networks, Inc. | Method and subsystem for searching media content within a content-search-service system |
US20080284910A1 (en) * | 2007-01-31 | 2008-11-20 | John Erskine | Text data for streaming video |
US9087331B2 (en) * | 2007-08-29 | 2015-07-21 | Tveyes Inc. | Contextual advertising for video and audio media |
US7437291B1 (en) * | 2007-12-13 | 2008-10-14 | International Business Machines Corporation | Using partial information to improve dialog in automatic speech recognition systems |
US8041716B2 (en) * | 2008-10-03 | 2011-10-18 | At&T Intellectual Property I, L.P. | Apparatus, methods and computer products for selection of content based on transcript searches |
-
2010
- 2010-11-18 US US12/949,527 patent/US8645134B1/en active Active
-
2014
- 2014-01-27 US US14/165,484 patent/US20140142941A1/en not_active Abandoned
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120293522A1 (en) * | 1996-12-05 | 2012-11-22 | Interval Licensing Llc | Browser for Use in Navigating a Body of Information, with Particular Application to Browsing Information Represented by Audiovisual Data |
US6076059A (en) * | 1997-08-29 | 2000-06-13 | Digital Equipment Corporation | Method for aligning text with audio signals |
US20090125534A1 (en) * | 2000-07-06 | 2009-05-14 | Michael Scott Morton | Method and System for Indexing and Searching Timed Media Information Based Upon Relevance Intervals |
US6941266B1 (en) * | 2000-11-15 | 2005-09-06 | At&T Corp. | Method and system for predicting problematic dialog situations in a task classification system |
US7231351B1 (en) * | 2002-05-10 | 2007-06-12 | Nexidia, Inc. | Transcript alignment |
US20030236663A1 (en) * | 2002-06-19 | 2003-12-25 | Koninklijke Philips Electronics N.V. | Mega speaker identification (ID) system and corresponding methods therefor |
US7440895B1 (en) * | 2003-12-01 | 2008-10-21 | Lumenvox, Llc. | System and method for tuning and testing in a speech recognition system |
US20080288250A1 (en) * | 2004-02-23 | 2008-11-20 | Louis Ralph Rennillo | Real-time transcription system |
US20090119704A1 (en) * | 2004-04-23 | 2009-05-07 | Koninklijke Philips Electronics, N.V. | Method and apparatus to catch up with a running broadcast or stored content |
US20090030698A1 (en) * | 2007-03-07 | 2009-01-29 | Cerra Joseph P | Using speech recognition results based on an unstructured language model with a music system |
US20120188446A1 (en) * | 2008-09-22 | 2012-07-26 | International Business Machines Corporation | Verbal description |
US8131545B1 (en) * | 2008-09-25 | 2012-03-06 | Google Inc. | Aligning a transcript to audio data |
US20100257212A1 (en) * | 2009-04-06 | 2010-10-07 | Caption Colorado L.L.C. | Metatagging of captions |
US20100299131A1 (en) * | 2009-05-21 | 2010-11-25 | Nexidia Inc. | Transcript alignment |
US8281231B2 (en) * | 2009-09-11 | 2012-10-02 | Digitalsmiths, Inc. | Timeline alignment for closed-caption text using speech recognition transcripts |
US20110087491A1 (en) * | 2009-10-14 | 2011-04-14 | Andreas Wittenstein | Method and system for efficient management of speech transcribers |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120301111A1 (en) * | 2011-05-23 | 2012-11-29 | Gay Cordova | Computer-implemented video captioning method and player |
US8923684B2 (en) * | 2011-05-23 | 2014-12-30 | Cctubes, Llc | Computer-implemented video captioning method and player |
CN106412678A (en) * | 2016-09-14 | 2017-02-15 | 安徽声讯信息技术有限公司 | Method and system for transcribing and storing video news in real time |
TWI747417B (en) * | 2020-08-05 | 2021-11-21 | 國立陽明交通大學 | Method for generating caption file through url of an av platform |
Also Published As
Publication number | Publication date |
---|---|
US8645134B1 (en) | 2014-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8645134B1 (en) | Generation of timed text using speech-to-text technology and applications thereof | |
US8260604B2 (en) | System and method for translating timed text in web video | |
US10306328B2 (en) | Systems and methods for rendering text onto moving image content | |
US20110067059A1 (en) | Media control | |
WO2017063399A1 (en) | Video playback method and device | |
US20150319510A1 (en) | Interactive viewing experiences by detecting on-screen text | |
US20230071845A1 (en) | Interactive viewing experiences by detecting on-screen text | |
US10354676B2 (en) | Automatic rate control for improved audio time scaling | |
US9767825B2 (en) | Automatic rate control based on user identities | |
US11197048B2 (en) | Transmission device, transmission method, reception device, and reception method | |
US8676578B2 (en) | Meeting support apparatus, method and program | |
KR20240084531A (en) | The system and an appratus for providig contents based on a user utterance | |
US20230345082A1 (en) | Interactive pronunciation learning system | |
CN117376593A (en) | Subtitle processing method and device for live stream, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GOOGLE INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HARRENSTIEN, KENNETH;JUE, TOLIVER;ALBERTI, CHRISTOPHER;AND OTHERS;SIGNING DATES FROM 20110125 TO 20110314;REEL/FRAME:034270/0969 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044144/0001 Effective date: 20170929 |
|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE THE REMOVAL OF THE INCORRECTLY RECORDED APPLICATION NUMBERS 14/149802 AND 15/419313 PREVIOUSLY RECORDED AT REEL: 44144 FRAME: 1. ASSIGNOR(S) HEREBY CONFIRMS THE CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:068092/0502 Effective date: 20170929 |