US20140278370A1 - Systems and Methods for Customizing Text in Media Content - Google Patents

Systems and Methods for Customizing Text in Media Content Download PDF

Info

Publication number
US20140278370A1
US20140278370A1 US14/085,963 US201314085963A US2014278370A1 US 20140278370 A1 US20140278370 A1 US 20140278370A1 US 201314085963 A US201314085963 A US 201314085963A US 2014278370 A1 US2014278370 A1 US 2014278370A1
Authority
US
United States
Prior art keywords
text
media content
media
semantic
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US14/085,963
Other versions
US9645985B2 (en
Inventor
Hsieh-Wei Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CyberLink Corp
Original Assignee
CyberLink Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CyberLink Corp filed Critical CyberLink Corp
Priority to US14/085,963 priority Critical patent/US9645985B2/en
Assigned to CYBERLINK CORP. reassignment CYBERLINK CORP. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, HSIEH-WEI
Publication of US20140278370A1 publication Critical patent/US20140278370A1/en
Application granted granted Critical
Publication of US9645985B2 publication Critical patent/US9645985B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06F17/24
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • one embodiment is a method implemented in a media processing device.
  • the method comprises obtaining, by the media processing device, media content and performing, by the media processing device, semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
  • the method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
  • Another embodiment is a system for editing media content, comprising a processor and at least one application executable in the processor.
  • the at least one application comprises a media interface for obtaining media content and a content analyzer for performing semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
  • the at least one application further comprises a tokenizer for generating at least one context token corresponding to the at least one semantic textual segment and a visualizer for visually accentuating the text section according to the context token.
  • Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that obtains media content and code that performs semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
  • the code further comprises code that generates at least one context token corresponding to the at least one semantic textual segment and code that visually accentuates the text section according to the context token.
  • Another embodiment is a method implemented in a media processing.
  • the method comprises obtaining, by the media processing device, media content and performing semantic analysis on a textual portion of the media content and generating, by the media processing device, textual context tokens based on the semantic analysis.
  • the method further comprises performing semantic analysis on an audio portion and on a visual portion of the media content corresponding to the textual portion and generating context tokens relating to the audio and visual portions.
  • the method further comprises combining, by the media processing device, the textual context tokens and the context tokens relating to the audio and visual portions and visually accentuating, by the media processing device, at least one context portrayed in at least a portion of media content according to the combined context tokens.
  • Another embodiment is a method implemented in a media processing device.
  • the method comprises obtaining, by the media processing device, a photo collection comprising digital images and textual content and performing, by the media processing device, semantic analysis on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection, wherein the text section comprises at least one word in the textual content in the at least a portion of the photo collection.
  • the method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
  • FIG. 1A is a block diagram of a media processing system for facilitating automatic media editing in accordance with various embodiments of the present disclosure.
  • FIG. 1B illustrates the process flow between various components of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 2 is a detailed view of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIGS. 3A and 3B illustrate the format of a context token generated by the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 4 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to various embodiments of the present disclosure.
  • FIGS. 5-8 illustrate various examples of subtitle modification performed by the visualizer in the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 9 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
  • FIG. 10 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
  • the editing process may involve, for example, stylizing existing subtitles by changing the font color, font size, location of the subtitles, and so on.
  • the editing process may also include inserting captions relating to commentary, descriptions, and so on into media content.
  • editing media content on a frame-by-frame basis can be time consuming.
  • media content is obtained and semantic analysis is performed on at least a portion of the media content, wherein the semantic analysis may involve analyzing visual, audio, and textual cues embedded in the media content that convey the emotions and/or context corresponding to events portrayed in the media content.
  • context tokens characterizing the emotions, context, etc. associated with events being portrayed in the portion of media content are generated.
  • a semantic fusion operation is applied to the context tokens to combine the context tokens, and the combined context tokens are mapped to the text that takes place in the portion of media content, where such text may comprise, for example, subtitles corresponding to dialog in the portion of media content and/or captions in the portion of media content (e.g., a caption describing a sound that occurs in a scene).
  • the subtitles or text corresponding to the mapping are stylized in an automated fashion without the need for a user to manually apply special effects. They subtitles may be stylized by modifying the font, font size, subtitle location. The modification(s) may also include animation or effects applied to the subtitles.
  • FIG. 1A is a block diagram of a media processing system 102 in which embodiments of the techniques for visually accentuating semantic context of text or events portrayed within media content.
  • the media processing system 102 may be embodied, for example, as a desktop computer, computer workstation, laptop, a smartphone 109 , a tablet, or other computing platform that includes a display 104 and may include such input devices as a keyboard 106 and a mouse 108 .
  • the media processing system 102 may be embodied as a smartphone 109 or tablet, the user may interface with the media processing system 102 via a touchscreen interface (not shown).
  • the media processing system 102 may be embodied as a video gaming console 171 , which includes a video game controller 172 for receiving user preferences.
  • the video gaming console 171 may be connected to a television (not shown) or other display 104 .
  • the media processing system 102 is configured to retrieve, via the media interface 112 , digital media content 115 stored on a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the media processing system 102 .
  • a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive
  • the digital media content 115 may be encoded in any of a number of formats including, but not limited to, JPEG (Joint Photographic Experts Group) files, TIFF (Tagged Image File Format) files, PNG (Portable Network Graphics) files, GIF (Graphics Interchange Format) files, BMP (bitmap) files or any number of other digital formats.
  • JPEG Joint Photographic Experts Group
  • TIFF Tagged Image File Format
  • PNG Portable Network Graphics
  • GIF Graphics Inter
  • the digital media content 115 may be encoded in other formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats.
  • MPEG Motion Picture Experts Group
  • MPEG-4 High-Definition Video
  • 3GPP
  • the media interface 112 in the media processing system 102 may also be configured to retrieve digital media content 115 directly from a digital recording device 107 where a cable 111 or some other interface may be used for coupling the digital recording device 107 to the media processing system 102 .
  • the media processing system 102 may support any one of a number of common computer interfaces, such as, but not limited to IEEE-1394 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
  • the digital recording device 107 may also be coupled to the media processing system 102 over a wireless connection or other communication path.
  • the media processing system 102 may be coupled to a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
  • a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
  • the media processing system 102 may receive digital media content 115 from another computing system 103 .
  • the media processing system 102 may access one or more media content sharing websites 134 hosted on a server 137 via the network 118 to retrieve digital media content 115 .
  • the components executed on the media processing system 102 include a content analyzer 114 , a tokenizer 116 , a semantic fusion operator 119 , a visualizer 121 , and other applications, services, processes, systems, engines, or functionality not discussed in detail herein.
  • the content analyzer 114 is executed to perform semantic analysis on the media content received by the media interface 112 .
  • the tokenizer 116 is executed to generate context tokens based on the semantic analysis, where the context tokens may be generated based on classification of visual cues, audio cues, and textual cues extracted by the content analyzer 114 .
  • the semantic fusion operator 119 is executed to combine the context tokens generated by the tokenizer 116 , and the visualizer 121 is executed to visually accentuate at least one context portrayed in the media content according to the context tokens.
  • the visualizer 121 modifies the appearance of subtitles/captions in the media content by modifying the font, font size, subtitle location, and so on.
  • the user may specify predetermined modifications to be applied for certain contexts. For example, the user may specify that if the content analyzer 114 determines that the context in the media content involves a scary scene, a certain font (e.g., a Gothic font style) is automatically applied to the subtitles relating to that scene or event.
  • a certain font e.g., a Gothic font style
  • the media interface 112 obtains media content, where the media content may include subtitles 151 corresponding to the text or commentary within the media content.
  • the subtitles 151 may be embedded directly into the media content, stored separately and superimposed during playback, or stored according to other means as known to those skilled in the art.
  • the media interface 112 forwards the media content to the content analyzer 114 , includes an image analyzer 162 , an audio analyzer 164 , a tokenizer 116 , a text analyzer 170 , and other applications, services, processes, systems, engines, or functionality not discussed in detail herein.
  • the content analyzer 114 analyzes the semantic-rich media content to extract information later used for modifying or generating stylized subtitles corresponding to the media content.
  • the media content may comprise video content as well as digital images that include embedded captions stored, for example, as metadata.
  • the image analyzer 162 analyzes the media content and identifies such visual cues as facial expressions, body language of individuals depicted in the media content, physical attributes of individuals, and so on.
  • the image analyzer 162 may also analyze attributes of the media content including, for example, lighting, color temperature, color hue, contrast level, and so on.
  • the audio analyzer 164 analyzes the media content and identifies such audio cues as speech tones of individuals within the media content, speed in which individuals are talking, speech volume, direction of speech, tone, and so on.
  • the audio cues may also include intonation that may serve as an indication of one or more emotions of a speaker.
  • the tokenizer 116 extracts textual information from the media content.
  • the tokenizer 116 may directly process the subtitles 151 and tokenize the words in the subtitles 151 .
  • the tokenizer 116 may be configured to process the audio portion of the media content and extract text information.
  • a speech recognition component 117 in the tokenizer 116 converts audio data into text data when the media content does not include subtitles 151 .
  • the tokenizer 116 processes textual information and breaks the information into meaningful elements that are significant as a group, wherein tokenization may be performed based on lexical analysis.
  • the lexical analysis performed by the tokenizer 116 may be based on regular expressions, specific key words, and so on where such information may be stored in a database 178 .
  • specific key words may comprise any of transition words, conjunctions, words that convey emphasis, repeated words, symbols, predefined keywords from a database, or any combination thereof.
  • the text analyzer 170 Based on the lexical analysis performed by the tokenizer 116 , extracts textual cues from the media content.
  • the data stored in the database 178 may also include key attributes such as visual attributes (e.g., lighting level, human facial expressions, body language, themes, color hue, color temperature), audio attributes (e.g., volume level), and other attributes.
  • the image analyzer 162 , audio analyzer 164 , and text analyzer 170 respectively generate context tokens 174 relating to the media content.
  • the semantic fusion operator 119 processes the context tokens 174 and combines context tokens relating to similar points within the media content. Note that for some embodiments, the context tokens 174 may be sent directly to the visualizer 121 without being processed by the semantic fusion operator 119 .
  • the content analyzer 114 may be configured to first analyze the textual content followed by the audio content and the visual content. Alternatively, the content analyzer 114 may be configured to first analyze the visual content followed by the text content and the audio content. In this regard, the content analyzer 114 may be configured to analyze the various components of the media content in a particular order or concurrently.
  • the semantic fusion operator 119 combines the context tokens 174 , and the mapping module 176 maps the combined context tokens 174 to specific text associated with the event or context in the media content, as described in more detail below.
  • the visualizer 121 modifies the subtitles 151 corresponding to the text, where the modification may include, for example and without limitation, a change in the subtitle font, change in font size, change in font color, and change in subtitle location.
  • the visualizer 121 incorporates the stylistic changes and outputs the modified media content 180 .
  • each context token 174 comprises a media stamp 302 and a semantic vector 304 , where the media stamp 302 corresponds to the media content.
  • the media stamp comprises a time stamp corresponding to a position within the media content.
  • the media stamp may also specify a window of time relative to the time stamp.
  • the media stamp 302 may specify that the corresponding semantic vector 304 corresponds to a time interval spanning 10:33 to 10:57 in the media content.
  • the semantic vector 304 corresponds to semantic concepts derived by the image analyzer 162 ( FIG. 1B ), audio analyzer 164 ( FIG. 1B ), and the text analyzer 170 ( FIG. 1B ).
  • Each semantic vector 304 within a context token may contain one or more entries where each entry comprises a semantic dimension 306 and a corresponding strength value 308 .
  • a semantic dimension 306 corresponds to a contextual cue within the media content and may include visual cues, audio cues, textual cues, and so on.
  • a context token c i is represented by the following expression:
  • t i denotes the media tamp of the context token
  • v i denotes the semantic vector 304 , which is expressed as:
  • v i ( d 1 ,d 2 , . . . ,d n ).
  • d j represents a strength or likelihood value towards a particular semantic dimension, such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on.
  • a particular semantic dimension such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on.
  • the visual content of a particular scene with dark and gray visual attributes may be assigned a higher strength value towards a semantic dimension of negativity, horror, and a feeling of mystery.
  • Speech (i.e., audio) content expressing delight and characterized by a high pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of happiness, a feminine theme, while a soft, gentle, and low pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of romance, and a masculine theme.
  • Textual context comprising specific transition keywords may be assigned a higher strength value to a semantic dimension reflecting strong emphasis. For example, a specific phrase such as “with great power, comes great responsibility” may be assigned a higher strength value reflecting strong emphasis, a positive atmosphere, and a masculine theme.
  • the corresponding strength value 308 reflects a confidence level of the semantic dimension 306 .
  • the semantic fusion operator 119 ( FIG. 1B ) combines the context tokens 174 to generate a fused semantic vector
  • the mapping module 176 maps the combined context tokens 174 to specific text associated with the event or context in the media content.
  • a fused semantic vector v ⁇ T associated with a specified media stamp T is determined by the following expression:
  • v ⁇ T ⁇ ( v v T ,v a T ,v t T ),
  • v v T denotes the semantic vector of visual content for media stamp T
  • v a T denotes the semantic vector of audio content for media stamp T
  • v t T denotes the semantic vector of text content for media stamp T
  • ⁇ ( ) denotes the fusion function.
  • the fusion function may be implemented as an operator for combining semantic vectors.
  • the fusion function may be expressed as a weighted summation function:
  • (w i ) corresponds to the weight value of each type of semantic vector (i.e., semantic vector of visual content, semantic vector of audio content, and semantic vector of textual content).
  • Each weight value represents the confidence level of a particular semantic vector. For example, the weight value (w a T ) for the audio semantic vector (v a T ) may be higher if the audio cues during time period (T) comprise dramatic intonations that occur in a given scene. On the other hand, the weight value (w v T ) for the visual semantic vector (v v T ) may be lower if the same scene provides few visual cues.
  • the fusion function may also be implemented according to a neural network model.
  • the mapping module 176 then maps the fused semantic vector v ⁇ T to media or corresponding subtitles according to the media stamp T.
  • FIG. 3A provides an example of a context token with a plurality of semantic dimensions 306 a , 306 b and corresponding strength values 308 a , 308 b .
  • the context token 174 characterizes a window of time in the media content spanning from 10:33 to 10:57, where various semantic dimensions 306 a , 306 b are portrayed in the media content.
  • the image analyzer 162 , audio analyzer 164 , and/or the text analyzer 170 determines based on various contextual cues within the media content that one or more individuals in the media content exhibit such emotions as happiness, sadness, anger, and fear.
  • each of the semantic dimensions 306 a , 306 b having corresponding strength values 308 a , 308 b where the semantic dimension 306 corresponding to happiness has the highest confidence level.
  • FIG. 3B is an example of a textual context token 320 comprising a media stamp 322 that specifies the time in which the corresponding text 324 is to be displayed.
  • the textual context token 320 further comprises an entropy value 326 and a semantic vector 328 , wherein the entropy value 326 represents the information content of the particular text section.
  • the text content comprises the subtitle “That's Awesome!”
  • the text segment is tokenized into two text tokens—“That's” and “Awesome”.
  • the text “That's” contains less useful information and is therefore assigned a lower entropy value, whereas the text “Awesome” is assigned a higher entropy value.
  • the higher entropy value triggers the visual accentuation.
  • a negative value for a semantic dimension relieves the contradiction between a visual or audio context token.
  • the audio context token of the corresponding portion has a sadness value of ⁇ 0.6 while the video context token has sadness value of 0.4.
  • the bias would be corrected as the sadness dimension is adjusted to neutral state of zero given the values ⁇ 0.6 and 0.4.
  • FIG. 2 is a schematic diagram of the media processing system 102 shown in FIG. 1A .
  • the media processing system 102 may be embodied in any one of a wide variety of wired and/or wireless computing devices, such as a desktop computer, portable computer, dedicated server computer, multiprocessor computing device, smartphone 109 ( FIG. 1A ), tablet computing device, and so forth.
  • the media processing system 102 comprises memory 214 , a processing device 202 , a number of input/output interfaces 204 , a network interface 206 , a display 104 , a peripheral interface 211 , and mass storage 226 , wherein each of these devices are connected across a local data bus 210 .
  • the processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with the media processing system 102 , a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system.
  • CPU central processing unit
  • ASICs application specific integrated circuits
  • the memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.).
  • RAM random-access memory
  • nonvolatile memory elements e.g., ROM, hard drive, tape, CDROM, etc.
  • the memory 214 typically comprises a native operating system 217 , one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc.
  • the applications may include application specific software which may comprise some or all the components (media interface 112 , content analyzer 114 , tokenizer 116 , semantic fusion operator 119 , visualizer 121 ) of the media processing system 102 depicted in FIG. 1A .
  • the components are stored in memory 214 and executed by the processing device 202 .
  • the memory 214 can, and typically will, comprise other components which have been omitted for purposes of brevity.
  • executable may refer to a program file that is in a form that can ultimately be run by the processing device 202 .
  • Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of the memory 214 and run by the processing device 202 , source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of the memory 214 and executed by the processing device 202 , or source code that may be interpreted by another executable program to generate instructions in a random access portion of the memory 214 to be executed by the processing device 202 , etc.
  • An executable program may be stored in any portion or component of the memory 214 including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components.
  • RAM random access memory
  • ROM read-only memory
  • hard drive solid-state drive
  • USB flash drive memory card
  • optical disc such as compact disc (CD) or digital versatile disc (DVD)
  • floppy disk magnetic tape, or other memory components.
  • Input/output interfaces 204 provide any number of interfaces for the input and output of data.
  • the media processing system 102 comprises a personal computer
  • these components may interface with one or more user input devices via the I/O interfaces 204 , where the user input devices may comprise a keyboard 106 ( FIG. 1A ) or a mouse 108 ( FIG. 1A ).
  • the display 104 may comprise a computer monitor, a plasma screen for a PC, a liquid crystal display (LCD), a touchscreen display, or other display device 104 .
  • a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • CDROM portable compact disc read-only memory
  • network interface 206 comprises various components used to transmit and/or receive data over a network environment.
  • the network interface 206 may include a device that can communicate with both inputs and outputs, for instance, a modulator/demodulator (e.g., a modem), wireless (e.g., radio frequency (RF)) transceiver, a telephonic interface, a bridge, a router, network card, etc.).
  • the media processing system 102 may communicate with one or more computing devices via the network interface 206 over the network 118 ( FIG. 1A ).
  • the media processing system 102 may further comprise mass storage 226 .
  • the peripheral interface 211 supports various interfaces including, but not limited to IEEE-1294 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
  • FIG. 4 is a flowchart 400 in accordance with one embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 400 of FIG. 4 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 4 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • FIG. 4 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 4 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • media content is obtained and in block 420 , semantic analysis is performed on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content.
  • the text section comprises at least one word in the text in the at least a portion of the media content.
  • at least one context token corresponding to the at least one semantic textual segment is generated.
  • the text section is visually accentuated according to the context token.
  • visually accentuating the text section comprises modifying the text section in the at least a portion of the media content and generating captions in the at least a portion of the media content.
  • modifying the visual appearance of text may be performed according to the literal meaning of the text section. For example, if the text section includes the word “fire” or “flame,” the visual appearance of the text section may be modified with a fiery font. As another example, if the text section includes the word “big” or “huge,” the visual appearance of the text section may be enlarged.
  • FIGS. 5-7 provide various examples of modifications performed by the visualizer 121 ( FIG. 1A ) in the media processing system 102 ( FIG. 1A ) in accordance with various embodiments.
  • FIG. 5 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style as well as the location of the subtitles.
  • the content analyzer 114 FIG. 1A ) analyzes such contextual cues as speech volume (e.g., one or more individuals shouting), keywords/phrases (e.g., “watch out”, “warning”), the presence of exclamation points in the subtitles, and so on.
  • the media processing system 102 is “text-aware” and is capable of visually accentuating a text section within text content.
  • the visualizer 121 selectively modifies the text sections containing the text “AWESOME” and the exclamation point in the subtitles. That is, rather than visually accentuating the entire line of subtitles, the visualizer 121 may be configured to visually accentuate only a portion of the subtitles (e.g., selective words/phrases/punctuation marks). In the example shown, only the word “AWESOME” and the exclamation point are visually accentuated by increasing the font size.
  • visually accentuating the text section according to the context token may comprise modifying the text section in the at least a portion of the media content and/or generating captions in the at least a portion of the media content.
  • the visualizer 121 also incorporates animation to further emphasize the words being spoken by the individual.
  • Other forms of animation may include, for example and without limitation, a shrinking/stretching effect, a fade-in/fade-out effect, a shadowing effect, a flipping effect, and so on.
  • the example in FIG. 5 also depicts graphics (i.e., lines) inserted into the media content by the visualizer 121 to indicate which individual is speaking.
  • FIG. 6 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style of the captions based on the body language of the individual as well as the presence of exclamation marks in the subtitles.
  • FIG. 7 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style of the captions based on image attributes (e.g., low lighting; night time), keyword (e.g., “Halloween”), the presence of an exclamation mark in the subtitles, and so on.
  • image attributes e.g., low lighting; night time
  • keyword e.g., “Halloween”
  • FIG. 8 is an example where the media content comprises digital photos with comments, descriptions, and other forms of annotation are embedded with the digital photos.
  • the media processing system 102 may retrieve media content from online photo sharing albums where one or more users upload photos and viewers add corresponding descriptions, comments, etc. to the uploaded photos.
  • the media content comprises digital photos with corresponding descriptions.
  • the text section comprising the word “beautiful” is visually accentuated to place emphasis on this word. Note that only the appearance of “beautiful” is modified.
  • FIG. 9 is a flowchart 900 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 900 of FIG. 9 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 9 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • FIG. 9 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 9 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • media content is obtained and semantic analysis is performed on a textual portion of the media content.
  • the media content obtained by the media interface 112 may include subtitles 151 ( FIG. 1B ) or captions.
  • textual context tokens are generated based on the semantic analysis
  • semantic analysis is performed on an audio portion and on a visual portion of the media content corresponding to the textual portion.
  • the image analyzer 162 ( FIG. 1B ) and the audio analyzer 164 ( FIG. 1B ) in the content analyzer 114 ( FIG. 1B ) may be configured to analyze portions of the media content where dialog between individuals take place.
  • context tokens relating to the audio and visual portions are generated.
  • the textual context tokens are combined with the context tokens relating to the audio and visual portions, and in block 960 , at least one context portrayed in the at least a portion of media content is visually accentuated according to the combined context tokens.
  • FIG. 10 is a flowchart 1000 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 1000 of FIG. 10 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 10 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • FIG. 10 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 10 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • a photo collection comprising digital images and textual content is obtained, and in block 1020 , semantic analysis is performed on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection.
  • the text section comprises at least one word in the textual content in the at least a portion of the photo collection.
  • at least one context token corresponding to the at least one semantic textual segment is generated, and in block 1040 , the text section is visually accentuated according to the context token.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Various embodiments are disclosed for facilitating automatic media editing. Media content is obtained and semantic analysis is performed on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. At least one context token corresponding to the at least one semantic textual segment is generated. The text section is visually accentuated according to the context token.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application claims priority to, and the benefit of, U.S. Provisional Patent Application entitled, “Subtitle Modalization,” having Ser. No. 61/788,741, filed on Mar. 15, 2013, which is incorporated by reference in its entirety.
  • BACKGROUND
  • With the ever-growing amount of digital content available to consumers through the Internet and other sources, consumers have access to a vast amount of content. With existing media editing tools, users manually edit subtitles or add captions in order to achieve a desired effect or style. This typically involves a great deal of effort on the part of the user in order to emphasize or convey the context of the media content being viewed. Thus, while many media editing tools are readily available, the editing process can be tedious and time-consuming.
  • SUMMARY
  • Briefly described, one embodiment, among others, is a method implemented in a media processing device. The method comprises obtaining, by the media processing device, media content and performing, by the media processing device, semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
  • Another embodiment is a system for editing media content, comprising a processor and at least one application executable in the processor. The at least one application comprises a media interface for obtaining media content and a content analyzer for performing semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The at least one application further comprises a tokenizer for generating at least one context token corresponding to the at least one semantic textual segment and a visualizer for visually accentuating the text section according to the context token.
  • Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that obtains media content and code that performs semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The code further comprises code that generates at least one context token corresponding to the at least one semantic textual segment and code that visually accentuates the text section according to the context token.
  • Another embodiment is a method implemented in a media processing. The method comprises obtaining, by the media processing device, media content and performing semantic analysis on a textual portion of the media content and generating, by the media processing device, textual context tokens based on the semantic analysis. The method further comprises performing semantic analysis on an audio portion and on a visual portion of the media content corresponding to the textual portion and generating context tokens relating to the audio and visual portions. The method further comprises combining, by the media processing device, the textual context tokens and the context tokens relating to the audio and visual portions and visually accentuating, by the media processing device, at least one context portrayed in at least a portion of media content according to the combined context tokens.
  • Another embodiment is a method implemented in a media processing device. The method comprises obtaining, by the media processing device, a photo collection comprising digital images and textual content and performing, by the media processing device, semantic analysis on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection, wherein the text section comprises at least one word in the textual content in the at least a portion of the photo collection. The method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
  • Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
  • FIG. 1A is a block diagram of a media processing system for facilitating automatic media editing in accordance with various embodiments of the present disclosure.
  • FIG. 1B illustrates the process flow between various components of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 2 is a detailed view of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIGS. 3A and 3B illustrate the format of a context token generated by the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 4 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to various embodiments of the present disclosure.
  • FIGS. 5-8 illustrate various examples of subtitle modification performed by the visualizer in the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
  • FIG. 9 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
  • FIG. 10 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
  • DETAILED DESCRIPTION
  • One perceived shortcoming with conventional media editing applications is the amount of time involved in manually editing subtitles or inserting captions into media content. The editing process may involve, for example, stylizing existing subtitles by changing the font color, font size, location of the subtitles, and so on. The editing process may also include inserting captions relating to commentary, descriptions, and so on into media content. However, editing media content on a frame-by-frame basis can be time consuming.
  • Various embodiments are disclosed for automatically modifying or generating stylized captions for semantic-rich media. In accordance with various embodiments, media content is obtained and semantic analysis is performed on at least a portion of the media content, wherein the semantic analysis may involve analyzing visual, audio, and textual cues embedded in the media content that convey the emotions and/or context corresponding to events portrayed in the media content.
  • As a result of the semantic analysis, context tokens characterizing the emotions, context, etc. associated with events being portrayed in the portion of media content are generated. A semantic fusion operation is applied to the context tokens to combine the context tokens, and the combined context tokens are mapped to the text that takes place in the portion of media content, where such text may comprise, for example, subtitles corresponding to dialog in the portion of media content and/or captions in the portion of media content (e.g., a caption describing a sound that occurs in a scene). Based on the mapping, the subtitles or text corresponding to the mapping are stylized in an automated fashion without the need for a user to manually apply special effects. They subtitles may be stylized by modifying the font, font size, subtitle location. The modification(s) may also include animation or effects applied to the subtitles.
  • A description of a system for facilitating automatic media editing is now described followed by a discussion of the operation of the components within the system. FIG. 1A is a block diagram of a media processing system 102 in which embodiments of the techniques for visually accentuating semantic context of text or events portrayed within media content. The media processing system 102 may be embodied, for example, as a desktop computer, computer workstation, laptop, a smartphone 109, a tablet, or other computing platform that includes a display 104 and may include such input devices as a keyboard 106 and a mouse 108.
  • For embodiments where the media processing system 102 is embodied as a smartphone 109 or tablet, the user may interface with the media processing system 102 via a touchscreen interface (not shown). In other embodiments, the media processing system 102 may be embodied as a video gaming console 171, which includes a video game controller 172 for receiving user preferences. For such embodiments, the video gaming console 171 may be connected to a television (not shown) or other display 104.
  • The media processing system 102 is configured to retrieve, via the media interface 112, digital media content 115 stored on a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the media processing system 102. As one of ordinary skill will appreciate, the digital media content 115 may be encoded in any of a number of formats including, but not limited to, JPEG (Joint Photographic Experts Group) files, TIFF (Tagged Image File Format) files, PNG (Portable Network Graphics) files, GIF (Graphics Interchange Format) files, BMP (bitmap) files or any number of other digital formats.
  • The digital media content 115 may be encoded in other formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats.
  • As depicted in FIG. 1A, the media interface 112 in the media processing system 102 may also be configured to retrieve digital media content 115 directly from a digital recording device 107 where a cable 111 or some other interface may be used for coupling the digital recording device 107 to the media processing system 102. The media processing system 102 may support any one of a number of common computer interfaces, such as, but not limited to IEEE-1394 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
  • The digital recording device 107 may also be coupled to the media processing system 102 over a wireless connection or other communication path. The media processing system 102 may be coupled to a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks. Through the network 118, the media processing system 102 may receive digital media content 115 from another computing system 103. Alternatively, the media processing system 102 may access one or more media content sharing websites 134 hosted on a server 137 via the network 118 to retrieve digital media content 115.
  • The components executed on the media processing system 102 include a content analyzer 114, a tokenizer 116, a semantic fusion operator 119, a visualizer 121, and other applications, services, processes, systems, engines, or functionality not discussed in detail herein. The content analyzer 114 is executed to perform semantic analysis on the media content received by the media interface 112. The tokenizer 116 is executed to generate context tokens based on the semantic analysis, where the context tokens may be generated based on classification of visual cues, audio cues, and textual cues extracted by the content analyzer 114.
  • The semantic fusion operator 119 is executed to combine the context tokens generated by the tokenizer 116, and the visualizer 121 is executed to visually accentuate at least one context portrayed in the media content according to the context tokens. For various embodiments, the visualizer 121 modifies the appearance of subtitles/captions in the media content by modifying the font, font size, subtitle location, and so on. For some embodiments, the user may specify predetermined modifications to be applied for certain contexts. For example, the user may specify that if the content analyzer 114 determines that the context in the media content involves a scary scene, a certain font (e.g., a Gothic font style) is automatically applied to the subtitles relating to that scene or event.
  • The process flow between the various components of the media processing system 102 is now described. Reference is made to FIG. 1B, which illustrates various components of the media processing system 102 in FIG. 1A. To begin, the media interface 112 obtains media content, where the media content may include subtitles 151 corresponding to the text or commentary within the media content. The subtitles 151 may be embedded directly into the media content, stored separately and superimposed during playback, or stored according to other means as known to those skilled in the art.
  • The media interface 112 forwards the media content to the content analyzer 114, includes an image analyzer 162, an audio analyzer 164, a tokenizer 116, a text analyzer 170, and other applications, services, processes, systems, engines, or functionality not discussed in detail herein. The content analyzer 114 analyzes the semantic-rich media content to extract information later used for modifying or generating stylized subtitles corresponding to the media content. Note that the media content may comprise video content as well as digital images that include embedded captions stored, for example, as metadata.
  • The image analyzer 162 analyzes the media content and identifies such visual cues as facial expressions, body language of individuals depicted in the media content, physical attributes of individuals, and so on. The image analyzer 162 may also analyze attributes of the media content including, for example, lighting, color temperature, color hue, contrast level, and so on.
  • The audio analyzer 164 analyzes the media content and identifies such audio cues as speech tones of individuals within the media content, speed in which individuals are talking, speech volume, direction of speech, tone, and so on. The audio cues may also include intonation that may serve as an indication of one or more emotions of a speaker. The tokenizer 116 extracts textual information from the media content. For some embodiments, the tokenizer 116 may directly process the subtitles 151 and tokenize the words in the subtitles 151. For situations where the media content does not include subtitles 151, the tokenizer 116 may be configured to process the audio portion of the media content and extract text information. For some embodiments, a speech recognition component 117 in the tokenizer 116 converts audio data into text data when the media content does not include subtitles 151.
  • The tokenizer 116 processes textual information and breaks the information into meaningful elements that are significant as a group, wherein tokenization may be performed based on lexical analysis. The lexical analysis performed by the tokenizer 116 may be based on regular expressions, specific key words, and so on where such information may be stored in a database 178. For some embodiments, specific key words may comprise any of transition words, conjunctions, words that convey emphasis, repeated words, symbols, predefined keywords from a database, or any combination thereof. Based on the lexical analysis performed by the tokenizer 116, the text analyzer 170 extracts textual cues from the media content.
  • The data stored in the database 178 may also include key attributes such as visual attributes (e.g., lighting level, human facial expressions, body language, themes, color hue, color temperature), audio attributes (e.g., volume level), and other attributes. The image analyzer 162, audio analyzer 164, and text analyzer 170 respectively generate context tokens 174 relating to the media content. The semantic fusion operator 119 processes the context tokens 174 and combines context tokens relating to similar points within the media content. Note that for some embodiments, the context tokens 174 may be sent directly to the visualizer 121 without being processed by the semantic fusion operator 119.
  • Note that the content analyzer 114 may be configured to first analyze the textual content followed by the audio content and the visual content. Alternatively, the content analyzer 114 may be configured to first analyze the visual content followed by the text content and the audio content. In this regard, the content analyzer 114 may be configured to analyze the various components of the media content in a particular order or concurrently. The semantic fusion operator 119 combines the context tokens 174, and the mapping module 176 maps the combined context tokens 174 to specific text associated with the event or context in the media content, as described in more detail below. The visualizer 121 modifies the subtitles 151 corresponding to the text, where the modification may include, for example and without limitation, a change in the subtitle font, change in font size, change in font color, and change in subtitle location. The visualizer 121 incorporates the stylistic changes and outputs the modified media content 180.
  • With reference to FIG. 3A, each context token 174 comprises a media stamp 302 and a semantic vector 304, where the media stamp 302 corresponds to the media content. For some embodiments, the media stamp comprises a time stamp corresponding to a position within the media content. The media stamp may also specify a window of time relative to the time stamp. For example, the media stamp 302 may specify that the corresponding semantic vector 304 corresponds to a time interval spanning 10:33 to 10:57 in the media content.
  • The semantic vector 304 corresponds to semantic concepts derived by the image analyzer 162 (FIG. 1B), audio analyzer 164 (FIG. 1B), and the text analyzer 170 (FIG. 1B). Each semantic vector 304 within a context token may contain one or more entries where each entry comprises a semantic dimension 306 and a corresponding strength value 308. A semantic dimension 306 corresponds to a contextual cue within the media content and may include visual cues, audio cues, textual cues, and so on.
  • During pre-processing by the image analyzer 162, audio analyzer 164, and the text analyzer 170, visual, audio, and textual content are analyzed and represented by a context token ci, which comprises a media stamp and one or more semantic vectors 304. A context token ci is represented by the following expression:

  • c i ={t i |v i},
  • where ti denotes the media tamp of the context token, and vi denotes the semantic vector 304, which is expressed as:

  • v i=(d 1 ,d 2 , . . . ,d n).
  • In the expression above, dj represents a strength or likelihood value towards a particular semantic dimension, such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on. For example, the visual content of a particular scene with dark and gray visual attributes may be assigned a higher strength value towards a semantic dimension of negativity, horror, and a feeling of mystery.
  • Speech (i.e., audio) content expressing delight and characterized by a high pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of happiness, a feminine theme, while a soft, gentle, and low pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of romance, and a masculine theme. Textual context comprising specific transition keywords may be assigned a higher strength value to a semantic dimension reflecting strong emphasis. For example, a specific phrase such as “with great power, comes great responsibility” may be assigned a higher strength value reflecting strong emphasis, a positive atmosphere, and a masculine theme. In this regard, the corresponding strength value 308 reflects a confidence level of the semantic dimension 306.
  • The semantic fusion operator 119 (FIG. 1B) combines the context tokens 174 to generate a fused semantic vector, and the mapping module 176 (FIG. 1B) maps the combined context tokens 174 to specific text associated with the event or context in the media content. Specifically, a fused semantic vector vƒ T associated with a specified media stamp T is determined by the following expression:

  • v ƒ T=ƒ(v v T ,v a T ,v t T),
  • where vv T denotes the semantic vector of visual content for media stamp T, va T denotes the semantic vector of audio content for media stamp T, vt T denotes the semantic vector of text content for media stamp T, and ƒ( ) denotes the fusion function. The fusion function may be implemented as an operator for combining semantic vectors. For some embodiments, the fusion function may be expressed as a weighted summation function:

  • ƒ(v v T ,v a T ,v t T)=Σ{v,a,t} w i T ,v i T =w v T v v T +w a T v a T +w t T v t T,
  • where (wi) corresponds to the weight value of each type of semantic vector (i.e., semantic vector of visual content, semantic vector of audio content, and semantic vector of textual content). Each weight value represents the confidence level of a particular semantic vector. For example, the weight value (wa T) for the audio semantic vector (va T) may be higher if the audio cues during time period (T) comprise dramatic intonations that occur in a given scene. On the other hand, the weight value (wv T) for the visual semantic vector (vv T) may be lower if the same scene provides few visual cues. The fusion function may also be implemented according to a neural network model. The mapping module 176 then maps the fused semantic vector vƒ T to media or corresponding subtitles according to the media stamp T.
  • FIG. 3A provides an example of a context token with a plurality of semantic dimensions 306 a, 306 b and corresponding strength values 308 a, 308 b. In the example shown, the context token 174 characterizes a window of time in the media content spanning from 10:33 to 10:57, where various semantic dimensions 306 a, 306 b are portrayed in the media content. In this example, the image analyzer 162, audio analyzer 164, and/or the text analyzer 170 determines based on various contextual cues within the media content that one or more individuals in the media content exhibit such emotions as happiness, sadness, anger, and fear. As shown, each of the semantic dimensions 306 a, 306 b having corresponding strength values 308 a, 308 b where the semantic dimension 306 corresponding to happiness has the highest confidence level.
  • FIG. 3B is an example of a textual context token 320 comprising a media stamp 322 that specifies the time in which the corresponding text 324 is to be displayed. The textual context token 320 further comprises an entropy value 326 and a semantic vector 328, wherein the entropy value 326 represents the information content of the particular text section. In the example shown, the text content comprises the subtitle “That's Awesome!” The text segment is tokenized into two text tokens—“That's” and “Awesome”. The text “That's” contains less useful information and is therefore assigned a lower entropy value, whereas the text “Awesome” is assigned a higher entropy value. The higher entropy value triggers the visual accentuation. Moreover, a negative value for a semantic dimension relieves the contradiction between a visual or audio context token. For example, the audio context token of the corresponding portion has a sadness value of −0.6 while the video context token has sadness value of 0.4. In this case, the bias would be corrected as the sadness dimension is adjusted to neutral state of zero given the values −0.6 and 0.4.
  • FIG. 2 is a schematic diagram of the media processing system 102 shown in FIG. 1A. The media processing system 102 may be embodied in any one of a wide variety of wired and/or wireless computing devices, such as a desktop computer, portable computer, dedicated server computer, multiprocessor computing device, smartphone 109 (FIG. 1A), tablet computing device, and so forth. As shown in FIG. 2, the media processing system 102 comprises memory 214, a processing device 202, a number of input/output interfaces 204, a network interface 206, a display 104, a peripheral interface 211, and mass storage 226, wherein each of these devices are connected across a local data bus 210.
  • The processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with the media processing system 102, a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system.
  • The memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.). The memory 214 typically comprises a native operating system 217, one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc.
  • The applications may include application specific software which may comprise some or all the components (media interface 112, content analyzer 114, tokenizer 116, semantic fusion operator 119, visualizer 121) of the media processing system 102 depicted in FIG. 1A. In accordance with such embodiments, the components are stored in memory 214 and executed by the processing device 202. One of ordinary skill in the art will appreciate that the memory 214 can, and typically will, comprise other components which have been omitted for purposes of brevity.
  • In this regard, the term “executable” may refer to a program file that is in a form that can ultimately be run by the processing device 202. Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of the memory 214 and run by the processing device 202, source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of the memory 214 and executed by the processing device 202, or source code that may be interpreted by another executable program to generate instructions in a random access portion of the memory 214 to be executed by the processing device 202, etc. An executable program may be stored in any portion or component of the memory 214 including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components.
  • Input/output interfaces 204 provide any number of interfaces for the input and output of data. For example, where the media processing system 102 comprises a personal computer, these components may interface with one or more user input devices via the I/O interfaces 204, where the user input devices may comprise a keyboard 106 (FIG. 1A) or a mouse 108 (FIG. 1A). The display 104 may comprise a computer monitor, a plasma screen for a PC, a liquid crystal display (LCD), a touchscreen display, or other display device 104.
  • In the context of this disclosure, a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
  • With further reference to FIG. 2, network interface 206 comprises various components used to transmit and/or receive data over a network environment. For example, the network interface 206 may include a device that can communicate with both inputs and outputs, for instance, a modulator/demodulator (e.g., a modem), wireless (e.g., radio frequency (RF)) transceiver, a telephonic interface, a bridge, a router, network card, etc.). The media processing system 102 may communicate with one or more computing devices via the network interface 206 over the network 118 (FIG. 1A). The media processing system 102 may further comprise mass storage 226. The peripheral interface 211 supports various interfaces including, but not limited to IEEE-1294 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
  • Reference is made to FIG. 4, which is a flowchart 400 in accordance with one embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A. It is understood that the flowchart 400 of FIG. 4 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102. As an alternative, the flowchart of FIG. 4 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • Although the flowchart of FIG. 4 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 4 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • Beginning with block 410, media content is obtained and in block 420, semantic analysis is performed on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content. For some embodiments, the text section comprises at least one word in the text in the at least a portion of the media content. In block 430, at least one context token corresponding to the at least one semantic textual segment is generated. In block 440, the text section is visually accentuated according to the context token. For some embodiments, visually accentuating the text section comprises modifying the text section in the at least a portion of the media content and generating captions in the at least a portion of the media content. Note that modifying the visual appearance of text may be performed according to the literal meaning of the text section. For example, if the text section includes the word “fire” or “flame,” the visual appearance of the text section may be modified with a fiery font. As another example, if the text section includes the word “big” or “huge,” the visual appearance of the text section may be enlarged.
  • To further illustrate the media editing techniques disclosed, reference is made to FIGS. 5-7, which provide various examples of modifications performed by the visualizer 121 (FIG. 1A) in the media processing system 102 (FIG. 1A) in accordance with various embodiments. FIG. 5 is an example where the visualizer 121 (FIG. 1A) changes the font size/style as well as the location of the subtitles. In the example to the left in FIG. 5, the content analyzer 114 (FIG. 1A) analyzes such contextual cues as speech volume (e.g., one or more individuals shouting), keywords/phrases (e.g., “watch out”, “warning”), the presence of exclamation points in the subtitles, and so on. In this regard, the media processing system 102 is “text-aware” and is capable of visually accentuating a text section within text content.
  • In the example to the right in FIG. 5, the visualizer 121 selectively modifies the text sections containing the text “AWESOME” and the exclamation point in the subtitles. That is, rather than visually accentuating the entire line of subtitles, the visualizer 121 may be configured to visually accentuate only a portion of the subtitles (e.g., selective words/phrases/punctuation marks). In the example shown, only the word “AWESOME” and the exclamation point are visually accentuated by increasing the font size. In this regard, visually accentuating the text section according to the context token may comprise modifying the text section in the at least a portion of the media content and/or generating captions in the at least a portion of the media content.
  • As shown, the visualizer 121 also incorporates animation to further emphasize the words being spoken by the individual. Other forms of animation may include, for example and without limitation, a shrinking/stretching effect, a fade-in/fade-out effect, a shadowing effect, a flipping effect, and so on. The example in FIG. 5 also depicts graphics (i.e., lines) inserted into the media content by the visualizer 121 to indicate which individual is speaking.
  • FIG. 6 is an example where the visualizer 121 (FIG. 1A) changes the font size/style of the captions based on the body language of the individual as well as the presence of exclamation marks in the subtitles. FIG. 7 is an example where the visualizer 121 (FIG. 1A) changes the font size/style of the captions based on image attributes (e.g., low lighting; night time), keyword (e.g., “Halloween”), the presence of an exclamation mark in the subtitles, and so on.
  • FIG. 8 is an example where the media content comprises digital photos with comments, descriptions, and other forms of annotation are embedded with the digital photos. For example, with reference back to the media content website 134 shown in FIG. 1, the media processing system 102 may retrieve media content from online photo sharing albums where one or more users upload photos and viewers add corresponding descriptions, comments, etc. to the uploaded photos. In the example shown in FIG. 8, the media content comprises digital photos with corresponding descriptions. As shown, the text section comprising the word “beautiful” is visually accentuated to place emphasis on this word. Note that only the appearance of “beautiful” is modified.
  • Reference is made to FIG. 9, which is a flowchart 900 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A. It is understood that the flowchart 900 of FIG. 9 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102. As an alternative, the flowchart of FIG. 9 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • Although the flowchart of FIG. 9 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 9 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • Beginning with block 910, media content is obtained and semantic analysis is performed on a textual portion of the media content. For example, as shown in FIG. 1B, the media content obtained by the media interface 112 (FIG. 1B) may include subtitles 151 (FIG. 1B) or captions.
  • In block 920, textual context tokens are generated based on the semantic analysis, and in block 930, semantic analysis is performed on an audio portion and on a visual portion of the media content corresponding to the textual portion. For example, the image analyzer 162 (FIG. 1B) and the audio analyzer 164 (FIG. 1B) in the content analyzer 114 (FIG. 1B) may be configured to analyze portions of the media content where dialog between individuals take place.
  • In block 940, context tokens relating to the audio and visual portions are generated. In block 950, the textual context tokens are combined with the context tokens relating to the audio and visual portions, and in block 960, at least one context portrayed in the at least a portion of media content is visually accentuated according to the combined context tokens.
  • Reference is made to FIG. 10, which is a flowchart 1000 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A. It is understood that the flowchart 1000 of FIG. 10 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102. As an alternative, the flowchart of FIG. 10 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
  • Although the flowchart of FIG. 10 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 10 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
  • Beginning with block 1010, a photo collection comprising digital images and textual content is obtained, and in block 1020, semantic analysis is performed on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection. For some embodiments, the text section comprises at least one word in the textual content in the at least a portion of the photo collection. In block 1030, at least one context token corresponding to the at least one semantic textual segment is generated, and in block 1040, the text section is visually accentuated according to the context token.
  • It should be emphasized that the above-described embodiments of the present disclosure are merely possible examples of implementations set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.

Claims (32)

At least the following is claimed:
1. A method implemented in a media processing device, comprising:
obtaining, by the media processing device, media content;
performing, by the media processing device, semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content;
generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment; and
visually accentuating, by the media processing device, the text section according to the at least one context token.
2. The method of claim 1, wherein performing semantic analysis on text in the at least a portion of the media content comprises:
performing lexical analysis and tokenizing the text; and
identifying predetermined key words in the tokenized text.
3. The method of claim 2, wherein the predetermined key words comprise at least one of:
transition words;
conjunctions;
words that convey emphasis;
repeated words;
symbols; and
predefined keywords from a database.
4. The method of claim 1, wherein the media content comprises visual content and wherein the text comprises subtitles.
5. The method of claim 4, wherein visually accentuating the text section according to the at least one context token comprises visually accentuating the text section within a line of subtitles according to a media stamp specifying a time for displaying the line of subtitles, and wherein the text section comprises a portion of the line.
6. The method of claim 1, wherein the media content comprises digital photos, and wherein the text comprises annotation of the digital photos.
7. The method of claim 6, wherein the text section comprises a portion of the annotation.
8. The method of claim 1, wherein visually accentuating, by the media processing device, the text section according to the at least one context token comprises at least one of:
modifying the text section in the at least a portion of the media content; and
generating captions in the at least a portion of the media content.
9. The method of claim 8, wherein generating captions comprises at least one of:
generating text with animated graphics; and
generating text with a varying position.
10. The method of claim 8, wherein modifying the text section comprises:
mapping the at least one context token to the text section; and
modifying a visual appearance of the text section according to the mapping.
11. The method of claim 10, wherein modifying the visual appearance comprises at least one of:
modifying a font type of the text section;
modifying a font size of the text section;
modifying a font color of the text section;
modifying a font effect of the text section; and
modifying a location of the text section.
12. The method of claim 10, wherein modifying the visual appearance of text further comprises modifying the visual appearance of the text section according to the literal meaning of the text section.
13. The method of claim 1, wherein the text section comprises a plurality of words in the text in the at least a portion of the media content, and wherein visually accentuating the text section according to the at least one context token comprises visually accentuating each of the words in the text section differently.
14. The method of claim 1, further comprising:
performing semantic analysis on at least one of: audio content and visual content in at least a portion of the media content to obtain at least one of: a semantic audio segment and a semantic visual segment, each corresponding to at least one of: an audio section and a visual section of the media content;
generating at least one context token corresponding to the at least one of: the semantic audio segment and the semantic visual segment;
combining the at least one context token; and
visually accentuating the text section according to the combined context tokens.
15. The method of claim 14, wherein performing semantic analysis on visual content in the at least a portion of the media content comprises analyzing at least one of:
human facial expressions;
body language;
themes;
color hue;
color temperature of the at least a portion of the media content; and
predefined image patterns/styles from a database.
16. The method of claim 14, wherein performing semantic analysis on audio content in the at least a portion of the media content comprises analyzing at least one of:
speech tone;
speech speed;
fluency;
punctuation;
location in which audio content originates;
direction in which audio content is conveyed;
speech volume; and
predefined audio patterns/styles from a database.
17. The method of claim 14, wherein each context token comprises a media stamp and a semantic vector.
18. The method of claim 17, wherein each semantic vector comprises at least one semantic dimension and a corresponding strength value of the semantic dimension.
19. The method of claim 1, wherein performing semantic analysis comprises analyzing emotional expressions and contexts of the at least a portion of the media content.
20. The method of claim 19, wherein performing semantic analysis is performed on at least one of text, visual content, and audio content in the at least a portion of the media content.
21. A system for editing media content, comprising:
a processor; and
at least one application executable in the processor, the at least one application comprising:
a media interface for obtaining media content
a content analyzer for performing semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content;
a tokenizer for generating at least one context token corresponding to the at least one semantic textual segment; and
a visualizer for visually accentuating the text section according to the at least one context token.
22. The system of claim 21, wherein each context token comprises a media stamp and a semantic vector.
23. The system of claim 22, wherein each semantic vector comprises at least one semantic dimension and a corresponding strength value of the semantic dimension, and wherein each media stamp comprises a time stamp in the media content corresponding to the semantic vector.
24. The system of claim 21, wherein the content analyzer performs semantic analysis by analyzing emotional expressions and contexts of the at least a portion of the media content.
25. The system of claim 24, wherein performing semantic analysis is performed on at least one of text, images, and audio in the at least a portion of the media content.
26. The system of claim 21, wherein the visualizer visually accentuates the text section by performing at least one of:
modifying text in the at least a portion of the media content; and
generating captions in the at least a portion of the media content.
27. A non-transitory computer-readable medium embodying a program executable in a computing device, comprising:
code that obtains media content;
code that performs semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content;
code that generates at least one context token corresponding to the at least one semantic textual segment; and
code that visually accentuates the text section according to the at least one context token.
28. The non-transitory computer-readable medium of claim 27, wherein the code that visually accentuates at least one context further comprises:
code that modifies text in the at least a portion of the media content; and
code that generates captions in the at least a portion of the media content.
29. A method implemented in a media processing, comprising:
obtaining, by the media processing device, media content and performing semantic analysis on a textual portion of the media content;
generating, by the media processing device, textual context tokens based on the semantic analysis;
performing semantic analysis on an audio portion and on a visual portion of the media content corresponding to the textual portion;
generating context tokens relating to the audio and visual portions;
combining, by the media processing device, the textual context tokens and the context tokens relating to the audio and visual portions; and
visually accentuating, by the media processing device, at least one context portrayed in at least a portion of media content according to the combined context tokens.
30. A method implemented in a media processing device, comprising:
obtaining, by the media processing device, a photo collection comprising digital images and textual content;
performing, by the media processing device, semantic analysis on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection, wherein the text section comprises at least one word in the textual content in the at least a portion of the photo collection;
generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment; and
visually accentuating, by the media processing device, the text section according to the at least one context token.
31. The method of claim 30, wherein the textual content comprises annotation of the digital photos.
32. The method of claim 31, wherein the text section comprises a portion of the annotation.
US14/085,963 2013-03-15 2013-11-21 Systems and methods for customizing text in media content Active 2034-07-18 US9645985B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/085,963 US9645985B2 (en) 2013-03-15 2013-11-21 Systems and methods for customizing text in media content

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361788741P 2013-03-15 2013-03-15
US14/085,963 US9645985B2 (en) 2013-03-15 2013-11-21 Systems and methods for customizing text in media content

Publications (2)

Publication Number Publication Date
US20140278370A1 true US20140278370A1 (en) 2014-09-18
US9645985B2 US9645985B2 (en) 2017-05-09

Family

ID=51531799

Family Applications (1)

Application Number Title Priority Date Filing Date
US14/085,963 Active 2034-07-18 US9645985B2 (en) 2013-03-15 2013-11-21 Systems and methods for customizing text in media content

Country Status (1)

Country Link
US (1) US9645985B2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170192939A1 (en) * 2016-01-04 2017-07-06 Expressy, LLC System and Method for Employing Kinetic Typography in CMC
WO2018074658A1 (en) * 2016-10-17 2018-04-26 주식회사 엠글리쉬 Terminal and method for implementing hybrid subtitle effect
US10049477B1 (en) * 2014-06-27 2018-08-14 Google Llc Computer-assisted text and visual styling for images
US10091202B2 (en) 2011-06-20 2018-10-02 Google Llc Text suggestions for images
US10474750B1 (en) * 2017-03-08 2019-11-12 Amazon Technologies, Inc. Multiple information classes parsing and execution
US10945041B1 (en) * 2020-06-02 2021-03-09 Amazon Technologies, Inc. Language-agnostic subtitle drift detection and localization
CN114286154A (en) * 2021-09-23 2022-04-05 腾讯科技(深圳)有限公司 Subtitle processing method and device for multimedia file, electronic equipment and storage medium
CN117319757A (en) * 2023-09-08 2023-12-29 北京优酷科技有限公司 Subtitle display method and device, electronic equipment and computer storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10102272B2 (en) * 2015-07-12 2018-10-16 Aravind Musuluri System and method for ranking documents

Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089504B1 (en) * 2000-05-02 2006-08-08 Walt Froloff System and method for embedment of emotive content in modern text processing, publishing and communication
US20080320378A1 (en) * 2005-10-22 2008-12-25 Jeff Shuter Accelerated Visual Text to Screen Translation Method
US20090153288A1 (en) * 2007-12-12 2009-06-18 Eric James Hope Handheld electronic devices with remote control functionality and gesture recognition
US20090164888A1 (en) * 2007-12-19 2009-06-25 Thomas Phan Automated Content-Based Adjustment of Formatting and Application Behavior
US20090208118A1 (en) * 2008-02-19 2009-08-20 Xerox Corporation Context dependent intelligent thumbnail images
US20100318360A1 (en) * 2009-06-10 2010-12-16 Toyota Motor Engineering & Manufacturing North America, Inc. Method and system for extracting messages
US20110047508A1 (en) * 2009-07-06 2011-02-24 Onerecovery, Inc. Status indicators and content modules for recovery based social networking
US20110231180A1 (en) * 2010-03-19 2011-09-22 Verizon Patent And Licensing Inc. Multi-language closed captioning
US20110276327A1 (en) * 2010-05-06 2011-11-10 Sony Ericsson Mobile Communications Ab Voice-to-expressive text
US8166051B1 (en) * 2009-02-03 2012-04-24 Sandia Corporation Computation of term dominance in text documents
US20120179982A1 (en) * 2011-01-07 2012-07-12 Avaya Inc. System and method for interactive communication context generation
US20120242897A1 (en) * 2009-12-31 2012-09-27 Tata Consultancy Services Limited method and system for preprocessing the region of video containing text
US20120288203A1 (en) * 2011-05-13 2012-11-15 Fujitsu Limited Method and device for acquiring keywords
US20130036117A1 (en) * 2011-02-02 2013-02-07 Paul Tepper Fisher System and method for metadata capture, extraction and analysis
US8374646B2 (en) * 2009-10-05 2013-02-12 Sony Corporation Mobile device visual input system and methods
US20130067319A1 (en) * 2011-09-06 2013-03-14 Locu, Inc. Method and Apparatus for Forming a Structured Document from Unstructured Information
US20130121410A1 (en) * 2011-11-14 2013-05-16 Mediatek Inc. Method and Apparatus of Video Encoding with Partitioned Bitstream
US20130218858A1 (en) * 2012-02-16 2013-08-22 Dmitri Perelman Automatic face annotation of images contained in media content
US20130298159A1 (en) * 2012-05-07 2013-11-07 Industrial Technology Research Institute System and method for allocating advertisements
US8588825B2 (en) * 2010-05-25 2013-11-19 Sony Corporation Text enhancement
US20140032259A1 (en) * 2012-07-26 2014-01-30 Malcolm Gary LaFever Systems and methods for private and secure collection and management of personal consumer data
US20140081619A1 (en) * 2012-09-18 2014-03-20 Abbyy Software Ltd. Photography Recognition Translation
US20140258851A1 (en) * 2013-03-11 2014-09-11 Microsoft Corporation Table of Contents Detection in a Fixed Format Document
US20140257789A1 (en) * 2013-03-11 2014-09-11 Microsoft Corporation Detection and Reconstruction of East Asian Layout Features in a Fixed Format Document
US20150363478A1 (en) * 2008-07-11 2015-12-17 Michael N. Haynes Systems, Devices, and/or Methods for Managing Data
US9317485B2 (en) * 2012-01-09 2016-04-19 Blackberry Limited Selective rendering of electronic messages by an electronic device
US9342613B2 (en) * 2004-09-17 2016-05-17 Snapchat, Inc. Display and installation of portlets on a client platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002244688A (en) 2001-02-15 2002-08-30 Sony Computer Entertainment Inc Information processor, information processing method, information transmission system, medium for making information processor run information processing program, and information processing program
US20070011012A1 (en) 2005-07-11 2007-01-11 Steve Yurick Method, system, and apparatus for facilitating captioning of multi-media content
US8126220B2 (en) 2007-05-03 2012-02-28 Hewlett-Packard Development Company L.P. Annotating stimulus based on determined emotional response
CN101534377A (en) 2008-03-13 2009-09-16 扬智科技股份有限公司 Method and system for automatically changing subtitle setting according to program content
US8259992B2 (en) 2008-06-13 2012-09-04 International Business Machines Corporation Multiple audio/video data stream simulation method and system

Patent Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089504B1 (en) * 2000-05-02 2006-08-08 Walt Froloff System and method for embedment of emotive content in modern text processing, publishing and communication
US9342613B2 (en) * 2004-09-17 2016-05-17 Snapchat, Inc. Display and installation of portlets on a client platform
US20080320378A1 (en) * 2005-10-22 2008-12-25 Jeff Shuter Accelerated Visual Text to Screen Translation Method
US20090153288A1 (en) * 2007-12-12 2009-06-18 Eric James Hope Handheld electronic devices with remote control functionality and gesture recognition
US20090164888A1 (en) * 2007-12-19 2009-06-25 Thomas Phan Automated Content-Based Adjustment of Formatting and Application Behavior
US20090208118A1 (en) * 2008-02-19 2009-08-20 Xerox Corporation Context dependent intelligent thumbnail images
US20150363478A1 (en) * 2008-07-11 2015-12-17 Michael N. Haynes Systems, Devices, and/or Methods for Managing Data
US8166051B1 (en) * 2009-02-03 2012-04-24 Sandia Corporation Computation of term dominance in text documents
US20100318360A1 (en) * 2009-06-10 2010-12-16 Toyota Motor Engineering & Manufacturing North America, Inc. Method and system for extracting messages
US20110047508A1 (en) * 2009-07-06 2011-02-24 Onerecovery, Inc. Status indicators and content modules for recovery based social networking
US8374646B2 (en) * 2009-10-05 2013-02-12 Sony Corporation Mobile device visual input system and methods
US20120242897A1 (en) * 2009-12-31 2012-09-27 Tata Consultancy Services Limited method and system for preprocessing the region of video containing text
US20110231180A1 (en) * 2010-03-19 2011-09-22 Verizon Patent And Licensing Inc. Multi-language closed captioning
US20110276327A1 (en) * 2010-05-06 2011-11-10 Sony Ericsson Mobile Communications Ab Voice-to-expressive text
US8588825B2 (en) * 2010-05-25 2013-11-19 Sony Corporation Text enhancement
US20120179982A1 (en) * 2011-01-07 2012-07-12 Avaya Inc. System and method for interactive communication context generation
US20130036117A1 (en) * 2011-02-02 2013-02-07 Paul Tepper Fisher System and method for metadata capture, extraction and analysis
US20120288203A1 (en) * 2011-05-13 2012-11-15 Fujitsu Limited Method and device for acquiring keywords
US20130067319A1 (en) * 2011-09-06 2013-03-14 Locu, Inc. Method and Apparatus for Forming a Structured Document from Unstructured Information
US20130121410A1 (en) * 2011-11-14 2013-05-16 Mediatek Inc. Method and Apparatus of Video Encoding with Partitioned Bitstream
US9317485B2 (en) * 2012-01-09 2016-04-19 Blackberry Limited Selective rendering of electronic messages by an electronic device
US20130218858A1 (en) * 2012-02-16 2013-08-22 Dmitri Perelman Automatic face annotation of images contained in media content
US20130298159A1 (en) * 2012-05-07 2013-11-07 Industrial Technology Research Institute System and method for allocating advertisements
US20140032259A1 (en) * 2012-07-26 2014-01-30 Malcolm Gary LaFever Systems and methods for private and secure collection and management of personal consumer data
US20140081619A1 (en) * 2012-09-18 2014-03-20 Abbyy Software Ltd. Photography Recognition Translation
US20140258851A1 (en) * 2013-03-11 2014-09-11 Microsoft Corporation Table of Contents Detection in a Fixed Format Document
US20140257789A1 (en) * 2013-03-11 2014-09-11 Microsoft Corporation Detection and Reconstruction of East Asian Layout Features in a Fixed Format Document

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10091202B2 (en) 2011-06-20 2018-10-02 Google Llc Text suggestions for images
US10049477B1 (en) * 2014-06-27 2018-08-14 Google Llc Computer-assisted text and visual styling for images
US20170192939A1 (en) * 2016-01-04 2017-07-06 Expressy, LLC System and Method for Employing Kinetic Typography in CMC
US10467329B2 (en) * 2016-01-04 2019-11-05 Expressy, LLC System and method for employing kinetic typography in CMC
WO2018074658A1 (en) * 2016-10-17 2018-04-26 주식회사 엠글리쉬 Terminal and method for implementing hybrid subtitle effect
US10474750B1 (en) * 2017-03-08 2019-11-12 Amazon Technologies, Inc. Multiple information classes parsing and execution
US10945041B1 (en) * 2020-06-02 2021-03-09 Amazon Technologies, Inc. Language-agnostic subtitle drift detection and localization
CN114286154A (en) * 2021-09-23 2022-04-05 腾讯科技(深圳)有限公司 Subtitle processing method and device for multimedia file, electronic equipment and storage medium
CN117319757A (en) * 2023-09-08 2023-12-29 北京优酷科技有限公司 Subtitle display method and device, electronic equipment and computer storage medium

Also Published As

Publication number Publication date
US9645985B2 (en) 2017-05-09

Similar Documents

Publication Publication Date Title
US9645985B2 (en) Systems and methods for customizing text in media content
US11749241B2 (en) Systems and methods for transforming digitial audio content into visual topic-based segments
Durand et al. The Oxford handbook of corpus phonology
US20090327856A1 (en) Annotation of movies
KR20200118894A (en) Automated voice translation dubbing for pre-recorded videos
US20140178043A1 (en) Visual summarization of video for quick understanding
US10665267B2 (en) Correlation of recorded video presentations and associated slides
US20220208155A1 (en) Systems and methods for transforming digital audio content
US20180226101A1 (en) Methods and systems for interactive multimedia creation
CN110781328A (en) Video generation method, system, device and storage medium based on voice recognition
CN110750996B (en) Method and device for generating multimedia information and readable storage medium
CN111930289B (en) Method and system for processing pictures and texts
US20180270446A1 (en) Media message creation with automatic titling
US20180189249A1 (en) Providing application based subtitle features for presentation
Jing et al. Content-aware video2comics with manga-style layout
Matamala The VIW project: Multimodal corpus linguistics for audio description analysis
US10691871B2 (en) Devices, methods, and systems to convert standard-text to animated-text and multimedia
Chi et al. Synthesis-Assisted Video Prototyping From a Document
CN117177024A (en) Video dubbing method and related device, electronic equipment and storage medium
JP2020129189A (en) Moving image editing server and program
JP7133367B2 (en) MOVIE EDITING DEVICE, MOVIE EDITING METHOD, AND MOVIE EDITING PROGRAM
KR102281298B1 (en) System and method for video synthesis based on artificial intelligence
KR20150121928A (en) System and method for adding caption using animation
EP4295248A1 (en) Systems and methods for transforming digital audio content
EP3121734A1 (en) A method and device for performing story analysis

Legal Events

Date Code Title Description
AS Assignment

Owner name: CYBERLINK CORP., TAIWAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHEN, HSIEH-WEI;REEL/FRAME:031647/0637

Effective date: 20131121

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8