US20140278370A1 - Systems and Methods for Customizing Text in Media Content - Google Patents
Systems and Methods for Customizing Text in Media Content Download PDFInfo
- Publication number
- US20140278370A1 US20140278370A1 US14/085,963 US201314085963A US2014278370A1 US 20140278370 A1 US20140278370 A1 US 20140278370A1 US 201314085963 A US201314085963 A US 201314085963A US 2014278370 A1 US2014278370 A1 US 2014278370A1
- Authority
- US
- United States
- Prior art keywords
- text
- media content
- media
- semantic
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims description 53
- 238000012545 processing Methods 0.000 claims description 84
- 230000000007 visual effect Effects 0.000 claims description 39
- 239000013598 vector Substances 0.000 claims description 26
- 230000000694 effects Effects 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 7
- 230000014509 gene expression Effects 0.000 claims description 6
- 230000008921 facial expression Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000002996 emotional effect Effects 0.000 claims 2
- 230000004927 fusion Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 11
- 230000004048 modification Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000008451 emotion Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000001944 accentuation Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G06F17/24—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- one embodiment is a method implemented in a media processing device.
- the method comprises obtaining, by the media processing device, media content and performing, by the media processing device, semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
- the method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
- Another embodiment is a system for editing media content, comprising a processor and at least one application executable in the processor.
- the at least one application comprises a media interface for obtaining media content and a content analyzer for performing semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
- the at least one application further comprises a tokenizer for generating at least one context token corresponding to the at least one semantic textual segment and a visualizer for visually accentuating the text section according to the context token.
- Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that obtains media content and code that performs semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content.
- the code further comprises code that generates at least one context token corresponding to the at least one semantic textual segment and code that visually accentuates the text section according to the context token.
- Another embodiment is a method implemented in a media processing.
- the method comprises obtaining, by the media processing device, media content and performing semantic analysis on a textual portion of the media content and generating, by the media processing device, textual context tokens based on the semantic analysis.
- the method further comprises performing semantic analysis on an audio portion and on a visual portion of the media content corresponding to the textual portion and generating context tokens relating to the audio and visual portions.
- the method further comprises combining, by the media processing device, the textual context tokens and the context tokens relating to the audio and visual portions and visually accentuating, by the media processing device, at least one context portrayed in at least a portion of media content according to the combined context tokens.
- Another embodiment is a method implemented in a media processing device.
- the method comprises obtaining, by the media processing device, a photo collection comprising digital images and textual content and performing, by the media processing device, semantic analysis on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection, wherein the text section comprises at least one word in the textual content in the at least a portion of the photo collection.
- the method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
- FIG. 1A is a block diagram of a media processing system for facilitating automatic media editing in accordance with various embodiments of the present disclosure.
- FIG. 1B illustrates the process flow between various components of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
- FIG. 2 is a detailed view of the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
- FIGS. 3A and 3B illustrate the format of a context token generated by the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
- FIG. 4 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to various embodiments of the present disclosure.
- FIGS. 5-8 illustrate various examples of subtitle modification performed by the visualizer in the media processing system of FIG. 1A in accordance with various embodiments of the present disclosure.
- FIG. 9 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
- FIG. 10 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system of FIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure.
- the editing process may involve, for example, stylizing existing subtitles by changing the font color, font size, location of the subtitles, and so on.
- the editing process may also include inserting captions relating to commentary, descriptions, and so on into media content.
- editing media content on a frame-by-frame basis can be time consuming.
- media content is obtained and semantic analysis is performed on at least a portion of the media content, wherein the semantic analysis may involve analyzing visual, audio, and textual cues embedded in the media content that convey the emotions and/or context corresponding to events portrayed in the media content.
- context tokens characterizing the emotions, context, etc. associated with events being portrayed in the portion of media content are generated.
- a semantic fusion operation is applied to the context tokens to combine the context tokens, and the combined context tokens are mapped to the text that takes place in the portion of media content, where such text may comprise, for example, subtitles corresponding to dialog in the portion of media content and/or captions in the portion of media content (e.g., a caption describing a sound that occurs in a scene).
- the subtitles or text corresponding to the mapping are stylized in an automated fashion without the need for a user to manually apply special effects. They subtitles may be stylized by modifying the font, font size, subtitle location. The modification(s) may also include animation or effects applied to the subtitles.
- FIG. 1A is a block diagram of a media processing system 102 in which embodiments of the techniques for visually accentuating semantic context of text or events portrayed within media content.
- the media processing system 102 may be embodied, for example, as a desktop computer, computer workstation, laptop, a smartphone 109 , a tablet, or other computing platform that includes a display 104 and may include such input devices as a keyboard 106 and a mouse 108 .
- the media processing system 102 may be embodied as a smartphone 109 or tablet, the user may interface with the media processing system 102 via a touchscreen interface (not shown).
- the media processing system 102 may be embodied as a video gaming console 171 , which includes a video game controller 172 for receiving user preferences.
- the video gaming console 171 may be connected to a television (not shown) or other display 104 .
- the media processing system 102 is configured to retrieve, via the media interface 112 , digital media content 115 stored on a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein the digital media content 115 may then be stored locally on a hard drive of the media processing system 102 .
- a storage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive
- the digital media content 115 may be encoded in any of a number of formats including, but not limited to, JPEG (Joint Photographic Experts Group) files, TIFF (Tagged Image File Format) files, PNG (Portable Network Graphics) files, GIF (Graphics Interchange Format) files, BMP (bitmap) files or any number of other digital formats.
- JPEG Joint Photographic Experts Group
- TIFF Tagged Image File Format
- PNG Portable Network Graphics
- GIF Graphics Inter
- the digital media content 115 may be encoded in other formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats.
- MPEG Motion Picture Experts Group
- MPEG-4 High-Definition Video
- 3GPP
- the media interface 112 in the media processing system 102 may also be configured to retrieve digital media content 115 directly from a digital recording device 107 where a cable 111 or some other interface may be used for coupling the digital recording device 107 to the media processing system 102 .
- the media processing system 102 may support any one of a number of common computer interfaces, such as, but not limited to IEEE-1394 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
- the digital recording device 107 may also be coupled to the media processing system 102 over a wireless connection or other communication path.
- the media processing system 102 may be coupled to a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
- a network 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks.
- the media processing system 102 may receive digital media content 115 from another computing system 103 .
- the media processing system 102 may access one or more media content sharing websites 134 hosted on a server 137 via the network 118 to retrieve digital media content 115 .
- the components executed on the media processing system 102 include a content analyzer 114 , a tokenizer 116 , a semantic fusion operator 119 , a visualizer 121 , and other applications, services, processes, systems, engines, or functionality not discussed in detail herein.
- the content analyzer 114 is executed to perform semantic analysis on the media content received by the media interface 112 .
- the tokenizer 116 is executed to generate context tokens based on the semantic analysis, where the context tokens may be generated based on classification of visual cues, audio cues, and textual cues extracted by the content analyzer 114 .
- the semantic fusion operator 119 is executed to combine the context tokens generated by the tokenizer 116 , and the visualizer 121 is executed to visually accentuate at least one context portrayed in the media content according to the context tokens.
- the visualizer 121 modifies the appearance of subtitles/captions in the media content by modifying the font, font size, subtitle location, and so on.
- the user may specify predetermined modifications to be applied for certain contexts. For example, the user may specify that if the content analyzer 114 determines that the context in the media content involves a scary scene, a certain font (e.g., a Gothic font style) is automatically applied to the subtitles relating to that scene or event.
- a certain font e.g., a Gothic font style
- the media interface 112 obtains media content, where the media content may include subtitles 151 corresponding to the text or commentary within the media content.
- the subtitles 151 may be embedded directly into the media content, stored separately and superimposed during playback, or stored according to other means as known to those skilled in the art.
- the media interface 112 forwards the media content to the content analyzer 114 , includes an image analyzer 162 , an audio analyzer 164 , a tokenizer 116 , a text analyzer 170 , and other applications, services, processes, systems, engines, or functionality not discussed in detail herein.
- the content analyzer 114 analyzes the semantic-rich media content to extract information later used for modifying or generating stylized subtitles corresponding to the media content.
- the media content may comprise video content as well as digital images that include embedded captions stored, for example, as metadata.
- the image analyzer 162 analyzes the media content and identifies such visual cues as facial expressions, body language of individuals depicted in the media content, physical attributes of individuals, and so on.
- the image analyzer 162 may also analyze attributes of the media content including, for example, lighting, color temperature, color hue, contrast level, and so on.
- the audio analyzer 164 analyzes the media content and identifies such audio cues as speech tones of individuals within the media content, speed in which individuals are talking, speech volume, direction of speech, tone, and so on.
- the audio cues may also include intonation that may serve as an indication of one or more emotions of a speaker.
- the tokenizer 116 extracts textual information from the media content.
- the tokenizer 116 may directly process the subtitles 151 and tokenize the words in the subtitles 151 .
- the tokenizer 116 may be configured to process the audio portion of the media content and extract text information.
- a speech recognition component 117 in the tokenizer 116 converts audio data into text data when the media content does not include subtitles 151 .
- the tokenizer 116 processes textual information and breaks the information into meaningful elements that are significant as a group, wherein tokenization may be performed based on lexical analysis.
- the lexical analysis performed by the tokenizer 116 may be based on regular expressions, specific key words, and so on where such information may be stored in a database 178 .
- specific key words may comprise any of transition words, conjunctions, words that convey emphasis, repeated words, symbols, predefined keywords from a database, or any combination thereof.
- the text analyzer 170 Based on the lexical analysis performed by the tokenizer 116 , extracts textual cues from the media content.
- the data stored in the database 178 may also include key attributes such as visual attributes (e.g., lighting level, human facial expressions, body language, themes, color hue, color temperature), audio attributes (e.g., volume level), and other attributes.
- the image analyzer 162 , audio analyzer 164 , and text analyzer 170 respectively generate context tokens 174 relating to the media content.
- the semantic fusion operator 119 processes the context tokens 174 and combines context tokens relating to similar points within the media content. Note that for some embodiments, the context tokens 174 may be sent directly to the visualizer 121 without being processed by the semantic fusion operator 119 .
- the content analyzer 114 may be configured to first analyze the textual content followed by the audio content and the visual content. Alternatively, the content analyzer 114 may be configured to first analyze the visual content followed by the text content and the audio content. In this regard, the content analyzer 114 may be configured to analyze the various components of the media content in a particular order or concurrently.
- the semantic fusion operator 119 combines the context tokens 174 , and the mapping module 176 maps the combined context tokens 174 to specific text associated with the event or context in the media content, as described in more detail below.
- the visualizer 121 modifies the subtitles 151 corresponding to the text, where the modification may include, for example and without limitation, a change in the subtitle font, change in font size, change in font color, and change in subtitle location.
- the visualizer 121 incorporates the stylistic changes and outputs the modified media content 180 .
- each context token 174 comprises a media stamp 302 and a semantic vector 304 , where the media stamp 302 corresponds to the media content.
- the media stamp comprises a time stamp corresponding to a position within the media content.
- the media stamp may also specify a window of time relative to the time stamp.
- the media stamp 302 may specify that the corresponding semantic vector 304 corresponds to a time interval spanning 10:33 to 10:57 in the media content.
- the semantic vector 304 corresponds to semantic concepts derived by the image analyzer 162 ( FIG. 1B ), audio analyzer 164 ( FIG. 1B ), and the text analyzer 170 ( FIG. 1B ).
- Each semantic vector 304 within a context token may contain one or more entries where each entry comprises a semantic dimension 306 and a corresponding strength value 308 .
- a semantic dimension 306 corresponds to a contextual cue within the media content and may include visual cues, audio cues, textual cues, and so on.
- a context token c i is represented by the following expression:
- t i denotes the media tamp of the context token
- v i denotes the semantic vector 304 , which is expressed as:
- v i ( d 1 ,d 2 , . . . ,d n ).
- d j represents a strength or likelihood value towards a particular semantic dimension, such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on.
- a particular semantic dimension such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on.
- the visual content of a particular scene with dark and gray visual attributes may be assigned a higher strength value towards a semantic dimension of negativity, horror, and a feeling of mystery.
- Speech (i.e., audio) content expressing delight and characterized by a high pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of happiness, a feminine theme, while a soft, gentle, and low pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of romance, and a masculine theme.
- Textual context comprising specific transition keywords may be assigned a higher strength value to a semantic dimension reflecting strong emphasis. For example, a specific phrase such as “with great power, comes great responsibility” may be assigned a higher strength value reflecting strong emphasis, a positive atmosphere, and a masculine theme.
- the corresponding strength value 308 reflects a confidence level of the semantic dimension 306 .
- the semantic fusion operator 119 ( FIG. 1B ) combines the context tokens 174 to generate a fused semantic vector
- the mapping module 176 maps the combined context tokens 174 to specific text associated with the event or context in the media content.
- a fused semantic vector v ⁇ T associated with a specified media stamp T is determined by the following expression:
- v ⁇ T ⁇ ( v v T ,v a T ,v t T ),
- v v T denotes the semantic vector of visual content for media stamp T
- v a T denotes the semantic vector of audio content for media stamp T
- v t T denotes the semantic vector of text content for media stamp T
- ⁇ ( ) denotes the fusion function.
- the fusion function may be implemented as an operator for combining semantic vectors.
- the fusion function may be expressed as a weighted summation function:
- (w i ) corresponds to the weight value of each type of semantic vector (i.e., semantic vector of visual content, semantic vector of audio content, and semantic vector of textual content).
- Each weight value represents the confidence level of a particular semantic vector. For example, the weight value (w a T ) for the audio semantic vector (v a T ) may be higher if the audio cues during time period (T) comprise dramatic intonations that occur in a given scene. On the other hand, the weight value (w v T ) for the visual semantic vector (v v T ) may be lower if the same scene provides few visual cues.
- the fusion function may also be implemented according to a neural network model.
- the mapping module 176 then maps the fused semantic vector v ⁇ T to media or corresponding subtitles according to the media stamp T.
- FIG. 3A provides an example of a context token with a plurality of semantic dimensions 306 a , 306 b and corresponding strength values 308 a , 308 b .
- the context token 174 characterizes a window of time in the media content spanning from 10:33 to 10:57, where various semantic dimensions 306 a , 306 b are portrayed in the media content.
- the image analyzer 162 , audio analyzer 164 , and/or the text analyzer 170 determines based on various contextual cues within the media content that one or more individuals in the media content exhibit such emotions as happiness, sadness, anger, and fear.
- each of the semantic dimensions 306 a , 306 b having corresponding strength values 308 a , 308 b where the semantic dimension 306 corresponding to happiness has the highest confidence level.
- FIG. 3B is an example of a textual context token 320 comprising a media stamp 322 that specifies the time in which the corresponding text 324 is to be displayed.
- the textual context token 320 further comprises an entropy value 326 and a semantic vector 328 , wherein the entropy value 326 represents the information content of the particular text section.
- the text content comprises the subtitle “That's Awesome!”
- the text segment is tokenized into two text tokens—“That's” and “Awesome”.
- the text “That's” contains less useful information and is therefore assigned a lower entropy value, whereas the text “Awesome” is assigned a higher entropy value.
- the higher entropy value triggers the visual accentuation.
- a negative value for a semantic dimension relieves the contradiction between a visual or audio context token.
- the audio context token of the corresponding portion has a sadness value of ⁇ 0.6 while the video context token has sadness value of 0.4.
- the bias would be corrected as the sadness dimension is adjusted to neutral state of zero given the values ⁇ 0.6 and 0.4.
- FIG. 2 is a schematic diagram of the media processing system 102 shown in FIG. 1A .
- the media processing system 102 may be embodied in any one of a wide variety of wired and/or wireless computing devices, such as a desktop computer, portable computer, dedicated server computer, multiprocessor computing device, smartphone 109 ( FIG. 1A ), tablet computing device, and so forth.
- the media processing system 102 comprises memory 214 , a processing device 202 , a number of input/output interfaces 204 , a network interface 206 , a display 104 , a peripheral interface 211 , and mass storage 226 , wherein each of these devices are connected across a local data bus 210 .
- the processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with the media processing system 102 , a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system.
- CPU central processing unit
- ASICs application specific integrated circuits
- the memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.).
- RAM random-access memory
- nonvolatile memory elements e.g., ROM, hard drive, tape, CDROM, etc.
- the memory 214 typically comprises a native operating system 217 , one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc.
- the applications may include application specific software which may comprise some or all the components (media interface 112 , content analyzer 114 , tokenizer 116 , semantic fusion operator 119 , visualizer 121 ) of the media processing system 102 depicted in FIG. 1A .
- the components are stored in memory 214 and executed by the processing device 202 .
- the memory 214 can, and typically will, comprise other components which have been omitted for purposes of brevity.
- executable may refer to a program file that is in a form that can ultimately be run by the processing device 202 .
- Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of the memory 214 and run by the processing device 202 , source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of the memory 214 and executed by the processing device 202 , or source code that may be interpreted by another executable program to generate instructions in a random access portion of the memory 214 to be executed by the processing device 202 , etc.
- An executable program may be stored in any portion or component of the memory 214 including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components.
- RAM random access memory
- ROM read-only memory
- hard drive solid-state drive
- USB flash drive memory card
- optical disc such as compact disc (CD) or digital versatile disc (DVD)
- floppy disk magnetic tape, or other memory components.
- Input/output interfaces 204 provide any number of interfaces for the input and output of data.
- the media processing system 102 comprises a personal computer
- these components may interface with one or more user input devices via the I/O interfaces 204 , where the user input devices may comprise a keyboard 106 ( FIG. 1A ) or a mouse 108 ( FIG. 1A ).
- the display 104 may comprise a computer monitor, a plasma screen for a PC, a liquid crystal display (LCD), a touchscreen display, or other display device 104 .
- a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- CDROM portable compact disc read-only memory
- network interface 206 comprises various components used to transmit and/or receive data over a network environment.
- the network interface 206 may include a device that can communicate with both inputs and outputs, for instance, a modulator/demodulator (e.g., a modem), wireless (e.g., radio frequency (RF)) transceiver, a telephonic interface, a bridge, a router, network card, etc.).
- the media processing system 102 may communicate with one or more computing devices via the network interface 206 over the network 118 ( FIG. 1A ).
- the media processing system 102 may further comprise mass storage 226 .
- the peripheral interface 211 supports various interfaces including, but not limited to IEEE-1294 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection.
- FIG. 4 is a flowchart 400 in accordance with one embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 400 of FIG. 4 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 4 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
- FIG. 4 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 4 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
- media content is obtained and in block 420 , semantic analysis is performed on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content.
- the text section comprises at least one word in the text in the at least a portion of the media content.
- at least one context token corresponding to the at least one semantic textual segment is generated.
- the text section is visually accentuated according to the context token.
- visually accentuating the text section comprises modifying the text section in the at least a portion of the media content and generating captions in the at least a portion of the media content.
- modifying the visual appearance of text may be performed according to the literal meaning of the text section. For example, if the text section includes the word “fire” or “flame,” the visual appearance of the text section may be modified with a fiery font. As another example, if the text section includes the word “big” or “huge,” the visual appearance of the text section may be enlarged.
- FIGS. 5-7 provide various examples of modifications performed by the visualizer 121 ( FIG. 1A ) in the media processing system 102 ( FIG. 1A ) in accordance with various embodiments.
- FIG. 5 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style as well as the location of the subtitles.
- the content analyzer 114 FIG. 1A ) analyzes such contextual cues as speech volume (e.g., one or more individuals shouting), keywords/phrases (e.g., “watch out”, “warning”), the presence of exclamation points in the subtitles, and so on.
- the media processing system 102 is “text-aware” and is capable of visually accentuating a text section within text content.
- the visualizer 121 selectively modifies the text sections containing the text “AWESOME” and the exclamation point in the subtitles. That is, rather than visually accentuating the entire line of subtitles, the visualizer 121 may be configured to visually accentuate only a portion of the subtitles (e.g., selective words/phrases/punctuation marks). In the example shown, only the word “AWESOME” and the exclamation point are visually accentuated by increasing the font size.
- visually accentuating the text section according to the context token may comprise modifying the text section in the at least a portion of the media content and/or generating captions in the at least a portion of the media content.
- the visualizer 121 also incorporates animation to further emphasize the words being spoken by the individual.
- Other forms of animation may include, for example and without limitation, a shrinking/stretching effect, a fade-in/fade-out effect, a shadowing effect, a flipping effect, and so on.
- the example in FIG. 5 also depicts graphics (i.e., lines) inserted into the media content by the visualizer 121 to indicate which individual is speaking.
- FIG. 6 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style of the captions based on the body language of the individual as well as the presence of exclamation marks in the subtitles.
- FIG. 7 is an example where the visualizer 121 ( FIG. 1A ) changes the font size/style of the captions based on image attributes (e.g., low lighting; night time), keyword (e.g., “Halloween”), the presence of an exclamation mark in the subtitles, and so on.
- image attributes e.g., low lighting; night time
- keyword e.g., “Halloween”
- FIG. 8 is an example where the media content comprises digital photos with comments, descriptions, and other forms of annotation are embedded with the digital photos.
- the media processing system 102 may retrieve media content from online photo sharing albums where one or more users upload photos and viewers add corresponding descriptions, comments, etc. to the uploaded photos.
- the media content comprises digital photos with corresponding descriptions.
- the text section comprising the word “beautiful” is visually accentuated to place emphasis on this word. Note that only the appearance of “beautiful” is modified.
- FIG. 9 is a flowchart 900 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 900 of FIG. 9 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 9 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
- FIG. 9 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 9 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
- media content is obtained and semantic analysis is performed on a textual portion of the media content.
- the media content obtained by the media interface 112 may include subtitles 151 ( FIG. 1B ) or captions.
- textual context tokens are generated based on the semantic analysis
- semantic analysis is performed on an audio portion and on a visual portion of the media content corresponding to the textual portion.
- the image analyzer 162 ( FIG. 1B ) and the audio analyzer 164 ( FIG. 1B ) in the content analyzer 114 ( FIG. 1B ) may be configured to analyze portions of the media content where dialog between individuals take place.
- context tokens relating to the audio and visual portions are generated.
- the textual context tokens are combined with the context tokens relating to the audio and visual portions, and in block 960 , at least one context portrayed in the at least a portion of media content is visually accentuated according to the combined context tokens.
- FIG. 10 is a flowchart 1000 in accordance with an alternative embodiment for facilitating automatic media editing performed by the media processing system 102 of FIG. 1A . It is understood that the flowchart 1000 of FIG. 10 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of the media processing system 102 . As an alternative, the flowchart of FIG. 10 may be viewed as depicting an example of steps of a method implemented in the media processing system 102 according to one or more embodiments.
- FIG. 10 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession in FIG. 10 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure.
- a photo collection comprising digital images and textual content is obtained, and in block 1020 , semantic analysis is performed on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection.
- the text section comprises at least one word in the textual content in the at least a portion of the photo collection.
- at least one context token corresponding to the at least one semantic textual segment is generated, and in block 1040 , the text section is visually accentuated according to the context token.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
- This application claims priority to, and the benefit of, U.S. Provisional Patent Application entitled, “Subtitle Modalization,” having Ser. No. 61/788,741, filed on Mar. 15, 2013, which is incorporated by reference in its entirety.
- With the ever-growing amount of digital content available to consumers through the Internet and other sources, consumers have access to a vast amount of content. With existing media editing tools, users manually edit subtitles or add captions in order to achieve a desired effect or style. This typically involves a great deal of effort on the part of the user in order to emphasize or convey the context of the media content being viewed. Thus, while many media editing tools are readily available, the editing process can be tedious and time-consuming.
- Briefly described, one embodiment, among others, is a method implemented in a media processing device. The method comprises obtaining, by the media processing device, media content and performing, by the media processing device, semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
- Another embodiment is a system for editing media content, comprising a processor and at least one application executable in the processor. The at least one application comprises a media interface for obtaining media content and a content analyzer for performing semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The at least one application further comprises a tokenizer for generating at least one context token corresponding to the at least one semantic textual segment and a visualizer for visually accentuating the text section according to the context token.
- Another embodiment is a non-transitory computer-readable medium embodying a program executable in a computing device, comprising code that obtains media content and code that performs semantic analysis on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content, wherein the text section comprises at least one word in the text in the at least a portion of the media content. The code further comprises code that generates at least one context token corresponding to the at least one semantic textual segment and code that visually accentuates the text section according to the context token.
- Another embodiment is a method implemented in a media processing. The method comprises obtaining, by the media processing device, media content and performing semantic analysis on a textual portion of the media content and generating, by the media processing device, textual context tokens based on the semantic analysis. The method further comprises performing semantic analysis on an audio portion and on a visual portion of the media content corresponding to the textual portion and generating context tokens relating to the audio and visual portions. The method further comprises combining, by the media processing device, the textual context tokens and the context tokens relating to the audio and visual portions and visually accentuating, by the media processing device, at least one context portrayed in at least a portion of media content according to the combined context tokens.
- Another embodiment is a method implemented in a media processing device. The method comprises obtaining, by the media processing device, a photo collection comprising digital images and textual content and performing, by the media processing device, semantic analysis on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection, wherein the text section comprises at least one word in the textual content in the at least a portion of the photo collection. The method further comprises generating, by the media processing device, at least one context token corresponding to the at least one semantic textual segment and visually accentuating, by the media processing device, the text section according to the context token.
- Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
- Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
-
FIG. 1A is a block diagram of a media processing system for facilitating automatic media editing in accordance with various embodiments of the present disclosure. -
FIG. 1B illustrates the process flow between various components of the media processing system ofFIG. 1A in accordance with various embodiments of the present disclosure. -
FIG. 2 is a detailed view of the media processing system ofFIG. 1A in accordance with various embodiments of the present disclosure. -
FIGS. 3A and 3B illustrate the format of a context token generated by the media processing system ofFIG. 1A in accordance with various embodiments of the present disclosure. -
FIG. 4 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system ofFIG. 1A for facilitating automatic media editing according to various embodiments of the present disclosure. -
FIGS. 5-8 illustrate various examples of subtitle modification performed by the visualizer in the media processing system ofFIG. 1A in accordance with various embodiments of the present disclosure. -
FIG. 9 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system ofFIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure. -
FIG. 10 is a top-level flowchart illustrating examples of functionality implemented as portions of the media processing system ofFIG. 1A for facilitating automatic media editing according to an alternative embodiment of the present disclosure. - One perceived shortcoming with conventional media editing applications is the amount of time involved in manually editing subtitles or inserting captions into media content. The editing process may involve, for example, stylizing existing subtitles by changing the font color, font size, location of the subtitles, and so on. The editing process may also include inserting captions relating to commentary, descriptions, and so on into media content. However, editing media content on a frame-by-frame basis can be time consuming.
- Various embodiments are disclosed for automatically modifying or generating stylized captions for semantic-rich media. In accordance with various embodiments, media content is obtained and semantic analysis is performed on at least a portion of the media content, wherein the semantic analysis may involve analyzing visual, audio, and textual cues embedded in the media content that convey the emotions and/or context corresponding to events portrayed in the media content.
- As a result of the semantic analysis, context tokens characterizing the emotions, context, etc. associated with events being portrayed in the portion of media content are generated. A semantic fusion operation is applied to the context tokens to combine the context tokens, and the combined context tokens are mapped to the text that takes place in the portion of media content, where such text may comprise, for example, subtitles corresponding to dialog in the portion of media content and/or captions in the portion of media content (e.g., a caption describing a sound that occurs in a scene). Based on the mapping, the subtitles or text corresponding to the mapping are stylized in an automated fashion without the need for a user to manually apply special effects. They subtitles may be stylized by modifying the font, font size, subtitle location. The modification(s) may also include animation or effects applied to the subtitles.
- A description of a system for facilitating automatic media editing is now described followed by a discussion of the operation of the components within the system.
FIG. 1A is a block diagram of amedia processing system 102 in which embodiments of the techniques for visually accentuating semantic context of text or events portrayed within media content. Themedia processing system 102 may be embodied, for example, as a desktop computer, computer workstation, laptop, asmartphone 109, a tablet, or other computing platform that includes adisplay 104 and may include such input devices as akeyboard 106 and amouse 108. - For embodiments where the
media processing system 102 is embodied as asmartphone 109 or tablet, the user may interface with themedia processing system 102 via a touchscreen interface (not shown). In other embodiments, themedia processing system 102 may be embodied as avideo gaming console 171, which includes avideo game controller 172 for receiving user preferences. For such embodiments, thevideo gaming console 171 may be connected to a television (not shown) orother display 104. - The
media processing system 102 is configured to retrieve, via themedia interface 112,digital media content 115 stored on astorage medium 120 such as, by way of example and without limitation, a compact disc (CD) or a universal serial bus (USB) flash drive, wherein thedigital media content 115 may then be stored locally on a hard drive of themedia processing system 102. As one of ordinary skill will appreciate, thedigital media content 115 may be encoded in any of a number of formats including, but not limited to, JPEG (Joint Photographic Experts Group) files, TIFF (Tagged Image File Format) files, PNG (Portable Network Graphics) files, GIF (Graphics Interchange Format) files, BMP (bitmap) files or any number of other digital formats. - The
digital media content 115 may be encoded in other formats including, but not limited to, Motion Picture Experts Group (MPEG)-1, MPEG-2, MPEG-4, H.264, Third Generation Partnership Project (3GPP), 3GPP-2, Standard-Definition Video (SD-Video), High-Definition Video (HD-Video), Digital Versatile Disc (DVD) multimedia, Video Compact Disc (VCD) multimedia, High-Definition Digital Versatile Disc (HD-DVD) multimedia, Digital Television Video/High-definition Digital Television (DTV/HDTV) multimedia, Audio Video Interleave (AVI), Digital Video (DV), QuickTime (QT) file, Windows Media Video (WMV), Advanced System Format (ASF), Real Media (RM), Flash Media (FLV), an MPEG Audio Layer III (MP3), an MPEG Audio Layer II (MP2), Waveform Audio Format (WAV), Windows Media Audio (WMA), or any number of other digital formats. - As depicted in
FIG. 1A , themedia interface 112 in themedia processing system 102 may also be configured to retrievedigital media content 115 directly from adigital recording device 107 where acable 111 or some other interface may be used for coupling thedigital recording device 107 to themedia processing system 102. Themedia processing system 102 may support any one of a number of common computer interfaces, such as, but not limited to IEEE-1394 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection. - The
digital recording device 107 may also be coupled to themedia processing system 102 over a wireless connection or other communication path. Themedia processing system 102 may be coupled to anetwork 118 such as, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks. Through thenetwork 118, themedia processing system 102 may receivedigital media content 115 from anothercomputing system 103. Alternatively, themedia processing system 102 may access one or more mediacontent sharing websites 134 hosted on aserver 137 via thenetwork 118 to retrievedigital media content 115. - The components executed on the
media processing system 102 include acontent analyzer 114, atokenizer 116, asemantic fusion operator 119, avisualizer 121, and other applications, services, processes, systems, engines, or functionality not discussed in detail herein. Thecontent analyzer 114 is executed to perform semantic analysis on the media content received by themedia interface 112. Thetokenizer 116 is executed to generate context tokens based on the semantic analysis, where the context tokens may be generated based on classification of visual cues, audio cues, and textual cues extracted by thecontent analyzer 114. - The
semantic fusion operator 119 is executed to combine the context tokens generated by thetokenizer 116, and thevisualizer 121 is executed to visually accentuate at least one context portrayed in the media content according to the context tokens. For various embodiments, thevisualizer 121 modifies the appearance of subtitles/captions in the media content by modifying the font, font size, subtitle location, and so on. For some embodiments, the user may specify predetermined modifications to be applied for certain contexts. For example, the user may specify that if thecontent analyzer 114 determines that the context in the media content involves a scary scene, a certain font (e.g., a Gothic font style) is automatically applied to the subtitles relating to that scene or event. - The process flow between the various components of the
media processing system 102 is now described. Reference is made toFIG. 1B , which illustrates various components of themedia processing system 102 inFIG. 1A . To begin, themedia interface 112 obtains media content, where the media content may includesubtitles 151 corresponding to the text or commentary within the media content. Thesubtitles 151 may be embedded directly into the media content, stored separately and superimposed during playback, or stored according to other means as known to those skilled in the art. - The
media interface 112 forwards the media content to thecontent analyzer 114, includes animage analyzer 162, anaudio analyzer 164, atokenizer 116, atext analyzer 170, and other applications, services, processes, systems, engines, or functionality not discussed in detail herein. Thecontent analyzer 114 analyzes the semantic-rich media content to extract information later used for modifying or generating stylized subtitles corresponding to the media content. Note that the media content may comprise video content as well as digital images that include embedded captions stored, for example, as metadata. - The
image analyzer 162 analyzes the media content and identifies such visual cues as facial expressions, body language of individuals depicted in the media content, physical attributes of individuals, and so on. Theimage analyzer 162 may also analyze attributes of the media content including, for example, lighting, color temperature, color hue, contrast level, and so on. - The
audio analyzer 164 analyzes the media content and identifies such audio cues as speech tones of individuals within the media content, speed in which individuals are talking, speech volume, direction of speech, tone, and so on. The audio cues may also include intonation that may serve as an indication of one or more emotions of a speaker. Thetokenizer 116 extracts textual information from the media content. For some embodiments, thetokenizer 116 may directly process thesubtitles 151 and tokenize the words in thesubtitles 151. For situations where the media content does not includesubtitles 151, thetokenizer 116 may be configured to process the audio portion of the media content and extract text information. For some embodiments, aspeech recognition component 117 in thetokenizer 116 converts audio data into text data when the media content does not includesubtitles 151. - The
tokenizer 116 processes textual information and breaks the information into meaningful elements that are significant as a group, wherein tokenization may be performed based on lexical analysis. The lexical analysis performed by thetokenizer 116 may be based on regular expressions, specific key words, and so on where such information may be stored in adatabase 178. For some embodiments, specific key words may comprise any of transition words, conjunctions, words that convey emphasis, repeated words, symbols, predefined keywords from a database, or any combination thereof. Based on the lexical analysis performed by thetokenizer 116, thetext analyzer 170 extracts textual cues from the media content. - The data stored in the
database 178 may also include key attributes such as visual attributes (e.g., lighting level, human facial expressions, body language, themes, color hue, color temperature), audio attributes (e.g., volume level), and other attributes. Theimage analyzer 162,audio analyzer 164, andtext analyzer 170 respectively generatecontext tokens 174 relating to the media content. Thesemantic fusion operator 119 processes thecontext tokens 174 and combines context tokens relating to similar points within the media content. Note that for some embodiments, thecontext tokens 174 may be sent directly to thevisualizer 121 without being processed by thesemantic fusion operator 119. - Note that the
content analyzer 114 may be configured to first analyze the textual content followed by the audio content and the visual content. Alternatively, thecontent analyzer 114 may be configured to first analyze the visual content followed by the text content and the audio content. In this regard, thecontent analyzer 114 may be configured to analyze the various components of the media content in a particular order or concurrently. Thesemantic fusion operator 119 combines thecontext tokens 174, and themapping module 176 maps the combinedcontext tokens 174 to specific text associated with the event or context in the media content, as described in more detail below. Thevisualizer 121 modifies thesubtitles 151 corresponding to the text, where the modification may include, for example and without limitation, a change in the subtitle font, change in font size, change in font color, and change in subtitle location. Thevisualizer 121 incorporates the stylistic changes and outputs the modifiedmedia content 180. - With reference to
FIG. 3A , each context token 174 comprises amedia stamp 302 and asemantic vector 304, where themedia stamp 302 corresponds to the media content. For some embodiments, the media stamp comprises a time stamp corresponding to a position within the media content. The media stamp may also specify a window of time relative to the time stamp. For example, themedia stamp 302 may specify that the correspondingsemantic vector 304 corresponds to a time interval spanning 10:33 to 10:57 in the media content. - The
semantic vector 304 corresponds to semantic concepts derived by the image analyzer 162 (FIG. 1B ), audio analyzer 164 (FIG. 1B ), and the text analyzer 170 (FIG. 1B ). Eachsemantic vector 304 within a context token may contain one or more entries where each entry comprises a semantic dimension 306 and a corresponding strength value 308. A semantic dimension 306 corresponds to a contextual cue within the media content and may include visual cues, audio cues, textual cues, and so on. - During pre-processing by the
image analyzer 162,audio analyzer 164, and thetext analyzer 170, visual, audio, and textual content are analyzed and represented by a context token ci, which comprises a media stamp and one or moresemantic vectors 304. A context token ci is represented by the following expression: -
c i ={t i |v i}, - where ti denotes the media tamp of the context token, and vi denotes the
semantic vector 304, which is expressed as: -
v i=(d 1 ,d 2 , . . . ,d n). - In the expression above, dj represents a strength or likelihood value towards a particular semantic dimension, such as but not limited to, a positive atmosphere, a negative atmosphere, a feeling of happiness, sadness, anger, horror, a feeling of mystery, a feeling of romance, a feminine theme, a masculine theme, and so on. For example, the visual content of a particular scene with dark and gray visual attributes may be assigned a higher strength value towards a semantic dimension of negativity, horror, and a feeling of mystery.
- Speech (i.e., audio) content expressing delight and characterized by a high pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of happiness, a feminine theme, while a soft, gentle, and low pitch intonation pattern may be assigned a higher strength value towards a positive feeling, a feeling of romance, and a masculine theme. Textual context comprising specific transition keywords may be assigned a higher strength value to a semantic dimension reflecting strong emphasis. For example, a specific phrase such as “with great power, comes great responsibility” may be assigned a higher strength value reflecting strong emphasis, a positive atmosphere, and a masculine theme. In this regard, the corresponding strength value 308 reflects a confidence level of the semantic dimension 306.
- The semantic fusion operator 119 (
FIG. 1B ) combines thecontext tokens 174 to generate a fused semantic vector, and the mapping module 176 (FIG. 1B ) maps the combinedcontext tokens 174 to specific text associated with the event or context in the media content. Specifically, a fused semantic vector vƒ T associated with a specified media stamp T is determined by the following expression: -
v ƒ T=ƒ(v v T ,v a T ,v t T), - where vv T denotes the semantic vector of visual content for media stamp T, va T denotes the semantic vector of audio content for media stamp T, vt T denotes the semantic vector of text content for media stamp T, and ƒ( ) denotes the fusion function. The fusion function may be implemented as an operator for combining semantic vectors. For some embodiments, the fusion function may be expressed as a weighted summation function:
-
ƒ(v v T ,v a T ,v t T)=Σ{v,a,t} w i T ,v i T =w v T v v T +w a T v a T +w t T v t T, - where (wi) corresponds to the weight value of each type of semantic vector (i.e., semantic vector of visual content, semantic vector of audio content, and semantic vector of textual content). Each weight value represents the confidence level of a particular semantic vector. For example, the weight value (wa T) for the audio semantic vector (va T) may be higher if the audio cues during time period (T) comprise dramatic intonations that occur in a given scene. On the other hand, the weight value (wv T) for the visual semantic vector (vv T) may be lower if the same scene provides few visual cues. The fusion function may also be implemented according to a neural network model. The
mapping module 176 then maps the fused semantic vector vƒ T to media or corresponding subtitles according to the media stamp T. -
FIG. 3A provides an example of a context token with a plurality ofsemantic dimensions semantic dimensions image analyzer 162,audio analyzer 164, and/or thetext analyzer 170 determines based on various contextual cues within the media content that one or more individuals in the media content exhibit such emotions as happiness, sadness, anger, and fear. As shown, each of thesemantic dimensions -
FIG. 3B is an example of a textual context token 320 comprising amedia stamp 322 that specifies the time in which thecorresponding text 324 is to be displayed. The textual context token 320 further comprises anentropy value 326 and asemantic vector 328, wherein theentropy value 326 represents the information content of the particular text section. In the example shown, the text content comprises the subtitle “That's Awesome!” The text segment is tokenized into two text tokens—“That's” and “Awesome”. The text “That's” contains less useful information and is therefore assigned a lower entropy value, whereas the text “Awesome” is assigned a higher entropy value. The higher entropy value triggers the visual accentuation. Moreover, a negative value for a semantic dimension relieves the contradiction between a visual or audio context token. For example, the audio context token of the corresponding portion has a sadness value of −0.6 while the video context token has sadness value of 0.4. In this case, the bias would be corrected as the sadness dimension is adjusted to neutral state of zero given the values −0.6 and 0.4. -
FIG. 2 is a schematic diagram of themedia processing system 102 shown inFIG. 1A . Themedia processing system 102 may be embodied in any one of a wide variety of wired and/or wireless computing devices, such as a desktop computer, portable computer, dedicated server computer, multiprocessor computing device, smartphone 109 (FIG. 1A ), tablet computing device, and so forth. As shown inFIG. 2 , themedia processing system 102 comprisesmemory 214, aprocessing device 202, a number of input/output interfaces 204, anetwork interface 206, adisplay 104, aperipheral interface 211, andmass storage 226, wherein each of these devices are connected across a local data bus 210. - The
processing device 202 may include any custom made or commercially available processor, a central processing unit (CPU) or an auxiliary processor among several processors associated with themedia processing system 102, a semiconductor based microprocessor (in the form of a microchip), a macroprocessor, one or more application specific integrated circuits (ASICs), a plurality of suitably configured digital logic gates, and other well known electrical configurations comprising discrete elements both individually and in various combinations to coordinate the overall operation of the computing system. - The
memory 214 can include any one of a combination of volatile memory elements (e.g., random-access memory (RAM, such as DRAM, and SRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, etc.). Thememory 214 typically comprises anative operating system 217, one or more native applications, emulation systems, or emulated applications for any of a variety of operating systems and/or emulated hardware platforms, emulated operating systems, etc. - The applications may include application specific software which may comprise some or all the components (
media interface 112,content analyzer 114,tokenizer 116,semantic fusion operator 119, visualizer 121) of themedia processing system 102 depicted inFIG. 1A . In accordance with such embodiments, the components are stored inmemory 214 and executed by theprocessing device 202. One of ordinary skill in the art will appreciate that thememory 214 can, and typically will, comprise other components which have been omitted for purposes of brevity. - In this regard, the term “executable” may refer to a program file that is in a form that can ultimately be run by the
processing device 202. Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of thememory 214 and run by theprocessing device 202, source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of thememory 214 and executed by theprocessing device 202, or source code that may be interpreted by another executable program to generate instructions in a random access portion of thememory 214 to be executed by theprocessing device 202, etc. An executable program may be stored in any portion or component of thememory 214 including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components. - Input/
output interfaces 204 provide any number of interfaces for the input and output of data. For example, where themedia processing system 102 comprises a personal computer, these components may interface with one or more user input devices via the I/O interfaces 204, where the user input devices may comprise a keyboard 106 (FIG. 1A ) or a mouse 108 (FIG. 1A ). Thedisplay 104 may comprise a computer monitor, a plasma screen for a PC, a liquid crystal display (LCD), a touchscreen display, orother display device 104. - In the context of this disclosure, a non-transitory computer-readable medium stores programs for use by or in connection with an instruction execution system, apparatus, or device. More specific examples of a computer-readable medium may include by way of example and without limitation: a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory), and a portable compact disc read-only memory (CDROM) (optical).
- With further reference to
FIG. 2 ,network interface 206 comprises various components used to transmit and/or receive data over a network environment. For example, thenetwork interface 206 may include a device that can communicate with both inputs and outputs, for instance, a modulator/demodulator (e.g., a modem), wireless (e.g., radio frequency (RF)) transceiver, a telephonic interface, a bridge, a router, network card, etc.). Themedia processing system 102 may communicate with one or more computing devices via thenetwork interface 206 over the network 118 (FIG. 1A ). Themedia processing system 102 may further comprisemass storage 226. Theperipheral interface 211 supports various interfaces including, but not limited to IEEE-1294 High Performance Serial Bus (Firewire), USB, a serial connection, and a parallel connection. - Reference is made to
FIG. 4 , which is aflowchart 400 in accordance with one embodiment for facilitating automatic media editing performed by themedia processing system 102 ofFIG. 1A . It is understood that theflowchart 400 ofFIG. 4 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of themedia processing system 102. As an alternative, the flowchart ofFIG. 4 may be viewed as depicting an example of steps of a method implemented in themedia processing system 102 according to one or more embodiments. - Although the flowchart of
FIG. 4 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession inFIG. 4 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure. - Beginning with
block 410, media content is obtained and inblock 420, semantic analysis is performed on text in at least a portion of the media content to obtain at least one semantic textual segment each corresponding to a text section of the media content. For some embodiments, the text section comprises at least one word in the text in the at least a portion of the media content. Inblock 430, at least one context token corresponding to the at least one semantic textual segment is generated. Inblock 440, the text section is visually accentuated according to the context token. For some embodiments, visually accentuating the text section comprises modifying the text section in the at least a portion of the media content and generating captions in the at least a portion of the media content. Note that modifying the visual appearance of text may be performed according to the literal meaning of the text section. For example, if the text section includes the word “fire” or “flame,” the visual appearance of the text section may be modified with a fiery font. As another example, if the text section includes the word “big” or “huge,” the visual appearance of the text section may be enlarged. - To further illustrate the media editing techniques disclosed, reference is made to
FIGS. 5-7 , which provide various examples of modifications performed by the visualizer 121 (FIG. 1A ) in the media processing system 102 (FIG. 1A ) in accordance with various embodiments.FIG. 5 is an example where the visualizer 121 (FIG. 1A ) changes the font size/style as well as the location of the subtitles. In the example to the left inFIG. 5 , the content analyzer 114 (FIG. 1A ) analyzes such contextual cues as speech volume (e.g., one or more individuals shouting), keywords/phrases (e.g., “watch out”, “warning”), the presence of exclamation points in the subtitles, and so on. In this regard, themedia processing system 102 is “text-aware” and is capable of visually accentuating a text section within text content. - In the example to the right in
FIG. 5 , thevisualizer 121 selectively modifies the text sections containing the text “AWESOME” and the exclamation point in the subtitles. That is, rather than visually accentuating the entire line of subtitles, thevisualizer 121 may be configured to visually accentuate only a portion of the subtitles (e.g., selective words/phrases/punctuation marks). In the example shown, only the word “AWESOME” and the exclamation point are visually accentuated by increasing the font size. In this regard, visually accentuating the text section according to the context token may comprise modifying the text section in the at least a portion of the media content and/or generating captions in the at least a portion of the media content. - As shown, the
visualizer 121 also incorporates animation to further emphasize the words being spoken by the individual. Other forms of animation may include, for example and without limitation, a shrinking/stretching effect, a fade-in/fade-out effect, a shadowing effect, a flipping effect, and so on. The example inFIG. 5 also depicts graphics (i.e., lines) inserted into the media content by thevisualizer 121 to indicate which individual is speaking. -
FIG. 6 is an example where the visualizer 121 (FIG. 1A ) changes the font size/style of the captions based on the body language of the individual as well as the presence of exclamation marks in the subtitles.FIG. 7 is an example where the visualizer 121 (FIG. 1A ) changes the font size/style of the captions based on image attributes (e.g., low lighting; night time), keyword (e.g., “Halloween”), the presence of an exclamation mark in the subtitles, and so on. -
FIG. 8 is an example where the media content comprises digital photos with comments, descriptions, and other forms of annotation are embedded with the digital photos. For example, with reference back to themedia content website 134 shown inFIG. 1 , themedia processing system 102 may retrieve media content from online photo sharing albums where one or more users upload photos and viewers add corresponding descriptions, comments, etc. to the uploaded photos. In the example shown inFIG. 8 , the media content comprises digital photos with corresponding descriptions. As shown, the text section comprising the word “beautiful” is visually accentuated to place emphasis on this word. Note that only the appearance of “beautiful” is modified. - Reference is made to
FIG. 9 , which is aflowchart 900 in accordance with an alternative embodiment for facilitating automatic media editing performed by themedia processing system 102 ofFIG. 1A . It is understood that theflowchart 900 ofFIG. 9 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of themedia processing system 102. As an alternative, the flowchart ofFIG. 9 may be viewed as depicting an example of steps of a method implemented in themedia processing system 102 according to one or more embodiments. - Although the flowchart of
FIG. 9 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession inFIG. 9 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure. - Beginning with
block 910, media content is obtained and semantic analysis is performed on a textual portion of the media content. For example, as shown inFIG. 1B , the media content obtained by the media interface 112 (FIG. 1B ) may include subtitles 151 (FIG. 1B ) or captions. - In
block 920, textual context tokens are generated based on the semantic analysis, and inblock 930, semantic analysis is performed on an audio portion and on a visual portion of the media content corresponding to the textual portion. For example, the image analyzer 162 (FIG. 1B ) and the audio analyzer 164 (FIG. 1B ) in the content analyzer 114 (FIG. 1B ) may be configured to analyze portions of the media content where dialog between individuals take place. - In
block 940, context tokens relating to the audio and visual portions are generated. Inblock 950, the textual context tokens are combined with the context tokens relating to the audio and visual portions, and inblock 960, at least one context portrayed in the at least a portion of media content is visually accentuated according to the combined context tokens. - Reference is made to
FIG. 10 , which is aflowchart 1000 in accordance with an alternative embodiment for facilitating automatic media editing performed by themedia processing system 102 ofFIG. 1A . It is understood that theflowchart 1000 ofFIG. 10 provides merely an example of the many different types of functional arrangements that may be employed to implement the operation of the various components of themedia processing system 102. As an alternative, the flowchart ofFIG. 10 may be viewed as depicting an example of steps of a method implemented in themedia processing system 102 according to one or more embodiments. - Although the flowchart of
FIG. 10 shows a specific order of execution, it is understood that the order of execution may differ from that which is depicted. For example, the order of execution of two or more blocks may be scrambled relative to the order shown. Also, two or more blocks shown in succession inFIG. 10 may be executed concurrently or with partial concurrence. It is understood that all such variations are within the scope of the present disclosure. - Beginning with
block 1010, a photo collection comprising digital images and textual content is obtained, and inblock 1020, semantic analysis is performed on the textual content to obtain at least one semantic textual segment each corresponding to a text section of the photo collection. For some embodiments, the text section comprises at least one word in the textual content in the at least a portion of the photo collection. Inblock 1030, at least one context token corresponding to the at least one semantic textual segment is generated, and inblock 1040, the text section is visually accentuated according to the context token. - It should be emphasized that the above-described embodiments of the present disclosure are merely possible examples of implementations set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
Claims (32)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/085,963 US9645985B2 (en) | 2013-03-15 | 2013-11-21 | Systems and methods for customizing text in media content |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361788741P | 2013-03-15 | 2013-03-15 | |
US14/085,963 US9645985B2 (en) | 2013-03-15 | 2013-11-21 | Systems and methods for customizing text in media content |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140278370A1 true US20140278370A1 (en) | 2014-09-18 |
US9645985B2 US9645985B2 (en) | 2017-05-09 |
Family
ID=51531799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/085,963 Active 2034-07-18 US9645985B2 (en) | 2013-03-15 | 2013-11-21 | Systems and methods for customizing text in media content |
Country Status (1)
Country | Link |
---|---|
US (1) | US9645985B2 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170192939A1 (en) * | 2016-01-04 | 2017-07-06 | Expressy, LLC | System and Method for Employing Kinetic Typography in CMC |
WO2018074658A1 (en) * | 2016-10-17 | 2018-04-26 | 주식회사 엠글리쉬 | Terminal and method for implementing hybrid subtitle effect |
US10049477B1 (en) * | 2014-06-27 | 2018-08-14 | Google Llc | Computer-assisted text and visual styling for images |
US10091202B2 (en) | 2011-06-20 | 2018-10-02 | Google Llc | Text suggestions for images |
US10474750B1 (en) * | 2017-03-08 | 2019-11-12 | Amazon Technologies, Inc. | Multiple information classes parsing and execution |
US10945041B1 (en) * | 2020-06-02 | 2021-03-09 | Amazon Technologies, Inc. | Language-agnostic subtitle drift detection and localization |
CN114286154A (en) * | 2021-09-23 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Subtitle processing method and device for multimedia file, electronic equipment and storage medium |
CN117319757A (en) * | 2023-09-08 | 2023-12-29 | 北京优酷科技有限公司 | Subtitle display method and device, electronic equipment and computer storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10102272B2 (en) * | 2015-07-12 | 2018-10-16 | Aravind Musuluri | System and method for ranking documents |
Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7089504B1 (en) * | 2000-05-02 | 2006-08-08 | Walt Froloff | System and method for embedment of emotive content in modern text processing, publishing and communication |
US20080320378A1 (en) * | 2005-10-22 | 2008-12-25 | Jeff Shuter | Accelerated Visual Text to Screen Translation Method |
US20090153288A1 (en) * | 2007-12-12 | 2009-06-18 | Eric James Hope | Handheld electronic devices with remote control functionality and gesture recognition |
US20090164888A1 (en) * | 2007-12-19 | 2009-06-25 | Thomas Phan | Automated Content-Based Adjustment of Formatting and Application Behavior |
US20090208118A1 (en) * | 2008-02-19 | 2009-08-20 | Xerox Corporation | Context dependent intelligent thumbnail images |
US20100318360A1 (en) * | 2009-06-10 | 2010-12-16 | Toyota Motor Engineering & Manufacturing North America, Inc. | Method and system for extracting messages |
US20110047508A1 (en) * | 2009-07-06 | 2011-02-24 | Onerecovery, Inc. | Status indicators and content modules for recovery based social networking |
US20110231180A1 (en) * | 2010-03-19 | 2011-09-22 | Verizon Patent And Licensing Inc. | Multi-language closed captioning |
US20110276327A1 (en) * | 2010-05-06 | 2011-11-10 | Sony Ericsson Mobile Communications Ab | Voice-to-expressive text |
US8166051B1 (en) * | 2009-02-03 | 2012-04-24 | Sandia Corporation | Computation of term dominance in text documents |
US20120179982A1 (en) * | 2011-01-07 | 2012-07-12 | Avaya Inc. | System and method for interactive communication context generation |
US20120242897A1 (en) * | 2009-12-31 | 2012-09-27 | Tata Consultancy Services Limited | method and system for preprocessing the region of video containing text |
US20120288203A1 (en) * | 2011-05-13 | 2012-11-15 | Fujitsu Limited | Method and device for acquiring keywords |
US20130036117A1 (en) * | 2011-02-02 | 2013-02-07 | Paul Tepper Fisher | System and method for metadata capture, extraction and analysis |
US8374646B2 (en) * | 2009-10-05 | 2013-02-12 | Sony Corporation | Mobile device visual input system and methods |
US20130067319A1 (en) * | 2011-09-06 | 2013-03-14 | Locu, Inc. | Method and Apparatus for Forming a Structured Document from Unstructured Information |
US20130121410A1 (en) * | 2011-11-14 | 2013-05-16 | Mediatek Inc. | Method and Apparatus of Video Encoding with Partitioned Bitstream |
US20130218858A1 (en) * | 2012-02-16 | 2013-08-22 | Dmitri Perelman | Automatic face annotation of images contained in media content |
US20130298159A1 (en) * | 2012-05-07 | 2013-11-07 | Industrial Technology Research Institute | System and method for allocating advertisements |
US8588825B2 (en) * | 2010-05-25 | 2013-11-19 | Sony Corporation | Text enhancement |
US20140032259A1 (en) * | 2012-07-26 | 2014-01-30 | Malcolm Gary LaFever | Systems and methods for private and secure collection and management of personal consumer data |
US20140081619A1 (en) * | 2012-09-18 | 2014-03-20 | Abbyy Software Ltd. | Photography Recognition Translation |
US20140258851A1 (en) * | 2013-03-11 | 2014-09-11 | Microsoft Corporation | Table of Contents Detection in a Fixed Format Document |
US20140257789A1 (en) * | 2013-03-11 | 2014-09-11 | Microsoft Corporation | Detection and Reconstruction of East Asian Layout Features in a Fixed Format Document |
US20150363478A1 (en) * | 2008-07-11 | 2015-12-17 | Michael N. Haynes | Systems, Devices, and/or Methods for Managing Data |
US9317485B2 (en) * | 2012-01-09 | 2016-04-19 | Blackberry Limited | Selective rendering of electronic messages by an electronic device |
US9342613B2 (en) * | 2004-09-17 | 2016-05-17 | Snapchat, Inc. | Display and installation of portlets on a client platform |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002244688A (en) | 2001-02-15 | 2002-08-30 | Sony Computer Entertainment Inc | Information processor, information processing method, information transmission system, medium for making information processor run information processing program, and information processing program |
US20070011012A1 (en) | 2005-07-11 | 2007-01-11 | Steve Yurick | Method, system, and apparatus for facilitating captioning of multi-media content |
US8126220B2 (en) | 2007-05-03 | 2012-02-28 | Hewlett-Packard Development Company L.P. | Annotating stimulus based on determined emotional response |
CN101534377A (en) | 2008-03-13 | 2009-09-16 | 扬智科技股份有限公司 | Method and system for automatically changing subtitle setting according to program content |
US8259992B2 (en) | 2008-06-13 | 2012-09-04 | International Business Machines Corporation | Multiple audio/video data stream simulation method and system |
-
2013
- 2013-11-21 US US14/085,963 patent/US9645985B2/en active Active
Patent Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7089504B1 (en) * | 2000-05-02 | 2006-08-08 | Walt Froloff | System and method for embedment of emotive content in modern text processing, publishing and communication |
US9342613B2 (en) * | 2004-09-17 | 2016-05-17 | Snapchat, Inc. | Display and installation of portlets on a client platform |
US20080320378A1 (en) * | 2005-10-22 | 2008-12-25 | Jeff Shuter | Accelerated Visual Text to Screen Translation Method |
US20090153288A1 (en) * | 2007-12-12 | 2009-06-18 | Eric James Hope | Handheld electronic devices with remote control functionality and gesture recognition |
US20090164888A1 (en) * | 2007-12-19 | 2009-06-25 | Thomas Phan | Automated Content-Based Adjustment of Formatting and Application Behavior |
US20090208118A1 (en) * | 2008-02-19 | 2009-08-20 | Xerox Corporation | Context dependent intelligent thumbnail images |
US20150363478A1 (en) * | 2008-07-11 | 2015-12-17 | Michael N. Haynes | Systems, Devices, and/or Methods for Managing Data |
US8166051B1 (en) * | 2009-02-03 | 2012-04-24 | Sandia Corporation | Computation of term dominance in text documents |
US20100318360A1 (en) * | 2009-06-10 | 2010-12-16 | Toyota Motor Engineering & Manufacturing North America, Inc. | Method and system for extracting messages |
US20110047508A1 (en) * | 2009-07-06 | 2011-02-24 | Onerecovery, Inc. | Status indicators and content modules for recovery based social networking |
US8374646B2 (en) * | 2009-10-05 | 2013-02-12 | Sony Corporation | Mobile device visual input system and methods |
US20120242897A1 (en) * | 2009-12-31 | 2012-09-27 | Tata Consultancy Services Limited | method and system for preprocessing the region of video containing text |
US20110231180A1 (en) * | 2010-03-19 | 2011-09-22 | Verizon Patent And Licensing Inc. | Multi-language closed captioning |
US20110276327A1 (en) * | 2010-05-06 | 2011-11-10 | Sony Ericsson Mobile Communications Ab | Voice-to-expressive text |
US8588825B2 (en) * | 2010-05-25 | 2013-11-19 | Sony Corporation | Text enhancement |
US20120179982A1 (en) * | 2011-01-07 | 2012-07-12 | Avaya Inc. | System and method for interactive communication context generation |
US20130036117A1 (en) * | 2011-02-02 | 2013-02-07 | Paul Tepper Fisher | System and method for metadata capture, extraction and analysis |
US20120288203A1 (en) * | 2011-05-13 | 2012-11-15 | Fujitsu Limited | Method and device for acquiring keywords |
US20130067319A1 (en) * | 2011-09-06 | 2013-03-14 | Locu, Inc. | Method and Apparatus for Forming a Structured Document from Unstructured Information |
US20130121410A1 (en) * | 2011-11-14 | 2013-05-16 | Mediatek Inc. | Method and Apparatus of Video Encoding with Partitioned Bitstream |
US9317485B2 (en) * | 2012-01-09 | 2016-04-19 | Blackberry Limited | Selective rendering of electronic messages by an electronic device |
US20130218858A1 (en) * | 2012-02-16 | 2013-08-22 | Dmitri Perelman | Automatic face annotation of images contained in media content |
US20130298159A1 (en) * | 2012-05-07 | 2013-11-07 | Industrial Technology Research Institute | System and method for allocating advertisements |
US20140032259A1 (en) * | 2012-07-26 | 2014-01-30 | Malcolm Gary LaFever | Systems and methods for private and secure collection and management of personal consumer data |
US20140081619A1 (en) * | 2012-09-18 | 2014-03-20 | Abbyy Software Ltd. | Photography Recognition Translation |
US20140258851A1 (en) * | 2013-03-11 | 2014-09-11 | Microsoft Corporation | Table of Contents Detection in a Fixed Format Document |
US20140257789A1 (en) * | 2013-03-11 | 2014-09-11 | Microsoft Corporation | Detection and Reconstruction of East Asian Layout Features in a Fixed Format Document |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10091202B2 (en) | 2011-06-20 | 2018-10-02 | Google Llc | Text suggestions for images |
US10049477B1 (en) * | 2014-06-27 | 2018-08-14 | Google Llc | Computer-assisted text and visual styling for images |
US20170192939A1 (en) * | 2016-01-04 | 2017-07-06 | Expressy, LLC | System and Method for Employing Kinetic Typography in CMC |
US10467329B2 (en) * | 2016-01-04 | 2019-11-05 | Expressy, LLC | System and method for employing kinetic typography in CMC |
WO2018074658A1 (en) * | 2016-10-17 | 2018-04-26 | 주식회사 엠글리쉬 | Terminal and method for implementing hybrid subtitle effect |
US10474750B1 (en) * | 2017-03-08 | 2019-11-12 | Amazon Technologies, Inc. | Multiple information classes parsing and execution |
US10945041B1 (en) * | 2020-06-02 | 2021-03-09 | Amazon Technologies, Inc. | Language-agnostic subtitle drift detection and localization |
CN114286154A (en) * | 2021-09-23 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Subtitle processing method and device for multimedia file, electronic equipment and storage medium |
CN117319757A (en) * | 2023-09-08 | 2023-12-29 | 北京优酷科技有限公司 | Subtitle display method and device, electronic equipment and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
US9645985B2 (en) | 2017-05-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9645985B2 (en) | Systems and methods for customizing text in media content | |
US11749241B2 (en) | Systems and methods for transforming digitial audio content into visual topic-based segments | |
Durand et al. | The Oxford handbook of corpus phonology | |
US20090327856A1 (en) | Annotation of movies | |
KR20200118894A (en) | Automated voice translation dubbing for pre-recorded videos | |
US20140178043A1 (en) | Visual summarization of video for quick understanding | |
US10665267B2 (en) | Correlation of recorded video presentations and associated slides | |
US20220208155A1 (en) | Systems and methods for transforming digital audio content | |
US20180226101A1 (en) | Methods and systems for interactive multimedia creation | |
CN110781328A (en) | Video generation method, system, device and storage medium based on voice recognition | |
CN110750996B (en) | Method and device for generating multimedia information and readable storage medium | |
CN111930289B (en) | Method and system for processing pictures and texts | |
US20180270446A1 (en) | Media message creation with automatic titling | |
US20180189249A1 (en) | Providing application based subtitle features for presentation | |
Jing et al. | Content-aware video2comics with manga-style layout | |
Matamala | The VIW project: Multimodal corpus linguistics for audio description analysis | |
US10691871B2 (en) | Devices, methods, and systems to convert standard-text to animated-text and multimedia | |
Chi et al. | Synthesis-Assisted Video Prototyping From a Document | |
CN117177024A (en) | Video dubbing method and related device, electronic equipment and storage medium | |
JP2020129189A (en) | Moving image editing server and program | |
JP7133367B2 (en) | MOVIE EDITING DEVICE, MOVIE EDITING METHOD, AND MOVIE EDITING PROGRAM | |
KR102281298B1 (en) | System and method for video synthesis based on artificial intelligence | |
KR20150121928A (en) | System and method for adding caption using animation | |
EP4295248A1 (en) | Systems and methods for transforming digital audio content | |
EP3121734A1 (en) | A method and device for performing story analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CYBERLINK CORP., TAIWAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHEN, HSIEH-WEI;REEL/FRAME:031647/0637 Effective date: 20131121 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |