CN114051154A

CN114051154A - News video strip splitting method and system

Info

Publication number: CN114051154A
Application number: CN202111305567.6A
Authority: CN
Inventors: 刘潇婧
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Zhiyun Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-15

Abstract

The invention discloses a news video strip splitting method and a system, wherein the method comprises the following steps: acquiring video data, converting voice data in the video data into voice characters, and converting subtitles in the video data into subtitle characters; acquiring a timestamp corresponding to voice characters converted from voice data and acquiring a timestamp corresponding to subtitle characters; cutting video data sentence by sentence according to voice characters to generate a video segment, splicing the video segment according to the voice characters and subtitle characters in the video segment, inserting special characters CLS after splicing, further inputting the overall character features including the CLS into a BERT model, and outputting semantic feature vectors of the video segment; calculating the time interval between adjacent voice character sentences according to the time stamp corresponding to the voice character, constructing a one-hot vector as a voice feature vector according to the time interval, splicing the voice feature vector and the semantic feature vector, inputting the spliced voice feature vector and the semantic feature vector into a two-classification model, and outputting a result according to a classification score.

Description

News video strip splitting method and system

Technical Field

The invention relates to the technical field of news media, in particular to a news video strip splitting method and a news video strip splitting system.

Background

The main task of the news splitting bar is to split segments of video contents according to certain business logic aiming at a certain news video (such as news simulcast, news 30 minutes, local news broadcast and the like), so that a data basis is provided for subsequent material arrangement and content distribution. At present, two main technical schemes exist: 1) based on the image: and splitting the video according to the conversion of the shot scene, and judging the splitting if the shot scenes of the live news are different through the still sitting of the host. 2) Based on the rule: and judging the news segmentation points according to the characteristics of the position, the size, the time and the like of the fixed captions. The prior art has the following defects: 1. the news segment is segmented according to the conversion of the shot scene, the semantic information of the news is not considered, and the news video of scenes such as a host sitting still all the time or continuously switching pictures cannot be covered. 2. The news segment segmentation is carried out by using the rules, so that the universality and the reusability are poor, and the labor cost is high.

Disclosure of Invention

One of the objectives of the present invention is to provide a news video strip splitting method and system, which simultaneously utilize an automatic speech recognition technology ASR and a character recognition technology OCR to respectively obtain characters corresponding to a speech broadcast and a video subtitle and time stamps corresponding to the characters, and perform video segmentation point judgment on a news video by two recognition means, thereby effectively improving the accuracy of the video segmentation points.

One of the purposes of the invention is to provide a news video strip splitting method and system, the method and system are used for splicing characters obtained by a voice recognition technology and characters obtained by subtitles in a video, inputting the spliced characters into a pre-training model BERT for training to generate a semantic feature vector with combined features, and the semantic feature vector can avoid the phenomenon of inaccurate video strip splitting caused by the sitting of a presenter or continuous switching of an independent video.

One of the purposes of the invention is to provide a news video stripping method and system, the method and system splices the time difference characteristic of the automatic speech recognition technology ASR and the semantic characteristic with the combined characteristic, judges whether the tail sentence of the news exists or not through a classification model, and further executes news stripping, so that the news stripping related by the invention does not need to consider the rule problem, and the applicability is better.

To achieve at least one of the above objects, the present invention further provides a news video ticker method, comprising:

acquiring video data, converting voice data in the video data into voice characters, and converting subtitles in the video data into subtitle characters;

acquiring a timestamp corresponding to voice characters converted from voice data and acquiring a timestamp corresponding to subtitle characters;

cutting video data sentence by sentence according to voice characters to generate a video segment, splicing the voice characters and subtitle characters in the video segment, inserting special characters CLS after splicing, further inputting the overall character features including the CLS into a BERT model, and outputting semantic feature vectors of the video segment;

calculating the time interval between adjacent voice character sentences according to the time stamp corresponding to the voice character, constructing a one-hot vector as a voice feature vector according to the time interval, and splicing the voice feature vector and the semantic feature vector;

inputting the spliced voice feature vector and the semantic feature vector into a two-classification model for training, and finally outputting a result according to a classification score.

According to one preferred embodiment of the present invention, ASR speech recognition technology is used to convert the speech data in the video data into speech words and obtain time stamps corresponding to the speech words, and OCR character recognition technology is used to recognize the video caption words and obtain time stamps corresponding to the words.

According to another preferred embodiment of the present invention, the strip removing method further comprises: and cutting the acquired voice characters according to sentences, cutting corresponding video data according to the cut voice characters to generate corresponding video segments, acquiring the subtitle characters of the cut video segments, and merging and splicing the subtitle characters of the cut video segments.

According to another preferred embodiment of the present invention, the strip removing method further comprises: marking the obtained sentences of the voice characters, setting tag characters of ending sentences and tag characters of non-ending sentences, and establishing tag feature vectors of the voice character sentences.

According to another preferred embodiment of the present invention, the strip removing method comprises: the video is divided into sub-blocks of consecutive non-repeating length of 128 video segments, and each sub-block serves as an independent video as input data.

According to another preferred embodiment of the present invention, the speech feature vector construction method comprises: the time interval is assigned to segments according to the time interval between each sentence in the video segment, wherein the assignment of the time interval of 0/s is 0, (0s,5s ] is 1, (5s,10s ] is 2, (10s, + ∞) is 3, and the 0, 1, 2, 3 are converted into one-hot vectors as speech feature vectors.

According to another preferred embodiment of the present invention, the strip removing method further comprises: inputting the spliced voice feature vector and the semantic feature vector into a pre-training model BERT for feature extraction, inputting the extracted features into a full-connection layer, and accessing into a two-classification model constructed by a sigmoid function to classify and judge whether the current segment is a tail sentence or not.

According to another preferred embodiment of the present invention, in the training process of the binary model, the entropy cross error of the video subblocks composed of a plurality of video segments is calculated for calculating the probability of the ending sentence:

where J is the entropy cross error, y_iIs a label, p_iAnd calculating the minimum value of the entropy cross error by adopting a gradient descent method to serve as a training completion index, and verifying the training result of the two classification models by using a verification set.

To achieve at least one of the above objects, the present invention further provides a news video ticker system, which performs the above news video ticker method.

The present invention further provides a computer-readable storage medium storing a computer program, which can be executed by a processor to perform the above-mentioned news video ticker method.

Drawings

FIG. 1 is a schematic flow chart showing a news video splitting method according to the present invention;

fig. 2 is a schematic model diagram of a news video stripping system according to the present invention.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

It is understood that the terms "a" and "an" should be interpreted as meaning that a number of one element or element is one in one embodiment, while a number of other elements is one in another embodiment, and the terms "a" and "an" should not be interpreted as limiting the number.

Referring to fig. 1-2, the present invention discloses a news video splitting method and a system diagram, wherein the method includes the following steps: firstly, video data needs to be collected, wherein the video data can be obtained from a network by using a crawler technology, for example, 1000 pieces of news video data on the network are obtained by using the crawler technology, 80% of the 1000 pieces of news video data are used as a training set, and 20% of the news video data are used as a verification set. After the collection of the news video data is completed, preprocessing the news video data, wherein the preprocessing method comprises the following steps: sampling an existing Speech Recognition technology (ASR) to convert Speech data in the news video data into Speech words, wherein the Speech words are words in a text form, and acquiring a timestamp corresponding to each Speech word; the obtained news videos are further unframed, each news video is converted into a picture frame, and the subtitle characters and the corresponding time stamps of each frame picture are further obtained by adopting an OCR character recognition technology (optical character recognition). It should be noted that the above-mentioned Speech Recognition technology (ASR) and OCR character Recognition technology (optical character Recognition) are both prior art, and the Recognition process is not described in detail in the present invention.

Further, after preprocessing the news video data, performing complete news video segment cutting according to the acquired voice characters, wherein the cutting method comprises the following steps: the method comprises the following steps of segmenting a video according to a single sentence by voice characters of voice recognition in the video, wherein the voice characters can be recognized as: s ═ S₁,s₂,s₃,...,s_n) Obtaining a time stamp, s, from said speech word_iRepresenting any sentence in the set S, and dividing the video segment of the corresponding sentence into V ═ V (V)₁,v₂，v₃...,v_n) Wherein v is_iSpeech and words s representing corresponding sentence_iThe video clip of (1). Further, each cut video segment v needs to be cut_iSplicing the subtitles in (1) into corresponding video segments v_iCaption character c_i。

Further, the sentence S (S) obtained by segmenting the speech character obtained by the speech recognition needs to be processed₁,s₂,s₃,...,s_n) Manually labeling, and dividing all the divided sentences according to the subject and content corresponding to the news₁,s₂,s₃,...,s_n) Judging whether the sentence is a final sentence or not, and if the sentence is a current sentence s_iTo end sentence, the current sentence s is divided into_iManually labeled as 1, forming a tag of the last sentence of the current sentence, if the current sentence s_iIf not, the current sentence s is divided into two_iArtificially labeled 0, constitutes a non-ending sentence label for the current sentence, and thus is for all segmentsWhether the sentence after cutting is a tail sentence or not constitutes a combination of 0 and 1: y ═ y₁,y₂,y₃,...,y_n) Where yi represents the corresponding post-cut sentence s_iCorresponding ending sentence judgment tag, wherein y_iE {0, 1 }. For example: xxx attends xxx meetings. [ END ] A temple campaign was held in Beijing. Many people enjoy thriving to participate. The activities include xxx. [ END ]. Wherein [ END ] represents the ending sentence, the mapping label is 1, and the mapping labels after other periods are all 0. That is, in a specific news context, the sentence with the ending sentence number is not a real ending sentence, so that the form of the ending sentence in the specific news context is recognized in a manual labeling manner, and the subsequent model training is facilitated.

It is worth mentioning that after the manual labeling of the judgment end sentence label is completed, semantic feature extraction needs to be performed, wherein the semantic feature extraction method comprises the following steps: and splicing the voice characters of each video segment and the corresponding caption characters by taking the size of the video segment corresponding to the cut sentence as granularity, wherein the splicing mode comprises the following steps: s_i＝(w_i1,w_i2,...,w_im)[SEP]c_i＝(t_i1,t_i2,...,t_ik) Wherein w is_imFor individual literal characters of the cut speech sentence, t_ikFor a single literal character in the corresponding subtitle-text sentence, [ SEP ]]Are splicers. And inserting special characters [ CLS ] into the sentence heads of the splicing sentences simultaneously in the splicing process]So that a complete stitched feature is formed: [ CLS]s_i＝(w_i1,w_i2,...,w_im)[SEP]c_i＝(t_i1,t_i2,...,t_ik) Inputting the spliced features into a pre-trained BERT model for semantic feature extraction, and utilizing the special character [ CLS ]]The output vector may represent the joint semantic feature vector of each video segment.

After establishing the construction of each video combined semantic feature vector, the invention further constructs the voice feature vector, and the voice feature vector construction method comprises the following steps:

dividing the video into video sub-blocks with a length of 128 video segments, that is, each video sub-block contains 128 video segments, wherein the number of video segments of the video sub-blocks is not specific, the present invention is only illustrated, wherein each sub-block is used as an independent video as input data of a classification model, and further, a speech feature vector is constructed according to a time interval between two adjacent sentences of the speech text, wherein a sentence with a time interval between two adjacent sentences of 0s is defined as 0, and a sentence with a time interval between two adjacent sentences of 0s is defined as 1, (0s,5 s) is defined as 2, (10s, + ∞) is defined as 3, and the above defined values of 0, 1, 2, 3 are converted into a one-hot vector as a speech feature vector of a current sentence, and a speech feature vector of a last video segment has a value of 3, the speech feature vector may be defined by a time interval at the end of a sentence, for example, if the time interval between the second sentence and the first sentence is 3s, the value of the speech feature vector of the corresponding first sentence is 1.

Further, the voice feature vectors and the semantic feature vectors corresponding to each video segment are spliced, wherein the splicing mode is two-vector direct splicing, the vector direct splicing mode enables the dimensionality of the vectors to be added, the splicing result of the voice feature vectors and the semantic feature vectors corresponding to each video segment is input into the pre-trained BERT model again to further extract features, the feature vectors extracted by the BERT model are input into a full connection layer, and the full connection layer is accessed into a two-classification model constructed by a sigmoid function to classify and judge whether the current segment is a tail sentence or not. For a video sub-block consisting of n video segments, cross-entropy errors are defined as:

wherein y is_iFor the above-mentioned label for judging the ending sentence, p_iTraining the entropy cross error formula on the training set data in a gradient descent mode to obtain the probability of the final sentenceAnd performing effect verification on the verification set, and taking the round with the best effect on the verification set as the last model to be stored.

After the training of the two classification models is completed, identifying the video according to the steps, wherein the identification result of the two classification models can be as follows: 0010001, merging the recognition results of the two classification models, wherein the recognition results of the two classification models can know that the third sentence and the seventh sentence are the end sentences, so as to further merge the first three sentences and merge the fourth sentence to the seventh sentence at the same time, thereby completing the video clip result of the latest news article splitting.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless section, wire section, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be understood by those skilled in the art that the embodiments of the present invention described above and illustrated in the drawings are given by way of example only and not by way of limitation, the objects of the invention having been fully and effectively achieved, the functional and structural principles of the present invention having been shown and described in the embodiments, and that various changes or modifications may be made in the embodiments of the present invention without departing from such principles.

Claims

1. A news video striping method, the method comprising:

2. The news video striping method of claim 1, wherein an ASR speech recognition technique is used to convert the speech data in the video data into speech text and obtain the time stamp corresponding to the speech text, and an OCR text recognition technique is used to recognize the video subtitle text and obtain the time stamp corresponding to the text.

3. The news video ticker method of claim 1, further comprising: and cutting the acquired voice characters according to sentences, cutting corresponding video data according to the cut voice characters to generate corresponding video segments, acquiring the subtitle characters of the cut video segments, and merging and splicing the subtitle characters of the cut video segments.

4. The news video ticker method of claim 1, further comprising: marking the obtained sentences of the voice characters, setting tag characters of ending sentences and tag characters of non-ending sentences, and establishing tag feature vectors of the voice character sentences.

5. The news video stripping method as claimed in claim 1, wherein the stripping method comprises: the video is divided into sub-blocks of consecutive non-repeating length of 128 video segments, and each sub-block serves as an independent video as input data.

6. The news video stripping method as claimed in claim 1, wherein the speech feature vector construction method comprises: the time interval is assigned to segments according to the time interval between each sentence in the video segment, wherein the assignment of the time interval of 0/s is 0, (0s,5s ] is 1, (5s,10s ] is 2, (10s, + ∞) is 3, and the 0, 1, 2, 3 are converted into one-hot vectors as speech feature vectors.

7. The news video ticker method of claim 6, further comprising: inputting the spliced voice feature vector and the semantic feature vector into a pre-training model BERT for feature extraction, inputting the extracted features into a full-connection layer, and accessing into a two-classification model constructed by a sigmoid function to classify and judge whether the current segment is a tail sentence or not.

8. The news video striping method of claim 7, wherein in the training process of the binary model, an entropy cross error of a video subblock composed of a plurality of video segments is calculated, and is used for calculating a probability of a final sentence:

9. A news video ticker system, said system performing a news video ticker method as claimed in any one of claims 1-8.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program being executable by a processor to perform a news video ticker method as claimed in any one of claims 1-8.