CN111681676B

CN111681676B - Method, system, device and readable storage medium for constructing audio frequency by video object identification

Info

Publication number: CN111681676B
Application number: CN202010517903.2A
Authority: CN
Inventors: 薛媛; 金若熙
Original assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Current assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2023-08-08
Anticipated expiration: 2040-06-09
Also published as: CN111681676A

Abstract

The invention discloses a method for constructing audio by video object identification, which comprises the following steps: setting frame extraction frequency based on related information of a video to be processed, extracting video key frames and generating a frame image stream; performing modularized multi-object recognition on the frame image stream by adopting a deep convolutional neural network model to obtain modularized specific sounding objects; performing at least two-time recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object; extracting sound production characteristics of the specific sound production object based on the kind of the specific sound production object and constructing the object kind of the specific sound production object and proper audio of the specific sound production object. The video is subjected to modularized recognition by adopting the depth convolution neural network model, and then is subjected to secondary or even more recognition analysis processing by the depth residual error network model, so that the type of a more accurate specific sounding object can be obtained, and more proper audio can be constructed.

Description

Method, system, device and readable storage medium for constructing audio frequency by video object identification

Technical Field

The present invention relates to the field of computer vision video detection technology, and in particular, to a method, a system, a device, and a readable storage medium for video object recognition and audio construction.

Background

In the prior art, more and more neural network models are applied to various technical fields, such as security, automatic driving, image recognition and the like, and higher accuracy of recognition is continuously pursued. In the prior art, the object recognition method has a plurality of defects, such as that the image in the video cannot accurately fall into the type of the specific sounding object, and the category recognition is not fine enough, so that the subsequent automatic dubbing process can cause inaccurate dubbing and great difficulty.

Disclosure of Invention

The invention provides a method, a system, a device and a readable storage medium for constructing audio by video object identification aiming at the defects in the prior art.

In order to solve the technical problems, the invention is solved by the following technical scheme:

a method for constructing audio by video object recognition, comprising the steps of:

setting frame extraction frequency based on related information of a video to be processed, extracting video key frames and generating a frame image stream;

performing modularized multi-object recognition on the frame image stream by adopting a deep convolutional neural network model to obtain modularized specific sounding objects;

performing at least two-time recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object;

Extracting sound production characteristics of the specific sound production object based on the kind of the specific sound production object and constructing the object kind of the specific sound production object and proper audio of the specific sound production object.

As a practical way, the depth residual network model is obtained as follows:

acquiring a plurality of images containing specific sounding objects, and removing the unqualified images of the specific sounding objects to obtain the qualified images of the specific sounding objects;

preprocessing an image of a qualified specific sounding object to obtain an image data set of the qualified specific sounding object, and dividing the image data set into a training set and a verification set;

and inputting the training set into the initial depth residual error network model for training, and verifying the training result through the verification set to obtain the depth residual error network model capable of obtaining the type of the specific sounding object.

As a practical matter, the audio includes an audio introduction, which is introduction text of the audio, and an audio keyword, which includes at least three words describing the audio including a category name of a specific sound-producing object and a category name of sound-producing sound.

As a facts way, the specific sound producing object type is used for extracting sound producing characteristics of the specific sound producing object based on the kind of the specific sound producing object, and constructing the object type of the specific sound producing object and the proper audio of the specific sound producing object, and the specific steps are as follows:

Performing score matching processing on the basis of the object type, the audio introduction and the audio keyword of the specific sounding object to respectively obtain a first matching score and a neural network matching score;

and obtaining video and audio matching scores based on the first matching scores and the neural network matching scores, and obtaining at least one proper audio of the specific sounding object according to the video and audio matching scores.

As a facts way, the score matching process based on the object category, the audio introduction and the audio keyword of the specific sound object respectively obtains a first matching score and a neural network matching score, and the specific steps are as follows:

word segmentation processing is carried out on the object type and the audio introduction of the specific sounding object to obtain words;

respectively obtaining word proportions of the object category of a specific sounding object, the audio introduction and the audio keyword, obtaining a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain word matching scores, wherein the word matching scores = the word overlapping proportion of the object category and the audio introduction =the audio introduction weight + the word overlapping proportion of the object category and the audio keyword =the audio keyword weight, and the audio introduction weight + the audio keyword weight = 1;

Based on the statistical data of the audio introduction, obtaining an object type TF-IDF vector, and taking the first cosine similarity as a TF-IDF matching score through the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score = cosine_similarity (the object type TF-IDF vector, the audio introduction TF-IDF vector);

performing weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score=word matching score is a word weight+TF-IDF matching score is a TF-IDF weight, and the word weight+TF-IDF weight=1;

and obtaining the BERT vector of the object category of the specific sounding object and the BERT vector introduced by the audio, obtaining the cosine similarity of the BERT vector through calculation, and taking the cosine similarity as the neural network matching score.

As a facts way, the video-audio matching score is obtained based on the first matching score and the neural network matching score, specifically:

and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score=the first matching score, the first weight and the neural network matching score, and the neural network weight, wherein the first weight and the neural network weight=1.

As a facts way, the method extracts the sound feature of the specific sound object based on the kind of the specific sound object and constructs the object kind of the specific sound object and the proper audio of the specific sound object, and further includes the following steps:

searching and matching the specific sounding object with the selected audio according to the video-audio matching score, so that the audio introduction, the audio keywords and the object types of the specific sounding object are matched with each other;

and mixing all the audios to form a complete audio file, and adding the audio file into an audio track of the video to synchronize the audio file and the video.

A video object recognition construction audio system comprises a frame image stream generating module, a first processing module, a second processing module and an extraction construction module;

the frame map stream generation module is configured to: setting frame extraction frequency based on related information of a video to be processed, extracting video key frames and generating a frame image stream;

the first processing module is used for carrying out modularized multi-object identification on the frame image stream by adopting a deep convolutional neural network model to obtain modularized specific sounding objects;

the second processing module is used for carrying out at least two-time recognition analysis processing on the modularized specific sounding object through the depth residual error network model to obtain the type of the specific sounding object;

The extraction building block is arranged to: extracting sound production characteristics of the specific sound production object based on the kind of the specific sound production object and constructing the object kind of the specific sound production object and proper audio of the specific sound production object.

A computer readable storage medium storing a computer program which, when executed by a processor, performs the method steps of:

A video object recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method steps of:

The invention has the remarkable technical effects due to the adoption of the technical scheme:

setting frame extraction frequency based on related information of a video to be processed, extracting video key frames and generating a frame image stream; performing modularized multi-object recognition on the frame image stream by adopting a deep convolutional neural network model to obtain modularized specific sounding objects; performing at least two-time recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object; extracting sound production characteristics of the specific sound production object based on the kind of the specific sound production object and constructing the object kind of the specific sound production object and proper audio of the specific sound production object. The video is subjected to modularized recognition by adopting the depth convolution neural network model, and then is subjected to secondary or even more recognition analysis processing by the depth residual error network model, so that the type of a more accurate specific sounding object can be obtained, and more proper audio can be constructed.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic flow diagram of the method of the present invention;

fig. 2 is a schematic diagram of the system architecture of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples, which are illustrative of the present invention and are not intended to limit the present invention thereto.

A method for constructing audio by video object recognition, as shown in fig. 1, comprising the following steps:

s100, setting frame extraction frequency based on relevant information of a video to be processed, extracting video key frames and generating a frame image stream;

s200, carrying out modularized multi-object recognition on the frame image stream by adopting a deep convolutional neural network model to obtain modularized specific sounding objects;

s300, performing at least two-time recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object;

S400, extracting sound production characteristics of the specific sound production object based on the type of the specific sound production object and constructing the object type of the specific sound production object and proper audio of the specific sound production object.

In this embodiment, the video to be processed refers to a video clip provided by a user and requiring an audio effect, a down-conversion frame extraction mode is adopted to extract video key frames from the video to be processed, the frame extraction frequency is set as an adjustable parameter, the lower limit of the frame extraction frequency is not set, the upper limit of the frame extraction frequency is determined by the bit rate of the video itself (usually, the video is 25 frames per second), and a time-sequence static frame image, i.e. a frame image stream, frame Image Stream can be generated after the frame extraction of the video to be processed, wherein the frame image stream is used for the next step of specific sounding object identification.

In the implementation process, firstly, the video key frames need to be extracted in a down-conversion way: based on the fact that the object/person with dubbing value in the video to be processed needs to have a certain continuous duration, the fact that the object dubbing which disappears occurs within one or two frames of the video to be processed is generally not considered, because the meaning is not great from the aspect of dubbing technology. In a specific operation, if the video key frames in the frame map stream are such that: if the frame seeds 2 seconds before the identified object type does not contain the type, the object is considered to sound from the second; if the sub-object exists from the first 2 seconds to the frame, the object is considered to be sounding continuously, and the minimum sounding time is set to be 5 seconds. In actual operation, different continuous sounding time and minimum sounding time can be set for different objects according to sounding rules of the objects. Extracting video key frames for object recognition by reducing the frequency of the video key frames: for example, the video with the bit rate of 25 frames/second is obtained by setting the frequency of the sampling key frames to 1 sheet/second after the frequency is reduced, that is to say, one frame is extracted from every 25 key frame pictures to serve as an identification input sample of an object in the future video of one second, so that the reading times can be effectively and simply reduced, and the processing speed is improved. Meanwhile, the frame extraction frequency is set as an adjustable parameter, the lower limit of the frame extraction frequency is not set, and the upper limit of the frame extraction frequency is determined by the bit rate of the video itself (usually 25 frames per second), so that a user determines the proper frame extraction frequency according to the characteristics of own video samples.

Again, a frame map stream generated by extracting video key frames and modular multi-object recognition based on embedded Deep convolutional neural networks (Deep CNNs). For each static frame image in the frame image stream, carrying out high nonlinear operation on pixel values of RGB three-color channels of each pixel point of an image through a network, generating probability vectors which take each identifiable specific sounding object as a center, judging the category of the specific sounding object by a depth convolution neural network through the maximum value in each probability vector, and judging the size of a current object selection frame according to the numerical distribution characteristics of the probability vectors in a rectangular area around the center of the specific sounding object. The generated selection box will be used to intercept a screenshot of a particular specific sound object in each frame of image for a second stage of more detailed specific sound object recognition. It should be explained that: all of the neural networks involved in this step are from the pre-trained Fast-RCNN network in the python language, the object recognition program library in the TensorFlow deep learning framework.

The embodiment obtains a modularized specific sounding object, and correspondingly adopts a modularized design to embed each layer of depth convolutional neural network for object recognition. The depth convolution neural network can be used for randomly switching the first-level depth neural network required in each-level object recognition to adapt to special use scenes or special object types, and the recognition network for carrying out fine classification on shoes and the ground is not based on any pre-trained CNN model. The modular design can be expanded to embed a plurality of deep convolutional neural networks in each stage of recognition, and the accuracy of overall object recognition, positioning accuracy and recognition accuracy of refined classification are improved by utilizing an algorithm of ensemble learning (Ensemble Learning).

For example: multiple object recognition is performed on the same video keyframe by using multiple depth neural networks, each depth neural network may be slightly different in size and position for each identified selection frame of a specific sounding object, and the ensemble learning algorithm may use the confidence value (between 0 and 1, closer to 1 to indicate that the network is more definite in correctness of the selection frame, the confidence value is a probability judgment of whether the model is correct for object recognition or not, which may be understood as that the model is confidence for primary object recognition, and the higher the confidence is, the higher the accuracy of the primary object recognition is) of performing weighted average on the multiple selection frames, so as to fine tune a more reliable selection frame for object positioning, so as to generate a screenshot with higher quality for performing recognition of subsequent steps.

The modular specific sounding object is obtained by carrying out multistage recognition analysis processing through a depth residual error network model, so that the types of the specific sounding object are obtained and sounding characteristics of the specific sounding object are extracted. See in particular the following manner:

the existing deep neural network cannot identify details of all objects from a natural image, so that a technical solution framework of a multi-stage object identification network can be provided. In this embodiment, the multi-level recognition analysis process follows the "coarse to fine" design concept: for each static frame image in a subframe image stream, firstly, a first-level deep neural network is utilized to perform preliminary analysis and identification processing to obtain approximate specific sounding object types (such as figures, shoes and doors and windows), then, for the detail screenshot of the position of each object, a new neural network is utilized to perform multistage identification and analysis processing of object subdivision types to obtain specific sounding object types (such as whether the shoes are sports shoes, board shoes or leather shoes). The multi-stage recognition analysis processing of the embodiment can be expanded into an image recognition framework with more stages (such as three stages or more than three stages), and generally, as the definition of the frame-extracted image used in the experiment is limited, the two-stage recognition analysis processing can be realized by adopting the two-stage deep neural network.

Here, the process of performing the secondary recognition analysis processing through the secondary deep neural network is mainly described: the preliminary recognition analysis processing adopts a first-stage deep recognition network which is derived from a pre-training Fast-RCNN network; the multi-stage recognition analysis processing adopts a multi-stage depth recognition network, wherein the multi-stage depth recognition network adopts a secondary depth recognition network of the secondary recognition analysis processing, which performs further refinement recognition on individual key objects recognized by the first-stage depth recognition network, such as 'shoes' recognized by the first-stage depth recognition network in a static frame image, and the secondary depth recognition network performs secondary recognition analysis processing on screenshot of the 'shoes' part to judge 'shoes types' and 'ground types'. More specifically, the present embodiment may identify four different pieces of athletic footwear (athletic shoes, leather shoes, high-heeled shoes, among others), as well as five different pieces of athletic ground (tile, plank, cement, sand, among others). The specific network architecture of the secondary depth identification network is designed based on a 50-layer depth residual network (Resnet 50). See the following depth residual network model acquisition procedure:

S310, acquiring a plurality of images containing specific sounding objects, and removing the unqualified images of the specific sounding objects to obtain the qualified images of the specific sounding objects;

s320, preprocessing the image of the qualified specific sounding object to obtain an image data set of the qualified specific sounding object, and dividing the image data set into a training set and a verification set;

s330, inputting the training set into the initial depth residual error network model for training, and verifying the training result through the verification set to obtain the depth residual error network model capable of obtaining the type of the specific sounding object.

In the prior art, there is no depth residual error network pre-trained for the recognition of shoes or the ground or other specific sounding objects, the depth residual error network used in the embodiment is not based on any pre-training parameters, the network parameters of the depth residual error network are completely used for original training from random numbers, the image sets required for training are all from screenshot of actual videos, and manual calibration is performed for the types of shoes and the ground. The training set at least comprises 17000+ pictures with different sizes, different aspect ratios and highest resolution of not more than 480p, and the main body is a picture of other specific sounding objects, and in training a depth residual error network model, unqualified images such as a very fuzzy picture with incomplete objects in the picture need to be removed, and the rest qualified images are divided into a training set and a verification set. These pictures differ from the disclosed image recognition dataset in that they are mostly low resolution pictures with non-square shapes, which takes into account that the screenshot shapes of the pictures of the video clips in the actual use scene are irregular, the resolution may also be reduced due to the video compression algorithm, the irregularities and low resolution can be understood as noise contained in the image dataset, thus the network trained on this dataset has a stronger noise immunity and an optimized pertinence to footwear and ground. The recognition accuracy (calculated on the test set) of five refined categories for the ground is 73.4% obtained through the depth residual error network of the embodiment, which is far higher than random selection (20%) and secondary selection (35.2%); the recognition precision of the four types of shoes is in the same magnitude; the actual recognition speed can reach 100 pictures per second by using a single Injettison P100 display card.

And additionally deepens a single-layer perceptron (Multi-layer perceptron) at the end of the network, which is inherent in the Resnet50, into two layers, and is matched with a random inactivation design (Dropout=0.5) so as to adapt to the type requirements of identification categories required by various specific objects, so that the overfitting phenomenon (the identification effect on a training set is far better than that of a test set) caused by excessive network parameters can be avoided to a certain extent.

The depth residual network (Resnet 50) adopted in the embodiment is correspondingly trained based on the existing depth residual network, so that the type of the specific sounding object required by the embodiment can be identified, namely, the calculation and identification flow of a single picture and the corresponding modification to a specific application field Jing Jin are carried out, the depth residual network (Resnet 50) can read square RGB images with pixel values not lower than 224×224, and for input images with rectangular shapes and length and width not being 224 pixels, the embodiment adopts a conventional linear interpolation method to firstly deform the input images into a floating point number matrix of 224×224×3 (RGB three color channels) which is regular; after the matrix is input into a network, the matrix is subjected to convolution operation through a series of convolution blocks to be deformed into a feature map (feature maps) with higher and lower abstraction degree and smaller size; the convolutional blocks are the basic units of conventional design of Convolutional Neural Network (CNN), and the convolutional blocks used in the Resnet50 are composed of three to four two-dimensional convolutional layers (2D content) in combination with random inactivation design (dropout), batch normalization layer (batch normalization) and linear rectification layer (rectified linear unit, reLU), and a residual path (residual layer containing only a simple one-layer two-dimensional convolutional layer or a simple copy of the input matrix) is also parallel to each block. The feature map output by the previous block is calculated through a residual error path and a convolution block path respectively and then output as two new matrixes with consistent dimensions, and the two new matrixes are simply added to form an input matrix of the next block. The numbers in the depth residual network (Resnet 50) name refer to a total of 50 two-dimensional convolutional layers contained in all the convolutional blocks. The depth residual error network after passing through all convolution blocks outputs a 2048-dimensional first-order vector, and then outputs a vector with dimension 1000 through a layer of Perceptron (Perceptron), and a layer of Perceptron with adjustable output dimension is added on the basis of the vector, so that the vector meets the number of types of actual object subclass identification, namely the output dimension for shoe identification is 4, and the output dimension for ground identification is 5. The final element values of the output vector of the depth residual network represent the probability values of the image belonging to a certain category, and the final category calibration is determined by the maximum probability value. Common depth residual networks similar to Resnet50 are also Resnet34, resnet101, etc.; other common image recognition networks include Alexnet, VGGnet, inceptionNet, which can be applied in the present embodiment, but do not work well, so a depth residual network (Resnet 50) is selected.

In addition, the secondary identification network architecture, namely the depth residual network (Resnet 50), in this embodiment supports the feedback learning mode at the same time: when the recognition accuracy of the secondary depth recognition network does not meet the scene requirement, the frame image flow can be subjected to screenshot through an object selection frame recognized by the primary depth recognition network, the screenshot is used as a new data set for manual calibration, and the secondary depth recognition network, namely a depth residual network (Resnet 50) is finely tuned. In this way, when the video content to be processed is changed significantly, a high recognition accuracy can be obtained rapidly by using the trained model and a small amount of new data, and thus the preparation period for adapting to the new application scene is shortened. The first-stage depth recognition network can also perform stepwise retraining according to the change of video types or the change of application scenes so as to adapt to the characteristics of new video data.

Furthermore, specific sounding object information identified by each stage in the secondary depth identification network is merged and stored in the same format. The information stored for each object is: the method comprises the steps of carrying out next processing on all information in a json file format, wherein the information comprises an object major class (upper network identification), an object major class confidence value, an object minor class (secondary depth identification network identification), an object minor class confidence value, an object positioning selection frame width and height and a center (taking frame image pixels as measurement units).

After the object type of the specific sound generating object is identified, in order to enable the object type of the specific sound generating object to be associated with the audio of the specific sound generating object, the embodiment adopts natural language as an intermediate expression for matching the object type of the specific sound generating object of the video to be processed with the audio, and the natural language is used as a method for matching the expression, so that the expression is beneficial to people to understand and mark, and the audio library is tidied and maintained.

For video to be processed, the object class identified from the video is represented as natural language (e.g., "cat"); for audio, two types of natural language markup can be used: audio introductions and audio keywords, i.e. audio includes audio introductions and audio keywords, where audio introductions may be understood as: audio content (e.g., "a person walking in snow") is presented with a sentence or phrase, and audio keywords are presented with three key words (e.g., "shoes/snow/footsteps"). Unlike audio presentations, the audio keywords must include a sound object and sound category, and in summary, the introduction of the audio keywords links to a mismatch between the object recognition category and the sound presentation. Here, the audio may be parsed into an audio presentation and an audio keyword, wherein the audio presentation is a presentation text of the audio, the audio keyword comprising at least three words describing the audio, the words describing the audio comprising a category name of a specific sound-emitting object and a category name of a sound-emitting sound.

For a particular sound-producing object, the class name of the object recognition is directly used as its natural language representation, which is further mapped to a vector representation because the computer cannot understand the natural language. Specifically, the present embodiment introduces vector expressions of two natural languages: TF-IDF (term frequency-inverse document frequency) and BERT (Bidirectional Encoder Representations from Transformers).

In a specific embodiment, the TF-IDF vector is calculated by audio-presenting text, and the TF-IDF vector represents how much each word has an effect on the semantics of the whole text in a text. The method comprises the following steps: firstly, chinese word segmentation is carried out on all audio frequency introduction through a word segmentation device 'crust word segmentation': calculating word frequency TF of each word in each audio introduction and word frequency DF of each word in the set of all audio introduction; for an audio presentation, the TF-IDF of any one of the words can be calculated: TF-idf=tf log (1/df+1); it should be noted that the TF-IDF calculation formula is a normalized TF-IDF, so as to ensure numerical stability; finally, for any text, the TF-IDF vector of the text is calculated. Firstly, sequencing all the words of the text library, calculating the TF-IDF value of each word in the text according to the sequence, and if the text does not contain the word, considering the TF-IDF value as 0. Finally, the vector with the same length as the vocabulary of the text library is obtained, namely the TF-IDF vector expression of the text.

Further: the BERT vector is calculated, and in the embodiment, the BERT is a transducer neural network structure, the network parameters are trained by large-scale unsupervised learning, the obtained model can be directly applied to the downstream natural language understanding problem, and the vector mapping is directly carried out on sentences and phrases in the natural language. This embodiment combines both (and simple word matching) methods so that the results are more accurate.

In one embodiment, the BERT vector expression of the sentence is calculated using a pre-trained Chinese BERT model in pytorch_pre-trained_bert in the Pytorch library. To meet the efficiency of the matching, a minimum BERT model "bert_base_rule" is employed. Specifically, the sentence is split Chen Zhuge characters, the first and last of the sentence are respectively added with the [ CLS ] and the [ SEP ] characters, the characters are used as input index_tokens, a full 0 list with the same length as the input index_tokens is used as input segment_ids, the two inputs are simultaneously input into the pre-training BERT model, and the output vector of the first character ("[ CLS ]") corresponding to the neural network of the last layer is taken as the BERT vector of the sentence.

In one embodiment, more specifically, the step S400 extracts the sound feature of the specific sound object based on the kind of the specific sound object and constructs the object type of the specific sound object and the suitable audio of the specific sound object, which includes the following specific steps:

S410, performing score matching processing on the basis of the object type, the audio introduction and the audio keyword of the specific sound object to respectively obtain a first matching score and a neural network matching score;

s420, obtaining video and audio matching scores based on the first matching scores and the neural network matching scores, and obtaining at least one proper audio of the specific sounding object according to the video and audio matching scores.

The audio and video construction process is that the object type and audio introduction are identified in the video, the matching of the audio keywords is carried out, the proper audio is selected through the calculated matching score, the calculation of the matching score is carried out in two ways, one is a traditional method, the other is a neural network way, and the traditional method has the advantage that the score can be accurately calculated when the natural language expression of the audio and the video has the same word; the neural network has the advantage that when two natural language expressions have no word overlap, the matching of meaning on the language expressions can be achieved, and the embodiment simultaneously uses and combines the scores of the two methods, which is helpful for complementation of the two methods.

In one embodiment, in step S410, the score matching process based on the object category, the audio introduction, and the audio keyword of the specific sound object obtains a first matching score and a neural network matching score, which are specifically:

S411, performing word segmentation processing on the object type and the audio introduction of the specific sounding object to obtain words;

s412, respectively obtaining the word proportion of the object category of the specific sounding object, the audio introduction and the audio keyword, obtaining a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score=the word overlapping proportion of the object category and the audio introduction, the audio introduction weight+the word overlapping proportion of the object category and the audio keyword weight, and the audio introduction weight+the audio keyword weight=1;

s413, obtaining an object type TF-IDF vector based on the statistical data of the audio introduction, and taking the first cosine similarity as a TF-IDF matching score through the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score = cosine_similarity (the object type TF-IDF vector, the audio introduction TF-IDF vector);

s414, carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score=word weight+TF-IDF matching score, wherein the word weight+TF-IDF weight=1;

S415, obtaining the BERT vector of the object type of the specific sounding object and the BERT vector of the audio introduction, obtaining the cosine similarity of the BERT vector through calculation, and taking the cosine similarity as the matching score of the neural network.

The modules in the video object sound effect searching and matching system can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor of the computer device or the mobile terminal, or may be stored in software in a memory of the computer device or the mobile terminal, so that the processor may call and execute operations corresponding to the above modules.

Step 411-step 414 are performed by conventional methods, and firstly, the object class of the specific sounding object and the sound introduction are segmented by using a barker. Then calculating word proportions of the object category of the specific sound object, which are respectively overlapped with the sound introduction and the sound key words, and weighting and averaging the two proportions to obtain word matching scores; and obtaining TF-IDF vector expression of the object category of the specific sounding object according to the statistical data in the voice introduction text. And then, calculating cosine similarity between the object TF-IDF vector and the voice introduction TF-IDF vector as a TF-IDF matching score, and carrying out weighted average on the word matching score and the TF-IDF matching score to obtain a matching score of the traditional method, namely the first matching score in the step. Of course, in other embodiments, the technique of obtaining the first matching score may be other technical means, which will not be described herein.

In step S415, the BERT vector of the object class of the specific sounding object and the BERT vector of the audio introduction are obtained, so as to obtain cosine similarity of the BERT vector, and the cosine similarity is used as the neural network matching score to realize the neural network matching method.

In one embodiment, the obtaining the video-audio matching score based on the first matching score and the neural network matching score in step S420 specifically includes: and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score=the first matching score, the first weight and the neural network matching score, and the neural network weight, wherein the first weight and the neural network weight=1.

In actual operation, the weight of the weighted average can be adjusted according to the requirement, and if the name of the object class of the specific sound-producing object is expected to accurately appear in the audio introduction or the keyword, the weight of the traditional matching score can be increased to increase the accuracy; if it is desired that the names of object categories for a particular sound-emitting object are not in the audio presentation or keywords, but are semantically identical, the weight of the neural network matching score may be increased to increase generalization.

Specifically, the 10 best matching tones may be selected as dubbing recommendations for each identified object according to the final matching score, although other numbers are possible.

In one embodiment, the method extracts the sound feature of the specific sound object based on the kind of the specific sound object and constructs the object kind of the specific sound object and the appropriate audio of the specific sound object, and further includes the following steps:

s500, searching and matching the specific sounding object with the selected audio according to the video-audio matching score, so that the audio introduction, the audio keywords and the object types of the specific sounding object are matched with each other;

s600, mixing all the audios to form a complete audio file, and adding the audio file into an audio track of the video to synchronize the audio file and the video.

Specifically, the specific sound-producing object is searched and matched with the selected audio according to the video-audio matching score, so that the audio introduction, the audio keywords and the object types of the specific sound-producing object are matched with each other, namely, the specific sound-producing object in the video is subjected to simulated sound in the prior art, and the specific sound-producing object and the audio are matched through the video-audio matching score, so that the voice-producing object is subjected to single dubbing. The method can also carry out integral dubbing, namely mixing the generated audio, reading all the required audio files after finding the audio files required by dubbing and the starting and ending time of playing each audio file, and converting each audio file into a unified frequency domain signal format so as to facilitate subsequent editing.

In this embodiment, audio files of any commonly used format, including wav and mp3, can be read, improving the ability to use scenes and generalize to other specific audio libraries.

The specific process of mixing all the audios is as follows: each section of audio is intelligently stretched or compressed to the length required by dubbing, and firstly, the mute parts of the audio in the beginning and ending stages are cut off, so that dubbing and a picture triggering dubbing in video can occur simultaneously, and the dubbing effect is optimal. And checking whether the audio time length after the head and tail silence is eliminated is longer than the time required to be played, if so, cutting the audio to the time length required to be played for dubbing, and using a fading-out effect at the tail to eliminate the abrupt sense of sudden suspension of the audio. If not, the audio is played circularly until the playing time required by dubbing, and when the audio is played circularly, the end-to-end joint of the front section and the rear section of the audio adopts overlapping with a certain time length and a gradual-in gradual-out effect, so that the circularly playing position is connected seamlessly, the long audio sounds natural and complete, and a user has optimal hearing experience. The duration of the fade-in fade-out will be equal to the duration of the overlap, the duration will be determined by a piecewise function according to the audio duration, if the original audio duration is less than 20 seconds, the overlap and fade-in fade-out time will be set to 10% of the audio duration, so that the duration of the overlap is moderate, which is beneficial to smoothly transition the front and rear audio segments, and thus is also beneficial to more reserved non-overlap portions of the short video for playing to the user. If the original audio time is longer than 20 seconds, the overlapping and gradual-in gradual-out time is set to be 2 seconds, so that the unnecessary long transition period of the long audio can be avoided, and the non-overlapping audio can be played as much as possible.

And finally, combining all the audios processed according to the steps, adding the audios into the audio tracks of the video, outputting a new video file with dubbing, and completing the whole dubbing process.

Example 2:

a video object recognition building audio system, as shown in fig. 2, includes a frame image stream generating module 100, a first processing module 200, a second processing module 300, and an extraction building module 400;

the frame map stream generation module 100 is configured to: setting frame extraction frequency based on related information of a video to be processed, extracting video key frames and generating a frame image stream;

the first processing module 200 is configured to perform modularized multi-object recognition on the frame image stream by using a deep convolutional neural network model, so as to obtain a modularized specific sounding object;

the second processing module 300 is configured to perform at least two recognition analysis processes on the modularized specific sounding object through the depth residual error network model, so as to obtain a type of the specific sounding object;

the extraction building block 400 is arranged to: extracting sound production characteristics of the specific sound production object based on the kind of the specific sound production object and constructing the object kind of the specific sound production object and proper audio of the specific sound production object.

In one embodiment, the depth residual network model obtaining process in the second processing module 300 is as follows:

In one embodiment, the extraction and construction module 400 is configured such that the audio includes an audio presentation and an audio keyword, the audio presentation being a presentation text of the audio, the audio keyword including at least three audio-descriptive words, the audio-descriptive words including a category name of a particular sound-emitting object and a category name of a sound-emitting sound.

In one embodiment, the extraction building module 400 is configured to: the specific sound production object type extracting method comprises the specific sound production object type extracting method, specific sound production object type constructing method and specific sound production object suitable audio, specifically comprising the following steps:

In one embodiment, the extraction building module 400 is configured to:

In one embodiment, the extraction building module 400 is configured to: the video and audio matching score is obtained based on the first matching score and the neural network matching score, and specifically comprises the following steps:

In one embodiment, the system further comprises a search matching module 500 and a mixing processing module 600;

the search matching module 500 is configured to search and match a specific sounding object with a selected audio according to a video-audio matching score, so that an audio introduction, an audio keyword and an object category of the specific sounding object are matched with each other;

The audio mixing processing module 600 is configured to mix all the audio to form a complete audio file, and add the audio file to an audio track of the video so that the audio file and the video are synchronized.

The simple and easy-to-use function interface is arranged in the audio mixing processing module, so that the audio and video can be generated by one key, and the working efficiency of a user is greatly improved. Although the mixing processing module 600 uses very common audio tools, as specific mixing steps and parameters in the method are specifically designed for movies, dramas, and short videos, for example, the method of silence removal and compression or extension of special effect audio mentioned in the method embodiment can specifically solve the problem of dubbing of the specific category of videos, that is, the problem that the audio length in the special effect audio library does not meet the video dubbing requirement at many times, and the specific audio processing parameters are also most suitable for the embodiment, and other technologies or audio processing parameters cannot be realized.

For system embodiments, the description is relatively simple as it is substantially similar to method embodiments, and reference is made to the description of method embodiments for relevant points.

Example 3:

In one embodiment, when the processor executes the computer program, the identifying processing is performed on the video to be processed, so as to obtain the type of the specific sounding object in the video to be processed and extract the sounding characteristics thereof, which is specifically as follows:

reducing the frame extraction frequency of the related information of the video to be processed, and extracting the video key frames;

generating a frame image stream from the extracted video key frames;

And carrying out multistage recognition analysis processing on the modularized specific sounding object through the depth residual error network model to obtain the type of the specific sounding object in the video to be processed and extract the sounding characteristics of the specific sounding object.

In one embodiment, the processor, when executing the computer program, implements the audio presentation as a presentation text of audio, the audio keywords comprising at least three audio-descriptive words including a category name of the particular sound-emitting object and a category name of the sound-emitting sound.

In one embodiment, when the processor executes the computer program, the score matching process is performed based on the object category, the audio introduction and the audio keyword of the specific sound object to obtain a first matching score and a neural network matching score, which specifically are:

In one embodiment, when the processor executes the computer program, the video and audio matching score obtained based on the first matching score and the neural network matching score is specifically:

In one embodiment, when the processor executes the computer program, the implementation of the one or more suitable audio steps for obtaining the specific sound generating object according to the video-audio matching score further includes:

Example 4:

in one embodiment, a video object recognition building audio device is provided, which may be a server or a mobile terminal. The video object recognition building audio device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the video object recognition builds the processor of the audio device for providing computing and control capabilities. The memory of the device for constructing the audio by using the video object identification comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database is used to store all data of the device that video object identification built audio. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a video object sound construction method.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, the specific embodiments described in the present specification may differ in terms of parts, shapes of components, names, and the like. All equivalent or simple changes of the structure, characteristics and principle according to the inventive concept are included in the protection scope of the present invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions in a similar manner without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A method for constructing audio by video object recognition, comprising the steps of:

extracting sound production characteristics of the specific sound production object based on the type of the specific sound production object and constructing an object type of the specific sound production object and proper audio of the specific sound production object;

the audio comprises an audio introduction and an audio keyword, wherein the audio introduction is an introduction text of the audio, the audio keyword comprises at least three words for describing the audio, and the words for describing the audio comprise category names of specific sound-producing objects and category names of sound-producing sounds;

the specific sound production object type extracting method comprises the specific sound production object type extracting method, specific sound production object type constructing method and specific sound production object suitable audio, specifically comprising the following steps:

obtaining video-audio matching scores based on the first matching scores and the neural network matching scores, and obtaining at least one proper audio of the specific sounding object according to the video-audio matching scores;

the method comprises the specific sounding object-based object category, audio introduction and audio keyword score matching processing to respectively obtain a first matching score and a neural network matching score, and specifically comprises the following steps:

2. The video object recognition building audio method according to claim 1, wherein the depth residual network model obtaining process is as follows:

3. The method for constructing audio according to claim 1, wherein the video-audio matching score is obtained based on the first matching score and the neural network matching score, specifically:

4. The method for constructing audio by video object recognition according to claim 1, wherein the method extracts sound characteristics of the specific sound object based on the kind of the specific sound object and constructs an object class of the specific sound object and an appropriate audio of the specific sound object, further comprising the steps of:

5. The video object recognition construction audio system is characterized by comprising a frame image stream generating module, a first processing module, a second processing module and an extraction construction module;

the extraction building block is arranged to: extracting sound production characteristics of the specific sound production object based on the type of the specific sound production object and constructing an object type of the specific sound production object and proper audio of the specific sound production object;

6. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1 to 4.

7. A video object recognition building audio apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 4 when executing the computer program.