CN116563669A

CN116563669A - Model training method, video classification method, device and equipment

Info

Publication number: CN116563669A
Application number: CN202310539771.7A
Authority: CN
Inventors: 崔东林; 李滨伯
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-08-08

Abstract

The disclosure provides a training method, a video classification method, a device and equipment for a model, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of deep learning, video processing and the like. The specific implementation scheme is as follows: performing unsupervised training on the feature extraction model based on the original dataset; wherein the feature extraction model comprises an image encoder and a text encoder; performing supervised training on the feature extraction model obtained by training based on the annotation data set; inputting the original dataset into a feature extraction model obtained by supervised training, and screening a target dataset from the original dataset according to the similarity between the image features output by the image encoder and the text features output by the text encoder; and performing unsupervised training on the feature extraction model obtained through supervised training based on the target data set again to obtain a trained feature extraction model.

Description

Model training method, video classification method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of deep learning, video processing, and the like.

Background

With the development of artificial intelligence technology, a deep learning model is widely applied, for example, the deep learning model can be applied to various video processing scenes. When training the deep learning model, a large amount of sample data is often needed, and the larger the data amount of the sample data is, the better the learning effect of the deep learning model is. However, if sample noise in the sample data is large, the learning effect of the deep learning model is seriously affected.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a feature extraction model, a training method and apparatus for a video classification model, a video classification method and apparatus, an electronic device, a storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a feature extraction model, including: performing unsupervised training on the feature extraction model based on the original dataset; the original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features; performing supervised training on the feature extraction model obtained by training based on the annotation data set; the annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions; inputting the original dataset into a feature extraction model obtained by supervised training, and screening a target dataset from the original dataset according to the similarity between the image features output by the image encoder and the text features output by the text encoder; and performing unsupervised training on the feature extraction model obtained through supervised training based on the target data set again to obtain a trained feature extraction model.

According to another aspect of the present disclosure, there is provided a training method of a video classification model, including: obtaining a training sample; wherein the training sample comprises a sample video and a classification label of the sample video; inputting at least one video frame in the sample video into a video classification model for classification prediction to obtain a classification result of the sample video; the video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model which is obtained by training by using the training method of the feature extraction model; calculating loss according to the classification labels and the classification results, and adjusting parameters of a full-connection layer in the video classification model according to the loss until convergence conditions are met; and outputting the trained video classification model.

According to another aspect of the present disclosure, there is provided a video classification method, including: acquiring videos to be classified; inputting at least one video frame in the video to be classified into a video classification model for classification prediction to obtain a classification result of the video to be classified; the video classification model is obtained through training according to a training method of the video classification model.

According to another aspect of the present disclosure, there is provided a training apparatus of a feature extraction model, including: the first training module is used for performing unsupervised training on the feature extraction model based on the original data set; the original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features; the second training module is used for performing supervised training on the feature extraction model obtained by training based on the labeling data set; the annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions; the data screening module is used for inputting the original data set into a feature extraction model obtained by supervised training, and screening a target data set from the original data set according to the similarity between the image features output by the image encoder and the text features output by the text encoder; and the third training module is used for performing unsupervised training on the feature extraction model obtained by the supervised training based on the target data set to obtain a trained feature extraction model.

According to another aspect of the present disclosure, there is provided a training apparatus for a video classification model, including: the first acquisition module is used for acquiring training samples; wherein the training sample comprises a sample video and a classification label of the sample video; the first prediction module is used for inputting at least one video frame in the sample video into a video classification model to perform classification prediction so as to obtain a classification result of the sample video; the video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model obtained by using the training device of the feature extraction model; the parameter processing module is used for calculating loss according to the classification labels and the classification results, and adjusting parameters of a full-connection layer in the video classification model according to the loss until convergence conditions are met; and the model output module is used for outputting the trained video classification model.

According to another aspect of the present disclosure, there is provided a video classification apparatus, including: the second acquisition module is used for acquiring videos to be classified; the second prediction module is used for inputting at least one video frame in the video to be classified into a video classification model to perform classification prediction so as to obtain a classification result of the video to be classified; the video classification model is obtained through training according to the training device of the video classification model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training the feature extraction model, the method of training the video classification model, or the method of video classification.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described training method of the feature extraction model, the above-described training method of the video classification model, or the above-described video classification method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described training method of a feature extraction model, the above-described training method of a video classification model, or the above-described video classification method.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a training method of a feature extraction model according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a training method of a feature extraction model according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of a training method of a feature extraction model according to another embodiment of the disclosure;

FIG. 4 is a schematic diagram of a training apparatus of a feature extraction model according to an embodiment of the disclosure;

FIG. 5 is a flow chart of a method of training a video classification model according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a video classification model provided in accordance with an embodiment of the disclosure;

FIG. 7 is a schematic diagram of a training apparatus for a video classification model according to an embodiment of the disclosure;

FIG. 8 is a flow chart of a video classification method according to an embodiment of the disclosure;

FIG. 9 is a schematic diagram of a video classification device according to an embodiment of the disclosure;

fig. 10 is a block diagram of an electronic device for implementing the methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with embodiments of the present disclosure, there is provided an embodiment of a training method for a feature extraction model, it being noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Fig. 1 is a flowchart of a training method of a feature extraction model according to an embodiment of the present disclosure, which includes, as shown in fig. 1, the following steps S101 to S104:

and step S101, performing unsupervised training on the feature extraction model based on the original data set. The original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features.

In an implementation, the text description corresponding to the first sample video frame may be a description text of the content of the video where the first sample video frame is located, for example, may be a title of the video.

The method comprises the steps of intercepting sample video, and selecting a plurality of first sample video frames from a video frame sequence obtained through interception. Assume that the original dataset includes N first sample video frames and corresponding N text descriptions, where N is an integer greater than 1. In the process of performing unsupervised training on the feature extraction model, inputting a first sample video frame into an image encoder of the feature extraction model to obtain image features, and inputting a text description corresponding to the first sample video frame into a text encoder of the feature extraction model to obtain text features so as to maximize the similarity between N paired image features and the text features and minimize N ² Similarity between the N unpaired image features and text features is the target training feature extraction model.

In practical applications, the image features may also be referred to as image vectors and the text features may also be referred to as text vectors. In the example shown in fig. 2, the original dataset comprises 3 first sample video frames and corresponding 3 text descriptions, the 3 first sample video frames are input to the image encoder to obtain image vectors F1, F2 and F3, respectively, and the 3 text descriptions are input to the text encoder to obtain text vectors A1, A2 and A3, respectively, wherein the image vector F1 and the text vector A1 are paired, the image vector F2 and the text vector A2 are paired, the image vector F3 and the text vector A3 are paired, and the cross entropy loss is calculated to maximize the similarity between F1 and A1, F2 and A2, and F3 and A3, while minimizing the similarity between A2 and F1, A1 and F2, A3 and F2, A1 and F3, and A2 and F3, respectively, and training the feature extraction model comprising the image encoder and the text encoder.

And step S102, performing supervised training on the feature extraction model obtained by training based on the labeling data set. The annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions.

It should be noted that the data volume of the original data set is generally much larger than the data volume of the labeled data set. The data volume of the original data set in this embodiment may reach tens of millions.

In step S102, the labeling data set is input into the feature extraction model obtained by training in step S101, specifically, the second sample video frame is input into the image encoder in the feature extraction model to obtain the predicted image feature, and the text description corresponding to the second sample video frame is input into the text encoder in the feature extraction model to obtain the predicted text feature, the loss is calculated according to the predicted image feature and the labeled image feature and the predicted text feature and the labeled text feature, and the parameters of the feature extraction model are adjusted according to the calculated loss until the convergence condition is satisfied.

Step S103, inputting the original dataset into a feature extraction model obtained by supervised training, and screening a target dataset from the original dataset according to the similarity between the image features output by the image encoder and the text features output by the text encoder.

The similarity may be cosine similarity, euclidean distance, manhattan distance, or the like.

The original data set is input into the feature extraction model obtained by training in the step S102, the higher the similarity between the output image features and the text features is, the smaller the noise corresponding to the sample data is, the lower the similarity is, the larger the noise corresponding to the sample data is, and the sample noise in the large-scale original data set can be effectively relieved by screening the original data set through the similarity, so that the target data set with smaller sample noise is obtained.

And step S104, performing unsupervised training on the feature extraction model obtained by the supervised training based on the target data set to obtain a trained feature extraction model.

And (3) training the feature modulus obtained by training in the step (S102) again based on the target data set with smaller sample noise, so that the learning effect of the feature extraction model can be effectively improved.

The conditions for stopping training in the steps S101, S102, and S104 may be set according to actual situations, for example, the loss of each iteration decreases less significantly, or the loss of the iteration reaches a preset loss, etc.

As shown in fig. 3, the training method of the feature extraction model provided in the embodiment of the present disclosure optimizes the feature extraction model by using an original dataset, wherein the optimization process is unsupervised training, and then optimizes the feature extraction model by using a labeling dataset, wherein the optimization process is supervised training, then screens the original dataset by using the feature extraction model obtained by the supervised training to obtain a target dataset, and finally optimizes the feature extraction model by using the target dataset, and the optimization process is unsupervised training.

It should be noted that, in order to further improve the learning effect of the feature extraction model, the target data set may be further screened, that is, the feature extraction model obtained by training in step S104 may be optimized again by using the labeling data set, the target data set may be screened by using the optimized feature extraction model, a data set with smaller sample noise may be obtained until the sample noise of the data set meets the requirement, and finally the optimized feature extraction model may be finally trained by using the data set meeting the requirement, so as to obtain the trained feature extraction model.

In an optional embodiment, the step S103 specifically includes: and if the similarity between the image characteristics output by the image encoder and the text characteristics output by the text encoder is larger than a preset value, determining the first sample video frame corresponding to the image characteristics and the text description corresponding to the text characteristics in the original data set as a target data set. The preset value may be set according to practical situations, for example, may be set to 60%.

In this embodiment, the target data set is screened from the original data set according to whether the similarity is greater than a preset value, so that sample noise of the screened target data set is directly related to the preset value, wherein the greater the preset value is, the smaller the sample noise of the target data set is, and the smaller the preset value is, the greater the sample noise of the target data set is.

In an alternative embodiment, the step S103 specifically includes the following steps S103a to S103c:

step S103a, for the similarity between the image feature output by the image encoder and the text feature output by the text encoder, the order is from high to low.

Step S103b, selecting image features and text features corresponding to a preset number of similarity degrees which are ranked at the front. The preset number can be set according to actual conditions.

Step S103c, determining the first sample video frame corresponding to the image feature and the text description corresponding to the text feature in the original dataset as a target dataset.

In this embodiment, the target data set is screened from the original data set according to the similarity of the preset number, so that the sample noise of the screened target data set is relatively related to the preset number, wherein the larger the preset number is, the larger the sample noise of the target data set is, and the smaller the preset number is, the smaller the sample noise of the target data set is.

There is further provided, in accordance with an embodiment of the present disclosure, an embodiment of a training apparatus for a feature extraction model, where fig. 4 is a schematic diagram of a training apparatus for a feature extraction model according to an embodiment of the present disclosure, and the training apparatus includes a first training module 401, a second training module 402, a data screening module 403, and a third training module 404. The first training module 401 is configured to perform unsupervised training on the feature extraction model based on the original data set; the original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features. The second training module 402 is configured to perform supervised training on the feature extraction model obtained by training based on the labeling data set; the annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions. The data filtering module 403 is configured to input the original dataset into a feature extraction model obtained by supervised training, and filter a target dataset from the original dataset according to a similarity between an image feature output by the image encoder and a text feature output by the text encoder. The third training module 404 is configured to perform unsupervised training on the feature extraction model obtained by the supervised training based on the target data set again, so as to obtain a trained feature extraction model.

It should be noted that the first training module 401, the second training module 402, the data filtering module 403, and the third training module 404 correspond to steps S101 to S104 in the above embodiment, and the four modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiment.

In an optional embodiment, the data filtering module is specifically configured to determine, as the target data set, a first sample video frame in the original data set corresponding to the image feature and a text description corresponding to the text feature when a similarity between the image feature output by the image encoder and the text feature output by the text encoder is greater than a preset value.

In an alternative embodiment, the data filtering module is specifically configured to sort the similarity between the image feature output by the image encoder and the text feature output by the text encoder according to the high-to-low order, select the image feature and the text feature corresponding to the preset number of similarity before the sorting, and determine the first sample video frame corresponding to the image feature and the text description corresponding to the text feature in the original dataset as the target dataset.

In accordance with embodiments of the present disclosure, there is also provided an embodiment of a method of training a video classification model, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

Fig. 5 is a flowchart of a training method of a video classification model according to an embodiment of the present disclosure, as shown in fig. 5, the training method including the following steps S501 to S504:

step S501, obtaining a training sample. Wherein, the training sample comprises a sample video and a classification label of the sample video.

Step S502, inputting at least one video frame in the sample video into a video classification model for classification prediction to obtain a classification result of the sample video. The video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model which is obtained by training by using the training method of the feature extraction model.

The method comprises the steps of training a feature extraction model by using the training method of the feature extraction model, constructing a video classification model by using an image encoder in the feature extraction model, and connecting a full connection layer behind the image encoder to obtain the video classification model.

And step S503, calculating loss according to the classification labels and the classification results, and adjusting parameters of a full connection layer in the video classification model according to the loss until convergence conditions are met.

Step S504, outputting the trained video classification model.

In the embodiment of the disclosure, only the parameters of the full connection layer are adjusted in the training process of the video classification model, and the parameters of the image encoder are not adjusted. Because the image encoder has learned the image features of a large number of video frames in the training process of the feature extraction model, parameters of the image encoder are directly utilized in the training process of the video classification model, so that the image features of the video frames can be well understood, and the training cost of the video classification model can be effectively reduced.

In order to improve the learning effect of the video classification model, in an alternative embodiment, the video classification model further includes a convolution layer, as shown in fig. 6, disposed between the image encoder and the fully-connected layer. In this embodiment, the step S503 specifically includes: and adjusting parameters of a convolution layer and a full connection layer in the video classification model according to the loss.

In a specific implementation, the number of the convolution layers may be one or more. In the training process of the video classification model, only the parameters of the convolution layer and the full connection layer are adjusted, and the parameters of the image encoder are not adjusted.

There is further provided, according to an embodiment of the present disclosure, an embodiment of a training apparatus for a video classification model, where fig. 7 is a schematic diagram of a training apparatus for a video classification model according to an embodiment of the present disclosure, and the training apparatus includes a first acquisition module 701, a first prediction module 702, a parameter processing module 703, and a model output module 704. The first obtaining module 701 is configured to obtain a training sample; wherein, the training sample comprises a sample video and a classification label of the sample video. The first prediction module 702 is configured to input at least one video frame in the sample video into a video classification model to perform classification prediction, so as to obtain a classification result of the sample video; the video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model obtained by using the training device of the feature extraction model. The parameter processing module 703 is configured to calculate a loss according to the classification label and the classification result, and adjust parameters of a full-connection layer in the video classification model according to the loss until a convergence condition is satisfied. Model output module 704 is configured to output a trained video classification model.

It should be noted that, the first obtaining module 701, the first predicting module 702, the parameter processing module 703, and the model output module 704 correspond to steps S501 to S504 in the above embodiments, and the four modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above embodiments.

In an alternative embodiment, the video classification model further comprises a convolution layer disposed between the image encoder and the fully-connected layer. In this embodiment, the parameter processing module is specifically configured to adjust parameters of a convolution layer and a full connection layer in the video classification model according to the loss.

In accordance with embodiments of the present disclosure, there is also provided an embodiment of a video classification method, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

Fig. 8 is a flowchart of a video classification method according to an embodiment of the present disclosure, and as shown in fig. 8, the training method includes the following steps S801 to S802:

Step S801, obtaining videos to be classified.

Step S802, inputting at least one video frame in the video to be classified into a video classification model for classification prediction to obtain a classification result of the video to be classified. The video classification model is obtained through training according to the training method of the video classification model.

In specific implementation, a plurality of video frames can be intercepted from the video to be classified, and the intercepted video frames are input into a video classification model for classification prediction, so that a classification result of the video to be classified is obtained. The video classification model may classify videos to be classified differently, for example, classify the content of the videos to be classified, the classification result may include food, video, travel, etc., and may classify the quality of the videos to be classified, and the classification result may include high quality, medium quality, low quality, etc.

In the embodiment of the disclosure, the training effect of the training method of the video classification model is improved, so that the learning effect of the video classification model obtained by training is also improved, namely, the classification result of video classification by using the video classification model is more accurate.

There is further provided, in accordance with an embodiment of the present disclosure, an embodiment of a video classification apparatus, wherein fig. 9 is a schematic diagram of the video classification apparatus according to an embodiment of the present disclosure, the video classification apparatus including a second acquisition module 901 and a second prediction module 902. The second obtaining module 901 is configured to obtain a video to be classified. The second prediction module 902 is configured to input at least one video frame in the video to be classified into a video classification model to perform classification prediction, so as to obtain a classification result of the video to be classified; the video classification model is obtained through training according to the training device of the video classification model.

It should be noted that, the second obtaining module 901 and the second predicting module 902 correspond to step S801 to step S802 in the above embodiment, and the two modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above embodiment.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, for example, a training method of a feature extraction model, a training method of a video classification model, or a video classification method. For example, in some embodiments, the above-described feature extraction model training method, video classification model training method, or video classification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the feature extraction model, the training method of the video classification model, or the video classification method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the training method of the feature extraction model, the training method of the video classification model, or the video classification method described above in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a feature extraction model, comprising:

performing unsupervised training on the feature extraction model based on the original dataset; the original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features;

Performing supervised training on the feature extraction model obtained by training based on the annotation data set; the annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions;

inputting the original dataset into a feature extraction model obtained by supervised training, and screening a target dataset from the original dataset according to the similarity between the image features output by the image encoder and the text features output by the text encoder;

and performing unsupervised training on the feature extraction model obtained through supervised training based on the target data set again to obtain a trained feature extraction model.

2. The training method of claim 1, wherein the screening the target dataset from the original dataset according to a similarity between image features output by the image encoder and text features output by the text encoder, comprises:

and if the similarity between the image characteristics output by the image encoder and the text characteristics output by the text encoder is larger than a preset value, determining the first sample video frame corresponding to the image characteristics and the text description corresponding to the text characteristics in the original data set as a target data set.

3. The training method of claim 1, wherein the screening the target dataset from the original dataset according to a similarity between image features output by the image encoder and text features output by the text encoder, comprises:

sorting the similarity between the image features output by the image encoder and the text features output by the text encoder according to the sequence from high to low;

selecting image features and text features corresponding to a preset number of similarity degrees which are ranked in front;

and determining the first sample video frame corresponding to the image characteristic and the text description corresponding to the text characteristic in the original data set as a target data set.

4. A training method according to any of claims 1-3, wherein the text description is a title.

5. A method of training a video classification model, comprising:

obtaining a training sample; wherein the training sample comprises a sample video and a classification label of the sample video;

inputting at least one video frame in the sample video into a video classification model for classification prediction to obtain a classification result of the sample video; the video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model obtained by training by the training method according to any one of claims 1-4;

Calculating loss according to the classification labels and the classification results, and adjusting parameters of a full-connection layer in the video classification model according to the loss until convergence conditions are met;

and outputting the trained video classification model.

6. The training method of claim 5, wherein the video classification model further comprises a convolutional layer disposed between the image encoder and the fully-connected layer;

the adjusting parameters of the full connection layer in the video classification model according to the loss comprises the following steps: and adjusting parameters of a convolution layer and a full connection layer in the video classification model according to the loss.

7. A method of video classification, comprising:

acquiring videos to be classified;

inputting at least one video frame in the video to be classified into a video classification model for classification prediction to obtain a classification result of the video to be classified;

wherein the video classification model is trained according to the training method of claim 5 or 6.

8. A training device of a feature extraction model, comprising:

the first training module is used for performing unsupervised training on the feature extraction model based on the original data set; the original data set comprises a plurality of first sample video frames and text descriptions corresponding to the first sample video frames, the feature extraction model comprises an image encoder and a text encoder, the image encoder is used for encoding the first sample video frames to obtain corresponding image features, and the text encoder is used for encoding the text descriptions to obtain corresponding text features;

The second training module is used for performing supervised training on the feature extraction model obtained by training based on the labeling data set; the annotation data set comprises a second sample video frame, image features annotated for the second sample video frame, text descriptions corresponding to the second sample video frame and text features annotated for the text descriptions;

the data screening module is used for inputting the original data set into a feature extraction model obtained by supervised training, and screening a target data set from the original data set according to the similarity between the image features output by the image encoder and the text features output by the text encoder;

and the third training module is used for performing unsupervised training on the feature extraction model obtained by the supervised training based on the target data set to obtain a trained feature extraction model.

9. The training device according to claim 8, wherein the data filtering module is specifically configured to determine, as the target data set, a first sample video frame in the original data set corresponding to the image feature and a text description corresponding to the text feature, in a case where a similarity between the image feature output by the image encoder and the text feature output by the text encoder is greater than a preset value.

10. The training device according to claim 8, wherein the data filtering module is specifically configured to sort, from high to low, the image features and text features corresponding to a preset number of similarities in the image features output by the image encoder and the text features output by the text encoder, and select the first sample video frame corresponding to the image features and the text description corresponding to the text features in the original dataset as the target dataset.

11. The training device of any of claims 8-10, wherein the textual description is a title.

12. A training apparatus for a video classification model, comprising:

the first acquisition module is used for acquiring training samples; wherein the training sample comprises a sample video and a classification label of the sample video;

the first prediction module is used for inputting at least one video frame in the sample video into a video classification model to perform classification prediction so as to obtain a classification result of the sample video; the video classification model comprises an image encoder and a full-connection layer which are sequentially connected, wherein the image encoder is an image encoder in a feature extraction model obtained by using the training device according to any one of claims 8-11;

The parameter processing module is used for calculating loss according to the classification labels and the classification results, and adjusting parameters of a full-connection layer in the video classification model according to the loss until convergence conditions are met;

and the model output module is used for outputting the trained video classification model.

13. The training device of claim 12, wherein the video classification model further comprises a convolution layer disposed between the image encoder and the fully-connected layer;

the parameter processing module is specifically configured to adjust parameters of a convolution layer and a full connection layer in the video classification model according to the loss.

14. A video classification apparatus comprising:

the second acquisition module is used for acquiring videos to be classified;

the second prediction module is used for inputting at least one video frame in the video to be classified into a video classification model to perform classification prediction so as to obtain a classification result of the video to be classified;

wherein the video classification model is trained in accordance with the training apparatus of claim 12 or 13.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the feature extraction model of any one of claims 1-4, the training method of the video classification model of claim 5 or 6, or the video classification method of claim 7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the feature extraction model according to any one of claims 1-4, the training method of the video classification model according to claim 5 or 6, or the video classification method according to claim 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of training a feature extraction model according to any one of claims 1-4, the method of training a video classification model according to claim 5 or 6, or the method of video classification according to claim 7.