CN114363617A

CN114363617A - Network lightweight video stream transmission method, system and equipment

Info

Publication number: CN114363617A
Application number: CN202210266889.2A
Authority: CN
Inventors: 王中元; 易鹏; 江奎; 肖晶; 涂卫平; 杨玉红; 李登实; 肖进胜
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-04-15

Abstract

The invention discloses a method, a system and equipment for transmitting network lightweight video stream.A coding end carries out spatial resolution down-sampling on an original video and then codes the down-sampled video, and the method specifically comprises five steps of key frame judgment, video down-sampling, key frame coding, non-key frame coding and code stream multiplexing; the decoding end restores the corresponding spatial resolution of the video through super-resolution reconstruction on the decoded video, and specifically comprises three steps of code stream demultiplexing, video decoding and video super-resolution reconstruction and a video frame delay buffer area. The method judges the key frame through the inter-frame motion complexity, performs spatial sampling on the video frame through edge-retained downsampling, and reconstructs the high-resolution video frame through a video super-resolution reconstruction network. The invention basically does not sacrifice video definition, but obviously reduces the data volume of the video stream, and is beneficial to network transmission of compressed video.

Description

Network lightweight video stream transmission method, system and equipment

Technical Field

The invention belongs to the technical field of multimedia, and relates to a network video stream transmission method, a system and equipment, in particular to a network lightweight video stream transmission method, a system and equipment.

Background

The sudden new coronary pneumonia enables most countries around the world to adopt social isolation measures, and the life, work, study and social contact of people are suddenly transferred to a network space, so that the network flow is suddenly increased, and the application of remote visual office work, network lessons and the like which depend on serious videos is challenged. As society is already used to and enjoys the advantages and convenience of network space work and study, networked remote office, study, conference and social contact become a normal state in the future, and the traffic congestion caused by the long-term overloading of the network also becomes a normal state.

An effective way to improve the efficiency of video streaming is video coding. Since the 80 s, the international standards organization established a series of video coding standards, forming a hybrid coding framework of block-wise prediction plus transform. However, the compression efficiency of video coding techniques can be doubled by almost ten years, and the evolution cycle of the international video coding standard also takes ten years. For example, the entire decade has passed from the release of the h.264 standard in 2003 to the release of the h.265 standard in 2013. Therefore, the progress of the coding efficiency can not obviously keep up with the increasing trend of the video data volume, and therefore, a new solution must be found for the era of high-load network video service.

Super-Resolution (SR) refers to a technique for recovering a high-Resolution image from a low-Resolution image or a sequence of images. The current super-resolution technology of video images based on deep learning has achieved great success. 2017, seoul university in korea developed an enhanced deep super-resolution network whose performance outperformed the previous SR method. In the same year, the RAISR (Rapid and Accurate Image Super-Resolution) technology proposed by google converts a low-Resolution Image into a high-Resolution Image by using machine learning, and the effect can reach or even exceed that of an original Image under the condition of saving 75% of bandwidth. In 2019, a HiSR (high-Resolution Super-Resolution) Super-Resolution technology was developed, a low-Resolution picture is converted into a high-definition picture by means of a deep learning algorithm, and an effect of quickly previewing the high-definition picture is achieved at a mobile terminal.

Disclosure of Invention

The invention aims to provide a network lightweight video stream transmission method, system and equipment based on the strong image detail reconstruction capability of a super-resolution technology. The spatial resolution of an original video frame is reduced through spatial domain down-sampling, the down-sampled video is compressed to enable the amount of compressed code fluid to be smaller, and then the original spatial resolution of the video frame is restored through super-resolution reconstruction, so that the bandwidth occupation is obviously reduced on the premise of basically not sacrificing the definition of the video.

The method adopts the technical scheme that: a network lightweight video stream transmission method comprises an encoding process and a decoding process;

the encoding process is specifically realized by the following steps:

step 1: judging key frames aiming at an input video;

inter-frame motion complexityCExceeds a preset thresholdTIf so, judging the frame as a key frame, otherwise, judging the frame as a non-key frame;

step 2: directly coding the key frame; aiming at a non-key frame, firstly carrying out down sampling, and then coding by combining a motion vector;

and step 3: packaging the compressed code streams of the key frames and the non-key frames before transmission so that a receiving end can distinguish the key frames and the non-key frames;

the decoding process is specifically realized by the following steps:

and 4, step 4: splitting the code stream of the key frame and the non-key frame;

and 5: decoding the video to obtain key frames and non-key frames;

the decoded video frames are sent to a video frame delay buffer area for buffering and are used for providing a plurality of continuous video frames required by video super-resolution reconstruction; meanwhile, the analyzed motion vector parameters are sent to a video super-resolution reconstruction network;

step 6: utilizing a video super-resolution reconstruction network to carry out video super-resolution reconstruction to obtain a super-resolution non-key frame;

and 7: and restoring and outputting the video according to the key frame obtained by decoding in the step 5 and the super-resolution non-key frame in the step 6.

The technical scheme adopted by the system of the invention is as follows: a network lightweight video stream transmission system comprises an encoding end and a decoding end;

the encoding end comprises the following modules:

the module 1 is used for judging key frames aiming at input videos;

a module 2, configured to directly perform encoding on the key frame; aiming at a non-key frame, firstly carrying out down sampling, and then coding by combining a motion vector;

the module 3 is used for encapsulating the compressed code streams of the key frames and the non-key frames before transmission so that a receiving end can distinguish the key frames and the non-key frames;

the decoding end comprises the following modules:

the module 4 is used for splitting the code stream of the key frame and the non-key frame;

a module 5, for video decoding, obtaining key frames and non-key frames;

the module 6 is used for carrying out super-resolution reconstruction on the video by utilizing a super-resolution reconstruction network to obtain a super-resolution non-key frame;

and the module 7 is used for recovering and outputting the video according to the key frame obtained by decoding in the module 5 and the super-resolution non-key frame in the module 6.

The technical scheme adopted by the equipment of the invention is as follows: a network lightweight video streaming device, comprising:

one or more processors;

a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the network lightweight video streaming method.

The invention has the advantages and positive effects that:

(1) according to the invention, the original video frame is coded after being sampled according to the spatial resolution, so that the amount of compressed code streams is remarkably reduced, and the network transmission is smoother; meanwhile, the original resolution is restored by adopting super-resolution reconstruction at a decoding end, and compared with the traditional mode of direct encoding and decoding, the video quality is basically not sacrificed.

(2) The invention has the advantages that the downsampling method and the video super-resolution reconstruction network are originally designed, the designed edge-preserving downsampling strategy and the super-resolution network based on deep learning have excellent performance, and the reduction quality of the target edge and texture details is ensured from two aspects of sampling and reconstruction.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a video super-resolution reconstruction network according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the method for transmitting a network lightweight video stream provided by the present invention includes an encoding process and a decoding process;

the encoding process of this embodiment is specifically implemented by the following steps:

step 1: judging key frames aiming at an input video;

the key frame of the compressed video does not relate to inter-frame prediction, so that the accumulation of inter-frame prediction errors can be prevented, and the video decoding and restoring quality can be improved. There are generally two strategies to start key frame encoding, timing key frames and mandatory key frames for scene cuts. The timing interval for timing the key frames is typically 10 s; scene change refers to whether a video picture has a large amount of violent motion. When a scene is switched, the efficiency of inter-frame coding is not high, which is not as good as improving the fault-tolerant performance of a code stream by adopting intra-frame coding, so that key frame coding is often adopted when the scene is switched. The judgment basis of scene switching is interframe motion complexity, including motion change amplitude and content change strength, wherein the motion change amplitude is measured based on an accumulated motion vector, and the content change strength is measured through accumulated frame difference. To this end, inter-frame motion complexityCThe calculation is as follows:

wherein,Nthe number of macroblocks of size 16 x 16 pixels,xMV _i、yMV _irespectively representing horizontal and vertical motion vectors of the macroblock,SAD _irepresents the inter-frame motion estimation error of the macroblock,

is a predetermined weight.

The motion vector and the inter-frame error of the macro block are obtained by the existing motion estimation algorithm. Motion vectors are also used in non-key frame coding, while motion vectors are provided to non-key frame coding operations to avoid the computational cost of unnecessarily repeating motion estimation.

considering that the traditional down-sampling algorithm is easy to blur the edge of an object and has too much damage to the definition of a target, the embodiment adopts a down-sampling algorithm for image edge preservation.

Order tof ₀, f ₁, f ₂, f ₃Investigating pixel points for a spatially continuous group of pixelsf ₀, f ₁, f ₂, f ₃The correlation between them is divided into two groups I (a), (b), (c), (d) and (d)f ₀, f ₁, f ₂) And II (a)f ₁, f ₂, f ₃) The second order differences are calculated as follows:

；

；

the absolute value of the second-order difference of the adjacent 3 points is used as a standard for measuring the correlation size, and the smaller the absolute value is, the larger the correlation is, and vice versa. The greater the correlation, the greater the likelihood that the pixel is in a homogeneous image region. Therefore, it is more reasonable to select the neighboring pixels with large correlation for interpolation. Based on this principle, the interpolated pixel is calculated by the following second-order interpolation formulaI：

Here, ,tfor interpolation of pixel pointsIAnd source pixel point

The distance between the two is less than or equal to 1tLess than or equal to 2; first order difference

。

In this embodiment, the key frames and the non-key frames are encoded by using a mature h.264 or h.265 encoding technique, where the key frames correspond to intra frames in the encoding standard, and the non-key frames correspond to predicted frames. The key frames are encoded with the original spatial resolution and the non-key frames are encoded with the reduced spatial resolution. The non-key frame coding module does not repeat the calculation of motion vectors, and the required motion vectors come from the key frame judgment step.

And step 3: packaging the compressed code streams of the key frames and the non-key frames before transmission so that a receiving end can distinguish the key frames and the non-key frames; in the embodiment, when the network transmission bandwidth resources are in shortage, the key frames are encapsulated preferentially.

The decoding process of this embodiment is specifically implemented by the following steps:

the embodiment splits the code stream of the key frame and the non-key frame, so that the decoding and super-resolution reconstruction of the non-key frame can be conveniently carried out in the back, and the non-key frame is directly decoded and output.

And 5: decoding the video to obtain key frames and non-key frames;

the present embodiment performs decoding of a corresponding standard according to an encoding standard of a compressed code stream. And sending the decoded video frames into a video frame delay buffer area for buffering, and supplying a plurality of continuous video frames required by video super-resolution reconstruction. Meanwhile, the motion vector parameters analyzed by the video decoder are sent to a video super-resolution reconstruction network, so that the calculated amount is saved.

The video super-resolution reconstruction network of the embodiment is responsible for recovering the spatial resolution of the decoded non-key frames so as to make up for the loss of details caused by sampling the non-key frames by the encoding end. And reconstructing a corresponding high-resolution frame from a series of adjacent low-resolution frames by adopting a video super-resolution scheme based on deep learning.

Referring to fig. 2, the video super-resolution reconstruction network of the present embodiment includes Bicubic upsampling, a motion estimation layer, a motion compensation layer, a feature extraction layer, a multi-memory detail fusion layer, a feature reconstruction layer, a sub-pixel amplification layer, and a residual stacking operation. Firstly, converting an input low-resolution frame into compensation frames through motion estimation and motion compensation, then sequentially performing feature extraction, multi-memory detail fusion, feature reconstruction and sub-pixel amplification on the compensation frames, and finally adding up-sampled frames of Bicubic and pixel amplification results to obtain reconstructed high-resolution frames.

The motion estimation and compensation of the present embodiment is used to handle temporal correlation between successive low resolution frames. Here, new motion estimation is not performed, and motion compensation is performed directly using the motion vector obtained by decoding, thereby saving the computational overhead of complex motion estimation.

The feature extraction function of this embodiment is implemented using a residual block structure, which is composed of a series of convolution layers. The process is described as follows:

；

wherein,Conv _nrepresents the second in the residual blocknA plurality of convolution layers, each of which is wound,I _nandO _nrepresents the firstnInput and output of each convolutional layer. The residual block retains information from the previous convolutional layer and passes it to all subsequent convolutional layers. The residual blocks used in feature reconstruction are also the structure of feature extraction residual blocks, but their positions in the network are different, and thus their roles are different.

The multi-memory detail fusion function of the embodiment is realized by adopting a multi-memory residual block structure, and the residual block is composed of a series of convolution long-term and short-term memory layers. When a low resolution frame passes through the residual block, the cell state of the convolution long-term and short-term memory layer retains the characteristic image information of the frame; when the next frame enters the residual block, it will receive the feature map inherited from the previous frame. In this way, the convolutional long-short term memory layer learns which part of valid information should be remembered and which invalid information should be forgotten. The process of convolving the long and short term memory layers is represented as:

wherein,i _t ,f _t, C _t, o _tandH _trespectively representing an input gate, a forgetting gate, a cell state, an output gate and a hidden state;X _t-representing a characteristic map, representing a hadamard product;I(⋅), F(⋅), C(. charpy) andO(. dash) represents functions of the input gate, the forgetting gate, the cell state, and the output gate, respectively, as defined by the standard long-short term memory network LSTM,tanh(dash) represents the hyperbolic tangent activation function;

a multi-memory residual block contains 3 convolutional long-short term memory layers, each layer using a convolutional kernel of size 3 x 3, but in different numbers. Since convolution of long and short term memory layers consumes large GPU memory (about 4 times that of a normal convolution layer), the input feature map is first mapped from 64 layers to 16 layers to reduce GPU memory cost and computational complexity.

In a convolutional neural network, the most common method for amplifying a feature map is convolution transposition, and Cabilllero et al propose a sub-pixel amplification method for amplifying the feature map. The present embodiment chooses sub-pixel magnification because this approach requires less computational cost and performs better in similar networks.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A network lightweight video stream transmission method is characterized by comprising an encoding process and a decoding process;

the encoding process is specifically realized by the following steps:

step 1: judging key frames aiming at an input video;

the decoding process is specifically realized by the following steps:

and 5: decoding the video to obtain key frames and non-key frames;

2. According toThe network lightweight video streaming method of claim 1, wherein: the inter-frame motion complexity in step 1

WhereinNthe number of macroblocks of size 16 x 16 pixels,xMV _i、yMV _irespectively representing horizontal and vertical motion vectors of the macroblock,SAD _irepresents the inter-frame motion estimation error of the macroblock,

is a predetermined weight.

3. The network lightweight video streaming method according to claim 1, wherein: in step 2, the non-key frame is down-sampled by using a down-sampling method for image edge preservation, and firstly, the method comprises the following steps off ₀, f ₁, f ₂, f ₃Investigating pixel points for a spatially continuous group of pixelsf ₀, f ₁, f ₂, f ₃The correlation between them is divided into two groups I (a), (b), (c), (d) and (d)f ₀, f ₁, f ₂) And II (a)f ₁, f ₂, f ₃) The second order differences are calculated as follows:

；

；

taking the absolute value of the second-order difference of the adjacent 3 points as a standard for measuring the correlation size, wherein the smaller the absolute value is, the larger the correlation is, and vice versa; selecting the adjacent pixel with large correlation to carry out interpolation, and calculating by the following second-order interpolation formulaInterpolation pixelI：

；

Here, ,tfor interpolation of pixel pointsIAnd source pixel point

。

4. The network lightweight video streaming method according to claim 1, wherein: in step 2, firstly, down-sampling is carried out on non-key frames, and then coding is carried out by combining motion vectors; wherein the motion vector is from a key frame determination step.

5. The network lightweight video streaming method according to claim 1, wherein: the video super-resolution reconstruction network in the step 5 comprises Bicubic up-sampling, a motion estimation layer, a motion compensation layer, a feature extraction layer, a multi-memory detail fusion layer, a feature reconstruction layer, a sub-pixel amplification layer and residual superposition operation; firstly, converting an input low-resolution frame into compensation frames through motion estimation and motion compensation, then sequentially performing feature extraction, multi-memory detail fusion, feature reconstruction and sub-pixel amplification on the compensation frames, and finally adding up-sampled frames of Bicubic and pixel amplification results to obtain reconstructed high-resolution frames.

6. The network lightweight video streaming method according to claim 5, wherein: the conversion of the input low resolution frame into the compensation frame is to perform motion compensation directly using the motion vector obtained by decoding.

7. The network lightweight video streaming method according to claim 5, wherein: the characteristic extraction is realized by adopting a residual block structure, wherein the residual block consists of a series of convolution layers; the specific process is as follows:

；

wherein,Conv _nrepresents the second in the residual blocknA plurality of convolution layers, each of which is wound,I _nandO _nrepresents the firstnInput and output of each convolutional layer.

8. The network lightweight video streaming method according to claim 5, wherein: the multi-memory detail fusion is realized by adopting a multi-memory residual block structure, wherein the multi-memory residual block consists of a series of convolution long and short term memory layers; when a low resolution frame passes through the residual block, the cell state of the convolution long-term and short-term memory layer retains the characteristic image information of the frame; when the next frame enters the residual block, it will receive the feature map inherited from the previous frame; the process of convolving the long and short term memory layers is represented as:

a multi-memory residual block contains 3 convolutional long-short term memory layers, each layer using a convolutional kernel of size 3 x 3.

9. A network lightweight video stream transmission system is characterized by comprising an encoding end and a decoding end;

the encoding end comprises the following modules:

the module 1 is used for judging key frames aiming at input videos;

the decoding end comprises the following modules:

a module 5, for video decoding, obtaining key frames and non-key frames;

10. A network lightweight video streaming device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the network lightweight video streaming method of any of claims 1 to 8.