CN113573058B

CN113573058B - Interframe image coding method based on space-time significance fusion

Info

Publication number: CN113573058B
Application number: CN202111112916.2A
Authority: CN
Inventors: 蒋先涛; 蔡佩华; 张纪庄; 郭咏梅; 郭咏阳
Original assignee: Kangda Intercontinental Medical Devices Co ltd
Current assignee: Kangda Intercontinental Medical Devices Co ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-11-30
Anticipated expiration: 2041-09-23
Also published as: CN113573058A

Abstract

The invention discloses an interframe image coding method based on space-time significance fusion, which relates to the technical field of image processing and mainly comprises the following steps: acquiring a time saliency map according to the time domain motion vector of each pixel point; extracting a single-layer image with super-pixel characteristics in the inter-frame image according to the mean value characteristics of all the pixel points; obtaining a transfer matrix according to the single-layer graph and the mean value characteristics of the pixel points corresponding to the nodes; acquiring a space saliency map based on a Markov chain theory according to the transfer matrix; acquiring a space-time saliency map according to the weight relation between the time saliency map and the space saliency map; acquiring a saliency map according to the space-time saliency map; and dynamically adjusting the quantization parameter according to the mean value characteristic of the corresponding pixel point in the saliency map, and encoding the interframe image according to the quantization parameter. The invention combines the motion characteristics of the image in time domain and space domain to obtain the time-space domain saliency map for coding, so that the coded data contains more image information, and the fidelity of the decoded data is improved.

Description

Interframe image coding method based on space-time significance fusion

Technical Field

The invention relates to the technical field of image processing, in particular to an interframe image coding method based on space-time significance fusion.

Background

Currently, with the increasingly widespread application of h.265/HEVC and its extended coding standard, the known perceptual computing models are mainly classified into four categories: region of Interest calculation model (ROI), Visual Attention model (Visual Attention), Visual Sensitivity model (Visual Sensitivity), Cross-perception Attention model (Cross-Modal Attention). Perceptual coding methods can be further classified into three categories: a preprocessing method, a non-scalable coding method, and a scalable coding method. The preprocessing method usually performs visual optimization processing on inter-frame images in the original video before encoding, and does not need to change an encoder. The non-scalable encoding method requires changes to the codec while performing the visual optimization. The scalable coding method only needs to change the coder when the visual optimization processing is carried out. The criterion for evaluating the performance of perceptual coding is the improvement of coding efficiency or visual quality, and for some real-time applications, the computational complexity of a perceptual model also needs to be verified.

Although the research on inter-frame image coding based on visual perception has been greatly advanced in recent years, there still remain disadvantages. The method comprises the following steps of (1) calculating the significance of an inter-frame image: there is currently a lack of efficient computational models for inter-picture coding applications. Although international research on the significance of static images has been greatly advanced, research on the significance detection of dynamic inter-frame images is still in the beginning and has not yet formed a system. The existing perception coding framework is not perfect enough, information interaction between an inter-frame image perceptron and a coder is only limited to salient object detection, and the information interaction is not beneficial to information sharing of the two (such as dividing information of a foreground and a background, motion type information and the like).

With the development of the mass media industry, the requirements on timeliness and fidelity of video transmission are higher and higher. Based on this, the research on computational models for inter-frame image coding has a great development space.

Disclosure of Invention

In order to overcome the defects of the prior art when the inter-frame image is coded, the invention provides an inter-frame image coding method based on space-time significance fusion, which comprises the following steps:

s1: acquiring an inter-frame image and extracting the mean value characteristic of each pixel point through a color difference calculation space;

s2: acquiring a time domain motion vector of each pixel point in the interframe image through an optical flow algorithm, and acquiring a time saliency map according to the time domain motion vector of each pixel point;

s3: extracting a single-layer image with super-pixel characteristics in the inter-frame image according to the mean value characteristics of all the pixel points;

s4: obtaining a transfer matrix according to the weight relation between the nodes and the edges in the single-layer graph and the mean value characteristics of the corresponding pixels of the nodes;

s5: acquiring a space saliency map based on a Markov chain theory according to the transfer matrix;

s6: acquiring a space-time saliency map according to the weight relation between the time saliency map and the space saliency map;

s7: carrying out normalization processing in a preset color gradation range according to the space-time saliency map to obtain a saliency map;

s8: and dynamically adjusting the quantization parameter according to the mean value characteristic of the corresponding pixel point in the saliency map, and encoding the interframe image according to the quantization parameter.

Further, the time saliency map is composed of time saliency values of each pixel point, wherein the acquisition of the time saliency values can be expressed as the following formula:

wherein (x, y) is the pixel coordinate of pixel point i, MV (x, y) is the amplitude of time domain motion vector, MV_x(x, y) is the horizontal component of the temporal motion vector, MV_y(x, y) is the vertical component of the temporal motion vector;

for the purpose of enhancing the amount of amplitude measurement,

and

is a constant parameter;

for normalizing the enhanced amplitude level to be within a preset tone scale range,

and the time significance value corresponding to the pixel point i is obtained.

Further, the nodes of the single-layer graph include a transient node and a sink node, where each node is connected to a transient node adjacent to the node or sharing an edge with an adjacent node of the node, and in the step S4, the method further includes the steps of:

and acquiring the weight of the edge between the adjacent nodes according to the average value characteristics of the pixel points corresponding to the nodes, and renumbering the nodes.

Further, the weight of the edge between the adjacent nodes can be expressed as:

wherein m and n are two adjacent nodes in the single-layer graph, and w_mnIs the weight of the edge between node m and node n, x_m、x_nThe mean features of the corresponding pixel points of node m and node n respectively,

is a constant number of times, and is,

is the Euler number.

Further, the transition matrix may be represented by the following formula:

in the formula,

is the numbering of the node m after renumbering,

is the numbering of the node n after renumbering, A is the adjacency matrix,

and N (m) represents that the node n is communicated with the node m, D is a degree matrix, P is a transition matrix, and t is the number of transient nodes.

Further, the spatial saliency map is composed of spatial saliency values of each pixel point, wherein the acquisition of the spatial saliency values is expressed as the following formula:

in the formula, Q is the transition probability between any transient state after a transition matrix P is expressed by a Markov absorption chain, I is a matrix of r multiplied by r, r is the number of absorption nodes, c is a t-dimensional column vector with all elements being 1, and y is the absorption time of the corresponding transient state node;

for corresponding transient node

The numbering after the renumbering is carried out,

for numbering corresponding to transient sectionDot

The vector is normalized with respect to the absorption time of (c),

is a spatial significance value.

Further, the spatio-temporal saliency map consists of spatio-temporal saliency values of each pixel point, and the acquisition of the spatio-temporal saliency values can be expressed as a formula:

in the formula,

for the weights of the spatial saliency map to be,

for the weights of the temporal saliency map to be,

is the spatial saliency value of the pixel point i,

is the temporal saliency value of the pixel point i,

the time-space significance value of the pixel point i is obtained.

Further, the adjustment of the quantization parameter in the step S8 can be expressed as the following formula:

in the formula, u (x, y) is the mean characteristic of the corresponding pixel point i (x, y) in the saliency map; q. q.s₁And q is₂Respectively corresponding pixel points in the saliency mapi (x, y) is in a corresponding proportional relation with the average value characteristic of the control quantization parameter threshold; QP₀For the initial value of the quantization coefficient, Δ QP is the correction value of the quantization parameter QP (x, y), and Int is the rounding operation.

Compared with the prior art, the invention has at least the following effects:

(1) the invention relates to an interframe image coding method based on space-time saliency fusion, which combines an image time domain and an image space domain to obtain a space-time saliency map on the basis of considering the motion characteristics of the image on the space domain and the time domain, and codes a video interframe image according to a normalized result, so that information interaction between a sensor and a coder is not limited to saliency target detection any more, and more foreground and background division change information in the video interframe image change process and motion type information of the interframe image can be obtained;

(2) through the interaction of more information between the sensor and the encoder, the encoded compressed data can keep more related information, so that a higher-definition and fidelity video image is obtained when the compressed data is decoded, and the method can be applied to the compression encoding of high-definition video;

(3) the information interactivity is improved, and meanwhile, the coding quantization parameter is dynamically adjusted through the space-time significance value, so that the coding bit rate is reduced, and the coding speed is improved.

Drawings

FIG. 1 is a diagram of method steps for an inter-frame image coding method based on spatio-temporal saliency fusion;

FIG. 2 is a schematic diagram of spatio-temporal saliency fusion.

Detailed Description

The following are specific embodiments of the present invention and are further described with reference to the drawings, but the present invention is not limited to these embodiments.

Example one

The invention aims to solve the problem that in the prior art, the inter-frame image coding in a video causes insufficient information interaction, so that the video distortion is easily caused in the data decoding process, and as shown in figure 1, the invention provides an inter-frame image coding method based on space-time significance fusion, which comprises the following steps:

Considering that the human visual perception system is more sensitive to the color difference computing space (CIELab), in order to make the decoded inter-frame image of the encoded video data more conform to the human visual perception habit, the invention needs to pre-process the inter-frame image of the video after obtaining the inter-frame image: and converting the color of the inter-frame image from the RGB space to a color difference calculation space, and calculating the mean value of each pixel point in the inter-frame image in the color difference calculation space as the characteristic of each pixel point.

Since it is a coding method based on spatio-temporal saliency fusion, it must include temporal saliency and spatial saliency. For the calculation of the time saliency, the invention acquires the time saliency map of the interframe image based on an optical flow algorithm (Lucas-Kanade). Specifically, the time domain of each pixel point is obtained through an optical flow algorithmHorizontal component MV of motion vector_x(x, y) and vertical component MV_y(x, y), then the magnitude MV (x, y) of the temporal motion vector is obtained according to the two components, and the formula expression can be expressed as:

further, by the enhancement operation, the magnitude of the temporal motion vector can be further expressed as:

wherein

And

are constant parameters, and in the present embodiment,

the value of (a) is selected to be 10,

the value of (d) is chosen to be 2. Finally, will

Normalizing to a predetermined color level range (in this embodiment, the predetermined color level range is selected as [0, 255 ]]) And, is formulated as:

and then the time significance value of the pixel point i (x, y)

Can be used forBy passing

To show that:

and matching the corresponding coordinates of the time significance value according to the obtained time significance value of each pixel point and the coordinates of the pixel points in the interframe image, thereby obtaining a time significance map.

For the calculation of the spatial saliency, the method is based on a Markov chain saliency detection method to acquire a spatial saliency map at a superpixel level. Firstly, a single-layer graph G (V, E) with super-pixel characteristics in an inter-frame image is extracted according to the mean value characteristics of all pixel points, wherein V and E respectively represent nodes and edges of the single-layer graph G. Also on the single-level graph G, each node V needs to be connected to a transient node that is adjacent to the node or shares an edge with an adjacent node of the node. Based on this, the weight of the edge E between two adjacent nodes m (current node) and n (transient node connected to the current node) can be defined as:

in the formula, x_m、x_nThe mean features of the corresponding pixel points of node m and node n respectively,

is a constant number of times, and is,

is the Euler number. Then, renumbering can be performed according to the weight, so that the first t numbered nodes are transient nodes, the second r numbered nodes are absorbing nodes, wherein t is the number of transient nodes, and r is the number of absorbing nodes.

At the upper partOn the basis of the above, it should be further understood that the transfer matrix P on the single-layer graph can be calculated according to the adjacency matrix a and the degree matrix D:

. Therefore, to calculate the transition matrix P, the adjacency matrix a and the degree matrix D need to be confirmed.

And according to the weight of the edge between adjacent nodes, the adjacency matrix a can be represented as:

from the connection matrix, the degree matrix D can be expressed as:

in the formula,

is the numbering of the node m after renumbering,

is the numbering of the node n after renumbering,

for the expression of the adjacency matrix, n (m) represents that the node n communicates with the node m. Wherein,

which is the case when the neighboring node is on the diagonal of the inter-picture.

And then, calculating the absorption time of each transient node based on the Markov theory according to the obtained transfer matrix P. Then the sink state of the node after renumbering according to this embodiment

And transient state

The transition matrix may be represented as:

the state of the first node is transient state, and the state of the last node is transient absorption. Q is the transition probability between any transient after the transition matrix P is represented by a Markov absorbing chain, R is the probability of containing the movement from any transient to any absorbing state, I is a matrix of R x R size, R is the number of absorbing nodes, and c is a t-dimensional column vector with all elements being 1. For the absorption chain, its basic properties can be deduced: matrix array

Wherein

which is an expression of the matrix K, can be understood as the expected number of times the chain takes for transient absorption to occur in the transient node n. Assuming that the chain starts in transient m, an

Representing the expected number of times before absorption, the absorption time of the corresponding transient node can be calculated, i.e.:

the basic idea of the markov chain is to detect saliency using temporal attributes in the absorbing markov chain. The virtual border nodes are identified as a priori border-based absorber nodes. Significance was calculated as the absorption time of the absorption node. On the basis of the markov chain significance model, the spatial significance value can be expressed as:

in the formula,

for corresponding transient node

The numbering after the renumbering is carried out,

is a number

The absorption time normalization vector corresponding to the transient node,

is a spatial significance value. And then according to the obtained space significance value of each pixel point, matching corresponding coordinates of the space significance value according to coordinates of the pixel points in the interframe image, thereby obtaining a space significance map.

The time significance map and the space significance map are obtained through the analysis and the calculation. The time saliency map reflects the dynamic characteristics of the inter-frame images in the video, the space saliency map reflects the static characteristics of the inter-frame images, and the time saliency map and the space saliency map are linearly fused to realize mutual complementation. Weight of the time saliency map is set to

The weight of the spatial saliency map is

The space-time significant value after linear fusion is:

in the formula,

the time-space significance value of the pixel point i is obtained;

，

，

；

is a constant value, and generally takes a value in the range of 0.3-0.5.

And then matching corresponding coordinates of the space-time significance values according to the obtained space-time significance values of the pixel points and coordinates of the pixel points in the interframe images, thereby obtaining a space-time significance map. And then pixel level saliency is compared

Proceed to preset color gradation range ([ 0, 255 ]]) The interior normalization processing is carried out, then the significant value is assigned to all the pixel points contained in the interior normalization processing, and a significant graph S containing each pixel point is obtained_map。

In HEVC, video inter-frame image is divided into a plurality of coding block units, and the code rate of the coding block is equal to quantization parameter QP and quantization step QP_stepClosely related, the value range of QP is [0, 51]. In general, the larger the quantization parameter QP, the higher the distortion of the image. Meanwhile, the foreground region in the video needs to increase the allocation of data resources (bit), and the background region needs to save the allocation of the data resources (bit). Therefore, the invention further provides a method for dynamically adjusting the quantization parameter based on the mean value characteristic u (x, y) of each pixel point in the coded block saliency map Smap, so that the foreground region adopts high QP coding and the background region adopts low QP coding, and the adjusted quantization parameter is expressed as follows:

in the formula, u (x, y) is the mean characteristic of the corresponding pixel point i (x, y) in the saliency map; q. q.s₁And q is₂Respectively, the threshold values of the control quantization parameters (q in this embodiment) are in corresponding proportional relation with the mean value characteristic of the corresponding pixel point i (x, y) in the saliency map₁=0.5*2x，q₂=0.8 × 2x, x is the mean characteristic of the corresponding pixel point); QP₀For the initial value of the quantization coefficient, Δ QP is the correction value of the quantization parameter QP (x, y), and Int is the rounding operation.

As shown in fig. 2.a, it is an inter-frame image in a certain video. The spatial saliency is extracted to obtain a spatial saliency map as shown in fig. 2.b, and it can be seen that the foreground region and the background region can be clearly divided; the time saliency extraction is carried out on the inter-frame image to obtain a time saliency map as shown in fig. 2.c, and it can be seen that the inter-frame image can express a moving object and a moving type; according to the method of the present invention, the two are combined to obtain the space-time saliency map as shown in fig. 2.d, and the information of the two can be perfectly combined, so that more related information can be retained when encoding according to the saliency map.

In summary, the interframe image coding method based on the space-time saliency fusion, provided by the invention, combines the two to obtain a space-time saliency map on the basis of considering the motion characteristics of the image in the time domain and the space domain, and codes the video interframe image according to the normalized result, so that the information interaction between a sensor and an encoder is not limited to saliency target detection any more, and more foreground and background division change information and motion type information of the interframe image in the video interframe image change process can be obtained.

Through the interaction of more information between the sensor and the encoder, the encoded compressed data can keep more related information, so that a video image with higher definition and fidelity is obtained when the compressed data is decoded, and the method can be applied to the compression encoding of high-definition video. The information interactivity is improved, and meanwhile, the coding quantization parameter is dynamically adjusted through the space-time significance value, so that the coding bit rate is reduced, and the coding speed is improved.

It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

Moreover, descriptions of the present invention as relating to "first," "second," "a," etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit ly indicating a number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

Claims

1. An interframe image coding method based on space-time significance fusion is characterized by comprising the following steps:

s8: dynamically adjusting quantization parameters according to the mean value characteristics of corresponding pixel points in the saliency map, and encoding the interframe images according to the quantization parameters;

in the step S4, the nodes of the single-layer graph include transient nodes and absorption nodes, where each node is connected to a transient node adjacent to the node or sharing an edge with an adjacent node of the node, and in the step S4, the method further includes the steps of:

acquiring the weight of edges between adjacent nodes according to the mean value characteristics of the pixels corresponding to the nodes, and renumbering the nodes;

the weight of the edge between the adjacent nodes can be expressed as:

wherein m and n are two adjacent nodes in the single-layer graph, and w_mnIs the weight of the edge between node m and node n, x_m、x_nThe corresponding pixel points of the node m and the node n are respectivelyThe characteristics of the values are such that,

is a constant number of times, and is,

is the Euler number;

the transition matrix can be represented by the following formula:

in the formula,

is the numbering of the node m after renumbering,

is the numbering of the node n after renumbering, A is the adjacency matrix,

the expression is an adjacent matrix expression, N (m) represents that a node n is communicated with a node m, D is a degree matrix, P is a transfer matrix, and t is the number of transient nodes;

in the step S5, the spatial saliency map is composed of spatial saliency values of each pixel, where the acquisition of the spatial saliency values is expressed as the following formula:

for corresponding transient node

The numbering after the renumbering is carried out,

is a number

The absorption time normalization vector corresponding to the transient node,

is a spatial significance value;

the adjustment of the quantization parameter in the step S8 can be expressed as the following formula:

in the formula, u (x, y) is the mean characteristic of the corresponding pixel point i (x, y) in the saliency map; q. q.s₁And q is₂Respectively, the threshold values of the control quantization parameters are in corresponding proportional relation with the mean value characteristics of the corresponding pixel points i (x, y) in the saliency map; QP₀For the initial value of the quantization coefficient, Δ QP is the correction value of the quantization parameter QP (x, y), and Int is the rounding operation.

2. The method as claimed in claim 1, wherein the temporal saliency map is formed by temporal saliency values of each pixel point, and the temporal saliency value is obtained as the following formula:

for the purpose of enhancing the amount of amplitude measurement,

and

is a constant parameter;

and the time significance value corresponding to the pixel point i is obtained.

3. The method as claimed in claim 1, wherein the spatio-temporal saliency map is composed of spatio-temporal saliency values of each pixel point, and the spatio-temporal saliency value is obtained by a formula:

in the formula,

for the weights of the spatial saliency map to be,

for the weights of the temporal saliency map to be,

is the spatial saliency value of the pixel point i,

is the temporal saliency value of the pixel point i,

the time-space significance value of the pixel point i is obtained.