CN110033007A

CN110033007A - Attribute recognition approach is worn clothes based on the pedestrian of depth attitude prediction and multiple features fusion

Info

Publication number: CN110033007A
Application number: CN201910321093.0A
Authority: CN
Inventors: 柯逍; 李振达
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2019-07-19
Anticipated expiration: 2039-04-19
Also published as: CN110033007B

Abstract

Attribute recognition approach is worn clothes based on the pedestrian of depth attitude prediction and multiple features fusion the present invention relates to a kind of.This method is matched by external appearance characteristic first, and selected part search result is used for subsequent Attribute Recognition；Then by the depth estimation method of human posture based on SSD, energy effective position goes out in image to belong to the foreground area of pedestrian, and preferably excludes contextual factor interference；The parsing result of various ways is finally merged, and combines iteration smoothing process, the mode for taking maximum a posteriori probability to distribute reinforces the correlation between attribute tags and pixel, obtains final attribute parsing recognition result.The present invention solves the problems such as inaccurate tag recognition under single analysis mode, pixel resolution areas deviation.This method simple and flexible has stronger practical application.

Description

Pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion

Technical Field

The invention belongs to the fields of computer vision, deep learning and image processing, and is applied to scenes such as intelligent monitoring, pedestrian re-identification and the like, in particular to a pedestrian clothing attribute identification method based on deep attitude estimation and multi-feature fusion.

Background

The identification of pedestrian attributes in surveillance images acquired from surveillance videos in the real world is more challenging for the following reasons: (1) poor imaging quality, generally low resolution, and susceptibility to motion blur; (2) the attributes may be influenced by the appearance of clothes worn or worn by the pedestrian, and the corresponding attributes are located at different spatial positions in the image due to different postures of the pedestrian in different images; (3) tag attribute data from surveillance video images is difficult to collect and can only be obtained in small quantities. These factors make it very difficult to learn a pedestrian attribute model through training. Early attribute identification methods relied primarily on manually extracted features such as color or text annotations of the item. In recent years, the pedestrian attribute recognition model based on deep learning starts attracting more and more people to research, because the model obtained by deep learning has stronger and stable learning ability under a large-scale data set, and a universal model capable of representing complex characteristics can be obtained. Meanwhile, due to the poor quality of images obtained through monitoring videos, the low resolution and the complex change of the appearance of clothes in monitored scenes, the situation of deep learning in the identification of the attributes of the people is undoubtedly improved by the factors.

Pedestrian clothing attribute identification corresponds to a multi-Label image classification (MLIC-Multiple Label image classification) problem. Existing approaches have explored sequential multi-label prediction. These methods are based on CNN-RNN model design. Importantly, these existing MLIC models assume (1) availability of large scale labeling training data (2) sufficiently good image quality. Both of these assumptions are invalid for pedestrian attribute identification in the surveillance image. A recent multi-person image annotation approach advances this continuous MLIC paradigm by combining additional interpersonal social relationships and scene context. This approach takes advantage of the background of family members and friend-centric high resolution photographic images in particular, but does not extend to open world surveillance scenes of objectionable image data. Furthermore, strong attribute level tags are required, whereas pedestrian attributes are mostly weak tags at the image level.

The reason why the pedestrian attribute label in the image obtained in the monitoring scene is in the weak level is also the reason why the existing attribute identification method has positioning deviation caused by the influence of environmental factors during identification.

In order to solve the problems, the method for estimating the pedestrian posture is provided for further defining a foreground region and a background region of a pedestrian in an image and eliminating the interference of background factors. And the image quality in a monitoring scene is improved by an image processing mode, and the accuracy of the identification method is enhanced by means of fusion characteristics and a fusion attribute identification method.

Disclosure of Invention

The invention aims to provide a pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion, which overcomes the defects in the prior art and solves the problems of inaccurate label identification and pixel analysis area deviation in a single analysis mode.

In order to achieve the purpose, the technical scheme of the invention is as follows: a pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion comprises the following steps:

s1, preprocessing an input image in a monitoring scene in an image denoising and image enhancement mode to improve the image quality;

step S2, performing attitude estimation based on a Deep Convolutional Neural Network (DCNN) on the preprocessed input image, defining foreground and background areas in the image, and taking attitude characteristics as one part of fusion characteristics;

s3, extracting fusion features from the foreground region of the image processed in the S2, and performing feature dimensionality reduction through PCA;

step S4, performing a style search on the input image using a common data set, the search including: labeling similar image samples and garment labels of the image samples;

and step S5, inputting the obtained fusion characteristics of different forms into the designed pedestrian clothing attribute identification frame to obtain the final clothing attribute identification result.

In an embodiment of the present invention, the step S1 is specifically implemented as follows:

step S11, solving the interference caused by motion blur in a monitoring scene through an image denoising method based on blind deconvolution, and obtaining a restored image:

assume an initial restored image f₀And a degradation function g₀And the fuzzy functions of all parts of the image are the same, and the input image before fuzzy interference is obtained through the following iterative formula:

wherein,representing the degradation function at the kth iteration of the ith round, f_i ^kRepresenting the restored image at the ith iteration, c (x) is the degraded image, i.e. the original input image,performing convolution operation;

step S12, image enhancement is carried out through a multi-scale Retinex algorithm with color recovery, and the color expression of the image under the monitoring scene is enhanced:

defining the incident light as L (x, y), the reflected image of the object as R (x, y), and the reflected image as S (x, y), then there are: s (x, y) ═ L (x, y) · R (x, y), mapping all three components to the log domain, then there is a corresponding log result of

logS(x,y),logL(x,y),logR(x,y)

The above formula can then be converted into:

logS(x,y)＝logL(x,y)+logR(x,y)

introducing color channel weights omega in the case of multiscale_iThen, there are:

wherein K represents the number of the center surrounding functions F;

definition C_iOne of the three channels is shownThe color recovery factor of the channel is used to balance the ratio between the three channels, and highlight the relatively dark area to achieve the purpose of eliminating distortion, so the model of the MSRCR can be expressed as:

and finally mapping the real number domain to a real number domain to obtain a final enhancement result.

In an embodiment of the present invention, in the step S2, a specific manner of performing the pose estimation based on the deep convolutional neural network DCNN on the preprocessed input image is as follows:

step S21, constructing an image model G ═ (V, E) to visually represent a human body model, where V represents a joint point of a human body or a certain part of the body, E is an edge connecting between nodes, E ∈ V × V represents a spatial relationship between adjacent nodes, and K ═ V | represents the number of joints; definition I represents an image, I represents the ith node in the image, l represents the pixel coordinate of the node, t represents the mixed spatial relationship of the nodes, the mixed spatial relationship is abstracted and combined by different pose instances, and then according to the definition in an image model, I belongs to { 1.,. K }, l is defined_i∈{1,...,L}，t_i∈{1,...,T}，l_iPixel coordinate { (x) representing node i_i,y_i)}，t_iRepresenting a spatial relationship type set of a node i and adjacent nodes thereof, namely T gesture types at the node;

step S22, from step S21, the appearance model of the human joint part may be represented as:

and is provided with

φ(l_i,t_i|I；θ)＝logp(l_i,t_i|I；θ)

Wherein, p (l)_i,t_iI; theta) is a probability domain obtained by mapping a score result finally calculated by forward propagation Softmax function in DCNN, and the mixed posture type of the node part I in the image I is predicted to be t_iAnd the pixel coordinate is at l_iθ is a parameter of the model;

the inter-joint spatial relationship may be expressed as:

adding standard quadratic variation to the spatial relationship model, with definition < d (l)_i-l_j)＞＝[dx dx²dy dy²]^TAnd dx ═ x_i-x_j、dy＝y_i-y_jRepresenting the relative pixel location of node i with respect to node j;

finally, the obtained product is

The formula represents a human body posture estimation model, and the tree structure represents the high efficiency of the acceleration calculation;

s23, carrying out clustering operation on the local image blocks obtained by preprocessing according to the spatial relative position of the central joint and the adjacent joints by a K-means clustering method to obtain a pre-training model;

and step S24, training by using the DCNN, mapping the image to different gesture types through a score function model, adjusting the weight and the parameters through a loss function according to the labeling information to enable the mapped score result to be consistent with the actual category, completing the classification of the gesture types, and obtaining the DCNN multi-classification model.

In an embodiment of the present invention, the step S3 is specifically implemented as follows:

s31, extracting simple features including color, gradient and texture from the input image, and fusing the simple features into different fusion complex features according to different attribute identification stages;

step S32, obtaining m pieces of n-dimensional data after passing through a feature descriptor, and forming a matrix X with m rows and n columns from the original data according to columns;

step S33, subtracting the average value of each line of X;

step S34, solving a covariance matrix;

step S35, solving the eigenvalue of the covariance matrix and the corresponding eigenvector r;

step S36, arranging the eigenvectors r into a matrix from top to bottom according to the corresponding eigenvalue size, and taking the first n' rows to form a new matrix P;

in step S37, the matrix P is the data after dimension reduction to n'.

In an embodiment of the present invention, the step S4 is specifically implemented as follows:

step S41, firstly, establishing a KD-tree index tree on a public data set;

s42, selecting fusion characteristics as sample characteristics, comparing the appearance fusion characteristics of the input image and the samples in the data set, and searching a KD-tree by a KNN clustering method in combination with L2-distance;

and step S43, selecting the first 25 results to form a nearest neighbor sample set, and forming candidate labels by the labeled clothing label information of the samples, so as to provide help for the subsequent clothing attribute identification.

In an embodiment of the present invention, the step S5 is specifically implemented as follows:

step S51, defining the pixel of the image in the sample as i, and the predicted clothing label of the pixel as l_iThe complex characteristic of the pixel is f_i(ii) a Defining a nearest neighbor sample set obtained through nearest neighbor retrieval as D, defining a labeled label set in the sample set as tau (D), and t represents a clothing category label; each analysis is defined with a mixing parameter of lambda ≡ [ lambda ] respectively₁,λ₂,λ₃]So as to perform final confidence combination;

step S52, fusing global identification based on logistic regression, approximate identification based on nearest neighbor sample, and improving identification accuracy based on the identification result of migration identification of mask conversion:

the functional model of global recognition pixel-tag confidence is as follows:

p denotes a given complex feature f_iAnd a model parameter theta_t ^gRepresents the probability value of the presence of one of the labels t in this sample, 1[ ·]Is an indicator function, which indicates that the label t is a member of the label set of nearest neighbor samples, and the model parameter θ_t ^gTraining was performed using The fahisonitatadataset as a positive sample;

the functional model for approximate recognition pixel-tag confidence is as follows:

model parametersTraining by using a nearest neighbor sample set D as a training set;

the function model of migration-resolved pixel-label confidence based on mask transformation is:

where j denotes the pixel present in the nearest neighbor sample superpixel block, the parameter θ_t ^gIs a model parameter in global parsing because M (l)_i,s_i,d) The mean value of the logistic regression results obtained by carrying out global analysis on the nearest neighbor sample super-pixel block area is shown;

step S53, the single confidence level is not enough to guarantee the accuracy of the label assignment result, therefore, consider the fusion analysis of the three kinds of analyses, Λ ≡ [ λ ≡ b₁,λ₂,λ₃]Defining three analytic weight ratios respectively, and calculating the confidence coefficient of each clothing label-pixel through a fusion model:

step S54, combining with the iterative smoothing processing procedure: define the label assignment of all pixels as L ≡ { L ≡ L_iThe condition of labeling the clothing label on each pixel in the image and the clothing item appearance modelWherein theta is_t ^cIs a fused appearance model of the labels t of the clothing categories, the final optimization result is to find the optimal pixel-label distribution L^*And appearance model set theta^*(ii) a At the beginning of the iterative process, an initial pixel-label assignment is defined asRepresenting the pixel-label result obtained by MAP allocation in combination with the confidence of the first pass through fusion analysis, the fusion appearance model set of the initial clothing category label isExpressing a set of logistic regression models for clothing categories, useTraining as training data; the process of defining the iteration is represented by a constant k for the number of iterations performed and E for the adjacent pixel pair, then at the kth iteration its pixel-label assignment isFusion appearance modelFor this, the model for optimizing the pixel-label distribution case is as follows:

wherein:and obtaining a final identification result.

Compared with the prior art, the invention has the following beneficial effects: the method further defines the foreground region and the background region of the pedestrians in the image by combining the human body posture estimation based on the deep learning, and eliminates the interference of background factors. Meanwhile, the image quality in the monitoring scene is improved by image denoising and image enhancement methods. Moreover, a plurality of simple features are fused to form complex description features, the expressive force of the clothing attributes is strengthened, the result of a single recognition mode is fused, and the recognition accuracy is improved through iterative smoothing processing. The method integrates the depth human body posture estimation result and multiple fusion characteristics, and can accurately identify the clothes attribute of the pedestrian. The method is simple, flexible to implement and high in practicability.

Drawings

FIG. 1 is a flow chart of a pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

As shown in FIG. 1, the invention provides a pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion. Aiming at the problems that the existing attribute identification method has environmental factor interference and further influences positioning accuracy and the like, the pedestrian attribute identification method based on pedestrian attitude estimation and multi-feature fusion is provided. The method comprises the steps of firstly, selecting a part of retrieval results for subsequent attribute identification through appearance feature matching. And then, a foreground region belonging to the pedestrian in the image can be effectively positioned by a depth human body posture estimation method based on the SSD, and the interference of background factors is well eliminated. And finally, combining the analysis results in various modes, combining an iterative smoothing process, and adopting a mode of maximum posterior probability distribution to strengthen the correlation between the attribute labels and the pixels to obtain a final attribute analysis recognition result. The invention solves the problems of inaccurate label identification, pixel analysis area deviation and the like in a single analysis mode. The method comprises the following specific steps:

and step S1, improving the image quality of the input image in the monitoring scene through a classical image denoising and image enhancement mode.

And step S2, performing gesture estimation based on a Depth Convolution Neural Network (DCNN) on the input image to define foreground and background areas in the image, wherein the gesture features are also used as one part of the fusion features, and the subsequent identification work is expanded based on the foreground area in the image.

And step S3, extracting multiple simple features such as color, gradient, texture and the like from the input image, fusing the multiple simple features into complex features, and performing feature dimensionality reduction through PCA.

And step S4, performing style retrieval on the input image by using a public data set, wherein the final retrieved result is used for assisting subsequent attribute identification. The result comprises a similar image sample, b the clothing label of the image sample.

Further, in the present embodiment, in the step S1, the image quality is improved by:

and step S11, solving the interference caused by motion blur in the monitored scene to a certain extent through an image denoising method based on blind deconvolution, and obtaining a restored image. Assume an initial restored image f₀And a degradation function g₀And the fuzzy functions of all parts of the image are the same, and the input image before fuzzy interference is obtained through the following iterative formula:

whereinRepresenting the degradation function at the kth iteration of the ith round, f_i ^kRepresenting the restored image at the ith iteration, c (x) is the degraded image, i.e. the original input image,is a convolution operation.

Step S12, performing image enhancement by using a multi-Scale Retinex (multiple Scale Retinex with color retrieval msrcr) algorithm with color recovery, and enhancing the color representation of the image in the monitored scene. Defining the incident light as L (x, y), the reflected image of the object as R (x, y), and the reflected image as S (x, y), then there are: s (x, y) ═ L (x, y) · R (x, y), mapping all three components into the log domain, then there are corresponding log results log S (x, y), log L (x, y), log R (x, y), then the above equation can be translated into: log s (x, y) ═ log l (x, y) + log r (x, y), and in the case of multiscale, color channel weight ω is introduced_iThen, there are:

k represents the number of the center surround functions F and takes a value of 3.

Definition C_iThe color recovery factor of a channel in the three channels is used to balance the ratio between the three channels, and highlight a relatively dark region to achieve the purpose of eliminating distortion, so the model of the MSRCR can be expressed as:and finally mapping the real number domain to a real number domain to obtain a final enhancement result.

Further, in the present embodiment, in the step S2, the DCNN-based human body pose estimation is performed through the following steps:

step S21, building an image model G ═ (V, E) to visually represent a human body model, where V represents a joint point of the human body or a certain part of the body, E is an edge connecting between nodes, E ∈ V × V represents a spatial relationship between adjacent nodes, and K ═ V | represents the number of joints. Defining I to represent an image, I to represent the ith node in the image, l to represent the pixel coordinate of the node, and t to represent the mixed spatial relationship of the nodes (cluster extraction is carried out by different pose instances)Like union), then according to the definition in the image model, there is i e {1_i∈{1,...,L}，t_i∈{1,...,T}，l_iPixel coordinate { (x) representing node i_i,y_i)}，t_iRepresenting a set of spatial relationship types of node i with its neighbors (i.e., T pose types at that node).

Step S22, the appearance model of the human joint part can be expressed as:and has phi (l)_i,t_i|I；θ)＝logp(l_i,t_iI; θ) where p (l)_i,t_iI; theta) is a probability domain obtained by mapping a score result finally calculated by forward propagation Softmax function in DCNN, and the mixed posture type of the node part I in the image I is predicted to be t_iAnd the pixel coordinate is at l_iIs a parameter of the model. The inter-joint space can be expressed as: adding standard quadratic variation to the spatial relationship model, with definition < d (l)_i-l_j)＞＝[dx dx²dy dy²]^TAnd dx ═ x_i-x_j、dy＝y_i-y_jIndicating the relative pixel location of node i with respect to node j.

Finally obtaining

The above formula represents a human body posture estimation model, and represents high efficiency in accelerated computation in a tree structure.

And step S23, carrying out clustering operation on the local image blocks obtained by preprocessing according to the spatial relative position of the central joint and the adjacent joints by a K-means clustering method to obtain a pre-training model, and providing help for subsequent training.

And step S24, training by using a DCNN, mapping the image to different gesture types through a score function model, adjusting weight and parameters through a loss function according to the labeling information to enable the mapped score result to be consistent with the actual category, and completing classification of the gesture types to obtain the DCNN multi-classification model.

Further, in this embodiment, in step S3, the image fusion features are extracted and feature dimensionality reduction is performed to reduce the computational complexity by:

and step S31, extracting simple features such as color, gradient, texture and the like from the input image, and fusing the simple features into different fusion complex features according to different attribute identification stages.

Step S32, obtaining m pieces of 39168-dimensional data after passing through a feature descriptor, and forming a matrix X with m rows and 39168 columns by the original data according to columns;

step S33, subtracting the average value of each line (representing an attribute field) of X from the average value of the line;

step S34, solving a covariance matrix;

step S36, arranging the eigenvectors into a matrix from top to bottom according to the size of the corresponding eigenvalue, and taking the front 441 rows to form a new matrix P;

in step S37, the matrix P is the data after dimension reduction to 441 dimensions.

Further, in the present embodiment, in the step S4, the style search and the acquisition of the candidate tag are performed to assist the subsequent clothing attribute identification by:

step S41 is to first build a KD index tree on a common data set.

And S42, selecting the fusion characteristics as sample characteristics, comparing the appearance fusion characteristics of the input image and the samples in the data set, and searching the KD-tree by a KNN (KNearestNeighbones) clustering method in combination with L2-distance.

Further, in this embodiment, in step S5, the results obtained by fusing the depth human body pose estimation exclude the interference of background factors as much as possible, and the results of multiple recognition modes are fused, and the iterative smoothing process is combined to enhance the recognition accuracy:

step S51, defining the pixel in the sample as i, and the predicted clothing label of the pixel is l_iThe complex characteristic of the pixel is f_i. A nearest neighbor sample set obtained by nearest neighbor search is defined as D, a labeled label set in the sample set is defined as τ (D), and t represents a label of the clothing category. Each analysis is defined with a mixing parameter of lambda ≡ [ lambda ] respectively₁,λ₂,λ₃]Thereby making the final confidence combination.

And step S52, fusing global identification based on logistic regression, approximate identification based on nearest neighbor samples and identification results of migration identification based on mask conversion, and improving identification accuracy. The functional model of global recognition pixel-tag confidence is as follows:

C_global(l_if_i,D)≡P(l_i＝tf_i,θ_t ^g)·1[t∈τ(D)]p denotes a given complex feature f_iAnd a model parameter theta_t ^gRepresents the probability value of the existence of a certain label t in this sample, 1[ ·]Is an indicator function, which indicates that the label t is a member of the label set of nearest neighbor samples, and the model parameter θ_t ^gTraining was performed using the fashionistadataset as a positive sample. The functional model for approximate recognition pixel-tag confidence is as follows:p denotes a given complex feature f_iAnd model parametersResults of logistic regression, model parametersThe nearest neighbor sample set D is used as a training set for training. The function model of migration-resolved pixel-label confidence based on mask transformation is:

where j denotes the pixel present in the nearest neighbor sample superpixel block, the parameter θ_t ^gIs a model parameter in global parsing because M (l)_i,s_iAnd d) represents the average value of the logistic regression results obtained by performing global analysis on the nearest neighbor sample superpixel block region.

and step S54, combining with the iterative smoothing processing procedure. Define the label assignment of all pixels as L ≡ { L ≡ L_iThe (i.e. the case of labeling the clothing label for each pixel in the image) and the clothing item appearance modelWhereinIs a fused appearance model of the labels t of the clothing categories, the final optimization result is to find the optimal pixel-label distribution L^*And appearance model set theta^*. At the beginning of the iterative process, an initial pixel-label assignment is defined asRepresenting the pixel-label result obtained by MAP allocation in combination with the confidence of the first pass through fusion analysis, the fusion appearance model set of the initial clothing category label is(set of clothing category logistic regression models), useTrained as training data. The process of defining the iteration is represented by a constant k for the number of iterations performed and E for the adjacent pixel pair, then at the kth iteration its pixel-label assignment isFusion appearance modelFor this, the model for optimizing the pixel-label distribution case is as follows:

wherein:and obtaining a final identification result.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion is characterized by comprising the following steps:

2. The pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion as claimed in claim 1, wherein the step S1 is implemented as follows:

log S(x,y),log L(x,y),log R(x,y)

The above formula can then be converted into:

log S(x,y)＝log L(x,y)+log R(x,y)

wherein K represents the number of the center surrounding functions F;

definition C_iRepresenting the color recovery factor of one of the three channels to balance the ratio between the three channels, highlighting the relatively dark regions for distortion removal purposes, then the model of the MSRCR can be expressed as:

log R_MSRCRi(x,y)＝C_i(x,y)log R_MSRi(x,y)

3. The pedestrian clothing attribute recognition method based on depth pose estimation and multi-feature fusion of claim 1, wherein in the step S2, the pose estimation based on the depth convolution neural network DCNN is performed on the preprocessed input image in the following specific manner:

step S21, constructing an image model G ═ (V, E) to visually represent a human body model, where V represents a joint point of a human body or a certain part of the body, E is an edge connecting between nodes, E ∈ V × V represents a spatial relationship between adjacent nodes, and K ═ V | represents the number of joints; defining I to represent an image, I to represent the ith node in the image, and l to represent a nodeT represents the mixed spatial relationship of the nodes, and is abstracted and combined by clustering of different pose instances, so that according to the definition in the image model, i belongs to { 1.,. K }, l is defined_i∈{1,...,L}，t_i∈{1,...,T}，l_iPixel coordinate { (x) representing node i_i,y_i)}，t_iRepresenting a spatial relationship type set of a node i and adjacent nodes thereof, namely T gesture types at the node;

and is provided with

φ(l_i,t_i|I；θ)＝logp(l_i,t_i|I；θ)

the inter-joint spatial relationship may be expressed as:

finally, the obtained product is

4. The pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion as claimed in claim 1, wherein the step S3 is implemented as follows:

step S33, subtracting the average value of each line of X;

step S34, solving a covariance matrix;

in step S37, the matrix P is the data after dimension reduction to n'.

5. The pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion as claimed in claim 1, wherein the step S4 is implemented as follows:

step S41, firstly, establishing a KD-tree index tree on a public data set;

6. The pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion as claimed in claim 5, wherein the step S5 is implemented as follows:

the functional model of global recognition pixel-tag confidence is as follows:

p denotes a given complex feature f_iAnd model parametersThe result of the logistic regression of (1), which indicates that one of the labels t is atProbability value existing in this sample, 1 [. ]]Is an indicator function, which indicates that the label t is a member of the label set of nearest neighbor samples, the model parameterTraining was performed using The fahisonita Dataset as a positive sample;

where j denotes a pixel present in the nearest neighbor sample superpixel block, a parameterIs a model parameter in global parsing because M (l)_i,s_i,d) The mean value of the logistic regression results obtained by carrying out global analysis on the nearest neighbor sample super-pixel block area is shown;

step S53, the single confidence level is not enough to guarantee the accuracy of the label assignment result, therefore, consider the fusion analysis of the three kinds of analyses, Λ ≡ [ λ ≡ b₁,λ₂,λ₃]Defining weight ratios of three kinds of analyses respectively, and calculating confidence of each clothing label-pixel through a fusion modelDegree:

step S54, combining with the iterative smoothing processing procedure: define the label assignment of all pixels as L ≡ { L ≡ L_iThe condition of labeling the clothing label on each pixel in the image and the clothing item appearance modelWhereinIs a fused appearance model of the labels t of the clothing categories, the final optimization result is to find the optimal pixel-label distribution L^*And appearance model set theta^*(ii) a At the beginning of the iterative process, an initial pixel-label assignment is defined asRepresenting the pixel-label result obtained by MAP allocation in combination with the confidence of the first pass through fusion analysis, the fusion appearance model set of the initial clothing category label isExpressing a set of logistic regression models for clothing categories, useTraining as training data; the process of defining the iteration is represented by a constant k for the number of iterations performed and E for the adjacent pixel pair, then at the kth iteration its pixel-label assignment isFusion appearance modelTo optimizeThe pixel-label assignment case model is:

wherein:and obtaining a final identification result.