CN114782510A

CN114782510A - Depth estimation method and device for target object, storage medium and electronic equipment

Info

Publication number: CN114782510A
Application number: CN202210467512.3A
Authority: CN
Inventors: 刘方原
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-22

Abstract

The embodiment of the disclosure discloses a depth estimation method and device for a target object, a storage medium and an electronic device, wherein the method comprises the following steps: performing feature extraction on an image which is acquired by a camera and comprises a target object to obtain a feature map; determining a first plane coordinate value of a target object in the image in an image coordinate system based on the feature map; based on the feature map and the first plane coordinate value, determining a plurality of first depth coordinate values of the target object under the local three-dimensional coordinate of the camera and a depth uncertainty corresponding to each first depth coordinate value; and determining a second depth coordinate value of the target object under the local three-dimensional coordinate based on the plurality of first depth coordinate values and the depth uncertainty corresponding to each first depth coordinate value.

Description

Depth estimation method and device for target object, storage medium and electronic equipment

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a method and an apparatus for estimating depth of a target object, a storage medium, and an electronic device.

Background

Depth estimation, namely obtaining distance information from a point in each space in a scene in an image to a camera on a camera, wherein a map formed by the distance information is called a depth map; wherein monocular depth estimation is to estimate the distance of an object in an image relative to a camera from an RGB image acquired by a single camera; defining the estimated Depth information as a Monocular camera Estimation (MDE) problem based on the Monocular camera; in the field of automotive assisted driving/automatic driving, depth estimation of vehicles is a very important and fundamental task requirement, a key step in scene reconstruction and understanding tasks.

Disclosure of Invention

The embodiment of the disclosure provides a depth estimation method and device for a target object, a storage medium and an electronic device.

According to an aspect of the embodiments of the present disclosure, there is provided a depth estimation method of a target object, including:

performing feature extraction on an image which is acquired by a camera and comprises a target object to obtain a feature map;

determining a first plane coordinate value of a target object in the image in an image coordinate system based on the feature map;

based on the feature map and the first plane coordinate value, determining a plurality of first depth coordinate values of the target object under the local three-dimensional coordinate of the camera and depth uncertainty corresponding to each first depth coordinate value;

and determining a second depth coordinate value of the target object under the local three-dimensional coordinate based on the plurality of first depth coordinate values and the depth uncertainty corresponding to each first depth coordinate value.

According to another aspect of the embodiments of the present disclosure, there is provided a depth estimation apparatus of a target object, including:

the characteristic extraction module is used for extracting the characteristics of the image which is acquired by the camera and comprises the target object to obtain a characteristic diagram;

the plane coordinate prediction module is used for determining a first plane coordinate value of a target object in the image in an image coordinate system based on the feature map obtained by the feature extraction module;

the object coordinate estimation module is used for determining a plurality of first depth coordinate values of the target object under the local three-dimensional coordinate of the camera and a depth uncertainty corresponding to each first depth coordinate value based on the feature map obtained by the feature extraction module and the first plane coordinate value determined by the plane coordinate prediction module;

and the target object estimation module is used for determining a second depth coordinate value of the target object under the local three-dimensional coordinate based on the plurality of first depth coordinate values determined by the object coordinate estimation module and the depth uncertainty corresponding to each first depth coordinate value.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the depth estimation method of a target object according to any of the embodiments.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the depth estimation method for the target object according to any of the above embodiments.

According to the depth estimation method and device of the target object, the storage medium and the electronic device, the multiple first depth coordinate values are predicted firstly, each first depth coordinate value corresponds to different depth estimation algorithms, the second depth coordinate value is determined based on the multiple first depth coordinate values and the uncertainty of the multiple first depth coordinate values, the depth estimation combining multiple depth estimation algorithms is achieved, the problem of dependence on the depth estimation algorithms caused by a single depth estimation algorithm is solved, and the robustness and the anti-interference capability are good.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally indicate like parts or steps.

Fig. 1a is a network structure diagram of a depth estimation network model according to an exemplary embodiment of the present disclosure.

FIG. 1b is a block diagram of an alternative example of a predictive branch model in the depth estimation network model provided in FIG. 1 a.

Fig. 2 is a schematic flowchart of a depth estimation method for a target object according to an exemplary embodiment of the disclosure.

FIG. 3 is a flow chart illustrating the step 203 in the embodiment shown in FIG. 2 according to the present disclosure.

FIG. 4 is a schematic flow chart of step 2031 in the embodiment of FIG. 3 according to the present disclosure.

FIG. 5 is a flow chart illustrating step 204 in the embodiment of FIG. 2 according to the present disclosure.

Fig. 6 is a flowchart illustrating a depth estimation method for a target object according to another exemplary embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of a depth estimation device for a target object according to an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of a depth estimation device for a target object according to another exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of parts and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used only for distinguishing between different steps, devices or modules, etc., and do not denote any particular technical meaning or necessary logical order therebetween.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the embodiments in the present disclosure emphasizes the differences between the embodiments, and the same or similar parts may be referred to each other, and are not repeated for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the course of implementing the present disclosure, the inventors found that the existing depth estimation method, generally, performs estimation by a single depth estimation algorithm, but the existing technology has at least the following problems: because the estimation is carried out based on the single depth estimation algorithm, the method is easily influenced by the accuracy of the single depth estimation algorithm and is not robust to noise and misjudgment.

Exemplary network architecture

Fig. 1a is a network structure diagram of a depth estimation network model according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the depth estimation network model includes a feature extraction branch model 101, a prediction branch model 102, and a feature fusion branch model 103.

Alternatively, FIG. 1b is a block diagram of an alternative example of a predictive branch model in the depth estimation network model provided in FIG. 1 a. As shown in FIG. 1b, the predicted branch model 102 may include, but is not limited to: a plane predicted branch model 1021, a depth predicted branch model 1022, a keypoint branch model 1023, a projected height branch model 1024, and an actual height predicted branch model 1025.

The process of depth estimation using the depth estimation network model provided in fig. 1a and 1b may include: inputting a single image (for example, an RGB image) acquired by a camera into a feature extraction branch model 101 of a depth estimation network model, and performing feature extraction on the image through the feature extraction branch model 101 to obtain a feature map; inputting the feature maps into a plurality of predicted branch models 102; processing the feature map by the plane prediction branch model 1021 to obtain a first plane coordinate (u, v) and a center offset value of a target object (e.g., a vehicle, etc.) in an image coordinate system corresponding to the image, wherein the first plane coordinate represents a coordinate value of a center point of the target object, and a result corresponding to the target object can be determined by the first plane coordinate from results of at least one object in corresponding images output by the key point branch model 1023, the projection height branch model 1024, and the actual height prediction branch model 1025; processing the feature map through a depth prediction branch model 1022 to obtain a first depth estimation value and a first uncertainty corresponding to the target object, wherein the depth prediction branch model 1022 may output at least one candidate estimation value corresponding to at least one object in the image, and determine a candidate estimation value closest to the first plane coordinate distance through the first plane coordinate as the first depth estimation value; processing the feature map through the key point branch model 1023 to obtain a plurality of key point predicted values and second uncertainty of the target object; the key point of the target object in the image may be a coordinate value of 8 key points corresponding to a minimum inclusion cube of the target object in the image, and the specific determination process may include: the key point branch model 1023 can predict at least one group of key points corresponding to at least one object included in the image, screen a plurality of key point predicted values through the first plane coordinate (the distance is nearest), and determine 8 key point coordinates close to the first plane coordinate as the output result of the key point branch model 1023; for example, when the target object is a vehicle, the key points are 8 vertex coordinate values corresponding to a cube that includes the vehicle; processing the feature map through the projection height branch model 1024 to obtain a projection height value and a third uncertainty of the target object, where the projection height represents a height of the target object in the image, and optionally, the projection height branch model 1024 may predict a plurality of predicted projection height values (for example, the image includes a plurality of objects, etc.), and screening the plurality of predicted projection height values through the first plane coordinate to determine the projection height value of the target object; the feature map is processed by the actual height prediction branch model 1025 to obtain an actual height prediction value of the target object, for example, when the target object is a vehicle, the actual height prediction value is a predicted actual height of the vehicle.

Three depth estimates, optionally including a first depth estimate Z, may be determined based on the output of the plurality of predicted branch models 102 described above₁Determining a second depth estimate Z based on the first planar coordinates, the plurality of keypoint predictors, and the camera's internal reference matrix₂Determining a third depth estimate Z based on the first planar coordinates, the projection height value, and the camera's internal reference matrix₃；

Alternatively, the second depth estimate may be determined based on the following equation (1):

wherein k is₁The coordinate value of the lower key point (e.g. determined by the mean of the coordinate values of 4 key points belonging to the bottom surface among 8 key points) of the target object in the image, k₂The coordinate values of the upper key points representing the target object in the image (for example, determined by the average value of the coordinate values of 4 key points belonging to the top surface among the 8 key points), specifically, the 8 key points in this embodiment are obtained by screening based on the first plane coordinate; f represents the focal length in the camera parameters; h represents the actual height predicted value predicted by the actual height predicted branch model.

The third depth estimate may be determined based on equation (2) below:

wherein h represents a projection height value predicted by the projection height branch model, and specifically, the projection height value in the embodiment is determined based on the first plane coordinate screening; f represents the focal length in the camera intrinsic parameters; h represents the actual height predicted value predicted by the actual height predicted branch model.

The feature fusion branch model 103 determines a target depth estimation value of the target object in the local three-dimensional coordinate where the camera is located according to the first depth estimation value, the second depth estimation value, the third depth estimation value, the first uncertainty, the second uncertainty, and the third uncertainty. Alternatively, the target depth estimate may be determined based on the following equation (3):

wherein Z represents a target depth estimate; i is the number of depth estimation values output by the prediction branch model 102, and the value is 3 in the embodiment; sigma_iDenotes the uncertainty, σ₁、σ₂And σ₃The first uncertainty, the second uncertainty and the third uncertainty are respectively represented, and the reciprocal of the uncertainty can represent a weight value of the corresponding depth estimation value, that is, formula (3) implements weighted averaging.

In order to better use the target depth estimation value, inverse projection transformation is performed on the first plane coordinates (u, v) in combination with the internal reference matrix K to obtain coordinate values of an X axis and a Y axis of the central point of the target object in the local three-dimensional coordinates, that is, the embodiment realizes determination of the three-dimensional coordinate value of the target object in the local three-dimensional coordinates.

In the embodiment, the depth value of the target object is estimated through three depth estimation algorithms, and compared with a depth estimation method based on a single depth estimation algorithm in the prior art, the depth estimation method has the advantages that the dependence on the single depth estimation algorithm is small, the noise misjudgment is not sensitive, and therefore more robust anti-interference is realized; in addition, compared with a direct depth estimation algorithm, the method has the advantages that the key point and the projection height can be predicted more simply from the image, the depth estimation is easier, and the accuracy of the depth estimation is improved by combining three depth estimation algorithms and the uncertainty thereof.

In general, a network model needs to be trained before application, and in the embodiment of the present disclosure, the depth estimation network model provided in fig. 1a and 1b also needs to be trained before application, where the training process includes inputting a sample image with a known depth true value into the depth estimation network model and outputting a depth estimation value; determining a network loss based on the depth estimate and the uncertainty, training a depth estimate network model using the network loss, optionally the network loss may be based on the branch loss L_iAnd depth loss L_depthDetermining the sum of (1); wherein, the branch loss can be determined based on the following formula (4), and the value of i is 1, 2, 3:

wherein σ_iZ represents the uncertainty (i is the uncertainty output by the depth-predicted branch model 1022, the keypoint branch model 1023, and the projected-height branch model 1024, respectively, when i takes different values), z_iIndicating the estimated depth value (i is the depth output by the depth-predicted branch model 1022, the keypoint branch model 1023, and the projected-height branch model 1024, respectively, when i takes different values), z^*Representing the corresponding depth truth value of the sample image.

Depth loss L_depthCan be determined based on the following equation (5):

L_depth＝|z-z^*equation (5)

Wherein z represents a depth estimation value of the target object predicted based on the depth estimation network model; z is a radical of^*Representing the corresponding depth truth value of the sample image.

Exemplary method

Fig. 2 is a flowchart illustrating a depth estimation method for a target object according to an exemplary embodiment of the disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and includes the following steps:

step 201, performing feature extraction on an image including a target object acquired by a camera to obtain a feature map.

In this embodiment, feature extraction may be performed through the feature extraction branch model 101 in the depth estimation network model shown in fig. 1a to obtain a feature map; the depth estimation method can be applied to any scene, for example, images around the vehicle are collected by using a vehicle-mounted camera, at this time, the target object can be obstacles such as other vehicles, pedestrians, buildings and the like, at this time, more information can be provided for auxiliary driving or automatic driving of the vehicle by the depth estimation method of the embodiment, and the driving safety is improved.

And step 202, determining a first plane coordinate value of the target object in the image coordinate system based on the characteristic diagram.

Alternatively, the present embodiment may implement the prediction of the first plane coordinate value based on the plane prediction branch model 1021 provided in fig. 1 b.

Step 203, based on the feature map and the first plane coordinate value, determining a plurality of first depth coordinate values of the target object under the local three-dimensional coordinate where the camera is located and depth uncertainty corresponding to each first depth coordinate value.

Alternatively, the present embodiment may provide, through fig. 1b, a depth prediction branch model 1022 in the plurality of prediction branch models 102, a key point branch model 1023, and a plurality of branch models in the projected height branch model 1024 and the actual height prediction branch model 1025 to process the feature map, and output, in combination with the first plane coordinate value, a plurality of first depth coordinate values of the target object in the local three-dimensional coordinate where the camera is located and a depth uncertainty corresponding to each first depth coordinate value.

And 204, determining a second depth coordinate value of the target object under the local three-dimensional coordinate based on the plurality of first depth coordinate values and the depth uncertainty corresponding to each first depth coordinate value.

Optionally, a weight value of each first depth coordinate value may be determined by a depth uncertainty corresponding to each first depth coordinate value, and the second depth coordinate value may be determined by performing weighting and averaging in combination with the weight values.

According to the depth estimation method for the target object provided by the embodiment of the disclosure, the multiple first depth coordinate values are predicted through the multiple prediction branch models, each first depth coordinate value corresponds to a different depth estimation algorithm, the second depth coordinate value is determined based on the multiple first depth coordinate values and the uncertainty thereof, the depth estimation combining multiple depth estimation algorithms is realized, the problem of dependence on the depth estimation algorithm caused by a single depth estimation algorithm is solved, and the depth estimation method is good in robustness and strong in anti-interference capability.

As shown in fig. 3, based on the embodiment shown in fig. 2, step 203 may include the following steps:

step 2031, based on the feature map and the first plane coordinate value, determining a plurality of predicted values corresponding to the target object, a prediction uncertainty corresponding to each predicted value, and a predicted value of the actual height of the target object.

The predicted value in this embodiment may include, but is not limited to: a depth predicted value, a key point predicted value and a projection height predicted value; alternatively, the actual height prediction value may be predicted by the projected height branch model 1024 provided in FIG. 1 b.

Step 2032, determining a plurality of first depth coordinate values under the local three-dimensional coordinate based on the plurality of predicted values and the actual height predicted value.

In this embodiment, when the predicted value is a predicted value of a keypoint or a predicted value of a projection height, the corresponding first depth coordinate value may be determined by performing calculation according to formula (1) or formula (2) in the embodiment provided in fig. 1 b.

Step 2033, determining a depth uncertainty corresponding to each first depth coordinate value based on the plurality of prediction uncertainties.

Because each predicted value corresponds to one prediction uncertainty, the depth uncertainty corresponding to the corresponding first depth coordinate value determined by the predicted value is the prediction uncertainty corresponding to the predicted value.

According to the depth estimation method, the multiple predicted values corresponding to the information including the depth, the key points, the projection height and the like are provided, the depth estimation algorithm is added on the basis of the multiple first depth coordinate values determined by the multiple predicted values through different depth estimation algorithms, the problem that only one depth estimation algorithm is relied on to cause over reliance on the depth estimation algorithm is solved, the problem that the depth estimation result is inaccurate due to inaccuracy of a single depth estimation algorithm is avoided, and the robustness of the depth estimation method provided by the embodiment is improved.

As shown in fig. 4, based on the embodiment shown in fig. 3, step 2031 may include the following steps:

step 401, feature maps are predicted based on multiple prediction branch models, and candidate prediction results corresponding to each object in at least one object in an image are determined.

Optionally, each candidate result includes a candidate predicted value and uncertainty corresponding to the candidate predicted value; the prediction may be performed by using a plurality of the depth prediction branch model 1022, the keypoint branch model 1023, and the projected height branch model 1024 provided in the embodiment shown in fig. 1b, so as to obtain a plurality of candidate prediction results corresponding to each object.

Step 402, determining a candidate prediction result corresponding to the target object from at least one candidate prediction result based on the first plane coordinate value.

In this embodiment, the first plane coordinate value includes a horizontal coordinate value and a vertical coordinate value of the center point of the target object, that is, a center point coordinate of the center point of the target object in the image, and with the center point coordinate as a reference, the candidate prediction results corresponding to the target object may be screened and determined from the candidate prediction results corresponding to the multiple objects, and optionally, the screening may be performed by a distance from the center point coordinate.

And step 403, obtaining a plurality of predicted values, prediction uncertainty corresponding to each predicted value and an actual height predicted value of the target object based on the candidate prediction result corresponding to the target object.

In the embodiment, a candidate prediction result corresponding to the target object is determined from a plurality of candidate prediction results by using the plane coordinate value, and since a plurality of objects may be included in the image (for example, a plurality of vehicles, pedestrians, buildings, etc. may be included in the road surface image), a set of predicted values corresponding to the target object is determined by using the plane coordinate value of the target object; through the screening of the first plane coordinate values, the subsequent operation is only based on the candidate prediction results corresponding to the target object, the subsequent operation does not need to be executed on the candidate prediction results corresponding to each object, the problem that the results are not accurate for the target object due to the fact that all the candidate prediction results are subjected to depth estimation is solved, and the accuracy of at least one predicted value corresponding to the target object is improved.

Alternatively, on the basis of the above embodiment, step 2032 may comprise at least two of the following:

a1, based on the direct depth prediction value as a first depth coordinate value.

In this embodiment, the first depth coordinate value may be determined based on the depth prediction branch model 1022 provided in fig. 1b, and optionally, the depth prediction branch model in this embodiment may adopt any network model for performing depth prediction in the prior art, and this embodiment does not specifically limit the network structure for depth prediction.

a2, determining a projected height value of the target object in the image based on the plurality of predicted keypoint values, and determining a first depth coordinate value based on the projected height value, the predicted actual height value and the camera's internal reference matrix.

Alternatively, the prediction of a first depth coordinate value is implemented based on the formula (1) in the embodiment provided in fig. 1b, wherein the key points include 8 key points of a cube surrounding the target object, and the projection height value is determined by the difference of the key points of the upper and lower planes.

a3, determining a first depth coordinate value based on the projected height prediction value, the actual height prediction value and the camera's internal reference matrix.

Optionally, the prediction of a first depth coordinate value is implemented based on equation (2) in the embodiment provided in fig. 1b above.

In this embodiment, the plurality of predicted values includes at least two of: the method comprises the steps that a direct depth predicted value under a local three-dimensional coordinate, a plurality of key point predicted values under the local three-dimensional coordinate and a projection height predicted value under the local three-dimensional coordinate respectively determine different predicted values through different branch network models, wherein each predicted value determines a first depth coordinate value through different depth estimation algorithms, a second depth coordinate value is determined according to the plurality of first depth coordinate values and the uncertainty of the first depth coordinate values, the depth estimation combining multiple depth estimation algorithms is realized, the problem of dependence on the depth estimation algorithms caused by a single depth estimation algorithm is solved, and the robustness of the depth estimation method is improved; in addition, compared with the direct depth value estimation, the key point prediction and the projection height prediction of the target image are simpler and easier to realize, so that the depth estimation of the embodiment is easier to realize, and the depth estimation value obtained by combining various predicted values is more accurate.

As shown in fig. 5, based on the embodiment shown in fig. 2, step 204 may include the following steps:

step 2041, based on the plurality of uncertainties, determines a weight value for each first depth coordinate value.

Optionally, the weight value for each first depth coordinate value is based on the reciprocal of each uncertainty.

Step 2042, weighting the plurality of first depth coordinate values based on the plurality of weight values, and obtaining a second depth coordinate value of the target object under the camera coordinate corresponding to the camera.

In this embodiment, after determining the weight value of each first depth coordinate value, the second depth coordinate value may be determined based on the formula (3) in the embodiment provided in fig. 1b, and in this embodiment, the second depth coordinate value is determined by weighting and averaging, and multiple depth estimation algorithms are combined, so that the problem that the existing depth estimation method depends on a single depth estimation algorithm is overcome, and a more robust depth estimation method is provided.

Fig. 6 is a flowchart illustrating a depth estimation method for a target object according to another exemplary embodiment of the present disclosure. As shown in fig. 6, the present embodiment may further include the following steps based on the embodiment shown in fig. 2:

step 601, based on the second depth coordinate value and the camera internal reference matrix, performing inverse projection transformation on the first plane coordinate value from the image coordinate system to the local three-dimensional coordinate system.

Optionally, the internal reference matrix of the camera may be determined as known data when determining the camera; in this embodiment, the plane coordinates in the image coordinate system are mapped to the local three-dimensional coordinates of the camera through inverse projection transformation by combining the second depth coordinate value and the camera reference matrix, so as to obtain the coordinate values of the X axis and the Y axis in the local three-dimensional coordinate system.

Step 602, determining a second plane coordinate value of the target object in the local three-dimensional coordinate system.

In this embodiment, in order to better use the target depth estimation value, the first plane coordinate is subjected to inverse projection transformation by combining with the internal reference matrix K, so as to obtain coordinate values of the central point of the target object on the X axis and the Y axis under the local three-dimensional coordinate, that is, the three-dimensional coordinate value of the target object under the local three-dimensional coordinate is determined, the target object is better positioned, and a basis is provided for subsequent other operations.

Optionally, on the basis of the foregoing embodiment, the method may further include:

b1, determining the central offset value of the target object under the image coordinate system based on the characteristic diagram.

Alternatively, as shown in the embodiment provided in fig. 1b, not only the first plane coordinates but also the center offset value may be output by the plane predictive branch model 1021.

b2, correcting the second plane coordinate value based on the center offset value to obtain the corrected third plane coordinate value.

In the embodiment, the second plane coordinate value is corrected through the output center deviation value, so that the coordinate values of the X axis and the Y axis of the center point of the corrected target object under the local three-dimensional coordinate system are obtained; through correction, the prediction accuracy of the coordinates of the central point of the target object is improved.

Any of the depth estimation methods for a target object provided by the embodiments of the present disclosure may be performed by any suitable device with data processing capability, including but not limited to: terminal equipment, a server and the like. Alternatively, the depth estimation method for any target object provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor executes the depth estimation method for any target object mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. Which will not be described in detail below.

Exemplary devices

Fig. 7 is a schematic structural diagram of a depth estimation device for a target object according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the apparatus provided in this embodiment includes:

and the feature extraction module 71 is configured to perform feature extraction on the image including the target object acquired by the camera to obtain a feature map.

And the plane coordinate prediction module 72 is configured to determine a first plane coordinate value of the target object in the image coordinate system based on the feature map obtained by the feature extraction module 71.

The object coordinate estimation module 73 is configured to determine, based on the feature map obtained by the feature extraction module 71 and the first plane coordinate value determined by the plane coordinate prediction module 72, a plurality of first depth coordinate values of the target object in the local three-dimensional coordinate where the camera is located and a depth uncertainty corresponding to each first depth coordinate value.

And a target object estimation module 74, configured to determine a second depth coordinate value of the target object in the local three-dimensional coordinate based on the plurality of first depth coordinate values determined by the object coordinate estimation module 73 and the depth uncertainty corresponding to each first depth coordinate value.

According to the depth estimation device for the target object provided by the embodiment of the disclosure, a plurality of first depth coordinate values are predicted in advance, each first depth coordinate value corresponds to a different depth estimation algorithm, and a second depth coordinate value is determined based on the plurality of first depth coordinate values and the uncertainty thereof, so that depth estimation combining with multiple depth estimation algorithms is realized, the problem of dependence on the depth estimation algorithm caused by a single depth estimation algorithm is solved, and the depth estimation device for the target object is good in robustness and strong in anti-interference capability.

Fig. 8 is a schematic structural diagram of a depth estimation device for a target object according to another exemplary embodiment of the present disclosure. As shown in fig. 8, in the apparatus provided in this embodiment, the object coordinate estimation module 73 includes:

an object prediction unit 731, configured to determine, based on the feature map and the first plane coordinate value, a plurality of prediction values corresponding to the target object, a prediction uncertainty corresponding to each prediction value, and a prediction value of an actual height of the target object;

a coordinate value determination unit 732 for determining a plurality of first depth coordinate values under the local three-dimensional coordinates based on the plurality of predicted values and the actual height predicted value;

a depth determination unit 733 for determining a depth uncertainty corresponding to each first depth coordinate value based on the plurality of prediction uncertainties.

Optionally, the object prediction unit 731 is specifically configured to predict the feature map based on a plurality of prediction branch models, and determine a candidate prediction result corresponding to each object in at least one object in the image; determining a candidate prediction result corresponding to the target object from at least one candidate prediction result based on the first plane coordinate value; and obtaining a plurality of predicted values, the prediction uncertainty corresponding to each predicted value and the actual height predicted value of the target object based on the candidate prediction result corresponding to the target object.

Optionally, the plurality of predictors includes at least two of: a direct depth predicted value under a local three-dimensional coordinate, a plurality of key point predicted values under the local three-dimensional coordinate, and a projection height predicted value under the local three-dimensional coordinate;

a coordinate value determination unit 732, specifically configured to use the direct depth prediction value as a first depth coordinate value; and/or determining a projection height value of the target object in the image based on the plurality of key point predicted values, and determining a first depth coordinate value based on the projection height value, the actual height predicted value and the internal reference matrix of the camera; and/or determining a first depth coordinate value based on the projected height prediction value, the actual height prediction value, and the camera's internal reference matrix.

In some alternative embodiments, the target object estimation module 74 includes:

a weight value unit 741 configured to determine a weight value of each of the first depth coordinate values based on the plurality of uncertainties;

the coordinate value estimation unit 742 is configured to weight the plurality of first depth coordinate values based on the plurality of weight values to obtain a second depth coordinate value of the target object at the camera coordinate corresponding to the camera.

In some optional embodiments, further comprising:

an inverse projection module 85, configured to perform inverse projection transformation from the image coordinate system to the local three-dimensional coordinate system on the first plane coordinate value based on the second depth coordinate value and the internal reference matrix of the camera; and determining a second plane coordinate value of the target object under the local three-dimensional coordinate system.

In some optional embodiments, further comprising:

a coordinate correction module 86, configured to determine, based on the feature map, a central offset value of the target object in the image coordinate system; and correcting the second plane coordinate value based on the center offset value to obtain a corrected third plane coordinate value.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 9. The electronic device may be either or both of the first device 100 and the second device 200, or a stand-alone device separate therefrom, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

Fig. 9 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 9, the electronic device 90 includes one or more processors 91 and memory 92.

The processor 91 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 90 to perform desired functions.

Memory 92 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 91 to implement the depth estimation method of the target object of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 90 may further include: an input device 93 and an output device 94, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the first device 100 or the second device 200, the input device 93 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 93 may be a communication network connector for receiving the acquired input signals from the first device 100 and the second device 200.

The input device 93 may also include, for example, a keyboard, a mouse, and the like.

The output device 94 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 94 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 90 relevant to the present disclosure are shown in fig. 9, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 90 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of depth estimation of a target object according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of depth estimation of a target object according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that advantages, effects, and the like, mentioned in the present disclosure are only examples and not limitations, and should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices, and methods of the present disclosure, various components or steps may be broken down and/or re-combined. Such decomposition and/or recombination should be considered as equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of depth estimation of a target object, comprising:

based on the feature map and the first plane coordinate value, determining a plurality of first depth coordinate values of the target object under the local three-dimensional coordinate of the camera and a depth uncertainty corresponding to each first depth coordinate value;

2. The method of claim 1, wherein the determining, based on the feature map and the first planar coordinate value, a plurality of first depth coordinate values of the target object in the local three-dimensional coordinate of the camera and a depth uncertainty corresponding to each of the first depth coordinate values comprises:

determining a plurality of predicted values corresponding to the target object, a prediction uncertainty corresponding to each predicted value and an actual height predicted value of the target object based on the feature map and the first plane coordinate value;

determining a plurality of first depth coordinate values under the local three-dimensional coordinate based on the plurality of predicted values and the actual height predicted value;

determining the depth uncertainty for each of the first depth coordinate values based on the plurality of prediction uncertainties.

3. The method according to claim 2, wherein the determining a plurality of predicted values corresponding to the target object, a prediction uncertainty corresponding to each of the predicted values, and a predicted actual height value of the target object based on the feature map and the first plane coordinate value comprises:

predicting the feature map based on a plurality of prediction branch models, and determining a candidate prediction result corresponding to each object in at least one object in the image;

determining a candidate prediction result corresponding to the target object from the at least one candidate prediction result based on the first plane coordinate value;

and obtaining the plurality of predicted values, the prediction uncertainty corresponding to each predicted value and the actual height predicted value of the target object based on the candidate prediction result corresponding to the target object.

4. The method of claim 3, wherein the plurality of predicted values includes at least two of: a direct depth prediction value under the local three-dimensional coordinate, a plurality of key point prediction values under the local three-dimensional coordinate, and a projection height prediction value under the local three-dimensional coordinate;

said determining a plurality of said first depth coordinate values at said local three-dimensional coordinates based on said plurality of predicted values and said actual height predicted value comprises at least two of:

based on said direct depth prediction value as one of said first depth coordinate values;

determining a projected height value of the target object in the image based on the plurality of predicted keypoint values, and determining one of the first depth coordinate values based on the projected height value, the predicted actual height value and an internal reference matrix of the camera;

determining one of the first depth coordinate values based on the projected height prediction value, the actual height prediction value, and the camera's internal reference matrix.

5. The method of any of claims 1-4, wherein said determining a second depth coordinate value of the target object at the local three-dimensional coordinate based on the plurality of first depth coordinate values and an uncertainty associated with each of the first depth coordinate values comprises:

determining a weight value for each of the first depth coordinate values based on a plurality of the uncertainties;

and weighting the plurality of first depth coordinate values based on the plurality of weight values to obtain a second depth coordinate value of the target object under the camera coordinate corresponding to the camera.

6. The method of any of claims 1-5, further comprising:

performing an inverse projective transformation of the first planar coordinate value from the image coordinate system to the local three-dimensional coordinate system based on the second depth coordinate value and an internal reference matrix of the camera;

and determining a second plane coordinate value of the target object under the local three-dimensional coordinate system.

7. The method of claim 6, further comprising:

determining a central offset value of the target object under the image coordinate system based on the feature map;

and correcting the second plane coordinate value based on the center offset value to obtain a corrected third plane coordinate value.

8. A depth estimation device of a target object, comprising:

the plane coordinate prediction module is used for determining a first plane coordinate value of a target object in the image under an image coordinate system based on the feature map obtained by the feature extraction module;

9. A computer-readable storage medium storing a computer program for executing the depth estimation method of the target object according to any one of claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the depth estimation method of the target object according to any one of claims 1 to 7.