CN114140527B

CN114140527B - Dynamic environment binocular vision SLAM method based on semantic segmentation

Info

Publication number: CN114140527B
Application number: CN202111373890.7A
Authority: CN
Inventors: 沈晔湖; 李星; 卢金斌; 王其聪; 赵冲; 蒋全胜; 朱其新; 谢鸥; 牛福洲; 牛雪梅; 付贵忠
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2024-09-10
Anticipated expiration: 2041-11-19
Also published as: CN114140527A

Abstract

The invention relates to a dynamic environment binocular vision SLAM method based on semantic segmentation, which comprises the following steps: acquiring a semantic mask of an object, wherein the semantic mask is generated through a deep learning network; a binocular camera is adopted to obtain multi-frame continuous binocular images; extracting characteristic points on each frame of binocular image, and matching the characteristic points on the adjacent frames of binocular images; removing the feature points on the semantic mask, and calculating the pose of the camera according to the rest feature points; separating a dynamic object and a static object on the binocular image based on the camera pose; recalculating the camera pose based on the separated static object; and constructing a static map based on the updated camera pose and the feature points on the static object. The method uses the binocular camera, takes the image after semantic information segmentation as a guide, can identify dynamic and static objects in the scene, realizes the construction of the map, has simple operation and low cost, and can be applied to most practical scenes.

Description

Dynamic environment binocular vision SLAM method based on semantic segmentation

Technical Field

The invention relates to the technical field of visual space positioning, in particular to a dynamic environment binocular vision SLAM method based on semantic segmentation.

Background

With the development of computer technology and artificial intelligence, intelligent autonomous mobile robots are an important research direction and research hotspot in the robot field. Along with the gradual intellectualization of mobile robots, the requirements of the mobile robots on the positioning of the mobile robots and the environment map are higher. At present, intelligent mobile robots have some practical applications to accomplish self-localization and mapping in known environments, but many challenges remain in unknown environments. A technique for accomplishing positioning and mapping in such an environment is called SLAM (Simultaneous Localization AND MAPPING), i.e., synchronous positioning and mapping, the goal of which is to enable the robot to accomplish its own positioning and incremental mapping during the movement of the unknown environment.

Traditional SLAM algorithms rely mainly on distance sensors with better stability, such as lidar. However, the range data obtained by the lidar is very sparse, which results in an environment map constructed by SLAM containing only a very small number of landmark points. This map can only be used to improve the positioning accuracy of the robot, but cannot be used in other fields of robot navigation such as path planning. Furthermore, the high price, large volume weight and power consumption of lidar limit their application in certain fields. Although the camera can overcome disadvantages of the laser radar in price, volume, mass and power consumption to some extent, and the camera can acquire abundant information at the same time, the camera also has some problems such as sensitivity to light changes, high operation complexity and the like. At present, a multi-sensor fusion SLAM algorithm is also provided, and the problems caused by the defects of a single sensor can be effectively relieved, but the cost and the complexity of the algorithm are further increased.

Existing visual SLAM algorithms are mostly based on the environmental static assumption that the scene is static, with no objects in relative motion. However, there are a large number of dynamic objects such as pedestrians and vehicles in the actual outdoor scene, so that the SLAM system based on the above assumption is limited to be used in the actual scene. Aiming at the problem that the positioning accuracy and stability of the visual SLAM algorithm are reduced in a dynamic environment, the existing algorithm uses some algorithms based on probability statistics or geometric constraint, so that the influence of dynamic objects on the accuracy and stability of the visual SLAM algorithm is reduced. For example, when there are a small number of dynamic objects in the scene, the dynamic objects may be culled using RANSAC (Random Sample Consensus) et al probability algorithm. But when a large number of dynamic objects appear in the scene, the above algorithm will not normally distinguish between dynamic objects. While other algorithms use optical flow methods to distinguish dynamic objects, in a scene where there are a large number of dynamic objects, the dynamic objects can be distinguished by using optical flow methods, but because the process of calculating dense optical flow is time-consuming, the execution efficiency of the SLAM algorithm is reduced.

Therefore, how to provide a dynamic environment binocular vision SLAM method based on semantic segmentation, which is simple in operation, low in cost and applicable to most practical scenes, is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

The invention provides a dynamic environment binocular vision SLAM method based on semantic segmentation, which aims to solve the technical problems.

In order to solve the technical problems, the invention provides a dynamic environment binocular vision SLAM method based on semantic segmentation, which comprises the following steps:

acquiring a semantic mask of an object, wherein the semantic mask is generated through a deep learning network;

A binocular camera is adopted to obtain multi-frame continuous binocular images;

Extracting characteristic points on the binocular images of each frame, and matching the characteristic points on the binocular images of adjacent frames;

Removing the characteristic points on the semantic mask, and calculating the pose of the camera according to the residual characteristic points;

Separating a dynamic object and a static object on the binocular image based on the camera pose;

Estimating motion parameters of the dynamic object based on the separated dynamic object;

Recalculating the camera pose based on the separated static object;

And constructing a static map based on the updated camera pose and the feature points on the static object.

Preferably, the deep learning network for generating the semantic Mask is a Mask R-CNN model.

Preferably, the method for extracting the feature points on the binocular image of each frame and matching the feature points on the binocular image of the adjacent frame comprises the following steps:

extracting the characteristic points by using an ORB method;

and acquiring descriptors of each feature point on each frame of binocular image, calculating the Hamming distance between two descriptors positioned on two adjacent frames of binocular images of one feature point, and forming a group of matched feature points by two feature points with the minimum Hamming distance.

Preferably, the method for judging whether the feature points are located on the semantic mask comprises the following steps: the semantic mask at least comprises a frame of an object, and the coordinates of the feature points are located in the range of the frame, so that the feature points are located on the semantic mask.

Preferably, the method for calculating the pose of the camera according to the residual characteristic points comprises the following steps: and solving the camera pose by adopting a PnP algorithm.

Preferably, the separating the dynamic object and the static object on the binocular image based on the camera pose; the method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps:

Separating dynamic objects: calculating the motion probability of an object corresponding to the semantic mask based on the camera pose and the position relation between the binocular image of the adjacent frame and the semantic mask, and judging the object corresponding to the semantic mask as a dynamic object if the motion probability is larger than a first threshold;

Dynamic object matching: aiming at the dynamic object, calculating hu moment, center point Euclidean distance and histogram distribution of a semantic mask corresponding to the dynamic object in the adjacent frame binocular image, and calculating probability of matching the dynamic object in the adjacent frame binocular image based on the hu moment, the center point Euclidean distance and the histogram distribution, wherein if the probability is larger than a second threshold value, two dynamic objects in the adjacent frame binocular image are the same object; and

Dynamic object motion estimation: and completing the association of the dynamic objects between the continuous frames through the matching of the dynamic objects, and estimating the motion parameters of the dynamic objects through a PnP algorithm.

Preferably, the step of separating the dynamic object comprises:

calculating the position of the semantic mask of the previous frame corresponding to the current frame based on the camera pose;

Calculating three-dimensional coordinates of all feature points on the semantic mask after projection by using a parallax map, wherein the parallax map is calculated by the binocular image;

Calculating errors of the corresponding feature points of the previous frame and the current frame in the directions of x, y and z, wherein the maximum value of the errors is used as an error value of the feature point;

and converting the error value into the motion probability of the object corresponding to the semantic mask where the feature point is located, and judging whether the object corresponding to the semantic mask is a dynamic object or not based on the motion probability.

Preferably, the method for recalculating the pose of the camera based on the separated static object comprises the following steps: and eliminating the characteristic points on the semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the residual characteristic points.

Preferably, the method for constructing the static map based on the updated camera pose and the feature points on the static object comprises the following steps:

determining a plurality of key frames based on the updated camera pose and feature points located on the static object;

Matching the feature points on the key frames, and eliminating unmatched feature points;

checking whether the matched characteristic points meet epipolar geometric constraint or not, and eliminating unsatisfied characteristic points;

Checking whether the forward depth of field, parallax, back projection errors and scales of the residual feature points are consistent, eliminating inconsistent feature points, and generating map points based on the residual feature points;

and constructing the static map based on the map points.

Preferably, before the static map is constructed, the method further comprises the step of optimizing the generated map points through beam adjustment.

Compared with the prior art, the dynamic environment binocular vision SLAM method based on semantic segmentation uses a binocular camera, takes images after semantic information segmentation as guidance, can identify dynamic and static objects in a scene, realizes map construction, is simple in operation and low in cost, and can be applied to most actual scenes.

Drawings

FIG. 1 is a flow chart of a dynamic environment binocular vision SLAM method based on semantic segmentation in an embodiment of the present invention;

FIG. 2 is a flow chart of separating dynamic objects according to an embodiment of the invention.

Detailed Description

In order to describe the technical solution of the above invention in more detail, the following specific examples are listed to demonstrate technical effects; it is emphasized that these examples are illustrative of the invention and are not limiting the scope of the invention.

The dynamic environment binocular vision SLAM method based on semantic segmentation provided by the invention, as shown in figure 1, comprises the following steps:

The semantic Mask of the object is obtained, the semantic Mask is generated through a deep learning network, and in the embodiment, the deep learning network for generating the semantic Mask is a Mask R-CNN model, so that high-quality semantic segmentation is achieved.

The method comprises the steps of acquiring a plurality of continuous binocular images by a binocular camera, and acquiring third-dimensional depth information of two-dimensional image pixels from the binocular images, wherein, of course, internal parameters and external parameters of the binocular camera mainly comprise: the parameters of the focal length f of the camera, the optical center (u, v) of the camera, the radial distortion coefficients kc ₁ and kc ₂ of the camera lens and the like can be obtained through calibration by a Zhang Zhengyou calibration method.

And extracting the characteristic points on the binocular image of each frame, and matching the characteristic points on the binocular images of the adjacent frames. The specific method comprises the following steps:

extracting the characteristic points by using an ORB (English full name: oriented Fast and Rotated Brief) method;

And eliminating the characteristic points on the semantic mask, and calculating the pose of the camera according to the residual characteristic points. The method for judging whether the feature points are located on the semantic mask comprises the following steps: the semantic mask at least comprises a frame of an object, and the coordinates of the characteristic points are located in the range of the frame, so that the characteristic points are located on the semantic mask; if not, the feature point is not located on the semantic mask. The method for calculating the pose of the camera according to the residual characteristic points comprises the following steps: solving the camera pose by adopting a PnP (English full scale PERSPECTIVE-n-Point) algorithm, constructing a reprojection error and optimizing the reprojection error as shown in the following formula (1):

and obtaining an optimal solution, namely the required camera pose, by minimizing the re-projection error.

The method for separating the dynamic object and the static object on the binocular image based on the camera pose comprises the following steps:

Separating dynamic objects: and calculating the motion probability of an object corresponding to the semantic mask based on the camera pose and the position relation between the binocular image of the adjacent frame and the semantic mask, and judging the object corresponding to the semantic mask as a dynamic object if the motion probability is larger than a first threshold. The specific steps are shown in fig. 2, including:

Calculating three-dimensional coordinates of all feature points on the semantic mask after projection by using a disparity map, wherein the disparity map is calculated by the binocular image, and specifically, the disparity map can be calculated by adopting an ELAS (English full name: EFFICIENT LARGE SCALE Stereo Matching) algorithm;

As known from the camera imaging principle, the conversion relationship between the three-dimensional coordinate system and the pixel (two-dimensional) coordinate system, and the depth and parallax are converted into:

the coordinate set of the jth semantic mask of the t-1 frame on the pixel coordinate system is recorded as Obtaining a three-dimensional coordinate set of the semantic mask at the moment through a formula (2) and a formula (3)

Obtaining a three-dimensional point set after movement through a formula (4)

Obtained by the formula (3)Conversion to a set under a pixel coordinate systemThen utilizeAnd the parallax map is calculated by the formula (2) and the formula (3)

Recording deviceIs thatIn (3) the point (i) of the middle,Is thatThe i-th point, the error delta i between the two points is calculated as:

the error of the object corresponding to the feature point is:

the calculated motion probability S (Δ _j) is:

Dynamic object matching: and aiming at the dynamic object, calculating hu moment (namely image moment), center point Euclidean distance and histogram distribution of a semantic mask corresponding to the dynamic object in the adjacent frame binocular image, and calculating the probability of matching the dynamic object in the adjacent frame binocular image based on the hu moment, the center point Euclidean distance and the histogram distribution, wherein if the probability is larger than a second threshold value, two dynamic objects in the adjacent frame binocular image are the same object. In particular, the hu moment of an image is an image feature with translational, rotational, and scale invariance.

The general moment calculation formula of the image is as follows:

Calculating the hu moment requires calculating the center distance, and firstly calculating the barycenter coordinates:

then construct the center moment:

and then normalizing the center distance:

by constructing the hu moment with the center moment, the hu moment has 7 invariant moments, and the specific formula is as follows:

Φ₁＝η₂₀+η₀₂

Φ₃＝(η₂₀-3η₁₂)²+3(η₂₁-η₀₃)²

Φ₄＝(η₃₀+η₁₂)²+(η₂₁+η₀₃)²

Φ₅＝(η₃₀+3η₁₂)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²+(3η₂₁-η₀₃)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²

Φ₆＝(η₂₀-η₀₂)[(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]+4η₁₁(η₃₀+η₁₂)(η₂₁+η₀₃)

Φ₇＝(3η₂₁-η₀₃)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²]+]+(3η₁₂-η₃₀)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²] (12)

Recording device The hu moment of the j semantic masks of the t-1 th frame is the distance between the two semantic masks is:

Calculating the center position of each semantic mask, and then calculating the Euclidean distance of the center point position of each semantic mask between the front frame and the rear frame, wherein the Euclidean distance is recorded as:

Calculating the histogram distribution of the semantic mask, normalizing and marking as Then calculate the Kl divergence (English full name: kullback-Leibler divergence, also called relative entropy: relative entercopy) of the different semantic masks of the previous and subsequent frames.

Combining the hu moment, the Euclidean distance and the histogram, and estimating the matching probability:

The method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps: dynamic object motion estimation: and completing the association of the dynamic objects between the continuous frames through the matching of the dynamic objects, and estimating the motion parameters of the dynamic objects through a PnP algorithm.

The method comprises the following steps of recalculating the pose of the camera based on the separated static object: and eliminating the characteristic points on the semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the residual characteristic points, wherein the specific calculation method can refer to the method for calculating the camera pose for the first time.

Based on the updated camera pose and the feature points on the static object, constructing a static map, wherein the specific method comprises the following steps:

Matching the feature points on the key frames, triangulating the matched feature points, matching the non-matched points with the non-matched feature points in other key frames until all the matched feature points are found, and eliminating the non-matched feature points;

and constructing the static map based on the map points.

Preferably, before the static map is constructed, the method further comprises the step of optimizing the generated map points through beam method adjustment (BA, english full name: bundle adjustment).

According to the method, the dynamic objects in the binocular images are identified through processing the binocular images, the pose of the camera and the pose of the dynamic objects are estimated, an environment map is constructed, and the requirements of the mobile robot on the three-dimensional map are met.

In summary, the dynamic environment binocular vision SLAM method based on semantic segmentation provided by the invention comprises the following steps: acquiring a semantic mask of an object, wherein the semantic mask is generated through a deep learning network; a binocular camera is adopted to obtain multi-frame continuous binocular images; extracting characteristic points on the binocular images of each frame, and matching the characteristic points on the binocular images of adjacent frames; removing the characteristic points on the semantic mask, and calculating the pose of the camera according to the residual characteristic points; separating a dynamic object and a static object on the binocular image based on the camera pose; estimating motion parameters of the dynamic object based on the separated dynamic object; recalculating the camera pose based on the separated static object; and constructing a static map based on the updated camera pose and the feature points on the static object. The method uses the binocular camera, takes the image after semantic information segmentation as a guide, can identify dynamic and static objects in the scene, realizes the construction of the map, has simple operation and low cost, and can be applied to most practical scenes.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A dynamic environment binocular vision SLAM method based on semantic segmentation is characterized by comprising the following steps:

Recalculating the camera pose based on the separated static object;

constructing a static map based on the updated camera pose and the feature points on the static object;

separating a dynamic object and a static object on the binocular image based on the camera pose; the method for estimating the motion parameters of the dynamic object based on the separated dynamic object comprises the following steps:

Dynamic object motion estimation: the association of the dynamic objects between the continuous frames is completed through the matching of the dynamic objects, and the motion parameters of the dynamic objects are estimated through a PnP algorithm;

The method for recalculating the pose of the camera based on the separated static object comprises the following steps: removing feature points on a semantic mask corresponding to the dynamic object, and updating the camera pose by adopting a PnP algorithm according to the rest feature points;

the method for constructing the static map based on the updated camera pose and the feature points on the static object comprises the following steps:

and constructing the static map based on the map points.

2. The semantic segmentation based dynamic environment binocular vision SLAM method of claim 1, wherein the deep learning network used to generate the semantic Mask is a Mask R-CNN model.

3. The dynamic environment binocular vision SLAM method based on semantic segmentation of claim 1, wherein the extracting feature points on the binocular image of each frame and matching feature points on the binocular image of the adjacent frame comprises:

extracting the characteristic points by using an ORB method;

4. The dynamic environment binocular vision SLAM method based on semantic segmentation of claim 1, wherein the method of judging whether the feature points are located on the semantic mask comprises: the semantic mask at least comprises a frame of an object, and the coordinates of the feature points are located in the range of the frame, so that the feature points are located on the semantic mask.

5. The dynamic environment binocular vision SLAM method based on semantic segmentation according to claim 1, wherein the method of calculating the camera pose according to the remaining feature points comprises: and solving the camera pose by adopting a PnP algorithm.

6. The semantic segmentation-based dynamic environment binocular vision SLAM method of claim 1, wherein the step of separating the dynamic objects comprises:

7. The semantic segmentation-based dynamic environment binocular vision SLAM method of claim 1, further comprising the step of optimizing the generated map points by beam method adjustment before constructing the static map.