CN111462135B

CN111462135B - Semantic mapping method based on visual SLAM and two-dimensional semantic segmentation

Info

Publication number: CN111462135B
Application number: CN202010246158.2A
Authority: CN
Inventors: 唐漾; 钱锋; 杜文莉; 堵威
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2023-04-21
Anticipated expiration: 2040-03-31
Also published as: CN111462135A

Abstract

The invention relates to the field of cross fusion of computer vision and deep learning, in particular to a semantic mapping method based on vision SLAM and two-dimensional semantic segmentation. The method of the invention comprises the following steps: s1, calibrating camera parameters and correcting camera distortion; s2, acquiring an image frame sequence; s3, preprocessing an image; s4, judging whether the current image frame is a key frame, if so, turning to a step S6, and if not, turning to a step S5; s5, dynamic fuzzy compensation; s6, semantic segmentation, namely extracting ORB feature points for the image frames, and performing semantic segmentation by using a mask region convolution neural network algorithm model; s7, pose calculation, namely calculating the pose of the camera by using a sparse SLAM algorithm model; s8, establishing a dense semantic map with the assistance of semantic information, and realizing three-dimensional semantic map establishment of the global point cloud map. The method and the system can improve the performance of the unmanned aerial vehicle semantic mapping system, and remarkably improve the robustness of feature point extraction and matching aiming at a dynamic scene.

Description

Semantic mapping method based on visual SLAM and two-dimensional semantic segmentation

Technical Field

The invention relates to the field of cross fusion of computer vision and deep learning, in particular to a semantic mapping method based on vision SLAM and two-dimensional semantic segmentation.

Background

Unmanned aerial vehicles generally consist of three modules, namely intelligent decision making, environment sensing and motion control, wherein the environment sensing is the basis of everything.

Unmanned aerial vehicles perceive the surrounding environment, requiring a set of stable, powerful sensor systems to act as "eyes", while requiring corresponding algorithms and powerful processing units to "read objects".

In the unmanned aerial vehicle's environment perception module, visual sensor is indispensable part, and visual sensor can be the camera, compares laser radar, millimeter wave radar, and the resolution ratio of camera is higher, can acquire sufficient environment details, can describe outward appearance and shape, the reading sign etc. of object for example.

Although the global positioning system (Global Positioning System, GPS) aids in the positioning process, the visual sensor cannot be replaced by the GPS system because of interference caused by tall trees, buildings, tunnels, etc. that can make GPS positioning unreliable.

The positioning and mapping (Simultaneous Localization and Mapping, SLAM) refers to that a main body carrying a specific sensor obtains an image frame by calculating the specific sensor to estimate the motion track of the main body without prior information, and establishes a map of surrounding environment, and the positioning and mapping method is widely applied to applications such as robots, unmanned aerial vehicles, automatic driving, augmented reality, virtual reality and the like.

SLAMs can be classified into laser SLAMs and vision SLAMs.

Because the laser SLAM is mature in theoretical technology and engineering application, but the laser SLAM has a fatal defect in the application of a robot, namely the structural information intelligently perceived by the laser radar is two-dimensional information, the information quantity is less, and a large amount of environmental information is lost. At the same time its high cost, huge volume and lack of semantic information make it limited in some specific application scenarios.

The perceived information source of the visual SLAM is a camera image.

Depending on the camera type, visual SLAM can be divided into three types: monocular, binocular, and deep SLAM. Similar to lidar, a depth camera may directly calculate a distance to an obstacle by collecting a point cloud. The depth camera has simple structure, easy installation and operation, low cost and wide application prospect.

With the advent of deep learning, visual SLAM has made a great progress in recent years.

Most visual SLAM schemes are feature points or pixel level, and unmanned aerial vehicles need to acquire semantic information in order to accomplish a specific task or to perform intelligent interaction with the surrounding environment.

The visual SLAM system can select useful information and reject invalid information.

With the development of deep learning, many sophisticated methods of object detection and semantic segmentation provide conditions for accurate semantic mapping. The semantic map is beneficial to improving autonomy and robustness of the unmanned aerial vehicle, completing more complex tasks, and converting path planning into task planning.

With the improvement of hardware computing power and optimization of algorithm structures, deep learning has achieved more and more attention.

A great leap is taken in the field of computer vision, and in terms of RGB image segmentation, the method can be roughly divided into target detection and semantic segmentation.

The method mainly aims at providing a target detection framework in the early stage, and more accurate target detection is realized.

The mainstream deep learning target detection framework is mainly based on CNN (Convolutional Neural Networks, convolutional neural network), wherein the YOLO (You Only Look Once, you only see once) series and R-CNN (Region-CNN, regional convolutional neural network) series are more efficient.

The technology of target perception in three-dimensional images is more and more mature, and the demand for three-dimensional understanding is also more and more urgent. Due to the irregularities of the point cloud, most researchers will convert the points into regular voxel or grid models, making predictions using deep neural networks.

Direct semantic segmentation of point cloud space requires significant computational resources to be consumed and interrelationships between spatial points are weakened.

The PointNet proposed in 2017 is the first deep neural network that can directly process the original three-dimensional point cloud.

Most of the existing visual SLAM systems adopt a dense mapping method, lack of semantic information and cannot meet the intelligent requirement.

A typical assumption for the visual SLAM algorithm is that the scene is stationary, where the presence of some dynamic objects affects not only the estimation of camera pose but also leaves shadows in the map, affecting the map quality.

The pictures captured by the camera under the condition of high-speed movement are easy to blur, and the extraction and matching of the characteristic points are greatly influenced.

Disclosure of Invention

The invention aims to provide a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, which solves the technical problem that a dynamic object moving at a high speed affects the quality of a map to be established.

In order to achieve the above object, the present invention provides a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, comprising the following steps:

s1, calibrating camera parameters and correcting camera distortion;

s2, acquiring an image frame sequence, wherein the image frame sequence comprises an RGB image and a depth image;

s3, preprocessing an image, and obtaining coordinates of a real three-dimensional space point corresponding to each pixel point in the RGB image by adopting a pinhole camera model;

s4, judging whether the current image frame is a key frame, if so, turning to a step S6, and if not, turning to a step S5;

s5, performing dynamic fuzzy compensation, and calculating to obtain an image block centroid of the current image frame as a semantic feature point to be used as a supplement of ORB feature points;

s6, semantic segmentation is carried out on the image frame, ORB feature points are extracted, semantic segmentation is carried out by using a mask region convolution neural network algorithm model, and semantic information of each pixel point of the image of the frame is obtained;

s7, pose calculation, namely calculating the pose of the camera by using a sparse SLAM algorithm model;

s8, inputting semantic information into a sparse SLAM algorithm model, assisting in dense semantic map construction, completing traversing of key frames, and realizing three-dimensional semantic map construction of a global point cloud map.

In an embodiment, the correcting camera distortion in step S1 further includes the steps of:

s11, camera seatThree-dimensional space point P (X, Y, Z) of the standard system is projected onto the normalized image plane to form normalized coordinates of the point as [ X, Y ]] ^T ；

S12, for the points [ x, y ] on the normalized plane] ^T Radial distortion and tangential distortion correction are performed by the following formula:

wherein [ x ] _corrected ,y _corrected ] ^T Is the corrected point coordinates, p ₁ ，p ₂ Is the tangential distortion coefficient, k, of the camera ₁ ，k ₂ ，k ₃ The radial distortion coefficient of the camera is that r is the distance from the point P to the origin of the coordinate system;

s13, point [ x ] after correction _corrected ,y _corrected ] ^T Projection onto the pixel plane by means of an internal matrix of parameters yields its correct position u, v on the image] ^T The method is realized by the following formula:

wherein f _x ，f _y ，c _x ，c _y Is an intrinsic parameter of the camera.

In an embodiment, the image preprocessing in step S3 further includes pixel points [ u, v ]] ^T The mapping relation to the real three-dimensional space point P (X, Y, Z) satisfies the following formula:

wherein K is called an inner parameter matrix, f _x ，f _y ，c _x ，c _y For the internal parameters of the camera, P is the real three-dimensional space point coordinates, [ u, v ]] ^T Is the pixel point coordinates.

In one embodiment, the key frames of step S4 are screened using a sparse SLAM algorithm model.

In an embodiment, the centroid of the image block in the step S5 is obtained by:

labeling each object of the frame image as a specific class;

for each segmented object, a corresponding labeling area exists, and the segmented image is called an image block;

computing moments of image blocks

p,q＝{0,1}；

Calculating corresponding centroid C as semantic feature point to supplement ORB feature point, wherein

In an embodiment, the step S6 further includes:

the semantic information of each pixel point comprises a semantic classification label, bounding box coordinates and a confidence score of the classification;

and based on the semantic segmentation result, eliminating ORB characteristic points extracted from the region of which the specified category is the dynamic object.

In an embodiment, the semantic segmentation is performed by the Mask R-CNN algorithm model in step S6, further including:

extracting features on different levels of an input image through a feature map pyramid network;

the interesting proposal is put forward through a regional generation network;

carrying out proposal area alignment by utilizing the arrangement of the interested areas;

performing mask segmentation by using a full convolution network;

and determining the region coordinates and classifying the categories by using a full connection layer.

In an embodiment, the sparse SLAM algorithm model further includes a trace thread, a local mapping thread, a loop detection thread:

the tracking thread is used for locating a camera of each frame of picture by searching for and matching the local map features and minimizing the re-projection error by using a pure motion beam adjustment method;

the local map building thread manages and optimizes a local map by executing a local beam adjustment method, maintains a common view relation between key frames by map points, and optimizes the pose of the common view key frames and map points by the local beam adjustment method;

the loop detection thread detects a large loop, corrects drift errors by executing pose diagram optimization, accelerates screening of closed loop matching frames, optimizes the scale, and optimizes an essential diagram and map points by a global beam adjustment method.

In an embodiment, the sparse SLAM algorithm model further includes a global beam adjustment method optimization thread, triggered after confirmation of the loop detection thread, and after pose map optimization, calculating the optimal structure and motion result of the whole system.

In an embodiment, the pose calculation in step S7 further includes: the camera pose is calculated through PnP solution in a preliminary step, the camera pose is optimized and calculated through a rear-end pose graph, and a minimized reprojection error of camera pose estimation is constructed:

wherein u is _i For pixel coordinates, P _i Is camera coordinate, ζ is corresponding lie algebra of camera pose, s _i And K is the depth of the feature points and K is the matrix of parameters in the camera.

According to the semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, a dense semantic map for eliminating dynamic objects is established based on the Mask R-CNN algorithm model and the sparse SLAM algorithm model of ORB feature points, the performance of an unmanned aerial vehicle semantic mapping system is improved by utilizing inter-frame information and semantic information on image frames, and the robustness of feature point extraction and matching for dynamic scenes is improved.

Drawings

The above and other features, properties and advantages of the present invention will become more apparent from the following description of embodiments taken in conjunction with the accompanying drawings in which like reference characters designate like features throughout the drawings, and in which:

FIG. 1 discloses a flow chart of a method according to an embodiment of the invention;

FIG. 2 discloses a calibration plate for camera calibration according to an embodiment of the present invention;

FIG. 3a discloses a pinhole imaging model diagram of a pinhole camera according to one embodiment of the invention;

FIG. 3b discloses a similar triangular schematic of a pinhole camera according to one embodiment of the invention;

FIG. 4 discloses a system flow diagram of a Mask RCNN in accordance with an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The concept of a semantic map refers to a map containing rich semantic information, and represents abstraction of semantic information such as space geometric relations and existing object types and positions in the environment. The semantic map contains both ambient space information and a map of ambient semantic information so that the mobile robot can know, like a person, both the presence of objects in the environment and what the objects are.

Aiming at the problems and the shortcomings in the prior art, the invention provides a semantic mapping system based on visual SLAM and two-dimensional semantic segmentation, which uses ORB (Oriented FAST and Rotated BRIEF, rapid guiding and brief rotation) based feature points to perform semantic segmentation, combines a sparse SLAM algorithm model, and completes semantic mapping while realizing positioning.

Fig. 1 discloses a flowchart of a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to an embodiment of the present invention, and in the embodiment shown in fig. 1, the semantic mapping method based on visual SLAM and two-dimensional semantic segmentation provided by the present invention specifically includes the following steps:

s1, calibrating camera parameters and correcting camera distortion;

s6, semantic segmentation is carried out on the image frame, ORB feature points are extracted, semantic segmentation is carried out by using a Mask R-CNN algorithm model, and semantic information of each pixel point of the image of the frame is obtained;

Each step is described in detail below.

Step S1: calibrating camera parameters and correcting camera distortion.

In image measurement processes and machine vision applications, in order to determine the interrelation between the three-dimensional geometric position of a point on the surface of a spatial object and its corresponding point in the image, a geometric model of camera imaging must be established, and these geometric model parameters are camera parameters.

The distortion coefficient belongs to one of the camera parameters and corresponds to the camera distortion phenomenon. Under most conditions, these camera parameters must be obtained through experimentation and calculation, and this process of solving for the parameters is called camera calibration (or camera calibration).

Camera distortion includes radial distortion and tangential distortion.

The radial distortion is caused by the lens shape.

More specifically, in the pinhole model, a straight line is projected onto the pixel plane or a straight line.

However, in an actual photograph, the lens of the camera tends to cause a straight line in the real environment to become a curve in the photograph, and such distortion is called radial distortion.

The tangential distortion is formed during assembly of the camera by the inability to make the lens and imaging plane exactly parallel.

Since the light projection causes inconsistency between the actual object and the image projected onto the 2D plane, the inconsistency is stable, and the distortion correction of the subsequent image can be realized by calibrating the camera and calculating the distortion parameter.

For radial distortion, correction is performed with a second and higher order polynomial function relating to distance from center:

wherein [ x, y] ^T Is the coordinates of an uncorrected point, [ x ] _corrected ,y _corrected ] ^T K for corrected point coordinates ₁ ， k ₂ ，k ₃ The radial distortion coefficient of the camera is represented by r, which is the distance of the point P from the origin of the coordinate system.

Two other parameters p can be used for tangential distortion ₁ ，p ₂ To correct:

wherein [ x, y] ^T Is the coordinates of an uncorrected point, [ x ] _corrected ,y _corrected ] ^T For corrected point coordinates, p ₁ ， p ₂ Is the tangential distortion coefficient of the camera, r is the distance from the point P to the origin of the coordinate systemAnd (5) separating.

Before the camera is used, three-dimensional information is obtained from a two-dimensional image by calibrating a radial distortion coefficient and a tangential distortion coefficient of the camera, so that the distortion correction, object measurement, three-dimensional reconstruction and the like of the image are realized.

Fig. 2 shows a calibration plate for calibrating a camera according to an embodiment of the present invention, the calibration plate shown in fig. 2 is placed in a visible range of the camera, each time a picture is taken, the calibration plate is changed in position and orientation, feature points in an image are detected, internal parameters and external parameters of the camera are obtained, and distortion coefficients are obtained.

Preferably, solving for camera parameters is performed using a Camera Calibrator (camera correction) toolbox in MATLAB.

For point P (X, Y, Z) in the camera coordinate system, step S1 of the present invention performs camera distortion correction by 5 distortion coefficients, finding the correct position of this point on the pixel plane.

The correction steps of the camera distortion are as follows:

s11, projecting the three-dimensional space points to a normalized image plane. Let its normalized coordinates be [ x, y ]] ^T 。

S12, correcting radial distortion and tangential distortion of points on the normalized plane.

Wherein [ x ] _corrected ,y _corrected ] ^T Is the corrected point coordinates, p ₁ ，p ₂ Is the tangential distortion coefficient, k, of the camera ₁ ，k ₂ ，k ₃ The radial distortion coefficient of the camera is represented by r, which is the distance of the point P from the origin of the coordinate system.

S13, correcting the corrected point [ x ] _corrected ,y _corrected ] ^T The correct position coordinates [ u, v ] of the point on the image are obtained by projecting the internal parameter matrix to the pixel plane] ^T 。

Wherein f _x ，f _y ，c _x ，c _y Is an intrinsic parameter of the camera.

And S2, acquiring an image frame sequence.

And acquiring an RGB-D image frame sequence by using a Kinect camera, wherein the image frame sequence comprises an RGB image and a depth image.

Step S3, image preprocessing

In one embodiment, an RGB-D camera is used as the primary sensor, an RGB image and a depth image are obtained simultaneously, and a pinhole camera model is used to map pixels of the RGB image into a real three-dimensional space.

Fig. 3a discloses a pinhole imaging model diagram of a pinhole camera according to an embodiment of the present invention, and fig. 3b discloses a similar triangle schematic diagram of a pinhole camera according to an embodiment of the present invention, wherein a camera coordinate system O-x-y-z is established, and the optical center position of the camera is taken as the origin O of the coordinate system, and the direction of the contracted arrow is taken as the forward direction, as shown in fig. 3a and 3 b.

By mapping the similar triangles shown in fig. 3b, a coordinate system O '-x' -y '-z' is established on the imaging plane of the camera, assuming the direction of the arrow as forward.

Suppose the P point coordinates are [ X, Y, Z] ^T The focal length of the camera lens is f, which is the distance from the camera optical center to the physical imaging plane.

Point P is projected through the optical center to point P' of the imaging plane, coordinates [ u, v ] of pixel point P ]] ^T 。

According to the corresponding relation, the mapping relation corresponds to the scaling and translation amount of one scale, and the deduction can be obtained:

wherein K is called an in-camera parameter matrix, which is an intrinsic parameter, and has been calibrated in step S1, f _x ，f _y ，c _x ，c _y Is a cameraInternal parameters, P is the real three-dimensional space point coordinates, [ u, v ]] ^T Is the pixel point coordinates.

Step S4, judging whether the frame is a key frame, if so, turning to step S6, and if not, turning to step S5;

if each frame of image is adopted to perform visual SLAM and semantic segmentation calculation, the calculated amount is too large, and therefore, the high quality of the image is selected as a key frame.

In the invention, a sparse SLAM algorithm model based on ORB (Oriented FAST and Rotated BRIEF, fast steering and simple rotation) feature points is used to screen key frames.

Each key frame contains an RGB image and a depth image.

Step S5, dynamic blurring

Because dynamic objects may exist in each frame of image, a certain number of targets are designated as dynamic targets each time a semantic mapping task is executed. In an image sequence, if the dynamic target is identified in the frame image, the method eliminates the corresponding point cloud when converting the two-dimensional pixel point into the three-dimensional space coordinate, and prevents the dynamic object from leaving a residual shadow in the map and affecting the quality of the map construction.

In step S5 of the present invention, if the frame image is not a key frame, because of motion blur, the ORB feature point is not extracted sufficiently, the following operations are performed as a complement before the image semantic segmentation step of step S6:

each object of the frame image is marked as a specific class, each segmented object is provided with a corresponding marked area, the segmented image is called an image block, and the moment m of the image block B is calculated _pq ：

Centroid position C is:

the centroid serves as a semantic feature point to supplement the deficiency of the ORB feature point.

And supplementing semantic feature points aiming at a fuzzy image with serious ORB feature point loss, and inhibiting the matching of a tracking algorithm to a dynamic object, comprehensively screening key frames, estimating the pose of a camera and preventing a mapping algorithm from including a moving object as a part of a 3D map.

Step S6, semantic segmentation

And (3) extracting ORB characteristic points for each image frame, and performing semantic segmentation by using a Mask RCNN (Mask Region-CNN) algorithm model to acquire semantic information of each pixel point of the frame image.

Based on the semantic segmentation result, if a dynamic object is identified, the ORB feature points extracted from the region which designates a certain category as a dynamic object are eliminated.

The inhibit vision SLAM algorithm includes moving objects as part of the 3D map during the mapping process.

In step S6 of the invention, the Mask RCNN algorithm model is trained by adopting a COCO data set.

COCO, collectively Common Objects in COntext, is a data set provided by Microsoft team that can be used for image recognition, and can obtain 80 categories of classification information.

FIG. 4 is a system flow chart of Mask RCNN according to an embodiment of the invention, as shown in FIG. 4, the RGB image semantic segmentation of the image frame is realized based on the Mask R-CNN algorithm model, and the steps of semantic segmentation based on the convolutional neural network framework of the Mask R-CNN algorithm model are as follows:

extracting features on different levels of the input image through an FPN (Feature Pyramid Networks, feature map pyramid network);

the proposal of interest is proposed by the RPN (Region Proposal Network, regional generation network);

proposed region alignment is performed using the RoI alignment (Region of Interest Align, region of interest arrangement);

mask segmentation using FCN (Fully Convolutional Networks, full convolution network);

region coordinate determination and category classification are performed by using FC (Fully Connected Layers, full connection layer).

The frame image is processed by a Mask RCNN algorithm model, a semantic classification result of a pixel level is generated, namely, a semantic classification label of each pixel point is generated, and meanwhile, bounding box coordinates and confidence scores of the classification are output.

The invention adopts ORB characteristic points to carry out tracking, mapping and position identification tasks, and the ORB characteristic points have the advantages of rotation invariance and scale invariance, can rapidly extract characteristics and match, can meet the requirement of real-time operation, and can show good precision in the position identification process based on word bags.

S7 pose calculation

The visual odometer pose estimation is for two adjacent frames of images, it being understood that a plurality of such frame pose estimation accumulations are the camera's motion trajectories.

The camera pose was calculated using a sparse SLAM algorithm model based on ORB (Oriented FAST and Rotated BRIEF, fast steering and simple rotation) feature points.

After extracting the characteristic points of the image frames, estimating the pose of the camera by using PnP based on the key frames.

PnP is the abbreviation of Perselect-n-Point (n-Point Perspective), which is a method for solving the motion of 3D to 2D Point pairs: i.e. how to solve the pose of the camera when n 3D spatial points and their projection positions are given.

Assume that at time k, the camera position is x _k The camera input data is u _k ，w _k For noise, the motion equation is constructed:

x _k ＝f(x _k-1 ,u _k ,w _k )。

at x _k Locally observing the landmark point y _j Generating a series of observations z _k,j ，v _k,j To observe noise, an observation equation is constructed:

z _k,j ＝h(y _j ,x _k ,v _k,j )。

in the step S7 of the invention, the camera pose can be initially calculated through PnP problem solving, and then the camera pose with more accuracy is further calculated by utilizing the rear-end pose chart optimization.

In step S7 of the invention, the PnP problem of camera pose estimation is constructed into a nonlinear least square problem on a domain lie algebra.

Furthermore, the camera pose estimation in step S7 of the present invention is constructed as a BA (Bundle Adjustment, beam adjustment method) problem, and the minimized re-projection error of the camera pose estimation is constructed:

wherein u is _i For pixel coordinates, P _i Is camera coordinate, ζ is corresponding lie algebra of camera pose, s _i The depth of the feature points is K, the matrix of parameters in the camera is K, and n is the number of the points.

Further, step S7 of the present invention further includes repositioning using an embedded location recognition model based on DBOW (Direct index Bag of words, bag of words model) to prevent tracking failure, or re-initialization of known map scenes, loop detection, etc.

In the invention, a sparse SLAM algorithm model is adopted for screening key frames and calculating camera pose, and is obtained by improvement on the basis of ORB-SLAM2 (Oriented FAST and Rotated BRIEF-Simultaneous Localization and Mapping 2, and the second generation rapid guiding and brief rotating instant positioning and map construction).

The SLAM algorithm model consists of 4 parallel threads, including a tracking thread, a local mapping thread, a loop detection thread and a global BA optimization thread.

Further, the global BA optimization thread is executed only after the loop detection thread is validated.

The first three threads are parallel threads, and are defined as follows:

1) The thread is tracked.

And (3) locating the camera of each frame of picture by searching for matching the local map features and minimizing the re-projection error by using the pure motion BA.

Preferably, the matching is performed using a constant speed model.

2) And (5) a local mapping thread.

The co-view relationship between key frames is maintained by MapPoints by performing local BA management of local maps and optimization, and the co-view key frame pose and MapPoints are optimized by local BA.

3) The loop detects the thread.

Detecting large loops and correcting drift errors by performing pose map optimization, accelerating the screening of closed loop matching frames by a cow, and optimizing Essential Graph (essence map) and MapPoints by a Sim3 optimization scale by a global BA. The Sim3 transform is a similarity transform.

The loop detection thread triggers the global BA optimization thread.

And (3) a global BA thread, and after optimizing the pose graph, calculating the optimal structure and the motion result of the whole system.

Compared with the dense SLAM algorithm model in the prior art, the sparse SLAM algorithm model adds rich semantic segmentation information of the image in the final image building process through semantic information fusion.

Step S8, three-dimensional semantic mapping

And (3) inputting semantic information into a sparse SLAM algorithm model by utilizing the semantic segmentation result of the step (S6) and combining the frame pose information acquired in the step (S7) and the real three-dimensional coordinates of the image frame pixels, projecting the same objects contained in the semantics of the frame images into a three-dimensional point cloud map with the same labeling color, assisting in dense semantic map construction, completing traversing of a key frame, and realizing three-dimensional semantic map construction of a global point cloud map.

Step S8 of the present invention further includes:

s81, projecting three-dimensional space pixels generated by a first frame key frame into an initial point cloud;

s82, generating a point cloud map through three-dimensional space coordinates corresponding to each pixel point of the current key frame obtained through calculation of the pinhole model;

s83, calculating to obtain pose changes of the current key frame and the previous key frame;

s84, overlapping and fusing three-dimensional coordinate points of the two point cloud maps through the pose transformation matrix to generate a point cloud map with more information;

s85, iterating the steps continuously, and constructing the global point cloud map when traversing all the key frames is completed.

The test results of the unmanned aerial vehicle semantic mapping method based on visual SLAM and two-dimensional semantic segmentation are further described in detail below in combination with specific tests.

Based on an operating system Ubuntu16.04 and a hardware display card Nvidia Geforce GTX 1050, the test uses Tensorflow, openCV, g O, point Cloud Library and other software tools to take a real scene as an experimental condition, and utilizes a Kinect V1 camera to real-shoot data.

For three-dimensional semantic diagramming evaluation, Q ₁ Representing the number of correctly detected articles, Q ₂ Representing the number of detected objects but misclassification and the actual presence of objects but not detected, Q ₃ Instead of the number of objects but detected results, P represents the correct detection rate of the three-dimensional object, and the calculation method is as follows:

P＝Q ₁ /(Q ₁ +Q ₂ +Q ₃ )

and (3) carrying out experimental record by constructing a dense semantic map for 9 times, wherein the accurate detection rate of the average three-dimensional object in the calculated map is 48.1086%, and the specific experimental result is shown in the following table:

TABLE 1

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood and appreciated by those skilled in the art.

As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

The embodiments described above are intended to provide those skilled in the art with a full range of modifications and variations to the embodiments described above without departing from the inventive concept thereof, and therefore the scope of the invention is not limited to the embodiments described above, but is to be accorded the broadest scope consistent with the innovative features recited in the claims.

Claims

1. A semantic mapping method based on visual SLAM and two-dimensional semantic segmentation is characterized by comprising the following steps:

s1, calibrating camera parameters and correcting camera distortion;

2. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the correcting camera distortion in step S1 further comprises the steps of:

s11, projecting a three-dimensional space point P (X, Y, Z) of a camera coordinate system to a normalized image plane to form a normalized coordinate of the point as [ X, Y] ^T ；

wherein [ x ] _corrected ,y _corrected ] ^T Is the corrected point coordinates, p ₁ ，p ₂ Is the tangential distortion coefficient, k, of the camera ₁ ，k ₂ ，k ₃ Is a cameraThe radial distortion coefficient, r, is the distance of the point P from the origin of the coordinate system;

wherein f _x ，f _y ，c _x ，c _y Is an intrinsic parameter of the camera.

3. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 2, wherein the image preprocessing of step S3 further comprises pixel points [ u, v ]] ^T The mapping relation to the real three-dimensional space point P (X, Y, Z) satisfies the following formula:

wherein K is called an in-camera parameter matrix, f _x ，f _y ，c _x ，c _y For the internal parameters of the camera, P is the real three-dimensional space point coordinates, [ u, v ]] ^T Is the pixel point coordinates.

4. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the key frames of step S4 are screened using a sparse SLAM algorithm model.

5. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the centroid of the image block in step S5 is obtained by:

labeling each object of the frame image as a specific class;

computing moments of image blocks

p,q＝{0,1}；

6. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the step S6 further comprises:

7. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the Mask R-CNN algorithm model of step S6 performs semantic segmentation, further comprising:

the interesting proposal is put forward through a regional generation network;

performing mask segmentation by using a full convolution network;

8. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1 or claim 4, wherein the sparse SLAM algorithm model further comprises a tracking thread, a local mapping thread, and a loop detection thread:

9. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the sparse SLAM algorithm model further comprises a global beam adjustment method optimization thread, wherein the global beam adjustment method optimization thread is triggered after loop detection thread confirmation, and the optimal structure and the motion result of the whole system are calculated after pose map optimization.

10. The semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to claim 1, wherein the pose calculation of step S7 further comprises: solving the initial computer pose through PnP, optimizing the computer pose by utilizing a rear-end pose graph, and constructing a minimized reprojection error of the camera pose estimation: