CN115035260B

CN115035260B - Three-dimensional semantic map construction method for indoor mobile robot

Info

Publication number: CN115035260B
Application number: CN202210594240.3A
Authority: CN
Inventors: 王立鹏; 张佳鹏; 王学武; 马文龙; 张智; 张秋雨
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2024-11-05
Anticipated expiration: 2042-05-27
Also published as: CN115035260A

Abstract

The invention discloses a three-dimensional semantic map construction method of an indoor mobile robot, which comprises the steps of operating an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and pose of the robot, and forming a space point cloud through back projection of an RGB image and a depth image of the key frame; filtering the space point cloud of the key frame and splicing according to the pose to form a three-dimensional dense point cloud map; performing target recognition on the RGB (red, green and blue) diagram of the key frame by utilizing a YOLO V5 network to obtain 2D semantic information, and obtaining a 3D semantic tag according to back projection; dividing the dense point cloud map by a point cloud dividing algorithm; and fusing the obtained semantic information with the segmentation result of the dense point cloud map, and constructing to obtain the 3D semantic map. According to the invention, the RGB-D information acquired by using the RGB-D depth camera is processed, and the semantic information is fused with the SLAM result to form the three-dimensional map with more abundant information.

Description

Three-dimensional semantic map construction method for indoor mobile robot

Technical Field

The invention belongs to the technical field of robot semantic map construction, and relates to a robot map construction method, in particular to a construction method of a three-dimensional semantic map of an indoor mobile robot.

Background

Along with the rapid development of intelligent technology, the mobile robot industry is also very beginning to develop, and advanced mobile robots are loaded with a plurality of sensors with specific functions, so that the designated tasks in various scenes can be completed without direct operation of personnel. For an intelligent robot with an application scene indoors, a robot autonomous exploration mode is generally adopted to complete a map construction task.

The typical mobile robot semantic construction method at present mainly comprises a visual image-based method and a point cloud processing-based method, for example Zhang Cheng et al in a document, namely a semantic SLAM algorithm under a dynamic environment, develop research works such as semantic segmentation, visual odometry and the like by utilizing a monocular image sequence, and finally construct a global semantic map; for example, zhang Zheng et al in document "robust semantic SLAM algorithm of variable structure" research and development of research work of semantic SLAM robustness by using point cloud images.

The prior art has the following problems in the aspect of constructing a semantic map of a mobile robot: firstly, a visual image can only provide RGB color information, and three-dimensional space information such as point cloud is difficult to extract; secondly, the point cloud information is not easy to obtain semantic information of the represented object, and the semantic is difficult to be fused with SLAM mapping and positioning results.

Disclosure of Invention

Aiming at the prior art, the technical problem to be solved by the invention is to provide a three-dimensional semantic map construction method for an indoor mobile robot, which is to process RGB-D information acquired by using an RGB-D depth camera, fuse semantic information with SLAM results and form a three-dimensional map with more abundant information.

In order to solve the technical problems, the three-dimensional semantic map construction method of the indoor mobile robot comprises the following steps:

step 1: operating an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding pose of the robot, and forming a single-frame space point cloud through back projection of an RGB image and a depth image corresponding to each key frame;

step 2: filtering the space point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;

step 3: performing target recognition on the RGB map corresponding to the key frame by utilizing a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic tag according to back projection;

step 4: dividing the dense point cloud map by a point cloud dividing algorithm;

step 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map, and constructing to obtain the 3D semantic map.

Further, step 1 includes:

Processing the key frame is added in a Tracking thread in the ORB-SLAM2 system, and space point cloud data corresponding to the key frame is obtained, wherein the space point cloud data specifically comprises: all pixel points in the key frame are subjected to the following operation, so that point cloud data of the key frame in a camera coordinate system are obtained: let a certain pixel point P 'coordinate in the image be [ u, v ] ^T, the internal reference matrix of the camera be K, and the depth data corresponding to the pixel point be d, then the pixel point is converted into a three-dimensional point P _c (X', Y ', Z') under the coordinate system of the camera as follows:

after the space coordinates of the three-dimensional points P _c are obtained, RGB information of the pixel points P' is obtained and used as RGB information of P _c, all the pixel points in the key frame are operated as above, and point cloud data of the key frame under a camera coordinate system are obtained through a point cloud construction function of a PCL library;

Then converting the point cloud data of the key frame in the camera coordinate system into the point cloud data in the world coordinate system:

Let P _c be converted to a point in the world coordinate system P _w(X,Y,Z),P_w (X, Y, Z) satisfy:

wherein T _cw is the camera pose.

Further, the filtering processing of the spatial point cloud corresponding to the key frame in the step 2 includes:

And removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud library, calculating the average distance from each point in the space point cloud to all adjacent points, and identifying the points with the average distance being outside an outlier threshold d _pt as outliers and removing the outliers from the data. And then downsampling the point cloud data by a voxel grid filtering method.

Further, the outlier distance threshold d _pt is specifically:

Assuming that the distances of all points in the space point cloud obey normal distribution (mu, sigma ²), judging whether the threshold parameter of the outlier is H _out, and if the threshold parameter of the outlier is d _pt, the outlier distance threshold is:

d_pt＝μ+H_out·σ。

Further, in the step 2, the step of stitching according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map includes:

The LoopClosing thread in the ORB-SLAM2 system optimizes the detected pose of the key frame in the closed loop, extracts the key frame with the changed pose of the camera after the closed loop optimization, deletes the point cloud data corresponding to the key frame under the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.

Further, step 3 includes:

step 3.1: establishing data set based on YOLO V5 network model

Shooting a target object picture to obtain a dataset, marking the acquired dataset by using a marking program Labelimg, marking a label, and respectively selecting a training set and a testing set from the shot picture; selecting pictures containing target objects from the COCO data set to serve as a training set and a testing set respectively; performing data configuration and model training parameter setting on the YOLO V5 network, wherein the data configuration comprises object category numbers and object category names, training the model parameters of the YOLO V5 network by using a training set, and verifying the network model by using a testing set;

step 3.2: extracting 2D target semantics and constructing 3D semantic tags

The YOLO V5 network calls an RGB-D depth camera to collect RGB image information, and a 2D recognition frame in the RGB image is obtained in real time to serve as 2D target semantic information;

An interface is added in a Tracking thread of ORB-SLAM2 to read a key frame generated in real time, each newly generated key frame is subjected to target detection by utilizing a trained YOLO V5 network model to obtain the pixel coordinates of a center point of a target object and the length and width of a boundary frame, when space point cloud data are generated according to an RGB image and a depth image corresponding to the key frame, RGB information of all pixel points in the 2D boundary frame of the target object is changed into set color information, at the moment, single-frame point cloud data generated according to the key frame have 3D semantic tags, the obtained key frame point cloud data are back projected into a world coordinate system to complete point cloud splicing, and a global dense point cloud map with the 3D semantic tags is obtained, namely, color information is added to point clouds of the target object in the point cloud map.

Further, in step 4, segmenting the dense point cloud map by the point cloud segmentation algorithm includes:

Step 4.1: dividing a dense point cloud map by adopting an ultra-voxel clustering dividing method, converting the point cloud into a surface structure, and representing a dividing result by adopting an adjacent graph G= { v, epsilon } wherein v _i epsilon v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v _i,v_j) connected with adjacent curved surface blocks, and each curved surface block comprises a centroid c _i and a normal vector n _i;

Step 4.2: after obtaining the curved surface block, using RANSAC to process the curved surface block to generate a candidate plane pc= { PC ₁,pc₂,…,pc_m }, then calculating the distance d (c _i,pc_m) from the centroid c _i of the curved surface block to the plane PC _m, setting a threshold delta, obtaining all the curved surface blocks with the distance delta from the plane PC _m, named pi= { v _i∈V|d(c_i,pc_m) < delta }, and defining:

Wherein η represents a constraint condition for distinguishing a foreground object from a background, D (pc _m) represents a weight of a point cloud plane, and the minimum energy P ^* of the graph segmentation problem is:

Wherein E (P) is the energy of the fitting, and the following conditions are satisfied:

wherein V and E are the sets of vertices and edges in the graph, respectively, and the adjacent graph g= { V, epsilon } of the corresponding cluster segmentation.

Further, step 5 includes:

Obtaining a bounding box of the target object point cloud according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are set P= { P ₁,P₂,P₃......,P_n }, then

b＝max(Depth(P_i)-Depth(P_j))

P _i,P_j∈P,Depth(P_i) is a depth value corresponding to the pixel point P _i, and the 2D pixel point is back projected into the world coordinate system, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;

And then detecting a result obtained by the point cloud segmentation, and replacing the random color attribute label of the cluster with the 3D semantic label obtained in the step 3 only when all the point clouds in a certain cluster are positioned in the bounding box to obtain a semantic map.

Further, the RGB-D depth camera uses a Kinect V2 camera.

Further, the YOLO V5 network employs YOLOv s network.

The invention has the beneficial effects that: according to the invention, the robot is provided with an RGB-D depth camera (for example, a Kinect camera) to obtain RGB-D information of an indoor environment, and the RGB-D information is converted into a 3D semantic tag after target recognition is completed through a trained YOLO v5s network model based on a key frame generated by an ORB-SLAM2 system. And generating a three-dimensional dense point cloud map by combining the RGB map and the depth map corresponding to the key frame with pose information of the key frame, dividing the point cloud map on the basis, and fusing semantic tags with the dense point cloud map to construct the three-dimensional semantic map with rich information.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a dense point cloud construction flow diagram;

FIG. 3 is a dense point cloud map;

FIG. 4 is a partial data set sample acquired;

FIG. 5 is a calibration of a data sample;

FIG. 6 shows the real-time detection results of YOLO v5 s;

FIG. 7 is a flow chart for generating a point cloud tag;

FIG. 8 is a point cloud label diagram in a real scene;

FIG. 9 is a super voxel cluster segmentation result;

FIG. 10 is a graph of point cloud segmentation results;

Fig. 11 is a fused semantic map.

Detailed Description

The invention is further described below with reference to the drawings and examples.

The flow chart of the invention is shown in figure 1, a Kinect V2 camera is selected as a vision sensor, key frames in the camera motion process are obtained based on an ORB-SLAM 2 algorithm, after the pose of the camera is obtained by utilizing processes such as feature point matching, back-end optimization and the like in the key frames, a space point cloud is formed by back projection of a corresponding RGB image and a depth image of each key frame, and the three-dimensional dense point cloud map is formed by splicing by combining pose information of the camera after filtering treatment; meanwhile, performing target identification on an RGB image corresponding to the key frame by utilizing a YOLO V5 network model, obtaining semantic information of the target, and projecting the semantic information to a space point cloud to realize construction of a 3D semantic tag; and then, carrying out segmentation work on the dense point cloud map by using a point cloud segmentation algorithm, and finally, fusing the obtained semantic information with a segmentation result of the dense point cloud map to construct a 3D semantic map.

1. Improved ORB-SLAM2 construction of three-dimensional dense point clouds

The method realizes the goal of acquiring dense point cloud by improving part of threads in ORB-SLAM2, and the specific flow is shown in figure 2, and the main steps are as follows:

Step 1: two-dimensional pixel point to three-dimensional point cloud data back projection

The Tracking thread in the ORB-SLAM2 system is modified, the processing of the key frame is increased, and the 3D point cloud data corresponding to the key frame is obtained. The principle of the process is that two-dimensional pixel coordinates and depth data of the environment are back projected to a world coordinate system through internal parameters and external parameters of a camera, a certain pixel point P' coordinate in an image is set as [ u, v ] ^T, an internal parameter matrix of the camera is K, and the depth data corresponding to the pixel point is d; the pixel point is converted into a three-dimensional point P _c (X ', Y ', Z ') in the camera coordinate system as follows:

after the space coordinates of the three-dimensional points P _c are obtained, RGB information of the pixel points P' is obtained and used as RGB information of P _c, all the pixel points in the key frame are operated as above, and point cloud data of the key frame under a camera coordinate system are obtained through a point cloud construction function of a PCL library. After obtaining the point cloud data corresponding to the key frame, converting the point cloud data into point cloud under the world coordinate system, and setting the point converted by P _c into the point under the world coordinate system as P _w (X, Y, Z), wherein the process can be represented by the following formula, wherein T _cw is the pose of the camera:

step2: dense point cloud data filtering

According to the method, the outlier removing filter in the PCL point cloud library is used for removing abnormal noise points in the point cloud data, the point cloud obtained after outlier removing filtering is smooth, and the method is suitable for subsequent processing of point cloud clustering. And then downsampling the point cloud data by a voxel grid filtering method. When the invention filters the outlier, the average distance from each point to all adjacent points is calculated, the obtained result is subjected to Gaussian distribution, the shape of the result is determined by Gaussian distribution parameters (mu, sigma ²), and the point with the average distance outside the standard range is considered as the outlier and is removed from the data.

Assuming that the distances of all points in the point cloud obey the normal distribution (μ, σ ²), and whether the threshold parameter of the outlier is H _out is determined, the outlier distance threshold d _pt may be expressed as:

d_pt＝μ+H_out·σ (3)

I.e., points whose average distance is outside the interval (0, d _pt) would be considered outlier filtering.

According to the invention, experiments are carried out, and based on a hardware platform (Intel Core i5-10400F CPU,NVIDIA GTX 1650), when the resolution parameter of voxel grid filtering is adjusted to be 0.01, namely the size of a voxel grid is 1cm x 1cm, the optimal operation effect can be obtained.

Step3: point cloud stitching

In the invention, the global dense point cloud data obtained directly according to step1 depends on the camera pose T _cw of each key frame, but the LoopClosing thread in the ORB-SLAM2 system optimizes the pose of the key frames in the whole closed loop according to the detected closed loop, and the pose estimation result of some key frames is changed.

In order to avoid ghost images of a constructed dense point cloud map, a key frame with the changed pose of a camera after closed-loop optimization is extracted, point cloud data corresponding to the key frame under a world coordinate system are deleted, corresponding point cloud data are regenerated according to the pose after closed-loop optimization and projected into the world coordinate system, point cloud splicing operation after loop detection is completed, and finally a 3D dense point cloud map constructed based on an ORB-SLAM2 system is shown in fig. 3.

2. 3D semantic label construction based on YOLO v5s network model

According to the method, target identification is required to be carried out on a key frame generated by a Tracking thread in an ORB SLAM2 system, and then the obtained 2D semantic information is mapped into space point cloud data to complete construction of a 3D semantic tag. The specific steps are as follows:

Step1: establishing a data set based on a YOLO v5s network model

The patent uses the camera to shoot the common object in the laboratory, mainly contains common articles such as computer equipment, books. Part of the acquired image is labeled using a labeling program Labelimg, as shown in fig. 4, and labeled, for example, with a "tv" label, as shown in fig. 5, for example, in the process of labeling a "display". 500 pictures are selected from the photographed pictures to serve as training sets, and 200 pictures are selected from the photographed pictures to serve as test sets. In addition, 1500 pictures containing objects in the indoor scene are selected from the COCO as a training set, 300 pictures are taken as a test set, and for use with the COCO data set, the annotation file is stored in json file format. In the present invention, the number of objects commonly used in indoor environments is 35.

Before training, the YOLO v5s data and the configuration file of the model need to be modified, the object class number nc of the two files is changed to 35, and the object class names are modified in the class list names. The training set comprises 2000 pictures, the test set comprises 500 pictures, the data set is prepared, the parameters of the YOLO v5s network model are trained, and the super-parameter configuration of the training model is shown in table 1.

TABLE 1 parameter settings for the YOLO v5s network model

Step2: extracting 2D target semantics and constructing 3D semantic tags

After the YOLO V5s network training is finished, the Kinect V2 camera is used for collecting corresponding RGB image information in real time, a real-time detection function can be achieved, the result is shown in fig. 6, 2D target semantic information in the RGB image can be obtained, and the requirement of constructing the 3D semantic tag in real time can be met.

The process of mapping 2D semantic information to 3D semantic tags is shown in FIG. 7, wherein an interface is added in the Tracking thread of ORB-SLAM2, and based on a trained YOLO v5s network, each newly generated key frame is subjected to target detection to obtain the pixel coordinates of the center point of the target object and the length and width of the boundary frame. And then when space point cloud data is generated according to the RGB image and the depth image corresponding to the key frame, RGB information of all pixel points in the 2D boundary frame of the target object is required to be changed into set color information, and at the moment, the single-frame point cloud data generated according to the key frame has semantic tags. And then, the obtained key frame point cloud data is back projected to a world coordinate system to finish point cloud splicing by combining the pose estimation result, and a global dense point cloud map with 3D semantic tags is obtained, namely color information is added to the point cloud of a target object in the point cloud map.

The key frame generated in ORB-SLAM2 is used as input to obtain a key frame image with a 2D boundary frame and confidence, and then the corresponding 3D semantic tag in the key frame is obtained through back projection, and the result is shown in FIG. 8, wherein the point cloud of the display is set to be red, the point cloud of the keyboard is set to be green, and the point cloud of the mouse is set to be blue.

3. Dense point cloud segmentation

According to the method, the point cloud segmentation work is carried out on the constructed dense point cloud map so as to obtain a scene which can be understood by a robot, and meanwhile, according to the segmentation result of the dense point cloud map, an accurate target point cloud is obtained, so that the method is convenient to combine with the 3D semantic tag subsequently. Firstly, an original dense point cloud map is segmented by using a segmentation method based on super voxel clustering, an irregular point cloud is converted into a surface structure according to the similarity of points, the surface patches formed after super voxel segmentation have a centroid and a normal vector, the point cloud segmentation can be defined as a graph segmentation problem, then, the graph segmentation method is adopted for further segmentation, and the detailed steps are as follows:

Step1: super voxel cluster segmentation

The super voxel segmentation is to use a spatial octree structure of point cloud, cluster the points through classes Supervoxel Clustering in a PCL library, add labels to voxel centroids (voxel_centroid_closed) of the octree structure according to segmentation results, and assign random colors to store the voxel centroids in colored _voxel_closed parameters.

The segmentation principle uses region growth of k-means clustering to segment point cloud, and the process of ultra-bulk clustering is similar to the crystallization process, so that polycrystal core crystallization is caused after supersaturation of solution. The super voxel clustering segmentation method can uniformly distribute crystal nuclei for growth in the whole space according to a certain rule, then the crystal nucleus distance and the minimum crystal grains are set, and smaller crystal grains are absorbed by larger crystal grains. In the super-voxel segmentation process, the expansion of the seed points is determined by the feature distance, the feature distance considers the feature space calculation of the space, the color and the normal vector, and the distance formula in super-voxel cluster segmentation is as follows:

In the above formula: d _c represents the degree of difference in color, D _n represents the degree of difference in normal direction, and D _s represents the degree of difference in distance of the point cloud. W _c、w_s、w_n in the formula represents the weights of the respective variables. The voxel with the smallest D value is found to be the most similar through searching the surrounding of the crystal nucleus, so that the crystal nucleus for the next growth is represented.

According to the invention, a Kinect V2 camera is selected as a sensor, and appropriate parameters are set: voxel_resolution=0.008 (voxel size, resolution of spatial octree), feed_resolution=0.5 (seed resolution). The color, the spatial distance and the weight occupied by the normal vector are adjusted according to the laboratory environment, and the obtained super voxel cluster segmentation result is shown in fig. 9.

Step2: point cloud segmentation based on geometric information

The super-voxel segmentation result obtained in the last step is represented by an adjacency graph G= { v, epsilon } wherein v _i epsilon v is a segmented curved surface block, and epsilon represents that adjacent curved surface blocks (v _i,v_j) are connected. After the super-voxel segmentation process, each surface block has a centroid c _i and a normal vector n _i, and the scene segmentation can be regarded as a graph segmentation problem.

Suppose there are K actual planes { s ₁,s₂,…,s_k } in a point cloud map, and that these planes have been segmented into patches. Here, a variable { b _i}₁ ^N,b_i∈[0,K],b_i =k is defined to mean that the surface block belongs to the plane s _k, and N represents that the actual plane s _k is divided into N surface blocks, as in fig. 9, the table top is divided into two clusters. If the planes of all the objects in the point cloud map can be extracted according to the point cloud map and the curved surface blocks are distributed to the planes, a more accurate point cloud segmentation result can be obtained.

Specific operation after obtaining the patch, the patch is processed using RANSAC to generate a candidate plane pc= { PC ₁,pc₂,…,pc_m }, d (c _i,pc_m), i.e., the distance of the patch centroid c _i to the plane PC _m, is calculated. By adding a threshold delta, all curved blocks with the distance delta from the plane pc _m can be obtained, and the curved blocks are named as pi= { v _i∈V|d(c_i,pc_m) < delta }. The following formula is defined:

In the above formula: η represents a constraint that distinguishes foreground objects from background, and D (pc _m) represents the weight of the point cloud plane. In the experiment, η=40 and δ=4 (cm) were set. The minimized energy P ^* of the graph-segmentation problem can be expressed as follows:

in the above formula, E (P) is the fitting energy, and the following formula is specific:

The edge meeting the energy minimization is cut off based on the point cloud minimum cutting theory, namely, the cutting result of the point cloud can be obtained by combining the adjacent curved surface blocks according to the constraint condition, the purpose of the cutting at the moment is to combine clusters belonging to the same object in the clustering result, the cutting result is shown in a figure 10, the effect of combining target objects such as a display, books, cups and bottles is good, the effect of combining planes of background objects such as a table, a storage cabinet and the ground is also good, and the requirements of the invention are met.

4. Semantic information is fused, and a semantic map is constructed

After dense point cloud segmentation is completed, the segmentation result is combined with the 3D semantic tag, the segmentation result is optimized through semantic information, and the target which cannot be extracted in the point cloud segmentation can be segmented; meanwhile, semantic information can be added for each segmented target, and a more accurate semantic map can be constructed. After the point cloud is segmented, each cluster is provided with an attribute tag, and after the attribute tags of the clusters are replaced by semantic tags of specific targets, the specific targets in the point cloud are provided with semantic information, so that a semantic map is constructed. The specific implementation steps are as follows:

Step1: storage of semantic tags

After the dense point cloud map is segmented, point clouds with different colors represent a point cloud set and a space sampling area in each super voxel, and the colors are attribute labels of the clusters. However, the attribute label at this time only distinguishes each clustering result, and is a random label without any meaning, and the label is also the attribute label of each cluster in the final cloud segmentation result.

In the second part, the object semantic label storage is completed by carrying out target identification on the key frame generated in ORB-SLAM2 through the YOLO v5s and adding a color label with semantic information to the point cloud segmentation result according to the 3D semantic label obtained by the identification result.

But now all pixels in the 2D object recognition bounding box are mapped directly to the 3D formed semantic tags in combination with the depth data of the Kinect camera. In a complex laboratory environment, the bounding box for target recognition is not accurate enough, so that the surrounding point cloud of the target object is easy to add wrong semantic information, and as shown in fig. 8, the semantic tag is erroneously added to the surrounding point cloud of the target object.

Step2: fusion of semantic tags and point cloud bounding boxes

In order to obtain a more accurate semantic map, the bounding box of the target object point cloud is obtained according to the 2D recognition frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D recognition frame. Assuming that all pixel points in the 2D identification frame are set p= { P ₁,P₂,P₃......,P_n }, then

b＝max(Depth(P_i)-Depth(P_j)) (8)

Wherein P _i,P_j∈P,Depth(P_i) is the depth value corresponding to the pixel point P _i. The length and the height of the bounding box are mapped by the length and the width of the 2D identification frame, and the principle is that 2D pixel points are back projected into a world coordinate system, which is not described herein.

Then, the method detects the results obtained by the point cloud segmentation, and only when all the point clouds in a certain cluster are positioned in the bounding box, specific semantic information is added, namely specific color attributes are added to the point clouds, so that a more accurate semantic map is obtained. As shown in fig. 11 and fig. 10, the fusion of the bounding box obtained by the semantic tag and the point cloud segmentation result enables the object TV to be segmented more completely, and the clustering of the point clouds at the edge of the actual object also successfully adds the correct semantic information; compared with the method that the boundary box obtained by directly carrying out 2D object recognition in FIG. 8 adds semantic information to the point cloud, the method can also effectively avoid the situation that the semantic information is erroneously added to the point cloud around the object. Meanwhile, the point cloud segmentation effect can be improved by combining semantic information, and small objects such as a keyboard and a mouse on a desktop can be segmented by clustering after the added semantic information changes the color attribute in fig. 11. According to the above two steps, a three-dimensional semantic map as shown in fig. 11 can be formed.

Claims

1. The three-dimensional semantic map construction method of the indoor mobile robot is characterized by comprising the following steps of:

wherein, T _cw is the camera pose;

step 3.1: establishing data set based on YOLO V5 network model

step 3.2: extracting 2D target semantics and constructing 3D semantic tags

Adding an interface into a Tracking thread of ORB-SLAM2 to read a key frame generated in real time, carrying out target detection on each newly generated key frame by utilizing a trained YOLO V5 network model to obtain the pixel coordinates of a central point of a target object and the length and width of a boundary frame, changing RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, at the moment, generating single-frame point cloud data according to the key frame to have a 3D semantic tag, and then carrying out back projection on the obtained key frame point cloud data in combination with a pose estimation result into a world coordinate system to complete point cloud splicing, so as to obtain a global dense point cloud map with the 3D semantic tag, namely adding the color information to the point cloud of the target object in the point cloud map;

step 4: dividing the dense point cloud map by a point cloud dividing algorithm;

2. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the filtering processing of the spatial point cloud corresponding to the key frame in the step 2 includes:

And removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud library, calculating the average distance from each point in the space point cloud to all adjacent points, identifying the points with the average distance being outside an outlier distance threshold d _pt as outliers and removing the outliers from the data, and then carrying out downsampling on the point cloud data by using a voxel grid filtering method.

3. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 2, wherein the method comprises the following steps of: the outlier distance threshold d _pt is specifically:

d_pt＝μ+H_out·σ。

4. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: step 2, the step of splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map comprises the following steps:

5. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the step 4 of segmenting the dense point cloud map by the point cloud segmentation algorithm includes:

P^*＝argminE(P),

6. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the step 5 comprises the following steps:

b＝max(Depth(P_i)-Depth(P_j))

7. The three-dimensional semantic map construction method of an indoor mobile robot according to any one of claims 1 to 6, characterized by: the RGB-D depth camera employs KinectV's 2 camera.

8. The three-dimensional semantic map construction method of an indoor mobile robot according to any one of claims 1 to 6, characterized by: the YOLOV network adopts a YOLOv s network.