CN115035260B - Three-dimensional semantic map construction method for indoor mobile robot - Google Patents
Three-dimensional semantic map construction method for indoor mobile robot Download PDFInfo
- Publication number
- CN115035260B CN115035260B CN202210594240.3A CN202210594240A CN115035260B CN 115035260 B CN115035260 B CN 115035260B CN 202210594240 A CN202210594240 A CN 202210594240A CN 115035260 B CN115035260 B CN 115035260B
- Authority
- CN
- China
- Prior art keywords
- point cloud
- key frame
- map
- semantic
- rgb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 title claims abstract description 21
- 230000011218 segmentation Effects 0.000 claims abstract description 50
- 238000001914 filtration Methods 0.000 claims abstract description 13
- 238000000034 method Methods 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 238000010586 diagram Methods 0.000 abstract description 3
- 239000013078 crystal Substances 0.000 description 7
- 238000005520 cutting process Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 239000003086 colorant Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000002425 crystallisation Methods 0.000 description 2
- 230000008025 crystallization Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000011897 real-time detection Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 101000664407 Neisseria meningitidis serogroup B (strain MC58) Surface lipoprotein assembly modifier 2 Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/05—Geographic models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30244—Camera pose
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Geometry (AREA)
- Remote Sensing (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computer Graphics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a three-dimensional semantic map construction method of an indoor mobile robot, which comprises the steps of operating an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and pose of the robot, and forming a space point cloud through back projection of an RGB image and a depth image of the key frame; filtering the space point cloud of the key frame and splicing according to the pose to form a three-dimensional dense point cloud map; performing target recognition on the RGB (red, green and blue) diagram of the key frame by utilizing a YOLO V5 network to obtain 2D semantic information, and obtaining a 3D semantic tag according to back projection; dividing the dense point cloud map by a point cloud dividing algorithm; and fusing the obtained semantic information with the segmentation result of the dense point cloud map, and constructing to obtain the 3D semantic map. According to the invention, the RGB-D information acquired by using the RGB-D depth camera is processed, and the semantic information is fused with the SLAM result to form the three-dimensional map with more abundant information.
Description
Technical Field
The invention belongs to the technical field of robot semantic map construction, and relates to a robot map construction method, in particular to a construction method of a three-dimensional semantic map of an indoor mobile robot.
Background
Along with the rapid development of intelligent technology, the mobile robot industry is also very beginning to develop, and advanced mobile robots are loaded with a plurality of sensors with specific functions, so that the designated tasks in various scenes can be completed without direct operation of personnel. For an intelligent robot with an application scene indoors, a robot autonomous exploration mode is generally adopted to complete a map construction task.
The typical mobile robot semantic construction method at present mainly comprises a visual image-based method and a point cloud processing-based method, for example Zhang Cheng et al in a document, namely a semantic SLAM algorithm under a dynamic environment, develop research works such as semantic segmentation, visual odometry and the like by utilizing a monocular image sequence, and finally construct a global semantic map; for example, zhang Zheng et al in document "robust semantic SLAM algorithm of variable structure" research and development of research work of semantic SLAM robustness by using point cloud images.
The prior art has the following problems in the aspect of constructing a semantic map of a mobile robot: firstly, a visual image can only provide RGB color information, and three-dimensional space information such as point cloud is difficult to extract; secondly, the point cloud information is not easy to obtain semantic information of the represented object, and the semantic is difficult to be fused with SLAM mapping and positioning results.
Disclosure of Invention
Aiming at the prior art, the technical problem to be solved by the invention is to provide a three-dimensional semantic map construction method for an indoor mobile robot, which is to process RGB-D information acquired by using an RGB-D depth camera, fuse semantic information with SLAM results and form a three-dimensional map with more abundant information.
In order to solve the technical problems, the three-dimensional semantic map construction method of the indoor mobile robot comprises the following steps:
step 1: operating an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding pose of the robot, and forming a single-frame space point cloud through back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the space point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
step 3: performing target recognition on the RGB map corresponding to the key frame by utilizing a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic tag according to back projection;
step 4: dividing the dense point cloud map by a point cloud dividing algorithm;
step 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map, and constructing to obtain the 3D semantic map.
Further, step 1 includes:
Processing the key frame is added in a Tracking thread in the ORB-SLAM2 system, and space point cloud data corresponding to the key frame is obtained, wherein the space point cloud data specifically comprises: all pixel points in the key frame are subjected to the following operation, so that point cloud data of the key frame in a camera coordinate system are obtained: let a certain pixel point P 'coordinate in the image be [ u, v ] T, the internal reference matrix of the camera be K, and the depth data corresponding to the pixel point be d, then the pixel point is converted into a three-dimensional point P c (X', Y ', Z') under the coordinate system of the camera as follows:
after the space coordinates of the three-dimensional points P c are obtained, RGB information of the pixel points P' is obtained and used as RGB information of P c, all the pixel points in the key frame are operated as above, and point cloud data of the key frame under a camera coordinate system are obtained through a point cloud construction function of a PCL library;
Then converting the point cloud data of the key frame in the camera coordinate system into the point cloud data in the world coordinate system:
Let P c be converted to a point in the world coordinate system P w(X,Y,Z),Pw (X, Y, Z) satisfy:
wherein T cw is the camera pose.
Further, the filtering processing of the spatial point cloud corresponding to the key frame in the step 2 includes:
And removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud library, calculating the average distance from each point in the space point cloud to all adjacent points, and identifying the points with the average distance being outside an outlier threshold d pt as outliers and removing the outliers from the data. And then downsampling the point cloud data by a voxel grid filtering method.
Further, the outlier distance threshold d pt is specifically:
Assuming that the distances of all points in the space point cloud obey normal distribution (mu, sigma 2), judging whether the threshold parameter of the outlier is H out, and if the threshold parameter of the outlier is d pt, the outlier distance threshold is:
dpt=μ+Hout·σ。
Further, in the step 2, the step of stitching according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map includes:
The LoopClosing thread in the ORB-SLAM2 system optimizes the detected pose of the key frame in the closed loop, extracts the key frame with the changed pose of the camera after the closed loop optimization, deletes the point cloud data corresponding to the key frame under the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
Further, step 3 includes:
step 3.1: establishing data set based on YOLO V5 network model
Shooting a target object picture to obtain a dataset, marking the acquired dataset by using a marking program Labelimg, marking a label, and respectively selecting a training set and a testing set from the shot picture; selecting pictures containing target objects from the COCO data set to serve as a training set and a testing set respectively; performing data configuration and model training parameter setting on the YOLO V5 network, wherein the data configuration comprises object category numbers and object category names, training the model parameters of the YOLO V5 network by using a training set, and verifying the network model by using a testing set;
step 3.2: extracting 2D target semantics and constructing 3D semantic tags
The YOLO V5 network calls an RGB-D depth camera to collect RGB image information, and a 2D recognition frame in the RGB image is obtained in real time to serve as 2D target semantic information;
An interface is added in a Tracking thread of ORB-SLAM2 to read a key frame generated in real time, each newly generated key frame is subjected to target detection by utilizing a trained YOLO V5 network model to obtain the pixel coordinates of a center point of a target object and the length and width of a boundary frame, when space point cloud data are generated according to an RGB image and a depth image corresponding to the key frame, RGB information of all pixel points in the 2D boundary frame of the target object is changed into set color information, at the moment, single-frame point cloud data generated according to the key frame have 3D semantic tags, the obtained key frame point cloud data are back projected into a world coordinate system to complete point cloud splicing, and a global dense point cloud map with the 3D semantic tags is obtained, namely, color information is added to point clouds of the target object in the point cloud map.
Further, in step 4, segmenting the dense point cloud map by the point cloud segmentation algorithm includes:
Step 4.1: dividing a dense point cloud map by adopting an ultra-voxel clustering dividing method, converting the point cloud into a surface structure, and representing a dividing result by adopting an adjacent graph G= { v, epsilon } wherein v i epsilon v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v i,vj) connected with adjacent curved surface blocks, and each curved surface block comprises a centroid c i and a normal vector n i;
Step 4.2: after obtaining the curved surface block, using RANSAC to process the curved surface block to generate a candidate plane pc= { PC 1,pc2,…,pcm }, then calculating the distance d (c i,pcm) from the centroid c i of the curved surface block to the plane PC m, setting a threshold delta, obtaining all the curved surface blocks with the distance delta from the plane PC m, named pi= { v i∈V|d(ci,pcm) < delta }, and defining:
Wherein η represents a constraint condition for distinguishing a foreground object from a background, D (pc m) represents a weight of a point cloud plane, and the minimum energy P * of the graph segmentation problem is:
Wherein E (P) is the energy of the fitting, and the following conditions are satisfied:
wherein V and E are the sets of vertices and edges in the graph, respectively, and the adjacent graph g= { V, epsilon } of the corresponding cluster segmentation.
Further, step 5 includes:
Obtaining a bounding box of the target object point cloud according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are set P= { P 1,P2,P3......,Pn }, then
b=max(Depth(Pi)-Depth(Pj))
P i,Pj∈P,Depth(Pi) is a depth value corresponding to the pixel point P i, and the 2D pixel point is back projected into the world coordinate system, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
And then detecting a result obtained by the point cloud segmentation, and replacing the random color attribute label of the cluster with the 3D semantic label obtained in the step 3 only when all the point clouds in a certain cluster are positioned in the bounding box to obtain a semantic map.
Further, the RGB-D depth camera uses a Kinect V2 camera.
Further, the YOLO V5 network employs YOLOv s network.
The invention has the beneficial effects that: according to the invention, the robot is provided with an RGB-D depth camera (for example, a Kinect camera) to obtain RGB-D information of an indoor environment, and the RGB-D information is converted into a 3D semantic tag after target recognition is completed through a trained YOLO v5s network model based on a key frame generated by an ORB-SLAM2 system. And generating a three-dimensional dense point cloud map by combining the RGB map and the depth map corresponding to the key frame with pose information of the key frame, dividing the point cloud map on the basis, and fusing semantic tags with the dense point cloud map to construct the three-dimensional semantic map with rich information.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a dense point cloud construction flow diagram;
FIG. 3 is a dense point cloud map;
FIG. 4 is a partial data set sample acquired;
FIG. 5 is a calibration of a data sample;
FIG. 6 shows the real-time detection results of YOLO v5 s;
FIG. 7 is a flow chart for generating a point cloud tag;
FIG. 8 is a point cloud label diagram in a real scene;
FIG. 9 is a super voxel cluster segmentation result;
FIG. 10 is a graph of point cloud segmentation results;
Fig. 11 is a fused semantic map.
Detailed Description
The invention is further described below with reference to the drawings and examples.
The flow chart of the invention is shown in figure 1, a Kinect V2 camera is selected as a vision sensor, key frames in the camera motion process are obtained based on an ORB-SLAM 2 algorithm, after the pose of the camera is obtained by utilizing processes such as feature point matching, back-end optimization and the like in the key frames, a space point cloud is formed by back projection of a corresponding RGB image and a depth image of each key frame, and the three-dimensional dense point cloud map is formed by splicing by combining pose information of the camera after filtering treatment; meanwhile, performing target identification on an RGB image corresponding to the key frame by utilizing a YOLO V5 network model, obtaining semantic information of the target, and projecting the semantic information to a space point cloud to realize construction of a 3D semantic tag; and then, carrying out segmentation work on the dense point cloud map by using a point cloud segmentation algorithm, and finally, fusing the obtained semantic information with a segmentation result of the dense point cloud map to construct a 3D semantic map.
1. Improved ORB-SLAM2 construction of three-dimensional dense point clouds
The method realizes the goal of acquiring dense point cloud by improving part of threads in ORB-SLAM2, and the specific flow is shown in figure 2, and the main steps are as follows:
Step 1: two-dimensional pixel point to three-dimensional point cloud data back projection
The Tracking thread in the ORB-SLAM2 system is modified, the processing of the key frame is increased, and the 3D point cloud data corresponding to the key frame is obtained. The principle of the process is that two-dimensional pixel coordinates and depth data of the environment are back projected to a world coordinate system through internal parameters and external parameters of a camera, a certain pixel point P' coordinate in an image is set as [ u, v ] T, an internal parameter matrix of the camera is K, and the depth data corresponding to the pixel point is d; the pixel point is converted into a three-dimensional point P c (X ', Y ', Z ') in the camera coordinate system as follows:
after the space coordinates of the three-dimensional points P c are obtained, RGB information of the pixel points P' is obtained and used as RGB information of P c, all the pixel points in the key frame are operated as above, and point cloud data of the key frame under a camera coordinate system are obtained through a point cloud construction function of a PCL library. After obtaining the point cloud data corresponding to the key frame, converting the point cloud data into point cloud under the world coordinate system, and setting the point converted by P c into the point under the world coordinate system as P w (X, Y, Z), wherein the process can be represented by the following formula, wherein T cw is the pose of the camera:
step2: dense point cloud data filtering
According to the method, the outlier removing filter in the PCL point cloud library is used for removing abnormal noise points in the point cloud data, the point cloud obtained after outlier removing filtering is smooth, and the method is suitable for subsequent processing of point cloud clustering. And then downsampling the point cloud data by a voxel grid filtering method. When the invention filters the outlier, the average distance from each point to all adjacent points is calculated, the obtained result is subjected to Gaussian distribution, the shape of the result is determined by Gaussian distribution parameters (mu, sigma 2), and the point with the average distance outside the standard range is considered as the outlier and is removed from the data.
Assuming that the distances of all points in the point cloud obey the normal distribution (μ, σ 2), and whether the threshold parameter of the outlier is H out is determined, the outlier distance threshold d pt may be expressed as:
dpt=μ+Hout·σ (3)
I.e., points whose average distance is outside the interval (0, d pt) would be considered outlier filtering.
According to the invention, experiments are carried out, and based on a hardware platform (Intel Core i5-10400F CPU,NVIDIA GTX 1650), when the resolution parameter of voxel grid filtering is adjusted to be 0.01, namely the size of a voxel grid is 1cm x 1cm, the optimal operation effect can be obtained.
Step3: point cloud stitching
In the invention, the global dense point cloud data obtained directly according to step1 depends on the camera pose T cw of each key frame, but the LoopClosing thread in the ORB-SLAM2 system optimizes the pose of the key frames in the whole closed loop according to the detected closed loop, and the pose estimation result of some key frames is changed.
In order to avoid ghost images of a constructed dense point cloud map, a key frame with the changed pose of a camera after closed-loop optimization is extracted, point cloud data corresponding to the key frame under a world coordinate system are deleted, corresponding point cloud data are regenerated according to the pose after closed-loop optimization and projected into the world coordinate system, point cloud splicing operation after loop detection is completed, and finally a 3D dense point cloud map constructed based on an ORB-SLAM2 system is shown in fig. 3.
2. 3D semantic label construction based on YOLO v5s network model
According to the method, target identification is required to be carried out on a key frame generated by a Tracking thread in an ORB SLAM2 system, and then the obtained 2D semantic information is mapped into space point cloud data to complete construction of a 3D semantic tag. The specific steps are as follows:
Step1: establishing a data set based on a YOLO v5s network model
The patent uses the camera to shoot the common object in the laboratory, mainly contains common articles such as computer equipment, books. Part of the acquired image is labeled using a labeling program Labelimg, as shown in fig. 4, and labeled, for example, with a "tv" label, as shown in fig. 5, for example, in the process of labeling a "display". 500 pictures are selected from the photographed pictures to serve as training sets, and 200 pictures are selected from the photographed pictures to serve as test sets. In addition, 1500 pictures containing objects in the indoor scene are selected from the COCO as a training set, 300 pictures are taken as a test set, and for use with the COCO data set, the annotation file is stored in json file format. In the present invention, the number of objects commonly used in indoor environments is 35.
Before training, the YOLO v5s data and the configuration file of the model need to be modified, the object class number nc of the two files is changed to 35, and the object class names are modified in the class list names. The training set comprises 2000 pictures, the test set comprises 500 pictures, the data set is prepared, the parameters of the YOLO v5s network model are trained, and the super-parameter configuration of the training model is shown in table 1.
TABLE 1 parameter settings for the YOLO v5s network model
Step2: extracting 2D target semantics and constructing 3D semantic tags
After the YOLO V5s network training is finished, the Kinect V2 camera is used for collecting corresponding RGB image information in real time, a real-time detection function can be achieved, the result is shown in fig. 6, 2D target semantic information in the RGB image can be obtained, and the requirement of constructing the 3D semantic tag in real time can be met.
The process of mapping 2D semantic information to 3D semantic tags is shown in FIG. 7, wherein an interface is added in the Tracking thread of ORB-SLAM2, and based on a trained YOLO v5s network, each newly generated key frame is subjected to target detection to obtain the pixel coordinates of the center point of the target object and the length and width of the boundary frame. And then when space point cloud data is generated according to the RGB image and the depth image corresponding to the key frame, RGB information of all pixel points in the 2D boundary frame of the target object is required to be changed into set color information, and at the moment, the single-frame point cloud data generated according to the key frame has semantic tags. And then, the obtained key frame point cloud data is back projected to a world coordinate system to finish point cloud splicing by combining the pose estimation result, and a global dense point cloud map with 3D semantic tags is obtained, namely color information is added to the point cloud of a target object in the point cloud map.
The key frame generated in ORB-SLAM2 is used as input to obtain a key frame image with a 2D boundary frame and confidence, and then the corresponding 3D semantic tag in the key frame is obtained through back projection, and the result is shown in FIG. 8, wherein the point cloud of the display is set to be red, the point cloud of the keyboard is set to be green, and the point cloud of the mouse is set to be blue.
3. Dense point cloud segmentation
According to the method, the point cloud segmentation work is carried out on the constructed dense point cloud map so as to obtain a scene which can be understood by a robot, and meanwhile, according to the segmentation result of the dense point cloud map, an accurate target point cloud is obtained, so that the method is convenient to combine with the 3D semantic tag subsequently. Firstly, an original dense point cloud map is segmented by using a segmentation method based on super voxel clustering, an irregular point cloud is converted into a surface structure according to the similarity of points, the surface patches formed after super voxel segmentation have a centroid and a normal vector, the point cloud segmentation can be defined as a graph segmentation problem, then, the graph segmentation method is adopted for further segmentation, and the detailed steps are as follows:
Step1: super voxel cluster segmentation
The super voxel segmentation is to use a spatial octree structure of point cloud, cluster the points through classes Supervoxel Clustering in a PCL library, add labels to voxel centroids (voxel_centroid_closed) of the octree structure according to segmentation results, and assign random colors to store the voxel centroids in colored _voxel_closed parameters.
The segmentation principle uses region growth of k-means clustering to segment point cloud, and the process of ultra-bulk clustering is similar to the crystallization process, so that polycrystal core crystallization is caused after supersaturation of solution. The super voxel clustering segmentation method can uniformly distribute crystal nuclei for growth in the whole space according to a certain rule, then the crystal nucleus distance and the minimum crystal grains are set, and smaller crystal grains are absorbed by larger crystal grains. In the super-voxel segmentation process, the expansion of the seed points is determined by the feature distance, the feature distance considers the feature space calculation of the space, the color and the normal vector, and the distance formula in super-voxel cluster segmentation is as follows:
In the above formula: d c represents the degree of difference in color, D n represents the degree of difference in normal direction, and D s represents the degree of difference in distance of the point cloud. W c、ws、wn in the formula represents the weights of the respective variables. The voxel with the smallest D value is found to be the most similar through searching the surrounding of the crystal nucleus, so that the crystal nucleus for the next growth is represented.
According to the invention, a Kinect V2 camera is selected as a sensor, and appropriate parameters are set: voxel_resolution=0.008 (voxel size, resolution of spatial octree), feed_resolution=0.5 (seed resolution). The color, the spatial distance and the weight occupied by the normal vector are adjusted according to the laboratory environment, and the obtained super voxel cluster segmentation result is shown in fig. 9.
Step2: point cloud segmentation based on geometric information
The super-voxel segmentation result obtained in the last step is represented by an adjacency graph G= { v, epsilon } wherein v i epsilon v is a segmented curved surface block, and epsilon represents that adjacent curved surface blocks (v i,vj) are connected. After the super-voxel segmentation process, each surface block has a centroid c i and a normal vector n i, and the scene segmentation can be regarded as a graph segmentation problem.
Suppose there are K actual planes { s 1,s2,…,sk } in a point cloud map, and that these planes have been segmented into patches. Here, a variable { b i}1 N,bi∈[0,K],bi =k is defined to mean that the surface block belongs to the plane s k, and N represents that the actual plane s k is divided into N surface blocks, as in fig. 9, the table top is divided into two clusters. If the planes of all the objects in the point cloud map can be extracted according to the point cloud map and the curved surface blocks are distributed to the planes, a more accurate point cloud segmentation result can be obtained.
Specific operation after obtaining the patch, the patch is processed using RANSAC to generate a candidate plane pc= { PC 1,pc2,…,pcm }, d (c i,pcm), i.e., the distance of the patch centroid c i to the plane PC m, is calculated. By adding a threshold delta, all curved blocks with the distance delta from the plane pc m can be obtained, and the curved blocks are named as pi= { v i∈V|d(ci,pcm) < delta }. The following formula is defined:
In the above formula: η represents a constraint that distinguishes foreground objects from background, and D (pc m) represents the weight of the point cloud plane. In the experiment, η=40 and δ=4 (cm) were set. The minimized energy P * of the graph-segmentation problem can be expressed as follows:
in the above formula, E (P) is the fitting energy, and the following formula is specific:
wherein V and E are the sets of vertices and edges in the graph, respectively, and the adjacent graph g= { V, epsilon } of the corresponding cluster segmentation.
The edge meeting the energy minimization is cut off based on the point cloud minimum cutting theory, namely, the cutting result of the point cloud can be obtained by combining the adjacent curved surface blocks according to the constraint condition, the purpose of the cutting at the moment is to combine clusters belonging to the same object in the clustering result, the cutting result is shown in a figure 10, the effect of combining target objects such as a display, books, cups and bottles is good, the effect of combining planes of background objects such as a table, a storage cabinet and the ground is also good, and the requirements of the invention are met.
4. Semantic information is fused, and a semantic map is constructed
After dense point cloud segmentation is completed, the segmentation result is combined with the 3D semantic tag, the segmentation result is optimized through semantic information, and the target which cannot be extracted in the point cloud segmentation can be segmented; meanwhile, semantic information can be added for each segmented target, and a more accurate semantic map can be constructed. After the point cloud is segmented, each cluster is provided with an attribute tag, and after the attribute tags of the clusters are replaced by semantic tags of specific targets, the specific targets in the point cloud are provided with semantic information, so that a semantic map is constructed. The specific implementation steps are as follows:
Step1: storage of semantic tags
After the dense point cloud map is segmented, point clouds with different colors represent a point cloud set and a space sampling area in each super voxel, and the colors are attribute labels of the clusters. However, the attribute label at this time only distinguishes each clustering result, and is a random label without any meaning, and the label is also the attribute label of each cluster in the final cloud segmentation result.
In the second part, the object semantic label storage is completed by carrying out target identification on the key frame generated in ORB-SLAM2 through the YOLO v5s and adding a color label with semantic information to the point cloud segmentation result according to the 3D semantic label obtained by the identification result.
But now all pixels in the 2D object recognition bounding box are mapped directly to the 3D formed semantic tags in combination with the depth data of the Kinect camera. In a complex laboratory environment, the bounding box for target recognition is not accurate enough, so that the surrounding point cloud of the target object is easy to add wrong semantic information, and as shown in fig. 8, the semantic tag is erroneously added to the surrounding point cloud of the target object.
Step2: fusion of semantic tags and point cloud bounding boxes
In order to obtain a more accurate semantic map, the bounding box of the target object point cloud is obtained according to the 2D recognition frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D recognition frame. Assuming that all pixel points in the 2D identification frame are set p= { P 1,P2,P3......,Pn }, then
b=max(Depth(Pi)-Depth(Pj)) (8)
Wherein P i,Pj∈P,Depth(Pi) is the depth value corresponding to the pixel point P i. The length and the height of the bounding box are mapped by the length and the width of the 2D identification frame, and the principle is that 2D pixel points are back projected into a world coordinate system, which is not described herein.
Then, the method detects the results obtained by the point cloud segmentation, and only when all the point clouds in a certain cluster are positioned in the bounding box, specific semantic information is added, namely specific color attributes are added to the point clouds, so that a more accurate semantic map is obtained. As shown in fig. 11 and fig. 10, the fusion of the bounding box obtained by the semantic tag and the point cloud segmentation result enables the object TV to be segmented more completely, and the clustering of the point clouds at the edge of the actual object also successfully adds the correct semantic information; compared with the method that the boundary box obtained by directly carrying out 2D object recognition in FIG. 8 adds semantic information to the point cloud, the method can also effectively avoid the situation that the semantic information is erroneously added to the point cloud around the object. Meanwhile, the point cloud segmentation effect can be improved by combining semantic information, and small objects such as a keyboard and a mouse on a desktop can be segmented by clustering after the added semantic information changes the color attribute in fig. 11. According to the above two steps, a three-dimensional semantic map as shown in fig. 11 can be formed.
Claims (8)
1. The three-dimensional semantic map construction method of the indoor mobile robot is characterized by comprising the following steps of:
step 1: operating an ORB-SLAM2 algorithm through an RGB-D depth camera carried by the robot to obtain key frame information and corresponding pose of the robot, and forming a single-frame space point cloud through back projection of an RGB image and a depth image corresponding to each key frame;
step 2: filtering the space point cloud corresponding to the key frame, and splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map;
Processing the key frame is added in a Tracking thread in the ORB-SLAM2 system, and space point cloud data corresponding to the key frame is obtained, wherein the space point cloud data specifically comprises: all pixel points in the key frame are subjected to the following operation, so that point cloud data of the key frame in a camera coordinate system are obtained: let a certain pixel point P 'coordinate in the image be [ u, v ] T, the internal reference matrix of the camera be K, and the depth data corresponding to the pixel point be d, then the pixel point is converted into a three-dimensional point P c (X', Y ', Z') under the coordinate system of the camera as follows:
after the space coordinates of the three-dimensional points P c are obtained, RGB information of the pixel points P' is obtained and used as RGB information of P c, all the pixel points in the key frame are operated as above, and point cloud data of the key frame under a camera coordinate system are obtained through a point cloud construction function of a PCL library;
Then converting the point cloud data of the key frame in the camera coordinate system into the point cloud data in the world coordinate system:
Let P c be converted to a point in the world coordinate system P w(X,Y,Z),Pw (X, Y, Z) satisfy:
wherein, T cw is the camera pose;
step 3: performing target recognition on the RGB map corresponding to the key frame by utilizing a YOLO V5 network to obtain 2D semantic information in the key frame, and obtaining a 3D semantic tag according to back projection;
step 3.1: establishing data set based on YOLO V5 network model
Shooting a target object picture to obtain a dataset, marking the acquired dataset by using a marking program Labelimg, marking a label, and respectively selecting a training set and a testing set from the shot picture; selecting pictures containing target objects from the COCO data set to serve as a training set and a testing set respectively; performing data configuration and model training parameter setting on the YOLO V5 network, wherein the data configuration comprises object category numbers and object category names, training the model parameters of the YOLO V5 network by using a training set, and verifying the network model by using a testing set;
step 3.2: extracting 2D target semantics and constructing 3D semantic tags
The YOLO V5 network calls an RGB-D depth camera to collect RGB image information, and a 2D recognition frame in the RGB image is obtained in real time to serve as 2D target semantic information;
Adding an interface into a Tracking thread of ORB-SLAM2 to read a key frame generated in real time, carrying out target detection on each newly generated key frame by utilizing a trained YOLO V5 network model to obtain the pixel coordinates of a central point of a target object and the length and width of a boundary frame, changing RGB information of all pixel points in a 2D boundary frame of the target object into set color information when generating space point cloud data according to an RGB image and a depth image corresponding to the key frame, at the moment, generating single-frame point cloud data according to the key frame to have a 3D semantic tag, and then carrying out back projection on the obtained key frame point cloud data in combination with a pose estimation result into a world coordinate system to complete point cloud splicing, so as to obtain a global dense point cloud map with the 3D semantic tag, namely adding the color information to the point cloud of the target object in the point cloud map;
step 4: dividing the dense point cloud map by a point cloud dividing algorithm;
step 5: and fusing the obtained semantic information with the segmentation result of the dense point cloud map, and constructing to obtain the 3D semantic map.
2. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the filtering processing of the spatial point cloud corresponding to the key frame in the step 2 includes:
And removing outliers in the point cloud data by using an outlier removing filter in the PCL point cloud library, calculating the average distance from each point in the space point cloud to all adjacent points, identifying the points with the average distance being outside an outlier distance threshold d pt as outliers and removing the outliers from the data, and then carrying out downsampling on the point cloud data by using a voxel grid filtering method.
3. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 2, wherein the method comprises the following steps of: the outlier distance threshold d pt is specifically:
Assuming that the distances of all points in the space point cloud obey normal distribution (mu, sigma 2), judging whether the threshold parameter of the outlier is H out, and if the threshold parameter of the outlier is d pt, the outlier distance threshold is:
dpt=μ+Hout·σ。
4. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: step 2, the step of splicing according to the pose corresponding to the key frame to form a three-dimensional dense point cloud map comprises the following steps:
The LoopClosing thread in the ORB-SLAM2 system optimizes the detected pose of the key frame in the closed loop, extracts the key frame with the changed pose of the camera after the closed loop optimization, deletes the point cloud data corresponding to the key frame under the world coordinate system, regenerates the corresponding point cloud data according to the camera pose after the closed loop optimization, and projects the point cloud data into the world coordinate system to construct the three-dimensional dense point cloud map.
5. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the step 4 of segmenting the dense point cloud map by the point cloud segmentation algorithm includes:
Step 4.1: dividing a dense point cloud map by adopting an ultra-voxel clustering dividing method, converting the point cloud into a surface structure, and representing a dividing result by adopting an adjacent graph G= { v, epsilon } wherein v i epsilon v is a curved surface block obtained by dividing, epsilon represents a curved surface block (v i,vj) connected with adjacent curved surface blocks, and each curved surface block comprises a centroid c i and a normal vector n i;
Step 4.2: after obtaining the curved surface block, using RANSAC to process the curved surface block to generate a candidate plane pc= { PC 1,pc2,…,pcm }, then calculating the distance d (c i,pcm) from the centroid c i of the curved surface block to the plane PC m, setting a threshold delta, obtaining all the curved surface blocks with the distance delta from the plane PC m, named pi= { v i∈V|d(ci,pcm) < delta }, and defining:
Wherein η represents a constraint condition for distinguishing a foreground object from a background, D (pc m) represents a weight of a point cloud plane, and the minimum energy P * of the graph segmentation problem is:
P*=argminE(P),
Wherein E (P) is the energy of the fitting, and the following conditions are satisfied:
wherein V and E are the sets of vertices and edges in the graph, respectively, and the adjacent graph g= { V, epsilon } of the corresponding cluster segmentation.
6. The method for constructing the three-dimensional semantic map of the indoor mobile robot according to claim 1, wherein the method comprises the following steps of: the step 5 comprises the following steps:
Obtaining a bounding box of the target object point cloud according to the 2D identification frame, wherein the width value b of the bounding box is the maximum depth difference corresponding to all pixel points in the 2D identification frame, and assuming that all pixel points in the 2D identification frame are set P= { P 1,P2,P3......,Pn }, then
b=max(Depth(Pi)-Depth(Pj))
P i,Pj∈P,Depth(Pi) is a depth value corresponding to the pixel point P i, and the 2D pixel point is back projected into the world coordinate system, and the length and the width of the 2D identification frame are mapped into the world coordinate system to obtain the length and the height of the bounding box;
And then detecting a result obtained by the point cloud segmentation, and replacing the random color attribute label of the cluster with the 3D semantic label obtained in the step 3 only when all the point clouds in a certain cluster are positioned in the bounding box to obtain a semantic map.
7. The three-dimensional semantic map construction method of an indoor mobile robot according to any one of claims 1 to 6, characterized by: the RGB-D depth camera employs KinectV's 2 camera.
8. The three-dimensional semantic map construction method of an indoor mobile robot according to any one of claims 1 to 6, characterized by: the YOLOV network adopts a YOLOv s network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210594240.3A CN115035260B (en) | 2022-05-27 | 2022-05-27 | Three-dimensional semantic map construction method for indoor mobile robot |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210594240.3A CN115035260B (en) | 2022-05-27 | 2022-05-27 | Three-dimensional semantic map construction method for indoor mobile robot |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115035260A CN115035260A (en) | 2022-09-09 |
CN115035260B true CN115035260B (en) | 2024-11-05 |
Family
ID=83121169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210594240.3A Active CN115035260B (en) | 2022-05-27 | 2022-05-27 | Three-dimensional semantic map construction method for indoor mobile robot |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115035260B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115908734B (en) * | 2022-11-25 | 2023-07-07 | 贵州电网有限责任公司信息中心 | Power grid map updating method, device, equipment and storage medium |
CN116363217B (en) * | 2023-06-01 | 2023-08-11 | 中国人民解放军国防科技大学 | Method, device, computer equipment and medium for measuring pose of space non-cooperative target |
CN116433493B (en) * | 2023-06-07 | 2023-09-22 | 湖南大学 | Workpiece point cloud set splicing method based on metric learning |
CN117542008B (en) * | 2023-10-12 | 2024-10-01 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Semantic point cloud fusion automatic driving scene identification method and storage medium |
CN117517864B (en) * | 2023-11-08 | 2024-04-26 | 南京航空航天大学 | Laser radar-based power transmission line near electricity early warning method and device |
CN117788730B (en) * | 2023-12-08 | 2024-10-15 | 中交机电工程局有限公司 | Semantic point cloud map construction method |
CN117611762B (en) * | 2024-01-23 | 2024-04-30 | 常熟理工学院 | Multi-level map construction method, system and electronic equipment |
CN117906595B (en) * | 2024-03-20 | 2024-06-21 | 常熟理工学院 | Scene understanding navigation method and system based on feature point method vision SLAM |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A kind of three-dimensional semantic map constructing method of the indoor environment based on deep learning |
CN112288857A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot semantic map object recognition method based on deep learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111695622B (en) * | 2020-06-09 | 2023-08-11 | 全球能源互联网研究院有限公司 | Identification model training method, identification method and identification device for substation operation scene |
CN111897332B (en) * | 2020-07-30 | 2022-10-11 | 国网智能科技股份有限公司 | Semantic intelligent substation robot humanoid inspection operation method and system |
-
2022
- 2022-05-27 CN CN202210594240.3A patent/CN115035260B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110243370A (en) * | 2019-05-16 | 2019-09-17 | 西安理工大学 | A kind of three-dimensional semantic map constructing method of the indoor environment based on deep learning |
CN112288857A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot semantic map object recognition method based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN115035260A (en) | 2022-09-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115035260B (en) | Three-dimensional semantic map construction method for indoor mobile robot | |
CN112070818B (en) | Robot disordered grabbing method and system based on machine vision and storage medium | |
CN109544677B (en) | Indoor scene main structure reconstruction method and system based on depth image key frame | |
Tan et al. | Robust monocular SLAM in dynamic environments | |
Lim et al. | Real-time image-based 6-dof localization in large-scale environments | |
CN103646391B (en) | A kind of real-time video camera tracking method for dynamic scene change | |
CN104134234B (en) | A kind of full automatic three-dimensional scene construction method based on single image | |
CN112927353B (en) | Three-dimensional scene reconstruction method, storage medium and terminal based on two-dimensional target detection and model alignment | |
CN103886619B (en) | A kind of method for tracking target merging multiple dimensioned super-pixel | |
Wang et al. | Bottle detection in the wild using low-altitude unmanned aerial vehicles | |
Wang et al. | An overview of 3d object detection | |
CN113408584B (en) | RGB-D multi-modal feature fusion 3D target detection method | |
CN105809716B (en) | Foreground extraction method integrating superpixel and three-dimensional self-organizing background subtraction method | |
CN110263794B (en) | Training method of target recognition model based on data enhancement | |
Urban et al. | Finding a good feature detector-descriptor combination for the 2D keypoint-based registration of TLS point clouds | |
CN113705579A (en) | Automatic image annotation method driven by visual saliency | |
CN117292076A (en) | Dynamic three-dimensional reconstruction method and system for local operation scene of engineering machinery | |
CN107610136B (en) | Salient object detection method based on convex hull structure center query point sorting | |
CN111179271B (en) | Object angle information labeling method based on retrieval matching and electronic equipment | |
Jisen | A study on target recognition algorithm based on 3D point cloud and feature fusion | |
CN110599587A (en) | 3D scene reconstruction technology based on single image | |
CN110826432B (en) | Power transmission line identification method based on aviation picture | |
Ghafarianzadeh et al. | Efficient, dense, object-based segmentation from RGBD video | |
Southey et al. | Object discovery through motion, appearance and shape | |
Chen et al. | High Precision ORB-SLAM Dense Reconstruction Based on Depth Visual Odometer in Dynamic Environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |