CN116168046B

CN116168046B - 3D point cloud semantic segmentation method, system, medium and device under complex environment

Info

Publication number: CN116168046B
Application number: CN202310456371.XA
Authority: CN
Inventors: 徐龙生; 王呈周; 董利亚; 杨克显; 李雷
Original assignee: Shandong Kailin Environmental Protection Equipment Co ltd
Current assignee: Shandong Kailin Environmental Protection Equipment Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-25
Anticipated expiration: 2043-04-26
Also published as: CN116168046A

Abstract

The invention belongs to the technical field of point cloud semantic segmentation, and provides a 3D point cloud semantic segmentation method, a system, a medium and equipment in a complex environment, aiming at the problem of label dependence, the method, the system, the medium and the equipment are relieved by means of a small amount of marked point cloud data and a large amount of unmarked point cloud data, based on cross-modal learning, the corresponding relation between spatial structural features in a 3D mode and 2D-3D cross-modal is mined, the similarity of 2D and 3D features is maximized in a unified semantic feature space, interaction between modes is enhanced by means of 2D appearance semantic information and the spatial invariant characteristic of the point cloud, and the local geometric information of the point cloud can be captured more efficiently. And the method can be quickly adapted to complex environments through an end-to-end network, and the segmentation accuracy is enhanced.

Description

3D point cloud semantic segmentation method, system, medium and device under complex environment

Technical Field

The invention belongs to the technical field of point cloud semantic segmentation, and particularly relates to a 3D point cloud semantic segmentation method, system, medium and device in a complex environment.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The robot needs to apply point cloud segmentation to sense and identify the surrounding environment in the inspection process, so that the intelligent inspection is realized instead of manpower. The point cloud segmentation is a technology for separating different parts in the point cloud data, so that environment perception and object recognition are realized, and the method has important significance for intelligent operation of a robot. Specifically, the robot scans the surrounding environment through a laser radar or a depth camera to acquire environment point cloud data, and can separate obstacles, floors, walls and the like in the environment through a point cloud segmentation technology to construct a map model and separate and identify surrounding objects.

However, for the point cloud segmentation technique, there are mainly two challenges: (1) complex environments: in complex environments, the difficulty of point cloud segmentation increases greatly, for example, for objects with complex shapes, objects with multiple colors, and objects with occlusion, the point cloud segmentation effect may be poor; (2) amount of point cloud annotation data: the amount of the point cloud data is huge, but manual marking is time-consuming and labor-consuming, and training a model by using a small amount of the point cloud data can seriously affect the accuracy of the point cloud segmentation. Currently, existing point cloud segmentation models require a large amount of data and computational resources, which makes the training process difficult, and existing models have difficulty interpreting the segmentation results, which limits the transparency and reliability of the model.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a 3D point cloud semantic segmentation method and a system under a complex environment, which are based on a weak supervision cross-modal 3D point cloud semantic segmentation method to reduce dependence on data labels and enhance understanding of 3D point cloud data. Based on the contrast learning paradigm, spatial invariance in the 3D point cloud modality is learned by learning distinguishing structural features in the 3D point cloud modality and visual concept mapping relationships between the 2D image and the 3D point cloud.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the invention provides a 3D point cloud semantic segmentation method under a complex environment, comprising the following steps:

generating a 2D image according to the 3D point cloud data by random rendering;

obtaining a 3D point cloud semantic segmentation result based on the 3D point cloud data, the 2D image and the trained point cloud semantic segmentation model; the construction process of the point cloud semantic segmentation model comprises the following steps:

constructing an augmentation dictionary based on the 3D point cloud data, constructing the augmentation data based on the augmentation dictionary, and extracting 3D image features of the augmentation data through a structure encoder; extracting features of the 2D image to obtain 2D image features;

mapping the 2D image features and the 3D image features to a unified semantic feature space through cross-modal learning, and obtaining feature representation of local points in the 3D point cloud data based on 2D description by capturing the corresponding relation between the two features;

and decoding the learned 3D image features through a decoder to obtain a 3D point cloud semantic segmentation result.

Further, the constructing augmentation data based on the augmentation dictionary specifically includes:

the built augmentation dictionary comprises a plurality of augmentation steps, and each augmentation step is provided with an augmentation factor;

when the 3D point cloud data is amplified, an amplification probability is randomly generated for each amplification step; when the probability of the augmentation is greater than the factor of the augmentation, the step of augmentation is employed, otherwise the step of augmentation is not employed.

Further, the 3D image features of the augmented data are extracted by a structural encoder:

step 1: the 3D point cloud data are learned through a convolution block to obtain a first weight, the first weight is multiplied with the 3D point cloud, and a first characteristic tensor is output;

step 2: performing feature extraction on the first feature tensor through the MLP layer, and outputting a second feature tensor; based on the two feature tensors, obtaining a second weight through convolution block learning, and multiplying the second weight by the second feature tensor to obtain a third tensor;

step 3: and (3) repeating the step (2) and sequentially taking the final tensor obtained by increasing the feature dimension as the structural coding feature.

Further, the convolution block consists of a multi-layer perceptron shared at each point, a max pooling layer and a full connection layer, outputting oneAffine transformation matrix of>Both the multi-layer perceptron and the max-pooling layer include a ReLU activation function and batch normalization operations, depending on the characteristic dimensions of the inputs to the convolution block.

Furthermore, when the point cloud semantic segmentation model is trained, only partial data with labels are up-sampled through a decoder by adopting a weak supervision learning paradigm, and then segmentation loss is calculated.

Further, after the augmented data is obtained, the 3D features having distinctiveness in the 3D modality are learned by comparing the learning paradigm, including: and obtaining the similarity of maximized augmentation data and the similarity of structural features between different point clouds based on the contrast loss.

Further, the mapping relation between the 2D image features and the 3D image features is obtained by maximizing feature similarity between the 2D image and the 3D point cloud in a feature space.

A second aspect of the present invention provides a 3D point cloud semantic segmentation system in a complex environment, comprising:

a 2D image rendering module configured to: generating a 2D image according to the 3D point cloud data by random rendering;

a semantic segmentation module configured to: obtaining a 3D point cloud semantic segmentation result based on the 3D point cloud data, the 2D image and the trained point cloud semantic segmentation model; the construction process of the point cloud semantic segmentation model comprises the following steps:

A third aspect of the present invention provides a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a 3D point cloud semantic segmentation method in a complex environment as described above.

A fourth aspect of the invention provides an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a 3D point cloud semantic segmentation method in a complex environment as described above when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, based on cross-modal learning, the corresponding relation between the spatial structure features in the 3D mode and the 2D-3D cross-modal is mined, the similarity of the 2D features and the 3D features is maximized in a unified semantic feature space, interaction between modes is enhanced by means of the 2D appearance semantic information and the space invariant characteristic of the point cloud, local geometric information of the point cloud can be captured more efficiently, and the segmentation accuracy is enhanced by rapidly adapting to a complex environment through an end-to-end network.

2. The method is based on a weak supervision learning paradigm, and the problem of label dependence is relieved by means of a small amount of marked point cloud data and a large amount of unmarked point cloud data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a network frame of a 3D point cloud semantic segmentation method in a complex environment provided by an embodiment of the present invention.

Fig. 2 is a block diagram of a structure encoder and decoder provided in an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Interpretation of the terms

A complex environment refers to a scene of objects having a variety of shapes, sizes, textures, colors, and relationships encountered in the real world. In point cloud segmentation techniques, complex environments can present many challenges to algorithms and models. The following are some specific features of the complex environment:

objects of complex shape: the physical world objects vary in shape from simple geometric shapes (e.g., cubes, spheres) to objects with complex curves and structures (e.g., vegetation, buildings). Complex shaped objects can make point cloud segmentation more difficult.

Objects of multiple colors: objects in the environment may have various colors and textures, which may make it difficult for a point cloud segmentation algorithm to distinguish between adjacent objects or to identify different portions of the same object.

Occlusion and overlap: objects in a complex environment may obscure or overlap each other, making it difficult for the point cloud segmentation algorithm to segment them correctly. In addition, occlusion can also lead to missing and incomplete information in the point cloud data.

Background noise and spurious points: background noise and stray points may be present in the point cloud data due to sensor errors, ambient light, etc. These noise points can interfere with the performance of the point cloud segmentation algorithm.

Dynamic environment: objects in a complex environment may move at different points in time, resulting in dynamic changes in point cloud data. Processing these dynamic changes is a challenge for point cloud segmentation algorithms.

Large scale and high density: point cloud data in complex environments is often characterized by large scale and high density, meaning that point cloud segmentation algorithms need to handle a large number of points and complex relationships between them. This not only increases the computational complexity, but may also lead to memory and storage problems.

Example 1

As shown in fig. 1-2, the present embodiment provides a 3D point cloud semantic segmentation method in a complex environment, including the following steps:

s1: and generating a 2D image according to the 3D point cloud data through random rendering.

Acquiring a 3D point cloud dataset：

，

Wherein,,representing the number of tagged point cloud data, +.>The number of unlabeled point clouds is represented,representing point cloud data->Representing ∈data according to point cloud>Rendering the resulting 2D image, per point cloud data +.>There is->Data points, the characteristic dimension of each point is +.>The characteristics include position and color information, +.>Representing the length and width of the 2D image; />A tag representing point cloud data;

for the followingIn this embodiment, a pre-trained DISN (deep implicit surface network) is used to randomly select an angle for rendering to obtain a 2D image +.>。

S2: distinguishing structural features within the 3D modality are learned.

By utilizing geometrical invariance of point clouds in a 3D space, distinguishing 3D features in a 3D mode are learned based on a comparison learning paradigm by constructing two augmented point cloud data.

Specifically, an augmentation dictionary is first built, and a plurality of augmentation steps such as rotation, scaling, translation, normalization, elastic distortion and the like are included in the dictionary, wherein each augmentation step is provided with an augmentation factor with a value between 0 and 1.

For convenience, the amplification factor of all the amplification steps is set to 0.5 in this embodiment.

When the point cloud data is amplified, an amplification probability is randomly generated for each amplification step, and when the value is larger than the amplification factor, the amplification step is adopted, otherwise, the amplification step is not adopted.

Thus, for each point cloud dataDifferent augmentation data may be obtained by means of a random linear combination based on the augmentation dictionary.

In the invention, the problem of computational complexity is considered, and only the use is consideredIs>。

Since point cloud data has invariance in geometric space, the data after augmentation is also similar in feature space.

Two augmented data are extracted by one structure encoderStructural coding features of->The same point cloud data->Generated->Should be similar in feature space and different point cloud data +.>And->The distance of the generated augmentation data in the feature space should be as far as possible.

Maximization ofThe similarity of structural features between different point clouds can be defined based on contrast Loss InfoNCE Loss as:

wherein,,representing the calculated cosine similarity,/->Is a super parameter, and K is the data volume of the point cloud which participates in training in the same batch.

The obtained maximizationAnd (3) minimizing the similarity of structural features among different point clouds for calculating the gradient of the model, and optimizing the model by adopting a random gradient descent method.

The specific coding process of the structural coder is as follows:

s2.1: input 3D point cloudLearning a size of +.>The weight is multiplied with the input 3D point cloud, so that the 3D point cloud is aligned, invariance of the model to specific space conversion is ensured, and the model is outputIs a characteristic tensor of (c).

S2.2: extracting the characteristics of the characteristic tensor obtained in S2.1 through the MLP layer, and outputtingIs a characteristic tensor of (c). The feature tensor learns a magnitude of +.>Weights, weights multiplied by the characteristic tensor of the input, output +.>Is a two-dimensional tensor of (c).

S2.3: repeating step S2.2 3 times, sequentially increasing feature dimensions to [128, 512, 1024]Finally obtainIs a two-dimensional tensor of (2)As a structural feature.

S2.2, the convolution block is composed of a Multi-Layer perceptron (MLP), a maximum pooling Layer (Max pooling) and two output sizes shared at each point,/>) Is composed of completely connected layers (fully connected layers, FC) for outputting a +.>Affine transformation matrix of>Depending on the characteristic dimension of the input of the convolution block. Notably, all layers of the convolution block, except the last layer, include a ReLU activation function and batch normalization (Batch Normalization, BN) operation.

S3: and learning a cross-modal mapping relationship between the 3D point cloud data and the 2D image.

S3, 2D imageRandomly selecting an angle according to the data +.>The appearance information of the point cloud data plane is rendered, so that the two are naturally provided with a mapping relation.

Through inter-modality learning, 2D image features and 3D image features can be mapped to a unified semantic feature space, and by capturing correspondence between the two, a more generalized feature representation of local points in 3D point cloud data can be obtained based on the 2D description.

Because of the large attribute difference between the 2D image and the 3D point cloud, the method and the device perform feature extraction on the 2D image by taking the common ResNet network as an image encoder, and perform feature extraction on the point cloud data2D features can be obtained>. In order to learn the mapping relation of point cloud data among cross modes, features are coded at the same time>Obtaining +.>Characterizing point cloud data->。

Maximizing feature similarity between a 2D image and a 3D point cloud within a feature space may be defined as:

by maximizing the feature similarity between the 2D image and the 3D point cloud, the mapping relationship of the point cloud data between the cross modes can be learned, and according to the assumption of spatial invariance, the same part of the object, such as a sofa armrest, has similar features with the 3D object (3D point cloud) no matter from which angle the object is observed (i.e. the 2D image), so that the feature similarity between the two needs to be maximized.

The scheme has the advantage that based on the cross-modal learning, the corresponding relation between the spatial structure characteristics in the 3D mode and the 2D-3D cross-mode is mined. By maximizing the similarity of 2D and 3D features in a unified semantic feature space, interaction between modes is enhanced by means of 2D appearance semantic information and the space invariant characteristic of point clouds, local geometric information of the point clouds can be captured more efficiently, complex environments can be quickly adapted through an end-to-end network, and segmentation accuracy is enhanced.

S4: and outputting a prediction semantic segmentation result.

The learned 3D junction is passed through a decoderConstruct characterizationMapped to the original point cloud size.

Due to the point cloud datasetOnly part of the point cloud data has labels, so that only part of the data with labels is +.>Upsampled by the decoder and then the segmentation loss is calculated.

The specific upsampling process of the decoder is as follows:

s4.1: characterizing 3D structureAnd 2D appearance feature->Fusion is performed as input to the decoder>Wherein->Is a parameter that can be learned. Learning a block of size by a convolutionWeight, weight multiplied by input feature, output dimension size is +.>Is a two-dimensional tensor of (c).

S4.2: feature extraction is carried out on tensors obtained in S4.1 through a shared MLP layer, and output is carried outIs a characteristic tensor of (c). The feature tensor learns a magnitude of +.>Weights, weights multiplied by the characteristic tensor of the input, output +.>Is a two-dimensional tensor of (c).

S4.3: the step S2.2 is repeated 3 times, sequentially increasing the feature dimension to 128, 63,]finally obtainAnd obtaining a final semantic segmentation prediction result through a full connection layer and a softmax function.

In this embodiment, the segmentation loss in S4 uses a cross entropy loss.

In the whole segmentation framework, the label-free data is only applied to the S2 and the S3, and is used for mining useful structure and appearance information from the point cloud data, and optimizing the structure encoder of the S2. The tagged data is used mainly to learn model weights for the decoder, except for optimizing the structural encoder.

The above scheme has the advantage that the problem of tag dependence is alleviated by means of a small amount of marked point cloud data and a large amount of unmarked point cloud data based on a weakly supervised learning paradigm.

The scheme of the invention can be applied to example segmentation in an unmanned urban scene, but is not limited to the scene, and can also be applied to other complex environments.

Table 1 is a simulation experiment based on the open source 3D point cloud dataset ScanNetv2, where only 20% of the tags were used in the training set and the remaining 80% of the data were processed as unlabeled data. The experiment adopts the overall classification accuracy PA (Point Accuracy), namely the ratio of the correct points to the total points of the point cloud; average classification accuracy MPA (Mean Point Accuracy), i.e. calculating the ratio of the correct points of each class to all points of the class and then averaging; the average IoU value MIoU (Mean Intersection over Union) of each class is used as an evaluation index.

Table 1 comparison of the accuracy of the invention with other algorithms

Example two

The embodiment provides a 3D point cloud semantic segmentation system under a complex environment, which comprises the following steps:

Example III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a 3D point cloud semantic segmentation method under a complex environment as described above.

Example IV

The embodiment provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the 3D point cloud semantic segmentation method under the complex environment when executing the program.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The 3D point cloud semantic segmentation method under the complex environment is characterized by comprising the following steps of:

generating a 2D image according to the 3D point cloud data by random rendering;

decoding the learned 3D image features through a decoder to obtain a 3D point cloud semantic segmentation result;

the construction of the augmentation data based on the augmentation dictionary specifically comprises the following steps:

when the 3D point cloud data is amplified, an amplification probability is randomly generated for each amplification step; when the augmentation probability is greater than the augmentation factor, the step of augmentation is employed, otherwise the step of augmentation is not employed;

the 3D image features of the augmented data are extracted by a structural encoder:

step 2: extracting the characteristics of the first characteristic tensor through the multi-layer sensor, and outputting a second characteristic tensor; based on the two feature tensors, obtaining a second weight through convolution block learning, and multiplying the second weight by the second feature tensor to obtain a third tensor;

2. The 3D point cloud semantic segmentation method according to claim 1, wherein the convolution block consists of a multi-layer perceptron, a max pooling layer and a full connection layer shared at each point, outputting oneAffine transformation matrix of>Both the multi-layer perceptron and the max-pooling layer include a ReLU activation function and batch normalization operations, depending on the characteristic dimensions of the inputs to the convolution block.

3. The 3D point cloud semantic segmentation method under the complex environment according to claim 1, wherein when the point cloud semantic segmentation model is trained, only part of data with labels is up-sampled through a decoder by adopting a weak supervision learning paradigm, and then segmentation loss is calculated.

4. The method for 3D point cloud semantic segmentation in a complex environment according to claim 1, wherein learning the 3D features having distinctiveness in the 3D modality by comparing the learning paradigm after obtaining the augmented data, comprises: and obtaining the similarity of maximized augmentation data and the similarity of structural features between different point clouds based on the contrast loss.

5. The 3D point cloud semantic segmentation method under the complex environment according to claim 1, wherein the mapping relationship between the 2D image features and the 3D image features is obtained by maximizing feature similarity between the 2D image and the 3D point cloud in a feature space.

6. The 3D point cloud semantic segmentation system in the complex environment is realized by applying the 3D point cloud semantic segmentation method in the complex environment as claimed in claim 1, and is characterized by comprising the following steps:

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the 3D point cloud semantic segmentation method in a complex environment according to any of claims 1-5.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the 3D point cloud semantic segmentation method in a complex environment according to any of claims 1-5 when the program is executed.