WO2024113078A1

WO2024113078A1 - Local context feature extraction module for semantic segmentation in 3d point cloud scenario

Info

Publication number: WO2024113078A1
Application number: PCT/CN2022/134619
Authority: WO
Inventors: 张锲石; 刘一澄; 吴福祥; 程俊; 郝富生
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2024-06-06

Abstract

Provided in the present application is a local context feature extraction module for semantic segmentation in a 3D point cloud scenario. The local context feature extraction module comprises: a local representation of rotation invariance, which is used for: receiving local spatial information, which comprises coordinate information of several points; calculating a local rotation invariance representation of each point according to the local spatial information; searching for a centroid point of a local neighborhood, and calculating a location code of a relative point of the centroid point with respect to a neighboring point; and determining local context features according to the local rotation invariance representation of each point and the location code of the relative point. The solution learns local features having X-Y-Z three-axis rotation invariance, and also compensates for the fact that random sampling may result in the loss of many useful point features.

Description

Local context feature extraction module for 3D point cloud scene semantic segmentation

Technical Field

The present invention belongs to the field of computer vision technology, and in particular relates to a local context feature extraction module for semantic segmentation of 3D point cloud scenes.

Background technique

Due to the rise of convolutional networks on 2D images, many researchers have begun to apply neural networks to 3D data. However, most of the work is to voxelize 3D point clouds or convert them into 2D images from multiple perspectives, and then apply conventional convolutional neural networks. PointNet mainly solves how to use 2D neural networks directly to process 3D point clouds themselves, and can stably extract point set features even if the point clouds fluctuate, are noisy or missing. However, the biggest drawback of PointNet and PointNet++ is that they only extract global features and lose a lot of information. Although PointNet++ has designed an SA layer to extract features of surrounding points, the extraction effect is not good and the amount of calculation is relatively poor. On the other hand, because FPS (furthest point sampling) is used, the advantage is that the features of the point cloud space can be retained as much as possible, but the disadvantage is that the calculation speed of point clouds in large scenes is slow and the memory requirement is high.

Summary of the invention

The purpose of the embodiments of this specification is to provide a local context feature extraction module for 3D point cloud scene semantic segmentation.

To solve the above technical problems, the embodiments of the present application are implemented in the following ways:

In a first aspect, the present application provides a local context feature extraction module for 3D point cloud scene semantic segmentation, the local context feature extraction module comprising:

A rotationally invariant local representation receives local spatial information, and the local spatial information includes coordinate information of a plurality of points;

Based on the local spatial information, calculate the local rotation invariant representation of each point;

Find the centroid of the local neighborhood and calculate the relative position encoding of the centroid and neighboring points;

The local context feature is determined based on the local rotation invariant representation of each point and the position encoding of the relative point.

In one embodiment, the local spatial information is recorded as (K, 3), and the local rotation invariance of each point includes the rotation invariance of the X-axis, Y-axis, and Z-axis;

Among them, the rotation invariance about the Z axis means:

The rotation invariance about the X axis is expressed as:

The rotation invariance about the Y axis is expressed as:

Where k = (1, 2, ..., K),

are the coordinates of the kth neighboring point of point i, and (x _im , y _im , z _im ) are the coordinates of the centroid of the point cloud where point i is located.

In one embodiment, calculating the relative position codes of the centroid point and the neighboring points includes:

Determine the coordinates of the centroid point;

Determine the coordinates of neighboring points;

According to the coordinates of the centroid and the adjacent points, the coordinate difference and Euclidean distance between the centroid and the adjacent points are calculated;

According to the coordinates of the centroid point, the coordinates of the neighboring points, the coordinate difference and the Euclidean distance, the relative position coding of the centroid point and the neighboring points is determined.

In one embodiment, the coordinates of the centroid point _Pi , the ^coordinates of the kth neighboring point _Pik , then the relative position code _rik of the centroid point and the kth neighboring point ^is :

in,

represents the coordinate difference between the centroid and the kth neighboring point,

Represents the Euclidean distance between the centroid and the kth neighboring point.

In one embodiment, determining the local context feature based on the local rotation invariant representation of each point and the position encoding of the relative point includes:

Weights are introduced into the rotation invariance representation of the X-axis and the Z-axis, and the local context features are determined based on the rotation invariance representation of the Y-axis of each point, the position encoding of the relative point, and the rotation invariance representation of the X-axis and the Z-axis after the weights are introduced.

In one embodiment, the local context feature

for:

Among them, λ is the weight expressed by the X-axis rotation invariance, μ is the weight expressed by the Z-axis rotation invariance,

Encodes the position of a relative point.

In one of the embodiments, the local context feature extraction module further includes: attention pooling, which weights the local context features by learning attention weights through geometric distance and feature distance to obtain enhanced local context features.

In one embodiment, the local context features are weighted by learning attention weights through geometric distance and feature distance to obtain enhanced local context features, including:

Calculate the geometric distance between points in local spatial information;

Calculate feature distance;

Learn the weights of the attention pool based on geometric distance and feature distance;

The weights of the attention pool are combined with the local context features, and the attention weights are obtained by sharing the MLP and normalized exponential function;

Enhanced local context features are obtained based on the attention weights and neighboring point features.

In a second aspect, the present application provides a 3D point cloud scene semantic segmentation network, which includes the local context feature extraction module and encoder architecture for 3D point cloud scene semantic segmentation of the first aspect;

The local context feature extraction module for 3D point cloud scene semantic segmentation is embedded in the encoder architecture.

It can be seen from the technical solution provided in the above embodiments of this specification that the solution: learns local features with X-Y-Z three-axis rotation invariance, while compensating for the fact that random sampling may cause the loss of many useful point features.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the embodiments of this specification or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

FIG1 is a schematic diagram of the structure of a local context feature extraction module for 3D point cloud scene semantic segmentation provided by the present application;

FIG2 is a schematic diagram of the structure of the 3D point cloud scene semantic segmentation network provided in this application.

Detailed ways

In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the drawings in the embodiments of this specification. Obviously, the described embodiments are only part of the embodiments of this specification, not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this specification.

In the following description, specific details such as specific system structures, technologies, etc. are provided for the purpose of illustration rather than limitation, so as to provide a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application may also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to prevent unnecessary details from obstructing the description of the present application.

It will be apparent to those skilled in the art that various modifications and variations may be made to the specific embodiments of the present application description without departing from the scope or spirit of the present application. Other embodiments derived from the present application description will be apparent to those skilled in the art. The present application description and examples are merely exemplary.

The words “include,” “including,” “have,” “contain,” etc. used in this document are open-ended terms, meaning including but not limited to.

Unless otherwise specified, "parts" in this application are calculated by mass.

Among the related technologies, the pioneering work PointNet has become one of the most promising methods for directly processing 3D point clouds. It uses a shared multi-layer perceptron (MLP) to learn point-by-point features and has achieved good results. Later, the optimized derivative model PointNet++ emerged, which further improved the performance of point cloud segmentation. However, there are still some problems: 1) The farthest point sampling method used is only suitable for small-scale point clouds, while large-scale point clouds have large data volumes, and the farthest point sampling (FPS) will take up more memory and reduce computing and network efficiency; 2) The point cloud itself is rotationally invariant. The segmentation results of point clouds input at different angles should be consistent. For example, chairs in different positions in a conference room must have different directions. No matter from which angle the network segmentation is input, the segmentation result should be classified as a chair. This shows that the features learned by 3D point clouds are direction-sensitive, and this direction sensitivity will affect the effect of point cloud segmentation.

Based on the above defects, this application adopts random sampling to adapt to large-scale scene-level point cloud data, and proposes a local context feature aggregation module with rotation invariance to learn local features with X-Y-Z three-axis rotation invariance, while compensating for the fact that random sampling may cause the loss of many useful point features.

The present invention is further described in detail below with reference to the accompanying drawings and embodiments.

Referring to FIG. 1 , there is shown a schematic diagram of the structure of a local context feature extraction module for 3D point cloud scene semantic segmentation provided in an embodiment of the present application.

As shown in FIG1 , the local context feature extraction module for 3D point cloud scene semantic segmentation may include:

Specifically, the local context feature extraction module (LCC): As a geometric object, the learned representation of a point set should remain unchanged to rotation transformations. Points rotated together should not change the category of the global point cloud, nor should they change the segmentation of the partial structure of the point cloud. In many real scenes, such as common chairs, objects belonging to the same category usually have different orientations. In addition, it can be clearly understood that the same object is not only represented by the rotation invariance of the Z axis, but also has a certain rotation invariance of the X and Y axes. To solve this problem, we propose to learn a new local representation with X-Y-Z axis rotation invariance, which uses polar coordinates to represent the local geometric structure of each point. The overall structure of LCC is shown in Figure 1.

As shown in Figure 1, the local spatial information (K, 3) is input into the LCC block, and the output is a local representation with X, Y and Z axis rotation invariant features, which are respectively

Among them, the rotation invariance about the Z axis means:

The rotation invariance about the X axis is expressed as:

The rotation invariance about the Y axis is expressed as:

Where k = (1, 2, ..., K),

It can be understood that all the above inverse trigonometric function operations are to convert the representation in the Cartesian coordinate system into the representation in the polar coordinate system.

In one embodiment, calculating the relative position encoding of the centroid point and the neighboring points includes:

Determine the coordinates of the centroid point;

Determine the coordinates of neighboring points;

Specifically, the coordinates of the centroid point _Pi , the coordinates of the kth neighboring point

Then the relative position encoding of the centroid point and the kth neighboring point is

for:

in,

In one embodiment, the local context feature is determined based on the local rotation invariant representation of each point and the position encoding of the relative point, including:

Local context features

for:

Encodes the position of a relative point.

Specifically, the module is subjected to an ablation experiment (introducing the rotation invariance representation of the X axis, the Y axis and the Z axis respectively, and it is found that the rotation invariance representation of the X and Z axes has a more obvious effect on the overall segmentation effect, so the weights (λ, μ) are introduced here to increase the proportion of the X and Z axis representations, and the local context features output by the LCC module are obtained.

In one embodiment, the local context feature extraction module further includes: attention pooling, which weights the local context features by learning attention weights through geometric distance and feature distance to obtain enhanced local context features, specifically including:

Calculate the geometric distance between points in local spatial information;

Calculate feature distance;

Specifically, using traditional maximum pooling or average pooling for the local context features of the point cloud given by LCC may result in the loss of most of the information. Here we design a module similar to attention pooling to process the local context features. First, we believe that the closer the distance, the greater the correlation between the points. We calculate the geometric relative distance between the points.

and feature distance

for:

Then the weight of the attention pool is learned by calculating the negative exponential of the two, and its instability is adjusted by adding the parameter ζ. The weight of the learned attention pool is:

The learned dual distance parameters and local context features are combined by cascading:

The attention weights are obtained by sharing MLP and normalized exponential function:

Finally, the local context features after the attention weights and neighboring point features are strengthened

The LCC and DP modules together form the LD module to extract and enhance local context features.

The local context feature extraction module for 3D point cloud scene semantic segmentation provided in the embodiment of the present application learns local features that are invariant to X-Y-Z three-axis rotation, while compensating for the possibility that random sampling may cause the loss of many useful point features.

Referring to FIG. 2 , there is shown a schematic diagram of the structure of a 3D point cloud scene semantic segmentation network applicable to an embodiment of the present application.

As shown in FIG2 , the 3D point cloud scene semantic segmentation network includes a local context feature extraction module and an encoder architecture for 3D point cloud scene semantic segmentation;

Specifically, we embed the proposed LD module into the widely used encoder architecture to form a new network, which we named LD-Net (i.e. 3D point cloud scene semantic segmentation network), as shown in Figure 2. The input of the network is a point cloud of size n×d, where n is the number of points and d is the input feature dimension. The point cloud is first fed to the shared MLP layer to extract the features of each point, and the feature dimension is uniformly set to 8. We use five encoder-decoder layers to learn the characteristics of each point, and finally use three consecutive fully connected layers to predict the semantic label of each point. The overall network structure is shown in Figure 2.

Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

Claims

A local context feature extraction module for 3D point cloud scene semantic segmentation, characterized in that the local context feature extraction module comprises:

A rotationally invariant local representation, wherein the rotationally invariant local representation receives local spatial information, wherein the local spatial information includes coordinate information of a plurality of points;

Calculating a local rotation invariant representation of each point based on the local spatial information;

Find the centroid point of the local neighborhood and calculate the relative position code of the centroid point and the neighboring points;

Local context features are determined based on the local rotation invariant representation of each point and the position encoding of the relative point.
The local context feature extraction module according to claim 1, characterized in that the local spatial information is recorded as (K, 3), and the local rotation invariant representation of each point includes the rotation invariant representation of the X axis, the Y axis, and the Z axis;

Among them, the rotation invariance about the Z axis means:

The rotation invariance about the X axis is expressed as:

The rotation invariance about the Y axis is expressed as:

Where k = (1, 2, ..., K),
are the coordinates of the kth neighboring point of point i, and (x im , y im , z im ) are the coordinates of the centroid of the point cloud where point i is located.
The local context feature extraction module according to claim 1, characterized in that the step of calculating the relative point position encoding of the centroid point and the neighboring points comprises:

Determine the coordinates of the centroid point;

Determine the coordinates of neighboring points;

Calculate the coordinate difference and the Euclidean distance between the centroid and the neighboring points according to the coordinates of the centroid and the neighboring points;

The position codes of the relative points between the centroid point and the neighboring points are determined according to the coordinates of the centroid point, the coordinates of the neighboring points, the coordinate difference and the Euclidean distance.
The local context feature extraction module according to claim 3, characterized in that the coordinates of the centroid point P i and the coordinates of the kth neighboring point
Then the relative position code of the centroid point and the kth neighboring point is
for:

in,
represents the coordinate difference between the centroid and the kth neighboring point,
Represents the Euclidean distance between the centroid and the kth neighboring point.
The local context feature extraction module according to claim 2 is characterized in that the determining of the local context feature according to the local rotation invariant representation of each point and the position encoding of the relative point comprises:

Weights are introduced into the rotation invariance representation of the X-axis and the Z-axis, and the local context features are determined based on the rotation invariance representation of the Y-axis of each point, the position encoding of the relative point, and the rotation invariance representation of the X-axis and the Z-axis after the weights are introduced.
The local context feature extraction module according to claim 5, characterized in that the local context feature
for:

Among them, λ is the weight expressed by the X-axis rotation invariance, μ is the weight expressed by the Z-axis rotation invariance,
Encodes the position of a relative point.
The local context feature extraction module according to claim 1 is characterized in that the local context feature extraction module also includes: attention pooling, which weights the local context features by learning attention weights through geometric distance and feature distance to obtain enhanced local context features.
The local context feature extraction module according to claim 7 is characterized in that the local context feature is weighted by learning attention weights through geometric distance and feature distance to obtain enhanced local context features, comprising:

Calculating geometric distances between points in the local spatial information;

Calculate feature distance;

learning a weight of an attention pool according to the geometric distance and the feature distance;

The weight of the attention pool and the local context feature are combined, and the MLP and normalized exponential function are shared to obtain the attention weight;

The enhanced local context feature is obtained according to the attention weight and the neighboring point feature.
A 3D point cloud scene semantic segmentation network, characterized in that the network comprises a local context feature extraction module and an encoder architecture for 3D point cloud scene semantic segmentation according to any one of claims 1 to 8;

The local context feature extraction module for 3D point cloud scene semantic segmentation is embedded in the encoder architecture.