CN118657932A

CN118657932A - Target detection method and device based on data synthesis, electronic equipment and medium

Info

Publication number: CN118657932A
Application number: CN202411118951.9A
Authority: CN
Inventors: 张�雄; 郭和炀; 苗乾坤
Original assignee: Neolix Technologies Co Ltd
Current assignee: Neolix Technologies Co Ltd
Priority date: 2024-08-15
Filing date: 2024-08-15
Publication date: 2024-09-17

Abstract

The application provides a target detection method, a device, electronic equipment and a medium based on data synthesis. The method comprises the following steps: acquiring multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model; rendering the virtual model into multi-view image data and generating new multi-view image data; updating the original point cloud data based on the correlation between the virtual model and the original point cloud data; updating the initial detection frame according to the space position of the virtual model; training the perception model by using the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model; and taking actually acquired multi-view image data and point cloud data around the vehicle as input of a trained perception model, carrying out target detection by using the trained perception model, and outputting a target detection result. The application can reduce the labeling cost of the perception data and improve the detection precision and accuracy of the model.

Description

Target detection method and device based on data synthesis, electronic equipment and medium

Technical Field

The present application relates to the field of automatic driving technologies, and in particular, to a target detection method, device, electronic apparatus, and medium based on data synthesis.

Background

Perception models the surrounding environment through sensor data, including identifying the location, size, orientation, and speed of obstacles, as well as the location, type, and length of lane lines, etc., as a key part in an autopilot system. The main algorithms of the perception module rely on neural networks dominated by Artificial Intelligence (AI), whose core elements include data, algorithms and computational effort. Of these three factors, the quality of the data directly determines the performance of the perception module. However, in an industrial production environment, data annotation costs are extremely high, often requiring millions to billions of investments.

Even with the investment of a large amount of labeling expenses, it is still difficult for the perception system to fully cover the long tail problem (i.e., the special situation that is rarely encountered during automatic driving but needs to be accurately identified). The cost of solving these long tail problems is often equivalent to the cost of solving the first 99%, so reducing the cost of perceiving the long tail problem is a technical need to be solved.

Currently, BEV (Bird's Eye View) architecture widely adopted by the autopilot industry has become the most popular perceived architecture since the release of Tesla AI-Day 2021. Compared with the traditional post-fusion architecture, the BEV architecture has stronger scalability and extensibility, and is completely dependent on data driving. However, as a data-driven perception architecture, BEV architecture requires a large amount of high quality data to train, which makes the data long-tail problem worse and further increases the cost of data labeling.

Thus, in summary, the prior art has the following problems in the automatic driving perception system: the data marking cost is high, and all scenes are difficult to cover; the long tail problem has huge treatment cost and low solution efficiency; the existing data driving architecture (such as BEV) has large demand for high-quality data, further aggravates the burden of data labeling, and results in reduced detection precision and poor accuracy of the model.

Disclosure of Invention

In view of the above, the embodiment of the application provides a target detection method, a device, electronic equipment and a medium based on data synthesis, which are used for solving the problems of high perceived data labeling cost, serious data long tail problem and reduced model detection precision and accuracy in the prior art.

In a first aspect of an embodiment of the present application, there is provided a target detection method based on data synthesis, including: acquiring multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model; rendering the virtual model into multi-view image data and generating new multi-view image data; updating the original point cloud data based on the correlation between the virtual model and the original point cloud data so that the updated point cloud data contains point cloud information of the virtual model; updating the initial detection frame according to the space position of the virtual model to generate a detection frame containing the virtual model; training a pre-constructed perception model by using new multi-view image data, updated point cloud data and updated detection frames to obtain a trained perception model; and taking actually acquired multi-view image data and point cloud data around the vehicle as input of a trained perception model, carrying out target detection by using the trained perception model, and outputting a target detection result.

In a second aspect of the embodiment of the present application, there is provided a target detection apparatus based on data synthesis, including: the acquisition module is configured to acquire multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model; a rendering module configured to render the virtual model into multi-view image data and generate new multi-view image data; the updating module is configured to update the original point cloud data based on the correlation between the virtual model and the original point cloud data so that the updated point cloud data contains point cloud information of the virtual model; the generation module is configured to update the initial detection frame according to the space position of the virtual model so as to generate a detection frame containing the virtual model; the training module is configured to train the pre-constructed perception model by utilizing the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model; the detection module is configured to take actually acquired multi-view image data and point cloud data around the vehicle as input of a trained perception model, perform target detection by using the trained perception model, and output a target detection result.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:

Acquiring multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model; rendering the virtual model into multi-view image data and generating new multi-view image data; updating the original point cloud data based on the correlation between the virtual model and the original point cloud data so that the updated point cloud data contains point cloud information of the virtual model; updating the initial detection frame according to the space position of the virtual model to generate a detection frame containing the virtual model; training a pre-constructed perception model by using new multi-view image data, updated point cloud data and updated detection frames to obtain a trained perception model; and taking actually acquired multi-view image data and point cloud data around the vehicle as input of a trained perception model, carrying out target detection by using the trained perception model, and outputting a target detection result. The application can reduce the labeling cost of the perception data and improve the detection precision and accuracy of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a target detection method based on data synthesis according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a target detection device based on data synthesis according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

Perception is an important part of an autopilot system aimed at completing ambient modeling based on sensor data, such as: giving the position/size/orientation/speed of the obstacle, giving the position/type/length of the lane line, etc. Most algorithms in the sensing module are realized by an AI-dominant neural network, and three elements of the neural network are as follows: data, algorithms, and computational effort. Data occupies the most central position in the development of a sensing module, the quality of the data directly determines the actual performance of the sensing module, and in the production environment of the industry, the data usually takes millions to billions of marking expenses. Even if a huge amount of labeling expenses are input, the perceived long tail problem is difficult to cover, and the cost of the last 1% of the long tail problem is the same as that of other 99% of the long tail problem, so how to reduce the cost of perceiving the long tail problem is a problem to be solved.

The current mainstream perception architecture is a BEV (Bird's Eye View) architecture, and since Tesla AI-Day 2021, the BEV becomes the most mainstream perception architecture in the autopilot industry, and compared with the past post-fusion architecture, the BEV architecture has stronger scalability and extensibility, and is a real data-driven architecture. As a complete data driven aware architecture, BEV architecture requires more data and thus the architecture faces more serious data long tail problems.

The BEV architecture models the environment through camera images of multiple perspectives, laser radar point cloud data, and millimeter wave radar data. Such data-driven architecture requires large amounts of data to train, and labeling of such data is costly and difficult to cover all autopilot scenarios, resulting in poor model performance in some scenarios.

Therefore, the main problem faced by the existing perception system is that the perceived data labeling cost is high, and especially when the long tail problem is processed, the cost is huge. The application aims to synthesize long tail training samples of a perception 3D detection model through computer graphics so as to reduce the cost of data annotation. In order to explain the technical scheme of the application in more detail, the following examples will show the data synthesis concept provided by the application in detail by synthesizing a training sample comprising a bus and a pick-up truck.

The following describes the technical scheme of the present application in detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a flow chart of a target detection method based on data synthesis according to an embodiment of the present application. The data synthesis based target detection method of fig. 1 may be performed by an autopilot system. As shown in fig. 1, the target detection method based on data synthesis may specifically include:

s101, acquiring multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model;

S102, rendering the virtual model into multi-view image data, and generating new multi-view image data;

s103, updating the original point cloud data based on the correlation between the virtual model and the original point cloud data so that the updated point cloud data contains point cloud information of the virtual model;

s104, updating the initial detection frame according to the space position of the virtual model to generate a detection frame containing the virtual model;

S105, training a pre-constructed perception model by using the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model;

S106, taking actually collected multi-view image data and point cloud data around the vehicle as input of a trained perception model, carrying out target detection by using the trained perception model, and outputting a target detection result.

First, given the images { I1, I2, &. In } of view n, the Lidar point cloud P (i.e., the original point cloud data), and the corresponding detection box B (i.e., the initial detection box), for the art modeled model M (i.e., the virtual model), the present application uses rendering to place the model M into the scene and update the corresponding image { I1, I2, in, point cloud P, and 3D detection frame B, so that the updated looking-around image { I1, I2, in } can see the imaging point of the model M In the point cloud P of the added model material M, lidar, and the detection frame containing the model M In the detection frame B.

In some embodiments, rendering the virtual model into multi-view image data and generating new multi-view image data includes:

Rendering the virtual model into a view angle space corresponding to each camera by using a computer graphics method;

Superposing the virtual model and the original multi-view image data to generate multi-view image data containing the virtual model;

correcting and fusing the multi-view image data corresponding to all the views and containing the virtual model to obtain new multi-view image data.

Specifically, a virtual model is rendered into a view angle space corresponding to each camera by using a computer graphics method. This process includes creating a virtual scene for each camera view and inserting a pre-created virtual model into the scene. For example, a virtual car model is created using 3D modeling software and placed in the appropriate position for the camera view.

Further, the virtual model is superimposed with the original multi-view image data to generate multi-view image data including the virtual model. Specifically, based on the computer graphics technology, a model M (virtual model) is rendered into a space related to each camera, and an image of the virtual model is superimposed with an original image of the corresponding camera. This allows to accurately demonstrate the existence of the virtual model on the original image.

For example, in one example, one virtual automobile model is rendered into the front view camera so that it appears in the front view image data, and the same virtual automobile model can also be imaged in all 6 look-around cameras to ensure consistency and continuity of the virtual model at all perspectives.

Further, correcting and fusing the multi-view image data corresponding to all views and containing the virtual model to obtain new multi-view image data. The correcting step comprises adjusting the geometry and the position of the virtual model under each view angle to ensure that the display effect of the virtual model under different view angles is consistent. The fusing step includes integrating the corrected multi-view image data to generate a unified multi-view image dataset. This process may utilize image processing techniques, such as multi-view stereo matching and image stitching techniques, to ensure that the final generated multi-view image data is visually seamless and conforms to the physical laws of the real world.

For example, in one example, one virtual car model may be rendered into a front-view camera, front-view image data generated, and imaged in all 6 look-around cameras, generating corresponding look-around image data.

By the method of the embodiment, the multi-view image data comprising the virtual model can be obtained, and the 3D automobile model synthesized based on the mode provided by the application has high authenticity and consistency, can effectively simulate the automobile scene in the real world, and improves the training effect of the perception model.

In some embodiments, updating the original point cloud data based on the correlation of the virtual model and the original point cloud data includes:

After a virtual model is inserted into a scene, determining points which are blocked by the virtual model in original point cloud data, and deleting the points which are blocked by the virtual model in the original point cloud data;

And calculating an intersection point of each laser radar wire harness and the virtual model, taking the nearest intersection point as newly added point cloud data, and fusing the newly added point cloud data with the original point cloud data to generate point cloud data containing the virtual model.

Specifically, a pre-created 3D virtual model is inserted in the scene. For example, a virtual car model and a bus model are inserted into the scene of the original point cloud data. These virtual models should be consistent with the actual environment in the original point cloud data to ensure the authenticity and accuracy of the updated point cloud data.

Further, after the virtual model is inserted, points in the original point cloud data that are occluded by the virtual model are determined. Specifically, the intersection of the ray of each point in the original point cloud and the origin of the laser radar device with each patch of the virtual model is judged. If the ray intersects a patch of the virtual model, the point is considered to be occluded and needs to be deleted from the original point cloud data. Through the process, the updated point cloud data can be ensured to not contain invalid points which are blocked by the virtual model.

Further, an intersection point of each laser radar harness and the virtual model is calculated, and the nearest intersection point is used as newly added point cloud data. For example, for each lidar harness, its intersections with the virtual model are solved and added to the point cloud data. The newly added point cloud data should be matched with the surface features of the virtual model to ensure that the updated point cloud data can accurately reflect the shape and position of the virtual model.

Further, the newly added point cloud data and the original point cloud data are fused, and updated point cloud data containing the virtual model is generated. In the fusion process, the consistency and continuity of the newly added point cloud data and the original point cloud data in space are required to be ensured, so that complete and accurate point cloud data are obtained. For example, new point cloud data behind a car and a bus is newly added in the original point cloud data.

For example, in one example, after a 3D virtual model is inserted in a scene, occluded points in the original point cloud should be deleted, while points where Lidar hit on newly added objects should be added to the point cloud. In order to delete points in the original point cloud, the intersection of the ray of the origin of the Lidar device with each point in the original point cloud and each patch of the virtual model can be determined. If the ray intersects a certain patch of the virtual model, the point is blocked and needs to be deleted; otherwise, the point is reserved. Meanwhile, in order to make the data more realistic, new point cloud data introduced by the virtual object needs to be added. To achieve this, the intersection point of each Lidar harness with the virtual model may be solved, and the nearest intersection point updated into the point cloud data.

By the method of the embodiment, the virtual model can be effectively integrated into the original point cloud data, and updated point cloud data containing the virtual model is generated. The method can improve the recognition accuracy of the perception system to the virtual target and enhance the adaptability of the perception model in complex scenes.

In some embodiments, determining points in the original point cloud data that are occluded by the virtual model includes:

determining rays of each point in the origin point cloud data and the origin point cloud data of the laser radar equipment, and judging whether the rays are the points in the origin point cloud data which are shielded by the virtual model according to the intersection between the rays and each patch of the virtual model;

And when the ray intersects with the patch in the virtual model, judging that the point in the original point cloud data belongs to the point shielded by the virtual model, otherwise, judging that the point in the original point cloud data does not belong to the point shielded by the virtual model.

Specifically, first, rays are emitted from the origin of the lidar device, and the paths of these rays with each point in the original point cloud data are determined. Assuming that the origin of the lidar device is O and each point in the original point cloud data is P _i, each ray may be represented as a straight line from O to P _i.

Further, the intersection of the ray with each patch of the virtual model is calculated: for each ray from the origin O of the lidar device to point P _i, it is determined whether it intersects the patch of the virtual model M. The virtual model M is typically composed of a plurality of patches (e.g., triangles or quadrilaterals) and it may be determined whether the point P _i is occluded by the virtual model by calculating the intersection of the ray with those patches.

For example, a geometric algorithm may be employed to solve the intersection problem of rays with the polygonal patch. For example, a ray-triangle intersection test algorithm is utilized to determine whether a ray intersects a triangle patch of the virtual model M.

Further, when the ray intersects the patch in the virtual model M, it is determined that the point P _i in the original point cloud data belongs to a point blocked by the virtual model, and the point needs to be deleted from the original point cloud data. Otherwise, the judgment point P _i does not belong to the point blocked by the virtual model, and the point is reserved.

Through the judgment, the original point cloud data can be updated, and the points blocked by the virtual model can be deleted. Specifically, to delete a point in the original point cloud, the following steps may be performed:

Rays are emitted from the origin of the lidar device and the path of the rays to each point in the original point cloud data is determined.

The intersection of each ray with each patch of the virtual model M is calculated. If the ray intersects a certain patch of M, the point in the point cloud data is considered to be blocked, and the point needs to be deleted; otherwise, the point is reserved.

For example, in one specific example, assuming that the origin of the lidar device is O, the point in the original point cloud data is P ₁,P₂,…,P_n, and the virtual model is M, which is composed of a plurality of triangular patches. For each point P _i, a ray path from O to P _i is computed, and a determination is made as to whether the ray intersects any of the triangle patches of M. If so, deleting the point P _i; otherwise, point P _i is reserved.

By the method of the embodiment, the points blocked by the virtual model in the original point cloud data can be effectively determined and deleted, so that the point cloud data is updated, and the point cloud data is more real and accurate.

In some embodiments, updating the initial detection box according to the spatial location of the virtual model includes:

calculating an external cuboid of the virtual model, and converting the calculated external cuboid into a unified coordinate space to generate a detection frame of the virtual model;

the detection frame of the virtual model is added to the initial detection frame so as to update the initial detection frame, and the detection frame containing the virtual model is generated.

Specifically, given a 3D virtual model M to be inserted, first an circumscribed cuboid of the model is calculated. The circumscribed cuboid is the smallest cuboid that completely encloses the virtual model M, with each face tangential to the boundary of the virtual model. In order to calculate the external cuboid, the minimum and maximum values of the virtual model M on each coordinate axis can be determined by using all the vertex coordinates of the virtual model M, so as to define six faces of the external cuboid.

Further, the calculated circumscribed cuboid is converted into a unified coordinate space. The unified coordinate space is a reference coordinate system used by the whole sensing system, so that data of different data sources (such as cameras and laser radars) can be ensured to be processed and fused under the same coordinate system. Specifically, the circumscribed cuboid of the virtual model M may be converted from its local coordinate system to the global coordinate system by a coordinate transformation matrix.

Further, in the unified coordinate space, a detection frame of the virtual model is generated. The detection box is a rectangular box for identifying the position and size of the virtual model in space. This step ensures that the virtual model M has an explicit spatial representation in the global coordinate system, which can be recognized and processed by the subsequent perception processing module.

Further, the detection frame of the generated virtual model is added to the initial detection frame B. The initial detection frame B includes detection frames of all known objects in the original point cloud data and the multi-view image data. By adding the detection frame of the virtual model to the initial detection frame B, a detection frame set containing the virtual model is generated. This process ensures that all the test frames are updated and processed in the same coordinate system.

For example, in one example, assuming a given 3D virtual model M, the updating of the initial detection box includes the steps of:

the external cuboid of the virtual model M is calculated, for example, by acquiring the minimum and maximum coordinate values of the model M on three coordinate axes of x, y and z, thereby defining six faces of the external cuboid.

And converting the external cuboid from the local coordinate system of the virtual model M into the global coordinate system of the perception system by utilizing the coordinate transformation matrix. This step may involve translation, rotation, and scaling, etc. transformation operations to ensure the correct position and orientation of the detection box of the virtual model under the global coordinate system.

A detection box of the virtual model M is generated in the global coordinate system, which is used to identify the position and size of the virtual model in space.

And adding the generated virtual model detection frame into an initial detection frame B, and updating the initial detection frame B to obtain a detection frame set containing the virtual model.

By the method of the embodiment, the detection frame of the virtual model and the original detection frame can be effectively fused, and an updated detection frame containing the virtual model can be generated. The method can improve the detection capability of the perception system on the virtual target and enhance the adaptability of the system in complex scenes.

In some embodiments, training the pre-constructed perceptual model with the new multi-view image data, the updated point cloud data, and the updated detection box comprises:

determining a model structure and model parameters of a pre-constructed perception model, and constructing a training data set by utilizing new multi-view image data, updated point cloud data and an updated detection frame;

Inputting the training data set into a perception model, optimizing model parameters by using a gradient descent algorithm, and adjusting the model parameters by minimizing a loss function;

And in the model training process, the gradient descent algorithm is repeatedly utilized to execute the optimization operation of the model parameters until the loss function converges, so as to obtain the trained perception model.

Specifically, first, a model structure G and an initial model parameter θ of a previously constructed perceptual model are determined. The model structure G includes neural network architecture of the perceptual model, such as convolutional layer, pooling layer, fully connected layer, etc. The initial model parameters θ include weights and offsets in these layers.

Further, a training data set is constructed using the new multi-view image data, the updated point cloud data, and the updated detection frame. Specifically, the new multi-view image data is used as input data of the model, and the updated point cloud data and the detection frame are used as label data of the model. The training dataset is made up of a plurality of sample pairs (xi, yi), where xi is the input data of the model (e.g., multi-view images and point cloud data) and yi is the corresponding label data (e.g., detection box).

Further, the training data set is input into the perception model, and model parameters are optimized by using a gradient descent algorithm. Specifically, a random gradient descent algorithm (SGD) is used to minimize the loss function, thereby adjusting the model parameter θ. The loss function is used for measuring the difference between the model prediction result and the actual tag data, and common loss functions comprise cross entropy loss, mean square error and the like.

Further, in the model training process, the optimization operation of the model parameters is repeatedly performed by using a gradient descent algorithm. The method comprises the following specific steps:

and calculating the prediction output of the model, and comparing the prediction output with the label data to obtain a loss value.

And calculating the gradient of the loss function relative to the model parameters, and updating the model parameters by using a back propagation algorithm.

Repeating the steps, and iteratively updating the model parameters until the loss function converges or reaches the preset training round number.

And after the loss function converges, obtaining a trained perception model. At this time, the model parameter θ has been optimized, and accurate target detection can be performed in the new multi-view image data and the point cloud data.

Specifically, given a model structure G and a model parameter θ, the main task of model training is to optimize the value of the parameter θ. To achieve this, some labeled training data { (x 1, y 1), (x 2, y 2), · (x _m, y_m) }, where xi is the input of the model and yi is the manually labeled label, is typically required, and the model weights θ are optimized using the gradient descent algorithm SGD (Stochastic GRADIENT DESCENT). The training process comprises the following steps:

Input data: the training dataset { (x 1, y 1), (x 2, y 2), ··, (x _m, y_m) } is input into the perception model.

Calculating loss: the model prediction output is compared with the actual tag data yi to calculate a loss function value.

Calculating the gradient: the gradient of the loss function with respect to the model parameter θ is calculated by a back propagation algorithm.

Updating parameters: and updating the model parameter theta according to the gradient by utilizing a gradient descent algorithm SGD.

Iterative training: repeating the steps until the loss function converges, and obtaining the optimized model parameter theta.

By the method of the embodiment, the pre-constructed perception model can be effectively trained by using the new multi-view image data, the updated point cloud data and the updated detection frame, and the target detection accuracy and the robustness of the model are improved.

In some embodiments, taking actually collected multi-view image data and point cloud data around a vehicle as input of a trained perception model, performing target detection by using the trained perception model, and outputting a target detection result, including:

inputting actually collected multi-view image data and point cloud data around the vehicle into a trained perception model;

In the trained perception model, extracting the characteristics of the data by utilizing a characteristic extraction module;

converting the extracted features into a unified aerial view space, and performing time sequence fusion on the features in the aerial view space to obtain time sequence-related multi-mode features;

The multi-mode feature is input into a decoder related to a preset task, a target object in multi-view image data is identified by the decoder, and a target detection result is output.

Specifically, actually acquired multi-view image data around the vehicle and point cloud data are input into a trained perception model. The multi-view image data typically includes an looking-around image of a plurality of cameras, and the point cloud data is from a plurality of Lidar sensors.

Further, in the trained perception model, the feature extraction module is utilized to perform feature extraction on the input data. The feature extraction module includes a plurality of Convolutional Neural Network (CNN) layers for extracting a high-dimensional feature representation from the multi-view image and the point cloud data. These feature representations contain spatial and semantic information of the input data.

Further, the extracted features are converted to a unified bird's eye view space (BEV). The BEV model architecture maps data from different sensors onto a unified two-dimensional plane through geometric transformations, thereby achieving a unified representation of the data. Specifically, the projection of the image features can be performed by using the internal and external parameters of the camera, and the mapping of the point cloud features can be performed by using the space coordinates of the point cloud data.

Further, time sequence fusion is carried out on the features in the aerial view angle space, and time sequence-related multi-mode features are obtained. The time sequence fusion module captures dynamic change information through the joint processing of multi-frame data, and timeliness and stability of feature representation are improved. This process may be implemented by a Recurrent Neural Network (RNN) or a long short-term memory network (LSTM).

Further, the multi-mode feature is input into a decoder related to a preset task, and a target object in the multi-view image data is identified by the decoder. The decoder adopts different neural network structures, such as a convolution layer, a full connection layer and the like, according to the requirements of specific perception tasks, and outputs a target detection result. Common perception tasks include 3D object detection, lane line recognition, occupancy grid recognition, and the like.

In one example, the perception model of an embodiment of the present application employs a BEV model architecture whose inputs generally include a multi-view image looking around, a plurality of lidar point cloud data, and millimeter wave/ultrasound radar data. The model extracts features of the corresponding sensor data through the respective feature extraction modules and converts the features into a unified BEV space (i.e., a bird's eye view space). For example, multi-view image data and point cloud data are mapped into BEV space by geometric transformation and projection.

In BEV space, the perception model carries out time sequence fusion on the characteristics, and the timeliness of the characteristics is enhanced by utilizing multi-frame data. The time-series-related BEV multi-modal characteristics are then input into a task-related decoder. For a 3D target detection task, the decoder identifies and outputs three-dimensional position, size and type information of a target object; for a lane line identification task, the decoder outputs position and type information of the lane line; for the occupancy grid recognition task, the decoder outputs occupancy conditions around the vehicle.

By the method of the embodiment, the trained perception model can accurately detect the target by utilizing the multi-view image data and the point cloud data, and output a high-precision detection result. The method improves the performance and reliability of the perception system in a complex driving environment.

According to the technical scheme provided by the embodiment of the application, the technical scheme at least has the following advantages:

The method provided by the application can effectively synthesize a large number of long tail training samples, and is more in line with the actual physical rule compared with the traditional 2D layer copy-paste synthesis mode. Specifically, the synthesis process of the application fully follows objective physical logic, and achieves high consistency with real scenes by rendering virtual models into multi-view image data and point cloud data.

Firstly, accurately rendering a virtual model into multi-view image data through a computer graphics method, and updating original point cloud data according to the position and the shape of the virtual model to generate multi-mode data containing the virtual model. The method ensures consistency and continuity of the virtual model under different visual angles, so that the synthesized data is more realistic.

Second, compared to the way in which the full virtual rendering is synthesized, the synthetic training sample of the present application is based entirely on the data acquired by the real sensors, except that the newly added object is rendered. Because the background and environment data in the whole scene are real sensor data, the characteristics of the training sample synthesized by the application are closer to those of the real data, thereby improving the training effect and the practicability of the perception model.

In addition, the method provided by the application can effectively solve the problem of long tail samples in the automatic driving training data, and obviously reduces the labeling cost of the automatic driving training samples. By synthesizing high-quality long-tail training samples, the workload and cost of manual labeling can be reduced, and the data coverage range and the generalization capability of the model can be improved. The high-efficiency data synthesis method not only improves the performance of a sensing system, but also accelerates the actual application of an automatic driving technology and the landing process.

In summary, the method of the application not only enhances the capability of the automatic driving perception system, but also reduces the labeling cost of training data by synthesizing the high-quality training samples conforming to the physical rule, thereby providing important support for the popularization and application of the automatic driving technology.

The following embodiments of the apparatus of the present application are described below at the bottom end of the window, and may be used to perform embodiments of the method of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application. Fig. 2 is a schematic structural diagram of a target detection device based on data synthesis according to an embodiment of the present application. As shown in fig. 2, the data synthesis-based object detection apparatus includes:

An acquisition module 201 configured to acquire multi-view image data, original point cloud data, an initial detection frame, and a virtual model created in advance;

A rendering module 202 configured to render the virtual model into multi-view image data and generate new multi-view image data;

The updating module 203 is configured to update the original point cloud data based on the correlation between the virtual model and the original point cloud data, so that the updated point cloud data contains the point cloud information of the virtual model;

A generating module 204 configured to update the initial detection frame according to the spatial position of the virtual model, so as to generate a detection frame containing the virtual model;

The training module 205 is configured to train the pre-constructed perception model by using the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model;

The detection module 206 is configured to take the actually collected multi-view image data and point cloud data around the vehicle as input of a trained perception model, perform target detection by using the trained perception model, and output a target detection result.

In some embodiments, the rendering module 202 of fig. 2 renders the virtual model into a perspective space corresponding to each camera using a computer graphics method; superposing the virtual model and the original multi-view image data to generate multi-view image data containing the virtual model; correcting and fusing the multi-view image data corresponding to all the views and containing the virtual model to obtain new multi-view image data.

In some embodiments, after the update module 203 of fig. 2 inserts the virtual model in the scene, determining points in the original point cloud data that are blocked by the virtual model, and deleting the points in the original point cloud data that are blocked by the virtual model; and calculating an intersection point of each laser radar wire harness and the virtual model, taking the nearest intersection point as newly added point cloud data, and fusing the newly added point cloud data with the original point cloud data to generate point cloud data containing the virtual model.

In some embodiments, the update module 203 of fig. 2 determines a ray for each point in the origin point cloud data and the origin point of the laser radar device, and determines whether the point in the origin point cloud data is occluded by the virtual model according to the intersection between the ray and each patch of the virtual model; and when the ray intersects with the patch in the virtual model, judging that the point in the original point cloud data belongs to the point shielded by the virtual model, otherwise, judging that the point in the original point cloud data does not belong to the point shielded by the virtual model.

In some embodiments, the generating module 204 of fig. 2 calculates an external cuboid of the virtual model, converts the calculated external cuboid into a unified coordinate space, and generates a detection frame of the virtual model; the detection frame of the virtual model is added to the initial detection frame so as to update the initial detection frame, and the detection frame containing the virtual model is generated.

In some embodiments, the training module 205 of fig. 2 determines model structures and model parameters of the pre-constructed perceptual model and constructs a training dataset using the new multi-view image data, the updated point cloud data, and the updated detection frame; inputting the training data set into a perception model, optimizing model parameters by using a gradient descent algorithm, and adjusting the model parameters by minimizing a loss function; and in the model training process, the gradient descent algorithm is repeatedly utilized to execute the optimization operation of the model parameters until the loss function converges, so as to obtain the trained perception model.

In some embodiments, the detection module 206 of fig. 2 inputs the actually acquired multi-view image data and point cloud data around the vehicle into the trained perception model; in the trained perception model, extracting the characteristics of the data by utilizing a characteristic extraction module; converting the extracted features into a unified aerial view space, and performing time sequence fusion on the features in the aerial view space to obtain time sequence-related multi-mode features; the multi-mode feature is input into a decoder related to a preset task, a target object in multi-view image data is identified by the decoder, and a target detection result is output.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device 3 according to an embodiment of the present application. As shown in fig. 3, the electronic apparatus 3 of this embodiment includes: a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Or the processor 301 when executing the computer program 303 performs the functions of the modules/units in the above-described device embodiments.

Illustratively, the computer program 303 may be partitioned into one or more modules/units, which are stored in the memory 302 and executed by the processor 301 to complete the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 303 in the electronic device 3.

The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and does not constitute a limitation of the electronic device 3, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.

The Processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the memory 302 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device. The memory 302 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The target detection method based on data synthesis is characterized by comprising the following steps:

Acquiring multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model;

Rendering the virtual model into the multi-view image data and generating new multi-view image data;

Updating the original point cloud data based on the interrelation between the virtual model and the original point cloud data so that the updated point cloud data contains the point cloud information of the virtual model;

updating the initial detection frame according to the space position of the virtual model to generate a detection frame containing the virtual model;

Training a pre-constructed perception model by using the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model;

and taking actually acquired multi-view image data and point cloud data around the vehicle as input of the trained perception model, carrying out target detection by using the trained perception model, and outputting a target detection result.

2. The method of claim 1, wherein the rendering the virtual model into the multiview image data and generating new multiview image data comprises:

superposing the virtual model and original multi-view image data to generate multi-view image data containing the virtual model;

correcting and fusing the multi-view image data corresponding to all the views and containing the virtual model to obtain the new multi-view image data.

3. The method of claim 1, wherein updating the original point cloud data based on the correlation of the virtual model and the original point cloud data comprises:

After a virtual model is inserted into a scene, determining points in the original point cloud data, which are blocked by the virtual model, and deleting the points in the original point cloud data, which are blocked by the virtual model;

And calculating an intersection point of each laser radar wire harness and the virtual model, taking the nearest intersection point as newly added point cloud data, and fusing the newly added point cloud data with original point cloud data to generate point cloud data containing the virtual model.

4. A method according to claim 3, wherein said determining points in the original point cloud data that are occluded by the virtual model comprises:

Determining rays of an origin of the laser radar device and each point in the original point cloud data, and judging whether the rays are the points which are blocked by the virtual model in the original point cloud data according to the intersection between the rays and each patch of the virtual model;

5. The method of claim 1, wherein updating the initial detection box according to the spatial location of the virtual model comprises:

And adding the detection frame of the virtual model into the initial detection frame so as to update the initial detection frame and generate the detection frame containing the virtual model.

6. The method of claim 1, wherein training the pre-constructed perceptual model using the new multi-view image data, the updated point cloud data, and the updated detection box comprises:

Determining a model structure and model parameters of a pre-constructed perception model, and constructing a training data set by utilizing the new multi-view image data, updated point cloud data and updated detection frames;

inputting the training data set into the perception model, optimizing the model parameters by using a gradient descent algorithm, and adjusting the model parameters by minimizing a loss function;

7. The method according to claim 1, wherein using the actually acquired multi-view image data and point cloud data around the vehicle as input of the trained perception model, performing object detection by using the trained perception model, and outputting an object detection result, comprises:

inputting actually collected multi-view image data and point cloud data around the vehicle into the trained perception model;

Converting the extracted features into a unified aerial view space, and carrying out time sequence fusion on the features in the aerial view space to obtain time sequence-related multi-mode features;

and inputting the multi-mode characteristics into a decoder related to a preset task, identifying a target object in the multi-view image data by using the decoder, and outputting a target detection result.

8. A data synthesis-based object detection apparatus, comprising:

the acquisition module is configured to acquire multi-view image data, original point cloud data, an initial detection frame and a pre-created virtual model;

A rendering module configured to render the virtual model into the multi-view image data and generate new multi-view image data;

the updating module is configured to update the original point cloud data based on the correlation between the virtual model and the original point cloud data so that the updated point cloud data contains the point cloud information of the virtual model;

the generation module is configured to update the initial detection frame according to the space position of the virtual model so as to generate a detection frame containing the virtual model;

the training module is configured to train the pre-constructed perception model by utilizing the new multi-view image data, the updated point cloud data and the updated detection frame to obtain a trained perception model;

The detection module is configured to take actually acquired multi-view image data and point cloud data around the vehicle as input of the trained perception model, perform target detection by using the trained perception model, and output a target detection result.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.