CN118784816A

CN118784816A - 3D image generation method and system based on multi-camera shooting

Info

Publication number: CN118784816A
Application number: CN202411202744.1A
Authority: CN
Inventors: 魏伟坚
Original assignee: Shenzhen Jinwode Technology Co ltd
Current assignee: Shenzhen Jinwode Technology Co ltd
Priority date: 2024-08-29
Filing date: 2024-08-29
Publication date: 2024-10-15

Abstract

The application relates to the technical field of 3D image generation, and discloses a 3D image generation method and system based on multi-camera shooting, wherein the method comprises the following steps: synchronously triggering and collecting images collected by a plurality of cameras to obtain a multi-view original image; preprocessing the multi-view original image to obtain a preprocessed multi-view image; performing feature extraction and matching on the preprocessed multi-view image to obtain a feature matching result; performing alignment and depth information calculation on the preprocessed multi-view image based on the feature matching result to obtain an aligned multi-view image and a depth map; performing 3D model reconstruction based on the aligned multi-view images and the depth map to obtain a scene 3D model; and performing stereo parallax correction and rendering processing on the scene 3D model to generate a stereo 3D image, so that high-precision and high-detail 3D model reconstruction is realized, and a high-quality and high-immersion stereo 3D image is generated.

Description

3D image generation method and system based on multi-camera shooting

Technical Field

The application relates to the technical field of 3D image generation, in particular to a 3D image generation method and system based on multi-camera shooting.

Background

With the rapid development of virtual reality, augmented reality and mixed reality technologies, there is an increasing demand for high quality, high realism 3D content. Traditional 3D content production methods often rely on complex modeling and rendering processes, which are time consuming and labor intensive and difficult to achieve accurate reconstruction of real scenes. A3D image generation method based on multi-camera shooting is generated, and aims to simultaneously capture real scenes through a plurality of cameras and reconstruct a high-quality 3D model through computer vision and image processing technology. However, this approach still faces many challenges in practical applications.

In the prior art, the problem of synchronization of a multi-camera system has been a technical difficulty. Time differences between different cameras can lead to time inconsistencies in captured images, affecting the subsequent 3D reconstruction effect. Secondly, due to the complexity of shooting environment, such as illumination change, shielding and other factors, the acquired multi-view image often has problems of noise, chromatic aberration, distortion and the like, and the problems can adversely affect the accuracy of 3D reconstruction. In addition, in key links such as feature extraction and matching, image alignment, depth information calculation and the like, how to improve the calculation efficiency while ensuring the precision is also an important problem. Particularly, when processing dynamic scenes, how to accurately identify and process moving objects and ensure the space-time consistency of reconstruction results is still a great challenge in the field. How to generate high-quality stereoscopic 3D images, so that the stereoscopic 3D images can present good stereoscopic impression and immersion impression on various display devices, is also a key problem to be solved.

Disclosure of Invention

The application provides a 3D image generation method and system based on multi-camera shooting, which further realize high-precision and high-detail 3D model reconstruction and generate high-quality and high-immersion stereoscopic 3D images.

The first aspect of the present application provides a 3D image generating method based on multi-camera shooting, where the 3D image generating method based on multi-camera shooting includes:

Synchronously triggering and collecting images collected by a plurality of cameras to obtain a multi-view original image;

preprocessing the multi-view original image to obtain a preprocessed multi-view image;

Performing feature extraction and matching on the preprocessed multi-view image to obtain a feature matching result;

Performing alignment and depth information calculation on the preprocessed multi-view image based on the feature matching result to obtain an aligned multi-view image and a depth map;

Performing 3D model reconstruction based on the aligned multi-view images and the depth map to obtain a scene 3D model;

and carrying out stereo parallax correction and rendering processing on the scene 3D model to generate a stereo 3D image.

The second aspect of the present application provides a 3D image generating device based on multi-camera shooting, the 3D image generating device based on multi-camera shooting includes:

the acquisition module is used for synchronously triggering and acquiring the images acquired by the cameras to obtain a multi-view original image;

The preprocessing module is used for preprocessing the multi-view original image to obtain a preprocessed multi-view image;

The matching module is used for carrying out feature extraction and matching on the preprocessed multi-view images to obtain feature matching results;

the computing module is used for carrying out alignment and depth information computation on the preprocessed multi-view images based on the feature matching result to obtain aligned multi-view images and depth maps;

The reconstruction module is used for reconstructing a 3D model based on the aligned multi-view images and the depth map to obtain a scene 3D model;

And the generation module is used for carrying out stereo parallax correction and rendering processing on the scene 3D model to generate a stereo 3D image.

A third aspect of the present application provides an electronic device, comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the electronic device to perform the 3D image generation method based on multi-camera shooting described above.

A fourth aspect of the present application provides a computer-readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the above-described 3D image generation method based on multi-camera shooting.

Compared with the prior art, the application has the following beneficial effects: clock synchronization is performed through the WiFi module, a microsecond trigger signal is generated through the central control unit, and a high-speed LED array is combined to emit light pulses of a specific mode, so that sub microsecond time alignment correction of multiple cameras is realized, and time consistency of multi-view images is greatly improved. The problems of noise, chromatic aberration, uneven exposure, distortion and the like of the multi-view images are effectively solved by adopting the technologies of a global color mapping model, a self-adaptive exposure equalization function, distortion correction of a polynomial model and the like, efficient and accurate matching among the multi-view images is realized by constructing an affine invariant feature set, global semantic feature extraction and a feature fast index tree, and accurate alignment of the multi-view images and generation of a high-precision depth map are realized by utilizing the technologies of a layered sparse beam adjustment model, multi-view three-dimensional matching cost body analysis and the like. Dynamic objects are effectively identified and processed through space-time voxelization, space-time consistency analysis and fine motion field optimization, and the space-time consistency of a reconstruction result is ensured. The 3D model reconstruction with high precision and high detail is realized by adopting the technologies of multi-resolution voxel representation, signed distance field optimization, self-adaptive surface subdivision and the like. Based on a human eye stereoscopic vision perception model and self-adaptive stereoscopic parallax adjustment, a dense light field sampling and rendering enhancement technology is combined, and a high-quality and high-immersion stereoscopic 3D image is generated.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.

The structures, proportions, sizes, etc. shown in the drawings are shown only in connection with the present disclosure, and are not intended to limit the scope of the invention, since any modification, variation in proportions, or adjustment of the size, etc. of the structures, proportions, etc. should be considered as falling within the spirit and scope of the invention, without affecting the effect or achievement of the objective.

Fig. 1 is a flow chart of a 3D image generating method based on multi-camera shooting according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a structure of a 3D image generating device based on multi-camera shooting according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. Referring to fig. 1, an embodiment of a method for generating a 3D image based on multi-camera shooting in an embodiment of the present application includes:

Step 100, synchronously triggering and collecting images collected by a plurality of cameras to obtain a multi-view original image;

it can be understood that the execution subject of the present application may be a 3D image generating device based on multi-camera shooting, and may also be a terminal or a server, which is not limited herein. The embodiment of the application is described by taking a server as an execution main body as an example.

Specifically, according to the requirements of scenes and the coverage range of view angles, cameras are arranged according to an array, different view angles of the scenes are covered to the maximum extent, and enough information is ensured to be captured from multiple angles. In the camera array, each camera is provided with a WiFi module, so that the WiFi module is communicated with the central control unit through a WiFi network, and the camera array with the WiFi module is formed. And setting a central control unit, and establishing connection with the camera array with the WiFi module through the WiFi network. The central control unit is used for generating and distributing high-precision synchronous trigger signals so as to ensure that all cameras start shooting at the same time point. To achieve time synchronization on the microsecond level, the central control unit needs to generate microsecond level trigger signals. The microsecond synchronous trigger signal can ensure that the shooting time difference between cameras is less than 1 microsecond, and motion blurring or distortion caused by the time difference in an image is avoided. Such trigger signals are transmitted to the respective cameras via the WiFi network. The transmitted microsecond synchronous trigger signal can ensure that each camera starts exposure and shooting at the moment of receiving the signal, so that a primarily synchronous multi-view image is obtained. In order to improve the accuracy of time synchronization and provide a reliable time reference for subsequent image processing, a high-speed LED array is provided in the shooting scene. The high speed LED array is capable of providing a well-defined time stamp for the scene by emitting pulses of light. The light pulses are controlled by the central control unit based on microsecond synchronous trigger signals, so that the light pulses are synchronous with shooting of the camera. After the camera receives the synchronous trigger signal and starts shooting, the scene with the time mark is recorded, and a multi-view image with an optical pulse mode is formed. These images are transmitted to the cloud APP for further processing via the WiFi network. In the cloud APP, sub-microsecond time alignment correction is performed on the image by using the light pulse marks in the image. The time alignment correction ensures the accurate synchronization of the images of all view angles on a time axis by analyzing the occurrence time of the light pulse in the images, eliminates the tiny time deviation caused by camera hardware or signal transmission, and finally obtains the multi-view original image with high-precision synchronization.

Step 200, preprocessing the multi-view original image to obtain a preprocessed multi-view image;

Specifically, denoising is performed on the multi-view original image so as to reduce noise interference in the image and obtain a preliminary denoising multi-view image. In order to improve the consistency and color accuracy of images, a standard color card is placed in a shooting scene, and the preliminary noise-reduced multi-view image is analyzed through the known color value of the standard color card so as to establish a global color mapping model. The model is used for correcting chromatic aberration among multi-view images, and ensures consistency of images captured by different cameras in color. And carrying out joint color correction on the primarily noise-reduced multi-view image based on the global color mapping model to obtain a multi-view image after color correction. In order to improve image quality, histogram analysis is performed on the color-corrected multi-view images, and pixel value distribution characteristics of each image are analyzed to obtain detailed information about brightness and contrast. Based on the pixel value distribution characteristics, an adaptive exposure equalization function is constructed, and the exposure and brightness of the function can be adjusted according to the specific situation of each image, so that the final image has higher equalization. And (3) carrying out pixel value adjustment on the multi-view image after color correction based on the self-adaptive exposure balancing function to obtain the multi-view image after exposure balancing, so that the overall brightness of the image is more uniform, and the problem of image quality caused by uneven exposure is avoided. In order to eliminate image distortion caused by the characteristics of a camera lens, the distortion correction of a polynomial model is carried out on the multi-view image after exposure equalization. Geometric deformation introduced by the lens can be accurately corrected through the polynomial model, and a multi-view image after distortion correction is obtained, so that the geometric structures of all view images are consistent. And performing super-resolution processing on the multi-view image after distortion correction to obtain a high-resolution multi-view image. The super-resolution processing technology can improve the resolution of the image through interpolation and detail compensation, so that the finally generated 3D image is clearer and finer. In order to ensure the definition and the overall visual effect of the image edge, edge sharpening and detail enhancement processing are carried out on the high-resolution multi-view image, key details in the image are enhanced, and finally the preprocessed multi-view image is obtained.

Step 300, extracting and matching the characteristics of the preprocessed multi-view images to obtain characteristic matching results;

it should be noted that, feature point extraction is performed on the preprocessed multi-view images, and a preliminary feature point set with unique properties is identified from each image through an algorithm, where the feature points generally represent key parts with high recognition in the images. Based on the preliminary feature point set, an affine invariant feature set is constructed, the feature sets are ensured to be kept unchanged under affine transformation by utilizing an image processing technology, and the robustness of feature matching under different view angles is improved. And carrying out global semantic feature extraction on the affine invariant feature set, wherein semantic information of objects in the image is considered in the extraction process, so that the feature points not only depend on geometric shapes, but also have certain semantic understanding capability, and a semantic enhancement feature set is formed. By integrating semantic information, the matching accuracy can be effectively improved. The feature rapid index tree is constructed based on the semantic enhancement feature set, and the index tree structure can accelerate the matching process between feature points and reduce the calculation time. And performing rough matching of feature points among the multi-view images on the basis of the feature rapid index tree to obtain a preliminary matching pair set. And the rough matching process effectively screens the feature points through the rapid index tree to determine the feature pairs which are possibly corresponding to the feature points. And constructing a feature point local relation graph on the basis of the initial matching pair set, modeling the relation between feature points by using a graph theory method, analyzing the local geometric relation between the feature points, and improving the matching reliability. And generating an optimized matching pair set according to the characteristic point local relation diagram, removing error matching and adding new correct matching pairs through an optimization algorithm, and forming a more accurate characteristic point matching set. And establishing a global geometric consistency model based on the optimized matching pair set, and ensuring that feature point matching keeps consistency on a global scale by analyzing the geometric relation of the whole image set. And refining the matching result based on the global geometric consistency model to obtain an accurate matching pair set. By fine tuning and adjusting the preliminary matching pairs, it is ensured that each feature point pair of the matching is consistent at both the global and local levels. And based on the accurate matching pair set, carrying out fusion and propagation of multi-view features, comprehensively considering the feature information under different view angles, and obtaining a final feature matching result.

Step 400, aligning and calculating depth information of the preprocessed multi-view images based on the feature matching result to obtain aligned multi-view images and depth maps;

specifically, a multi-view geometric relationship diagram is constructed based on the feature matching result, and a topological structure of the initial view is obtained. The topological structure represents the geometric relations between images of different visual angles, and the relative position and the posture of the camera during shooting are primarily known through analyzing the relations. And optimizing the initial view topological structure, and obtaining the optimized view topological structure by eliminating redundant view relations or enhancing the connection between key views. And establishing a layered sparse beam adjustment model based on the optimized view topological structure, wherein the model can effectively process complex geometric relations among a plurality of view angles and generate pose parameters of an initial camera, namely the space position and the orientation of each camera during shooting. And carrying out iterative optimization on the initial camera pose parameters, and gradually adjusting the camera pose by minimizing the re-projection error so as to enable all view angle images to be optimally aligned in the same three-dimensional space, thereby obtaining the target camera pose parameters. Based on pose parameters of the target camera, performing geometric transformation on the preprocessed multi-view images, and adjusting objects in the images in space so that the objects are consistent as much as possible under all view angles, thereby obtaining roughly aligned multi-view images. And performing multi-view three-dimensional matching cost body analysis on the roughly aligned multi-view images, and primarily estimating the depth of the corresponding pixel point in each view image by calculating parallax information to obtain initial depth estimation. And carrying out multi-view consistency propagation based on the initial depth estimation, and obtaining an optimized depth map by comprehensively considering depth information among a plurality of view angles, propagating and adjusting the depth values. And (3) refining the optimized depth map, eliminating noise and error, and obtaining a high-precision depth map. And carrying out fine alignment on the coarsely aligned multi-view images based on the high-precision depth map, and adjusting the images so that the details of the images under all view angles are completely aligned, thereby obtaining the finely aligned multi-view images. And carrying out space-time consistency analysis on the accurately aligned multi-view images and the high-precision depth map, identifying and processing dynamic objects in the scene, wherein the objects possibly have different positions and forms under different view angles, and ensuring that the objects are consistent in a final alignment result through special processing. And obtaining the aligned multi-view image and depth map.

And calculating an essential matrix E among cameras based on the pose parameters of the target camera, wherein E consists of an antisymmetric matrix [ t ] x of the translation vector t and a rotation matrix R. The essence matrix E is a key element describing the relative motion of cameras in a multi-camera system, providing a geometric relationship between multi-view images. Based on the above, epipolar correction is performed on the preprocessed multi-view images, so that the corresponding points of each pair of images are ensured to be positioned on the corrected epipolar lines, and corrected image pairs are obtained. And carrying out phase consistency transformation on the corrected image pairs, extracting a displacement field at a sub-pixel level between the images by calculating a phase correlation function, accurately capturing a tiny displacement error in the step, and calculating an affine transformation matrix by the information, so that the images are geometrically adjusted, and a coarse aligned multi-view image is obtained. And performing anisotropic diffusion filtering on the roughly aligned multi-view images, and performing smoothing processing on the images while preserving the edges of the images to generate a smooth image set with the edges preserved. The method of filtering performs multi-scale Log-Gabor filtering on a smoothed image set, and the filtering method can generate a characteristic tensor field by analyzing image characteristics under a plurality of scales, and describes local structure information of the image. And calculating a structure tensor based on the characteristic tensor field, and obtaining a structure consistency map by carrying out characteristic value decomposition on the structure tensor, so as to reveal the consistency degree of the image under different viewing angles. The initially aligned multi-view image sets are non-rigidly registered using a structural consistency map, with finer alignment of the images ensuring that they are highly consistent at all views, resulting in a fine aligned multi-view image set. A quad-segmentation is constructed based on the finely aligned multi-view image set, which is decomposed into a plurality of super-pixel sets by segmenting the image. Each super-pixel set contains a group of pixels with similar characteristics. And calculating a multi-view matching cost body based on the super-pixel set, and obtaining an initial depth map by integrating image information of a plurality of view angles. And solving the initial depth map by applying a variational method, and continuously adjusting and refining the depth information through iterative optimization to obtain high-precision initial depth estimation.

And performing bilateral joint upsampling on the high-precision depth map and the coarsely aligned multi-view image, and improving the resolution of the depth map while maintaining edge information to obtain the high-resolution depth map. A parallax gradient field is calculated based on the high resolution depth map, the parallax gradient field being capable of reflecting the rate of change of parallax between different images. On the basis, an affine transformation matrix is constructed, and partial deformation is carried out on the coarsely aligned multi-view images, so that the view images are more accurately aligned in space, and the finely aligned multi-view images are obtained. In order to optimize the aligned image, structural tensor analysis is performed on the finely aligned multi-view image, edge information of the image is extracted, smoothness of the image is maintained, and a smooth image with maintained edges is generated. The multi-scale local phase consistency measurement is calculated based on the smooth image maintained by the edge, so that the structural similarity of the image under different scales can be reflected, the local structural similarity mapping is obtained, and the identification of tiny differences in the image is facilitated. And constructing a luminosity consistency constraint term based on the local structure similarity mapping, combining a geometric smoothness constraint term, and obtaining a precisely aligned multi-view image by solving a precisely aligned image of the displacement field. Spatio-temporal voxel coding is performed on the precisely aligned multi-view image and high resolution depth map, the image and depth information are mapped into four-dimensional space, a 4D spatio-temporal voxel grid is constructed, and the color and depth feature vectors of each voxel are calculated to form a characterized spatio-temporal voxel field. And constructing a space-time diagram based on the space-time voxel field, obtaining space-time segmentation results, and identifying object areas in different time and space through the segmentation results. And carrying out connected domain analysis on the space-time segmentation result, and calculating the space-time consistency score of each connected domain. By means of the scores, different dynamic object areas are identified and clustered, and a dynamic object candidate area set R is obtained. The set of dynamic object candidate regions contains regions where dynamic objects may be present. Based on the dynamic object candidate region set, optical flow constraint and depth consistency constraint are constructed, a fine motion field is obtained through joint optimization, and the movement track and the morphological change of the dynamic object are captured. And (3) carrying out time sequence interpolation and extrapolation on the accurately aligned multi-view image and the high-resolution depth map according to the fine motion field, ensuring consistency when processing the dynamic object on a time axis, and finally obtaining the aligned multi-view image and the depth map.

Step 500, reconstructing a 3D model based on the aligned multi-view images and the depth map to obtain a scene 3D model;

Specifically, the aligned multi-view image and depth map are subjected to voxelization, and an initial low-resolution voxel model is obtained by mapping the image and the depth information into a three-dimensional voxel grid, so that scene data is represented in space. And performing octree data structure analysis based on the initial low-resolution voxel model, and recursively dividing the three-dimensional space into smaller subspaces to obtain the multi-resolution voxel representation. The multi-resolution voxel representation is not only capable of efficiently compressing data, but also provides detailed information at different resolutions. Signed distance field value calculation is performed on each voxel in the multi-resolution voxel representation, the values represent the distance between each voxel in the scene and the object surface, and signs are used to distinguish the voxels inside and outside the object, so as to obtain an initial SDF model. And carrying out multi-view depth fusion based on the initial SDF model, integrating depth information from different views, and eliminating deviation and inconsistency among the views to obtain an optimized SDF model. And carrying out self-adaptive subdivision on the optimized SDF model, and generating more details in a region needing higher resolution to obtain a high-precision surface grid model. Based on the high-precision surface grid model, multi-view consistency constraint is carried out, consistency of the model surface under different view angles is ensured, and an objective function of surface optimization is established. And carrying out iterative solution on the surface optimization objective function, and gradually adjusting the vertex positions of the grids to enable the model surface to be finer and obtain a refined surface model. A multi-view texture set is extracted based on the refined surface model, and rich details are added to the model surface by acquiring texture information from images from different view angles. And performing global optimization of graph cut on the multi-view texture set, selecting an optimal texture stitching path through an optimization algorithm, and eliminating boundaries between images to obtain seamless stitched texture mapping. Based on the texture mapping of seamless stitching, the texture mapping is applied to a refined surface model to generate a final scene 3D model.

And 600, performing stereo parallax correction and rendering processing on the scene 3D model to generate a stereo 3D image.

Specifically, depth analysis is performed on a 3D model of a scene based on a stereoscopic vision perception model of human eyes, and depth distribution characteristics of the scene are obtained by simulating perception of the human eyes to the depth, so that relative distances and layering of different objects in space are reflected. According to the depth distribution characteristics of the scene, a self-adaptive stereoscopic parallax adjustment model is constructed, and the model can dynamically adjust parallax so as to adapt to objects in different depth ranges, and the stereoscopic effect of the final image is ensured to be consistent with the visual experience of human eyes. And performing parallax adjustment on the scene 3D model based on the self-adaptive stereoscopic parallax adjustment model to obtain a preliminary stereoscopic parallax correction result. And constructing a parallax continuity optimization model based on the preliminary stereo parallax correction result, carrying out fine processing on a large parallax region through the model, eliminating an unnatural effect caused by overlarge parallax change, and obtaining an optimized stereo parallax correction result. And generating dense light field sampling data based on the optimized stereo parallax correction result, wherein the dense light field sampling data contains abundant ray information and view angle details in a scene. In order to improve the data processing efficiency, the dense light field sampling data is compressed to generate a light field model, and the model greatly reduces the storage and calculation cost while guaranteeing the light field information integrity, thereby being convenient for quick processing in practical application. Based on the light-weight light-field model, an initial rendering result is constructed, and light-field data are converted into visual images. And performing image enhancement processing on the initial rendering result, and improving the detail expressive force and contrast of the image to obtain a high-quality rendering image. And performing space-time analysis and jitter suppression processing on the high-quality rendering image, so that the jitter phenomenon caused by the change of the visual angle or the display equipment is effectively reduced, and a smooth and stable stereoscopic 3D image is obtained.

In the embodiment of the application, the WiFi module is used for clock synchronization and the central control unit is used for generating microsecond triggering signals, and the high-speed LED array is combined for transmitting the light pulse of a specific mode, so that the sub microsecond time alignment correction of the multiple cameras is realized, and the time consistency of the multi-view images is greatly improved. The problems of noise, chromatic aberration, uneven exposure, distortion and the like of the multi-view images are effectively solved by adopting the technologies of a global color mapping model, a self-adaptive exposure equalization function, distortion correction of a polynomial model and the like, efficient and accurate matching among the multi-view images is realized by constructing an affine invariant feature set, global semantic feature extraction and a feature fast index tree, and accurate alignment of the multi-view images and generation of a high-precision depth map are realized by utilizing the technologies of a layered sparse beam adjustment model, multi-view three-dimensional matching cost body analysis and the like. Dynamic objects are effectively identified and processed through space-time voxelization, space-time consistency analysis and fine motion field optimization, and the space-time consistency of a reconstruction result is ensured. The 3D model reconstruction with high precision and high detail is realized by adopting the technologies of multi-resolution voxel representation, signed distance field optimization, self-adaptive surface subdivision and the like. Based on a human eye stereoscopic vision perception model and self-adaptive stereoscopic parallax adjustment, a dense light field sampling and rendering enhancement technology is combined, and a high-quality and high-immersion stereoscopic 3D image is generated.

In a specific embodiment, the process of executing step 100 may specifically include the following steps:

Arranging the cameras in an array manner to obtain a camera array, and configuring a WiFi module for each camera in the camera array to obtain a camera array with the WiFi module;

setting a central control unit, establishing connection with the camera array with the WiFi module through a WiFi network, and constructing a camera system with the central control unit;

Generating a microsecond-level trigger signal based on the central control unit to obtain a microsecond-level synchronous trigger signal, and transmitting the microsecond-level synchronous trigger signal to each camera through a WiFi network;

Triggering the camera system with the central control unit based on the transmitted microsecond synchronous triggering signal to obtain a primarily synchronous multi-view image;

Setting a high-speed LED array in a shooting scene, and controlling the high-speed LED array to emit light pulses in a specific mode based on the microsecond synchronous trigger signal to obtain a scene with a time mark;

shooting the scene with the time mark to obtain a multi-view image with a light pulse mode, and transmitting the multi-view image with the light pulse mode to a cloud APP through a WiFi network;

And performing sub-microsecond time alignment correction in the cloud APP based on the light pulse mode in the multi-view image with the light pulse mode to obtain a multi-view original image.

Specifically, according to the requirements of scenes and the coverage range of view angles, cameras are arranged according to an array, different view angles of the scenes are covered to the maximum extent, and enough information is ensured to be captured from multiple angles. In the camera array, each camera is provided with a WiFi module, so that the WiFi module is communicated with the central control unit through a WiFi network, and the camera array with the WiFi module is formed. And establishing connection between the central control unit and the camera array through a WiFi network. The central control unit is used for generating and distributing high-precision synchronous trigger signals so as to ensure that all cameras start shooting at the same time point. to achieve time synchronization on the microsecond level, the central control unit needs to generate microsecond level trigger signals. The microsecond synchronous trigger signal can ensure that the shooting time difference between cameras is less than 1 microsecond, and motion blurring or distortion caused by the time difference in an image is avoided. Such trigger signals are transmitted to the respective cameras via the WiFi network. Although WiFi transmissions may introduce some delay, by embedding timestamp information in the signal, these delays can be compensated for at the receiving end. The transmitted microsecond synchronous trigger signal can ensure that each camera starts exposure and shooting at the moment of receiving the signal, so that a primarily synchronous multi-view image is obtained. In order to improve the accuracy of time synchronization and provide a reliable time reference for subsequent image processing, a high-speed LED array is provided in the shooting scene. The high speed LED array is capable of providing a well-defined time stamp for the scene by emitting pulses of light. The light pulses are controlled by the central control unit based on microsecond synchronous trigger signals, so that the light pulses are synchronous with shooting of the camera. The time stamp can be used as a reference for time alignment in subsequent image processing by introducing an identifiable optical signal in the scene. After the camera receives the synchronous trigger signal and starts shooting, the scene with the time mark is recorded, and a multi-view image with an optical pulse mode is formed. Since the image shot by each camera contains the light pulse marks, the marks can be used for performing sub-microsecond time alignment correction on the image in the subsequent processing process. The multi-view image with the light pulse mode is transmitted to the cloud APP through the WiFi network. In the cloud APP, sub-microsecond time alignment correction is performed on the image by using the light pulse marks in the image. The time alignment correction ensures the accurate synchronization of the images of all view angles on a time axis by analyzing the occurrence time of the light pulse in the images, eliminates the tiny time deviation caused by camera hardware, wiFi transmission or signal processing, and finally obtains the multi-view original images with high-precision synchronization. To better understand this time synchronization process, we can describe it by a simple mathematical model. The trigger signal time received by each camera i is set to t _i, where i=1, 2. By averaging the time signals of all cameras, a system average time T _avg is obtained as a reference time:

T _avg=(1/n)*Σ(t_i) (i is summed from 1 to n);

Wherein T _avg is the average value of the camera time, which is used as the time reference of the whole system. If the clock of a certain camera i has deviation, the time synchronization of the whole system is realized by adjusting the internal clock of the camera i to enable T _i to be equal to T _avg.

In processing the light pulse markers, the time difference Δt of the light pulse arrival at the camera is defined, and this difference is corrected by a corresponding correction algorithm, ensuring that all images are recorded at the same point in time. Such time correction can be expressed as:

t_i'=t_i+Δt_i；

where t _i' is the corrected time stamp, by adjusting the time stamp of each camera, it is ensured that the final image is completely consistent in time.

Through the steps, high-precision time synchronization and visual angle alignment can be realized, and finally an accurate multi-visual angle original image is generated in the cloud APP, so that a foundation is laid for subsequent generation of 3D images or dynamic TIF images. The WiFi and cloud processing-based method not only simplifies the hardware requirements, but also provides greater flexibility and expandability, so that the system can be more easily adapted to different application scenes and requirements.

In a specific embodiment, the process of performing step 200 may specifically include the following steps:

Denoising the multi-view original image to obtain a preliminary denoising multi-view image, placing a standard color card in a shooting scene, and analyzing the preliminary denoising multi-view image based on the standard color card to obtain a global color mapping model;

Performing joint color correction on the primarily noise-reduced multi-view images based on the global color mapping model to obtain color-corrected multi-view images, and performing histogram analysis on the color-corrected multi-view images to obtain pixel value distribution characteristics of each image;

Constructing an adaptive exposure balance function based on the pixel value distribution characteristics, and adjusting pixel values of the color corrected multi-view image based on the adaptive exposure balance function to obtain an exposure balanced multi-view image;

performing distortion correction of a polynomial model on the multi-view image subjected to exposure equalization to obtain a multi-view image subjected to distortion correction, and performing super-resolution processing on the multi-view image subjected to distortion correction to obtain a high-resolution multi-view image;

And carrying out edge sharpening and detail enhancement processing on the high-resolution multi-view image to obtain a preprocessed multi-view image.

Specifically, denoising is performed on the multi-view original image, and noise introduced by various factors in the shooting process, such as sensor noise or random noise under low illumination conditions, is eliminated. Common denoising methods include gaussian filtering, median filtering, and higher-level non-local mean filtering. Through analyzing similarity or spatial relation among pixels, the details and textures of the image are kept as much as possible while noise is removed, and a preliminary noise-reduction multi-view image is obtained. A standard color chart is placed in a shooting scene. Standard color cards are a tool with known color reference values, commonly used for color correction in image processing. And (3) by placing a standard color chart in a shooting scene, analyzing the preliminary noise-reduced multi-view image by using a known color reference value, and establishing a global color mapping model. The model is constructed by comparing the actual color value of the standard color card in the image with the expected standard color value, and can identify the color shift in the image and correct the color shift correspondingly. Based on the established global color mapping model, the multi-view images with preliminary noise reduction are subjected to joint color correction, so that the color performance of all view images is ensured to be consistent, and the problem of inconsistent colors caused by different cameras or shooting conditions is avoided. And mapping the color value of each image into a uniform color space to obtain the multi-view image after color correction. And carrying out histogram analysis on the multi-view image after the color correction. By counting the number of pixels of different brightness levels in each image, the overall brightness distribution and contrast of the image can be reflected. The shape of the histogram can help identify whether the image has problems with underexposure or overexposure. And extracting the distribution characteristics of the pixel values by analyzing the histogram of each image. And constructing an adaptive exposure equalization function based on the pixel value distribution characteristics. The function enables the image to have a more uniform brightness distribution as a whole by adjusting the exposure of different brightness regions in the image. The adaptive exposure balance improves the contrast and sharpness of the image by stretching or compressing the histogram of the image so that the luminance values are more widely distributed throughout the available luminance range. And (3) carrying out pixel value adjustment on the multi-view image after color correction through the function to obtain the multi-view image after exposure equalization. And (3) carrying out distortion correction on the polynomial model on the multi-view image subjected to exposure equalization, and eliminating geometric distortion caused by a camera lens, wherein the distortion is particularly obvious in a wide-angle lens or a fisheye lens. A polynomial model may be used to approximate the distortion by adjusting the position of each pixel in the image to restore it to an undistorted view angle. Assume that the coordinates in the original image areThe corrected coordinates areThe polynomial model may be expressed as:

；

Wherein, Is a coefficient of the model, and the distortion correction is realized by fitting the coefficients, so as to obtain a multi-view image after the distortion correction. And performing super-resolution processing on the multi-view image after distortion correction. The low resolution image is converted into the high resolution image through an algorithm, and common methods include a conventional method based on interpolation and a modern method based on deep learning. The super-resolution processing can restore details and textures in the image, so that the final image is clearer, and a high-resolution multi-view image is obtained. Edge sharpening and detail enhancement processing are performed on the high-resolution multi-view image. Edge sharpening makes the object outline more clearly visible by enhancing the contrast of edge portions in the image. The detail enhancement is to improve the definition of textures and micro structures in the image so that the image looks finer and finer, and finally the preprocessed multi-view image is obtained.

In a specific embodiment, the process of executing step 300 may specifically include the following steps:

Extracting feature points of the preprocessed multi-view image to obtain a preliminary feature point set, and constructing an affine invariant feature set based on the preliminary feature point set;

Carrying out global semantic feature extraction on the affine invariant feature set to obtain a semantic enhancement feature set, and constructing a feature quick index tree based on the semantic enhancement feature set;

Performing rough matching on feature points among the multi-view images based on the feature rapid index tree to obtain a primary matching pair set, and constructing a feature point local relation diagram based on the primary matching pair set;

Generating an optimized matching pair set according to the characteristic point local relation graph, and establishing a global geometric consistency model based on the optimized matching pair set;

And refining the matching result based on the global geometric consistency model to obtain an accurate matching pair set, and carrying out multi-view feature fusion and propagation based on the accurate matching pair set to obtain a feature matching result.

Specifically, feature point extraction is performed on the preprocessed multi-view image. These feature points represent key locations in the image with unique textures or shapes, often used for subsequent image matching and 3D reconstruction. Common feature point extraction algorithms include SIFT (scale invariant feature transform), SURF (speeded up robust feature), ORB (Oriented FAST and Rotated BRIEF), etc., which can detect feature points at multiple scales and calculate a descriptor for each feature point to obtain a preliminary feature point set. To cope with affine transformations between different perspectives, an affine invariant feature set is constructed based on the preliminary feature point set. The affine invariant feature can remain stable when affine transformations such as scaling, rotation, translation, etc. occur to the image. Transforming the preliminary feature point set through an affine transformation matrix ensures that the same object features under different viewing angles can be consistently described. Assuming that the coordinates of a feature point in the image before transformation areTransformed coordinates areThe affine transformation can be expressed as:

；

Where a, b, c, d are parameters of the affine transformation matrix and e, f are translation vectors. By applying this transformation to the feature points, an affine invariant feature set is obtained. In order to extract higher-level feature information, global semantic feature extraction is performed on affine invariant feature sets, semantic information of objects in an image is captured, and for example, object types, shapes and spatial relations in the image are identified. And improving the local features of the image into features with semantic meanings through a deep learning technology to form a semantic enhancement feature set. The semantic enhancement features not only consider pixel-level similarity, but also integrate object recognition and classification information, thereby being beneficial to improving the matching accuracy. And constructing a feature quick index tree based on the semantic enhancement feature set. The feature quick index tree is a data structure, can efficiently organize and search feature points, and can be quickly matched in a large number of feature points. Common indexing structures include KD-trees and LSHs, which enable feature points to be mapped into a particular space by hierarchical or hash methods, thereby speeding up the matching process. And carrying out rough matching on the characteristic points among the multi-view images according to the index tree to obtain a primary matching pair set. Through quick search, the feature point pairs which possibly correspond to different images are found out. And screening the preliminary matching pair set. And constructing a characteristic point local relation diagram based on the preliminary matching pair set. The local relationship graph reveals their relative positions and geometric layout in the image by analyzing the spatial relationship between the feature points. By the graph theory method, the matching pairs which do not accord with the geometric relationship are identified, and are removed, so that an optimized matching pair set is generated. And establishing a global geometric consistency model based on the optimized matching pair set. The global geometric consistency model ensures that feature point matching is consistent on a global scale by comprehensively considering the matching results of all view images. The model generally optimizes the matching results by minimizing the reprojection errors or geometric errors. The re-projection error can be expressed as:

；

Wherein, Is the location of the actual matching point,Is the location of the proxel calculated based on the matching model,Representing the sum of the re-projection errors of all matching points. And obtaining an optimal global geometric consistency model by optimizing the error function. And refining the matching result based on the global geometric consistency model. Through finer geometric correction and optimization, incorrect matching is eliminated, confidence of correct matching is enhanced, and an accurate matching pair set is obtained. And carrying out multi-view feature fusion and propagation on the basis of the accurate matching pair set to obtain a feature matching result. The multi-view feature fusion generates more robust and accurate matching results by integrating feature point information from different viewing angles. And the feature propagation utilizes the known matching information to infer a new matching pair in the viewing angles which are not matched, so that the matching coverage range is enlarged.

In a specific embodiment, the process of performing step 400 may specifically include the following steps:

Constructing a multi-view geometric relationship diagram based on the feature matching result to obtain an initial view topological structure, and optimizing the initial view topological structure to obtain an optimized view topological structure;

Establishing a layered sparse beam adjustment model based on the optimized view topological structure, generating initial camera pose parameters, performing iterative optimization on the initial camera pose parameters, and minimizing a reprojection error to obtain target camera pose parameters;

performing geometric transformation on the preprocessed multi-view images based on the pose parameters of the target camera to obtain coarse aligned multi-view images, and performing multi-view three-dimensional matching cost body analysis on the coarse aligned multi-view images to obtain initial depth estimation;

performing multi-view consistency propagation based on the initial depth estimation to obtain an optimized depth map, and performing refinement treatment on the optimized depth map to obtain a high-precision depth map;

and carrying out fine alignment on the coarsely aligned multi-view images based on the high-precision depth map to obtain precisely aligned multi-view images, carrying out space-time consistency analysis on the precisely aligned multi-view images and the high-precision depth map, and identifying and processing dynamic objects to obtain aligned multi-view images and depth maps.

Specifically, a multi-view geometric relationship graph is constructed based on feature matching results, describing the relative geometric relationship between different viewing angles. And constructing a multi-view topological structure by analyzing the position relation of the matched characteristic points among all view angles, wherein the topological structure represents the initial estimation of the relative position and the posture of the camera. Optimizing the initial view topological structure, eliminating unreliable view angle connection, enhancing the relation between key view angles, and adjusting the relative position between view angles through an optimization algorithm to obtain an optimized view topological structure. And establishing a layered sparse beam adjustment model based on the optimized view topological structure. Layered sparse beam adjustment is a commonly used multi-view reconstruction technology, and projection errors between images and inaccuracy of the pose of a camera can be corrected simultaneously by performing global optimization on the pose of the camera and the position of three-dimensional points. The core of this process is to minimize the re-projection error, i.e. to make the projected points of the three-dimensional points on all images as consistent as possible with the actual observed feature points. The reprojection error can be expressed by the following formula:

；

Wherein, Is the total re-projection error that is present,Representing the actual observed position of the ith three-dimensional point on the jth image,Representing a rotation matrix through a cameraTranslation vectorFor three-dimensional pointsCalculated position after projection is performed. And obtaining the optimal camera pose parameter, namely the target camera pose parameter, through iterative optimization of the error function. And performing geometric transformation on the preprocessed multi-view images based on the pose parameters of the target camera, and aligning all the images into a common three-dimensional space to obtain a coarsely aligned multi-view image. And performing multi-view three-dimensional matching cost body analysis on the coarsely aligned multi-view images. And the stereo matching cost body analysis preliminarily estimates the depth information of each pixel by calculating a disparity map. Disparity maps represent the pixel offset of an object in the same scene at different viewing angles, which is directly related to the depth of the object. Based on the preliminary disparity estimation, an initial depth map is obtained. Multi-view consistency propagation is performed based on the initial depth estimate. And integrating and spreading the depth information of the multiple visual angles, eliminating inconsistent depth values, and obtaining an optimized depth map. And (3) carrying out refinement treatment on the optimized depth map, removing noise and errors, improving the precision of the depth map and obtaining a high-precision depth map. And (3) carrying out fine alignment on the coarsely aligned multi-view images based on the high-precision depth map, and completely conforming the images to the depth information on the pixel level by adjusting geometric transformation parameters of the images to obtain the accurately aligned multi-view images. Space-time consistency analysis is performed on the precisely aligned multi-view images and the high-precision depth map, and dynamic objects existing in the scene are identified and processed, and the objects can be positioned at different positions at different time points, so that the depth map and the images are inconsistent. By analyzing the inconsistencies, the dynamic object is accurately identified, the influence of the dynamic object on the 3D reconstruction is eliminated through corresponding processing, and finally the aligned multi-view image and depth map are obtained.

In a specific embodiment, the performing step performs geometric transformation on the preprocessed multi-view image based on the pose parameter of the target camera to obtain a coarsely aligned multi-view image, and performs multi-view stereo matching cost-body analysis on the coarsely aligned multi-view image, and the process of obtaining the initial depth estimate may specifically include the following steps:

Calculating an essential matrix E among cameras based on pose parameters of a target camera, wherein E= [ t ] x R, wherein [ t ] x is an antisymmetric matrix of a translation vector t, and R is a rotation matrix; performing epipolar line correction on the preprocessed multi-view image to obtain a corrected image pair;

Performing phase consistency transformation on the corrected image pair, calculating a phase correlation function to obtain a sub-pixel level displacement field, and calculating an affine transformation matrix based on the sub-pixel level displacement field to obtain a coarsely aligned multi-view image;

performing anisotropic diffusion filtering on the coarsely aligned multi-view images to obtain a smooth image set with maintained edges, and performing multi-scale Log-Gabor filtering on the smooth image set to obtain a characteristic tensor field;

Calculating a structure tensor based on the characteristic tensor field, performing eigenvalue decomposition on the structure tensor to obtain a structure consistency map, and performing non-rigid registration on the preliminarily aligned multi-view image set based on the structure consistency map to obtain a finely aligned multi-view image set;

And constructing quaternary body segmentation for the finely aligned multi-view image set to obtain a super-pixel set, calculating a multi-view matching cost body based on the super-pixel set to obtain an initial depth map, solving the initial depth map by applying a variational method, and obtaining initial depth estimation through iterative optimization.

Specifically, an essential matrix E between cameras is calculated based on pose parameters of the target camera. The essence matrix E describes the relative motion relationship between the two cameras and is closely related to the relationship between points in three-dimensional space and their projection onto the image plane. The essence matrix can be expressed asWhereinIs a translation vectorIs an anti-symmetric matrix of (2), R is a rotation matrix. Anti-symmetric matrixCan be expressed as:

；

Wherein, The components of the translation vector t in the x, y, z axes, respectively. By substituting the rotation matrix R and translation vector t of the camera into the above formula, an essential matrix E is calculated, which is used to describe the geometrical relationship between the two cameras. And (3) carrying out polar line correction on the preprocessed multi-view image, and restricting corresponding points in the image to the same polar line, so that the process of stereo matching is simplified. Through polar line correction, image distortion caused by different shooting angles between cameras is eliminated, so that corresponding points of each pair of images are aligned on corresponding polar lines, and corrected image pairs are obtained. A phase consistency transformation is performed on the corrected image pair. The phase consistency transformation can effectively extract the edge and texture features of the image by analyzing the phase information of different frequency components in the image. And calculating a phase correlation function to obtain phase consistency information between images, and calculating a displacement field at a sub-pixel level. The phase correlation function is typically used to compare similarities between images in the frequency domain, and can provide higher accuracy than conventional pixel level alignment. From this information, an affine transformation matrix is calculated, and the image is geometrically corrected to obtain a coarsely aligned multi-view image. Anisotropic diffusion filtering is performed on the coarsely aligned multi-view image. Anisotropic diffusion filtering is a filtering method capable of removing noise while preserving image edge information. The image is kept sharp near the edge by directional constraint in the diffusion process, and noise is removed in the smooth area, so that a smooth image set kept by the edge is obtained. Multi-scale Log-Gabor filtering is performed on the edge-preserving smoothed image set. The Log-Gabor filter is a filter which is excellent in the frequency domain and can extract local features of images at different scales. By applying Log-Gabor filtering on multiple scales, a feature tensor field is generated, containing rich feature information in the image. A structure tensor is calculated based on the feature tensor field. The structure tensor is a mathematical tool for describing the local structure information of the image, and the structure tensor is subjected to eigenvalue decomposition to obtain structure consistency mapping to reflect the structure similarity of the image in different areas. The initially aligned multi-view image sets are non-rigidly registered based on the structural consistency map. non-rigid registration allows for higher alignment accuracy at the pixel level by allowing local deformation between images. By this registration, a finely aligned multi-view image set is obtained. A quad-segmentation is constructed on the fine-aligned multi-view image set, generating a super-pixel set. The super-pixel set is a representation mode for dividing the image into a plurality of small areas with similar characteristics, so that the computational complexity of image processing can be effectively reduced, and important edge information can be maintained. And calculating a multi-view matching cost body based on the super-pixel set, and generating an initial depth map. And (3) carrying out multi-view matching cost body analysis, namely estimating the depth information of each pixel by comparing super pixels in images with different view angles, and obtaining an initial depth map. and applying a variational method to solve the initial depth map. The variational method is used for obtaining more accurate initial depth estimation by defining an energy function and iteratively optimizing the energy function so as to minimize the overall error of the depth map. The core of this process is to find an optimal depth distribution so that the re-projection errors of all view images are minimized.

In a specific embodiment, the performing step performs fine alignment on the coarsely aligned multi-view image based on the high-precision depth map to obtain an accurately aligned multi-view image, performs space-time consistency analysis on the accurately aligned multi-view image and the high-precision depth map, and identifies and processes the dynamic object, so that the process of obtaining the aligned multi-view image and the depth map may specifically include the following steps:

performing bilateral joint upsampling on the high-precision depth map and the coarsely aligned multi-view image to obtain a high-resolution depth map;

calculating a parallax gradient field based on the high-resolution depth map, constructing an affine transformation matrix based on the parallax gradient field, and carrying out local deformation on the coarsely aligned multi-view images to obtain finely aligned multi-view images;

Performing structure tensor analysis on the finely aligned multi-view images to obtain smooth images with maintained edges, and calculating multi-scale local phase consistency measurement based on the smooth images with maintained edges to obtain local structure similarity mapping;

Constructing a luminosity consistency constraint item based on local structure similarity mapping, and solving a displacement field by combining a geometric smoothness constraint item to obtain an accurately aligned multi-view image;

performing space-time voxel formation on the accurately aligned multi-view image and the high-resolution depth map, constructing a 4D space-time voxel grid, and calculating the color and depth feature vector of each voxel to obtain a characteristic space-time voxel field;

Constructing a space-time diagram based on the characteristic space-time voxel field to obtain a space-time segmentation result;

carrying out connected domain analysis on the space-time segmentation result, calculating the space-time consistency score of each connected domain, and clustering the connected domains according to the space-time consistency score to obtain a dynamic object candidate region set R;

based on the dynamic object candidate region set, constructing optical flow constraint and depth consistency constraint, and carrying out joint optimization to obtain a fine motion field;

And carrying out time sequence interpolation and extrapolation on the accurately aligned multi-view image and the high-resolution depth map according to the fine motion field to obtain the aligned multi-view image and the depth map.

Specifically, bilateral joint upsampling is performed on the high precision depth map and the coarsely aligned multi-view image. The bilateral combination upsampling is a filtering method combining spatial domain and color domain information, and can effectively improve the resolution of a depth map while maintaining the edge details of an image. The formula for sampling on the bilateral combination is as follows:

；

Wherein, Is the depth value after up-sampling,Is the original depth value of the depth value,Is a neighborhood of the pixel x,AndThe color values of pixels x and y respectively,AndThe smoothing parameters of the spatial domain and the color domain are controlled separately,Is a normalization factor. Through this process, a high resolution depth map is generated that is consistent with the resolution of the coarsely aligned multi-view image. A parallax gradient field is calculated based on the high resolution depth map. The parallax gradient field represents the change rate of depth information, and can reveal the three-dimensional structural characteristics of objects in the image. Based on the parallax gradient field, an affine transformation matrix is constructed, and the locally deformed multi-view images which are roughly aligned are subjected to local deformation. The formula of the affine transformation matrix can be expressed as:

；

Wherein, Is the original coordinates of the coordinates,Is the transformed coordinates, a, b, c, d are parameters controlling the transformation, and e and f are translation amounts. The local deformation process ensures that images of different perspectives geometrically achieve higher alignment accuracy, resulting in finely aligned multi-perspective images. Structural tensor analysis is performed on the finely aligned multi-view images. The structure tensor is a tool capable of capturing local structural features of an image, and by analyzing gradient information in the image, an edge-preserving smooth image is obtained. Structure tensorCan be expressed as:

；

Wherein, AndIs the gradient of the image in the x and y directions. And obtaining a smooth image maintained by the edge through analysis of the structure tensor, further calculating a multi-scale local phase consistency measure, and generating a local structure similarity map. And constructing a photometric consistency constraint term and a geometric smoothness constraint term based on the local structure similarity mapping. Photometric consistency constraintCan be expressed as:

；

Wherein, AndRespectively pixels in two imagesIs used for the intensity values of (a),Is a displacement fieldWill beMapped to a location in another image. Geometric smoothness constraint termExpressed as:

；

Wherein, Representing pixelsNeighborhood pixels of (a). By jointly solving these two constraints, an accurately aligned multi-view image is obtained. And performing space-time voxel formation on the precisely aligned multi-view image and the high-resolution depth map, constructing a 4D space-time voxel grid, and calculating the color and depth feature vector of each voxel to obtain a characteristic space-time voxel field. And constructing a space-time diagram based on the voxel field, and performing space division to obtain a space-time division result. And carrying out connected domain analysis on the space-time segmentation result, calculating the space-time consistency score of each connected domain, and identifying the dynamic object. Based on the dynamic object, optical flow constraint and depth consistency constraint are constructed, and a fine motion field is obtained through joint optimization. And performing time sequence interpolation and extrapolation on the accurately aligned multi-view image and the high-resolution depth map by using the fine motion field to obtain the aligned multi-view image and the depth map.

In a specific embodiment, the process of performing step 500 may specifically include the following steps:

Voxel processing is carried out on the aligned multi-view images and depth maps to obtain an initial low-resolution voxel model, and octree data structure analysis is carried out on the basis of the initial low-resolution voxel model to obtain multi-resolution voxel representation;

Carrying out signed distance field value calculation on each voxel in the multi-resolution voxel representation to obtain an initial SDF model, and carrying out multi-view depth fusion based on the initial SDF model to obtain an optimized SDF model;

Performing self-adaptive subdivision on the optimized SDF model to obtain a high-precision surface grid model, and performing multi-view consistency constraint based on the high-precision surface grid model to obtain a surface optimization objective function;

carrying out iterative solution on the surface optimization objective function to obtain a refined surface model, and extracting a multi-view texture set based on the refined surface model;

And performing global optimization of graph cut on the multi-view texture set to obtain seamless spliced texture mapping, and generating a scene 3D model based on the seamless spliced texture mapping.

Specifically, the aligned multi-view image and depth map are subjected to voxelization, which is a process of dividing a three-dimensional space into small cube units (i.e., voxels), and an initial low-resolution voxel model is generated by associating each voxel with pixel or depth information in the image. The low resolution voxel model enables the preliminary capture of the three-dimensional structure of the scene with less computational expense. Octree data structure analysis is performed based on the initial low-resolution voxel model. Octree is a tree structure commonly used for three-dimensional spatial data partitioning, forming a hierarchical voxel representation by recursively partitioning the space into eight subspaces. Each voxel is subdivided into smaller sub-voxels, the resolution of which can be adjusted as required. The multi-resolution voxel representation not only provides higher detail resolution in critical areas, but also reduces computational complexity by compressing non-important areas. A signed distance field value is calculated for each voxel in the multi-resolution voxel representation. Signed distance field values represent the distance of a voxel to the object surface and are used to distinguish, by sign, whether the voxel is inside or outside the object. By computing the signed distance field value for each voxel, an initial SDF model is generated that describes the shape of the object in the whole space. Based on the initial SDF model, multi-view depth fusion is performed. By integrating depth information from different perspectives, errors and uncertainties caused by a single perspective are eliminated, resulting in a more accurate representation of the three-dimensional shape. By fusing these depth information, an optimized SDF model is generated. And performing adaptive subdivision on the optimized SDF model. By increasing the voxel resolution in areas where high precision representation is required to capture finer surface features, a high precision surface mesh model is generated that is capable of accurately describing the three-dimensional shape of the object. To optimize the high-precision surface mesh model, a surface optimization objective function is constructed based on the multi-view consistency constraint. The multi-view consistency constraint can improve the accuracy of the surface model by ensuring that the projections of images from different perspectives onto the grid surface are consistent. The surface optimization objective function can be expressed as:

；

Wherein, Represent the firstAt the point of viewThe pixel value at which it is located,Representing projected pixel values, the first term measures projection errors, the second termIs a smooth term for the SDF gradient, controlling the smoothness of the grid. And (3) carrying out iterative solution on the objective function, and optimizing the grid surface to obtain a refined surface model. And extracting a multi-view texture set on the basis of the refined surface model. The multi-view texture set comprises texture information corresponding to the surface grid in images shot by different cameras. And performing graph cutting on the texture to perform global optimization. The graph cut algorithm can select the optimal splicing path in the multi-view texture, eliminate seams between different views and generate seamless spliced texture mapping. Based on the texture mapping of seamless stitching, the texture mapping is applied to a refined surface model to generate a final scene 3D model.

In a specific embodiment, the process of executing step 600 may specifically include the following steps:

performing depth analysis on the 3D model of the scene based on the stereoscopic vision perception model of human eyes to obtain scene depth distribution characteristics, and constructing a self-adaptive stereoscopic parallax adjustment model according to the scene depth distribution characteristics;

Performing parallax adjustment on the scene 3D model based on the self-adaptive stereoscopic parallax adjustment model to obtain a preliminary stereoscopic parallax correction result, and constructing a parallax continuity optimization model based on the preliminary stereoscopic parallax correction result;

Processing the large parallax region through a parallax continuity optimization model to obtain an optimized stereoscopic parallax correction result, and generating dense light field sampling data based on the optimized stereoscopic parallax correction result;

compressing the dense light field sampling data to obtain a light field model, and constructing an initial rendering result based on the light field model;

and carrying out image enhancement on the initial rendering result to obtain a high-quality rendering image, and carrying out space-time analysis and jitter suppression processing based on the high-quality rendering image to obtain a stereoscopic 3D image.

Specifically, depth analysis is performed on a 3D model of a scene through a human eye stereoscopic vision perception model. The human eye stereoscopic perception model simulates how human eyes perceive the depth of an object through parallax. When a scene is observed by the human eye, a slight positional difference between the left and right eyes causes them to see slightly different images, which difference is called parallax. By calculating these parallaxes, the human brain can infer the depth of the object. Based on the model, the 3D model of the scene is subjected to depth analysis, and the depth distribution characteristics of the scene are extracted. The depth profile features of the scene reflect the distance level and depth variations of objects in space. Based on these features, an adaptive stereoscopic parallax adjustment model is constructed. The self-adaptive stereoscopic parallax adjustment model dynamically adjusts the parallax value according to different depth areas so as to ensure that the final stereoscopic image visually accords with the perception habit of human eyes. For example, for distant objects, the parallax should be smaller, and for near objects, the parallax should be larger. The mathematical representation of this adjustment process can be described by the following formula:

；

Wherein, Indicating the value of the adjusted disparity value,Is the adjustment parameter of the device, which is used for adjusting the parameters,Representing the depth distance of the object to the camera. The parallax is dynamically adjusted through the formula, and a preliminary stereo parallax correction result suitable for scenes with different depths is obtained. And constructing a parallax continuity optimization model based on the preliminary stereo parallax correction result. The parallax continuity optimization model aims to deal with the problem of discontinuity caused by excessive parallax change. Parallax discontinuities can lead to visual illusions or unnatural jerkiness in the stereoscopic image. To avoid these problems, the parallax continuity optimization model makes parallax changes more natural and coherent by smoothing the parallax values. This process is typically accomplished by minimizing the following objective function:

；

Wherein, Is a parallax continuity error that is a function of the parallax,AndRespectively adjacent pixelsAndIs used for the parallax value of (1),Representing pixelsIs a neighborhood of (c). By minimizing this error, the parallax change is smoothed, resulting in an optimized stereo parallax correction result. And generating dense light field sampling data based on the optimized stereo parallax correction result. Dense light fields represent ray information for each point in the scene in multiple directions, and subtle illumination changes and viewing angle differences in the scene are captured by dense light field sampling. The amount of dense light field sample data is typically very large and therefore requires compression processing in order to achieve efficient processing with limited storage and computing resources. And compressing the dense light field sampling data, and obtaining the light field model by reducing redundant data. The compression method may include principal component analysis, transform coding, or other efficient compression algorithms. By these methods, the amount of data is significantly reduced while important light field information is retained. And constructing an initial rendering result based on the light-weight light-field model. The initial rendering result is a scene image generated based on the compressed light field data, and the basic visual effect of the scene is reproduced by reconstructing ray information in the light field. And performing image enhancement processing on the initial rendering result. Image enhancement may include contrast adjustment, color correction, sharpening, etc., to enhance the visual effect of the image. For example, the image contrast can be adjusted by the following formula:

；

Wherein, Is the pixel value of the enhanced image,Is the original pixel value of the pixel,Is a contrast adjustment factor that is used to adjust the contrast,Is a luminance offset value. By the enhancement operation, a high quality rendered image is obtained. Spatiotemporal analysis and jitter suppression processing are performed based on the high-quality rendered image. Spatiotemporal analysis is used to ensure consistency of images over time, avoiding flickering or instability problems due to dynamic changes. Jitter suppression ensures that the image remains smooth and natural when playing stereoscopic 3D images by detecting and compensating for small jitter in the image. Through the steps, a high-quality and stable stereoscopic 3D image is finally obtained.

The method for generating a 3D image based on multi-camera shooting in the embodiment of the present application is described above, and the following describes a 3D image generating device 10 based on multi-camera shooting in the embodiment of the present application, referring to fig. 2, one embodiment of the 3D image generating device 10 based on multi-camera shooting in the embodiment of the present application includes:

The acquisition module 11 is used for synchronously triggering and acquiring images acquired by a plurality of cameras to obtain a multi-view original image;

A preprocessing module 12, configured to preprocess the multi-view original image to obtain a preprocessed multi-view image;

The matching module 13 is used for carrying out feature extraction and matching on the preprocessed multi-view images to obtain feature matching results;

a calculation module 14, configured to perform alignment and depth information calculation on the preprocessed multi-view image based on the feature matching result, so as to obtain an aligned multi-view image and a depth map;

the reconstruction module 15 is configured to perform 3D model reconstruction based on the aligned multi-view image and depth map, so as to obtain a 3D model of the scene;

The generating module 16 is configured to perform stereo parallax correction and rendering processing on the scene 3D model, and generate a stereo 3D image.

Through the cooperation of the components, clock synchronization is performed through the WiFi module, a microsecond trigger signal is generated through the central control unit, and the light pulse of a specific mode is emitted by combining the high-speed LED array, so that sub microsecond time alignment correction of a plurality of cameras is realized, and the time consistency of multi-view images is greatly improved. The problems of noise, chromatic aberration, uneven exposure, distortion and the like of the multi-view images are effectively solved by adopting the technologies of a global color mapping model, a self-adaptive exposure equalization function, distortion correction of a polynomial model and the like, efficient and accurate matching among the multi-view images is realized by constructing an affine invariant feature set, global semantic feature extraction and a feature fast index tree, and accurate alignment of the multi-view images and generation of a high-precision depth map are realized by utilizing the technologies of a layered sparse beam adjustment model, multi-view three-dimensional matching cost body analysis and the like. Dynamic objects are effectively identified and processed through space-time voxelization, space-time consistency analysis and fine motion field optimization, and the space-time consistency of a reconstruction result is ensured. The 3D model reconstruction with high precision and high detail is realized by adopting the technologies of multi-resolution voxel representation, signed distance field optimization, self-adaptive surface subdivision and the like. Based on a human eye stereoscopic vision perception model and self-adaptive stereoscopic parallax adjustment, a dense light field sampling and rendering enhancement technology is combined, and a high-quality and high-immersion stereoscopic 3D image is generated.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the 3D image generating method based on multi-camera shooting in the above embodiments.

The present application also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, when the instructions run on a computer, cause the computer to perform the steps of the 3D image generating method based on multi-camera shooting.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The 3D image generation method based on multi-camera shooting is characterized by comprising the following steps:

2. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the step of synchronously triggering and collecting the images collected by the plurality of cameras to obtain a multi-view original image comprises the steps of:

3. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the preprocessing the multi-view original image to obtain a preprocessed multi-view image comprises:

Denoising the multi-view original image to obtain a preliminary denoising multi-view image, and placing a standard color card in a shooting scene, and analyzing the preliminary denoising multi-view image based on the standard color card to obtain a global color mapping model;

Performing joint color correction on the preliminary noise-reduced multi-view image based on the global color mapping model to obtain a color-corrected multi-view image, and performing histogram analysis on the color-corrected multi-view image to obtain pixel value distribution characteristics of each image;

4. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the performing feature extraction and matching on the preprocessed multi-view image to obtain a feature matching result includes:

global semantic feature extraction is carried out on the affine invariant feature set to obtain a semantic enhancement feature set, and a feature quick index tree is constructed based on the semantic enhancement feature set;

5. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the performing alignment and depth information calculation on the preprocessed multi-view image based on the feature matching result to obtain an aligned multi-view image and a depth map includes:

6. The method for generating a 3D image based on multi-camera shooting according to claim 5, wherein the performing geometric transformation on the preprocessed multi-view image based on the pose parameter of the target camera to obtain a coarsely aligned multi-view image, and performing multi-view stereo matching cost-body analysis on the coarsely aligned multi-view image to obtain an initial depth estimate includes:

calculating an essential matrix E among cameras based on the pose parameters of the target camera, wherein E= [ t ] x R, wherein [ t ] x is an antisymmetric matrix of a translation vector t, and R is a rotation matrix; performing epipolar line correction on the preprocessed multi-view image to obtain a corrected image pair;

7. The method for generating a 3D image based on multi-camera shooting according to claim 5, wherein the performing fine alignment on the coarsely aligned multi-view image based on the high-precision depth map to obtain a precisely aligned multi-view image, performing space-time consistency analysis on the precisely aligned multi-view image and the high-precision depth map, identifying and processing a dynamic object, and obtaining an aligned multi-view image and a depth map, includes:

Performing structure tensor analysis on the finely aligned multi-view images to obtain smooth images with maintained edges, and calculating multi-scale local phase consistency metrics based on the smooth images with maintained edges to obtain local structure similarity mapping;

constructing a luminosity consistency constraint item based on the local structure similarity mapping, and solving a displacement field by combining a geometric smoothness constraint item to obtain an accurately aligned multi-view image;

Performing spatio-temporal voxel formation on the precisely aligned multi-view image and the high-resolution depth map, constructing a 4D spatio-temporal voxel grid, and calculating the color and depth feature vector of each voxel to obtain a characteristic spatio-temporal voxel field;

constructing optical flow constraint and depth consistency constraint based on the dynamic object candidate region set, and performing joint optimization to obtain a fine motion field;

And carrying out time sequence interpolation and extrapolation on the accurately aligned multi-view image and the high-resolution depth map according to the fine motion field to obtain an aligned multi-view image and a depth map.

8. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the performing 3D model reconstruction based on the aligned multi-view image and the depth map to obtain a 3D model of a scene includes:

voxel processing is carried out on the aligned multi-view images and the depth map to obtain an initial low-resolution voxel model, and octree data structure analysis is carried out on the basis of the initial low-resolution voxel model to obtain multi-resolution voxel representation;

9. The method for generating a 3D image based on multi-camera shooting according to claim 1, wherein the performing stereo parallax correction and rendering processing on the scene 3D model to generate a stereo 3D image comprises:

Performing depth analysis on the 3D model of the scene based on a human eye stereoscopic vision perception model to obtain scene depth distribution characteristics, and constructing a self-adaptive stereoscopic parallax adjustment model according to the scene depth distribution characteristics;

Processing the large parallax region through the parallax continuity optimization model to obtain an optimized stereoscopic parallax correction result, and generating dense light field sampling data based on the optimized stereoscopic parallax correction result;

10. A 3D image generating apparatus based on multi-camera shooting, for performing the 3D image generating method based on multi-camera shooting according to any one of claims 1 to 9, the apparatus comprising: