US11321937B1 - Visual localization method and apparatus based on semantic error image - Google Patents
Visual localization method and apparatus based on semantic error image Download PDFInfo
- Publication number
- US11321937B1 US11321937B1 US17/473,190 US202117473190A US11321937B1 US 11321937 B1 US11321937 B1 US 11321937B1 US 202117473190 A US202117473190 A US 202117473190A US 11321937 B1 US11321937 B1 US 11321937B1
- Authority
- US
- United States
- Prior art keywords
- semantic
- image
- dimensional
- pose
- hypothesized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/136—Segmentation; Edge detection involving thresholding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/75—Determining position or orientation of objects or cameras using feature-based methods involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
- G06V20/647—Three-dimensional objects by matching two-dimensional images to three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
- G06V30/274—Syntactic or semantic context, e.g. balancing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
Definitions
- One or more embodiments of the present disclosure relates to the field of image processing technologies and in particular to a visual localization method and apparatus based on a semantic error image.
- Visual localization is to obtain a spatial position and an orientation of a target, i.e. a pose estimation of the target, by obtaining information around the target based on data such as image and three-dimensional point clouds. Therefore, the visual localization is widely applied in localization and navigation of robots, navigation of self-driving vehicles, augmented reality and three-dimensional reconstruction and the like.
- the visual localization method usually includes the following three methods: the first method is a localization method based on three-dimensional structure, in which a localization accuracy will be significantly reduced and even failed in the cases of significant change of scene environment, presence of a large number of repetitive structures in the scene, weak texture or texture-less structure of the scene, change of strong light irradiation, motion blur and change of strong view point and the like; the second method is a localization method based on an image, in which a pose estimation is performed by searching out an image most similar to a target image in an image database, leading to a low localization accuracy; the third method is a localization method based on a learning model, in which a learning model is trained in advance and a pose estimation is performed using the model, wherein the method cannot process a large scene and lacks generality when a model is constructed for each scene. Image similarity retrieval is present in all the above methods.
- change factors such as light and season will have a huge impact on a scene, and structural overlap
- one or more embodiments of the present disclosure aim to provide a visual localization method and apparatus based on a semantic error image, in which a high localization accuracy can be generated in a case of significant change of a scene.
- one or more embodiments of the present disclosure provide a visual localization method based on a semantic error image, including:
- each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
- each pixel point of the two-dimensional semantic image has corresponding semantic information; and determining semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
- hypothesized pose pool including at least one hypothesized pose according to at least one matching pair
- the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point;
- hypothesized pose pool is constructed in the following manner:
- R is a rotation matrix and t is a translation matrix.
- selecting the hypothesized pose with the minimum reprojection error and the minimum semantic error as the pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose includes:
- calculating the total number of correct positions according to the reprojection error image corresponding to each hypothesized pose includes the followings:
- the three-dimensional semantic image is reprojected as the two-dimensional image according to the hypothesized pose h j , wherein based on a position coordinate y i of any three-dimensional point i, a theoretical position coordinate p′ i of the theoretical pixel point i′ of the two-dimensional image obtained through projection is expressed as follows:
- X i , Y i and Z i are the position coordinates of the three-dimensional point i in x, y and z directions, and C is a camera projection matrix;
- the reprojection error image is constructed based on the reprojection error e i and an inlier threshold ⁇ of the matching pair is set, such that,
- n i ⁇ 1 , e i ⁇ ⁇ 0 , e i ⁇ ⁇ ( 8 )
- the reprojection error e i is smaller than the inlier threshold ⁇ , the theoretical pixel point of the two-dimensional image obtained through projection based on the hypothesized pose is consistent in position with the corresponding pixel point of the two-dimensional semantic image, which is called correct position;
- N i ⁇ n i (9)
- calculating the total number of correct semantics according to the semantic error image corresponding to each hypothesized pose includes:
- An embodiment of the present disclosure further provides a visual localization apparatus based on a semantic error image, including:
- a matching module configured to perform feature extraction for a target image, and obtain at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, wherein each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
- a semantic segmenting module configured to: obtain a two-dimensional semantic image of the target image by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and determine semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
- a pose pool constructing module configured to construct a hypothesized pose pool including at least one hypothesized pose according to at least one matching pair
- an image constructing module configured to, for each hypothesized pose in the hypothesized pose pool, construct a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point; and
- a pose estimating module configured to determine a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
- the pose pool constructing module is configured to: select four matching pairs randomly from all matching pairs, obtain one hypothesized pose through calculation according to a PNP (perspective-n-point) algorithm and the four selected matching pairs, and construct the hypothesized pose pool by using all hypothesized poses obtained based on the random combination of all matching pairs.
- PNP perspective-n-point
- R is a rotation matrix and t is a translation matrix.
- the pose estimating module is configured to: calculate a total number of correct positions according to the reprojection error image corresponding to each hypothesized pose; calculate a total number of correct semantics according to the semantic error image corresponding to each hypothesized pose; and select a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics as an optimal pose estimation.
- feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model; a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image, and semantic information of each matching pair is determined according to semantic information of each pixel of the two-dimensional semantic image; a hypothesized pose pool including at least one hypothesized pose is constructed according to each matching pair; a reprojection error image and a semantic error image are constructed for each hypothesized pose in the hypothesized pose pool; a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
- optimal pose screening is performed, so as to achieve good localization effect even in a
- FIG. 1 is a flowchart of a method according to one or more embodiments of the present disclosure.
- FIG. 2 is a schematic diagram of a semantic error image according to one or more embodiments of the present disclosure.
- FIG. 3 is a schematic diagram of a matching pair according to one or more embodiments of the present disclosure.
- FIG. 4 is a schematic diagram of semantic information of a theoretical pixel point and a three-dimensional point according to one or more embodiments of the present disclosure.
- FIG. 5 is a structural schematic diagram of an apparatus according to one or more embodiments of the present disclosure.
- FIG. 6 is a structural schematic diagram of an electronic device according to one or more embodiments of the present disclosure.
- one or more embodiments of the present disclosure provide a visual localization method based on a semantic error image, including:
- step S 101 feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, where each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature.
- the three-dimensional scene model is built using a plurality of images in a dataset based on an incremental Structure From Motion (SFM) algorithm (e.g. COLMAP method).
- SFM Structure From Motion
- COLMAP method e.g. COLMAP method
- a plurality of feature points are obtained by performing feature extraction for the target image, and at least one matching pair matched in feature is obtained by performing feature matching for each feature point and each three-dimensional point of the three-dimensional scene model.
- feature matching of the two-dimensional feature point and the three-dimensional point may be performed based on Approximate Nearest Neighbor Search algorithm to search and determine the feature point and the three-dimensional point matched in feature.
- a loose error threshold for example, 0.9, may be set in order to improve a successful matching rate.
- a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and semantic information of each matching pair is determined according to the semantic information of each pixel of the two-dimensional semantic image.
- the two-dimensional semantic image after semantic segmentation and the semantic information of each pixel point of the two-dimensional semantic image may be obtained by performing semantic segmentation for the target image. After the semantic information of each pixel point is determined, the semantic information of each pixel point in each matching pair is taken as semantic information of the matching pair and as semantic information of the three-dimensional point in the matching pair.
- a hypothesized pose pool including at least one hypothesized pose is constructed according to at least one matching pair.
- the hypothesized pose pool is constructed based on PNP (perspective-n-point) algorithm according to each matching pair.
- the hypothesized pose pool includes at least one hypothesized pose, and each hypothesized pose is determined based on four randomly-selected matching pairs.
- a reprojection error image and a semantic error image are constructed; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point.
- corresponding reprojection error image and semantic error image are constructed for each hypothesized pose.
- the three-dimensional semantic image is constructed using three-dimensional points in all matching pairs.
- the reprojection error image is obtained in the following manner: obtaining the two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose and constructing the reprojection error image based on a position error between the theoretical position of each theoretical pixel point of the two-dimensional image and the actual position of the corresponding pixel point of the two-dimensional semantic image.
- the semantic error image is obtained in the following manner: assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point.
- a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
- a total number of correct positions is calculated according to the reprojection error image corresponding to each hypothesized pose, where a larger total number of correction positions means a smaller reprojection error; and a total number of correct semantics is calculated according to the semantic error image corresponding to each hypothesized pose, where a larger total number of correct semantics means a smaller semantic error.
- a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics is selected as an optimal pose estimation.
- feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model; a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image where each pixel point of the two-dimensional semantic image has corresponding semantic information; semantic information of each matching pair is determined according to semantic information of each pixel of the two-dimensional semantic image; a hypothesized pose pool including at least one hypothesized pose is constructed according to each matching pair; a reprojection error image and a semantic error image are constructed for each hypothesized pose in the hypothesized pose pool; a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
- the visual localization method of the embodiment introduces semantic information of scene to perform optimal pose screening based on the semantic error image
- the target image is a RGB image.
- the two-dimensional semantic image may be obtained by performing segmentation for the target image using an image segmentation network signet.
- the image segmentation network signet includes an encoder and a decoder.
- the encoder uses a convolutional layer and a pooling layer alternately
- the decoder uses a convolutional layer and an upsampling layer alternately
- pixel classification employs a Softmax classifier.
- a pooling index placement information of a pooling process
- the key of the image segmentation network signet lies in bottom sampling and top sampling.
- a maximum pixel position index recorded in a downsampling process is used, one batch normalization layer is added after each convolutional layer, and a Rectified Linear Units (ReLu) activation layer is added after the batch normalization layer, so as to improve the image segmentation effect.
- ReLu Rectified Linear Units
- a max-pooling can realize translation invariance when a small spatial displacement is performed on the input target image. Due to continuous downsampling, a large quantity of spatial information of the target image is overlapped on each pixel of an output feature map. For an image classification task, the multi-layer max-pooling and the downsampling can achieve better robustness due to translation invariance. However, loss of feature map size and spatial information occurs. After downsampling, all encoders only store the max-pooling indices during feature mapping, that is, store a position of a maximum feature value in each pooling window for feature mapping of each encoder.
- step S 103 the hypothesized pose pool including at least one hypothesized pose is constructed according to at least one matching pair in the following manner:
- C x and C y are a position of a center point in the pixel coordinate system, f x and f y are focal distance, S is a coordinate axis tilt parameter, R is a rotation matrix, and t is a translation matrix, where R and t are an extrinsic matrix of a camera.
- hypothesized poses can be calculated by randomly selecting four matching pairs based on the PNP algorithm and the above coordinate transformation relationship.
- a corresponding hypothesized pose is calculated according to four matching pairs randomly selected from all matching pairs, a plurality of hypothesized poses are obtained by performing calculation based on random combination of all patching pairs, and the hypothesized pose pool is constructed using all hypothesized poses.
- step S 104 the reprojection error image is constructed in the following manner.
- the actual position coordinate of the pixel point i of the two-dimensional semantic image is p i
- the position coordinate of the three-dimensional point i of the three-dimensional semantic image is y i .
- One hypothesized pose h j may be obtained by selecting four matching pairs randomly each time based on the PNP algorithm, a plurality of hypothesized poses may be obtained based on random combination of all matching pairs, and the hypothesized pose pool h n may be constructed using all hypothesized poses, where n is a number of hypothesized poses in the hypothesized pose pool.
- the hypothesized pose h j is a correct pose
- the three-dimensional semantic image is reprojected as a two-dimensional image according to the hypothesized poseh j .
- the theoretical positon coordinate p′ i of the theoretical pixel point i′ of the two-dimensional image obtained through projection is expressed as follows:
- X i , Y i and Z i are the position coordinates of the three-dimensional point i in x, y and z directions
- C is a camera projection matrix
- the reprojection error image is constructed based on the reprojection error present between the theoretical position coordinate p′ i and the actual position coordinate p i .
- an inlier threshold ⁇ of the matching pair is set, such that,
- n i ⁇ 1 , e i ⁇ ⁇ 0 , e i ⁇ ⁇ ( 8 )
- the inlier value n i is 1
- the matching pair (p i , y i ) is an inlier, which represents that the theoretical pixel point of the two-dimensional image obtained through projection according to the hypothesized pose is consistent in position with the corresponding pixel point of the two-dimensional semantic image, which is called correct position. If the reprojection error e i is greater than or equal to the inlier threshold ⁇ , the inlier value n i is 0.
- N i ⁇ n i (9)
- the semantic error image is further constructed in the following manner: determining semantic information of the pixel point of the corresponding position of the two-dimensional semantic image according to the theoretical position coordinate p′ i of the theoretical pixel point of the two-dimensional image, and taking the determined semantic information as semantic information of the theoretical pixel point of the two-dimensional image; determining a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of each matched three-dimensional point according to the semantic information of each theoretical pixel point of the two-dimensional image; constructing the semantic error image according to the semantic error between the semantic information of each theoretical pixel point and the semantic information of each matched three-dimensional point.
- the semantic information of the pixel point (2D) and the three-dimensional point (3D) in the matching pair is S (e.g. S is sky).
- the theoretical pixel point (2D) and three-dimensional point of the theoretical matching pair are obtained.
- the semantic information for example, B is a building
- the determined semantic information is taken as the semantic information of the theoretical pixel point.
- it is determined that the semantic information B of the theoretical pixel point is different from the semantic information S of the three-dimensional point.
- a semantic error m i present between them can be expressed as follows:
- the semantic information of the pixel point of the actual position coordinate p i is l i
- the semantic information of the theoretical pixel point of the theoretical position coordinate p′ s is l′ i . If the semantic information of the pixel point is identical to the semantic information of the theoretical pixel point, the semantic error is 1 and otherwise the semantic error is 0.
- each hypothesized pose is traversed.
- the total number N i of correct positions and the total number M i of correct semantics corresponding to each hypothesized pose are determined according to the formulas (9) and (11).
- a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics is selected therefrom as the optimal pose estimation.
- Evaluation is performed using two evaluation indexes: one index is camera position and the other is camera orientation.
- the evaluation result is made in the form of a percentage that the position and the orientation of the target image reaches a given threshold, where the threshold includes a position threshold and an orientation threshold, the position threshold is in the form of Xm (X meter) and the orientation threshold is in the form of Y° (Y degrees).
- Three different threshold combinations may be adopted: (0.25 meters, 2°), (0.5 meters, 5°), and (5 meters, 10°).
- the threshold combination (0.25 meters, 2°) refers to a percentage of the number of the images in which the final pose estimation and the true pose differ by less than 0.25 meters in position and by less than 2° in orientation to the total number of images after all images are tested.
- Table 1 show the test results under a city dataset of the CMU dataset. According to the test results, in all tested images under the data set in the method of the embodiment, the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 0.25 meters in position and by less than 2° in orientation to the total number of images is 63.1%, the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 0.5 meters in position and by less than 5° in orientation to the total number of images is 69.0%, and the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 5 meters in position and by less than 10° in orientation to the total number of images is 73.7%.
- the method of the embodiment is obviously superior to other methods in a challenging scene. It can be seen from the Table 1 that in the CMU dataset, the method of this embodiment is comprehensively superior to the methods of AS, CSL, DenseVLAD and NetVLAD.
- the scene is more challenging due to influence of season and light irradiation and the like.
- the traditional methods such as AS and CSL have a greatly-reduced localization effect due to light irradiation, view point and repetitive structure and the like.
- the method of this embodiment introduces semantic information and constructs a semantic error image so as to be more robust to some extent in a challenging scene.
- RobotCar Seasons dataset Method/dataset Day time Night time Meter (m) 0.25/0.5/5 0.25/0.5/5 Degree (deg) 2/5/10 2/5/10 AS 35.6/67.9/90.4 0.9/2.1/4.3 CSL 45.3/73.5/90.1 0.6/2.6/7.2 DenseVLAD 7.4/31.1/91.0 1.0/4.5/22.7 NetVLAD 2.5/26.3/90.8 0.4/2.3/16.0 Present application 45.5/73.8/92.2 6.4/18.1/38.1
- the method of this embodiment is superior to the traditional active search method, and the CSL method as well as DenseVLAD and NetVLAD based on image retrieval. It can be seen from the nighttime dataset that the day time and night time pose accuracies of RobotCarSeasons dataset decrease significantly. Due to significant change of day time and night time, the localization effects of all methods decrease greatly. In this case, the localization accuracies of the methods based three-dimensional structure, such as active search and CSL decrease most significantly and even fail. In the significant change of the scene, the method of this embodiment is more robust and has applicability to the significant change of the scene.
- the method of one or more embodiments of the present disclosure may be performed by a single device, for example, by one computer or server or the like.
- the method of this embodiment may also be applied to a distributed scene and performed by several devices through cooperation.
- one of the several devices may perform only one or more steps of the method according to one or more embodiments of the present disclosure and the several devices may interact with each other to complete the method as above.
- an embodiment of the present disclosure further provides a visual localization apparatus based on a semantic error image, including:
- a semantic information determining module configured to a two-dimensional semantic image and a three-dimensional semantic image of a target image, where each pixel point of the two-dimensional semantic image has corresponding two-dimensional semantic information, and each three-dimensional point of the three-dimensional semantic image has corresponding three-dimensional semantic information;
- a matching module configured to configured to determine at least one matching pair formed by the pixel point and the three-dimensional point matched in semantic information according to the two-dimensional semantic image and the three-dimensional semantic image;
- a pose constructing module configured to construct one group of hypothesized poses according to at least one matching pair
- an error image constructing module configured to, for each hypothesized pose, construct a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point; and
- a pose estimating module configured to select a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
- the above apparatus is divided into various modules functionally for respective descriptions.
- the functions of the various modules can be implemented in one or more softwares and/or hardwares.
- the above apparatus of the embodiments is used to implement the corresponding method of the above embodiments and has the beneficial effects of the corresponding method embodiments and thus will not be repeated herein.
- FIG. 6 is a schematic diagram of a hardware structure of a more specific electronic device according to the present disclosure.
- the device may include a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 .
- the processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize mutual communication connection inside the device through the bus 1050 .
- the processor 1010 may be implemented by a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits or the like to execute relevant programs, so as to realize the technical solution according to the embodiments of the present disclosure.
- CPU Central Processing Unit
- ASIC Application Specific Integrated Circuit
- the memory 1020 may be implemented in the form of Read Only Memory (ROM), Random Access Memory (RAM), static storage device or dynamic storage device or the like.
- the memory 1020 may store operating system and other application programs.
- relevant program codes are stored in the memory 1020 and may be invoked by the processor 1010 .
- the input/output interface 1030 is used to connect an inputting/outputting module to realize information input and output.
- the inputting/outputting module may be configured in the device as a component (not shown) or externally connected at the device to provide corresponding functions.
- the inputting device may include keyboard, mouse, touch screen, microphone, and various sensors and the like, and the outputting device may include display, loudspeaker, vibrator and indicator lamp and the like.
- the communication interface 1040 is used to connect a communication module (not shown) to realize mutual communication between the present device and other devices.
- the communication module may realize communication in a wired manner (for example, USB or network wire or the like) or in a wireless manner (for example, mobile network, WIFI or Bluetooth or the like).
- the bus 1050 includes a passage through which information can be transmitted among various components of the device (for example, the processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 ).
- the device may further include other components required to realize normal operation in a specific implementation process.
- the above device may also only include the components necessary for the technical solution of the embodiments of the present disclosure rather than include all components shown in the drawings.
- the computer readable medium includes permanent, non-permanent, mobile and non-mobile media, which can realize information storage by any method or technology.
- the information may be computer readable instructions, data structures, program modules and other data.
- the examples of the computer storage medium include but not limited to: phase change random access memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), and other types of RAMs, Read-Only Memory (ROM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a Flash Memory, or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, cassette type magnetic tape, magnetic disk storage or other magnetic storage device or other non-transmission medium for storing information accessible by computing devices.
- PRAM phase change random access memory
- SRAM Static Random Access Memory
- DRAM Dynamic Random Access Memory
- RAM Random Access Memory
- ROM Read-Only Memory
- EEPROM Electrically-Erasable Programmable Read-Only Memory
- Flash Memory or other
- the well-known power sources/grounding connections of integrated circuit chips or other components may be shown or not shown in the accompanying drawings.
- the apparatus may be shown in the form of block diagram to avoid making one or more embodiments of the present disclosure difficult to understand, and considerations are given to the following fact, i.e. the details of the implementations of these block diagrams of the apparatus are highly dependent on a platform for implementing one or more embodiments of the present disclosure (i.e. these details should be completely within the understanding scope of those skilled in the art).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure provides a visual localization method and apparatus based on a semantic error image. The method includes: performing feature extraction for a target image, and obtaining at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model; obtaining a two-dimensional semantic image of the target image by performing semantic segmentation for the target image; and determining semantic information of each matching pair according to semantic information of each pixel of the two-dimensional semantic image; constructing a hypothesized pose pool including at least one hypothesized pose according to at least one matching pair; for each hypothesized pose, constructing a reprojection error image and a semantic error image; determining a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose. Optimal pose screening is performed using the semantic error image constructed based on a semantic error, so as to achieve good localization effect even in a case of significant change of a scene.
Description
One or more embodiments of the present disclosure relates to the field of image processing technologies and in particular to a visual localization method and apparatus based on a semantic error image.
Visual localization is to obtain a spatial position and an orientation of a target, i.e. a pose estimation of the target, by obtaining information around the target based on data such as image and three-dimensional point clouds. Therefore, the visual localization is widely applied in localization and navigation of robots, navigation of self-driving vehicles, augmented reality and three-dimensional reconstruction and the like.
At present, the visual localization method usually includes the following three methods: the first method is a localization method based on three-dimensional structure, in which a localization accuracy will be significantly reduced and even failed in the cases of significant change of scene environment, presence of a large number of repetitive structures in the scene, weak texture or texture-less structure of the scene, change of strong light irradiation, motion blur and change of strong view point and the like; the second method is a localization method based on an image, in which a pose estimation is performed by searching out an image most similar to a target image in an image database, leading to a low localization accuracy; the third method is a localization method based on a learning model, in which a learning model is trained in advance and a pose estimation is performed using the model, wherein the method cannot process a large scene and lacks generality when a model is constructed for each scene. Image similarity retrieval is present in all the above methods. Thus, in an actual application, change factors such as light and season will have a huge impact on a scene, and structural overlaps between images are significantly reduced, leading to a lower localization effect.
In view of this, one or more embodiments of the present disclosure aim to provide a visual localization method and apparatus based on a semantic error image, in which a high localization accuracy can be generated in a case of significant change of a scene.
Based on the above object, one or more embodiments of the present disclosure provide a visual localization method based on a semantic error image, including:
performing feature extraction for a target image, and obtaining at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, wherein each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
obtaining a two-dimensional semantic image of the target image by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and determining semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
constructing a hypothesized pose pool including at least one hypothesized pose according to at least one matching pair;
for each hypothesized pose in the hypothesized pose pool, constructing a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point;
determining a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
Optionally, the hypothesized pose pool is constructed in the following manner:
selecting four matching pairs randomly from all matching pairs, obtaining one hypothesized pose through calculation according to a PNP (perspective-n-point) algorithm and the four selected matching pairs, and constructing the hypothesized pose pool by using all hypothesized poses obtained based on the random combination of all matching pairs.
Optionally, the hypothesized pose is calculated in the following formula:
h 1 =−R −1 *t (5)
h 1 =−R −1 *t (5)
where R is a rotation matrix and t is a translation matrix.
Optionally, selecting the hypothesized pose with the minimum reprojection error and the minimum semantic error as the pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose includes:
calculating a total number of correct positions according to the reprojection error image corresponding to each hypothesized pose;
calculating a total number of correct semantics according to the semantic error image corresponding to each hypothesized pose;
selecting a hypothesized pose with the maximum total number of correct positions and the maximum total number of correct semantics as an optimal pose estimation.
Optionally, calculating the total number of correct positions according to the reprojection error image corresponding to each hypothesized pose includes the followings:
for each hypothesized pose hj, j=1, 2 . . . n, n is a number of hypothesized poses in the hypothesized pose pool, the three-dimensional semantic image is reprojected as the two-dimensional image according to the hypothesized pose hj, wherein based on a position coordinate yi of any three-dimensional point i, a theoretical position coordinate p′i of the theoretical pixel point i′ of the two-dimensional image obtained through projection is expressed as follows:
wherein Xi, Yi and Zi are the position coordinates of the three-dimensional point i in x, y and z directions, and C is a camera projection matrix;
a reprojection error ei present between the theoretical position coordinate p′i of the theoretical pixel point i′ of the two-dimensional image and an actual position coordinate pi a the pixel point i of the two-dimensional semantic image is expressed as follows:
e i =∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
e i =∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
the reprojection error image is constructed based on the reprojection error ei and an inlier threshold τ of the matching pair is set, such that,
if the reprojection error ei is smaller than the inlier threshold τ, the theoretical pixel point of the two-dimensional image obtained through projection based on the hypothesized pose is consistent in position with the corresponding pixel point of the two-dimensional semantic image, which is called correct position;
for the reprojection error image corresponding to each hypothesized pose, a total number Ni of inliers is calculated and the total number of correct positions is calculated as follows:
N i =Σn i (9)
N i =Σn i (9)
Optionally, calculating the total number of correct semantics according to the semantic error image corresponding to each hypothesized pose includes:
determining a semantic error mi present between the semantic information of the theoretical pixel point of the two-dimensional image and the semantic information of the three-dimensional point;
for the semantic error image corresponding to each hypothesized pose, calculating the total number of the correct semantics Mi:
M i =Σm i (11)
M i =Σm i (11)
An embodiment of the present disclosure further provides a visual localization apparatus based on a semantic error image, including:
a matching module, configured to perform feature extraction for a target image, and obtain at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, wherein each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
a semantic segmenting module, configured to: obtain a two-dimensional semantic image of the target image by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and determine semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
a pose pool constructing module, configured to construct a hypothesized pose pool including at least one hypothesized pose according to at least one matching pair;
an image constructing module, configured to, for each hypothesized pose in the hypothesized pose pool, construct a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point; and
a pose estimating module, configured to determine a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
Optionally, the pose pool constructing module is configured to: select four matching pairs randomly from all matching pairs, obtain one hypothesized pose through calculation according to a PNP (perspective-n-point) algorithm and the four selected matching pairs, and construct the hypothesized pose pool by using all hypothesized poses obtained based on the random combination of all matching pairs.
Optionally, the hypothesized pose is calculated in the following formula:
h 1 =−R −1 *t (5)
h 1 =−R −1 *t (5)
wherein R is a rotation matrix and t is a translation matrix.
Optionally, the pose estimating module is configured to: calculate a total number of correct positions according to the reprojection error image corresponding to each hypothesized pose; calculate a total number of correct semantics according to the semantic error image corresponding to each hypothesized pose; and select a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics as an optimal pose estimation.
As can be seen from the above, in the visual localization method and apparatus based on a semantic error image according to one or more embodiments of the present disclosure, feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model; a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image, and semantic information of each matching pair is determined according to semantic information of each pixel of the two-dimensional semantic image; a hypothesized pose pool including at least one hypothesized pose is constructed according to each matching pair; a reprojection error image and a semantic error image are constructed for each hypothesized pose in the hypothesized pose pool; a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose. According to the semantic error image constructed based on the semantic error, optimal pose screening is performed, so as to achieve good localization effect even in a case of significant change of scene.
In order to describe the technical solutions in one or more embodiments of the present disclosure or the prior art more clearly, the accompanying drawings required for descriptions of the embodiments or prior art will be briefly introduced below. Apparently, the accompanying drawings described below are merely one or more embodiments of the present disclosure. Other drawings may be obtained by those skilled in the art based on these accompanying drawings without paying creative work.
To make the subject, the technical solutions and advantages of the present disclosure clearer and understandable, the present disclosure will be further described in combination with specific embodiments and accompanying drawings.
It should be noted that unless otherwise defined, the technical terms or scientific terms used in one or more embodiments of the present disclosure shall have general meanings that can be understood by persons of ordinary skills in the art. “First”, “second” and similar words used in one or more embodiments of the present disclosure do not represent any sequence, number or importance but distinguish different components. The terms such as “including” and “containing” mean that an element or article appearing before the words covers an element or article or their equivalents appearing after the words and does not preclude other elements or articles. The terms such as “connect” or “coupling” are not limited to physical or mechanical connection, but may include direct or indirect electrical connection. The terms such as “upper”, “lower”, “left” and “right” are used only to represent relative positional relationship, and when an absolute position of the described object changes, the relative positional relationship will change accordingly.
As shown in FIG. 1 , one or more embodiments of the present disclosure provide a visual localization method based on a semantic error image, including:
At step S101, feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, where each matching pair includes a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature.
In this embodiment, the three-dimensional scene model is built using a plurality of images in a dataset based on an incremental Structure From Motion (SFM) algorithm (e.g. COLMAP method). With unordered images as input and siftGPU as local feature, the three-dimensional scene model is built during feature extraction. Afterwards, the local feature of each image and all information of the three-dimensional points in the three-dimensional scene model are stored separately for subsequent management and use.
A plurality of feature points are obtained by performing feature extraction for the target image, and at least one matching pair matched in feature is obtained by performing feature matching for each feature point and each three-dimensional point of the three-dimensional scene model.
In some embodiments, feature matching of the two-dimensional feature point and the three-dimensional point may be performed based on Approximate Nearest Neighbor Search algorithm to search and determine the feature point and the three-dimensional point matched in feature. During the search, a loose error threshold, for example, 0.9, may be set in order to improve a successful matching rate.
At step S102, a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and semantic information of each matching pair is determined according to the semantic information of each pixel of the two-dimensional semantic image.
In this embodiment, the two-dimensional semantic image after semantic segmentation and the semantic information of each pixel point of the two-dimensional semantic image may be obtained by performing semantic segmentation for the target image. After the semantic information of each pixel point is determined, the semantic information of each pixel point in each matching pair is taken as semantic information of the matching pair and as semantic information of the three-dimensional point in the matching pair.
At step S103, a hypothesized pose pool including at least one hypothesized pose is constructed according to at least one matching pair.
In this embodiment, the hypothesized pose pool is constructed based on PNP (perspective-n-point) algorithm according to each matching pair. The hypothesized pose pool includes at least one hypothesized pose, and each hypothesized pose is determined based on four randomly-selected matching pairs.
At step S104, for each hypothesized pose in the hypothesized pose pool, a reprojection error image and a semantic error image are constructed; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point.
In this embodiment, based on the constructed hypothesized pose pool, corresponding reprojection error image and semantic error image are constructed for each hypothesized pose. The three-dimensional semantic image is constructed using three-dimensional points in all matching pairs. The reprojection error image is obtained in the following manner: obtaining the two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose and constructing the reprojection error image based on a position error between the theoretical position of each theoretical pixel point of the two-dimensional image and the actual position of the corresponding pixel point of the two-dimensional semantic image. The semantic error image is obtained in the following manner: assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point.
At step S105, a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
In this embodiment, after the reprojection error image and the semantic error image corresponding to each hypothesized pose are determined, a total number of correct positions is calculated according to the reprojection error image corresponding to each hypothesized pose, where a larger total number of correction positions means a smaller reprojection error; and a total number of correct semantics is calculated according to the semantic error image corresponding to each hypothesized pose, where a larger total number of correct semantics means a smaller semantic error. Afterwards, a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics is selected as an optimal pose estimation.
In the visual localization method based on a semantic error image according to one or more embodiments of the present disclosure, feature extraction is performed for a target image, and at least one matching pair is obtained by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model; a two-dimensional semantic image of the target image is obtained by performing semantic segmentation for the target image where each pixel point of the two-dimensional semantic image has corresponding semantic information; semantic information of each matching pair is determined according to semantic information of each pixel of the two-dimensional semantic image; a hypothesized pose pool including at least one hypothesized pose is constructed according to each matching pair; a reprojection error image and a semantic error image are constructed for each hypothesized pose in the hypothesized pose pool; a hypothesized pose with a minimum reprojection error and a minimum semantic error is determined as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose. The visual localization method of the embodiment introduces semantic information of scene to perform optimal pose screening based on the semantic error image constructed using the semantic error, so as to achieve a good localization effect even in a case of significant change of scene.
The visual localization method of the embodiment will be detailed below in combination with the accompanying drawings and specific embodiments.
In some embodiments, in step S102, the target image is a RGB image. The two-dimensional semantic image may be obtained by performing segmentation for the target image using an image segmentation network signet. The image segmentation network signet includes an encoder and a decoder. The encoder uses a convolutional layer and a pooling layer alternately, the decoder uses a convolutional layer and an upsampling layer alternately, and pixel classification employs a Softmax classifier. In a encoding and decoding process, a pooling index (placement information of a pooling process) is transmitted to the decoder to improve an image segmentation rate. The key of the image segmentation network signet lies in bottom sampling and top sampling. During an upsampling process, a maximum pixel position index recorded in a downsampling process is used, one batch normalization layer is added after each convolutional layer, and a Rectified Linear Units (ReLu) activation layer is added after the batch normalization layer, so as to improve the image segmentation effect.
A max-pooling can realize translation invariance when a small spatial displacement is performed on the input target image. Due to continuous downsampling, a large quantity of spatial information of the target image is overlapped on each pixel of an output feature map. For an image classification task, the multi-layer max-pooling and the downsampling can achieve better robustness due to translation invariance. However, loss of feature map size and spatial information occurs. After downsampling, all encoders only store the max-pooling indices during feature mapping, that is, store a position of a maximum feature value in each pooling window for feature mapping of each encoder.
In step S103, the hypothesized pose pool including at least one hypothesized pose is constructed according to at least one matching pair in the following manner:
There are four major coordinate systems in a matching process of the pixel point and the three-dimensional point: world coordinate system O-XYZ, camera coordinate system Oc−XcYcZc, plane coordinate system O-xy, pixel coordinate system uv. The pixel coordinate (u, v) and the plane coordinate (x, y, z) are subjected to coordinate transformation as follows:
where lengths of each pixel in an x axis direction and a y axis direction are dx and dy respectively, and the coordinate of the pixel under the plane coordinate system is (u0, v0).
By analogy reasoning, a transforamtion relationship between the pixel coordinate (u, v) and the world coordinate (X, Y, Z) is finally obtained:
where Cx and Cy are a position of a center point in the pixel coordinate system, fx and fy are focal distance, S is a coordinate axis tilt parameter, R is a rotation matrix, and t is a translation matrix, where R and t are an extrinsic matrix of a camera.
One group of hypothesized poses can be calculated by randomly selecting four matching pairs based on the PNP algorithm and the above coordinate transformation relationship. The hypothesized pose h1 can be obtained according to the pixel coordinate (u, v) and the world coordinate (X, Y, Z) of the four matching pairs in the following formula:
h 1 =−R −1 *t (5)
h 1 =−R −1 *t (5)
Based on the above principle, a corresponding hypothesized pose is calculated according to four matching pairs randomly selected from all matching pairs, a plurality of hypothesized poses are obtained by performing calculation based on random combination of all patching pairs, and the hypothesized pose pool is constructed using all hypothesized poses.
In some embodiments, in step S104, the reprojection error image is constructed in the following manner.
For the position coordinate of the matching pair (pi, yi), the actual position coordinate of the pixel point i of the two-dimensional semantic image is pi, and the position coordinate of the three-dimensional point i of the three-dimensional semantic image is yi. One hypothesized pose hj may be obtained by selecting four matching pairs randomly each time based on the PNP algorithm, a plurality of hypothesized poses may be obtained based on random combination of all matching pairs, and the hypothesized pose pool hn may be constructed using all hypothesized poses, where n is a number of hypothesized poses in the hypothesized pose pool.
For each hypothesized pose hj, j=1, 2 . . . n. When the hypothesized pose hj is a correct pose, the three-dimensional semantic image is reprojected as a two-dimensional image according to the hypothesized posehj. Under the hypothesized pose hj, based on the position coordinate yi of any three-dimensional point, the theoretical positon coordinate p′i of the theoretical pixel point i′ of the two-dimensional image obtained through projection is expressed as follows:
where Xi, Yi and Zi are the position coordinates of the three-dimensional point i in x, y and z directions, and C is a camera projection matrix.
Because the hypothesized pose is not surely a correct pose, a reprojection error ei present between the theoretical position coordinate p′i of the theoretical pixel point i′ of the two-dimensional image and the actual position coordinate pi of the pixel point i of the two-dimensional semantic image is expressed as follows:
e i =∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
e i =∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
The reprojection error image is constructed based on the reprojection error present between the theoretical position coordinate p′i and the actual position coordinate pi. For the reprojection error image, an inlier threshold τ of the matching pair is set, such that,
According to the formula (7), if the reprojection error ei is smaller than the inlier threshold τ, the inlier value ni is 1, and the matching pair (pi, yi) is an inlier, which represents that the theoretical pixel point of the two-dimensional image obtained through projection according to the hypothesized pose is consistent in position with the corresponding pixel point of the two-dimensional semantic image, which is called correct position. If the reprojection error ei is greater than or equal to the inlier threshold τ, the inlier value ni is 0.
for the reprojection error image corresponding to each hypothesized pose, a total number Ni of inliers is calculated and the total number of correct positions is calculated as follows:
N i =Σn i (9)
N i =Σn i (9)
The semantic error image is further constructed in the following manner: determining semantic information of the pixel point of the corresponding position of the two-dimensional semantic image according to the theoretical position coordinate p′i of the theoretical pixel point of the two-dimensional image, and taking the determined semantic information as semantic information of the theoretical pixel point of the two-dimensional image; determining a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of each matched three-dimensional point according to the semantic information of each theoretical pixel point of the two-dimensional image; constructing the semantic error image according to the semantic error between the semantic information of each theoretical pixel point and the semantic information of each matched three-dimensional point.
As shown in FIGS. 2-4 , the semantic information of the pixel point (2D) and the three-dimensional point (3D) in the matching pair is S (e.g. S is sky). After the two-dimensional image is obtained by performing reprojection for the three-dimensional semantic image according to the hypothesized pose, the theoretical pixel point (2D) and three-dimensional point of the theoretical matching pair are obtained. According to the theoretical position coordinate of the theoretical pixel point, the semantic information (for example, B is a building) of the pixel point of the corresponding position of the two-dimensional semantic image is determined. The determined semantic information is taken as the semantic information of the theoretical pixel point. Then, it is determined that the semantic information B of the theoretical pixel point is different from the semantic information S of the three-dimensional point. A semantic error mi present between them can be expressed as follows:
According to the formula (10), for each semantic error image, the semantic information of the pixel point of the actual position coordinate pi is li, and the semantic information of the theoretical pixel point of the theoretical position coordinate p′s is l′i. If the semantic information of the pixel point is identical to the semantic information of the theoretical pixel point, the semantic error is 1 and otherwise the semantic error is 0.
For each semantic error image corresponding to each hypothesized pose, the total number Mi of correct semantics is calculated in the following formula:
M i =Σm i (11)
M i =Σm i (11)
In order to determine the optimal pose estimation, each hypothesized pose is traversed. The total number Ni of correct positions and the total number Mi of correct semantics corresponding to each hypothesized pose are determined according to the formulas (9) and (11). A hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics is selected therefrom as the optimal pose estimation.
A localization effect that can be achieved based on the method of this embodiment is described below in combination with the experimental data.
Evaluation is performed using two evaluation indexes: one index is camera position and the other is camera orientation. The evaluation result is made in the form of a percentage that the position and the orientation of the target image reaches a given threshold, where the threshold includes a position threshold and an orientation threshold, the position threshold is in the form of Xm (X meter) and the orientation threshold is in the form of Y° (Y degrees). Three different threshold combinations may be adopted: (0.25 meters, 2°), (0.5 meters, 5°), and (5 meters, 10°). For example, the threshold combination (0.25 meters, 2°) refers to a percentage of the number of the images in which the final pose estimation and the true pose differ by less than 0.25 meters in position and by less than 2° in orientation to the total number of images after all images are tested.
TABLE 1 |
Test results of CMU dataset |
CMU dataset |
Method/dataset | City | Suburb | Park |
Meter(m) | 0.25/0.5/5 | 0.25/0.5/5 | 0.25/0.5/5 |
degree(deg) | 2/5/10 | 2/5/10 | 2/5/10 |
AS | 55.2/60.3/65.1 | 20.7/25.9/29.9 | 12.7/16.3/20.8 |
CSL | 36.7/42.0/53.1 | 8.6/11.7/21.1 | 7.0/9.6/17.0 |
DenseVLAD | 22.2/48.7/92.8 | 9.9/26.6/85.2 | 10.3/27.0/77.0 |
NetVLAD | 17.4/40.3/93.2 | 7.7/21.0/80.5 | 5.6/15.7/65.8 |
Present application | 63.1/69.0/73.7 | 37.4/45.2/53.3 | 25.5/31.7/40.3 |
Table 1 show the test results under a city dataset of the CMU dataset. According to the test results, in all tested images under the data set in the method of the embodiment, the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 0.25 meters in position and by less than 2° in orientation to the total number of images is 63.1%, the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 0.5 meters in position and by less than 5° in orientation to the total number of images is 69.0%, and the percentage of the number of the images in which the pose estimation result and the true pose differ by less than 5 meters in position and by less than 10° in orientation to the total number of images is 73.7%.
According to the above test results, it can be known that the method of the embodiment is obviously superior to other methods in a challenging scene. It can be seen from the Table 1 that in the CMU dataset, the method of this embodiment is comprehensively superior to the methods of AS, CSL, DenseVLAD and NetVLAD. In the CMU dataset, the scene is more challenging due to influence of season and light irradiation and the like. For such type of scenes, the traditional methods such as AS and CSL have a greatly-reduced localization effect due to light irradiation, view point and repetitive structure and the like. In contrast, the method of this embodiment introduces semantic information and constructs a semantic error image so as to be more robust to some extent in a challenging scene.
TABLE 2 |
Test results of RobotCar Seasons dataset |
RobotCar Seasons dataset |
Method/dataset | Day time | Night time | ||
Meter (m) | 0.25/0.5/5 | 0.25/0.5/5 | ||
Degree (deg) | 2/5/10 | 2/5/10 | ||
AS | 35.6/67.9/90.4 | 0.9/2.1/4.3 | ||
CSL | 45.3/73.5/90.1 | 0.6/2.6/7.2 | ||
DenseVLAD | 7.4/31.1/91.0 | 1.0/4.5/22.7 | ||
NetVLAD | 2.5/26.3/90.8 | 0.4/2.3/16.0 | ||
Present application | 45.5/73.8/92.2 | 6.4/18.1/38.1 | ||
It can be known from the test results of the Table 2 that in a challenging scene, the method of this embodiment is superior to the traditional active search method, and the CSL method as well as DenseVLAD and NetVLAD based on image retrieval. It can be seen from the nighttime dataset that the day time and night time pose accuracies of RobotCarSeasons dataset decrease significantly. Due to significant change of day time and night time, the localization effects of all methods decrease greatly. In this case, the localization accuracies of the methods based three-dimensional structure, such as active search and CSL decrease most significantly and even fail. In the significant change of the scene, the method of this embodiment is more robust and has applicability to the significant change of the scene.
It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, for example, by one computer or server or the like. The method of this embodiment may also be applied to a distributed scene and performed by several devices through cooperation. In a case of the distributed scene, one of the several devices may perform only one or more steps of the method according to one or more embodiments of the present disclosure and the several devices may interact with each other to complete the method as above.
Specific embodiments of the present disclosure are described above. Other embodiments not described herein still fall within the scope of the appended claims. In some cases, the actions or steps recorded in the claims may be performed in a sequence different from the embodiments to achieve a desired result. Further, the processes shown in drawings do not necessarily require a particular sequence or a continuous sequence to achieve the desired result. In some embodiments, multi-task processing and parallel processing are possible and may also be advantageous.
As shown in FIG. 5 , an embodiment of the present disclosure further provides a visual localization apparatus based on a semantic error image, including:
a semantic information determining module, configured to a two-dimensional semantic image and a three-dimensional semantic image of a target image, where each pixel point of the two-dimensional semantic image has corresponding two-dimensional semantic information, and each three-dimensional point of the three-dimensional semantic image has corresponding three-dimensional semantic information;
a matching module, configured to configured to determine at least one matching pair formed by the pixel point and the three-dimensional point matched in semantic information according to the two-dimensional semantic image and the three-dimensional semantic image;
a pose constructing module, configured to construct one group of hypothesized poses according to at least one matching pair;
an error image constructing module, configured to, for each hypothesized pose, construct a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point; and
a pose estimating module, configured to select a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
For ease of descriptions, the above apparatus is divided into various modules functionally for respective descriptions. Of course, in one or more embodiments of the present disclosure, the functions of the various modules can be implemented in one or more softwares and/or hardwares.
The above apparatus of the embodiments is used to implement the corresponding method of the above embodiments and has the beneficial effects of the corresponding method embodiments and thus will not be repeated herein.
The processor 1010 may be implemented by a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits or the like to execute relevant programs, so as to realize the technical solution according to the embodiments of the present disclosure.
The memory 1020 may be implemented in the form of Read Only Memory (ROM), Random Access Memory (RAM), static storage device or dynamic storage device or the like. The memory 1020 may store operating system and other application programs. When the technical solution according to the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 1020 and may be invoked by the processor 1010.
The input/output interface 1030 is used to connect an inputting/outputting module to realize information input and output. The inputting/outputting module may be configured in the device as a component (not shown) or externally connected at the device to provide corresponding functions. The inputting device may include keyboard, mouse, touch screen, microphone, and various sensors and the like, and the outputting device may include display, loudspeaker, vibrator and indicator lamp and the like.
The communication interface 1040 is used to connect a communication module (not shown) to realize mutual communication between the present device and other devices. The communication module may realize communication in a wired manner (for example, USB or network wire or the like) or in a wireless manner (for example, mobile network, WIFI or Bluetooth or the like).
The bus 1050 includes a passage through which information can be transmitted among various components of the device (for example, the processor 1010, the memory 1020, the input/output interface 1030 and the communication interface 1040).
It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, the device may further include other components required to realize normal operation in a specific implementation process. In addition, those skilled in the art may understand that the above device may also only include the components necessary for the technical solution of the embodiments of the present disclosure rather than include all components shown in the drawings.
In the embodiments of the present disclosure, the computer readable medium includes permanent, non-permanent, mobile and non-mobile media, which can realize information storage by any method or technology. The information may be computer readable instructions, data structures, program modules and other data. The examples of the computer storage medium include but not limited to: phase change random access memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), and other types of RAMs, Read-Only Memory (ROM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a Flash Memory, or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, cassette type magnetic tape, magnetic disk storage or other magnetic storage device or other non-transmission medium for storing information accessible by computing devices.
Persons of ordinary skill in the art should understand that the descriptions of the above embodiments are merely illustrative and shall not be intended to imply that the scope of protection of the present disclosure (including the claims) is limited to these embodiments. Based on the idea of the present disclosure, the technical features of the above embodiments or different embodiments can be combined, the steps may be performed in any sequence, and many other changes may be present in different aspects of one or more embodiments of the present disclosure as described above and are not mentioned in the details for simplification.
Furthermore, in order to simplify descriptions and discussions, and make one or more embodiments of the present disclosure not difficult to understand, the well-known power sources/grounding connections of integrated circuit chips or other components may be shown or not shown in the accompanying drawings. In addition, the apparatus may be shown in the form of block diagram to avoid making one or more embodiments of the present disclosure difficult to understand, and considerations are given to the following fact, i.e. the details of the implementations of these block diagrams of the apparatus are highly dependent on a platform for implementing one or more embodiments of the present disclosure (i.e. these details should be completely within the understanding scope of those skilled in the art). In a case that specific details (for example, circuit) are made to describe the exemplary embodiments of the present disclosure, it is apparent to those skilled in the art that one or more embodiments of the present disclosure can be implemented without these specific details or in a case of change of these specific details. As a result, these descriptions shall be considered as explanatory rather than limiting.
Although the present disclosure is described in combination with the specific embodiments of the present disclosure, many substitutions, modifications and variations of these embodiments become apparent to those skilled in the art according to the above descriptions. For example, other memory architecture (for example, DRAM) may use the embodiment discussed herein.
One or more embodiments of the present disclosure are intended to cover all such substitutions, modifications and variations within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, and improvements and the like made within the spirit and principle of one or more embodiments of the present disclosure shall all fall within the scope of protection of the present disclosure.
Claims (10)
1. A visual localization method based on a semantic error image, comprising:
performing feature extraction for a target image, and obtaining at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, wherein each matching pair comprises a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
obtaining a two-dimensional semantic image of the target image by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and determining semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
constructing a hypothesized pose pool comprising at least one hypothesized pose according to at least one matching pair;
for each hypothesized pose in the hypothesized pose pool, constructing a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point;
determining a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
2. The method according to claim 1 , wherein constructing the hypothesized pose pool comprises:
selecting four matching pairs randomly from all matching pairs, obtaining one hypothesized pose through calculation according to a PNP (perspective-n-point) algorithm and the four selected matching pairs, and constructing the hypothesized pose pool by using all hypothesized poses obtained based on the random combination of all matching pairs.
3. The method according to claim 2 , wherein the hypothesized pose is calculated in the following formula:
h 1 =−R −1 *t (5)
h 1 =−R −1 *t (5)
wherein R is a rotation matrix, and t is a translation matrix.
4. The method according to claim 1 , wherein selecting the hypothesized pose with the minimum reprojection error and the minimum semantic error as the pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose comprises:
calculating a total number of correct positions according to the reprojection error image corresponding to each hypothesized pose;
calculating a total number of correct semantics according to the semantic error image corresponding to each hypothesized pose;
selecting a hypothesized pose with the maximum total number of correct positions and the maximum total number of correct semantics as an optimal pose estimation.
5. The method according to claim 4 , wherein calculating the total number of correct positions according to the reprojection error image corresponding to each hypothesized pose comprises the followings:
for each hypothesized pose hj, j=1, 2 . . . n, the three-dimensional semantic image is reprojected as the two-dimensional image according to the hypothesized pose hj, wherein based on a position coordinate yi of any three-dimensional point i, a theoretical position coordinate p′i of the theoretical pixel point of the two-dimensional image obtained through projection is expressed as follows:
wherein Xi, Yi and Zi are the position coordinates of the three-dimensional point i in x, y and z directions, and C is a camera projection matrix;
a reprojection error ei present between the theoretical position coordinate p′i of the theoretical pixel point i′ of the two-dimensional image and an actual position coordinate p1 of the pixel point i of the two-dimensional semantic image is expressed as follows:
e i ∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
e i ∥p i −p′ i ∥=∥p i −Ch j y i∥ (7)
the reprojection error image is constructed based on the reprojection error ei and an inlier threshold τ of the matching pair is set, such that,
if the reprojection error ei is smaller than the inlier threshold τ, the theoretical pixel point of the two-dimensional image obtained through projection based on the hypothesized pose is consistent in position with the corresponding pixel point of the two-dimensional semantic image, which is called correct position;
for the reprojection error image corresponding to each hypothesized pose, a total number Ni of inliers is calculated and the total number of correct positions is calculated as follows:
N i =Σn i (9).
N i =Σn i (9).
6. The method according to claim 5 , wherein calculating the total number of correct semantics according to the semantic error image corresponding to each hypothesized pose comprises:
determining a semantic error mi present between the semantic information of the theoretical pixel point of the two-dimensional image and the semantic information of the three-dimensional point;
for the semantic error image corresponding to each hypothesized pose, calculating the total number of the correct semantics Mi:
M i =Σm i (11).
M i =Σm i (11).
7. A visual localization apparatus based on a semantic error image, comprising:
a matching module, configured to perform feature extraction for a target image, and obtain at least one matching pair by performing feature matching for each extracted feature point and each three-dimensional point of a constructed three-dimensional scene model, wherein each matching pair comprises a pixel point of the target image and the three-dimensional point of the three-dimensional scene model which are matched in feature;
a semantic segmenting module, configured to: obtain a two-dimensional semantic image of the target image by performing semantic segmentation for the target image, wherein each pixel point of the two-dimensional semantic image has corresponding semantic information; and determine semantic information of each matching pair according to the semantic information of each pixel of the two-dimensional semantic image;
a pose pool constructing module, configured to construct a hypothesized pose pool comprising at least one hypothesized pose according to at least one matching pair;
an image constructing module, configured to, for each hypothesized pose in the hypothesized pose pool, construct a reprojection error image and a semantic error image; wherein the semantic error image is obtained in the following manner: constructing a three-dimensional semantic image by using the three-dimensional points in all matching pairs, obtaining a two-dimensional image by performing reprojection for the three-dimensional semantic image according to a current hypothesized pose, assigning semantic information of each theoretical pixel point of the two-dimensional image to the semantic information of the corresponding pixel point of the two-dimensional semantic image, and then constructing the semantic error image based on a semantic error between the semantic information of each theoretical pixel point of the two-dimensional image and the semantic information of the correspondingly-matched three-dimensional point;
a pose estimating module, configured to determine a hypothesized pose with a minimum reprojection error and a minimum semantic error as a pose estimation according to the reprojection error image and the semantic error image of each hypothesized pose.
8. The apparatus according to claim 7 , wherein,
the pose pool constructing module is configured to: select four matching pairs randomly from all matching pairs, obtain one hypothesized pose through calculation according to a PNP (perspective-n-point) algorithm and the four selected matching pairs, and construct the hypothesized pose pool by using all hypothesized poses obtained based on the random combination of all matching pairs.
9. The apparatus according to claim 8 , wherein the hypothesized pose is calculated in the following formula:
h 1 =−R −1 *t (5)
h 1 =−R −1 *t (5)
wherein R is a rotation matrix and t is a translation matrix.
10. The apparatus according to claim 7 , wherein,
the pose estimating module is configured to: calculate a total number of correct positions according to the reprojection error image corresponding to each hypothesized pose; calculate a total number of correct semantics according to the semantic error image corresponding to each hypothesized pose; and select a hypothesized pose with a maximum total number of correct positions and a maximum total number of correct semantics as an optimal pose estimation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011199775.8A CN112102411B (en) | 2020-11-02 | 2020-11-02 | Visual positioning method and device based on semantic error image |
CN202011199775.8 | 2020-11-02 |
Publications (2)
Publication Number | Publication Date |
---|---|
US11321937B1 true US11321937B1 (en) | 2022-05-03 |
US20220138484A1 US20220138484A1 (en) | 2022-05-05 |
Family
ID=73784300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/473,190 Active US11321937B1 (en) | 2020-11-02 | 2021-09-13 | Visual localization method and apparatus based on semantic error image |
Country Status (2)
Country | Link |
---|---|
US (1) | US11321937B1 (en) |
CN (1) | CN112102411B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210312647A1 (en) * | 2018-12-21 | 2021-10-07 | Nikon Corporation | Detecting device, information processing device, detecting method, and information processing program |
US20220004808A1 (en) * | 2018-08-28 | 2022-01-06 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
CN114677567A (en) * | 2022-05-27 | 2022-06-28 | 成都数联云算科技有限公司 | Model training method and device, storage medium and electronic equipment |
US11494927B2 (en) | 2020-09-15 | 2022-11-08 | Toyota Research Institute, Inc. | Systems and methods for self-supervised depth estimation |
US11615544B2 (en) * | 2020-09-15 | 2023-03-28 | Toyota Research Institute, Inc. | Systems and methods for end-to-end map building from a video sequence using neural camera models |
CN116105603A (en) * | 2023-04-13 | 2023-05-12 | 安徽蔚来智驾科技有限公司 | Method and system for determining the position of a moving object in a venue |
US11682194B2 (en) * | 2021-09-23 | 2023-06-20 | National University Of Defense Technology | Training method for robust neural network based on feature matching |
CN116363218A (en) * | 2023-06-02 | 2023-06-30 | 浙江工业大学 | Lightweight visual SLAM method suitable for dynamic environment |
CN117115238A (en) * | 2023-04-12 | 2023-11-24 | 荣耀终端有限公司 | Pose determining method, electronic equipment and storage medium |
WO2024037562A1 (en) * | 2022-08-19 | 2024-02-22 | 深圳市其域创新科技有限公司 | Three-dimensional reconstruction method and apparatus, and computer-readable storage medium |
CN118089753A (en) * | 2024-04-26 | 2024-05-28 | 江苏集萃清联智控科技有限公司 | Monocular semantic SLAM positioning method and system based on three-dimensional target |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112837367B (en) * | 2021-01-27 | 2022-11-25 | 清华大学 | Semantic decomposition type object pose estimation method and system |
CN112907657A (en) * | 2021-03-05 | 2021-06-04 | 科益展智能装备有限公司 | Robot repositioning method, device, equipment and storage medium |
CN113129419B (en) * | 2021-04-27 | 2023-06-20 | 南昌虚拟现实研究院股份有限公司 | Intelligent visual interaction method and system based on semantics |
CN113362461B (en) * | 2021-06-18 | 2024-04-02 | 盎锐(杭州)信息科技有限公司 | Point cloud matching method and system based on semantic segmentation and scanning terminal |
CN114170366B (en) * | 2022-02-08 | 2022-07-12 | 荣耀终端有限公司 | Three-dimensional reconstruction method based on dotted line feature fusion and electronic equipment |
Citations (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120082371A1 (en) * | 2010-10-01 | 2012-04-05 | Google Inc. | Label embedding trees for multi-class tasks |
US20140198986A1 (en) * | 2013-01-14 | 2014-07-17 | Xerox Corporation | System and method for image selection using multivariate time series analysis |
US20160085310A1 (en) * | 2014-09-23 | 2016-03-24 | Microsoft Corporation | Tracking hand/body pose |
US20160203361A1 (en) * | 2008-08-15 | 2016-07-14 | Brown University | Method and apparatus for estimating body shape |
EP3114833A1 (en) | 2014-03-06 | 2017-01-11 | NEC Laboratories America, Inc. | High accuracy monocular moving object localization |
US20170124713A1 (en) * | 2015-10-30 | 2017-05-04 | Snapchat, Inc. | Image based tracking in augmented reality systems |
CN106803275A (en) | 2017-02-20 | 2017-06-06 | 苏州中科广视文化科技有限公司 | Estimated based on camera pose and the 2D panoramic videos of spatial sampling are generated |
CN107063258A (en) | 2017-03-07 | 2017-08-18 | 重庆邮电大学 | A kind of mobile robot indoor navigation method based on semantic information |
US20180189565A1 (en) * | 2015-08-28 | 2018-07-05 | Imperial College Of Science, Technology And Medicine | Mapping a space using a multi-directional camera |
CN108416840A (en) | 2018-03-14 | 2018-08-17 | 大连理工大学 | A kind of dense method for reconstructing of three-dimensional scenic based on monocular camera |
US20180307310A1 (en) * | 2015-03-21 | 2018-10-25 | Mine One Gmbh | Virtual 3d methods, systems and software |
US20190130214A1 (en) * | 2017-10-30 | 2019-05-02 | Sap Se | Computer vision architecture with machine learned image recognition models |
US20190147221A1 (en) * | 2017-11-15 | 2019-05-16 | Qualcomm Technologies Inc. | Pose estimation and model retrieval for objects in images |
US20190166359A1 (en) * | 2017-11-28 | 2019-05-30 | Paul Lapstun | Viewpoint-Optimized Light Field Display |
US20190172223A1 (en) * | 2017-12-03 | 2019-06-06 | Facebook, Inc. | Optimizations for Dynamic Object Instance Detection, Segmentation, and Structure Mapping |
US10366508B1 (en) | 2016-08-29 | 2019-07-30 | Perceptin Shenzhen Limited | Visual-inertial positional awareness for autonomous and non-autonomous device |
US20190295261A1 (en) * | 2018-03-26 | 2019-09-26 | Samsung Electronics Co., Ltd. | Method and apparatus with image segmentation |
US20190304134A1 (en) * | 2018-03-27 | 2019-10-03 | J. William Mauchly | Multiview Estimation of 6D Pose |
US20190304170A1 (en) * | 2018-03-28 | 2019-10-03 | Apple Inc. | Reconstructing views of real world 3d scenes |
CN110303000A (en) | 2019-07-16 | 2019-10-08 | 江苏维乐益生食品科技有限公司 | A kind of raw materials of food processing cleaning device |
US20190311478A1 (en) * | 2016-07-08 | 2019-10-10 | Avent, Inc. | System and Method for Automatic Detection, Localization, and Semantic Segmentation of Anatomical Objects |
US20190317850A1 (en) * | 2018-04-17 | 2019-10-17 | International Business Machines Corporation | Intelligent responding to error screen associated errors |
US10482618B2 (en) * | 2017-08-21 | 2019-11-19 | Fotonation Limited | Systems and methods for hybrid depth regularization |
US20200005521A1 (en) * | 2018-06-29 | 2020-01-02 | Eloupes, Inc. | Synthesizing an image from a virtual perspective using pixels from a physical imager array weighted based on depth error sensitivity |
US10540577B2 (en) * | 2013-08-02 | 2020-01-21 | Xactware Solutions, Inc. | System and method for detecting features in aerial images using disparity mapping and segmentation techniques |
US20200043190A1 (en) * | 2018-07-31 | 2020-02-06 | Intel Corporation | Removal of projection noise and point-based rendering |
US10600210B1 (en) | 2019-07-25 | 2020-03-24 | Second Spectrum, Inc. | Data processing systems for real-time camera parameter estimation |
US20200104969A1 (en) * | 2018-09-28 | 2020-04-02 | Canon Kabushiki Kaisha | Information processing apparatus and storage medium |
US20200111233A1 (en) * | 2019-12-06 | 2020-04-09 | Intel Corporation | Adaptive virtual camera for indirect-sparse simultaneous localization and mapping systems |
US20200126257A1 (en) * | 2019-12-18 | 2020-04-23 | Intel Corporation | Continuous local 3d reconstruction refinement in video |
US10657391B2 (en) * | 2018-01-05 | 2020-05-19 | Uatc, Llc | Systems and methods for image-based free space detection |
US10685446B2 (en) * | 2018-01-12 | 2020-06-16 | Intel Corporation | Method and system of recurrent semantic segmentation for image processing |
US20200302584A1 (en) * | 2019-03-21 | 2020-09-24 | Sri International | Integrated circuit image alignment and stitching |
US20200311871A1 (en) * | 2017-12-20 | 2020-10-01 | Huawei Technologies Co., Ltd. | Image reconstruction method and device |
US20210074019A1 (en) * | 2019-09-11 | 2021-03-11 | Beijing Xiaomi Intelligent Technology Co., Ltd. | Method, apparatus and medium for object tracking |
US20210089572A1 (en) * | 2019-09-19 | 2021-03-25 | Here Global B.V. | Method, apparatus, and system for predicting a pose error for a sensor system |
US20210166477A1 (en) * | 2019-12-03 | 2021-06-03 | Augustus Intelligence Inc. | Synthesizing images from 3d models |
US20210232851A1 (en) * | 2018-06-07 | 2021-07-29 | Five Al Limited | Image segmentation |
US11094082B2 (en) * | 2018-08-10 | 2021-08-17 | Canon Kabushiki Kaisha | Information processing apparatus, information processing method, robot system, and non-transitory computer-readable storage medium |
US11107244B1 (en) * | 2020-04-17 | 2021-08-31 | Applied Research Associates, Inc. | Location determination in a GPS-denied environment with user annotation |
US20210270722A1 (en) * | 2018-08-28 | 2021-09-02 | Essenlix Corporation | Assay accuracy improvement |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110503688B (en) * | 2019-08-20 | 2022-07-22 | 上海工程技术大学 | Pose estimation method for depth camera |
-
2020
- 2020-11-02 CN CN202011199775.8A patent/CN112102411B/en active Active
-
2021
- 2021-09-13 US US17/473,190 patent/US11321937B1/en active Active
Patent Citations (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160203361A1 (en) * | 2008-08-15 | 2016-07-14 | Brown University | Method and apparatus for estimating body shape |
US10002460B2 (en) * | 2008-08-15 | 2018-06-19 | Brown University | Method and apparatus for estimating body shape |
US20120082371A1 (en) * | 2010-10-01 | 2012-04-05 | Google Inc. | Label embedding trees for multi-class tasks |
US20140198986A1 (en) * | 2013-01-14 | 2014-07-17 | Xerox Corporation | System and method for image selection using multivariate time series analysis |
US10540577B2 (en) * | 2013-08-02 | 2020-01-21 | Xactware Solutions, Inc. | System and method for detecting features in aerial images using disparity mapping and segmentation techniques |
EP3114833A1 (en) | 2014-03-06 | 2017-01-11 | NEC Laboratories America, Inc. | High accuracy monocular moving object localization |
US20160085310A1 (en) * | 2014-09-23 | 2016-03-24 | Microsoft Corporation | Tracking hand/body pose |
US20180307310A1 (en) * | 2015-03-21 | 2018-10-25 | Mine One Gmbh | Virtual 3d methods, systems and software |
US20180189565A1 (en) * | 2015-08-28 | 2018-07-05 | Imperial College Of Science, Technology And Medicine | Mapping a space using a multi-directional camera |
US20170124713A1 (en) * | 2015-10-30 | 2017-05-04 | Snapchat, Inc. | Image based tracking in augmented reality systems |
US20190311478A1 (en) * | 2016-07-08 | 2019-10-10 | Avent, Inc. | System and Method for Automatic Detection, Localization, and Semantic Segmentation of Anatomical Objects |
US10366508B1 (en) | 2016-08-29 | 2019-07-30 | Perceptin Shenzhen Limited | Visual-inertial positional awareness for autonomous and non-autonomous device |
CN106803275A (en) | 2017-02-20 | 2017-06-06 | 苏州中科广视文化科技有限公司 | Estimated based on camera pose and the 2D panoramic videos of spatial sampling are generated |
CN107063258A (en) | 2017-03-07 | 2017-08-18 | 重庆邮电大学 | A kind of mobile robot indoor navigation method based on semantic information |
US10482618B2 (en) * | 2017-08-21 | 2019-11-19 | Fotonation Limited | Systems and methods for hybrid depth regularization |
US20190130214A1 (en) * | 2017-10-30 | 2019-05-02 | Sap Se | Computer vision architecture with machine learned image recognition models |
US20190147221A1 (en) * | 2017-11-15 | 2019-05-16 | Qualcomm Technologies Inc. | Pose estimation and model retrieval for objects in images |
US20190166359A1 (en) * | 2017-11-28 | 2019-05-30 | Paul Lapstun | Viewpoint-Optimized Light Field Display |
US20190172223A1 (en) * | 2017-12-03 | 2019-06-06 | Facebook, Inc. | Optimizations for Dynamic Object Instance Detection, Segmentation, and Structure Mapping |
US20200311871A1 (en) * | 2017-12-20 | 2020-10-01 | Huawei Technologies Co., Ltd. | Image reconstruction method and device |
US10657391B2 (en) * | 2018-01-05 | 2020-05-19 | Uatc, Llc | Systems and methods for image-based free space detection |
US10685446B2 (en) * | 2018-01-12 | 2020-06-16 | Intel Corporation | Method and system of recurrent semantic segmentation for image processing |
CN108416840A (en) | 2018-03-14 | 2018-08-17 | 大连理工大学 | A kind of dense method for reconstructing of three-dimensional scenic based on monocular camera |
US20190295261A1 (en) * | 2018-03-26 | 2019-09-26 | Samsung Electronics Co., Ltd. | Method and apparatus with image segmentation |
US20190304134A1 (en) * | 2018-03-27 | 2019-10-03 | J. William Mauchly | Multiview Estimation of 6D Pose |
US20190304170A1 (en) * | 2018-03-28 | 2019-10-03 | Apple Inc. | Reconstructing views of real world 3d scenes |
US20190317850A1 (en) * | 2018-04-17 | 2019-10-17 | International Business Machines Corporation | Intelligent responding to error screen associated errors |
US20210232851A1 (en) * | 2018-06-07 | 2021-07-29 | Five Al Limited | Image segmentation |
US20200005521A1 (en) * | 2018-06-29 | 2020-01-02 | Eloupes, Inc. | Synthesizing an image from a virtual perspective using pixels from a physical imager array weighted based on depth error sensitivity |
US20200043190A1 (en) * | 2018-07-31 | 2020-02-06 | Intel Corporation | Removal of projection noise and point-based rendering |
US11094082B2 (en) * | 2018-08-10 | 2021-08-17 | Canon Kabushiki Kaisha | Information processing apparatus, information processing method, robot system, and non-transitory computer-readable storage medium |
US20210270722A1 (en) * | 2018-08-28 | 2021-09-02 | Essenlix Corporation | Assay accuracy improvement |
US20200104969A1 (en) * | 2018-09-28 | 2020-04-02 | Canon Kabushiki Kaisha | Information processing apparatus and storage medium |
US20200302584A1 (en) * | 2019-03-21 | 2020-09-24 | Sri International | Integrated circuit image alignment and stitching |
CN110303000A (en) | 2019-07-16 | 2019-10-08 | 江苏维乐益生食品科技有限公司 | A kind of raw materials of food processing cleaning device |
US10600210B1 (en) | 2019-07-25 | 2020-03-24 | Second Spectrum, Inc. | Data processing systems for real-time camera parameter estimation |
US20210074019A1 (en) * | 2019-09-11 | 2021-03-11 | Beijing Xiaomi Intelligent Technology Co., Ltd. | Method, apparatus and medium for object tracking |
US20210089572A1 (en) * | 2019-09-19 | 2021-03-25 | Here Global B.V. | Method, apparatus, and system for predicting a pose error for a sensor system |
US20210166477A1 (en) * | 2019-12-03 | 2021-06-03 | Augustus Intelligence Inc. | Synthesizing images from 3d models |
US20200111233A1 (en) * | 2019-12-06 | 2020-04-09 | Intel Corporation | Adaptive virtual camera for indirect-sparse simultaneous localization and mapping systems |
US20210398320A1 (en) * | 2019-12-06 | 2021-12-23 | Intel Corporation | Adaptive virtual camera for indirect-sparse simultaneous localization and mapping systems |
US20200126257A1 (en) * | 2019-12-18 | 2020-04-23 | Intel Corporation | Continuous local 3d reconstruction refinement in video |
US11107244B1 (en) * | 2020-04-17 | 2021-08-31 | Applied Research Associates, Inc. | Location determination in a GPS-denied environment with user annotation |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220004808A1 (en) * | 2018-08-28 | 2022-01-06 | Samsung Electronics Co., Ltd. | Method and apparatus for image segmentation |
US11893780B2 (en) * | 2018-08-28 | 2024-02-06 | Samsung Electronics Co., Ltd | Method and apparatus for image segmentation |
US20210312647A1 (en) * | 2018-12-21 | 2021-10-07 | Nikon Corporation | Detecting device, information processing device, detecting method, and information processing program |
US11967094B2 (en) * | 2018-12-21 | 2024-04-23 | Nikon Corporation | Detecting device, information processing device, detecting method, and information processing program |
US11494927B2 (en) | 2020-09-15 | 2022-11-08 | Toyota Research Institute, Inc. | Systems and methods for self-supervised depth estimation |
US11615544B2 (en) * | 2020-09-15 | 2023-03-28 | Toyota Research Institute, Inc. | Systems and methods for end-to-end map building from a video sequence using neural camera models |
US11682194B2 (en) * | 2021-09-23 | 2023-06-20 | National University Of Defense Technology | Training method for robust neural network based on feature matching |
CN114677567B (en) * | 2022-05-27 | 2022-10-14 | 成都数联云算科技有限公司 | Model training method and device, storage medium and electronic equipment |
CN114677567A (en) * | 2022-05-27 | 2022-06-28 | 成都数联云算科技有限公司 | Model training method and device, storage medium and electronic equipment |
WO2024037562A1 (en) * | 2022-08-19 | 2024-02-22 | 深圳市其域创新科技有限公司 | Three-dimensional reconstruction method and apparatus, and computer-readable storage medium |
CN117115238A (en) * | 2023-04-12 | 2023-11-24 | 荣耀终端有限公司 | Pose determining method, electronic equipment and storage medium |
CN116105603A (en) * | 2023-04-13 | 2023-05-12 | 安徽蔚来智驾科技有限公司 | Method and system for determining the position of a moving object in a venue |
CN116105603B (en) * | 2023-04-13 | 2023-09-19 | 安徽蔚来智驾科技有限公司 | Method and system for determining the position of a moving object in a venue |
CN116363218A (en) * | 2023-06-02 | 2023-06-30 | 浙江工业大学 | Lightweight visual SLAM method suitable for dynamic environment |
CN116363218B (en) * | 2023-06-02 | 2023-09-01 | 浙江工业大学 | Lightweight visual SLAM method suitable for dynamic environment |
CN118089753A (en) * | 2024-04-26 | 2024-05-28 | 江苏集萃清联智控科技有限公司 | Monocular semantic SLAM positioning method and system based on three-dimensional target |
Also Published As
Publication number | Publication date |
---|---|
CN112102411A (en) | 2020-12-18 |
US20220138484A1 (en) | 2022-05-05 |
CN112102411B (en) | 2021-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11321937B1 (en) | Visual localization method and apparatus based on semantic error image | |
Zhang et al. | A UAV-based panoramic oblique photogrammetry (POP) approach using spherical projection | |
WO2020207512A1 (en) | Three-dimensional object modeling method, image processing method, and image processing device | |
CN112396640B (en) | Image registration method, device, electronic equipment and storage medium | |
da Silveira et al. | 3d scene geometry estimation from 360 imagery: A survey | |
US11461911B2 (en) | Depth information calculation method and device based on light-field-binocular system | |
CN107329962B (en) | Image retrieval database generation method, and method and device for enhancing reality | |
CN112435338B (en) | Method and device for acquiring position of interest point of electronic map and electronic equipment | |
CN113689578B (en) | Human body data set generation method and device | |
CN112489099B (en) | Point cloud registration method and device, storage medium and electronic equipment | |
CN113870379A (en) | Map generation method and device, electronic equipment and computer readable storage medium | |
CN112562001B (en) | Object 6D pose estimation method, device, equipment and medium | |
US20230401691A1 (en) | Image defect detection method, electronic device and readable storage medium | |
CN113592015B (en) | Method and device for positioning and training feature matching network | |
CN115035235A (en) | Three-dimensional reconstruction method and device | |
CN117132649A (en) | Ship video positioning method and device for artificial intelligent Beidou satellite navigation fusion | |
CN112085842B (en) | Depth value determining method and device, electronic equipment and storage medium | |
CN114998630B (en) | Ground-to-air image registration method from coarse to fine | |
CN114549927B (en) | Feature detection network training, enhanced actual virtual-actual registration tracking and shielding processing method | |
Bartczak et al. | Extraction of 3D freeform surfaces as visual landmarks for real-time tracking | |
CN113643328B (en) | Calibration object reconstruction method and device, electronic equipment and computer readable medium | |
CN115861922A (en) | Sparse smoke and fire detection method and device, computer equipment and storage medium | |
Guo et al. | Full-automatic high-precision scene 3D reconstruction method with water-area intelligent complementation and mesh optimization for UAV images | |
Rodriguez et al. | Pola4All: survey of polarimetric applications and an open-source toolkit to analyze polarization | |
CN115016647A (en) | Augmented reality three-dimensional registration method for substation fault simulation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NATIONAL UNIVERSITY OF DEFENSE TECHNOLOGY, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JIANG, JIE;XIN, XING;KANG, LAI;AND OTHERS;SIGNING DATES FROM 20210803 TO 20210901;REEL/FRAME:057461/0496 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |