CN113570615A

CN113570615A - Image processing method based on deep learning, electronic equipment and storage medium

Info

Publication number: CN113570615A
Application number: CN202110154168.8A
Authority: CN
Inventors: 章子誉; 罗国中
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-10-29

Abstract

The embodiment of the invention provides an image processing method based on deep learning, electronic equipment and a storage medium, wherein the method comprises the following steps: calling an object detection network model to determine a target area image in an image to be processed, wherein the target area image comprises an image of an area where a target detection object is located; calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, wherein the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of a target detection object; and determining a filling area from the target area image according to the semantic segmentation result, wherein the filling area refers to an area image except for a frame area in the image of the area where the target detection object is located, and filling material pictures in the filling area, so that the filling area in the image, such as the filling area of a window and a door, can be efficiently and accurately extracted, the image filling can be rapidly completed, and the image filling effect and accuracy are improved.

Description

Image processing method based on deep learning, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image processing method based on deep learning, an electronic device, and a storage medium.

Background

With the rapid progress of science and technology, the performance of a hardware computing unit is greatly improved, so that an artificial intelligence technology mainly based on deep learning is rapidly developed, and the understanding and the application of the three-dimensional world are increasingly diversified. The entertainment playing method using the camera of the electronic equipment such as the smart phone and the like as the input medium is the mainstream choice of the artificial intelligence technology for the majority of consumers at present.

In many playing methods, the virtual reality technology has a significantly increasing trend in recent years compared with the traditional technologies such as makeup and face changing, and the application of the virtual reality technology is becoming more extensive. In the virtual reality playing method, controllable materials (such as pictures, expressions and the like) are mostly filled in a partial area in an image, that is, the partial area in the image is covered by the controllable materials, and the partial area can be an area where an object such as a window or a door is located in the image. Currently, a method of combining a three-dimensional (3D) point cloud and depth information or a method of performing semantic segmentation on an image is generally used for determining a filling area in the image. Then, the mode of combining the 3D point cloud and the depth information can only extract the approximate position of the filling area, so that the accuracy is low and the effect is poor during replacement; the semantic segmentation method for the image generally needs a large amount of labeled data of image pixel levels, the calculated amount of data is large, the efficiency is low, and the segmentation effect is also prone to the situations of mottle, unclear edges and the like. It can be seen that how to efficiently and accurately extract a filling region in an image has become an urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention provides an image processing method based on deep learning, electronic equipment and a storage medium, which can efficiently and accurately extract a filling area in an image and improve the effect and accuracy of image filling.

In one aspect, an embodiment of the present invention provides an image processing method based on deep learning, where the method includes:

and acquiring an image to be processed.

And calling an object detection network model to determine a target area image in the image to be processed, wherein the target area image comprises an image of an area where a target detection object is located.

And calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, wherein the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object.

And determining a filling region from the target region image according to the semantic segmentation result, and filling a material picture in the filling region, wherein the filling region comprises a region image except the frame region in the image of the region where the target detection object is located.

In another aspect, an embodiment of the present invention provides an image processing apparatus, including:

and the acquisition module is used for acquiring the image to be processed.

And the determining module is used for calling an object detection network model to determine a target area image in the image to be processed, wherein the target area image comprises an image of an area where a target detection object is located.

And the processing module is used for calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, and the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object.

The determining module is further configured to determine a filling region from the target region image according to the semantic segmentation result, where the filling region includes a region image except the frame region in the image of the region where the target detection object is located.

The processing module is further configured to fill the material picture in the filling area.

In still another aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a storage device, where the processor and the storage device are connected to each other, where the storage device is configured to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the above deep learning based image processing method.

In still another aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, where the program instructions are executed by a processor to execute the above deep learning-based image processing method.

In yet another aspect, the invention implementation discloses a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the deep learning-based image processing method.

In the embodiment of the invention, a target area image can be determined in an image to be processed by calling an object detection network model, the target area image comprises an image of an area where a target detection object is located, a semantic segmentation result of the target area image can be obtained by calling the image segmentation network model to perform semantic segmentation processing on the target area image, the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object or not, only the target area image is subjected to semantic segmentation to avoid huge calculation amount caused by performing semantic segmentation on the whole image, a filling area is determined from the target area image according to the semantic segmentation result, the filling area refers to an area image except the frame area in the image of the area where the target detection object is located, namely, the filling area can be filled by using a material image, and the filling area in the image can be efficiently and accurately extracted, such as the filling area of windows and doors, and quickly completing the image filling, thereby improving the effect and the accuracy of the image filling.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an image processing framework according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of an image processing method based on deep learning according to an embodiment of the present invention;

FIG. 3a is a schematic diagram of an image to be processed according to an embodiment of the present invention;

FIG. 3b is a schematic diagram of a target area detection result according to an embodiment of the present invention;

FIG. 3c is a schematic diagram of an image processing effect provided by the embodiment of the invention;

FIG. 4 is a flowchart illustrating another method for processing an image based on deep learning according to an embodiment of the present invention;

FIG. 5a is a schematic diagram of a candidate frame of an image to be processed according to an embodiment of the present invention;

FIG. 5b is a diagram illustrating a target candidate box according to an embodiment of the present invention;

FIG. 5c is a diagram illustrating an expanded target candidate box according to an embodiment of the present invention;

FIG. 5d is a schematic diagram of another image to be processed according to an embodiment of the present invention;

FIG. 5e is a schematic diagram of another image processing effect provided by the embodiment of the invention;

FIG. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application mainly relates to the technologies of machine learning, computer vision and the like of artificial intelligence, and is specifically explained by the following embodiments:

referring to fig. 1, which is a schematic structural diagram of an image processing frame according to an embodiment of the present invention, the image processing frame according to an embodiment of the present invention may be formed as follows:

(1) an input image is acquired.

(2) And sending the input image into an object detection network model for processing so as to identify a detection object appointed in the input image, and extracting a corresponding region image, thereby obtaining a local image in the input image. The local image may be a region where a specified detection object in the input image is located, and the specified detection object may specifically be an object in an intermediate region surrounded by a peripheral frame and the peripheral frame, and may be, for example, a window, a door, and the like, where the window may be regarded as being composed of a peripheral frame (i.e., a window frame) and glass surrounded by the peripheral frame.

(3) And sending the local image into an image segmentation network model for semantic segmentation, wherein the semantic segmentation is to classify each pixel included in the local image, and find out a region surrounded by a peripheral frame of a specified detection object in the local image according to a classification result and use the region as a filling region.

(4) And filling the filling area by using the material picture which accords with the current festival atmosphere to obtain the image filled with the material picture.

Therefore, the image processing framework can firstly carry out target detection on the image, find out the region of the image, which comprises the detection object, and then carry out semantic segmentation processing on the region, so that huge calculation amount caused by semantic segmentation on the whole image is avoided, and after the filling region is determined from the region of the detection object according to the semantic segmentation result, the filling region in the image can be efficiently and accurately extracted, the image filling can be rapidly completed, and a better filling effect is realized.

Fig. 2 is a schematic flow chart of an image processing method based on deep learning according to an embodiment of the present invention, where the image processing method according to the embodiment of the present invention includes the following steps:

201. and acquiring an image to be processed.

Specifically, the image to be processed may be acquired by calling a shooting device in real time, or may also be an image stored in a storage space such as a local gallery or a cloud gallery, which is not limited in the embodiment of the present invention.

Taking calling the shooting device to acquire the image to be processed as an example, when the user wants to experience the virtual reality function provided by the target application, the shooting device of the electronic device may be aligned to a shooting object such as a window, a door, etc., after the virtual reality function of the target application is opened, the user may trigger a shooting instruction, the electronic device starts the shooting device in response to the shooting instruction, and may acquire the image captured by the shooting device, which may include an image shown in a preview window, and the image captured by the shooting device is taken as the image to be processed.

It should be noted that the electronic device may specifically include a smart phone, a tablet computer, a notebook computer, a vehicle-mounted terminal, an intelligent wearable device, and the like.

202. And calling an object detection network model to determine a target area image in the image to be processed, wherein the target area image comprises an image of an area where a target detection object is located.

The object detection network model can be used for identifying a specified target detection object and determining the position of the target detection object in the image.

Specifically, the object detection network model may be invoked to detect a target detection object in the image to be processed, so as to determine a position of the target detection object in the image to be processed, and determine a target area image including an area where the target detection object is located according to the position. Among them, the target detection object may be a set photographic object that needs to be filled, such as a window, a door, and the like.

Taking an example that the target detection object is a window, and the to-be-processed image is shown in fig. 3a, where the to-be-processed image 10 includes a region image of the window 20 and other background region images, and the region image of the window 20 specifically includes a frame region (i.e., a region image of the window frame 21) and a region image of the glass 22, the window in the to-be-processed image is detected by using the object detection network model, and the detection result is shown in fig. 3b, so that a target region image 30 including a region where the window is located can be obtained.

203. And calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, wherein the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object.

Wherein the image segmentation network model may be used to classify pixels in the image. The image segmentation network model in the invention can classify the pixels into two categories, namely, the pixels in the image are classified into one of two categories.

Specifically, the semantic segmentation processing may be performed on the target area image by calling the image segmentation network model, so as to obtain a semantic segmentation result of each pixel in the target area image, where the semantic segmentation result indicates whether each pixel belongs to a frame area of the target detection object.

204. And determining a filling region from the target region image according to the semantic segmentation result, and filling a material picture in the filling region, wherein the filling region comprises a region image except the frame region in the image of the region where the target detection object is located.

Specifically, the region to which each pixel in the target region image belongs may be determined according to the semantic segmentation result, and then the filling region that needs to be covered (or replaced) is determined according to the region to which each pixel belongs, where the filling region refers to a region image other than the frame region in the image of the region in which the target detection object is located, and then the corresponding material picture is filled in the filling region. Taking fig. 3b as an example, invoking the image segmentation network model may obtain a semantic segmentation result of each pixel in the target area image 30, where the semantic segmentation result indicates whether each pixel belongs to a frame region of the window 20 (i.e. the window frame 21), and according to the semantic segmentation result, a region surrounded by the window frame 21, i.e. a region where the glass 22 is located in the target area image 30 may be obtained, and of course, if the window 20 is opened, the region surrounded by the window frame 21 includes, in addition to the region where the glass 22 is located, a background object seen through the window 20, such as a distant building, a landscape, and the like, and the region surrounded by the window frame 21 may be used as a filling region. The filled area is then filled with the material picture, with the effect that the glass in the area enclosed by the window frame 21, the background object seen through the glass 22, is covered by the material picture, as shown in fig. 3 c. The material picture may be a picture randomly selected from a material library, or a picture selected by a user by showing the material library to the user.

In some possible embodiments, the current time information may be obtained, and then the material picture matching the current time information is determined from the material library, for example, the current time is 12 months and 25 days, the holiday atmosphere may be determined as christmas, the material picture related to christmas may be searched from the material library, and then the material picture related to christmas may be filled in the filling area, as shown in fig. 3c, a picture of an elk ridden by a santa claus is filled; if the material picture is Hallowmas, the material picture can be a monster lying on a window, and the like. Therefore, the virtual display effect which is consistent with the current holiday atmosphere can be achieved, and the interest and the playability of image processing are improved.

Referring to fig. 4, a schematic flow chart of another deep learning-based image processing method according to an embodiment of the present invention is shown, where the image processing method according to the embodiment of the present invention includes the following steps:

401. and acquiring an image to be processed.

The specific implementation manner of step 401 may refer to the related description of step 201 in the foregoing embodiment, and is not described herein again.

402. And inputting the image to be processed into an object detection network model to obtain a target candidate frame comprising the target detection object.

Specifically, by inputting the image to be processed into the object detection network model, the target detection object in the image to be processed can be detected, and the target candidate frame including the target detection object can be obtained. As shown in fig. 3b, the region 30 in the image to be processed can be used as the target candidate frame including the window 20.

In some possible embodiments, the specific implementation of invoking the object detection network model to obtain the target candidate box may include:

first, an image to be processed is input into an object detection network model, and a plurality of candidate frames of the image to be processed and probability distribution, position information and size information of each candidate frame can be output, wherein the probability distribution is used for indicating the probability that the candidate frame includes a target detection object, the position information can be offset of the center position of the candidate frame from the center position of the target detection object, and the size information can be width and height of the candidate frame. For example, for a certain frame candidate, the output of the object detection network model may include two parts, one part is the probability that the frame candidate includes the target detection object (p0, p1), p1 may represent the probability that the frame candidate includes the target detection object, then p0 may represent the probability that the frame candidate does not include the target detection object, the sum of p0 and p1 is 1, the other part is the position information and the size information (Δ cx, Δ cy, w, h), Δ cx, Δ cy represents the offset amount by which the center position (cx, cy) of the frame candidate deviates from the center position of the target detection object, and w, h represent the width and height of the frame candidate.

Then, at least one candidate box is determined from the plurality of candidate boxes according to the probability distribution, for example, the probability of each candidate box may be compared with a probability threshold of 0.6, and when the probability threshold is reached, at least one candidate box that is left may be determined. After determining at least one remaining candidate frame according to the probability, secondary screening may be performed according to the probability distribution, the position information, and the size information of the at least one candidate frame, and a target candidate frame including a target detection object may be determined from the at least one candidate frame.

It should be noted that the number of the target candidate frames is the same as the number of the target detection objects in the image to be processed, and if the number of the target detection objects in the image to be processed is one, the number of the target candidate frames is also one; if the number of the target detection objects in the image to be processed is k, the number of the target candidate frames is also k, and k is an integer greater than 1. Taking a window as an example, if the image to be processed includes a window, the target candidate frame corresponding to the window can be obtained; if three windows are included in the image to be processed, a target candidate frame corresponding to each of the three windows can be obtained.

In some possible embodiments, performing secondary screening according to the probability distribution, the position information, and the size information of the at least one candidate frame to obtain a specific implementation of the target candidate frame may include:

first, according to the position information and the size information of each candidate frame, a predicted position information of the target detection object can be obtained through calculation, and then, according to the position information and the size information of the at least one candidate frame, at least one predicted position information of the target detection object can be obtained through calculation. For example, if the position information and the size information of a certain frame candidate are (Δ cx, Δ cy, w, h) and the center position of the frame candidate is (cx, cy), the calculated predicted position information of the target detection object is (rx, ry, rw, rh), rx + Δ cx, ry + cy + Δ cy, rw + w, rh-h.

Then, the at least one piece of predicted position information is screened by using a Non-Maximum Suppression (NMS) screening strategy to obtain the most accurate predicted target position information, and then the candidate frame of the target predicted position information obtained by calculation is used as the target candidate frame.

For example, as shown in fig. 5a, 5 remaining candidate frames A, B, C, D, E are determined according to the probability distribution, the probability of the candidate frame E including the window is arranged from small to large as A, B, C, D, E, whether the overlapping degrees of a to D and E are greater than a certain set threshold is determined from the candidate frame E with the highest probability, the overlapping degree can be evaluated by an Intersection over unity (IoU), and if the overlapping degrees of a to D and E exceed the threshold, a to D are removed, and the candidate frame E is marked to be retained, so that the candidate frame with the most accurate prediction is found, and the candidate frame E is used as the target candidate frame including the window.

It will be appreciated that if there are multiple windows in the image to be processed, there will ultimately be a corresponding number of multiple target candidate frames remaining. For example, the image to be processed includes two windows, which are marked as window 1 and window 2, the number of candidate frames correspondingly reserved for the window 1 is 4 (A, B, C, D), the number of candidate frames correspondingly reserved for the window 2 is 4 (E, F, G), the probability is arranged from small to large as A, B, E, G, D, C, F, whether the overlapping degree of A, B, E, G, D, C and F is greater than a certain set threshold value is respectively judged from the candidate frame F with the largest probability, if the overlapping degree of E, G and F exceeds the threshold value, E, G is removed, and the candidate frame F is marked to be reserved; then, from the remaining candidate boxes A, B, C, D, the candidate box C with the highest probability is selected, then the degree of overlap between A, B, D and C is determined, if the degree of overlap exceeds the set threshold, then A, B, D is eliminated, and the candidate box C is marked to be kept, so that the target candidate box of the window 1 is C, and the target candidate box of the window 2 is F.

403. And determining a target area image from the image to be processed according to the position information and the size information of the target candidate frame.

Specifically, according to the position information and the size information of the target candidate frame, a region image corresponding to the position and the size may be cut out from the image to be processed, and taken as a target region image including the target detection object.

In some feasible embodiments, in consideration of the fact that the image to be processed may have irregularities such as rotation and stretching during shooting, the shooting device also has diversity, after the target candidate frame is determined, the width and the height of the target candidate frame may be expanded according to a preset proportion (for example, 10%) to obtain the expanded target candidate frame, then the target area image with corresponding position and size is intercepted from the image to be processed according to the position information and the size information of the expanded target candidate frame, and it is ensured that all target detection objects in the image are detected by expanding the candidate frame, so that missing detection of part of the target detection objects due to unstable models is avoided. As shown in fig. 5b, due to the fact that the image to be processed rotates during shooting, the obtained target candidate frame 30 cannot completely cover the target detection object 20 (i.e., a window), the target candidate frame 30 may be expanded according to a preset ratio, and as shown in fig. 5c, the expanded target candidate frame 30 may completely cover the target detection object 20, which is beneficial to accurately extracting a filling area in the image and improving an image filling effect.

In some possible embodiments, the specific implementation of expanding the target candidate frame to obtain the expanded target candidate frame may include:

and moving the boundary of the target candidate frame to the direction of expanding the target candidate frame according to a preset proportion to obtain an expanded boundary, if the expanded boundary exceeds the boundary of the image to be processed, taking the boundary of the image to be processed as the boundary of the expanded target candidate frame, and then determining the expanded target candidate frame according to the boundary of the expanded target candidate frame.

404. And inputting the target area image into an image segmentation network model for binarization segmentation processing to obtain a classification label of the pixel in the target area image.

Specifically, an image segmentation network model is called to perform binarization segmentation processing on pixels in a target area image, where the binarization segmentation processing is to perform binary classification on the pixels to obtain a classification label (also referred to as a segmentation mask) of the pixels in the target area image, where the classification label indicates whether the pixels belong to a frame area of a target detection object, for example, when the classification label is 1, the pixels are indicated to belong to the frame area of the target detection object, and when the classification label is 0, the pixels are indicated not to belong to the frame area of the target detection object.

405. And determining a semantic segmentation result of the target area image according to the classification label of the pixel in the target area image.

406. And obtaining pixels which do not belong to the frame region of the target detection object in the image of the region of the target detection object according to the semantic segmentation result.

407. And taking a second area image formed by the pixels which do not belong to the frame area of the target detection object as a filling area, and filling a material picture in the filling area.

Specifically, after the semantic segmentation result is obtained, the pixels of the frame region not belonging to the target detection object in the image of the region where the target detection object is located may be obtained according to the classification label of the pixels, the pixels of the frame region not belonging to the target detection object are used as the filling region, and then the material picture is filled in the filling region. Taking fig. 3b as an example, pixels not belonging to the frame area 21 of the window 20 in the image 20 of the area where the window is located may be obtained according to the classification label of the pixels, and the pixels not belonging to the frame area 21 of the window 20 may be used as the filling area.

In some possible embodiments, when determining the filling area, the method may also obtain pixels of a frame area belonging to the target detection object in the target area image according to the classification label of the pixels, determine a first area image composed of the pixels of the frame area belonging to the target detection object, and then determine an area image other than the first area image in the image of the area where the target detection object is located as the filling area. Taking fig. 3b as an example, the pixels belonging to the frame region 21 of the window 20 in the target region image 30 may also be obtained according to the classification label of the pixels, the first region image 21 (i.e., the region image corresponding to the frame region 21) composed of the pixels belonging to the frame region 21 of the window 20 is determined, and then the region image 22 except the first region image 21 in the image 20 of the region where the window is located is determined as the filling region.

In some possible embodiments, the object detection network model and the image segmentation network model may be trained as follows. Specifically, a training sample set is obtained, where the training sample set includes a plurality of images and annotation information, the annotation information includes an annotation frame and a semantic segmentation result for a target detection object, and the annotation frame is a rectangular frame in the image that includes the target detection object. Taking the window as an example of the target detection object, the multiple images may include an image labeled as a window on a public data set such as sungbd, COCO and the like, an image obtained by shooting an indoor window with a shooting device such as a camera and the like, and an image obtained by shooting an indoor office desk, chair, elevator, floor and stair lamp with a shooting device such as a camera and the like as a negative example, and the negative example image may avoid false alarm during window detection and window semantic segmentation. Then, training the neural network by using the plurality of images and the labeling frames in the labeling information to obtain an object detection network model; and training the neural network by using the semantic segmentation results in the plurality of images and the annotation information to obtain an image segmentation network model.

The object detection Network model and the image segmentation Network model may adopt a structure of a Neural Network such as a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), and may specifically include a Convolutional layer, a pooling layer, a nonlinear activation function, an upsampling layer, and the like.

Wherein, when training the image segmentation network model, a cross entropy loss function can be adopted to supervise the image segmentation network model, and the specific loss function is as follows: and L is-logPi, wherein Pi represents the probability that the prediction of a certain pixel in the prediction result is correct compared with the labeled real classification label, and is between 0 and 1.

In some possible embodiments, if there are a plurality of target detection objects in the image to be processed, as shown in fig. 5d, and three windows are included in the image to be processed, the determined filling areas include

areas

41, 42, and 43, and then the three areas are filled with the material picture, as shown in fig. 5e, and the filling effect is that filling the three areas with one material picture can make the overall feeling of the image stronger. Of course, three material pictures can be found, and each material picture is correspondingly filled with one region, which is not limited in the embodiment of the present invention.

In some possible embodiments, for the overlapping portions of multiple target candidate boxes, an or logic may be adopted, that is, as long as the semantic segmentation result of one target candidate box indicates that a certain pixel in the overlapping portion belongs to the target detection object, the pixel in the final detection result belongs to the target detection object.

In the embodiment of the invention, a target candidate frame comprising a target detection object can be obtained by calling an object detection network model, a target area image with corresponding position and size is intercepted from an image to be processed according to the position information and the size information of the target candidate frame, an image segmentation network model is called to carry out binary segmentation on the target area image to obtain a classification label of pixels in the target area image, an area formed by the pixels of a frame area which does not belong to the target detection object in the target area image is taken as a filling area, such as the filling area of a window and a door, according to the classification label, a material picture is filled in the filling area, the semantic segmentation is carried out only on the target area image to avoid huge calculation amount caused by the semantic segmentation of the whole image, the filling area in the image can be efficiently and accurately extracted, and the image filling can be rapidly completed, the effect and the accuracy of image filling are improved.

Referring to fig. 6, a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention is shown, where the apparatus includes:

the acquiring module 601 is configured to acquire an image to be processed.

A determining module 602, configured to invoke an object detection network model to determine a target area image in the image to be processed, where the target area image includes an image of an area where a target detection object is located.

The processing module 603 is configured to invoke an image segmentation network model to perform semantic segmentation processing on the target area image, so as to obtain a semantic segmentation result of the target area image, where the semantic segmentation result is used to indicate whether a pixel in the target area image belongs to a frame region of the target detection object.

The determining module 602 is further configured to determine a filling region from the target region image according to the semantic segmentation result, where the filling region includes a region image except the frame region in the image of the region where the target detection object is located.

The processing module 603 is further configured to fill the material picture in the filling area.

Optionally, the semantic segmentation result includes a classification label of the pixel, where the classification label is used to indicate whether the pixel belongs to a bounding box region of the target detection object.

Optionally, the determining module 602 is specifically configured to:

and acquiring pixels belonging to the frame region of the target detection object in the target region image according to the semantic segmentation result.

And determining a first area image formed by pixels of the frame area belonging to the target detection object.

And determining the area image except the first area image in the image of the area where the target detection object is located as a filling area.

Optionally, the determining module 602 is specifically configured to:

and obtaining pixels which do not belong to the frame region of the target detection object in the image of the region of the target detection object according to the semantic segmentation result.

And taking a second area image formed by the pixels which do not belong to the frame area of the target detection object as a filling area.

Optionally, the processing module 603 is specifically configured to:

and inputting the target area image into an image segmentation network model for binarization segmentation processing to obtain a classification label of the pixel in the target area image.

And determining a semantic segmentation result of the target area image according to the classification label of the pixel in the target area image.

Optionally, the determining module 602 is specifically configured to:

and inputting the image to be processed into an object detection network model to obtain a target candidate frame comprising the target detection object.

And determining a target area image from the image to be processed according to the position information and the size information of the target candidate frame.

Optionally, the determining module 602 is specifically configured to:

and expanding the target candidate frame according to a preset proportion to obtain the expanded target candidate frame.

And according to the position information and the size information of the expanded target candidate frame, taking an image area corresponding to the position and the size of the expanded target candidate frame in the image to be processed as a target area image.

Optionally, the determining module 602 is specifically configured to:

and moving the boundary of the target candidate frame to the direction of expanding the target candidate frame according to a preset proportion to obtain the expanded boundary.

And if the expanded boundary exceeds the boundary of the image to be processed, taking the boundary of the image to be processed as the boundary of the expanded target candidate frame.

And determining the expanded target candidate frame according to the boundary of the expanded target candidate frame.

Optionally, the determining module 602 is specifically configured to:

inputting the image to be processed into an object detection network model to obtain probability distribution, position information and size information of a plurality of candidate frames of the image to be processed, wherein the probability distribution is used for indicating the probability that the candidate frames comprise the target detection object.

Determining at least one candidate box from the plurality of candidate boxes according to the probability distribution of the plurality of candidate boxes.

And determining a target candidate frame comprising the target detection object from the at least one candidate frame according to the probability distribution, the position information and the size information of the at least one candidate frame.

Optionally, the determining module 602 is specifically configured to:

and calculating to obtain at least one piece of predicted position information of the target detection object according to the position information and the size information of the at least one candidate frame.

And screening the at least one piece of predicted position information by using a non-maximum suppression screening strategy to obtain target predicted position information.

And taking the candidate frame of which the target prediction position information is calculated from the at least one candidate frame as a target candidate frame comprising the target detection object.

Optionally, the obtaining module 601 is further configured to obtain a training sample set, where the training sample set includes a plurality of images and labeling information, and the labeling information includes a labeling box and a semantic segmentation result for a target detection object.

The processing module 603 is further configured to train a neural network by using the plurality of images and the label boxes in the label information to obtain an object detection network model.

The processing module 603 is further configured to train a neural network by using the semantic segmentation results in the plurality of images and the annotation information, so as to obtain an image segmentation network model.

Optionally, the processing module 603 is specifically configured to:

the current time information is acquired.

And determining a material picture matched with the current time information from a material library.

And filling the material picture in the filling area.

Optionally, the obtaining module 601 is specifically configured to:

and starting the shooting device in response to a shooting instruction triggered by the target application.

Acquiring an image captured by the shooting device, wherein the image captured by the shooting device comprises an image shown in a preview window.

And taking the image captured by the shooting device as an image to be processed.

It should be noted that the functions of each functional module of the image processing apparatus according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes a power supply module and the like, and includes a processor 701, a storage 702, a user interface 703, and a shooting device 704. Data can be exchanged between the processor 701, the storage 702, the user interface 703 and the photographing apparatus 704.

The storage 702 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device 702 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), or the like; the storage means 702 may also comprise a combination of memories of the kind described above.

The user interface 703 may include a display, a touch panel, and the like, and is used for outputting data such as an image and detecting a touch operation of a user.

The capture device 704 may include a camera, such as a front-facing camera, a rear-facing camera, a single-lens camera, or a multi-lens camera, among others.

The processor 701 may be a Central Processing Unit (CPU) 701. In one embodiment, the processor 701 may also be a Graphics Processing Unit (GPU). The processor 701 may also be a combination of a CPU and a GPU. In one embodiment, the storage 702 is used to store program instructions. The processor 701 may call the program instructions to perform the following operations:

and acquiring an image to be processed.

Optionally, the processor 701 is specifically configured to:

Optionally, the processor 701 is further configured to:

the method comprises the steps of obtaining a training sample set, wherein the training sample set comprises a plurality of images and marking information, and the marking information comprises a marking frame aiming at a target detection object and a semantic segmentation result.

And training the neural network by using the plurality of images and the labeling frames in the labeling information to obtain an object detection network model.

And training a neural network by using the semantic segmentation results in the plurality of images and the annotation information to obtain an image segmentation network model.

Optionally, the processor 701 is specifically configured to:

the current time information is acquired.

And filling the material picture in the filling area.

Optionally, the processor 701 is specifically configured to:

the photographing apparatus 704 is started in response to a photographing instruction triggered by the target application.

Acquiring an image captured by the photographing device 704, wherein the image captured by the photographing device 704 comprises an image shown in a preview window.

The image captured by the photographing apparatus 704 is taken as an image to be processed.

In a specific implementation, the processor 701, the storage device 702, the user interface 703 and the shooting device 704 described in this embodiment of the present invention may execute the implementation described in the related embodiment of the image processing method based on deep learning provided in fig. 2 and fig. 4 in this embodiment of the present invention, or may execute the implementation described in the related embodiment of the image processing device provided in fig. 6 in this embodiment of the present invention, which is not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, where the program includes one or more instructions that can be stored in a computer storage medium, and when executed, the program may include processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above.

The above disclosure is only a few examples of the present application, and certainly should not be taken as limiting the scope of the present application, which is therefore intended to cover all modifications that are within the scope of the present application and which are equivalent to the claims.

Claims

1. An image processing method based on deep learning, characterized in that the method comprises:

acquiring an image to be processed;

calling an object detection network model to determine a target area image in the image to be processed, wherein the target area image comprises an image of an area where a target detection object is located;

calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, wherein the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object;

2. The method of claim 1, wherein the semantic segmentation result comprises a classification label of the pixel, and wherein the classification label is used to indicate whether the pixel belongs to a bounding box region of the target detection object.

3. The method according to claim 1 or 2, wherein the determining a filling region from the target region image according to the semantic segmentation result comprises:

obtaining pixels belonging to a frame region of the target detection object in the target region image according to the semantic segmentation result;

determining a first area image composed of pixels belonging to the frame area of the target detection object;

4. The method according to claim 1 or 2, wherein the determining a filling region from the target region image according to the semantic segmentation result comprises:

obtaining pixels which do not belong to the frame region of the target detection object in the image of the region of the target detection object according to the semantic segmentation result;

5. The method according to claim 1, wherein the invoking an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image comprises:

inputting the target area image into an image segmentation network model for binarization segmentation processing to obtain a classification label of a pixel in the target area image;

6. The method of claim 1, wherein invoking the object detection network model to determine a target area image in the image to be processed comprises:

inputting the image to be processed into an object detection network model to obtain a target candidate frame comprising the target detection object;

7. The method according to claim 6, wherein the determining a target area image from the image to be processed according to the position information and the size information of the target candidate frame comprises:

expanding the target candidate frame according to a preset proportion to obtain an expanded target candidate frame;

8. The method according to claim 7, wherein the expanding the target candidate frame according to the preset ratio to obtain the expanded target candidate frame comprises:

moving the boundary of the target candidate frame to the direction of expanding the target candidate frame according to a preset proportion to obtain an expanded boundary;

if the expanded boundary exceeds the boundary of the image to be processed, taking the boundary of the image to be processed as the boundary of the expanded target candidate frame;

9. The method according to any one of claims 6 to 8, wherein the inputting the image to be processed into an object detection network model to obtain a target candidate frame including the target detection object comprises:

inputting the image to be processed into an object detection network model to obtain probability distribution, position information and size information of a plurality of candidate frames of the image to be processed, wherein the probability distribution is used for indicating the probability that the candidate frames comprise the target detection object;

determining at least one candidate box from the plurality of candidate boxes according to the probability distribution of the plurality of candidate boxes;

10. The method of claim 9, wherein the determining a target candidate box including the target detection object from the at least one candidate box according to the probability distribution, the position information, and the size information of the at least one candidate box comprises:

calculating to obtain at least one predicted position information of the target detection object according to the position information and the size information of the at least one candidate frame;

screening the at least one piece of predicted position information by using a non-maximum suppression screening strategy to obtain target predicted position information;

11. The method of claim 1, wherein the invoking the object detection network model prior to determining the target area image in the image to be processed, the method further comprises:

acquiring a training sample set, wherein the training sample set comprises a plurality of images and marking information, and the marking information comprises a marking frame aiming at a target detection object and a semantic segmentation result;

training a neural network by using the plurality of images and the labeling frames in the labeling information to obtain an object detection network model;

12. The method of claim 1, wherein the filling of the material picture in the padding area comprises:

acquiring current time information;

determining a material picture matched with the current time information from a material library;

and filling the material picture in the filling area.

13. An image processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring an image to be processed;

the determining module is used for calling an object detection network model to determine a target area image in the image to be processed, wherein the target area image comprises an image of an area where a target detection object is located;

the processing module is used for calling an image segmentation network model to perform semantic segmentation processing on the target area image to obtain a semantic segmentation result of the target area image, and the semantic segmentation result is used for indicating whether pixels in the target area image belong to a frame area of the target detection object;

the determining module is further configured to determine a filling region from the target region image according to the semantic segmentation result, where the filling region includes a region image except the frame region in an image of a region where the target detection object is located;

14. An electronic device, characterized in that the electronic device comprises a processor and a storage device, the processor and the storage device are connected with each other, wherein the storage device is used for storing a computer program, the computer program comprises program instructions, and the processor is configured to call the program instructions to execute the deep learning based image processing method according to any one of claims 1 to 12.

15. A computer-readable storage medium, wherein the computer storage medium stores a computer program comprising program instructions, which are executed by a processor, to perform the deep learning based image processing method according to any one of claims 1 to 12.