CN118135058A

CN118135058A - Image generation method and device

Info

Publication number: CN118135058A
Application number: CN202410092090.5A
Authority: CN
Inventors: 王颢涵; 沈俊杰; 冯伟; 卢杨; 李耀宇; 张政; 吕晶晶; 朱鑫
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-06-04

Abstract

The embodiment of the disclosure provides an image generation method and device. The image generation method comprises the following steps: firstly, responding to the acquired initial article through-bottom image and initial article category, constructing text prompt words and background prompt words based on the initial article category, then acquiring a target noise image corresponding to the initial article through-bottom image based on preset noise setting, then carrying out feature extraction on the initial article through-bottom image, the text prompt words and the background prompt words to obtain article feature information, and finally carrying out cyclic image processing on the target noise image, the text prompt words and the article feature information to generate a target image corresponding to the initial article through-bottom image, thereby providing an end-to-end article advertisement background generation mode and improving the automation and the generation efficiency of image generation.

Description

Image generation method and device

Technical Field

Embodiments of the present disclosure relate to the field of computer technology and the field of information processing technology, and in particular, to an image generating method and apparatus.

Background

The object advertisement background generation aims at generating a natural and realistic background for the object transparent background image so as to construct high-quality advertisement pictures, thereby improving the clicking rate of the pictures. The existing background generation method mainly comprises two modes, namely a "text-generated picture" mode and a "picture-generated picture" mode, wherein the "text-generated picture" mode refers to that a section of prompting words for describing pictures and a commodity transparent base picture are input into a large Diffusion model (such as Stable Diffusion, control net), and the large model fills background areas around commodities according to the content of the prompting words; the pattern of "map generation" refers to that a reference image is additionally introduced based on the pattern of "map generation", and noise with a certain intensity is added to the reference image as initial noise of a large diffusion model, so that the generated background area has a certain similarity with the reference image.

However, the "text-to-picture" mode and the "picture-to-picture" mode take a lot of time to design and correct the hint words, and if the hint words are poor in describing the spatial position layout or abstract style of the picture, a great challenge is presented to the fine customization background, etc.

Disclosure of Invention

Embodiments of the present disclosure propose an image generation method, an image generation apparatus, an electronic device, and a computer readable medium.

In a first aspect, embodiments of the present disclosure provide an image generation method, the method including: in response to acquiring the initial article through-bottom image and the initial article category, constructing text prompt words and background prompt words based on the initial article category; acquiring a target noise image corresponding to the initial article transparent bottom image based on preset noise setting; extracting features of the initial article transparent base map, the text prompt words and the background prompt words to obtain article feature information; and performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective bottom image.

In some embodiments, performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective view, including: inputting the target noise image and the text prompt word into an encoder in a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information and the background characteristic information to a decoder in a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some embodiments, performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective view, and further including: in response to determining that the denoised image does not meet the image generation condition, the denoised image is taken as the target noise image.

In some embodiments, the method further comprises: acquiring a reference image and an article mask corresponding to the reference image; and acquiring a target noise image corresponding to the initial article through image based on the preset noise setting, including: acquiring a noise reference image corresponding to the initial article transparent bottom image based on preset noise setting and the reference image; and, the method further comprises: extracting features of the reference image and object masks corresponding to the reference image to obtain reference feature information; and performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective view, wherein the method comprises the following steps: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some embodiments, performing cyclic image processing on the noise reference image, the text prompt word, the item feature information and the reference feature information to generate a target image corresponding to the initial item perspective view, including: inputting the noise reference image and the text prompt word into an encoder of a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information, the reference characteristic information and the background characteristic information to a decoder of a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some embodiments, performing cyclic image processing on the noise reference image, the text prompt word, the item feature information and the reference feature information to generate a target image corresponding to the initial item perspective view, and further including: in response to determining that the de-noised image does not meet the image generation condition, the de-noised image is taken as a noise reference image.

In some embodiments, the method further comprises: filtering the information of the reference characteristic information based on an article mask corresponding to the initial article transparent bottom image to obtain filtered reference characteristic information; and-performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective view, wherein the method comprises the following steps: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some embodiments, feature extraction is performed on an initial item perspective view, a text prompt, and a background prompt, to obtain item feature information, including: and carrying out feature extraction on the initial object perspective view, the text prompt word and the background prompt word through an object feature extraction model to obtain object feature information, wherein the object feature extraction model comprises an attention module of an object mask.

In some embodiments, the item feature extraction model and the literature map diffusion model are trained based on the following steps: in response to acquiring the sample article image, the sample article mask and the sample article category, acquiring a sample article perspective image based on the sample article image and the sample article mask; based on the sample article category, constructing a sample text prompt word and a sample background prompt word; acquiring a sample noise image based on the sample article image; and taking the sample article background image, the sample text prompt word and the sample background prompt word as input of an initial article feature extraction model, taking the sample noise image and the sample text prompt word as input of an initial text-to-image diffusion model, and training the initial article feature extraction model and the initial text-to-image diffusion model to obtain an article feature extraction model and a text-to-image diffusion model.

In some embodiments, the method further comprises: acquiring a reference image pair; and acquiring a sample noise image based on the sample item image, comprising: acquiring a sample noise image based on a preset noise setting and a reference image pair; and taking the sample object background image, the sample text prompt word and the sample background prompt word as the input of an initial object feature extraction model, taking the sample noise image and the sample text prompt word as the input of an initial text-to-image diffusion model, training the initial object feature extraction model and the initial text-to-image diffusion model to obtain an object feature extraction model and a text-to-image diffusion model, comprising: the method comprises the steps of using a machine learning method to input a sample object background image, a sample text prompt word and a sample background prompt word as an initial object feature extraction model, using a reference image pair as an initial reference feature extraction model, using a sample noise image and a sample text prompt word as an initial text-to-image diffusion model, training the initial object feature extraction model, the initial reference feature extraction model and the initial text-to-image diffusion model to obtain an object feature extraction model, a reference feature extraction model and a text-to-image diffusion model, wherein the reference feature extraction model is used for extracting features of object masks corresponding to the reference image and the reference image, and obtaining reference feature information.

In some embodiments, acquiring the reference image pair includes: expanding the sample article mask to obtain an expanded article mask; carrying out data enhancement on the sample article image to obtain a new sample image; performing translation operation on the new sample image and the inflated article mask to obtain a translated sample image and a translated article mask; carrying out random mask processing on the translation object mask to generate a random object mask comprising a plurality of random rectangular masks; and rotating the translation sample image and the random article mask to obtain a new sample article image and a new sample article mask, and forming a reference image pair by the new sample article image and the new sample article mask.

In a second aspect, embodiments of the present disclosure provide an image generating apparatus, the apparatus including: a construction module configured to construct text and background cues based on the initial item category in response to acquiring the initial item through image and the initial item category; the acquisition module is configured to acquire a target noise image corresponding to the initial article transparent bottom image based on preset noise setting; the first extraction module is configured to perform feature extraction on the initial article transparent base drawing, the text prompt word and the background prompt word to obtain article feature information; the generating module is configured to perform cyclic image processing on the target noise image, the text prompt word and the article characteristic information, and generate a target image corresponding to the initial article perspective bottom image.

In some embodiments, the generation module is further configured to: inputting the target noise image and the text prompt word into an encoder in a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information and the background characteristic information to a decoder in a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some embodiments, the generation module is further configured to: in response to determining that the denoised image does not meet the image generation condition, the denoised image is taken as the target noise image.

In some embodiments, the apparatus further comprises a second extraction module; an acquisition module further configured to: acquiring a reference image and an article mask corresponding to the reference image; acquiring a noise reference image corresponding to the initial article transparent bottom image based on preset noise setting and the reference image; and a second extraction module configured to: extracting features of the reference image and object masks corresponding to the reference image to obtain reference feature information; and a generation module further configured to: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some embodiments, the generation module is further configured to: inputting the noise reference image and the text prompt word into an encoder of a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information, the reference characteristic information and the background characteristic information to a decoder of a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some embodiments, the generation module is further configured to: in response to determining that the de-noised image does not meet the image generation condition, the de-noised image is taken as a noise reference image.

In some embodiments, the apparatus further comprises a filtration module; a filtration module configured to: filtering the information of the reference characteristic information based on an article mask corresponding to the initial article transparent bottom image to obtain filtered reference characteristic information; and a generation module further configured to: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some embodiments, the first extraction module is further configured to: and carrying out feature extraction on the initial object perspective view, the text prompt word and the background prompt word through an object feature extraction model to obtain object feature information, wherein the object feature extraction model comprises an attention module of an object mask.

In some embodiments, the item feature extraction model and the literature graph diffusion model are trained based on a training unit; a training unit configured to: in response to acquiring the sample article image, the sample article mask and the sample article category, acquiring a sample article perspective image based on the sample article image and the sample article mask; based on the sample article category, constructing a sample text prompt word and a sample background prompt word; acquiring a sample noise image based on the sample article image; and taking the sample article background image, the sample text prompt word and the sample background prompt word as input of an initial article feature extraction model, taking the sample noise image and the sample text prompt word as input of an initial text-to-image diffusion model, and training the initial article feature extraction model and the initial text-to-image diffusion model to obtain an article feature extraction model and a text-to-image diffusion model.

In some embodiments, the training unit is further configured to: acquiring a reference image pair; acquiring a sample noise image based on a preset noise setting and a reference image pair; and taking the sample object background image, the sample text prompt word and the sample background prompt word as the input of an initial object feature extraction model, taking the reference image pair as the input of an initial reference feature extraction model, taking the sample noise image and the sample text prompt word as the input of an initial text-to-image diffusion model, and training the initial object feature extraction model, the initial reference feature extraction model and the initial text-to-image diffusion model to obtain an object feature extraction model, a reference feature extraction model and a text-to-image diffusion model, wherein the reference feature extraction model is used for feature extraction of object masks corresponding to the reference image and the reference image to obtain reference feature information.

In some embodiments, the training unit is further configured to: expanding the sample article mask to obtain an expanded article mask; carrying out data enhancement on the sample article image to obtain a new sample image; performing translation operation on the new sample image and the inflated article mask to obtain a translated sample image and a translated article mask; carrying out random mask processing on the translation object mask to generate a random object mask comprising a plurality of random rectangular masks; and rotating the translation sample image and the random article mask to obtain a new sample article image and a new sample article mask, and forming a reference image pair by the new sample article image and the new sample article mask.

In a third aspect, embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the image generation method as described in any of the embodiments of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements an image generation method as described in any of the embodiments of the first aspect.

According to the image generation method provided by the embodiment of the disclosure, the execution main body firstly responds to the acquired initial article through-bottom image and the initial article category, builds a text prompt word and a background prompt word based on the initial article category, then acquires a target noise image corresponding to the initial article through-bottom image based on preset noise setting, then performs feature extraction on the initial article through-bottom image, the text prompt word and the background prompt word to obtain article feature information, and finally performs cyclic image processing on the target noise image, the text prompt word and the article feature information to generate a target image corresponding to the initial article through-bottom image.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of an image generation method according to the present disclosure;

FIG. 3 is a flow chart of one embodiment of generating a target image corresponding to an initial item perspective view in accordance with the present disclosure;

FIG. 4 is a flow chart of another embodiment of an image generation method according to the present disclosure;

FIG. 5 is a flowchart of another embodiment of generating a target image corresponding to an initial item perspective view in accordance with the present disclosure;

FIG. 6 is a flow chart of one embodiment of a training object feature extraction model and a meristematic map diffusion model according to the present disclosure;

FIG. 7 is a flow chart of one embodiment of acquiring a reference image pair according to the present disclosure;

FIG. 8 is a schematic structural view of one embodiment of an image generation apparatus according to the present disclosure;

Fig. 9 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the related disclosure and not limiting thereof. It should be further noted that, for convenience of description, only the portions related to the disclosure are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary system architecture 100 to which an embodiment of an image generation method or image generation apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be user terminal devices on which various client applications, such as e-commerce platform applications, image class applications, video class applications, search class applications, financial class applications, etc., may be installed.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting receipt of server messages, including but not limited to smartphones, tablets, electronic book readers, electronic players, laptop and desktop computers, and the like.

The terminal devices 101, 102 and 103 can acquire an initial article through image and an initial article category, construct text prompt words and background prompt words based on the initial article category, acquire a target noise image corresponding to the initial article through image based on preset noise setting, perform feature extraction on the initial article through image, the text prompt words and the background prompt words to obtain article feature information, and perform cyclic image processing on the target noise image, the text prompt words and the article feature information to generate a target image corresponding to the initial article through image.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal apparatuses 101, 102, 103 are hardware, various electronic apparatuses are possible, and when the terminal apparatuses 101, 102, 103 are software, they are installed in the above-listed electronic apparatuses. Which may be implemented as multiple software or software modules (e.g., multiple software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server receiving a request transmitted from a terminal device with which a communication connection is established. The background server can receive and analyze the request sent by the terminal equipment and generate a processing result.

The server 105 may also obtain an initial article background image and an initial article category, construct a text prompt word and a background prompt word based on the initial article category, then obtain a target noise image corresponding to the initial article background image based on a preset noise setting, then perform feature extraction on the initial article background image, the text prompt word and the background prompt word to obtain article feature information, and finally perform cyclic image processing on the target noise image, the text prompt word and the article feature information to generate a target image corresponding to the initial article background image.

The server may be hardware or software. When the server is hardware, it may be various electronic devices that provide various services to the terminal device. When the server is software, a plurality of software or software modules providing various services to the terminal device may be realized, or a single software or software module providing various services to the terminal device may be realized. The present invention is not particularly limited herein.

It should be noted that the image generating method provided by the embodiment of the present disclosure may be performed by the terminal device 101, 102, 103 or the server 105, and accordingly, the image generating apparatus may be provided in the terminal device 101, 102, 103 or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, a flow chart 200 of one embodiment of an image generation method according to the present disclosure is shown. The image generation method comprises the following steps:

In response to acquiring the initial item through image and the initial item category, a text prompt and a background prompt are constructed based on the initial item category, step 210.

In this step, the execution subject (e.g., the terminal device 101, 102, 103 or the server 105 in fig. 1) on which the above-described image generation method operates may acquire an initial item image through network reading, perform item category analysis on the initial item image, acquire an initial item category, and the initial item category may characterize an item category included in the initial item image, such as milk, lipstick, or the like. The executing body may perform image processing on the initial article image by using an article mask corresponding to the initial article image, so as to obtain an initial article through bottom image corresponding to the initial article image, where the initial article through bottom image may represent an image including only the initial article and not including the background.

After the execution main body obtains the initial article through-bottom image and the initial article category, a text construction template of a text prompt word can be read, the text construction template is utilized to carry out text construction on the initial article category, the text prompt word corresponding to the initial article through-bottom image is generated, the text prompt word can represent the article category corresponding to the initial article, and the text construction template can be 'Aphoto of initial article category'. By way of example, the initial item category is C, then the text prompt may be "A photo of C".

The executing body may determine an initial article category code corresponding to the initial article category according to a correspondence between the initial article category and the article category code. The executing body may further splice the initial article category code with a preset character, to obtain background information corresponding to the initial article category, where the preset character may be a preset character, for example, "sks". The execution subject can read a background construction template of the background prompt word, construct background information by using the background construction template, generate the background prompt word corresponding to the initial article transparent bottom image, the background prompt word can represent the background information corresponding to the initial article, and the background construction template can be 'in the background of background information'. For example, the initial item category is C, the background information may be a string D formed by sks and C, and the background prompt term may be "in the background of D".

Step 220, based on the preset noise setting, acquiring a target noise image corresponding to the initial article through bottom image.

In this step, the executing body may perform noise superposition processing on the image according to a preset noise setting, and generate a target noise image corresponding to the initial article through-bottom image, where the preset noise setting may include performing preset sub-noise superposition processing on the image, where the image may be a reference image corresponding to the initial article through-bottom image, a random image corresponding to the initial article through-bottom image, and the like, and the target noise image may represent an image corresponding to the initial article through-bottom image, where the noise processing has been performed. As an example, the preset noise setting may include performing the noise superimposing process on the image 50 times, and the execution subject may perform the noise superimposing process on the image 50 times, to generate the target noise image corresponding to the initial article through image.

And 230, extracting features of the initial object perspective bottom map, the text prompt words and the background prompt words to obtain object feature information.

In this step, the executing body may perform feature extraction on the initial article perspective view, the text prompt word and the background prompt word by using the feature extraction network, extract initial article information in the initial article perspective view, and generate general background information adapted to the initial article category, so as to obtain article feature information. Or the execution main body can perform feature analysis on the initial article transparent base drawing, the text prompt word and the background prompt word, analyze the feature information related to the initial article, and extract the feature information to obtain the article feature information.

As an optional implementation manner, the step 230 performs feature extraction on the initial article perspective bottom map, the text prompt word and the background prompt word to obtain article feature information, and may include the following steps: and carrying out feature extraction on the initial object perspective view, the text prompt word and the background prompt word through an object feature extraction model to obtain object feature information, wherein the object feature extraction model comprises an attention module of an object mask.

Specifically, the execution subject may obtain an article feature extraction model, where the article feature extraction model may be used to characterize a correspondence between both an initial article perspective view and text prompt words and article feature information, and the specific structure of the article feature extraction model is substantially the same as a control net (control network). The execution main body can input the initial article transparent base chart, the text prompt word and the background prompt word into an article feature extraction model, the article feature extraction model is used for extracting features of the initial article transparent base chart, the text prompt word and the background prompt word, and article feature information is output, and the article feature information can be used for representing and extracting initial article information in the initial article transparent base chart and is used for generating general background information adapting to the initial article category.

The above-mentioned object feature extraction model may include an attention module of an object mask, that is, a specific structure of the object feature extraction model is substantially the same as a control net (control network), and in the present disclosure, the attention module in the object feature extraction model is replaced by an attention module based on the object mask, and if the above-mentioned initial object passes through the corresponding object mask M in the image, the attention module based on the object mask may be expressed as:

X_out＝M·CA(X_in,P_fg)+(1-M)·CA(X_in,P_bg)+X_in

Wherein, X _in and X _out respectively represent input and output of the attention module of the object mask, CA () represents a conventional attention module, P _fg represents a dot product symbol, and P _bg represents a background prompt word.

In the implementation mode, the article feature extraction model comprises an attention module of an article mask, when the attention is calculated, the article region is enabled to calculate the attention with text prompt words only, the background region is enabled to calculate the attention with background prompt words only, the background style related knowledge is ensured to be accurately encoded into the background prompt words, interference of foreground information is avoided, the article feature extraction model can analyze and extract article feature information more accurately and more specifically, and more accurate article feature information is obtained.

And 240, performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective view.

In this step, the execution subject may perform image background generation and denoising processing on the target noise image, the text prompt word and the article feature information, perform background information extraction on the target noise image and the text prompt word, obtain corresponding background information, then perform feature fusion and denoising processing on the background information and the article feature information to generate an image with an initial article and background information, where the image may be an image with noise, and then need to use the image as a new target noise image, perform image background generation and denoising processing again with the text prompt word and the article feature information until the obtained image is a clear image with the initial article and background information, perform cyclic image processing on the target noise image, the text prompt word and the article feature information, perform cyclic denoising processing on the image, and perform a preset number of denoising processes, where the preset number may correspond to a preset noise setting until a target image corresponding to the initial article perspective is generated, and the target image may be used to characterize the image with the initial article and background.

Referring to fig. 3, a flowchart 300 of one embodiment of generating a target image corresponding to an initial item perspective view is shown, that is, step 240 above, performing loop image processing on a target noise image, a text prompt, a background prompt, and item feature information, and generating a target image corresponding to an initial item perspective view may include the following steps:

step 310, inputting the target noise image and the text prompt word to an encoder in a text-generated graph diffusion model to obtain background characteristic information.

In this step, the execution subject acquires a text-generated graph Diffusion model, which may be used to characterize a correspondence between the target noise image, the text-hint word, and the item feature information, and the target image, and may be a pre-trained large Diffusion model Stable Diffusion (SD), which may include an encoder and a decoder. The execution body may input the target noise image, the text prompt word and the article characteristic information into the text-to-image diffusion model, perform cyclic image processing on the target noise image, the text prompt word and the article characteristic information through the text-to-image diffusion model, perform cyclic denoising processing on the image, and may execute a preset number of denoising processes, where the preset number may correspond to a preset noise setting until a target image corresponding to the initial article perspective view is generated.

The execution subject may input the target noise image and the text prompt word to an encoder in a text-to-image diffusion model, where the encoder encodes the target noise image and the text prompt word to extract background feature information, where the background feature information may be used to generate a background corresponding to the initial item.

And 320, inputting the object characteristic information and the background characteristic information to a decoder in the diffusion model of the text-generated image to obtain a denoising image.

In this step, the execution body may input the article feature information output by the article feature extraction model and the background feature information output by the encoder together to a decoder in the text-to-image diffusion model, where the decoder decodes the article feature information and the background feature information and outputs a corresponding denoising image, where the denoising image may be a noise image including an initial article and a background, or may be a target image including the initial article and the background.

Step 330, it is determined whether the denoising image satisfies the image generation condition.

In this step, the executing body may determine whether the output denoising image meets an image generation condition, where the image generation condition may include that the image is clear, or that the image is subjected to a preset number of denoising processes and the like.

In response to determining that the denoised image meets the image generation condition, step 340 is performed to determine the denoised image as a target image corresponding to the initial item perspective view.

In this step, the executing body determines that the denoising image satisfies the image generating condition through judgment, and then it may be determined that the text-to-image diffusion model has executed the denoising process for a preset number of times, a clear image including the initial object and the background is generated, and the denoising image may be determined as the target image corresponding to the original object perspective view.

In the implementation manner, the article characteristic information and the background characteristic information can be considered at the same time, so that the literature graph diffusion model automatically generates a target image comprising articles and corresponding backgrounds, and automatically generates the backgrounds suitable for the articles.

With continued reference to fig. 3, in response to determining that the de-noised image does not meet the image generation condition, step 350 is performed with the de-noised image as the target noise image.

In this step, the executing body determines that the denoising image does not meet the image generation condition through judgment, and may determine that the denoising image still has noise, and needs to use the denoising image as a target noise image, input the target noise image and the text prompt word into the encoder in the text-to-image diffusion model again for encoding, input the target noise image into the decoder for decoding, and perform cyclic denoising processing on the noise image until the target image meeting the image generation condition is generated.

In the implementation mode, the text-generated graph diffusion model can be utilized to carry out cyclic image processing until the target image corresponding to the original object perspective graph is generated, so that the generation efficiency and accuracy of the target image are improved.

Referring to fig. 4, fig. 4 shows a flowchart 400 of another embodiment of an image generation method, i.e. the image generation method described above may comprise the steps of:

in response to acquiring the initial item through image and the initial item category, a text prompt and a background prompt are constructed based on the initial item category, step 410.

In this step, step 410 is the same as step 210 in the embodiment shown in fig. 2, and is not described here.

Step 420, obtaining a reference image and an object mask corresponding to the reference image.

In this step, the executing body may acquire a reference image through network reading, and the reference image may be used as a background reference base of the initial article transparent image. The execution body can analyze the object of the reference image to determine the object mask corresponding to the reference image.

Step 430, acquiring a noise reference image corresponding to the initial article through bottom image based on the preset noise setting and the reference image.

In this step, the executing body may perform noise superposition processing on the reference image according to a preset noise setting, and generate a noise reference image corresponding to the initial article through-bottom image, where the preset noise setting may include performing preset noise superposition processing on the reference image, and the noise reference image may represent the reference image corresponding to the initial article through-bottom image and having undergone noise processing. As an example, the preset noise setting may include performing the noise superimposing process on the image 50 times, and the execution body may perform the noise superimposing process on the reference image 50 times, to generate the noise reference image corresponding to the initial article through-bottom image.

And step 440, extracting features of the initial article transparent base drawing, the text prompt word and the background prompt word to obtain article feature information.

In this step, step 440 is the same as step 230 in the embodiment shown in fig. 2, and is not described here.

And 450, extracting features of the reference image and the object mask corresponding to the reference image to obtain reference feature information.

In this step, the execution subject may acquire a pre-trained reference feature extraction model, where the reference feature extraction model may be used to characterize a correspondence between a reference image and an object mask corresponding to the reference image and reference feature information, and a specific structure of the reference feature extraction model is substantially the same as a control net (control network). The execution body may input the reference image and the object mask corresponding to the reference image to the reference feature extraction model, perform feature extraction on the reference image and the object mask corresponding to the reference image through the reference feature extraction model, and output reference feature information, where the reference feature information may be used to characterize background region information in the reference image.

Step 460, performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information, and generating a target image corresponding to the initial article perspective view.

In this step, the execution subject may perform image background generation and denoising processing on the noise reference image, the text prompt word, the object feature information and the reference feature information, perform background information extraction on the noise reference image and the text prompt word, obtain corresponding reference background information, perform feature fusion and denoising processing on the reference background information, the reference feature information and the object feature information, generate an image with an initial object and similar background information to the reference image, which may be an image with noise, then need to use the image as a new noise reference image, perform image background generation and denoising processing again with the text prompt word, the object feature information and the reference feature information until the obtained image is a clear image with an initial object and similar background information to the reference image, perform cyclic image processing on the noise reference image, the text prompt word, the object feature information and the reference feature information, perform cyclic denoising processing on the image, and perform a preset number of denoising processes, which may correspond to a preset noise setting until a target image with an initial object perspective image is generated, which may be used for the initial object and the object representation.

In this embodiment, by generating a background similar to the reference image layout, constituent elements, colors, styles, and the like for the initial article from the initial article through-image and the reference image, information such as layout, textures, and the like in the reference image can be extracted to generate a similar background for the initial article.

Referring to fig. 5, a flowchart 500 of another embodiment of generating a target image corresponding to an initial item perspective view is shown, where in step 460, the processing of the noise reference image, the text prompt, the item feature information, and the reference feature information in a loop image is performed to generate the target image corresponding to the initial item perspective view, and may include the following steps:

Step 510, inputting the noise reference image and the text prompt word into an encoder of the text-generated graph diffusion model to obtain the background characteristic information.

In this step, the execution subject acquires a text-generated graph Diffusion model, which may be used to characterize a correspondence between several of the noise reference image, the text prompt, the reference feature information, and the item feature information, and the target image, and may be a pre-trained large Diffusion model Stable Diffusion (SD), which may include an encoder and a decoder. The execution body may input the noise reference image, the text prompt word, the reference feature information and the article feature information into a text-generated image diffusion model, perform cyclic image processing on the noise reference image, the text prompt word, the reference feature information and the article feature information through the text-generated image diffusion model, perform cyclic denoising processing on the image, and may perform a preset number of denoising processes, where the preset number may correspond to a preset noise setting until a target image corresponding to the initial article perspective view is generated.

The execution body can input the noise reference image and the text prompt word into an encoder in the text-to-image diffusion model, the encoder encodes the noise reference image and the text prompt word, background characteristic information is extracted, and the background characteristic information can represent background area information of the reference image and is used for generating a background corresponding to the initial object.

And step 520, inputting the object characteristic information, the reference characteristic information and the background characteristic information into a decoder of the diffusion model of the text-generated image to obtain a denoising image.

In this step, the execution body may input the article feature information output by the article feature extraction model, the reference feature information output by the reference feature extraction model, and the background feature information output by the encoder together to a decoder in the context graph diffusion model, where the decoder decodes the article feature information, the reference feature information, and the background feature information, and outputs a corresponding denoising image, where the denoising image may be a noise image including an initial article and a background, or may be a target image including the initial article and the background.

Step 530, determining whether the denoising image satisfies the image generation condition.

In response to determining that the denoised image meets the image generation condition, step 540 is performed to determine the denoised image as the target image for the initial item perspective view.

In the implementation manner, the article characteristic information, the reference characteristic information and the background characteristic information can be considered at the same time, so that the literature graph diffusion model automatically generates a target image comprising the article and the reference background and automatically generates a background similar to the reference image.

With continued reference to fig. 5, in response to determining that the de-noised image does not meet the image generation condition, step 550 is performed with the de-noised image as a noise reference image.

In this step, the executing body determines that the denoising image does not meet the image generation condition through judgment, and can determine that the denoising image still has noise, the denoising image needs to be used as a noise reference image, the denoising image and the text prompt word are input into an encoder in a text-to-image diffusion model again to be encoded, and then are input into a decoder to be decoded, and the denoising process is performed on the denoising image circularly until a target image meeting the image generation condition is generated.

As an optional implementation manner, the image generating method further includes: filtering the information of the reference characteristic information based on an article mask corresponding to the initial article transparent bottom image to obtain filtered reference characteristic information; and, the step 460 of performing cyclic image processing on the noise reference image, the text prompt word, the item feature information and the reference feature information to generate a target image corresponding to the initial item perspective view, which may include the following steps: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information to generate a target image corresponding to the initial article perspective view.

Specifically, after the execution body acquires the reference feature information, the execution body may perform information filtering on the reference feature information by using an article mask corresponding to the original article transparent image, so as to obtain filtered reference feature information, that is, the reference feature extraction model may output a plurality of feature maps Y _i (i=1 to N) with different resolutions, where Y _i may be Y _i＝(1-M)·Y_i, where M represents the article mask corresponding to the original article transparent image, which is the same as control net.

The executing body may perform image background generation and denoising processing on the noise reference image, the text prompt word, the object feature information and the filtered reference feature information, perform background information extraction on the noise reference image and the text prompt word, obtain corresponding reference background information, perform feature fusion and denoising processing on the reference background information, the filtered reference feature information and the object feature information, generate an image with an initial object and similar background information to the reference image, which may be an image with noise, then need to use the image as a new noise reference image, perform image background generation and denoising processing again with the text prompt word, the object feature information and the filtered reference feature information until the obtained image is a clear image with the initial object and similar background information to the reference image, perform cyclic image processing on the noise reference image, the text prompt word, the object feature information and the filtered reference feature information, perform cyclic denoising processing on the image, and perform a preset number of times of denoising processing, which may correspond to a preset noise setting until a target image corresponding to the initial object bottom image is generated.

Or the execution main body can carry out cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information through the draft-of-text image diffusion model to generate a target image corresponding to the original article perspective bottom image.

In the implementation mode, the reference characteristic information is subjected to information filtering, so that the filtered reference characteristic information can be focused on the background information of the reference image, the accuracy and pertinence of the reference characteristic information are improved, and the target image is more accurate.

Referring to FIG. 6, a flowchart 600 of one embodiment of training an item feature extraction model and a meridional map diffusion model is shown, i.e., the item feature extraction model and the meridional map diffusion model described above may be trained based on the following steps:

In response to acquiring the sample item image, the sample item mask, and the sample item category, a sample item perspective image is acquired based on the sample item image and the sample item mask, step 610.

In this step, the execution body may read the sample article image through the network, perform article type analysis on the sample article image, and obtain a sample article type and a sample article mask. The execution body can utilize the sample article mask to perform image processing on the sample article image, and obtain a sample article transparent bottom image corresponding to the sample article image, wherein the sample article transparent bottom image can represent an image which only comprises a sample article and does not comprise a background.

Step 620, based on the sample item category, a sample text prompt and a sample background prompt are constructed.

In this step, the execution body may read a text construction template of the text prompt word, perform text construction on the sample article category by using the text construction template, generate the text prompt word corresponding to the transparent bottom image of the sample article, where the text prompt word may represent the article category corresponding to the initial article, and the text construction template may be "a photo of the initial article category".

The executing body may determine an initial article category code corresponding to the sample article category according to a correspondence between the sample article category and the article category code. The executing body may further splice the sample article category code with a preset character, so as to obtain background information corresponding to the sample article category, where the preset character may be a preset character, for example, "sks". The execution subject can read a background construction template of the background prompt word, construct background information by using the background construction template, generate the background prompt word corresponding to the transparent bottom image of the sample article, the background prompt word can represent the background information corresponding to the initial article, and the background construction template can be 'in the background of background information'.

Step 630, based on the sample item image, a sample noise image is acquired.

In this step, the executing body may perform noise superposition processing on the sample article image according to a preset noise setting, and generate a sample noise image corresponding to the sample article image, where the preset noise setting may include performing preset sub-noise superposition processing on the image. As an example, the preset noise setting may include performing the noise superimposing process on the image 50 times, and the execution body may perform the noise superimposing process on the sample article image 50 times, to generate a sample noise image corresponding to the sample article image.

Step 640, taking the sample article background image, the sample text prompt word and the sample background prompt word as input of an initial article feature extraction model, taking output results of the sample noise image, the sample text prompt word and the initial article feature extraction model as input of an initial text-to-image diffusion model, and training the initial article feature extraction model and the initial text-to-image diffusion model to obtain an article feature extraction model and a text-to-image diffusion model.

In this step, the executing body may acquire an initial article feature extraction model and an initial literature graph diffusion model, take a sample article background image, a sample text prompt word and a sample background prompt word as inputs of the initial article feature extraction model, take output results of the sample noise image, the sample text prompt word and the initial article feature extraction model as inputs of the initial literature graph diffusion model, and train the initial article feature extraction model and the initial literature graph diffusion model simultaneously by adopting a self-supervision mode to obtain an article feature extraction model and a literature graph diffusion model.

Specifically, the execution body may input the sample article background image, the sample text prompt word and the sample background prompt word into an initial article feature extraction model, where the initial article feature extraction model performs feature extraction on the sample article background image, the sample text prompt word and the sample background prompt word, and outputs sample article feature information.

The execution main body can input the sample noise image and the sample text prompt word into an initial aragonite graph diffusion model, the initial aragonite graph diffusion model performs feature extraction on the sample noise image and the sample text prompt word, decodes the extracted features and output results of the initial article feature extraction model, outputs a prediction image, continuously adjusts network parameters of the initial article feature extraction model and the initial aragonite graph diffusion model until the prediction image meets training conditions, and then the model training is completed to obtain an article feature extraction model and a aragonite graph diffusion model.

In the implementation mode, the model is trained in a self-supervision mode, so that the generated object feature extraction model and the generated literature graph diffusion model can simultaneously consider object feature information and background feature information, and a target image can be generated more accurately.

As an alternative implementation manner, the method may further include the following steps: acquiring a reference image pair; and, the step 630, based on the sample object image, of acquiring a sample noise image may include the steps of: acquiring a sample noise image based on a preset noise setting and a reference image pair; and, the step 640, which takes the sample object transparent image and the sample text prompt word as the input of the object feature extraction model, takes the sample noise image, the sample text prompt word and the sample background prompt word as the input of the text-to-image diffusion model, trains the object feature extraction model and the text-to-image diffusion model, and obtains the object feature extraction model and the text-to-image diffusion model, may include the following steps: the method comprises the steps of taking a sample article background image, a sample text prompt word and a sample background prompt word as input of an initial article feature extraction model, taking a reference image pair as input of an initial reference feature extraction model, taking a sample noise image and a sample text prompt word as input of an initial text-to-image diffusion model, training the initial article feature extraction model, the initial reference feature extraction model and the initial text-to-image diffusion model to obtain an article feature extraction model, a reference feature extraction model and a text-to-image diffusion model, wherein the reference feature extraction model is used for feature extraction of article masks corresponding to the reference image and the reference image, and obtaining reference feature information.

Specifically, the executing body may acquire a reference image pair in advance, where the reference image pair may include an object image and an object mask, and the executing body may perform noise superposition processing on the reference image pair according to a preset noise setting, and generate a sample noise image, where the preset noise setting may include performing preset noise superposition processing on the reference image pair.

The execution main body can acquire an initial article feature extraction model, an initial reference feature extraction model and an initial venturi figure diffusion model, takes a sample article background image, a sample text prompt word and a sample background prompt word as the input of the initial article feature extraction model, takes a reference image pair as the input of the initial reference feature extraction model, takes the output results of the sample noise image, the sample text prompt word and the initial article feature extraction model as the input of the initial venturi figure diffusion model, and trains the initial article feature extraction model, the initial reference feature extraction model and the initial venturi figure diffusion model simultaneously by adopting a self-supervision mode to obtain an article feature extraction model, a reference feature extraction model and the venturi figure diffusion model. The article feature extraction model is used for extracting features of an initial article transparent base chart, text prompt words and background prompt words to obtain article feature information; the reference feature extraction model is used for extracting features of a reference image and an object mask corresponding to the reference image to obtain reference feature information; the text-generated image diffusion model is used for performing cyclic image processing on the target noise image, the text prompt word, the reference characteristic information and the article characteristic information to generate a target image corresponding to the initial article perspective bottom image.

The execution body may input the reference image pair to an initial reference feature extraction model, and the initial reference feature extraction model performs feature extraction on the reference image pair and outputs sample reference feature information.

The execution main body can input the sample noise image and the sample text prompt word into an initial aragonite graph diffusion model, the initial aragonite graph diffusion model performs feature extraction on the sample noise image and the sample text prompt word, decodes the extracted features, sample object feature information and sample reference feature information, outputs a prediction image, continuously adjusts network parameters of the initial object feature extraction model, the initial reference feature extraction model and the initial aragonite graph diffusion model until the prediction image meets training conditions, and the model training is completed to obtain an object feature extraction model, a reference feature extraction model and a aragonite graph diffusion model.

In the implementation mode, the model is trained in a self-supervision mode, so that the generated object feature extraction model, the reference feature extraction model and the literature graph diffusion model can simultaneously consider object feature information, reference feature information and background feature information, and a target image can be generated more accurately.

Referring to fig. 7, a flow chart 700 of one embodiment of acquiring a reference image pair is shown, i.e., the aforementioned acquisition of a reference image pair may include the steps of:

step 710, expanding the sample article mask to obtain an expanded article mask.

In this step, the executing body may expand the sample object mask with a first preset probability to obtain an expanded object mask, where the expanding may expand boundary points of the binary object, and combine all background points in contact with the object into the object, so as to expand the boundary to the outside, and may include performing operations such as horizontal expansion, vertical expansion, and omnidirectional expansion on the sample object mask.

And step 720, carrying out data enhancement on the sample article image to obtain a new sample image.

In this step, the execution body may perform data enhancement on the sample article image, and perform mixup operations on the sample article image with a preset probability, that is, may perform weighted summation on pixel values of the sample article image and a random advertisement image at each position, so as to obtain a new sample image.

And 730, performing translation operation on the new sample image and the inflated article mask to obtain a translated sample image and a translated article mask.

In this step, the execution body may perform a translation operation on the new sample image and the inflated article mask with a second preset probability, and in the translation process, the new sample image and the inflated article mask remain synchronized, and the formed blank is filled with 0, so as to obtain a translated sample image and a translated article mask.

Step 740, performing a random masking process on the translated object mask to generate a random object mask comprising a plurality of random rectangular masks.

In this step, the executing body may perform a random mask processing on the translated object mask by using a third preset probability, that is, on the translated object mask, three rectangular masks in random positions are additionally generated with the third preset probability, so as to obtain a random object mask including a plurality of random rectangular masks, where the random object mask includes both a sample object mask and a plurality of random rectangular masks.

Step 750, rotating the translated sample image and the random article mask to obtain a new sample article image and a new sample article mask, and combining the new sample article image and the new sample article mask into a reference image pair.

In this step, the execution body may rotate the translation sample image and the random article mask with a fourth preset probability, and in the rotation process, the translation sample image and the random article mask keep synchronous, and the formed blank is filled with 0, so as to obtain a new sample article image and a new sample article mask. The execution body may combine the new sample object image and the new sample object mask into a reference image pair.

In the implementation mode, a series of data enhancement processing is carried out on the sample object image to construct a reference image pair of a new sample object image and a new sample object mask, so that the reference image pair is prevented from being manually collected, the training cost is reduced, and the model training efficiency is improved.

Referring to fig. 8, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an image generating apparatus. This embodiment of the device corresponds to the embodiment of the method shown in fig. 2.

As shown in fig. 8, the image generating apparatus 800 of the present embodiment may include: a construction module 810, an acquisition module 820, a first extraction module 830 and a generation module 840.

Wherein the construction module 810 is configured to construct text and background cues based on the initial item category in response to acquiring the initial item through image and the initial item category;

an acquisition module 820 configured to acquire a target noise image corresponding to the initial item through-bottom image based on a preset noise setting;

The first extraction module 830 is configured to perform feature extraction on the initial article perspective bottom map, the text prompt word and the background prompt word to obtain article feature information;

The generating module 840 is configured to perform cyclic image processing on the target noise image, the text prompt word and the article feature information, and generate a target image corresponding to the initial article perspective view.

In some optional implementations of the present implementation, the generation module 840 is further configured to: inputting the target noise image and the text prompt word into an encoder in a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information and the background characteristic information to a decoder in a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some optional implementations of the present implementation, the generation module 840 is further configured to: in response to determining that the denoised image does not meet the image generation condition, the denoised image is taken as the target noise image.

In some optional implementations of the present implementation, the apparatus further includes a second extraction module; the acquisition module 820 is further configured to: acquiring a reference image and an article mask corresponding to the reference image; acquiring a noise reference image corresponding to the initial article transparent bottom image based on preset noise setting and the reference image; and a second extraction module configured to: extracting features of the reference image and object masks corresponding to the reference image to obtain reference feature information; and, a generation module 840, further configured to: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some optional implementations of the present implementation, the generation module 840 is further configured to: inputting the noise reference image and the text prompt word into an encoder of a text-generated graph diffusion model to obtain background characteristic information; inputting the article characteristic information, the reference characteristic information and the background characteristic information to a decoder of a text-to-picture diffusion model to obtain a denoising image; judging whether the denoising image meets the image generation condition or not; in response to determining that the de-noised image meets the image generation condition, determining the de-noised image as a target image corresponding to the initial article perspective view.

In some optional implementations of the present implementation, the generation module 840 is further configured to: in response to determining that the de-noised image does not meet the image generation condition, the de-noised image is taken as a noise reference image.

In some optional implementations of the present implementation, the apparatus further includes a filtering module; a filtration module configured to: filtering the information of the reference characteristic information based on an article mask corresponding to the initial article transparent bottom image to obtain filtered reference characteristic information; and, a generation module 840, further configured to: and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information to generate a target image corresponding to the initial article perspective view.

In some optional implementations of the present implementation, the first extraction module is further configured to: and carrying out feature extraction on the initial object perspective view, the text prompt word and the background prompt word through an object feature extraction model to obtain object feature information, wherein the object feature extraction model comprises an attention module of an object mask.

In some optional implementations of the present implementation, the article feature extraction model and the literature graph diffusion model are obtained based on training by a training unit; a training unit configured to: in response to acquiring the sample article image, the sample article mask and the sample article category, acquiring a sample article perspective image based on the sample article image and the sample article mask; based on the sample article category, constructing a sample text prompt word and a sample background prompt word; acquiring a sample noise image based on the sample article image; and taking the sample article background image, the sample text prompt word and the sample background prompt word as input of an initial article feature extraction model, taking the sample noise image and the sample text prompt word as input of an initial text-to-image diffusion model, and training the initial article feature extraction model and the initial text-to-image diffusion model to obtain an article feature extraction model and a text-to-image diffusion model.

In some optional implementations of the present implementation, the training unit is further configured to: acquiring a reference image pair; and taking the sample object background image, the sample text prompt word and the sample background prompt word as the input of an initial object feature extraction model, taking the reference image pair as the input of an initial reference feature extraction model, taking the sample noise image and the sample text prompt word as the input of an initial text-to-image diffusion model, and training the initial object feature extraction model, the initial reference feature extraction model and the initial text-to-image diffusion model to obtain an object feature extraction model, a reference feature extraction model and a text-to-image diffusion model, wherein the reference feature extraction model is used for feature extraction of object masks corresponding to the reference image and the reference image to obtain reference feature information.

In some optional implementations of the present implementation, the training unit is further configured to: expanding the sample article mask to obtain an expanded article mask; carrying out data enhancement on the sample article image to obtain a new sample image; performing translation operation on the new sample image and the inflated article mask to obtain a translated sample image and a translated article mask; carrying out random mask processing on the translation object mask to generate a random object mask comprising a plurality of random rectangular masks; and rotating the translation sample image and the random article mask to obtain a new sample article image and a new sample article mask, and forming a reference image pair by the new sample article image and the new sample article mask.

According to the image generation device provided by the embodiment of the disclosure, the execution main body firstly responds to the acquired initial article through-bottom image and the initial article category, builds a text prompt word and a background prompt word based on the initial article category, then acquires a target noise image corresponding to the initial article through-bottom image based on preset noise setting, then performs feature extraction on the initial article through-bottom image, the text prompt word and the background prompt word to obtain article feature information, and finally performs cyclic image processing on the target noise image, the text prompt word and the article feature information to generate a target image corresponding to the initial article through-bottom image.

Those skilled in the art will appreciate that the above-described apparatus also includes some other well-known structures, such as a processor, memory, etc., which are not shown in fig. 8 in order to unnecessarily obscure embodiments of the present disclosure.

It should be noted that, in the technical solution of the present disclosure, the related aspects of collecting, updating, analyzing, processing, using, transmitting, storing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and are used for legal purposes without violating the public order colloquial. Necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information security, network security and national security of the user are maintained.

Referring now to fig. 9, a schematic diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a smart screen, a notebook computer, a PAD (tablet), a PMP (portable multimedia player), an in-vehicle terminal (e.g., in-vehicle navigation terminal), etc., a fixed terminal such as a digital TV, a desktop computer, etc. The terminal device shown in fig. 9 is only one example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processor, a graphics processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored. The processing device 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

In general, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the electronic device 900 to communicate wirelessly or by wire with other devices to exchange data. While fig. 9 shows an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 9 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 909, or installed from the storage device 908, or installed from the ROM 902. When executed by the processing device 901, performs the above-described functions defined in the methods of the embodiments of the present disclosure. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor comprises a building module, an acquisition module, a first extraction module and a generation module, wherein the names of these modules do not constitute a limitation of the module itself in some cases.

As another aspect, the present application also provides a computer readable medium, which may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to acquiring the initial article through-bottom image and the initial article category, constructing text prompt words and background prompt words based on the initial article category; acquiring a target noise image corresponding to the initial article transparent bottom image based on preset noise setting; extracting features of the initial article transparent base map, the text prompt words and the background prompt words to obtain article feature information; and performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective bottom image.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. An image generation method, the method comprising:

In response to acquiring the initial article through-bottom image and the initial article category, constructing text prompt words and background prompt words based on the initial article category;

Acquiring a target noise image corresponding to the initial article transparent image based on preset noise setting;

Extracting features of the initial article transparent base drawing, the text prompt words and the background prompt words to obtain article feature information;

and performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective bottom image.

2. The method of claim 1, wherein the performing cyclic image processing on the target noise image, the text prompt word, and the item feature information to generate a target image corresponding to the initial item perspective view comprises:

inputting the target noise image and the text prompt word to an encoder in a text-generated graph diffusion model to obtain background characteristic information;

inputting the article characteristic information and the background characteristic information to a decoder in the literature graph diffusion model to obtain a denoising image;

judging whether the denoising image meets an image generation condition or not;

and in response to determining that the denoising image meets the image generation condition, determining the denoising image as a target image corresponding to the initial article perspective view.

3. The method of claim 2, wherein the performing cyclic image processing on the target noise image, the text prompt word, and the item feature information to generate a target image corresponding to the initial item perspective view, further comprises:

In response to determining that the denoised image does not meet the image generation condition, the denoised image is taken as a target noise image.

4. The method of claim 1, wherein the method further comprises: acquiring a reference image and an article mask corresponding to the reference image; and

The obtaining the target noise image corresponding to the initial article transparent image based on the preset noise setting comprises the following steps:

Acquiring a noise reference image corresponding to the initial article transparent image based on preset noise setting and the reference image; and

The method further comprises the steps of: extracting features of the reference image and object masks corresponding to the reference image to obtain reference feature information; and

The step of performing cyclic image processing on the target noise image, the text prompt word and the article characteristic information to generate a target image corresponding to the initial article perspective view, including:

and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective bottom image.

5. The method of claim 4, wherein the performing cyclic image processing on the noise reference image, the text prompt, the item feature information, and the reference feature information to generate the target image corresponding to the initial item perspective view comprises:

Inputting the noise reference image and the text prompt word into an encoder of a text-to-graphic diffusion model to obtain background characteristic information;

inputting the article characteristic information, the reference characteristic information and the background characteristic information to a decoder of the draft map diffusion model to obtain a denoising image;

judging whether the denoising image meets an image generation condition or not;

6. The method of claim 5, wherein the performing cyclic image processing on the noise reference image, the text prompt, the item feature information, and the reference feature information to generate the target image corresponding to the initial item perspective view, further comprises:

In response to determining that the de-noised image does not meet the image generation condition, the de-noised image is taken as a noise reference image.

7. The method of claim 4, wherein the method further comprises: based on an article mask corresponding to the initial article transparent bottom image, carrying out information filtering on the reference characteristic information to obtain filtered reference characteristic information; and

The step of performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the reference characteristic information to generate a target image corresponding to the initial article perspective view, including:

and performing cyclic image processing on the noise reference image, the text prompt word, the article characteristic information and the filtered reference characteristic information to generate a target image corresponding to the initial article perspective view.

8. The method according to any one of claims 2-7, wherein the feature extracting the initial item perspective view, the text prompt word, and the background prompt word to obtain item feature information includes:

and carrying out feature extraction on the initial object perspective view, the text prompt words and the background prompt words through an object feature extraction model to obtain object feature information, wherein the object feature extraction model comprises an attention module of an object mask.

9. The method of claim 8, wherein the item feature extraction model and the meristematic map diffusion model are trained based on:

in response to acquiring a sample article image, a sample article mask, and a sample article category, acquiring a sample article perspective image based on the sample article image and the sample article mask;

Based on the sample article category, constructing a sample text prompt word and a sample background prompt word;

acquiring a sample noise image based on the sample article image;

And taking the sample article transparent bottom image, the sample text prompt word and the sample background prompt word as input of an initial article feature extraction model, taking the sample noise image and the sample text prompt word as input of an initial text-to-image diffusion model, and training the initial article feature extraction model and the initial text-to-image diffusion model to obtain the article feature extraction model and the text-to-image diffusion model.

10. The method of claim 9, wherein the method further comprises: acquiring a reference image pair; and

The acquiring a sample noise image based on the sample article image includes:

acquiring a sample noise image based on a preset noise setting and the reference image pair; and

The step of training the initial article feature extraction model and the initial meringual map diffusion model to obtain the article feature extraction model and the meringual map diffusion model by taking the sample article background prompt word, the sample article background prompt word and the sample article transparent image as the input of the initial article feature extraction model and the input of the initial meringual map diffusion model, and the step of training comprises the following steps:

And training the initial article feature extraction model, the initial reference feature extraction model and the initial venturi image diffusion model to obtain the article feature extraction model, the reference feature extraction model and the venturi image diffusion model, wherein the reference feature extraction model is used for carrying out feature extraction on article masks corresponding to the reference image and the reference image to obtain reference feature information.

11. The method of claim 10, wherein the acquiring a reference image pair comprises:

Expanding the sample article mask to obtain an expanded article mask;

carrying out data enhancement on the sample article image to obtain a new sample image;

performing translation operation on the new sample image and the inflated article mask to obtain a translated sample image and a translated article mask;

Carrying out random mask processing on the translational object mask to generate a random object mask comprising a plurality of random rectangular masks;

And rotating the translation sample image and the random article mask to obtain a new sample article image and a new sample article mask, and forming the new sample article image and the new sample article mask into the reference image pair.

12. An image generation apparatus, the apparatus comprising:

A construction module configured to construct text and background cues based on the initial item category in response to acquiring the initial item through image and the initial item category;

The acquisition module is configured to acquire a target noise image corresponding to the initial article transparent image based on preset noise setting;

the first extraction module is configured to perform feature extraction on the initial article perspective bottom graph, the text prompt words and the background prompt words to obtain article feature information;

the generating module is configured to perform cyclic image processing on the target noise image, the text prompt word and the article characteristic information, and generate a target image corresponding to the initial article perspective view.

13. An electronic device, comprising:

One or more processors;

Storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-11.

14. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method according to any one of claims 1-11.