Background
Currently, some software can generate other face images from the face images of the user, and the software generally comprises a model for converting the face images of the user into face images of other styles. The training process of the model is generally unidirectional, namely a face image is input, the face image is compared with the face images of other corresponding styles, and the parameters of the model are optimized according to the difference of the two face images.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for generating a cartoon head portrait generation model, and a method and a device for generating a cartoon head portrait.
In a first aspect, an embodiment of the present disclosure provides a method for generating a cartoon avatar generation model, the method including: acquiring a preset training sample set, wherein the training samples comprise sample face images and sample cartoon head portraits corresponding to the sample face images; acquiring a pre-established initial generation confrontation network, wherein the initial generation confrontation network comprises a cartoon head portrait generation network, a human face image generation network, a cartoon head portrait judgment network and a human face image judgment network; the following training steps are performed: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
In some embodiments, the training step comprises: by utilizing a machine learning method, taking a sample face image included in a training sample set as the input of a cartoon head portrait generation network, taking a cartoon head portrait output by the cartoon head portrait generation network as the input of a face image generation network, taking the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of a cartoon head portrait discrimination network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, and training an initially generated confrontation network; and taking a sample cartoon head portrait included in a training sample in the training sample set as the input of a face image generation network, taking a face image output by the face image generation network as the input of a cartoon head portrait generation network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, taking a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of the cartoon head portrait discrimination network, training an initially generated countermeasure network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model.
In some embodiments, for a training sample in the training sample set, the training sample includes a similarity between a sample face image and a feature vector of a sample caricature avatar, which is greater than or equal to a preset similarity threshold.
In some embodiments, training the initially generated countermeasure network includes: determining a first generation loss value for representing the difference between the sample face image and the face image output by the face image generation network, and determining a second generation loss value for representing the difference between the sample cartoon head portrait and the cartoon head portrait output by the cartoon head portrait generation network; determining a first discrimination loss value which corresponds to a cartoon head portrait discrimination network and is used for representing the difference between a sample cartoon head portrait input into the cartoon head portrait discrimination network and a cartoon head portrait output by a cartoon head portrait generation network, and determining a second discrimination loss value which corresponds to a face image discrimination network and is used for representing the difference between a sample face image input into the face image discrimination network and a face image output by the face image generation network; training the initially generated countermeasure network based on the determined first generation loss value, second generation loss value, first discrimination loss value, and second discrimination loss value.
In some embodiments, the generated loss value is determined by any one of the following loss functions: l1 norm loss function, L2 norm loss function.
In a second aspect, an embodiment of the present disclosure provides a method for generating a cartoon avatar, the method including: acquiring a target face image; inputting a target face image into a pre-trained cartoon head portrait generation model to obtain a cartoon head portrait and outputting the cartoon head portrait, wherein the cartoon head portrait generation model is generated according to the method described in any embodiment of the first aspect.
In a third aspect, an embodiment of the present disclosure provides an apparatus for generating a cartoon avatar generation model, the apparatus including: the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is configured to acquire a preset training sample set, and the training samples comprise sample face images and sample cartoon head images corresponding to the sample face images; the system comprises a second acquisition unit, a first generation countermeasure network and a second acquisition unit, wherein the second acquisition unit is configured to acquire a pre-established initial generation countermeasure network, and the initial generation countermeasure network comprises a cartoon head portrait generation network, a human face image generation network, a cartoon head portrait judgment network and a human face image judgment network; a training unit configured to perform the following training steps: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
In some embodiments, the training unit is further configured to: by utilizing a machine learning method, taking a sample face image included in a training sample set as the input of a cartoon head portrait generation network, taking a cartoon head portrait output by the cartoon head portrait generation network as the input of a face image generation network, taking the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of a cartoon head portrait discrimination network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, and training an initially generated confrontation network; taking a sample cartoon head portrait included in a training sample set as the input of a face image generation network, taking a face image output by the face image generation network as the input of a cartoon head portrait generation network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, taking a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of the cartoon head portrait discrimination network, training an initially generated countermeasure network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model.
In some embodiments, for a training sample in the training sample set, the training sample includes a similarity between a sample face image and a feature vector of a sample caricature avatar, which is greater than or equal to a preset similarity threshold.
In some embodiments, the training unit comprises: a first determining module configured to determine a first generation loss value for characterizing a difference between the sample face image and a face image output by the face image generation network, and determine a second generation loss value for characterizing a difference between the sample caricature avatar and a caricature avatar output by the caricature avatar generation network; the second determination module is configured to determine a first discrimination loss value corresponding to the cartoon head portrait discrimination network and used for representing the difference between a sample cartoon head portrait input into the cartoon head portrait discrimination network and a cartoon head portrait output by the cartoon head portrait generation network, and determine a second discrimination loss value corresponding to the face image discrimination network and used for representing the difference between a sample face image input into the face image discrimination network and a face image output by the face image generation network; a training module configured to train the initially generated countermeasure network based on the determined first generation loss value, second generation loss value, first discrimination loss value, second discrimination loss value.
In some embodiments, the generated loss value is determined by any one of the following loss functions: l1 norm loss function, L2 norm loss function.
In a fourth aspect, an embodiment of the present disclosure provides an apparatus for generating a cartoon avatar, the apparatus including: a face image acquisition unit configured to acquire a target face image; and a cartoon head portrait generating unit, configured to input the target face image into a pre-trained cartoon head portrait generating model, and obtain a cartoon head portrait and output, where the cartoon head portrait generating model is generated according to the method described in any one of the embodiments of the first aspect.
In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first or second aspects.
In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.
The method and the device for generating the cartoon head portrait generation model provided by the embodiment of the disclosure are characterized in that a preset training sample set and a pre-established initial generation confrontation network are obtained, wherein the initial generation confrontation network comprises a cartoon head portrait generation network, a human face image generation network, a cartoon head portrait distinguishing network and a human face image distinguishing network, a machine learning method is utilized, a sample human face image included in a training sample in the training sample set is used as the input of the cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of the human face image generation network, a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the input of the cartoon head portrait distinguishing network, a human face image output by the human face image generation network and a corresponding sample human face image are used as the input of the human face image distinguishing network, and training the initially generated confrontation network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model. Therefore, a bidirectional training mode is adopted, namely the input face image is input to generate the confrontation network to obtain the cartoon head portrait, the cartoon head portrait is converted into the face image, the similarity between the input face image and the output face image is high, and the generation of the cartoon head portrait with high similarity to the input face image by using the cartoon head portrait generation model is facilitated.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant disclosure and are not limiting of the disclosure. It should be noted that, for the convenience of description, only the parts relevant to the related disclosure are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates a method for generating a caricature avatar generation model or an apparatus for generating a caricature avatar generation model, and an exemplary system architecture 100 of a method for generating a caricature avatar or an apparatus for generating a caricature avatar, to which embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as an image processing application, a web browser application, an instant messaging tool, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal apparatuses 101, 102, 103 are hardware, various electronic apparatuses are possible. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server handling uploading of training sample sets by the terminal devices 101, 102, 103. The background server can train the initially generated countermeasure network by using the acquired training sample set, so that the cartoon head portrait generation model is obtained. In addition, the background server can also process the input human face image by using the cartoon head portrait generation model to obtain a cartoon head portrait and output the cartoon head portrait.
It should be noted that the method for generating the cartoon avatar generation model provided in the embodiment of the present disclosure may be executed by the server 105, or may also be executed by the terminal devices 101, 102, and 103, and accordingly, the apparatus for generating the cartoon avatar generation model may be provided in the server 105, or may also be provided in the terminal devices 101, 102, and 103. In addition, the method for generating the cartoon head portrait provided by the embodiment of the present disclosure may be executed by the server 105 and may also be executed by the terminal devices 101, 102, and 103, and accordingly, the apparatus for generating the cartoon head portrait may be disposed in the server 105 and may also be disposed in the terminal devices 101, 102, and 103.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case that the training sample set or the target face image required for training the model does not need to be acquired from a remote place, the system architecture may not include a network, and only a server or a terminal device is needed.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a caricature avatar generation model in accordance with the present disclosure is shown. The method for generating the cartoon head portrait generation model comprises the following steps:
step 201, a preset training sample set is obtained.
In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the method for generating the cartoon avatar generation model may obtain a preset training sample set from a remote location or from a local location through a wired connection manner or a wireless connection manner. The training sample comprises a sample face image and a sample cartoon head portrait corresponding to the sample face image. Generally, a sample face image is a face image obtained by shooting a real face, and a sample caricature head portrait is a drawn head portrait. The corresponding relation between the sample face image and the sample cartoon head portrait is established in advance. For example, a technician may manually select a large number of sample face images and sample caricatures, select a sample face image and a sample caricature avatar with a high degree of similarity, and set them as training samples.
In some optional implementations of the embodiment, for a training sample in the training sample set, a similarity between feature vectors of a sample face image and a sample caricature head image included in the training sample is greater than or equal to a preset similarity threshold. The feature vectors may be used to characterize various features of the image, such as color features, shape features, and the like, among others. Specifically, the execution subject for generating the training sample set may determine the feature vectors of the sample face image and the sample caricature head included in the training sample by using an existing method for determining the feature vector of the image (e.g., an LBP (Local Binary Pattern) algorithm, a neural network-based algorithm, etc.). And matching the feature vectors of the face images of the samples and the feature vectors of the head portraits of the samples pairwise, thereby extracting a plurality of pairs of matched face images of the samples and head portraits of the samples. And the similarity between the feature vectors respectively corresponding to the mutually matched sample face image and the sample cartoon head image is greater than or equal to a preset similarity threshold.
Step 202, a pre-established initial generation countermeasure network is obtained.
In this embodiment, the execution subject may obtain the pre-established initial generation countermeasure network from a local or remote location. The initially generated confrontation network can comprise a cartoon head portrait generating network, a human face image generating network, a cartoon head portrait judging network and a human face image judging network. The cartoon head portrait generating network is used for generating a cartoon head portrait by using an input face image, the cartoon head portrait distinguishing network is used for distinguishing a cartoon head portrait output by the cartoon head portrait generating network and a sample cartoon head portrait input into the cartoon head portrait distinguishing network, the face image generating network is used for generating a face image by using the input cartoon head portrait, and the face image distinguishing network is used for distinguishing a face image output by the face image generating network and a sample face image input into the face image distinguishing network.
It should be understood that the initial generative countermeasure network may be an untrained generative countermeasure network with initialized parameters, or may be an already trained generative countermeasure network.
The caricature avatar generation network and the face image generation network may be convolutional neural networks (e.g., convolutional neural networks having various structures including convolutional layers, pooling layers, anti-pooling layers, and anti-convolutional layers) for performing image processing. The caricature head image discrimination network and the face image discrimination network may be convolutional neural networks (e.g., convolutional neural networks of various structures including fully-connected layers that may implement a classification function). In addition, the discriminant network may be another model for implementing a classification function, such as a Support Vector Machine (SVM). Here, the comic head image discrimination network and the face image discrimination network may output discrimination results respectively. For example, if the comic head portrait determination network determines that the image input to the comic head portrait determination network is a comic head portrait output by the comic head portrait generation network, the comic head portrait determination network may output a tag 1 (or 0) corresponding to the image; if it is determined that the image is not a caricature avatar output by the caricature avatar generation network, a label 0 (or 1) corresponding to the image may be output. And if the face image discrimination network determines that the image input to the face image discrimination network is a face image output by the face image generation network, outputting 1 (or 0) corresponding to the label of the image; if it is determined that the face image is not the face image output by the face image generation network, the label 0 (or 1) corresponding to the image may be output. The discrimination network may output other preset information, and is not limited to the values 1 and 0.
Step 203, the following training steps are performed: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
In this embodiment, the executing entity may execute the following training steps: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
Specifically, the executing entity may first fix parameters of any one of a generation network (including a cartoon head portrait generation network and a face image generation network) and a discrimination network (including a cartoon head portrait discrimination network and a face image discrimination network) (which may be referred to as a first network), and optimize a network (which may be referred to as a second network) in which the parameters are not fixed; and then fixing the parameters of the second network and optimizing the first network. And continuously performing the iteration to enable the cartoon head portrait distinguishing network to be incapable of distinguishing whether the input image is generated by the cartoon head portrait generating network or not and enable the face image distinguishing network to be incapable of distinguishing whether the input image is generated by the face image generating network or not. At the moment, the cartoon head portrait generated by the cartoon head portrait generation network is close to the sample cartoon head portrait, and the cartoon head portrait identification network cannot accurately identify the cartoon head portrait and the sample cartoon head portrait output by the cartoon head portrait generation network (namely, the identification accuracy is 50%); the face image generated by the face image generation network is close to the sample face image, and the face image discrimination network cannot accurately distinguish the face image output by the face image generation network from the sample face image (namely, the discrimination accuracy is 50%). The network for generating a cartoon avatar at this time may be determined as a cartoon avatar generation model. In general, the execution agent may train the generation network and the discrimination network using existing back propagation algorithm and gradient descent algorithm. And adjusting the parameters of the generation network and the discrimination network after each training, and taking the generation network and the discrimination network obtained after each parameter adjustment as the initial generation countermeasure network for the next training. In the training process, a loss value can be determined by using a loss function, and the generation network and the discrimination network are iteratively trained according to the loss value, so that the determined loss value is minimum in each iterative operation.
As shown in fig. 3A, G1 is a comic avatar generation network, G2 is a face image generation network, D1 is a comic avatar determination network, and D2 is a face image determination network. For one training sample, G1, G2, D1, and D2 are trained by using a sample face image as an input of G1 and finally outputting a face image by G2, as shown in fig. 3A. As can be seen from the figure, the two-way training mode is adopted in the embodiment, that is, the input face image is input to generate the confrontation network, the cartoon head portrait is obtained, then the cartoon head portrait is converted into the face image, and the confrontation network is generated after the training is finally carried out, so that the generated cartoon head portrait can be restored to the face image with higher similarity to the input face image, and therefore the finally obtained cartoon head portrait generation model can generate the cartoon head portrait with higher similarity to the input face image.
In some optional implementations of this embodiment, the executing entity may train the initially generated countermeasure network according to the following steps:
the method comprises the steps of firstly, determining a first generation loss value used for representing the difference between a sample face image and a face image output by a face image generation network, and determining a second generation loss value used for representing the difference between a sample cartoon head portrait and a cartoon head portrait output by a cartoon head portrait generation network. In general, the first and second generated loss values may be loss values determined according to a regression loss function, generally denoted as L (y, y '), with which the obtained loss values are used to characterize the degree of inconsistency between the true value (i.e., the sample face image or the sample caricature avatar in the present embodiment) y and the predicted value (i.e., the face image or the caricature avatar output by the face image generation network in the present embodiment) y'. During training, the training time is minimized.
Alternatively, the generated loss value may be determined by any one of the following loss functions: l1 norm loss function, L2 norm loss function. The L1 norm loss function and the L2 norm loss function are existing pixel-level loss functions, that is, differences between pixels included in two images are determined by using the pixels as basic units, so that the accuracy of representing the differences between the images by using generated loss values can be improved.
And secondly, determining a first discrimination loss value which corresponds to the cartoon head portrait discrimination network and is used for representing the difference between a sample cartoon head portrait input into the cartoon head portrait discrimination network and a cartoon head portrait output by the cartoon head portrait generation network, and determining a second discrimination loss value which corresponds to the face image discrimination network and is used for representing the difference between a sample face image input into the face image discrimination network and a face image output by the face image generation network. In general, discriminant loss values can be determined using a loss function (e.g., a cross-entropy loss function) for the two classes.
And step three, training the initially generated countermeasure network based on the determined first generation loss value, second generation loss value, first discrimination loss value and second discrimination loss value. Specifically, the total loss value may be obtained by performing weighted summation on the determined loss values by using preset weights respectively corresponding to the loss values. During training, parameters of the cartoon head portrait generation network, the human face image generation network, the cartoon head portrait judgment network and the human face image judgment network are continuously adjusted, so that the total loss value is gradually reduced, and when the total loss value meets a preset condition (for example, the total loss value is less than or equal to a preset loss value threshold value, or the total loss value is not reduced any more), it is determined that the model training is completed.
In some optional implementations of this embodiment, the training step may be performed as follows:
by utilizing a machine learning method, taking a sample face image included in a training sample set as the input of a cartoon head portrait generation network, taking a cartoon head portrait output by the cartoon head portrait generation network as the input of a face image generation network, taking the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of a cartoon head portrait discrimination network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, and training an initially generated confrontation network; and taking a sample cartoon head portrait included in a training sample in the training sample set as the input of a face image generation network, taking a face image output by the face image generation network as the input of a cartoon head portrait generation network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, taking a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of the cartoon head portrait discrimination network, training an initially generated countermeasure network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model.
Specifically, as shown in fig. 3A and 3B. For one training sample, G1, G2, D1, and D2 are trained by using a sample face image as an input of G1 and finally outputting a face image by G2, as shown in fig. 3A. The training samples are input with the sample cartoon head as G2 as shown in fig. 3B, and finally the cartoon head is output from G1, so that G1, G2, D1 and D2 are trained. As can be seen from fig. 3A and 3B, in the implementation manner, two times of training can be performed by using one training sample, so that parameters of a cartoon head portrait generation network and parameters of a face image generation network can be alternately optimized, and the accuracy of generating images by the cartoon head portrait generation network and the face image generation network can be synchronously improved, so that a finally obtained cartoon head portrait generation model can generate a cartoon head portrait with high similarity to an input face image.
With continued reference to fig. 4, fig. 4 is a schematic diagram of an application scenario of the method for generating a cartoon avatar generation model according to the present embodiment. In the application scenario of fig. 4, the electronic device 401 first obtains a preset training sample set 402 from a local. Each training sample in the training sample set 402 includes a sample face image and a sample caricature head portrait corresponding to the sample face image. The electronic device 401 then locally retrieves the pre-established initial generation countermeasure network 403. The initially generated confrontation network 403 includes a comic head portrait generating network G1, a face image generating network G2, a comic head portrait discriminating network G3, and a face image discriminating network G4.
Then, the electronic device 401 performs the following steps: by using a machine learning method, a sample face image included in a training sample in the training sample set 402 is input to the cartoon head portrait generation network G1, a cartoon head portrait output by the cartoon head portrait generation network is input to the face image generation network G2, a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are input to the cartoon head portrait determination network D1, a face image output by the face image generation network and a corresponding sample face image are input to the face image determination network D2, and G1, G2, D1 and D2 are trained. When the cartoon head portrait discriminating network D1 cannot accurately discriminate the cartoon head portrait output by the cartoon head portrait generating network from the sample cartoon head portrait (i.e., the discrimination accuracy is 50%), and the face image discriminating network D2 cannot accurately discriminate the face image output by the face image generating network from the sample face image (i.e., the discrimination accuracy is 50%), the cartoon head portrait generating network G1 at this time is determined as the cartoon head portrait generating model 404.
The method provided by the above embodiment of the present disclosure includes obtaining a preset training sample set and a pre-established initial generation countermeasure network, where the initial generation countermeasure network includes a cartoon head portrait generating network, a human face image generating network, a cartoon head portrait distinguishing network, and a human face image distinguishing network, and using a machine learning method to take a sample human face image included in a training sample set as an input of the cartoon head portrait generating network, take a cartoon head portrait output by the cartoon head portrait generating network as an input of the human face image generating network, take a cartoon head portrait output by the cartoon head portrait generating network and a corresponding sample cartoon head portrait as inputs of the cartoon head portrait distinguishing network, take a human face image output by the human face image generating network and a corresponding sample human face image as inputs of the human face image distinguishing network, and train the initial generation countermeasure network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model. Therefore, a bidirectional training mode is adopted, namely the input face image is input to generate the confrontation network to obtain the cartoon head portrait, the cartoon head portrait is converted into the face image, the similarity between the input face image and the output face image is high, and the generation of the cartoon head portrait with high similarity to the input face image by using the cartoon head portrait generation model is facilitated.
With further reference to fig. 5, a flow 500 of one embodiment of a method for generating a caricature avatar according to the present disclosure is shown. The flow 500 of the method for generating a caricature avatar includes the steps of:
step 501, obtaining a target face image.
In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) of the method for generating a cartoon avatar may acquire a target face image from a remote place or from a local place by a wired connection manner or a wireless connection manner. Wherein the target face image is a face image with which a caricature head portrait is to be generated. For example, the target face image may be a face image obtained by shooting a face of a target person by a camera included in the execution subject or a camera included in an electronic device communicatively connected to the execution subject, and the target person may be a user within a shooting range of the camera.
Step 502, inputting the target face image into a pre-trained cartoon head portrait generation model to obtain a cartoon head portrait and outputting the cartoon head portrait.
In this embodiment, the execution subject may input the target face image into a pre-trained cartoon avatar generation model to obtain a cartoon avatar and output the cartoon avatar. The cartoon avatar generation model is generated according to the method described in the embodiment corresponding to fig. 2.
The execution body may output the generated avatar in various ways. For example, the generated avatar may be displayed on a display screen included in the execution main body, or the generated avatar may be transmitted to another electronic device communicatively connected to the execution main body.
According to the method provided by the embodiment of the disclosure, the target face image is acquired, and the target face image is input into the cartoon head portrait generation model which is trained in advance according to the method described in the embodiment corresponding to the fig. 2, so that the cartoon head portrait and the output are acquired.
With further reference to fig. 6, as an implementation of the method shown in fig. 2, the present disclosure provides an embodiment of an apparatus for generating a cartoon avatar generation model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 6, the apparatus 600 for generating a cartoon avatar generation model of the present embodiment includes: a first obtaining unit 601 configured to obtain a preset training sample set, where the training samples include a sample face image and a sample caricature head portrait corresponding to the sample face image; a second obtaining unit 602 configured to obtain a pre-established initial generation countermeasure network, where the initial generation countermeasure network includes a cartoon head portrait generating network, a face image generating network, and a cartoon head portrait determining network, a face image determining network; a training unit 603 configured to perform the following training steps: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
In this embodiment, the first obtaining unit 601 may obtain the preset training sample set from a remote location or a local location through a wired connection or a wireless connection. The training sample comprises a sample face image and a sample cartoon head portrait corresponding to the sample face image. The corresponding relation between the sample face image and the sample cartoon head portrait is established in advance. For example, a technician may manually select a large number of sample face images and sample caricatures, select a sample face image and a sample caricature avatar with a high degree of similarity, and set them as training samples.
In this embodiment, the second obtaining unit 602 may obtain an initial generation countermeasure network established in advance. The initially generated confrontation network can comprise a cartoon head portrait generating network, a human face image generating network, a cartoon head portrait judging network and a human face image judging network. The cartoon head portrait generating network is used for generating a cartoon head portrait by using an input face image, the cartoon head portrait distinguishing network is used for distinguishing a cartoon head portrait output by the cartoon head portrait generating network and a sample cartoon head portrait input into the cartoon head portrait distinguishing network, the face image generating network is used for generating a face image by using the input cartoon head portrait, and the face image distinguishing network is used for distinguishing a face image output by the face image generating network and a sample face image input into the face image distinguishing network.
It should be understood that the initial generative countermeasure network may be an untrained generative countermeasure network with initialized parameters, or may be an already trained generative countermeasure network.
The caricature avatar generation network and the face image generation network may be convolutional neural networks (e.g., convolutional neural networks having various structures including convolutional layers, pooling layers, anti-pooling layers, and anti-convolutional layers) for performing image processing. The caricature head image discrimination network and the face image discrimination network may be convolutional neural networks (e.g., convolutional neural networks of various structures including fully-connected layers that may implement a classification function). In addition, the discriminant network may be another model for implementing a classification function, such as a Support Vector Machine (SVM). Here, the comic head image discrimination network and the face image discrimination network may output discrimination results respectively. For example, if the comic head portrait determination network determines that the image input to the comic head portrait determination network is a comic head portrait output by the comic head portrait generation network, the comic head portrait determination network may output a tag 1 (or 0) corresponding to the image; if it is determined that the image is not a caricature avatar output by the caricature avatar generation network, a label 0 (or 1) corresponding to the image may be output. If the face image discrimination network judges that the image input into the face image discrimination network is the face image output by the face image generation network, the face image discrimination network can output a label 1 (or 0) corresponding to the image; if it is determined that the face image is not the face image output by the face image generation network, the label 0 (or 1) corresponding to the image may be output. The discrimination network may output other preset information, and is not limited to the values 1 and 0.
In this embodiment, the training unit 603 may perform the following training steps: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
Specifically, the training unit 603 may first fix parameters of any one of a generation network (which may be referred to as a first network) and a discrimination network (which may be referred to as a cartoon head recognition network and a face image recognition network), and optimize a network (which may be referred to as a second network) in which the parameters are not fixed; and then fixing the parameters of the second network and optimizing the first network. And continuously performing the iteration to enable the cartoon head portrait distinguishing network to be incapable of distinguishing whether the input image is generated by the cartoon head portrait generating network or not and enable the face image distinguishing network to be incapable of distinguishing whether the input image is generated by the face image generating network or not. At the moment, the cartoon head portrait generated by the cartoon head portrait generation network is close to the sample cartoon head portrait, and the cartoon head portrait identification network cannot accurately identify the cartoon head portrait and the sample cartoon head portrait output by the cartoon head portrait generation network (namely, the identification accuracy is 50%); the face image generated by the face image generation network is close to the sample face image, and the face image discrimination network cannot accurately distinguish the face image output by the face image generation network from the sample face image (namely, the discrimination accuracy is 50%). The network for generating a cartoon avatar at this time may be determined as a cartoon avatar generation model. In general, the training unit 603 can train the generation network and the discrimination network by using the existing back propagation algorithm and gradient descent algorithm. And adjusting the parameters of the generation network and the discrimination network after each training, and taking the generation network and the discrimination network obtained after each parameter adjustment as the initial generation countermeasure network for the next training. In the training process, a loss value can be determined by using a loss function, and the generation network and the discrimination network are iteratively trained according to the loss value, so that the determined loss value is minimum in each iterative operation.
In some optional implementations of this embodiment, the training unit 603 may be further configured to: by utilizing a machine learning method, taking a sample face image included in a training sample set as the input of a cartoon head portrait generation network, taking a cartoon head portrait output by the cartoon head portrait generation network as the input of a face image generation network, taking the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of a cartoon head portrait discrimination network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, and training an initially generated confrontation network; and taking a sample cartoon head portrait included in a training sample in the training sample set as the input of a face image generation network, taking a face image output by the face image generation network as the input of a cartoon head portrait generation network, taking a face image output by the face image generation network and a corresponding sample face image as the input of a face image discrimination network, taking a cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait as the input of the cartoon head portrait discrimination network, training an initially generated countermeasure network, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model.
In some optional implementations of the embodiment, for a training sample in the training sample set, a similarity between feature vectors of a sample face image and a sample caricature head image included in the training sample is greater than or equal to a preset similarity threshold.
In some optional implementations of this embodiment, the training unit 603 may include: a first determining module (not shown in the figure) configured to determine a first generation loss value for characterizing a difference between the sample face image and a face image output by the face image generation network, and determine a second generation loss value for characterizing a difference between the sample caricature avatar and a caricature avatar output by the caricature avatar generation network; a second determining module (not shown in the figure) configured to determine a first discrimination loss value corresponding to the cartoon head portrait discrimination network and used for representing a difference between a sample cartoon head portrait input to the cartoon head portrait discrimination network and a cartoon head portrait output by the cartoon head portrait generating network, and determine a second discrimination loss value corresponding to the face image discrimination network and used for representing a difference between a sample face image input to the face image discrimination network and a face image output by the face image generating network; a training module (not shown in the figure) configured to train the initially generated countermeasure network based on the determined first generation loss value, second generation loss value, first discrimination loss value, and second discrimination loss value.
In some optional implementations of this embodiment, the generated loss value is determined by any one of the following loss functions: l1 norm loss function, L2 norm loss function.
The apparatus 600 provided in the foregoing embodiment of the present disclosure, by obtaining a preset training sample set and a pre-established initial generation countermeasure network, where the initial generation countermeasure network includes a cartoon head portrait generating network, a face image generating network, a cartoon head portrait distinguishing network, and a face image distinguishing network, a machine learning method is used, a sample face image included in a training sample in the training sample set is used as an input of the cartoon head portrait generating network, a cartoon head portrait output by the cartoon head portrait generating network is used as an input of the face image generating network, a cartoon head portrait output by the cartoon head portrait generating network and a corresponding sample cartoon head portrait are used as inputs of the cartoon head portrait distinguishing network, a face image output by the face image generating network and a corresponding face sample image are used as inputs of the face image distinguishing network, and the initially generated countermeasure network is trained, and determining the trained cartoon head portrait generation network as a cartoon head portrait generation model. Therefore, a bidirectional training mode is adopted, namely the input face image is input to generate the confrontation network to obtain the cartoon head portrait, the cartoon head portrait is converted into the face image, the similarity between the input face image and the output face image is high, and the generation of the cartoon head portrait with high similarity to the input face image by using the cartoon head portrait generation model is facilitated.
With further reference to fig. 7, as an implementation of the method shown in fig. 5, the present disclosure provides an embodiment of an apparatus for generating a cartoon avatar, where the apparatus embodiment corresponds to the method embodiment shown in fig. 5, and the apparatus may be applied to various electronic devices.
As shown in fig. 7, an apparatus 700 for generating a caricature avatar of the present embodiment includes: a face image acquisition unit 701 configured to acquire a target face image; the cartoon head portrait generating unit 702 is configured to input the target face image into a pre-trained cartoon head portrait generating model, obtain a cartoon head portrait and output the cartoon head portrait. The cartoon avatar generation model is generated according to the method described in the embodiment corresponding to fig. 2.
In this embodiment, the face image obtaining unit 701 may obtain the target face image from a remote location or a local location through a wired connection or a wireless connection. Wherein the target face image is a face image with which a caricature head portrait is to be generated. For example, the target human face image may be a human face image obtained by shooting a human face of a target person by a camera included in the apparatus 700 or a camera included in an electronic device communicatively connected to the apparatus 700, where the target person may be a user within a shooting range of the camera.
In this embodiment, the cartoon avatar generation unit 702 may input the target face image into a pre-trained cartoon avatar generation model, obtain a cartoon avatar, and output the cartoon avatar. The cartoon avatar generation model is generated according to the method described in the embodiment corresponding to fig. 2.
The above-described comic avatar generation unit 702 can output the generated avatar in various ways. For example, the generated avatar may be displayed on a display screen included in the apparatus 700, or the generated avatar may be transmitted to other electronic devices communicatively connected to the apparatus 700.
The apparatus 700 provided in the foregoing embodiment of the present disclosure obtains the target face image, and inputs the target face image into the cartoon head portrait generation model trained in advance according to the method described in the foregoing embodiment corresponding to fig. 2, so as to obtain the cartoon head portrait and output.
Referring now to fig. 8, a schematic diagram of an electronic device (e.g., a server or terminal device of fig. 1) 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a preset training sample set, wherein the training samples comprise sample face images and sample cartoon head portraits corresponding to the sample face images; acquiring a pre-established initial generation confrontation network, wherein the initial generation confrontation network comprises a cartoon head portrait generation network, a human face image generation network, a cartoon head portrait judgment network and a human face image judgment network; the following training steps are performed: by utilizing a machine learning method, a sample face image included in a training sample set is used as the input of a cartoon head portrait generation network, a cartoon head portrait output by the cartoon head portrait generation network is used as the input of a face image generation network, the cartoon head portrait output by the cartoon head portrait generation network and a corresponding sample cartoon head portrait are used as the inputs of a cartoon head portrait discrimination network, a face image output by the face image generation network and a corresponding sample face image are used as the inputs of a face image discrimination network, an initially generated countermeasure network is trained, and the trained cartoon head portrait generation network is determined as a cartoon head portrait generation model.
Further, the one or more programs, when executed by the electronic device, cause the electronic device to: acquiring a target face image; and inputting the target face image into a pre-trained cartoon head portrait generation model to obtain a cartoon head portrait and outputting the cartoon head portrait.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first acquisition unit, a second acquisition unit, and a training unit. Where the names of the units do not in some cases constitute a limitation of the unit itself, for example, the first acquisition unit may also be described as a "unit that acquires a preset set of training samples".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.