CN117076702B - Image searching method and electronic equipment - Google Patents

Image searching method and electronic equipment Download PDF

Info

Publication number
CN117076702B
CN117076702B CN202311184754.2A CN202311184754A CN117076702B CN 117076702 B CN117076702 B CN 117076702B CN 202311184754 A CN202311184754 A CN 202311184754A CN 117076702 B CN117076702 B CN 117076702B
Authority
CN
China
Prior art keywords
expression image
feature vector
image
expression
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311184754.2A
Other languages
Chinese (zh)
Other versions
CN117076702A (en
Inventor
王龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202311184754.2A priority Critical patent/CN117076702B/en
Publication of CN117076702A publication Critical patent/CN117076702A/en
Application granted granted Critical
Publication of CN117076702B publication Critical patent/CN117076702B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses an image searching method and electronic equipment, wherein the method comprises the following steps: acquiring an expression image database, wherein the expression image database comprises a plurality of expression images, a first content feature vector and a first intention feature vector corresponding to each expression image in the plurality of expression images; the first content feature vector represents a feature vector describing the content of the emoji image, and the first intention feature vector represents a feature vector describing the expression intention of the emoji image; determining a first text feature vector of information to be processed; searching a target expression image matched with the information to be processed from the image database, wherein a first content feature vector and a first intention feature vector corresponding to the target expression image are matched with the first text feature vector. The method can quickly and accurately find the expression image which needs to be sent by the user, and is beneficial to improving the efficiency of online chat conversation.

Description

Image searching method and electronic equipment
Technical Field
The embodiment of the application relates to the field of computers, in particular to an image searching method and electronic equipment.
Background
With the development of various social applications and new media technologies, expression packages are increasingly widely used. The expression package may be called an expression image, and means a way to express emotion using an image. Expression packages have become an indispensable communication medium in the current social network, and people often prefer to use expression packages to replace words to convey their intention and emotion.
However, as the expression packages stored by people are increased, it is more difficult and inefficient for users to find the expression packages, and when chat is performed on line, the user cannot quickly and accurately find the expression packages which the user wants to send, so that the chat conversation efficiency is reduced.
Disclosure of Invention
The application provides an image searching method and electronic equipment, which can quickly and accurately search expression images required to be sent by a user, and are beneficial to improving the efficiency of online chat conversation.
In a first aspect, the present application provides an image searching method, including: acquiring an expression image database, wherein the expression image database comprises a plurality of expression images, a first content feature vector and a first intention feature vector corresponding to each expression image in the plurality of expression images; the first content feature vector is a feature vector describing the content of the expression image, and the first intention feature vector represents a feature vector describing the expression intention of the expression image; determining a first text feature vector of information to be processed; searching a target expression image matched with the information to be processed from the image database, wherein a first content feature vector and a first intention feature vector corresponding to the target expression image are matched with the first text feature vector.
Based on the method described in the first aspect, searching the target expression image matched with the information to be processed from the image database is realized through multi-dimensional quick searching of the expression intention and the semantic content of the expression image, so that the expression image required to be sent by the user can be more quickly and accurately searched, and the efficiency of online chat conversation is improved.
In one possible implementation, obtaining an expression image database includes: acquiring a plurality of expression images locally collected by a user; invoking an image encoder to encode each expression image in the plurality of expression images to obtain a first image feature vector corresponding to each expression image; calling an alignment network to process the first image feature vector and the prompt word to obtain a first content feature vector and a first intention feature vector corresponding to each expression image; the prompting words are used for prompting the content and the expression intention of the expression image described by the alignment network; and constructing an expression image database by using the plurality of expression images, the first content feature vector and the first intention feature vector corresponding to each expression image. Based on the mode, the expression image database can be established from two dimensions of the expression intention and the semantic content, and the expression image searching efficiency is improved.
In one possible implementation manner, the alignment network includes a first alignment network and a second alignment network, the cue word includes a first cue word and a second cue word, the alignment network is invoked to process the first image feature vector and the cue word to obtain a first content feature vector and a first intention feature vector corresponding to each expression image, including: invoking the first alignment network to process the first image feature vector and the first prompt word to obtain a first content feature vector corresponding to each expression image; the first prompting word is used for prompting the first alignment network to describe the content of the expression image; invoking the second alignment network to process the first image feature vector and the second prompting word to obtain a first intention feature vector corresponding to each expression image; the second prompting word is used for prompting a second alignment network to describe the expression intention of the expression image. Based on the mode, the accuracy of the first content feature vector and the first intention feature vector corresponding to each expression image can be improved.
In one possible implementation manner, searching the target expression image matched with the to-be-processed information from the image database includes: determining a first vector distance between a first content feature vector corresponding to each expression image in the image database and the first text feature vector, and determining a second vector distance between a first intention feature vector corresponding to each expression image in the image database and the first text feature vector; and determining the target expression image matched with the information to be processed based on the first vector distance and the second vector distance. Based on the mode, the target expression image matched with the information to be processed can be searched from two dimensions of the expression intention and the semantic content, and the searching efficiency of the expression image is improved.
In one possible implementation, determining, based on the first vector distance and the second vector distance, a target expression image to which the information to be processed matches includes: calculating the first vector distance and the second vector distance to obtain a third vector distance corresponding to each expression image; arranging third vector distances in order from small to large, and determining the expression images corresponding to the first N third vector distances as target expression images matched with the information to be processed; the N is a positive integer. Based on the mode, the target expression image matched with the information to be processed can be searched by combining the expression intention and the semantic content, and the searching efficiency of the expression image is improved.
In one possible implementation manner, searching the target expression image matched with the to-be-processed information from the image database includes: calling a first alignment network to process the first text feature vector to obtain a fourth content feature vector; invoking a second alignment network to process the first text feature vector to obtain a fifth intention feature vector; determining a third vector distance between a first content feature vector and a fourth content feature vector corresponding to each expression image in the image database, and determining a fourth vector distance between a first intention feature vector and a fifth intention feature vector corresponding to each expression image in the image database; and determining the target expression image matched with the information to be processed based on the third vector distance and the fourth vector distance. Based on the mode, the target expression image matched with the information to be processed can be searched from two dimensions of the expression intention and the semantic content, and the searching efficiency and the accuracy of the expression image are improved.
In one possible implementation, the method further includes: displaying the target expression image matched with the information to be processed in a first interface, wherein the first interface is any one of the following interfaces: an image searching interface, an input method interface and an expression image recommending interface. Based on the mode, the visualization of image searching can be improved, and the flexibility of displaying the target expression image is improved.
In one possible implementation, the method further includes: displaying an expression image adding interface, wherein the expression image adding interface comprises an expression image frame, a custom label frame and a storage option; when the triggering operation of the user for the saving option is detected, saving a first expression image added in the expression image frame and a first label of the first expression image added in the custom label frame; and if the information to be processed is the same as the first label, determining the first expression image as the target expression image. Based on the mode, the efficiency of searching the target expression image and the accuracy of the target expression image can be improved, and the efficiency of online chat conversation can be improved.
In a second aspect, the present application provides an image searching apparatus, which may be an electronic device, or may be an apparatus in an electronic device, or may be an apparatus that is capable of being used in a matching manner with an electronic device; the image searching device may also be a chip system, and the image searching device may perform the method performed by the electronic device in the first aspect. The function of the image searching device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more units corresponding to the functions described above. The unit may be software and/or hardware. The operations and beneficial effects performed by the image searching device may refer to the methods and beneficial effects described in the first aspect, and the repetition is not repeated.
In a third aspect, the present application provides an electronic device comprising one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, the one or more memories being configured to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the image finding method in any of the possible implementations of the first aspect described above.
In a fourth aspect, the present application provides an image finding apparatus comprising a function or unit for performing the method of any one of the first aspects.
In a fifth aspect, the present application provides a computer readable storage medium having stored therein a computer program comprising program instructions which, when run on an electronic device, cause the electronic device to perform the image finding method of any one of the possible implementations of the first aspect.
In a sixth aspect, the application provides a computer program product for, when run on a computer, causing the computer to perform the image finding method of any one of the possible implementations of the first aspect.
Drawings
FIG. 1A is a schematic diagram of an emoticon according to an embodiment of the application;
FIG. 1B is a schematic diagram of an online chat session provided by an embodiment of the application;
fig. 2 is a schematic hardware structure of an electronic device according to an embodiment of the present application;
FIG. 3 is a block diagram of a software architecture of an electronic device according to an embodiment of the present application;
fig. 4 is a schematic flow chart of an image searching method according to an embodiment of the present application;
fig. 5 is a schematic flow chart of an electronic device acquiring an expression image database according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of a first alignment network training according to an embodiment of the present application;
FIG. 7A is a schematic diagram of a first sample emotion image and corresponding first text information according to an embodiment of the present application;
FIG. 7B is a schematic diagram of a training process of a first alignment network according to an embodiment of the present application;
FIG. 8 is a schematic flow chart of a second alignment network training according to an embodiment of the present application;
FIG. 9A is a schematic diagram of a first example of cross-talk information provided by an embodiment of the present application;
FIG. 9B is a diagram of a second example dialogue information provided by an embodiment of the application;
FIG. 9C is a schematic diagram of a training process of a second alignment network according to an embodiment of the present application;
fig. 10A is a schematic diagram of information to be processed in an emoticon search interface according to an embodiment of the application;
fig. 10B is a schematic diagram of information to be processed in a chat frame according to an embodiment of the present application;
FIG. 10C is a schematic diagram of information to be processed in an input method interface according to an embodiment of the present application;
FIG. 11A is a schematic diagram showing a target expression image on an expression image search interface according to an embodiment of the present application;
FIG. 11B is a schematic diagram showing a target expression image on an expression image recommendation interface according to an embodiment of the present application;
FIG. 11C is a schematic diagram of an input method interface for displaying a target expression image according to an embodiment of the present application;
fig. 12 is a schematic diagram of an expression image adding interface according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an image searching device according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and furthermore, in the description of the embodiments of the present application, "plural" means two or more than two.
The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
The term "User Interface (UI)" in the following embodiments of the present application is a media interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and a form acceptable to the user. The user interface is a source code written in a specific computer language such as java, extensible markup language (extensible markup language, XML) and the like, and the interface source code is analyzed and rendered on the terminal equipment to finally be presented as content which can be identified by a user. A commonly used presentation form of the user interface is a graphical user interface (graphic user interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be a visual interface element such as text, icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc. displayed in the display screen of the terminal device.
In order to facilitate understanding of the solution provided by the embodiments of the present application, the following describes related concepts related to the embodiments of the present application:
1. artificial intelligence and machine learning
In an embodiment of the application, artificial intelligence (Artificial Intelligence, AI) technology is involved; the AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain the best results. Specifically, the AI technology relates to a wide field, and has both hardware-level technology and software-level technology; at the hardware level, AI technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like; in the software layer, AI technology mainly includes computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, and other directions.
Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning/deep learning typically includes techniques such as artificial neural networks, supervised learning, unsupervised learning, etc.
Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
With research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, and the embodiment of the application can specifically relate to the technologies such as machine learning, computer vision technology, natural language processing technology and the like in the artificial intelligence technology when the image searching method is realized.
2. Alignment network and large language model (Large Language Model, LLM)
In the embodiment of the application, the image searching method is applied to an alignment network and a large language model in a machine learning technology. The alignment network is a network for aligning a plurality of features, and can be applied to the multi-modal field, such as a text feature alignment network, a graphic feature alignment network, an image feature alignment network, and the like. By multimodal is meant information of multiple modalities including text, images, video, audio, etc. The Image searching method is mainly applied to an Image-text feature alignment network, namely a Pre-Training neural network model for matching Image features and text features, such as a CLIP (fully called Contrastive Language-Image Pre-Training) model, a BLIP (fully called Bootstrapping Language-Image Pre-Training) model and the like. The CLIP model is published in 2021 in the year of 2021, and is pre-trained from unlabeled image and text data in a self-supervision learning mode, so that the model can understand semantic links between images and texts and can be used for text image retrieval. The BLIP model is a multi-mode transducer model, and an encoder-Decoder hybrid architecture (Multimodal mixture of Encoder-Decoder, MED) is proposed, and the MED is characterized by being very flexible, and can be used as a single-mode encoder, an image-based text encoder or an image-based text Decoder; the BLIP model is trained jointly by three visual language targets: contrast learning of image text, image text matching, and image conditional language modeling.
By large language model is meant a deep learning model trained using large amounts of text data that can generate natural language text or understand the meaning of language text. The large language model can process various natural language tasks, such as text classification, question-answering, dialogue and the like, and is an important path to artificial intelligence. Common large language models are LLaMA2 (collectively Large Language Model Meta AI 2) model, chatGPT (collectively Chat Generative Pre-trained Transformer) model, and the like. The LLaMA2 model is a neural network model for processing sequence problems, is a large language model published in 2 months of 2023, and trains various models, and parameters of the models are varied from 70 hundred million to 650 hundred million. The ChatGPT model is an artificial intelligence technology driven natural language processing tool that can generate answers based on patterns and statistics rules seen in the pre-training phase, and can also interact according to the chat context, really like a human being.
3. Expression image
The expression image may be also referred to as an expression package, and means a manner of representing emotion using an image. In the current social network, the expression image becomes an indispensable communication medium, and people often prefer to use the expression image to replace characters to convey own intention and emotion. As shown in (a) of fig. 1A, the expression image may express "i am confused"; as also shown in fig. 1A (b), the emoticon may express "i like. During online chat, a user can express emotion or intention by transmitting an emoticon. As shown in fig. 1B, in the chat conversation, user a sends "no dinner together today"; user B sent the expression image a and replied to "is yesterday not to say that it is available today? "user a sends" company is temporarily scheduled for new tasks today, is somewhat busy ". The expression image a can express the emotion of the user B ' puzzles ', namely the expression intention corresponding to the expression image a is ' I'm puzzles '.
However, as the number of the expression images stored by people increases, it becomes more difficult and inefficient for users to find the expression images, and when chat is performed on line, the user cannot quickly and accurately find the expression images that he/she wants to send, thereby reducing the efficiency of the chat session.
In order to quickly and accurately find out expression images which are required to be sent by a user, improve the efficiency of online chat conversation and improve user experience, the application provides an image searching method and electronic equipment. In a specific implementation, the above-mentioned image search method may be performed by the electronic device 100. The electronic device 100 may be a mobile phone, a tablet computer, a notebook computer, or a wearable electronic device (such as a smart watch) with a wireless communication function, but is not limited thereto. The electronic device 100 is configured with a display screen and may be installed with a preset Application (APP), such as a social chat APP. The user can chat online with other users through the social chat APP, and expression images, characters, pictures and the like can be sent to other users in the online chat process, and the online chat APP is not limited herein. Of course, the user can save the expression image to the local, and can search and find the expression image required by the user in the process of sending the expression image to other users.
The hardware configuration of the electronic device 100 is described below. Referring to fig. 2, fig. 2 is a schematic hardware structure of an electronic device 100 according to an embodiment of the application.
The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system. The processor 110 invokes instructions or data stored in the memory to cause the electronic device 100 to perform the image lookup method performed by the electronic device in the method embodiment described below.
In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.
The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger.
The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, etc. in other embodiments, the power management module 141 may be disposed in the processor 110.
The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.
The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.
The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wi-Fi network), bluetooth (BT), BLE broadcast, global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., applied on the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques.
The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.
The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like. The ISP is used to process data fed back by the camera 193. The camera 193 is used to capture still images or video. The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs.
The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions.
The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function) required for at least one function of the operating system, and the like. The storage data area may store data created during use of the electronic device 100 (e.g., audio data), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a nonvolatile memory such as a flash memory device or the like.
The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.
The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. The earphone interface 170D is used to connect a wired earphone. The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a hall sensor. The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). A distance sensor 180F for measuring a distance. The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector. The ambient light sensor 180L is used to sense ambient light level. The fingerprint sensor 180H is used to collect a fingerprint. The temperature sensor 180J is for detecting temperature. The touch sensor 180K, also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The bone conduction sensor 180M may acquire a vibration signal. The keys 190 include a power-on key, a volume key, etc. The motor 191 may generate a vibration cue. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card.
The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated. Fig. 3 is a software configuration block diagram of the electronic device 100 according to the embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages. As shown in fig. 3, the application package may include chat APP, camera, gallery, calendar, talk, WLAN, bluetooth, music, video, short message, etc. applications.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 3, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.
The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.
The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.
The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).
The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.
The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.
Android runtimes include core libraries and virtual machines. Android run time is responsible for scheduling and management of the Android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (media library), three-dimensional graphics processing library (e.g., openGL ES), 2D graphics engine (e.g., SGL), etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
The image searching method provided by the application is further described in detail below. Referring to fig. 4, fig. 4 is a flowchart of an image searching method according to an embodiment of the present application. As shown in fig. 4, the image searching method includes the following steps S401 to S403, and the method execution body shown in fig. 4 includes an electronic device (such as the electronic device 100 in fig. 2). Alternatively, the method execution body shown in fig. 4 includes a chip in an electronic device. Fig. 4 illustrates an electronic device as an execution subject. Wherein:
s401, the electronic equipment acquires an expression image database.
In an embodiment of the present application, the expression image database includes a plurality of expression images, a first content feature vector and a first intention feature vector corresponding to each of the plurality of expression images. The first content feature vector represents a feature vector describing the content of the emoji image, and the first intention feature vector represents a feature vector describing the expression intention of the emoji image. For example, the expression image database includes 3 expression images, namely expression image a, expression image b, and expression image c; the expression image database further comprises a first content feature vector and a first intention feature vector corresponding to each expression image, namely a first content feature vector and a first intention feature vector corresponding to the expression image a, a first content feature vector and a first intention feature vector corresponding to the expression image b, and a first content feature vector and a first intention feature vector corresponding to the expression image c. The electronic device needs to acquire the expression image database firstly so that the required expression image can be searched in the expression image database later.
Taking the expression image shown in (a) in fig. 1A as an example, the expression image is taken as an example, the expression image is a white dog, the dog sits on the ground sideways, eyes look ahead, and the head of the dog is marked with a question mark; the expression intention of this expressive image is "i am very confused".
The expression image may be an expression image locally collected or added by the user (such as a custom expression image locally collected or added by the user), or may be an expression image downloaded by the user in a cloud mall, which is not limited herein. That is, the expression image database can support the searching of the expression images downloaded in the cloud mall, and also can support the searching of the expression images collected locally. The embodiment of the application is illustrated by taking the plurality of expression images as the expression images locally collected by the user as an example.
It should be noted that, the data related to the user, the expression image locally collected by the user, and the like according to the embodiment of the present application are all obtained after the authorization of the user. Moreover, when embodiments of the present application are applied to specific products or technologies, the data involved requires user approval or consent, and the collection, use and processing of the relevant data requires compliance with relevant laws and regulations and standards of the relevant countries and regions.
In one possible implementation manner, when the electronic device obtains the expression image database, as shown in fig. 5, a specific implementation manner may include the following steps s11 to s14. Based on the mode, the expression image database can be established from two dimensions of the expression intention and the semantic content, and the expression image searching efficiency is improved.
And s11, the electronic equipment acquires a plurality of expression images locally collected by the user.
In a specific implementation, the plurality of expression images collected locally by the user may be custom expression images sent by other users collected by the user in the online chat process, or custom expression images made locally by the user, or custom expression images added locally by the user, which is not limited herein.
And s12, the electronic equipment calls an image encoder to encode each expression image in the plurality of expression images to obtain a first image feature vector corresponding to each expression image.
In a specific implementation, the Image Encoder may be referred to as an Image Encoder, which may be a trained Image Encoder, for converting an Image into a feature vector representation, for example, the Image Encoder may be an Encoder in a transducer model, or may be an Encoder in another model, which is not limited herein. The electronic device can call the image encoder to encode each expression image to obtain a first image feature vector corresponding to each expression image.
And S13, the electronic equipment calls an alignment network to process the first image feature vector and the prompt word, and a first content feature vector and a first intention feature vector corresponding to each expression image are obtained.
In a specific implementation, the prompt word is used for prompting the alignment network to describe the content and the expression intention of the expression image, for example, the prompt word is "describe the content and the expression intention of the expression image in detail". For each expression image, the electronic device needs to input a first image feature vector corresponding to the expression image and the prompt word into an alignment network together for processing, and the alignment network outputs a first content feature vector and a first intention feature vector corresponding to the expression image. The alignment network may be an alignment network that extracts both the first content feature vector of the expression image and the first intention feature vector of the expression image; the alignment network, which may also consist of two separate alignment networks directly, may for example comprise a first alignment network for extracting the first content feature vector of the emoji image and a second alignment network for extracting the first intention feature vector of the emoji image.
Alternatively, taking the example that the alignment network is composed of two independent alignment networks, the alignment network includes a first alignment network and a second alignment network, and the hint words include a first hint word and a second hint word. When the electronic device invokes the alignment network to process the first image feature vector and the prompt word to obtain a first content feature vector and a first intention feature vector corresponding to each expression image, a specific implementation manner may include the following steps 1 and 2. The execution sequence of step 1 and step 2 is not limited. Based on the mode, the accuracy of the first content feature vector and the first intention feature vector corresponding to each expression image can be improved.
Step 1, the electronic equipment calls a first alignment network to process the first image feature vector and the first prompt word, and a first content feature vector corresponding to each expression image is obtained.
In a specific implementation, the first prompting word is used for prompting the first alignment network to describe the content of the expression image, for example, the first prompting word is "content describing the expression image in detail". For each expression image, the electronic device needs to input a first image feature vector corresponding to the expression image and a first prompt word into a first alignment network together for processing, and the first alignment network outputs a first content feature vector corresponding to the expression image.
Alternatively, the first alignment network may be trained by other computer devices and then sent to the electronic device for use. The computer device may be a server or other devices, and the server may be a separate physical server, a server cluster formed by a plurality of servers, or a cloud computing center (such as a cloud server). The cloud server is a simple, efficient, safe and reliable computing service, and the management mode is simpler and more efficient than that of a physical server; basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communications, middleware services, domain name services, security services, content delivery networks (content delivery network, CDNs), big data, and artificial intelligence platforms may be provided. It can be appreciated that the first alignment network is trained offline at the cloud end and not trained at the edge end. Based on the mode, dependence on hardware resources of the electronic equipment is reduced, and burden of the electronic equipment is reduced.
Taking the computer device as a server as an example, as shown in fig. 6, the method specifically may include the following steps a to H. The execution sequence of the steps B-D and E-F is not limited, and the steps B-D can be executed first and then the steps E-F can be executed. The steps E to F may be performed first, and then the steps B to D may be performed, or the steps B to D and E to F may be performed simultaneously.
And step A, the server acquires a first sample expression image and corresponding first text information.
In a specific implementation, the first sample expression image and the corresponding first text information may form a first training sample, and the server needs to acquire a plurality of first training samples to train the model. For example, the first sample expression image is shown in fig. 7A (a); the first text message corresponding to the first sample expressive image is "a white puppy sitting sideways on the ground with eyes looking forward, and a mark with a question mark on the head of the puppy", as shown in fig. 7A (b).
And B, the server calls an image encoder to edit the first sample expression image to obtain a second image feature vector.
In a specific implementation, the server may also input the first sample expression image into a trained image encoder to perform encoding processing, and may obtain a second image feature vector.
And C, the server calls a first initial alignment network to process the second image feature vector and the first prompt word, and a second content feature vector is obtained.
In a specific implementation, the first initial alignment network herein may be a neural network model, such as a CLIP model, a BLIP model, etc., without limitation. The server can input the second image feature vector and the first prompt word into the first initial alignment network for processing, so that the first initial alignment network describes the content of the expression image and obtains the second content feature vector.
And D, the server calls a large language model to process the second content feature vector and the first prompt word, and a third content feature vector is obtained.
In particular implementations, the large language model herein is a deep learning model that has been trained using large amounts of text data, such as the LLaMA2 model, the ChatGPT model, and the like. The server can input the second content feature vector and the first prompt word into the large language model for processing, so that the large language model describes the content of the expression image, and a third content feature vector is obtained.
And E, the server calls a text encoder to encode the first text information to obtain a second text feature vector.
In a specific implementation, the Text Encoder may be called a Text Encoder, which may be a trained Text Encoder, for converting Text information into a feature vector representation, for example, the Text Encoder may be an Encoder in a transform model, an Encoder in a BERT model, or an Encoder in another model, which is not limited herein. The server can input the first text information into a text encoder for encoding processing to obtain a second text feature vector.
And F, the server calls a large language model to process the second text feature vector and the third prompt word, and a third text feature vector is obtained.
In a specific implementation, the third prompting word is used for prompting the large language model to interpret the text information, for example, the third prompting word is "description of reinterpretation text information". The server can input the second text feature vector and the third prompt word into the large language model for processing, so that the large language model can re-interpret the first text information to obtain the third text feature vector.
And G, updating model parameters of the first initial alignment network by the server based on the third content feature vector and the third text feature vector to obtain the first alignment network.
In a specific implementation, the server calculates a first model Loss value (Loss) by using the third content feature vector and the third text feature vector, where the first model Loss value may be a similarity between the third content feature vector and the third text feature vector, for example, a cosine similarity (Cosine Similarity), that is, a cosine value of an included angle between the two vectors is calculated, to measure a similarity degree of the two vectors in a direction. The server may adjust model parameters of the first initial alignment network in a direction to reduce the first model loss value; and after the training times are reached, completing the training to obtain a first alignment network. The mode of specifically adjusting the model parameters of the first initial alignment network may also adopt a random gradient descent method, an adaptive gradient algorithm, etc., which is not limited herein.
And step H, the server sends the first alignment network to the electronic equipment. Accordingly, the electronic device receives the first alignment network from the server.
In a specific implementation, after the server trains the first alignment network, the first alignment network may be sent to the electronic device for use by the electronic device. The first alignment network may be used to extract a first content feature vector of the emoji image.
In general, for the training process of the first alignment network, as shown in fig. 7B, the server acquires a plurality of first training samples, each of which includes a first sample expression image and corresponding first text information. For each first training sample, the server inputs the acquired first sample expression image into a trained image encoder for encoding processing, and a second image feature vector can be obtained; inputting the second image feature vector and the first prompt word into a first initial alignment network for processing, so that the first initial alignment network describes the content of the expression image to obtain a second content feature vector; and (3) inputting the second content feature vector and the first prompt words (namely (1) and (2)) into a large language model for processing, so that the large language model describes the content of the expression image, and a third content feature vector is obtained. In addition, the server also inputs first text information corresponding to the acquired first sample expression image into a text encoder for encoding processing to obtain a second text feature vector; and (3) inputting the second text feature vector and the third prompt words (namely (3) and (4)) into the large language model for processing, and allowing the large language model to re-interpret the first text information to obtain a third text feature vector. Finally, the server calculates a first model loss value by using the third content feature vector and the third text feature vector, and adjusts model parameters of the first initial alignment network according to the direction of reducing the first model loss value; and after the training times are reached, completing the training to obtain a first alignment network.
Of course, if the electronic device has a sufficiently large processing capacity, the first alignment network may also be trained by itself, and the specific training process may refer to the steps a to G, which are not limited herein.
And step 2, the electronic equipment calls a second alignment network to process the first image feature vector and the second prompting word, and a first intention feature vector corresponding to each expression image is obtained.
In a specific implementation, the second prompting word is used for prompting the second alignment network to describe the expression intention of the expression image, for example, the first prompting word is "describe the expression intention of the expression image in detail". For each expression image, the electronic device needs to input a first image feature vector corresponding to the expression image and a second prompt word into a second alignment network together for processing, and the second alignment network outputs a first intention feature vector corresponding to the expression image.
Alternatively, the second alignment network may be trained by other computer devices and then sent to the electronic device for use. The computer device may be a server or other devices, where the server may be a separate physical server, a server cluster formed by multiple servers, or a cloud computing center (e.g., a cloud server). It can be appreciated that the second alignment network is also trained offline at the cloud end, and is not trained at the edge end. Based on the mode, dependence on hardware resources of the electronic equipment is reduced, and burden of the electronic equipment is reduced.
Taking the computer device as a server as an example, as shown in fig. 8, the method specifically may include the following steps a to i. The execution sequence of the steps b-e and f-g is not limited, and the steps b-e and f-g can be executed first; the steps f to g may be performed first, and then the steps b to e may be performed, or the steps b to e and the steps f to g may be performed simultaneously.
Step a, a server acquires first sample conversation information.
In a specific implementation, the first sample dialogue information includes dialogue text information and a second sample emoticon. As shown in fig. 9A, the dialogue text information in the first sample dialogue information may be: "cannot get together to get together today" yesterday is not so much free today? "company is temporarily arranged with new tasks today, is somewhat busy"; the second sample emotion image is an emotion image transmitted by the user in the first sample session information.
And b, the server determines second text information corresponding to the second sample expression image based on the first sample dialogue information.
Optionally, when the server determines the second text information corresponding to the second sample emotion image based on the first sample dialogue information, a specific implementation manner may be: calling a large language model to process the first sample dialogue information to obtain second sample dialogue information; and determining second text information corresponding to the second sample expression image from the second sample dialogue information.
In a specific implementation, the server may input the first sample dialogue information into the large language model to process, so that the large language model replaces all the expression images in the first sample dialogue information with text information according to the dialogue context, and thus outputs second sample dialogue information, as shown in fig. 9B. At this time, the second text information corresponding to the second sample emotion image may be found from the second sample dialogue information, that is, in fig. 9B, the second text information corresponding to the second sample emotion image is "i am very confused". The first sample dialogue information and the second sample dialogue information can form a second training sample, and the server needs to acquire a plurality of second training samples to train the model.
And c, the server calls an image encoder to edit the second sample expression image to obtain a third image feature vector.
In a specific implementation, the server may also input the second sample expression image into the trained image encoder for encoding processing, and may obtain a third image feature vector.
And d, the server calls a second initial alignment network to process the third image feature vector and the second prompt word, and a second intention feature vector is obtained.
In a specific implementation, the second initial alignment network herein may also be a neural network model, such as a CLIP model, a BLIP model, and the like, which is not limited herein. The server can input the third image feature vector and the second prompt word into a second initial alignment network for processing, so that the second initial alignment network describes the expression intention of the expression image and obtains a second intention feature vector.
And e, the server calls a large language model to process the second intention feature vector, the second prompt word and the context information extracted from the dialogue text information, so as to obtain a third intention feature vector.
In a specific implementation, the large language model herein may also be a deep learning model that has been trained using a large amount of text data, such as the LLaMA2 model, the ChatGPT model, and the like. The server can input the second intention feature vector, the second prompt word and the context information extracted from the dialogue text information into the large language model for processing, so that the large language model describes the expression intention of the expression image, and a third intention feature vector is obtained. The server may extract the context information in the dialog text information by using a trained visual model, where the visual model may be a BERT model, a GPT model, an ELMO model, and the like, which is not limited herein.
And f, calling a text encoder by the server to encode the second text information to obtain a fourth text feature vector.
In a specific implementation, the text encoder may be an encoder in a transform model, an encoder in a BERT model, or an encoder in another model, which is not limited herein. The server can input the second text information into a text encoder for encoding processing to obtain a fourth text feature vector.
And g, the server calls a large language model to process the fourth text feature vector, the fourth prompt word and the context information, and a fourth intention feature vector is obtained.
In a specific implementation, the fourth prompting word is used for prompting the large language model to describe the expression intention of the text information, for example, the fourth prompting word may be "describe the expression intention of the text information in detail". The server may input the fourth text feature vector, the fourth prompting word, and the extracted context information into the large language model for processing, so that the large language model describes the expression intention of the second text information, and a fourth intention feature vector is obtained.
And h, updating model parameters of the second initial alignment network by the server based on the third intention feature vector and the fourth intention feature vector to obtain a second alignment network.
In a specific implementation, the server calculates a second model loss value by using the third intention feature vector and the fourth intention feature vector, where the second model loss value may be a similarity between the third intention feature vector and the fourth intention feature vector, for example, a cosine similarity, that is, a cosine value of an angle between the two vectors is calculated, and is used to measure the similarity degree of the two vectors in the direction. The server may adjust model parameters of the second initial alignment network in a direction to reduce the second model loss value; and after the training times are reached, completing the training to obtain a second alignment network. The mode of specifically adjusting the model parameters of the second initial alignment network may also adopt a random gradient descent method, an adaptive gradient algorithm, etc., which is not limited herein.
And step i, the server sends a second alignment network to the electronic equipment. Accordingly, the electronic device receives a second alignment network from the server.
In a specific implementation, after the server trains the second alignment network, the second alignment network may be sent to the electronic device for use by the electronic device. The second alignment network may be used to extract a first intent feature vector of the emoji image.
In general, for the training process of the second alignment network, as shown in fig. 9C, the server acquires first sample dialogue information including dialogue text information and a second sample emoticon; inputting the first sample dialogue information into a large language model for processing, and outputting second sample dialogue information; and determining second text information corresponding to the second sample expression image from the second sample dialogue information. And the server inputs the second sample expression image into a trained image encoder for encoding processing to obtain a third image feature vector. Inputting the third image feature vector and the second prompting word into a second initial alignment network for processing, so that the second initial alignment network describes the expression intention of the expression image and obtains a second intention feature vector; and inputting the second intention feature vector, the second prompt word and the context information (namely (1), (2) and (3)) extracted from the dialogue text information into a large language model for processing, so that the large language model describes the expression intention of the expression image, and a third intention feature vector is obtained. The server inputs the second text information into a text encoder for encoding processing to obtain a fourth text feature vector; and inputting the fourth text feature vector, the fourth prompt word and the extracted context information (namely (3), (4) and (5)) into a large language model for processing, so that the large language model describes the expression intention of the second text information, and a fourth intention feature vector is obtained. Finally, the server calculates a second model loss value by using the third intention feature vector and the fourth intention feature vector, and adjusts model parameters of the second initial alignment network according to the direction of reducing the second model loss value; and after the training times are reached, completing the training to obtain a second alignment network.
Of course, if the electronic device has a sufficiently large processing capacity, the second alignment network may also be trained by itself, and the specific training process may refer to the steps a to h, which are not limited herein.
And s14, the electronic equipment constructs an expression image database by utilizing the plurality of expression images, the first content characteristic vector and the first intention characteristic vector corresponding to each expression image.
In a specific implementation, a plurality of expression images, a first content feature vector corresponding to each expression image and a first intention feature vector are combined together to form an expression image database so as to be used for searching the expression images later.
S402, the electronic equipment determines a first text feature vector of information to be processed.
In the embodiment of the application, the information to be processed can be text information (such as words, sentences and the like) input or searched by a user, and the text information can be emotional intent expressed by the expression image to be searched or content description of the expression image to be searched. That is, the image search method can support both a search for an expression intention of an expression image and a search for semantic content of the expression image.
For example, the information to be processed may be text input by the user in the emoticon search box. Illustratively, as shown in fig. 10A, the user inputs "confusion" in the expression image search box of the expression image search interface, where "confusion" is information to be processed; the 'a puppy or a kitten' with a question mark on the head 'can also be input into the expression image search box of the expression image search interface, and the' a puppy or a kitten 'with a question mark on the head' is the information to be processed.
For another example, the information to be processed may be text that is not sent and is entered by the user in the chat box. Illustratively, as shown in fig. 10B, the user edits the input unsent text in the chat frame as "confusion", which is the information to be processed.
For another example, the information to be processed may be text input by the user in an emoticon search field in the input method interface. Illustratively, as shown in fig. 10C (a), the user clicks an expression button in the input method interface, and displays an expression interface; as shown in fig. 10C (b), the expression interface includes an Emoji option, a pigment option, an emoticon option, and a bucket option, and the user can click on the emoticon option to enter an emoticon search interface (which can be considered as a local TAB page); as shown in (C) of fig. 10C, the emoticon search interface includes an emoticon search field in which a user can input a text to be searched, such as "confusion", which is information to be processed; of course, sentences to be searched, such as a question mark on the head of a puppy or a kitten, which is information to be processed, can be input into the expression image search field.
Of course, the information to be processed may be information input in other manners, which is not limited herein. The electronic device needs to extract the first text feature vector of the information to be processed for subsequent processing.
In one possible implementation manner, when the electronic device determines the first text feature vector of the information to be processed, a specific implementation manner may be: and calling a text encoder to process the information to be processed to obtain a first text feature vector of the information to be processed. That is, the electronic device may also input the information to be processed into the trained text encoder for processing, so as to obtain the first text feature vector of the information to be processed. The text encoder may be the above-mentioned text encoder, or may be other text encoders, and is not limited herein.
S403, the electronic equipment searches the target expression image matched with the information to be processed from the image database.
In the embodiment of the application, the first content feature vector and the first intention feature vector corresponding to the target expression image are matched with the first text feature vector. It can be understood that the image database includes a plurality of expression images, and a first content feature vector and a first intention feature vector corresponding to each expression image, and the electronic device needs to find a first content feature vector and a first intention feature vector matched with the first text feature vector in the image database, that is, a first content feature vector and a first intention feature vector corresponding to a target expression image, and at this time, the target expression image is the expression image matched with the information to be processed. Wherein the target emoticons may be one or more.
Optionally, the electronic device may record the target expression image matched with the to-be-processed information, so that when the same to-be-processed information is obtained subsequently, the matched target expression image may be directly displayed. Based on the mode, the efficiency of searching the expression image is improved. In addition, if the user does not use the first target expression image displayed in the first interface within the first preset time period, the electronic device may not display the first target expression image in the first interface, and may also search for other expression images matched with the information to be processed again.
In one possible implementation manner, when the electronic device searches the image database for the target expression image matched with the information to be processed, the specific implementation manner may be: determining a first vector distance between a first content feature vector and a first text feature vector corresponding to each expression image in an image database, and determining a second vector distance between a first intention feature vector and a first text feature vector corresponding to each expression image in the image database; and determining a target expression image matched with the information to be processed based on the first vector distance and the second vector distance. Based on the mode, the target expression image matched with the information to be processed can be searched from two dimensions of the expression intention and the semantic content, and the searching efficiency of the expression image is improved. The vector distance here may be a euclidean distance between the pointing amounts, and different ways of determining a target expression image to which the information to be processed matches based on the first vector distance and the second vector distance are described in detail below.
Mode one: arranging the first vector distances corresponding to each expression image in a sequence from small to large, and determining the expression images corresponding to the first X first vector distances as target expression images matched with the information to be processed; and arranging the second vector distances corresponding to each expression image in order from small to large, and determining the expression images corresponding to the first Y second vector distances as target expression images matched with the information to be processed. Wherein X and Y are positive integers.
For example, assume that X is 1 and Y is 1; the image database comprises an expression image a, an expression image b and an expression image c. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. The first vector distances are arranged from small to large, and the smallest first vector distance (i.e., the first vector distance arranged at the first) is 0.3, so that the expression image a can be used as a target expression image for matching the information to be processed. The second vector distance corresponding to the expression image a is 0.2, the second vector distance corresponding to the expression image b is 0.5, and the second vector distance corresponding to the expression image c is 0.6. These second vector distances are arranged from small to large, and the smallest second vector distance (i.e., the second vector distance arranged at the first) is 0.2, so that the expression image a can be regarded as the target expression image to be matched with the information to be processed. At this time, only the expression image a is used as a target expression image for matching the information to be processed.
For another example, assume that X is 1 and Y is 1; the image database comprises an expression image a, an expression image b and an expression image c. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. The first vector distances are arranged from small to large, and the smallest first vector distance (i.e., the first vector distance arranged at the first) is 0.3, so that the expression image a can be used as a target expression image for matching the information to be processed. The second vector distance corresponding to the expression image a is 0.4, the second vector distance corresponding to the expression image b is 0.3, and the second vector distance corresponding to the expression image c is 0.6. These second vector distances are arranged from small to large, and the smallest second vector distance (i.e., the second vector distance arranged at the first) is 0.3, so that the expression image b can be regarded as the target expression image to be matched with the information to be processed. At this time, both the expression image a and the expression image b can be used as target expression images for matching the information to be processed.
For another example, assume that X is 2 and Y is 2; the image database comprises an expression image a, an expression image b and an expression image c. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. These first vector distances are arranged from small to large, and the first vector distances arranged in the first 2 are 0.3 and 0.5, so that the expression image a and the expression image b can be used as target expression images for matching the information to be processed. The second vector distance corresponding to the expression image a is 0.2, the second vector distance corresponding to the expression image b is 0.5, and the second vector distance corresponding to the expression image c is 0.6. These second vector distances are arranged from small to large, and the second vector distances arranged in the first 2 are 0.2 and 0.5, and thus are also target expression images in which expression image a and expression image b are matched as information to be processed. At this time, the expression image a and the expression image b are used as target expression images for matching the information to be processed.
For another example, assume that X is 2 and Y is 2; the image database comprises an expression image a, an expression image b and an expression image c. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. These first vector distances are arranged from small to large, and the first vector distances arranged in the first 2 are 0.3 and 0.5, so that the expression image a and the expression image b can be used as target expression images for matching the information to be processed. The second vector distance corresponding to the expression image a is 0.5, the second vector distance corresponding to the expression image b is 0.3, and the second vector distance corresponding to the expression image c is 0.4. These second vector distances are arranged from small to large, and the second vector distances arranged in the first 2 are 0.3 and 0.4, so that the expression image b and the expression image c can be regarded as target expression images for matching the information to be processed. At this time, the expression image a, the expression image b and the expression image c can be used as target expression images matched with the information to be processed.
Mode two: calculating the first vector distance and the second vector distance to obtain a third vector distance corresponding to each expression image; arranging third vector distances in order from small to large, and determining the expression images corresponding to the first N third vector distances as target expression images matched with the information to be processed; the N is a positive integer.
For example, assuming that N is 1, the image database includes an expression image a, an expression image b, an expression image c; the first vector distance corresponds to a weight of 0.4 and the second vector distance corresponds to a weight of 0.6. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. The second vector distance corresponding to the expression image a is 0.2, the second vector distance corresponding to the expression image b is 0.5, and the second vector distance corresponding to the expression image c is 0.6. Thus, the calculation results in: the distance between the expression image a and the third vector is 0.24, the distance between the expression image b and the third vector is 0.5, and the distance between the expression image c and the third vector is 0.64. The third vector distances are arranged from small to large, and the smallest third vector distance (namely the third vector distance arranged at the first) is 0.24, so that the expression image a can be used as a target expression image for matching the information to be processed.
For another example, assuming N is 2, the image database includes an expression image a, an expression image b, an expression image c; the first vector distance corresponds to a weight of 0.4 and the second vector distance corresponds to a weight of 0.6. The first vector distance corresponding to the expression image a is 0.3, the first vector distance corresponding to the expression image b is 0.5, and the first vector distance corresponding to the expression image c is 0.7. The second vector distance corresponding to the expression image a is 0.2, the second vector distance corresponding to the expression image b is 0.5, and the second vector distance corresponding to the expression image c is 0.6. Thus, the calculation results in: the distance between the expression image a and the third vector is 0.24, the distance between the expression image b and the third vector is 0.5, and the distance between the expression image c and the third vector is 0.64. These third vector distances are arranged from small to large, and the third vector distances arranged in the first 2 are 0.24 and 0.5, so that the expression image a and the expression image b can be used as target expression images for matching the information to be processed.
Of course, the manner in which the electronic device determines the target expression image matched with the information to be processed based on the first vector distance and the second vector distance may also adopt other strategies, which is not limited herein.
In one possible implementation manner, when the electronic device searches the image database for the target expression image matched with the information to be processed, the specific implementation manner may be: calling a first alignment network to process the first text feature vector to obtain a fourth content feature vector; invoking a second alignment network to process the first text feature vector to obtain a fifth intention feature vector; determining a third vector distance between a first content feature vector and a fourth content feature vector corresponding to each expression image in the image database, and determining a fourth vector distance between a first intention feature vector and a fifth intention feature vector corresponding to each expression image in the image database; and determining the target expression image matched with the information to be processed based on the third vector distance and the fourth vector distance. It can be understood that the electronic device also needs to extract the content feature vector (i.e., the fourth content feature vector) of the first text information by using the first alignment network, extract the intent feature vector (the fifth intent feature vector) of the first text information by using the second alignment network, and purposefully find the target expression image matched with the first text information from two dimensions of semantic content and expression intent, so as to improve the accuracy of the target expression image. In addition, the specific implementation manner of determining the target expression image matched with the information to be processed based on the third vector distance and the fourth vector distance may refer to the specific implementation manner of determining the target expression image matched with the information to be processed based on the first vector distance and the second vector distance, which is not described herein in detail.
In one possible implementation, the method further includes: displaying a target expression image matched with information to be processed in a first interface, wherein the first interface is any one of the following interfaces: an image searching interface, an input method interface and an expression image recommending interface. That is, after the electronic device searches the target expression image matched with the information to be processed in the expression image database, the target expression image can be displayed in the first interface for the user to select. Based on the mode, the visualization of image searching can be improved, and the flexibility of displaying the target expression image is improved.
As shown in fig. 10A, the emoji search interface includes a emoji search box and a search button, and if the user inputs "confusion" in the emoji search box, when the user clicks the search button, the electronic device executes the above-described image searching method, and after searching the target emoji matching the "confusion" in the emoji database, the target emoji is displayed on the emoji search interface, as shown in fig. 11A, for the user to select.
As shown in fig. 10B, assuming that the user inputs "confusion" when editing the text in the chat frame, the "confusion" is text that is not sent by the user, at this time, the electronic device may perform the above-described image searching method, find the target expression images matching with the "confusion" in the expression image database, and then pop up the expression image recommendation interface, and display the target expression images in the expression image recommendation interface, as shown in fig. 11B, for the user to select, so as to be beneficial to improving the conversation efficiency.
As shown in fig. 10C, assume that the user clicks an expression button in the input method interface, and displays an expression interface; further clicking on the emoticon option in the displayed emoticon interface to enter an emoticon search interface (which may be considered a local TAB page); the user can input characters to be searched, namely 'confusion', in an expression image search column of an expression image search interface; when the user clicks the search button, the electronic device executes the above-described image searching method, and after searching the target expression images matched with the confusion in the expression image database, the target expression images are displayed on the input method interface, as shown in fig. 11C, so as to be selected by the user.
Of course, for the display sequence of the target expression images, the electronic device may display according to the historical habit or the historical behavior of the user, for example, display according to the sequence of the historical use frequency of the user from more to less in the preset time period, display according to the random sequence, and display according to the time sequence (such as from near to far) of adding the expression images by the user, which is not limited herein.
In one possible implementation, the method further includes: displaying an expression image adding interface, wherein the expression image adding interface comprises an expression image frame, a custom label frame and a storage option; when the triggering operation of the user for the saving option is detected, saving a first expression image added in the expression image frame and a first label of the first expression image added in the custom label frame; and if the information to be processed is the same as the first label, determining the first expression image as the target expression image. It can be understood that, the user may add a label to the stored expression image in advance and store the stored expression image, when the user needs to search the expression image, the electronic device may display the first expression image corresponding to the first label as the target expression image together for the user to select, if the information to be processed input by the user is the same as the first label stored by the user before, except that the target expression image is searched by applying the step S403. Based on the mode, the efficiency of searching the target expression image and the accuracy of the target expression image can be improved, and the efficiency of online chat conversation can be improved.
As shown in fig. 12, the emoticon adding interface includes an emoticon frame, a custom tab frame, and a save option. The user can add a first emoticon in the emoticon frame and add a first tag of the first emoticon in the custom tag frame. When the user clicks the save option, the electronic device saves the first expression image added in the expression image frame and the first label of the first expression image added in the custom label frame. In the process of searching the expression image by the user, if the information to be processed input by the user is the same as the first label stored by the user, the electronic equipment can also directly take the first expression image as the target expression image.
Based on the method, the electronic equipment acquires an expression image database; then determining a first text feature vector of the information to be processed, wherein the information to be processed can be considered as text information input by a user, text information searched by the user and the like; and finally, searching the target expression image matched with the information to be processed from an image database, and displaying the target expression image for selection by a user. The first content feature vector and the first intention feature vector corresponding to the target expression image are matched with the first text feature vector; it can be understood that the target expression image matched with the information to be processed is searched from the image database by rapidly searching the emotion intention and the semantic content of the expression image in a multi-dimensional manner, so that the expression image required to be sent by the user can be searched more rapidly and accurately, and the efficiency of online chat conversation is improved.
Referring to fig. 13, fig. 13 is a schematic diagram illustrating a structure of an image searching apparatus 1300 according to an embodiment of the present application. The image searching device shown in fig. 13 may be an electronic device, or may be a device in an electronic device, or may be a device that can be used in a matching manner with an electronic device. The image finding apparatus shown in fig. 13 may include an acquisition unit 1301, a determination unit 1302, and a finding unit 1303. Wherein:
an obtaining unit 1301, configured to obtain an expression image database, where the expression image database includes a plurality of expression images, a first content feature vector and a first intention feature vector corresponding to each of the plurality of expression images; the first content feature vector represents a feature vector describing the content of the emoji image, and the first intention feature vector represents a feature vector describing the expression intention of the emoji image;
a determining unit 1302, configured to determine a first text feature vector of information to be processed;
the searching unit 1303 is configured to search, from the image database, a target expression image that matches the information to be processed, where a first content feature vector and a first intention feature vector that correspond to the target expression image match the first text feature vector.
In one possible implementation manner, the obtaining unit 1301 is specifically configured to, when obtaining the expression image database: acquiring a plurality of expression images locally collected by a user; invoking an image encoder to encode each expression image in the plurality of expression images to obtain a first image feature vector corresponding to each expression image; calling an alignment network to process the first image feature vector and the prompt word to obtain a first content feature vector and a first intention feature vector corresponding to each expression image; the prompting words are used for prompting the content and the expression intention of the expression image described by the alignment network; and constructing an expression image database by using the plurality of expression images, the first content feature vector and the first intention feature vector corresponding to each expression image.
In a possible implementation manner, the alignment network includes a first alignment network and a second alignment network, the hint word includes a first hint word and a second hint word, and the obtaining unit 1301 is specifically configured to, when invoking the alignment network to process the first image feature vector and the hint word to obtain a first content feature vector and a first intention feature vector corresponding to each expression image: invoking the first alignment network to process the first image feature vector and the first prompt word to obtain a first content feature vector corresponding to each expression image; the first prompting word is used for prompting the first alignment network to describe the content of the expression image; invoking the second alignment network to process the first image feature vector and the second prompting word to obtain a first intention feature vector corresponding to each expression image; the second prompting word is used for prompting a second alignment network to describe the expression intention of the expression image.
In one possible implementation manner, the searching unit 1303 is specifically configured to, when searching the target expression image matched with the information to be processed from the image database: determining a first vector distance between a first content feature vector corresponding to each expression image in the image database and the first text feature vector, and determining a second vector distance between a first intention feature vector corresponding to each expression image in the image database and the first text feature vector; and determining the target expression image matched with the information to be processed based on the first vector distance and the second vector distance.
In one possible implementation manner, the searching unit 1303 is specifically configured to, when determining, based on the first vector distance and the second vector distance, a target expression image that matches the information to be processed: calculating the first vector distance and the second vector distance to obtain a third vector distance corresponding to each expression image; arranging third vector distances in order from small to large, and determining the expression images corresponding to the first N third vector distances as target expression images matched with the information to be processed; the N is a positive integer.
In one possible implementation manner, the searching unit 1303 is specifically configured to, when searching the target expression image matched with the information to be processed from the image database: calling a first alignment network to process the first text feature vector to obtain a fourth content feature vector; invoking a second alignment network to process the first text feature vector to obtain a fifth intention feature vector; determining a third vector distance between a first content feature vector and a fourth content feature vector corresponding to each expression image in the image database, and determining a fourth vector distance between a first intention feature vector and a fifth intention feature vector corresponding to each expression image in the image database; and determining the target expression image matched with the information to be processed based on the third vector distance and the fourth vector distance.
In a possible implementation manner, the device further includes a display unit, where the display unit is configured to display, in a first interface, a target expression image matched with the information to be processed, where the first interface is any one of the following interfaces: an image searching interface, an input method interface and an expression image recommending interface.
In a possible implementation, the display unit is further configured to: displaying an expression image adding interface, wherein the expression image adding interface comprises an expression image frame, a custom label frame and a storage option; the device also comprises a processing unit, wherein the processing unit is used for storing a first expression image added in the expression image frame and a first label of the first expression image added in the custom label frame when the triggering operation of the user for the storage option is detected; and if the information to be processed is the same as the first label, determining the first expression image as the target expression image.
For the case where the image finding device may be a chip or a chip system, reference may be made to the schematic structural diagram of the chip shown in fig. 14. The chip 1400 shown in fig. 14 includes a processor 1401, an interface 1402. Optionally, a memory 1403 may also be included. Wherein the number of processors 1401 may be one or more, and the number of interfaces 1402 may be a plurality.
For the case where the chip is used to implement the electronic device in the embodiment of the present application:
the interface 1402 is configured to receive or output a signal;
the processor 1401 is configured to perform data processing operations of the electronic device.
It can be understood that some optional features of the embodiments of the present application may be implemented independently in some scenarios, independent of other features, such as the scheme on which they are currently based, so as to solve corresponding technical problems, achieve corresponding effects, or may be combined with other features according to requirements in some scenarios. Accordingly, the image searching device provided in the embodiment of the present application may also implement these features or functions accordingly, which will not be described herein.
It should be appreciated that the processor in embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
It will be appreciated that the memory in embodiments of the application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The present application also provides a computer readable storage medium having stored therein a computer program comprising program instructions for performing the functions of any of the method embodiments described above when the program instructions are run on an image finding device.
The application also provides a computer program product which, when run on a computer, causes the computer to carry out the functions of any of the method embodiments described above.
In the above embodiments, the implementation may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. An image finding method, the method comprising:
acquiring a plurality of expression images locally collected by a user;
invoking an image encoder to encode each expression image in the plurality of expression images to obtain a first image feature vector corresponding to each expression image;
invoking a first alignment network to process the first image feature vector and the first prompt word to obtain a first content feature vector corresponding to each expression image; the first prompting word is used for prompting the first alignment network to describe the content of the expression image;
invoking a second alignment network to process the first image feature vector and the second prompting word to obtain a first intention feature vector corresponding to each expression image; the second prompting word is used for prompting the second alignment network to describe the expression intention of the expression image;
Constructing an expression image database by utilizing the plurality of expression images, the first content feature vector and the first intention feature vector corresponding to each expression image; the first content feature vector represents a feature vector describing the content of the expression image, and the first intention feature vector represents a feature vector describing the expression intention of the expression image;
determining a first text feature vector of information to be processed;
searching a target expression image matched with the information to be processed from the expression image database, wherein a first content feature vector and a first intention feature vector corresponding to the target expression image are matched with the first text feature vector;
the searching the target expression image matched with the information to be processed from the expression image database comprises the following steps:
invoking the first alignment network to process the first text feature vector to obtain a fourth content feature vector;
invoking the second alignment network to process the first text feature vector to obtain a fifth intention feature vector;
determining a third vector distance between a first content feature vector and the fourth content feature vector corresponding to each expression image in the expression image database, and determining a fourth vector distance between a first intention feature vector and the fifth intention feature vector corresponding to each expression image in the expression image database;
And determining the target expression image matched with the information to be processed based on the third vector distance and the fourth vector distance.
2. The method according to claim 1, wherein the searching the target expression image matched with the information to be processed from the expression image database comprises:
determining a first vector distance between a first content feature vector corresponding to each expression image in the expression image database and the first text feature vector, and determining a second vector distance between a first intention feature vector corresponding to each expression image in the expression image database and the first text feature vector;
and determining the target expression image matched with the information to be processed based on the first vector distance and the second vector distance.
3. The method of claim 2, wherein the determining a target expression image for which the information to be processed matches based on the first vector distance and the second vector distance comprises:
calculating the first vector distance and the second vector distance to obtain a third vector distance corresponding to each expression image;
arranging the third vector distances in order from small to large, and determining the expression images corresponding to the first N third vector distances as target expression images matched with the information to be processed; and N is a positive integer.
4. A method according to any one of claims 1-3, characterized in that the method further comprises:
displaying the target expression image matched with the information to be processed in a first interface, wherein the first interface is any one of the following interfaces: an image searching interface, an input method interface and an expression image recommending interface.
5. A method according to any one of claims 1-3, characterized in that the method further comprises:
displaying an expression image adding interface, wherein the expression image adding interface comprises an expression image frame, a custom label frame and a storage option;
when the triggering operation of the user for the saving option is detected, saving a first expression image added in the expression image frame and a first label of the first expression image added in the custom label frame;
and if the information to be processed is the same as the first label, determining the first expression image as the target expression image.
6. An electronic device, comprising: one or more processors, one or more memories; wherein one or more memories are coupled to one or more processors, the one or more memories being operable to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-5.
7. A chip comprising a processor and an interface, the processor and the interface being coupled; the interface being for receiving or outputting signals, the processor being for executing code instructions to cause the method of any one of claims 1-5 to be performed.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions, which when run on an electronic device, cause the electronic device to perform the method of any of claims 1-5.
9. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the method according to any of claims 1-5.
CN202311184754.2A 2023-09-14 2023-09-14 Image searching method and electronic equipment Active CN117076702B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311184754.2A CN117076702B (en) 2023-09-14 2023-09-14 Image searching method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311184754.2A CN117076702B (en) 2023-09-14 2023-09-14 Image searching method and electronic equipment

Publications (2)

Publication Number Publication Date
CN117076702A CN117076702A (en) 2023-11-17
CN117076702B true CN117076702B (en) 2023-12-15

Family

ID=88711727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311184754.2A Active CN117076702B (en) 2023-09-14 2023-09-14 Image searching method and electronic equipment

Country Status (1)

Country Link
CN (1) CN117076702B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034203A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 Training, expression recommended method, device, equipment and the medium of expression recommended models
CN110598037A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Image searching method, device and storage medium
CN110597963A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Expression question-answer library construction method, expression search method, device and storage medium
KR20210078927A (en) * 2019-12-19 2021-06-29 주식회사 카카오 Method for providing emoticons in instant messaging service, user device, server and application implementing the method
CN116431855A (en) * 2023-06-13 2023-07-14 荣耀终端有限公司 Image retrieval method and related equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021042763A1 (en) * 2019-09-03 2021-03-11 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image searches based on word vectors and image vectors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034203A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 Training, expression recommended method, device, equipment and the medium of expression recommended models
CN110598037A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Image searching method, device and storage medium
CN110597963A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Expression question-answer library construction method, expression search method, device and storage medium
KR20210078927A (en) * 2019-12-19 2021-06-29 주식회사 카카오 Method for providing emoticons in instant messaging service, user device, server and application implementing the method
CN116431855A (en) * 2023-06-13 2023-07-14 荣耀终端有限公司 Image retrieval method and related equipment

Also Published As

Publication number Publication date
CN117076702A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
WO2023125335A1 (en) Question and answer pair generation method and electronic device
CN112269853B (en) Retrieval processing method, device and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN113806473A (en) Intention recognition method and electronic equipment
CN112989767B (en) Medical term labeling method, medical term mapping device and medical term mapping equipment
CN111881315A (en) Image information input method, electronic device, and computer-readable storage medium
CN111460231A (en) Electronic device, search method for electronic device, and medium
US20210405767A1 (en) Input Method Candidate Content Recommendation Method and Electronic Device
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
CN112287070B (en) Method, device, computer equipment and medium for determining upper and lower relation of words
CN115691486A (en) Voice instruction execution method, electronic device and medium
CN114281936A (en) Classification method and device, computer equipment and storage medium
CN113742460B (en) Method and device for generating virtual roles
CN112416984B (en) Data processing method and device
KR20190061824A (en) Electric terminal and method for controlling the same
CN117076702B (en) Image searching method and electronic equipment
CN116227629A (en) Information analysis method, model training method, device and electronic equipment
US12118985B2 (en) Electronic device and method for providing on-device artificial intelligence service
CN118057355A (en) Answer generation method, device and storage medium
CN116861066A (en) Application recommendation method and electronic equipment
CN116821321A (en) Conference summary generation method and electronic equipment
CN114281937A (en) Training method of nested entity recognition model, and nested entity recognition method and device
CN113655933A (en) Text labeling method and device, storage medium and electronic equipment
CN116709339B (en) Detection method of application notification message and electronic equipment
CN116304146B (en) Image processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant