AR Multimodal Intelligent Virtual Character (With Gemini API)

简体中文

An AI virtual character project that renders through AR, combines multimodal large models, and is implemented through Unity AR Foundation. It can be used to interact with virtual characters in real environments, increasing the interactive experience with AI and making AI concrete. It has the following features:

Supports Android/iOS, Android platform has been verified
Supports OpenAI, Baidu, Gemini Pro, etc. large models, can be customized
TTS&STT, supports Azure, Baidu, etc., can be customized
Supports the latest Gemini Pro visual understanding
Automatically takes pictures and understands the environment according to voice prompts
Supports AR/VR model switching, automatically detects phone posture, and switches AR/VR mode
Explanation animation and facial expression animation

Most of the codes for LLM, TTS&STT come fromUnity-AI-Chat-Toolkit ,The character model used in the project comes fromReady Player Me。The following is a video of the demonstration:

Install

Development environment

Unity 2021.3.28 (future versions should also work)
AR Foundation 4.2.9 (already included in the project, no installation required)
Windows 11 (verified)

Install

git clone https://github.com/hillday/AIRAgentChat.git

Open the project directory in Unity Hub. Unity will automatically create the project and download the relevant dependency packages. The following is the normal development interface.

functional module

VR/AR mode automatic switching

In VR mode, the virtual character will interact with the user in the real environment, while in VR mode, the virtual character will interact with the user in the virtual environment. The scene is designed to take into account that holding the phone in VR mode will make you tired, and the VR effect will be better when you put it down. Not good, so the system will automatically adjust the mode according to the posture change of the phone. The logic is to judge the angle change of the phone around the X-axis. The AR mode is between [0, 25] degrees, and the other is VR mode. The code is as follows:

    [SerializeField] float m_EnterVRAngle = 25.0f;
    private float _lastEulerX = 0.0f;

    private void IsEnterVRSpace()
    {
        if (m_MainCamera.enabled)
        {
            Vector3 euler = m_MainCamera.transform.rotation.eulerAngles;
            if (euler.x > m_EnterVRAngle && euler.x < 360 - m_EnterVRAngle)
            {
                // check m_VRPano is activate
                if (!m_VRPano.activeSelf)
                    m_VRPano.SetActive(true);
                if (euler.x - _lastEulerX > 1)
                {
                    m_ARSpace.transform.rotation = Quaternion.Euler(euler.x, 0, 0);
                    _lastEulerX = euler.x;
                }
            }
            else
            {
                if (m_VRPano.activeSelf)
                    m_VRPano.SetActive(false);


                if (_lastEulerX > 0.0f)
                {
                    m_ARSpace.transform.rotation = Quaternion.Euler(0, 0, 0);
                    _lastEulerX = 0.0f;
                }
            }
        }
    }

LLM

The demonstration video uses Google's latest [Gemini Pro API] (https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini), which supports chat mode and visual understanding mode. They are two different models. The current visual understanding model does not support chat mode, so it needs to be integrated during use.
It also supports other LLMs, such as OpenAI, Baidu, etc., which can be configured as needed.

TTS&STT

To have a voice conversation with LLM, you first input speech into text into LLM, and then convert the resulting text into speech for playback. The demonstration video uses Azure's HTTP API, and it also supports APIs from other platforms for configuration.

Semantic understanding to automatically take photos

LLM automatically takes photos according to the user's voice instructions. In principle, it inputs a priori background knowledge to LLM, telling it which system functions can be used and what corresponding things it can do. Then it needs to be allowed to judge whether it needs to call the function according to the scene. Returns a specific instruction, and the system intercepts the instruction to make a function call. The camera function is as follows:

    [SerializeField] protected string m_Prompt = "你具有调用外部系统的能力，现在外部系统有拍照功能,代码为F0001,在交流的过程中请根据场景需要返回功能代码调用外部系统，比如当说拍个照/帮忙分析一下图像/你看到了什么的时候返回调用拍照功能，返回格式为：系统功能#F0001,不需要调用系统功能的时候，请和我正常交流，返回格式为：非系统功能#你的回答。";
    [SerializeField] private string m_PromptFuncSign = "系统功能#";
    [SerializeField] private string m_PromptNotFuncSign = "非系统功能#";
    [SerializeField] private string m_PromptForVision = "描述一下这张图片中的内容，需要详细一些，包括看到的对象，相关的知识，历史等。";

According to the test, in most scenarios, such as asking it to take a photo or want to see the surrounding environment, it can understand and return系统功能#F0001。

Expression animation

Expression animation is realized through ARKit blendshapes. The ideal way is to generate dynamic expressions in real time through the voice drive blendshapes parameters, so that it will look natural when speaking. It is currently driven by fixed parameters, so it does not look very natural, and the mouth shape cannot match the changes in speech. We are developing technology to automatically generate blendshapes parameters for speech, which will be replaced after it goes online.

Follow-up

Function optimization

Speech generation blendshapes

Version control

This project uses Git for version management. You can see the currently available versions in the repository.

Refer to

Ready Player Me

ARKit Face Blendshapes

unity-AI-Chat-Toolkit

arfoundation

Author

qchunhai

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
Assets		Assets
Images		Images
Packages		Packages
ProjectSettings		ProjectSettings
UserSettings		UserSettings
.gitignore		.gitignore
README.md		README.md
README.zh_CN.md		README.zh_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AR Multimodal Intelligent Virtual Character (With Gemini API)

简体中文

Install

Development environment

Install

functional module

VR/AR mode automatic switching

LLM

TTS&STT

Semantic understanding to automatically take photos

Expression animation

Follow-up

Function optimization

Version control

Refer to

Author

About

Releases

Packages

Languages

hillday/AIRAgentChat

Folders and files

Latest commit

History

Repository files navigation

AR Multimodal Intelligent Virtual Character (With Gemini API)

简体中文

Install

Development environment

Install

functional module

VR/AR mode automatic switching

LLM

TTS&STT

Semantic understanding to automatically take photos

Expression animation

Follow-up

Function optimization

Version control

Refer to

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages