Imagine having a lifelike digital avatar that not only looks like a real human but also sounds and behaves like one. This is no longer the realm of science fiction, but a reality made possible by the cutting-edge technology of NVIDIA Jetson AGX Orin Developer Kit, NVIDIA Omniverse, and Unreal Engine.
In this project, we'll dive into the world of photorealistic talking avatars and explore how these technologies come together to create a truly immersive experience. Basically, we can use the avatar to build conversational agents, virtual assistants, chatbots, and more. Whether you're a developer or tech enthusiast, this project is for you.
ArchitectureLet me first show you how the overal architecture is going to look like:
Here's how the system works:
- The system starts with an audio input from a microphone.
- This input is captured by an OSC Client running on an NVIDIA Jetson AGX Orin developer kit.
- The client utilizes Nvidia Riva TTS for text-to-speech technology, Nvidia Riva ASR for speech recognition, and a large language model for inference.
- The processed audio data is then passed through a Python application called Llamaspeak, developed by Dustin Franklin(@dusty-nv) to the OSC Server.
- The OSC server uses Audio2Face to generate facial animations from the audio.
- Finally, Unreal Engine 5, running on a PC with an RTX graphics card, renders the final facial animation output.
This architecture showcases a complex workflow that transforms audio input into facial animations. It utilizes advanced technologies like speech recognition, text-to-speech, and lip sync, along with powerful computing hardware (NVIDIA Jetson AGX Orin) and sophisticated rendering software (Unreal Engine 5).
Nvidia Omniverse Audio2FaceLet's begin by discussing Nvidia Omniverse Audio2Face, a revolutionary technology that enables the creation of photorealistic avatars with lifelike facial expressions and movements. It is an AI-powered tool that generates facial animation for 3D characters based on audio input. It can process either pre-recorded speech or live audio feeds, making it versatile for various applications. The best thing about Omniverse Audio2Face is that it is easy to use. You simply need to provide an audio file or live audio feed, and the tool automatically generates realistic facial expressions in real-time.
Moreover, this tool can be integrated with third-party rendering applications such as Unreal Engine 5. Developers can easily incorporate facial animations into Metahuman characters without any hassle, thereby simplifying the animation workflow.
Audio2Face is available for Windows and Linux users with an Nvidia RTX GPU. It can be downloaded from the Nvidai Omniverse platform.
- Download Omniverse Launcher from here.
- Once downloaded, double-click the installer file
- Click on the Exchange tab
- Search for Audio2Face in the Apps section
- Click the Install button for Audio2Face.
- Once installed, you can click Launch from the Exchange tab to open Audio2Face.
- Navigate to the following directory exts\omni.audio2face.exporter\config within the audio2face installation directory.
- Open the extension.toml file in a text editor.
- Append the following content to the bottom of the file:
[python.pipapi]
requirements=['python-osc']
use_online_index=true
- Python-osc is a a simple OpenSoundControl implementation in Python providing support for a simple OSC-client and OSC-server. Open Sound Control (OSC) is an open protocol commonly used in many areas of the audio industry. It is primarily used to network generic audio data between clients, although it can also be used for non-audio data.
- Run the below code to check whether the python-osc library has been correctly installed in the Audio2face environment.
- Then navigate to the following directory: exts\omni.audio2face.exporter\omni\audio2face\exporter\scripts within the installation directory.
- Open the facsSolver.py script in a text editor. This script is used to establish a communication channel between Audio2Face and Unreal Engine using Open Sound Control protocol.
- Within the script, add this line of code to create a UDP client:
from pythonosc import udp_client
self.client = udp_client.SimpleUDPClient('127.0.0.1', 5008)
- Then add below code snippet
mh_ctl_list = [
['CTRL_expressions_browDownR', "browLowerR", 1.0],
['CTRL_expressions_browDownL', "browLowerL", 1.0],
['CTRL_expressions_browLateralR', "browLowerR", 1.0],
['CTRL_expressions_browLateralL', "browLowerL", 1.0],
['CTRL_expressions_browRaiseinR', "innerBrowRaiserR", 0.5],
['CTRL_expressions_browRaiseinL', "innerBrowRaiserL", 0.5],
['CTRL_expressions_browRaiseouterR', "innerBrowRaiserR", 0.5],
['CTRL_expressions_browRaiseouterL', "innerBrowRaiserL", 0.5],
['CTRL_expressions_eyeLookUpR', "eyesLookUp", 1.0, "eyesLookDown", -1.0],
['CTRL_expressions_eyeLookDownR', "eyesLookUp", 1.0, "eyesLookDown", -1.0],
['CTRL_expressions_eyeLookUpL', "eyesLookUp", 1.0, "eyesLookDown", -1.0],
['CTRL_expressions_eyeLookDownL', "eyesLookUp", 1.0, "eyesLookDown", -1.0],
['CTRL_expressions_eyeLookLeftR', "eyesLookLeft", 1.0, "eyesLookRight", -1.0],
['CTRL_expressions_eyeLookRightR', "eyesLookLeft", 1.0, "eyesLookRight", -1.0],
['CTRL_expressions_eyeLookRightL', "eyesLookLeft", 1.0, "eyesLookRight", -1.0],
['CTRL_expressions_eyeLookRightL', "eyesLookLeft", 1.0, "eyesLookRight", -1.0],
['CTRL_expressions_eyeBlinkR', "eyesCloseR", 1.0, "eyesUpperLidRaiserR", -1.0],
['CTRL_expressions_eyeBlinkL', "eyesCloseR", 1.0, "eyesUpperLidRaiserR", -1.0],
['CTRL_expressions_eyeSquintinnerR', "squintR", 1.0],
['CTRL_expressions_eyeSquintinnerL', "squintL", 1.0],
['CTRL_expressions_eyeCheekraiseR', "cheekRaiserR", 1.0],
['CTRL_expressions_eyeCheekraiseL', "cheekRaiserL", 1.0],
['CTRL_expressions_mouthCheekSuckR', "cheekPuffR", 0.5],
['CTRL_expressions_mouthCheekBlowR', "cheekPuffR", 0.5],
['CTRL_expressions_mouthCheekSuckL', "cheekPuffL", 0.5],
['CTRL_expressions_mouthCheekBlowL', "cheekPuffL", 0.5],
['CTRL_expressions_noseNostrilDilateR', "noseWrinklerR", 1.0],
['CTRL_expressions_noseNostrilCompressR', "noseWrinklerR", 1.0],
['CTRL_expressions_noseWrinkleR', "noseWrinklerR", 1.0],
['CTRL_expressions_noseNostrilDepressR', "noseWrinklerR", 1.0],
['CTRL_expressions_noseNostrilDilateL', "noseWrinklerL", 1.0],
['CTRL_expressions_noseNostrilCompressL', "noseWrinklerL", 1.0],
['CTRL_expressions_noseWrinkleL', "noseWrinklerL", 1.0],
['CTRL_expressions_noseNostrilDepressL', "noseWrinklerL", 1.0],
['CTRL_expressions_jawOpen', "jawDrop", 1.0, "jawDropLipTowards", 0.6],
['CTRL_R_mouth_lipsTogetherU', "jawDropLipTowards", 1.0],
['CTRL_L_mouth_lipsTogetherU', "jawDropLipTowards", 1.0],
['CTRL_R_mouth_lipsTogetherD', "jawDropLipTowards", 1.0],
['CTRL_L_mouth_lipsTogetherD', "jawDropLipTowards", 1.0],
['CTRL_expressions_jawFwd', "jawThrust", -1.0],
['CTRL_expressions_jawBack', "jawThrust", -1.0],
['CTRL_expressions_jawRight', "jawSlideLeft", -1.0, "jawSlideRight", 1.0],
['CTRL_expressions_jawLeft', "jawSlideLeft", -1.0, "jawSlideRight", 1.0],
['CTRL_expressions_mouthLeft', "mouthSlideLeft", 0.5, "mouthSlideRight", -0.5],
['CTRL_expressions_mouthRight', "mouthSlideLeft", 0.5, "mouthSlideRight", -0.5],
['CTRL_expressions_mouthDimpleR', "dimplerR", 1.0],
['CTRL_expressions_mouthDimpleL', "dimplerL", 1.0],
['CTRL_expressions_mouthCornerPullR', "lipCornerPullerR", 1.0],
['CTRL_expressions_mouthCornerPullL', "lipCornerPullerL", 1.0],
['CTRL_expressions_mouthCornerDepressR', "lipCornerDepressorR", 1.0],
['CTRL_expressions_mouthCornerDepressL', "lipCornerDepressorL", 1.0],
['CTRL_expressions_mouthStretchR', "lipStretcherR", 1.0],
['CTRL_expressions_mouthStretchL', "lipStretcherL", 1.0],
['CTRL_expressions_mouthUpperLipRaiseR', "upperLipRaiserR", 1.0],
['CTRL_expressions_mouthUpperLipRaiseL', "upperLipRaiserL", 1.0],
['CTRL_expressions_mouthLowerLipDepressR', "lowerLipDepressorR", 1.0],
['CTRL_expressions_mouthLowerLipDepressL', "lowerLipDepressorR", 1.0],
['CTRL_expressions_jawChinRaiseDR', "chinRaiser", 1.0],
['CTRL_expressions_jawChinRaiseDL', "chinRaiser", 1.0],
['CTRL_expressions_mouthLipsPressR', "lipPressor", 1.0],
['CTRL_expressions_mouthLipsPressL', "lipPressor", 1.0],
['CTRL_expressions_mouthLipsTowardsUR', "pucker", 1.0],
['CTRL_expressions_mouthLipsTowardsUL', "pucker", 1.0],
['CTRL_expressions_mouthLipsTowardsDR', "pucker", 1.0],
['CTRL_expressions_mouthLipsTowardsDL', "pucker", 1.0],
['CTRL_expressions_mouthLipsPurseUR', "pucker", 1.0],
['CTRL_expressions_mouthLipsPurseUL', "pucker", 1.0],
['CTRL_expressions_mouthLipsPurseDR', "pucker", 1.0],
['CTRL_expressions_mouthLipsPurseDL', "pucker", 1.0],
['CTRL_expressions_mouthFunnelUR', "funneler", 1.0],
['CTRL_expressions_mouthFunnelUL', "funneler", 1.0],
['CTRL_expressions_mouthFunnelDL', "funneler", 1.0],
['CTRL_expressions_mouthFunnelDR', "funneler", 1.0],
['CTRL_expressions_mouthPressUR', "lipSuck", 1.0],
['CTRL_expressions_mouthPressUL', "lipSuck", 1.0],
['CTRL_expressions_mouthPressDR', "lipSuck", 1.0],
['CTRL_expressions_mouthPressDL', "lipSuck", 1.0]
]
facsNames = [
"browLowerL",
"browLowerR",
"innerBrowRaiserL",
"innerBrowRaiserR",
"outerBrowRaiserL",
"outerBrowRaiserR",
"eyesLookLeft",
"eyesLookRight",
"eyesLookUp",
"eyesLookDown",
"eyesCloseL",
"eyesCloseR",
"eyesUpperLidRaiserL",
"eyesUpperLidRaiserR",
"squintL",
"squintR",
"cheekRaiserL",
"cheekRaiserR",
"cheekPuffL",
"cheekPuffR",
"noseWrinklerL",
"noseWrinklerR",
"jawDrop",
"jawDropLipTowards",
"jawThrust",
"jawSlideLeft",
"jawSlideRight",
"mouthSlideLeft",
"mouthSlideRight",
"dimplerL",
"dimplerR",
"lipCornerPullerL",
"lipCornerPullerR",
"lipCornerDepressorL",
"lipCornerDepressorR",
"lipStretcherL",
"lipStretcherR",
"upperLipRaiserL",
"upperLipRaiserR",
"lowerLipDepressorL",
"lowerLipDepressorR",
"chinRaiser",
"lipPressor",
"pucker",
"funneler",
"lipSuck"
]
for i in range(len(mh_ctl_list)):
ctl_value = 0
numInputs = (len(mh_ctl_list[i])-1) // 2
for j in range(numInputs):
weightMat = outWeight.tolist()
poseIdx = facsNames.index(mh_ctl_list[i][j*2+1])
ctl_value += weightMat[poseIdx] * mh_ctl_list[i][j*2+2]
print(mh_ctl_list[i][0], ctl_value)
self.client.send_message('/' + mh_ctl_list[i][0], ctl_value)
return outWeight
Overall, this code snippet translates weight values for various facial expressions based on audio input and sends control signals to Unreal Engine to animate the 3D character's face accordingly.
Then, drag and drop the "male_bs_arkit.usd" file into the plugin window. This file contains the blendshape data for the male character's face.
Now, go to the A2F Data Conversion section of the Audio2Face plugin. This section allows you to configure the input animation data and blendshape mesh for your character.
In the "A2F Data Conversion" section, select Input Anim and Blendshape Mesh from the dropdown menu. This will tell Audio2Face to use the blendshape data from the "male_bs_arkit.usd" file to create a mesh that can be animated by the audio input.
Finally, click the Set up blendshape solve button to apply the blendshape data to the mesh and enable animation. After clicking this button, you should see the character's face begin to animate in response to the male shape.
Play the audio file within Audio2Face to see the character's face respond with corresponding animations.
Unreal Engine and MetahumansUnreal Engine is a powerful and versatile real-time 3D creation tool developed by Epic Games. It is primarily used for developing video games.
MetaHuman Creator is an online, user-friendly 3D design tool for creating highly realistic digital humans that can be animated within Unreal Engine. It allows anyone to create customized, photorealistic AI avatars in just seconds.
Open Unreal Engine and Create a new Unreal Engine project.
Then go to Edit-> Plugins and Enable the OSC server and Apple ARKit plugins.
- OSC plugin allows communication between applications and devices, making it crucial for receiving animation data from external sources.
- Apple ARKit enables augmented reality experiences within the engine, potentially enhancing your project.
The OSC Server acts as a listening endpoint for messages sent to the local instance of Unreal Engine.
Restart Unreal Engine. Then access Quixel Bridge.
Quixel Bridge is a bridge between Unreal Engine and various content libraries, including MetaHuman Creator.
Import your MetaHuman.
You can either import a MetaHuman you created in MetaHuman Creator or choose from over 50 pre-made models available in Quixel Bridge.
Press the Add button to import the MetaHuman model into the scene.
Drag your model from the Content Browser in to your character Blueprint.
Then open Metahuman Blueprint. This involves creating visual scripts that define the character's movement and actions.
Blueprints are visual scripting tools within Unreal Engine that allow you to define the behavior and functionality of your characters.
Each Blueprint can contain one or more graphs, depending on the type of Blueprint, that define the implementation of a particular aspect of the Blueprint. Here, we created an OSC Server to receive input data from remote nodes on port 5008.
This Blueprint specifically handles the facial animation and control of your MetaHuman character. Compile and save the Animation Blueprint.
You've successfully integrated Metahumans into your Unreal Engine project. Now, we can animate them, add them to your scenes, and bring them to life within your project.
Putting it all togetherllamaspeak is a web-based implementation of a voice assistant developed by Dustin Franklin from Nvidia. It enables the understanding of spoken speech through RIVA ASR, processes it using LLM, and generates spoken responses via RIVA TTS, offering a conversational experience for users employing a ChatGPT-like Large Language Model on NVIDIA Jetson boards.
We can now put all the pieces together and run it.
To run the Audio2Face on a Windows machine with an RTX graphics card, follow these steps:
Open a command prompt on the Windows machine. Navigate to the directory containing the Audio2Face application by running the following command:
cd C:\Users\admin\AppData\Local\ov\pkg\audio2face-2022.2.1
Launch Audio2Face in headless mode by executing the following command:
audio2face_headless.kit.bat
Headless mode is a feature of Audio2Face that allows you to run your programs without needing a graphical user interface.
Upon execution of the command, you will observe several lines of output indicating the start-up of various extensions required for the functioning of Audio2Face. Once the startup process is complete, the Audio2Face pipeline will be ready to use.
Next, on Nvidia Jetson AGX Orin 64GB, log in to your Docker account by running the command:
docker login
Then, start the Nvidia Riva server by running the command bash riva in the terminal.
bash riva
Execute the following commands to start the text-generation-webui service:
./run.sh --workdir /opt/text-generation-webui $(./autotag text-generation-webui:1.7) \
python3 server.py --listen --verbose --api \
--model-dir=/data/models/text-generation-webui \
--model=mistral-7b-instruct-v0.2.Q4_K_M.gguf \
--loader=llamacpp \
--n-gpu-layers=128 \
--n_ctx=4096 \
--n_batch=4096 \
--threads=$(($(nproc) - 2))
This will start the text-generation-webui service with the specified Mistral 7B Instruct model and configuration.
Run the following command to execute the run.sh script for the Llamaspeak application:
./run.sh --workdir=/opt/llamaspeak \
--env SSL_CERT=/data/cert.pem \
--env SSL_KEY=/data/key.pem \
$(./autotag llamaspeak) \
python3 chat.py --verbose
Afterward, you will be prompted whether to pull a compatible container. Proceed accordingly.
Found compatible container dustynv/llamaspeak:r35.4.1 (2023-12-05, 5.0GB) - would you like to pull it? [Y/n] n
Couldn't find a compatible container for llamaspeak, would you like to build it? [y/N]y
Respond y when prompted to build the container if it's not found.
running ASR service (en-US)
-- running TTS service (en-US, English-US.Female-1)
-- running AudioMixer thread
-- starting webserver @ 0.0.0.0:8050
-- running LLM service (mistral-7b-instruct-v0.2.Q4_K_M.gguf)
* Serving Flask app 'webserver'
* Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on https://127.0.0.1:8050
* Running on https://10.1.199.193:8050
After a successful build, you'll see messages indicating the Audio Mixer thread, webserver, and LLM service running.
Open a web browser and navigate to https://127.0.0.1:8050 to access the text-generation-webui application.
The audio input prompt will be processed by Llamaspeak and used in our Unreal Engine project on the Windows PC to drive the animation process. Unreal Engine on Windows receives animation data from the pipeline based on the processed audio and text.
Within the Unreal Engine tab on the Windows PC, click on the Play button to initiate the animation process.
Watch the demo of the Avatar in action! In the demos below, you'll see a user interacting with a MetaHuman character in Unreal Engine
As you saw in the demo, these MetaHumans aren't just visually stunning, they can also speak and interact with a user in real time.
The demo above shows a user interacting with a MetaHuman character in Unreal Engine using Nvidia Omniverse Audio2Face and the NVIDIA Jetson AGX Orin Developer Kit. Imagine creating chatbots, educational assistants, or even in-game characters that can hold conversations with users. Feel free to customize the application further based on your requirements.
I hope you found this project useful and thanks for reading. If you have any questions or feedback? Leave a comment below.
All the code referenced in this story is available in my github repo.
Thanks and acknowledgements- Thanks to Dustin Franklin from Nvidia (@dusty_nv) for his llamaspeak work, which was important in making this achievement possible.
- This project was made possible through the support, guidance, and assistance of the staff of Institute of Smart Systems and Artificial Intelligence. Special thanks to Huseyin Atakan Varol and Daniil Filimonov.
- Tutorial - llamaspeak - NVIDIA Jetson Generative AI Lab
- Omniverse Audio2Face: An End-to-End Platform For Customized 3D Pipelines
- Build an avatar with ASR, Sentence-transformer, Similarity Search, TTS and Omniverse Audio2Face
- AI-Powered Facial Animation — LiveLink with NVIDIA Audio2Face & Unreal Engine Metahuman
- Livestream to Metahumans in Unreal Engine
Comments