Criminal investigations heavily depend on the speed and accuracy of suspect identification. This is traditionally facilitated by manual sketching methods. However, these traditional composite sketching techniques have many limitations. To solve this issue and create high quality and realistic forensic facial sketch, we introduce a stable diffusion based face generator. This is a step by step guide to convert and run core model of this generator on Ryzen™ AI Software.
The core component of this architecture is stable diffusion model. For that we will be using FaceGen model created by findnitai at Hugging Face. It will be converted to ONNX format before running it on VitisAI EP. Additionally we can use LLM and Speech-to-text models for input pre-processing (for audio inputs and prompt generation).
Setting Up The NPU Development EnvironmentNPU Available Devices
Ryzen AI engine enabled devices List can be Found at https://github.com/amd/RyzenAI-SW/issues/18 . I will be using Minisforum UM790 Pro 7940HS Mini PC .
Enable The NPU
After confirming your device is an Ryzen AI engine enabled device, make sure your NPU is enabled. In Windows
- Go to Device Manager --> System Devices
- Look For AMD IPU Devices
- If it does not appear in the list, you’ll need to enable the device in the BIOS
Enable NPU in BIOS
Source:https://www.hackster.io/512342/amd-pervasive-ai-
developer-contest-pc-ai-study-guide-6b49d8
Install NPU Drivers
- Download NPU Drivers from AMD Website
- Extract the downloaded zip file.
- Open a terminal in administrator mode and execute the
.\amd_install_kipudrv.bat
bat file. - Check the installation from Device Manager -> System Devices -> AMD IPU Device
Install Dependencies ( for RyzenAI SW 1.1)
When Installing VIsual Studio 2019, remember to install Python Development and Desktop Development with C++ enxtenions (From Tools --> Get Tools and Features).
Remember to set path variables for you softwares (Control Panel -> Advanced System Settings -> Environment Variable -> Edit -> add ).
Download the Ryzen AI software from AMD Website. Extract the Zip and run .\install.bat -env <env name>
. Change <env name>
to your desired name. You can use conda activate <env name>
to activate your environment.
Remember to use (activate) your Ryzen environment when running/developing the codes in this projects
To test the installation, run the quicktest provided by AMD.
Activate the Ryzen environment
conda activate <env name>
- Activate the Ryzen environment
conda activate <env name>
- Go to
ryzen-ai-sw-1.1\quicktest
and runquicktest.py
with
cd ryzen-ai-sw-1.1\quicktest
python quicktest.py
- If everything is correctly installed, you will get
[Vitis AI EP] No. of Operators : CPU 2 IPU 398 99.50%
[Vitis AI EP] No. of Subgraphs : CPU 1 IPU 1 Actually running on IPU 1
...
Test Passed
...
- Check the installation from Device Manager -> System Devices -> AMD IPU Device.
One approach to run your models on Ryzen AI is to convert it to ONNX model and quantize before running on a ONNX inference session. Hugging Face Optimum provide tools to make this process much easier.
Exporting a Model to ONNX
To export an existing model, load it to the ONNX runtime and set the export=True
.
from optimum.onnxruntime import ORTModelForXXXXXXX
#Give the location to your pre trained model (eg:- model.bin)
model_id = "Location/to/your/model"
#Load the model to appropriate runtime
ort_model= ORTModelForXXXXXXX.from_pretrained(model_id, export=True)
#save models
ort_model.save_pretrained("Location/to/save/converted/onnx/model")
Available ORTModels are
##########################################################################
"modeling_ort":
"ORTModel",
"ORTModelForAudioClassification",
"ORTModelForAudioFrameClassification",
"ORTModelForAudioXVector",
"ORTModelForCustomTasks",
"ORTModelForCTC",
"ORTModelForFeatureExtraction",
"ORTModelForImageClassification",
"ORTModelForMaskedLM",
"ORTModelForMultipleChoice",
"ORTModelForQuestionAnswering",
"ORTModelForSemanticSegmentation",
"ORTModelForSequenceClassification",
"ORTModelForTokenClassification"
##########################################################################
"modeling_seq2seq":
"ORTModelForSeq2SeqLM",
"ORTModelForSpeechSeq2Seq",
"ORTModelForVision2Seq",
"ORTModelForPix2Struct"
##########################################################################
"modeling_decoder": ["ORTModelForCausalLM"]
"optimization": ["ORTOptimizer"]
"quantization": ["ORTQuantizer"
After
exporting and saving the model, you will find .onnx
file(s) at your given location.
Quantization
You can use ONNX runtime quantization tools provided by optimum.onnxruntime package
to quantize your model.
You can quantize your model using Optimum CLI
optimum-cli onnxruntime quantize --help
usage: optimum-cli <command> [<args>] onnxruntime quantize [-h] --onnx_model ONNX_MODEL -o OUTPUT [--per_channel] (--arm64 | --avx2 | --avx512 | --avx512_vnni | --tensorrt | -c CONFIG)
options:
-h, --help show this help message and exit
--arm64 Quantization for the ARM64 architecture.
--avx2 Quantization with AVX-2 instructions.
--avx512 Quantization with AVX-512 instructions.
--avx512_vnni Quantization with AVX-512 and VNNI instructions.
--tensorrt Quantization for NVIDIA TensorRT optimizer.
-c CONFIG, --config CONFIG
`ORTConfig` file to use to optimize the model.
Required arguments:
--onnx_model ONNX_MODEL
Path to the repository where the ONNX models to quantize are located.
-o OUTPUT, --output OUTPUT
Path to the directory where to store generated ONNX model.
Optional arguments:
--per_channel Compute the quantization parameters on a per-channel basis.
Or you can create ORTQuantizer to quantize your model. If your exported model have multiple ONNX models, you have to quantize them individually.
from optimum.onnxruntime import ORTModelForXXXXXXX
from optimum.onnxruntime import ORTQuantizer
#Give the location to your onnx model (eg:- model.onnx)
model_id = "Location/to/your/model"
#Load the model to appropriate runtime
ort_model= ORTModelForXXXXXXX.from_pretrained(model_id)
# Create a quantizer
quantizer = ORTQuantizer.from_pretrained(ort_model)
#Specify quantization configs based on your requirements
#eg:-
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
#################
# Provide or create a calibration dataset
#################
quantizer.quantize(
save_dir="path/to/output/model",
calibration_dataset = dataset,
quantization_config=qconfig,
)
If your exported model have multiple ONNX models, you have to quantize them individually.
encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")
decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")
u_net_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="u_net_model.onnx")
quantizer = [encoder_quantizer, decoder_quantizer, u_net_quantizer ]
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
for quant in quantizer:
quant.quantize(save_dir="./qunatized",quantization_config=dqconfig)
Source: https://huggingface.co/docs/optimum/en/onnxruntime/usage_guides/quantization
FaceGen ONNX ConversionConversion
Start with activating your RyzenAI Environment.conda activate <env name>
Install Following Dependencies.
- onnxruntime
- transformers
- optimum
- diffusers
pip install onnxruntime
pip install transformers
pip install optimum
pip install diffusers
Now we download the our required SD model findnitai/FaceGen
from hugging face. Clone the repository with
git clone https://huggingface.co/findnitai/FaceGen
Additionally you can let the model to be auto-downloaded during conversion process. Before conversion process we need few additional files. Go to https://huggingface.co/runwayml/stable-diffusion-v1-5 download feature_extractor and safety_checker folders and move them to FaceGen folder.
Use the following code to convert the model to ONNX.
from optimum.onnxruntime import ORTStableDiffusionPipeline
model_id = "<download location>/FaceGen"
pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id)
# Don't forget to save the ONNX model
save_directory = "Location/to/save/converted/onnx/model"
pipeline.save_pretrained(save_directory)
prompt = "old man wearing a hat"
image = pipeline(prompt).images[0]
When Stable Diffusion models are exported to the ONNX format using Optimum, they are split into four components. They will be combined during inference session.
- The text encoder
- The VAE encoder
- The VAE decoder
- U-NET
Find more info in
https://huggingface.co/docs/optimum/en/onnxruntime/usage_guides/models
Converted model can be found at my hugging face repository. logicbomb95/FaceGen-ONNX .
Inference
class InferenceSession(
path_or_bytes: str | bytes | PathLike,
sess_options: Sequence | None = None,
providers: Sequence[str] | None = None,
provider_options: Sequence[dict[Any, Any]] | None = None,
**kwargs: Any
)
:param path_or_bytes: Filename or serialized ONNX or ORT format model in a byte string.
:param sess_options: Session options.
:param providers: Optional sequence of providers in order of decreasing
precedence. Values can either be provider names or tuples of (provider name, options dict). If not provided, then all available providers are used with the default precedence.
:param provider_options: Optional sequence of options dicts corresponding
to the providers listed in 'providers'.
The model type will be inferred unless explicitly set in the SessionOptions. To explicitly set:
so = onnxruntime.SessionOptions()
# so.add_session_config_entry('session.load_model_format', 'ONNX') or
so.add_session_config_entry('session.load_model_format', 'ORT')
A file extension of '.ort' will be inferred as an ORT format model. All other filenames are assumed to be ONNX format models.
'providers' can contain either names or names and options. When any options are given in 'providers', 'provider_options' should not be used.
The list of providers is ordered by precedence. For example ['CUDAExecutionProvider', 'CPUExecutionProvider'] means execute a node using CUDAExecutionProvider if capable, otherwise execute using CPUExecutionProvider.
Here we will use providers=['VitisAIExecutionProvider']
as the Execution Provider and provider_options=[{"config_file":"path to vaip_config.json"}]
vaip_config.json file as provider options.
eg:-
vae_decoder_session = ort.InferenceSession(main_dir+"/vae_decoder/model.onnx", providers=['VitisAIExecutionProvider'], provider_options=[{"config_file":vaip_config}])
vaip_config.json
can be found at ryzen-ai-sw-1.1\ryzen-ai-sw-1.1\voe-4.0-win_amd64
folder.
We will be using "openai/clip-vit-base-patch16"
as the tokenizer.
For the scheduler, we will be using following config
scheduler_config = {
"_class_name": "PNDMScheduler",
"_diffusers_version": "0.27.2",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"num_train_timesteps": 1000,
"prediction_type": "epsilon",
"set_alpha_to_one": False,
"skip_prk_steps": True,
"steps_offset": 1,
"timestep_spacing": "leading",
"trained_betas": None
}
Since the model is split in to several .onnx
files, we need to create ort inference session for each file
# Load the ONNX models for VAE decoder, text encoder, and U-NET
vae_decoder_session = ort.InferenceSession(main_dir+r"\vae_decoder\model.onnx", providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
text_encoder_session = ort.InferenceSession(main_dir+r"\text_encoder\model.onnx", providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
unet_session = ort.InferenceSession(main_dir+r"\unet\model.onnx",
providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
# Load the tokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16")
You can run the code to generate images using given prompts. Pipeline only contains basic arguments. You can add more arguments to get better results
pipeline = ORTStableDiffusionPipeline(
vae_decoder_session=vae_decoder_session,
text_encoder_session=text_encoder_session,
unet_session=unet_session,
config={},
tokenizer=tokenizer,
scheduler=scheduler, # Provide the scheduler
# Provide additional arguments if necessary
)
Here's the list of additional arguments you can use.
prompt: Optional
height: Optional[int]
width: Optional[int]
num_inference_steps: int
guidance_scale: float
negative_prompt: Optional
num_images_per_prompt:
eta: float
generator: Optional
latents: Optional
prompt_embeds: Optional
negative_prompt_embeds: Optional
output_type: str
return_dict: bool
callback: Optional
callback_steps: int
guidance_rescale: float
Using similar approach we can image-to-image generation as well.
pipeline = ORTStableDiffusionImg2ImgPipeline(
vae_decoder_session=vae_decoder_session,
text_encoder_session=text_encoder_session,
unet_session=unet_session,
config={},
tokenizer=tokenizer,
scheduler=scheduler, # Provide the scheduler
# Provide additional arguments if necessary
)
Here's the the complete code to run converted stable diffusion model on Ryzen™ AI Software.
import onnxruntime as ort
from transformers import CLIPTokenizer
from optimum.onnxruntime import ORTStableDiffusionPipeline
from diffusers.schedulers import PNDMScheduler
main_dir=r"Your model directory"
vaip_config = "location/to/vaip_config.json"
# Load the ONNX models for VAE decoder, text encoder, and U-NET
vae_decoder_session = ort.InferenceSession(main_dir+r"\vae_decoder\model.onnx", providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
text_encoder_session = ort.InferenceSession(main_dir+r"\text_encoder\model.onnx", providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
unet_session = ort.InferenceSession(main_dir+r"\unet\model.onnx",
providers=['VitisAIExecutionProvider'],
provider_options=[{"config_file":vaip_config}])
# Load the tokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16")
# Provided scheduler configuration dictionary
scheduler_config = {
"_class_name": "PNDMScheduler",
"_diffusers_version": "0.27.2",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"num_train_timesteps": 1000,
"prediction_type": "epsilon",
"set_alpha_to_one": False,
"skip_prk_steps": True,
"steps_offset": 1,
"timestep_spacing": "leading",
"trained_betas": None
}
# Instantiate the PNDMScheduler with the provided configuration
scheduler = PNDMScheduler(**scheduler_config)
# Instantiate the ORTStableDiffusionPipeline with the scheduler
pipeline = ORTStableDiffusionPipeline(
vae_decoder_session=vae_decoder_session,
text_encoder_session=text_encoder_session,
unet_session=unet_session,
config={}, # Provide any necessary config
tokenizer=tokenizer,
scheduler=scheduler, # Provide the scheduler
# Provide additional arguments if necessary
)
# Provide the prompt
prompt = "super car on the road"
# Run inference
generated_image = pipeline(prompt).images[0]
# Display and save the generated image
import matplotlib.pyplot as plt
plt.imshow(generated_image)
plt.axis("off")
#plt.savefig("generated_image.png")
plt.show()
For more info https://huggingface.co/docs/diffusers/v0.13.0/en/stable_diffusion
Supplementary Models - LLama2Instead of directly providing prompts to face generator, we can use an LLM to extract information from a description and generate required prompts. Here we will use model preparation workflow provided by AMD to prepare LLama2.
Preparation
Go to your RyzenAI-SW\example\transformers
folder to create a conda environment for LLM.
cd RyzenAI-SW\example\transformers
conda env create --file=env.yaml
conda activate ryzenai-transformers
For quantization, download precomputed scales from https://huggingface.co/datasets/mit-han-lab/awq-model-zoo
You only need files relevant to your LLM model. In this case we will download files relevant to llama2.
Move them to RyzenAI-SW\example\transformers\ext\awq_cache
. Now go to RyzenAI-SW\example\transformers
and run setup.bat
to setup environment variables.
cd RyzenAI-SW\example\transformers\
setup.bat
You need to run this file every time when you need to run your model. This will setup several environment variables.
SET PWD=%~dp0
SET THIRD_PARTY=%PWD%\third_party
SET TVM_LIBRARY_PATH=%THIRD_PARTY%\lib;%THIRD_PARTY%\bin
SET PATH=%PATH%;%TVM_LIBRARY_PATH%;%PWD%\ops\cpp\;%THIRD_PARTY%
SET PYTORCH_AIE_PATH=%PWD%
SET PYTHONPATH=%PYTHONPATH%;%TVM_LIBRARY_PATH%;%THIRD_PARTY%
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ops\python
SET PYTHONPATH=%PYTHONPATH%;%PWD%\onnx-ops\python
SET PYTHONPATH=%PYTHONPATH%;%PWD%\tools
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\smoothquant\smoothquant
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\smoothquant\smoothquant
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\llm-awq
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\llm-awq\awq\quantize
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\llm-awq\awq\utils
SET PYTHONPATH=%PYTHONPATH%;%PWD%\ext\llm-awq\awq\kernels
SET AWQ_CACHE=%PWD%\ext\awq_cache\
set XRT_PATH=%THIRD_PARTY%\xrt-ipu
set TARGET_DESIGN=
set DEVICE=phx
set XLNX_VART_FIRMWARE=%PWD%/xclbin/phx
Then build dependencies using
pip install ops\cpp --force-reinstall
Now download LLama2 chat from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf (You need to request permission from Meta).
RyzenAI-SW required to your LLama2 directory to be in this structure. /llama-2-wts-hf/7B_chat .
If not, manually set the directory in run_awq.py
.
Quantization
After setting up the transformer environment, go to the llama2 folder.
cd models/llama2/
Run run_awq.py
script in CMD with one of these command to quantize the model.
AWQ - lm_head runs in BF16
python run_awq.py --w_bit 4 --task quantize
AWQ - lm_head runs in BF16
python run_awq.py --w_bit 4 --task quantize --flash_attention
AWQ + Quantize lm_head
python run_awq.py --w_bit 4 --task quantize --lm_head
AWQ + Quantize lm_head
python run_awq.py --w_bit 4 --task quantize --lm_head --flash_attention
Quantized model will be saved as pytorch_llama27b_w_bit_4_awq_fa_lm_amd.pt
.
Inference
To check the quantized model, you can use decode function in run_awq.py
.
python run_awq.py --task decode --target aie --w_bit 4
Use the following code to build a simple chatbot that run on NPU.
import torch
import logging
import time
from transformers import set_seed
from transformers import LlamaTokenizer
import qlinear
set_seed(123)
model_dir = "path/to/llama directory"
ckpt = "path/to/converted/model/pytorch_llama27b_w_bit_4_awq_fa_lm_amd.pt"
tokenizer = LlamaTokenizer.from_pretrained(model_dir)
print(f"Loading from ckpt: {ckpt}")
model = torch.load(ckpt)
model = model.to(torch.bfloat16)
for n, m in model.named_modules():
if isinstance(m, qlinear.QLinearPerGrp):
print(f"Preparing weights of layer : {n}")
m.device = "aie"
m.quantize_weights()
def decode_prompt(model, tokenizer, prompt, input_ids=None, max_new_tokens=30):
if input_ids is None:
print(f"prompt: {prompt}")
start = time.time()
inputs = tokenizer(prompt, return_tensors="pt")
end = time.time()
else:
start, end = 0, 0
print("Input Setup - Elapsed Time: " + str(end - start))
prompt_tokens = 0
if prompt is None:
start = time.time()
generate_ids = model.generate(input_ids, max_new_tokens=max_new_tokens)
end = time.time()
prompt_tokens = input_ids.shape[1]
else:
start = time.time()
generate_ids = model.generate(inputs.input_ids, max_length=50)
end = time.time()
prompt_tokens = inputs.input_ids.shape[1]
num_tokens_out = generate_ids.shape[1]
new_tokens_generated = num_tokens_out - prompt_tokens
generate_time = (end - start)
print("Generation - Elapsed Time: " + str(generate_time))
start = time.time()
response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
end = time.time()
print(f"response: {response}")
print("Tokenizer Decode - Elapsed Time: " + str(end - start))
logging.disable(logging.CRITICAL)
def main():
while True:
prompt = input("You: ")
if prompt.lower() in ["exit", "quit", "q"]:
print("Exiting chat...")
break
decode_prompt(model, tokenizer, prompt)
if __name__ == "__main__":
main()
Checkout the chatbot demo video.
This quantized model use can be use to extract information from a description and generate required prompts for stable diffusion pipeline.
Converted model can be found at at my hugging face repository. logicbomb95/Llama-2-7b-chat-hf-AMD-NPU .
Supplementary Models - Whisper-SmallAnother alternative method of input to the face generator is verbal description of the face. Then it will be converted to text using a STT model. Here we use whisper-small. Conveniently, INT8 ONNX model is available at Intel/whisper-small-int8-dynamic-inc. Download the model with
git clone https://huggingface.co/Intel/whisper-small-int8-dynamic-inc
Since the model is converted and quantized, we can directly go to inference. Run the model in you RyzenAI conda environment. Here is an example code for testing.
import os
import torchaudio
import time
from transformers import WhisperProcessor, PretrainedConfig
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
model_name = 'openai/whisper-small'
model_path = 'path/to/the/model'
processor = WhisperProcessor.from_pretrained(model_name)
# Path to the actual audio file
audio_file_path = 'testaudio.wav' # Update this with the path to your audio file
# Load the audio file
waveform, sample_rate = torchaudio.load(audio_file_path)
# Load ONNX model
sessions = ORTModelForSpeechSeq2Seq.load_model('encoder_model.onnx','decoder_model.onnx','decoder_with_past_model.onnx',providers=['VitisAIExecutionProvider'], provider_options=[{"config_file":vaip_config}])
model_config = PretrainedConfig.from_pretrained(model_name)
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
# Process the audio
input_features = processor(waveform.numpy(), sampling_rate=sample_rate, return_tensors="pt").input_features
# Measure time taken for generation
start_time = time.time()
predicted_ids = model.generate(input_features)[0]
end_time = time.time()
generation_time = end_time - start_time
transcription = processor.decode(predicted_ids)
prediction = processor.tokenizer.normalize(transcription)
# Calculate token speed
num_tokens = len(predicted_ids)
token_speed = num_tokens / generation_time
print("Transcription:", transcription)
print("Normalized Prediction:", prediction)
print("Time taken to generate:", generation_time, "seconds")
print("Token Speed:", token_speed, "tokens per second")
ConclusionNow we have all the core / supplementary models required for face generator. Now all the models can be connected to the UI using API (WIP). The UI is developed using Next.js and can be found in GitHub Repository.
- Facial Features Finetuning - Refiner Stage
- Better Prompt Generation (Using LLM and STT)
- Output Moderation Stage
- Different Quantization Methods for Faster Inference
Comments