Accelerate AI: Introducing OpenVINO™ 2023.0

Learn what to expect from OpenVINO™ 2023.0 in this article breaking down new features you can use in your designs.

The Top Features

The 2023.0 version of OpenVINO was designed to improve the developer journey through minimizing offline conversions, broadening model support, and advancing hardware optimizations. Below are the top highlights AI developers need to know about this release, but you can find the full release notes here.

ModelSelection: These new features minimize code changes for AI developers to allow them to adopt, maintain, and align code better with deep learning frameworks.

New TensorFlow integration: To simplify the workflow from training to deployment of TensorFlow models.
Now on Conda Forge: To provide easier OpenVINO Runtime access for C++ developers who prefer Conda.
Broader processor support: ARM processor support now includes OpenVINO CPU inferencing, including dynamic shapes, full processor performance and broad sample code/notebook tutorial coverage.
Extended Python support: Added support for Python 3.11 for more potential performance improvements.

Optimize: With these next set of additions, AI developers can now optimize and deploy with ease across more models including NLP, as well as access more AI acceleration with new hardware feature capabilities.

Broader model support: Support for generative AI models, text processing models, transformer models, etc. is now available.
Dynamic shapes support on GPU: This means there is no need to reshape models to static shapes when leveraging GPU, providing more flexibility in coding — especially for NLP models.
NNCF as the quantization tool of choice: We’ve combined post-training quantization (POT) into Neural Network Compression Framework (NNCF) to make it easier to add tremendous performance improvements through model compression.

Deploy: These capabilities are designed to give solutions an immediate performance boost with automatic device discovery, load balancing and dynamic inference parallelism across CPU, GPU and more.

Thread scheduling in CPU plugin: AI developers can now optimize for performance or power saving by running inference on E-cores, P-cores, or both for 12th Gen Intel® Core™ CPU and up.
Default inference precision: By defaulting to different formats, we are able to provide optimal performance on CPU and GPU.
Extension of Model caching: To reduce the first inference latency for both GPU and CPU.

Explore OpenVINO’s 2023.0’s latest capabilities

Let’s take a deeper look into some of these new features introduced above and what exactly they mean for AI developers:

New TensorFlow Experience

OpenVINO 2023.0 enables TensorFlow developers to move from training to deployment more easily.

With this feature, there is no longer a need to convert TensorFlow format model files to OpenVINO IR format offline — it happens automatically at runtime.

Now, developers can start experimenting with the — use_new_frontend option passed to Model Optimizer or the Model Conversion API to enjoy improved conversion time for limited scope models or load a standard TensorFlow model directly in OpenVINO Runtime or OpenVINO Model Server for deployment. (Currently, it supports the SavedModel format and binary frozen format .pb. ) We recommend leveraging OpenVINO Runtime for even more performance benefits with model compression but now developers have options based on their needs.

The following diagram shows a simple example:

Figure 1: General workflow for deploying TensorFlow/TensorFlow Lite model

Broader Model Support

With new model support additions, AI developers can find extended model support for generative AI models, such as Stable Diffusion 2.0 (Figure 2), Stable diffusion with ControlNet (Figure 3), text processing models, and transformer models such as CLIP, BLIP, S-BERT, GPT-J, etc., Detectron2, Paddle Slim, Segment Anything Model (SAM) (Figure 4), YOLOv8, RNN-T, and more.

Default inference precision

The latest update also includes a significant improvement in the performance of the inference on various devices, which now operate in high-performance mode by default. This means that for GPU devices, FP16 inference is used, while CPU devices use BF16 inference if available (Figure 5). Previously, users had to convert IR to FP16 themselves to enable GPU execution in FP16 mode. Now, all devices can select default inference precision automatically, and this selection is disconnected from IR precision. In the rare event that high-performance mode impacts accuracy, users can adjust the inference precision hint.

Additionally, developers can now control the IR precision separately. By default, we recommend setting it to FP16 to reduce the model size by 2x for floating-point models. It is important to note that IR precision does not affect how devices execute the model but serves to compress the model by reducing the weight precision.

Figure 5: Auto conversion of IR model to default inference precision

Neural Network Compression Framework or NNCF is the quantization tool of choice

Previously, OpenVINO had separate tools for post training optimization (POT) and quantization aware training. We’ve combined both methods into NNCF. This helps to reduce the model size, memory footprint and latency, as well as improve the computational efficiency.

NNCF provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop. It is designed to work with models from PyTorch, TensorFlow, ONNX and OpenVINO (Figure 6).

Figure 6: Compression algorithms provided in NNCF.

The post-training quantization algorithm takes samples from the representative dataset, inputs them into the network, and calibrates the network based on the resulting weights and activation values. Once calibration is complete, values in the network are converted to 8-bit integer format. The basic POT quantization flow in NNCF is the simplest way to apply 8-bit quantization to the model:

Set up an environment and install dependencies.

pip install nncf

Prepare the calibration dataset

import nncf

calibration_loader = torch.utils.data.DataLoader(...)
def transform_fn(data_item):
    images, _ = data_item
    return images
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)

Run to get a quantized model

model = … #OpenVINO/ONNX/PyTorch/TF object
quantized_model = nncf.quantize(model, calibration_dataset

Tutorials on how to use NNCF for model quantization and compression can be found here, of which we have validated applying post-training quantization to YOLOv5 model with little accuracy drop (Figure 7).

Figure 7: Apply post-training quantization in NNCF to YOLOv5 model with small accuracy impacts.

Thread scheduling on Intel 12th Gen Core and up

Improve multi-thread scheduling for Intel® platform.

With the new ov::hint::scheduling_core_type property, performance or power saving could be configured by choosing to run inference on {ov::hint::SchedulingCoreType::ANY_CORE, ov::hint::SchedulingCoreType::PCORE_ONLY, ov::hint::SchedulingCoreType::ECORE_ONLY}, for Intel® 12th Gen CORE CPU and up, HYBRID platform.

By setting the ov::hint::enable_hyper_threading property to “True”, both physical and logical cores could be enabled on P-cores for Intel® platform.

Figure 8: Enable “SCHEDULING_CORE_TYPE” and “ENABLE_HYPER_THREADING” with improved multi-threading in CPU plug-in.

Another new property is ov::hint::enable_cpu_pinning. In default, ov::hint::enable_cpu_pinning is set to “True”, which means multiple threads for running inference requests of multiple deep learning models will be scheduled by OpenVINO Runtime (TBB). In this mode, inference of one deep learning model with multiple threads will be treated as an overall graph, of which each thread will be bound to a CPU processor without cache missing and additional overhead. However, in the case of simultaneously running inference for two neural networks, multiple threads of different inference requests could be scheduled on the same CPU processors, leading to computation competition on the same processor resources (as shown in Figure 9).

Figure 9: Set “ENABLE_CPU_PINNING” to “True” in CPU plug-in, with TBB scheduling for multiple threads.

To avoid CPU processor resource competition, we can disable the processor binding properties by setting ov::hint::enable_cpu_pinning to “False” and let the operation system schedule the processor resource for each thread of the network. In this mode, the inference on different layers of the same deep learning model could be switched across different processors, resulting in cache missing and additional overhead (as shown in Figure 10 & 11). Developers can decide whether to enable CPU pinning according to their validation results.

Figure 10: Set “ENABLE_CPU_PINNING” to “False” in CPU plug-in, with OS scheduling for multiple threads.

Figure 11: Set “ENABLE_CPU_PINNING” to “False” in CPU plug-in, with OS scheduling for multiple threads.

Upgrade to OpenVINO™ 2023.0

With these latest features, OpenVINO aims to get the most out of your AI application from start to finish. With your continued support, we can produce valuable upgrades for AI developers everywhere. And with its smart and comprehensive capabilities, it can be like having your very own performance engineer by your side.

But enough about what OpenVINO can do for you, try it out for yourself and upgrade using the following command:

pip install — upgrade openvino-dev

Be sure to check all your dependencies because the upgrade may update other packages beyond OpenVINO. If you wish to install the C/C++ API, pull a pre-built Docker image or download from another repository, visit the download page to find a package that suits your needs. If you are looking for model serving instructions, check out the new documentation.

Additional Resources

OpenVINO™ Release Notes

OpenVINO™ Notebooks

Provide Feedback & Report Issues

Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

A special thanks to everyone who participated in this blog:

Zhuo Wu, Ethan Yang, Adrian Boguszewski, Anisha Udayakumar, Yiwei Lee, Stephanie Maluso, Raymond Lo, Ryan Loney, Ansley Dunn, Wanglei Shen

artificial intelligence