MicroZed Chronicles: NEON & SIMD

A few months ago we examined how we could implement Single Instruction Multiple Data (SIMD) processing using DSP48’s in programmable logic…

Adam Taylor
4 years ago

A few months ago, we examined how we could implement Single Instruction Multiple Data (SIMD) processing using DSP48’s in programmable logic. Using SIMD in this manner in the PL enables us to obtain both optimal power performance and resource utilization.

In this blog, we are going to explore how we can do similar, using the SIMD / vector floating-point unit in the processing system.

The NEON engine allows us to process large data sets in parallel using a single instruction. This comes in very useful for applications like image and audio processing, where algorithms require data sets to be processed using simple instructions (multiply, add, etc.) multiple times with little control code. In applications such as this leveraging the SIMD unit can result in a significant increase in performance.

SIMD is not just limited to image and audio processing. It also brings significant benefit to applications, including:

  • Matrix multiplication
  • Error correction — e.g. Reed-Solomon
  • Elliptical curve cryptography

At this point, you may wonder what is the difference between implementing the algorithm using SIMD in the NEON or accelerating the design in to programmable logic using HLS to leverage SIMD in the DSP48s. The answer really depends upon the performance requirements we are trying to achieve.

As part of the standard Arm Cortex-A9 architecture, the NEON unit is supported out of the box for the SW developer. This also means there is a wide ecosystem of support for the NEON unit across a range of libraries and APIs — especially when working with higher level operating systems like PetaLinux, where libraries such as FFMPEG, FFTW and Project NE10 support NEON use. This out of the box functionality supports adoption and eases both program creating and debugging.

If we decide to accelerate the design using programmable logic, we of course gain higher performance, with a lower latency and more deterministic response. However, as this requires the development of custom intellectual property (IP) cores, this approach will require more engineering resources to implement.

Which solution to use NEON or PL SIMD depends upon the performance requirements of the application. Being aware of the NEON/VFPU and its capabilities provide us with another tool in our embedded system toolbox to be aware of and consider when architecting our solutions.

On the Zynq-7000, the NEON unit provides 32, 64-bit registers which can also be viewed as 16, 128-bit registers. This means the data lanes can be of widths between 8, 16, 32 or 64-bit signed or unsigned or single precision floating point. A data lane is the number of parallel operations supported at once.

When code is compiled to execute in the NEON, special NEON assembly instructions are used. If we are targeting the NEON, we can double check NEON instructions are used by looking in the ELF.

Opening the ELF in SDK will show us the NEON assembly instructions. We can identify these easily in the assembler as they start with a V, for example VADD, VMUL, etc.

V{}{}{}{.}(}, src1, src2

To ensure we use use the NEON in our Zynq applications, we need to configure the C/C++ build settings correctly in SDK.

The setting to use are:

  • -std=c99 — C99 introduces new features which can be used by the NEON engine.
  • -mfpu = neon — identifies which floating point unit is available on the hardware.
  • optimisation level = -O3 — enables aggressive optimisation, in-lining and increases speed.It also enables -ftree-vectorise which enables vectorisation of C/ C ++ code for the NEON.
  • -mfloat-abi=softfp / -mfloat-abi = hard — specifies which floating point ABI is used.
  • -mvectorize-with-neon-quad- will vectorize with Quad words as opposed to double words which is GCC 4.4 default.
  • -ftree-vectorizer-verbose=n — enables us to see information on the vectorisation process.
  • -fdump-tree-vect — will generate a dump file which reports the vectorisation effort.

Taking a simple example of code that generates a dot product, we can compile the code with the settings above to use the NEON (see XApp1206 for the soure code).

Examining the generated instructions by looking within the ELF file,we can see NEON commands in several places.

We can also examine the vect file which provides information on the vectorisation effort under the SRC directory. We can see this file provided we had -ftree-vectorizer-verbose=n and -fdump-tree-vect set in the build options.

Examining the file in SDK will show the effort and considerations, undertaken in the NEON compilation process.

When I ran the example code above, which includes both the NEON vectorized code and the hand vectorized code, we can see the automatically vectorized code performed slightly better.

I will come back to the NEON unit soon and look at how it is used when we are running Linux on the processing system. But for now, we can see it can provide significant benefits for a range of applications.

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.

Adam Taylor
Adam Taylor is an expert in design and development of embedded systems and FPGA’s for several end applications (Space, Defense, Automotive)
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles