Artificial Intelligence has been one of the biggest buzzword phrases in the technical community for quite some time now as it has been reshaping the world we live in. Examples such as automation robots on assembly lines, smart assistants like Siri and Amazon Alexa, self-driving cars, and the incorporation of machine learning in devices ranging from robotic vacuum cleaners to medical equipment prove that AI has more or less become a staple in our daily lives.
As AI has evolved and become increasingly more processing intensive, solutions to spread the workload out to distribute the processing power have lead to the concept of edge computing or computing at the edge. This simply means that instead of sending all data to a single source to be processed then returned, appropriate chunks of data are delegated to be processed at various other points in the network.
For example, a security camera looking for motion and not sending its recording to a cloud service until it has detected motion cuts down on the amount of storage consumed on the attached cloud service and the network traffic between the cloud and the security camera. Even in this simple example, it is obvious that the security camera needs more processing power beyond what is required for basic video capture. The camera now needs a processor capable of image processing of the video signal in real time to detect motion and then send that specific duration of video capturing the motion to the cloud server.
This small scale example of injecting processing power into various points in a network by means of both the addition of hardware and software demonstrates what has created a demand for products such as the BittWare 250-SoC card. Equipped with a Xilinx ZU19EG Zynq UltraScale+ MPSoC chip, the 250-SoC is an FPGA solution for hardware acceleration in a PCIe card form factor. It serves as the solution for a server or any other computer with the appropriate PCIe slot to offload its intensive processing tasks onto an FPGA to decrease both its processing latency and power draw.
Just like the addition of the extra physical processor being installed into the security camera to process the video signal in real time, the 250-SoC gives a server the option for offloading tasks such as its packet processing onto it which cuts down on its networking latency. The hardware acceleration potential extends to three main categories of low latency/acceleration in storage, computation, and networking of a system.
Acquired by Molex in 2018, BittWare is a company that offers FPGA solutions for applications such as hardware acceleration and network packet processing targeted for deployment in data centers. BittWare is well-known for its board-level computing technologies, integrated systems and software expertise. The 250-SoC card stands out in BittWare's FPGA accelerator card lineup as the only Zynq UltraScale+ MPSoC-based low profile PCIe card. This makes is an attractive option for those tight space situations in sever racks that still need a powerful FPGA solution.
As can be seen in the photo of the 250-SoC above, it is built to fit into a PCIe slot to create a link between the FPGA and the host computer, the 250-SoC is also equipped with QSFP28 and OCuLink connectors for networking and external drive connectivity capability.
To dive a bit deeper in the the benefits of these types of connectors, let's take a minute to look at their functionality. QSFP28, or Quad Small Form Pluggable, is a compact transceiver connector used to higher speed data communications and telecommunications. The 'Q' for 'quad' stands the four channels of the connector (four transmit and four receive channels) that is capable of supporting Ethernet, fibre channel, InfiniBand, and SONET/SDH standards with different data rate options where each of the four channels is capable of up to a 28G data rate. This means that the 250-SoC can support 100G data center networking connection speeds.
One of the most ideal solutions for a data center is to have the best-in-breed 100G switches and routers with low power consumption with the maximum amount of ports. This ideal solution would also have the capability to implement a solution such as a DWDM network when the network traffic is needing to be connected over long distances. These requirements are where the flexibility of FPGA-based hardware like the 250-SoC come into play. A custom tailored switch and/or router topology can be implemented on the Zynq UltraScale+ and then updated/upgraded as needed. The traffic is then pushed out at up to 100G data rates by use of the QSFP28 connectors on the 250-SoC.
As mentioned previously, the heart and brain of the 250-SoC is the ZU19EG Zynq UltraScale+ MPSoC chip. This FPGA chip hosts a quad-core Arm Cortex-A53 processor and a dual-core Cortex-R5 real-time processor physically instantiated alongside programmable logic. Having the physical Arm-core processors available means that the Zynq UltraScale+ MPSoC chip can achieve faster processing speeds compared to the processing speeds of a soft processor instantiated via HDL in the programmable logic of the FPGA. Since the programmable logic in the Zynq UltraScale+ chip is not consumed by the physical Arm-cores, it leaves more resources available for a custom design.
The Zynq UltraScale+ chip also has high-speed connectivity interfaces built in that connect both straight to its Arm-core processing system and the programmable logic to guarantee the most resource and timing efficient design can be achieved. This is where interfaces such as the PCIe, SATA, etc. can be fully taken advantage of in a design.
On the 250-SoC, each of the four channels on the QSFP28 connectors are routed through the programmable logic via the GTY transceivers as they can support line rates from 500 Mb/s to 30.5 Gb/s. The other key connector interface of the 250-SoC, the OCuLink connectors, are split between the GTY and the GTH transceivers connected to the programmable logic of the Zynq UltraScale+ chip. This allows for the four, 8-lane OCuLink connectors on the 250-SoC to be used as either single 8-lane channels or split into two separate 4-lane channels.
OCuLink stands for “Optical Copper (Cu) Link” and is a PCI Express (PCIe) interconnect system most commonly used on sold-state hard drives (SSDs). This allows for the FPGA on the BittWare 250-SoC to be directly connected to NVMe SSD to read/write data instead of having to send it through a server's CPU.
This idea of being able to connect hard drives directly to an FPGA is an eye-catching feature of the 250-SoC and eliminates what potentially can be a huge bottleneck in a system when the host computer's CPU has to act as a gatekeeper to large non-volatile memory such as a hard drive.
Taking a step back to look at the 250-SoC ZU19EG MPSoC board overall, it has the option to include an Extender board on PCI front panel and equipped with the following features:
- Full-height half-length form factor or half-height half-length form factor
- Single-width Active heatsink
- PCIe Gen 3 16-lane
- XCZU19EG-2FFVD1760E Xilinx UltraScale+ MPSoC
- QSFP28 network ports - x2
- DDR4 memory banks - x2
- 4GB x72 Processing System (PS) memory @ 2400MTPS
- 4GB x72 Programmable Logic (PL) memory @ 2400MTPS
- OCuLink x8 connectors (cables not included) - x4
- Micro-USB connector for Arm USB-to-UART access - x1
- RJ45 connector for Arm 1GbE network access - x1
- Full-height PCI bracket with QSFP28, micro-USB & RJ45 openings (250-SoC-ZU19EG-E-2A-10) or half-height PCI bracket with QSFP28 & micro-USB openings (250-SoC-ZU19EG-E-2A-20)
To take advantage of the 250-SoC's hardware, custom IP in HDL to achieve full hardware acceleration in the design and an embedded Linux image running on Arm-core of the Zynq UltraScale+ chip to handle things like standard networking protocols are the best combination.
As a bare bones starting point, BittWare developed an embedded Linux image in PetaLinux that creates a network interface to link a host PC and the Arm-core of the Zynq UltraScale+ by tunneling the Ethernet packets through the 64KiB of shared on-chip memory to allow use of any of the standard network protocols such as SSH to communicate with the Arm-core. BittWare also provides a PC Host driver that handles the Host PC side of this interface.
This same board support package for the 250-SoC card in PetaLinux serves as a starting point for a user to develop their own custom Linux image for the 250-SoC. And depending on the size of the embedded Linux image and the required boot time of the system, can be stored on either the 250-SoC's onboard QSPI flash or eMMC memory.
BittWare has a dedicated developers portal users can sign up for that is fully equipped with getting started resources and example designs for all of their available boards. This includes the PetaLinux BSP just mentioned and the HDL for a built-in self test (BIST).
There are also some reference designs available developed in-house at BittWare for demonstration and reference purposes. Designs such as PCIe data capture, 100G network load test application, a software-defined network interface card design, and a loopback via Xilinx's CMAC. Even though not all of these reference design target the 250-SoC directly, they serve as a great starting point to see exactly how a PCIe FPGA accelerator card can function in a system.
Vendor Atomic Rules has also created three IP cores to use with BittWare PCIe cards that can be purchased. The first of which is a data mover IP core, Arkville DPDK, that is targeted towards computational packet processing applications with the goal to offload server cycles to FPGA gates.
Their TimeServo IP provides a solution for the FPGA’s system timer or clock when sub-nanosecond resolution and sub-microsecond accuracy is needed in a design. The third IP offered by Atomic Rules is a UDP Offload Engine, which implements the UDP standard RFC 768, including checksum, segmentation and reassembly in hardware. This offloads much of the UDP standard from software into hardware so that line rates of 25, 50, and 100 GbE can be achieved.
Overall, the BittWare 250-SoC is an ideal solution to upgrade an existing server by means of hardware acceleration. Integrating AI in a system is already a huge challenge for developers, so having the right hardware and corresponding software resources like the 250-SoC has makes all the difference.