Team Southeast Asia:

Published March 29, 2022 © Apache-2.0

ThunderGP: HLS-based Graph Processing Framework on FPGAs

ThunderGP enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability.

IntermediateProtip24 hours892

Adaptable Compute Acceleration: 3rd Place

Adaptive Computing Developer Contest

ThunderGP: HLS-based Graph Processing Framework on FPGAs

Things used in this project

Hardware components

AMD Alveo

Software apps and online services

AMD Xilinx Software Development Kit

Story

“The entire world of e-commerce on the internet is driven by graph analytics" because graph structure could naturally represent datasets of many important application domains such as social networks, cybersecurity, and machine learning. In the current formation age, the exponential growth of data from these applications has created a pressing demand for performant graph processing.

A large body of research in building efficient FPGA-based accelerators for graph processing; however, there still remains a gap between high-level graph applications and underlying CPU-FPGA platforms, which requires developers to understand hardware details and program with lots of eﬀorts (e.g., programming with hardware description language, tuning the pipeline and conducting memory optimizations). This gap largely hinders the adoption of FPGAs for datacenter application developers.

What is The Big Deal about ThunderGP?

ThunderGP closes the above gap by bringing both performance and programmability for FPGA-accelerated graph processing, and it has been accepted in FPGA’21.

ThunderGP is an HLS-based open-source graph processing framework on FPGAs, which supports both Vitis and SDAccel development environments and suits the Xilinx Alveo platforms such as U50, U200, U250 and VCU1525. With ThunderGP, the developers only need to write high-level functions that use explicit high-level language (C++) based APIs that are hardware agnostic. Subsequently, ThunderGP automatically generates a high-performance accelerator on state-of-the-art FPGA platforms with multiple super-logic regions (SLRs) and manages the accelerator’s deployment.

Figure 1: The overview of ThunderGP. It is compatible with both Vitis and SDAccel tools.

The overview of ThunderGP is shown in Figure 1. We briefly illustrate the main building blocks as follows.

Build-in accelerator template. ThunderGP adopts the Gather-Apply-Scatter (GAS) model as the abstraction of various graph algorithms and realizes the model by a build-in highly-paralleled and memory-efficient accelerator template.
Automated accelerator generation. The automated accelerator generation produces synthesizable accelerators with unleashing the full potentials of the underlying FPGA platform. In addition to the build-in accelerator template, it takes the user-defined functions (UDFs) of the scatter, the gather, and the apply stages (from the GAS model) of the graph algorithm and the FPGA platform model (e.g., U50) from developers as inputs.
Graph partitioning and scheduling. ThunderGP adopts a vertical partitioning method based on destination vertex without introducing heavy preprocessing operations such as edge-sorting to enable vertex buffering with on-chip RAMs.
High-levelAPIs. ThunderGP provides two sets of C++ based APIs: accelerator APIs (Acc-APIs) for customizing accelerators for graph algorithms and Host-APIs for accelerator deployment and execution.

For details of the GAS model, APIs and the design of ThunderGP, please refer to ThunderGP technical report (in the attachment or on GitHub).

How Easy-to-Use is ThunderGP?

We conduct a case study - propagation prediction of COVID-19 on Alveo U50 board with Vitis 2020.1 - to demonstrate how ThunderGP can be easily applied to a real-life graph processing problem.

Timely prediction of the time-varying prevalence of infection at the population level plays an important role in deploying proper blocking actions such as quarantine or social distance to mitigating the spread of the virus. Current propagation prediction models are generally composed by the spatial cellular automata (CA) and the temporal susceptible infectious removed (SIR) model, where the cell represents a residential area (e.g., a county) and maintains its status (e.g., infection rate) which is updated by the SIR model according to transmissions between neighbour cells. Hence, the propagation can be formulated as a graph processing problem, where the counties and their connections are represented by a graph, and the SIR updating by the propagation within the graph.

We have implemented three propagation models with ThunderGP: the CA-SIR [1], the CA-SEIR [2], and the CA-SAIR [3] models. The dataset is obtained from the COVID-19 Impact Analysis Platform [4], containing 3.1K counties and 2.3M connections.

Here, we showcase the example of implementing the accelerator for the CA-SAIR model in Listing 1. For the scatter stage, each county (a cell) calculates the infection rate to push to a neighbour county according to its infection rate and their connectivity strength that quantifies both the volume and frequency of inter-county movement. For the gather stage, the county accumulates all infection rates that are pushed to it. In the apply stage, the gathered infection rate is used for calculating the prevalence of infection. Note that the apply stage involves many user-defined parameters (ThunderGP supports user-defined parameters for the apply stage, details in the technical report).

Listing 1: The user inputs for accelerating propagation prediction of COVID-19 on U50 platform.

Figure 2 shows the visualization of the infection Risk in the USA after one week from the time of conducting predication with the public datasets. The results have matched with the open-sourced Python program [3] executed on the CPU side.

Figure 2: The visualization of the infect risk after one week from the predicting time.

Table 1 quantifies the development efforts involved with ThunderGP on this task and also shows the performance comparison with the Python-based CPU implementation[3]. Based on the results, the benefit of using ThunderGP for this problem is twofold. Firstly, ThunderGP achieves up to 419 times speedup over the CPU-based solution. Being able to predicate the propagation in a short time could assist fast and timely reactions to the spread condition. Secondly, CA-SIR models are fast evolving with the increasing understanding of the virus. With ThunderGP, the developers write only dozens of lines of code for accelerating the prediction typically for a day, which minimizes the development effort. This preliminary result is promising, and the system is open-sourced, and we believe more case studies can be performed to further assess the programmability improvement.

Table 1: The development efforts involved with ThunderGP on propagation prediction of COVID-19 on U50 platform. The code for formatting the dataset to standard graph format is not counted. The compiling time for the FPGA image is excluded in development time.

[1] MA Fuentes et al. Physica A: Statistical Mechanics and its Applications, 1999.

[2] José M Carcione et al. A simulation of a covid-19 epidemic based on a deterministic seir model. arXiv, 2020.

[3] Yiwang Zhou et al. A spatiotemporal epidemiological prediction model to inform county-level covid-19 risk in the united states. Harvard Data Science Review, 2020.

[4] University of Maryland COVID-19 Impact Analysis Platform. https://data.covid.umd.edu, 2020-09-10.

How Efficient is ThunderGP?

As mentioned before, there has been a large body of research work on FPGA-based graph processing accelerators. In this chapter, we make a fair comparison with the state-of-the-art designs to demonstrate the efficiency of ThunderGP. For datasets and graph applications, please refer to ThunderGP technical report.

We first compare ThunderGP with the state-of-the-art RTL-based work: Hitgraph [1], as shown in Table 2. The performance metric is Million Edges Traversed Per Second (MTEPS). All the implementations are based on four SLRs, but the difference is that HitGraph does not consider the overhead of utilizing multiple SLRs as its performance is based on simulation with simply scaling to the memory bandwidth of multiple SLRs. The performance speedup is up to 2.9 times. What's more important is that we make the design executed on real hardware.

Table 2: The performance comparison with the state-of-the-art design[1].

We then compare ThunderGP with HLS-based frameworks: Chen et al. [2] and GraphOps[3]. As their experiments are not conducted with multiple SLRs hence less memory bandwidth, to make a fair comparison, we use bandwidth efficiency (MTEPS/(GB/s)) as one metric. As shown in Table 3, ThunderGP achieves up to 29.2 times absolute speedup and 12.3 times improvement on bandwidth efficiency over GraphOps, and 5.2 times absolute speedup and 2.4 times improvement on bandwidth efficiency over Chen et al..

Table 3: Comparison with state-of-the-art HLS-based frameworks in terms of absolute performance and bandwidth efficiency (B.W. Efficiency).

The speedup is coming from the advanced design of ThunderGP. Please check the technical report for more design details.

[1] Shijie Zhou et al. HitGraph: High-throughput Graph Processing Framework on FPGA. TPDS,2019.

[2] Xinyu Chen et al. On-the-fly parallel data shuffling for graph processing on opencl-based fpgas. FPL, 2019

[3] Tayo Oguntebi et al. Graphops: A dataflow library for graph analytics acceleration. FPGA, 2016.

Let’s Get Started with ThunderGP!

So far, you might be interested in ThunderGP!

No worries, we provide a step by step guide for working with ThunderGP in the GitHub repository.

For the first level of usage, we write the guideline for users that only requires build-in graph processing algorithms.

For the second level of usage, we guide the users to customize the accelerator for their own applications with the system provided APIs.

Check the detailed instructions in README https://github.com/Xtra-Computing/ThunderGP/tree/develop_u50.

Enjoy, and have fun!

Code

Credits

Xinyu Chen

2 projects • 11 followers

PhD student at National University of Singapore.

Comments

Awards

Adaptable Compute Acceleration: 3rd Place

Adaptive Computing Developer Contest

ThunderGP: HLS-based Graph Processing Framework on FPGAs