Team DatenLord:

•

WanZheng Weng

Created March 31, 2022

TRIDENT: A Hardware Implemented Poseidon Hasher

We implemented the ZK-SNARK friendly Poseidon hasher on FPGA to boost performance of blockchain storage proof.

768

Big Data Analytics: 1st Place

Adaptive Computing Challenge 2021

TRIDENT: A Hardware Implemented Poseidon Hasher

Things used in this project

Hardware components

AMD Varium C1100 Blockchain Accelerator Card

NVIDIA GeForce RTX 3070 GPU

GPU is used for comparation.

AMD Ryzen 5900X CPU

Software apps and online services

AMD Vivado Design Suite

Microsoft VS Code

SpinalHDL - Scala Based Hardware Construction Language

Cocotb - Python Based Hardware Verification Environment

Lotus - Filecoin Node Implementation

Ubuntu

Story

Abstract:

Poseidon hasher is widely used in blockchain projects such as Filecoin, Mina Protocol, and Dusk Network for their zero-knowledge cryptographic proof, aka ZK-SNARK. This project aims to accelerate ZK-SNARK by implementing the Poseidon hasher on hardware. To our knowledge, there was no open-source hardware implementation for Poseidon hasher.

Compare to traditional hashers such as SHA256 and SHA3, Poseidon operates directly on finite fields, therefore it is far more efficient when it comes to zero-knowledge proof. TRIDENT implemented the “Filecoin” version of Poseidon which operates on BLS12-381’s scalar field, but it can be modified to use on other elliptic curves as well.

We implemented TRIDENT as a block design in Vivado Design Suite 2021.2. The Poseidon hasher is written in Spinal-HDL, then converted to Verilog, and used as an custom IP in Vivado. Xilinx Varium C1100 Blockchain Accelerator Card is used in this project. We use the XDMA PCI-E IP in AXI4-Stream mode to write/read data to/from FPGA. Filecoin’s “Neptune” Rust API is modified to switch Poseidon hashing from on GPU to FPGA. And up to now, TRIDENT has achieved more than twice throughput of CPU implementation and much higher performance-power ratio than GPUs.

In general, TRIDENT provides a complete solution for accelerating Poseidon Hasher in FPGA, including hardware design and software API. Filecoin storage providers can conveniently deploy it in the sealing process of mining, which can improve efficiency vastly.

The remainder of our project documentation is organized as follows: In Section 1, we will give a brief introduction to the Poseidon Hash function. And for Section 2 and Section 3, we will illustrate details of our implementation including IP design and system construction. Section 4 shows the detailed performance results of TRIDENT. And finally, we will share how to use the project repository and instructions on the deployment of TRIDENT.

Section 1: A Brief Introduction Of Poseidon

The area of practical computational integrity proof systems, like zk-SNARK, is seeing a very dynamic development with several constructions having appeared recently with improved properties and relaxed setup requirements. Many use cases of such systems involve, often as their most expensive part, proving the knowledge of a preimage under a certain cryptographic hash function, which is expressed as a circuit over a large prime field.

And Poseidon is a new hash function which has been designed to be friendly to zero-knowledge applications, specifically, in minimizing the proof generation time, the proof size, and the verification time. For example, Poseidon hasher is used in the zero-knowledge proof system of FileCoin, an IPFS based decentralized storage network. The computation of Poseidon Hasher involves a large amount of compute-intensive modular multiplications, making it one of the performance bottlenecks in the mining process of FileCoin. Currently, GPU is often used to accelerate the computation process of Poseidon with greater power consumption. And our project, TRIDENT, is attempting to implement Poseidon hasher in FPGA to reach a better performance-power ratio than GPU.

The implementation of TRIDENT is specified for Filecoin’s Poseidon Instantiation, but it can be transformed and applied in other zero-knowledge systems using Poseidon easily. And in the remainder of Section 1, we take Filecoin’s Poseidon Instantiation for example to show the details of the Poseidon hash function.

The Poseidon hash function takes a preimage of (t-1) prime field elements to a single field element. For Filecoin, t can be 3, 5, 9, and 12, which means that the length of preimages can be 2, 4, 8, and 11 and each prime field element is 255-bit. Firstly, the preimage of Poseidon is initiated to the internal state of t prime field elements through domain separation process. And then the internal state is transformed over R (R=RF+RP) rounds of constant addition, S-boxes, and MDS matrix mixing. Once all rounds have been performed, Poseidon outputs the second element of the internal state. The data flow of Poseidon is shown in the picture below:

From the picture above, we can find that Poseidon has two kinds of rounds, which are RP partial rounds and RF full rounds. Poseidon calculates half of the full rounds first and then all of partial rounds and finally the remaining half of the full rounds. And the only difference between the two is that: partial rounds only compute the first element of the internal state in SBox stages, but full rounds transform all elements through SBox.

In the round constant addition stage, each prime field element is added by its corresponding round constant, and the constants are different in each round. For Filecoin’s Poseidon instantiation, S-Box computes the fifth power of the state element. And in the MDS Mixing stage, a vector-matrix multiplication is applied in the vector of internal state, where the MDS matrix is t*t and consistent in every round.

In addition to the unoptimized Poseidon hasher shown above, there is also an optimized version of Poseidon Hasher which computes less multiplication in each hash. The data flow picture of optimized Poseidon is shown below. The main advantage of optimized Poseidon hasher over unoptimized one is that the constants matrix of MDS mixing stages in optimized Poseidon is sparse, which averts a lot of compute-intensive modular multiplications.

The data flow of optimized Posieon is shown above. The hash results of optimized and unoptimized Poseidon for the same preimage are consistent. In TRIDENT, we have implemented both kinds of Poseidon hasher in FPGA and the optimized Poseidon can achieve higher throughput but with a more complicated hardware structure. The details of optimized and unoptimized Poseidon, including specific definition and computation of round constants and mds matrices, can be found in the spec of neptune, a rust implementation of Poseidon.

The hardware Implementation of TRIDENT mainly includes two parts: Poseidon IP design and integrated system construction. And we will introduce details of these two parts in Section2 and Section3 respectively:

Section2: Poseidon IP Design

In Section2, we will introduce the details of the design of Poseidon accelerator IP. In general, the computation of Poseidon hasher can be perceived as a continuous stream of modular arithmetic operations. So the design of Poseidon IP is about two things: how to implement high performance-area ratio modular arithmetic circuits and how to organize these arithmetic modules to achieve better utilization and throughput. Additionally, IP design and verification in TRIDENT are implemented through SpinalHDL and Cocotb, which improves the efficiency and quality of our design vastly. So we will first introduce the usage of SpinalHDL and Cocotb in TRIDENT.

2.1 Digital Design With SpinalHDL And Cocotb

Spinal is a scala-based HCL(HCL: Hardware Construction Language) or more precisely a scala package featured in alige chip design, which is similar to chisel also based on scala and known for usage in RISC-V CPU design. The process of designing hardware in Spinal can be mainly divided into three steps: 1) use scala and Spinal package to describe the structure and logic of your hardware design; 2) compile and execute the scala program to generate corresponding System Verilog/VHDL codes, which describes the same structure as scala one; 3) using any kinds of simulators such as Iverilog, Verilator or Vivado simulator for hardware simulation and verification. A specific design flow of SpinalHDL is shown in the picture below:

Spinal has various advantages over traditional HDLs and it can simplify the design of digital hardware greatly by describing circuit in a higher level of abstraction. However, unlike HLS which describes hardware in behavioral level, Spinal eases and expedites the design process without sacrificing the performance and resources utility of generated hardware. It's important that SpinalHDL has almost the same precision or granularity of description as traditional HDLs like Verilog or VHDL. It can control the amount of registers and the length of logical path between registers finely. The best proof of approximate description granularity is that: for all RTL-level syntax elements in Verilog/VHDL, SpinalHDL has a counterpart.

Besides the same description level as Verilog, SpinalHDL is more expressive than Verilog. Developers can use abundant advanced language features of scala, such as functional programming, object-oriented, and recursion, to describe their hardware design. Taking the realization of the adder tree module used in TRIDENT for example, we can use the recursion feature in scala to facilitate it. The circuit structure and corresponding scala codes of the adder tree are shown below. The adder tree is used in the MDS vector-matrix multiplication, which takes 12 finite field elements as input and outputs the addition result of them. Using recursion, we can easily describe this structure and reuse the code to generate an adder tree with any amount of input elements, which is impossible to be realized in Verilog.

object AdderTreeGenerator{
  def apply(opNum:Int, dataWidth:Int, input:Vec[UInt]):(UInt,Int) = {
    if(opNum == 2) {
      (ModAdderPiped(input(0), input(1)), ModAdderPiped.latency)
    }
    else {
      val adderOutputs = for(i <- 0 until opNum/2) yield
        ModAdderPiped(input(2*i),input(2*i+1))
      if(opNum%2==0){
        val next = Vec(UInt(dataWidth bits),opNum/2)
        next.assignFromBits(adderOutputs.asBits())
        val (nextOutput, latency) = AdderTreeGenerator(opNum/2, dataWidth, next)
        (nextOutput, latency+2)
      } else{
        val next = Vec(UInt(dataWidth bits),opNum/2+1)
        val temp = Delay(input(opNum-1), ModAdderPiped.latency)
        next.assignFromBits(temp.asBits##adderOutputs.asBits())
        val (nextOutput,latency)= AdderTreeGenerator(opNum/2+1, dataWidth, next)
        (nextOutput, latency+2)
      }
    }
  }
}

In addition, Spinal has many good encapsulations or abstractions of some classic circuits, like Counter, StateMachine, FIFO, which are frequently used in digital hardware design. Using these abstractions can reduce the workload of designers and the possibility of making error during development. The most used circuit in TRIDENT is Stream in SpinalHDL. Stream class implements the handshake protocol in digital design, which includes three signals: valid, ready, and payload. Calling the member methods of Stream class, we can realize different data transfer operations under handshake protocol. For example, we can use the stage() function in the Stream class to implement a Pipeline with backpressure.

Cocotb is a python-based testbench environment for verifying VHDL, Verilog RTL designs. The most important advantage of Cocotb is that: Python is far more productive, expressive and succinct than traditional languages used for verification like Verilog, VHDL, and System Verilog. At the same time, we can build a reference model more quickly and conveniently based on python’s abundant package and community. For example, In TRIDENT, we also implement the Poseidon haser in python, which is used as the golden model in verification. One of the difficulties in realizing it in software is that variables in Poseidon hasher are 255-bit and are usually unsupported in other program languages. But python can realize computations of Integers with any width, which facilitates the implementation greatly. And it's worth noting that codes in TRIDENT that have passed testbenches of Cocotb can work on FPGA correctly and stably without additional hardware debugging process using debugging tools of Vivado, such as ILA.

Besides the points we mentioned above, SpinalHDL and Cocotb still have a lot of advantages over traditional HDL in digital design and verification, which we believe will simplify FPGA design and promote the application of FPGA in many fields. But currently, SpinalHDL and Cocotb haven’t been fully supported by Vivado, for example, we can’t run cocotb testbench in Vivado simulator and it’s also troublesome to instantiate and simulate Xilinx IP in SpinalHDL. So we hope that Vivado and other Xilinx design tools can better support SpinalHDL and Cocotb in the future.

2.2 Modular Arithmetic Operator

It’s worth noting that Poseidon Hasher operates elements in the finite field, which means all arithmetic operations are modular. Compared to normal arithmetics, modular arithmetic operations are more complicated in circuit implementation and lack existing mature IP to use. There are two kinds of modular operations involved in Poseidon Hasher, that are modular addition and modular multiplication. In TRIDENT, we implement these two kinds of circuits based on the existing adder IP and multiplier IP provided in Vivado.

Modular Adder:

The modulo m addition of two numbers x and y is defined by:

The regular modular addition can be straightforwardly implemented by a normal adder, a comparator, and a subtracter following the definition above. The specific implementation: x and y are added through an adder first, and then the addition result minus the modulus is computed through a subtracter, and finally the addition result is compared with the modulus through a comparator to decide whether to output the addition or the subtraction result.

But this implementation is expensive both in terms of area and delay due to the usage of a comparator. In TRIDENT, we implement modular adder based on the algorithm below:

In this algorithm, we just need two adders and a Mux to realize the function of modular addition. The circuit structure is shown in the picture below.

And because the width of elements in Poseidon is 255 and the 255-bit addition causes long logic delay, we implement the 255-bit addition through Xilinx Adder/Sub IP in which we can specify the number of pipeline stages in an adder. Otherwise, TRIDENT also provides another version of the modular adder in which normal adders are implemented through the plus sign directly and have no pipeline stages. The configuration of Xilinx Adder IP in TRIDENT is:

Modular Multiplication:

Modular multiplication problem is defined as the computation of P = A × B (mod m), given the integers A, B and m. It is usually assumed that A and B are positive integers with 0 ≤ A, B < m. There are many approaches to perform multiplication such as multiply then divide, interleaving multiplication and reduction. Normally modular multiplication is done by repeated subtraction of modulus from the multiplicand until the result is smaller than modulus. This technique is time consuming when the value of modulus is too large. Modular Multiplication can also be performed by division of the modulus. This technique will require many hardware resources and also take much time since division is a complicated and compute-intensive task from the perspective of hardware.

In TRIDENT, the implementation of modular multiplication adopts the Montgomery reduction algorithm which avoids the compute-intensive division. The main idea of Montgomery algorithm is to transform the operands into Residue Number System(RNS) domain and in this domain the division used for the reduction of multiplication results in normal domain can be replaced by shifting. A brief description of Montgomery algorithm is as below:

1) Given integers a, b < n (n is the modulus), we define their n-residue with respect to r (r is a power of two and greater than the modulus) as:

2) In order to describe the Montgomery reduction algorithm, we need an additional quantity, n', which is the integer with the property:

3) Given n-residues of a and b, Montgomery product is defined as:

After Montgomery Production, the result is still in n-residues domain and you can transform the result into normal domain by computing MonPro(res, 1) or continue to perform another multiplication. For more details about Montgomery algorithm, you can access this website.

In the formula of Montgomery Prodution above, n, n' and r are pre-set or can be precomputed, and r is a power of 2 which means division in Step 3 can be simplified to shifting. So arithmetic operations in this algorithm just include three normal multiplications, one normal addition and a conditional subtraction. The brief structure of the modular multiplier based on Montgomery algorithm is shown in the picture below. In order to reach higher performance, normal multipliers and adders used in Montgomery multiplier are all instantiated from existing IPs provided by Xilinx and are all fully pipelined, which means the design can compute a 255-bit modular multiplication in one cycle.

However, the maximum width of multiplier IP in Vivado is 64-bit, so the 255-bit multiplier can’t be accessed directly by instantiating the existing IP. We need to combine some low-bit multipliers to get a 255-bit multiplier. In TRIDENT, we use Karatusba-Ofman Algorithm, which needs much less small multipliers than other algorithms to build a same size multiplier. The way Karatusba-Ofman decomposes a bigger multiplier into small multipliers is shown below:

The algorithm only uses 3 half-word multipliers to build a full-word multiplier. According to the idea of recursion, the 256-bit multiplier can be decomposed to 128-bit, and then decomposed to 64-bit, and finally composed to 32-bit, the width we set in Xilinx multiplier IP.

In TRIDENT, the structure we used to build a 255-bit multiplier from 32-bit Xilinx Multiplier IPs is shown as below:

The specific configurations of multiplier IP in TRIDENT are shown below. Either "Mults" option indicating DSPs or "Luts" option indicating LUTs can be selected in the "Multiplier Construction" to make full use of multiple kinds of resources in FPGA to instantiate more modular multipliers.

2.3 Accelerator Architecture

The architecture of Poseidon accelerator is modelled on the dataflow picture of Poseidon shown in Section 1 with some variations of serial-parallel conversion. The brief structure of accelerator is shown in the picture below. The bigger arrows in the picture indicate the data is transfered parallelly with all elements of the internal state(3, 5, 9 or 12 elements) sent in one cycle. The smaller arrows indicate data is transfered serially with one element of internal state sent in one cycle. The basic idea of how the accelerator operates is that: the PoseidonThread module implements the complete computation process of one round in Poseidon( the number of rounds in Poseidon ranges from 63 to 65) and the accelerator reuses PoseidonThread to compute all rounds, either partial or full, and outputs the second element of the internal state when all rounds are finished.

Two main ideas of accelerating the computation of Poseidon in this accelerator are parallelization and pipeline. As for pipeline, all arithmetic operators including multipliers and adders are all pipelined to diminish the logic delay in our design. To be more specific, the 255-bit modular multiplier and adder are divided into 47 and 33 pipeline stages respectively. And it takes approximately 200 cycles for PoseidonThread to transfer data from input port to output port. And the whole design can operate at the frequency of more than 100MHz. In terms of parallelization, the data flows serially before MDSMatrixMultiplier and parallelly in the rest of PoseidonThread. The reason for partial parallelization in our design includes: 1) the toplevel input of Poseidon accelerator from XDMA IP is serial, which means one element is transfered in one cycle; 2) the vector-matrix multiplication in MDS Mixing stage needs up to 144 modular multipliers if all elements of internal state are computed parallelly, which is impossible to implement due to the restriction from limited FPGA resources. So only 12 multipliers are implemented in MDSMatrixMultiplier, making it able to parallelly compute all 12 multiplications related to one element of the vector. Additionally, if the project is transplanted to FPGA boards with more resources, parallelism and performance can be improved by instantiating more PoseidonLoop modules.

The specific operating mechanism of Poseidon accelerator includes: 1) AXI4StreamReceiver module receives the serial input(one state element in one cycle) from XDMA IP, counts the size of internal state according to last signal in AXIStream protocol and then outputs internal state parallelly. 2) StreamArbiter in PoseidonLoop is a 2-1 priority arbiter receiving output from AXI4StreamReceiver and LoopbackDemux, and the loopback input has the higher priority to avoid the deadlock of the pipeline loop. 3) PoseidonSerializer is in charge of serializing the parallel output from the arbiter and sending it to PoseidonThread which executes arithmetic operations in Poseidon hasher. 4) In PoseidonThread, SBox5Stage, AddRoundConstantStage corresponds to S-boxes and Add Round Constants in dataflow picture shown in Section2. And MDSMatrixMultiplier and MDSMatrixAdders accomplish the computation of MDSMixing in Poseidon hasher jointly. 5) DataDemux is a 1-2 router that transfers the output of PoseidonThread to AXI4Transmitter or StreamArbiter. If all rounds are completed, the internal state is transmitted to AXI4Transmitter, otherwise it loops back to StreamArbiter and starts computation of next round. 6) AXI4Transmitter outputs the hash result under AXI4 Stream protocol to XDMA IP and a fifo is implemented in it to reorder and buffer the input from LoopbackDemux.

Section3: Integrated System Design

The architecture block diagram of the TRIDENT system is shown in the figure below. In the CPU part, TRIDENT provides a rust API for lotus, a popular Filecoin node implementation, to send to and receive data from FPGA and the API is adapted from the Xilinx XDMA driver. And the FPGA hardware part is mainly composed of three modules: Xilinx XDMA IP, asynchronous FIFO, Poseidon accelerator IP. Among them, XDMA provides a unified PCIe interface for the upper CPU Server, and provides a unified AXI-Stream interface for the hardware design in FPGA. The working frequency of XDMA IP, 250MHz, is higher than the frequency of Poseidon IP ranging from 100MHz to 200MHz. So asynchronous FIFO is in charge of cross-clock domain data transfer from XDMA to the accelerator. And finally, the Poseidon accelerator IP with standard AXI4-Stream interface is the computation core of the hardware system and is responsible for accelerating calculation process of Poseidon Hasher.

We implement the whole FPGA hardware system design through the Block design tool in Vivado. The block design diagram which shows the connections between different modules is as below:

The specific configurations of Xilinx IPs involved in the block design above are shown in pictures below:

Xilinx XDMA Configuration:

Basic Tab:

PCIe ID Tab:

PCIe BARs Tab:

PCIe: MISC Tab:

PCIe: DMA Tab:

Clock Wizard Configuration:

Asynchronous FIFO Configuration:

General Tab:

Flags Tab:

Section4: Implementation And Performance

4.1 Vivado Implementation Result:

The implementation result and performance of TRIDENT are shown in this Section. After Vivado implementation, the utilization of FPGA hardware resources is shown in the table below. The shortage of DSP slices and LUTs used in modular multipliers hinders more instantiations of these arithmetic elements to achieve better performance.

Implementation Results:

Timing Information:

The working frequency of Poseidon IP is set at 100MHz and the timing report of Vivado implementation is shown below. The current design is just enough to meet the target frequency. And this frequency is still below our expectation of 250MHz, which is the output frequency of XDMA IP. In order to get a higher throughput, we will continue to improve it by cutting the critical path in our design.

4.2 Performance Of TRIDENT

In TRIDENT, two ways have been used to measure the performance of the whole system.

The first one: we write a C program, which directly sends preimages to and receives hash results from FPGA, and calculates the whole duration time from sending the first preimage to the reception of the last hash result. The exact performance of TRIDENT in three kinds of length preimages is shown in the table below. For arity2, TRIDENT finish 850000 hashes in 0.877s and its data transfer rate can reach 29.1651 MB/s, which means in one second it can finish the computation of approximately 1M hashes.

The second one: as we mention above, TRIDENT also provides a Rust-based software API for Lotus, a Filecoin node implementation, so that Lotus can employ the FPGA to accelerate the sealing process as the usage of GPU. And in Lotus, there is a specific benchmark program, lotus-bench, which can measure the performance of different computation processes in Filecoin mining precisely. And we use lotus-bench to test the performance of GPU, CPU, and TRIDENT in the sealing process which needs Poseidon hashers respectively. The result is shown in the table below:

The picture below shows the output information from terminal when executing lotus-bench program with TRIDENT accelerating the computation of Poseidon.

In the performance results above, we can see that TRIDENT‘s hash rate can achieve more than twice of CPU's rate but still fall behind GPU's hash speed. But given the great power consumption of GPU, TRIDENT can still have a much better performance-power ratio. And the power consumption result of TRIDENT is shown in the picture below:

We can see that the total on-chip power of FPGA is 24.823W and the maximum power consumption in the specification of RTX 3070 revealed on Nvidia's official website is 220W. It’s obvious that FPGA is much more efficient than GPU in terms of performance-power ratio.

Although TRIDENT is currently weaker than GPU in performance, we have seen a lot of room for improvement. And we perceive that the improvement mainly lies in three aspects: 1) optimizing the timing of the current design to reach a higher frequency; 2) innovating the structure of the arithmetic operation modules to achieve a better performance-area ratio; 3) modifying the architecture of the accelerator IP and the whole FPGA system to improve the utilization of arithmetic modules. In the future, we will still work on this project and we are confident that TRIDENT can reach the performance much better than GPU while keeping a low power consumption.

Section 5: Instructions On Deployment

In this section, we will introduce the repository of TRIDENT and how to deploy TRIDENT in the Filecoin application.

5.1 TRIDENT Repository

The file directory structure of TRIDENT is shown below:

.
├── images
│   ├── adderTree.jpg
│   ├── block_design.png
│   ├── device.png
│   ├── opt_poseidon.png
│   ├── poseidon.png
│   ├── spinal.jpg
│   ├── timing.JPG
│   └── whole_arch.jpg
├── LICENSE
├── poseidon-spinal // design sources of Poseidon accelerator IP
│   ├── build.sc
│   ├── images
│   │   ├── AdderBasedModAdder.drawio.png
│   │   ├── MDSMatrixAdder.drawio.png
│   │   ├── MDSMatrixMultiplier.drawio.png
│   │   ├── ModMultiplier.drawio.png
│   │   ├── src
│   │   ├── Thread.drawio.png
│   │   └── TopLevel.drawio.png
│   ├── LICENSE
│   ├── poseidon_constants
│   │   ├── mds_matrixs
│   │   ├── mds_matrixs_ff
│   │   ├── round_constants
│   │   └── round_constants_ff
│   ├── README.md
│   ├── run.sh
│   └── src
│       ├── main
│       ├── reference_model
│       └── tests
├── README.md
├── utils // software design of Poseidon
│   ├── lotus
│   │   ├── dma_utils.c
│   │   ├── fpga.cpp
│   │   ├── gpu.rs
│   │   └── Readme.txt
│   └── trident_tester
│       ├── arity_12_inputs.txt
│       ├── arity_12_outputs.txt
│       ├── arity_3_inputs.txt
│       ├── arity_3_outputs.txt
│       ├── arity_9_inputs.txt
│       ├── arity_9_outputs.txt
│       ├── dma_utils.c
│       ├── trident_tester.cpp
│       └── trident_tester.h
└── vivado // vivado projects of Poseidon
    ├── poseidon_ip.tar.gz
    ├── poseidon.tar.gz
    └── trident100MhzC1100.tar.gz

There are three main directories in TRIDENT:

(1) poseidon-spinal: design sources of Poseidon accelerator IP

build.sc: the configuration of mill, a scala building tool, in which we can set the dependent library of SpinalHDL and some scala compile options;
poseidon_constants: includes all constants used in both optimized and unoptimized Poseidon Hasher in *.txt file format;
run.sh: the shell script which you can use to generate Verilog codes from SpinalHDL and execute the verification process;
src: contains all source codes files of Poseidon accelerator, including hardware design codes in “main”, a reference model of Poseidon in Python in “reference_model” and verification codes in "tests";

(2) utils: software design of TRIDENT

lotus: rust-based software API for lotus to interact with fpga
trident_tester: C program to test the performance of TRIDENT

(3) vivado: vivado projects of TRIDENT based on Vivado 2021.2

poseidon_ip.tar.gz: the vivado custom IP project of Poseidon accelerator
poseidon.tar.gz: the vivado project of the whole TRIDENT system
trident100MhzC1100.tar.gz: bitstream file of TRIDENT which can be directly implemented in Xilinx Varium C1100 card

5.2 Deployment

There are mainly three ways to deploy the TRIDENT:

Use the .bit file in "vivado" and deploy it in Xilinx Varium C1100 FPGA Card, and then you need to connect the card with PC server through PCIe interface. You can use Xilinx official XDMA driver to interact with FPGA directly and we also provide a C-program in “/utils/trident_tester” that you can refer to. At the same time, we also produce a rust API in “/utils/lotus” which can be used in Lotus to accelerate the sealing process of Filecoin.
Besides using the .bit file, you can also edit the Vivado project we provide to customize your own FPGA design in the Varium C1100 card. For example, you can replace the PCIe interface with another communication protocol or you can add the existing Poseidon IP into another block design.
And if you want to deploy TRIDENT in another FPGA platform or you want to modify the design of Poseidon IP, you can also edit our design in SpinalHDL and then generate Verilog codes using “mill” or the script "run.sh" we provide. For example, you can replace AXI4-Stream with AXI-Lite or AXI-full and add Poseidon IP into your SoC design.

Code

Credits

Steve Wu

1 project • 0 followers

WanZheng Weng

0 projects • 0 followers

Comments

Awards

Big Data Analytics: 1st Place

Adaptive Computing Challenge 2021

TRIDENT: A Hardware Implemented Poseidon Hasher