MicroZed Chronicles: Block RAM Optimization

A few weeks ago we looked at how we could use Xilinx Paramteterized Macros (XPM) in place of the Block Memory Generator and the benefits…

Adam Taylor
5 years ago

A few weeks ago we looked at how we could use Xilinx Paramteterized Macros (XPM) in place of the Block Memory Generator and the benefits these offered.

One of the things I briefly touched on in that blog was how we can optimize memory structures for performance and power. It is an interesting area to examine and as such, it is what we are going to be looking at in this article.

Before we can optimize, though, we first we need to understand the Block RAM structure provided in the Seven Series and UltraScale+ families.

These Block RAM structures are very flexible, each Block RAM stores 36Kb and can be configured either as two 18Kb RAMs or one 36Kb RAM.

It is, however, possible to further configure these RAMS trading address space for data width, e.g. a 36Kb RAM can implement structures from 32K by 1 bit to 1K by 36 bits. While the 18K RAMS are able to implement an 18K by 1 bit to 1K by 18 bits memory and any valid range in between.

Ensuring Vivado implements the correct structure for our needs is therefore important not only to ensure optimal resource utilisation but also to ensure timing and power requirements are addressed.

By default, when we implement a large RAM structure that uses multiple Block RAMS, multiplexing is avoided to provide the best performance. Vivado achieves this by leveraging the address depth.

For example, a 6K by 256 RAM structures would be implemented by using a 64 BRAMS configured as 8K by 4 bits.

However, a more efficient resource implementation uses 7 BRAMS configured as 1K by 36 and implements this structure 6 times to provide the required storage address range. The final 4 bits of the data bus can be provided using an 8K by 4-bit memory for a total of 43 BRAMS.

Although this implementation does require additional logic to implement the RAM structure which will impact the timing performance. It does use fewer Block RAMS and therefore provides a reduced power dissipation.

We can make Vivado implement the second structure by using the RAM_decomposition constraint. This constraint can be applied in either the source RTL or via the XDC where the format is:

set_property ram_decomp power [get_cells myram]

But, this is not the only RAM implementation constraint we have in our tool box. Along with the RAM_decomposition we can also control the RAM cascade_height.

The Cascade height allows us to minimize the number of built-in multiplexors used within larger RAMs so that we can achieve better timing performance

Again, this constraint can be defined in either the source RTL or XDC, the XDC format is:

set_property cascade_height 1 [get_cells myram]

So far you might be thinking that all of the options offer a choice of either performance (cascade height) or power (RAM decomp).

However, we can also combine the cascade_height and ram_decomposition constraints to implement memory structures which provide both good power efficiency and timing performance, as shown in the diagram below where a 8K by 36 Bit RAM is required.

In the example, the default implementation uses 8 Block RAMS which are cascaded together via the internal multiplexers. As only one BRAM is active at any one time it will be power efficient. Yet, due to the internal multiplexing, the timing performance will be slower.

If we were to apply the just apply the RAM_decomp constraint, we would also get the same structure.

Using the cascade_height constraint on its own will result in better performance as the multiplexing through cascaded RAMS is reduced due to a reduced cascade height constraint.

Due to the structure implemented there is more than one Block RAM active at any one time, so while the timing performance is good the power efficiency is not.

If we use both the cascade height and RAM decomp constraints we can get the best of both worlds. That is a RAM strucutre where only one RAM is active is at any one time and the cascade height is minimized so we gain the best possible timing performance and power efficiency.

Being aware of these constraints and using them correctly within our design provides us with another tool to help us achieve our project requirements.

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Additional Information on Xilinx FPGA / SoC Development can be found weekly on MicroZed Chronicles.

Adam Taylor
Adam Taylor is an expert in design and development of embedded systems and FPGA’s for several end applications (Space, Defense, Automotive)
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles