# Build Your Own Domain-specific Solutions with RapidWright

Chris Lavin and Alireza Kaviani Xilinx Research Labs 2/24/19



- > Why are Domain-specific solutions important?
  - RapidWright value proposition
  - >> Why open source?

> What is RapidWright?

> How to use RapidWright?



# **FPGA Industry and Community Dynamics**



> Continuous industry and community engagement



## The Age of Domain Specific Architectures

#### 40 years of Processor Performance





from: "There's Plenty of Room at the Top," Leiserson, et. al., to appear.

- > Achieve higher efficiency by tailoring the architecture to characteristics of the domain
  - >> More effective parallelism for a specific domain, More effective use of memory bandwidth
  - >> Domain specific programming language



# Raising the Abstraction of Design Entry









**SDx** 







ASICs

# **Focus on Emerging Applications**



- > Module-based approach to implementation
  - >> Lock-in performance with reusable modules
  - >> Fewer inter-block timing closure issues

#### > Goals

- >> Productivity
  - Order of magnitude reduction in compile time per domain
- >> Performance (near-spec)
- Predictable timing closure



# **Proposed Domain-specific Tool Flows**



# **Domain Tool Flow Example**



- > Fact
  - >> Emerging domains such as surveillance or vision have high replication
- > Community role
  - >> Identify and extract operators and functions in the domain
- > RapidWright value proposition
  - >> Assemble relocatable pre-implemented domain operators
  - >> Deliver the best inference/watt



# **Building Relocatable Domain-specific Shells**



- > Fact
  - >> Advances in silicon have created QoR opportunity
- > Community role
  - >> Domain-specific shell design or overlays
- > RapidWright value proposition
  - >> Achieve near-spec performance



# Success Scenario: Rapid Domain-specific Assembly





# What is RapidWright?



# RapidWright Overview

#### > Companion framework for Vivado

- >> Fast, light-weight, open source
- Communicates through Design CheckPoints<sup>1</sup> (DCPs)
- >> Java code, Python scripting

#### > Enables targeted solutions

- Reuse & relocate pre-implemented modules
- >> Just-in-time implementations
- Create shells & overlays

#### > Power user ecosystem

- Academic algorithm validation
- Rapid prototyping of CAD concepts







# 4 Ways to Design in RapidWright

#### **BUILD ROUTED CIRCUITS**



- > Well-defined circuits in seconds
- > Parameterizable library of generators

# REUSE P&R CIRCUITS

> Reuse/relocate P&R circuits from Vivado

**SHELLS & OVERLAYS** 

> Combine P&R circuits together

**FROM VIVADO** 



# A Modular Pre-implemented Methodology

#### **USER TASKS (MANUAL)**

- 1. Design selection attributes:
  - Modular
  - Latency tolerant
  - Prefers replication
- 2. Placement planning



Match Design Structure to Architecture Patterns

#### **TOOL TASKS (AUTOMATED)**

- 3. P&R modules cached:
  - Relocatable
  - Reusable
  - Timing predictable

4. Run implementation





# Creating Pre-implemented Modules (Vivado OOC Flow)





RapidWright Pre-implemented Module Flow







# **Design Performance Results**

| Design     | Target<br>Device | Baseline<br>(initial design) | RapidWright <sup>1</sup><br>Flow | Gain |
|------------|------------------|------------------------------|----------------------------------|------|
| Seismic    | KU040            | 270MHz                       | 390MHz                           | 41%  |
| FMA        | KU115            | 270MHz                       | 417MHz                           | 54%  |
| GEMM       | KU115            | 391MHz                       | 462MHz                           | 16%  |
| ML overlay | ZU9EG            | 368MHz                       | 541MHz                           | 50%  |

Speed Grade: -2

#### **Utilization table**

| Design           | LUT | FF  | DSP | BRAM |
|------------------|-----|-----|-----|------|
| Seismic          | 93% | 5%  | -   | -    |
| FMA (HPC design) | 25% | 50% | 97% | 6%   |
| GEMM             | 19% | 20% | 87% | -    |
| ML overlay       | 46% | 29% | 42% | 96%  |

<sup>1</sup>RapidWright: Enabling Custom Crafted Implementations for FPGAs, FCCM 2018



# Re-locatability & Reuse of Multiple Implementations

| RUN         | F <sub>MAX</sub> (MHz) |
|-------------|------------------------|
| Vivado      | 270                    |
| RapidWright | 417 (+53%)             |

- > 97% DSP utilization
- > 4.4 TeraOp/s
- > "Fabric discontinuites"
  - >> SLR boundary
  - >> IO Columns
  - >> Laguna Tiles







# Latency Flexibility: AXI Stream Register Slices





- > Exploiting latency-tolerance and architectural knowledge
  - >> Automatic insertion of latency blocks



# Debugging with an ILA (ChipScope)

I downloaded my design and it's not working. But it works in simulation!

I added an ILA, but the bug is gone!



You'll need to recompile with an ILA to debug it.





# **Experiment: Insert Pre-implemented ILA**

- > Preserves existing
  - >> Placement
  - >> Routing
- > Only occupy unused resources





# **Preserve Existing Placement & Routing**

Debug Blocks Inserted by RapidWright





### **Debug Instrumentation Speedup**





# **Beyond a Pre-implemented Methodology**

- > RapidWright probe router enables higher productivity
  - >> 21X more debug turns per day
  - >> Highest level of routing preservation possible
  - >> Future innovation:
    - iteration with extra probe inputs
    - Automatic insertion of pipeline flops to manage timing

| <b>Vivado</b> modify_debug_probes | RapidWright<br>ProbeRouter | Δ   |
|-----------------------------------|----------------------------|-----|
| 130 mins                          | 6.3 mins                   | 21X |



RapidWright Probes Rerouted



# **Pre-implemented Data Movement Shell**

- > Goals
  - >> Minimize overhead of compute (and overlays)
  - Prove shell assembly model

- > Build-to-order LinkBlaze<sup>1</sup> shell
  - >> 512 bit, bi-directional
  - >> RapidWright Pre-implemented modules

| Vivado | RapidWright   |
|--------|---------------|
| 516MHz | 620MHz (+20%) |







<sup>&</sup>lt;sup>1</sup>LinkBlaze: Efficient global data movement for FPGAs (ReConFig 2017)

## **Just-in-time, Circuit Module Generators**

#### > Build modules on-demand

- >> Placed and routed *in seconds*
- >> Reusable and compose-able
- >> Target spec performance

#### > Parameterizable Generators

- >> Adder
- >> Subtractor
- >> Multiplier

#### > Expression Generator

- >> Invokes math generators
- >> Built to spec: 775MHz

 $x^2+3*x-5$ 









# RapidWright SLR Crossing DCP Creator

- SLR crossing module from scratch
  - >> Parameterizable
  - >> Closes timing at 760MHz
    - Clk Period: 1.313ns
  - >> Routed clock, placed and routed
  - >> Runs in seconds

This RapidWright program creates a placed and routed DCP that car imported into UltraScale+ designs to aid in high speed SLR cross: RapidWright documentation for more information.

GENERAT



```
Option
                                          Description
-?, -h
                                          Print Help
-a [String: Clk input net name]
                                          (default: clk in)
-b [String: Clock BUFGCE site name]
                                          (default: BUFGCE X0Y218)
-c [String: Clk net name]
                                          (default: clk)
-d [String: Design Name]
                                          (default: slr crosser)
-i [String: Input bus name prefix]
                                          (default: input)
-l [String: Comma separated list of
                                          (default: LAGUNA X2Y120)
  Laguna sites for each SLR crossing]
-n [String: North bus name suffix]
                                          (default: north)
                                          (default: slr crosser.dcp)
-o [String: Output DCP File Name]
                                          (default: xcvu9p-flgc2104-2-i)
-p [String: UltraScale+ Part Name]
-q [String: Output bus name prefix]
                                          (default: output)
-r [String: INT clk Laguna RX flops]
                                          (default: GCLK B 0 1)
-s [String: South bus name suffix]
                                          (default: south)
-t [String: INT clk Laguna TX flops]
                                          (default: GCLK B 0 0)
-u [String: Clk output net name]
                                          (default: clk out)
-v [Boolean: Print verbose output]
                                          (default: true)
-w [Integer: SLR crossing bus width]
                                          (default: 512)
-x [Double: Clk period constraint (ns)] (default: 1.538)
-y [String: BUFGCE cell instance name]
                                          (default: BUFGCE inst)
-z [Boolean: Use common centroid]
                                          (default: false)
```

# Ongoing Work: C Code to Full Chip Accelerator in Seconds

#### > RapidWright generator capabilities

UltraScale+ VU3P, 100% DSP utilization
Front-end C code parser still in development
Prototype back-end flow
Runs in seconds (37 seconds)
Achieves spec frequency (775 MHz)

#### > Future integration work:

SLR crossing generator - target 750 MHz LinkBlaze (data movement) solution





## Leveraging Algorithmic Engines

#### > SAT Solver

- >> Resolve difficult, localized congestion routing
  - Finds solutions where Vivado cannot
- >> RapidWright front-end to SAT solver engine<sup>1</sup>

#### > Future Work

- Simultaneous SAT placement and routing solution
- >> ILP Solvers
  - Potential for placement solutions



<sup>1</sup>Fraisse, H., Gaitonde, D., A SAT-based timing driven Place and Route flow for critical soft IP (FPL 2018)



# How do I get started with RapidWright?





# Run RapidWright in Your Browser











## **FPGA'19 Invited Tutorial Paper**

#### Build Your Own Domain-specific Solutions with RapidWright

Invited Tutorial

Chris Lavin and Alireza Kaviani Xilinx Research Labs San Jose, CA chris.lavin@xilinx.com,alireza.kaviani@xilinx.com

#### ABSTRACT

As the complexity of programmable architectures increases with advances in silicon process technology, there is a growing need to extract greater productivity and performance from the tools. Due to their inherent reconfigurability, FPGAs are proving to be valuable targets for more efficient domain-specific architectures. However, FPGA implementation tools are designed for a broad set

In this paper we describe RapidWright, an open source framework that enables customized implementations for Xilinx FPGAs. RapidWright enables implementation tools that can take advantage of the great potential of domain-specific attributes-leading to greater productivity and performance. The focus of this paper is to provide an introductory reference of RapidWright and its use cases so that others may be empowered to adapt their implementations to their domain-specific applications.

#### CCS CONCEPTS

 Hardware → Reconfigurable logic and FPGAs;
 Computer systems organization → Reconfigurable computing;

#### KEYWORDS

Domain-specific, Open Source, FPGA, Xilinx, Vivado

Chris Lavin and Alireza Kaviani. 2019. Build Your Own Domain-specific Solutions with RapidWright. In The 2019 ACM/SIGDA International Sympostum on Field-Programmable Gate Arrays (FPGA '19), February 24-26, 2019, Seastde, CA, USA. ACM, New York, NY, USA, Article 4, 9 pages. https://doi.org/10.1145/3289602.3293928

#### 1 INTRODUCTION

RapidWright [1] is an open source platform with a gateway to Xilinx's back-end implementation tools (Vivado) that raises the implementation abstraction while maintaining the full potential of advanced FPGA silicon. RapidWright works synergistically with Vivado through design checkpoints (DCPs, see Figure 1) to enable highly customizable implementations. Vivado can produce highly

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@ecm.org. FPGA 'D. February 24-26, 20 D. Seaside, CA. USA

© 2019 Copyright held by the owner/author(s). Publication rights licensed to Associa

https://doi.org/10.1145/3289602.329992

tion for Computing Machinery. ACM ISBN 978-1-4503-6137-8/19/02...\$15.00



Figure 1: Vivado and RapidWright DCP Compatibility

optimized implementations for key design modules to deliver the highest performance. RapidWright can then replicate, relocate and assemble these tuned modules to compose a complete application and preserve high performance.

RapidWright's native gateway to Vivado also sets the groundwork for an ecosystem aimed at further advancing FPGA tools. It empowers academic and industry researchers by combining the commercial credibility of FPGA tools with the agility of an open source framework, leading to innovative solutions that might not be feasible otherwise.

This paper serves as a supplemental reference to the RapidWright tutorial with an aim to provide some fundamentals about the framework and introductory use cases. In the remainder of this paper we describe RapidWright and its capabilities in Section 2, some example use cases in Section 3 and conclude in Section 4. Supplementary material on Xilinx architecture is included in Appendix A to help orient the reader regarding specific RapidWright constructs.

#### 2 RAPIDWRIGHT STRUCTURE

RapidWright is implemented in Java and distributed with a foundational API library that provides access to design checkpoint (DCP) files and Vivado-compatible device models. A high-level diagram showing the organization of the project is shown in Figure 2. There are three core Java packages (groups of classes) within RapidWight: device, edif (logical netlist) and design (physical netlist) and this section describes the purpose and composition of each one.



from being moved.

Routing nets inside of a site (intra-site) is different from routing outside of sites (inter-site) and the SIteInst maintains all relevant information concerning intra-site routing. Routing inside of a site must account for placed cells, their type and context. In general must account for piaced ceus, their type and context. In general, when constructing placed and routed logic, it can be beneficial to compare 51teInst. content from Vivado-generated implementa-tions to ensure correctness. This can be done by loading placed and routed DCPs from Vivado into RapidWright and querying the respective S1teInst objects to establish patterns for site wire and Routing is accomplished inside a site through S1tePIPs, which

Routing is accompisshed missde a site through 51tePPs, which establish a connection through routing BELs and some logic BELs (such as LUTis). The 51teInst object in RapidWright maintains site PIP usage. By default, all site PIPs are turned off. If a 51tePIP is added to the 51teInst then it is marked as being turned on or

A Net in RapidWright contains the routing information to physically connect placed cells using device interconnect or PIPs. Many logical nets map to the same physical net, for example, consider the net depicted in Figure 5. This figure shows the logical nedist

ornnectic logical ne The im PIPs. PIP pin source are repre objects). routing) is pins is de

2.7 N



Figure 6: Physical netlist view of a particular physical net

definition of an implementation. This object is unique to Rapid-Wright and is one of its enabling constructs that allows placed and routed information to be preserved, relocated and replicated. A module contains both the logical and physical netlist elements and corresponds to a hierarchical cell within a netlist. It is similar to a placed and routed out-of-context DCP, however RapidWright



ces entering/leaving the cell). Figure 3 illustrates how Rapid-

#### 2.3 Design Package (Physical Netlist)

2.5 Design Fackago (\*19) state ventus)
The destyn package in the collection of objects used to describe how
a logical netlist maps to a dovice netlist. A design is also referred
to as a physical netlist or implementation. It contains all of the
primitive logical cell mappings to hardware, specifically the collection of the collection

mation for a design. It keeps track of the logical netlist, physical netlist, constraints, the device and part references among other things. The Design class is most similar to a design checkpoint in that it contains all the information necessary to create a DCP file The remainder of this subsection describes the major object classe found in the design package

#### 2.4 Cell (A BEL Instance)

At the lowest level, a RapidWright Cell maps a logical leaf cell from the EDIF netlist (EDIFCellInst) to a EEL as shown in Figure 4. The cell mane is typically the full hierarchical logical name of the leaf cell to which it maps. A cell also maintains the logical cell pin



Figure 4: Shows mapping between BEL/Cell, Site/SiteInst and Device/Design.

#### 2.5 SiteInst

Design representation and implementation in Vivado is BEL-centric (BELs and cells). The SiteInst keeps track of three major map-

- (1) Man of all cells to RELs (placements in site)
- 3) Nets to Site Wires (intra-site routing)

Each SiteInst maps to a single, compatible site within a device The SiteInst is configured to a type using a SiteTypeEnum that is either the primary type or an alternate site type of the host site. RapidWright also preserves the same Vivado 'fixed' flag which is



Figure 11: Xilinx FPGA Architecture Hierarchy



Figure 12: Intersite and Intrasite Routing Resource

ed to as the "placement" of the cell. Non-leaf cells represent hiof the netlist and do not require placement. Thus, when one Vivado command place\_des1gn, it is essentially mapping cells in the netlist to compatible and legal BEL sites. ing BELs are programmable muxes used to route signals en BELs. Routing BELs do not support any design elements ells from the netlist do not occupy routing BEL sites). Howne routing BELs do have optional inversion

BELs have input and output pins and configurable connections that connect an input pin to an output pin. These BEL-based config-urable connections are called site PIPs (Programmable Interconnect Points). Both logic BELs and routing BELs can have site PIPs. How ever, in the case of a logic BEL, the site must be unoccupied by a cell for the site PIP to be usable. These site PIPs, when implemented in logic BELs (such as a LUT), are called "route-thrus." When routing a design, it is sometimes necessary to route through unused LUTs (or other BELs) using site PIPs to complete a route



# RapidWright Resources: www.rapidwright.io



# **Today After Lunch (1:45PM)**

# RapidWright FPGA 2019 Deep Dive Tutorial

| Tutorial Segment                 | Time    | Purpose                                             |
|----------------------------------|---------|-----------------------------------------------------|
| Hello, World jupyter             | 5 mins  | Intro to RapidWright within Jupyter Notebook        |
| Create Netlist from Scratch      | 10 mins | How to build a netlist from scratch                 |
| Pipeline Generator               | 15 mins | How to generate a circuit in RapidWright            |
| Pre-implemented Modules: Part I  | 15 mins | How to create a pre-implemented module              |
| Pre-implemented Modules: Part II | 15 mins | How to use and relocate pre-implemented modules     |
| Probe Re-router ipyter           | 20 mins | Fast probe routing on existing implementation       |
| SAT Router jupyter               | 15 mins | How to use a SAT engine to solve routing congestion |
| Create and Use an SLR Bridge     | 25 mins | Combine Vivado and RapidWright generated citcuits   |





# Conclude



# **Summary**





- > Build routed circuits & reuse P&R circuits
- RapidWright enables:Performance by 50%Debug productivity >10X
- > Leverage algorithmic engines (SAT, ILP, ...)
- > www.rapidwright.io



# RapidWright Enables DSA Compilers



- > Hard problems, let's work together
- > Domain-specific optimizations
- > Architecture exploration
- > Empower those closest to the problem





# Adaptable. Intelligent.



