This is a network layer 4 load balancer written on FPGA. The purpose of this project was to create a proof of concept rather than a fully-featured product. Therefore, this documentation is not purely technical, but it also contains my thoughts and reflections on the whole project, as I think they might be (hopefully) educational. The project was created as part of the FPGA Programming course that I have taken at my university.
- Hardware part is mostly written in Hardcaml (a domain-specific language based on OCaml).
- Fully pipelined and takes ~30 cycles to process one packet.
- Uses a RISC-V soft core (VexRiscv) to control the load balancer.
- Software running on the core uses FreeRTOS kernel and is written in C.
- Written for Altera DE2-115 FPGA board.
The main part of the load balancer is the data plane. It contains all the logic required to forward a packet. The MAC (Triple-Speed Ethernet Intel® FPGA IP) is connected to the data plane through the streaming interface and to the Ethernet PHY chip, through RGMII interface.
The RISC-V core is used to control the load balancer and to communicate with a PC through JTAG. Internally, it uses AvalonMM interface as a bus.
There are two clocks domain: the fast one used on the datapath, which is the MAC and the data plane, and the slower one used for the remaining logic.
The load balancer works on the Transport Layer, i.e. is oblivious to the data that is being sent. It works in DSR (direct service response) mode.
The following diagram show how the balancer operates:
Steps:
- Receive a packet with the destination configured as a VIP (virtual IP address - IP address of the service).
- Choose a real server - calculate a hash value from 4 values (src ip, dst ip, src port, dst port) and based on this hash value pick a server. With that, packets from the same session will be sent to the same real.
- Encapsulate packet in another IP packet and send to the real.
- Real server process the packet and replies directly to the user.
The data plane lookups for a real using a hash ring, which is configured by the core. In practice, a consistent hashing algorithm should be used here, but for the sake of simplicity, the hash ring is programmed in a naive way - every entry of a hash ring is specified manually.
The balancer can handle multiple VIPs (i.e. multiple services with different IPs) with separate collections of real servers.
The project is divided into several components, each of them located in a separate directory. For a more detailed description, please refer to the README in a sub-directory.
- env - scripts for creating a Docker image with all tools, packages and toolchains required to build the project
- hardcaml - hardware design of the load balancer
- ip - IP components used in the project
- monitor - a tool communicating with the balancer through UART. Used for monitoring and sending commands
- quartus - all files related to synthesis: Quartus Project Files, QSF files, SDC files. It also contains the description of the Platform Designer system used for interconnecting logic to intellectual property functions and subsystems.
- rtl - Verilog source files with the top-level module and a few other auxiliary modules
- sim - a simulator of the RISC-V core
- spinal - configuration of the VexRiscv core
- sw - the software running on the soft-core
- tools - miscellaneous tools and scripts
Please note that the only supported FPGA board is Altera DE2-115.
The build system works well, but it is not perfect. Therefore, to build the project you have to follow the steps:
- Ensure you have all tools set up (the easiest way is to work in the provided Docker container)
- Build the data plane - run
make
in hardcaml directory. - Build the software - run
make
in sw directory. - Build the quartus project - run
make
in quartus directory.
These commands will generate all Verilog sources, synthesize them, perform P&R and timing analysis and generate a bitstream ready to program a FPGA. Then simply run make program
in quartus directory to load the bitstream.
A simple framework for testing the RISC-V core is located in sim subdirectory. All bus slaves that are normally connected to the core are emulated in the software. Verilator is used to compile HDL sources of the core to C++ code.
Separately, there is a simulator for the data plane. It is written exclusively in Hardcaml in a form of testbenches. Some of them are fine-grained (testing a single circuit) and some others are more comprehensive. There is one big regression test of the whole data plane, which simply injects data stream into the data plane and verifies if the output is correct.
Also, there is a possibility to mount the load balancer natively in Linux's networking stack. The code in the directory hardcaml/test/tap, creates a TAP tunnel device, receives packets from that device, forwards them through the simulated load balancer and puts outgoing packets back in the TAP device. Effectively, it allows testing the load balancer as if it was a normal physical device. With that, Linux utility tools for network operations become very handy.
Resource | LUTs | Registers | Memory |
---|---|---|---|
Whole system | 24211 | 22249 | 830 Kbit |
Data plane | 11685 | 10927 | 2.9 Kbit |
The data plane runs at 100 Mhz (fmax is ~125Mhz and is limited by Intel's MAC). Everything else runs at 50 Mhz.
The maximum throughput achieved on the FPGA board is 1 Gbit/s and 1.56 Mp/s, which is reasonable because the PHY chip located on the board supports only Gigabit Ethernet. However, the theoretical throughput of the data plane is 4 Gbit/s and 6.25 Mp/s (assuming UDP packets with a payload of 16 bytes). These numbers could be scaled up even further, by increasing the data path width (which is now equal to 32 bits).
As this was only a half-semester project, I had limited time to work on it. It was never meant to be a fully working load balancer, but rather an opportunity to learn Hardcaml and how to program FPGAs in general. Therefore, many crucial features are missing, including:
- Packet fragmentation
- IPv4 header checksum validation
- Reasonable VIP management (right now, only 8 VIPs can be programmed and 32 real servers)
- Consistent hashing
- Simulation of both the load balancer and the core
- and many many more ...
This project is MIT licensed, as found in the LICENSE file.