A fine-grained dataflow library for reconfigurable streaming accelerators

Open Access
Chandrashekhar, Aarti
Graduate Program:
Electrical Engineering
Master of Science
Document Type:
Master Thesis
Date of Defense:
October 05, 2011
Committee Members:
  • Vijaykrishnan Narayanan, Thesis Advisor
  • FPGA
  • framework
  • dataflow
  • reconfigurable
In this thesis, a library of basic operators for accelerating complex algorithms on an Field Programmable Gate Array (FPGA) is proposed. The components of this custom Register Transfer Level (RTL) hardware library are specifically designed to provide fine-grained control over resources while accelerating algorithms on an FPGA. Furthermore, the library is extensible allowing designers to develop custom operators. A hardware framework to ease the composition of systems using the components of this library is also presented. Such an approach facilitates the use of dataflow programming at the application level for mapping an algorithm to the hardware components. This framework is highly modular and configurable in terms of hardware resources, bit-width allocation, and accuracy. In addition, the hierarchical nature of this framework allows recursive definitions of custom operators. This allows complex operators to be built using the library operators. The framework is well-suited to image processing tasks since it takes into account the streaming requirements of such applications. The initial architecture of this framework and the associated drawbacks are discussed and a new improved architecture which overcomes these drawbacks is also presented. Biologically-inspired vision processing algorithms with applications such as saliency detection and object recognition are studied as a use case of the framework. In particular, the implementation of a bio-inspired architecture of Retinal and Lateral Geniculate Nucleus (LGN) processing stages using the proposed framework is detailed. All the hardware examples are synthesized and verified on a Xilinx(c) Virtex-6 SX475T FPGA. The FPGA implementation is also compared to a multi-core CPU implementation of the algorithm and it is shown that the FPGA-based implementation outperforms the CPU-based implementation by an order of 10.