Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

Hoare, Raymond R.; Jones, Alex K.; Kusic, Dara; Fazekas, Joshua; Foster, John; Tung, Shenchih; McCloud, Michael

doi:10.1155/ASP/2006/46472

Research Article
Open access
Published: 01 December 2006

Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

Raymond R. Hoare¹,
Alex K. Jones¹,
Dara Kusic¹,
Joshua Fazekas¹,
John Foster¹,
Shenchih Tung¹ &
…
Michael McCloud¹

EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 046472 (2006) Cite this article

1379 Accesses
18 Citations
3 Altmetric
Metrics details

Abstract

This paper presents an architecture that combines VLIW (very long instruction word) processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1) a 4-way VLIW processor implemented in an FPGA, (2) large speedups through hardware functions, (3) a hardware/software interface with zero overhead, (4) a design methodology for implementing signal processing applications on this architecture, (5) tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to times that of software with an average times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.

References

Altera Corporation : Stratix II Device Handbook, Volume 1. available on-line: https://doi.org/www.altera.com
Xilinx Incorporated : Virtex-4 Product Backgrounder. available on-line: https://doi.org/www.xilinx.com
Lattice Semiconductor Corporation : LatticeECP and EC Familiy Data Sheet. available on-line: https://doi.org/www.latticesemi.com
Apple Computer Inc : Optimizing with SHARK, Big Payoff, Small Effort.
Suresh DC, Najjar WA, Vahid F, Villarreal JR, Stitt G: Profiling tools for hardware/software partitioning of embedded applications. Proceedings of ACM SiGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '03), June 2003, San Diego, Calif, USA 189–198.
Google Scholar
De Micheli G, Ku D, Mailhot F, Truong T: The Olympus synthesis system. IEEE Design and Test of Computers 1990, 7(5):37–53. 10.1109/54.60605
Article Google Scholar
Lavagno L, Sentovich E: ECL: a specification environment for system-level design. Proceedings of 36th Design Automation Conference (DAC '99), June 1999, New Orleans, La, USA 511–516.
Google Scholar
Gupta S, Dutt N, Gupta R, Nicolau A: SPARK: a high-level synthesis framework for applying parallelizing compiler transformations. Proceedings of 16th IEEE International Conference on VLSI Design (VLSI Design '03), January 2003, New Delhi, India 461–466.
Google Scholar
Gupta S, Savoiu N, Dutt N, Gupta R, Nicolau A: Using global code motions to improve the quality of results for high-level synthesis. IEEE Transactions On Computer-Aided Design Of Integrated Circuits and Systems 2004, 23(2):302–312. 10.1109/TCAD.2003.822105
Article Google Scholar
Jones AK, Bagchi D, Pal S, Banerjee P, Choudhary A: Pact HDL: compiler targeting ASIC's and FPGA's with power and performance optimizations. In Power Aware Computing. Edited by: Graybill R, Melhem R. Kluwer Academic, Boston, Mass, USA; 2002:169–190. chapter 9
Chapter Google Scholar
Tang X, Jiang T, Jones AK, Banerjee P: Behavioral synthesis of data-dominated circuits for minimal energy implementation. Proceedings of 18th IEEE International Conference on VLSI Design (VLSI Design '05), January 2005, Kolkata, India 267–273.
Google Scholar
Jung E: Behavioral synthesis using systemC compiler. Proceedings of 13th Annual Synopsys Users Group Meeting (SNUG '03), March 2003, San Jose, Calif, USA
Google Scholar
Black D, Smith S: Pushing the limites with behavioral compiler. Proceedings of 9th Annual Synopsys Users Group Meeting (SNUG '99), March 1999, San Jose, Calif, USA
Google Scholar
Bartleson K: A New Standard for System-Level Design. Synopsys White Paper, 1999
Google Scholar
Goering R: Behavioral Synthesis Crossroads. EE Times Article, 2004
Google Scholar
Pursley DJ, Cline BL: A practical approach to hardware and software SoC tradeoffs using high-level synthesis for architectural exploration. Proceedings of of the GSPx Conference, March–April 2003, Dallas, Tex, USA
Google Scholar
Chappell S, Sullivan C: Handel-C for Co-Processing and Co-Design of Field Programmable System on Chip. Celoxica White Paper, 2002
Google Scholar
Banerjee P, Haldar M, Nayak A, et al.: Overview of a compiler for synthesizing MATLAB programs onto FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2004, 12(3):312–324.
Article Google Scholar
Banerjee P, Shenoy N, Choudhary A, et al.: A MATLAB compiler for distributed, heterogeneous, reconfigurable computing systems. Proceedings of 8th Annual IEEE International Symposium on FPGAs for Custom Computing Machines (FCCM '00), April 2000, Napa Valley, Calif, USA 39–48.
Google Scholar
McCloud S: Catapult C Synthesis-Based Design Flow: Speeding Implementation and Increasing Flexibility. Mentor Graphics White Paper, 2004
Google Scholar
Chaiyakul V, Gajski DD: Assignment decision diagram for high-level synthesis. In Tech. Rep. #92-103. University of California, Irvine, Calif, USA; December 1992.
Google Scholar
Chaiyakul V, Gajski DD, Ramachandran L: High-level transformations for minimizing syntactic variances. Proceedings of 30th Design Automation Conference (DAC '93), June 1993, Dallas, Tex, USA 413–418.
Google Scholar
Ghosh I, Fujita M: Automatic test pattern generation for functional RTL circuits using assignment decision diagrams. Proceedings of 37th Design Automation Conference (DAC '00), June 2000, Los Angeles, Calif, USA 43–48.
Chapter Google Scholar
Zhang L, Ghosh I, Hsiao M: Efficient sequential ATPG for functional RTL circuits. Proceedings of IEEE International Test Conference (ITC '03), September–October 2003, Charlotte, NC, USA 1: 290–298.
Google Scholar
Chouliaras VA, Nunez J: Scalar coprocessors for accelerating the G723.1 and G729A speech coders. IEEE Transactions on Consumer Electronics 2003, 49(3):703–710. 10.1109/TCE.2003.1233807
Article Google Scholar
Atzori E, Carta SM, Raffo L: 44.6% processing cycles reduction in GSM voice coding by low-power reconfigurable co-processor architecture. IEE Electronics Letters 2002, 38(24):1524–1526. 10.1049/el:20021019
Article Google Scholar
Hilgenstock J, Herrmann K, Otterstedt J, Niggemeyer D, Pirsch P: A video signal processor for MIMD multiprocessing. Proceedings of 35th Design Automation Conference (DAC '98), June 1998, San Francisco, Calif, USA 50–55.
Google Scholar
Garg R, Chung CY, Kim D, Kim Y: Boundary macroblock padding in MPEG-4 video decoding using a graphics coprocessor. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(8):719–723. 10.1109/TCSVT.2002.800857
Article Google Scholar
Hinds CN: An enhanced floating point coprocessor for embedded signal processing and graphics applications. Proceedings of Conference Record 33rd Asilomar Conference on Signals, Systems, and Computers, October 1999, Pacific Grove, Calif, USA 1: 147–151.
Google Scholar
Alves JC, Matos JS: RVC-a reconfigurable coprocessor for vector processing applications. Proceedings of 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '98), April 1998, Napa Valley, Calif, USA 258–259.
Google Scholar
Bridges T, Kitchel SW, Wehrmeister RM: A CPU utilization limit for massively parallel MIMD computers. Proceedings of 4th Symposium on the Frontiers of Massively Parallel Computation, October 1992, McLean, Va, USA 83–92.
Chapter Google Scholar
Schmit H, Whelihan D, Tsai A, Moe M, Levine B, Taylor RR: PipeRench: A virtualized programmable datapath in 0.18 micron technology. Proceedings of IEEE Custom Integrated Circuits Conference (CICC '02), May 2002, Orlando, Fla, USA 63–66.
Google Scholar
Goldstein SC, Schmit H, Budiu M, Cadambi S, Moe M, Taylor RR: PipeRench: a reconfigurable architecture and compiler. Computer 2000, 33(4):70–77. 10.1109/2.839324
Article Google Scholar
Goldstein SC, Schmit H, Moe M, et al.: PipeRench: a coprocessor for streaming multimedia acceleration. Proceedings of 26th IEEE International Symposium on Computer Architecture (ISCA '99), May 1999, Atlanta, Ga, USA 28–39.
Google Scholar
Cadambi S, Weener J, Goldstein SC, Schmit H, Thomas DE: Managing pipeline-reconfigurable FPGAs. Proceedings of 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '98), February 1998, Monterey, Calif, USA 55–64.
Google Scholar
Schmit H: Incremental reconfiguration for pipelined applications. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 47–55.
Google Scholar
Levine BA, Schmit H: Efficient application representation for HASTE: hybrid architectures with a single, transformable executable. Proceedings of 11th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '03), April 2003, Napa Valley, Calif, USA 101–110.
Google Scholar
Ebeling C, Cronquist DC, Franklin P: RaPiD - reconfigurable pipelined datapath. Proceedings of 6th International Workshop on Field-Programmable Logic and Applications (FPL '96), September 1996, Darmstadt, Germany 126–135.
Google Scholar
Ebeling C, Cronquist DC, Franklin P, Fisher C: RaPiD - a configurable computing architecture for compute-intensive applications. In Tech. Rep. TR-96-11-03. University of Washington, Department of Computer Science & Engineering, Seattle, Wash, USA; 1996.
Google Scholar
Ebeling C, Cronquist DC, Franklin P, Secosky J, Berg SG: Mapping applications to the RaPiD configurable architecture. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 106–115.
Google Scholar
Cronquist DC, Franklin P, Berg SG, Ebeling C: Specifying and compiling applications for RaPiD. Proceedings of 6th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '98), April 1998, Napa Valley, Calif, USA 116–125.
Google Scholar
Cronquist DC, Fisher C, Figueroa M, Franklin P, Ebeling C: Architecture design of reconfigurable pipelined datapaths. Proceedings of 20th Anniversary Conference on Advanced Research in VLSI, March 1999, Atlanta, Ga, USA 23–40.
Chapter Google Scholar
Mirsky E, DeHon A: MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources. Proceedings of 4th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '96), April 1996, Napa Valley, Calif, USA 157–166.
Google Scholar
Kapasi UJ, Dally WJ, Rixner S, Owens JD, Khailany B: The imagine stream processor. Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, September 2002, Freiberg, Germany 282–288.
Chapter Google Scholar
Khailany B, Dally WJ, Kapasi UJ, et al.: Imagine: media processing with streams. IEEE Micro 2001, 21(2):35–46. 10.1109/40.918001
Article Google Scholar
Owens JD, Rixner S, Kapasi UJ, et al.: Media processing applications on the Imagine stream processor. Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors, September 2002, Freiberg, Germany 295–302.
Chapter Google Scholar
Hauser JR, Wawrzynek J: Garp: a MIPS processor with a reconfigurable coprocessor. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 12–21.
Google Scholar
Callahan TJ, Hauser JR, Wawrzynek J: The Garp architecture and C compiler. Computer 2000, 33(4):62–69. 10.1109/2.839323
Article Google Scholar
Callahan T: Kernel formation in Garpcc. Proceedings of 11th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '03), April 2003, Napa Valley, Calif, USA 308–309.
Google Scholar
Hauck S, Fry TW, Hosler MM, Kao JP: The Chimaera reconfigurable functional unit. Proceedings of 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '97), April 1997, Napa Valley, Calif, USA 87–96.
Google Scholar
Hauck S, Hosler MM, Fry TW: High-performance carry chains for FPGAs. Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '98), February 1998, Monterey, Calif, USA 223–233.
Google Scholar
Hoare R, Tung S, Werger K: A 64-way SIMD processing architecture on an FPGA. Proceedings of 15th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS '03), November 2003, Marina del Rey, Calif, USA 1: 345–350.
Google Scholar
Dutta S, Wolfe A, Wolf W, O'Connor KJ: Design issues for very-long-instruction-word VLSI video signal processors. Proceedings of IEEE Workshop on VLSI Signal Processing, IX, October–November 1996, San Francisco, Calif, USA 95–104.
Google Scholar
Capitanio A, Dutt N, Nicolau A: Partitioned register files For VLIWs: a preliminary analysis of tradeoffs. Proceedings of 25th Annual International Symposium on Microarchitecture (MICRO '92), December 1992, Portland, Ore, USA 292–300.
Chapter Google Scholar
Trimaran, An Infrastructure for Research in Instruction-Level Parallelism 1998, https://doi.org/www.trimaran.org
Jones AK, Hoare R, Kourtev IS, et al.: A 64-way VLIW/SIMD FPGA architecture and design flow. Proceedings of 11th IEEE International Conference on Electronics, Circuits and Systems (ICECS '04), December 2004, Tel Aviv, Israel 499–502.
Google Scholar
Lee C, Potkonjak M, Mangione-Smith WH: MediaBench: a tool for evaluating and synthesizing multimedia and communications systems. Proceedings of 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '97), December 1997, Research Triangle Park, NC, USA 330–335.
Google Scholar
Degener J, Bormann C: GSM 06.10 lossy speech compression library. available on-line: https://doi.org/kbs.cs.tu-berlin.de/~jutta/toast.html
Golub G, Loan CFV: Matrix Computational. Johns Hopkins University Press, Baltimore, Md, USA; 1991.
Google Scholar
Hassibi B, Vikalo H: On sphere decoding algorithm. I. Expected complexity. submitted to IEEE Transactions on Signal Processing, 2003
Google Scholar
Hassibi B, Vikalo H: On sphere decoding algorithm. II. Examples. submitted to IEEE Transactions on Signal Processing, 2003
Google Scholar
Chobe Y, Narahari B, Simha R, Wong WF: Tritanium: augmenting the trimaran compiler infrastructure to support IA64 code generation. Proceedings of 1st Annual Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Techniques (EPIC '01), December 2001, Austin, Tex, USA 76–79.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15261, USA
Raymond R. Hoare, Alex K. Jones, Dara Kusic, Joshua Fazekas, John Foster, Shenchih Tung & Michael McCloud

Authors

Raymond R. Hoare
View author publications
You can also search for this author in PubMed Google Scholar
Alex K. Jones
View author publications
You can also search for this author in PubMed Google Scholar
Dara Kusic
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Fazekas
View author publications
You can also search for this author in PubMed Google Scholar
John Foster
View author publications
You can also search for this author in PubMed Google Scholar
Shenchih Tung
View author publications
You can also search for this author in PubMed Google Scholar
Michael McCloud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raymond R. Hoare.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hoare, R.R., Jones, A.K., Kusic, D. et al. Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions. EURASIP J. Adv. Signal Process. 2006, 046472 (2006). https://doi.org/10.1155/ASP/2006/46472

Download citation

Received: 12 October 2004
Revised: 30 June 2005
Accepted: 12 July 2005
Published: 01 December 2006
DOI: https://doi.org/10.1155/ASP/2006/46472

Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords