Moving Beyond Single Processor on Chip




Muataz H. Salih Al-Doori
Research Engineer, PhD
B.Sc. Computer Engineering (Univ. of Technology, Baghdad/Iraq), M.Sc. Computer Engineering (Univ. of Technology, Baghdad/Iraq).

Project Title: Design and Implementation of Embedded Multiprocessor SoC. for Tracking and Navigation Systems Using FPGA Technology


Moving Beyond Single Processor on Chip

In the 1990s RISC microprocessors became very popular, to the point that they replaced some CISC architectures. The success of ASIC and SoCs eased the advent of post-RISC processors, which are usually generic RISC architectures, augmented with additional components. Several different approaches have been proposed to implement the acceleration of a microprocessor, but in general the main idea consists in starting from a general-purpose RISC core and adding extra components like dedicated hardware accelerators. This is due to the fact that it is good to keep a certain degree of software programmability to keep up with the fact that applications and protocols change fast, so having a programmable core in the system is recommendable to guarantee general validity and flexibility to the platform. One possible way of accelerating a programmable core consists in general into exploiting instruction and/or data parallelism of applications by providing the processor with Very Long Instruction Word (VLIW) or Single- Instruction Multiple-Data (SIMD) extensions; another way consists in adding special functional units (for example, MAC circuits, barrel shifter, or other special components designed to speed up the execution of algorithms) in the data path of the programmable core: this way the instruction set of the core is extended with specialized instructions aiming at speeding-up operations which are both heavy and frequent. This approach is anyway not always possible or convenient, especially if the component to plug into the pipeline is very large.

Moreover, large ASIC blocks are now not so convenient in the sense that they usually cost a lot and lack flexibility, so that they become useless whenever the application or the standard they implement changes. For this reason many microprocessors come with a special interface meant to ease the attachment of external accelerators; there are basically two possibilities: using large, general-purpose accelerators to be used for as many applications as possible; using very large, powerful, run-time reconfigurable accelerators. The design and verification issues related to coprocessors can be faced independently from the ones related to the main processor: this way it is possible to parallelize the design activities, saving then time, or (in case which the core already exists before the coprocessors are designed) the coprocessors can be just plugged into the system as black boxes, with no need to modify the architecture of the processor.

Today a single chip can host an entire system which is cheap and at the same time powerful enough to run several applications, including demanding ones like image, video, graphics, and audio, which are becoming extremely popular at consumer level even in portable devices. These applications are then to be carefully analyzed to determine an optimal way to map them to the hardware available: since normally applications are made up of a control part and a computation part, the first stage usually consists in locating the computational kernels of the algorithms.

These kernels are usually mapped on dedicated parts of the system (namely dedicated processing engines), optimized to exploit the regularity of the operations operated on large amounts of data, while the remaining parts of the code (the control part) is implemented by software running on a regular microprocessor. Sometimes special versions of known algorithms are set up in order to

meet the demand for an optimal implementation on hardware circuits. Different application domains call for different kinds of accelerators: for example, applications like robotics, automation, and Dolby digital audio and 3D graphics require floating-point computation, making thus the insertion of floating-point units (FPU) very useful and sometimes even necessary. To cover the broad range of modern and computationally demanding applications like imaging, video compression, multimedia, we also need some other kind of accelerator: those applications usually benefit from regular, vector architectures able to exploit the regularity of data while satisfying the high bandwidth requirements. A possibility consists in producing so called multimedia SoC, which usually are a particular version of multiprocessor systems (MPSoCs) containing different types of processors, which meets far better the demands than homogeneous MPSoCs. Such machines are usually quite large, so a very effective way of solving this problem which is widely accepted nowadays is to make those architectures run-time reconfigurable. This means that the hardware is done so that the datapath of the architecture can be changed by modifying the value of special bits, named configuration bits. One first example of reconfigurable that became very popular is given by FPGA processors, which can be used to implement virtually any circuit by sending the right configuration bits to the device. The idea of reconfigureability was then developed further, leading to custom devices used to implement powerful computation engines; this way it is possible implementing several different functionalities on the same component, saving area and at the same time tailoring the hardware at run-time to implement an optimal circuit for a given application. Reconfigureability is an excellent mean of combining the performance of hardware circuits with the flexibility of programmable architectures.

Accelerators come in different forms and can differ a lot from each others: differences can relate to the purpose for which they are designed (accelerators can be specifically designed to implement a single algorithm, or can instead support a broad series of different applications), their implementation technology (ASIC custom design, ASIC standard-cells, FPGAs), the way which they interface to the rest of the system, and their architecture. So, multiprocessor now is existing on single chip.


1. Leibson S (2004) Lower SoC operating frequencies to cut power dissipation. In Portable Design, February

2. Jeong CH, Park WC, Kim SW, Han TD (2000) The Design and Implementation of CalmlRISC32 Floating- Point Unit. In Proc. AP-ASIC, pp 327–330

3. Ahonen T (2006) Designing Network-Based Single-Chip System Architectures, DrTech Thesis, Tampere University of Technology. TUT Publication 625

4. Altera (2010) company web page

5. Xilinx (2010) company web page

6. Bondalapati K, Prasanna VK (2002) Reconfigurable Computing Systems. Proceedings of the IEEE, 90(7):1201–1217

7. Brunelli C, Cinelli F, Rossi D, Nurmi J (2006) A VHDL Model and Implementation of a Coarse-Grain Reconfigurable Coprocessor for a RISC Core. In Proc. PRIME, pp 229–232

8. Chen L, Bai X, Dey S (2002), Testing for Interconnect Crosstalk Defects Using On-Chip Embedded Processor Cores. Journal of Electronic Testing: Theory and Applications, 18(4):529–538

9. Comer DE (2004) Network Systems Design Using Network Processors. Prentice Hall, Upper Saddle River,N

10. Kongetira P, Aingaran K, Olukotun K (2005) Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29

11. Kranitis N, Paschalis A, Gizopoulos D, Xenoulis G (2005) Software-Based Self-Testing of Embedded Processors. IEEE Transactions on Computers, 54(4):461–475