# FR500 VLIW-architecture High-performance Embedded Microprocessor

Takao Sukemura

(Manuscript received January 20, 2000)

A new-concept FR500 microprocessor using the VLIW architecture has been developed for digital consumer products. It can issue four instructions simultaneously and can be configured in a small-scale circuit, making it possible to implement a low-cost, high-performance system. It combines two 32-bit integer operation units, two 32-bit floating-point operation units, and two 16-bit media processing operation units, providing a peak performance of 532 MIPS, 1064 MFLOPS, and 4256 MOPS at 266 MHz. The flexible instruction set provided by the VLIW architecture makes it possible to add specialized DSP instructions for signal processing as well as customer-defined instructions. This enables development of a microprocessor optimized for digital consumer products – a market that is expecting explosive growth.

## 1. Introduction

Conventionally, Fujitsu provides two families of 32-bit microprocessors: the FR and SPARClite families. The FR family is a series of RISC microcontrollers with powerful processing for control systems and is used in products such as digital video cameras (DVCs) and portable phones. The SPARClite family is a series of RISC processors with a workstation (WS) SPARC architecture optimized for embedded applications. Due to its high performance, the SPARClite family is used mainly in image processors such as digital still cameras (DSCs) and laser beam printers (LBPs).

However, the further development of new digital multimedia consumer products requires a new high-performance processor, because conventional RISC processors alone have insufficient processing power. Processing of multimedia video images requires more than 10 times the previous performance, and this higher performance should be implemented using highly flexible software.

For these digital consumer products, we have

developed a new FR500 microprocessor that has a very long instruction word (VLIW) architecture. The FR500 is the first product in the FR-V family, which is Fujitsu's generic name for VLIW architecture microprocessors. The FR-V family will be made more flexible in the future so that new products optimized for a wide variety of digital consumer products can be produced (**Figure 1**).



#### Figure 1 Roadmap of Fujitsu microprocessors.

# 2. Requirements for embedded microprocessors

The FR500 is a general-purpose microprocessor tailored for embedded applications. Consequently, the FR500 usage environment is different from that of general-purpose microprocessors used in personal computers (PCs) and WSs, resulting in substantial differences in development concepts.

General-purpose microprocessors designed for PCs/WSs are developed on the following principles:

- Binary compatibility as an absolute requirement
- Performance over cost

Embedded microprocessors require the following:

- Low cost/high performance
- High code efficiency
- Low power consumption
- Reduced system cost
- Provision of CPU Core IP to System-on-a Chip (SOC)
- A substantial software development environment
- Easy debugging
- A faster real-time response
- Flexibility to meet customers' needs

# 3. FR500 hardware

The FR500 adopts the following two parallel processing concepts to achieve higher performance:

- Parallelism in instruction issuance (VLIW)
- Parallelism in data (SIMD: Single Instruction stream, Multiple Data stream)

There are two possible methods for enhancing performance: increase the clock frequency and use parallel processing. Due to the special characteristics of embedded microprocessors, we decided to use the above two parallel processing concepts to enhance execution performance without increasing the clock frequency.

Conventional RISC processors generally perform parallel processing using the super-scalar method, in which hardware extracts simultaneously executable instructions and then executes two or more instructions simultaneously. By using this method, the higher-level products in the SPARClite family already achieve a performance above 400 MIPS.<sup>1)</sup> However, in the super-scalar method, evaluation of parallelism is performed in hardware, so the instruction-issuing circuit is larger.

A two-way super-scalar method that executes two instructions simultaneously is practicable for embedded processors, but using the super-scalar approach to execute three or more instructions simultaneously makes the circuit far too complex. Consequently, we decided to use the VLIW architecture to reduce the circuit size and increase parallelism. In the VLIW architecture, the compiler assures instruction parallelism, which greatly reduces the size of the instructionissuance circuit compared to the super-scalar method. We estimate that to execute four instructions simultaneously, the VLIW architecture will require a 50% smaller circuit than the superscalar method.

The FR500 has a 4-way VLIW architecture and can issue four 32-bit instructions simultaneously. These four instructions are allocated to two integer instruction slots and to two floatingpoint operation instruction slots or media processing instruction slots. In the 4-way VLIW architecture, the load can be executed using the integer instruction slots during the floating-point or media processing operations, reducing the possibility of bottlenecks in the data supply to the floating-point or media processing units. Also, a branch instruction can be issued to any of the four slots, contributing to improved simultaneously executable instructions. The combinations of slots and execution units are shown in **Table 1**.

The floating-point operation unit is an IEEE754-compliant 32-bit single-precision floating-point operational circuit which executes add (FADD) and multiplication (FMUL) in two cycles. It has the SIMD function to execute two opera-

| VLIW instruction | Instruction n      | Instruction n + 1  | Instruction n + 2  | Instruction n + 3  |
|------------------|--------------------|--------------------|--------------------|--------------------|
| Instruction slot | Slot 0             | Slot 1             | Slot 2             | Slot 3             |
|                  | Integer-0          | Float-0 or Media-0 | Integer-1          | Float-1 or Media-1 |
|                  | Integer-0          | Float-0 or Media-0 | Integer-1          |                    |
|                  | Integer-0          | Float-0 or Media-0 | Integer-1          | Branch-0           |
|                  | Integer-0          | Float-0 or Media-0 | Float-1 or Media-1 |                    |
|                  | Integer-0          | Float-0 or Media-0 | Float-1 or Media-1 | Branch-0           |
|                  | Integer-0          | Float-0 or Media-0 |                    |                    |
|                  | Integer-0          | Float-0 or Media-0 | Branch-0           |                    |
|                  | Integer-0          | Float-0 or Media-0 | Branch-0           | Branch-1           |
|                  | Integer-0          | Integer-1          |                    |                    |
|                  | Integer-0          | Integer-1          | Branch-0           |                    |
| Execution unit   | Integer-0          | Integer-1          | Branch-0           | Branch-1           |
| (Pipeline)       | Float-0 or Media-0 | Float-1 or Media-1 |                    |                    |
|                  | Float-0 or Media-0 | Float-1 or Media-1 | Branch-0           |                    |
|                  | Float-0 or Media-0 | Float-1 or Media-1 | Branch-0           | Branch-1           |
|                  | Integer-0          |                    |                    |                    |
|                  | Integer-0          | Branch-0           |                    |                    |
|                  | Integer-0          | Branch-0           | Branch-1           |                    |
|                  | Float-0 or Media-0 |                    |                    |                    |
|                  | Float-0 or Media-0 | Branch-0           |                    |                    |
|                  | Float-0 or Media-0 | Branch-0           | Branch-1           |                    |
|                  | Branch-0           |                    |                    |                    |
|                  | Branch-0           | Branch-1           |                    |                    |
|                  | Control            |                    |                    |                    |

Table 1 Combinations of slots and execution units.

tions using one operation instruction, so the two floating-point units can execute four floating-point operations. Therefore, the peak performance at 266 MHz is  $266 \times 4 \text{ MFLOPS} = 1064 \text{ MFLOPS}$ .

The media processing unit uses 16-bit fixedpoint operation and can execute multiply and accumulation (MMAC) and other operations in two cycles. One media processing unit can execute two 16-bit data operations simultaneously and also has a double-SIMD architecture, permitting simultaneous execution of four operations. Therefore, when the two media processing units operate simultaneously, eight operations can be executed. One MMAC executed by the media processing unit is counted as two operations, so the peak performance at 266 MHz is 4256 MOPS.

To increase the parallel count of instructions and data, the FR500 uses a special architecture with 1R/1RW dual-port SRAM as internal cache memory to implement the "dual fetch approach." As a result, instructions contiguous to the branch instruction and the instruction at the branch target can be fetched simultaneously, which reduces the branch penalty.<sup>2)</sup> The data cache also implements dual load, making it possible to execute two 8-byte length load instructions simultaneously in one cycle, which reduces the memory-access bottleneck.

The data cache also has write-through and copy-back modes that can be selected by the user to optimize the data cache according to the application. In addition, the data cache has a memory mode, which enables it to function as internal memory and hide the external transfer latency via external DMA.

The FR500 is equipped with these mechanisms to increase the parallel count and enhance the performance without increasing the operation frequency. **Figure 2** shows the block diagram of the FR500, and **Figure 3** shows the configuration of the operation units.

The FR500 uses the VLIW architecture to increase the parallel count, but generally the VLIW architecture must describe instructions for all the



Figure 2 FR500 microprocessor block diagram.



Figure 3 Execution units.

slots, so No Operation (NOP) must be issued to slots that are not to be executed. NOP wastes memory and cache areas. The code efficiency in embedded microprocessors is directly related to system cost, so it is very important. In the FR500, a bit called the "packing flag" is defined in the





instructions, eliminating unnecessary NOPs and enabling operation to be performed in the compressed condition in the memory and cache.

One bit in the 32-bit instructions is used as a packing-flag bit for determining VLIW packet breaks. This method is compatible with the different parallel counts of instructions, for example, 2-way and 4-way (**Figure 4**). By using this packing flag, the FR500 achieves a code efficiency equivalent to a 32-bit RISC processor, despite its VLIW architecture.

An embedded processor must also reduce the cost of the application system so that the final product can be sold at competitive prices. The FR500 reduces system cost and, at the same time, incorporates a dedicated SDRAM interface in the processor to enhance performance. This SDRAM interface is separate from the ordinary system bus and can be coupled directly to a 133 MHz SDRAM, so the user can use SDRAM without needing to perform the difficult task of designing a high-speed memory interface.

Also, connection to ROM and low-speed I/O is made using a FR500-dedicated companion chip coupled directly to the FR500 processor via the 133 MHz system bus. The companion chip provides the bus bridge function and incorporates resources such as a DMAC, timer, interrupt controller, and serial interface (**Figure 5**).

In the first FR500 products, the processor and companion chip are separate because users pursuing higher performance for LBPs, etc., generally tend to develop their own ASICs that are equivalent to the companion chip for faster data transfer. It is possible to provide the design data for our companion chip as a reference design for users developing their own ASICs.

The FR500 series will also include products integrating the processor and companion chip for single-chip-oriented (e.g., DSCs) users.

The FR500 processor core is tailored for embedded applications and must be provided as the IP core of an SOC as a precondition. Design of the FR500 processor core is based on automatic layout using highly technology-portable standard cells, thereby reducing the development period.

# 4. FR500 software

In VLIW processors, the compiler extracts the instruction parallelism, so compiler performance is reflected directly to processor performance. Since the compiler plays a key role in determining processing performance, we introduced the optimization technology of the VLIW and vector processing used in a Fujitsu supercomputer. Also, to enhance the compiler performance, the parallel count of instructions extracted by the compiler must be increased. However, this means increasing the number of instructions in the basic block, which is the compiler processing unit. In the



FR500 system configuration.

FR500, a non-excepting instruction (exception inhibit instruction) and a predicate instruction (conditional instruction) are defined to assist the compiler global-scheduling.

For an embedded processor, it is very important to provide an environment in which the user can develop software. SOFTUNE Workbench is provided for the FR500 as an integrated development environment containing the above compiler, the In-Circuit Emulator (ICE) using the internal Debug Support Unit (DSU), and real-time OS (REALOS) (**Figure 6**). Industry-standard development environments, such as GNUPro<sup>TM</sup> and Tornado<sup>TM</sup> II/VxWorks<sup>®</sup> are also provided.

An evaluation platform containing the FR500 processor and companion chip as well as an evaluation kit with various interfaces will be provided (**Figure 7**, **8**).

### 5. FR-V architecture

The new FR500 processor is the first product in the world to have the FR-V architecture. It has a 4-way VLIW architecture containing two integer units and two floating-point/media units, but the FR-V architecture itself does not define combinations of the number of ways and instruction sets. In the FR-V architecture, five instruction sets (I, i, F, M, and D) are defined as shown in **Figure 9**, and the customers can define their own



Figure 6 Integrated development environment.

instruction sets (C).

The FR500 processor has been developed with an emphasis on media processing using a combination of these instruction sets. The FR300 is specialized for DSP (**Figure 10**). The flexible combination of instruction sets allowed by the FR-V architecture makes it possible to develop application-specific microprocessors.<sup>3)</sup>

#### 6. Conclusion

The FR500 is the world's first embedded general-purpose microprocessor to have the VLIW architecture. The VLIW architecture and SIMD function provide a high performance of 532 MIPS, 1064 MFLOPS, and 4256 MOPS in a small chip size. A photograph of the FR500 die is shown in **Figure 11**, and the chip specifications are given in **Table 2**. We intend to develop an applicationspecific FR-V family based on this microprocessor core by making the most of the flexibility of the instruction sets of the FR-V architecture.

### References

- H. Fujiyama et al.: Superscalar Embedded Processor with SDRAM Interface, COOL Chips II, Kyoto, April 1999.
- A. Suga et al.: A 4-way VLIW Embedded Multimedia Processor, ISSCC 2000, San Francisco, February 2000.
- R. Abrishami: FR500: A New VLIW Processor for Consumer Appliances, Microprocessor Forum 1999, San Jose, October 1999.



Figure 7 Development kit block diagram.



Figure 8 Development kit photo.

FR-V application-specific VLIW cores.





Figure 9 Processor core elements.





Table 2 Chip specifications.

Figure 10

| Process           | 0.18 μm CMOS 5-metal                             |  |  |
|-------------------|--------------------------------------------------|--|--|
| Issues            | 4 Issues VLIW (2 Integer, 2 Floating or 2 Media) |  |  |
| Frequency         | 266 MHz                                          |  |  |
| Peak performance  | 532 MIPS + 1064 MFLOPS or 4256 MOPS              |  |  |
| Register file     | GR: (5R/4W) 32-bit $	imes$ 64-word               |  |  |
|                   | FR: (5R/4W) 32-bit $	imes$ 64-word               |  |  |
| Cache             | Instruction: (1RW/1R)16 KB 4-Way set-associative |  |  |
|                   | Data : (1RW/1R)16 KB 4-Way set-associative       |  |  |
| Bus interface     | SDRAM Bus: Max.133 MHz 1 Gbyte/s.                |  |  |
|                   | System Bus : Max.133 MHz 1 Gbyte/s.              |  |  |
| No.of transistors | Logic: 3.2 M RAM: 3.5 M Total: 6.7 M             |  |  |
| Chip size         | 7.5 mm $	imes$ 7.5 mm                            |  |  |
| Package           | Plastic BGA352                                   |  |  |
| Power dissipation | 2.0 W@1.8 V                                      |  |  |



**Takao Sukemura** received the B.S. degree in Physics from Tokyo Metropolitan University, Tokyo, Japan in 1982. He joined Fujitsu Ltd., Electronic Devices Group, Kawasaki, Japan in 1982, where he was engaged in development of microcontrollers and microprocessors. He has been developing the new VLIW-architecture embedded microprocessor since 1998. From April 1999 to March 2000, he worked on this micro-

processor at Fujitsu Laboratories Ltd., Kawasaki, Japan.

E-mail: t.sukemura@ed.fujitsu.co.jp