halled had have (tid) . Repair Harris in the his the king a little in (VI-3100) > (title in )



## IMPLEMENTATIONS OF 8×8 DCT AND IDCT ON DIFFERENT FPGA TECHNOLOGIES USING THE MODIFIED LOEFFLER ALGORITHM

Asst. Lecturer N. H. Abbas and Comment (1984) The both 21 TOO beneate middle and believe and Dept. of Elect. The to make approaches not assessment College of Eng. - University of Baghdad Baghdad-Iraq

In this paper the hardware implementations is investing of 8x8 Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) on different Field Programmable Gate Array (FPGA) technologies using the modified Loeffler algorithm. The investigations involved simulations and synthesis of Very High Speed Integrated Circuit Hardware Description Language (VHDL) code utilizing recent FPGA families of Xilinx, Altera, and Lucent. The paper achieving the most demanding real-time requirements of some standardized frame resolutions and rates. Synthesis results for 8-point DCT/IDCT implementations indicate operating frequencies of 50 MHz, 60 MHz, and 22 MHz for the investigated Xilinx, Altera and Lucent FPGA chips, respectively. These frequencies allow 2193 Source Input Format (SIF) and 100 High Definition Television (HDTV) frames to be processed by the Xillinx FPGA. The resulting frame processing rates for Lucent are 877 and 40 for SIF and HDTV, while for Altera they are 647 and 29, respectively. Results indicate that the investigated FPGA implementations would speed DCT based compression algorithms up to frame rates well above the real-time requirements of SIF, International Consulting Committee on Radio & Television (CCIR-TV) and HDTV frame formats.

في هذه المنشورة تم الاستقصاء عن البناء المادي ل8 × 8 تحويلة الجيب تمام المتقطعة (DCT) ومعكوسة تحويلة الجيب تمام المتقطعة (IDCT) في عدة تقنيات لترتيب بوابة برمجة المجال (FPGA) باستخدام خوارزمية Leoffler المحورة. الاستقصاء يتضمن التشبيه والتركيب لشفرة لغة وصف الكيان المادي ذات السرعة العالية جدا (VHDL) المستخدمة في الاونة الاخيرة عن طريق عوائل ترتيب بوابة برمجة المجال (FPGA) من شركة Lucent & Xilinx Altera . إن اغلب متطلبات الزمن الحقيقي (real-time) لبعض تصاميم الهياكل ومعدلاتها القياسي تم تحقيقها. نتائج التركيب لبناء 8 نقط تحويلة جتا المتقطعة (DCT) او معكوسها تبين انه ترددات العمل هي 50 ميغاهرتز و 22 ميغاهرتز ، 60 ميغاهرتز و 22 ميغاهرتز لقطع Lucent & ، Altera ، Xillinx ترتيب بوابة المجال على التوالي. النتائج تبين انه استخدام ال FPGA لبناء DCT & IDCT يسرع منه وكذلك الاقتراب يكون أكثر من تحقيق متطلبات الزمن الحقيقي لصيغة ادخال المصدر (SIF) ، الجمعية الاستشارية الدولية للراديو والتلفزيون (CCIR-TV) ، والتلفزيون العالي التعريف (HDTV) ، والتلفزيون

RECT LEGUNOLOGIES LISTAGE LITE WODEL

### **KEY WORDS**

Discrete Cosine Transform (DCT); VHDL; FPGA; Reconfigurable Processor.

### INTRODUCTION

Discrete cosine transform (DCT) & inverse discrete cosine transform (IDCT) are widely used as a transform in image processing, especially for contemporary visual data compression algorithms (JPEG, MPEG, etc.). Some of the applications of two-dimensional DCT involve still image compression and compression of individual video frames, while multidimensional DCT is mostly used for compression of video streams and volume spaces. Transform is also useful for transferring multidimensional data to DCT frequency domain, where different operations, like spread-spectrum data watermarking, can be performed in easier and more efficient manner. Implementing DCT/IDCT, as an ASIC is a design solution, which meets the real time processing requirements, but it lacks flexibility. Another, more flexible solution, still capable to achieve real-time performance, is the reconfigurable realization of the transforms. Such DCT/IDCT implementations mapped on FPGAs will be discussed here [Van 1999].

It is investigated in this paper the implementations of 8x8 DCT and IDCT hardware units mapped on various FPGA technologies using the modified Loeffler algorithm [Van 1999], [Loeffler 1989]. The work achieve the most demanding real-time requirements of some standardized frame resolutions such as the Source Input Format (SIF) and the International Consulting Committee on Radio and Felevision (CCIR-TV). The particular interest was in performance improvements for the High Definition Television (HDTV) standard. The investigation involved generation, simulation and synthesis of VHDL code using ModelSim and LeonardoSpectrumas design environments. VHDL is used libraries during the design process for the recent FPGA families of Xilinx, Altera, and Lucent. Synthesis results for an 8-bit IDCT implementation indicate:

1- 214 Configurable Logic Block slices and 22 multipliers in Xilinx Virtex II Technology; 1482 Altera Acex-1K Logical Cells; 1488 Lucent's Orca Look-UP Tables.

2- Operating frequencies of 50 MHz for Xilinx, 60 MHz for Altera and 22 MHz for Lucent.

3- 2193 SIF and 100 HDTV frames per second to be processed by Xillinx Virtex II FPGA; 877 SIF and 40 HDTV frames per second processing speed by Lucent's Orca; 647 SIF and 29 HDTV frames per second throughput by Altera's Acex.

Synthesis and simulation results prove that the investigated FPGA implementations can speed up DCT to frame rates well above the real-time requirements of SIF, CCIR-TV and HDTV

[Vassiliadis 2001].

The remainder of the paper is organized as follows. It is briefly describe in Background some DCT/IDCT theoretical background and the modified Loeffler algorithm. 3<sup>rd</sup> Section will discuss the methodology of the experimentation and some hardware implementation issues. Experimental results are reported in fourth Section. Finally, concluding remarks are presented in last Section.

### BACKGROUND

DCT and IDCT have been widely used in video data compression standards. The decorrelation and energy compaction properties of the transform have been exploited to achieve high compression ratios in MPEG and JPEG. The N-point 1-D DCT is defined by [Rao 1990]:

$$X(h) = \frac{2}{N} c_v \sum_{n=0}^{N-1} y(n) \cos \left[ \frac{(2n+1)k\pi}{2N} \right], h = 0, 1, ..., N-1;$$
(1)



Where X(k) = The N-point 1-D DCT.

And Y(n)= The N-point 1-D IDCT is defined by:

$$y(n) = \frac{2}{N} \sum_{k=0}^{N-1} c_k X(k) \cos \left[ \frac{(2n-1)k\pi}{2N} \right], \ n = 0, 1, ..., N-1;$$
where  $c_k = \begin{cases} 1/\sqrt{2}, k = 0 \\ 1, k \neq 0 \end{cases}$  (2)

where 
$$c_k = \begin{cases} 1/\sqrt{2}, k = 0\\ 1, k \neq 0 \end{cases}$$

DCT and IDCT are highly computational intensive, which creates prerequisites for performance bottlenecks in systems utilizing them. To overcome this problem, a number of algorithms have been proposed for more efficient computations of these transforms. 8-point is used 1-D DCT/IDCT algorithm in all experiments, proposed by van Eijdhoven and Sijstermans [Van 1999]. This algorithm is a slight modification of the original Loeffler algorithm [Loeffler 1989], which provides one of the most computationally efficient 1-D DCT/IDCT calculations. The modified Loeffler algorithm for calculating 8-point 1-D IDCT is illustrated in Fig. (1).



Fig.(1). The 8-point IDCT - modified Loeffler algorithm

The round block in Figure 1 signifies a multiplication by  $\sqrt{1/2}$ . The butterfly block it is presented in Fig. (2).



Fig. (2). The Butterfly

Where:

$$\mathbf{O}_0 = I_0 - I_1 \tag{3}$$

$$\mathbf{O}_1 = I_0 - I_1$$

The rectangular block depicts a rotation, which transforms a pair of inputs  $[I_0, I_1]$  into outputs  $[O_0, O_1]$ . The symbol is depicted in **Fig. (3).** 



Fig. (3). The rotator

$$O_0 = I_0 \log \left[ \frac{n\pi}{16} \right] - I_1 \sin \left[ \frac{n\pi}{16} \right] = C_2 I_0 - S_2 I_1$$

$$O_1 = I_0 \sin \left[ \frac{n\pi}{16} \right] + I_1 \cos \left[ \frac{n\pi}{16} \right] = S_2 I_0 + C_2 I_1$$

$$(6)$$

The implementation of the rotator depicted in Fig. (4) utilizes four multipliers and two adders to shorten critical path and improve numerical accuracy. This direct implementation has been proven to be ideal for fixed-point arithmetic [Sim 2001]. Indeed, some other implementations of the rotator are possible, e.g., with three multipliers and three adders. These alternative designs, however, have longer critical paths and involve initial additions, which may lead to overflows and may affect the accuracy of the calculations [Babic 2003].



Fig. (4). Implementation of the rotator for IDCT



The algorithm is depicted of 8-point DCT in Fig. (5).



Fig. (5). The 8-point DCT - modified Loeffler algorithm

The functionality of the rotator in DCT is slightly different than in IDCT, while the round block and the butterfly are exactly the same [Cho 1991].

The DCT rotator block equations are:

$$O_0 = I_0 k \cos \left[ \frac{n\pi}{16} \right] - I_1 k \sin \left[ \frac{n\pi}{16} \right] = C_0 I_0 - S_0 I_1$$
 (7)

$$O_1 = -I_0 k \sin \left[ \frac{n\pi}{16} \right] - I_1 k \cos \left[ \frac{n\pi}{16} \right] = -S_n^* I_0 + C_n^* I_1$$
 (8)

In video data compression standards, the 2-D DCT/IDCT is defined. One possible approach to compute the 2-D DCT/IDCT is the standard row-column separation. In this approach, the 1-D transform is applied to each row [Cho 1991], &[Ren 1998].On each column of the result 1-D transform is performed again, to produce the final result of the 2-D DCT/IDCT. In our experiments we use this strategy.

# METHODOLOGY OF THE IMPLEMENTATION

All experiments involve processing video data with different frame formats. It will have chosen the SIF, CCIRTV and the HDTV formats, since they have been considered by many video compression standards. The frame resolutions for SIF, CCIR-TV and HDTV are 352x288, 525x720 and 1152x1926, respectively. It is written synthesizable VHDL models of two units, one describing 1-D DCT and the other - 1-D IDCT. The designs have been implemented according to the modified

### IMPLEMENTATIONS OF 8× 8 DCT AND IDCT ON DIFFERENT FPGA TECHNOLOGIES USING THE MODIFIED LOEFFLER ALGORITHM

Loeffler algorithm. Both simulated and synthesized the VHDL models for three different FPGA technologies, namely Virtex II, Acex-1K and Orca using the following design tools:

1-ModelSim SE/EE from Model Technology, version 54.b, revision 2000.06, for simulating the VHDL source code;

2-LeonardoSpectrum from Exemplar, version v2000.1a2.75, for the synthesis of VHDL source

8-bit is considered input data to design DCT for consistency with the 8-bit color presentation in code. visual data compression standards like MPEG and JPEG. The output data width was designed to be 10-bit. Similarly, 10-bit inputs and 8-bit outputs were considered for the IDCT design. The rowcolumn separation strategy was used to compute the 2-D DCT/IDCT. As it have used 8-point 1-D DCT and IDCT, the FPGA I/O ports delay, reported by the synthesis software, is in essence the data processing delay for 8 pixels. Implementing matrix transposition without extra delay, can multiply the 8point 1-D DCT I/O latency by 16 to calculate the latency of the 8x8 DCT transform. This is in essence the processing latency for one 8x8 pixel block. Given this latency, it can easily calculate the time, required to transform all 8x8 blocks in any video frame for the selected formats - SIF, CCIR-TV and HDTV. Frame processing rate (frames per second) of the implemented DCT/IDCT was the main criterion used to estimate and compare the FPGA mappings on the three different technologies.

### IMPLEMENTATION RESULTS

Synthesis results for 8-point DCT and IDCT units are included in Table (1). These results indicate that the Xilinx FPGA implementations of DCT/IDCT can process higher numbers of frames per time unit, compared to the other two FPGA technologies. One reason for this considerable data processing speed is the utilization of coarse-grain reconfigurable resources available in the Virtex II FPGA. In particular, the usage of hardwired multipliers and fast carry chains lead to a severe acceleration of the implemented computations.

Table (1). Synthesis results for different FBGA technologies

| Implemented function              | DCT IDCT Lucent ORCA-3C/3T series FPGA |                                   | DCT IDCT Altera Acex-1K series FPGA |                | DCT IDCT Xilinx Virtex-II seriesFPGA |      |
|-----------------------------------|----------------------------------------|-----------------------------------|-------------------------------------|----------------|--------------------------------------|------|
| Max Clock frequency ( MHz)        | 22                                     | 22                                | 16                                  | 16             | 54                                   | 60   |
| No. of multipliers used           | LL                                     | THE REPORT OF THE PERSON NAMED IN |                                     |                | 22                                   | 22   |
| No. of LUTs used                  | 1320                                   | 1488                              |                                     |                |                                      |      |
| No. of LCs used                   |                                        |                                   | 1303                                | 1482           |                                      |      |
| No. of CLBs used                  |                                        |                                   |                                     | Description of | 205                                  | 214  |
| No. of SIF frames per second      | 896                                    | 877                               | 647                                 | 647            | 2193                                 | 2469 |
| No. of CCIR –TV frames per second | 1219                                   | 214                               | 158                                 | 158            | 536                                  | 600  |
| No. of HDTV frames per second     | 40                                     | 40                                | 29                                  | 29             | 100                                  | 112  |

It was particularly interested whether the FPGA implementations of the designs would be fast enough to meet the real time requirements of the selected video formats. For SIF and CCIR-TV, the requirements for real time processing rates are 25 frames per second. It can be observed in Fig. (6) that Xilinx FPGA implementations process the highest number of SIF frames per second. The other two FPGA technologies, using finer-grain resources, although slower, are capable of processing SIF frames at speeds, much higher than the required real-time rates (25 frames per second). Therefore, in this case the advantages of the Virtex II technology can not be utilized efficiently.





Fig. (6). Comparing different FPGA technologies to SIF Frames processing per second

Regarding CCIR-TV format, performance results impose similar conclusions see Fig. (7). All FPGA technologies provide DCT/IDCT processing speeds well above the required real time values.



Fig. (7). Comparing different FPGA technologies to

CCIR-TV frames processed per second

The advantages of using coarse-grain reconfigurable resources for speeding up the DCT and IDCT operations are illustrated in Fig. (8). Only Xilinx FPGA is able to process twice the rate required by HDTV, which is 50 frames per second. The other two FPGA can not achieve the requirements for real time processing.



Fig. (8). Comparing different FPGA technologies to HDTV frames processed per second

### SUMMARY AND CONCLUSIONS

It is reported in this paper the results from an investigation on reconfigurable implementations of DCT/IDCT mapped on different FPGA technologies. Synthesis and simulation results from the experiments indicate that real-time requirements of SIF, CCIR-TV and even HDTV can be met by the implemented DCT and IDCT designs. From the reported results can conclude that all investigated FPGA implementations can speed up DCT based compression standards dramatically. However, for computationally intensive algorithms like DCT/IDCT better results can be achieved by coarser-grained reconfigurable logic, like the one realized by the Virtex II technology of Xilinx. It is intended in future, to integrate the investigated DCT/IDCT designs into a custom computing machine organization, called MOLEN [Vassiliadis 2001]. The MOLEN processors utilize microcode to control both reconfiguration and execution process of the reconfigurable unit. The primary goal will be to investigate the influence of FPGA reconfiguration time on the overall performance of the system.

### REFERENCES

- T.J. van Eijndhven and F.W. Sijstermans, (Van 1999), Data Processing Device and method of Computing the Cosine Transform of a Mtrix, PCT Patent WO 99948025, to Koninklijke Philips Electronics, World Intellectual Property Organization, International Bureau,
- C. Loeffler and A. Lightenberg, (1989), Practical fast 1-D DCT algorithms with 11 Multiplications," Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP '89), Scotland, May, pp. 988-991.
- S. Vassiliadis, S. Wong and S. Cotofana, (2001), The MOLEN .µcoded processor, Proceedings of the 11th International Conference on Field Programmable Logic and Applications (FPL),.
- K.R. Rao and P.Yip, (1990), Discrete Cosine Transform. Algorithms, Advantages, Applications, Academic Press, San Diego, California,.
- M. Sima, S. Cotafona, S. Vassiliadis and J.T.J. van Eindhoven, (2001), 8x8 IDCT Implementation on an FPGA-augmented Trimedia, IEEE Symposium on FPGAs for Custom Computing Machines (FCCM 2001), California, April.
- D. Babic. Discrete Cosine Transform Algorithm for FPGA Devices, M. sc. (2003), Thesis, Electrical Engineering computing in Zagreb,.
- N. Ik Cho and Sang Uk Lee. (1991), Fast algorithm and implementation of 2-D discrete cosine transform. IEEE Trans. on Circuits and Systems, 38(3):297-306, March.
- H. Ren Wu and Zhihong Man. Comments on (1998), Fast algorithms and implementation of 2-D discrete cosine transform, Circuits and Systems for Video Technology, IEEE Transactions on Volume, 8(2):128–129, April.