

# ISSN 2278 – 0211 (Online)

# Low Power and Area Efficient Linear Phase FIR Filter of Odd Length with Symmetric Coefficient Using Improved CSLA

Bindhiya Balan PG Scholar, Electronics and Communication Coimbatore Institute of Engineering & Technology, Coimbatore, India M. Ramkumar Raja Assistant Professor, Electronics & Communication Coimbatore Institute of Engineering &Technology, Coimbatore, India

# Abstract:

This brief proposes a low power and area efficient linear phase FIR digital filter of odd length using improved carry select adder (CSLA). The proposed parallel FIR architecture exploits the advantages of symmetric coefficient and reduces the number of multipliers in the sub filter section at the expense of increase in adders in preprocessing and post processing blocks using fast FIR algorithm. As the length of the FIR filter increases the number of reduced multipliers increase. But the increase in adders in pre/post processing blocks stays fixed. Adders have less area than multipliers, so exchange of multipliers with adders reduces the area and power of the filter. For further area reduction an improved CSLA is used instead of a SQRT CSLA.

*Key words:* Finite impulse response filter (FIR filter), improved carry select adder (CSLA), square root carry select adder (SQRT CSLA), symmetric convolution, very large scale integration (VLSI)

# 1. Introduction

In digital signal processing there are many instances in which an input signal to a system contains extra unnecessary content or additional noise which can degrade the quality of the desired portion. In such cases we will use digital filters to remove or filter out the useless samples. Digital filters modifies the frequency properties of the input signal x(n) to meet certain specific design requirements. The properties of a casual digital filter can be completely characterized by its unit sample response h (n), or its frequency response H ( $e^{fiW}$ ) or by difference equation [3]. In this brief parallel FIR filter is realized using the difference equation.

Reducing the power consumption is one of the most important goals in design of current multimedia and wireless systems. Reduced power consumption leads to reduced cooling cost and increased reliability of the processors and workstations. In portable systems reduced power consumption linearly increases the battery life or number of hours of operation of device. Therefore low power implementation of DSP system is important.

Due to the use of multipliers, a significant amount of power and computations are required in digital FIR filters. A multiplication is computation intensive and utilizes large amount of power. The proposed new structures exploit the nature of symmetric coefficients of odd length and further reduce the amount of multipliers required at the expense of additional adders. Since multipliers outweigh adders in hardware cost, it is profitable to exchange multipliers with adders. Moreover, the number of increased adders stays still when the length of FIR filter becomes large, whereas the number of reduced multipliers increases along with the length of the FIR filter.

Here additional adder units are used in preprocessing and post processing blocks of the FIR filter. So the power consumption of the adder units plays an important role in overall power consumption of the FIR filter. The CSLA is used in FIR filters to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry input Cin=0 and Cin=1, then the final sum and carry are selected by the multiplexers (mux). To achieve low power consumption a Binary to Excess-1 Converter (BEC) is used instead of RCA with Cin=1 in the regular CSLA. The modified CSLA using BEC has reduced area and power consumption with slight increase in delay because in CSLA the sum and carry for Cin=0 and Cin=1 is not calculated simultaneously instead the sum and carry for Cin=1 is calculated only after the sum and carry for Cin=0 is calculated. To replace the n -bit ripple carry adder (RCA), n+1 -bit binary to excess-1 converter (BEC) is required.

There have been several papers proposed the way to reduce the complexity of parallel FIR filter in past [1]-[11].In [3]-[4], polyphase decomposition is discussed, where first the small-sized parallel FIR structures are derived and then larger block sized ones can be constructed by cascading or by iterating small-sized parallel FIR blocks.[1]-[5] fast fir algorithm is used. In [7]-[11] parallel FIR filters are designed using linear convolution.

However, in both categories of method the additional area required dew to adders in the pre/post processing blocks has not been taken into consideration yet, which will lead to a significant reduction in hardware cost. In this brief, we will discuss symmetric convolution based on odd length having the advantage of an improved CSLA. In [15]-[16] the concepts of SQRT CSLA are discussed. There have been a few papers proposing ways to reduce the area and power of CSLA [12]-[14]. In this paper a new approach is presented by a gate level modification in SQRT CSLA.

This paper is organized as follows. A brief introduction of parallel FIR filter is given in section II. In section III existing parallel FIR system is presented. Section IV addresses the proposed architecture. Section V investigates the power comparison. Section VI gives the conclusion.

#### 2. Parallel FIR Filter

Parallel FIR filter uses parallel processing technique in which multiple outputs are computed in parallel in a clock period. Therefore, the effective sampling speed is increased by the level of parallelism. To obtain a parallel processing structure the single input single output system (SISO) must be converted into a multiple input multiple output system (MIMO). The following set of equation describes a parallel system with 3 input per clock cycle.

$$y(3k) = a_1 x(3k) + a_2 x(3k - 1) + a_3 x(3k - 2)$$
(1)
$$y(3k + 1) = a_1 x(3k + 1) + a_2 x(3k) + a_3 x(3k - 1)$$
(2)
$$y(3k + 2) = a_1 x(3k + 2) + a_2 x(3k + 1) + a_3 x(3k)$$
(3)

Here the k denotes the clock cycle. At the  $k^{th}$  clock cycle the 3 inputs x(3k), x(3k+1), x(3k+2) are processed and 3 samples are generated at the output. Parallel processing systems are also referred to as a block processing system and the no of inputs processed in a clock cycle is referred to as the block size. The main disadvantage of this kind of filter is its circuit area increases linearly with the block size. A 3-parallel FIR filter is shown bellow (figure 1)



Figure 1: parallel FIR filter

# 3. Existing System

Existing system is a linear phase FIR digital filter based on fast FIR (FFA) algorithm. The hardware complexity of this architecture is better compare to parallel FIR filter. FFA algorithm reduces the number of multipliers in the sub filter block at an expense of increase in adders at pre/post processing block. So the silicon area and power consumption is lesser compare to parallel FIR filter. But the problem arises when the block size increases, the number of adders in the pre/post processing block increases rapidly. It will increase the hardware requirement. In existing system regular carry select adders are used, which increases the silicon area and the power.

Parallel processing is a powerful technique that can be applied to FIR filters to either increase the throughput or decrease the power consumption. But the use of parallel processing is avoided due to the linear increase in the hardware cost that results from this use of technique. Fast FIR algorithm (FFA) is used to produce reduced complexity parallel FIR filtering structures. A three parallel FFA produces a parallel filtering structure of block size 3. A three parallel FFA can be derived from traditional parallel filtering approach. Polyphase representation of a traditional parallel FIR filter is given bellow

$$\sum_{i=0}^{L-1} Y_i(z^L) z^{-i} = \sum_{j=0}^{L-1} H_j(z^L) z^{-j} \sum_{k=0}^{L-1} X_k(z^L) z^{-k} \quad (4)$$

Where

$$\frac{Y_{i}(z) = \sum_{m=0}^{\infty} z^{-m} y_{mL+1}}{z^{-m}}$$

$$X_k(z) = \sum_{m=0}^{\infty} z^{-m} x_{mL+k}$$

<u>∠\_</u> ∞=0

 $k, j \& i = 0, 1, 2, \dots, L-I$ 

The traditional 3-parallel filtering structure requires 3N multiplications and 3(N-1) additions. By manipulating through a series of steps, the number of filtering operations can be reduced, which in turn reduces the total number of multipliers required to realize the 3-parallel filtering structure. So the 3 parallel FFA filter can be represented as

$$Y_{0} = H_{0}X_{0} - z^{-2}H_{2}X_{2} + Z^{-2} \times [(H_{1} + H_{2})(X_{1} + X_{2}) - H_{1}X_{1}]$$
(5)  

$$Y_{1} = [(H_{0} + H_{1})(X_{0} + X_{1}) - H_{1}X_{1}] - (H_{0}X_{0} - z^{-2}H_{2}X_{2})$$
(6)  

$$Y_{2} = [(H_{0} + H_{1} + H_{2}) + (X_{0} + X_{1} + X_{2})]$$

$$-[(H_{0} + H_{1})(X_{0} + X_{1}) - H_{1}X_{1}]$$
(7)

The implementations of these equations are shown in Figure 2. The three parallel FIR filter is constructed using 6 sub filter of length N/3 including  $H_0 X_0, H_1 X_1, (H_0 + H_1) + (X_0 + X_1), (H_1 + H_2) + (X_1 + X_2) \& (H_0 + H_1 + H_2) + (X_0 + X_1 + X_2)$  and 3 preprocessing and 7 post processing addition, i.e. this structure requires 6 length-N=3 FIR filters and 10 pre/post-processing additions to realize the proper transfer function. The (3-by-3) FFA structure requires 6(N/3) multipliers and 6(N/3-1) +10 adders. Comparing the implementation cost of the traditional and reduced-complexity 3-parallel structures, it should be clear that the reduced-complexity filtering structure provides a savings of approximately 33% over the traditional structure. But when the block size increases the number of adders in the pre/post processing block increases rapidly and the usage of regular carry select adders increases the silicon area and the power.

CSLA uses multiple pairs of ripple carry adder (RCA) to generate partial sum and carry by considering carry input Cin=0 and Cin=1, then the final sum and carry are selected by multiplexers (mux). Since the device utilization is higher in regular CSLA the power consumption is higher but it offer high speed since the sum and carry for Cin=1 and Cin=0 is calculated simultaneously. The sqrt carry select adder is shown in Figure 3.





Figure 3: 16-bit SQRT CSLA

#### 4. Proposed System

A new parallel FIR filter structure of odd length for symmetric convolutions based on Fast FIR algorithm is proposed. It reduces the area by the reduction of the number of multipliers in the sub filter block at an expense of 5 additional adders at the pre/post-processing block. For further area and power reduction this work uses a simple and efficient gate-level modification on CSLA.

#### 4.1. 3×3 Proposed FFA

By manipulating the polyphase decomposition many sub filter blocks can be earned, which contain symmetric coefficients. So that half the number of multipliers within a single sub filter block can be utilized for the multiplications of whole taps. The existing twoparallel FFA structure naturally has benefits to symmetric convolutions in odd length. When it comes to a set of odd-length symmetric coefficients, two out of three sub filters contain symmetric coefficients, i.e., H0 and H1. However, the existing three-parallel FFA structure is not as advantageous. So a new three-parallel FIR filter structure is proposed, which enables more multipliers sharing in the sub filter section and, therefore, can save more hardware cost over the existing FFA. For a set of symmetric coefficients in odd length N, when ( $N \mod 3$ ) equals zero, the proposed structure can earn two more sub filter blocks containing symmetric coefficients than existing system. Manipulating the equations of existing system a new set of equations can be derived. They are shown bellow.

$$Y_{0} = H_{0}X_{0} + Z^{-2} \times \left\{ (H_{1} + H_{2})(X_{1} + X_{2}) - H_{1}X_{1} - \left( (H_{0} + H_{2})(X_{0} + X_{2}) - H_{0}X_{0} - \frac{1}{2}[(H_{0} + H_{2})(X_{0} + X_{2}) - (H_{0} - H_{2})(X_{0} - X_{2})] \right) \right\}$$

$$(8)$$

$$Y_{1} = (H_{0} + H_{1} + H_{2})(X_{0} + X_{1} + X_{2}) - (H_{1} + H_{2})(X_{1} + X_{2}) - (H_{0} + H_{2})(X_{0} + X_{2}) + \{(H_{0} + H_{2})(X_{0} + X_{2}) - \frac{1}{2}[(H_{0} + H_{2})(X_{0} + X_{2}) - (H_{0} - H_{2})(X_{0} - X_{2})] - H_{0}X_{0} + Z^{-3}\{(H_{0} + H_{2})(X_{0} + X_{2}) - \frac{1}{2}[(H_{0} + H_{2})(X_{0} + X_{2}) - (H_{0} - H_{2})(X_{0} - X_{2})] - H_{0}X_{0}\}$$

$$(9)$$

$$Y_2 = H_1 X_1 + \frac{1}{2} [(H_0 + H_2)(X_0 + X_2) - (H_0 - H_2)(X_0 - X_2)]$$
(10)

The implementation of the proposed three-parallel FIR filter is shown in figure 3. The proposed structure has four out of six sub filter blocks, i.e., H1,  $H0 \pm H2$ , H0 + H1 + H2, are with symmetric coefficients. So a single sub filter block can be realized in using only half the amount of multipliers. Each output of multipliers responds to two taps, except the middle one. Compared with the existing FFA three-parallel FIR filter structure, the proposed structure leads to two more sub filter blocks, which contains symmetric coefficients. Therefore, for an *N*-tap three-parallel FIR filter, the proposed structure can save *N*/3 multipliers from the existing FFA structure. However, it comes with the price of the increase in amount of adders, i.e., five additional adders, in preprocessing and post processing blocks. Since the implementation cost of a multiplier is much greater than that of an adder, the cost to implement the proposed filtering structure can be approximated as being proportional to the number of multipliers required for implementation i.e. adders does not have that much contribution to hardware cost. But the additional adder will affect the area and power of the filter a little bit. To reduce the power and area contribution of additional adders an improved CSLA is used. It requires less number of silicon area and power. The improved CSLA is designed by a gate level modification in the conventional SQRT CSLA. It has less power and area than the existing adder with slight increase in the delay. So the additional adders will not have much effect on power and area of proposed filter structure.



Figure 4: Implementation of the Proposed Structure

#### 4.2. Improved Carry Select Adder

Binary to Excess One Converter (BEC) is a circuit used to add 1 to the input numbers. A circuit of 4-bit BEC and the truth table is shown in Figure 4 and Table 3.2 respectively. It can be used to replace RCA with Cin=1 .By using BEC the number of gates can be reduced. By reducing the number of gates, the power consumption can be reduced

| 0001 |  |
|------|--|
| 0010 |  |
| 0100 |  |
| 0101 |  |
| 0110 |  |
| •••• |  |
|      |  |

The structure of 16-bit carry select adder with BEC is shown in Figure 5. It has five group of different size Ripple Carry Adder together with Binary to Excess One Converter In the carry select adder using BEC the number of gates used is considerably reduced and thereby the power consumption. But the sum and carry for Cin=1 is calculated only after the sum and carry for Cin=0 is calculate. So it will offer some speed penalty. The RCSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry input Cin=1 and Cin=0, then the final sum and carry are selected by the multiplexers. The modified CSLA architecture has developed using Binary to Excess-1 converter (BEC). The modified CSLA has a slightly larger delay, but the area and power of the CSLA are significantly reduced.



Figure 5: 16-bit Improved SQRT Carry Select Adder

#### 5. Comparison and Analysis

Comparison of number of multipliers and adders in FFA and proposed structure is given in the table II. The designed proposed here is implemented using Verilog HDL and synthesized using Xilinx ISE 12.3.

| Structure | L  | М   | Reduced<br>Multiplier | Α   | No Of Increased adder |
|-----------|----|-----|-----------------------|-----|-----------------------|
| FFA       |    |     |                       |     |                       |
|           | 9  |     |                       |     | 5                     |
| Proposed  |    |     |                       |     |                       |
| FFA       |    | 46  | 8                     | 58  |                       |
|           | 27 |     |                       |     | 5                     |
| Proposed  |    | 38  | 17.4%                 | 63  |                       |
| FFA       |    | 136 | 26                    | 166 |                       |
|           | 81 |     |                       |     | 5                     |
| Proposed  |    | 110 | 19.1%                 | 171 |                       |

 Table 2: Comparison of Existing & Proposed FFA Structures

 No of Multipliers (M), No of Adders (A)

Table III shows the comparison of SQRT CSLA and Improved CSLA in terms of delay and power. The total power consists of the thermal, dynamic, static and I/O thermal power dissipation. The proposed adder takes less power when compared with conventional CSLA. It is also observed that in the proposed adder the reduction in area is very high with insignificant penalty in the delay when compared with traditional conventional CSLA. As the input length is progressed, the area and power are decreased in the same proportion, but in the same proportion the delay penalty is not increased. Since the area in the proposed adder is very less, it is obvious that, the power consumption is also very less

| Structure | parameter | 16 bit | 32 bit |
|-----------|-----------|--------|--------|
| SQRT      | Power(mW) | 140    | 221    |
| CSLA      | Delay(ns) | 21.378 | 29.949 |
| Improved  | Power(mW) | 100    | 171    |
| CSLA      | Delay(ns) | 23.110 | 32.540 |

Table 3: Comparison of CSLA & Improved CSLA

The comparison result of delay and power of CSLA and improved CSLA with BEC is shown in bellow. After comparing conventional CSLA, CSLA with BEC-1, it is evident from graphs that power utilization and combinational path delay the proposed one provides a good tradeoff between delay and power consumption



Figure 6: comparison of delay

*Figure 7: comparison of power* 

The FIR filter is designed using Verilog HDL and synthesized using XILINX ISE 12.3. Synthesis result for 81 tap FIR filter is given bellow.

| Synthesis result    | Structure | L=3     |
|---------------------|-----------|---------|
| power               | FFA       | 116 mW  |
|                     | Proposed  | 129 mW  |
| Critical path delay | FFA       | 9.91nS  |
|                     | Proposed  | 9.90 nS |

Table 4: Synthesis Result for an 81-Tap Fir Filter

Table III exhibits synthesis results for both filter structure in terms of delay and power. It is clear that the power and delay of the proposed structure is reduced.

# 6. Conclusion

The power and delay of FFA structure is evaluated and compared with the proposed FIR filter with improved CSLA. It is clear that, the proposed structure takes less power when compared with existing structure. It is also observed that in the proposed filter, the reduction in area is very high with insignificant penalty in the delay when compared with FFA filter. As the length is progressed, the number of reduced multipliers increased in the same proportion, but the additional adders at the pre/post processing blocks stays fixed. Moreover, the area in the proposed adder is very less; it is obvious that, the power consumption further reduced.

#### 7. References

- 1. Y.-C. Tsao and K. Choi "Area-Efficient VLSI Implementation for Parallel Linear-Phase FIR Digital Filters of Odd Length Based on Fast FIR Algorithm" IEEE Transactions On Circuits And Systems—ii: Express Briefs, Vol. 59, No. 6, June 2012
- 2. Y.-C. Tsao and K. Choi, "Area-efficient parallel FIR digital filter structures for symmetric convolutions based on fast FIR algorithm," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 2, pp. 366–371, Feb. 2010.
- 3. D. A. Parker and K. K. Parhi, "Low-area/power parallel FIR digital filter Implementations," J. VLSI Signal Process. Syst., vol. 17, no. 1, pp. 75–92, Sep. 1997.
- 4. J. G. Chung and K. K. Parhi, "Frequency-spectrum-based low-area low power parallel FIR filter design," EURASIP J. Appl. Signal Process., vol. 2002, no. 9, pp. 444–453, Jan. 2002.
- 5. K. K. Parhi, VLSI Digital Signal Processing systems: Design and Implementation. New York: Wiley, 1999.
- 6. Z.-J. Mou and P. Duhamel, "Short-length FIR filters and their use in fast nonrecursive filtering," IEEE Trans. Signal Process., vol. 39, no. 6, pp. 1322–1332, Jun. 1991.
- 7. J. I. Acha, "Computational structures for fast implementation of L-path and L-block digital filters," IEEE Trans. Circuits Syst., vol. 36, no. 6, pp. 805–812, Jun. 1989.
- 8. C. Cheng and K. K. Parhi, "Hardware efficient fast parallel FIR filter structures based on iterated short convolution," IEEE Trans. Circuits Syst.I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004.
- 9. C. Cheng and K. K. Parhi, "Furthur complexity reduction of parallel FIR filters," in Proc. IEEE ISCAS, May 2005, vol. 2, pp. 1835–1838.
- 10. C. Cheng and K. K. Parhi, "Low-cost parallel FIR structures with 2-stage parallelism," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 2, pp. 280–290, Feb. 2007.
- 11. I.-S. Lin and S. K. Mitra, "Overlapped block digital filtering," IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. vol. 43, no. 8, pp. 586–596, Aug. 1996.

- Y.-C. Tsao and K. Choi, "Area-efficient parallel FIR digital filter structures for symmetric convolutions based on fast FIR algorithm," IEEE Trans. Very Large Scale Intgr. (VLSI) Syst., vol. 20, no. 2, pp. 366–371, Feb. 2012.
- 13. Padma Devi, Ashima Girdher, Balwinder Singh(2010), "Improved Carry Select Adder with Reduced Area and Low Power Consumption", International Journal of Computer Applications(0975-8887) volume 3-No 4.
- 14. B. Ramkumar, H.M. Kittur, and P. M. Kannan, "ASIC implementation of modified faster carry save adder," Eur. J. Sci. Res., vol. 42, no. 1, pp.53–58, 2010.
- 15. Y. He, C. H. Chang, and J. Gu(2005), "An area efficient 64-bit square root carry- select adder for low power applications," in Proc. IEEE Int. Symp. Circuits Syst., vol. 4, pp. 4082–4085.
- 16. J. M. Rabaey, Digtal Integrated Circuits—A Design Perspective.Upper Saddle River, NJ:Prentice-Hall, 2001