



# HIGH PERFORMANCE VLSI INTEGER TRANSFORM ARCHITECTURE FOR HEVC 1T.STELLA SUSHMA, 2 N.SUBBARAYUDU

1M.Tech student, Dept of ECE, Sri Mittapalli Institute of Technology For Women, A.P, India 2Assistant professor, Dept of ECE, Sri Mittapalli institute of technology for women, A.P, India

**ABSTRACT:** High Efficiency Video Coding (HEVC) is currently being prepares as the modern video coding standard of the video coding Experts Group and the International Standard Organization / International Electro- technical Commission (ISO/IEC) Moving Picture Experts Group. VLSI Architecture is proposed for the HEVC encoder. The VLSI architecture is based on signed bit transform (SBT) matrix which contains only 0, 1 or -1. These SBT matrices are very simple and have lower bit width and reduce number of addition operations because it contains many zero elements. So here adder reuse strategy can be used. Hence power consumption and area consumption are reduced. So the VLSI architecture can be synthesized with proper area and high speed. The proposed transform hardware architecture can process video data with higher speed and reduced area.

**KEY WORDS**: High Efficiency Video Coding, Signed Bit Tansform, High Efficiency Video Coding.

#### I.INTRODUCTION

A picture is worth a thousand words. This Expresses the essential difference among ability human to perceive linguistic information and visual information. For the same message, a visual representation tends to be perceived as being more efficient than the spoken or written words. The processing of language is inherently serial. Words and their meanings are recorded or perceived one at a time in a causal manner. In the mammalian visual system, this parallelism is evident from the retina right through to the higher-order structures in the visual cortex and beyond. For example: video conferences, medical data transfer, business data transfer and so on, require much more image data to be transmitted and stored on-line. Due to the internet, the huge information transmissions take place. The processed data required much more storage, computer processor speed and much more bandwidth for transmission.

In the rapidly growing field of Internet applications, not only still images but also small image sequences are used to enhance the design of private and commercial web pages. Meeting bandwidth requirements and

acceptable maintaining image quality simultaneously are a challenge. Wavelets are mathematical functions that provide good quality compression at very high compression ratios, because of their ability to decompose signals into different scales or resolutions. The standard methods of image compression come in numerous ranges. Most of the wellestablished compression schemes use the bivariate Discrete Wavelet Transform (DWT) on wavelet-based coding. image At high compression rates, wavelet-based methods provide much better image quality in comparison with the JPEG (Joint Photographic Experts Group) standard, which relies on the discrete cosine transform (DCT). The good results obtained from DWT are due to multiresolution analysis, which essentially brings out information about the statistical structure of the image data. The current most popular methods rely on removing high frequency components of the image by storing only the low frequency components (e.g., DCT based algorithms). Although some information loss can be tolerated in most of these applications, there is certain image processing applications that demand no pixel difference between the original and the reconstructed image.

Fractal image compression is a lossy compression method, so there will be datalosses in compressed image. For fractal coding, an image is represented by fractals rather than pixels. Each fractal is defined by a unique Iterated Function System (IFS) affine consisting of of а group transformations. Therefore, the key point for fractal coding is to find fractals which can best approximate the original image and then to represent them as а set of affine transformations. Standard fractal coding methods rise above many other image coding techniques in the sense that it maintains high image quality after decoding but presents high compression ratios during encoding. Rather than lossy compression with relatively high compression ratio, mathematical lossless compression techniques are favoured in this field. A lossless scheme typically achieves a compression ratio of the order of two, but will allow exact recovery of the original image from the compressed version.

## II.RELATED WORK

One of the main problem in real time communication is repetition of corrupt messages. Here the data should be delivered with low delay and the use of techniques will avoid the overloads by transmitting. During the digital information transmitting through a channel, practically inevitable errors are produced. To ensure reliable transmission, the data are further encoded via Error Correcting Code (ECC). This could be used to recognize and correct errors. In this work the wellknown binary linear block Hamming codes are used because they have been used in the optimization problems that we accelerate thanks to the circuit explained further on. A binary linear (N, k) code is a k-dimensional subspace of the space of N-bit code words, and therefore has 2K code words. But we solve for blocks or subsets of M code words in the code, where M≤2K, used to transmit a message. When a message is transmitted, its binary string can suffer modifications (changed bits), arriving an incorrect code word in the receiver side.

As technology scales, reliability becomes a challenge for CMOS circuits. Reliability issues appear, for example during device Manufacturing, as defects that can compromise production yield. Once the devices are in the field, other reliability issues appear in the form of soft errors or age induced permanent failures. Memory devices are among those affected by those issues due to their high level of integration. Current techniques to address those reliability issues in memories include the use of redundant elements to repair manufacturing defects, and the use of Error Correcting Codes (ECC) to deal with soft errors once the device is in operation. Different techniques are used to deal with defects versus soft errors. ECC can also be used to correct errors caused by defects, but then their ability to correct soft errors may be compromised leading to a reduced reliability. However, to the best of our knowledge, there is no previous work on how the use of ECC to deal with defects affects the reliability of memory in the field.Networking applications require highspeed processing of data and thus rely on complex integrated circuits. In routers and switches, packets typically enter the device through one port, are processed, and are then sent to one or more output ports. During this processing, data are stored and moved through the device. Reliability is a key requirement for networking equipment such as core routers. Therefore, the stored data must be protected to detect and correct errors. This is commonly done using errorcorrecting codes (ECCs. One problem that occurs when protecting the data in networking applications is that, to facilitate its processing, a few control bits are added to each data block. For example, flags to mark the start of a packet (SOP), the end of a packet (EOP), or an error (ERR) are commonly used. These flags are used to determine the processing of the data, and the associated control logic is

commonly on the critical timing path. To access the control bits, if they are protected with an ECC, they must first be decoded. This decoding adds delay and may limit the overall frequency. Several codes are used to evaluate the proposed method. After evaluation it is compared with the existing solutions in terms of decoding delay and area.

# III. BASIC IMAGE COMPRESSION SCHEMES

For a universal algorithm to compress images, a sequence of image pixels extracted from an image in the raster scan order is simply encoded. But, for a universal algorithm such a sequence is hard to compress. Universal algorithms are usually designed for alphabet of sizes not exceeding 28 and do not exploit directly the image data features. As images are 2- dimensional data, intensities of neighbouring pixels are highly correlated, and the images contain noise added to the image during the acquisition process.

gray-scale compression Modern image algorithms employ techniques used in universal statistical compression algorithms. However, prior to statistical modelling and entropy coding the image data is transformed to make it easier to compress. To make the image data easily compressible, we use 2dimensional image transforms, such as DCT or wavelet transform. In transform algorithms, instead of pixel intensities, a matrix of transform coefficients is encoded. Transforms can be used for both lossless and lossy compressions. Transform algorithms are more popular in lossy compression. Apart from lossless compressing lossy and and decompressing of whole images, transform algorithms deliver many interesting features such as progressive transmission, region of interest coding, etc. The usages of algorithms are dependent mostly on information content of images and types of application.

Lossless compression algorithms are often predictive in nature. In a predictive

algorithm, the predictor function is used to guess the pixel intensities and the prediction errors are calculated. The prediction errors are differences between actual and predicted pixel intensities. To calculate the predictor for a specific pixel usually intensities of a small number of already processed pixels neighbouring it is used. Next, the sequence of prediction errors, called residium, is encoded. Prediction error distribution is close to Laplacian, that is, symmetrically exponential. Therefore, entropy of prediction errors is significantly smaller than that of pixel values. That is why; it is easier to compress residium. In respect to the lossless compression, better results in terms of computational speed are obtained by predictive algorithms.

# a)Context Adaptive Lossless Image Coding (CALIC) Algorithm

CALIC obtains higher lossless compression for the continuous-tone images than other techniques reported in the literature. The nonlinear predictor adapts via an error feedback mechanism. The former estimation technique can afford a large number of modelling contexts without suffering from the sparse context problem. CALIC employs a twostep (prediction/residual) approach. In the prediction step, CALIC employs a simple new gradient based non-linear prediction scheme called gradient-adjusted predictor (GAP), which adjusts prediction coefficients based on estimates of local gradients. Predictions are then made context-sensitive and adaptive by modelling of prediction errors and feedback of the expected error conditioned on properly chosen modelling contexts. The modelling context is a combination of quantized local gradient and texture pattern; the two features that are indicative of the error behaviour. The context-based error modelling is done at a low model cost. By estimating expected rather prediction errors than error probabilities in different modelling contexts, CALIC can afford a large number of modeling

contexts without suffering from either context dilution problem or from excessive memory use.

b)Using Single Error Correction Codes to Protect Against Isolated Defects and Soft Errors:

The technology scaling process provides highdensity, low cost, high-performance integrated circuits. To cope with defects in memory chips, many different techniques have been proposed, all of them based on the use of redundant elements to replace defective ones. For example, when all remaining defective cells are located in one half of the array, the other half can still be used as a memory with reduced capacity. This reduction is done by permanently setting the most significant bit of the addresses either to 0 or 1, depending on which part of the memory is to be used. However, in most cases, the remaining defective cells are evenly distributed across the whole array, and not clustered in one half of the array, making this technique useless. This change in the voltage level will change the state of the transistor, which will result in a change of the value in a memory cell. For example, if a memory cell holds "1," an SEU will force it to "0. Unfortunately, these techniques will fail in the appearance of multiple cell upsets (MCU). The most common approach to deal with multiple errors has been the use of interleaving in the physical arrangement of the memory cells, so that cells that belong to the same logical word are separated. As the errors in an MCU are physically close as discussed in, they will cause single errors in different words that can be corrected by the Single Error Correction-Double Error Detection (SEC-DED) codes. However, interleaving cannot be used, for example, in small memories or register files, and in other cases, its use may have an impact on floor- planning, access time, and power consumption.

#### IV. EXISTED SYSTEM

Transform is a frequently used module when compressing video; thus, the complexity of a transform has an important effect on the whole complexity of the video encoder. Chen et al. derived the factorization relationship between N × N and N/2 × N/2 DCT matrices by analyzing the periodic property of the cosine function. With the factorization relationship of DCT, the number of arithmetic operations of the transform can be reduced. Ahmed et al. decomposed the DCT matrix into sparse sub matrices where the multiplications are avoided by using the lifting scheme. Arai et al. proposed an Arai, Agui, and Nakajima (AAN) fast algorithm based on the common factor extraction algorithm in which the complicated common factors were moved from the transform kernel to the scale part. Only five multipliers are required in AAN's transform kernel. The multiplier is expensive against the adder in the integration circuit. Thus, the multiplication operation is usually replaced by adders in the circuit design. Tsui and Chan developed an efficient multiplier less fast Fourier transform (FFT)-like transform based on a recursive noise model that minimizes the hardware resources of the transform while maintaining the high performance. In, a multiplier less hardware implementation using a second-order cone programming technique is presented, and the dynamic ranges of intermediate data are minimized through geometric programming

The existing transform architectures consider how to reduce the number of arithmetic operators, such as addition and multiplication, more than the data bit width in the transform. In fact, the data bit width is also an important factor impacting on the circuit speed and area of VLSI architecture. A circuit with a large bit width needs a larger number of fan-in or fanout of logic gate, and more MOS devices are required in the logic gate circuit. Thus, the capacitive load and

resistance of the logic gate all increase with widening bit width. According the first-order resistance and capacitance (RC) circuit model theory, the delay of the circuit is related with RC. Large RC leads to long circuit delay. The circuit delay varying with the increasing input bit width in two typical CMOS processes (SMIC40nm and GF28nm). As for the adder, the carry chain is the critical path for the circuit delay, which is also dependent on the input and output bit width.

Thus, aside from the number of arithmetic operations, the bit width is the other optimization factor for fast transform architecture. In this brief, we propose a new VLSI architecture for the integer transforms of the HEVC standard for reducing the bit widths of data. The integer transform matrix is decomposed into several signed bit-plane transform (SBT) matrices that are used in the proposed architecture. Moreover, a number of adders are reused based on the redundant property of elements of bit matrices. With the bit matrix-based transform algorithm, the proposed VLSI transform architecture can process 32 pixels/cycle data throughput maximally with very high working frequency and proper area.

a) Bit-Plane Decomposition of Integer Transform in signed bit matrix- based transform algorithm

In order to narrow the bit width of intermediate transformed data, we propose the bit decomposition algorithm which decomposes the integer transform matrix into several SBT matrices. Applying the existed SBT algorithm to the transform architecture, instead of the integer transform matrix circuits, the SBT matrix circuits are implemented and the input data are transformed with each SBT matrix circuit, respectively. Due to the simple elements of SBT matrices, the bit widths of intermediate transformed data and output data are significantly reduced. The bit width of output data should be n +

\_logN2 \_ maximally. Taking the 32 × 32 1-D integer transform as an example, the



increasing bit width of output data is only 5 b with the SBT algorithm, compared with the 11-b increasing of the straightforward integer transform. The bit widths of SBT increase slowly as the intermediate data are processed stage by stage, which shortens the circuit delay and constrains the clock cycle to be smaller. Although the delay of the integer transform circuit is reduced based on the proposed bit transform algorithm, more adders are required due to more SBTs.



However, the bit widths of adders used in SBT are also so low that the addition operation is also very fast. Additionally, it can be observed from (6) that many zero elements are in the SBT matrix. The number of actually required addition operations is seldom due to the sparse SBT matrix according to the rule of matrix multiplication. The sparse characteristic of the SBT matrices can benefit for reducing the addition operations in the transform process.

#### Fig. 1: HIERARCHICAL STRUCTURE OF SBT

There are many addition operation redundancies in SBT. Thus, we propose the adder reuse method based on the element redundancy characteristic of SBT matrices for reducing the number of adders in the next section.

V. PROPOSED SYSTEM Convolution codes don't process the data bits block wise rather they process the consecutive bits and use the convolution property of the polynomials to generate the code words. A convolutional encoder encodes L data bits to M > L code bits in every time step. The encoding technique is not memory less, as the code bits rely on upon the data bits encoded at past time

steps. This is another enormous distinction from block codes, as the block codes are memory less. The block codes need to have long block lengths, in light of the fact that they are memory less and their execution enhances with block length

#### Fig. 2: PROPOSED SYSTEM

The above figure (2) shows the architecture of proposed system. In this system the both encoder and decoder process will be involved. Basically, the entire operation is followed by ten steps as shown in figure (2). First inputs bits will be passed to the Zig Zag scanner. This scanner will scan the input bits and send the pure inputs into run level encoder. The run level encoder produces two bits of encoded information for each bit of input information, so it is called a rate 1/2 encoder. A bit is moved into the uttermost left stage at each data and the bits in advance existing in the development registers are moved one position to right. In the wake of applying the modulo-2 task relating yields are gotten. This methodology of continues until the arrival of data at the information of encoder. A run level encoder is generally characterized in (n, k, m) format, where n is number of outputs of the encoder; k is number of inputs of the encoder; m is number of memory elements

(flip-flops) of the longest shift register of the encoder. The rate of a (n, k, m) encoder is k/n. Modulation will takes signal from encrypted encoder and modulates and produces an analog signal. Analog signals are quantized and converted into digital signals in the quantization block. We assume that a proposed decoder receives parallel successive code symbols, in which the boundaries of the symbols and the

frames have been identified. Now after decoding the signals are demodulated and output produced will be sinked with HEVC. At last it is concluded that the proposed system gives effective results compared to existed system.

#### VI. **RESULTS**



Fig.5: OUTPUT WAVEFORM



### VII. CONCLUSION

The emerging HEVC standard has been developed and standardized collaboratively by using the VLSI architecture. A fast integer transform VLSI architecturebased sparse signed bit transform (SBT) is proposed for real-time

ultra HD video coding conforming to the HEVC standard. The integer transform matrix with high bit width is decomposed into several low bit width matrices based on matrix decomposition method. The circuit reuse strategy is used of SBT matrices to reduce number of adders in VLSI architecture. The proposed transform hardware architecture can process video data with higher speed and proper area compared with previous work.

#### VIII. REFERENCES

[1] Mohamed Asan Basiri M and Noor Mahammad Sk, "Multimode Parallel and Folded VLSI Architectures for 1D-Fast Fourier Transform", Integration, the VLSI Journal, Elsevier, vol. 55, pp. 43-56, Sept. 2016.

[2] Fei Liang, Xiulian Peng, and Jizheng Xu2, "A light-weight HEVC encoder for image coding", IEEE International Conference on Visual Communications and Image Processing (VCIP), pp. 1-5, Nov. 2013.

[3] Pramod Kumar Meher, Sang Yoon Park, Basant Kumar Mohanty, Khoon Seong Lim, and Chuohao Yeo,, "Efficient Integer DCT Architectures for HEVC", IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 1, pp. 168- 178, Jan. 2014.

[4] Pai-Tse Chiang and Tian Sheuan Chang, "A Reconfigurable Inverse Transform Architecture Design for HEVC Decoder", IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1006-1009, May 2013.

[5] Honggang Qi, Qingming Huang, and Wen Gao, "A Low-Cost Very Large Scale Integration Architecture for Multi Standard Inverse Transform", IEEE Transactions on Circuits and Systems - II, Express Briefs, vol. 57, no. 7, pp. 551-555, July 2010.

[6] Khan Wahid, Muhammad Martuza, Mousumi Das, and Carl McCrosky, "Resource Shared Architecture of Multiple Transforms for Multiple Video Codecs", IEEE International Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 947-950, May 2011.

[7] Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang, and Jiarong Tong, "A Reconfigurable Multi-Transform VLSI Architecture Supporting Video Codec Design", IEEE Transactions on Circuits and Systems - II, Express Briefs, vol. 58, no. 7, pp. 432-436, July 2011.

[8] Yao Ziyou, He Weifeng, Hong Liang, He Guanghui, and Mao Zhigang, "Area and Throughput Efficient IDCT/IDST Architecture for HEVC Standard", IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2511-2514, June 2014.

[9] Hong Liang, He Weifeng, Zhu Hu, and Mao Zhigang, "A Cost Effective 2-D Adaptive Block Size IDCT Architecture for HEVC Standard", IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1290-1293, Aug. 2013.

[10] Wenjun Zhao, Takao Onoye, and Tian Song, "High-Performance Multiplierless Transform Architecture for HEVC", IEEE International Symposium on Circuits and Systems, pp. 1668-1671, May 2013.

[11] Mohamed Asan Basiri M and Noor Mahammad Sk, "An Efficient VLSI Architecture for Discrete Hadamard Transform", IEEE International VLSI Design Conference, pp. 140-145, Jan. 2016.

[12] Ricardo Gonzalez, Benjamin M. Gordon, and Mark A. Horowitz, "Supply and Threshold Voltage Scaling for Low Power CMOS", IEEE Journal of Solid State Circuits, vol. 32, no. 8, pp. 1210- 1216, Aug. 1997. 126