00622048

Upload: akbisoi1

Post on 10-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 00622048

    1/4

    1997 IEEE International Symposium on Circuits an d Systems, June 9-12,1997,Hong Kong

    A NOVEL SCALABLE ARCHITECTURE w I m MEMORY INTERLEAVINGORGANIZATION FOR FULL SEARCH BLOCK-MATCHING ALGORITHMYe o n g - K a n g La i , L i a n g - Ge e Ch e n , T s u n g - Hu n . Ts a i , a n d P o - C h e n g Wu

    Department of Electrical EnigineeringNationa l Taiwan UniversityTaipei, Taiwan, R. 0. C. China

    ABSTRACTThis pap er describes a high- throug hput scalalble architec-ture for full-search block-matching algorithm (FSBMA).Th e numb er of processing elements (PE S) is scalable ac-cording to the variable algorithm parameters and the per-formance required for different applications. By use of theefficient PE-rings and the intelligent memory-interleavingorganization, th e efficiency of the archite cture can be in-creased. Techniques for reducing interconnect ions an d ex-tern al memory accesses are also presented. Our results

    demonstrate that the scalable PE-ringed architecture is aflexible and high-performance solution for FSBMA.

    1. INTRODUCTIONAmong various video compression techniques, the motion-compensated hybrid coding is the most popular one and isadopted by several standards and proposals [I]-[3]. Blockmatchin g algorit hm for motion estimati on is nowadays usedin a wide variety of applications. It removes the temporalredundancy within frame sequences, thus it provides thesecoding systems with significant bit-ra te reduction. Howev-er, it also requires a large amount of computation and aheavy memory bandwidth.

    The different performance requirements are needed fordifferent real-time video applicat ions. Th e performance re-quirements can be evaluated by some essential parameterssuch as the block size, the search area size, the frame size,and t he frame rate. To meet real-time video application, thenumb er of PES depends on the performance requirements.For numerous application domai ns, optimal block-matchingparameter s have not been specified yet. Furthe rmore, flex-ibility in parameters is explicitly favored for emulation ofalgorithms in an early phase of system development. Hence,there exists a sub stan tial need for a flexible motion estima-tor, allowing the user to select his own parameters and tocheck the influence of parameter variations.

    Several dedicated hardware implementations have beenrealized for full-search (FS) or full-search-based blockmatching algorithms [4]-[ll]. Principally, these realizationsare systolic arrays, laid out for a specific set of parametervalues. Hence they usually offer only a limited flexibility,or even no flexibility at all. Moreover, the parallel architec-ture with multiple PES can efficiently increase throughput.However, the numb er of inp ut pins, the difficulties on da taaddressing, and interconnection complexity between mem-ory modules and PES make it hard to implement. We pro-pose the novel scalable PE-ringed arch ite ctur e effectively tosolve all these problems.

    In the presented paper, a scalable, parametrizable and

    0-7803-3583-XI97$10.00 01997 IEEE

    OB I B 2 B i 04 05 0607 B8 BY B I O B l i 812813 B l 4 B i1 6 0 1 7 H I X . ~ B l i l B l lI-.2 01 3 1174 O M 4748 B4Y H i l l BA2 Bh i64 65 1166 B7Y BXI

    a(0.0)a(0,l)a(o.Z) , , , ..q7,o)a(l, l)a(l ,z)a(2,0)a(2.l)a(2.2).. , , .

    224 0225 8 2 2 6 . .240 BZ4l 0242~~

    -816x16 template block

    from current frame

    X'1x31 search area

    from previous frame

    7

    Figure: 1 . Template block and search area for 256possible displacement ( N = 16,p = 8).yet efficient full search block-matching architecture is de-scribed. The proposed approach offers the various da ta flowof different paramet ers a nd t hus achieves an efficiency closeto 1 0 0 ]percent. It also utilizes the d ata-reuse t o reduces thenumber of input pins. Section 2 shows the overall scalablearchitecture. In Section 3 , we present some techniques ofmemor:y bandwid th reduction, da ta arr angement , and inter -connection simplification. In Section 4, the performance ofthe proposed VLSI architect ure along with various conven-tional full search block-matching architect ures is analyzed.Finally, Section 5 gives a conclusion.2. THE PROPOSED VLSI ARCHITECTURE

    The procedure of a block-matching algorithm is to find t hebest mat ched displaced block from the previous frame Ft-1,within a search range, for each N x N block in the presentframe .Ft. A straig htforw ard metho d, th e full search, ex-haustively matches all possible candidates to find the dis-placement (called motion vector) w ith a minimal distortion.A s a c.riterion of distortion, the mean absolute difference(MAD) is calculated for each candidate location ( U , v ) :

    M A D ( u ,v )

  • 8/8/2019 00622048

    2/4

    H MM columns an d H PE columns from the above PE from memory module

    currentblock -DIXBIS

    vMMlOWS

    vPErows

    ... . . .t ime-sharing common bus Imotion vector

    Figure 2 . The scalable PE-ringed architecture.N-1 N-1

    = IFt(z + , Y + m )- Ft-i(z + + U,Y+ m +1=0 m = O

    (1)where (z ,y) is the coordinate of the upper-left pixel of thecurrent block in F,, and the values of U and TJ are limitedto between - p to p - 1. Fig. 1 illustrates the procedure ofFSBMA. The candidate blocks are labeled by BO N B255.

    Fig. 2 shows the scalable PE-ringed archit ecture to per-form the FSBMA . To exploit the parallelism in FSBMA, theproposed archi tecture is consists of H x V identical on-chipMemory Modules (M Ms) and H x V identical processing el-ements (PES). The PE S are connected in a ringed fashion.Each of the PE S is composed of an absolute difference unit,an ac cumulat or, and a final-result lat ch in a pipelined fash-ion. The detailed hardware architecture of a PE is shown inFig. 3. According to the performance requirements of dif-ferent real-time video applications, the numb er of the MMsand the PES, i.e., H x V , is determined by the followingevaluation equation. Assume that the cycle time is Tp forthe N x N current block with the search range of - p top - 1.

    ( 2 )x ( 2 ~ ) x N 2 x Tp 1f r< -H x V=where T denotes the total block-matching time of allblocks in a fra me (frame size: W X L ) . It should be small-er than frame period, l / f r . The search area pixels to becomputed are first stored in the MMs. The current blockpixels are sequentially inpu t a nd broadcasted to all PE S in

    currentblockpixels

    fromtheleftPE

    to time-sharing common bus to the below PE

    to.therightPE

    Figure 3. Architecture of PE.a raster-scan order. The upper-left H x V of the all candidat e blocks are first compute d. After 256 clock cycles, theMAD of each candidate block is produced in each PE antransferred to PEs final-result latch. These H x V latcheMADS can then be sent to th e minimum extractor to get minimum MAD. During the comparisons, the PES continuto perform the block-matchings of the next H x V candidate blocks. It does not consume extra cycles to fill up thpipeline operations.3. T E C H N I Q U E S FOR DATA M A N A G E M E N TThis archi tecture is based on some dat a management techniques: (1 ) On-chip memory configuration for data-reuse(2 ) Memory interleaving organization for parallel data accesses, and (3) Propagation of the accumulated partialresults for eliminating the interconnection overheads between the PES and the MMs, These techniques are described in the following subsections. Fig. 4 shows a simplexample to illust rate the operati on of the proposed architecture. In terms of ( Z ) , we assume th at t he 4x4 MMs an4x 4 PES are needed t o meet a cert ain real-time application3.1. On-chip Memory C o n f i g u r a t i o n fo r DataThe memory bandwidth is the main bottleneck to supporthe high performance motion estimators. However, pixelare repeatedly used several times to e valuate different candidate blocks. To avoid the extremely high bandwi dth requirements for chip 1/0 and memory system s, we utilize thoverlap between the search areas of the adjacent currenblocks (see Fig. 5) and propose the scheme of three halfsearch-area (HSA) segments. Each of the mem ory modules is partitioned into three HSA segments, as shown iFig. 6(a). One HSA block (3 1 x 16 pixels) is interleaved treside the same labeled segments in t he 4 x 4 MMs. Fig. 6(billustrates the operations of the t hree HSA segments. Wheexecuting task 0 (matching current block 0) , th e PESaccesthe search area pixels from the segments 0 an d 1. Durinperforming the FS of current block 0 , the HSA segmentin all memory modules is being filled by the right-half othe search area for task 1 (HSA C). When executing task(matching current block l), the search area pixels from th

    Reuse

  • 8/8/2019 00622048

    3/4

    L I I I ' i ' I 'l

    Figure 4. A simple architecture e x a m p l e for N = 4,p = 2, H = 4, and V = 4.(some interconnections areomi t te d . )

    HSA blocks (31 x 16) ~wrsnl locks (16 x 16)

    P

    a

    I I 1 I I IA R C D E F G H - - .+_Isarch area of cumnl block 0 task0)-- earch area of current block 1 (task 1)1 search area of currentblack 2 (lark 2)

    F i g u r e 5 . Overlapped search areas of a d j a c e n t cur-rent blocks.segments 1 and 2 of the 4 x 4 M M s are accessed by the PES,and the new dat a (HSA D) are input t o segment 0. In thiscyclic manner, the 31 x 16 new pixels ca n be easily accessedfrom system memory during performing the block-matchingof the current block by only one input port.3.2. Me mory In te r le a v ing Orga n iz a t ionThe on-chip memory is used to reduce the 1oa.d for chip1/0 and memory system. The next problem is: how toprovide all required d at a for the H x V PE S simultaneously.Our approach is to divide the memory into H x V memorymodules, and to interleave the input pixels to these H xV modules. An example are shown in Fig. 7. The label(0 N 15) for each pixel indicates the module that storesthe pixel. This memory interleaving organization providesa solution for parallel data accesses, but it stipulates thateach of P ES is able t o access t he corresponding memorymodule in parallel.

    HSA segment 0HSA segment 1HSA segrnent 2

    memory moduleI segment2 1

    contents of segmentsTO T1 T2 T3 T4 T5 T6A I n / D D m G G

    B B m E E M H

    m C I F / F F m

    * time

    H ull search for current block Ti: task i

    Figure 16. ( a ) The three partitioned HSA se gme nts .(b) Contents of the three HSA s e g m e n t s ( w r i t e op -erations ar e m a r k e d b y small f r a me s ) .3.3. P r o p a g a t i o n of th e accumulated partial-In most of multiple PESdesigns, a fully connected intercon-nection is demanded between the multiple PES and multipleM M s . However, the interconnection gives the extra delayand consumes larger routing space due t o large numbers ofbuses, multiplexers and tri-state buffers. To overcome th edrawback, the adjacent PE S are connected in t he horizon-tal rings and the vertical rings, and we arrange the dataflow from the memory modules to the PES and redistributethe PE S operations. This method allows the operation s be-longing i o a candidate block t o be performed by th e severalPES. T hen, for every clock cycle, P E propagates the accu-mulated partial-result to the adjacent PE t o form the nextpartial-r esult of the block candid ate. After 256 clock cycles,the final result of each candidate block is produced in eachPE . Th e dat a flow is shown in Table 1. By use of the prop-agation of the accumulated partial-results, each of the PESis only connected to one memory module. This eliminatesthe complicated interconnection and t he switching circuitrybetween the PE S and th e memory modules.

    results

    4. P E R F O R M A N C E A N AL Y SI SIn this section, we analyze the performance of the proposedVLSI architecture for the full search block-matching algo-rithm. Table 2 presents a comparison between the pro-posed and previous architectures. The 2-D array provideshigh-speed motion estimators with very high costs. The 1-D array is an efficient low-cost design for some low-speedapplications. However, the proposed architecture gives asatisfactory solution taking into account scalability, inpu tports and computation speed.

    5. C O N C L U S I O NA scalable PE-ringed architecture for F S B M A has been de-scribed. The number of processing elements (PES) is scal-able according to t he variable algorithm parameters and t he

  • 8/8/2019 00622048

    4/4

    ArchitectureN o. of the PESNo. ofinDut da ta Dinsproposed architecture 2-D array [4] 1-D array [5]128 I 64 1 32 [ 16 128 1 64 1 32 1 16 128 1 6 4 1 32 I 1616 I 16 I 16 I 16 24 I 24 I 24 1 24 136 I 7 2 I 4 0 I 2 4

    Figure 7. An example for the distribution of searcharea pixels to 4 x 4-memory modules.performa nce required for different applica tions. A config-uration of random-access on-chip memory modules solvesthe problems of chip 1/0 and memory bandwidth require-ment. Input data are arranged in the memory modules bymemory interleaving organizatio n. Combined with a tech-nique of the propagation of accumulated partial results, theinterleaved memory module provides every PE with its re-quired da ta simultaneously without introducing complicat-ed interconnections and switching circuitry. In summary,th e proposed architectur es hav e the following desirable fea-tures: (1) low hardware cost, ( 2 ) high throughput rate, ( 3 )low latency delay, (4 ) low 1/0 and memory bandwidth re-quirements, and ( 5 ) 1 0 0 percent computation efficiency.

    REFERENCES[l] CCI TT SGXV Working Par ty XV/ 1 Specialists Group

    on Codin g for Visual Telephony, Document # 5 8 4 , Nov.1989 .[2] ISO/IEC/JTCl/SC29/WGll/Draft/CD 13818-2 ,Recommendation H . 2 6 2 , November, 1993 .[3] ISO/IEC/JTCl/SC29/WGll/Draft/CD 13818-1 ,MPEG-2 Systems, November, 1993 .[4] Luc De Vos and Michael Stegherr, , ParameterisableVLSI Architectures for the Full-Search Block-MatchingAlgorithm , I E E E T r a n s a c ti o n s o n C i r c u it a n d S y s -t em . , vol. 36 no. 10, p 1309-1316 , Oct. 1989 .[5] K . M. Yang, M . T. Sun, and L. Wu, A family of VLSIdesigns for the motion compensation block-matchingalgorithm, I E E E T r a n s a ct i o n s o n C i r c u it an d S y s t e m

    for Video Technology. , vol. 3 6 , no. 10, p .1317-1325 ,Oct. 1989 .

    Clock cycles perblock matching

    Table 1. Data flow for FSBMA (N = 1 6 , p = 8 )T i m e I Date S e q u e n c e I PBO I P B l I . I P B 1 4 1 P B 1 6

    O X 1 6 4 0 I

    512 1024 2048 4096 512 1024 2048 4096 5 1 2 1024 2048 4096

    ... ... ...... ... .... ( 0 , 1 4 ) B2X 16+14

    OX16+16 . (0 ,16) B 11X 16+0 . ( 1 , 0 ) B 4 81X 16+1 .(l,l) BKl... ... ...... ... ...1 x 1 6 1 1 4 a ( 1 . 1 4 ) BKO1 X 1 6 + 1 6 & ( 1 , 1 6 ) 8 4 9... ... ...... ... ...... ... ...... ... ...14X 16+0 r (1 4 ,O ) 8 3 214X 16+1 4 1 4 . 1 ) 8361 4 X1 6 +1 41 4 x 1 6 + 1 616X 16+016X 16+1

    B 49... ... ...

    a ( 1 4 ,1 4 ) 8 3 4 B 3 6 . B 1 6 B 1 7. (1 4, 1 6 ) B3 3 8 3 4 . B 1 9 8 1 6a(16.0) B1 6 B l t . B 1 B 3~ ( 1 6 ~ 1 ) B 1 9 B 1 6 . B 1 B 2

    ... ...B 60

    ......1 6 1 1 6 + 1 41 6 X1 6 +1 6

    ... ... ...B 3 1

    ... ... . . . . . . . ...... ... . . . . . . . ...a ( l K , l 4 ) B 1 1 B 1 9 . BO B 1

    E l ? B l 8 . B3 BO( 1 6 , l S )

    I ... I ... I . . . I . . I . I ... I

    I ... I ... I ... l . . . I . I ... I ...I I I ... IN B X T 1 . 1 ... 1 ...... ... ...( I[6 ] Y . S. Jehng, L. G. Chen,and T. D. Chiueh, An ef-

    ficient and simple VLSI tree architecture for motionestimation algorithm, I E E E T r a n sa c t io n s o n S i g n alProc., vol. 4 1 , no. 2 , Feb. 1993 .

    [7] H. M. Jong, L. G. Chen, and T. D. Chiueh, I Paral-lel Architecture for 3-Step Hierarchical Search Block-Matching Algorithm, I E E E T r a n s a c t i o n s o n C i r c u i ta n d S y s t e m fo r Vide o Technology., vol. 4 , no. 4, Aug.1994 .[8] Shifan Chang, Juin-Haur Hwang, and Chein-Wei Jen, Scalable Array Architecture for Full Search BlockMatching , I E E E T r a n s a ct i o n s o n C i r c u it an d S y s t e m

    for Vide o Technology., vol. 5 no . 4 , pp 332-353 , Aug.1995 .[9] Luc De Vos and Matthias Schobinger , VLSI Architec-

    ture for a Flexible Block Matching Processor , I E E ET ransac t ions o n C i r c u it a n d S y s t e m for V ideo T ech-nology. , vol. 5 no. 5 , pp 417-428 , Oct. 1995 .[lo] Santanu Dutta and Wayne Wolf, A Flexible Paral-lel Architecture Adapted to Block-Matching Motion-Estimation Algorithms , IEEE T ransac t ions on Cir-c u i t a nd S y s t e m for Vide o Technology., vol. 6 no. 1, pp

    74-86 , Feb. 1 9 9 6 .[ l l ] Gangan Gupta and Chaitali Chakrabarti , Archi-tectures for Hierarchical Other Block Matching Algo-

    rithms , I E E E T r an s a ct io n s o n C i r cu i t a n d S y s t e mfor Video Technology. , vol. 5 no. 6 , pp 477-489 , Dec.1995 .

    1232