parauelizing ofdigital signal processing with using...

9
THE I STITUTE OF ELECTRlCAL AND ELECTRONICS ENGINEER INC SłGNAL ?~OCESSłNG ALGORrTHMS. ARCHITECTURES. ARRANGEMENTS • ..., APPUCATlONS S?A 2010 epternber 23-25"'.2010. Poznań. POLA D ParaUelizing of digital signal processing with using GPU Wojciech Bożejko \ Andrzej Dobrucki 2, Maciej Walczyński 2 'Wrocław University ofTechnology Institute ofComputer Engineering, Control and Robotics. Wrocław. Poland e-mail: [email protected] 2Wrocław University ofTechnology, Institute ofTelecommunications, Teleinformatics and Acoustics, Wrocław, Poland, e-mail: [email protected]@pwr.wroc.pl Abstraet: In this paper we show the process oj a class oj algorithms parallelization which are used in digital signal processing. We present this approach on the instance oj the popular LMS algorithm which is used in noise reduction, echo cancelation problems and digital signal processing in genera!. We propose an approach which uses a GPGPU technology. ParalleI approach allows liS decomposing the problem into a number oj smaller ones, which can be computedJaster. Obtained results, especially increase oj speed and efficiency, show that the paralleI method implemented on GPU is much more effective than other existing procedures and it can be used in the real- time systems. 1 Introduction We can observe, that from about 5 years processors have a constant maximaI clock frequency, about 3 GHz. It means that hardware technology achieves a barrier in some sense, and increasing the speed of single core is uneconomical (huge cooling should be used, for example by liquid nitrogen or Freon). owadays. computational power follows from cores multiplying - we have 2, 4, or 6 cores inside a single CPU, but there are prototypes with 80 cores (made by Intel). Executing existed sequential algorithm on multicore processors does not give any acceleration. We have to designing the new kind of algorithms, namely paralleI filters to take advantage of the new processors hardware architecture. There are many of algorithms uses in digital signal processing which need a strong computational power to work in real time. In many situations the most complex (from the computational point of view) part of those algorithms is problem of large matrix multiplication. We propose a parallelization ofthose algorithms on example of the LMS algorithm. 2 The Problem on LMS example The LMS (Least Mean Square) filtration base on the minimization of the mean square error. These filters are stable and easy for implementation [I ],[2]. Unfortunately, paralleI ization of this algorithm. especially in the distributed-memory paralleI computing systems is not so obvious. A main disadvantage of the LMS algorithm is slow convergence of this approach. There is a number of LM variants inc1uding P MS (Proportional ormalized Least Mean Square) which are focused on improving weak convergence ofthe original LMS method. Procedure of the filter adaptation requires a significant calculation and time cost. which has to be minimized. The most complex element of the computational process is matrix multiplication procedure. By its parallelizing we obtain the concurrent algorithm which works as the sequential one. but much faster (so-called single-walk parallel ization [3].[4] ). 3 Sequential LMS In this paragraph we show (follows by [4]) the most complex part of the algorithm is matrix multiplication part. Sequential LMS (Least Mean Square) is one of the most popular adaptive filtering algorithm. It belongs to the gradient adaptive filter class. In these filters we assume that a modification of óh(n) vector of the hen) filter parameters should be proportional in each time moment n to the cost function gradient vector l(n), which can be written as an equation: hen + 1) = hen) + 6h(n) = hen) - 2. !len) eJ/ChCn», (I) 2 eJh(n) where !len) is a scale variable which inf1uences onto the speed of the filter modification. In the general case it depends on the time. To speed up of the adaptation process, an additionally weight matrix Wen) is introduced. Such modi fied equation (I) takes the form of: hen + 1) = hen) + 6h(n) = hen) - 2. II (n)W(n) eJ/(h(n». 2 ,.. eJh(n) (2) In the case of LMS a temporaI error value is minimized. Therefore the error criterion takes the form of: (3) 29

Upload: hahanh

Post on 14-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

THE I STITUTE OF ELECTRlCAL AND ELECTRONICS ENGINEER INC

SłGNAL ?~OCESSłNGALGORrTHMS. ARCHITECTURES. ARRANGEMENTS • ..., APPUCATlONS

S?A 2010epternber 23-25"'.2010. Poznań. POLA D

ParaUelizing of digital signal processing with using GPU

Wojciech Bożejko \ Andrzej Dobrucki 2, Maciej Walczyński 2

'Wrocław University ofTechnology Institute ofComputer Engineering, Control and Robotics. Wrocław.Poland

e-mail: [email protected]ław University ofTechnology, Institute ofTelecommunications, Teleinformatics and Acoustics,

Wrocław, Poland,e-mail: [email protected]@pwr.wroc.pl

Abstraet: In this paper we show the process oj a class ojalgorithms parallelization which are used in digital signalprocessing. We present this approach on the instance ojthe popular LMS algorithm which is used in noisereduction, echo cancelation problems and digital signalprocessing in genera!. We propose an approach whichuses a GPGPU technology. ParalleI approach allows liS

decomposing the problem into a number oj smaller ones,which can be computedJaster. Obtained results, especiallyincrease oj speed and efficiency, show that the paralleImethod implemented on GPU is much more effective thanother existing procedures and it can be used in the real-time systems.

1 IntroductionWe can observe, that from about 5 years processors have

a constant maximaI clock frequency, about 3 GHz. Itmeans that hardware technology achieves a barrier in somesense, and increasing the speed of single core isuneconomical (huge cooling should be used, for exampleby liquid nitrogen or Freon). owadays. computationalpower follows from cores multiplying - we have 2, 4, or 6cores inside a single CPU, but there are prototypes with 80cores (made by Intel). Executing existed sequentialalgorithm on multicore processors does not give anyacceleration. We have to designing the new kind ofalgorithms, namely paralleI filters to take advantage of thenew processors hardware architecture.There are many of algorithms uses in digital signalprocessing which need a strong computational power towork in real time. In many situations the most complex(from the computational point of view) part of thosealgorithms is problem of large matrix multiplication. Wepropose a parallelization ofthose algorithms on example ofthe LMS algorithm.

2 The Problem on LMS example

The LMS (Least Mean Square) filtration base on theminimization of the mean square error. These filters arestable and easy for implementation [I ],[2]. Unfortunately,

paralleI ization of this algorithm. especially in thedistributed-memory paralleI computing systems is not soobvious. A main disadvantage of the LMS algorithm isslow convergence of this approach. There is a number ofLM variants inc1uding P MS (Proportional ormalizedLeast Mean Square) which are focused on improving weakconvergence ofthe original LMS method.Procedure of the filter adaptation requires a significantcalculation and time cost. which has to be minimized. Themost complex element of the computational process ismatrix multiplication procedure. By its parallelizing weobtain the concurrent algorithm which works as thesequential one. but much faster (so-called single-walkparallel ization [3].[4] ).

3 Sequential LMS

In this paragraph we show (follows by [4]) the mostcomplex part of the algorithm is matrix multiplication part.Sequential LMS (Least Mean Square) is one of the mostpopular adaptive filtering algorithm. It belongs to thegradient adaptive filter class. In these filters we assumethat a modification of óh(n) vector of the hen) filterparameters should be proportional in each time moment nto the cost function gradient vector l(n), which can bewritten as an equation:

hen + 1) = hen) + 6h(n) = hen) - 2. !len) eJ/ChCn», (I)2 eJh(n)

where !len) is a scale variable which inf1uences onto thespeed of the filter modification. In the general case itdepends on the time. To speed up of the adaptationprocess, an additionally weight matrix Wen) is introduced.Such modi fied equation (I) takes the form of:

hen + 1) = hen) + 6h(n) =hen) - 2. II(n)W(n) eJ/(h(n».

2 ,.. eJh(n) (2)

In the case of LMS a temporaI error value is minimized.Therefore the error criterion takes the form of:

(3)

29

Page 2: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

From this the cost function derivative is given by:

iJJ(h(n)) _ [~~ ~]TiJh(n) - iJho(n)' iJh,(n)' ... , iJhM(n)' (4)

where M denotes a filter dimension. In tum:

ae2(n) ae(n)ahk(n) = 2e(n) ahk(n) =

= 2e(n) iJ(d(n)-Lr;'_o/~kX(n-k» = -2e(n)x(n - k), (5)iJhk n

where I~=o(hkx(n - k) is an estimator den) of thereference signal y(n). Finally. the equation (l) takes theform of:

hen + 1) = hen) + Il(n) Wen) e(n) xCn), (6)

which can be formulated in the matrix form as:

(7)

4 Parallei approachThe most complex part of LMS algorithm is process of

filter coefficients updating. As we show above thoseprocess may be reduced to the matrix multiplication. Wemake a parallelization of those part of algorithm usingOPOPU.

4.1 Parallei algorithms efficiency mea uresThe basie terms which are connected with parali el

programming are: speedup, efficiency and cost.Speedup is a parameter which describe how many timesparallei algorithm is fas ter than the sequential. Speedup isgiven by:

SM (p) = TM

(p)' (8)

where Ts is computations time of a sequential algorithmsolving a problem P on a sequential machine, h/p) +computations time of a parallei algorithm solving aproblem P on p-processors machine MEfficiency takes the form of:

T/M(P) = S.",I(P). (9)

PCost is given by the equation:

CM (P) = p' TM (p), (10)and can be understood as a portion of energy (for exampleelectric energy) need to executed the parallei algorithm.

4.2 Parallei matrix multiplication problem on C DA

The main problem in parallelization of existingalgorithms on CUDA devices is appropriate use ofmemory. On nVidia Tesla architecture, a thread block has16kB of shared memory visible to all threads of the block.Ali threads have access to the same global memory. Sharedmemory is much faster than global memory. When there isno bank conf1icts accessing the shared memory is fast asaccessing a register. For comparison access to the globalmemory takes 400-600 cycles. lt is possible to use sharedmemory only for smallest test instances (smali matrixes).

5 Computational experimentsThere we re two versions of algorithm - sequential and

paralleI. Both of them was coded in C with using CUDAlibraries and run on four nVidia devices:

l. nVidia OeForce 9600M OS with 32 streamingprocessors installed on Lenovo Y530, IntelPentium Dual-Core CPU 20Hz, 30B RAM under32-bit Windows Vista Home Premium operatingsystem,

2. nVidia OeForce OTX 295 wit h 480 streamingprocessors installed on Intel Core2Duo 2.40hz,20B RAM under 32-bit Windows Vista Businessoperating system.

3. nVidia Tesla C870 GPU (512 GFLOPS) with 128streaming processor cores. This GPU wasinstalled on the Hewlett-Packard server based on2 Dual-Core AMD I GHz Opteron processorswit h l MB cache memory and 8 OB RAMworking under 64-bit Linux Debian 5.0 operatingsystem.

4. nVidia Georce GTX480 with 480 streamingprocessors, with 1610153984 bytes of totalamount global memory installed on the samecomputer set as nVidia GeForce GTX295.

Table I. Sequential runtimes (in ms) on nVidia OeForce9600M GS

Sequential nVidia OeForce 9600M GS

nxn min. max. averagę

10 x lO 0,87 1,86 1,18

20 x 20 5,88 8,29 6,62

30 x 30 19,26 21,51 20,53

40 x 40 45,90 48,46 46.39

50 x 50 90,41 92,38 91,28

100 x 100 715,72 722,64 718,09

30

Page 3: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

Table 2. Sequential runtimes (in ms) on nVidia GeForceGTX295

Sequential nVidia GeForce GTX295nxn min. max. average

10x 10 0,65 2,54 0,9820 x20 4,37 10,38 5,17

30 x 30 15,32 17,05 16,01

40 x 40 34,06 40,24 35,6650 x 50 67,59 72,40 68,68

100x 100 527,97 529,63 528,63

Table 3. Sequential runtimes (in ms) on nVidia GeForceGTX480

Sequential nVidia GeForce GTX480nxn mm. max. averagę

10 x \O 0,1356 0,2144 0,149620 x 20 0,7235 0,8461 0,750630 x 30 2,28889 2,417 2,310640 x 40 5,3027 5,4392 5,335950 x 50 10,2585 10,3362 10,2758

100x 100 118,0734 118,7849 118,1404

Tables 1-3 contains minimal, maximai and average timesof sequential matrix multiplication for different matrixdimensionson three different GPU's.

Table 4. Paraliel runtimes (in ms) on nVidia GeForce9600MGS

Parallei nVidia GeForce 9600M GSnxn min. max. averagę

10x \O 0,18 1,23 0,6720 x 20 0,28 1,22 0,7530 x 30 0,39 1,32 0,8440 x 40 0,62 1,37 1,1050 x 50 0,98 1,95 1,52

100x 100 6,14 7,70 6,67200 x 200 46,18 49,40 47,48300 x 300 168,46 172,57 171,17400 x 400 355,81 366,84 363,08500 x 500 696,82 711,55 707,41

Table 5. Paralle! runtimes (in ms) on nVidia GeForceGTX295

ParalleI nVidia GeForce GTX295nxn min. max. average

10 x 10 0,15 1,92 0,4220 x 20 0,12 2,91 0,4530 x 30 0,17 2,89 0,5640 x 40 0,17 2,02 0,4950 x 50 0,20 2,00 0,51

100 x 100 0,43 2,29 0,75200 x 200 2,40 4,35 2,82300 x 300 8,16 10,05 9,11400 x 400 21,40 23,49 22,60500 x 500 50,36 53,60 52, II

1000 x 1000 566,08 581,77 575,51

Table 6. Parali el runtimes (in ms) on nVidia GeForceGTX480.

Parallei nVidia GeForce GTX480nxn mm. max. averaze

lO x 10 0,06 0,12 0,0620 x 20 0,06 0,17 0,0730 x 30 0,07 0,13 0,0840 x 40 0,07 0,17 0,0850 x 50 0,08 0,15 0,09

100 x 100 0,16 0,22 0,17200 x 200 0,76 0,94 0,81300 x 300 2,42 2,72 2,52400 x 400 6,20 6,84 6,43500 x 500 13,90 14,58 14,20

1000 x 1000 129,15 130,85 130,142000 x 2000 1111,14 1126,97 1118,533000 x 3000 3948,34 3963,22 3957,35

31

Page 4: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

Table 7. Parallei runtimes (in ms) on nVidia Tesla C870GPU

Parallei nVidia Tesla C870nxn min. max. average

10 x 10 0,04 0,08 0,0420 x 20 0,06 0,10 0,0630 x 30 0,09 0,13 0,0940 x 40 0,12 0,22 0,1250 x 50 0,20 0,25 0,21

100 x 100 1,13 1,19 1,16200 x 200 10,95 12,78 11,79300 x 300 29,42 30,21 29,74400 x 400 71,12 75,36 73,31500 x 500 134,33 138,24 136,55

1000 x 1000 1128,36 1155,37 1137,372000 x 2000 9057,08 9189,94 9129,85

Tables 4-7 contains minimal, maximai and average timesof parallei matrix multiplication for different matrixdimensions on four different GPU's. Compare the resultsfrom Tables 1-3 willi the results of tables 5-7 shows themerits of paralIelization. Parallei algorithm is much fasterthan sequential for each GPu. These results are cIearlyvisible in Table 8, which contains information aboutaverage speedup (in function of matrix dimension) forthree nVidia's cards: GeForce 9600M GS, GeForceGTX295, nVidia GeForce GTX480.

T bl 8 S d fi diffa e spee up va ues or l erent matnx sizes.Speedup

nxn 9600M GS GTX295 GTX480.lO x lO 1,76 2,32 2,3420 x 20 8,82 11,54 11,4130 x 30 24,48 28,79 30,1640x40 42,05 72,93 65,41

50 x 50 59,89 135,76 114,36100 x 100 107,68 702,52 690,28

As we can see, it was possible to obtain a high speedupwhich means, that the paralI el algorithm works over 100times faster than the sequential one for 9600M GS, andabout 700 times faster for both GTX family cards.

Results from Tables 4-7 and Tables 8 are shown in graphicform on Figure l and 2.

Q. 64,00:lal 32,00ClI

~ 16,00

1024,00 • nVidia GeForce 9600MGS

• nVidia GeForce GTX295GPUnVidia GeForce GTX480

512,00

256,00

128,00

8,00

4,00

2,00

1,00

\0.,.\0 1.0.,.1.0 ...,0.,....,0 ~o.,.r.O '00.,.'00 \r:f:j.,. ••.r:P

Matrix dimensionFig.l. A speedup in function of matrix size computed onnVidia GeForce 9600M GS, GTX295 and GTX480.

1024,00512,00256,00128,0064,00

ł ~ 32.00II>E 16,00

- 8,00E 4,00._ 2.00•...• 1,00

0,500,250,130,060,03

- -- -------------;• nVidia GeForce 9600M GS

• nVidia GeForce GTX295

• nVidia GeForce GTX480

nVidia Tesla C870 GPU

\01.\0 "tO~'lO "!p.,.?P tIl",40 rj)"'~ \r:P",\r:P ~.,.1.rP ~.,.?J:PlfP.,.łfiJ ~.,.rj:Jl

Matrix dimension-----------------------'Fig.2. Time of parallei matrix multiplication in function ofmatrix climension.

6 GPU programming

The code was implemented in Microsoft Visual C++ withCUDA 3.0. The main code is presented below.

Streaming processors code:

32

global void matrix_mult(int Ya, int *b,Int *c, int n, int m}

int idx = blockldx.x * blockDim.x +threadldx.x + l;int idy = blockldx.y * blockDim.y ~threadldx.y + l;

for(int k=l;k<=n;k++)c[idx+m*idy]=a[m*(idx-)+k]Yb[mY(k-l)+idy];

Page 5: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

Kemel code:

int N,Miunsigned int mem_size;

int 'devC, 'hostC, *devB,*hostB, 'hostA,'devA;

mem size = sizeof (int) * (N+l)* (M+l);

hostChostBhostA

(int*) malloc(mem_size);(int*) malloc(mem_size);(int*) malloc(mem_size);

cudaMalloc( (void**) &devA, mem_size);cudaMalloc( (void**) &devC, mem_size);cudaMalloc( (void**) &devB, mem_size);

cudaMemcpy( devA, hostA, mem_size,cudaMemcpyHostToDevice) ;

cudaMemcpy( devB, hostB, mem_size,cudaMemcpyHostToDevice) ;

memset(hostC,O,mem_size);cudaMemcpy( devC, hostC, mem_size,cudaMemcpyHostToDevice) ;

matrix_mult«< 1, (N+l) * (M+l) »> (devA,devB, devC, N, H);

cudaMemcpy( hostC, devC, mem_size,cudaMemcpyDeviceToHost) ;

free (hostC) ;free(hostB) ;free(hostA) ;

cudaFree (devC) ;cudaFree(devB) ;cudaFree (devA) ;

7 Concluding remarks

In this paper the most computationally complex part ofthe algorithm was set and on this basis we propose a newalgorithm based on original LMS which use an advantageof para IIel approach. As we show that method is muchfaster than the existing ones. For future work, we caninvestigate more complex algorithms from the fields ofdigital signal processing, such as echo cancellation intelecommunication networks. for possible parallelization.

References[I] J. Benesty, T. Gansler, D.R. Morgan, M.M. Sandhi,

S.L. Gay, Digital Signal Processing: Advences inNetwork and Acoustic Echo Cancellation. Springer.2001.

[2] A. Perry. FundamentaIs oj Voice-Qualiyz Engineeringin Wireless etworks. Cambridge. 2007.

[3] T.P. Zieliński, Digital signal processing. From theorylO applications (in polish), Wydawnictwa Komunikacjii Łączności. 2007.

[4) W. Bożejko, M. Walczyński, M. Wodecki, Applicationbeam- search algorithm based on fast Fouriertransform Jor signal analysis (in polish), Automatyka.Zeszyty aukowe Politechniki Śląskiej, Gliwice 2008.z. 150, pp. 31-38.

[5] W. Bożejko, C. Smutnicki, M. Uchroński, Paralleicalcu/ating oj/he goal function in metaheuristics usingGP , In: G. Allen et al. (Eds.): lCCS 2009. Part I.L CS 5544. 2009, pp. 1022-1031.

33

Page 6: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

+'IEEE'\74,·· IIIrr-.

Indect

THE INSTITUTE OF ELECTRlCALA D ELECTRONICS ENGINEERS I C.Region 8 - Europe, Middle East and Africa, POLA D SECTIO

CHAPTERS SIG AL PROCESSIG. CIRCUITS A D SYSTEMS

S? 2010SIGNAL ?~OCESSING

ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND ApPLlCATIONS

Conference Proceedings

Poznan, 23_25th September 20 l O

POZNAN UNIVERSITY OF TECHNOLOG YFACUL TY OF COMPUTING

CHAIR OF CO TROL AND SYSTEM ENGINEERlNGDIVISION OF SIGNAL PROCESSING AND ELECTRONIC SYSTEMS

UL. PIOTROWO 3 60 965 POZNAN, POLANDPhone: +48 61 665 2831 Fax: +48 61 665 2840

http://www.ieee.put.poznan.pl

Page 7: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

Scientific CommitteeProf. Adam Dąbrowski - ChairmanProf. Ryszard ChoraśProf. Andrzej CzyżewskiProf. Anthony DaviesProf. Andrzej DobruckiProf. Andrzej DziechProf. Alfred FettweisProf. Ewa HermanowiczProf. Bożena KostekProf. Krzysztof KozłowskiProf. Rolf KraemerProf. Volker KruegerProf. Zbigniew KulkaProf. Andrzej MaterkaProf. Józef Model kiProf. George MoschytzProf. Andrzej apieraIski

I B -13IEEE Conference Record

Prof. Peter 011Prof. Antoni owakowskiProf. Maciej OgorzałekProf. tanisław OsowskiProf. Aleksander Petrovskydoc. dr inż. Janusz RenówskiProf. Aleksander SękProf. Wladyslaw SkarbekProf. Thomas SikoraProf. Ryszard TadeusiewiczProf. Ralph UrbanskyProf. Joos VanderwalleProf. Heinrich T. VierhausProf. Ryszard WojtynaProf. Krzysztof ZarembaProf. Jan ZarzyckiProf. Tomasz Zieliński

Organizing CommitteeProf. Adam Dąbrowski - ChairmanTomasz MarciniakDamian CetnarowiczPiotr KardyśPaweł PawłowskiJulian BalcerekAdam KonieczkaMałgorzata PiskorzTomasz JaniakAndrzej amerłaRado ław Weychan

978-83-62065-01-1#17892

POZNAN UNIVERSITY OF TECHNOLOGYFACUL TY OF COMPUTI G SCIENCE AND MA AGEME T

CHAIR OF CONTROL AND SYSTEM ENGINEERINGDIVISION OF SIGNAL PROCESSING AND ELECTRONIC SYSTEMS

UL. PIOTROWO 3 60 965 POZNAN, POLANDPhone: +48 61 665 2831 Fax: +48 61 665 2840

http://www.ieee.put.poznan.pI

Page 8: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

Table of ContentsGeneral Inforrnation 5

Program Summary 6

SESSION 1: Image processing

1. Christian Kollmitzer, FUNDAMENTAL OF TEREOVI 10 (Tutorial l) 8

2. Marcin Sylwestrzak, Maciej Szkulmowski, Anna zkulmowska, Piotr Targowski, SIG ALPROCESSING IN OPTICAL COHERE CE TOMOGRAPHY (OCT) (Tutorial 2) 9

SESSION 2: DSP Theory and Implementations

3. Christian Gleichner, Tobias Koal, H. T. Vierhaus, EFFE TIVE LOGIC ELF REPAIR BASED OEXTRACTED LOGIC CLUSTER (Tuto rial 3) 10

4. Przemysław Korohoda, Adam Dąbrowski, EXPERIME TAL STUDY O WA VELET-L1KEDECOMPOSITIO BA ED O FILTERI G I DOMAI OF Dl CRETETRIGO OMETRIC TRA SFORM 16

5. Jarosław Majewski, Ryszard Wojtyna, TAKI G LAW OUT OF TRAł ED EURAL ETWORK 21

6. Przemysław Brylski, Micha! Strzelecki, FPGA łMPLEME TATIO OF PARALLEL DIGITALIMAGE PROCESSOR 25

7. Wojciech Bożejko, Andrzej Dobrucki, Maciej Wałczyń ki. PARALLELlZl G OF D1GITAL !G ALPROCESSI G WITH USI G GPU 29

8. Krzysztof Piotr Sozański, MODIFICATION OF REACTIVE POWER CONTROL ALGORITHM 34

SESSIO 3: Image and Video Processing

9. Dongi! Han, Thejaswini Purushotharn. K.V.Suchethan waroop. K.R.Rao. LOW COMPLEXITYH.264 ENCODER USI G MACHI E LE R I G 40

10. Tomasz Komuta, Mateusz Pruchniak, UTILlZATIO OF GP FOR REAL-TIME VISIO IROBOTICS 44

11. M. Rosenbaum, M. Hugo, R. Urbansky. W. auer-Greff, BLOCK BA ED MOTIOCOMPE SATIO FOR REAL TIME UPER RE OL TIO IM GE RESTORATIO 50

12. Antoine Abche, Fadi Yaacoub, Elie Karam. PARTlAL K- PA E MRI RECO STRUCTIO U I GA MODlFIED HOMODYNE APPROACH 56

13. Mikołaj Roszkowski. Grzegorz Pastuszak, 1 TRA PREDICTIO HARDWARE MODULE FORHIGH-PROFILE H.264/AVC E CODER 62

14. Maryam Baradaran -Khalkhali, S.Kazem hekofteh. Saeed Too izadeh, Mohammad -R. Akbarzadeh-T, EXPLOITING FUZZY APPROXIMATOR TO HEAD PO E ESTIMATION 68

SESSION 4: Speech Processing

15. Kuba Łopatka, Piotr Suchomski, Andrzej Czyżewski. TIME-DOMAI PRO ODICMODlFICATIO S FOR TEXT-TO-SPEECH THE IZER 3

16. Elias Azarov, Alexander Petrovsky, Piotr Zubrycki, M LTI VOICE TEXT TO SPEECH SYNTHESIBASED ON THE INSTANTA EOU PARAMETRIC VOICE CO VERSIO 78

17. Magdalena Kaniewska, l TAN TA EOUS CO PLE FREQ E CY FOR PIPELI E PITCHESTIMA TIO 83

SESSIO 5: I DECT project

18. Julian Bałcerek, Paweł Pawłowski, Adam Konieczka, zymon Drga, Adam Dąbrowski, MaciejKmieciak, DATABASE OF EMERGENCY TELEPHO E CALL - YSTEM TOOL FORREAL-TIME REGISTRATIO A D METADATA ARCHI G 89

19. Tomasz Marciniak. Radosław Weychan., zymon Drgas, Adam Dąbrowski. Agnieszka Krzyków ka.SPEAKER RECOG ITIO BASED ON SHORT POLl H SEQUE CES 9-

20. Damian Cetnarowicz, Szymon Drgas, Adam Dąbrowski. SPEAKER RECOG ITIO SY TEM A DEXPERIME TS WITH HEAD / TORSO SIMULATOR A D TELEPHO ETRA SMISSIO 99

Index of Authors 104

3

Page 9: ParaUelizing ofdigital signal processing with using GPUstaff.iiar.pwr.wroc.pl/wojciech.bozejko/papers/2010/BozDobWal_2010.… · the i stitute of electrlcal and electronics engineer

+.IEEECONFERENCEINFORMATIONSCHEDULE

Conlerence Record #: To be issued by Conference serveesConlerence Tllle: 2010S.gnal Processing Algor.thms Arcnltectures Anangemenls Ma Apphcel.oos.Spa)Conference Acronym: SPA 2010Conlerence Dates. 9123 010 lo 9 2512010

Locatlon: The Ma.nLectureCenter of PoznanUnoversnyot Tecl1nologyCity: PoznanCountry: POlandState/Provlnce:

Exhlblts: Y No. of Exhlblts: 1 Tutorlats: Y Anendance: 50 Produclng Publlcatlon: Y Concentrallon Banking Info: NoConlerence Scope: Theory and FilIer Des'Qn.OSPNgor.thms/lmplementaoons. OSPChips Image Conlerence Keywords: dsp. imageprocessmę.Processing.AudIOProcessing. Speecl1Processmę.Mulomed.aData Cornpressoon.Techrucal and Musie audio processongspeechprocessing rnunrnedia,AcoUS1ICSPsycncecocsucs.Electroacous!lcs. V.sual Systems. HearongAIds. Sound Mastering. viceo Ed.bng audIOand visuat systems.spa.spa2010OependableSystems. Biometnes Image Recognit.on.Speaker and Speecl1RecognlOonMult.medlaDataBases.sterecvis.on 0'9.lal Televrsjon. Neural Networks Conlarence

Focus: Appl.C81.onScoentificlAcademoc

WWW URL: hltp /Iw.vw reeeput.poznan plCFP URL: hltp·/.WNw.leeeput poznan pl

Abslracl Submlsslon Dale: 6/1312010Notltlcallon ol Acceptance: 7119/2010Flnal Papar Submlssion Dale: 61712010

SponsorsPoznanUn.versotyof Tecnnology (Co Sponsor) -80%IEEE PolandSeCIJonSI908IProcesslng cnepter tco Sponsor) -10%IEEE PolandSeCIJonCorCUltsand Syslems Cnaeter (Co Sponsor) -10'loInlormalion Contacl: Con/erence Chalr:JuhanBaloerek Ad Dabrowsk:PoznanU01vers.tyol Technology Poznan Urnversrtyot Tecl1nologyPiotrowo3 PlOtrO,.,O3Poznan 60-965 Poznan 60-965POLAND POLANDPh +48 61 6652833 Ph . +4861 665 2745fax +4861 6652840 Ph2 +4861 665 [email protected] Fax +4861 6652840

edern d8browskl@pUlpoznan pl

Treasurer:

Technlcal Program cnarr:AdamOabrowsloPoznanUnoversotyot TechnologyPiotrowo3Poznan60-965POLANDPh +48616652745Ph2: +48616652831Fax: +4861 [email protected]

PublIcaIlon cnatr:

Info Schedula Submlned bAcem Oabrowsk.48 61 665 2745adam oabrowsk.@putpoznan pl

Entered in datebase: 712012010

4