05764208

Upload: hub23

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 05764208

    1/4

    Efficiency Improvements For A Speech Recognition Coprocessor

    Hui Geng2

    Weiqian Liang1*

    , Ming Dong2

    , Runsheng Liu1

    1Dept. of Electronic Engineering, Tsinghua Univ., Beijing 10084, China

    2Beijing VoiceOn Technology Co. Ltd., Beijing 100084, China

    Abstract This paper proposes a revised MSAC (Multiplier

    Square Accumulate Calculation) coprocessor called VPU3010,

    which is used to calculate Mahalanobis distance for ASR

    (Automatic Speech Recognition). We improved the address

    generating unit to provide frame-increment and resetting

    function, and appended feedback signal as an output port,

    which could be used as interrupt trigger or query signal by the

    main processor. Design of VPU3010 was verified on Xilinx

    FPGA platform firstly, then has been implemented on 0.18mUMC technology. Experiments show that the real-time factor

    of system based on VPU3010 working at 50MHz is 0.72compared with 0.82 of MSAC at 50MHz, which could

    significantly improve the efficiency of the C-Program calling

    the coprocessor.1

    Key words ASR; embedded system; ARM; co-processor

    I. INTRODUCTION AND WORK REVIEWNowadays, most automatic speech recognition (ASR)

    systems with high performance are based on ContinuousHidden Markov Model (CHMM) algorithm. Mobile deviceswith speech interface such as cell-phones, personal digitalassistants (PDA) and Pocket Personal Computers (PPC) aremore and more common. However, CHMM algorithm is

    very complex, so traditional embedded systems, which aremainly based on General Purpose Processors (GPP), have towork at very high frequency to meet the requirement of thereal-time computation [3][6]. On the other hand, speechrecognition application-specific integrated circuits (ASIC),which are designed for special applications [6], have high

    performance and low-power at the price of lack of flexibly [4][6]. Therefore a hardwaresoftware co-design approach was

    proposed for ASR systems in [1], [5], [7] and [8], andachieved a better performance compared with GPP systemsand dedicated hardware implementation systems.

    In typical CHMM based ASR systems, there are mainlythree steps: feature extraction, output probability calculationincluding Mahalanobis distance calculation (MDC) and log-

    add calculation, and Viterbi decoding [1].Both [7] and [8] proved that output probability

    calculation took most of executing time, and [1] gave furtheranalysis that Mahalanobis distance calculation consumedmost of the computation load. Therefore we have everdesigned the first version speech recognition co-processorMSAC to calculate the Mahalanobis distance [1]. Anexciting result that MSAC spent similar processing time asOAK DSP [11] in the condition when OAK DSP worked at 5times faster than MSAC did was gotten. However, afterimplementing the whole speech recognition algorithm, we

    *Communication author: Weiqian Liang, Assistant Processor,

    Email: [email protected]

    find that efficiency of C-program calling MSAC is low forthe reasons which will be detailed in section 2. Sinceaccording to CHMM algorithm, Mahalanobis distancecalculation needs to be called frequently, it is necessary andmeaningful to improve the efficiency of the C-Program.

    The rest of the paper is organized as follows. Section 2summarizes CHMM algorithm and analyzes the algorithmefficiency. Section 3 details the proposed improvements ofVPU3010. Section 4 describes the implementation andexperiment results of the VPU3010. Finally, conclusions are

    drawn in section 5.

    II. C-PROGRAM EFFICIENCY ANALYSISReal-time algorithm in this paper processes a block of

    frames each time. According reference [1] and [2], for mostof the applications, the Mahalanobis distance could bewritten as follows:

    2

    1

    1M

    td jgd

    d jgd

    o

    =

    (1)

    where M is the dimension number of the observation

    vector, tdo

    , jgd

    , jgd

    is the feature vector, mean vector andcovariance matrix respectively for the jth state and the gth

    mixture component.In (1), the data needed by MDC includes feature tdo

    and

    model data jgd

    , jgd

    , which are both cached in the on-chipSRAM. However, according to the structure of MSAC in [1],in order to get the right data for MDC, the MSAC needs the

    Fig. 1 C- flow diagram calling MSAC

    main processor to calculate the address for it. Therefore, wecan get the flow chart of C-program calling MSAC shown asFig .1.

    ___________________________________978-1-61284-840-2/11/$26.00 2011IEEE

  • 8/13/2019 05764208

    2/4

    In Fig .1, the t is the index of HMM states which is tokenas present state, and T is the number of HMM state, the i isthe index of present frame, and F is the frame number of a

    block. FSD represents the starting address of SRAM storedfeatures, MSD is the starting address for means, and VSD isfor variances.

    Apparently there are mainly two loops: the inner onetakes present frame index of speech as loop variable, andincludes N mixture components. The extern one takescurrent HMM state as loop variable. Therefore for S-state G-mixture HMM model and speech with F frames, the innerloop will be executed F*S times including F*S*G timescalling MSAC. Then we will analyze arithmetic operationsused to calculate address in the inner recycle: 1. there is onemultiply-add operation and two assignment operations forthe first mixture; 2. for the second mixture, two additionoperations are included; 3. for the third mixture and these

    behind it, there are two multiply-add operations for eachcalculation for one mixture. Therefore it totally takes [2*(G-2)+1]*F*S multiply-add, 2*F*S addition and 2*F*Sassignment operations to generate address in this fragment ofC-program. Furthermore, with the help of ARMulator we getinternal core cycles of these operations are 4-cycle, 2-cycleand 2-cycle respectively, so it theoretically takes 4*(2*G-1)*F*S cycles for the calling function spending oncalculating the address.

    On the other hand, the two recycles are regular withsimple control, so it is suitable for hardware implementation.Therefore we designed the improved version of MSACVPU3010 in this paper, which will be detailed in section 3.

    III. STRUCTURE OF VPU30103.1 Changes with the address generating unit

    MSAC uses SRAM on chip to store data needed in MDC,and divides SRAM into four blocks to store features, means,variances and results separately. The starting address of each

    block is assigned by address stored in register group, andfour address generators produce respective address forfeature block, mean block, variance block and result block.

    The initial value of address generators are from registergroup. When read or write operation happens, addressgenerators could increase or reduce by one on presentaddress automatically. In another word, if we could provideright starting address to generators, the generators couldgenerate next successive address needed.

    Table I shows the SARM address changing in the innerloop. Firstly, the co-processor calculates the first framespeech feature for the first mixture, then the address offeatures, means and variances are the starting address + D(dimension of feature). Next, the feature address should go

    back to the starting address, meanwhile the mean address andvariance address should keep the starting address + D tocalculate the first frame speech feature for the secondmixture. Calculation for the rest mixture is in the similar way.However, for the second frame, the feature address should bethe starting address + D, and the address of mean andvariance should go back to starting address, then the addresswill be changed like the last frame.

    According to the above procedure, it is necessary thatfeature address generator could execute frame incrementfunction, and mean and variance generators could reset theiraddress when needed. We take the following architectureshown as Fig. 1 to implement these functions, whichminimizes the changes to original structure of MSAC. In thisunit, the control signals like selecting and load are generated

    by control unit which is implemented by finite state machine(FSM). The outputs of address generators are selected to beconnected with SRAMs address ports to get the right datafrom SRAM.

    TABLE I. THE SARM ADDRESS CHANGING IN THE INNER LOOP.

    frame

    index

    mixture

    index

    starting address

    of present frame

    starting address of

    present

    mean/variance

    1 1 FSA M/VSA

    1 2 FSA M/VSA + D

    1 3 FSA M/VSA + 2D

    2 1 FSA + D M/VSA

    2 2 FSA + D M/VSA + D

    2 3 FSA + D M/VSA + 2D

    FSA: starting address of feature block in SARMM/VSA: starting address of mean/variance block in SARM

    VPU3010 implements frame-increment function byadding one control bit. When the control bit is set, the

    present feature address, which is the output of addressgenerator, will be loaded into the feature starting address

    register. When resetting function control bit is set, the newstarting address stored in the register will be loaded into theaddress generators. Therefore, the present feature addresswill increase or reduce by one from the new starting addresswhen accessing happens. That means, when the calculationof one frame feature for one group of mixtures is finished,the feature address could go to the beginning address of nextframe automatically without any arithmetic operations in C-

    program.

    Fig. 2 Block diagram of the proposed addressing unit.

    Similarly, the resetting function is provided for feature,mean, variance and result address generators through anotherfour control bits. When these control bits are set to 1, the

  • 8/13/2019 05764208

    3/4

    starting addresses stored in starting address register groupwill be load into the corresponding address generators as the

    basis of next address instead of present address. Therefore,we can get any address needed by co-processor just throughdifferent configurations of control bits. The new C-programcalling VPU3010 is shown as follows, and it does notinclude any arithmetic operations in the loops, which willsignificantly save resource of calculations.

    Fig. 3 C- flow diagram calling VPU3010

    3.2 Feedback signal as interrupt trigger or query

    For the original version of MSAC [1], there is nofeedback signal to the main processors, therefore C-program

    based on original MSAC has to execute a waiting procedureto ensure MSAC finishing MDC. Testing with ARMulator,when MSAC and ARM works at 50MHz, MSAC spends 114cycles on waiting. Although it coincides with the theoretical

    value 94 cycles, it is still a waste since it could be used to doother operation if there is an interrupt function. Besides that,it is very inconvenient to set the suitable waiting time

    because the cycles for MDC are varying on different modelsand feature dimensions. Therefore interrupt and queryfunction are involved in VPU3010. Once VPU3010completes the MDC, the query bit of control register will beset to 1 under control of FSM. And an output circuit is alsoappended to the control register, therefore, through readingcontrol register, the main processor could judge whetherVPU3010 has finished MDC or not. Besides that, interruptand query pin is also added to VPU3010, therefore it isconvenient to use this pin as an interrupt trigger.

    IV.

    EXPERIMENTS AND RESULTSFirstly, the design of VPU3010 was verified on Xilinx

    FPGA platform, and then we implemented it with 0.18mUMC technology. The die area is about 2.89mm2 as shownin Fig.4.

    The platform of experiments is based S3C-44B0x(ARM7) and VPU3010 shown as Fig.5. Taking 3-mixture,358-state 27-dimension HMM model, 500-word list, themain processor ARM7 works at 49MHz with 87 Mbps bus

    bandwidth bus.

    Fig. 4 VPU3010 under Microscope

    Fig. 5 Testing board with ARM7 and VPU3010.

    Table II. shows the specification of VPU3010, and these

    data is from experience under normal condition 25

    .Thesedata shows VPU3010 is with high performance and lowpower consumption, which is preference of embeddedsystems.

    TABLE II. SPECIFICATION OF VPU3010..

    Technology UMC 0.18m

    IO Voltage 3.3V

    Core Voltage 1.8V

    MAX frequency 150MHz

    Power consumption of Core 0.14mW/MHz

    TABLE III. PERFORMANCE

    COMPARISON

    AMONG

    DIFFERENT

    ASR

    SYSTEMS.

    software

    based

    ARM7

    system based

    ARM7+

    MSAC

    system

    based

    ARM7+

    VPU3010

    sample 22528 22528 22528

    voice Time (s) 2816 2816 2816

    recognition time (ms) 7630 2297 2025

    frequency of arm7 (MHz) 49 49 49

    frequency of coprocessor

    (MHz)50 50

    real-time factor 2.71 0.82 0.72

    word accuracy 97.40%

  • 8/13/2019 05764208

    4/4

    Table III shows the performance comparison amongdifferent embedded speech recognition systems. Accordingto results in Table 2, software system based on ARM7 hasthe highest real-time factor 2.71, and the software-hardwareco-design system is about 3.5 times faster than pure softwaresystem, which shows the efficiency of software-hardware co-design. For systems of MSAC and VPU3010, VPU3010 getsa 0.72 real-time factor compared with 0.82 of MSAC.

    V. CONCLUSIONVPU3010 proposed in this paper shows the best

    performance among three speech recognition systems andhas just increased a little resource compared with MSAC [1].

    Besides that, it could be easily used on differentplatforms for diverse recognition tasks by changingconfiguration, which could be implemented by writingcontrol register.

    However, there are still some shortcomings for thissolution, the most important one is that a large number ofdata exchanges happen between main processors and thecoprocessor. For VPU3010 with low bus bandwidth it isdifficult to solve this problem, therefore our future work will

    propose a speech recognition System on Chip (SoC) forhigher performance.

    REFERENCES

    [1] P. LEE, M. Dong, and W. Q. Liang, Design of Speech RecognitionCo-Processor for the Embedded Implementation, Proc. of IEEEEDSSC, 1163-1166, 2007.

    [2] Hui Geng Weiqian Liang Ming Dong, A speech recognition SoCbased on ARM7-TDMI core and a MSAC co-processor, SOCC 2009,IEEE International, pp. 235-238.

    [3] M. Yuan, T. Lee, P. C. Ching, Y. Zhu, Speech Recognition on DSP:Issues on Computational Efficiency and Performance Analysis, Proc.of IEEE ICCCAS, pp. 852-856, 2005.

    [4] S. Yoshizawa, N. Wada, N. Hayasaka, Y. Miyanaga, Scalablearchitecture for word HMM-based speech recognition and VLSIimplementation in complete system, IEEE Trans. on Circuits andSystems I, Vol. 53, No. 1, pp. 70-77, Jan 2006.

    [5] D. Chandra, U. Pazhayaveetil, P.D. Franzon, Architecture for LowPower Large Vocabulary Speech Recognition, in Proc. IEEE Int.SOC Conference, Sept. 2006, pp. 25-28.

    [6] P. Li, H. Tang, and W. Q. Liang, Low Power Embedded SpeechRecognition System Based on A MCU and A Coprocessor, Proc. ofIEEE ICASSP, 2009.

    [7] H. Lim, K. You, and W. Sung, Design and implementation of speechrecognition on a softcore based FPGA, in Proceedings of ICASSP,

    pp. 10441047, 2006.

    [8] Octavian Cheng, Waleed Abdulla, Zoran Salcic, Hardware-SoftwareCo-design of Automatic Speech Recognition System for EmbeddedReal-time Applications, industrial Electronics, IEEE Transactions on:accepted for future publication.

    [9] Zhu Xuan, Chen Yining, Liu Jia, Liu Runsheng, A novel efficientdecoding algorithm for CDHMM-based speech recognizer on chip,ICASSP, IEEE International Conference on Acoustics, Speech andSignal Processing - Proceedings, v 2, p 293-296, 2003.

    [10] ARM, Developer Suite AXD and armsd Debuggers Guide5.5.8.,ARM INC., 2001.

    [11] OakDSPCore Architecture Specification Revision 4.0, DSP GroupInc., Santa Clara, CA, DSPG Publication, 1998.