05764208

8/13/2019 05764208

1/4

Efficiency Improvements For A Speech Recognition Coprocessor

Hui Geng2

Weiqian Liang1*

, Ming Dong2

, Runsheng Liu1

1Dept. of Electronic Engineering, Tsinghua Univ., Beijing 10084, China

2Beijing VoiceOn Technology Co. Ltd., Beijing 100084, China

Abstract This paper proposes a revised MSAC (Multiplier

Square Accumulate Calculation) coprocessor called VPU3010,

which is used to calculate Mahalanobis distance for ASR

(Automatic Speech Recognition). We improved the address

generating unit to provide frame-increment and resetting

function, and appended feedback signal as an output port,

which could be used as interrupt trigger or query signal by the

main processor. Design of VPU3010 was verified on Xilinx

FPGA platform firstly, then has been implemented on 0.18mUMC technology. Experiments show that the real-time factor

of system based on VPU3010 working at 50MHz is 0.72compared with 0.82 of MSAC at 50MHz, which could

significantly improve the efficiency of the C-Program calling

the coprocessor.1

Key words ASR; embedded system; ARM; co-processor

I. INTRODUCTION AND WORK REVIEWNowadays, most automatic speech recognition (ASR)

systems with high performance are based on ContinuousHidden Markov Model (CHMM) algorithm. Mobile deviceswith speech interface such as cell-phones, personal digitalassistants (PDA) and Pocket Personal Computers (PPC) aremore and more common. However, CHMM algorithm is

very complex, so traditional embedded systems, which aremainly based on General Purpose Processors (GPP), have towork at very high frequency to meet the requirement of thereal-time computation [3][6]. On the other hand, speechrecognition application-specific integrated circuits (ASIC),which are designed for special applications [6], have high

performance and low-power at the price of lack of flexibly [4][6]. Therefore a hardwaresoftware co-design approach was

proposed for ASR systems in [1], [5], [7] and [8], andachieved a better performance compared with GPP systemsand dedicated hardware implementation systems.

In typical CHMM based ASR systems, there are mainlythree steps: feature extraction, output probability calculationincluding Mahalanobis distance calculation (MDC) and log-

add calculation, and Viterbi decoding [1].Both [7] and [8] proved that output probability

calculation took most of executing time, and [1] gave furtheranalysis that Mahalanobis distance calculation consumedmost of the computation load. Therefore we have everdesigned the first version speech recognition co-processorMSAC to calculate the Mahalanobis distance [1]. Anexciting result that MSAC spent similar processing time asOAK DSP [11] in the condition when OAK DSP worked at 5times faster than MSAC did was gotten. However, afterimplementing the whole speech recognition algorithm, we

*Communication author: Weiqian Liang, Assistant Processor,

Email: [email protected]

find that efficiency of C-program calling MSAC is low forthe reasons which will be detailed in section 2. Sinceaccording to CHMM algorithm, Mahalanobis distancecalculation needs to be called frequently, it is necessary andmeaningful to improve the efficiency of the C-Program.

The rest of the paper is organized as follows. Section 2summarizes CHMM algorithm and analyzes the algorithmefficiency. Section 3 details the proposed improvements ofVPU3010. Section 4 describes the implementation andexperiment results of the VPU3010. Finally, conclusions are

drawn in section 5.

II. C-PROGRAM EFFICIENCY ANALYSISReal-time algorithm in this paper processes a block of

frames each time. According reference [1] and [2], for mostof the applications, the Mahalanobis distance could bewritten as follows:

2

1

1M

td jgd

d jgd

o

=

(1)

where M is the dimension number of the observation

vector, tdo

, jgd

, jgd

is the feature vector, mean vector andcovariance matrix respectively for the jth state and the gth

mixture component.In (1), the data needed by MDC includes feature tdo

and

model data jgd

, jgd

, which are both cached in the on-chipSRAM. However, according to the structure of MSAC in [1],in order to get the right data for MDC, the MSAC needs the

Fig. 1 C- flow diagram calling MSAC

main processor to calculate the address for it. Therefore, wecan get the flow chart of C-program calling MSAC shown asFig .1.

___________________________________978-1-61284-840-2/11/$26.00 2011IEEE

8/13/2019 05764208

2/4

In Fig .1, the t is the index of HMM states which is tokenas present state, and T is the number of HMM state, the i isthe index of present frame, and F is the frame number of a

block. FSD represents the starting address of SRAM storedfeatures, MSD is the starting address for means, and VSD isfor variances.

Apparently there are mainly two loops: the inner onetakes present frame index of speech as loop variable, andincludes N mixture components. The extern one takescurrent HMM state as loop variable. Therefore for S-state G-mixture HMM model and speech with F frames, the innerloop will be executed F*S times including F*S*G timescalling MSAC. Then we will analyze arithmetic operationsused to calculate address in the inner recycle: 1. there is onemultiply-add operation and two assignment operations forthe first mixture; 2. for the second mixture, two additionoperations are included; 3. for the third mixture and these

behind it, there are two multiply-add operations for eachcalculation for one mixture. Therefore it totally takes [2*(G-2)+1]*F*S multiply-add, 2*F*S addition and 2*F*Sassignment operations to generate address in this fragment ofC-program. Furthermore, with the help of ARMulator we getinternal core cycles of these operations are 4-cycle, 2-cycleand 2-cycle respectively, so it theoretically takes 4*(2*G-1)*F*S cycles for the calling function spending oncalculating the address.

On the other hand, the two recycles are regular withsimple control, so it is suitable for hardware implementation.Therefore we designed the improved version of MSACVPU3010 in this paper, which will be detailed in section 3.

III. STRUCTURE OF VPU30103.1 Changes with the address generating unit

MSAC uses SRAM on chip to store data needed in MDC,and divides SRAM into four blocks to store features, means,variances and results separately. The starting address of each

block is assigned by address stored in register group, andfour address generators produce respective address forfeature block, mean block, variance block and result block.

The initial value of address generators are from registergroup. When read or write operation happens, addressgenerators could increase or reduce by one on presentaddress automatically. In another word, if we could provideright starting address to generators, the generators couldgenerate next successive address needed.

Table I shows the SARM address changing in the innerloop. Firstly, the co-processor calculates the first framespeech feature for the first mixture, then the address offeatures, means and variances are the starting address + D(dimension of feature). Next, the feature address should go

back to the starting address, meanwhile the mean address andvariance address should keep the starting address + D tocalculate the first frame speech feature for the secondmixture. Calculation for the rest mixture is in the similar way.However, for the second frame, the feature address should bethe starting address + D, and the address of mean andvariance should go back to starting address, then the addresswill be changed like the last frame.

According to the above procedure, it is necessary thatfeature address generator could execute frame incrementfunction, and mean and variance generators could reset theiraddress when needed. We take the following architectureshown as Fig. 1 to implement these functions, whichminimizes the changes to original structure of MSAC. In thisunit, the control signals like selecting and load are generated

by control unit which is implemented by finite state machine(FSM). The outputs of address generators are selected to beconnected with SRAMs address ports to get the right datafrom SRAM.

TABLE I. THE SARM ADDRESS CHANGING IN THE INNER LOOP.

frame

index

mixture

index

starting address

of present frame

starting address of

present

mean/variance

1 1 FSA M/VSA

1 2 FSA M/VSA + D

1 3 FSA M/VSA + 2D

2 1 FSA + D M/VSA

2 2 FSA + D M/VSA + D

2 3 FSA + D M/VSA + 2D

FSA: starting address of feature block in SARMM/VSA: starting address of mean/variance block in SARM

VPU3010 implements frame-increment function byadding one control bit. When the control bit is set, the

present feature address, which is the output of addressgenerator, will be loaded into the feature starting address

register. When resetting function control bit is set, the newstarting address stored in the register will be loaded into theaddress generators. Therefore, the present feature addresswill increase or reduce by one from the new starting addresswhen accessing happens. That means, when the calculationof one frame feature for one group of mixtures is finished,the feature address could go to the beginning address of nextframe automatically without any arithmetic operations in C-

program.

Fig. 2 Block diagram of the proposed addressing unit.

Similarly, the resetting function is provided for feature,mean, variance and result address generators through anotherfour control bits. When these control bits are set to 1, the

8/13/2019 05764208

3/4

starting addresses stored in starting address register groupwill be load into the corresponding address generators as the

basis of next address instead of present address. Therefore,we can get any address needed by co-processor just throughdifferent configurations of control bits. The new C-programcalling VPU3010 is shown as follows, and it does notinclude any arithmetic operations in the loops, which willsignificantly save resource of calculations.

Fig. 3 C- flow diagram calling VPU3010

3.2 Feedback signal as interrupt trigger or query

For the original version of MSAC [1], there is nofeedback signal to the main processors, therefore C-program

based on original MSAC has to execute a waiting procedureto ensure MSAC finishing MDC. Testing with ARMulator,when MSAC and ARM works at 50MHz, MSAC spends 114cycles on waiting. Although it coincides with the theoretical

value 94 cycles, it is still a waste since it could be used to doother operation if there is an interrupt function. Besides that,it is very inconvenient to set the suitable waiting time

because the cycles for MDC are varying on different modelsand feature dimensions. Therefore interrupt and queryfunction are involved in VPU3010. Once VPU3010completes the MDC, the query bit of control register will beset to 1 under control of FSM. And an output circuit is alsoappended to the control register, therefore, through readingcontrol register, the main processor could judge whetherVPU3010 has finished MDC or not. Besides that, interruptand query pin is also added to VPU3010, therefore it isconvenient to use this pin as an interrupt trigger.

IV.

EXPERIMENTS AND RESULTSFirstly, the design of VPU3010 was verified on Xilinx

FPGA platform, and then we implemented it with 0.18mUMC technology. The die area is about 2.89mm2 as shownin Fig.4.

The platform of experiments is based S3C-44B0x(ARM7) and VPU3010 shown as Fig.5. Taking 3-mixture,358-state 27-dimension HMM model, 500-word list, themain processor ARM7 works at 49MHz with 87 Mbps bus

bandwidth bus.

Fig. 4 VPU3010 under Microscope

Fig. 5 Testing board with ARM7 and VPU3010.

Table II. shows the specification of VPU3010, and these

data is from experience under normal condition 25

.Thesedata shows VPU3010 is with high performance and lowpower consumption, which is preference of embeddedsystems.

TABLE II. SPECIFICATION OF VPU3010..

Technology UMC 0.18m

IO Voltage 3.3V

Core Voltage 1.8V

MAX frequency 150MHz

Power consumption of Core 0.14mW/MHz

TABLE III. PERFORMANCE

COMPARISON

AMONG

DIFFERENT

ASR

SYSTEMS.

software

based

ARM7

system based

ARM7+

MSAC

system

based

ARM7+

VPU3010

sample 22528 22528 22528

voice Time (s) 2816 2816 2816

recognition time (ms) 7630 2297 2025

frequency of arm7 (MHz) 49 49 49

frequency of coprocessor

(MHz)50 50

real-time factor 2.71 0.82 0.72

word accuracy 97.40%

8/13/2019 05764208

4/4

Table III shows the performance comparison amongdifferent embedded speech recognition systems. Accordingto results in Table 2, software system based on ARM7 hasthe highest real-time factor 2.71, and the software-hardwareco-design system is about 3.5 times faster than pure softwaresystem, which shows the efficiency of software-hardware co-design. For systems of MSAC and VPU3010, VPU3010 getsa 0.72 real-time factor compared with 0.82 of MSAC.

V. CONCLUSIONVPU3010 proposed in this paper shows the best

performance among three speech recognition systems andhas just increased a little resource compared with MSAC [1].

Besides that, it could be easily used on differentplatforms for diverse recognition tasks by changingconfiguration, which could be implemented by writingcontrol register.

However, there are still some shortcomings for thissolution, the most important one is that a large number ofdata exchanges happen between main processors and thecoprocessor. For VPU3010 with low bus bandwidth it isdifficult to solve this problem, therefore our future work will

propose a speech recognition System on Chip (SoC) forhigher performance.

REFERENCES

[1] P. LEE, M. Dong, and W. Q. Liang, Design of Speech RecognitionCo-Processor for the Embedded Implementation, Proc. of IEEEEDSSC, 1163-1166, 2007.

[2] Hui Geng Weiqian Liang Ming Dong, A speech recognition SoCbased on ARM7-TDMI core and a MSAC co-processor, SOCC 2009,IEEE International, pp. 235-238.

[3] M. Yuan, T. Lee, P. C. Ching, Y. Zhu, Speech Recognition on DSP:Issues on Computational Efficiency and Performance Analysis, Proc.of IEEE ICCCAS, pp. 852-856, 2005.

[4] S. Yoshizawa, N. Wada, N. Hayasaka, Y. Miyanaga, Scalablearchitecture for word HMM-based speech recognition and VLSIimplementation in complete system, IEEE Trans. on Circuits andSystems I, Vol. 53, No. 1, pp. 70-77, Jan 2006.

[5] D. Chandra, U. Pazhayaveetil, P.D. Franzon, Architecture for LowPower Large Vocabulary Speech Recognition, in Proc. IEEE Int.SOC Conference, Sept. 2006, pp. 25-28.

[6] P. Li, H. Tang, and W. Q. Liang, Low Power Embedded SpeechRecognition System Based on A MCU and A Coprocessor, Proc. ofIEEE ICASSP, 2009.

[7] H. Lim, K. You, and W. Sung, Design and implementation of speechrecognition on a softcore based FPGA, in Proceedings of ICASSP,

pp. 10441047, 2006.

[8] Octavian Cheng, Waleed Abdulla, Zoran Salcic, Hardware-SoftwareCo-design of Automatic Speech Recognition System for EmbeddedReal-time Applications, industrial Electronics, IEEE Transactions on:accepted for future publication.

[9] Zhu Xuan, Chen Yining, Liu Jia, Liu Runsheng, A novel efficientdecoding algorithm for CDHMM-based speech recognizer on chip,ICASSP, IEEE International Conference on Acoustics, Speech andSignal Processing - Proceedings, v 2, p 293-296, 2003.

[10] ARM, Developer Suite AXD and armsd Debuggers Guide5.5.8.,ARM INC., 2001.

[11] OakDSPCore Architecture Specification Revision 4.0, DSP GroupInc., Santa Clara, CA, DSPG Publication, 1998.

05764208

Documents