dev days, speech recognition, lm aubert
DESCRIPTION
Overview of Automatic Speech Recognition (ASR) for embedded devices - Large vocabulary, continuous speech recognition. - Technical overview - Potential application - Upcoming alternatives to embedded engines Presented DevDays, Belfast, UK, 24 April 09 Louis-Marie Aubert, ECIT, Queen's University BelfastTRANSCRIPT
![Page 1: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/1.jpg)
Speech Recognition on embedded devices
Louis-Marie AubertECIT – Queen’s University Belfast
DevDays – Belfast – April 24, 2009
![Page 2: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/2.jpg)
What should we expect from speech recognition?
![Page 3: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/3.jpg)
Speech Recognition success?
• Natural continuous speech • Real-time• Large vocabulary (up to 100,000 words)• No training (speaker independent)• Adaptive to speaker accent• Robust against
– Background noise– Audio frontend imperfections
• N-best hypotheses with confidence value
![Page 4: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/4.jpg)
What are the solutions on the market?
![Page 5: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/5.jpg)
Existing solutions• Server-based
– Telephony, IVR
– Dictation (Heath care industry)
– Audio indexing
Either offline or with important delays
![Page 6: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/6.jpg)
Existing solutions• Desktop-based
– Real-time dictation
– Language learning
Requires a good setup, powerful computer, quiet environmentVery good accuracy, no training required
![Page 7: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/7.jpg)
Existing solutions• Embedded applications
– Simple voice commands(‘Call-mum’ type command)
– Disconnected word recognition
Small vocabulary and lack of naturalness restricts the range of applications
![Page 8: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/8.jpg)
Is it so difficult?
![Page 9: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/9.jpg)
Technical challenge
Speech waveformTranscription
SpeechRecognizer
‘Hello world’
![Page 10: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/10.jpg)
Technical challenge
Speech waveform Acoustic feature vectors
Spectral Analyser ~40 coeff.
10 ms
![Page 11: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/11.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 12: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/12.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 13: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/13.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
Acoustic Models
• 4000 acoustic models
• Sub-acoustic unit
• Functions that score 10 ms of speech
• Sets of mean and variance 40-long vectors of Gaussian mixtures (16)
‘Hello world’
![Page 14: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/14.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 15: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/15.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 16: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/16.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexiconPhoneme
LexiconSenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
Phoneme
• 50 in English
• Differentiable sounds
• Represent a sequence of senomes: HMM (Hidden Markov Model)
ah1 ah2 ah3‘ah’:
l1 l2 l3‘l’:
‘Hello world’
![Page 17: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/17.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
Triphone
• 2500 in English
• Differentiable sounds in their context
continuous speech
ah1 ah2 ah3‘hh-ah+l’:
l1 l2 l3‘ah-l+ow’:
‘Hello world’
![Page 18: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/18.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 19: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/19.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 20: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/20.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Word
• Large vocabulary: 64000
• Represent a sequence of phonemes/triphones
‘hello’:
‘world’:
hh ah l ow
w er l d
![Page 21: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/21.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 22: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/22.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 23: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/23.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Statistical language model
• Bi-gram / Tri-gram
• Give the probability of sequence of 2/3 words
• 64000 words leads to roughly 10 million states / 50 million arcs
hello
world
mum
dad0.20.05
0.3
![Page 24: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/24.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
![Page 25: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/25.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
![Page 26: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/26.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
![Page 27: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/27.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
TriphoneLexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
Viterbi decoding
• Token passing algorithm
• 5000/10000 tokens to propagate every 10 ms
• Select the most promising tokens and output associated sequence of: senomes triphones words sentence
l1 l2 l3
s1 s2 s3
ow1 ow2 ow3
ey1
ey2
ey3
d1
d2
d3
v3 v2
v1
![Page 28: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/28.jpg)
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
![Page 29: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/29.jpg)
Challenges in embedded systems
• Low computational resources• Power consumption constraints• Noisy environment, poor audio quality
For a truly embedded speech recognition engine that works, we must move away from the pure software approach:
• Make the best of all hardware acceleration available• Dedicated chip (accelerator) to unload CPU and
relax memory constraints
![Page 30: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/30.jpg)
Why do we want speech recognition on embedded devices anyway?
![Page 31: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/31.jpg)
Applications on mobiles• Complement touch screen interface with
speech interface• Speech enable existing mobile applications
– Browse complex menus– Easily find items in large libraries,
local or online (contacts, music…)– Browse Web and search maps– Games– Compose text-messages,
emails…
![Page 32: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/32.jpg)
Applications on mobiles• Speech enable mobile applications
Rubicon, "The Apple iPhone: Successes and Challenges for the Mobile Industry", 31 March 2008
![Page 33: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/33.jpg)
Applications on mobiles• Key to safety when driving
– Text-messaging– Satellite-Navigation function
• Voice Memo– Shopping list– Activity scheduler
• Market of Speech technology in embedded devices– $125 million in 2006– $500 million in 2010
Opus Research report, March 2007
![Page 34: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/34.jpg)
Other markets• Developing countries
– Access to information technology for illiterate people• Administrative tasks• Education• Social integration
• Health-care at home (self-manage diseases)– Exploding market
• Chronic diseases• Elderly people (Baby Boomers reach retirement age)• Market for home health care products is evaluated at $4.3 billion today
– Place for Speech recognition• Inexperience of patients with electronic interfaces• Poor physical condition (e.g. low vision)• Illiteracy Medical device today, March 2009
![Page 35: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/35.jpg)
Other applications• Speech translation
– IraqCom
![Page 36: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/36.jpg)
Okay, I can’t wait! Is there anything I can use now?
![Page 37: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/37.jpg)
Upcoming solutions• Voicemail accessible via text-message,
email or dedicated application
– Server-based– Require agreement and implementation by the
carriers
![Page 38: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/38.jpg)
Upcoming solutions• Nuance Voice Control 2
– Online search – Text-messaging
• Embedded software for simple voice command
• Server-based engine for large vocabulary speech recognition
• Speech Recognition API on Android 1.5
![Page 39: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/39.jpg)
So?
![Page 40: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/40.jpg)
Conclusion• A truly embedded speech recognition system
– A range of exciting applications• Real-time dictation with no perceived delay• Natural language interface (ASR + TTS)• Applications independent of the carrier
– But… not available yet!
• New speech recognition API are arriving soon– Rely on network/server availability– Can still lead to innovative applications
![Page 41: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/41.jpg)
Conclusion• Key to succeed
– Robustness, accuracy– Fast to load and execute– Well designed interface
• Speech cannot be used on its own• Should be cleverly combined with other interfaces
– Graphical– Touch– …
– Don’t put customers off by clumsy speech recognition widgets, again!
![Page 42: Dev Days, Speech Recognition, LM Aubert](https://reader036.vdocuments.pub/reader036/viewer/2022081602/54b3b2704a7959f61e8b45a2/html5/thumbnails/42.jpg)
Questions?