NET talk : A parallel network that learns to read a loud
Korea Maritime and Ocean UniversityNLP
Jung Tae LEE [email protected]
Terrence J.Sejnowski and Charles R. Rosenburg (1986)
01 Introduction
03 Performance
04 Summary
02 Network Architecture
`
01 Introduction of NETtalk
1. Introduction of NETtalk
NETtalk
One of the method for converting text to speech(TTS).
Automated learning procedure for parallel network of deterministic
processing units.
Conventional approach is converted by applying phonolohical rules,
and handling exceptions with a look-up table.
After trainig, it achives good performance and generalizes to novel
words.
NETtalk
`
Characteristics of TTS in Eng
English is amongst the most difficult languages to read aloud.
Speech sounds have exceptions that are often context-sensitive
- EX) the “a” in almost all words ending in “ave”, such as
“brave” and “gave”, is a long vowel,
but-not in “have”, and some words can vary in pronuciation
with their syntactic role.
01 Introduction of NETtalk
NETtalk
This is the problem in conventional approach
`
DECtalk : commercial product DECtalk used two methods for converting text to phonemes
1. A word is first looked up in a pronunciation dictionary of common
words;
if it is not found there the a set of phonological rules is applied.
(For novel words that are not correctly pronounced)
2. alternative approach is based on massively-parallel network models.
Knowledge in these models is distributed over many processing units
and make decision by exchange of information between the process-
ing unit
01 Introduction of NETtalk
NETtalk
`
In this paper :
Network learning algorithms with three layers.
NETtalk can be trained on any dialect of any languages.
Demonstrates that a relatively small network can capture most of the
significant regularities in English pronunciation as well as absorb many
of the irregulatities.
01 Introduction of NETtalk
NETtalk
`
2. Network Architecture
02 Network Archi-tecture
Processing Unit
NETtalk
The network is composed of processing units that non-linearly trans-
form
their summed, continuous-valued inputs.
The connection strength, or
weight, linking one unit to an-
other unit can be a positive or
negative real value.
Processing Unit The ouput of the ith unit is determined by first summing all of its in-
puts
02 Network Archi-tecture
NETtalk
=
is the weight from the jth to the ith unit, and then applying a
sigmoidal transformation
= =
Processing Unit
02 Network Archi-tecture
NETtalk
value, representing either an excitatory or an inhibitory influence of the first unit on the output of the second unit
NETtalk is hierarchically arranged into three layers of units
Representations of Letters and Phonemes
02 Network Archi-tecture
NETtalk
There are seven groups of units in the input layer- Each input group encodes one letter of the input text.- Seven letters are presented to the input units at any one time.
And one group of units in each of the other two layers- The desired output of the network is the correct phoneme, or contrastive speech sound, associated with the center, or fourth
Except for center letter provide a partial context for this deci-sion- The test is stepped through the window letter-by-letter.
At each step, the network computes a phoneme, and after each word the weights are adjusted according to how closely the computed pronunciation matches the correct one.
Representations of Letters and Phonemes
02 Network Archi-tecture
NETtalk
The letters are represented by alphabet, plus an additional 3 units to encode punctuation and word boundaries
The phonemes, are represented in terms of 23 articulatory features, such as point of articulation, voicing, vowel height, and so on
Three additional units encode stress and syllable boundaries
goal of the learning algorithm is to adjust the weights between the units in the network in order to make the hidden units good feature detectors
Learning Algorithm
02 Network Archi-tecture
NETtalk
Two texts were used to train the network:- Phonetic transcriptions from informal, continuous speech of a child- 20,012 word corpus from a dictionary
A subset of 1000 words was chosen from this dictionary taken from the Brown corpus of the most common words in English
Letters and phonemes were aligned like this: “phone” - /f-on-/
Learning Algorithm
02 Network Archi-tecture
NETtalk
Training according to the discrepancy between the desired andactual values of the output units.
This error was “back-propagated” from the output to the input layer.
So, network is adjusted to minimize its contribution to the totalmean square error in discrepancy.
Briefly, the weights were updated accroding to:
: is from the jth unit in layer n to the ith unit in layer n + 1 : smooths the gradient by over-relaxation : leaning rate
Learning Algorithm
02 Network Archi-tecture
NETtalk
and recursively back-propagating the differences to lower layers
P’(E) : is the first derivative of P(E), : was the desired value of the ith unit in the output layer, : was the actual value obtained from the network
Back-propagate condition : margin > 0.1initialize the weight : -0.3 ~ 0.3 (uniform)
`
3. Performance
03 Performance
Performance
Two measures of performance were computed
Best Guess- best guess, which was the phoneme making the smallest angle with the output vector.
Perfect match- value of each articulatory feature was within a marginof 0.1 of its corrects value.
NETtalk
`
Continuous Informal Speech
03 PerformanceNETtalk
Learining after 50,000words. Perfect matches were at 55%.
`
Continuous Informal Speech
03 PerformanceNETtalk
Examples of raw output from the simulator
stresses
text
phonemes
200word
1 iter
25 iter
Cont’
`
Continuous Informal Speech
03 PerformanceNETtalk
Graphical summary of the weights between the letter units and some of the hidden units
Negative(inhibitory weight)
Positive(excitatory weight)
`
Continuous Informal Speech
03 PerformanceNETtalk
Damage to the network and recovery from damage.
`
Dictionary
03 PerformanceNETtalk
Used the 1000 most Common word in EGN .
Hard pron
soft pron
`
04 Summary
4. Summary
NETtalk
• Seven groups of nodes in the input layer,
• The text was stepped through the window on a letter-by-letter basis.
• standard back-propagation algorithm
• Strings of seven letters were thus presented to the input layer at any one time.