coupling between asr and mt in speech-to-speech translation

66
Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar

Upload: liv

Post on 16-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Coupling between ASR and MT in Speech-to-Speech Translation. Arthur Chan Prepared for Advanced Machine Translation Seminar. This Seminar. Introduction (6 slides) Ringger’s categorization of Coupling between ASR and NLU (7 slides) Interfaces in Loose Coupling 1 best and N-best (5 slides) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Coupling between  ASR and MT in Speech-to-Speech Translation

Coupling between ASR and MT in Speech-to-

Speech Translation

Arthur Chan

Prepared for Advanced Machine Translation Seminar

Page 2: Coupling between  ASR and MT in Speech-to-Speech Translation

This Seminar Introduction (6 slides) Ringger’s categorization of Coupling between ASR

and NLU (7 slides) Interfaces in Loose Coupling

• 1 best and N-best (5 slides)• Lattices/Confusion Network/Confidence Estimation (12

slides)• Results from literature

Tight Coupling• Ney’s Theory and 2 methods of Implementation (14 slides) Sorry, without FST approaches.

Some “As Is” Ideas on This Topic

Page 3: Coupling between  ASR and MT in Speech-to-Speech Translation

6 papers on Coupling of Speech-to-Speech TranslationH. Ney, “Speech translation: Coupling of recognition and translation,”

in Proc. ICASSP, 1999.Casacuberta et al., “Architectures for speech-to-speech translation

using finite-state models,” in Proc. Workshop on Speech-to-Speech Translation, 2002.

E. Matusov, S.Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Proc. InterSpeech, 2005.

S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, 2004.

V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, 2005.

N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005.

Page 4: Coupling between  ASR and MT in Speech-to-Speech Translation

A Conceptual Model of Speech-to-Speech Translation

SpeechRecognizer

MachineTranslator

SpeechSynthesizer

waveforms

DecodingResult(s)

Translationwaveforms

Page 5: Coupling between  ASR and MT in Speech-to-Speech Translation

Motivation of Tight Coupling between ASR and MT

One best of ASR could be wrong MT could be benefited from wide range of

supplementary information provided by ASR• N-best list

• Lattice

• Sentenced/Word-based Confidence Scores

• E.g. Word posterior probability

• Confusion network

• Or consensus decoding (Mangu 1999) MT quality may depend on WER of ASR (?)

Page 6: Coupling between  ASR and MT in Speech-to-Speech Translation

Scope of this talk.

SpeechRecognizer

MachineTranslator

SpeechSynthesizer

waveforms

1-best?

Translationwaveforms

Lattice?

N-best?

Confusion network?

Page 7: Coupling between  ASR and MT in Speech-to-Speech Translation

Topics Covered Today

The concept of Coupling

• “Tightness” of coupling between ASR and Technology X. (Ringger 95)

Two questions:• What could ASR provide in loose coupling?

• Discussion of interfaces between ASR and MT in loose coupling

• What is the status of tight coupling?

• Ney’s Formulation

Page 8: Coupling between  ASR and MT in Speech-to-Speech Translation

Topics not covered

Direct Modeling• Use both features in ASR and MT

• Some referred as “ASR and MT unification”

Implication of the MT search algorithms on the coupling

Generation of speech from text. • Presenter doesn’t know enough.

Page 9: Coupling between  ASR and MT in Speech-to-Speech Translation

The Concept of Coupling

Page 10: Coupling between  ASR and MT in Speech-to-Speech Translation

Classification of Coupling of ASR and Natural Language Understanding (NLU)

Proposed in Ringger 95, Harper 94 3 Dimensions of ASR/NLU

• Complexity of the search algorithm• Simple N-gram?

• Incrementality of the coupling• On-line? Left-to-right?

• Tightness of the coupling• Tight? Loose? Semi-tight?

Page 11: Coupling between  ASR and MT in Speech-to-Speech Translation

Tightness of Coupling

Tight

Semi-Tight

Loose

Page 12: Coupling between  ASR and MT in Speech-to-Speech Translation

Notes:

Semi-tight coupling could appear as• Feedback loop between ASR and Technology

X for the whole utterance of speech

• Or Feedback loop between ASR and Technology X for every frame.

The Ringger system• A good way to understand how speech-based

system is developed

Page 13: Coupling between  ASR and MT in Speech-to-Speech Translation

Example 1: LM Someone asserts that ASR has to be used

with 13-grams. • In tight-coupling,

• A search will be devised to search for the best word sequence with best acoustic score + 13 gram likelihood

• In loose coupling• A simple search will be used to generate some outputs

(N-best list, lattice etc.), • 13-gram will then use to rescore the output.

• In semi-tight coupling• 1, A simple search will be used to generate results• 2, 13 gram will be applied at the word-end only (but

exact history will not be stored)

Page 14: Coupling between  ASR and MT in Speech-to-Speech Translation

Example 2: Higher order AM

Segmental model assume obs. probability is not conditionally independent.

Someone assert that segmental model is better than just HMM. • Tight coupling: Direct search of the best word

sequence using segmental model.

• Loose coupling: Use segmental model to rescore

• Semi-tight coupling: Hybrid HMM-Segmental model algorithm?

Page 15: Coupling between  ASR and MT in Speech-to-Speech Translation

Summary of Coupling between ASR and NLU

Page 16: Coupling between  ASR and MT in Speech-to-Speech Translation

Implication on ASR/MT coupling Generalize many systems

• Loose coupling• Any system which uses 1-best, n-best, lattice, or other

inputs for 1-way module communication• (Bertoldi 2005)• CMU System (Saleem 2004)• (Matusov 2005)

• Tight coupling• (Ney 1999)• (Casacuberta 2002)

• Semi-tight coupling• (Quan 2005)

Page 17: Coupling between  ASR and MT in Speech-to-Speech Translation

Interfaces in Loose Coupling:1-best and N-best

Page 18: Coupling between  ASR and MT in Speech-to-Speech Translation

Perspectives

ASR outputs• 1-best results

• N-best results

• Lattice

• Consensus network.

• Confidence scores How ASR generate these outputs? Why they are generated? What if there are multiple ASRs?

• (and what if their results are combined?)

Page 19: Coupling between  ASR and MT in Speech-to-Speech Translation

Origin of the 1-best.

Decoding of HMM-based ASR= Searching the best path in a huge HMM-state

lattice.

1-best ASR result• The best path one could find from

backtracking.

State Lattice (Next page)

Page 20: Coupling between  ASR and MT in Speech-to-Speech Translation
Page 21: Coupling between  ASR and MT in Speech-to-Speech Translation

Note on 1-best Most of the time 1-best Word Sequence Why?

• In LVCSR, storing the backtracking pointer table for state sequence takes a lot of memory (even nowadays)

• [Compare this with the number of frames of score one need to be stored]

Usually a backtrack pointer storing • The previous words before the current word

Clever structure dynamically allocate back-tracking pointer table.

Page 22: Coupling between  ASR and MT in Speech-to-Speech Translation

What is N-best list?

Traceback not only from the 1st -best, also from the 2nd best and 3rd best, etc.

Pathway:• Directly from search backtrack pointer table

• Exact N-best algorithm (Chow 90)

• Word pair N-best algorithm (Chow 91)

• A* search using Viterbi score as heuristic (Chow 92)

• Generate lattice first, then generate N-best from lattice

Page 23: Coupling between  ASR and MT in Speech-to-Speech Translation

Interfaces in Loose Coupling:Lattice, Consensus Network and Confidence Estimation

Page 24: Coupling between  ASR and MT in Speech-to-Speech Translation

What is Lattice?

A compact representation of state-lattice• Only word node (or link) are involved

Difference between N-best and Lattice• Lattice could be compact representation of N-

best list.

Page 25: Coupling between  ASR and MT in Speech-to-Speech Translation
Page 26: Coupling between  ASR and MT in Speech-to-Speech Translation

How lattice is generated?

From the decoding backtracking pointer table• Only record all the links between word nodes.

From N-best list• Become a compact representation of N-best

• [Sometimes spurious link will be introduced]

Page 27: Coupling between  ASR and MT in Speech-to-Speech Translation

How lattice is generated when there are phone contexts at the word end?

Very complicated when phonetic context is involved• Not only word-end needs to be stored but also

the phone contexts.

• Lattice has the word identity as well as contexts

• Lattice can become very large.

Page 28: Coupling between  ASR and MT in Speech-to-Speech Translation

How this is resolved?

Some used only approximate triphone to generate lattice in first stage (BBN)

Some generate lattice even with full CD-phones but convert it back to no-context lattices (RWTH)

Use the lattice with full CD phone contexts (RWTH)

Page 29: Coupling between  ASR and MT in Speech-to-Speech Translation

What ASR folks do when lattice is still too large?

Use some criteria to prune the lattice. Example Criteria

• Word posterior probability

• Application of another LM or AM, then filtering.

• General confidence score

• Maximum lattice density• (number of words in lattice/number of words)

Or generate an even more compact representation than lattices• E.g. consensus network.

Page 30: Coupling between  ASR and MT in Speech-to-Speech Translation

Conclusions on lattices

Lattice generation itself could be a complicated issue

Sometimes, what post-processing stage (e.g. MT) will get is pre-filtered, pre-processed results.

Page 31: Coupling between  ASR and MT in Speech-to-Speech Translation

Confusion Network and Consensus Hypothesis

Confusion Network:• Or “Sausage Network”.

• Or “Consensus Network”

Page 32: Coupling between  ASR and MT in Speech-to-Speech Translation

Special Properties (?)

More “local” than lattice• One can apply simple criteria to find the best results

• E.g. “consensus decoding” is to apply word-posterior probability on confusion network.

More tractable • In terms of size

Found to be useful in• ?

• ?

Page 33: Coupling between  ASR and MT in Speech-to-Speech Translation

How to generate consensus network?

From the lattice• Summary of Mangu’s algorithm

Intra-word clustering

Inter-word clustering

Page 34: Coupling between  ASR and MT in Speech-to-Speech Translation

Note on Consensus Network:

Note:• Time information might not be preserved in

confusion network

• The similarity function directly affect the final output of the consensus network.

Page 35: Coupling between  ASR and MT in Speech-to-Speech Translation

Other ways to generate confusion network

From the N-best list• Using Rover.

• A mixture of voting and adding confidence of word

Page 36: Coupling between  ASR and MT in Speech-to-Speech Translation

Confidence Measure

Anything other than likelihood which could tell whether the answer is useful

E.g.• Word posterior probability

• P(W|A)

• Usually compute using lattices

• Language model backoff mode

• Other posterior probabilities (frame, sentence)

Page 37: Coupling between  ASR and MT in Speech-to-Speech Translation

Interfaces in Loose Coupling:Results from the Literature

Page 38: Coupling between  ASR and MT in Speech-to-Speech Translation

General word

Coupling in SST is still pretty new Papers are chosen according to whether

some outputs have been used• Other techniques such as direct modeling

might be mixed into the papers.

Page 39: Coupling between  ASR and MT in Speech-to-Speech Translation

N-best list (Quan 2005)

Using N-best list for reranking• Interpolation weights of AM and TM are then

optimized.

Summary:• Reranking gives improvements.

Page 40: Coupling between  ASR and MT in Speech-to-Speech Translation

Lattices: CMU results (Saleem 2004)

Summary of results• Lattice word error rate improved when lattice

density improves

• Lattice density and Weight on Acoustic scores turns out to be an important parameter to tune• Too large and small could hurt.

Page 41: Coupling between  ASR and MT in Speech-to-Speech Translation

LWER against Lattice Density

Page 42: Coupling between  ASR and MT in Speech-to-Speech Translation

Modified Bleu scores against lattice density

Page 43: Coupling between  ASR and MT in Speech-to-Speech Translation

Optimal density and score weight based on Utterance Length.

Page 44: Coupling between  ASR and MT in Speech-to-Speech Translation

Consensus Network

Bertoldi 2005 is probably the only work on confusion-network based method

Summary of results:• When direct modeling is applied

• Consensus Network doesn’t beat N-best method.

• Author argues for speed and simplicity of the algorithm

Page 45: Coupling between  ASR and MT in Speech-to-Speech Translation

Confidence: Does it help?

According to Zhang 2006, Yes. • Confidence Measure (CM) filtering is used to

filter out unnecessary results in N-best

• Note: The approaches used is quite different.

Page 46: Coupling between  ASR and MT in Speech-to-Speech Translation

Conclusion on Loose Coupling

SR could give a rich sets of output. It is still an unknown what type of output

should be used in pipeline. Currently, it seem to lack of comprehensive

experimental studies on which method is the best.

Usage of confusion network and confidence estimation seem to be under-explored.

Page 47: Coupling between  ASR and MT in Speech-to-Speech Translation

Tight Coupling : Theory and Practice

Page 48: Coupling between  ASR and MT in Speech-to-Speech Translation

Theory (Ney 1999)

Baye’s Rule

Introduce f as hidden var.

Baye’s Rule

Assume x doesn’t depend on target lang.

Sum to Max

Page 49: Coupling between  ASR and MT in Speech-to-Speech Translation

Layman point of view

Three factors• Pr(e) : target language model

• Pr(f|e) : translation model

• Pr(x|f) : acoustic model

• Note: assumption has been made only the best matching f for e is used.

Page 50: Coupling between  ASR and MT in Speech-to-Speech Translation

Comparison with SR

In SR:• Pr(f) : Source language model

In Tight coupling• Pr(f|e), Pr(e) : Translation model and Target

language model

Page 51: Coupling between  ASR and MT in Speech-to-Speech Translation

Algorithmic Point of View

Brute Force Method: Instead of incorporating LM into standard Viterbi algorithm• Incoporating P(e) and P(f|e)

• => Very complicated

Page 52: Coupling between  ASR and MT in Speech-to-Speech Translation

Assumptions in Modeling

Alignment Models (HMM)

Acoustic Modeling• Speech Recognizer will produce a word graph.

• Each link with word hypothesis covers the portion of acoustic scores. (notation is confusing in paper)

Page 53: Coupling between  ASR and MT in Speech-to-Speech Translation

Lexicon Modeling

Further assumption from standard IBM* models• Target word is assumed to be dependent on

previous word

• So, in fact, source LM is actually there.

Page 54: Coupling between  ASR and MT in Speech-to-Speech Translation

First Implementation: Local Average Assumptions

Local Average Assumptions

P(x|e) is used to capture the local characteristic of the acoustic.

Page 55: Coupling between  ASR and MT in Speech-to-Speech Translation

Justification of Using Average Local Assumption

Rephrased from Author (p.3 para 2)• Lexicon modeling and language modeling will

cause f_{j-1}, f_{j}, f_{j+1} appear in the math. In another words

• It is too complicated to carry out

• Computation advantage: the local score could be obtained just from the word graph but before translation• => Full translation strategy could still be carried out

Page 56: Coupling between  ASR and MT in Speech-to-Speech Translation

Computation of P(x|e)

Make use of best source sequence Also refer to Wessel 98,

• A commonly used word posterior probability algorithm for lattice• A forward-backward like procedure is used

Page 57: Coupling between  ASR and MT in Speech-to-Speech Translation

Second Method: Monotone Alignment Assumption - Network

Page 58: Coupling between  ASR and MT in Speech-to-Speech Translation

Monotone Alignment Assumption – Formula for Text Input

Close-formed solution exist form DP O(JE^2)

Page 59: Coupling between  ASR and MT in Speech-to-Speech Translation

Monotone Alignment Assumption – Formula for Speech Input

DP: O(JE^2F^2)

Page 60: Coupling between  ASR and MT in Speech-to-Speech Translation

How to make Monotone Assumptions work?

Words needs to be reordered• As part of search strategy.

Does acoustic model assumption used?• i.e. Are we talking about word lattice or still

state lattice?• Don’t know, seems like we are actually talking

about word lattice.

• Supported by Matusov 2005

Page 61: Coupling between  ASR and MT in Speech-to-Speech Translation

Experimental Results in Matusov, Kanthak and Ney 2005

Summary of the results• Translation quality is only improved by tight

coupling when the lattice density is not high.

• Same as Saleem 2004, incorporation of acoustic scores help.

Page 62: Coupling between  ASR and MT in Speech-to-Speech Translation

Conclusion: Possible Issues of tight coupling

Possibilities:• In SR, source n-gram LM is very closed to the best

configuration.

• The complexity of the algorithm is too high, approximation is still necessary to make it work.

• When the criterion in tight coupling is used. It is possible that the LM and the TM need to be jointly estimated.

• The current approaches still haven’t really implement tight-coupling

• There might be bugs in the programs.

Page 63: Coupling between  ASR and MT in Speech-to-Speech Translation

Conclusion

Two major issues in coupling of SST is discussed• In loose coupling:

• Consensus network and Confidence scoring is still not fully utilized

• In tight coupling:• The approach seem to be haunted by very high

complexity of search algorithm construction

Page 64: Coupling between  ASR and MT in Speech-to-Speech Translation

Discussion

Page 65: Coupling between  ASR and MT in Speech-to-Speech Translation

The End. Thanks.

Page 66: Coupling between  ASR and MT in Speech-to-Speech Translation

Literature 2006 Ruiqiang Zhang, Genichiro Kikui. Integration of Speech Recognition and

Machine Translation: Speech Recognition Word Lattice Translation. Speech Communication. Vol.48, Issues 3-4

H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP, 1999.

E. Matusov, S.Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Proc. InterSpeech, 2005.

S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, 2004.

V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, 2005.

N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005.

L. Mangu, E. Brill, & A. Stolcke, Finding consensus in speech recognition: word error minimization and other applications of confusion networks, Computer Speech and Language 14(4), 373-400., (2000)

E. Ringger, A Robust Loose Coupling for Speech Recognition and Natural Language Understanding, 1995