project proposal: translation example search engine
DESCRIPTION
I propose to use a local document fingerprinting algorithm, Winnowing, to find near matches of natural language translation samples.TRANSCRIPT
Project Proposal
CSC 630, Fall 2013, University of ArizonaSumin Byeon
Example-BasedMachine Translation
• Translation example sets (S₁→T₁), (S₂→T₂), (S₃→T₃), ...
• Given a query text S, find the closest match S’ such that (S’→T’)
• T’ is accepted as the translation of S
Hypothesis
S2# T2#S#
Sn# Tn#
S1# T1#
…#
h(S)# h(Sσ),#φ(S)# Ti#
Which hash function? Optimal value of k? Window size?
Relationship with Content Addressability• Content recognizability
• Hash - Winnowing
• Content recoverability
• By locating or reconstructing
• Unlike other projects like NDN or Receipt, mine is relatively straightforward
• Simple key-value storage
• Key: hash
• Value: (reference to original text, offset)
Text Matching• Full-text search may be an effective solution, but...
• Loses information regarding the ordering of the query words
• Limited support for phrase search
• Certain linguistic features will be ignored (e.g., “a”, “the”)
• Matching long enough partial text
• Longer text - lower probability of finding matches
• Shorter text - higher probability of ambiguity (i.e., homonym, false cognates)
Grand Plan
• Winnowing algorithm implementation
• Index a large number of samples (+10,000)
• Translation sample search engine with simple RESTful interface
• Integrate it with Better Translator
Better Translator
• Language translator exploiting an indirect translation trick
• e.g., (Korean)→(Japanese)→(English)
• A perfect platform to test the hypothesis
• 여러분이 몰랐던 구글 번역기
• Google Translate: You did not know Google Translate
• Better Translator: Google Translate you did not know