Machine Translation at translate.google.com Slav Petrov on behalf of many people from Google Research (slides borrowed in part from Dan Klein, Thorsten Brants and Peng Xu)
Machine Translation (French)
Machine Translation (French)
Machine Translation (Japanese)
Machine Translation (Japanese)
General Approaches Rule-based approaches (outdated?) Expert system-like rewrite systems Interlingua methods (analyze and generate) Lexicons come from humans Can be very fast, and can accumulate a lot of knowledge over time (e.g. Systran) Statistical approaches (topic of this talk) Word-to-word translation Phrase-based translation Syntax-based translation (tree-to-tree, tree-to-string) Trained on parallel corpora Usually noisy-channel (at least in spirit)
Levels of Transfer
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system:
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Model of translation
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon I will do it around
Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon I will do it around See you tomorrow
The Noisy-Channel Model We want to predict a sentence given its translation a: The noisy channel approach: Translation model: a table with phrase translations and their probabilities Language model: Distributions over sequences of words (sentences)
MT System Components Noisy Channel Model Language Model Translation Model source P(e) e channel P(f e) f
MT System Components Noisy Channel Model Language Model Translation Model source P(e) e channel P(f e) f best e decoder observed f argmax P(e f) = argmax P(f e)p(e) e e
Statistical MT: Translation as a search problem
Unsupervised Word Alignment nous acceptons votre opinion. we accept your view.
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view.
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view. Output: alignments: pairs of translated words
Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view. Output: alignments: pairs of translated words When words have unique sources, can represent as a (forward) alignment function a from French to English positions
Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? What is the anticipated cost of collecting fees under the new proposal? z En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits?
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.
Bidirectional Alignment
Alignment Heuristics
Extracting Phrases
From Intro to Stat MT slide by Chris Callison-Burch and Philip Koehn 16
From Intro to Stat MT slide by Chris Callison-Burch and Philip Koehn 16
Probabilistic Language Models Goal: Assign useful probabilities P(x) to sentences x Input: many observations of training sentences x Output: system capable of computing P(x) Probabilities should broadly indicate plausibility of sentences P(I saw a van) >> P(eyes awe of an) Not grammaticality: P(artichokes intimidate zippers) 0 In principle, plausible depends on the domain, context, speaker One option: empirical distribution over training sentences? Problem: doesn t generalize (at all) Two aspects of generalization Decomposition: break sentences into small pieces which can be recombined in new ways (conditional independence) Smoothing: allow for the possibility of unseen pieces
N-Gram Model Decomposition Chain rule: break sentence probability down Impractical to condition on everything before P(??? Turn to page 134 and look at the picture of the)? N-gram models: assume each word depends only on a short linear history Example:
More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M Parallel corpus size 10M 8
More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M 10M 100M Parallel corpus size 9
More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M 10M 100M 1B Parallel corpus size 10
Improvement over time: data & algorithms 0,2200 0,1900 Test Data BLEU 0,1600 0,1300 Hindi-English Thai-English Hungarian-English 0,1000 Q4'07 Q1'08 Q2'08 Q3'08 Q4'08 Q1'09 11
Motivation Train 5-gram language models Vary amount of training data from 13 million to 2 trillion tokens Data is divided into four sets Train separate language model for each set Use up to four language models during decoding
Motivation target +ldcnews +webnews +web green: Kneser-Ney blue: Stupid Backoff Growth nearly linear within domains Improvement smaller for web, but doesn t seem to level off Stupid has higher BLEU than KN with >8G tokens training
MapReduce Recap Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers
MapReduce Recap 1. Backup map workers compensate for failure of map task. 2. Sorting phase requires all map jobs to be completed before starting reduce phase.
Language Models n-gram language models n-gram statistics, relative frequency
Stupid Backoff Smoothing Statistics required: counts
Stupid Backoff Smoothing Traditional LM tools: Scan text and keep a trie data structure in memory Direct (or simple/fast) access to all necessary counts Scale to >1 trillion words: MapReduce, no in memory access to all counts. Steps: Collect unigram counts Determine Vocabulary Collect n-gram counts
Collect Unigram Counts Mapper: input={document, }, output={word, count}
Collect Unigram Counts Reducer: input={word, {count}}, output={word, count}
Determine Vocabulary Crucial to reduce the amount of intermediate data storage Simple count cut-off, words below cutoff are mapped to <UNK> Frequent words have small IDs: good for compression
Collect N-Gram Counts In a straightforward way: Similar to unigram counting, count n-grams for all orders 1,,n For each word in the corpus, O(n) copies of it are generated in the intermediate data Unnecessary overhead for shuffling, sorting, etc. in MapReduce Better way: Only get the highest order n-grams Split training corpus into smaller, more manageable chunks and combine the counts later
Collect Ngram Count Mapper: input={document, }, output={reversed_ngram, count}
Collect Ngram Count Reducer: input={reversed_ngram, {count}}, output={reversed_ngram, count}
Combine Ngram Counts Split input into several MapReduce jobs for scalability Combine separate ngram statistics Catch: Ngrams are processed in sequential order No access to other ngrams in general Possibly large amount of intermediate data Trick: Input ngrams in reverse order Restore the normal ngram order in Map Use a cache to accumulate counts in Map
Combine Ngram Counts Example: E D 2, E D C 3, E D C B 1, E D C A 1, F 2 Book-keeping: E=2, ED=2 E=5, ED=5, EDC=3 E=6, ED=6, EDC=4, EDCB=1 E=7, ED=7, EDC=5, EDCB=1, EDCA=1: Output( BCDE, 1); E=7, ED=7, EDC=5, EDCA=1, F=2: Output( ACDE, 1); Output( CDE, 5); Output( DE, 7); Output( E, 7); Size of the cache is only the size of highest order n
Combine Ngram Counts
Generate Final Distributed LM So far, we have: Stupid backoff: {ngram, count} Katz/Kneser-Ney: {ngram, prob/backoff} Need to pack them info the final distributed LM Depend on the serving architecture in a client/ server approach
Distributed LM Architecture Final LM data is served on multiple servers Each server performs lookups for multiple clients Save memory: if model is 8GB, 100 clients need 800GB memory with local model, but only 80GB if use 10 servers to serve 10 copies.
Distributed LM Architecture Replicas for redundancy and higher load Replica server 0 Replica server 1 Replica server M-1 Server 0 Server 1 Server M-1 Vocab Replica Vocab Server Load balancers to distribute load between replicas Cache Cache Client 0 Client 1 Clients use user-defined hash function to decide which server to contact
Distributed LM Latency Network latency In memory access: X hundred nanoseconds Network access: Y milliseconds Typical sentence of 25 words in machine translation: Roughly 100K ngrams during search Requires >100 seconds to access all probabilities Batch processing One batch for each word position, 25 batches in ~100 milliseconds
Distributed LM For each ngram A B C D E, we may need: Stupid backoff: P(E ABCD), P(E BCD), P(E CD), P(E D), P(E) Katz/Kneser-Ney: P(E ABCD), α(abcd), P(E BCD), α(bcd), P(E CD), α(cd), P(E D), α(d), P(E) Ideally those entities should locate on the same server, in practice not possible for Katz/Kneser-Ney Hash ngrams by the last word (E) could work well One network request per ngram for Stupid backoff Two network requests per ngram for Katz/Kneser-Ney In practice, we hash ngrams by the last two words (D E) and duplicate unigram entries on every server for better balanced shards
Client for Distributed LM Typical client usage: From a given set of histories, tentatively expand to the possible next word (or words): dummy lookups Batch all required ngrams (and histories too in case of Katz/Kneser-Ney) Send all batched entries to the servers Wait for the servers to return probabilities (and possibly backoff weights) Revisit all expansions with real probabilities, perform pruning: true lookups
Typical Decoder as Client Repeatedly switch between dummy lookup and true lookup active node pruned node
Distributed LM Data Structure The served model data structure can be flexible Compact trie Block encoding Perfect hashing Different application may choose different data structure One MapReduce to generate the served models
Compressing Language Models Data structures with tradeoff between Space (compression) Runtime (n-gram lookups) Accuracy (quantization, lossy compression) Functionality (enumeration, next word) Space Runtime Accuracy Functionalit Trie baseline baseline baseline y baseline Distributed LM better worse same same Quantization better same worse same Randomized better better worse worse
Value Quantization Storing probabilities as floating point numbers is expensive Only need small number of bits to achieve same quality 0,4700 0,4525 0,4350 BLEU 0,4175 0,4000 #bits 3 4 5 6 7 8 Store negative log probability with 2#bits different values Experiment above uses simple uniformly sized bins
Machine Translation Results with Bloomier Filters Vary number of value and error bits Results for 7 or more value bits equivalent identical to full floating point Results for 12 or more error bits equivalent to lossless model
Conclusions Google uses Statistical Machine Translation There is no data like more data (But smarter algorithms help, too) Challenges in distributed environment different than on single core Approximate solutions work well in practice (especially in the presence of massive data) Think about the massive computation that happens next time you use: translate.google.com