Machine Translation at

Podobné dokumenty
CSCI 599 MACHINE TRANSLATION

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

Compression of a Dictionary

Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Teacher: Student:

GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA

User manual SŘHV Online WEB interface for CUSTOMERS June 2017 version 14 VÍTKOVICE STEEL, a.s. vitkovicesteel.com

budete s řešením hotovi nebo až vyprší limit, všechny papíry prosím odevzdejte.

DC circuits with a single source

Introduction to MS Dynamics NAV

Czech Republic. EDUCAnet. Střední odborná škola Pardubice, s.r.o.

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

VOŠ, SPŠ automobilní a technická. Mgr. Marie Šíchová. At the railway station

Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční

USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Nová éra diskových polí IBM Enterprise diskové pole s nízkým TCO! Simon Podepřel, Storage Sales

Instrukce: Cvičný test má celkem 3 části, čas určený pro tyto části je 20 minut. 1. Reading = 6 bodů 2. Use of English = 14 bodů 3.

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Invitation to ON-ARRIVAL TRAINING COURSE for EVS volunteers

VY_22_INOVACE_60 MODAL VERBS CAN, MUST

VY_32_INOVACE_06_Předpřítomný čas_03. Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace

1, Žáci dostanou 5 klíčových slov a snaží se na jejich základě odhadnout, o čem bude následující cvičení.

Využití hybridní metody vícekriteriálního rozhodování za nejistoty. Michal Koláček, Markéta Matulová

Gymnázium, Brno, Slovanské nám. 7, SCHEME OF WORK Mathematics SCHEME OF WORK. cz

WORKSHEET 1: LINEAR EQUATION 1

PART 2 - SPECIAL WHOLESALE OFFER OF PLANTS SPRING 2016 NEWS MAY 2016 SUCCULENT SPECIAL WHOLESALE ASSORTMENT

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Digitální učební materiál

Výukový materiál zpracovaný v rámci operačního programu Vzdělávání pro konkurenceschopnost

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Theme 6. Money Grammar: word order; questions

Projekt: ŠKOLA RADOSTI, ŠKOLA KVALITY Registrační číslo projektu: CZ.1.07/1.4.00/ EU PENÍZE ŠKOLÁM

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Litosil - application

Transportation Problem

Just write down your most recent and important education. Remember that sometimes less is more some people may be considered overqualified.

Dynamic programming. Optimal binary search tree

PITSTOP VY_22_INOVACE_26

Social Media a firemní komunikace

Projekt: ŠKOLA RADOSTI, ŠKOLA KVALITY Registrační číslo projektu: CZ.1.07/1.4.00/ EU PENÍZE ŠKOLÁM

Enabling Intelligent Buildings via Smart Sensor Network & Smart Lighting

Vánoční sety Christmas sets

AIC ČESKÁ REPUBLIKA CZECH REPUBLIC

FIRE INVESTIGATION. Střední průmyslová škola Hranice. Mgr. Radka Vorlová. 19_Fire investigation CZ.1.07/1.5.00/

CZ.1.07/1.5.00/

EU peníze středním školám digitální učební materiál

Výukový materiál zpracovaný v rámci operačního programu Vzdělávání pro konkurenceschopnost

Caroline Glendinning Jenni Brooks Kate Gridley. Social Policy Research Unit University of York

EU PENÍZE ŠKOLÁM Operační program Vzdělávání pro konkurenceschopnost

CHAIN TRANSMISSIONS AND WHEELS

Next line show use of paragraf symbol. It should be kept with the following number. Jak může státní zástupce věc odložit zmiňuje 159a.

18.VY_32_INOVACE_AJ_UMB18, Frázová slovesa.notebook. September 09, 2013

The Over-Head Cam (OHC) Valve Train Computer Model

Výukový materiál zpracovaný v rámci projektu EU peníze do škol. illness, a text

MO-ME-N-T MOderní MEtody s Novými Technologiemi CZ.1.07/1.5.00/

EXACT DS OFFICE. The best lens for office work

Content Language level Page. Mind map Education All levels 2. Go for it. We use this expression to encourage someone to do something they want.

By David Cameron VE7LTD

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Číslo projektu: CZ.1.07/1.5.00/ Název projektu: Inovace a individualizace výuky

ITICA. SAP Školení přehled Seznam kurzů

Střední odborná škola stavební a Střední odborné učiliště stavební Rybitví

Translation Model Interpolation for Domain Adaptation in TectoMT

Configuration vs. Conformation. Configuration: Covalent bonds must be broken. Two kinds of isomers to consider

Anotace Mgr. Filip Soviš (Autor) Angličtina, čeština Speciální vzdělávací potřeby - žádné -

CZ.1.07/1.5.00/

Stojan pro vrtačku plošných spojů

Projekt: ŠKOLA RADOSTI, ŠKOLA KVALITY Registrační číslo projektu: CZ.1.07/1.4.00/ EU PENÍZE ŠKOLÁM

Název projektu: Multimédia na Ukrajinské

Střední škola obchodní, České Budějovice, Husova 9, VY_INOVACE_ANJ_741. Škola: Střední škola obchodní, České Budějovice, Husova 9

2. Entity, Architecture, Process

kupi.cz Michal Mikuš

II_ _Listening Pracovní list č. 2.doc II_ _Listening Pracovní list č. 3.doc II_ _Listening Řešení 1,2.doc

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Comparation of mobile

Czech Crystal in Chengdu 2016 捷克水晶闪亮成都

Syntactic annotation of a second-language learner corpus

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Klepnutím lze upravit styl předlohy. nadpisů. nadpisů.

Byznys a obchodní záležitosti

PC/104, PC/104-Plus. 196 ept GmbH I Tel. +49 (0) / I Fax +49 (0) / I I

PROČ UŽ SE NEOBEJDETE BEZ ANALÝZY DAT

2011 Jan Janoušek BI-PJP. Evropský sociální fond Praha & EU: Investujeme do vaší budoucnosti

Anotace Mgr. Filip Soviš (Autor) Angličtina, čeština Speciální vzdělávací potřeby - žádné -

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Střední odborná škola a Střední odborné učiliště, Chrudim, Čáslavská 205. Keywords: The wedding banquet, The seating arrangement, Wedding customs

PRAVIDLA ZPRACOVÁNÍ STANDARDNÍCH ELEKTRONICKÝCH ZAHRANIČNÍCH PLATEBNÍCH PŘÍKAZŮ STANDARD ELECTRONIC FOREIGN PAYMENT ORDERS PROCESSING RULES

Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115

II/2 Inovace a zkvalitnění výuky cizích jazyků na středních školách

LOGBOOK. Blahopřejeme, našli jste to! Nezapomeňte. Prosím vyvarujte se downtrade

CZ.1.07/1.5.00/

Chapter 7: Process Synchronization

Název projektu: Multimédia na Ukrajinské

Are you a healthy eater?

(

CZ.1.07/1.5.00/

POSLECH. M e t o d i c k é p o z n á m k y k z á k l a d o v é m u t e x t u :

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Zelené potraviny v nových obalech Green foods in a new packaging

Číslo materiálu: VY 32 INOVACE 29/18. Číslo projektu: CZ.1.07/1.4.00/

Transkript:

Machine Translation at translate.google.com Slav Petrov on behalf of many people from Google Research (slides borrowed in part from Dan Klein, Thorsten Brants and Peng Xu)

Machine Translation (French)

Machine Translation (French)

Machine Translation (Japanese)

Machine Translation (Japanese)

General Approaches Rule-based approaches (outdated?) Expert system-like rewrite systems Interlingua methods (analyze and generate) Lexicons come from humans Can be very fast, and can accumulate a lot of knowledge over time (e.g. Systran) Statistical approaches (topic of this talk) Word-to-word translation Phrase-based translation Syntax-based translation (tree-to-tree, tree-to-string) Trained on parallel corpora Usually noisy-channel (at least in spirit)

Levels of Transfer

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system:

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Model of translation

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon I will do it around

Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See you around Machine translation system: Yo lo haré pronto Novel Sentence Model of translation I will do it soon I will do it around See you tomorrow

The Noisy-Channel Model We want to predict a sentence given its translation a: The noisy channel approach: Translation model: a table with phrase translations and their probabilities Language model: Distributions over sequences of words (sentences)

MT System Components Noisy Channel Model Language Model Translation Model source P(e) e channel P(f e) f

MT System Components Noisy Channel Model Language Model Translation Model source P(e) e channel P(f e) f best e decoder observed f argmax P(e f) = argmax P(f e)p(e) e e

Statistical MT: Translation as a search problem

Unsupervised Word Alignment nous acceptons votre opinion. we accept your view.

Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view.

Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view. Output: alignments: pairs of translated words

Unsupervised Word Alignment Input: a bitext: pairs of translated sentences nous acceptons votre opinion. we accept your view. Output: alignments: pairs of translated words When words have unique sources, can represent as a (forward) alignment function a from French to English positions

Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? What is the anticipated cost of collecting fees under the new proposal? z En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits?

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Example from Kevin Knight farok crrrok hihok yorok clok kantok ok-yurp 1c. ok-voon ororok sprok. 1a. at-voon bichat dat. 2c. ok-drubel ok-voon anok plok sprok. 2a. at-drubel at-voon pippat rrat dat. 3c. erok sprok izok hihok ghirok. 3a. totat dat arrat vat hilat. 4c. ok-voon anok drok brok jok. 4a. at-voon krat pippat sat lat. 5c. wiwok farok izok stok. 5a. totat jjat quat cat. 6c. lalok sprok izok jok stok. 6a. wat dat krat quat cat. 7C lalok farok ororok lalok sprok izok enemok 7a. wat jjat bichat wat dat vat eneat. 8c. lalok brok anok plok nok. 8a. iat lat pippat rrat nnat. 9c. wiwok nok izok kantok ok-yurp. 9a. totat nnat quat oloat at-yurp. 10c. lalok mok nok yorok ghirok clok. 10a. wat nnat gat mat bat hilat. 11c. lalok nok crrrok hihok yorok zanzanok. 11a. wat nnat arrat mat zanzanat. 12c. lalok rarok nok izok hihok mok. 12a. wat nnat forat arrat vat gat.

Bidirectional Alignment

Alignment Heuristics

Extracting Phrases

From Intro to Stat MT slide by Chris Callison-Burch and Philip Koehn 16

From Intro to Stat MT slide by Chris Callison-Burch and Philip Koehn 16

Probabilistic Language Models Goal: Assign useful probabilities P(x) to sentences x Input: many observations of training sentences x Output: system capable of computing P(x) Probabilities should broadly indicate plausibility of sentences P(I saw a van) >> P(eyes awe of an) Not grammaticality: P(artichokes intimidate zippers) 0 In principle, plausible depends on the domain, context, speaker One option: empirical distribution over training sentences? Problem: doesn t generalize (at all) Two aspects of generalization Decomposition: break sentences into small pieces which can be recombined in new ways (conditional independence) Smoothing: allow for the possibility of unseen pieces

N-Gram Model Decomposition Chain rule: break sentence probability down Impractical to condition on everything before P(??? Turn to page 134 and look at the picture of the)? N-gram models: assume each word depends only on a short linear history Example:

More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M Parallel corpus size 10M 8

More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M 10M 100M Parallel corpus size 9

More (parallel) data is better data 0,4000 Test data BLEU 0,3250 0,2500 pt-en fr-en es-en fi-en de-en zh-en 0,1750 0,1000 1M 10M 100M 1B Parallel corpus size 10

Improvement over time: data & algorithms 0,2200 0,1900 Test Data BLEU 0,1600 0,1300 Hindi-English Thai-English Hungarian-English 0,1000 Q4'07 Q1'08 Q2'08 Q3'08 Q4'08 Q1'09 11

Motivation Train 5-gram language models Vary amount of training data from 13 million to 2 trillion tokens Data is divided into four sets Train separate language model for each set Use up to four language models during decoding

Motivation target +ldcnews +webnews +web green: Kneser-Ney blue: Stupid Backoff Growth nearly linear within domains Improvement smaller for web, but doesn t seem to level off Stupid has higher BLEU than KN with >8G tokens training

MapReduce Recap Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers

MapReduce Recap 1. Backup map workers compensate for failure of map task. 2. Sorting phase requires all map jobs to be completed before starting reduce phase.

Language Models n-gram language models n-gram statistics, relative frequency

Stupid Backoff Smoothing Statistics required: counts

Stupid Backoff Smoothing Traditional LM tools: Scan text and keep a trie data structure in memory Direct (or simple/fast) access to all necessary counts Scale to >1 trillion words: MapReduce, no in memory access to all counts. Steps: Collect unigram counts Determine Vocabulary Collect n-gram counts

Collect Unigram Counts Mapper: input={document, }, output={word, count}

Collect Unigram Counts Reducer: input={word, {count}}, output={word, count}

Determine Vocabulary Crucial to reduce the amount of intermediate data storage Simple count cut-off, words below cutoff are mapped to <UNK> Frequent words have small IDs: good for compression

Collect N-Gram Counts In a straightforward way: Similar to unigram counting, count n-grams for all orders 1,,n For each word in the corpus, O(n) copies of it are generated in the intermediate data Unnecessary overhead for shuffling, sorting, etc. in MapReduce Better way: Only get the highest order n-grams Split training corpus into smaller, more manageable chunks and combine the counts later

Collect Ngram Count Mapper: input={document, }, output={reversed_ngram, count}

Collect Ngram Count Reducer: input={reversed_ngram, {count}}, output={reversed_ngram, count}

Combine Ngram Counts Split input into several MapReduce jobs for scalability Combine separate ngram statistics Catch: Ngrams are processed in sequential order No access to other ngrams in general Possibly large amount of intermediate data Trick: Input ngrams in reverse order Restore the normal ngram order in Map Use a cache to accumulate counts in Map

Combine Ngram Counts Example: E D 2, E D C 3, E D C B 1, E D C A 1, F 2 Book-keeping: E=2, ED=2 E=5, ED=5, EDC=3 E=6, ED=6, EDC=4, EDCB=1 E=7, ED=7, EDC=5, EDCB=1, EDCA=1: Output( BCDE, 1); E=7, ED=7, EDC=5, EDCA=1, F=2: Output( ACDE, 1); Output( CDE, 5); Output( DE, 7); Output( E, 7); Size of the cache is only the size of highest order n

Combine Ngram Counts

Generate Final Distributed LM So far, we have: Stupid backoff: {ngram, count} Katz/Kneser-Ney: {ngram, prob/backoff} Need to pack them info the final distributed LM Depend on the serving architecture in a client/ server approach

Distributed LM Architecture Final LM data is served on multiple servers Each server performs lookups for multiple clients Save memory: if model is 8GB, 100 clients need 800GB memory with local model, but only 80GB if use 10 servers to serve 10 copies.

Distributed LM Architecture Replicas for redundancy and higher load Replica server 0 Replica server 1 Replica server M-1 Server 0 Server 1 Server M-1 Vocab Replica Vocab Server Load balancers to distribute load between replicas Cache Cache Client 0 Client 1 Clients use user-defined hash function to decide which server to contact

Distributed LM Latency Network latency In memory access: X hundred nanoseconds Network access: Y milliseconds Typical sentence of 25 words in machine translation: Roughly 100K ngrams during search Requires >100 seconds to access all probabilities Batch processing One batch for each word position, 25 batches in ~100 milliseconds

Distributed LM For each ngram A B C D E, we may need: Stupid backoff: P(E ABCD), P(E BCD), P(E CD), P(E D), P(E) Katz/Kneser-Ney: P(E ABCD), α(abcd), P(E BCD), α(bcd), P(E CD), α(cd), P(E D), α(d), P(E) Ideally those entities should locate on the same server, in practice not possible for Katz/Kneser-Ney Hash ngrams by the last word (E) could work well One network request per ngram for Stupid backoff Two network requests per ngram for Katz/Kneser-Ney In practice, we hash ngrams by the last two words (D E) and duplicate unigram entries on every server for better balanced shards

Client for Distributed LM Typical client usage: From a given set of histories, tentatively expand to the possible next word (or words): dummy lookups Batch all required ngrams (and histories too in case of Katz/Kneser-Ney) Send all batched entries to the servers Wait for the servers to return probabilities (and possibly backoff weights) Revisit all expansions with real probabilities, perform pruning: true lookups

Typical Decoder as Client Repeatedly switch between dummy lookup and true lookup active node pruned node

Distributed LM Data Structure The served model data structure can be flexible Compact trie Block encoding Perfect hashing Different application may choose different data structure One MapReduce to generate the served models

Compressing Language Models Data structures with tradeoff between Space (compression) Runtime (n-gram lookups) Accuracy (quantization, lossy compression) Functionality (enumeration, next word) Space Runtime Accuracy Functionalit Trie baseline baseline baseline y baseline Distributed LM better worse same same Quantization better same worse same Randomized better better worse worse

Value Quantization Storing probabilities as floating point numbers is expensive Only need small number of bits to achieve same quality 0,4700 0,4525 0,4350 BLEU 0,4175 0,4000 #bits 3 4 5 6 7 8 Store negative log probability with 2#bits different values Experiment above uses simple uniformly sized bins

Machine Translation Results with Bloomier Filters Vary number of value and error bits Results for 7 or more value bits equivalent identical to full floating point Results for 12 or more error bits equivalent to lossless model

Conclusions Google uses Statistical Machine Translation There is no data like more data (But smarter algorithms help, too) Challenges in distributed environment different than on single core Approximate solutions work well in practice (especially in the presence of massive data) Think about the massive computation that happens next time you use: translate.google.com