Technical Report Series on Corpus Building



Podobné dokumenty
Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Teacher: Student:

Compression of a Dictionary

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

Next line show use of paragraf symbol. It should be kept with the following number. Jak může státní zástupce věc odložit zmiňuje 159a.

Introduction to MS Dynamics NAV

AIC ČESKÁ REPUBLIKA CZECH REPUBLIC

VY_32_INOVACE_06_Předpřítomný čas_03. Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace

Litosil - application

User manual SŘHV Online WEB interface for CUSTOMERS June 2017 version 14 VÍTKOVICE STEEL, a.s. vitkovicesteel.com

PRAVIDLA ZPRACOVÁNÍ STANDARDNÍCH ELEKTRONICKÝCH ZAHRANIČNÍCH PLATEBNÍCH PŘÍKAZŮ STANDARD ELECTRONIC FOREIGN PAYMENT ORDERS PROCESSING RULES

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

PART 2 - SPECIAL WHOLESALE OFFER OF PLANTS SPRING 2016 NEWS MAY 2016 SUCCULENT SPECIAL WHOLESALE ASSORTMENT

Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční

WORKSHEET 1: LINEAR EQUATION 1

USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING

Vliv metody vyšetřování tvaru brusného kotouče na výslednou přesnost obrobku

Výukový materiál v rámci projektu OPVK 1.5 Peníze středním školám

Zelené potraviny v nových obalech Green foods in a new packaging

The Over-Head Cam (OHC) Valve Train Computer Model

Čtvrtý Pentagram The fourth Pentagram

Zubní pasty v pozměněném složení a novém designu

II/2 Inovace a zkvalitnění výuky cizích jazyků na středních školách

DATA SHEET. BC516 PNP Darlington transistor. technický list DISCRETE SEMICONDUCTORS Apr 23. Product specification Supersedes data of 1997 Apr 16

Návštěvy. Aug 1, Aug 31, This report shows the number of visits to your web site during the selected period.

Contact person: Stanislav Bujnovský,

GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA

dat 2017 Dostupný z Licence Creative Commons Uveďte autora-zachovejte licenci 4.0 Mezinárodní

Dynamic Development of Vocabulary Richness of Text. Miroslav Kubát & Radek Čech University of Ostrava Czech Republic

Transportation Problem

Air Quality Improvement Plans 2019 update Analytical part. Ondřej Vlček, Jana Ďoubalová, Zdeňka Chromcová, Hana Škáchová

DC circuits with a single source

PC/104, PC/104-Plus. 196 ept GmbH I Tel. +49 (0) / I Fax +49 (0) / I I

Postup objednávky Microsoft Action Pack Subscription

Číslo projektu: CZ.1.07/1.5.00/ Název projektu: Inovace a individualizace výuky

Vánoční sety Christmas sets

CHAPTER 5 MODIFIED MINKOWSKI FRACTAL ANTENNA

Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115

VYSOKÁ ŠKOLA HOTELOVÁ V PRAZE 8, SPOL. S R. O.

Návštěvy. Jul 1, Jul 31, This report shows the number of visits to your web site during the selected period.

Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT

Database systems. Normal forms

2. Entity, Architecture, Process

EU peníze středním školám digitální učební materiál

FIRE INVESTIGATION. Střední průmyslová škola Hranice. Mgr. Radka Vorlová. 19_Fire investigation CZ.1.07/1.5.00/

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Číslo materiálu: VY 32 INOVACE 29/18. Číslo projektu: CZ.1.07/1.4.00/

SPECIAL THEORY OF RELATIVITY

STŘEDNÍ ODBORNÁ ŠKOLA a STŘEDNÍ ODBORNÉ UČILIŠTĚ, Česká Lípa, 28. října 2707, příspěvková organizace

CHAIN TRANSMISSIONS AND WHEELS

STUDY EDITS FOR BETTER TRANSPORT IN THE CENTRE OF NÁCHOD

Why PRIME? 20 years of Erasmus Programme Over 2 million students in total Annually

Dynamic programming. Optimal binary search tree

Name: Class: Date: RELATIONSHIPS and FAMILY PART A

Fytomineral. Inovace Innovations. Energy News 04/2008

Just write down your most recent and important education. Remember that sometimes less is more some people may be considered overqualified.

Czech Republic. EDUCAnet. Střední odborná škola Pardubice, s.r.o.

Gymnázium, Brno, Slovanské nám. 7, SCHEME OF WORK Mathematics SCHEME OF WORK. cz

Website review ciporat.ru

EXACT DS OFFICE. The best lens for office work

Projekt: ŠKOLA RADOSTI, ŠKOLA KVALITY Registrační číslo projektu: CZ.1.07/1.4.00/ EU PENÍZE ŠKOLÁM

Využití hybridní metody vícekriteriálního rozhodování za nejistoty. Michal Koláček, Markéta Matulová

VŠEOBECNÁ TÉMATA PRO SOU Mgr. Dita Hejlová

Distribution of Sorbus thayensis in the Czech Republic

Klepnutím lze upravit styl předlohy. nadpisů. nadpisů.

Návštěvy. Jul 1, Jul 30, This report shows the number of visits to your web site during the selected period.

CZ.1.07/1.5.00/

Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.

ANGLICKÁ KONVERZACE PRO STŘEDNĚ POKROČILÉ

Caroline Glendinning Jenni Brooks Kate Gridley. Social Policy Research Unit University of York

CZ.1.07/1.5.00/

CZ.1.07/1.5.00/

Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49

Drags imun. Innovations

DOPLNĚK K FACEBOOK RETRO EDICI STRÁNEK MAVO JAZYKOVÉ ŠKOLY MONCHHICHI

Jak importovat profily do Cura (Windows a

1, Žáci dostanou 5 klíčových slov a snaží se na jejich základě odhadnout, o čem bude následující cvičení.

On large rigid sets of monounary algebras. D. Jakubíková-Studenovská P. J. Šafárik University, Košice, Slovakia

Návrh a implementace algoritmů pro adaptivní řízení průmyslových robotů

Theme 6. Money Grammar: word order; questions

Mechanika Teplice, výrobní družstvo, závod Děčín TACHOGRAFY. Číslo Servisní Informace Mechanika:

Návštěvy. Sep 1, Sep 30, This report shows the number of visits to your web site during the selected period.

Configuration vs. Conformation. Configuration: Covalent bonds must be broken. Two kinds of isomers to consider

Výukový materiál zpracován v rámci projektu EU peníze školám

LOGOMANUÁL / LOGOMANUAL

Invitation to ON-ARRIVAL TRAINING COURSE for EVS volunteers

Návštěvy. Dec 1, Dec 31, This report shows the number of visits to your web site during the selected period.

Domain Report. Unless stated otherwise, the statistics represent the state as of 31 December 2011.

Radiova meteoricka detekc nı stanice RMDS01A

AŤ ŽIJE MEFANET! VYUŽIJE MEFANET NOVÉ TECHNOLOGIE PRO PODPORU VÝUKY?

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Digitální učební materiál

CODE BOOK NEISS 8. A code book is an identification tool that allows the customer to perform a test result evaluation using a numeric code.

Britské společenství národů. Historie Spojeného království Velké Británie a Severního Irska ročník gymnázia (vyšší stupeň)

Syntactic annotation of a second-language learner corpus

Inovace řízení a plánování činností s ohledem na požadavky ISO 9001

ITICA. SAP Školení přehled Seznam kurzů

Návštěvy. Feb 1, Feb 29, This report shows the number of visits to your web site during the selected period.

AIP GEN CZECH REPUBLIC

AIP GEN CZECH REPUBLIC

Transkript:

Technical Report Series on Corpus Building Vol. 8 (June 2013) Czech Corpora Uwe Quasthoff Dirk Goldhahn Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

Affiliation of the authors: Uwe Quasthoff, Dirk Goldhahn: Institut für Informatik,Universität Leipzig {quasthoff, dgoldhahn}@informatik.uni-leipzig.de Copyright: Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, http://asv.informatik.uni-leipzig.de/ Technical Report Series on Corpus Building Vol. 1: Deutscher Wortschatz 2013 Vol. 2: Danish Corpora Vol. 3: Dutch Corpora Vol. 4: Icelandic Corpora Vol. 5: Hungarian Corpora Vol. 6: Ukrainian Corpora Vol. 7: Indonesian Corpora Vol. 8: Czech Corpora This PDF document was created using the open source tool mwlib. For more infotmation, see http://code.pediapress.com/ PDF generated at: 25. June 2013

Czech corpora 1 Introduction to corpus creation 1 CES - a processing related language description 2 CES corpora 3 CES corpus comparison 8 Processing details 9 Appendix to ces news 2005-2007: Database summary 9 Appendix to ces news 2008: Database summary 9 Appendix to ces news 2009: Database summary 10 Appendix to ces news 2010: Database summary 10 Appendix to ces news 2011: Database summary 11 Appendix to ces news 2012: Database summary 11 Appendix to ces newscrawl 2011: Database summary 12 Appendix to ces newscrawl 2012: Database summary 12 Appendix to ces wikipedia 2007: Database summary 13 Appendix to ces wikipedia 2012: Database summary 13 Appendix to ces web 2002: Database summary 14 Appendix to ces web 2011: Database summary 14 Appendix to ces web 2012: Database summary 15 Appendix to ces mixed 2012: Database summary 15 Content details 16 Appendix to ces news 2005-2007: Size of different TLDs 16 Appendix to ces news 2008: Size of different TLDs 16 Appendix to ces news 2009: Size of different TLDs 17 Appendix to ces news 2010: Size of different TLDs 17 Appendix to ces news 2011: Size of different TLDs 17 Appendix to ces news 2012: Size of different TLDs 18 Appendix to ces newscrawl 2011: Size of different TLDs 18 Appendix to ces newscrawl 2012: Size of different TLDs 18 Appendix to ces web 2002: Size of different TLDs 19 Appendix to ces web 2011: Size of different TLDs 19

Appendix to ces web 2012: Size of different TLDs 19 Appendix to ces mixed 2012: Size of different TLDs 20 Appendix to ces news 2005-2007: Size of largest domains 20 Appendix to ces news 2008: Size of largest domains 21 Appendix to ces news 2009: Size of largest domains 21 Appendix to ces news 2010: Size of largest domains 22 Appendix to ces news 2011: Size of largest domains 23 Appendix to ces news 2012: Size of largest domains 23 Appendix to ces newscrawl 2011: Size of largest domains 24 Appendix to ces newscrawl 2012: Size of largest domains 25 Appendix to ces web 2002: Size of largest domains 25 Appendix to ces web 2011: Size of largest domains 26 Appendix to ces web 2012: Size of largest domains 27 Appendix to ces mixed 2012: Size of largest domains 27 Appendix to ces news 2005-2007: Number of sources by time period 28 Appendix to ces news 2008: Number of sources by time period 29 Appendix to ces news 2009: Number of sources by time period 31 Appendix to ces news 2010: Number of sources by time period 32 Appendix to ces news 2011: Number of sources by time period 33 Appendix to ces news 2012: Number of sources by time period 35 Word details 37 Appendix to ces news 2005-2007: Words by length without multiplicity 37 Appendix to ces news 2008: Words by length without multiplicity 39 Appendix to ces news 2009: Words by length without multiplicity 41 Appendix to ces news 2010: Words by length without multiplicity 43 Appendix to ces news 2011: Words by length without multiplicity 45 Appendix to ces news 2012: Words by length without multiplicity 47 Appendix to ces newscrawl 2011: Words by length without multiplicity 49 Appendix to ces newscrawl 2012: Words by length without multiplicity 51 Appendix to ces wikipedia 2007: Words by length without multiplicity 53 Appendix to ces wikipedia 2012: Words by length without multiplicity 55 Appendix to ces web 2002: Words by length without multiplicity 57 Appendix to ces web 2011: Words by length without multiplicity 59 Appendix to ces web 2012: Words by length without multiplicity 61 Appendix to ces mixed 2012: Words by length without multiplicity 63 Appendix to ces news 2005-2007: Words by length with multiplicity 65 Appendix to ces news 2008: Words by length with multiplicity 67

Appendix to ces news 2009: Words by length with multiplicity 69 Appendix to ces news 2010: Words by length with multiplicity 71 Appendix to ces news 2011: Words by length with multiplicity 73 Appendix to ces news 2012: Words by length with multiplicity 75 Appendix to ces newscrawl 2011: Words by length with multiplicity 77 Appendix to ces newscrawl 2012: Words by length with multiplicity 79 Appendix to ces wikipedia 2007: Words by length with multiplicity 81 Appendix to ces wikipedia 2012: Words by length with multiplicity 83 Appendix to ces web 2002: Words by length with multiplicity 85 Appendix to ces web 2011: Words by length with multiplicity 87 Appendix to ces web 2012: Words by length with multiplicity 89 Appendix to ces mixed 2012: Words by length with multiplicity 91 Appendix to ces news 2005-2007: The most frequent 50 words 92 Appendix to ces news 2008: The most frequent 50 words 93 Appendix to ces news 2009: The most frequent 50 words 94 Appendix to ces news 2010: The most frequent 50 words 95 Appendix to ces news 2011: The most frequent 50 words 96 Appendix to ces news 2012: The most frequent 50 words 97 Appendix to ces newscrawl 2011: The most frequent 50 words 98 Appendix to ces newscrawl 2012: The most frequent 50 words 99 Appendix to ces wikipedia 2007: The most frequent 50 words 100 Appendix to ces wikipedia 2012: The most frequent 50 words 101 Appendix to ces web 2002: The most frequent 50 words 102 Appendix to ces web 2011: The most frequent 50 words 103 Appendix to ces web 2012: The most frequent 50 words 104 Appendix to ces mixed 2012: The most frequent 50 words 105 Appendix to ces news 2005-2007: Longest words in top-1.000 by rank 106 Appendix to ces news 2008: Longest words in top-1.000 by rank 107 Appendix to ces news 2009: Longest words in top-1.000 by rank 108 Appendix to ces news 2010: Longest words in top-1.000 by rank 109 Appendix to ces news 2011: Longest words in top-1.000 by rank 110 Appendix to ces news 2012: Longest words in top-1.000 by rank 111 Appendix to ces newscrawl 2011: Longest words in top-1.000 by rank 112 Appendix to ces newscrawl 2012: Longest words in top-1.000 by rank 113 Appendix to ces wikipedia 2007: Longest words in top-1.000 by rank 114 Appendix to ces wikipedia 2012: Longest words in top-1.000 by rank 115 Appendix to ces web 2002: Longest words in top-1.000 by rank 116 Appendix to ces web 2011: Longest words in top-1.000 by rank 117

Appendix to ces web 2012: Longest words in top-1.000 by rank 118 Appendix to ces mixed 2012: Longest words in top-1.000 by rank 119 Character N-gram details 120 Appendix to ces news 2005-2007: Alphabet as used in the top-100.000 words 120 Appendix to ces news 2008: Alphabet as used in the top-100.000 words 121 Appendix to ces news 2009: Alphabet as used in the top-100.000 words 123 Appendix to ces news 2010: Alphabet as used in the top-100.000 words 124 Appendix to ces news 2011: Alphabet as used in the top-100.000 words 126 Appendix to ces news 2012: Alphabet as used in the top-100.000 words 127 Appendix to ces newscrawl 2011: Alphabet as used in the top-100.000 words 129 Appendix to ces newscrawl 2012: Alphabet as used in the top-100.000 words 130 Appendix to ces wikipedia 2007: Alphabet as used in the top-100.000 words 132 Appendix to ces wikipedia 2012: Alphabet as used in the top-100.000 words 133 Appendix to ces web 2002: Alphabet as used in the top-100.000 words 135 Appendix to ces web 2011: Alphabet as used in the top-100.000 words 136 Appendix to ces web 2012: Alphabet as used in the top-100.000 words 138 Appendix to ces mixed 2012: Alphabet as used in the top-100.000 words 139 Abbreviation details 141 Appendix to ces news 2005-2007: Most frequent abbreviations 141 Appendix to ces news 2008: Most frequent abbreviations 142 Appendix to ces news 2009: Most frequent abbreviations 143 Appendix to ces news 2010: Most frequent abbreviations 144 Appendix to ces news 2011: Most frequent abbreviations 145 Appendix to ces news 2012: Most frequent abbreviations 146 Appendix to ces newscrawl 2011: Most frequent abbreviations 146 Appendix to ces newscrawl 2012: Most frequent abbreviations 147 Appendix to ces wikipedia 2007: Most frequent abbreviations 147 Appendix to ces wikipedia 2012: Most frequent abbreviations 148 Appendix to ces web 2002: Most frequent abbreviations 149 Appendix to ces web 2011: Most frequent abbreviations 150 Appendix to ces web 2012: Most frequent abbreviations 151 Appendix to ces mixed 2012: Most frequent abbreviations 152 Appendix to ces news 2005-2007: Left neighbors of the full stop 153 Appendix to ces news 2008: Left neighbors of the full stop 154 Appendix to ces news 2009: Left neighbors of the full stop 155 Appendix to ces news 2010: Left neighbors of the full stop 156

Appendix to ces news 2011: Left neighbors of the full stop 157 Appendix to ces news 2012: Left neighbors of the full stop 158 Appendix to ces newscrawl 2011: Left neighbors of the full stop 159 Appendix to ces newscrawl 2012: Left neighbors of the full stop 160 Appendix to ces wikipedia 2007: Left neighbors of the full stop 161 Appendix to ces wikipedia 2012: Left neighbors of the full stop 162 Appendix to ces web 2002: Left neighbors of the full stop 163 Appendix to ces web 2011: Left neighbors of the full stop 164 Appendix to ces web 2012: Left neighbors of the full stop 165 Appendix to ces mixed 2012: Left neighbors of the full stop 166 Appendix to ces news 2005-2007: Left neighbors of the full stop with additional internal full stops 167 Appendix to ces news 2008: Left neighbors of the full stop with additional internal full stops 168 Appendix to ces news 2009: Left neighbors of the full stop with additional internal full stops 169 Appendix to ces news 2010: Left neighbors of the full stop with additional internal full stops 170 Appendix to ces news 2011: Left neighbors of the full stop with additional internal full stops 171 Appendix to ces news 2012: Left neighbors of the full stop with additional internal full stops 172 Appendix to ces newscrawl 2011: Left neighbors of the full stop with additional internal full stops 173 Appendix to ces newscrawl 2012: Left neighbors of the full stop with additional internal full stops 174 Appendix to ces wikipedia 2007: Left neighbors of the full stop with additional internal full stops 175 Appendix to ces wikipedia 2012: Left neighbors of the full stop with additional internal full stops 176 Appendix to ces web 2002: Left neighbors of the full stop with additional internal full stops 177 Appendix to ces web 2011: Left neighbors of the full stop with additional internal full stops 178 Appendix to ces web 2012: Left neighbors of the full stop with additional internal full stops 179 Appendix to ces mixed 2012: Left neighbors of the full stop with additional internal full stops 180 Sentences details 181 Appendix to ces news 2005-2007: Shortest sentences 181 Appendix to ces news 2008: Shortest sentences 182 Appendix to ces news 2009: Shortest sentences 184 Appendix to ces news 2010: Shortest sentences 185 Appendix to ces news 2011: Shortest sentences 187 Appendix to ces news 2012: Shortest sentences 188 Appendix to ces newscrawl 2011: Shortest sentences 190 Appendix to ces newscrawl 2012: Shortest sentences 191 Appendix to ces wikipedia 2007: Shortest sentences 193 Appendix to ces wikipedia 2012: Shortest sentences 194 Appendix to ces web 2002: Shortest sentences 196 Appendix to ces web 2011: Shortest sentences 197

Appendix to ces web 2012: Shortest sentences 199 Appendix to ces mixed 2012: Shortest sentences 200 Appendix to ces news 2005-2007: Longest sentences 202 Appendix to ces news 2008: Longest sentences 204 Appendix to ces news 2009: Longest sentences 206 Appendix to ces news 2010: Longest sentences 208 Appendix to ces news 2011: Longest sentences 210 Appendix to ces news 2012: Longest sentences 212 Appendix to ces newscrawl 2011: Longest sentences 214 Appendix to ces newscrawl 2012: Longest sentences 216 Appendix to ces wikipedia 2007: Longest sentences 218 Appendix to ces wikipedia 2012: Longest sentences 220 Appendix to ces web 2002: Longest sentences 222 Appendix to ces web 2011: Longest sentences 224 Appendix to ces web 2012: Longest sentences 226 Appendix to ces mixed 2012: Longest sentences 228 Appendix to ces news 2005-2007: Length of sentences in characters 230 Appendix to ces news 2008: Length of sentences in characters 231 Appendix to ces news 2009: Length of sentences in characters 232 Appendix to ces news 2010: Length of sentences in characters 233 Appendix to ces news 2011: Length of sentences in characters 234 Appendix to ces news 2012: Length of sentences in characters 235 Appendix to ces newscrawl 2011: Length of sentences in characters 236 Appendix to ces newscrawl 2012: Length of sentences in characters 237 Appendix to ces wikipedia 2007: Length of sentences in characters 238 Appendix to ces wikipedia 2012: Length of sentences in characters 239 Appendix to ces web 2002: Length of sentences in characters 240 Appendix to ces web 2011: Length of sentences in characters 241 Appendix to ces web 2012: Length of sentences in characters 242 Appendix to ces mixed 2012: Length of sentences in characters 243 Appendix to ces news 2005-2007: Length of sentences in words 244 Appendix to ces news 2008: Length of sentences in words 245 Appendix to ces news 2009: Length of sentences in words 246 Appendix to ces news 2010: Length of sentences in words 247 Appendix to ces news 2011: Length of sentences in words 248 Appendix to ces news 2012: Length of sentences in words 249 Appendix to ces newscrawl 2011: Length of sentences in words 250 Appendix to ces newscrawl 2012: Length of sentences in words 251

Appendix to ces wikipedia 2007: Length of sentences in words 252 Appendix to ces wikipedia 2012: Length of sentences in words 253 Appendix to ces web 2002: Length of sentences in words 254 Appendix to ces web 2011: Length of sentences in words 255 Appendix to ces web 2012: Length of sentences in words 256 Appendix to ces mixed 2012: Length of sentences in words 257 Oddities details 258 Appendix to ces news 2005-2007: Longest words 258 Appendix to ces news 2008: Longest words 258 Appendix to ces news 2009: Longest words 259 Appendix to ces news 2010: Longest words 259 Appendix to ces news 2011: Longest words 260 Appendix to ces news 2012: Longest words 260 Appendix to ces newscrawl 2011: Longest words 261 Appendix to ces newscrawl 2012: Longest words 261 Appendix to ces wikipedia 2007: Longest words 262 Appendix to ces wikipedia 2012: Longest words 262 Appendix to ces web 2002: Longest words 263 Appendix to ces web 2011: Longest words 263 Appendix to ces web 2012: Longest words 264 Appendix to ces mixed 2012: Longest words 264 Appendix to ces news 2005-2007: Sentences with high average word length 265 Appendix to ces news 2008: Sentences with high average word length 266 Appendix to ces news 2009: Sentences with high average word length 267 Appendix to ces news 2010: Sentences with high average word length 268 Appendix to ces news 2011: Sentences with high average word length 269 Appendix to ces news 2012: Sentences with high average word length 270 Appendix to ces newscrawl 2011: Sentences with high average word length 271 Appendix to ces newscrawl 2012: Sentences with high average word length 272 Appendix to ces wikipedia 2007: Sentences with high average word length 273 Appendix to ces wikipedia 2012: Sentences with high average word length 274 Appendix to ces web 2002: Sentences with high average word length 275 Appendix to ces web 2011: Sentences with high average word length 276 Appendix to ces web 2012: Sentences with high average word length 277 Appendix to ces mixed 2012: Sentences with high average word length 278 Appendix to ces news 2005-2007: Problems with sentence segmentation - words ending in a stopword 279

Appendix to ces news 2008: Problems with sentence segmentation - words ending in a stopword 279 Appendix to ces news 2009: Problems with sentence segmentation - words ending in a stopword 279 Appendix to ces news 2010: Problems with sentence segmentation - words ending in a stopword 280 Appendix to ces news 2011: Problems with sentence segmentation - words ending in a stopword 281 Appendix to ces news 2012: Problems with sentence segmentation - words ending in a stopword 281 Appendix to ces newscrawl 2011: Problems with sentence segmentation - words ending in a stopword 282 Appendix to ces newscrawl 2012: Problems with sentence segmentation - words ending in a stopword 283 Appendix to ces wikipedia 2007: Problems with sentence segmentation - words ending in a stopword 284 Appendix to ces wikipedia 2012: Problems with sentence segmentation - words ending in a stopword 284 Appendix to ces web 2002: Problems with sentence segmentation - words ending in a stopword 285 Appendix to ces web 2011: Problems with sentence segmentation - words ending in a stopword 286 Appendix to ces web 2012: Problems with sentence segmentation - words ending in a stopword 287 Appendix to ces mixed 2012: Problems with sentence segmentation - words ending in a stopword 288

1 Czech corpora Introduction to corpus creation The Leipzig Corpora Collection (LCC) collects Web based corpora for many different languages. The main text genres are newspaper texts, Wikipedias and randomly collected web pages. All corpora are processed in the same way: Crawling Web pages HTML stripping Language identifikation Sentence segmentation Cleaning: Removal of ill-formed sentences Duplicate removal Calculation of word frequences and word co-occurrences As result we have a corpus containing only well-formed sentences in the language under consideration. The sentences are in random order; hence, sharing the corpus does not violate copyright law because it is impossible to reconstruct the original texts. The pre-processing steps contain both language independent steps (like HTML stripping and duplicate removal) and language dependent steps (like language identification and sentence segmentation). Especially the language specific parts are vulnerable to specific processing problems. The aim of the paper is to identify possible problems and evaluate the results. The following problems are adressed: A processing-focused language description Language size: How much text is available for this language? What are the biggest sources? Corpus description: Genre, size, crawling and processing date. Possible problems in language identification: Which languages are similar? Character set and alphabet Inspecting the word list: Most frequent words, longer high frequent words and longest words at all. Word length distribution. Can abbreviations confuse sentence segmentation? Information about the abbreviation list. Inspecting sentences: Inspect shortest and longest sentences to identify possible segmentation problems. Sentence length distribution. The paper describes the result of these inspections; the appendices show the exact results for the different corpora. This helps to compare the corpora with respect to quality. In the section quality overview, an overall quality description for each corpus is given. All corpora contain only minor problems which are irrelevant for most applications. Otherwise the corpus creation has been iterated.

CES - a processing related language description 2 CES - a processing related language description General properties of the Czech language Native Name: Čeština Classifiation: Indo-European, Slavic, West, Czech-Slovak Total Number of Speakers: 12M Largest countries with number of speakers: Czech Republic (10M) Sources: http:/ / www. ethnologue. com, Wikipedia Processing summary Latin alphabet with some additional characters full stop is used as sentence boundary and for abbreviations apostrostophes used rarely Properties important for processing Alphabet and punctuation The alphabet is Latin based, with the following specialities (sources: http:/ / en. wikipedia. org/ wiki/ Alphabets_derived_from_the_Latin and http:/ / de. wikipedia. org/ wiki/ Tschechische_Sprache#Alphabet): Czech includes all 26 base letters and Á, Č, Ď, É, Ě, Í, Ň, Ó, Ř, Š, Ť, Ú, Ů, Ý, Ž In foreign words some more Diphtongs: au, eu, ou Usual Latin punctuation Usage of uppercase letters: At sentence beginnings and for proper names (of persons, organisations, countries etc.). Sentence segmentation and word tokenization Sentence beginnings Sentences begin with a capitalized first word. Abbreviations Abbreviations confusing with sentence boundaries: Special abbreviation list has to be inspected. Sources for abbreviations: ### Abbreviations with full stop may appear in the word list without full stop. Apostrophes Use of apostrophes: very infrequent (???)

CES - a processing related language description 3 Sources and ranking (2012) Estimated number of webpages containing text Google.com top-5 words: 121.000.000 results for "a" "se" "na" "v" "je" Google.com top-10 words: 1.280.000 results for "a" "se" "na" "v" "je" "že" "to" "z" "s" "o" Rank according to number of speakers (Ethnologue): 75 Rank according to Wikipedia size (see http:/ / de. wikipedia. org/ wiki/ Wikipedia:Sprachen, 01/2013): Rank 18 with 253.000 articles. Rank according to number of newspapers as found by AbyZ (5/2012): 134 newspapers, rank 18. Rank according to number of newspapers with RSS feeds (5/2012): 131 newspapers, rank 12. Rank according to our corpus size (9/2012): 17 CES corpora Quality Overview Quality Ratings A: Very good quality. Ready to use (or already used) for frequency dictionary. Size as large as possible Only minimal errors Multiple genres (if possible) A-: Small problems identified. They should not affect usage. B: Native speaker quality. Information about abbreviations and sentence boundaries by native speaker Resulting statistics checked by native speaker, possible errors corrected C: Non-native speaker quality Obvious problems shown in corpus statistics are corrected D: First version Pre-processing with default abbreviation list and default sentence boundaries E: Poor Quality: Old, outdated or faulty. Corpus Quality The quality of the corpora differes slightly because the corpus processing toolchain changed slightly during several years. Moreover, original data are often no more available. Hence, improvement of quality often means removing incomplete or doubtful sentences. Forthcoming editions of all corpora thus might have a slightly smaller number of sentences. This especially applies to near duplicate sentences which are removed only sparingly. The following table shows the quality of the corpora. Minimal errors are still possible and described in the sections below. All possible major improvements are mentioned here.

CES corpora 4 Corpus Quality rating Known problems to-dos ces_news_2005-2007 A - - ces_news_2008 A - - ces_news_2009 A - - ces_news_2010 A - - ces_news_2011 A - - ces_news_2012 A - - ces_newscrawl_2011 A - - ces_newscrawl_2012 A - - ces_web_2002 A - - ces_web_2011 A - - ces_web_2012 A - - ces_wikipedia_2007 A - - ces_wikipedia_2012 A - - ces_mixed_2012 A - - Processing Overview For more details, see Appendix: Database Summary and Appendix: Number of sources by time period. Corpus Size (M sentences) Size (M running words) Multiwords Crawling date Production date ces_news_2005-2007 1.2 18 10.250 end of 2005-end of 2007 2010 ces_news_2008 1.9 30 13.314 daily 2008 2011 ces_news_2009 2.1 33 13.277 daily 2009 2011 ces_news_2010 2.2 34 12.862 daily 2010 2011 ces_news_2011 1.9 31 11.943 daily 2011 2012 ces_news_2012 1.9 29 11.727 daily 2012 2013 ces_newscrawl_2011 4.0 65 18.746 04/2012 2012 ces_newscrawl_2012 4.8 71 20.317 04/2013 2013 ces_web_2002 4.4 67 20.461 batch crawl 2002 2010 ces_web_2011 7.4 134 25.800 12/2010-12/2011 2012 ces_web_2012 9.4 137 26.015 1/2012-12/2012 2013 ces_wikipedia_2007 0.47 7.5 22.350 10/2007 2010 ces_wikipedia_2012 1.3 20 36.358 01/2012 2012 ces_mixed_2012 37 548 54.872 see above 2013

CES corpora 5 Content Overview For more details, see Appendix: Size of different TLDs and Appendix: Size of different domains. Corpus Type of sources Countries Number of sources Publishing date Biggest source ces_news_2005-2007 News cs(93%), sk(7%) 75 mainly 3/2007-12/2007 ihned.cz/ ces_news_2008 News cs(99%), sk(1%) 98 2008 HN.IHNED.CZ/ ces_news_2009 News cs 78 2009 HN.IHNED.CZ/ ces_news_2010 News cs 81 2010 hn.ihned.cz/ ces_news_2011 News cs 119 2011 zpravy.idnes.cz/ ces_news_2012 News cs(94%), com(6%) 105 2012 isport.blesk.cz/ ces_newscrawl_2011 News cs 35 2011 and before www.profit.cz/ ces_newscrawl_2012 News cs 36 2012 and before www.novinky.cz/ ces_web_2002 Web cs (100%) 13.471 2002 and before www.grimoar.cz/ ces_web_2011 Web cs(92%), sk(2%), com(2%), eu(2%) 83.208 2011 and before abc.blesk.cz/ ces_web_2012 Web cs(93%), sk(1%), com(2%), eu(2%) 91.292 2012 and before darren-shan.ic.cz/ ces_wikipedia_2007 Wikipedia - 1 2007 and before wikipedia.org ces_wikipedia_2012 Wikipedia - 1 2012 and before wikipedia.org ces_mixed_2012 Mixed Sources cs(95%), sk(1%), com(1%), eu(1%) 117.779 2012 and before HN.IHNED.CZ/ Words Appendix: Words by Length without multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: Words by Length with multiplicity shows a plot of the corresponding length distribution. A smooth asymetric bell-shaped curve is expected. Appendix: The Most Frequent 50 Words shows the most frequent stopwords as well as one or more words related to the region. Appendix: Longest Words in Top-1000 by rank shows the 25 longest words within the top-1000. The usually give an impression of the main topics treated in the corpus. Appendix: Longest Words with minimum frequency 2 should give an idea of very long words. In the case of processing problems, different types of non-words may appear. This might help to improve the word definition. Corpus Word length graph without multiplicity Word length graph with multiplicity Most Frequent 50 Words Longest Words in Top-1000 Longest Words with minimum frequency 2 ces_news_2005-2007 okay okay okay okay URLs, missing blanks ces_news_2008 okay okay okay okay URLs ces_news_2009 okay okay okay okay URLs ces_news_2010 okay okay okay okay URLs, missing blanks, junk ces_news_2011 okay okay okay okay URLs, missing blanks ces_news_2012 okay okay okay okay URLs, routes, missing blanks

CES corpora 6 ces_newscrawl_2011 okay okay okay okay Missing blanks, routes, chemicals, URLs ces_newscrawl_2012 okay okay okay okay URLs, missing blanks, junk, etc. ces_web_2002 okay okay okay okay URLs, missing blanks ces_web_2011 okay okay okay okay URLs, missing blanks ces_web_2012 okay okay okay okay Routes, missing blanks, URLs ces_wikipedia_2007 okay okay okay okay okay ces_wikipedia_2012 okay okay okay okay chemicals, URLs ces_mixed_2012 okay okay okay okay all of the above Abbreviations Abbreviations are usually not used as sentence boundaries. Conversely, missing abbreviations can overgenerate sentence boundaries. Due to limitations in the processing chain, the list of abbreviations used for sentence boundary detection can differ from the abbreviations in the word list. Appendix: Most Frequent Abbreviations shows possible under-generation of sentence boundaries by wrong abbreviations (i.e. words ending in a full stop) in the word list. Sentences Appendix: Shortest sentences shows the shortest declarative, exclamatory and interrogative sentences. In preprocessing, a minimal length for sentences might be specified. And missing abbreviations are often visible as faulty sentence engings. Appendix: Longest sentences shows the longest declarative, exclamatory and interrogative sentences. Usually, the maximun sentence length is defined as 256 characters (not 256 bytes). Very long exclamatory or interrogative sentences often contain an overseen sentence boundary. Appendix: Length of sentences in characters shows the distribution of the sentence length. A large and balanced corpus will result in a smooth and bell-shaped curve. Isolated local maxima usually result from large sets of near duplicate sentences. Corpus Shortest sentences Longest sentences Length distribution (in characters) Length distribution (in words) ces_news_2005-2007 okay max. 255 bytes instead characters several near duplicate peaks okay ces_news_2008 okay okay okay okay ces_news_2009 okay okay okay okay ces_news_2010 okay okay okay okay ces_news_2011 okay okay okay okay ces_news_2012 okay okay several near duplicate peaks okay ces_newscrawl_2011 okay okay okay okay ces_newscrawl_2012 okay okay okay okay ces_web_2002 okay okay max. 255 bytes instead characters okay

CES corpora 7 ces_web_2011 okay okay okay okay ces_web_2012 okay okay okay okay ces_wikipedia_2007 okay okay several near duplicate peaks several near duplicate peaks ces_wikipedia_2012 okay okay several near duplicate peaks okay ces_mixed_2012 okay okay okay okay Oddities Appendix: Sentences with high average word length: Average sentences contain many stopwords, and these stopwords are usually short. Hence, they restrict the average word length in a sentence. Conversely, sentences with high average word length are often ill formed. They may be used to improve pre-processing. Appendix: Problems with sentence segmentation - Words ending in a stopword: If there are many ill-formed word or sentence boundaries witout a blank between two words, they will generate new ill-formed words. The appendix shows the most frequent words ending in an uppercase stopword. If they are infrequent then the date were of high quality. Corpus Sentences with high average word length Words ending in a stopword ces_news_2005-2007 URLs okay ces_news_2008 URLs okay ces_news_2009 URLs, missing blanks Numbers like 05:00Na,maxfreq=23 ces_news_2010 URLs, missing blanks, very long words Numbers like 05:00Na,maxfreq=26 ces_news_2011 URLs, missing blanks, very long words, junk Numbers like 05:00Na,maxfreq=8 ces_news_2012 URLs, missing blanks, very long words, junk maxfreq=9 ces_newscrawl_2011 URLs, missing blanks, very long words maxfreq=7 ces_newscrawl_2012 URLs, missing blanks, very long words maxfreq=18 ces_web_2002 URLs, often: missing blanks maxfreq=16 ces_web_2011 URLs, often: missing blanks, routes maxfreq=16 ces_web_2012 URLs, missing blanks, routes okay ces_wikipedia_2007 URLs, chemicals, Russian words okay ces_wikipedia_2012 URLs, missing blanks, Chinese okay ces_mixed_2012 as above as above

CES corpus comparison 8 CES corpus comparison Automated Corpus comparison For the following comparisons, the following tests on the top-1000 words are performed: Vectors based on the frequencies of the top-1000 words are created for the analysed languages. The cosine of the angle between these vectors is computed. Identical languages receive a value of 0, distinct languages get a value of 1. The same analysis is conducted using the frequencies of the top-1000 typical letter trigrams of the languages. Monolingual word list comparison (top-1000 words) As one can expect the comparisons show: The different news corpora have different word lists with maximum distance 0.16 (ces_newscrawl_2012 and ces_news_2011) The wikipedia corpora are similar with maximum distance 0.07 The web corpora have maximum distance 0.12 (ces_news_2012 and ces_news_2002) The mixed corpus ces_mixed_2012 holds a central position with maximum distances of 0.32 to the other corpora. Multilingual word list comparison (top-1000 words) Both the comparison of the top-1000 words and the comparison of the letter trigrams used in these words show that there are similar languages in our data, all being members of the slavic family. The distance of the mixed corpus to the next language, Slovak, is 0.54 for the words and 0.59 for the letter trigrams. Both distances are about average. The average value for the most similar language is 0.58 for trigrams. The most similar languages based on words: Slovak, Polish, Crotian +--------+---------------------+----------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+----------------+-------------+ ces slk Slovak 0.541585 ces pol Polish 0.658309 ces hrv Croatian 0.744599 ces hsb Sorbian, Upper 0.74595 ces slv Slovenian 0.754234 +--------+---------------------+----------------+-------------+ The most similar languages based on letter trigrams: Slovak, Croatian, Slovenian +--------+---------------------+-----------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+-----------------+-------------+ ces slk Slovak 0.595796 ces hrv Croatian 0.731652 ces slv Slovenian 0.75994 ces srp-latn Serbian (Latin) 0.764132 ces pol Polish 0.773368 +--------+---------------------+-----------------+-------------+

9 Processing details Appendix to ces news 2005-2007: Database summary Values for some general parameters Parameter Value Number of sentences 1180422 Number of running word forms 18487116 Number of distinct word forms 580861 Number of multiwords 10250 Percentage of words with frequency=1 45.4012 Number of sentence based co-occurrences 5100404 Number of neighbour co-occurrences 664587 Appendix to ces news 2008: Database summary Values for some general parameters Parameter Value Number of sentences 1964452 Number of running word forms 30490884 Number of distinct word forms 679842 Number of multiwords 13314 Percentage of words with frequency=1 43.8117 Number of sentence based co-occurrences 8167016 Number of neighbour co-occurrences 1016121

Appendix to ces news 2009: Database summary 10 Appendix to ces news 2009: Database summary Values for some general parameters Parameter Value Number of sentences 2114181 Number of running word forms 32911080 Number of distinct word forms 695997 Number of multiwords 13277 Percentage of words with frequency=1 43.8546 Number of sentence based co-occurrences 9107518 Number of neighbour co-occurrences 1092894 Appendix to ces news 2010: Database summary Values for some general parameters Parameter Value Number of sentences 2154346 Number of running word forms 33990535 Number of distinct word forms 750971 Number of multiwords 12862 Percentage of words with frequency=1 44.7730 Number of sentence based co-occurrences 9945368 Number of neighbour co-occurrences 1152708

Appendix to ces news 2011: Database summary 11 Appendix to ces news 2011: Database summary Values for some general parameters Parameter Value Number of sentences 1909971 Number of running word forms 30748072 Number of distinct word forms 702495 Number of multiwords 11943 Percentage of words with frequency=1 44.5590 Number of sentence based co-occurrences 9853154 Number of neighbour co-occurrences 1089863 Appendix to ces news 2012: Database summary Values for some general parameters Parameter Value Number of sentences 1940717 Number of running word forms 28661212 Number of distinct word forms 680532 Number of multiwords 11727 Percentage of words with frequency=1 44.9491 Number of sentence based co-occurrences 7753538 Number of neighbour co-occurrences 960442 Number of distributional similar word pairs (NOT READY) 0 Number of similar sentence pairs (NOT READY) 0

Appendix to ces newscrawl 2011: Database summary 12 Appendix to ces newscrawl 2011: Database summary Values for some general parameters Parameter Value Number of sentences 4073054 Number of running word forms 64752481 Number of distinct word forms 1271178 Number of multiwords 18746 Percentage of words with frequency=1 49.8621 Number of sentence based co-occurrences 19223320 Number of neighbour co-occurrences 2074562 Appendix to ces newscrawl 2012: Database summary Values for some general parameters Parameter Value Number of sentences 4847073 Number of running word forms 71216871 Number of distinct word forms 1398252 Number of multiwords 20317 Percentage of words with frequency=1 50.9530 Number of sentence based co-occurrences 17637540 Number of neighbour co-occurrences 2098380

Appendix to ces wikipedia 2007: Database summary 13 Appendix to ces wikipedia 2007: Database summary Values for some general parameters Parameter Value Number of sentences 468368 Number of running word forms 7468623 Number of distinct word forms 498904 Number of multiwords 22350 Percentage of words with frequency=1 52.4802 Number of sentence based co-occurrences 1901412 Number of neighbour co-occurrences 261746 Appendix to ces wikipedia 2012: Database summary Values for some general parameters Parameter Value Number of sentences 1270501 Number of running word forms 20126736 Number of distinct word forms 923691 Number of multiwords 36358 Percentage of words with frequency=1 52.7002 Number of sentence based co-occurrences 5137322 Number of neighbour co-occurrences 677076

Appendix to ces web 2002: Database summary 14 Appendix to ces web 2002: Database summary Values for some general parameters Parameter Value Number of sentences 4402068 Number of running word forms 66959742 Number of distinct word forms 1753989 Number of multiwords 20461 Percentage of words with frequency=1 52.4787 Number of sentence based co-occurrences 17884722 Number of neighbour co-occurrences 2125253 Appendix to ces web 2011: Database summary Values for some general parameters Parameter Value Number of sentences 6581096 Number of running word forms 100087964 Number of distinct word forms 2038980 Number of multiwords 24793 Percentage of words with frequency=1 52.8270 Number of sentence based co-occurrences 25451492 Number of neighbour co-occurrences 2885270

Appendix to ces web 2012: Database summary 15 Appendix to ces web 2012: Database summary Values for some general parameters Parameter Value Number of sentences 7421871 Number of running word forms 111154082 Number of distinct word forms 2200205 Number of multiwords 25045 Percentage of words with frequency=1 53.3842 Number of sentence based co-occurrences 27948258 Number of neighbour co-occurrences 3146617 Appendix to ces mixed 2012: Database summary Values for some general parameters Parameter Value Number of sentences 31524632 Number of running word forms 487761861 Number of distinct word forms 4860819 Number of multiwords 53539 Percentage of words with frequency=1 54.0793 Number of sentence based co-occurrences 109930766 Number of neighbour co-occurrences 11172425

16 Content details Appendix to ces news 2005-2007: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 66938 93.00.sk/ 4984 6.92 Appendix to ces news 2008: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 118161 98.61.sk/ 1597 1.33

Appendix to ces news 2009: Size of different TLDs 17 Appendix to ces news 2009: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 124324 99.73 Appendix to ces news 2010: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 141509 99.85 Appendix to ces news 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 151144 99.56

Appendix to ces news 2012: Size of different TLDs 18 Appendix to ces news 2012: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 112567 94.03 com/ 7132 5.96 Appendix to ces newscrawl 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 284115 100.00 Appendix to ces newscrawl 2012: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 288877 100.00

Appendix to ces web 2002: Size of different TLDs 19 Appendix to ces web 2002: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 13471 100.00 Appendix to ces web 2011: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 688666 92.39.eu/ 14169 1.90.sk/ 13776 1.85 com/ 13095 1.76 Appendix to ces web 2012: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 704776 92.59 com/ 17167 2.26.eu/ 17089 2.24.sk/ 8105 1.06

Appendix to ces mixed 2012: Size of different TLDs 20 Appendix to ces mixed 2012: Size of different TLDs TLDs larger than 1% TLD # of sources %.cz/ 2352875 95.62 com/ 33470 1.36.eu/ 27191 1.11 Appendix to ces news 2005-2007: Size of largest domains Largest domains Source # of sentences ihned.cz/ 252872 www.halonoviny.cz/ 186261 HN.IHNED.CZ/ 178129 zpravy.idnes.cz/ 88880 www.blisty.cz/ 82650 EKONOM.IHNED.CZ/ 56186 HNonline.sk/ 50817 www.financninoviny.cz/ 47354 pes.eunet.cz/ 35674 www.mobilmania.cz/ 28825 # of distinct sources 75

Appendix to ces news 2008: Size of largest domains 21 Appendix to ces news 2008: Size of largest domains Largest domains Source # of sentences HN.IHNED.CZ/ 369105 www.halonoviny.cz/ 331577 zpravy.idnes.cz/ 291675 deniksport.blesk.cz/ 183546 www.blesk.cz/ 173152 www.blisty.cz/ 127803 www.financninoviny.cz/ 112839 ihned.cz/ 87217 Domaci.iHNed.cz/ 41656 Zahranicni.iHNed.cz/ 39427 # of distinct sources 98 Appendix to ces news 2009: Size of largest domains Largest domains Source # of sentences HN.IHNED.CZ/ 355253 zpravy.idnes.cz/ 315590 www.halonoviny.cz/ 309624 deniksport.blesk.cz/ 270186 www.blesk.cz/ 200648 www.blisty.cz/ 119844 www.financninoviny.cz/ 118557 EKONOM.IHNED.CZ/ 68067 Domaci.iHNed.cz/ 56867 Sport.iHNed.cz/ 47277 # of distinct sources 78

Appendix to ces news 2009: Size of largest domains 22 Appendix to ces news 2010: Size of largest domains Largest domains Source # of sentences hn.ihned.cz/ 424157 zpravy.idnes.cz/ 343312 www.blesk.cz/ 225124 www.blisty.cz/ 134746 www.financninoviny.cz/ 133354 www.halonoviny.cz/ 126680 deniksport.blesk.cz/ 125441 EKONOM.IHNED.CZ/ 100145 isport.blesk.cz/ 95268 Domaci.iHNed.cz/ 75706 # of distinct sources 81

Appendix to ces news 2011: Size of largest domains 23 Appendix to ces news 2011: Size of largest domains Largest domains Source # of sentences zpravy.idnes.cz/ 325754 isport.blesk.cz/ 248174 rssfeeds.ihned.cz/ 203344 www.blesk.cz/ 186302 HN.IHNED.CZ/ 158698 www.financninoviny.cz/ 139827 www.blisty.cz/ 118766 byznys.ihned.cz/ 78120 zpravy.ihned.cz/ 74046 sport.ihned.cz/ 73746 # of distinct sources 119 Appendix to ces news 2012: Size of largest domains Largest domains Source # of sentences isport.blesk.cz/ 223166 zpravy.idnes.cz/ 210435 zpravy.ihned.cz/ 190895 www.blesk.cz/ 148323 www.financninoviny.cz/ 144899 HN.IHNED.CZ/ 120971 byznys.ihned.cz/ 112814 www.blisty.cz/ 105750 sport.ihned.cz/ 95252 idnes.cz.feedsportal.com/ 91465 # of distinct sources 105

Appendix to ces news 2012: Size of largest domains 24 Appendix to ces newscrawl 2011: Size of largest domains Largest domains Source # of sentences www.profit.cz/ 658633 ekonom.ihned.cz/ 643031 www.blesk.cz/ 472907 www.novinky.cz/ 463174 www.rozhlas.cz/ 401231 www.iprima.cz/ 339979 plzensky.denik.cz/ 213952 www.tyden.cz/ 188189 www.denik.cz/ 125213 www.pressweb.cz/ 107812 # of distinct sources 35

Appendix to ces newscrawl 2012: Size of largest domains 25 Appendix to ces newscrawl 2012: Size of largest domains Largest domains Source # of sentences www.novinky.cz/ 846859 www.rozhlas.cz/ 751087 www.denik.cz/ 509684 www.iprima.cz/ 509423 plzensky.denik.cz/ 340130 www.blesk.cz/ 327380 ekonom.ihned.cz/ 316357 www.pressweb.cz/ 185080 www.lidovky.cz/ 166900 www.tyden.cz/ 156609 # of distinct sources 36 Appendix to ces web 2002: Size of largest domains Largest domains Source # of sentences www.grimoar.cz/ 39963 www.env.cebin.cz/ 36147 www.regionalist.cz/ 29871 www.musicpage.cz/ 28566 www.harry.cz/ 28399 krystal.op.cz/ 25860 www.automa.cz/ 24608 www.radioservis-as.cz/ 20718 www.baraka.cz/ 19823 osz.cmkos.cz/ 19688 # of distinct sources 13471

Appendix to ces web 2002: Size of largest domains 26 Appendix to ces web 2011: Size of largest domains Largest domains Source # of sentences abc.blesk.cz/ 149583 abicko.avcr.cz/ 58200 www.automatizace.cz/ 22526 sw.gurroa.cz/ 11893 osz.cmkos.cz/ 9179 www.hutka.cz/ 9045 www.skyfly.cz/ 8831 www.zitova.cz/ 8678 www.jesuit.cz/ 8370 www.chorvatsko.cz/ 7650 # of distinct sources 78437

Appendix to ces web 2012: Size of largest domains 27 Appendix to ces web 2012: Size of largest domains Largest domains Source # of sentences darren-shan.ic.cz/ 12594 sw.gurroa.cz/ 11381 oficialnistranky.cz/ 9019 www.zitova.cz/ 7926 synopse.startrek.cz/ 7167 www.hutka.cz/ 6289 www.chorvatsko.cz/ 6257 www.dvs.cz/ 6220 www.hcjb.cz/ 6180 web.meulovo.cz/ 5980 # of distinct sources 87781 Appendix to ces mixed 2012: Size of largest domains Largest domains Source # of sentences 1403681 HN.IHNED.CZ/ 1396933 zpravy.idnes.cz/ 1386734 www.blesk.cz/ 1153560 www.novinky.cz/ 920413 www.rozhlas.cz/ 911077 www.halonoviny.cz/ 838714 www.financninoviny.cz/ 625736 EKONOM.IHNED.CZ/ 614566 www.blisty.cz/ 598923 # of distinct sources 112486

Appendix to ces mixed 2012: Size of largest domains 28 Appendix to ces news 2005-2007: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2007 70541 98.01 Number of sources per month

Appendix to ces news 2005-2007: Number of sources by time period 29 month # of sources % 2007-03 4682 6.51 2007-04 5049 7.02 2007-05 7837 10.89 2007-06 8182 11.37 2007-07 5342 7.42 2007-08 9287 12.90 2007-09 6940 9.64 2007-10 8719 12.11 2007-11 7909 10.99 2007-12 6594 9.16 Appendix to ces news 2008: Number of sources by time period Number of sources by year, month, and day

Appendix to ces news 2008: Number of sources by time period 30 Number of sources per year year # of sources % 2008 119822 100.00 Number of sources per month month # of sources % 2008-01 9230 7.70 2008-02 7197 6.01 2008-03 8948 7.47 2008-04 9372 7.82 2008-05 9019 7.53 2008-06 10926 9.12 2008-07 11006 9.19 2008-08 10870 9.07 2008-09 10990 9.17 2008-10 11238 9.38 2008-11 10650 8.89 2008-12 10376 8.66

Appendix to ces news 2009: Number of sources by time period 31 Appendix to ces news 2009: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2009 124661 100.00 Number of sources per month month # of sources % 2009-01 11480 9.21 2009-02 10899 8.74 2009-03 11591 9.30 2009-04 11206 8.99 2009-05 10750 8.62 2009-06 10810 8.67 2009-07 9650 7.74 2009-08 9697 7.78 2009-09 10585 8.49

Appendix to ces news 2009: Number of sources by time period 32 2009-10 11049 8.86 2009-11 7155 5.74 2009-12 9789 7.85 Appendix to ces news 2010: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2010 141726 100.00 Number of sources per month

Appendix to ces news 2010: Number of sources by time period 33 month # of sources % 2010-01 10941 7.72 2010-02 9541 6.73 2010-03 10922 7.71 2010-04 10180 7.18 2010-05 9264 6.54 2010-06 9821 6.93 2010-07 9497 6.70 2010-08 20936 14.77 2010-09 14054 9.92 2010-10 12548 8.85 2010-11 12044 8.50 2010-12 11978 8.45 Appendix to ces news 2011: Number of sources by time period Number of sources by year, month, and day

Appendix to ces news 2011: Number of sources by time period 34 Number of sources per year year # of sources % 2011 151807 100.00 Number of sources per month month # of sources % 2011-01 12754 8.40 2011-02 12172 8.02 2011-03 12396 8.17 2011-04 13242 8.72 2011-05 12895 8.49 2011-06 10990 7.24 2011-07 11982 7.89 2011-08 14006 9.23 2011-09 13964 9.20 2011-10 13646 8.99 2011-11 13939 9.18 2011-12 9821 6.47

Appendix to ces news 2012: Number of sources by time period 35 Appendix to ces news 2012: Number of sources by time period Number of sources by year, month, and day Number of sources per year year # of sources % 2012 119720 100.00 Number of sources per month month # of sources % 2012-01 10313 8.61 2012-02 9728 8.13 2012-03 10194 8.51 2012-04 9627 8.04 2012-05 9069 7.58 2012-06 10022 8.37 2012-07 10317 8.62 2012-08 10338 8.64 2012-09 9475 7.91

Appendix to ces news 2012: Number of sources by time period 36 2012-10 11159 9.32 2012-11 10468 8.74 2012-12 9010 7.53

37 Word details Appendix to ces news 2005-2007: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 8.6512 word length percentage 1 0.0303 2 0.2577 3 1.5527 4 3.7561 5 7.2599 6 10.2580 7 13.4530 8 14.9797 9 14.2247

Appendix to ces news 2005-2007: Words by length without multiplicity 38 10 11.6258 11 8.4237 12 5.6885 13 3.7248 14 2.3295 15 1.4544 16 0.8926 17 0.5905 18 0.3815 19 0.2551 20 0.1802 21 0.1229 22 0.0878 23 0.0615 24 0.0406 25 0.0267 26 0.0213 27 0.0172 28 0.0138 29 0.0114 30 0.0102 31 0.0065 32 0.0067 33 0.0048 34 0.0041 35 0.0036 36 0.0036 37 0.0028 38 0.0026 39 0.0031 40 0.0009 41 0.0015 42 0.0010 43 0.0007 44 0.0003 45 0.0009 46 0.0005 47 0.0002 48 0.0002

Appendix to ces news 2005-2007: Words by length without multiplicity 39 49 0.0003 50 0.0002 Appendix to ces news 2008: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 8.6512 word length percentage 1 0.0219 2 0.2562 3 1.5304 4 3.7354 5 7.3829 6 10.2906 7 13.5971 8 14.9076 9 14.1834

Appendix to ces news 2008: Words by length without multiplicity 40 10 11.5650 11 8.3600 12 5.6757 13 3.6386 14 2.3635 15 1.4980 16 0.9476 17 0.6372 18 0.4205 19 0.2795 20 0.1955 21 0.1291 22 0.0968 23 0.0621 24 0.0447 25 0.0304 26 0.0238 27 0.0190 28 0.0153 29 0.0129 30 0.0088 31 0.0069 32 0.0056 33 0.0041 34 0.0029 35 0.0029 36 0.0021 37 0.0013 38 0.0012 39 0.0013 40 0.0006 41 0.0009 42 0.0010 43 0.0007 44 0.0001 45 0.0003 46 0.0006 47 0.0004 48 0.0004

Appendix to ces news 2008: Words by length without multiplicity 41 49 0.0001 50 0.0003 Appendix to ces news 2009: Words by length without multiplicity Percentage of words of fixed length in characters, counted without multiplicty Average word length 8.6738 word length percentage 1 0.0257 2 0.2483 3 1.5323 4 3.7438 5 7.3252 6 10.1683 7 13.5597 8 14.8502 9 14.0771