Focused web crawling in the acquisition of comparable corpora
|
|
- Drahomíra Sedláková
- před 6 lety
- Počet zobrazení:
Transkript
1 Inf Retrieval (2008) 11: DOI 1007/s Focused web crawling in the acquisition of comparable corpora Tuomas Talvensaari Æ Ari Pirkola Æ Kalervo Järvelin Æ Martti Juhola Æ Jorma Laurikkala Received: 4 July 2007 / Accepted: 25 February 2008 / Published online: 15 March 2008 Ó Springer Science+Business Media, LLC 2008 Abstract Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising. Keywords Cross-language information retrieval Focused crawling Comparable corpora 1 Introduction In Cross-Language Information Retrieval (CLIR), the aim is to find documents that are written in a language different from the query. Consequently, besides the usual information retrieval (IR) issues, in CLIR one has to address the problem of crossing the language barrier. Usually, the query is translated from the source language into the target language, i.e., the language of the documents, after which a normal monolingual retrieval process can take place. The query translation approaches can be categorized according to the linguistic resources employed. The main approaches use either machine-readable dictionaries, T. Talvensaari (&) M. Juhola J. Laurikkala Department of Computer Sciences, University of Tampere, Kanslerinrinne 1, Tampere 33014, Finland tuomas.talvensaari@cs.uta.fi A. Pirkola K. Järvelin Department of Information Studies, University of Tampere, Kanslerinrinne 1, Tampere 33014, Finland
2 428 Inf Retrieval (2008) 11: machine translation (MT) systems, fuzzy cognate matching, multilingual corpora, or a combination of these resources. In dictionary-based translation, the source language query keys are replaced by their target language counterparts in a dictionary. This seems straightforward, but multiple translation alternatives may introduce ambiguity into the resulting target language query. In addition, dictionaries are limited in scope, often missing crucial query vocabulary, such as proper nouns and domain-specific terminology. Naturally, such shortcomings can severely impair query performance (Pirkola et al. 2001). MT systems aim to produce readable, grammatically correct translations. However, queries are often just lists of words, and using sophisticated natural language processing techniques on them would seem out of proportion. Similarly to dictionary-based translation, domain-specific vocabulary is often missing from MT systems. On the other hand, MT systems often have ways to guess (e.g. by weighting) the most probable translation candidate, which decreases translation ambiguity. Fuzzy cognate matching is the least resource-laden of the mentioned techniques. Proper nouns and technical terms often vary only slightly between languages, and rather simple techniques, such as n-gram matching, can be used to translate such words. Also, transformation rules can be learned and used to capture stereotypical variation between languages. The English German word pair construction konstruktion is a typical example of such variation. However, fuzzy matching is usually inadequate when used alone (save maybe for languages that are closely related, such as Swedish and Norwegian). More often it is used as a complementary technique (Pirkola et al. 2006). In approaches based on multilingual corpora, the translation knowledge is extracted statistically from the corpora used. These methods can further be categorized based on the relatedness of the corpora. A parallel corpus consists of document pairs that are more or less exact translations of each other. In a comparable corpus, the document pairs are not exact translations but have similar vocabulary (Sheridan and Ballerini 1996). The aligned documents can be, e.g., accounts of the same news event written independently in different countries. Naturally, the most reliable translation knowledge is obtained from large parallel corpora, such as the Canadian Hansard corpus (Gale and Church 1991) or the JRC-Acquis corpus of EU legislation (Steinberger et al. 2006). However, such collections are relatively rare, and they are often not available for particular domains. Moreover, CLIR resources in general are scarce for special domains. For this reason, the acquisition and use of comparable corpora in domain specific CLIR is an appealing idea. The Web, with its vast volumes of data in almost any domain and language, is a natural resource for corpus-based CLIR. In this paper, we experiment with focused Web crawling as a means to build domain-specific comparable corpora. To our knowledge, such experiments have not previously been published. Focused crawling refers to the acquisition of material specific to a given subject from the Web, taking advantage of its hyperlink structure (Chakrabarti et al. 1999). The domain of choice for our experiments is genomics, a fast-growing field with a fast-growing vocabulary. Cross-lingual resources for such a domain would have to keep up with the pace of the field, and we think our method has potential in this respect as well. We aim to show that: it is possible to mine comparable texts in predefined domains and languages the gathered texts can be aligned and the alignments can be employed as a similarity thesaurus it is possible to derive good quality translations from the alignments
3 Inf Retrieval (2008) 11: a domain-specific comparable Web corpus provides better translations than a generalpurpose parallel corpus, even if the latter is of much higher alignment quality the system can achieve competitive CLIR performance when used together with other resources In Sect. 3 we introduce our focused crawler that was used to gather genomics-specific text in English, Spanish, and German. In Sect. 4, a brief overview of some of the tools used in the study, is presented. The crawled text was aligned at paragraph level Spanish and German paragraphs were aligned with the English ones. This procedure is explained in Sect. 5. The alignments were employed to extract statistical translation knowledge. This was done with our Comparable Corpus Translation program, Cocot (Talvensaari et al. 2007), which is introduced in Sect. 6. We performed IR experiments based on the genomics track of the 2004 TREC conference (Hersh 2005). Several translation system setups and translation approaches were used in the tests, which are described in Sect Section 7.2 gives a more in-depth analysis of the quality of the translations provided by Cocot. We translated individual domain-specific words from the source languages (Spanish and German) into the target language (English), and compared the quality of the translations to that achieved with the help of the JRC-Acquis (Steinberger et al. 2006) parallel corpus. Section 8 provides a discussion on how well the fore-mentioned aims were achieved. 2 Previous work Most of the research on the automatic creation of comparable corpora has involved established research corpora, such as the collections of TREC and CLEF conferences (Sheridan and Ballerini 1996; Braschler and Schauble 1998). Surprisingly few studies exist on the acquisition of such corpora from the Web. (Cheng et al. 2004) note that Comparable corpora are far easier to obtain; however, how to automatically gather appropriate comparable corpora from the Web is still a challenging task. Utsuro et al. (2002) come closest to this. They collected Japanese and English news articles and aligned those having matching dates. Unfortunately, date-based alignment is not generally applicable, because the assumption that articles published on the same day report on the same events does not hold in the Web in general. Hassan et al. (2007) derived named entity (i.e. proper noun) translations from comparable and parallel corpora. They used a sophisticated method for aligning comparable texts that was based on word clustering. They did not, however, address the problem of acquiring the comparable corpora. Steinberger et al. (2005) use thesauri and named entities to cluster news documents crosslingually. However, they do not apply the clusters to CLIR, but to browsing and information extraction. Parallel corpora, on the other hand, have been mined from the Web. Multilingual news services and web sites of multinational corporations are only some examples of parallel content (Nie et al. 1999; Resnik 1999; Yang and Li 2004). However, the more specific the domain of interest, the harder it is to find parallel pages in the Web. The present work is an extension of our previous work and it differs from the previous one (Talvensaari et al. 2007) in these respects:
4 430 Inf Retrieval (2008) 11: The comparable corpora are mined from the Web. Previously, we used readily available IR test corpora. This is a profound extension that greatly improves the portability of our method. We concentrate on a specific domain. Additional CLIR resources are especially needed for domain-specific vocabulary. The alignments are made on paragraph-level, not document-to-document. Unlike in the news domain, the alignments could not be made based on the dates of the documents. We compare our system s CLIR performance to more other CLIR approaches than previously. Hence, the experiment setup is more competitive. 3 Focused crawling The process of focused crawling can be summarized as follows (Cho et al. 1998; Bra et al. 1994; Chakrabarti et al. 1999). At the start of the process a set of seed URLs are inserted into a queue. One by one, the URL at the head of the queue is removed, and the web page pointed to by the URL is retrieved. The page is processed in some manner (e.g. it can be scanned to build an index for a search engine) and the out-links of the page are extracted and inserted into the queue. The queue can be prioritized, for example, based on how well the anchor text of the links matches against a driver query that consists of words of the wanted domain. The driver query is meant to steer the crawler to pages whose content is relevant to the domain in question. Crawling continues until the URL queue is empty or the process is interrupted. In our experiments, the crawler collected domain-specific text in some predefined languages to be used in statistical translation. This brings forth special requirements for the implementation of the crawler: 1. In statistical translation, it is essential that the words to be translated appear in their natural contexts; random lists of words cannot be used as sources of translation knowledge. However, web pages often contain lots of noise from the domain s point of view e.g. links to out-of-domain pages or personal contact information. For this reason, our crawler extracts text paragraphs from the retrieved pages entire pages are not used. Also, we made the alignments at the paragraph level, because this level of granularity seems appropriate for statistical translation. A lengthy web page can handle various topics, whereas a paragraph is most often a concise expression of a single idea. 2. Unlike in many focused crawling applications, we are not that interested in the popularity or importance of the encountered pages; we only need some specific words appearing in their contexts. Consequently, our crawler need not to use wellknown measures such as PageRank to evaluate the pages. Next, we describe the functioning of our crawler in detail. The crawler was coded in Perl. Figure 1 provides an outline of the crawler. 3.1 Acquiring the seed URLs Prior to the actual crawling phase, we semi-automatically collected domain-specific vocabulary in all of the three languages. The vocabularies play a central role in the
5 Inf Retrieval (2008) 11: Fig. 1 The crawling process crawling process they are used in acquiring the seed URLs and as driver queries to filter domain-specific content in the actual crawling process, as seen in Fig. 1. The vocabularies were acquired with simple Google searches, such as, for English, ðbiotechnology OR genomicsþ AND ðvocabulary OR lexiconþ: The word lists were extracted from the found pages, and, for each language, we constructed a combined word list that had the words sorted according to decreasing frequency in the lists. This phase involved manual work, since the lists had to be extracted from the non-uniformly coded web pages. It should be stressed that the lists were collected independently for each language, and it took roughly one working day from a non-domain-expert (the first author) to collect them. We constructed Boolean search phrases from the acquired word lists, to search for the seed URLs. Each query consisted of two parts connected with the AND operator: a constant context facet and a varying set of domain words that were chosen based on their frequency. The context facet provided the correct context for the query; it contained the ten most frequent vocabulary words. (In the English queries, the context facet was ðgene OR dna OR pcr OR mutation OR karyotype OR genotype OR genome OR trans location OR translation OR transcriptionþ:þ The second set also contained 10 words which were connected with the OR operator. In the first query it consisted of words from the ranks 11 to 20 in the frequency-sorted domain lexicon. For the second query, the set included words from the ranks 21 to 30, and so on. A total of 50 queries were constructed and run for each language. The aim of this procedure was to use a wide array of domain words as search keys, in order to find lots of prospective seed URLs the order in which they were picked was not important. Shorter queries were also needed due to technical limitations. There is a limit to effective query length in Google (2006): query keys in excess of 32 key did not affect the results at all, they were presumably ignored by the search engine. After the queries were executed with Google, the retrieved URLs (no more than 1,000 for each query) were scored. The more times a URL appeared in the results, and the higher it ranked, the more points it scored. After this, the scores were combined host-wise, that is, each host s score was the sum of its indivual pages URLs. The hosts were sorted according to their score. For the highest-scoring hosts (a few dozen per language), the page
6 432 Inf Retrieval (2008) 11: that had the highest individual score was chosen as a seed URL. Note that the root of a host (e.g. could not be automatically chosen as a seed URL, since the domain-specific content might be only a small proportion of the overall content of a host. The described approach ensured that there was at most one seed URL for each host. It is probable, and certainly preferable from a user s point of view, that other domain-specific pages are reachable through the link structure of the site. 3.2 The crawl For each web page encountered, its text paragraphs were extracted (see Fig. 1). This was done by examining the intended lay-out of the page (as laid out by Perl s HTML :: FormatText module), not the HTML mark-up. A text segment that spanned more than one line and contained three or more sentences was considered a paragraph. Sentences, on the other hand, were defined as character sequences that start with an upper-case letter and end with one of the punctuation symbols. An array of miscellaneous heuristics was applied to prune exceptions to this simple rule. For example, in George W. Bush, George W. was not considered a sentence. The language of each paragraph was detected with a simple n-gram-based algorithm (Cavnar and Trenkle 1994). If the paragraph was in one of the sought-for languages, it was matched against the driver query of the particular language. The queries consisted of about 300 domain words acquired earlier. If the match score exceeded a threshold, the paragraph was saved to disk. The threshold was decided by manually sampling brief test crawls with varying thresholds. The score was simply the proportion of domain words (words in the driver query) versus the total number of words in the paragraph. A more sophisticated tf.idf score could also be used, but this would require collection-wide statistics. They could be incorporated from some readily available corpus, and accumulated as the crawl would progress. The out-links of the page were extracted and scored. The score of a link l, which is located on page P is calculated as follows: scoreðlþ ¼w a qðaðlþþ þ w p qðpþþw h qðhostðpþþ; where a(l) is the anchor text of link l, host(p) is the set of pages visited so far that have the same host as P, and q(x) is the proportion of domain words in a text segment x. The weights w a, w p and w h add up to 1, and after some experimentation we ended up choosing w p = w h = 0.45, w a =. The link URLs were priority-enqueued, based on the scores. The encountered URLs were kept in memory, so that every URL was visited only once. Paragraphs were also tracked, because often the same paragraph came up on different pages and different URLs pointed to pages with the same content. Table 1 depicts the sizes of the corpora acquired with the described approach. Table 1 Sizes of the acquired corpora Language Size (MB) Words ( ) Paragraphs English ,500 Spanish ,800 German ,200
7 Inf Retrieval (2008) 11: Tools employed In the present research we make frequent use of the following tools: the Utaclir query translator, the FITE-TRT cognate translator, the Babelfish MT system, the Cocot query translator, and the Lemur search engine. These are briefly described below. The Utaclir query translator (Keskustalo et al. 2002) is a dictionary based query-generator. It employs stop word elimination, lemmatization and stemming, compound splitting, and off-the-shelf translation dictionaries. It produces synonym structured queries (the Pirkola method (Pirkola 1998)), where each source word is translated into a synonym set #syn(...) and these are combined by a probabilistic sum operator #sum(...). The sizes of the dictionaries used in this study were 29,000 and 35,000 source word entries for German English and Spanish English, respectively. The FITE-TRT cognate translator is based on transliteration rules automatically mined from bilingual word lists (Pirkola et al. 2006). In addition, it uses large frequency lists for the source and target languages of translation in order to resolve between candidate translations generated. The BabelFish MT system (babelfish.altavista.com) is used in the CLIR experiments (see Sect. 7). The Lemur search engine (Lemur homepage) is based on language modelling. It supports various modes of operation, including structured queries of the kind Utaclir produces. In the present study, Lemur was used in the InQuery mode (Allan et al. 1996). The Cocot Comparable Corpus Translation Program is introduced in Sect Paragraph alignment In the following section, the word document actually refers to the paragraphs extracted from the Web pages in the crawling phase. Document is used instead of paragraph, because generally the aligned entities are not necessarily paragraphs; they can be of any chosen granularity. Let d S [ C S and d T [ C T be documents in the source and target collections, respectively. We aim to produce a set of alignments A ¼fhd S ; Di jd 6¼ ;g; where D ¼ fd T jsimðd S ; d T Þ [ hg: In other words, we aim to map each source document to a set of target documents whose similarity with the source document exceeds some threshold h. Each set D is called a hyper document. It is not realistic to expect that we could find a satisfying counterpart for every source language document. Thus, we expect that A \ C S. The alignment method resembles the one by Talvensaari et al. (2007). First, queries were formed from each source document. Second, the queries were translated into the target language (English) with Utaclir. Words that were not in Utaclir s dictionary were transmuted by FITE-TRT. Third, the translated queries were run against the English paragraphs with the Lemur IR toolkit. (See Sect. 4 for description of the mentioned tools). The Lemur score was used as an indication of the similarity between the source document and the target documents. For each source document, at most 20 target documents whose similarity exceeded a score threshold were chosen into the set D. The threshold was chosen among a few predefined threshold levels. For each level, a test alignment was created, and the alignments were used to translate a test vocabulary with Cocot (see Sect. 6). The level that brought the best translation quality was chosen. Table 2 depicts the statistics of the alignments created for the Spanish English, and the German English comparable corpora. First, the number of source paragraphs for which at
8 434 Inf Retrieval (2008) 11: Table 2 Alignment statistics Languages A Avg. D Unique target paragraphs Source words Target words Spa-Eng 16, ,664 1,100,000 1,700,000 Ger-Eng 30, ,049 3,800,000 3,200,000 least one alignment pair was found, is shown. Average D is the average number of target paragraphs aligned per source document, while the fourth column depicts the number of target documents that appear in at least one hyper document. The last two columns show the number of words in source and unique target documents, respectively. 6 Cocot employing the alignments Cocot, a Comparable Corpus Translation program (Talvensaari et al. 2007), uses the aligned corpus as a similarity thesaurus, which implies calculating similarity scores between a source language word and the words in the target documents. The similarity thesaurus similarity score can be calculated by using traditional IR weighting approaches, reversing the roles of documents and words. A source language word is thought of as the query, and target language words are retrieved as the answer. For a document d j, in which a word t i appears, the Cocot system calculates the weight w ij as follows: ( w ij ¼ 0:5 þ 0:5 0 if tf ij ¼ 0 tf ij ln otherwise, Maxtf j where tf ij is the frequency of word t i in document d j, Maxtf j the largest term frequency in document d j, dl j the number of unique words in the document. NT can be the number of unique words in the collection, or its approximation. For a hyper document D k (see Sect. 5) in which a word t i appears, the weight is W ik ¼ X w ij ln ðrank jk þ 1Þ ; d j 2D k where rank jk is the rank of the document d j in the hyper document D k, i.e. the rank calculated by Lemur in the alignment phase. The lower the rank, the less similar the target document is to the source document, according to Lemur. Thus, the lower rank documents can be trusted less as a source of translation knowledge. This is echoed in the equation above. Finally we can calculate Cocot s similarity score between a word s i appearing in the source documents, and a word t j appearing in the target hyper documents: P hd simðs i ; t j Þ¼ w k;d ki2a ik W jk ; kt ks i k ð1 slopeþþslope jk NT dl j avg trg vlength where A is the set of alignments, s i and t j are the feature vectors representing s i and t j, and avg_trg_vlength the average length of the target word vectors. The formula employs the pivoted vector length normalization scheme, introduced by Singhal et al. (1996). The slope
9 Inf Retrieval (2008) 11: Table 3 Example Cocot translations Rank Alelo Alergia Alogénico 1 Allele 14.0 Allergy 4.1 tcr Dominant 10.6 Allergic 2.5 Allogenic Recessive 10.6 Allergen 2.0 mhc Gene 9.8 Ragweed 1.9 apc Heterozygous 9.6 Non-allergic 1.6 lfa value is a parameter of this scheme (we used slope = 0.2). The scheme was applied because standard cosine normalization favors words with short feature vectors, i.e. rare words. When the above score is calculated between a source language word and every word appearing in the target documents, we get a rank of the target words. Table 3 shows Cocot ranks for three genomics-related Spanish words. Score thresholding and word cut-off values (WCV) can be used as translation parameters to define Cocot s query translation behavior. For example, the parameters WCV = 4, h = 4.0 mean that for the word alelo, the four highest ranking words would be returned, whereas, for alergia, only the first word would be used as the translation. 7 Test runs and results To evaluate our system, we experimented with the test topics of the genomics track of the 2004 TREC conference (Hersh 2005). The test collection was a subset of the MEDLINE database of about 4.6 million documents. There were 50 English topics, which were translated into Spanish and German by knowledgeable speakers. Figure 2 presents an example topic in English and German. Only the title and need parts of the topics were used. The experiments consisted of two distinct set-ups: Fig. 2 Example topic in English and German
10 436 Inf Retrieval (2008) 11: Standard IR experiments with the topics. Queries were constructed from the translated topics, which were then translated back to English with various CLIR systems. Cocot, with the Web corpus alignments, was combined with dictionary-based translation, and the performance of the combination was compared to that of other CLIR systems. These experiments are reported on in Sect Translating individual domain-specific words extracted from the Spanish and German topics. To gain more detailed analysis on the performance of the Web corpus Cocot, we compared its performance in translating individual domain words with the performance of Cocot using the JRC-Acquis corpus. The experiments and their results are presented in Sect Retrieval experiments In the experiments, the Spanish and German topics were translated into English with several different translation approaches. Test runs were performed with the translated queries and the original English topics, which provided the monolingual baseline. The first 20 topics were used as training topics to decide Cocot s translation parameters. The values WCV = 3, h = 4.0 were chosen for both language pairs. The acquired comparable Web corpus was used as Cocot s translation corpus. For the last 30 topics used in the evaluation runs, there were 6,594 relevant documents in the recall base. Besides the monolingual baseline, we applied six query construction strategies that used different translation methods: 1. Utaclir alone (UC for short). 2. Utaclir and Cocot (UC-CC). Utaclir with Cocot to translate out-of-vocabulary (OOV) words, that is, words not found in Utaclir s dictionary. Since Cocot s strength lies in its ability to translate domain vocabulary, and provide expansion keys (see Sect. 7.2), we paired Cocot with a resource that could handle the more general vocabulary. Comparing the UC-CC approach with UC provided us evidence of the improvement Cocot brings to pure dictionary-based translation. The queries were formed in the following way. Utaclir produces a synonym set of translations, enclosed with the #sum operator: #sum(#syn(t 1 ) #syn(t 2 )... #syn(t n )), where T i is the set of dictionary translations to some source word. It also produces a set of untranslated words w 1, w 2... w n. The UC-CC approach produces the query #sum(#syn(t 1 ) #syn(t 2 )... #syn(t n ) #syn(c1) #syn(c2)... #syn(c n )), where C i is the Cocot s translation set (its size determined by parameters WCV, h) of the word w i. 3. FITE-TRT alone (FI). 4. Utaclir-FITE-TRT (UC-FI). Utaclir used with the FITE-TRT technique to translate OOV words. 5. Utaclir-FITE-TRT-Cocot (UC-FI-CC). The above combination concatenated with the Cocot translations of Utaclir s OOV words. 6. Babel Fish (BA). Table 4 shows the results for the runs in mean average precision and precision after 10 documents; Fig. 3a and b depicts the recall-precision curves for the Spanish English and German English runs, respectively. The Friedman test (Siegel and Castellan 1988) showed significant (p\ 0.05) difference in the performance for both language pairs. In the pairwise comparisons, the monolingual baseline was most often significantly better than the other
11 Inf Retrieval (2008) 11: Table 4 Mean average precision and precision after 10 retrieved documents for monolingual baseline and 6 translation approaches, N = 30. [ XX indicates statistically significant (p \ 0.05) difference over method XX Spa-Eng Ger-Eng MAP P@10 MAP P@10 MONO UC UC-CC ([UC) 0.42 FI ([UC) 0.41 UC-FI 2 ([UC, FI) 8 ([BA, FI, UC) UC-FI-CC 0 7 ([BA, FI, UC) 0.25 ([UC) 0.48 ([UC) BA 0.25 ([UC) ([UC) 0.42 methods, as expected. Significant differences between the different translation approaches are depicted in Table 4. All translation methods perform clearly better in the Spanish English runs than in the German English runs. This is probably due to the large number of Latin-based cognates between Spanish and English, and perhaps also to the morphological complexity of German when compared to Spanish. In the Spanish runs, UC-FI performed exceptionally well, nearly matching the monolingual baseline. Cocot seems to bring some improvement into dictionary-based translation, and in the German runs, the improvement was statistically significant. In the German runs, all of the CLIR approaches, save for UC, appear to perform quite alike. The combination UC-FI-CC has a slight edge over the rest of the approaches. It is noteworthy that in some cases, an added resource actually decreases the performance level. This is true in the Spanish runs with UC-FI-CC, and with UC-FI in the German runs. This could perhaps be fixed by weighting the components. Figures 4 and 5 present a query-by-query analysis of the results for the Spanish and German runs, respectively. The median of the average precision of each query represents the zero level in the histograms. The median was calculated only among the translation approaches, the monolingual baseline was not included. In the Spanish runs, FI and BA are wildly uneven, while UC is consistently below the median. UC-CC performs quite near the median all the way, which confirms the improvement that Cocot brings to Utaclir. In the German runs, BA and UC are much like in the Spanish runs. The other approaches seem to perform quite steadily above the median in all of the queries. In general, combining different resources seems to bring consistency to the performance: in the combined approaches there are very few queries that drop significantly below the median. 7.2 Word translation tests The word translation tests are meant to test whether the following two assumptions hold: 1. In addition to correct translations, Cocot (and similarity thesauri in general) gives words related to the correct translations which are often good expansion keys in IR.
12 438 Inf Retrieval (2008) 11: (a) MONO UC UC-CC FI UC-FI UC-FI-CC BA Precision (b) Precision Recall Spanish-English MONO UC UC-CC FI UC-FI UC-FI-CC BA Recall German-English Fig. 3 Recall-precision curves for the retrieval experiments 2. Cocot, using the comparable Web corpus, can translate genomics-specific vocabulary better than Cocot that uses a high-quality parallel corpus with more general vocabulary. As the parallel corpus, we used the JRC-Acquis corpus (Steinberger et al. 2006), which consists of official EU documents in all official EU languages. Its size for the languages used in this study is depicted in Table 5. The version of the corpus used was 2.2. The alignments in the corpus were mostly on paragraph level. They were created by Steinberger et al. (2006) with an algorithm that was based on the famous algorithm by Gale and Church (1991).
13 Inf Retrieval (2008) 11: (a) (b) UC - UC-CC (c) (d) FI - UC-FI (e) (f) UC-FI-CC - BA Fig. 4 Difference to median average precision query-by-query, Spanish queries To test the performance of Cocot with different translation corpora, we needed a measure for the goodness of the words returned by Cocot in relation to the queries they are part of. Since a good word may be expected to appear more often in the relevant documents than in the rest of the documents, we devised a simple measure of goodness by using document frequencies. Definition 1 The relative document frequency of a word w in a document set D is rdf(w, D) =df(w, D)/ D, where df(w, D) is the document frequency of w in D.
14 440 Inf Retrieval (2008) 11: (a) (b) UC UC-CC (c) (d) FI - UC-FI (e) (f) UC-FI-CC - BA Fig. 5 Difference to median average precision query-by-query, German queries Definition 2 Let C and R q ð CÞ be the target test collection set and the set of relevant documents for a query q, respectively. Furthermore, let t be a Cocot translation to a word in a source language query q. The goodness of the key t in the query q is gðt; qþ ¼ rdf ðt; R q Þ rdf ðt; C n R q Þ: In other words, the goodness is the difference in the relative document frequency of a key among documents relevant to the query (R q ) and among the rest of the documents ðc n R q Þ: The larger the positive difference, the better the key, and vise versa. The measure has the range [-1, 1].
15 Inf Retrieval (2008) 11: Table 5 Size of the JRC-Acquis parallel corpus, version 2.2 Language Words English 7,547,154 Spanish 8,006,579 German 6,481,949 Definition 3 A translation key t is good, if g(t, q) C 0.2 and the one-tailed binomial test (Siegel and Castellan 1988) indicates that relative document frequency was significantly (p \ 0.05) higher in R q than in C n R q : We extracted domain-specific vocabulary from the Spanish and German topics and translated them with Cocot, using the two translation corpora. To gain more reliable results, we only considered topics for whom R q C 5. For example, the word transgenen appears in the German example topic 2 in Fig. 2. For this topic, R 2 = 101, i.e. there are 101 relevant documents for the topic. Since there are 4,591,008 documents in the entire collection, jc n R 2 j¼ ¼ : When the Web corpus is used as the translation corpus, and WCV = 3 is applied, Cocot gives the translation set {plant, transgene, transgenic} for the word transgenen. The word plant appears in 105,292 documents in the collection, of which only one is in R 2. Accordingly, for the word plant in topic 2, gðplant; 2Þ ¼1= = ¼ 0:01 ðp ¼ 0:90Þ: According to this measure, the other translation set words are much better, because g(transgene, 2) = 0 (p & 0) and g(transgenic, 2) = 0.97 (p & 0). Table 6 shows the results for the word translation tests. In the tests, we used parameters WCV = 5, h = 0 for Cocot. The number of OOV words (i.e. words not found in the source language documents) is a crucial statistic; for example, in the German JRC runs, 93 out of 148 unique source language words could not be translated at all. When the Web corpus is used, this number reduces to 31. The German Web corpus seems to work slightly better than the Spanish one, perhaps due to its larger size. The JRC corpus, on the other hand, seems to perform evenly for both languages, as indicated by the percentage of good translations and average Table 6 Results of the word translation tests. Cocot returned five words per source language word, unless the word was OOV Spanish German Queries Source words Unique source words Web JRC Web JRC OOV Translations Good translations Good translations % Avg. goodness
16 442 Inf Retrieval (2008) 11: Table 7 CLIR results based on the goodness analysis German Spanish JRC Web JRC Web MAP P@10 MAP P@10 MAP P@10 MAP P@10 Source Good Bad goodness of the translations. All of the measures indicate that the Web corpus outperforms the JRC corpus in translating genomics-related words. Note also that for the Web corpus in both languages, the number of good translations exceeds the number of source language words, which shows that Cocot also provides related expansion keys, besides correct translations. Therefore, both of the above-stated assumptions seem to hold Validating the translation goodness analysis In the previous section it was shown that our genomics-specific corpora could produce translations that appear frequently in relevant documents in the recall base. However, it is not self evident that the good translations (as determined by the above analysis) would also result in good CLIR performance. The validity of the goodness analysis was tested by making IR experiments based on the tests. For both of the language pairs, we first made a baseline run where the source language queries were run without translation. Then, for both language pairs and translation corpora, we formed good and bad queries. In the good queries, only the translations that were judged as good by the above analysis were used as translations. In the bad queries, only the translations that were not judged as good were used. In both query sets, the output of Cocot was concatenated with the baseline queries, so that the untranslated source language words were also part of both good and bad queries. All 50 topics of the TREC genomics track were used in these experiments. Table 7 presents the results of the runs in mean average precision and precision after 10 documents. The good queries always clearly outperform the bad ones. Also, the good queries always outperform the source language queries. These observations prove that translation goodness and CLIR performance correlate strongly. It should be noted that the results of Table 7 should not be compared to the official IR results of Table 4, or used to compare the performance of the translation corpora. This is because prior knowledge of the relevant documents was used in forming the good and bad queries. 8 Discussion Our aim was to devise a novel technique for multilingual focused crawling for the creation of comparable corpora, and to use the acquired corpora in statistical query translation. We built a focused crawler that applied language detection, driver queries, and URL prioritizing to collect domain specific text in predefined languages. We managed to collect
17 Inf Retrieval (2008) 11: considerable amounts of text in the genomics domain in Spanish, German, and English. Alignments were made between the collections at the paragraph level; we mapped source language texts with similar counterparts in the target language collection. The alignments were employed in statistical translation with the Cocot translation system. In the word translation experiments, we showed that Cocot was capable of providing good quality translations of domain vocabulary, as well as contributing semantically connected, but morphologically or taxonomically unrelated, expansion keys. Such query expansion is a feature missing from the other CLIR approaches. We were also able to prove that a high-quality parallel corpus with more general vocabulary was unable to provide equally good translation knowledge for domain-specific vocabulary. This indicates that resorting to noisier comparable corpora that can be created (semi-)automatically is necessary when dealing with special domains. In the standard IR experiments Cocot, using the Web corpus as the translation corpus, clearly improved dictionary-based translation. However, compared to other translation approaches the results are more varied. Especially in the Spanish runs, the UC-CC approach was inferior to the approach where FITE- TRT was used in OOV translation (although the difference was statistically insignificant). The excellent performance of FITE-TRT in the Spanish runs could be explained by the large number of Latin-based cognates between Spanish and English. This kind of crosslingual variation is where FITE-TRT is at its best. The greater similarity between Spanish and English may also be the reason behind the better overall results in the Spanish English runs than in the German English ones. Cocot s poorer relative performance in the Spanish runs is probably also due to the smaller size of the corpus. In the German runs, UC-CC fared as well as UC-FI and machine translation. It might be that in general, Cocot would be of better use with language pairs that have relatively low number of cognates. We also note that there are various aspects of the present Cocot configuration that could be improved: the alignment process, the choice of Cocot s translation parameters, more complex query structuring (e.g. weighting strategies) could be employed, and combination with other resources could be made more effective. Our method should naturally be easily portable to other domains, and indeed we believe it is. Gathering domain-specific vocabulary for the crawling phase is not a trivial task, but it can be done quickly even by a non-specialist, as in our experiments. The vocabulary gathering could further be automated by automatically extracting domain-specific vocabulary from user-specified seed pages. In the alignment phase a smallish general purpose dictionary suffices. In addition, some kind of cognate matching can be used, such as FITE- TRT which was used in the present study. It could be argued that the genomics domain is a relatively easy domain for CLIR, because much of the central vocabulary protein names etc. is the same across languages. In other technical domains, where more than cognate matching is needed, Cocot could perhaps bring greater improvement over other methods. This, though, remains to be shown readily available test environments for special domains are rare. While the contribution of the automatically acquired comparable corpora was not dramatic, the results were at least encouraging. This research contributed the first method for the acquisition of such corpora. It should also be noted that the techniques presented here are not the only way to employ the acquired corpora. For instance, they could also be used to prune translation alternatives in dictionary-based translation. A further potential application of the cross-language word associations Cocot provides on the basis of comparable corpora is cross-lingual document similarity calculation (e.g. for cross-lingual document retrieval through query by example document; cross-lingual plagiarism detection, etc.). Even in its present design, however, the method resulted in good translations of
18 444 Inf Retrieval (2008) 11: OOV words and provided useful expansion keys that improve CLIR effectiveness. Therefore, the method seems competitive and promising, especially for rapidly developing special domains. Acknowledgments Tuomas Talvensaari was supported by the Tampere Graduate School in Information Science and Engineering (TISE). Ari Pirkola was supported by the Academy of Finland, Grants Number and ( Focused Retrieval of Web Documents ). Kalervo Järvelin was supported by the Academy of Finland, Grants Number and References Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., & Shu, H. (1996). Inquery at TREC- 5. In TREC-5: The Fifth Text Retrieval Conference (pp ). National Institute of Standards and Technology. Bra, P. D., Houben, G.-J., Kornatzky, Y., & Post, R. (1994). Information retrieval in distributed hypertexts. In Proceedings of the 4th RIAO Conference (pp ). Braschler, M., & Schäuble, P. (1998). Multilingual information retrieval based on document alignment techniques. In ECDL 98: Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries (pp ). London: Springer-Verlag. Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp ). Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: A new approach to topic-specific web resource discovery. In WWW 99: Proceeding of the Eighth International Conference on World Wide Web (pp ). New York: Elsevier North-Holland, Inc. Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., & Chien, L.-F. (2004). Translating unknown queries with web corpora for cross-language information retrieval. In SIGIR 04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. In WWW7: Proceedings of the Seventh International Conference on World Wide Web (pp ). Amsterdam: Elsevier Science Publishers B. V. Gale, W. A., & Church, K. W. (1991). A program for aligning sentences in bilingual corpora. In ACL 91: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp ). Morristown, NJ: Association for Computational Linguistics. Hassan, A., Fahmy, H., & Hassan, H. (2007). Improving named entity translation by exploiting comparable and parallel corpora. In Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop. Hersh, W. R. (2005). Report on the TREC 2004 genomics track. SIGIR Forum, 39(1), Keskustalo, H., Hedlund, T., & Airio, E. (2002). Utaclir general query translation framework for several language pairs. In SIGIR 02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Lemur homepage. (2008). The Lemur toolkit homepage. Accessed 22 February Nie, J.-Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In SIGIR 99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In SIGIR 98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Pirkola, A., Hedlund, T., Keskustalo, H., & Järvelin, K. (2001). Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4(3 4), Pirkola, A., Toivonen, J., Keskustalo, H., & Järvelin, K. (2006). FITE-TRT: A high quality translation technique for OOV words. In SAC 06: Proceedings of the 2006 ACM Symposium on Applied Computing (pp ). New York: ACM Press.
19 Inf Retrieval (2008) 11: Resnik, P. (1999). Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp ). Morristown, NJ: Association for Computational Linguistics. Sheridan, P., & Ballerini, J. P. (1996). Experiments in multilingual information retrieval using the SPIDER system. In SIGIR 96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Siegel, S., & Castellan, N. J., Jr. (1988). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill. Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In SIGIR 96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ). New York: ACM Press. Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information. Journal of Computing and Information Technology, 13(4), Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (2006). The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. In LREC 2006: Proceedings of the 5th International Conference on Language Resources and Evaluation. Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., & Keskustalo, H. (2007). Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems, 25(1), 4. Utsuro, T., Horiuchi, T., Chiba, Y., & Hamamoto, T. (2002). Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. In AMTA 02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users (pp ). London: Springer-Verlag. Yang, C. C., & Li, K. W. (2004). Building parallel corpora by automatic title alignment using length-based and text-based approaches. Information Processing & Management, 40(6),
Compression of a Dictionary
Compression of a Dictionary Jan Lánský, Michal Žemlička zizelevak@matfyz.cz michal.zemlicka@mff.cuni.cz Dept. of Software Engineering Faculty of Mathematics and Physics Charles University Synopsis Introduction
Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost.
Tento materiál byl vytvořen v rámci projektu Operačního programu Vzdělávání pro konkurenceschopnost. Projekt MŠMT ČR Číslo projektu Název projektu školy Klíčová aktivita III/2 EU PENÍZE ŠKOLÁM CZ.1.07/1.4.00/21.2146
Gymnázium, Brno, Slovanské nám. 7 WORKBOOK. Mathematics. Teacher: Student:
WORKBOOK Subject: Teacher: Student: Mathematics.... School year:../ Conic section The conic sections are the nondegenerate curves generated by the intersections of a plane with one or two nappes of a cone.
Klepnutím lze upravit styl předlohy. nadpisů. nadpisů.
1/ 13 Klepnutím lze upravit styl předlohy Klepnutím lze upravit styl předlohy www.splab.cz Soft biometric traits in de identification process Hair Jiri Prinosil Jiri Mekyska Zdenek Smekal 2/ 13 Klepnutím
Dynamic Development of Vocabulary Richness of Text. Miroslav Kubát & Radek Čech University of Ostrava Czech Republic
Dynamic Development of Vocabulary Richness of Text Miroslav Kubát & Radek Čech University of Ostrava Czech Republic Aim To analyze a dynamic development of vocabulary richness from a methodological point
Návrh a implementace algoritmů pro adaptivní řízení průmyslových robotů
Návrh a implementace algoritmů pro adaptivní řízení průmyslových robotů Design and implementation of algorithms for adaptive control of stationary robots Marcel Vytečka 1, Karel Zídek 2 Abstrakt Článek
USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING
USING VIDEO IN PRE-SET AND IN-SET TEACHER TRAINING Eva Minaříková Institute for Research in School Education, Faculty of Education, Masaryk University Structure of the presentation What can we as teachers
Využití hybridní metody vícekriteriálního rozhodování za nejistoty. Michal Koláček, Markéta Matulová
Využití hybridní metody vícekriteriálního rozhodování za nejistoty Michal Koláček, Markéta Matulová Outline Multiple criteria decision making Classification of MCDM methods TOPSIS method Fuzzy extension
Litosil - application
Litosil - application The series of Litosil is primarily determined for cut polished floors. The cut polished floors are supplied by some specialized firms which are fitted with the appropriate technical
Introduction to MS Dynamics NAV
Introduction to MS Dynamics NAV (Item Charges) Ing.J.Skorkovský,CSc. MASARYK UNIVERSITY BRNO, Czech Republic Faculty of economics and business administration Department of corporate economy Item Charges
CHAPTER 5 MODIFIED MINKOWSKI FRACTAL ANTENNA
CHAPTER 5 MODIFIED MINKOWSKI FRACTAL ANTENNA &KDSWHUSUHVHQWVWKHGHVLJQDQGIDEULFDW LRQRIPRGLILHG0LQNRZVNLIUDFWDODQWHQQD IRUZLUHOHVVFRPPXQLFDWLRQ7KHVLPXODWHG DQGPHDVXUHGUHVXOWVRIWKLVDQWHQQDDUH DOVRSUHVHQWHG
Czech Republic. EDUCAnet. Střední odborná škola Pardubice, s.r.o.
Czech Republic EDUCAnet Střední odborná škola Pardubice, s.r.o. ACCESS TO MODERN TECHNOLOGIES Do modern technologies influence our behavior? Of course in positive and negative way as well Modern technologies
GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA
GUIDELINES FOR CONNECTION TO FTP SERVER TO TRANSFER PRINTING DATA What is an FTP client and how to use it? FTP (File transport protocol) - A protocol used to transfer your printing data files to the MAFRAPRINT
Transportation Problem
Transportation Problem ١ C H A P T E R 7 Transportation Problem The transportation problem seeks to minimize the total shipping costs of transporting goods from m origins (each with a supply s i ) to n
Theme 6. Money Grammar: word order; questions
Theme 6 Money Grammar: word order; questions Čas potřebný k prostudování učiva lekce: 8 vyučujících hodin Čas potřebný k ověření učiva lekce: 45 minut KLÍNSKÝ P., MÜNCH O., CHROMÁ D., Ekonomika, EDUKO
DC circuits with a single source
Název projektu: utomatizace výrobních procesů ve strojírenství a řemeslech egistrační číslo: Z..07/..0/0.008 Příjemce: SPŠ strojnická a SOŠ profesora Švejcara Plzeň, Klatovská 09 Tento projekt je spolufinancován
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49 Výukový materiál zpracovaný v rámci projektu Výuka moderně Registrační číslo projektu: CZ.1.07/1.5.00/34.0205 Šablona: III/2 Anglický jazyk
WORKSHEET 1: LINEAR EQUATION 1
WORKSHEET 1: LINEAR EQUATION 1 1. Write down the arithmetical problem according the dictation: 2. Translate the English words, you can use a dictionary: equations to solve solve inverse operation variable
VY_32_INOVACE_06_Předpřítomný čas_03. Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace
VY_32_INOVACE_06_Předpřítomný čas_03 Autor: Růžena Krupičková Škola: Základní škola Slušovice, okres Zlín, příspěvková organizace Název projektu: Zkvalitnění ICT ve slušovské škole Číslo projektu: CZ.1.07/1.4.00/21.2400
Radiova meteoricka detekc nı stanice RMDS01A
Radiova meteoricka detekc nı stanice RMDS01A Jakub Ka kona, kaklik@mlab.cz 15. u nora 2014 Abstrakt Konstrukce za kladnı ho softwarove definovane ho pr ijı macı ho syste mu pro detekci meteoru. 1 Obsah
Just write down your most recent and important education. Remember that sometimes less is more some people may be considered overqualified.
CURRICULUM VITAE - EDUCATION Jindřich Bláha Výukový materiál zpracován v rámci projektu EU peníze školám Autorem materiálu a všech jeho částí, není-li uvedeno jinak, je Bc. Jindřich Bláha. Dostupné z Metodického
AIC ČESKÁ REPUBLIKA CZECH REPUBLIC
ČESKÁ REPUBLIKA CZECH REPUBLIC ŘÍZENÍ LETOVÉHO PROVOZU ČR, s.p. Letecká informační služba AIR NAVIGATION SERVICES OF THE C.R. Aeronautical Information Service Navigační 787 252 61 Jeneč A 1/14 20 FEB +420
Gymnázium, Brno, Slovanské nám. 7, SCHEME OF WORK Mathematics SCHEME OF WORK. cz
SCHEME OF WORK Subject: Mathematics Year: first grade, 1.X School year:../ List of topisc # Topics Time period Introduction, repetition September 1. Number sets October 2. Rigtht-angled triangle October,
EXACT DS OFFICE. The best lens for office work
EXACT DS The best lens for office work EXACT DS When Your Glasses Are Not Enough Lenses with only a reading area provide clear vision of objects located close up, while progressive lenses only provide
Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční
Příloha I Seznam tabulek Tabulka 1 Stav členské základny SK Praga Vysočany k roku 2015 Tabulka 2 Výše členských příspěvků v SK Praga Vysočany Tabulka 3 Přehled finanční odměny pro rozhodčí platný od roku
Next line show use of paragraf symbol. It should be kept with the following number. Jak může státní zástupce věc odložit zmiňuje 159a.
1 Bad line breaks The follwing text has prepostions O and k at end of line which is incorrect according to Czech language typography standards: Mezi oblíbené dětské pohádky patří pohádky O Palečkovi, Alenka
PC/104, PC/104-Plus. 196 ept GmbH I Tel. +49 (0) / I Fax +49 (0) / I I
E L E C T R O N I C C O N N E C T O R S 196 ept GmbH I Tel. +49 (0) 88 61 / 25 01 0 I Fax +49 (0) 88 61 / 55 07 I E-Mail sales@ept.de I www.ept.de Contents Introduction 198 Overview 199 The Standard 200
User manual SŘHV Online WEB interface for CUSTOMERS June 2017 version 14 VÍTKOVICE STEEL, a.s. vitkovicesteel.com
1/ 11 User manual SŘHV Online WEB interface for CUSTOMERS June 2017 version 14 2/ 11 Contents 1. MINIMUM SYSTEM REQUIREMENTS... 3 2. SŘHV ON-LINE WEB INTERFACE... 4 3. LOGGING INTO SŘHV... 4 4. CONTRACT
Vliv metody vyšetřování tvaru brusného kotouče na výslednou přesnost obrobku
Vliv metody vyšetřování tvaru brusného kotouče na výslednou přesnost obrobku Aneta Milsimerová Fakulta strojní, Západočeská univerzita Plzeň, 306 14 Plzeň. Česká republika. E-mail: anetam@kto.zcu.cz Hlavním
FIRE INVESTIGATION. Střední průmyslová škola Hranice. Mgr. Radka Vorlová. 19_Fire investigation CZ.1.07/1.5.00/
FIRE INVESTIGATION Střední průmyslová škola Hranice Mgr. Radka Vorlová 19_Fire investigation CZ.1.07/1.5.00/34.0608 Výukový materiál Číslo projektu: CZ.1.07/1.5.00/21.34.0608 Šablona: III/2 Inovace a zkvalitnění
Czech Technical University in Prague DOCTORAL THESIS
Czech Technical University in Prague Faculty of Nuclear Sciences and Physical Engineering DOCTORAL THESIS CERN-THESIS-2015-137 15/10/2015 Search for B! µ + µ Decays with the Full Run I Data of The ATLAS
The Over-Head Cam (OHC) Valve Train Computer Model
The Over-Head Cam (OHC) Valve Train Computer Model Radek Tichanek, David Fremut Robert Cihak Josef Bozek Research Center of Engine and Content Introduction Work Objectives Model Description Cam Design
Dynamic programming. Optimal binary search tree
The complexity of different algorithms varies: O(n), Ω(n ), Θ(n log (n)), Dynamic programming Optimal binary search tree Různé algoritmy mají různou složitost: O(n), Ω(n ), Θ(n log (n)), The complexity
WYSIWYG EDITOR PRO XML FORM
WYSIWYG EDITOR PRO XML FORM Ing. Tran Thanh Huan, Ing. Nguyen Ba Nghien, Doc. Ing. Josef Kokeš, CSc Abstract: In this paper, we introduce the WYSIWYG editor pro XML Form. We also show how to create a form
Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115
Číslo projektu: Číslo šablony: Název materiálu: Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115 CZ.1.07/1.5.00/34.0410 II/2 Parts of a computer IT English Ročník: Identifikace materiálu: Jméno
2. Entity, Architecture, Process
Evropský sociální fond Praha & EU: Investujeme do vaší budoucnosti Praktika návrhu číslicových obvodů Dr.-Ing. Martin Novotný Katedra číslicového návrhu Fakulta informačních technologií ČVUT v Praze Miloš
ITICA. SAP Školení přehled 2012. Seznam kurzů
ITICA SAP Školení přehled 2012 Seznam kurzů SAP Školení v roce 2012 Způsob realizace školení Naše školení jsou zaměřena především na cíl předvést obrovský a rozsáhlý systém SAP jako použitelný a srozumitelný
Výukový materiál v rámci projektu OPVK 1.5 Peníze středním školám
VY_22_INOVACE_AJOP40764ČER Výukový materiál v rámci projektu OPVK 1.5 Peníze středním školám Číslo projektu: Název projektu: Číslo šablony: CZ.1.07/1.5.00/34.0883 Rozvoj vzdělanosti II/2 Datum vytvoření:
VYSOKÁ ŠKOLA HOTELOVÁ V PRAZE 8, SPOL. S R. O.
VYSOKÁ ŠKOLA HOTELOVÁ V PRAZE 8, SPOL. S R. O. Návrh konceptu konkurenceschopného hotelu v době ekonomické krize Diplomová práce 2013 Návrh konceptu konkurenceschopného hotelu v době ekonomické krize Diplomová
By David Cameron VE7LTD
By David Cameron VE7LTD Introduction to Speaker RF Cavity Filter Types Why Does a Repeater Need a Duplexer Types of Duplexers Hybrid Pass/Reject Duplexer Detail Finding a Duplexer for Ham Use Questions?
Database systems. Normal forms
Database systems Normal forms An example of a bad model SSN Surnam OfficeNo City Street No ZIP Region President_of_ Region 1001 Novák 238 Liteň Hlavní 10 26727 Středočeský Rath 1001 Novák 238 Bystřice
Website review vaznikystrechy.eu
Generated on August 02 2016 10:08 AM The score is 56/100 SEO Content Title Dřevěné příhradové vazníky pro všechy typy střech Length : 49 Perfect, your title contains between 10 and 70 characters. Description
Website review ciporat.ru
Website review ciporat.ru Generated on October 12 2016 02:27 AM The score is 39/100 SEO Content Title Polina Gagarina: Zhubnout 40 kg může být i se špatným dědičnosti. Vím, že pro sebe! - Svět Ženy Length
ENGLISH FOR ACADEMIC PURPOSES
Slezská univerzita v Opavě Fakulta veřejných politik v Opavě ENGLISH FOR ACADEMIC PURPOSES Distanční studijní opora Veronika Portešová Opava 2012 Projekt č. CZ.1.07/2.2.00/15.0178 Inovace studijního programu
Social Media a firemní komunikace
Social Media a firemní komunikace TYINTERNETY / FALANXIA YOUR WORLD ENGAGED UČTE SE OD STARTUPŮ ANALYSIS -> PARALYSIS POUŽIJTE TO, CO ZNÁ KAŽDÝ POUŽIJTE TO, CO ZNÁ KAŽDÝ POUŽIJTE TO, CO ZNÁ KAŽDÝ POUŽIJTE
Content Language level Page. Mind map Education All levels 2. Go for it. We use this expression to encourage someone to do something they want.
Study newsletter 2015, week 40 Content Language level Page Phrase of the week Go for it All levels 1 Mind map Education All levels 2 Czenglish Stressed vs. in stress Pre-intermediate (B1-) Advanced (C1)
Informace o písemných přijímacích zkouškách. Doktorské studijní programy Matematika
Informace o písemných přijímacích zkouškách (úplné zadání zkušebních otázek či příkladů, které jsou součástí přijímací zkoušky nebo její části, a u otázek s výběrem odpovědi správné řešení) Doktorské studijní
CZ.1.07/1.5.00/
Projekt: Příjemce: Digitální učební materiály ve škole, registrační číslo projektu CZ.1.07/1.5.00/34.0527 Střední zdravotnická škola a Vyšší odborná škola zdravotnická, Husova 3, 371 60 České Budějovice
Perception Motivated Hybrid Approach to Tone Mapping
Perception Motivated Hybrid Approach to Tone Mapping Martin Čadík Czech Technical University in Prague, Czech Republic Content HDR tone mapping Hybrid Approach Perceptually plausible approach Cognitive
Table of contents. 5 Africa poverty reduction. Africa's growth. Africa - productivity. Africa - resources. Africa development
Africa Table of contents 1 Africa's growth 2 Africa - productivity 3 Africa - resources 4 Africa development 5 Africa poverty reduction 6 Africa Trade opportunities Africa's growth Different approaches
Bibliometric probes into the world of scientific publishing: Economics first
Bibliometric probes into the world of scientific publishing: Economics first Daniel Münich VŠE, Nov 7, 2017 Publication space Field coverage of WoS Source: Henk F. Moed, Citation Analysis in Research Evaluation,
Progressive loyalty V1.0. Copyright 2017 TALENTHUT
Progressive loyalty Copyright 2017 TALENTHUT www.talenthut.io 1. Welcome The Progressive Loyalty Siberian CMS module will allow you to launch a loyalty program and reward your customers as they buy from
Dynamic Signals. Ananda V. Mysore SJSU
Dynamic Signals Ananda V. Mysore SJSU Static vs. Dynamic Signals In principle, all signals are dynamic; they do not have a perfectly constant value over time. Static signals are those for which changes
Enabling Intelligent Buildings via Smart Sensor Network & Smart Lighting
Enabling Intelligent Buildings via Smart Sensor Network & Smart Lighting Petr Macháček PETALIT s.r.o. 1 What is Redwood. Sensor Network Motion Detection Space Utilization Real Estate Management 2 Building
TKGA3. Pera a klíny. Projekt "Podpora výuky v cizích jazycích na SPŠT"
Projekt "Podpora výuky v cizích jazycích na SPŠT" Pera a klíny TKGA3 Tento projekt je spolufinancován Evropským sociálním fondem a státním rozpočtem ČR Pera a klíny Pera a klíny slouží k vytvoření rozbíratelného
DATA SHEET. BC516 PNP Darlington transistor. technický list DISCRETE SEMICONDUCTORS Apr 23. Product specification Supersedes data of 1997 Apr 16
zákaznická linka: 840 50 60 70 DISCRETE SEMICONDUCTORS DATA SHEET book, halfpage M3D186 Supersedes data of 1997 Apr 16 1999 Apr 23 str 1 Dodavatel: GM electronic, spol. s r.o., Křižíkova 77, 186 00 Praha
TECHNICKÁ NORMALIZACE V OBLASTI PROSTOROVÝCH INFORMACÍ
TECHNICKÁ NORMALIZACE V OBLASTI PROSTOROVÝCH INFORMACÍ Ing. Jiří Kratochvíl ředitel Odboru technické normalizace Úřad pro technickou normalizaci, metrologii a státní zkušebnictví kratochvil@unmz.cz http://cs-cz.facebook.com/normy.unmz
Úvod do datového a procesního modelování pomocí CASE Erwin a BPwin
Úvod do datového a procesního modelování pomocí CASE Erwin a BPwin (nově AllFusion Data Modeller a Process Modeller ) Doc. Ing. B. Miniberger,CSc. BIVŠ Praha 2009 Tvorba datového modelu Identifikace entit
CZ.1.07/1.5.00/
Projekt: Příjemce: Digitální učební materiály ve škole, registrační číslo projektu CZ.1.07/1.5.00/34.0527 Střední zdravotnická škola a Vyšší odborná škola zdravotnická, Husova 3, 371 60 České Budějovice
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49 Výukový materiál zpracovaný v rámci projektu Výuka moderně Registrační číslo projektu: CZ.1.07/1.5.00/34.0205 Šablona: III/2 Anglický jazyk
DOPLNĚK K FACEBOOK RETRO EDICI STRÁNEK MAVO JAZYKOVÉ ŠKOLY MONCHHICHI
MONCHHICHI The Monchhichi franchise is Japanese and held by the Sekiguchi Corporation, a famous doll company, located in Tokyo, Japan. Monchhichi was created by Koichi Sekiguchi on January 25, 1974. Sekiguchi
Zubní pasty v pozměněném složení a novém designu
Energy news4 Energy News 04/2010 Inovace 1 Zubní pasty v pozměněném složení a novém designu Od října tohoto roku se začnete setkávat s našimi zubními pastami v pozměněném složení a ve zcela novém designu.
TechoLED H A N D B O O K
TechoLED HANDBOOK Světelné panely TechoLED Úvod TechoLED LED světelné zdroje jsou moderním a perspektivním zdrojem světla se širokými možnostmi použití. Umožňují plnohodnotnou náhradu žárovek, zářivkových
Hledání nápadů v textových zdrojích
Hledání nápadů v textových zdrojích Obsah Motivace: hledání postupů, které podpoří vznik bisociací uvnitř bohatých (medicínských) informačních zdrojů pro hledání nových hypotéz, které vysvětlují zkoumaný
Digitální učební materiál
Digitální učební materiál Projekt Šablona Tématická oblast DUM č. CZ.1.07/1.5.00/34.0415 Inovujeme, inovujeme III/2 Inovace a zkvalitnění výuky prostřednictvím ICT (DUM) Anglický jazyk pro obor podnikání
Air Quality Improvement Plans 2019 update Analytical part. Ondřej Vlček, Jana Ďoubalová, Zdeňka Chromcová, Hana Škáchová
Air Quality Improvement Plans 2019 update Analytical part Ondřej Vlček, Jana Ďoubalová, Zdeňka Chromcová, Hana Škáchová vlcek@chmi.cz Task specification by MoE: What were the reasons of limit exceedances
Projekt: ŠKOLA RADOSTI, ŠKOLA KVALITY Registrační číslo projektu: CZ.1.07/1.4.00/21.3688 EU PENÍZE ŠKOLÁM
ZÁKLADNÍ ŠKOLA OLOMOUC příspěvková organizace MOZARTOVA 48, 779 00 OLOMOUC tel.: 585 427 142, 775 116 442; fax: 585 422 713 email: kundrum@centrum.cz; www.zs-mozartova.cz Projekt: ŠKOLA RADOSTI, ŠKOLA
Cambridge International Examinations Cambridge International General Certifi cate of Secondary Education
Cambridge International Examinations Cambridge International General Certifi cate of Secondary Education FIRST LANGUAGE CZECH 0514/01 Paper 1 Reading For Examination from 2016 SPECIMEN MARK SCHEME 2 hours
Angličtina v matematických softwarech 2 Vypracovala: Mgr. Bronislava Kreuzingerová
Angličtina v matematických softwarech 2 Vypracovala: Mgr. Bronislava Kreuzingerová Název školy Název a číslo projektu Název modulu Obchodní akademie a Střední odborné učiliště, Veselí nad Moravou Motivace
Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT
Škola: Střední škola obchodní, České Budějovice, Husova 9 Projekt MŠMT ČR: EU PENÍZE ŠKOLÁM Číslo projektu: CZ.1.07/1.5.00/34.0536 Název projektu školy: Výuka s ICT na SŠ obchodní České Budějovice Šablona
Syntactic annotation of a second-language learner corpus
Syntactic annotation of a second-language Jirka Hana & Barbora Hladká Charles University Prague ICBLT 2018 CzeSL Corpus of L2 Czech ICBLT 2018 2 CzeSL Czech as a Second Language Part of AKCES Acquisition
HASHING GENERAL Hashovací (=rozptylovací) funkce
Níže uvedené úlohy představují přehled otázek, které se vyskytly v tomto nebo v minulých semestrech ve cvičení nebo v minulých semestrech u zkoušky. Mezi otázkami semestrovými a zkouškovými není žádný
Zelené potraviny v nových obalech Green foods in a new packaging
Energy News1 1 Zelené potraviny v nových obalech Green foods in a new packaging Již v minulém roce jsme Vás informovali, že dojde k přebalení všech tří zelených potravin do nových papírových obalů, které
CZ.1.07/1.5.00/
Projekt: Příjemce: Digitální učební materiály ve škole, registrační číslo projektu CZ.1.07/1.5.00/34.0527 Střední zdravotnická škola a Vyšší odborná škola zdravotnická, Husova 3, 371 60 České Budějovice
Fytomineral. Inovace Innovations. Energy News 04/2008
Energy News 4 Inovace Innovations 1 Fytomineral Tímto Vám sdělujeme, že již byly vybrány a objednány nové lahve a uzávěry na produkt Fytomineral, které by měly předejít únikům tekutiny při přepravě. První
STŘEDNÍ ODBORNÁ ŠKOLA a STŘEDNÍ ODBORNÉ UČILIŠTĚ, Česká Lípa, 28. října 2707, příspěvková organizace
Název školy STŘEDNÍ ODBORNÁ ŠKOLA a STŘEDNÍ ODBORNÉ UČILIŠTĚ, Česká Lípa, 28. října 2707, příspěvková organizace Číslo a název projektu: CZ.1.07/1.5.00/34.0880 Digitální učební materiály www.skolalipa.cz
Klepnutím lze upravit styl předlohy. Klepnutím lze upravit styl předlohy. nadpisů. nadpisů. Aleš Křupka.
1 / 13 Klepnutím lze upravit styl předlohy Klepnutím lze upravit styl předlohy www.splab.cz Aleš Křupka akrupka@phd.feec.vutbr.cz Department of Telecommunications Faculty of Electrotechnical Engineering
místo, kde se rodí nápady
místo, kde se rodí nápady a private european network of information centres on materials and innovative products. Created in 2001 in Paris, it provides members with a large selection of specific, reproducible
Škola: Střední škola obchodní, České Budějovice, Husova 9. Inovace a zkvalitnění výuky prostřednictvím ICT
Škola: Střední škola obchodní, České Budějovice, Husova 9 Projekt MŠMT ČR: EU PENÍZE ŠKOLÁM Číslo projektu: CZ.1.07/1.5.00/34.0536 Název projektu školy: Výuka s ICT na SŠ obchodní České Budějovice Šablona
Drags imun. Innovations
Energy news 2 Inovace Innovations 1 Drags imun V příštích plánovaných výrobních šaržích dojde ke změně balení a designu tohoto produktu. Designové změny sledují úspěšný trend započatý novou generací Pentagramu
ANGLICKÁ KONVERZACE PRO STŘEDNĚ POKROČILÉ
ANGLICKÁ KONVERZACE PRO STŘEDNĚ POKROČILÉ MGR. VLADIMÍR BRADÁČ ROZVOJ KOMPETENCÍ MANAGEMENTU A PRACOVNÍKŮ VŠ MSK (S PODPOROU ICT) PROJEKT OP VK 2.2: CZ.1.07/2.2.00/15.0176 OSTRAVA 2012 Tento projekt je
Klepnutím lze upravit styl Click to edit Master title style předlohy nadpisů.
nadpisu. Case Study Environmental Controlling level Control Fifth level Implementation Policy and goals Organisation Documentation Information Mass and energy balances Analysis Planning of measures 1 1
http://www.zlinskedumy.cz
Číslo projektu Číslo a název šablony klíčové aktivity Tematická oblast Autor CZ.1.07/1.5.00/34.0514 III/2 Inovace a zkvalitnění výuky prostřednictvím ICT Výklad a cvičení z větné stavby, vy_32_inovace_ma_33_01
A Note on Generation of Sequences of Pseudorandom Numbers with Prescribed Autocorrelation Coefficients
KYBERNETIKA VOLUME 8 (1972), NUMBER 6 A Note on Generation of Sequences of Pseudorandom Numbers with Prescribed Autocorrelation Coefficients JAROSLAV KRAL In many applications (for example if the effect
Výukový materiál zpracovaný v rámci projektu EU peníze do škol. illness, a text
Výukový materiál zpracovaný v rámci projektu EU peníze do škol ZŠ Litoměřice, Ladova Ladova 5 412 01 Litoměřice www.zsladovaltm.cz vedeni@zsladovaltm.cz Pořadové číslo projektu: CZ.1.07/1.4.00/21.0948
Mikrokvadrotor: Návrh,
KONTAKT 2011 Mikrokvadrotor: Návrh, Modelování,, Identifikace a Řízení Autor: Jaromír r Dvořák k (md( md@unicode.cz) Vedoucí: : Zdeněk Hurák (hurak@fel.cvut.cz) Katedra řídicí techniky FEL ČVUT Praha 26.5.2011
Case Study Czech Republic Use of context data for different evaluation activities
Case Study Czech Republic Use of context data for different evaluation activities Good practice workshop on Choosing and using context indicators in Rural Development, Lisbon, 15 16 November 2012 CONTENT
UPM3 Hybrid Návod na ovládání Čerpadlo UPM3 Hybrid 2-5 Instruction Manual UPM3 Hybrid Circulation Pump 6-9
www.regulus.cz UPM3 Hybrid Návod na ovládání Čerpadlo UPM3 Hybrid 2-5 Instruction Manual UPM3 Hybrid Circulation Pump 6-9 CZ EN UPM3 Hybrid 1. Úvod V továrním nastavení čerpadla UPM3 Hybrid je profil PWM
Energy vstupuje na trh veterinárních produktů Energy enters the market of veterinary products
Energy news2 1 Energy vstupuje na trh veterinárních produktů Energy enters the market of veterinary products Doposud jste Energy znali jako výrobce a dodavatele humánních přírodních doplňků stravy a kosmetiky.
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49
Střední průmyslová škola strojnická Olomouc, tř.17. listopadu 49 Výukový materiál zpracovaný v rámci projektu Výuka moderně Registrační číslo projektu: CZ.1.07/1.5.00/34.0205 Šablona: III/2 Anglický jazyk
Are you a healthy eater?
Are you a healthy eater? VY_32_INOVACE_97 Vzdělávací oblast: Jazyk a jazyková komunikace Vzdělávací obor: Anglický jazyk Ročník: 8. 9.roč. 1. What does a nutrition expert tell four teenagers about their
Aplikace matematiky. Dana Lauerová A note to the theory of periodic solutions of a parabolic equation
Aplikace matematiky Dana Lauerová A note to the theory of periodic solutions of a parabolic equation Aplikace matematiky, Vol. 25 (1980), No. 6, 457--460 Persistent URL: http://dml.cz/dmlcz/103885 Terms
. 1 st International School of Ostrava - mezinárodní gymnázium, s. r. o., Gregorova 2582/3, Ostrava. IZO: Forma vzdělávání: denní
. 1 st International School of Ostrava - mezinárodní gymnázium, s. r. o., Gregorova 2582/3, 702 00 Ostrava IZO: 150 077 009 Forma vzdělávání: denní Kritéria pro I. kolo přijímacího řízení pro školní rok
Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115
Číslo projektu: Číslo šablony: Název materiálu: Gymnázium a Střední odborná škola, Rokycany, Mládežníků 1115 CZ.1.07/1.5.00/34.0410 II/2 Websites and communication tools IT English Ročník: Identifikace
kupi.cz Michal Mikuš
kupi.cz Michal Mikuš redisgn website kupi.cz, reduce the visual noise. ADVERT ADVERT The first impression from the website was that i dint knew where to start. It was such a mess, adverts, eyes, products,
PRAVIDLA ZPRACOVÁNÍ STANDARDNÍCH ELEKTRONICKÝCH ZAHRANIČNÍCH PLATEBNÍCH PŘÍKAZŮ STANDARD ELECTRONIC FOREIGN PAYMENT ORDERS PROCESSING RULES
PRAVIDLA ZPRACOVÁNÍ STANDARDNÍCH ELEKTRONICKÝCH ZAHRANIČNÍCH PLATEBNÍCH PŘÍKAZŮ STANDARD ELECTRONIC FOREIGN PAYMENT ORDERS PROCESSING RULES Použité pojmy Platební systém Elektronický platební příkaz Účetní
Příručka ke směrnici 89/106/EHS o stavebních výrobcích / Příloha III - Rozhodnutí Komise
Příručka ke směrnici 89/106/EHS o stavebních výrobcích / Příloha III - Rozhodnutí Komise ROZHODNUTÍ KOMISE ze dne 27. června 1997 o postupu prokazování shody stavebních výrobků ve smyslu čl. 20 odst. 2
ČSN EN ISO OPRAVA 2
ČESKÁ TECHNICKÁ NORMA ICS 03.120.10; 11.040.01 Květen 2010 Zdravotnické prostředky Systémy managementu jakosti Požadavky pro účely předpisů ČSN EN ISO 13485 OPRAVA 2 85 5001 idt EN ISO 13485:2003/AC:2009-08
USER'S MANUAL FAN MOTOR DRIVER FMD-02
USER'S MANUAL FAN MOTOR DRIVER FMD-02 IMPORTANT NOTE: Read this manual carefully before installing or operating your new air conditioning unit. Make sure to save this manual for future reference. FMD Module
Číslo projektu: CZ.1.07/1.5.00/34.0036 Název projektu: Inovace a individualizace výuky
Číslo projektu: CZ.1.07/1.5.00/34.0036 Název projektu: Inovace a individualizace výuky Autor: Mgr. Libuše Matulová Název materiálu: Education Označení materiálu: VY_32_INOVACE_MAT27 Datum vytvoření: 10.10.2013