Dynamic Development of Vocabulary Richness of Text Miroslav Kubát & Radek Čech University of Ostrava Czech Republic
Aim To analyze a dynamic development of vocabulary richness from a methodological point of view. Q 1 : Do various text segmentations affect the development? Q 2 : Are there significant differences between texts? Q 3 : Are there significant differences between genres?
vocabulary richness Example of different vocabulary richness development in two hypothetical texts 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 chapter text_1 text_2
vocabulary richness Development of two texts, segmentation 300 tokens 0.84 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 anglické_listy AVERAGE VOCABULARY RICHNESS anglické_listy 0.791 výlet_do_španěl 0.789 (non-significiant difference according to u-test, α = 0.05) výlet_do_španěl
MATTR Distance (d) measurement 0.5 0.4 0.3 0.2 MATTR I MATTR I+1 d i = [ MATTR i MATTR i+1 2 + 1] 1/2 0.1 i 1 + i = 1 0 0 1 2 3 4 5 i
Vocabulary richness MATTR A text is divided into the overlapped subtexts of the same length. TTR is computed for every subtext. MATTR is defined as a mean of particular values. MATTR L = σ i=1 N L V i L(N L + 1) L arbitrarily chosen length of a window, L < N N text length in tokens V i number of types in an individual window
Vocabulary richness MATTR a, b, c, a, a, d, f a, b, c b, c, a c, a, a a, a, d a, d, f MATTR 3 = σ i=1 N L V i L(N L + 1) = 3 + 3 + 2 + 2 + 3 3(7 3 + 1) = 0.87 L=3 N=7 V i number of types in an individual window
Data 16 Czech long texts. 3 genres (travel books, novels, scientific texts). One author (Karel Čapek).
Three ways of text segmentation Chapters Paragraphs Sequences of 300 tokens
vocabulary richness Development of two texts, segmentation 300 tokens 0.84 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 anglické_listy AVERAGE VOCABULARY RICHNESS Anglické_listy 0.791 Výlet_do_španěl 0.789 (non-significiant difference according to u-test, α = 0.05) výlet_do_španěl
d Development of two texts, segmentation 300 tokens 1.009 1.008 1.007 1.006 Wilcoxon-Mann-Whitney Test p-value: 0.00004 1.005 1.004 1.003 1.002 1.001 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 výlet do španěl anglické listy
average d Average distances (d) in texts 1.0035 1.003 1.0025 1.002 1.0015 1.001 1.0005 1 chapters paragraphs sequences_300
average d Average distances (d) 1.0018 1.0016 1.0014 1.0012 1.001 1.0008 1.0006 1.0004 1.0002 1 chapters paragraphs sequences_300
average d Average distances (d) 1.001 1.0009 1.0008 1.0007 1.0006 1.0005 1.0004 1.0003 1.0002 1.0001 1 sequences_300 sequences_500 sequences_1000
Wilcoxon-Mann-Whitney Test (α = 0.05) Chapter segmentation anglické listy cesta na sever cesta na sever 0.385 italské listy obrázky z holandska výlet do španěl italské listy 0.469 0.920 obrázky z holandska 0.240 0.582 0.539 výlet do španěl 0.688 0.111 0.212 0.116 hordubal krakatit obyčejný život hordubal 0.468 0.030 0.078 0.034 0.754 krakatit 0.941 0.229 0.340 0.135 0.526 0.304 obyčejný život 0.391 0.941 0.997 0.515 0.174 0.028 0.289 povětroň první parta tovarna na absolutno povětroň 0.740 0.101 0.227 0.089 0.877 0.700 0.591 0.117 první parta 0.081 0.000 0.003 0.002 0.154 0.164 0.012 0.000 0.075 tovarna na absolutno 0.041 0.107 0.123 0.628 0.011 0.003 0.008 0.102 0.009 0.000 válka s mloky 0.335 0.933 0.834 0.548 0.120 0.045 0.223 0.870 0.126 0.001 0.160 život a dílo skladatele foltýna válka s mloky život a dílo skladatele foltýna 29 significant differences 0.721 0.116 0.241 0.164 0.845 0.867 0.553 0.264 0.924 0.224 0.046 0.172 objektivní metoda objektivní metoda 0.027 0.000 0.001 0.001 0.071 0.087 0.002 0.000 0.034 0.677 0.000 0.000 0.147 pragmatis mus 0.684 0.138 0.215 0.184 0.978 0.937 0.525 0.257 0.879 0.202 0.048 0.129 0.979 0.123 směry v estetice 0.529 0.781 0.892 0.558 0.242 0.061 0.329 0.980 0.221 0.004 0.188 0.933 0.315 0.003 0.334 pragmatis mus
Wilcoxon-Mann-Whitney Test (α = 0.05) Paragraph segmentation anglické listy cesta na sever 0.019 cesta na sever italské listy obrázky z holandska italské listy 0.138 0.454 obrázky z holandska 0.091 0.000 0.003 výlet do španěl 0.023 0.977 0.483 0.001 výlet do španěl hordubal krakatit hordubal 0.035 0.983 0.518 0.001 0.951 krakatit 0.453 0.036 0.284 0.014 0.059 0.062 obyčejný život 0.003 0.645 0.245 0.000 0.654 0.699 0.003 obyčejný život povětroň první parta tovarna na absolutno válka s mloky život a dílo skladatele foltýna 45 significant differences objektivní metoda pragmatis mus povětroň 0.043 0.580 0.748 0.000 0.652 0.658 0.066 0.268 první parta 0.063 0.441 0.925 0.001 0.490 0.533 0.148 0.165 0.716 tovarna na absolutno 0.107 0.000 0.002 0.837 0.000 0.000 0.006 0.000 0.000 0.000 válka s mloky 0.012 0.962 0.438 0.000 0.917 0.994 0.015 0.613 0.549 0.402 0.000 život a dílo skladatele foltýna 0.024 0.972 0.528 0.000 0.969 0.972 0.049 0.643 0.658 0.506 0.000 0.983 objektivní metoda 0.492 0.056 0.323 0.022 0.082 0.096 0.965 0.008 0.092 0.201 0.011 0.036 0.078 pragmatis mus 0.003 0.463 0.142 0.000 0.397 0.502 0.005 0.681 0.173 0.115 0.000 0.432 0.463 0.010 směry v estetice 0.481 0.073 0.348 0.021 0.108 0.110 0.909 0.014 0.120 0.230 0.013 0.054 0.094 0.993 0.012
Wilcoxon-Mann-Whitney Test (α = 0.05) 300 Tokens segmentation anglické listy cesta na sever italské listy obrázky z holandska výlet do španěl hordubal krakatit obyčejný život povětroň první parta tovarna na absolutno cesta na sever 0.234 italské listy 0.023 0.175 obrázky z holandska 0.487 0.944 0.228 výlet do španěl 0.000 0.001 0.120 0.009 hordubal 0.001 0.018 0.602 0.064 0.117 krakatit 0.015 0.320 0.421 0.332 0.001 0.063 obyčejný život 0.039 0.397 0.456 0.420 0.003 0.075 0.880 povětroň 0.185 0.888 0.116 0.852 0.000 0.008 0.212 0.360 první parta 0.010 0.166 0.651 0.234 0.011 0.215 0.585 0.521 0.106 tovarna na absolutno 0.132 0.753 0.260 0.645 0.002 0.037 0.514 0.605 0.696 0.320 válka s mloky 0.002 0.045 0.955 0.149 0.044 0.555 0.151 0.186 0.015 0.443 0.083 život a dílo skladatele foltýna 0.050 0.387 0.526 0.377 0.011 0.155 0.941 0.846 0.348 0.688 0.601 0.311 objektivní válka s mloky život a dílo skladatele foltýna 28 significant differences objektivní metoda pragmatis mus metoda 0.018 0.150 0.748 0.256 0.022 0.302 0.553 0.519 0.135 0.980 0.331 0.616 0.709 pragmatis mus 0.015 0.093 0.775 0.130 0.242 0.840 0.245 0.298 0.078 0.425 0.156 0.828 0.345 0.452 směry v estetice 0.219 0.913 0.115 0.753 0.000 0.012 0.303 0.343 0.957 0.162 0.647 0.034 0.349 0.156 0.066
average d Average distances (d) in genres 1.0025 1.002 1.0015 1.001 1.0005 1 chapters paragraphs sequences_300 travel_book novel scientific_text
u-test, α = 0.05, u 1.96 means significant difference CHAPTERS travel book novel novel 1.008178 x scientific text 1.604092 0.807497 PARAGRAPHS travel book novel novel 0.512984 x scientific text 0.60703 0.11041 There is no significant difference between genres in our corpus. 300_TOKENS travel book novel novel 0.482236 x scientific text 0.751127 0.578005
Chapters "Natural" units. Relatively long units. Appropriate units for linguistic and literary research. Different lengths.
Paragraphs "Natural" units. Appropriate units for linguistic and literary research. Very short units for vocabulary richness measurement. Different lengths.
Sequences of text with the same length Same length. Good length for vocabulary richness measurement. Artificial units. Last part of a text must be removed. Problematic linguistic and literary interpretation.
Preliminary Conclusion and Discussion The longer the sequences of text, the smaller average distances between the subsequent sequences. Both chapters and paragraphs are not suitable for development analysis due to their different length. Sequences of arbitrary chosen length seems to be the best way; however, the artificial character of these units makes the interpretation of the results problematic. Development of vocabulary richness could be used for text analysis. Differences among genres were not corroborated in our corpus. Is there another unit for this analysis?
Thank You For Your Attention!
References Čech, R. (2015). Development of thematic concentration of text (in Karel Čapek s books of travel). Czech and Slovak Linguistic Review. (accepted) Čech, R., Popescu, I. I., Altmann, G. (2014). Metody kvantitativní analýzy (nejen) básnických textů. Olomouc: Univerzita Palackého v Olomouci. Covington, M. A., McFall J. D. (2010). Cutting the Gordian Knot: The Moving Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2), 94-100. Hřebíček, L. (2000). Variation in Sequences. Prague: Oriental Institute. Köhler, R., Galle, M. (1993). Dynamic Aspects of Text Characteristics. In: Hřebíček, L., Altmann, G. (eds.), Quantitative Text Analysis. Trier: WVT, 46-53. Kubát, M., Milička, J. (2013). Vocabulary Richness Measure in Genres. In: Journal of Quantitative Linguistics, 20(4):339-349. Popescu, I. I. et al. (2009). Word frequency studies. Berlin/New York: Mouton de Gruyter. Tuzzi, A., Popescu, I. I., Altmann, G. (2010). Quantitative Analysis of Italian Texts. Lüdenscheid: RAM.