Quantitative Properties of Russian Adjective-Noun Collocations across Dictionaries and Corpora* Maria Khokhlova[0000-0001-9085-0284] St. Petersburg State University, 7/9 Universitetskaya nab., 199034 St. Petersburg, Russia m.khokhlova@spbu.ru Abstract. The paper discusses the differences between collocations extracted from a number of Russian dictionaries paying attention to their frequency char- acteristics based on corpora. The aim of the study was, first, to analyze how col- locations and set expressions are described in Russian explanatory and special- ized dictionaries and to what extent their data coincide with each other, and, secondly, to investigate how collocations presented in dictionaries are reflected in text corpora. This will make it possible to examine the interrelation between the “manually” collected data and modern corpora (the Russian National Cor- pus and ruTenTen). We tested the following hypothesis, i.e. high collocation frequencies correspond to the fact that the item is represented in several dic- tionaries. In our paper we considered 180 collocations built according to the “adjective / participle + noun” model. The results show the heterogeneity of the dictionary data while the choice of lexical items does not coincide with its frequency characteristics: the examples are low-frequency and about 34% are absent in the disambiguated subcorpus. Explanatory dictionaries and collocation dictionaries show the smallest overlap. Keywords: Collocations, Russian Language, Dictionaries, Corpora, Statistics. 1 Introduction Our project deals with the process of building a database that will represent Russian collocations extracted from dictionaries and corpora [8]. The results of this research can be used in various NLP tasks and also in different fields of theoretical and applied linguistics, i.e. Russian lexicology, morphology and syntax or teaching the Russian language. Data about Russian collocability can be valuable for machine translation, clustering of words and word combinations, sentiment analysis, text summarization, disambiguation etc. It is expected that the collocations extracted from dictionaries will be used for the evaluation of machine learning algorithms dealing with automatic text processing, since today there is no single standard that would include verified infor- mation and at the same time in sufficient quantity. * This work was supported by the grant of the Russian Science Foundation (Project No. 19-78- 00091). Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Since dictionaries are an important source of data about collocations, it is im- portant to analyze them in order to understand the possible distinctions. At the same time, it can be tricky to compare different dictionaries. In the paper we consider collocations which were extracted from six Russian dic- tionaries, analyzing how they are reflected in corpora of the Russian language. Within the framework of our research, we dwell on two tasks. First, to analyze how recurrent word combinations are presented in different dictionaries and how much they coin- cide with each other. Secondly, to investigate the extent to which collocations that are reflected in dictionaries can be found in corpora and, therefore, trace the intersection between “manually” collected data and modern corpora. 2 Related Work The tradition of Russian lexicography has a rich history, however, there are not many projects dealing with collocability in Russian and moreover based on corpus data or created by automatic methods. At the same time corpora are seen as the main source of language data in Western lexicography and up-to-date projects implement them (Macmillan Dictionary). The issue of selecting collocations is crucial in lexicography, and not only for monolingual dictionaries, representing “the most controversial and vulnerable part of almost every bilingual dictionary” [2, p. 61]. Atkins and Rundell [1] point out the difficulty of selecting examples from the corpus and suggest using collocation lists for this task. As some authors note [4, 12] the issue of differentiating phrases of various types is still controversial, which leads to the fact that “specific cases of idiomatic combinations often do not receive an unambiguous qualification, which is reflected, in particular, in dictionaries” [12, p. 2]. What kind of phrases to include in a computer dictionary is discussed, for example, in [14]. Multi-word expressions can be seen as an umbrella term and it is true for lexicographic resources. Authors list different word combinations in dictionary entries calling them idioms, phrasemes, collocations etc. A detailed overview of the available Russian dictionaries was given in the paper [9]. The paper [11] describes the machine learning procedure of selecting coefficients for searching collocations based on the examples selected from the dictionaries. The idea of comparison between dictionaries and corpora attracted attention from scholars a few decades ago. The interest was focused on automatic extraction (either rule-based or statistical one) from the sources and their further evaluation. The results of the analysis complement each other and can be applied for constructing NLP lexi- cons [6]. The research presented later in [20] puts emphasis on collocations in ma- chine translation. Their implementation improved the quality of the analyzed systems. There is also a number of works involving comparison between text corpora (for example, [17, 10]). Most of them deal with building frequency lists and calculating statistical metrics based on several corpora trying to find a suitable test “supporting the comparison of small and large corpora” [10, p. 258]. 3 3 Experiment: Description The merging of dictionaries from different sources implies not only a single lexico- graphic format, but also raises the question of data relevance. When describing collo- cations, a lexicographer needs to select examples taking into account their representa- tiveness in corpus, coverage in dictionaries, and also suitability for language users and their purposes. We identified items from several Russian dictionaries of different types: 1. Explanatory dictionaries, i.e. the Dictionary of the Russian Language (DRL [5]); the Large Explanatory Dictionary of the Russian Language (LEDR [13]); 2. Collocation dictionaries [18, 16, 3]; 3. Online dictionary [12]. In our research, we tested the following hypothesis: high collocation frequencies in the corpus correspond to high values of the dictionary index introduced in [8]. This index is understood as the number of dictionaries in which the item is recorded. That is, we expect to see a directly proportional relationship between lexicographic and corpus data and, therefore, a positive correlation between dictionaries and corpora. None of the collocations was recorded in all six dictionaries we examined, so the maximum value of the dictionary index turned out to be 4. To assess the extracted data across text corpora, we randomly selected 20 colloca- tions from groups with dictionary indices 2, 3 and 4 (see Table 1 for the examples) resulting in 60 collocations. During the statistical analysis we implemented nonpara- metric Friedman and Kruskal-Wallis tests for the comparison between corpora. We also analyzed 20 collocations that are present only in one dictionary (i.e., their dictionary index was 1) resulting in 120 collocations. As a material for our study we used the Russian National Corpus (RNC), i.e. a dis- ambiguated subcorpus of 6 mln tokens and the main corpus of 321 mln tokens. The given subcorpora represent a “classical” (traditional) approach to corpus building compared to an automatic one but are not so large, as opposed to other Russian corpo- ra. Therefore we also consulted ruTenTen corpus of more than 18 bln tokens that was crawled automatically [7]. Table 1. Examples of collocations Collocation Boriso- Kusto- Oubine DRL Reginina, LEDR Diction- va 1995 va 1987 Tyurina, ary index 2008 Shiroko- va 1980 1. adskaya 01 1 1 0 0 0 2 bol’ ‘hellish pain’ 2. glubokaya 0 1 1 0 1 0 3 drevnost’ 1 1 and 0 stand for presence or absence of the collocation in the dictionary. 4 ‘great an- tiquity’ 3. goryachaya 1 1 1 0 1 0 4 lyubov’ ‘burning love’ 4. ostraya 1 1 1 0 1 0 4 diskussiya ‘heated discussion’ 5. vysokoye 0 1 1 0 1 0 3 masterstvo ‘superior skill’ 6. zverinaya 0 1 1 0 0 0 2 zhestokost’ ‘monstrous cruelty’ In our study we will also address to the following questions: 1) can we use corpora of a smaller volume for collocation analysis or in tasks dealing with their automatic pro- cessing? 2) do “traditional” and large web corpora produce the same results? 3) do dictionaries present homogeneous data, i.e. collocations of the same language nature demonstrate similar quantitative features? 4 Experiment: Results 4.1 Dictionary index 4 Even for the high value of dictionary index, the results are not uniform (see examples in Table 2). 4 collocations are absent in the disambiguated subcorpus, although they are present in several dictionaries. Spearman’s rank correlation coefficient is about 0.81 for all three pairs of corpora, which indicates a strong positive relationship be- tween them and their similar ranking of collocations. Table 2. Collocations with dictionary index 4 № collocation RNC, RNC ruTen- subset Ten 1. bol’shaya raznitsa ‘big difference’ 1.33 2.06 1.85 2. bol’shoy uspekh ‘great success’ 5.16 7.61 4.92 3. bol’shoye znacheniye ‘great meaning’ 7.33 9.27 12.65 4. glubokiy smysl ‘deep meaning’ 2.17 1.52 1.23 5. glubokoye ‘deep satisfaction’ 0.83 0.55 0.34 udovletvoreniye 6. glubokoye uvazeniye ‘deep respect’ 0.50 1.88 1.10 7. goryachaya lyubov’ ‘burning love’ 0.17 1.20 0.24 5 8. krepkaya druzhba ‘strong friendship’ 0.17 0.22 0.29 9. ostraya diskussiya ‘heated discussion’ 0.00 0.28 0.33 10. ostraya kritika ‘sharp criticism’ 0.33 0.25 0.21 11. ostraya nuzhda ‘desperate need’ 0.00 0.42 0.16 12. polnaya svoboda ‘complete freedom’ 2.17 4.76 2.24 13. shirokaya diskussiya ‘broad discussion’ 0.17 0.15 0.15 14. shirokaya izvestnost’ ‘great fame’ 1.00 1.00 1.26 15. shirokaya podderzhka ‘broad support’ 0.00 0.25 0.39 16. shirokiy razmakh ‘wide scope’ 0.50 0.72 0.31 17. slaboye mesto ‘weak point’ 1.50 2.20 3.14 18. vysokiy rezul’tat ‘high score’ 1.00 0.53 3.83 19. vysokiy urozhay ‘high yield’ 0.00 0.92 0.75 20. yarkiy primer ‘vivid example’ 2.50 2.83 4.84 The average frequency in the subcorpus of the RNC was 1.34, the main corpus of the RNC and ruTenTen showed 1.93 and 2.01 respectively, but the differences between the data are statistically insignificant (p > 0.05 according to the Friedman test). Hence, the frequencies are homogeneous but two collocations with the lexeme bol’shoy ‘large’ (bol’shoye znacheniye ‘great meaning’ and bol’shoy uspekh ‘great success’) show outliers. 4.2 Dictionary index 3 The frequencies of this group of collocations (see Table 3) show lower correlation (Spearman’s rank correlation coefficient varies between 0.62 and 0.79), while the differences between them in corpora are significant (p < 0.05 according to the Fried- man test). Table 3. Collocations with dictionary index 3 № collocation RNC, RNC ruTen- subset Ten 1. bol’shaya beda ‘big trouble’ 0.83 1.76 0.60 2. bol’shaya pol’za ‘great benefit’ 0.50 2.42 1.13 3. bol’shaya pomosch ‘great help’ 0.66 0.98 1.23 4. bol’shaya vazhnost’ ‘great importance’ 0.33 0.93 0.25 5. gigantskiy shag ‘giant step’ 1.00 0.62 0.14 6. glubokaya drevnost’ ‘great antiquity’ 1.83 1.80 1.62 7. glubokoye vliyaniye ‘deep influence’ 0.33 0.19 0.17 8. korennoy interes ‘core interest’ 0.50 0.20 0.14 9. lyutaya nenavist’ ‘fierce hatred’ 0.83 0.45 0.21 10. lyutyy moroz ‘bitter frost’ 1.50 0.73 0.42 11. nabityy durak ‘perfect fool’ 0.00 0.13 0.01 12. posledneye izvestiye ‘last news’ 2.00 2.23 0.24 13. ravnoye pravo ‘equal right’ 1.00 1.45 1.63 6 14. tesnaya druzhba ‘close friendship’ 0.50 0.81 0.16 15. tyazhelaya zadacha ‘difficult task’ 0.00 0.21 0.13 16. velikoye pereseleniye ‘great relocation’ 0.33 0.45 0.34 17. vysokaya tre- ‘high exactingness’ 0.17 0.14 0.13 bovatel’nost’ 18. vysokoye masterstvo ‘high skill’ 0.50 0.38 0.68 19. zhguchiy styd ‘burning shame’ 0.50 0.23 0.04 20. zhiznennyy put’ ‘life path’ 2.83 4.03 3.87 Thus, in the case of the above given examples that are present in three dictionaries, the frequencies tend to reveal more diversity. 4.3 Dictionary index 2 For collocations selected from two dictionaries, we see that 12 units out of 20 (60%) were not found in the smallest corpus, and 3 of them were not recorded in the main RNC corpus either (see Table 4 for the results). Table 4. Collocations with dictionary index 2 № collocation RNC, RNC ruTen- subset Ten 1. adskaya bol’ ‘hellish pain’ 0.50 0.17 0.15 2. bezgranichnaya toska ‘boundless longing’ 0.00 0.01 0.01 3. bezmernaya glubina ‘immense depth’ 0.17 0.03 0.01 4. isklyuchitel’noye ‘exceptional diversity’ 0.00 0.01 0.01 mnogoobraziye 5. l’vinaya chast’ ‘lion's share’ 0.00 0.12 0.13 6. mestnyy padezh ‘local case’ 0.00 0.02 0.01 7. nervnaya sistema ‘nervous system’ 9.66 9.46 17.72 8. neukrotimaya zloba ‘indomitable malice’ 0.17 0.05 0.01 9. ogromnyy diapazon ‘huge range’ 0.00 0.07 0.09 10. otchayannaya ‘desperate courage’ 0.00 0.23 0.04 khrabrost’ 11. polnyy vostorg ‘complete delight’ 0.33 0.98 0.88 12. porazitel’naya predos- ‘astounding precaution’ 0.00 0.00 0.01 torozhnost’ 13. putevodnaya nit’ ‘guiding thread’ 0.17 0.40 0.17 14. total’naya slezhka ‘total surveillance’ 0.00 0.04 0.08 15. tsepnaya reaktsiya ‘chain reaction’ 1.50 1.97 1.32 16. uzhasnaya groza ‘terrible thunderstorm’ 0.17 0.11 0.02 17. zhestokoye nakazaniye ‘cruel punishment’ 0.00 0.71 0.21 18. zhguchaya zlost’ ‘burning anger’ 0.00 0.00 0.01 19. zverinaya skuka ‘animal boredom’ 0.00 0.00 0.01 20. zverinaya zhestokost’ ‘bestial cruelty’ 0.00 0.07 0.04 7 Spearman’s correlation coefficient increased up to 0.92 for ruTenTen and the main RNC corpus while it decreased to 0.53 for ruTenTen and the disambiguated RNC subcorpus. Hence we can register differences in ranking in the latter case. But the fluctuations in the frequencies between all three corpora are not significant (p > 0.05 according to the Friedman test). This result enables us to suggest that collocations found only in two dictionaries are rare. 4.4 Dictionary index 1 Despite the fact that the online dictionary of idiomatic expressions [12] was compiled on the basis of the RNC, only half of the collocations were recorded in the disambig- uated subcorpus. Collocations extracted from the dictionary are characterized by ex- tremely low frequencies in all three corpora and show minimal values compared to other lexicographic resources. This is the poorest result among collocations obtained for all dictionaries. The differences in frequency values between corpora are insignifi- cant (p > 0.05 according to the Friedman test), and the standard deviation values are also low, i.e. one can assume some homogeneity of noun collocations in this diction- ary. The collocations extracted from the dictionary of lexical intensifiers [16] are also characterized by low frequencies in corpora and insignificant differences. The dictionary of set expressions [18] shows the highest results for the collocation frequencies (the differences between corpora are also insignificant, p > 0.05 according to the Friedman test), i.e. it can be concluded that this lexicographic resource reflects more frequent collocations. For example, vyssheye obrazovaniye ‘higher education’ and dukhovnaya zhizn’ ‘spiritual life’. Analysis of the data in the collocations dictionary [3] suggests that the selected items occupy an intermediate position according to their frequency characteristics, i.e. 8 items are not recorded in the disambiguated subcorpus, and 5 collocations have only 1 occurrence in the main RNC corpus. But nevertheless the collocations extracted from the given dictionary prove to be the only ones showing significant differences between corpora. The results for collocations from both explanatory dictionaries suggest that the sources differ to a certain degree in how they represent unique phrases. For units from the LEDR, the distribution of frequencies in three corpora is characterized by outliers and a large range of values (for example, aktsionernoye obscestvo ‘joint stock compa- ny’, organicheskoye veschestvo ‘organic matter’, pochtovyy yaschik ‘letterbox’). Col- locations from the DRL have smaller deviations from the mean values. Both explana- tory dictionaries show very little overlap with other lexicographic sources. This can be explained by the fact that dictionaries are aimed at describing different vocabulary: for example, the dictionary [12] represents only phrases with the meaning of high intensity, while DRL and LEDR are aimed at a more complete presentation of vocab- ulary and dictionary entries list phraseological units. 8 5 Discussion The analysis shows that in total there are no significant differences between corpora in frequencies of collocations with the same dictionary indexes (or from the same dictionary). Thus the analyzed items prove to be rare units. About 34% of the consid- ered collocations are absent in the RNC disambiguated subcorpus, i.e. it can be as- sumed that the volume of 6 mln tokens is not enough to study collocability. About 12% of the analyzed collocations yield less than 0.01 occurrences per million even in the largest ruTenTen corpus. It should be mentioned that collocation frequencies in corpora are steadily decreas- ing with the decrease of dictionary index (the differences are statistically significant, p < 0.05 according to the Kruskal-Wallis test), and the value of the Spearman’s rank correlation coefficient also decreases. Collocations represented in four dictionaries tend to be more widespread in corpora but also have low frequencies. It is worth noting that with the rise of corpus volume the unique collocations (with dictionary index equal to 1) tend to show more diversity in their frequencies. Only the collocation dictionaries [12] and [18] demonstrated significant differences on the smallest RNC corpus while the main RNC corpus could exemplify one more pair, e.g. the dictionary of set expressions [18] and the dictionary of lexical intensifiers [16]. The ruTenTen corpus proved to have the largest number of pairs with significant differences in frequencies (here we can name additionally, firstly, the dictionary of idiomatic expressions [12] and the collocations dictionary [3], and secondly, the for- mer [12] and LEDR [13]). This can suggest that the dictionary of set expressions [18] includes more frequent phrases compared to other sources, while the dictionary of Russian idiomatics [12] contains the least recurrent units. With the exception of a few collocations (bol’shoye znacheniye ‘great importance’, bol’shoy uspekh ‘great success’, vyssheye obrazovaniye ‘higher education’ and nerv- naya sistema ‘nervous system’ the examples turn out to be low-frequency in all three corpora. The hypothesis is confirmed that with the decrease of the dictionary index, the relative frequencies of collocations in the corpus decrease (with the exception of unique collocations in the dictionary [18], whose frequencies, on the contrary, exceed the others). The presence of collocations in several dictionaries indicates their higher frequencies and hence possible prediction by automatic methods. 6 Conclusion In our study we examined the Russian collocations which were extracted from six dictionaries. Their quantitative characteristics obtained on corpora of different vol- umes show that the analyzed examples turn out to be low-frequency and demonstrate their ambiguous nature. The overwhelming majority of dictionary collocations are unique, i.e. presented in only one dictionary; hence such items are difficult to be iden- tified in corpora by using automatic methods. The issue of data volume deserves much more attention, and the very phenomenon of collocability must be investigated in larger corpora as small volume does not show 9 any occurrences for a number of collocations. Automatically crawled large corpora reveal more fascinating findings as well as peculiarities obscured in smaller text col- lections. Hence we can assume that machine learning algorithms that process word combinations should be based on large datasets counting several bln tokens. The number of dictionary collocations depends on the dictionaries used. Unfortu- nately, despite the processing of several sources, the volume of the extracted data is still insufficient, therefore it is important to analyze other dictionaries and lexico- graphic sources and extract examples from them. Explanatory dictionaries may con- tain set expressions in other parts of dictionary entries (in the texts of quotations or illustrative examples), therefore, their further analysis is necessary, which will be performed at the next stage of our work. In future we plan to consider other resources and to study collocations based on other syntactic models. References 1. Atkins, S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford U.P., Ox- ford (2008). 2. Berkov, V.: Bilingual lexicography [Dvuyazychnaya leksikografiya]. 2nd edition. AST, Moscow (2004). 3. Borisova, E.: A Word in a Text. A Dictionary of Russian Collocations with English- Russian Dictionary of Keywords [Slovo v tekste. Slovar’ kollokatsiy (ustoychivykh so- chetaniy) russkogo yazyka s anglo-russkim slovarem klyuchevykh slov]. Filologiya, Mos- cow (1995). 4. Calzolari, N., Fillmore, Ch., Grishman, R., Ide, N., Lenci, A., MacLeod, C., Zampolli, A.: Towards Best Practice for Multiword Expressions in Computational Lexicons. In: Pro- ceedings of LREC – 2002, 1934–1940 (2002). 5. The Dictionary of the Russian Language [Slovar’ russkogo jazyka v 4 tomakh]. Yevgen’yeva, A. P. (ed.-in-chief). Vol. 1–4, 2nd edition, revised and supplemented. Russ- kij jazyk, Moscow (1981–1984). 6. Fontenelle, T.: Collocation acquisition from a corpus or from a dictionary: a comparison. In: Proceedings I-II Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, 221–228 (1992). 7. Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen Corpus Family. In: Proceedings of the 7th International Corpus Linguistics Conference CL 2013, the United Kingdom, July 2013, 125–127 (2013). 8. Khokhlova, M.: Building a Gold Standard for a Russian Collocations Database. In: Pro- ceedings of the XVIII EURALEX International Congress: Lexicography in Global Con- texts. Ljubljana, 863–869 (2018). 9. Khokhlova M.: Collocations in Russian Lexicography and Russian Collocations Database. In: Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, France. European Language Resources Association, 3191–3199 (2020). 10. Kilgarriff, A.: Comparing Corpora. International Journal of Corpus Linguistics, 6 (1), 97– 133 (2001). 11. Klyshinsky, E., Khokhlova, M.: In Search of Lost Collocations: Combining Measures to Reach the Top Range. In: Internet and Modern Society: Proceedings of the International Conference IMS-2017 (St. Petersburg; Russian Federation, 21–24 June 2017). Ra- domir V. Bolgov, Nikolai V. Borisov, Leonid V. Smorgunov, Irina I. Tolstikova, Vic- 10 tor P. Zakharov (eds.). ACM International Conference Proceeding Series, 160–163. ACM Press, N.Y. (2017). 12. Kustova, G.: Dictionary of Russian Idiomatic Expressions [Slovar’ russkoyj idiomatiki. Sochetaniya slov so znacheniyem vysokoy stepeni] (2008), http://dict.ruslang.ru, last ac- cessed 2020/10/14. 13. The Large Explanatory Dictionary of the Russian Language [Bol’shoy tolkovyy slovar’ russkogo yazyka]. Kuznetsov, S. (ed.). Norint, St. Petersburg (1998). 14. Lukashevich, N., Dobrov, B., Chuyko, D.: Selecting word phrases for an automatic text processing system dictionary [Otbor slovosochetaniy dlya slovarya sistemy avtomatich- eskoy obrabotki tekstov]. In: Computational linguistics and intellectual technologies: Pro- ceedings of Int. Conf. “Dialog–2008”, 339–344. RSUH, Moscow (2008). 15. Macmillan Dictionary, https://www.macmillandictionary.com, last accessed 2020/10/14. 16. Oubine, I.: Dictionary of Russian and English Lexical Intensifiers [Slovar’ usilitel’nykh slovoso-chetaniy russkogo I angliyskogo yazykov]. Russian Language, Moscow (1987). 17. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the workshop on Comparing Corpora. Association for Computational Linguistics, 1–6 (2000). 18. Reginina, K., Tjurina, G., Shirokova, L.: Set Expressions of the Russian Language. A Ref- erence Book for Foreign Students [Ustoychivye slovosochetaniya russkogo yazyka: Uchebnoye posobiye dlya studentov-inostrantsev]. Shirokova, L. I. (ed.). Moscow (1980). 19. Russian National Corpus, http://ruscorpora.ru, last accessed 2020/10/14. 20. Wehrli, E., Seretan, V., Nerima, L., Russo, L.: Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In: Proceedings of the 13th Annual Conference of the EAMT, Barcelona, 128–135 (2009).